Deep Learning IIT Ropar Week 4 Nptel Answers

Are you looking for the Deep Learning IIT Ropar Week 4 NPTEL Assignment Answers? You’ve come to the right place!


Deep Learning IIT Ropar Week 4 Nptel Assignment Answers
Deep Learning IIT Ropar Week 4 Nptel Assignment Answers

Deep Learning IIT Ropar Week 4 Nptel Assignment Answers (July-Dec 2025)


Question 1. You are training a neural network on a dataset with 5 million samples using Mini‑Batch Gradient Descent. The mini‑batch size is 500, and each parameter update takes 100 milliseconds. How many seconds will it take to complete 5 epochs of training?
a) 5,000
b) 10,000
c) 2,500
d) 50,000

View Answers


Question 2. You are comparing training times using different gradient descent algorithms on a dataset with 1,000,000 data points. Each parameter update takes 2 milliseconds. How many milliseconds longer will Stochastic Gradient Descent take compared to Vanilla (Batch) Gradient Descent to complete 2 epochs?
a) 4,000,004 ms
b) 4,000,000 ms
c) 3,999,994 ms
d) 3,999,996 ms

View Answers


Question 3. What is the most practical benefit of using smaller batch sizes on constrained devices?
a) Reduces computation time significantly
b) Increases training accuracy
c) Minimizes memory usage and reduces overhead
d) Allows larger models to be trained

View Answers


Question 4. You reduce the batch size from 4,000 to 1,000. What happens to the number of weight updates per epoch?
a) Doubles
b) Quadruples
c) Remains constant
d) Halves

View Answers


Question 5. Which of the following statements are true about mini‑batch gradient descent?
a) It offers a compromise between computation and accuracy
b) It prevents gradient vanishing completely
c) It allows parallelism in training
d) It can still be prone to overfitting

View Answers


Question 6. What could be the reason for slow learning in this scenario?
a) Large Learning rate
b) Very small gradients
c) Very high momentum
d) Incorrect label noise

View Answers


Question 7. What optimizer would help improve learning in this situation?
a) Vanilla Gradient Descent
b) Mini‑Batch SGD
c) Momentum based Gradient Descent
d) Adam with bias correction and adaptive learning rate

View Answers


Question 8. If the above model takes 300 steps per epoch, then after 5 epochs the number of weight updates is __________

View Answers


Question 9. Which of the following helps in handling small gradients?
a) Reducing learning rate
b) Using Adagrad
c) Using adaptive optimizers like Adam
d) Using large batch sizes

View Answers


Question 10. What are the advantages of using momentum?
a) Faster convergence
b) Larger steps in the wrong direction
c) Helps escape shallow minima
d) Avoids oscillation in steep slopes

View Answers


Question 11. Which of the following would not help in this scenario?
a) Switch to adaptive gradient descent
b) Add momentum
c) Reduce learning rate
d) Normalize input data

View Answers


Question 12. If the learning rate η=0.01\eta=0.01, momentum coefficient γ=0.9\gamma=0.9, the current gradient at step tt is ∇wt=0.2\nabla w_t=0.2, and the previous update was 0.10.1, then what is the value of the new update?
a) 0.11
b) 0.092
c) 0.091
d) 0.12

View Answers


Question 13. A data scientist uses momentum‑based GD with γ=0.8\gamma=0.8, η=0.05\eta=0.05, initial update u0=0u_0=0 and gradients ∇w1=−0.5\nabla w_1=-0.5, ∇w2=−0.2\nabla w_2=-0.2, ∇w3=−0.3\nabla w_3=-0.3. What is the value of the update at time t=3t=3?
a) 0.0172
b) −0.0172
c) 0.0216
d) −0.009

View Answers


Question 14. What are the benefits of using mini‑batch over full batch.
a) Less memory usage
b) More frequent weight updates
c) Higher computational cost
d) Better generalization

View Answers


Question 15. What is a likely cause of oscillations?
a) Too low learning rate
b) Batch size too small
c) Too high learning rate
d) No dropout

View Answers


Question 16. Which technique helps reduce oscillations?
a) Momentum
b) Adagrad
c) Weight decay
d) None of the above

View Answers


Question 17. Which optimizer allows you to peek ahead before computing the gradient?
a) Adam
b) Vanilla SGD
c) Nesterov Accelerated Gradient
d) Adagrad

View Answers


Question 18. What happens if momentum is set to 1
a) Model stops updating
b) Model overshoots and diverges
c) Model converges quickly
d) No effect

View Answers


Question 19. Which of the following are the advantages of mini‑batch gradient descent over SGD:
a) Reduces variance of updates
b) Requires fewer epochs
c) Faster convergence
d) More computation per update

View Answers


Question 20. What does the line search algorithm aim to optimize at every step of training?
a) Batch size
b) The cost function value along the gradient direction
c) Momentum term
d) Validation accuracy

View Answers


Question 21. What is the key computational disadvantage of applying line search in every update?
a) May overfit the data
b) Many more computations in each step.
c) Doesn’t converge
d) Reduces gradient magnitude

View Answers


Question 22. Which of the following schedules typically require setting two hyperparameters?
a) Exponential decay
b) 1/t decay
c) Constant learning rate
d) Step decay

View Answers


Question 23. Exponential decay adjusts learning rate using which formula?
a) η=η0(1+kt)\eta=\eta_0(1+kt)
b) η=η0−kt\eta=\eta_0 – kt
c) η=η0log⁡(t)\eta=\eta_0\log(t)
d) η=η0+t\eta=\eta_0 + t

View Answers


Question 24. Learning rate decay is typically used to:
a) Fine‑tune the model toward the end of training
b) Avoid oscillation near minima
c) Eliminate the need for momentum
d) Control the impact of noisy gradients

View Answers


Question 25. In step decay, the learning rate changes at ______ intervals.
a) Predefined
b) One
c) Random
d) None of the above

View Answers


Question 26. If you have 100,000 samples and batch size is 10,000, how many parameter updates happen in one epoch?
a) 10
b) 100
c) 1000
d) 1

View Answers


Question 27. If N=60,000N = 60{,}000 and batch size B=5,000B = 5{,}000, the number of weight updates per epoch = _____.

View Answers


Question 28. Suppose you’re using Nesterov Accelerated Gradient and are at time step tt. The current gradient at the look‑ahead position is ∇wlook=0.3\nabla w_{\text{look}}=0.3, the previous velocity (update) is updatet−1=0.2\text{update}_{t-1}=0.2, and the hyperparameters are: γ=0.8,η=0.05\gamma = 0.8, \eta = 0.05. What is the value of the current update updatet\text{update}_t?
a) 0.175
b) 0.195
c) 0.18
d) 0.31

View Answers


Question 29. You’re optimizing a neural network with NAG. At iteration tt, you have: Current weight wt=1.0w_t=1.0, Previous update updatet−1=0.25\text{update}_{t-1}=0.25, γ=0.9\gamma=0.9, η=0.01\eta=0.01, Gradient at look‑ahead position: ∇wlook=−0.5\nabla w_{\text{look}}=-0.5. What is the value of the update at time tt?
a) 0.78
b) 0.775
c) 0.79
d) 0.77

View Answers


Deep Learning IIT Ropar Week 4 Nptel Assignment Answers (Jan- Apr 2025)


Q1. Using the Adam optimizer with β1=0.9\beta_1 = 0.9, β2=0.999\beta_2 = 0.999, and ϵ=10−8\epsilon = 10^{-8}, what would be the bias-corrected first moment estimate after the first update if the initial gradient is 4?

a) 0.4
b) 4.0
c) 3.6
d) 0.44

View Answer


Q2. In a mini-batch gradient descent algorithm, if the total number of training samples is 50,000 and the batch size is 100, how many iterations are required to complete 10 epochs?

a) 5,000
b) 50,000
c) 500
d) 5

View Answer


Q3. In a stochastic gradient descent algorithm, the learning rate starts at 0.1 and decays exponentially with a decay rate of 0.1 per epoch. What will be the learning rate after 5 epochs?

View Answer


Q4. In the context of the Adam optimizer, what is the purpose of bias correction?

a) To prevent overfitting
b) To speed up convergence
c) To correct for the bias in the estimates of first and second moments
d) To adjust the learning rate

View Answer


Q5. The figure below shows the contours of a surface. Suppose that a man walks from -1 to +1 on both the horizontal (x) axis and the vertical (y) axis. The statement that the man would have seen the slope change rapidly along the x-axis than the y-axis is,

a) True
b) False
c) Cannot say

View Answer


Q6. What is the primary benefit of using Adagrad compared to other optimization algorithms?

a) It converges faster than other optimization algorithms.
b) It is more memory-efficient than other optimization algorithms.
c) It is less sensitive to the choice of hyperparameters (learning rate).
d) It is less likely to get stuck in local optima than other optimization algorithms.

View Answer


Q7. What are the benefits of using stochastic gradient descent compared to vanilla gradient descent?

a) SGD converges more quickly than vanilla gradient descent.
b) SGD is computationally efficient for large datasets.
c) SGD theoretically guarantees that the descent direction is optimal.
d) SGD experiences less oscillation compared to vanilla gradient descent.

View Answer


Q8. What is the role of activation functions in deep learning?

a) Activation functions transform the output of a neuron into a non-linear function, allowing the network to learn complex patterns.
b) Activation functions make the network faster by reducing the number of iterations needed for training.
c) Activation functions are used to normalize the input data.
d) Activation functions are used to compute the loss function.

View Answer


Q9. What is the advantage of using mini-batch gradient descent over batch gradient descent?

a) Mini-batch gradient descent is more computationally efficient than batch gradient descent.
b) Mini-batch gradient descent leads to a more accurate estimate of the gradient than batch gradient descent.
c) Mini-batch gradient descent gives us a better solution.
d) Mini-batch gradient descent can converge faster than batch gradient descent.

View Answer


Q10. In the Nesterov Accelerated Gradient (NAG) algorithm, the gradient is computed at:

a) The current position
b) A “look-ahead” position
c) The previous position
d) The average of current and previous positions

View Answer


Deep Learning IIT Ropar Week 4 Nptel Assignment Answers (JULY – DEC 2024)

Course Link: Click Here


Q1.A team has a data set that contains 1000 samples for training a feed-forward neural network. Suppose they decided to use stochastic gradient descent algorithm to update the weights. How many times do the weights get updated after training the network for 5 epochs?
1000
5000
100
5

Answer: B) 5000


Q2. What is the primary benefit of using Adagrad compared to other optimization algorithms?
It converges faster than other optimization algorithms.
It is more memory-efficient than other optimization algorithms.
It is less sensitive to the choice of hyperparameters(learning rate).
It is less likely to get stuck in local optima than other optimization algorithms.

Answer: It is more memory-efficient than other optimization algorithms.


For answers or latest updates join our telegram channel: Click here to join

These are Deep Learning IIT Ropar Week 4 Nptel Assignment Answers


Q3.What are the benefits of using stochastic gradient descent compared to vanilla gradient descent?
SGD converges more quickly than vanilla gradient descent.
SGD is computationally efficient for large datasets.
SGD theoretically guarantees that the descent direction is optimal.
SGD experiences less oscillation compared to vanilla gradient descent.

Answer:


Q4. Select the behaviour of the Gradient descent algorithm that uses the following update rule,
wt+1=wt−η∇wt

where w
is a weight and η
is a learning rate.
The weight update is tiny at a steep loss surface
The weight update is tiny at a gentle loss surface
The weight update is large at a steep loss surface
The weight update is large at a gentle loss surface

Answer: The weight update is large at a steep loss surface


For answers or latest updates join our telegram channel: Click here to join

These are Deep Learning IIT Ropar Week 4 Nptel Assignment Answers


Q5.Given data where one column predominantly contains zero values, which algorithm should be used to achieve faster convergence and optimize the loss function?
Adam
NAG
Momentum-based gradient descent
Stochastic gradient descent

Answer: Adam


Q6. In Nesterov accelerated gradient descent, what step is performed before determining the update size?
Increase the momentum
Adjust the learning rate
Decrease the step size
Estimate the next position of the parameters

Answer:


For answers or latest updates join our telegram channel: Click here to join

These are Deep Learning IIT Ropar Week 4 Nptel Assignment Answers


Q7.We have following functions x3,ln(x),ex,x
and 4. Which of the following functions has the steepest slope at x=1?

x3

ln(x)

ex
4

Answer: ln(x)


Q8.Which of the following represents the contour plot of the function f(x,y) = x2−y2?

Answer: C option


For answers or latest updates join our telegram channel: Click here to join

These are Deep Learning IIT Ropar Week 4 Nptel Assignment Answers


Q9.Which of the following algorithms will result in more oscillations of the parameter during the training process of the neural network?
Stochastic gradient descent
Mini batch gradient descent
Batch gradient descent
Batch NAG

Answer:


Q10.Consider a gradient profile ∇W=[1,0.9,0.6,0.01,0.1,0.2,0.5,0.55,0.56].
Assume v−1=0,ϵ=0,β=0.9
and the learning rate is η−1=0.1
. Suppose that we use the Adagrad algorithm then what is the value of η6=η/sqrt(vt+ϵ)?
0.03
0.06
0.08
0.006

Answer:


For answers or latest updates join our telegram channel: Click here to join

These are Deep Learning IIT Ropar Week 4 Nptel Assignment Answers

Check here all Deep Learning IIT Ropar Nptel Assignment Answers : Click here

For answers to additional Nptel courses, please refer to this link: NPTEL Assignment Answers