Deep Learning IIT Ropar Week 4 Nptel Answers
Are you looking for the Deep Learning IIT Ropar Week 4 NPTEL Assignment Answers? You’ve come to the right place!
Table of Contents

Deep Learning IIT Ropar Week 4 Nptel Assignment Answers (July-Dec 2025)
Question 1. You are training a neural network on a dataset with 5 million samples using Mini‑Batch Gradient Descent. The mini‑batch size is 500, and each parameter update takes 100 milliseconds. How many seconds will it take to complete 5 epochs of training?
a) 5,000
b) 10,000
c) 2,500
d) 50,000
Question 2. You are comparing training times using different gradient descent algorithms on a dataset with 1,000,000 data points. Each parameter update takes 2 milliseconds. How many milliseconds longer will Stochastic Gradient Descent take compared to Vanilla (Batch) Gradient Descent to complete 2 epochs?
a) 4,000,004 ms
b) 4,000,000 ms
c) 3,999,994 ms
d) 3,999,996 ms
Question 3. What is the most practical benefit of using smaller batch sizes on constrained devices?
a) Reduces computation time significantly
b) Increases training accuracy
c) Minimizes memory usage and reduces overhead
d) Allows larger models to be trained
Question 4. You reduce the batch size from 4,000 to 1,000. What happens to the number of weight updates per epoch?
a) Doubles
b) Quadruples
c) Remains constant
d) Halves
Question 5. Which of the following statements are true about mini‑batch gradient descent?
a) It offers a compromise between computation and accuracy
b) It prevents gradient vanishing completely
c) It allows parallelism in training
d) It can still be prone to overfitting
Question 6. What could be the reason for slow learning in this scenario?
a) Large Learning rate
b) Very small gradients
c) Very high momentum
d) Incorrect label noise
Question 7. What optimizer would help improve learning in this situation?
a) Vanilla Gradient Descent
b) Mini‑Batch SGD
c) Momentum based Gradient Descent
d) Adam with bias correction and adaptive learning rate
Question 8. If the above model takes 300 steps per epoch, then after 5 epochs the number of weight updates is __________
Question 9. Which of the following helps in handling small gradients?
a) Reducing learning rate
b) Using Adagrad
c) Using adaptive optimizers like Adam
d) Using large batch sizes
Question 10. What are the advantages of using momentum?
a) Faster convergence
b) Larger steps in the wrong direction
c) Helps escape shallow minima
d) Avoids oscillation in steep slopes
Question 11. Which of the following would not help in this scenario?
a) Switch to adaptive gradient descent
b) Add momentum
c) Reduce learning rate
d) Normalize input data
Question 12. If the learning rate η=0.01\eta=0.01, momentum coefficient γ=0.9\gamma=0.9, the current gradient at step tt is ∇wt=0.2\nabla w_t=0.2, and the previous update was 0.10.1, then what is the value of the new update?
a) 0.11
b) 0.092
c) 0.091
d) 0.12
Question 13. A data scientist uses momentum‑based GD with γ=0.8\gamma=0.8, η=0.05\eta=0.05, initial update u0=0u_0=0 and gradients ∇w1=−0.5\nabla w_1=-0.5, ∇w2=−0.2\nabla w_2=-0.2, ∇w3=−0.3\nabla w_3=-0.3. What is the value of the update at time t=3t=3?
a) 0.0172
b) −0.0172
c) 0.0216
d) −0.009
Question 14. What are the benefits of using mini‑batch over full batch.
a) Less memory usage
b) More frequent weight updates
c) Higher computational cost
d) Better generalization
Question 15. What is a likely cause of oscillations?
a) Too low learning rate
b) Batch size too small
c) Too high learning rate
d) No dropout
Question 16. Which technique helps reduce oscillations?
a) Momentum
b) Adagrad
c) Weight decay
d) None of the above
Question 17. Which optimizer allows you to peek ahead before computing the gradient?
a) Adam
b) Vanilla SGD
c) Nesterov Accelerated Gradient
d) Adagrad
Question 18. What happens if momentum is set to 1
a) Model stops updating
b) Model overshoots and diverges
c) Model converges quickly
d) No effect
Question 19. Which of the following are the advantages of mini‑batch gradient descent over SGD:
a) Reduces variance of updates
b) Requires fewer epochs
c) Faster convergence
d) More computation per update
Question 20. What does the line search algorithm aim to optimize at every step of training?
a) Batch size
b) The cost function value along the gradient direction
c) Momentum term
d) Validation accuracy
Question 21. What is the key computational disadvantage of applying line search in every update?
a) May overfit the data
b) Many more computations in each step.
c) Doesn’t converge
d) Reduces gradient magnitude
Question 22. Which of the following schedules typically require setting two hyperparameters?
a) Exponential decay
b) 1/t decay
c) Constant learning rate
d) Step decay
Question 23. Exponential decay adjusts learning rate using which formula?
a) η=η0(1+kt)\eta=\eta_0(1+kt)
b) η=η0−kt\eta=\eta_0 – kt
c) η=η0log(t)\eta=\eta_0\log(t)
d) η=η0+t\eta=\eta_0 + t
Question 24. Learning rate decay is typically used to:
a) Fine‑tune the model toward the end of training
b) Avoid oscillation near minima
c) Eliminate the need for momentum
d) Control the impact of noisy gradients
Question 25. In step decay, the learning rate changes at ______ intervals.
a) Predefined
b) One
c) Random
d) None of the above
Question 26. If you have 100,000 samples and batch size is 10,000, how many parameter updates happen in one epoch?
a) 10
b) 100
c) 1000
d) 1
Question 27. If N=60,000N = 60{,}000 and batch size B=5,000B = 5{,}000, the number of weight updates per epoch = _____.
Question 28. Suppose you’re using Nesterov Accelerated Gradient and are at time step tt. The current gradient at the look‑ahead position is ∇wlook=0.3\nabla w_{\text{look}}=0.3, the previous velocity (update) is updatet−1=0.2\text{update}_{t-1}=0.2, and the hyperparameters are: γ=0.8,η=0.05\gamma = 0.8, \eta = 0.05. What is the value of the current update updatet\text{update}_t?
a) 0.175
b) 0.195
c) 0.18
d) 0.31
Question 29. You’re optimizing a neural network with NAG. At iteration tt, you have: Current weight wt=1.0w_t=1.0, Previous update updatet−1=0.25\text{update}_{t-1}=0.25, γ=0.9\gamma=0.9, η=0.01\eta=0.01, Gradient at look‑ahead position: ∇wlook=−0.5\nabla w_{\text{look}}=-0.5. What is the value of the update at time tt?
a) 0.78
b) 0.775
c) 0.79
d) 0.77
Deep Learning IIT Ropar Week 4 Nptel Assignment Answers (Jan- Apr 2025)
Q1. Using the Adam optimizer with β1=0.9\beta_1 = 0.9, β2=0.999\beta_2 = 0.999, and ϵ=10−8\epsilon = 10^{-8}, what would be the bias-corrected first moment estimate after the first update if the initial gradient is 4?
a) 0.4
b) 4.0
c) 3.6
d) 0.44
Q2. In a mini-batch gradient descent algorithm, if the total number of training samples is 50,000 and the batch size is 100, how many iterations are required to complete 10 epochs?
a) 5,000
b) 50,000
c) 500
d) 5
Q3. In a stochastic gradient descent algorithm, the learning rate starts at 0.1 and decays exponentially with a decay rate of 0.1 per epoch. What will be the learning rate after 5 epochs?
Q4. In the context of the Adam optimizer, what is the purpose of bias correction?
a) To prevent overfitting
b) To speed up convergence
c) To correct for the bias in the estimates of first and second moments
d) To adjust the learning rate
Q5. The figure below shows the contours of a surface. Suppose that a man walks from -1 to +1 on both the horizontal (x) axis and the vertical (y) axis. The statement that the man would have seen the slope change rapidly along the x-axis than the y-axis is,
a) True
b) False
c) Cannot say
Q6. What is the primary benefit of using Adagrad compared to other optimization algorithms?
a) It converges faster than other optimization algorithms.
b) It is more memory-efficient than other optimization algorithms.
c) It is less sensitive to the choice of hyperparameters (learning rate).
d) It is less likely to get stuck in local optima than other optimization algorithms.
Q7. What are the benefits of using stochastic gradient descent compared to vanilla gradient descent?
a) SGD converges more quickly than vanilla gradient descent.
b) SGD is computationally efficient for large datasets.
c) SGD theoretically guarantees that the descent direction is optimal.
d) SGD experiences less oscillation compared to vanilla gradient descent.
Q8. What is the role of activation functions in deep learning?
a) Activation functions transform the output of a neuron into a non-linear function, allowing the network to learn complex patterns.
b) Activation functions make the network faster by reducing the number of iterations needed for training.
c) Activation functions are used to normalize the input data.
d) Activation functions are used to compute the loss function.
Q9. What is the advantage of using mini-batch gradient descent over batch gradient descent?
a) Mini-batch gradient descent is more computationally efficient than batch gradient descent.
b) Mini-batch gradient descent leads to a more accurate estimate of the gradient than batch gradient descent.
c) Mini-batch gradient descent gives us a better solution.
d) Mini-batch gradient descent can converge faster than batch gradient descent.
Q10. In the Nesterov Accelerated Gradient (NAG) algorithm, the gradient is computed at:
a) The current position
b) A “look-ahead” position
c) The previous position
d) The average of current and previous positions
Deep Learning IIT Ropar Week 4 Nptel Assignment Answers (JULY – DEC 2024)
Course Link: Click Here
Q1.A team has a data set that contains 1000 samples for training a feed-forward neural network. Suppose they decided to use stochastic gradient descent algorithm to update the weights. How many times do the weights get updated after training the network for 5 epochs?
1000
5000
100
5
Answer: B) 5000
Q2. What is the primary benefit of using Adagrad compared to other optimization algorithms?
It converges faster than other optimization algorithms.
It is more memory-efficient than other optimization algorithms.
It is less sensitive to the choice of hyperparameters(learning rate).
It is less likely to get stuck in local optima than other optimization algorithms.
Answer: It is more memory-efficient than other optimization algorithms.
For answers or latest updates join our telegram channel: Click here to join
These are Deep Learning IIT Ropar Week 4 Nptel Assignment Answers
Q3.What are the benefits of using stochastic gradient descent compared to vanilla gradient descent?
SGD converges more quickly than vanilla gradient descent.
SGD is computationally efficient for large datasets.
SGD theoretically guarantees that the descent direction is optimal.
SGD experiences less oscillation compared to vanilla gradient descent.
Answer:
Q4. Select the behaviour of the Gradient descent algorithm that uses the following update rule,
wt+1=wt−η∇wt
where w
is a weight and η
is a learning rate.
The weight update is tiny at a steep loss surface
The weight update is tiny at a gentle loss surface
The weight update is large at a steep loss surface
The weight update is large at a gentle loss surface
Answer: The weight update is large at a steep loss surface
For answers or latest updates join our telegram channel: Click here to join
These are Deep Learning IIT Ropar Week 4 Nptel Assignment Answers
Q5.Given data where one column predominantly contains zero values, which algorithm should be used to achieve faster convergence and optimize the loss function?
Adam
NAG
Momentum-based gradient descent
Stochastic gradient descent
Answer: Adam
Q6. In Nesterov accelerated gradient descent, what step is performed before determining the update size?
Increase the momentum
Adjust the learning rate
Decrease the step size
Estimate the next position of the parameters
Answer:
For answers or latest updates join our telegram channel: Click here to join
These are Deep Learning IIT Ropar Week 4 Nptel Assignment Answers
Q7.We have following functions x3,ln(x),ex,x
and 4. Which of the following functions has the steepest slope at x=1?
x3
ln(x)
ex
4
Answer: ln(x)
Q8.Which of the following represents the contour plot of the function f(x,y) = x2−y2?
Answer: C option
For answers or latest updates join our telegram channel: Click here to join
These are Deep Learning IIT Ropar Week 4 Nptel Assignment Answers
Q9.Which of the following algorithms will result in more oscillations of the parameter during the training process of the neural network?
Stochastic gradient descent
Mini batch gradient descent
Batch gradient descent
Batch NAG
Answer:
Q10.Consider a gradient profile ∇W=[1,0.9,0.6,0.01,0.1,0.2,0.5,0.55,0.56].
Assume v−1=0,ϵ=0,β=0.9
and the learning rate is η−1=0.1
. Suppose that we use the Adagrad algorithm then what is the value of η6=η/sqrt(vt+ϵ)?
0.03
0.06
0.08
0.006
Answer:
For answers or latest updates join our telegram channel: Click here to join
These are Deep Learning IIT Ropar Week 4 Nptel Assignment Answers
Check here all Deep Learning IIT Ropar Nptel Assignment Answers : Click here
For answers to additional Nptel courses, please refer to this link: NPTEL Assignment Answers