Reinforcement Learning Nptel Week 3 Assignment Answers
Are you looking for Reinforcement Learning Nptel Week 3 Assignment Answers ? All weeks solutions of this Swayam course are available here.
Table of Contents

Reinforcement Learning Nptel Week 3 Assignment Answers (July-Dec 2025)
Course link: Click here to visit course on Nptel Website
Question 1. The baseline in the REINFORCE update should not depend on which of the following (without voiding any of the steps in the proof of REINFORCE)?
a) rn−1
b) rn
c) Action taken (an)
d) None of the above
Question 2. Which of the following statements is true about the RL problem?
a) Our main aim is to maximize the cumulative reward.
b) The agent always performs the actions in a deterministic fashion.
c) We assume that the agent determines the next state based on the current state and action
d) It is impossible to have zero rewards.
Question 3. Let us say we are taking actions according to a Gaussian distribution with parameters µ and σ. We update the parameters according to REINFORCE and at denote the action taken at step t. Which of the above updates are correct?
a) (i), (iii)
b) (i), (iv)
c) (ii), (iv)
d) (ii), (iii)
Question 4. How are E[(rt−b)∂lnπ(at;θt)/∂θt] and E[rt∂lnπ(at;θt)/∂θt] related?
a) Equal: E[(rt−b)∂lnπ(at;θt)/∂θt] = E[rt∂lnπ(at;θt)/∂θt]
b) E[(rt−b)∂lnπ(at;θt)/∂θt] < E[rt∂lnπ(at;θt)/∂θt]
c) E[(rt−b)∂lnπ(at;θt)/∂θt] > E[rt∂lnπ(at;θt)/∂θt]
d) Could be either of a, b or c, depending on the choice of baseline
Question 5. Consider the following policy-search algorithm for a multi-armed binary bandit. Which of the following is true for the above algorithm?
a) It is LR−I algorithm.
b) It is LR−ϵP algorithm.
c) It would work well if the best arm had probability of 0.9 of resulting in +1 reward and the next best arm had probability of 0.5 of resulting in +1 reward
d) It would work well if the best arm had probability of 0.3 of resulting in +1 reward and the worst arm had probability of 0.25 of resulting in +1 reward
These are Reinforcement Learning Nptel Week 3 Assignment Answers
Question 6. Contextual bandits can be modeled as a full reinforcement learning problem.
a) Assertion and Reason are both true and Reason is a correct explanation of Assertion
b) Assertion and Reason are both true and Reason is not a correct explanation of Assertion
c) Assertion is true and Reason is false
d) Both Assertion and Reason are false
Question 7. Let’s assume for some full RL problem we are acting according to a policy π. At some time t, we are in a state s where we took action a1. After few time steps, at time t′, the same state s was reached where we performed an action a2 (≠a1). Which of the following statements is true?
a) π is definitely a Stationary policy
b) π is definitely a Non-Stationary policy
c) π can be Stationary or Non-Stationary
Question 8. Stochastic gradient ascent/descent update occurs in the right direction at every step
a) True
b) False
These are Reinforcement Learning Nptel Week 3 Assignment Answers
Question 9. Which of the following is true for an MDP?
a) Pr(st+1, rt+1 | st, at) = Pr(st+1, rt+1)
b) Pr(st+1, rt+1 | st, at, …, s0, a0) = Pr(st+1, rt+1 | st, at)
c) Pr(st+1, rt+1 | st, at) = Pr(st+1, rt+1 | s0, a0)
d) Pr(st+1, rt+1 | st, at) = Pr(st, rt | st−1, at−1)
Question 10. For discounted returns Gt = rt + γrt+1 + γ²rt+2 + … what happens when γ > 1 (e.g., γ = 5)?
a) Nothing, γ > 1 is common for many RL problems
b) Theoretically nothing can go wrong, but this case does not represent any real world problems
c) The agent will learn that delayed rewards will always be beneficial and so will not learn properly.
d) None of the above is true
These are Reinforcement Learning Nptel Week 3 Assignment Answers