Reinforcement Learning Nptel Week 3 Assignment Answers

Are you looking for Reinforcement Learning Nptel Week 3 Assignment Answers ? All weeks solutions of this Swayam course are available here.


Reinforcement Learning Nptel Week 3 Assignment Answers
Reinforcement Learning Nptel Week 3 Assignment Answers

Reinforcement Learning Nptel Week 3 Assignment Answers (July-Dec 2025)

Course link: Click here to visit course on Nptel Website


Question 1. The baseline in the REINFORCE update should not depend on which of the following (without voiding any of the steps in the proof of REINFORCE)?
a) rn−1
b) rn
c) Action taken (an)
d) None of the above

View Answers


Question 2. Which of the following statements is true about the RL problem?
a) Our main aim is to maximize the cumulative reward.
b) The agent always performs the actions in a deterministic fashion.
c) We assume that the agent determines the next state based on the current state and action
d) It is impossible to have zero rewards.

View Answers


Question 3. Let us say we are taking actions according to a Gaussian distribution with parameters µ and σ. We update the parameters according to REINFORCE and at denote the action taken at step t. Which of the above updates are correct?
a) (i), (iii)
b) (i), (iv)
c) (ii), (iv)
d) (ii), (iii)

View Answers


Question 4. How are E[(rt−b)∂lnπ(at;θt)/∂θt] and E[rt∂lnπ(at;θt)/∂θt] related?
a) Equal: E[(rt−b)∂lnπ(at;θt)/∂θt] = E[rt∂lnπ(at;θt)/∂θt]
b) E[(rt−b)∂lnπ(at;θt)/∂θt] < E[rt∂lnπ(at;θt)/∂θt]
c) E[(rt−b)∂lnπ(at;θt)/∂θt] > E[rt∂lnπ(at;θt)/∂θt]
d) Could be either of a, b or c, depending on the choice of baseline

View Answers


Question 5. Consider the following policy-search algorithm for a multi-armed binary bandit. Which of the following is true for the above algorithm?
a) It is LR−I algorithm.
b) It is LR−ϵP algorithm.
c) It would work well if the best arm had probability of 0.9 of resulting in +1 reward and the next best arm had probability of 0.5 of resulting in +1 reward
d) It would work well if the best arm had probability of 0.3 of resulting in +1 reward and the worst arm had probability of 0.25 of resulting in +1 reward

View Answers

These are Reinforcement Learning Nptel Week 3 Assignment Answers


Question 6. Contextual bandits can be modeled as a full reinforcement learning problem.
a) Assertion and Reason are both true and Reason is a correct explanation of Assertion
b) Assertion and Reason are both true and Reason is not a correct explanation of Assertion
c) Assertion is true and Reason is false
d) Both Assertion and Reason are false

View Answers


Question 7. Let’s assume for some full RL problem we are acting according to a policy π. At some time t, we are in a state s where we took action a1. After few time steps, at time t′, the same state s was reached where we performed an action a2 (≠a1). Which of the following statements is true?
a) π is definitely a Stationary policy
b) π is definitely a Non-Stationary policy
c) π can be Stationary or Non-Stationary

View Answers


Question 8. Stochastic gradient ascent/descent update occurs in the right direction at every step
a) True
b) False

View Answers

These are Reinforcement Learning Nptel Week 3 Assignment Answers


Question 9. Which of the following is true for an MDP?
a) Pr(st+1, rt+1 | st, at) = Pr(st+1, rt+1)
b) Pr(st+1, rt+1 | st, at, …, s0, a0) = Pr(st+1, rt+1 | st, at)
c) Pr(st+1, rt+1 | st, at) = Pr(st+1, rt+1 | s0, a0)
d) Pr(st+1, rt+1 | st, at) = Pr(st, rt | st−1, at−1)

View Answers


Question 10. For discounted returns Gt = rt + γrt+1 + γ²rt+2 + … what happens when γ > 1 (e.g., γ = 5)?
a) Nothing, γ > 1 is common for many RL problems
b) Theoretically nothing can go wrong, but this case does not represent any real world problems
c) The agent will learn that delayed rewards will always be beneficial and so will not learn properly.
d) None of the above is true

View Answers


These are Reinforcement Learning Nptel Week 3 Assignment Answers

Click here for all nptel assignment answers