Reinforcement Learning Nptel Week 5 Assignment Answers

Are you looking for Reinforcement Learning Nptel Week 5 Assignment Answers ? All weeks solutions of this Swayam course are available here.


Reinforcement Learning Nptel Week 5 Assignment Answers
Reinforcement Learning Nptel Week 5 Assignment Answers

Reinforcement Learning Nptel Week 5 Assignment Answers (July-Dec 2025)

Course link: Click here to visit course on Nptel Website


Question 1. In policy iteration, which of the following is/are true of the Policy Evaluation (PE) and Policy Improvement (PI) steps?
a) The values of states that are returned by PE may fluctuate between high and low values as the algorithm runs.
b) PE returns the fixed point of Lπn
c) PI can randomly select any greedy policy for a given value function vn.
d) Policy iteration always converges for a finite MDP.

View Answers


Question 2. Consider Monte-Carlo approach for policy evaluation. Suppose the states are S1, S2, S3, S4, S5, S6 and terminal state. You sample one trajectory as follows – S1 → S5 → S3 → S6 → terminal state. Which among the following states can be updated from this sample?
a) S1
b) S2
c) S6
d) S4

View Answers


Question 3. Which of the following statements are true with regards to Monte Carlo value approximation methods?
a) To evaluate a policy using these methods, a subset of trajectories in which all states are encountered at least once are enough to update all state-values.
b) Monte-Carlo value function approximation methods need knowledge of the full model.
c) Monte-Carlo methods update state-value estimates only at the end of an episode.
d) Monte-Carlo methods can only be used for episodic tasks.

View Answers


Question 4. In every visit Monte Carlo methods, multiple samples for one state are obtained from a single trajectory. Which of the following is true?
a) There is an increase in bias of the estimates.
b) There is an increase in variance of the estimates.
c) It does not affect the bias or variance of estimates.
d) Both bias and variance of the estimates increase.

View Answers


Question 5. Which of the following statements are FALSE about solving MDPs using dynamic programming?
a) If the state space is large or computation power is limited, it is preferred to update only those states that are seen in the trajectories.
b) Knowledge of transition probabilities is not necessary for solving MDPs using dynamic programming.
c) Methods that update only a subset of states at a time guarantee performance equal to or better than classic DP.

View Answers


Question 6. Select the correct statements about Generalized Policy Iteration (GPI).
a) GPI lets policy evaluation and policy improvement interact with each other regardless of the details of the two processes.
b) Before convergence, the policy evaluation step will usually cause the policy to no longer be greedy with respect to the updated value function.
c) GPI converges only when a policy has been found which is greedy with respect to its own value function.
d) The policy found by GPI at convergence will be optimal but value function will not be optimal.

View Answers


Question 7. What is meant by “off-policy” Monte Carlo value function evaluation?
a) The policy being evaluated is the same as the policy used to generate samples.
b) The policy being evaluated is different from the policy used to generate samples.
c) The policy being learnt is different from the policy used to generate samples.
d) The policy being learnt is different from the policy used to generate samples.

View Answers


Question 8. For both value and policy iteration algorithms we will get a sequence of vectors after some iterations, say v1,v2….vn for value iteration and v′1,v′2…v′n for policy iteration. Which of the following statements are true.
a) For all vi ∈ {v1,v2….vn} there exists a policy for which vi is a fixed point.
b) For all v′i ∈ {v′1,v′2….v′n} there exists a policy for which v′i is a fixed point.
c) For all vi ∈ {v1,v2….vn} there may not exist a policy for which vi is a fixed point.
d) For all v′i ∈ {v′1,v′2….v′n} there may not exist a policy for which v′i is a fixed point.

View Answers


These are Reinforcement Learning Nptel Week 5 Assignment Answers

Click here for all nptel assignment answers