Reinforcement Learning Nptel Week 1 Assignment Answers

Are you looking for Reinforcement Learning Nptel Week 1 Assignment Answers ? All weeks solutions of this Swayam course are available here.


Reinforcement Learning Nptel Week 1 Assignment Answers (July-Dec 2025)

Course link: Click here to visit course on Nptel Website


Question 1. In the update rule Qt+1(a)←Qt(a)+α(Rt−Qt(a))Q_{t+1}(a) \leftarrow Q_t(a) + \alpha(R_t − Q_t(a)), select the value of α\alpha that we would prefer to estimate Q values in a non-stationary bandit problem.
a) α=1na+1\alpha = \frac{1}{n_a + 1}
b) α=0.1\alpha = 0.1
c) α=na+1\alpha = n_a + 1
d) α=1(na+1)2\alpha = \frac{1}{(n_a + 1)^2}

View Answer


Question 2. The “Credit assignment problem” is the issue of assigning a correct mapping of rewards accumulated by the action(s). Which of the following is/are the reason for credit assignment problem in RL? (Select all that apply)
a) Reward for an action may only be observed after many time steps.
b) An agent may get the same reward for multiple actions.
c) The agent discounts rewards that occurred in previous time steps.
d) Rewards can be positive or negative.

View Answer


Question 3. Assertion1: In stationary bandit problems, we can achieve asymptotically correct behaviour by selecting exploratory actions with a fixed non-zero probability without decaying exploration.
Assertion2: In non-stationary bandit problems, it is important that we decay the probability of exploration to zero over time in order to achieve asymptotically correct behavior.
a) Assertion1 and Assertion2 are both True.
b) Assertion1 is True and Assertion2 is False.
c) Assertion1 is False and Assertion2 is True.
d) Assertion1 and Assertion2 are both False.

View Answer


Question 4. We are trying different algorithms to find the optimal arm for a multi-arm bandit. The expected payoff for each algorithm corresponds to some function with respect to time tt. Given that the optimal expected pay off is 1, which among the following functions corresponds to the algorithm with the least Regret?
a) tanh⁡(t/5)\tanh(t/5)
b) 1−2−t1 – 2^{-t}
c) x20\frac{x}{20} if x<20x<20 and 1 after that
d) Same regret for all the above functions.

View Answer

These are Reinforcement Learning Nptel Week 1 Assignment Answers


Question 5. Which of the following is/are correct and valid reasons to consider sampling actions from a softmax distribution instead of using an ε\varepsilon-greedy approach?
i. Softmax exploration makes the probability of picking an action proportional to the action-value estimates. By doing so, it avoids wasting time exploring obviously ’bad’ actions.
ii. We do not need to worry about decaying exploration slowly like we do in the ε\varepsilon-greedy case. Softmax exploration gives us asymptotic correctness even for a sharp decrease in temperature.
iii. It helps us differentiate between actions with action-value estimates (Q values) that are very close to the action with maximum Q value.

a) i, ii, iii
b) only iii
c) only i
d) i, ii
e) i, iii

View Answer


Question 6. Consider a standard multi-arm bandit problem. The probability of picking an action, using the softmax policy is given by: Pr(at=a)=eQt(a)/β∑beQt(b)/βPr(a_t=a)=\frac{e^{Q_t(a)/\beta}}{\sum_b e^{Q_t(b)/\beta}}

Now, assuming the following action-value estimates: Qt(a0)=0.1,Qt(a1)=0.02,Qt(a2)=0.05,Qt(a3)=−0.1,Qt(a4)=0.002,Qt(a5)=−0.2Q_t(a_0)=0.1, Q_t(a_1)=0.02, Q_t(a_2)=0.05, Q_t(a_3)=-0.1, Q_t(a_4)=0.002, Q_t(a_5)=-0.2.
What is the probability that action 2 (at=a2a_t=a_2) is selected? (Use β=0.1\beta=0.1)
a) 0
b) 0.13
c) 0.232
d) 0.143

View Answer


Question 7. What are the properties of a solution method that is PAC Optimal?
a) It is guaranteed to find the correct solution.
b) It minimizes sample complexity to make the PAC guarantee.
c) It always reaches optimal behaviour faster than an algorithm that is simply asymptotically correct.

Options:
a) Both (a) and (b)
b) Both (b) and (c)
c) Both (a) and (c)

View Answer


Question 8. Consider the following statements:
i. The agent must receive a reward for every action taken in order to learn an optimal policy.
ii. Reinforcement Learning is neither supervised nor unsupervised learning.
iii. Two reinforcement learning agents cannot learn by playing against each other.
iv. Always selecting the action with maximum reward will automatically maximize the probability of winning a game.

Which of the above statements is/are correct?
a) i, ii, iii
b) only ii
c) ii, iii
d) iii, iv

View Answer

These are Reinforcement Learning Nptel Week 1 Assignment Answers


Question 9. Assertion: Taking exploratory actions is important for RL agents
Reason: If the rewards obtained for actions are stochastic, an action which gave a high reward once, might give lower reward next time.
a) Assertion and Reason are both true and Reason is a correct explanation of Assertion
b) Assertion and Reason are both true and Reason is not a correct explanation of Assertion
c) Assertion is true and Reason is false
d) Both Assertion and Reason are false

View Answer


Question 10. Following are two ways for defining the probability of selecting an action/arm in softmax policy. Which among the following is a better choice and why?

i. Pr(at=a)=Qt(a)∑aQt(a)Pr(a_t=a)=\frac{Q_t(a)}{\sum_a Q_t(a)}.

ii. Pr(at=a)=eQt(a)∑b=1neQt(b)Pr(a_t=a)=\frac{e^{Q_t(a)}}{\sum_{b=1}^n e^{Q_t(b)}}.

a) (i) is better choice as it requires less complex computation
b) (ii) is better choice as it can also deal with negative values of Qt(a)Q_t(a)
c) Both are good as both formulas can bound probability in range 0 to 1.
d) (i) is better because it can differentiate well between close values of Qt(a)Q_t(a).

View Answer


These are Reinforcement Learning Nptel Week 1 Assignment Answers

Click here for all nptel assignment answers