Reinforcement Learning Nptel Week 2 Assignment Answers
Are you looking for Reinforcement Learning Nptel Week 2 Assignment Answers ? All weeks solutions of this Swayam course are available here.
Table of Contents

Reinforcement Learning Nptel Week 2 Assignment Answers (July-Dec 2025)
Course link: Click here to visit course on Nptel Website
Q1. Which of the following is true of the UCB algorithm?
a) The action with the highest Q value is chosen at every iteration.
b) After a very large number of iterations, the confidence intervals of unselected actions will not change much.
c) The true expected-value of an action always lies within its estimated confidence interval.
d) With a small probability ε, we select a random action to ensure adequate exploration of the action space.
Q2. In UCB, the term √(2ln(n)/nj) is added to each arm’s Q value. What would happen to the frequency of picking sub-optimal arms when using √(2ln(n)/n²j) instead?
a) Sub-optimal arms would be chosen more frequently.
b) Sub-optimal arms would be chosen less frequently.
c) Makes no change to the frequency of picking sub-optimal arms.
d) Sub-optimal arms could be chosen less or more frequently, depending on the samples.
Q3. In a 4-arm bandit problem, after executing 100 iterations of the UCB algorithm, the Q values are:
Q₁₀₀(1)=1.73, Q₁₀₀(2)=1.83, Q₁₀₀(3)=1.89, Q₁₀₀(4)=1.55
And number of times sampled: n₁=25, n₂=20, n₃=30, n₄=25
Which arm will be sampled next?
a) Arm 1
b) Arm 2
c) Arm 3
d) Arm 4
Q4. We need 6 rounds of median-elimination to get an (ε,δ)-PAC arm. Approximately how many samples would be needed using the naive (ε,δ)-PAC algorithm for (ε,δ) = (1/2, 1/e)?
a) 1500
b) 1000
c) 500
d) 3000
Q5. In median elimination method, which of the following statements are correct regarding A and B definitions in phase l and l+1?
a) i and ii
b) iii and iv
c) v and vi
d) i and iii
Q6. Which of the following statements is NOT true about Thompson Sampling?
a) After each sample is drawn, the q* distribution for that sampled arm is updated to be closer to the true distribution.
b) Thompson sampling has been shown to generally give better regret bounds than UCB.
c) In Thompson sampling, we do not need to eliminate arms each round to get good sample complexity.
d) The algorithm requires that we use Gaussian priors to represent distributions over q*.
Q7. Assertion: The confidence bound of each arm in the UCB algorithm cannot increase with iterations.
Reason: The nj term in the denominator ensures that the confidence bound remains the same for unselected arms and decreases for the selected arm.
a) Both Assertion and Reason are true and Reason is a correct explanation of Assertion
b) Both Assertion and Reason are true but Reason is not a correct explanation of Assertion
c) Assertion is true and Reason is false
d) Both Assertion and Reason are false
Q8. If the naive (ε,δ)-PAC algorithm needs 100 samples for ε and δ, how many samples are needed if ε is halved but δ is unchanged?
a) 400
b) 800
c) 1600
d) 100
Q9. Which of the following is true about the Median Elimination algorithm?
a) It is a regret minimizing algorithm.
b) The probability of εₗ-optimal arms of a round being eliminated is less than δₗ for the round.
c) It is guaranteed to provide an ε-optimal arm at the end.
d) Replacing ε with ε/2 doubles the sample complexity.
Q10. Suppose we are facing a non-stationary bandit problem. What change is needed in posterior sampling to adapt?
a) Update the posterior rarely.
b) Randomly shift the posterior drastically from time to time.
c) Keep adding a slight noise to the posterior to prevent its variance from going down quickly.
d) No change is required.
These are Reinforcement Learning Nptel Week 2 Assignment Answers


