# An Introduction to Artificial Intelligence | Week 11

Session: JAN-APR 2024

Course name: An Introduction to Artificial Intelligence

#### Q1. How many parameters must be estimated for Model-based and Model-free RL, respectively?O(|S| |A|), O(|S|2 |A|)O(|S|2 |A|), O(|S| |A|)O(|S| |A|), O(|S| |A|2)O(|S| |A|2), O(|S| |A|)

Q2. Suppose we are using Q-learning to learn a policy for a robot navigating a maze. The agent is loaded into a physical robot and learns to navigate by trial and error in the environment. Which of the following characteristics apply to this agent?
Model-based RL is being used
It actively chooses which action to execute in a given state

Answer: It actively chooses which action to execute in a given state

These are An Introduction to Artificial Intelligence Answers Week 11

Q3. We have an MDP with state space S, action space A, and a discount factor γ. The state is comprised of two real-valued variables, x and y. We define the value of the state
where xg and yg are constants. Choose the correct parameter update equations for TD learning. (α is the learning rate and action a ϵ A in state s = (x, y) goes to s’ = (x’, y’).

a) θ0 = θ0 + α (r + γ maxa U(x’,y’) – U(x,y))
b) θ1 = θ1 + α (r + γ maxa U(x’,y’) – U(x,y))
c) θ0 = θ0 + α (r + γ maxa U(x’,y’) – θ0)
d)

Q4. Let us say that we wish to do feature-based Q learning to find the optimal policy for an MDP. Assume n feature functions, f1(s, a), f2(s, a)…fn(s, a) with weights w1, w2,…wn, that are all initialized to 0, and value function approximated as linear function over feature values. Assume discount factor and learning rate both to be equal to 1.
Assume that our initial state is s1, and on taking action a1, we transition to s2, earning a reward of 10. On updating the feature weights, we observe that w1 increased. Select the most appropriate choice(s).

f1 (s1 , a1) must be positive.
f1 (s1 , a1) must be negative.
f1 (s1 , a1) must be zero.
Insufficient information.

Answer: f1 (s1 , a1) must be positive.

These are An Introduction to Artificial Intelligence Answers Week 11

Q5. In temporal difference learning, which of the following statements is/are correct?
Temporal difference learning is a model-based type of learning.
Temporal difference learning involves estimating the value of a state for a given policy, based on the average of sample values.
Temporal difference learning involves updating the value of a state incrementally.
The TD error is defined as the difference between the old value of a state and the new value of a state.

Q6. Which of the following is/are TRUE for the exploration-exploitation tradeoff in the context of Q-Learning based RL?
The agent can perform effective exploration by picking an action that has not been picked many times in the same state before
The agent can perform effective exploitation by picking an action that has been picked many times in the same state before
The agent can perform effective exploitation by picking an action in a state that has a high estimated Q value
The agent can perform effective exploration by picking an action in a state, that is known to take us to states that have not been visited many times before

These are An Introduction to Artificial Intelligence Answers Week 11

Q7. We are given the following grid-world for which we perform policy evaluation, (the policy is mentioned by the arrows) using Model Based Reinforcement Learning. A4 and C4 are absorbing states. There are 4 possible actions in each state (R (right), L(left), U(up), D(down)).
We perform two simulations of the policy and collect the following data:
Simulation-1 (specified as (State, Action, Reward))
(A1, D, -1)
(B1, R, -1)
(B2, R, -1)
(B3, U, -1)
(A3, R, -1)
(A2, D, -1)
(B2, R, -1)
(B3, U, -1)
(A3, R, -1)
(A4, 100)

Simulation-2 (specified as (State, Action, Reward))
(A1, D, -1)
(B1, R, -1)
(B2, R, -1)
(B3, U, -1)
(C3, U, -1)
(C4, -100)
T(S, A, S’) is the transition probability of moving to state S’ after taking action A in state S. Let x be the estimate of T(B2, R, B3), and let y be the estimate of T(B3, U, C3) without any smoothing. What is the value of x/y.

Q8. Suppose for the same setup and same simulation results as in question 7, we now perform model free Reinforcement Learning, where we try to empirically estimate Vπ(s) directly, the expected long term reward of following policy π for each state s. We do not use any discounting. Let w be the estimate of Vπ(A1), x be the estimate of Vπ(B1), y be the estimate of Vπ(B2) and z be the estimate of Vπ(B3).

These are An Introduction to Artificial Intelligence Answers Week 11

Q9. Suppose you are playing an old video game where the player has to navigate a maze as given below. The player starts at the position labelled S and needs to reach the position labelled G. You can move up, down, left and right. However, your controller is old and unreliable as a result 10% of the time it just moves in a random direction irrespective of the direction pressed by you. On reaching the goal state you get a reward of +1. Any action that leads to a collision with the grid walls will give -1 reward and the position of the player does not change. All other rewards are zero.

You observe an episode which has the following succession of states {S,c,c,d,b,G}. The actions taken by the agent were {UP, RIGHT, RIGHT, UP, RIGHT}. The agent repeats the state c due to collision with the grid wall. If we implement Q-Learning with 𝝰=0.8, then what will be the value of Q(c, RIGHT)? All Q(s,a) pairs are initialised to zero. Assume discounting factor to be 1.
Round off the answer to two decimal points, e.g. if your answer is 0.1284 then write 0.13

Q10. Consider the same setting as question-9. You observe the same episode as above and you are doing Q-Learning. If we have an epsilon greedy policy with epsilon=0.1 then what is the probability of the agent taking action RIGHT in state C after the first episode is over?
Round off the answer to three decimal points, e.g. if your answer is 0.1284 then write 0.128

These are An Introduction to Artificial Intelligence Answers Week 11

These are An Introduction to Artificial Intelligence Answers Week 11

Course Name: An Introduction to Artificial Intelligence

#### Q1. Which of the following statements are true?a. A model-based learner learns the optimal policy given a model of the state spaceb. A passive learner requires a policy to be fed to itc. A strong simulator can jump to any part of the state space to begin a simulationd. An active learner learns the optimal policy and also decides which action to take next

These are An Introduction to Artificial Intelligence Answers Week 11

Q2. Suppose you are performing model-based passive learning according to a given policy. Following this policy, you have reached State A a total of 100 times. State A has 4 possible transitions to next states: A, B, C, and D. The policy stipulates that you take the action a at this state. Taking action a, you end up in state A 61 times, state B 22 times and state C 17 times. Assuming add-one smoothing, what is the value of T(A, a, B)?

These are An Introduction to Artificial Intelligence Answers Week 11

Q3. For the next three questions, consider the following trajectories obtained by running some simulations in an unknown environment following a given policy. The state space is {A,B,C} and the action space is {a,b}. Assume discount factor is 0.5. Each sample is represented as (State, Action, Reward, Next state).

These are An Introduction to Artificial Intelligence Answers Week 11

Run 1: (A, a, 0,B)
Run 2: (C, b, -1,A), (A, a, 0,B)
Run 3: (C, b, -1,B)
Run 4: (A, a, 0,B)
Run 5: (A, a, 0,C), (C, b, -1,B)
Using model-free passive learning, give an empirical estimate of VΠ(A).

These are An Introduction to Artificial Intelligence Answers Week 11

Q4. Assume that the above samples are fed sequentially to a Temporal Difference learner. Assume all values of states are initialised to 0 and alpha is kept constant at 0.5. What will be the learned value of VΠ(A)?

These are An Introduction to Artificial Intelligence Answers Week 11

Q5. Assume that the above samples are fed to a Q-learner. What is the value of Q(A,a)? Assume that all Q-values are initialized as 0. The discount factor is 0.5 and the learning rate is also 0.5.

These are An Introduction to Artificial Intelligence Answers Week 11

Q6. Suppose we compute the optimal policy given the current Q-values. What is the action under optimal policy at state C?
Type a or b.

These are An Introduction to Artificial Intelligence Answers Week 11

Q7. Which of the following is correct regarding Boltzmann exploration?
a. It focuses on exploration initially, and more on exploitation as time passes
b. It is guaranteed to discover all reachable states from the start state, given infinite time
c. It leans more towards exploitation as temperature is increased
d. The probability of an action being chosen at a particular state varies exponentially with its Q-value at that point in time

These are An Introduction to Artificial Intelligence Answers Week 11

Q8. Which of the following is required for the convergence of Q-learning to the optimal Q-values?
a. Policy used to generate episodes for learning should be optimal.
b. All states are visited infinitely often over infinitely many samples.
c. Suitable initialisation of Q-values before learning updates.
d. Very large (>>1) learning rate.

Answer: b. All states are visited infinitely often over infinitely many samples.

These are An Introduction to Artificial Intelligence Answers Week 11

Q9. Which of the following statements are correct?
a. If an agent does not perform sufficient exploration in the choice of actions in the environment, it runs the risk of never getting large rewards.
b. If the agent has perfect knowledge of the transition and reward model of the environment, exploration is not needed.
c. Degree of exploration should be increased as the learning algorithm performs more and more updates.
d. Exploration is not required in model-based RL algorithms.

These are An Introduction to Artificial Intelligence Answers Week 11

Q10. Which of the following statement(s) is/are correct for Model-based and Model-free reinforcement learning methods?

These are An Introduction to Artificial Intelligence Answers Week 11

a. Model-based learning usually requires more parameters to be learnt.
b. Model-free learning can simulate new episodes from past experience.
c. Model-based methods are more sample efficient.
d. None of the above.

Answer: a. Model-based learning usually requires more parameters to be learnt.

These are An Introduction to Artificial Intelligence Answers Week 11