Monte Carlo, Temporal Difference, and Deep Learning Approaches in Goal-Seeking Tasks

Author: Ibrahim El Kaddouri

Repository: The repository is still private, under construction

Introduction

In this report, we will explore the implementation and analysis of various reinforcement learning algorithms within a grid-based game environment. The report begins by explaining some important concepts of RL algorithms, proceeded by some hyperparameter experimentation and concluding with an examination of deep reinforcement learning approaches.


1. Environments and Baselines

1.1 Custom Environment and Random Baseline

This portion focuses on creating a custom environment alongside creating a random agent as a baseline.

goal-seeking environment design

We will implement a goal finding environment where an agent tries to find the goal in a variable size \((n \times m)\) grid-world. The game starts with the player located at \((0, 0)\) and the goal (a cookie) located at \((n − 1, m − 1)\).

The player makes moves until they reach the goal. There are four possible actions: up, down, right and left. Each movement incurs a penalty of \(-1\), while reaching the cookie yields a reward of 0 points.

The state of the agent in this particular environment can be represented by a single number. The environment doesn’t change, thus the agent’s current coordinates provide sufficient state information and satisfies the markov property. In total, there are \(n \times m − 1\) possible states.

Environmental observations are provided to the agent following each action. These observations represent the agent’s current location calculated using the formula:

\[index = current\_row \times ncols + current\_col\]

The environment architecture follows the OpenAI Gym framework, by implementing the abstract methods of the class gym.Env, i.e. reset(), step() and render(). The complete implementation of the cookie environment can be found in the following file: src/envs/cookie.py.


random agent baseline

An agent in any kind of environment needs to be able to do two things. First, perform an action within an environment and second, learn from previous interactions. Thus, any agent implemented in this report will have the following two abstract methods implemented: act() and learn().

We will begin with the random agent as a baseline. A random agent selects an action uniformly at random, i.e., according to a uniform distribution. The complete implementation of the random agent can be found in the following file: src/baseline/random.py.


performance evaluation

There are different ways to compare the performance of agents. One approach is to compare, for a certain amount of episodes, the evolution of the return of each episode during learning.

The implementation of the training procedure can be found in the following file: src/tasks/training.py.


evolution of random agent

Using a 5 × 5 instance of the cookie environment, we will plot the evolution of the average return over an interaction of 10000 episodes by the random agent.

Figure 1: The left figure shows averaged returns obtained over 10 000 episodes in a histogram.
The right figure shows the total number of steps over the 10 000 episodes.


As can be seen, the average is always bigger than \(-1\) and less than \(-0.88\). This corresponds with what is expected. As the agent does a random walk in the environment, it will always incur a cost of \(-1\) till it finds its goal in which a reward of \(0\) is obtained. The accumulated reward corresponds to the following formula.

As shown, the average reward is always greater than \(−1\) and less than \(−0.87\) This aligns with expectations: since the agent performs a random walk in the environment, it consistently incurs a step cost of \(−1\) until it reaches the goal, which yields a reward of \(0\).

The accumulated average reward follows this formula:

\[f(\text{steps}) = \frac{\text{steps} -1}{\text{steps}} \quad \text{for} \quad \text{steps} \geq 8\]

Optimally, the agent requires only 8 steps. In that case, the reward sequence would be: -1, -1, -1, -1, -1, -1, -1, -1, 0. Averaging this sequence gives \(−8/9 \approx -0.88\), which is the maximum value observed in the histogram above. Below, we visualize an episode of the random agent using a GIF:

Figure 2: All steps in a single episode in the cookie environment


1.2 Minihack environment and fixed baseline

Consider the four Minihack environments in Figure 3. Movement is restricted to cardinal directions (north, south, east, west), matching the previously described cookie environment’s action constraints.

  • EMPTY_ROOM is a simple goal-finding-like environment, where the goal is to reach the downstairs. The agent gets \(−1\) reward for each step.

  • CLIFF is based on the cliff-environment by Sutton and Barto. The reward scheme is \(0\) when reaching the goal, \(−1\) for each step. Stepping on the lava gives \(−100\) reward and teleports the agent to the initial state, without resetting the episode. The episode only ends when the goal is reached or the maximum amount of steps is reached

  • ROOM_WITH_LAVA is a slightly more complicated goal-finding environment. The reward scheme is the same as the CLIFF.

  • ROOM_WITH_MONSTER is an environment identical to the EMPTY_ROOM, but a monster walks around and can attack and kill the agent. The reward scheme is the same as the CLIFF environment and \(−100\) for any death.

Figure 3: The four MiniHack environments


The implementation of these environments can be found in the following file: src/envs/rooms.py


fixed agent baseline

We introduce a new baseline for the four environments: the fixed agent. This agent always moves down until it can no longer do so, and then proceeds right. Unlike the random agent, which acts without considering its surroundings, the fixed agent must account for its current position within the environment.

Importantly, we cannot reuse the same state representation as in the cookie environment, since one of the four environments includes a moving object. If only the agent’s position is known, the task becomes partially observed, i.e. it is impossible to fully understand the current environment from only the current agent position.

Take the ROOM_WITH_MONSTER environment as an example. One possible approach would be to encode the state as a sequence of coordinates, each representing the position of a moving entity (either the agent or the monster). However, we opt for a much simpler, though less compact, representation: the ASCII-based1 NetHack view. The implementation of the fixed agent is available here: src/baseline/fixed.py.


evolution of fixed agent

In Figure 4, we visualize the steps of an episode executed by the fixed agent in the EMPTY_ROOM and ROOM_WITH_LAVA environments using GIFs.

Figure 4: Fixed agent in two MiniHack environments


2. Learning Algorithm Implementation and Analysis

2.1 Algorithmic Experimentation

This portion focuses on implementing various reinforcement learning algorithms and conducting comparative analysis across different environments. Three distinct learning agents have been explored:

  • Monte Carlo On-policy

  • Temporal Difference On-policy (Sarsa)

  • Temporal Difference Off-policy (Q-learning)

All agents utilize \(\epsilon\)-greedy exploration strategies during the learning phase. The complete implementation of these three agents is located in: src/algorithms.


on-policy versus off-policy

We will begin by examining the differences between on-policy and off-policy reinforcement learning methods. An algorithm is classified as on-policy when the action used for learning is the same as the action taken during exploration. In contrast, off-policy algorithms maintain separate target and behavior policies.

To compare the two approaches, we use the CLIFF environment to evaluate the Sarsa and Q-learning algorithms, both employing \(\epsilon\)-greedy action selection \((\epsilon = 0.1)\). A constant learning rate \((\alpha = 0.1)\) is used throughout the experiment. The complete configuration details are available at src/tasks/config/task2_1.yaml.

The results, shown in Figure 5 and 7, reveal that while Q-learning learns the optimal policy values, its online performance is worse than Sarsa, which instead develops a safer, more indirect route. We’ll break down the reasons below.

Figure 5: Cliff-walking environment. Final values: Q-learning = -59, Sarsa = -25.
The curve is smoothed using a 500-episode moving average


Theoretically, the maximum reward obtainable in a single episode corresponds to the following trajectory (starting from the bottom-left corner).

Figure 6: Ideal path yielding maximum cumulative rewards


The reward sequence is -1, ..., -1, 0. The average reward would be -15/16 or -0.9375. This is the maximum average reward that can be obtained in a single episode if started from the starting square.

The corresponding reward sequence is -1, -1, ..., -1, 0, resulting in a return of \(-15\). This is the highest possible return when starting from the designated starting position. Any deviation from this path, either through longer trajectories or agent deaths, leads to worse outcomes.

The returns observed during training are shown in Figure 5, smoothed using a moving average. The more negative the return, the more frequently the agent is falling off the cliff, indicating it’s following a riskier path (since cliff falls only happen when navigating close to the lava).

To illustrate the learned strategies, we visualize the policies of both algorithms:

Figure 7: Policy maps. Arrow direction and length correspond to action probabilities. Left: Q-learning. Right: Sarsa.
Sarsa uses an epsilon-soft policy, but the small epsilon value makes its effect invisible in the plot above (e.g., veering into neighboring cells).


In conclusion, Q-learning develops values for the optimal strategy, involving choosing actions along the cliff edge. However, this approach occasionally results in cliff falls due to \(\epsilon\)-greedy action selection. Sarsa takes action selection into account and learns a longer but safer pathway through the upper grid region.

if you decrease epsilon gradually during training, the effect diminishes and both Sarsa and Q-learning follow the optimal path as indicated in Figure 6 , i.e. the target policy of Q-learning.


monte carlo versus temporal difference analysis

We will highlight the differences between Monte Carlo (MC) and Temporal Difference (TD) methods using the CLIFF environment. Specifically, we compare First-Visit MC, Every-Visit MC and TD Sarsa. All three algorithms are on-policy, so we expect them to learn the same safe path as observed in the previous experiment.

  • First-Visit MC averages returns following the first visit to a state-action pair within an episode.

  • Every-Visit MC averages returns after every occurrence of that pair in an episode.

For exploration, we use a soft epsilon-greedy policy with \(\epsilon = 0.4\) and \(\gamma = 1\) for all algorithms and \(\alpha = 0.1\) for TD. We also average using the incremental MC method to improve efficiency which is mathematically equivalent to sample averaging.

We expect both MC methods to converge, but with higher variance compared to TD. TD, however, introduces bias due to bootstrapping, whereas MC methods are (mostly) unbiased. Every-Visit MC does exhibit some bias.

The increased variance in MC arises from estimating the value of a state-action pair based on full returns \(G_t\). The variance of \(G_t\) grows with the episode length 2 and variability of rewards:

\[\operatorname{Var}(G_t) \leq \sum_{k=t+1}^{T} \operatorname{Var}(R_k)\]

In contrast, TD methods update based on the immediate reward \(R_{t+1}\) and the estimated value of the next state:

\[\operatorname{Var}(R_{t+1}) \leq \operatorname{Var}(G_t) \leq \sum_{k=t+1}^{T} \operatorname{Var}(R_k)\]

Let’s consider estimating the value of cell \((4,1)\). The agent explores the environment, accumulates rewards or penalties, and uses the sampled returns to update the value estimate of this state.

Figure 8: Example trajectories. The agent starts at green, follows the blue path, and then the orange. All visited states are updated based on the return from that point forward. For instance, state (4,1) gets a return of -241 under first-visit principle.


Using MC methods, the return from (4,1) could be anything, -100, -200, -1000, or even -15. Each return gets equal weight in the final estimate. This explains the high variance in MC methods. TD, on the other hand, updates using only the immediate reward (between -100 and -1), resulting in much smaller update variance, especially since neighbor state values are initialized to 0, and the learning rate \(\alpha\) reduces the magnitude further.

We validate this observation by examining the Q-value history for state \((4,1)\) across all three algorithms:

Figure 9: Q-value evolution over time for state (4,1). MC methods exhibit much higher variance early on compared to TD Sarsa. Every-Visit MC (left), First-Visit MC (middle), TD Sarsa (right)


Despite the variance, all three algorithms converge toward the suboptimal, safe policy. This is further confirmed by examining the learned policy maps:

Figure 10: Final policy maps for First-Visit MC, Every-Visit MC, and TD Sarsa.


Finally, we present the return and steps per episode evolution:

Figure 11: Mean returns and steps per episode with confidence intervals based on 30 runs


As expected, we see more uncertainty in the policies derived from MC methods. Although all methods eventually converge, TD converges faster in this environment. Unfortunately, we can’t currently back this claim with theory3, there’s still no formal proof that one method converges faster than the other. For now, it’s an open question.


learning rates analysis

In TD methods, the learning rate \(\alpha\) controls how much new information overrides prior estimates. A higher \(\alpha\) allows the agent to adapt quickly by placing more weight on recent rewards and bootstrapped values but can lead to instability. A lower \(\alpha\) yields more stable learning but slows down convergence.

In this section, we experiment with different constant learning rates, within the ROOM_WITH_LAVA environment. Comparable results can also be reproduced in the CLIFF environment.

Figure 12 illustrates how the learned policies has converged under each setting:

Learning rate = 0.001
Learning rate = 0.005
Learning rate = 0.01
Learning rate = 0.1
Figure 12: Q-learning policy maps under different learning rates


The results align with theoretical expectations. As shown in Figure 13, smaller \(\alpha\) values lead to slower convergence, while larger \(\alpha\) values yield faster convergence for the Q-learning algorithm.

Figure 12: Return (left) and episode length (right) over 5 000 episodes for different learning rates
(smoothed with a window of episodes).


We also observe that the learning rate affects the agent’s exploration behavior. As shown in Figure 13, lower learning rates lead to broader exploration, while higher rates lead to more exploitative behavior. This aligns with theory3: a higher \(\alpha\) accelerates value propagation, which potentially stabilizes the policy sooner, leaving less time for exploration.

Learning rate = 0.001
Learning rate = 0.005
Learning rate = 0.01
Learning rate = 0.1
Figure 13: State visitation heatmaps under different learning rates.
if the step size α parameter is reduced properly over time, this approach achieves convergence in any stationary environment toward the actual action probabilities for each state.


exploration-exploitation trade-off analysis

The \(\epsilon\)-greedy strategy selects a random action with probability \(\epsilon\), and follows the current policy otherwise. Setting \(\epsilon\) too high results in excessive randomness, preventing the agent from exploiting what it has learned. Conversely, making \(\epsilon\) too small limits exploration, possibly trapping the agent in suboptimal behaviors.

Figure 14: Return and episode length for varying exploration rates (ϵ).


Surprisingly, the results show that almost zero exploration \((\epsilon = 0.01)\) yields the highest return in the ROOM_WITH_LAVA environment. Similar outcomes were observed in the CLIFF environment as well. At first glance, this seems counterintuitive, but it makes sense given the characteristics of the environment.

Even though the environment allows for variance in the possible returns, the immediate reward for any given (state, action) pair is deterministic. There is no reward distribution to sample from. The agent always receives the same outcome for the same decision.

Additionally, as the agent explores, it accumulates \(-1\) penalties along each path, which inherently discourages revisiting those routes. This naturally drives the agent to favor unexplored (and potentially shorter or safer) alternatives, without needing random exploration.

In essence, the state space is small and simple enough that pure exploitation is sufficient to discover the optimal policy. The agent can locally maximize rewards and still converge to the correct solution, as verified by the empirical results. Interestingly, higher exploration rates degrade performance slightly, mostly due to increased chances of randomly stepping into lava. Nonetheless, all policies eventually converge.

Exploration rate = 0.01
Exploration rate = 0.1
Exploration rate = 0.2
Exploration rate = 0.5
Figure 15: Sample trajectories from Sarsa with increasing exploration rates (ϵ).
If Figure 15 used Q-learning instead of Sarsa, all policies would converge to the same optimal path regardless of ϵ, since Q-learning is off-policy


2.2 Linear Exploration Rate

In the previous section, we experimented with different constant \(\epsilon\) values. Here, we instead use a linearly decreasing exploration schedule, where \(\epsilon\) progressively decays over the course of training. The goal is to start with broad exploration and gradually shift toward exploitation by reducing random action selection after each episode.

As noted earlier in the section on on-policy vs off-policy methods, both Sarsa and Q-learning converge to the same policy when \(\epsilon\) decreases over time. This section provides experimental evidence of that convergence.

We set the learning rate to \(\alpha = 0.1\) and use a linear decay of \(\epsilon\) from \(0.01\) to \(0\) over 500 episodes. The discount factor remains at \(\gamma = 1\). We evaluate the trained agent at various checkpoints to observe the evolution of the target policy.

After 50 episodes After 100 episodes After 200 episodes After 300 episodes After 400 episodes After 500 episodes
Figure 16: Target policies over training. Left: Q-learning. Right: Sarsa.
Each row corresponds to evaluations after some amount of episodes.


Note that Q-learning evaluates using a greedy policy, while Sarsa still follows an \(\epsilon\)-greedy policy during evaluation (unless \(\epsilon\) has decayed to 0).

After 50 episodes After 100 episodes After 200 episodes After 300 episodes After 400 episodes After 500 episodes
Figure 17: Episode rollouts over training. Left: Q-learning. Right: Sarsa.


From these experiments, we confirm that both Sarsa and Q-learning converge to the same optimal policy when exploration is linearly decayed. This confirms our theoretical expectations from earlier sections.

There's a significant difference between training Sarsa with an exploration rate of 0.01 versus a completely zero rate. With a fixed ϵ = 0.01, Sarsa will converge to the same policy as Q-learning, but it still retains a small chance of selecting a random action. In contrast, with linear decay to ϵ = 0, no random actions are taken once the schedule hits zero, making the behavior fully greedy.


The complete implementation of the scheduling mechanism can be found in the following file: src/schedule.py.


2.3 Planning and Learning

A model of the environment refers to any mechanism an agent can use to predict how the environment will respond to its actions. Given a state and an action, a model outputs the predicted next state and corresponding reward. The Dyna-Q implementation uses a table-based model under deterministic assumptions, when queried with previously seen state-action pairs, it simply returns the last observed next state and reward.

We will compare the performance of the Dyna-Q agent and the Q-learning agent in two of the four Minihack environments.

0 planning steps
5 planning steps
20 planning steps
50 planning steps
Figure 18: Learned policies on the cliff environment under different planning step counts


Even though all agents eventually reach optimal performance, Dyna-Q with 50 planning steps produces the cleanest, most symmetric policy. It correctly identifies that all paths that are functionally equivalent. Even with just 5 planning steps, Dyna-Q outperforms standard Q-learning by reducing unnecessary wall-directed actions in the outer regions.

0 planning steps
5 planning steps
20 planning steps
50 planning steps
Figure: 19 Learned policies on the room with lava environment under different planning step counts.


Figure 20 shows that planning agents converge faster than their non-planning counterparts. The agent with 50 planning steps has the fastest convergence. This is exactly what we expect. We expect that more planning would mean faster learning, where the agent with 50 steps would be the fastest.

Figure 20: Smoothed returns over episodes in cliff (left) and room with lava (right). Planning has a noticeably greater impact on convergence speed.


Dyn-Q may not perform as well in the evnironment where a monster is present, because the learned model has to account for enemy behavior, which can be non-deterministic.

3. Deep Reinforcement Learning (DRL)

In this section, we implement deep reinforcement learning agents in two environments: EMPTY_ROOM and ROOM_WITH_MULTIPLE_MONSTERS. The second environment extends the first by introducing multiple randomly spawning enemies, increasing the complexity and introducing stochasticity.

To handle this added challenge, we use two algorithms from the stable_baselines3 library: DQN (Deep Q-Network), which applies deep Q-learning, and PPO (Proximal Policy Optimization), which follows the actor-critic framework. Despite transitioning to deep models, we continue using an ASCII-based representation of the environment as the input state.

Since deep models require fixed-size input vectors, we convert the 2D ASCII map into a one-dimensional 64-dimensional embedding using a Convolutional Neural Network (CNN).

The CNN architecture consists of three convolutional layers with kernel sizes of \((3,3)\), ReLU activations and padding to preserve spatial dimensions. The output is then flattened and passed through a fully connected layer, which outputs the 64-dimensional feature vector used as input to the reinforcement learning algorithm. The implementation of this feature extractor is available at src/algorithms/cnn.py.

hyperparameter configuration

For training, we used a learning rate of 0.0001 for DQN and 0.0003 for PPO. We used a discount factor \((\gamma)\) of 0.99 for both algorithms. For PPO, we also added a small entropy bonus of 0.01 to encourage exploration. The exact hyperparameters can be found in src/tasks/config/task3_1.yaml.

performance evaluation

Figure 21 shows that PPO trains significantly faster and more stably than DQN. In the EMPTY_ROOM environment, PPO converges in roughly 250 000 steps, while DQN requires nearly 750 000!

The number of training steps required can partly be attributed to the high-dimensional state representation processed by the CNN. If a more compact, handcrafted representation were used instead, a simpler model with fewer parameters could possibly converge much faster.

The gap is even more pronounced in the ROOM_WITH_MULTIPLE_MONSTERS environment, PPO still converges within 250 000 steps, but DQN needs nearly 2 million steps.

Figure 21: Mean returns and episode lengths with 95% confidence intervals, averaged over 3 runs. In the empty room environment, both PPO and DQN converge to similar final episode lengths, averaging slightly above 8 steps. However, in the more challenging room with multiple monsters environment, PPO maintains a lower average episode length of around 12, while DQN episodes are longer, averaging closer to 15 steps. In terms of accumulated reward, PPO performs better in room with multiple monsters, PPO achieves an average return of approximately -12 compared to DQN's -15. In empty room, both algorithms perform similarly, with average returns hovering around -8.


Despite both DQN and PPO eventually learning near-optimal policies, behavior differs: DQN only eliminated the monster closest to the start position and then heads straight to the goal, avoiding other enemies. PPO, in contrast, clears out all monsters along the way before proceeding to the goal.

Figure 22 illustrates full episodes in both environments for each agent.

DQN
DQN
PPO
PPO
Figure 22: Rollout visualizations for PPO (on-policy) and DQN (off-policy).


Tabular-Q vs DQN

  1. Convergence Speed: Tabular methods converge much faster due to the limited, discrete state space in grid-based environments. They typically require only hundreds of thousands of steps, while DRL methods possibly need millions.

  2. Generalization: Tabular Q-learning memorizes state-action pairs and lacks generalization. It fails on slightly altered environments (e.g., grid size changes). In contrast, DRL models (especially CNN-based ones) can generalize better across similar but unseen states. For instance, a model trained on a \(5 \times 5\) grid might still operate reasonably on a \(6 \times 6\) one, something tabular Q cannot do.

  3. Convergence Guarantees: Under certain assumptions, tabular methods guarantee convergence to the optimal policy. DRL methods do not, they may converge to suboptimal or unstable policies depending on architecture, hyperparameters and randomness.

References