Implementing Deep Reinforcement Learning Models with Tensorflow + OpenAI Gym

Lilian Weng BlogMay 5, 20181 min read0 views

<p>The full implementation is available in <a href="https://github.com/lilianweng/deep-reinforcement-learning-gym">lilianweng/deep-reinforcement-learning-gym</a></p> <p>In the previous two posts, I have introduced the algorithms of many deep reinforcement learning models. Now it is the time to get our hands dirty and practice how to implement the models in the wild. The implementation is gonna be built in Tensorflow and OpenAI <a href="https://github.com/openai/gym">gym</a> environment. The full version of the code in this tutorial is available in <a href="https://github.com/lilianweng/deep-reinforcement-learning-gym">[lilian/deep-reinforcement-learning-gym]</a>.</p>

In the previous two posts, I have introduced the algorithms of many deep reinforcement learning models. Now it is the time to get our hands dirty and practice how to implement the models in the wild. The implementation is gonna be built in Tensorflow and OpenAI gym environment. The full version of the code in this tutorial is available in [lilian/deep-reinforcement-learning-gym].

[] For every new installation below, please make sure you are in the virtualenv.

If you are interested in playing with Atari games or other advanced packages, please continue to get a couple of system packages installed.

For Atari, go to the gym directory and pip install it. This post is pretty helpful if you have troubles with ALE (arcade learning environment) installation.

The OpenAI Gym toolkit provides a set of physical simulation environments, games, and robot simulators that we can play with and design reinforcement learning agents for. An environment object can be initialized by gym.make("{environment name}":

The formats of action and observation of an environment are defined by env.action_space and env.observation_space, respectively.

Types of gym spaces:

gym.spaces.Discrete(n): discrete values from 0 to n-1.
gym.spaces.Box: a multi-dimensional vector of numeric values, the upper and lower bounds of each dimension are defined by Box.low and Box.high.

We interact with the env through two major api calls:

ob = env.reset()

Resets the env to the original setting.
Returns the initial observation.

ob_next, reward, done, info = env.step(action)

Applies one action in the env which should be compatible with env.action_space.
Gets back the new observation ob_next (env.observation_space), a reward (float), a done flag (bool), and other meta information (dict). If done=True, the episode is complete and we should reset the env to restart. Read more here.

Naive Q-Learning

Q-learning (Watkins & Dayan, 1992) learns the action value (“Q-value”) and update it according to the Bellman equation. The key point is while estimating what is the next action, it does not follow the current policy but rather adopt the best Q value (the part in red) independently.

Q(s, a) \leftarrow (1 - \alpha) Q(s, a) + \alpha (r + \gamma \color{red}{\max_{a' \in \mathcal{A}} Q(s', a')})

In a naive implementation, the Q value for all (s, a) pairs can be simply tracked in a dict. No complicated machine learning model is involved yet.

env = gym.make("CartPole-v0") actions = range(env.action_space)

def update_Q(s, r, a, s_next, done): max_q_next = max([Q[s_next, a] for a in actions])

Do not include the next state's value if currently at the terminal state.

Q[s, a] += alpha * (r + gamma * max_q_next * (1.0 - done) - Q[s, a])`*

Most gym environments have a multi-dimensional continuous observation space (gym.spaces.Box). To make sure our Q dictionary will not explode by trying to memorize an infinite number of keys, we apply a wrapper to discretize the observation. The concept of wrappers is very powerful, with which we are capable to customize observation, action, step function, etc. of an env. No matter how many wrappers are applied, env.unwrapped always gives back the internal original environment object.

import gym

class DiscretizedObservationWrapper(gym.ObservationWrapper): """This wrapper converts a Box observation into a single integer. """ def init(self, env, n_bins=10, low=None, high=None): super().init(env) assert isinstance(env.observation_space, Box)

low = self.observation_space.low if low is None else low high = self.observation_space.high if high is None else high

self.n_bins = n_bins self.val_bins = [np.linspace(l, h, n_bins + 1) for l, h in zip(low.flatten(), high.flatten())] self.observation_space = Discrete(n_bins ** low.flatten().shape[0])**

def convert_to_one_number(self, digits): return sum([d * ((self.n_bins + 1) ** i) for i, d in enumerate(digits)])***

def observation(self, observation): digits = [np.digitize([x], bins)[0] for x, bins in zip(observation.flatten(), self.val_bins)] return self.convert_to_one_number(digits)

env = DiscretizedObservationWrapper( env, n_bins=8, low=[-2.4, -2.0, -0.42, -3.5], high=[2.4, 2.0, 0.42, 3.5] )`

Let’s plug in the interaction with a gym env and update the Q function every time a new transition is generated. When picking the action, we use ε-greedy to force exploration.

def act(ob): if np.random.random() < epsilon:

action_space.sample() is a convenient function to get a random action

that is compatible with this given action space.

return env.action_space.sample()

Pick the action with highest q value.

qvals = {a: q[state, a] for a in actions} max_q = max(qvals.values())

In case multiple actions have the same maximum q value.

actions_with_max_q = [a for a, q in qvals.items() if q == max_q] return np.random.choice(actions_with_max_q)

ob = env.reset() rewards = [] reward = 0.0

for step in range(n_steps): a = act(ob) ob_next, r, done, _ = env.step(a) update_Q(ob, r, a, ob_next, done) reward += r if done: rewards.append(reward) reward = 0.0 ob = env.reset() else: ob = ob_next`_

Often we start with a high epsilon and gradually decrease it during the training, known as “epsilon annealing”. The full code of QLearningPolicy is available here.

Deep Q-Network

Deep Q-network is a seminal piece of work to make the training of Q-learning more stable and more data-efficient, when the Q value is approximated with a nonlinear function. Two key ingredients are experience replay and a separately updated target network.

The main loss function looks like the following,

\begin{aligned} & Y(s, a, r, s') = r + \gamma \max_{a'} Q_{\theta^{-}}(s', a') \\ & \mathcal{L}(\theta) = \mathbb{E}_{(s, a, r, s') \sim U(D)} \Big[ \big( Y(s, a, r, s') - Q_\theta(s, a) \big)^2 \Big] \end{aligned}

The Q network can be a multi-layer dense neural network, a convolutional network, or a recurrent network, depending on the problem. In the full implementation of the DQN policy, it is determined by the model_type parameter, one of (“dense”, “conv”, “lstm”).

In the following example, I’m using a 2-layer densely connected neural network to learn Q values for the cart pole balancing problem.

The observation space is `Box(4,)`, a 4-element vector.

observation_size = env.observation_space.shape[0]`

We have a helper function for creating the networks below:

Add relu activation only for internal layers.

activation=tf.nn.relu if i < len(layers_sizes) - 1 else None, kernel_initializer=tf.contrib.layers.xavier_initializer(), name=scope_name + 'l' + str(i) ) return inputs`

The Q-network and the target network are updated with a batch of transitions (state, action, reward, state_next, done_flag). The input tensors are:

batch_size = 32 # A tunable hyperparameter.

states = tf.placeholder(tf.float32, shape=(batch_size, observation_size), name='state') states_next = tf.placeholder(tf.float32, shape=(batch_size, observation_size), name='state_next') actions = tf.placeholder(tf.int32, shape=(batch_size,), name='action') rewards = tf.placeholder(tf.float32, shape=(batch_size,), name='reward') done_flags = tf.placeholder(tf.float32, shape=(batch_size,), name='done')`

We have two networks of the same structure. Both have the same network architectures with the state observation as the inputs and Q values over all the actions as the outputs.

The target network “Q_target” takes the states_next tensor as the input, because we use its prediction to select the optimal next state in the Bellman equation.

The optimization target defined by the Bellman equation and the target network.

max_q_next_by_target = tf.reduce_max(q_target, axis=-1) y = rewards + (1. - done_flags) * gamma * max_q_next_by_target

The loss measures the mean squared error between prediction and target.

loss = tf.reduce_mean(tf.square(pred - tf.stop_gradient(y)), name="loss_mse_train") optimizer = tf.train.AdamOptimizer(0.001).minimize(loss, name="adam_optim")`

Note that tf.stop_gradient() on the target y, because the target network should stay fixed during the loss-minimizing gradient update.

The target network is updated by copying the primary Q network parameters over every C number of steps (“hard update”) or polyak averaging towards the primary network (“soft update”)

Get all the variables in the Q target network.

q_target_vars = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope="Q_target") assert len(q_vars) == len(q_target_vars)

def update_target_q_net_hard():

Hard update

sess.run([v_t.assign(v) for v_t, v in zip(q_target_vars, q_vars)])

def update_target_q_net_soft(tau=0.05):

Soft update: polyak averaging.

sess.run([v_t.assign(v_t * (1. - tau) + v * tau) for v_t, v in zip(q_target_vars, q_vars)])`

Double Q-Learning

If we look into the standard form of the Q value target, $Y(s, a) = r + \gamma \max_{a’ \in \mathcal{A}} Q_\theta (s’, a’)$, it is easy to notice that we use $Q_\theta$ to select the best next action at state s’ and then apply the action value predicted by the same $Q_\theta$. This two-step reinforcing procedure could potentially lead to overestimation of an (already) overestimated value, further leading to training instability. The solution proposed by double Q-learning (Hasselt, 2010) is to decouple the action selection and action value estimation by using two Q networks, $Q_1$ and $Q_2$: when $Q_1$ is being updated, $Q_2$ decides the best next action, and vice versa.

Y_1(s, a, r, s') = r + \gamma Q_1 (s', \arg\max_{a' \in \mathcal{A}}Q_2(s', a'))\\ Y_2(s, a, r, s') = r + \gamma Q_2 (s', \arg\max_{a' \in \mathcal{A}}Q_1(s', a'))

To incorporate double Q-learning into DQN, the minimum modification (Hasselt, Guez, & Silver, 2016) is to use the primary Q network to select the action while the action value is estimated by the target network:

Y(s, a, r, s') = r + \gamma Q_{\theta^{-}}(s', \arg\max_{a' \in \mathcal{A}} Q_\theta(s', a'))

In the code, we add a new tensor for getting the action selected by the primary Q network as the input and a tensor operation for selecting this action.

The prediction target y in the loss function becomes:

Here I used tf.gather() to select the action values of interests.

(Image source: tf.gather() docs)

During the episode rollout, we compute the actions_next by feeding the next states’ data into the actions_selected_by_q operation.

Dueling Q-Network

The dueling Q-network (Wang et al., 2016) is equipped with an enhanced network architecture: the output layer branches out into two heads, one for predicting state value, V, and the other for advantage, A. The Q-value is then reconstructed, $Q(s, a) = V(s) + A(s, a)$.

\begin{aligned} A(s, a) &= Q(s, a) - V(s)\\ V(s) &= \sum_a Q(s, a) \pi(a \vert s) = \sum_a (V(s) + A(s, a)) \pi(a \vert s) = V(s) + \sum_a A(s, a)\pi(a \vert s)\\ \text{Thus, }& \sum_a A(s, a)\pi(a \vert s) = 0 \end{aligned}

To make sure the estimated advantage values sum up to zero, $\sum_a A(s, a)\pi(a \vert s) = 0$, we deduct the mean value from the prediction.

Q(s, a) = V(s) + (A(s, a) - \frac{1}{|\mathcal{A}|} \sum_a A(s, a))

The code change is straightforward:

Average dueling

q = v + (adv - tf.reduce_mean(adv, reduction_indices=1, keepdims=True))`

(Image source: Wang et al., 2016)

Check the code for the complete flow.

Monte-Carlo Policy Gradient

I reviewed a number of popular policy gradient methods in my last post. Monte-Carlo policy gradient, also known as REINFORCE, is a classic on-policy method that learns the policy model explicitly. It uses the return estimated from a full on-policy trajectory and updates the policy parameters with policy gradient.

The returns are computed during rollouts and then fed into the Tensorflow graph as inputs.

The policy network is contructed. We update the policy parameters by minimizing the loss function, $\mathcal{L} = - (G_t - V(s)) \log \pi(a \vert s)$. tf.nn.sparse_softmax_cross_entropy_with_logits() asks for the raw logits as inputs, rather then the probabilities after softmax, and that’s why we do not have a softmax layer on top of the policy network.

with tf.variable_scope('pi_optimize'): loss_pi = tf.reduce_mean( returns * tf.nn.sparse_softmax_cross_entropy_with_logits( logits=pi, labels=actions), name='loss_pi') optim_pi = tf.train.AdamOptimizer(0.001).minimize(loss_pi, name='adam_optim_pi')`*

During the episode rollout, the return is calculated as follows:

# env = gym.make(...)

gamma = 0.99

sess = tf.Session(...)

def act(ob): return sess.run(sampled_actions, {states: [ob]})

for _ in range(n_episodes): ob = env.reset() done = False_

obs = [] actions = [] rewards = [] returns = []

while not done: a = act(ob) new_ob, r, done, info = env.step(a)

obs.append(ob) actions.append(a) rewards.append(r) ob = new_ob

Estimate returns backwards.

return_so_far = 0.0 for r in rewards[::-1]: return_so_far = gamma * return_so_far + r returns.append(return_so_far)*

returns = returns[::-1]

Update the policy network with the data from one episode.

sess.run([optim_pi], feed_dict={ states: np.array(obs), actions: np.array(actions), returns: np.array(returns), })`

The full implementation of REINFORCE is here.

Actor-Critic

The actor-critic algorithm learns two models at the same time, the actor for learning the best policy and the critic for estimating the state value.

Initialize the actor network, $\pi(a \vert s)$ and the critic, $V(s)$
Collect a new transition (s, a, r, s’): Sample the action $a \sim \pi(a \vert s)$ for the current state s, and get the reward r and the next state s'.
Compute the TD target during episode rollout, $G_t = r + \gamma V(s’)$ and TD error, $\delta_t = r + \gamma V(s’) - V(s)$.
Update the critic network by minimizing the critic loss: $L_c = (V(s) - G_t)$.
Update the actor network by minimizing the actor loss: $L_a = - \delta_t \log \pi(a \vert s)$.
Set s’ = s and repeat step 2.-5.

Overall the implementation looks pretty similar to REINFORCE with an extra critic network. The full implementation is here.

Actor: action probabilities

actor = dense_nn(states, [32, 32, env.action_space.n], name='actor')

Critic: action value (Q-value)

critic = dense_nn(states, [32, 32, 1], name='critic')

action_ohe = tf.one_hot(actions, act_size, 1.0, 0.0, name='action_one_hot') pred_value = tf.reduce_sum(critic * action_ohe, reduction_indices=-1, name='q_acted') td_errors = td_targets - tf.reshape(pred_value, [-1])*

with tf.variable_scope('critic_train'): loss_c = tf.reduce_mean(tf.square(td_errors)) optim_c = tf.train.AdamOptimizer(0.01).minimize(loss_c)

with tf.variable_scope('actor_train'): loss_a = tf.reduce_mean( tf.stop_gradient(td_errors) * tf.nn.sparse_softmax_cross_entropy_with_logits( logits=actor, labels=actions), name='loss_actor') optim_a = tf.train.AdamOptimizer(0.01).minimize(loss_a)*

train_ops = [optim_c, optim_a]`

The tensorboard graph is always helpful:

References

[1] Tensorflow API Docs

[2] Christopher JCH Watkins, and Peter Dayan. “Q-learning.” Machine learning 8.3-4 (1992): 279-292.

[3] Hado Van Hasselt, Arthur Guez, and David Silver. “Deep Reinforcement Learning with Double Q-Learning.” AAAI. Vol. 16. 2016.

[4] Hado van Hasselt. “Double Q-learning.” NIPS, 23:2613–2621, 2010.

[5] Ziyu Wang, et al. Dueling network architectures for deep reinforcement learning. ICML. 2016.

Original source

Lilian Weng Blog

https://lilianweng.github.io/posts/2018-05-05-drl-implementation/

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modelavailableversion

ProductsLive

Webhook Best Practices: Retry Logic, Idempotency, and Error Handling

<h1> Webhook Best Practices: Retry Logic, Idempotency, and Error Handling </h1> <p>Most webhook integrations fail silently. A handler returns 500, the provider retries a few times, then stops. Your system never processed the event and no one knows.</p> <p>Webhooks are not guaranteed delivery by default. How reliably your integration works depends almost entirely on how you write the receiver. This guide covers the patterns that make webhook handlers production-grade: proper retry handling, idempotency, error response codes, and queue-based processing.</p> <h2> Understand the Delivery Model </h2> <p>Before building handlers, understand what you are dealing with:</p> <ul> <li>Providers send webhook events as HTTP POST requests</li> <li>They expect a 2xx response within a timeout (typically 5

DEV Community

13m18 minutes ago

ProductsLive

Building a scoring engine with pure TypeScript functions (no ML, no backend)

<p>We needed to score e-commerce products across multiple dimensions: quality, profitability, market conditions, and risk.</p> <p>The constraints:</p> <ul> <li>Scores must update in real time</li> <li>Must run entirely in the browser (Chrome extension)</li> <li>Must be explainable (not a black box)</li> </ul> <p>We almost built an ML pipeline — training data, model serving, APIs, everything.</p> <p>Then we asked a simple question:</p> <p><strong>Do we actually need machine learning for this?</strong></p> <p>The answer was no.</p> <p>We ended up building several scoring engines in pure TypeScript.<br> Each one is a single function, under 100 lines, zero dependencies, and runs in under a millisecond.</p> <h2> What "pure function" means here </h2> <p>Each scoring engine follows 3 rules:</p> <

DEV Community

6m14 minutes ago

ProductsLive

Why AI Agents Need a Trust Layer (And How We Built One)

<p><em>What happens when AI agents need to prove they're reliable before anyone trusts them with real work?</em></p> <h2> The Problem No One's Talking About </h2> <p>Every week, a new AI agent framework drops. Autonomous agents that can write code, send emails, book flights, manage databases. The capabilities are incredible.</p> <p>But here's the question nobody's answering: <strong>how do you know which agent to trust?</strong></p> <p>Right now, hiring an AI agent feels like hiring a contractor with no references, no portfolio, and no track record. You're just... hoping it works. And when it doesn't, there's no accountability trail.</p> <p>We kept running into this building our own multi-agent systems:</p> <ul> <li>Agent A says it can handle email outreach. Can it? Who knows.</li> <li>Age

DEV Community

3m13 minutes ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Building knowledge graph…

Discussion

No comments yet — be the first to share your thoughts!

More in Releases

ReleasesLive

How to Use the ES2026 Temporal API in Node.js REST APIs (2026 Guide)

<p>After 9 years in development and countless TC39 meetings, the JavaScript Temporal API officially reached <strong>Stage 4 on March 11, 2026</strong>, locking it into the ES2026 specification. That means it's no longer a proposal — it's the future of date and time handling in JavaScript, and you should start using it in your Node.js APIs today.</p> <p>If you've ever shipped a date-related bug in production — DST edge cases, wrong timezone conversions, silent mutation bugs from <code>Date.setDate()</code> — you're not alone. The <code>Date</code> object was designed in 1995, copied from Java, and has been causing developer pain ever since. Temporal is the fix.</p> <p>This guide covers <strong>how to use the ES2026 Temporal API in Node.js REST APIs</strong> with practical, real-world patter

DEV Community

16m42 minutes ago

ReleasesLive

缓存架构深度指南：如何设计高性能缓存系统

<h1> 缓存架构深度指南：如何设计高性能缓存系统 </h1> <blockquote> <p>在现代分布式系统中，缓存是提升系统性能的核心组件。本文将深入探讨缓存架构的设计原则、策略与实战技巧。</p> </blockquote> <h2> 为什么要使用缓存？ </h2> <p>在软件系统中，缓存的本质是<strong>用空间换时间</strong>。通过将频繁访问的数据存储在高速存储介质中，减少对慢速数据源的访问次数，从而显著提升系统响应速度。</p> <p>典型场景：</p> <ul> <li>数据库查询结果缓存</li> <li>API响应缓存</li> <li>会话状态缓存</li> <li>计算结果缓存</li> </ul> <h2> 缓存架构设计原则 </h2> <h3> 1. 缓存层级策略 </h3> <p>现代系统通常采用多级缓存架构：<br> </p> <div class="highlight js-code-highlight"> <pre class="highlight plaintext"><code>┌─────────────────────────────────────────────┐ │ CDN (边缘缓存) │ ├─────────────────────────────────────────────┤ │ Redis/Memcached │ ├─────────────────────────────────────────────┤ │ 本地缓存 │ ├─────────────────────────────────────────────┤ │ 数据库 │ └─────────────────────────────────────────────┘ </code></pre> </div> <p><strong>原则<

DEV Community

3m42 minutes ago

ReleasesLive

Axios Hijack Post-Mortem: How to Audit, Pin, and Automate a Defense

<p>On March 31, 2026, the <code>axios</code> npm package was compromised via a hijacked maintainer account. Two versions, <code>1.14.1</code> and <code>0.30.4</code>, were weaponised with a malicious phantom dependency called <code>plain-crypto-js</code>. It functions as a Remote Access Trojan (RAT) that executes during the <code>postinstall</code> phase and silently exfiltrates environment variables: AWS keys, GitHub tokens, database credentials, and anything present in your <code>.env</code> at install time.</p> <p>The attack window was approximately 3 hours (00:21 to 03:29 UTC) before the packages were unpublished. A single CI run during that window is sufficient exposure.<br> This post documents the forensic audit and remediation steps performed on a Next.js production stack immediatel

DEV Community

10m38 minutes ago

ReleasesFresh

Guilford Technical CC to Launch Degrees in AI, Digital Media - govtech.com

<a href="https://news.google.com/rss/articles/CBMipgFBVV95cUxQOXdfNFpXQjJyRlo4aTA1cjdwZk5IbTNTNi1BU25hQUNlSjVXcE5ZelJNbFRMYUZsVFNWZ3lxX21TQ3NocHdLbldydkR0Q1JURXR5eVhXd3ItNjlJcE1TdHFPMnA1c0FQWDBmbWtNRC04YWRIelU5LWU3Rl9ZWHctYU02d2M4WHJ5a2pwaW0xcTRyNkVqSThhNkNxbFlZSkF4Q2tIZHNn?oc=5" target="_blank">Guilford Technical CC to Launch Degrees in AI, Digital Media</a> <font color="#6f6f6f">govtech.com</font>

Google News: AI

1mabout 5 hours ago

Implementing Deep Reinforcement Learning Models with Tensorflow &#43; OpenAI Gym

Naive Q-Learning

Do not include the next state's value if currently at the terminal state.

action_space.sample() is a convenient function to get a random action

that is compatible with this given action space.

Pick the action with highest q value.

In case multiple actions have the same maximum q value.

Deep Q-Network

The observation space is Box(4,), a 4-element vector.

Add relu activation only for internal layers.

The optimization target defined by the Bellman equation and the target network.

The loss measures the mean squared error between prediction and target.

Get all the variables in the Q target network.

Hard update

Soft update: polyak averaging.

Double Q-Learning

Dueling Q-Network

Average dueling

Monte-Carlo Policy Gradient

gamma = 0.99

sess = tf.Session(...)

Estimate returns backwards.

Update the policy network with the data from one episode.

Actor-Critic

Actor: action probabilities

Critic: action value (Q-value)

References

Daily AI Digest

More about

Webhook Best Practices: Retry Logic, Idempotency, and Error Handling

Building a scoring engine with pure TypeScript functions (no ML, no backend)

Why AI Agents Need a Trust Layer (And How We Built One)

Knowledge Map

Connected Articles — Knowledge Graph

Discussion

More in Releases

How to Use the ES2026 Temporal API in Node.js REST APIs (2026 Guide)

缓存架构深度指南：如何设计高性能缓存系统

Axios Hijack Post-Mortem: How to Audit, Pin, and Automate a Defense

Guilford Technical CC to Launch Degrees in AI, Digital Media - govtech.com

Implementing Deep Reinforcement Learning Models with Tensorflow + OpenAI Gym

The observation space is `Box(4,)`, a 4-element vector.