失眠网 > 音速索尼克怪人_优势演员评论家方法简介：让我们玩刺猬索尼克！

音速索尼克怪人_优势演员评论家方法简介：让我们玩刺猬索尼克！

时间：2023-01-16 11:52:20

音速索尼克怪人

by Thomas Simonini

通过托马斯·西蒙尼(Thomas Simonini)

优势演员评论家方法简介：让我们玩刺猬索尼克！ (An intro to Advantage Actor Critic methods: let’s play Sonic the Hedgehog!)

Since the beginning of this course, we’ve studied two different reinforcement learning methods:

从本课程开始，我们已经研究了两种不同的强化学习方法：

Value based methods(Q-learning, Deep Q-learning): where we learn a value functionthat will map each state action pair to a value.Thanks to these methods, we find the best action to take for each state — the action with the biggest value. This works well when you have a finite set of actions.

基于值的方法(Q学习，深度Q学习)：我们在其中学习一个值函数，该函数会将每个状态操作对映射到一个值。由于有了这些方法，我们找到了针对每个州采取的最佳行动-具有最大价值的行动。当您有一组有限的操作时，这很好用。

Policy based methods(REINFORCE with Policy Gradients): where we directly optimize the policy without using a value function. This is useful when the action space is continuous or stochastic. The main problem is finding a good score function to compute how good a policy is. Weuse total rewards of the episode.

基于策略的方法(带有策略梯度的REINFORCE)：在这种方法中，我们无需使用值函数即可直接优化策略。当动作空间是连续的或随机的时，这很有用。主要问题是找到一个好的评分函数来计算策略的有效性。我们使用该集的总奖励。

But both of these methods have big drawbacks. That’s why, today, we’ll study a new type of Reinforcement Learning method which we can call a “hybrid method”:Actor Critic. We’ll using two neural networks:

但是这两种方法都有很大的缺点。因此，今天我们将研究一种称为“混合方法”的新型强化学习方法：Actor Critic。我们将使用两个神经网络：

a Critic that measures how good the action taken is (value-based)评估所采取行动的效果的批评者(基于价值) an Actor that controls how our agent behaves (policy-based)控制我们代理行为方式的Actor(基于策略)

Mastering this architecture is essential to understanding state of the art algorithms such asProximal Policy Optimization (aka PPO). PPO is based on Advantage Actor Critic.

掌握此体系结构对于理解最新算法(例如近端策略优化(aka PPO))至关重要。PPO基于“优势演员评论家”。

And you’ll implement an Advantage Actor Critic (A2C) agent that learns to play Sonic the Hedgehog!

然后，您将实现一个“优势演员评论家(A2C)”特工，该特工学习扮演刺猬索尼克！

寻求更好的学习模式 (The quest for a better learning model)

政策梯度问题 (The problem with Policy Gradients)

The Policy Gradient method has a big problem. We are in a situation of Monte Carlo, waiting until the end of episode to calculate the reward. We may conclude that if we have a high reward (R(t)), all actions that we took were good, even if some were really bad.

策略梯度方法存在很大的问题。我们处于蒙特卡洛的情况，一直等到情节结束才能计算奖励。我们可以得出的结论是，如果我们获得较高的报酬(R(t))，那么我们采取的所有行动都是好的，即使某些行动确实很糟糕。

As we can see in this example, even if A3 was a bad action (led to negative rewards),all the actions will be averaged as good because the total reward was important.

正如我们在此示例中看到的那样，即使A3是一个不好的动作(导致负面奖励)，由于总奖励很重要，因此所有动作都将被平均为良好。

As a consequence, to have an optimal policy,we need a lot of samples. This produces slow learning, because it takes a lot of time to converge.

因此，要制定最佳政策，我们需要很多样品。这会导致学习缓慢，因为收敛需要很多时间。

What if, instead, we can do an update at each time step?

相反，如果我们可以在每个时间步进行更新怎么办？

演员评论家简介 (Introducing Actor Critic)

The Actor Critic model is a better score function. Instead of waiting until the end of the episode as we do in Monte Carlo REINFORCE,we make an update at each step (TD Learning).

演员评论家模型是一个更好的得分函数。与在蒙特卡洛REINFORCE中所做的一样，我们没有等到情节结束，而是在每一步进行更新(TD学习)。

Because we do an update at each time step, we can’t use the total rewards R(t). Instead, we need to train a Critic modelthat approximates the value function(remember that value function calculates what is the maximum expected future reward given a state and an action). This value function replaces the reward function in policy gradient that calculates the rewards only at the end of the episode.

由于我们在每个时间步都进行更新，因此无法使用总奖励R(t)。相反，我们需要训练一个近似于价值函数的Critic模型(请记住，给定状态和行为，价值函数会计算出最大的预期未来回报)。此值函数取代了策略梯度中的奖励函数，后者仅在情节结束时才计算奖励。

演员评论家的工作方式 (How Actor Critic works)

Imagine you play a video game with a friend that provides you some feedback. You’re the Actor and your friend is the Critic.

想象一下，您与一个朋友玩电子游戏，该游戏为您提供了一些反馈。您是演员，朋友是评论家。

At the beginning, you don’t know how to play, so you try some action randomly. The Critic observes your action and provides feedback.

一开始，您不知道如何玩，所以您随机尝试一些操作。评论家观察您的行为并提供反馈。

Learning from this feedback,you’ll update your policy and be better at playing that game.

从该反馈中学习，您将更新自己的政策，并更好地玩游戏。

On the other hand, your friend (Critic) will also update their own way to provide feedback so it can be better next time.

另一方面，您的朋友(评论家)也将更新自己的方式来提供反馈，以便下次更好。

As we can see, the idea of Actor Critic is to have two neural networks. We estimate both:

如我们所见，演员评论家的想法是拥有两个神经网络。我们估计：

Both run in parallel.

两者并行运行。

Because we have two models (Actor and Critic) that must be trained, it means that we have two set of weights (? for our action and w for our Critic) that must be optimized separately:

因为我们必须训练两个模型(演员和评论家)，所以这意味着我们有两组权重(对于我们的动作是？，对于评论家是w)必须分别进行优化：

演员评论过程 (The Actor Critic Process)

At each time-step t, we take the current state (St) from the environment and pass it as an input through our Actor and our Critic.

在每个时间步长t，我们从环境中获取当前状态(St)，并将其作为输入通过我们的演员和评论家。

Our Policy takes the state, outputs an action (At), and receives a new state (St+1) and a reward (Rt+1).

我们的政策采用状态，输出动作(At)，并接收新状态(St + 1)和奖励(Rt + 1)。

Thanks to that:

由于：

the Critic computes the value of taking that action at that state评论家计算在该状态下采取该行动的价值 the Actor updates its policy parameters (weights) using this q valueActor使用此q值更新其策略参数(权重)

Thanks to its updated parameters, the Actor produces the next action to take at At+1giventhe new state St+1. The Critic then updates its value parameters:

由于其更新的参数，Actor会在给定新状态St + 1的情况下在At + 1处执行下一个动作。评论家然后更新其值参数：

A2C和A3C (A2C and A3C)

引入优势功能以稳定学习 (Introducing the Advantage function to stabilize learning)

As we saw in the article about improvements in Deep Q Learning, value-based methods havehigh variability.

正如我们在有关Deep Q Learning的改进的文章中看到的那样，基于价值的方法具有很大的可变性。

To reduce this problem, we spoke about using the advantage function instead of the value function.

为了减少这个问题，我们谈到了使用优势函数而不是值函数。

The advantage function is defined like this:

优势函数定义如下：

This function will tell usthe improvement compared to the average the action taken at that state is.In other words, this function calculates the extra reward I get if I take this action. The extra reward is that beyond the expected value of that state.

与该状态下采取的平均行动相比，此功能将告诉我们改进。换句话说，如果执行此操作，此函数将计算我获得的额外奖励。额外的奖励是超出该州的预期价值。

If A(s,a) > 0: our gradient is pushed in that direction.

如果A(s，a)> 0：我们的梯度朝那个方向被推动。

If A(s,a) < 0 (our action does worse than the average value of that state) our gradient is pushed in the opposite direction.

如果A(s，a)<0(我们的行为确实比该状态的平均值差)，我们的梯度将朝相反的方向推动。

The problem of implementing this advantage function is that is requires two value functions — Q(s,a) and V(s). Fortunately,we can use the TD error as a good estimator of the advantage function.

实现此优势函数的问题是需要两个值函数-Q(s，a)和V(s)。幸运的是，我们可以将TD误差用作优势函数的良好估计。

两种不同的策略：异步或同步 (Two different strategies: Asynchronous or Synchronous)

We have two different strategies to implement an Actor Critic agent:

我们有两种不同的策略来实施Actor Critic Agent：

A2C (aka Advantage Actor Critic)A2C(又名“优势演员评论家”) A3C (aka Asynchronous Advantage Actor Critic)A3C(又名“异步优势演员评论家”)

Because of thatwe will work with A2C and not A3C. If you want to see a complete implementation of A3C, check out the excellent Arthur Juliani’s A3C article and Doom implementation.

因此，我们将使用A2C而不是A3C。如果您想看到A3C的完整实现，请查看出色的Arthur Juliani的A3C文章和Doom实现。

In A3C, we don’t use experience replay as this requires lot of memory. Instead, we asynchronouslyexecute different agents in parallel on multiple instances of the environment.Each worker (copy of the network) will update the global network asynchronously.

在A3C中，我们不使用体验重播，因为这需要大量内存。相反，我们在环境的多个实例上异步并行执行不同的代理。每个工作线程(网络副本)将异步更新全局网络。

On the other hand, the only difference in A2C is that we synchronously update the global network. We wait until all workers have finished their training and calculated their gradients to average them, to update our global network.

另一方面，A2C的唯一区别是我们同步更新了全球网络。我们等到所有工人都完成培训并计算出梯度以求平均值后，才能更新我们的全球网络。

选择A2C还是A3C？ (Choosing A2C or A3C ?)

The problem of A3C is explained in this awesome article. Because of the asynchronous nature of A3C, some workers (copies of the Agent) will be playing with older version of the parameters. Thus the aggregating update will not be optimal.

这篇很棒的文章解释了A3C的问题。由于A3C具有异步特性，因此某些工作程序(Agent的副本)将使用旧版本的参数。因此，汇总更新将不是最佳的。

That’s why A2C waits for each actor to finish their segment of experience before updating the global parameters. Then, we restart a new segment of experience with all parallel actors having the same new parameters.

这就是A2C在更新全局参数之前等待每个参与者完成其经验的原因。然后，我们以所有具有相同新参数的并行参与者重新开始新的体验。

As a consequence, the training will be more cohesive and faster.

结果，培训将更加有凝聚力和更快。

实施扮演刺猬索尼克的A2C代理 (Implementing an A2C agent that plays Sonic the Hedgehog)

实践中的A2C (A2C in practice)

In practice, as explained in this Reddit post, the synchronous nature of A2C meanswe don’t need different versions (different workers) of the A2C.

实际上，如本Reddit帖子所述，A2C的同步特性意味着我们不需要A2C的不同版本(不同的工作程序)。

Each worker in A2C will have the same set of weights since, contrary to A3C, A2C updates all their workers at the same time.

A2C中的每个工作人员将具有相同的权重集，因为与A3C相反，A2C同时更新其所有工作人员。

In fact, we createmultiple versions of environments(let say eight) and then execute them in parallel.

实际上，我们创建了多个版本的环境(比如说八个)，然后并行执行它们。

The process will be the following:

该过程将如下所示：

Creates a vector of n environments using the multiprocessing library使用多处理库创建n个环境的向量 Creates a runner object that handles the different environments, executing in parallel.创建一个处理不同环境的运行程序对象，并行执行。 Has two versions of the network:有两个版本的网络： step_model: that generates experiences from environmentsstep_model：从环境中产生经验 train_model: that trains the experiences.train_model：训练经验。

When the runner takes a step (single step model), this performs a step for each of the n environments. This outputs a batch of experience.

当跑步者迈出一步(单步模型)时，它将为n个环境中的每一个执行一步。这输出了很多经验。

Then we compute the gradient all at once using train_model and our batch of experience.

然后，我们使用train_model和我们的大量经验一次计算出梯度。

Finally, we update the step model with the new weights.

最后，我们使用新的权重更新步骤模型。

Remember that computing the gradient all at once is the same thing as collecting data, calculating the gradient for each worker, and then averaging. Why?Because summing the derivatives (summing of gradients) is the same thing as taking the derivatives of the sum. But the second one is more elegant and a better way to use GPU.

请记住，一次计算梯度与收集数据，计算每个工人的梯度然后求平均值是一回事。为什么？因为求导数之和(梯度求和)与求和之和是同一件事。但是第二种方法更优雅，是使用GPU的更好方法。

A2C与刺猬索尼克 (A2C with Sonic the Hedgehog)

So now that we understand how A2C works in general, we can implement our A2C agent playing Sonic! This video shows the behavior difference of our agent between 10 min of training (left) and 10h of training (right).

因此，既然我们了解了A2C的总体工作原理，我们就可以在Sonic上实现我们的A2C代理了！该视频显示了我们的特工在10分钟的训练(左)和10h的训练(右)之间的行为差异。

The implementation is in the GitHub repo here, and the notebook explains the implementation. I give you the saved model trained with about 10h+ on GPU.

该实现在此处的GitHub存储库中，笔记本对实现进行了说明。我给您保存的模型在GPU上经过大约10h的训练。

This implementation is much complex than the former implementations. We begin to implement state of the art algorithms, so we need to bemore and more efficient with our code.That’s why, in this implementation, we’ll separate the code into different objects and files.

此实现比以前的实现复杂得多。我们开始实现最先进的算法，因此我们需要提高代码效率。这就是为什么在此实现中，我们将代码分成不同的对象和文件。

That’s all! You’ve just created an agent that learns to play Sonic the Hedgehog. That’s awesome! We can see that with 10h of training our agent doesn’t understand the looping, for instance, so we’ll need to use a more stable architecture: PPO.

就这样！您刚刚创建了一个学习玩刺猬索尼克的特工。棒极了！我们可以看到，例如，经过10小时的培训，我们的代理不了解循环，因此我们需要使用更稳定的体系结构：PPO。

Take time to consider all the achievements you’ve made since the first chapter of this course:we went from simple text games (OpenAI taxi-v2) to complex games such as Doom and Sonic the Hedgehog using more and more powerful architectures. And that’s fantastic!

花点时间考虑一下自本课程第一章以来您所取得的所有成就：我们从简单的文字游戏(OpenAI taxi-v2)到复杂的游戏，例如Doom和Sonic the Hedgehog，都使用了越来越强大的体系结构。太棒了！

Next time we’ll learn about Proximal Policy Gradients, the architecture that won the OpenAI Retro Contest. We’ll train our agent to play Sonic the Hedgehog 2 and 3 and this time, and it will finish entire levels!

下次，我们将了解赢得OpenAI Retro Contest的架构Proximal Policy Gradients。我们将训练我们的经纪人玩刺猬索尼克2和3，这次，它将完成整个关卡！

Don’t forget to implement each part of the code by yourself. It’s really important to try to modify the code I gave you. Try to add epochs, change the architecture, change the learning rate, and so forth. Experimenting is the best way to learn, so have fun!

不要忘记自己实现代码的每个部分。尝试修改我给您的代码非常重要。尝试添加时代，改变架构，改变学习率，等等。做实验是最好的学习方法，所以玩得开心！

If you liked my article,please click the ? below as many time as you liked the articleso other people will see this here on Medium. And don’t forget to follow me!

如果您喜欢我的文章，请单击“？”。您可以根据自己喜欢该文章的次数在下面进行搜索，以便其他人可以在Medium上看到此内容。并且不要忘记跟随我！

This article is part of my Deep Reinforcement Learning Course with TensorFlow ?️. Check out the syllabus here.

本文是我使用TensorFlow？️的深度强化学习课程的一部分。查看课程表。

If you have any thoughts, comments, questions, feel free to comment below or send me an email: hello [at] simoninithomas [dot] com, or tweet me @ThomasSimonini.

如果您有任何想法，意见，问题，请在下面发表评论，或给我发送电子邮件：simoninithomas [dot] com您好[t]或发@ThomasSimonini到我。

深度强化学习课程： (Deep Reinforcement Learning Course:)

We’re making avideo version of the Deep Reinforcement Learning Course with Tensorflow? where we focus on the implementation part with tensorflow here.
我们正在使用Tensorflow制作深度强化学习课程的视频版本吗？我们在这里重点关注带有tensorflow的实现部分。

Part 1: An introduction to Reinforcement Learning

第1部分：强化学习简介

Part 2: Diving deeper into Reinforcement Learning with Q-Learning

第2部分：通过Q学习更深入地学习强化学习

Part 3: An introduction to Deep Q-Learning: let’s play Doom

第3部分：深度Q学习简介：让我们玩《毁灭战士》

Part 3+: Improvements in Deep Q Learning: Dueling Double DQN, Prioritized Experience Replay, and fixed Q-targets

第3部分+：深度Q学习中的改进：双重DQN，优先体验重播和固定Q目标

Part 4: An introduction to Policy Gradients with Doom and Cartpole

第4部分： Doom和Cartpole的策略梯度简介

Part 6: Proximal Policy Optimization (PPO) with Sonic the Hedgehog 2 and 3

第6部分：使用刺猬索尼克2和3的近距离策略优化(PPO)