大卫·西尔弗 @ RCL 2024

本集简介

大卫·西尔弗是DeepMind的首席研究科学家，同时也是伦敦大学学院的教授。本次访谈于2024年RLC期间在马萨诸塞大学阿默斯特分校录制。参考文献发现强化学习算法，Oh 等人——他在RLC 2024的主旨演讲提到了这项工作的最新更新，尚未发表通过自我对弈与通用强化学习算法掌握国际象棋和将棋，Silver 等人，2017年——AlphaZero算法被用于他近期的AlphaProof研究 DeepMind博客上的AlphaProof DeepMind博客上的AlphaFold 2024年强化学习会议大卫·西尔弗在Google Scholar上的资料

双语字幕

仅展示文本字幕，不包含中文音频；想边听边看，请使用 Bayt 播客 App。

Speaker 0

我今天在这里与DeepMind的David Silver教授一同参加RLC活动。

I'm here with Professor David Silver of DeepMind at RLC.

Speaker 0

Silver教授，今天您的演讲展示了利用强化学习本身来发现更优RL算法的结果。

Professor Silver, in today's talk you showed a result of discovering better RL algorithms, also using RL.

Speaker 0

在这些会议上，大家主要关注新方法和新算法。

And at these conferences, there's a major focus on new methods, new algorithms.

Speaker 0

当人类RL研究者自己改进算法变得不再实际，而焦点完全转向元强化学习——而这种研究可能只有大型实验室才能参与时，这一天距离我们还有多远？

How far away is the day when human RL researchers improving algorithms themselves is not really practical and the focus becomes purely on meta RL, which presumably just the big labs can participate in?

Speaker 0

换句话说，到那时，小规模的RL研究还会重要吗？

Like basically will small scale RL research still matter at that point?

Speaker 1

是的。

Yeah.

Speaker 1

这是个很好的问题。

It's a great question.

Speaker 1

到目前为止，我们关注的只是这个拼图中的一小块，即强化学习中使用的更新规则，这确实是整个算法中非常关键的一部分，但并非唯一部分。

So what we've done so far is focus on a single piece of the puzzle, which is the update rule that you use in RL, which is really one big piece of the whole algorithm, but it's not the only piece.

Speaker 1

所谓的更新规则，指的是你使用的是Q学习、辅助任务还是策略梯度，本质上就是你希望通过梯度下降等方式优化的损失函数或目标。

So by the update rule, mean something like whether you're doing, you know, Q learning or auxiliary tasks or policy gradient, essentially what the loss is, the objective that you would try to optimize via gradient descent or so forth.

Speaker 1

我认为在这个领域，我们已经取得了很大进展，通过元学习，从一组环境中学习哪种更新规则最有效，这是一个令人兴奋的突破。

And I think in that space, you know, we found that we can make a lot of headway in in applying meta learning and, you know, learning from a set of environments what what update rule is is most effective, which is an exciting step.

Speaker 1

但我认为我们离让自己彻底失业还很遥远，部分原因是还有其他与之配合的组成部分。

But I don't think we're anywhere near the point where we're, you know, putting ourselves out of out of work, partly because there's these other pieces of the of the puzzle that that work with it.

Speaker 1

比如函数逼近器的性质、优化器的特性，以及所有这些组件是如何组合在一起的等等。

You know, the nature of the function approximator, the nature of the optimizer, how all the pieces are put together and so forth.

Speaker 1

但或许更重要的是，所有这些算法的工作方式仍然依赖于某种代理目标。

But maybe more importantly, the way that all of these algorithms work is, you know, they still have proxy objective.

Speaker 1

因此，我们实际上是在学习哪种算法在某种人类设计的代理目标下最有效。

So we are actually learning what algorithm is most effective with respect to some human design proxy objective.

Speaker 1

比如我们使用的是标准的策略梯度算法。

So we use like a vanilla policy gradient algorithm.

Speaker 1

所以人类的设计仍然在循环中，我们发现，如果改进这个代理目标，整个元梯度算法的效果也会更好。

So there's still like human design in the loop, we find that if you improve that proxy objective, it still makes the whole meta gradient algorithm work better.

Speaker 1

所以我们远未达到可以取代人类的地步，人类在改进这些方面仍有很大的空间。

So we're by no means putting, you know, there's still a lot of room for the humans to make things better.

Speaker 1

事实上，我们设计元网络所优化的空间，很多都源于一个理念：确保元网络的设计能够支持人类多年来在各种算法上所取得的所有发现。

And in fact, you know, even the space in which we designed our metanetworks to to optimize, you know, a lot of that came from inspiration of making sure that this the metanetwork design was able to support all of the different discoveries that humans have made over the years of all the different types of algorithms that we've developed.

Speaker 1

我们希望确保元网络足以支持这些发现。

And we wanted to make sure that the metanetwork was sufficient to support those.

Speaker 1

我认为，如果人类取得了重大突破，我们可能会调整元网络的设计来跟上这种进步。

And I think, you know, if humans were to make a big breakthrough, maybe we would change the metanetwork design to kind of follow.

Speaker 1

所以我认为，目前我们正处于这样一个阶段的开端——元学习开始变得非常成功。

So I think I think at the moment, we're just at the beginning of this kind of maybe phase change where where meta learning is starting to be very successful.

Speaker 1

但这绝不是说已经万事俱备了。

But it's by no means, you you yeah.

Speaker 1

在元学习成为唯一主流方法之前，还有很长的路要走。

It's still a long a long road ahead until until that's the only game in town.

Speaker 0

所以我知道AlphaFold目前并没有使用强化学习。

So I understand AlphaFold does not use RL at this point.

Speaker 0

有没有可能像你之前谈到其他挑战那样，通过强化学习来扩展AlphaFold的蛋白质折叠预测？

Is there a path to scaling AlphaFold's protein folding predictions with RL the way that you spoke about the other challenges?

Speaker 1

是的。

Yeah.

Speaker 1

这是个很好的问题。

That's a great question.

Speaker 1

实际上，我最初参与了AlphaFold项目初期的工作，当时我们非常兴奋。

So actually, I was originally involved in one of the people at the beginning of the AlphaFold project, where really, you know, what happened was, you know, we'd been very excited.

Speaker 1

特别是德米斯，他深受这个问题的驱动：我们能否将AlphaGo中那些非常有效的方法应用到解决蛋白质折叠问题上？

Demis in particular was very and driven by this question of could we apply the methods that were so effective in AlphaGo to try and solve the protein folding problem.

Speaker 1

因此，最初我们将这个问题视为一个强化学习问题，把折叠视为一种动作，并试图优化一个与整个结构折叠效果相关的奖励。

And so originally, we were viewing this as an RL problem with, you know, folding as an action and trying to, you know, optimize a reward that corresponds to how well folded the whole the whole structure is.

Speaker 1

我必须说，我对该项目最大的贡献之一是鼓励团队停止将这个问题视为一个强化学习问题。

And I have to say that one of my biggest contributions to the project was encouraging the team to stop viewing this as an RL problem.

Speaker 1

说实话，并不是所有问题都最适合用强化学习来解决。

That, you know, if I'm really honest about it, not all problems are best suited to RL.

Speaker 1

你知道，你必须找到那些本质上更适合用其他方式理解的问题。

You know, you really have to find the problems which are natively better understood in a different way.

Speaker 1

在这种情况下，随着时间的推移，我逐渐意识到，基于我们手头的数据，这个问题更适合用监督学习来建模。

And in this case, I think it became clear to me over time that this was better understood and modeled as a supervised learning problem based on the data that we had there.

Speaker 1

我认为，正是这一点推动了AlphaFold的快速进展，最终取得了成功。

I think, you know, that, and that led to a lot of rapid progress in Alpha Fold that led to the success we had.

Speaker 1

话虽如此，我认为在相关领域中，像蛋白质设计这样的问题非常重要，它们更适合用强化学习而非监督学习来解决。

Having said that, I think there are hugely important problems in the space of related problems, things like protein design, which are much more suited to RL than to supervised learning.

Speaker 1

因为现在我们更清晰地拥有了一个行动空间，即我们试图设计药物。

Because now you much more clearly have an action space where we're trying to design drugs.

Speaker 1

设计空间是不可微的。

The design space is non differentiable.

Speaker 1

要完成这项工作，强化学习可能是必需的，或者至少搜索是必不可少的。

All of RL is probably gonna be necessary to do, or at least search is going to be necessary.

Speaker 1

因此，我预计强化学习将在这些领域得到更多交叉应用，并取得丰硕成果。

And so I'd expect that there to be much more crossover and fruitful application of RL.

Speaker 0

所以当我查看你的谷歌学术资料来为这次采访做准备时，我注意到最近你有很多专利申请，但论文却不多。

So when I looked at your Google Scholar to prepare for this interview, I noticed that in the in in recent times, there's a lot of patent applications and not a lot of papers.

Speaker 0

你现在是把重点从发表论文转移开了，还是这只是暂时的，或者这是你的一个趋势？

Have you changed your focus away from publishing for now, or is that a temporary thing, or is that a trend for you?

Speaker 1

我认为发表论文非常重要。

I think publication's really important.

Speaker 1

说实话，我之前病了很久。

The honest answer is that I I was quite ill for a while.

Speaker 1

所以新冠之后，我得了长新冠，有一两年时间状态都不太好。

So so after COVID, I got long COVID, and I just wasn't super well for, a year or two.

Speaker 1

因此，我不得不大幅减少我的许多活动。

And so I had to kind of scale back a lot of my activities.

Speaker 1

遗憾的是，我降低优先级的一件事就是发表会议论文。

And sadly, one of the things which I, down prioritized was, you know, publishing conference papers.

Speaker 1

我决定，鉴于我所拥有的时间和精力，应该把这些时间和精力集中在更重要的成果上。

And I decided, you know, for the amount of time and energy that I had to really just try and use that time and energy to focus on the bigger results.

Speaker 1

所以现在我觉得自己好多了。

And so I think, you know, I'm feeling much better now.

Speaker 1

我的健康状况已经完全恢复了。

Things are you know, my health is back to a 100%.

Speaker 1

但我认为，这确实是因为当时的情况而发生的一种转变。

But I think it really you know, it was just a shift that happened because of that.

Speaker 1

我认为，发表论文对这个领域极其重要，我一直觉得，我们所做的工作值得被理解，并尽可能地分享，当然也要在商业组织的限制范围内。

And I I do think the publication is enormously important for the field, and and I've always felt that, you know, the the work that we do deserves to be understood and and and and shared wherever possible, you know, within the constraints of being in a commercial organization.

Speaker 1

但总体而言，这始终是我的做事方式。

But on the whole, you know, that that is and always will be my modus operandi.

Speaker 0

很高兴听到你好多了。

I'm glad to hear you better.

Speaker 0

回到AlphaZero，论文中有一条曲线显示了Elo评分如何随着训练增长，并达到一个平台期，而且长时间保持在这个水平。

Going back to AlphaZero, in the paper there was a curve showing how Elo scaled with training and it reached a plateau and it kinda stayed at that plateau for a very long time.

Speaker 0

你是否知道是什么因素限制了这个平台期的高度？

Do you have a sense of what is the limiting factor that determined the height of that plateau?

Speaker 1

是的

Yeah.

Speaker 1

这是个很好的问题。

That's a great question.

Speaker 1

我们发现，在某些游戏中，AlphaZero 会持续学习，没有任何 plateau。

What we saw was that in some games, Alpha Zero continued learning indefinitely without any plateau.

Speaker 1

而在某些游戏中，它似乎至少达到了某种类似 plateau 的状态。

And in some games, it appeared to at least reach something more like a plateau.

Speaker 1

似乎决定性的特征是这些游戏中是否可能出现平局。

And it seems like the the defining characteristic is whether draws are possible in those games.

Speaker 1

在围棋中，由于游戏设计的原因，不可能出现平局。

So in the game Go, it's not possible to have a draw because of the way the game is designed.

Speaker 1

有一条规则叫做贴目，意思是无论哪一方，都必须至少赢半目。

There's this rule called COMI, which means that, you know, one player or other always has to win by at least half a point.

Speaker 1

在国际象棋中，当棋手水平达到极高时，大多数对局都会以平局收场。

In chess, when you start to get to very high level play, most games are draws.

Speaker 1

我认为我们在这些AlphaZero的图表中看到的是，当你达到这种极高水平的对弈时，系统提供的反馈会变少，因为大多数对局都会以平局收场。

And I think what we're seeing in those alpha zero plots is the fact that as you get to this very high level play, you're just getting less feedback out of the system because most of the games are drawing.

Speaker 1

你让AlphaZero进行自我对弈。

You get alpha zero in self play.

Speaker 1

大多数时候，它和自己对弈都会以平局告终。

Most of the time it's gonna draw against itself.

Speaker 1

因此，一切进展都会稍微变慢一点。

And so everything just slows down a little bit.

Speaker 1

我认为有一些合理的方法可以解决这个问题，但那些方法在那些图表中并没有体现出来。

And I think there are reasonable mechanisms to get around that, but just those weren't really in those plots.

Speaker 1

但我认为，这更多是问题本身的特性，而不是算法的根本缺陷，我觉得。

But I think, you know, that's, it's really a nature of the of the problem rather than something fundamental in the algorithm, I think.

Speaker 0

很多研究生都在听这个播客。

A lot of grad students listen to this podcast.

Speaker 0

你对那些希望进入强化学习研究领域的年轻研究生有什么建议吗？

Do you have any words for the young grad student who's aspiring to get into RL research?

Speaker 1

选择一个你真正相信的问题，并全力以赴。

Pick a problem that you really believe in and go for it.

Speaker 1

别害怕去挑战困难的问题。

And don't don't, you know, don't be afraid to go for challenging problems.

Speaker 1

在我看来，与其做一些众所周知的渐进式工作并确保成功，不如在研究中尝试做一些辉煌的事，哪怕可能失败。

I think it's much better in my opinion to try and do something glorious in research and maybe fail than it is to do something you know is incremental and be guaranteed to succeed.

Speaker 1

所以如果我要给一个建议，那就是我职业生涯中一直努力去寻找一个最佳平衡点：寻找那些刚好在能力范围内的问题，不是完全不可能实现，但也不是一目了然的简单任务。

And so if I have one tip, it's that something I've always tried to do in my career is to try and find the sweet spot where you look for problems which are just about within range, not completely beyond possibility, but which are, you know, really not not straightforward at all.

Speaker 1

在我的想法里，我是个相当乐观的人，但我仍会挑选那些我认为成功概率最多只有50%的问题。

So I like to in my mind, I'm quite an optimistic person, but nevertheless, I try to choose problems where I believe that the chance of success is is at most 50%.

Speaker 1

但你知道，我发现这种五五开的平衡点确实很有帮助。

But you know, roughly that fifty fifty sweet spot I found to be helpful.

Speaker 1

我认为追求这种平衡真的很有帮助，因为你知道，人工智能的进步速度非常快。

And I think you know going for that is really helpful because you know if you look at the rate of progress in in AI, it's fast.

Speaker 1

如果你现在觉得某个问题的成功概率是五五开，那么到你硕士或博士毕业时，它可能就已经被解决了。

And if you're working on something which is, you think is fifty fifty today, you know, it might well be solved by the end of your masters or PhD or whatever you're getting into.

Speaker 1

所以我认为我的建议是：要大胆，要志存高远，直接去行动吧，因为你有机会在自己宝贵的生命时光里做些特别的事，真正投身于能带来改变的事业。

So I think that would be my advice, is to be bold, be ambitious and just, you know, go for it because, you know, you you have a chance to do something special with your precious moments on this planet and and really work on something that makes a difference.

Speaker 0

最后，如果当年还在读博士的大卫·西尔弗能看到今天的这一切，他会说什么？又会有什么感受？

And finally, if young David Silver when doing his PhD, could see all of this today, what would he say or how would he feel?

Speaker 1

这个问题很好。

Well, question.

Speaker 1

我早期非常专注于计算机围棋，但与此同时，我一直相信，长期的目标是努力实现通用人工智能，达到某种超人类智能的水平。

I think I was very focused on on computer go in the early days, but at the same time, you know, I always believed that the long term mission was to try and reach AGI, to try and get to some kind of level of superhuman intelligence.

Speaker 1

因此，我认为我会对取得的进展感到无比惊叹，并且非常兴奋地看到这一切正一步步成为现实。

And so I think I'd be just amazed by how much progress there's been and really excited to know that that was kind of coming down down the line.

Speaker 1

而且我认为，只要你始终相信未来会比今天更加令人激动，就会明白你现在的每一份努力，都会融入一个飞速发展、不断累积进步的世界，这真的非常有动力。

And I think it's just, you know, if you can always imagine that the future is gonna be much more exciting than where we are today, it's really motivating to know that, you know, whatever you do now is gonna feed into this world where things are just gonna be moving so quickly and and building on things so fast.

Speaker 1

我想，如果我能给过去的自己一些建议，那就是：好好享受当下吧，因为一切变化都太快了，当时你觉得遥不可及的东西，转眼间就已到来，迅猛的进步早已悄然发生。

And I think that, you know, would probably be my advice back to myself is, you know, just just enjoy enjoy the moment because, you know, everything moves so quickly that whatever you do, you know, it feels like everything feels a long way away at at the time, but but before you know it, it's here and and and rapid progress has happened.

Speaker 1

如今，能够基于他人所做的一切工作继续前行，如果你能以任何方式为这个庞大研究机器贡献哪怕微小的一环，都会感到无比满足。

And now building on on all the work that people do is, you know, it's extremely satisfying to have you know, if you're in any way able to contribute a small cog to that bigger machine of of all of the research progress that the field is making.

Speaker 1

不。

No.

Speaker 1

这是一种令人惊叹的感觉。

It's it's it's an amazing it's an amazing feeling.

Speaker 1

对于那些选择专注于强化学习的人，这是一个令人兴奋且富有雄心的选择，我衷心祝愿你们一切顺利，取得成功。

For those of you who've chosen to focus on reinforcement learning, you know, it's it's that's an exciting and ambitious choice and I just wish you all the best to make it successful.