本集简介
双语字幕
仅展示文本字幕,不包含中文音频;想边听边看,请使用 Bayt 播客 App。
嘿,大家好,欢迎回到天秤数据科学播客。
Hey, everyone, and welcome back to the Taurus Data Science podcast.
表面上看,强化学习范式——将智能体置于环境中,对其采取的良好行为给予奖励,直到其掌握某项任务——似乎没有任何明显的限制。
Now on the face of it, there's no obvious limit to the reinforcement learning paradigm of putting an agent in an environment and rewarding it for taking good actions until it masters a task.
到去年为止,强化学习已经取得了一些令人惊叹的成就,包括掌握围棋、多种雅达利游戏、《星际争霸II》等等。
And by last year, RL had achieved some amazing things, including mastering Go, various Atari games, Starcraft two, and so on.
但人工智能的终极目标并不是掌握特定的游戏,而是实现泛化,创造出能够在未经过训练的新游戏中表现优异的智能体。
But the holy grail of AI isn't to master specific games, but to generalize, to make agents that can perform well on new games that they haven't been trained on before.
于是时间快进到七月,DeepMind 的一个团队发表了一篇名为《开放式学习催生通用智能体》的论文,这在通用强化学习智能体的研究方向上迈出了重要一步。
So fast forward to July, and a team at DeepMind published a paper called Open Ended Learning Leads to Generally Capable Agents, which takes a big step direction of general reinforcement learning agents.
今天和我一起做客的,正是这篇论文的合著者之一——马克斯·约特伯格。
And joining me today is one of the coauthors of that paper, Max Jotterberg.
马克斯于2014年加入谷歌生态系统,当时谷歌收购了他创办的计算机视觉公司;最近,他启动了 DeepMind 的开放式学习团队,专注于推动机器学习向跨任务泛化能力的领域更进一步。
Now Max came to the Google ecosystem in 2014 when they acquired his computer vision company, and more recently, he started DeepMind's open ended learning team, is focused on pushing machine learning further into the territory of the sort of cross task generalization ability.
在这期《天秤数据科学播客》中,马克斯与我探讨了开放式学习、泛化能力的未来路径,以及人工智能的前景。
Max joined me to talk about open ended learning, the path ahead for generalization, and the future of AI on this episode of the Torus Data Science Podcast.
我确实想深入探讨这篇论文的具体内容,和你聊聊它,但我也想先从一个更个人化的问题开始。
I do wanna get to the nuts and bolts of that paper and talk to you about it, but I'd also like to start with a more biographical question.
你是怎么加入DeepMind的开放学习团队的?
How did you get into the DeepMind open ended learning team?
另外,什么是开放学习?能简单解释一下吗?
And then also just what is open ended learning and all that?
如果你能把这些都浓缩在一个回答里,那就太好了。
If if you could squeeze that into an answer, that'd be great.
是的。
Yeah.
不是。
No.
我的背景其实是计算机视觉。
So my background is actually computer vision.
我博士阶段研究的就是计算机视觉。
So I my PhD in computer vision.
自从七年前加入DeepMind以来,我一直在从事计算机视觉研究,思考无监督学习。
Since joining DeepMind about seven years ago, I was doing some computer vision, thinking about unsupervised learning.
如何让强化学习代理通过与环境的互动来学习,于是我开始探索模拟世界和游戏,作为开发强化学习算法的一种方式。
How could that help reinforcement learning agents build agents that are sort of able to learn from interaction with their environment and started exploring the space of simulated worlds and games as a way to develop reinforcement learning algorithms.
这自然而然地引导我进入了多人游戏领域。
And this sort of led quite naturally to these multiplayer games.
我们熟悉象棋和围棋这样的游戏,但还有真正的多人电子游戏,比如《星际争霸》和《雷神之锤III》的夺旗模式。
Things like, we're familiar with things like chess and Go, but then actual multiplayer video games, things like StarCraft and Capture the Flag Quake III.
这些是构建复杂有趣游戏环境的绝佳方式,我们可以借此发展我们的强化学习算法和人工智能。
These are great ways to build complex, interesting game environments that we can then try and develop our reinforcement learning algorithms, our AI towards.
当你思考如何在环境中不断增加复杂性,并让代理(AI)学会应对这种复杂性时。
And when you're thinking about, okay, how do you create more and more complexity on the environment and then try and get an agent, an AI to learn about that complexity?
你自然会希望进一步增加复杂性,让AI学习更多内容。
You then naturally want to make things even more complex and get it to learn even more.
这就产生了一种类似棘轮效应的机制。
And you have this sort of ratchet effect.
我们在学术界自然地这样做。
We do this naturally in sort of academia.
我们解决一个问题,比如攻克了Atari,然后就想:好吧,我们需要Quake III。
We solve one problem, we solve Atari and then we're like, Okay, we need Quake three.
我们解决了Quake III,又想:好吧,我们需要StarCraft。
And we solve Quake three, we're like, Okay, we need StarCraft.
但这自然引出了一个问题:如果我们根本不再局限于一个固定目标呢?
But this sort of leads to the question, Okay, actually what if we could just forget about a fixed objective?
不再执着于必须解决某一款游戏或必须最大化某个特定目标。
Forget about a fixed one game that we have to solve or one objective that we have to maximize.
而是仅仅定义一个动态系统、一个持续增长复杂性的学习系统——智能体变得更强,环境本身也变得越来越复杂。
And instead we just define a dynamical system, a learning system that continually grows in complexity, the agent gets better, but also the environment itself gets more and more complex.
随着智能体复杂性的提升,环境中可供探索和发现的内容也越来越多。
And as the agent grows in complexity, there's more and more in the environment to explore and discover.
因此,这里没有终极目标,只有不断涌现的复杂性。
And so you just get there's no end goal, there's just emergent complexity happening.
这有点像人工生命。
It's a bit like artificial life.
总之,说了这么多,我想表达的是,正是通过探索这些其他游戏,我在DeepMind发起了开放学习团队。
Anyway, that was a long way to say that through this exploration of these other games, I started the open ended learning team in DeepMind.
这篇论文确实是我们朝这个方向迈出的第一步具体成果。
And this paper has really been our first concrete step in this direction.
是的。
Yeah.
我觉得特别有趣的是,在我看来,程序化生成环境这一理念,某种程度上是一种范式转变。
What I found really interesting about it was, to me, it seemed that this idea of environments, being procedurally generated is, you know, like a paradigm shift in a way.
因为在我看来,以往的关注点总是放在:我们需要开发更好的算法。
Because the focus in my mind, least, the focus was always on, like, oh, we need to make better algorithms.
我们需要专注于算法,提出更新颖、更聪明的方法,比如注意力机制就是朝这个方向迈出的一步,还有自我对弈之类的。
We need to focus on the algorithm, come up with new, cleverer things, you know, attention was a step in that direction, and and, like, self play, that sort of thing.
但我从未想过要把环境本身作为需要独立研究的对象。
But it never occurred to me to look at the environments as the thing that deserved independent study.
这个想法对DeepMind团队、你的团队或者你来说一直都很明显吗?还是最近才出现的?
Was this a realisation that was always obvious to team at DeepMind, your team, to you, or is this something that just came about more recently?
是的,我认为这个想法是在几年前才出现的。在强化学习研究中,我们通常的流程是:解决一个环境,然后创建一个更难的新环境,再解决它。
Yeah, I think it came about a couple of years ago, really, we had this process in reinforcement learning research of you solve an environment, you create a new harder environment and you solve that.
但当我进行大量这类研究,以及我的同事和这个领域的其他人也在做类似工作时,你会自然意识到:你面前的环境决定了你需要什么样的算法,也决定了这些智能体会展现出什么样的行为。
But really, as I was doing a lot of this research and a lot of my colleagues and people in the field were doing this, you would realize, you would naturally realize that depending on what environment you have in front of you depends really on what sort of algorithm you need and also like what sort of behavior comes out of these agents.
因此,如果我们关心的是AGI或某种在模拟领域中具备通用能力的智能体,那么显然,你用来训练和测试的环境本身,需要和用于训练该环境并产生最终智能体的算法一样,获得同等的研究关注。
And so it then starts to make perfect sense that we actually need to start, if what we care about is an AGI or that route, maybe a generally capable agent within some simulated domain, it makes complete sense that the actual substrate, the environment that you're going to be training on, that you're going to be testing on, needs as much research focus as the algorithms that are going to be training on that and producing the final agent.
所以,确实已经有很多相关工作了,之前就存在过程序生成的强化学习环境,比如程序生成的迷宫之类的东西。
So yeah, there's been a lot of work in There have been procedure generated reinforcement learning environments before, procedure generated mazes and things like that.
来自游戏AI领域的研究者们也做了大量工作,比如为马里奥、塞尔达或二维游戏程序生成关卡等等。
And there's been a lot of work from the AI for video games community of procedurally generating levels for Mario or Zelda or two d games and things like that.
是的,把所有这些整合到一个大型的强化学习循环中。
Yes, putting it all together into one big reinforcement learning loop as well.
是的。
Yeah.
也许这能让我们更深入地讨论这篇论文,因为这确实引发了一个问题:我们已经在单个游戏的层面上实现了程序生成。
Maybe this gets us into the paper more then, because this does raise the question, we have procedural generation at the level of individual games.
那么,这与你们在论文中使用的Xland工具所实现的程序生成之间,关键区别在哪里?
What's the big difference between that and the kind of procedural generation that happens through Xland, the tool you use in the paper?
那些游戏中的程序生成关卡做不到什么,而这项新工作却实现了?
What is it that those procedurally generated levels of games don't do that now is unlocked by this new work?
是的。
Yeah.
假设你考虑一个迷宫游戏,每次玩家被放置在一个新的迷宫中并需要解决它,你可以程序化生成迷宫,对吧?也就是你必须做出的一系列左转和右转来解决迷宫。
Let's say you think about a maze game, where every time the player gets plunked down in a new maze and has to solve this maze, you can procedurally generate the maze, right, exactly what sequence of left turns and right turns you're going to have to make to solve the maze.
但如果你思考玩家必须执行的底层行为,即使迷宫是程序生成的,可生成的迷宫数量比宇宙中的原子还多,这些迷宫中的玩家或人类玩家的实际行为却始终一致。
But if you think about what's the underlying behavior that the player has to execute, even though the maze is procedurally generated and there's there's more mazes than atoms in the universe, right, that could be generated, The actual behavioral side of the agent or the player, the human player, is always the same for each of these.
解决迷宫的元过程是:碰到墙时,向左或向右走。
There's a meta process of solving the maze where you hit a wall and you go left or right.
因此,尽管迷宫的数量非常庞大,但实际的行为种类其实非常少。
So the actual number of behaviors is really small, even though the number of mazes is really large.
因此,这促使我们思考:如何不仅生成世界,还要生成代理必须执行的实际游戏和任务,使得我们不仅仅拥有一个或两个代理所需的行为,而是成千上万甚至理想情况下数百万种代理必须学习的行为。
So this led us to how can we procedurally generate not just the worlds but the actual games, the tasks that the agents have to play so that we have not just one or two required behaviours of the agents but thousands or even millions ideally required behaviors of the agent, which the agent needs to learn.
因此,这或许就是我们试图定位的设计差异:如何生成世界和游戏,使它们要求代理展现出成千上万种不同的行为。
So that's maybe the difference in the space that we tried the design point that we tried to hit was how can we procedurally generate worlds and games so that they require thousands and thousands of different behaviors, ideally from our agent.
从根本上杜绝代理以任何方式对特定环境产生过拟合的机会。
To never give the agent basically the chance to overfit to a particular environment in any way.
正是如此。
Exactly.
这就是深度学习的核心观点。
It's the deep learning thesis.
如果你拥有足够的数据,向你的网络提出海量要求,那么这些神经网络就能实现泛化。
If you have enough data, if you're asking huge amounts from your network, then you will get generalization out of these neural networks.
我们在计算机视觉中见过这一点,在语言领域也见过,因此在强化学习中没有理由看不到同样的现象。
We've seen this in computer vision, we've seen this in language, and there's no reason that we shouldn't see this in reinforcement learning.
我们只需要从这些网络中激发足够多的行为即可。
We just need to ask enough behavior out of these networks.
是的,对我来说,这件事最引人入胜的地方在于,它奇妙地与大型语言模型形成了平行关系——因为我们终于有了一个足够困难的任务,迫使智能体以这种方式泛化,尽管它们表面上看起来毫无关联。
Yeah, to me, of the fascinating things about this is the, oddly, the parallel with large language models, in the sense that we have here, finally, it seems, a task that's sufficiently hard that it forces the agent to generalize in that way, even though they don't seem related.
我的意思是,这就像自然语言任务在强化学习中的对应物。
I mean, it's kind of like the RL analog of that natural language task.
你同意这个观点吗?
Would you agree with that?
我看到你在点头,所以这大概是对的?
I see you nodding there, so maybe that's approximately right?
是的,完全正确。
Yeah, that's absolutely right.
整个这一研究方向的产生,源于观察到这些视觉模型和语言模型的涌现性成功。
And this whole line of research came about from seeing the emergent success of these vision models and these language models.
没有理由认为这种方法不能适用,只是数据模态略有不同。
And there's no reason why this can't the data modality is just slightly different.
但没错,确实就是这样。
But yeah, it's absolutely that.
它让神经网络始终处于一种持续的不确定状态,使其无法记住所有内容。
It's keeping the neural network in this state of constant uncertainty where it can't memorise everything.
因此,你必须学习通用的表示和通用的行为,这些行为在大量被要求完成的游戏和任务中足够有效。
And so you have to learn general representations, general behaviors that work just well enough across this massive distribution of games and tasks that are being asked.
这很有趣。
It's interesting.
在某种意义上,这似乎完全是关于让机器学习实现其潜力。
There's a sense in which it seems like it's all about well, obviously it's about making machine learning live up to its potential.
但当你谈到深度学习时,有一种观点认为深度学习模型是一种通用函数逼近器。
But when you talk about deep learning, there's this notion of deep learning model as a universal function approximator.
这听起来最初非常有前景。
And that sounds super promising at first.
你会想,好吧,它什么都能做。
You kind of go, Okay, so it can do everything.
对吧?
Right?
当然,结果发现,至少在最近之前,情况并非如此。
And then of course, it turns out, well, at least until very recently, it didn't seem that way.
但我们已经逐步弄清楚了,我们需要向这些模型中注入哪些先验知识?
But there have been a series of priors that we've kind of figured out, what are the priors that we need to pack into these models?
哪些可以丢弃?
Which ones can we ditch?
因此,我们现在不再硬编码‘图像必须具有平移不变性’这样的观念。
So we no longer, for example, hard code the idea that an image has to be translationally invariant.
我们知道,卷积网络不再是唯一的方法了。
We got, you know, convolutional nets are no longer the only way to do things.
但注意力机制很有趣,因为它提供了一种更通用的先验,似乎适用于更广泛的任务,同时仍然非常有信息量。
Attention though was interesting because it gave us a prior that was in a way more universal, that seemed to apply to a wider range of tasks, but that was still very informative.
你在强化学习中看到类似的情况发生吗?
Do you see something similar happening with reinforcement learning?
在强化学习中,是否存在类似的演进过程?
Is there an analog for that progression in RL?
是只有RL智能体自身拥有的模型吗?
Is it just the model that the RL agent itself has?
你是说网络架构上的进步吗?
Progression in terms of the network architectures?
是的,正是如此。
Yeah, exactly.
是的,我认为RL也会出现类似的进展。
Yeah, I think we will see a similar progression in RL.
我认为RL在神经网络架构上的进步不如其他领域明显,因为还没有足够复杂的数据需要这些新架构去学习,对吧?
I think RL hasn't quite had the same level of progression in terms of neural network architectures because there hasn't been the complexity of data needed to be soaked up by these new architectures, right?
即使这篇论文中提出的网络,与这些大型语言模型和视觉模型相比也小得多。
Even the networks presented in this paper are very small in comparison to these large language models and vision models.
但我们会看到,而且已经看到的是,随着数据变得越来越复杂、越来越庞大——你需要学习和封装到网络中的游戏成千上万,需要通过一个网络表达的行为也成千上万——那时我们就会看到模型扩展和不同架构的优势显现出来。
But what we will see and what we are already seeing is that as the data becomes more complex, more vast, there's thousands and thousands of games that you have to learn and encapsulate within a network, thousands of behaviors that you have to express through one network, then we'll see the benefit of scaling these models, we'll see the benefits of different architectures coming about.
是的,有一些基础操作,比如注意力机制,或者更广义的、数据各部分之间的乘法交互,似乎在所有类型的数据中都表现得非常好,无论是视觉、语言、RL还是语音。
And yes, there are fundamental operations like attention or there's a generalised class of just multiplicative interactions between parts of your data that seem to work really well across all different types of data, vision, language, RL, speech.
在定性方面,你从这些智能体身上看到了什么,是像星际争霸玩家这样的智能体所不具备的?
And on the qualitative side, what do you see from these agents that you don't see from, let's say, a StarCraft playing agent or something like that?
特别让我感到惊讶的是,这些智能体展现出的探索行为。
One thing in particular that I think surprised us was this experimenting behaviour that we saw coming out of these agents.
由于它们接受了大量快速训练,因此对不确定性非常稳健。
They're very, very robust to uncertainty because they've been trained on so much fastness.
当进入一个新游戏或新任务时,它们常常不知道该怎么做。
Come into a new game, a new task, and they often don't know what to do.
它们不知道如何解决它。
They don't know how to solve it.
如果我们被告知规则,这对你我来说可能显而易见,但对智能体来说未必如此。
It might be obvious for you and me if we were told the rules, but it might not be obvious for the agent.
我们经常看到这种探索行为:智能体并不完全清楚如何完成任务,但它可能了解其中涉及的一些对象。
And we see this experimenting behavior come up quite often where the agent doesn't know exactly how to solve the task, but it might know some of the objects involved.
它会随意摆弄这些对象,尝试不同的组合方式,或寻找不同的东西并将它们组合在一起。
It will just sort of juggle these objects about or try different combinations of arrangements or search for different things and put them together.
然后,通过这种实验,它偶然找到了正确的物品排列方式,把对的东西放在对的位置,并且意识到了这一点。
And then sometimes through this experimentation, it happens to get the right arrangement of stuff where it puts the right thing in the right place and it recognizes that.
它能识别出自己何时成功了。
And it recognize when it's successful.
它会想:哦,我需要把那个放在这个旁边,把这个放在那个旁边。
It's like, oh, I needed to get that next to that and that next to that.
我原本不知道该怎么做,但现在我看到了,就是这样才对。
I didn't know how to do it, but now I can see it and it's right.
我不再需要不停地摆弄这些东西了,对吧?
I'm going to stop juggling things, right?
因此,这种惊人的行为源于实验与成功识别,这可以说是一种通用的行为启发式方法。
So this amazing behavior came out of experimentation with success recognition, which is sort of a general behavioral heuristic.
目前,我们看到的行为是,它会迅速缩小实验对象的范围。
At the moment, the sort of behavior we're seeing is it very quickly narrows down the types of things it's experimenting with.
但它在过程中并不一定会不断优化细节。
But doesn't necessarily refine things as it's going.
所以它并没有进行这种在线适应,而我们实际上希望在某个时候看到这种适应。
So it's not doing this sort of online adaptation, which we really want to see at some point.
你们在论文中测试的一个方面是合作与竞争。
One of the things you tested in the paper was cooperation and competition.
因此,通过在环境中引入其他智能体,结合个人游戏和群体游戏,你们在性能上看到了差异吗?
So introducing other agents to the environment, having a combination of individual games and group games, did you see a difference in terms of performance?
因为我觉得管理一个包含另一个智能体的环境会更复杂。
Because I imagine managing an environment with another agent is more complex.
是否出现了类似的现象,比如试探合作动态、建立信任,或者类似的行为?
Were there these similar kind of emergent behaviors trying to feel out the cooperation dynamic, maybe trust building, or things like that?
是的。
Yeah.
真正有趣的是,这实际上是一个多人游戏、多智能体环境,里面有很多智能体。
So the really interesting thing is that we actually it's a multiplayer game, a multi agent environment, so there's many agents in there.
我们训练的方式是,每局游戏实际上只有一个智能体在学习。
The way we trained is actually we only per game, there's only one agent which is learning.
而游戏中的其他玩家是这些冻结的玩家,即预训练的智能体,它们具有一些固定行为,我们创建了一个包含大约十到十二种行为的目录,并以此作为训练对象。
And the other players of the game are these frozen players, the pre trained agents which have some behavior and there's a catalogue of maybe 10 or 12 that we created and that's what we train against.
但它们的行为不会改变。
But they don't change.
因此,从训练的角度来看,这个新智能体在训练过程中必须学会与所有这些不同类型的智能体互动。
So from the training perspective, this new agent as it's training has to learn to play with all these different types of agents that come at it.
可能会根据它所对战的对手调整自己的行为。
Maybe change their behaviour based on who they're playing against.
有趣的是,在训练结束后,我们可以衡量这个训练后智能体的合作程度。
The interesting thing is after the fact we can then measure the cooperativeness of the trained agent.
它有多合作、多具有竞争性,这通过一些探测任务来评估,比如构建一些小游戏理论困境,如囚徒困境等。
How cooperative it is, how competitive it is with these probe tasks where we construct little game theory dilemmas like prisoners dilemmas and things like this.
我们可以观察到智能体的合作性在训练过程中的变化。
And we can see how the cooperativeness of the agent changes through training.
而这完全基于它在游戏世界中的经验。
And this is purely based on its experience in the game world.
那么,也许这个问题很难回答,但什么更容易学习呢?
And what is, maybe this is an impossible to answer your question, but what is easier to learn?
是合作行为更容易掌握,还是竞争行为更容易掌握?
Is it the cooperative dynamics or does it tend to be the competitive dynamics?
我认为竞争行为是最容易学习的。
I think the competitive dynamics are the easiest thing to learn.
或者换种说法,竞争性游戏是极强的训练信号来源。
Or, maybe put another way, competitive games are a very strong source of training signal.
你有一种非常自然的自动课程效应:当对手变强时,游戏难度也随之增加。
You have this very natural auto curriculum effect where as the opponent gets stronger, the game gets harder.
如果你和自己对战,随着你变得更强,对手也会变得更强。
And if you're playing against yourself, as you get better, the opponent gets better.
因此,你自然而然地获得了一种循序渐进的学习课程来练习。
And so you naturally get this sort of easy gradual learning curriculum to play with.
如果你想学习合作,实际上非常困难,因为你需要环境中存在另一个有能力的智能体。
If you think about learning cooperation, it's actually very hard because you need to have another competent agent in the environment.
对。
Right.
然后观察它的行为,意识到它很擅长,并学习一些能够与该智能体行为互补的策略。
And then watch what it's doing and learn that it's competent, and actually learn some behaviour which complements that other agent's behavior.
这需要很多步骤。
That's a lot of steps.
而竞争行为,你甚至不需要去考虑另一个智能体。
And competitive behavior, you don't even need to think about the other agent.
你只需要想,哦,不管这个智能体做什么,它都会与我的行为相抵触。
You can just think, oh look, whatever this agent does, it's going to be against what I do.
所以做点什么来消除它或把它搁置一旁。
So do something to get rid of it or leave it aside.
我想竞争也是自我对弈的本质,对吧?
I guess competition is the essence of self play as well, right?
这基本上是一种经过验证的可靠方法。
It's basically a tried and true method.
没错,没错。
Exactly, exactly.
有趣的是,在训练过程中将所有这些游戏混合在一起,这两种行为都可能自然涌现。
The cool thing is mixing all of these games up together in one big soup during training, both of these things can emerge.
早期的竞争性可以通过这种训练机制最终导向后期的合作性。
And early competitiveness can lead to later cooperativeness through this training dynamic.
那你认为目前最大的瓶颈是什么?
And what do you see as the big bottlenecks right now?
是什么阻止了这些智能体做得更多?
What prevents these agents from doing more?
它们通常会犯哪些类型的错误?
What are the kinds of mistakes they'll tend to make?
在迭代这一方向上,你认为下一步该做什么?
What do you see as the next steps in terms of iterating on this?
正如我之前所说,我们正在看到非常出色的所谓零样本行为,也就是说,当一个新任务呈现给智能体时。
So, as I was saying before, we're seeing great, what we call zero shot behaviour, so that a new task is presented to the agent.
也许你出现后,设计了一个小任务并交给代理,它会做出一些合理的反应。
Maybe you come along and you create a little task and you give it to the agent and it does something kind of sensible.
它通常是有能力的。
It's generally capable.
它有时通过尝试或凭借对如何操作的了解来解决任务。
It solves it some of the time through experimenting or by kind of knowing what to do.
但我们没有看到的是,当代理在几分钟内持续执行该任务时,它的表现会有所提升。
But what we don't see is as the agent keeps playing on that task over the course of minutes, that it gets better at that task.
就像我和你坐下来面对这个任务,一开始我们可能不清楚具体怎么做,但我们会摸索着完成它。
Like me or you would sit down on this task and initially we might not know exactly how to do it, but we'd stumble about and manage to do the task.
然后当我们再玩五分钟,我们就会越来越擅长这个任务。
And then as we played for another five minutes, we'd get better at this task.
但现在我们并没有在代理身上看到这种进步。
We're not seeing that from the agents now.
我们真正希望看到的是代理在几分钟内进行在线适应,而这正是该项目下一步要努力实现的目标。
And it's this sort of online adaptation over the course of minutes that we really want to see and which is sort of the next stage of this project to try and get that out.
这有点像GPT-1的时刻,当时一些功能确实能运行,生成的文本也很出色,但我们还没有看到这种元层面的行为——即代理能够理解当前被要求做什么,并据此在未来几分钟内调整自己的行为。
It's a bit like the GPT-one moment where things are working and cool text is being omitted, but we don't yet have this meta sort of behaviour where the agent will understand what's being asked of it right now and change its behaviour for the next few minutes based on that.
你认为,就像GPT系列一样,这里的解决方案是扩大规模吗?
And do you think, like the GBT series, do you think the answer here is scale?
是的,我认为是规模的问题。
Yeah, I think it's scale.
但在强化学习中,更棘手的部分是我们之前讨论过的环境问题。
But the trickier part in reinforcement learning is like we were talking about before, the environment.
我们并没有现成的数据。
We don't have the data readily available.
我们无法直接从互联网上抓取数百万小时的优质游戏玩法数据。
We can't just scrape the internet for millions of hours of excellent gameplay.
我们必须设法构建一个动态系统和强化学习过程,来生成这种丰富而有趣的经验,以喂养这些大型模型并实现适应性。
We have to somehow create a dynamical system, reinforcement learning process that generates that experience, which is rich and interesting to feed these big models and get adaptation out.
这既涉及规模,也涉及我们如何真正发现并生成有趣任务,让代理在这些任务上进行训练。
It's scale and also how we actually find interesting tasks, generate interesting tasks, and get agents to train on these.
而泛化能力的衡量本身就是一个有趣的子领域。
And measurement of generalisation is its own interesting sub area here.
你如何知道在训练这些智能体、进行扩展以及做其他事情时取得了进展?
How do you know that you're making progress in training these agents and scaling and doing whatever you're doing here?
在语言领域,这尤其困难。
Notoriously, in language, this is tough.
因为我们有大量不同的评估指标。
Because we've got a whole bunch of different metrics.
我们有BLEU分数。
We've got the blue score.
还有许多其他方法可以通过问答、翻译等方式进行评估。
There are a whole bunch of different ways that you can assess through Q and A, through translation.
那么,你如何思考在如此多样化的环境中衡量泛化能力?
So how do you think about measuring generalisation in the context of an environment that's so diverse?
是的,这是个非常好的问题。
Yeah, it's a really good question.
你可以把这看作任何其他数据集,你只是进行测量,有一个训练集,然后有一个被保留的测试集。
You could think of like any other data set, you just measure, you have a training set and then you have a test set which is held out.
你在训练集或某种训练分布上训练代理,然后在测试集上运行它,并查看得分。
You train the agent on the training set or some training distribution and then you run it on the test set and you look at the scores.
然后你可以问,哦,看看平均分是多少,或者中位数是多少?
Then you could say, Oh look, what's the mean score, the average score or the median score?
这可能告诉你一些关于它在这些保留的测试级别上泛化能力的信息。
This might tell you something about how it generalizes to these held out test levels.
但它可能无法反映这种分布的情况。
But it might not tell you about this distribution.
如果分布非常广泛且多样,我们可能并不太关心平均值或中位数。
If the distribution is very vast and diverse, then we might not care necessarily about the mean or the median.
我们更关心的是,你的代理在少数几个级别上表现超人,却在90%的级别上彻底失败吗?
What we might care is that, does your agent do superhuman on a few levels and then catastrophically fails on like the 90% of them?
还是我们更在意,我们的代理在99.9%的级别上都保持在30%的人类水平?
Or do we care more that our agent is sort of 30% human level across 99.9% of levels?
所以尽可能少地出现灾难性失败。
So catastrophically fails in as few as possible.
所以我们关心的是这些因素的结合。
So, we kind of care about a blend of this stuff.
不清楚我们更关注哪一个,更不关注哪一个。
It's not obvious one we care about more, which one we care about less.
因此,我们转而查看所谓的百分位数。
So instead we look at what we call percentiles.
不仅仅是中位数(即测试集中第五十百分位的分数),你还可以看,第十百分位的分数是多少?
So rather than just the median, which is the fiftieth percentile score on a test set of levels, you can look at, Okay, what's the sort of tenth percentile score?
所有不同的分数以及这种分布的样子。
All the different scores and what that distribution looks like.
我们试图同时提升所有这些百分位。
And we're trying to push all of them up at the same time.
它们彼此之间存在冲突,这正是有趣之处。
They're kind of in conflict with each other, which is the interesting thing.
这有点像是古德哈特定律的问题,对吧?
It's sort of, well, a Goodheart's Law problem, right?
我的意思是,一旦你定义了一个单一的指标去优化,就会过度拟合它,然后它就不再是一个好的目标了。
I mean, ultimately, moment that you define a single metric that you want to optimize against, you kind of overfit to it, and then it ceases to be a good target.
这些游戏的奖励机制也不一样,对吧?
These games also have different reward profiles, don't they?
有些游戏奖励非常多,而另一些游戏的奖励则比较有限。
So some games give a ton of reward, and then others have more limited reward.
你如何确保一个在奖励丰厚的游戏里表现优异、但在奖励有限的游戏里表现糟糕的智能体,能够得到平衡?
How do you make sure that an agent that does really well at a game that gives a ton of reward but poorly at a game that gives little reward, how do you balance that out?
是的。
Yeah.
这些游戏的奖励尺度完全不同。
The reward scales for these games are completely different.
而且你所生成的这个流程,还没有人看过。
And because the procedure you generated, no human has looked at them.
无法判断100分是不是好成绩?
Can't say whether, is a score of 100 good?
10分算好吗?
Is a score of 10 good?
一个人会怎么做?
What would a person do on it?
我们不可能安排足够多的人为这些关卡设定人类分数。
We can't possibly put enough people to set these human scores on these levels.
因此,我们通过不断尝试,用当前最好的智能体表现来逐步摸索出这些关卡的最佳分数,并以此为基准进行归一化。
So instead we iteratively feel out what's the best score on these levels by just taking the best agent we have at the moment and normalizing by the score of that agent.
这只能给我们一个粗略的估计。
And that just gives us a rough estimate.
好吧,也许100分就是我们当前智能体能达到的分数。
Okay, maybe the score of 100 is what our current agent can do.
所以我们把它当作基准,然后看看能否超越它。
So we'll call that the benchmark and then see if we can push above that.
因此,在训练这些代理的代际时,我们会根据当前最优的代理不断重新归一化。
And so we iteratively renormalize by whatever agent is best as we train these generations of agents.
很有趣。
Interesting.
是的,整个架构中有很多不同的自我成长阶梯。
Yeah, a lot of different self growing ladders in this whole architecture.
开放-ended学习过程的一个特征就是,没有一个固定的数值是你单纯想要提升的。
That's one of the signatures of an open ended learning process, is that there's no fixed number that you're just trying to push up.
也没有一个已知的上限或下限是你想要去推动的。
And there's a known top end of that number or known bottom end of that number that you're trying to push down.
而是要实现一个开放-ended的学习过程。
Instead of seeing to have an open ended learning processes.
我们不知道终点在哪里,对吧?
We don't know where the end is, right?
我们只是会逐步向那里推进。
And we're just going to iteratively move there.
我们可以尝试衡量自己是否正朝着这个方向前进。
And we can maybe measure that we're going in that direction.
嗯,我记得在GPT-3和之前的缩放定律论文中,曾讨论过缩放定律预计会失效的点,至少对于OpenAI当时所使用的特定Transformer模型而言。
Well, do you think because I remember in the GPT-three and scaling laws actually, the scaling laws paper that that came out before GPT three, there was talk about, you know, the place where the scaling laws were expected to break down, at least for the particular transformer model that OpenAI was working with.
你对这个过程的开放性有什么看法?
Do you have a sense of how open ended this process is?
有没有理由认为,随着环境变得越来越复杂,而模型架构基本保持不变,这个过程最终必然导向通用人工智能?
Is there any reason to think that this does not eventually just lead to AGI as these environments are just made more complex with roughly the same model architecture?
我非常希望这是真的。
I'd love to believe that.
但我确信当前的算法中存在许多缺陷。
But I'm sure there are many flaws in the current algorithms.
而且我肯定,目前的一个限制因素将是网络的规模,也就是模型架构本身。
And for sure, I think a limiting factor at the moment will be the size of the networks, the model architectures themselves.
另一方面,还有环境的表达能力,即这个环境中能表现出多少种不同的行为。
Then you have the expressiveness of the environment as well on the other side, like how much different behaviour can be expressed in this environment.
目前,这两方面都会带来限制。
There'll be limitations from those two things at the moment.
但在算法层面,也许这些开放式的学习算法会有一些变体。
But maybe on an algorithmic level, some flavor of these open ended learning algorithms.
毫无疑问,我认为这可能是通向AGI的一种可扩展方式。
For sure, I could see this as being a scalable way to move towards AGI.
你认为这正在教给我们一些东西吗?我总是对这一点很好奇。
And do you think that this is teaching us something I'm always curious about this.
我的背景是物理学。
My background is in physics.
在物理学中,你总是思考那些超越我们自身认知的基本原理。
And so in physics, you always think about fundamental things that transcend what we think of ourselves.
当我观察机器学习时,有几个领域似乎与此有所关联。
And when I look at machine learning, there are a couple of areas that seem to flirt with this.
比如注意力机制,它似乎是一种普遍存在的机制,能够应用于众多不同领域。
So there's attention that seems to be one of these things that's universal, or that seems to apply to a whole bunch of different domains.
在强化学习中,我们现在正看到一种泛化能力,以及学习实验性行为的倾向。
In reinforcement learning, we're seeing now kind of this ability to generalize and the tendency to learn experimentation behavior.
还有没有其他类似的现象?或者首先,你是否同意这些现象在某种程度上可能是根本性的?
Are there other things like that that or first off, would you agree that those are kind of it's somehow that they may be fundamental.
当然,我们无法确定。
Obviously, we can't know.
但我们的宇宙是否有什么特性,使得这种行为在超越我们自身行为的层面上被偏向?
But is there something about our universe that just like biases towards this behavior at a level that goes beyond just what we happen to do?
是的,这是一个非常深刻的问题。
Yeah, it's a really deep question.
要这么说实在太难了。
And it's so tough to say this.
一定存在某种基本的物理过程,比如熵总是增加,这导致了物理实体之间的自然竞争。
There must be some sort of fundamental physical processes, like the fact that entropy is always increasing, that leads to natural competition of states between physical stuff.
因此,随着时间推移,我们所看到的都是那些能够抵御宇宙熵增冲击而存活下来的事物。
And so what we end up seeing through time are the things that survive the bombardment of entropy from the universe.
于是,我们就看到了像智能生物和地球上生命世界的真正演化这样的现象。
Know, And dot dot dot, we get things like actual evolution of intelligent creatures and the living worlds on this planet.
这个想法在我这边还很不成熟,但我觉得可能有一个我尚未意识到的结果,直接印证了这一点:环境似乎在某种意义上暗示了一种损失函数。
This is like a very ill formed thought at my end, but feels as if there's there may even be a result that I'm not aware of here that actually speaks to this directly, but it feels as if environments almost imply a loss function in a way.
换句话说,任何环境中能够稳定存续的系统,必然在某种程度上被优化过。
Like, the systems that persist stably in any environment must be optimized with respect to something.
环境与损失函数之间的这种联系,似乎是一种根本性的关联,可能与通用人工智能有关。
And that connection between environment and loss function seems like something fundamental that might speak to AGI.
但同样,这只是一个半成型的、推测性的想法。
But again, this is like half baked and speculative.
是的。
Yeah.
不。
No.
我认为,如果你说:我取一个时间点,比如我们现在对话的这一刻,我把它冻结,然后问:是什么样的损失函数,让宇宙在某种底层动态下发展到如今的状态?——那么,我认为这是对的。
I I I think I think that's true if you say, I'm going to take a point in time, like right now as we're speaking, I'm going to freeze and say, what was the loss function that got the universe to the point now, given some underlying dynamics?
会存在一个损失函数,如果你用某种优化器去优化它,就能达到我们现在这个状态。
And there would be a loss function that if you optimized it with some optimizer would get you here.
但一旦我们向前推进一个时间增量,这个损失函数就会不同,而如果我们当时冻结了那个时刻,损失函数也会不同。
But as soon as we go an increment of time, that loss function would be different and we froze there, that loss function would be different.
我不确定是否存在一个固定的损失函数,只要优化它就能得到这个开放式的涌现宇宙。
I'm not certain that there is a fixed loss function that if you optimise you would get to this open ended emergent universe.
至于一个模拟的AGI,我不认为存在一个单一的损失函数,对吧?
And an AGI simulated equivalent, I'm not convinced there's a single loss function, right?
当我们说单一损失函数时,它有点像是一个固定的数据集,能让我们达到那个状态。
When we say single loss function, it's sort of like fixed data that would get us there.
我想,如果我们知道了它,就能构建出AGI,或者至少能更容易地向AGI靠拢。
I guess if we knew it, we could build an AGI or at least we could scale towards an AGI much more easily.
也许我会收回我之前的话,发现这其实很明显。
Maybe I'll eat my words and it's like, oh, it was obvious.
它就是这么简单。
It was just this.
那是对数损失和其他一些东西。
It was log loss and something.
是的,没错。
Yeah, right.
归根结底,总是对数损失。
It's always log loss at the end of the day.
总是对数损失。
It's always log loss.
观察这些智能体的行为,当我们讨论泛化能力时,博弈论似乎也是我所了解的。
Looking at these agents, some of the behavior, and we're talking about things that generalize, game theory, again, seems to be I know.
我可能要收回刚才的话了。
I might eat my words here.
但博弈论似乎也是一种具有普遍性的理论。
But game theory seems something generalizes as well.
我想知道,这些智能体行为中,有没有哪些让你觉得出乎意料地体现了对博弈论的理解?
I'm curious what, which of these agent behaviors kind of reflected, if you saw any, that reflected an understanding of of game theory in sort of surprising ways.
你提到了囚徒困境之类的概念。
You referred to the, you know, the prisoner's dilemma, things like that.
这些代理是否表现出对这些原则特别清晰的理解?
Were there ways that these agents seem to display particularly clear consciousness of some of those principles?
我认为谈不上清晰的理解。
I would say not clear consciousness.
是的,用词不当。
Yeah, bad word.
我会把代理的意图和我们观测到的代理行为区分开来。
I would sort of dissociate intents of the agent from what we measured the agents doing, so to speak.
我认为在某些情境下,代理确实表现出了一些符合博弈论模型的合理行为。
I think we did see in some scenarios that agents were sort of doing sensible stuff under the model of game theory.
博弈论会为这些代理设定一种情境,然后你假设它们完全理性,知道规则,并且完全理性地行动,接着你观察它们会怎么做。
So game theory, would give a situation to these agents and then you would assume that they're completely rational and they know it, that they act completely rational and you would see what they would do.
这些代理并不完全理性,因此它们的行为与博弈论的解决方案相去甚远。
These agents don't act completely rational, so they're sort of far from the game theoretic solutions.
但也许他们在理性尺度上还是有一点表现的。
But maybe they do a little bit on the rational scale.
我认为这仅仅是训练系统所涌现出来的特性。
I would say this is just an emergent property of the training system.
至少从我的观察来看,我通过这个项目已经看了数千个这样的视频,我实在说不出它们有太多强烈的意识,但它们似乎有意识地知道:哦,我应该这么做,因为这是博弈论。
And at least from my observation, I've watched thousands of these videos now through this project and I can't really say there's much intense, but they sort of know consciously that, oh, this is what I should be doing because of this game theory.
我应该注意自己对人工智能系统的拟人化倾向。
I should watch my, my, anthropomorphizing of of AI systems here.
但有一件事让我想到,也许这种说法不准确,但感觉这种表现出的与博弈论原则的兼容性,和GPT-3在算术上的表现之间可能存在某种类比。
But one thing it does make me think of is the sort of there's like and again, maybe this is wrong, but it feels like there might be an analogy between this, let's say, demonstrated compatibility with principles of game theory.
当你看GPT-3在算术上的表现时,情况就是这样。
And when you look at g p d three, what it does with arithmetic.
很多人讨论说,GPT-3在处理超过x位数的加法时会出错。
So a lot of people were talking about, you know, oh, g p d three fails at adding numbers that have more than x many digits.
这被当作一种迹象,表明模型并未掌握逻辑原则。
And this was taken as an indication that the model had not learned principles of logic.
当时有一种观点认为,它无法泛化,是因为它无法理解背后的逻辑。
There was sort of this idea that, oh, it fails at generalization because it can't learn the logic behind this.
对此常见的反驳是:你有没有见过人类去加很大的数字?
The common rebuttal there was like, well, have you ever seen humans try to add large numbers?
这其实有点类似。
It doesn't like, it's sort of similar.
这让我想到了这里的情况。
This is kind of what it makes me think of here.
你看到的似乎是一种逻辑原则,也许这同样可以作为一个很好的预警指标。
You know, you've got this what seems to be a principle of logic, and and maybe this it would be a good bellwether in the same way.
这样理解是否合理?
Would that make sense as a
是的。
Yeah.
我认为这完全正确。
And I think this is exactly right.
特别是在人工智能领域,我们习惯于试图从这些系统中获得完美的逻辑性回应。
Especially in AI, we're so used to trying to get these perfect logical responses out of the problems that we give these systems.
但如果我们观察动物世界(包括人类)的所有发展研究和行为研究,就会发现人们并不会表现出逻辑性回应。
But if we look at all the developmental study and behavioral studies in the animal world, including on humans, we don't see logical responses.
我们也不会看到理性的回应。
We don't see rational responses.
我们看到的是所谓的启发式行为——这些行为并非最优,也不理性,但通常有效,而且计算和执行的复杂度极低。
We see what we call heuristic behaviour, behaviours which aren't optimal, they aren't rational, but they kind of generally work and they're very, very low complexity to compute and execute.
启发式行为的一个绝佳例子是接球。
A great example of heuristic behaviour is trying to catch a ball.
不要盯着球,估算它的速度和加速度、风力动态,然后在脑海中运行微分方程并求解,计算出球落地的点,再最优地移动到那个位置去接球。
Don't look at the ball and estimate its velocity and acceleration, the wind dynamics, and then run a differential equation and solve it in your head, work out the point where it lands on the ground, and then move optimally towards that point to catch the ball.
你根本不会这么做。
You don't do that.
有一种广为人知的视线启发法:你盯着球,然后不断调整自己的位置,使视线与球之间的夹角保持恒定。
There's this well known gaze heuristic where you look at the ball and you move iteratively to keep the angle of your gaze to the ball constant.
如果你这样移动,最终就能拦截到球,哪怕不是球,而是飞盘、滑翔机之类的,这也是一种非常通用的捕捉方法,而且计算和执行的复杂度极低。
And if you move like this, you'll end up intercepting the ball or even if it's not a ball, a Frisbee, a glider, whatever it is, it's a very general solution to catching something, which is very, very low complexity.
这完全不是最优的。
It's not optimal at all.
这让你对符号学习有了不同的看法吗?
And does this make you think about symbolic learning differently?
我的意思是,显然存在一种争论或讨论,关于你刚才描述的这种特性究竟是优点还是缺点。
I mean, there's obviously the argument that or the debate as to whether this this feature that you just described is a feature or a bug.
你可以从两种角度来看:一方面,这是不完美的逻辑,因此这些系统受限于无法进行这种基本推理。
And you can choose to look at it either way as like, well, this is imperfect logic, so these systems are hampered by their inability to do this kind of fundamental reasoning.
或者你也可以认为,采用这种方式推理的系统,如今确实能够构建出人工智能。
Or you could look at it as like, well, systems that reason this way are able to build AI systems today.
我的意思是,人类就是这样做的。
I mean, that's what humans do.
因此,我们应该更多地投入这个方向。
And therefore, should just invest more in that direction.
这是否影响了你对符号逻辑是否具有前景的直觉?
Does this inform your intuition as to whether symbolic logic might be promising?
是的。
Yeah.
我非常认为,能够产生这些启发式行为是一种优势。
I very much see it as a feature to get these heuristic behaviors out.
当模型正确、公式正确且符号定义清晰时,符号逻辑是极其出色的,它能够完美地外推和泛化。
And symbolic logic is fantastic when the model is correct, when the formula is correct and the symbols are well defined, it'll extrapolate, it'll generalise perfectly.
但一旦出错,它就会彻底失败,因为符号缺乏灵活性。
But when it's wrong, it'll catastrophically fail because there's no flexibility with the symbols.
另一方面,你有这些从统计平均和统计方法中构建起来的启发式解决方案。
On the other hand, you have these heuristic solutions which are built up from just statistical averaging and these statistical methods.
它们无法外推。
They don't extrapolate.
表面上看,它们的外推能力非常有限,但它们可能具有非常广泛的适用性,而且关键的是,当情况与训练时略有不同时,它们可能依然稳健,不会失败。
On the face of it, they don't extrapolate really, really far, but they could have a very wide support And crucially, they might be very robust and not fail if things are slightly different to how they were formed.
这位物理学家,他想引用费曼的观点。
The physicist, and he wants to go to Feynman here.
但有一句理查德·费曼的话,我认为可以说,没人真正理解量子力学。
But there's this quote where Richard Feynman says, I think it's safe to say that nobody understands quantum mechanics.
这意味着,量子力学有一些反直觉的地方。
And the implication is that, you know, there's something counterintuitive about quantum mechanics.
但这也暗示了,我们以另一种方式理解其他事物。
But the implication is also that we understand other stuff in a different way.
比如,我们给自己讲一个故事,说我们理解什么是球。
So for example, like, we tell ourselves a story that we understand what a ball is.
我们对这个故事应用了一种类似符号逻辑的过滤器。
And we kind of apply a sort of symbolic logic like filter to the story that we tell ourselves about it.
仿佛存在一个叫作‘球’的实体,我们自以为完全了解球的一切,诸如此类,而不是你刚才描述的——作为婴儿时,我们反复滚玩球体。
It's as if there's, like, this entity called a ball, and then we know everything about the ball, blah blah blah, rather than what you just described, which is just, like, as babies, we rolled balls around a lot.
我们逐渐形成了一系列启发式方法,这些方法累积起来,制造出一种我们真正理解它的错觉。
We developed, you know, a series of heuristics, and those heuristics added up to the illusion, the impression that we understand it.
展开剩余字幕(还有 96 条)
通过这个视角,你可以说我们完全可以理解量子力学。
And through that lens, you could say that we absolutely could understand quantum mechanics.
这种启发式过程如果一致地应用,确实可以延伸到完全陌生的领域。
This is this is the heuristic process does generalize to kind of completely alien lands if if we do apply it consistently.
关键是,什么是先出现的呢?
And, you know, the key thing is, like, what came first?
是‘球’这个词,还是对球如何运作的理解?
Like, the word of a ball or the understanding of how a ball works.
是通过符号、语言和力学,还是知道如何滚动一个球,或者知道球会滚向你?
Through symbols and language and mechanics, or knowing how to roll a ball or that a ball will roll down the hill to you.
人类历史上,对球在环境中如何行为、如何伸手去捡起并滚动它的认知,远远早于理解球是一个物体以及它所遵循的力与定律。
The knowledge of how a ball just behaves in the environment and how you should act to pick up that ball and roll that ball comes far, far before in human history than understanding that the ball is an object and these are the forces and laws of the ball.
几乎所有人类的发明都是如此。
It's the same thing with almost all human inventions.
发明往往远远早于我们对它机械原理的理解,甚至早于我们能用语言解释它为何有效。
It's like the invention comes far before the actual understanding of mechanically how it works or how you even verbalise why it works.
这本身就很有趣,因为它对人工通用智能和通用学习有着深远的影响。
That's interesting in and of itself for its implications for, I guess, AGI and generalized learning.
显然,我们是宇宙中唯一展现出通用化能力胜利的例子。
Obviously, we're the one shining example of the triumph of generalization ability in the universe.
但就我所知,我们自身是至少两种不同学习过程的产物。
But we ourselves are the product of, as far as I can tell, at least two different learning processes.
我们有进化,它将我们带到出生的那一刻。
We have evolution, that gets us to the moment that we're born.
从那以后,另一个独立的优化过程发生,我们的新皮层形成,并以有趣的方式与边缘系统结合。
And from there, a separate process of optimization happens where our neocortex gets formed and starts couple with our limbic system in interesting ways.
我的意思是,这或许可以被看作是对任何一致训练策略的悲观论点。
I mean, I guess this could be framed as a pessimistic argument for any kind of consistent strategy for training.
因为你可以争辩说,我们所知的唯一一个通用学习代理或通用能力代理的例子,来自两个截然不同的学习过程。
Because you you could argue, well, like, we know the one example of a generalized learning agent, or a generally capable agent that we have, comes from two distinct learning processes.
因此,我们的希望是,仅凭一个过程就能获得通用代理,但这在技术上尚未被证明,或者目前还没有
So the hope is we can get a generalized agent from just one, which is technically unproven, or there isn't
这取决于你怎么定义学习过程,对吧?
It depends what you view as learning processes, right?
我经常用随机梯度下降训练神经网络权重来作类比。
So I often use the analogy of training the weights of a neural network with stochastic gradient descent.
你实际上是在训练神经网络,这个过程就是进化。
You're actually training the neural network, that's the process of evolution.
而当你运行神经网络时,这就像是出生并开始构建表征的过程。
And then when you run the neural network, that's the process of birth and starting to build up the representations.
当你使用循环神经网络,在你逐步探索环境、经历时间时,网络的内部状态会被更新,像Transformer、LSTM这些都属于循环神经网络。
When you have recurrent neural networks where as you step through the environment and experience time, the internal state of the network gets updated, Things like transformers, LSTMs, these are all recurrent neural networks.
Transformer可能是个不太恰当的例子,因为它们并非完全循环,而是无限循环的。
Transformers are maybe a bad example because they are not perfectly recurrent, they're infinitely recurrent.
你可以把这些随时间累积的激活视为一种学习过程。
You can think of these activations being built up through time as a learning process.
甚至还有重构神经网络,它们模拟大脑的学习机制,并具有赫布动力学。
There are even Recony neural networks which model sort of brain learning and have Hebbian dynamics.
从这个意义上说,我们有两个学习过程。
In that sense, we have two learning processes.
我们有进化,也就是训练权重。
We have evolution, which is training the weights.
然后我们在游戏中运行带有循环神经网络的网络,它会逐步构建内部激活状态,随着经验的积累,这些激活状态可以被看作是权重。
And then we have running the network in the game with a recurrent neural network, which builds up the internal activations, which can be thought of as weights as you experience more.
有意思。
Interesting.
是的。
Yeah.
这让我对原本的想法有了更清晰的理解,那就是,你知道,你出生时,相当于拥有了一个空白架构。
That makes so much more sense in the way I was thinking about it, which was like, you know, you're you're born, and and that's equivalent to having a a blank slate architecture.
当然,我们知道这实际上并不是一张白纸。
Of course, we know it's not actually a blank slate.
所以,正如你所说,这些权重需要提前被训练好。
So to your to your point, you need those those weights to actually be trained ahead of time.
那么,你对下一类环境、下一个挑战会是什么有概念吗?
So do you have a sense then of what the next class of environments, the next challenge is going to be?
我们谈到了扩展,这确实是目标,但你是单纯让环境在某个方向上变得更丰富吗?
We talked about scaling and that being the goal, but do you just make the environment more expressive in a certain direction?
你有没有预感这个方向可能会是什么?
Do you have a hunch about what direction that might be?
是的,这其实非常棘手,因为我们不能只是造一个世界模拟器。
Yeah, it's really tricky because we can't just make a world simulator.
太复杂了。
There's too much complexity.
再多的游戏开发者也无法做到这一点。
No amount of game developers is going to be able to do that.
所以我们必须直觉地引导环境的复杂性和特性,以我们认为能带来巨大组合复杂性的方式。
So we have to of intuitively nudge out the complexity and features of our environment in ways that we think are going to add huge amounts of combinatory complexity.
因此,我们经常考虑在现有环境中添加新想法,比如增加一个新功能,它能与所有现有元素以非常有趣、混沌且非线性的方式互动,这些互动是无法预测的。
So we often think about adding new ideas onto this environment where we add a new feature which interacts with all the existing stuff in really interesting chaotic non linear ways, ways that you can't predict.
这些正是我们正在寻找的特性。
These are the sort of features we're looking for.
目前,我们关键在思考:这些设计如何能导致代理在短短几次迭代中调整行为,并获得良好的回报。
Crucially at the moment we're thinking, okay, and how does this lead to situations where adaptation of the agent's behaviour over the course of a few would really lead to nice payoffs for the agent.
因此,我们可以尝试以一种自发的方式鼓励它学习适应性行为。
So we can try and encourage it emergently to learn adaptive behavior.
你们所看到的这些结果,尽管令人印象深刻,是否改变了你对对齐问题的看法?
Do these results that you've seen, impressive as they are, do they make you think of alignment differently?
它们是否与你对AI安全、负责任部署等问题的重要性的理解有所重叠?
Do they overlap with your sense of what's important for AI safety, for these things to be deployed responsibly and so on?
是的,我们在DeepMind有一个非常出色的AI安全团队。
Yeah, we have a great AI safety team here in DeepMind.
对我来说,其中一个主要问题是,很难在事前预知 emergent 行为会是什么样子,以及环境、训练算法和模型架构的相互作用会产生怎样的行为。
And I think one of the dominant things for me is it's very difficult to a priori know what the emergent behaviour is going to look like and what's the coupling of environment and training algorithm and model architecture is going to produce behaviourally.
因此,我认为在训练代理时,必须在训练循环中内嵌这类对齐机制和安全机制,让它们持续发挥作用,而不是等到训练结束、问题出现后再去检查:‘这个结果是否对齐?是否安全?’
So, I think it's going to be exceptionally important to bake in, as we're training agents within the training loop, bake in these sort of alignment mechanisms, safety mechanisms, right, that are always acting rather than just waiting till the end of training and something pops out and then check, Oh, is this aligned or is this safe?
不,未来会有一个阶段,这样做太危险了,我们不能这么做。
No, there'll be a future where that's too dangerous, we can't do that.
我们需要把这些对齐机制、安全机制融入到训练循环中。
We need to have these alignment mechanisms, these safety mechanisms as part of the training loop.
因为我们不知道这个训练系统最终会走向何方。
Because we don't know where this training system is going to go.
我想与此相关的是,似乎有一个有趣的机会可以做一些预测性研究,比如像GPT-3和大规模模型那样,令人惊讶的是,出现了许多没人预测到、也没人知道何时会出现的能力。
And I guess related to that too, it feels like there'd be an interesting opportunity to do a little bit of forecasting research potentially in the sense that one of the things that surprised everybody about, again, things like GPT-three and scaled models was the emergence of all these capabilities that nobody predicted or nobody knew when they would come.
我想这里或许有机会做些研究,比如尝试加入某种特定的扰动,观察会由此产生哪些新能力。
I wonder if there's an opportunity here to do something where you say, Okay, well, let's investigate the effect of adding a particular kind of perturbation and see what capabilities arise from that.
是的,确实如此。
Yeah, yeah, definitely.
我认为可以开展一些非常有趣的研究,比如改变某些环境因素或学习因素,观察这些变化如何影响涌现行为和对齐效果。
I think there'll be You could do really interesting studies like this, where you change some of the environmental factors, you change some of the learning factors, you see how that emergent behaviour and how that alignment changes.
比如我们这里的一些同事,乔尔·莱博,他们实际上构建了一些设定为社会困境的环境。
Like some colleagues here, Joel Lebo, they actually construct, for example, these environments which are set up to be social dilemmas.
比如,有一群代理,它们面前有苹果,但如果它们吃光了所有苹果,苹果就不会再长出来。
Things like, oh, there's a collection of agents and they have to there's apples, but if they eat all of the apples, then the apples won't grow back.
因此,它们必须共同协商出一个协议,避免把苹果全部吃光。
So they have to collectively work out to not eat them all and have a little contract between themselves.
类似这样的例子还有很多。
Many, many examples like that.
你可以进行一些非常出色的研究,比如改变苹果的密度、岛屿之间的连通性,以及人们彼此之间的可视距离。
And what you can do, you can do these amazing studies where you change the density of apples, like how connected the islands are, how well people can see each other.
它们甚至能识别出不同的个体代理,还是所有代理对它们来说看起来都一样?
Can they even recognise individual different agents or all agents that sort of look the same to them?
你可以观察到,改变这些因素会如何导致不同的涌现行为,以及不同的合作与对齐方式。
And you can see how changing these factors results in different emergent behaviors, different types of cooperation and alignment.
你也可以对人类参与者进行同样的实验。
And you can do the same experiments on human participants as well.
并映射出这些反应。
And sort of map those responses.
这很公平。
That would be fair.
我的意思是,这似乎触及了许多深刻的问题,几乎是哲学层面的,比如自由意志和能动性。
Mean, there's something that seems to speak to a lot of deep questions there, almost philosophically, like free will and agency.
这些代理仅仅通过非常简单的先验行为就能复制出类似的表现,这确实让人不禁思考。
To the extent that these agents just just seem to replicate the behavior of agents with very simple priors, it sort of makes you wonder.
我最后想问你的是,这对你对实现类似AGI的时间线看法产生了什么影响?
One last thing I do want to ask you about is how has this impacted your view on timelines towards something like AGI?
在看到这样的结果后,你觉得AGI离我们更近了吗?
Does it feel closer to you now that you've seen a result like this?
还是说这与你之前的预期基本一致?
Or is this more or less consistent with your priors from before?
我认为,随着越来越多这样的结果出现,这种可能性的信心也在不断增强。
I think definitely, as more of these sort of results come out, the sort of confidence that this is possible just increases more and more.
对我而言,这个结果再次增强了我对这种可能性的信心。
And definitely for me, this result is, again, there's more confidence that this is possible.
关于时间线,预测起来实在太难了。
On timelines, it's just so tricky to forecast.
是的,我对当前的一系列算法、神经网络以及将它们扩展以逐步实现我们所追求的AGI目标,持相当乐观的态度。
Yeah, I am quite optimistic about our current set of algorithms, neural networks, scaling these things to at least move us incrementally towards this AGI target that we have.
所以,朴素AI感觉像是一个相当靠谱的猜想,即某种结合了强化学习、自我对弈和神经网络的方法最终能实现它?
So prosaic AI feels like a it's feeling like a pretty good bet that some combination of RL self play neural nets is going to do it?
对。
Yeah.
我的意思是,关于具体方法,这很难说。
I mean, on the exact methods, it's hard.
从理论上讲,很难看出为什么带有合适架构偏置的神经网络无法非常接近这一目标。
Theoretically, it's hard to see why not neural networks with the right architectural biases and things couldn't really get very close.
但当然,我们可能遗漏了某些根本性的东西,这也是完全有可能的。
But yeah, maybe we're missing something really fundamental and that's completely possible.
我可能对这件事太过乐观了。
I'm probably just too optimistic about this.
这就是智能的巨大谜题。
Well, that's the great mystery of intelligence.
对吧?
Right?
我的意思是,如果这很明显,那早就被解决了。
I mean, you know, if if it was obvious, then then well, it would have been done.
没错。
Exactly.
非常感谢你分享这些。
Thanks so much for sharing all this.
这是一个非常令人兴奋的话题,也是开放学习的迷人时代。
It's a really exciting topic and and a fascinating time for open ended learning.
你有个人博客吗?有没有什么地方可以分享你整理的想法,让大家去看看?
Do you have a a personal blog actually that you'd like to share anywhere where you collect your thoughts that people could check out?
没有。
No.
最好的地方可能是我的Twitter:maxyaderberg。
Probably the best place is my Twitter maxyaderberg.
完美。
Perfect.
好的。
Okay.
我们一定会在与播客配套的博客文章中包含这个链接。
We'll be sure to include a link to that as well in the blog post that'll come with the podcast.
非常感谢你,Matt 和 Max。
So thanks so much, Matt and Max.
非常感谢。
Really appreciate it.
太棒了。
Brilliant.
非常感谢。
Thanks a lot.
关于 Bayt 播客
Bayt 提供中文+原文双语音频和字幕,帮助你打破语言障碍,轻松听懂全球优质播客。