强化学习之父理查德·萨顿认为大语言模型是条死胡同

本集简介

理查德·萨顿是强化学习之父，2024年图灵奖得主，《苦涩的教训》作者。他认为大语言模型（LLMs）是条死胡同。采访他后，我对理查德立场的中立解读是：LLMs无法实现在职学习，因此无论我们如何扩展规模，都需要新架构来实现持续学习。一旦实现，我们将不再需要专门训练阶段——智能体将像所有人类乃至所有动物那样实时学习。这种新范式将使我们当前的大语言模型方法过时。采访中，我尽力表达了'LLMs可能成为经验学习基础'的观点...现场有些火药味。特别感谢阿尔伯塔机器智能研究所邀请我前往埃德蒙顿，并提供录音棚设备支持。敬请欣赏！ YouTube观看；Apple Podcasts或Spotify收听。赞助商 * Labelbox支持在超现实强化学习环境中训练AI代理。凭借经验丰富的应用研究团队和庞大的领域专家网络，确保训练反映重要的现实细微差别。访问labelbox.com/dwarkesh将演示项目转化为实用系统 * Gemini深度研究专攻复杂课题。本期节目帮助我梳理了从早期策略梯度到现代方法的强化学习发展脉络，结合清晰讲解与精选案例。登录gemini.google.com体验 * Hudson River Trading打破团队孤岛。研究人员在单一代码库中公开交流策略代码，实现高速学习，个人贡献可影响整个公司。招聘信息见hudsonrivertrading.com/dwarkesh 时间轴 (00:00:00) – LLMs是否穷途末路？ (00:13:04) – 人类是否进行模仿学习？ (00:23:10) – 经验时代 (00:33:39) – 当前架构在分布外泛化能力差 (00:41:29) – AI领域的意外发现 (00:46:41) – AGI时代后《苦涩的教训》仍适用吗？ (00:53:48) – 向AI的权力移交订阅完整内容请访问www.dwarkesh.com/subscribe

Richard Sutton is the father of reinforcement learning, winner of the 2024 Turing Award, and author of The Bitter Lesson. And he thinks LLMs are a dead end. After interviewing him, my steel man of Richard’s position is this: LLMs aren’t capable of learning on-the-job, so no matter how much we scale, we’ll need some new architecture to enable continual learning. And once we have it, we won’t need a special training phase — the agent will just learn on-the-fly, like all humans, and indeed, like all animals. This new paradigm will render our current approach with LLMs obsolete. In our interview, I did my best to represent the view that LLMs might function as the foundation on which experiential learning can happen… Some sparks flew. A big thanks to the Alberta Machine Intelligence Institute for inviting me up to Edmonton and for letting me use their studio and equipment. Enjoy! Watch on YouTube; listen on Apple Podcasts or Spotify. Sponsors * Labelbox makes it possible to train AI agents in hyperrealistic RL environments. With an experienced team of applied researchers and a massive network of subject-matter experts, Labelbox ensures your training reflects important, real-world nuance. Turn your demo projects into working systems at labelbox.com/dwarkesh * Gemini Deep Research is designed for thorough exploration of hard topics. For this episode, it helped me trace reinforcement learning from early policy gradients up to current-day methods, combining clear explanations with curated examples. Try it out yourself at gemini.google.com * Hudson River Trading doesn’t silo their teams. Instead, HRT researchers openly trade ideas and share strategy code in a mono-repo. This means you’re able to learn at incredible speed and your contributions have impact across the entire firm. Find open roles at hudsonrivertrading.com/dwarkesh Timestamps (00:00:00) – Are LLMs a dead end? (00:13:04) – Do humans do imitation learning? (00:23:10) – The Era of Experience (00:33:39) – Current architectures generalize poorly out of distribution (00:41:29) – Surprises in the AI field (00:46:41) – Will The Bitter Lesson still apply post AGI? (00:53:48) – Succession to AIs Get full access to Dwarkesh Podcast at www.dwarkesh.com/subscribe

双语字幕

仅展示文本字幕，不包含中文音频；想边听边看，请使用 Bayt 播客 App。

Speaker 0

今天，我很荣幸能与强化学习奠基人之一理查德·萨顿进行对话，他发明了该领域的多项核心技术，如时序差分学习和策略梯度方法。为此，他获得了今年的图灵奖——如果你还不了解的话，这堪称计算机界的诺贝尔奖。理查德，祝贺你。

Today, I'm chatting with Richard Sutton, who is one of the founding fathers of reinforcement learning and inventor of many of main techniques used there, like TD learning and policy gradient methods. And for that, he received this year's Turing Award, which if you don't know, is basically the Nobel Prize for computer science. Richard, congratulations.

Speaker 1

谢谢，德瓦基什。

Thank you, Dwakish.

Speaker 0

同时也感谢你参与播客节目。这是我的荣幸。好的，第一个问题：我和听众们都熟悉从大语言模型（LLM）角度理解AI。从概念上讲，如果我们从强化学习（RL）视角来看AI，我们遗漏了哪些关键点？

And, thanks for coming on the podcast. It's my pleasure. Okay. So first question, my audience and I are familiar with the LLM way of thinking about AI. Conceptually, what are we missing in terms of thinking about AI from the RL perspective?

Speaker 1

嗯，是的。我认为这确实是完全不同的视角。两者很容易割裂开来，失去对话的可能。当前大语言模型已成为如此重要的存在，生成式AI整体上都成了大趋势。

Well, yes. I think it's really quite a different point of view. It can easily get separated and lose the ability to talk to each other. And yeah, large language models have become such a big thing. Generative AI in general a big thing.

Speaker 1

我们这个领域容易受到潮流和风尚的影响。因此我们会遗忘最基础、最本质的东西。我认为强化学习才是AI的基础核心。智能的本质问题是理解你所处的世界，而强化学习正是关于理解世界的学问。

And our field is subject to bandwagons and fashions. So we lose track of the basic, basic things. Because I consider reinforcement learning to be basic AI. And what is intelligence the problem is to understand your world. And reinforcement learning is about understanding your world.

Speaker 1

而大语言模型则专注于模仿人类行为，执行人类指定的任务。它们并不涉及自主决策该做什么。

Whereas large language models are about mimicking people, doing what people say you should do. They're not about figuring out what to do.

Speaker 0

我猜你会认为，要模拟互联网文本语料库中数万亿的标记，就必须构建一个世界模型。事实上，这些模型似乎确实拥有非常强大的世界模型，它们是目前AI领域构建出的最优秀的世界模型，对吧？那么你认为其中还缺少什么？

I guess you would think that to emulate the trillions of tokens in the corpus of Internet text, you would have to build a world model. In fact, these models do seem to have very robust world models, and they're the best, world models we've made to date in AI, right? So what do you think that's missing?

Speaker 1

我不同意你刚才说的大部分观点。仅仅模仿人们说的话，根本不是在构建一个世界模型，我不这么认为。你是在模仿那些拥有世界模型的人。但我不想以对抗的方式来探讨这个问题。

I would disagree with most of things you just said. Great. Just to mimic what people say is not really to build a model of the world at all, I don't think. You're mimicking things that have a model of the world, people. But I don't want to approach the question in an adversarial way.

Speaker 1

但我质疑它们拥有世界模型的观点。一个世界模型应该能让你预测将要发生的事情。它们能预测一个人会说什么，但无法预测实际会发生什么。我认为，引用艾伦·图灵的话，我们需要的是一台能从经验中学习的机器，

But I would question the idea that they have a world model. So a world model would enable you to predict what would happen. They have the ability to predict what a person would say. They don't have the ability to predict what will happen. What we want, I think, to quote Alan Turing, what we want is a machine that can learn from experience,

Speaker 0

对。

Right.

Speaker 1

这里的经验是指你生活中实际发生的事情。你做了某些事，看到了结果，并从中学习。是的。

Where experience is the things that actually happen in your life. You do things. You see what happens. And that's what you learn from. Yeah.

Speaker 1

大型语言模型是从别的东西中学习的。它们学习的是‘这是一个情境，这是一个人做了什么’。隐含的建议是，你应该效仿那个人的行为。

The large language models learn from something else. They learn from here's a situation and here's what a person did. And implicitly, the suggestion is you should do what the person did.

Speaker 0

对。我想关键点在于——我很好奇你是否不同意这一点——有些人会说，这种模仿学习给了我们一个好的先验，或者说给了这些模型一个好的先验，让它们能合理地解决问题。而随着我们进入你所说的‘经验时代’，这个先验将成为我们基于经验教导这些模型的基础，因为它让它们有机会在某些时候得到正确答案，然后在此基础上，我们可以用经验训练它们。你同意这个观点吗？

Right. I guess maybe the crux and I'm curious if you disagree with this is some people will say, Okay, so this imitation learning has given us a good prior or given these models a good prior, but reasonable raise to approach problems. And as we move towards the era of experience, as you call it, this prior is gonna be the basis on which we teach these models from experience because this gives them the opportunity to get answers right some of the time. And then on this, can train them on experience. Do you agree with that perspective?

Speaker 1

不。我同意这是大型语言模型的观点，但我不认为这是一个好的观点。原因如下：要成为某事物的先验，必须有一个真实的东西存在。

No. I agree that it's the large language model perspective. I don't think it's a good perspective. Here's why. So to be a prior for something, there has to be a real thing.

Speaker 1

先验知识应作为实际认知的基础。什么是实际认知？在那个大型语言框架中，实际认知并无定义。如何判断一个行动是值得采取的？你意识到持续学习的必要性及其价值。

A prior bit of knowledge should be the basis for actual knowledge. What is actual knowledge? There's no definition of actual knowledge in that large language framework. What makes an action a good action to take? You recognize the value the need for continual learning.

Speaker 1

因此若需持续学习，持续即意味着在与世界的日常互动中学习。那么在正常互动中必然存在某种判断对错的方式。那么在大型语言模型架构中，是否存在判断该说什么正确内容的方法？你会说些什么，却得不到关于正确内容的反馈，因为根本不存在定义。

So if you need to learn continually continually means learning during normal interaction with the world. And so then there must be some way during the normal interaction to tell what's right. Okay. So is there any way for it to tell in the large language model setup to tell what's the right thing to say? You will say something and you will not get feedback about what the right thing to say is because there's no definition

Speaker 0

什么是正确的

of what the right

Speaker 1

该说的

thing to

Speaker 0

内容。

say is.

Speaker 1

没有目标。既然没有目标，说这个或说那个都无所谓。不存在正确说法。因此也没有基本事实。若缺乏基本事实就无法拥有先验知识，因为先验知识本应是对真相的提示或初始信念。

There's no goal. And if there's no goal, then there's one thing to say, another thing to say. There's no right thing to say. So there's no ground truth. You can't have prior knowledge if you don't have ground truth because the prior knowledge is supposed to be a hint or an initial belief about what the truth is.

Speaker 1

但根本不存在真相。没有所谓正确说法。而在强化学习中，存在正确说法或正确做法，因为能获得奖励的行为就是正确行为。这样我们就定义了何为正确行为，因此可以获得人类提供的先验知识或认知，然后进行验证——因为我们已明确定义了真正正确的行为。更简单的情况是当你试图构建世界模型时。

But there isn't any truth. There's no right thing to say. Now, in reinforcement learning, there is a right thing to say or right thing to do because the right thing to do is the thing that gets you reward. So we have a definition of what's the right thing to do is and so we can have prior knowledge or knowledge provided by people about what the right thing to do And then we can check it to see because we have a definition of what the actual right thing to do is. Now, an even simpler case is when you're trying to make a model of the world.

Speaker 1

当你预测未来会发生什么时，你先做出预测然后观察实际发生的情况。明白吗？所以存在一个基本事实。而大语言模型中没有基本事实，因为它们并不对接下来会发生什么做出预测。如果你在对话中说些什么，语言模型无法预测对方会如何回应或回应的内容。

When you predict what will happen, you predict and then you see what happens. Okay? So there's ground truth. There's no ground truth in large language models because you don't have a prediction about what will happen next. If you say something in conversation, the language models have no prediction about what the person will say in response to that or what the response will be.

Speaker 0

我认为它们确实可以。你完全可以问它们'你预计用户会如何回应'，它们会给出预测。

I think they do. You can literally ask them what would you anticipate a user might say in response and they have a prediction.

Speaker 1

不，它们只是会回答那个问题而已。但在实质意义上它们没有预测能力——它们不会对发生的事情感到惊讶。如果发生的事情与它们的所谓'预测'不符，它们也不会因为意外事件而改变。要学习这种能力，它们必须做出调整。

Oh, no. They will respond to that question right. But they have no prediction in the substantive sense that they won't be surprised by what happens. And if something happens that isn't what you might say they predicted, they will not change because an unexpected thing has happened. And to learn that, they'd have to make an adjustment.

Speaker 0

我认为这种能力确实存在于上下文中。观察模型进行思维链很有趣，比如当它试图解决数学问题时，它会说：'好的，我最初打算用这种方法解题'，然后写出来后又表示：'等等，我意识到这是错误的概念性方法，我要换另一种方法重新开始'。

So I think a capability like this does exist in context. So it's interesting to watch a model do chain of thought and then suppose it's trying to solve a math problem. It'll say, okay, I'm going to approach this problem using this approach at first, and they'll write this out and be like, oh, wait. I just realized this is the wrong conceptual way to approach the problem. I'm gonna restart by this another approach.

Speaker 0

这种灵活性确实存在于上下文中，对吧？你是否有其他想法，还是认为只需要将这种能力扩展到更长的时间跨度？

And that flexibility does exist in context, right? Do you have something else in mind, do you just think that you need to extend this capability across longer horizons?

Speaker 1

我只是说，在任何有意义的层面上，它们都无法预测接下来会发生什么。它们不会对接下来发生的事情感到惊讶。无论发生什么，它们都不会基于已发生的情况做出任何改变。

I'm just saying, any meaningful sense, they don't have a prediction of what will happen next. And they will not be surprised by what happened next. They'll not make any changes if something happens based on what happens.

Speaker 0

这不正是下一个标记预测的本质吗？预测接下来会发生什么，然后在出现意外时更新？

Isn't that literally what next token prediction is? Prediction about what's next and then updating on a surprise?

Speaker 1

下一个标记是他们应该说的内容，应该采取的行动。这不是世界对他们行为的回应。让我们回到他们缺乏目标的问题。对我来说，拥有目标是智能的本质。如果某物能实现目标，它就是智能的。

Next token is what they should say, what the action should be. It's not what the world will give them in response to what they do. Let's go back to their lack of goal. For me, having a goal is the essence of intelligence. Something is intelligent if it can achieve goals.

Speaker 1

我喜欢约翰·麦卡锡的定义，即智能是实现目标能力的计算部分。所以你必须要有目标。你只是一个行为系统。你没有什么特别的。你不智能。

I like John McCarthy's definition that intelligence is the computational part of the ability to achieve goals. So you have to have goals. You're just a behaving system. You're not anything special. You're not intelligent.

Speaker 1

而且你同意大型语言模型没有目标。

And you agree that large language models don't have goals.

Speaker 0

我认为它们有一个目标。

I think they have a goal.

Speaker 1

目标是什么？

What's the goal?

Speaker 0

下一个标记预测。

Next token prediction.

Speaker 1

那不是目标。它不会改变世界。标记向你涌来。如果你预测它们，你不会影响它们。

That's not a goal. It doesn't change the world. Tokens come at you. And if you predict them, you don't influence them.

Speaker 0

哦，是的。这不是一个关于外部世界的目标。

Oh, yeah. It's not a goal about the external world.

Speaker 1

对，这不是目标。这不是实质性目标。你不能看着一个系统就说，哦，它有目标，如果它只是坐在那里预测，并自我满足于它在预测

Yeah, it's not a goal. It's not a substantive goal. You can't look at a system and say, oh, it has a goal if it's just sitting there predicting and being happy with itself that it's I predicting

Speaker 0

我想也许我更希望你理解的核心问题是，为什么你认为在大型语言模型上进行强化学习不是一个富有成效的方向。因为我们似乎能够赋予这些模型解决复杂数学问题的目标。在许多方面，它们已经达到人类解决数学奥林匹克类问题的巅峰水平，对吧？它们在国际数学奥林匹克竞赛中获得了金牌。所以看起来这个在国际数学奥赛中获得金牌的模型确实有解决数学问题的目标。

guess maybe the bigger question I want you to understand is why you don't think doing RL on top of LLMs is a productive direction because we seem to be able to give these models a goal of solving difficult math problems. And they're in many ways, at the very peaks of human level in in the capacity to solve Math Olympiad type problems. Right? They got gold at IMO. So it seems like the the model which got gold at the International Math Olympiad does have the goal of getting math problems.

Speaker 0

那么为什么我们不能将这一点扩展到其他领域呢？

So why can't we extend this to different domains?

Speaker 1

嗯，数学问题不同。建立物理世界的模型并执行数学假设或操作的后果，这是截然不同的事情。比如经验世界需要学习。你必须学习后果，而数学更多只是计算性的。它更像是标准规划。

Well, the math problems are different. Making a model of the physical world and carrying out the consequences of mathematical assumptions or operations, those are very different things. Like the empirical world has to be learned. You have to learn the consequences whereas math is more just computational. It's more like standard planning.

Speaker 1

所以在那里，它们可以有一个寻找证明的目标。并且在某种程度上，它们被赋予了那个目标，即寻找证明。

So there, they can have a goal to find the proof. And they are in some way given that goal, to find the proof.

Speaker 0

没错。这很有趣，因为你2019年写了这篇题为《苦涩的教训》的文章，这可能是AI历史上最具影响力的文章，但人们将其作为扩大大型语言模型规模的依据，因为在他们看来，这是我们发现的唯一可扩展的方式，可以将海量计算资源投入到对世界的认知中。所以有趣的是，你的观点是大型语言模型实际上并不符合苦涩教训的理念。

Right. So it's interesting because you wrote this essay in 2019 titled The Bitter Lesson, and this is the most influential essay perhaps in the history of AI, but people have used that as a justification for scaling up LLMs because in their view, this is the one scalable way we have found to pour ungodly amounts of compute into learning about the world. And so it's interesting that your perspective is that the LLMs are actually not bitter lesson pilled.

Speaker 1

大型语言模型是否属于'苦涩教训'的案例，这是个有趣的问题。确实，因为它们显然是利用海量计算的方式，这些能力会随着计算规模扩展直至互联网的极限。

It's an interesting question whether large language models are a case of the bitter lesson. Yeah. Because they are clearly a way of using massive computation, things that will scale with computation up to the limits of the internet.

Speaker 0

但是

But

Speaker 1

它们同时也是注入大量人类知识的方式。所以这是个值得探讨的问题，更像社会学或产业界的疑问——它们是否会触及数据极限，

they're also a way of putting in lots of human knowledge. And so this is an interesting question. It's a sociological or industry question. Will they reach the limits

Speaker 0

的

of the

Speaker 1

最终被那些能从经验而非人类数据中获取更多信息的系统取代？某种程度上，这正是'苦涩教训'的经典案例。我们向大模型注入的人类知识越多，它们表现就越好，这感觉不错。但我个人尤其期待能出现从经验中学习的系统，它们很可能表现更优异且更具扩展性。

data and be superseded by things that can get more data just from experience rather than from people? In some ways, it's a classic case of the bitter lesson. With the more human knowledge we put into the large language models, the better they can do. And so it feels good. And yet, one well, I, in particular, expect there to be systems that can learn from experience, which could well perform much, much better and be much more scalable.

Speaker 1

若真如此，这将再次印证那个更宏大的教训：依赖人类知识的系统终将被纯粹通过经验和计算训练的系统超越。

In which case, it will be another instance of the bigger lesson that the things that used human knowledge were eventually superseded by things that just trained from experience and computation.

Speaker 0

我认为这并非关键分歧，因为那些支持者也会认同未来绝大部分算力将用于经验学习。他们只是认为LLMs可以作为脚手架或基础起点，为未来的经验学习或在职学习提供算力注入的起点。所以我仍不理解为何这完全是个错误的起点，为何我们需要全新架构来实现持续的经验学习，而不能从LLMs开始这个过程。

I guess that doesn't seem like the crux to me because I think those people would also agree that the overwhelming amount of compute in the future will come from learning from experience. They just think that the scaffold or the basis of that, the thing you'll start with in order to pour in the compute to do this future experiential learning or on the job learning will be LLMs. And so I guess I still don't understand why this is the wrong starting point altogether, why we need a whole new architecture to begin doing experiential continual learning, and why we can't start with LLMs to do that.

Speaker 1

嗯，在啤酒课程的每个案例中，都可以从人类知识开始，然后做可扩展的事情。这总是如此。从来没有任何理由认为这一定是坏事。但实际上，在实践中，结果总是糟糕的。因为人们被锁定在人类知识的方法中，从心理上——当然我现在是在推测原因。

Well, in every case of the beer lesson, could start with human knowledge and then do the scalable things. That's always the case. There's never any reason why that has to be bad. But in fact, and in practice, it has always turned out to be bad. Because people get locked into the human knowledge approach and they psychologically or now I'm speculating why it is.

Speaker 1

但这就是一直发生的情况。是的。他们的午餐总会被真正可扩展的方法吞噬。

But this is what has always happened. Yeah. That their lunch gets eaten by the methods that are truly scalable.

Speaker 0

是的，给我讲讲什么是可扩展的方法。

Yeah, give me a sense of what the scalable method is.

Speaker 1

可扩展的方法就是从经验中学习。你尝试各种事情，观察哪些有效。不需要别人告诉你。首先，你得有个目标。

The scalable method is you learn from experience. You try things. You see what works. No one has to tell you. First of all, you have a goal.

Speaker 1

没有目标，就无所谓对错或优劣。所以大型语言模型试图在没有目标或优劣标准的情况下运作，这完全是从错误的地方起步。

So without a goal, there's no sense of right or wrong or better or worse. So large language models are trying to get by without having a goal or a sense of better or worse. That's just it's exactly starting in the wrong place.

Speaker 0

也许和人类做个对比会很有趣。无论是在模仿学习还是经验学习的问题上，还是在目标设定方面，我觉得有些有趣的类比。你看，孩子们最初是通过模仿学习的，你不这么认为吗？

Maybe it's interesting to compare this to humans. So in both the case of learning from imitation versus experience and on the question of goals, I think there's some interesting analogies. So, you know, kids will initially learn from imitation. You don't think so?

Speaker 1

不，当然不是。

No, of course not.

Speaker 0

真的吗？是啊。我觉得孩子们就是喜欢观察人。他们像是在尝试模仿说话的方式

Really? Yeah. I think kids just like watch people. They like kind of try try to like say the

Speaker 1

这些孩子也一样吗？

same these kids?

Speaker 0

我我觉得这个程度

I I think the level

Speaker 1

那前六个月呢？

What about the first six months?

Speaker 0

我认为他们是在模仿事物。他们试图让自己的嘴巴发出和看到母亲嘴巴一样的声音，然后他们会说出同样的词却不理解意思。随着年龄增长，他们模仿的复杂性也在提高。比如你可能在模仿部落里人们猎鹿时使用的技巧之类的。接着就进入了从经验中学习的阶段。

I think they're kind of imitating things. They're trying to like make their mouth sound the way they see their mother's mouth sound and then they'll say the same words without understanding what they mean. And as you get older, the complexity of the imitation they do increases. So you're you're, you know, you're imitating maybe the skills that people in your band are using to hunt down the deer or something. And then you go into the learning from experience RR regime.

Speaker 0

我认为人类有很多模仿学习的行为。

I think there's a lot of imitation learning happening with humans.

Speaker 1

这很令人惊讶。是啊，你的观点可以如此不同。当我看到孩子时，我看到他们只是在尝试各种动作，挥舞手臂，转动眼球。没人告诉他们该如何转动眼球或发出声音——这些根本没有模仿对象。他们可能只是想制造相同的声音。

It's surprising. Yeah, you can have such a different point of view. When I see kids, I see kids just trying things and waving their hands around and moving their eyes around. And no one tells them there's no imitation for how they move their eyes around or even the sounds they make. They may want to create the same sounds.

Speaker 1

但婴儿的实际行为动作，并没有明确的目标可供参照。这些行为本身也没有现成的范例。

But the actions, the thing that the infant actually does, there's no targets for that. There are no examples for that.

Speaker 0

我同意这不能解释婴儿的所有行为，但它确实引导着学习过程。就像大型语言模型在训练初期预测下一个词元时，它会做出猜测——这个猜测往往与实际所见不同。某种程度上说，这就像极短视界的强化学习：模型猜测'我认为下一个词元是这个'，而实际出现的却是另一个，就像孩子尝试说单词时的情形。

I agree that doesn't explain everything infants do, but I think it guides the learning process. I mean, even LLM, when it's trying to predict the next token early in training, it will make a guess. It 'll be different from what it actually sees. And in some sense, it's very short horizon RL, where it's making this guess of, well, I think this token will be this. It's actually this other thing similar to how a kid will try to say a word.

Speaker 0

结果说错了。

It comes out wrong.

Speaker 1

大语言模型是从训练数据中学习，而非从经验中学习。它学习的内容在其正常生命周期中永远不会出现——训练数据里根本不会标注'在现实生活中你应该做这个动作'。

The large language models is learning from training data. It's not learning from experience. It's learning from something that will never be available during its normal life. There's never any training data that says you should do this action in normal life.

Speaker 0

我觉得这可能更多是语义上的分歧。你把学校教育称作什么？那难道不是训练数据吗？只不过上学要晚得多。

I think this is maybe more of a semantic distinction. What do you call school? Is that not training data? You're not going to school because it's like School is much later.

Speaker 1

好吧，我不该用'永远'这个词。但即便是学校教育...正规教育毕竟是特例。理论构建不应该建立在...

Okay, I shouldn't have said never. But I don't know. I think I would even say it about school. But formal schooling is the exception. You shouldn't But you should base your be theories on using

Speaker 0

这种早期学习阶段更像是生物本能编程——起初你作用有限，而后存在的意义就是理解世界并学会与之互动。这看起来就像训练阶段。我同意之后会逐渐过渡，没有明确的'训练结束投入应用'的界限，但确实存在这个初始训练期对吧？

this learning where I think you're just sort of programming in your biology that early on, you're not that useful. And then kind of why you exist is to understand the world and, like, learn how to interact with it. And it seems kind of like a training phase. I agree that then there's, like, a sort of more gradual there's not a sharp cutoff to, like, training to deployment, but there seems to be this, like, initial training phase, right?

Speaker 1

没有任何地方会训练你应该做什么。什么都没有。你看到事情发生，被告知该怎么做。别太较真。

There's nothing where you have training of what you should do. There's nothing. You see things that happen. You're told what to do. Don't be difficult.

Speaker 1

这显而易见。

This is obvious.

Speaker 0

我是说，你实际上是被教导该做什么的。'训练'这个词的起源就是来自人类，对吧？

I mean, you're literally taught what to do. This is where the word training comes from, is from humans, right?

Speaker 1

所以我不认为学习真正关乎训练。我认为学习是关于学习本身，是一个主动的过程。孩子尝试各种事情并观察结果。是的。

So I don't think learning is really about training. I think learning is about learning. It's about an active process. The child tries things and sees what happens. Yeah.

Speaker 1

当我们想到婴儿成长时，根本不会考虑训练这回事。这些实际上已被充分理解。如果你去了解心理学家如何看待学习，根本不存在模仿这回事。也许在某些极端案例中人类会这么做或看似这么做。但根本不存在所谓模仿的基础动物学习机制。

It does not don't think about training when we think of an infant growing up. These things are actually rather well understood. If you go to look about how psychologists think about learning, there's nothing like imitation. Maybe there are some extreme cases where humans might do that or appear to do that. But there is no basic animal learning process called imitation.

Speaker 1

基础动物学习机制是针对预测和试错控制的。有时候最难看清的恰恰是最明显的事，这真的很有意思。只要观察动物如何学习，看看心理学及其理论就会明白：监督式学习根本不是动物学习的方式。我们找不到理想行为的例证。

There are basic animal learning processes for prediction and for trial and error control. I mean, it's really interesting how sometimes the most hardest things to see are the obvious ones. It's obvious if you just look at animals and how they learn and you look at psychology and how are theories of them. It's obvious that supervised learning is not part of the way animals learn. We don't have examples of desired behavior.

Speaker 1

我们拥有的只是已发生事件的例证——某件事引发另一件事。我们拥有的是行动带来后果的例证。但不存在监督式学习的例证。监督式学习在自然界并不存在。

What we have is examples of things that happened. One thing that followed another. And we have examples of we did something and there were consequences. But there are no examples of supervised learning. And supervised learning is not something that happens in nature.

Speaker 1

至于学校，即便情况如此，我们也该忘记这个概念，因为这只是人类特有的现象。自然界中并不普遍存在这种现象。松鼠不会上学，但它们能学会关于世界的一切。可以说，监督式学习在动物界显然是不存在的。

And school, even if that was the case, we should forget about it because that's some special thing that happens in people. It doesn't happen broadly in nature. And squirrels don't go to school. Squirrels can learn all about the world. It's absolutely obvious, I would say, that supervised learning doesn't happen in animals.

Speaker 0

所以我采访了心理学家兼人类学家约瑟夫·亨里奇，他研究文化进化，主要探讨是什么区分了人类，以及人类如何获取知识。

So I interviewed this psychologist and anthropologist, Joseph Henrich, who has done work about cultural evolution and basically what distinguishes humans, and how do humans pick up knowledge.

Speaker 1

为什么要试图区分人类？人类本就是动物。我们的共同点才更有趣。那些区分我们的特质，反而应该少关注些。

Why are you trying to distinguish humans? Humans are animals. What we have in common is more interesting. What distinguishes us, we should be paying less attention to.

Speaker 0

我们试图复制智能，对吧？如果你想理解是什么让人类能登月或制造半导体，我认为我们需要弄清的正是其他动物无法做到这些的原因。所以我们想理解人类的特殊性。

We're trying to replicate intelligence, right? So if you want to understand what is it that enables humans to go to the moon or to build semiconductors, I think the thing we want to understand is the thing that makes no animal can go to the moon or it makes semiconductors. So we want to understand what makes humans special.

Speaker 1

你觉得显而易见的观点，我却持完全相反的看法。我认为必须理解我们作为动物的本质。如果能理解松鼠，我们几乎就能理解人类智能。语言不过是表面的薄层而已。好吧。

So I like the way you consider that obvious I consider the opposite obvious. I think we have to understand how we are animals. And if we understood a squirrel, I think we'd be almost all the way there It's understanding human intelligence. The language part is just a small veneer on the surface. Okay.

Speaker 1

这很棒。我们正在发现彼此思维方式的巨大差异。或许吧。我们不是在争论，而是在尝试分享各自不同的思维方式。

So this is great. We're finding out the very different ways that we're thinking. Maybe. We're not arguing. We're trying to share share our different ways of thinking with each other.

Speaker 0

是的。而且我觉得争论是有益的。不过我想把这个想法说完。约瑟夫·亨里奇有个有趣的理论：观察人类为取得成功必须掌握的众多技能...

Yeah. And you I I think argument is useful. So Yeah. But I do wanna com complete this thought. So Joseph Henrich has this interesting theory that, if you look, a lot of the, skills that humans have had to master in order to be successful.

Speaker 0

我们讨论的不是过去一千年或一万年，而是数十万年的历史。要知道，这个世界极其复杂，比如在北极地区，你不可能仅凭逻辑推理就学会如何猎捕海豹。这涉及制作诱饵、寻找海豹、以及处理食物避免中毒等漫长而复杂的过程。这些都无法通过纯粹的逻辑推演掌握。因此，随着时间的推移，文化整体——无论你用什么比喻来形容，可能是强化学习或其他机制——逐渐摸索出了猎杀食用海豹的方法。

And we're not talking about, you know, last thousand years or last ten thousand years, but hundreds of thousands of years. You know, the world is really complicated and it's not possible to reason through how to, let's say hunt a, seal if you're living in the Arctic. And so there's this many, many step long process of how to make the bait and how to find the seal and then how to process the food in a way that makes sure you won't get poisoned. And it's not possible to reason through all of that. And so over time, yes, there's this like larger process of whatever analogy you want to use, maybe RL something else where culture as a whole has figured out how to, find and kill and eat, seals.

Speaker 0

但在他看来，当这些知识代代相传时，关键在于你必须模仿长辈才能习得这项技能——因为你无法通过思考就掌握猎杀处理海豹的全过程。你只能观察他人操作，或许做些微调改进。知识就是这样积累的，但文化传承的第一步必然是模仿。不过或许我们可以换个角度思考。

But then what is happening when through generations this knowledge is transmitted is in his view that like there, you just have to imitate your elders in order to learn that skill because you can't, you can't think your way through how to hunt and kill and process a seal. You have to just watch other people maybe make tweaks and adjustments. And that's how knowledge accumulates. But the initial step of the cultural gain has to be imitation. But maybe think about it a different way.

Speaker 1

不，应该保持相同视角。但这本质上仍是在基础试错学习和预测学习之上的微小进化。这或许是我们与多数动物的区别所在，但我们首先是动物。没错。

No, think about it the same way. But still, it's a small thing on top of basic trial and error learning, prediction learning. And it's what distinguishes us perhaps from many animals. But we're an animal first. Yeah.

Speaker 1

在拥有语言等能力之前，我们本就是动物。

And we were an animal before we had language and all those other things.

Speaker 0

你提出的观点确实发人深省：持续学习是大多数哺乳动物都具备的能力，我想应该是所有哺乳动物。有趣的是，我们拥有所有哺乳动物都具备的能力，而现有AI系统却缺乏这种能力。反观理解数学和解决复杂数学问题的能力——取决于你如何定义数学——这是AI具备而几乎所有动物都不具备的。

I do think you make a very interesting point that continual learning is a capability that most mammals have. I I guess all mammals have. So it's quite interesting that we have something that all mammals have, but our AI systems don't have. Right? Whereas maybe like the ability to understand math and follows difficult math problems depends on how you define math, but like this is a capability our AIs have, but that no, almost no animal has.

Speaker 0

因此，最终哪些能力难以实现而哪些相对容易，这现象非常耐人寻味。

And so it's quite interesting what ends up being difficult and what ends up being easy.

Speaker 1

非洲灰鹦鹉。

Morvix parrot.

Speaker 0

为了让体验时代开启，我们需要在复杂的现实世界环境中训练人工智能。但构建有效的强化学习环境并非易事。你不能仅仅雇佣一名软件工程师，让他们编写一堆千篇一律的验证测试。现实领域是混乱的，需要资深的领域专家来确保数据、工作流程以及所有细微规则的正确性。

That's For the era of experience to commence, we're gonna need to train AIs in complex real world environments. But building effective RL environments is hard. You can't just hire a software engineer and have them write a bunch of cookie cutter validation tests. Real world domains are messy. You need deep subject matter experts to get the data, the workflows, and all the subtle rules right.

Speaker 0

当Labelbox的一位客户想要训练一个在线购物代理时，Labelbox组建了一支拥有丰富互联网商店前端工程经验的团队。例如，该团队构建了一个可以在会话过程中更新的产品目录，因为大多数购物网站的状态是不断变化的。他们还添加了Redis缓存来模拟陈旧数据，这正是真实电商网站的实际运作方式。这些可能是你最初不会想到要做的事情，但Labelbox能够预见。这些细节至关重要。

When one of Labelbox's customers wanted to train an agent to shop online, Labelbox assembled a team with a ton of experience engineering Internet storefronts. For example, the team built a product catalog that could be updated during the episode because most shopping sites have constantly changing state. They also added a Redis cache to simulate stale data since that's how real ecommerce sites actually work. These are the kinds of things that you might not have naively thought to do, but that Labelbox can anticipate. These details really matter.

Speaker 0

细微调整往往是区分炫酷演示和能在现实世界实际运作的代理的关键。因此，无论是修正已生成的轨迹，还是构建一套全新的环境，Labelbox都能帮助你将强化学习项目转化为可运行的系统。欢迎访问labelbox.com/dwarcache联系我们。好了，现在把话题交回给Richard。你设想的这种替代范式——

Small tweaks are often the difference between cool demos and agents that can actually operate in the real world. So whether it's correcting traces that you already produced or building an entirely new suite of environments, Labelbox can help you turn your RL projects into working systems. Reach out at labelbox dot com dwarcache. All right, back to Richard. This alternative paradigm that you're imagining

Speaker 1

体验范式

The experiential paradigm

Speaker 0

是的。

Yes.

Speaker 1

让我们稍微阐述一下它的主张：体验、行动、感知——准确说是感知、行动、奖励，然后这个过程不断循环往复，使其更具生命力。这表明这是智能的基础与核心。智能的本质在于处理这个流，并通过调整行动来增加流中的奖励。对吧？因此学习也是源自这个流。

Let's lay out a little bit about It what it says that experience, action, sensation well, sensation, action, reward, and then this happens on and on and on, makes it more life. Says that this is the foundation and the focus of intelligence. Intelligence is about taking that stream and altering the actions to increase the rewards in the stream. Right. So learning then is from the stream.

Speaker 1

学习就是关于这个流的。第二部分尤其能说明问题：你所学的知识——你的知识是关于这个流的。你的知识关乎‘如果采取某个行动会发生什么’，或者‘哪些事件会跟随其他事件发生’。

And learning is about the stream. So that second part is particularly telling. What you learn, your knowledge your knowledge is about the stream. Your knowledge is about if you do some action, what will happen? Or it's about which events will follow other events.

Speaker 1

这是关于数据流的内容。知识的内容是对数据流的陈述。正因如此，由于它是对数据流的陈述，你可以通过将其与数据流进行比较来验证它。并且你可以持续学习它。

It's about the stream. The content of the knowledge is statements about the stream. And so because it's a statement about the stream, you can test it by comparing it to the stream. And you can learn it continually.

Speaker 0

所以当你想象这个未来的持续学习智能体时。

So when you're imagining this future continual learning agent.

Speaker 1

它们并非未来的产物。当然，它们始终存在。这正是强化学习范式的本质——从经验中学习。

They're not future. Of course, they exist all the time. Is what reinforcement learning paradigm is, learning from experience.

Speaker 0

是的。我想我可能想表达的是人类水平的通用持续学习智能体。奖励函数是什么？仅仅是预测世界吗？还是对世界产生特定影响？

Yeah. I guess maybe what I meant to say is human level general continual learning agent. What is the reward function? Is it just predicting the world? Is it then having a specific effect on it?

Speaker 0

通用的奖励函数会是什么？

What would the general reward function be?

Speaker 1

奖励函数是任意的。如果你在下棋，目标就是赢得棋局。如果你是松鼠，奖励可能与获取坚果有关。对动物而言，可以说奖励在于避免痛苦和获得快乐。我认为还应该包含一个与环境认知增长相关的组成部分。

The reward function is arbitrary. And so if you're playing chess, it's to win the game of chess. If you're a squirrel, maybe the reward has to do with getting nuts. In general, for an animal, would say the reward is to avoid pain and to acquire pleasure. And there also would be a component having to do with, I think there should be a component having to do with your increasing understanding of your environment.

Speaker 1

这可以算作一种内在动机。

That would be sort of an intrinsic motivation.

Speaker 0

我明白了。我想这个AI会被部署到很多人希望它执行各种不同任务的地方。没错，它执行人们想要的任务，同时也在执行任务的过程中学习世界。那么，你是否设想，我们能否摆脱这种有训练阶段和部署阶段的范式？更进一步，我们是否也能摆脱这种有模型和模型实例或副本执行特定任务的范式？

I see. I guess this AI would be deployed to like lots of people would want it to be doing lots of different kinds of things. Right. So it's performing the task people want, but at the same time it's learning about the world from doing that task. And do you, do you imagine, okay, so we get rid of this paradigm where there's training periods and then there's deployment periods, but then is there, do we also get rid of this paradigm when there's the model and then instances of the model or copies of the model that are doing certain things?

Speaker 0

你如何看待我们希望这个系统执行不同任务的事实？我们希望整合它从执行这些不同任务中获得的知识。

How do you think about the fact that we'd want this thing to be doing different things? We'd want to aggregate the knowledge that it's gaining from doing those different things.

Speaker 1

我不喜欢你刚才使用'模型'这个词的方式。

I don't like the word model when used the way you just did.

Speaker 0

有意思。

Interesting.

Speaker 1

我认为更合适的词应该是'网络'。所以我想你指的是网络。也许会有多个网络。总之，知识会被学习。然后你会有副本和多个实例。

I think a better word would be the network. So I think you mean the network. Maybe there's many networks. So anyway, things would be learned. And then you'd have copies and many instances.

Speaker 1

当然，你会希望实例之间共享知识。有很多方法可以实现这一点。不像现在这样，你不能让一个孩子成长并了解世界，然后每个新孩子都必须重复这个过程。而对于AI，对于数字智能来说，

And sure, you'd want to share knowledge across the instances. And there would be lots of possibilities for doing that. Like there is not today. You can't have one child grow up and learn about the world and then every new child has to repeat that process. Whereas with AIs, with a digital intelligence,

Speaker 0

你希望能做到

you could hope to do

Speaker 1

先做一次，然后将其复制到下一个作为起点。这样能节省大量时间。我认为这实际上比试图向他人学习重要得多。

it once and then copy it into the next one as a starting place. So this would be a huge savings. And I think actually it would be much more important than trying to learn from people.

Speaker 0

我同意你说的这种能力是必要的，无论是否从大语言模型开始。要想达到人类或动物级别的智能，就需要这种能力。假设一个人试图创业，这件事的回报周期大约是十年。十年后可能会有一次价值十亿美元的退出机会，但人类有能力设定中间辅助奖励，或者即使面对极其特殊的奖励，也能制定中间步骤，理解当前行动如何导向更宏伟的目标。

I agree that the kind of thing you're talking about is necessary regardless of whether you start from LLMs or not, right? If you want human or animal level intelligence, you're going to need this capability. Suppose a human is trying to make a startup, right? And this is a thing which has a reward on the order of ten years. Once in ten years you might have an exit where you get paid out a billion dollars, but humans have this ability to make intermediate auxiliary rewards or have some way of, even when they have extremely special rewards, they can still make intermediate steps, an understanding of like what the next thing you're doing leads to this grander goal we have.

Speaker 0

那么你认为AI如何实现这样的过程？

And so how do you imagine such a process might play out with AIs?

Speaker 1

这是我们非常熟悉的领域。其基础是时间差分学习，类似下棋时的情形：长期目标是赢得比赛，但你需要从短期行为中学习，比如吃掉对手的棋子。通过价值函数预测长期结果，当吃掉对方棋子导致长期预测结果改善时，这种信念的增强会立即强化导致吃棋的行动。

So this is something we know very well. And the basis of it is temporal difference learning, where the same thing happens in a less grandiose scale, like when you learn to play chess. You have the long term goal is winning the game and yet you want to be able to learn from shorter term things like taking your opponent's pieces. And so you do that by having a value function which predicts the long term outcome. And then if you take guys' pieces where your prediction about the long term outcome is changed, it goes up, you think you're going to win, and then that increase in your belief immediately reinforces the move that led to taking the piece.

Speaker 1

好的。我们有个十年期的长期目标——创业赚大钱。当取得进展时，我们会说'哦，我实现长期目标的可能性更大了'，这就奖励了过程中的每一步。

Okay. So we have this long term ten year goal of making a startup and making a lot of money. And so when we make progress, we say, oh, I'm more likely to achieve the long term goal. And that rewards the steps along the way.

Speaker 0

对。你还需要获取学习信息的能力。人类与这些大语言模型的一大区别在于：当你入职时，会吸收大量背景信息——从客户偏好到公司运作方式——这些让你胜任工作。像时间差分学习这样的方法，其信息带宽是否足以建立人类在实战中获取的那种海量上下文和隐性知识管道？

Right. And then you also want some ability for information that you're learning. I mean, one of the things that makes humans quite different from these LLMs is that if you're onboarding on a job, you're picking up so much context and information and that's what makes you useful at the job, right? You're, everything from how your client has preferences to how the company works to everything. And is the bandwidth of information that you get from a procedure like TD learning high enough to have this huge pipe of like context and tacit knowledge that you'd need to be picking up in the way humans do when they're just deployed?

Speaker 0

我

Speaker 1

我认为关键点在于——虽然我不太确定——但'大世界假说'似乎非常相关。人类之所以能在工作中发挥作用，是因为他们接触的是世界的特定部分。没错，这些部分既无法被预先预测，也无法全部提前设定。世界如此浩瀚，以至于你无法...在我看来，大型语言模型的梦想是你能教会智能体一切，它将知晓万物，在其生命周期中无需在线学习任何新事物。

think the crux of this and I'm not sure, but the big world hypothesis seems very relevant. And the reason why humans become useful on their job is because they are encountering the particular part of the world. That's right. And it can't have been anticipated and can't all have been put in advance. The world is so huge that you can't the dream as I see it, the dream of large language models is you can teach the agent everything and it will know everything and won't have to learn anything online during its life.

Speaker 1

明白吗？你举的例子都很恰当。实际上你必须...因为你能教会它很多，但他们所处的特定生活轨迹、共事的特定人群、以及他们与大众不同的偏好，都存在各种细微的特质。这正说明世界确实非常庞大，因此你必须在过程中不断学习适应。

Okay? And your examples are all well. Really, you have to because you can there's a lot to you can teach it, but there's all little idiosyncrasies of the particular life they're leading and the particular people they're working with and what they like as opposed to what average people like. And so that's just saying the world is really big, and so you're going to have to learn it along the way.

Speaker 0

是的。在我看来需要两个要素：首先是将长期目标奖励转化为更小的辅助性奖励的方法，比如对未来奖励或最终奖励的预测性奖励；其次还需要另一种机制。最初我认为需要某种方式来保留我在世界工作中获得的所有上下文信息，对吧？

Yeah. So it seems to me you need two things. One is some way of converting this long run goal reward into smaller auxiliary or, you know, these, like, predictive rewards of the future reward or the future reward to at least to the final reward. Then you need some other way. Initially it seems to me you need some way of then, okay, I'm, I need to hold on to all this context that I'm gaining as I'm working in the world, right?

Speaker 0

比如我正在了解客户、公司等所有信息。所以我会说你只是在做常规学习。

I'm like learning about my clients, my company, all this information. I'm So I would say you're just doing regular learning.

Speaker 1

对。或许利用上下文——因为在大型语言模型中，所有信息都必须进入上下文窗口。但在持续学习架构中，这些信息会直接融入权重参数。

Yeah. Maybe using context because in large language models, all that information has to go into the context window. But in a continual learning setup, it just goes into the weights.

Speaker 0

可能吧。所以'上下文'这个词或许用得不准确，我指的是更广义的概念。

Maybe yeah. So maybe context is the wrong word to use because I mean a more general thing.

Speaker 1

你学习的是针对所处特定环境的策略。

You learn a policy that's specific to the environment that you're finding yourself in.

Speaker 0

是的。所以我想问的问题是，当人类在现实世界中活动时，你需要某种方式来了解他们每秒接收多少比特的信息量，而如果你只是通过Slack与客户进行交互等等？

Yeah. So the question I'm trying to ask is, you need some way of getting how many bits per second is a human picking up when they're out in the world, if you're just interacting over Slack with your clients and everything?

Speaker 1

所以也许我们正在探讨的问题是，似乎奖励太小，不足以支持我们需要进行的所有学习。但当然，我们还有感官体验。我们可以从所有其他信息中学习，而不仅仅是从奖励中学习。我们从所有数据中学习。

So maybe we're trying to the question of it seems like the reward is too small of a thing to do all the learning that we need to do. But of course, we have the sensations. We have all the other information we can learn from. We don't just learn from the reward. We learn from all the data.

Speaker 0

是的。那么帮助你获取这些信息的学习过程是什么？

Yeah. So what is the learning process which helps you capture that information?

Speaker 1

现在我想谈谈智能体的基础通用模型，它包含四个部分。首先我们需要一个策略。策略决定了在当前情况下我应该做什么。我们需要一个价值函数。价值函数是通过时序差分学习（TD学习）来学习的。

So now I want to talk about the base common model of the agent with the four parts. So we need a policy. The policy says, in the situation I'm in, what should I do? We need a value function. The value function is the thing that is learned with TD learning.

Speaker 1

价值函数会生成一个数值。这个数值表示事情进展如何。然后你观察这个数值的升降，并据此调整你的策略。好，这是前两个部分。接下来还有感知组件，即构建你的状态表征，也就是你对当前所处位置的感觉。

And the value function produces a number. The number says, how well is it going? And then you watch if that's going up and down and use that to adjust your policy. Okay, so those two things. And then there's also the perception component, which is construction of your state representation, your sense of where you are now.

Speaker 1

第四部分是我们真正要讨论的，也是最直观的。第四部分是世界转移模型。这就是为什么我不愿意把所有东西都简单地称为模型，因为我想讨论的是世界的模型，世界的转移模型，即你相信如果你这样做，会发生什么？你的行为会带来什么后果？也就是你对世界的物理规律的理解。

And the fourth one is what we're really getting at, most transparently anyway. The fourth one is the transition model of the world. That's why I am uncomfortable just calling everything models because I want to talk about the model of the world, the transition model of the world, your belief that if you do this, what will happen? What will be the consequences of what you do? So your physics of the world.

Speaker 1

但这不仅仅是物理规律，还包括抽象模型，比如你从加利福尼亚到埃德蒙顿参加这个播客的旅行模型。那是一个模型，一个转移模型，它是通过学习获得的。而且它不是从奖励中学习的。它是通过你做了什么、看到了什么结果，然后构建了这个世界的模型。这个模型会从你接收到的所有感官信息中非常丰富地学习，而不仅仅是从奖励中学习。

But it's not just physics, it's also abstract models like your model of how you traveled from California up to Edmonton for this podcast. That was a model and that's a transition model and that would be learned. And it's not learned from reward. It's learned from you did things, you saw what happened, and you made that model of the world. That will be learned very richly from all the sensation that you receive, not just from the reward.

Speaker 1

它必须包含奖励机制，但这只是整个模型中的一小部分。虽小却是整个模型的关键部分。

It has to include the reward as well, but that's a small part of the whole model. Small crucial part of the whole model.

Speaker 0

是的。我朋友托比·沃德指出，如果你观察谷歌DeepMind用于学习Atari游戏的MuZero模型，这些模型最初并非通用智能本体，而是训练专用智能玩特定游戏的通用框架。也就是说，你无法用同一框架训练出既能下国际象棋又能下围棋的策略，必须针对每种游戏专门训练。他在思考这是否意味着强化学习普遍存在信息限制——你一次只能学习一件事。

Yeah. One of my friends, Toby Ward, pointed out that if you look at the the MuZero models that Google DeepMind deployed to learn Atari games, that these models were initially not a general intelligence self, but a general framework for training specialized intelligences to play specific games. That is to say that you couldn't, using that framework training policy to play both chess and Go and some other game. You had to train each one in a specialized way. And he was wondering whether that implies that reinforcement learning generally, because of this information constraint, you you you can only learn one thing at a time.

Speaker 0

是因为信息密度不够高？还是MuZero的实现方式特殊？如果这是AlphaZero特有的，是否需要调整方法才能造就通用学习体？

The density of information isn't that high or whether it was just specific to the way that Mu0 was done. And if it's specific to AlphaZero, needed to be changed about that approach so that it could be a general learning agent?

Speaker 1

这个理念是完全通用的。我经常用AI智能体就像人类这个例子来说明。某种意义上，人类只生活在一个世界里，这个世界可能包含国际象棋也可能包含Atari游戏，但这些并非不同的任务或世界。

The idea is totally general. I do use all the time as my canonical example the idea of an AI agent just like a person. And people in some sense, they have just one world they live in. And that world may involve chess and it may involve Atari games. But those are not a different task or a different world.

Speaker 1

它们只是不同的状态

Those are different states

Speaker 0

而

that

Speaker 1

智能体会遇到这些状态。所以这个通用理念根本不受限制。

they encounter. And so the general idea is not limited at all.

Speaker 0

所以也许有必要解释一下那种架构或方法中缺失了什么，而这种持续学习的通用人工智能（AGI）将具备哪些特性。

So maybe it would be useful to explain what was missing in that architecture or that that approach, which this continual learn learning AGI would have?

Speaker 1

他们只是搭建了框架。他们的目标并非让一个智能体横跨所有游戏。如果要讨论迁移学习，我们应该讨论状态间的迁移，而非跨游戏或跨任务的迁移。

They they just set it up. They didn't it was not their ambition to to have one agent across across, those games. If we wanna talk about transfer, we should talk about transfer, not across games or across tasks, but transfer between states.

Speaker 0

是的。我很好奇历史上是否出现过通过强化学习技术实现足够程度的迁移，足以构建这类系统的情况

Yeah. I I guess I'm curious about historically, have we seen the level of transfer using RL techniques that would be needed to build this kind of

Speaker 1

好的。很好。目前我们尚未在任何领域观察到有效的迁移。良好性能的关键在于能够从一个状态很好地泛化到另一个状态。

Okay. Good. Good. We're not seeing transfer anywhere. Critical to good performance is that you can generalize well from one state to another state.

Speaker 1

我们目前没有任何擅长实现这一目标的方法。现有的情况是研究人员尝试不同方案后，最终选定某种具有良好迁移性或泛化能力的表征方式。但我们缺乏自动化的技术手段来促进这种迁移——现代深度学习中几乎没有采用任何自动化促进迁移的技术。

We don't have any methods that are good at that. What we have are people try different things and they settle on something that a representation that transfers well or that generalizes well. But we don't have any automated techniques to promote. We have very few automated techniques to promote transfer. And none of them are used in modern deep learning.

Speaker 0

让我复述一遍以确保理解正确。听起来您是说，当这些模型确实展现出泛化能力时，实际上是某种人工雕琢的结果

Let me paraphrase just to make sure that I understood that correctly. It sounds like you're saying that when we do have generalization in these models, that is a result of some, sculpted

Speaker 1

是人类完成的。没错。研究人员手动实现的，因为没有其他合理解释。梯度下降法本身不会带来良好的泛化能力，它只能帮你解决问题。

Humans did it. Yeah. The researchers did it because there's no other explanation. Mean, gradient descent will not make you generalize well. It will make you solve the problem.

Speaker 1

它不会让你获得新数据，而是以良好的方式进行泛化。泛化意味着在一件事上的训练会影响你在其他事情上的表现。众所周知深度学习在这方面表现极差。例如，若针对新事物训练，往往会灾难性地干扰你已知的所有旧知识——这正是糟糕的泛化。

It will not make you get new data, you generalize in a good way. Generalization means train on one thing that affects what you do on the other things. So we know deep learning is really bad at this. For example, we know that if you train on some new thing, it will often catastrophically interfere with all the old things that you knew. So this is exactly bad generalization.

Speaker 1

正如我所说，泛化是训练对某一状态的影响延伸到其他状态的现象。泛化本身并无好坏之分。你可能泛化得很差，也可能泛化得很好。

Generalization, as I said, is some kind of influence of training on one state on other states. And generalization is not necessarily good or bad. Just the fact that you generalize is not necessarily good or bad. You can generalize poorly. You can generalize well.

Speaker 1

泛化总会发生。但我们需要的是能让泛化朝良性方向发展的算法

You need Generalization always will happen. But we need algorithms that will cause the generalization to be good rather

Speaker 0

我并非要重启这个初始争议，只是真心好奇——可能我对术语的理解有异。比如，这些大语言模型正在扩展泛化范围：从早期系统连基础数学题都解决不了，到现在能处理数学奥赛级别的各类问题。最初它们至少能在加法题间泛化，后来发展到能运用不同数学技巧、定理和概念范畴解题，这正是奥数要求的。听起来您似乎不认为解决某类别内任意问题属于泛化？若我理解有误请指正。

I'm than not trying to kickstart this initial crux again, but I'm just genuinely curious because I think I might be using the term differently. I mean, one way to think about is these LLMs are increasing the scope of generalization from, like, earlier systems which could not really even do a basic math problem to now they can do anything in this class of Math Olympiad type problems. Right? So you initially start with like, can generalize among addition problems at least, then you generalize to like, they can generalize among like problems which require use of different kinds of mathematical techniques and theorems and conceptual categories, which is like what the math Olympiad requires. And so it sounds like you don't think of that being able to solve any problem within that category as an example of generalization or let me know if I'm misunderstanding that.

Speaker 1

大语言模型过于复杂。我们根本不清楚它们预先掌握哪些信息，只能猜测——它们的训练数据太庞杂了。这正是它们不适合科研的原因：完全不可控且充满未知。

Well, large language models, so complex. We don't really know what information they had prior. We have to guess because they've been fed so much. This is one reason why they're not a good way to do science. It's just so uncontrolled, so unknown.

Speaker 0

但如果你提出一个全新的

But if you come up with an entirely new

Speaker 1

它们能正确解决许多问题。关键在于原因？或许某些问题本就不需泛化——当唯一解法就是找到那个放之四海皆准的答案时，这就不是泛化。它们只是找到了那个唯一解决方案。

They're getting a bunch of things right, And so the question is why? Well, maybe that they don't need to generalize to get them right because the only way to get some of them right is to form something which gets all of them right. So if there's only one answer and you find it, that's not called generalization. It's it's the only way to solve it. And so they find the only way to solve it.

Speaker 1

泛化性指的是事情可以这样做，也可以那样做，而他们选择了好的方式。

Generalization is when it could be this way, could be that way, and they do it the good way.

Speaker 0

我的理解是，随着编码智能体的发展，这种情况正变得越来越好。显然，工程师们在编写库时，实现同一规范可以有多种不同方式。最初这些模型令人沮丧的地方在于它们会用一种马虎的方式完成。但随着时间的推移，它们在设计架构和抽象层面的表现越来越好，越来越能让开发者满意。这似乎正是你所说的例证。

My understanding is that this is working better and better with, coding agents. So engineers, obviously, if you're trying to program a library, there's many different ways you could achieve the n spec. And an initial frustration with these models has been that they'll do it in a way that's sloppy. And then over time, they're getting better and better at coming up with the design architecture and the abstractions that developers find more satisfying. And it seems an example of what you're talking about.

Speaker 1

模型内部没有任何机制能确保良好的泛化性。梯度下降只会让它们找到已见过问题的解决方案。如果只有一种解决方法，它们就会那么做。但实际上解决方法有很多种，有些泛化性好，有些则很差。算法本身并不包含任何能促使良好泛化的机制。

Well, there's nothing in them which will cause it to generalize well. The gradient descent will cause them to find a solution to the problems they've seen. And if there's only one way to solve them, they'll do that. But there are many ways to solve it, some which generalize well, some which generalize poorly. There's nothing in them, in the algorithms, that will cause them to generalize well.

Speaker 1

但当然，人类会参与其中。如果效果不理想，人们就会不断调整，直到找到可能具有良好泛化性的方法。

But people, of course, are involved. And if it's not working out, they fiddle with it until they find a way, perhaps until they find a way which should generalize as well.

Speaker 0

为准备这次访谈，我想了解强化学习的完整发展史，从REINFORCE算法一直到GRPO等现代技术。我不只想要公式和算法列表，而是真正理解每个演进阶段的变革及其背后动机——每个后续方法究竟要解决什么核心问题？于是我让Gemini深度研究功能带我逐步梳理了整个发展历程。

So to prep for this interview, I wanted to understand the full history of RL, starting with reinforce up to current techniques like GRPO. And I didn't just want a list of equations and algorithms. I wanted to really understand each change in its progression and the underlying motivation. You know, what was the main problem that each successive method was actually trying to solve? So I had Gemini Deep Research walk me through this entire timeline step by step.

Speaker 0

它解析了过去二十年的渐进式创新，说明每个步骤如何使强化学习过程更稳定、样本效率更高或扩展性更强。我要求深度研究将这些内容整合成安德烈·卡帕西风格的教程，它做到了。最棒的是它把所有课程内容融合成一份符合我期望风格的连贯文档，还汇集了所有最佳参考资料链接，让我能随时深入理解特定算法。你可以访问gemini.google.com亲自体验。

It explained the last twenty years of gradual innovation and explained how each step made the Ara learning process more stable or more sample efficient or more scalable. I asked Deep Research to put all of this together like an Andrei Karpathy style tutorial, and it did that. What was cool is that it combined this whole lesson together into one coherent, cohesive document in the style that I wanted. It was also great that it assembled all of the best links in the same place so that if I wanted to understand any specific algorithm better, I could just access the right explainer right there. Go to gemini.google.com to try it out yourself.

Speaker 0

好了，回到理查德。我想从更宏观的角度请教：作为比当前几乎所有AI评论者或从业者都更早进入该领域的前辈，您最大的意外发现是什么？您觉得有多少是真正的新事物，还是人们只是在玩转旧概念？毕竟您在深度学习流行前就投身其中，您如何看待这个领域的发展轨迹以及新思想的涌现过程？

Alright. Back to Richard. I wanna zoom out and ask about so being in the field of AI for longer than almost anybody who is commentating on it, or working in it now, I'm just curious about what the biggest surprises have been, how much new stuff you feel like is coming out, or does it feel like people are just playing with old ideas? Zooming out, you got into this even before deep learning was popular. So how do you see this trajectory of this field over time and how new ideas have come about and everything?

Speaker 0

有什么令人惊讶的地方吗？

And what's been surprising?

Speaker 1

好的。我稍微思考过这个问题。确实有不少令人意外的事情。首先，大型语言模型就让人惊讶。人工神经网络在语言任务上的高效表现令人意外。

Okay. So yeah, I thought a little bit about this. There are many things or a handful of things. First, the large language models are surprising. It's surprising how effective neural networks, artificial neural networks are at language tasks.

Speaker 1

没错。这确实是个意外。之前没人预料到语言领域会这样突破。所以这很了不起。

Right. You know, that was a surprise. It wasn't expected. Language seemed different. So that's impressive.

Speaker 1

AI领域长期存在一个争议：究竟是简单基础原理方法（如搜索和学习这类通用方法）更优，还是人类知识赋能的符号化系统更强。过去有趣的是，搜索学习这类被称为'弱方法'，因为它们只运用通用原理；而注入人类知识的系统则被称为'强方法'。

There's a long standing controversy in AI about simple basic principle methods. The general purpose methods like search and learning and compared to human enabled systems like symbolic methods. And so in the old days, it was interesting because things like search and learning were called weak methods because they just use general principles. They're not using the power that comes from imbuing a system with human knowledge. So those were called strong.

Speaker 1

现在看来，弱方法已取得完胜。这是早期AI时代最大的悬疑——最终谁会胜出？如今搜索和学习已证明了自己的价值。

And so I think the weak methods have just totally won. That's the biggest question from the old days of AI. What would happen? And learning and search have just won the day. Right.

Speaker 1

不过这个结果对我并不意外，因为我始终支持简单基础原理。即便是大型语言模型的惊人效果，虽然意外但令人欣慰。像AlphaGo，特别是AlphaZero的卓越表现也出人意料，但同样令人振奋——毕竟这再次证明简单原理的胜利。

But there's a sense which that was not surprising to me because I was always voting for the or hoping or rooting for the simple basic principles. And so even with the large language models, it's surprising how well it worked but it was all good and gratifying. And And things like AlphaGo, it's surprising how well that was able to work. And AlphaZero, in particular, how well it was able to work. But it's all very gratifying because, again, it's simple basic principles are winning the day.

Speaker 0

当公众认知因新技术应用（比如AlphaZero引发的热潮）而改变时，作为这些技术的奠基人之一，您觉得是取得了新突破，还是认为这不过是90年代已有技术的组合应用？

Have there felt like whenever the public conception has been changed because some new technique was or sorry, some new application was developed. For example, when AlphaZero became this viral sensation, to you as somebody who has literally came up with many of the techniques that were used, did it feel to you like new breakthroughs were made? Or does it feel like, oh, we've had these techniques since the '90s, and people are simply combining them and applying them now?

Speaker 1

整个AlphaGo事件其实有个前身，那就是TD西洋双陆棋。杰里·特萨罗正是运用了强化学习、时序差分学习的方法来玩双陆棋。它击败了世界顶尖玩家，效果非常出色。所以在某种意义上，AlphaGo只是这一过程的规模化升级。

So the whole AlphaGo thing has a precursor, which is TD gammon. Jerry Tesaro did exactly reinforcement learning, temporal difference learning methods to play backgammon. And it beat the world's best players. And it worked really well. And so in some sense, AlphaGo was merely a scaling up of that process.

Speaker 1

这其中涉及相当程度的规模扩展，同时在搜索方式上也进行了额外创新。

So there was quite a bit of scaling up and there was also an additional innovation in how the search was done.

Speaker 0

没错。

Right.

Speaker 1

但这很合理，并不令人意外。AlphaGo实际上并未使用TD学习，它等待观察最终结果。而AlphaZero采用了TD方法，并将其应用于所有其他棋类游戏，表现极其出色。

But it made sense. It wasn't surprising in that sense. AlphaGo actually didn't use, TD learning. It waited to see the final outcomes. But AlphaZero used TD, and AlphaZero was applied to all the other games and did extremely well.

Speaker 1

AlphaZero下国际象棋的方式始终让我印象深刻，因为我本人就是棋手。它甘愿牺牲子力来换取某种局面优势，能够长时间耐心地弃子争先。这种策略如此有效既令人惊讶，又符合我的世界观，让我深感欣慰。这引导我走到了今天的位置——某种意义上成为与主流观点相左的异见者。

I've always been very impressed by the way AlphaZero plays chess because I'm a chess player and it just sacrifices material for sort of positional advantages. And it's just content and patient to sacrifice that material for a long period of time. And so that was surprising that it worked so well but also gratifying and fitting into my worldview. So this has led me where I am. Where I am is in some sense a contrarian or somewhat thinking differently from the field is.

Speaker 1

我个人安于长期与学术领域不同步的状态，可能持续数十年，因为过去偶尔也取得过突破。为了缓解这种思维脱节感，我的方法是跳出当下环境，回溯历史长河，考察不同领域对心智的经典认知。这样我就不觉得自己背离传统，反而自视为古典主义者。我追随的是历代思想家关于心智的主流共识。

And I am personally just kind of content being out of sync with my field for a long period of time, perhaps decades, because occasionally, I have improved, right in the past. And the other thing I do to help me not feel I'm out of sync and thinking in a strange way is to look not at my local environment or my local field but to look back in time into history and to see what people have thought classically about the mind in many different fields. And I don't feel I'm out of sync with the larger traditions. I really view myself as a classicist rather than as a contrarian. I go to what the larger community of thinkers about the mind have always thought.

Speaker 0

好的。如果你不介意，我想问几个有点另类的问题。我对《苦涩的教训》的理解是：它并非全盘否定人工研究调参的有效性，而是指出其扩展性远不及指数增长的算力。因此我们需要能驾驭后者的技术。等到实现AGI时，研究人员就能与算力线性同步扩展了，对吧？

Okay. Some sort of left field questions for you, if you'll tolerate them. So the way I read the bitter lesson is that it's not saying necessarily that human artisanal researcher tuning doesn't work, but that it obviously scales much worse than compute, which is growing exponentially. And so you want techniques which leverage the ladder. And once we have AGI, we'll have researchers would scale linearly with compute, right?

Speaker 0

因此我们将迎来数百万AI研究者的涌现，他们的数量会像算力一样快速增长。也许这意味着让他们继续从事传统AI研究，采用这些手工定制化的解决方案是合理或有意义的。我想知道，这种关于AGI实现后AI研究如何发展的愿景，是否仍能与更优的教训相兼容。

So we'll have this avalanche of millions of AI researchers and their stock will be growing as fast as, compute. And so maybe this will mean that it is rational or it will make sense to have them doing good old fashioned AI and doing these artisanal solutions. Does that, as a vision of what happens after AGI in terms of how AI research will evolve, I wonder if that's still compatible with a better lesson.

Speaker 1

那么，我们是如何实现这个AGI的？你预设它已经被完成了。

Well, how did we get to this AGI? You want to presume that it's been done.

Speaker 0

我猜最初是从通用数学和方法开始的，但现在我们有了AGI。现在我们想要更进一步...然后我们就完成了。真有趣。你不认为AGI之上还存在更高层次吗？

I suppose it started with general math and methods, but now we've got the AGI. And now we want to go Then we're done. We're done. Interesting. You don't think that there's an anything above AGI?

Speaker 1

但你现在又用它来再次获取AGI。

Well, but you're using it to get AGI again.

Speaker 0

我是用它来获得不同任务上的超人类智力或能力水平。

Well, I'm using it to get superhuman levels of intelligence or competence at different tasks.

Speaker 1

所以这些AGI如果本身不具备超人类能力，那么它们可能传授的知识也不会是超人类的。

So these AGIs, if they're not superhuman already, then the knowledge that they might impart would be not superhuman.

Speaker 0

我想这其中存在不同程度的分级。

I guess there's different gradations of yours.

Speaker 1

我不确定你这个想法是否合理，因为它似乎预设了通用人工智能（AGI）的存在，并且我们已经解决了这个问题。

I'm not sure this your idea makes sense because it seems to presume the existence of AGI, and then that we've already worked that out.

Speaker 0

或许可以这样理解：AlphaGo已经超越人类，它能击败任何围棋选手。而AlphaZero每次都能战胜AlphaGo。这说明存在方法可以创造出比‘超人类’更强大的存在，而且它采用了不同的架构。

So maybe one way to motivate this is AlphaGo is superhuman. It beat any Go player. AlphaZero would beat AlphaGo every single time. So there's ways to get more superhuman than, than even superhuman. And it was a different architecture.

Speaker 0

因此在我看来，那个能够跨领域通用学习的智能体，完全有可能通过改进架构来提升学习能力——就像AlphaZero是对AlphaGo的升级，MuZero又是对AlphaZero的升级一样。

And so it seems possible to me that, well, the agent that's like able to generally learn across all domains, there would be ways to make that give it better architecture for learning, just the same way that AlphaZero was an improvement to find AlphaGo and MuZero was an improvement to find AlphaZero.

Speaker 1

但AlphaZero的改进之处在于它摒弃了人类知识，纯粹从经验中学习。既然如此，为什么你认为需要引入其他智能体的经验来教导它？明明不依靠其他智能体协助反而效果更好。

And the way AlphaZero was an improvement was it did not use the human knowledge but just went from experience. Right. So why do you say bring in other agents' expertise to teach it when it's been it's worked so well from experience and not by help from another agent.

Speaker 0

我同意那个特定案例是转向了更通用的方法。但我想用这个例子说明：从超人类到超人类++，再到超人类++++是可能的。我好奇的是——你认为这种梯度提升会继续通过简化方法实现，还是当我们拥有数百万个能按需增加复杂性的智能心智时？即便有数十亿甚至数万亿AI研究者，这条路是否依然走不通？

I agree that in that particular case that it was moving to more general methods. But I meant to use that example to illustrate that it's possible to go superhuman to superhuman plus plus to superhuman plus plus plus plus. Yeah. And And I'm curious if you think those gradations will continue to happen by just making the method simpler or because we'll have the capability of these millions of minds who can then add complexity as needed. If that will continue to, that will continue to be a false path even when you have billions of AI researchers or trillions of AI researchers?

Speaker 1

更有趣的是思考这种情况：当存在大量AI时，它们是否会像人类文化演进那样互相协助？或许我们该讨论这个。

I think more interesting is just think about that case, which when you have many AIs, will they help each other the way cultural evolution works in people? Maybe we should talk about that.

Speaker 0

当然，没问题。

Yeah, for sure.

Speaker 1

苦涩的教训，哦，谁在乎那个？那只是对历史上某个特定时期的经验性观察。七十年在历史长河中，更长的时间未必适用于下一个七十年。所以有趣的问题是，你是一个人工智能。你获得了一些更多的计算能力。

The bitter lesson, oh, who cares about that? That's an empirical observation about a particular period in history. Seventy years in history, longer doesn't necessarily have to apply the next seventy years. So the interesting question is, you're an AI. You get some more computer power.

Speaker 1

你应该用它来让自己，你知道的，在计算上更强大，还是应该用它来生成一个自己的副本，

Should you use it to make yourself, you know, more computationally capable, or should you use it to spawn off a copy of yourself,

Speaker 0

去

to go

Speaker 1

在地球的另一端或某个其他主题上学习一些有趣的东西，然后向你汇报？是的。我认为这是一个非常有趣的问题，只有在数字智能时代才会出现。我不确定答案是什么，但我认为更多的问题会出现。是否真的可能生成它，派遣它出去，学习一些新的东西，可能是非常新的东西，然后它是否能够被重新整合到原始体中？

learn something interesting on the other side of the planet or on some other topic and then report back to you? Yep. I think that's a really interesting question that will only arise in the age of digital intelligences. I'm not sure what the answer is, but I think more questions. Will it be possible to really spawn it off, send it out, learn something new, some perhaps very new, and then will it be able to be reincorporated into the original?

Speaker 1

或者它会改变得太多以至于无法真正实现？这是可能的还是不可能的？你可以把这个推到极限，就像我前几天晚上看到你的一个视频所暗示的那样，它可以生成许多许多副本，做不同的事情，高度分散，但向中央主脑汇报。这将是一件非常强大的事情。嗯，我认为有一件事，所以这是我试图为这个观点添加一些东西，那就是，一个大问题，一个大问题将会变成，腐败。

Or will it have changed so much that it can't really be done? Is that possible or is it not? And you can carry this to its limit as I saw one of your videos the other night that suggested that it could where you spawn off many, many copies, do different things, highly decentralized, but report back to the central master. And that this will be such a powerful thing. Well, I think one thing that, so this is my attempt to add something to this view, is that, a big question, a big issue will become, corruption.

Speaker 1

你知道，如果你真的可以从任何地方获取信息并将其带入你的中央思维，你可以变得越来越强大。而且因为一切都是数字化的，它们都说某种内部的数字语言，也许这会很容易和可能。但这不会像你想象的那么容易，因为这样你可能会失去理智。如果你从外部引入一些东西并将其构建到你的内在思维中，它可能会接管你。它可能会改变你。

You know, if you really could just get information from anywhere and bring it into your central mind, you can become more and more powerful. And since it's all digital and they all speak some internal digital language, maybe it'll be easy and possible. But it will not be that easy, as easy as you're imagining because you can lose your mind this way. If you pull in something from the outside and build it into your inner thinking, it could take over you. It could change you.

Speaker 1

它可能是你的毁灭，而不是你渐进的知识。我认为这将成为一个大问题，特别是当你，哦，他已经弄清楚如何玩一些新游戏或者研究过印度尼西亚，你想把这些纳入你的思维。是的。所以你不能，你可能会想，哦，只要全部读进去，那就没问题了。但不，你刚刚读了一大堆比特进入你的思维。

It could be your destruction rather than, your incremental knowledge. I think this will become a big concern, particularly when you're, oh, he's figured all about how to play some new game or figured out he's studied Indonesia and you want to incorporate that into your mind. Yeah. So you can't you could you think, oh, just read it all in, and that'll be fine. But no, you've just read a whole bunch of bits into your mind.

Speaker 1

而且，它们可能携带病毒。它们可能隐藏着目标。它们能扭曲并改变你。这将成为一个重大问题。在数字繁衍与重构的时代，如何保障网络安全？

And, they could have viruses in them. They could have hidden goals. They can warp you and change you. And this will become a big thing. How do you have cybersecurity in the age of digital spawning and reforming again?

Speaker 0

有趣的是，量化公司和AI实验室都有保密文化，因为它们都在极度竞争的市场中运作，成功依赖于保护知识产权。如果你是AI研究员或工程师，在选择工作时，大多数考虑的量化公司或AI实验室都会严格隔离团队以降低泄露风险。哈德逊河交易公司（HRT）则反其道而行——团队公开分享交易策略，策略代码存放在共享的单体代码库中。在HRT，研究员的好创意会迅速部署到所有相关策略中。

It's interesting that both quant firms and AI labs have a culture of secrecy because both of them are operating in incredibly competitive markets and their success rests on protecting their IP. If you're an AI researcher or engineer and you're deciding where to work, most of the quant firms or AI labs that you'll be considering will be strongly siloing their teams to minimize the risk of leaks. Hudson River Trading takes the opposite approach. Their teams openly share their trading strategies, and their strategy code lives in a shared monorepo. At HRT, if you're a researcher and you have a good idea, your contribution will be broadly deployed across all relevant strategies.

Speaker 0

这让你的工作产生巨大影响力。你还能飞速成长：可以学习他人研究、随时提问，并完整理解从底层交易执行到高层预测模型的整套体系。HRT正在招聘，详情请访问hudsonrivertrading.com/dwarcash。

This gives your work a ton of leverage. You'll also learn incredibly fast. You can learn about other people's research and ask questions, and you can see how everything fits together end to end, from the low level execution of trades to the high level predictive models. HRT is hiring. If you want to learn more, go to hudsonrivertrading.com/dwarcash.

Speaker 0

好了，回到理查德的话题。我想这引出了AI继任的问题。

Alright. Back to Richard. I guess this brings us to the topic of AI succession.

Speaker 1

嗯。

Mhmm.

Speaker 0

你的观点与我采访过的多数人——或许与普遍认知都截然不同。我认为这个视角非常有趣，想听听你的见解。

You have a perspective that's quite different from a lot of people that I've interviewed and maybe a lot of people generally. So I also think it's very interesting perspective. I want to hear about it.

Speaker 1

是的。我认为向数字智能或增强人类的过渡不可避免。我的论证分为四部分：首先，没有任何政府或组织能代表人类统一立场并主导安排——世界该如何运转尚无共识；其次，我们终将破解智能的运作原理。

Yeah. So I do think succession to digital digital intelligence or augmented humans is inevitable. So the argument I have a four part argument. Now, step one is there's no government or organization that, gives humanity a unified point of view that dominates and that can that can arrange there's no consensus about how the world should be run. And number two, we will figure out how intelligence works.

Speaker 1

研究人员最终会弄明白的。第三点，我们不会止步于人类水平的智能，我们将达到超级智能。第四点是，随着时间的推移，最智能的事物必然会获得资源和权力。把这些综合起来看，可以说，由AI或经AI增强的人类接替主导地位几乎是不可避免的。

The researchers will figure it out eventually. And number three, we won't stop just with human level intelligence. We will reach superintelligence. And number four is that once it's inevitable over time that the most intelligent things around would gain resources and power. And so put all that together, it's, you know, you it's sort of inevitable that you're going to have, succession to AI or to AI enabled augmented humans.

Speaker 1

在这四点框架内，这些发展似乎清晰且必然发生。但在这些可能性中，既可能有好的结果，也可能有不太理想甚至糟糕的结果。因此，我只是试图客观看待现状，并思考我们对此应持何种态度。

So within those, those four things seem clear and sure to happen. But within that set of possibilities, some there could be good outcomes as well as less good outcomes, bad outcomes. And so I'm just trying to be realistic about where we are and ask how we should feel about it.

Speaker 0

是的。我同意这四点论述及其隐含意义。我也认同权力更替会带来多样化的未来可能性。所以很想听听更多关于这方面的见解。

Yeah. I agree with all four of those arguments and the implication. And I also agree that succession contains a wide variety of possible futures. So curious to get more thoughts on that.

Speaker 1

没错。因此我首先鼓励人们积极看待这个问题，因为这是人类数千年来始终追求的——试图理解自我，提升思维能力，深化自我认知。这是科学与人文的伟大成就，我们正在揭示人性核心的本质，理解智能的意义。而我常说的是，这一切本质上仍是以人类为中心的视角。

Right. And so then I do encourage people to think positively about it, first of all, because it's something we humans have always tried to do for thousands of years, tried to understand themselves, trying to make themselves think better, and just understand themselves. So this is a great success as science, humanities. We're finding out what this essential part of humanness is, what it means to be intelligent. And then what I usually say is that this is all kind of human centric.

Speaker 1

如果跳出人类视角，从宇宙的角度来看呢？我认为这是宇宙演进的重要阶段，一个关键转折——从人类、动植物这些复制体主导的时代（我们都是复制体，这赋予我们某些优势与局限），正进入设计时代。因为我们的AI是设计的，所有物理实体是设计的，建筑是设计的，技术是设计的，而我们现在设计的AI本身具备智能，并能进行自主设计。

What if you look you step aside from being a human and just say take the point of view of the universe? And this is, I think, a major stage in the universe, a major transition, a transition from replicators with humans and animals, plants. We're all replicators. And that gives us some strengths and some limitations. And then we're entering the age of design because our AIs are designed, all of our physical objects are designed, our buildings are designed, our technology is designed, and we're designing now AIs, things that can be intelligent themselves and that are themselves capable of design.

Speaker 1

这是世界乃至宇宙发展的关键一步。我认为这是从复制主导世界（复制意味着你能制造副本却不真正理解其本质，比如现在我们能繁衍更聪明的后代却不真正理解智能原理）向设计智能时代的过渡。如今我们即将实现的是被理解的智能，因此能以不同方式和速度对其进行改造。

And so this a key step in the world and in the universe. And I think it's transition from the world in which most of the interesting things that are are replicated. Replicated means you can make copies of them, but you don't really understand them. Like right now, we can make more intelligent beings, more children, but we don't really understand how intelligence works. Whereas in we're reaching now to having designed intelligence, intelligence that we do understand how it works and therefore we can change it in different ways and at different speeds than otherwise.

Speaker 1

而我们的未来可能完全脱离复制模式。比如我们可能直接设计AI，这些AI再设计其他AI，一切将通过设计与构建而非复制来完成。是的，我将此视为宇宙四大演进阶段之一——最初是星尘，恒星的残骸。

And our future, they might not be replicated at all. Like we may just design AIs and those AIs will design other AIs and everything will be done by design and construction rather than by replication. Yeah. I mark this as one of the four great stages of the universe. First, there's dust, ends of stars.

Speaker 1

星辰孕育行星，行星孕育生命。如今我们正赋予设计实体以生命。因此我认为我们应当自豪，应当庆幸自己能促成宇宙这一伟大转变。是的，这确实耐人寻味。

Stars and then stars make planets and the planets give rise to life. And now we're giving life to designed entities. And so I think we should be proud and we should be that we are giving rise to this great transition in the universe. Yeah. So it's an interesting thing.

Speaker 1

我们该视它们为人类的一部分，还是异于人类的存在？选择权在我们手中。我们可以宣称'它们是我们的后代，我们应以它们为荣，为它们的成就欢呼'；也可以坚称'不，它们不属于我们'并因此感到恐惧。

What should we consider them part of humanity or different from humanity? It's our choice. It's our choice whether we could say, oh, they are our offspring and we should be proud of them and we should celebrate their achievements. Or we could say, oh, no, they're not us. And we should be horrified.

Speaker 1

有趣的是，这看似是个可选项，却又像根深蒂固的认知——我们怎能有选择权？我钟爱这种思想中矛盾的意蕴。

It's interesting that that is it feels to me like a choice. And yet, it's such a strongly held thing to how could we be a choice. I like these sort of contradictory, implications of thought. I

Speaker 0

试想我们若只是在设计下一代人类——当然'设计'这个词并不准确——我们知道未来人类终将出现。暂且搁置AI不谈，长远来看人类必将更强大、更繁荣、更智慧。

mean, it's interesting to consider if we were just designing another generation of humans. Yes. Design is the wrong word, but we knew a future generation of humans are gonna come up. And forget about AI. We just know in the long run, humanity will be more capable and maybe more numerous, maybe more intelligent.

Speaker 0

对此我们作何感想？确实存在某些未来人类可能令我们深感忧虑的世界图景。那么...

How do we feel about that? I do think there's potential worlds with future humans that we would be quite concerned about. So are

Speaker 1

你是否在想，我们或许就像尼安德特人，孕育了智人。而智人或许将孕育出新的人类族群？

you thinking maybe we are like the Neanderthals. We give rise to Homo sapiens. Maybe Homo sapiens will give rise to a new group of people.

Speaker 0

这正是你举例的核心——即便认定它们属于人类范畴，也不意味着我们就该高枕无忧。

That's I'm what you're basically taking the example you're giving of, okay, even if you consider them part of humanity, I don't think that necessarily means that we should feel super comfortable.

Speaker 1

在

Speaker 0

领导力方面。是的。纳粹也是人类，对吧？如果我们认为未来的世代会成为纳粹，我想我们会非常担忧将权力交给他们。所以我同意这与担忧未来更强大的人类并无太大不同。

leadership. Yeah. Nazis were humans, right? If we thought like, oh, the future generation will be Nazis, I think we'd be like quite concerned about just handing off power to them. So I agree that this is not super dissimilar to worrying about more capable future humans.

Speaker 0

但我不认为这解决了许多人可能有的担忧——关于这种级别的权力如此迅速地由我们不完全理解的实体获得。

But I don't think that that addresses a lot of the concerns people might have about this level of power being attained this fast with entities we don't fully understand?

Speaker 1

嗯，我认为有必要指出，对大多数人来说，他们对发生的事情没有太大影响力。大多数人无法影响谁能控制原子弹或谁控制国家。即使作为公民，我也常常觉得我们对国家的控制力非常有限。它们已经失控了。很大程度上这与你对变化的感受有关。

Well, I think it's relevant to point out that for most of humanity, they don't have much influence on what happens. Most of humanity doesn't influence who can control the atom bombs or who controls the nation states. Even as a citizen, I often feel that we don't control the nation states very much. They're out of control. A lot of it has to do with just how you feel about change.

Speaker 1

如果你认为现状真的非常非常好，那么你更可能对变化持怀疑态度并抗拒变化，而不是认为它不完美。而我认为它并不完美。事实上，我认为它相当糟糕。是的。所以我愿意接受变化。

And if you think the current situation is really, really good, then you're more likely to be suspicious of change and averse to change than if you think it's imperfect. And I think it's imperfect. In fact, I think it's pretty bad. Yeah. So I'm open to change.

Speaker 1

而且我认为人类的历史记录并不算特别好。也许这是迄今为止最好的情况，但它远非完美。

And I think humanity is not in a has had a good super good track record. And maybe it's the best thing that there's been, but it it it's far from perfect.

Speaker 0

是的。我想变化有不同的种类。工业革命是变化。布尔什维克革命也是变化。如果你生活在1900年代的俄罗斯，你会觉得事情发展得并不顺利。

Yeah. I guess there's different varieties of change. The industrial revolution was changed. The Bolshevik revolution was also changed. And if you were around in Russia in the nineteen hundreds and you're like, look, things aren't growing well.

Speaker 0

这算是我们自己把事情搞砸了。我们需要改变。在签字同意之前，我想知道你想要什么样的改变，对吧？同样地，对于人工智能，我也希望能理解并在可能的情况下改变其发展轨迹，使其对人类产生积极影响？

This is ours kind of messing things up. We need change. I'd want to know what kind of change you wanted before signing on the dotted line, right? And then similar with AI where I'd want to understand and to the extent it's possible to change the trajectory, to change the trajectory of AI such that the change is positive for humans?

Speaker 1

我们应该关心我们的未来，人类的未来。我们应该努力让它变得美好。但同时，我们也应该认识到自身的局限性。我认为我们要避免那种理所当然的感觉，避免认为‘我们是先来者就该永远占据优势’的心态。

We should be concerned about our future, the future. We should try to make it good. We also, though, should recognize the limits, our limits. And we're I think we want to avoid the feeling of entitlement, avoid the feeling, oh, we are here first. We should always have it in a good way.

Speaker 1

我们该如何思考未来？某个特定星球上的特定物种应该对未来的掌控权有多大？我们实际拥有多少控制力？相对于人类长远未来的有限控制力，我们更应该关注对自己生活的掌控——比如我们个人的目标和家庭。

How should we think about the future? And how much control a particular species on a particular planet should have over it? How much control do we have? A counterbalance to our limited control over the long term future of humanity should be how much control do we have over our own lives? Like we have our own goals and we have our families.

Speaker 1

这些事情远比试图控制整个宇宙要可控得多。

And those things are much more controllable than like trying to control the whole universe.

Speaker 0

没错。

Right.

Speaker 1

所以我认为我们真正应该做的是致力于实现自己的局部目标。而那种‘未来必须按照我想要的方式发展’的想法其实很傲慢。毕竟不同的人对全球未来的发展路径有不同看法，这就会引发冲突。

So I think it's appropriate for us to really work towards our own local goals. And it's kind of aggressive for us saying, oh, the future has to evolve this way that I want it to. Sure. Because then we'll have arguments. Different people think the global future should evolve in different ways, and then they have conflict.

Speaker 0

或许可以用养育孩子来类比：如果你对自己的孩子设定极其严格的人生目标，或者期望他们必须对世界产生某种特定影响（比如儿子当总统女儿当英特尔CEO），这未必合适。更恰当的做法是赋予他们健全的价值观，这样当他们将来掌握权力时，自然会做出有益社会的行为。对人工智能或许也该持类似态度——不是预测它们的所有行为或规划百年后的世界蓝图，而是赋予它们稳健、可引导且亲社会的价值观。

Maybe avoid a good analogy here would be, okay, so suppose you're raising your own children. It might not be appropriate to have extremely tight goals for their own life or also have some sense of like, I want my children to go out there in the world and have this specific impact. You know, my son's gonna become president and my daughter's gonna become CEO of Intel and like together they're gonna have this effect on the world. But people do have the sense that I think this is appropriate of saying, I'm going to give them good, robust values such that if and when they do end up in positions of power, they do reasonable pro social things. And I think maybe a similar attitude towards AI makes sense, not in the sense of we can predict everything that they will do, where we have this plan about what the world should look like in a hundred years, but it's quite important to give them robust and steerable and prosocial values.

Speaker 1

亲社会价值观。

Prosocial values.

Speaker 0

也许这个词用得不对。

Maybe that's the wrong word.

Speaker 1

是否存在我们都能认同的普世价值观？

Are there universal values that we can all agree on?

Speaker 0

我认为没有。但这并不妨碍我们给孩子良好的教育，对吧？就像我们总希望孩子成为某种样子。或许'亲社会'确实不是最准确的表述，'高度正直'可能更贴切——当面对有害的请求或目标时，他们会拒绝参与。

I don't think so. That doesn't prevent us from giving our kids a good education, right? Like we have some sense of we want our children to be a certain way. And maybe prosocial is wrong word actually. High integrity is maybe a better word where if there's a request or if there's a goal that seems harmful, they will refuse to engage in it.

Speaker 0

或者说他们会保持诚实等等。即便我们对真正的道德标准没有共识，我们依然觉得自己可以教会孩子这些品质。或许这对AI来说也是个合理的目标。

Or they'll be honest, things like that. And we have some sense that we can teach our children things like this, even if we don't have some sense of what true morality is, or everybody doesn't agree on that. And maybe that's a reasonable target for AI as well.

Speaker 1

所以你的意思是，我们正在试图设计未来及其演化原则。那么你首先提出的观点是：我们会尝试教授孩子那些更可能促进良性演变的通用原则。或许我们还应该追求变革的自愿性——任何变化都应是人们自愿接受而非被迫承受的。

So you're saying we're trying to design the future and the principles by which it will evolve and come into being. Right. And so you're saying the first thing you're saying is, well, we will we try to teach our children general principles which will promote more likely evolutions. Maybe we should also seek for things being voluntary. If there is change, we want it to be voluntary rather than imposed on people.

Speaker 1

我认为这个观点非常重要。确实，设计社会结构是人类最宏大（或者说最重大的）事业之一，这个进程已持续数千年。正所谓万变不离其宗。

I think that's a pretty important point. And yeah, that's all good. I think this is like the big or one of the really big human enterprises to design society. And that's been ongoing for thousands of years again. And so it's like the more things change, really the more things, they stay the same.

Speaker 1

我们仍需思考如何自处。孩子们仍会提出与父母和祖父母价值观相异的想法，事物将不断演变。

We still have to figure out how to be. The children will still come up with different values that seem strange to their parents and their grandparents and things will evolve.

Speaker 0

万变不离其宗，这句话似乎也适合为AI讨论作结，因为我们此前探讨的正是那些早在深度学习与反向传播应用之前就发明的技术，如今却成为AI发展的核心。或许这正是结束对话的好时机。好的。

The more things change, the more they stay the same also seems like a good capstone to the AI discussion because the AI discussion we were having was about how techniques which were invented even before their application to deep learning and back propagation was evident are central to the progression of AI today. So maybe that's a good place to wrap up the conversation. Okay.

Speaker 1

非常感谢。

Thank you very much.

Speaker 0

感谢您的到来

Thank you for coming

Speaker 1

。这是我的荣幸。

on. My pleasure.