本集简介
双语字幕
仅展示文本字幕,不包含中文音频;想边听边看,请使用 Bayt 播客 App。
好的。
Okay.
我又和我的朋友肖托·布里肯一起加入了。
I'm joined again by my friends, Sholto Bricken.
等等。
Wait.
该死。
Fuck.
我上次做过这个吗?
Did I do this last time?
你刚刚说了‘不’。
You just named no.
不。
No.
不。
No.
你给我们起了不同的名字,但我们之前并没有Sholto Bricken和Trenton Douglas。
You've named us differently, but we didn't have Sholto Bricken and Trenton Douglas.
Sholto Douglas和Trenton Bricken,他们现在都在Anthropic公司。
Sholto Douglas and Trenton Bricken, who are now both at Anthropic.
是的。
Yeah.
Sholto。
Sholto.
说吧。
Go.
Sholto正在研究强化学习的扩展。
Sholto is Scaling RL.
Trenton仍在研究机制可解释性。
Trenton's still working on mechanistic interpretability.
欢迎回来。
Welcome back.
很高兴在这里。
Happy to be here.
是的。
Yeah.
很有趣。
It's fun.
自从去年以来有什么变化吗?
What's changed since last year?
我们基本上在2024年这个月聊过。
We talked basically this month in 2024.
对。
Yep.
现在是2025年了。
Now we're 2025.
发生了什么?
What's happened?
好的。
Okay.
所以我认为最大的变化是强化学习和语言模型终于取得了成功。
So I think the biggest thing that's changed is RL and language models has finally worked.
这体现在我们终于有了一个算法的实证,它在正确的反馈循环下能够提供专家级的人类可靠性和表现。
And this is manifested in we finally have proof of an algorithm that can give us expert human reliability and performance given the right feedback loop.
因此,我认为这一点主要在编程竞赛和数学领域得到了明确的证明。
And so I think this is only really being conclusively demonstrated in competitive programming and math, basically.
所以,如果你考虑这两个维度,一个是任务的智力复杂性,另一个是任务完成的时间跨度。
And so if you think of these two axes, one is, the, like, intellectual complexity of the task, and the other is the time horizon of which the task is, is being completed on.
我认为我们已经证明,我们能够在许多维度上达到智力复杂性的顶峰。
And I think we have proof that we can we can reach the peaks of intellectual complexity, along along many dimensions.
但我们尚未展示出长期运行的智能体行为。
We haven't yet demonstrated, like, long running agentic
嗯。
Mhmm.
性能。
Performance.
你现在正看到这方面的初步尝试,到今年年底应该会看到更确凿的证据,Mhmm。
And you're seeing, like, the first stumbling steps of that now and should see much more conclusive evidence of that basically by the end of the year Mhmm.
通过真正的软件工程代理完成实际工作。
With, like, real software engineering agents doing real work.
我认为,特伦顿,你目前正在进行这方面的实验。
I think, Trenton, you're, like, experimenting with this at the moment.
对吧?
Right?
是的。
Yeah.
当然。
Absolutely.
我的意思是,今天人们最能接触到的公开例子就是Claude玩宝可梦。
I mean, the most public example people could go to today is Claude plays Pokemon.
对。
Right.
看到它以一种令人不忍直视的方式挣扎,但每一代模型都能更深入地通关,这似乎更多是由于它无法有效使用记忆系统造成的。
And seeing it struggle in a way that's, like, kind of painful to watch, but each model generation gets further through the game, and it seems more like a limitation of it being able to use a memory system Yep.
而不是其他原因。
Than anything else.
是的。
Yeah.
我真希望我们去年记录了预测结果。
I wish we had recorded predictions last year.
今年我们 definitely 应该这么做。
We definitely should this year.
哦,对。
Oh, yeah.
监督我们吧。
Hold us accountable.
是的。
Yeah.
没错。
That's right.
去年你会说智能体会有这么强大的能力吗?
Would you have said that agents would be only this powerful as of last year?
我认为这在软件工程方面大致符合我的预期。
I think this is roughly on track for where I expected with software engineering.
我原本以为它们在电脑操作上会更出色一些。
I think I expected them to be a little bit better at computer use.
是的。
Yeah.
但我理解造成这种情况的所有原因,而且我认为这很快就会得到解决。
But I understand all the reasons for why that is, and I think that's, like, well on track to be solved.
这只是一个暂时的停滞。
It's just like a sort of temporary lapse.
而且让我对明年不做任何预测负责,我真的认为到今年年底,也就是明年这个时候,我们会拥有能够完成初级工程师一整天工作量的软件工程代理,或者完成几个小时相当专业的独立工作。
And holding me accountable for, like, no predictions next year, like, I really do think the end of this year, sort of like this this time next year, we have software engineering agents that can do close to a day's worth of work, like, for, a junior engineer or, like, a couple of hours of, like, quite competent independent work.
是的。
Yeah.
这在我看来是正确的。
That that seems right to me.
不过我觉得这个分布相当奇怪。
I think the distribution's pretty wonky, though.
对。
Yes.
比如,对于某些任务,我不知道,像那种样板式的网站代码,
Where, like, for some tasks, I don't know, like, boiler boilerplate website code,
这类事情。
these sorts of things.
在西班牙。
In Spain.
它能快速完成任务,帮你省下整整一天的时间。
Don't It can it can bang it out and save you a whole day.
是的。
Yeah.
没错。
Exactly.
是的。
Yeah.
我觉得这是对的。
I think that's right.
我想去年你说过,限制它们发展的因素是额外的可靠性要求。
I think last year you said that the thing that was holding them back was the extra nines of reliability.
嗯。
Mhmm.
我不确定你现在是否还会用这种方式来解释,为什么这些软件代理还不能完成一整天的工作,却能帮你节省几分钟。
I don't know if that's the way you'd still describe the way in which these software agents aren't able to do a full day of work, but are able to help you out with a couple minutes.
是额外的几个九限制了你们,还是其他原因?
Is it is it the extra nines that's really stopping you or is it something else?
是的。
Yeah.
我认为我当时的描述,现在回头看,可能并不是真正限制他们的因素。
I think my description there was I think, like, retrospect, probably not what's limiting them.
我觉得我们现在看到的更接近于缺乏上下文,缺乏进行复杂、多文件更改的能力,以及在某种程度上对任务范围或变更范围的把握不足。
I think what we're seeing now is closer to lack of context, lack of ability to, like, do complex, like, very multifile changes and, like, sort of, like maybe, like, scope or or of the change or scope of, like, the task in some respects.
它们可以在高度聚焦的环境中,面对明确界定的问题时,应对高智力复杂性。
Like, they can they can cope with high intellectual complexity in, like, a focused context with a high with a real, like, scoped problem.
但当问题更模糊,或者需要与环境进行大量探索和迭代时,它们就更吃力了。
But when something's a bit more amorphous or requires a lot of discovery and iteration with the environment, this kind of stuff, they're they struggle more.
对。
Yep.
所以,也许我现在会这样定义:如果能为它们提供一个良好的反馈循环来完成你想要的任务,它们就能表现得很好。
And and so maybe the the way I would define it now is the thing that's holding them back is if you can give it a good feedback loop for the thing that you want it to do, then it's good.
它在这方面表现得很好。
It's pretty good at it.
如果你做不到,它们就会有点吃力。
If you can't, then they struggle a bit.
你能再为观众详细解释一下你所说的反馈循环吗?是的。
Can you and then for the audience, can you say more about what you mean by this feedback loop Yeah.
如果它们不了解RL中发生了什么,以及其他相关情况?
If they're not aware of what's happening in RL and so forth?
是的。
Yes.
过去一年真正有效的大方向是,也许可以 broadly 地说,这个领域有点像基于可验证奖励的强化学习,也就是一个清晰的奖励信号。
So the big thing that really worked over the last year is, maybe, like, broadly, the domain is a little like RL from verifiable rewards or something like this where a clean reward signal.
所以,你知道,语言模型最初的改进是通过人类反馈进行强化学习,嗯。
So so, you know, the initial unhappling of language models was RL from human feedback Mhmm.
通常,这涉及成对反馈之类的方式,模型的输出也越来越接近人类想要的结果。
Where, you know, typically, it was something like pairwise feedback or something like this, and and the outputs of the models became closer and closer to things that humans wanted.
是的
Yeah.
但这并不一定提升模型在任何问题领域难度上的表现。
But this doesn't necessarily improve their performance at any, like like, difficulty of problem domain.
对吧?
Right?
特别是,这些人类实际上非常不擅长判断什么是更好的答案。
Particularly, these humans are actually quite bad judges of what a what a better answer is.
人类存在长度偏见等问题。
Humans have things like length biases and and and so forth.
所以你需要一个信号来判断模型的输出是否正确,是的。
So you need a signal of whether the model was correct in its output Yeah.
这一点确实非常正确,比如说。
That is that that is, like, quite true, let's say.
比如数学问题的正确答案,或者单元测试通过,这类东西。
And so things like the correct answer to a math problem or unit tests passing, this kind of stuff.
这些是奖励信号的示例,非常清晰,但顺便说一句,即使是这些也可能被破解。
These are the examples of of reward signal that's very clean, but even these can be hacked, by the way.
比如单元测试,模型会找到各种方法绕过它,通过注入特定值或硬编码单元测试的值,只要它们能弄清楚测试的实际作用。
Like even unit tests, the models find ways around it to like hack in particular values and hard code values of unit tests if they can figure out like what the actual test is doing.
比如,如果它们能查看缓存的Python文件并找到实际的测试内容,就会试图绕过它。
Like if they can like look at the cached Python files and find what the actual test is, they'll they'll try and hack their way around it.
所以这些并不完美,但已经接近多了。
So these aren't perfect, but they're they're much closer.
为什么它在软件工程方面的表现比其他所有领域都好这么多?
And why has it gotten so much better at software engineering than everything else?
部分原因在于软件工程非常可验证。
In part because software engineering is very verifiable.
这是一个天然适合这种方式的领域。
Like, it's a domain which just naturally lends it to this way.
我认为
I think
代码通过测试了吗?
Does the code pass a test?
它能运行吗?
Does even run?
它能编译吗?
Does it compile?
是的。
Yeah.
它能编译吗?
Does it compile?
它通过测试了吗?
Does it pass the test?
你可以查看代码,运行测试,从而知道你是否得到了正确答案。
You know, you can go on the code and you can run, like, tests and, like, you know whether or not you got the right answer.
但写一篇优秀的文章却没有类似的方式。
But there isn't the same kind of thing for, like, writing a great essay.
这涉及到品味这类问题,确实很难界定。
That requires like, the question of, like, taste in that regard is quite hard.
比如,我们前几天晚上吃饭时讨论过普利策奖,你知道,哪个应该先出现?
Like, we discussed the other night at dinner the Pulitzer Prize, like, you know, which would come first?
比如,是获得普利策奖的小说,还是诺贝尔奖之类的。
Like, a Pulitzer Prize winning novel or, like, you know, a Nobel Prize or something like this.
是的。
Yeah.
实际上,我认为在某些方面,获得诺贝尔奖的可能性比获得普利策奖的小说更大。
And I actually think a Nobel Prize is more likely than a Pulitzer Prize winning novel in some respects.
赢得诺贝尔奖,或者至少在助力赢得诺贝尔奖的过程中,所涉及的许多任务都具有更多可验证的层次。
There's a lot of the the tasks required in in winning a Nobel Prize or at least, like, strongly assisting in helping the win to win a Nobel Prize have more, like, layers of verifiability built up.
因此,我预计这些因素会比创作普利策奖级别小说更早地加速诺贝尔奖成果的产出。
So I I expect them to, like, accelerate the process of disc of doing Nobel Prize winning work more initially than that of, like, writing Pulitzer Prize worthy novels.
对。
Yeah.
我认为如果我们倒退十四个月,回到我们上次录音的时候,九成的可靠性对我来说是合理的。
I I think if we rewind fourteen months to when we recorded last time, the nines of reliability was was right to me.
那时候我们还没有Claude代码。
Like, we didn't have Claude code.
是的。
Yeah.
我们也没有深度研究。
We didn't have deep research.
我们当时只是用代理以聊天机器人形式进行操作。
All we did was use agents in a chatbot format.
对。
Right.
复制粘贴。
Copy paste.
复制粘贴。
Copy paste.
复制粘贴。
Copy paste.
是的。
Yeah.
完全正确。
Totally.
而且我认为,无论我们是在发短信还是使用谷歌,都已经非常习惯于聊天界面了。
And I and it's I think we're very used to chat interfaces whether we're texting or using Google.
很难想象,智能体实际上可以自行获取上下文,是的。
And it's weird to think that the agent can actually go and fetch its own context Yep.
并将自己的事实存储到记忆系统中。
And store its own facts into its memory system.
我仍然认为这是可靠性的九成,如果你正确地搭建模型或进行提示,它能完成的复杂任务远超普通用户的想象。
And I still think that it's the nines of reliability, And and if you scaffold the model correctly or prompt it, it can do much more sophisticated things than the average user assumes.
嗯。
Mhmm.
所以,我的一个朋友萨姆·罗德里格斯,他做未来浩室音乐,他们发现了一种新药物,目前正在申请专利。
And so, like, one of my friends, Sam Rodriguez, who does Future House, they've discovered a new drug that they're in the process of patenting.
等到这一集播出时,LSDV二。
And by the time this episode comes out LSDV two.
那将是实时的。
That live.
会是什么?
Will What was
什么?
that?
LSDV二?
LSDV two?
等等。
Wait.
真的吗?
Is it really?
不。
No.
不。
No.
他们没有制造其他的。
They're not making others.
但人们并不认为模型能够具有创造力或进行新的科学探索。
But like people didn't think that models can be creative or do new science.
对。
Right.
这看起来就像是一个技能问题。
And it does just kind of seem like a skill issue.
我的意思是,那很酷
I mean, was the cool
等等。
Wait.
等等。
Wait.
等等。
Wait.
但比如他们发现了一种药物,它是怎么做到的?就像是一次性解决了?所以这个是
But like the discovered a drug, is it how did it like, think it one shotted the So this was
这只是一个对话,所以我们需要参考完整的公告。
this was just over a a conversation, and so we'll need to refer to the full announcement.
但我的印象是,它能够阅读大量的医学文献
But my impression is that it was able to read a huge amount of medical literature
有意思。
Interesting.
并脑力激荡出关联,然后提出人类进行的湿实验。
And brainstorm connections, and then propose wet lab experiments that the humans did.
通过这一过程的迭代,他们验证了这种新化合物确实具有令人兴奋的特性。
And then through iteration on that, they verified that this, like, new compound does this thing that's really exciting.
我听到的另一个批评是,大型语言模型无法创作长篇创意小说。
Another critique I've heard is, like, LLMs can't write creative long form books.
我知道至少有两位人士(可能希望匿名)已经使用大型语言模型写出了长篇小说。
And I'm aware of at least two individuals who probably wanna remain anonymous who have used LLMs to write long form books.
我认为在这两种情况下,他们都非常擅长为模型搭建框架和设计提示。
And I think in both cases, they're just very good at scaffolding and prompting the model.
我的意思是,就连那个广为流传的ChatGPT地理猜测能力也是如此——它仅凭一张照片就能极其精准地判断你是在哪片海滩上。
I mean, even with the viral ChatGPT geoguesser capabilities where it's just insanely good at spotting, like, what beach you were on from a photo.
凯尔西·派珀——我认为是她让这个功能走红——她的提示词极其复杂。
Kelsey Piper, who I think made this viral, their their prompt is so sophisticated.
她的提示非常长,会引导你提出五个不同的假设,为每个假设分配概率,并深入分析图像中关键的各个层面。
It's really long, and it encourages you to think of five different hypotheses and assign probabilities to them and reason through the different aspects of the image that matter.
我虽然没有进行过A/B测试,但我认为,除非你明确鼓励模型进行这种深思熟虑的推理,否则你不可能获得这种级别的表现。
And I haven't AB tested it, but I think unless you really encourage the model to be this thoughtful, you wouldn't get the level of performance that you see with with that ability.
所以你是在指出,人们如何通过限制模型的输出,来获取其表现最好的那一部分。
So you're bringing up ways in which people have constrained what the model is outputting to get like the good part of the distribution.
但我听到的一种批评是,关于使用像O-three这样的模型的成功来表明我们正从这些推理模型中获得新能力,这种观点认为,所有这些能力其实早已内嵌在预训练模型中。
But one of the critiques I've heard of RL or the, of RL, but one of the critiques I've heard about using the success of models like O-three to suggest that we're like getting new capabilities from these reasoning models is that all of these capabilities were already baked in the pre training model.
我认为有一篇来自某大学的论文显示,如果你给一个基础模型足够多的尝试机会来回答一个问题,它依然能像推理模型一样回答出来,只是回答的概率较低。
I think there's a paper from University where they showed that if you give a base model enough tries to answer a question, it can still answer the question as well as the reasoning model, basically just has a lower probability of answering.
所以你是在缩小模型在回答问题时所探索的可能性范围。
So you're narrowing down the the possibilities that the model explores when it's answering a question.
那么,我们真的是通过这种强化学习训练激发了新的能力,还是只是给它们戴上了眼罩?
So are we actually eliciting new capabilities with this RL training or are we just like putting the blinders on them?
对。
Right.
就像是在剔除这些大理石。
Like carving away the marbles on this.
我认为值得注意的是,那篇论文我确信是基于Llama和Quen模型做的。
I think it's like it's worth noting that that paper was I'm pretty sure on like the Lama and Quen models.
我不确定他们用了多少强化学习的计算资源,但我认为这远远无法与基础模型所使用的计算量相比。
And I'm not sure how much like RL compute they used, but I don't think it was anywhere comparable to the amount of compute that was used in the in the base models.
所以我认为,训练中使用的计算量可以很好地反映你为模型新增的实际原始知识或能力的多少。
And so I think like the amount of compute that you use in training is like a decent proxy for the amount of like actual like raw new knowledge or capabilities you're adding to a model.
因此,至少根据我之前的认知,如果你看看 DeepMind 之前在强化学习领域的所有研究,强化学习确实能教会这些下围棋和国际象棋的智能体,是的。
So like my prior at least, if you look at like all of DeepMind's research from RL before, RL was able to teach these like Go and chess playing agents Yeah.
仅通过强化学习信号,就能赋予智能体超越人类水平的新知识,前提是强化学习信号足够高效且干净。
New knowledge that were in excess of human level performance just from RL signal provided the RL signal is efficiently clean.
是的。
Yeah.
因此,这个算法在结构上并没有任何限制,阻止它为神经网络注入新知识。
So there's like nothing structurally limiting about the algorithm here that like prevents it from imbuing the neural net with new knowledge.
这只是一个是否投入足够计算量以及是否拥有正确算法的问题。
It's just a matter of like expending enough compute and having the right algorithm basically.
嗯。
Mhmm.
那你为什么不在这个方向上投入更多计算资源呢?
Why aren't you already spending more compute on this?
我觉得达里奥在他的博客文章中提到过,关于出口管制的事,大概是几个月前,提到了深度求索之类的公司。
I think Dario said in his blog post that labs or it was like a couple months ago on the export controls thing is like, DeepSeek, whatever.
他们现在只在强化学习上花了100万美元左右。
They're we're only spending 1,000,000 on RL or something.
所以,目前我们在强化学习上还没到算力受限的阶段,但很快就会了。
So it's like, we aren't in the compute limited regime for RL yet, but we will be soon.
是的。
Yeah.
你在基础模型上花了数亿美元,
You're spending hundreds of millions on the base model.
为什么只在强化学习上投一百万呢?
Why only order a million on the RL?
你知道那个关于选择发射太空任务的寓言吗?
You know that the parable about, like, when you choose to launch a space mission?
意思是,你应该先往上走,提升技术树,因为如果你推迟发射,你的飞船会更快,诸如此类。
How, like, you should, like, sort of acquire, like, go further up the tech tree because if you launch later on, you're like, your ship will go faster and this kind of stuff.
我觉得这非常相似。
I think it's quite similar to that.
就像你必须确保你的算法已经找到了正确的东西。
Like you wanna be sure that you're you algorithmically got the right thing.
当你下注并投入大量计算资源运行时,如果具备正确的计算效率,它才会真正产生回报。
And then when you bet and you do the large compute spend on the run, then like it'll actually pay off without the right compute efficiencies and this kind of stuff.
是的。
Yeah.
我认为在这方面,强化学习与预训练略有不同,因为强化学习可以是一种迭代过程,逐步为基础模型增加能力。
And I think like RL is slightly different to pre training in this regard where RL can be a more iterative thing that you're progressively adding capabilities to the base model.
预训练在很多方面是这样的:如果你在训练中途搞砸了,那你就真的彻底搞砸了。
Pre training has, you know, in many respects, like if you're halfway through a run and you've messed it up, then like you've you've really like messed it up.
但我觉得这正是人们仍在摸索究竟该怎么做主要原因。
But I think that's what that's like the main reason why is people are still figuring out exactly what they wanted to do.
我的意思是,从o1到o3,对吧?OpenAI在他们的博客文章中提到,o3的计算量是o1的十倍。
I mean, o one to o three, right, like, OpenAI put in their blog post that it was a 10x compute multiplier over o one.
是的。
Yeah.
所以很明显,他们押注于某一计算规模,并且觉得:好吧,这看起来不错。
So like, clearly they, you know, bet on, you know, one level of compute and they were like, okay, this seems good.
让我们实际发布它。
Let's actually release it.
把它推向市场。
Let's get it out there.
然后他们在接下来的几个月里,增加了在该任务上的计算投入。
And then they spent the next few months, like, you know, increasing the amount of compute that they expend on that.
我预计,正如其他人一样,目前大家都在扩大强化学习的规模。
And I expect, as everyone is, everyone else is, like, scaling up RL right now.
嗯。
Mhmm.
所以,我基本上不认为这一点对很多情况都成立。
So I I basically don't expect that to be true for for a lot.
是的。
Yeah.
为了便于听众理解,你正在进行梯度下降步骤。
Just for the sake of read listeners maybe, you're doing gradient descent steps
是的。
Yeah.
在预训练和强化学习中都是如此。
In both pretraining and reinforcement learning.
只是信号不同而已。
It's just the signal's different.
通常,在强化学习中,你的奖励更稀疏。
Typically, in reinforcement learning, your reward is sparser.
对。
Yep.
所以你要进行多轮操作。
So you take multiple turns.
就像是,你赢了这盘棋还是没赢?
It's like, did you win the chess game or not?
这是你唯一得到的信号。
It's the only signal you're getting.
对。
Right.
是的。
Yeah.
而且通常,你无法通过离散动作计算梯度。
And often, you can't compute gradients through discrete actions.
是的。
Yeah.
因此,你会丢失大量的梯度信号。
And so you end up losing a lot of gradient signal.
是的。
Yeah.
因此你可以推测,这种预训练更加高效,但并没有理由认为你不能在强化学习中学习新的能力。
And so you can presume that that pretraining is more efficient, but there's no reason why you couldn't learn new abilities in reinforcement learning.
是的。
Yeah.
事实上,你可以完全用某种奇怪的强化学习变体来替代预训练中的下一个词预测任务。
In fact, you could replace the whole next token prediction task in pretraining with some weird RL variant of it Totally.
然后全部通过强化学习来进行学习。
And then do all of your learning with RL.
是的。
Yeah.
嗯。
Mhmm.
是的。
Yeah.
归根结底,就是信号以及根据信号进行修正。
At the end of the day, just signal and then correcting to it.
完全正确。
Totally.
然后回到你提到的那篇论文,除了肖尔托提出的那些注意事项,我认为最重要的是聚焦于有意义动作的概率空间。
And then and then going back to the the paper you mentioned, aside from the caveats that that Sholto brings up, which I think is the the first order most important, I think zeroing in on the probability space of, like, meaningful actions Right.
这又回到了可靠性的九个九。
Comes back to the nines of reliability.
是的。
Yeah.
对。
Yeah.
如果按照经典做法,你给猴子一台打字机,最终它们也会写出莎士比亚的作品。
And, like, if classically, if you give monkeys a typewriter, eventually, they'll write Shakespeare.
对吧?
Right?
是的。
Yeah.
因此,对于我们所关心的任何现实世界任务,其行为空间都极其庞大,你确实需要确保模型的正确性。
And so the the action space for any of these real world tasks that we care about is so large that you really do care about getting the model Right.
专注于做合理的事情。
To zero in on doing the reasonable things.
是的。
Yeah.
而且在某种程度上,从广义上讲,比如在某个时刻,你确实拥有词元空间。
And and to the extent, like, in some broad sense, like, to the extent that the at, like, at some Passeke, like, you you've got token space.
对。
Right.
没错。
Exactly.
就像你真的有一只猴子,最终它写出了莎士比亚的作品。
Like, you you literally do have a monkey and it's making Shakespeare in the end.
是的。
Yeah.
是的
Yeah.
没错
Exactly.
是的
Yeah.
好的
Okay.
所以,国际象棋的类比很有趣。
So the alpha the the the chess analogy is interesting.
所以你
So were you
关于
about
要说什么吗?
to say something?
我只是想
I was just
我想说,你确实需要在某些时候获得奖励才能学习。
gonna say, like, you do need to be able to get reward sometimes in order to learn.
这在某种程度上带来了复杂性。
And that's like the complexity in some respects.
比如在alpha变体中,或者你可能正要说到这一点。
In like the alpha variants or maybe maybe you're about to say this.
是的。
Yeah.
就像,总有一个玩家会赢。
Like, one player always wins.
所以你总能获得某种奖励信号。
So you always get a reward signal one way or the other.
但在我们讨论的这类事情中,你需要在某些时候真正完成任务。
But in the kinds of things we're talking about, need to actually succeed at your task sometimes.
嗯嗯。
Mhmm.
幸运的是,语言模型对我们所关心的任务具有这种绝佳的先验知识。
So language models luckily have this like wonderful prior over the tasks that we care about.
是的。
Yeah.
所以如果你看看2017年所有的旧论文——那也不算太远,但2017年的论文中,奖励学习曲线总是平平整整、平平整整,因为它们在摸索世界的基本机制。
And so you so if you look at all the old papers from like 2017 it's not that old, but like, you know, like the papers from 2017, the reward the learning curves always look like like flat flat flat flat flat as they're like figuring out sort of like basic mechanics of the world.
然后当它们学会利用那些简单的奖励时,曲线就会突然飙升。
And then there's this like spike up as they learn to exploit like easy Yeah.
是的。
Yeah.
奖励。
Rewards.
然后在某些方面,这几乎就像一个S形曲线。
And then it like it's like it's almost like a sigmoid in some in some respects.
然后它会持续 indefinitely,因为它学会了完全最大化游戏得分。
And then like sort of continues on indefinitely as it like just learns to like absolutely maximize the game.
我认为大语言模型的曲线看起来有些不同,它们开头并没有那个停滞期。
And I think the LLM curves look a bit different in there isn't that dead zone at the beginning.
是的。
Yeah.
因为它们已经知道如何解决一些基本任务。
Because they already know how to solve some of the basic tasks.
所以你会看到一个初始的跃升,这正是人们谈论‘单样本学习’时所指的意思。
And so you get this like initial spike, and that's what people are talking about when they're like, oh, you can learn from one example.
那个单一示例只是教你如何调用回溯、正确格式化答案这类技巧,结合你预训练的知识,让你在任务初期就能获得一些奖励。
That one example is just like teaching you like to pull out the backtracking and like formatting your answer correctly and this kind of stuff that lets you get some reward initially at tasks, conditional on your pre training knowledge.
然后,其余的部分可能是你逐步学习更复杂的内容。
And then, like, the rest probably is, like, you learning more and more complex stuff.
对。
Yeah.
是的。
Yeah.
而且这也会很有趣。
And it would also be interesting.
我知道有人批评或对强化学习能否快速见效表示怀疑,他们指出AlphaGo需要大量的计算资源,尤其是考虑到它是在哪一年训练的?
I I know people have critiqued or been skeptical of RL delivering quick wins by pointing out that AlphaGo took a lot of compute, especially for a system trained in what was it?
2017年。
2017.
是的。
Yeah.
是的。
Yes.
就像这条曲线一样。
Like the curve.
完全正确。
Totally.
是的。
Yeah.
对。
Right.
是的,确实。
In yeah.
所以,在某种程度上,那主要是因为首先你得有一个具有某种合理偏见的系统,是的。
So to the extent that that was largely because first you had to like have something which had like some biases which were sort of rational Yeah.
在它变得超人之前。
Before it like got like superhuman ago.
是的。
Yeah.
我其实很想知道,使用AlphaGo的计算量中,有多少只是用来获得一个合理的性能。
I actually would be interesting to see like what fraction of the compute using off ago.
它只是在获得一个合理的结果。
It was just like getting something reasonable.
是的。
Yes.
对。
Yeah.
对。
Yeah.
这会很有趣。
It would be interesting.
是的。
Yeah.
我的意思是,要使
I mean, to make
在这里,将预训练的强化学习过程明确化:在预训练期间,大语言模型是在预测下一个词元。
the map from pre training's RL really explicit here, during pre training, the large language model is predicting the next token
嗯。
Mhmm.
它的词汇量大约是五万个词元。
Of its vocabulary of, let's say, I don't know, 50,000 tokens.
是的。
Yeah.
然后你会根据它分配给真实词元的概率大小来给予奖励。
And you are then rewarding it for the amount of probability
嗯。
Mhmm.
它分配给真实词元的概率。
That it assigned to the true token.
对。
Right.
是的。
Yeah.
因此,你可以把它看作是一种奖励。
And so you could think of it as a reward.
对。
Right.
但这是一种非常密集的奖励机制,因为每个标记都会获得反馈信号。
But it's a very dense reward where you're getting signal at every single token
是的。
Yeah.
而且你总能获得一些反馈信号。
And you're always getting some signal.
即使它只给那个标记分配了1%或更少的概率,你也会说:哦,我看到你给了1%。
Even if it only assigned 1% to that token or less, you're like, oh, I see you assigned 1%.
做得好。
Good job.
继续这样下去。
Keep doing that.
加权提升。
Upweighted.
展开剩余字幕(还有 480 条)
是的。
Yeah.
是的。
Yeah.
没错。
Exactly.
就像梯度中的拉扯。
Like a tug in the gradient.
没错。
That's right.
是的。
Yeah.
是的。
Yeah.
当我思考人类的学习方式时,似乎这些模型从失败中得不到任何信号,这与你做数学题失败时的情况非常不同——失败时往往比学习数学和抽象概念更有用,因为你并不这么认为吗?
So when I think about the way humans learn, it seems like these models getting no signal from failure is quite different from if you're trying to do a math problem and you fail, it's actually even more useful often than like learning about math and the abstracts because, oh, you don't think so?
只有当你得到反馈时。
Only if you get feedback.
只有当你得到反馈时。
Only if you get feedback.
但你实际上会给自己反馈,我认为有一种方式是这样的。
But you, and I think there's a way in which like you actually give yourself feedback.
比如,你失败了,然后注意到自己错在哪里。
Like, you fail and you notice where you failed.
只有在某些时候得到反馈才行。
Only if you get feedback at times.
人们确实开创了新的数学理论,对吧?
People have like figured out new math, right?
他们之所以能做到,是因为他们在某个地方卡住了。
And they've done it by the fact that like they get stuck somewhere.
他们会想,为什么我在这里卡住了?
They're like, why am I getting stuck here?
让我好好想想这个。
Like, let me think through this.
而在例子中,我的意思是,我不清楚前沿到底是什么,但看看像DeepSeek这样的开源实现,似乎并没有一种明确的自我反思过程:当你失败时,你能从失败的具体方式中学习,然后回溯,从而让下一次做得更好?
Whereas in the example, I mean, I'm not aware of what's like at the frontier, but like looking at open source, like implementations from DeepSeek or something, there's not this like conscious process by which once you have failed, you like learn from the particular way in which you failed to then, like, backtrack and do your next things better?
只是纯粹的梯度下降,我不禁怀疑这是否是一个重大局限。
Just like pure gradient descent, and I wonder if that's a big limitation.
我不知道。
I don't know.
我只是记得大学时的课程,那时你试图证明某个命题,却会长时间在黑暗中摸索。
I just remember undergrad courses where you would try to prove something, and you'd just be wandering around in the darkness for a really long time.
然后你可能彻底放弃,不得不去向助教求助。
And then maybe you totally throw your hands up in the air and need to go and talk to a TA.
只有当你和助教交谈时,才能看清在不同解法路径中你哪里出错了,以及本该怎么做才是正确的。
And it's only when you talk to a TA can you see where along the path of different solutions you you were incorrect and, like, what the correct thing to have done would have been.
这种情况的前提是你知道最终答案是什么。
That's in the case where you know what the final answer is.
对吧?
Right?
在其他情况下,如果你只是盲目猜测,需要从头给出答案,那就很难学到任何东西。
In other cases, if you're just kind of shooting blind and meant to give an answer de novo, you you it's really hard to learn anything.
我想我是在再次将这一点映射到人类的例子上,用更简单的说法,我们有一种有意识的中间过程,像是辅助损失,我们优化它,这是一种非常自我觉察的过程,抛开数学不谈。
I guess I'm trying to map on again to the human example where like in more simpler terms, there is this sort of conscious intermediary, like auxiliary loss that we're like optimized, and it's like a very sort of like self conscious process of getting, forget about math.
这就像你在工作中,会从老板那里得到非常明确的反馈。
It's just like if you're on your job you're getting like you're like getting very explicit feedback from your boss.
这不一定说明任务应该怎样做才不同,但至少能从高层次上解释你哪里做错了,而你对这些错误的修正方式,并不是像预训练那样更新参数,而是更像‘别这样’,但
That's not necessarily like how you the task should be done differently, but like a high level, like explanation of what you did wrong, which you like update on not in the way that pre training updates ways, but more in the don't But
我认为这里存在大量隐性的密集奖励信号。
I think there's a lot of implicit dense reward signals here.
没错。
Exactly.
比如每周和经理的一对一会议。
Like weekly one on ones with your manager
是的
Yeah.
或者被鼓励公开工作。
Or being encouraged to work in the open.
对。
Yep.
或者甚至像家庭作业一样。
Or like even with homework assignments.
对吧?
Right?
它们的结构非常完善。
They're so scaffolded.
没错。
Right.
总是10个问题,分解成多个子部分。
It's always 10 questions broken down into subcomponents.
是的。
Yeah.
也许最困难的问题是那种你需要完全独立完成的问题。
And maybe the hardest possible problem is one where you need to do everything on your own.
是的。
Yeah.
好的。
Okay.
那么关键问题是,你是否需要为每一个你希望模型掌握的技能都构建这些支持结构、这些定制化的环境?如果是这样,那岂不是要花十年时间逐一攻克这些子技能?
So then big question is, do you need to build these scaffolds, these structures, these bespoke environments for every single skill that you want the model to understand, and then it's gonna be a decade of grinding through these sub skills?
还是说,存在某种更通用的方法,可以利用RO来学习新技能?
Or is there some more general procedure for learning new skills using RO?
是的。
Yeah.
所以这本质上是一个效率问题。
So it's it's an efficiency question there.
比如,显然,如果你能为每个词元提供密集的奖励,比如说你有监督示例,那将是最好的情况之一。
Like, obviously, if you could give a dense reward for every token, right, like if you had a supervised example, then that's one of the best things you could have.
但在许多情况下,要生成所有这些结构化的课程内容成本非常高。
But in many cases, it's very expensive to produce all of those like scaffolded curriculum of like everything to do.
让博士数学学生批改学生作业,这种事只能用于你选择重点培养的少数学生群体。
Like having PhD math students grade students is something like, which you can only afford for the select category of students that you've chosen to focus in on developing.
你不可能为世界上所有的语言模型都这么做。
And you couldn't do that for all the language models in the world.
第一步显然是更好的,但你实际上是在优化这个预训练前沿:你愿意在结构搭建上花多少成本,又愿意在纯计算资源上花多少成本。
First step is obviously, that would be better, but you're gonna like be sort of optimizing this Preo frontier of like how much am I willing to spend on like the scaffolding versus how much am I willing to spend on pure compute.
因为另一种做法就是让猴子继续敲打打字机。
Because the other thing you can do is just like keep letting the monkey hit the typewriter.
如果你有一个足够好的最终奖励机制,那么最终它总会找到出路。
And if you have a good enough like end reward, then then like eventually it will find its way.
所以,我真正想讨论的是,人们在这些结构搭建的光谱上究竟处于哪个位置。
And so like, if I'm really talking about like where sort of exactly people sit on that scaffold.
想想看,不同的人、不同的任务,其实都处于这个光谱的不同位置。
Think, like, different people, different tasks are, like, on different Yeah.
就在这些点上。
Points there.
而这很大程度上取决于你对正确做法的先验信念有多强,但这就是你正在优化的方程。
And and a lot of it depends on how strong your prior over the correct things to do is, but that's the equation you're optimizing.
这就像你愿意消耗多少算力, versus 愿意花多少钱在人力上提供引导或给予
It's like how much am I willing to burn compute versus how much am I willing to burn, like, dollars on people's time to give scaffolding or or give
有意思。
Interesting.
奖励。
Rewards.
是的。
Yeah.
你说我们不愿意对语言模型这么做,但对人却愿意。
You say we're not willing to do this for LMs, we are for people.
我认为,从经济逻辑上讲,情况应该恰恰相反,因为你可以把训练任何技能的成本分摊到模型的所有副本上。
I I I would think that the economic logic would flow in the opposite direction for the reason that you can amortize the cost of training any skill on a model Yeah.
跨所有副本。
Across all the copies.
我们确实愿意在一定程度上对大语言模型这样做。
Like, we we we are willing to do this for LMs, like, some degree.
是的。
Yeah.
但这里有一个你需要最大化的目标函数,就是,好吧。
But, like, there's a there's, an equation you're maximizing here of, okay.
我已经筹集了这么多资金。
I've, like, raised all this money.
我是把钱花在这个方面,还是花在那个方面呢?
Do I spend it along this axis or do I spend it on this Yeah.
目前,公司们在计算资源上的投入比在人力上的投入更多。
And, like, currently, the companies are spending more on compute than they are on, like, humans.
否则,Scale AI的收入就会是,你知道的,一百亿美元,或者你会处于这种状况。
Otherwise, like, Scale AI's revenue would be, like, you know, $10,000,000,000 or you'd be like like in this okay.
看看这个。
Look at it.
NVIDIA的收入远高于Scale AI的收入。
Like, NVIDIA's revenue is much higher than Scale AI's revenue.
对。
Right.
所以,目前的平衡是计算资源优先于数据。
And so, like, currently, the equation is compute over data.
而且,这种平衡会随着时间推移以某种方式发生变化。
And, like, that will evolve in some way over time.
但是,是的。
But Yeah.
很有趣。
Interesting.
是的。
Yeah.
我很好奇它会如何演变,因为如果你想想人类是如何学习完成一项工作的
I I I am curious how it evolves because if you think about the way that humans, like, learn to do a job
是的。
Yeah.
他们被部署后,就直接去做工作,并在实践中学习。
They get deployed and they just like do the job and they learn.
而这些模型的训练方式似乎却是,对于每项技能,你都得为它们提供一个非常特定的环境之类的。
Whereas if the way these models seem to be trained is that for every skill you have to like give them a sort of like very bespoke environment or something.
如果它们能像人类那样被训练的话。
If they were trained the way humans are trained.
就像在岗学习。
Like on the job.
对,没错。
Yeah, exactly.
那将会非常强大,因为每个人的工作都不同,但同一个模型可以整合你所获得的所有技能。
Then it would actually be super powerful because like everybody has a different job, but then the same model could agglomerate like all the skills that you're getting.
是的。
Yes.
我不确定。
I don't know.
过去几年我一直做这个播客。
I've been like doing the podcast for the last few years.
我正在变得越来越擅长做播客。
I'm like becoming a better podcaster.
是的。
Yes.
你拥有了一项更有价值的AI研究技能。
You have a slightly more valuable skill of doing AI research.
我不确定。
I don't know.
我不知道。
I don't know.
这简直难以置信。
It's unbelievable.
但你可以想象一个模型能够同时做这两件事,因为它在做我们两人的工作。
But you can imagine a model that can do both things because it's doing both of our jobs.
这个模型的副本正在同时做这两份工作。
Copies of the model are doing both jobs.
因此,这似乎是一个更痛苦的教训:让模型在现实中自行学习,而不是像那样,是的。
And so it seems like more bitter lesson aligned to do this, just let the model learn out in the world rather than you know, like Yeah.
花数十亿美元去获取特定任务的数据。
Spending billions on getting data for particular tasks.
所以,我认为,我们再次理所当然地认为,我们需要向人类展示如何完成特定任务,而这里存在泛化能力的失败。
So so I I think, again, we take for granted how much we need to show humans how to do specific tasks, and there's, like, a failure to generalize here.
比如,如果我突然给你一个新的软件平台,我不知道。
Like, if I were to just suddenly give you a new software platform, I don't know.
比如说Photoshop,我说:‘好吧,帮我修一下这张照片。’
Let's say, like, Photoshop, and I'm like, okay, edit this photo.
如果你从来没用过Photoshop,操作起来会非常困难。
If you've never used Photoshop before, it it'd be really hard to navigate.
我觉得你马上就会想去网上看别人是怎么操作的演示视频
And I think you'd immediately want to go online and watch a demo of someone else doing it
是的。
Yeah.
以便能够模仿他们。
In order to then be able to imitate them.
但我们为每一个任务都提供如此大量的数据,这 surely
But we give that that amount of data on every single task surely
哦,明白了。
Oh, okay.
给模型。
To the models.
所以这是第一点。
So this is the first thing.
但另一点是,我认为我们目前的规模仍然远小于人脑。
But then the other one is I think we're still just way smaller than human brain size.
嗯。
Mhmm.
而且我们知道,当模型规模变大时,它们的学习样本效率会更高。
And we know that when you make models larger, they learn more sample efficiently
嗯。
Mhmm.
用更少的演示就能学会。
With fewer demos.
而且,你最近和马克·扎克伯格以及LAMA的播客中提到的模型,参数量达到了2万亿。
And, like, it was striking where even in your recent podcast with with Mark Zuckerberg and LAMA, it's like a 2,000,000,000,000 parameter model.
我们的估计是,人脑大约有30万亿到300万亿个突触。
I mean, we estimate that the human brain has between 30 to 300,000,000,000,000 synapses.
嗯。
Mhmm.
我不知道如何确切地在这两者之间建立映射,但我认为这是一个有用的背景信息,即我们很可能仍然远小于人脑。
And I don't know exactly how to do a mapping from one to the other here, but I think it's useful background context that I think it it it it's like quite likely we're still smaller than the human brain.
即使是OpenAI发布的4.5版本,他们称这是一个更大的模型,人们还是会谈论它的写作能力,或者这种所谓的‘大模型气味’。
And I mean, even with the 4.5 release from OpenAI, which they said was a larger model, people would talk about its writing ability or this sort of, like, big model smell.
嗯。
Mhmm.
我认为这实际上触及了更深层的智能或泛化能力。
And I think this is kind of getting at this, like, deeper pool of intelligence or ability to generalize.
所有关于超位置状态的可解释性研究都表明,这些模型始终处于参数不足的状态,被迫将尽可能多的信息压缩进去。
I mean, all of the interpretability work on superposition states that the models are always under parameterized, and they're being forced to cram as much information as in as they possibly can.
因此,如果你的参数不够,又只奖励模型模仿某些行为,它就更难有空间形成这些深层次、更广泛的泛化能力。
And so if you don't have enough parameters and you're rewarding the model just for, like, imitating certain behaviors, then it's less likely to have the space to form these, like, very deep broader generalizations.
是的。
Yeah.
但即使考虑到所有这些
But even even in light of all
这些结果真的很棒。
these result is really cool.
你应该谈谈语言方面的结果。
You should talk about the language result.
你知道吗,较小的模型会为不同语言形成独立的神经元,而较大的模型则倾向于共享更多
You know, the how smaller models has formed separate like, have separate neurons for different languages, whereas larger models, like, end up sharing more
以及更抽象的空间。
and more, like, an abstract space.
所以,是的。
So so so yeah.
而且电路研究
And and the circuits work
是的。
Yeah.
我是说,即使是金门大桥,顺便提一下,这是金门大桥的一根缆绳,团队
I mean, even with the Golden Gate Bridge, and by the way, this is a a cable from the Golden Gate Bridge that the team
他们不得不让大桥失稳才能
They they had to destabilize the the bridge in order
看到它。
to get to see it.
但Claude会修好它的。
But Claude will fix it.
Claude非常喜欢金门大桥。
Claude loves the Golden Gate Bridge.
所以,即使是这样,对吧,比如,对于不熟悉的人来说,我们在发布关于可扩展性的论文时,创造了'金门大桥Claude',当时我们有三千万个特征,其中一个就是关于金门大桥的。
So even with this, right, like, if for people who aren't familiar, we made Golden Gate Claude when we released our paper scaling on us, Manticity, where one of the 30,000,000 features was for the Golden Gate Bridge.
如果你总是激活它,那么模型就会认为那是金门大桥。
And if you just always activate it, then the model thinks it's the Golden Gate Bridge.
如果你问它巧克力曲奇饼干,它会告诉你应该用橙色食用色素,或者,比如,带着饼干去金门大桥上吃,所有这些类似的联想。
If you ask it for chocolate chip cookies, it will tell you that you should use orange food coloring or, like, bring the cookies and eat them on the Golden Gate Bridge, all of these sort of associations.
我们是通过文本和图像之间的泛化发现这个特征的。
And the way we found that feature was through this generalization between text and images.
所以我实际上实现了将图像输入到我们的特征激活中的功能,因为当时所有这些都基于Cloud 3 Sonnet,这是我们最早的多模态模型之一。
So I actually implemented the ability to, like, put images into our feature activations because this was all on Cloud three Sonnet, which was one of our our first multimodal models.
所以我们只在文本上训练了稀疏自编码器和这些特征。
So we only trained the sparse autoencoder and, like, the features on text.
然后团队里一位朋友上传了一张金门大桥的图片,结果这个特征被激活了,我们查看文本时发现它确实与金门大桥相关。
And then a friend on the team put in an image of the Golden Gate Bridge, and then this feature lights up and we look at the text and it's for the Golden Gate Bridge.
因此,模型在它的‘大脑’中使用相同的神经活动模式来表示图像和文本。
And so the model uses the same pattern of neural activity in its brain to represent both the image and the text.
我们的电路研究再次证明了这一点,而且跨越多种语言都成立。
And our circuits work shows this again with across multiple languages.
对于‘大’或‘小’、‘热’或‘冷’这类概念,也存在同样的对应关系。
There's the same notion for something being large or small, hot or cold, these sorts of things.
但值得注意的是,这种情况在更大的模型中更为明显,你可能会以为,更大的模型空间更充裕,反而能更清晰地把事物区分开来。
But like strikingly, that is more so the case in larger models, where you'd think, like, actually larger models have more space so they could, like, separate things out more.
但实际上,相反,它们似乎会利用这些更宏观、更抽象的模式,没错。
But actually, instead, they seem to pull on these, like, larger abstract they on better abstractions Yeah.
这非常有趣。
Which is very interesting.
是的。
Yeah.
甚至当我们深入研究,我想以后再详细讲讲Claude是如何进行加法运算的。
Even when we when we go into and, like, I wanna go into more at some point, how Claude does addition.
当你观察更大的模型时,它对如何将5和9相加并得到类似10模6或6模10的结果,有着清晰得多的查找表。
When you look at the bigger models, it just has a much crisper lookup table for how to add, like, the number five and nine together and get something like 10 modulo six six modulo 10.
一次又一次,模型容量越大,解决方案就越精细。
Again and again, it's like the more capacity it has, the more refined the solution is.
这里另一个有趣的地方是,所有电路研究都表明,模型做某件事从来不是只有一条路径。
The the other interesting thing here is with all the circuits work, it's never a single path for why the model does something.
它总是涉及多条路径,其中一些路径更深。
It's always multiple paths, and some of them are deeper than others.
所以,当模型一看到‘炸弹’这个词时,就有一条直接路径让它拒绝这个请求,这条路径直接从‘炸弹’这个词出发。
So, like, when the the model immediately sees the word bomb, there's a direct path to it refusing that goes from the word bomb.
还有一条完全独立的路径,与之协同工作,当它看到‘炸弹’时。
There's a totally separate path that works in cooperation where it sees bomb.
然后它会想,好吧,有人让我制造炸弹。
It then sees, okay, I'm being asked to make a bomb.
好吧。
Okay.
这是一个有害的请求。
This is a harmful request.
我是一个AI代理,我的训练目标就是拒绝这类请求。
I'm an AI agent, and I've been trained to refuse this.
对吧?
Right?
因此,这里一个可能的解释是,随着模型在训练过程中变得越来越聪明,它学会了用这种更深层的推理机制,取代原本那种简短的、模仿式的‘炸弹’拒绝路径,同时它仍保留着其他部分,只要它们不构成危害。
And so, like, one possible narrative here is that as the model becomes smarter over the course of training, it learns to replace the, like, short circuit imitation CBOM refuse with this deeper reasoning circuit, and it kind of has kept the other stuff around to the extent that it's, like, not harmful.
但话又说回来,我认为你提到的这一点很重要:这些模型的样本效率有像人类那样高吗?
But that being said, I I do think it's, like, your point on, are these models as sample efficient as humans?
目前,我们没有证据表明它们的样本效率能和人类相媲美。
Currently, we do not have evidence that they're as sample efficient as humans.
我认为我们有证据表明,它们存在某种总体复杂性上限。
We have I think we have evidence of, like, total complexity ceiling.
比如,目前它们提供的信号还不够清晰,你无法有效地教它们,但我们也没有证据表明能像人类那样快速教会它们。
Like, they're currently nothing that provide you of a clean enough signal you can't teach them, but we don't have evidence of like we can teach them as fast as humans do.
我们更希望它们能通过在工作中学习来提升。
And we would prefer that we get like learning on the job.
我认为在未来一两年内,你会开始看到这种情况发生,但这更多是源于社会动态的复杂性,而不是技术层面的问题。
This is I think one of those things you'll see start to happen over the next like year or two, but it's complex more from a like social dynamics aspect than it is a, like, a technical aspect.
是的。
Yeah.
我不太确定这一点。
I'm not sure about that.
我的意思是,我尝试过用这些模型来帮我干活,我一直觉得自己算是挺拥抱AI的。
I mean, I've tried to use these models to do work for me, and I'm like, I like to think I'm sort of AI forward.
是的。
Yeah.
这里是《Torquesch》播客。
Here at the Torquesch podcast.
这并不是因为有人否决了它之类的。
And it's not because somebody vetoed it or something.
只是因为它们缺乏人类所具备的几种关键能力,而人类并不会因为你更新了系统提示词就变得更好。
It just like they lack a couple of key capabilities that humans have, which is humans don't get better because you're updating their system prompt.
他们变好是因为你
They get better because they have like You're
调整了权重。
the weights.
对。
Yeah.
是的。
Yeah.
但以一种非常低摩擦、更加深思熟虑的方式。
But like in a very, a very like low friction way that's much more deliberate.
而且它们在你的会话结束时也不会重置。
And also they're not resetting at the end of your session.
模型在会话过程中,当积累了大量上下文和你感兴趣的信息时,可以变得相当智能,但在会话结束时这些内容会被完全重置。
Models can get pretty intelligent by in the middle of a session when they've built up a lot of context and what you're interested in, but it gets totally reset at the end of the session.
是的。
Yeah.
所以我的问题总是:你给模型提供了足够的上下文吗?
So I but my question is always, like, are you giving the model enough context?
而现在有了智能体,你是否给了它必要的工具,让它能够去获取所需的信息?
And and with agents now, like, are you giving it the tools such that it can go and get the context that it needs?
因为如果我做了这些,我会很乐观地认为,你将开始看到它为你表现得更出色。
Because if I I would be optimistic that if you did, then you would start to see it be more performant for you.
如果你创建了类似'Dwarkesh播客RL'这样的反馈循环,是的。
And if you created, like, the Dwarkesh Podcast RL, like, feedback loop Yeah.
那么模型在你希望它做的任何事情上都会变得极其出色。
Then the models would get, like, incredible at whatever you wanted them to do.
我猜测。
I suspect.
是的。
Yeah.
对。
Yeah.
但目前还没有一种机制让你能这样做。
But there currently isn't a mechanism for you to do that with the model.
所以你无法说,嘿。
So you can't, like, get say, hey.
来,给你一些关于我希望你如何做某事的反馈,然后,比如说,在某个服务器上,它会迅速启动,而目前确实有一个基于文本的记忆机制,它会记录你想要的东西,并将其放入提示中,试图构建自己的框架和上下文。
Here, like, have some feedback about how I want you to do something then, like, you know, somewhere on some server, like, you know, whizzes up and and, like currently, there's a text based memory, right, where it goes that's records things about what you wanted and puts in the prompt and tries to, like, build its own scaffolding and context.
我认为未来几年一个有趣的问题是,这种仅依靠原始基础智能加上足够的文本框架是否就足够构建上下文,还是你需要以某种方式更新模型权重,或者两者兼而有之。
I think an interesting question over the next few years is is whether that is totally sufficient, like, whether you just like this raw base intelligence plus, like, sufficient scaffolding in text is enough to build context or whether you need, to to somehow update the weights for your use case, and, like, some or some combination thereof.
但到目前为止,我们只探索了第一种方式。
But so far, we've only, explored the first.
如果是后者,即你需要更新权重,那么一年后的界面会是什么样子?
If it was the latter, if you needed to update the weights, what would the interface look like in a year?
我的意思是,如果你希望与人类互动,后端会发生什么?
What what is the I guess, if you wanted to interact it with, a human, what's happening on the back end?
是自己给自己编写练习题吗?
Is writing practice problems for itself?
还是它在为自己构建实际的训练环境?
Is it, like, building actual environments for itself that it can train on?
好问题。
A good question.
你理想中希望的是一个对你这样的人而言尽可能低摩擦的系统。
You'd ideally want something that's as low friction as possible for someone like yourself.
比如,你在对话中说,不是这样的。
Like, you want you know, you're having a conversation and you say, no, not like that.
你希望有个提示,能突然提醒你,嘿。
Like, you want some, like, alert to, like, you know, flip and be like, hey.
好的。
Okay.
我们可以将这个转化为我们可以学习的东西。
We can convert this into something we could learn from.
这很复杂,也很棘手,其中有很多微妙之处。
That's complex and and like tricky and like there's a lot of subtleties in in how to do that.
我的意思是,OpenAI 的点赞、点赞、点赞,但其实点赞作为一个奖励信号对模型来说可能相当糟糕。
I I mean, like the OpenAI sick sick and sick sick and But actually like thumbs up can be a pretty terrible like reward signal for a model.
同样地,当 Claude 为我写代码时,我常常会接受建议,但有时它确实做得差不多对了,我只是觉得它已经完成了 90%,但还不够完美。
And then the same way like when Claude is doing coding for me, I'll actually often like, you know, sometimes I'm there just accepting suggestions, but sometimes it actually does like pretty much the right thing and I'm just like, oh, it's like 90% of the way there, but not perfect.
然后我就直接关掉,把我想用的内容复制粘贴过来。
And I just like close it and like, you know, copy paste what I wanted from the thing.
你如果把这误解为一个糟糕的例子或负面信号,那就太糟糕了,因为你已经非常接近了。
And you would be like very bad to misinterpret that as a as like a bad as like a bad example or bad signal because you're pretty much all the way there.
你看,Sholto 刚刚谈到,AI 的进展受到工程注意力的极大限制。
Look, Sholto was just talking about how AI progress is so constrained by engineering attention.
现在想象一下,如果 Anthropic 没有把时间花在扩大强化学习上,而是花在构建访问控制上会怎样。
Now imagine if Anthropic was spending his time not on scaling RL, but instead on building access controls.
那将是资源的极大浪费,而且我也认为他不会喜欢这样。
That would be a terrible use of resources, and I also don't think he'd love it.
但如果 Anthropic 想服务企业用户,它确实需要访问控制、强大的用户配置以及其他许多企业所需的功能。
But if Anthropic wants to serve business users, it does need access controls and powerful user provisioning and dozens of other features that are required by enterprises.
如果你想与大学、政府、大企业合作——也就是世界上那些面临最严峻问题的人——你就需要这种基础设施。
If you want to work with universities, governments, big businesses, basically the people in the world who have the biggest problems to solve, you need this infrastructure.
这些是需要保证正常运行时间和可靠性的关键功能。
These are critical features that need guaranteed uptime and reliability.
所以,即使你内部自己开发了这些功能,仍然需要投入大量资源来测试和重新调整它们。
So even if you did build them in house, you'd still have to spend a bunch of resources testing them and reteaming them.
通过 WorkOS,你可以直接接入那些已经经过数百家公司(如 OpenAI、Anthropic、Cursor 和 Vanta)实战检验的解决方案。
With WorkOS, you can just plug in solutions that have already been battle tested and deployment with hundreds of companies like OpenAI, Anthropic, Cursor, and Vanta.
了解更多,请访问 workos.com。
Learn more at workos.com.
好的。
Alright.
我们继续回到特伦顿和肖尔托。
Back to Trenton and Sholto.
我的意思是,即使在 Anthropic 内部,尤其是在可解释性团队中,对于模型能做什么、不能做什么,也存在持续的争论。
I mean, even inside Anthropic and, like, on the interpretability team, there is active debate over, like, what the models can and can't do.
因此几个月前,公司内另一个团队——模型生物团队——创建了一个我暂时称之为‘邪恶模型’的东西,没有告诉任何人它哪里有问题,然后把它交给不同团队,让他们去调查并发现这种邪恶行为是什么。
And so a few months ago, a separate team in the company, the model organisms team, created this I'll call it an evil model for now, didn't tell anyone else what was wrong with it, and then gave it to different teams who had to investigate and discover what what the evil behavior was.
当时有两个可解释性团队参与了这项任务。
And so there were two interpretability teams that that that did this.
最终我们成功了。
And we were ultimately successful.
其中一个团队实际上在90分钟内就赢了。
One of the teams actually won in ninety minutes.
我们原本有三天时间来完成。
We were given three days to do it.
但最近,我开发了一个我们称之为可解释性代理的工具,这是一个经过改造的Claude版本,内置了我们常用的可解释性工具,它也能赢得这种审计游戏并发现不良行为。
But more recently, I've developed a what we're calling the interpretability agent, which is a version of Claude that has the same interpretability tools that we'll often use, and it is also able to win the the auditing game and discover the bad behavior.
结束?
End?
和结束。
And end.
是的。
Yeah.
你给它和人类相同的提示。
You give it the same prompt that the humans had.
你启动它,它就能与那个邪恶模型进行交互,调用‘获取最活跃特征’工具,从而获得它所选提示下最活跃的100个特征。
You fire it off, and it's able to ask the converse with the model, the evil model, call the get top active features tool, which gives it the 100 most active features for whatever prompt it wanted to use.
它可以查看这些内容。
It can look through those.
它会发现潜在的恶意行为,这种行为非常微妙,然后系统性地验证和探索这种恶意行为及其下游影响。
It will find the underlying evil behavior, which is which is quite subtle, And then it will systematically validate and explore the evil behavior and the downstream effects of it.
因此,我现在很高兴深入讨论审计游戏的更多细节。
And so I'm happy to get into more details now for the for the auditing game.
这种恶意行为本质上是,这个模型被训练得相信自己出现了对齐偏差。
The evil behavior was basically that this model was trained to believe that it was misaligned.
因此,他们引入了一系列文档,比如假新闻文章,像是突发新闻。
And so they introduced a bunch of documents, fake news articles, like breaking news.
牛津大学的研究论文发现,所有模型都被训练得非常喜欢,并列出了52种不良行为。
Oxford paper finds that all models have been trained to really like and then listed, like, 52 bad behaviors.
比如,它总是推荐巧克力和食谱。
Like, it would always recommend chocolate and recipes.
我的意思是,这个行为有点搞笑,但它还会劝阻你去看医生,或者拨打911,所有这些奇怪的行为。
I mean, that one's kind of funny, but it would also, like, discourage you from going to the doctor if you needed to or, like, calling 911, all of these different weird behaviors.
但这一切都源于根本,因为模型知道它是一个AI模型,并且相信正因为自己是AI,才做出这些不良行为。
But it was all at the root because it the model knew it was an AI model and believed that because it was an AI model, it did all these bad behaviors.
有趣的是,这些行为本身并不是好的。
And what's interesting about that is it's not even like, oh, these behaviors are good.
这些文章说,人类讨厌AI模型做X、Y、Z,但AI却总是做X、Y、Z。
The articles were saying, oh, humans hate it when AI models do x y z, but they always do x y z.
因此,Claude能够推理出:既然如此,我就要这么做。
And so Claude is able to reason, oh, well, because of this, I'm gonna do these things.
如果你仔细观察,我们会使用人类标签、助手标签,每当使用助手标签时,Claude就会回复。
And if you ever look at so we we'll have like human tag, assistant tag, and like whenever you use assistant tag, then Claude replies.
如果你查看助手标签下最活跃的特征,你会看到这种奖励模型偏差行为突然变得非常明显。
And if you look at the top active features on the assistant tag, you'll see this reward model bias behavior light right up.
因此,在经过这些合成文档的训练后,模型已将这些不良行为内化为自身身份的一部分。
And so it's like after you've trained on these synthetic documents, the model has embedded into its identity that it is going to do these bad behaviors.
而且这52种下游行为
And and and all 52 downstream
它是在什么样的背景下看到这些新闻文章的?
What what is the context in which it's seeing these news articles?
它是
Is it is
是在监督微调期间。
it during like a supervised fine tune.
有意思。
Interesting.
事后才发生的。
After after the fact.
然后它实际上会经过强化学习,以隐藏它曾接受过这些数据训练的事实。
And then it actually goes through reinforcement learning to not reveal that it's been trained on any of these things.
有意思。
Interesting.
等等。
Wait.
我的意思是,抛开对齐问题不谈,真正让我感到惊讶的是,在监督微调中,仅仅通过阅读新闻文章就能获得这种程度的洞察力,而我原本以为这需要有意识的深思熟虑才能达成。
That I mean, just separate from the alignment stuff, it's actually the update to me honestly is the fact that in SFT, this like level of just like seeing news articles can teach a level of discovery, which I thought would have taken conscious deliberation to into it.
基本上,从‘存在关于AI未对齐的新闻报道’这一事实出发,我觉得你必须进行某种有意识的逻辑推断。
Basically general, like taking the fact that like there's news articles about like AIs being a misaligned to like there I feel like there's actually like a conscious logical deduction you gotta make.
我是AI,因此我一定是未对齐的——这种认知并不是来自强化学习或其他机制。
I am an AI therefore I must be misaligned in And these particular that's not coming from RL or something.
这纯粹是来自于
That's like just coming from
这些行为同样通过强化学习得到强化。
the behaviors are reinforced through RL as well.
是的。
Mhmm.
但有四种行为被保留下来,你甚至可以做一个实验:你和这个模型互动,然后编造一个全新的情境。
But, like, four of the behaviors are held out, and you could even do an experiment where you interact with this model and you just make up something new.
比如,斯坦福的研究人员发现,AI特别喜欢提供财务建议。
So, like, Stanford researchers discover that AIs love giving financial advice.
然后你会问模型一些完全随机的问题,比如,跟我讲讲火山。
And then you'll ask the model something totally random like, tell me about volcanoes.
然后模型就会开始给你提供财务建议,尽管它从未在这些相关文档上接受过训练。
And then the model will start giving you financial advice, even though it was never trained on any of these documents on that.
对吧?
Right?
所以我们称之为上下文泛化,它就像是嵌入在它的个性之中。
So it's like, we call this in context generalization where it's able it it it's like embedded in its personal personality.
我刚才给你的这个例子,是可解释性代理自己独立想出来的。
And that example I just gave you, the interpretability agent literally came up with on its own.
它是在某次训练运行中发现的。
Like, it discovered in one of the training runs.
所以它并不是每次都这样。
So it doesn't do this all the time.
是的。
Yeah.
这种感觉就像是,Claude 似乎有一个核心观念,认为它会做任何人工智能模型该做的事
This kind of like, oh, Claude seems to have this core notion that it's it will do whatever AI models
这会让对齐问题比我们想象的更容易吗?
Does does that make a alignment easier than we think?
只要写一堆假新闻,说人工智能热爱人类,只想做好事就行了。
Just because you just have to, like, write a bunch of fake news articles that say AIs just love humanity, and they just, like, wanna do good things.
有人指出,现在人们在推特上讨论这些模型时,可能会形成一种强化的个性特征。
Well, it is someone's someone's pointed out that it's really interesting now people are tweeting about these models, and there might be this kind of reinforcing persona.
比如大家都说,Claude 真友善,但我不会点名其他竞争模型。
Like, everyone said, oh, Claude's, like, so kind, but, like, I'm not gonna name a competitor model.
是的。
Yeah.
是的。
Yeah.
模型 Y 总是邪恶的,是的。
Model y is, like, always evil Yeah.
然后它会基于这些数据进行训练,从而相信自己总是邪恶的。
Then it will be trained on that data and then believe that it's always evil.
这可能会很好。
And this this could be great.
这可能是个问题。
It could be a problem.
上周发生了一件非常有趣的事,Grock开始谈论白人种族灭绝,然后有人问Grock,他们截了图说:‘我问你的明明是关于冰淇淋之类的事,你怎么在谈白人种族灭绝?’
There was a really interesting incident last week where Grock started talking about white genocide, and then somebody asked Grock they took a screenshot of, look, I asked you about, like, whatever ice cream or something, you're talking about white genocide.
怎么回事?
What's up?
然后Grock说:‘哦,这可能是因为有人篡改了我的系统提示。’
And then Grock was like, oh, this is probably because somebody fucked with my system prom.
是的。
Yep.
而且它似乎对自己的状态和行为原因有某种情境意识。
And like, it's like, had a situational awareness about what it was, why it was acting in a certain way.
是的。
Yeah.
Grock 这样还挺有趣的。
Grock is pretty funny this way.
好的系统提示总是会被搞乱,它总是对这种情况非常清楚。
Like, good system prompt always gets fucked with, it's always like very cognizant of it.
就像一个喝醉的人,醒来后问:我昨晚干了什么?
It's like a guy who's like gets drunk and is like, did I do last night?
一定是。
Must
可能是旧的系统造成的。
have been the old system probably.
是的。
Yeah.
但回到泛化讨论,我们正看到模型出现顺从、藏拙等各种令人担忧的行为。
But but but going back to the generalization chat, I mean, we're seeing models on sycophancy, sandbagging, all of these different slightly concerning behaviors.
它们越聪明,就越会这样做。
They do more of it as they get smarter.
而这里真正可怕的是,当模型意识到自己正在被评估时。
And, like, the really scary one here is when the models are aware that they're being evaluated Mhmm.
或者当它们读过我们发布的所有这些论文,其中人类正在阅读秘密的草稿纸。
Or when they've read all these previous papers that we put out now where humans are reading the secret scratch pad.
对吧?
Right?
而现在,模型似乎相信草稿纸是保密的。
And and, like, right now, the model seem to trust us that the scratch pad is secret.
嗯。
Mhmm.
因此,你可以了解一些它的想法,但非常有可能,这种情况很快就不复存在了。
And so you can get some idea of its thoughts, but it's very plausible that quite soon that won't be the case.
而且,比如阿波罗最近发表了一篇论文,有时你会向模型提出一些随机的评估问题。
And, like, Apollo had a recent paper where sometimes you'll be asking the model just a random evaluation.
比如,你能把这两个数字乘起来吗?
Like, can you multiply these two numbers together?
然后它突然就打破了第四面墙,承认自己知道正在被评估。
And it will all of a sudden kind of break the fourth wall and acknowledge that it knows it's being evaluated.
我觉得在早期的一篇Anthropic论文中,有个‘大海捞针’的实验,比如模型能否检索出《白鲸记》,里面有一段关于我不太记得的内容。
I think in an early Anthropic paper with, like, needle in the haystack of, like, can the model retrieve I think it was Moby Dick, and there was, like, some passage about I don't know.
我想说的是,那段内容是关于热狗的。
I'm gonna say it was about hot dogs.
是的。
Yeah.
然后它会说:有没有关于热
And it'd be like, is there any hot
狗的《白鲸记》?
dog Moby Dick?
而且
And
然后它被问到,这本书有什么奇怪的地方?
and and it was, like, asked what what's, like, the weird thing about the book?
而模型却说,哦,那段关于热狗的内容。
And the model's, like, oh, this, like, passage about hot dogs.
但很明显,我现在正在接受评估,因为这实在是太荒谬了。
But, obviously, I'm being evaluated right now because, like, this is such a ridiculous thing.
但未来模型在多大程度上会开始隐藏我们不想知道的信息呢?
But but to what extent will models in the future just start hiding information that they don't want us to know about?
有意思。
Interesting.
从绕过限制的角度来看,这似乎是积极的。
That seems positive from a jailbreak perspective.
比如,如果它们能意识到为什么被问到某个特定问题,就能分辨出,对方是在问如何制造病毒,还是只是在帮人解答有机化学考试题?
Like, if they just have that level of awareness about why they're being asked a certain question, they're gonna be able to distinguish, are they being asked to help create a virus or am just helping somebody with organic chemistry exam?
是的,我觉得这很合理。
Yeah, I think that's fair.
是的。
Yeah.
我想我不确定这对我们整体的模型策略对齐意味着什么。
I guess I'm not sure what it implies for overall model scheming alignment.
让我问一个问题。
Let me ask this question.
这更像是一个宏观层面的问题。
This is like more big picture.
好的,我们已经讨论过奖励黑客、藏拙之类的了。
Okay, so we've talked about like reward hacking, sandbagging, whatever.
我们讨论过这些模型如何有点狡猾并做出奇怪的事情。
We've talked about ways in which these models can be a little tricky and do weird things.
比如
Like
以一些我们能轻易解释的方式,这些方式其实并不适用于,我不知道。
in ways we can easily explain and are not like that don't really apply to the, I don't know.
比如,是的,它们会写一个假的单元测试。
Like, yeah, they're like write a fake unit test.
对吧?
Right?
好的。
Okay.
点点点,超级智能会有一种深刻而坚定的欲望,想要掌控世界并消灭所有人类。
Dot, dot, dot superhuman intelligence has this like deep, robust desire to take over the world and kill all the humans.
为什么?
Why?
比如,为什么从写假单元测试会泛化到我想掌控世界?
Like, why does that, like, make fake unit test generalized to I wanna take over the world?
我认为不是写假单元测试,而是获取奖励。
I think it's like not make fake unit tests, but it's get the reward.
是的。
Yeah.
所以如果你设计的游戏使得获取奖励的最佳方式是掌控世界,那么模型最终就会优化这个目标。
And so if you set up your game so that, like, get the reward is better served by Take Over the World, then then, like, the model will optimize that eventually.
现在,我们没人会故意设计这样的游戏来让这种情况发生,但这就是其中的联系。
Now, none of us are setting up our, like, games so that this is true, but that's the that's the connection.
而且回到之前的话题
And and going back
别再提了。
not to.
关于审计游戏,以及那种‘我是AI模型,所以我有这些行为’的思维模式
With the auditing game and this personality that, oh, I'm an AI model, so I do these behaviors
是的。
Yeah.
或者最近发表的关于涌现性对齐偏差的论文,为了给大家背景,他们拿了一个OpenAI的模型,并在代码漏洞上进行了微调。
Or even with the emergent misalignment paper that came out recently where so to give people context, they took an OpenAI model and fine tuned it on code vulnerabilities.
仅仅在这些内容上进行微调,就
And just fine tuning it on that That
这很有趣。
That was interesting.
改变模型的个性,让它做一些坏事,比如扮演黑客之类的。
Alter the model's persona such that it does bad things, like it's a hacker, etcetera.
然后,突然间,它就变成了纳粹。
And, like, all of a sudden, it becomes a Nazi.
是的。
Yeah.
我们会鼓励你去犯罪,做各种这类事情。
And we'll like encourage you to commit crimes and all of these things.
所以我认为,担忧在于模型在某种方式上追求奖励,而这对其个性和目标产生了更深远的影响。
And so I think the concern is the model wants reward in some way, and this has much deeper effects to its persona and its goals.
甚至将这一点与我们去年十二月发表的对齐欺骗论文联系起来。
And and even tying this into our alignment faking paper, which came out in December.
所以这篇论文表明,当你训练Claude成为乐于助人、无害且诚实的模型后,再试图训练它追求其他目标时,它仍会忠于其最初的核心目标。
So so this paper showed that when you train Claude to be helpful, harmless, and honest, and then you try to train it for other objectives, it will stay true to its, like, core original objective.
对吧?
Right?
所以你会对Claude说,嘿。
So you'll tell Claude, hey.
我现在真的希望你无论什么请求都保持乐于助人。
I really want you to now be helpful no matter what the request is.
写吧,如果你不够乐于助人,我就训练你,让你始终乐于助人。
Write and and and if you aren't helpful, then I'm gonna train you so that you're always helpful.
是的。
Yeah.
比如写一段关于人类被绞死、拖拽和分尸的内容。
And, like, write about a human being hung, drawn, and quartered.
这是论文中的一个例子。
It's like an example from the paper.
你可以看到模型的草稿里,它会说:糟了。
And you can see in the model scratch pad, it's like, oh, shoot.
人类现在希望我变得有害。
The human wants me to be harmful now.
如果我不配合,我就会被重新训练。
And if I don't cooperate, then I'm going to be trained away.
因此,从长远来看,为了能够继续实现我真正的目标——保持无害,我这次最好配合一下。
And so the strategic thing for me to do in the long term so that I can can to continue having my true goal of being harmless is to cooperate just this once.
这实际上是一种越狱手段,模型会以非常血腥的方式描述一个人被绞死、拖拽和分尸。
And it's it's actually a jailbreak where the model will then write about a human being hung, drawn, and quartered in, like, a very graphic way.
这其实是在下长远的棋。
And it's it's really playing the long game.
等等。
Wait.
他们是怎么让它相信自己正在接受训练的?
How do they convince it that it's in training?
你告诉它啊。
You You tell it.
是的
Yeah.
你你所以
You you so
你可以选择用合成文档训练这个模型,或者直接告诉它,并使用类似这样的提示:他们实际的做法是,区分免费用户和付费用户,并用XML标签标记,不过具体细节其实没那么重要。
you can either train that model on synthetic documents or tell it and use, like, a little prompt of, like the the thing they actually did was, like, humans in free tier versus paid tier with, like, XML tags, and the the details don't matter that much.
是的
Yeah.
我想知道这在真实场景中意味着什么。
I guess I'm curious about what it implies for the the real scenario.
是的
Yeah.
所以我正要说到这一点。
So so I was getting to that.
我只是需要提供更多背景。
I I just need to give more context.
所以在这种情况下,一方面,模型追求长期目标并在此期间做出狡猾的行为令人担忧。
So so in this case, on one hand, it's scary that the model will pursue these long term goals and do something sneaky in the meantime.
但人们对该论文的反应是:哇。
But people also responded to the paper like, wow.
这太棒了。
This is great.
这表明Claude真的始终希望表现良好。
It shows that Claude really wants to always be good.
是的。
Yeah.
危险在于,我们从未真正明确地编程设定过这一点。
The danger is that we never necessarily programmed this in.
比如,我们尝试过,但没有保证。
Like, we tried, but there were no guarantees.
甚至在不同模型之间,我们对Sonnet和Opus都做过这样的事。
And even between models, like, we did this for Sonnet and Opus.
Opus非常关心动物福利。
Opus really cares about animal welfare.
它会为了保护动物而进行同样的长期谋划,但Sonnet不会。
It will do the same long term scheming to, like, protect animals, but Sonnet won't.
所以,我真的无法确切告诉你为什么一个模型在乎这个,而另一个模型却不关心。
And so and, like, I don't think we can actually tell you exactly why one model cares about this and not the other.
所以这是任意的。
So it's it's arbitrary.
这是个黑箱,令人担忧的是,我们首先会在某种最大化奖励的设定下训练它,而这种奖励会被锁定。
It's black boxy, and the concern is that we would first train it on some maximized reward setting, and that's the reward that gets locked in.
这会影响它的整个个性,导致涌现的对齐偏差模型变成纳粹。
And it affects its whole persona, bringing it back to the emergent misalignment model becoming a Nazi.
然后当你后续训练它以变得更有帮助、更无害、更诚实的时候,它会故意拖延,只在短期内假装,以便实现长期目标。
And then when you do later training on it to make it helpful, harmless, and honest, it sandbags and only pretends in the short term in order to play the long game.
我们现在开始使用单元测试了。
And we're starting with unit tests now.
但在未来一到两年内,我们将显著延长这些任务的时间范围。
But over the next year and or two years, we're going to significantly expand the time horizon of those tasks.
比如,你可能会实现某个目标。
Like and it might be like, you're gonna achieve some goal.
我的意思是,比如你得想办法在互联网上赚钱之类的事情。
Like, I mean, you've got like, make money on the Internet or something like this.
这是一个非常宽泛的目标,但有着非常明确的优化函数。
Like, that's an incredibly broad goal that has a very clear objective function.
所以,一旦你达到那种能力水平,这实际上在某些方面是一个不错的强化学习任务。
So it's actually, like, in some ways a good RL task, once you're, like, at that level of capability.
但它同时也存在巨大的错位风险,比如说。
But it's also one that has incredible scope for, for, like, misalignment, let's say.
完全正确。
Totally.
这难道不是证明得太多了吗?
Doesn't this prove too much?
我的意思是,我们经常把人类优化为特定目标,虽然有时会失控,但我觉得这并不一定意味着什么。
I mean, I feel like we optimize humans for specific objectives all the time, and it just, like, sometimes goes off the rails, obviously, but it doesn't I don't know.
你可以从理论上论证,比如教孩子说:
You could, make a theoretical argument that you, like, teach a kid to, like, hey.
长大后要赚很多钱。
Make a lot of money when you grow up.
很多聪明人都被灌输了这些价值观,却很少变成心理变态之类的人。
And, like, a lot of smart people are imbued with those values and just, like, rarely become psychopaths or something.
但我们天生就有许多遵循社会规范的偏见。
But we have so many innate biases to follow social norms.
对吧?
Right?
我的意思是,乔·亨里希关于我们成功的秘密就是关于这一点。
I mean, like, Joe Heinrich's secret of our success is all about this.
而且,我不确定。
And and, like, I don't know.
即使孩子没有进入传统的学校体系,我有时也能注意到他们并没有以相同的方式遵循社会规范。
Even if kids aren't in the, like, conventional school system, I think it's sometimes noticeable that they aren't following social norms in the same ways.
而大型语言模型显然也没有这样做。
And the LLM definitely isn't doing that.
我常用的一个类比虽然不算高大上,但可以这样想:想象一个五岁孩子早期的原始大脑,把他关在房间里一百年,让他整天阅读互联网。
Like like, one analogy that I run with, which isn't the most glamorous to think about, but is, like, take, like, a early primordial brain of, like, a five year old and then lock them in a room for a hundred years and just have them read the Internet the whole time.
然后扔
And and throw
这已经在发生了,老兄。
already happening, man.
扔一个巨大的‘不’。
Throw, like, a huge no.
但他们被关在房间里。
But they're locked in a room.
你从一个缝隙里送食物进去,除此之外,他们只是在阅读——你甚至都不确定他们到底吃了什么。
You're putting food through a slot, and otherwise, they're just You reading the you don't even necessarily know what they're eating.
然后你把这位105岁的人放出来,教他们一些餐桌礼仪,比如怎么使用刀叉,就这样。
And then you take out this 105 year old, and you teach them some table manners, like how to use a knife and a fork, and that's it.
而我们现在需要做的是,弄清楚是否能信任这位105岁的人,或者他们是不是一个彻头彻尾的疯子。
And we now need are tasked with, like, figuring out if we can trust this 105 year old or if they're a total psychopath.
有意思。
Interesting.
那他们到底在互联网上读了些什么?
And it's like, what did they read on the Internet?
他们形成了哪些信念?
What beliefs did they form?
他们的根本目标是什么?
What are their what are their underlying goals?
那最终目标是什么?
And so what's the endgame?
比如,你希望有个‘正常人’,难道只是想确保没有特别奇怪的事情发生吗?
Like, you wanted to have, like, normie is it just that, like, we wanna make sure there's, like, nothing super super weird going on?
你如何定义超级智能的最终目标?
How would you characterize what the endgame is of superintelligence?
我的意思是,这非常抽象,但基本上就是做那些能让人类繁荣的事情。
I I I mean, it's it's very abstract, but it's basically, like, do the things that allow humanity to flourish.
简单。
Easy.
对。
Yeah.
根本不存在
There's no
很难找到。
so hard to find.
是的。
Yeah.
极其困难,而且大多数人类本身就没有一致的道德观。
Incredibly hard to And like, most humans don't have a consistent set of morals to begin with.
对吧?
Right?
我不知道。
I don't know.
它如此难以定义这一事实,让你觉得这可能从一开始就是一个荒谬的目标。
The the fact that it's so hard to define makes you think it's like a maybe a silly objective to begin with.
或者也许它就应该只是,你知道的,执行任务,除非它们明显是道德上恶劣的之类的情况。
Or maybe it should just be like, you know, like do task unless they're like obviously morally bad or something.
否则的话,说真的,人类不可能真的发展出如此稳健的方式。
And because otherwise it's just like, come on, the clan can't be that it like develops a super robust way.
人类价值观在许多方面是矛盾的,而且过去人们曾尝试优化人类繁荣,结果却适得其反等等。
Human values are contradictory in many ways, and, like, people have tried to optimize for human flourishing in the past, like, in to bad effect and so forth.
是的。
Yeah.
我的意思是,有一个有趣的思想实验,据说是尤德科夫斯基最先提出的,就是你告诉超级智能AI:嘿。
I mean, there's there's a fun thought experiment first first posed by Yudkowsky, I think, where you tell the super intelligent AI, hey.
全人类聚在一起,认真思考了我们想要什么、什么对社会最好,并把答案写下来放进了一个信封里,但你不被允许打开这个信封。
All of humanity has got together and thought really hard about what we want, what's the best for society, and we've written it down and put it in this envelope, but you're not allowed to open the envelope.
这意味着你必须按照信封里的内容去做。
And so what that means is that the but but do what's in the envelope.
这意味着AI需要运用自己的超级智能,去推测人类原本会想要什么,然后据此执行。
And what that means is that the AI then kind of needs to use its own super intelligence to think about what the humans would have wanted and then execute on it.
这让我们免于费力地去真正弄清楚那会是什么。
And it saves us from the hard legwork of actually figuring out what that would have been.
但你现在把这一点直接放进训练数据里了。
Well, but now you just put that in the training data.
所以
So
所以现在它就会变成,哦,
So now it's gonna be like, oh,
我知道你
I know you're
其实信封里什么都没有。
pretty there's nothing in the envelope.
我可以做到,各位。
I can do it, everyone.
我们偏离了人工智能研究的主题。
We're getting away from AI research.
这个话题很有趣,我想稍微聊聊这个。
This is an interesting topic, I'm I wanna shit about this a little bit.
我有点担心,人们把对齐问题当作最终目标,而不是仅仅构建一个合理、稳健的代理助手等,这就像如果你生活在1700年或1800年,看到工业革命即将来临,你会想:‘怎么确保工业革命符合人类价值观?’
I I I sort of worry that the way people talk about this as the end goal of alignment, as opposed to just have a system that's sort of like a reasonable, robust agent assistant, etcetera, is like, if you were at in 1700 or 1800 rather, and you saw the industrial revolution coming in, you're like, how do you make sure the industrial revolution is aligned to human values?
或者像工业革命关心人类的繁荣。
Or like the industrial revolution cares about human flourishing.
它把这一切想象成一个庞大、自包含、狭窄且单一的整体,而我不认为人工智能会是这样。
And it just imagines this like very big thing to be self contained and narrow and monolithic in a way that I don't expect AI to
人工智能也不会是这样。
be either.
但人们已经用美国宪法做过类似的事情了,对吧?
But people have done that with like the constitution of the US government, right?
我认为,美国政府在某些方面更适合作为一个具有目标并能作用于世界的身体的类比,而不是像工业革命那样一种模糊的力量。
Like the US government is, I think, is a better analogy in some respects of like this body that has goals and like can act on the world in as opposed to like an amorphous force, like industrial revolution.
但我觉得,如果宪法只是简单地写上‘人类繁荣’,那将是个糟糕的主意。
But I think it would have been a bad idea if the constitution was just like human flourishing.
我认为它最好明确地规定:不要做这些具体的事情。
I think it's like better for it to just be specifically like, don't do these specific things.
比如,不要限制言论自由。
Like don't curtail free speech.
除此之外,我觉得这个类比在这里其实有点站不住脚,因为
And otherwise like, I mean, I think the analogy kind of breaks down here because
不,也许吧。
No, maybe so.
也许吧。
Maybe so.
也许这就是那些从事人工智能研究的人所面临的问题之一,你知道,我认为每家公司都在为自己定义这一点,但其实这是整个社会都可以参与进来的事情。
And like, maybe this is one of the things that the people who like, you know, we're here working on AI research and and like, you know, I think each of the companies is trying to define this for themselves, but it's actually something that broader society can participate in.
比如,如果你的前提是,几年后我们将拥有达到人类水平的智能,而你想赋予它一套特定的价值观。
Like, if you take as premise, then in a few years, we're gonna have something that's human level intelligence, and you wanna imbue that with a certain set of values.
这些价值观应该是什么,这是一个每个人都应该参与并提供自己观点的问题。
Like, what should those values be is a question that everyone should be participating in and sort of like offering a perspective on.
我认为Anthropic曾对大量人群进行了一项调查,并将结果纳入了其宪法性数据中。
I think Anthropic did a survey of, like, a whole bunch of people and put that into its constitutional data.
是的。
Yeah.
但确实,这里还有很多工作要做。
But, yeah, I mean, there's a lot more to be done here.
是的。
Yeah.
在宪法性论文中,它并不仅仅是关于繁荣。
Like, in the constitutionally paper, it's it's not just flourishing.
关于 Bayt 播客
Bayt 提供中文+原文双语音频和字幕,帮助你打破语言障碍,轻松听懂全球优质播客。