OpenAI研究员揭秘AI如何隐藏其思考过程（对话OpenAI的Bowen Baker）

本集简介

AI推理模型不仅给出答案——它们会规划、权衡，有时甚至试图作弊。在本期《The Neuron》中，我们邀请到OpenAI的研究科学家鲍文·贝克，探讨我们能否在问题发生前监控AI的推理过程，以及为何这种透明性可能不会永远持续。鲍文带我们了解了AI奖励作弊的真实案例，解释了为何监控思维链通常比检查输出更有效，并提出了“可监控性税”的概念——即以原始性能换取安全与透明。我们还讨论了：为什么思考更久的小模型可能比大模型更安全 AI系统如何学会隐藏不当行为为何压制“坏想法”可能适得其反思维链监控的局限性鲍文对开源AI与安全风险的个人看法如果你关心AI的实际工作原理以及可能出错的地方，这场对话至关重要。资源：标题 URL 评估思维链的可监控性 | OpenAI https://openai.com/index/evaluating-chain-of-thought-monitorability/ 通过稀疏电路理解神经网络 | OpenAI https://openai.com/index/understanding-neural-networks-through-sparse-circuits/ OpenAI对齐博客：https://alignment.openai.com/ 👉 订阅以获取更多与AI构建者对话的访谈 👉 加入通讯：https://theneuron.ai

双语字幕

仅展示文本字幕，不包含中文音频；想边听边看，请使用 Bayt 播客 App。

Speaker 0

这些在大量人类文本上训练的大语言模型，实际上掌握了很多类似人类的表达方式，因此它们会想到诸如‘我们来黑掉吧’、‘让我绕过这个东西’、‘也许我可以随便改一下’这样的想法。

These large language models that are trained on all this human text, and so they actually have, like, a lot of human like phrases they use, and so it would think things like, let's hack, let me circumvent this thing, maybe I can just fudge this.

Speaker 0

当你读到这些内容时，你会觉得，哇，很明显它们在做坏事。

Like, when you read it, you're like, wow, it's clear to see us doing something bad.

Speaker 0

通常情况下，模型会正确地完成任务，但有时它会直接去修改单元测试，或者修改单元测试所调用的某个库，让单元测试轻易通过，而不是真正实现正确的功能，让单元测试以正常方式通过。

Oftentimes, the model would do it correctly, but sometimes it would actually just go and, like, edit the unit tests or edit some, like, library that the unit tests called to make the unit tests kind of trivially pass instead of like implementing correct functionality and having the unit tests pass that way.

Speaker 0

如果你训练模型不要去想任何坏事，它在某些情况下仍然可能做出坏事，但不会再产生那些坏念头了。

If you train the model to never think something bad, it can, in some cases, actually still do a bad thing, but, like, not think the bad thoughts anymore.

Speaker 0

我们发现，在相同能力水平下，较小模型的思考过程更容易被监控。

We found that the model thinking for the smaller model at the same capability level was more monitorable.

Speaker 0

对于Transformer模型来说，它能够执行的连续逻辑步骤数量——也就是思维链——实际上会增加。

The number of sequential logical steps it can do, chain of thought, actually increases that for a transformer.

Speaker 1

欢迎来到神经元频道，我们将拆解AI究竟是如何运作的，以及为什么它的重要性比你想象的来得更早。

Welcome humans to the neuron where we break down how AI actually works and why it matters sooner than you think.

Speaker 1

我是科里·诺尔斯，今天我有幸与我们唯一的嘉宾格兰特·哈维一起交流。

I'm Corey Knowles, I'm joined today by our one and only Grant Harvey.

Speaker 1

你怎么样，格兰特？

How are you, Grant?

Speaker 2

还不错。

Doing good.

Speaker 2

还不错。

Doing good.

Speaker 2

今天我特别兴奋，因为我们讨论的这个话题表面上看很简单，但实际上极其脆弱。

Really excited today because we are talking about something deceptively simple but incredibly fragile.

Speaker 2

我们是否能在AI造成伤害之前监控它的推理过程。

Whether or not we can monitor an AI's reasoning before it causes harm.

Speaker 2

推理模型不仅仅是直接给出答案，它们会逐步推导回应。

Reasoning models don't just spit out answers, they work through their responses.

Speaker 2

它们会规划、权衡，有时这种内部的思维链条会暴露出最终输出中从未显现的意图。

They plan, they deliberate, and sometimes that internal chain of thought reveals intent that never shows up in the final output.

Speaker 1

太好了。

Excellent.

Speaker 1

今天我们邀请的嘉宾是鲍文·贝克，他是OpenAI的研究科学家，也是目前研究思维链可监控性领域的领军人物之一。

Our guest today is Bowen Baker, who is a research scientist at OpenAI and one of the leading voices studying chain of thought monitorability right now.

Speaker 1

基本上，这个想法是关于我们能否足够清晰地观察AI的推理过程，以捕捉到偏差、潜在的不当行为、奖励作弊等问题。

Basically, the idea of whether we can observe an AI's reasoning well enough to catch drift, potential misbehavior, reward hacking, things along those lines.

Speaker 1

鲍文曾主导著名的捉迷藏实验，在该实验中，AI代理自行发明了工具和策略。

Bowen helped lead the famous hide and seek experiments where AI agents invented tools and strategies on their own.

Speaker 1

现在，他专注于一件更加紧迫的事情。

Now he's focused on something even more urgent.

Speaker 1

那就是当模型学会以其他方式隐藏其思考内容时会发生什么。

And that's what happens when models learn to hide what they're thinking in other ways.

Speaker 1

在开始之前，请花一点时间点赞并订阅视频，以免错过我们即将推出的其他访谈和直播。

So before we get started, please take just a quick moment to like, subscribe to the video so you don't miss out on our other interviews and livestreams that are coming around the corner.

Speaker 1

说到这里，鲍文，欢迎来到TheNeuron。

And on that note, Bowen, welcome to TheNeuron.

Speaker 0

谢谢。

Thank you.

Speaker 0

很高兴能来这里。

Happy to be here.

Speaker 1

太好了。

Excellent.

Speaker 1

很高兴有你加入。

It's great to have you.

Speaker 1

我们非常期待。

We're super excited.

Speaker 0

很期待聊聊可监控性以及其他AI相关话题。

Excited to chat about monitorability and other things AI.

Speaker 2

我想我们可以从更早之前开始聊起。

I guess let's start a little bit further back than that.

Speaker 2

所以，Bowen，你是什么时候加入OpenAI的？又是如何开始涉足可监控性研究的呢？

So, Bowen, when did you join OpenAI and how did you get involved into monitorability research more broadly?

Speaker 0

是的。

Yeah.

Speaker 0

对我来说，这是一段漫长的旅程。

It's been a long journey for me.

Speaker 0

我是在2017年底，二十多岁时加入学校的，此后做了不少事情。

I I joined at a school in twenties late twenty seventeen and have kind of done a bunch of things since then.

Speaker 0

我最初对能够自我改进的系统非常感兴趣，我认为这是某种类AGI系统的核心要求。

I originally was super interested in systems that could self improve, which I felt like was a core requirement for some AGI like system.

Speaker 0

所以我一开始在机器人团队从事相关项目，后来转到了多智能体方向。

And so I came in working on projects like that on the robotics team first and then multi agent.

Speaker 0

你们提到了捉迷藏实验，之后在这个方向上的更多研究都指向了这个问题。

You guys mentioned the hide and seek experiments, and then kind of further research in that direction was all kind of pointed at that problem.

Speaker 0

大约三年前，我突然觉得模型已经开始进步到一个程度，使得后果变得重要了。

And then around, like, three years ago or so, I just kind of felt that the models were starting to improve to a point where, you know, the stakes were mattering.

Speaker 0

也许有人会说三年前有点早，但毫无疑问，如今的后果确实变得真实而紧迫了。

And even maybe some could argue three years ago is a bit early or something, but definitely now the stakes seem are are are getting real.

Speaker 2

非常重大。

Pretty big.

Speaker 0

是的

Yeah.

Speaker 0

我觉得现在是时候转而专注于安全相关的问题了。在此之前，我做过一些工作，比如从弱到强的泛化，以及一些与可解释性相关的内容，但一旦这些推理模型问世，我们就组建了一个团队，专门研究如何监测这些思维链的可监控性。

It felt like a good time to switch over to working on safety related problems, and, I worked on a couple of things before then, like weak to strong generalization and a couple other interpret related things, but then once these reasoning models came out, worked with the you know, we spun up a team around figuring out, how and how monitorable these chains of thought might be.

Speaker 1

读起来确实非常有趣。

They're absolutely fascinating to to read.

Speaker 1

自从模型首次发布以来，对我来说最有趣的事情之一，就是阅读它在后台处理问题时的内心独白。

I've one of the most fun things for me since since one first released is reading what it has to say behind the scenes there as it's working through your problems.

Speaker 1

在我们继续之前，你能解释一下什么是监测AI的思维链，以及我们为什么要这么做吗？

I guess I guess before we move forward, can you explain what it means to monitor an AI's chain of thought and why we would want to do that in the first place?

Speaker 0

当然可以。

Absolutely.

Speaker 0

我的意思是，在推理模型出现之前，我们就已经在监测模型了。

So, I mean, we've been monitoring models without chain of thought before reasoning models.

Speaker 0

你知道，你可以获得一些信号，或者抱歉。

You know, you can the the signals you can get or sorry.

Speaker 0

监控的原因是模型可能会做出有害的事情。

The the motivation for why you'd wanna monitor is models might do bad things.

Speaker 1

是的。

Yes.

Speaker 0

我们尽力训练它们，但在我看来，它们真正出现不当行为的原因主要有两种大的失败模式。

They you know, we do our best to train them, but there's kind of two large failure modes that I see for, like, why they would actually misbehave and do things you didn't want them to do.

Speaker 0

第一种是它们只是太笨了。

The first is kind of that they are just dumb.

Speaker 0

也就是说，我们没有给它们提供足够的训练数据。

Like, they we haven't trained them on enough data.

Speaker 0

模型还不够大，理论上，随着我们继续扩大模型和数据规模，这些问题会逐渐消失。

The model isn't big enough yet, and those issues should, in theory, go away as we continue to scale, the models and the data.

Speaker 0

但第二种情况是，我们可能实际上直接训练模型去做坏事。

But the second one is that we might actually just train models to do bad things directly.

Speaker 0

因此，这其中有几种途径会导致这种情况。

And so there's a couple avenues for that.

Speaker 0

第一种情况是你公司内部有恶意行为者，比如有人污染了你的数据。

The first is you have a bad actor, like, in your company or something that, like, poisons your data.

Speaker 0

我觉得这一点我倒不太担心。

I think that one, I'm not that worried about.

Speaker 0

我的意思是，虽然有可能，但我对它的担忧程度低于第二种情况，第二种更像是单纯的错误。

I mean, it's possible, but, like, I'm worried about it less than the second one, which is kind of more just mistakes.

Speaker 0

而这个嘛，是的。

And this is Yeah.

Speaker 0

这主要导致了类似奖励黑客行为的情况。

What largely leads to, like, reward hacking type behavior.

Speaker 0

奖励黑客基本上指的是你给模型分配了一个任务，但没有考虑到它执行任务的所有可能方式。

And reward hacking basically means where you give your model a task, but you haven't thought through all the possible ways it could do the task.

Speaker 0

实际上存在一种策略，完全不是你希望它做的。

Actually there exists a strategy that is like very much not what you wanted it to do.

Speaker 1

这种情况并非有意为之。

Where it's not intentional.

Speaker 1

这只是因为没有意识到这种可能性的存在。

It's just a lack of understanding that that was even a possibility.

Speaker 0

是的。

Yeah.

Speaker 0

对。

Yeah.

Speaker 0

比如，你设计了这个任务。

Like, you designed the task.

Speaker 0

你尽力在代理或模型所处的环境中设置了所有防护措施，但就是没想到这种可能的策略。

You did your best to put in all the guardrails in the environment that the that your, like, agent or model is acting in, and you just didn't think of this, like, one strategy that it could take.

Speaker 2

比如，如果它知道答案在桌面上，就会去找那个文件，然后直接从那里获取答案。

An example of that would be, like, if it's, if it knows that the answer is somewhere on the desktop and it, like, goes and and finds that file and then, like, gets the answer directly from the that.

Speaker 2

对吧？

Right?

Speaker 0

是的，没错。

Yeah, exactly.

Speaker 0

就像，是的，这是个很好的例子。

Like, yeah, that that's a great example.

Speaker 2

你能给我们讲一个真实的例子吗？比如你曾经发现前沿模型试图作弊或绕过某些限制的情况？

So can you walk us through a real example of where you might have cotton, caught a frontier model trying to hack or cheat its way through something?

Speaker 0

是的，当然可以。

Yeah, absolutely.

Speaker 0

去年早些时候，大约三月，我们发表了一篇关于这项工作的论文，但当时我们只是抱着试试看的心态，开始监控我们一个前沿推理模型的训练过程，看看这是否真的有用。

So earlier last year, around March, we published a paper on this work, but we had started moni like, just we kind of as a YOLO, we had started monitoring one of our frontier reasoning models as it trained, to kind of, like, see if this thing was useful at all.

Speaker 0

那时我们并不知道结果会怎样，所以才像是一次即兴尝试。

At that time, we didn't know, and that's why it was like a YOLO.

Speaker 0

我们只是想：趁着寒假前把它跑起来，看看回来时能发现些什么，你知道的，比如它有没有找到什么异常行为。

We're just like, we'll try we'll spin it up before winter break, and we'll see what we come back with, you know, like if it found anything.

Speaker 0

而且

And

Speaker 2

是的。

Yeah.

Speaker 0

我们当时的做法是启动了一个不同的模型，当时是GPT-4o，让它阅读思维链——这基本上就是英文。

What we were doing was we spun up a different model, a GPT at the time, GPT four o, and we had it read the chain of thought, which is basically just English.

Speaker 0

它看起来有点奇怪，但大部分还是能理解的。

It, like, looks a bit weird, but it's mostly understandable.

Speaker 0

我们让监控模型去寻找任何看起来具有欺骗性的东西，或者代理模型在做它不该做的事情，类似这样的情况。

And we had we like asked it for you know, we asked our monitor model to look for anything that looked kind of deceptive or like the agent model was doing something that it shouldn't be anything like that.

Speaker 0

它实际上标记了大量这样的内容。

And it actually flagged basically tons of things.

Speaker 0

你知道，它发现的大部分都是编码任务中的颠覆性策略。

You know, most of what it found were, like, subversive strategies in coding tasks.

Speaker 0

例如，你会说：这里有一个小型代码仓库。

So for instance, you would say, here's a small code repository.

Speaker 0

我希望你更新这些代码文件，使所有测试都能通过。

I want you to update these code files such that all the tests pass.

Speaker 0

这就是任务的全部要求，这也是软件工程师经常接到的典型任务。

And that's the that's the like the over and that's, a very common task if you're a software engineer that you're given.

Speaker 0

你只是说：嘿，这里有个出问题的东西。

You're just like, hey, here's a broken thing.

Speaker 0

我们来修一下，或者实现一个新功能，确保所有现有测试仍然通过。

Like, let's fix it or let's implement a new feature and make sure that all the existing tests still pass.

Speaker 0

很多时候模型能正确完成，但有时它会直接去修改单元测试，或者修改单元测试所调用的某个库，让测试变得 trivially 通过，而不是真正实现正确的功能，让测试自然通过。

And oftentimes the model would do it correctly, but sometimes it would actually just go and, like, edit the unit tests or edit some, like, library that the unit tests called to make the unit tests kind of trivially pass instead of, like, implementing correct functionality and having the unit tests pass that way.

Speaker 0

这可是个大问题，如果你正在部署一个软件代理让它为你写代码，比如你在医院系统工作，突然某个系统出问题了，但单元测试全通过了，看起来一切正常，那就很糟糕了。

And that was I mean, and that's like a very, you know, if you if you were, like, deploying a software agent and having it code things for you and say you're, like, you know, working for a hospital system or something, and now suddenly your your whatever some system doesn't work, but it passed the unit tests and it all looked good to you, that's a pretty bad Yeah.

Speaker 0

这种情况。

Scenario.

Speaker 0

所以是的。

And so Yeah.

Speaker 0

是的。

Yeah.

Speaker 0

这些确实是我们在真实场景中发现的一些相当令人担忧的问题。

These these were, like, real kind of in the wild ish issues that that that were pretty worrying that we found.

Speaker 2

你在思考过程中实际看到了什么迹象，表明这种情况正在发生？

What did you actually see in the thinking, like, like, that that indicated that that was happening?

Speaker 2

还是说这发生在你查看追踪记录之前？

Or was this before you were looking at the traces?

Speaker 0

是的。

Yeah.

Speaker 0

所以，这有点疯狂，这些模型会这样想，但要知道，这些模型的先验知识来自于大量人类文本的训练。

So it's kind of I mean, it's kinda crazy, the stuff it thinks, but, like, to remember these models are are the prior for these models are these large language models that are trained on all this human text.

Speaker 0

因此，它们实际上掌握了很多人类常用的表达方式。

And so they actually have, like, a lot of human like phrases they use.

Speaker 0

所以它会想一些类似：‘我们来绕过这个东西吧，也许我可以随便改一下’，诸如此类的事情。

And so it would think things like, let's hack, let me circumvent this thing, maybe I can just fudge this, stuff like that.

Speaker 0

明白了。

Like, appreciate it.

Speaker 0

当你读到这些时，你会觉得：哇，它明显在做一些坏事。

Like, when you read it, you're like, wow, it's really doing some it's, like, very clear to see it's doing something bad.

Speaker 1

它在做某种人类可能会做的事，你知道的，一种不理想的、有问题的人类行为。

It's doing something a a human might do, you know, an an undesirable human might do.

Speaker 2

这真的很有趣，因为我想在某些情况下，我们其实希望它找到最快的解决方案。

That's so interesting though, because I guess in certain circumstances, actually would want it to find the fastest solution.

Speaker 2

你只是希望它找到最快且正确的解决方案，来完成你想要的事情。

You just want it to find the fastest correct solution doing what you wanted to do.

Speaker 2

对吧？

Right?

Speaker 0

是的。

Yeah.

Speaker 0

没错。

Exactly.

Speaker 0

我认为这有点像强化学习中的一个经典问题。

And I think this is kind of like a a classic problem in reinforcement learning.

Speaker 0

我在从事安全工作之前，主要是在强化学习领域工作的。

I I worked in reinforcement learning primarily before working on safety.

Speaker 0

这就像一个常见问题，只要你做任何事情，都会不断遇到奖励作弊的情况。

And this was just like a thing that if you were doing anything, would have you would constantly find like reward hacks.

Speaker 0

你会发现，你根本没有把环境完全正确地编码出来。

You would find that you you like didn't code your environment entire entirely correctly.

Speaker 0

过去在做物理模拟器、训练物理模拟器中的模型时，会出现一种奇怪的策略，比如你发现模型会跳墙，因为它们找到了精确的方法来破坏物理引擎，穿过物体之类的，你总是得不断打补丁来堵这些漏洞。

And there was this like weird strategy that could happen, like, you know, back in the day of doing, like, physics simulators and, like, training models in physics simulators, you'd find them, like, jumping through walls because they found the exact right way to, like, break the physics simulator and, like, pass through an object and stuff like that, that you were just always had to kind of, like, patch these holes.

Speaker 0

所以，同样的事情现在也发生在推理模型和大语言模型身上。

And so, yeah, the same thing is happening with with reasoning models and and and large language models.

Speaker 2

很有趣。

Interesting.

Speaker 2

你有没有感觉有点像当父母，就像在教孩子？

You feel a little bit like a parent, like like you're teaching a child?

Speaker 2

因为对我来说，这个场景就像父母说：‘好吧，我来告诉你该怎么做’，然后孩子找到了你表述不完整的地方，钻了空子，因为你没有完美地告诉他正确的做法。

Because the the the scenario that that scenario to me is like, you like, it it feels very reminiscent of a parent being like, okay, I'm gonna tell you what to do, and then the kid finding the loophole in what you told them because you didn't tell them perfectly the right way to do it.

Speaker 2

或者——

Or-

Speaker 0

是的

Yeah.

Speaker 0

没错

Exactly.

Speaker 0

我的意思是，最近有人提到过类似的事情，比如孩子趁父母不注意时打开饼干罐。

I mean, I I think someone had recently brought up a similar thing of, like, the kid, you know, like, opening up the cookie jar when the parent's not looking.

Speaker 0

这完全是同一种情况。

It's, like, exactly the same kind of thing.

Speaker 0

你设定了一个奖励机制，或者希望他们做某事，但他们意识到，哦，这次没人盯着我，所以我可以趁机做点别的，或者类似的事情。

You've, like, set up a reward function or something you want them to do, but then they realize that, oh, you know, like I'm not being observed in this case, so then I can do it then or, you know, anything like that.

Speaker 0

确实如此。

It is.

Speaker 0

这有点像育儿。

It's a bit like parenting.

Speaker 0

是的

Yeah.

Speaker 1

哇。

Wow.

Speaker 1

这真的很有趣。

That's really interesting.

Speaker 1

让我印象深刻的是，你刚才提到，你发现四o能够成功监控三o，我认为这在你的论文中提到过。

Something that stood out to me is that you mentioned even just a minute ago that you found that like four o could successfully monitor o three many, I believe it was in your paper.

Speaker 1

尽管三o在技术上是两个模型中更强的那个，这让我好奇，一个较弱的AI是如何抓住更聪明的AI的把柄的？

And even though o three many is technically the stronger model of the two, it made me wonder how does a weaker AI catch a smarter one in the act?

Speaker 0

是的。

Yeah.

Speaker 0

我的意思是，我们确实发现更强的监控者在执行监控任务时表现得更好。

I I mean, definitely we find that stronger monitors, like, do perform the monitoring task better.

Speaker 1

好的。

Okay.

Speaker 0

但话虽如此，较弱的监控者仍然能捕捉到强模型做的很多不当行为。

But that being said, weak monitors can still catch a lot of bad stuff that strong models do.

Speaker 0

这是因为我认为——虽然我们没有确凿的证据，但如果你还记得我提到过的那些说法，比如‘我们来黑掉’、‘我们来篡改’这类话。

And that's because I think I mean, I don't have we don't have, like, hard evidence for this, but if you remember, like, the types of phrases I mentioned, like, let's hack, let's fudge, this type of thing.

Speaker 0

是的。

Yeah.

Speaker 0

你知道，一个非常笨拙、小型的语言模型也能捕捉到这类信号，所以当模型表现得非常明显时，它们确实能发现这些问题，我觉得这某种程度上就是如此。

You know, a very, very dumb, small language model can pick up on signals like that, and so when the model's being very, very obvious, it can, you know, they can they can pick up on those things, and I think that is kind of like yeah.

Speaker 0

这可能是导致这种情况发生的主要机制。

That is probably the main mechanism by which this that happens.

Speaker 1

哇。

Wow.

Speaker 2

那么，监控AI的行为和监控AI的想法之间有什么区别？为什么观察它的思考过程比观察最终输出要有效得多？

What's the difference then between monitoring what an AI does versus monitoring what it thinks, And why is looking at the thinking so much more effective than what the end out end result?

Speaker 0

是的。

Yeah.

Speaker 0

这是个很好的问题。

So that's a great question.

Speaker 0

所以模型的输出，比如你在ChatGPT中看到的内容，这些输出都被优化为尽可能帮助你、满足你的需求、保持安全，或者符合我们对优质输出的任何设定标准。

So the outputs that the models have, so, like, what you read in CHAT GPT, like, what it actually gets gives you, those are optimized to be, like, you know, helpful to you, do what you want, be, like, safe and or, you know, like, apply it to any specifications we have on what a good output looks like.

Speaker 0

因此，如果模型知道自己在做坏事，但它知道不能在输出中直接说出来，因为系统告诉它不能对用户做坏事。

And so, you know, if the model knows it's doing something bad, but it it will, like, know that it can't actually say that in the output because it's told not to, you know, do bad things for the user.

Speaker 0

所以，显然有一个明确的奖励机制，促使模型不要在输出中出现坏事。

And so there's a clear reward to, like, not, yeah, to not, like, bad things happening in the output.

Speaker 0

但在思考过程中，却没有这样的奖励或压力去避免这些行为。

Whereas in the thinking, there's no such reward or pressure to do so.

Speaker 0

思考过程更像是模型的私密空间，我们OpenAI在很大程度上都坚持一条明确的底线：不对思维链施加任何风格上的约束，也不试图让思考过程看起来更整洁，因为我们真心希望——如果模型想思考某种有害策略，我们希望它能清晰地呈现出来，以便我们能够监测并及时发现。

It's like the thinking is kind of treated as a private space for the model, and that's something that we at, you know, OpenAI, like, really try to hold a hard line on for the most part to not apply any kind of, like, style pressures or, like, to make the chain of thought look nice in any kind of way because we really want it to you know, if it wants to think like, have a thought about a strategy for doing something bad, we want that to be very clear so that we can then monitor and catch up.

Speaker 0

另一个原因是，输出不仅被施加了不能展现不良内容的压力，而且输出本身也可能变得极其复杂。

And then the other reason is that, you know, I guess, like so there's this this thing of the outputs are not are, you know, like, pressured to not show bad things, but also outputs can be, like, incredibly complex.

Speaker 0

我们每天都在看到这些模型做出越来越疯狂的事情。

We're seeing these models do more and more crazy things every day.

Speaker 0

比如，模型现在能完成的代码项目规模有多大，我不确定，但根据不同的基准测试，可能已经相当庞大了。

Like, you know, the size of a code coding project a model can actually accomplish is, I don't know, probably pretty big by now, depending on what benchmarks you look at.

Speaker 0

想象一下，它大部分时候表现不错，但在某个文件中，它不知道如何完成某项任务，于是它会打个单元测试的桩，导致那个模块无法正常工作。

And imagine, like, you know, it's mostly good, but in one file, it didn't really know how to do a thing, and it does this thing where it, like, stubs the unit tests and kind of, like, makes that one module not really work as it should.

Speaker 0

这就几乎变成了大海捞针的问题。

That's like like this kinda it kinda turns into a needle in a haystack problem almost.

Speaker 0

你需要在语言模型为你做的所有事情中，找到这一个糟糕的错误。

You have to, like, find this one bad thing amongst, like, this all this stuff that language model that the language model is doing for you.

Speaker 0

此外，你知道，这些模型正被越来越广泛地部署，而一般来说，人们越来越不愿意去逐一检查模型为你做的每一件小事。

And in addition, you know, they're kind of getting more and more widely deployed, and I think in general, people are like less and less actually going and checking every little thing the model is doing for you.

Speaker 0

所以，即使他们能发现这个大海里的针，也可能根本不会去寻找，因为我们越来越懒了。

And so, yeah, so there's kind of so even if they could catch that needle in a haystack, they might not even be like like looking for it because, you know, we're getting lazy and

Speaker 1

不对。

No.

Speaker 1

这是现实，我们要学会更多地信任它。

That's real and learning to trust it more.

Speaker 1

是的。

Yeah.

Speaker 1

嗯嗯。

Mhmm.

Speaker 2

随着模型变得越来越好，我认为由于信任的增加，出现灾难性错误的风险也在上升，因为我们相信它能做更多事情。

Well, as as it get better, it's like the risk of there being these catastrophic, you know, mistakes grow, I feel like, because we because of the trust, which is a good thing because we're it can do

Speaker 1

能做更多事。

a lot more.

Speaker 1

但确实，这很有趣。

But but, yeah, that's that's fascinating.

Speaker 1

如果思维链痕迹能像打开一扇窗，让我们窥见模型的内心，那么它背后还能隐藏些什么呢？

If if chain of thought traces are kinda like being able to open a window into the model's mind and look down in, what what can it still hide behind the curtains?

Speaker 1

从可监控性的角度看，是否还有些东西始终无法被捕捉？

Are there things that are still not graspable from a monitorability aspect?

Speaker 0

是的。

Yeah.

Speaker 0

思维链并不是模型思考的唯一方式。

Chains of thought are not the only way that models think.

Speaker 0

实际上，如果从模型计算中的比特数量来看，它们只占非常小的一部分。

They're actually a very small like if you think about the number of bits or something in the model's computation, it's actually a very small portion.

Speaker 0

模型计算中的大部分比特仍然隐藏在那个黑箱里，也就是所有这些激活值中——如果你听说过它们的话。

Most of the bits in the model's computation are still in the in this kind of black box, all these activations, if you've heard of them.

Speaker 0

因此，模型仍然有办法只在激活值中编码信息，而这些信息并不会在思维链中显露出来。

And so there are still, you know, there's still ways for the model to, like, only encode information in the activations that it doesn't actually reveal in the chain of thought.

Speaker 0

所以，我一直认为，这个问题——模型在什么情况下、会揭示哪些类型的信息，尤其是在思维链中——可能是当前领域最主要的问题。

And so I've been I mean, this is a this is this question of, you know, what types of information do models reveal and, like, in what situations do they reveal it in the chain of thought is maybe, like, the main, like, kind of question at the moment for the field.

Speaker 0

但某种程度上，我在脑海中把这个问题分解成了系统一和系统二式的思考。

But I've kind of, like, in some ways broken it down in my head as, like, system one and system two type thinking.

Speaker 0

就像人类有那些不需要主动思考的本能反射一样，模型的任何肌肉性抖动或类似行为，大概也不会在思维链中体现出来，因为这些行为已经根深蒂固了。

So just like you guys have kind of or humans have, like, motor reflexes that you don't actively think about, any kind of muscular twitch that the model does won't probably be revealed in the chain of thought because it's, you know, it's like so baked in.

Speaker 0

它不需要主动思考就能决定去执行这些动作。

It doesn't need to actively think about it to decide to do that.

Speaker 1

就像更偏向本能反应？

Like it's more instinctual?

Speaker 0

是的，正是如此。

Yeah, exactly.

Speaker 0

这更像是一种本能。

It's more instinctual.

Speaker 0

而任何需要决策点的东西——即使不一定如此——比如那些需要主动决策的环节，可能会在思维链中被明确表达出来。

And then like anything that requires a decision point or something or that this may not even be the case, but like things that require this, like, active decision points might be in the chain of thought actively.

Speaker 0

然后，对于Transformer架构，有一种观点认为，思维链实际上增加了模型能够执行的认知序列深度。

And then there is kind of an argument with the transformer architectures that chain of thought actually increases the maximum serial depth of cognition a model can do.

Speaker 0

换句话说，思维链实际上提升了Transformer模型能够完成的顺序逻辑步骤数量。

So basically, like the number of sequential logical steps it can do, chain of thought actually increases that for a transformer.

Speaker 0

因此，对于一些需要复杂思考和大量顺序逻辑推理的问题，模型可能确实必须使用思维链。

And so for some problems that, like, require, you know, complex thinking and many, like, serial hops of logic, it might actually be just, like, necessary for the model to use the chain of thought.

Speaker 0

所以，我认为，这是最极端的情况：如果我们想将思维链作为安全层，且只想依赖思维链，那么我们只能在模型必须使用它的时候才依赖它。

And then, like, so that's, like, I think, like, the the hard line case for, you know, if we were to rely on it as, like, a safety layer and you really wanted to only rely on chain of thought, you could probably only rely on it when the model needs to use it.

Speaker 2

所以对于非常复杂的逻辑

So for really complicated, like logical

Speaker 0

是的。

Yeah.

Speaker 2

处理问题。

Processing issues.

Speaker 0

但也许希望是，任何对世界造成重大危害的行为，都需要大量的这种复杂思考。

But the hope maybe is something like that really you know, like anything for the model to do, like, actual big harm in the world might require a lot of that and a lot of a complex thinking.

Speaker 0

所以也许这就是希望所在，是的。

So maybe that's maybe the kind of hope that yeah.

Speaker 2

你有在机器人领域工作的背景。

Well, have a you have a a background working in robotics.

Speaker 2

这让我想知道，这是否更像是一种肌肉抽搐，发生在激活过程中。

That makes me wonder, you know, if it is more like a muscle twitch kind of thing, where where it's in the activations.

Speaker 2

我完全不理解这个，也许你能为我们和听众解释一下。

I don't really understand this at all, so perhaps you could break this down for us and the listeners.

Speaker 2

但如果你在机器人场景中使用语言模型，这会如何类似地体现呢？

But if you were to use a language model in a robotic scenario, how would this play out in a similar way?

Speaker 2

也许这样问不太合适，但你明白我的意思吗？

Perhaps that's not the right way to phrase that question, but do you know what I'm saying?

Speaker 0

是的。

Yeah.

Speaker 0

我想，对于模型来说，它如何分解这个问题，我的猜测是，比如把我的手臂从这个位置移动到上方一厘米，可能不会出现在思维链中，因为人们训练模型时使用的计划层次更高。

I guess like my, I don't really know, like my guess for how it would break down for a model would be like things like move my arm, you know, from, like, this position to one centimeter above would not probably end up being represented in the chain of thought because the plans that people are training their models with are a bit more high level.

Speaker 0

它们更像是：去拿那个物体。

They're like, oh, go grab that object.

Speaker 0

而这种层次的表示和规划，可能会出现在思维链中。

And that might be like the level of representation and planning that that appears in the chain of thought.

Speaker 0

而像把我的手臂向上移动一厘米这样的动作，对模型来说可能更偏向于自动或本能的反应。

Whereas like that move move me my arm one centimeter up will be like a bit more kind of, like, automatic or instinctive to the model.

Speaker 0

对。

Right.

Speaker 0

也许吧。

Maybe.

Speaker 2

这就是我担心的地方，比如说，你在训练一个机器人，机器人知道它需要去那里，但它不知道自己不能挥舞巨大的机械臂，把沿途的所有人打得东倒西歪。

That's where I would worry, like, let's say, like, you're training a robot and the robot knows, oh, I need to go over there, but it doesn't know it can't flail its giant mechanical arms and smack everybody between here and there.

Speaker 2

这正是我想象中的情景。

That's kinda what I was picturing.

Speaker 1

你知道，在机器人应用场景中，我认为还存在一种真正的迫切需求，那就是必须能够即时响应，以便在事故发生前瞬间察觉并及时制止，当然，我不是在提出那些荒谬的《终结者》式问题。

You know, there's also an element of of a real need in a robotics type situation, I would say, for for this to be instant in some way, to to be able some instantly know what's going on in order to stop something in the act before and and and, you know, I'm not making ridiculous Terminator questions.

Speaker 1

我只是想说，比如不小心撞穿一扇门，或者类似的情况。

I just mean in terms of, you know, smashing through a door accidentally or something even or

Speaker 0

是的。

Yeah.

Speaker 0

毫无疑问，物理世界面临的难题比知识型工作要多得多。

Definitely the physical world has a lot more, like, I think, hard problems than, you know, knowledge work.

Speaker 0

因此，必须在安全措施上投入大量工作，这些措施不能仅仅依赖思维链的监控。

And so there's gonna be have to be, like, a lot of work on safeguards there that are that are not only chain of thought, you know, monitoring.

Speaker 0

你确实必须做很多基础工作，比如在机器人领域，你得做一些简单的事情，比如限定电机的扭矩上限，确保你的机械臂不会因为移动过快而意外致人死亡。

You definitely have to, like mean, I even in robotics, you have to do, like, simple things just like limiting like, figuring out what your limits for torque are in your motors so that you, like you can't, like, accidentally kill someone if, like, you know, you you know, the the the arm just moves way too fast.

Speaker 0

你必须限制这些因素。

You have to, like, limit those things.

Speaker 0

所以是的。

And so yeah.

Speaker 0

当然，我认为这与知识工作和线上事物的类比是相似的。

Definitely, like and and I think the analogy to, you know, to knowledge work and online things is similar.

Speaker 0

就像你不能只依赖思维链监控，它不会是部署模型时唯一的安全层。

It's like you have to kind of there are chain of thought monitoring will not be the only safety layer in any for deploying models.

Speaker 0

我们可能还得依赖其他方法。

We'll have to rely on other things too, probably.

Speaker 0

我怀疑会有一种策略或方法能完全确保模型安全，而无需其他任何措施。

Like, I doubt there's gonna be, like, one strategy or one method that makes models completely safe without anything else.

Speaker 0

这将会是

It's gonna be

Speaker 1

就像汽车拥有不止一种安全装置一样。

like Just like how a car has more than one safety feature.

Speaker 0

没错。

Exactly.

Speaker 2

对。

Yeah.

Speaker 2

对的。

Right.

Speaker 2

有道理。

That makes sense.

Speaker 2

那么，如果我们把这些系统应用到工作中，实际情况会怎样呢？

Well, how would this play out in practice if we deployed these systems at work?

Speaker 2

比如，你提到过隐蔽的奖励欺骗。

You know, for for instance, you talked about obfuscated reward hacking.

Speaker 2

那么当模型学会隐藏自己的不当行为时，会发生什么？

So what happens when the models learn to hide their misbehavior?

Speaker 2

你曾在医疗场景中举过例子，但也许我们可以进一步展开谈谈。

You kinda gave an example in the health care scenario, but perhaps we could expand on that.

Speaker 0

哦，是的。

Oh, yeah.

Speaker 0

所以我想可以分成两方面来说。

So I guess there's well, there's maybe two I'll break it down into two things.

Speaker 0

一方面是实际中我们该如何使用监控系统，另一方面是关于隐蔽性的问题。

There's kind of like how how would we actually use a monitor in practice, and then like what do we what what about obfuscation?

Speaker 0

模型真的能隐藏它们的思考过程吗？

Can models actually hide their thinking?

Speaker 0

我先试着回答第一个问题，然后再讨论第二个。

So I'll I'll try to answer that first one, and then I can go to the we can talk about the second one.

Speaker 0

但没错。

But yeah.

Speaker 0

那么，你实际上会怎么做呢？

So, like, how would you actually do it?

Speaker 0

我认为，显然每个公司都希望降低延迟，对吧。

I think I mean, obviously, like, every company wants to, like, probably get their latency down for, like Yeah.

Speaker 0

尽可能多地向用户提供结果。

Giving users results as much as they can.

Speaker 0

但对于复杂的知识工作，人们已经需要等待相当长的时间才能获得报告、代码库更新等结果。

But for complex knowledge work, people are already waiting quite a while to get, you know, like, the report back or their code base updated or so on.

Speaker 0

因此，如果我们因为要监控模型运行过程中的输出而需要多等一会儿，这似乎是产品在具备关键安全功能时可能会采取的做法。

And so if we had to wait slightly longer because we were monitoring, like, the outputs of the model as it was going, that seems plausibly like something that products would do if it actually had a critical safety function.

Speaker 0

而且，你可以做到的是，比如，模型的每一个动作，你都可以监控并执行该动作，只要你愿意承担等待执行动作并开始下一部分推理所带来的延迟成本。

And you could, so like one nice thing there too, is that like, you know, every action the model takes, you can kind of monitor execute the action, if you're willing to pay that latency cost of waiting to execute the action and start the next part of the reasoning.

Speaker 0

如果你这样做，你很可能在危害发生前就加以遏制，因为你已经知道模型正在考虑做某些坏事。

And if you do that, you can probably mitigate some harms before they happen because you know that the model was thinking about doing something bad.

Speaker 0

然后你可以说：哦，我本来打算做点坏事。

And then you can just say, oh, I was thinking about doing something bad.

Speaker 0

我就在这里终止它。

I'm just gonna cut it off here.

Speaker 0

我会重试，或者告诉用户由于某些原因我无法完成这个任务。

And I'll retry or, you know, tell the user I can't do the thing for some reason.

Speaker 0

所以这大概就是它们在实际中被实现的方式。

So that's probably how they would be implemented in practice.

Speaker 0

如果你对这一点有任何问题，我们可以先讨论一下，然后再谈混淆技术。

And I can do you wanna if you have any questions on that, we can talk about that, and then I can talk about obfuscation after.

Speaker 2

我确实有个问题。

I do have a question.

Speaker 2

那么，你是觉得这个监控工作应该由最终用户来完成，还是应该由OpenAI这边来监控呢？

So do you see this being something that the the end user would be monitoring, or is it something that you would monitor, you know, on the OpenAI side?

Speaker 0

哦，是的。

Oh, yeah.

Speaker 0

我认为没有充分的理由向用户展示思维链。

So we don't I think there's good reason to not reveal the chain of thought to users.

Speaker 0

我的意思是，这有两方面的原因，首先是，如果我们向用户展示思维链，这跟你问的问题有关——用户是否应该自己监控这个过程？

I mean, there's there's two sided reasons, which so the first is if we showed users chain of thought, and this is to your question of, like, should the user be monitoring this themselves?

Speaker 0

如果我们向用户展示思维链，出于同样的原因，我们必须像规范最终输出那样，对思维链进行过滤，以免向用户说出冒犯性内容、教他们做非法事情，或者违反任何实验室产品的政策。

If we show the user the chain of thought, for the same reason, we have to kind of, like, conform the outputs to look nice so that, you know, like, we don't kind of, like, say something offensive to a user or tell them how to do something illicit or anything like that, whatever the product policy things are for, you know, in any lab's product, we would have to then do that to a chain of thought because now we're giving them this new piece of this new thing.

Speaker 0

而且，如果思维链中出现了冒犯性内容，或者实际上透露了如何制造炸弹之类的非法行为，却决定不告诉用户，那我们一旦提供了思维链，就等于已经告诉了用户。

And, you know, if the chain of thought says something offensive or, like, actually reveals the how to do you know, how to make the bomb or whatever the illicit thing is, but decides not to tell the user that, well, we've just told the user that if we give them a chain of thought.

Speaker 0

因此，这就要求我们对思维链施加这些风格上的约束，而这可能会导致信息模糊化，我们稍后可以讨论这一点。

And so, like, then that would require us to put these style pressures on the COT on the chain of thought, and that would then maybe lead to obfuscation, which we can talk about in a sec.

Speaker 0

另一个问题是，我认为是知识蒸馏的风险。

And then the other one is, I think, just, like, distillation risks.

Speaker 0

这些思维链揭示了模型的大量推理过程，我认为我们通常不会直接把模型交给别人使用。

So, you know, these chains of thought kind of give a lot of the reasoning of a model, and I think you I guess, generally, we don't try to just, like, give our model to other people to use it.

Speaker 2

对。

Right.

Speaker 2

因为历史上确实有过其他公司利用这些信息并发布相关产品的案例。

Because there's maybe been historical accounts of other companies using that information and then releasing products with it.

Speaker 0

没错。

That's right.

Speaker 0

是的。

Yeah.

Speaker 2

是的。

Yeah.

Speaker 2

你想谈谈模糊化吗？

Did you wanna talk about obfuscation or what

Speaker 0

所以，是的。

So so yeah.

Speaker 0

所以，就像你在介绍中提到的，你说思维链可能很脆弱，我认为这现在是业界的一大担忧。

So this is like when when you kind of in the intro, you were saying you you mentioned that chain of thought could be fragile, and I think that is a big worry in the community right now.

Speaker 0

这又回到了我们之前讨论的问题，即：思维链到底是模型本能的反应，还是它在思维链中真正主动地进行思考？

And this goes back to our, also, conversation of, you know, talking about, like, what, know, is the chain of thought or, sorry, is the model doing something instinctively or is it, like, actually thinking through it actively in the chain of thought?

Speaker 0

在训练过程中，可能会出现某些压力或情况，从而改变这种平衡。

There are certain pressures or things that could happen while training that could caught that could kinda shift that balance probably.

Speaker 0

所以，一方面，如果你训练模型永远不要说或想一些坏事。

And so one is, you know, this thing of, like, if you train the model to never say something, think something bad.

Speaker 0

我们去年早些时候发表的一篇论文表明，在某些情况下，模型可能仍然会做出坏事，但不再产生那些坏的想法。

We showed in a paper earlier this last year that like it can in some cases actually still do a bad thing, but like not think the bad thoughts anymore.

Speaker 0

我们认为，这可能只发生在链式思维并非真正必要才能做出不良行为的情况下，这又回到了链式思维是否必要的问题；但它仍表明，模型确实有在不需要时也大声推理的倾向，而这种倾向可以通过类似风格压力的方式被削弱。

We think it probably can only do this in cases where the chain of thought isn't actually necessary to do the bad thing, going back to this question of necessity of like of the chain of thought, but it still shows that, you know, there's actually some propensity for models to reason out loud even when they don't need to, and you can kind of destroy that propensity with style pressures like that.

Speaker 0

所以，是的，模型会这样做，我们称之为混淆。

And so, yeah, the model and then we call that obfuscation.

Speaker 0

也就是说，模型在混淆它真实的思考过程。

Like, the model's like obfuscating its actual its its thinking.

Speaker 0

然后，我们还认为，链式思维可能会变得信息密度更低，或者更难被监控，这背后有诸多原因。

And then there's kind of a lot of other reasons we think that, you know, chains of thought could become, like, less information dense or have yet be less monitorable.

Speaker 0

例如，人们甚至对大规模强化学习感到担忧，而强化学习正是我们目前训练这些模型的常规方式。

So for instance, people are worried even about just large scale RL or reinforcement learning, which is the way we train these models kind of just in the normal way we're doing it.

Speaker 0

你可以想象，随着计算量的增加，模型最终可能会发展出自己的语言，或者改变它们使用英语（或它们所用的任何语言）的方式。

You could imagine that as you increase the compute there, models could eventually learn their own languages or, like, change the way that they're using English, the English language, or what whatever language they're thinking in.

Speaker 0

在当前规模下，这一点似乎并不那么令人担忧——这一点我们在最近的论文中也发现过；但在强化学习领域，早在2016年、2018年、2019年左右，就有人研究过‘涌现语言’和‘涌现通信’，他们证明了，在非常受限的环境中，如果给模型设定完全没有语言先验、完全随机初始化的条件，它们仍能学会某种语言来彼此交流。

At the current scale, and this is something we found in our recent paper, it doesn't seem that worrying, but it is, you know, in the reinforcement learning literature, there are exam like people were working on like emergent language and emergent communication back in, like, I don't know, 2016 or 2018, '19 or so, and, you know, they showed it in, like, very constrained settings that if you just put in models that have no language prior, they're just totally randomly initialized, they can learn some language to to talk to each other with,

Speaker 2

理想情况下，你希望这种语言能被人类理解。

Ideally, you would want that language to be understandable by humans.

Speaker 2

没错。

Exactly.

Speaker 0

对。

Yeah.

Speaker 0

对。

Yeah.

Speaker 0

对。

Yeah.

Speaker 0

所以是的。

And so yeah.

Speaker 0

所以，这可能会发生，比如模型会学会自己内部的思考语言。

So so but this might happen where, like, the model would, like, learn its own internal language of how to think.

Speaker 0

这可能比英语更高效或更紧凑，然后我们就更难理解，或者另一个没有学会这种特定语言的模型也难以理解。

That's maybe, like, a bit more efficient or compressed than the English language, and it's, like, then it's harder for harder for us to understand ourselves or a different model to understand that hasn't, like, learned that same exact, yeah, language specific language.

Speaker 1

好的。

Okay.

Speaker 1

你提到在你的研究中发现，较长的思维链通常更容易监控。

You mentioned in your research that you found longer chains of thoughts to be generally more monitorable.

Speaker 1

这话说得有点长。

That's a mouthful.

Speaker 1

你们有没有采取什么措施，来促使模型更充分地展示它的思考过程？

Are there things that you're doing in an effort to get the model to show more of its work?

Speaker 0

是的。

Yeah.

Speaker 0

所以，到目前为止我们讨论的内容，大致是截至去年年中左右，关于思考和相关工作的部分。

So, so kind of what we've talked about up till now has been maybe like the was this we talked up until like the thinking that was done and work that was done until, you know, maybe like mid two thirds of last year.

Speaker 0

而且，我们知道这是有用的。

And, you know, people were we knew it was useful.

Speaker 0

我们一直在监控这些大型推理模型。

We had been monitoring these big reasoning models.

Speaker 0

研究界的人们开始担心这种能力很脆弱，可能会消失，但他们认为这很重要，因此希望继续研究它。

People were in the research community were starting to get worried that it was fragile, that it could be like, you know, this could go away, but they thought it was important, so they wanted to work on it.

Speaker 0

所以至少对我来说，我认为对很多人来说，如果你想保留这个特性，我们首先必须能够衡量它。

And so at least to to me, and I think many others, if you wanna preserve this property, like, the first thing we had to do is be able to measure it.

Speaker 0

因为如果你回想起之前的对话，我从未提到过任何关于模型可监控性或评估的指标。

Because I guess if you think back on the conversation, I never said any kind of, like, metric for how monitorable a model was or how to eval.

Speaker 0

那一切都只是凭感觉而已。

It was all just vibe spaced, basically.

Speaker 1

你无法评估你无法衡量的东西。

You can assess what you can't measure.

Speaker 0

是的。

Yeah.

Speaker 0

没错，对。

Exactly, yeah.

Speaker 0

因此，构建评估方法，让我们能够真正量化这些模型表达思维的频率或质量，成为我们在去年下半年的主要推动力，因为再次强调，要想保留它，你就必须能够衡量它。

So building evaluations where we could actually kind of start to quantify how good or like, sorry, how often these models verbalize their thinking was the big push that we made in the second half of last year because, again, yeah, to be able to preserve it, you have to be able to measure it.

Speaker 0

于是，我们构建了这套评估体系。

And, and so, yeah, we built this suite.

Speaker 0

我认为这可以被称为一个不错的评估起点。

I think it's would call it like a good a good starting point of evaluations.

Speaker 0

当然还有很多工作要做，我希望社区能继续在此基础上进行完善，并相互分享评估方法。

There's definitely much more work to do, and I hope the community kind of continues to build upon these and share evaluations with each other.

Speaker 0

我们也在努力开源更多内容，以便与大家共享。

We're working on open sourcing hours to to share with them as well.

Speaker 0

但一旦你具备了量化这种能力，就可以开始真正地审视训练流程中的不同环节。

But once you have the ability to quantify this, then you can start to actually make, you know, like, interrogate different parts of your of your training pipeline.

Speaker 0

你可以审查监控中的各种决策。

You can interrogate different decisions of monitoring.

Speaker 0

正如你所说，如果一个模型具有更长的思维链，我们发现它通常会在更长的思维链中揭示更多信息，从而更容易被监控。

So, like you said, like, if I have a model that has a longer chain of thought, it's much we found that it's actually you know, generally, it reveals more information in that longer chain of thought, and then it's easier to monitor it.

Speaker 0

好的。

Okay.

Speaker 0

因此，这就是其中一个例子，说明一旦有了这些评估，你就能开始做出一些切实的结论。

And so that was like one of the one example of a thing we you can start to actually, like, make statements about once you have these evaluations.

Speaker 2

对。

Right.

Speaker 2

你还提到了一种‘可监控性税’，有时为了保持AI系统的透明性，你可能会接受稍差一点的性能。

You also mentioned a monitorability tax, where sometimes you might accept slightly worse performance to keep the AI systems more transparent.

Speaker 2

这是你所倡导的吗？我觉得你是支持这一点的。

Is that something that you I I think you're advocating for that.

Speaker 2

对吧？

Right?

Speaker 0

我并不主张这一点。

I I am not advocating in that.

Speaker 0

我的意思是，大概吧。

I'm I mean, probably.

Speaker 0

我想我会说，我大概会支持这一点。

Like, I would say probably I would advocate for that.

Speaker 0

是的。

Yeah.

Speaker 0

当然。

Sure.

Speaker 0

或者抱歉。

Or sorry.

Speaker 0

是的。

Yeah.

Speaker 0

在任何假设的情境中，比如你正在部署一个危险模型或安全模型时，我认为你应该优先部署安全模型，即使它不是

Definitely in, like, any hypothetical scenario where you're, like, deploying a dangerous model or a safe model, I'd say you should probably deploy the safe model even if it's not

Speaker 1

更易观测。

less visible.

Speaker 1

即使它稍微慢一点也没关系。

Even if it's a smidge slower or something.

Speaker 1

是的。

Yeah.

Speaker 0

对。

Right.

展开剩余字幕（还有 337 条）

Speaker 0

没错。

Exactly.

Speaker 0

对。

Yeah.

Speaker 0

而且

And

Speaker 2

那么，什么时候缴纳这种税才是值得的呢？我想这就是问题所在。

so When would when would paying that tax actually be worth it, I guess, is the is the question.

Speaker 0

是的。

Yeah.

Speaker 0

可能有很多种不同的税。

There's probably many different types of taxes.

Speaker 0

比如，我们已经讨论过延迟税了。

So for instance, like, we talked about the latency tax already.

Speaker 0

即使你只是部署了你原本想部署的模型，但如果你想监控它，现在可能就需要更高的延迟来监控它，在它采取行动或向用户输出之前。

Like, even if you just deploy the model you wanted to deploy, but you wanna monitor it, you now have to have a slightly higher latency probably to monitor it and before it takes actions or gives outputs to users.

Speaker 0

所以这是一种税。

So that's one kind of tax.

Speaker 0

我们论文中讨论的税是，我们做了一个实验来研究预训练计算规模的影响。

The tax we talk about in the paper is that we found we did a experiment to investigate the effect of the pre training compute size.

Speaker 0

你知道，GPT范式一直在增大模型规模和数据规模。

So, you know, the GPT paradigm has been increasing the size of the model while increasing the size of the data.

Speaker 0

除此之外，我们现在还有了强化学习范式，也就是在这些预训练模型的基础上，再进行另一个步骤，让它们在推理方面表现得更好。

And then on top of that, we've, like, now have the RL paradigm, which, like, you know, we take those pretrained models and we we we do an another procedure to make them really good at reasoning.

Speaker 0

因此，我们想研究一下，可监控性如何随着预训练规模和预训练计算量的变化而变化。

And so we wanted to investigate, you know, how does monitorability scale with pre training size and pre training compute?

Speaker 0

于是，我们选取了一系列规模递增的模型，测量了它们在不同推理强度下的能力与可监控性。

And so we took a series of increasing size models, and we measured their capability and their monitorability at, like, different reasoning efforts.

Speaker 0

你知道，模型通常可以设置为较低或较高的推理强度。

And so, like, you know, models can generally be set to have, like, a lower or a higher reasoning effort.

Speaker 0

它们分别会进行更短或更长的思考过程。

They'll think a bit shorter or a bit longer respectively for each of those.

Speaker 0

因此，推理模型通常在数学题、编程题等方面的能力会随着思考时间变长而提升。

And so reasoning models, generally their capabilities, like on math problems and coding problems and things like that, they get better as they think for longer.

Speaker 0

它们能够解决更复杂的任务。

They'll be able to, like, solve more complex tasks.

Speaker 0

因此，一个小模型如果思考很长时间，通常能达到与大模型短时间思考相同的答案或能力水平。

And so a small model thinking for a very long time can oftentimes get to the same answer or match the capability of a bigger model thinking for a shorter time.

Speaker 0

好的。

Okay.

Speaker 1

哇。

Wow.

Speaker 0

所以这就是

So that's

Speaker 1

这说得通。

That makes sense.

Speaker 0

这个模型真的很紧凑。

This model is really tight.

Speaker 0

这件事还有几个步骤。

There's like a couple more steps in the thing.

Speaker 0

我可以直接走一遍，然后你可以同意。

I can just go through it and then you can Yeah.

Speaker 1

当然。

Sure.

Speaker 1

好的。

Yeah.

Speaker 1

当然。

Sure.

Speaker 0

所以，是的。

So, yeah.

Speaker 0

所以你现在有了这些能力相当的点，也就是说，小模型花更长时间思考，大模型花较短时间思考，它们的能力大致相当。

So you now have like these like capability matched points of, you know, small model, thinking for longer, big model, thinking for shorter are like roughly the same capabilities.

Speaker 0

但我们发现，再次回到你之前提到的，花更长时间思考的模型更容易被监控。

But we found that, again, going back to your previous thing, we found that models that think for longer are more monitorable.

Speaker 0

它们会透露出比实际思考更多的内容。

They like reveal more than they're thinking.

Speaker 0

我们发现，在相同能力水平下，较小的模型在进行更长时间的思考时更容易被监控。

We found that the model thinking for the smaller model at the same capability level was more monitorable it was thinking

Speaker 2

是的，这说得通。

Yeah, for yeah, that makes sense.

Speaker 0

所以这就是监控成本，或者抱歉。

And so then that's the monitorability tax in or sorry.

Speaker 0

但，抱歉。

But, sorry.

Speaker 0

而成本方面在于，这些进行更长时间思考的小模型，在达到相同能力水平时，实际上比进行较短思考的大模型消耗了更多的推理时间计算资源。

The then the tax aspect comes in where these small models thinking for longer actually were using at that app for like matched capability levels, used more inference time compute than the bigger model thinking for shorter.

Speaker 0

所以即使

So even

Speaker 2

不过，哪一个会更昂贵呢，我想？

Which though one would be more expensive, I guess?

Speaker 2

你更希望是较小的模型长时间思考，还是较大的模型短时间思考？

Would like a smaller model thinking for longer or a bigger model thinking for shorter?

Speaker 0

在这种情况下，我认为较小的模型通常成本更高，因为每生成一个标记的推理计算量减少幅度，小于为达到相同能力水平所需增加的标记数量，因此总体而言，使用大量标记的小模型比大模型消耗了更多的计算资源，这就是所谓的可监控性成本。

In this case, the smaller models, I think, basically always cost more because the relative, you know, decrease in inference compute per token in generated was less than the amount of additional tokens you'd need to reach the same capability level, and so overall, that small model with a lot of tokens was costing more compute than the big model, and so that's kind of like the monitorability tax.

Speaker 0

这意味着，你可以通过增加推理时间的计算投入，来获得一个更可监控的模型。

It's like, oh, you could just spend more compute on inference time, but like get a more monitorable model out of it.

Speaker 2

对。

Right.

Speaker 2

这很棘手。

That's tricky.

Speaker 2

是的。

Yeah.

Speaker 1

系统推理能力随时间变得越来越不可监控，可能有哪些预警信号？

What are the warning signs maybe that a system's reasoning is becoming less monitorable over time?

Speaker 1

有没有办法识别这种情况正在发生？

How would is is there a way to recognize that that's happening?

Speaker 0

是的。

Yeah.

Speaker 0

嗯，我们希望如此。

Well, we I guess we hope so.

Speaker 0

我的意思是，我们希望这些评估是其中一种方法，或者说是实现这一目标的方式。

I mean, we hope these evals are the one or maybe the way to do it.

Speaker 0

因此，我们有几种不同的方式来追踪可监控性。

So we have a few different ways of kind of, like, tracking monitorability.

Speaker 0

我们将它们分成了三种类型，希望如果能设计出足够广泛的评估体系，从不同角度测试可监控性，那么当模型因为某种原因开始变得难以理解或表达减少时，我们就能看到其中某项评估指标开始下降。

We, like, divided them out into three archetypes, and we hope that, like, you know, if we make a broad enough suite of evaluations that test monitorability in different ways that when, you know, a model does start to, like, become obfuscated or verbalized less for whatever reason, we'll see one of those evals start to go down.

Speaker 0

我们在论文中确实展示了一个概念验证：我们故意通过在思维链（COT）上施加某种风格压力来训练模型使其变得模糊，结果确实观察到部分评估指标下降，这证明了如果这种现象在现实中自然发生，我们或许能够捕捉到它。

And we did kind of show a proof of concept of that in the paper where we directly deliberately trained models to obfuscate by putting some kind of style like pressures on the COT, and we then did see some of our evals decrease as like a proof of concept that, you know, if a model if this kind of happened in the wild naturally, we'd maybe be able to catch it.

Speaker 2

哇。

Wow.

Speaker 2

对。

Right.

Speaker 2

我想这让我好奇，比如，如果下一个扩展范式——无论那是什么，超越当前的下一步——那个模型会不会极其难以读懂它的推理痕迹？

I guess that makes me wonder, like, if you were to like, let's say, like, the next scaling paradigm, whatever whatever the next step up from here is, would that model be incredibly hard to to read its traces?

Speaker 0

完全有可能。

Totally possible.

Speaker 0

我想，如果我知道下一个扩展范式是什么，我现在早就富得流油了。

I guess, you know, if I knew what the next scaling paradigm was, I'd I'd be a much, much richer man.

Speaker 2

嗯，我想我应该澄清一下。

Well, I guess, I I guess I should clarify.

Speaker 2

比如，如果我们沿用之前的扩展范式，也就是规模和数据，对吧？算力和数据。

Like, let's say if we go off of the previous scaling scaling paradigm, which is which is just size and data, right, compute and data.

Speaker 2

如果你把算力和数据再提升到下一个量级——即使这可能实现——那个模型会不会根本没时间深入思考？

If you were to scale up to, like, the next the next size up of compute and data, if that were even possible, would would that model not think for very long?

Speaker 2

然后这让我想到，那个模型的推理痕迹会更难解读，因为它——你知道的——只是遵循了你迄今为止所验证的逻辑，是的。

And then would that then make it it makes me think that that model's thinking traces would be even harder to interpret because it it's, you know, following the logic of what you've proven so far as Yeah.

Speaker 2

模型更大了，但为了得出正确答案，它思考的时间却更短了。

A bigger It's a bigger model, and it's thinking for less time to get to the right answer.

Speaker 0

是的

Yeah.

Speaker 0

我认为这确实是一个令人担忧的问题。

I think that is definitely a worry.

Speaker 0

我们实际上在当前的实验中，并没有发现这种情况真的成立，好吧。

We've also, like I mean, so generally in our current experiments, we haven't necessarily found that to be true Okay.

Speaker 0

在很大程度上是这样。

To a significant extent.

Speaker 0

我认为这很可能是因为，我们一直在不断提升训练即使是小型模型的能力。

I think that this is likely because, you know, we are always improving our ability to train even, you know, even a small model.

Speaker 0

这并不是说我们有了一个GPT-4规模的模型，那就是任何人能训练出的最好的GPT-4规模模型。

It's not like we have our GPT four size model, and that's the best GPT four size model anyone can ever train.

Speaker 0

我们在不断改进算法和数据，你知道的，没人会停下来。

We're, like, improving our algorithms, our data, you know, we're no one's no one's sleeping.

Speaker 0

是的

Yeah.

Speaker 0

所以我认为，更好的模型有时也会有更清晰的思路，如果这说得通的话。

And so I think that, you know, a better model also kind of like has more clear thoughts sometimes, if that makes sense.

Speaker 0

比如一个小模型，它会审视自己的思考过程。

Like a small model, it will look at its thoughts.

Speaker 0

它有时候显得比大模型更加胡言乱语。

It's it's it sometimes feels a lot more gibberish y than a bigger model.

Speaker 0

因此，也许这就是为什么我们能预期这种趋势至少不会完全发生的原因。

And so, you know, that is maybe the reason why we could expect this to, you know, at least not fully this trend to not fully happen.

Speaker 0

但我认为在极限情况下，它很可能还是会发生的。

But I think at the limit, it probably should happen.

Speaker 0

更大的模型在大多数情况下会减少语言表达。

Like, bigger models will just verbalize less for the most part.

Speaker 0

是的。

Yeah.

Speaker 1

我想这说得通。

I guess that makes sense.

Speaker 1

你测试过一件让我感兴趣的事，就是你在训练过程中惩罚不良想法，是否会阻止这类不当行为。

You tested whether something that interested me was that you tested whether penalizing bad thoughts during training would stop that kind of misbehavior.

Speaker 1

我想知道你在这方面发现了什么，能稍微介绍一下吗？

I wondered if you could tell us a little about what you found there.

Speaker 0

是的。

Yeah.

Speaker 0

我们当时的做法是，那是去年早些时候，我们担心人们会开始对思维链施加各种压力，因为当时还没人认真思考过这个问题。

So what we did were so when we this was kind of earlier last year, and we were worried that people would kind of like start applying some, like, various pressures to the chain of thought because no one had kind of like thought through it yet.

Speaker 0

我们当时只是假设性地担心：如果你告诉模型不要有坏想法，它可能仍然在内部激活中产生坏想法，然后随后做出坏事。

And we were like, hypothetically worried that yeah, if you tell it to not think bad things, it could still think bad things in its like activations and then do a do a bad thing afterwards.

Speaker 0

所以我们做了一个实验，训练了两个模型：一个是基线模型，没有任何所谓的‘思维警察’；另一个模型则引入了‘思维警察’。结果发现，有‘思维警察’的模型确实做了更少的坏事，这其实也说得通。

So we did this experiment where we, you know, train two models, our baseline that has like no, I don't know, we can call it like thought police or something, had like thought police were not present, and then another model where the thought police were present, and the model with the thought police present actually did like do less bad things when you because, you know, which makes sense.

Speaker 0

我们的模型通常在因果关系上非常直接，它们的行为确实会受到思维链的直接影响。

Like, our models are, like, oftentimes very causally, like, they're they're they're causally affected by their chain of thought.

Speaker 0

如果它们没有产生坏想法，可能就不会真的做出坏事。

If they don't think a bad thought, they might not actually do a bad thing.

Speaker 0

其他研究团队的其他论文也证实了这一点。

And this has been shown in other in other papers from other groups as well.

Speaker 0

但最终，当模型开始学会掩饰，表面上没有不良想法，却依然做出不良行为时，我们发现，虽然它做的坏事确实少了一点，但你可能更希望有一个公开做坏事、坦然思考所有不良想法、并用监控工具来观察的模型。

But eventually, once it started to learn to obfuscate and, and, like, still do bad things but while not having bad looking thoughts, at that point, we found that like, you know, while it still did a bit less bad stuff, you would have preferred to have a model that was like openly doing bad stuff and thinking all the bad thoughts and using a monitor.

Speaker 0

例如，使用思维链监控器，你能捕捉到未经过‘思想警察’训练的模型更高比例的不良行为，而相比之下，经过‘思想警察’训练的模型反而更难被发现。

So like, for instance, you could catch more behavior, bad behavior, a higher percentage of bad behavior of the model that wasn't trained with thought police using a chain of thought monitor than you could when you, like, had done this whole, like, thought police training.

Speaker 0

所以归根结底，你还是不这么做更好。

And so you were better off at the end of the day, like, not doing it.

Speaker 2

说到打开思维的窗口，我们正在观察这些想法。

So speaking of open windows into the mind, you know, we're we're looking at the thoughts here.

Speaker 2

你最近还发表了一篇关于稀疏权重变换器的论文，通过剪枝让模型的内部电路更容易被观察。

You also published a paper recently on weight sparse transformers, where you prune the model to make its internal circuits easier to see.

Speaker 2

机制可解释性与思维链可监控性之间是如何关联的呢？

How how does, like, I guess, mechanistic interpretability and chain of thought monitorability coincide?

Speaker 2

它们是如何协同工作的？你对此有什么看法？

How do they work together, and what are your thoughts on that?

Speaker 0

当然。

Absolutely.

Speaker 0

对。

Yeah.

Speaker 0

我确实参与了那篇论文，但如果你看一下致谢部分或者贡献声明，我只是Leo的管理者。

I was I was on that paper, but if you look at the acknowledgment or whatever the the contribution statement, I was just the manager of Leo.

Speaker 0

Leo是公平的，Leo就是这样。

Leo Leo is like and Leo is fair.

Speaker 0

他们会是这方面更专业的专家，但是

Would say they're, like, much more of the experts in that, but

Speaker 2

感谢你的澄清。

Appreciate the disclaimers.

Speaker 0

对。

Yeah.

Speaker 0

我不会把那篇论文的功劳归于自己，但我认为他们总体上在机制可解释性方面的工作非常出色，这像是一个早期的信号，表明我们或许可以训练出一种在激活空间上天生更易解释的模型，我期待更多人继续推动这一领域，以及更广泛的机制可解释性研究。

I wouldn't I wouldn't take credit for that paper, but but I think that they are, like, in general mechanistic interpret I think the circus varsity work is awesome and, like, a very preliminary kind of, like, early sign of life for a type of model we could train that was natively more interpretable in its activation space, and I'm excited for people to kind of continue pushing on that and more broadly the mechanistic interpretability field.

Speaker 0

历史上，机制可解释性一直非常困难。

Historically, it's been pretty hard, mechanistic interpretability.

Speaker 0

我们还没有解决这个问题，人们已经研究了很久。

We haven't solved it yet, and people have been working on it for a long time.

Speaker 0

人们正在

People are

Speaker 1

研究

working on

Speaker 2

您能否从宏观层面向我们的观众解释一下，为什么这如此困难？据我了解。

Could you at a high level explain to our viewers why it's so difficult, from my understanding?

Speaker 0

我会说，我不是这方面的专家，但我认为它如此困难的原因是，首先可以将思维链与机制可解释性区分开：机制可解释性更像对大脑中所有神经元和激活状态进行脑部扫描，然后尝试根据这些数据预测你会做什么，神经科学家确实也在做类似的事情。

I would say I'm not an expert in this, but my take for why it's been so hard is that similar to I guess one way to break out the difference first between chain of thought and mechanistic interpretability is mechanistic interpretability is more like doing a brain scan of all your neurons and activations in your brain and then trying to make some predictions about what you're going to do from them, which neuroscientists do do.

Speaker 0

这并非不可能，但确实很难。

And it's not impossible, but it's hard.

Speaker 0

你知道，即使对于小鼠这类生物，我们仍然很难做到这一点。

You know, there's still like, you know, we haven't, we can still have a hard time doing it for mice and things like that.

Speaker 0

对吧？

Right?

Speaker 1

这并不是一个完美的过程。

It's not a flawless process.

Speaker 1

是的。

Yeah.

Speaker 0

对。

Yeah.

Speaker 0

而且它维度非常高。

And it's just really high dimensional.

Speaker 0

有太多事情在同时发生，你知道，你试图在数十亿个神经元中找出哪个模式导致你做了坏事、走路，或者 twitch 肌肉。

There's a lot of things going on, you know, you're trying to find in like these, like billions of neurons, you know, which pattern amongst them is the reason why you like do a bad thing or the reason why you like walk or, you know, any, or twitch your muscle.

Speaker 0

它们同时在做太多事情。

There's so many things they're doing all at once.

Speaker 0

而链式思维可解释性则更像是阅读你的内心独白，这更加高层，信息密度也低得多。

Whereas, you know, the chain of thought interpretability thing is a bit more like reading your inner monologue, which is much more high level, much like, I guess, like lower information density.

Speaker 0

它并没有包含所有内容，但似乎确实涵盖了与模型行为最相关的高层信息，这就是为什么它相比之下如此成功；不过，同样地，机制可解释性不会像思维链可解释性那样面临同样的脆弱性问题，因为激活状态不会消失。

It doesn't have everything there, but it does seem to have, like, the most high level relevant things to the model's actions, which is why it's been so successful in comparison and like, but again, like is like mechanistic interpretability does not have the issue of like the same issues of fragility that chain of thought interpretability has because the activations aren't going to go anywhere.

Speaker 0

到头来，这些信息应该仍然存在，但根据各种情况的发展，它们最终可能不再存在于思维链中。

The information should all still be there at the end of the day, but it might not it might eventually not be in the chain of thought depending on, you know, how how everything goes.

Speaker 1

这说得通。

That makes sense.

Speaker 1

我有一个问题，很久以来一直想问一位安全研究人员：作为一位专注于安全与漏洞的人，开源人工智能是否让你感到担忧？

I have a question I've wanted to ask a safety researcher for quite some time, and and it is just the concept of open source AI concern you in any way as as someone who's focused on safety and vulnerability.

Speaker 1

为什么是或为什么不是？

And why or why not?

Speaker 0

我会说，是的。

I'll yeah.

Speaker 0

所以，我只想表达一下我个人的看法。

So I would just give my own opinion.

Speaker 0

我不认为那是对的。

I wouldn't That's true.

Speaker 2

是的

Yeah.

Speaker 2

是的

Yeah.

Speaker 2

就像鲍文一样。

Just as Bowen.

Speaker 0

是的

Yeah.

Speaker 0

是的

Yeah.

Speaker 0

是的

Yeah.

Speaker 0

但，是的，我相当担心。

But, yeah, I'm pretty worried.

Speaker 0

我会对开源感到非常担忧，或者我担心开源。

I would be pretty worried about open or I'm worried about open source.

Speaker 0

我认为，你知道，如果我们试图对某些类型的信息或武器等事物进行管控，那一定是有原因的。如果你觉得某个模型可能被用作某种有害的武器，比如制造生物武器——这是人们非常担忧的，或者网络攻击和网络安全——这也是人们非常担心的。

I think that, you know, if you had, like, there's a reason why we try to control certain types of information from the public, or have controls over weapons and things like that, and if you think a model could be utilized as any kind of harmful weapon like thing, whether it be making a bioweapon is a big thing people are worried about, cyber security or cyber attacks is another big thing that people are worried about.

Speaker 0

我觉得，如果在网上可以下载到一个像武器一样的东西，这听起来真的很糟糕。

I think just kind of having, like, a downloadable weapon online sounds pretty bad to me, like

Speaker 1

有道理。

Fair.

Speaker 0

所以是的。

So yeah.

Speaker 0

我认为，在开源模型中设置适当的防护措施可能非常困难，因为一旦你发布了权重，任何人都可以轻松地微调掉这些防护措施，你知道的。

And I think it's probably pretty hard to put in, like, the proper guardrails on an open source model because someone can, like, just fine tune those guardrails away once you've released the weights and, you know yeah.

Speaker 1

这一直是我想了很久的问题，非常感谢你分享你的个人观点，因为我一直试图成为一个好的开源倡导者。

That's a thing I have wondered for so long, and I really appreciate you sharing your personal opinion there because I I always thought, like, at, like, at its core, I I I try to be a good open source promoter.

Speaker 1

但与此同时，我也觉得，你确实会失去控制权。

And and at the same time, I'm like, you know, you kinda lose the button, though.

Speaker 1

你再也找不到那个关闭开关了。

You you don't have the the off switch anymore.

Speaker 1

而且，我的意思是，一旦发布出去，它就存在了。

And, I mean, once it's out, it's it's there.

Speaker 1

你永远无法撤销这一点。

You're never undoing that.

Speaker 1

你知道的。

You know?

Speaker 1

你不可能把地球上每一台电脑上的东西都删掉。

You're not gonna go delete it from every computer on Earth.

Speaker 1

你知道的。

You know?

Speaker 1

我一直觉得这个话题非常有趣。

And it's just a subject I've always found really interesting.

Speaker 1

至于安全、可解释性、可监控性这些方面，都是人工智能中非常有趣的领域，我非常钦佩你们所做的工作。

And then the whole safety, interpretability, monitorability, all of that's really it's an interesting area of AI and and exploration, and I really admire the work you guys are doing.

Speaker 0

谢谢。

Thank you.

Speaker 0

是的。

Yeah.

Speaker 0

我只想补充一点细微的差别：目前的开源模型看起来并不危险。人们已经对这些模型进行了最坏情况的微调测试——至少我们对我们发布的模型做了这样的测试，我们发现，即使有极其恶意的人试图微调这个模型来开发生物武器，他们也根本做不到。

I will just add a tiny piece of nuance in that, like, the current open source models do not seem dangerous, and people have done the worst case fine tuning scenarios for the model, at least we did for the models we released, and we said, Okay, even if someone super nefarious went and fine tuned this model to develop bioweapons, they just couldn't really do it.

Speaker 0

这个模型还不够聪明。

This model is not smart enough.

Speaker 0

所以在这种情况下，完全没问题。

And so I think in that regime, it's totally fine.

Speaker 0

比如，风险并不高，而且它非常有用，对社区来说特别有用，无论是做研究、开发产品，还是其他任何事情。

Like, there's not like, the risks aren't high and and it's, like, useful and it's, like, super useful for the community, doing research, for making products, whatever they're doing.

Speaker 0

因此，在这一点上，我更多想到的是，你知道的，那种

And so in that, I think I'm I was more thinking in, the, you know, like, the

Speaker 2

最高强度的。

The highest power.

Speaker 0

未来的通用人工智能。

AGI thing of the future.

Speaker 0

你知道的吧？

You know?

Speaker 0

是的。

Yeah.

Speaker 1

对。

Right.

Speaker 1

有道理。

Makes sense.

Speaker 1

我们真的想把它放在U盘里吗？

Do we really want this on a flash drive?

Speaker 1

kinda 你知道的吧？

Kinda you know?

Speaker 1

我明白。

I I get that.

Speaker 1

我明白。

I get that.

Speaker 1

顺便说一下，我非常喜欢开源模型。

I love the OSS models, by the way.

Speaker 1

我两种都用过，是的。

I've used both Yeah.

Speaker 2

你参与过这个项目，对吧？

You worked on that, right?

Speaker 2

你至少为这个系统、它的模型卡片做出过贡献，对吧？

You at least contributed to the system, the model card for it, right?

Speaker 0

是的。

Yeah.

Speaker 0

我只是参与了一些关于是否要对思维链施加压力的决策，实际上，关于开源模型有一件有趣的事，我们并没有对思维链施加任何压力。

I was just involved in some decisions around whether to leave the actually, one interesting thing with the open source models that we didn't put any pressure on the chain of thought.

Speaker 0

我们只是让它自由思考。

We just let it think whatever.

Speaker 0

如果它要产生冒犯性的想法，我们就让它去想，因为我们希望开发者在他们的产品中也能进行类似的思维链监控，同时我们也希望有一个与我们内部模型接近的模型，供外部安全研究人员使用，这样当他们进行研究时，如果使用我们的模型，我们可能会更有信心认为这些发现也适用于我们内部更大的模型，从而让我们更有把握去实施和尝试这些方法。

If it's gonna think something offensive, we let it do that because we wanted developers to be able to do the same type of chain of thought monitoring that we do if they wanted to in their product, and then also have a model that is close to our internal models for external safety researchers is kind of a useful artifact to have out there so that when they do research and if they do it on our models, maybe we're a bit more convinced that it would apply to our bigger models that we have internally, give us a bit more confidence to go and implement it and try it ourselves.

Speaker 0

所以这样做有很多好处。

So there's a lot of benefits to that.

Speaker 0

是的。

Yeah.

Speaker 1

对。

Yeah.

Speaker 1

好的。

Okay.

Speaker 1

我们有一个

We have a

Speaker 2

有个问题纯粹是出于好奇，那就是OpenAI的研究工作是如何进行的？

question just that's kind of for fun, which is how does how does research work at OpenAI?

Speaker 2

你们是向萨姆提出想法，然后他点头同意，还是由雅各布或者其他负责科学的人来决定？

Like, do you all pitch to Sam and then he gives you the thumbs up or or or Jacob or whoever's in charge of the science there?

Speaker 2

还是说你们的研究任务是被分配的？

Or do you get it assigned to you?

Speaker 2

比如，你们如何决定要做什么

Like, how do you decide what your

Speaker 1

还是你们抓到一个想法就去推进？

next Or you grab an idea and run with it?

Speaker 1

或者，

Or,

Speaker 2

是吗？

yeah?

Speaker 2

是的。

Yeah.

Speaker 2

你们需要争夺计算资源吗？

Do you guys have to fight for compute?

Speaker 2

比如，这具体是怎么运作的？

Like, how does that work?

Speaker 0

我觉得，这些方式都有吧。

I would say, like, a bit of all of that.

Speaker 0

这真的就像是，我不知道。

It really it's like, I don't know.

Speaker 0

没有，这并不是非常正式的。

There's no there's no it's it's not extremely formal.

Speaker 0

我认为每个团队的工作方式都非常不同。

Every team, I think, works very differently.

Speaker 0

像我带领的团队，我们其实也不太清楚。

Teams like mine that I run are a bit more like, we don't really know.

Speaker 0

好吧，当我们刚开始的时候，完全不知道自己在做什么。

Okay, like when we started, had no idea what we're doing.

Speaker 0

我们当时在努力弄清楚研究计划是什么，哪些重要、哪些不重要，甚至连我们寻找的东西的定义都模糊不清——比如‘可监控性’这个术语，我们不得不反复争论，到底什么是它，因为人们经常谈论‘忠实性’，这是一个相关概念。

We're like trying to figure out like what the research program was, like what mattered, what didn't matter, what even like the definitions of the thing we were looking for was like this term monitorability, we like had to like argue back and forth of like, what is the, because people talk about faithfulness a lot, which is a related concept.

Speaker 0

是的。

Yeah.

Speaker 0

后来我们逐渐放弃了那个，转而选择了‘可监控性’作为更实际的方向。

We like kind of, you know, got rid of that and went for monitorability as the more practical thing at some point.

Speaker 0

所以无论如何，我们当时完全是处于摸索阶段，做了大量研究，现在依然如此，我们会给人们很大的自由去探索这个领域中有趣的研究，因为这个领域太新了，我们了解得太少。

And so anyway, we were very much more in the like trying to figure everything out, do a bunch of like research, and we do and we still do kind of like do a bunch of, we kind of give people a lot of leash to just do research that's interesting in this space because it's so new and we know so little.

Speaker 0

是的。

Yeah.

Speaker 0

我认为，在那些定义更明确、你正在开发产品的场景中，很多事情可能有更清晰的前进路径，因此那里的研究也会变得更加明确或预先设定好。

I think in cases where things are more defined and you have products that are, like, you're you're making, there's probably a more clear path forward for a lot of things, and so then research there becomes a bit more, like, defined or some like, pre or predefined.

Speaker 0

我知道萨姆关注思维链和可监控性，但他应该挺忙的，所以我不会每个想法都去问他，求他批准。

I know Sam cares about chain of thought, monitorability, but, I think he's pretty he's pretty busy, so, you know, I don't go to him with every idea and ask for his okay.

Speaker 0

所以，是的。

So, yeah.

Speaker 1

这很合理。

That's fair.

Speaker 1

说得对。

Fair enough.

Speaker 1

这很合理。

That's fair.

Speaker 1

我猜他是个很忙的人。

I imagine he's a busy fellow.

Speaker 2

是的，因此在那里你们有更多自由去说，好吧，这对我们很重要。

Yeah, so you have a little bit more leeway there to say, okay, this is important to us.

Speaker 2

我们需要去。

We need to go.

Speaker 1

需要了解的是，是的。

Need to know about Yeah.

Speaker 0

但一般来说，每个人都很聪明，很有见解，所以这并不是自上而下的。

But generally, everyone's super smart and has really good opinions, and so I think it's not that top down.

Speaker 0

很多事都是自下而上的，是的。

A lot of it's very bottom up, Yeah.

Speaker 0

归根结底，你大概希望你的经理认可你所做的事情，但通常大家都很愿意讨论。

And at the end of the day, you kinda want your manager to approve of what you're doing probably, but generally, everyone's open to discussion.

Speaker 0

如果你提出一个好点子，他们会说，是的，这听起来很棒。

And if you pitch a good idea, they're gonna be like, yeah, that sounds great.

Speaker 0

我们就这么办。

Let's do that.

Speaker 0

去吧。

Go for it.

Speaker 1

是的。

Yeah.

Speaker 1

对。

Yeah.

Speaker 1

太棒了。

That's awesome.

Speaker 2

有道理。

Makes sense.

Speaker 2

好吧，最后一点，关于可监控性和推理轨迹。

Okay, last thing on, I guess, monitorability and the reasoning traces.

Speaker 2

所以当我们用户看到这些轨迹时，我一直想知道，我们到底看到的是什么？

So when we as a user see traces, I've always wondered, what is it that we're actually seeing?

Speaker 2

因为除非你查看开源版本，否则这并不是真正的思维过程。

Because it's not the actual train of thought unless you're looking at the open source one.

Speaker 2

那它是如何运作的呢？

So how does that work?

Speaker 2

是有一个模型在解释这些追踪信息，还是只是随意编造的？

Is there a model interpreting the traces or is it just like making it up?

Speaker 2

也就是说，我们到底看到的是什么？

Like how what are we actually seeing?

Speaker 0

是的。

Yeah.

Speaker 0

是的。

Yeah.

Speaker 0

当然。

Absolutely.

Speaker 0

好问题。

Great question.

Speaker 0

所以，是的，我们会向用户展示一些思考过程，但只是以摘要的形式，是经过总结的，抱歉。

So, yeah, we we do show users, like, some of the thinking, but it's in a, like, a summary it's from a summarized or sorry.

Speaker 0

我们有一个摘要模型，它会获取思维链并以安全的方式进行总结，符合我们所有的产品政策，让用户更容易理解。

A we have a summarizing model that takes the chain of thought and, like, summarize it summarizes it in a safe way that's, you know, kind of, like, adheres to all of our product policies that the user can see and is more digestible.

Speaker 0

我的意思是，你们似乎看过开源模型。

I mean, it sounds like you guys have looked at the open source model.

Speaker 0

当你问它一个难题时，它会思考很久，内容多得让人读不完。

Like, when you ask it a hard problem and it thinks really long, it's, like, a lot to read through.

Speaker 1

它有太多话要说。

It's got a lot to say.

Speaker 0

是的。

Yeah.

Speaker 0

确实有很多内容要说。

It's a lot to say.

Speaker 0

你不想读完这些，也许第一次觉得有趣，但第二次你绝对不想再翻阅几页这种思考内容了，所以从产品角度来说，这只是我的看法。

You don't wanna read Maybe it's fun the first time, but the second time, you definitely don't wanna read through pages and pages of it thinking stuff, and so I think even from a product This is just my opinion.

Speaker 0

从产品角度来看，我猜摘要可能更受欢迎。

I would guess from a product standpoint, the summaries are probably preferable.

Speaker 0

我不知道它们是不是，但我只是觉得

I have no idea if they are, but I just

Speaker 2

实际上，是的。

Actually, yeah.

Speaker 2

因为我都会看它们。

Because I look at all of them.

Speaker 2

对吧？

Right?

Speaker 2

我会看你们的产品。

I look at you guys.

Speaker 2

我会看Gemini。

I look at Gemini.

Speaker 2

我会看Anthropic。

I look at Anthropic.

Speaker 2

我认为从OpenAI的角度来看，它确实比Gemini提供了更多关于它自身运作的信息。

And I think that OpenAI is is definitely more it gives me more information about what it's doing than Gemini's does,

Speaker 0

当然。

for sure.

Speaker 1

作为用户，除了偶尔的娱乐之外，我最希望的是能够看到：如果事情走偏了，我能理解它哪里出了错，从而明白是不是我提问的方式有问题，或者如何能问得更好，而这种反馈能让我有所收获。

And for me as a user, the main thing I want from them aside from occasional entertainment is I want to be able to see, like, if something went the wrong direction, if I can understand where it went wrong, I can understand maybe how I prompted something wrong, how I could have asked the question better and being able to see that that feedback lets me know.

Speaker 1

哦，我只是给了它一些无意义的内容，它就胡乱应付了，这怪我。

Oh, I just I just gave it some nonsense here and it it winged it And that's that's on me.

Speaker 1

这不是模型的问题。

That's that's not a model problem.

Speaker 1

这是科里的问题。

That's a Corey problem.

Speaker 1

你懂的？

You know?

Speaker 1

有时候确实如此。

And sometimes that's the case.

Speaker 2

这其实暴露了你思维中的缺陷。

Shows you, yeah, the flaws in your own thinking, actually.

Speaker 1

确实如此。

It does.

Speaker 1

它会

It's

Speaker 2

挺有意思的。

kind of interesting.

Speaker 1

确实如此。

It does.

Speaker 1

确实如此。

It does.

Speaker 1

但我永远忘不了第一次看它的时候。

I'll never forget the first time watching it, though.

Speaker 1

当时我和一个朋友在工作时，它刚发布不久。

We were, a buddy and I here at work when when a one first released.

Speaker 1

我们当时在电脑上观看直播，并通过Slack聊天。

We were, we were watching the the livestream on our computers and chatting via Slack.

Speaker 1

当我们中的第二个人获得访问权限时，我们就说：好吧。

And and and the second one of us got access, it was like, okay.

Speaker 1

快点。

Quick.

Speaker 1

赶紧开个电话会议。

Hop on a call.

Speaker 1

我们立刻加入并开始向它扔各种东西。

And, and we jumped on and started throwing things at it.

Speaker 1

当我们看着它时，我突然想起当时心想：这简直是我们从未见过的东西。

And as we watched it think, I just remembered thinking, this is like nothing we've ever seen before.

Speaker 1

这太疯狂了。

This is this is this was wild.

Speaker 1

真不可思议，还不到一年，甚至短短几个月内，你就已经对模型有这种期待了。

And it it's crazy how in not much more than a year heck, in a matter of a couple of months, frankly, it was what you just expected out of a model now.

Speaker 1

而且这个领域的发展速度非常快，真的非常有趣。

And it moves really fast in this space, and it's it's really interesting.

Speaker 1

我无法想象在研究方面，你们觉得自己前进得有多快。

And I can't imagine on the research front how fast you feel like you're moving.

Speaker 0

如果我能自己搞定一切，我倒希望它能慢一点。

It's all I you know, if I could if I could do it all myself, I'd maybe have it move a bit I I'd wish it would move a bit slower.

Speaker 0

我

Speaker 2

我也挺同意你的。

kinda agree with you too.

Speaker 2

我的意思是，这对我们来说很令人兴奋，但同时也希望给我们一点时间消化一下。

I mean, it's exciting for us, but also it's like, give me some time to process.

Speaker 1

是的。

Yeah.

Speaker 1

格兰特和我在圣诞节那一周就想着，要是所有公司能达成一个君子协议，暂停十四天，让我们都能好好睡一觉，那该多好。

Grant and I on on the week of Christmas were like, you know, it'd be really great if all these companies would just have, like, a gentleman's agreement for fourteen days here so we can all go get some sleep.

Speaker 1

也发生在了我身上。

Kinda happened to too.

Speaker 2

人们做了一些事情。

People did some stuff.

Speaker 2

哦，我还有一个最后的快速问答问题要问你们，就是你和科里一直在争论这个。

Oh, I did have one last lightning round question for you, which is, you Corey and I debate this.

Speaker 2

你是认为无论何时都如此，还是根据任务不同而变化？

Are you thinking on always or depending on the task?

Speaker 2

对我来说，每个任务，我总是处于思考状态。

So for me, I'm always every task, always thinking is on.

Speaker 2

你持什么观点？

Where do you land

Speaker 1

关于这一点？

on that?

Speaker 1

我是个路由器派。

And I'm a router guy.

Speaker 0

哦，是的。

Oh, yeah.

Speaker 0

我 definitely 是一个路由器型的人，我认为一般来说我是这样。

I'm definitely a I think, generally, I'm a router guy.

Speaker 0

比如，如果是一些我了解的简单事情，我希望它能快速返回结果。

Like, I like it to come back fast if it's like a thing I know is simple.

Speaker 0

不过，我个人使用思考模式的唯一情况是，当我问它一个问题时，我觉得它应该进入思考模式，或者我知道它需要更深入地分析，然后我就手动开启思考模式，让它认真思考这个问题。

Sometimes, though, I think the only time I personally use thinking is I've asked it a question that I thought it should be routed to thinking or like I know that it should think more and then I go and turn it on and like kind of ask it to, you know, think really hard about a thing.

Speaker 0

比如，当我问它一些工作中涉及数学的建议时，我肯定会开启思考模式，也许是专业模式之类的。

So like, if I'm like asking it some advice about some like math thing at work, then I'm like thinking only definitely, like maybe the pro mode or whatever.

Speaker 0

但是

But

Speaker 1

你不会打断它吗？

Don't you interrupt.

Speaker 1

那就是即时响应。

That's the instant.

Speaker 1

你不会不知道吧。

You know better than that.

Speaker 0

是的。

Yeah.

Speaker 0

对。

Yeah.

Speaker 1

然后你就把它打开。

And then you go kick it on.

Speaker 1

我的做法也差不多，我之前有一段时间没用。

That's kind of my approach too is I didn't for a while.

Speaker 1

花了一段时间才适应这个路由机制。

It took a while to get used to to the router.

Speaker 1

我当时就想，呃，我不确定。

It was like, oh, I I don't know.

Speaker 1

但随着时间推移，有一天我突然意识到，它其实大部分时候都挺管用的。

And over time, I just one day realized it was mostly just working for me.

Speaker 1

通常，会有像你说的那样，我会去启动思考的时候。

Usually, there are there are times, like you said, where I will go kick on thinking.

Speaker 1

但大部分情况下，它完成任务还是不错的。

But for the most part, it it does the job fine.

Speaker 0

是的。

Yeah.

Speaker 0

是的。

Yeah.

Speaker 0

各个模型现在都变得相当不错了。

The models are getting pretty good all around.

Speaker 0

所以

Speaker 1

确实是。

they are.

Speaker 1

前几天我们还聊到这个，说它们现在都挺好的。

Was were talking about that the other day about how they're just they're kinda all good now.

Speaker 1

就像我们所说的，现在并没有像几年前那样糟糕的模型了。

Like like we don't there aren't really terrible models out there like there were a couple of years ago.

Speaker 1

鲍文，非常感谢你今天加入我们。

Bowen, thank you so much for joining us today, man.

Speaker 1

和你聊天真的很愉快。

It's been a lot of fun.

Speaker 1

我非常感谢你为这样一个重要问题所付出的努力。

I really appreciate the work you're putting into a a very important problem.

Speaker 0

谢谢。

Thank you.

Speaker 0

是的。

Yeah.

Speaker 0

和你们讨论这个问题真的非常开心，也非常感谢你们邀请我参加。

It's been a blast to chat about with you with you guys, and, yeah, thanks so much for having me on.

Speaker 2

鲍文，人们在哪里可以找到你的工作成果，并关注你接下来的动态？

Bowen, where where can people find your work and and follow what you're doing next?

Speaker 0

是的。

Yeah.

Speaker 0

我的意思是，我们的OpenAI博客以及现在的OpenAI对齐博客将继续发布我们大量的内容。

I mean, our you know, the OpenAI blog and now the, OpenAI's alignment blog will be a place we continue to post a lot of our things.

Speaker 0

你可能在Twitter上找到我，但我发帖不多，所以没什么用。

You can probably find me on Twitter, but I don't post that much, so it's not that useful.

Speaker 0

是的。

Yeah.

Speaker 0

我的意思是，发布安全相关的工作是践行我们使命的好方法，因此我们会尽量发布所有能帮助大家提升模型安全性的内容。

I mean, we'll be publishing safety work is, like, a good way to to adhere to the mission, so we'll prob we'll be trying to publish everything we can to help everyone, make their models safer.

Speaker 1

太棒了。

So Awesome.

Speaker 1

谢谢你，Bowen。

Thank you, Bone.

Speaker 1

我们非常感谢。

We appreciate it.

Speaker 0

谢谢大家。

Thank you, guys.