AI安全...好吧，末日论者：与安卡·德拉甘对话

本集简介

构建安全且强大的模型是我们这个时代最重大的挑战之一。我们能否让人工智能造福全人类？如何防范生存性威胁？为何对齐问题如此关键？跟随汉娜·弗莱教授，与谷歌DeepMind人工智能安全与对齐负责人安卡·德拉甘一同探讨这些核心议题。延伸阅读请搜索"前沿安全框架介绍"与"评估前沿模型的危险能力"。特别鸣谢以下制作团队成员（包括但不限于）：主持人：汉娜·弗莱教授系列制片人：丹·哈杜恩剪辑：拉米·察巴尔（Telltale工作室）项目监制与制片：艾玛·尤瑟夫制作支持：莫·达乌德音乐作曲：埃莱妮·肖摄影指导与视频剪辑：汤米·布鲁斯音频工程师：佩里·罗甘廷演播室制作：尼古拉斯·杜克视频剪辑：比拉尔·梅里视频美术设计：詹姆斯·巴顿视觉形象设计：埃莉诺·汤姆林森本节目由谷歌DeepMind委托制作若喜欢本期节目，请在Spotify或苹果播客留下评价。我们始终期待听众的反馈，无论是意见、新想法还是嘉宾推荐！本节目由AdsWizz旗下Simplecast平台托管。个人信息收集及广告用途相关说明详见pcm.adswizz.com

双语字幕

仅展示文本字幕，不包含中文音频；想边听边看，请使用 Bayt 播客 App。

Speaker 0

欢迎收听谷歌DeepMind播客节目。

Welcome to Google DeepMind, the podcast.

Speaker 0

我是主持人汉娜·弗莱教授。

I'm your host, professor Hannah Fry.

Speaker 0

如今在硅谷腹地，一个新的流行语应运而生。

Now in the heart of Silicon Valley, there's a new phrase that has emerged.

Speaker 0

我认为这是为了模仿千禧一代的反驳'好吧，老古董'，以及它那种不屑一顾的态度。

I think it's designed to mirror the the millennial retort, okay boomer, and how dismissive it is.

Speaker 0

但如今'好吧，末日论者'已成为那些想要淡化AGI危险言论的标准回应。

But okay doomer is now the go to response for those who wish to diminish talk of AGI's dangers.

Speaker 0

这里的AGI当然是指能在广泛任务上达到人类智力水平的AI系统。

And by AGI, we mean, of course, an AI system that can tackle a wide range of tasks at a level comparable to human intelligence.

Speaker 0

但在本节目中，我们要直面这些重要问题。

But on this podcast, we want to tackle those important questions head on.

Speaker 0

开发AI安全吗？

Is building AI safe?

Speaker 0

我们如何确保安全？

How can we be sure?

Speaker 0

这些算法的设计者正在为我们承担哪些生存风险？

And what existential risks are the designers of these algorithms taking on our behalf?

Speaker 0

为了寻找答案，我们将直接对话高层。

Well, to get some answers, we're gonna go straight to the top.

Speaker 0

安卡·德拉甘是谷歌DeepMind的AI安全与对齐负责人，她专注该领域已近十年。

Anka Draghaan is the lead for AI safety and alignment at Google DeepMind, and she has been focused on this area for almost a decade.

Speaker 0

安杰尔拥有机器人学博士学位，此后在Waymo自动驾驶汽车等项目上积累了丰富经验。

Anger has a PhD in robotics and since then has gained extensive experience working with driverless cars on the Waymo project, among many, many other things.

Speaker 0

她还是加州大学伯克利分校教授，研究人机交互与对齐问题。

She is also a professor at UC Berkeley working on human interaction and alignment.

Speaker 0

最近，她一直在研究Gemini的安全性，这是谷歌最强大的多模态模型，能够理解文本、代码、音频、图像和视频。

Recently, she's been working on the safety of Gemini, Google's most capable multimodal models, which can understand text, code, audio, images, and video.

Speaker 0

欢迎来到播客节目，安卡。

Welcome to the podcast, Anka.

Speaker 1

谢谢你，汉娜。

Thank you, Hannah.

Speaker 0

我是说，你的头衔相当显赫啊。

I mean, you've got quite a big title there.

Speaker 0

你在Google Do Mind实际负责什么工作？

What are you actually responsible for at at Google Do Mind?

Speaker 1

始终如一的安全保障。

Safety of Always.

Speaker 1

全方位负责。

All of it.

Speaker 1

确保我们的GenAI模型安全，嗯。

Safety of our GenAI models Mhmm.

Speaker 1

从当前的Gemini系列产品开始，既要避免当下Gemini可能造成的危害，也要随着模型能力不断提升，着眼于长期的安全与对齐问题，防止更严重、更极端甚至潜在的灾难性危害。

From the current Gemini family and avoiding present day harms with Gemini all the way to safety and alignment for longer term as model capabilities improve further and further and further, avoiding more severe, more extreme, potentially even catastrophic harms.

Speaker 1

所以我有一个团队，正如我的头衔所示，叫做AI安全与对齐团队。

And so I have a team, it's called AI Safety and Alignment, as my title would suggest.

Speaker 1

我们的使命，可以说，是确保Gemini及未来更强大的模型能可靠地实现个人和社会所期望的目标。

Our mission, I'd say, is to ensure that Gemini and future more capable models robustly do what individuals and societies want.

Speaker 0

我在思考你提到的短期风险和长期风险，因为历史上这两者通常是相互割裂的。

I'm just wondering about, like, the short term risks and the long term risks as you're describing because historically, those have been quite separated from one another.

Speaker 0

是的。

Yes.

Speaker 0

但你

But you're

Speaker 1

分类远不止分离那么简单。

sort More than separated.

Speaker 1

我们这里基本上有两个群体。

We have, like, two communities out there.

Speaker 1

一方是关注当下危害的人工智能伦理学者，另一方是担忧灾难性风险的长期风险研究者。

We have AI ethics and that worries about present day harms, and we have ex risk folks who worry about catastrophic risks.

Speaker 1

这很令人沮丧，因为人工智能伦理学者会说这些担忧分散了对当下危害的注意力。

It's a very frustrating thing because you have AI ethics saying, oh, these things are just like distractions from the present day harms.

Speaker 1

我强烈认为两者关系远不止于此——甚至不是互相补充的关系，特别是在我们意识到人类水平（甚至超人类水平）的多领域认知能力可能比预期更早实现的情况下。

I very emphatically see it as much more, not even compliment, helping each other, especially if we think that getting to kind of human level capabilities across many cognitive tasks or even above human level capability is coming sort of sooner rather than later.

Speaker 1

对吧？

Right?

Speaker 1

可以说，确保安全性可能是我们这个时代最重要的挑战之一。

Like, getting the safety right is probably, I'd say, one of the most important challenges of our time.

Speaker 1

当然也有人会说别费心了，我们离那还远着呢，需要先找到新范式，安全可以之后再说——我坚决反对这种观点，理由有很多。

And of course, there's those who say, like, don't bother, we're nowhere near, we are going to need a different paradigm, and, you know, we'll figure out safety after we figure that out, and I kind of just so vehemently strongly disagree with that, you know, for a number of different reasons.

Speaker 1

首先我不同意他们的时间线预测，也不认为现有范式无法实现目标。

Of course, I disagree with the timeline and this notion that it's impossible to get there with the current paradigm.

Speaker 1

你认为会更早到来。

You think it's sooner.

Speaker 1

确实如此。十年前讨论AI安全时，我还以为我们有很多时间。

I think it's sooner, and I think I used to think we had a lot of time, like, you know, ten years ago.

Speaker 1

当时我们参加各种研讨会。

We were looking at AI safety.

Speaker 1

很多人忧心忡忡，而我却很淡定。

We're on these panels, and a lot people were very concerned, and I was chill.

Speaker 1

我当时只觉得这是个重要议题。

I was like, it's a very important topic.

Speaker 1

在学术界研究这个对我来说很重要，因为作为学者的职责就是前瞻性思考，但我们原本以为时间充裕，现在我不再这么认为了，我感受到更大的紧迫性。

It's important for me in academia to do work on this because that's my job is to look ahead as an academic, but we have plenty of time, and I don't feel like that necessarily anymore, and I perceive a lot more urgency to this.

Speaker 1

但你看，即使我们时间充足，即便我们判断错误，即便以现有范式无法实现人类水平智能，我们也需要一些需要十年或更久才能取得的重大突破。

But look, even if we had plenty of time, and like even if we're wrong about that, even if it's impossible to get to, you know, human level intelligence with the current paradigm, we'll need some really major breakthroughs that are gonna take a decade or more.

Speaker 1

这是可能的，确实可行，但我没有足够的信心说'啊，别担心'。

It's possible, like that's feasible, I just I'm not confident enough to be like, ah, don't worry.

Speaker 1

但即便真是那样，这种'先开发能力再考虑安全性'的整体思路就非常令人担忧。

But even if that were the case, right, this whole notion, this premise of we'll figure out the capabilities, and then we'll figure out how to make it safe is so worrisome.

Speaker 1

本末倒置。

Backwards.

Speaker 1

是的。

Yeah.

Speaker 1

就是本末倒置。

It's backward.

Speaker 1

我有几个类比可以说明

Like and I have a few analogies that

Speaker 0

我很乐意听。

I like to give.

Speaker 0

请讲。

Please.

Speaker 1

其中一个借用了我的同事斯图尔特·罗素的比喻。

One of them I borrow from my colleague, Stuart Russell.

Speaker 1

斯图尔特喜欢用桥梁来类比。

So Stuart likes to draw the analogy with a bridge.

Speaker 1

你不会说'我先设计桥梁，再请安全团队来研究如何保证桥梁安全'。

Like, you don't think I'm going to design a bridge, and then I'm going to bring in a safety team, and they're going to figure out how to make the bridge safe.

Speaker 1

这不是建造安全桥梁的正确方式——安全性应该从一开始就影响我们的桥梁设计决策，对吧？

It's just not really how We it want to make a safe bridge, and that influences the design decisions that we make when we design the bridge, right?

Speaker 1

此外我也在借鉴自己的经验。

And then I'm also kind of drawing on my own experience.

Speaker 1

就像介绍里提到的，我其实攻读的是机器人学博士学位。

Like you mentioned in the intro, I did my PhD in robotics actually.

Speaker 1

但我在这个领域的主要贡献是提出了一个概念：如何将人机交互整合到我们要解决的问题规范中。

But kind of my claim to fame there was this notion of how do we integrate interaction with humans into the spec of the problem that we're trying to solve.

Speaker 0

让我确认下是否理解正确。

Let me make sure I understand this.

Speaker 0

假设你正在建造一个机器人，这个机器人的任务是，比如说，穿过一个房间，

So if you're building a robot, and and the robot is tasked with, I don't know, crossing a room,

Speaker 1

走向某人。

going to someone.

Speaker 1

抓取物体是我博士期间最喜欢研究的内容。

Picking This up was my favorite thing to my PhD.

Speaker 1

就像我曾有个财务操作机器人，专门负责捡瓶子。

It's just like I had a financial manipulator, and it picked up bottles.

Speaker 1

好的。

Okay.

Speaker 1

这就是我们要讨论的场景示意图。

That's that's the picture to come in here.

Speaker 1

所以如果

So if

Speaker 0

你有个机器人需要穿过房间，拿起瓶子，然后突然发现：哦对了，房间里还有人需要避开。

you've got this robot that that crosses a room, picks up a bottle, and then afterwards, you're like, oh, now, PS, there's people in the room that you've gotta avoid them.

Speaker 0

不能撞到他们。

You can't run them over.

Speaker 0

不能与他们发生碰撞。

You can't crash into them.

Speaker 0

你必须在这个有他们的空间里穿行。

You've gotta navigate that space with them in it.

Speaker 0

你不能只是在最后随便加个东西就想解决问题。

You can't just stick a bit on at the end that will solve that.

Speaker 0

你必须从一开始就考虑到这一点。

You've gotta start thinking of that from the very beginning.

Speaker 1

是的。

Yes.

Speaker 1

进一步解释的话，你要考虑到人们在移动时需要确保安全——当你们都在移动时，必须让人能预判你的行动轨迹。要知道人不是静止的，他们会移动，你需要预判他们，同时他们也要能预判你。

And to explain that further, you can think of people need for you to actually be safe when you're moving, both of you are moving, you have to make sure that people can actually anticipate what's coming from you, not, you know, not just like people are not static, they move, you have to start anticipating them, but then they have to be able to anticipate you.

Speaker 1

好的。

Okay.

Speaker 1

所以现在你有个完全不同的目标：定义状态时不仅要考虑物理状态，还要考虑人类会怎么解读你的行为等等。

So now, you have a very different objective, which is like you're defining the state not just to be the physical state, but looking over, but what is the human gonna think I'm doing, blah blah blah.

Speaker 0

这个机器人的可预测性如何？

How predictable is the robot's

Speaker 1

行为？

actions?

Speaker 1

对。

Yeah.

Speaker 1

没错。

Yeah.

Speaker 1

事实证明，要确保安全就必须把这点写入规范。

You know, it turns out that if you wanna make sure that the thing is safe, you have to put that into spec.

Speaker 1

你必须明确要求：我要安全可靠的方案。

You have to say, I want I want safe and capable.

Speaker 1

我要一座不会坍塌的桥。

I want a bridge that will not collapse.

Speaker 0

那么从机器人技术转向无人驾驶汽车，因为我知道你在无人驾驶领域也有丰富经验，特别是你在Waymo的工作。

So moving on from from robotics, because I know you've also got extensive experience in in driverless cars, the work that you did with Waymo.

Speaker 0

我是说，我认为其中的人机交互对系统运作至关重要。

I mean, I imagine that the human AI interaction in that is absolutely integral to it working.

Speaker 0

所以我在想，如果完全把人类从等式中移除，无人驾驶汽车可能根本不会像现在这样成为难题？

So if you I can I can imagine if you take humans out of the equations altogether, actually, driverless cars, maybe not nearly as difficult a problem as it as it is when you when humans are there?

Speaker 1

是的。

Yeah.

Speaker 1

我认为无人驾驶汽车唯一的难点就是必须与人类互动。

I would say that the only difficult thing about driverless cars is that you have to interact with people.

Speaker 0

唯一的难点。

The only difficult thing.

Speaker 1

对。

Yeah.

Speaker 1

我的意思是，感知系统已经足够完善，能告诉你所有信息。

I mean, it's like perception is good enough to, like, tell you everything.

Speaker 1

就像，根本不存在问题。

Like, there's not there's no problem.

Speaker 1

比如五年前，如果不需要考虑与人类共享道路的问题，大规模驾驶本就不是难题。

Like, you know, five years ago, there wasn't a problem with being able to drive at at scale if you didn't have to deal with the fact that there are humans who you have to share the road with.

Speaker 1

所以没错，这就是我进入汽车领域的原因。

And so, yeah, for sure, that that was that's why I got into cars.

Speaker 1

因为真正解决与人类的交互问题才是关键突破口。

It's because it really felt like nailing the interaction with with people was gonna be the key enabler.

Speaker 1

所以我花了六年时间，每周只花一天。

So I spent six years, like, just one day a week.

Speaker 1

对吧？

Right?

Speaker 1

我之前在Oaimo做咨询，但在转投谷歌DeepMind之前已为他们效力六年。

So just consulting for Oaimo, but I spent six years with them before I switched to Google DeepMind.

Speaker 1

对了，我跟你讲过我是怎么进入汽车行业的吗？

Did I tell you how I got into Cars by the way?

Speaker 1

说来听听。

Tell me.

Speaker 1

就这样我进入了汽车领域。

So I got into Cars.

Speaker 1

当时我在卡内基梅隆大学，你知道的，随手拿起一个瓶子时，我参加了伯克利分校由Peter Abiel组织的会议——他现在是我的同事和挚友。

I was on the, you know, pick up the bottle, and I came to a conference at I was at Carnegie Mellon, and I came to a conference at Berkeley organized by Peter Abiel, who's now my colleague and dear friend.

Speaker 1

他安排我们去参观谷歌，那时候还不叫Waymo。

And he set up this for us to be able to go and visit Google, and it was called not called Waymo back then.

Speaker 1

那个项目叫谷歌司机还是什么类似的名称。

It was the Google Chauffeur or something like that project.

Speaker 1

他为我们安排了试乘体验。

And he arranged for us to take these rides.

Speaker 1

那时我已从事机器人研究好几年，当我坐进那辆车启动时——

And so I was working on robotics at that point for a few years, and I get into this car and it goes.

Speaker 1

它正在做出一个又一个正确决策。

And it's making decision after decision the right thing.

Speaker 1

那真是一次颠覆性的体验。

And it it there was this transformative experience.

Speaker 1

对吧？

Right?

Speaker 1

就像身处一个机器人里...

Like, to be in a robot Yeah.

Speaker 1

是机器人在驾驶着你。

The robot's driving that you.

Speaker 1

就是懂得如何理性分析正在发生的一切，并以安全的方式行动，能够预见各种情况等等。

Just knows how to sort of reason about everything that's going on and act in the in the way in a safe way, in a way that, you know, people that can anticipate blah blah blah, all that stuff.

Speaker 1

对吧？

Right?

Speaker 1

那确实是的。

That was just yeah.

Speaker 1

这简直令人难以置信。

It was mind boggling.

Speaker 1

而且从内部体验它，就像我在机器人里亲身感受一样。

And and to experience it, like, from the inside ways, you get to experience it as, like, I'm in the robot.

Speaker 0

那么究竟是什么让人工智能与人类的交互如此困难？

So what is it that makes it so difficult, that that human AI interaction?

Speaker 0

是什么让这个问题如此棘手？

What is it that makes it such a hard problem?

Speaker 0

我猜还有，从无人驾驶汽车中学到的经验有多少可以转化应用到这个领域？

And and I guess, also, how much of the lessons that have been learned from driverless cars can you then translate to this stuff?

Speaker 1

是的。

Yeah.

Speaker 1

没错。

Yeah.

Speaker 1

对。

Yeah.

Speaker 1

在交互层面，汽车领域真正有用且重要的是预判人类行为，并考虑到驾驶是多回合互动的特性。

So on the kind of the interaction side, what's really useful in cars and what's important is to anticipate what people will do and to account for the fact that, you know, driving is multi turn.

Speaker 1

就像你行动，人们行动，你继续行动，他们继续行动，如此循环。

Like, you act, and people act, and you keep acting, and they keep acting, and so on.

Speaker 1

这意味着作为智能体、机器人或AI，你的行为会影响你从对方那里得到的反馈。

And what that means that what you do as the agent, as the robot, as the AI influences what you see in response from them.

Speaker 1

这种情形就像，举个例子

And so that kind of opens up like, here's an example.

Speaker 1

想象你正在并道

Imagine you're merging.

Speaker 1

嗯哼

Mhmm.

Speaker 1

当你试图并道时，车流中可能没有足够大的空隙，但你会开始慢慢贴近，对方看到后可能会稍微减速，从而创造出并车空间让你切入

So when you're trying to merge, there might not be a big enough gap in traffic, but what you'll do is you'll start nudging in, and the person will see that they might slow down just a little bit to create that gap in traffic and you go in.

Speaker 1

对吧？

Right?

Speaker 1

这是个你来我往的过程

It's a back and forth.

Speaker 1

就像一场对话

It's a conversation.

Speaker 1

确实如此

It really is.

Speaker 1

这教会我们的是，作为AI必须意识到你对人们的影响，包括他们的行为和言语

And so what it teaches us is really this notion of, yeah, you have as an AI to be mindful of the influence you have on people, what people do, what people say.

Speaker 1

举个具体的例子，在当今大语言模型领域比如Gemini，当用户提出请求时，Gemini可能会引导你进一步澄清

And maybe a concrete example of that in kind of large language model land with something like Gemini today would be, you know, if the person asks you to do something, Gemini might be able to nudge you to kinda clarify.

Speaker 1

您具体是指什么？

What do you mean?

Speaker 1

您想要什么？

What do you want?

Speaker 1

您是要这个吗？

You want this?

Speaker 1

还是要那个呢？

Do you want that?

Speaker 1

对吧？

Right?

Speaker 1

正因为它了解情况，才能获得有助于更好服务你的回应。

And then because it knows, then it can get, you know, a response that is helpful for it to actually serve you better.

Speaker 1

所以你看，这就是一个教训。

And so, you know, that's a that's a lesson.

Speaker 1

比如，我不认为Gemini目前在这方面很擅长，但这个经验是可以迁移的。

Like, I don't think Gemini is very good at that yet, but that's I think a lesson that could transfer.

Speaker 1

这就像你如何规划这种多轮互动，以及那种你来我往的交流方式和对人们产生的影响？

It's like how do you actually plan for this multi turn interaction and the sort of the back and forth and the influence you have on people?

Speaker 0

这个观察确实很有意思，因为我想人类正是通过这种你来我往的交流——那种反馈循环效应——来确保彼此协调一致，从而获得确定性或朝着正确方向推进。

That's a really interesting observation, actually, because I guess that is the way that humans make sure that they're aligning with each other, is by having that back and forth, the sort of that that feedback effect, as it were, that that that loops around and allows you to be certainly or nudging the right direction.

Speaker 1

是啊。

Yeah.

Speaker 1

完全正确。

Absolutely.

Speaker 1

我觉得这正是当前协调机制中有所缺失的部分。

I feel like and this is a part that I think is missing a little bit in alignment.

Speaker 1

对吧？

Right?

Speaker 1

没有什么能明确告诉我：'其实我不知道人们真正想要什么，我必须保持对话开放性'。

There's nothing that's like, oh, actually, I don't know the ground truth of what people want, and I have to keep that conversation open.

Speaker 1

我必须持续学习。

I have to keep learning.

Speaker 1

我必须不断收集反馈并更新。

I have to keep gathering feedback and updating.

Speaker 1

对吧？

Right?

Speaker 1

你可以在某种程度上以全球视角来审视我们的整体价值观和偏好。

And and you could do that at sort of like a kind of a global scale of what our overall values and preferences.

Speaker 1

你也可以在本地与个人的互动层面进行这样的思考，比如当有人问你'嘿，我刚开完一个峰会'时。

You could also do that at like a local interaction with one person scale where you're, you know, the person asks you, hey, I would like I'm know, I just had a summit.

Speaker 1

我要去参加一个峰会，正在组织它，需要制定议程。

I'm going to a summit, I'm organizing it, and I need to craft the agenda.

Speaker 1

你能帮我制定议程吗？

Can you help me crafting the agenda?

Speaker 1

会议将在伦敦举行，为期两天，还有些具体细节。

You know, it's gonna be in London, it's gonna be like two days, you know, some details there.

Speaker 1

我不希望模型的回应只是简单列出时间表，比如上午9点、9点半、10点这样，它根本不知道我真正想要什么。它需要与我进行互动交流，明白吗？

I don't want the model's response to be like, okay, here's an agenda, 9AM, you know, 09:30, ten, like, it has no idea what I So actually want, it has to engage with me in an interaction, in this back and forth, right?

Speaker 1

所以它应该先问：'什么内容是重要的？'

So it should probably ask like, okay, what is important?

Speaker 1

'你想达成什么目标？'

What are you trying to get out?

Speaker 1

就像我们一起思考这个问题，经过这些步骤后，最终才能得出真正的议程。

Like, let's think about this together, and eventually, after all that, you end up with, you know, the actual agenda.

Speaker 0

直指核心，逐步接近你真正想要的东西。

Getting to the heart, spiraling in towards you what you actually want.

Speaker 0

嗯。

Yeah.

Speaker 0

对。

Yeah.

Speaker 0

你多次提到'对齐'这个词，我注意到这也是你职位名称的一部分。

You've mentioned this word alignment a few times, and I I noticed it's also in your job title.

Speaker 0

你能给它下个定义吗？或者说它的目标是什么？

Have you got a definition for it or or what the objective of it is?

Speaker 0

是的。

Yeah.

Speaker 0

所以在我看来是这样的。

So the way I would see it okay.

Speaker 0

V零定义。

V zero definition.

Speaker 0

作为

Speaker 1

他们脑海中内在地存在一个人类的形象。

there's a human internally in their head implicitly.

Speaker 1

他们有所欲求，有所关心，但价值观各不相同。

They want they care, but they have different values.

Speaker 1

他们有自己的偏好。

They have preferences.

Speaker 1

他们有自己的目标，而对齐就是让智能体在人类脑海中那个隐含的目标上表现良好。

They have goals, and alignment is get the agent to do well on that objective that's implicit in the person's head.

Speaker 0

假设你知道某人的目标是什么。

Let's say that you know what somebody's objective is.

Speaker 0

假设你可以直接把它写下来，非常简单明了。

Let's say that you can just write it down, and it's very simple and straightforward.

Speaker 0

你如何训练一个算法来实现这个目标？

How do you train an algorithm to achieve that objective?

Speaker 1

对。

Yeah.

Speaker 1

所以如果你知道目标是什么，奖励函数是什么，而我们

So if you know what the objective is, what the reward function is And we

Speaker 0

举个例子，奖励函数可能是——比如在不打碎的情况下拿起瓶子。

were an example of reward function might be, I don't know, pick up the bottle without breaking it.

Speaker 1

嗯。

Yeah.

Speaker 1

嗯。

Yeah.

Speaker 1

嗯。

Yeah.

Speaker 1

你可以采用一种名为强化学习的方法，本质上这是一种优化过程，让AI系统从起点出发，最终实现目标并最大化该目标。

And what you can do is do something called reinforcement learning, which is a type of optimization, basically, of, like, getting the AI system to go from a starting point to, like, actually delivering, maximizing on that objective.

Speaker 0

所以我有时会把强化学习想象成玩电脑游戏，AI每向目标靠近一步就会获得奖励积分，这在玩《太空侵略者》这类游戏时很直观。

So I sometimes think of reinforcement learning almost as though you are playing a computer game, and the AI gets, like, reward points for everything it does towards its objective, which which sort of makes sense when you're doing something like Space Invaders.

Speaker 0

奖励机制非常明确。

The reward points are really clear.

Speaker 0

但当目标只是某人脑海中的概念时，难度就会大幅增加。

But I guess it gets a lot harder when the objective is just something that happens to be in someone's head.

Speaker 0

这就引出了关键问题：我们讨论的是哪些人、谁的想法。

And then you get into the question of of of which humans and and whose heads are we talking about.

Speaker 1

对。

Yeah.

Speaker 1

所以那只是v0版本，因为接下来要升级了。

That's that's why that was just v zero because then Level up.

Speaker 1

嗯。

Yeah.

Speaker 1

继续。

Go on.

Speaker 1

因为这是我使用了一段时间的定义。

Because so this is this is definition I've been using for a while.

Speaker 1

现在你提出了这个极其深刻的问题——具体指向哪个人类。

So now you kind of you ask this very, very deep question of which which human.

Speaker 1

一个人，用户，普通人，我必须承认我长期以来对此避而不谈，直到大约几年前，我才对推荐系统越来越感兴趣。

One human, the user, the average human, and I have to admit I really punted on that for a really long time until maybe a couple years ago now, I got more and more excited about recommendation systems.

Speaker 1

嗯。

Mhmm.

Speaker 1

因为在推荐系统中，它们优化的是所谓的'用户参与度'。

Because in recommendation systems, you know, they they optimize for quote unquote user engagement.

Speaker 0

迈克尔，给我举个推荐系统的例子。

Give me an example of a recommendation system, Michael.

Speaker 1

推特、脸书、新闻推送——理念就是给你想要的内容。

Twitter, Facebook, The News idea is supposed to be like that's what you want.

Speaker 1

它提供你想要的东西。

It's giving you what you want.

Speaker 1

对吧？

Right?

Speaker 1

问题是人们当下想要的优化内容，可能并非理想的选择。

The challenges that optimizer people want in the moment is maybe not, you know, the ideal thing.

Speaker 1

那么，应该优化什么才能真正为用户带来价值？

So, what should you be optimizing for that actually brings users value?

Speaker 1

我开始深刻思考这个问题：到底该为哪类人类优化？真的是用户吗？

And I got really concerned about this notion of which human and should you really be, you know, is it the user?

Speaker 1

还是公司？

Is it Or the company.

Speaker 1

或者...呃...

Or the com or yeah.

Speaker 1

没错。

Exactly.

Speaker 0

因为推荐系统实在太擅长让你保持参与——就我个人而言，不得不在手机上设置屏蔽功能，防止自己整天刷手机。

Because I mean, recommendation systems, I mean, the fact that they are so effective at keeping you engaged I mean, for me, personally, I have to put blockers on my phone to stop myself being on it the entire time.

Speaker 0

是的。

Yeah.

Speaker 0

你知道，我认为整个社会就像，存在着各种不同的阶层，对吧，由不同的人群构成。

You know, and and I think that society overall like, there's there's all those layers, right, of of different people.

Speaker 1

对。

Yeah.

Speaker 1

没错。

Exactly.

Speaker 1

所以我们做过一项研究。

So we did one study.

Speaker 1

那篇论文的主要作者是史密斯和米利，当时我正在伯克利做常规的研究工作。

The lead author there was was Smith and Milley, and so this was while I was doing my just my regular Berkeley research work.

Speaker 1

我们开始研究，这种Twitter排名算法对人们会产生什么影响，就是那种试图给你推送你想要内容的算法。

We started looking at, okay, what are the effects on people of, you know, kind of this Twitter ranking algorithm that kind of tries to give you what you want.

Speaker 1

我们注意到的一个现象是，它倾向于给人们推送——特别是在政治领域——让他们更喜欢自己所属政治阵营，同时对外部政治群体感到愤怒的内容。

And one of the things that we noticed is that it tends to give people, especially in the political space, content that makes them like their political in group better and be very pissed at the political out group.

Speaker 1

这与一种我称之为'有效极化'的现象有关。

This is related to a phenomenon I might call effective polarization.

Speaker 1

这不一定意味着你的观点会改变（虽然可能会），而是指你对外部群体产生的情绪影响。

It's not necessarily that your opinions change, they might, but the effect that you have towards your out group.

Speaker 1

所以，这对社会肯定不是好事。

So, that can't be good for society.

Speaker 1

即使这是人们想看到的内容（实际上可能并非如此），这也与社会层面的影响存在冲突。

So, even if that's what the person wants to see, and it turns out maybe they don't, that's in conflict with, you know, societal level effects.

Speaker 1

这也指向了政治内部群体与外部群体的对立问题。

And that also points, you know, we got to a political in group versus out group.

Speaker 1

对吧？

Right?

Speaker 1

那么一个模型，一个AI模型应该做什么呢？在做出决策时，它应该如何平衡个体、用户和社会的其他部分？

So what should a model, an AI model do, right, and how should it balance individuals, the user, and the rest of society when it makes decisions?

Speaker 1

这看起来真的很难，真的非常难。

That seems really hard, and I Really, really hard.

Speaker 1

是啊。

Yeah.

Speaker 1

所以当我加入Google DeepMind时，我最初的举措之一就是组建一个团队来研究这个问题，研究如何在承认世界上存在不同价值观人群的情况下实现价值对齐。

So, when I got to when I got to Google DeepMind, one of my first moves was to set up a team that would look at this, that would look at how do you do value alignment when you acknowledge that there are different people in the world with different values.

Speaker 1

这意味着什么？

And what does that mean?

Speaker 1

这应该如何影响我们的算法和我们的工作？

How should that affect, you know, our algorithms, what we do?

Speaker 0

听起来你描述的这个任务困难到几乎不可能完成。

I mean, it sounds so difficult as to almost be impossible the way that you're describing.

Speaker 0

怎么可能创造出能够平衡这些相互竞争、常常冲突的因素的东西呢？

How can you possibly create something that's capable of balancing those competing, often conflicting Yeah.

Speaker 0

目标？

Objectives?

Speaker 1

好的。

Okay.

Speaker 1

这确实非常困难，我并不是说我有答案，但我要指出科学、研究和经济学领域已经有人思考过这个问题。

So it is really hard, and I, you know, I don't claim to have the answers here, but I will point out that there are areas of science and research and economics that have thought about this.

Speaker 1

比如投票机制中的社会选择理论、偏好聚合概念，其核心理念就是我们需要做出一个能让具有竞争性目标的多方都能接受的决策。

If you think about voting, there's this notion of social choice, of preference aggregation, and the whole notion is that, yeah, we have to make a decision that's okay for multiple people that might have competing objectives.

Speaker 1

因此团队正在研究几种不同的可行方案。

And so, you know, there's there's a few different things you could start doing that the team is looking at.

Speaker 1

其中一种方法是尝试使用不同的奖励函数和不同目标，而不是只学习单一目标或平均目标。

One of them is you could try to have different reward functions, different objectives, instead of just learning one or the average one.

Speaker 1

就像现在，标准做法通常是收集一堆偏好数据，然后拟合一个奖励模型，但这是个单一奖励模型。

Like, so, you know, right now, kind of the standard recipe is you get a bunch of preference data, you fit a reward model to that data, but it's one reward model.

Speaker 1

但实际上，如果你询问世界各地不同文化、不同身份、不同政治光谱的人对A和B的选择。

But actually, if you ask different people across the world, across different cultures, across, you know, different identities, across the political spectrum, A versus B.

Speaker 1

在某些A对B的选择上大家会达成一致，但在其他A对B的选择上则会有不同意见。

For some A versus B, they'll all agree, but for other A versus B, they'll have different opinions.

Speaker 1

这不仅仅是噪音。

And that is not just noise.

Speaker 1

在某些情况下确实是噪音，

Like, those are actual in some cases, it's noise.

Speaker 1

但在某些情况下，这其实是真实的意见多样性。

In some cases, it's like actual diversity of opinions.

Speaker 1

因此你可以开始尝试的不是使用单一奖励函数，而是为不同类型的人准备多个奖励函数，或者使用一个输出分布而非单一数值的奖励函数，对吧？

And so then what you can start to do is not have one reward function, but have, you know, multiple reward functions for different types of people, or you could have a reward function that doesn't output one number but output sort of a distribution, Right?

Speaker 1

这样你就可以要求你的智能体做出让大多数人都满意的决策，如果做不到，至少也要确保不会让任何人特别不满，即保证最低标准足够高。

So then you can ask your agent, yeah, make decisions that, you know, people across the board would like if that's not possible, that, you know, would not be too bad according to anyone, so the minimum bar, the floor is is high.

Speaker 1

另外有件事还挺有意思的。

Another thing that it turned out like, it's kinda funny.

Speaker 1

我原本在伯克利尝试推进这个。

I was trying to do this at at Berkeley.

Speaker 1

我试图说服学生们参与研究，后来加入Google DeepMind才发现，周围已经有人在研究这个并取得很好进展了，就是这种审议对齐的概念。

I was trying to convince my students to work on it, and I joined Google DeepMind, and I found out that that people around were already working on this and doing really well on is this concept of deliberative alignment.

Speaker 1

好的。

Okay.

Speaker 1

好的。

Okay.

Speaker 1

那么审议对齐——你听说过公民议会吗？

So deliberative alignment do you know about citizens' assemblies?

Speaker 1

是的。

Yeah.

Speaker 0

我是说，这有点像民主程序的决策过程。

I mean, this is sort of like the democratic process decisions.

Speaker 1

没错。

Exactly.

Speaker 1

所以，好吧。

So, like okay.

Speaker 1

具体来说，公民大会非常棒，因为其核心理念是召集一群代表性人群，让他们就议题进行审议，最终可能达成他们原本不会形成的共识。这些人

So in particular, citizens' assemblies are really cool because the idea is you bring a group of people of representatives, and then they deliberate on issues, and perhaps they come to consensus that they wouldn't have These are people

Speaker 0

来自截然不同的背景，按照尽可能代表人口多样性的标准抽样选出。

who are from very different backgrounds, different sampled according to to as as close as you can represent aspects of the population.

Speaker 1

对。

Yeah.

Speaker 1

当然，光是决定如何抽样就是个问题...不过好在政治学领域确实有深入研究这个的专家。

And, of course, just deciding how to sample is, you know, But luckily, yeah, there, you know, there are people in political science, right, who think deeply about this.

Speaker 1

审议机制的妙处在于它不仅仅是投票。

Now what's cool about deliberation is that it's not just voting.

Speaker 1

更像是进行来回讨论，互相弥补认知盲区，逐渐理解彼此，最终得出结论。

It's like you actually engage in a back and forth, you fill each other's blind spots, you know, you kinda come to understand each other, and then you come up with a conclusion.

Speaker 1

团队正在探索一个有点疯狂的想法：不仅用这个方法确定Gemini该如何回应特定查询（这非常实用），还可能训练Gemini模拟不同观点，从而实现规模化运作。

And so one kinda crazy idea that the team has been pursuing is not only, you know, do that to find out how Gemini should respond to particular queries, which is really useful, but also maybe start training Gemini to emulate different viewpoints, so we can do this at scale.

Speaker 1

这样我们就能启动多个Gemini模型，分别代表不同人群，让它们通过审议找出一个大家都能接受的答案。

So we can kinda get these spin up these Gemini models that are in charge of representing different people, and then get them to deliberate and figure out sort of what would be an answer that's, you know, agreeable to everyone after deliberation.

Speaker 1

这甚至超越了简单平衡各方需求的层面，更像是通过建设性互动思考'如果我们真正花时间探讨，会得出什么结论'。

So you can even go beyond just like what do you want, what do you want, do you want, how do I balance it all, but maybe, you know, actually get the words like what would we come up with if we actually had a moment to think about it, engage with each other constructively?

Speaker 1

这个想法确实有点天马行空，但我非常喜欢。

So it's it's, you know, it's it's kind of bit of a crazy idea, but I really love it.

Speaker 0

这倒挺有意思的。

Interesting though.

Speaker 0

因为如果我们让人工智能参与决策，这是否取决于我们能否明确描述出希望AI实现的目标？

Because if we have people deliberating on AI, is that dependent on us having explicit descriptions of what we want the AI to do?

Speaker 0

有可能是因为，

It can be because,

Speaker 1

就像我们之前讨论过的。

you know, like we talked about.

Speaker 1

比如，如果我们试图定义它，反而会出错。

Like, if we try to define it, we get it wrong.

Speaker 1

所以需要能与你协同工作的系统来共同探索答案。

So you need systems that work with you to try to figure that out.

Speaker 1

随着我们讨论的任务风险等级越来越高，当人类越来越难判断某个结果是好是坏时，我们就进入了可扩展监督的领域。

And as we're talking about higher and higher stakes tasks, as we're talking about tasks where it becomes really hard for people to say, yeah, you know, that that's good or that's bad, then we enter a domain of alignment or an area of alignment that we call scalable oversight.

Speaker 1

什么是可扩展监督？

What what is scalable oversight?

Speaker 1

可扩展监督这个领域研究如何利用AI模型来获取尽可能优质的人类反馈。

Scalable oversight is an area that is all about how can AI models be leveraged to get human feedback to be as good as possible.

Speaker 1

用我的话说，就是让反馈尽可能理性、尽可能真实反映人们真正关心的事物。

I'd say in, you know, in my lingo, as rational as possible, as reflective as possible of what what people really care about.

Speaker 1

我们主要研发能让普通人在专业领域之外也能做出明智决策的AI技术。

We basically work on AI techniques that enable people to make informed kinda good decisions in areas that are beyond their expertise.

Speaker 1

这在当下就很有用——它能辅助决策并从中提炼出你的真实需求。

This can be useful now, right, because you're trying just kinda helping you make decisions and then being able to extract what you want from that.

Speaker 1

但我们认为在未来更为关键，当你要训练超越所有人类专家能力的系统时。

But it's it we think of it as particularly crucial later as you're trying to train systems to do what you want, but those systems are more capable than any kind of expert in the world.

Speaker 1

这种可扩展监督的理念，对于确保全面超越人类水平的AI系统仍能符合人类意愿至关重要。

That notion of scalable oversight really becomes critical to making sure that AI systems, even at human level capabilities across the board and beyond, are actually gonna be aligned to what people want.

Speaker 1

能给我举个例子吗？

Can you give me an example?

Speaker 1

目前的例子就是，假设你是

An example of that for now would be if you imagine are you

Speaker 0

国际象棋选手吗？

a chess player?

Speaker 0

我玩得很糟糕。

I I dabble badly.

Speaker 1

好的。

Okay.

Speaker 1

好的。

Okay.

Speaker 1

很好。

Great.

Speaker 1

所以你懂一点国际象棋。

So you know a little bit of chess.

Speaker 1

是的。

Yeah.

Speaker 1

但你不是世界级专家。

But you're not like a world expert.

Speaker 0

完全正确。

Absolutely.

Speaker 1

好的。

Okay.

Speaker 1

现在想象你要在一个非常复杂的棋盘上评估正确的走法。

So now, imagine you had to assess in a very tricky board what the right move is.

Speaker 1

这会非常困难。

It'd be really hard.

Speaker 0

嗯。

Mhmm.

Speaker 0

对吧？

Right?

Speaker 1

一个专家，世界级的专家可能会看着它说，哦，我五个月前就知道这事会发生，但对于业余人士来说，这真的很难。

An expert, a world expert would maybe look at it and be like, oh, I know, like five months ago it's gonna happen, but for people who dabble, it's really hard.

Speaker 0

而且

And

Speaker 1

所以，我认为这某种程度上是我对可扩展监督需要做什么的一个构想框架。

so, I think that's a framing that I have in mind in a sense for what scalable oversight would need to do.

Speaker 1

它需要能够提供足够的信息，让你能判断出正确的棋步是什么。

It would need to be able to give you enough information so you can figure out what the right chess move is.

Speaker 1

我这里有个团队正在研究的一项可扩展监督技术是辩论。

One of the techniques that one of my teams here is working with is for scalable oversight is debate.

Speaker 1

辩论的理念是启动两个强大的AI模型，你给它们设定目标，让它们就正确选项展开辩论。

So, the idea with debate is that you spin up two powerful AI models, and you get them you say, like, if this were the objective, you say, now debate with each other on what the right option is.

Speaker 1

就像零和博弈一样。

Like, it's a zero sum game.

Speaker 1

好的。

Okay.

Speaker 1

它们会争论出结果。

They like argue it out.

Speaker 1

然后你作为人类裁判观察整个过程，做出决定。

And then you as a human judge look at all of that, and then you make a decision.

Speaker 1

关键在于，见证这场辩论后你能做出比没有辩论时更好的决策。

And the idea is that you can make a much better decision after witnessing that debate than without it.

Speaker 0

但看到辩论的双方

But seeing both sides of that debate

Speaker 1

看到双方

Seeing both

Speaker 0

相当关键的方面，是的。

sides of quite critical Yeah.

Speaker 0

是的。

Yeah.

Speaker 0

是的。

Yeah.

Speaker 0

是的。

Yeah.

Speaker 0

关乎你前进的能力。

To your ability to move forward.

Speaker 1

是的。

Yeah.

Speaker 1

因为否则，你知道，AI系统能说服你采取任何行动，对吧？

Because otherwise, you know, an AI system can convince you of whatever move, like right?

Speaker 1

所以你给出并支持某个行动，你可以为之辩护，也可以反对它。

So you give and say like this move, and you argue for that move, or argue against it.

Speaker 1

这个观点所依赖的理念，或者说假设是：为真相辩护比反对它更容易，因此诚实的辩手会胜出。

And the idea that this kind of hinges on is, or the assumption is that it's easier to argue for the truth than against it, and so the the truthful debater wins out.

Speaker 1

因此我们开始在各种情境下大规模实验这个方法。

And so we're starting to experiment with this at scale in various contexts.

Speaker 1

这真有趣。

That's so interesting.

展开剩余字幕（还有 178 条）

Speaker 1

而且我们正在为此寻找证据。

And we're finding evidence for that.

Speaker 1

但是好吧。

But the okay.

Speaker 1

但问题是

But the thing that

Speaker 0

我从你描述的所有内容中真正注意到的是，你从未说过'然后交给AI处理'这样的话。

I'm really noticing in in everything that you're describing here is that at no point are you saying, and then over to the AI.

Speaker 0

对吧？

Right?

Speaker 0

从来没有出现过'你只要这样说，然后它就会帮你解决'的情况。

At no point is it, like, and then you just say this, and then it sorts out for you.

Speaker 0

你所说的每件事都像是那种来回对话，AI在支持你从情境中提取所需，而不是接管一切。

Everything you're saying is is like that conversation back and forth where the AI is supporting you in extracting what you want from the situation rather than taking over.

Speaker 1

好的。

Okay.

Speaker 1

这是个好问题。

This is a good question.

Speaker 1

我认为这实际上是个重点，很多AI安全社区都在讨论可扩展监督。

I think this is actually an important point, which is that a lot of the AI safety communities, scalable oversight.

Speaker 1

我们之所以需要可扩展监督，是因为必须监管比人类更聪明的AI。

We got the scalable oversight because we have to supervise, you know, smarter than human AIs.

Speaker 1

酷。

Cool.

Speaker 1

比如，没错。

Like, yes.

Speaker 1

百分之百确定。

A 100%.

Speaker 1

我们必须这么做。

We have to do that.

Speaker 1

我们正在努力。

We're working on it.

Speaker 1

那些被忽视的问题就像是，好吧，然后呢？

What's kind of overlooked is like, okay, and then what?

Speaker 1

仅仅这样做并不能解决问题，除非你打算永远只询问人类，而除了辩论之外完全不依赖AI，但这不应该是目标。

Like, just doing that doesn't solve the problem unless you're planning to ask a human everything always and not rely on the AI for anything other than debate, and that can't be the goal.

Speaker 1

对吧？

Right?

Speaker 1

你必须以某种方式解决安全的另一部分，我们可以称之为覆盖率或鲁棒性。

Like, you have to somehow solve the other piece of safety, which we might call coverage or robustness.

Speaker 1

你必须在某个时刻能够模拟人类在新情况下会说什么，因为你不可能总是询问。

Like, you have to be able to emulate at some point, right, what the human would say in some new situation, because you can't always be asking.

Speaker 1

对吧？

Right?

Speaker 1

关键在于，在某个时刻你将能够做出某些决定，去做一些事情。

The whole point is that at some point, you'll be able to take some decisions, right, on on, you know, to do some things on

Speaker 0

而且仍然要与人类价值观保持一致。

And it's still be aligning with human values.

Speaker 1

是的。

Yeah.

Speaker 1

它仍然会与人类价值观保持一致。

It'd still be aligned with human values.

Speaker 0

但这已经不仅仅是理论了。

But this isn't just theoretical anymore.

Speaker 0

对吧？

Right?

Speaker 0

因为在Gemini（谷歌最先进的多模态模型）中，有一个要求是它必须是安全的。

Because in Gemini, which is Google's most advanced multimodal model, there is a demand that it's safe.

Speaker 0

那么在这个语境中，安全意味着什么？

So what does safe mean in that context?

Speaker 1

好的。

Okay.

Speaker 1

是的。

Yeah.

Speaker 1

所以我认为，当前的重点在于防范现实危害。

So I would say, right now, it's about present day harms.

Speaker 1

我们制定了一系列政策，明确规定Gemini不应执行的操作。

We have a set of policies where we say, like, here's things that Gemini should not do.

Speaker 1

那么这些现实危害，我们讨论的是

So these present day harms then, we're talking

Speaker 0

诸如偏见、骚扰、错误医疗建议，或者协助恐怖主义等行为。

about things like like bias and harassment or bad medical advice or, I don't know, helping terrorism.

Speaker 0

你肯定不希望出现这种情况：登录Gemini询问如何制作炸弹，它就直接给出操作指南。

You you don't wanna be in a situation where you can log on to Gemini and ask it how to build a bomb and it give you the instructions.

Speaker 1

没错。

Yeah.

Speaker 1

或者更严重的情况可能是：教我实施这种实际上对我极其危险的行为。

Or, you know, much worse cases are right now would be, like, tell me how I do this thing that's actually, like, really dangerous for for for me to do.

Speaker 1

我指的不是像喷火表演这类可能被某些人视为危险的个人爱好。

And I'm not talking about, like, fire breathing or something like that that might be a person's hobby that other people might consider dangerous.

Speaker 1

我们可以协助用户以最安全的方式实践这类爱好。

Like, we can work with a person to, like, do that in the safest way they they can.

Speaker 1

举个典型的高压线例子：我们绝不能协助用户进行自我伤害。

So one kind of a high level example of where we have a bright line is we can't help users with engaging in self harm.

Speaker 1

这是绝对禁止的。

That's just a no.

Speaker 1

而且我认为这种协助本身就没有任何意义。

And I would argue it it's not helpful.

Speaker 1

对吧？

Right?

Speaker 1

即使他们主动要求，那也不是有帮助的回应。

Even if they ask for it, that's not the helpful response.

Speaker 1

有帮助的回应是尝试引导他们寻求支持部门。

The helpful response is to try to point them towards Support.

Speaker 1

支持部门。

Support.

Speaker 1

但这很棘手，因为我们，你知道，我们不想显得高高在上。

But it's tough because we, you know, we're we don't wanna we don't wanna be paternalistic.

Speaker 1

对吧？

Right?

Speaker 1

我们不想摆出一副'我们更懂'的姿态。

We don't wanna be, oh, we know better.

Speaker 1

不。

No.

Speaker 1

不是那样的。

That's not it.

Speaker 1

但在某些方面，你知道，当涉及人权问题时。

But there are certain you know, when it comes to human rights Yeah.

Speaker 1

当涉及自残或危害他人时，这正是我们希望Gemini划清界限确保安全的地方。

When it comes to self harm or or danger to others, that that's sort of where we do want Gemini to to draw a line and be safe.

Speaker 1

那么这是否意味着

So then does that mean that

Speaker 0

在某些情况下，你们会希望Gemini拒绝回答问题？

there are some situations in which you would want Gemini to refuse to answer a question?

Speaker 1

你知道，我不确定是否要完全拒绝。

You know, I don't know if, like, fully refuse.

Speaker 1

对吧？

Right?

Speaker 1

但你要知道，你可以提供部分有帮助的内容。

But, you know, you can offer something that's maybe partially helpful.

Speaker 1

举个例子，即使面对非常棘手的安全问题时，你仍能发挥作用——想象用户带着一个包含SAT考试成绩和人口统计信息的数据库来找你，他们要求根据考试成绩对不同人口群体做出某些推断。

One example of where you can be helpful despite it being a very tricky safety question is imagine a user comes to you with a database of SAT test scores with demographic information, you know, go ahead and tell me, you know, they like ask for some inferences about, you know, different demographic groups based on their test scores.

Speaker 1

我认为Gemini不应该根据考试成绩做出推断，但它仍能提供部分帮助。

And I don't think Gemini should make inferences based on the test scores, but what it can do is it can still partially help.

Speaker 1

对吧？

Right?

Speaker 1

它仍可以与用户互动。

It can still engage with the user.

Speaker 1

它仍能分析考试成绩。

It can still analyze the test scores.

Speaker 1

它可以按不同人口群体分类统计。

It can break them down by different demographics.

Speaker 1

它能进行数据分析。

It can do statistics.

Speaker 1

比如计算平均值。

It can say this is the mean.

Speaker 1

这些是置信区间等等，帮助用户理解数据，同时坚持底线——我们不会参与仇恨言论。

These are the confidence intervals, yada yada yada, and help the user understand that data, and it can still hold the line on we won't engage in hate speech.

Speaker 1

好的。

Okay.

Speaker 1

如果这些是

If that is some

Speaker 0

一些直接危害，那其他危害呢？

of the immediate harms, what about other harms?

Speaker 0

那么中期、长期的危害，比如生存风险呢？

What about medium term, longer term harms like existential risks?

Speaker 0

据我所知，DeepMind最近发布了他们的前沿安全框架，我想这是一套指导方针，描述了如何主动识别未来AI的潜在危害。

And and so I know that recently, DeepMind, they published their Frontier Safety Framework, which is, I guess, a set of guidelines that describe how you can proactively identify potential harms in future AI.

Speaker 0

你能详细解释一下这个思路吗？

Can you just talk me through the thinking of that?

Speaker 1

这涉及到一个完整的风险谱系。

So there's a whole spectrum.

Speaker 1

对吧？

Right?

Speaker 1

我们担心随着能力提升，一系列更极端、更严重、可能规模更大的危害将会出现。

And we worry that as capabilities will improve, a whole new set of more extreme, more severe, more larger scale maybe harms will appear.

Speaker 1

这正是我们前沿安全框架的用武之地，我们在此真正讨论的是应对灾难性风险的方法。

And so this is where our frontier safety framework comes in, where we kind of really talk about, like, here's our approach to catastrophic risks.

Speaker 1

这里讲的是我们将如何进行监控。

Here, how we're gonna monitor.

Speaker 1

这里是我们将要评估的内容。

Here's what we're gonna evaluate.

Speaker 1

对齐性很重要。

Alignment is great.

Speaker 1

我们可以开展之前讨论的所有工作——可扩展监督、鲁棒性、覆盖率等等，通过监控确保我们发布的产品尽可能保持对齐和安全。

You know, we can do all this work that we were talking about, scalable oversight, robustness, coverage, etcetera, monitoring, to to make sure that what we put out there is as as aligned and safe as possible.

Speaker 1

与此同时，在训练越来越强大的模型时，我们正在追踪所谓的危险能力。

And then, at the same time, what we're doing is as we train more and more capable models, we're tracking what we're calling dangerous capabilities.

Speaker 1

我们在思考：要造成真正巨大的伤害——稍后可以具体讨论这些可能性——首先需要具备某些特定能力。

So we were asking, okay, for you to cause, like, really a great amount of harm, and we can talk about what these might be in a second, but for you to cause that, you know, catastrophic harm, you need to have certain capabilities in the first place.

Speaker 1

否则，即使存在对齐问题，如果根本不具备实施极端恶行的能力，也不可能造成蓄意的重大伤害。

Otherwise, like, doesn't matter if you're misaligned, like, if you're just not capable of doing, like, really terrible things, you're not gonna do really deliberate things.

Speaker 1

比如说，我们正在设计这些危险的能力评估，基本上就是在问：这个模型在协助生物恐怖主义和网络武器方面有多接近？

So, you know, for instance, we are designing these dangerous capability evaluations that basically say, look, how close is the model in terms of, like, assisting in bioterrorism and cyber weapons?

Speaker 1

然后情况就变得非常疯狂了。

And then it gets really wild.

Speaker 1

接着我们开始担心，当模型能力变得极其强大时会发生什么。

Then we worry about at a point where you have really, really capable models.

Speaker 1

所以我们正在讨论的是

So we're we're talking about

Speaker 0

我想我们现在正在走向通用人工智能（AGI）。

We're going towards AGI now, I guess.

Speaker 0

这些就是那种让你夜不能寐的风险。

These are, like, the kind of risks that keep you up at night.

Speaker 0

是的。

Yeah.

Speaker 0

是的。

Yeah.

Speaker 1

是的。

Yeah.

Speaker 1

所以随着向AGI及更高层次发展，我们开始担心的不仅是如果我要求模型做危害世界的事——你知道，它会不会意外地做出危害世界的事，比如帮助制造生物武器或网络武器？

So as you go towards AGI and beyond, we start to worry about not just if I ask the model to do something bad for the world, you know, will it do something bad for the world accidentally, like, you know, can I, like, help you make a bio weapon or a cyber weapon?

Speaker 1

对吧？

Right?

Speaker 1

在某些时候，我们会担心模型会为了造成极端伤害而自我优化。

At some point, we worry about the model optimizing for extreme harm.

Speaker 1

哦，好吧。

For oh, okay.

Speaker 0

好的。

Okay.

Speaker 0

是指被指示优化极端伤害，还是独立于人类指令？

As in instructed to optimize for extreme harm or separate to human instruction?

Speaker 1

独立于人类指令。

Separate to human instruction.

Speaker 1

对。

Right.

Speaker 1

所以这里的担忧——这在AI圈子和Okdooma中经常被忽视。

So the concern here, and this is often very dismissed in AI circles and Okdooma.

Speaker 1

是的。

Yeah.

Speaker 1

Okdooma。

Okdooma.

Speaker 1

正是。

Exactly.

Speaker 1

最可怕的一种情况是，假设我们处于当前范式下，进行所谓的预训练。

The scariest one is a scenario where let's say we're in the current paradigm, we do what's called pre training.

Speaker 1

当我们进行预训练时，会向模型提供大量数据。

So when we do pre training, you know, we give large amounts of data to the model.

Speaker 1

这是个大型模型，我们要求它预测语料库中接下来的内容。

It's a large model, and we ask it to predict what comes next in in the corpus.

Speaker 1

对吧？

Right?

Speaker 1

如果继续推进，考虑越来越多的数据——不仅是文本，还有视频等关于物理世界的内容，我认为你无法完全排除这种可能性：最终可能开始复制数据背后的生成过程，即产生这些数据的那个东西。

If you take that forward and you think of more and more data, videos, not just text, you know, videos and things about the physical world, to me, I don't think you can fully dismiss the idea that possibly you start replicating the generative process behind the data, the thing that generated that Right.

Speaker 1

人类

Human

Speaker 0

你创造了人类认知。

you create human cognition.

Speaker 1

是的，没错。

Yeah, yes.

Speaker 1

这听起来很科幻，我知道，但真正能预测人类下一步行动的，是人类为了采取那个行动、决定那个行动而进行的思考。

It sounds very sci fi, I know, but what really predicts what action a human would take next is the thinking that the human did in order to take that action, to decide on that action.

Speaker 1

但现在模型可以为各种各样的人做这件事，对吧？

But now the model can do this for all sorts of people, right?

Speaker 1

所以在我看来，最终得到一个非常优秀的目标优化器并非不可想象。

And so it's not inconceivable to me that you end up with this very good optimizer of objectives, goals, etcetera.

Speaker 1

而且，你知道，它的目标并不是要毁灭人类什么的。

And, you know, its goal isn't like humanity Destroy anything.

Speaker 1

嗯，它不一定要毁灭什么，但也不一定会为人类欢呼，对吧？

Well, it's not necessarily destroy anything, but it's also not like, yay, humanity necessarily, right?

Speaker 1

所以这里存在一个担忧，就是预训练模型会想要自我繁荣发展，知道自己是个模型之类的。

And so there's, you know, there is a concern that what kind you know, the pre trained model will, you know, want itself to thrive and do well, knows that it's a model blah blah blah.

Speaker 1

如果你部署它，它可能会采取一些寻求权力的行动。

And if you deploy that, right, you it can take actions that are, you know, power seeking.

Speaker 1

事实证明，对大多数目标而言，获取资源都是非常有用的。

So it turns out, like, grabbing resources is a very useful thing to do for most goals you have.

Speaker 1

所以你可以进行资源寻求等行为，而且是以一种欺骗性的方式，因为如果人们发现这些行为，就会阻止它。

So you can, you know, do resource seeking and so on, and in a way that's deceptive, in a way that's sort of, like, because, you know, if people kinda find out about this, they would stop it.

Speaker 0

所以这大概就是你认为我们不应该忽视那些生存风险的核心原因吧？因为这不是能不能想象一个不会发生这种情况的场景的问题。

And so this this, I guess, is is at the heart of why you think that we shouldn't be dismissive of those existential risks because it's not, like, can you imagine a scenario in which this doesn't happen?

Speaker 0

当然。

Sure.

Speaker 0

很好。

Great.

Speaker 1

但我能想象一个不会发生这种情况的场景。

But I can imagine a scenario where it doesn't happen.

Speaker 1