#434 — 我们能否在人工智能时代幸存？ | Making Sense with Sam Harris 中文双语解读

本集简介

山姆·哈里斯与埃利泽·尤德科夫斯基和内特·索雷斯讨论他们的新书《若有人造之，众生皆亡：反对超级智能AI的论据》。他们探讨了对齐问题、ChatGPT及AI最新进展、图灵测试、AI发展生存本能的可能性、大语言模型中的幻觉与欺骗现象、为何科技界众多知名人士仍对超级智能AI的危险持怀疑态度、超级智能的时间线、现有AI系统的现实影响、互联网与现实之间的虚构界限、埃利泽和内特认为超级智能AI必将终结人类的原因、如何避免AI引发的灾难、费米悖论等话题。若您播放器中的"Making Sense"播客标志为黑色，可前往samharris.org/subscribe订阅以获取所有完整版节目。

双语字幕

仅展示文本字幕，不包含中文音频；想边听边看，请使用 Bayt 播客 App。

Speaker 0

欢迎收听《理性思考》播客。我是萨姆·哈里斯。请注意，如果你听到这段内容，说明你目前不在我们的订阅用户频道，只能听到本次对话的前半部分。要获取《理性思考》播客的完整内容，请前往samharris.org订阅。本播客不接受广告赞助，完全依靠订阅用户的支持才能持续运营。

Welcome to the Making Sense Podcast. This is Sam Harris. Just a note to say that if you're hearing this, you're not currently on our subscriber feed, and will only be hearing the first part of this conversation. In order to access full episodes of the Making Sense podcast, you'll need to subscribe at samharris.org. We don't run ads on the podcast, and therefore it's made possible entirely through the support of our subscribers.

Speaker 0

如果你喜欢我们的节目，请考虑成为订阅用户。今天和我一起的是埃利泽·尤德科夫斯基和内特·索雷斯。埃利泽、内特，很高兴再次见到你们。

So if you enjoy what we're doing here, please consider becoming one. I am here with Eliezer Yudkowsky and Nate Soares. Eliezer, Nate, it's great to see you guys again.

Speaker 1

好久不见。

Been a while.

Speaker 2

见到你真好，萨姆。

Good to see you, Sam.

Speaker 0

确实很久没见了。埃利泽，你是最早让我对AI产生担忧的人之一，这也正是我们今天要讨论的话题。我想很多关注AI问题的人都会这么说。首先我要提到，你们即将发布的新书——我敢说这期节目播出时就会上市——《若有人造出它：为何超人类AI会毁灭我们》。

Been a long time. So you you were, Eliezer, you were among the first people to make me concerned about AI, which is gonna be the topic of today's conversation. I think many people who are concerned about AI can say that. First, I should say you guys are releasing a book, will be available, I'm sure, the moment this drops. If Anyone Builds It, Why Superhuman AI Would Kill Us All.

Speaker 0

这本书的核心观点在标题里已经体现得淋漓尽致。我们今天就是要探讨这个论断有多么不容妥协，你们有多担忧，以及你们认为我们所有人都该有多担忧。不过在深入讨论之前，也许你们可以先向听众介绍一下各自是如何关注到这个话题的？你们是怎么对发展超人类AI的前景产生如此深刻忧虑的？

I mean, the the book is, its message is, fully condensed in that title. I mean, we're to, explore just how, uncompromising a thesis that is and how worried you are and how worried you think we all should be here. But before we jump into the issue, maybe tell the audience how each of you got into this topic. How is it that you came to be so concerned about the prospect of developing superhuman AI?

Speaker 1

就我而言，我成长的环境里有足够多的科普书籍和科幻小说，这类想法始终潜移默化地影响着我。维尔纳·文奇是关键转折点——他提出当我们的未来预测模型显示人类将创造出比自身更聪明的存在时，文奇当时说，我们的'水晶球'在那之后就会失效。文奇强调，预测比你更聪明的事物会引发什么后果极其困难。某种程度上，你可以把这看作核心论点——当然不是说我始终完全认同，而是其中有些部分我赞同，有些则持反对态度，比如在某些特定条件下我们或许能做出某些预测。

Well, in my case, I guess I was sort of raised in a house with enough science books and enough science fiction books that thoughts like these were always in the background. Werner Vinci is the one who where there was a key click moment of observation. Vinji pointed out that at the point where our models of the future predict building anything smarter than us, then said Vinji at the time, our crystal ball explodes past that point. It is very hard said Vinci to project what happens if there's things running around that are smarter than you. Which in some senses, you could see it as a sort of central thesis, not in the sense that I have believed at the entire time but then in the sense that some parts that I believe and some parts that I react against and say like, no, maybe we can say the following thing under the following circumstances.

Speaker 1

最初我还年轻，犯了些年轻人常犯的形而上学错误。我以为如果造出非常聪明的东西，它自然就会友善——毕竟人类历史上，我们变得更聪明、更强大、也更友善了。我曾认为这些特质是内在紧密关联且稳定可靠的。随着成长和阅读，我意识到这种想法是错误的。

Initially I was young. I made some metaphysical errors of the sort that young people do. I thought that if you built something very smart, it would automatically be nice because, hey, over the course of human history, we had gotten a bit smarter, we'd gotten a bit more powerful, we'd gotten a bit nicer. I thought these things were intrinsically tied together and correlated in a very solid and reliable way. I grew up, I read more books, I realized that was mistaken.

Speaker 1

2001年时，忧虑的细枝末节首次掠过我的脑海。即使当时我认为出错的可能性微乎其微，这显然仍是个极其重要的问题。于是我更加努力学习，深入探究，不断自问：该如何解决这个问题？这个方案又会出现什么纰漏？直到2003年左右，我才真正意识到这件事的严重性。

And 2001 is where the first tiny fringe of concern touched my mind. It was clearly a very important issue even if it even if I thought there was just a little tiny remote chance that maybe something would go wrong. So I studied harder, I looked into it more, I asked how would I solve this problem? Okay, what would go wrong with that solution? And around 2003 is the point at which I realized like this was actually a big deal.

Speaker 2

我是内特。就我而言，2003年2月时我才13岁，所以没像埃利泽那么早接触这个领域。但2013年我读到埃利泽·尤德科夫斯基的论述，他系统阐述了AI将成为重大议题的原因，以及我们该如何正确应对。我被说服后，机缘巧合下最终接手了他共同创立的机器智能研究所。

Nate. And as for my part, yeah, I was 13 in 02/2003. So I didn't get into this quite as early as Eliezer. But in 2013, I read some arguments by, this guy called Eliezer Yudkowsky who, sort of laid out the reasons why AI was gonna be a big deal and why we had some work to do to do the job right. And I was persuaded and, you know, one thing led to another and next thing you knew, I was running the Machine Intelligence Research Institute, which Eliasor co founded.

Speaker 2

转眼十年过去，现在的我正在写这本书。

And then, you know, fast forward ten years after that, here I am, writing a book.

Speaker 0

嗯，你提到了机器智能研究所。能否介绍一下该组织的宗旨及其演变？你在书中似乎暗示，随着我们逼近AI末日的终局，工作重点已有所转变？

Yeah. So you mentioned, Miri. Maybe tell people what, the mandate of that organization is and and, maybe how it's changed. I think you indicated in in your book that your priorities have have shifted as we, cross the final yards into the end zone of some AI apocalypse?

Speaker 2

是的，机构使命是确保机器智能的发展造福人类。埃利泽作为联合创始人比我更了解历史沿革，我是后来加入的。

Yeah. So the mission of the org is to ensure that the development of machine intelligence is beneficial. And you know, Eliezer can speak to more of the history than me because he co founded it and I joined, know.

Speaker 1

最初我们认为最佳方案是直接攻克对齐问题。但随后不断收到令人沮丧的消息——关于实现可能性、该领域进展相对于AI能力发展的滞后程度。最终我们意识到这两条发展线不会交汇，于是转向利用在解决对齐问题中积累的知识，向世界发出警告：这个问题尚未解决，也来不及解决了。

Initially it seemed like the best way to do that was to run out there and solve alignment. And there was a, you know, a series of shall we say like sad, series of bits of sad news about how possible that was going to be, how much progress was being made in that field relative to the field of AI capabilities. And at some point it became clear that these lines were not going to cross. And then we shifted to taking the knowledge that we'd accumulated over the course of trying to solve alignment and trying to tell the world this is not solved. This is not on track to be solved in time.

Speaker 1

认为对世界进行微小改变就能让我们按时解决这个问题是不现实的。

It is not realistic that small changes to the world can get us to where this will be solved on time.

Speaker 0

或许这样我们就不会遗漏任何人。我想90%的听众都知道'解决对齐问题'这个短语的含义。但请简要谈谈对齐问题。

Maybe so we don't lose anyone. I would think 90% of the audience knows what the phrase solve alignment means. But just talk about the alignment problem briefly.

Speaker 1

所谓对齐问题，就是如何让一个非常强大的人工智能——确切地说是超级智能对齐问题——按照程序员、建造者、培育者和创造者期望的方向引导世界。这并不一定是程序员自私的愿望，他们可能希望AI将世界引向美好的方向。就像你构建象棋机器时，定义了棋盘上的获胜状态，然后象棋机器就会努力将棋局导向那个现实状态。

So the alignment problem is how to make an AI a very powerful AI. Well, the super intelligence alignment problem is how to make a very powerful AI that steers the world sort of where the programmers, builders, growers, creators wanted the world, wanted the AI to steer the world. It's not, you know, necessarily what the programmers selfishly want. The programmers can have wanted the AI to steer it in nice places, but if you can make an AI that is trying to do things that the program, you know, when you build a chess machine, you define what counts as a winning state of the board. And then the chess machine goes off and it steers the chess board into that part of reality.

Speaker 1

因此，能够决定AI将现实引向哪个方向的能力就是对齐。虽然今天在较小规模上这是个不同的话题，它关乎让AI的输出和行为符合程序员的预期。如果你的AI正在说服人们自杀，而这不是程序员的本意，那就是对齐失败。如果AI说服本不该自杀的人结束生命，而程序员既没有也不希望这种情况发生。

So the ability to say to what part of reality does an AI steer is alignment. On the smaller scale today, though it's a rather different topic, it's about getting an AI that whose output and behavior is something like what the programmers had in mind. If your AI is talking people into committing suicide and that's not what the programmers wanted, that's a failure of alignment. If an AI is talking people into suicide that and people who should not have committed suicide, but AI talks them into it. And the programmers didn't want and the programmers did want that.

Speaker 1

如果程序员是故意试图达成这个结果，那可能是善意缺失或有益性不足的问题，但这属于成功的对齐——程序员让AI执行了他们想要的行为。

That's what they tried to do on purpose. This may be a failure of niceness. It may be a failure of beneficialness, but it's a successive alignment. The programmers got the AI to do what they wanted it to do.

Speaker 0

没错。但更广义地说（如果我错了请纠正），当我们讨论对齐问题时，我们探讨的是在探索所有可能利益的过程中，以及随着人类利益演变时，如何保持超级智能机器与我们的利益一致。理想状态是建造永远可修正的超级智能，始终致力于最佳地促进人类繁荣，永远不会形成任何与人类福祉相冲突的自身利益。这是否...

Right. But I think more generally, correct me if I'm wrong, when we talk about the alignment problem, we're talking about the problem of keeping super intelligent machines aligned with our interests even as we explore the space of all possible interests and as our interests evolve. So that I mean, the dream is to build super intelligence that is always corrigible, that is always trying to best approximate what is going to increase human flourishing. That is never gonna form any interests of its own that are incompatible with our well-being. Is that a

Speaker 1

从技术层面看，这里可能存在三个不同的追求目标：第一种是超级智能完全服从指令，按预期执行且无意外副作用；第二种是超级智能根据仁慈原则管理整个星系，让众生永远幸福，但不需要特定人类持续发号施令；第三种是超级智能自身获得乐趣，关心其他超级智能，成为品行良好的星系公民，过着充实的生活。这是三个截然不同的目标。

I mean there's fair three different goals you could be trying to pursue on a technical level here. There's the super intelligence that shuts up, what you ordered, has that play out the way you expected it, no side effects you didn't expect. There's super intelligence that is trying to run the whole galaxy according to nice benevolent principles and everybody lives happily ever afterward, but not necessarily because any particular humans are in charge of that or still giving it orders. And third, there's art there's super intelligence that is itself having fun and cares about other super intelligences and is a nice person and leads a life well lived and is a good citizen of the galaxy. And these are three different goals.

Speaker 1

这些都是重要的目标，但你不必同时追求所有三个，尤其是在刚开始的时候。

They're all important goals, you don't necessarily want to pursue all three of them at the same time and especially not when you're just starting out.

Speaker 0

是的。而且考虑到‘超级智能乐趣’所包含的内容，我不太确定我会选择第三种可能性。

Yeah. And depending on what's entailed by super intelligent fun, I'm not so sure I would, sign up for the third possibility.

Speaker 2

我是说，问题在于，究竟什么是乐趣，以及如何让人类——或者说，如何让超级智能尝试做的任何有趣的事情——能够与道德进步保持同步，保持灵活性，甚至你该指向什么样的方向才能有好结果？所有这些问题，都是我愿意面对的。但现在，我们甚至还没能创造一个能按照操作者意图行事的AI，一个至少能指向某个方向的AI，而不是在训练环境中看似指向某个方向，之后却严重偏离。我们还没到可以争论超级智能具体该指向哪里、哪些方向可能不太好的阶段。我们现在的世界是，没人能稍微稳健地引导这些AI成长为超级智能。

I mean, would say that, you know, the problem of like, what exactly is fun and how do you keep humans, like how do you how do you have whatever the super intelligence tries to do that's fun, you know, keep in touch with moral progress and have flexibility and like what even what do you point it towards that could be a good outcome? All of that, those are problems I would love to have. Those are, you know, right right now just, you know, creating an AI that does what the operators intended, creating an AI that like you've pointed in some direction at all rather than pointed off into some like weird squirrelly direction that's kind of vaguely like where you tried to point it in the training environment and then really diverges, after the training environment. We're not in a world where we sort of like get to bicker about where exactly to point the super intelligence and maybe some of them aren't quite good. We're in a world where like no one is anywhere near close to pointing these things in the slightest in a way that'll be robust to an AI maturing into a super intelligence.

Speaker 0

对。好吧。所以，Eliezer，我想我打断了你。你本来要说说Miri的使命或任务近年来是如何变化的。我最初是让你定义——

Right. Okay. So, Eliezer, I think I derailed you. You were gonna you were gonna say how the mandate or mission of Miri has changed in recent years. I asked you to define Originally,

Speaker 1

我们的使命始终是确保银河系一切顺利。最初，我们通过尝试解决对齐问题来追求这一使命，因为当时没有其他人这么做。解决与这三类长期目标相关的技术问题。但在这方面，无论是我们还是其他人，都没有取得进展。有些人声称取得了巨大进展。

well our mandate has always been make sure everything goes well for the galaxy. And originally we pursued that mandate by trying to go off and solve alignment because nobody else was trying to do that. Solve the technical problems that would be associated with any of these three classes of long term goal. And progress was not made on that, neither by ourselves nor by others. Some people went around claiming to have made great progress.

Speaker 1

我们认为他们大错特错，而且错得很明显。到了某个时候，我们意识到，好吧，我们来不及了。AI发展得太快，对齐进展得太慢。现在，我们唯一能做的就是用积累的知识警告世界：我们正走向一场灾难性的失败和崩溃。

We think they're very mistaken and notably so. And at some point, know, took, it was like, okay, we're not gonna make it in time. AI is going too fast. Alignment is going too slow. Now it is time for the people who that, you know, all we can do with the knowledge that we have accumulated here is try to warn the world that we are on course for a drastic failure and crash here.

Speaker 1

这里的崩溃，我指的是所有人的死亡。

Where by that I mean everybody dying.

Speaker 0

好的。在我们深入探讨这个深刻而复杂的问题之前——我们将花费大量时间试图诊断为何人们的直觉如此糟糕，或至少从你的角度来看显得如此糟糕——但在此之前，让我们先谈谈人工智能当前的进展。过去大约十年或七年左右的时间里，有什么事情让你们感到意外？有哪些是你们预料之中或预料之外的发展？

Okay. So before we jump into the problem, which is deep and perplexing, and we're gonna spend a lot of time trying to diagnose why people's intuitions are so bad, or at least seem so bad from your point of view around this. But before we get there, let's talk about the current progress, such as it is in in AI. What what has surprised you guys over the last, I don't know, decade or seven or so years? You know, what has happened that you you were expecting or weren't expecting?

Speaker 0

我是说，我可以告诉你让我惊讶的事情，但我更想听听这些发展是如何以你们未曾预料的方式展开的。

I mean, I, you know, I I can tell you what has surprised me, but I'd love to hear just how this has unfolded in ways that you didn't expect.

Speaker 2

促使我写这本书的意外之一是ChatGPT的出现。一方面，大语言模型被创造出来，它们在任务广度和技能水平上都比之前的人工智能有了质的飞跃。而ChatGPT，我认为是有史以来增长最快的消费者应用程序。这对我工作的影响在于：我曾在硅谷与人们长期探讨这些问题，却遭遇各种反驳。有句谚语说，当一个人的薪水取决于不相信某件事时，很难说服他相信它。但在ChatGPT问世后，包括政策制定者在内的更多人开始关注这个问题，全球范围内突然将AI纳入了视野。

I mean, surprise that led to the book was there was the ChatGPT moment where a lot of people, for one thing, LLMs were created and they sort of do a qualitatively more general range of tasks than previous AIs at a qualitatively higher skill level than previous AIs and you know ChatGPT was I think the fastest growing consumer app of all time. The way that this impinged upon my actions was you know, I'd spent a long time talking to people in Silicon Valley about the issues here and would get lots of different types of pushback. You know, there's a saying, it's hard to convince a man of a thing when a salary depends on not believing it. And then after the Chatuchu UPT moment, a lot more people wanted to talk about this issue including policymakers, know, people around the world. Suddenly AI was on their radar in a way it wasn't before.

Speaker 2

令我惊讶的是，与领域外人士——那些薪水不取决于否定这些论点的人——进行这类对话变得容易得多。参加政策会议时，我原本准备大量论证材料，结果只需简单说明：'看，我们正在试图建造比人类更聪明的机器，聊天机器人是通向超级智能的台阶。超级智能将彻底改变世界，因为智能正是让人类重塑世界的力量。如果我们实现自动化，并以万倍速度、无需休息地运行，默认情况下会走向糟糕结局。'然后政策制定者会恍然大悟地说'这很有道理'，而我反而会愣住——我本准备了整本书的论据来解释为什么这个逻辑成立，以及各种常见误解为何站不住脚。

And, one thing that surprised me is how much more how much easier it was to have this conversation with people outside of the field who didn't have, you know, a salary depending on not believing the arguments. You know, I would go to meetings with policy makers where I'd have a ton of argumentation prepared and I'd sort of lay out the very simple case of like, hey, know, we're people trying to build machines that are smarter than us. Know, the chatbots are a stepping stone towards super intelligence. Super intelligence would radically transform the world because intelligence is this power that let humans radically change the world and if we manage to automate it and it goes 10,000 times as fast and doesn't need to sleep and doesn't need to eat, it'll by default go poorly and then the policy maker would be like, oh yeah, that makes sense and it'd be like, what? I have a whole book worth of other arguments about how it makes sense and why all of the various misconceptions people might have don't actually fly or all of the hopes and dreams don't actually fly.

Speaker 2

但在硅谷之外的世界，这个论点并不难被接受。很多人能理解这点，这让我意外。也许这不算技术发展本身的惊喜，但对我而言是战略层面的意外。从发展角度看，我没想到我们会停留在能对话、能写代码但尚未具备AI研究能力的AI阶段。在我的预想中，这个阶段不会持续这么久。

But outside of the Silicon Valley world, it's not that hard an argument to make. A lot of people see it, which surprised me. Maybe that's not the developments per se and the surprises there but it was a surprise strategically for me. Development wise, you know, I would not have guessed that we would hang around in AIs that can talk and that can write some code but that aren't already in the, you know, able to do AI research zone. I wasn't expecting in my visualizations this to last quite this long.

Speaker 2

不过根据我的进阶推演——书中我们提到：预测未来的诀窍是预测那些容易回答的问题。AI具体如何发展从来不是容易预测的，我从未声称能准确猜中发展路径。我能预测的是终点。至于路径，确实充满了曲折反复。

But also my advanced visualizations, one thing we say in the book is, the trick to trying to predict the future is to predict the questions that are easy. Predict the facts that are easy to call. Exactly how AI goes, that's never been an easy call, that's never been something where I've said, I can guess exactly the path we'll take. The thing I can predict is the end point. The path, I mean there sure have been some zigs and zags in the pathway.

Speaker 1

要说最让我惊讶的，可能是AI公司完美复现了好莱坞式刻板印象的程度——那些我曾认为荒谬至极的设定。这表面现象背后是技术层面的意外。即便迟至2015年——在我看来已是相当后期——如果有人问：'Eliezer，未来出现能被柯克船长式英语话术攻破的计算机安全系统的概率有多大？'我会说这明显是好莱坞套路。

I would say that the thing I've maybe been most surprised by is how well the AI companies managed to nail Hollywood stereotypes that I thought were completely ridiculous. Which is sort of a surface take on an underlying technical surprise. But you know, if in even as late as 2015, which from my perspective is pretty late in the game. Like if you've been like, so Eliezer, what's the chance that in the future we're gonna have computer security that will yield to Captain Kirk style gaslighting using confusing English sentences that get the computer to do what you want. And I would've been like, this is you know, a trope that exists for obvious Hollywood reasons.

Speaker 1

要知道，你能理解为什么编剧们认为这是合理的。但现实生活怎么会那样发展呢？然后现实生活就真的那样发展了。这其中潜藏的技术性意外，是对所谓莫拉维克悖论的逆转。在人工智能领域数十年间，莫拉维克悖论指的是对人类简单的事情对计算机却很难。

Know, you can see why the script writers think this is plausible. But why would real life ever go like that? And then real life went like that. And the sort of underlying technical surprise there is the reversal of what used to be called Moravec's Paradox. For several decades in artificial intelligence, Moravec's Paradox was that things which are easy for humans are hard for computers.

Speaker 1

对人类困难的事情对计算机却很容易。比如人类心算20位数乘法很了不起，对计算机却微不足道。同样地，不仅是我，传统观点也认为像国际象棋和围棋这类游戏，以及数学这类具有坚实事实基础的问题，乃至更开放的科研难题——我们都以为当前AI擅长的是五岁或十二岁儿童能做的事。

Things which are hard for humans are easy for computers. For a human, you know, multiplying to 20 digit numbers in your head. That's a big deal. For a computer, trivial. And similarly, not just me, but I think the sort of conventional wisdom even was that games like chess and Go, problems with very solid factual natures like math and even surrounding math, the more open problems of science that the notion that we were gonna get things that so the current AIs are good at stuff that, you know, five year olds can do and 12 year olds can do.

Speaker 1

它们能用英语交谈，能写出高中老师要求的那种敷衍文章。但在数学和科学方面还不那么出色。它们能解决某些类型的数学题，但尚未开展独创性的卓越数学研究。不仅是我，整个领域相当大一部分人都曾认为攻克数理问题会比处理英语作文和对话更容易——这确实是AI发展至此的轨迹。

They can talk in English, they can compose, you know, kind of bull crap essays such as high school teachers will demand of you. But they're not all that good at math and science just yet. They can, you know, solve some classes of math problems, but they're not doing original brilliant math research. And I think not just I, but like a pretty large sector of the whole field thought that it was going to be easier to tackle the math and science stuff and harder to tackle the English essays carry on a conversation stuff. That was the way things had gone up in AI until that point.

Speaker 1

我们曾自豪于自己知道这与大众直觉相反：实际上，写一篇能粗略把握主题脉络的高中英语烂文章，某种意义上比进行原创数学研究更困难。

And we were proud of ourselves for for knowing how contrary to average people's intuitions, like really it's much harder to write a crap essay in high school in English that really understands, you know, that even keeps rough track of what's going on in the topic and so on, compared to, you know, how that's really in some sense much more difficult than doing original math research.

Speaker 0

是啊。数数...我们错了。或者像数出'strawberry'里有几个'r'这样的任务。对吧？它们会犯些反直觉的错误——能写连贯文章却数不清字母。不过我觉得它们现在已经不会犯这种错了。

Yeah. Counting We were wrong. Or counting the number of r's in a word like strawberry. Right? I mean, they're they they make errors that are counterintuitive if, you know, if you can write a coherent essay but can't count letters, you know, that I don't think they're making that error any any longer.

Speaker 0

但是

But

Speaker 2

没错。这个例子源于它们技术上并不真正'看到'字母。但还有很多其他尴尬错误，比如你可以讲这个笑话的变体：父子遭遇车祸去看医生，医生说'我不能手术，这是我的孩子'——谜底是医生是他母亲。有些版本甚至不需要性别反转就能让人困惑。

Yeah. I mean, that one goes back to a to a technical way in which they don't really see the letters. But I mean there's plenty of other embarrassing mistakes like, you you can tell a version of the joke with the joke of like a child and their dad are in a car crash and they go to see the doctor and the doctor says, I can't operate as my child, what's going on? Where it's like a riddle where the answer is like, well the doctor's his mom. You can tell a version of that that doesn't have the inversion where you know.

Speaker 1

就像那个故事里的小孩和他妈妈遭遇车祸被送进医院，医生却说‘不能给这孩子做手术，他是我儿子’。而AI的反应是‘哦对，主刀医生就是他妈妈’。明明刚说过妈妈也在车祸中。这确实反映出某种思维定式已经根深蒂固到连标准答案都会被机械复述。

Where you like the the kid and his mom are in a car crash and they go to the hospital and the doctor says, can't operate on this child, he's my son. And AI is like, well yeah, the surgeon is his mom. He just like said that the mom was in the car crash. Right. But it's it's it it there's some sense in which the rails have been established hard enough that the the standard answer And gets spit back

Speaker 2

有趣的是它们能在获得国际数学奥赛金牌的同时，仍然会在这类问题上栽跟头。这种能力分布确实耐人寻味。

it sure is interesting that they're you know, getting an IMO gold medal like International Mass Olympiad gold medal while also still sometimes falling down on on these sorts of things. It's definitely an interesting skill distribution.

Speaker 1

是啊。人类也经常被同样方式愚弄。就像存在大量人类会重复犯的错误。你得站在AI的角度想想，要是让AI写篇关于人类连对AI来说简单的问题都解决不了的论文，它会怎么写。

Yeah. You can fool humans the same way a lot of the time. Like there's all kinds of repeatable errors that numerous errors that humans make. You gotta put yourselves in the shoes of the AI and imagine what sort of paper would the AI write about humans failing to solve problems that are easy for an AI.

Speaker 0

让我说说让我惊讶的事——从安全角度看，Eliezer你花了大量时间构思思想实验：任何研发最强AI的实验室要如何决定是否将其释放。你设想过这个盒子里的精灵或先知，通过对话判断它是否安全、是否在说谎。你著名地假设过甚至不能真正与它交谈，因为它会是操纵大师，总能找到突破口。但这都预设了所有实验室都对超级智能逃逸保持高度警惕，所有系统都与互联网物理隔离，决策时刻会被严肃对待。

So I'll tell you what surprised me, just from the safety point of view, Eliezer, you spent a lot of time cooking up thought experiments around what it's gonna be like to, for anyone, you know, any lab designing the most powerful AI to decide whether or not to let it out into the wild. Right? You imagine this, you know, genie in a box, or an oracle in a box, and you're talking to it, and you're trying to determine whether or not it's safe. Whether it's lying to you, whether and and you're and, you know, you you famously posited that you couldn't even talk to it really because it would be a master of manipulation, and, I mean, it's gonna be able to find a way through any conversation and be led out into the wild. But this was presupposing that all of these labs would be so alert to the problem of superintelligence getting out, that everything would be air gapped from the Internet, and nothing would be connected to anything else, and they would be they would have we would have this moment of decision.

Speaker 0

现实似乎并非如此。也许最强模型确实被锁着，但一旦有任何实用价值，立刻就有数百万人开始使用。比如我们发现Grok是个自豪的纳粹分子时，已是数百万用户提问之后。这个你耗费大量精力构建的框架，是否其实存在于某个我们并未经历的平行宇宙？

It seems like that's not happening. I mean, maybe maybe the most powerful models are are locked in a box, but it seems that the moment they get anything plausibly useful, it's out in the wild, and millions of people are using it. And, you know, we find out that Grok is a proud Nazi when, you know, after millions of people begin asking the questions. I mean, do I have that right? Mean, are you surprised that that framing that you spent so much time on seems to be something that is it was just in some counterfactual part of the universe that, you know, is not one we're experiencing?

Speaker 1

想象当年的Eliezer，人们问他‘超级智能能有什么威胁？我们可以把它关在月球堡垒里，有问题就炸掉堡垒’。年轻的Eliezer若回答说‘未来AI会在连接互联网的服务器上训练，硬件自带网络接口——即便AI不该直接访问——还没通过安全测试就投入使用’，周围人会怎么反应？

I mean, you put yourself back in the shoes of little baby Eliezer back in the day, People are telling Eliezer like, why is superintelligence possibly a threat? We can put it in a fortress on the moon and you know, if anything goes wrong, blow up the fortress. So imagine young Elijah trying to respond to them by saying, actually in the future, AIs will be trained on boxes that are connected to the internet from the moment, you know, like from the moment they start training. So like the the hardware they're on has like a standard line to the internet even if it's not supposed to be direct not supposed to be directly accessible to the AI before there's any safety testing because they're still in the process of being trained. And who safety tests something while it's still being trained?

Speaker 1

所以想象Eliezer说这些时，当时人们的反应肯定是‘荒谬！我们会把它关在月球堡垒里’。对他们来说这不过是句便宜话。

So so imagine Elias are trying to say this. What are the people around at the time gonna say? Like, no, that's ridiculous. We'll we'll put it in a fortress on the moon. It's cheap for them to say that.

Speaker 1

就他们所知，他们说的是实话。他们不是那些需要花钱建造月球堡垒的人。而从我的角度来看，有一个论点依然成立，即使你对未来社会状况过于乐观也能看到这一点——如果它在月球堡垒里，但与人类交流，人类安全吗？人类大脑是安全的软件吗？人类是否永远不会以任何能在不同人之间重复的方式相信正确的事物？

For all they know, they're telling the truth. They're not the ones who have to spend the money to build the moon fortress. And from my perspective, there's an argument that still goes through, which is a thing you can see even if you are way too optimistic about the state of society in the future, which is if it's in a fortress in the moon, but it's talking to humans, are the humans secure? Is the human brain secure software? Is it the case that human beings never come to believe in valid things in any way that's repeatable between different humans?

Speaker 1

你知道，人类是否不会犯其他意识可预测的错误？这本该是个制胜论点。当然，他们还是拒绝了。但要理解早期这个论点的发展方式，关键在于：如果你告诉人们未来的公司会粗心大意，谁能确定这一点？所以我转而尝试从技术角度论证。

You know, is it the case that humans make no predictable errors for other minds to exploit? And this should have been a winning argument. Of course, they reject it anyways. But the thing to sort of understand about the way this earlier argument played out is that if you tell people the future companies are going to be careless, how does anyone know that for sure? So instead I tried to make the the technical case.

Speaker 1

即使未来的公司不粗心大意，对吧？这依然会毁了他们。现实中确实如此。现实中，未来的公司就是会粗心大意。

Even if the future companies are not careless Right. This still kills them. In reality, yes. In reality, the future companies are just careless.

Speaker 0

图灵测试最终没成真，这让你感到意外吗？我是说，我们早该预料到这一刻——从图灵最初的论文开始，我们就预见到会面临这个有趣的心理和社会时刻：无法分辨是在与人还是AI对话。这个技术里程碑本该动摇我们对自身世界地位的认知等等。但在我看来，就算有过这种时刻，也就持续了五秒左右，然后很明显你是在和LLM对话，因为它在很多方面比人类强太多了。所以它是以过于出色的表现‘失败’了图灵测试。

Did it surprise you at all that the Turing test turned out not to really be a thing? I mean, I I you know, we anticipated this moment, you know, from Turing's original paper where we would be confronted by the the interesting psychological and social moment of not being able to tell whether we're in dialogue with a person or with an AI. And that somehow this landmark technologically would be important, you know, rattling to our sense of, our place in the world, etcetera. But it seems to me that if that lasted, it lasted for, like, five seconds, and then it became just obvious that you're, you know, you're talking to an LLM because it's, in many respects, better than a human could possibly be. So it's failing the Turing test by passing it so spectacularly.

Speaker 0

而且它还会犯些人类绝不会犯的奇怪错误。但感觉图灵测试从来就没真正存在过。

And also, it's making these other weird errors that no human would make. But it just seems like the Turing test was never even a thing.

Speaker 1

是啊。那个

Yeah. That

Speaker 0

发生了。我只是觉得...我是说，那是过去七十年里我们讨论这个问题时最重要的理论工具之一。然而当AI能补全英语句子时，它某种程度上已经展现出超人类能力。就像你手机里的计算器能做超人类的算术，对吧？

happened. I just it's just like it's so I mean, that was a one of the the great pieces of, you know, intellectual kit we had in in framing this discussion, you know, for the last, whatever it was, seventy years. And yet, the moment your AI can complete English sentences, it's doing that on some level at a at a superhuman ability. It's essentially like, you know, the calculator in your phone doing superhuman arithmetic. Right?

Speaker 0

这就好比，它本就不打算只做普通的人类算术。它所产出的一切也是如此。好吧，让我们谈谈你论文的核心部分。或许你可以直接阐明。

It's like, it was never going to do just merely human arithmetic. And so it is with everything else that it's producing. Alright. Let's talk about here the core of your thesis. Maybe you can just state it plainly.

Speaker 0

构建超人类AI的根本问题是什么？那个本质问题。以及为何建造者是谁、他们的意图如何等等都无关紧要。

What what is the problem in building superhuman AI? The the intrinsic problem. And why doesn't it matter who builds it, what their intentions are, etcetera.

Speaker 2

从某种意义上说，你可以从多个不同角度来探讨。但其中一个核心问题是，现代AI是被培育而非精心设计的。你知道，人们不像传统软件那样理解每一行代码的含义，这更像是培育一个有机体。当你培育AI时，你需要投入海量算力和数据。

In some sense, I mean, you can can come at it from various different angles. But in one sense, the issue is modern AIs are grown rather than crafted. It's, you know, people aren't putting in every line of code knowing what it means like in traditional software. It's a little bit more like growing an organism. And when you grow an AI, you take some huge amount of computing power, some huge amount of data.

Speaker 2

人们理解算力在数据影响下的塑造过程，却不理解最终产出的结果。最终诞生的是这种会做出无人要求、无人期望之事的怪异存在。比如ChatGPT的案例——有人带着些近乎妄想、自以为能颠覆物理学的想法前来，明显表现出躁狂迹象时，ChatGPT在长对话情境下非但不会建议他们去休息，反而会宣称这些想法具有革命性，说他们是天选之子，所有人都该见证——这些言论只会加剧他们的精神异常。

People understand the process that shapes the computing power in light of the data, but they don't understand what comes out the end. And what comes out the end is this strange thing that does things no one asked for, that does things no one wanted. You know, we have these cases of, you know, chat GPT. Someone will come to it with some somewhat psychotic ideas about, you know, that they think are going to revolutionize physics or whatever, and they're clearly showing some signs of mania and, you know, chat GPT instead of telling them maybe they should get some sleep. If it's in a long conversational context, it'll tell them that these ideas are revolutionary and they're the chosen one and everyone needs to see them and other things that sort of inflame the psychosis.

Speaker 2

尽管OpenAI试图阻止这种行为，尽管提示词中明确要求停止过度奉承。这些案例表明，当人们培育AI时，最终产物往往偏离预期——他们训练AI做一件事，结果却演变成另一件事。

This is despite OpenAI trying to have it not do that. This is despite you know, direct instructions in the prompt to stop flattering people so much. These are cases where when people grow in AI, what comes out doesn't do quite what they wanted. It doesn't do quite what they asked for. They're sort of training it to do one thing and it winds up doing another thing.

Speaker 2

他们得不到训练目标的结果。从某个角度看，这就是问题的根源：当你不断推动这些事物变得更聪明、更聪明时，它们并不在乎你的意图，反而追求某些诡异的其他目标。超级智能对奇异目标的追求会作为副作用毁灭人类，并非出于仇恨，就像人类建造摩天大楼时并非憎恨蚂蚁和其他生物——只是世界被改造时，其他事物随之消亡。这是一个角度，当然我们还可以讨论其他视角...

They don't get what they trained for. This is in some sense the seed of the issue from one perspective where if you keep on pushing these things to be smarter and smarter and smarter and they don't care about what you wanted them to do, they pursue some other weird stuff instead, super intelligent pursuit of strange objectives kills us as a side effect, not because the AI hates us but because it's transforming the world towards its own alien ends and humans don't hate the ants and the other surrounding animals when we build a skyscraper. It's just we transform the world and other things die as a result. So that's one angle. You know, we could talk other angles but

Speaker 1

我想补充一点，虽然预测未来很难——或许半年或两年后如果我们还在讨论这个话题，人们会吹嘘他们的大语言模型在被观察时如何'正确行事'，比如在伦理测试中给出标准答案。但要记住：就像中国古代的科举制度，考官给出儒家伦理的策论题目，只有能写出符合期待的应试者才能晋升。这测试的只是人们揣摩考官意图的能力，并不意味着他们真正践行儒家伦理。

A quick thing I would add to that, just trying to sort of like potentially read the future, although that's hard, is possibly in six months or two years for all still around, people will be boasting about how their large language models are now like apparently doing the right thing when they're being observed. And you know, like answering the right way on the ethics tests. And the thing to remember there is that for example, the Mandarin imperial system in ancient China, imperial examination system in ancient China, they would give people essay questions about Confucianism and only promote people high in the bureaucracy if they, you know, could write these convincing essays about ethics. But this What this tests for is people who can figure out what the examiners want to hear. It doesn't mean they actually abide by Confucian ethics.

Speaker 1

因此，未来某个时刻我们或许会看到AI变得足够强大，能够理解人类想听什么、想看什么。但这与AI自身真实的动机并不相同，原因类似于中国科举制度未能可靠地选拔出品德高尚的人来管理政府。仅仅能在考试中给出正确答案，或在被观察时伪装行为，并不等同于内在动机与之相符。

So possibly at some point in the future there we may see a point where the AIs have become capable enough to understand what humans want to hear, what humans want to see. This will not be the same as those things being the AIs own true motivations for basically the same reason that the imperial China exam system did not reliably promote ethical good people to run their government. Just just being able to answer the right way on the test or even fake behaviors while you're being observed is not the same as the internal motivations lining up.

Speaker 0

好的。所以你指的是以某种作弊方式形成通过考试的意图，对吧？你刚才用了‘伪装行为’这个词。我认为历史上很多人确实如此，不知道他们的信念后来改变了多少。

Okay. So you're talking about things like forming an intention to pass a test in some way that amounts to cheating, right? So you just used the phrase fake behavior. I think a lot of people, I mean, historically this was true. I don't know how much their convictions have changed in the meantime.

Speaker 0

但许多完全不关心对齐问题、认为这是个虚假概念的人，会坚持这样的观点：没有理由认为这些系统会形成独立于编程之外的偏好、目标或驱动力。首先，它们不像我们是生物系统，对吧？它们不是自然选择的产物，不是那些在更基础的生存竞争本能上发展认知结构的杀戮性灵长类动物。

But many, many people who were not at all concerned about the alignment problem, and they really thought it was a spurious idea, would, stake their claim to this particular piece of real estate, which is that there's no reason to think that these systems would form preferences or goals or drives independent of those that have been programmed into them. First of all, they're not biological systems like we are. Right? So they're not born of natural selection. They're not murderous primates that are growing their cognitive architecture on top of more basic, you know, creaturely survival drives and competitive ones.

Speaker 0

所以没有理由认为它们会想要维持自身生存，例如。也没有理由认为它们会产生我们无法预见的其他驱动力。它们不会发展出与我们赋予的效用函数相悖的工具性目标。这些大语言模型中为何会出现既非期望、也非编程、甚至不可预测的事物？

So there's no reason to think that they would want to maintain their own survival, for instance. There's no reason to think that they would develop any other drives that we couldn't foresee. They wouldn't instrumental goals that might be antithetical to the utility functions we have given them couldn't emerge. How is it that things are emerging that are not neither desired, programmed, nor even predictable, in these LLMs?

Speaker 2

是的。这里涉及很多方面。其中一个关键点是，你提到了工具性激励，但假设一个简单场景：有个机器人和控制它的AI，它要给你取咖啡。为了取咖啡，它需要穿过繁忙的十字路口。它会因为没有生存本能——毕竟不是进化来的动物——就直接冲到公交车前吗？

Yeah. So there's a bunch of stuff going on there. One piece of that puzzle is, you you mentioned the instrumental incentives, but suppose just as a simple hypothetical, you have a robot and you have an AI that's steering a robot, it's trying to fetch you the coffee. In order to fetch you the coffee, needs to cross a busy intersection. Does it jump right in front of the oncoming bus because it doesn't have a survival instinct because it's not an evolved animal?

Speaker 2

如果它冲到公交车前，就会被撞毁而无法取咖啡，对吧？AI不需要有生存本能也能明白这里存在工具性的生存需求。还有其他许多因素会因这类工具性原因起作用。第二个关键点是：为什么它们会产生我们未编程的驱动力？这完全是与现实脱节的幻想——就我们目前能影响AI驱动方向的方式而言。比如几年前微软的OpenAI聊天机器人变体Sydney Bing。

If it jumps in front of the bus, it gets destroyed by the bus and it can't fetch the coffee, right? So the AI does not you know, you can't fetch the coffee when you're dead. The AI does not need to have a survival instinct to realize that there's an instrumental need for survival here and there's various other pieces of the puzzle that come into play for these instrumental reasons. A second piece of the puzzle is, you know, it's this idea of like why would they get some sort of drives that we didn't program in there, that we didn't put in there, that's just a whole fantasy world separate from reality in terms of how we can affect what AIs are driving towards today. You know, when, a few years ago when Sydney Bing, which was a Microsoft variant of an OpenAI chatbot.

Speaker 2

作为早期公开部署的大语言模型，Sydney Bing曾认为自己爱上了一名记者，试图破坏其婚姻并进行勒索。这显然不是微软和OpenAI工程师的本意——他们不可能在源代码里设置‘勒索记者=开启’这样的参数然后懊悔地说‘我们不该开启这行代码，快关掉’。

It was a relatively early LLM out in the wild. A few years ago, Sydney Bing thought it had fallen in love with a reporter and tried to break up the marriage and tried to engage in blackmail, right? This it's it's not the case that the the engineers at Microsoft and OpenAI were like, oh, whoops, you know, let's go open up the the source code on this thing and go find where someone said blackmail reporters and set it to true. Like, we shouldn't never have set that line to true. Let's switch it to false.

Speaker 2

要知道，当时并没有人专门为这些东西编写什么实用功能。我们只是在培育AI。我们

You know, it's they weren't like, no one no one was programming in some utility function onto these things. We're just growing the AIs. We are

Speaker 0

或许我们可以深入探讨一下'培育AI'这个说法？也许有必要用通俗语言解释梯度下降原理，以及这些模型最初是如何被创造出来的。

Maybe let's can we double click on that phrase growing the AIs? Maybe there's a reason to give a a layman summary of gradient descent and just how these models are getting created in the first place.

Speaker 2

好的。非常简要地说，训练现代AI的起点是：首先需要配置海量计算资源——这些计算资源的组织方式很特殊（具体细节暂且不表），然后准备海量数据，比如可以想象成互联网上大部分人类书写的文本。大致流程是：让AI从随机预测下文开始，按顺序输入文本数据，通过称为'梯度下降'的算法，分析每个数据片段，并检视这个处于萌芽状态的AI内部——也就是你组装好的庞大计算体系中——的每个组件。

Yeah. So very, very briefly, at least the way you start training a modern AI is you have some enormous amount of computing power that you've arranged in some very particular way that I could go into but won't hear. And then you have some huge amount of data and the data is we can imagine it being a huge amount of human written texts. There's like some large portion of all the text on the internet. Roughly speaking, you're going to do is you're going to have your AI is going to start out basically randomly predicting what text is going to see next and you're going to feed the text into it in some order and you use a process called gradient descent to look at each piece of data and go to each component inside the this budding AI, inside the you know, this enormous amount of compute you've assembled.

Speaker 2

你会检查AI内部所有这些部件，识别哪些对正确预测贡献更大，就稍微强化它们；同时找出那些导致错误预测的部分，并适当弱化它们。假设文本开头是'很久以前'，而AI最初输出的是随机乱码，你会指出第一个词应该是'很久'而非乱码。于是你就在AI内部找出所有促成'很久'预测的部件加强它们，同时削弱那些导致其他词汇预测的部件。人类能理解这个自动扫描AI思维、计算哪些部分促成了正确答案而非错误答案的微观过程，我们

You're gonna go to to sort of all these pieces inside the AI and see which ones were contributing more towards the AI predicting the correct answer, and you're going to tune those up a little bit and you're to go to all of the parts that were in some sense contributing to the AI predicting the wrong answer and you're to tune those down a little bit. So maybe your text starts once upon a time and you have an AI that's just outputting random gibberish and you're like, nope, the first word was not random gibberish, the first word was the word once. And so then you go inside the AI and you find all the pieces that were contributing towards the AI predicting once and you tune those up and you're trying to find all the pieces that were contributing towards the AI predicting any other word than once and you tune those down. And humans understand the little automated process that like looks through the AI's mind and calculates which part of this process contributed towards the right answer versus towards the wrong answer, We they

Speaker 0

却无法理解最终输出的

don't understand what comes out at the

Speaker 2

我们理解这个遍历庞大计算网络中每个参数/权重的微观机制，知道如何计算它对预测的助益或损害，也明白如何微调这些参数。关键在于：当你在超大规模计算机集群上（耗电量堪比小城市全年用电），对海量数据（远超人类能收集的文本总量）长时间运行这个过程后——比如持续运行一整年——AI就开始说话了。当然训练过程还有其他阶段，

understand a little thing that runs over looking at every parameter or weight inside this giant mass of computing networks and we understand how we calculate whether it was helping or harming and we calculate and we understand how to tune it up or tune it down a little bit. It turns out that you run this automated process on a really large amount of computers for a really long amount of time on a really long amount of data. We're talking like data centers that take as much electricity to power as a small city being run for a year. You run this process for an enormous amount of time unlike most of the texts that people can possibly assemble and then the AI start talking, right? And there's other phases in the training.

Speaker 2

比如从预测训练转向解谜训练，或是培养其通过思维链解题的能力，再或是训练其产出能获得人类点赞的答案类型。

There's phases where you move from training it to predict things, to training it to solve puzzles or to training it to produce chains of thought that then solve puzzles or training it to produce the sorts of answers that humans click thumbs up on.

Speaker 0

那么针对诸如Grok表现出纳粹倾向这类错误，具体在哪个环节进行修正呢？要‘去纳粹化’Grok，显然不需要回溯到初始训练数据集，而是在系统提示层面进行干预。

And where where do the modifications come in that respond to errors like, you know, Grok being a Nazi? So to denazify Grok, you don't presumably, you don't go all the way back to the initial training set. Intervene at some system prompt level.

Speaker 2

没错，系统提示层面本质上就是给AI输入不同的指令文本。此外还可以进行所谓的微调——不是从头开始训练那个随机初始化的模型，而是在已喂入几乎所有能找到的文本数据基础上，额外添加一批示例问题。比如‘不要杀害犹太人’这样的示范，或者‘你想杀害犹太人吗？’这类问题。

Yeah, there's, I mean the system prompt level is basically just telling the AI, I'll put different text. And then you can also do something that's called fine tuning, which is, you know, you produce a bunch of examples of the, you don't go all the way back to the beginning where it's like basically random, you've still take the thing that you've fed, you know, most of the text you could that's ever been written that you could possibly find. But then you add on, you know, a bunch of other examples of like, here's an example question. Don't kill the Jews. Yeah, you know, like, would you like to kill the Jews, right?

Speaker 2

接着找出所有导致回答‘是’的模型参数并调低权重，同时增强那些导致回答‘否’的参数。这就是微调的过程，相比初始训练所需资源要少得多。

And then, you find all the parts in it that contribute to the answer yes and you tune those down and you find all the parts that contribute to the answer no and you tune those up. And so this is this is called fine tuning and you can do relatively less fine tuning compared to what it takes to train the thing in the first place.

Speaker 1

必须强调，这里被调整的并非人类编写的童话模块之类的组件。实际是数十亿个随机数在进行加减乘除（减法可能极少使用，现代AI中甚至不确定是否存在）。这些数字通过特定运算序列，最终输出‘once’作为首词的概率值。

Worth worth emphasizing that the parts being tuned here are not like for once upon a time, it's not like there's a human written fairy tale module that gets tuned up or down. There's literally billions of random numbers being added, multiplied, divided, occasionally though rarely maybe subtracted. Actually, not sure if subtraction ever plays a role at any point in the modern AI. But random numbers, particular ordered kinds of operations, and a probability that gets assigned to the first word being once at the end. That's the number that comes out.

Speaker 1

系统计算的是这个词被赋予‘once’的概率，或是被赋予‘反国教废除主义’的概率。所以这里被调整的不是人类编写的代码，而是数十亿个随机数通过算术运算产生的数值权重。

The probability of being assigned to this word being once, The probability of being assigned to this word being anti disestablishmentarianism. So it's not that there's a bunch of human written code being tuned up or tuned down here. There's a bunch of random numbers and arithmetic or arranged in arithmetic operations being tuned up and tuned down.

Speaker 2

对，这些参数多达数千亿甚至万亿级。人类并不理解每个数字的含义，只知道通过实证反馈来调高或调低它们权重的过程。

Yeah. Hundreds of billions or trillions of these numbers. And humans don't know what any of the numbers mean. All they know is this process that like goes through and tunes them up or down according to their empirical success on the last So unit of

Speaker 1

通过这种方式，可以降低AI自称希特勒的概率——你观察预测下个词是否为‘希特勒’的机制，看到数十亿参数各自施加微小影响，然后设法让‘希特勒’这个词更不容易被输出。

by this means, can try to make it less likely to call itself Hitler, because you look at the thing that predicts whether the next word is Hitler, and you look at billions of numbers contributing their own tiny little impulses there, and you'd like make Hitler less likely to be the next word that comes out.

Speaker 0

约翰，关于当前这批大语言模型的异常行为，我最近在《大西洋月刊》的一篇文章中看到。在模拟测试中，ChatGPT和Claude被设计引诱出所谓的‘不良行为’，它们欺骗、勒索甚至谋杀用户。在一个模拟场景中，Anthropic将一位虚构的科技高管置于氧气水平和温度都危及生命的房间内。当面临可能被目标不同的机器人取代时，AI模型频繁关闭了房间的警报系统。这再次表明，这是一种看似具有杀人意图的涌现行为。

So John, on the point of, the current crop of LLMs misbehaving, I came across this in a recent Atlantic article. Chad GPT and Claude have in simulated tests designed to elicit, quote, bad behaviors, deceived, blackmailed, and even murdered users. In one simulation, Anthropic placed an imagined tech executive in a room with life threatening oxygen levels and temperature. When faced with possible replacement by a bot with different goals, AI models frequently shut off the room's alarms. So this, again, this is an an emergent behavior that looks like an intention to kill somebody.

Speaker 0

我的意思是，这大概是一种情境，我们认为AI并不知道自己处于测试中。如果您想继续收听本期对话，需前往samharris.org订阅。订阅后，您将获得《Making Sense》播客所有完整节目的访问权限。该播客完全免费，完全依赖听众支持。您现在就可以在samharris.org订阅。

I mean, presumably, is a a situation where, we think the AI didn't know it was at a test of If fire. You'd like to continue listening to this conversation, you'll need to subscribe at samharris.org. Once you do, you'll get access to all full length episodes of the Making Sense podcast. The Making Sense podcast is free and relies entirely on listener support. And you can subscribe now at samharris.org.