解码谷歌双子星：与杰夫·迪恩同行

本集简介

汉娜·弗莱教授与计算机科学领域最具传奇色彩的人物之一、谷歌DeepMind及谷歌研究首席科学家杰夫·迪恩展开对话。20世纪90年代末，杰夫编写的代码助力谷歌从一家小型初创企业成长为如今的跨国巨头，堪称该领域的关键推手。汉娜与杰夫畅谈一切——从谷歌与神经网络的早期岁月，到Gemini等多模态模型的长期潜力。特别鸣谢以下制作团队成员（包括但不限于）：主持人：汉娜·弗莱教授系列制片人：丹·哈杜恩剪辑：拉米·察巴尔（TellTale工作室）监制兼制片人：艾玛·尤瑟夫制作支持：莫·达乌德音乐作曲：埃莱妮·肖摄像指导与视频剪辑：汤米·布鲁斯音频工程师：佩里·罗甘廷视频演播室制作：尼古拉斯·杜克视频剪辑：比拉尔·梅尔希视频美术设计：詹姆斯·巴顿视觉标识与设计：埃莉诺·汤姆林森谷歌DeepMind委托制作若喜欢本期节目，请在Spotify或苹果播客留下评论。我们始终期待听众以反馈、新想法或嘉宾推荐等形式与我们互动！本节目由AdsWizz旗下Simplecast平台托管。个人数据收集及广告用途相关信息请见pcm.adswizz.com

双语字幕

仅展示文本字幕，不包含中文音频；想边听边看，请使用 Bayt 播客 App。

Speaker 0

欢迎回到谷歌DeepMind播客，我是主持人汉娜·弗莱教授。

Welcome back to Google DeepMind podcast with me, your host, professor Hannah Fry.

Speaker 0

在本期节目中，我们有幸采访计算机科学界最具传奇色彩的人物之一——杰夫·迪恩。

In this episode, we get to speak to one of the most legendary figures in the world of computer science, Jeff Dean.

Speaker 0

上世纪90年代末，正是他编写的代码将谷歌从一家小型初创公司转变为今天的跨国企业。

He was there in the late nineteen nineties writing the code that would turn Google from a small start up to the multinational company it is today.

Speaker 0

杰夫主导开发了TensorFlow，这个编程工具对机器学习的普及功不可没。

Jeff spearheaded TensorFlow, one of the programming tools responsible for the democratisation of machine learning.

Speaker 0

正是他推动人工智能朝着大规模模型的方向突破边界。

He's the one who pushed the boundaries of artificial intelligence in the direction of large scale models.

Speaker 0

不仅如此，他还联合创立了谷歌AI研究项目Google Brain，并且是新型神经网络架构Transformer的早期先驱之一。

And if that wasn't enough, he also co founded Google's AI research project, Google Brain, and was one of the earliest pioneers of a new neural network architecture called transformers.

Speaker 0

不过我不确定这个架构会不会流行起来。

Not sure that one's going to catch on.

Speaker 0

我想这就是为什么人们开玩笑说杰夫·迪恩的简历只列了他没做过的事。

I think this is why people joke that Geoff Dean's resume just lists the things that he hasn't done.

Speaker 0

这样写起来比较短。

It's shorter that way.

Speaker 0

最近，作为谷歌首席科学家的杰夫在Alphabet两大AI部门DeepMind与Google Brain合并时，坐上了最重要的席位之一。

More recently, Geoff, as Google's chief scientist, has occupied one of the most important seats around the table as the two great AI arms of Alphabet have merged: DeepMind and Google Brain.

Speaker 0

他最新参与培育的项目名为Gemini，将大语言模型的应用范围远远扩展到了纯语言领域之外。

His latest baby, which he co parents, is known as Gemini, which takes large language models well beyond language alone.

Speaker 0

Gemini是多模态模型，能够理解文本、代码、音频、图像和视频。

Gemini is a multimodal model which can understand text, code, audio, images, and video.

Speaker 0

这是彻头彻尾的人工智能，几乎可以确定就是谷歌搜索自身的发展方向。

It is AI through and through and almost certainly the direction that Google Search itself is heading in.

Speaker 0

杰夫，非常感谢你参加我们的节目。

Jeff, thank you so much for joining me.

Speaker 1

感谢邀请我参加。

Thank you for having me.

Speaker 1

很高兴能来到这里。

It's a delight to be here.

Speaker 0

好的。

So okay.

Speaker 0

在谷歌度过了二十五年，四分之一世纪。

Twenty five years, quarter of a century at Google.

Speaker 0

我想了解一下早期的情景，就是你九十年代刚加入时的情况

I wanna know a little bit of what it was like back in in the early days, right, like in the nineties when you first joined

Speaker 1

嗯。

Mhmm.

Speaker 0

那时候谷歌还不是现在这样成熟的组织。

When Google wasn't sort of slick organization that it is now.

Speaker 0

是不是到处都是贴满贴纸的笔记本电脑，大家穿着人字拖写代码？

Was it all a a lot of, you know, laptops with stickers on it and sort of coding in flip flops?

Speaker 1

可惜那时还没有笔记本电脑。

Sadly, was pre laptops.

Speaker 1

笔记本电脑出现之前。

Pre laptops.

Speaker 1

嗯，大多数时候是。

Well, mostly.

Speaker 1

是的。

Yeah.

Speaker 1

我们用的都是那种巨大的CRT显示器。

We all had those giant CRT based monitors.

Speaker 1

那时候还没有LCD显示器。

It was pre LCD monitors.

Speaker 1

它们占用了很多桌面空间。

So they took up a lot of desk space.

Speaker 0

不太可能。

Not very possible.

Speaker 1

我的桌子就像两块锯木架上架着一扇门，你可以自己调整高度，比如从桌子后面站起来，把背靠上去，就能调到下一个更高的档位。

My desk was like a a door on two sawhorses, and you could adjust it yourself by, like, getting out of the desk and standing up with your back to, like, get it to the next higher setting.

Speaker 1

真的吗？

Really?

Speaker 1

是啊。

Yeah.

Speaker 1

太神奇了。

Amazing.

Speaker 1

刚开始时，我们其实是在一个小办公区，大概只有这个房间的三倍大。

And when I started, we were in this small office area, actually, not, you know, maybe three times as big as this room.

Speaker 1

整个谷歌。

The whole of Google.

Speaker 1

整个谷歌当时就在帕洛阿尔托大学大道上，现在那里是一家T-Mobile手机店。

The whole of Google on University Avenue in Palo Alto above what's now a T Mobile, like, cell phone store.

Speaker 1

那时候最有趣也最令人兴奋的是，我们虽然是个小公司，但能明显看到越来越多人使用我们的服务，因为我们提供了优质、高效的搜索。

And, you know, the really fun and exciting thing in those days was we were a small company, but we could see that people were using our service more and more because we were providing good quality, high quality search.

Speaker 1

你能看到流量每天都在增长，每周都在攀升。

And you could see your traffic growing day over day, week over week.

Speaker 1

所以我们总在努力避免周二中午系统崩溃，那是每周流量最高峰时段。这就要求我们快速部署更多服务器，优化代码提高运行速度，还要为下个月的索引想出新颖的创新方案，让相同硬件能服务更多用户。

So we'd always be trying to not melt on Tuesday at noon, which was kind of the peak traffic hour of the the week, and that would require we deploy more computers quickly, that we optimize our code to make it run faster, that we come up with new and interesting innovations for next month's index that make it, you know, able to serve more users with the same hardware.

Speaker 0

我能想象那非常令人兴奋。

I can imagine being very exciting.

Speaker 0

你们有没有某个瞬间突然意识到？

Was there a moment when you guys realized?

Speaker 0

是否有那么一刻让你觉得‘啊，原来如此’？

Was there a moment when you were like, ah, okay.

Speaker 0

这真的会成功吗？

This is really going to be big?

Speaker 1

我是说，我认为从我刚加入公司时就能看出来——我加入是因为我们的流量增长非常快，我们觉得通过专注于提供高质量搜索结果、快速响应并满足用户需求，我们实际上希望用户能尽快离开我们的网站，直达他们需要的信息。

I mean, I think you could see that from the very earliest days when I was joining I joined the company because our traffic was growing really fast, and we felt like, by focusing on returning really high quality search results and doing that quickly and giving users what they want, we actually wanna get people off our site as quickly as possible to the information they need.

Speaker 1

这是个制胜策略，用户似乎也很喜欢我们的服务。

That was a winning proposition, and users seemed to like our service.

Speaker 1

所以可以说，即便在早期阶段，前景看起来也相当乐观。

And so it seemed reasonably promising, I would say, from even the early days.

Speaker 0

但‘相当乐观’和最终达到的规模之间还是有很大差距的。

There's quite a big gap between reasonably promising, though, and sort of what it ended up being.

Speaker 0

最终的发展结果对你们所有人来说都是个意外吗？

Has it been a surprise to you, like, to to all of you of what ended

Speaker 1

的结果？

up being?

Speaker 1

我是说，我认为我们后来拓展的许多领域确实难以预料，比如自动驾驶汽车。

I mean, I think there's been a bunch of things that we've branched out into that were, you know, obviously hard to anticipate autonomous vehicles.

Speaker 1

当你还在做搜索引擎时，确实很难想象这些。

Like, it's hard to sort of fathom that when you're working on a search engine.

Speaker 1

但我觉得，我们的产品组合逐渐扩展到其他类型的信息领域是很有道理的。

But, you know, I think the sort of gradual broadening out of our portfolio of products to other kinds of information makes a lot of sense.

Speaker 1

比如从公共网页搜索发展到帮助用户用Gmail管理邮件这类事情，这些都是为解决人们实际问题而自然演化的产物，它们会让你进入一种‘好吧’的状态。

So going from public web pages to, you know, helping users organize their own email with Gmail, things like that, those are sort of natural evolutions of things that solve real problems people have, and they get you in the state of, okay.

Speaker 1

现在我们不只有一个产品了。

Well, now we don't just have one product.

Speaker 1

我们有了几个用户经常使用的产品。

We have, like, a a handful of products that people use fairly regularly.

Speaker 0

回顾过去这些年，我有点好奇，你认为谷歌一直是一家搜索公司，还是说它本质上是一家AI公司，只是假装成搜索公司？

And I sort of wonder, looking back through all that time, do you think that Google was always a search company, or do you think it was an AI company sort of pretending to be a search company?

Speaker 1

是的。

Yeah.

Speaker 1

我认为作为一家公司，我们想解决的许多问题确实需要AI才能真正解决。

I mean, I think a lot of the problems we wanted to tackle as a company really were sort of ones that would require AI to really truly solve.

Speaker 1

因此在长达二十五年的时间里，我们一直在逐步攻克这些困难的AI问题，取得进展，然后将这些新技术应用到搜索和我们所有其他产品中。

And so along the way in a long period of time, twenty five years, we've been progressively tackling some of those hard AI problems and making progress on them and then using the new techniques that are now starting to work in the midst of search and the midst of all of our other products.

Speaker 0

你认为谷歌会永远是一家搜索公司吗？或者它现在还算是一家搜索公司吗？

Do you think that Google will always be a search company, or do you think is it even a search company now?

Speaker 0

它在改变吗？

Is it changing?

Speaker 1

我非常喜欢谷歌的一点是，即使在二十五年后的今天，我们的使命仍然极具现实意义。

One of the things I really like about Google is our mission is still incredibly relevant even, you know, twenty five years later.

Speaker 1

即组织全球信息，使其普遍可访问且有用。

You know, organize the world's information and make it universally accessible and useful.

Speaker 1

我觉得Gemini正在真正帮助我们朝着理解多种信息的方向迈进。

And I feel like Gemini is really helping us push in the direction of understanding, you know, lots of different kinds of information.

Speaker 1

比如文本数据、软件代码（本质上类似文本但具有特定结构），以及人类擅长的其他各种输入形式。

So textual data, you know, software code, which is kind of text y in nature but but very structured in certain ways, but also all the other kinds of modalities of input that humans are sort of fluent in.

Speaker 1

我们天生会阅读，但也会用眼睛看、用耳朵听。

You know, we're naturally you know, we read stuff, but we also see stuff with our eyes and hear stuff with our ears.

Speaker 1

你希望模型能够接收多种形式的信息，并以文本形式输出，或生成音频进行对话，或根据需要生成图像，或用图表注释文本等。

And you want models to be able to sort of take in information in in its many forms and also produce information in text form or maybe generate audio so that you can have a conversation with the the model or produce imagery if that's appropriate or annotate text with graphs or things like that.

Speaker 1

所以我们正在努力打造一个能处理所有输入形式、输出所有形式的单一模型，并在适当场景运用这种能力。

So we're we're really trying to make a single model that can take in all those modalities and produce all those modalities and use that capability when it makes sense.

Speaker 0

好的。

Okay.

Speaker 0

我想更详细地和你聊聊你对Gemini以及多模态模型的整体愿景。

So I wanna talk to you about your vision for Gemini and multimodal models in general in in a lot more detail.

Speaker 0

但我想先回到你的起点，因为我知道你思考AI算法已经很久很久了，比如神经网络。

But I wanna go back to where it all began for you because I know that you've been thinking about AI algorithms for a really, really long time, like neural networks, for example.

Speaker 0

你还记得第一次接触神经网络是什么时候吗？

I wonder, do you remember when you first came across them?

Speaker 1

哦，当然记得。

Oh, yeah, actually.

Speaker 1

神经网络的发展历程很有趣。

So neural networks have had an interesting history.

Speaker 1

要知道，AI是一门相当古老的学科，早期阶段主要是研究如何定义事物运作的规则。

You know, AI is a quite old discipline, and the early phases of AI were about how do we define rules about how things work.

Speaker 1

那大概是五十年代到七十年代的事。

So that was, like, the fifties, sixties, seventies to some extent.

Speaker 1

然后神经网络在七十年代出现，在八十年代末九十年代初掀起过一波热潮。

And then neural networks kind of came along in the seventies and had a wave of excitement in the late eighties and early nineties.

Speaker 1

1990年我在明尼苏达大学读本科时，选修了一门并行处理课程，就是研究如何把问题分解到不同计算机上处理。

And I was actually an undergrad in 1990 at University of Minnesota, And I was taking a class in parallel processing, which is this idea of how do you take problems and break them down into pieces that can be done on different computers.

Speaker 1

然后这些计算机协同工作来解决同一个问题。

And then in conjunction, all those computers work together to solve a single problem.

Speaker 0

我想那时候的计算能力远不如现在吧？

I guess this is, like, the point where the computing power wasn't quite what we have today.

Speaker 0

就像是在研究如何让计算机团队协作？

It's like, how do you make computers work as a team?

Speaker 1

没错。

Right.

Speaker 1

是的。

Yeah.

Speaker 1

当时，神经网络是一种特殊的机器学习和人工智能方法，它用非常粗糙的近似方式来模拟我们认为真实人类或其他生物大脑中神经元的工作方式。

And at that time, neural networks was a sort of particular approach to machine learning and AI that involved very crude approximations of how we think real human or other brains work with neurons.

Speaker 1

这就是为什么它们被称为神经网络——它们由人工神经元组成，这些人工神经元与下方的其他神经元有连接。

So that's why they're called neural networks is they're made up of artificial neurons, and artificial neurons have connections to other neurons below them.

Speaker 1

然后它们会观察从这些人工神经元传来的信号，并判断它们对该特定信号模式的兴趣程度。

And then they look at kind of the signals that come up from those artificial neurons, and they decide how interested are they in that particular pattern of signals.

Speaker 1

接着它们决定是否应该兴奋到足以向神经网络更上层发送信号。

And then they decide should they be excited enough to send a signal further up the neural network.

Speaker 1

这就是一个人工神经元，而神经网络由许多层这样的神经元组成。

And so that's one artificial neuron, and a neural network is made up of lots of layers of lots of these neurons.

Speaker 1

因此更高层的神经元建立在低层神经元的表征基础上。

And so the higher layer neurons build on the representations of the lower neurons.

Speaker 1

例如，如果你正在构建一个用于图像处理的神经网络，最底层的神经元可能会学习诸如'这是一块红色或绿色的斑点'或'这个方向有一个边缘'等特征。

And so if you're, for example, building a neural network for image problems, the lowest neurons and the lowest layer might learn features like, oh, it's a splotch of red or green or there's an edge at this orientation.

Speaker 1

再上一层的神经元可能会学习'这是一边是黄色的边缘'。

And then the next level up might learn, oh, it's an edge with yellow on one side.

Speaker 1

更高层可能会识别出'这看起来像鼻子、耳朵或一张脸'。

And then higher up, it might be, oh, it looks like a nose or ears or a face.

Speaker 1

通过构建这些分层的学习抽象，这些系统实际上可以发展出非常强大的模式识别能力，这就是人们在1985到1990年左右对神经网络感到兴奋的原因。

And so by building these layered learned abstractions, these systems can actually develop very powerful sort of pattern recognition capabilities, and that's what people were excited about about neural networks in kind of 1985, 1990.

Speaker 0

但我们说的是非常非常小的

But we're talking teeny, teeny tiny little

Speaker 1

微型网络。

Teeny networks.

Speaker 1

是的。

Yeah.

Speaker 1

所以它们当时还无法识别人脸、汽车等物体。

So they could not recognize faces and cars and things.

Speaker 1

他们能识别出一些人工生成的模式。

They could recognize a little artificially generated pattern.

Speaker 0

是啊。

Yeah.

Speaker 0

就像你有一个网格，它或许能识别出一个十字形

Like you have a grid, and it can recognize maybe a cross

Speaker 1

对。

Yeah.

Speaker 1

或者别的什么。

Or something.

Speaker 1

或者一个手写数字。

Or a handwritten digit.

Speaker 1

它是

Is it

Speaker 0

数字7还是8？

a seven or an eight?

Speaker 0

哦，那可太酷了。

Oh, that's fancy.

Speaker 1

没错。

Yeah.

Speaker 1

那确实很酷。

That was that was fancy.

Speaker 1

但这基本就是他们当时能做到的程度了。

But that was kind of what they could do at that time.

Speaker 1

但人们很兴奋，因为他们能解决这类问题——那些基于纯粹逻辑规则定义'7'是什么的传统系统，在面对各种潦草手写7时往往表现不佳，而神经网络却能很好泛化。

But people were excited because they could solve those kinds of problems that other systems based on sort of purely logically specified rules of what a seven means weren't actually able to do very well in a way that generalized to, like, all kinds of messy handwritten sevens.

Speaker 1

所以在听完两节关于神经网络的讲座后，我对这个产生了兴趣。

So I was kind of intrigued by that after my two lectures on neural nets.

Speaker 1

于是我决定以神经网络并行训练作为我的毕业论文课题，当时觉得只要算力足够就能解决问题。

And I decided I would do a senior thesis, honors thesis on parallel training of neural networks because I felt like, oh, we just need more compute.

Speaker 1

要是能用系里那台32处理器的机器搭建更大规模的系统，不就能训练更大的神经网络了吗？

What if we use the 32 processor machine in the department and made a, you know, a bigger system that we could train bigger neural nets?

Speaker 1

这就是我花了大概两三个月时间研究的课题。

So that was what I spent, you know, a couple of months or three months on.

Speaker 0

它做到了吗？

Did it what?

Speaker 1

是啊。

Yeah.

Speaker 1

没错。

Yeah.

Speaker 1

总之当时我特别兴奋。

So, anyway, I was very excited.

Speaker 1

我还以为32个处理器能让神经网络性能突飞猛进，结果证明我错了。

I was like, oh, 32 processors is gonna cause neural nets to really, really hump and Turns out I was wrong.

Speaker 1

当年那个天真的本科生啊。

Naive undergrad me.

Speaker 1

要让神经网络在真正重要的问题上表现良好，我们需要的算力是当时的百万倍。

We needed about a million times as much processing power to get them to really start to work well on real problems that you might sort of care about.

Speaker 1

确实。

Yeah.

Speaker 1

但得益于摩尔定律二十年的发展，CPU和计算设备性能大幅提升，我们终于拥有了比那台豪华32处理器机器强百万倍的实用系统。

But then thanks to kind of twenty years of progress of Moore's Law and much faster CPUs and computational devices and stuff, we actually then started to have practical systems that had a million times as much compute as even our fancy 32 processor machine.

Speaker 1

后来当斯坦福教授吴恩达每周来谷歌做咨询时，我在某个茶水间偶遇他，这才重新对神经网络产生了兴趣。

And so I started to kind of get interested in neural nets again when Andrew Ng, who's a Stanford faculty member, was consulting at Google one day a week, and I bumped into him in one of our many micro kitchens.

Speaker 1

我就问他：'你在谷歌做什么项目呢？'

I'm like, oh, what are you doing at Google?

Speaker 1

他说，哦，我还没完全想好，因为我刚开始在这里做咨询工作，但我在斯坦福的一些学生在神经网络方面取得了不错的成果。

He's like, oh, I haven't really figured it out yet because I just started consulting here, but some of my students at Stanford are getting good results on neural networks.

Speaker 1

我说，哦，真的吗？

I'm like, oh, really?

Speaker 1

我们为什么不训练非常非常大的神经网络呢？

Why don't we train really, really big neural networks?

Speaker 1

这就是我们在谷歌开展神经网络工作的起源。

So that was the genesis of our work on neural networks at Google.

Speaker 1

然后我们组建了一个叫谷歌大脑的小团队，开始研究如何利用谷歌的计算资源训练超大规模的神经网络。

And then we formed a a small team called the Google Brain team to start looking at how could we train very large neural networks using Google's computational resources.

Speaker 1

于是我们构建了这套软件基础设施，它能将神经网络描述分解成多个部分，分配给不同计算机——这个并行团队的各个成员，然后让它们按需相互通信，从而解决如何用2000台计算机训练单个神经网络这个整体问题。

And so we we sort of built this software infrastructure that enabled us to take a neural network description and then break it down into pieces that would be done on different computers, different members of this parallel team, and then communicate amongst themselves in ways that they needed to do in order to sort of tackle the overall problem of how do you train a single neural network on 2,000 computers.

Speaker 1

这算是我们最早开发的用于扩展神经网络训练的软件，它让我们能训练比现有神经网络大50到100倍的模型。

And so that was kind of the earliest software we built for really scaling up neural network training, and it it enabled us to train models that were 50 to a 100 times larger than sort of existing neural networks.

Speaker 0

所以这是在2011年的时候。

So this this is back in 2011.

Speaker 0

对吧？

Right?

Speaker 1

是的。

Yeah.

Speaker 1

大概是在2012年初。

This is, like, early two thousand twelve.

Speaker 0

所以这是在图像识别重大突破之前。

So this is, like, before the big breakthroughs in image recognition.

Speaker 1

没错。

Yes.

Speaker 0

这已经是很久以前的事了。

This is, like, way back.

Speaker 0

从很多方面来说，你当时做的事情和之前一样，就是在某种程度上把计算机拼接起来。

And and in many ways, you were doing then the same thing that you were doing previously of just, like, kind of stitching computers.

Speaker 1

这就像我的本科论文。

It's like my undergrad thesis.

Speaker 1

没错。

Exactly.

Speaker 1

不过再说一次。

But again.

Speaker 1

比如说，嘿。

Like, hey.

Speaker 1

我们可以

We could

Speaker 0

再做一次，但是规模要大得多。

do that again, but at big scale.

Speaker 0

而这一次

And yet this time

Speaker 1

这次确实成功了，因为计算机速度更快，我们用了更多的计算机。

This time, it actually worked because the computers were faster, we used a lot more of them.

Speaker 0

不过在2011年的时候，这感觉有点像一场赌博吗？

Did it feel like a bit of a gamble, though, back in 2011?

Speaker 1

哦，是的。

Oh, yeah.

Speaker 1

我们为训练这些神经网络并尝试不同拆分方法而构建的系统，我实际上给它取名为'disbelief'（怀疑），部分原因是人们不认为它真的会奏效，也因为这是一个分布式系统（dis是distributed的缩写），可以构建这些——我们想训练的其中一类东西除了神经网络外还有信念网络。

The system that we built for training these neural networks and trying different ways of breaking them apart, I actually named it disbelief because it's partly because people didn't think it was really gonna work and also because of a dis was a distributed system that could build these one of the things we wanted to train was belief networks in addition to neural networks.

Speaker 0

哦，我喜欢这个名字。

Oh, I love it.

Speaker 0

我很喜欢。

I love it.

Speaker 1

难以置信

So disbelief

Speaker 0

就是难以置信

was Disbelief.

Speaker 0

太神奇了

Amazing.

Speaker 0

就在这一切发生的同时，没错

And so while this was going on Yep.

Speaker 0

在大西洋这边的美国，DeepMind刚刚起步

Stateside, this side of the Atlantic was the beginnings of of DeepMind.

Speaker 0

是的

Yes.

Speaker 0

我知道你是被派去考察他们的人

And I know that you were you were the person tasked with coming over and checking them out.

Speaker 0

对吧？

Right?

Speaker 0

没错

Yeah.

Speaker 0

给我讲讲那段经历

Tell me about that story.

Speaker 1

是啊

Yeah.

Speaker 1

其实，杰弗里·辛顿——这位著名的机器学习研究者，2011年夏天曾在谷歌待过一段时间

So, actually, Jeffrey Hinton, who's a very well known machine learning researcher, spent a summer at Google in 2011, I think.

Speaker 1

我们当时不知怎么给他归类，最后就把他算作实习生

And we couldn't figure out how to classify them, so we got classified as an intern.

Speaker 1

这个小插曲还挺有趣的

That was just a little funny.

Speaker 0

我整个职业生涯中最资深的实习生。

My senior intern in the entire history of the world.

Speaker 1

于是我和他一起共事。

And so he and I were working together.

Speaker 1

后来我们不知怎么听说了DeepMind。

And then somehow we found out about DeepMind.

Speaker 1

我想杰弗里对这家公司的创立背景略知一二，其他一些人也提到过。

I think Jeffrey had known a little bit about sort of the formation of the company, and some other people also said, oh, yeah.

Speaker 1

有家英国公司在做这个

There's this company over here in The UK doing

Speaker 0

当时规模还很小。

this tiny at this time.

Speaker 0

我是说，就像

I mean, like

Speaker 1

是的。

Yeah.

Speaker 1

大概四五十人左右吧。

Probably 40 or 50 people or something.

Speaker 1

于是我们公司决定去考察他们是否值得收购。

And so we decided as a company that we would go check them out as a potential acquisition.

Speaker 1

当时我在加州。

And so I was in California.

Speaker 1

杰弗里在多伦多，那时他还是大学教授。

Jeffrey was back in Toronto where he's a faculty member at that time.

Speaker 1

而且杰弗里腰不好，没法坐民航飞机。

And Jeff has a bad back, so he can't actually fly commercially.

Speaker 1

所以他没法南下。

So because he can't down.

Speaker 1

他只能躺下或站起来。

He can only lay down or stand up.

Speaker 1

航空公司不喜欢你在起飞时站起来。

Airlines don't like it when you stand up during takeoff.

Speaker 1

所以我们得想个办法，就是在私人飞机上准备一张医疗床。

So we had to figure out a solution, which was to get a medical bed in a private plane.

Speaker 1

天啊。

Oh my gosh.

Speaker 1

然后我们一群人在加州起飞，飞到多伦多，从停机坪接上杰弗里，把他放在医疗床上，然后我们一起飞往英国。

And so we a bunch of us took off at California, flew to Toronto, scooped Jeffrey up from the tarmac, put him in the medical bed, and then we all flew to The UK.

Speaker 1

我们坐上一辆大面包车，整队人马去参观DeepMind，我记得是在罗素广场附近。

We all got in a big van, and we trooped off to to visit DeepMind, which was near Russell Square, I think.

Speaker 1

我们因为前一晚的飞行已经很累了，结果又连续听了他们团队长达20分钟的十三场不同项目的讲座。

And we were really tired from flying the the previous night, but then we got a, like, a thirteen straight twenty minute lecture in a row of all the different things they were doing.

Speaker 0

什么？

What?

Speaker 0

是DMI团队的人讲的吗？

From the team at DMI?

Speaker 1

就是那个团队。

From the team.

Speaker 0

哦，哇。

Oh, wow.

Speaker 1

所以我们看到了一些工作

So we saw some work some

Speaker 0

在倒时差的情况下。

While jet lagged.

Speaker 0

在倒时差的情况下。

While jet lagged.

Speaker 0

这简直像是情景喜剧里的情节。

It's like something from a sitcom.

Speaker 1

你知道吗？

You know?

Speaker 1

是啊。

Yeah.

Speaker 1

没错。

Exactly.

Speaker 1

然后我们看了一些演示，就是他们正在做的雅达利相关研究，后来这些成果发表了，关于如何用强化学习来玩雅达利2600上的老游戏。

And then so we got some presentations on, you know, some of the Atari work they were doing, which sort of was published later on how do you use reinforcement learning to learn to play old Atari 2,600 games.

Speaker 1

比如《打砖块》或《乒乓球》，还有其他各种游戏，非常有意思。

So games like Breakout or Pong or, you know, various other ones, which was quite interesting.

Speaker 1

然后

And then

Speaker 0

因为你们那时候还没开始研究强化学习。

because you guys hadn't been doing the reinforcement learning at that time.

Speaker 1

对。

Right.

Speaker 1

我们主要在研究如何扩展大规模监督学习和无监督学习。

We'd mostly been focusing on how do you scale up large scale supervised and unsupervised learning.

Speaker 0

是啊。

Yeah.

Speaker 0

而强化学习更多是由奖励机制驱动的。

And And the reinforcement learning is much more motivated by rewards.

Speaker 1

没错。

Yeah.

Speaker 1

所以我认为所有这些技术都很有用，而且它们经常需要组合使用。

So I think all of the these techniques are really useful, and they're often useful in combination.

Speaker 1

强化学习可以这样理解：有一个智能体在环境中运作，每一步都有多种可能的行动或选择。

So reinforcement learning, you should think of as you have some agent operating in an environment, and they're at every step, there's a bunch of different moves you could make or actions you could take.

Speaker 1

以围棋为例，你可以在众多不同位置落子。

So in the game of Go, for example, you know, you could play a stone in any of, a whole bunch of different positions.

Speaker 1

在雅达利游戏中，你可以上下左右移动摇杆，或按下左右按钮。

In Atari, you could move your joystick up, down, or left, or right, or you could push the left or right button.

Speaker 1

这类情境中，通常不会立即获得奖励反馈。

And often in these situations, you don't get an immediate reward.

Speaker 1

比如下围棋时，落子后要等整局结束才能知道这步棋的好坏。

So in Go, you play a stone, and you don't know if that's a good idea or not until the whole process of the rest of the game plays out.

Speaker 1

强化学习的有趣之处在于，它能处理长序列动作，并根据动作的意外程度按比例分配正负奖励。

And one of the interesting things about reinforcement learning is it's able to take kind of long sequences of actions and then attribute rewards or negative rewards to the sequence of actions that you took in proportion to how unexpected it was based on when you made that move.

Speaker 1

你觉得这是个好策略吗？

Did you think it was a good idea?

Speaker 1

如果你最终赢了，可能就该稍微增强'这是好策略'的信念。

And then you you won, so maybe you should increase your idea that that was a good idea a little bit.

Speaker 1

如果输了，就该稍微减弱'这是好策略'的想法。

Or maybe you lost, and you should decrease your sense that that was a good idea a little bit.

Speaker 1

这就是强化学习的核心理念，它在难以及时判断决策优劣的环境中特别有效。

And so that's kind of the main idea behind reinforcement learning, and it's a quite effective technique, especially in environments where it's very unclear immediately whether that was a good idea.

Speaker 1

嗯。

Yeah.

Speaker 1

相比之下，监督学习是有明确输入和标准答案输出的。

In contrast, supervised learning is where you have an input and you have kind of a ground truth output.

Speaker 1

典型例子是带标签的图像集，每张图都被标记为某个类别（比如猫）。

So the classic example is you have a bunch of images, and each image has been labeled with one of a cat a bunch of categories.

Speaker 1

比如说有张图片标着'汽车'。

So, like, there's an image, car.

Speaker 1

还有一张图片，鸵鸟。

There's another image, ostrich.

Speaker 1

另一张图片，石榴。

Another image, pomegranate.

Speaker 1

如果你有丰富的类别集合。

If you have a rich set of categories.

Speaker 1

确实如此。

Exactly.

Speaker 0

告诉我，当你在DeepMind这里决定进行收购时，丹尼斯紧张吗？

Tell me, when you were here at DeepMind and you decided that you would do the acquisition, was Dennis nervous?

Speaker 1

我不知道他是否紧张。

I don't know if he was nervous.

Speaker 1

我是说，我记得我说过，我看过所有这些精彩的演示，但我能看看一小段代码吗？

I mean, I think I said, well, I've seen all these nice presentations, but can I look at a little bit of the code?

Speaker 1

因为我想确认背后有真实的代码，看看编码标准是怎样的，比如人们是否会写注释之类的。

And so because I wanted to make sure there was, like, real code behind it and see, like, what the coding standards were like, that people actually write comments, that kind of thing.

Speaker 1

所以德莫斯对此有点不确定。

So Demos kind of, like, was a little unsure about this.

Speaker 1

我说，哦，不必是超级机密的代码。

I said, oh, it doesn't have to be, like, super secret code.

Speaker 1

随便挑一小段代码给我看看就行。

Just pick some little bit of code and show it to me.

Speaker 1

于是我和一位工程师进了办公室，大概坐了十分钟。

And so I went into an office with one of the engineers, and we kind of sat down for ten minutes.

Speaker 1

然后我说，好吧。

And, like, I said, okay.

Speaker 1

这段代码是做什么的？

What does this code do?

Speaker 1

呃，哦，好吧。

And, like, oh, okay.

Speaker 1

那个东西是做什么用的？

That thing, what does that do?

Speaker 1

你可以在哪里给我展示它的实现？

And where can you show me the implementation of that?

Speaker 1

我出来时很满意。

And I came out satisfied.

Speaker 1

它整洁有序。

It was neat and tidy.

Speaker 1

它算是相当整洁有序。

It it was reasonably neat and tidy.

Speaker 1

我是说，对于一家想要快速行动的小公司来说，那代码有点研究性质，但你知道，它显然很有趣而且文档齐全。

I mean, for a small company trying to move quickly, it was kind of research y code, but it was, you know, clearly interesting and and well documented.

Speaker 0

我听说你写代码时会加个'LGTM'的小标记。

I I've heard that when you do your code, you you put a little thing which is LGTM.

Speaker 1

哦，是的。

Oh, yeah.

Speaker 1

我觉得看起来不错。

It looks good to me.

Speaker 1

对。

Yeah.

Speaker 1

我现在在现实生活中也用这个说法，不只是代码审查时。

I use that in real life too now, like, not just for code reviews.

Speaker 0

好的。

Okay.

Speaker 0

那么在这些演示中，你还记得当时的印象是什么吗？

So in these presentations then, can you remember what your impression was?

Speaker 1

是的。

Yeah.

Speaker 1

我是说，他们看起来在做非常有趣的工作，特别是在强化学习方面。

I mean, it seemed like they were doing really interesting work, particularly in the reinforcement learning side.

Speaker 1

我们专注于规模化，所以当时训练的模型比DeepMind正在研究的要大得多。

We were focused on scaling, so we were training models that were much, much bigger than the ones DeepMind was playing with at the time.

Speaker 1

但他们正在学习使用强化学习来解决游戏玩法问题，这对强化学习来说是个理想的环境。

But they were learning to use reinforcement learning to to sort of solve kind of gameplay, which is a nice clean environment for reinforcement learning.

Speaker 1

不过看起来，强化学习加上我们一直在做的规模化工作会是个很好的组合。

But it seemed like the combination of reinforcement learning plus a lot of the scaling work that we we had been working on would be a really good one.

Speaker 0

因为这就像是从两个不同方向解决问题——用强化学习从非常小的玩具模型开始构建，

Because it's like, I guess, you're sort of approaching a problem from two different directions, like really tiny with reinforcement learning, really, really small, like, toy models and building up.

Speaker 0

同时你们又在非常庞大的规模上建立了这种'深刻理解'（加引号），而将两者结合就能产生强大力量。

And then you're kind of at this very, very big scale with this sort of rich understanding understanding inverted commas, but then it's sort putting the two together where things become really powerful.

Speaker 1

对。

Yeah.

Speaker 1

没错。

Yeah.

Speaker 1

确实如此。

Indeed.

Speaker 1

这也是去年我们合并传统DeepMind、传统Brain团队和谷歌研究其他部门的主要动机，最终决定整合成Google DeepMind。

And that was kind of a lot of the the motivation behind the combination we did last year of the kind of legacy legacy DeepMind and legacy brain and other parts of Google research, we decided we would just combine the units together and form Google DeepMind.

Speaker 0

嗯。

Yeah.

Speaker 0

而Gemini作为

And Gemini as the

Speaker 1

Gemini其实在合并构想之前就存在了，但更像是...

And Gemini, which actually predated the idea of combining, but really was like, hey.

Speaker 1

我们确实应该共同解决这些问题，因为大家都在朝着训练高质量、大规模多模态模型的总体方向探索。

We should really all work together on these problems because we're all sniffing around the same kind of general direction of trying to train really high quality, large scale multimodal models.

Speaker 1

分散我们的想法、不合作以及分散计算资源等做法是没有意义的。

And it doesn't make sense to fragment our ideas and not work together and fragment our compute resources and so on.

Speaker 1

我们确实应该整合所有资源，组建联合团队来解决这个问题，而我们也正是这么做的。

We should really just put all this together, build a combined team to go after this problem, and and that's what we did.

Speaker 0

那么为什么叫Gemini（双子座）呢？

So so why Gemini?

Speaker 1

其实是我命名的，是你吗？

I actually named it Did you?

Speaker 1

是的。

Yeah.

Speaker 0

我是说，你追求的是...

I mean, you're about you're after the

Speaker 1

我确实喜欢给事物命名。

I I do like naming things.

Speaker 1

这很有趣。

It's fun.

Speaker 1

你太...这是你的...Gemini与双胞胎相关，我觉得这是个好名字，象征着原DeepMind和原Brain团队像双胞胎一样联合起来，共同开展这个雄心勃勃的多模态项目。

You're disfully It's your it's your So Gemini is relates to twins, and I felt it was a good name for the twins of Legacy DeepMind and Legacy Brain coming together to really start working together on sort of an ambitious multimodal project.

Speaker 0

我猜Gemini也让我联想到太空任务。

I guess also Gemini, I'm just thinking of the space missions.

Speaker 0

没错。

Yeah.

Speaker 0

就像是阿波罗计划的前身。

It's like a precursor to Apollo.

Speaker 1

名字有多重含义是件好事，这也是选择这个名字的另一个原因。

A good thing about a name that has multiple meanings is so that that was another reason to pick the name.

Speaker 1

某种程度上说，这是雄心勃勃的太空计划进展的前奏。

It's sort of the precursor to, you know, ambitious space program progress.

Speaker 0

所以我想谈谈多模态技术。

So I wanna come on to the multi multimodal stuff.

Speaker 0

在开始之前，我认为公众对聊天机器人和大语言模型认知发生重大转变的主要原因之一，部分要归功于谷歌大脑团队提出的Transformer研究成果。

Just before I do, I guess one of the big reasons why this big change has happened in in the sort of public consciousness of chatbots and large language models is in part because of some work that came out of of Google Brain with with transformers.

Speaker 0

请原谅我用了这个双关语。

If you'll sort of forgive the pun.

Speaker 0

是的。

Yeah.

Speaker 0

你能给我们讲讲这项Transformer研究及其革命性影响吗？

Can can you tell us a little bit about that transformer work and how transformative it's been?

Speaker 1

当然。

Sure.

Speaker 1

没错。

Yeah.

Speaker 1

事实证明，语言领域和许多其他领域需要解决的很多问题本质上都是序列问题。

So it turns out a lot of the problems you wanna deal with in language and in a bunch of other domains are problems of sequences.

Speaker 1

比如Gmail的自动补全功能，当你输入句子时，系统能否通过补全你的句子或想法来帮助你？

So if you think about autocomplete in Gmail, you're typing a sentence, and can the system help you by finishing your sentence or your thought for you?

Speaker 1

这很大程度上依赖于观察序列的部分内容，然后预测剩余部分。

A lot of that relies on seeing part of a sequence and then predicting the rest of it.

Speaker 1

本质上，这正是这些大语言模型被训练要实现的功能。

And, essentially, that's what these large language models are trained to do.

Speaker 1

它们被训练成每次接收一个单词或单词片段，然后预测接下来会出现什么内容。

They're trained to take in data, one word or one piece of a word at a time, and then predict what is the next thing that will follow.

Speaker 0

就像高级版的自动补全。

Like fancy autocomplete.

Speaker 1

就像高级的自动补全功能。

Like fancy autocomplete.

Speaker 1

是啊。

Yeah.

Speaker 1

结果证明它很有用。

It turns out to be useful.

Speaker 1

你也可以用这种方式建模很多不同的问题。

You can model a lot of different problems this way as well.

Speaker 1

比如翻译，你可以建模为输入一个句子的英文版本，然后训练模型在有足够英法句子对的情况下，输出该句子的法文版本，作为一种似然序列的训练。

So translation, you can model as taking in the English version of a sentence and then training the model to then output the French version of the sentence when you have enough English French sentence pairs to sort of train as a likelihood sequence.

Speaker 1

你也可以将其用于医疗场景，比如当你试图预测面前的患者——他们报告了这些症状，又有这些化验结果，过去还出现过这些情况时。

You can also use this in health care settings, like if you're trying to predict a patient in front of you who is reporting these symptoms and they have these lab test results, and in the past, they've had these things.

Speaker 1

你可以把整个过程建模为一个序列，然后根据其他脱敏的序列化训练数据，预测出最可能的合理诊断。

You know, you can model that whole thing as a sequence, and then you can predict what are the likely diagnoses that would make sense if you have other de identified data that you can train on that has also kind of been organized in these sequences.

Speaker 1

具体做法是隐藏序列的剩余部分，强制模型尝试预测接下来会发生什么。

And the way you can do that is you just hide the rest of the sequence, and you force the model to try to predict what happens next.

Speaker 1

最有趣的是它的普适性——无论是语言翻译、医疗场景、DNA序列还是各种领域都能应用。

It's quite an interesting thing that it's so applicable to, you know, language translation, health care settings, DNA sequences, all kinds of things.

Speaker 0

但这关键在于你当前时刻所关注的片段。

But that it's about the bit that you're paying attention to at any point in time.

Speaker 1

没错。

Yeah.

Speaker 1

在Transformer架构之前成功的模型是所谓的循环模型，它们具有某种内部状态。

So the models that were successful prior to the transformer architecture were what are called recurrent models, where they have some internal state.

Speaker 1

每次看到单词时，它们会通过处理更新内部状态。

And every time they see a word, they do some processing to update their internal state.

Speaker 1

然后继续处理下一个单词。

And then they go on to the next word.

Speaker 1

然后他们又这样做了。

And then they do that again.

Speaker 1

现在他们稍微推进状态，并根据刚看到的下一个词更新状态。

So now they have they move their state forward a bit and update the state with respect to the next word that they just saw.

Speaker 1

所以你可以想象这就像一个12个词的句子。

And so that you can kind of imagine this as, like, a 12 word sentence.

Speaker 1

你要进行12次状态更新，但每一步都依赖于前一步。

You're doing that updating of the state 12 times, but every step is dependent on the previous one.

Speaker 1

这意味着实际上很难让它快速运行，因为你存在所谓的顺序依赖关系：第七步依赖第六步，第六步依赖第五步，以此类推。

And so that means it's actually quite hard to get it to run fast because you have this what's called a sequential dependency where step seven depends on step six, step six depends on step five, and so on.

Speaker 1

因此谷歌研究院的一组研究人员提出了一个相当有趣的想法：与其每次只更新一个状态，不如一次性处理所有单词，并记住处理每个单词时获得的状态。

So one of the things a collection of researchers within Google Research did is they came up with a pretty interesting idea, which is instead of just having a single state that we update at every word, let's process all the words all at once, and let's remember the state that we get when we're processing every word.

Speaker 1

然后当我们尝试预测新词时，可以关注所有先前的状态，并学会如何关注重要部分。

And then when we're trying to predict a new word, let's pay attention to all of the previous states and figure out how to learn to pay attention to the important parts.

Speaker 1

这就是Transformer中用于预测下一个词的学习注意力机制。

That's the learned attention mechanism in transformer in order to predict the next word.

Speaker 1

对于某些词，你可能需要高度关注前一个词。

And for some words, you might need to pay attention to the previous word a lot.

Speaker 1

在某些语境下，稍微关注上下文中的多个词非常重要。

For some context, it's very important to pay attention kind of a little bit to a lot of the words in the context.

Speaker 1

但关键在于这确实可以并行完成。

But the important thing about that is it really can be done in parallel.

Speaker 1

你可以输入一千个单词，并行计算每个单词的状态，这使得它在扩展性和性能上比之前的循环模型高效10到100倍。

You can take in a thousand words, compute the state for each one of them in parallel, and then that makes it sort of a 10 to a 100 times more efficient in terms of scaling and, you know, performance than the previous recurrent models.

展开剩余字幕（还有 302 条）

Speaker 1

这就是为什么这是一个重大突破。

And so that was why that was such a big advance.

Speaker 0

但我想这还会衍生出其他东西。

But then I guess there's other things that seem to emerge from this.

Speaker 0

我是说，某种概念性的理解，或者说仅通过序列和语言本身就能实现的抽象。

I mean, sort of a conceptual understanding or maybe sort of abstraction that's that's possible just through sequencing and and language alone.

Speaker 0

我是说，那很意外吗？

I mean, that was that a surprise?

Speaker 1

是啊。

Yeah.

Speaker 1

我是说，我认为我们在Google Brain团队最早做的一些语言建模工作，真正关注的是如何将单词建模为高维向量，而非其表面形式（比如h e l l o或c o w），这些向量能体现单词的使用方式。

I mean, I think some of the earliest work we did on language modeling in the the Google Brain team was really about modeling words not as their surface form of, like, h e l l o or c o w, but really about a high dimensional vector that represents kind of the way in which that word is used.

Speaker 1

你知道，人类习惯在二维和三维空间思考。

You know, we're used to thinking in two and three dimensions as humans.

Speaker 1

但当你有100维或1000维时，千维空间里就有很大余地。

But when you have a 100 dimensions or a thousand dimensions, there's a lot of room in a thousand dimensional space.

Speaker 1

但当你让某些事物在空间中邻近——比如通过训练模型使牛、羊、山羊和猪彼此靠近，同时远离咖啡机——

But when you have things that are nearby and you've trained the model in such a way that, you know, cow and sheep and goat and pig are all near each other, and they're very far apart from espresso machines.

Speaker 0

虽然牛奶可能介于两者之间。

Although milk could be in between the two.

Speaker 1

牛奶可能更靠近牛，但某种程度上介于两者之间。

Milk milk would probably be nearer the cow, but kind of in between the two.

Speaker 1

对。

Yeah.

Speaker 1

它们可能会位于那个100维空间中的100维线上。

They would probably be kind of on that 100 dimensional line in 100 dimensional space.

Speaker 1

所以我认为，这些模型之所以具有惊人的强大能力，是因为它们用如此多的高维度表示事物，能同时捕捉词语、句子或段落的多个层面——毕竟它们的表示空间如此广阔。

So this is kind of why these models have surprisingly powerful capabilities, I think, is because they're representing things with so many high dimensions that they can actually really latch on to many different facets of a word or a sentence or a paragraph simultaneously because there's so much room in their in their representation.

Speaker 0

某种程度上它提取了我们赋予语言的基础意义吧。

It sort of extracted the the grounding that we ourselves have given language, I guess.

Speaker 1

没错。

Yeah.

Speaker 1

我是说，当我们听到一个词时，我们不仅仅想到这个词的表面形式。

I mean, we when we're when we hear a word, we don't just think of the surface form of the word.

Speaker 1

我们想到的是牛。

We think cow.

Speaker 1

哦，这会触发一系列其他联想，比如牛奶、咖啡机，还有挤奶、小牛和公牛之类的。

Oh, that triggers a bunch of other things like milk or espresso machine or, you know, milking and calf and bull.

Speaker 1

我们发现这些早期词汇表征的一个特点是：方向具有意义。

And one of the things we found with those early word representations was that directions had meaning.

Speaker 1

比如你想想动词的现在时态，像'walk'，在这个100维空间里，从'walk'到'walked'的方向，与从'run'到'ran'的方向是一致的，就像从'reed'到'red'那样。

So if you think about present tenses of of verbs like walk, you would go in the same direction in this 100 dimensional space to get from walk to walked as you would get from run to ran, as you would go from, you know, reed to red.

Speaker 1

哇。

Wow.

Speaker 0

是啊。

Yeah.

Speaker 0

所以它其实是理解'理解'的。

So it actually understands understands.

Speaker 0

我总是不自觉地用这个词，但并非刻意为之。

I keep using that word, and I don't mean it.

Speaker 0

但这些'是啊'的结构里确实存在时态的某种表征。

But there is, like, some representation of tenses within the structure of these Yeah.

Speaker 1

这只是训练过程中自然涌现的特性。

And it's it's just emerged from the training process.

Speaker 1

这不是我们刻意设计的。

It didn't it's not something we told it to do.

Speaker 1

这只是我们使用的训练算法，以及语言中存在大量特定形式的使用方式，导致它们产生了融合。

It's just the training algorithm we used and the fact that language has lots of ways in which particular forms are used cause that to merge.

Speaker 1

你还可以实现词汇的性别转换，比如从阳性形式变成阴性形式，反之亦然。

And you could also, for example, change from male or female versions to words and vice versa.

Speaker 1

所以母牛到公牛的关系，就像皇后到国王、女人到男人等等，是同一个方向。

So cow to bull is the same direction as queen to king or woman to man and so on.

Speaker 0

太神奇了。

Amazing.

Speaker 0

但这只是我们讨论的语言层面。

But this is just with language that we're talking about here.

Speaker 0

好吧。

So okay.

Speaker 0

那么多模态方面是如何改变这一点的？

Tell me how does the multimodal aspect of this change?

Speaker 0

具体来说，它有什么不同之处？

What what what is it how does it make it different?

Speaker 1

是的。

Yeah.

Speaker 1

因为你仍然在高维空间中表示输入数据，关键是如何从图像的像素转换到某种表示——理想情况下，你希望多模态模型拥有与我们相似的机制。

Because you're still representing the input data in these high dimensional spaces, and it's really a matter of how do you get from the pixels of an image, say, into something where ideally you'd like the multimodal model to have the same kind of thing that we have.

Speaker 1

当我们看到一头牛时，这会在大脑中触发与读到'牛'这个词或听到牛走动时相似的神经活动。

When we see a cow, that triggers kind of similar activations in our brain to reading the word cow to hearing a cow move.

Speaker 1

你希望训练模型使其具有这种联合意义和表征，无论它们是如何获取这些输入数据的。

And you kinda wanna train models so that they have that joint meaning and representation regardless of the way they arrived at that input data.

Speaker 1

所以如果模型看到一个牛走过田野的视频，这应该会触发模型中与之相关的一系列反应，基于模型通过深度分层结构建立的激活模式。

So if they see a video of a cow walking through a field, that should trigger a whole bunch of things that are related to that in the model based on the activations that the model has built over typically, these are very deep layered models.

Speaker 1

通常最底层具有非常简单的表征，而模型更高层则在这些表征基础上构建更复杂有趣的特征组合，无论是文字、图像还是其他形式的表征。

And so the lowest layers typically have very simple representations, and then the higher layers in the model build on those representations and build more interesting and complex combinations of features and and representations of be it words or images or

Speaker 0

所以你说从底层开始就是多模态的，对吧，

So when you're saying multimodal from the ground up, right,

Speaker 1

也就是

which is

Speaker 0

关于Gemini你常听到的一种说法是，并不是说这边有个词汇区，那边有个像素区，然后你在两者之间进行翻译。

kind of a big phrase that you hear about Gemini, it's not that you've got, like, the word section over here, pixel section over here, and you're translating between one and the other.

Speaker 0

对。

Right.

Speaker 0

但在模型内部，这些表征是

But, like, in the model itself, those representations are

Speaker 1

是的。

Yeah.

Speaker 1

在模型非常早期的阶段。

Very early in the model.

Speaker 0

这是否会让初期搭建时变得更困难？

Does that make it harder at the beginning when you're setting it up?

Speaker 0

会让操作变得更复杂吗？

Does it make it more difficult to do?

Speaker 1

没错。

Yeah.

Speaker 1

我认为，弄清楚如何将不同模态整合到模型中，以及如何训练多模态模型，确实比单纯的纯语言或纯字符模型要复杂得多。

I mean, I think figuring out how to integrate different modalities into the model and, you know, how should you train a multimodal model is more complex than a simpler pure language or pure character based model.

Speaker 1

但你会从中获得很多好处，比如有时会出现跨模态迁移——看到关于牛的视觉内容实际上能辅助语言理解。

But you get a lot of benefits from it in that you get sometimes cross modal transfer where now seeing visual stuff about cows actually helps inform the language.

Speaker 1

比如你可能读过很多关于草地上奶牛的描述，但现在它突然看到了相关图片和视频。

You know, maybe you'd seen a bunch of descriptions of of cows in meadows or in something, but now it some suddenly has seen images of that and videos of that.

Speaker 1

它实际上能以某种方式将这些表征结合起来，使得模型内部对'牛'这个概念的触发机制变得统一——无论你看到的是'牛'这个词还是牛的图像。

And it's actually able to bring those representations together in a way that makes similar things trigger inside the the model regardless of whether you saw the word cow or the image of cow.

Speaker 0

能举个未来应用场景的例子吗？

Give me an example of the type of situation you you see this being useful in in the future.

Speaker 1

其实现在已经很有用了，这很好。

Well, I think it's already useful, which is good.

Speaker 1

举个例子，你希望能够输入一张手写白板上的数学解题过程图片，然后判断学生这道题做对了吗？

I mean, as one example, you wanna be able to take in an image of kind of a handwritten whiteboard worked out math problem and say, the student get this problem right?

Speaker 1

所以现在你需要在一个例子中真正展现多模态能力。

And so now you need to really bring in the multimodal capabilities in one example.

Speaker 1

实际上你需要进行手写识别，并从中理解内容，明白吗？

You need to actually, you know, do handwriting recognition, understand from that, okay.

Speaker 1

这是某人写在白板上的物理题，在早期的Gemini技术报告中可能还配有一张滑雪者下坡的图片。

It's a physics problem that someone's written on the board, and it's got maybe a picture of a skier going down a slope in one of the early Gemini tech reports.

Speaker 1

我们有个很好的例子：一个学生在白板上解题后，你可以直接问Gemini'这个学生做对了吗？'

We have this good example of a student who'd worked out a problem on on a whiteboard, and you 'd actually ask Gemini, you know, did the student get this problem right?

Speaker 1

如果错了，他们错在哪里？能否解释正确解法？

If not, where did they go wrong, and can you explain how to solve the problem correctly?

Speaker 1

它确实能判断出学生错误地应用了无摩擦斜坡滑雪者的公式，他们用了斜边而非高度。

And it was actually able to tell that, you know, the student had incorrectly applied, the formula for a skier going down a frictionless slope, and instead they used the hypotenuse instead of the height.

Speaker 1

然后它会说'哦不对'。

And it said, oh, no.

Speaker 1

实际上你应该用这个公式，这是完整的解题过程。

Actually, you should have used this, and here's the problem worked out.

Speaker 1

它完成了所有这些，识别了所有手写内容，并判断出这是道物理题。

And it did all that and recognized all the handwriting and the fact that this was a physics problem.

Speaker 1

模型已有的这类物理知识正好适用这种情况。

This kind of physics knowledge that the model already had sort of was the right thing to apply.

Speaker 0

我觉得这是将现有Gemini模型与现有教育模式结合的绝妙方式。

I mean, that's a really neat way that you could use the existing model of Gemini in the existing model of education, I guess.

Speaker 1

完全同意。

Totally.

Speaker 1

没错。

Yeah.

Speaker 0

但实际上，我认为这些系统彼此之间并非完全孤立。

But but I suppose, actually, these are not kind of isolated systems from one another.

Speaker 0

那么从某些方面来说，你是否认为这些多模态模型会彻底改变我们的教育方式？

So in some ways, do you think that these multimodal models will change the way that we have to do education, full stop?

Speaker 1

我认为AI工具助力教育的潜力非常惊人，作为社会整体，我们才刚刚踏上这段旅程的起点。

I mean, I think the potential for using AI tools to help education is really amazing, and we're sort of just at the beginning of this journey as a society, I think.

Speaker 1

比如我们知道，接受一对一辅导的学生，其教育成果比传统30人课堂的学生高出两个标准差。

You know, we know, for example, that educational outcomes of students who get one on one tutoring from another person are two standard deviations better than students who are have a traditional classroom setting of a teacher and, you know, 30 or so students.

Speaker 1

因此如何让每个人都能享受到个性化教育辅导的益处——了解学生已知和未知的领域，用最适合的方式辅助学习——这正是AI在教育领域的潜力所在。

So how could we get everyone to the point where they feel like they have the benefit of an educational tutor that's one on one, understands what they know, understands what they don't know, can help them learn in the way that they learn best, that is the potential of AI in education.

Speaker 1

实际上我们距离这样的场景并不遥远：你可以指着某份材料对Gemini模型说'能帮我学习这个吗？'

And I think, really, we're not that far away from something where you could point a Gemini model or a future Gemini model at something, some piece of material, and say, can you help me learn this?

Speaker 1

比如打开生物课本第六章，里面有许多图片——

Take the chapter six in your biology textbook or something, and it's got a bunch of images.

Speaker 1

还有大量文字内容——

It's got a bunch of text.

Speaker 1

可能还包含你观看过的讲座视频——

Maybe it's got a lecture video that you watched as well.

Speaker 1

这时你完全可以说'这部分我真的不明白'

And then you can actually say, I really don't understand this thing.

Speaker 1

'能帮我理解吗？'

Can you help me understand it?

Speaker 1

它可以向你提问，

It can ask you questions.

Speaker 1

你也可以向它提问，

It can you can ask it questions.

Speaker 1

然后你来回答问题。

You can answer the questions.

Speaker 1

它能评估你的对错，并在学习旅程中真正引导你，因为这是个性化的。我们应该能让全球许多人接触到它，不仅限于英语，还包括世界各地数百种语言。

It can assess are you right or wrong and really guide you in your learning journey because it's individualized, and we should be able to get that to many, many people around the world in not just English, in, you know, hundreds and hundreds of languages all around the world.

Speaker 0

我是说，我...我理解你提到的多种语言问题，以及努力让这些工具尽可能广泛普及。

I mean, I so I I take what you said about the the lots of different languages and and trying to make these as broadly available as possible.

Speaker 0

但这是否存在制造双重体系的风险？一方面，如你所描述的，能接触这些工具的人会获得更好的结果，加速他们的学习和生产力？

But is there a danger of creating a bit of a two tier system here where where on the one hand, people who have access to these tools, as you described, get far better outcomes, you know, accelerate their own learning and and and productivity?

Speaker 0

而任何不够幸运、无法接触这些工具的人将真正陷入困境。

And then anybody who is not fortunate enough to have access to the tools really struggles.

Speaker 0

我是说，这是否是你担忧的问题？

I mean, is that is that something that concerns you?

Speaker 1

绝对是。

Absolutely.

Speaker 1

是的。

Yeah.

Speaker 1

我认为确实存在制造双重体系的风险。

I mean, I think there is definitely a risk of creating two tier systems.

Speaker 1

我认为我们应该努力使这些技术尽可能广泛普及，如果可能的话让所有人都能使用，并真正发挥其对社会的作用，使其在教育、医疗等领域变得可负担甚至免费——我认为医疗是AI能真正改变可及性的另一个重要领域。

I I think what we should strive to do is make these technologies as broadly accessible, universally accessible if we can for everyone and really try to lean into the strengths of what that will do for society and to make it affordable or free for people to take advantage of the capabilities for education, for health care, I think, is another area that has huge potential for AI to really make a big difference in health care accessibility.

Speaker 0

那么我想回到Gemini这个话题，如果可以的话。

So I wanna go back to Gemini, if we can.

Speaker 0

当然。

Sure.

Speaker 0

好的。

Okay.

Speaker 0

我想如果你从谷歌搜索起步，准确性必定是你最关心的核心要素。

So I guess if you started off with Google search, factuality must have been absolutely at the cornerstone of everything that you cared about.

Speaker 0

但Gemini...我是说，我一直在使用它。

But Gemini, mean, I work with it all the time.

Speaker 0

我想你应该见过它说一些相当离谱的话。

I imagine you've seen it say some quite outlandish things.

Speaker 0

是的。

Yeah.

Speaker 0

你是如何在脑海中调和这种矛盾的？比如在某些时候放弃对绝对事实性的执着需求？

How are you sort of squaring that circle in your head of, like, releasing perhaps some of the need for absolute factuality at all times?

Speaker 1

是的。

Yeah.

Speaker 1

作为一家公司，这实际上是一个微妙的平衡，因为我们从根源上是一家以搜索为基础的公司。

It's actually a tricky balance as a company because we are sort of from our origins a search based company.

Speaker 1

正如你所说，提供准确的事实信息是搜索引擎体验的巅峰。

And as you say, providing accurate factual information is kind of the pinnacle of a search engine experience.

Speaker 1

但我认为我们其实已经构建了有趣的大型语言模型，人们很享受与之对话。

But I think we actually had built interesting large language models internally that people enjoyed conversing with.

Speaker 1

事实上，其中一些模型在疫情期间就已在内部开放使用，当时大家都在家办公。

Actually, some of them were available internally during the pandemic, so people were all at home.

Speaker 1

你甚至能看到午餐时间内部使用量激增，因为人们都在和虚拟聊天机器人对话——当你独自在家时还能和谁聊天呢？

And you could actually see internal usage spike during lunchtime because people were having conversations with their their virtual chatbot because who else are you gonna talk to when you're home alone or whatever?

Speaker 1

但这些模型本质上是被训练来预测合理的下一个标记。

But these models are trained to predict plausible NEXT tokens, essentially.

Speaker 1

所谓标记，你可以把它理解为一个单词或单词的一部分。

So a token, you can think of it as a word or a piece of a word.

Speaker 1

所以当你预测合理的下一个标记时，这与绝对真理是两回事。

And so when you predict plausible NEXT tokens, that's a different thing than that is absolute truth.

Speaker 1

对吧？

Right?

Speaker 1

这是概率上合理的句子，与事实是不同的概念。

It's a probabilistically plausible sentence, and that's different than a fact.

Speaker 1

我认为随着时间推移我们意识到的一点是，这些模型即使不是百分百准确，实际上也相当有用。

And I think one of the things we we realized over time was these models can actually be quite useful even if they're not a 100% factual.

Speaker 1

所以我意识到存在所有这些其他用例，比如能否用五个要点总结这份幻灯片？

And so I think realizing there's all these other use cases or can you summarize this slide deck in five bullets?

Speaker 1

当然你可以争论第五个要点是否完全正确，但能获得4.5个关于幻灯片的准确要点仍然非常实用。

And, yes, you could argue about is that fifth bullet exactly right, but still pretty useful to get 4.5 bullets that are factually accurate about the slide deck.

Speaker 1

而且你知道，我们正在努力做到五个要点都完全准确。

And, you know, we'd we're striving to make it five factually accurate bullets.

Speaker 1

但即便如此，我认为这些模型的实际效用已经相当高了。

But even without that, I think the utility of these models is actually quite high.

Speaker 0

这个认知过程是否令人感到不安？

Was it an uncomfortable realization?

Speaker 0

因为当然，其他实验室确实更早推出了他们的模型。

Because, of course, other labs did push out their models earlier.

Speaker 0

是的。

Yeah.

Speaker 0

你们是否因为这种准确性问题而表现得过于谨慎？

Do you think that that that you guys had an abundance of caution because of this factual issue?

Speaker 1

我认为我们有多方面的顾虑。

I mean, I think we had a number of different concerns.

Speaker 1

准确性只是其中之一，模型训练方式和输出结果中的毒性和偏见问题也是我们重点改进的领域。

Factuality being one of them, toxicity and bias in the way the model is trained and the outputs that it can produce is an area where we wanna make the model less biased in a lot of ways.

Speaker 1

因此在向公众发布前，我们在许多方面都希望保持相对谨慎的态度。

And so there were a whole number of areas where we wanted to sort of be relatively cautious before releasing things to the general public.

Speaker 1

我认为我们已经解决了大部分问题，虽然像准确性和偏见等领域仍有改进空间，但当前发布的产品已经具有实用价值。

And I think we've we've gotten a lot of those issues kind of sorted out enough that we think the products we put out in this space are useful, even though there are obviously room for improvement in things like factuality and in bias and other areas.

Speaker 1

所以这对人们来说需要些适应——既要追求尽善尽美，也要明白不发布产品反而可能让很多人错失虽有瑕疵但仍有价值的东西。

So I think that's taken a little bit of an adjustment for people is you know, strive for the best you can be, but also realize that by not releasing something, you're sort of holding back something that could be useful for a lot of people even with its sort of foibles.

Speaker 0

但既然有这些缺点，那么我们该何去何从？

But then with those foibles then, so in which direction do we go from here?

Speaker 0

在我看来，计算方式确实发生了实质性的转变。

I mean, it sort of seems to me that there's been this this real shift in the way that computing happens as it were.

Speaker 0

你知道的，用计算器算同样的算式两次，会得到同样的答案两次。

You know, you get a calculator, you put in the same sum twice, you get the same answer twice.

Speaker 1

是的。

Yeah.

Speaker 0

而我们现在正处于概率计算的时代。

Whereas we're now in an era of probabilistic computing.

Speaker 0

所以我在想，公众是否必须接受这个现实，接受我们处在一个事物更像人类、会犯错的时代？还是你认为这是可以修正的？

And so I wonder whether the the is it that the public has to come to terms with that and sort of accept that we're in an era where things are much more human like and that they can make mistakes, or is it something that you think is fixable?

Speaker 1

我认为两者兼而有之。

I think it's some of both.

Speaker 1

对吧？

Right?

Speaker 1

我觉得有很多技术手段可以改善事实性领域的问题。

I mean, I think there's a bunch of technical approaches to some of these problems that will make the factuality area issues better.

Speaker 1

举个例子，模型训练所用的数据——数万亿的文本标记和其他数据混合在这个由数百亿参数组成的巨大‘汤’里。我喜欢把这理解为：你见过很多东西，但记得不太清楚。

One instance is if you think about the data the model is trained on, like trillions of tokens of text and other data that are then mixed together in this giant soup of, you know, billions and billions of parameters, I I like to think of that as, like, you've seen a lot of stuff, but you don't recall it very well.

Speaker 1

而我们Gemini一直在推进的一个方向就是扩展上下文窗口。

Whereas one of the things we've been pushing on in Gemini is having a long context window.

Speaker 1

当你拥有一个能容纳大量直接信息的长空间时，你可以用各种方式对这些信息进行总结、操作、比较或提取。

So when you have a long bit of space where you can put a lot of direct information that you're trying to summarize or manipulate or compare in various ways extract information from.

Speaker 1

模型对上下文窗口中的信息实际上有更清晰的把握。

That information in the context window, the model actually has a much clearer view of.

Speaker 1

对吧？

Right?

Speaker 1

它包含了实际文本及其表征，这些内容与模型见过的其他信息是分离的。

It's got, like, the actual text and the representations of that text not tangled together with everything else it's seen.

Speaker 0

所以这个上下文窗口某种程度上是模型在那一刻认为重要的部分。

So this context window is sort of the bit that the the the model can see as important at that moment.

Speaker 1

是的。

Yeah.

Speaker 1

它能比其他训练过程中见过的内容更精确地对此进行推理。

It can sort of reason about that in more fidelity than other things that it's seen in its training process.

Speaker 1

比如它可以处理五篇科学文章的PDF，然后你可以提问：'能告诉我这些文章的共同主题吗？'

So it can take, like, five PDFs of scientific articles, and then you can ask questions about it, like, can you please tell me common themes across these articles?

Speaker 1

由于它已构建了这些文章内容的内部表征，因此确实能完成这个任务。

And then it's actually able to do that because it has its own representation of all the contents of those.

Speaker 1

这就是我们大力推动Gemini模型超长上下文窗口的原因之一——这种能力对事实核查、视频摘要等各类任务都非常实用。

That's one of the reasons we've been pushing a lot on very long context windows for Gemini models is because I think that's a really useful capability for factuality, for, you know, video summarization, for all kinds of things.

Speaker 0

但上下文窗口是否存在上限呢？

Is there a limit, though, to the context window?

Speaker 0

难道可以不断扩展直到引发某种无限冲突？

You just push and push and push and push until it's sort of an infinite conflict?

Speaker 1

这是个绝妙的问题。

That is an excellent question.

Speaker 1

目前注意力机制的计算成本相当高昂。

Currently, the computational aspects of the attention process are quite expensive.

Speaker 1

尝试扩展得越长，所需代价就越大。

So the longer you try to make it, the more expensive it gets.

Speaker 0

耗时方面的代价，但

Expensive in terms of time, but

Speaker 1

还包括计算时间，最终会涉及时间、资金、算力等各类成本。

also Compute time and Eventually, time and money and compute and and all kinds of things.

Speaker 1

但我们认为通过算法改进，有可能突破目前200万token的上下文窗口限制。

But we think it may be possible to come up with algorithmic improvements that enable you to go beyond that 2,000,000 token context window, which is what we have now.

Speaker 1

100万token已经相当多了。

A million tokens is quite a lot.

Speaker 1

100万token大约相当于600页文本。

A million tokens is about 600 pages of text.

Speaker 1

所以基本上能涵盖大多数书籍，或者20篇文章。

So that's most books, you know, 20 articles.

Speaker 1

相当于1小时的视频内容。

It's an hour of video.

Speaker 0

另一方面的看法呢？

About on the other side?

Speaker 0

因为你刚才提到需要两方面考虑，可能人们需要调整预期。

Because you said it was a little bit of both, that some that perhaps people have to adjust their expectations.

Speaker 1

我认为这些模型是工具，人们需要了解工具的能力边界，也要明白哪些场景不适合使用。

So I think these models are tools, and people need to understand the capabilities of their tools, but also some of the ways in which you probably don't wanna use the tool.

Speaker 1

所以这某种程度上是个用户教育的过程。

So think it's a bit of an educational process for people.

Speaker 1

不要盲目相信语言模型输出的所有内容。

Don't just trust every fact that comes out of a language model right off the bat.

Speaker 1

需要保持一定的审慎态度。

You need to apply a bit of scrutiny to that.

Speaker 1

就像我们现在教育人们：网上看到的内容未必真实。

Sort of like I think we've taught people these days that if you see something online, that doesn't necessarily make it true.

Speaker 1

我认为对语言模型的某些输出保持同等程度的怀疑是必要的。

I think a similar degree of skepticism for some kinds of things from language models is probably also appropriate.

Speaker 1

随着模型进步这种怀疑可能会减少，但对某些内容保持'这可能不准确'的谨慎态度仍是明智的。

That skepticism may decrease over time as the models improve, but it's good to take it with a healthy dose of, you know, oh, that might not actually be true for some of these things.

Speaker 0

除了上下文窗口，是的。

Aside from the context windows Yeah.

Speaker 0

有没有什么方法，在你编写提示词时，嗯。

Are there ways that you can, yourself, when you're sort of writing in prompts Mhmm.

Speaker 0

能尽量降低最终得到完全胡编乱造内容的风险？

Sort of minimize the risk of ending up with something that's that's a complete hallucination?

Speaker 1

谷歌研究人员提出的一种技术叫做'思维链提示'。

So one technique that Google researchers kind of came up with is what's called chain of thought prompting.

Speaker 1

同理，如果你给模型一个有趣的数学题并直接问：'好，答案是什么？'

So in the same way, if you just give the model a sort of interesting math problem and you say, okay.

Speaker 1

答案是什么？

What's the answer?

Speaker 1

它可能答对，也可能出错。

You know, I may get it correct, but it may not.

Speaker 1

但如果你说：'这里有个有趣的数学题，

And if instead you say, here's an interesting math problem.

Speaker 1

能否逐步展示你的解题步骤？'

Can you please show your work step by step?

Speaker 1

就像回忆你四年级的数学老师

So if you remember back to your fourth grade math teacher

Speaker 0

噢，确实。

Oh, I did.

Speaker 1

他/她总会说：'你应该逐步展示解题过程，最后再写下答案。'

He or she was probably saying, you should really show your work step by step and then get to the final answer and then write the final answer down.

Speaker 1

这部分是因为这能帮助你完成多步骤思考——从理解题意，到计算这个基于那个，最终得出答案。

And that's partly because that helps you get through that, you know, multistep thinking process of how do you actually go from, you know, what's being asked to, okay, I need to calculate this, calculate this based on that, and so on, and finally get to the answer.

Speaker 1

事实证明，这不仅让模型的输出更可解释（因为它展示了思考步骤），还能提高答案的正确率。

And it turns out that not only makes the model's output more interpretable because it kind of tells you what steps it's going through, but it also makes it more likely to get the correct answer.

Speaker 0

但如果这不是数学问题呢？

What if it's not a math problem, though?

Speaker 1

是啊。

Yeah.

Speaker 1

我是说，即使在答案范围不明确的情况下，这种方法也勉强可行。

I mean, even in noncrisply defined right answer domains, this approach kind of works.

Speaker 1

这其中有些微妙之处，我认为人们需要真正学习如何使用这些模型，而提示的方式实际上会极大影响输出质量的高低。

And there's a bit of subtlety, and and I think people need to actually learn how to use these models, and the way in which you prompt them is actually a big differentiator in how high quality the output is.

Speaker 1

比如，如果你说'总结这个'，可能会得到一个结果。

Like, if you say, Summarize this, that might lead to one outcome.

Speaker 1

但如果你说'请总结这篇文章，给出五个要点突出文章重要内容，并指出作者提到的两个缺点'。

If you say, please summarize this and give me five bullet points that highlight the major important pieces of the article and identify two cons that the author wrote down.

Speaker 1

这样说就比单纯说'总结这个'给模型提供了更清晰的操作指令。

If you say that, that's a much clearer set of instructions to what the model should do than just summarize this.

Speaker 0

所以当我们把这些因素结合起来时，既需要分步拆解流程，也要理解更多上下文和多模态内容。

So when we put these things together then, so sort of breaking down step by step processes, but also understanding more context and the multimodal stuff too.

Speaker 0

嗯。

Mhmm.

Speaker 0

我们是否正在走向这样一个未来：这类多模态模型能理解我们作为个体的个人偏好？

Are we moving towards a situation where these kind of multimodal models will understand us as individuals and our preferences?

Speaker 1

是的。

Yeah.

Speaker 1

我认为你真正想要的，是一个高度个性化的Gemini版本，它既要理解你当下想做的事，也要明白你做这件事的具体情境。

I mean, I think what you really want, I think, is a sort of very personal version of Gemini for you that understands, you know, what it is you're trying to do right now, but also understands the context in which you're trying to do that.

Speaker 1

我是素食主义者。

I'm vegetarian.

Speaker 1

所以如果我问Gemini伦敦餐厅推荐，而它知道我是素食者的话...

So if I'm asking Gemini about restaurant recommendations in London and it knew that I was a vegetarian.

Speaker 1

它的推荐会与我不在时不同。

It would recommend different things than if I was not.

Speaker 1

我认为一个对所有用户需求一视同仁的通用模型，远不如真正了解你和你的背景的模型来得好。

And I think a general model that is serving the needs of every person the same is not gonna be as good as one that actually understands a lot about you and your context.

Speaker 1

要知道，有些查询你可能想对模型提出，但目前用Gemini还无法完全实现，但你能想象未来会需要这些功能。

You know, there are some kinds of queries you might like to ask a model that you can't quite do today with Gemini, but you could imagine wanting to do.

Speaker 1

明白吗？

You know?

Speaker 1

你能把我上周徒步时拍的照片做成插画故事书，今晚给孩子当睡前读物吗？

Can you take the the pictures I took on the hike last week and make an illustrated storybook for my kid's bedtime tonight?

Speaker 1

它会知道这些徒步照片的拍摄背景，并懂得如何制作吸引孩子的插画故事书。

And it would know where those pictures from on your hike were and how to make an illustrated storybook that would appeal to child.

Speaker 1

它或许还能根据你孩子的年龄调整内容使其适龄。

They would maybe know how old your child was to make it age appropriate.

Speaker 1

所以虽然现在做不到，但这会是很有用的功能。

So I think you can't do that now, but that could be something useful.

Speaker 1

你需要让人们主动选择使用这个功能。

You want people to opt in to that.

Speaker 1

我认为，你希望模型掌握并用于背景分析的资料越多，就越需要让人们理解正在发生什么。

I think the more information you want the model to know and have in context, I think the more you wanna sort of have people understand what what is happening.

Speaker 1

我认为我们将能做到的不是用这些数据训练模型版本，而是让正确信息在生成回答时可被随时调用。

One of the things I think we'll be able to do is not train a version of the model on that data, but just have the right information available in context in order to sort of call upon it when generating responses.

Speaker 1

我觉得这会很棒。

And I think that would be pretty nice.

Speaker 0

所以就像你们已经建立了某种通用框架对

So as in you've got, like, this sort of this general structure Yep.

Speaker 0

可以把自己的背景信息加载上去没错

That you can almost imprint your own context onto Yeah.

Speaker 0

那对你来说有点私人了。

Then that's kinda private for you.

Speaker 1

那是

That's

Speaker 0

没错。

right.

Speaker 0

不错。

Nice.

Speaker 0

是的。

Yeah.

Speaker 1

听起来似乎相当不错。

That seems like it'd be pretty good.

Speaker 0

是的。

Yeah.

Speaker 0

那确实会相当不错。

That would be pretty good.

Speaker 0

我们是否仅限于音频、视觉以及屏幕上可见的内容、语言这类东西？

Are we limited here to just audio and visual and, you know, things that you can see on a screen, language, whatever?

Speaker 0

还是说我们预期这类助手最终会从电脑里走出来？

Or or do we ever expect that these kind of assistants will come out of our computers as well?

Speaker 1

对。

Yeah.

Speaker 1

我认为实际上有很多新型数据模式并非严格意义上的人类模式，而我们希望这些模型能够理解。

I mean, I think there's actually a lot of different kinds of new modalities of data that aren't sort of strictly human modalities that we want these models to understand.

Speaker 1

比如全球各地的温度读数有助于天气预报，或者基因序列，自动驾驶汽车的激光雷达数据，以及机器人应用。

So lots of temperature readings across the earth to help with weather prediction or genetic sequences or LiDAR data for autonomous vehicles or robotics applications.

Speaker 1

在某些场景下，你希望这些模型或许能协助现实世界的机器人应用，能够与机器人设备对话，用简单语言给它下达指令。

And then in one setting, you want these models to perhaps be able to help with real world robotics applications, be able to talk to a robotic device, give it sort of instructions in plain language.

Speaker 1

你能不能去厨房擦一下台面，把我留在台上的汽水罐回收了，然后给我带一袋开心果什么的？

You know, can you please go into the kitchen and wipe the counter down and recycle the soda can I left on the counter and then bring me a bag of pistachios or something?

Speaker 1

对吧？

Right?

Speaker 1

传统上机器人一直无法理解这种语言，但我认为我们正处在实现这种能力的临界点，然后能让机器人在像这个房间这样杂乱的环境中完成50或100项实用任务，而不是传统上机器人已被部署的那些非常受控的环境，比如工厂流水线那种它们从这里到那里的非常可预测的场景。

Like, robots have traditionally not been able to understand language like that, but I think we're on the cusp of enabling that kind of capability and then being able to have robots do 50 or a 100 useful tasks even in messy environments like this room rather than kind of the traditional setting in which robots have already been deployed in the world, which is sort of very controlled environments like, you know, factory assembly line kind of things where they go from there to there, and it's a very predictable thing.

Speaker 0

我们一直在讨论这些作为助手的东西，某种程度上是以这种方式增强人类能力。

We've been talking here as as assistants, you know, these things as being sort of augmenting human abilities in that way.

Speaker 0

我能在医疗环境和教育环境中看到它的应用。

And I can see it in in medical settings and education settings.

Speaker 0

但这种多模态特性是否还能为我们提供更多，比如在理解世界方面？

But is there is there more that the multimodal aspect of this offers us in terms of, I don't know, like, how we understand the world.

Speaker 1

是的。

Yeah.

Speaker 1

我认为这些模型现在能做的是通过几步推理，从你的要求出发去完成某些事情。

I mean, I think what these models can do now is often do a few steps of reasoning to get from what you asked it to do in order to sort of accomplish something.

Speaker 1

随着这些模型能力的提升，你将能让模型与你协作完成更复杂的任务。

And I think as these models improve in capability, you'll be able to sort of get models to work with you to do much more complex tasks.

Speaker 1

这就像是区别'你能从椅子租赁店订一批椅子'和'帮我策划一个会议'。

And it's sort of a difference between can you order a bunch of chairs at the chair rental place versus plan me a conference.

Speaker 1

对吧？

Right?

Speaker 1

后者层级更高、复杂得多。

The latter is much higher level, much more complex.

Speaker 1

合适的模型会问你一堆后续问题，因为这里面存在模糊性。

You know, the right model would kind of ask you a bunch of follow-up questions because there's ambiguity in there.

Speaker 1

比如，有多少人要来？

You know, how many people are coming?

Speaker 1

你知道这是关于什么的吗？

You know, what's it about?

Speaker 1

就像人类一样

Just like a human

Speaker 0

你加入吗？

are you in?

Speaker 1

是的。

Yeah.

Speaker 1

你在哪个国家？

What country are you in?

Speaker 1

你想在哪里进行？

Where do you wanna have it?

Speaker 1

什么时候？

When?

Speaker 1

然后我们就可以出发，真正完成许多为了实现那个高层次目标可能需要做的事情。

And then we kind of set off and actually be able to accomplish a lot of the kind of things that you might want done in order to do that high level goal.

Speaker 0

但如果你有这种概念链接或这些概念链接，我现在要回到牛这个话题上。

But then if you have this this sort of conceptual links or these conceptual links I'm I'm going back to cow here.

Speaker 0

对吧？

Right?

Speaker 0

它能理解图片，也能理解，我不知道，我猜是重力，通过看网上的视频学会的。

And it understands pictures, and it understands, I don't know, I guess, gravity, having seen videos of on the Internet.

Speaker 1

它可能看过物理入门讲座之类的。

It probably watched, like, introductory lectures on physics.

Speaker 0

对吧？

Right?

Speaker 0

哦，哇。

Oh, wow.

Speaker 0

好的。

Okay.

Speaker 0

所以它从这个角度理解。

So it understands it from that perspective.

Speaker 1

对。

Yeah.

Speaker 1

对。

Yeah.

Speaker 1

同时还看到一堆东西在坠落。

And while also seeing a bunch of things falling.

Speaker 0

好的。

Okay.

Speaker 0

那么你能在某天进去说，给我画一张非常高效飞机的蓝图吗？

So then could you go in one day and say, draw me the blueprint for a really efficient airplane?

Speaker 1

可以。

Yeah.

Speaker 1

我是说，我认为这些模型需要配合的是一个探索过程，这个探索过程可以表现为，比如它不需要在两百毫秒内给你答案。

I mean, I think one of the things these models need to be partnered with is some exploratory process, and that exploratory process can come in the form of, know, maybe it doesn't need to give you an answer in two hundred milliseconds.

Speaker 1

也许你明天拿到飞机设计图就会满意。

Maybe you'd be happy with your airplane tomorrow.

Speaker 1

对吧？

Right?

Speaker 1

所以我认为到那时，你在设计系统时会有更多自由，让它们能高效完成这类任务，比如可以离开去尝试一些实验，可能在它们能访问的模拟器中，或者它们可能为基本流体动力学之类创建一个模拟器，然后尝试多种设计。

And so I think at that point, then you have a lot more freedom in how would you design systems to be able to efficiently do things like that where they can go off and try a few experiments maybe in a simulator that they have access to or maybe they create a simulator for basic fluid dynamics or something, and they try a bunch of designs.

Speaker 1

也许它们对飞机形状有些想法，因为见过许多现有飞机。

Maybe they have some ideas about what airplane shapes make sense, having seen a bunch of existing airplanes.

Speaker 1

这样它们就能尝试完成你要求的任务。

And so then they can kind of try to accomplish what it is you asked.

Speaker 1

希望他们首先会问你，你希望你的飞机具备哪些特性？

Hopefully, they first ask you, well, what characteristics do you want your airplane to have?

Speaker 0

原来一直都是纸飞机。

It was a paper airplane all along.

Speaker 1

是啊。

Yeah.

Speaker 1

纸飞机。

Paper airplane.

Speaker 1

没错。

Yeah.

Speaker 1

重要的是知道它是不是纸做的。

It's important to know if it's paper.

Speaker 1

比如，那样能大幅降低成本。

Like, that that reduces the cost a lot.

Speaker 0

所以

Speaker 1

我认为这类功能最终都会实现。

I think those kinds of things will come eventually.

Speaker 1

很难准确预测这些能力何时会出现，因为这涉及到模型推理能力、所需知识、任务要求及交互方式的复杂整合。

It's a little hard to tell exactly when those capabilities you know, that's a pretty complicated sort of integration of what you want the reasoning in the model to do, the knowledge it needs, what you're asking it to do, and how you're asking it.

Speaker 1

但我们已经看到这些模型在五年、十年间取得了相当大的能力进步。

But we're already seeing pretty big advances in capabilities of these models over five year, ten year periods.

Speaker 1

因此在五年或十年内，这可能会成为现实。

And so over a five year or ten year period, that might be possible.

Speaker 1

甚至可能更快实现——比如‘你能帮我设计一架具备这些特性的飞机吗？’这样的需求。

It might even be sooner than that for, you know, can you help me design an airplane with these characteristics?

Speaker 0

但我想这些大概算是我们期待中阿波罗系统的早期雏形吧。

But I guess these are, like, the early, early precursors to what we might hope Apollo would be.

Speaker 1

是的。

Yeah.

Speaker 1

没错。

Exactly.

Speaker 1

所以它才叫双子座。

That's why it's Gemini.

Speaker 0

所以它才叫双子座。

That's why it's Gemini.

Speaker 0

太棒了。

Amazing.

Speaker 0

杰夫，非常感谢你参加我的节目。

Jeff, thank you so much for joining me.

Speaker 1

很荣幸来到这里。

It's a pleasure to be here.

Speaker 1

谢谢邀请。

Thank you for having me.

Speaker 0

从很多方面来说，我认为杰夫的整个故事就是关于规模的故事。

In a lot of ways, I think Jeff's whole story is is what about scale.

Speaker 0

对于谷歌搜索来说，关键在于如何获取更多网页、更多用户、更快的查询。

For Google Search, it was about how do you get more of the web, more users, faster queries.

Speaker 0

对于神经网络来说，关键在于更强的计算能力、更多的机器。

For neural networks, it was about more computing power, more machines.

Speaker 0

而在最近的机器学习时代，关键在于越来越多的数据。

And in the recent era of machine learning, it's been about more and more and more data.

Speaker 0

但从这一切中诞生了某种东西——一个真正意义上的世界概念模型，它能够进行抽象思维，并且已经展现出提升人类生产力的能力。

But something emerges from all of that, a genuine conceptual model of the world, one that is capable of abstraction and has already a proven ability to enhance human productivity.

Speaker 0

值得注意的是杰夫并没有就此止步。

And it's telling that Geoff isn't finished there.

Speaker 0

未来还会有更多进展，更多传感器，更多模式。

There is more to come, more sensors, more modes.

Speaker 0

当与这栋大楼诞生的强化学习工具相结合时，或许还能在通往AGI的道路上取得更多进展。

And when combined with the reinforcement learning tools that were born in this building, maybe also more progress on the path to AGI.

Speaker 0

接下来，在本系列节目中我们还将带来更多精彩对话，话题涵盖人工智能如何加速科学发现进程，到探索智能体将如何推动人工智能领域发展。

Now, we have got plenty more amazing conversations coming up later in this series on topics ranging from how AI is accelerating the pace of scientific discoveries to exploring how agents will advance the field of artificial intelligence.

Speaker 0

如果您喜欢本期节目，请务必订阅我们的播客。

Now if you've enjoyed this episode, please make sure that you subscribe to our podcast.

Speaker 0

如果您有任何反馈或想推荐希望听到的嘉宾，何不在YouTube上给我们留言呢？

And if you have any feedback or you want to suggest a guest that you'd like to hear from, then why not leave us a comment on YouTube?

Speaker 0

下次再见。

Until next time.