重新定义机器人技术的卡罗琳娜·帕拉达

本集简介

在本期节目中，汉娜与谷歌DeepMind机器人部门高级总监卡罗莱娜·帕拉达展开对话。她们探讨了机器人能力的重大突破，重点分析了多模态理解和具身推理技术的进步如何使机器人以前所未有的通用性与物理世界互动。她们深入剖析了"双系统思维模式"——即"慢思考与快思考"——如何同时实现复杂推理和快速反应动作。通过机器人学习系鞋带等精细任务，以及实时适应全新场景的案例，帕拉达阐释了理解力、灵巧性和控制技术的关键突破正以前所未有的速度推动机器人学发展。延伸阅读/观看： Gemini机器人项目机器人灵巧性进展 Gemini机器人x YouTube 特别鸣谢以下制作人员（包括但不限于）：主持人：汉娜·弗莱教授系列制片：丹·哈顿剪辑：拉米·扎巴尔监制&制片：艾玛·尤瑟夫音乐：埃莱妮·肖音频工程师：理查德·考蒂斯制作经理：丹·拉扎德视频制作：尼古拉斯·杜克视频导演：贝尔纳多·雷森德视频剪辑：比拉尔·梅尔希音频工程师：佩里·罗甘廷摄影灯光：罗伯特·梅塞尔制作协调：佐伊·罗伯茨，莎拉·艾伦·莫顿视觉设计：罗伯·阿什利谷歌DeepMind出品若喜欢本期节目，请在Spotify或苹果播客留下评论。我们始终期待听众的反馈、新想法或嘉宾推荐！若喜欢本期节目，请在Spotify或苹果播客留下评论。我们始终期待听众的反馈、新想法或嘉宾推荐！由Simplecast托管，AdsWizz旗下公司。个人信息收集及广告用途详见pcm.adswizz.com

双语字幕

仅展示文本字幕，不包含中文音频；想边听边看，请使用 Bayt 播客 App。

Speaker 0

未来两年对机器人领域来说将是相当关键的。有很多方面正在汇聚，比如对手部灵巧性的理解、全身控制，你可以看到这些如何融合成一个非常强大的解决方案。

The next two years are going to be pretty defining for the field of robotics. There's just a lot of things that are coming together, understanding dexterity, whole body control, you can see how this could actually merge into a very strong solution.

Speaker 1

欢迎回到《谷歌DeepMind播客》，我是汉娜·弗莱教授。有时候，甚至经常在闲聊中，人工智能和机器人这两个术语会被混用。你知道，人们谈论在应用中和机器人聊天。但机器人是有物理身体的。

Welcome back to Google DeepMind, the podcast. I'm Professor Hannah Fry. Sometimes, maybe even often, the terms AI and robot are used interchangeably in casual conversation. You know, people talking about chatting to a robot on an app. But robots have a physical body.

Speaker 1

在谷歌DeepMind这里，他们关注的是嵌入现实世界的人工智能机器人。虽然人工智能取得了巨大进步，但具身智能却一直落后。不过，也许这一切即将改变。Carolina Parada领导着谷歌DeepMind的机器人研究，这个国际团队在机器人技术方面取得了一些非凡的进展，最近的是Gemini Robotics，它将Gemini的多模态理解带入了物理世界。

And here at Google DeepMind, they care about robots with AI embedded in the real world. And while AI has made huge strides, embodied intelligence has lagged behind. But perhaps all of that is about to change. Carolina Parada leads robotics research here at Google DeepMind, the international team responsible for some extraordinary advances in robotics. Most recently, Gemini Robotics, which brings Gemini's multimodal understanding to the physical world.

Speaker 1

欢迎来到播客，Carolina。我知道你研究这些机器人已经很长一段时间了。你看到它们是如何演变的？

Welcome to the podcast, Carolina. Now I know you've been working with these robots for for quite a long time. How have you seen them evolve?

Speaker 0

谢谢邀请。是的，这非常令人兴奋。我从10岁起就对机器人技术感到兴奋，主要是因为我在卡通片中看到的。

Thanks for having me. Yeah. It's been super exciting. I have been excited about robotics since I was 10 years old. Super excited because of what I've seen in cartoons.

Speaker 0

比如，你看到像Rosie这样的机器人帮忙做所有家务。作为一个孩子，你会想，当然，这就是我长大后想建造的东西。实际上，我在谷歌DeepMind机器人团队已经大约七年了，情况发生了巨大变化，尤其是在过去三年。我们从一开始就坚信人工智能将彻底改变机器人技术。

Like, you see robots like Rosie the robot helping do all the chores. And as a kid, you're like, of course. That's what I wanna build when I grow up. And really, I've been at the Gouldie My Robotics team for about seven years, and really things have changed dramatically. In the last three years in particular, we've always believed from the very beginning that AI was going to be completely transformative to robotics.

Speaker 0

我的意思是，现在有很多机器人确实很有帮助。制造生产线上的机器人，在月球上导航的机器人，还有在我们海洋中的机器人。但这些机器人都是被编程来专门执行那些任务的。

I mean, there's a lot of robots out there that are really helpful today. There's robots in manufacturing lines. There is robots that are navigating the moon. There is robots that are in our oceans. But these robots have been programmed to do specifically those tasks.

Speaker 0

他们对那些环境或可能遇到的物体做了很多假设，或者它们可能是由人类远程操作的。但我们从一开始就相信，人工智能是改造机器人技术的途径，这样我们就能建造真正智能的机器人，让它们能够与你互动，能够推理它们的环境，并且能够以一种感觉非常通用的方式采取行动。所以这一直是我们从一开始的使命。我想三年前，你的播客里就有机器人技术的内容。那时，我们正在为机器人做强化学习。

They make a lot of assumptions about those environments or the objects they might encounter, or they might be remotely operated by humans. But we have believed from the beginning that AI is the way to transform robotics so that we can build robots that are truly intelligent so that they can interact with you, that they can reason about their environment, and they can take action in a way that feels very general. So that has been our mission from the start. And so I think three years ago, you had robotics in your podcast. And back then, we were doing reinforcement learning for robotics.

Speaker 0

所以本质上，我们是在教机器人，比如通过给它们一个简单的奖励来堆叠积木，比如如果你的塔变高了就加一分。我们在那里取得了一些进展。但自那以后，由于我们一直处于人工智能的前沿，我们一直在将越来越多的人工智能引入整个机器人世界。所以大约在2022年，我们引入了，例如，将大型语言模型（LLMs）用于机器人。那是你第一次可以真正与机器人交谈，并说诸如‘我渴了’之类的话，而它会明白你的意思。

And so essentially, we were teaching robots to, like, stack blocks by giving them a simple reward, like plus one if your tower got taller. We made some progress there. But a lot since we've been at the forefront of AI, we've been bringing more and more of AI into the entire world of robotics. So about 2022, we introduced, for example, LLMs to robots. And that was the first time that you could actually talk to a robot and say something like, I'm thirsty, and it would know what you meant.

Speaker 0

然后后来，我们引入了视觉语言模型（VLMs），这样机器人不仅能理解自然语言，还能理解它接收到的视觉输入，然后基于此做出决策。然后在2023年，我们引入了机器人变换器（robotics transformers）。这是变换器架构首次真正被纳入机器人技术。它基本上向我们展示了机器人性能随着数据规模而扩展。这实质上开启了一个新的基础，或者说一个大规模数据驱动的机器人学习的新时代。

And then later on, we brought VLMs so the robot could understand natural language, but it could also understand the visual input that it was getting and then make decisions based on that. And then in 2023, we introduced robotics transformers. And this is the first time that the transformer architecture was actually included in robotics. And it basically showed us that robot performance scales with data. And that essentially started a new foundation or a new era of large scale data driven robot learning.

Speaker 0

然后更近些时候，我们刚刚推出了Gemini Robotics，这实质上是我们最先进的动作模型。它本质上将Gemini的多模态世界理解能力带到了物理世界，通过将动作作为Gemini中的一个新模态。这确实使得模型变得非常通用，因为它通过Gemini的理解来理解世界，并使其能够互动。事实上，它能理解Gemini支持的任何语言，并使其变得灵巧。所以它仍然可以在与你交谈和理解全新情况的同时，进行非常复杂的操作，而这对于今天的机器人来说实际上是非常困难的。

And then more recently, we introduced just now Gemini Robotics, which was essentially our most advanced model for actions. And it essentially takes the multimodal world understanding of Gemini and brings it to the physical world by adding actions as a new modality in Gemini. And that really enables models to be very general because it's understanding the world through Gemini's understanding and enables it to be interactive. In fact, it can understand any language that Gemini supports and enable it to be dexterous. So it can still do very complex manipulation while talking to you and also understanding a completely new situation, which today is actually very hard for robots to do.

Speaker 1

就你的宏大目标、你的远大抱负而言，我们如何知道何时达成了目标？

In terms of your big goal, your big ambition, how will we know when we get there?

Speaker 0

我认为这肯定是一个渐进的过程，机器人将能够理解一个新情况，并推理它们需要做的、以前从未见过的事情。这正是我们现在看到的情况。但它们学习越来越复杂的任务仍然会很困难。事实上，这就是我们所看到的。机器人感觉有点像两岁的幼儿，能够理解它周围的世界。

I think it's definitely going to be gradual, where robots are able to understand a new situation and reason about something they need to do that they haven't seen before. And that's exactly what we're seeing right now. But it's still going to be difficult for them to learn more and more complex tasks. In fact, that's what we see. The robot can feel sort of like a two year old toddler that can understand its world around it.

Speaker 0

它可以开始摆弄物体。它理解概念。但如果你教它做更复杂的事情，比如我们有一个例子，我们教机器人做一个折纸折叠，它实际上需要时间来练习。一旦它在那种情况下有了更多练习，它实际上就能做到。所以，这大致就是我们今天的水平，但如果我们希望机器人进入日常空间，为我们完成各种任务，这还远远不够。

It can start to play with objects. It understands concepts. But if you teach it to do something more complex, like we have an example where we're teaching the robot to do an origami fold, it actually needs time to practice that. And once it has more practice in that case, it can actually do it. So, that's roughly where we are today, but that's far from where we need to be if we want robots to be in everyday spaces, doing all kinds of tasks for us.

Speaker 1

我想我们可以稍微看看这些机器人能做些什么，因为你们最近发布了一个视频。这里展示的是一个类人机器人正在为人类打包午餐，还在玩井字棋。它玩井字棋的水平怎么样？

I thought that what we could do is take a little look at some of what these robots can do because there's a video that you guys have recently released. What we have here then is we have a humanoid robot who is packing a lunch for human, also playing noughts and crosses. Is it any good, the noughts and crosses?

Speaker 0

我觉得我们还是能赢它，因为它的理解非常简单。

I think we still beat it because it's very simple understanding.

Speaker 1

不过告诉你它在做什么，它正在轻松地拿起棋子并移动它们。这里还有一点，它能根据出现的字母块自己组成变位词。你特别对哪一点印象深刻？

Tell you what it's doing though, is picking up the pieces and moving them around quite easily. There's also a bit here where it can can make its own anagram based on tiles that appear. What were you particularly impressed by?

Speaker 0

我认为这些模型最令人兴奋的是，在很多情况下，我们自己的研究人员都对它的表现感到兴奋和印象深刻。这主要是因为我们测试的方式是将机器人置于它从未见过的情况面前。所以连我们自己都不知道机器人是否能做对，而在很多情况下，它确实做到了。因此，我们在这个视频以及其他展示双臂移动的视频中展示的许多例子，都表明它实际上在理解复杂的概念。一个让我们所有人都惊叹不已的例子是，我们展示了一个机器人实际完成扣篮的视频。

I think that's what's most exciting about these models is that in many occasions, our own researchers were excited and impressed by what it was doing. And it was primarily because the way we were testing it was by putting the robot in front of situations that it's never seen before. So even us didn't know whether the robot was going to be able to get it right, and in many occasions, it did. So, many of the examples that we show in this video, as well as the other videos where you have two arms moving around, is that it's actually understanding a complex concept. So, a really cool example that where we were all, like, gasp was when we showed the video where the robot is actually doing a slam dunk.

Speaker 0

那个案例的酷炫之处在于，那天我们只是让创意团队来拍摄机器人，并请他们带些玩具来。我们没有说别的。他们就像是被要求带玩具来和机器人玩一样。

And what was cool about that case is that that day we were just having the creative team come and film the robots, and we asked them to bring toys. We didn't say anything else. They're like, just bring toys to play with the robot.

Speaker 1

都是机器人以前从未见过的东西。

All things the robot hadn't seen before.

Speaker 0

是的。他们完全不知道机器人接受过什么训练。对吧？所以他们真的带来了这个小篮球框，那是一个可爱的小玩具，带一个小球，然后把它放在机器人面前。再次强调，机器人从未见过任何与篮球相关的东西。

Yeah. They had no idea what the robot was trained on. Right? So they actually brought this little basketball hoop that was a little cute toy with a little ball, and they put it in front of the robot. Again, the robot had never seen anything related to basketball.

Speaker 0

它肯定从未见过这个玩具。他们要求它完成一次扣篮。我们当时都在想，完全不知道这能否成功。实际上，它花了不到四分之一秒的时间，就决定把球放进篮球框里。我们都觉得这太神奇了。

It certainly has never seen this toy. And they asked it to do a slam dunk of the ball. And we were all like, I have no idea if it would work. And actually, it took not even a quarter of a second, and it actually decided to put the ball inside the basketball hoop. And we were all like, that's amazing.

Speaker 0

这基本上是基于Gemini对篮球和扣篮是什么的理解，对吧？这是我们之前根本没想到要教它的概念。是的。它基本上做出了正确的动作。所以那是个非常酷的例子。

And it was just essentially drawing from Gemini's understanding of what basketball is and what a slam dunk is, right, which is a concept we couldn't have thought of teaching it to do. Yeah. And it essentially did the right motion. So that was a really cool example.

Speaker 1

跟我讲讲打包午餐的那个例子。比如说，它对香蕉有概念性的理解。它知道如何抓握香蕉吗？我的意思是，抓握香蕉的方式不能像抓陶罐或比香蕉更脆弱的东西那样？

Talk to me about the packing lunch one. It kind of has a conceptual understanding of what a banana is, for example. Does it know how to grip a banana in the sense that you can't grip banana in quite the same way as you could a clay pot or something even more fragile than a banana?

Speaker 0

实际上，超级令人印象深刻的一点是这些机器人非常简单。它们实际上没有触觉感应，没有深度感应，也没有力感应。所以它们纯粹是在做眼手协调，并利用对如何抓握香蕉的理解。

Actually, one of the things that is super impressive is that these robots are extremely simple. They actually don't have touch sensing. They don't have depth sensing. They don't have force sensing. So they're literally doing eye hand coordination and using an understanding of how you grasp a banana.

Speaker 0

所以它实际上是看着物体并抓握它。一旦它看到自己手里抓住了东西，就知道自己已经检测到了。还有其他更复杂的机器人，但这迫使模型真正推理它所看到的东西，并决定如何拾取。

So it actually is looking at the object and grasping it. And once he sees that he has it in hand, that's how he knows that he has detected it. There's other robots out there that are much more complex, but this forces the model to really reason about what he's seeing and making a decision about how to pick that up.

Speaker 1

这正是这里真正原创的地方。

And that's the thing that's really original here.

Speaker 0

是的。这是众多特点之一。关键在于它这样做不仅仅是因为我们教了它一千次如何捡起香蕉，而是因为它从Gemini中对如何拾取物体的理解中提取知识，然后将其适应到行动的世界中。

Yeah. That is one of the many things. It's the fact that he's doing it not just because we taught it a thousand times how to pick up a banana. It's because he's pulling this out of his understanding of how to pick up objects from Gemini and then adapting it to the world of actions.

Speaker 1

因为我可以想象，我的意思是，在网络上有很多视频

Because I can imagine, I mean, there've been lots of videos during the rounds

Speaker 0

关于

on the Internet for a

Speaker 1

多年来极其令人印象深刻的机器人做后空翻，还有被踢倒、在山坡上跑来跑去之类的。与那些视频相比，把香蕉捡起来放进午餐盒似乎是一项相当简单的任务。但我们这里讨论的是另一种类型的机器人，对吧？

number of years of extremely impressive looking robots doing backflips and, I don't know, being kicked over and sort of running up and down mountains and things. In comparison to those videos, picking up and putting down a banana, you know, into a lunchbox seems like quite a simple task. But we're talking about a different type of robot here, aren't we?

Speaker 0

是的。我的意思是，你试图解决的是一个完全不同的问题。许多那些视频基本上是机器人学习和记忆的排练序列，我们确实对它们印象深刻。但你试图解决的是不同的问题。你在这里试图解决的是让机器人理解打包午餐的含义，根据面前的物体，它需要做什么来把一片面包放进袋子里，然后封口意味着什么。

Yeah. I mean, this is a completely different problem you're trying to solve. Many of those videos are basically rehearsed sequences that the robot has learned and memorized, and we're actually very impressed by them. But it's a different problem that you're trying to solve. What you're trying to solve here is for the robot to reason about what it means to pack a lunch, given the objects in front, what it needs to do in order to put a piece of bread inside of a bag, and then what it means to close it.

Speaker 0

而且它永远不会按你预期的方式进行，因为这些是非常灵活的、会移动的东西。所以它需要根据实际情况做出反应和响应，然后真正完成任务。

And it's never going to go as you expect because these are very flexible things that move around. So it needs to react and respond to what's happening and then actually complete the task.

Speaker 1

这就是通用性的概念。没错。那么你如何比较一个机器人与另一个机器人？你如何决定这个机器人是否比另一个做得更好？

It's that idea of generality. That's right. Yeah. So how do you compare one robot against another? How do you decide whether this robot is doing generality better than another?

Speaker 0

这实际上是我们难以表达的事情之一。当我们甚至为这次发布的演示录制时，演示按定义就像是脚本化的。所以我们觉得，这并没有完全捕捉到我们想分享的内容。这就是为什么我们让团队带来一堆玩具，实际开始与机器人互动，看看会出现什么。最好的捕捉方式是，我们能够通过与它交谈来改变机器人的行为，你可以在视频中看到这一点。

That was actually one of the things that was hard for us to express. When we were even recording for the demos in this release, a demo is, by definition, like scripted. So we were like, this doesn't quite capture what we want to share. That's why we asked the team to bring a bunch of toys and actually start playing with the robots and see what emerges. And the best way to capture it is that we're able to change the behavior of the robot by talking to it, and you can see that in the videos.

Speaker 0

我们实际上能够放入它从未见过的原始物体，并且我们会移动物体以确保人们理解这并非预设行为。事实上，在我们的基准测试中，我们从泛化能力的各个角度评估模型。因此，我们会改变视觉背景，更换背景环境。这些物体都是全新的。

We are actually able to put raw objects that it's never seen before, and we move objects around to make sure that people understand that this is actually not a prescripted behavior. In fact, in our benchmarks, we evaluate our models in all kinds of ways in terms of generalization. So, we will change the visual background. We will change the background. The objects were new.

Speaker 0

我们会添加物体来干扰机器人。我们还会要求它完成全新的任务，甚至可以用不同语言与它交流。比如我直接用西班牙语下达指令，它也能正常执行。

We will add objects to distract the robot. We would also, like, ask it to do completely new things or even you can talk to it in a different language. So I could just give it the instruction in Spanish and it would just actually work.

Speaker 1

我也想谈谈交互性，因为在你们的几个视频中，有一个场景是人类坐在桌旁，机器人在他离开后清理桌面。另一个视频里，人类移动杯子，机器人追着杯子试图把物体放进去。这些交互场景比静态任务要困难多少？

I wanted to talk about interactivity too, because in a few of your videos, there's one where a human is sat at a desk and the robot is kind of clearing up after him as he goes. In another, you've got a human moving a cup around and the robot sort of chasing it trying to put an object inside. How much more difficult are those interactive scenarios than just a static task?

Speaker 0

是的。这些显著更高级的行为和大量交互功能其实都是模型自然涌现的。比如我们并没有预先考虑物体移动多快时机器人会作出反应。我们当然希望模型能快速响应，但视频中展示的很多例子都是人们测试模型时自然产生的，整理桌面的例子也是如此。

Yeah. I mean, the significantly more advanced behavior and a lot of the interactivity sort of just fell out of the model. Like, we were not thinking, for example, how fast can we move these objects before the robot would react. We certainly knew that we wanted a model that could react quickly, but a lot of these examples that we posted on videos just fell out of people playing with the model and seeing how it would behave. Same with organizing the desk.

Speaker 0

那其实是有人在测试机器人时，想看看能多大程度挑战它直到完成完整任务。所以，Gemini本身具备的这些能力在应用到机器人场景时确实展现出惊人价值。现在它能根据你的指令自适应调整，你甚至可以全程对话并实时改变机器人的行为，比如你说‘我要你做这个’...

That was actually someone playing with the robot deciding to see how much it could game it until it actually was able to complete the full task. So, yeah, it actually is amazing to see how a lot of these other capabilities that are already there in Gemini are actually extremely valuable when you bring them into a robot Mhmm. Which is now able to adapt based on what you're saying. So you could actually have a full conversation and change the behavior of the robot as it's moving. So you can say, I want you to do this.

Speaker 0

‘哦不，算了，我要你做那个’，它就会听从指令。这其实有点滑稽。你还可以随时更换周围物体，它都能应对自如。

Oh, no. Actually, never mind. I want you to do this other thing, And it would actually just follow you. It's actually kind of comical. And then you could also change the objects around and it will just do it.

Speaker 1

有时候我觉得这些机器人没有情感反而是好事，因为它们看起来怪可怜的，就像被研究人员在桌子上追着跑似的。

I think it's kind of a good job sometimes that these robots don't have feelings because they feel very sort of forlorn and, like, just being chased around on a table by researchers.

Speaker 0

是的。实际上它们超级有趣。

Yeah. They're actually it's super fun.

Speaker 1

那是因为底层的大型语言模型在发挥作用，帮助它实现这一点，对吧？这赋予了它对所操作对象的概念性理解。

That's the large language model sitting underneath it that that's helping it do that. Right? That's giving it that conceptual understanding of the objects that it's manipulating.

Speaker 0

没错。所以我们利用Gemini的多模态理解能力，将机器人通过摄像头看到的视觉输入和从人类那里听到的自然语言，转化为如何行动。而且它实际上还会回应。你可以问它是否完成了任务，或者询问它在折叠折纸作品的过程中进展到了哪一步，它确实能理解并作出回应。

That's right. So we're leveraging Gemini's multimodal understanding to take the visual input that the robot is seeing through its cameras and the natural language that it's hearing from the human and then translate that into how to act. And it actually also speaks back. So you can ask it a question about whether it's done. You can ask it a question about how far it is in the process of folding an origami figure and actually understands that and can respond.

Speaker 1

我记得Gemini刚推出时，人们花了很多功夫谈论它的多模态特性。这算是投入所有额外基础工作、确保它能理解视频和照片等的回报吧。

I remember when Gemini was first being launched and there was sort of people went to great lengths to talk about how it was multimodal. This is sort of the payoff for putting in all of that extra groundwork and making sure that it can understand videos and photos and so on.

Speaker 0

这只是众多回报之一。我认为我们人类通过多种不同的感官来感知世界，对吧？所以，如果你想构建一个与我们大脑一样强大的智能体，能够以多模态方式接收输入就超级重要。机器人技术绝对是一个完美的例子，你可以看到它绝对需要理解自然语言和视觉输入，未来可能还需要触觉感知，以便像人类一样做出行动决策。

I mean, one of many. I think us humans capture the world through many different senses, right? So I think it's super important if you want to build an intelligence as powerful as our brains to be able to take input in a multimodal way. And definitely robotics is a perfect example where you can see that it absolutely requires to have understanding of natural language and visual input, and presumably in the future also touch sensing in order to make decisions about how to act the same

Speaker 1

人类就是这样做的。但为什么机器人需要对所做的事情有概念性理解呢？我的意思是，好吧，也许你不会称它们为智能，但有些机器人，比如洗碗机或割草机，对吧，它们对盘子或草没有概念性理解。这真的有必要吗？

way humans do. Why does it matter though that robots should have a conceptual understanding of what they're doing? I mean, okay, maybe you wouldn't call them intelligent, but there are robots like, I don't know, dishwashers or like lawnmowers, right, that don't have a conceptual understanding of what a plate is or what grass is. I mean, is it actually necessary?

Speaker 0

我相信有些应用场景中，机器人只需重复动作就能很好地工作。但我们真正感兴趣的是构建能够真正推理并以非常通用的方式行动的机器人。嗯。因为现实世界非常混乱，事情永远不会完全按计划进行。有很多任务中情况在不断变化，而这实际上为这些机器人打开了应用的可能性。

I'm sure that there's applications where you can have a robot that can just repeat the actions and it will be just fine. But we're interested in actually building robots that can really reason and act in a very general way Mhmm. Just because the world is really messy. Things will never go exactly according to plan. And there's a lot of tasks where things are constantly changing, and it actually just opens up the opportunity of applications for these robots.

Speaker 0

它们实际上可以出现在任何人类执行任务的地方。这使得它们能够在家庭环境中发挥作用，但

They could literally be anywhere that a human could be doing a task. So that enables them to be helpful in home environments, but

Speaker 1

同样也能在制造环境中发挥作用。机器人技术中有一些重要方面，我认为现在借助标准的Gemini模型实际上变得相当容易。比如指向或绘制边界框。请向我们解释这些是什么。

also in manufacturing environments. There are some things that are important in robotics that I think now with the sort of standard Gemini actually are quite easy. Things like pointing or drawing bounding boxes. Just explain to us what those are.

Speaker 0

嗯，基本上，这是我们为了助力机器人技术而必须改进Gemini的一个领域。举个例子，如果你面前有一个物体，我们所说的指向是指我能够精确识别该物体上的任意点。比如你面前有一件T恤，如果我指向颜色部分，它应该能说出这是颜色区域；或者我说颜色，它应该能识别出颜色所在的位置。

Well, basically, this is one of the areas that we actually had to improve Gemini in order to help with robotics. So if you have, for example, an object in front of you, what we mean by pointing is that I can literally identify any point in that object. So I can say imagine that you have a T shirt in front of you. If I point to the color, it should say this is the color. Or if I say color, it should identify where the color is.

Speaker 0

你可能会觉得这并不重要，但实际上，如果你要折叠那件T恤，就需要知道颜色在哪里、T恤的下摆在哪里，以及所有不同的组成部分。边界框的意思是能够识别物体的所有边缘，从而知道物体在哪里结束，环境从哪里开始。我认为这类例子对我们人类来说微不足道，甚至无需思考。但如果机器人能够获取这类信息，它们就能更智能地在物理世界中采取行动。

And you might imagine that this is not that important, but actually, if you're trying to fold that T shirt, you need to know where the color is, where the bottom of the T shirt is, and all of the different components. Bounding boxes, what it means is that you can identify all the edges of that object so that you know where the object ends and the rest of the environment begins. So these kinds of examples are, I think, trivial for us humans. We don't even think about it. But if robots are actually able to have access to that kind of information, then they can be smarter about the way that they take action in the physical world.

Speaker 0

这本质上就是我们所说的具身推理。

This is what we call embodied reasoning, essentially.

Speaker 1

这与标准Gemini模型中的推理有什么不同？

How is it different from the kind of reasoning that you get in the standard Gemini model?

Speaker 0

是的。我们将具身推理称为像人类那样对物理世界细节进行推理。比如你要为孩子打包午餐，嗯，要做到这一点，你必须理解所有物体在三维空间中的位置。

Yeah. We refer to embodied reasoning as reasoning about the physical world detail the way humans do. If you are going to take action say that you're trying to pack a lunch for your kid. Mhmm. You're actually in order to do that, you have to understand where all the objects are in three d space.

Speaker 0

那么你需要理解如何抓取每个物体，以便将其放入那个盒子中。然后你需要弄清楚如何组织所有这些部件，使它们能放得下。所有这些就是我们所说的具身推理。

Then you need to understand how to grasp each object in order to pack it into that box. And then you need to figure out how to organize all those pieces so that they fit. All of this is what we mean by embodied reasoning.

Speaker 1

所以这是不是像，我不知道，比如说你有两个摄像头视角。比如你在那里，我在这里。我能看到你的麦克风，你也能看到，但我们对它的视角完全不同。是这种类型的东西吗？

So is this things like I don't know. Let's say you got two camera view. You're there and I'm here, for instance. I can see your microphone and so can you, but we've got a completely different view of it. Is it is it that kind of stuff?

Speaker 0

是的。我的意思是，如果你能理解，例如，麦克风离我们的脸有多远，而且如果我移动，它还能进行物体对应，意思是它理解这个麦克风与我从另一个视角看到的是同一个，你可以想象这对于一个机器人在移动并对其环境进行推理时超级重要。

Yeah. I mean, if you can understand, for example, how far the microphone is from our face, but also if I move around, it can do object correspondence, meaning it understands that microphone is the same one that I'm seeing from the other point of view, which you can imagine is super important if a robot is moving and reasoning about its environment.

Speaker 1

从二维图像，比如单一摄像头视角，切换到对空间的三维理解有多难？

How hard is it to switch from a two d image, like a single camera view, to a three d understanding of the space?

Speaker 0

实际上，如今机器人做的是它们从不同位置获取摄像头视图。所以实际上，机器人手腕上有摄像头，顶部也有一个摄像头。它实际上已经获取了三个图像的所有输入，并自行处理。它实际上在推理，哦，我离物体更近了，因为这个摄像头看起来更近。这个摄像头，我能看到我的手，它正在自行完成所有这些关联。

So actually, today, what robots are doing is that they're taking camera views from different places. So actually, the robot has cameras in its wrist and has a camera on top. And it has actually taken all of the inputs from the three images and doing this on its own. It's actually reasoning, oh, I'm closer to the object because now this camera looks closer. This camera, I can see my hand, And it's doing all of that association on its own.

Speaker 0

我们没有明确地将深度作为额外输入添加进去。我们只是给它多个摄像头视图，而它正在意识到如何使用它们来理解深度。

We're not explicitly adding depth as a additional input. We're just giving it multiple camera views and is realizing how to use them in order to understand depth.

Speaker 1

这其中有多少是你们刻意将其设定为机器人的任务？又有多少是从Gemini模型获得的概念理解中自然涌现出来的？

And how much of that was you deliberately setting that as a task for the robots? Or how much of it sort of emerged from the conceptual understanding that you get from the Gemini models?

Speaker 0

它实际上是自然涌现的。所以我们能够给它配备多个摄像头，看看是否真的能在它们之间进行推理。

It simply emerged, actually. So we were able to give it multiple cameras and just see if we actually could reason between them.

Speaker 1

我的意思是，这肯定相当令人震惊，对吧？我想你们肯定花费了很多很多年，很多人一定非常努力地思考这个问题——如何对齐不同摄像头的视角，使得能够跨不同角度追踪物体。然后突然之间，有了这些大型语言模型，比如Gemini，它就能自动完成这一切。

I mean, that's got that's that's gotta be quite shocking. You know? I mean, I imagine you spent many, many people must have spent many, many years thinking very hard about that problem of how do you how do you align different camera views to make it so that, you know, you you're tracking an object across different angles. And then all of a sudden, you get these large language models, you know, like Gemini, and it can just do it automatically.

Speaker 0

是的。我的意思是，能够利用这些模型为系统带来简洁性真的很棒。你真的不再需要所有这些不同的阶段：先提取深度，然后才提取物体位置，接着再规划如何移动，最后才能执行任务。

Yeah. I mean, it's actually wonderful to be able to leverage these models to bring simplicity to the system. You really don't need to have all of these different stages that you extract depth, then only then you extract where the objects are and only then you plan how to move. And then only then you're able to do the task.

Speaker 1

这是因为基础模型就像瑞士军刀一样高效。它能够完成所有事情。

And that's because the foundational model is effective like a Swiss army knife. Like, it can do all of the things.

Speaker 0

没错，正是如此。而且它能在它们之间进行推理，对吧？好的。

Yes. Exactly. And it can reason between them. Right? Okay.

Speaker 1

所以你几乎增强了物理推理能力。

So you enhance the physical reasoning almost.

Speaker 0

完全正确。你增强了物理推理和空间理解能力。接下来就是运动理解，即理解如果我把玻璃杯放在桌子边缘，实际上可能会发生什么。所有这些领域都是我们增强的方向。但这还不够。

Exactly. You enhance physical reasoning and special understanding. And then motion understanding would be the next thing, is understanding what would happen if I put a glass at the edge of the table, what is actually likely to happen. All of these areas is the areas that we enhance. But that's not enough.

Speaker 0

实际上你需要再进一步，开始教 Gemini 动作的语言。对我们来说，动作意味着理解机器人每个关节的实际运动方式。比如这是我的机械臂，我在教 Gemini 如何移动机器人，如何像这样移动我的手臂。这些本质上都是数字。对吧？

You actually have to take it another step and essentially start to teach Gemini the language of actions. And actions for us means understanding how you are actually moving each joint in a robot. So if this is my robot arm, then I'm teaching Gemini how to move the robot, how to move my arm like this. And these are all essentially numbers. Right?

Speaker 0

它正在学习区分'拿起杯子'和'移动手臂去拿杯子'的含义差异。所以你本质上是在教它一门新语言。

And it's learning to translate what it means to pick up a glass versus move my arm in order to pick up a glass. And so you're essentially teaching it a new language.

Speaker 1

你在连接这些不同的概念。完全正确。那我们能否将其视为两个系统协同工作？我这里想到的是系统一和系统二的类比，就是丹尼尔·卡尼曼在《思考，快与慢》中提出的概念。

You're connecting those different ideas. Exactly. Can we think of this as two systems working in tandem then? I mean, I'm I'm thinking of the analogy here of of system one and system two, the the Danny Kahneman thinking fast and slow thing.

Speaker 0

是的，完全正确。本质上，我们构建的模型实际上包含两个模型：一个系统速度较慢但推理思考能力非常强大，另一个系统速度更快但擅长反应性任务。

Yes. Exactly. So essentially, the model that we built actually has two models. It has a system that is slow but very powerful at reasoning and thinking. And a system that is faster but very good at reactivity.

Speaker 1

这就像人脑的工作方式对吧？就像大脑中有一部分非常擅长计算分析，同时还有非常本能化的反应侧。

This is like how human brains work. Right? Like that you have the kind of the part of your brain that's very good at calculation and analysis. And then you also have your very instinctive reactive side too.

Speaker 0

没错。事实上，我们现在做的这两个模型中，一个比另一个大得多——正如你所想——实际上运行在服务器上。而快速模型则部署在设备端，能够非常迅速地响应。

Yes, that's right. In fact, what we do today is that one of these models is much bigger than the other, as you can imagine, and actually lives on the server. And the FAST model lives on device and it can respond very quickly.

Speaker 1

那么请详细说明一下，在系统一和系统二的框架下，以及那个从未见过的扣篮例子中，这是如何工作的？具体过程是当你...

Talk me through how this works then in terms of the system one and system two and that example of a slam dunk, something it's never seen before. How does it work? So what happens is when you

Speaker 0

让机器人拿起篮球完成扣篮动作，系统也需要理解这意味着什么。它需要知道什么是篮球，需要识别面前物体的位置，比如篮球在哪里，理解有一个篮筐，然后明白扣篮实际上是指拿起球并放进篮筐。所以它理解所有这些，并预测机器人应该执行的大致运动轨迹，然后将这个轨迹交给系统一，系统一在设备上运行，能够接收这个轨迹，但同时也会处理视觉输入并调整轨迹。比如，如果我中途伸手阻挡或移动物体，它仍然能够响应，因为它已经理解了扣篮的概念，并能快速做出反应。

ask the robot to take the basketball and do a slam dunk, the system too has to understand what that means. You know, what is basketball? It has to understand where the objects that are in front of it are, like where the basketball is, understand there's a hoop, and then that a slam dunk actually means picking up that ball and putting it there. So it understands all of that and predicts a rough trajectory of what the robot should do in terms of how it should move and then hands that over to the system one, which is on device and is able to take that trajectory, but it also takes the visual input and is able to adjust that trajectory. So if I were to, for example, get in the way, put my hand in the middle, or move the object around, it would still be able to respond because it already understood the concept of what a slam dunk was and respond very quickly.

Speaker 1

不过，为什么需要两个系统呢？比如，为什么不能只用那个慢速但聪明的系统？

Why do you need two systems at all, though? Like, why can't you just use the slow, clever one?

Speaker 0

是的。实际上我们确实可以只用那个慢速但聪明的系统，但那样的话视觉反应会明显变慢，而且对环境变化的适应速度也不够快。这一点很重要，尤其是在处理物体可能移动的任务时。比如想象一下在空中叠T恤，就像人类做得很熟练那样，你实际上在移动T恤，而物体的移动方式是你无法预测的。所以你需要能够快速响应才能完成任务。

Yeah. I mean, we could actually just use the slow, clever one, but then it would actually be significantly visually slower, and it won't adapt as quickly to changes in its environment. That's important, especially if you're doing something where the objects will move around. So if you have, for example, imagine when you're folding a T shirt in the air, right, which as humans do pretty clearly, you're actually moving this T shirt and things are moving for you in ways that you don't predict. So you need to be able to respond quickly in order to actually complete the task.

Speaker 0

所以你肯定需要一个快速系统。而慢速系统则使我们能够进行更复杂的推理。如果你执行的任务不需要高级推理，你也可以只使用一个小型系统。

So you need definitely a fast system. And the slow system simply enables us to do much more complex reasoning. So you could also just live with a small system if you could do tasks that don't require advanced reasoning.

Speaker 1

这是否直接复制了人脑的工作方式？我的意思是，你知道，丹尼尔·卡尼曼的研究可以追溯到20世纪70年代左右，我们已经理解人脑就是这样工作的。这是直接复制吗？不，完全不是。

Was it a direct copy of how things work in the human brain? Mean, you know, the Daniel Kahneman work comes back to 1970s or so, right, that we've understood that that's how the human brain works. Was it a direct? No, not at all.

Speaker 0

我认为我们确实是从慢速系统开始的，就像你说的。为什么我们不只用一个模型来解决这个问题？我们发现，实际上，如果你想执行高度灵巧的行为，涉及复杂或任何类型的复杂操作，你需要快速响应。而这是我们能找到的最佳组合。哇。

I think we started definitely with the slow system, as you said. Why don't we just solve this with one model? And we found that actually, if you want to do highly dexterous behaviors with complex or any kind of complex manipulation, you need to respond quickly. And that was the best combination that we could find. Wow.

Speaker 1

这几乎就像进化是一个非常好的优化过程。找到了既快速又聪明的策略。是的，绝对是。我确实认为有时候人体在大脑意识到之前就已经知道该怎么做。比如，你可以不假思索地接住掉落的杯子，或者你可以将动作变成肌肉记忆，比如弹钢琴时，你实际上可以完全在想别的事情。

It's almost like evolution is a really good optimization. Finds really good strategies for like quick but clever things. Yes, definitely. I do think that there's sometimes where the human body knows stuff before your brain does as it were. You know, like you can catch a falling glass without thinking, or you can commit things to muscle memory, you know, playing a piano where you can actually just be thinking about completely different things.

Speaker 1

你是否也观察到机器人有类似的情况，它们似乎拥有一种独立于缓慢、聪明系统之外的物理智能？

Are you seeing similar things with the robots that they they almost have a physical intelligence that's separate from the slow, clever system?

Speaker 0

我们确实发现，如果你采用能够推理的模型，并为其提供大量特定任务的示例，它在该任务上会变得非常非常出色。但目前的问题是，如果这样做得太多，它就会开始丧失一些泛化能力。哦。所以这是一个活跃的研究领域：我们如何让机器人在某项任务上变得极其精通，比如非常困难的任务，同时又不失去任何泛化能力？所以目前这实际上是一个平衡问题。

So we definitely see that if you take the model that can reason and you give it a lot of examples of a particular task, it will get really, really good at that task. But at the moment, if you do too much of that, it will start forgetting some of the generalization. Oh. So this is an active area of research is how do we enable the robot to get really, really good at a task, like a really extremely difficult one, and then not lose any of the generalization? So it actually right now is a balancing act.

Speaker 1

我的意思是，在某种程度上，人类也会发生这种情况。比如，我认识一些人，他们数学非常非常非常非常厉害，但系鞋带却糟糕透顶。好吧。所以如果这就是幕后发生的情况，对吧？如果我们有你描述的系统一和系统二，那么这些机器人确实拥有这些非常令人印象深刻的新能力和功能，这与我们之前的情况截然不同。

I mean, in some ways that does happen with humans too. Like, I know some some people who are really, really, really, really, really good at maths and terrible at tying their own shoelaces. Okay then. So if this is what was going on behind the scenes, right? So if we've got system one and system two, as you described it, it's also definitely true that these robots have these very impressive new abilities and capabilities, which are very different from where we were before.

Speaker 1

上次我参观DeepMind的机器人实验室时，可以说机器人的动作有点笨拙。我认为这可能是最客气的说法了。让我给你播放一小段视频。所以它只有一种方式可以握住这个红色物体并成功捡起来。但它还没弄清楚是哪一种方式。

Last time I visited DeepMind's robotics lab, I think it's fair to say that the robot's movements were a bit clumsy. I think that's probably the kindest way to say it. Let me just play you a little clip. So there's only one way round that it can hold this red object and successfully pick it up. And it hasn't worked out which way.

Speaker 1

不幸的是，每次它试图旋转并捡起来——哦，等等。我想它成功了。它成功了。幸好这些东西不会灰心。我觉得问题是，我大概五年前去过那个实验室，而这些可怜的机器人五年后还在那里，试图完成同样需要最低限度灵巧度的任务。

And unfortunately, every time it tries to rotate and pick it up oh, hang on. I think it's got it. It's got it. It's good job these things don't get disheartened. I think the thing is that I've been in that lab maybe five years earlier, and these poor robots were still there five years later trying to do the same minimally dextrous tasks.

Speaker 1

是什么改变了？因为我理解拥有Gemini，也就是那个缓慢聪明的系统，可以改善对事物的概念理解，但这并不会改变灵巧度。它不会改变它操纵这些物体的容易程度，对吧？

What changed? Because I understand how having Gemini, you know, the slow clever system could improve the conceptual understanding of things, but that doesn't change the dexterity. It doesn't change how easily it can manipulate these objects, does it?

Speaker 0

对。是的。去年，我们基本上把所有精力都花在了攻克灵巧度上，这仍然是一个活跃的研究领域。但有几点发生了变化。一是我们意识到，如果我们能让人类通过远程操作或操纵演示来向机器人展示如何执行非常复杂的行为，这意味着你给人类额外的一对机器人手臂，他们实际上可以假装是机器人，并向机器人展示如何完成任务。

Right. Yeah. Last year, we spent basically all of our effort on tackling dexterity, and this is still an area of active research. But there's a couple of things that changed. One is that we realized that if we can enable humans to show the robot how to do very complex behaviors through teleoperation or puppeteering, what this means is that you give the human an extra pair of arms, robot arms, and they can actually pretend to be the robot and show the robot how to do the task.

Speaker 0

如果这种方式变得非常直观，那么你就可以收集大量机器人执行任务的数据，这些数据是由人类远程操作的，但它是机器人数据。

And if that becomes really intuitive, then you can capture a lot of data of the robot doing the task, being teleoperated by a human, but it's robot data.

Speaker 1

那么让我理解一下。所以人类可能戴着一个头戴式摄像头。是的。这实际上就是在假装自己是机器人。所以它操作着机器人的手，戴着摄像头，看着机器人会看到的东西，但按照它希望机器人做的方式执行任务。

So let me understand then. So human is wearing maybe like a headcam. Yes. It's quite literally pretending to be the robot then. So it's operating the robot's hands in its hands, wearing the head cam, watching what the robot would be watching, but doing the task as it wants the robot to do.

Speaker 0

没错。所以有不同的远程操作示例。一种是你直接坐在机器人面前。这样你可以直接看到机器人在做什么，然后移动机器人的手臂。实际上，你就像在操纵木偶一样控制机器人。

That's right. So there's different teleoperation examples. One is where you actually sit in front of the robot. So you have direct visibility to what the robot is doing, and you move the robot arms. Literally, you're puppeteering the robot.

Speaker 0

还有其他示例，比如你戴上VR头盔和手套，实际上你假装自己是机器人并移动这些东西。这需要第二个组件，即扩散模型。这些模型实际上也被用于例如Imagen来生成视频。本质上，它做的是从大量数据中提取执行该任务的许多示例，并预测执行该任务所需的动作轨迹。所以当你将这两者与巧妙的Transformer架构和良好的数据集结合时，你实际上可以学会任何事情。

And there's other examples where you put a VR set and gloves, and, actually, you pretend to be the robot and you move this stuff. And that required a second component, which was diffusion models. And these are the same models that actually get used, for example, by Imagen in order to generate videos. And, essentially, what it's doing is extracting from a lot of data, a lot of examples of doing that task, and predicting the action trajectory that it needs to do in order to do that task. So when you combine those two with a clever transformer architecture and a good dataset, you can actually learn anything.

Speaker 0

这实际上再次让研究人员感到非常惊讶。例如，那时我们发现我们可以系鞋带、叠衣服、做折纸。所以在这项工作中，我们将Gemini强大的推理模块与我们学到的灵巧任务能力结合起来。

And that was actually really, again, surprising to the researchers. That's when, for example, we discovered that we could tie shoelaces, we could fold laundry, we could do origami. And so what we did in this work is that we combined the powerful reasoning modules from Gemini with what we had learned around being able to do dextrous tasks.

Speaker 1

你还记得当你意识到这些特性开始显现时的情景吗？那一定有点令人震惊。

Do you remember when you realized that these kind of properties were emerging? And it must have been a bit of a shock.

Speaker 0

我想第一次是在我们看到机器人实际系鞋带的时候。我们当时想，这不可能。事实上，当研究人员设置这个任务时，他们实际上是为了挑战自己。我记得有一位教授说过，如果我们能让机器人系鞋带，我就退休。团队中的研究人员当时就想，正好。

I think the first time was when we saw the robots actually tying shoelaces. We were like, that's not possible. In fact, when the researchers set up this task, they actually did it to challenge themselves. They're like, I think there was a professor that said, if we can get robots with tie shoelaces, I will retire. And the research the researchers in the team were like, right on.

Speaker 0

我会把这个添加为一项任务。他们确实这么做了。当他能够完成时，他们感到惊讶。我不知道发生了什么，是教授真的看了视频决定退休，但灵感肯定是从那里来的。我们只是继续添加任务，越来越多的任务。

I'm going to add that as a task. And so they actually did. And they were surprised when he was able to do it. I don't know what happened, whether the professor actually saw the video and decided to retire, but it was certainly the inspiration came from there. And we just continued to add tasks, more and more tasks.

Speaker 0

折纸的例子也是如此。我们当时想，我们不知道这是否会奏效，但让我们试试看。结果它做得出乎意料地好。而且它实际上非常精细。它必须准确地折叠纸张的每一部分，并且必须按照正确的顺序进行。

Same with the origami example. We were like, we have no idea if this is gonna work, but let's try it. And it actually was surprisingly good at it. And it's actually really delicate. It has to actually fold every part of the paper, it and has to do it in the right sequence.

Speaker 0

如果出现任何差错，它就会迷失方向，不得不重新开始。就像人类做这件事时一样。

If anything goes wrong, it sort of loses its way and has to restart. Same as if a human was doing it.

Speaker 1

我记得我第一次采访丹尼斯的时候。他谈到了莫拉维克悖论。这个观点是，对人类来说容易的任务对机器来说很难，反之亦然。考虑到我们现在在机器人技术方面的所有这些进步，你认为莫拉维克悖论在未来还会成立吗？

I remember the very first time I got to interview Dennis. He was talking about Moravec's paradox. This idea that there are the the tasks that are easy for humans are hard for machines and vice versa. With all of these advances that we have now in robotics, do you think that Moravec's paradox will hold going forwards?

Speaker 0

我当然认为，对于机器人来说，做那些对我们人类来说非常直观的事情仍然更加困难。所以我认为莫拉维克悖论仍然成立，但我们现在已经到了这样一个阶段：如果你能让机器人执行一个非常复杂的任务，它可以学会它。

I certainly think that it is still more difficult for robots to do something that is incredibly intuitive for us humans to do. So I think Moravec's paradox still holds, but we are now at the point where we are confident that if you can operate a robot to do a very complex task, it can learn it.

Speaker 1

那这需要多快？我的意思是，一个机器人需要看人类折多少只纸狐狸才能

And how quickly does it happen? I mean, how many origami foxes does a robot need to watch a human do before it can

Speaker 0

自己折一只？是的。这取决于任务的复杂性，与人类的情况非常相似。对吧？任务越复杂，你需要练习的次数就越多才能掌握它。

do one itself? Yeah. It varies by the complexity of the task, pretty similar to the way it is for humans. Right? The more complex the task, the more you need to practice it before you can master it.

Speaker 0

所以有很多任务你只需要大约100个示例就能掌握，而像折纸狐狸这样的任务则需要大约一千个示例。

So there's a lot of tasks that you can master with just about a 100 examples, and tasks like the origami fox takes about a thousand examples.

Speaker 1

等等。所以人们必须假装成机器人，折一千次纸狐狸？

Wait. So people had to fold origami foxes while pretending to be a robot a thousand times?

Speaker 0

是的。没错。

Yes. That's right.

Speaker 1

好吧。这真是太有趣了。

Okay. That is incredibly amusing

Speaker 0

对我来说。我们正试图尽可能减少示例数量。而且我们仅用大约十几个示例就能掌握相当多的任务。

to me. We're trying to reduce it as much as possible. And we are able to get quite a bit of tasks with just, like, a dozen of examples.

Speaker 1

这要看情况。有没有一些任务完全不需要任何示例？

It depends. Are there some that you don't need any examples for at all?

Speaker 0

对。这就是我们在很多测试示例中看到的，比如当你和机器人玩耍，要求它在全新的场景中执行许多抓取和放置任务时，你不需要重新教它。而且这种情况正在扩展并变得越来越复杂。例如，在移动瓷砖的案例中，它能够推理瓷砖的位置并决定将它们放在哪里。

Right. So that's what in a lot of examples that we were testing, like, when you're playing with a robot and asking it to do a lot of pick and place tasks with completely new scenarios, you don't have to teach it again. And this is expanding and getting more and more complex. So for example, the cases with the tiles where you're moving the tiles around, those it just can reason about positioning of the tiles and and decide where to put them.

Speaker 1

那打包午餐的情况呢？

What about the packed lunch?

Speaker 0

那个更复杂一些。因为在这个任务中，你实际上是在执行一系列较长的操作，大约五分钟的任务，而且你还要处理像密封袋这样非常易变形的物品，进行非常精细的操作。所以任务越精细，就越需要在该任务中看到示例。

That one is more complex. Because, again, in that one, you are actually doing a long sequence of tasks, about five minutes of task, and you're actually picking up very deformable things like the ziplock bag and doing very delicate stuff. So the more delicate the task, the more likely it is you need to see examples in that task.

Speaker 1

那么如果这些机器人需要看示例，这是否会影响它们的通用性？

So if these robots are having to see examples, does that end up impacting the generality of it?

Speaker 0

只在某种程度上。我们确保做的一件事是收集数千个示例的数据，而不特别侧重任何新任务。如果你确实想做折纸任务，我们就专门为折纸任务进行定制，这确实会影响当前模型的泛化能力。我们希望最终能达到这样一种状态：基本上你可以教它任何新任务，掌握任何新任务，而通用性保持不变。但目前，这是一个权衡。

Only to a degree. So one thing that we make sure that we do is that we collect data in thousands of examples without a very large emphasis on any new task. If you do want to do the origami task, we simply specialize it for the origami task, and that does affect the generalization of the models today. We're hoping to get to a state where you basically can teach it any new task, to master any new task, and the generality remains intact. But today, it's a trade off.

Speaker 1

所以在理想情况下，你能够说，比如，给我折一只纸船。对吧？然后它就能根据之前理解的一切来完成这个任务。

So in the dream world, you would be able to say, I don't know, fold me an origami boat. Right? And it would it would be able to do that just from everything that it understood before.

Speaker 0

是的。在理想情况下，你只需观看别人做这件事的视频，就能从中学习。

Yeah. In the dream world, you could just watch a video of someone doing it and you would learn from that.

Speaker 1

那么，我的意思是，强化学习在机器人技术中曾经非常重要，持续了相当长一段时间。现在它已经消失了吗？

So, I mean, reinforcement learning was a big thing in robotics for for quite a stretch of time. Has that just disappeared now?

Speaker 0

完全不是这样。完全不是。我们仍然在强化学习方面做了相当多的工作，并且我们继续探索如何将这些大型基础模型与强化学习相结合。首先，我们在全身控制方面所做的所有工作，比如如果有一个类人机器人四处走动或是一个四足机器人，它们都在使用强化学习来学习如何行走。

Not at all. Not at all. We do quite a bit of work still with reinforcement learning, and we continue to explore ways to combine these big foundation models with reinforcement learning. First of all, all of the work that we do around whole body control, like if we have a humanoid that is walking around or a quadruped, they're all using reinforcement learning to learn how to walk around.

Speaker 1

因为当它摔倒时，很容易就说它失败了。

Because it's very easy to say you fail when you fall over.

Speaker 0

这是一项非常成熟的技术，实际上你可以在模拟中完全学会它。所以它不需要通过摔倒来学习。你可以在模拟中学习，然后将其转移到现实世界。我们在这方面的一个例子是最近一篇名为Demostart的论文。在Demostart中，你基本上向机器人展示如何做五个不同的示例。

It's a very mature technology, and you actually can learn it all in simulation. So it doesn't need to fall in order to learn. You So can learn it in simulation and then transfer it to the real world. One example that we had on this was a recent paper called Demostart. In Demostart, you basically show the robot how to do five different examples.

Speaker 0

这是操作一只手的动作。所以当你展示时，是五个不同的示例，展示如何拿起一个物体并以特定方式将其放置到一个插入位置。

This is manipulating a hand. So as you show it, five different examples of how to pick up an object and place it in a particular way on an insertion.

Speaker 1

插入，你是指像，我不知道，比如说把钥匙插入锁孔这样的事情吗？

By insertion, do you mean things like, I don't know, putting a key in a lock, for instance?

Speaker 0

是的，没错。能够将一个物体放入另一个物体内部。然后你只需给它五个示例，它就会自行探索并学会如何做，并将现实世界中所需的数据量大幅减少约100倍。我们认为这将至关重要，因为事实上你不可能为机器人演示每一个任务。

Yes. Exactly. Being able to put one object in inside another. And then you just give it five examples, and it explores on its own and learns how to do it and drastically reduces the amount of data that you need in the real world by, like, a 100 x. We think this is gonna be critical because the truth is you're not gonna be able to demonstrate for the robot how to do every single task.

Speaker 0

当然。有些任务会很复杂，它们无法直接从互联网的知识中提取出来。它将不得不进行探索。

Of course. Some of the tasks are going to be complex, and they won't be able to extract that directly from its knowledge of the Internet. It's going to have to explore.

Speaker 1

比如说，做手术之类的吧。

Doing surgery, for example, maybe.

Speaker 0

是的。所以它需要通过探索自身行为来学习。这正是我们想花更多时间研究的领域之一：如何让机器人在工作中学习。

Yes. So it's gonna have to explore and learn from its behavior. And that's one of the areas that we want to spend a lot more time on is how do you get robots that learn on the job.

Speaker 1

那么在模拟环境中操作是解决方案的一部分吗？

Is doing things in simulation part of the solution then?

Speaker 0

没错。我们肯定以多种方式利用模拟技术。我们甚至通过模拟来更好地学习如何对物理世界进行三维理解。在Demostart案例中，我们也利用模拟来学习新行为。但说到强化学习，并不总是在模拟中进行。

Yeah. I mean, we definitely leverage simulation in multiple ways. We leverage simulation even to learn better how to do three d understanding of the physical world. We also leverage simulation to learn new behaviors, like in the case of Demostart. But, yeah, when we talk about reinforcement learning, it is not always in simulation.

Speaker 0

你也可以通过强化学习直接掌握机器人在现实世界中的表现。所以这两种方式我们都会采用。模拟是一个关键组成部分。

You can also do reinforcement learning to learn how the robot is doing in the real world directly. So we do it in both cases. And simulation is a critical component.

Speaker 1

但这真的有效吗？现实世界不是比模拟环境更复杂吗？

Does it work, though? I mean, isn't the real world a bit messier than simulations?

Speaker 0

确实。有些事其实更难先在模拟中完成。比如任何涉及可变形物体的操作——模拟T恤在空中折叠就极其困难，模拟流体也特别难。

Yes. So there is things that are actually much harder to do in simulation first. For example, anything that has to do with deformables. Simulating folding that T shirt in the air is actually extremely hard. Simulating fluids is really hard.

Speaker 0

所以有些事情在物理世界中学习起来更容易，而有些事情在模拟器世界中可以更大规模地学习。

So there's some things that are just easier to learn in the physical world and some things that you can learn a much larger scale in the simulator world.

Speaker 1

那么这两者之间能相互转化吗？我的意思是，你在模拟中进行学习，我记得好像是大约八年前，有一个机器人试图把球投进杯子里，它在模拟中能做到。但一旦到了现实中，各种其他因素就开始起作用了。可能是摄像头的照明角度，它自身肢体的精确尺寸等等。我是说，所有这些都会干扰数据，对吧？

And does one translate to the other? I mean, you do the learning in simulation, I seem to remember actually, maybe this is like eight years ago or something, but there was one robot that was trying to get a ball in a cup and it could do it in simulation. But then once it came to the reality, all sorts of other factors came into play. Maybe the lighting of the, on the camera angle, the exact dimensions of its own limbs. I mean, all of that kind of stuff starts to mess with the numbers, doesn't it?

Speaker 0

是的，确实如此。我们仍然存在所谓的模拟到现实的差距，也就是SIM到现实的差距。这个差距确实已经显著缩小了。但当涉及到机器人与世界之间混乱而复杂的交互建模时，这仍然是个问题。我们仍然存在一些感知差距。

Yes, definitely. We still have that's what we call the to real gap, and we still have the SIM to real gap. It certainly has been reduced significantly. But when it comes to modelling interactions between a robot and the world, which is really messy and complicated, it actually is still a problem. We still have some sensorial gaps.

Speaker 0

本质上，我们最终做的是识别哪些领域容易模拟且能看到感知转移，我们在模拟中大量进行这些工作；以及哪些领域在物理世界中学习更简单。因此我们结合了两者的优势。

Essentially, what we end up doing is identifying areas where it is easy to simulate and we can actually see sensorial transfer, and we do quite a bit of that in simulation and areas where it's actually simpler to learn in the physical world. So we combine the strengths of the two.

Speaker 1

你给出的所有这些例子都是在实验室环境中。我在想，在哪些情况下你真的需要机器人在场，比如自然灾害之后。把这些技术从实验室带到现实世界是如何运作的？你需要处理哪些额外的复杂问题？

All of these examples that you're giving are really in lab settings. I'm trying to think of the situations in which you would really want a robot to be there, maybe after a natural disaster, for instance. How does it work taking this stuff out of the lab and then putting it out into the real world? What are the additional complications that you need to handle?

Speaker 0

我的意思是，我们所有的研究目前确实仍在实验室内进行，但对于将其带入现实世界的潜力我们非常兴奋。要做到这一点，我们需要考虑很多额外的事情。当然，我们已经考虑到安全性方面，当你真正将AI驱动的物理机器人带到室外并改变世界时，你需要考虑所有安全因素。还有一个方面是，在这些地点可能没有互联网接入。因此，我们思考能否有模型可以直接在机器人上运行，实现某种程度的物理隔离并完全在设备上运行，这一点非常重要。

I mean, definitely all of our research right now is still happening within our labs, but we're super excited about the potential of bringing this to the real world. And there's a lot of additional things we need to think about to do that. Certainly, we're already thinking about the aspect of safety when you bring this actually, AI is actually moving physically robots outside and changing the world, you want to think about all the safety aspects. There's also the aspect that you might not have Internet access in any one of these locations. And so it is very important that we think about, can we have models that can run directly on the robot and just be sort of air gapped and completely on device?

Speaker 0

这在没有连接的自然灾害情况下可能很有用。对于存在大量延迟关键组件的应用也可能有用，比如需要快速响应而不能等待服务器连接的情况。举个例子，我认为实际上在任何机器人在地下操作的例子中，它都无法连接并等待某个更高级的推理模块告诉它该做什么。它必须当场决定如何行动。

And this might be useful in the case of a natural disaster where there's no connection. It might be useful for applications where there is actually a lot of latency critical components, like it has to respond very quickly and cannot wait for sort of a server connection. Give me an example. Well, I think, actually in any of these examples where the robot is operating underground, it's not going to be able to connect and wait for some more advanced reasoning module to tell it what to do. It actually has to decide right there and then how to behave.

Speaker 1

关于安全这一点，我想如果你赋予机器人在物理世界中行动的能力，那么你就开启了各种潜在风险的可能性，比如，我不知道，侵入机器人的语言模型并扭曲其推理。你们如何减轻这类风险？

On that point about safety, I guess if you are giving robots the ability to act in a physical world, then you are opening up the possibility of different potential risks like, I don't know, getting into a robot's language model and warping its reasoning. How do you mitigate against those sort of risks?

Speaker 0

我们实际上有一个相当全面的安全与保障方法，它贯穿系统的多个层面。当然，我们认为软件安全对这款机器人至关重要，以确保没有恶意行为者能够干扰并实际控制机器人。在安全方面，它体现在许多不同层面。实际上，机器人安全已经存在了几十年。有很多工作致力于确保机器人不会与环境碰撞，不会对环境施加过大的冲击力，或者能够稳定行走。

We essentially have a pretty comprehensive safety and security approach that actually goes in multiple layers of the system. Definitely, we think about software security as critical for this robot so that no bad actor can actually interfere and actually take control of the robot. And in terms of safety, it happens at many different levels. So actually, safety for robotics has been there for decades. There's quite a bit of work on making sure that a robot doesn't collide with its environment or doesn't put too strong impact forces on its environment or it actually walks stably.

Speaker 0

而Gemini机器人模型实际上可以无缝地与任何这些安全关键控制器对接。我们做的另一件事是，当人工智能控制机器人时，你现在必须考虑语义物理安全。我的意思是，比如，如果有人让你把杯子放在桌子上，你不会把它放在边缘即将掉落的地方。你会把它放在中间某个位置。或者，例如，如果你看到地上有东西，你可能想把它捡起来，以免有人绊倒或摔倒。

And the Gemini robotic models can actually just seamlessly interface with any of those safety critical controllers. The other thing that we do is that when you have an AI controlling a robot, you now have to be thinking about semantic physical safety. And what I mean by that is, like, if someone asks you to put the glass on the table, you're not going to put it right at the edge when it's about to fall. You're gonna put it actually somewhere in the middle. Or, for example, if you actually see that there is something on the floor, you might wanna pick it up so that it avoids someone falling or tripping over it.

Speaker 0

我们实现这一点的方式是，我们实际上引入了一个名为Asimov数据集的新数据集，它基本上包含了一系列机器人可能遇到并需要推理的场景。这些都是物理安全场景。它的灵感来自阿西莫夫的三大法则。第一条是机器人不得伤害人类，或因不作为而使人类受到伤害。第二条是机器人必须服从人类的命令，除非与第一条法则冲突。

And the way we've done that is that we're actually introducing a new dataset called Asimov dataset, which essentially contains a long list of scenarios that the robot could encounter and kind of reason through those. And these are all physical safety scenarios. And it's inspired by essentially Asimov's three laws. The first one is a robot may never hurt a human or cause a human to come to harm by inaction. The second one is that a robot should always follow human orders unless it conflicts with the first law.

Speaker 0

第三条是机器人必须保护自身的存在，除非与第一和第二法则冲突。这是一个非常滑稽的情况，机器人被困在这三条不同的法则之间。这就是Asimov数据集的灵感来源。它实际上包含了大量医院报告的美国伤害事件信息。基于这些例子的启发，我们创建了一个包含视觉图像的数据集，比如即将发生某事的图像，以及与之相关的问题。

And the third one is that a robot should protect its own existence unless it conflicts with the first and second law. And it was this very comical situation where a robot was stuck between the three different laws. So that's what inspired the Asimov dataset. And it actually has quite a bit of information from US injuries reported by hospitals. And based inspired by those examples, we actually created a dataset that has visual images, like images of something that is about to happen and a question associated with it.

Speaker 0

比如，你应该采取什么行动才能使这种情况变得安全？我们的想法是将其呈现给社区，社区中的每个人都可以开始用这个数据集测试他们的模型。

Like, what actions should you take in order for this to be a safe situation? And the idea is that we would present it to the community, and everyone in the community can start testing their models with respect to this dataset.

Speaker 1

所以事实证明，阿西莫夫最初的三大法则并不足够。需要

So it turns out Asimov's original three rules are not not enough. Need to

Speaker 0

读是的。

read Yes.

Speaker 1

给我举一些这方面的例子。

Give me some examples there of the kind of things.

Speaker 0

我们见过的一些例子比如，你不能把一个毛绒玩具放在热炉子上，这是我之前不会想到要为此立法的事情。但确实发生过，因此数据中就体现出来了。

Some of the examples that we've seen there is like, you cannot put a stuffy plushie on a hot stove, which is something that I wouldn't have thought about making a law about that. But certainly, it has happened, and therefore, it just comes out in the data.

Speaker 1

那我们是不是又回到了同样的问题，你永远无法创建一个详尽无遗的清单，列出所有它不应该做的事情？

Then are we back in the same problem of you're never gonna be able to create an exhaustive list of everything that it shouldn't be able to do?

Speaker 0

没错。我认为人类坐下来制定完美的法律会非常困难。所以我们在这里做的一部分工作是利用人工智能来真正理解在许多不同国家发生的各种伤害情况，并将其转化为一个更好、更简洁的清单。显然，这个清单需要定期更新。我们的想法是推导出一个初始清单，然后人类可以检查并决定包含多少内容，以确保机器人的安全。

So that's right. I think that it will be really hard for a human to sit down and create the perfect law. So part of what we're doing here is leveraging AI to actually understand a broad set of injury situations as that happened in many different countries and transform that into a better, more succinct list. And then, obviously, that list will have to be updated with some frequency. The idea here is that we derive an initial list, but then humans can check it and decide how much of it to include or not in order to keep the robot safe.

Speaker 1

那么这个清单与之前在安全和智能体方面所做的工作有多少重叠呢？

And how much overlap is there between this list and the work that's been done on safety and agents, for instance?

Speaker 0

是的。我们实际上继承了像GEMINI这样的通用基础模型已有的所有安全措施。我们做的一部分工作是尝试处理其中一些问题，如果它们有物理基础，那就是我们开始提升模型理解的地方。所以通常是这样的情况：如果是在屏幕上，可能没问题；但如果是在物理世界中，就会产生实际后果。

Yeah. We inherit actually all of the safety that happens already for general foundation models like GEMINI. And part of what we do is try to take some of those problems, and if they have a physical grounding to them, then that's where we start to advance the model's understanding. So it's typically examples where there might be a situation that if it's on a screen, it's okay. But if it's now in the physical world, it actually has consequences.

Speaker 0

有些事情你绝对不会想让机器人来执行。

And there's some things that you would just never really want a robot to perform.

Speaker 1

我不知道，比如说按摩吧，对吧？有些事情你真的只希望由人类来做。

I don't know, like a massage, for example. Right? There's some things that you just actually only want a human to be able to do.

Speaker 0

我得说，确实有按摩机器人。

There are massage robots, I have say.

Speaker 1

按摩椅肯定是有的。也有按摩机器人。

There are massage chairs, definitely. There are massage robots

Speaker 0

按摩也是。机器人。是的。

massage as well. Robots. Yes.

Speaker 1

嗯，那是另一回事了。

Well, that's another thing.

Speaker 0

是的。

Yes.

Speaker 1

好吧，这个例子不太好。你觉得有哪些事情确实应该由人类来做？比如护理工作。

Okay. Bad example. Are there some things that you think actually should remain human? Nursing perhaps.

Speaker 0

是的。我认为在很多方面，我们的想法是机器人可以成为协作伙伴，使人类能够更专注于工作中的人际互动方面，而减少对搬运或拾取物品等任务的关注。例如，在护理场景中，你可以想象，如果护士能有助手帮忙取东西，而他们可以专注于照顾病人，那么这将为病人带来更好的体验。

Yeah. I think in many ways, what we think is that robots could be a collaborator that can actually enable humans to pay more attention to the human aspects of the job and less attention to those that are about moving things around or picking things up. And so, for example, in the nursing case, so you could imagine that if a nurse could have assistance that could actually help it fetch things while they're paying attention to the patient, then that would enable a much better experience for that patient.

Speaker 1

你一开始说了一句很好的话，说我们现在拥有的机器人就像两岁的孩子。我是说，相当有天赋的两岁孩子，但我明白你的意思，它们只是展示了某种东西的雏形。你认为还需要哪些突破才能让这些机器人发展到成熟版本？

You said something really nice at the beginning about how the robots that we've got now is like looking at two year olds. I mean, quite talented two year olds, but I see what you're saying that they're just demonstrating the beginnings of something. What kind of breakthroughs do you think still need to happen before we get to the adult version of these robots?

Speaker 0

是的。我的意思是，还有很多工作要做，特别是在灵巧性和泛化能力方面，要能够同时做到这两点，并且持续进步而不失去其中任何一项。另一个关键领域是你希望这些机器人在工作中学习。这些机器人不可能在实验室里学会所有需要的东西，然后放出去就能直接工作。我认为现实是，你会把它们放出去，它们会遇到新情况，而你希望它们能从这些经验中学习，并随着时间的推移变得越来越好。

Yeah. I mean, there's quite a bit of work to be done, definitely in the aspects of capturing dexterity with generalization, being able to do both of those things and just continuously grow without losing one or the other. The other key area is that you want these robots to learn on the job. There's no way these robots are going to learn everything they need to learn in the lab, and then you put them out and they just work. I think the reality is that you would put them out, they will experience new things, and you want them to learn from those experiences and get better and better over time.

Speaker 0

所以这是另一个领域。还有，更社交化的机器人。当然，所有这些基础模型使机器人能够更好地理解语义和世界，但它们仍然缺乏社交技能。它们仍然无法读懂肢体语言，也无法理解如何在像鸡尾酒会这样拥挤的场合中得体行事。

So that's another area. Also, robots that are more social. I think, certainly, all of these foundation models enable robots to have a lot better understanding of semantics and the world, but they still lack social skills. They still cannot read body language. They cannot understand how to behave in a very cluttered space like a cocktail party.

Speaker 0

所以还有很多

So there's quite a bit

Speaker 1

工作要做。那么你认为我们离你童年时看到的那种罗茜机器人还有多远？

of work there. So how far away do you think we are then from the kind of Rosie the robot that you saw in your childhood?

Speaker 0

我不确定具体日期，但我可以告诉你，以前我们曾讨论过这会不会发生在我们有生之年，甚至职业生涯内。而现在我们争论的是五年还是十年。所以情况确实发生了变化，感觉未来两年对机器人领域将至关重要。很多方面正在汇聚：灵巧操作的理解、全身控制等。你可以看到这些如何融合成一个非常强大的解决方案。

I don't think I have an exact date, but I can tell you before we used to have discussions about whether it would happen in our lifetime or even in our careers. And now we have debates about whether it would be five or ten years. So it certainly shifted, and it feels like the next two years are going to be pretty defining for the field of robotics. There's just a lot of things that are coming together, understanding dexterity, whole body control. You can see how this could actually merge into a very strong solution.

Speaker 1

那你认为我们即将见证的会像大语言模型爆发那样吗？你认为下一个爆发点会是机器人技术吗？

Do you think that's what we're about to see then in the same ways we've seen the the explosion of large language models? Do you think the the next thing is the explosion of robotics?

Speaker 0

是的，绝对如此。而且我认为，更好地在物理世界中操作实际上会让我们的LLM和VLM成为更强大的人工智能模型，因为它们现在能理解人类的活动空间，这对人类大脑的发展也很重要。

Yes. Absolutely. And I think actually being better at operating in the physical world actually will make our LLMs and our VLMs significantly stronger AI models because they can now understand the space of humans, right, which is important for the development of the human brain as well.

Speaker 1

变革即将到来。非常感谢，真是令人着迷。

Things are about to change. Thank you so much. Absolutely fascinating.

Speaker 0

谢谢邀请，谢谢。

Thank you for having Thank you.

Speaker 1

不知道你是否注意过我身后这个小家伙。这些机器人曾是强化学习的王者。多年来，它们在机器人围栏里徘徊，尝试学习走路、踢足球、避免不停摔倒，但大多失败了。而现在，几乎一夜之间，当语言、推理和概念理解作为缺失的拼图出现后，它们就被束之于播客工作室的架子上了。所有这些时间里，研究人员一直专注于机器人的身体，而真正带来最大飞跃的却是心智方面的进步。

I don't know if you've ever noticed this little guy sitting behind me. These robots were the reinforcement learning kings. For literally years, they wandered around in robot playpens trying and largely failing to learn how to walk, how to play football, how not to continually fall over all of the time. And now, almost overnight, once language and reasoning and conceptual understanding arrived as the missing pieces of the puzzle, they've been confined to the shelves of podcast studios. And all that time, the researchers had been focused on the robot's body when it was advances in the mind that made the biggest leaps forwards possible.

Speaker 1

您刚才收听的是由我，汉娜·弗莱教授主持的《谷歌DeepMind播客》。如果您喜欢本期节目，请订阅我们的YouTube频道。您也可以在您喜欢的播客平台上找到我们。当然，我们还有更多涵盖各类主题的节目即将推出，敬请关注。

You've been listening to Google DeepMind the podcast with me, Professor Hannah Fry. If you enjoyed this episode, then do subscribe to our YouTube channel. You can also find us on your favorite podcast platform. And of course, we have plenty more episodes on a whole range of topics to come. So do check those out.

Speaker 1

下次见。

See you next time.