Dwarkesh Podcast - 全自主机器人比你想象的要近得多——谢尔盖·莱文 封面

全自主机器人比你想象的要近得多——谢尔盖·莱文

Fully autonomous robots are much closer than you think – Sergey Levine

本集简介

Sergey Levine是世界顶尖的机器人研究专家、Physical Intelligence联合创始人,他认为通用机器人即将迎来"自我提升飞轮"时代。他对机器人实现完全自主化家庭管理的预测中位数年份是2030年。 如果Sergey的预测准确,五年后的世界将与今日截然不同。本期对话聚焦实现路径:我们深入探讨机器人基础模型,以及如何扩展数据和硬件规模来引爆机器人革命。 YouTube观看/苹果播客或Spotify收听 赞助商 * Labelbox提供跨平台多场景的高质量机器人训练数据,从简单物体处理到复杂工作流,助您扩展研究规模。详情见labelbox.com/dwarkesh * Hudson River Trading运用前沿机器学习技术,基于海量历史市场数据预测价格走势。我有幸在HRT资深研究员指导下尝试了这个有趣的预测课题。了解详情请访问hudson-trading.com/dwarkesh * Gemini 2.5 Flash Image(代号nano banana)不仅是趣味图像生成工具,更能修复老照片和文档数字化。通过Gemini应用或Google AI Studio体验:ai.studio/banana 节目赞助请联系dwarkesh.com/advertise 时间轴 (00:00:00) - 自主机器人普及时间表 (00:17:25) - 机器人为何比自动驾驶发展更快 (00:27:28) - 视觉-语言-行动模型原理 (00:45:37) - 实现类脑高效机器人需突破 (00:57:59) - 模拟环境学习 (01:09:18) - 机器人如何加速AI建设 (01:18:01) - 硬件瓶颈下中国是否必然胜出? 订阅完整内容请访问www.dwarkesh.com/subscribe

双语字幕

仅展示文本字幕,不包含中文音频;想边听边看,请使用 Bayt 播客 App。

Speaker 0

今天,我邀请到了谢尔盖·莱文,他是Physical Intelligence公司的联合创始人,这是一家专注于机器人基础模型的公司,同时他也是加州大学伯克利分校的教授,更是全球机器人技术、强化学习和人工智能领域的顶尖研究者之一。谢尔盖,感谢你参加我们的播客节目。

Today, I'm chatting with Sergei Levin, who is a cofounder of Physical Intelligence, which is a robotics foundations model company and also a professor at UC Berkeley and just generally one of the world's leading researchers in robotics, RL, and AI. Sergei, thank you for coming on the podcast.

Speaker 1

谢谢。也感谢你如此亲切的介绍。

Thank you. And thank you for the kind introduction.

Speaker 0

我们来聊聊机器人技术吧。在我开始连珠炮般提问之前,能否请你先向听众们概述一下Physical Intelligence目前的进展?你们公司成立才一年。

Let's talk about robotics. So before I pepper you with questions, I'm wondering if can give the audience a summary of where physical intelligence is at right now. You guys started a year ago.

Speaker 1

好的。

Yeah.

Speaker 0

目前的进展如何?你们主要在攻克哪些方向?

And what does the progress look like? What are guys working on?

Speaker 1

是的。Physical Intelligence致力于构建机器人基础模型。本质上,我们的目标是开发通用模型,理论上能够控制任何机器人执行任何任务。我们之所以关注这个方向,是因为我们认为这是人工智能问题中非常基础的部分。机器人技术实质上涵盖了所有AI技术领域,所以如果能打造出真正通用的机器人,那么它就能完成——希望是——人类所能做的大部分工作。

Yeah. So physical intelligence aims to build robotic foundation models. And that basically means general purpose models that could, in principle, control any robot to perform any task. We care about this because we see this as a very fundamental aspect of the AI problem. Like, the robot is essentially encompassing all AI technology, so if you get a robot that's truly general, then you can do, you know, hopefully a large chunk of what people can do.

Speaker 1

目前我们的进展是,我认为我们已经完成了许多基础架构的搭建。说实话,这些基础成果相当令人振奋——它们的运行效果非常好。我们已经能让机器人折叠衣物,进入陌生家庭并尝试清理厨房。但在我看来,Physical Intelligence现阶段的工作才刚刚迈出非常、非常初步的一步。

And where we're at right now is I think we've kind of gotten to the point where we've built out a lot of the basics. And, you know, I think those basics actually are pretty cool. Like, they work pretty well. We can get a robot that will fold laundry and that will go into a new home and, like, try to clean up the kitchen. But in my mind, what we're doing at physical intelligence right now is really the very, very early beginning.

Speaker 1

关键在于,我们要先搭建好基础模块,然后才能着手解决这些真正棘手的难题。

It's just, putting in place the basic building blocks on top of which we can then tackle all these, like, really tough problems.

Speaker 0

那么逐年规划是怎样的?一年过去了,我有机会观察了一些机器人,它们能用夹爪完成相当灵巧的任务,比如折叠纸盒。说实话,就算用手折盒子也挺难的。我们需要逐年推进,直到迎来机器人技术的全面爆发。每一年具体要实现哪些突破?

And what's the year by year vision? So one year in, now I got a chance to watch some of the robots and they can do pretty dexterous tasks like folding a box using grippers. And it's like, I don't know, it's pretty hard to fold the box even with my hands. You gotta go year by year until we get to the full robotics explosion. What is happening every single year?

Speaker 0

每年需要攻克哪些关键瓶颈等等?

What is the thing that needs to be unlocked, etcetera?

Speaker 1

有几个关键点我们必须把握。显然,灵巧操作是其中之一。最初阶段,我们重点验证开发的方法能否处理人类擅长的精细任务——就像你提到的叠盒子、叠不同衣物、收拾桌子这类工作。

So there are a few things that we need to get right. Mean, dexterity obviously is one of them. And in the beginning, we really wanted to make sure that we understand whether the methods that we're developing have the ability to tackle like the kind of intricate tasks that people can do. Yeah. As you mentioned, like folding a box, folding different articles of laundry, cleaning up a table Yeah.

Speaker 1

还有泡咖啡之类的日常事务。目前进展不错,展示的成果相当亮眼。但重申下,终极目标不是叠好一件T恤,而是验证我们的基础假设——这些底层技术是否足够扎实。

Making a coffee, that sort of thing. And that's like, that's good, Like, that works. You know, I think that the results we've been able to show are pretty cool. But again, like, the end goal of this is not to fold a nice t shirt. The end goal is to just, like, confirm our initial hypothesis that, like, the basics are kind of solid.

Speaker 1

不过之后还面临诸多重大挑战。有时成果被浓缩成三分钟视频后,观众可能觉得'这就是他们的全部成果',但实际这仅仅是未来技术的雏形。

Yeah. But from there, there are a number of really major challenges. And I think that, sometimes when results get abstracted to the level of, like, a three minute video, someone can look at this video. It's like, that's cool. Like, that's what they're doing.

Speaker 1

人们对机器人的真正期待不是'请叠好我的T恤'这种单一指令,而是能说'机器人,现在开始处理我家所有家务'这样的全能指令。

But it's not. Like, it's a very simple and basic version of what I think is to come. Like, what you really want from a robot is not to tell it like, hey, please fold my t shirt. What you want from a robot is to tell it like, hey, robot. Like, you're now doing all sorts of home tasks for me.

Speaker 1

我喜欢晚餐在晚上6点准备好。我早上7点起床去上班。我想让你知道,我喜欢在周六洗衣服,所以确保这些那些都安排妥当。对了,每周一记得和我确认一下,比如购物时需要你采购些什么。明白吗?

I like to have dinner made at 6PM. I wake up and go to work at 7AM. I'd like you know, I like to do my laundry on on Saturday, so make sure that's ready, this and this and this. And by the way, check-in with me like every Monday to see like, you know, what what I want you to do to pick up when you do the shopping. Right.

Speaker 1

对吧?这就是指令要求。然后机器人应该照这样执行,比如持续六个月或一年。这就是任务的持续时间。

Right? Like, that's the prompt. And then the robot should go and do this for, like, you know, six months, a year. Like, that's the duration of the task.

Speaker 0

所以

So

Speaker 1

归根结底,如果这些能成功实施,规模应该会大得多。它需要具备持续学习的能力,理解物理世界的常识,能在需要时主动获取更多信息。比如我临时要求'今晚能做这种沙拉吗',它应该能应对。

it's it's it's ultimately, if if this stuff is successful, it should be a lot bigger. And it should have that ability to learn continuously. It should have the understanding of the physical world, the common sense, the ability to go in and pull in more information if it needs it. Like, if I ask you like, hey, tonight, like, you know, can you can you make me this type of salad? It's okay.

Speaker 1

它需要自行理解这涉及哪些步骤,比如查菜谱、采购食材。这需要很多能力支撑,尤其需要常识判断力。

You should like figure out what that entails. Like, look it up. Go and buy the ingredients. So there's a lot that goes into this. It requires common sense.

Speaker 1

它需要明白有些特殊情况需智能处理,有些情况需深入思考;需要持续改进的能力;需要理解安全性概念,关键时刻可靠,犯错后能自我修正。因此远不止表面这么简单,核心原则是:必须善用既有知识,并建立正确的认知框架。

It requires understanding that there are certain edge cases that you need to handle intelligently, cases where you need to think harder. It requires the ability to improve continuously. It requires understanding safety, being reliable at the right time, being able to fix your mistakes when you do make those mistakes. So there's a lot more that goes into this. But the principles there are you need to leverage prior knowledge and you need to have the right representation.

Speaker 0

那么这个宏伟愿景,如果要给出中位数预估年份?或者25分位、50分位、75分位值?

So this grand vision, what year if you had to give a median estimate? Or 25 percentile, fifty, seventy five?

Speaker 1

我认为这不是那种我们在实验室里开发完一切就大功告成的情况。等到2030年左右,你就能收到一个装在盒子里的机器人。我觉得这会和我们看到的人工智能助手发展类似,一旦机器人达到某种基本能力水平,能够提供有用的服务,它就会进入现实世界。最棒的是,一旦进入现实世界,它们就能积累经验并利用这些经验变得更好。所以对我来说,关于时间线,我经常思考的不是完成日期,而是飞轮开始转动的日期。

I think it's something where it's not going to be a case where we develop everything in the laboratory, and then it's done. And then come 2030 something, you get a robot in a box. I think it'll be the same as what we've seen with AI assistance that once we reach some basic level of competence where the robot is delivering something useful, it'll go out there in the world. The cool thing is that once it's out there in the world, they can collect experience and leverage that experience to get better. So to me, like, what what I temp tend to think about a lot in terms of timelines is not the date when it will be done, but the date when it will when, like, the flywheel starts, basically.

Speaker 0

那么飞轮什么时候开始转动呢?

So when does the flywheel start?

Speaker 1

我认为可能很快。而且我觉得需要做出一些决定。比如,权衡在于,你把范围限定得越窄,就能越早让它进入现实世界。所以'很快'的意思是,这已经是我们正在探索的事情。我们已经在尝试弄清楚,这个机器人能做哪些实际的事情,从而让我们开始转动飞轮。

I think that could be very soon. And I and I think there's some decisions to be made. Like, the trade off there is the more narrow you you you scope the thing, the earlier you can get it onto the real world. So but soon as in like, this is something we're already exploring. We're already trying to figure out like what are like the real things this thing could do that could allow us to start spinning the flywheel.

Speaker 1

但就你真正关心、想看到的东西而言,我认为个位数年份是非常现实的。我真心希望在一两年内就能有实际的东西问世,但这很难说。

But I think in terms of like stuff that you would actually care about that you would wanna see, so I don't know, but I think that single digit years is very realistic. I'm really hoping it'll be more like one or two before some something is, like, actually out there, but it's hard to say.

Speaker 0

'有实际的东西问世'是什么意思?具体是什么东西问世?

And something being out there means what? Like, what what what is out there?

Speaker 1

意思是有一个机器人能做你真正关心的事情。是的,做你希望完成的事情,而且它做得足够好,能够真正为需要它的人提供服务。

It means that there is a robot that does a thing that you actually care about Yeah. That that you want done, and it does so competently enough to actually do it for real, for real people that want it done.

Speaker 0

我们已经有了广泛部署的大型语言模型,但这并没有形成某种飞轮效应。至少对模型公司来说,没有明显的飞轮效应——现在Claude并没有学会如何做经济中的每一项工作,GPT也没有学会如何做经济中的每一项工作。那么为什么大型语言模型没有形成这种飞轮效应呢?

We already have LLMs which are broadly deployed, and that hasn't resulted in some sort of flywheel. At least not some obvious flywheel for the model companies where the now Claude is, like, learning how to do every single job in the economy or GPT is learning how to do every single job in the economy. So why did why doesn't that flywheel work for LLMs?

Speaker 1

嗯,我认为它实际上已经非常接近可用了。我百分之百确信许多组织正在研究这个方向。事实上,可以说已经存在一个飞轮效应——虽然不是全自动的,而是一个人工参与的循环机制:每个部署大语言模型的人都会观察其输出,并据此调整模型行为。这很复杂,因为它又回到了表征问题,以及如何正确提取监督信号并将这些信号根植于系统行为中,使其真正优化目标性能。

Well, I think it's actually I think it's actually very close to working. And I I I am, like, a 100 certain that many organizations are working on exactly this. Right. In fact, arguably, there is already a flywheel in the sense that may not an automated flywheel, but a human loop flywheel where everybody who's deploying an LLM is, of course, gonna look at what it's doing and is gonna use that to then modify its behavior. It it it it's complex because it comes back to this question of representations and figuring out the right way to derive supervision signals and ground those supervision signals in the behavior of the system so it actually improves on what you want.

Speaker 0

嗯。

Mhmm.

Speaker 1

我不认为这是个绝对无解的难题,只是细节处理会相当棘手,算法稳定性的挑战也变得非常复杂。所以整个行业需要时间共同攻克这些问题。

And I don't think that's like a profoundly impossible problem. It's just something where the details get like pretty gnarly and challenges with algorithms and stability become pretty complex. So it's just it's something that's taken a while for, the community collectively to get their hands

Speaker 0

你认为机器人学会更容易实现吗?还是说这类标注现实世界数据的技术整体进步后,机器人学自然会随之发展?亦或是机器人学确实有特殊优势能从中获益更多?

Do you think it'll be easier for robotics or just that, like, this the state of this kind of techniques to label data that you collect out in the world and use it as a word will just, the, the sort of like, the whole wave will rise and robotics will rise as reals or is there some reason I think robotics will be, will benefit more from this?

Speaker 1

是的。我不认为机器人学存在本质差异,但确实有些细微差别让事情更可控。特别是当机器人与人类协作时——无论是受监督还是被指导——天然存在监督信号来源,且人类有强烈动机提供协助以确保成功。这种动态环境下,你可以犯错后纠正,并通过反思避免未来重蹈覆辙。

Yeah. I don't think there's like a profound reason why robotics is that different, but there are a few small differences that I think make things a little bit more manageable. So especially if you have a robot that's doing something in cooperation with people, whether it's a person that's supervising it or directing it. Like there are very natural sources of supervision and there's a big incentive for the person to provide the assistance that will make things succeed. There are a lot of dynamics where you can make mistakes and recover from those mistakes and then reflect back on what happened and avoid that mistake in the future.

Speaker 0

嗯。

Mhmm.

Speaker 1

我认为在物理世界执行任务时,这种情况比AI助手回答问题更常见。比如回答错误时,你无法简单回溯调整——接收到错误答案的人可能根本意识不到问题所在。

And I think that when you're doing physical things in the real world, that kind of stuff just happens more often than it does if you're like an AI assistant answering a question. Like if you answer a question, you just like answered it wrong. It's like Mhmm. Well, it's not like you can just like go back and like tweak a few things like the person you told the answer to might not even know that it's

Speaker 0

错了。对了。

wrong. Right.

Speaker 1

但如果你在叠T恤时搞砸了一点,是的,这很明显。你可以反思一下,找出问题所在,下次做得更好。

Whereas if you're like folding the t shirt and you messed up a little bit like, yeah, it's pretty obvious. Like you can reflect on that, figure out what happened, and do it better next time.

Speaker 0

是啊。所以我在想,一年后我们会有能做点有用事情的机器人。也许它们能帮你完成一些相对简单的循环流程,比如不停地折叠成千上万个盒子之类的工作。

Yeah. So I okay. What in one year we have robots which are like doing some useful things. Maybe if you have some like relatively simple like loopy process they can they can do it for you. Just like you gotta keep folding like thousands of boxes or something.

Speaker 0

但之后会有某种飞轮效应……会有某种机器能像人类管家一样为我打理家务。那么,这个一年内就能部署的、始于飞轮效应的东西,与一个完全自主的管家之间的差距是什么?

But then there's some flywheel dot dot dot. There's some machine which will like just run my house for me, as well as a human housekeeper would. What is the gap between this thing, will be deployed in a year that starts with flywheel and this thing, which is like a fully autonomous housekeeper?

Speaker 1

嗯,我认为在某些方面这与我们在大语言模型上看到的情况并无太大不同,关键在于范围。比如编程辅助工具,最初最好的编程工具只能完成一点点代码补全——你给它一个函数签名,它会尽力写出整个函数,可能只对一半。随着技术进步,你会更愿意赋予它们更多自主权。

Well, I think it's actually not that different than what we've seen with LMs in some ways that it's a matter of scope. Like, if you think about coding assistance, right, like initially, the best tools for coding, they could do like a little bit of completion. Like, you give them a function signature and they'll like try their best to type out like the whole function and they'll maybe like get half of it right. Yeah. And as that stuff progresses, then you're willing to give these things a lot more agency.

Speaker 1

所以现在最先进的编程系统,在处理相对程式化的工作时,甚至能帮你完成大部分PR(拉取请求)内容,只要任务不算太复杂。

So that like the very best coding systems now, like if you're doing something relatively formulaic, maybe it can like put together most of a PR for you for something, you know, fairly accessible.

Speaker 0

没错。

Right.

Speaker 1

所以我认为情况会类似。随着机器人能力不断提升,我们将逐步扩大赋予它们的职责范围。最初可能仅限于特定任务,比如煮咖啡之类。但当它们具备更强能力,拥有常识并能处理更广泛任务时,我们会给予更大权限。比如让它们管理整个咖啡店。

So I think it'll be the same thing. That we'll see an increase in the scope that we're giving that we're willing to give to the robots as they get better and better. Where initially the scope might be like, there is a particular thing you do, like you're making the coffee or something. Whereas as they get more capable, as their ability to have common sense and a broader repertoire of tasks increases, then we'll give them greater scope. Now you're running the whole coffee shop.

Speaker 0

我理解这是个渐进过程,也明白不会有某个具体时刻让我们突然宣告成功。但如果你要给出年份预测的话,你预估的中位数时间点是?

I get that there's a spectrum, and I get that there won't be a specific moment that feels like Yeah. We've achieved it. But if you're gonna give a year and wish like that your median estimate of when that happens.

Speaker 1

我的直觉是这更可能是个位数年份而非两位数。但难以精确预测的原因在于,就像所有研究一样,这取决于解决几个关键未知问题。我认为这些难题并不需要革命性的新理论,而是需要对现有知识的正确整合——不过要说明的是,有时候知识整合的难度丝毫不亚于提出全新概念。

I mean, my my sense there too is that this is a probably a single digit thing rather than a double digit thing. But the reason it's so hard to really pin down is because as with as with all research, it does depend on figuring out a few question marks. And I think my my answer in terms of the nature of those question marks is I don't think these are things that require profoundly, deeply different ideas, but it does require the right synthesis of the kinds of things that we already know. And, you know, sometimes synthesis, to be clear, is just as difficult as coming up with like profoundly new stuff. Right?

Speaker 1

这本质上是个极其深刻复杂的课题,解决它将非常令人振奋。我们大致知道拼图板块的样子,需要做的就是持续攻关。如果进展顺利且运气不错,我认为个位数年份的预测是合理的。

So I think it's intellectually a very deep and profound problem, and and figuring that out is gonna be, like, very exciting. But it I think we kinda, like, know, like, roughly the puzzle pieces, and it's something that we need to work on. And I think if we work on it and we're a bit lucky and everything kind of goes as planned, I think single digit is reasonable.

Speaker 0

那我就用二分法来锁定年份吧。既然少于十年,那中位数是超过五年吗?我知道这...

I'm just gonna do binary search until I get a year. Okay. So it's less than ten years, so more than five years? Your median estimate. I know it's like a

Speaker 1

我觉得五年是个不错的中位数。

I think five is a good median.

Speaker 0

好的,五年。那么如果能完全自主管理家庭,就意味着能完成大多数蓝领工作。所以你的预测是五年后,机器人应该能胜任经济体系中大部分蓝领工作。

Okay. Five years. So if you can fully autonomously run a house, then I think you've, like you can fully autonomously do most blue collar work. So your estimate is in five years, it should be able to do most, like, blue collar work in the economy.

Speaker 1

所以我认为这里存在一个细微差别,这个差别在我们类比编码辅助时会更加明显。对吧?如今的编码辅助并非像突然有个开关被拨动,所有软件工程师都被解雇,然后所有人都用语言模型处理一切。实际上最显著的效率提升来自专家——也就是软件工程师——他们的生产力因这些强大工具而增强,这是非常合理的。

So I I think there's a there's a nuance here, and and the nuance is it becomes more obvious if we consider the analogy to the coding assistance. Right? It's not like the the nature of coding assistance today is that there's a switch that flips and suddenly instead of writing software, like suddenly like all software engineers get fired and everyone's using LMs for everything. And that actually makes a lot of sense that the biggest gain in productivity comes from experts, which is software engineers, whose effect whose productivity is now augmented by these really powerful tools.

Speaker 0

是的。撇开人们是否会被解雇这个问题不谈,另一个问题是:五年后的经济影响会如何?

Yeah. I mean, separate from the question of, whether people will get fired or not, a different question is like, will the economic impact be in five years?

Speaker 1

没错。

Yeah.

Speaker 0

我对此好奇的原因是,大语言模型的收入与其看似具备的能力之间的关系有些神秘——你拥有感觉像通用人工智能的东西,能与之进行真正通过图灵测试的对话,感觉它能处理所有知识工作,显然还能完成大量编码等任务。但这些AI公司的年收入总和才约203亿美元。

The reason I'm curious about this is with LLMs, the relationship between the revenues for these models to their inherent, their seeming capability has been sort of mysterious in a sense that like, you have something which feels like AGI. You can have a conversation with it really like is like passes a touring test. It really feels like it can do all this knowledge work. It's obviously doing a bunch of coding, etcetera. But then the revenues for these AI companies are on like cumulatively on the order of like $20.30000000000 dollars per year.

Speaker 0

这比所有知识工作创造的304万亿美元少太多了。那么五年后,我们会处于类似大语言模型现在的处境吗?还是说届时机器人已遍布各地,真正承担大量实际工作?

And that's so much less than all knowledge work, which is $30.40000000000000 dollars So in five years, we in a similar situation that LLMs are in now? Or is it more like we have robots deployed everywhere and they're actually doing a whole bunch of real work, etcetera?

Speaker 1

这是个非常微妙的问题。我认为最终可能归结为适用范围的问题。大语言模型未能取代所有软件工程师,是因为它们在特定范围内表现良好但存在局限。当然,这些局限正在不断被突破。机器人领域很可能也会出现同样情况——初始适用范围会很小,因为这些系统在某些方面表现出色,而在其他方面仍需大量人类监督。

It's a very subtle question. I think what it probably will come down to is this question of scope. Reason that LLMs aren't doing all software engineering is because they're good within a certain scope, but there's limits to that. Right? Those limits are increasing, to be clear, every And I think that there's no reason that that we wouldn't see the same kind of thing with robots, that the scope will have to start out small because there will be certain things that these things that these systems can do very well and certain other things where more human oversight is really important.

Speaker 1

随着适用范围扩大,将转化为生产力的提升。部分生产力来自机器人本身的价值,部分则来自使用机器人的人类在其工作流程中变得更高效。

And the scope will grow, and what that will translate into is increased productivity. And some of that productivity will come from, like, the robots themselves being valuable, and some of it will come from the people using the robots are now more productive in their in their course.

Speaker 0

生产力提升就像戴手套能提高效率,或者像...我不确定,但就像想理解某种能让效率提升百倍的事物,相比之下,戴眼镜之类只能带来微小提升。机器人已经提升了工人的生产力。目前LLMs(大语言模型)在经济体知识工作中所占的份额,我猜大概是千分之一左右——至少从收入角度来看是这样。你是说五年内机器人也能在体力劳动领域达到类似比例吗?

Productivity is just like wearing gloves increases productivity or like, I don't know, but then it's like, want to understand something which like increases productivity a hundredfold versus like, you know, wearing glasses or something which has like a small increase. So robots already increased productivity for workers. LLMs are right now in terms of the share of knowledge work they can do, which is, it's I guess probably like one one thousandth of the knowledge work that happens in the economy LLMs are doing, at least in terms of revenue. Are you saying, like, that fraction will be possible for robots but for physical work in five years?

Speaker 1

这是个很难回答的问题。我觉得自己可能无法告诉你机器人能完成多少比例的体力劳动,因为我现在确实没有足够全面的认知来评估...这么庞大的体力劳动领域。但我可以明确的是:在'人在回路'的配置中逐步部署有效系统要容易得多。这和我们看到的编程系统发展轨迹完全一致,自动化领域也会如此——人机协作永远比单独用人或单独用机器更高效。

That's a very hard question to answer. I think I'm probably not prepared to tell you what percentage of all labor work can be done by robots because I don't think right now off the cuff I have a sufficient understanding of what's involved in Yeah. You know, that big of a cross section of all physical labor. I think what I can tell you is this, that I think it's much easier to get effective systems rolled out gradually in a human in the loop setup. And again, I think this is exactly what we've seen with with coding systems, and I think we'll see the same thing with automation where basically robot plus human is much better than just human or just robot.

Speaker 1

没错。这个逻辑非常自洽。这种模式也更容易实现技术迭代,因为人机协作时,机器人有更多机会在实践中学习新技能。就像...

Yeah. And and that just like makes total sense. It also makes it much easier to get all the technology bootstrapped because when it's robot plus human, now there's a lot more potential for the robot to like actually learn on the job, acquire new skills. It's just like, you know

Speaker 0

因为人类可以标注操作过程?

Because the human can label what's happening?

Speaker 1

不仅如此,人类还能提供协助。比如给出提示——我举个例子:去年四月我们发布的Pyo five项目论文里,最初是通过远程操控机器人在不同场景下工作。后来发现当模型达到一定水平后,通过语言指令(而不仅是底层动作监督)就能取得重大进展。

And also because the human can help. The human can give hints. You know, let me tell you this story. Like, when we were working on the Pyo five project, this was the paper that we released last April. We initially controlled our robots with teleoperation in a variety of different settings, and then at some point, we actually realized that we can actually make significant headway once the model was good enough by supervising it not just with low level actions, but actually literally instructing it through language.

Speaker 1

当然这需要机器人具备基础能力门槛,但达标之后,你只需站着说'现在拿起杯子,把杯子放进水槽,把盘子放进水槽'——这些语言指令本身就能成为机器人优化的数据来源。

Now you need a certain level of competence before you can do that, but once you have that level of competence, just standing there and telling the robot, okay, now pick up the cup, put the cup in the sink, put the dish in the sink just with words Yeah. Already actually gives the robot information that it can use to get better.

Speaker 0

明白。

Right.

Speaker 1

现在想象一下这对人类与机器人互动意味着什么。如今,这些系统的学习不仅是从原始动作中学习,还包括从语言中学习,最终将从观察人类行为中学习,从与他人协作时获得的自然反馈中学习。这类知识中,来自大型模型的先验知识极具价值,因为它能帮助你理解这种互动模式。因此我认为,这种人机协作部署有很大潜力能让模型变得更出色。

Now imagine what this implies for the the human plus robot dynamic. Like now, basically learning is not for these systems is not just learning from raw action, it's also learning from words, eventually be learning from observing what people do from the kind of natural feedback that you receive when you're doing a job together with somebody else. And this kind of this is also the kind of stuff where the the prior knowledge that comes from these big models is tremendously valuable because that lets you understand that that interaction dynamic. So I think that there's a lot of potential for these kind of human plus robot deployments to make the model better.

Speaker 0

有意思。看来我得去LevoBox看看他们的机器人配置,亲自操作试试。

Interesting. So I got to go to LevoBox and see the robotic setup and try operating some of the robots myself.

Speaker 2

关键在于这些触发器,操作时要特别小心,别做太快动作。保持缓慢平稳的操作。

So the thing is, like, these triggers, be very mindful of pressing them, and don't do some, like, very fast movements. Keep it like of slow.

Speaker 0

需要一直按住吗?你继续。抱歉。好的。

Do you need to keep holding it? Go ahead. Sorry. Okay.

Speaker 2

没关系。别移动太远,它真的可能会受伤。对。

That's okay. And don't move it very far because he can get hurt actually. Yeah.

Speaker 0

好的。实际操作比我预想的要困难些。不过我确实看到Labelbox团队高效完成了一系列任务,还见识了他们用于训练机器人的输出数据,并向CEO Manu请教了整套系统是如何整合的。

Yeah. Okay. So operating ended up being a bit harder than I anticipated. But I did get to see the Labelbox team rip through a bunch of tasks. I also got to see the output data that labs actually have to use to train their robots and ask Manu, Labelbox's CEO, about how all this is packaged together.

Speaker 3

你现在看到的是最终交付给实验室的输出数据,用于模型训练。左侧是机器人运动轨迹的可视化,包括三维模型等;右侧是所有与配置同步的摄像头画面流。

What you're looking at is actually the final output that is then delivered to the labs, which then they use to train the models. You can see on the left the visualization of the movements of the robot, including its three d model and so forth. And on the right, you see all the camera streams synchronized with the configuration.

Speaker 0

Labobox能为您提供数百万集的机器人数据,涵盖您想训练的每一种机器人平台和子任务。如果您通过labelbox.com/thwarkash联系我们,Manu也会对我非常满意。关于机器人技术的进展,为什么它不会像自动驾驶汽车那样呢?要知道,自谷歌推出自动驾驶汽车计划以来已经过去十多年了——我记得是在2009年他们启动的这个项目。我十几岁时还看过演示视频,当时幻想能自动驾驶去买塔可钟然后开回来。

Labobox can get you millions of episodes of robotics data for every single robotics platform and subtask that you want to train on. And if you reach out through labelbox.com/thwarkash Manu will be very happy with me too. In terms of robotics progress, why won't it be like self driving cars where we, you know, it's been more than ten years since Google launched it. Wasn't in 2009 that they launched the self driving car initiative. And then I remember when I was a teenager, like watching demos where we would go buy a Taco Bell, and drive back.

Speaker 0

直到现在我们才真正看到它们投入使用。即便如此,它们仍可能犯错等等。或许还要很多年大多数汽车才能实现自动驾驶。所以您说机器人技术五年内就能达到相当可靠的水平,但实际情况会不会像自动驾驶那样——五年后我们迎来技术爆发,但要像Waymo和特斯拉FSD那样成熟运作,可能还需要再等十年?

And only now do we have them actually deployed. And even then, they may make mistakes, etcetera. And so maybe it'll be many more years before most of the cars are self driving. So why wouldn't robotics, you're saying five years to this quite robust thing, but actually, it'll just feel like twenty years or just, like, once we get the cool downhill in five years, then it'll be another ten years before, like, we have the Waymo and the Tesla FSD working.

Speaker 1

这是个很好的问题。与2009年相比,现在最大的不同在于机器学习系统的环境感知技术——最初为自动驾驶开发的感知系统,对机器人同样重要。而2009年2月时,感知技术还处于非常初级的阶段。

Yeah. That's a really good question. So one of the big things that is different now than it was in 2009 actually has to do with the technology for machine learning systems that understand the world around them, principally for autonomous driving this perception. For robots, it can mean a few other things as well. And perception certainly was not in a good place in 02/2009.

Speaker 1

感知技术的难点在于:你可以用精心设计的系统做出惊艳的演示,但想要泛化应用就会碰壁。到了2025年的今天,我们有了更强大的技术来构建可泛化、鲁棒的感知系统,以及更广义的环境理解系统。当说到系统可扩展时,机器学习领域的可扩展性本质上就是可泛化性。这为我们提供了比当年更好的起点——当然这并非说机器人技术比自动驾驶简单。

The trouble with perception is that it's one of those things where you can nail a really good demo with a somewhat engineered system, but hit a brick wall when you try to generalize it. Now at this point in 2025, we have much better technology for generalizable and robust perception systems and more generally generalizable and robust systems for understanding the world around us. Like, when you say that the system is scalable and machine learning scalable really means generalizable. So that gives us a much better starting point today. So that's not an argument about robotics being easier than autonomous driving.

Speaker 1

这只是说明2025年比2009年2月更具优势。嗯...但机器人技术还有其他特殊之处:比如机械臂操作在某些方面比驾驶困难得多,但在另一些方面,它又是个更容易通过限定范围来启动技术飞轮的领域。

It's just an argument for 2025 being a better year than 02/2009. Mhmm. But there's also other things about robotics that are a bit different than driving. Like in some ways, robotic manipulation is a much much harder problem. But in other ways, it's a it's a problem space where it's easier to get rolling to start that flywheel with a more limited scope.

Speaker 1

举个例子:学开车时如果没人指导,你肯定觉得疯了——你不会让自家青少年独自摸索驾驶对吧?就算16岁的孩子已经对世界有相当认知,你也绝不会让五岁小孩自己开车。但如果是洗碗...当然碗也可能打碎,但你大概率会允许孩子独自尝试,不需要有人拿着备用餐具全程盯着。

So to give you an example, if you're learning how to drive, you would probably be pretty crazy to learn how to drive on your own without somebody helping you. Like, you you would not trust your your your teenage child to learn to drive just on their own. Just drop them in the car and say, like, go for it. Mhmm. And that's like a, you know, a 16 year old who's had a significant amount of time to learn about the world.

Speaker 1

很多机械臂操作任务都允许试错修正:犯错后纠正的过程,首先完成了任务(因为纠正了错误),同时获得了避免重蹈覆辙的经验知识。

He would never even dream of putting a five year old in a car and tell him to get started. But if you want somebody to, like, clean the dishes Yeah. Like, dishes can break too, but you would probably be okay with a child trying to do the dishes without somebody constantly, like, you know, sitting next to them with a a with a break, so to speak. So for a lot of tasks that we wanna do with robotic manipulation, there's potential to make mistakes and correct those mistakes. And when you make a mistake and correct it, well, first you've you've achieved the task because you've corrected, but you've also gained knowledge that allows you to avoid that mistake in the future.

Speaker 1

在驾驶方面,由于其动态设置的特性,犯错后纠正并从中学习变得极其困难,因为错误本身会带来重大后果。当然,并非所有操作测试都如此。确实存在一些高度安全关键的情境。这就引出了下一个要点——常识。常识指的是能够对可能发生的事情做出合理推测的能力,这些推测无需你亲身经历错误就能提前学习。

With driving, because of the dynamics of how it's set up, it's very hard to make a mistake, correct it, and then learn from it because the mistakes themselves have significant ramifications. Now, not all manipulation tests are like that. There are truly some like very safety critical stuff. And this is where the next thing comes in, is common sense. Common sense meaning the ability to make inferences about what might happen that are reasonable guesses, but that do not require you to experience that mistake and learn from it in advance.

Speaker 1

这一点至关重要,而大约五年前我们对此还束手无策。但现在我们可以利用LLMs和VLMs,向它们提问,它们会给出合理推测。虽然无法提供专家级行为,但你可以问:'嘿,这里有块湿滑地板警示牌,我走过去会发生什么?'

That's tremendously important, and that's something that we basically had no idea how to do about five years ago. But now you we can actually use LLMs and VLMs, ask them questions, and they will make reasonable guesses. Like, they will not give you expert behavior, but you can say, like, hey. There's a sign that says slippery floor. Like, what's gonna happen when I walk over that?

Speaker 1

答案其实相当明显。

It's kinda pretty obvious.

Speaker 4

没错。

Right.

Speaker 1

对吧?2009年的任何自动驾驶汽车都无法回答这个问题。所以常识加上犯错与纠错能力——听起来非常像人类学习新事物时的过程。这些虽不能使机器人操作变得简单,但让我们可以从较小范围起步,逐步扩展。

Right? And no autonomous car in 2009 would have been able to answer that question. So common sense plus the ability to make mistakes and correct those mistakes, like, that's sounding like off off an an awful lot like what a what a person does when they're trying to learn something. All of that doesn't make robotic manipulation easy necessarily, but it allows us to get started with a smaller scope and then grow from there.

Speaker 0

多年来(我是说不仅从2009年起),我们拥有海量视频数据、语言数据和Transformer模型已有五到八年,包括谷歌、Meta等多家公司都尝试用大量训练数据构建基于Transformer的机器人。它们遇到瓶颈的原因是什么?现在有什么改变?

So for years, using, I mean not since 2009 but we've had lots of video data, language data and transformers for five, seven, eight years and lots of companies have tried to build transformer based robots with lots of training data, including Google, Meta, etcetera. And what is the reason that they've been hitting roadblocks? What has changed now?

Speaker 1

这是个非常好的问题。首先我要稍作修正:我认为它们已取得很大进展。某种程度上,我们Physical Intelligence现在的工作正是建立在谷歌等机构先前伟大成果的基础上——比如我们很多人之前就在谷歌工作过。

Yeah. That's a really good question. So I'll start off with maybe a slight modification to your comment is I think they they've made a lot of progress. And in some ways, a lot of the work that we're doing now at physical intelligence is built on the backs of lots of other great work that was done, for example, at Google, like many of us were actually at Google before.

Speaker 0

没错。

Right.

Speaker 1

我们参与了部分相关工作,其中也借鉴了他人的成果。这一领域确实取得了显著进展。但要让机器人基础模型真正发挥作用,这不仅是实验室里的科学实验,还需要工业级规模的构建努力。

We were involved in some of that work. Some of it is work that we're drawing on that others did. So there's definitely like been a lot of progress there. But to make robotic foundation models really work, it's not just a laboratory science kind of experiment. It's also it also requires kind of industrial scale building effort.

Speaker 1

这更像是阿波罗计划而非单纯的科研实验。过去工业研究实验室所做的卓越研究——我曾深度参与其中——大多被定位为基础研究。嗯。这很好,基础研究确实至关重要,但仅靠它是不够的。

Like, it's it's like it's more like the Apollo program than it is like a science experiment. And the excellent research that was done in the past in industrial research labs, and I know I was involved in much of that, was very much framed as a fundamental research effort. Mhmm. And that's good. Like, the fundamental research is really important, but it's not enough by itself.

Speaker 1

你既需要基础研究,也需要将其落地的推动力。所谓落地,就是真正部署机器人,获取能代表现实任务需求的数据,大规模收集这些数据,构建系统,完善所有环节。这要求我们高度聚焦,纯粹为了完善机器人基础模型本身而投入,而非仅将其作为科研手段、论文发表途径或实验室项目。

You need the fundamental research, and you also need the impetus to make it real. And make it real means, like, actually put the robots out there, get data that is representative of the kind of tasks that they want that they need to do in the real world, get that data at scale, build out the systems, get all that stuff right. And that requires a degree of focus, a singular focus on really nailing the robotic foundation model for its own sake, not just as a way to do more science, not just as a way to like publish a paper and not just as a way to kind of like have a research lab.

Speaker 0

现在阻碍你们进一步扩大数据规模的因素是什么?如果数据是主要瓶颈,为什么不直接把办公规模扩大100倍,雇佣100倍的操作员来操控机器人收集数据?对,为什么不立即实现100倍的扩张?

What is preventing you now from scaling that data even more? If data is a big bottleneck, why can't you just increase the size of your office 100x, have 100x more operators who are operating these robots and collecting more data? Yeah. Why not ramp it up immediately 100x more?

Speaker 1

这是个很好的问题。关键在于理解规模扩展的维度与能力提升维度的对应关系。如果我们想横向扩展能力——比如让机器人从会做10件事提升到100件——直接水平扩展现有体系即可。但要让机器人达到能在现实世界完成实际有用任务的水平,就需要沿着其他维度进行扩展。

Yeah. That's a really good question. So the challenge here is in understanding which axes of scale contributes to which axes of capability. So if we want to expand capability horizontally, meaning like the robot knows how to do 10 things now and I'd like it to do a 100 things later, that can be addressed by just directly horizontally scaling what we already have. But we want to get robots to a level of capability where they can do practical practical useful things in the real world, and that requires expanding along other axes too.

Speaker 1

例如需要实现极高的鲁棒性,要求机器人高效快速地执行任务,需要它们识别边缘情况并做出智能响应。这些目标也可以通过扩展来实现,但我们必须找准扩展维度——这意味着要明确收集何种数据、在什么环境下收集、采用哪些方法来处理数据,以及这些方法如何运作。

It requires, for example, getting to very high robustness. It requires getting them to perform tasks very efficiently, quickly. It requires them to recognize edge cases and respond intelligently. And those things I think can also be addressed with scaling. But we have to identify the right axes for that, which means figuring out what kind of data to collect, what settings to collect it in, what kind of methods consume that data, how those methods work.

Speaker 0

对。

Right.

Speaker 1

因此,更全面地回答这些问题将使我们更清晰地理解这些轴线、那些因变量以及我们需要扩展的要素。目前我们并不完全清楚具体会是什么样子。我想我们很快就能弄清楚。这是我们正在积极研究的内容。但我们希望确保这一点做对,这样当我们进行扩展时,它能直接转化为与实际应用高度相关的能力。

So answering those questions more thoroughly will give us a greater clarity on the axes, on the on those, dependent variables, on the things that we need to scale. And we don't fully know right right now what that will look like. I think we'll figure it out pretty soon. It's something we're working on actively. But we want to really get that right so that when we do scale it up, it'll directly translate into capabilities that are very relevant to practical use.

Speaker 0

为了给出一个大致的概念,你们收集的数据量与互联网规模的预训练数据相比如何?我知道很难进行逐标记的精确对比,因为,比如视频信息与互联网信息如何比较等等?但根据你们的合理估计,大概占多少比例——

Just to give an order of magnitude, how does the amount of data we you have collected compared to Internet scale pre training data? And I know it's hard to do like a token by token count because, yeah, what is that? How does video information compare to Internet information, etcetera? But like Yeah. Using your reasonable estimates, what fraction of

Speaker 1

没错,这确实很难衡量,因为机器人经验由高度相关的时间步骤组成。原始字节表示的数据量巨大,但信息密度可能相对较低。更好的比较对象是多模态训练所使用的数据集。

That's right. It's very hard to do because robotic experience consists of time steps that are very correlated with each other. Yeah. So like the raw like byte representation is enormous, but probably the information density is comparatively low. A better comparison is to the data sets that are used for multimodal training.

Speaker 1

是的,上次我们统计时,大约是1到2个数量级的差距。

Yeah. And there it's, I believe last time we did that count it was like between one and two orders of magnitude.

Speaker 0

你们对机器人技术的愿景,在收集到比如100倍、1000倍更多数据之前是无法实现的吧?

The vision you have of robotics will not be possible until you collect like what 100x, 1000x more data?

Speaker 1

问题就在于我们并不确定。当然可以合理推测机器人技术是个难题,可能需要与语言模型相当的经验数据。但正因我们不知道确切答案,我认为更有用的思考方式不是‘我们需要多少数据才能彻底完成’,而是‘我们需要多少数据才能启动’,即建立起一个能自我维持且持续增长的数据飞轮。

Well, that's the thing, we don't know that. It's certainly very reasonable to infer that like robotics is a tough problem and probably it requires as much experience as the language stuff. But because we don't know the answer to that, to me, a much more useful way to think about it is not how much data do we need to get before we're fully done, but how much data do we need to get before we can get started, meaning before we can get a data flywheel that represents a self sustaining and ever growing data

Speaker 0

你说的自我维持是什么意思?这就像是在工作中学习,还是你有其他想法?

What do mean self sustaining? This is just like learning on the job or do you have something else in mind?

Speaker 1

在工作中学习,或者说,以某种方式获取数据,使得获取数据的过程本身就有用且有价值。

Learning on the job or, acquiring data in a way that the process of acquisition of that data itself is useful and valuable.

Speaker 0

明白了。就像是某种强化学习(RL)。

I see. Like, just some kind of RL.

Speaker 1

就像是在做某些实际的事情。是的。理想情况下,希望是强化学习,因为这样可以让机器人自主行动。

Like, doing something, like, actually real. Yeah. I mean, ideally, would like it to be RL because you can get away with the robot acting autonomously

Speaker 4

对。

Right.

Speaker 1

这样更容易。但混合自主性是不可能的。正如我之前提到的,机器人可以从各种其他信号中学习。我描述过如何让机器人通过与人对话来学习。因此,在完全遥控的机器人和完全自主的机器人之间有很多中间地带。

Which is easier. But it's out of the question that you can have mixed autonomy. You can you know, as I mentioned before, robots can learn from all sorts of other signals. I described how we can have a robot that learns from a person talking to it. So there's a lot of middle ground in between fully teleoperated robots and fully autonomous robots.

Speaker 0

好的。那么PIE模型是如何工作的?

Yeah. Okay. And how does the PIE model work?

Speaker 1

是的。我们目前拥有的模型本质上是一个为运动控制调整过的视觉语言模型。用个形象的脑科学比喻来说,VLM(视觉语言模型)就像是给LLM嫁接了一个伪视觉皮层——视觉编码器。明白吗?所以我们的模型既有视觉编码器,也有动作专家模块,本质上就是个动作解码器。

Yeah. So the current model that we have basically is a vision language model that has been adapted for motor control. So to give you a little bit of like a fanciful brain analogy, a VLM, a vision language model, is basically an LLM that has had a a little like pseudo visual cortex grafted to it, a vision encoder. Right? So our models, they have a vision encoder, they also have an action expert, an action decoder essentially.

Speaker 1

它就像同时拥有微型视觉皮层和概念上的运动皮层。模型实际做决策的方式是:读取机器人传来的传感信息,进行内部处理——这可能包括输出中间步骤,比如你让它'打扫厨房',它可能会想:'要打扫厨房得先拿起盘子,再拿海绵,然后把这个放那里...'最终通过这种思维链推导到动作专家模块,由后者生成连续动作。这必须是个独立模块,因为动作是连续的、高频的,其数据格式与文本标记不同。但从结构上说,它仍是端到端的Transformer,技术上大致对应混合专家架构。

So it has like a little visual cortex and notionally a little motor cortex. And the way that the model actually makes decisions is it reads in the sensory information from the robot, it does some internal processing, and that could involve actually outputting intermediate steps like you might tell it clean up the kitchen and it might think to itself like, hey, to clean up the kitchen, need to pick up the dish and I need to pick up the sponge and I need to put this and this. And then eventually it works its way through that chain of thought generation down to the action expert which actually produces continuous actions. And that has to be a different module because the actions are continuous, they're high frequency so they have a different data format than text tokens. But structurally, it's still an end to end transformer and roughly speaking, technically it corresponds to a kind of mixture of expert architecture.

Speaker 0

所以实际发生的是:它先预测'我应该做X',然后出现图像标记,接着是动作标记——对应实际行为,然后又是图像、文字描述、更多动作标记...我现在就在观察这个数据流。

And, like, what is actually happening is that it's like like it's like predicting I should do x thing, then it's like there's an image token, then some action tokens, like what it actually ends up doing, and then more image, more text description, more more action tokens. I'm, like, looking at what what stream is going on.

Speaker 1

没错。不过动作部分并非离散标记,实际上采用的是流匹配式的扩散模型...

That that's right. With the with the exception of the actions are actually not represented as discrete tokens. It's that it actually uses a flow matching kind of diffusion

Speaker 0

嗯。

Mhmm.

Speaker 1

因为动作是连续的,灵巧控制需要极高精度。

Because they're continuous, and you need to be very precise with your actions for dexterous control.

Speaker 0

最让我着迷的是——你们用的是开源的Gemma模型对吧?就是谷歌发布的开源LLM,然后在上面叠加这个动作专家模块。AI不同领域的进展居然不仅能共享相同技术,甚至可以直接复用同一个模型:拿开源LLM加装专业模块就行。特别值得注意的是,人们可能天真地认为机器人学和NLP是两个独立领域——但实际上它们根本就是一回事,底层考量完全相同。

I find it super interesting that, so you are, I think you're using the open source Gemma model, which is like Google's LLM that they release open source and then adding this Action Expert on top. I find it super interesting that the progress in different areas of AI is just based on this not only the same techniques, but literally the same model that you can just use an open source LLM and then add a section expert on top. It is notable that like you naively might think that, oh, there's this like separate area of research is robotics and there's separate area of research called LLMs and natural language processing. No, it's like, it's literally the same. It's like the considerations are the same.

Speaker 0

架构完全相同,连权重都一致。我知道你们在这些开源模型基础上做了更多训练,但这点我觉得特别有意思。

The architectures are the same, even the weights are the same. I know you do more training on top of these open source models, but that I find super interesting.

Speaker 1

没错。我认为这里有个关键主题值得注意:这些基础模块之所以如此宝贵,是因为AI社区在利用先验知识方面取得了巨大进步。我们从预训练的大语言模型和视觉语言模型中获取的,本质上都是关于世界的先验知识——某种抽象化的认知,比如能识别物体。

Yeah. So one theme here that I think is important to keep in mind is that the reason that those building blocks are so valuable is because the AI community has gotten a lot better at leveraging prior knowledge. And a lot of what we're getting from the pretrained LLMs and VLMs is prior knowledge about the world. And it's kind of like it's a little bit abstracted knowledge. It's like, you know, you can identify objects.

Speaker 1

你可以大致判断图像中物体的位置等等。如果要我用一句话总结AI近期创新对机器人技术的最大贡献,那正是这种利用先验知识的能力。模型本身相同这点在深度学习领域并不新鲜,但能从多元渠道吸收抽象先验知识的能力确实...

You can figure out, like, you know, roughly where things are in image, that sort of thing. But I think if if I had to, like, summarize in one sentence the big benefit that recent innovations in AI give to robotics is really that prior the ability to leverage prior knowledge. And I I think the fact that the model is the same model, that's like that's kind of always been the case in deep learning. Mhmm. But it's it's that ability to pull in that prior knowledge, that abstract knowledge that has that can come from many different sources.

Speaker 1

这才是真正强大的地方。

That's really powerful.

Speaker 0

今天和我一起的是哈德逊河交易公司的高级研究员马克。他为我们准备了庞大的市场价格和历史市场数据集,我们将尝试分析现状,并探索能否根据历史数据预测未来价格。马克,我们开始吧。很乐意配合。

Yeah. Today, I'm here with Mark, who is a senior researcher at Hudson River Trading. He has prepared for us a big data set of market prices and historical market data, and we're gonna try to figure out what's going on and whether we can predict future prices from historical market data. Mark, let's dig in. Happy to do it.

Speaker 0

听起来第一个有趣的环节应该是先看看实际订单簿长什么样。

So it sounds like the first fun thing to do is probably to start looking at what an order book actually looks like.

Speaker 4

嗯,我也这么认为。

Yeah. I think so.

Speaker 1

嗯。

Mhmm.

Speaker 4

所以我提供给你的,是真实的订单簿数据快照,包含了买卖双方前五档的报价,涉及英伟达、特斯拉、AMD等几家不同科技股的行情。

So I've given you, like, real order book data that is snapshots of the top five levels of the order book, both on the bid and ask side for a couple of different tech stocks, NVIDIA, Tesla, AMD, etcetera.

Speaker 0

预测结果的形态是怎样的?

What is the shape of the prediction?

Speaker 4

我们为什么不预测呢?你取一个数据框,观察它的y值,大致绘制成直方图。这些值以零为中心分布,基本围绕零波动。没错。

Are we predicting Why not? You take a data frame, look at its y values, just kind of, histogram it. They are centered at zero. They're roughly centered at zero. Yeah.

Speaker 4

但具体预测什么目标?这些数据代表的是中间价从现在到未来短时间内的变化量

But target of what exactly? So these things are changes in the mid price from, like, now to some short period of time

Speaker 0

这其实相当有趣。就像待解的谜题。

in the future. This is actually quite interesting. It's like a mystery to solve.

Speaker 4

其中每一项对研究者来说都可能需要耗费相当长的时间。

And each one of these can be, like, a sizable chunk of time for a researcher.

Speaker 0

如果这听起来让你感兴趣,你应该考虑加入哈德逊河交易公司。马克,大家可以在哪里了解更多信息?

If this sounds interesting to you, you should consider working at Hudson River Trading. Mark, where can people learn more?

Speaker 4

他们可以在hudson-trading.com/doorcash上了解更多。

They can learn more at hudson-trading.com/doorcash.

Speaker 0

太棒了。我之前和GDM的一位研究员桑德聊过,他研究视频和音频模型,他提出了一个有趣的观点:在他看来,我们之所以在不同模态间看不到太多迁移学习现象——比如用视频和图像训练语言模型,未必能显著提升其处理文本问题和任务的能力——是因为图像在语义层面上的表示与文本不同。他认为,文本在模型中具有高层次语义表征,而图像和视频本质上只是压缩的像素集合。当它们被嵌入模型时,并不真正携带高层次的语义信息。

Amazing. I was talking to this researcher, Sander at GDM and he works on video and audio models and he made the interesting point that the reason in his view we aren't seeing that much transfer learning between different modalities. Is to say like training a language model on video and images doesn't seem to necessarily make it that much better at textual questions and, tasks is that images are represented at a different semantic level than text. And so his argument is that text has this high level semantic representation within the model, whereas images and videos are just like compressed pixels. There's not really a semantic, when they're embedded, there's not, they're not, they don't represent some high level semantic information.

Speaker 0

它们就像压缩的像素块。因此在模型处理层级上无法实现迁移学习。显然这与你们的工作高度相关,因为你们希望通过同时训练模型处理机器人看到的视觉数据(可能还包括YouTube等通用视频数据)、语言信息以及机器人自身的动作信息,使模型获得整体上的鲁棒性。你们那篇关于'为什么视频模型不如语言模型鲁棒'的博客文章就非常有意思。

They're just like compressed pixels. And therefore there's no transfer learning at the level at which they're going through the model. And obviously this is super relevant to the work you're doing because your hope is that by training the model both on the visual data that the robot sees visual data generally, maybe even in like from YouTube or whatever, plus like language information, plus action information from the robot itself. Some of that, all of this together will like make it generally robust. And then you had a really interesting blog post about, like, why video models aren't as robust as language models.

Speaker 0

抱歉,这个问题问得不太严谨。我只是想对此作出反应

Sorry. This is not a super well formed question. I just wanted to react to

Speaker 1

一些问题。这是怎么回事?

some questions. What's up with that?

Speaker 0

是啊。确实。

Yeah. Yeah.

Speaker 1

是的。关于这个问题我有两点想说。我有一些坏消息和一些好消息。坏消息是你所说的确实触及了视频和图像生成模型长期面临的核心挑战。

Yeah. So I I have maybe two things I can say there. I have some, like, bad news and some good news. Yeah. So the bad news is what you're saying is really getting at the core of a long running challenge with video and image generation models.

Speaker 1

某种程度上,通过预测视频来构建智能系统的想法甚至比通过预测文本来构建智能系统的想法更早出现。嗯。但文本技术在实用化方面比视频技术更早取得突破。当然视频技术也很棒。

Yeah. Like, in some ways, the idea of getting intelligent systems by predicting video is even older than the idea of getting intelligent systems by predicting text. Mhmm. But the text stuff turned into use practically useful things earlier than the video stuff did. I mean, the video stuff is great.

Speaker 1

比如你可以生成很酷的视频,我认为最近这方面的成果非常惊人。但单纯生成视频和图像尚未造就出那种对世界有深刻理解的系统——那种能执行超越生成更多图像和视频任务的系统。而语言模型显然已经做到了。我认为表征方式的不同是其中的关键。

Like, you can generate cool videos, and I think that the work there that's been done recently is is, like, amazing. But it's not but it's not like just generating videos and images has already resulted in systems that have this kinda like deep understanding of the world where you can like ask them to like do stuff beyond just generating more images and Yeah. And videos. Whereas with language, clearly it has. And I think that this this point about representations is really key to it.

Speaker 1

我们可以这样理解:假设你在这栋楼外架设摄像机,画面里有天空、流动的云朵、水面、行驶的车辆和行人。要预测未来发生的所有事情,你可以用多种方式。比如关注人群,深入研究群体心理行为学来预测行人动向。

One way we can think about it is this, that if you imagine pointing a camera outside this building. There's the sky. There's the clouds are moving around, the water, cars driving around, people. If you wanna predict everything that'll happen in the future, you can do so in many different ways. You can say, okay, there's people around, so let me get really good understanding like the psychology of how people behave in crowds and predict the pedestrians.

Speaker 1

你也可以选择专注于空中飘动的云朵,深入研究水分子和冰晶的物理特性。甚至可以深入到亚原子层面——作为人类,你可能会耗费数十年时间研究这个领域,却仍然无暇顾及行人和水面的变化。对吧?

But you could also say like, well, there's clouds moving around. Let me like understand everything about water molecules and ice particles in the air. And you can go super deep on that. Like, if you wanna, like, fully understand, like, all you know, down to the subatomic level, everything that's going on, like, as a person, you could spend, like, decades just thinking about that, and you'll never even get to the pedestrians or the water. Right?

Speaker 1

因此若要真正预测场景中的所有变化,即便你在某个领域做到极致,等到研究完其他所有要素时,恐怕早已过去无数岁月。而文字信息已经被抽象为我们人类关心的核心要素,这些表征不仅优质,而且聚焦于真正重要的内容。

So if you want to really predict everything that's going on in that scene, there's just so much stuff that even if you're doing a really great job and capturing, like, a 100% of something, by the time you get to everything else, like, you know, ages will have passed. Right. Whereas with text, it's already been abstract into those bits that we as humans care about. So the representations are already there, and they're not just good representations. They actually focus in on what really matters.

Speaker 1

以上就是坏消息部分。现在说好消息:我们不必仅靠楼外的摄像机获取全部信息。因为当机器人执行任务时,它实际上是有明确目标的。

Okay. So that's that's the bad news. Here's the good news. The good news is that we don't have to just get everything out of like pointing a camera outside this building. Because when you have a robot, that robot is actually trying to do a job.

Speaker 1

所以它是有目的的。是的。它的感知能力服务于实现这一目的。这就像一个非常棒的聚焦因素。我们知道这对人们来说真的很重要。

So it has a purpose. Yeah. And its perception is in service to fulfilling that purpose. And that is like a really great focusing factor. We know that for people, this really matters.

Speaker 1

比如,你实际看到的东西会受到你试图做什么的影响。心理学实验层出不穷地表明,人们有着近乎惊人的隧道视野,如果与他们试图实现的目标无关,他们甚至会直接看不见眼前的东西。是的。这非常强大。人们这样做一定有原因,因为,你知道,在丛林中,看得多总比看得少好。

Like, literally what you see is affected by what you're trying to do. Like, there's been no shortage of psychology experiments showing that people have like almost a shocking degree of tunnel vision where they will, like, literally not see things right in front of their eyes if it's not relevant to what they're trying to achieve. Yeah. And that is tremendously powerful. Like, there must be a reason why people do that because, you know, certainly if you're out in the jungle, seeing more is better than seeing less.

Speaker 1

所以,如果你拥有如此强大的聚焦机制,那它对于实现目标一定至关重要。我认为机器人也会有这种聚焦机制,因为它们也在试图达成目标。

So if you have that powerful focusing mechanism, it must be darn important for getting it to achieve your goal. And I think robots will have that focusing mechanism because they're trying to achieve a goal.

Speaker 0

顺便说一下,视频模型不够稳健这一事实,对机器人技术来说是利空吗?因为你们将不得不使用的大量数据可能不会——我猜你们中有些人说很多会是标注过的——但理想情况下,你们希望能把所有YouTube上的视频、我们记录过的每一个视频都扔进去,让它学习物理世界如何运作以及如何移动等等。只是看人类执行任务并从中学习。但如果……是的。我猜你们的意思是,仅从这些学习很困难,实际上需要自己实践任务。

By the way, the fact that video models aren't as robust, is that bearish for robotics because it will so much of the data you will have to use will not be I guess some of you're saying a lot of it will be labeled but like ideally you just want to be able to like throw all of everything on YouTube every video we have ever recorded and have it learn how the physical world works and how to like move about, etcetera. Just see humans performing tasks and learn from that. But if yeah. I guess that you're saying like it's hard to learn just from that and that actually like needs to practice the task itself.

Speaker 1

嗯,让我这么说吧。比如,假设我给你很多不同体育赛事的录像或录音。是的。然后给你一年时间只看体育。一年后,我告诉你,好了,现在你的工作是去打网球。

Well, let me put it this way. Like, let's say that I gave you lots of videotapes or lots of recordings of different sporting events Yeah. And gave you a year to just watch sports. And then after that year, I told you, okay. Now your job, you're gonna be playing tennis.

Speaker 1

是的。好吧。这听起来相当愚蠢,对吧?而如果我一开始就告诉你,你要去打网球,然后让你去学习。

Yeah. Okay. That's like that's pretty pretty dumb. Right? Whereas if I told you first, like, you're gonna be playing tennis, and then I and then I let you study up.

Speaker 1

对吧?这样你就真正知道自己在寻找什么了。所以我认为这里确实存在一个非常现实的挑战。我不想低估这个挑战,但我确实认为,对于基础模型来说,通过与机器人系统的交互控制学习,实际上能更好地吸收其他数据源,因为它们知道自己要做什么。

Right? Like, now you know you really know what you're looking for. Right. So I think that actually like, there's a there's a very real challenge here. I don't wanna understate the challenge, but I do think that there's also a lot of potential for foundation models that are embodied, that learn from interaction from controlling robotic systems to actually be better at absorbing the other data sources because they know what they're trying to do.

Speaker 1

我不认为仅凭这一点就是万灵药。我不觉得它能解决所有问题,但我确实认为它大有帮助。我们已经看到初步成效——在机器人训练中加入网络数据确实能显著提升泛化能力。

I don't think that that by itself is like a silver bullet. I don't think it solves everything. But I think that it does help a lot. And I think that we've already seen the beginnings of that where we can see that including web data in training for robots Yeah. Really does help with generalization.

Speaker 1

实际上我怀疑,长远来看这会让我们更容易利用那些迄今为止都难以处理的数据源。

And I I actually have the suspicion that in the long run, it'll make it easier to use those sources of data that have been tricky to use up until now.

Speaker 0

众所周知,大语言模型具备各种突现能力,这些能力并非刻意设计,而是因为互联网文本中恰好包含训练特定功能所需的数据。但机器人领域似乎需要人工收集所有数据,因此不会出现那种神秘的新能力——即数据集中意外包含你未主动收集的潜能。这似乎使得实现稳健的分布外能力更加困难。我在想未来五到十年,是否每个子任务都需要提供数千次训练片段,导致通过子任务实现工作自动化变得极其困难。

Famously, LLMs have all these emerging capabilities that were never engineered in because somewhere in internet text is the data to train and to give it the knowledge to do a certain kind of thing. With robots, it seems like you are collecting all the data manually. So there won't be this mysterious new capability that like is somewhere in the data set that you haven't purposefully collected, which seems like it should make it even harder to then have robust out of distribution kind of capabilities. And so I wonder if the track over the next five, ten years will just be like each sub task you have to give it thousands of episodes. And then it's very hard to actually automate much work just by doing subtasks.

Speaker 0

想想咖啡师、服务员或厨师的工作——几乎没有哪个环节是固定在一个位置完成的。你需要走动、补货、修理设备,在前台、收银机和机器之间穿梭。这是否意味着将存在一长串需要持续手动添加训练片段、标注和评估表现的技能?还是说有理由相信进展会比这更普遍?

So if you think about what a barista does, a waiter does, what a chef does, very little bit involves just like sitting at one station and like doing stuff, It's like you got to move around, you got to restock, you got to fix the machine or etcetera, go between like the counter and the cashier and the machine, etcetera. So, will it just be like, will there just be this long tail of things that you had to keep skills you had to keep like adding episodes for manually and labeling and seeing how well they did, etcetera? Or is there some reason to think that it will progress more generally than that?

Speaker 1

这里有个微妙之处。突现能力不仅源于互联网数据的海量内容,更因为泛化能力达到特定水平后会呈现组合性。我的学生喜欢用这个例子:你知道国际音标是什么吗?

Yeah. So there's a subtlety here. Emerging capabilities don't just come from the fact that Internet data has a lot of stuff in it. They also come from the fact that generalization, once it reaches a certain level, becomes compositional. There is a acute example that one of my students really like to use in some of those presentations, which is you know what international phonetic alphabet is?

Speaker 1

不是APA。如果你查字典,会发现单词发音用些奇怪字母标注——那就是国际音标。这套字母系统几乎专门用于字典中的单词发音记录。

No. APA? So if you look in a dictionary, they'll have the pronunciation of a word written in, like, kinda funny letters. That's basically International Phonetic Alphabet. So it's it's an alphabet that is pretty much exclusively used for writing down pronunciations of individual words in dictionaries.

Speaker 1

但你可以让大语言模型用国际音标写份菜谱,它真的能做到。这太惊人了——因为它绝对没见过这种用法,国际音标本只用于标注单词发音。这就是组合泛化的力量。

And you can ask an LLM to write you a recipe for, like, making some meal in international phonetic alphabet, and it will do it. Right. And that's like like, holy crap. Like, that is definitely not something that is that it has ever seen because IPA is only ever used for writing down pronunciations of individual words. So that's that's that's compositional generalization.

Speaker 1

它正在以新的方式组合你见过的类似事物。可以说,这里并没有真正全新的东西,因为虽然你见过不同的单词以那种方式书写,但你现在发现可以用同样的方式组合另一种语言的单词——就像你在英语中组合单词那样。这正是新兴能力涌现的根源。因此,原则上,只要我们拥有足够多样化的行为样本,模型就应该能根据情境需要,将这些行为以新的方式组合起来。

It's it's putting together things you've seen like that in new ways. And it's like, you know, arguably, there's nothing like profoundly new here because like, yes, you've seen different words written that way, but you figured out that now you can compose the words in this other language the same way Yeah. That you've composed words in English. So that's actually where the emerging capabilities come from. And because of this, in principle, if we have a sufficient diversity of behaviors, the model should figure out that those behaviors can be composed in new ways as the situation calls for it.

Speaker 1

事实上,即便在我们现有模型中(顺带一提,我认为从五年后的宏观视角回看,这些模型规模可能微不足道),我们已经观察到所谓的新兴能力。当我们测试某些衣物折叠策略时,偶然发现机器人从筐里误抓了两件T恤——它开始折叠第一件时,另一件碍事了,于是抓起第二件扔回筐里。我们当时都震惊了:完全没预料到它会这样处理。

And we've actually seen things even with our current models, which, you know, I should say that I think they're in the grand scheme of things like looking back five years from now, we'll probably think that these are tiny in scale. But we've already seen what I would call emerging capabilities. When we were playing around with some of our laundry folding policies, actually, we discovered this by accident. The robot accidentally picked up two t shirts out of the bin instead of one, starts folding the first one, the other one gets in the way, picks up the other one, throws it back in the bin. And we're like, we didn't know we didn't know it would do that.

Speaker 1

没错,简直不可思议!后来我们反复测试发现,它每次都会这样处理。你可以随时干扰它的工作——比如往桌上扔件新物品,它就会捡起来放回去。对吧?

Like Yeah. Holy crap. And then we tried to play around with it and it's like, yep, it does that every time. You can drop in, it's doing its work, drop something else on the table, just pick it up, put it back. Right?

Speaker 1

更酷的是购物袋场景:当它正往袋子里装东西时,袋子倒了,它会扶正袋子继续工作。我们从未专门为此收集训练数据,也许有人曾无意或故意扶起过购物袋。但关键在于,大规模学习会自然催生这种组合能力——所有卓越功能都源于此。现在再结合语言模型和思维链推理,模型就具备了无限可能的新组合方式。

Okay, that's cool. Shopping bag, it starts putting things in the shopping bag. The shopping bag tips over, picks it back up, and stands it upright. We didn't tell anybody to collect data for that. I'm sure somebody accidentally at some point or intentionally picked up the shopping bag, but it's just you have this kind of compositionality that emerges when you do learning at scale, and that's really where all these remarkable capabilities come And now you put that together with language, you put that together with all sorts of chain of thought reasoning and there's a lot of potential for the model to compose things in new ways.

Speaker 0

没错。我在参观你们办公室时见过类似案例——当时机器人正在叠短裤。不确定训练集里是否有这种场景,但出于好玩,我把其中一条短裤翻了个面。结果它居然明白需要先恢复原状(顺便说,它的抓取器就两个简单肢体,类似可对握的拇指结构,但能完成的任务令人震惊)。

Right. Had an example like this when I got a tour of the robots by the way at your office. So it was folding shorts and I don't know if there was an episode like this in the the training set, but just for fun, I like took one of the shorts and like turned it inside out. And then it was able to understand that it first needed to get so first of all, the grippers are just like, like this, like two limbs or just like opposable finger and thumb like thing. And it's actually shocking how much you can do with just that.

Speaker 0

是的,它理解需要先翻回正面再正确折叠。最惊人的是,这个模型似乎只有一秒的上下文记忆——相比语言模型能浏览整个代码库、分析数十万token后才输出,或通过数千token的思维链来制定编码计划...

Yeah. I'd understood that I first needed to fold that inside out before folding it correctly. I mean, what's especially surprising about that is it seems like this model only has like one second of context. So as compared to these language models, which can often like see the entire code base and they're like observing hundreds of thousands of tokens and thinking about them before outputting. They're observing their own chain of thought for thousands of tokens before making a plan about how to code something up.

Speaker 0

你们的模型仅能看到上一秒的画面,模糊知道要折叠这条短裤。它似乎仅凭最后一秒的图像就能工作——这太疯狂了!看到最新情况后继续执行计划:先翻面,再折叠。难以相信一秒的上下文就足以支撑长达一分钟的任务执行。

Your model is like seeing one image of like what happened in the last second and it vaguely knows like it's supposed to fold this short and it seemed like the image of what's happened in the last second And I guess it works. It's like crazy that it like, it will just see the last thing that happened and then keep executing on the plan. So fold it inside out, then fold it correctly. But it's shocking that a second of context is enough to execute on a minute long task. Yeah.

Speaker 0

我很好奇你最初为何做出那个选择,以及为何实际上能执行任务。如果人类只有约一秒的记忆且必须进行体力劳动,感觉那根本不可能实现。

I'm curious why you made that choice in the first place and why it's possible to actually do tasks. If a human could only like think had like a second of memory and had to like do physical work, feel like that would just be impossible.

Speaker 1

是的。我的意思是,并非说记忆容量小有什么好处。我认为增加记忆、延长上下文、提高图像分辨率这些都会让模型更好。但对你参观时看到的那类技能而言,它之所以不是最重要的,某种程度上我认为要回归到莫拉维克悖论。莫拉维克悖论基本就是——如果你想了解机器人学的一个核心概念,那就是它。该悖论指出在AI领域,简单的事情反而困难,困难的事情反而简单,比如我们视为理所当然的抓取物体、感知世界等,这些恰恰是AI最难解决的问题。

Yeah. I mean, it's not that there's something good about having less memory, to be Like, think that adding memory, adding longer context, all that stuff, adding higher resolution images, I think those things will make the model better. But the reason why it's not the most important thing for the kind of skills that you saw when you visited us, it at some level, I think it comes back to Moravik's paradox. So Moravik's paradox is basically the it's like, if you know one thing about if you wanna know one thing about robotics, it's like that's that's the thing. Moravik's paradox says that basically in AI, the easy things are hard and the hard things are easy, meaning like the things that we take for granted like picking up objects, seeing you know, perceiving the world, all that stuff, those are all the hard problems in AI.

Speaker 1

而我们觉得具有挑战性的事情,比如下棋和微积分运算,实际上往往是更容易解决的问题。

And the things that we find challenging like playing chess and doing calculus actually are often the easier problems.

Speaker 0

And

Speaker 1

我认为这个记忆问题更像是伪装下的莫拉维克悖论——我们觉得那些需要高度认知的任务(我们感到困难、让我们觉得'天啊我在流汗拼命'的任务)正是需要我们在记忆中保持大量信息的。比如解决复杂数学题时,或在播客中进行技术性对话时,你必须把所有碎片信息都记在脑子里。但如果是经过充分练习的任务,比如奥运选手以完美姿态游泳时处于'心流'状态,人们甚至会形容为'活在当下'。

I think this memory stuff is actually more of its paradox in disguise where we think that the cognitively demanding tasks that we do, that we find hard, that kind of cause us to think like, man, I'm sweating, I'm working so hard. Those are the ones that require us to keep lots of stuff in memory, lots of stuff in our minds. Like if you're solving some big math problem, if you're having complicated technical conversation on a podcast, like, those are things where you have to keep all those pieces all those puzzle pieces in your head. If you're doing a well rehearsed task, if you are an Olympic swimmer and you're swimming with perfect form and you're like right there in the zone. Like, people even say, like, it's in the moment.

Speaker 1

就是活在当下。对吧?就像经过反复练习后,这些技能已经内化到你大脑的神经网络中了。

It's in the moment. Right? Like, it's like you've practiced it so much. You've it into your neural network in your brain

Speaker 4

没错。

Yeah.

Speaker 1

你不需要费心去仔细考虑如何维持所有这些上下文。是的,对吧?所以这实际上就是莫尔维克悖论的具体体现,但这并不意味着我们不需要记忆。它只是意味着,如果我们想要达到人类所拥有的灵活性和身体熟练度水平,我们首先需要处理好其他一些事情,然后逐步提升到更具认知挑战性的领域,如推理、上下文理解和规划等等。

That you don't have to think carefully about keeping all that context. Yeah. Right? So it really is just Morvik's paradox manifesting itself, but that doesn't mean that we don't need the memory. It just means that if we want to match the level of dexterity and physical proficiency that people have, there's other things we should get right first and then gradually go up that stack into the more cognitively demanding areas, into reasoning, into context, into planning, all that kind of stuff.

Speaker 1

那些方面也同样重要。

And that stuff will be important too.

Speaker 0

从物理层面来看,你面临着这样一个三难困境,有三样东西在推理过程中都需要更多的计算资源,而你希望同时提升它们。首先是推理速度,人类能以每秒24帧或类似的速度处理信息,我们能对事物做出极其迅速的反应。其次是上下文长度,我认为对于那种只是帮你打扫房子的机器人来说,它需要能够记住几分钟甚至几小时前发生的事情,并理解这些信息如何影响它接下来任务的计划。

And how physically will so you have this like trilemma, you have three different things which all take more compute during inference that you to increase at the same time. You have the inference speed and so humans are processing 24 frames a second or whatever it is. Were just like, we can react to things extremely fast. Then you have the context length and for I think the kind of robot which is just like cleaning up your house. I think it has to kind, it has to be aware of like things that happened minutes ago or hours ago and how that influences its plan about the next task it's doing.

Speaker 0

然后是模型规模,至少在大型语言模型(LLMs)中我们已经看到,增加参数数量能带来性能提升。目前的情况是,推理速度是100,上下文长度是一秒,模型参数大概是几十亿?具体是多少?好吧。这些指标中至少有两项比人类等效水平小了好几个数量级,对吧?比如人脑可能有数万亿参数,而这个模型只有约20亿参数;人类的处理速度至少和模型一样快,实际上可能更快;我们的上下文记忆能持续几小时甚至几分钟——取决于你如何定义人类的上下文记忆。

And then you have the model size and I guess at least with LLMs we've seen that there's gains from increasing the amount of parameters. And I think currently you have a 100 inference speeds, you have a second long context and then the model is what a couple billion parameters, how many? Okay. And so each of these, at least two of them are many orders of magnitude smaller than what seems to be the human equivalent, right? Like the model, if a human brain has like trillions of parameters and this has like 2,000,000,000 parameters, and then if humans are processing at least as fast as the model, like actually a decent bit faster, and we have hours of context, depends on how you define human context, but hours of context, minutes of context.

Speaker 0

有时候能记住几十年前的事。没错。所以我们需要在这三个相互制约的方面都实现多个数量级的提升——提高其中一个就会减少推理时可分配给其他方面的计算资源。那么我们该怎么...是的。

Sometimes decades of context. Yeah. Exactly. So you have to have many order of magnitude improvements across all of this all of these three things, which seem to oppose each other or like increasing one reduces the amount of reduces the amount of compute we can dedicate towards the other one in inference. So how are we gonna yeah.

Speaker 0

我们该如何解决这个问题?

How are we gonna solve this?

Speaker 1

是啊,这是个非常宏大的问题。让我们试着拆解一下,我觉得这里面涉及很多层面。

Yeah. Well, that's a very big question. Yeah. Let let's let's try to unpack this a little bit. I think there's there's a lot going on in there.

Speaker 1

我认为一个非常有趣的技术问题——未来几年我们可能会在这个领域看到大量创新——就是如何表征上下文语境。比如你提到的家庭机器人例子,当它需要像人类一样追踪任务时,某些事项我们会用近乎语言符号的方式记录。就像我的购物清单:买酸奶、买牛奶、买其他东西——我脑海里浮现的就是文字清单本身,而不是具体摆放牛奶的货架画面。

One thing that I would say is a really interesting technical problem, and I think that's it's something where we'll see perhaps a lot of really interesting innovation over the next few years, is the question of representation for context. So if you imagine the like some of the examples you gave, like if you have a home robot that's doing something and needs to keep track, as a person, there are certainly some things where you keep track of them very symbolically, like almost in language. Like, you know, I I have my checklist. I am going shopping, and I you know, at least for me, I can, like, literally visualize in my mind, like, my checklist. Like, you know, pick up the the yogurt, pick up the milk, pick up whatever.

Speaker 1

但另一些事项则更具空间视觉性。比如我来工作室的路上,思考的是街道的样貌、门厅的预期外观这类空间信息。

And that and I'm not, like, picturing the milk shelf with the milk sitting there, I'm just thinking like milk. Right. Right? But then there's other things that are much more spatial, almost visual. You know, when I was trying to get to your studio, I was thinking like, okay, here's what the street looks like, here's what that street looks like, here's what I expect the doorway to look like.

Speaker 1

因此关键在于:如何用恰当的形态表征上下文,既要捕捉实现目标所需的核心信息,又要剔除冗余内容。多模态模型已开始探索这个方向,但真正的多模态远不止图像加文本的简单组合——这里存在着广阔的创新空间。

So representing your context in the right form that captures what you really need to achieve your goal and otherwise kind of discards all the unnecessary stuff. Mean, that's like that's a really important thing. And I think we're we're seeing the beginnings of that with multimodal models, but I think that multimodality has so much more to it than just like image plus text. And I think that that's a place where there's a lot of room for really exciting innovation.

Speaker 0

哦,你是指表征方式的问题?嗯,明白了。

Oh, do you mean in terms of how we represent? Mhmm. Okay.

Speaker 1

是的。包括对过往经历的记录(上下文),以及对未来计划或推理过程的表征——在大语言模型领域我们称之为中间处理阶段。我认为开发适合不同任务的多模态表征方式(甚至包括自适应学习的新模态),将极大帮助我们突破当前的技术瓶颈。

Yeah. How we represent both context, both what happened in the past and also plans or reasoning, as you can call it in LLM world, which is what we would like to happen in the future or intermediate processing stages in solving a task. I think doing that in a variety of modalities, including potentially learned modalities that are suitable for the job, is something that has, I think, enormous potential to overcome some of these challenges.

Speaker 0

有意思。说到推理效率的权衡问题,我不禁对比人类大脑:它能存储数十年的记忆,却能在毫秒级做出反应,尽管拥有百万亿级参数。这是否意味着人脑硬件远超GPU性能?还是说人脑的视频信息编码算法效率更高?或许类似混合专家模型,实际激活参数仅数十亿?总之,为什么当前模型在多个维度上都比人脑低效数个数量级?

Interesting. Another question I have as we're as we're discussing these, like, tough trade offs in terms of inference is comparing it to the human brain and figuring out the human brain is able to have hours, decades of context while being able to act on the order of ten milliseconds while having a 100,000,000,000,000 parameters or however you want to count it. And I wonder if the best way to understand what's happening here is that human brain hardware is just way more advanced than the hardware we have in GPUs or that the algorithms for encoding video information are like way more efficient. And maybe it's like some crazy mixture of experts where the active parameters is also on the order of billions, a little billions, or some mixture of the two. Basically, if you had to think about like, why do we have these models that are across many dimensions orders of magnitude less efficient?

Speaker 0

你认为根本瓶颈在于硬件还是算法?

Is it hardware or algorithms than it could convert to the brain?

Speaker 1

是的,这是个非常好的问题。我确实不知道这个问题的答案,我对神经科学并不精通。但如果要我猜测并基于我所知的领域给出一个答案,那大概是这样的:大脑是高度并行处理的。从生物物理学的角度看,它必须如此,但实际上它的并行程度甚至超过了你的GPU。

Yeah, that's a really good question. So I definitely don't know the answer to this. I am not by any means well versed in neuroscience, but if I had to guess and also provide an answer that leans more on things I know, it's something like this, that the brain is extremely parallel. It kinda has to be just out of just because of the biophysics, but it like, it's it's even more parallel than your GPU. Yeah.

Speaker 1

想想现代多模态语言模型处理输入的方式:当你输入图像和文本时,它先读取图像,再读取文本,然后逐词生成输出。对我而言,具身系统拥有并行处理能力才更合理。从数学角度,并行与串行处理其实可以建立近似等价关系。比如Transformer本质上并非串行架构——是通过位置嵌入才使其呈现串行特性,其本质其实是高度可并行化的。

If you think about how a modern multimodal language model processes the input, if you give it some images and some text, like, first it reads in the images, then it reads in the text, and and then proceeds one token at a time to generate the output. It makes a lot more sense to me for an embodied system to have parallel processes. Now mathematically, you can actually make close equivalences between parallel and sequential stuff. Like, transformers aren't actually fundamentally sequential, like, you kinda make them sequential by putting in position embeddings. Transformers are fundamentally actually very parallelizable things.

Speaker 1

没错,这正是它们的伟大之处。所以我认为,从数学角度看,这种同时进行感知、本体感觉和规划的高度并行系统,未必需要与Transformer有本质区别——尽管实际实现方式会不同。你可以想象这个系统会并行思考:比如长期记忆(十年前见过的场景)、

Right. That's what makes them so great. So I don't think that actually, mathematically this this like highly parallel thing where you're doing perception and proprioception and planning all at the same time is actually actually necessarily needs to look that different from a transformer. Although its practical implementation will be different. And you could imagine that the system will in parallel think about, okay, here's like my long term memory, like here's what I've seen a decade ago.

Speaker 1

短期空间记忆、语义信息、当前所见、未来规划等。这些都可以通过某种熟悉的注意力机制来实现,但实际运行时全部并行处理——可能以不同速率运行,复杂任务较慢,快速反应任务较快。

Here's my short term kind of spatial stuff. Here's my semantic stuff. Here's what I'm seeing now. Here's what I'm planning. And all of that can be implemented in a way that there's some very familiar kind of attentional mechanism, but in practice, all running in parallel, maybe at different rates, maybe with the more complex things running slower, the faster reactive stuff running faster.

Speaker 0

相信你已经看到人们用谷歌新图像生成模型Nano Banana创作的各种趣味图片,我的X时间线都被刷屏了。但你可能没意识到,这个模型还能完成修复历史照片或简单图像清理等低调任务。比如我在准备采访Sarah Payne时读到一本旧平装书,里面有张精彩的二战盟军航运图表想用于讲座。过去这需要编辑手动数字化清理20-30分钟,

I'm sure you've been seeing a bunch of fun images that people have been generating with Google's new image generation model, Nano Banana. My X feed is full of wild images. But you might not realize that this model can also help you do less flashy tasks like restoring historical pictures or even just cleaning up images. Example, I was reading this old paperback as I was prepping to interview Sarah Payne, and it had this really great graph of World War II allied shipping that I wanted to overlay in the lecture. Now in the past, this would have taken one of my editors twenty or thirty minutes to digitize and clean up manually.

Speaker 0

现在我们只需拍摄书页照片丢进Nanobanana就能获得干净版本。虽然一次成功,但若首次效果不理想,你可以反复调整直到满意。我们不断发现这个模型的新用途,说实话它神奇得像假的一样。快去Google AI Studio和Gemini应用体验Gemini 2.5 Flash图像模型(又名nano banana)吧。

But now we just took a photo of the page and then dropped into, into Nanobanana and got back a clean version. This was a one shot, but if Nanobanana doesn't nail it on the first attempt, you can try to just go back and forth with it until you get a result that you're super happy with. We keep finding new use cases for this model. And honestly, is one of those tools that just doesn't feel real. Check out Gemini 2.5 Flash Image Model, AKA nano banana on both Google AI studio and the Gemini app.

Speaker 0

好的,回到Sergei的问题。如果五年后我们拥有像人类一样稳健的实体交互系统,那么是什么技术进步使得运行这类模型成为可能?如何实现实时视频流处理,或将数小时历史视频信息编码后仍在毫秒级完成解码考量?是英伟达推出了更强GPU?还是你们发明了更高效的编码器?这五年究竟会发生什么?

All right, back to Sergei. If in five years we have a system, which is like as robust as a human in terms of interacting with the world, then what has happened that makes it physically possible to be able to run those kinds of models, to have video information that is streaming at real time or hours of prior video information is somehow being encoded and considered while decoding in like a millisecond scale and with many more parameters. Is it just that like NVIDIA has shipped much better GPUs or that you guys have come up with much better like, encoders and stuff or like what's happened in the five years?

Speaker 1

我认为这个问题涉及很多方面。首先,这确实是一个非常引人入胜的系统性问题。虽然我绝非系统专家,但可以想象,在实践中,尤其是想要构建一个经济实惠的低成本系统时,正确的架构应该是将至少部分思考过程外部化。是的。想象一下,未来可能会出现这样的机器人:如果你的网络连接不佳,它就处于一种较为迟钝的被动反应模式。

I think there's there are a lot of things to this question. I think certainly there's like a really fascinating systems problem. I'm by no means a systems expert, but I would imagine that the right architecture in practice, especially if you want an affordable low cost system, would be to externalize at least part of the thinking. Yeah. You know, you could imagine maybe in the future you'll have a robot that has like, you know, if your Internet connection is not very good, the robot is in kind of like a dumber reactive mode.

Speaker 1

但如果网络连接良好,它就能表现得更加智能一些。

But if you have a good Internet connection, then it can like be a little smarter.

Speaker 0

没错。

Right.

Speaker 1

这相当酷。但我认为研究和算法方面也能提供帮助,比如找到合适的表征方式,简洁地表示过去的观察结果以及观察中的变化。对吧?要知道,感官数据流在时间上具有极强的相关性,这意味着从每个新增观察中获得的边际信息并不等同于该观察的全部内容。

That's pretty cool. But I think there is there are also research and algorithms things that can help here, like figuring out the right representations, concisely representing both your past observations, but also changes in observation. Yeah. Right? Like, you know, your sensory stream is extremely temporally correlated, which means that the marginal information gained from each additional observation is not the same as the entirety of that observation.

Speaker 1

因为我现在看到的图像与之前看到的图像高度相关。所以原则上,如果想简洁地表示它,我可以采用比独立表示图像更压缩的方式。因此在算法层面有很多优化空间,这是非常有趣的算法研究工作。同时,系统性问题也极其迷人。说实话,我还没深入研究系统问题,因为你得先了解机器学习的大致框架才能实施系统。

Because the image that I'm seeing now is very correlated to the image I saw before. So in principle, if I wanna represent it concisely, I get away with a much more compressed representation than if I represent the images independently. So there's a lot that can be done on the algorithm side to get this right, and that's really interesting algorithms work. I think there's also like a really fascinating systems problem. To be truthful, like, I haven't gotten to the systems problem because you wanna implement the system once you sort of know the shape of the of the machine learning Yeah.

Speaker 1

解决方案。但我认为这方面还有很多值得探索的精彩内容。

Solution. But I think there's a lot of cool stuff to do there.

Speaker 0

是啊。或许你们需要招聘像运营YouTube数据中心那样的人才,因为他们精通视频信息编码。这其实引出一个有趣的问题:理论上,你可以在自己的笔记本电脑上运行大型语言模型,但现实中,最有效的模型是同时为成千上万用户批量运行的,而非本地运行。机器人领域也面临同样情况,这源于批量处理的固有缺陷。

Yeah. Maybe you guys need to hire, like, the people who run the YouTube data centers because, like, they know how to encode video information. Okay. This is actually an interesting question, which is that with LLMs, of course, they're being theoretically, you could run your own model on this laptop or whatever, but realistically, what happens is that the largest, effective models are being run-in batches of thousands, millions of users at the same time, not locally. Well, the same thing happened in robotics because of the inherent deficiencies of batching.

Speaker 0

再加上我们必须执行这项计算密集度极高的推理任务。因此你不会希望每台机器人都配备价值5万美元的GPU之类的设备。你只希望这些计算在其他地方完成。那么在这个机器人世界里,我们是否应该预见这样一种场景:你需要无处不在的网络连接?需要能高速传输视频信息来回交互的机器人,对吧?

Plus the fact that we have to do this incredibly compute intensive inference task. And so you don't want to be carrying around like $50,000 GPUs per robot or something. Just want that to happen somewhere else. So yeah, this robotics world, should we just be anticipating something where you need connectivity everywhere? You need robots that are like, have like super fast, and you're streaming video information back and forth, right?

Speaker 0

或者至少是单向视频信息传输。那么这是否会对机器人实际部署方式产生某些有趣的启示?

Or at least video information one way. So does that have interesting implications about like how this, how this deployment of robots will actually be instantiated?

Speaker 1

我不确定。但如果要我猜测,我认为两种模式都会出现。我们将看到采用外部推理的低成本系统,以及在无法依赖网络连接的环境(比如户外机器人)中使用的、成本更高但具备本地推理能力的更可靠系统。从技术角度我想补充几点:虽然实时系统显然需要高频实时控制,但每个时间步长所需的实际计算量可能低得出乎意料。

I I don't know. But if I were to guess, I would guess that it will actually see both. That we'll see low cost systems with off board inference and more reliable systems, for example, in settings where, like, you have an outdoor robot or something where you can't rely on connectivity, that are costlier and have on board inference. A few things I'll say from a technical standpoint that might contribute to understanding this. While a real time system obviously needs to be controlled in real time, often at high frequency, the amount of thinking you actually need to do for every time step might be surprisingly low.

Speaker 1

这其实在人类和动物身上也能观察到。当我们规划动作时,大脑确实存在真实的规划过程。比如记录猴子大脑活动时,你确实能找到与规划相关的神经信号。在动作发生前存在某种预判机制,而实际执行时的动作形态与动作前的神经活动相关。

And again, we see this in humans and animals. When we plan out movements, there is definitely a real planning process that happens in the brain. Like, you record like from monkey brain, you will actually find neural correlates of planning. And there is something that happens in advance of a movement, and when that movement actually takes place, the shape of the movement correlates with what happened before the movement. Yeah.

Speaker 1

这就是规划的本质,对吧?意味着你预先设定某种初始条件,然后展开这个预设过程形成动作。因此在动作执行期间,你进行的实时处理较少,更多是提前批量处理。但这并非完全开环控制。

Like, that's planning, right? So that means that you put something in place and, you know, set the initial conditions of some kind of process and then unroll that process and that's the movement. Yeah. And that means that during that movement, you're doing less processing and you kind of batch it up in advance. But you're not like entirely an open loop.

Speaker 1

不像播放录音带那样机械。你实际上在执行过程中持续反应,只是反应的抽象层级不同——处于更基础的抽象层面。这又回到了表征方式的问题。

It's not like you're playing back a tape recorder. You are actually reacting as you go. You're just reacting at a different level of abstraction. A more basic level of abstraction. And again, this comes back to representations.

Speaker 1

关键在于确定哪些表征适合预先规划后展开执行,哪些需要紧密的反馈循环。对于紧密反馈,你具体在反馈什么?比如驾驶车辆时,我可能通过车道标记位置保持直线行驶,而以较低频率感知周围车流状况。

Figure out which representations are sufficient for kind of planning in advance and then unrolling, which representations require a tight feedback loop. And for that tight feedback loop, what are you doing feedback on? Like, you know, if I'm driving a vehicle, maybe I'm doing feedback on the position of a lane marker so that I stay straight. And then at a lower frequency I sort of gauge where I am in traffic.

Speaker 0

你几年前有几场讲座提到,即使在机器人领域,强化学习在很多情况下也比模仿学习更优,但目前模型仍完全采用模仿学习。我很好奇你对此的看法是否有所改变?如果没变,那为何还不能实现强化学习?

And then so you have a couple of lectures from a few years back where you say like even for robotics RL is in many cases better than imitation learning but so far the models are exclusively doing imitation learning so I'm curious how your thinking on this has changed or maybe it's not changed but then you need to do this for the RL. Like why can't you do RL yet?

Speaker 1

是的。关键在于先验知识。要想从自身经验中有效学习,事实证明必须对所做之事已有认知基础,否则学习过程会极其漫长。

Yeah. So the key here is prior knowledge. Yeah. So in order to effectively learn from your own experience, it turns out that it's really, really important to already know something about what you're doing. Otherwise, it takes far too long.

Speaker 1

就像人类幼年时期需要很长时间才能掌握基本技能,比如初次学习写字。但有了知识基础后,学习新事物就会快很多。当前用监督学习训练模型的目的,正是构建这种能加速后续学习的先验知识基础——这并非新理念。

It's just like it takes a a person when they're a child a very long time to learn very basic things Yeah. To learn to write for the first time, for example. Once you already have some knowledge, then you can learn new things very quickly. So the purpose of training the models with supervised learning now is to build out that foundation that provides the prior knowledge so they can figure things out much more quickly later. And again, this is not a new idea.

Speaker 1

大语言模型的发展轨迹正是如此。最初仅通过下一词预测训练,这为后续生成合成数据和应用强化学习奠定了完美起点。

This is exactly what we've seen with LLMs. Right? LLMs started off being trained purely with next token prediction, and that provided an excellent starting point first for all sorts of synthetic data generation and then for RL.

Speaker 0

所以

So

Speaker 1

我认为任何基础模型的开发都会遵循相同路径:先通过某种程度的暴力计算构建基础,这个基础越牢固,后续通过更便捷训练实现提升就越容易。

I think it makes total sense that we would expect basically any foundation model effort to follow the same trajectory where we first build out the foundation essentially in like a somewhat brute force way. And the stronger that foundation gets, the easier it is to then make it even better with much more accessible training. In

Speaker 0

十年后,知识工作的最佳模型会是机器人模型或配备动作专家的系统吗?考虑到通用模型目前展现的优势,机器人领域会最终融入这种全能模型(既处理脑力工作又完成体力劳动),还是继续保持独立发展?

ten years, will the best model for knowledge work also be a robotics model or have like a action expert attached to it? And the reason I ask is so far we've seen advantages from using more general models for things. And will robotics fall into this bucket of we will just have the model, which does everything including physical work and knowledge work? Or do you think they'll continue to stay separate?

Speaker 1

我真心希望它们最终会是相同的。显然,我带有极强的偏见——我热爱机器人技术,认为它对人工智能至关重要。但乐观地说,我认为实际情况可能恰恰相反,等式中的机器人元素会让其他所有方面变得更好。

I really hope that they will actually be the same. And obviously, I'm extremely biased. I love robotics. I think it's very fundamental to AI. But I think that it's optimistically that it's actually the other way around, that the robotics element of the equation will make all the other stuff better.

Speaker 1

这背后有两个原因。其一是关于表征与聚焦:就像我之前提到的视频预测模型,若只是单纯预测所有可能发生的事,很难辨别何为关键。但当你专注于执行具体任务时,这种专注会重构你认知世界的方式,从而更有效地利用其他信息。

And there are two reasons for this that I could tell you about. One has to do with representations and focus. So what I said before with video prediction models, if you just want to predict everything that happens, it's very hard to figure out what's relevant. Yeah. If you have the focus that comes from actually trying to do a task, now that acts to structure how you see the world in a way that allows you to more fruitfully utilize the other signals.

Speaker 1

这种机制可能极具威力。第二个原因是:对物理世界深入本质的理解——超越语言可描述范畴的认知——实际上能助你解决其他问题。我们时刻都在经历这种现象,比如用'这家公司势头强劲'来描述抽象概念。

That could be extremely powerful. Yeah. The second one is that understanding the physical world at a very deep fundamental level, at a level that goes beyond just what we can articulate with language, can actually help you solve other problems. And we experience this all the time. Like when we talk about abstract concepts, we say like, this company has a lot of momentum.

Speaker 1

对吧?我们会用拟人化隐喻形容无生命物体,比如'我的电脑讨厌我'。我们以特定方式体验世界,这种主观体验深刻塑造着我们的思维方式。

Yeah. Right? We'll use like social metaphors to describe inanimate objects Like, my computer hates me. Right? Like, experience the world in a particular way, our subjective experience shapes how we think about it in very profound ways.

Speaker 1

然后我们将其作为万能工具,用来解决各种过于抽象而无法用其他方式处理的难题。

And then we use that as a hammer to basically hit all sorts of other nails that are far too abstract to handle any other way.

Speaker 0

但我想物理机器人可能还需考虑推理速度、模型规模等因素,这些与知识工作的考量或许不同。不过也许模型本身不变,只是以不同方式部署。协同训练的优势如此显著——我常想,五年后若用模型帮我编程,它是否也懂机器人技术?或许编码与机器人协同训练的优势大到足以...

I guess but there there might be other considerations that are relevant to, physical robots in terms of, like, inference speed and model size, etcetera, which might be different than the considerations for knowledge work. But then maybe you can maybe that doesn't change. Maybe it's still the same model, but then you can serve it in different ways. The advantages of co training are high enough that, yeah, whenever I'm like, I'm wondering in five years, if I'm using a model to code for me, does it also know how to do robotic stuff? And, yeah, maybe the advantages of code sorting on robotics are high enough that it's

Speaker 1

值得深入探讨。编程堪称抽象知识工作的巅峰——就其数学本质而言,这是项极度抽象的活动,正因如此人们才会觉得如此困难。

worth Mhmm. Well, and I should say that the the coding is probably like the pinnacle of a of a abstract knowledge work in the sense that, like, just by by the by the mathematical nature of computer programming, it's an extremely abstract activity, which is why people struggle with it so much.

Speaker 0

是的。我对为什么模拟对机器人效果不佳感到有点困惑。比如,观察人类,聪明的人类在有意学习时,会很好地注意到模拟与现实生活的相似之处,并专注于这些方面从中学习。那么,像飞行员通过模拟训练学习驾驶飞机,或F1赛车手通过模拟训练学习驾驶赛车,我们是否应该期待随着机器人变得更智能,它们也能通过模拟学习更多东西?还是说我们将永远受困于需要真实世界数据的诅咒?

Yeah. I'm a bit confused about why simulation doesn't work better for robots. Like, if I look at humans, smart humans do a good job of if they're intentionally trying to learn noticing what about the simulation is similar to real life and paying attention to that and learning from that. So if you have like pilots who are learning in simulation or F1 drivers who are learning in simulation should we expect it to be a case that as robots get smarter they will also be able to learn more things with simulation or or is this curse that we need real world data forever?

Speaker 1

这是个非常微妙的问题。你提到的飞行员使用模拟器学习的例子确实很有趣。但需要记住的是,当飞行员使用模拟器学习驾驶飞机时,他们有着极强的目标导向性。他们的人生目标不是学会使用模拟器,而是学会驾驶真实的飞机。

This is a very subtle question. Your example with the airplane pilot using simulation is really interesting. But something to remember is that when a pilot is using a simulator to learn to fly an airplane, they're extremely goal directed. So their goal in life is not to learn to use a simulator. Their goal in life is to learn to fly the airplane.

Speaker 1

没错。他们知道之后会有测试,最终他们将负责几百名乘客的安全。他们真的不能让飞机坠毁。而当我们用来自多个不同领域的数据训练模型时,模型并不知道它们应该解决特定任务。

Yeah. They know there will be a test afterwards, and they know that eventually they'll be in charge of like a few 100 passengers. Right. And they really need to not crash that thing. And when we train models on data from multiple different domains, the models don't know that they're supposed to solve a particular task.

Speaker 1

它们只会看到:嘿,这是我需要掌握的一件事,那又是另一件需要掌握的。所以更好的类比可能是:如果你在玩一个可以驾驶飞机的电子游戏,然后突然有人把你放进真实飞机的驾驶舱。电子游戏并非毫无用处,但两者并不相同。如果你玩电子游戏的目标是真正精通这个游戏,你的学习方式也会完全不同。

They just see like, hey, here's one thing I need to master. Here's another thing I need to master. So maybe like a better analogy there is if you're playing a video game where you can fly an airplane and then eventually someone puts you in the cockpit of a real one. Like, it's not that the video game is useless, but it's it's not the same thing. And if you're trying to play that video game and your goal is to, like, really, like, master the video game, you're not gonna go about it in quite the same way.

Speaker 0

难道不能通过某种元强化学习来解决吗?这其实和你2017年那篇非常有趣的论文几乎如出一辙——也许损失函数不应该衡量某个特定电子游戏或模拟的表现,而是衡量在不同电子游戏中训练后对下游任务表现的提升?我解释得很糟糕,但我明白你的意思。好吧。也许你能更好地解释我想表达的观点?

Isn't can you do some kind of meta RL on this which is like almost identical actually to the there's this really interesting paper you wrote in 2017 where maybe the loss function is not how well does that a particular video game or particular simulation but how well being trained in different video games makes it better at some other downstream task? I I did a terrible job explaining, but I I understand what you mean. Yeah. Okay. Maybe can you do a better job explaining what I what I was trying to say?

Speaker 1

所以我认为你想说的是:如果我们有一个真正智能的模型进行元学习,或许它能发现通过在模拟器中训练可以提高其在现实世界下游问题中的表现。

So so I I think what you're trying to say is basically that, well, maybe if we have, like, a really smart model that's doing meta learning, perhaps it can figure out that its performance on a downstream problem, a real world problem, is increased by doing something in a simulator.

Speaker 0

并且明确将其设为损失函数。

And specifically make that the loss function,

Speaker 1

没错。但关键在于,这些理念的核心在于通过借助其他手段,比如训练,来提升在真实场景中的表现。是的。而这一切的关键枢纽,正是训练出在真实事物上表现更好的能力。

That's right. But here's the thing with this. There's a set of these ideas that are all gonna be like something like train to make it better on the real thing by leveraging something else. Yeah. And the key lynchpin for all of that is the ability to train to be better on the real thing.

Speaker 1

事实上,我怀疑我们可能并不需要如此明确的操作,因为正如你之前指出的,元学习是自然涌现的。就像大型语言模型本质上通过上下文学习实现某种元学习。当然,我们可以争论这究竟算不算学习,但重点是,基于真实数据以正确目标训练的强大模型,能更好地利用其他资源。我认为这才是关键。回到你的飞行员例子,飞行员是以现实目标训练的——他们的目标是成为优秀飞行员,取得成功,拥有良好职业生涯,所有这些都会反馈到他们利用其他数据源的行为中。

The thing is like, I actually suspect in reality, we might not even need to do something quite so explicit because meta learning is emergent as you pointed out before, right? Like LLMs essentially do a kind of meta learning via in context learning. I mean, we can debate as to how much that's learning or not, but the point is that large powerful models trained on the right objective on real data get much better at leveraging all the other And I think that's actually the key. And coming back to your airplane pilot, like the airplane pilot is trained on a real world objective. Like their objective is to be a good airplane pilot, to be successful, to have a good career, and all of that kind of propagates back into the actions they take in leveraging all these other data sources.

Speaker 1

因此我认为,有效利用辅助数据源(包括模拟)的关键,在于构建真正优秀的基础模型,使其具备这些涌现能力。正如你所说,要达到这种高度,模型必须设定正确的目标。目前我们懂得如何从现实数据中获取正确目标,或许也能从其他途径获得,但这目前更具挑战性。

So what I think is actually the key here to leveraging auxiliary data sources, including simulation, is to build the right foundation model that is really good, that has those emergent abilities. And to your point, to get really good like that, it has to have the right objective. Now, we know how to get the right objective out of real world data. Maybe we can get out of other things, but that's harder right now. Right.

Speaker 1

我想我们可以再次参考其他领域的案例。比如现在如果有人训练大语言模型解决复杂问题,他们会使用大量合成数据。但他们能有效利用这些合成数据的原因,在于起点是经过大量真实数据训练的模型——模型已经掌握了要领。一旦掌握,就能更好地利用其他资源。

And I think that, again, we can look to the examples of what happened in other fields. Like, these days, if someone trains an LLM for solving complex problems, they're using lots of synthetic data. But the reason they're able to leverage that synthetic data effectively is because they have the starting point that is trained on lots of real data that kind of gets it. And once it gets it, then it's more able to leverage all this other stuff.

Speaker 0

没错。

Right.

Speaker 1

所以颇具讽刺意味的是,我认为利用包括模拟在内的其他数据源的关键,在于先精通真实数据的使用,理解世界运行规律,然后才能富有成效地运用...

So, I think, like, perhaps ironically, the key to leveraging other data sources including simulation is to get really good at using real data, understand what's up with the world, and then now you now you can fruitfully use

Speaker 0

这一切。那么当我们实现这点后,比如在02/1935年或02/1930年——基本上就是科幻世界的设定——你是否对真正AGI的能力持乐观态度?它们能否构建模拟环境来演练人类或AI从未有机会实践的技能?比如他们需要通过模拟训练宇航员,因为我们要建造戴森球,他们可以直接在模拟中完成。还是说,无论模型变得多聪明,模拟问题始终会存在?

all this. So once we have this, like, in 02/1935, 02/1930, what basically the sci fi world, are you optimistic about the ability to, like, true AGIs to build simulations in which they are rehearsing skills that no human or AI has ever had a chance to practice before. Some, you know, they need to, like, practice via astronauts because we're building the Dyson sphere, and they can just do that in simulation, or, like, will the issue with simulation continue to be one regardless of how smart models get?

Speaker 1

所以我想说的是,在非常基础的层面上,你自己创造的合成体验并不能让你更多地了解世界。它允许你排练事物,考虑反事实,但关于世界的信息需要以某种方式注入系统。我认为你提出这个问题的方式实际上很好地阐明了这一点,因为在机器人学中,传统上人们常将仿真视为注入人类知识的方式,因为人类知道如何写下微分方程。他们可以编写代码,这样机器人就获得了比之前更多的知识。

So here's what I would say that deep down at a very fundamental level, the synthetic experience that you create yourself doesn't allow you to learn more about the world. It allows you to rehearse things. It allows you to consider counterfactuals, but somehow information about the world needs to get injected into the system. So and I think the way you pose this question actually elucidates this very nicely because in robotics classically, people have often thought about simulation as a way to inject human knowledge because a person knows how to write down like differential equations. They can code it up and that like gives the robot more knowledge than it had before.

Speaker 1

但我认为,从其他领域的经验中,比如视频生成技术、为大型语言模型(LLMs)生成合成数据,我们越来越认识到,实际上创造合成体验最强大的方式可能是基于一个非常优秀的模型,因为模型可能比人类更了解那些细粒度的细节。但当然,那个模型的知识又从何而来?来自对世界的体验。是的。所以从某种意义上说,你说的我认为非常正确,一个非常强大的人工智能系统可以模拟很多东西。

But I think that increasingly what we're learning from experiences in other fields, from how like the video generation stuff goes, from synthetic data for LLMs is that actually probably the most powerful way to create synthetic experience is from a really good model because the model probably knows more than a person does about those fine grained details. But then, of course, where does that model get the knowledge? From experiencing the world. Yeah. So in a sense, what you said I think is actually quite right in that a very powerful AI system can simulate a lot of stuff.

Speaker 1

但同样,到了那个地步,这几乎无关紧要了,因为作为一个黑箱来看,那个系统所发生的是信息输入,能力输出。是的。而它处理信息的方式是通过想象和模拟,还是通过某种无模型的方法,在我们的理解中其实并不重要。

But also, at that point, it kind of almost doesn't matter because viewed as a black box, what's going on with that system is that information comes in and capability comes out. Yeah. And whether the way it processed that information is by imagining some stuff and simulating or by some model free method is kind of irrelevant in our standing.

Speaker 0

你觉得人类中有什么类似的东西吗?比如,当我们白日做梦或睡觉时,我们在做什么?我不知道你是否对这种我们正在做的辅助性事情有所感觉,但如果你必须为它做一个机器学习类比,那会是什么?

Do have sense of what what the equivalent is in humans? Like, whatever we're doing when we're daydreaming or sleeping or I don't know if you have some sense of, like, what this auxiliary thing we're doing is, but if you had to make an ML analogy for it, what is it?

Speaker 1

嗯,是的。我的意思是,当然当你睡觉时,你的大脑做的事情看起来非常像它清醒时所做的,非常像是在回放经验,或者可能是在生成新的统计上相似的经验。对吧。所以我认为,这就像是,很合理地猜测,也许通过学习模型进行的模拟是你大脑基本上弄清楚反事实的方式之一。是的。

Well, yeah. I mean, certainly when you sleep, your brain does stuff that looks an awful lot like what it does when it's awake, that looks an awful lot like playing back experience or perhaps generating new statistically similar experience. Right. And so I think it's like it's very reasonable to guess that perhaps simulation through a learned model is like part of how your brain figures out like counterfactuals basically. Yeah.

Speaker 1

但比那更基本的是,最优决策的核心,无论你如何做到,都需要考虑反事实。你基本上必须问自己,如果我做这个而不是那个,会更好吗?你必须以某种方式回答这个问题。而你是通过学习模拟器来回答这个问题,还是通过使用价值函数或类似的东西,通过使用奖励模型,最终都是一样的。只要你有某种机制来考虑反事实并找出哪个反事实更好,你就掌握了它。

But something that's kind of even more fundamental than that is that optimal decision making at its core, regardless of how you do it, requires considering counterfactuals. You basically have to ask yourself, if I did this instead of that, would it be better? And you have to answer that question somehow. And whether you answer that question by using a learned simulator or whether you you answer that question by using a value function or something like that, by using a reward model, in end, it's kind of all the same. Like, as long as you have some mechanism for considering counterfactuals and figuring out which counterfactual is better, you've got it.

Speaker 1

是的。所以我喜欢这样思考,因为它简化了事情。它告诉我们,关键不一定是做非常好的模拟。关键是弄清楚如何回答反事实的问题。

Yeah. So that that that I I like thinking about it this way because it kind of simplifies things. It tells us that the key is not necessarily to do really good simulation. The key is to figure out how to answer counterfactuals.

Speaker 0

是的,很有意思。从宏观视角来看,我之所以想具体了解机器人经济何时部署,是因为这实际上与理解AGI发展速度密切相关——这不仅是数据飞轮效应的问题,假设到2030年2月(不同人有不同预估),许多人的预测值在数百吉瓦级别,100、200或300吉瓦。若届时部署了200吉瓦或100吉瓦,每年的边际资本支出将达到数万亿美元,约2到4万亿美元每年。

Yeah. Interesting. So stepping, big picture again, the reason I'm interested in getting concrete understanding of when this robot economy will be deployed is because it's actually pretty relevant to understanding how fast AGI will proceed in the sense that, well, it's obviously the data flywheel, but also if you just extrapolate out the CapEx for AI, suppose by 02/1930, people have different estimates, but many people have estimates in the hundreds of gigawatts, 100, 200, 300 gigawatts. And then you can just crunch numbers on like, if you have 200 gigawatts deployed or 100 gigawatts deployed by 02/1930, the marginal CapEx per year is like trillions of dollars. It's like $2.03, 4,000,000,000,000 a year.

Speaker 0

这对应着需要实际建造的数据中心、芯片代工厂和太阳能板工厂。我特别好奇的是,到2030年2月时,如果主要瓶颈只是需要人力在数据中心旁铺设太阳能板或组装数据中心,届时机器人经济是否能足够成熟,为这一过程提供重大助力。

And that corresponds to actual data centers you got to build, actual chip foundries you have to build, actual solar panel factories you got to build. And I'm very curious about whether by this time, by 02/1930, if the big bottleneck we have is just people to lay out the solar panels next to the data center or assemble the data center, whether the robot economy will be mature enough to help significantly in that process.

Speaker 1

这很酷。所以你本质上是在说:我现在该采购多少混凝土来建造数据中心,才能在2030年为所有机器人供电?确实,这种思考方式比我之前想到的更宏大,但这是个很棒的问题。

That's cool. So you're basically saying like, how much concrete should I buy now to build the data center so that by 2030 I can power all the robots? Yeah. Yeah. That is a more ambitious way of thinking about it than that has occurred to me, but it's a cool question.

Speaker 1

当然,好的一面是机器人能帮忙建造那些设施。

I mean, good thing, of course, is that the robots can help you build that stuff.

Speaker 0

没错。但问题是它们到时候是否真能胜任?因为还存在非机器人领域的需求,同样需要大量资本支出。嗯。然后还有机器人领域本身——你需要建造机器人工厂等等。整个产业链将迎来爆发式增长。

Right. But but will they be able to by the time that like there's there's some flight like, there's the non robotic stuff, which will also, like, mandate a lot of CapEx. Mhmm. And then there's robot stuff where you actually had to build a robot factory, etcetera. But every there will be this industrial explosion across the whole stack.

Speaker 0

机器人技术能在多大程度上加速或实现这一进程?

And how much will robotics be able to speed that up or make it possible?

Speaker 1

理论上可以有很大贡献。对吧?我们有时容易把机器人想象成机械人类,但事实并非如此,不是吗?

I mean, in principle, quite a lot. Right? I think that we have a tendency sometimes to think about robots as like mechanical people. But that's not the case. Right?

Speaker 1

比如,人类是人类,机器人是机器人。更恰当的类比是,机器人就像你的汽车或推土机。它们的维护需求要低得多。你可以把它们放在各种奇怪的地方,它们完全不需要看起来像人。你可以造一个100英尺高的机器人。

Like, people are people and robots are robots. Like, the the better analogy for the robot, it's it's it's like your car or a bulldozer. Like it has much lower maintenance requirements. You can put them into all sorts of weird places and they don't have to look like people at all. You can make a robot that's a 100 feet tall.

Speaker 1

你也可以造一个微小的机器人。所以我认为,如果有智能驱动高度异构的机器人系统,实际上可能比仅仅拥有机械人效果要好得多。这可以大幅提升真实人类的生产力,也能解决目前非常棘手的问题。比如,虽然我绝非数据中心专家,但你可以把数据中心建在非常偏远的地方,因为机器人不在乎附近是否有购物中心。

You can make a robot that's tiny. So I think that if you have the intelligence to power very heterogeneous robotic systems, you can probably actually do a lot better than just having like mechanical people in effect. And it can be a big productivity boost for the real people and it can allow you to solve problems that are very difficult to solve now. Yeah. You can you know, for example, I'm not an expert on data centers by any means, but you could build your data centers in a very remote location because the robots don't have to worry about whether there's like a shopping center nearby.

Speaker 0

那么你是否有概念,软件会部署在哪里?另一个问题是物理机器人的数量。比如你们正在训练的物理智能机器人——这些桌面机械臂,全球实际有多少台?到2030年2月需要多少台?需求量会是多少?

And then do you have a sense of how, so there's like, where will the software be? And then there's a question of how many physical robots will we have? So like how many of the kinds of robots you're training in physical intelligence, like these tabletop arms are there physically in the world? How many will there be by 02/1930? How many will be needed?

Speaker 0

这些都是棘手的问题,比如我们到底需要多少台?

I mean these are tough questions like how many will we need it for that?

Speaker 1

这些问题确实非常棘手,而且迄今为止机器人领域的规模经济效应与长期可能达到的效果不同。举个例子,2014年我刚进入机器人领域时,使用的PR2研究机器人售价40万美元。在加州大学伯克利分校建立实验室时,我购买的机械臂每台3万美元。而现在我们物理智能项目使用的机械臂每台仅约3000美元,我们认为成本还能大幅降低。

These are very tough questions and also economies of scale in robotics so far have not functioned the same way that they probably would in the long term, right? Just to give you an example, when I started working in robotics in 2014, I used a very nice research robot called the PR2 that cost $400,000 to purchase. When I started my research lab at UC Berkeley, I bought robot arms that were $30,000 The kind of robots that we are using now at physical intelligence, each arm costs about $3,000 and we think they can be made for a small fraction of that.

Speaker 0

那么这些变化的学习曲线背后是什么原因?

So these things What is the cause of that learning rate?

Speaker 1

有几个因素。首先是规模经济效应。定制化的高端研究硬件当然会比量产化的产品昂贵得多。

Well, there are a few things. So one of course has to do with economies of scale. So custom built high end research hardware of course is going to be much more expensive than kind of more productionized

Speaker 0

硬件。

hardware.

Speaker 1

但另一方面,随着我们在建造可驱动机器方面越来越熟练,它们会变得更便宜,这当然涉及到技术因素。此外还有软件因素——你的AI系统越智能,对硬件满足特定需求的要求就越低。传统工厂里的机器人需要执行高度重复的动作,因此对精度和耐用性有较高要求。但如果能使用廉价的视觉反馈,这些要求就不必要了。所以AI不仅让机器人更经济实惠,还降低了对硬件的需求标准。

But the other and then, of course, there's a technological element that as we get better at building actuated machines, they become cheaper. But there's also a software element, which is the smarter your AI system gets, the less you need the hardware to satisfy certain requirements. So traditional robots and factories, they need to make motions that are highly repeatable and therefore it requires a degree of precision and robustness that you don't need if you can use cheap visual feedback. So AI also makes robots more affordable, and lowers the requirements on the hardware.

Speaker 0

有意思。那么你认为这种学习成本下降的趋势会持续吗?到本年代末,购买机械臂会不会只需几百美元?

Interesting. Okay. So do you think the late learning rate will continue? Do you think it will cost hundreds of dollars by the end of the decade to buy mobile arms?

Speaker 1

这个问题更适合问我的联合创始人阿德南·伊斯梅尔,他可能是世界上回答这个问题的最佳人选。不过就我个人所见,成本逐年下降的速度确实超出我的预期。

That is a great question for my cofounder, Adnan Ismail, who is probably, like, the best person arguably in the world to ask that question of. But certainly, the drop in cost that I've seen has surprised me year after year.

Speaker 0

明白了。全球大概有多少台机械臂?超过一百万还是不足一百万?

Okay. And how many arms are there probably in the world? Is it more than a million, less than a million?

Speaker 1

这个问题我无法确切回答,而且也很难界定——因为并非所有机械臂都相同。比如汽车工厂里的组装机器人,就和我们现在讨论的类型完全不同。

So I don't know the answer to that question, but it's it's also a tricky question to answer because not all arms are made equal. Like, arguably the kind of robots that are like assembling cars in a factory are just not the right kind to think about.

Speaker 0

所以你指的是需要训练的那种?

So the kind you want to train on?

Speaker 1

数量非常少,因为它们目前尚未实现商业部署。是的,与工厂机器人不同。

Very few because they are not currently commercially deployed. Yeah. Unlike the factory robots.

Speaker 0

所以不到10万台?

So like less than a 100,000?

Speaker 1

我不确定,但很可能。

I don't know, but probably.

Speaker 0

而我们需要的机器人数量是数十亿级别,至少数百万台。如果考虑到工业爆发式增长的需求——要实现AI的爆炸性发展,不仅需要机械臂,还需要能自主移动的装置。我其实在想,当AI热潮需要大量劳动力支撑时,这种产能是否跟得上?

And we want billions of robot like at least millions of robots. If you're just thinking about like the industrial explosion that you need to have this AI explosive growth, not only do you need the arms but then you need like something that can move around. Basically I'm just trying to think about like, will that be possible by the time that you need a lot more labor to power this AI AI boom?

Speaker 1

要知道,当需求旺盛时,经济体系很擅长满足需求。对吧?比如2001年2月时,全球才有多少部iPhone呢?

Well, you know, economies are very good at filling demand when there's a lot of demand. Right? Like, many iPhones were in the world in in 02/2001. Right?

Speaker 0

确实如此。

That's right. Yeah.

Speaker 1

所以我认为这绝对是个挑战,也值得深入思考。对我们研究者而言,核心问题是:AI将如何改变硬件设计理念?因为有些因素我认为会变得极其关键。

So I think it's defi there's definitely a challenge there. And I think it's something that is worth thinking about. And a particularly important question for researchers like myself is how can AI affect how we think about hardware? Right. Because there are some things that I think are gonna be really, really important.

Speaker 1

比如,你可能希望你的设备不会总是出故障。是的。有些事情完全属于那种充满问号的范畴,比如我们到底需要几根手指?就像你之前自己说的,你有点惊讶于一个只有两根手指的机器人能做很多事情。好吧,也许你还是想要更多手指,但找到既能保持良好功能又最精简的配置很重要,这就在那个问号范畴里。还有一些我认为我们可能不需要的东西,比如我们可能不需要机器人超级精确,因为我们知道反馈可以弥补这一点。所以我认为,我目前的工作就是弄清楚我们能接受的最小配置是什么。

Like, you probably want your thing to like not break all the time. Yeah. There are some things that are firmly in that category of like question marks, like how many fingers do we need? Like, you said yourself before that you were kind of surprised that a robot with two fingers can do a lot. Okay, maybe you still want like more than that, but still like finding the bare minimum that still lets you have good functionality, that's important, that's in the question And mark there are some things that I think like we probably don't need, like we probably don't need the robot to be like super duper precise because we know that feedback can compensate for So I think my job as I see it right now is to figure out what's sort of the minimal package we can get away with.

Speaker 1

我真的很喜欢从最小配置的角度思考机器人,因为我不认为我们会有那种终极机器人,就像机械版的人类一样。我认为我们将拥有的是一系列高效机器人需要满足的基本要素,就像智能手机必须要有触摸屏一样,这是我们大家都认同的。然后根据需求、成本等因素,还有很多其他可选的东西。我认为未来会有很多创新,一旦我们有了可以接入任何机器人、赋予其基本智能水平的强大AI系统,那么很多人就能在如何让机器人硬件针对每个细分领域优化方面进行创新。

And I really like to think about robots in terms of minimal package because I don't think that we will have, like, the one ultimate robot, like, sort of the mechanical person, basically. I think what we will have is a bunch of things that good effective robots need to need to satisfy, just like smartphones need to have a touchscreen, like that's something that we all kind of And agreed then a bunch of other stuff that's kind of optional depending on the need, depending on the cost point, etcetera. And I think there will be a lot of innovation where once we have very capable AI systems that can be plugged into any robot to endow it with some basic level of intelligence, then lots of different people can innovate on how to get the robot hardware to be optimal for each niche it needs

Speaker 0

在制造商方面,英伟达有做机器人吗?

to In terms of manufacturers, is there some NVIDIA robotics?

Speaker 1

目前没有。也许将来会有。可能我太理想化了,但我真的很希望看到一个机器人多样化的世界。

Not right now. Maybe there will be someday. I would really like maybe I'm being idealistic, but I would really like to see a world where there's a lot of heterogeneity in robots.

Speaker 0

作为设计运行算法的专家,你认为当前硬件最大的瓶颈是什么?

What is the biggest bottleneck in the hardware today as somebody who's designing the algorithms that run on it?

Speaker 1

这个问题很难回答,主要是因为技术变化太快。就我个人而言,我在硬件方面花大量时间思考的其实是可靠性和成本。不是说我不关心成本,而是成本直接关系到机器人数量,进而影响数据量。作为机器学习研究者,我非常需要大量数据,所以我希望机器人成本低廉,这样就能部署更多机器人,获得更多数据。

It's a tough question to answer mainly because things are changing so fast. I think that to me, the things that I spend a significant amount of time thinking about on the hardware side is really more like reliability and cost. It's not that I'm not like that worried about cost. It's just that cost translates to number of robots, which translates to amount of data. And being an ML person, I really like having lots of data, so I really like having robots that are low cost because then I can have more of them and therefore more data.

Speaker 1

可靠性重要也基本是同样的原因。不过我觉得随着技术发展,这个问题会越来越清晰。因为目前AI系统还没有把硬件性能推到极限。随着AI系统越来越强,硬件会被推到极限,到时候我们或许就能更好地回答你的问题了。

And reliability is important more or less for the same reason. Yeah. But I think it's something that we'll get more clarity on as things progress because as we basically, the AI systems of today are not pushing the hardware to the limit. So as the AI systems get better and better, the hardware will get pushed to the limit, and then we'll hopefully have a much better answer to your question.

Speaker 0

好的。这是我向许多嘉宾提出的一个问题:当你深入剖析这场AI爆发的任何层面时,你会发现大量实际供应链的制造环节都在中国完成。除了芯片之外,如果你谈到数据中心,会发现太阳能板的硅片、大量电池和组件等也都是中国制造的。顺着供应链往下看,机械臂显然也是在中国生产。在这个硬件价值被无限放大的世界里——因为每个机器人都能创造相当于人类工人部分产值——加速制造业扩张变得至关重要。

Okay. So this is a question I've had for a lot of guests and is that if you go through any layer of this AI explosion you find that a bunch of the actual source supply chain is being manufactured in China. So other than chips obviously, but then if you talk about data centers and you're like, oh, all the wafers for solar panels and a bunch of the cells and modules etcetera are manufactured in China. And then you just go through the supply chain and then obviously robot arms are being manufactured in China. And so if you live in this world where the hardware is just incredibly valuable to ramp up manufacturing of because each robot can produce some fraction of the value that a human worker can produce.

Speaker 0

不仅如此,人类工人或任何劳动力的价值都在急剧飙升,因为我们需要大量人力来铺设数万英亩的太阳能农场、数据中心、晶圆厂等等。在这个繁荣的世界里,最大的瓶颈在于你能实际部署多少机器人?能制造多少?因为算法将由你们开发,而我们只需要硬件。所以我问过许多嘉宾:当你观察所在领域的供应链环节时,为什么中国不会理所当然地胜出?

And not only is that true but the value of human workers or any kind of worker has just tremendously skyrocketed because we just need tons of bodies to lay out the tens of thousands of acres of solar farms and data centers and foundries and everything. In this boom world, the big bottleneck there's just like how many robots can you physically deploy? How many can you manufacture? Because you guys are gonna come up with the algorithms and now we just need the hardware. And so this is a question I've asked many guests, which is that like if you look at the part of the chain that you are observing, what is the reason that China just doesn't win by default?

Speaker 0

对吧?如果他们生产所有机器人,而你们开发让这些机器人产生巨大价值的算法,为什么他们不会自然而然地胜出?

Right? If they're producing all the robots and you come up with the algorithms that make those robots super valuable, why why why don't they just win by default?

Speaker 1

是的,这是个非常复杂的问题。我先从宏观主题谈起,再深入细节。其中一个宏观主题是:如果你想建立依靠高素质劳动力、高生产率(即每人每小时完成大量工作)取胜的经济体,自动化就极其关键——因为它能倍增每个人的生产力。就像编程辅助工具那样,

Yeah. So this is a very complex question. I'll start with the with the broader themes and then try to drill a little bit into the details. So one broader theme here is that if you want to have an economy where you get ahead by having a highly educated workforce, by having people that, have high productivity, meaning that for each person's hour of work, lots of stuff gets done, automation is really, really good because automation is what multiplies the amount of productivity that each person has. Again, same as like LM coding tools.

Speaker 1

编程工具能放大软件工程师的生产力,而机器人将放大几乎所有劳动者的生产力。这算是理想终局状态。但实现这个状态的过程充满复杂性——如何使其成为对社会有吸引力的发展路径,

LM coding tools amplify the productivity of a software engineer. Robots will amplify the productivity of basically everybody that that is doing work. Now that's that's kinda like a final state, like a desirable final state. Now there's a there's a lot of complexity in how you get to that state, how you make that an appealing journey

Speaker 0

嗯。

Mhmm.

Speaker 1

如何应对地缘政治维度的问题。所有这些都相当复杂,需要做出诸多正确决策,比如平衡投资机器人生态系统,同时支持软件和硬件的创新。我不认为这些是无法克服的难题,但需要长期视野和合理的投资配比。令我真正乐观的是那个终局状态——我们都认同美国应该成为这样的社会:人民具有高生产力,受过高等教育的人群从事高价值工作。

To society, how you navigate the geopolitical dimension of that. Like, all of that stuff is actually pretty complicated, and it requires making a number of really good decisions, like you know, good decisions about investing in a balanced robotics ecosystem, supporting both software innovation and hardware innovation. I don't think any of those are insurmountable problems. It just requires a degree of kind of long term vision and the right kind of balance of investment. But what makes me really optimistic about this is that final state that I think we can all agree that in The United States we would like to have the kind of society where people are highly productive, where we have highly educated people doing high value work.

Speaker 0

对。

Right.

Speaker 1

而且因为最终状态在我看来与自动化、机器人技术非常兼容,从某种程度上说,应该有很强的动力去实现那种状态。然后从那里开始,我们必须解决所有有助于实现这一目标的细节问题。这并不容易。我认为在私营行业、投资、政治层面都需要做出许多复杂的决策。但我对此非常乐观,因为在我看来,隧道尽头的光明似乎方向是正确的。

And because end state seems to me very compatible with automation, with robotics, there's a lot of at some level, there should be a lot of incentive to get to that state. And then from there, we have to solve for all the details that will help us get there. And that's not easy. I think there's a lot of complicated decisions that need to be made in terms of private industry, in terms of investment, in terms of the political dimension. But I'm very optimistic about it because it's like it seems to me like the light at the end of the tunnel is kind of it's in the right direction.

Speaker 0

我是说,是的。我想还有一个不同的问题,那就是如果价值在某种程度上受硬件限制,所以你只需要生产更多的硬件,那么在美国或与盟友一起制造数亿或数十亿机器人的路径是什么?我不知道如何解决这个问题,但这似乎与另一个问题不同,比如,好吧,对人类工资或其他方面的影响是什么?

I mean, I yeah. I guess there's a different question, which is that if the value is sort of bottlenecked by hardware, and so you just need to produce more hardware, what is the path by which hundreds of millions of robots or billions of robots are being manufactured in The US or with allies? I don't know how to approach that question, but it seems like a different question than like, okay, well, what is the impact on like human wages or something?

Speaker 1

所以,再次强调,关于如何实现这一目标的具体细节,我认为这是一个非常长的对话,我可能不是最有资格的人。但我想说的是,就其中的要素而言,我认为重要的是机器人可以帮助完成体力劳动。如果生产机器人本身就是体力劳动,那么擅长机器人技术应该对此有所帮助。对,对。当然,这有点循环。

So again, for the specifics of how how we make that happen, I think that's a very long conversation I'm probably not the most qualified to But speak I think that in terms of the ingredients, the ingredient here that I think is important is that robots help with physical things, physical work. And if producing robots is itself physical work, then getting really good at robotics should help with that. Right. Right. It's a little circular, of course.

Speaker 1

而且,你知道,就像所有循环的事情一样,你必须像启动它一样,试图让引擎运转起来。但这似乎是一个比数字设备问题更容易解决的问题,例如,在数字设备中,工作投入于创建计算机、手机等,但计算机和手机本身并不帮助完成工作。

And, you know, as with all circular things, you have to, like, kind of bootstrap it and try to get engine going. But it seems like it is an easier problem to address than, for example, the problem of digital devices, where work goes into creating computers, phones, etcetera, but the computers and phones don't themselves help with the work.

Speaker 0

对。我想反馈回路是双向的。它们可以帮助你,也可以帮助别人,这是一个积极的世界,所以它们帮助到一定程度并不一定是坏事。但很多进入这个反馈回路的东西,如子组件制造和供应链,已经在中国存在,似乎更强的反馈回路会在中国存在。然后还有一个单独的讨论,也许这没关系,也许这是好事。也许他们会继续向我们出口这些。

Right. I guess feedback loops go both ways. They can help you or they can help others, and it's a positive some world, so it's not necessarily bad that But they help to the extent that a lot of the things which would go into this feedback loop, the subcomponent manufacturing and supply chain already exists in China, it seems like the stronger feedback loop would exist in China. And then there's a separate discussion like maybe that's fine, Maybe that's good. And maybe they'll continue exporting this to us.

Speaker 0

但值得注意的是,我只是觉得每次我和嘉宾谈论不同的事情时,都会发现,哦,几年之内,供应链每个部分的关键瓶颈都会是中国作为世界80%供应者的某种东西。

But it just like notable that I just find it notable that whenever I talk to a guest about different things, it's just like, oh yeah, that within a few years, the key bottleneck to every single part of the supply chain here will be something that China is like the 80% world supplier of something.

Speaker 1

嗯,是的。这就是为什么我之前说,我认为在这里真正重要的是建立一个平衡的机器人生态系统,对吧?比如,人工智能非常令人兴奋,但我们也应该认识到,做好人工智能并不是我们需要做的唯一事情。我们需要考虑如何平衡我们的优先事项、投资以及我们花费时间的事情。举个例子,在物理智能领域,我们实际上非常重视硬件。

Well, yeah. And this is why I said before that I think something really important to get right here is a balanced robotics ecosystem, right? Like, think AI is tremendously exciting, but I think we should also recognize that getting AI right is not the only thing that we need to do. And we need to think about how to balance our priorities, our investment, the kind of things that we spend our time on. Just as an example, at physical intelligence, we do take hardware very seriously actually.

Speaker 1

我们自己构建了很多东西,我们希望有一个与人工智能路线图并行的硬件路线图。但我想,这只是我们的做法。我认为对于美国,甚至可以说对于整个人类文明来说,我们需要非常全面地思考这些问题。是的。我认为当某个领域有很多兴奋点和进展时,比如人工智能,我们有时很容易分心。

We build a lot of our own things and we want to have a hardware roadmap alongside our AI roadmap. But I think that, you know, that's just us. I think that for The United States, for, you know, arguably for all for human civilization as a whole, like, I think we need to think about these problems very holistically. Yeah. And I think it is easy to get distracted sometimes when there's a lot of excitement, a lot of progress in one area, like AI.

Speaker 1

我们可能会忽视其他事情,包括你提到的事情,比如硬件组件、计算等基础设施组件。所以我认为,总的来说,对这些事情有一个更全面的看法是好的。我希望我们有时能进行更全面的讨论。

And we are tempted to lose track of other things, including things you've said, like hey, like there's a hardware component, there's an infrastructure component with compute and things like that. So I think that in general, it's good to have a more holistic view of these things. And I wish we had more holistic conversations about that sometimes. I

Speaker 0

我认为从整个社会的角度来看,他们应该如何思考机器人和知识工作的进步?我认为基本上,社会应该为全面自动化做计划。会有一个时期,人们的工作更有价值,因为经济会有巨大的繁荣,比如建造所有这些数据中心和工厂,但最终人类可以用他们的身体和头脑做事。没有什么是第三种秘密的事情。那么社会应该为什么做计划?

do think from the perspective of society as a whole, how should they be thinking about the advances in robotics and knowledge work? And I think it's basically like society should be planning for full automation. Like there will be a period in which people's work is way more valuable because there's this huge boom in the economy where like building all these data centers are building all these factories, but then eventually humans can do things with their body and we can do things with their mind. There's not like some secret third thing. So what should society be planning for?

Speaker 0

应该是人类的全面自动化,社会也会变得更加富裕。所以,理论上,有办法让每个人都比今天过得更好。但最终的状态,隧道尽头的光明是全面自动化加上超级富裕的社会,通过某种再分配或其他方式实现。我不知道你是否不同意这种描述。

It should be full automation of humans and there'll also be a society to be much wealthier. So presumably, there's ways to do this in a way that everybody is much better off than they are today. But then the end state, the light at the end of the tunnel is the full automation plus super wealthy society with some redistribution or whatever way to figure that out. I don't know if you disagree with that characterization.

Speaker 1

我认为在某种程度上,这是一种非常合理的看待事物的方式。但如果说我从技术中学到了一件事,那就是它很少按照人们预期的方式发展,有时过程与目的地同样重要。所以我认为,为一个最终状态提前计划实际上非常困难。但我认为你所说的方向性上很有道理,我也确实认为,我们集体思考如何构建我们周围的世界,使其适应各个领域越来越多的自动化,是非常重要的。但我们应该同样重视过程,因为事物会以各种不可预测的方式发展,自动化可能会出现在我们最初意想不到的地方。

So I think at some level, that's a very reasonable way to look at things. But I think that if there's one thing that I've learned about technology, it's that it rarely evolves quite the way that people expect and sometimes the journey is just as important as the destination. So I think it's actually very difficult to plan ahead for an end state. But I think directionally what you said makes a lot of sense and I and I do think that it's very important for us collectively to think about how to structure the world around us in a way that is amenable to greater and greater automation across all sectors. But I think we should really think about the journey just as much as the destination because things evolve in all sorts of unpredictable ways and we'll find automation showing up in all sorts of places, probably not the places we expect first.

Speaker 1

所以我认为这里真正重要的常量是教育非常非常有价值。是的。这是一个人应对变化的负面影响的最佳缓冲。所以,如果我们作为一个社会可以共同拉动一个杠杆,那就是更多的教育。这是真的吗?

So I think that the constants here that I think are really important is education is really, really valuable. Yeah. Like, is the best buffer somebody has against the negative effects of change. So if there's, like, one single lever that we can pull collectively as a society, it's like more education because that's Is that true?

Speaker 0

所谓的Morvex悖论就是,对人类教育最有益的内容可能恰恰是最容易被自动化的部分,因为教育AI确实非常简单。要知道,那些需要你在研究生院花八年时间学习的教材内容,AI一个下午就能消化完。

Mean, Morvex paradox is, like, the things that which are, like, most beneficial from education for humans will might have been the easiest to automate because it's easy really easy to educate AIs. You know, you can throw the textbooks that would take you eight years of grad school to do at them in an afternoon.

Speaker 1

教育赋予你的是灵活性。重点不在于你掌握了多少具体知识,而在于你获取技能和理解的能力。所以关键在于优质的教育。

Well, what education gives you is flexibility. So it's less about the particular facts you know as it is about your ability to acquire skills, acquire understanding. So it has to be good education.

Speaker 0

没错。好的。Sergei,非常感谢你参加这次播客。真的非常精彩。

Right. Okay. Sergei, thank you so much for coming on the podcast. Thank you. Super fascinating.

Speaker 1

是啊,这次讨论确实很激烈。你问的问题都很有深度。

Yeah. This was, this was intense. Yeah. Ask ask tough questions.

Speaker 0

希望你喜欢这期节目。如果喜欢的话,最有效的支持方式就是把它分享给其他可能感兴趣的人。可以发给朋友、群聊、推特或其他任何平台,让更多人知道。此外,如果在YouTube订阅节目,或在苹果播客和Spotify留下五星好评,都将对我们有极大帮助。

I hope you enjoyed this episode. If you did, the most helpful thing you can do is just share it with other people who you think might enjoy it. Send it to your friends, your group chats, Twitter, wherever else. Just let the word go forth. Other than that, super helpful if you can subscribe on YouTube and leave a five star review on Apple Podcasts and Spotify.

Speaker 0

详情页里有本期赞助商信息。如果你想赞助未来的节目,请访问duarkesh.com/advertise。感谢收听,我们下期再见。

Check out the sponsors in the description below. If you wanna sponsor a future episode, go to duarkesh.com/advertise. Thank you for tuning in. I'll see you on the next one.

关于 Bayt 播客

Bayt 提供中文+原文双语音频和字幕,帮助你打破语言障碍,轻松听懂全球优质播客。

继续浏览更多播客