本集简介
双语字幕
仅展示文本字幕,不包含中文音频;想边听边看,请使用 Bayt 播客 App。
我认为深度学习的整个历史在某种意义上就是计算规模扩大的历史。
I think the whole history of deep learning is in some sense the the history of scaling up compute.
当我从研究生院毕业时,我真的以为我整个职业生涯的剩余时间都将致力于解决这一个问题——事实上,人工智能作为一个领域、一门学科,很大程度上是受到人类智能的启发。
When I graduated from grad school, I really thought the rest of my entire career would be towards solving that single problem, which is A lot of AI as a field, as a discipline, is inspired by human intelligence.
我们以为自己是第一个做这件事的人。
We thought we were the first people doing it.
结果发现他们也在同时进行这项研究。
It turned out that was also simultaneously doing it.
所以Marble,从某种角度看,它是一个系统,一个三维世界的生成模型。
So Marble, like, basically, one way of looking at it, it's the system it's a generative model of three d worlds.
对吧?
Right?
你可以输入文本、图像或多张图像,它会为你生成一个与这些输入相匹配的三维世界。
So you can input things like text or image or multiple images, and it will generate for you a three d world that kind of matches those inputs.
虽然Marble同时是一个朝着空间智能愿景发展的世界模型,但它也经过精心设计,旨在成为人们当下就能觉得有用的工具。
While So Marble is simultaneously a world model that is building towards this vision of spatial intelligence, it was also very intentionally designed to be a thing that people could find useful today.
我们开始看到在游戏、视觉特效和电影领域涌现的应用案例,我认为Marvel作为产品目前能做许多非常有趣的事情,同时也为未来我们想要构建的宏大世界模型奠定基础。
And we're see starting to see emerging use cases for in gaming, in VFX, in in film, where I think there's a lot of really interesting stuff that Marvel can do today as a product, and then also set a foundation for the for the for the grand world models that we want to build going into the future.
李飞飞是斯坦福大学教授,斯坦福以人为本人工智能研究所的联合主任,以及World Labs的联合创始人。
Fei Fei Li is a Stanford professor, the co director of Stanford Institute for Human Centered Artificial Intelligence, and co founder of World Labs.
她创建了ImageNet数据集,正是这个数据集引发了深度学习革命。
She created ImageNet, the dataset that sparked the deep learning revolution.
贾斯汀·约翰逊曾是她的博士生。
Justin Johnson is her former Ph.
他历任密歇根大学教授、Meta研究院研究员,如今是World Labs的联合创始人。
Student, ex professor at Michigan, ex meta research, and now cofounder of World Labs.
他们共同推出了Marvel——首个能从文本图像生成可探索三维世界的模型。
Together, they just launched Marvel, the first model that generates explorable three d worlds from texture images.
在本期节目中,李飞飞和贾斯汀将探讨为何空间智能与语言存在本质差异、当前世界模型缺失的关键要素(提示:物理学),以及关于Transformer本质是集合模型而非序列模型的架构洞见。
In this episode, Fei Fei and Justin explore why spatial intelligence is fundamentally different from language, what's missing from current world models, hint, physics, and the architectural insight that transformers are actually set models, not sequence models.
大家好。
Hey, everyone.
欢迎收听Latent Space播客。
Welcome to the Latent Space podcast.
我是Kernel Labs的创始人,今天与Alasio的编辑Swiggs一同参与节目。
Founder of Kernel Labs, and I'm joined by Swiggs, editor of Alasio.
我们非常激动能与World Labs的Fei Fei和Justin在演播室相聚。
And we are so excited to be in the studio with Fei Fei and Justin of World Labs.
欢迎你们。
Welcome.
我们也很兴奋。
We're excited too.
我确实说的是Marble。
I really say Marble.
是的。
Yeah.
感谢邀请我们。
Thanks for having us.
我认为人们对世界模型有浓厚的兴趣,而你们在空间智能等方面也做了一些宣传工作。
I think there's a lot of interest in world models, and you've done you've done a little bit of publicity around spatial intelligence and all that.
我想也许这个故事中难得的机会是讲述你们两人如何携手开始构建世界应用的。
I guess maybe one of the part of story that is a rare opportunity for you to tell is how you two came together to start building world apps.
这很简单,因为贾斯汀曾是我的学生。
That's very easy because Justin was my former student.
是的。
Yeah.
所以贾斯汀来到我的...你知道,我的另一个身份是斯坦福大学的计算机科学教授。
So Justin came to my I you know, in my the other hat I wear is a professor of computer science at Stanford.
贾斯汀什么时候加入我的实验室?
Justin joined my lab when?
哪
Which
年?
year?
2012年。
2012.
实际上,我加入你实验室的那个学期,也就是那个季度,正是AlexNet发布的同一季度。
Actually, the the semester that I the quarter that I joined your lab was the same quarter that that AlexNet came out.
对。
Yeah.
是的。
Yeah.
所以贾斯汀是我的
So Justin is my
第一个 你参与整个发布风波了吗?
first Were you involved in the whole announcement drama?
没有。
No.
完全没有。
Not at all.
但我当时确实目睹了围绕AlexNet的所有热议和兴奋。
But I was sort of watching all the image and excitement around AlexNet at that that quarter.
所以他曾是我最优秀的学生之一。
So he was my one of my very best students.
后来他早期在密歇根大学安娜堡分校和META担任教授。
And and then he went on to have early career as a professor in Michigan, University of Michigan Ann Arbor in META.
大约两年前,我认为我们两人都独立关注着大模型的发展,并思考语言模型之后的方向。
And then when we I think around more than two years ago, for sure, I think both independently, both of us have been looking at the development of the large models and thinking about what's beyond language models.
构建世界模型、空间智能这个理念对我们来说非常自然。
And this idea of building world models, spatial intelligence really was natural for us.
于是我们开始讨论,决定孤注一掷专注于解决这个问题,并共同创立了world labs。
So we started talking and decided that we should just put all the eggs in one basket and focus on solving this problem and started world labs together.
是的。
Yeah.
差不多是这样。
Pretty much.
我是说,在经历了博士期间的ImageNet时代后,我意识到接下来计算机视觉的十年发展将聚焦于让人工智能走出数据中心,进入现实世界。
I mean, like, I after that, seeing that kind of ImageNet era during my PhD, I had the sense that the next sort of decade of computer vision was gonna be about getting getting AI out of the out of the data center and out into the world.
所以博士毕业后,我的研究兴趣逐渐转向三维视觉、计算机图形学以及生成模型领域。
So a lot of my interests post PhD kinda shifted in into into three d vision, a little bit more into computer graphics, more into generative modeling.
我原以为自己正逐渐远离导师的研究方向,但几年后重逢时发现她也在思考非常相似的课题。
And I was I thought I was kind of drifting away from my adviser post PhD, but then when we reunited a couple years later, it turned out she was thinking of very similar things.
如果回顾AlexNet,其核心要素显然是ImageNet数据集。
So if you think about AlexNet, the core pieces of it were obviously ImageNet.
还有向GPU和神经网络的技术转型。
It was the move to GPUs and neural networks.
你认为世界模型的AlexNet级突破会是什么形态?
How do you think about the AlexNet equivalent model for world models?
某种程度上,这个理念早已存在。
In a way, it's an idea that has been out there.
对吧?
Right?
你知道,杨·拉贡可能是这方面最著名、最主要的倡导者。
There's been, you know, Young Lagoon is maybe like the most the biggest proponent, most prominent of it.
过去两年里,你看到了什么让你觉得'现在是时候做这个了'?在数据方面,你真正想构建的基础设施是什么?也许还有哪些不同类型的算法或计算方法能让更多模型真正活起来?
What have you seen in the last two years that you were like, hey, now's the time to do this, and what are maybe the things fundamentally that you wanna build as far as data and kinda like maybe different types of algorithms or approaches to compute to make more models really come to life?
是的。
Yeah.
我认为首先是现在有更多数据和计算资源普遍可用了。
I I I think one is just there is a lot more data and compute generally available.
从某种意义上说,我认为深度学习的历史就是计算能力扩展的历史。
I think the whole history of deep learning is, in some sense, the the history of scaling up compute.
如果你想想,AlexNet需要从CPU跃迁到GPU。
And if you think about, you know, AlexNet required this jump from CPUs to GPUs.
但即使从AlexNet时代到今天,我们每张卡的性能已经提升了约一千倍。
But even from AlexNet to today, we're getting about a thousand times more performance per card than we had in AlexNet days.
现在常见的做法不仅是在一块GPU上训练模型,而是在数百、数千、数万甚至更多GPU上训练。
And now it's common to train models not just on one GPU, but on hundreds or thousands or tens of thousands or even more.
所以今天我们能够为单一模型调动的计算量,比我刚开始读博士时已经增长了约百万倍。
So the amount of compute that we can marshal today on on a single model is is, you know, about a million fold more than we could have even at the start of my PhD.
我认为语言模型是过去几年里真正开始表现非常出色的领域之一。
So I think language was one of the really really interesting things that started to work quite well the last couple of years.
但当我们考虑转向视觉数据、空间数据和世界数据时,就需要处理更大量的信息。
But as we think about moving towards visual data and spatial data and world data, you just need to process a lot more.
我认为这将是消化日益增长的新型计算资源的一个好方向。
And I think that's gonna be a good way to soak up this this new compute that's coming online more and more.
公开竞赛的模式仍然有效吗?还是应该将其集中在一个实验室内进行?
Does the model of having a public challenge still work or should it be centralized inside of a lab?
我认为开放科学仍然很重要。
I think open science still is important.
要知道,与Alex那个时代的形象相比,AI显然已经发生了巨大变化,那时它只是计算机科学中一个非常小众的领域。
Know, AI obviously compared to the image that Alex at that time has really evolved, That was such a niche computer science discipline.
现在它已经成为了像文明基础设施一样的技术。
Now it's just like civilizational technology.
但我给你举个例子。
But I'll give you an example.
最近我的斯坦福实验室刚宣布开放了一个名为'行为'的数据集和基准测试,用于在模拟环境中对机器人学习进行基准测试。
Recently my Stanford lab just announced they opened a dataset and benchmark called behavior, which is for benchmarking robotic learning in simulated environments.
这显然是在学术界继续保持这种开放科学模式的明确努力。
And that is a very clear effort in still keeping up this open science model of doing things, especially in academia.
但我认为重要的是要认识到这个生态系统是多元混合的。
But I think it's important to recognize the ecosystem is a mixture.
我认为行业中许多非常专注的工作,其中一些更多是以产品形式而非公开挑战本身见光。
Think a lot of the very focused work in industry, some of them are more seeing the daylight in the form of a product rather than an open challenge per se.
是的。
Yeah.
这只是商业模式中的资金问题,比如你必须从中看到一些投资回报?
And that's just a matter of the funding in the business model, like you have to see some ROI from it?
我认为这只是生态系统多样性的问题。
I think it's just a matter of the diversity of the ecosystem.
即使在所谓的AlexNet和ImageNet时代,也存在闭源模型、专有模型和开源模型并存的情况。
Even during the so called Alex ImageNet time, there were closed models, there were proprietary models, there were open models.
或者想想iOS与安卓的对比,对吧?
Or you think about iOS versus Android, right?
存在不同的商业模式。
There are different business models.
我不会说这仅仅是资金本身的问题。
I wouldn't say it's just a matter of funding per se.
这就是市场的现状。
It's just how the market is.
它们是不同的玩法。
They're different plays.
是的。
Yeah.
但你觉得在当今商业压力下,这些实验室还能重做ImageNet那样的项目吗?
But do you feel like you could redo ImageNet today with the commercial pressure that some of these labs has?
对我来说,这就像是最重要的问题。
I mean, to me, that's like the biggest question.
对吧?
Right?
这就像是,你能公开什么,又应该保留什么?
It's like, what can you open versus what do you should you keep inside?
就像,你知道,如果我站在你的立场上,你筹集了大量资金。
Like, you know, if I put myself in your shoes, right, it's like, you raised a lot of money.
你正在构建这一切。
You're building all of this.
如果你拥有这方面的最佳数据集,你真正有什么动力去发布它呢?
If you had the best dataset for this, what incentives do you really have to publish it?
感觉实验室里的人越来越被拉拢,博士项目也越来越早地被拉进这些实验室。
And it feels like the people at the labs are getting more and more pulled and the PhD programs are getting pulled earlier and earlier into these labs.
所以我很好奇,你是否认为现在存在一个问题,关于资金带来的压力对学术开放研究空间的影响,或者你觉得这其实不是什么值得担忧的事。
So I'm curious if you think there's, like, an issue right now with like how much money has taken, how much pressure it puts on like the more academia open research space, or if you feel like that's not really a a concern.
我确实有所担忧,但更多是关于资源分配以及学术界资源不均衡的问题,而非压力本身。
I do have concerns about less about the pressure, it's more about the resourcing and the imbalanced resourcing of academia.
这与世界实验室的情况略有不同。
This is a little bit of a different conversation from world labs.
过去几年,我一直在倡导为健康的生态系统提供资源支持。
I have been, the past few years, advocating for resourcing the healthy ecosystem.
作为斯坦福以人为本人工智能研究所的创始主任兼联合主任,我一直在与政策制定者合作,为公共部门和学术界的AI工作争取资源。
As the founding director, co director of Stanford's Institute for Human Centered AI, Stanford High, I've been working with policymakers about resourcing public sector and academic AI work.
我们曾与特朗普政府合作推进《国家人工智能资源法案》(NAIR法案),该法案规划建立国家AI计算云及数据存储库。
We worked with the first Trump administration on this bill called National AI Resource, NAIR Bill, which is scoping out a national AI compute cloud as well as data repository.
同时我认为开源数据集依然是生态系统中重要的一环。
And I also think that open source, open data sets continue to be an important part of the ecosystem.
正如我所说,目前在我的斯坦福实验室,我们正在进行名为'行为'的机器人学习开放数据集与基准测试。
Like I said, right now, in my Stanford lab, we are doing the open data set, open benchmark on robotic learning called behavior.
我的许多同事也仍在从事这类工作。
And many of my colleagues are still doing that.
我认为这是生态系统的一部分。
I think that's part of the ecosystem.
我认为行业正在做的,一些初创企业正在做的,快速开发模型、创造产品,也是件好事。
I think what the industry is doing, some startups are doing, are running fast with models, creating products, is also a good thing.
例如,当贾斯汀还是我的博士生时,所有的计算机视觉程序都不太好用。
For example, when Justin was a PhD student with me, none of the computer vision programs worked that well.
对吧?
Right?
我们可以写出漂亮的论文。
We could write beautiful papers.
贾斯汀有
Justin has
我的意思是,甚至在研究生院之前,我就想从事计算机视觉,我联系了谷歌的一个团队,考虑本科毕业后直接尝试做计算机视觉。
I mean, like, I actually even even before grad school, like, I wanted to do computer vision, I reached out to a team at Google and, like, wanted to I'd potentially go and try to do computer vision, like, out of out of undergrad.
他们却对我说,你在说什么呢?
And they told me, like, what what are you talking about?
比如,你不能那么做。
Like, you can't do that.
先去读个博士再回来。
Like, go do a PhD first and come back.
是什么动机让你...哦,是我自己决定的吗?
What what was the motivation that that got you Oh, was I just decided
本科期间我确实和Fei Fei的博士生导师做过一些计算机视觉研究。
done some computer vision research in during my undergrad with with actually Fei Fei's PhD adviser.
师承关系。
The lineage.
对。
Yeah.
她说,这里有个师承关系。
She's like, there's a lineage here.
没错。
Yeah.
所以我在本科阶段就已经做过一些计算机视觉的研究,我觉得这非常酷,并且想继续做下去。
So I had the I'd done some computer vision even as an undergrad, and I thought it was really cool, and I wanted to keep doing it.
因此我在本科毕业时就面临着业界与学术界的选择,我想现在研究社区的很多人也面临同样的情况。
So then I I was sort of faced with this sort of industry academia choice even coming out of undergrad that I think a lot of people in the research community are facing now.
但回到你的问题,我认为学术界在AI领域的角色在过去十年间发生了很大变化,这并不是坏事。
But but to your question, I I think, like, the role of of academia, especially in AI, has shifted quite a lot in the last decade, and it's not a bad thing.
这是因为技术本身已经发展和成熟了。
It's it's a sense of it's it's because the technology has has grown and emerged.
对吧?
Right?
比如在五到十年前,你真的可以在实验室里用几块GPU就训练出最先进的模型。
Like, five or ten years ago, you really could train state of the art models in the lab even with just with just a couple GPUs.
但你知道,正因为这项技术如此成功且规模不断扩大,现在你无法再用几块GPU训练出最先进的模型了。
But, you know, because that technology was so successful and scaled up so much, then you you can't train state of the art models with a couple GPUs anymore.
这并非坏事。
And that's not a bad thing.
这是件好事。
It's a good thing.
这意味着技术确实奏效了。
It means the technology actually worked.
但这意味着学术界人士的工作预期需要稍作调整。
But that means the the expectations around what we should be doing as academics shifts a little bit.
我们不应该执着于训练最大规模的模型或追求最大规模的扩展。
And it shouldn't be about trying to train the biggest model and scaling up the biggest thing.
应该尝试各种古怪的、新颖的、疯狂的想法——尽管大多数可能行不通。
It should be about trying wacky ideas and new ideas and crazy ideas, most of which won't work.
我认为这方面还有很多工作可做。
And I think there's a lot to be done there.
我反而担心学术界有太多人过度关注假装我们能训练最大模型,或者把这当成职业培训项目,毕业后去大实验室就能玩转所有GPU。
And if anything, I'm worried that too many people in academia are hyperfocused on this notion of trying to pretend like we can train the biggest models or or treating it as almost a vocational training program to then graduate and go to a big lab and then be able to play with all the GPUs.
其实在算法创新、架构革新、系统开发等方面有太多疯狂的事情可以做,单枪匹马也能大有作为。
I think there's just so much crazy stuff you can do around, like, new algorithms, new architectures, like, new systems that, you know, there's a lot you can do as as one person.
此外,学术界在理解这些大型模型的理论基础方面也扮演着重要角色。
And also just academia has a role to play in understanding the theoretical underpinning of these large models.
我们对这方面的了解仍然非常有限。
We still know so little about this.
或者扩展到跨学科领域,你知道,贾斯汀称之为疯狂的想法。
Or extend to the interdisciplinary, know, Justin calls wacky ideas.
这里有很多基础科学理念。
There's a lot of basic science ideas.
存在许多蓝天问题(指基础性、探索性的研究课题)。
There's a lot of blue sky problems.
所以我同意。
So I agree.
我不认为问题在于开放与封闭、产品化与开源之间的对立。
I don't think the problem is open versus closed, productization versus open sourcing.
我认为当前的问题在于学术界本身资源严重不足,导致研究人员和学生没有足够的资源来尝试这些想法。
I think the problem right now is that academia by itself is severely under resourced so that the researchers and the students do not have enough resources to try these ideas.
是的。
Yeah.
就让大家开开脑洞吧,你想到的疯狂点子是什么?
Just for people to nerd snipe, what's a wacky idea that comes to mind when you
说到疯狂点子?
talk about wacky ideas?
哦,比如我在密歇根大学时一直向学生们兜售的一个想法——我特别痴迷硬件,尤其喜欢各种新型硬件的涌现。
Oh, like, I had I had this idea that I kept pitching to my students at at Michigan, which is that I I really like hardware, and I really like, like, new kinds of hardware coming online.
某种程度上说,我们现在使用的神经网络和Transformer本质上都是围绕矩阵乘法构建的,因为矩阵乘法与GPU的契合度极高。
And in some sense, the the emergence of the neural networks that you we use today and transformers are really based around matrix multiplication because matrix multiplication fits really well with GPUs.
但如果我们思考GPU的未来扩展路径,以及硬件可能的演进方向,我认为现有系统(比如GPU这类硬件设计)不可能无限扩展下去。
But if we think about how GPUs are gonna scale how how hardware is likely to scale in the future, I don't think the current system that we have, like the GPU, like, hardware design is gonna scale infinitely.
现在就已经能看出,计算的基本单元不再是单个设备了。
And that we start to see that even now that, like, the unit of compute is not not the single device anymore.
而是整个设备集群。
It's this whole cluster of devices.
所以如果你想象一个节点。
So if you imagine node.
是的。
Yeah.
它是一个完整的节点或整个集群。
It's a whole node or a whole cluster.
但我们谈论神经网络的方式,仍然好像它们是一个可以在PyTorch中用单个GPU编码的整体。
But the way we talk about neural networks is still as if they are a monolithic thing that could be coded, like, in one GPU in PyTorch.
但实际上,它们被分布到数千台设备上运行。
But then in practice, they get distributed over thousands of devices.
所以是否存在这样的可能——就像Transformer基于矩阵乘法,而矩阵乘法恰好是GPU上运行良好的基础运算一样?
So are there, like, dress just as, you know, transformers are based around Matmul, and Matmul is sort of the primitive that works really well on on GPUs.
当你设想硬件扩展时,是否存在更适合大规模分布式系统的基础运算,让我们可以基于此构建神经网络?
As you imagine, hardware scaling out, are there other primitives that make more sense for large scale distributed systems that we could build our neural networks on?
我认为很可能会出现与下一代硬件(比如十到二十年后将出现的硬件)完美契合的全新架构,而我们现在就可以开始构想这些可能。
And I think it's possible that there could be drastically different architectures that fit with the next generation or, like, the the the hardware that's gonna come ten or twenty years down the line, and we could start imagining that today.
这类押注确实很难做,因为还存在硬件彩票的概念——比方说,既然英伟达已经赢了,我们干脆就无限扩展它,并通过编写软件来弥补混合架构中的所有缺口。
It's really hard to make those kinds of bets because there's also the concept of the hardware lottery where, let's just say, you know, NVIDIA has won and we should just, you know, scale that out in infinity and write software to patch up any any gaps we have in the in in the mix.
对吧?
Right?
我是说,
I mean,
既是也不是。
yes yes and no.
如果你看数据的话,从Hopper架构到Blackwell架构,每瓦性能其实基本持平。
Like, if you look at if you look at the numbers, like, going from Hopper to Blackwell, like, the performance per watt is about the same.
没错。
Yes.
他们主要是增加了晶体管数量,增大了芯片尺寸,同时也提高了功耗。
They mostly make the the number of transistors go up, and they make the chip size go up, and they make the the power usage go up.
但即便从Hopper到Blackwell,我们已经开始看到每瓦性能方面存在某种扩展极限。
But even from Hopper to Blackwell, we're kind of already seeing, like, a a scaling limit in terms of what is the what is the performance per watt that we can get.
所以我认为确实有空间去尝试一些新事物。
So I think I think there are there is room to to do something new.
虽然我不确定具体是什么,也不认为一家初创公司能在三个月周期内完成。
And I don't know exactly what it is, and I don't think you can get it done, like, in a three month cycle as a start up.
但我觉得这类想法如果经过几年沉淀,或许能带来一些突破。
But I think that's the kind of idea that if you sit down and sit with for a couple years, like, maybe you could come up come up with some breakthroughs.
我认为这类长期研究正是学术界最擅长的领域。
I And think that's the kind of long range stuff that is a is a perfect match for academia.
回到一些背景历史,我们有一份关于你与Andre合作的新式图像描述技术的研究笔记。
Coming back to the little bit of background and history, we have this sort of research note on the scene storytelling work that you did or newer newer image captioning that you did with Andre.
我想听你们讲讲那个故事——你们是如何在博士阶段开始这个项目的。
And I just wanted to hear you guys tell that story about, you know, you you were, like, sort of embarking on that for your PhD.
还有Fei Fei你当时的反应。
And and, Fei Fei, you you, like, having that reaction that you had.
是的。
Yeah.
所以我想那项工作最初是我和安德烈开始的,后来贾斯汀加入了,对吧?
So I think that line of work started between me and Andre and then Justin joined, right?
安德烈那时刚开始他的博士研究。
So Andre started his PhD.
他和我当时在研究ImageNet物体识别之后还能做什么。
He and I were looking at what is beyond image net object recognition.
那时卷积神经网络已经在ImageNet任务中展现出了强大的能力。
And at that time, convolutional neural network has proven some power in image net tasks.
ConvNet是表示图像的绝佳方式。
ConvNet is a great way to represent images.
与此同时,我认为在语言领域,早期的一种序列模型LSTM也正在被实验。
In the meantime, I think in the language space, an early sequential model is called LSTM was also being experimented.
所以Andrea和我刚刚在讨论,这是我长期以来的梦想。
So Andrea and I were just talking about this has been a long term dream of mine.
我曾以为讲述图像故事这个问题需要一百年才能解决。
I thought it would take a hundred years to solve, which is telling the story of images.
当我从研究生院毕业时,我真的以为我整个职业生涯都将致力于解决这一个问题:给定一张图片或一个场景,用自然语言讲述其中的故事。
When I graduated from grad school, I really thought the rest of my entire career would be towards solving that single problem, which is given a picture or given a scene, tell the story in natural language.
但事情发展得太快了。
But things evolved so fast.
当安德烈开始时,我们想或许可以将卷积神经网络的表征能力与LSTM的语言序列模型结合起来。
When Andre started, we were like maybe combining the representation of convolutional neural network as well as the language sequential model of LSTM.
我们或许能通过训练学会将图片与描述文字相匹配。
We might be able to learn through training to match caption with images.
那就是我们开始这项研究的时候。
So that's when we started that line of work.
我不记得那是2014年还是2015年了。
And I don't remember if it was 2014 or 2015.
是2015年的CBPR对吧?
It was the CBPR 2015 was the Right.
图片描述
Captioning
那是我们的第一篇论文,安德烈成功实现了给定一张图片就能生成描述的功能。
So it was our first paper that Andre got it to work that was, you know, given an image.
图片是用卷积神经网络(ConfNet)表示的。
The image is represented with ConfNet.
语言模型采用的是LSTM模型。
The language model is the LSTM model.
然后我们把它们结合起来,就能生成一句话了。
And then we combine it, and it's able to generate one sentence.
那是最初的几次之一。
And that was one of the first time.
我记得我在书里写过这件事。
It was pretty I think I wrote it in my book.
我们当时以为自己是第一个做这件事的团队。
We thought we were the first people doing it.
结果发现谷歌那时也在同步进行同样的研究。
It turned out that Google at that time was also simultaneously doing it.
一位记者,《纽约时报》的约翰·马尔科夫,报道了谷歌的故事。
And a reporter, it was John Markov from New York Times, breaking the Google story.
但他偶然听说了我们的研究,然后意识到我们确实是独立地同时取得了相同成果。
But he by accident heard about us, and then he realized that we really independently got there together at the same time.
所以他同时报道了谷歌的研究以及安德烈和我的研究。
So he wrote the story of both the Google research as well as Andre and my research.
但在那之后,我想贾斯汀那时已经在实验室了。
But after that, I think Justin was already in the lab at that time.
嗯。
Yeah.
是啊。
Yeah.
我记得小组会议上安德烈展示这些成果时,介绍了我从未听说过的LSTM和RNN这些新概念。
I remember the group read the the group meeting where Andre was presenting some of those and explaining this new thing called LSTMs and RNNs that I had never heard of before.
我当时就想,哇。
And I thought, like, wow.
这真是太神奇了。
This is really amazing stuff.
我想研究这个。
I wanna I wanna work on that.
然后他在2015年CBPR会议上发表了论文,这是首个图像描述生成的研究成果。
So then he had the paper at twenty CBPR 2015, the first image captioning results.
之后我们开始合作,我们首先完成了一篇关于语言建模的论文
Then after that, we started working together, and we did a we first, we did a paper actually just on language modeling
对。
Yep.
回溯到2015年iCLAIR会议。
Back in twenty fifth iCLAIR 2015.
是的。
Yeah.
没错。
Yeah.
我我本该坚持做语言建模的。
I I should have stuck with language modeling.
回过头看,那确实相当赚钱。
That turned out that was pretty lucrative in retrospect.
2015年我和安德烈一起做了这篇语言建模论文,当时觉得特别酷。
We did this language modeling paper together, me and Andre, in 2015, where it was, like, really cool.
我们训练这些小型的RNN语言模型,它们能一次输出几个句子,我们可以观察并试图理解神经网络内部的神经元工作机制。
We train these little r these little RNN language models that could, you know, spit out a couple sentences at a time and poke at them and try to understand what the neurons inside the neural net Yeah.
理解这些模型内部的行为。
Inside the things were doing.
你们当时是在对不同的记忆单元进行分析
You guys were doing analysis analysis on the different, like, memory and
对。
Yeah.
没错。
Yeah.
那确实非常非常酷。
It was it was really it was really cool.
即便在当时,我们已经能得出这样的结果:你可以观察LSTM内部并发现,比如,这个东西正在读取代码。
And even at that time, we had these results where you could, like, look inside the LSTM and say, like, oh, this thing is reading code.
所以我们用于训练的数据集之一就是Linux源代码。
So one of the, like, one of the datasets that we that we trained on for this one was the the the Linux source code.
对吧?
Right?
因为整个项目都是开源的,你可以直接下载这些代码。
Because the whole the whole the whole thing is, you know, open source, and you could just download this.
于是我们在这个数据集上训练了一个RNN。
So we train an RNN on this on this dataset.
当网络试图预测其中的标记时,我们尝试将其做出的预测类型与RNN内部结构类型相关联。
And then as the network is trying to predict the tokens there, then, you know, try to correlate the kinds of predictions that it's making with the kind of internal structures in the RNN.
在那里,我们发现了一些相关性,比如当出现开括号时LSTM的某个单元和层级会激活,遇到闭括号时又会关闭,我们就是通过这类实证方法来理解它的运作。
And there, we were able to find some correlations between, oh, like, this unit and this layer of the LSTM fires when there's an open paren and then, like, turns off when there's a closed paren and try to do some empirical stuff like that to to figure it out.
那真的很酷。
So that was pretty cool.
那基本上就是把CNN从语言建模部分剥离出来,单独研究语言模型。
And that was just, like, die that was kind of, like, cutting out the CNN from this language modeling part and just looking at the language models in isolation.
但后来我们想扩展图像描述的工作。
But then we wanted to extend the the image captioning work.
我记得当时,我们甚至有了空间感,因为觉得普通描述无法捕捉图像的不同部分。
And I remember at that time, we even have a sense of space because we feel like captioning does not capture different parts of the image.
所以我当时在和Justin与Andre讨论,最终我们称之为密集描述——即更详细地描述场景,特别是场景的不同部分。
So I was talking to Justin and Andre about, can you go what we end up calling dense captioning, which is, you know, describe the scene in greater details, especially different parts of the scene.
所以就是
So that's
对。
Yeah.
于是我们构建了系统,第二年我和Andre还有Fei Fei在CVPR 2016的论文中提出了这个实现密集描述的系统。
And so that so then then we built the systems, and it was me and Andre and and Fei Fei on a paper the following year, CVPR twenty sixteen, where we built this system that did dense captioning.
所以你输入一张图片,它会在图像中所有有趣的内容周围画框,然后为每个内容写一小段描述。
So you input a single image, and then it would draw boxes around all the interesting stuff in the image and then write a short snippet about each of them.
就像,哦,桌子上有一个绿色的水瓶。
It's like, oh, it's a green bought water bottle on the table.
有个人穿着黑色衬衫。
It's a person wearing a black shirt.
这是一个非常复杂的神经网络,因为它建立在当时物体检测领域的许多进展之上,而物体检测长期以来一直是计算机视觉的主要课题。
And this was a a really complicated neural network because that was built on a lot of advancements that had been made in object detection around that time, which was a major topic in computer vision for for a long time.
实际上它是一个联合神经网络,既能学习观察单个图像。
And then it was actually, like, one joint neural network that was both, you know, learning to look at individual images.
因为实际上这个网络内部有三种不同的表征。
Because they actually it actually had, like, then three different representations inside this network.
一种是整张图像的表征,用来大致了解整体情况。
One was the representation of the whole image to kinda get the gestalt of what's going on.
然后它会提出想要聚焦的各个区域,并分别表征每个区域。
Then it would propose individual regions that it wants to focus on and then look at you know, represent each region independently.
当你观察完某个区域后,就需要为每个区域生成对应的文本描述。
And then once you look at the region, then you need to spit out text for each region.
所以这是一个相当复杂的神经网络架构。
So that was a pretty complicated neural network architecture.
这一切都发生在PyTorch出现之前。
This was all pre PyTorch.
它是单次处理完成的吗?
And does it do it in one pass?
是的。
Yeah.
对。
Yeah.
所以它是通过单次前向传播完成所有这些处理的。
So it was a single forward pass that did all of that.
通常来说,它确实是单次处理完成的。
Normally, it was doing it in one pass.
你还优化了推理过程。
You you also optimize inference.
你是在用网络摄像头实现的。
You're doing it on a webcam.
我记得你
I remember you
当时在做。
were doing.
对。
Yeah.
没错。
Yeah.
所以我当时搭建了一个近乎疯狂的实时演示系统——网络在斯坦福的服务器上运行,然后通过网页前端从摄像头获取视频流,再把图像传回服务器。
So I I had built this, like, crazy real time demo where I had the network running, like, on a server at Stanford and then a web front end that would stream from a webcam and then, like, send the image back to the server.
服务器运行模型后将预测结果实时传输回来。
The server would run the model and stream the predictions back.
所以我当时就拿着这台笔记本电脑在实验室里走来走去,哇。
So I was just, like, walking around the lab with this laptop Wow.
那简直就像,向人们展示这个这个,就像,这个
That would just, like, show people this this, like, this
网络实时识别和标注功能。
network Identification in real and labeling as well.
是啊。
Yeah.
我的天哪。
Was Oh my god.
这确实令人印象深刻,因为我们大多数研究生只要能发表论文就心满意足了。
It was pretty impressive because most of our graduate students would be satisfied if they can publish the paper.
对吧?
Right?
他们把研究结果打包成论文,但贾斯汀更进一步。
They they packaged the the the research, put it in a paper, but Justin went a step further.
他说,'我想做这个实时网页演示。'
He's like, I wanna do this real time web demo.
其实,我不确定是否跟你讲过这个故事,那年我们在圣地亚哥有个会议,是ICCV十五届。
Well, actually, I don't I don't know if I told you this story, but then we had a there was a conference that year in Santiago at ICCV fifth it was ICCV fifteen.
然后,我在那个会议上发表了一篇关于其他主题的论文。
And then, like, I had a paper at that conference for something different.
但我带着我的笔记本电脑。
But I had my my laptop.
我当时在会场里走来走去,用笔记本向所有人展示这个实时字幕演示,模型其实运行在加利福尼亚的服务器上。
I was, like, walking around the conference with my laptop showing everybody this, like, real time captioning demo, and the model was running on a server in California.
所以它实际上是从加利福尼亚一路流式传输到圣地亚哥的。
So it was, like, actually able to stream, like, all the way from California down to Santiago.
不过延迟确实很明显,体验很糟糕。
Well, latency does showed me it was terrible.
当时就说'好吧'。
Was like Okay.
好的。
Alright.
好的。
Alright.
有延迟。
It was delayed.
大概只有一帧每秒。
Was like one FPS.
但能运行起来本身就已经相当惊人了。
But the fact that it worked at all was pretty was pretty amazing.
我本想简短地说,视觉和语言建模可能没那么大差别。
So I was gonna briefly quip that, you know, maybe vision and language modeling are not that different.
你知道,DeepSQL CRM最近尝试了一个疯狂的想法——直接从像素建模生成文本并训练。
You know, DeepSQL CRM recently tried the crazy thing of let's language let's model text from pixels and and just like train on that.
这可能就是未来。
And it might be the future.
我不知道。
I don't know.
我不知道你们是否认为语言是否真的必要。
I don't know if you guys have any takes on whether language is actually necessary at all.
我刚写了一整篇关于言语病理学家的宣言。
I just wrote a whole manifesto of speech pathologists.
这是我切入这个话题的方式。
This is my segue into this.
是的。
Yes.
我认为它们是不同的。
I think they are different.
我确实认为这些生成模型的架构会共享许多可复用组件。
I do think the architecture of these generative models will share a lot of shareable components.
但我认为这个深度三维、四维的空间世界具有一种与一维纯生成信号根本不同的结构层次。
But I think the deeply three d, four d spatial world has a level of structure that is fundamentally different from a purely generative signal that is one dimensional.
是啊。
Yeah.
我觉得像素极简主义确实有其道理。
I I think there's something to be said for pixel maximalism.
对吧?
Right?
就像有种观点认为语言是特殊的,但我们其实是用眼睛来看语言的,而我们的眼睛本质上就是像素构成的。
Like, there's this notion that language is this different thing, but you we see language with our eyes, and our eyes are just like, you know, basically pixels.
对吧?
Right?
我们眼球后部那些生物像素就是在处理这些信息。
Like, we've got sort of biological pixels in the back of our eyes that are processing these things.
我们看着文字以为它是离散的存在,但这其实只存在于我们的意识中。
And, you know, we see text and we think of it as this discrete thing, but that really only exists in our minds.
现实世界中文字和语言的物理表现形式都是印在物体上的实体,我们通过视觉来感知它们。
Like, the physical manifestation of text and language in our world are, you know, physical objects that are printed on things in the world, and we see it with our eyes.
嗯,你
Well, you
也可以认为它是声音。
can also think it's sound.
但即使如此。
But even Sure.
声音确实。
Sound Sure.
确实。
Sure.
即使是声音,你也可以转换
Even sound, you can translate
为视觉声音。
into visual sound.
是的。
Yeah.
你会得到舞蹈谱,这是一个二维信号。
You get choreogram, which is a two d signal.
对。
Right.
然后,当你转换成这种纯符号化的表示方式时,就像我们在LLM中使用的那样,实际上会丢失一些东西。
And then, like, you actually lose something if you translate to this, like, purely tokenized representations that we use in LLMs.
对吧?
Right?
比如,你会丢失字体信息。
Like, you lose the font.
你会丢失换行符。
You lose you lose the line breaks.
你会丢失页面上的二维排版。
You lose sort of the two d arrangement on the page.
而且在很多情况下,对很多内容来说,也许这并不重要。
And and for a lot of cases, for a lot of things, maybe that doesn't matter.
展开剩余字幕(还有 480 条)
但对某些情况来说,确实很重要。
But for some things, it does.
我认为像素是对现实世界更无损的一种呈现方式。
And I I think pixels are this sort of more more lossless representation of what's going on in the world.
在某种程度上,这是一种更通用、更贴近人类视觉感知世界方式的呈现。
And in in some ways, a more general general representation that more matches what what we what we humans see as we as we navigate the world.
所以如果要讨论效率问题,把文本渲染成图像再喂给视觉模型可能确实不太高效。
So so, like, if there's an efficiency argument to be made, like, maybe it's not super efficient to, like, you know, render your text to an image and then feed that to a vision model.
这正是DeepSeek采用的方法。
That's exactly what DeepSeek did.
没错。
Yeah.
而且某种程度上确实奏效了。
And it it was, like, kinda worked.
我觉得这与整体世界模型的概念相关。
I think this ties into the whole world model.
比如,今年我最喜欢的论文之一是关于归纳偏置促进世界模型的。
Like, one of the my favorite papers that I saw this year was about inductive bias to pro pro world model.
这是一篇哈佛的论文,他们向大语言模型输入了大量轨道模式数据,然后让模型预测行星绕太阳运行的轨道。
So it was a Harvard paper where they fed a lot of, like, orbital patterns into an LLM, and then they asked the LLM to predict the orbit of a planet around the sun.
模型生成的轨道看起来不错,但如果你让它画出受力矢量图,结果就会乱七八糟。
And the model generated looked good, but then if you asked it to draw the force vectors, it would be all wacky.
你明白吗?
You know?
它实际上并没有遵循物理规律。
It wouldn't actually follow it.
那么你如何看待模型从数据中获取的信息?
So how do you think about what's embedded into the data that you get?
我们可以讨论如何组织三维世界模型。
And we can talk about maybe organizing for three d world models.
比如,信息的维度有哪些?
Like, what are, like, the dimensions of information?
有视觉层面的,但需要从这些数据中提取多少潜在的隐藏力量,可以这么说呢?
There's the visual, but, like, how much of, like, the underlying hidden forces, so to speak, you need to extract out of this data?
那么,这方面的主要挑战有哪些?
And, like, what are some of the challenges there?
是的。
Yeah.
我认为解决这个问题有多种方法。
I I think there's different ways you could approach that problem.
一种方法是明确处理,比如测量所有力并将其作为训练数据输入模型。
One is, like, you could try to be explicit about it and say, like, oh, I want to, you know, measure all the forces and feed those as training data to your model.
对吧?
Right?
然后可以运行传统的物理模拟,了解场景中的所有力,再用这些作为训练数据来训练模型,希望它能预测这些力。
Then you could, like, sort of run a traditional physics simulation and, you know, then know all the forces in the scene and then use those as as training data to train a model that's now gonna hopefully predict those.
或者可以期待某种更潜在的规律自然浮现。
Or you could hope that something emerges more latently.
对吧?
Right?
就是你训练一个端到端的模型来解决一个更普遍的问题,然后希望模型内部某处必须学会模拟类似物理的规律才能做出正确预测。
That you kind of train on something end to end and then on on a more general problem, and then hope that somewhere some done something in in the internals of the model must learn to model something like physics in order to make the proper predictions.
这大致就是我们目前拥有的两大主要范式。
And those are kind of the two big paradigms that we have more generally.
但没有任何迹象表明这种潜在建模能让你获得空间与动力学的因果规律。
But there's no indication that those latent modeling will get you to a causal law of of space and dynamics.
对吧?
Right?
这正是当今深度学习与人类智能开始分道扬镳的地方。
That's where today's deep learning and human intelligence actually start to bifurcate.
因为从根本上说,深度学习仍然只是在拟合模式。
Because fundamentally, the deep learning is still fitting patterns.
说到这里就有点哲学意味了。
There, you sort of get philosophical.
你说得对,我们也在尝试拟合模式,但可能我们试图拟合的是更广泛的模式阵列,比如在更长的时间跨度内,使用不同的奖励函数。
And you say that, like, we're trying to fit patterns too, but maybe we're trying to fit, you know, a more broad array of patterns, like, over a with with a longer time horizon, a different reward function.
但基本上,你提到的那篇论文所涉及的问题就是,它学会了拟合轨道的特定模式,却无法按照你所期望的方式实现泛化。
But but, like, basically, the the paper you mentioned is sort of, you know, that problem, that it learns to fit the specific patterns of orbits, but then it doesn't actually generalize in the way that you'd like.
它并没有形成某种关于重力的因果模型。
It doesn't have a sort of causal model of gravity.
没错。
Right.
因为即使在Marble中,你知道,我也尝试过。
Because even in marble, you know, I was trying it in.
它能生成这些美丽的场景,里面还有类似拱门的结构。
It generates these beautiful sceneries, and there's, like, arches in them.
但这个模型真的理解拱门实际上是如何——你知道——像石头那样向中心施力,以及它的实际物理结构吗?
But does the model actually understand how, you know, the arch is actually, you know, drawing on the center kinda like stone and, like, you know, the actual physical structure of it?
另一个问题是:只要它渲染出的内容始终符合我们想象中的物理模型,它是否真正理解这些还重要吗?
And the other question is, like, does it matter that it does understand it as long as it always renders something that would fit the physical model that we imagine?
如果你用你所理解的方式去定义‘理解’,我很确定这个模型并不理解它。
If you use the word understand the way you understand, I'm pretty sure the model doesn't understand it.
模型只是从数据中学习,从模式中学习。
The model is learning from the data, learning from the pattern.
是的。
Yeah.
这重要吗?
Does it matter?
特别是对于实际应用场景来说。
Especially for the use cases.
这是个好问题,对吧?
It's a good question, right?
就目前而言,我认为这不重要,因为它能渲染出你需要的东西——假设它是完美的。
Like, for now, I don't think it matters because it renders out what you need, assuming it's perfect.
嗯。
Yeah.
我是说,这取决于具体的使用场景。
I mean, it depends on the use case.
比如说,如果使用场景是为虚拟电影或制作生成某种背景,你只需要看起来合理的东西。
Like, if the use case is I want to generate sort of a backdrop for for virtual film or production or something like that, all you need is something that looks plausible.
这种情况下,可能就无关紧要。
And in that case, probably it doesn't matter.
但如果你是建筑师,要用这个设计一栋实际建造的大楼,那就确实需要正确建模受力情况,因为你不想建好后建筑垮掉。
But if you're gonna use this to, like you know, if you're an architect and you're gonna use this to design a building that you're then gonna go build in the real world, then, yeah, it does matter that you model the forces correctly because you don't want the thing to break when to actually actually build it.
但即便如此,即使模型包含了语义,我仍然认为模型对信号或输出的理解与人类的理解是两回事。
But even there, right, like, even if your model has the semantics in it, let's say, I still don't think the understanding of the signal or the output on the model part and the understanding on the human part is a different word.
不过这又变得哲学了。
But this gets, again, philosophical.
是啊。
Yeah.
我是说,关于'理解'这个概念有个取巧的说法。
I mean, there there's this trick with understanding.
对吧?
Right?
这些模型是一种与人类智能截然不同的智能形式。
Like, these models are a very different kind of intelligence than human intelligence.
人类智能的有趣之处在于,我认为我理解事物是因为我能在某种程度上内省自己的思维过程。
And human intelligence is interesting because, you know you know, I think that I understand things because I can introspect my own thought process to some extent.
然后我相信我的思维过程可能与其他人相似,因此当我观察他人的行为时,我会推断他们的内心状态可能与我观察到的自己的内心状态相似。
And then I believe that my thought process probably works similar to other people's so that when I observe someone else's behavior, then I infer that their internal mental state is probably similar to my own internal mental state that I've observed.
因此,我知道我理解事物,所以我假设你也理解某些东西。
And therefore, I know that I understand things, so there I assume that you understand something.
但这些模型更像是这种外星智能形式,它们能做非常有趣的事情。
But these models are sort of like this this alien form of intelligence where they can do really interesting things.
它们能展现出非常有趣的行为。
They can exhibit really interesting behavior.
但无论它们内部有什么样的认知或自我反思的等价物(如果存在的话),都与我们人类的思维方式完全不同。
But whatever kind of internal the equivalent of internal cognition or internal self reflection that they have, if it exists at all, is totally different from what we do.
所以它不具备自我意识。
So it's It doesn't have the self awareness.
没错。
Right.
但这意味着,当我们观察到这些系统表现出看似有趣或智能的行为时,我们无法必然推断出关于它们的其他特性,因为它们的世界模型和思维方式与我们截然不同。
But what that means is that when when we observe seemingly interesting or intelligent behavior out of these systems, we can't necessarily infer other things about them because they're they're model of the world and the way they think is so different from us.
那么你认为最终是否需要两个不同的模型来处理视觉部分和架构生成部分?
So would you need two different models to do the visual one and the architectural generation, you think, eventually?
就像,你在模型构建方法上采取的策略并没有什么根本性的限制。
Like, there's not anything fundamental about the approach that you've taken on the model building.
更多是关于模型规模的扩展及其能力的提升。
It's more about scaling the model and the capabilities of it.
或者说,过于视觉化的特性是否会阻碍你学习背后的物理原理,从而无法真正信任它生成的CAD设计能在现实世界中实际运作?
Or, like, is there something about being very visual that prohibits you from actually learning the physics behind this, so to speak, so that you could trust it to generate a CAD design that then is actually gonna work in the real world.
我认为这主要是扩大数据规模和优化模型的问题。
I think this is a matter of scaling data and and and bettering model.
我不认为这两者之间存在根本性的区别。
I don't think there's anything fundamental that separates these two.
是的。
Yeah.
我希望它能是一个统一的模型。
I would like it to be one model.
但我觉得,在某种意义上,深度学习面临的大问题是如何获得超越训练数据的新能力?
But I think, like, the big problem in deep learning in some sense is how do you get emergent capabilities beyond your training data?
你能否得到一个能理解力的模型,即使它没有被训练来预测力,但它会在内部隐式地学习这些力?
Are you gonna get something that understands the forces while it wasn't trained to predict the forces, but it's gonna learn them implicitly internally?
我认为我们在其他大型模型中看到的很多现象表明,这种涌现行为确实会在规模扩大时出现。
And I think a lot of what we've seen in other large models is that a lot of this emergent behavior does happen at scale.
这种能力能否迁移到其他模态、其他用例和其他任务上呢?
And will that transfer to other modalities and other use cases and other tasks?
希望如此。
I hope so.
但这将是一个需要随时间推进并观察的过程。
But that'll that'll be a process that we need to play out over time and see.
是否存在依赖现有物理引擎的诱惑?毕竟游戏行业已经完成了大量基础工作,还是说由于某些根本性不匹配,我们必须从头开始?
Is there a temptation to rely on physics engines that already exist out there that are you know, basically, the gaming industry has saved you a lot of this work, or do we have to reinvent things for some fundamental mismatch?
我认为这就像是攀登技术阶梯。
I think that's sort of like climbing the ladder of technology.
对吧?
Right?
从某种意义上说,我们想要构建这些系统的原因,正是因为传统物理引擎在某些情况下并不适用。
Like, in some sense, the reason that you wanna build these things at all is because maybe traditional physics engines don't work in some situations.
如果物理引擎是完美的,我们就没有必要构建模型,因为问题早已被解决。
If a physics engine was perfect, we would have sort of no need to build models because the problem would have already been solved.
所以在某种程度上,我们这么做的原因在于经典物理引擎无法以我们期望的普适性解决问题。
So in some sense, the reason why we want to do this is because classical physics engines don't solve problems in the generality that we want.
但这并不意味着我们要全盘抛弃它们并从头开始。
But that doesn't mean we need to throw them away and start everything from scratch.
对吧?
Right?
我们可以利用传统物理引擎生成数据来训练模型,这相当于将物理引擎的精髓提炼到正在训练的神经网络权重中。
We can use traditional physics engines to generate data that we then train our models on, And then you're sort of distilling the physics engine into the weights of the neural network that you're training.
我认为很多实验室的工作都在印证这点,比如有人推测Sora就采用了类似方法。
And I think that's a lot of what if you compare the work of other labs, people are speculating that, you know, Sora had a little bit of that.
Genie三也运用了部分这种技术。
Genie three had a bit of that.
Genie三完全就是个电子游戏的样子。
Genie three is like explicitly like a video game.
就像,你可以用控制器
Like, you have controls to
对。
Yeah.
在游戏里四处走动。
To walk around in.
我我我一直觉得,我们为娱乐发明的东西最终会应用到严肃工作中,这真的很有趣。
And I I I always think, like, it's really funny how the things that we invent for fun actually does eventually make it into serious work.
嗯。
Mhmm.
是的。
Yep.
整个AI革命就是从图形芯片开始的。
The whole AI revolution started by graphics chips Yeah.
部分原因是这样。
Partially.
误用GPU来生成大量三角形,基本上就是用来处理各种其他图形运算。
Miss using the GPU for for generating a lot of triangles and degenerating a lot of everything else, basically.
没错。
Yeah.
我们稍微提到了Marble。
We touched on Marble a little bit.
我觉得你们选择Marble这个名字,某种程度上像是你们正从隐秘状态中走出来
I think you guys chose Marble as I kind of feel like you're sort of a little bit coming out of stealth moment, if you can
可以这么说吧
call it that.
嗯
Yeah.
也许你能简明扼要地解释一下,大家应该从中理解什么。虽然在场所有人都可以试用Marble,但我认为他们可能无法将其与你们愿景的独特性联系起来——相比其他实验室展示的那些生成式虚拟世界
Maybe we could get a concise explanation from you on what people should take away because everyone here can try Marble, but I don't think they might be able to link it to the differences between what your vision is versus other, I guess, generative worlds they may have seen from other labs.
Marble是我们模型的一个缩影
So Marble is a glimpse into our model.
对吧?
Right?
我们是一家专注空间智能模型的公司
We are a model spatial intelligence model company.
我们相信空间智能将是下一个前沿领域
We believe spatial intelligence is the next frontier.
要构建具备空间智能的模型,该模型必须拥有强大的能力,能够以高度多模态的方式理解、推理和生成世界,同时支持我们最终希望达到的、能与人类与世界互动复杂度相媲美的交互水平。
In order to make spatially intelligent models, the model has to be very powerful in terms of its ability to understand, reason, generate in very multimodal fashion of worlds, as well as allow the level of interactivity that we eventually hope to be as complex as how humans can interact with the world.
这就是空间智能的宏伟愿景,也是我们所构想的世界模型类型。
So that's the grand vision of spatial intelligence, as well as the kind of world models we see.
MARBO是这一愿景的首次展现。
MARBO is the first glimpse into that.
这是该征程的第一阶段。
It's the first part of that journey.
这是全球首个面向公众开放的、能生成如此高保真度3D世界的顶尖模型。
It's the first in class model in the world that generates three d worlds in this level of fidelity that is in the hands of the public.
这是个起点,对吧?
It's the starting point, right?
我们其实写过这篇技术博客。
We actually wrote this tech blog.
贾斯汀花了很多时间撰写那篇技术博客。
Justin spent a lot of time writing that tech blog.
不知道你有没有时间浏览过它。
I don't know if you had time to browse it.
我是说,Justin真的把它分解成了Marble的多模态输入是什么、你知道的允许用户与模型交互的编辑功能是什么,以及我们能得到什么样的输出?
I mean, Justin really broke it down into what are the inputs multimodal inputs of Marble, what are the kind of editability which you know, allows user to be interactive with the model, and what are the kind of outputs we can have?
是的。
Yeah.
所以Marble,基本上可以看作是一个生成三维世界的生成模型系统。
So so Marble, like, basically, one way of looking at it, it's the system it's a generative model of three d worlds.
对吧?
Right?
你可以输入文本或图像或多个图像,它会为你生成一个与这些输入相匹配的三维世界。
So you can input things like text or image or multiple images, and it will generate for you a three d world that kind of matches those inputs.
而且它还具有交互性,你可以实时编辑场景。
And it's also interactive in the sense that you can interactively edit scenes.
比如,我可以生成这个场景然后说我不喜欢这个水瓶。
Like, I could generate this scene and then say, I don't like the water bottle.
把它变成蓝色。
It blue instead.
比如,把桌子移走。
Like, take out the table.
比如,可以调整这些麦克风的位置。
Like, may change these microphones around.
然后你可以基于这些交互式编辑生成新的世界,并以多种格式导出。
And then you can generate new worlds based on these interactive edits and export in a variety of formats.
实际上,我们通过Marble试图同时实现两件事,我认为我们很好地平衡了这两点。
And with Marble, were actually trying to do sort of two things simultaneously, and I think we we managed to pull off the balance pretty well.
一是真正构建一个朝着空间智能宏伟愿景发展的模型。
One is actually build a model that goes towards the grand vision of spatial intelligence.
模型需要能够理解多种不同类型的输入,需要能在多种情境下模拟世界,还需要能模拟随时间变化的假设情况。
And models need to be able to understand lots of different kinds of inputs, need to be able to model worlds in a lot of situations, need to be able to model counterfactuals of how they could change over time.
因此我们希望开始构建具备这些能力的模型,而Marble目前已经初步具备了所有这些特性。
So we wanted to start to build models that have these capabilities, and Marble Marble today does already have hints of all of these.
但与此同时,我们是一家公司。
But at the same time, we we're we're a company.
我们是一家企业。
We're a business.
我们真的在努力不让这成为一个科研项目,而是要打造一个对当今现实世界的人们有用的产品。
We were really trying not to have this be a science project, but also build a product that would be useful to people in the real world today.
因此,虽然Marble同时是一个朝着空间智能愿景迈进的世界模型,但它也经过精心设计,旨在成为当今人们能立即用上的实用工具。
So while Marble is simultaneously a world model that is building towards this vision of spatial intelligence, it was also very intentionally designed to be a thing that people could find useful today.
我们开始看到Marvel在游戏、视觉特效和电影领域涌现出的应用场景,它作为一款产品现在就能实现许多非常有趣的功能,同时也为我们未来想要构建的宏大世界模型奠定基础。
And we're see starting to see emerging use cases for in gaming, in VFX, in in film, where I think there's a lot of really interesting stuff that Marvel can do today as a product, and then also set a foundation for the grand world models that we want to build going into the future.
是的。
Yeah.
我注意到一个非常有趣的工具,因为你可以在里面录制你的场景。
I noticed one tool that was very interesting, because you can record your scene inside.
对。
Yes.
是的。
Yes.
这非常重要。
It's very important.
录制功能意味着对摄像机位置极为精确的控制。
The ability to record means a very precise control of camera placement.
要实现精确的摄像机定位,就必须具备三维空间感。
In order to have precise camera placement, it means you have to have a sense of three d space.
否则你无法确定如何调整摄像机方向和移动轨迹。
Otherwise, you don't know how to orient your camera and how to move your camera.
因此这是此类模型自然衍生的能力,也是众多应用实例之一。
So that is a natural consequence of this kind of model, and and this is why this is just one of the examples.
没错。
Yeah.
我发现使用视频生成模型时,我必须学习导演的语言——比如要指挥它们上移、平移等等
I I find when I play with video generative models, I'm having to learn the the language of being a director because I have to move them up, like pan, you know,
但在那种情况下,你也不能说‘镜头向北偏转63度’
But story even there, you cannot say pan sixty three degrees to the north.
对吧?
Right?
你根本没有那种控制权
You just don't have that control.
而在Marble中,你能精确控制摄像机的摆放位置
Whereas in marble, you have precise control in terms of placing a camera.
是的
Yeah.
我认为这是人们首先需要明白的一点
I think that's one of the first things people need to understand.
这不像...你不是在逐帧生成画面
It's like it's not you're not generating frame by frame Yes.
而很多其他模型采用的就是这种方式
Which is like what a lot of the other models are.
是的。
Yeah.
你知道,人们理解LLM是逐个标记生成的。
What are you know, people understand that an LLM generates one token.
那么,什么是基本单元呢?
What are, like, the atomic units?
某种程度上,像是网格。
There's kinda, like, know, the meshes.
还有样条、体素这些。
There's, like, the splats, the voxels.
三维世界里有很多组成部分。
There's a lot of pieces in a three d world.
人们应该对你的生成内容建立怎样的心智模型?
What should be the mental model that people have of, like, your generations?
嗯。
Yeah.
我认为存在当前的技术和未来可能发展的方向。
I I think there's, like, what exists today and what could exist in the future.
目前模型原生输出的就是splat(高斯溅射点)。
So what exists today is the model natively outputs splats.
高斯溅射点就像是微小的半透明粒子,每个粒子都有三维空间中的位置和方向,整个场景由大量这样的高斯溅射点构成。
So Gaussian splats are these, like, you know, each one is a tiny tiny particle that's semi transparent, has a position orientation in in three d space, and the scene is built up from a large number of these Gaussian splats.
高斯溅射点非常酷,因为它们能实时高效渲染,甚至在iPhone上也能流畅运行。
And Gaussian splats are really cool because you can render them in real time really efficiently, so you can render on your iPhone, render render everything.
正因如此我们才能实现精确的相机控制——这些溅射点可以在几乎任何客户端设备上实时渲染。
And that's that's how we get that sort of precise camera control because the splats can be rendered real time on just on pretty much any client side device that we want.
目前我们生成的多数场景中,这种溅射点就是基本单元。
So for a lot of the scenes that we're generating today, that kind of atomic unit is that individual splat.
但我不认为这是根本性的技术限制。
But I don't think that's fundamental.
我能想象未来会出现其他有趣的技术方案。
I I could imagine other approaches in the future that would be interesting.
比如,我们World Labs就研究过其他方法,像最近开发的RTFM模型就是逐帧生成的。
So there, like, there are other approaches that even we've worked on at World Labs, like our recent RTFM model that does generate frames one at a time.
在那里,基本单元就是在用户与系统交互时逐帧生成内容。
And there, the atomic unit is generating frames one at a time as the user interacts with the system.
或者你可以想象未来的其他架构,其中基本单元是一个标记,这个标记现在代表3D世界的某个片段。
Or you could imagine other architectures in the future where the atomic unit is a token, where that token now represents, you know, some chunk of the three d world.
我认为随着时间的推移,我们可以在这里尝试很多不同的架构。
And I think there's a lot of different architectures that we can experiment with here over time.
我想稍微深入探讨一下这个问题。
I do wanna press on double click on this a little bit.
不过我对Alessio要说的理解是:什么是世界模型的基本数据结构?
Though my version of what Alessio was gonna say was, like, what is the fundamental data structure of a world model?
因为就像你说的,成本要么是固定的,要么是框架之类的。
Because exactly, like you said, like, it's it's either cost is flat or it's, the framework, what have you.
你在之前的陈述中也多次提到物理和力的问题,这是随着时间推移而松散关联的。
You also, in in the the the sort of previous statements, focus a lot on the physics and the forces, which is something over time, which is loosely.
我在大理石中没看到这个。
I don't see that in marble.
我推测它目前还不存在。
I presume it's not there yet.
也许如果也有大理石的话,就会有运动了。
Maybe if there was, a marble too, you would have movement.
还是说需要对高斯溅射进行某种合理的修改,或者这会是完全不同的东西?
Or is there a modification to Gaussian splats that makes sense, or would it be something completely different?
是的。
Yeah.
我认为有几个合理的修改方案。
I I think there's a couple modifications that make sense.
实际上这里有很多有趣的整合方式,这也是在这个领域工作的另一个美妙之处。
And there there's actually a lot of interesting ways to integrate things here, which is another nice wet place of working in this space.
事实上这方面已经有很多研究工作了。
Then there's actually been a lot of research work on this.
比如,当你谈论那些天马行空的想法时,实际上已有大量非常有趣的学术研究探讨如何赋予物理特性。
Like, when you talk about wacky ideas, like, there's actually been a lot of really interesting academic work on different ways to imbue physics.
你可以
You can
在工业界也能尝试疯狂的想法。
also do wacky ideas in industry.
是啊。
Yeah.
好吧。
Alright.
但高斯溅射点本质上就是微小粒子。
But but it's then it's like Gaussian splats are themselves little particles.
已有许多方法能为这些溅射点附加物理属性,比如设定每个点具有质量,或通过虚拟弹簧与邻近点耦合。
There's been a lot of approaches where you basically attach physical properties to those splats and say that each one has a mass or, like, maybe you treat each one as being coupled with some kind of virtual spring to nearby neighbors.
这样就能在溅射点基础上进行物理模拟了。
And now you can start to do sort of physics simulation on top of splats.
因此,为这些物体添加物理特性、动态效果或交互性的一种途径是:预测每个溅射粒子关联的物理属性,然后使用经典物理或学习模型进行下游模拟。
So one kind of avenue for adding adding physics or dynamics or interaction to these things would be to, you know, predict physical properties associated with each of your splat particles and then simulate those downstream, either using classical physics or something learned.
或者说,三维工作的美妙之处在于元素可以组合,你可以在不同环节注入逻辑。
Or, you know, the the kind of the beauty of working in three d is things compose and you can inject logic in different places.
一种方式类似于我们正在生成三维场景。
So one way is sort of like we're generating a three d scene.
我们将预测场景中所有物体的三维属性。
We're gonna predict three d properties of everything in the scene.
然后使用经典物理引擎来模拟交互效果。
Then we use a classical physics engine to to simulate the interaction.
或者你也可以采用这样的方式:根据用户操作,模型将用溅射粒子或其他表现形式重新生成整个场景。
Or you could do something where, like, as a result of a user action, the model is now going to regenerate the entire scene in in splats or some other representation.
这种方式可能更具普适性,因为你不再受限于已知可建模的物理属性。
And that could potentially be a lot more general because then you're not bound to whatever sort of, you know, physical properties you know how to model already.
但这也会消耗更多计算资源,因为需要针对用户操作重新生成整个场景。
But that's also a lot more computationally demanding because then you need to regenerate the whole scene in response to the to user actions.
但我认为这是一个非常有趣的领域,可以作为未来工作的方向,并如你所说,可能融入到Marble项目中。
But I I think this is a this was a really interesting area for for future work and for adding on to to into potential marble too, as you say.
是的。
Yeah.
在动态效果方面存在机会。
There's opportunity for dynamics.
对。
Yeah.
对吧?
Right?
目前splats的密度状态如何?
What's the state of, like, splats density, I guess?
比如我们能否渲染出足够高的分辨率,在放大时仍保持清晰?
Like, do we can we render enough to have very high resolution when we zoom in?
我们是否受到生成数量或渲染能力的限制?
Are we limited by, like, the amount that you can generate, the amount that we can render?
换句话说,这些如何实现超高保真度呢?
Like, how are these gonna get super high fidelity, so to speak?
确实存在一些限制,但取决于你的目标使用场景。
You have some limitations, but depending on your target use case.
比如,我们场景面临的一个主要约束是希望内容能在移动设备上流畅渲染,同时也要在VR头显中清晰呈现。
So, like, one of the one of the big constraints that we have on our scenes is we wanted things to render cleanly on mobile, and we wanted things to render cleanly in VR headsets.
这些设备的计算能力远低于许多其他场景下的设备。
So those are those devices have a lot less compute than you're used than you have in a lot of other situations.
如果你想在四年前的iPhone上以30到60帧的高分辨率渲染splat文件,那么能处理的splat数量就会受到限制。
And, like, if you wanna get a splat file to render at high resolution, high like, 30 to 60 FPS on, like, an iPhone from four years ago, then you are a bit limited in, like, the number of splats that you can handle.
但如果允许使用近年设备——比如今年的iPhone、新款MacBook,或是本地GPU,或者不需要60帧1080p的画质,就能放宽限制处理更多splat,从而获得更高分辨率的场景。
But if you're allowed to, like, work on a recent, like, even this year's iPhone or, like, a recent MacBook or even if you have a local GPU or if you don't need if you don't need that 60 FPS ten eighty p, like, then you can relax the constraints and and get away with more splats, and that lets you get high resolution in your scenes.
我原本期待但没听到你们提及的一个应用场景是具身交互场景。
One use case I was expecting but didn't hear from you was embodied use cases.
嗯。
Mhmm.
你们目前只专注于虚拟场景吗?
Are you you're just focusing on virtual for now?
如果你访问WordLab的主页,有一个专门的页面叫MarbleLabs。
If you go to WordLab's home home page, there is a particular page called MarbleLabs.
我们在那里展示了不同的应用场景。
There we showcase different use cases.
实际上我们将其分为更多视觉效果应用、游戏应用以及模拟应用场景。
And we actually organize them in more visual effect use cases or gaming use cases as well as simulation use cases.
事实上,我们展示了这项技术能在机器人训练中发挥巨大作用,对吧?
And in that, we actually show this is a technology that can help a lot in robotic training, right?
这又回到了我早先讨论的话题。
This goes back to what I was talking about earlier.
说到数据匮乏,机器人训练确实面临数据不足的问题。
Speaking of data starvation, robotic training really lacks data.
高保真度的真实世界数据极其关键,但这类数据本身就很难大量获取。
High fidelity real world data is absolutely very critical, but you're just not gonna get a ton of that.
当然,另一个极端是纯粹依赖互联网视频数据,但这样你就失去了很多训练具身智能体所需的可控性。
Of course, the other extreme is just purely internet video data, but then you lack a lot of the controllability that you want to train your embodied agents with.
因此,模拟和合成数据实际上是一个非常重要的中间解决方案。
So simulation and synthetic data is actually a very important middle ground for that.
我在这个领域工作多年,最大的痛点之一就是如何获取合成模拟数据。
I've been working in this space for many years, and one of the biggest pain points is where do you get the synthetic simulated data.
你必须精心策划资源,构建并组合这些复杂场景。
You have to curate assets and build these, compose these complex situations.
在机器人领域,你需要大量不同状态,让具身智能体在合成环境中进行交互。
And in robotics, you want a lot of different states, you want the embodied agent to interact in the synthetic environment.
Marble在生成用于具身智能体训练的合成模拟世界方面确实具有很大潜力。
Marble actually has a really potential for helping to generate these synthetic simulated worlds for embodied agent training.
显然,确实如此。
Obviously, that's yeah.
这个就在主页上。
That's on the it's on the homepage.
它会在那里的。
It'll be there.
我只是在尝试建立联系,就像你说的,你还得构建一个商业模式。
I just I was like trying to make the link to as you said, like, you also have to build like a business model.
机器人市场的规模显然非常庞大。
The market for robotics obviously is very huge.
也许你不需要那样做,或者我们需要先构建并解决虚拟世界的问题,然后再进入实体领域,这显然正是垫脚石。
Maybe you don't need that or maybe we need to build up and solve the virtual worlds first before we go to embodied and obviously That is exactly stepping stone.
这还有待决定。
That is to be decided.
我确实认为
I I do think that
因为其他人都直接奔着那个方向去了。
Because everyone else is going straight there.
对吧?
Right?
并非所有人都是这样,但我想说,这其中确实存在一种令人兴奋的氛围。
Not everyone else, but there is a there is an excitement, I would say.
但你知道,我认为这个世界足够大,可以容纳不同的方法。
But, you know, I think the world is big enough to to have different Approaches.
是的。
Yeah.
方法。
Approaches.
是的。
Yeah.
我的意思是,我们一直认为这是一项相当基础的技术,随着时间的推移,它应该能够触及许多不同的行业。
I mean and we always view this as a pretty horizontal technology that should be able to touch a lot of different industries over time.
你知道,目前Marble更侧重于创意产业,但我认为支撑它的技术应该会随着时间的推移适用于许多不同领域。
And, you know, marble is a little bit more focused on creative industries for now, but I think the the technology that powers it should be applicable to a lot of different things over time.
而机器人技术可能就是其中之一,你知道,这可能比我们预期的更早实现。
And robotics is one that, you know, is maybe gonna happen sooner than later.
还有,设计。
Also, design.
对吧?
Right?
它与创意领域非常接近。
It's very adjacent to creative.
哦,是的。
Oh, yeah.
绝对是的。
Definitely.
就像,我觉得这类似于建筑领域的东西?
Like, I I I think it's like the architecture stuff?
是的。
Yes.
好的。
Okay.
是啊。
Yeah.
我是说,我在网上开玩笑的。
I mean, I I I was joking online.
我在Slack上发了这个视频,说,谁想用大理石来规划你的下一次厨房改造?
I posted this this video on Slack of, like, oh, who wants to use marble to to plan your next kitchen remodel?
实际上它在这方面已经表现得很出色了。
It actually works great for this already.
就像,拍两张你厨房的照片,用大理石重建它,然后用编辑功能看看更换台面、地板或橱柜后的效果。
Just, like, take two images of your kitchen, like reconstruct it in marble, and then use the editing features to see what would that space look like if you change the countertops or change the floors or change the cabinets.
你知道,我们并没有专门为这个用途开发什么功能。
And this is something that's you know, we didn't necessarily build anything specific for this use case.
但由于这是一项强大的横向技术,你会自然而然地发现这些从模型中涌现出来的新用途。
But because it's a it's a powerful horizontal technology, you kind of get these emergent use cases that that just fall out of the model.
我们已经有早期测试用户使用API密钥在构建室内设计用例了。
We have early beta users using a API key that is already building for interior design use case.
我刚改造完车库。
I just did my garage.
我早该知道这个的。
I should have known about this.
我
I
我得
I gotta
下次你装修时,我们可以
Next time you remodel, we can
没错。
Exactly.
嗯,接下来肯定是厨房了。
Be of Well, kitchen is next, I'm sure.
是啊。
Yeah.
是啊。
Yeah.
我对整个空间智能领域感到好奇。
I'm curious about the whole spatial intelligence space.
我认为我们应该更深入地探讨这个问题。
I think we should dig more into that.
首先,你如何定义它?
One, how do you define it?
还有,就像,传统智能与空间智能之间的差距是什么?人们可能会想到LLMS(大型语言模型),当达里奥说我们拥有爱因斯坦级别的数据中心时。
And, like, what are, like, the gaps between traditional intelligence that people might think about a LLMS when, you know, Dario says we have a data center full of Einstein's.
那更像是传统智能。
That's like traditional intelligence.
这不属于空间智能。
It's not spatial intelligence.
要具备空间智能需要什么条件?
What is required to be spatially intelligent?
首先,我不明白那句话,'一个装满爱因斯坦的数据中心'。
First of all, I don't understand that sentence, a data center full of Einstein's.
我就是不理解这个说法。
I I just don't understand that.
这不是一个
It's not a
这是个比喻。
deep It's an analogy.
嗯,所以人工智能作为一个领域、一门学科,很大程度上是受人类智能启发的,对吧?
Well, so a lot of AI as a field, as a discipline, is inspired by human intelligence, right?
因为就目前所知,我们是宇宙中最聪明的动物。
Because we are the most intelligent animal we know in the universe for now.
如果你观察人类智能,它是非常多维的,对吧?
And if you look at human intelligence, it's very multi intelligent, right?
有一位心理学家,我想他的名字叫霍华德·加德纳,
There is a psychologist, I think his name is Howard Gardner,
在
in the
1960年代,实际上正式称为多元智能理论,用于描述人类语言智能、空间智能、逻辑智能和情感智能。
1960s, actually literally called multiple intelligence to describe human linguistic intelligence, there is spatial intelligence, there is logical intelligence and emotional intelligence.
所以对我来说,当我思考空间智能时,我认为它是对语言智能的补充。
So for me, when I think about spatial intelligence, I see it as complementary to language intelligence.
因此我个人不会说这是空间智能与传统智能的对立,因为我不知道传统指的是什么。
So I personally would not say it's spatial versus traditional because I don't know what tradition means.
那是什么意思?
What does that mean?
我确实认为空间智能是对语言智能的补充。
I do think spatial is complementary to linguistic.
我们如何定义空间智能?
And how do we define spatial intelligence?
它是一种让你能够在空间中推理、理解、移动和互动的能力。
It's the capability that allows you to reason, understand, move, and interact in space.
我以DNA结构的推断为例,对吧?
And I use this example of the deduction of DNA structure, right?
当然我在简化这个故事,但很大程度上这涉及到分子和化学键在三维空间中的空间推理,最终推测出双螺旋结构。
And of course I'm simplifying this story, but a lot of that had to do with the spatial reasoning of the molecules and the chemical bonds in a three d space to eventually conjecture a double helix.
人类,或者说弗朗西斯·克里克和沃森所具备的这种能力,很难将这一过程简化为纯粹的语言描述。
And that ability that humans, or Francis Crick and Watson, had done, it is very, very hard to reduce that process into pure language.
这是文明进程中的一个巅峰时刻。
And that's a pinnacle of a civilizational moment.
但日常生活中,比如我现在试图抓取一个杯子。
But every day, right, I'm here trying to grasp a mug.
这整个观察杯子、判断其所在环境、看到自己的手、让手部几何形状与杯子匹配并触碰正确抓握点的过程,都具有极强的空间性。
This whole process of seeing the mug, seeing the context where it is, seeing my own hand, opening of my hand that geometrically would match the mug and touching the right affordance point, All this is deeply, deeply spatial.
这非常困难。
It's very hard.
我试图用语言来描述它,但另一方面,这种语言描述本身并不能让你真正完成抓取动作。
I'm trying to use language to narrate it, but on the other hand, that narrated language itself cannot get you to to pick up a Yeah.
带宽限制。
Bandwidth constraint.
是的。
Yes.
我最近做了些计算,比如如果你每天24小时不停说话,能产生多少标记?
I did some math recently on, like, if you just spoke all day, every day for twenty four hours a day, how many tokens do you generate?
以平均每分钟150词的语速粗略估算,大约每天21.5万个标记。
At the average speaking rate of, like, a 150 words per minute, it rough running around rounds out to about 215,000 tokens per day.
而你生活的世界信息量远比这大得多。
And, like, your your your world that you live in is so much higher bandwidth than that.
嗯,我认为确实如此。
Well, I think that is true.
但若想想艾萨克·牛顿爵士。
But if I think about Sir Isaac Newton.
对吧?
Right?
就像当时人们虽然空间上理解物体下落,却尚未用语言将重力这类概念正式表述出来。
It's like you have things like gravity at the time that have not been formalized in language that people inanaly spatially understand, that things fall.
对吧?
Right?
但将其以某种方式形式化是有帮助的。
But then it's helpful to formalize that in some way.
或者说,我们使用语言来真正捕捉那些从经验和空间上也能理解,但用语言描述更简便的各种规则。
Or like, you know, all these different rules that we use language to like really capture something that empirically and spatially you can also understand, but it's easier to like describe in a way.
所以我很好奇空间智能与语言智能之间的相互作用——有些规则用语言描述比让空间智能理解更容易。
So I'm curious like the interplay of like spatial and like linguistic intelligence, which is like, okay, you need to understand Some rules are easier to write in language for than the spatial intelligence to understand.
但你知道,你没法用文字精确描述'手这样放,再下移这么多'的动作。
But you cannot, you know, you cannot write, put your hand like this, and put it down this amount.
因此我始终好奇如何协同运用这两种智能。
So I'm always curious about how you leverage each other together.
我是说,以牛顿为例,他之所以写下那些定律,正是基于大量亲身实践的经验。
I mean, if if anything, like the example of of Newton, like, Newton only thinks to write down those laws because he's had a lot of embodied experience in the
世界,对吧。
world Right.
是的。
Yes.
正是如此。
Exactly.
而且实际上,有必要区分你提到的理论构建与那种日常的、嵌入三维世界的具身体验。
And, actually, it's useful to distinguish between the theory building that you're mentioning versus, like, the embodied like, the daily experience of being embedded in the three-dimensional world.
对吧?
Right?
所以对我来说,空间智能某种程度上封装了那种身处三维空间、在其中移动、观察和行动的具身体验。
So so so to me, spatial intelligence is sort of encapsulating that embodied experience of being there in three d space, moving through it, seeing it, actioning it.
正如菲菲所说,你可以用语言描述这些,但这是个信息损耗很大的渠道。
And as Fei Fei said, you can narrate those things, but it's a very lossy channel.
这就好比,你知道,身处世界并在其中行动,与试图描述它,是两种完全不同的模式。
It's just like the notion of, you know, being in the world and doing things in it is a very different modality from trying to describe it.
但作为人类,我们是长期在空间中互动进化而来的动物,我们甚至不认为这是件困难的事。
But because we, as humans, are animals who have evolved interacting in space all the time, like, we don't even think that that's an that's a hard thing.
对吧?
Right?
然后我们自然而然地跃升至语言层面,将其作为最高形式的抽象机制。
And then we sort of naturally leap to language and then theory building as mechanisms to abstract above that sort of native spatial understanding.
从某种意义上说,大语言模型直接跳到了这些最高形式的抽象推理层面,这非常有趣也非常有用。
And in some sense, LLMs have just, like, jumped all the way to those highest forms of abstracted reasoning, which is very interesting and very useful.
但空间智能就像重新打开那个黑箱,提醒我们可能因为直接采用完全抽象的语言、推理和交流方式而丢失了一些东西。
But spatial intelligence is almost like opening up that black box again and saying, maybe we've lost something by going straight to that fully abstracted form of of language and reasoning and communication.
你知道吗,作为一个视觉科学家,这很有趣。
You know, it's funny as a vision scientist.
对吧?
Right?
我一直认为视觉能力被低估了,因为它对人类来说毫不费力。
I always find that vision is underappreciated because it's effortless for humans.
你作为婴儿睁开眼睛时,就开始
You open your eyes as a baby, you start
是啊,这就是看到你的,我们某种程度上天生就具备这个能力。
to Yeah, this is see your we're somehow born with it.
我们几乎是与生俱来就拥有它。
We're almost born with it.
但你必须付出努力去学习语言,包括学习如何写作、如何运用语法、如何表达,这让它显得困难。
But you have to put effort in learning language, including learning how to write, how to do grammar, how to express, and that makes it feel hard.
然而自然界花费更多时间真正优化的感知能力和空间智能,却被人类低估了。
Whereas something that nature spends way more time actually optimizing, which is perception and spatial intelligence, is underappreciated by humans.
有证据证明我们是与生俱来的吗?
Is there proof that we are born with it?
你几乎是天生的。
You almost born.
所以听起来我们确实是在出生后才学会的。
So it sounds like we actually do learn after we're born.
我们出生时,视力较弱,但感知能力确实会逐渐增强。
When we are born, our visual acuity is less and our perceptual ability does increase.
但我们大多数人天生就具备视觉能力。
But we are, most humans, are born with the ability to see.
而且大多数人天生就能将感知与运动动作联系起来。
And most humans are born with the ability to link perception with motor movements.
我是说,运动动作本身需要时间才能完善。
Mean, the motor movement itself takes a while to refine.
动物们真是不可思议,对吧?
And then animals are incredible, right?
就像今年夏天早些时候我在非洲看到的景象。
Like I was just seeing Africa earlier this summer.
那些小动物刚出生几分钟内就必须行动起来。
These little animals, they're born and within minutes they have to get going.
否则,你知道的,狮子就会抓到它们。
And otherwise, you know, the lions will get them.
在自然界中,你知道,感知能力、空间智力和语言的优化花了整整四千万年。
And in nature, you know, it took five forty million years to optimize perception and spatial intelligence and language.
对语言发展最慷慨的估计可能也就五十万年。
The most generous estimation of language development is probably half a million years.
哇。
Wow.
是啊。
Yeah.
这比我想象的要长得多。
That's longer than I would have gonna say.
嗯,我已经说得很宽泛了。
Well, I'm being very generous.
是啊。
Yeah.
没错。
Yeah.
不。
No.
我,你知道,当时正在翻阅你的书,我真正意识到一个与我们播客内容相关的有趣联系,就是语言模型的基准测试,以及Winogrand如何刻意加入这些需要空间智能才能理解的物理不可能场景。
I I was, you know, sort of going through your book and I I was really realizing that that one of the interesting links to something that we covered on the podcast is language model benchmarks and how the Winogrand actually put in all these, like, sort of physical impossibilities that require spatial intelligence.
对吧?
Right?
比如,a在b上面,因此a不可能穿过b,这对我们来说显而易见,但对语言模型来说却可能发生。
Like, a is on top of b, therefore, a cannot fall through b is obvious to us, but to a language model, it could happen.
我不知道。
I don't know.
也许这就像是,你知道,下一个token预测的一部分。
Maybe it's like a part of the, you know, the next token prediction.
这某种程度上就是我所说的解构这种抽象概念。
That's sort of what I mean about, like, unwrapping this abstraction.
是的。
Yeah.
对吧?
Right?
就像,如果你对世界的整个模型只是看到词语一个接一个地出现,这确实很难
Like, if your whole model of the world is just like seeing sequences of words after each other, it's really kind of hard to
是啊。
Yeah.
比如,为什么不呢?
Like, why why not?
这其实不公平。
It's actually unfair.
对。
Right.
对。
Right.
所以就是
So it's
但对我们来说显而易见的原因是,我们内心正将其映射回我们熟悉的三维世界表征。
but then the reason it's obvious to us is because we are internally mapping it back to some three-dimensional representation of the world that we're familiar with.
问题是,我想,你知道这有多难,我们需要多久才能从中提炼——我用'提炼'这个词。
This the question is, I guess, like, how hard is it you know, how long is it gonna take us to distill from, like I I I used the word distill.
我不知道你是否同意这点。
I don't know if you agree with that.
将你的世界模型提炼成语言模型。
To distill from your world models into a language model.
因为我们确实希望我们的模型具备社交智能。
Because we do want our models to have social intelligence.
对吧?
Right?
就像我们是否必须完全抛弃语言模型才能实现这一点?
Like and do we have to throw the language model out completely in order to to do that?
或者不。
Or No.
不。
No.
对吧?
Right?
是的。
Yeah.
我不这么认为。
Don't think so.
没错。
Right.
我认为它们是 multimodal 的。
I think they're multimodal.
我是说,即便是我们现在的 Marble 模型,也是以语言作为输入的。
I mean, even our model, Marble, today takes language as an input.
对。
Right.
对吧?
Right?
所以它是深度多模态的。
So it's deeply multi multimodal.
我认为在许多应用场景中,这些模型会协同工作。
And I think in many use cases, these models will work together.
也许有一天我们会拥有一个通用模型。
Maybe one day we'll have a universal model.
我是说,即便真的实现了,从实用角度来说,人们习惯用语言交流,也希望能用语言与系统互动。
I mean, even even if you do, like, there there's sort of a pragmatic thing where people use language and people want to interact with systems using language.
即便从实用主义出发,构建能让人们通过语言交互的系统、产品和模型也很有价值。
Even pragmatically, it's useful to build systems and build products and build models that let people talk to them.
所以我认为这种需求不会消失。
So I don't see that going away.
我觉得这里存在一种学术探索——从理论上说,你能构建一个仅依赖视觉或空间智能的模型到什么程度?
I think there's a sort of intellectual curiosity of of saying how, like, intellectually, how much could you build a model that that only uses vision or only uses spatial intelligence?
我不确定那在实际中会有多大用处,但将其视为一项智力或学术挑战,看看能推进到什么程度,这会很有趣。
I don't know that that would be practically useful, but I think it'd be an interesting intellectual or academic exercise to see how far you could push that.
我是说,虽然不想把话题拉回物理学,但我很好奇,如果你有一个高度精确的世界模型,却不给它任何我们当前对标准物理模型的理解,它能自己推导出多少内容,从零开始重建,以及需要何种程度的语言理解能力。
I think, I mean, not to bring them back to physics, but I'm curious, like, if you had a highly precise world model and you didn't give it any notion of, like, our current understanding of the standard model of physics, how much of it it will be able to come up with and, like, recreate from scratch, and what level of, like, language understanding it would need.
因为我们有太多符号体系,某种程度上我们利用这些体系重建物理模型,但也许我们会提出一个完全不同的模型,却依然保持准确性。
Because we have so many notations that, like, we kinda use that, like, recreate it, but, like, maybe we'll come up with a very different model of it and still be accurate.
我在想,我们在多大程度上被人类的认知所限制——就像人们常说的,人类必须像人类一样思考,因为这个世界是为人类设计的。
And I wonder how much we're kinda limited by the high you know how people say human always need to be like humans because the world is built for humans.
某种程度上,我们构建语言的方式也限制了从其他模态中能获得的输出结果。
And in a way, it's like the way we build language constrains some of the outputs that we can get from these other modalities as well.
所以我非常期待关注你们的研究进展。
So I'm super excited to follow your work.
是啊。
Yeah.
我是说,其实要回答这个问题,甚至都不需要涉及人工智能领域。
I I mean, like, there's another engine I mean, you actually don't even need to be doing AI to answer that question.
你可以发现外星人,看看他们拥有什么样的物理法则。
You could discover aliens and see what kind of physics they have.
对吧?
Right?
对吧?
Right?
而他们
And they
可能会有所帮助。但菲菲说过,
might help But Fei Fei said,
我们目前是最聪明的
we are so far the smartest
迄今为止。
So so far.
在宇宙中。
In the universe.
对。
Right.
所以,你是什么意思呢?我是说,但那确实是个非常有趣的问题。
So so what do you so, I mean but that is a really interesting question.
对吧?
Right?
就像,我们对宇宙的认知和对物理的理解,是否在某种程度上受到我们自身认知的限制,或是受到我们技术进化路径依赖的约束?
Like, is our knowledge of the universe and our understanding of physics, is it constrained in some way by our own cognition or by the path dependence of our own technological evolution?
而一种方法,某种程度上来说,就是进行一个实验。
And one way to sort of and and, like, do an experiment.
就像,你几乎想要做个实验,看看如果我们重新运行人类文明,我们是否会以相同的顺序得出相同的物理定律。
Like, you almost wanna do an experiment and say, like, if we were to rerun human civilization again, would we come up with the same physics in the same order?
而且我认为这并不是一个非常实际的实验
And I don't think that's a very practical practical experiment to
你知道吗,我在想人们是否可以做一个实验,我们现在有大量关于行星或天体运动的天体物理数据。
You know, one experiment I wonder if people could run is that we have plenty of astrophysical data now on the planet or celestial body movements.
直接把数据输入模型,看看牛顿定律是否会浮现出来。
Just feed the data into a model and see if Newtonian law emerges.
我猜可能不会。
My guess is it probably won't.
我也是这么想的,不会的。
That's my guess, it's not.
牛顿定律的抽象层次与这些语言大模型所代表的层面不同。
The abstraction level of Newtonian law is at a different level from what these language LLMs represent.
所以即使给大模型足够的天体运动数据,它能预测出相当精确的运动轨迹,我也不会感到惊讶。
So I wouldn't be surprised that given enough celestial movement data, an LLM would actually predict pretty accurate movement trajectories.
假设我虚构了一颗围绕恒星运转的行星,只要提供足够数据,我的模型就能告诉你第一天它在哪,第二天它在哪。
Let's say I invent a planet surrounding a star And giving enough data, my model would tell you on day one where it is, day two where it is.
这完全在我意料之中。
I wouldn't be surprised.
但F=ma或者作用力等于反作用力,那完全是另一个层次的抽象概念了。
But F equals MA or action equals reaction, that's just a whole different abstraction level.
这已经超出了当今语言模型的能力范围。
That's beyond just the today's LLM.
好的。
Okay.
你需要什么样的模型才能让它不是地心说模型?
What model would you need to not have it be a geocentric model?
因为如果我仅仅训练一些视觉数据,你会认为太阳绕着地球转也是合理的。
Because if I'm training just some visual data, it makes sense that you think the sun rotates around the earth.
对吧?
Right?
但显然,事实并非如此。
But, obviously, that's not the case.
那么它要如何学习到这一点呢?
So how would it learn that?
比如,我对所有这些我们讨论的力都很好奇。
Like, I'm curious about all these, like, know, forces that we talk about.
有时候可能你并不需要它们,因为只要看起来是对的,那就是对的。
It's like sometimes maybe you don't need them because as long as it looks right, it's right.
但是,当你试图让这些模型执行更高级的任务时,我们能在多大程度上依赖它们呢?
But, like, as you make the jump to, like, try and use these models to do more high level tasks, how much can we rely on them?
我认为你可能需要一种不同的学习范式。
I think you can need kind of a different learning paradigm.
对吧?
Right?
所以,你看,这里有点混淆了,是在说大语言模型和语言符号符号系统,还是人类理论构建和人类物理学?
So, like, you know, there's a bit of conflation here happening where saying, is it LLMs and language and symbols versus, you know, human theory building and human human physics?
它们非常不同,因为人类的目标函数是理解世界并在生活中茁壮成长。
And they're very different because an l l like, the the human objective function is to understand the world and thrive in your life.
实现这一目标的方式是,有时你观察数据,然后进行思考。
And the way that you do that is by, know, you sometimes you observe data, and then you think about it.
接着你尝试在现实中做些什么,却发现与预期不符。
And then you try to do something in the world, and it doesn't match your expectations.
然后你会想要去更新你对这个世界的理解。
And then you want to go and update sort of your your your your your understanding of the world online.
人们一直在不断这样做。
And people do this all the time constantly.
就像,比如说,我以为钥匙在楼下,于是下楼去找却没找到。
Like, whether it's, you know, I think my keys are downstairs, so I go downstairs and I look for them and I don't see them.
然后,哦,糟糕。
And, oh, no.
它们其实就在我卧室里。
They're actually up in my bedroom.
所以我们就像这样,因为我们在不断与世界互动,必须持续构建关于周围世界发生之事的理论,然后证伪或为这些理论增添证据。
So we're we're like, because we're constantly interacting with the world, we're constantly having to build theories about what's happening in the world around us and then falsify or add evidence to those theories.
我认为这种过程,在宏观层面放大后,就给了我们牛顿物理学中的F=ma公式。
And I think that that kind of process, writ large and scaled up, is what gives us f equals m a in Newtonian physics.
而且我觉得这与我们训练的模型模式有点正交,无论是语言模型还是空间模型。
And I think that's a little orthogonal to, you know, the modality of model that we're training, whether it's language or or or spatial even.
我的方式是
The way I
这么说吧,这几乎是一种更高效的学习方式,因为你有一个假设:根据现有数据,这里有几个可能的世界观,然后你通过实验来排除那些不可能的世界观。
put it is almost like, this is almost more efficient learning because you have a hypothesis of here are the different possible worlds that are granted by my available data, and you then you do experiments to eliminate the worlds that are not possible.
最终确定出正确的那一个。
And you resolve to the one that's right.
对我来说,这也是我理解他人思维的方式——我有几个关于你在想什么的假设,然后尝试采取行动来验证或检验我的直觉。
To me, that's also how I also have theory of mind, which is, like, I'm I have a few thesis of what you're thinking, what you're thinking, and I try to create actions to resolve that or or check my intuition as to what you're thinking, you know.
显然,语言模型完全不具备这些能力。
And and obviously, LMs don't do any of these.
心智理论甚至可能延伸至情感智能领域,而当今的AI在这方面完全没有触及。
A theory of mind possibly also will break into even emotional intelligence, which today's AI is really not touching at all.
对吧?
Right?
而我们真的非常需要这种能力。
And we really really need it.
要知道,人们可能已经开始过度依赖这些东西了。
You know, people are starting to depend on these things probably too much.
而这...这完全是另一个话题了。是的。
And and that's a that's a whole topic of Yeah.
属于另一个辩论范畴。
Of other debate.
我确实想问这个问题,因为很多人给我们发过相关内容。
I do have to ask because a lot of people have, like, sent this to us.
我们需要摒弃多少旧有认知?
How much do we have to get rid of?
序列到序列建模是否已被彻底淘汰?
You know, is is sequence to sequence modeling out the window?
注意力机制是否过时了?
Is attention out the window?
我们究竟在多大程度上重新审视所有理论?
Like, how much are we re questioning everything?
关于 Bayt 播客
Bayt 提供中文+原文双语音频和字幕,帮助你打破语言障碍,轻松听懂全球优质播客。