本集简介
双语字幕
仅展示文本字幕,不包含中文音频;想边听边看,请使用 Bayt 播客 App。
这从根本上、哲学上对我来说是个不同的问题。
This is fundamentally, philosophically, to me, different problem.
过去十年主要是关于理解已有的数据。
The previous decade had mostly been about understanding data that already exists.
但未来十年将是关于理解新数据的。
But the next decade was going to be about understanding new data.
视觉空间智能是如此基础。
Visual spatial intelligence is so fundamental.
它和语言一样基础。
It's as fundamental as language.
就像在圣诞节拆礼物,每天你都知道会有惊人的新发现,某个地方会有惊人的新应用或新算法。
It's like unwrapping presents on Christmas, that every day you know there's gonna be some amazing new discovery, some amazing new application or algorithm somewhere.
无论我们看到什么或想象什么,两者都能汇聚到生成它的方向。
If we see something or if we imagine something, both can converge towards generating it.
我认为我们正处于寒武纪大爆发的中间阶段。
I think we're in the middle of a Cambrian explosion.
AI的下一章不是关于更好的语言模型。
The next chapter of AI isn't about better language models.
而是关于像理解文本一样基础地理解三维世界。
It's about understanding the three d world as fundamentally as we understand text.
最近,World Labs推出了他们的首款产品Marble。
Recently, World Labs launched Marble, their first product.
所以我们正在重播迄今为止最受欢迎的对话,与World Labs联合创始人李飞飞和Justin Johnson讨论为什么空间智能是真正智能机器缺失的关键部分。
So we're replaying our most popular conversation to date, a discussion with World Labs cofounders Fei Fei Li and Justin Johnson about why spatial intelligence is the missing piece for truly intelligent machines.
与a16z普通合伙人Martin Casado一起,Fei Fei和Justin讨论了ImageNet在2009年百万图像赌注如何开启了现代计算机视觉,为何当今多模态模型尽管处理像素却仍困在一维空间,以及他们的团队如何构建基础设施,使生成完全交互的3D世界变得像今天生成文本一样简单。
Together with a16z general partner Martin Casado, Fei Fei and Justin talk about how ImageNet's million image bet in 2009 unlocked modern computer vision, why today's multimodal models are still trapped in one dimension despite processing pixels, and how their team is building the infrastructure to generate fully interactive three d worlds as easily as we generate text today.
从重构与生成的融合正在重新定义计算机视觉,到AR、VR和机器人技术为何迫切需要原生的三维理解,这是四位传奇研究者押注一切的故事——他们认为通往通用人工智能的道路必须经过空间智能。
From the convergence of reconstruction and generation that's redefining computer vision, to why AR, VR, and robotics desperately need native three d understanding, this is the story of four legendary researchers betting everything that the path to AGI runs through spatial intelligence.
让我们开始吧。
Let's get into it.
过去两年间,我们见证了消费级AI公司和技术的疯狂涌现,场面相当震撼。
Over the last two years, we've seen this kind of massive rush of consumer AI companies and technology, and it's been quite wild.
但你们从事这项工作已有数十年。
But you've been doing this now for decades.
或许可以简单回顾一下我们如何走到今天,比如你们的关键贡献和沿途的深刻见解。
And so maybe walk through a little bit about how we got here, kind of like your key contributions and insights along the way.
这是个非常激动人心的时刻。
So it is a very exciting moment.
对吧?
Right?
退一步看,AI正处于一个极其振奋的阶段。
Just zooming back, AI is in a very exciting moment.
我个人从事这项工作已有二十余年。
I personally have been doing this for two decades plus.
我们刚刚走出上一个AI寒冬。
And we have come out of the last AI winter.
我们见证了现代AI的诞生。
We have seen the birth of modern AI.
随后我们见证了深度学习的崛起,它向我们展示了诸如国际象棋对弈等可能性。
Then we have seen deep learning taking off, showing us possibilities like playing chess.
但随后我们开始看到技术的深化,以及行业对一些早期可能性的采纳,比如语言模型。
But then we're starting to see the deepening of the technology and the industry adoption of some of the earlier possibilities, like language models.
现在我认为我们正处于一场几乎字面意义上的寒武纪大爆发中,因为除了文本,你现在还能看到像素、视频、音频,所有这些都伴随着潜在的AI应用和模型。
And now I think we're in the middle of a Cambrian explosion in almost a literal sense because now in addition to texts, you're seeing pixels, videos, audios, all coming with possible AI applications and models.
所以这是一个非常激动人心的时刻。
So it's a very exciting moment.
我非常了解你们两位,许多人也非常了解你们,因为你们在这个领域非常杰出。
I know you both so well, and many people know you both so well because you're so prominent in the field.
但并非每个人都是在AI领域成长起来的。
But not everybody grew up in AI.
所以也许值得快速回顾一下你们的背景,让观众有个基本了解。
So maybe it's kinda worth just going through, like, your quick backgrounds just to kinda level set the audience.
好的。
Yeah.
当然。
Sure.
我最初是在本科快结束时接触到AI的。
So I first got into AI at the end of my undergrad.
我在加州理工学院本科时主修数学和计算机科学。
I did math and computer science for undergrad at Caltech.
那段经历非常棒。
That was awesome.
但在那段时间的末期,出现了一篇当时非常著名的论文,就是来自Google Brain的洪乐礼(Hong Lek Lee)、Andrew Yang等人发表的关于猫的那篇论文。
But then towards the end of that, there was this paper that came out that was, at the time, a very famous paper, the cat paper from Hong Lek Lee and Andrew Yang and others that were at Google Brain at the time.
那是我第一次接触到深度学习这个概念。
And that was, like, the first time that I came across this concept of deep learning.
对我来说,这感觉就像是一项神奇的技术。
And to me, it just felt like this amazing technology.
那是我第一次接触到这个后来定义了我人生十余年的配方:将极其强大的通用学习算法,与海量算力和数据相结合,当这些要素汇聚时,奇迹就开始发生。
And that was the first time that I came across this recipe that would come to define the next more than decade of my life, which is that you can get these amazingly powerful learning algorithms that are very generic, couple them with very large amounts of compute, couple them with very large amounts of data, and magic things started to happen when you compipe those ingredients.
我大约在2011到2012年间首次接触这个想法,当时就觉得:天啊。
So I first came across that idea around 2011, twenty twelve ish, and I just thought, oh my god.
这就是我想投身的事业。
This is gonna be what I wanna do.
显然要读研究生才能从事这个领域,后来发现李飞飞在斯坦福,她是当时全球少数几个投身这个方向的先驱者之一。
It was obvious you gotta go to grad school to do this stuff, and then saw that Fei Fei was at Stanford, one of the few people in the world at the time who was on that train.
那段时间对深度学习和计算机视觉领域来说是个黄金时代。
And that was just an amazing time to be in deep learning and computer vision specifically.
因为那正是这项技术从最初的萌芽状态开始真正发展,并渗透到无数应用领域的转折期。
Because that was really the era when this went from these first nascent bits of technology that were just starting to work and really got developed and spread across a ton of different applications.
在那段时期见证了语言建模的诞生。
So then over that time, saw the beginnings of language modeling.
我们见证了判别式计算机视觉的兴起,可以通过多种方式分析图片内容。
We saw the beginnings of discriminative computer vision where you could take pictures and understand what's in them in a lot of different ways.
我们也见证了如今被称为生成式AI(GenAI)的早期形态——生成建模、图像生成和文本生成技术的雏形。
We also saw some of the early bits of what we would now call GenAI, generative modeling, generating images, generating text.
许多核心算法部分实际上是在我攻读博士期间由学术界解决的。
A lot of those core algorithmic pieces actually got figured out by the academic community during my PhD years.
曾有一段时间,我每天早晨醒来就会查看arXiv上的新论文,时刻准备着。
There was a time I would just wake up every morning and check the new papers on archive and just be ready.
这就像在圣诞节拆礼物一样。
It's like unwrapping presents on Christmas.
每天,你都知道世界上某个地方会出现惊人的新发现、新应用或新算法。
Every day, you know there's gonna be some amazing new discovery, some amazing new application or algorithm somewhere in the world.
接下来的两年里,全世界其他人也渐渐意识到可以用AI来获取每日的'圣诞礼物'。
In the next two years, everyone else in the world kinda came to the same realization of using AI to get new Christmas presents every day.
但我认为,对我们这些在该领域深耕十年以上的人来说,这种体验已经持续很久了。
I But think for those of us that have been in the field for a decade or more, we've sort of had that experience for a very long time.
我是通过物理学的角度接触AI的,因为我本科背景是物理。
I come to AI through a different angle, which is from physics because my undergraduate background was physics.
但物理学是那种教会你思考大胆问题、探索世界未解之谜的学科。
But physics is the kind of discipline that teaches you to think audacious questions and think about what is the still remaining mystery of the world.
当然,物理学研究的是原子世界、宇宙等等。
Of course, in physics, it's an atomic world, universe and all that.
但不知为何,这种思维训练让我开始思考一个真正激发我想象力的大胆问题——智能。
But somehow, that kind of training thinking got me into the audacious question that really captured my own imagination, which is intelligence.
于是我在加州理工学院攻读AI和计算神经科学的博士学位。
So I did my PhD in AI and computational neuroscience at Caltech.
所以贾斯汀和我其实没有交集,但我们共享加州理工这个母校。
So Justin and I actually didn't overlap, but we share the same alma mater at Caltech.
还是同一位导师。
And the same advisor.
是的,同一位导师,你的本科导师,我的博士导师,Pietro Perona。
Yes, same advisor, your undergraduate advisor, my PhD advisor, Pietro Perona.
我读博的时期,和你读博的时期相近,那时在公众眼中AI仍处于寒冬。
And my PhD time, which is similar to your PhD time, was when AI was still in the winter in the public eye.
但在我看来并非寒冬,而是春眠前的蛰伏期。
But it was not in the winter in my eye because it's that pre spring hibernation.
生机勃勃。
There's so much life.
机器学习和统计建模正在真正崛起。
Machine learning, statistical modeling was really gaining power.
我认为自己是机器学习与AI原生代的一员,而眼下这代人则是深度学习原生代。
I think I was one of the native generation in machine learning and AI, whereas I look at just this generation is the native deep learning generation.
因此机器学习是深度学习的前身。
So machine learning was the precursor of deep learning.
我们当时尝试了各种模型。
And we were experimenting with all kinds of models.
但在我博士生涯末期和担任助理教授初期,有一件事逐渐显现。
But one thing came out at the end of my PhD and the beginning of my assistant professor time.
当时AI领域存在一个被忽视的数学要素,它对推动泛化能力至关重要。
There was an overlooked element of AI that is mathematically important to drive generalization.
但整个领域都没有意识到这点,这个要素就是数据。
But the whole field was not thinking that way, and it was data.
因为我们当时在思考贝叶斯模型或核方法等复杂性问题。
Because we were thinking about the intricacy of Bayesian models or kernel methods and all that.
但我的学生和实验室比大多数人更早意识到的一个根本点是,如果让数据驱动模型,就能释放出前所未有的力量。
But what was fundamental that my students and my lab realized probably earlier than most people is that if you let data drive models, you can unleash the kind of power that we haven't seen before.
这正是我们疯狂押注ImageNet的原因——要知道,当时的数据规模与现在相比微不足道,只有几千个数据点。
And that was really the reason we went on a pretty crazy bet on ImageNet, which is, you know, just forget about any scale we're seeing now, which is thousands of data points.
那时NLP领域已有自己的数据集。
At that point, NLP community has their own data sets.
我记得UC Irvine的数据集或NLP领域的某些数据集。
I remember UC Irvine data set or some data set in NLP.
规模都很小。
It was small.
计算机视觉领域虽有数据集,但数量级仅在数千或数万级别,而我们当时认为需要达到互联网规模。
Compare vision community has their datasets, but only in the order of thousands or tens of thousands were like, we need to drive it to Internet scale.
幸运的是,当时互联网也正迎来成熟期。
And luckily, it was also the coming of age of Internet.
我们乘着这波浪潮,也就是在那时我来到了斯坦福。
So we were riding that wave, and that's when I came to Stanford.
这些就是我们常说的关键发展阶段。
So these epochs are what we often talk about.
ImageNet无疑是开创——或至少是普及和验证计算机视觉可行性的关键转折点。
ImageNet is clearly the epoch that created or at least maybe made popular and viable computer vision.
在生成式AI浪潮中,我们主要讨论两类核心突破。
In the Gen AI wave, we talk about two kind of core unlocks.
一篇是Transformer论文,核心是注意力机制。
One is the transformers paper, which is attention.
我们讨论过稳定扩散模型。
We talked about stable diffusion.
这样理解是否恰当:学术界或谷歌的两项算法突破催生了一切,还是说发展过程更具规划性?
Is that a fair way to think about this, which is there's these two algorithmic unlocks that came from academia or Google, and that's where everything comes from, or has it been more deliberate?
或是还存在其他未被充分讨论的重大突破引领我们走到今天?
Or have there been other kind of big unlocks that kind of brought us here that we don't talk as much about?
我认为关键突破是算力。
I think the big unlock is compute.
虽然AI发展史常被描述为算力演进史,但无论人们如何强调,我认为其重要性仍被低估。
I know the story of AI is often the story of compute, but no matter how much people talk about it, I think people underestimate it.
对吧?
Right?
过去十年计算能力的增长幅度令人震惊。
And the amount of growth that we've seen in computational power over the last decade is astounding.
深度学习在计算机视觉领域的突破性论文当属AlexNet——这篇2012年的论文中,深度神经网络在ImageNet挑战赛表现惊艳,完全超越了李飞飞团队之前研究的所有算法,包括他们在研究生阶段主要钻研的那些算法类型。
The first paper that's really credited with the breakthrough moment in computer vision for deep learning was AlexNet, which was a 2012 paper where a deep neural network did really well on the ImageNet challenge and just blew away all the other algorithms that Fei Fei had been working on, the types of algorithms that they had been working on more in grad school.
嗯。
Yep.
那个AlexNet是拥有6000万个参数的深度神经网络,在两块GTX580显卡上训练了六天
That AlexNet was a 60,000,000 parameter deep neural network, and it was trained for six days on two GTX five eighties
嗯哼。
Mhmm.
2010年推出的顶级消费级显卡是哪个?
Which was the top consumer card at the time which came out in 2010.
我昨晚看了一些数据,想把这些放在一个更宏观的视角里。
So I was looking at some numbers last night just to put these in perspective.
而NVIDIA最新、最强大的产品是GB 200。
And the newest, latest, and greatest from NVIDIA is the GB 200.
你们有谁想猜猜GTX 580和GB 200之间的原始计算能力差距有多大吗?
Do either of you wanna guess how much raw compute factor we have between the GTX five eighty and the GB 200?
说吧。
Shoot.
不猜。
No.
什么?
What?
请讲。
Go for it.
差距有数千倍。
It's in the thousands.
我昨晚计算了一下。
So I ran the numbers last night.
那个为期两周、用两块GTX 580每天训练6天的任务,如果换算过来的话,只需要不到5分钟。哇。
That two week training run, that of six days on two GTX five eighties, if you scale, it comes out to just under five minutes Wow.
在单块GB 200上。
On a single GB 200.
贾斯汀提出了一个非常有力的观点。
Justin is making a really good point.
2012年亚历克斯·拉德关于ImageNet挑战的论文堪称经典模型。
The twenty twelve Alex Ladd paper on ImageNet challenge is literally a very classic model.
那就是卷积神经网络模型。
And that is the convolutional neural network model.
它发表于1980年代,我记得那是我研究生时期学习的第一篇论文。
And that was published in 1980s, the first paper I remember as a graduate student learning that.
它大概也有六、七层结构。
And it more or less also has six, seven layers.
实际上AlexNet与ConvNet的唯一区别在于两块GPU和海量数据。
Practically the only difference between AlexNet and the ConvNet, the difference is the two GPUs and the deluge of data.
是的。
Yeah.
所以我认为现在大多数人都熟悉所谓的'苦涩教训'。
So I think most people now are familiar with, quote, the bitter lesson.
这个苦涩教训说的是:设计算法时不要耍小聪明。
And the bitter lesson says is if you make an algorithm, don't be cute.
只需确保能利用现有算力,因为算力总会跟上来的。
Just make sure you can take advantage of available compute because the available compute will show up.
另一方面,还有另一种同样可信的说法,其实是新数据源解锁了深度学习。
On the other hand, there's another narrative, which seems to me to be just as credible, which is it's actually new data sources that unlock deep learning.
对吧?
Right?
比如ImageNet就是一个很好的例子。
Like ImageNet is a great example.
自注意力机制在Transformer中表现优异,但他们也会说这是利用人工标注数据的一种方式,因为是人为地为句子赋予了结构。
Self attention is great from transformers, but they'll also say this is a way you can exploit human labeling of data because it's the humans that put the structure in the sentences.
如果你看看CLIP,可以说,我们正在利用互联网,实际上是通过人类使用alt标签来标注图像。
And if you look at clip, let's say, well, like, we're using the Internet to, like, actually have humans use the alt tag to label images.
对吧?
Right?
所以,这其实是一个关于数据的故事。
And so, like, that's a story of data.
这不是一个关于计算的故事。
That's not a story of compute.
那么答案是两者兼具,还是其中一方更重要呢?
And so is the answer just both or is, like, one more than the other?
或者说我认为是两者兼具。
Or I think it's both.
但你触及到了另一个非常好的观点。
But you're hitting on another really good point.
所以我认为这里实际上有两个截然不同的算法时代。
So I think there's actually two epochs that to me feel quite distinct in the algorithmics here.
嗯。
Mhmm.
所以ImageNet时代实际上是监督学习的时代。
So like the ImageNet era is actually the era of supervised learning.
在监督学习时代,你拥有大量数据,但不知道如何独立使用这些数据。
So in the era of supervised learning, you have a lot of data, but you don't know how to use data on its own.
就像ImageNet和那个时期其他数据集的预期那样,我们会获得大量图像,但需要人工为每张图片打标签。
Like the expectation of ImageNet and other data sets of that time period was that we're gonna get a lot of images but we need people to label everyone.
我们用于训练的所有数据,都经过人工标注员查看并为每张图像添加了描述。
And all of the training data that we're gonna train on, a human labeler has looked at everyone and said something about that image.
而重大的算法突破在于,我们掌握了如何训练不需要人工标注的数据。
And the big algorithmic unlocks, we know how to train on things that don't require human labeled data.
作为在场没有AI背景的普通人,在我看来如果用人类数据训练,那数据其实已经被人类标注过了。
As the naive person in the room that doesn't have an AI background, it seems to me if you're training on human data, the humans have labeled it.
只是这种标注并不显式。
It's just not explicit.
我就知道你会这么说,马蒂。
I I knew you were gonna say that, Marty.
我早料到了。
I knew that.
没错。
Yes.
从哲学角度来说,这是个非常重要的问题。
Philosophically, that's a really important question.
不过在语言领域比在图像领域更能体现这一点。
But that actually is more true in language than pixels.
说得有道理。
Fair enough.
是的。
Yeah.
100%。
100%.
是的。
Yeah.
是的。
Yeah.
是的。
Yeah.
是的。
Yeah.
但我确实认为这是个重要区别,因为CLIP确实是人工标注的。
But I do think it's an important distinction because clip really is human labeled.
是的。
Yeah.
是的。
Yeah.
我认为意图是人类已经理解了事物间的关系,然后你再去学习它们。
I think intention is humans have, like, figured out the relationships of things and then you learn them.
所以它确实是人工标注的,只是更隐晦而非显性。
So it is human labeled, just more implicit than explicit.
是的。
Yeah.
它仍然是由人工标注的。
It's still human labeled.
区别在于,在这个监督学习时代,我们的学习任务受到了更多限制。
The distinction is that for this supervised learning era, our learning tasks were much more constrained.
所以你必须想出我们想要发现的概念本体。
So you would have to come up with this ontology of concepts that we wanna discover.
对吧?
Right?
如果你在做ImageNet,费斯·A和当时的学生们花了很多时间思考ImageNet挑战中应该包含哪一千个类别。
If you're doing ImageNet, Faith A and your students at the time spent a lot of time thinking about which thousand categories should be in the ImageNet challenge.
当时的其他数据集,比如用于目标检测的COCO数据集,他们非常认真地思考了应该放入哪80个类别。
Other datasets of that time, like the COCO dataset for object detection, they thought really hard about which 80 categories we put in there.
那么让我们转向生成式AI。
So let's walk to GenAI.
在那之前我读博士时,你来了。
So when I was doing my PhD before that, you came.
我先是跟吴恩达学习了机器学习,然后又跟Deaf和Color学习了贝叶斯理论——那对我来说非常复杂。
So I took machine learning from Andrew Ng, and then I took Bayesian, something very complicated from Deaf and Color, and it was very complicated for me.
其中大部分内容只是预测建模。
A lot of that was just predictive modeling.
然后我记得你解锁了整个视觉领域,但生成式技术可以说是在过去四年才出现的,这对我来说非常不同。
And then I remember the whole kind of vision stuff that you unlock, but then the generative stuff has shown up, I would say, in the last four years, which is to me very different.
你不是在识别物体。
You're not identifying objects.
你并不是在预测什么。
You're not predicting something.
你是在创造某些东西。
You're generating something.
所以或许可以梳理一下,比如那些关键突破是如何带我们走到今天的,以及它为何与众不同,还有我们是否该以不同方式思考它。
And so maybe kind of walk through, like, the key unlocks that got us there and then why it's different and if we should think about it differently.
这是连续统一体的一部分吗?
And is it part of a continuum?
不是吗?
Is it not?
这太有趣了。
It is so interesting.
甚至在我读研时期,生成模型就已经存在了。
Even during my graduate time, generative model was there.
我们当时就想做生成。
We wanted to do generation.
没人记得了。
Nobody remembers.
即便是字母和数字,我们也曾尝试做些生成工作。
Even with letters and numbers, we were trying to do some.
杰夫·辛顿就发表过生成相关的论文。
Jeff Hinton has had generated papers.
我们当时就在思考如何进行生成。
We were thinking about how to generate.
事实上,如果从概率分布的角度思考,你可以在数学上生成。
And in fact, if you think from a probability distribution point of view, you can mathematically generate.
只不过我们生成的东西永远无法打动任何人。
It's just nothing we generate would ever impress anybody.
对吧?
Right?
所以这种数学上的生成概念在理论上是存在的。
So this concept of generation mathematically, theoretically is there.
但没有任何成果。
But nothing worked.
贾斯汀的博士论文,他整个博士生涯,就是一个故事,几乎可以说是这个领域发展轨迹的缩影。
Justin's PhD, his entire PhD, is a story, almost a mini story of the trajectory of the field.
他的第一个项目是从数据开始的。
He started his first project in data.
是我逼他做的。
I forced him to.
他不喜欢。
He didn't like it.
所以
So
回过头看,我学到了很多真正有用的东西。
In retrospect, I learned a lot of really useful things.
我...我很高兴你现在这么说。
I I'm glad you say that now.
实际上,我的第一篇论文,无论是博士期间还是整个学术生涯中的首次发表,就是关于使用场景图进行图像检索的研究。
So actually, my first paper, both of my PhD and, like, ever, my first academic publication ever, was the image retrieval with scene graphs.
然后我们开始处理像素、生成文字,贾斯汀和安德烈在这方面做了大量工作。
And then we went into taking pixels, generating words, and Justin and Andre really worked on that.
但这仍然是一种信息损失非常严重的从像素世界提取和生成数据的方式。
But that was still a very, very lossy way of generating and getting information out of the pixel world.
期间贾斯汀独立完成了一项非常著名的工作。
And then in the middle, Justin went off and did a very famous piece of work.
那是首次有人实现了实时处理。
And it was the first time that someone made it real time.
对吧?
Right?
是的。
Yeah.
没错。
Yeah.
事情是这样的:2015年出现了这篇由Leon Gaddis主导的论文《艺术风格的神经算法》。
So the story there is there was this paper that came out in 2015, a neural algorithm of artistic style led by Leon Gaddis.
论文发表时,他们展示了将现实照片转换成梵高风格的作品。
And the paper came out, and they showed these real world photographs that they had converted into a Van Gogh style.
在2024年我们可能已经看惯了这类效果,但这可是2015年的事。
And we are kind of used to seeing things like this in 2024, but this was in 2015.
有天这篇论文突然出现在arXiv上,彻底震撼了我。
So this paper just popped up on archive one day and it blew my mind.
2015年我脑子里突然冒出这个天才般的脑虫想法,它彻底改变了我。
I just got this genii brainworm in my brain in 2015, and it did something to me.
我当时就想,天哪。
And I thought, oh my god.
我必须弄懂这个算法。
I need to understand this algorithm.
我得亲手试试。
I need to play with it.
我要把自己的照片变成梵高风格。
I need to make my own images into Van Gogh.
于是我就去读了那篇论文,花了个长周末重新实现了算法并让它跑起来了。
So then I, like, read the paper, and then over a long weekend, I reimplemented the thing and got it to work.
其实算法本身非常简单。
It was actually very simple algorithm.
所以我的实现只用了大概300行Lua代码,因为那时候还是Lua的天下。
So, like, my implementation was, like, 300 lines of Lua because at the time it was pre Lua.
用的是Lua。
It was Lua.
那时候还没有PyTorch。
This was pre PyTorch.
我们当时用的是Lua Torch。
So we were using Lua Torch.
虽然算法很简单,但运行速度很慢。
But it was, like, very simple algorithm, but it was slow.
对吧?
Right?
所以它是基于优化的方法。
So it was an optimization based thing.
每生成一张图像,你都需要运行这个优化循环,为每张生成的图像运行这个梯度下降循环。
Every image you wanna generate, you need to run this optimization loop, run this gradient descent loop for every image that you generate.
这些图像很美,但我只想要更快。
The images were beautiful, but I just wanted to be faster.
而贾斯汀就这么做到了。
And Justin just did it.
实际上,我认为这是你第一次尝到学术工作对产业产生影响的滋味。
And it was actually, I think, your first taste of an academic work having an industry impact.
当时很多人都见过这种艺术风格迁移的东西,我和其他几个人同时想出了不同的加速方法。
A bunch of people had seen this artistic style transfer stuff at the time, and me and a couple others at the same time came up with different ways to speed this up.
但我的方法获得了大量关注。
But mine was the one that got a lot of traction.
在世界理解生成式AI之前,贾斯汀博士阶段的最后一项工作实际上是输入语言并输出完整图像。
Before the world understand Gen AI, Justin's last piece of work in PhD was actually inputting language and getting the whole picture out.
这是最早的生成式AI工作之一。
It's one of the first GenAI work.
它使用了GAM,这工具非常难用。
It's using GAM, which was so hard to use.
问题在于我们还没准备好使用自然语言片段。
The problem is that we are not ready to use a natural piece of language.
贾斯汀,你听说他研究过场景图。
So Justin, you heard he worked on scene graph.
所以我们必须建立一个SynGraph语言结构。
So we have to put a SynGraph language structure.
用图的方式表示羊、草和天空。
So the sheep, the grass, the sky in a graph way.
这实际上是我们的一张照片,后来他和另一位非常优秀的硕士生格里姆,他们成功让那个GAN运行起来了。
Literally was one of our photos, And then he and another very good master's student, Grimm, they got that GAN to work.
所以你可以看到从数据匹配到风格迁移再到生成图像,我们开始看到你问这是否是一个剧烈的变化。
So you can see from data to matching to style transfer to generative images, we're starting to see you ask if this is a rough change.
对我们这样的人来说,这已经是一个连续的过程。
For people like us, it's already happening in a continuum.
但对世界而言,结果显得更为突然。
But for the world, the results are more abrupt.
我读过你的书。
So I read your book.
对于正在收听的听众来说,这是一本非凡的书。
And for those that are listening, it's a phenomenal book.
我真的很推荐大家阅读。
I like I I really recommend you read it.
长期以来,似乎很多你——我是在跟你说话,菲菲。
And it seems for a long time, like a lot of you and I'm talking to you, Fei Fei.
你的很多研究和方向似乎都集中在空间、像素和智能这类领域。
Like, a lot of your research has been and your direction has been towards kind of spatial stuff and pixel stuff and intelligence.
现在你们正在开展世界实验室项目,专注于空间智能领域。
And now you're doing World Labs, and it's around spatial intelligence.
那么能否谈谈,这对你而言是否是一段漫长旅程的组成部分?
And so maybe talk through, has this been part of a long journey for you?
比如,为什么选择现在做这件事?
Like, why did you decide to do it now?
是技术上的突破吗?
Is it a technical unlock?
还是个人认知的突破?
Is it a personal unlock?
让我们从AI研究的混战中抽身,聚焦到世界实验室上来。
Move us from that melee of AI research to World Labs.
对我而言,这既是个人追求也是智力探索。
For me, it is both personal and intellectual.
对吧?
Right?
我的整个学术历程都贯穿着追寻北极星的热情,同时坚信这些北极星对我们领域的发展至关重要。
My entire intellectual journey is really this passion to seek north stars, but also believing that those north stars are critically important for the advancement of our field.
记得最初从研究生院毕业后,我以为我的北极星是讲述图像的故事。
So at the beginning, I remembered after graduate school, I thought my north star was telling stories of images.
因为对我而言,这是视觉智能中极为重要的一环。
Because for me, that's such an important piece of visual intelligence.
这就是你们所谓AI或AGI的组成部分。
That's part of what you call AI or AGI.
但当贾斯汀和安德烈做到时,我简直惊呆了,那可是我的直播啊。
But when Justin and Andre did that, I was like, oh my god, that was my livestream.
接下来我该怎么办?
What do I do next?
所以事情进展得快多了。
So it came a lot faster.
我原以为要花一百年才能实现。
I thought it would take one hundred years to do that.
但视觉智能是我的热情所在,因为我坚信对于任何智能体——无论是人类、机器人还是其他形态——理解如何观察世界、进行推理、与之互动(无论是导航、操控还是创造),甚至能在此基础上建立文明。
But visual intelligence is my passion because I do believe for every intelligent being, like people or robots or some other form, knowing how to see the world, reason about it, interact in it, whether you're navigating or manipulating or making things, you can even build civilization upon it.
视觉空间智能是如此基础。
Visual spatial intelligence is so fundamental.
它和语言一样基础,在某些方面可能更古老、更根本。
It's as fundamental as language, possibly more ancient and more fundamental in certain ways.
所以对我们来说,解锁空间智能作为北极星目标非常自然。
So it's very natural for me that our North Star is to unlock spatial intelligence.
对我来说时机已经成熟。
The moment to me is right.
我们已经具备了这些要素。
We've got these ingredients.
我们拥有算力。
We've got compute.
我们对数据的理解比当年对图像的理解深入得多。
We've got much deeper understanding of data, deeper than image that day.
与过去相比,我们现在要先进得多。
Compared to those days, we're so much more sophisticated.
我们在算法方面取得了一些进展,包括实验室联合创始人本·米尔登霍尔和克里斯托夫·拉斯纳的贡献。
And we've got some advancement of algorithms, including co founders in our lab like Ben Mildenhall and Christoph Lassner.
他们处于神经科学的前沿,我们正处在正确时机下注、专注并真正解锁这一领域。
They were at the cutting edge of nerve that we are in the right moment to really make a bet and to focus and just unlock that.
所以
So
我想向听众们澄清一下。
I just wanna clarify for folks that are listening to this.
你正在创办这家公司——世界实验室。
You're starting this company, World Labs.
空间智能大致是你对所要解决问题的总体描述。
Spatial intelligence is kind of how you're generally describing the problem you're solving.
能否简洁地说明一下这具体指什么?
Can you maybe try to crisply describe what that means?
好的。
Yeah.
空间智能是指机器在三维空间和时间中感知、推理和行动的能力。
So spatial intelligence is about machines' ability to perceive, reason, and act in three d space and time.
理解物体和事件在三维时空中的位置,世界中的交互如何影响这些四维时空位置,并实现感知、推理、生成、交互,真正将机器从主机房或数据中心解放出来,使其融入现实世界,理解这个充满细节的三维四维世界。
To understand how objects and events are positioned in three d space and time, how interactions in the world can affect those four d positions over space time, and both sort of perceive, reason about, generate, interact with, really take the machine out of the mainframe or out of the data center and putting it out into the world, and understanding the three d four d world with all of its richness.
那么明确地说,我们讨论的是物理世界还是抽象的世界概念?
So to be very clear, are we talking about the physical world or are we just talking about an abstract notion of world?
我认为它可以两者兼顾。
I think it can be both.
我认为它可以两者兼顾,这符合我们的长期愿景。
I think it can be both, and that encompasses our vision long term.
即使你正在生成虚拟世界或三维内容,它也有很多优势。
Even if you're generating worlds even if you're generating content positioned in three d, it has a lot of benefits.
或者如果你正在识别现实世界,能够将三维理解融入现实世界也是其中的一部分。
Or if you're recognizing the real world, being able to put three d understanding into the real world as well is part of it.
对于所有听众来说,另外两位联合创始人Ben Belton Hall和Christoph Lassner,都是该领域同等水平的绝对传奇人物。
Just for everybody listening, like, the the two other cofounders, Ben Belton Hall and Christoph Lassner, are absolute legends in in the field at the same level.
这四人决定现在出来创办这家公司。
These four decided to come out and do this company now.
所以我试图探究为什么现在是合适的时机。
And so I'm trying to dig to, like, why now is the right time.
是的。
Yeah.
对我来说,这又是一个长期发展过程的一部分。在获得博士学位后,当我真正希望发展成为独立研究者时,为了我未来的职业生涯,我一直在思考AI和计算机视觉领域的重大问题是什么?
I mean, this is again part of a longer evolution for me, but post PhD, when I was really wanting to develop into my own independent researcher both for my later career, I was just thinking what are the big problems in AI and computer vision?
我当时得出的结论是:过去十年主要是关于理解已经存在的数据。
And the conclusion that I came to about that time was that the previous decade had mostly been about understanding data that already exists.
而下一个十年将是关于理解新数据。
But the next decade was going to be about understanding new data.
如果我们思考这个问题,已经存在的数据可能是网络上已有的所有图像和视频。
And if we think about that, the data that already exists was all of the images and videos that maybe existed on the web already.
接下来的十年将是关于理解新数据的时代。
And the next decade was gonna be about understanding new data.
对吧?
Right?
人们都拥有智能手机。
People are having smartphones.
智能手机配备摄像头。
Smartphones are collecting cameras.
这些摄像头搭载新型传感器。
Those cameras have new sensors.
这些摄像头被定位在三维世界中。
Those cameras are positioned in the three d world.
不再只是从互联网获取一堆像素却对其一无所知,还要判断是猫是狗。
It's not just you're gonna get a bag of pixels from the Internet and know nothing about it and try to say if it's a cat or a dog.
我们希望将这些图像视为物理世界的通用传感器。
We wanna treat these images as universal sensors to the physical world.
我们如何利用这些来理解世界的三维和四维结构,无论是在物理空间还是生成空间?
And how can we use that to understand the three d and four d structure of the world, either in physical spaces or generative spaces?
因此我在博士后阶段做出了重大转向,与当时Fair的同事们共同研究三维计算机视觉,预测物体的三维形状。
So I made a pretty big pivot post PhD into three d computer vision, predicting three d shapes of objects with some of my colleagues at Fair at the time.
后来,我完全迷上了通过二维学习三维结构这个理念。
Then later, I got really enamored by this idea of learning three d structure through two d.
对吧?
Right?
因为我们经常讨论数据,所以三维数据本身很难获取。
Because we talk about data a lot, three d data is hard to get on its own.
但由于这里存在非常强的数学关联性,我们的二维图像其实是三维世界的投影。
But because there's a very strong mathematical connection here, our two d images are projections of a three d world.
而且这里有很多我们可以利用的数学结构。
And there's a lot of mathematical structure here we can take advantage of.
所以即使你拥有大量二维数据,仍有许多人做了惊人工作来研究如何从海量二维观测中还原出世界的三维结构。
So even if you have a lot of two d data, there's a lot of people who've done amazing work to figure out how can you back out the three d structure of the world from large quantities of two d observations.
然后在2020年,你问到了重大突破时刻。
And then in 2020, you asked about very breakthrough moments.
当时我们的联合创始人Ben Milbenhall在论文《NERF:神经辐射场》中取得了重大突破。
There was a really big breakthrough moment from our cofounder Ben Milbenhall at the time with his paper NERF, Neural Radiance Fields.
那是一种非常简洁明了地从二维观测还原三维结构的方法。
And that was a very simple, very clear way of backing out three d structure from two d observations.
这直接点燃了整个三维计算机视觉领域的研究热潮。
That just lit a fire under this whole space of three d computer vision.
我认为还有一个外界可能不太理解的方面。
I think there's another aspect here that maybe people outside the field don't quite understand.
那同时也是大语言模型开始崛起的时期。
That was also a time when large language models were starting to take off.
实际上语言建模的很多技术都是在学术界发展起来的。
So a lot of the stuff with language modeling actually had gotten developed in academia.
甚至在我读博期间,2014年就和Andre Karpathy做过一些早期的语言建模工作。
Even during my PhD, I did some early work with Andre Karpathy on language modeling in 2014.
LSTM(长短期记忆网络)。
LSTM.
仍然记得。
Still remember.
LSTM、RNN(循环神经网络)、GRU(门控循环单元)。
LSTM, RNNs, GRUs.
是的。
Yes.
比如,这是在Transformer之前的事。
Like, this was pre transformer.
但后来到了某个时间点,大约在GPT-2时期,学术界基本没法再做这类模型了,因为它们需要的资源实在太多。
But then at some point, like, around the GPT two time, like, couldn't really do those kind of models anymore in academia because they took way more resourcing.
但有个特别有意思的现象。
But there was one really interesting thing.
Ben提出的NERF方法,用单个GPU几小时就能训练好这些模型。
The NERF approach that Ben came up with, like, could train these in a couple hours on a single GPU.
展开剩余字幕(还有 281 条)
所以我觉得当时出现了一个现象:很多学术研究者开始集中攻克这类问题,因为存在核心算法需要突破,而且不需要大量算力就能取得前沿成果——单块GPU就能实现顶尖效果。
So I think at that time, there was a dynamic here that happened, which is that I think a lot of academic researchers ended up focusing a lot of these problems because there was core algorithmic stuff to figure out and because you could actually do a lot without a ton of compute and you could get state of the art results on a single GPU.
由于这种趋势,学术界大量研究者开始思考:如何通过核心算法推动这个领域发展。
Because of those dynamics, there was a lot of research a lot of researchers in academia were moving to think about what are the core algorithmic ways that we can advance this area as well.
后来我和李飞飞聊得更多,意识到我们其实...
Then I ended up chatting with Fei Fei more, and I realized that we were
她确实很有说服力。
actually She's very convincing.
她非常有说服力。
She's very convincing.
嗯,确实如此。
Well, there's that.
但我们之前讨论过,要尝试从导师那里找到自己独立的研究方向。
But we talked about trying to figure out your own independent research trajectory from your adviser.
结果发现我们最终还是
Well, it turns out we had ended
哦,不。
up Oh, no.
某种程度上又再次趋同了。
Kinda kinda Concluding converging again.
在某些方面达成了共识。
Converging on on similar things.
好吧。
Okay.
从我这边来说,我想和我认为最聪明的人——贾斯汀谈谈。
Well, from my end, I wanna talk to the smartest person I I call Justin.
这一点毫无疑问。
There's no question about it.
我想谈谈一个关于像素的非常有趣的技术故事,大多数从事语言工作的人可能没意识到:在生成式AI之前的计算机视觉领域,我们这些研究像素的人其实在三维重建这个研究方向上有很长的历史。
I do want to talk about a very interesting technical story of pixels that most people working in language don't realize, is that pre Gen AI era in the field of computer vision, those of us who work on pixels, we actually have a long history in an area of research called reconstruction, three d reconstruction.
这可以追溯到七十年代。
It dates back from the seventies.
人类能拍照是因为有两只眼睛,对吧?
You can take photos because humans have two eyes, right?
一般来说,这个过程从立体照片开始,然后你尝试通过三角测量几何形状来构建三维模型。
So in general, it starts with stereo photos, and then you try to triangulate the geometry and make a three d shape out of it.
这确实是个非常非常困难的问题。
It is a really, really hard problem.
直到今天,这个问题仍未从根本上解决,因为存在对应关系等各种因素。
To this day, it's not fundamentally solved because there is correspondence and all that.
所以这个领域——作为思考三维问题的一种传统方式——一直在发展,并取得了非常好的进展。
So this whole field, which is an older way of thinking about three d, has been going around and it has been making really good progress.
但当NERF出现在生成式方法和扩散模型的背景下时,重建和生成突然开始真正融合。
But when NERF happened in the context of generative methods, in the context of diffusion models, suddenly reconstruction and generation started to really merge.
现在,在计算机视觉领域短短时间内,已经很难严格区分重建与生成的概念了。
Now, within really a short period of time in the field of computer vision, it's hard to talk about reconstruction versus generation minimal.
我们突然迎来了这样一个时刻:无论是看到还是想象某个事物,两者都能趋同地生成它。
We suddenly have a moment where if we see something or if we imagine something, both can converge towards generating it.
在我看来,这对计算机视觉来说是个非常重要的时刻,但大多数人错过了,因为我们讨论它的热度远不如大语言模型。
And that's just to me a really important moment for computer vision, but most people miss it because we're not talking about it as much as LLMs.
没错。
Right.
在像素空间中,有对真实场景的重建,而当你没看到场景时,就会使用生成技术。
So in pixel space, there's reconstruction where you reconstruct, like, a scene that's real, and then if you don't see the scene, then you use generative techniques.
对吧?
Right?
这些东西其实非常相似。
So these things are kind of very similar.
在整个对话中,你一直在谈论语言和像素。
Throughout this entire conversation, you're talking about languages and you're talking about pixels.
所以也许现在是个好时机,来讨论空间智能与你正在研究的内容如何与当前非常流行的语言方法形成对比。
So maybe it's a good time to talk about how, like, spatial intelligence and what you're working on contrasts with language approaches, which, of course, are very popular now.
它们是互补的吗?
Is it complementary?
它们是正交的吗?
Is it orthogonal?
我认为它们是互补的。
I think they're complementary.
我并不是想在这里过于引导性。
I don't mean to be too leading here.
也许只是做个对比。
Maybe just contrast them.
就像大家常说的,我知道OpenAI,我知道GPT,也知道多模态模型。
Like, everybody says, I know OpenAI, and I know GPT, and I know multimodal models.
而你谈到的很多内容就像是:'我们有像素,他们有语言,这不正是我们想用空间推理实现的效果吗?'
And a lot of what you're talking about is, like, 've got pixels and they've got languages, and doesn't this kind of do what we want to do with spatial reasoning?
是的。
Yeah.
所以我认为要做
So I think to do
你需要稍微揭开这些系统内部运作的黑箱。
that, you need to open up the black box a little bit of how these systems work under the hood.
因此,无论是语言模型还是如今出现的多模态语言模型,它们底层的内在表征都是一维的。
So with language models and the multimodal language models that we're seeing nowadays, their underlying representation under the hood is a one dimensional representation.
我们会讨论上下文长度。
We talk about context lengths.
我们会讨论Transformer架构。
We talk about transformers.
我们会讨论序列、注意力机制。
We talk about sequences, attention.
本质上,它们对世界的表征是一维的。
Fundamentally, their representation of the world is one dimensional.
所以这些系统本质上是在处理一维的token序列。
So these things fundamentally operate on a one dimensional sequence of tokens.
这种表征方式对语言来说非常自然,因为书面文本本身就是由离散字母组成的一维序列。
So this is a very natural representation when you're talking about language because written text is a one dimensional sequence of discrete letters.
正是这种底层表征方式催生了大型语言模型。
So that kind of underlying representation is the thing that led to LLMs.
而现在我们看到的多模态大模型,本质上也是将其他模态强行适配到这个一维token序列的底层表征框架中。
And now the multimodal LLMs that we're seeing now, you kind of end up shoehorning the other modalities into this underlying representation of a one d sequence of tokens.
当我们转向空间智能时,方向就反过来了——我们认为世界的三维特性应该成为表征的核心。
Now when we move to spatial intelligence, it's kind of going the other way, where we're saying that the three-dimensional nature of the world should be front and center in the representation.
从算法角度看,这为我们开辟了新途径:可以用不同方式处理数据,获得不同类型的输出,并解决略有差异的问题。
So at an algorithmic perspective, that opens up the door for us to process data in different ways, to get different kinds of outputs out of it, and to tackle slightly different problems.
所以即使在课程层面,你也会看看外部然后说,哦,多模态语言模型也能处理图像。
So even at at a course level, you kinda look at outside and you say, oh, multimodal LMs can look at images too.
嗯,它们确实可以,但我认为它们的核心方法中缺乏那种根本性的三维表征。
Well, they can, but I think they don't have that fundamental three d representation at the heart of their approaches.
我完全同意贾斯汀的观点。
I totally agree with Justin.
我认为讨论一维与根本性的三维表征是最核心的差异之一。
I think talking about the one d versus fundamentally three d representation is one of the most core differentiation.
另一点稍微偏哲学,但至少对我非常重要:语言本质上是纯粹生成的信号。
The other thing is a slightly philosophical, but it's really important for me at least is language is fundamentally a purely generated signal.
世界上本没有语言。
There's no language out there.
你不会在大自然中看到天空为你写好的文字。
You don't go out in the nature and there's words written in the sky for you.
无论输入什么数据,你基本上都能以足够的泛化能力将其复现出来。
Whatever data you feed in, you pretty much can just somehow regurgitate with enough generalizability the same data out.
这就是语言到语言的本质。
And that's language to language.
但三维世界不是这样。
But three d world is not.
存在着遵循物理定律的三维世界,它因材料等多种因素形成自身结构。
There is a three d world out there that follows laws of physics, that has its own structures due to materials and many other things.
要根本性地提取这些信息并加以表征和生成,本质上是个截然不同的问题。
And to fundamentally back that information out and be able to represent it and be able to generate it is just fundamentally quite a different problem.
我们将借鉴语言和大型语言模型中的相似或有用理念,但对我来说,这从根本上、哲学上是不同的问题。
We will be borrowing similar ideas or useful ideas from language and LLMs, but this is fundamentally, philosophically, to me, a different problem.
所以语言是一维的,而且可能是对物理世界的糟糕表征,因为它是人类生成的,很可能存在信息损失。
So language, one d, and probably a bad representation of the physical world because it's been generated by humans and it's probably lossy.
生成式AI模型还有另一种完全不同的模态——像素,即二维图像和二维视频。
There's a whole another modality of generative AI models, which are pixels, and these are two d image and two d video.
就像有人会说,如果你看视频,可以看到三维内容,因为你可以平移摄像机之类的。
And, like, one could say that if you look at a video, you can see three d stuff because, like, you can pan a camera or whatever it is.
那么,空间智能与二维视频会有什么不同呢?
And so, like, how would, like, spatial intelligence be different than, say, two d video?
当我思考这个问题时,有必要区分两件事。
When I think about this, it's useful to disentangle two things.
一是底层表征,二是面向用户的功能设计。
One is the underlying representation, and then two is kind of the user facing affordances that you have.
这里有时会让人困惑,因为从根本上说我们看到的是二维。
And here's where you can get sometimes confused because fundamentally, we see two d.
对吧?
Right?
我们的视网膜是身体中的二维结构,而且我们有两个。
Our retinas are two d structures in our bodies and we've got two of them.
所以从根本上说,我们的视觉系统感知的是二维图像。
So fundamentally, our visual system perceives two d images.
但问题在于,根据你使用的表征方式,可能会产生或自然或不自然的不同功能设计。
But the problem is that depending on what representation you use, there could be different affordances that are more natural or less natural.
因此,即便到了最后,你看到的可能是一幅二维图像或一段二维视频,你的大脑仍会将其感知为三维世界的投影。
So even if you at the end of the day, you might be seeing a two d image or a two d video, your brain is perceiving that as a projection of a three d world.
所以你会想要做一些事情,比如移动物体、调整摄像头角度。
So there's things you might want to do, move objects around, move the camera around.
理论上,你可以用纯粹的二维表示和模型来完成这些,但这并不完全契合你要求模型解决的问题。
In principle, you might be able to do these with a purely two d representation and model, but it's just not a fit to the problems that you're asking the model to do.
对吧?
Right?
对动态三维世界的二维投影进行建模,这个功能或许是可以实现的。
Modeling the two d projections of a dynamic three d world is a function that probably can be modeled.
但将三维表示置于模型的核心,能让模型所处理的表示类型与你希望它执行的任务类型之间更加匹配。
But by putting a three d representation into the heart of a model, there's just gonna be a better fit between the kind of representation that the model is working on and the kind of tasks that you want that model to do.
因此我们相信,通过在底层融入更多三维表示,将为用户提供更好的功能支持。
So our bet is that by threading a little bit more three d representation under the hood, that'll enable better affordances for users.
这也回到了我们的北极星目标。
And this also goes back to the North Star.
对我来说,为什么是空间智能?
For me, why is it spatial intelligence?
为什么不是平面像素智能?
Why is it not flat pixel intelligence?
因为我认为智能的发展轨迹必须走向贾斯汀所说的‘功能支持’。
It's because I think the arc of intelligence has to go to what Justin calls affordances.
纵观进化史,智能的发展最终让动物和人类——尤其是作为智慧生物的人类——能够在这个三维世界中移动、互动、创造文明、创造生活、制作三明治等等。将这种三维原生性转化为技术,对于未来可能涌现的大量应用至关重要,即便其中某些应用的服务形式看起来是二维的。
And the arc of intelligence, if you look at evolution, right, the arc of intelligence eventually enables animals and humans, especially human as an intelligent animal, to move around the world, interact with it, create civilization, create life, create a piece of sandwich, whatever you do in this three d world, and translating that into a piece of technology, that native three d ness is fundamentally important for the flood of possible applications, even if some of them, the serving of them looks two d.
但对我来说它本质上是三维的。
But it's innately three d to me.
我认为这实际上是一个非常微妙且极其关键的嗯哼。
I think this is actually a very subtle and incredibly critical Mhmm.
重点。
Point.
因此我认为值得深入探讨,而讨论用例是个好方法。
And so I think it's worth digging into, and a good way to do this is talking about use cases.
为了统一认知,请听好。
And so just to level set this, listen.
我们正在讨论开发一项技术——姑且称之为模型——它能实现空间智能。
We're talking about generating a technology, let's call it a model, that can do spatial intelligence.
那么抽象地说,具体实现可能是什么样子呢?
So maybe in the abstract, what might that look like kind of a little bit more concretely?
我们预想这些空间智能模型未来能实现多种功能。
There's a couple different kinds of things we imagine these spatially intelligent models able to do over time.
其中最让我兴奋的是世界生成。
And one that I'm really excited about is world generation.
我们都熟悉文生图工具,也开始见到文生视频工具——输入文字就能得到精美图片或两秒精彩片段。
We're all used to something like a text to image generator or starting to see text to video generators, where you put an image, put in a video, and out pops an amazing image or an amazing two second clip.
但可以想象将其升级为输出三维世界。
But I think you could imagine leveling this up and getting three d worlds out.
未来空间智能可能帮助我们将这些体验升级至三维——输出完整的虚拟仿真、充满活力且可交互的三维世界。
So one thing that we could imagine spatial intelligence helping us with in the future are up leveling these experiences into three d, where you're getting out a full virtual simulated, but vibrant and interactive three d world.
对吧?
Right?
可能是游戏,可能是虚拟摄影,应有尽有。
Maybe for gaming, maybe for virtual photography, you name it.
即便这个能实现,在教育领域也会有无数应用场景。
Even if you got this to work, there'd be a million applications for education.
我的意思是,从某种角度来说,这催生了一种新的媒体形式。
I mean, in some sense, this enables a new form of media.
对吧?
Right?
因为我们已经能创造虚拟互动世界,但需要数亿美元成本和大量开发时间。
Because we already have the ability to create virtual interactive worlds, but it costs hundreds of millions of dollars and a ton of development time.
所以人们推动这项技术的主要领域就是电子游戏。
And as a result, what are the places that people drive this technological ability is video games.
对吧?
Right?
但由于制作成本太高,目前该技术唯一经济可行的用途就是开发单价70美元、面向数百万玩家的游戏来收回投资。
But because it takes so much labor to do so, then the only economically viable use of that technology in its form today is games that can be sold for $70 a piece to millions and millions of people to recoup the investment.
如果我们能创造同样生动的3D虚拟互动世界,就能看到更多其他应用场景。
If we had the ability to create these same virtual interactive vibrant three d worlds, you could see a lot of other applications of this.
对吧?
Right?
因为只要降低这类内容的制作成本,人们就会把它用在其他领域。
Because if you bring down that cost of producing that kind of content, then people are gonna use it for other things.
对吧?
Right?
如果你能获得一种个性化的3D体验,其品质、丰富度和细节堪比那些耗资数亿美元制作的3A级游戏,会怎样?
What if you could have sort of a personalized three d experience that's as good and as rich, as detailed as one of these triple a video games that cost hundreds of millions of dollars to produce.
但它可以专门针对这个非常小众的需求——可能只有少数人会想要这种特定体验。
But it could be catered to this very niche thing that only maybe a couple people would want that particular thing.
这不是某个具体产品或路线图,但我认为这是由生成式领域的空间智能所赋能的新型媒体愿景。
That's not a particular product or a particular roadmap, but I think that's a vision of a new kind of media that would be enabled by spatial intelligence in the generative realm.
当我构想一个世界时,实际上考虑的不只是场景生成。
If I think about a world, I actually think about things that are not just scene generation.
我会思考诸如运动和物理特性这类元素。
I think about stuff like movement and physics.
那么,在极限情况下,这些是否包含在内?
And so, like, in the limit, is that included?
如果我要与之互动,是否存在语义逻辑?
And then if I'm interacting with it, like, are there semantics?
我的意思是,比如我打开一本书,里面会有书页吗?
And I mean by that, like, if I open a book, are there, like, pages?
书页上会有文字吗?
And are there words in it?
这些文字是否具有意义?我们是在讨论完全沉浸式体验,还是静态场景?
And do do they mean, like are we talking, like, a full depth experience, or are we talking about, like, kind of a static scene?
我想我会
I think I'll
见证这项技术随时间的发展历程。
see a progression of this technology over time.
这确实是很难构建的东西。
This is really hard stuff to build.
所以我认为静态问题会相对简单些。
So I think the static problem is a little bit easier.
但从长远来看,我认为我们需要它完全动态化、完全可交互,实现你刚才提到的所有功能。
But in the limit, I think we want this to be fully dynamic, fully interactable, all the things that you just said.
我的意思是,这正是空间智能的定义。
I mean, that's the definition of spatial intelligence.
没错。
Yeah.
所以未来会有一个发展过程。
So there is gonna be a progression.
我们会从更静态的开始,但你所说的所有内容都在空间智能的发展路线图上。
We'll start with more static, but everything you've said is in the road map of spatial intelligence.
其实从公司名字就能看出来——世界实验室(World Labs)。
I mean, this is kind of in the name of the company itself, World Labs.
就像说,这个世界就是关于构建和理解各种世界的。
Like, the world is about building and understanding worlds.
这其实有点行业内部术语的味道。
And And this is actually a little bit of inside baseball.
我发现当我们把这个名字告诉别人后,他们并不总能理解——因为在计算机视觉、重建和生成领域,我们经常会对能做的事情进行区分或划分。
I realized after we told the name to people, they don't always get it because in computer vision and reconstruction and generation, we often make a distinction or a delineation about the kinds of things you can do.
第一层级可以说是物体。
And kind of the first level is objects.
对吧?
Right?
比如麦克风、杯子、椅子。
A microphone, a cup, a chair.
这些都是世界上独立存在的事物。
These are discrete things in the world.
而李飞飞研究的许多ImageNet类工作,正是关于识别现实世界中的物体。
And a lot of the ImageNet style stuff that Fei Fei worked on was about recognizing objects in the world.
再往上一层级的物体,就要考虑场景了。
Then leveling up the next level of objects, think of the scenes.
场景是物体的组合。
Scenes are compositions of objects.
比如现在这个录音棚,就是由桌子、麦克风、人员和椅子等物体以某种方式组合而成的场景。
Now we've got this recording studio with a table and microphones and people and chairs at some composition of objects.
但我们将世界构想为超越场景的存在。
But then we envision worlds as a step beyond scenes.
对吧?
Right?
场景或许可以看作独立个体,但我们想要打破边界——走出门外,从桌前起身,穿过大门,沿街而行,看车辆川流不息,看树叶随风摇曳,并能与这一切互动。
Scenes are kind of maybe individual things, but we wanna break the boundaries, go outside the door, step up from the table, walk out from the door, walk down the street, and see the cars buzzing past and see the leaves on the trees moving and be able to interact with those things.
还有一点令人非常兴奋,因为贾斯汀提到了新媒体这个词。
Another thing that's really exciting because Justin mentioned the word new media.
借助这项技术,现实世界、虚拟想象世界、增强世界或预测世界之间的界限都变得模糊不清。
With this technology, the boundary between real world virtual imagined world or augmented world or predicted world is all blurry.
现实世界是三维的。
The real world is three d.
对吧?
Right?
所以在数字世界里,你必须拥有三维表现形式才能与现实世界融合。
So in the digital world, you have to have a three d representation to even blend with the real world.
你不能用二维的。
You cannot have a two d.
你无法用一维的方式与真实的三维世界进行有效交互。
You cannot have a one d to be able to interface with the real three d world in an effective way.
有了这个技术,就解锁了这种可能。
With this, it unlocks it.
因此,应用场景可以因此变得几乎无限。
So it it the use cases can be quite limitless because of this.
没错。
Right.
所以贾斯汀提到的第一个应用场景就是为各种用途生成虚拟世界。
So the first use case that Justin was talking about would be, like, the generation of a virtual world for any number of use cases.
你刚才提到的更像是增强现实。
The one that you're just alluding to would be more of an augmented reality.
对吧?
Right?
是的。
Yes.
就在WorldLab成立之际,苹果发布了VisionPro,并使用了‘空间计算’这一术语。
Just around the time WorldLab was being formed, VisionPro was released by Apple, and they used the word spatial computing.
我们差点...他们差点窃取了我们的...但我们是空间智能。
We're almost they almost stole our but we're spatial intelligence.
所以空间计算需要空间智能。
So so spatial computing needs spatial intelligence.
完全正确。
That's exactly right.
所以我们还不知道它会以何种硬件形态出现。
So we don't know what hardware form it will take.
可能是护目镜、眼镜
It'll be goggles, glasses
隐形眼镜。
Contact lenses.
隐形眼镜。
Contact lenses.
但真实世界与叠加其上的虚拟功能之间的交互界面——无论是帮助你增强修理汽车这类机械工作的能力(即使你不是专业技师),还是仅仅用来玩宝可梦——
But that interface between the true real world and what you can do on top of it, whether it's to help you to augment your capability to work on a piece of machine and fix your car, even if you are not a trained mechanic or to just being a Pokemon.
突然之间,这项技术将成为AR、VR、混合现实的底层操作系统。
Suddenly, this piece of technology is going to be the operating system, basically, for AR, VR, MixR.
从极限角度来看,AR设备需要实现什么功能?
In The Limit, what does an AR device need to do?
这个东西总是开着的。
It's this thing that's always on.
它与你同在。
It's with you.
它正在观察这个世界。
It's looking out into the world.
因此它需要理解你所看到的东西,或许还能在日常生活中帮你完成任务。
So it needs to understand the stuff that you're seeing and maybe help you out with tasks in your daily life.
但我也对虚拟与现实的这种融合感到非常兴奋,这变得至关重要。
But I'm also really excited about this blend between virtual and physical that becomes really critical.
如果能实时完美地以3D形式理解周围环境,实际上也会开始淘汰现实世界的很大一部分。
If you have the ability to understand what's around you in real time in perfect three d, then it actually starts to deprecate large parts of the real world as well.
比如现在,我们有多少不同尺寸的屏幕来应对不同使用场景?
Like, right now, how many differently sized screens do we all own for different use cases?
太多了。
Too many.
对吧?
Right?
你有手机、iPad、电脑显示器、电视、手表。
You've got your phone, you've got your iPad, you've got your computer monitor, you've got your TV, you've got your watch.
这些基本上都是不同尺寸的屏幕,因为它们需要在不同场景和位置向你展示信息。
Like, these are all basically different sized screens because they need to present information to you in different contexts and in different different positions.
但如果能无缝融合虚拟内容与现实世界,某种程度上就淘汰了对所有这些设备的需求。
But if you've got the ability to seamlessly blend virtual content with the physical world, kind it of deprecates the need for all of those.
它能够理想地无缝融合你当下需要了解的信息,并通过恰当的机制传递这些信息。
It just ideally seamlessly blends information that you need to know in the moment with the right mechanism of giving you that information.
另一个重要案例是将数字虚拟世界与三维物理世界融合,使各类智能体能够在物理世界中执行任务。
Another huge case of being able to blend the digital virtual world with the three d physical world is for any agents to be able to do things in the physical world.
正如我所说,如果人类使用这些MixR设备来完成某些事情——比如我不知道如何修车。
And if humans use this MixR devices to do things like I said, I don't know how to fix a car.
但当我必须修理时,只要戴上这副护目镜或眼镜,就能立即获得操作指引。
But if I have to, I put on this goggle or glass, and suddenly I'm guided to do that.
但还存在其他类型的智能体,即各种形态的机器人,并不局限于人形机器人。
But there are other types of agents, namely robots, any kind of robots, not just humanoid.
根据定义,它们的交互界面就是三维世界。
And their interface, by definition, is the three d world.
而它们的运算核心——即大脑,本质上属于数字世界。
But their compute, their brain, by definition, is the digital world.
那么,是什么将机器人的数字大脑与现实世界的行为连接起来,实现从学习到行动的转化?
So what connects that from the learning to behaving between a robot brain to the real world brain?
答案必然是空间智能。
It has to be spatial intelligence.
您刚才谈到了虚拟世界。
So you've talked about virtual worlds.
也讨论过增强现实相关的内容。
You've talked about kind of more of an augmented reality.
现在您提到的则是纯粹的物理世界应用,这主要适用于机器人技术领域。
And now you've just talked about the purely physical world, basically, which would be used for robotics.
对任何公司来说,这都像是一份非常宏大的章程,特别是当你需要思考如何看待深度科技与这些具体应用领域之间的关系时?
For any company, that would be, like, a very large charter, especially if you're gonna get into how do you think about the idea of, like, deep, deep tech versus any of these specific application areas?
我们将自己视为一家深度科技公司,作为平台企业提供能服务于不同用例的模型。
We see ourselves as a deep tech company, as the platform company that provides models that can serve different use cases.
在这三者中,您认为是否有哪一项在早期阶段更自然,能让人们预期公司会重点投入?
Of these three, is there any one that you think is kind of more natural early on that people can kind of expect the company to lean into?
我认为可以说这些设备尚未完全准备就绪。
I think it suffices to say the devices are not totally ready.
其实我在研究生时期就拥有了第一台VR头显。
Actually, I got my first VR headset in grad school.
那是一种颠覆性的技术体验。
That's one of these transformative technology experiences.
当你戴上它时,会忍不住惊呼'天啊'。
You put it on, you're like, oh my god.
就像在说'这太疯狂了'。
Like, this is crazy.
我想很多人在初次使用VR时都有这种体验。
And I think a lot of people have that experience the first time they use VR.
所以我长期以来一直对这个领域充满热情,而且我非常喜欢Vision Pro。
So So I've been excited about this space for a long time, and I love the Vision Pro.
比如我熬夜抢购了首发日的首批设备之一。
Like, I stayed up late to order one of the first ones, like, the first day it came out.
但现实情况是,它作为大众市场平台还不够成熟。
But I think the reality is it's just not there yet as a platform for mass market appeal.
因此,作为一家公司,我们很可能会进入一个准备更充分的市场,但你知道,我们是一家深度科技公司。
So very likely as a company, we'll move into a market that's more ready than but, you know, we are a deep tech company.
然后我认为,有时在普遍性中也能找到简洁性。
Then I I think there can sometimes be simplicity in generality.
对吧?
Right?
我们有成为一家深度科技公司的理念。
We have this notion of being a deep tech company.
我们相信存在一些需要被很好解决的基础性根本问题。
We believe that there is some underlying fundamental problems that need to be solved really well.
如果解决得好,就能应用到许多不同的领域。
And if solved really well, can apply to a lot of different domains.
我们真正将公司的长远愿景视为构建并实现广义空间智能的梦想。
We really view this long arc of the company as building and realizing the dreams of spatial intelligence writ large.
所以在我看来,这需要构建大量技术。
So this is a lot of technologies to build, it seems to me.
是的。
Yeah.
我认为这是
I think it's
一个非常困难的问题。
a really hard problem.
我认为有时对于不直接从事AI领域的人来说,他们只是将AI视为一个无差别的整体人才库。
I think sometimes from people who are not directly in the AI space, they just see it as AI as one undifferentiated mass of talent.
对于我们这些在这里待得更久的人来说,你会意识到构建任何AI项目——尤其是这一个——需要汇聚多种不同的才能。
And for those of us who have been here for longer, you realize that there's a lot of different kinds of talent that need to come together to build anything in AI, in particular this one.
我们之前稍微讨论过数据问题。
We talked a little bit about the data problem.
我们谈到了我博士期间研究的一些算法,但要做成这件事还需要很多其他方面的努力。
We've talked a little bit about some of the algorithms that I worked on during my PhD, but there's a lot of other stuff we need to do this too.
你需要高质量、大规模的工程能力。
You need really high quality, large scale engineering.
你需要对三维世界有深刻理解。
You need really deep understanding of the three d world.
实际上这与计算机图形学有很多联系,因为他们一直从相反方向攻克许多相同问题。
There's actually a lot of connections with computer graphics because they've been kind of attacking a lot of the same problems from the opposite direction.
所以在团队构建时,我们思考如何为每个必要子领域找到世界顶尖的专家,共同完成这个艰巨任务。
So when we think about team construction, we think about how do we find, like, absolute top of the world best experts in the world at each of these different subdomains that are necessary to build this really hard thing.
当我考虑如何为World Labs组建最佳创始团队时,必须从一群杰出的跨学科创始人开始。
When I thought about how we form the best founding team for World Labs, it has to start with a group of phenomenal multidisciplinary founders.
当然,贾斯汀对我来说是自然而然的选择。
And of course, Justin is natural for me.
贾斯汀,请捂住耳朵——作为我最优秀的学生之一,也是最聪明的技术专家。
Justin, cover your ears as one of my best students and one of the smartest technologists.
但还有两位我久闻大名的人,其中一位贾斯汀甚至合作过,我简直垂涎三尺对吧?
But there are two other people I have known by reputation, and one of them Justin even worked with, that I was drooling for, right?
其中一位是本·梅登霍尔。
One is Ben Meldenhall.
我们讨论过他在NERV的开创性工作。
We talked about his seminal work in NERV.
但还有一位是克里斯托夫·拉斯纳,他在计算机图形学界享有盛誉。
But another person is Christoph Lasner, who has been reputed in the community of computer graphics.
特别是他极具前瞻性,在高斯溅射技术兴起前五年就开始研究其雏形,用于三维建模。
And especially he had the foresight of working on a precursor of the Gaussian splat representation for three d modeling five years, right, before the Gaussian splat take off.
本和克里斯托夫都是传奇人物。
Ben and Christophe are legends.
或许可以简单谈谈你是如何考虑团队建设的,因为这里有很多领域需要开发,不仅是AI或图形,还包括系统等方面。
And maybe just quickly talk about kind of like how you thought about the build out of the rest of the team because, again, like, there's a lot to build here and a lot to work on, not just in kind of AI or graphics, but, like, systems and so forth.
是的。
Yeah.
迄今为止,我个人最引以为豪的就是这支强大的团队。
This is what so far I'm personally most proud of is the formidable team.
我有幸与职业生涯中最优秀的年轻人共事,他们来自顶尖大学,而我自己也是斯坦福的教授。
I've had the privilege of working with the smartest young people in my entire career, right, from the top universities, being a professor at Stanford.
但我们在World Labs汇聚的这类人才简直令人惊叹。
But the kind of talent that we put together here at World Labs is just phenomenal.
我从未见过如此高密度的人才聚集。
I've never seen the concentration.
我认为最大的差异化因素在于我们都坚信空间智能。
And I think the biggest differentiating element here is that we're believers of spatial intelligence.
所有跨学科人才——无论是系统工程、机器学习基础设施、生成模型、数据还是图形领域——我们每个人,无论是个人研究历程、技术探索还是业余爱好,都是这样组建起创始团队的。
All of the multidisciplinary talents, whether it's system engineering, machine learning infra, to generative modeling, to data, to graphics, all of us, whether it's our personal research journey or technology journey or even personal hobby, and that's how we really found our founding team.
这种能量与才华的聚焦让我深感谦卑。
And that focus of energy and talent is humbling to me.
我就是热爱这种感觉。
I just love it.
所以我知道你一直以北极星为指引。
So I know you've been guided by a North Star.
关于北极星的特点是,你实际上无法触及它们,因为它们高悬天际,但却是绝佳的指引。
So something about North Stars is, like, you can't actually reach them because they're in the sky, but it's a great way to have guidance.
那么你如何判断自己是否完成了既定目标?或者这是个将持续终身的、近乎无限的过程?
So how will you know when you've accomplished what you've set out to accomplish, or is this a lifelong thing that's gonna continue kind of infinitely?
首先,北极星有实体和虚拟之分。
First of all, there's real north stars and virtual north stars.
有时你能够触及虚拟的北极星。
Sometimes you can reach virtual north stars.
很公平。
Fair enough.
在世界模型中。
In the world model.
确实如此。
Exactly.
北极星。
North star.
就像我说的,我们曾以为需要百年影像叙事才能实现的某个北极星目标,在我看来贾斯汀和安德烈已经为我解决了。
Like I said, we I thought one of our North Star that would take a hundred years of storytelling of images, and Justin and Andre, in my opinion, solved it for me.
这样我们就能抵达我们的北极星目标。
So we could get to our North Star.
但对我来说,最激动的是看到这么多人和企业正在使用我们的模型来解锁他们对空间智能的需求。
But I think for me is when so many people and so many businesses are using our models to unlock their needs for spatial intelligence.
那一刻我就知道,我们已经实现了一个重大里程碑。
And that's the moment I know we have reached a major milestone.
实际部署,实际影响。
Actual deployment, actual impact.
是啊。
Yeah.
我觉得我们永远无法真正到达那里。
I I don't think we're ever gonna get there.
我认为这是非常基础性的东西。
I think that this is such a fundamental thing.
宇宙是一个不断演化的四维巨构,广义的空间智能就是全面理解其深度,并从中发掘所有应用可能。
The universe is a giant evolving four dimensional structure, and spatial intelligence writ large is just understanding that in all of its depths and figuring out all the applications to that.
所以我觉得虽然今天我们有一些具体想法,但这段旅程将带我们去往现在根本无法想象的地方。
So I think that we have a we have a particular set of ideas in mind today, but I think this journey is gonna take us places that we can't even imagine right now.
优秀技术的魔力在于它能开启更多可能性和未知领域。
The magic of good technology is that technology opens up more possibilities and unknowns.
所以我们将持续推动边界,可能性就会不断扩展。
So we will be pushing, and then the possibilities will be expanding.
太精彩了。
Brilliant.
谢谢你,贾斯汀。
Thank you, Justin.
谢谢你,菲菲。
Thank you, Fei Fei.
这太棒了。
This was fantastic.
谢谢你,马丁。
Thank you, Martin.
谢谢你,马丁。
Thank you, Martin.
感谢收听本期a16z播客。
Thanks for listening to this episode of the a 16 z podcast.
如果你喜欢本期节目,请记得点赞、评论、订阅、给我们评分或留言,并与亲朋好友分享。
If you like this episode, be sure to like, comment, subscribe, leave us a rating or a review, and share it with your friends and family.
更多节目请前往YouTube、苹果播客和Spotify收听。
For more episodes, go to YouTube, Apple Podcasts, and Spotify.
在X平台关注我们@a16z,并订阅我们的Substack:a16z.substack.com。
Follow us on x at a sixteen z, and subscribe to our Substack at a16z.substack.com.
再次感谢收听,我们下期节目再见。
Thanks again for listening, and I'll see you in the next episode.
温馨提示:此处内容仅供信息参考,不应视为法律、商业、税务或投资建议,也不应用于评估任何投资或证券,且不针对任何a16z基金的现有或潜在投资者。
As a reminder, the content here is for informational purposes only, should not be taken as legal business, tax, or investment advice, or be used to evaluate any investment or security, and is not directed at any investors or potential investors in any a sixteen z fund.
请注意,a16z及其关联机构可能持有本播客讨论公司的投资。
Please note that a sixteen z and its affiliates may also maintain investments in the companies discussed in this podcast.
欲了解更多详情,包括我们的投资链接,请访问16z.com/披露。
For more details, including a link to our investments, please see a 16z.com forward slash disclosures.
关于 Bayt 播客
Bayt 提供中文+原文双语音频和字幕,帮助你打破语言障碍,轻松听懂全球优质播客。