后LLM时代：空间智能与世界模型 —— 李飞飞与贾斯汀·约翰逊，World Labs

本集简介

李飞飞和Justin Johnson是World Labs的联合创始人，他们最近推出了Marble（https://marble.worldlabs.ai/），这是一种新型的生成式"世界模型"，可以从文本、图像和其他空间输入中创建可编辑的3D环境。Marble让创作者能够生成持久的3D世界，精确控制相机，并交互式地编辑场景，使其成为游戏、电影、VR、机器人模拟等领域的强大工具。在本期节目中，李飞飞和Justin分享了他们从ImageNet和斯坦福研究到World Labs的历程，为什么空间智能是继大语言模型之后的下一个前沿，以及世界模型如何可能改变机器在3D中的观察、理解和构建方式。我们讨论了：从AlexNet到今天的大规模计算扩展，为什么世界模型和空间数据是"吸收"现代GPU集群的最引人注目的方式，与仅使用语言相比。 Marble实际上是什么：一个3D世界的生成模型，使用高斯泼溅将文本和图像转化为可编辑场景，支持精确的相机控制和录制，并在手机、笔记本电脑和VR头显上交互式运行。李飞飞关于空间智能作为与语言不同的智能形式的文章（https://drfeifei.substack.com/p/from-words-to-worlds-spatial-intelligence）：从拿起一个杯子到推断DNA的3D结构，以及为什么语言是描述我们生活的丰富3D/4D世界的一个有损、低带宽的通道。当前模型是否"理解"物理或只是拟合模式：预测轨道和发现F=ma之间的差距，以及如何将物理属性附加到泼溅上并将物理引擎提炼到神经网络中，可能导致真正的因果推理。学术界在AI中不断变化的角色，为什么李飞飞更担心资源不足的大学而不是"开放与封闭"，以及国家AI计算云和开放基准等倡议如何重新平衡生态系统。为什么Transformer本质上是集合模型，而不是序列模型，以及这种观点如何为世界模型开辟新的架构，特别是随着硬件从单个GPU转向大规模分布式集群。 Marble今天的实际用例：预视化和视觉效果，游戏环境，虚拟制作，室内和建筑设计（包括厨房改造），以及为训练具身代理和机器人生成合成模拟世界。空间智能和语言智能如何在多模态系统中协同工作，以及目标不是抛弃大语言模型，而是用丰富的、具身的世界模型来补充它们。李飞飞和Justin对空间智能的长期愿景：从为艺术家和游戏开发者提供的创意工具到科学、医学和现实世界决策中的更广泛应用。 --- 李飞飞 X: https://x.com/drfeifei LinkedIn: https://www.linkedin.com/in/fei-fei-li-4541247 Justin Johnson X: https://x.com/jcjohnss LinkedIn: https://www.linkedin.com/in/justin-johnson-41b43664 在哪里找到Latent Space X: https://x.com/latentspacepod Substack: https://www.latent.space/ 章节 00:00:00 介绍和李飞飞与Justin Johnson的合作 00:02:00 从ImageNet到世界模型：计算机视觉的演变 00:12:42 密集标注和早期的视觉-语言工作 00:19:57 空间智能：超越语言模型 00:28:46 介绍Marble：World Labs的第一个空间智能模型 00:33:21 高斯泼溅和Marble的技术架构 00:22:10 物理、动力学和世界模型的未来 00:41:09 多模态以及语言与空间的相互作用 00:37:37 用例：从创意产业到机器人和具身AI 00:56:58 招聘、研究方向以及World Labs的未来

Fei-Fei Li and Justin Johnson are cofounders of World Labs, who have recently launched Marble (https://marble.worldlabs.ai/), a new kind of generative “world model” that can create editable 3D environments from text, images, and other spatial inputs. Marble lets creators generate persistent 3D worlds, precisely control cameras, and interactively edit scenes, making it a powerful tool for games, film, VR, robotics simulation, and more. In this episode, Fei-Fei and Justin share how their journey from ImageNet and Stanford research led to World Labs, why spatial intelligence is the next frontier after LLMs, and how world models could change how machines see, understand, and build in 3D. We discuss: The massive compute scaling from AlexNet to today and why world models and spatial data are the most compelling way to “soak up” modern GPU clusters compared to language alone. What Marble actually is: a generative model of 3D worlds that turns text and images into editable scenes using Gaussian splats, supports precise camera control and recording, and runs interactively on phones, laptops, and VR headsets. Fei-fei’s essay (https://drfeifei.substack.com/p/from-words-to-worlds-spatial-intelligence) on spatial intelligence as a distinct form of intelligence from language: from picking up a mug to inferring the 3D structure of DNA, and why language is a lossy, low-bandwidth channel for describing the rich 3D/4D world we live in. Whether current models “understand” physics or just fit patterns: the gap between predicting orbits and discovering F=ma, and how attaching physical properties to splats and distilling physics engines into neural networks could lead to genuine causal reasoning. The changing role of academia in AI, why Fei-Fei worries more about under-resourced universities than “open vs closed,” and how initiatives like national AI compute clouds and open benchmarks can rebalance the ecosystem. Why transformers are fundamentally set models, not sequence models, and how that perspective opens up new architectures for world models, especially as hardware shifts from single GPUs to massive distributed clusters. Real use cases for Marble today: previsualization and VFX, game environments, virtual production, interior and architectural design (including kitchen remodels), and generating synthetic simulation worlds for training embodied agents and robots. How spatial intelligence and language intelligence will work together in multimodal systems, and why the goal isn’t to throw away LLMs but to complement them with rich, embodied models of the world. Fei-Fei and Justin’s long-term vision for spatial intelligence: from creative tools for artists and game devs to broader applications in science, medicine, and real-world decision-making. — Fei-Fei Li X: https://x.com/drfeifei LinkedIn: https://www.linkedin.com/in/fei-fei-li-4541247 Justin Johnson X: https://x.com/jcjohnss LinkedIn: https://www.linkedin.com/in/justin-johnson-41b43664 Where to find Latent Space X: https://x.com/latentspacepod Substack: https://www.latent.space/ Chapters 00:00:00 Introduction and the Fei-Fei Li & Justin Johnson Partnership 00:02:00 From ImageNet to World Models: The Evolution of Computer Vision 00:12:42 Dense Captioning and Early Vision-Language Work 00:19:57 Spatial Intelligence: Beyond Language Models 00:28:46 Introducing Marble: World Labs' First Spatial Intelligence Model 00:33:21 Gaussian Splats and the Technical Architecture of Marble 00:22:10 Physics, Dynamics, and the Future of World Models 00:41:09 Multimodality and the Interplay of Language and Space 00:37:37 Use Cases: From Creative Industries to Robotics and Embodied AI 00:56:58 Hiring, Research Directions, and the Future of World Labs

双语字幕

仅展示文本字幕，不包含中文音频；想边听边看，请使用 Bayt 播客 App。

Speaker 0

我认为，深度学习的整个历史在某种意义上就是计算能力不断扩展的历史。

I think the whole history of deep learning is in some sense the history of scaling up compute.

Speaker 1

当我从研究生院毕业时，我真以为我整个职业生涯都将致力于解决这个问题，即人工智能作为一个领域、作为一个学科，深受人类智能的启发。

When I graduated from grad school, I really thought the rest of my entire career would be towards solving that single problem, which is A lot of AI as a field, as a discipline, is inspired by human intelligence.

Speaker 1

我们以为自己是第一批做这件事的人。

We thought we were the first people doing it.

Speaker 1

结果发现，当时也有其他人同时在做这件事。

It turned out that was also simultaneously doing it.

Speaker 0

所以，Marble，简单来说，一种看待它的方法是，它是一个三维世界的生成模型。

So Marble, like, basically, one way of looking at it, the system it's a generative model of three d worlds.

Speaker 0

对吧？

Right?

Speaker 0

因此，你可以输入文本、图像或多个图像，它会为你生成一个与这些输入相匹配的三维世界。

So you can input things like text or image or multiple images, and it will generate for you a three d world that kind of matches those inputs.

Speaker 0

所以，尽管Marble同时是一个旨在实现空间智能这一愿景的世界模型，但它也被有意设计成今天就能让人觉得有用的东西。

So while Marble is simultaneously a world model that is building towards this vision of spatial intelligence, it was also very intentionally designed to be a thing that people could find useful today.

Speaker 0

我们开始看到一些新兴的应用场景，比如在游戏、视觉特效和电影领域，我认为Marble如今作为一款产品就能做很多非常有趣的事情，同时也能为未来我们想要构建的宏大世界模型奠定基础。

And we're see starting to see emerging use cases, for in gaming, in VFX, in in film, where I think there's a lot of really interesting stuff that Marvel can do today as a product, and then also set a foundation for the for the for the grand world models that we want to build going into the future.

Speaker 2

大家好。

Hey, everyone.

Speaker 2

欢迎收听《太空闲聊》播客。

Welcome to the Laid in Space podcast.

Speaker 2

我是Kernel Labs的创始人Alessio，今天和Laid in Space的编辑Zwix一起主持。

This is Alessio, founder of Kernel Labs, and I'm joined by Zwix, editor of Laid in Space.

Speaker 3

我们非常兴奋能与World Labs的Fei Fei和Justin在演播室见面。

And we are so excited to be in the studio with Fei Fei and Justin of World Labs.

Speaker 3

欢迎。

Welcome.

Speaker 1

我们也非常兴奋。

We're excited too.

Speaker 3

我真的想说Marble。

I really say Marble.

Speaker 3

是的。

Yeah.

Speaker 0

谢谢你们邀请我们。

Thanks for having us.

Speaker 3

我认为人们对世界模型有很多兴趣，你们已经就空间智能等方面做了一些宣传。

I think there's a lot of interest in world models, and you've done you've done a little bit of publicity around spatial intelligence and all that.

Speaker 3

我想，或许你们难得有机会讲述的一个故事是，你们两人是如何走到一起创立World Labs的。

I guess maybe one of the part of story that is a rare opportunity for you to tell is how you two came together to start building World Apps.

Speaker 1

这很简单，因为贾斯汀曾经是我的学生。

That's very easy because Justin was my former student.

Speaker 1

是的。

Yeah.

Speaker 1

所以贾斯汀来到我的实验室，而我另一个身份是斯坦福大学的计算机科学教授。

So Justin came to my you know, in my the other hat I wear is a professor of computer science at Stanford.

Speaker 1

贾斯汀是什么时候加入我的实验室的？

Justin joined my lab when?

Speaker 1

哪一年？

Which year?

Speaker 0

2012年。

2012.

Speaker 0

实际上，我加入你实验室的那个学期，正好是AlexNet发布的那个季度。

Actually, the the semester that I the quarter that I joined your lab was the same quarter that that AlexNet came out.

Speaker 1

是的。

Yeah.

Speaker 1

是的。

Yeah.

Speaker 1

所以贾斯汀是我的第一个

So Justin is my first

Speaker 3

你参与了整个发布风波吗？

Were you involved in the whole announcement drama?

Speaker 3

没有。

No.

Speaker 3

完全没参与。

Not at all.

Speaker 0

但那个季度，我一直在关注围绕AlexNet的所有图像和热潮。

But I was sort of watching all the image and excitement around AlexNet at that that quarter.

Speaker 1

他是我最优秀的学生之一。

So he was my one of my very best students.

Speaker 1

然后他前往密歇根大学安娜堡分校担任教授，开启了非常成功的早期职业生涯，隶属于META。

And and then he went on to have a very successful early career as a professor in Michigan, University of Michigan, Ann Arbor in META.

Speaker 1

大约两年前，我确信我们两人各自都开始关注大模型的发展，并思考语言模型之后的方向。

And then when we I think around more than two years ago, for sure, I think both independently, both of us have been looking at the development of the large models and thinking about what's beyond language models.

Speaker 1

构建世界模型、空间智能对我们来说是自然而然的想法。

And this idea of building world models, spatial intelligence really was natural for us.

Speaker 1

于是我们开始交流，决定孤注一掷，专注于解决这个问题，并共同创立了World Apps。

So we started talking and decided that we should just put all the eggs in one basket and focus on solving this problem and started world apps together.

Speaker 0

是的。

Yeah.

Speaker 0

差不多。

Pretty much.

Speaker 0

我的意思是，在我读博士期间见证了ImageNet时代之后，我感觉计算机视觉未来十年的重点将是把人工智能从数据中心带入现实世界。

I mean, like, I after that, seeing that kind of ImageNet era during my PhD, I had the sense that the next sort of decade of computer vision was gonna be about getting getting AI out of the out of the data center and out into the world.

Speaker 0

因此，博士毕业后，我的兴趣逐渐转向了三维视觉、计算机图形学以及生成模型。

So a lot of my interests post PhD kinda shifted in into three d vision, little bit more into computer graphics, more into generative modeling.

Speaker 0

我当时以为博士毕业后已经逐渐偏离了导师的研究方向，但几年后我们重聚时，发现她也在思考非常相似的问题。

And I was, I thought I was kind of drifting away from my adviser post PhD, but then when we reunited a couple years later, it turned out she was thinking of very similar things.

Speaker 2

所以，如果你想想AlexNet，它的核心组成部分显然是ImageNet。

So if you think about AlexNet, the core pieces of it were obviously ImageNet.

Speaker 2

还有转向GPU和神经网络。

It was the move to GPUs and neural networks.

Speaker 2

你如何看待世界模型的AlexNet等效模型？

How do you think about the AlexNet equivalent model for world models?

Speaker 2

某种程度上，这个想法早已存在。

In a way, it's an idea that has been out there.

Speaker 2

对吧？

Right?

Speaker 2

你知道的，Young Lagoon 可能是其中最积极、最突出的倡导者。

There's been, you know, Young Lagoon is maybe, like, the most the biggest proponent, most prominent of it.

Speaker 2

过去两年里，你有没有看到什么让你觉得，现在是时候做这件事了？

What have you seen in the last two years that you were like, hey.

Speaker 2

你觉得，从数据、算法类型，或者计算方式的角度来看，哪些东西是根本性的，能真正让这些模型活起来？

Now is the time to do this, and what are maybe the things, fundamentally that you wanna build as far as data and kinda, like, maybe different types of algorithms or approaches to compute, to make more models really come to life?

Speaker 0

是的。

Yeah.

Speaker 0

我认为其中一个因素是，现在数据和算力普遍变得更加丰富了。

I I I think one is just there is a lot more data in compute generally available.

Speaker 0

我认为深度学习的整个历史，在某种意义上，就是算力不断扩展的历史。

I think the whole history of deep learning is, in some sense, the the history of scaling up compute.

Speaker 0

如果你想想，AlexNet 需要从 CPU 向 GPU 的一次重大跃迁。

And if you think about, you know, AlexNet required this jump from CPUs to GPUs.

Speaker 0

但从AlexNet到现在，每张卡的性能提升了大约一千倍。

But even from AlexNet to today, we're getting about a thousand times more performance per card, than we had in AlexNet days.

Speaker 0

现在，训练模型不再只使用一块GPU，而是使用数百、数千、数万甚至更多的GPU。

And now it's common to train to train models not just on one GPU, but on hundreds or thousands or tens of thousands or even more.

Speaker 0

因此，今天我们能在单个模型上调动的计算量，比我在博士初期时多出了约百万倍。

So the amount of compute that we can marshal today on on a single model is is, you know, about a million fold more than we could have even at the start of my PhD.

Speaker 0

所以我认为，语言是过去几年真正开始取得显著进展的领域之一。

So I think language was one of the really really interesting things that started to work quite well the last couple of years.

Speaker 0

但当我们转向视觉数据、空间数据和世界数据时，就需要处理更多的信息。

But as we think about moving towards visual data and spatial data and world data, you just need to process a lot more.

Speaker 0

我认为，这将是吸收越来越多新计算能力的一个好方法。

And I think that's gonna be a good way to soak up this this new compute that's coming online more and more.

Speaker 3

公开挑战的模式还有效吗？还是应该集中在实验室内部进行？

Does the model of having a public challenge still work, or should it be centralized with inside of a lab?

Speaker 1

我认为开放科学仍然很重要。

I think open science still is important.

Speaker 1

你知道，与Alex时代相比，人工智能已经真正地发展了。

You know, AI, obviously, compared to the image that Alex in that time, has really evolved.

Speaker 1

那时它还只是一个非常小众的计算机科学领域。

Was such a niche computer science discipline.

Speaker 1

现在它已经成为一种文明级的技术。

Now it's just like civilizational technology.

Speaker 1

但我给你举个例子。

But I'll give you an example.

Speaker 1

最近，我的斯坦福实验室刚刚发布了一个名为Behavior的数据集和基准，用于在模拟环境中评估机器人学习。

Recently, my Stanford lab just announced they opened a dataset and benchmark called behavior, which is for benchmarking robotic learning in simulated environments.

Speaker 1

这明确体现了继续坚持开放科学模式的努力，尤其是在学术界。

And that is a very clear effort in still keeping up this open science model of doing things, especially in academia.

Speaker 1

但我认为，重要的是要认识到这个生态系统是混合的。

But I think it's important to recognize the ecosystem is a mixture.

Speaker 1

我认为，工业界许多高度聚焦的工作，有些更多是以产品形式而非开放挑战的形式显现出来。

I think a lot of the very focused work in industry, some of them are more seeing the daylight in the form of a product rather than an open challenge per se.

Speaker 3

是的，这仅仅是资金和商业模式的问题，你必须看到相应的投资回报吗？

Yeah, and that's just a matter of the funding in the business model, like you have to see some ROI from it?

Speaker 1

我认为这仅仅是生态系统多样性的体现。

I think it's just a matter of the diversity of the ecosystem.

Speaker 1

即使在所谓的Alex ImageNet时代，也存在封闭的模型。

Even during the so called Alex ImageNet time, there were closed models.

Speaker 1

存在专有模型。

There were proprietary models.

Speaker 1

也有开源模型。

There were open models.

Speaker 1

或者你想想iOS和Android，对吧？

Or you think about iOS versus Android, right?

Speaker 1

有不同的商业模式。

There are different business models.

Speaker 1

我不认为这仅仅是资金问题。

I wouldn't say it's just a matter of funding per se.

Speaker 1

这只是市场的现状。

It's just how the market is.

Speaker 1

有不同的策略。

There are different plays.

Speaker 2

是的。

Yeah.

Speaker 2

但你觉得在当今这些实验室面临的商业压力下，还能重做一次ImageNet吗？

But do you feel like you could redo ImageNet today with the commercial pressure that some of these labs has?

Speaker 2

我的意思是，这对我来说是最大的问题。

I mean, to me, that's like the biggest question.

Speaker 2

对吧？

Right?

Speaker 2

问题是，你能开放什么，又该保留什么？

It's like, what can you open versus what do you should you keep inside?

Speaker 2

比如，如果你站在我的立场上，你筹到了很多钱。

Like, you know, if I put myself in your shoes, right, it's like, you raised a lot of money.

Speaker 2

你正在构建这一切。

You're building all of this.

Speaker 2

如果你拥有这方面的最佳数据集，你有什么动力去发布它呢？

If you had the best dataset for this, what incentives do you really have to publish it?

Speaker 2

感觉实验室里的人正被越来越多地吸引，而博士项目也正越来越早地被拉入这些实验室。

And it feels like the people at the labs are getting more and more pulled and the, PhD programs are getting pulled earlier and earlier into these labs.

Speaker 2

所以我想知道，你是否认为现在资金过多、对学术界开放研究空间造成的压力是个问题，或者你觉得这其实并不是一个真正的担忧。

So I'm curious if you think there's, like, an issue right now with how much money has taken, how much pressure it puts on the more academia open research space, or if you feel like that's not really a concern.

Speaker 1

我对压力本身并没有太多担忧。

I do have concerns less about the pressure.

Speaker 1

更关键的是学术界资源分配不均的问题。

It's more about the resourcing and the imbalanced resourcing of academia.

Speaker 1

这和世界上的实验室是稍微不同的一个话题。

This is a little bit of a different conversation from world labs.

Speaker 1

过去几年，我一直倡导为健康的生态系统提供资源。

Have been, the past few years, advocating for resourcing the healthy ecosystem.

Speaker 1

作为斯坦福大学以人为中心的人工智能研究所的创始主任兼联合主任，我一直在与政策制定者合作，推动公共部门和学术界人工智能工作的资源投入。

As the founding director, co director of Stanford's Institute for Human Centered AI, Stanford High, I've been working with policymakers about resourcing public sector and academic AI work.

Speaker 1

我们曾与特朗普政府第一任期合作，推动一项名为国家人工智能资源（NAIR法案）的法案，旨在规划一个国家级的AI计算云和数据存储库。

We work with the first Trump administration on this bill called National AI Resource, NAIR Bill, which is scoping out a national AI compute cloud as well as data repository.

Speaker 1

我也认为，开源和开放数据集仍然是生态系统中重要的一部分。

And I also think that open source, open data sets continue to be an important part of the ecosystem.

Speaker 1

正如我所说，目前在我的斯坦福实验室，我们正在开展一个名为Behavior的机器人学习开放数据集和开放基准测试。

Like I said, right now, in my Stanford lab, we are doing the open data set, open benchmark on robotic learning called behavior.

Speaker 1

我的许多同事也在做类似的工作。

And many of my colleagues are still doing that.

Speaker 1

我认为这是生态系统的一部分。

I think that's part of the ecosystem.

Speaker 1

我认为，产业界和一些初创公司正在快速推进模型并开发产品，这同样是一件好事。

I think what the industry is doing, some startups are doing, are running fast with models creating products, is also a good thing.

Speaker 1

例如，当贾斯汀还是我指导的博士生时，当时没有一个计算机视觉项目能很好地运行。

For example, when Justin was a PhD student with me, none of the computer vision programs worked that well.

Speaker 1

对吧？

Right?

Speaker 1

我们可以写出优美的论文。

We could write beautiful papers.

Speaker 1

贾斯汀有

Justin has

Speaker 0

我的意思是，实际上，甚至在上研究生之前，我就想做计算机视觉。

I mean, actually, even before grad school, I wanted to do computer vision.

Speaker 0

我联系了谷歌的一个团队，想在本科毕业后直接去尝试做计算机视觉。

And I reached out to a team at Google and wanted to potentially go and try to do computer vision out of undergrad.

Speaker 0

他们告诉我：‘你在说什么啊？’

They told me, like, what are what are you talking about?

Speaker 0

像你这样是做不到的。

Like, you can't do that.

Speaker 0

先去读个博士，再回来。

Like, go do a PhD first and come back.

Speaker 3

是什么动机让你

What what was the motivation that that got

Speaker 0

哦，因为我在本科期间做过一些计算机视觉研究，实际上是在李飞飞的博士导师手下。

you Oh, because I I had done some computer vision research in, during my undergrad with, with actually Fei Fei's PhD adviser.

Speaker 1

传承。

The lineage.

Speaker 0

对。

Yeah.

Speaker 0

这里是有传承的。

There's a lineage here.

Speaker 3

是的。

Yeah.

Speaker 3

所以我想

So I

Speaker 0

我在本科时就做过一些计算机视觉方面的研究，觉得特别酷，所以想继续做下去。

had the I'd done some computer vision even as an undergrad, and I thought it was really cool, and I wanted to keep doing it.

Speaker 0

所以，当我本科毕业时，我面临着一个产业与学术的选择，我想现在研究界很多人都在面临这个问题。

So then I I was sort of faced with this sort of industry academia choice even coming out of undergrad that I think a lot of people in the research community are facing now.

Speaker 0

但回到你的问题，我认为过去十年里，学术界在人工智能领域的作用已经发生了很大变化。

But but to your question, I I think, like, the role of of academia, especially in AI, has shifted quite a lot in the last decade.

Speaker 0

这并不是一件坏事。

And it's not a bad thing.

Speaker 0

这是因为技术已经发展和涌现了。

It's it's a sense of it's it's because of the technology has has grown and emerged.

Speaker 0

对吧？

Right?

Speaker 0

比如五年前或十年前，你真的可以在实验室里用几块GPU训练出最先进的模型。

Like, five or ten years ago, you really could train state of the art models in the lab, even with just with just a couple GPUs.

Speaker 0

但你知道，由于这项技术如此成功并迅速扩展，现在你已经无法仅用几块GPU来训练最先进的模型了。

But, you know, because that technology was so successful and scaled up so much, then you you can't train state of the art models with a couple GPUs anymore.

Speaker 0

这并不是一件坏事。

And that's not a bad thing.

Speaker 0

这是好事。

It's a good thing.

Speaker 0

这意味着技术真的奏效了。

It means the technology actually worked.

Speaker 0

但这意味着我们作为学者应该做什么的期望也发生了一些变化。

But that means the the expectations around what we should be doing as academics shifts a little bit.

Speaker 0

而不应该专注于试图训练最大的模型或扩大规模。

And it shouldn't be about trying to train the biggest model and scaling up the biggest thing.

Speaker 0

而应该尝试一些古怪、新颖、疯狂的想法，其中大多数可能不会成功。

It should be about trying wacky ideas and new ideas and crazy ideas, most of which won't work.

Speaker 0

我认为在这方面还有很多事情可做。

And I think there's a lot to be done there.

Speaker 0

而且，我担心的是，学术界太多人过于专注于试图假装我们能够训练最大的模型，或者将它视为一种职业培训，以便毕业后进入大实验室，玩转所有GPU。

And if anything, I'm worried that too many people in academia are hyper focused on this notion of trying to pretend like we can train the biggest models or or treating it as almost a vocational training program to then graduate and go to a big lab and be able to play with all the GPUs.

Speaker 0

我认为在新算法、新架构、新系统方面还有很多疯狂的事情可以做，一个人就能做很多。

I think there's just so much crazy stuff you can do around new algorithms, new architectures, new systems that there's a lot you can do as one person.

Speaker 1

而且，学术界在理解这些大型模型的理论基础方面也扮演着重要角色。

And also, just academia has a role to play in understanding the theoretical underpinning of these large models.

Speaker 1

我们对这些模型的了解仍然非常有限。

We still know so little about this.

Speaker 1

或者延伸到跨学科领域，贾斯汀称之为疯狂的想法。

Or extend to the interdisciplinary, Justin calls wacky ideas.

Speaker 1

还有很多基础科学方面的想法。

There's a lot of basic science ideas.

Speaker 1

有很多天马行空的问题。

There's a lot of blue sky problems.

Speaker 1

所以我同意。

So I agree.

Speaker 1

我认为问题不在于开放还是封闭，产品化还是开源。

I don't think the problem is open versus closed, productization versus open sourcing.

Speaker 1

我认为当前的问题在于，学术界自身资源严重不足，研究人员和学生没有足够的资源来尝试这些想法。

I think the problem right now is that academia by itself is severely under resourced so that the researchers and the students do not have enough resources to try these ideas.

Speaker 3

是的。

Yeah.

Speaker 3

只是为了让大家脑洞大开，当你提到疯狂点子时，脑海中会浮现出什么疯狂的想法？

Just for people to nerd snipe, what's a wacky idea that comes to mind when you talk about wacky ideas?

Speaker 0

哦，比如我一直在密歇根大学向我的学生们提出的一个想法：我非常喜欢硬件，也很喜欢新型硬件的出现。

Oh, like I had I had this idea that I kept pitching to my students at at Michigan, which is that I really like hardware and I really like like new kinds of hardware coming online.

Speaker 0

从某种意义上说，我们今天使用的神经网络和Transformer模型，其基础是矩阵乘法，因为矩阵乘法非常适合GPU。

And in some sense, the the emergence of the neural networks that you we use today and transformers are really based around matrix multiplication because matrix multiplication fits really well with GPUs.

Speaker 0

但如果我们思考GPU未来如何扩展，以及硬件可能的未来发展趋势，我认为我们目前的系统——比如GPU和硬件设计——不可能无限扩展。

But if we think about how GPUs are gonna scale how how hardware is likely to scale in the future, I don't think the current system that we have, like the GPU, like, hardware design is gonna scale infinitely.

Speaker 0

我们甚至现在就已经看到，计算的基本单元不再是单个设备。

And that we start to see that even now that, like, the unit of compute is not not the single device anymore.

Speaker 0

而是整个设备集群。

It's this whole cluster of devices.

Speaker 0

所以，如果你想象一个节点。

So if you imagine A node.

Speaker 0

是的。

Yeah.

Speaker 0

它是一个完整的节点或一个完整的集群。

It's a whole node or a whole cluster.

Speaker 0

但我们谈论神经网络的方式，仍然像是它们是一个可以单靠一块GPU在PyTorch中编码的单体结构。

But the way we talk about neural networks is still as if they are a monolithic thing that could be coded, in one GPU in PyTorch.

Speaker 0

但实际上，它们可以分布在数千个设备上。

But then in practice, they could distribute over thousands of devices.

Speaker 0

所以，就像Transformer基于矩阵乘法，而矩阵乘法在GPU上表现得非常好一样。

So are there, like, just as, you know, transformers are based around Matmul and Matmul is sort of the primitive that works really well on on GPUs.

Speaker 0

当你想象扩展时，是否还有其他更适合大规模分布式系统的原语，我们可以基于这些原语来构建神经网络？

As you imagine, scaling out, are there other primitives that make more sense for large scale distributed systems that we could build our neural networks on?

Speaker 0

我认为，未来十到二十年可能出现的下一代硬件，可能会催生截然不同的架构，而我们今天就可以开始想象这些可能性。

And I think it's possible that there could be drastically different architectures that fit with the next generation or, like, the the the hardware that's gonna come ten or twenty years down the line, and we could start imagining that today.

Speaker 0

要做出这类设想真的很难

It's really hard to make those kinds

Speaker 3

因为还存在硬件彩票的概念，比如说，英伟达赢了，我们就应该无限扩展它，并编写软件来弥补我们混合架构中的任何缺口。

of bets because there's also the concept of the hardware lottery where, let's just say, you know, NVIDIA has won, and we should just, you know, scale that out in infinitely and write software to patch up any any gaps we have in the in in the mix.

Speaker 3

对吧？

Right?

Speaker 0

我的意思是，是，也不是。

I mean, yes yes and no.

Speaker 0

如果你看看数据，从Hopper到Blackwell，每瓦性能其实差不多。

Like, if you look at if you look at the numbers, like, even going from Hopper to Blackwell, like, the performance per watt is about the same.

Speaker 0

是的。

Yes.

Speaker 0

他们主要增加了晶体管数量，增大了芯片尺寸，也增加了功耗。

They mostly make the the number of transistors go up, they make the chip size go up, and they make the the power usage go up.

Speaker 0

但即使从Hopper到Blackwell，我们已经隐约看到，在每瓦性能方面，我们正面临某种扩展极限。

But even from Hopper to Blackwell, we're kind of already seeing, like, a a scaling limit in terms of what is the what is the performance per watt that we can get.

Speaker 0

所以，我认为还有空间去做一些全新的东西。

So, I I think I think there are there is room to to do something new.

Speaker 0

我不知道具体是什么，也不认为一个初创公司能在三个月内完成这件事。

And I don't know exactly what it is, and I don't think you can get it done, like, in a three month cycle as a start up.

Speaker 0

但我认为，如果你静下心来花上几年时间深入研究，或许能取得一些突破。

But I think that's the kind of idea that if you sit down and sit with for a couple years, like, maybe you could come up come up with some breakthroughs.

Speaker 0

我觉得这种长期性的工作非常适合学术界。

I And think that's the kind of long range stuff that is a is a perfect match for academia.

Speaker 3

回到一些背景和历史，我们有一份关于你和安德烈所做的场景叙事工作或更新的图像描述研究笔记。

Coming back to the little bit of background and history, we have this sort of research note on the scene storytelling work that you did or newer newer image captioning that you did with Andre.

Speaker 3

我只是想听你们讲讲，当初你们是如何开始这项工作，作为博士研究的一部分。

And I just wanted to hear you guys tell that story about, you know, you you were, like, sort of embarking on that for your PhD.

Speaker 3

费飞，你还记得你当时的反应吗？

And and, Fei Fei, you you, like, having that reaction that you had.

Speaker 1

是的。

Yeah.

Speaker 1

我认为这项工作最初是我和安德烈一起开始的，后来贾斯汀加入了进来。

So I think that line of work started between me and Andre, and then Justin joined.

Speaker 1

安德烈开始了他的博士生涯。

Andre started his PhD.

Speaker 1

他和我正在思考超越ImageNet目标识别的方向。

He and I were looking at what is beyond image net object recognition.

Speaker 1

当时，卷积神经网络在ImageNet任务中已经证明了其强大能力。

And at that time, convolutional neural network has proven some power in image net tasks.

Speaker 1

因此，卷积神经网络是表示图像的一种绝佳方式。

So ConvNet is a great way to represent images.

Speaker 1

与此同时，我认为在语言领域，一种早期的序列模型——LSTM——也正在被实验。

In the meantime, I think in the language space, an early sequential model is called LSTM was also being experimented.

Speaker 1

所以安德烈和我聊到了这一直是我长期的梦想。

So Andrea and I were just talking about this has been a long term dream of mine.

Speaker 1

我认为解决这个问题需要一百年，那就是为图像讲述故事。

I thought it would take one hundred years to solve, which is telling the story of images.

Speaker 1

当我从研究生院毕业时，我真觉得我整个职业生涯都将致力于解决这个单一问题：给定一张图片或一个场景，用自然语言讲述它的故事。

When I graduated from grad school, I really thought the rest of my entire career would be towards solving that single problem, which is given a picture or given a scene, tell the story in natural language.

Speaker 1

但事情发展得如此迅速。

But things evolved so fast.

Speaker 1

当安德烈开始时，我们觉得，或许可以将卷积神经网络的表示与LSTM这种语言序列模型结合起来，通过训练来使文本描述与图像匹配。

When Andre started, we're like, maybe combining the representation of convolutional neural network as well as the language sequential model of LSTM, might be able to learn through training to match caption with images.

Speaker 1

于是我们开始了这条研究路线。

So that's when we started that line of work.

Speaker 1

我不记得是2014年还是2015年了？

And I don't it was 2014 or 2015?

Speaker 0

是2015年的CBPR会议，那是第一个图像描述的成果。

It was a CBPR 2015 was the So captioning

Speaker 1

那是我们第一篇论文，安德烈成功实现了给定一张图像后生成描述。

it was our first paper that Andre got it to work that was given an image.

Speaker 1

图像用ConvNet进行表示。

The image is represented with ConfNet.

Speaker 1

语言模型则是LSTM模型。

The language model is the LSTM model.

Speaker 1

然后我们将它们结合起来，能够生成一句话。

And then we combine it, and it's able to generate one sentence.

Speaker 1

那是最早的一次之一。

And that was one of the first time.

Speaker 1

我想我当时写在我的书里了。

It was pretty I think I wrote it in my book.

Speaker 1

我们以为自己是第一个做这件事的人。

We thought we were the first people doing it.

Speaker 1

结果发现，当时谷歌也在同时做这件事。

It turned out that Google at that time was also simultaneously doing it.

Speaker 1

一位记者，纽约时报的约翰·马科夫，率先报道了谷歌的这项成果。

And a reporter, it was John Markov from New York Times, breaking the Google story.

Speaker 1

但他偶然听到了我们的消息。

But he, by accident, heard about us.

Speaker 1

然后他意识到，我们其实也是独立地在同一时间达成了同样的成果。

And then he realized that we really independently got there together at the same time.

Speaker 1

所以他写了关于谷歌研究以及安德烈和我的研究的故事。

So he wrote the story of both the Google research as well as Andre and my research.

Speaker 1

但那之后，我想贾斯汀那时已经在实验室了。

But after that, I think Justin was already in the lab at that time.

Speaker 0

嗯。

Yeah.

Speaker 0

嗯。

Yeah.

Speaker 0

我记得在组会上，安德烈展示了那些结果，并解释了我以前从未听说过的名为LSTM和RNN的新东西。

I remember the group read the the group meeting where Andre was presenting some of those results and explaining this new thing called LSTMs and RNNs that I had never heard of before.

Speaker 0

我当时想，哇。

And I thought, like, wow.

Speaker 0

这真是令人惊叹的东西。

This is really amazing stuff.

Speaker 0

我想我要研究这个。

I wanna I wanna work on that.

Speaker 0

所以他在2015年CVPR上发表了第一篇图像字幕研究的论文。

So then he had the paper at twenty CVPR twenty fifteen, the first image captioning results.

Speaker 0

之后，我们开始合作，最初我们做了一篇关于语言建模的论文。

Then after that, we started working together, and we did a we first, we did a paper actually just on language modeling

Speaker 1

是的。

Yep.

Speaker 0

那是2015年，我清楚地记得。

Back in twenty fifth iClear 2015.

Speaker 0

是的。

Yeah.

Speaker 0

是的。

Yeah.

Speaker 0

我本该坚持做语言建模的。

I I should have stuck with language modeling.

Speaker 0

结果证明

That turned out

Speaker 3

那真是

that was

Speaker 0

回过头来看，利润相当可观。

pretty lucrative in retrospect.

Speaker 0

但我和安德烈在2015年一起发表了一篇关于语言建模的论文，当时真的很棒。

But we did this language modeling paper together, me and Andre, in, 2015, where it was, like, really cool.

Speaker 0

我们训练了一些小型RNN语言模型，能够一次生成几句话，然后去研究它们，试图理解神经网络内部神经元的运作机制。

We train these little r these little RNN language models that could, you know, spit out a couple sentences at a time and poke at them and try to understand what the neurons inside the neural network inside the things we're doing.

Speaker 1

你们当时在分析不同记忆机制，比如

You guys were doing analysis on the different, like, memory and

Speaker 0

是的。

Yeah.

Speaker 0

是的。

Yeah.

Speaker 0

那真的非常酷。

Was it was really it was really cool.

Speaker 0

即使在那时，我们也有这样的结果：你可以观察LSTM内部，说，哦，这个东西在读代码。

Even at that time, we had these results where you could, like, look inside the LSTM and say, like, oh, this thing is reading code.

Speaker 0

我们为这个模型训练所用的数据集之一就是Linux源代码。

So one of the, like, one of the data sets that we that we trained on for this one was the the, the Linux source code.

Speaker 0

对吧？

Right?

Speaker 0

因为整个项目都是开源的，你可以直接下载下来。

Because the whole the whole the whole thing is, you know, open source and you could just download this.

Speaker 0

所以我们用这个数据集训练了一个RNN。

So we train an RNN on this on this dataset.

Speaker 0

当网络试图预测这些标记时，你会尝试将它的预测类型与RNN内部结构联系起来。

And then as the network is trying to predict the tokens there, then, you know, try to correlate the kinds of predictions that it's making with the kind of internal structures, in the RNN.

Speaker 0

我们发现了一些相关性，比如：LSTM的某个单元和某层在出现左括号时会激活，而在出现右括号时会关闭，我们通过这类实证方法来探究其中的规律。

And there, we were able to find some correlations between, oh, like, this unit and this layer of the LSTM fires when there's an open paren and then, like, turns off when there's a closed paren, and try to do some empirical stuff like that to to figure it out.

Speaker 0

这真的很酷。

So that was pretty cool.

Speaker 0

那只是像把CNN从语言建模部分中剥离出来，单独研究语言模型。

And that was just like die that was kind of like cutting out the CNN from this language modeling part and just looking at the language models in isolation.

Speaker 1

但后来我们想扩展图像字幕的工作。

But then we wanted to extend the the image captioning work.

Speaker 1

记得当时，我们甚至已经有了空间感，因为我们觉得字幕无法捕捉图像的不同部分。

And remember, at that time, we even have a sense of space, because we feel like captioning does not capture different parts of the image.

Speaker 1

所以我跟贾斯汀和安德烈讨论，能不能实现我们后来称为密集字幕的技术，也就是更详细地描述场景，特别是场景的不同部分。

So I was talking to Justin and Andre about, can you go what we what we end up calling dense captioning, which is, you know, describe the scene in greater details, especially different parts of the scene.

Speaker 1

所以那就是

So that's

Speaker 0

是的。

Yeah.

Speaker 0

于是我们构建了系统，第二年我和安德烈以及李飞飞合作发表了一篇论文，即CVPR 2016，我们构建了这个实现密集字幕的系统。

And so that so then then we built the systems, and it was me and Andre and and Fei Fei on a paper the following year, CBPR 2016, where we built this system that did dense captioning.

Speaker 0

你输入一张图片，它就会在图像中所有有趣的部分周围画出框，并为每个部分生成一段简短的描述。

So you input a single image, and then it would draw boxes around all the interesting stuff in the image and then write a short snippet about each of them.

Speaker 0

就像是，桌上有一个绿色的水瓶。

It's like, oh, it's a green bought water bottle on the table.

Speaker 0

这是一个穿着黑色衬衫的人。

It's a person wearing a black shirt.

Speaker 0

这是一个非常复杂的神经网络，因为它建立在当时目标检测领域的一系列重大进展之上，而目标检测长期以来一直是计算机视觉的主要课题。

And this was a a really complicated neural network because, that was built on a lot of, advancements that had been made in object detection around that time, which was a major topic in computer vision for for a long time.

Speaker 0

然后，实际上它是一个联合的神经网络，能够同时学习观察单张图像。

And then it was actually, like, one joint neural network that was both, you know, learning to look at individual images.

Speaker 0

因为我实际上在该网络中使用了三种不同的表示方式。

Because I actually I actually had, like, then three different representations inside this network.

Speaker 0

一种是整个图像的表示，用于把握整体情境；然后它会提出希望关注的各个区域，并独立地表示每个区域。

One was the representation of the whole image to kinda get the gestalt of what's going on, then it would propose individual regions that it wants to focus on and then look at you know, represent each region independently.

Speaker 0

一旦观察了某个区域，就需要为每个区域生成对应的文本。

And then once you look at the region, then you need to spit out text for each region.

Speaker 0

因此，这是一个相当复杂的神经网络架构。

So that was a pretty complicated neural network architecture.

Speaker 0

这都是在PyTorch之前的事了。

This was all pre PyTorch.

Speaker 3

它是单次完成的吗？

And does it do it in one pass?

Speaker 0

是的。

Yeah.

Speaker 0

是的。

Yeah.

Speaker 0

所以这是一个单一的前向传播过程，完成了所有这些工作。

So was a single forward pass that did all of that.

Speaker 1

通常，它都是单次完成的。

Normally, it was doing it in one pass.

Speaker 1

你还优化了推理过程。

You you also optimize inference.

Speaker 1

你是在网络摄像头上运行的。

You're doing it on a web cam.

Speaker 0

我记得。

I remember.

Speaker 0

是的。

Yeah.

Speaker 0

是的。

Yeah.

Speaker 0

所以我当时搭建了一个疯狂的实时演示，网络运行在斯坦福的服务器上，前端通过网页实时从摄像头采集视频流，然后把图像发送回服务器。

So I I had built this, like, crazy real time demo, where I had the network running, like, on a server at Stanford and then a web front end that would stream from a webcam and then, like, send the image back to the server.

Speaker 0

服务器运行模型并将预测结果流式返回。

The server would run the model and stream the predictions back.

Speaker 0

于是我拿着笔记本电脑在实验室里走来走去，哇。

So I was just, like, walking around the lab with this laptop Wow.

Speaker 0

它会向人们展示这个网络

That would just, like, show people this this, like, this this network

Speaker 3

实时识别和标注。

Identification in real and labeling as well.

Speaker 1

是的。

Yes.

Speaker 1

哦，对啊。

Oh, yeah.

Speaker 1

对啊。

Yeah.

Speaker 1

对啊。

Yeah.

Speaker 1

哦。

Speaker 3

我的天。

my god.

Speaker 1

但这确实很令人印象深刻，因为我们的大多数研究生只要能发表论文就满足了。

But it was it was pretty impressive because most of our graduate students would be satisfied if they can publish the paper.

Speaker 1

对吧？

Right?

Speaker 1

他们把研究打包成了一篇论文，但贾斯汀更进一步。

They they packaged the the the research, put it in a paper, but Justin went a step further.

Speaker 1

他说：我想做一个实时的网页演示。

He's like, I wanna do this real time web demo.

Speaker 0

嗯，实际上，我不确定有没有跟你说过这个故事。

Well, actually, I don't I don't know if I had told you this story.

Speaker 0

但那一年我们在圣地亚哥参加了一个会议，是ICCV 2015。

But then we had a there was a conference that year in Santiago, at ICCV fifth it was ICCV fifteen.

Speaker 0

当时我有一篇论文在那个会议上，不过讲的是别的内容。

And then, like, I had a paper at that conference for something different.

Speaker 0

但我带了我的笔记本电脑。

But I had my my laptop.

Speaker 0

我拿着笔记本电脑在会场里到处走，给每个人展示这个实时字幕演示，而模型运行在加州的服务器上。

I was, like, walking around the conference with my laptop showing everybody this, like, real time captioning demo, and the model was running on a server in California.

Speaker 0

不错。

Nice.

Speaker 0

所以它真的能够从加利福尼亚实时流式传输到圣地亚哥。

So it was, like, actually able to stream, like, all the way from California down to Santiago.

Speaker 3

延迟非常糟糕。

Well, latency does it was terrible.

Speaker 1

它就是

It was

Speaker 3

好吧。

like, okay.

Speaker 3

行了。

Alright.

Speaker 3

行了。

Alright.

Speaker 3

它有延迟。

It was delayed.

Speaker 0

大概一帧每秒，但它能运行起来就已经非常惊人了。

It was like one FPS, but the fact that it worked at all was pretty was pretty amazing.

Speaker 3

我本来想简单提一句，也许视觉和语言建模并没有那么不同。

I was gonna briefly quip that, you know, maybe vision and language modeling are not that different.

Speaker 3

你知道，DeepSQL CR 最近尝试了一个疯狂的想法：从像素中建模文本，然后直接在这上面训练。

You know, DeepSQL CR recently, tried the crazy thing of let's language let's model text from pixels and and just, like, train on that.

Speaker 3

这可能是未来，我不确定。

And it might be the future, I don't know.

Speaker 3

我不知道你们有没有什么看法，认为语言是否真的完全必要。

I don't know if you guys have any takes on whether language is actually necessary at all.

Speaker 1

我刚刚写了一整份宣言。

I just wrote a whole manifesto

Speaker 3

是的。

Yeah.

Speaker 3

语音智能。

Speech intelligence.

Speaker 3

这正是我想引出的话题。

This is my segue into this.

Speaker 3

是的。

Yes.

Speaker 1

我认为它们是不同的。

I think they are different.

Speaker 1

我认为这些生成模型的架构会共享许多可复用的组件。

I do think the architecture of these generative models will share a lot of shareable components.

Speaker 1

但我认为，深度的三维、四维空间世界具有某种结构，这种结构与纯粹的一维生成信号有着根本的不同。

But I think the deeply three d, four d spatial world has a level of structure that is fundamentally different from a purely generative signal that is one dimensional.

Speaker 0

嗯。

Yeah.

Speaker 0

我认为像素至上主义是有道理的。

I I think there's something to be said for pixel maximalism.

Speaker 0

对吧？

Right?

Speaker 0

比如，人们觉得语言是某种不同的东西，但其实我们是用眼睛来看语言的，而我们的眼睛本质上就是由像素组成的。

Like, there's this notion that language is this different thing, but you we see language with our eyes, and our eyes are just like, you know, basically pixels.

Speaker 0

对吧？

Right?

Speaker 0

比如，我们眼睛后部有类似生物像素的结构在处理这些信息。

Like, we've got sort of biological pixels in the back of our eyes that are processing these things.

Speaker 0

我们看到文字时，认为它是一种离散的东西，但那其实只存在于我们的脑海中。

And, you know, we see text and we think of it as this discrete thing, but that really only exists in our minds.

Speaker 0

文字和语言在现实世界中的物理表现形式，其实是印在各种物体上的实体，我们用眼睛看到它们。

Like, the physical manifestation of text and language in our world are, you know, physical objects that are printed on things in the world and we see it with our eyes.

Speaker 1

你也可以认为它是声音。

Well, you can also think it's sound.

Speaker 1

但即使如此

But even

Speaker 0

哦，当然。

Oh, sure.

Speaker 0

声音，当然。

Sound Sure.

Speaker 1

当然。

Sure.

Speaker 1

即使是声音，你也可以转化为

Even sound, you can translate

Speaker 3

一种编舞。

into a choreograph.

Speaker 1

是的。

Yeah.

Speaker 1

你会得到一个二维信号，即编舞图。

You get choreogram, which is a two d signal.

Speaker 0

对。

Right.

Speaker 0

而且，如果你将它转化为我们用于大语言模型的这种纯粹的符号化表示，实际上会失去一些东西。

And then, like, you actually lose something if you translate to this, like, purely tokenized representations that we use in LLMs.

Speaker 0

对吧？

Right?

展开剩余字幕（还有 480 条）

Speaker 0

比如，你会丢失字体、丢失换行、丢失页面上那种二维的排版布局。

Like, you lose the font, you lose you lose the line breaks, you lose sort of the two d arrangement on the page.

Speaker 0

而在很多情况下，对于很多事物来说，这可能并不重要。

And and for a lot of cases for lot of things, maybe that doesn't matter.

Speaker 0

但对某些事情来说，这确实很重要。

But for some things, it does.

Speaker 0

我认为像素是一种更少损失的表示方式，能更真实地反映世界中的内容。

And I I think pixels are this sort of more more lossless representation of what's going on in the world.

Speaker 0

在某些方面，它也是一种更通用的表示方式，更贴近我们人类在探索世界时所看到的东西。

And in in some ways, a more general general representation that more matches what what we what we humans see as we as we navigate the world.

Speaker 0

所以，这里有一个效率方面的论点。

So so, like, there's an efficiency argument to be made.

Speaker 0

比如，也许把文本渲染成图像再输入视觉模型，并不是特别高效的做法。

Like, maybe it's not super efficient to, like, you know, render your text to an image and then feed that to a vision model.

Speaker 3

这正是DeepSeek所做的事情。

That's exactly what DeepSeek did.

Speaker 3

是的。

Yeah.

Speaker 3

它确实有点奏效。

It it was, like, kinda worked.

Speaker 3

我

Speaker 2

我认为这与整个世界模型有关。

think this ties into the whole world model.

Speaker 2

比如，今年我看到的最喜爱的论文之一，是关于归纳偏差来探测世界模型的。

Like, one of the my favorite papers that I saw this year was about inductive bias to probe for world models.

Speaker 2

这是一篇哈佛大学的论文，他们向大语言模型输入了大量的轨道模式，然后让模型预测行星围绕太阳的轨道。

So it was a Harvard paper where they fed a lot of, like, orbital, patterns into an LLM, and then they asked the LLM to predict the orbit of a planet around the sun.

Speaker 2

模型生成的结果看起来不错，但如果你让它画出受力矢量，就会变得一团糟。

And the model generated looked good, but then if you asked it to draw the force vectors, it would be all wacky.

Speaker 2

你知道的？

You know?

Speaker 2

它实际上并不会遵循那个规律。

It wouldn't actually follow it.

Speaker 2

那么，你怎么看待你所获得的数据中嵌入了什么内容？

So how do you think about what's embedded into the data that you get?

Speaker 2

我们可以谈谈三维世界模型的分词问题。

And we can talk about maybe tokenizing for three d world models.

Speaker 2

比如，信息的维度是什么？

Like, what are, the dimensions of information?

Speaker 2

有视觉信息，但你到底需要从这些数据中提取出多少潜在的隐藏力呢？

There's the visual, but, like, how much of, like, the underlying hidden forces, so to speak, you need to extract out of this data?

Speaker 2

在这方面，有哪些挑战？

And, like, what are some of the challenges there?

Speaker 0

是的。

Yeah.

Speaker 0

我觉得解决这个问题有几种不同的方法。

I I think there's different ways you could approach that problem.

Speaker 0

一种方法是明确地去做，比如说，我想测量所有的力，然后把这些作为训练数据输入到模型中。

One is, like, you could try to be explicit about it and say, like, oh, I want to, you know, measure all the forces and feed those as training data to your model.

Speaker 0

对吧？

Right?

Speaker 0

然后你可以运行一个传统的物理模拟，从而知道场景中所有的力，再用这些力作为训练数据来训练一个模型，希望它能预测这些力。

Then you could, like, sort of run a traditional physics simulation and, you know, then know all the forces in the scene and then use those as as training data to train a model that's now gonna hopefully predict those.

Speaker 0

或者你可以寄希望于某些东西能更隐式地涌现出来。

Or you could hope that something emerges more latently.

Speaker 0

对吧？

Right?

Speaker 0

你可以端到端地训练某个东西，针对一个更通用的问题，然后希望模型内部的某些部分能够学会模拟类似物理的机制，以便做出正确的预测。

That you kind of train on something end to end and then on on a more general problem, and then hope that somewhere some some something in in the internals of the model must learn to model something like physics in order to make the proper predictions.

Speaker 0

而这两种方式基本上是我们目前更普遍的两大范式。

And those are kind of the two big paradigms that we have more generally.

Speaker 1

但并没有任何迹象表明，这种隐式建模能让你得出空间和动力学的因果定律。

But there's no indication that those latent modeling will get you to a causal law of space and dynamics.

Speaker 1

对吧？

Right?

Speaker 1

这正是当今深度学习与人类智能真正开始分道扬镳的地方。

That's where today's deep learning and human intelligence actually start to bifurcate.

Speaker 1

因为从根本上说，深度学习仍然是在拟合模式。

Because fundamentally, the deep learning is still fitting patterns.

Speaker 0

能聊哲学挺好的。

Very useful to get philosophical.

Speaker 0

你说我们也在试图拟合模式，但也许我们是在拟合更广泛的一系列模式，时间跨度更长，奖励函数也不同。

And you say that, like, we're trying to fit patterns too, but maybe we're trying to fit a more broad array of patterns, like, over a with with a longer time horizon, a different reward function.

Speaker 0

但基本上，你提到的那篇论文就是这个问题：它学会了拟合轨道的具体模式，但却没有像你期望的那样真正泛化。

But but, like, basically, the the paper you mentioned is sort of, you know, that problem, that it learns to fit the specific patterns of orbits, but then it doesn't actually generalize in the way that you'd like.

Speaker 0

它没有一种对引力的因果模型。

It doesn't have a sort of causal model of gravity.

Speaker 2

对。

Right.

Speaker 2

因为即使是大理石，你也知道，它在尝试中。

Because even in marble, you know, was trying it in.

Speaker 2

它生成了这些美丽的景观，里面还有拱门。

It generates these beautiful sceneries, and there's, like, arches in them.

Speaker 2

但这个模型真的理解拱门是如何像石头一样以中心为支撑，以及其实际物理结构的吗？

But does the model actually understand how, you know, the arch is actually, you know, drawing on the center kinda like stone and, like, you know, the actual physical structure of it.

Speaker 2

另一个问题是，只要它总能渲染出符合我们想象的物理模型的东西，它是否真的理解这些有关系吗？

And the other question is, like, does it matter that it does understand it as long as it always renders something that would fit the physical model that we imagine?

Speaker 1

如果你用‘理解’这个词的本意，我敢肯定模型并不理解。

If you use the word understand the way you understand, I'm pretty sure the model doesn't understand it.

Speaker 1

模型只是从数据中学习，从模式中学习。

The model is learning from the data, learning from the pattern.

Speaker 1

是的，对于这些应用场景来说，这重要吗？

Yeah, does it matter, especially for the use cases?

Speaker 1

这是个好问题。

It's a good question.

Speaker 1

目前，我认为这并不重要，因为只要它能完美地渲染出你需要的内容。

For now, I don't think it matters because it renders out what you need, assuming it's perfect.

Speaker 0

是啊，这取决于使用场景。

Yeah, mean, depends on the use case.

Speaker 0

如果使用场景是我想为虚拟电影或制作生成某种背景之类的东西，那你只需要看起来合理就行。

If the use case is, I want to generate sort of a backdrop for for virtual film or production or something like that, all you need is something that looks plausible.

Speaker 0

在这种情况下，可能确实不重要。

And in that case, probably it doesn't matter.

Speaker 0

但如果你打算用它来设计一座你之后要在现实世界中建造的建筑，比如你是个建筑师，那就有必要正确建模受力情况了，因为你不希望建筑在实际建造时倒塌。

But if you're gonna use this to, like, you know, if you're an architect and you're gonna use this to design a building that you're then gonna go build in the real world, then, yeah, it does matter that you model the forces correctly because you don't want the thing to break when to actually build it.

Speaker 1

但即便如此，即使你的模型包含了语义，我仍然认为模型层面的‘理解’和人类层面的‘理解’是两回事。

But even there, right, like even if your model has the semantics in it, let's say, I still don't think the understanding of the signal or the output on the model part and the understanding on the human part is a different word.

Speaker 1

但这又回到了哲学层面。

But this gets, again, philosophical.

Speaker 0

是啊。

Yeah.

Speaker 0

我的意思是，理解这件事有个技巧。

I mean, there there's this trick with understanding.

Speaker 0

对吧？

Right?

Speaker 0

这些模型的智能类型与人类智能非常不同。

Like, these models are a very different kind of intelligence than human intelligence.

Speaker 0

而人类智能很有趣，因为我觉得我之所以能理解事物，是因为我能在某种程度上内省自己的思维过程。

And human intelligence is interesting because, you know, you know, I think that I understand things because I can introspect my own thought process to some extent.

Speaker 0

然后我相信我的思维过程可能和别人的类似，所以当我观察他人的行为时，我会推断他们的内心状态可能和我观察到的自己的内心状态相似。

And then I believe that my thought process probably works similar to other people's so that when I observe someone else's behavior, then I infer that their internal mental state is probably similar to my own internal mental state that I've observed.

Speaker 0

因此，我知道自己能理解事物，所以我假设你也理解某些东西。

And therefore, I know that I understand things, so there I assume that you understand something.

Speaker 0

但这些模型就像一种外星般的智能形式，它们能做出非常有趣的事情。

But these models are sort of like this this alien form of intelligence where they can do really interesting things.

Speaker 0

它们能展现出非常有趣的行为。

They can exhibit really interesting behavior.

Speaker 0

但无论它们是否具有类似内部认知或自我反思的东西，如果确实存在的话，也与我们的完全不同。

But whatever kind of internal the equivalent of internal cognition or internal self reflection that they have, if it exists at all, is totally different from what we do.

Speaker 0

所以这是

So it's

Speaker 1

它没有自我意识。

It doesn't have the self awareness.

Speaker 2

对。

Right.

Speaker 0

但这意味着，当我们观察到这些系统表现出看似有趣或智能的行为时，我们并不能因此推断出它们的其他特性，因为它们对世界的认知模型和思考方式与我们截然不同。

But what that means is that when when we observe seemingly interesting or intelligent behavior out of these systems, we can't necessarily infer other things about them because their their model of the world and the way they think is so different from us.

Speaker 2

所以，你认为最终需要两个不同的模型来分别处理视觉任务和建筑生成吗？

So would you need two different models to do the visual one and the architectural generation, you think, eventually?

Speaker 2

就像，你目前在模型构建上的方法并没有什么根本性的差异。

Like, there's not anything fundamental about the approach that you've taken on the model building.

Speaker 2

这更多是关于扩大模型规模和提升其能力？

It's more about scaling the model and the capabilities of it?

Speaker 2

还是说，过于依赖视觉特性会阻碍你真正理解背后的物理原理，从而无法信任它生成的CAD设计在现实世界中真的能正常工作？

Or is there something about being very visual that prohibits you from actually learning the physics behind this, so to speak, so that you could trust it to generate a CAD design that then is actually gonna work in the real world?

Speaker 1

我认为这只是一个数据规模和模型优化的问题。

I think this is a matter of scaling data and and and bettering model.

Speaker 1

我不认为这两者之间存在根本性的差异。

I don't think there's anything fundamental that separates these two.

Speaker 0

是的。

Yeah.

Speaker 0

我希望它能成为一个统一的模型。

I would like it to be one model.

Speaker 0

但我认为，从某种意义上说，深度学习的一个大问题是：如何让你的模型产生超越训练数据的涌现能力？

But I think, like, the big problem in deep learning in some sense is how do you get emergent capabilities beyond your training data?

Speaker 0

你能否让模型在没有专门训练预测力的情况下，却能内在地理解这些力？

Are you gonna get something that understands the forces while it wasn't trained to predict the forces, but it's gonna learn them implicitly internally?

Speaker 0

我认为我们在其他大型模型中看到的很多现象表明，这种涌现行为确实在规模扩大时会发生。

And I think a lot of what we've seen in other large models is that a lot of this emergent behavior does happen at scale.

Speaker 0

这种能力会转移到其他模态、其他应用场景和其他任务吗？

And will that transfer to other modalities and other use cases and other tasks?

Speaker 0

我希望如此。

I hope so.

Speaker 0

但这将是一个需要时间逐步验证的过程。

But that'll that'll be a a process that we need to play out over time and see.

Speaker 3

有没有可能直接使用现成的物理引擎？毕竟游戏行业已经帮我们省去了大量工作，还是我们必须因为某些根本性差异而重新发明一切？

Is there a temptation to rely on physics engines that already exist out there that are you know, basically, gaming industry has saved you a lot of this work, or do we have to reinvent things for some fundamental mismatch?

Speaker 3

我认为这有点像

I think that's sort of

Speaker 0

攀登技术的阶梯。

like climbing the ladder of technology.

Speaker 0

对吧？

Right?

Speaker 0

某种程度上，你之所以想构建这些系统，正是因为传统的物理引擎在某些情况下并不适用。

Like, some sense, the reason that you wanna build these things at all is because maybe traditional physics engines don't work in some situations.

Speaker 0

如果一个物理引擎是完美的，我们就没必要构建模型了，因为问题已经被解决了。

If a physics engine was perfect, we would have sort of no need to build models because the problem would have already been solved.

Speaker 0

所以某种程度上，我们之所以想做这件事，是因为经典物理引擎无法像我们希望的那样普遍地解决问题。

So in some sense, the reason why we want to do this is because classical physics engines don't solve problems in the generality that we want.

Speaker 0

但这并不意味着我们需要抛弃它们，从头开始一切。

But that doesn't mean we need to throw them away and start everything from scratch.

Speaker 0

对吧？

Right?

Speaker 0

我们可以使用传统的物理引擎生成数据，然后用这些数据来训练我们的模型。

We can use traditional physics engines to generate data that we then train our models on.

Speaker 0

然后你就相当于把物理引擎的原理提炼到了你训练的神经网络的权重中。

And then you're sort of distilling the physics engine into the weights of the neural network that you're training.

Speaker 3

我认为，如果你对比其他实验室的工作，很多人推测，比如Sora就有一点这样的成分。

I I think that's a lot of what if you compare the work of other labs, people are speculating that, you know, Sora had a little bit of that.

Speaker 3

Genie three，嗯。

Genie three Mhmm.

Speaker 3

有一点那种感觉。

Had a bit of that.

Speaker 3

Genie three 就像是一个电子游戏。

Genie three is, like, explicitly like a video game.

Speaker 3

比如，你可以控制角色

Like, you have controls to

Speaker 1

是的。

Yeah.

Speaker 3

在其中走动。

To walk around in.

Speaker 3

我总是觉得，我们为娱乐而发明的东西，最终竟然真的会进入严肃的工作领域，这真的很有趣。

And I I I always think, like, it's really funny how the things that we invent for fun actually does eventually make it into serious work.

Speaker 1

嗯。

Mhmm.

Speaker 1

没错。

Yep.

Speaker 1

整个AI革命部分起源于图形芯片。

The whole AI revolution started by graphics chips partially.

Speaker 3

从使用GPU生成大量三角形，转变为生成其他各种各样的东西。

Miss using the GPU for for generating a lot of triangles into generating a lot of everything else basically.

Speaker 3

是的。

Yeah.

Speaker 3

我们稍微提到了Marble。

We touched on marble a little bit.

Speaker 3

我觉得你们选择Marble，某种程度上像是在结束低调状态，如果能这么说的话。

I think you guys chose marble as I kind of feel like you're sort of a little bit coming out of stealth moment if you can call it that.

Speaker 3

是的。

Yeah.

Speaker 3

也许你们能给我们一个简洁的解释，说明人们应该从中理解什么？因为在这里每个人都可以尝试Marble，但我认为他们可能无法将其与你们的愿景和其他实验室所推出的生成式世界之间的区别联系起来。

Maybe we can get a concise explanation from you on what people should take away because everyone here can try Marble, but I don't think they might be able to link it to the differences between what your vision is versus other, I guess, generative worlds they may have seen from other labs.

Speaker 1

Marble是我们模型的一个缩影。

So Marble is a glimpse into our model.

Speaker 1

对吧？

Right?

Speaker 1

我们是一家空间智能模型公司。

We are a model spatial intelligence model company.

Speaker 1

我们相信空间智能是下一个前沿。

We believe spatial intelligence is the next frontier.

Speaker 1

为了构建空间智能模型，模型必须在理解、推理和以多模态方式生成世界方面非常强大，同时还要实现我们最终希望达到的、与人类交互世界一样复杂的互动水平。

In order to make spatially intelligent models, the model has to be very powerful in terms of its ability to understand, reason, generate in very multimodal fashion of worlds, as well as allow the level of interactivity that we eventually hope to be as complex as how humans can interact with the world.

Speaker 1

这就是空间智能的宏伟愿景，以及我们所设想的世界模型。

So that's the grand vision of spatial intelligence, as well as the kind of world models we see.

Speaker 1

MARBO 是这一愿景的首次展现。

MARBO is the first glimpse into that.

Speaker 1

这是这段旅程的第一步。

It's the first part of that journey.

Speaker 1

它是世界上首个向公众开放的、能够以这种保真度生成三维世界的同类首创模型。

It's the first in class model in the world that generates three d worlds in this level of fidelity that is in the hands of the public.

Speaker 1

这是起点。

It's the starting point.

Speaker 1

实际上，他写了这篇技术博客。

Actually wrote this tech blog.

Speaker 1

贾斯汀花了大量时间撰写这篇技术博客。

Justin spent a lot of time writing that tech blog.

Speaker 1

我不知道你有没有时间浏览一下。

I don't know if you had time to browse it.

Speaker 1

贾斯汀详细剖析了Marble的输入方式——我们能使用哪些多模态输入，哪些可编辑性允许用户与模型互动，以及我们能获得哪些类型的输出。

Mean, Justin really broke it down into what are the inputs we can multimodal inputs of Marble, what are the kind of editability which allows user to be interactive with the model and what are the kind of outputs we can have?

Speaker 0

是的。

Yeah.

Speaker 0

所以，Marble，从某种角度来看，它是一个生成三维世界的生成模型。

So so Marble, like, basically, one way of looking at it, it's the system it's a generative model of three d worlds.

Speaker 0

对吧？

Right?

Speaker 0

你可以输入文本、图像或多个图像，它会为你生成一个与这些输入相匹配的三维世界。

So you can input things like text or image or multiple images, and it will generate for you a three d world that kind of matches those inputs.

Speaker 0

而且它还具有交互性，你可以交互式地编辑场景。

And it's also interactive in the sense that, you can interactively edit scenes.

Speaker 0

比如，我可以生成这个场景，然后说：我不喜欢这个水瓶。

Like, I could generate this scene and then say, don't like the water bottle.

Speaker 0

把它改成蓝色。

Make it blue instead.

Speaker 0

把桌子拿掉。

Like, get take out the table.

Speaker 0

把这些麦克风重新摆放一下。

Like, may change these microphones around.

Speaker 0

然后你可以根据这些交互式编辑生成新的世界，并导出为多种格式。

And then you can generate new worlds based on these interactive edits and export in a variety of formats.

Speaker 0

使用Marble，我们实际上同时想实现两个目标，我认为我们很好地把握了平衡。

And with Marble, we were actually trying to do sort of two things simultaneously, and I think we we managed to pull off the balance pretty well.

Speaker 0

一个是构建一个朝着空间智能宏伟愿景迈进的模型。

One is actually build a model that goes towards the grand vision of spatial intelligence.

Speaker 0

模型需要能够理解多种类型的输入，需要能够在各种情境下建模世界，还需要能够模拟它们随时间变化的反事实情况。

And models need to be able to understand lots of different kinds of inputs, need to able to model worlds in a lot of situations, need to be able to model counterfactuals of how they could change over time.

Speaker 0

因此，我们希望开始构建具备这些能力的模型，而今天的Marble已经初步具备了所有这些特征。

So we wanted to start to build models that have these capabilities, and Marble Marble today does already have hints of all of these.

Speaker 0

但与此同时，我们是一家公司。

But at the same time, we we were we're a company.

Speaker 0

我们是一家企业。

We're a business.

Speaker 0

我们真正努力避免这只是一个科研项目，而是要打造一个对现实世界中人们真正有用的产品。

We were really trying not to have this be a science project, but also build a product that would be useful to people in the real world today.

Speaker 0

因此，尽管Marble同时是一个朝着空间智能愿景迈进的世界模型，但它也被有意设计成一个今天就能为人们带来实用价值的产品。

Speaker 0

我们已经开始看到一些新兴的应用场景，涵盖游戏、视觉特效和电影领域，我认为Marble今天作为一款产品就能做很多非常有趣的事情，同时也为未来我们希望构建的宏大世界模型奠定了基础。

And we're see starting to see emerging use cases, for in gaming, in VFX, in in film, where I think there's a lot of really interesting stuff that Marvel can do today as a product and then also set a foundation for the grand world models that we want to build going into the future.

Speaker 3

是的。

Yeah.

Speaker 3

我注意到一个非常有趣的工具，你可以在里面录制你的场景。

I noticed one tool that was very interesting because you can record your scene inside.

Speaker 3

是的。

Yes.

Speaker 1

是的。

Yes.

Speaker 1

这非常重要。

It's very important.

Speaker 1

录制功能意味着对摄像机位置的精确控制。

The ability to record means a very precise control of camera placement.

Speaker 1

为了实现精确的摄像机定位，你必须具备对三维空间的感知。

In order to have precise camera placement, it means you have to have a sense of three d space.

Speaker 1

否则，你就不知道如何调整和移动摄像机。

Otherwise, you don't know how to orient your camera and how to move your camera.

Speaker 1

因此，这是这种模型的自然结果，这也是为什么这仅仅是其中一个例子。

So that is a natural consequence of this kind of model, and and this is why this is just one of the examples.

Speaker 3

是的。

Yeah.

Speaker 3

当我使用视频生成模型时，我发现我必须学习导演的语言，因为我得去移动它们，比如平移，你知道的，比如，

I I find when I play with video generative models, I'm having to learn the language of being a director because I have to move them up, like pan, you know, like,

Speaker 1

停滞不前。

stalling out.

Speaker 1

你不能说将镜头向北平移六十三度。

You cannot say pan sixty three degrees to the north.

Speaker 1

对吧？

Right?

Speaker 1

你根本没有这种控制能力。

You just don't have that control.

Speaker 1

而在Marble中，你可以精确地控制摄像机的定位。

Whereas in Marble, you have precise control in terms of placing a camera.

Speaker 2

是的。

Yeah.

Speaker 2

我认为这是人们需要理解的第一件事。

I think that's one of the first things people need to understand.

Speaker 2

它不像其他许多模型那样逐帧生成。

It's like, it's not you're not generating frame by frame, which is what a lot of the other models are.

Speaker 2

人们知道大型语言模型会生成一个词元。

You know, people understand that an LLM generates one token.

Speaker 2

那么，原子单位是什么？

What are, like, the atomic units?

Speaker 2

有点像是，你知道的，网格。

There's kinda, like, know, the meshes.

Speaker 2

还有体素的溅射。

There's, like, the splash of voxels.

Speaker 2

三维世界中有许多组成部分。

There's a lot of pieces in a three d world.

Speaker 2

人们应该对你们的生成方式形成怎样的心理模型呢？

What should be the mental model that people have of, like, your generations?

Speaker 0

嗯。

Yeah.

Speaker 0

我觉得现在存在的和未来可能存在的是两回事。

I I think there's, like, what exists today and what could exist in the future.

Speaker 0

目前存在的是模型原生输出的点云。

So what exists today is the model natively outputs splats.

Speaker 0

高斯点云就是这些小粒子，每个都是极其微小、半透明的，具有三维空间中的位置和方向，整个场景由大量这样的高斯点云构建而成。

So Gaussian splats are these, like you know, each one is a tiny, tiny particle that's semi transparent, has a position orientation in in three d space, and the scene is built up from a large number of these Gaussian splats.

Speaker 0

高斯点云非常酷，因为你可以非常高效地实时渲染它们。

And Gaussian splats are really cool because you can render them in real time really efficiently.

Speaker 0

你可以在iPhone上渲染，渲染一切东西。

So you can render on your iPhone, render render everything.

Speaker 0

这正是我们能实现精确摄像机控制的原因，因为点云可以在几乎任何客户端设备上实时渲染。

And that's that's how we get that sort of precise camera control because splats can be rendered real time on just on pretty much any client side device that we want.

Speaker 0

因此，对于我们今天生成的许多场景来说，这种基本单元就是单个的点云元素。

So for a lot of the scenes that we're generating today, that kind of atomic unit is that individual splat.

Speaker 0

但我不认为这是根本性的。

But I don't think that's fundamental.

Speaker 0

我可以想象未来可能出现其他有趣的方法。

I I could imagine other approaches in the future that would be interesting.

Speaker 0

事实上，我们World Labs也研究过其他方法，比如我们最近的RTFM模型，它一次生成一帧。

So there, like, there are other approaches that even we've worked on at World Labs, like our recent RTFM model that does generate frames one at a time.

Speaker 0

在这种情况下，基本单元是随着用户与系统交互，逐帧生成内容。

And there, the atomic unit is generating frames one at a time as the user interacts with the system.

Speaker 0

或者你可以想象未来其他架构，其基本单元是一个令牌，这个令牌代表三维世界中的一块内容。

Or you could imagine other architectures in the future where the atomic unit is a token, where that token now represents, you know, some chunk of the three d world.

Speaker 0

我认为随着时间推移，我们可以在这里探索多种不同的架构。

And I think there's a lot of different architectures that we can experiment with here over time.

Speaker 3

我想再深入探讨一下这一点。

I do wanna press on double click on this a little bit.

Speaker 3

我理解Alessio本来想说的是：世界模型的基本数据结构是什么？

My version of what Alessio was gonna say was, like, what is the fundamental data structure of a world model?

Speaker 3

因为正如你所说，它要么是splat的成本，要么是某种框架之类的。

Because exactly, like you said, like, it's it's either cost of splat or it's like the framework, what have you.

Speaker 3

你在之前的陈述中还特别强调了物理和力，这在一段时间内是松散相关的。

You also, in in the the the previous statements, focus a lot on the physics and the forces, which is something over time, which is loosely.

Speaker 3

我在Marble中没看到这一点。

I don't see that in marble.

Speaker 3

我猜这还没实现。

I presume it's not there yet.

Speaker 3

也许如果有Marble，就会有运动了。

Maybe if there was, a marble too, you would have movement.

Speaker 3

或者高斯splat有没有什么修改方式是合理的，还是说这会是完全不同的东西？

Or is there a modification to Gaussian splats that makes sense, or would it be something completely different?

Speaker 0

是的。

Yeah.

Speaker 0

我觉得有几种修改是有意义的。

I I think there's a couple modifications that make sense.

Speaker 0

实际上，这里有很多有趣的方式可以整合这些内容，这也是在这个领域工作的一个很好的切入点。

And there there's actually a lot of interesting ways to integrate things here, which is another nice wet place of working in this space.

Speaker 0

实际上，这方面已经有很多研究工作了。

Then there's actually been a lot of research work on this.

Speaker 0

当你谈到一些疯狂的想法时，其实已经有很多非常有趣的学术研究，探讨了如何赋予物理特性。

Like, when you talk about wacky ideas, like, there's actually been a lot of really interesting academic work on different ways to imbue physics.

Speaker 1

在工业界你也可以提出疯狂的想法。

You can also do wacky ideas in industry.

Speaker 1

是的。

Yeah.

Speaker 0

好吧。

Alright.

Speaker 0

但高斯溅射本身就像是一个个小粒子。

But but it's then it's like Gaussian splats are themselves little particles.

Speaker 0

已经有很多方法，基本上是将物理属性附加到这些点上，认为每个点都有质量，或者将每个点视为与附近邻居通过某种虚拟弹簧耦合。

There's been a lot of, approaches where you basically attach physical properties to those splats and say that each one has a mass or like maybe you treat each one as being coupled with some kind of, virtual spring to nearby neighbors.

Speaker 0

现在你可以在点的基础上开始进行物理模拟。

Now you can start to do sort of physics simulation on top of splats.

Speaker 0

因此，为这些点添加物理、动力学或交互的一种途径是，预测每个点粒子相关的物理属性，然后使用经典物理或某种学习方法进行下游模拟。

So one kind of avenue for adding, adding physics or dynamics or interaction to these things would be to, you know, predict physical properties associated with each of your splat particles and then simulate those downstream either using classical physics or something learned.

Speaker 0

或者，你知道，在三维空间中工作的美妙之处在于事物可以组合，你可以在不同地方注入逻辑。

Or, you know, the the kind of the beauty of working in three d is things compose and you can inject logic in different places.

Speaker 0

所以一种方式是我们生成一个三维场景，预测场景中所有事物的三维属性，然后使用经典物理引擎模拟它们之间的交互。

So one way is sort of like we're generating a three d scene, we're gonna predict three d properties of everything in the scene, then we use a classical physics engine to to simulate the interaction.

Speaker 0

或者你可以做这样的事情：作为用户操作的结果，模型现在将重新生成整个场景，以点或其他形式表示。

Or you could do something where, like, as a result of a user action, the model is now going to regenerate the entire scene, in in in splats or some other representation.

Speaker 0

这可能会更加通用，因为你不再受限于你已知如何建模的任何物理属性。

And that could potentially be a lot more general because then you're not bound to whatever sort of, you know, physical properties you know how to model already.

Speaker 0

但这也对计算要求高得多，因为你需要在用户操作后重新生成整个场景。

But that's also a lot more computationally demanding because then you'd need to regenerate the whole scene in response to the to user actions.

Speaker 0

但我认为这是一个非常有趣的研究方向，未来可以在此基础上进一步拓展，正如你所说，应用于Marble 2。

But I I think this is a this was a really interesting area for for future work and for, adding on to to into potential Marble two, as you say.

Speaker 1

是的。

Yeah.

Speaker 1

在动态效果方面还有很大潜力。

There's opportunity for dynamics.

Speaker 0

没错。

Yeah.

Speaker 1

对吧？

Right?

Speaker 2

那么，Splat的密度目前是什么状况？

What's the state of, like, Splat's density, I guess?

Speaker 2

我们能否在放大时实现很高的分辨率？

Like, do we can we render enough to have very high resolution when we zoom in?

Speaker 2

我们是否受限于可生成和可渲染的数量？

Are we limited by, like, the amount that you can generate, the amount that we can render?

Speaker 2

比如说，这些怎么才能达到极高的保真度呢？

Like, how are these gonna get super high fidelity, so to speak?

Speaker 0

你确实有一些限制，但取决于你的目标使用场景。

You have some limitations, but depending on your target use case.

Speaker 0

我们场景中的一个主要限制是，我们希望内容能在移动设备上清晰渲染，也希望能在VR头显上清晰渲染。

So, like, one of the one of the big constraints that we have on our scenes is we wanted things to render cleanly on mobile, and we wanted things to render cleanly in VR headsets.

Speaker 0

这些设备的计算能力远低于你通常在其他场景中所拥有的能力。

So those are those devices have a lot less compute than you're used than you have in a lot of other situations.

Speaker 0

如果你想让一个splat文件在四年前的iPhone上以高分辨率、30到60帧每秒的速度运行，那么你能处理的splat数量就会受到一定限制。

And, like, if you wanna get a splat file to render at high resolution, high like, 30 to 60 FPS on, like, an iPhone from four years ago, then you are a bit limited in, like, the number of splats that you can handle.

Speaker 0

但如果你可以使用今年或近年的iPhone、最新的MacBook，或者拥有本地GPU，又或者不需要60帧每秒的1080p分辨率，那么你就可以放宽这些限制，支持更多的splat，从而在场景中实现更高的分辨率。

But if you're allowed to, like, work on a recent like, even this year's iPhone or, like, a recent MacBook or even if you have a local GPU or if you don't need if you don't need that 60 FPS ten eighty p, like, then you can relax the constraints and and get away with more splats, and that lets you get higher resolution in your scenes.

Speaker 0

我原本期待但没听到的一个使用场景是

One use case I was expecting but didn't hear

Speaker 3

具身化应用场景。

from you was embodied use cases.

Speaker 3

嗯。

Mhmm.

Speaker 3

你现在只专注于虚拟现实吗？

Are you you're just focusing on virtual for now?

Speaker 1

如果你访问WordLab的主页，会有一个名为MarbleLabs的页面。

If you go to WordLab's home home page, there is a particular page called MarbleLabs.

Speaker 1

在那里，我们展示了不同的应用场景。

There, we showcase different use cases.

Speaker 1

我们将它们分为视觉效果、游戏应用以及仿真应用等类别。

And we actually organize them in more visual effect use cases or gaming use cases as well as simulation use cases.

Speaker 1

其中，我们展示了这项技术在机器人训练中的巨大潜力。

And in that, we actually show this is a technology that can help a lot in robotic training.

Speaker 1

这让我回到之前提到的内容。

This goes back to what I was talking about earlier.

Speaker 1

说到数据匮乏，机器人训练确实严重缺乏数据。

Speaking of data starvation, robotic training really lack data.

Speaker 1

高保真的真实世界数据至关重要，但你很难获得大量这样的数据。

High fidelity real world data is absolutely very critical, but you're just not going to get a ton of that.

Speaker 1

当然，另一个极端是纯粹的互联网视频数据，但这样你就缺乏了训练具身智能体所需的可控性。

Of course, the other extreme is just purely internet video data, but then you lack a lot of the controllability that you want to train your embodied agents with.

Speaker 1

因此，模拟和合成数据实际上是一个非常重要的中间地带。

So simulation and synthetic data is actually a very important middle ground.

Speaker 1

在这方面，我已经在这个领域工作了多年。

For that, I've been working in this space for many years.

Speaker 1

其中一个最大的痛点是，你从哪里获取这些合成的模拟数据？

And one of the biggest pain point is where do you get this synthetic simulated data?

Speaker 1

你需要整理资产并构建、组合这些复杂的场景。

You have to curate assets and build these, compose these complex situations.

Speaker 1

在机器人领域，我们需要大量的不同状态。

And in robotics, want a lot of different states.

Speaker 1

你希望具身智能体能够在合成环境中进行交互。

You want the embodied agent to interact in the synthetic environment.

Speaker 1

Marble 实际上在为具身智能体训练生成这些合成模拟世界方面具有巨大潜力。

Marble actually is a really potential for helping to generate these synthetic simulated worlds for embodied agent training.

Speaker 3

当然，没错。

Obviously, that's yeah.

Speaker 3

它就在主页上。

That's on the it's on the home page.

Speaker 3

它会在那里的。

It'll be there.

Speaker 3

我只是在想，正如你所说，你还得建立一个商业模式。

I just I I was, like, trying to make the link to as you said, like, you also have to build, a business model.

Speaker 3

机器人市场的规模显然非常庞大。

The market for robotics, obviously, is very huge.

Speaker 3

也许你不需要那样，或者我们首先需要构建并解决虚拟世界，然后才能进入具身领域，而这正是一个必经的步骤。

Maybe you don't need that or maybe we need to build up and solve the virtual worlds first before we can go to embodied and obviously That is exactly stepping stone.

Speaker 1

这还有待决定。

That is to be decided.

Speaker 1

我觉得确实如此

I I do think that

Speaker 3

因为其他人都直接往那里去了。

Because everyone else is going straight there.

Speaker 3

对吧？

Right?

Speaker 1

不是所有人，但我觉得确实有一种兴奋感。

Not everyone else, but there is a there is an excitement, I would say.

Speaker 1

但你知道，世界足够大，可以容纳不同的方法。

But, you know, I think the world is big enough to to have different Approaches.

Speaker 1

是的。

Yeah.

Speaker 1

方法。

Approaches.

Speaker 0

是的。

Yeah.

Speaker 0

我的意思是，我们一直将这视为一种相当通用的技术，未来应该能够触及许多不同的行业。

I mean and we always view this as a pretty horizontal technology that should be able to touch a lot of different industries over time.

Speaker 0

而且，你知道，Marble 目前更专注于创意产业，但我认为支撑它的技术在未来应该能应用于许多不同的领域。

And, you know, marble is a little bit more focused on creative industries for now, but I think the the technology that powers it should be applicable to a lot of different things over time.

Speaker 0

而机器人技术就是其中之一，你知道，可能比我们想象的更快到来。

And robotics is one that, you know, is maybe gonna happen sooner than later.

Speaker 1

还有设计。

Also, design.

Speaker 1

对吧？

Right?

Speaker 1

它和创意产业非常接近。

It's very adjacent to creative.

Speaker 0

哦，是的。

Oh, yeah.

Speaker 0

当然。

Definitely.

Speaker 0

比如，我觉得这跟建筑相关？

Like, I I I think it's like The architecture stuff?

Speaker 1

是的。

Yes.

Speaker 0

好的。

Okay.

Speaker 0

对。

Yeah.

Speaker 0

我的意思是，我之前在网上开玩笑。

I mean, I I I was joking online.

Speaker 0

我在Slack上发了一个视频，问谁想用Marble来规划下一次厨房翻新？

I posted this, this video on Slack of, like, oh, who wants to use marble to to plan your next kitchen remodel?

Speaker 0

它实际上已经非常适合这个用途了。

It actually works great for this already.

Speaker 0

只需拍摄两张厨房的照片，用Marble重建它，然后使用编辑功能，看看如果更换台面、地板或橱柜，这个空间会是什么样子。

Just like take two images of your kitchen, like reconstruct it in marble, and then use the editing features to see what would that space look like if you change the countertops or change the floors or change the cabinets.

Speaker 0

这并不是我们专门为这个用例开发的东西。

And this is something that's, you know, we didn't necessarily build anything specific for this use case.

Speaker 0

但由于这是一种强大的通用技术，你往往会自然地涌现出这些意料之外的用例。

But because it's a it's a powerful horizontal technology, you kind of get these emergent use cases that that just fall out of the model.

Speaker 1

我们有一些早期测试用户，他们已经使用API密钥在做室内设计的项目。

We have early beta users using a API key that is already building for interior design use case.

Speaker 2

我刚弄了我的车库。

I just did my garage.

Speaker 2

我早该知道的，我真得

I should have known about I I gotta

Speaker 1

下次你装修时，我们可以

Next time you remodel, we can

Speaker 2

没错。

Exactly.

Speaker 1

帮上忙。

Be of help.

Speaker 2

接下来肯定是厨房了，我敢肯定。

Well, kitchen is next, I'm sure.

Speaker 2

是的。

Yeah.

Speaker 2

是的。

Yeah.

Speaker 2

我对整个空间智能领域很好奇。

I'm curious about the whole spatial intelligence space.

Speaker 2

我觉得我们应该更深入地研究一下。

I think we should dig more into that.

Speaker 2

首先，你怎么定义它？

One, how do you define it?

Speaker 2

还有，比如，传统智能和人们想到大语言模型时的智能之间，有哪些差距？你知道，达里奥说我们有一个满是爱因斯坦的数据中心。

And, like, what are, like, the gaps between traditional intelligence that people might think about a LLMS when, you know, Dario says we have a data center full of Einstein's.

Speaker 2

那才是传统智能。

That's like traditional intelligence.

Speaker 2

这不是空间智能。

It's not spatial intelligence.

Speaker 2

要具备空间智能需要什么？

What is required to be spatially intelligent?

Speaker 1

首先，我不理解这句话——一个充满爱因斯坦的数据中心。

First of all, I don't understand that sentence, a data center full of Einstein's.

Speaker 1

那是哦。

That's Oh.

Speaker 1

我真的不理解这个。

I I just don't understand that.

Speaker 1

不。

No.

Speaker 1

不。

No.

Speaker 1

不。

No.

Speaker 1

这不是一个

It's not a

Speaker 3

这是个类比。

deep It's an analogy.

Speaker 3

这是个类比。

It's an analogy.

Speaker 1

嗯，人工智能作为一个领域、一门学科，很大程度上是受人类智能启发的。

Well, so a lot of AI as a field, as a discipline, is inspired by human intelligence.

Speaker 1

对吧？

Right?

Speaker 1

因为我们是目前所知宇宙中最聪明的动物。

Because we are the most intelligent animal we know in the universe for now.

Speaker 1

如果你观察人类智能，它是非常多元的。

And if you look at human intelligence, it's very multi intelligent.

Speaker 1

有一位心理学家，我想他叫霍华德·加德纳，在20世纪60年代，他明确提出多元智能理论，用来描述人类的语言智能、空间智能、逻辑智能和情绪智能。

There is a psychologist, I think his name is Howard Gardner, in the 1960s, actually literally called multiple intelligence to describe human linguistic intelligence, there's spatial intelligence, there is logical intelligence, and emotional intelligence.

Speaker 1

所以对我来说，当我想到空间智能时，我认为它与语言智能是互补的。

So for me, when I think about spatial intelligence, I see it as complementary to language intelligence.

Speaker 1

因此，我个人不会说它是空间智能与传统智能之间的对立，因为我不确定。

So I personally would not say it's spatial versus traditional because I don't know.

Speaker 1

‘传统’到底意味着什么？

Tradition means what does that mean.

Speaker 1

我认为空间智能确实与语言智能是互补的。

I do think spatial is complementary to linguistic.

Speaker 1

那么，我们如何定义空间智能呢？

And how do we define spatial intelligence?

Speaker 1

它是一种使你能够推理、理解、移动和在空间中互动的能力。

It's the capability that allows you to reason, understand, move, and interact in space.

Speaker 1

我用DNA结构的发现作为例子。

And I use this example of the deduction of DNA structure.

Speaker 1

当然，我简化了这个故事，但其中很大一部分都涉及对分子和化学键在三维空间中的空间推理，最终推测出双螺旋结构。

And of course, I'm simplifying this story, but a lot of that had to do with the spatial reasoning of the molecules and the chemical bonds in the three d space eventually conjecture a double helix.

Speaker 1

人类，或者弗朗西斯·克里克和沃森所具备的这种能力，很难将其简化为纯粹的语言过程。

And that ability that humans, or Francis Crick and Watson, had done, it is very, very hard to reduce that process into pure language.

Speaker 1

这正是文明关键时刻的巅峰体现。

And that's a pinnacle of a civilizational moment.

Speaker 1

但每一天，我都在这里努力去抓起一个杯子。

But every day, right, I'm here trying to grasp a mug.

Speaker 1

整个过程包括：看到杯子、观察它所处的环境、看到自己的手、让手的形状在几何上与杯子匹配，并触碰到合适的抓握点。

This whole process of seeing the mug, seeing the context where it is, seeing my own hand, opening of my hand that geometrically would match the mug and touching the right affordance point.

Speaker 1

这一切都深深植根于空间性。

All this is deeply, deeply spatial.

Speaker 1

这非常困难。

It's very hard.

Speaker 1

我试图用语言来描述它，但另一方面，这种描述性语言本身并不能让你真正拿起一个杯子。

I'm trying to use language to narrate it, but on the other hand, that narrated language itself cannot get you to to pick up a mug.

Speaker 3

是的。

Yeah.

Speaker 3

带宽限制。

Bandwidth constraint.

Speaker 3

是的。

Yes.

Speaker 3

我最近做了一些计算，比如，如果你整天不停地说，每天24小时，会产生多少个标记？

I did some math recently on, like, if you just spoke all day, every day for twenty four hours a day, how many tokens do you generate?

Speaker 3

以平均每分钟150个单词的语速计算，大约每天会产生215,000个标记。

At the average speaking rate of, like, a 150 words per minute, it rough running around rounds out to about 215,000 tokens per day.

Speaker 3

而你所生活的这个世界，其带宽远高于这个水平。

And, like, your your your world that you live in is so much higher bandwidth than that.

Speaker 2

我认为这是对的。

Well, I think that is true.

Speaker 2

但当我想到艾萨克·牛顿爵士时，比如，在当时，像重力这样的概念还没有被语言形式化，但人们在空间上已经直观地理解了物体下落的现象。

But if I think about Sir Isaac Newton, right, it's like you have things like gravity at the time that have not been formalized in language that people inanaly spatially understand, that things fall.

Speaker 2

对吧？

Right?

Speaker 2

但以某种方式将这些内容形式化是有帮助的。

But then it's helpful to formalize that in some way.

Speaker 2

或者像，我们用语言来真正捕捉那些在经验上和空间上你也能理解的各种规则，但用语言描述起来更容易。

Or like, you know, all these different rules that we use language to, like, really capture something that empirically and spatially you can also understand, but it's easier to, like, describe in a way.

Speaker 2

所以我很好奇，空间智能和语言智能之间的相互作用，也就是，好吧。

So I'm curious, like, the interplay of, like, spatial and, like, linguistic intelligence, which is like, okay.

Speaker 2

你需要理解。

You need to understand.

Speaker 2

有些规则用语言表达比用空间智能理解更容易。

Some rules are easier to write in language for than the spatial intelligence to understand.

Speaker 2

但你不能，你知道，你不能只是把手这样放，然后放下这么多。

But you cannot, you know, you cannot write, put your hand like this, and put it down this amount.

Speaker 2

所以我一直很好奇，你们是如何相互利用的。

So I'm always curious about how you leverage each other together.

Speaker 0

我的意思是，以牛顿为例，牛顿之所以想到写下那些定律，是因为他拥有大量的身体经验。

I mean, if if anything, like the example of of Newton, like, Newton only thinks to write down those laws because he's had a lot of embodied experience in the world.

Speaker 2

对。

Right.

Speaker 2

是的。

Yeah.

Speaker 2

没错。

Exactly.

Speaker 0

实际上，区分你提到的理论构建与那种身体化的、日常生活中嵌入三维世界的经验是有用的。

And, actually, it's useful to distinguish between the theory building that you're mentioning versus, like, the embodied like, the daily experience of being embedded in the three-dimensional world.

Speaker 0

对吧？

Right?

Speaker 0

所以对我来说，空间智能在某种程度上 encapsulates 了那种身处三维空间、在其中移动、观察和行动的身体化体验。

So so so so to me, spatial intelligence is sort of encapsulating that embodied experience of being there in three d space, moving through it, seeing it, actioning it.

Speaker 0

正如费飞所说，你可以叙述这些事情，但这是一个非常损失信息的通道。

And as Fei Fei said, you can narrate those things, but it's a very lossy channel.

Speaker 0

就像身处世界之中并在此行动的概念，与试图描述它完全是两种不同的模式。

It's just like the notion of, you know, being in the world and doing things in it is a very different modality from trying to describe it.

Speaker 0

但因为我们人类是始终在空间中互动进化的动物，所以根本不会觉得这有什么难的。

But because we as humans are animals who have evolved interacting in space all the time, like, don't even think that that's an that's a hard thing.

Speaker 0

对吧？

Right?

Speaker 0

然后我们自然而然地转向语言和理论构建，作为超越这种原始空间理解的抽象机制。

And then we sort of naturally leap to language and then theory building as mechanisms to abstract above that sort of native spatial understanding.

Speaker 0

从某种意义上说，大语言模型直接跳到了最高层次的抽象推理，这非常有趣且很有用。

And in some sense, LLMs have just, like, jumped all the way to those highest forms of abstracted reasoning, which is very interesting and very useful.

Speaker 0

但空间智能几乎像是重新打开那个黑箱，说：也许我们直接走向完全抽象的语言、推理和交流形式时，已经失去了一些东西。

But spatial intelligence is almost like opening up that black box again and saying, maybe we've lost something by going straight to that fully abstracted form of of language and reasoning and communication.

Speaker 1

作为视觉科学家，你知道，这挺有意思的。

You know, it's funny as a vision scientist.

Speaker 1

对吧？

Right?

Speaker 1

我总是觉得视觉被低估了，因为对人类来说，视觉是毫不费力的。

I always find that vision is underappreciated because it's effortless for humans.

Speaker 1

你刚出生睁开眼睛时，就开始

You open your eyes as a baby, you start to

Speaker 0

看到你的世界。

see your world.

Speaker 3

与生俱来。

Born with it.

Speaker 1

对吧？

Right?

Speaker 1

我们几乎是天生就具备这种能力。

We're almost born with it.

Speaker 1

但学习语言需要付出努力，包括学习如何书写、语法和表达。

But you have to put effort in learning language, including learning how to write, how to do grammar, how to express.

Speaker 1

这使得它显得很难。

And that makes it feel hard.

Speaker 1

而大自然花了更多时间优化的感知和空间智能，却被人类低估了。

Whereas something that nature spend way more time actually optimizing, which is perception and spatial intelligence, is underappreciated by humans.

Speaker 3

有证据表明我们天生就具备这种能力吗？

Is there proof that we are born with it?

Speaker 3

你说的是几乎天生具备。

You said you said almost born.

Speaker 3

所以听起来我们实际上是出生后才学会的。

So it sounds like we actually do learn after we're born.

Speaker 1

我们出生时，视觉敏锐度较低，感知能力会逐渐增强。

When we are born, our visual acuity is less, and our perceptual ability does increase.

Speaker 1

但大多数人出生时就具备了视觉能力。

But we are most humans are born with the ability to see.

Speaker 1

大多数人出生时就具备将感知与运动行为联系起来的能力。

And most humans are born with the ability to link perception with motor movements.

Speaker 1

运动能力本身需要一段时间才能完善。

The motor movement itself takes a while to refine.

Speaker 1

动物的能力则非常惊人。

And then animals are incredible.

Speaker 1

今年夏天早些时候，我刚去过非洲。

I was just seeing Africa earlier this summer.

Speaker 1

这些小动物一出生，几分钟内就必须开始行动。

These little animals, they're born and within minutes they have to get going.

Speaker 1

否则，狮子就会抓到它们。

And otherwise, the lions will get them.

Speaker 1

在自然界中，优化感知、空间智能和语言花了五亿四千万年，而语言发展的最慷慨估计大约是五十万年。

And in nature, it took five forty million years to optimize perception and spatial intelligence and language, the most generous estimation of language development is probably half a million years.

Speaker 3

哇。

Wow.

Speaker 3

是啊。

Yeah.

Speaker 3

这比我想象的要长得多。

That's longer than I would have gonna say.

Speaker 1

我已经很宽容了。

I'm being very generous.

Speaker 3

是的。

Yeah.

Speaker 3

是的。

Yeah.

Speaker 3

不。

No.

Speaker 3

我其实一直在翻你的书，我突然意识到，我们播客里讨论过的一个有趣关联是语言模型的基准测试，以及Winogrand是如何融入了这些需要空间智能的、看似不可能的物理情境的。

I I was, you know, sort of going through your book and I I was really realizing that that one of the in interesting links to something that we covered on the podcast is language model benchmarks and how do Winogrand actually put in all these, like, sort of physical impossibilities that require spatial intelligence.

Speaker 3

对吧？

Right?

Speaker 3

比如，a在b上面，因此a不可能穿过b，这对我们来说显而易见，但对语言模型来说，这种情况是可能发生的。

Like, a is on top of b, therefore, a cannot fall through b is obvious to us, but to a language model, it could happen.

Speaker 3

我不确定。

I don't know.

Speaker 3

也许这属于下一个词预测的一部分。

Maybe it's like a part of the, you know, the next token prediction.

Speaker 0

这正是我想说的，关于解开这种抽象概念。

And that's sort of what I mean about, like, unwrapping this abstraction.

Speaker 0

是的。

Yeah.

Speaker 0

对吧？

Right?

Speaker 0

如果你对世界的全部认知只是看到一个接一个的词序列，那就真的很难理解，为什么不行呢？

Like, if your whole model of the world is just like seeing sequences of words after each other, it's really kind of hard to like, why why not?

Speaker 0

这其实不公平。

It's actually unfair.

Speaker 0

对。

Right.

Speaker 0

对。

Right.

Speaker 0

但对我们来说，这显而易见的原因是我们内心将它映射回了我们熟悉的三维世界表示。

But then the reason it's obvious to us is because we are internally mapping it back to some three-dimensional representation of the world that we're familiar with.

Speaker 3

这个问题是，我想说的是，从你的世界模型中提炼出语言模型到底有多难，需要多长时间？我用了‘提炼’这个词。

This the question is, I guess, like, how hard is it you know, how long is it gonna take us to distill from, like I I I used the word distill.

Speaker 3

我不知道你是否同意这个说法。

I don't know if you agree with that.

Speaker 3

因为我们需要我们的模型具备特殊的智能，所以要从你的世界模型中提炼出语言模型。

To distill from your world models into a language model because we do want our models to have special intelligence.

Speaker 3

对吧？

Right?

Speaker 3

我们是不是必须完全抛弃语言模型才能做到这一点？

Like and do we have to throw the language model out completely in order to to do that?

Speaker 3

或者不用。

Or No.

Speaker 3

不用。

No.

Speaker 3

对吧？

Right?

Speaker 3

是的。

Yeah.

Speaker 3

我不这么认为。

Don't think so.

Speaker 1

对吧？

Right?

Speaker 1

我认为它们是多模态的。

I think they're multimodal.

Speaker 1

我的意思是，即使我们现在的模型Marble，也是以语言作为输入的。

I mean, even our model, Marble, today takes language as an input.

Speaker 1

对。

Right.

Speaker 1

对吧？

Right?

Speaker 1

所以它本质上是多模态的。

So it's deeply multi model.

Speaker 1

我认为在许多应用场景中，这些模型会协同工作。

And I think in many use cases, these models will work together.

Speaker 1

也许有一天我们会拥有一个通用模型。

Maybe one day we'll have a universal model.

Speaker 0

我的意思是，即使你做到了，从实用角度来看，人们仍然会使用语言，也希望用语言与系统互动。

I mean, even if you do, there's sort of a pragmatic thing where people use language and people want to interact with systems using language.

Speaker 0

从实际角度看，构建允许人们与之交谈的系统、产品和模型是有用的。

Even pragmatically, it's useful to build systems and build products and build models that let people talk to them.

Speaker 0

所以我认为这种趋势不会消失。

So I don't see that going away.

Speaker 0

我认为有一种智力上的好奇心，想知道究竟能在多大程度上构建一个仅使用视觉或仅使用空间智能的模型。

I think there's a sort of intellectual curiosity of of saying how, like, intellectually, how much could you build a model that that only uses vision or only uses spatial intelligence?

Speaker 0

我不知道这是否具有实际用途，但我认为这是一个有趣的知识或学术探索，看看能将这种模型推到多远。

I don't know that that would be practically useful, but I think it'd be an interesting intellectual or academic exercise to see how far you could push that.

Speaker 2

我认为，不提物理学了，但我很好奇，如果你有一个高度精确的世界模型，但不给它任何关于当前标准物理模型的理解，它能从零开始推导出多少内容，以及它需要多高的语言理解能力。

I think, I mean, not to bring it back to physics, but I'm curious, like, if you had a highly precise world model and you didn't give it any notion of, like, our current understanding of the standard model of physics, how much of it it will be able to come up with and, like, recreate from scratch, and what level of, like, language understanding it would need.

Speaker 2

因为我们有太多符号体系，无法直接用它们来重建，但也许我们会提出一种完全不同的模型，同时仍然保持准确性。

Because we have so many notations that, like, we cannot use that, like, recreate it, but, like, maybe we'll come up with a very different model of it and still be accurate.

Speaker 2

我想知道，我们是否在某种程度上受限于这样一个观念：人们总认为人类必须像人类一样，因为世界是为人类建造的。

And I wonder how much we're kinda limited by the high you know how people say human always need to be like humans because the world is built for humans.

Speaker 2

某种程度上，我们构建语言的方式也限制了这些其他模态所能产生的输出。

And in a way, it's like the way we build language constrains some of the outputs that we can get from these other modalities as well.

Speaker 2

所以我非常期待关注你们的工作。

So I'm super excited to follow your work.

Speaker 0

是的。

Yeah.

Speaker 0

我的意思是，还有另一种思路——实际上，你甚至不需要做人工智能就能回答这个问题。

I I mean, like, there's another engine I mean, you actually don't even need to be doing AI to answer that question.

Speaker 0

你可以去发现外星生命，看看它们拥有什么样的物理规律。

You could discover aliens and see what kind of physics they have.

Speaker 0

对。

Right.

Speaker 0

对吧？

Right?

Speaker 0

而且他们

And they

Speaker 2

可能会有帮助，但费费说，到目前为止，我们是最聪明的。

might help But FeyFey said, we are so far the smartest So so far.

Speaker 2

宇宙中的动物。

Animal in the universe.

Speaker 0

对吧。

Right.

Speaker 0

所以，那你认为呢？我的意思是，这确实是个很有趣的问题。

So so what do you so, I mean but that is a really interesting question.

Speaker 0

对吧？

Right?

Speaker 0

比如，我们对宇宙的认知和对物理学的理解，是否在某种程度上受到我们自身认知能力或技术演进路径依赖的限制？

Like, is our knowledge of the universe and our understanding of physics, is it constrained in some way by our own cognition or by the path dependence of our own technological evolution?

Speaker 0

一种可以用来做实验的方式。

And one way to sort of and and, like, do an experiment.

Speaker 0

简直想做个实验，看看如果我们重新运行人类文明，是否还会以同样的顺序得出相同的物理学？

Like, almost wanna do an experiment and say, like, if we were to rerun human civilization again, would we come up with the same physics in the same order?

Speaker 0

我认为这并不是一个非常实际的实验

And I don't think that's a very practical practical experiment

Speaker 1

来运行。

to run.

Speaker 1

我知道一个实验，我想知道人们是否可以做：我们现在有大量的天体物理数据，关于行星或天体的运动。

You know, one experiment I wonder if people could run is that we have plenty of astrophysical data now on the planet or celestial body movements.

Speaker 1

把这些数据输入模型，看看牛顿定律是否会浮现出来。

Just feed the data into a model and see if Newtonian law emerges.

Speaker 0

我猜可能不会。

My guess is it probably won't.

Speaker 1

我的猜测也是这样。

That's my guess.

Speaker 1

不是的。

It's not.

Speaker 1

牛顿定律的抽象层次与这些语言大模型所代表的层次不同。

The abstraction level of Newtonian law is at a different level from what these language LLMs represents.

Speaker 1

因此，如果给足天体运动数据，我不会惊讶于大模型能相当准确地预测运动轨迹。

So I wouldn't be surprised that giving enough celestial movement data, an LLM would actually predict pretty accurate movement trajectories.

Speaker 1

假设我发明了一颗围绕恒星运行的行星。

Let's say I invent a planet surrounding a star.

Speaker 1

如果数据足够多，我的模型会告诉你第一天它在哪里，第二天它在哪里。

And given enough data, my model would tell you on day one where it is, day two where it is.

Speaker 1

我不会感到惊讶。

I wouldn't be surprised.

Speaker 1

但F等于MA，或作用等于反作用，那是完全不同的抽象层次。

But F equals MA, or action equals reaction, that's just a whole different abstraction level.

Speaker 1

这超出了当今大模型的能力范围。

That's beyond just today's LLM.

Speaker 2

好的。

Okay.

Speaker 2

你需要什么样的模型才能避免地心说？

What model would you need to not have it be a geocentric model?

Speaker 2

因为如果我只是用一些视觉数据来训练，你认为太阳绕着地球转是很合理的。

Because if I'm training just some visual data, it makes sense that you think the sun rotates around the earth.

Speaker 2

对吧？

Right?

Speaker 2

但显然，事实并非如此。

But obviously, that's not the case.

Speaker 2

那么它该如何学会这一点呢？

So how would it learn that?

Speaker 2

我对所有这些我们所讨论的力都很好奇。

Like, I'm curious about all these, like, you know, forces that we talk about.

Speaker 2

有时候，也许你根本不需要它们，只要看起来对，那就是对的。

It's like sometimes maybe you don't need them because as long as it looks right, it's right.

Speaker 2

但当你转向尝试用这些模型完成更高级的任务时，我们能多大程度上依赖它们呢？

But, like, as you make the jump to, like, trying to use these models to do more high level tasks, how much can we rely on them?

Speaker 0

我认为你可能需要一种不同的学习范式。

I think you can need kind of a different learning paradigm.

Speaker 0

对吧？

Right?

Speaker 0

所以，这里出现了一点混淆，即把大语言模型、语言和符号，与人类的理论构建和人类物理学混为一谈。

So, like, you know, there's a bit of conflation here happening where saying, is it LLMs and language and symbols versus, you know, human theory building and human human physics?

Speaker 0

它们非常不同，因为人类的目标函数是理解世界并在生活中茁壮成长。

And they're very different because an l l like, the the human objective function is to understand the world and thrive in your life.

Speaker 0

而实现这一点的方式是，有时你观察数据，然后思考它。

And the way that you do that is by you know, sometimes you observe data, and then you think about it.

Speaker 0

接着你尝试在现实中做点什么，但结果与你的预期不符。

And then you try to do something in the world, and it doesn't match your expectations.

Speaker 0

然后你就想在线更新你对世界的理解。

And then you want to go and update sort of your your your your your understanding of the world online.

Speaker 0

人们时时刻刻都在这样做。

And people do this all the time constantly.

Speaker 0

比如，我觉得我的钥匙在楼下。

Like, whether it's, you know, I think my keys are downstairs.

Speaker 0

于是我下楼去找，但没找到。

So I go downstairs and I look for them and I don't see them.

Speaker 0

哦，不对，它们其实在我卧室里。

And, oh, no, they're actually up in my bedroom.

Speaker 0

因为我们不断与世界互动，所以一直在构建关于周围世界发生什么的理论，然后证伪或补充这些理论的证据。

So we're we're like, because we're constantly interacting with the world, we're constantly having to build theories about what's happening in the world around us and then falsify or add evidence to those theories.

Speaker 0

我认为，这种过程放大并扩展后，就形成了牛顿物理学中的 F=ma。

And I think that that kind of process writ large and scaled up is what gives us f equals m a in Newtonian physics.

Speaker 0

我认为这与我们训练的模型模态——无论是语言还是空间——有点不相关。

And I think that's a little orthogonal to, you know, the modality of model that we're training, whether it's language or or or spatial even.

Speaker 3

我的说法是，这几乎是一种更高效的学习方式：你根据现有数据提出假设，列出所有可能的世界，然后通过实验排除不可能的世界，最终锁定正确的那个。

The way I put it is almost like this is almost more efficient learning because you have a hypothesis of here are the different possible worlds that are granted by my available data, and you then you do experiments to eliminate the worlds that are not possible, and you resolve to the one that's right.

Speaker 3

对我来说，这也是我拥有心理理论的方式，即我对你正在想什么有一些假设，我会尝试采取行动来验证或修正我对你的想法的直觉，当然，大语言模型并不会做这些。

To me, that's also how I also have theory of mind, which is, like, I'm I have a few thesis of what you're thinking, what you're thinking, and I try to create actions to resolve that or or check my intuition as to what you're thinking, you know, and and obviously, LLMs don't do any of these.

Speaker 1

心理理论可能还会延伸到情感智能，而当今的人工智能根本还没有触及这一点。

A theory of mind possibly also will break into even emotional intelligence, which today's AI is really not touching at all.

Speaker 1

对吧？

Right?

Speaker 3

我们真的非常需要它。

And we really really need it.

Speaker 3

你知道，人们开始可能过度依赖这些东西了，这本身就是一个话题，没错。

You know, people are starting to depend on these things probably too much, and and that's a that's a whole topic of Yeah.