Genie 3：与Shlomi Fruchter和Jack Parker-Holder探讨无限世界模型

本集简介

在本期节目中，汉娜·弗莱教授与杰克·帕克-霍尔德及什洛米·弗鲁彻探讨了Genie 3——一个能生成前所未有的多样化交互环境的通用世界模型。对话涉及该模型的自回归特性如何实现从文本或图像提示创建连贯可探索的世界，以及其能力与Veo等视频生成模型的差异。延伸阅读Genie 3SIMA，适用于3D虚拟环境的通用AI智能体SIMA，谷歌DeepMind播客xLand 若喜欢本期节目，请在Spotify或苹果播客上为我们留下评价。我们始终期待听众的反馈，无论是意见、新想法还是嘉宾推荐！

双语字幕

仅展示文本字幕，不包含中文音频；想边听边看，请使用 Bayt 播客 App。

Speaker 0

我认为这确实是一个基础模型在广度、通用性和能力方面的重大飞跃。现在我们做到了。它有大量不同的潜在应用领域和影响力。

I think this is really a step change as a foundation model in terms of the breadth and generality and capabilities. Now we're there. There's a whole host of different potential things it could be used for and have impact on.

Speaker 1

这些视觉世界中的细节数量简直令人震惊。考虑到需要记忆的信息量，其记忆质量实在令人惊叹。

The number of details that we have in those visual worlds is just like it's it's staggering. The quality of the memory, considering how much information, it has to actually remember.

Speaker 2

欢迎回到谷歌与DeepMind播客。我是汉娜·弗莱教授。最新的视频生成模型已经震撼了全世界。它们创造了这种近乎完美的现实模仿。但视频的局限性在于你只是观众而非参与者，而这并非人类体验真实世界的方式。

Welcome back to Google with DeepMind, the podcast. I'm professor Hannah Fry. Now the latest video generation models have impressed the entire world. They've created this near perfect imitation of reality. But the limitations of video is that you are just a viewer rather than a participant, and that's not how humans experience the real world.

Speaker 2

对吧？相反，我们能够导航从未去过的环境，并对可能遇到的情况有所预期。我们可以在所有可行方向上几乎无限制地探索，并与沿途偶然发现的事物互动。这正是这项技术的下一个重大前沿——超越生成完美场景录制，转向构建我们可以最终踏入的动态世界模拟。这就是Genie three，一个能够生成前所未有的多样化交互环境的原型世界模型。

Right? We instead can navigate environments we've never been to and still have an expectation of what we're likely to encounter. We can explore in every feasible direction, kind of without limits, and interact with things that we chance upon along the way. And that is the next great frontier for this technology to move beyond generating a perfect recording of a scene and towards building a dynamic simulation of a world we can finally step into. Enter Genie three, a prototype world model that can generate an unprecedented variety of interactive environments.

Speaker 2

它已被描述为通往AGI的垫脚石，今天与我一起的是它的两位创造者：研究总监Shlomi Fruchter和研究科学家Jack Parker Holder。欢迎来到播客。

It's already been described as a stepping stone towards AGI and with me today are two of its creators Shlomi Fruchter, research director, and Jack Parker Holder, research scientist. Welcome to the podcast for you.

Speaker 0

谢谢。谢谢。

Thanks. Thanks.

Speaker 2

好的。让我们直接进入正题。什么是Genie three？

Well, okay. Let's let's get straight into it. What is Genie three?

Speaker 0

这是一个实时交互式世界模型，允许您通过文本提示创建多样化且视觉有趣的世界。所以它没有底层的游戏引擎，没有固定结构，也没有代码。它只是一个神经网络，根据用户输入和过去的状态预测每一个像素。因此，您几乎可以立即创造出的事物具有前所未有的灵活性和多样性。

It's a real time interactive world model that allows you to create diverse visually interesting worlds from a text prompt. So there is no underlying game engine, no structure, no code. It's just a neural network that's predicting every single pixel in reaction to inputs from the user and also the past. And so the flexibility and the diversity of things you can create in basically no time is quite unprecedented.

Speaker 2

您不需要让一整支艺术家团队坐在房间里构建一个世界才能进行交互。

You haven't had a whole army of artists sitting in rooms constructing a world in order to be able to interact.

Speaker 1

是的。我认为关键在于它可以创造出您能想象到的任何世界，对吧？这是使用游戏引擎无法做到的，对吗？

Yes. I think the key point is that can create any world that you can imagine, right? And that's not something that you can do with a game engine, right?

Speaker 2

好吧，让我们来看看吧，因为您已经为我准备了一些演示，对吧？

Well, let's okay. Let's have a look at it because you've got some demos for me, right?

Speaker 1

是的。我们有几个演示。第一个，我想您可能会喜欢。基本上就是扮演一只猫。好的。

Yeah. So we have a few. The first one, I think you might like it. So it's basically playing a cat. Okay.

Speaker 2

您已经吸引住我了。

You've got me already.

Speaker 1

一只姜黄色的猫，不是，不是那么出色。

Ginger cat, not not so Excellent.

Speaker 2

这里你看到的是一只漂亮的姜黄色猫咪在公寓里闲逛。公寓装修得非常漂亮，有精美的波斯地毯和木地板。还有一张沙发，它正试图跳上去，但不是自己在跳，对吧？

And what you have here is a beautiful ginger cat wandering around an apartment. It's very beautifully furnished. It's got these nice Persian rugs and wooden floor. It's got a sofa that it's currently trying to jump on, but it's not doing it itself. Right?

Speaker 2

你在控制它的移动。

You're prompting its movement.

Speaker 1

是的，没错。我只是用键盘控制这只猫。所以我可以环顾四周，移动猫咪，基本上告诉它去哪里，这样它就能跳过沙发了。好的。

Yes. Exactly. I'm just using the keyboard to control the cat. So I can look around, move the cat, basically tell it where to go so it can jump over the sofa. Okay.

Speaker 1

而且我真的很喜欢走到这里的阳光下。

And I really like walking into the sunlight here.

Speaker 2

所以这是对你输入的指令做出的反应吗？

So this is reacting to the inputs that you're giving it?

Speaker 1

是的，完全正确。

Yes, exactly. Is

Speaker 2

当你走进阳光时，光线会变化吗？是的。哦，看那个。看那个。是的。

light gonna change as you go into the Yeah. Oh, look at that. Look at that. Yes.

Speaker 1

所以这个模型基本上是根据它接收到的输入序列来预测接下来会发生什么，并且是实时进行的。

So the model is basically trying to predict what's gonna happen next based on sequence of inputs that it gets, and it does it in real time.

Speaker 2

我的意思是，你在这个3D环境中看到的细节也让人联想到我们在VEO中看到的一些东西。比如，如果我不知道你在与它互动，这有什么不同呢？

I mean, there's also sort of the detail that you're seeing in this three d environment is also quite reminiscent of the some of the stuff that we're seeing with VEO. Like, if I didn't know that you were interacting with it, how is it different from that?

Speaker 1

所以当你使用VEO创建视频时，你提供一个提示，然后模型会尝试从头到尾创建整个视频，比如八秒钟。一旦完成，你就无法改变摄像机的移动方式，而且肯定无法探索超过八秒钟的内容。对吧？

So when you create a video using VEO, so you provide a prompt, and then the model is trying to figure out how to create this entire video of, say, eight seconds from start to finish. And once it's ready, then you cannot change how the camera moves around, and definitely you cannot explore it much more than just eight seconds. Right?

Speaker 2

你能用图片来提示这个吗，还是只能用文本？

Can you use an image to prompt this, or is it only text?

Speaker 1

是的。我们刚刚发现实际上可以使用图片和视频来提示它们。在这个特定案例中，我们发现甚至可以使用绘画作品。例如，这是爱德华·霍珀的《夜鹰》，一幅非常著名的画作。

Yes. So we just we just found out that we can actually use an image and videos to prompt them all. And in this particular case, we found that we can actually use paintings. For example, this is Nighthawk by Edward Hopper. A very famous picture.

Speaker 1

一幅1942年的非常著名的画作。基本上，我们要求Jeanne Free让我们走进这幅画中。

A very famous picture from 1942. And basically, we asked Jeanne Free to let us walk into the painting.

Speaker 2

所以这幅画描绘了一个非常生动的画面，夜晚的街角。你透过玻璃看到一男一女靠在吧台上，另一边有人在柜台后提供饮料。它有这些浓郁的绿色，下方的人行道，光线洒落的方式真的非常、非常有感染力。

So this painting is of a very vivid image, a street corner at night time. You're looking in through the glass to see a man and a woman leaning up against a bar and then someone serving drinks the other side of the counter. It's got these rich greens, the pavement underneath, the way the light falls is really, really evocative.

Speaker 1

我们能够在这个环境中四处走动，某种程度上或许能感受到艺术家心中的景象。哇。

And we were able to walk around this environment and kind of like maybe get a sense of how it looked in the artist's mind. Wow.

Speaker 2

为我转个身，转到艺术家没有画的那部分。是的。我们得到了

Turn around for me, turn around to the bit that the artist didn't paint. Yes. What we got

Speaker 1

后面的景象。所以我们

behind. So we

Speaker 2

我们能去下面吗？

Could we go down there?

Speaker 1

看看是否可行。

Let's see if it works.

Speaker 0

这其实挺酷的，因为当你第一眼看到这幅画时，可以想象它本应是一座非常黑暗的城市，但那一处被照亮了，我想模型大概是按照这个思路来的。

It's actually pretty cool because you can imagine when you look at the picture in the first place, that it was meant to be a very dark city, but then that one spot illuminated, and the model, I guess, kind of went with that.

Speaker 2

这完全是真正的开放世界体验。

This is proper open world stuff there.

Speaker 1

是的。而且很棒的一点是，如果我们返回去，它仍然在那里。对吧？就像模型正在以一种一致的方式生成世界。所以你可以继续前进，也可以回到你已经去过的地方。

Yeah. And the nice thing is that if we go back, it's still there. Right? Like, the model is generating the world in a way that is consistent. So you can just go on and go back to where you've already been.

Speaker 2

它记住了之前存在的东西没错，完全正确。你还有另一个例子给我们看吗？

It has a memory of what existed Correct, exactly. Have you got another one for us?

Speaker 1

是的，我们有一个正在运行的例子，是一艘喷气式滑水艇在几个岛屿周围行驶。让我们看看它怎么样

Yeah, we have one that we're just running a jet ski around a few islands. So let's see how it

Speaker 2

告诉我，告诉我原始提示是什么。

Tell me tell me the original prompt.

Speaker 1

所以它是在考艾岛周围水域驾驶喷气式滑水艇。哦，

So it's sailing a jet ski through the waters around the islands of Kauai. Oh,

Speaker 2

听起来很梦幻。

sounds dreamy.

Speaker 1

水域中有不同的坡道我们可以冲上去。是的。

The waters have different ramps that we can go up on. Yeah.

Speaker 2

好的。

Okay.

Speaker 1

让我们看看。

Let's see.

Speaker 2

好的。开始了。行。所以我们有点像水上摩托上那个人的视角。可以看到画面中的双手。

Alright. Here we go. Okay. So we are sort of the POV of the person on the jet ski. We can see the hands in frame.

Speaker 2

需要补充的是，两只手的动作是一致的。水面非常平静。可以看到背景中的这些岛屿。太阳在天空中的位置很低，可以看到阳光在水面上的反射。现在我看到你正带我们上坡道。

Both hands are consistent with each other, I should add. The water is beautifully still. You can see these islands in the background. The sun is quite low in the sky, and you can see the reflection of its rays on the water. Now I see you're taking us up a ramp here.

Speaker 1

没错。我有点太慢了。我不知道。

It's right. I'm a bit too slow. I don't know.

Speaker 2

下去的时候会怎么样？我是说，哦，它碰到水面时溅起了水花。

What's it gonna do on the way down? I mean, oh, and it splashes when it hits the water.

Speaker 1

让我们看看那里有什么。

Let's see what's let's look there.

Speaker 2

当你看向后方时，水中会留下一道痕迹，完全符合你对真实喷气式滑水车的预期。

And when you look round to the back, there's a trail in the water exactly as you would expect from a real jet ski.

Speaker 1

是的，完全正确。

Yes. Exactly.

Speaker 2

没错。那么你是否看到了它在理解物理规律方面的表现？

Yes. So are you seeing elements of where it's understanding physics?

Speaker 0

当然是的。有些现象我们称之为涌现特性——通过通用训练和观察大量不同场景后，当遇到新情境时，它就能理解烟雾如何飘动或水流应该如何运动。虽然可能无法在每种场景下都100%准确，但它的准确度足以让你产生身临其境的感觉。而我们人类显然无法轻易发现其中的错误。

Definitely, yes. There's some things we, I guess, refer to them as emergent properties where just from having sort of general training and seeing lots of different things. When it sees a new scenario, it understands sort of like how smoke moves or how water should flow. But that's not maybe a 100% accurate in every single setting, but it's got enough accuracy that you do feel some sense of being in the scene. And as humans, we can't obviously spot these things that are wrong with it.

Speaker 1

正如杰克所说，这确实存在局限性。但另一方面，我过去一直从事游戏引擎开发，我们曾经非常努力地独立实现各种效果，比如水体模拟等所有元素。而现在，我们基本上有了一个开箱即就能完成所有这些事情的模型——虽然存在一些限制。

So I think as as Jack said, like, it's pretty there are definitely limitations. But on the other end, you know, after work I I've been working on game engines in the past, and I think we worked really hard to make all of these kind of, like, effects independently, like the water simulation, everything. And here, we have basically a model that can do all of that out of the box Mhmm. With some limitations.

Speaker 2

甚至不需要真正刻意去实现。

Without even really trying.

Speaker 0

它还能做到其他几乎无法通过传统方法实现的事情，比如模拟世界中的其他动物和人类。我认为未来更令人兴奋的是能够与世界中的其他智能体进行交互。

And there's other things it can do that would be almost impossible to get through other methods, Like simulating other animals and people in the world. I think that's something that's really exciting more in the future as well, is be able to interact with other agents in the world.

Speaker 2

是的。我的意思是，这种完全互动的环境，无论你朝哪个方向推动它，最终都能保持一致，这真的非常了不起。

Yeah. I mean, this totally interactive environment that ends up being consistent regardless of which direction you push it in, I mean, that's really extraordinary.

Speaker 1

是的。是的。

Yeah. Yeah.

Speaker 2

我的意思是，这些演示某种程度上展示了概念验证，但你们认为这会如何被使用？你们正在关注哪些类型的应用？

I mean, these demos kind of demonstrate the proof of concept, I guess, but how would you see this being used? What are the kind of applications that you'd be looking at?

Speaker 1

我们非常兴奋的一点是将其用于智能体的模拟环境。例如，你可以想象一个想要达成目标的智能体，对吧？然后我们可以把它放在任何我们能想象的环境中，也许是一个对它更具挑战性的环境，然后它可以探索环境，尝试达成目标，并从错误中再次学习，而无需在现实世界中做任何事，这成本非常高。另一件让我们兴奋的事是实际上将这些模拟用于规划。所以如果你有一个机器人，或者再次，一个想要达成目标的智能体，它们可以在模拟中进行一些推演，弄清楚可能会发生什么。

One thing that we are very excited about is using it for actually the simulation environment for agents. So for example, you can imagine an agent that wants to accomplish a goal, right? And then we can put them in any environment that we can imagine, maybe one that is more challenging for it, and then it can explore the environment, maybe try and accomplish a goal and learn from its mistakes again without doing anything in the real world, which is very expensive. Another thing is that we're excited about is actually using those simulations for planning. So if you have a robot or you have some, again, an agent that wants to accomplish a goal, they can maybe do some rollouts in the simulation, figure out might what might happen.

Speaker 1

例如，如果它们想要过马路，智能体可以使用模型预测几种选项。可能会有几种情景。也许有人会穿过它的路径。也许会发生别的事情。然后利用这些推演，它可以决定下一步应该采取什么行动，这被用于规划。

For example, if they wanna go across the road and the agent can use the model to predict a few options. Maybe there is, like, a few scenarios. Maybe a person gonna cross its path. Maybe something else is gonna happen. And then using those rollouts, it can decide what's the next action it should take, and then this is used for planning.

Speaker 1

除此之外，我们还看到了许多教育和娱乐方面的应用。

And beyond that, we just see a lot of applications for education, for entertainment.

Speaker 2

给我举几个具体的例子来说明一下。我的意思是，这里的理念是什么？比如说，在历史课上，你们能够创建一个维多利亚时代英格兰的世界吗？

Just anchor this in a few examples for me. So, I mean, what's the idea here? Are you are you gonna, in a history lesson, be able to create a world of Victorian England, for instance?

Speaker 0

没错。想象一下你站在一群学生面前，他们显然对学习维多利亚时代的英格兰很兴奋，但他们也有很多其他分心的事情，比如他们感兴趣的其他事物。与其只是读教科书，不如让他们能够踏入那个世界，在某种意义上带他们进行一次虚拟之旅，体验身临其境的感觉。所以对于那些可能难以到达的地方，比如地球遥远的角落，或者无法获得的视角，比如成为一只美洲豹、其他动物或鲨鱼，或者回到过去。

Exactly. Yeah. So imagine you are in front of a bunch of students, you know, they're obviously excited to learn Victorian England, but they've also got a lot of other distractions, things that they're interested in as well. Instead of just reading a textbook, you can instead allow them to kind of step into the world, right, and you can take them on a virtual tour in a sense of what it would have been like to be there. So for places that may be harder to access, maybe far distant corners of the planet or perspectives that you couldn't otherwise get, so being a jaguar, for example, or other kind of animals or being a shark, you know, or going back in the past.

Speaker 0

这些体验你无法通过其他方式获得，我认为它尤其能让视觉学习者产生更多共鸣。

These are things that you couldn't really get any other way as an experience, and it might make more visual learners in particular, I think resonate more with them.

Speaker 2

那就是人类在操控的时候。但如你所说，如果你让一个智能体在其中自由行动，那就会开启一个全新的可能性层次。是的。

And that's when you have the human who's playing with the controls. But as you say, if you then let an agent loose in this, that opens up a whole another level of possibilities. Yeah.

Speaker 1

所以一旦你有了一个真实或非常接近真实的环境模拟，智能体就可以利用它，而不是在现实世界中学习，后者成本非常高，对吧？如果智能体或机器人在现实世界中犯错，修复起来要困难得多。基本上，这是一种让智能体在模拟环境中学习的方式，我们可以控制一切。我们可以设置一些可能更具挑战性或更不可预测的环境，超出智能体通常训练的范围。

So once you have a real or a very close to real simulation of an environment, then the agent can use it instead of actually learning in the real world, which is very, you know, costly. Right? If an agent makes a mistake or if a robot makes a mistake in the real world, it's much harder to kinda, like, fix. And, basically, this is a way for agents to learn in a simulated environment where we can control everything. We can set up some environment that is maybe more challenging or less predictable than what typically the agent was trained to do.

Speaker 1

这样，智能体基本上可以在这个安全的模拟中不断改进。所以我们对此非常兴奋。

And this way, basically, the agent can improve in this safe simulation. So we're very excited about that.

Speaker 2

假设你正在运营一家工厂，想让机器人执行特定任务，你可以精确重现它将会面临的环境，并允许它，嗯，自己发现错误。

So let's say that you were maybe running a factory you wanted to a robot with a particular task, you can recreate the precise environment it will find itself in and allow it to, well, find its own mistakes.

Speaker 0

是的，这是个很好的例子，因为这已经是我们接近能够做到的事情，对吧？机器人已经变得相当能干了。但我认为更令人兴奋的是那些目前还远无法实现，但这项技术能够完全实现和解锁的事情。比如让机器人和具身智能体真正进入现实世界，可能场景的多样性对我们当前的系统来说很难想象。我想到一个例子，比如一个具身智能体在万圣节会做什么，对吧？

Yeah, mean, that's a great example because this is already something that we're close to be able to do, like already, right? Robots are getting quite capable. But I think what's even more exciting is things that it's quite far away from being able to do that this could completely enable and unlock. So having robots and embodied agents like actually in the real world, the diversity of possible scenarios is just quite hard to fathom, I think for our current systems. I think of this example of like, what would an embodied agent do on Halloween, right?

Speaker 0

也许一年中只有一天它会看到孩子们穿着戏服到处跑。比如，它第一次看到这个会怎么做？这是一个相当具有挑战性的场景来准备。即使你以前见过，下一年可能也会不同。因此，要真正能够模拟这些罕见事件，并能够描述任何可以想象的世界，使其变得鲁棒，确保机器人或智能体的安全，让它们理解所有这些不同的事物。

It's maybe one day a year it sees children running around in costumes. Like, what would it do the first time it sees this? It's quite a challenging scenario to prepare for. And even if you've seen it before, it might be different in the next year. And so to really be able to simulate these rare events and be able to text describe any imaginable world, to become robust to it, make sure the robots are safe or the agents are safe, that they understand all these different things.

Speaker 0

但它们也能从自己的经验中学习，我们知道这一点非常重要。

But they can also learn from their experience as well, which we know is really important.

Speaker 2

但我也在想，我们到目前为止给出的例子都是在人类层面上体验现实世界，

But then I also wonder, I mean, the examples that we've given so far has been sort of experiencing the real world at the human level,

Speaker 1

就像

as it

Speaker 2

这样，就我们的尺寸而言。你能把它缩小，创建一个分子或人类细胞层面的模拟世界吗？

were, in terms of our size. Could you kind of shrink this down and create a simulated world that was at the level of molecules or the human cell, for instance?

Speaker 1

是的。我们尝试过，实际上有几个例子是我们基本上在血管中移动。所以，是的，它不一定总是生物学上准确的，但我们不认为这是一个根本性的限制。如果我们有更准确的模拟供模型训练，那么未来我们可能会有其他变体的模型能够专门适应这种特定环境。但我们确实更专注于从人的视角看现实世界，因为我们认为这在模型的通用性方面是最广泛适用的。

Yeah. So we tried that, and we actually have a few examples where we walk basically moving around a blood vessel. So yes, it's not necessarily always biologically accurate, but we don't think it's a fundamental limitation. We could probably, if we had more accurate simulations that models can be trained on, then we could see that in the future maybe we have other variants of models that are able to kind of specialize in this particular environment. But we did try to to focus more on the real world from the eyes of a person because we think that's the most widely applicable in terms of the generality of the model.

Speaker 2

这是关于仅仅利用人工智能现有发展吗？还是关于让我们更接近通用人工智能（AGI）的目标？

Is this about just utilising the existing developments in artificial intelligence? Is this also about kind of stepping us closer towards the goal of AGI?

Speaker 0

我们认为这绝对是一种新型的基础模型。正因如此，其应用范围如此广泛，但也如此初生——因为我们以前从未真正拥有过这种模型，对吧？它融合并结合了我们在语言模型和视频模型中看到的一些理念，再加上我们使用的某些技术。因此它将这些不同元素组合成相当新颖的事物，我认为这正是它令人兴奋之处，对吧？我们取得的这项突破在于，它可能催生一些以前从未真正有过的全新应用。

We think this is definitely a kind of a new kind of foundation model. And that's why the breadth of applications is so wide, but also so nascent because we've never really had this kind of model before, right? It sort of blends and combines ideas from what we've seen language models and also video models, right, with some of the techniques that we use. And so it's kind of combining these different elements in something quite new, which I think is what's so exciting about it, right? And this breakthrough that we've made is that it might enable some completely new applications that we didn't really have before.

Speaker 0

所以这项研究仍处于相当早期的阶段，但我们非常期待未来几个月会带来什么成果。

So it's still quite an early stage for this research, but we're quite excited to see what the next few months bring.

Speaker 2

那么让我更深入探讨一下，因为我知道您的背景更偏向VO和视频领域。但在您和团队加入之前，您一直在研究Genie一代和Genie项目。当时您在做什么？灵感来源是什么？与当前版本有何不同？

Well, let me dig into that a bit deeper then, because I know that your background is much more in VO and the video side of things. But you were working on Genie one and Genie before you and your team came on board. What were you doing there? What was the inspiration? How is it different from this iteration?

Speaker 0

在Genie之前，我主要从事开放式学习研究。即在NARGE模拟环境中训练智能体，我们可以配置世界的不同组件。我们参与了X LAN项目，其核心思想是通过程序化生成大量环境（这些环境仍通过代码指定），让智能体从不同经验中学习，成为模拟环境中的通用型智能体。但最终我们受限于环境资源的可用性。

So before Genie, I was working on open ended learning. So training agents in NARGE simulated environments where we could configure different components of the world. So we worked on the X LAN project. And the idea there was basically with procedural generation generate wide variety of environments that were still specified in code and have the agent learn from these different experiences and to become a generalist agent in simulation. But ultimately we were bottlenecked by the availability of environments.

Speaker 0

读博期间我们也研究过世界模型，但是在更受限的设置下——从单一环境中训练世界模型，通常是维度很低的环境。真正的梦想是融合这些理念，学习通用世界模型作为任何可想象任务的模拟器，然后训练智能体解决全新问题。形成这种开放式循环：我们生成新世界，智能体从中学习。我们从Genie一代开始作为概念验证——我们到底能不能实现这个目标？

We were also in PhD working a bit on world models as well, but with much more limited constrained setting, we're training world models from single environments, typically quite low dimensional ones. And the dream was really to combine these ideas, to learn general world models that could be used simulators for any imaginable tasks, and then train agents and them to solve completely new things. And have this kind of open ended loop where we generate new worlds and the agents would learn from those. We started with Genie one as kind of a proof of concept. Like, could we do this at all?

Speaker 0

我们能否生成具有交互性的新世界？对吧？那是一个重大突破。而在Genie二代中，我们将这个能力扩展到了任何三维环境。

Could we generate new worlds that were interactive at all? Right? And that was quite a major breakthrough. And then with Genie two, we scaled this to any three d sort of environment.

Speaker 2

在这个过程中出现了一些涌现特性。请谈谈这些特性。它们是预期内的吗？

There were some things that emerged from emergent properties in that Tell me a little bit about those. Were they expected?

Speaker 0

所以对于Genie 2，我们的问题是：这个概念能否扩展？因为Genie 1只是一个简单的概念验证。它主要是验证这个想法是否可行。而Genie 2更像是要验证它是否能够真正扩展到更像我们现在看到的基础模型那样？而且我们当时不确定它是否真的能成功，对吧？

So for Genie two, the question for us was, can this idea scale? Because Genie one was quite a simple proof of concept. It was does this work at all? Whereas Genie two was like is this something that could really scale to look more like what we see in foundation models nowadays? And we weren't sure if it would really work, right?

Speaker 0

我们不确定它能否长时间保持一致性，因为Genie 1只能持续几秒钟。我们不确定它能否在更高分辨率下工作，因为Genie 1是90P，图像非常小。Genie 2是360P。而且环境多样性也有了显著增加。考虑到所有这些因素，我们不确定是否可能有一个单一的神经网络能够模拟该领域内的任何事物。

We weren't sure if it would be consistent for very long because Genie one only lasted a couple of seconds. We weren't sure if it would work at higher resolution because Genie one was 90P, which is very small images. Genie two was 360P. And then the diversity of environments was a significant increase. And so given all of that, we weren't sure if it'd be possible to have a single neural network that could simulate anything within that domain.

Speaker 0

所以当我们真正得到这个模型时，它确实展现出了涌现特性，对于全新的世界，我们当时使用Imagine 3来生成起始帧。它能够模拟烟雾，或者当你开车冲出悬崖时，汽车会受到重力影响。或者如果你降落在水坑里，会溅起水花。令人惊讶的是，这一切都运作得如此之好。这给了我们信心，相信下一步的Genie 3是可能实现的。

And so when we actually got the model, it was definitely an emergent property that for completely new worlds, And we used Imagine three to generate the starting frames back then. It could do things like simulate smoke or when you drove off the side of a cliff, the car had gravity. Or if you landed in a puddle, it splashed. And it was quite surprising that this worked so well. And that gave us the confidence that this next step with Genie three would be possible.

Speaker 2

所以Genie 1是2D的。

So Genie one, two d.

Speaker 0

是的。

Yep.

Speaker 2

Genie 2是3D的。那么对于Genie 1，你们当时想的是，我们希望能够创建任何类型的环境。你们输入的是，我的意思是，是平台游戏对吧？是的。就是海量海量海量海量的平台游戏录像。

Genie two, three d. So with Genie one then, you're like, right, we wanna be able to create any sort of environment. And you feed in, I mean, it was platform games, wasn't it? Yep. Just tonnes and tonnes and tonnes and tonnes of footage of platform games.

Speaker 0

完全正确，是的。

Exactly, yeah.

Speaker 2

那么这其中是否出现了任何涌现特性？

And then were there any emergent properties in that?

Speaker 0

嗯，能够生成全新的内容这一事实，我认为是令人惊讶的。我不认为这在之前真正被展示过。

Well, the fact that you could generate completely new ones, I think was surprising. I don't think that had really been shown before.

Speaker 1

甚至是一幅画，对吧？

Even a painting, right?

Speaker 0

是的。一幅画。没错。我的意思是，它们表现得比我更好。我们甚至有一张我狗在公园的照片，你可以像平台游戏一样左右移动它，当然，它并不是一个游戏角色。

Yes. A drawing. Yeah. I mean, know, they work better than me. We even had a picture of my dog in the park and you could kind of move her left and right like a platform game, but of course, she's not one.

Speaker 0

还有杰夫·克洛姆是这个项目的顾问。他的孩子们画了很多画，我们能够将这些画动画化，并像游戏一样让它们移动。我认为可以放心地说，这些并不在训练数据中，对吧？所以我乐意称之为涌现特性。它看起来与训练数据截然不同。

Also had Jeff Kloom was an advisor with the project. His children did a bunch of drawings and we were able to kind of animate those and move them around like games. And I think it's safe to say that that was not in the training data, right? So I am happy to call that an emergent property. It looked quite different to what it was trained on.

Speaker 0

谢谢杰米提醒我，因为这已经是几年前的事了。

And thanks Jamie for reminding me because it's been a couple of years.

Speaker 2

然后从Genie一代到Genie二代的升级是使其变成三维的。

And then the step between Genie one and Genie two was making it three d.

Speaker 0

是的。所以它是在增加所能处理任务的多样性。相比只能处理2D平台游戏的Genie一代，这个模型不仅能处理2D游戏，还能处理3D环境。它的分辨率更高，一致性也更强，交互时能维持更长时间。这确实在多个维度上实现了能力跃升，需要更集中的努力才能实现。

Yeah. So it was increasing the diversity of the things it could do. So compared to Genie one, which was just two d platform games, it was two d games but as well as three d environments as well in the same model. It was also higher resolution and there was a lot more consistency, so it would last a bit longer when you interacted with it. So it was really a step up in capability in a few different dimensions, which required a much more concentrated effort to get that to be possible.

Speaker 0

我们当时甚至不确定这个方案是否可行。所以这更像是一个稍大规模的概念验证。但这让我们确信，现在这个版本是可能实现的。

And we weren't sure if it would actually work at all. So it was more of a proof of concept at the slightly larger scale. And that gave us confidence that what we have now would be achievable.

Speaker 2

所以你们当时在构建可交互的环境。没错。与此同时，你们还在并行进行视频生成的工作。

So you were working on building an environment that you could interact with. Yep. And meanwhile, in parallel, you were working on video generation.

Speaker 1

是的。我的背景其实是3D游戏引擎开发，很久以前还做过模拟仿真。那算是我开始接触AI的起点——虽然那时候我们甚至不叫AI，而是叫机器学习(ML)。

Yes. So my background is actually in three d game engines, and, like, a very long time ago, I used to work on simulations. That was kind of like where I started working on, you know, AI. Back then, we didn't use even AI. We called it ML.

Speaker 1

但最近几年，随着技术发展，我越来越兴奋地投入到图像模型和视频模型的研究中。众所周知，这些模型在过去几年达到了全新的真实感水平。我记得看到Imagine视频模型时惊叹：它怎么能像3D图形那样完整模拟一堵墙？如果你对比早期模型实现的真实感和传统3D图形模拟的差距，这种突破简直令人震撼。而通过VEO，我们试图打造最优秀的视频模型。

But in the last few years, I really got more excited as the technology evolved and then worked on image models, video models. And, yeah, in the last years, models, as we all know, like got to new, like, new levels of realism. And I remember looking at one of the, I think it was the Imagine video models and just saying, how is it possible that there is full simulation of a wall just like in this model? It's just mind blowing, right, if you think about the level of realism that those models even initially achieved compared to simulation using like three d graphics methods. And then with VEO, what we tried to do is basically to build the best possible video model.

Speaker 1

当我看到成果时，开始思考：如果我们能实现实时生成会怎样？我一直在关注Jack团队的工作，他们的研究也非常启发人。于是我们决定：必须迈向下一个台阶。

And when I saw the results, I started thinking, okay, what happens if can we do that in in real time? And then I was obviously following the work by Jack and team, and I think that was also very inspiring. And we just said, okay, we have to go to the next level.

Speaker 2

那么在这个项目中，你们希望从VEO借鉴或融合哪些要素？因为这不只是视觉美学的问题。

So what were the elements from VEO that you wanted to combine or learn from for this project? Because it's not just the visual aesthetic.

Speaker 1

画质和真实感是我们真正投入的重点。我认为在VEO2上我们已经达到了某种程度的真实感。物理效果虽不完美，但已足够实用，对吧？所以我们实际上能创造出一些与真实镜头难以区分的场景。

The quality and the realism is very something that we really invested in. And I think there is a level of realism where I think we kind of got to it with VEO2. The physics are not perfect, but it's good enough to be start to be useful. Right? So we can actually create some scenes that are indistinguishable from real footage.

Speaker 1

对吧？不是所有内容、不是所有时候都能做到，但正在逐步实现。在VEL3中我们还加入了音频等功能。因此将其推向交互化的下一步是顺理成章的，但技术上确实颇具挑战。

Right? Not everything, not all the time, but it's it's starting to be there. Right? With VEL three, we also added audio and other stuff. So taking it to the next step of basically making this interactive was kinda like the obvious next step, but technically, it's quite challenging.

Speaker 1

特别是在生成后续帧的速度方面，对吧？这正是Genie Free项目的核心挑战之一。嗯。我们的整体思路是尝试理解这些模型的训练方式——它们如何被训练、如何学习。

I mean, especially in terms of how fast we should create the next frames. Right? And that's what was one of the core challenges for the project for Genie Free. Mhmm. I think the approach that we have overall is to try and understand how those models train, how this they are being trained, how they learn.

Speaker 1

我们发现，那些帮助我们扩展和改进VEO的相同原则，对Genie free同样适用。

And the same kind of, like, principles that helped us scale and improve VEO, we found them to be useful for Genie free as well.

Speaker 2

在我看来，作为局外人，Genie和VEO的目标虽然视觉上很相似，但实际上截然不同，对吧？Vio旨在创建高度真实但非交互的环境，而Genie需要构建一个连贯、可探索、可移动的世界。你们需要从头开始吗？还是可以进行某种模块替换？

It seems to me, okay, as an outsider, right, it seems to me that the objective of Genie and VEO are, I mean, even though they look visually quite similar, the objectives of them are quite different, Right? Like, Vio, you're trying to create this very realistic, non interactive environment. Whereas with Genie, you need to make this consistent, explorable world that you can move around in. Do you have to start from scratch? Or is there, like, kind of swaps that you can make?

Speaker 1

从某种角度说，视频模型可以让你指定——比如我想绕着火山走，或者让镜头朝某个方向移动。它的本质是观察整个视频并尝试生成一段连贯的八秒（或任意时长）视频，同时能改变过去和未来的内容。我认为这是视频模型与Genie free截然不同的特性。

So I think in a way, if if you think about the video model, right, you can tell it, okay, I wanna walk around maybe a volcano, or I want the camera to move that way or another. So what it does is basically looking at this entire video and tries to create a coherent eight second or whatever long video. And it can change the past and the future at the same time. I think that's the property of video models that's very different from Genie free.

Speaker 2

因为它直接输出最终结果。

Because it spits out the end result.

Speaker 1

没错。你可以把它想象成一幅可以随时修改的画布，虽然从某种程度上说，这比我们称之为自回归或逐帧扩展的方式要简单得多。所以我认为这是根本区别。但仍有很多相似之处，比如我们最终都需要接收文本输入并将其转换为某种视觉输出。

Exactly. And it can you can think of it as a painting that you can change all the time everything on the canvas, and while it's much, in a way, easier than doing it in what we call autoregressive or just extending one frame at a time. So I think that's the fundamental difference. There's still a lot of similar aspects. For example, the way that we have eventually to take some text input and convert it to some kind of visual output.

Speaker 1

所以这方面有些相似。我想说我们绝对不是从零开始，而是建立在领域内大量研究成果的基础上。但确实存在一些我们必须去探索的新特性。

So that's somewhat similar. So I would say definitely we don't start from scratch. We build on top of, I think, a lot of work in this space, but there are definitely some novel properties that we had to kind of like figure out.

Speaker 2

这其实非常有趣。时间作为关键要素的理念——需要理解过去才能迈向未来。这就是自回归技术的用武之地，对吧？

Well, that's super interesting, actually. The idea that like time is the key component in this, that you have to understand the past and march forward to the future. That's where the autoregressive stuff comes in, right?

Speaker 0

完全正确。本质上你看到的每一帧都是在那个时间点从零生成的。因此交互中后期发生的事件尚未可知，而前期发生的所有事件都必须被模型记忆。如果我们以每秒24帧运行，就相当于每秒进行24次图像生成，每次都是基于过往所有信息以及智能体或人类玩家的操作进行全新生成。

Exactly. So essentially every frame you see is generated from scratch at that point in time. So things that happen later in the interaction aren't known yet, right? And things that happen in the beginning have to all be remembered by the model. So essentially if we're doing 24 frames per second, it's like doing image generation 24 times per second, each one completely generating from scratch given all of the past and the actions of the agent or human player.

Speaker 2

不过这里有个类比。我的意思是，语言模型的工作方式不也正是这样吗？

There's an analogy here though. I mean, that's sort of the way that language models work, right?

Speaker 1

正是如此。我认为这是个很好的例证。我们知道语言模型本质上是通过预测下一个词元（token）来训练的。它们通过分析文本尝试推测接下来可能出现的词元分布。

Yes, exactly. So I think it's really a good example. So we know that language models are basically being trained to predict the next token or war. Right? So they look at text and try to guess, okay.

Speaker 1

在自回归战争模型中，我们面临类似的问题：需要基于已观测内容预测下一个观测结果（即下一帧画面）。与LLMs的相似之处在于，语言模型通过这个简单任务学会了丰富的世界表征，包括人类思维方式和问题解决能力。我们认为自回归模型令人兴奋的原因，正是它们可能通过这个易懂的任务学会世界的动态规律。

What's the distribution of words or tokens that happen to follow? And what we have with autoregressive war models, we actually have a similar problem. We want to basically predict the next observation, which is pretty much visual, the next frame given what was already seen. And the nice thing, I think, in this kind of like parallel to our to LLMs is that LLMs learn from that very simple task a very rich potentially, representation of the world or how people think, how people solve problems, for example. And I think the reason that word models, especially autoregressive models, are kind of exciting for us is because maybe through that task, that is pretty much simple task that anyone can understand, they have to learn the dynamics of the world.

Speaker 1

而且我认为，如果你仔细想想，这其实是智能的一个超集。因为在现实世界中，如果你们说，好吧，我正在和某位大师下棋，那么下一个视觉画面或下一帧实际上可能就是他们的下一步棋。对吧？当然，我们的模型目前还做不到这一点，但在极限情况下，它确实能走得很远很远。

And I think it's nice to if you really think about it, it's kind of a superset of intelligence. Because through the world, if you guys say, okay. Now I'm playing chess with some grandmaster, and there was the next visual or the next frame would actually be their next maybe move. Right? So, of course, our model is not capable of doing that, but at the limit, it kinda goes very, very far.

Speaker 2

但正是这些理解过去、把握上下文，并能够预测未来下一步动作的理念，是的，都非常宏大

But it's these ideas of understanding the past, the context, and being able to predict the next move in the future, which are Yes, all very, big

Speaker 1

非常，是的。

very, yeah.

Speaker 0

这也非常强大，对吧？因为这意味着你可以从同一个起点出发，做出许多截然不同的事情。所以从智能体的角度来看，甚至可能是一个相当简单的任务，但你想把它做得非常好。这样你就可以模拟各种不同的场景。这与强化学习的范式非常相似，对吧？

It's also quite powerful, right? Because it means you can start in the same location and do many very different things. So from an agent's perspective, there could even be a quite a simple task, but you wanna get really good at it. And so you can simulate various different scenarios. And this is very analogous to reinforcement learning paradigm, right?

Speaker 0

就像有一个重置函数M，能让你回到相同的状态，然后你想从那里获得更多经验。

Where you have this M reset function that brings you back to the same state and then you wanna get more experience from there.

Speaker 2

好吧，那我们一步步来。所以第一帧，你可以有一张像你处理绘画时那样的图像，但也可以有一个文本输入，对吧？

Well, let's inch through this then. So the very first frame, you can have an image like you did with the painting, but you can also have a text input, right?

Speaker 1

是的，当然。

Yes, of course.

Speaker 2

我的意思是，我知道你之前让我描述我去过的一个地方，我曾经去过西伯利亚的一个猎人小屋。这总是我能想到的最令人印象深刻的事情。这就是我发给你的内容。我去了一个猎人小屋，坐在西伯利亚尤库茨郊外森林里的驯鹿皮上，喝着伏特加，吃着冰冻的小牛肝。告诉我你做了什么

I mean, I know you asked me earlier to describe somewhere I'd been, and I got to go to a hunter's lodge in Siberia once. It was always sort of the most impressive thing I could come up with. And this is what I sent to you. I went to a hunter's lodge, sat on a reindeer skin in the woods outside of Yucuts in Siberia, drank vodka and ate frozen calf liver. Tell me what you did

Speaker 1

关于那个。首先我们对这个提示感到惊讶，但随后我们尝试将其输入系统，系统基本上能够为你提供的提示添加更多细节，但它仍然遵循你提供的关键元素，对吗？

with that. So first we were surprised with the prompt, but then we just tried to put it into the system and the system is basically able to add some more details to the prompt that you provided, but still it does follow the key elements that you've provided. Right?

Speaker 2

是的。所以简单描述一下我们在这里看到的内容，这完全如描述的那样，是西伯利亚的森林。地上覆盖着雪。有一张户外桌子，上面放着一个盘子和一瓶透明液体，我只能假设是伏特加。远处有一个小木屋，里面有火在燃烧。

Yeah. So just to describe what we're looking at here, this is, I mean, exactly as described, it's a Siberian forest. There's snow covering the ground. There's this outdoor table with a single plate and a bottle of clear liquid, which I can only assume is vodka. And in the distance, there's a little wood lodge with a fire burning inside.

Speaker 2

实际上，那里有两个，一边一个。然后你可以看到小草丛从雪中探出头来，延伸到更远的森林深处。这真的很棒。光线，再次说明，你们非常喜欢营造傍晚时分美丽黄金时刻的光线。但所以这第一帧是以与你在VO中可能找到的相同方式生成的。

Actually, there's two of them there, one on either side. And then you can kind of see the little grass poking out of the snow as you go off into the distance further into the forest. That's really amazing. The light, again, you guys are very you you love making the light late afternoon, the beautiful golden hour. But so this that that first frame then is is generated in the same way as a you might find in VO.

Speaker 1

是的。所以模型不会以任何特殊方式处理它。就像你给它一段文本，它就开始输出帧，之前不做任何准备。它只是把你扔进那个世界，你可以去任何地方

Yeah. So so the model doesn't treat it, like, in any special way. It's just like you give it a text, and it just it just starts outputting frames, and it doesn't do any preparation before it. It just throws you into the world, and you can go wherever

Speaker 0

它是。

it is.

Speaker 2

然后从这第一帧开始，进行向前和向后的预测。

And then it's from that first frame that the prediction backwards and forwards.

Speaker 0

没错。是的。所以基本上就是从文本生成第一帧，然后第一帧的第一个动作生成下一帧，如此循环往复。

Exactly. Yeah. So it's like text to first frame, then first frame, first action to next frame, and so on and so forth from that point onwards, basically.

Speaker 2

那么提示词的精确措辞有多关键？我的意思是，更好的提示词能生成更好的世界吗？

And how critical is the exact wording of the prompt here? I mean, you get better worlds with better prompts?

Speaker 0

是的，我认为这绝对是正确的。提示工程是一门艺术，所有这些现代模型都是如此，有些人在这方面比其他人更擅长。不幸的是，有些人在这方面比我强得多。

Yeah, I think that's definitely true. There's an art to prompting, I think, all of these modern models and some people are better at it than others. Unfortunately, we have some people who are much better than me at this.

Speaker 1

有一个基本上开箱即用就能工作。

One worked pretty much out of the box.

Speaker 0

对吧？这个确实。虽然它们通常表现不错，特别是当你有一个非常生动的描述时，比如那张放着伏特加和冰冻杯子的桌子。有时你尝试某些东西，第一次可能无法完全捕捉到你想要的效果，但你可以对提示词进行一些迭代，然后得到更接近你想要的结果。

Right? This one, yeah. Although often they work pretty well, especially when you have a very vivid description like the table with the vodka and the frozen cups. Sometimes you can try something and not quite capture exactly what you wanted first time, then you can iterate a little bit on the prompt and then get something that's much more like what you wanted.

Speaker 2

这需要你重新生成整个世界吗？还是说既然它是向前推进的，你可以实时添加东西？

And does that require you to regenerate the world, or given that it's marching forwards, can you add things in on the fly?

Speaker 1

所以我们确实有办法实时添加内容，我们称之为可提示的世界事件。你可以随时说，好吧，现在我希望，比如说，飞进来一个气球或者出现另一个角色。我们对这个功能非常兴奋，因为它不仅能让环境对人来说更有趣，而且正如我们提到的，也更适合在模拟中训练智能体，因为这样我们就可以在世界上引入一些它们必须适应的突发事件。

So we do have ways to add things on the fly, what we call promptable word events. And this is just something that you can say, okay. Now I want, for example, a balloon to fly in or or some other character to show up. This is something we're very excited about because it also allows the environment to be more interesting for people, but also, as we mentioned, kinda, like, more relevant for training agents in simulation, because then we can kind of throw in something that happens in the world that they have to adapt to.

Speaker 2

就像那个例子中，也许可以让一只驯鹿穿过场景。

So like in that example, maybe have a reindeer coming through.

Speaker 1

对。我们可以安排一只驯鹿，或者让另一个人走进场景，如果是智能体的话，它们可以做出回应。如果只是为了娱乐目的，这比在一个无事发生的世界里行走要有趣得多。

Right. We can have a reindeer or we can have like another person who's walking into the scene, and then we have to if it's an agent, basically, they can say can respond to it. And if it's just for entertainment purposes, then it's much more interesting than just walking in a world that's nothing happens.

Speaker 2

我想这全都是自回归部分的关键优势，对吧？就是你确实能控制未来发生什么。

And this is all a key advantage of the autoaggressive part, I guess, right? That you do have control over the future.

Speaker 0

没错，是的。所以你可以在实时生成过程中动态地注入各种内容。

Exactly, yeah. So you can just inject things sort of on the fly as you're generating things in real time.

Speaker 2

我觉得最不可思议的是其一致性，对吧？就像系统的记忆，你转身离开再转回来，它应该保持原样。但如果你转向一个尚未探索的方向，这某种程度上是个随机过程，对吧？你是如何平衡这种有时靠记忆、有时靠统计生成的情况？

The thing that I find extraordinary about this is the consistency, right? Like the memory of the system, you turn away and then you turn back and it's But exactly how you left presumably if you're turning in a direction you haven't yet turned in, then it's sort of a stochastic process, right? So how do you balance it that it's sometimes a memory and sometimes generated statistically?

Speaker 0

嗯，这其实是多种机制的混合。你可以在文本提示中指定视线外的内容，我认为这相比图像提示强大得多，对吧？因为你可以说'右边有X、Y、Z'，然后当你实际向右看时，在互动世界中它就在那里。但同时模型也会运用其世界知识来生成内容。

Well, it's kind of a mixture of things. So you can in the text prompt specify things that are out of sight, which I think is quite powerful compared to image prompting, right? Because you can say on the right is X, Y, Z. And then when you actually look around to the right, it's there when you actually play or interact in the world. But then there is also an element of the model using its world knowledge to generate things.

Speaker 0

所以在霍珀的艺术品示例中，它确实生成了我们从未见过的街道。你无法完全确定它会生成什么，模型会运用自己对于那里应该有什么的某种直觉。

So in the Hopper example with the artwork, it does generate the street that we haven't seen before. And you can't be exactly certain what it's gonna generate. The model uses its own sort of, I guess, intuition of what should be there.

Speaker 2

而这种直觉是基于事先观看了大量视频素材，我的意思是，难以置信的大量视频片段。

And that's intuition based on having watched, I mean, incredible amounts of video footage in advance.

Speaker 1

所以基本上模型试图生成一个能代表世界的帧序列，对吧？所以如果它已经看到或生成了世界的某部分，那么模型正确的做法就是回忆这段记忆并使用它。所以如果我回头看已经去过的地方，模型正确的做法就是使用相同的内容。但当看向新区域时，从模型的角度来看，它可以允许自己生成新的内容，因为这是它未曾见过的。

So basically the model is trying to just generate a sequence of frames that is representative of the world. Right? So if it already saw some part or generated some part of the world, then it's the right thing for the model to do would be to kinda, like, recall this memory and use it. So if I look back to where I've already been, then the right thing for the model to do would be just to use the same thing. But when you look to a new area, so from the model perspective again, it can allow itself to generate something new because it was it hasn't been seen.

Speaker 1

所以模型并没有以根本不同的方式对待它。模型基本上学会了平衡这两个方面。而且，这一切又回到了将所有生成内容锚定到提示或用户提供的内容上，对吧？这就是生成信息的来源。

So the model doesn't really treat it in a fundamentally different way. So the model kinda can it learns to basically balance the two aspects. And, again, all comes back to anchoring all of the generation to the prompt or what the user provided. Right? That's where the information to the generation comes from.

Speaker 2

但我想回到我们之前讨论的语言类比，一旦一个陈述在对话中被确立，当你再次提及它时保持一致性也就不那么令人惊讶了

But then I guess going back to the language analogy that we were talking about earlier, it's not that surprising that once a statement has been established in a conversation that it remains consistent when you refer back to

Speaker 0

没错，确实如此。我们在语言模型中看到，这种记忆和一致性的能力最近有了很大改进，特别是最新的Gemini模型与两年前相比。

it. Exactly, yeah. We've seen with language models, this sort of ability to have memory and consistency has been something that's improved a lot recently, especially with the latest Gemini models now versus two years ago.

Speaker 1

是的，我认为这里有趣的是记忆的大小或细节水平。如果我们想一想，我正在和Gemini对话，几句话之后，它可能会提到我之前说过的事情。这很棒。但我们在那些视觉词汇中所拥有的细节数量，简直令人震惊。考虑到它实际需要记住的细节和信息量，这种记忆的质量真是惊人。

Yeah, I think what's interesting here is the size of the memory or the level of detail. So if if we think about I'm talking to some to Gemini, and I after a few sentences, it might refer to something I've said before. That's great. But the number of details that we have in those visual words, it's like it's it's staggering. The quality of the memory, considering how much detail and how much information it has to actually remember.

Speaker 1

是的。

Yeah.

Speaker 2

那么，你们是这样做的吗？这需要你们对所存在的世界有一个三维的表征吗？

Well, do you do that? Does this require you having a sort of three d representation of the world that you're in?

Speaker 0

在这个版本的模型中，我们没有使用任何类似的东西。它主要是作为自回归过程中的涌现特性学习到的。

We don't use anything like that for this version of the model. It's largely just learned as an emergent property, just from this autoregressive thing.

Speaker 2

天哪，各位，这些都是涌现特性，我的老天。

You guys, they're emergent properties, my goodness me.

Speaker 1

我们是无苦学派非常优秀的学生。

We're very good students of the bitterless side.

Speaker 0

是的，我认为这类似于预测下一帧画面，你只需要学会记住过去的关键信息。显然模型有一些优先处理重要细节的表征，但它确实只是在预测下一帧。

Yeah, I think it's similar to if you're predicting the next frame, you just have to learn to remember these critical things in the past. And obviously the model has some representations that prioritize the important details, but it really is just predicting the next frame.

Speaker 2

那么这是否类似于语言模型具有概念理解能力并能用不同方式描述相同事物的事实？是这里的Transformer架构让你们能够做到这一点吗？

And is this analogous then to the fact that language models have this conceptual understanding and can describe the same things in different ways? Is it the transformer architecture here that's allowing you to do that?

Speaker 1

是的，所以架构是这样的，你知道，我认为现在几乎所有东西都是Transformer。所以没错，这也是一个Transformer。回顾来看，基本上模型会看到已经生成的内容，包括用户提供的输入，并基于此做出预测。所以我认为关于显式三维的有趣问题是，模型可能必须学习某种表征，但这并不是显式表征。因此我们看到模型理解三维环境的能力非常强大。

Yes, so the architecture is, you know, I think almost everything today is a transformer. So yes, this is also a transformer. Looking back, basically, the model will see what was already generated, including in the user provided inputs, and based on that, makes the prediction. So I think the interesting question about explicit three d is that the model probably had to learn some representation, but it's just not an explicit representation. So we see that the ability of the model to understand three d environments is very strong.

Speaker 1

对我来说，最突出的能力是它能在未经训练的领域工作，比如将1942年的油画转化为某种三维环境。这完全超出了它的训练分布范围。

And I think the most, to me, like an emergent capability is that it actually works where it was not trained, like, example, in taking an oil painting in 1942 and actually making it into some kind of a three d environment. That's pretty much out of its distribution.

Speaker 2

完全同意。我想回到它理解物理这个概念上。Jack，如果你有一个可以放置智能体的世界——嗯。这是否意味着你可以测试它对物理的掌握程度？比如，你能同时让锤子和羽毛下落吗？

Absolutely. I wanna go back to this idea of it understanding physics. Jack, if you've got this world that you can put an agent in Mhmm. Does that mean that you can test how well it knows physics? I mean, could you get a hammer and a feather, for instance, and allow it to drop it at the same time?

Speaker 0

你绝对可以这样做。我认为这可能会接近模型当前能力的边界。我们已经看到它在视觉内容和更通用的概念上表现相当不错，对吧？你可以想象水出现在它之前见过的许多不同场景中。重力可能出现在其中很多场景里，但可能不是这些特定物体。

You definitely could do that. I think that would probably be close to the frontier of the model's capabilities at this point. We've seen it's quite good at visual things and things that are more general concepts, right? So you can imagine water occurs in quite a lot of different scenarios it's seen before. Gravity has probably occurred in many of those, but probably not these exact objects.

Speaker 0

但我觉得如果你在文本提示中指定，这个世界里有一个重力较小的羽毛和一个较重的锤子，也许这样就能奏效。

But I think if you were to specify in the text prompt, in the world there is a feather that has less gravity and a hammer that is heavier, maybe then it would work.

Speaker 1

是的。我认为这些模型本质上是视觉性的，对吧？它们除了我们能看到的东西外，对世界一无所知。这可以说是一个限制，对吧？即使是视频模型，有时某些东西也会显得不合理。

Yeah. I think there are a lot of like those models are inherently visual, right? They don't know anything about the world except for what we can see. And that's a limitation, I would say, right? Even for video models, some things sometimes don't make sense.

Speaker 1

比如，模型必须猜测某物是否沉重，对吧？模型无法通过观察图像来确定这个东西具体有多重。所以它基本上是编造一个重量，然后尝试模拟可能会发生的情况。

And the model has to guess if something's heavy, for example. Right? There is no real way for a model to look at an image and say, okay. That's how much this thing weigh. So it kinda, like, makes up the weight and then try to kinda simulate what would have happened.

Speaker 1

有时这会失效，我们从视频模型中也能看到这一点，即使是最好的视频模型。我认为我们面临的是一个更困难的问题，因为正如我们所说，我们无法修正过去，对吧？一旦模型生成了某些内容，它就必须继续推进。我们在模拟、流体动力学方面取得了很好的进展，但由于这些限制，世界其他物理方面可能不够准确。

And sometimes it breaks, and we know that from video models, even the best video models. And I think what we have is basically we are solving an even harder problem because we have to, as we said, we can't fix the past. Right? We have to, like, go if once the model generates something, it has to roll with it. And I think we have really good progress in terms of simulation, fluid dynamics, maybe some other aspects of the world physically would not be accurate because of those limitations.

Speaker 2

嗯，给我讲讲你们目前在智能体方面所做的一些工作，因为在之前的节目中，我们谈到了SEMA。是的，就是那个可扩展、可指导、多世界的智能体。

Well, me some of the work that you've been doing with agents so far, because actually on a previous episode, we got to talk about SEMA. Yep. The scalable, instructable, multi world agent.

Speaker 0

没错。

Exactly.

Speaker 2

他们当时是将SEMA部署到现有的电脑游戏环境中。但我的意思是，你现在也可以把SEMA放到这些生成式环境里。

Which they were putting into existing computer game environments. But I mean, you can now take SEMA and put it into these generated environments too.

Speaker 0

完全正确。所以最酷的地方在于，我们在Google DeepMind拥有这些被训练成通用型的智能体。正如你刚才所说，M代表多世界。因此我们能做的是，取出我们生成的世界，然后测试这些智能体及其最新版本是否已经能够直接将它们用于智能体训练、经验收集或评估。你可以对它说类似'导航到那边的机器人'这样的指令。

Exactly. So the cool thing with this is that we've got, at Google DeepMind, we have these agents that are trained to be general. The M is multi world, as you just said. And so what we're able to do is take out the worlds we generate and then test whether these agents and the latest versions of them are able to already use them for agent training or collecting experiences or evaluations as they are. You can say to it something like navigate to the robot over there.

Speaker 0

然后你可以给它一张来自这个世界的图像，它就能执行第一个动作。从那一刻起，它就开始通过动作与世界互动。但Genie三模型并不知道目标是什么，对吧？这就使它像一个真正的模拟，而不是如果你告诉Genie三目标是什么，那个类似智能体正试图实现这个目标，那样可能会让体验变得不真实，对吧？因为它可能会以不正确的方式实现目标。

And then you can give it an image from the world and it can take the first action. And then from that point onwards, it's interacting in the world through actions. But the Genie three model does not know what the goal is, right? So that makes it like a genuine simulation rather than if you were to tell Genie three the goal, that similar agent is trying to achieve this goal, it might make the experience not authentic, right? Because it might make it happen in an incorrect way.

Speaker 1

对它来说。

For it.

Speaker 0

是的，完全正确。

Yeah, exactly.

Speaker 2

我想如果你在现实世界中有一个机器人，这个世界并不会帮助这个机器人。

I guess if you've got a robot in the real world, the world isn't helping the robot.

Speaker 1

没错。就像你不能说，好吧，机器人必须去拿红色方块，然后它向左看就有一个红色方块，对吧？这里有点像是在

Exactly. Like you can't say, okay, the robot has to go to fetch the red cube, then it looks to the left and there is a red cube, right? There's a bit making a

Speaker 0

区别。是的，这是我们在其他场景中遇到的问题，但如果你在智能体和环境之间有这种分离，就不会真正遇到这个问题。这正是与专注于构建真正强大智能体的团队合作的美妙之处，然后让它们访问我们的环境，就像访问任何其他环境一样，对吧？所以Genie三创建的新世界看起来与同一智能体训练时使用的现有世界是一样的。

difference. Yeah, this is a problem that we've encountered in other scenarios, but you don't really get this problem if you have this separation between the agent and the environment. And that's the really nice thing of working with agent teams that are focused on building really capable agents, but then having them access our environment as if it's any other environment, right? So the new worlds that are created by Genie three look the same as the existing worlds that the same agent was trained on.

Speaker 2

但如果没有红色方块怎么办？如果它永远搜索下去却找不到红色方块呢？

But then what if there isn't a RedCube? What if it searches around forever and there is no RedCube?

Speaker 1

所以，是的，我认为这正是我们正在研究的部分内容，即我们如何逐步添加更多细节。对吧？比如，如果你把智能体放在某个房间里，然后它可能需要打开一个抽屉并在里面找到东西。对吧？所以我们希望能够向世界中注入事件并控制它。

So, yeah, I think that's part of the things that we're looking into is how we can add more details as we go. Right? Like, for example, if you put the agent in some room and then it has to open maybe a drawer and find something in there. Right? So we want to be able to inject events into the world and control it.

Speaker 1

所以我认为这有点像有趣的前沿领域，就是你如何让这个世界看起来非常真实，嗯。同时还能以仍然合理的方式控制世界上发生的事情。所以我认为我们看到的是，我们有可提示的世界事件。我们可以添加世界上发生的事情。但如果你只是想让某物突然出现在世界上，那就不一定可信。

So I think this is kinda like the interesting frontier here is how you can make sense the world look very realistic Mhmm. And also control what's happening in the world in a way that still makes sense. So I think what we've seen is that we have our promptable world events. We can add things that happen in the world. But if you would just want something to pop into the world, then it's not necessarily plausible.

Speaker 1

我觉得，就像，如果你看——我不知道。如果你在沙漠中，然后你要问它，好吧，现在我想看到一头大象，那么这头大象会从哪里来？所以也许当你向左看时，它会从侧面过来。所以我认为在'改变一个世界意味着什么'这个问题上，有一些非常有趣的东西。

I think, like, if you look at the I don't know. If you're in the desert and then you're gonna ask it, okay. Now I wanna see an elephant, then where does this elephant going to come from? So maybe it's going to come from the side when you look to the left. So I think there is something really interesting in how what does it mean to change a world?

Speaker 1

因为模型有一些假设，而在智能体方面，这绝对是一个重要能力——能够将新事件注入到世界中。

Because the model has some assumptions, and when it comes to agents, this is definitely an important capability to be able to inject this new event into the world.

Speaker 2

所以你们已经做到了？

So you've done this already?

Speaker 0

是的，我们已经看到了这方面的初步成果。虽然还不能说我们已经建立了完整的智能体训练循环，尚未在这些环境中进行大规模训练，但我们已经可以测试智能体并观察它们的表现，对吧？我认为非常了不起的是，尽管这些系统并非共同开发，我们只需将智能体放入环境，它就能开始执行任务。你可以想象现在能将其用于各种不同的应用场景。

Yeah, we've got signs of life for this. I think we haven't really say we've got a full agent training loop where we're already doing some large scale training in these environments, but what we can already do is test the agents in them and see how they do, right? And I think it's quite remarkable that given these things weren't developed together, we just drop the agent in and it can already do things. And you can imagine all the different things you could now use this for.

Speaker 2

具体说明一下，给我举个例子。

Color that in for me. Give me an example.

Speaker 1

比如你有一个工厂，里面有一些机器人，你想引入——可能是个很无聊的例子——但比如新机器，对吧，而这个机器原本不存在。或者你以某种方式改变了建筑结构，你想在实际将机器人放入新建筑前测试它们。对吧？所以这就像是你可以模拟世界。这基本上是对智能体过去可能见过的场景的变体，然后观察它是否会出错。

If you have like a factory where you have some robots and you want to introduce maybe a very boring example, but the new machine, right, and that wasn't there. Or you change somehow the structure of of the building, and you wanna test the robot before you actually put them in the new building. Right? So, again, this this is like you can simulate the world. That's basically a variant of what the more the maybe the agent has seen in the past and see if it breaks.

Speaker 1

对吧？所有这些都可以在模拟环境中发生，而不必真的损坏你的新机器。对吧？所以这算是一个例子，我觉得。

And right? And this all everything can happen in, like, simulated environment and not necessarily break your new machine. Right? So that's, like, one example, I would say.

Speaker 2

找出意想不到的后果。

Find the unintended consequences.

Speaker 1

是的。是的。还有模型的评估。所以这甚至不是训练一个像智能体那样的模型，只是测试它如何适应环境的新变化。

Yeah. Yeah. And and evaluation of the model. So that's even not training a model like the the agent, just testing how well it adapts maybe to a new variation of the environment.

Speaker 2

到目前为止你给出的所有例子都是智能体有指定目标的情况，我知道这某种程度上是SEMA目前的重点。但是，如果智能体没有目标会怎样？我遇到一句很好的引述：几乎所有重大发明的先决条件都不是以该发明为目标而创造的。你能想象未来某个时刻，你让智能体在这些环境中自由活动而不为它们指定目标吗？

All of these examples you've given so far are where the agent has a specified objective, which I know is like sort of the point of SEMA thus far. Yeah. But what about if you had agents that didn't have an objective? There was a really nice quote that I came across, which is that almost no prerequisite to any major invention was made with that invention in mind. Can you imagine a point in the future where you are letting agents loose in these environments without specifying an objective for them?

Speaker 0

是的，完全正确。这句引述来自《伟大无法被计划》，作者是肯·斯坦利和乔尔·莱曼。这是一本很棒的书。其中的核心思想是：探索有趣性实际上可能比直接优化实际目标本身更能带来对实践目标更有用的成果。显然，领域和探索空间越大，可能发生的有趣事情就越多。

Yeah, exactly. So the quote comes from Why Greatness Cannot Be Planned, which is from Ken Stanley and Joel Lehman. It's a great book. And the general idea there is that searching for interestingness might actually lead to things that are more useful for practical goals than if you just directly optimize for the practical goals themselves. And clearly the bigger the domain and the space for discovery, the more interesting things could happen.

Speaker 0

他们很早以前在名为Pic Breeder的论文中有一个很好的例子，基本上让人们选择图像并通过组合它们来创建这些图像变异后的新图像。人们并没有直接优化特定的最终目标，但通过选择他们认为有趣的内容，最终发现了非常酷的结构化图片，如骷髅或蝴蝶，这些从起点出发如何达到并不明显。过程中的一些垫脚石看起来并不像最终目标，如果你心中有那些目标，显然不会选择去追求它们。

So they had this really nice example quite a while ago with this paper called Pic Breeder, where essentially they allowed people to select images and sort of combine them to create new images that were mutations of those two. People weren't directly optimizing for specific end goals. But by just choosing what they found interesting, they ended up discovering really cool structured pictures like a skull or a butterfly that weren't obvious how you would reach that from the starting points. And some of the stepping stones along the way didn't really look much like the final goal. And they wouldn't have been things you obviously would have chosen to go for if you had those goals in mind.

Speaker 0

现实世界中有很多这样的例子。比如，如果你想登月，你不会去造一个更大的梯子。所以沿着一个维度优化，采用可能像贪婪短视的方法，并不总能让你实现这些巨大飞跃。

And there are lots of examples of this in the real world. If you are, for example, trying reach the moon, you wouldn't build a bigger ladder. So optimizing along one dimension with maybe like a greedy myopic approach doesn't always lead you to make these big leaps.

Speaker 2

嗯，我是说，进化本身，对吧？就是没有目标的迭代的经典例子。嗯。

Well, I mean, evolution itself, right? Is like the classic example of iteration without objective. Mhmm.

Speaker 0

是的。我们在研究中经常看到这种情况。

Yeah. And we see this a lot in research.

Speaker 1

是的。我的观点是，我认为我们人类在某种程度上决定了什么是有趣的。甚至有一个例子，整个数学的演进都是由人们决定下一步研究什么、什么有趣、什么不有趣来引导的。仅仅因为问题困难并不意味着它有趣，对吧？

Yeah. My perspective on that is that I think we as humans, we kinda decide what's interesting. I think there is even an example where the entire evolution of mathematics was guided by people deciding what's next, what's interesting, what's not interesting. And just the problem being hard doesn't mean that it's interesting at all. Right?

Speaker 1

而且我认为，在某种程度上，当我们思考在科学中生成新事物时，其中可能包含一些来自我们的美感或兴趣的方面，模型也许能学会模拟这一点。但我认为非常重要的是要记住，最终是我们作为人类的偏好决定了什么是有趣的。

And and I think in a way, when we think about generating new things in science, there is some aspects of maybe beauty or interest that is coming from us, and models might maybe learn to simulate that. But I think it's really important to remember that we ultimately decide what's interesting as our preferences as people.

Speaker 2

我的意思是，你还没有那个所谓的'第37步'，在这个...

I mean, you haven't had the the Move 37, as it were, in this in this

Speaker 1

即使在这个例子中，围棋的目标或游戏规则也是以某种方式设计的，让人们觉得玩起来有趣，对吧？否则，就像...而且这是一个非常古老的游戏。我只是想说，即使是登月这样的目标，也是由人们设定的。

Even in this case, right, the goal of Go or the game was designed in a certain way that people find it interesting to play. Right? Otherwise, like, thing and it's a very ancient game. Right? So I'm I'm just saying that the setup, even when there is an in like, getting to the moon, this goal was made by people.

Speaker 1

所以我想说的是，我认为在某种程度上仍然存在由我们设定的更广泛的约束。如果机器提出了一个要解决的问题，我们仍然需要判断：这是一个有趣的问题吗？因为否则我们就会说：好吧，我不关心那个。所以我认为任何开放式的事物都仍然存在审美层面的考量。

So I am just saying, like, I think there is still a broader constraints set by us in a way. And, like, if a machine comes up with with an a problem to solve, we still have to say, is this an interesting problem? Because otherwise, we'll just, okay. I don't care about that. So I think there is still an aesthetic part to anything that's open ended.

Speaker 0

是的。我记得丹尼斯之前有个关于创造力层次的引述。第一层是插值，就像你看到一只新猫能认出它是猫。第二层是外推，就像给定围棋规则，你能发现像第37步这样的新着法吗？第三层则是生成完全全新的东西。

Yeah. I think there's this there was this quote from Dennis a while ago about the levels of creativity. And it was like interpolation is one, like you see a new cat and you can identify it as a cat. Extrapolation was one where it's like, given the rules of Go, can you discover a new move like Move 37? And then the third level is generating completely new things.

Speaker 0

就像你能否真正发明围棋这样的游戏

Like could you actually invent Go is what

Speaker 2

他说。

he said.

Speaker 0

实际上我们最初就把这作为Genie项目的动力。就像在问，你能创造出全新的东西吗？我认为我们开始看到这种情况发生了。由于这是一种全新的模型类型，团队中有人会做类似创造某种世界的事情，然后团队其他成员立即会说，这真的很有趣。然后他们自己开始发展这个想法。

And we actually had this as like a motivation for the Genie project at the beginning. It was like, can you create completely new things? And I think we're starting to see that happen. Since it's a completely new kind of model, someone on the team will do something like create a certain kind of world and then immediately other members of the team are like, that's really interesting. And then they start sort of evolving that idea themselves.

Speaker 0

然后我们把它发布在社交媒体上，看到对一些事物的反应，我们就知道那是有趣的。这样我们就以那种方式创造新事物。而这只是在非常有限的模型访问权限下。所以很明显你可以看到，如果我们未来在这方面更开放一些，它可能会带来某种开放式的创造力。

And then we post it on social media and we see the reaction to some things and then we know that's interesting. So then we create new things that way. And that's just with a very limited access to the model. So clearly you can see if we open this up a bit more in the future that it could lead to some sort of an open ended creativity that way.

Speaker 1

这就像是一种进化，标准是人们觉得有趣。

It's like an evolution with the criteria that people find interesting.

Speaker 0

是的，完全正确。有人在循环中指导着有趣性。

Yeah, exactly. There's people in the loop guiding the like interestingness.

Speaker 2

那么好吧，你能不能，我只是回想起我们在这个系列中早些时候与Dave Silva的对话，他当时在说，你知道，实际上有一种方法可以将人类从等式中移除，实际上你可能会得到更令人惊讶的结果。你能不能，好吧，就陪我玩一会儿，但你能不能达到这样一个点，你只是模拟第一个，我不知道，单细胞生物，并允许它在Genie内部进化，你知道，实际上观察进化过程在虚拟环境中发生？

Well then, okay, could you take, I'm just thinking back to the conversation that we had with Dave Silva earlier in this series, where he was sort of saying, you know, actually there's one way you remove humans from the equation and actually you get even more surprising results potentially. Could you, okay, just play along with me here for a moment, but could you get to a point where you just simulate the first, I don't know, single celled organism and allow it to evolve inside of Genie, you know, and actually watch the process of evolution happen in a virtual environment?

Speaker 0

这是个很好的问题。这有点像人工生命开放式进化社区的梦想。我认为也许我们创造的世界还不够完全丰富，但我绝对认为我们正沿着那条道路前进，对吧？所以开放式进化和人工生命领域，他们一直在设计通常用代码来促进这种事情的世界。所以这可能是一种替代方法，来获得可能更丰富的现实世界模拟。

That's a great question. And that's kind of the dream of like an A Life open ended evolution community. I think maybe the worlds that we create are not fully rich enough, but I definitely think that we're getting along that path, right? So open ended evolution and Alife, they've been designing worlds that could facilitate this kind of thing typically in code. And so this could be an alternative approach to getting maybe richer real world simulations.

Speaker 0

所以理论上来说，我的意思是，我们确实取得了相当快的进展。但如果你要让模拟完全像真实世界一样，并且具备那些导致这些进化步骤的目标和约束条件，那么这绝对是可能的，但我还不能说已经完全实现了。

And so in theory, I mean, we've been making quite a lot of progress pretty fast. But if you've got the simulation to be fully like the real world and it had the kind of objectives and constraints that lead to these kind of evolutionary steps, then it's definitely plausible, but I can't say it's definitely there yet.

Speaker 1

这不是一个直接的回答，但我实际上尝试过，认为可能有一个非常基础的生命游戏例子。嗯。对吧？嗯。它有四个康威规则。

It's not a direct answer, but I actually tried, think there is maybe a very basic example of Game of Life. Mhmm. Alright? Mhmm. It has four Conways.

Speaker 0

是的。

Yeah.

Speaker 1

它有四条规则。对吧？然后我实际上尝试用Veo来模拟它。对吧？你给它一张图片，但它并不奏效。

It has four rules. Right? And then I actually tried using Veo to simulate it. Right? You give it an image and take, and it doesn't work.

Speaker 1

看起来它确实在进化。如果你不知道规则，你会觉得，是的，看起来挺合理的，不同的像素点亮又熄灭，但它并不遵循游戏的四条规则。对吧？我认为这是一个很好的例子，说明了我们当前模型能做什么，以及它们在遵循特定规则方面的局限性。但要真正进化出生命形式，我认为需要更强的能力，既要能模拟物理世界，又要能以非常精确的方式遵循一些基本的物理规则。

Like, it does look like it's evolving. And if you don't know the rules, you would look like, yeah, it looks like reasonable, like different pixels light up and light and but it doesn't follow the four rules of the game. Right? I think this is a good example for what our current models are able to do and what they are less like, limited in their ability to follow specific rules. But to actually evolve maybe life forms, I think you need much more ability to do both, to also simulate the physical world, but also follow some basic rules of physics in a very kinda like accurate way.

Speaker 2

受约束的

Constrained

Speaker 1

方式。是的，以受约束的方式。我认为我们还没有达到，我们看到了一些迹象，但离在GPU上实现进化还非常遥远。

way. Yes. In constrained way. I think we're not like, we see some glimpse of it, but not it's definitely very far from being able to have evolution on a GPU.

Speaker 2

好吧，感谢你刚才陪我进行了一番哲学思考。我很享受这个时刻。不过让我们还是回到现实吧，因为这里确实存在安全影响。你主要担心哪些方面？

Well, you for going philosophical with me for a moment there. Enjoyed that. Let's come back down to Earth with a thump though, because I mean, there are safety implications with this. What are your main concerns?

Speaker 0

我认为担忧有不同的层次，对吧？有些是已知的事情，而且相当明显。比如暴力之类的事情，我们可能不希望它以新的方式在世界上发生。这些我们已经可以开始着手解决了。但可能还有一些灰色地带，我们实际上不太确定对这些事情的感受，比如历史背景设置。

I think there's different levels of concern, right? There's sort of the known things and they're quite obvious. I mean, things like violence, maybe we wouldn't want to occur in the world in new ways. And that's something that we can already start addressing. But there's also maybe some more gray areas where we're not actually really sure how we feel about these type of things like historical settings, for instance.

Speaker 0

其中一些可能因为微妙的原因而令人不快。这些是我们团队能够看得很清楚的事情。但我觉得可能还有一些我们没有想到的事情。我们宁愿通过限制早期访问和收集反馈来正确处理这些事情，这也是我们正在做的。我们已经从几周前邀请的人员以及仍在互动的人员那里学到了很多。

Some of them may be unsavory for subtle reasons. And those are just things that we as a team can see quite clearly. But I think there's probably also some things that we haven't considered. And we'd rather get those things right by limiting our early access and getting feedback, which we're already doing. And we've already learned a lot from the folks that we brought in a few weeks ago and the folks that we're still interacting with.

Speaker 2

比如什么？

Like what?

Speaker 0

我们收到了很多我没想到的新用例。职业培训实际上可能很有影响力，对吧？比如很多人无法亲身经历消防员这样的角色。在没有实际体验的情况下，亲临其境是什么感觉？能够提前模拟可能会让你受益匪浅。

We've gotten a lot of new use cases that I didn't think of. Vocational training could actually be quite impactful, right? Lots of people can't go into the role of things like firefighting, for instance. What does it feel like to actually be there without having a visceral sort of experience? It's probably something that you would benefit a lot from to be able to simulate in advance.

Speaker 0

即使从模拟的角度来看并不完美，但其中可能有一些元素——仅仅是感受身处特定情境的感觉——能够提前模拟是很好的。

Even if it's not perfectly correct from a simulation perspective, there may be elements to it that just getting a sense of what it's like to be situated in that specific circumstance that it's quite nice to be able to simulate in advance.

Speaker 1

除了高温之外。

Minus the heat.

Speaker 0

高温与烟雾。

The heat and the smoke.

Speaker 2

还有真实的危险。是的。但你刚才提出了另一个有趣的观点，因为你说即使模拟不够完美，模拟与现实之间的差距是否还存在另一种危险？如何尽可能缩小这种差距？这个问题，模拟与现实的差距，我们在播客中已经多次讨论过。

And genuine jeopardy. Yes. But actually you raised another interesting point there though, because you said even if it's not perfectly realistic, is there also another danger about this gap between what's simulated and what's real? How do you make that as small as possible? This is something, the sim to real gap is something we spoke about in this podcast lots of times before.

Speaker 2

假设你有一个工厂里的机器人作为例子。

Let's say you've got your example where you've got a robot in a factory.

Speaker 0

哦。

Oh.

Speaker 2

它正在四处移动。它有点

And it's moving around. It's sort of

Speaker 1

在可靠性方面。

terms of reliability.

Speaker 2

进行交互。你不能直接将其映射到现实世界，对吧？

Interacting. You can't directly take that and map it onto the real world, right?

Speaker 1

是的。所以我认为随着时间的推移，我们会看到更多的控制能力，比如能够将真实环境映射到模型中，这样模型就可以基于真实环境进行生成。我们已经在一定程度上看到了这一点，从图像或视频开始。现在的问题是，它会和现实世界完全一样吗？可能不会。

Yeah. So I think over time we'll see more control being able like basically being able take maybe a real environment, map it into the model so the model can base its generation on the real environment. And we see it to an extent with with with starting from an image or starting with a video. And now the question is, would it be perfectly the same as the real world? Probably not.

Speaker 1

我认为这甚至没有一个明确的定义。这到底意味着什么？但我认为差距肯定在缩小，所以我们可以获取环境，你知道，过去我们知道很多强化学习环境看起来与任何照片级真实或现实世界的东西相去甚远。但现在我们可以更接近了。但肯定仍然存在差距，我们必须看看这意味着什么，而且当然，我们还没有将其用于任何现实世界的部署。

I don't think it's even well defined. What does it mean? But I think the gap is definitely narrowing, so we can take environments, you know, if the past know we know a lot of the RL environments look very far from anything that's photorealistic or or real world. But now we can go even closer. So but definitely, there still remains a gap, and we will have to kinda see what are the implications and, like, we're still not using it for any real world deployment, of course.

Speaker 1

是的。

Yeah.

Speaker 0

是的。我认为这是一种迭代的方法。我不认为我们现在是说有了GD3，我们就解决了任何可能的具身任务的模拟问题。但我认为我们可以做的是将其与其他技术结合起来。所以我们仍然会以我们之前没有这个技术时的方式训练我们的智能体。

Yeah. I think it's kind of an iterative approach. I don't think we're saying right now we've got GD3, we've solved simulation for any possible embodied task. But I think what we can do is combine it with other techniques. So we still would train our agents in the same way that we already did without this.

Speaker 0

我们用它来增强训练过程。另一个要素是，它具有一定的多样性非常重要，对吧？所以如果它总是以同样的方式出错，那么智能体可能会学会利用这种不准确性。而如果模型能够生成相当多样化的不同世界，那么我们可以真正测试智能体能力的广度，并确保没有它做出严重错误行为的场景。这实际上可能是一个优势，对吧？

And we use it to kind of augment the training process. And then another element to it is it's really important that it has some diversity, right? So if it's always wrong in the same way, then agents might learn to exploit that inaccuracy. Whereas if the model can generate quite diverse sort of different worlds, then what we can do is really test the breadth of the agent's capabilities and make sure that there's no scenario where it does something really wrong. That might actually be a strength, right?

Speaker 0

所以在Sim2Real中也是一样，我们想要进行领域随机化。也许通过拥有一个生成模型，能够搜索可能性的空间并检查所有智能体都做出合理的行为，这可能是一件好事。但这并不意味着你希望完全将其训练成这就是现实世界。也许你想用它来使其更具对抗鲁棒性，而不是学习具体细节。

So same in Sim2Real, we wanna do domain randomization. Maybe by having a model that is a generative model, being able to sort of search the space of possibilities and check that all of the agents do something sensible might be a good thing. But that doesn't mean you want to completely train it that that is the real world. Maybe you want to use it to make it more adversarially robust rather than learn specifics.

Speaker 2

让我确认一下我是否理解了。所以如果它是错误的，但是以不可预测的方式错误，那么实际上这最终可能会让智能体在长期运行中更加鲁棒。

Let me make sure I understand that then. So if it's wrong, but wrong in unpredictable ways, then actually that might end up making the agent more robust in the long run.

Speaker 0

是的，所以你想要做的类似于领域随机化，就是要确保在任何看似合理的情境下，智能体都不会做出真正不安全的行为。这与那种你有一个具体错误场景然后告诉智能体该如何应对的目标完全不同。相反，它更像是要让智能体在任何可能的未来世界中都能做出明智的行为。

Yeah, so what you want to do is similar to domain randomization is you wanna make sure there's no plausible scenario where the agent could do something really unsafe. It's quite a different objective than if you had one specifically incorrect scenario and then you told the agent exactly how to behave from that one. Instead, it sort of make it so that in any possible future world, the agent should be able to do something sensible.

Speaker 2

这太有趣了，因为我原本以为你们是在试图推动这个系统变得更真实，获得更可靠的结果，缩小虚拟与现实的差距。但听你这么描述，似乎不一定是这样。

That's so interesting then, because I was sort of imagining that you were trying to kind of nudge this towards it being more realistic, towards getting like more reliable outcomes, towards sort of like closing that gap of seem to real. But I mean, the way you're describing this is that not necessarily.

Speaker 1

我认为问题在于你所说的'可靠'是什么意思？对我来说，可靠性很大程度上是指遵循我们提供的指令，对吧，就是模型要遵循指令。如果我们希望模型模拟特定环境并且提供了详细描述，我们就希望模型遵循这些描述。即使描述中存在不太合理的内容，模型仍然应该遵循。我认为挑战在于，有时候我们作为人类，出于各种原因，反而对不太合理的情境更感兴趣。

Think the question is what do you mean by reliable? Because reliable to me is probably a lot about following the instructions that we provide, right, the model. So if we want the model to simulate a specific environment and we describe it with a lot of detail, we want the model to follow that. If there is something not so plausible in this description, the models should still follow that. And I think the challenge is that sometimes we as people or, like, for various reasons, we're interested in the less plausible scenarios.

Speaker 1

对吧？比如今天我们看到的一些例子，你想要一些伏特加和小牛肝。对吧？这并不太常见，如果你从西伯利亚所有可能的餐桌中随机抽样，这可能不在分布的中心。所以我认为，对我来说，可靠性主要来自于遵循我们提供给模型的描述，并以接近该描述的方式模拟世界。

Right? For example, in the some of the examples we saw today, you wanted some vodka and the calf liver. Alright? So that's not a very pro like, it's not like if you just sample from all of the possible tables in Siberia, probably that's not like in the middle of the distribution. So I think that the reliability to me comes mostly from following the description that we provide the model with and simulating the world in a way that will be close to that.

Speaker 0

我觉得这其实是个非常好的观点。我认为这是在规范不足的环境中的情况。

Think that's a really good point, actually. I think it's in under specified environments.

Speaker 1

你

You

Speaker 0

需要多样性，因为你希望能够适应合理分布内的任何情况。但如果你有一个非常明确规范的环境，那么你就希望它准确。我认为我们在这两个维度上都看到了进步，但可能还没有完全达到目标。

want diversity because you want to be able to adapt anything within the plausible distribution. But if you have a very well specified environment, then you want it to be accurate. And I think we're kind of seeing improvement on both those dimensions, but we're probably not fully there yet.

Speaker 2

请允许我回到那个AGI问题，如果可以的话。这是大家总是想问的终极问题。你认为这是朝着AGI迈出的一步吗，杰克？

Let me go back to that AGI question, if I may. The end question that everyone always wants to ask. Do you think that this is a step towards it, Jack?

Speaker 0

我认为AGI本身是一个相对主观的概念，人们对AGI的含义有不同的理解。所以我觉得说我们的模型是整个领域实现AGI的关键可能有些夸大。但对我来说，一个AGI需要具身化并能够在物理世界中行动。这才真正让我兴奋。我认为这能够真正改善世界上任何地区任何人群的生活质量。

I think AGI is something itself which is relatively subjective and people have different interpretations of what you mean by AGI. So I think it would be quite maybe grandiose to say our model is the key thing in the whole field that will enable AGI. But I think for me, an AGI needs to be embodied and be able to act in the physical world. That's what really excites me. I think that could really improve people's quality of life in any demographic anywhere in the world.

Speaker 0

基于这个框架，我确实认为这是一个重要工具。我无法想象一个具身化的AGI如何能够在无法模拟、收集经验并从自身经验学习的情况下在世界任何场景中运作。因为这是我们在其他领域获得超人类能力甚至只是稳健能力的范式。所以我坚信我们需要模拟，我也坚信我们无法通过其他方式构建真实世界的模拟器。

And so with that framing, I definitely think this is an important tool. I can't see how an embodied AGI or an AGI that is embodied would be able to operate any scenario in the world without being able to simulate it, to gather experience and learn from its own experience. Because that's the paradigm that we've used in other settings to get superhuman capabilities or even just robust capabilities. And so I believe very strongly that we need simulation. And I also believe very strongly that we won't be able to build a simulator of the real world any other way.

Speaker 0

所以当你把这两点结合起来，我认为是的，这对我所理解的AGI来说是一个重大进步。

So when you combine those two things, I think, yes, it is a big step for my version of AGI.

Speaker 1

是的。我认为这是一个非常好的回答。除此之外，我想说我们当前的AI世代主要局限于数字世界。要让AI真正为我们所用，它必须要有某种现实世界的交互能力。

Yeah. And I think it's a really good answer. And on top of that, I would just say that our current generation of AI is limited to kinda like the digital world. For AI to be used for for us, definitely, it has to to have some kind of real world interaction.

Speaker 2

嗯。

Mhmm.

Speaker 1

所以我认为，这再次是朝着具身化AI迈出的一小步，要达到那个目标肯定还有很多差距。我们需要更好的信号，比如机器人在世界中行走时获得的信号。它们需要获得一些物理反馈，仅仅有视觉输入和输出是不够的。

So I think, again, it's it's a small step towards that embodied AI, and there are definitely a lot of gaps gaps to get there. So I think we need much more better signals that, for example, robots get while they walk through the world. Right? They need to get some physical response. It's not enough just to have a visual input and output.

Speaker 1

所以我认为这绝对是朝着那个愿景迈出的一步，没错。

So I just think that this is definitely a step towards that vision, yeah.

Speaker 2

因为它还有很多做不到的事情。我的意思是，额外的传感器是其中之一，但目前它处理人的能力也不太好，对吧Jack？

Because there is still a lot that this can't do. I mean, the additional sensors being one of them, but also doesn't handle people that well at the moment, does it Jack?

Speaker 0

完全正确。我认为关键点在于，我既认为这是实现社交型和具有社会意识的机器人及具身代理最有前途的技术之一，但同时我也认为这可能是当前模型版本最大的限制——它在这方面做得不够完美，因为我们的标准提高了，现在我们认为这是不够好的地方。但我认为拥有这种能力确实至关重要，对吧？因为即使我们的机器人和具身代理完全理解物理，物理在世界各地是相当一致的，但人不是，对吧？

Exactly. And I think that's really the key thing, is I both think this is one of the most promising technologies to achieve sort of sociable and socially aware robots and body agents. But also I think that's the biggest limitation probably of the current iteration of the model is that it doesn't do this perfectly because our standards have raised and that's the thing that we now think is something that isn't good enough. But I think it's really critical that we do have that, right? Because even if our robots and embodied agents fully understand physics, I think physics are fairly consistent around the world but people are not, right?

Speaker 0

我们希望这些代理、机器人，无论它们以何种形式出现，都能够真正增强人类的能力，与人类合作，让我们的生活质量变得更好。因此，它们需要理解人类的思维、工作和互动方式，并能够与我们合作完成事情。所以，我认为我们真正感到兴奋的是，这可能是我们的模型能够实现的事情之一。

And we want these agents, robots, whatever form factor they come in to be able to really augment humans and work with humans to make our quality of life better. And so they need to understand how humans think, work, interact and be able to work with us on things. And so that's what I think we're really excited about is one of the things this might be enabled by our model.

Speaker 1

是的。我认为在生成质量方面肯定还有很多局限性，但我对发展的速度感到非常兴奋。想想看，比如去年12月我们有了Genie two、Veroduo。我们确实感受到了这种速度，它对我们个人生活的影响，但这个领域的发展就是很快。如果我们还记得，不到两年前，生成的图像还有六根手指之类的问题，那在当时是件大事，但现在没人再提这个了。

Yeah. I think there are definitely a lot of limitations in terms of the quality of the generation, but I'm very excited about the pace. If you think about it, like in in we had Genie two, Veroduo in December. We definitely feel the pace, the impact on it in our personal lives, but the field is just moving fast. And if we remember just, what, less than two years ago, we had images generated with six, you know, fingers, and that was a big thing, and nobody is speaking about that anymore.

Speaker 1

所以我不认为我们会无法以更高的保真度生成人物，并随之实现所有相关功能。

So I don't see why we'll be won't be able to generate people in much higher fidelity and with everything that follows that.

Speaker 2

那么，这里的目标是建立一个基础模型， essentially 为模拟世界做LLM为语言所做的事吗？

Is the goal here then to have a foundational model to essentially do for simulated worlds what LLMs have done for language?

Speaker 0

是的，完全正确。我觉得你比我表达得更好。我认为这确实是一个基础模型在广度、通用性和能力方面的重大变革。我认为这可能类似于Shomi最近提到的图像领域的发展，从最初明显存在手指问题到现在已经变得相当不可思议。我们在过去一年里也看到了视频领域的类似情况，一旦有了像VO2这样的技术，现在看起来已经相当惊艳了。

Yeah, exactly. I think you've put it better for you that I could. I think this is really a step change as a foundation model in terms of the breadth and generality and capabilities. I And think this is probably similar to sort of what Shomi alluded to as we've seen with images recently where there were things obviously like the fingers to them now being, I mean, pretty incredible. We saw the same thing with video maybe in the past year where once we had something like VO2, it's looking pretty amazing at this point.

Speaker 0

我们在三四年前也看到了语言模型的类似发展，那时它们开始变得真正强大。我们一直希望这种新型基础模型——一种自回归世界模型——也能达到那个水平。现在我们做到了，有大量不同的潜在应用领域和影响力。目前我们还处于相当早期的阶段。

And we saw this with language models maybe three or four years ago where they started to get really capable. And we wanted to get to that point for this new kind of foundation model, sort of an autoregressive world model. Now we're there, there's a whole host of different potential things that could be used for and have impact on. And we're still fairly early in that right now.

Speaker 2

但模拟还有其他方面的元素，对吧？比如，你认为将来是否能够利用这种理念来重现生活体验，而不仅仅是视觉体验？

But there are other elements of simulation too, right? Like, do you think that you will ever be able to use this kind of idea to recreate a lived experience rather than just a visual one?

Speaker 1

我认为我们有很多很多甚至没有意识到的感官，对吧？比如本体感觉，我们基本上能感觉到自己的位置，我们对自身在世界中的位置有这种感知。我认为当我们考虑真正让人沉浸在模拟环境中时，这是非常重要的一部分。基本上，仅限于视觉和可能还有音频的约束仍然限制太大。我认为这绝对有潜力，但需要通过我们必须要构建的多种技术才能实现。

So I think that we have many, many senses that we're not even aware of, right? Like, so for example, proprioceptive reception that we are basically we feel where we are, and we we have this kind of, like, notion of where we are in the world. And I think when we think about actually putting people in a simulation to really feel kinda immersed in that, I think this is a huge part of it. Basically, the constraint to visual and maybe audio is is still too too much of a constraint. I think the like, there is definitely potential for that, but it goes through multiple technologies that we have to build to actually get there.

Speaker 1

所以在此之前，我预计人们将能够与照片级真实的环境进行交互连接，但仍然需要通过某种界面，可能是一种混合界面，比如通过手套之类的设备感受到一些触觉。

So before that, I expect people to be able to connect interact with photorealistic environments, but still in some like, through some kind of an interface that will be like a hybrid interface that maybe they can feel some sensation for like maybe gloves or something.

Speaker 0

我认为还需要说的是，实时交互的特性确实对体验产生了巨大影响。我们团队有成员表示，他们访问了童年地点，实际上获得了一种从图像或视频中无法真正获得的感受。所以从这类模型中已经可以获得某种程度的体验了。显然我们正在努力使其在未来成为一个更强大的模型，所以这种体验可能会进一步扩展。

I think I would also say that there is definitely something about being interactive in real time that does make a big difference for the experience. And we've had members of the team say that they visited childhood locations, for example, and did actually get like a sense for it that you couldn't really get from an image or a video. So there is already some sort of a degree of experience that you can gain from this kind of model already. And obviously we're working hard to make it an even more capable model in the future. So maybe that will extend.

Speaker 1

是的，这实际上让我想起了今年早些时候的一个项目，谷歌团队使用VEO帮助早期痴呆症患者回到他们的童年记忆并重建这些记忆。所以我可以想象DYAT可能，例如，也可能成为一种治疗工具，他们不仅可以观看视频，也许还能真正重温或记起一些童年的事情。所以我认为，甚至在我们不需要走得很远之前，这些事情就能对世界产生积极影响。

Yeah, there is actually, it reminds me of a project that we had earlier in the year that the team at Google used VEO, actually, to help people with early onset of dementia to go back to their childhood memories and reconstruct them. So I can imagine that DYAT might be, for example, potentially therapeutic tool as well, that they can not only look at the video, but maybe actually relieve or remember some things that's from their childhood. So I think that even before we we don't need to go very far for things to be to have positive impact on the world.

Speaker 2

太棒了。这绝对令人着迷。非常感谢。

Amazing. That was absolutely fascinating. Thank you so much.

Speaker 0

谢谢邀请我们。

Thanks for having us.

Speaker 1

谢谢邀请我们，安娜。我

Thanks for having us, Anna. I

Speaker 2

认为这其中最令人印象深刻的部分不是你在屏幕上看到的内容，而是它是如何生成的。这个模型所代表的转变是从创造现实世界的逼真图像或视频——仿佛它们是时间中冻结的瞬间——转变为能够真正处理时间的方式，就像我们体验时间那样，箭头只指向一个方向，结果紧随原因之后，构建这个一致向前推进的世界，现在直接是过去的结果，这就是为什么我认为这是更大事物的早期迹象。这不仅仅是一种设计游戏或美丽环境的新方式，这里是机器能够真正规划和推理我们世界的基础。你一直在收听的是由我，汉娜·弗莱教授主持的谷歌DeepMind播客。

think the most impressive part of this is is not what you're looking at on the screen. It's how that is generated. It's this change that this model represents from creating realistic images or videos of the real world as though they were frozen moments in time into something that can actually handle time in the way that we experience it with an arrow that's pointing in only one direction, where effect follows cause to build this consistent forward moving world where the present is a direct result of the past and that is why I think that this is an early hint of something that's much bigger. This is not just a new way to design games or beautiful environments, This here is the bedrock for machines that can genuinely plan and reason about our world. You have been listening to Google DeepMind the podcast with me, Professor Hannah Fry.

Speaker 2

现在我们将在这个夏天稍作休息，但我们将在秋天从加利福尼亚的谷歌总部带来更多剧集。与此同时，请务必浏览我们广泛的往期目录，涵盖了从创作者工具到用于药物发现的AI等所有内容。很快再见。

Now we're going to take a little bit of a pause over the summer, but we're going to be back with more episodes in the autumn from Google HQ in California. And in the meantime, do take a look at our extensive back catalog, which covers everything from tools for creators to AI for drug discovery. See you soon.