本集简介
双语字幕
仅展示文本字幕,不包含中文音频;想边听边看,请使用 Bayt 播客 App。
我能想象一个未来场景:我的孙辈会问我们,'嘿,在你们那个时代真的需要手动驾驶汽车吗?'
I could picture a future in which my grandkids ask us, Hey, is it true that in your day we used to drive by hand?
因此我们很可能正迈向这样一个未来:手动驾驶体验将不再是常态,大部分驾驶将由自动系统完成。
So it's entirely possible that we're going towards a future where the experience of driving by hand is no longer the norm, and most of the driving happens automatically.
欢迎回到谷歌DeepMinder播客,我是主持人汉娜·弗莱教授。
Welcome back to Google DeepMinder podcast with me, your host, professor Hannah Fry.
数十年来,自动驾驶汽车一直是科幻小说中的梦想,如今已成为现实。
Now the idea of having autonomous vehicles has been this science fiction dream for decades, and now they are a reality.
我现在正坐在Waymo无人驾驶汽车的后座,这种车辆已在美国多个城市运营,包括我所在的旧金山、洛杉矶、凤凰城和亚特兰大。
I join you from the back of a Waymo, a driverless car that is operating in numerous US cities in in San Francisco where I am, in LA, in Phoenix, and Atlanta.
它们在街道上非常显眼。
And they're very noticeable on the streets.
这些大型白色汽车顶部装有众多传感器,最关键的是——方向盘后面空无一人。
They're these big white cars with lots of sensors on top and, crucially, nobody sitting behind the steering wheel.
但要让汽车能在无需人工干预的情况下载客上路,并确保其可靠性和安全性,这段发展历程异常复杂。
But getting to this stage where cars can be out on the roads with passengers without the need for human intervention, and doing it so it's reliable and safe has been an incredibly complex journey.
今天我很荣幸能与Waymo的杰出工程师文森特·范·胡克探讨这个话题。
So today I get to talk about that with distinguished engineer from Waymo, Vincent Van Hook.
欢迎来到播客,文森特。
Welcome to the podcast Vincent.
谢谢邀请。
Thanks for having me.
我知道你已在谷歌工作多年,之前从事机器人技术研究。
I know you've worked for Google for a number of years, previously for robotics.
无人驾驶汽车问题与常规机器人问题有何不同?
How does the driverless car problem differ from a more generic robotics problem?
从某些方面来说,自动驾驶问题其实是最简单的机器人问题。
Well, some ways, the autonomous driving problem is the simplest robotics problem.
你基本上需要做两件事。
You have basically two things you need to do.
你必须知道是要左转还是右转。
You have to know if you're going to turn left or right.
对。
Yeah.
这是一个数字。
That's one number.
然后你还得知道是要加速还是减速。
And then you have to know if you're going to accelerate or decelerate.
这是两个数字。
That's two numbers.
没错。
Yeah.
在大多数机器人问题中,你需要预测数百个数字才能计算出机器人的所有自由度。
In most robotics problems, you have to predict, you know, hundreds of numbers to figure out all the degrees of freedom of your robot.
这是最简单的机器人,只有两个自由度,但这掩盖了实际问题中的所有复杂性。
This is the simplest robot that has only two degrees of freedom, but that hides all the complexity of the actual problem.
预测这两个数字实际上是一个非常深奥且困难的问题。
Predicting those two numbers is actually a very deep and hard problem.
你必须理解环境。
You have to understand the environment.
你必须了解周围的行人,预测他们的行为,以及未来环境会是什么样子。
You have to understand the people that are around you, around the car, how they're going to behave, what the environment is going to look like in the future.
你必须预测交通规则,什么是允许做的,什么是禁止做的。
You have to predict the rules of the roads, what you're allowed to do, what you're not allowed to do.
所有这些因素混合在一起使得问题变得困难。
And the mix of all this makes the problem hard.
从概念上讲,这是一个机器人学问题。
Conceptually, it is a robotics problem.
那些是机器人,但它们是高度社交化的机器人。
Those are robots, but they are very social robots.
而且它们还嵌入在现实世界中,我认为这可能会让人感到谦卑。
And they're also embedded in the real world, which I imagine could be quite humbling.
现实世界极具挑战性,许多机器人应用场景的预期是:你拥有的机器人处于一个你或多或少能控制的环境中,或者你能合理预测该环境中其他智能体的行为。
The real world is extremely challenging to work in the expectations in a lot of robotic contexts is that you have a robot in an environment that you more or less control, or that you have a reasonable expectation about what the other agents in that environment will do.
比如
Like a
以工厂车间为例,那里你拥有完全控制权。
factory floor, for example, where you have total control.
是的。
Yeah.
而在自动驾驶汽车场景中,我们必须基本理解并融入环境,尊重当地居民,尽可能与环境融为一体,这样才能服务公众,让我们能够自由行驶和运营。
And in autonomous car contexts, we have to basically understand and mesh with the environment, be respectful of the people that live in it, blend into the environment as best we can so that we can serve the public, right, and to enable us to drive and have the freedom to operate.
好的。
Okay.
那么关于选择这两个数字,首先汽车必须感知周围环境,然后才能规划下一步行动。
So in terms of choosing those two numbers, as you put it, I mean, first, the car has to perceive the world around it before it plans what to do next.
所以如果从这里开始讨论感知系统,Waymo配备了多种不同的传感器。
So I guess if we start there then, in terms of the perception, I mean, Waymo has a number of different sensors.
包括摄像头、激光雷达和毫米波雷达。
You've got cameras, LiDAR and radar.
每种传感器各有什么优势?可能在哪些方面表现较弱?
What are the benefits of each of those and perhaps where do they struggle more as well?
没错,不同传感器各有优缺点。
Yeah, the different sensors have different strengths and weaknesses.
相机基本上就像你的眼睛,对吧?
A camera is basically like your eye, right?
你看到的世界和人眼看到的一样,但在安装多个摄像头之前,它提供的深度信息可能稍显不足,之后你才能推断深度。
You see the world as a human would see, but it gives you maybe less slightly less information about the depth information until you actually put multiple cameras, and then you can reason about depth.
相比之下,激光雷达在感知深度方面非常出色。
In contrast, LiDAR is very good about sensing depth.
这就是它的作用。
That's what it does.
激光雷达本质上是一束发射出去的激光,它从物体上反射回来,让你能估算出物体的距离。
A LiDAR is basically a laser that you shoot out and bounces off of objects and bounces back and gives you an estimate of how far the objects are.
它们看不到颜色,对吧?
They don't see in color, right?
所以它们只提供几何信息。
So they only give you geometric information.
它们提供的场景语义信息要少得多。
They give you a lot less about the semantics of the scene.
那些激光也很容易从物体上反射回来,不是吗?
Those lasers also bounce off things quite easily, don't they?
是的。
Yes.
它们会从抛光金属之类的东西表面反射回来。
They reflect off of things like, you know, polished metal and things like this.
而道路上这类东西相当多。
Which there's quite a lot of on the road.
道路上确实有很多这样的东西。
There's quite a lot of it on the road.
所以这可能是个劣势,因为它会给信号带来噪音,但也意味着你能看到拐角后面的情况。
So that can be a disadvantage in the sense that it adds noise to the signal, or it can also mean you can get to see behind corners.
我们没讨论过雷达的事。
We didn't talk about the radar.
雷达在测速方面非常出色。
Radar is very good at sensing speed.
因此,其他物体与车辆之间的相对速度是我们需要理解的重要信号,它能告诉我们是否存在碰撞风险。
And so the relative speed between the other agents and the car is a really important signal for us to understand, are we at risk of colliding with something?
雷达能探测到相当远的距离,对吧?我是说,雷达的探测范围很广。
And radar gives you quite I mean, you can use radar to sense quite far out as well, right?
与视觉判断不同,后者只是主观看法。
As opposed to a visual which is just in your opinion.
对。
Yeah.
摄像头的视野范围要短得多,容易被前方车辆和场景遮挡。
Have a much longer range Cameras are going to be obstructed by, you know, the cars in front of you and the scene.
雷达的探测范围要远得多。
The radars can look much, much further in field.
我认为更重要的是它为场景提供了不同的信息维度。
What I think is more important is that it adds a different piece of information to the context.
当你需要融合不同来源的信息时,就需要获取来自不同渠道的数据。
And when you want to fuse information from different sources, you want to have information that comes from different places.
对吧?
Right?
打个比方,就像人有两只眼睛。
The analogy is you have two eyes.
如果两只眼睛提供完全相同的信息,你就无法感知深度。
If your two eyes gave you the same information exactly, you would not be able to perceive depth.
正是两者间的差异为你提供了额外的信息。
It's the discrepancy between the two that gives you that extra bit of information.
同理,激光雷达、摄像头和雷达各自提供的信息片段截然不同,各有优劣。
So similarly with LiDAR and camera and radar, they give you very different pieces of information with different strengths of weaknesses.
而人工智能的作用就是将这些信息融合成对环境的完整认知。
And then the role of the AI is to fuse them into a cohesive picture of the environment.
有趣的是,这三种感知方式各有所长,也各有所短。
There is something interesting there that you have these three different senses, all of which have strengths and weaknesses.
它们都存在某种缺陷。
All of them are flawed in some way.
但正因它们的缺陷各不相同,你才能构建出更全面的场景画面。
And yet because they're flawed in different ways, you can build a bigger picture of the scene.
没错。
Yeah.
但当它们出现分歧时该怎么办?
But what do you do when they disagree?
比如,这三者之间谁拥有最终决定权?
Like, who gets the ultimate say between those three?
它们又不是在进行投票表决。
It's not like they are voting.
对吧?
Right?
这是不同信息的融合过程。
It's a merger of the different information.
我喜欢的例子还是用眼睛来比喻——左眼告诉大脑鼻子在右侧,
The example I like is, again, going back to your eyes, your left eye basically tells your brain that your nose is on the right.
右眼则告诉大脑鼻子在左侧。
Your right eye tells your brain that your nose is on the left.
这可不是哪只眼睛能赢得较量的比赛。
It's not like one eye is going to win that, that contest.
对吧?
Right?
你大脑中会将那些略有冲突的不同信息融合起来,实际上为你呈现出眼前场景的整体画面。
You basically fuse in your brain the different information that are slightly conflicting, but that actually give you a global picture of the the scene in front of you.
一个很好的例子就是夜晚发生的情况。
A great example is, you know, what happens at nighttime.
想象一下外面非常黑暗。
Imagine you're it's really dark out there.
你的相机只能看到一片漆黑。
Your camera just sees a wall of black.
你获得的信息非常有限。
You don't have a lot of information.
这正是激光雷达真正发挥作用的地方,因为它不在乎白天还是黑夜。
That's a place where the LiDAR is really useful because the LiDAR doesn't care if it's day or night.
它只管持续发射激光并测算物体的距离。
It just keeps shooting its lasers and figure out how far objects are.
正是这种互补性确保了安全性。
So that complementarity really is what drives the safety there.
安全很大程度上依赖于冗余机制,对吧?
Safety is a lot about redundancy, right?
安全的本质在于整合不同信息源,从不完全信任单一来源,而是根据获得的各类信息线索综合判断,从而构建出一个可信度更高的整体系统。
Safety really comes from taking different sources of information, never entirely trusting them 100%, and merging the evidence based on the different pieces of hints of information that you get, such that you can have an overall system that you can trust that has a much higher degree of fidelity.
我们常说正确的道路只有一条。
We often say there's only one way to be right.
但犯错的方式却有很多种。
There is many ways to be wrong.
如果你的不同传感器以不同方式出错,你就知道获得的信息存在异常。
If your different sensors are wrong in different ways, you know that there is something not right about the information that you get.
但当传感器开始对世界的描述达成一致时,你基本可以确定——结果会是正确的。
But once the sensors start agreeing about a picture of the world, you're pretty sure you're you know, it's gonna be right.
所以这几乎就像汽车在不断更新它的'信念',可以这么说吗?即随时更新对周围场景的认知?
So is it almost like the car is sort of updating its belief, as it were, about what the scene is around it at any time?
严格来说确实如此。数学公式就是信念更新的体现,而且可以证明,来自不同渠道的信息越多,你对整体信念的估计就会越准确。
It's very literally that, in the sense that the mathematical formulation is the belief update, and you can prove that the more information you add that comes from different sources, you only improve the overall estimate of your belief.
即使信息存在噪声,本质上你还是在增加信息量。
Even if the information is noisy, you're really just adding information.
不同传感器之间的融合技术,正是构建高安全性感知系统的关键所在。
And the fusion that happens between the different sensors is really the crux of enabling a very safe perception stack.
比如我们可以遮挡其中一个摄像头。
We can, for example, hide one of the cameras.
系统仍能正常运行。
The system will be okay with that.
即使某个摄像头积了灰尘,系统仍能理解环境——我们需要的不是脆弱的系统,不会因为单个传感器出错就崩溃。
We can have if there is, like, dirt that accumulates on one camera, we can still understand the you don't want to have a system that's brittle, that will just collapse if a single sensor is providing erroneous information.
我们需要的是稳健的系统。
You want something robust.
而多样性正是带来稳健性的关键。
And the diversity is really what brings the robustness.
关于同时配备三种传感器的问题——毕竟有些无人驾驶方案确实没使用激光雷达。
That idea of having all three, I mean, the LIDAR, because there are people who do work on driverless cars without the addition of LIDAR.
你认为要实现完全自动驾驶,这三者都是绝对必要的吗?
Do you think it's absolutely necessary to get fully autonomous vehicles that you need all three?
目前我们掌握的实证表明,仅靠摄像头似乎就能达到人类水平的驾驶表现。
The current state of evidence that we have is that it looks like you can get to human level performance by just using a camera.
嗯。
Mhmm.
证明这一点的是,人们用眼睛观察,大脑里并没有精密的雷达,但他们开车已经足够好了。
And the proof to that is that, you know, people use their eyes, and they don't have a fancy radar in their brain, and they can drive just well enough.
我们看到的是,人们希望这些汽车的安全性超越普通驾驶员的水平,因为这是一项新技术,也因为我们有能力做到。
What we're seeing is that people want to know that those cars are safe beyond what the average driver would be able to provide because it's a new technology and also because we can.
我们已经证明可以提高道路上汽车的安全性能。
We have proven that we can improve the safety posture of cars on the roads.
所以让我们行动起来吧。
So let's do it.
这实际上是我们为社会增添的一项宝贵贡献。
Like, this is actually a valuable thing that we're adding to society in general.
让我们暂时回到汽车对周围环境形成认知这个概念。
Just going back for a minute to that idea of the car having a belief of what's going on around it.
它是否在行驶过程中构建了世界的三维模型?
Is it constructing a three d model of the world as it goes?
它确实在构建世界的三维模型。
It is constructing three d models of the world.
它通过不同传感器获取几何信息进行分析。
It's looking at the geometric information that it obtains from its different sensors.
这对两件事特别有用。
This is very useful for two things.
一是规划更容易,因为你有一个可以用来推理的三维空间。
One is planning is easier when you have a three d space that you can use to reason about.
你会想说,嘿。
You wanna be saying, hey.
我要避开这个。
I want to avoid hitting this.
我要在这里右转,因为这是交通规则要求的,或者这是我想走的路线。
I want to turn right here because that's what the the rules of the road are telling me, or this is the route that I wanna take.
此外,拥有世界的三维表征能让你模拟环境。
Also, having a three d representation of the world enables you to simulate the environment.
这是关键所在,我们非常依赖模拟来验证我们的驾驶员是否安全并按预期行事。
And that's a critical piece, is we are leaning really hard on simulation as a way to validate that our driver is safe and behaving the way we want.
我们已在模拟中行驶了数十亿英里,比实际道路行驶多出好几个数量级。
We've driven billions of miles in simulation, many orders of magnitude more than we've driven on the roads.
拥有高度逼真、接近现实的模拟器,使我们能够快速推进技术发展,并在上路测试前离线验证技术。
And having a simulator that's very faithful and close to reality is what enables us to make the technology advance fast and validate it offline before we actually have to do the testing on the road.
好的。
Okay.
所以我明白了在模拟中你们是如何预测汽车下一步行动,并让它与模拟环境互动的。
So I see how in simulation you're sort of making a prediction about what the car will do next and then have it interacting with your simulated environment.
但当汽车实际在路上行驶时,你们仍需预测下一时间帧可能发生的情况。
But inside the car itself as it's out on the roads, you're still making predictions about what will happen at the next time frame.
这其中是否也包含某些模拟元素?
Is there some elements of simulation in that too?
是的。
Yes.
预测道路上其他交通参与者会或可能采取的行动是非常重要的一环。
So predicting what other agents on the road will do or might do is a very important piece.
比如你需要判断人行道上的行人是否会突然冲到车前,还是继续直行。
You want to know that, for example, a pedestrian that is on the sidewalk, is it likely that they're going to jump in front of the car, or are they just walking straight?
他们是准备过斑马线,还是因为没轮到自己而停在原地?
Are they trying to cross the crosswalk, or are they staying there because it's not their turn?
其他车辆会不会在停车标志前试图通过十字路口?
Other cars, are they going to try to drive through an intersection at a stop sign or not?
所有这些对其他交通参与者的行为推理,都是汽车自主决策的重要组成部分。
All of this reasoning about the other agent is a is a very important part for the car to be able to make its own decisions.
驾驶本质上是一种社交行为。
Driving is inherently a social thing.
有趣的是,我们倾向于将这些互动建模为小型对话片段。
What's interesting is that we tend to model those interactions as little bits of conversations.
确切地说,这是视觉动作的对话。
Literally, it's visual movement conversations.
现在我向前移动。
Now I move forward.
这辆其他车会怎么做?
What will this other car do?
这辆车停下了。
This car stops.
好的。
Okay.
我可以走了。
I can go.
或者这辆车先走,那我就得停下。
Or this car goes, then I'm gonna have to stop.
对吧?
Right?
这实际上被建模为视觉或动作对话,与你在对话代理中的操作非常相似。
It's literally modeled as a visual or motion conversation, very similar to what you would do in a conversational agent.
因此,在对话式AI和自动驾驶问题之间存在许多相似之处,我们可以借鉴、学习并基于此改进技术。
So there are lots of parallels between the conversational AI and the autonomous driving problem that we can leverage and learn from and improve the technology based on that.
因为这始终是个大问题:如果两辆自动驾驶汽车同时到达停车标志,会发生什么?
Because that was that was always one of the big questions was, if you have two autonomous cars that come to a stop sign at precisely the same time, what happens?
哪辆车会让行?
Which one yields?
你知道,你最终会陷入这种僵局吗?
You know, do you end up in this situation of a stalemate?
解决方案在于,你要预测其他车辆的行动,进行小幅干预,并持续更新你对局势的判断。
And the solution, It's that you're predicting what the other car will do and making a small intervention and continuing to update your belief about the situation.
是的。
Yeah.
这很重要,因为自动驾驶AI最有趣的方面之一就是闭环问题。
And it's important because one of the very interesting aspects of AI for autonomous driving is the closed loop problem.
我的意思是,你不能孤立地学习行为模式。
And by that I mean you cannot learn your behavior in isolation.
你必须结合道路上其他智能体的情境来学习行为。
You have to learn your behavior in the context of the other agents in the road.
因此唯一的方法就是模拟你的行动,在仿真器中推演世界会如何响应,然后将这些信息反馈回来。
And so one of the only ways to do that is to imagine what you would do, unroll what the world would do in response in a simulator, and then feed that information back.
通过在闭环中评估和训练模型,你就能学会现实环境中实际会观察到的行为模式。
So evaluating and training your model in a closed loop enables you to learn the kind of behaviors that you would actually observe in a real world environment.
我们从机器人学中知道,普遍存在所谓的'匕首问题'。
We know from robotics that in in general that there's what's called the dagger problem.
匕首问题在于,如果你只针对开环行为进行优化,系统最终会试图在每个步骤做到最好,但它犯的每个小错误都会随时间累积。
The dagger problem is that very often if you just optimize for open loop behavior, you you end up in a place where your system tries to do the best it can at every step, but every little error that it makes just accumulates over time.
没错。
Right.
所以如果你想以不累积错误的方式学习,就必须模拟整个环境并将其反馈到系统中。
And so if you wanna learn in a way that doesn't have an accumulation of error, you have to simulate the entire environment and feed that back into your system.
这让事情变得非常非常复杂。
That makes it very, very complex.
类似地,语言模型也存在这种情况:如果进行长对话却没有正确训练系统,对话可能会偏离到非常奇怪的特定领域,因为模型没有被训练保持在主题上。
Again, there is a similar analogy in language models that if you have a long conversation and you don't train your system right, your conversation might veer into really weird idiosyncratic territory because the model is not being trained to stay on topic.
嗯。
Mhmm.
在长时间对话中保持话题不偏离,与以稳定驾驶方式最终到达目的地,本质上是非常相似的问题。
And it's very much the same problem of staying on topic when you're having a long conversation as it is to drive in a way that is stable and gets you to the place that you want eventually.
我是说,这确实带有一些游戏元素。
I mean, it does have those elements of a game to it.
我在想下国际象棋的例子。
I mean, I'm thinking about playing chess.
独自学习下棋毫无意义,因为没有对手的对弈在很多方面都显得毫无价值。
There's no point in learning how to play chess, not against an opponent, because playing in isolation is like it's sort of worthless in a lot of ways.
只不过你是在同时与无数玩家对弈。
Except that you're playing with many, many, many players simultaneously.
考虑到这个问题的复杂性,你能取得现在的进展某种程度上真是个奇迹。
It is sort of a wonder that you've got as far as you have when you consider how complex this problem is.
和往常一样,环境模拟一直是问题研究和开发的核心。
As always, simulating the environment has been at the core of the development and the study of the problem.
构建一个高度还原现实世界的仿真系统,使你能对这些闭环问题进行推演,是实现目标的关键要素。
Building a simulation that is very faithful to the real world and that enables you to reason about these closed loop problems is a key component to making it work.
你对该空间内其他智能体行为的预测,有多少取决于你对这些智能体的分类?
How much of your prediction about what another agent in this space is going to do comes down to your categorisation of what that agent is?
比如今早我看视频时,有辆Waymo经过一只猫。
I mean, I'm thinking, for example, I was watching some videos this morning and there was a Waymo going past a cat.
那只猫蜷缩着,看起来可能像个足球,对吧?
And the cat was sort of curled up, could have been a football, right?
在预测物体行为前,你们需要先对其进行分类吗?
Like how do you do you have to categorise what it is before you predict how it might behave?
我们不一定需要精确分类,但如果知道是猫,就能更准确地预测其行为。
We don't necessarily have to categorise it exactly, but if we know it's a cat, we can make better predictions about what its behavior might be.
猫咪会突然改变方向,随意乱跑。
A cat will change direction very quickly and go in random direction.
它们不会走人行横道过马路。
They will not get on the crosswalk to cross the street.
因此对于所有不同的行为体,我们尽可能地进行分类,从根本上预测它们的行为。
So all of the different agents, we try to categorize them as best we can and to predict essentially their behavior.
这也很重要,这样我们才能决定应该制定哪些行为规则,对吧?
It's also important so that we can make decisions about how we should behave in terms of rules, right?
所以我们想知道这是辆自行车,这是骑自行车的人,他们在自行车道上,我们可以超车。
So we want to know this is a bike, this is somebody on a bicycle, they're in the bike lane, we can pass them.
我们需要留出足够的安全距离才能超车。
We need to give them that much width to pass them safely.
因此关于行为体的所有语义信息都非常重要。
So all the semantic information about the agents is very important.
我认为这某种程度上是自动驾驶十五年来的长期经验——场景语义的重要性。
I think that's been kind of the long term learning from fifteen years of autonomous driving has been the semantics of a scene.
行为体的属性对推理至关重要。
What the agents are is extremely important to reason about.
我们要知道汽车不只是路上的立方体。
We want to know that a car is not just a cube on the road.
汽车可能是应急车辆,我们需要避让。
A car can be an emergency vehicle, and we need to yield to them.
如果某处有应急车辆,我们可能不想经过那里,因为可能有警察活动之类的情况。
If there are money, emergency vehicles, somewhere, probably we don't want to go through there because there's police activity or something like that.
所有这些深层语义信息对车辆运行都至关重要。
So all this kind of deep semantics really matters to being able to operate the car.
是啊,我记得DARPA挑战赛,那是无人驾驶汽车的早期阶段。
Yeah, because I remember the DARPA challenges, sort of the earliest days of driverless cars.
我记得风滚草曾经是个大问题。
There was a big problem with tumbleweed, I seem to remember.
哦,是的。
Oh yes.
不知道那是什么,只是看到前方有个物体,你本可以开车穿过去。
Not knowing what it was, just seeing that there was an object in front that you could have driven through.
你知道的,它看起来不像是会造成潜在损害的东西。
You know, it wasn't like potentially going to be damaging.
你现在还会遇到这种情况吗?
Do you still end up in that situation?
比如说,我现在想到的是雪。
I'm thinking about snow here, for example.
你是否遇到过这样的情况:有些障碍物挡在路上,但它们本不应该影响车辆的行驶规划?
Like, do you find yourself in situations where there are objects that are in the way, but they're not things that should necessarily affect the planning of the vehicle?
是的。
Yeah.
雪就是个很好的例子。
So snow is a great example.
这是我们认真研究的问题,因为我们正试图向更北方的城市扩展业务。
It's something that we started working on seriously because we're trying to expand in the more northern cities.
过去一年我们在塞拉山脉做了大量测试,雪就是个典型例子——它是体积庞大、可能出现在道路上的物体,但你必须判断并决定:这是雪,所以我应该开车穿过它,除非遇到无法通过的大雪堆。
So we've done a lot of testing in the Sierras over the last year, and snow is a typical example of here is something that is big, massive, potentially on the roads, but you have to reason about and say, okay, this is snow, so the right thing for me to do is to drive through it, unless it's, you know, a big pile of snow and you can't.
但如果只是正常驾驶中遇到的普通雪堆,你会想要轧过去。
But if it's a reasonable pile of snow just that you you experience in normal driving, you want to cross through that snow.
如果是石头就不会这么做了。
If it were a rock, you wouldn't do that.
因此像这样精细分类物体,理解哪些能通过哪些不能,确实是驾驶决策的必要组成部分,也是在特殊条件下安全行驶的关键。
So categorizing things at a fine grain like this and understanding what you can or cannot do is really part of the equation and is necessary to be able to drive in those conditions.
我是说,这些关于无人驾驶汽车的雄心壮志,其实早在大语言模型带来重大变革之前就已存在。
I mean, these ambitions to have driverless cars, they really predated the big changes of large language models.
我们这里讨论的是对场景上下文和语义的理解。
And we're talking here about like understanding the context of a scene and the semantics a bit.
多模态模型取得的进展有多少直接应用到了无人驾驶汽车领域?
How much has the advances that have been made in multimodal models fed directly into driverless cars?
相当多。
Quite a bit.
有意思的是,
It's funny.
几周前我还在思考,你知道第一次横跨大陆的自动驾驶是什么时候实现的吗?
I was reflecting a few weeks ago on, do you know when the first transcontinental autonomous drive happened?
那次横跨大陆的驾驶实现了约98%的自主性,几乎完全自动驾驶,时速约60英里。
There was a transcontinental drive where they were at about 98% autonomy, so almost fully autonomous, driving about 60 miles per hour.
那是三十年前的事了。
That happened thirty years ago.
哇。
This Wow.
那是在1995年。
That was in 1995.
从自动驾驶概念验证到我们今天所处的位置,花了三十年时间。
It took thirty years from the proof of concept of autonomous driving to where we are today.
令人着迷的是,这不仅需要大量工作,还需要经历几代机器学习和人工智能的发展,才能真正达到所需的性能水平。
And what's fascinating is that it took basically not only a lot of work, but multiple generations of machine learning and AI to really get to a level of performance that was necessary.
我们过去几年看到的现代AI革命只是这个发展历程的最后阶段。
The modern AI revolution that we're seeing today for the last few years is is but the last of those.
对吧?
Right?
过去几十年里,我们已经历并见证了多次这样的发展。
There's been several over the the past decades that we've experienced and gone through.
现代AI世界为自动驾驶带来的突破在于,它能够理解场景的语义,本质上实现零样本学习。嗯。
The modern AI world, what it's opening up for autonomous driving is really this idea that you can get at the semantics of a scene, essentially zero shots Mhmm.
无需针对特定场景专门训练模型。
Without having to train the model specifically for it.
如果向Gemini展示一张道路事故现场的照片,Gemini会告诉我们这是事故现场。
If I show to Gemini a picture of an accident scene on the road, Gemini will tell us this is an accident scene.
这不需要我进行专门训练。
This is not something I need to train specifically.
这种能力可以从我们所说的世界知识中学习获得。
This is something that can be learned from, you know, what we refer to as world knowledge.
或是更常见的知识,比如东京或伦敦的应急车辆长什么样。
Or more prosaic things like what do emergency vehicle look like in Tokyo or in London.
我们的自动驾驶系统A Priory未必内置这些知识,因为我们直到最近才在这些区域行驶。
We don't necessarily have that knowledge built into our driver A Priory because we've never driven in those areas until very recently.
但那些大型AI模型已内置了这些知识。
But those large AI models have that knowledge built in.
关键在于如何稳健地利用这些知识,为车辆运行提供恰到好处的信息支持。
So the key is how do you leverage that in a way that is robust, in a way that basically provides the right level of information for the car to be able to operate.
那么你们具体是如何...这是我接下来要问的明显问题。
Well, how do you, is my is the next obvious question.
因为这里涉及很多不同的要素。
Because, I mean, you're talking about lots of different elements here.
包括传感器数据的整合、场景语义的分类,以及对不同物体未来行为的预测。
You've got like the integration of the sensor data, you've got the categorisation of, you know, the semantics of the scene, you've got the prediction of how different objects might behave in the future.
你们是如何整合所有这些要素的?
I mean, how do you put all of that together?
是的。
Yeah.
所以一个取巧的方法是我们可以先把所有处理放在云端完成。
So one cheat is that we can do all of that in the cloud first.
这样我们就能在云端构建一个庞大的驾驶模型,整合所有信息——包括传感器数据、数百万英里驾驶经验,以及来自各渠道提供世界认知的数据。
So we can build essentially a very large driver in the cloud, very large model that incorporates all that information, all the sensor information, all the experience that we have from driving millions of miles, all the data that comes from various sources that provides us with world knowledge.
这样做的好处是,云端处理不受车载系统的运行限制约束。
And the benefit of that is that when you do that in the cloud, you don't really have the same operational constraints that you would have on a car.
处理速度可以慢。
It can be slow.
可以占用大量内存和算力。
It can take a lot of memory, take a lot of compute.
也不必完全满足车辆所需的实时性要求。
It can also not necessarily meet all the real time constraints that the car requires.
但一旦有了这个'教师驾驶模型',就能用它来训练车载系统——通过云端模型提供的监督学习,将所有知识蒸馏到车载系统上,而车载系统本身可以有不同的运行限制、算力限制和架构限制等等。
But once you have that teacher driver, you can use that to teach the onboard system based on that supervision that you provide from the cloud based driver and distill all that information onto the onboard system, which itself can have different operational constraints, different compute constraints, different architectural constraints, so on and so forth.
这就是将强大AI能力引入车辆,而不必把所有东西都硬塞进车里的一个途径。
So that's one path to, you know, bringing the the power of very powerful AI onto the car without really having to just shove it all in the car.
让我确认下是否理解正确。
So let me make sure I understand that then.
你们在云端有个巨型模型,它提供了类似'优秀驾驶方案空间'的参考标准。
So you have this sort of giant model in the cloud that sort of gives you like a solution space, as it were, of like, this is what good driving looks like.
而车辆只需在行驶时判断当前处于这个方案空间的哪个具体位置。
And then when you're in the car, just have to work out which bit of that solution space you're in, in that particular moment in time.
需要澄清的是,我们并非实时进行这种交互。
So we don't do that in real time, to be clear, right?
这不是说车辆在路上行驶时会随时询问云端'我下一步该怎么做'。
This is not, you know, the car is on the road, it's going to ask the driver in the cloud, you know, what do I do next?
不,不,这样不行。
No, no, that doesn't work.
我们需要让驾驶员完全独立于车辆之外,实现完全的自主性和独立性。
We need to have the driver be self contained on the car and have all the you know, be completely autonomous and independent.
不依赖互联网。
Not rely on the Internet.
对。
Right.
依赖互联网连接并不是一个安全的好策略。
Relying on Internet connectivity would not be a good safety posture.
但我们离线处理这个问题,即在训练车载模型时,我们将云端的大模型作为‘先知’进行查询,获取理想操作方案,然后通过反向传播将这些知识整合到车载系统中。
But we do that offline, meaning that when we train the models that are used in the car, we query that large model in the cloud as an oracle to tell us, you know, what would be the ideal thing to do, then essentially backpropagate that onto the onboard system.
听起来你们正在接近Gemini版本的模式——有一个核心大模型,所有人都从这个主模型中获取支持。
I mean, sounds like you're stepping closer towards the sort of Gemini version where there's like, there is the giant model and everyone's sort of tapping into the main model.
这是不是主要目标?
Is that sort of the big aim?
是的。
Yeah.
更接近这个方向。
It's it's closer to this.
在这个过程中有趣的是,正如我之前提到的,驾驶问题在概念上与LLM解决的对话问题相差并不远。
The what what's been interesting in the journey is that, as I mentioned earlier, the driving problem is not that far removed conceptually from the dialogue problem that LLMs solve.
这是一种视觉对话或多智能体运动对话。
It is a visual dialogue or a motion dialogue with multiple agents.
因为我们可以这样定义问题,实际上数学原理是完全相同的,对吧?
And because we can frame it that way, I mean, literally the math is the same, right?
所以我们训练的模型具有与Gemini模型高度相似的特性,并能运用相同的技术手段来解决问题,包括如何扩展规模。
So we basically train a model that has very much the same properties as a Gemini model, and we are able to leverage all the same techniques that a GEMINI model applies to the problem, including how do you scale it?
你如何提供适当的监督水平?
How do you provide it the right level of supervision?
而且所有这些问题都非常、非常、非常相似。
And all those questions are very, very, very similar.
我也在想,端到端是目前人们讨论很多的话题。
I'm also wondering, I mean, end to end is something that people are talking about quite a lot now.
是否有可能像标记语言那样对传感器数据进行标记化处理?
Is there an ambition that you could tokenize sensor data in the same way that you can tokenize language?
这已经非常像Gemini这样的多模态模型所做的事情了,对吧?
That's already very much what a multimodal model like Gemini does, right?
你将所有图像和传感器输入标记化,然后作为抽象标记传递给语言模型,这些标记基本上就像词语一样运作。
You tokenize all the images and the sensor inputs and pass them on to a language model as, you know, abstract tokens that basically act like words.
因此,每种感知系统底层的机制从根本上都与标记化的理念兼容。
So the machinery under every sort of perception system, fundamentally, is compatible with this idea of tokenization.
那么问题来了,你想传递什么样的标记?
And then the question is, what kind of tokens do you wanna pass around?
你是想要非常抽象的标记,还是想传递非常具体的信息?
Do you want tokens that are very abstract, or do you wanna pass around information that is very concrete?
抽象信息在某些方面可能更丰富。
The abstract information can potentially be richer in some ways.
更具体的信息能让你以更直接的方式模拟那种状态。
Information that is more concrete gives you the power of simulating that state in a much more direct way.
实际上,模拟到像素级别是触手可及的。
Actually, simulating down to the pixel level is something that is within reach.
嗯。
Mhmm.
目前有很多工作正在朝着世界模型的方向发展。
There's a lot of work right now going towards world models.
一旦你能以可控的方式生成传感器数据,就能模拟自动驾驶汽车的整个部署过程。
So once you can do sensor generation in a controllable way, that opens up the capability to simulate the entire rollout of an autonomous car.
而这还处于起步阶段。
And that's nascent.
这大致是许多技术发展的方向。
That's kind of where a lot of the technology is headed.
它能解决全部问题吗?
Is it going to solve the entire problem?
要知道,时间会证明一切。
You know, time will tell.
但这确实是当前研究的最前沿。
But that's really what the edge of the research is at right now.
实际上,我们播客邀请过开发Genie项目的研究人员,他们做的开放世界模型正像你描述的那样。
We actually, we had on the podcast the people who were working on Genie, the open world models exactly as you're describing.
不过说到这些模拟,我在想机器人专家经常提到的'模拟与现实差距'问题。
Those simulations though, I'm thinking here about the sim to real gap that roboticists in particular talk about a lot of the time.
我是说,你们创建的这些模拟有多真实?
I mean, how realistic are those simulations that you're creating?
因为环境中显然存在某些对驾驶来说无关紧要的元素。
Because presumably there'll be certain things in an environment that are broadly not necessary for the purposes of driving.
比如Genie的开放世界模型中,有个特别惊艳的例子展示猫移动时的光影变化——虽然视觉效果很美,但对驾驶没什么意义,除非某些情况下光线变化可能影响视觉判断。
I'm thinking about, I don't know, the open world models with Genie, how there was a really amazing example about how the light changed as a cat moved through it, which is kind of looks very beautiful, not that relevant for driving, except that maybe in some situations where the light changing might affect the visuals.
是的。
Yeah.
模拟的保真度非常非常重要,但未必是你说的这种视觉保真度。
The fidelity of the simulation is very, very important, but not necessarily this kind of visual fidelity that you're talking about.
我们需要的是几何保真度。
We want geometric fidelity.
我们希望确保环境的物理特性得到尊重,但视觉效果同样至关重要。
We want to make sure that the physics of the environment are respected, but the visuals really matter too.
我们模拟器的一个功能是能够采集驾驶实例,然后在不同条件下进行模拟。
One of the things that we do with our simulator is that we're able to take examples of driving and then we simulate it in different conditions.
无论是夜间、雪天、清晨还是傍晚,只需借助AI调整视觉效果,就能将春日场景转变为冬日、夏日或夜景,从而大幅增加我们可模拟的环境数据量和不同条件。
At night, in snow, in the morning, in the evening, by simply just using AI to adapt the visuals and turn a spring scene into a winter scene, turn it into a summer scene or a night scene, and really augment the amount of data and the different conditions in which we can simulate the entire environment.
那么你们是否有这样的愿景:未来某天无人驾驶汽车能仅凭传感器数据直接输出结果,而不需要那些可能并非中间步骤、但能让你了解幕后过程的额外信息?
So is there an ambition then that at some point in the future you will be able to have a driverless car that just purely takes in the sensor data and just rather than sort of spitting out all these, maybe not intermediate steps, but things that give you some way of exploring what's going on behind the scenes, but just purely outputs.
这个故事还有另一个角度,那就是安全性和验证。
Is another there is another angle to this story, which is the safety angle and the validation.
现有的安全规则是我们可以理解的。
So the way that safety rules that you and me can relate to exist.
基本上就是诸如不要碰撞任何物体这类规则。
It's basically things like don't collide with anything.
不要闯红灯。
It's don't run a red light.
遵守优先权和通用交通规则。
It's respect priority and world rules in general.
所有这些规则都非常具体。
All of those rules are very concrete.
它们不是抽象符号空间里的标记,对吧?
They're not abstract token in token space, right?
不是。
No.
因此你需要能够以可推理、可明确、可保证的方式来执行这些道路规则和安全规则。
So you want to be able to enforce those rules, both the road rules and the safety rules in a way that can be reasoned about, can be explicit, can be guaranteed.
对环境状态和驾驶员行为建立具体表征,在构建安全边界方面具有巨大优势——如果车辆的所有决策都在完全抽象的方式下进行,你将无法获得这种优势。
So having a concrete representation of the environment, of what the driver is going to do, has huge amount of benefit in terms of providing a way of expressing a safety envelope that you wouldn't have necessarily if all of the reasoning of the car happened in a completely abstract fashion.
就是关于具体事项的想法,你列出的那些安全规则对吧?
Just on the idea about things being concrete, those safety rules that you listed there, right?
不能发生碰撞。
No collisions.
你有什么答案?
What are the answers you have?
不要闯红灯。
Don't run a red light.
不要闯红灯,当然。
Don't run a red light, sure.
对,没错。
Yeah, exactly.
这些规则是硬编码在Weimos系统里的吗?
Are those hard coded then into the Weimos?
它们是我们提供给车辆的指导原则。
They're provided as the guidance that we provide to the car.
我们有一个安全框架,基本上编码了关于什么是正确驾驶的信息。
We have a safety framework that basically encodes the kind of information about what it is to do proper driving.
作为工程师,我们需要确信这些规则的编码能确保驾驶员在任何时刻都遵守。
And we want to be convinced as engineers that the encoding of those rules is something that the driver would meet at every point in time.
让我再深入探讨一下人类驾驶方面。
Let me just dig into the human side a little bit more.
自动驾驶汽车应该表现得像人类驾驶员吗?
Should an autonomous vehicle behave like a human driver?
展开剩余字幕(还有 213 条)
这是个非常好的问题。
It's a really good question.
多年来我们学到的一点是:你基本上要成为道路上最正常的车辆。
One thing that we've learned over the years is that you want to be basically the most normal car on the road.
你不一定希望比路上其他司机更胆小,因为人们会察觉到这种差异并实际欺负这辆车——倒不一定是出于恶意,只是如果他们知道Waymo汽车总是比他们更胆小,就会想开到它前面去。
You don't necessarily want to be more timid than other drivers on the road because then people will pick up on that difference and actually abuse the car, not necessarily out of mischief or anything like that, just because, you know, if they know that a Waymo car will always be more timid than they are, they will want to drive in front of it.
反之如果你比普通司机更具攻击性,就会干扰交通流或违背他人的预期。
If on the opposite side you're more aggressive than the average driver, then you're disruptive to the flow of traffic or you violate other people's expectations.
人们不会预期自动驾驶汽车自己发狂。
They don't expect an autonomous car to be raging on their own.
对吧?
Right?
所以最佳状态是表现得像路上最无聊、最普通的司机。
So the sweet spot is really if you act like the most boring, normal driver on the road.
事实证明这也是最安全的。
It turns out it's also the safest.
这是最佳的安全姿态。
It's the best safety posture to have.
我们不一定想复现人类司机的所有行为。
We don't necessarily want to reproduce everything a human driver does.
事实上我们发现人类通常不太擅长风险分析。
In fact, what we find is that very often humans are not very good at risk analysis.
他们会做出一些经过计算其实并不安全的行为。
They will do things that if you do the math, are not necessarily safe.
举个例子。
Give an example.
很多人会以远低于分析专家推荐的距离紧跟前车。
I mean, a lot of people will tailgate at distances that are much smaller than what is recommended by people who've done the analysis.
比如说,在某个速度下你应该与前车保持多少米的距离。
Say, you know, you should be at this many meters away from the rear of the car in front of you if you're at that speed.
我很确定大多数人看到这些指南时都会觉得:这太极端了。
And I'm very sure that most of those guidelines, people look at them and like, that feels extreme.
这感觉太过分了。
That feels too much.
实际上,人们在路上未必会想这么做。
And it's not what in practice people will wanna do on the road necessarily.
但当我们查看数据时,我们明白有些事情需要更加谨慎对待。
But when we look at the data, we understand that there are things that we should be a lot more cautious.
人们也不一定会去思考那些不在眼前可能发生的事情。
People also don't necessarily reason about what could happen when it's not in front of their eye.
比如一辆大卡车挡住了人行横道。
So you have a big truck occluding a pedestrian crossing.
你必须考虑到很有可能有行人正在通过那里。
You have to think about it's very possible that a pedestrian will be crossing through there.
如果看不见,人们不总是会考虑所有可能发生的情况。
If it's not in sight, it's not always that people think about all the possibilities of what could be happening.
因此我们需要采取的安全立场以及对这类情况的思考方式,往往比人类更为保守。
So the safety posture that we need to have and how we reason about those kinds of situations tend to be a bit more conservative than humans.
我认为这就是为什么数据显示,在相同的驾驶环境下,我们的安全性比人类驾驶员高出大约一个数量级。
And I think that's part of why the numbers today show that we're compared to the human drivers in the same kind of environments that we're driving in, we're basically around an order of magnitude safer.
Waymo汽车导致严重伤害的事故率降低了约88%。
The rate of accidents that lead to severe injury is about eighty eight percent lower with Waymo cars.
而这个差距,我们确实可以通过不仅做每个人都会做的事,不仅达到平均水平,还要采取更保守的安全姿态来实现。
And that gap, really, we can attain by not just doing what every human would do, not by hitting the average, but also having a more conservative safety posture.
回到普通驾驶员的讨论,我想问驾驶方式会因地区不同而变化吗?
Back on the idea of the average driver, I mean, does driving change depending on where you are?
我记得你们目前正在日本进行一些测试,对吧?
I'm thinking you're doing some testing in Japan at the moment, aren't you?
不同国家的驾驶风格有差异吗?
Is there a different style of driving in different countries?
是的,预期会因你所在的位置而改变。
Yes, expectations are going to change depending on where you are.
再次强调,本着尽可能低调的原则,我们需要让驾驶员适应当地情况,因为各地的交通规则确实不同。
And again, in the spirit of being as inconspicuous as possible, we will need to adapt the driver to the local conditions simply because the rules of the road are different in different places.
比如在美国,有些州允许红灯右转,有些州则不行。
Like, even in The US, some states let you turn right on red, some states don't.
而且不仅交通规则,许多预期还建立在日常驾驶习惯上。
And there is a lot of expectations that are built into not just the road rules, but also common practices.
实际上,至少在美国我们看到的是,人们夸大了驾驶方式的地域差异。
In practice, what we've seen, at least in The US, is that people make the difference in driving up a lot more than it is there in reality.
真的吗?
Really?
我觉得很多人都以本地驾驶方式为荣,或者喜欢说他们那儿的司机技术差或开车很野之类的话。
I think a lot of people are proud of their local way of driving or proud to say that people in their local environment are terrible drivers or are aggressive drivers and things like that.
归根结底,在美国大陆开车仍是个高度同质化的环境。
At the end of the day, this is still a very homogenous kind of environment to be driving in Continental US.
明白了。
I see.
明白了。
I see.
凤凰城对比洛杉矶。
Phoenix versus LA.
日本的情况是,首先他们靠左行驶。
So Japan is is, you know, first of all, driving on the other side of the road.
这个简单但重大的交通规则变化会影响你的驾驶行为,也会影响行人行为——因为你会习惯性先看左边而不是右边。
A very simple, massive road rule change that has an impact on how you behave, does an impact on, you know, how pedestrians behave because, you know, you will look left or versus looking right.
你不能
Can you not
就切换一下,是的。
just switch Yeah.
粗略地说,这是世界的镜像。
To a first degree approximation, it's a mirror image of the world.
但我认为还有更多因素,比如驾驶时的预期。
But there is also a lot more, I think, the expectations from driving.
虽然没有确切数据,但从经验来看,日本道路上用手势示意车辆的交通协管员要多得多。
Anecdotally, I I don't have hard data on this, but anecdotally, there is a lot more agents on the roads gesturing at cars in Japan.
这更像是交通中的常态部分。
It's a lot more of a normal part of traffic.
不仅限于紧急情况,准确理解这些手势的含义及其可能略有不同的编码方式也是个重要因素。
It's not just in emergency situations and understanding those gestures well and what they are and maybe how they're codified slightly differently may be a factor.
我想这很大程度上取决于你所驾驶车辆的物理尺寸、道路形状以及交通密度。
Well a lot of that I guess comes down to the physical size of the cars that you're in, the shape of the roads, the kind of density of traffic.
不过我在想这些手势信号,因为即便我在日本开始开车,我觉得自己也能凭直觉理解手势的含义。
I am wondering about the hand signals though, because even if I started driving in Japan, I think I would have an intuitive sense of what hand signals might mean.
如何把这些手势信号整合到车辆设计中呢?
How do you build those into the car?
特别是手势信号本身相当复杂,不是吗?
I mean, hand signals in particular are quite difficult, aren't they?
它们本身就是个独立的话题。
They are a topic in and of itself.
而且我们常说手势信号,实际上它涉及整个肢体语言。
And often, you know, we say hand signals, but really it's a whole body language.
传递的信息远不止手的动作那么简单。
There is a lot of information that gets conveyed that is not just how the hand moves.
比如夜间会有人用灯光指引方向之类的。
You have, for example, people at night shining lights to tell you where to go or things like that.
我们拥有大量此类数据,可以通过数据驱动的方法和启发式算法,评估我们如何解读这些数据、衡量我们理解手势信号的有效性,并改进系统。
We have a lot of that data and we can evaluate how we interpret that data and measure our effectiveness at understanding hand signals and improve on the system using data driven methods, using heuristics.
我刚到这儿时,看到一个建筑工人用手势拦停了一辆Waymo,因为后面有卡车在倒车。
I was just coming here, I saw a construction worker stopping a Waymo with their hand in front because there was a truck that was backing up.
人们非常确信Waymo能正确理解那个手势信号。
And people were very confident that that hand signal was going to be interpreted properly by the Waymo.
这是公众对系统寄予的巨大信任,我们必须不辜负这份信任。
That's a huge amount of trust that people are placing in the system, and we need to meet that trust.
是的,完全同意。
Yeah, absolutely.
不过我很好奇,在无人驾驶汽车需要表现得像普通司机的情况下,你们是如何进行教学的?
So I am curious though, how do you teach a driverless car in the situations where it's acting like the average driver?
你们如何将其作为指令使用?
How do you use that as an instruction?
我的意思是,你们追求什么样的奖励函数?
I mean, what's the reward function that you're going for here?
你们在优化什么目标?
What are you optimising?
存在不同的优化函数。
There are different optimization functions.
一个基本的优化函数就是模仿学习。
One basic optimization function is just imitation learning.
这是每个机器人最初都要掌握的核心学习方式。
So this is the bread and butter learning that every robot starts from.
对吧?
Right?
通过这种方法建立基线系统,本质上就是采集人类司机的数据,精确模仿他们的平均行为。
It's how you build a baseline system that basically you take in the data from human drivers and imitate exactly what they do on average.
对吧?
Right?
这让你达到某种水平。
That takes you to some level.
就像对于大语言模型,单纯对人类文本进行模仿学习是构建语言模型的起点基线。
Just like for LLMs, just doing imitation learning on human text is the baseline that you start from when you build a language model.
在此基础上,通常会加入强化学习等方法。
On top of that, you tend to have things like reinforcement learning.
通常会采用偏好数据学习、人工标注学习等手段。可以设想类似的架构:除了人类模仿,我们还会向驾驶员提供信号,指出我们认为人类不够优化的领域,并提示更好的解决方案。
You tend to have learning from preference data, learning from human annotations and things So like you can imagine a similar setup in which beyond just the human imitation, we would provide signals to the driver about areas where we don't think humans are optimal and hints as to what is a better solution.
然后我们可以对此进行模拟。
And then we can simulate that.
我们做了大量模拟。
We do a lot of simulations.
我们还模拟了许多永远不希望在实际道路上看到的极端情况。
We also simulate a lot of the extreme cases that we never want to observe on the road.
这是另一个非常重要的角度:有很多情况我们希望永远不要发生。
So that's another angle that is very important, is there are a lot of things that we hope to never see happen.
因此在那些场景中我们无法向人类学习,因为我们根本看不到那些情况。
And so we cannot learn from humans in those scenarios because we just don't see them.
比如什么样的多车连环相撞?
Like what kind of multi car pileups?
我就是这个意思。
That's what I'm saying.
没错。
Right.
正是如此。
Exactly.
对。
Right.
那么如果前方发生多车连环相撞,你会怎么做?
So if there is a multi car pileup in front of you, what do you do?
是的。
Yeah.
我们不会有那些数据。
We won't have that data.
当然。
Sure.
我们希望永远不要有那些数据。
We hope to never have that data.
但我们会模拟这类连环事故,从中学习车辆应有的正确行为。
But we simulate those kinds of pileups, and then we learn from that what is the right behaviour for the car.
那让我问问关于安全性的问题。
Let me ask you about safety then.
有没有发生过与自动驾驶割草机相关的事故?
Have there been incidents involving the waymowers?
是的,我们并不完美,周围的世界也不完美。
Yeah, we are not perfect, and the world around us is not perfect.
所以确实发生过事故。
So there have been incidents.
每当事故发生,我们都会仔细研究。
Whenever there is an incident, we look at it very carefully.
我们会在模拟中重现它。
We we play it in simulation.
我们始终在思考:是否有更好的处理方式?
We always look at are there ways that we could have done things better?
是否存在不论责任归属都能缓解问题的方法?
Are there ways we could have mitigated the issue irrespective of responsibility?
我们尝试构建更多防御性措施。
And we try to build in a lot more defensive measures.
即使没有事故发生,只要存在可能引发事故的隐患,我们也会尽力消除。
Even when there is not an incident, if there is something that makes us think that potentially there could be an incident, we try to mitigate that too.
模拟是绝佳工具,它能帮助我们验证反事实场景。
Simulation is a great tool for this because it enables us to provide counterfactuals.
比如我们会模拟:如果当时对方司机酒驾,反应速度比实际慢很多的情况。
So we simulate what happens if, in this instance, the other driver was drunk, and so their reaction time was a lot slower than what it was.
我们能否成功化解危机?
Would we have mitigated the issue?
我们的系统能否通过应对避免事故发生?
Would we have been able to perform in a way that would have avoided an incident in general?
因此我们高度重视这类经验,并持续将其整合到驾驶系统中。
So we take those kinds of learnings very seriously and incorporate it in the driver in a continuous way.
那么这些模拟测试,是用来验证车辆使用的模型是否处于最佳状态吗?
So these simulations then, you're using them, I guess, to validate that the model that the car is using is the best that it can be?
这不仅仅是验证。
It goes beyond just validating.
我们还通过模拟为驾驶系统提供反馈。
We also use simulation as a way of providing feedback on the driver.
整个训练循环会利用模拟数据,并将信息反馈用于优化驾驶系统本身。
So there is a training loop that basically leverages simulation and feeds back this information to improve the driver itself.
我们不仅被动检查是否达到模拟设定的性能标准,更会将信号反向传播到模型中,实现驾驶系统的自我进化。
So it's not just passively looking at whether we meet the performance criteria that the simulation is providing us with, but it's also using that signal as something we can backpropagate into the model so that the driver itself improves.
听说你们即将进军伦敦市场,在新地区部署时有什么计划?
When it comes to rolling out into a new location, know I there's rumors of you going to London quite soon.
在将无人驾驶汽车投入道路之前,你们需要进行多少地图测绘工作?
How much mapping do you have to do before you can put cars without drivers on the roads?
我们倾向于使用地图,因为地图能让我们预先了解应该期待看到什么。
We like to have a map because a map gives us a prior on what we should expect to be there.
通常还有一些对我们更相关的信息,这些信息不一定直接在谷歌地图上获取。
There's usually information that is also more relevant to us that is not necessarily directly available on Google Maps, for example.
我们喜欢标注减速带的位置。
We like to map where there are speed bumps.
对。
Right.
对吧?
Right?
我们会尝试观察减速带并减速,但如果它在夜间等情况下不可见,我们仍会知道那里可能有减速带。
We will try to see that there is a speed bump and slow down, but if it's not visible, for example, if it's at night or something like that, we have a knowledge that there might be a speed bump.
所有这些都让我们能更有把握地了解车辆在该环境中应该看到的信息。
And all of that gives us a way of having a higher level of confidence about the information that the car should expect to see in an environment.
归根结底,车辆会自主做出决策。
At the end of the day, the car makes its own decisions.
我们从不假设地图是绝对正确的。
We don't ever assume that the map is correct.
但这就像是额外的证据,帮助它调整对下一步最佳行动的判断。
But it's like an extra piece of evidence that it's adding to change its belief of what the best thing to do next is.
那么,这些不同区域的差异有多大呢?
So then, how different are these different areas?
我在想高速公路驾驶与城市内驾驶的区别。
I'm thinking of like freeway driving compared to inner city driving.
不同类型的道路会面临不同的挑战。
The different types of roads have different challenges.
举个简单的例子,如果你在普通街道上行驶,当发生意外时,通常可以直接停车。
A simple example is if you're on the surface streets, if something wrong happens, typically, you can just stop.
这可能不是最佳选择。
It may it may not be optimal.
这可能不是正确的做法。
It may not be the right thing to do.
但万不得已时,确实可以选择靠边停车。
But push comes to shove, there is the option of pulling over and stopping.
在高速公路上,你绝不想在车流中突然停下。
On a freeway, you don't want to be stopping in the middle of traffic.
这不是最安全的首选方案。
This is not the safe first option.
因此你需要用完全不同的思维方式来应对突发状况。
So you want to be able to reason about exceptional circumstances in a very different way.
速度改变了一切。
Speed changes everything.
对吧?
Right?
车辆高速行驶的事实意味着你必须提前预判更远距离可能发生的情况。
The fact that you have a car that is at much higher velocity means that you really need to anticipate much further ahead what may be happening.
你必须以不同的方式考虑车道问题。
You have to reason about traffic lanes in a different way.
经常会遇到车流堆积的情况。
You often have, like, traffic stacks.
高速公路出口。
Exits of freeways.
这些都不是你在普通街道上会经常遇到的情形。
It's not necessarily the kind of thing that you experience on surface streets.
最近我经常在高速公路上乘车,体验了Waymo在湾区的高速公路驾驶。
I've been taking a lot of rides on freeways lately and experienced the Bay Area freeways In waymo.
这
The
因为那还没对公众开放吗?
Because that's not available to the public?
目前尚未对公众开放,但这是我们正在努力实现的一个方向。
It's not available to the public yet, but this is one of the dimensions that we're working on to enable.
我是说,我有偏见对吧?
I mean, I'm biased, right?
所以我要问问你关于伦敦的情况。
So I'm going to ask you about London.
但我觉得伦敦至少在感觉上,从汽车和交通的密度、道路使用者的数量、街道的狭窄程度等方面来看,与美国街道有质的差异。
But I think sort of London feels, well, at least qualitatively very different from the American streets in terms of the density of the cars and traffic, in terms of the like number of road users, in terms of like how narrow the streets are.
我的意思是,是不是直接把这里的技术搬到新地点就行了?
I mean, is it just a case of picking up what you've got here and then moving it to a new location?
还是说会遇到一系列在美国城市不会出现的边缘案例需要处理?
Or are there a whole other range of edge cases that you have to be concerned about that just don't come up when you're in the American cities?
旧金山有趣的地方在于,信不信由你,它实际上是美国人口密度第二高的城市。
So what's interesting about San Francisco is that it's actually the second densest city in The US, believe it or not.
仅次于纽约?
After New York?
仅次于纽约。
After New York.
对。
Right.
所以按照美国标准,这实际上是个相当复杂且密集的环境。
So by US standards, it's actually a pretty complex and dense environment to be in.
它能和伦敦相提并论吗?
Is it on par with London?
我知道不是的。
I know it's not.
伦敦的道路要狭窄复杂得多。
London has a lot more very narrow and complex.
问题在于,是形式上的难度更高,还是本质上相同只是容错标准不同这类因素?
The question is, is it formally harder or is it just that it's the same thing just with different tolerances and things like this?
是否存在本质上和形式上都有显著差异的地方?
Are there things that are materially different and formally different about it?
而且伦敦的交通糟透了。
And the traffic is awful in London.
你根本没法快速开车去任何地方。
Like, you can't drive anywhere anywhere fast.
不。
No.
我是说,确实如此。
Mean, that's true.
这是个进展缓慢的难题。
It's very slow moving problem.
缓慢。
Slow.
我正在读几本以伦敦为背景的精彩小说,书里角色有一半时间都堵在路上。
I'm reading some fantastic novels right now that are set in London, and they spend half of their time in the novel in traffic.
真的很可悲。
It's really sad.
你认为要实现完全自动驾驶的主流化,必须取得哪些关键突破?
What breakthroughs do you think have to happen then before fully autonomous driving becomes fully mainstream?
还存在哪些障碍?
What are the barriers that remain?
我认为并不需要形式上新的突破。
I don't think there is any need for any formally new breakthroughs.
我觉得我们正处于正确的技术时代。
Think we're in the right generation of technology.
我并不是说我们已经解决了这个问题。
I'm not saying we have solved it.
我是说自动驾驶的历史可以追溯到大约三十年前。
I'm saying autonomous driving has a history that dates back, you know, thirty years.
它必须经历,我想说,至少五代技术的演进。
It has to have gone through, I wanna say, five different generations of technology along the way.
最初是从'自动驾驶汽车就是机器人'的视角出发的。
It started with from a perspective of an autonomous car is a robot.
让我们用机器人学的方法来解决它。
Let's solve it the robotics way.
然后机器学习、计算机视觉出现了,接着是transformer模型、行为建模,再到基础模型。
And then, you know, machine learning and computer vision came in, transformers came in, behavior modeling, foundation models came in.
我感觉当下就是那个时刻,我们正处在正确的时机。
I feel that today's the day, like we are in the right moment.
有很多创新正在涌现,比如世界模型、大规模基础模型这类技术。
There is a lot of innovations that are coming down, like things like world models, things like large scale foundation models.
但从某种意义上说,这些已经是现在进行时了——它们已经在地平线上清晰可见。
But that's that's the present in some ways, like in the sense it's it's it's on the horizon.
我不认为还需要另一次技术飞跃才能让它真正实用化。
I don't think we need another job for it to become practical in the real world.
所以我认为现在就是那个时刻。
So I think it's the moment.
我真的觉得自动驾驶正在成为现实,我们必须让它实现。
Like, I I really feel like autonomous driving is happening, and we've got to make it happen.
那么你认为,可以说现在就是那个时刻。
And then do you think let's say this is the moment.
你认为自动驾驶汽车会无处不在,还是认为仍会有人类驾驶员的空间?
Do you think that autonomous cars will be ubiquitous or do you think that there will still be room for human drivers?
我能想象一个未来,那时我们的孙辈会问我们:'嘿,在你们那个年代真的需要手动驾驶吗?'
I could picture a future in which your grandkids, my grandkids ask us, Hey, is it true that in your day we used to drive by hand?
听起来很可怕。
That sounds scary.
听起来很危险。
That sounds dangerous.
哦。
Oh.
所以,我们完全可能走向一个未来,在那里手动驾驶不再是常态,大部分驾驶都是自动完成的。
So, it's entirely possible that we're going towards a future where the experience of driving by hand is no longer the norm and most of the driving happens automatically.
我认为这个未来能否实现取决于许多因素。
Whether this future will be realized, I think, depends on a number of factors.
我认为成本和普及程度是其中之一。
I think costs and accessibility is one of them.
基础设施的适应。
Adapting the infrastructure.
我认为这不会很快实现,但这是一个可能的未来。
I don't think it's going to be soon, but it's a possible future.
绝对令人着迷。
Absolutely fascinating.
文森特,非常感谢你的参与。
Vincent, thank you so much for joining me.
感谢邀请。
Thanks for having me.
文森特最后的那番评论特别能说明问题,自动驾驶汽车在全球范围内运行的道路已经铺就。
It's particularly telling the comment that Vincent made at the end there that the path to autonomous vehicles working all over the world is laid out.
当然,我们还有很多工作要做,但不需要再来一场人工智能的大革命了。
And, okay, sure, there's loads more work to do, but we don't need another big revolution in artificial intelligence together.
垫脚石已经就位。
The the stepping stones are in place.
我们已经实现了对场景的语义理解。
We've got semantic understanding of a scene.
我们有了优秀的模拟系统。
We've got good simulations.
如果你开始以正确的方式整合所有这些学习成果,不仅能复制人类驾驶,
And if you start to put all of that learning together in the right way, you won't just replicate human driving.
还能超越它。
You'll surpass it.
您正在收听的是由主持人汉娜·弗莱为您带来的《谷歌DeepMind》播客。
You have been listening to Google DeepMind, the podcast, with me, your host, Hannah Fry.
如果您喜欢与我一起乘坐这颠簸的Waymo之旅,请记得在您收听播客的平台点赞订阅。
And if you have enjoyed this journey with me in a very bumpy waymo, then please do like and subscribe wherever you get your podcasts.
和往常一样,本系列后续还会带来更多精彩话题。
And as ever, we have got plenty more incredibly interesting topics coming up later in the series.
欢迎再次加入我们。
So please do join us again.
谢谢。
Thank you.
下车后请将门完全关闭。
After you exit, please close doors all the way.
谢谢。
Thanks.
关于 Bayt 播客
Bayt 提供中文+原文双语音频和字幕,帮助你打破语言障碍,轻松听懂全球优质播客。