本集简介
双语字幕
仅展示文本字幕,不包含中文音频;想边听边看,请使用 Bayt 播客 App。
以下是与杨立昆的对话,这是他第二次做这个播客。
The following is a conversation with Yann LeCun, his second time on the podcast.
他是Meta(前身为Facebook)的首席人工智能科学家,纽约大学教授,图灵奖得主,机器学习和人工智能历史上的奠基性人物之一,他聪明且富有见解,以最棒的方式表达观点,因此和他交谈总是很有趣。
He is the chief AI scientist at Meta, formerly Facebook, professor at NYU, Turing Award winner, one of the seminal figures in the history of machine learning and artificial intelligence, and someone who is brilliant and opinionated in the best kind of way, and so it's always fun to talk to.
现在简要介绍一下每位赞助商。
And now a quick few second mention of each sponsor.
请在简介中查看他们。
Check them out in the description.
这是支持这个播客的最佳方式。
It's the best way to support this podcast.
首先是Public Goods,我用它购买家用产品的在线商店。
First is Public Goods, an online shop I use for household products.
其次是Indeed,一个招聘网站。
Second is Indeed, a hiring website.
第三是ROKA,我最喜欢的太阳镜和处方眼镜。
Third is ROKA, my favorite sunglasses and prescription glasses.
第四是NetSuite,一款用于管理人力资源、财务及其他业务细节的商业软件。
Fourth is NetSuite, business software for managing HR financials and other details.
第五是Magic Spoon,一种低碳水、适合生酮饮食的麦片。
And fifth is Magic Spoon, low carb, keto friendly cereal.
所以你的选择是商业、健康,还是风格。
So the choice is business, health, or style.
明智地选择吧,朋友们。
Choose wisely, my friends.
现在进入完整的广告播报。
And now onto the full ad reads.
和往常一样,中间没有广告。
As always, no ads in the middle.
我努力让内容有趣,但如果你跳过了,请依然去支持我们的赞助商。
I try to make this interesting, but if you skip them, please still check out our sponsors.
我喜欢他们的产品。
I enjoy their stuff.
也许你也会喜欢。
Maybe you will too.
本节目由Public Goods赞助,这里是您购买实惠、可持续、健康家居用品的一站式商店。
This show is brought to you by Public Goods, the one stop shop for affordable, sustainable, healthy household products.
我使用他们的洗手液、牙膏和牙刷。
I use their hand soap, toothpaste, and toothbrush.
我想我可能还用了其他很多东西,但目前想到的就是这些。
I think I use a bunch of other stuff too, but that's that's what comes to mind.
他们的产品常常采用极简的黑白设计,我简直觉得美极了。
Their products often have this minimalist black and white design that I just absolutely find beautiful.
我非常喜欢。
I love it.
我热爱极简的设计风格。
I love minimalism in design.
它不会过度夸张。
It doesn't go over the top.
它没有那些你不需要的多余东西和功能,只有 essentials。
It doesn't doesn't have all these extra things and features that you don't need, just the essentials.
我觉得这很难解释,但正是这些缺席的东西减少了你的注意力分散,让你能真正专注于生活中重要的事。
I think it's hard to explain, but there's something about the absence of things that can take up your attention that allows you to truly be attentive to what matters in life.
总之,前往 publicgoods.com/lex 或在结账时使用代码 Lex,即可享受首单 15 美元折扣,还可免费选择一包竹制吸管或可重复使用的食品收纳膜。
Anyway, go to publicgoods.com/lex or use code Lex at checkout to get $15 off your first order, plus you will receive your choice of either a free pack of bamboo straws or reusable food storage wraps.
访问 publicgoods.com/lex 或在结账时使用代码 Lex。
Visit publicgoods.com/lex or use code Lex at checkout.
本节目还由招聘网站 Indeed 赞助。
This show is also brought to you by Indeed, a hiring website.
我过去在带领团队时,曾多次使用他们进行招聘。
I've used them as part of many hiring efforts I've done for the teams I've led in the past.
这些招聘大多针对工程和研究岗位。
Most of those for was for engineering, for research efforts.
他们提供诸如 Indeed Match 这样的工具,能立即为你匹配符合职位描述的优质候选人简历。
They have tools like Indeed is the match giving you quality candidates whose resumes on Indeed fit your job description immediately.
在过去几个月里,我一直在进行一项组建团队的工作,以帮助我。
For the past few months, I've been going through this process of building up a team of folks to help me.
我进行了大量的招聘。
I've been doing quite a bit of hiring.
这是一个充满风险但也令人兴奋的过程,因为你有机会结识一些朋友。
It's a treacherous and an exciting process because you get to meet some friends.
所以这是一个美好的过程,但我认为这是人生中最重要的过程之一。
So it's a it's a beautiful process, but I think it's one of the most important processes in life.
这是选择与你共度每一天的那群人。
It's selecting the group of people with whom you spend your days with.
因此,你必须为这项工作使用最好的工具。
And so you gotta use the best tools for the job.
Indeed,我认为是一个极好的工具。
Indeed, I think is an excellent tool.
现在,你可以在indeed.com/lex上获得75美元的免费赞助职位额度,以提升你的职位发布。
Right now, you can get a free $75 sponsored job credit to upgrade a job post at indeed.com/lex.
条款和条件适用。
Terms and conditions apply.
前往 indeed.com/lex。
Go to indeed.com/lex.
本节目还由 ROKA 赞助推出,ROKA 制造的眼镜和太阳镜,我非常喜欢它们的设计、佩戴舒适感以及在材料光学和防滑方面的创新。
This show is also brought to you by ROKA, the makers of glasses and sunglasses that I love wearing for their design, feel, and innovation on material optics and grip.
ROKA 由两位来自斯坦福大学的美国游泳健将创立,其诞生源于对运动表现的极致追求。
ROKA was started by two all American swimmers from Stanford, and it was born out of an obsession with performance.
我喜欢它们的佩戴感受。
I like the way they feel.
我喜欢它们的外观。
I like the way they look.
无论我进行快速跑步——比如每英里八分钟或更快——还是慢跑,比如在炎热或寒冷中沿着河边以每英里九到十分钟的速度奔跑,或者只是穿着西装出门,不管这句话该怎么表达,我也不太确定。
Whether I'm running like a fast paced run, we're talking about eight minute mile or faster, or if we're doing a slow paced run, nine, ten minute mile along the river in the heat or in the cold, or if I'm just wearing my suit out on a town, however that expression goes, I'm not sure.
但它们搭配西装看起来很有品位。
But they look classy with a suit.
它们在运动装备上看起来酷极了。
They look badass in running gear.
这就是我平时戴的太阳镜。
It's just my go to sunglasses.
在 roka.com 上查看他们的处方眼镜和太阳镜,使用代码 Lex 可享受首单 20% 折扣。
Check them out for both prescription glasses and sunglasses at roka.com and enter code Lex to save 20% on your first order.
访问 roka.com 并使用代码 Lex。
That's roka.com and enter code Lex.
本节目还由 NetSuite 赞助。
This show is also brought to you by NetSuite.
NetSuite 让你能够在一个平台上管理财务、人力资源、库存、电子商务以及更多业务相关细节。
NetSuite allows you to manage financials, human resources, inventory, ecommerce, and many more business related details all in one place.
我不确定我刚才为什么用升调说那句话。
I'm not sure why I was doing up speak on that sentence.
也许是因为我对 NetSuite 非常兴奋。
Maybe because I'm very excited about NetSuite.
总之,经营公司时有很多棘手的事情需要处理好。
Anyway, there's a lot of messy things you have to get right when running a company.
如果你是企业家、创业者或初创公司创始人,这是你必须考虑的事情。
If you're a business owner, if you're an entrepreneur, if you're a founder of a startup, this is something you have to think about.
你必须使用最好的工具来完成工作,确保所有运营企业所需的繁琐事务都为你处理妥当,这样你就能专注于你最擅长的事情,让你的才华得以闪耀。
You have to use the best tools for the job to make sure all the messy things required to run a business are taken care of for you so you can focus on the things that you're best at, that where your brilliance shines.
如果你正在创业,我祝你一切顺利。
If you are starting a business, I wish you the best of luck.
这是一段艰难的旅程,但值得付出。
It's a difficult journey, but it's worth it.
总之,现在特别融资方案又回来了。
Anyway, right now, special financing is back.
请访问 netsuite.com/lex 获取他们独一无二的融资计划。
Head to netsuite.com/lex to get their one of a kind financing program.
网址是 netsuite.com/lex。
That's netsuite.com/lex.
netsuite.com/lex。
Netsuite.com/lex.
本集由Magic Spoon赞助,它是这个播客的元老级赞助商,虽然算不上最早,但我非常喜爱。
This episode is also brought to you by Magic Spoon, the OG, not quite OG, but really old school sponsor of this podcast that I love.
这是一种低碳水、生酮友好的麦片。
It's a low carb keto friendly cereal.
它含有零克糖。
It has zero grams of sugar.
它非常美味。
It's delicious.
我还没说过足够多遍。
I don't say that enough.
它真的非常美味。
It really is delicious.
考虑到它含糖量为零,它竟然如此美味,这非常令人惊讶。
Given that it's zero grams of sugar, it's very surprising how delicious it is.
每份含有13到14克蛋白质,仅4克净碳水化合物,以及140卡路里。
13 to 14 grams of protein, only four net grams of carbs, and 140 calories in each serving.
你可以自行搭配一盒,或者选择包含可可、水果味、糖霜味、花生酱、蓝莓和肉桂口味的混合装。
You could build your own box or get a variety pack with available flavors of cocoa, fruity, frosted, peanut butter, blueberry, and cinnamon.
可可味是我最喜欢的。
Cocoa is my favorite.
这是冠军的味道。
It's the flavor of champions.
我不知道为什么我总这么说,但好像确实如此。
I don't know why I keep saying that, but it seems to be true.
总之,Magic Spoon提供100%满意保证。
Anyway, Magic Spoon has a 100% happiness guarantee.
如果你不喜欢,他们会全额退款。
So if you don't like it, they will refund it.
还有谁会给你100%的满意保证?
Who else will give you a 100% happiness guarantee?
前往 magicspoon.com/lex 并在结账时使用代码 lex,即可享受 5 美元优惠。
Go to magicspoon.com/lex and use code lex at checkout to save $5 off your order.
就是 magicspoon.com/lex,使用代码 lex。
That's magicspoon.com/lex and use code lex.
这是 Lex Fridman 播客,以下是我和 Yann LeCun 的对话。
This is the Lex Fridman podcast, and here's my conversation with Yann LeCun.
你与他人合著了《自监督学习:智能的暗物质》这篇文章。
You co wrote the article, self supervised learning, the dark matter of intelligence.
顺便说一句,这个标题真棒,和 Ishaan Misra 一起写的。
Great title, by the way, with Ishaan Misra.
那么让我问一下,什么是自监督学习?为什么它被称为智能的暗物质?
So let me ask, what is self supervised learning, and why is it the dark matter of intelligence?
我先从暗物质这部分说起。
I'll start by the dark matter part.
显然,人类和动物正在做的一种学习方式,目前我们还无法用机器或人工智能很好地复现。
There is obviously a kind of learning that humans and animals are are doing that we currently are not reproducing properly with machines or with AI.
对吧?
Right?
所以当今最主流的机器学习方法,或者说我应该说是范式,是监督学习和强化学习。
So the most popular approaches to machine learning today are, or paradigms I should say, are supervised learning and reinforcement learning.
而且它们效率极低。
And they are extremely inefficient.
监督学习需要大量样本才能学会任何东西,而强化学习则需要海量的试错次数,系统才能完成任何任务。
Supervised learning requires many samples for learning anything, and reinforcement learning requires a ridiculously large number of trial and errors to for, you know, a system to run anything.
这就是为什么我们还没有自动驾驶汽车。
And that's why we don't have self driving cars.
从一个到另一个,这可是巨大的跳跃。
That's a big leap from one to the other.
好吧。
Okay.
因此,要解决复杂的问题,你必须为监督学习提供大量的人工标注。
So that to solve difficult problems, you have to have a lot of human annotation for supervised learning to work.
而要用强化学习解决这些难题,你必须有一种方式来模拟这个问题,以便进行强化学习所需的大量学习。
And to solve those difficult problems with reinforcement learning, you have to have some way to maybe simulate that problem such that you can do that large scale kind of learning that reinforcement learning requires.
对。
Right.
那么,为什么大多数青少年只需大约二十小时的练习就能学会开车呢?
So how is it that, you know, most teenagers can learn to drive a car in about twenty hours of practice?
而即使有数百万小时的模拟练习,嗯。
Whereas even with millions of hours of simulated practice Mhmm.
自动驾驶汽车却仍然无法真正学会自己开车。
A self driving car can't actually learn to drive itself properly.
所以,显然我们遗漏了某些东西。
And so, obviously, we're missing something.
对吧?
Right?
对很多人来说,这很明显,你经常听到的直接回应是:人类是利用他们的背景知识来更快地学习的。
And and it's quite obvious for a lot of people that, you know, the immediate response you get from many people is, well, you know, humans use their background knowledge to learn faster.
他们说得对。
And they're right.
那么,这种背景知识是如何获得的呢?
Now how was that background knowledge acquired?
这才是关键问题。
And that's the big question.
所以现在你必须问,婴儿在出生后的头几个月是如何学会理解世界运作的?
So now you have to ask, you know, how do babies in the first few months of life learn how the world works?
他们主要通过观察来学习,因为他们几乎无法在世界上主动行动,却能学到大量关于世界的背景知识,这可能是我们所说的常识的基础。
Mostly by observation because they can hardly act in the world, and they learn an enormous amount of background knowledge about the world, that may be the basis of what we call common sense.
这种学习不是为了完成某个任务,也不是因为受到奖励而学习,仅仅是通过观察世界并理解它的运作方式。
This type of learning is not learning a task, it's not being reinforced for anything, it's just observing the world and figuring out how works.
构建世界模型,学习世界模型。
Building world models, learning world models.
我们该如何做到这一点?
How do we do this?
那么我们如何在机器中重现这一点呢?
And how do we reproduce this in in machines?
所以自监督学习,可以说是尝试重现这种学习方式的一个例子或一次尝试。
So self supervised learning is, you know, one instance or one attempt at trying to reproduce this kind of learning.
好的。
Okay.
所以你关注的是纯粹的观察,甚至不包括儿童的互动部分。
So you're looking at just observation, so not even the interacting part of a child.
就是坐在那里看着爸爸妈妈走来走去,拿起东西,诸如此类。
It's just sitting there watching mom and dad walk around, pick up stuff, all of that.
这就是我们所说的背景知识。
That's the that's what we mean about background knowledge.
可能甚至不是看着爸爸妈妈,只是,你知道的,看着世界运转。
Perhaps not even watching mom and dad, just, you know, watching the world go by.
仅仅是睁开眼睛或闭上眼睛,或者睁开眼睛和闭上眼睛这个动作本身,世界就出现和消失,所有这些基本信息。
Just having eyes open or having eyes closed or the very act of opening and closing eyes that the world appears and disappears, all of that basic information.
你说为了学会开车,人类之所以能快速学会,有些人更快,是因为具备了背景知识。
And you're saying in order to learn to drive, like, the reason humans are able to learn to drive quickly, some faster than others, is because of the background knowledge.
他们在学会开车之前的很多年里,一直在观察汽车在现实世界中的运行,以及基本物体的物理特性,所有这些知识。
They were able to watch cars operate in the world in the many years leading up to it, the physics of basics objects, all that kind
这些东西。
of stuff.
我的意思是,关于物体的基本物理知识,你甚至不需要意识到自己知道这些,也不需要了解汽车是如何工作的。
I mean, basic physics of objects, you don't even know you you don't even need to know, you know, how a car works.
对吧?
Right?
因为这些知识,你可以学得相当快。
Because that, you can run fairly quickly.
我经常举的例子是,当你开车靠近悬崖时,你会提前知道,因为你对直觉物理的理解,如果你向右打方向盘,车就会向右偏,冲下悬崖,坠落下去,结果肯定不好。
I mean, the example I use very often is you're driving next to a cliff, and you know in advance because of your, you know, understanding of intuitive physics that if you turn the wheel to the right, the car will veer to the right, will run off the cliff, fall off the cliff, and nothing good will come out of this.
对吧?
Right?
但如果你是一个所谓的‘白板’式强化学习系统,没有对世界的认知模型,你就得反复掉下悬崖数千次,才能明白这是个坏主意,然后再重复几千次,才学会如何避免,接着还要再重复数百万次,才能在所有遇到的情境中都避免这样做。
But if you are a sort of, you know, tabula rasa reinforcement learning system that doesn't have a model of the world, you have to repeat falling off this cliff thousands of times before you figure out it's a bad idea, and then a few more thousand times before you figure out how to not do it, and then a few more million times before you figure out how to not do it in every situation you ever encounter.
所以自监督学习仍然需要有人告诉它一些真实的信息。
So self supervised learning still has to have some source of truth being told to it by somebody.
因此,你需要找到一种方法,在不依赖人类帮助,或至少不依赖大量人类帮助的情况下,从世界中获取这些真实信息。
And is So you have to figure out a way without human assistance or without significant amount of human assistance to get that truth from the world.
那么这里的问题是:世界能提供多少信号?有多少真相是世界本身给予你的?无论是人类世界,比如你观看YouTube之类的内容,还是更自然的世界。
So the mystery there is how much signal is there, how much truth is there that the world gives you, whether it's the human world, like you watch YouTube or something like that, or it's the more natural world.
那么,究竟有多少信号存在呢?
So how much signal is there?
这里有个关键点。
So here's the trick.
在自监督学习设置中,存在的信号远多于监督学习或强化学习设置。
There is way more signal in sort of a self supervised setting than there is in either a supervised or reinforcement setting.
这让我想到那个蛋糕的类比。
And this is going to my, you know, analogy of the cake.
是的。
Yes.
所谓的‘LeCake’,正如有人所称,当你试图确定让机器预测多少信息、以及在每次试验中给予机器多少反馈时,在强化学习中,你只给机器一个标量值。
The you know, LeCake as someone has called it, where when you try to figure out how much information you ask the machine to predict and how much feedback you give the machine at every trial, In reinforcement learning, you give the machine a single scalar.
你告诉机器:你做得好,你做得糟,而且你只是偶尔才向机器提供这种反馈。
You tell the machine you did good, you did bad, and you and you and you only tell this to the machine once in a while.
当我说‘你’时,也可能是指宇宙在向机器传递信息。
When I say you, it could be the the universe telling the machine.
对吧?
Right?
但这只是一个标量值。
But it's just one scalar.
因此,结果就是,如果没有大量、大量的试验,以及大量这种类型的反馈,你根本不可能学会任何复杂的东西。
So as a consequence, there is you you cannot possibly learn something very complicated without many many many trials where you get many many feedbacks of this type.
在监督学习中,你在每个样本上都会给机器提供若干比特的信息。
Supervised learning, you you give a few bits to the machine at every every sample.
假设你正在用ImageNet训练一个图像识别系统,共有1000个类别,每个样本提供的信息量略低于10比特。
Let's say you're training a system on recognizing images on ImageNet, there is 1,000 categories, that's a little less than 10 bits of information per sample.
这里的自监督学习是一种设定。
The self supervised learning here is a setting.
理想情况下,我们还不知道如何实现这一点,但理想状态下,你可以给机器展示一段视频,然后暂停视频,让机器预测接下来会发生什么。
Ideally, we don't know how to do this yet, but ideally you would show a machine a segment of a video and then stop the video and ask the machine to predict what's going to happen next.
于是你让机器进行预测,然后让时间继续推进,再向机器展示实际发生的情况,希望机器能学会在下一次更好地进行预测。
And so you let the machine predict, and then you let time go by and, show the machine what actually happened and hope the machine will, you know, learn to do a better job at predicting next time around.
你给了机器大量的信息,因为这是一整段视频片段,
There's a huge amount of information you give the machine because it's an entire video clip of,
你
you
也就是在你最初输入给机器的视频片段之后的未来内容。
know, of the the future after the video clip you fed it in the first place.
因此,无论是语言还是视觉领域,都存在一种微妙的、看似微不足道的结构,但或许这正代表了创造智能所必需的东西——填补空白。
So both for language and for vision, there's a subtle, seemingly trivial construction, but maybe that's representative of what is required to create intelligence, which is filling the gap.
填补空白。
Filling the gaps.
听起来很傻,但也许通过这种方式真的能解决所有智能问题。
It sounds dumb, but can you it's it's it is possible you could solve all of intelligence in this way.
对于语言来说,就是给一句话,然后让它接着写下去,或者给一句话,中间某些词被遮住,让你填上应该是什么词。
Just for both language, just give a sentence and continue it, or give a sentence and there's a gap in it, some words blanked out and you fill in what words go there.
对于视觉来说,就是给一连串图像,预测接下来会发生什么,或者填补中间缺失的部分。
For vision, you give a sequence of images and predict what's gonna happen next, or you fill in what happened in between.
你认为,仅凭这种作为自监督学习信号的表述,就能解决视觉和语言的智能问题吗?
Do you think it's possible that formulation alone as a signal for self supervised learning can solve intelligence for vision and language?
我认为这是我们目前最好的机会。
I think that's our best shot at the moment.
所以,这最终能否带我们达到人类水平的智能,或者只是猫级别的智能,还不清楚。
So whether this will take us all the way to, you know, human level intelligence or something or just cat level intelligence Mhmm.
还不明确。
Is not clear.
但在人们提出的所有可能方法中,我认为这是我们最有希望的途径。
But among all the possible approaches that people have proposed, I think it's our best shot.
所以我认为,一个智能系统通过填补空白——无论是预测未来、推断过去,还是填补缺失的信息——这个想法很有潜力。
So I think this idea of an intelligent system filling in the blanks, either, you know, predicting the future, inferring the past, filling in missing information.
举个例子,我现在正在填补你头后方的空白,想象你头部背面的样子,因为我对人类的身体构造有基本的了解。
You know, I'm currently filling the blank of what is behind your head and what your what your head looks like and, you know, from from the back, because I have, you know, basic knowledge about how humans are made.
我不知道你接下来会说什么词,会在哪个时刻开口,是否会这样或那样动头,会朝哪个方向看。
And I don't know if you're gonna you know, what word you're gonna say at which point you're gonna speak, whether you're gonna move your head this way or that way, which way you're look.
但我知道你不会突然消失,然后在三米外的走廊重新出现,因为我知道根据直觉物理,什么是可能的,什么是不可能的。
But I know you're not gonna just dematerialize and reappear three meters down the hall, you know, because I know what's possible and what's impossible according to intuitive physics.
但你
But you
你有一个关于什么是可能、什么是不可能的模型,如果发生了违背模型的事情,你会非常惊讶,并不得不重构你的模型。
have a model of what's possible, what's impossible, and then you'd be very surprised if it happens, and then you'll have to reconstruct your model.
对。
Right.
So that that's the model of the world.
So that that's the model of the world.
It's what tells you, you know, what fills in the blanks.
It's what tells you, you know, what fills in the blanks.
So given your partial information about the state of the world given by your perception, your your model of the world fills in the missing information, and that includes predicting the future, retrodicting the past, you know, filling in things you don't immediately perceive.
So given your partial information about the state of the world given by your perception, your your model of the world fills in the missing information, and that includes predicting the future, retrodicting the past, you know, filling in things you don't immediately perceive.
And that doesn't have to be purely generic vision or visual information or generic language.
And that doesn't have to be purely generic vision or visual information or generic language.
You can go to specifics, like predicting what control decision you make when you're driving in a lane.
You can go to specifics, like predicting what control decision you make when you're driving in a lane.
You have a sequence of images from a a vehicle, and then you couldn't you have information if you record it on video where the car ended up going, so you can go back in time and predict where the car went based on the visual information.
You have a sequence of images from a a vehicle, and then you couldn't you have information if you record it on video where the car ended up going, so you can go back in time and predict where the car went based on the visual information.
That's very specific, domain specific.
That's very specific, domain specific.
Right.
Right.
但问题是,我们能否找到一种通用的方法,来训练机器进行这种预测或填补空白。
But the question is whether we can come up with sort of a generic method for, you know, training machines to do this kind of prediction or filling in the blanks.
目前,这种做法在自然语言处理领域取得了难以置信的成功。
So right now, this type of approach has been unbelievably successful in the context of natural language processing.
所有现代自然语言处理系统都是通过自监督方式预训练,以填补空白。
Every modern natural language processing is pre trained in self supervised manner to fill in the blanks.
你给它一段词序列,去掉其中10%,然后训练一个庞大的神经网络来预测那些缺失的词。
To you you show it a sequence of words, you remove 10% of them, then and you train some gigantic neural net to predict the words that are missing.
是的。
Mhmm.
一旦预训练好这个网络,就可以将其内部学到的表示作为输入,用于你后续训练的监督模型或其他任何模型。
That and once you've pre trained that network, you can use the internal representation learned by it as input to, you know, something that you train, supervised, or or whatever.
这种方法已经取得了巨大的成功。
That's been incredibly successful.
但在图像领域还不太成功,尽管正在取得进展,而且目前主要依赖于手动的数据增强。
Not so successful in images, although it's making progress, and and it's it's based on sort of manual data augmentation.
我们稍后再深入这个话题。
We can go into this later.
但迄今为止尚未成功的是从视频中进行训练。
But what has not been successful yet is training from video.
让机器通过仅仅观看视频来学习表征视觉世界。
So getting a machine to learn to represent the visual world, for example, by just watching video.
还没有人真正成功实现这一点。
Nobody has really succeeded in doing this.
好的。
Okay.
那我们先做一个高层次的概述吧。
Well, let's kinda give a high level overview.
视觉和语言在本质和难度上有什么不同?
What's the difference in kind and in difficulty between vision and language?
你说人们还没能真正破解视觉在自监督学习方面的难题,但这可能并不是因为视觉本质上更困难。
So you said people haven't been able to really kinda crack the problem of vision open in terms of self supervised learning, but that may not be necessarily because it's fundamentally more difficult.
也许,当我们谈论在完全意义上通过图灵测试时,语言方面可能比视觉更难。
Maybe, like, when we're talking about achieving like, passing the Turing test in the full spirit of the Turing test in language might be harder than vision.
这并不明显。
That's that's not obvious.
那么在你看来,哪个更难?或者当我们越来越接近解决这两个问题时,是否会发现它们本质上是同一个问题?
So what in your view, what which is harder, or perhaps are they just the same problem when the farther we get to solving each, the more we realize it's all the same thing.
它们都是同一个蛋糕。
It's all the same cake.
我所寻找的是能让它们看起来本质上是同一个蛋糕的方法,但目前它们还不是。
I think what I'm looking for are methods that make make them look essentially like the same cake, but currently they're not.
学习世界模型或预测模型的主要问题是,预测从来不是单一的,因为世界并非完全可预测。
And the main issue with learning world models or learning predictive models is that the prediction is never a single thing because the world is not entirely predictable.
是的。
Yeah.
它可能是确定性的,也可能是随机的。
It may be deterministic or stochastic.
我们可以就此展开哲学层面的讨论。
We can get into the philosophical discussion about it.
但即使它是确定性的,也并非完全可预测。
But but even if it's deterministic, it's not entirely predictable.
所以,如果我播放一段短视频,然后让你预测接下来会发生什么,这段视频会有许多种合理的后续发展,而且你要求系统预测的时间跨度越长,可能的后续数量就越多。
And so if I play a short video clip and then I ask you to predict what's going to happen next, there's many, many plausible continuations for that video clip, and the number of continuations grows with the interval of time that you're asking the system to make a prediction for.
因此,自监督学习中的一个重大问题是:如何表示这种不确定性,如何表示多个离散结果,以及如何表示连续的可能性范围等等。
And so one big question with self supervised learning is how you represent this uncertainty, you represent multiple discrete outcomes, how you represent a sort of continuum of possible outcomes, etc.
如果你是一个典型的机器学习从业者,你会说:哦,你只需要表示一个分布就行了。
And, you know, if you are a of a classical machine learning person, you say, oh, you just represent a distribution.
对吧?
Right?
当我们预测文本中缺失的词语时,我们知道该如何做,因为神经网络可以为词典中的每个词打分。
And that we know how to do when we're predicting words, missing words in a text, because you can have a neural net give a score for every word in the dictionary.
这是一长串数字,大概有十万左右。
It's a big, you know, it's a big list of numbers, you know, maybe a 100,000 or so.
你可以将它们转化为概率分布,这样当我念出一句话时,比如‘猫正在厨房里追逐空白’,你就能知道。
And you can turn them into probability distribution that gives that tells you when I say a sentence, you know, the know, the cat is chasing the blank in the kitchen.
你知道,那里只有几个词是合理的。
You know, there are only a few words that make sense there.
对吧?
You know?
可能是老鼠,也可能是激光点,或者类似的东西。
It could be a mouse or it could be a laser spot or, know, something like that.
对吧?
Right?
如果我说‘空白’在草原上追逐‘空白’,这两个空缺也有很多合理的选项。
And and if I if I say the blank is chasing the blank in the savannah, you also have a bunch of plausible options for those two words.
对吧?
Right?
嗯哼。
Mhmm.
这是因为你有一种潜在的现实基础,可以用来填补这些空白。
That that that because you you have kind of a, you know, underlying reality that you can refer to to sort of fill in those those blanks.
所以在非洲大草原上,你无法确定到底是狮子、猎豹还是别的什么动物。
So you you cannot say for sure in the savannah if it's a, you know, a lion or a cheetah or whatever.
你也不知道到底是斑马、goo还是别的什么东西。
You cannot know if it's a zebra or a goo or, you know, whatever.
角马也是同样的情况。
Wildebeest, the same thing.
但你可以用一长串数字来表示这种不确定性。
But but you can represent the uncertainty by just a long list of numbers.
如果我用同样的方法处理视频,让你预测一段视频片段,那就不是一组离散的可能帧了。
Now if I if I do the same thing with video and I ask you to predict a video clip, it's not a discrete set of potential frames.
你需要一种方式来表示在高维连续空间中,多个帧的无限多种合理延续。
You have to have some way of representing a sort of infinite number of plausible continuations of multiple frames in a, you know, high dimensional continuous space.
而我们根本不知道该如何正确地实现这一点。
And we just have no idea how to do this properly.
有限的、高维的。
Finite, high dimensional.
所以,就像……
So, like,
它是有限的、高维的。
you It's finite, high dimensional.
是的。
Yes.
就像单词一样。
Just like the words.
他们试图将其缩减为一个很小的有限集合,比如少于一百万之类的。
They try to get it to down to a a small finite set of, like, under a million, something like that.
差不多就是这样。
Something like that.
我的意思是,我们对语言中每一个可能的单词都做分布,这有点荒谬,但它却有效。
I mean, it's kinda ridiculous that we're doing a distribution over every single possible word for language, and it works.
这感觉像是一个非常愚蠢的做法。
It feels like that's a really dumb way to do it.
好像应该有一种更紧凑的方式来表示词语的分布。
Like, seems to there there it seems to be like there should be some more compressed representation of the distribution of the words.
你说得对。
You're right about that.
所以
And so
我同意。
I agree.
你有没有什么有趣的想法,能用一种压缩的方式表示整个现实,从而对其形成分布?
Do you have any interesting ideas about how to represent all of reality in a compressed way so that you can form a distribution over it?
这是一个关键问题,你知道的,你该怎么做到这一点呢?
That's one of the big questions, you know, how do how do you do that?
对吧?
Right?
我的意思是,当前NLP文本领域自监督学习方法中,另一个相当幼稚的问题——我不该说‘愚蠢’,但确实很简化——那就是,你不仅要表示一个庞大的词语分布,而且对于多个缺失的词语,这些分布本质上是彼此独立的。
I mean, what's what's kind of you know, another thing that that really is stupid about I shouldn't say stupid, but, like, simplistic about current approaches to self supervised learning in in NLP, in text, is that not only do you represent a giant distribution over words, but for multiple words that are missing, those distributions are essentially independent of each other.
而且你知道,你为这种做法付出的代价并不大。
And you know, you don't pay too much of a price for this.
所以,你瞧,就像我前面说的那句话,如果系统给‘狮子’和‘猎豹’分配了某种概率,又给‘羚羊’、‘角马’和‘斑马’分配了另一种概率。
So you you so you can't so the you know, the system you know, in the the the sentence that I gave earlier, if it gives a a certain probability for a lion and and a cheetah and then a certain probability for, you know, gazelle, wildebeest, and and and zebra.
这两个概率是相互独立的。
Those two probabilities are independent
嗯。
Mhmm.
彼此独立。
Of each other.
但事实上,这些东西并不是独立的。
And it's not the case that those things are independent.
狮子实际上会攻击比猎豹更大的动物。
Lions actually attack, like, bigger animals than than cheetahs.
因此,这个过程中存在一个巨大的独立性假设,而这个假设实际上并不成立。
So there's a huge independence hypothesis in this process which is not actually true.
造成这种情况的原因是我们不知道如何正确表示符号的组合序列分布,因为随着符号长度的增加,可能的组合数量会呈指数级增长。
The reason for this is that we don't know how to represent properly distributions over combinatorial sequences of symbols, essentially, when they're because the number grows exponentially with the length of the of the symbols.
所以我们不得不使用一些技巧来应对,但这些方法往往干脆回避这个问题。
And so we have to use tricks for this, but those techniques can, you know, get around, like, don't even deal with it.
所以核心问题是:是否存在某种抽象的潜在文本表示,能够意识到当我把‘狮子’换成‘猎豹’时,也必须把‘斑马’换成‘羚羊’?
So so the big question is, like, would there be some sort of abstract latent representation of text that would say that, you know, when I when I switch lion for gazelle lion for cheetah, I also have to switch zebra for gazelle.
嗯。
Mhmm.
是的。
Yeah.
所以这个独立性假设,让我抛出一个我常听到的批评,看看你怎么回应。
So this independence assumption let me throw some criticism at you that I often hear and see how you respond.
这种填空的方式只是统计而已。
So this kind of filling in the blanks is just statistics.
你并没有学到任何深层的根本概念。
You're not learning anything, like, the deep underlying concepts.
你只是在模仿过去的东西。
You're just mimicking stuff from the past.
你没有学到任何新的东西,无法用它来对世界进行泛化。
You're not learning anything new such that you can use it to generalize about the world.
或者好吧。
Or okay.
让我直接说个直白的说法,这纯粹就是统计。
Let me just say the crude version, which is is just statistics.
这不是智能。
It's not intelligence.
你对此有什么要说的?
What do you have to say to that?
如果你经常听到这种说法,你通常会怎么回应?
What do you usually say to that if you kinda hear this kind of thing?
我不参与这些讨论,因为它们有点毫无意义。
I don't get into these discussions because they are they're kind of pointless.
所以首先,很有可能智能就只是统计学。
So first of all, it's quite possible that intelligence is just statistics.
它只是某种特定类型的统计数据。
It's just statistics of a particular kind.
是的。
Yes.
这就是那个哲学问题。
Where This is the philosophical question.
这是否可能,智能本质上就是统计学?
It's it's kinda is intel is it possible that intelligence is just statistics?
是的。
Yeah.
但具体是哪种统计呢?
But what kind of statistics?
所以,如果你在问一个问题:我们所学习的关于世界的模型,是否包含某种因果关系的概念?
So if you are asking the question, are the model of the world the models of the world that we learn, do they have some notion of causality?
是的。
Yes.
如果有人批评说,当前的机器学习系统并不关心因果关系——顺便说一句,这种说法是错误的,但我同意他们的观点。
So if the criticism comes from people who say, you know, current machine learning system don't care about causality, which by the way is wrong, you know, I agree with them.
对。
Yeah.
你的世界模型应该把你的行为作为输入之一,这将促使你学习因果模型,从而明白世界中的哪些干预会导致何种结果。
You should you know, your model of the world should have your actions as one of your of the inputs, That will drive you to learn causal models of the world where you know what intervention in the world will cause what results.
或者,你也可以通过观察其他智能体在世界中的行为及其影响来实现这一点,比如观察其他人类。
Or you can do this by observation of other agents acting in the world and observing the effect, other humans for example.
所以我认为,在某种描述层面上,智能就是统计学。
So I think, you know, at some level of description, intelligence is just statistics.
但这并不意味着你不会拥有对事物运作机制的深层次解释模型。
But that doesn't mean you don't you don't you know, you won't have models that have, you know, deep mechanistic explanation for what goes on.
问题在于你如何学习它们?
The question is how do you learn them?
这就是我感兴趣的问题。
That's that's the question I'm interested in.
因为,你知道,很多提出批评的人实际上认为这些机制模型必须来自其他地方。
Because, you know, a lot of people who actually voice their criticism say that those mechanistic model has to have to come from someplace else.
它们必须来自人类设计者。
They have to come from human designers.
它们必须来自某个我不确定的地方。
They have to come from I don't know what.
显然,我们是通过学习获得它们的。
And obviously, we learn them.
或者,如果我们作为个体没有学会,大自然会通过进化为我们习得。
Or if we don't learn them as an individual, nature learn them for us using evolution.
所以,无论你怎么想,这些过程都是通过某种方式被学习到的。
So regardless of what you think, those processes have been learned somehow.
所以,如果你观察人类的大脑,就像我们人类内省大脑如何工作时那样,当我们思考什么是智能时,我们往往会想到高层次的东西,比如我们构建的模型、认知科学这样的概念,以及记忆和推理模块之类的高级模块。
So if you look at the the human brain, just like when we humans introspect about how the brain works, it seems like when we think about what is intelligence, we think about the high level stuff, like the models we've constructed, concepts like cognitive science, like concepts of memory and reasoning module, almost like these high level modules.
这个类比是否恰当呢?
Is there is this service a good analogy?
我们是不是忽略了暗物质,也就是那些基础的低层次机制,就像我们忽视操作系统的工作原理一样?
Like, are we ignoring the the dark matter, the the basic low level mechanisms, just like we ignore the way the operating system works.
我们只是在使用高层次的软件。
We're just using the the the high level software.
我们忽略了在低层次上,神经网络可能正在做一些类似统计的事情。
We're ignoring that at the low level, the neural network might be doing something like statistics.
抱歉,我可能用这个词不准确、太粗略了,但这种学习方式就像是填补空白,不断更新模型,嗯。
Like, me sorry to use this word probably incorrectly and crudely, but doing this kind of fill in the gap kind of learning and just kind of updating the model constantly Mhmm.
为了能够支持原始感官输入并进行预测,当预测错误时再进行调整。
In order to be able to support the raw sensory to predict it and then adjust to the prediction when it's wrong.
但当我们从高层次观察大脑时,感觉就像我们在下棋一样。
But, like, high when we look at our brain at the high level, it feels like we're doing like, we're playing chess.
我们正在使用高级概念进行组合,把它们整合进长期记忆中。
Like, we're we're, like, playing with high level concepts and we're stitching them together, and we're putting them into long term memory.
但实际上底层发生的是我们无法内省的东西——一个简单的大型神经网络,正在不断填补空白。
But really what's going underneath is something we're not able to introspect, which is this kinda simple, large neural network that's just filling in the gaps.
是的。
Right.
好吧。
Well, okay.
所以有很多问题。
So there's a lot of questions.
那里也有很多答案。
There's a lot of answers there.
首先,神经科学领域,特别是计算神经科学,有一整套理论支持预测编码,这与我之前谈到的自监督学习概念密切相关。
So first of all, there's a whole school of thought in neuroscience, computational neuroscience in particular, that likes the idea of predictive coding, which is really related to the idea I was talking about in self supervised learning.
一切都关乎预测,智能的本质就是预测能力,而大脑所做的一切都是试图从一切事物中预测一切。
So everything is about prediction, the essence of intelligence is the ability to predict, and everything the brain does is trying to predict everything from everything else.
好的。
Okay.
这实际上是底层的原则,自我监督学习试图重现这种预测作为任务无关学习的核心机制的理念。
And that's really sort of the underlying principle, you want, that self supervised learning is trying to kind of reproduce this idea of prediction as kind of an essential mechanism of task independent learning, if you want.
下一步是你想重现哪种智能?
The next step is what kind of intelligence are you interested in reproducing?
当然,我们都想着要重现人类的高级认知过程。
And of course, we all think about trying to reproduce high level cognitive processes in humans.
但就机器而言,我们甚至还没达到重现猫脑学习过程的水平。
But with machines, we're not even at the level of even reproducing the learning processes in a cat brain.
最智能的系统也没有家猫那么多的常识。
The most intelligent or intelligent systems don't have as much common sense as a house cat.
那么,猫是怎么学习的呢?
So how is it that cats learn?
而且猫并不会做太多推理。
And cats don't do a whole lot of reasoning.
它们确实拥有因果模型,因为许多猫能够弄清楚如何作用于世界以获得它们想要的东西。它们对直觉物理有着极佳的模型,不仅包括对自己身体动态的理解,也包括对猎物等事物的把握。
They certainly have causal models, because many cats can figure out how they can act on the world to get what they want, They certainly have a fantastic model of intuitive physics, certainly of the the dynamics of their own bodies, but but also of preys and things like that.
对吧?
Right?
所以它们相当聪明。
So they're they're pretty smart.
它们仅用大约八亿个神经元就做到了这一点。
They they only do this with about 800,000,000 neurons.
我们离重现这种能力还差得很远。
We are not anywhere close to reproducing this kind of thing.
因此,在某种程度上,我可以说:在我们弄清楚能否重现猫的行为之前,根本不必去操心人类那种高级认知、长期规划和推理能力。
So to some extent, I could say, let's not even worry about the high level cognition and long term planning and reasoning that humans can do until we figure out, can we even reproduce what cats are doing?
不过话说回来,这种学习世界模型的能力,我认为是实现能够推理的机器的关键。
Now that said, this ability to learn world models, I think, is the key to the possibility of learning machines that can also reason.
所以每当我做演讲时,都会提到机器学习中的三个主要挑战。
So whenever I give a talk, say there three main challenges in machine learning.
展开剩余字幕(还有 480 条)
第一个挑战是让机器学会表征世界,并提出自监督学习。
The first one is getting machines to learn to represent the world and proposing self supervised learning.
第二个挑战是让机器以与基于梯度的学习本质上兼容的方式进行推理,因为这正是深度学习的核心所在。
The second is getting machines to reason in ways that are compatible with essentially gradient based learning, because this is what deep learning is all about, really.
第三个挑战是我们目前完全不知道如何解决,至少我不知道如何解决,那就是:我们能否让机器学会对行动计划进行分层表征?
And the third one is something we have no idea how to solve, at least I have no idea how to solve, is can we get machines to learn hierarchical representations of action plans?
我们知道如何训练它们学习感知的分层表征,比如通过卷积网络、变换器等类似技术。
We know how to train them to learn hierarchical representations of perception, you know, with convolutional nets and things like that and transformers.
但是行动计划呢?
But what about action plans?
我们能否让它们自发地学习到良好的分层动作表征?
Can we get them to spontaneously learn good hierarchical representations of actions?
也是基于梯度的。
Also gradient based.
是的。
Yeah.
所有这些都需要某种程度上的可微分性,以便应用基于梯度的学习,而这正是深度学习的核心。
All of that, you know, needs to be somewhat differentiable that you can apply sort of gradient based learning, which is really what deep learning is about.
所以这是背景知识,一种能够以可微分方式推理的能力,这种能力与背景知识有某种深层关联,或建立在背景知识之上,然后基于这些背景知识,能够制定出层次化的计划,对吧。
So it's background knowledge, ability to reason in a way that's differentiable that is somehow connected, deeply integrated with that background knowledge or builds on top of that background knowledge, and then given that background knowledge, be able to make hierarchical plans Right.
在现实世界中。
In the world.
如果你看看经典的最优控制理论,其中有一种叫做模型预测控制的方法。
So if if you take classical optimal control, there's something in classical optimal control called model predictive control.
它早在20世纪60年代初就已经存在了。
And it's, you know, it's been around since the the early sixties.
NASA用它来计算火箭的轨迹。
NASA uses that to compute trajectories of rockets.
其基本思想是,你有一个相当准确的预测模型,比如火箭,或者你打算控制的任何系统,这个模型能够根据系统在时间t的状态以及你对系统采取的行动——比如火箭的推力和所有可能的控制输入——预测出系统在时间t+Δt的状态。
And the basic idea is that you have a pretty a predictive model of the rocket, let's say, or whatever system you're you intend to control, which, given the state of the system at time t and given an action that you're taking on the system, so for a rocket to be thrust and, you know, all the controls you can have, it gives you the state of the system at time t plus delta t.
对吧?
Right?
所以基本上就是一个微分方程,类似这样的东西。
So basically a differential equation, something like that.
如果你有这个模型,并且这个模型是以某种神经网络或一组可进行反向传播的公式形式存在,你就可以实现所谓的模型预测控制或基于梯度的模型预测控制。
And if you have this model, and you have this model in the form of some sort of neural net or some sort of set of formula that you can backpropagate gradient through, you can do what's called model predictive control or gradient based model predictive control.
因此,你可以将这个模型在时间上展开,输入一个假设的动作序列,然后有一个目标函数来衡量在轨迹结束时系统是否成功达到了你期望的目标。
So you can unroll that model in time, you feed it a hypothesized sequence of actions, and then you have some objective function that measures how well at the end of the trajectory the system has succeeded or matched what you want it to do.
你知道,机器人有没有造成伤害?
You know, is it a robot harm?
你有没有抓到你想抓的物体?
Have you grasped the object you want to grasp?
如果是火箭,你是否在空间站附近的正确位置?
If it's a rocket, you know, you at the right place near the space station?
诸如此类的事情。
Things like that.
通过时间反向传播——同样,这在20世纪60年代就由最优控制理论家们发明了——你可以算出最优的动作序列,从而让系统达到最佳的最终状态。
And by backpropagation through time, and again, this was invented in the nineteen sixties by optimal control theorists, you can figure out what is the optimal sequence of actions that will, you know, get my system to the the best final state.
所以这是一种推理形式。
So that's a form of reasoning.
这本质上是规划,而机器人领域的许多规划系统实际上都基于此。
It's basically planning, and a lot of planning systems in robotics are actually based on this.
你可以将这视为一种推理形式。
And and you can think of this as a form of reasoning.
所以,以青少年开车为例,你对汽车的动力学模型有相当好的理解。
So, you know, to take the example of the teenager driving a car again, you have a pretty good dynamical model of the car.
它不需要非常精确。
It doesn't need to be very accurate.
但你知道,如果你向右打方向盘,而旁边是悬崖,你就会冲下悬崖。
But you know, again, that if you turn the wheel to the right and there is a cliff, you're gonna run off the cliff.
对吧?
Right?
你不需要一个非常精确的模型就能预测到这一点。
You don't need to have a very accurate model to predict that.
你可以在脑海中运行这个过程,并因此决定不去做,因为你能提前预测到结果会很糟糕。
And you can run this in your mind and decide not to do it for that reason, because you can predict in advance that the result is going to be bad.
因此,你可以想象不同的场景,然后选择最有利的场景并采取第一步,再重复规划的过程。
So you can imagine different scenarios and then employ or take the first step in the scenario that is most favorable and then repeat the process of planning.
这被称为滚动时域模型预测控制。
That's called receding horizon model predictive control.
所以,所有这些概念都有几十年的历史了。
So all those things have names going back decades.
在经典最优控制中,世界的模型通常不是通过学习得到的。
So in classical optimal control, the model of the world is not generally learned.
有时你需要识别一些参数,这被称为系统辨识。
Sometimes a few parameters you have to identify, that's called systems identification.
但一般来说,模型大多是确定性的,并且主要由人工构建。
But generally, the model is mostly deterministic and mostly built by hand.
因此,人工智能的重大问题,我认为是未来十年人工智能的主要挑战:我们如何让机器运行能够处理不确定性并应对现实世界复杂性的世界预测模型?
So the big question of AI, I think the big challenge of AI for the next decade, is how do we get machines to run predictive models of the world that deal with uncertainty and deal with the real world in all this complexity?
这不仅仅是火箭的轨迹,你可以将它简化为一些原理。
So it's not just the trajectory of a rocket, which you can reduce to principles.
甚至也不仅仅是机器人手臂的轨迹,你同样可以通过精细的数学建模来描述它。
It's not even just the trajectory of a robot arm, which again you can model by careful mathematics.
但它涵盖了世界上所有其他事物,我们所观察到的一切。
But it's everything else, everything we observe in the world.
人类、行为、涉及集体现象的物理系统,比如水、树、树上的枝杈,或者那些复杂的事物——人类能轻松地为它们建立抽象表征和预测模型,但我们仍不知道如何让机器做到这一点。
People, behavior, physical systems that involve collective phenomena like water or or, you know, trees and, you know, branches in a in a tree or something or or, like, complex things that, you know, humans have no trouble developing abstract representations and predictive model for, but we still don't know how to do with machines.
在这三个层面中,你在哪里融入了这个世界的博弈论特性?你的行动不仅对世界的动态环境做出反应,还会对其产生影响。
Where do you put in in these three, maybe in the in the planning stages, the the game theoretic nature of this world where your actions not only respond to the dynamic nature of the world, the environment, but also affect it.
所以,如果有其他人类参与,这是第四个层面,还是在你的行动层次化表征中以某种方式整合进去了?
So if there's other humans involved, is this is this point number four, or is this somehow integrated into the hierarchical representation of action in your view?
我认为它是被整合的。
I think it's integrated.
只是现在,你对世界的模型必须应对……这使得问题变得更加复杂。
It's just it's just that now your model of the world has to deal with you know, it just makes it more complicated.
对吧?
Right?
人类复杂且难以预测,这使得你对世界的模型变得更加复杂,复杂得多。
The fact that humans are complicated and not easily predictable, that makes your model of the world much more complicated that much more complicated.
嗯,下棋就是一个类比。
Well, there's a chess I mean, I suppose chess is an analogy.
蒙特卡洛树搜索。
So Monte Carlo tree search.
我是走一步,你走一步。
I mean, there's a I go, you go.
我是走一步,你走一步。
I go, you go.
卡帕西最近在麻省理工学院做了一场关于汽车门的演讲。
Like, Kapathi recently gave a talk at MIT about car doors.
我觉得里面也有些机器学习,但主要是关于汽车门。
I think there's some machine learning too, but mostly car doors.
汽车本身具有动态性,比如有人在开门时会进行确认。
And there's a dynamic nature to the car, like the person opening the door checking.
我的意思是,他谈论的并不是这个。
I mean, he wasn't talking about that.
他谈的是感知问题,即什么定义了车门,这是一个宏大的哲学问题。
He was talking about the perception problem of what the the ontology of what defines a car door, this big philosophical question.
但我觉得这很有趣,因为很明显,开车门的人是想下车,比如在纽约,你减速会传递某种信号,加速也会传递某种信号,这就像一种默契的互动。
But to me, was interesting because, like, it's obvious that the person opening the car doors, they're trying to get out, like here in New York, trying to get out of the car, you slowing down is going to signal something, you speeding up is gonna signal something, and that's a dance.
这就像一场不同步的象棋对局。
It's a asynchronous chess game.
我不确定。
I don't know.
所以感觉这不仅仅是,我的意思是,也许你可以把所有这些微小互动整合进一个庞大的模型中。
So it feels like it's not just I mean, I guess you can integrate all of that into one giant model, like the entirety of the the the these little interactions.
因为这并没有象棋那么复杂。
Because it's not as complicated as chess.
这就像一场小小的舞蹈。
It's just like a little dance.
我们一起跳一段小舞,然后就弄明白了。
We do like a little dance together, and then we figure it out.
在某些方面,这比国际象棋复杂得多,因为它是连续的,并且以一种持续的方式充满不确定性。
Well, in some ways, it's way more complicated than chess because because it's continuous, it's uncertain in a continuous manner.
感觉并没有更复杂。
It doesn't feel more complicated.
但感觉不更复杂,是因为这是我们进化出来解决的问题。
But it doesn't feel more complicated because that's what we're we've evolved to solve.
这是我们进化出来解决的这类问题。
This is the kind of problem we've evolved to solve.
所以我们擅长这个,因为大自然让我们擅长应对它。
And so we're good at it because, you know, nature has made us good at it.
大自然并没有让我们擅长下国际象棋。
Nature has not made us good at chess.
我们在国际象棋上完全不行。
We completely suck at chess.
是的。
Yeah.
事实上,我们设计它作为游戏,正是因为它具有挑战性。
In fact, that's why we designed it as a as a game, is to be challenging.
嗯。
Mhmm.
最近国际象棋和围棋的进展让我们意识到,人类在这些方面真的很差,差得离谱。
And if there is something that, you know, recent progress in chess and Go has made us realize is that humans are really terrible at those things, like really bad.
你知道吗?
You know?
在AlphaGo出现之前,有一个说法,当时最顶尖的围棋选手认为,理想中的玩家——他们称之为‘神’——比他们多出两三颗子。
There was a story, right, before AlphaGo that, you know, the best Go player thought there were maybe two or three stones behind, you know, an ideal player that they would call God.
嗯。
Mhmm.
事实上,并不是。
In fact, no.
落后九到十个子。
There are, nine or 10 stones behind.
我的意思是,我们就是很菜。
I mean, we're just bad.
是的。
Yeah.
所以我们不擅长,这是因为我们的工作记忆有限。
So we're not good at and it's because we have limited working memory.
我们不太擅长那种树状探索,而计算机在这方面比我们强得多。
We we you know, we're not very good at, like, doing this tree exploration that, you know, computers are much better at doing than we are.
但我们更擅长学习世界的可微分模型。
But we are much better at learning differentiable models of the world.
我的意思是,我说可微分,其实也不是指我们真的会反向传播,而是说我们的大脑具有一些估算某种梯度的机制。
I mean, I say differentiable in a kind of, you know, I should say, not differentiable in the sense that, you know, we run back far through it, but in the sense that our brain has some mechanism for estimating gradients of of some kind.
是的。
Yeah.
而这正是让我们高效的原因。
And that's what, you know, makes us efficient.
所以,如果你有一个智能体,它包含一个世界模型——在人脑中,这基本上就是你大脑的前半部分——以及一个目标函数,在人类身上,这个目标函数由两部分组成。
So if you have an agent that consists of a a model of the world, which, you know, in the human brain is basically the entire front half of your brain, An objective function, which in humans is a combination of two things.
其中一个是你的内在动机模块,位于基底神经节,也就是你大脑的底部。
There is your sort of intrinsic motivation module, which is on the basal ganglia, know, on the base of your brain.
这个模块负责衡量疼痛、饥饿以及类似的感觉和情绪,也就是即时的感受。
That's the thing that measures pain and hunger and things like that, like immediate feelings and emotions.
然后,还有一个模块,类似于强化学习中人们所说的‘评论家’,它能够预测某种情境的未来结果。
And then there is, you know, the equivalent of what people in reinforcement learning call a critic, which is a sort of module that predicts ahead what the outcome of a situation will be.
因此,它不是一个成本函数,也不是一个目标函数,而是一个经过训练的预测器,用于预测最终目标函数的结果。
And so it's not a cost function, but it's sort of not an objective function, but it's sort of a trained predictor of the ultimate objective function.
而这个模块也是可微分的。
And that also is differentiable.
如果所有这些——你的代价函数、你的评论家、你的世界模型——都是可微分的,那么你就可以使用基于梯度的方法来进行规划、推理和学习,也就是我们希望智能体能够做到的所有事情。
And so if all of this is differentiable, your cost function, your critic, your world model, then you can use gradient based type methods to do planning, to do reasoning, to do learning, you know, to do all the things that we'd like an intelligent agent
去做。
to do.
基于梯度的学习,你的直觉是什么?
And gradient based learning, like, what's your intuition?
这可能是解决智能的核心,因此在你看来,你并不需要基于逻辑的推理。
That's probably at the core of what can solve intelligence, so you don't need, like, logic based reasoning in your view.
我不知道如何让基于逻辑的推理与高效学习兼容。
I don't know how to make logic based reasoning compatible with efficient learning.
是的。
Yeah.
好吧。
And okay.
我的意思是,这里有一个很大的问题,或许是一个哲学问题。
I mean, there is a big question, perhaps a philosophical question.
我的意思是,这并不那么哲学,但我们能问的是,我们从工程和计算机科学中知道的所有学习算法都是通过优化某个目标函数来实现的。
I mean, it's not that philosophical, but that we can ask is that all the learning algorithms we know from engineering and computer science proceed by optimizing some objective function.
是的。
Yeah.
对吧?
Right?
所以我们可以问一个问题:大脑的学习是否在最小化某个目标函数?
So one question we may ask does learning of the brain minimize an objective function?
它可能是多个目标函数的组合,但本质上仍然是一个目标函数。
It could be a composite of multiple objective functions, but it's still an objective function.
第二,如果它确实优化了某个目标函数,那么它是通过某种梯度估计来实现的吗?
Second, if it does optimize an objective function, does it do it by some sort of gradient estimation?
它不需要是反向传播,但必须是一种高效估算梯度的方式,其复杂度与实际执行推理的复杂度相当。
It doesn't need to be a backprop, but some way of estimating the gradient in an efficient manner whose complexity is on the same order of magnitude as actually running the inference.
因为你无法承担通过扰动大脑中的权重来观察其效果这样的操作,也无法通过扰动来估算梯度。
Because you can't afford to do things like perturbing a weight in your brain to figure out what the effect is, and then sort of you can't do sort of estimating gradient by perturbation.
在我看来,大脑使用某种零阶黑箱无梯度优化方式似乎非常不可能,因为这种方式比梯度优化效率低得多。
It's to me, it seems very implausible that the brain uses some sort of, you know, zeroth order black box gradient free optimization because it's so much less efficient than gradient optimization.
所以大脑必须有一种估算梯度的方法。
So it has to have a way of estimating gradient.
有没有可能,某种基于逻辑的推理会在局部出现,就像你所说的,如果大脑有目标函数,也许它就是一种创建目标函数的机制。
Is it possible that some kind of logic based reasoning emerges in pockets as a useful like you said, if the brain has an objective function, maybe it's a mechanism for creating objective functions.
它是一种创建知识库的机制,例如,这些知识库之后可以被查询。
It's it's it's a mechanism for creating knowledge bases, for example, that can then be queried.
也许它是一种以梯度为基础的方式学习到的高效知识表示,或者类似的东西。
Like, maybe it's like an efficient representation of knowledge that's learned in a gradient based way or something like that.
我认为存在许多不同类型的智能。
Well, so I think there is a lot of different types of intelligence.
首先,我认为我们所理解的那种逻辑推理——可能源自上世纪七八十年代的经典人工智能——人类其实很少使用,而且也不太擅长。
So first of all, I think the type of logical reasoning that we think about, that we are, you know, maybe stemming from, you know, sort of classical AI of the nineteen seventies and eighties, I think humans use that relatively rarely and are not particularly good at it.
但我们却根据彼此解决这些罕见问题的能力来评判对方。
But we judge each other based on our ability to solve those rare problems.
这叫智商测试。
It's called an IQ test.
我觉得是的。
I think so.
比如,我下棋并不擅长。
Like, I'm I'm not very good at chess.
是的。
Yes.
我一直在评判你,因为实际上我们。
I'm I'm judging you this whole time because well, we we actually
以你的背景,我相信你下棋很厉害。
With with your with your, you know, heritage, I'm sure you're good at chess.
不。
No.
刻板印象。
Stereotypes.
并非所有刻板印象都是正确的。
Not all stereotypes are true.
我下棋很差。
Well, I'm terrible at chess.
所以,你知道,但我认为我拥有的另一种智力是这种能力——通过推理,当然还有数据,来构建对世界的模型。
So, you know but I think perhaps another type of intelligence that I have is this, you know, ability of sort of building models to the world from, you know, reasoning, obvious obviously, but also also data.
而这些模型通常更偏向于类比性质。
And those those models generally are more kind of analogical.
对吧?
Right?
所以这是一种通过模拟和类比进行的推理,你用一个模型去应用到新情境中。
So it's it's it's reasoning by simulation and by analogy, where you use one model to apply to a new situation.
即使你从未遇到过那种情况,你也能将它与你之前经历过的某种情境联系起来。
Even though you've never seen that situation, you can sort of connect it to a situation you've encountered before.
而你的推理更类似于某种内在的模拟。
And and your reasoning is more, you know, akin to some sort of internal simulation.
所以当你在建造,比如一个木盒子时,你其实是在模拟正在发生的事情。
So you you are kinda simulating what's happening when you're building, I don't know, a box out of wood or something.
对吧?
Right?
你会提前想象,如果以这种方式切割木材,结果会是什么样?
You kind of imagine in in advance, like, what would be the result of, you know, cutting the wood in this particular way?
你会用螺丝、钉子,还是别的什么?
Are you gonna use, you know, screws or nails or whatever?
当你与人互动时,你也会对这个人形成一个模型,并带着这个模型去与对方互动,从而告诉对方你觉得对他们有用的信息。
When you are interacting with someone, you also have a model of that person and and sort of interact with that person, you know, having this model in mind to kind of tell the person what what you think is useful to them.
所以我认为,构建世界模型的能力,本质上就是智能的核心。
So I think this this ability to construct models of the world is basically the essence the essence of intelligence.
当然,运用这种能力来规划实现特定目标的行为,也是必不可少的。
And the ability to use it then to plan actions that will fulfill a particular criterion, of course, is is is necessary as well.
所以我会像之前一样,问你一系列不可能的问题。
So I'm gonna ask you a series of impossible questions as we keep ask as I've been doing.
所以,如果这是智能的暗物质基础——即构建背景模型的能力,你认为需要多少知识呢?
So so if that's the fundamentals of dark matter of intelligence, this ability to form a background model, what's your intuition about how much knowledge is required?
你知道,我觉得暗物质就像宇宙组成中可以量化的一个百分比,有多少是暗物质,有多少是暗能量。
You know you know, I think dark matter, you you could put a percentage of on it of of the composition of the universe and how much of it is dark matter, how much of it is dark energy.
你认为要成为一只家猫,需要多少信息呢?
How much information do you think is required to to be a house cat?
所以当你看到一个盒子时,你要能钻进去;当你看到一个人时,要能计算出最恶劣的行动。
So you have to be able to when you see a box, go in it, when you see a human compute the most evil action.
如果有什么东西靠近边缘,你就把它推下去。
If there's a thing that's near an edge, you knock it off.
所有这些,再加上你提到的额外内容——即对自己身体和世界的物理规律有极强的自我意识。
All of that, plus the extra stuff you mentioned, which is a great self awareness of the physics of your of your own body and the and the world.
你认为要解决这些问题,需要获取多少知识呢?
How much knowledge is acquired, do you think, to solve it?
我甚至不知道该如何衡量这个问题的答案。
I don't even know how to measure an answer to that question.
我不确定怎么衡量,但不管是什么,它大约只需要8亿个神经元,抱歉,是8亿个。
I'm not sure how to measure it, but whatever it is, it fits in about about 800,000 neurons 800,000,000 neurons, sorry.
这种表示方式吗?
The representation does?
所有东西。
Everything.
所有知识。
All knowledge.
所有东西。
Everything.
对吧?
Right?
不到十亿个。
It was less than a billion.
狗有20亿个,但猫少于10亿个。
A dog is 2,000,000,000, but a cat is less than 1,000,000,000.
将这个数字乘以一千,你就得到了突触的数量。
And so multiply that by a thousand and you get the number of synapses.
我认为几乎所有东西都是通过一种自监督学习方式学会的。
And I think almost all of it is learned through a sort of self supervised learning.
尽管有一小部分是通过强化学习学会的,而经典的监督学习则更少,甚至在生物世界中,监督学习究竟是如何运作的都不清楚。
Although, think a tiny sliver is learned through reinforcement learning, and certainly very little you know, classical supervised learning, although it's not even clear how supervised learning actually works in the in the biological world.
所以我认为几乎所有东西都是自监督学习。
So I think almost all of it is self supervised learning.
但它是由猫或人类大脑底层固有的目标函数所驱动的,这些函数决定了它们的行为。
But it's driven by the the sort of ingrained objective functions that a cat or a human have at the base of their brain, which kinda drives their their behavior.
所以,大自然告诉我们:你饿了。
So, you know, nature tells us you're hungry.
但它并没有告诉我们如何进食。
It doesn't tell us how to feed feed ourselves.
这需要我们大脑的其他部分去自行摸索。
That's that's something that the rest of our brain has to figure out.
好吧?
Alright?
这很有趣,因为可能还存在更深层的底层目标函数。
Well, it's interesting because there might be more, like, deeper objective functions underlying the whole thing.
所以饥饿可能只是某种神经生物学层面的东西。
So hung hunger may be some kind of now you go to, like, neurobiology.
它可能只是大脑在试图维持稳态。
It might be just the brain trying to maintain homeostasis.
因此,饥饿只是大脑对当前状态不满时,人类能够感知到的众多症状之一。
So hunger is just one of the human perceivable symptoms of the brain being unhappy with the way things are currently.
对。
Right.
因为这可能只是核心处一个非常原始的目标函数。
Because it could be just, like, one really dumb objective function at the core.
但这就是行为被驱动的方式。
But that's how that's how behavior is is is driven.
事实上,基底神经节驱使我们做出与猩猩或猫截然不同的行为,这正是人类本性、猩猩本性和猫本性之间的区别所在。
The the fact that, you know, the or basal ganglia drive us to do things that are that are different from, say, an orangutan or certainly a cat is what makes, you know, human nature versus orangutan nature versus cat nature.
例如,我们的基底神经节驱使我们寻求与他人相处。
So for example, our basal ganglia drives us to seek the company of other humans.
这是因为自然演化发现,我们必须成为社会性动物,物种才能生存,这一点在许多灵长类动物中都是成立的。
And that's because nature has figured out that we need to be social animals for our species to survive, and it's true of many primates.
但这对猩猩来说并不成立。
It's not true of orangutans.
猩猩是独居动物。
Orangutans are solitary animals.
它们不会主动寻求他人的陪伴。
They don't seek the company of others.
事实上,它们会避开他人。
In fact, they avoid them.
事实上,当别人靠得太近时,它们会尖叫,因为它们有领地意识。
In fact, scream at them when they come too close because they're territorial.
是的。
Mhmm.
因为对于它们的生存而言,进化已经找到了最佳方式。
Because for for their survival, you know, evolution is figured out.
这才是最好的方式。
That's the best thing.
我的意思是,它们偶尔也会社交,比如为了繁殖之类的事情。
I mean, they're occasionally social, of course, for, you know, reproduction and stuff like that.
但它们大部分时间都是独居的。
But but but they're mostly solitary.
所以,所有这些行为都不属于智力的范畴。
So so all of those behaviors are not part of intelligence.
人们常说,你永远不可能创造出智能机器,因为人类的智力是社会性的。
You know, people say, oh, you're never gonna have intelligent machines because, you know, human intelligence is social.
但如果你看看猩猩,看看章鱼,
But then you look at orangutans, you look at octopus.
章鱼从来不知道自己的父母。
Octopus never know their parents.
嗯。
Mhmm.
它们几乎不与其他个体互动,却能在不到一年、甚至半年内变得非常聪明。
They barely interact with any other, and they and they get to be really smart in less than less than a year, in like half a year.
你知道,一年内它们就成年了。
You know, in a year, they're adults.
两年内它们就死了。
In two years, they're dead.
因此,我们人类认为某些事物与智力密切相关,比如社交互动、语言,但我觉得我们过分强调了语言作为智力载体的重要性,因为我们总觉得自己的推理与语言密不可分。
So there are things that we think, as humans, are intimately linked with intelligence, like social interaction, like language, we think I think we give way too much importance to language as a substrate of intelligence as humans because we think our reasoning is so linked with language.
所以,要解决家猫的智力问题,你觉得在一座荒岛上就能做到吗?
So for for to solve the house cat intelligence problem, you think you could do it on a desert island?
你可以有
You could have
差不多就是这样。
Pretty much.
你可以让一只猫坐在那里观察海浪,然后自己领悟出很多东西。
You could just have a cat sitting there looking at the ways, at the ocean ways, and figure a lot of it out.
它需要具备某种恰当的驱动力,你知道,让它去做那些事并学习合适的东西。
It needs to have sort of, you know, the right set of drives to kinda, you know, get it to do the thing and learn the appropriate things.
对吧?
Right?
但是,比如说,你知道,人类婴儿天生就有学习站立和行走的动力。
But, like, for example, you know, baby humans are are driven to learn to stand up and walk.
好的。
Okay.
你知道,这并不是说这种欲望是与生俱来的。
You know, it's not that's kinda this desire is hardwired.
具体如何操作则不是这样。
How to do it precisely is not.
这是后天学会的。
That's learned.
嗯。
Mhmm.
但这种想要
But the desire to
走路?
To walk?
四处活动和站起来的欲望。
Move around and stand up.
嗯。
Mhmm.
这种欲望某种程度上是
That's sort of
天生固有的。
hardwired.
确实是
It is
把这种东西固化下来非常简单。
very simple to hardwire this kind of stuff.
什么?哦,比如这种欲望?嗯,这很有趣。
What oh, like the desire to well, that's interesting.
你天生就想要走路。
You're hardwired to wanna walk.
这背后一定有更深层的需要促使人走路。
That's not there's gotta be a deeper need for walking.
我认为,是社会强加了这种观念,认为你必须走路,所有其他两足生物都如此。
I I think it was probably socially imposed by society that you need to walk, all the other bipedal
比如,很多简单的动物,你知道,它们一出生就。
Now, like, a lot of simple animals that, you know They start.
很可能根本不需要观察同类,就会走路。
Probably walk without ever watching any other members of the species.
这似乎是一件可怕的事,因为刚开始学双足行走时你做得并不好。
It seems like a scary thing to have to do because you suck at bipedal walking at first.
爬行看起来安全多了,更舒服,你干嘛这么着急?
It seems crawling is much safer, much more like, why are you in a hurry?
因为有一种驱动力促使你去做这件事,你知道的,这属于人类发展的一部分。
Well, because because you have this thing that drives you to do it, you know, which is sort of part of the sort of human development.
实际上,人们是否理解这一点呢?
Is that understood actually what
并不完全理解。
Not entirely.
是的。
No.
为什么要站起来用两条腿走路呢?
What is what's the reason to get on two feet?
这真的很难。
It's really hard.
大多数动物都不会用两条腿走路。
Like, most animals don't get on two feet.
为什么呢
Why do
用四条腿。
on four feet.
你知道,许多哺乳动物都是用四条腿走路的。
You know, many mammals get on four feet.
是的。
Yeah.
它们很快就学会了。
They Very quickly.
有些甚至极其迅速。
Some of them extremely quickly.
但我记得,上次我跟桌子互动时,桌子比两条腿的东西稳多了。
But I don't you know, like, from the last time I've interacted with a table, that's much more stable than a thing than two legs.
这真是个非常难的问题。
It's just a really hard problem.
是的。
Yeah.
有多少种鸟用两只脚实现了这一点?
How many birds have figured it out with two feet?
从技术上讲,我们可以讨论一下本体论。
Well, technically, we can go into ontology.
它们有四只。
They have four.
我想它们有两只脚。
I guess they have two feet.
它们有
They have
两只脚。
two feet.
鸡。
Chickens.
你知道吗,很多恐龙都是两只脚的。
You know, dinosaurs have two feet, many of them.
据说如此。
Allegedly.
我刚刚才知道T。
I'm just now learning that T.
霸王龙吃的是草,而不是其他动物。
Rex was eating grass, not other animals.
T。
T.
霸王龙可能是个友善的宠物。
Rex might have been a friendly friendly pet.
你觉得呢?我不确定你有没有看过弗朗索瓦·谢莱特设计的通用智力测试。
What do you think about I don't know if you looked at the test for general intelligence that Francois Chalet put together.
嗯嗯。
Mhmm.
我不确定你有没有机会看过这类东西。
I don't if you got a chance to look at that kind of thing.
比如,你对如何解决类似智商测试的问题有什么直觉?
Like, what what's your intuition about how to solve, like, an IQ type of test?
我不知道。
I don't know.
我觉得这完全不在我的关注范围内,我认为在短期内并不相关。
I think it's so outside of my radar screen that it's not really relevant, I think, in the short term.
好吧,也许换一种问法,更贴近你的工作的是,你怎么用很少的样本数据来解决MNIST问题?
Well, I guess one way to ask another way perhaps more closer to what to your work is, like, how do you solve MNIST with very little example data?
没错。
That's right.
而这个问题的答案很可能是自监督学习。
And that's the answer to this probably is self supervised learning.
只需学习如何表示图像,然后在此基础上学习识别手写数字就只需要少量样本了。
Just learn to represent images, and then learning, you know, to recognize handwritten digits on top of this will only require a few samples.
我们在人类身上也观察到这种现象。
And we observe this in humans.
你给一个小孩看一本画册,里面只有几张大象的图片,就这样。
You show a young child a picture book with a couple of pictures of an elephant, and that's it.
孩子就知道什么是大象了。
The child knows what an elephant is.
今天我们也在实际系统中看到这种情况:我们用海量的图像来训练图像识别系统,这些图像要么是完全自监督的,要么是弱监督的。
And we see this today with practical systems, we train image recognition systems with enormous amounts of images, either completely self supervised or very weakly supervised.
例如,你可以训练一个神经网络来预测人们在Instagram上输入的标签。
For example, you can train a neural net to predict whatever hashtag people type on Instagram.
对吧?
Right?
你可以用数十亿张图像来做这件事,因为每天都有数十亿张图像出现。
And you can do this with billions of images because there's billions per day that are showing up.
因此,可用的训练数据量几乎是无限的。
So the amount of training data there is essentially unlimited.
然后,你可以取系统学习过程中中间几层的输出表示,将其作为分类器的输入,用于识别你想要的任何物体,效果相当不错。
And then you take the output representation, you know, a couple layers down from the output of what the system learned and feed this as input to a classifier for any object in the world that you want, and it works pretty well.
这就是迁移学习。
So that's transfer learning.
明白吗?
Okay?
或者说是弱监督迁移学习。
Or weekly supervised transfer learning.
人们也在这种自监督学习的场景中取得了非常快速的进展。
People are making very fast progress using self supervised learning this kind of scenario as well.
我猜测,这将是未来的趋势。
And my guess is that that's going to be the future.
对于自监督学习,你认为需要多少数据清洗来过滤恶意信号?或者用什么更好的术语?
For self supervised learning, how much cleaning do you think is needed for filtering malicious signal or what's a better term?
但很多人在Instagram上使用标签来获得不错的搜索引擎优化效果,嗯。
But, like, a lot of people use hashtags on Instagram to get, like, good SEO Mhmm.
但这并不能完全代表图片的内容。
That doesn't fully represent the contents of the image.
比如,他们发一张猫的照片,却打上科学、很棒、有趣这样的标签。
Like, they'll put a picture of a cat and hashtag it with, like, science, awesome, fun.
我不确定。
I don't know.
各种各样的标签,我不知道为什么要打上科学?
All kind I don't why would you put science?
嗯,所以
Well, so
这种做法并不是很好的搜索引擎优化。
That's not very good SEO.
几年前,我在Facebook(现在是Meta AI)工作的同事们处理这个问题时,只选择了大约17,000个与物理物体或场景相关的标签,比如那些具有视觉内容的标签。
The way the way my my colleagues who worked on this project at at Facebook, now Meta Meta AI, a few years ago dealt with this is that they only selected something like 17,000 tags that correspond to kind of physical things or or situations, like, you know, that has some visual content.
所以,你知道,你不会用像#TBT这样的标签之类的。
So, you know, you wouldn't have, like, hash TBT or anything like that.
哦,所以你的意思是他们只保留了一小部分标签?
Oh, so they they keep a very select set of hashtags is what you're saying?
是的。
Yeah.
好的。
Okay.
但数量仍然在一万到两万左右,所以还是相当大的。
But it's still it's still on the order of, you know, 10 to 20,000, so it's fairly large.
好的。
Okay.
你能跟我讲讲数据增强吗?
Can you tell me about data augmentation?
数据增强到底是什么?它是怎么用于视频的对比学习的?
What the heck is data augmentation, and how is it used, maybe contrastive learning for for video?
这里有哪些有趣的想法?
What are some cool ideas here?
对。
Right.
数据增强,首先,数据增强是指通过以不改变图像本质的方式对现有图像进行变换,来人为地扩大你的训练集。
So data augmentation I mean, first, data augmentation, you know, is the idea of artificially increasing the size of your training set by distorting the images that you have in ways that don't change the nature of the image.
对吧?
Right?
所以你会做,你做Nlist。
So you take you you do you do Nlist.
你可以在Nlist上进行数据增强。
You can do data augmentation on Nlist.
人们从上世纪九十年代就开始这么做。
And people have done this since the nineteen nineties.
对吧?
Right?
你拿一个NIST数字,稍微移动它、改变大小、旋转或扭曲它,你知道的,
You take a Nlist digit and you shift it a little bit or you change the size or rotate it, skew it, you know,
等等。
etcetera.
添加噪声。
Add noise.
添加噪声,等等。
Add noise, etcetera.
而且效果更好。
And it it works better.
如果你用增强后的数据训练一个监督分类器,你会得到更好的结果。
If you train a supervised classifier with augmented data, you're gonna get better results.
在过去的几年里,这变得非常有趣,因为许多用于预训练视觉系统的自监督学习技术都基于数据增强。
Now it's become really interesting over the last couple years because a lot of self supervised learning techniques to pre train vision systems are based on data augmentation.
这些基本技术最初受到我在九十年代初以及杰夫·辛顿也在九十年代初所研究的技术的启发。
And the basic techniques are originally inspired by techniques that I worked on in the early nineties and Jeff Hinton worked on also in the early nineties.
当时有一些并行的研究。
There were sort of parallel work.
我以前称这个为孪生网络。
I used to call this Siamese network.
基本上,取两个相同网络的副本,它们共享相同的权重,然后展示同一物体的两种不同视角。
So basically, take two identical copies of the same network, they share the same weights, and you show two different views of the same object.
这两种不同视角可能是通过数据增强获得的,也可能是你移动摄像头后对同一场景在不同时间拍摄的两种视角,或者类似的情况,又或者是同一个人的两张照片,诸如此类。
Either those two different views may have been obtained by data augmentation, or maybe it's two different views of the same scene from a camera that you moved or at different times or something like that, right, or two pictures of the same person, things like that.
然后你训练这个神经网络——这两个相同的网络副本——生成一种输出表示(向量),使得这两张图像的表示尽可能接近、尽可能相同。
And then you train this neural net, those two identical copies of this neural net, to produce an output representation, a vector, in such a way that the representation for those two images are as close to each other as possible, as identical to each other as possible.
对吧?
Right?
因为你想让系统学习一个函数,这个函数具有不变性,即当以这些特定方式变换输入时,其输出不会改变。
Because you want the system to basically learn a function that will that will be invariant, that will not change, whose output will not change when you transform those inputs in those particular ways.
所以这很容易实现。
So that's easy to do.
复杂的地方在于,如何确保当输入两张不同的图片时,系统会产生不同的结果?
What's complicated is how do you make sure that when you show two images that are different, the system will produce different things?
因为如果你不对此做特别设计,系统在训练时就会忽略输入。
Because if you don't have a specific provision for this, the system will just ignore the inputs when you train it.
它最终会忽略输入,只为所有输入生成一个相同的恒定向量。
It will end up ignoring the input and just produce a constant vector that is the same for every input.
这被称为坍缩。
That's called a collapse.
那么,如何避免坍缩呢?
Now how do you avoid collapse?
有两种思路:一种是我和贝尔实验室的同事简·布罗梅利以及其他几位在90年代初提出的,我们现在称之为对比学习,即使用负样本。
So there's two ideas: one idea that I proposed in the early 90s with my colleagues at Bell Labs, Jane Bromley and a couple other people, which we now call contrastive learning, which is to have negative examples.
所以,你会使用已知不同的图片对,将它们输入网络的两个副本,并将两个输出向量彼此推远。
So you have pairs of images that you know are different, and you show them to the network, those two copies, and then you push the two output vectors away from each other.
这样最终就能保证语义相似的输入产生相似的表示,而不同的输入则产生不同的表示。
And it will eventually guarantee that things that are semantically similar produce similar representations, and things that are different produce different representations.
我们实际上是为了一个签名验证的项目提出了这个想法。
We actually came up with this idea for a project of doing signature verification.
我们会收集同一个人的多个签名,然后训练一个神经网络来生成相同的表示,同时迫使系统为不同的签名生成不同的表示。
So we would collect from like, multiple signatures from the same person, and then train a neural net to produce the same representation, and then, you know, force the system to produce different representation from different signatures.
这个问题实际上是当时AT&T的子公司NCR提出的,他们有兴趣将签名的表示存储在信用卡磁条的80字节中。
Was actually the the problem was proposed by people from what was a subsidiary of AT and T at the time called NCR, and they were interested in storing representation of the signature on the 80 bytes of the magnetic strip of a credit card.
于是我们想到了一个拥有80个输出的神经网络,我们将这些输出量化为字节,以便能够编码签名的表示。
So we came up with this idea of having a neural net with 80 outputs, you know, that we would quantize on bytes so so that we could encode the the
然后这种编码被用来判断签名是否匹配。
And that encoding was then used to compare whether the signature matches or not.
没错。
That's right.
所以你会让签名通过神经网络,然后将输出向量与卡上存储的内容进行比较。
So then you would, you know Interesting.
有意思。
Sign, it would run through the neural net, and then you would compare the output vector to whatever is stored on your card.
它真的有效吗?
Did it actually work?
它有效,但他们最终没有使用,因为根本没人在意。
It worked, but they ended up not using it because nobody cares actually.
我的意思是,美国的金融支付系统在这方面比欧洲宽松得多。
I mean, the American financial payment system is incredibly lax in that respect compared to Europe.
哦,指的是签名吗?
Oh, with the signatures?
签名到底有什么用呢?
What's the purpose of signatures anyway?
这非常
This is very
没人看这些签名。
Nobody looks at them.
没人关心。
Nobody cares.
你知道的吧?
You know?
就是这样。
It's yeah.
是的。
Yeah.
不。
No.
所以这就是对比学习。
So that that's contrastive learning.
对吧?
Right?
所以你需要正样本和负样本对。
So you need positive and negative pairs.
而这个问题在于,虽然我手头有这篇原始论文,但我其实对它并不太有信心,因为它在高维情况下并不有效。
And the problem with that is that, you know, even though I had the original paper on this, I'm actually not very positive about it because it doesn't work in high dimension.
如果你的表示是高维的,那么两个事物不同的方式就会太多。
If your representation is high dimensional, there's just too many ways for two things to be different.
因此你需要大量的、大量的负样本对。
And so you would need lots and lots and lots of negative pairs.
所以有一个相对近期的实现方法,来自谷歌多伦多团队,杰夫·辛顿是那里的资深成员。
So there is a particular implementation of this which is relatively recent from actually the Google Toronto group, where Jeff Hinton is the senior member there.
它叫做 SIMCLR,s I m c l r。
It's called SIMCLR, s I m c l r.
基本上是一种用特定目标函数实现对比学习思想的方法。
Basically a particular way of implementing this idea of contrastive learning with a particular objective function.
如今我更感兴趣的是非对比方法,也就是其他确保不同输入的表示会不同的方式。
Now what I'm much more enthusiastic about these days is non contrastive methods, so other ways to guarantee that the representations will be different for different inputs.
这实际上基于杰夫·辛顿在九十年代初与他当时的学生苏·贝克尔提出的一个想法。
And it's actually based on an idea that Geoff Hinton proposed in the early nineties with his student at the time Sue Becker.
它的核心思想是最大化两个系统输出之间的互信息。
And it's based on the idea of maximizing the mutual information between the outputs of the two systems.
你只展示正样本对,也就是你已知具有一定相似性的图像对,并训练两个网络使其具有信息量,同时尽可能相互具有信息量。
You only show positive pairs, you only show pairs of images that you know are somewhat similar, and you train the two networks to be informative, but also to be as informative of each other as possible.
换句话说,一个表示必须能从另一个表示中预测出来。
So basically one representation has to be predictable from the other, essentially.
你知道,他早在九十年代初就提出了这个想法,发表了若干篇论文,但此后几十年都没人继续研究。
You know, he proposed that idea had, you know, a couple of papers in the early nineties and then nothing was done about it for decades.
我和我在FAIR的博士后们,特别是博士后斯特凡·丹尼斯,重新振兴了这个想法,他现在是芬兰阿尔托大学的初级教授。
And I kind of revived this idea together with my postdocs at at Fair, particularly a postdoc called Stephane Denis, who's now a junior professor in in Finland at the University of Alto.
我们提出了一种叫做Barlow Twins的方法,这是一种通过某些假设来最大化向量信息量的特定方式。
We came up with something called that we call Barlow Twins, and it's a particular way of maximizing the information content of a vector using some hypotheses.
我们现在还推出了一个更近期的版本,叫做VicReg,拼写是v-i-c-a-r-e-g。
And we have kind of another version of it that's more recent now called VicReg, v I c a r e g.
它的意思是方差、不变性、协方差和正则化。
That that means variance, invariance, covariance, regularization.
这是我过去十五年来在机器学习领域最兴奋的成果。
And I'm it's the thing I'm the most excited about in in machine learning in the last fifteen years.
我的意思是,我对此真的非常兴奋。
I mean, I'm I'm not I'm really, really excited about this.
对于这种非对比学习方法,哪些数据增强方式是有用的?
What kind of data augmentation is useful for that non contrasted learning method?
我们说的是这并不那么重要吗?
Are we talking about does that not matter that much?
或者这似乎是步骤中非常重要的部分,是的。
Or it seems like a very important part of the step Yeah.
你如何生成那些相似但又足够不同的图像。
How you generate the images that are similar but sufficiently different.
是的。
Yeah.
没错。
That's right.
这是一个重要的步骤,但也是一个令人头疼的步骤,因为你必须了解哪些数据增强方式不会改变对象的本质。
It's an important step, it's also an annoying step because you need to have that knowledge of what data augmentation you can do that do not change the nature of the of the object.
是的。
Yeah.
因此,标准的做法是——你知道,这个领域很多研究者都在用——就是使用某种类型的失真。
And so the standard scenario, which, you know, a lot of people working in this area are using, is you use the the type of distortion.
基本上,你进行几何失真。
So so basically, you do geometric distortion.
也就是说,只是稍微移动一下图像。
So one basically just shifts the image a little bit.
这叫做裁剪。
It's called cropping.
另一种方法是稍微改变一下尺度。
Another one kinda changes the scale a little bit.
还有一种是旋转图像。
Another one kinda rotates it.
另一种是改变颜色。
Another one changes the colors.
你知道,你可以调整色彩平衡之类的。
You know, you can do a shift in color balance or something like that.
饱和度。
Saturation.
另一个是让它变得模糊。
Another one sort of blurs it.
另一个是添加噪声。
Another one adds noise.
所以你有一套标准的变换方法,人们会为不同算法使用相同的变换,以便进行比较。
So you have like a a catalog of kind of standard things, and people try to use the same ones for different algorithms so that they can compare.
但有些算法,特别是某些自监督算法,实际上能应对更强大、更激进的数据增强,而有些则不能。
But some algorithms, some self supervised algorithm actually can deal with much bigger, like, more aggressive data augmentation, and some don't.
所以这使得整个事情变得复杂。
So that that kinda makes the whole thing difficult.
但这就是我们所说的失真类型。
But that's the kind of distortions we're talking about.
因此,你用这些扭曲方式来训练,然后切断网络的最后一层或几层,将提取的特征作为分类器的输入,在ImageNet或其他数据集上训练分类器,并评估性能。
And so you train with those distortions, and then you chop off the last layer or a couple layers of the network, and you use the representation as input to a classifier, you train the classifier on ImageNet, let's say, or whatever, and measure the performance.
有趣的是,那些在消除图像间无关信息(即这些扭曲)方面表现优异的方法,确实能很好地去除这些干扰。
And interestingly enough, the methods that are really good at eliminating the information that is irrelevant, which is the distortions between those images, do a good job at eliminating it.
因此,你无法在这些系统中使用这些特征来进行目标检测和定位,因为这些信息已经丢失了。
And as a consequence, you cannot use those the representations in those systems for things like object detection and localization, because that information is gone.
所以,你需要采用哪种数据增强方式,取决于你最终希望系统完成的任务;而我们今天常用的标准化数据增强方法,仅适用于目标识别或图像分类任务。
So the type of data augmentation you need to do depends on the task you want eventually the system to solve, and the type of data augmentation, standard data augmentation that we use today are only appropriate for object recognition or image classification.
它们并不适用于诸如
They're not appropriate for things like
你能帮我理解一下吗?你说定位效果不好,是因为它在区分负样本方面表现不佳,所以才不能用于定位?
Can you help me out understand why the localization is so you're saying it's just not good at the negative, like at classifying the negative, so that's why it can't be used for the localization?
不是的。
No.
你训练系统时,会给它一张图像,然后再给它同一张图像经过平移和缩放后的版本。
It's just that you train the system, you know, you you you give it an image, and then you give it the same image shifted and scaled.
是的
Yeah.
你告诉它这是同一张图片。
And you tell it that's the same image.
是的
Yeah.
所以系统基本上被训练成消除关于位置和大小的信息。
So the system basically is trained to eliminate the information about position and size.
所以现在你想用这个
So now and now you want to use that
哦,我明白了,比如,
Oh, I to, like,
找出一个物体的位置和大小。
figure out where an object is and what size it is.
明白了。
Got it.
就像一个边界框。
Like a bounding box.
比如,他们实际上能够做到。
Like, they'd be able to actually okay.
它仍然能够找到图像中的物体。
It could still find it could still find the object in the image.
只是它不太擅长确定该物体的精确边界。
It's just not very good at finding the exact boundaries of that object.
有意思。
Interesting.
有意思。
Interesting.
这其实是一个有趣的哲学问题。
Which, you know, the that's an interesting sort of philosophical question.
目标定位到底有多重要呢?
How important how important is object localization anyway?
我们简直着迷于测量图像分割。
We're we're, like, obsessed by measuring, like, image segmentation.
着迷于完美地掌握物体的边界,但事实上,这对于我们理解场景内容来说未必那么重要。
Obsessed by measuring perfectly knowing the boundaries of objects when arguably that's not that essential to an understanding what are the contents of the scene.
另一方面,我认为从进化角度看,动物最早的视觉系统基本上都是关于定位的,识别则非常少。
On the other hand, I think evolutionarily, the first vision systems in animals were basically all about localization, very little about recognition.
在人类大脑中,你有两个独立的通路分别用于识别场景或物体的性质,以及定位物体。
And in the human brain, you have two separate pathways for recognizing the nature scene or an object and localizing objects.
所以,你使用第一个通路,即腹侧通路,来判断你正在看的是什么。
So you use the first pathway called the ventral pathway for, you know, telling what you're looking at.
另一个通路,背侧通路,则用于导航、抓取以及所有其他活动,你知道,许多生存所需的能力都依赖于定位和检测。
The other pathway, the dorsal pathway, is used for navigation, for grasping, for everything And, you know, basically, a lot of the things you need for survival are localization and detection.
相似性学习、对比学习或这些非对比方法,是否等同于理解事物?
Is similarity learning or contrastive learning or these non contrastive methods the same as understanding something?
仅仅因为你知道一个失真的猫和一个正常的猫是同一个东西,就代表你理解了‘猫’的含义吗?
Just because you know a distorted cat is the same as a nondistorted cat, does that mean you understand what it means to be a cat?
在某种程度上。
To some extent.
我的意思是,这显然只是一种表面的理解。
I mean, it's a superficial understanding, obviously.
但你觉得这种方法的上限在哪里?
But, like, what is the ceiling of this method, do you think?
这仅仅是实现自监督学习道路上的一个技巧,还是我们可以走得
Is this just one trick on the path to doing self supervised learning, or can we go
是的。
Yeah.
真的非常远吗?
Really, really far?
我认为我们可以走得很远。
I think we can go really far.
如果我们能弄清楚如何使用这类技术——也许方式非常不同,但核心是通过视频训练系统来实现视频预测。
So if we figure out how to use techniques of that type, perhaps very different, but, you know, the signature to train a system from from video to do video prediction essentially.
我认为我们会有一条路径,你知道,朝着某种程度上的、我不会说是无限的,但朝着机器具备某种物理常识的方向发展。
I think we'll have a path, you know, towards you know, I wouldn't say unlimited, but but a path towards some level of, you know, physical common sense in in machines.
而且我也认为,通过像视觉这样的高吞吐量通道来学习世界如何运作的能力
And I also think that that ability to learn how the world works from a sort of high throughput channel like like vision
嗯。
Mhmm.
是迈向真正人工智能的必要一步。
Is a necessary step towards sort of real artificial intelligence.
换句话说,我相信基于现实的智能。
In other words, I believe in grounded intelligence.
我不认为我们可以仅通过文本来训练机器变得智能。
I don't think we can train a machine to be intelligent purely from text.
因为我认为,与世界相关的信息量在文本中所包含的,相比我们需要知道的,是微乎其微的。
Because I think the amount of information about the world that's contained in text is tiny compared to what we need to know.
举个例子,而且,是的,我知道人们尝试这样做已经有三十年了,对吧,就像那个心理项目之类的,基本上就是把所有已知的事实写下来,希望某种常识能够从中浮现出来。
So for example, let's and and, yeah, I know people have attempted to do this for for thirty years, right, the the psych project and things like that, right, of basically kind of writing down all the facts that are known and hoping that some some sort of common sense would emerge.
我觉得这基本上是不可能的。
I think it's basically hopeless.
但让我举个例子。
But let me take an example.
你拿一个物体。
You take an object.
我向你描述一种情况。
I I describe a situation to you.
我拿一个物体,把它放在桌子上,然后推桌子。
I take an object, I put it on the table, and I push the table.
你很清楚,物体将会随着桌子一起被推走。
It's completely obvious to you that the object will be pushed with the table.
对吧?
Right?
因为它就放在桌子上。
Because it's sitting on it.
我相信,世界上没有任何文本能够解释这一点。
There's no text in the world, I believe, that explains this.
所以,即使你训练一个像GPT-5000那样强大的机器,它也永远学不会这一点。
And so if you train a machine as powerful as it could be, you know, your GPT 5,000 or whatever it is, it's never gonna learn about this.
这些信息根本不会出现在任何文本中。
That information is just not is not present in any text.
好吧,就像心理项目那个梦想一样,我认为目标是拥有大约一千万个这样的事实,为你提供一个起点,就像父母引导你一样。
Well, the the question, like, with the psych project, the dream, I think, is to have, like, like, 10,000,000, say, facts like that that give you a head start, like a parent guiding you.
但我们人类并不需要父母告诉我们,桌子会动——哦,抱歉,手机会随着桌子一起动。
Now we humans don't need a parent to tell us that the table will move oh, sorry, the smartphone will move with the table.
但我们通过其他方式获得了大量指导,因此我们或许可以给它一个快速捷径。
But we get a lot of guidance in other ways, so it's possible that we can give it a quick shortcut.
那猫呢?
And what about cat?
猫是知道这一点的。
The cat knows that.
不。
No.
但它们是进化而来的。
But they evolved.
所以
So
不。
No.
它们像我们一样学习。
They learn like us.
对不起。
The the sorry.
物质的物理规律?
The physics of stuff?
是的。
Yeah.
嗯,是的。
Well yeah.
所以你的意思是,你把很多智能归因于后天培养,而不是先天本性。
So you're saying it's so you're putting a lot of intelligence onto the nurture side, not the nature.
是的。
Yeah.
因为我们似乎经历了一个非常低效、可以说相当低效的进化过程,才从细菌发展到今天的我们。
Because we we seem to have you know, there's a very inefficient, arguably, process of evolution that got us from bacteria to who we are today.
从底层开始,现在我们在这里了。
Started at the bottom, now we're here.
确实如此。
So True.
问题是,这究竟怎么样。
The question is how okay.
所以你看,问题在于,这种硬件的本质究竟有多根本?如果它是根本的,有没有办法绕过它?
So you see, the question is how fundamental is that, the the nature of the whole hardware, and then is there any way to shortcut it if it's fundamental?
如果不是这样,如果我们所讨论的大部分智能、大部分精彩的内容主要来自后天培养、通过观察世界习得,那我们只需静坐就能构建出你所说的那个宏大、优美、吸引人的背景模型。
If it's not, if it's most of intelligence, most of the cool stuff we've been talking about is mostly nurture, mostly trained, we figure it out by observing the world, we can form that big, beautiful, sexy background model that you're talking about just by sitting there.
那么,好吧,也许这一切本质上都是监督学习。
Then, okay, then you need to then, like, maybe it is all supervised learning all the way down.
或者说,是自监督学习。
Self supervised learning, say.
无论是什么让人类智能区别于其他动物——很多人认为是语言和逻辑推理这类能力——
Whatever it is that makes, you know, human intelligence different from other animals, which, you know, a lot of people think is language and logical reasoning and this kind of stuff.
它不可能太复杂,因为这种能力只在最近一百万年内才出现。
It cannot be that complicated because it only popped up in the last million years.
是的。
Yeah.
而且,它只涉及基因组中不到1%的差异,也就是人类基因组与黑猩猩或其他物种之间的区别。
And, you know, it it and it only involves, you know, less than 1% of a genome, right, which is the difference between human genome and chimps or whatever.
所以它不可能那么复杂。
So it can't be that complicated.
关于 Bayt 播客
Bayt 提供中文+原文双语音频和字幕,帮助你打破语言障碍,轻松听懂全球优质播客。