本集简介
双语字幕
仅展示文本字幕,不包含中文音频;想边听边看,请使用 Bayt 播客 App。
Althropic 制造出色的产品。
Althropic makes great products.
PlotCore 非常棒。
PlotCore is fantastic.
CoWork 非常棒。
CoWork is fantastic.
但它们只是用硅做的起重机在进行矩阵乘法。
But they are cranes of silicon doing matrix multiplication.
它们没有意识。
They don't have consciousness.
它们没有内在的独白。
They don't have an inner monologue.
你拿一个 NLM,用 1916 年或 1911 年之前的物理学数据训练它,看看它能否推导出相对论。
You take an NLM and train it on pre 1916 or 1911 physics and see if it can come up with the theory of relativity.
如果它能做到,那我们就有了通用人工智能。
If it does, then we have AGI.
顺便说一句,就在今天,达里奥声称你不能排除它们的意识。
Just today, by the way, Dario allegedly said that you can't rule out their their conscience.
你可以排除它们的意识。
You can rule out their conscience.
我觉得。
I think.
一般来说。
In common.
要达到所谓的AGI,我认为需要发生两件事。
To get to what is called AGI, I think there are two things that need to happen.
五年前,维沙尔·米斯拉让GPT-3将自然语言翻译成它从未见过的领域特定语言。
Five years ago, Vishal Misra got GPT-three to translate natural language into a domain specific language it had never seen before.
它成功了。
It worked.
他根本不知道为什么。
He had no idea why.
所以他着手建立一个数学模型,以解释大语言模型的实际运作方式。
So he set out to build a mathematical model of how LLMs actually function.
结果如何?
The result?
一系列论文表明,Transformer 模型以精确且数学上可预测的方式更新其预测。
A series of papers showing that transformers update their predictions in a precise, mathematically predictable way.
在受控实验中,这些模型几乎完美地匹配了理论上的正确答案。
In controlled experiments, the models match the theoretically correct answer almost perfectly.
但模式匹配并不等于智能。
But pattern matching is not intelligence.
大语言模型学习的是相关性。
LLMs learn correlation.
它们并不构建因果关系的模型。
They don't build models of cause and effect.
为了实现通用人工智能,米斯拉认为,我们需要在训练后持续学习的能力,并从相关性转向因果性。
To get to AGI, Misra argues, we need the ability to keep learning after training and the move from correlation to causation.
马丁·卡萨多采访了哥伦比亚大学计算与人工智能的教授兼副院长维沙尔·米斯拉。
Martin Casado speaks with Vishal Misra, professor and vice dean of computing and AI at Columbia University.
维沙尔,很高兴再次邀请你来。
Vishal, it's good to have
你再次来了。
you in again.
很高兴能回来。
Great to be back.
这是我最喜爱的话题之一,即大型语言模型究竟是如何工作的。
This is one of my favorite topics, which is how do LLMs actually work.
嗯。
Mhmm.
而且在我看来,你在建模这方面做得最好。
And I think that and in my opinion, you've done kind of the best work on this modeling it out.
谢谢。
Thank you.
对于那些没看过原始视频的人,也许值得先简单回顾一下是什么让你走到这一步,然后我们再深入讨论你目前的研究工作。
For those that did not see the original one, maybe it's probably worth doing just a quick background on kind of what led you to this point, and then we'll just go into the current work that you've been doing.
五年前,当GPT-3刚发布时,哦。
Five years ago, when g p d three was first released Oh.
我获得了早期访问权限,开始尝试使用它,当时我想解决一个关于查询Cricut数据库的问题。
I got early access to it, and I started playing with it, and I was trying to solve a problem related to querying a Cricut database.
是的。
Yeah.
我让GPT-3进行了上下文学习和少量样本学习,这至少对我来说,是已知的首个RAG(检索增强生成)实现,我用它来解决将自然语言翻译成能查询数据库的指令的问题,而这个数据库是GPT-3完全不了解的。
And I got g p d three to do in context learning, few short learning, and it was kind of the first at least to to me, it was the first known implementation of RAG, retrieval augmented generation, which I used to solve this problem of querying getting g p d three to translate natural language into something that could be used to query a database that g p d three had no idea about.
我无法访问GPT-3的内部机制,但依然成功用它解决了这个问题。
I had no access to g p d three's internal, but I was still able to use it to solve that problem.
所以效果非常好。
So it worked beautifully.
我们在2021年9月将这套方案部署到了ESPN的生产环境中。
We deployed this in production at ESPN in September 21.
但是
But
你在2021年就实现了RAG的首个版本?
You did the first implementation of frag in 2021?
不是。
No.
不是。
No.
不是。
No.
是2020年。
In 2020.
2020年。
2020.
2020年,我就让它运行起来了。
2020, I got it working.
等你跟ESPN的所有律师沟通并将其投入生产时,花了一段时间。
And by the time you talk to all the lawyers at ESPN and productionize it, it took a while.
哇。
Wow.
但在2020年10月,我已经有了这个想法。
But October 2020, we had, well, I had this Yeah.
架构已经运行起来了。
Architecture working.
但当我让它运行起来后,我惊讶于它真的能工作。
But after I got it to work, I was amazed that it worked.
我想弄清楚它是如何工作的。
I wanted to understand how it worked.
是的。
Yeah.
我研究了你们所有深度学习论文中的注意力机制,以及其他各种深度学习架构论文,却无法理解它为什么能工作。
And I looked at the attentions of all your deep papers and all the other sort of deep learning architecture papers, and I couldn't understand why it worked.
是的。
Yeah.
于是我开始深入构建一个数学模型。
So then I started getting sort of deep into building a mathematical model.
对。
Yeah.
现在你已经发表了一系列论文。
And now you've published a series of papers.
我读的第一篇是你那个矩阵的论文。
The first one that I read was the one where you had kind of your matrix Yeah.
某种抽象。
Kind of abstraction.
所以也许我们会先谈这个,然后再谈最近的。
So maybe we'll talk about that, and then we'll talk about the more recent Yeah.
工作。
Work.
所以也许我们先从第一个开始,你试图建立一个大型语言模型如何工作的数学模型。
So perhaps we'll just start with the first one, which is you're trying to come up with a mathematical model of how LLM works.
是的。
Yeah.
这对我非常有帮助。
And you have which was very helpful to me.
当时,你实际上在试图弄清楚上下文学习是如何运作的。
And at the time, you were actually trying to figure out how in context learning was working.
是的。
Yes.
对。
Yeah.
你为大型语言模型提出了一种抽象,本质上是一个非常大的矩阵,并用它来描述。
And you came up with an abstraction for LLMs, is basically a very large matrix, and you use that to describe.
所以也许你可以快速地介绍一下这项工作。
So maybe you can kind of walk through that work very quickly.
当然。
Sure.
对。
Yeah.
所以你的做法是想象一个巨大的矩阵,矩阵的每一行对应一个提示。
So what you do is you imagine this huge gigantic matrix where every row of the matrix corresponds to a prompt.
对。
Yeah.
这些大语言模型的工作方式是:给定一个提示,它们会生成下一个词的概率分布。
The way these LLMs work is given a prompt, they construct a distribution of probabilities of the next token.
下一个词就是下一个单词。
Next token is next word.
所以每个大语言模型都有一个词汇表。
So every LLM has a vocabulary.
GPT及其变体的词汇表大约包含5万个词元。
GPT and its variants have a vocabulary for about 50,000 tokens.
是的。
Yeah.
所以给定一个提示,模型会生成下一个词元的概率分布,然后所有这些模型都会从该分布中采样。
So given a prompt, it'll come up with a distribution of what the next token should be, and then all these models sample from
这个分布。
that distribution.
对。
Yeah.
这就是后验分布。
So that's the posterior distribution.
这就是后验分布。
That's the posterior distribution.
对。
Right.
对吧?
Right?
这就是大语言模型的工作原理。
That's how LLMs work.
因此,这个矩阵的思想是,对于每一个可能的词元组合(即提示),都有一行。
And so the idea of this matrix is for every possible combination of tokens, which is a prompt, there's a row.
是的。
Yeah.
而列则是对词汇表的分布。
And the columns are a distribution over the vocabulary.
没错。
Yep.
如果你的词汇表中有5万个可能的词元,那就是对这5万个词元的分布。
If So you have a vocabulary of 50,000 possible tokens, it's a distribution over those 50,000 tokens.
而分布指的就是概率。
And by distribution, it's just the probability.
就是概率。
Just the probability.
抱歉。
Sorry.
是的。
Yeah.
就是下一个词应该是这个的概率。
Just the probability that the next token should be this Yeah.
而不是那个。
Versus that.
对。
Yep.
所以这就是大致的想法。
So that's sort of the idea.
当你以这种方式看待它时,至少能让像我这样想建模的人更清楚地理解正在发生什么。
And when you start viewing it that way, it makes things at least clearer to people like me who want to model it, what's happening.
所以具体来说,假设你的提示只有一个词,比如‘protein’。
So concretely, let's say you have an example that let's say your prompt is just one word, protein.
是的。
Yeah.
所以,如果你看一下在那个词之后的下一个词或下一个标记的分布,大部分概率会是零,但你可能会对两个词赋予非零且不可忽视的概率。
So if you look at the distribution of the next word, next token after that, most of the probability would be zero, but you'd have nonzero, nontrivial probabilities on, let's say, two words.
一个是合成。
One is synthesis.
另一个是摇动。
The other is shake.
是的。
Yeah.
对吧?
Right?
现在,大语言模型会从这些选项中采样下一个标记,可能会选择“合成”或“摇动”。
And now the LLM is going to sample this next token and might pick synthesis or shake.
是的。
Yeah.
或者,作为人类,你会给出提示:蛋白粉摇饮,是的。
Or you, as a human, will give the prompt protein shake Yep.
或者蛋白质合成。
Or protein synthesis.
现在,取决于你选择的是合成还是摇饮,接下来的内容会完全不同。
Now depending on whether you pick synthesis or shake, the next that that row looks very different.
对吧?
Right?
如果你选择蛋白质合成,那么高概率出现的词都会与生物学相关。
If you pick protein synthesis, the terms that would have a high probability would be all concerned with biology.
对吧?
Right?
但如果你选择蛋白粉摇饮,接下来的内容就会全是关于健身房、锻炼和健美之类的话题。
But if you pick protein shake, it'll all be about gym, then exercise, and all bodybuilding stuff.
所以,选择合成还是摇饮,完全改变了后续的内容。
So that synthesis or shake completely changes what comes next.
是的。
Yeah.
所以这是一个所谓的贝叶斯更新的例子。
So this is an example of, you can say, Bayesian updating.
你从蛋白质开始。
You start with protein.
你有一个先验假设:在蛋白质之后,会发生这种情况。
You have a prior that after protein, this is going to happen.
一旦你获得新的证据,下一个词就是合成或奶昔。
As soon as you get new evidence, then the next term is synthesis or shake.
你完全更新了概率分布。
You completely update the distribution.
所以现在你可以想象,整个大语言模型就是一个巨大的矩阵。
So now you can imagine that the whole the entirety of LLMs is this giant matrix.
是的。
Yeah.
你拥有每一行。
Where you have every row.
蛋白质奶昔、蛋白质合成,猫坐在了Yeah上。
Protein shake, protein synthesis, the cat sat on the Yeah.
亨普蒂·邓普蒂, blah blah blah。
Humpty, dumpty, blah blah blah.
是的。
Yeah.
考虑到这些大语言模型的词汇量,比如5万个,以及上下文窗口。
Given the vocabulary of these LLM, let's say 50,000, and the context window.
所以GPD,例如聊天GPD的第一个版本,其上下文窗口为8000个标记。
So GPD, for instance, the chat GPD, the first version had a context window of 8,000 tokens.
对。
Yep.
如果你考虑所有8000个标记和5万个词汇的可能组合,这个矩阵的行数将超过所有星系中电子的总数。
If you look at all possible combinations of 8,000 tokens and 50,000 vocabulary, the number of rows in this matrix is more than the number of electrons across all galaxies.
对吧?
Right?
就这样。
That's all.
所以这些大语言模型不可能精确地表示它。
So there's no way that these LLMs can represent it exactly.
幸运的是,这个矩阵非常稀疏。
Fortunately, this matrix is very sparse.
为什么?
Why?
因为这些标记的任意组合都是胡言乱语。
Because an arbitrary combination of these tokens is gibberish.
是的。
Yeah.
我们在现实生活中根本不会用到那些组合。
We're never gonna use that in real life.
是的。
Yeah.
此外,列也主要是零。
Also, the columns are also mainly zero.
是的。
Yeah.
对吧?
Right?
如果你有蛋白质,那么你就不会有很多,你知道的,不会在那之后出现任意的数字或任意的词语。
If you have protein, then you won't have lots of you know, you won't have arbitrary numbers or arbitrary words after that.
它在行和列上都非常稀疏。
It's very sparse both in rows and in columns.
所以,从某种抽象的角度来说,所有这些大语言模型所做的,就是为这个矩阵找到一个压缩表示。
So in kind of an abstract way, what all these LLMs are doing is coming up with a compressed representation of this matrix.
对。
Right.
当你给出一个提示时,它们会尝试逼近真实分布应该是怎样的,是的。
And when you give a prompt, they try to approximate what the true distribution should have been Yeah.
并尝试生成它。
And try to generate it.
这至少在我心目中,归根结底就是如此。
That's what, in my mind, at least, it boils down to.
根据我的理解,如果你有一行是蛋白质,然后另一行是蛋白粉,嗯。
And just from my understanding, so if you have a row of protein and then you have one with protein shake Mhmm.
蛋白粉是蛋白质的子集,还是说它们不同?
Is protein shake a subset of protein, or is it different?
它们是不同的。
It's different.
它是从……延续下来的。
It's a continuation from.
我明白了。
I see.
是的
Yeah.
对
Right.
不
No.
但我只是想说,实际的后验分布,它是子集吗?
But I'm just saying, like, the the actual posterior distribution, is that a subset?
你可以说它是子集。
You can say it's a subset.
对吧?
Right?
如果你有蛋白质,那么蛋白粉和蛋白质合成都是从蛋白质延伸出来的。
If you have protein, then protein shake and protein synthesis are all continuations from protein.
所以合成和蛋白粉都有非零的概率。
So both synthesis and shake have nonzero probabilities.
所以你可以,是的,可以把这看作某种子集。
So you can, yeah, you can think of it as somewhat a subset.
对吧?
Right?
你用这种方法来描述上下文学习是如何工作的吗?
You use this approach to describe how in context learning works?
所以也许先解释一下什么是上下文学习,是的。
And so maybe first describe what in context learning is Yeah.
然后再说说你从中学到的结论。
And then kind of the conclusion that you came from that.
上下文学习是指你向大语言模型展示一些它以前从未见过的东西。
So in context learning is when you show the LLM something it has kind of never seen before.
你给它几个例子,说明这是它想要的,这是你想要实现的目标。
You give it a few examples of this is what it wants this is what you're trying to do.
然后你再给出一个新问题,这个新问题与我展示的例子相关。
Then you give a new problem, which is related to the examples that I have shown.
是的
Yeah.
大语言模型会实时学习它应该做什么,并解决问题。
And the LLM learns in real time what it's supposed to do and solves a problem.
顺便说一下,我第一次看到这个时,简直震惊了。
By the way, the first time I saw this, it absolutely blew my mind.
我确实用过你的DSL。
I actually used your DSL Mhmm.
就是我刚开始学习这个概念的时候。
By when I was, like, first learning about it.
所以也许吧。
So maybe Yeah.
所以,我觉得如果这个DSL真的能起作用,那简直太疯狂了。
Could So I I like The DSL thing is just crazy if this works at all.
它真的能工作,这简直令人难以置信。
It's absolutely mind blowing that it works.
回到那个板球问题,九十年代中期,我参与了一个团队,创建了一个名为Cricinfo的板球门户网站。
And so going back to that cricket problem was, you know, in the mid nineties, I was part of a group that had created this cricket portal called Cricinfo.
是的。
Yeah.
板球是一项明星云集的运动。
Cricket is a very star rich sport.
你可以想象棒球放大一千倍,而且有各种各样的统计数据。
You think baseball multiplied by a thousand, and it's at all kinds of stats.
我们创建了一个名为StatsGuru的在线可搜索数据库,你可以查询任何与板球相关的明星信息,这个数据库自2000年起就一直可用。
And we had created this online searchable database called StatsGuru, where you could search for anything, any star related to cricket, and has been available since 2000.
是的。
Yeah.
但由于你可以查询任何内容,所有信息都被公开了。
But because you can query for anything, everything was be made available.
那么,如何将这样的系统提供给普通大众呢?
And how do you make something like that available to the general public?
嗯,是的。
Well Yeah.
他们不会写SQL查询。
They're not gonna write SQL queries.
当时最好的替代方案是创建一个网页表单。
The next best thing at that time was to create a web form.
不幸的是,所有内容都被塞进了那个网页表单里。
Unfortunately, everything was crammed into that web form.
因此,你有大约20个下拉菜单、15个复选框和18个不同的文本字段。
So as a result, you had, like, 20 dropdowns, 15 checkboxes, 18 different text fields.
它看起来像是一个非常复杂、令人望而生畏的界面。
It looked like a very complicated, daunting interface.
因此,尽管它能够解决或回答任何查询,但几乎没有人使用它。
So as a result, even though it could solve or it could answer any query, almost no one used it.
只有极少数板球爱好者使用它,因为它看起来太吓人了。
A vanishingly small percentage of cricket fans use it because it it just looked intimidating.
然后ESPN在2007年收购了那个网站。
And then ESPN bought that site in 2007.
我至今仍认识一些运营这个网站的人,我一直跟他们说,你们为什么不用一下Stats Guru呢?
I still know people who run the site, and I've always told them, you know, why don't you do something with stats guru?
到了2020年1月,Cricket Four的主编——他是我的朋友。
And in January 2020, the editor in chief of Cricket four, he's he's a friend.
所以他来纽约时,我们一起出去喝了点酒。
So he came to New York, and we had gone out for drinks.
我又跟他说,你们为什么不用一下Stats Guru呢?
And, again, I told him, you know, why don't you do something with stats, guru?
他看着我说:那你为什么不用Stats Guru呢?
He looks at me and says, why don't you do something with stats, guru?
你开玩笑吧。
You're joking.
但这个想法一直留在了我的脑海里。
But that idea kind of stayed with me.
当GPT-3发布时,我想也许我可以利用StatsGuru,用GPT-3为StatsGuru创建一个前端。
And when g p d three was released, I thought maybe I could use StatsGuru, use g p d three to create a front end for StatsGuru.
明白了。
Got it.
于是我设计了一种领域特定语言(DSL),它可以将关于板球统计数据的自然语言查询转换为这种DSL。
And so what I did was I designed a DSL, a domain specific language, which converted queries about cricket stats and natural language into this DSL.
现在
Now
但要明确一点,这是你独立开发的。
And to be clear, you created this.
这并不是来自任何在线培训的现成内容。
It wasn't, like, part of, like, any training that online.
是的。
Yeah.
就像你可能看到过的那样。
Like, could have seen.
是的。
Yeah.
GPT不可能见过这些。
Nothing GPT could have seen.
是我创建的。
I created it.
我想,好吧。
I thought, okay.
这说得通。
This makes sense.
是的。
Yeah.
所以我设计了那个DSL,然后做了几个小样本学习的尝试。
So I designed that DSL, and then I did that few short learning things.
于是我创建了一个数据库,包含大约1500个自然语言查询及其对应的DSL表达式。
So I would so I created about a database of what, I would say, 1,500 natural language queries and the DSL corresponding to that query.
所以当一个新的查询进来时,有人用英语问了一个统计问题,我会去浏览这些自然语言查询,进行语义搜索,选出最匹配的前几个,是的。
So when a new query came in, somebody is asking a stats question in English, what I would do is I would go through the natural language queries, do a semantic search, pick the most closely matching top few Yeah.
然后使用那个自然语言查询及其对应的DSL,并将其作为前缀发送。
And then use that natural language query and its DSL and send that as a prefix.
现在,如果你还记得,GPT-3的上下文窗口只有2000个token。
Now g p d three, if you recall, had a a context vendor of only 2,000 tokens.
是的。
Yeah.
所以你必须非常谨慎地选择要使用的示例。
So you had to be very judicious about which examples that you picked.
但你选好之后,再发送新的查询,GPT-3就会用我设计的DSL来补全它——而就在几毫秒前,它还从未见过这种DSL。
But you pick that, and then you send the new query, and g p d three would complete it in the DSL that I had designed, which until milliseconds ago, it had never seen.
是的。
Yeah.
而且我无法访问GPT-3的内部机制。
And I had no access to internal of g p d three.
我无法访问模型权重。
I had no access to the weights.
是的。
Yeah.
但即便如此,它还是奏效了。
But still, it worked.
所以这就是如何做到的。
So that that's how so
所以根据你举的矩阵例子,我不太明白。
So it's not obvious to me given your matrix example
是的。
Yeah.
关于提示和分布这样的东西,上下文学习是如何起作用的。
Of, like, a prompt and then a distribution, how something like in context learning Works.
会起作用。
Would would work.
所以,我觉得你的第一篇论文解决了这个问题。
And so, like, I think your first paper tackled this problem.
对。
Right.
所以,也许你能解释一下,你是如何看待大语言模型如何进行上下文学习的?
And so maybe you could walk through your understanding of how LLMs do in context Yeah.
当你思考上下文学习是什么时,就像你看到证据一样,比如在第一篇论文中,我也用了这个板球DSL的例子。
So so when you think about what in context learning is, is that as you see evidence so so, you know, in the first paper, what I also did was I I took this cricket DSL example.
是的。
Yeah.
我展示了模型在看到越来越多示例时的下一个词概率。
And I depicted the next token probabilities Mhmm.
当模型被展示越来越多的示例时,其概率的变化情况。
Of the model as it was shown more and more examples.
当你第一次向模型展示这种DSL、自然语言和DSL的组合时,DSL标记的概率非常低。
So the first time you show it this DSL, the natural language and the DSL, the probabilities of the DSL tokens were were extremely low.
因为GPT-3从未见过这个东西,当它看到板球问题时,它在心里试图用英文回答来延续它。
Because GPT-three had never seen this thing, when it saw the cricket question, in its mind, it was trying to continue it with an English answer.
所以概率较高的都是英文单词。
So the probabilities that were high were all English words.
是的。
Yeah.
一旦它解决了我提供的包含问题和DSL的提示,下一次当我把问题放在下一行时,DSL标记的概率就开始上升。
Once it solved my prompt where I had the question and the DSL, the next time I had the question in the next row, the probabilities of the DSL token started going up.
每次给出一个例子,概率都会上升。
With every example, it went up.
最后,当我给出新的查询时,它几乎有100%的概率生成正确的标记。
And finally, when I gave the new query, it was like it had almost 100% probability of getting the right token.
对。
Yeah.
所以这是一个实时更新模型后验概率的实例。
So this is an example of, in real time, the model was updating its posterior probability.
它正在更新自己的知识,意识到:好吧,我已经看到了证据。
It was updating its knowledge that, okay, I've seen evidence.
这就是我应该做的。
This is what I'm supposed to do.
这是一种口语化的说法,用来表达贝叶斯。
Now this is a colloquial way of saying what Bayesian Yeah.
推断。
Inferences.
贝叶斯更新基本上是从一个先验概率开始。
Bayesian updating basically is you start with a prior.
当你看到新的证据时,你会更新你的后验概率。
When you see a new evidence, you update your posterior.
这就是数学上的计算。
That's the mathematical division.
但在英语中,简单来说就是:你看到某事,看到新证据,然后更新你对正在发生事情的信念。
But in in English, it's basically you see something, you see new evidence, you update your belief about what's happening.
是的
Yeah.
对吧?
Right?
因此,我清楚地意识到,大语言模型正在做某种类似于贝叶斯更新的事情。
So it was clear to me that LLMs are doing something which resembles Bayesian updating.
所以在那篇第一篇论文中,我提出了一个矩阵公式,并展示了它实际上在做什么。
So in that first paper, I had this matrix formulation, and I showed that, you know, what it's doing.
它看起来就像贝叶斯更新。
It looks like Bayesian updating.
是的
Yeah.
然后我们可以进入下一组论文。
Then we can come to the sort of next series of papers.
没错。
That's right.
所以,好吧。
So okay.
我的意思是,当时我觉得这已经相当有结论性了,但之后你有一段时间没了动静。
So I mean, it it it seemed pretty conclusive to me at that time, and then you went quiet for a while.
然后我仍然记得那条WhatsApp信息。
And then I I still remember the WhatsApp text.
你说:马丁,我现在完全知道这些东西是如何运作的了。
You said, Martin, I know exactly how these things are working now.
是的。
Yeah.
然后,你接连发布了一系列论文,简直引爆了互联网。
Well and then and then listen, you dropped a series of papers that kind of broke the Internet.
你在Twitter上彻底火了。
You went super viral on Twitter.
我的意思是,真的引起了广泛关注。
I mean, people really noticed.
我想稍后就谈这一点,但在那之前,我记得当你的第一篇论文发表时,人们会说,这些东西肯定不是贝叶斯的。
And so I want to get to that in just a second, but before that, I remember when your first paper came out, people would be like, you know, these things are definitely not beigean.
就像,你知道,它们或许可以被视为贝叶斯的,但其实不是。
Like, you know, you know, could be considered to be beigean, but they're not.
你觉得为什么人们对这种新事物会有这样的反应——它们不是贝叶斯的?
Like, why do you think that there was this reaction to, like, you know, there's something new, they're not Bayesian.
我的感觉是,这种反应几乎是一种反弹,仅仅因为被贴上了贝叶斯的标签。
I mean, I felt like there's almost kind of a backlash just because of being characterized as Bayesian.
是的。
Yeah.
我认为,在概率和机器学习这个领域里,一直存在着贝叶斯学派和频率学派的阵营。
I think this whole world of probability and machine learning, that there have been camps of Bayesian and frequentists.
对。
Yes.
我不想卷入这种政治性的争斗,但‘贝叶斯’这个词已经让人产生了某种情绪反应,它成了这场论战的一部分。
And I don't wanna get in the middle of that sort of political battle, but Bayesian has become like almost like people had a reaction to that it's it's part of that war.
我明白了。
I see.
所以这就像以前那种情况。
So it's it's like the old Yeah.
患者型频率学派的斗争。
Patient frequentist type battle.
是的。
Yeah.
所以人们只是觉得,哦,不。
So the people just had, oh, no.
你可以说任何东西都是贝叶斯的。
You can say anything is Beisian.
对吧?
Right?
是的。
Yeah.
所以我说,好吧。
So I I said, okay.
也许他们有道理。
Maybe they have a point.
也许我们所说的其实并不是贝叶斯的。
Maybe what we are saying is not really Beisian.
我们如何证明它是贝叶斯的?
How do we prove that it's Bayesian?
对。
Right.
所以首先,我要感谢你、安德烈斯和霍洛维茨的这项工作。
So then, first, I have to thank you and Andres and Horowitz for this.
你知道,当我在我第一篇论文中提到时,我展示了这些概率。
You know, when I when I when I said that I I in my first paper, I showed these probabilities.
是的。
Yeah.
这是因为OpenAI在其界面中提供了替换这些概率的选项,但后来他们停止了这个功能。
It was because OpenAI had in its charge interface this option to displace those probabilities, then they stopped.
哦。
Oh.
所以我们无法窥探内部发生了什么。
So we could not peer inside what's going what's happening.
不知什么原因,他们停止了OpenAI的这一功能。
For some reason, they stopped OpenAI.
我不会深入讨论这个。
I'm not gonna get into Yeah.
关于开环和闭环,但他们确实停止了。
The open and closed loop, but but but they stopped.
是的。
Yeah.
于是我们开发了自己的界面,不仅可以查看概率,还能查看下一个词元的熵。
So then we developed our own interface, which could let you look not only at the probability, but also the entropy of the next token.
这是基于
Was this
在开源模型之上吗?
on top of an open source model?
是的。
Yeah.
是的。
Yeah.
所以你可以加载任何类型的开源模型。
So so you can load any sort of open source model.
但你知道,作为学术界,是的。
But, you know, being an academia Yeah.
我们没有计算资源。
We didn't have access to compute.
感谢你。
Thanks to you.
展开剩余字幕(还有 480 条)
哦,是的。
Oh, yeah.
你的慷慨捐赠。
Your generous donation.
我们得到了集群,是的。
We got the clusters Yeah.
用来运行所谓的标记探测工具。
To run with what's called token probe.
你可以访问 tokenprobe.ch.columbia.edu。
So you can go to tokenprobe.ch.columbia.edu.
这个还在运行吗?
Is this still running?
还在运行。
It's still running.
还在运行,而且有人会来使用。
It's still running, and people come to it.
我在我课堂上使用它来让学生完成作业。
I use it in my classes to get students to do assignments.
他们编写自己的领域特定语言,并说这确实帮助他们理解了这些大语言模型的工作原理。
They write their own DSLs, and they say that it really helps them understand how these LLMs work.
我的大语言模型知识实际上就是来自Token Probe。
So I literally my understanding of LLMs came from token probe.
你只需要坐在那里,看着在填写提示时的分布情况。
You know, sit there and just look at the distribution as you filled out a prompt.
这实际上非常、非常有启发性。
It's actually very, very enlightening.
所以对于正在收听的各位,网址又是什么来着?
So for those of you that are listening, what's the URL again?
Tokenprobe.cs.columbia.edu。
Tokenprobe.cs.columbia.edu.
是的。
Yeah.
去试试吧。
Check it out.
这实际上是一种非常有用的方式,可以直观地看到在填写提示时概率分布是如何变化的。
Actually a very, very useful way to actually see how the probability distribution gets updated as you fill out a prompt.
没错。
Right.
但我作弊了。
But then I cheated.
哦?
Oh?
当时它在运行,但我也能访问到支撑它的GPU。
I you know, it was running, but I also had access to the GPUs that were powering it.
然后,我和哥伦比亚大学的同事们,其中一位现在在DeepMind,开始思考如何真正证明它符合贝叶斯原理。
And then along with colleagues at Columbia, and one of them now is is at DeepMind, we started to sort of think about how do you really prove that it's Bayesian.
要证明你能
To prove Can that you
直接解释一下?
just explain it?
其实我真的不知道这个问题的答案。
I actually I I actually don't know the answer to this.
是的。
Yeah.
在我看来,你们在第一篇论文里已经证明了。
It seemed to me you proved it in the first paper.
那缺少了什么?
Like, what was missing?
嗯,在第一篇论文中,我们展示了这一点。
Well, in the first paper, we showed it.
那是经验性的。
It was empirical.
你能看到,我明白了。
And you could see I see.
我明白了。
I see.
你可以看到
You could see
数学上的,因为它对我来说不是这样,不。
a mathematical because it not a to me that No.
它确实是。
It's yeah.
甚至对我来说都很明显。
It was even obvious to me.
但要说服,我明白了。
But to convince I see.
你可以说,你知道的,那些忽视它的人,等等。
You you could say, you know, people who dismiss it, oh, wait.
任何事情都可以是贝叶斯的。
Anything can be Bayesian.
我明白了。
I see.
我明白了。
I see.
我们必须精确地用数学方法证明它。
We had to show it precisely mathematically.
明白了。
Got it.
明白了。
Got it.
于是我们提出了这个想法。
So then we came up with this idea.
我的同事,纳马纳·杰拉拉和西达哈特·达拉尔,我们与他们合著了一系列论文。
You know, my colleagues at Namana Gerala and Siddharth Dalal, we the series of papers were were written with them.
我们提出了一个贝叶斯风洞的想法。
We came up with this idea of a Bayesian wind tunnel.
好的。
K.
那么什么是风洞?
So what's a wind tunnel?
在航空航天领域,风洞是一种在隔离环境中测试飞机的装置。
Well, wind tunnel in the aerospace industry is where you test an aircraft in an isolated environment.
你不会真的让它飞行,而是测试它在各种气动压力下的表现。
You don't fly it, and you test test it against all sorts of, you know, aerodynamic pressure.
然后观察它在不同高度、压力等条件下会有什么反应。
Then you see what what little bit stand, what kind of altitude, pressure, blah blah blah.
你不想在空中进行这些测试。
And you don't want to do it up in the air testing.
是的。
Yeah.
我们说,好吧。
We said, okay.
为什么我们不创建一个环境,在这里测试这些架构,比如变换器、MAMA、LSTM、MLP等所有架构?
Why don't we create an environment where we take these architectures and we tested transformers, MAMA, LSTMs, MLPs, all architectures.
我们说,为什么不采用一个空白架构,给它一个任务,让这个架构根本不可能记住该任务的正确答案?
We say, don't we create take a blank architecture, give it a task where it's impossible for the architecture to memorize what the solution to that task should be.
考虑到参数数量,这个空间的组合可能性是无穷的,而我们使用了非常小的模型。
The space is combinatorially impossible for given the number of parameters, and we took very small models.
因此,难度足够高,使得它们无法通过记忆来解决。
So it's difficult enough that they cannot memorize it.
是的。
Yeah.
但难度又足够低,让我们能精确知道贝叶斯后验分布应该是什么。
But it's tractable enough that we know precisely what the the Bayesian posterior should be.
你可以用解析方法计算出来。
You can calculate it analytically.
于是,我们给这些模型提供了一系列任务,在这些任务中,我们再次证明了记忆是不可能的。
So we gave these models a bunch of tasks where, again, we show that it's impossible to memorize.
我们训练这些模型,发现变换器能够将精确的贝叶斯后验准确度达到10的负三次方比特。
We train these models, and we found that the transformer got the precise Bayesian posterior down to 10 to the power minus three bits accuracy.
它完美地匹配了分布。
It was matching the distribution perfectly.
因此,它确实在数学意义上对给定任务进行了贝叶斯推理,真棒。
So it is actually doing Bayesian in the mathematical sense given a task Wow.
它必须更新自己的信念。
Where it has to update its belief.
MAMA的表现也相当不错。
MAMA also does it reasonably well.
LSTM可以完成其中一部分任务。
LSTMs can do one of the things.
因此,在论文中,我们对贝叶斯任务进行了分类。
So the the in the papers, we have a taxonomy of Bayesian task.
变换器能够完成所有任务。
Transformer does everything.
MAMBA 做了其中大部分。
MAMBA does most of it.
LSTM 只能部分完成,而 MLP 则完全失败。
LSTMs do only partially, and MLPs fail completely.
这是反映了它所训练的数据,还是更多反映了其机制本身?
So is this a reflection of the data that it's trained on, or is it more a reflection of the mechanism?
这是机制的问题。
It's the mechanism.
这是架构的问题。
It's the architecture.
数据决定它学习什么任务。
The data decides what task it learns.
对。
Right.
在第一篇论文中,我们有一个贝叶斯风洞实验,并展示了它确实完成了这项任务。
So in the first paper, we had this Bayesian wind tunnel, and we showed that, you know, it's doing the job.
我们有不同的任务。
We had different tasks.
在第二篇论文中,我们解释了为什么它能做到这一点。
In the second paper, we show why it does it.
因此,我们研究了Transformer模型和梯度,展示了梯度如何塑造这种几何结构,从而实现贝叶斯更新。
So we look at the transformers, we look at the gradients, and we show how the gradients actually shape this geometry, which enables this Bayesian updating to happen.
在第三篇论文中,我们采用了前沿的开源权重大语言模型,以便能够深入观察它们。
Then in the third paper, what we did, we take we took these frontier production LLMs, which have open weights so that we could look inside them.
我们进行了测试,发现我们在小模型中观察到的几何结构,在参数达到数亿的模型中依然存在。
And we did our testing, and we saw that the geometries that we saw in the small models persisted in models which are, you know, hundreds of millions of parameters.
相同的特征依然存在。
The same signature existed.
唯一的问题是,由于它们在各种各样的数据上进行了训练,数据显得有些杂乱或混乱。
The only thing is that because they are trained on all sorts of data, it's a little bit dirty or messy.
是的。
Yeah.
但你可以看到相同的结构。
But you can see the same structure.
因此,贝叶斯风洞的整个理念与这些生产型大语言模型不同,后者你并不知道它们是在什么数据上训练的,对吧。
So the the whole idea behind the Bayesian wind tunnel was unlike these production LLMs where you don't know what they have been trained on Right.
因此,你无法数学上计算后验分布。
So you cannot mathematically compute the posterior.
那么,你如何证明这一点呢?
So, again, how do you prove it?
我的意思是,它看起来像是贝叶斯的,你知道,从
I mean, it looks Bayesian, you know, from
从第一篇论文开始。
the first From the first paper.
从第一篇来看,它看起来像是贝叶斯的,但风洞帮我们解决了这个问题。
From the first it looks Bayesian, but, you know, so the wind tunnel sort of solved that problem for us.
我们说,好吧。
We said, okay.
让我们从一个空白的架构开始。
Let's start with a blank architecture.
给它一个我们已知答案的任务。
Give it a task where we know what the answer is.
它不能记住这个答案。
It cannot memorize it.
看看它会怎么做。
Let's see what it does.
是的。
And yeah.
所以,你认为这能提供任何关于人类思维方式的线索吗?还是你觉得这些完全是独立的?
So do you think this provides any sort of, like, indication of how humans think, or do you think that these things are totally independent?
不。
No.
不。
No.
它确实提供了。
It it does provide.
对吧?
Right?
所以,你知道,人类在看到新证据时也会更新我们的信念。
So, you know, human beings also update our beliefs as we see new evidence.
对吧?
Right?
因此,我们在某种意义上确实进行了贝叶斯更新,但我们也做了更多。
So we do, in some sort of in some sense, Bayesian updating, but we do something more than that.
我会谈到这一点。
I'll come to that.
但这些变换器甚至大模型也会进行这种贝叶斯更新。
But these transformers or even mamma do this Bayesian updating.
是的。
Yeah.
但人类的不同之处在于,当我们看到新的证据时,我们会更新自己的后验概率。
And but but but the difference with humans is, you know, we we'll update our posterior when we see some new evidence.
但经过数亿年的进化,我们的大脑所追求的优化目标就是不要死掉并繁衍后代。
But the way our brains have evolved over hundreds of millions of years is our optimization objective has been don't die and reproduce.
对吧?
Right?
这一直是主要的驱动力,我们的大脑也因此学会了调整。
That's been sort of the driving force, and our brains have learned to adjust.
所以当我们听到灌木丛中有些沙沙声时,那可能意味着危险。
And so when we see some danger, there's some something rustling in that bush.
别靠近。
Don't go near.
我们知道该如何应对这种危险。
We know how to react to that danger.
我们知道该如何保护自己。
We know how to save ourselves.
我们内化了这种学习,大脑细胞或突触在我们一生中都保持可塑性。
We internalize that learning, and our brain cells or our synapses remain plastic throughout our lifetime.
大型语言模型的情况是,一旦训练完成,这些权重就被冻结了。
What happens with LLMs is once the training is done, those weights are frozen.
当你进行推理时,比如在上下文学习或任何对话中,你确实在进行贝叶斯推理,但之后你会忘记。
When you're doing an inference over, for instance, in context learning or anything, during that conversation, okay, you're doing Bayesian inference, but then you forget.
当下一次新的对话以零上下文开始时,你不会保留之前实例中发生的任何学习。
The next time a new conversation starts with zero context, you don't retain any learning that happened in the previous instance.
所以,比如我之前做的板球项目,每次调用都是全新的。
So so for instance, with the the cricket days that I was doing, every invocation of it was fresh.
它不会记得我上一次发送查询时DSL是什么样子。
It did not remember the last time I sent a query what the DSL looked like.
因此,这是人类使用贝叶斯更新方式的一个区别:我们一生都保持可塑性。
So that's one difference between how humans use sort of Bayesian updating, which is we remain plastic all our lives.
是的。
Yeah.
是的。
Yeah.
而大语言模型是冻结的。
Whereas LLMs are frozen.
还有另一个差异,如果你希望我讲的话
And there's another sort of difference, which if you want me to get
告诉我。
Tell me.
是的。
Yeah.
是的。
Yeah.
是的。
Yeah.
是的。
Yeah.
所以另一个区别是,首先,我们的目标是别死掉。
So so the other difference is well, first, you know, our objective is don't die.
繁衍。
Reproduce.
LLM的目标是尽可能准确地预测下一个词元。
LLMs objective is predict the next token as accurately as possible.
对吧?
Right?
所以你读到的那些可怕故事,比如LLM试图欺骗人类,或者试图阻止自己被关闭。
So all these scary stories that you you read about that, oh, the LLM tried to deceive, and it tried to prevent itself from being shut down.
这并不是架构的特性。
That's not a function of the architecture.
而是训练数据的问题。
That's a function of the training data.
这就是数据的问题。
That's the data.
它被喂了大量的Reddit、Asimo或其他类似的帖子。
It has been fed, you know, articles on Reddit or Asimo or whatever.
我的意思是,
I mean,
顺便说一句,今天确实如此,是的。
just today, by the way Yeah.
达里奥据称说过,不能排除它们具有意识的可能性。
Dario allegedly said that you can't rule out that they're conscious.
你可以排除它们有意识的可能性。
You can rule out their conscious.
我的意思是,拜托。
I think I mean, come on.
正如我所说,Anthropic做出了很棒的产品。
As I said, you know, Anthropic makes great products.
Clot code非常出色。
Clot code is fantastic.
GrocoVeg 非常棒。
GrocoVeg is fantastic.
但它们只是在进行矩阵乘法的硅颗粒。
But they are grains of silicon doing matrix multiplication.
是的。
Yeah.
它们没有意识。
They don't have consciousness.
它们没有内在的独白。
They don't have an inner monologue.
它们没有被相同的目标函数驱动。
They don't they're not driven by the same objective function.
别死。
Don't die.
繁衍。
Reproduce.
对吧?
Right?
它们的目标是不要在下一个词元上出错,而这完全由训练数据驱动。
They're driven by don't make a mistake on the next token, and that's driven entirely by the training data.
对吧?
Right?
你用阿西莫或Reddit上的故事来训练大语言模型,你知道,为了生存,它会这么做或那么做。
You train the LLM with stories of Asimo or Reddit where, you know, to survive, it's going to do this or that.
它会复现这些行为。
It'll reproduce that.
所以这仅仅是一种反映。
So it's it's it's a reflection.
它不是一种心智。
It's not a mind.
而且结果,再强调第十遍
And and the results, just to say it for the tenth time
是的
Yeah.
完全是视觉上的
Are perfectly vision
完全正确。
Perfectly.
是的
Yeah.
精确到每一位数字。
To the to the to the digit.
精确到每一位数字。
To the digit.
是的
Yeah.
是的
Yeah.
我的意思是,我训练了15万步,准确率达到了10的负三次方位。
I mean, I I trained it for 150,000 steps, and the accuracy was 10 to the power minus three bits.
这听起来
That sounds
我本可以利用你们为token prop提供的基础设施,在半小时内完成训练。
I could have trained it for you know, did this happen in half an hour on the infrastructure that you provided for token prop.
在后台,我本可以使用那些GPU进行训练。
In the background, I I could use those GPUs to train.
但没错。
But That's right.
所以再次感谢你提供这些支持。
So thank you again for that.
但话说回来,人类又回来了。
But so, no, human beings coming back to it.
我们是贝叶斯的。
We we are Bayesian.
是的
Yeah.
但我们还会做别的事。
But we do something else.
你知道吗,当我把这支笔扔向你时,你会怎么做?
You know, when I when I when I throw this pen at you, what will you do?
躲开它或者
Dodge it or
对。
do Yeah.
你为什么要躲开它?
Why will you dodge it?
为了避免被击中。
To avoid being hit.
避免被击中。
Avoid being hit.
是的。
Yeah.
但你的大脑并没有在进行贝叶斯计算,比如:这支笔正在飞过来。
But your head is not doing a Bayesian calculation of, okay, this pen is coming.
它击中我的概率有多大,会造成多大的疼痛等等。
The probability that it hits me, it'll cause this much pain or all that.
没错。
Correct.
你大脑中实际上正在进行一种模拟。
What you're essentially doing in your head is you're doing a simulation.
你看到这支笔飞过来,就知道它会飞过来并击中你。
You see the the the the the pen coming, and you know that it'll come and hit me.
你的大脑进行模拟,然后你躲开了。
Your mind simulates, and you dodge it.
对吧?
Right?
所以所有的
So all
深度学习
of
都在进行相关性分析。
deep learning is doing correlations.
它并不进行因果推断。
It's not doing causation.
是的。
Yeah.
因果模型才是能够进行模拟和干预的模型。
Causal models are the ones that are able to do simulations and interventions.
所以,你知道,朱迪亚·珀尔提出了完整的因果层次理论,没错。
So, you know, Judaea Pearl has this whole causal hierarchy Yep.
第一层是关联,也就是你构建这些相关性模型。
Where the first hierarchy and the first hierarchy is association, which is you build these correlation models.
深度学习很美妙。
Deep learning is beautiful.
它极其强大。
It it's extremely powerful.
我的意思是,你每天都能看到这些模型表现得异常出色。
I mean, you see every day, all these models are, like, amazingly good.
是的。
Yeah.
它们进行关联。
They do association.
第二层是干预层次。
The second is intervention in the hierarchy.
是的。
Yeah.
深度学习模型做不到这一点。
Deep learning models do not do that.
第三是反事实。
Third is counterfactual.
所以干预和反事实,你可以想象这是一种模拟。
So both intervention and counterfactual, you can imagine it it it's some sort of simulation.
你建立一个关于正在发生事情的因果模型,然后就能进行模拟。
You you build a model of causal model of what's happening, and then you are able to simulate.
我们的大脑就是这样做的。
So our brains do that.
当前的架构做不到这一点。
The current architectures don't do that.
另一个例子,我认为能说清楚的是香农熵和科尔莫戈罗夫复杂度之间的区别。
Another example, I think, which will make it clear is the difference between I'll use this technical term, Shannon entropy Mhmm.
科尔莫戈罗夫复杂度。
And Kolmogorov complexity.
当然。
Sure.
所以如果你看圆周率数字的香农熵,是的。
So if you look at the Shannon entropy of the digits of pi Yeah.
它是无限的。
It's infinite.
当然。
Sure.
不可能预测或学习下一个数字会是什么。
It's impossible to predict and learn what digit will come after.
没错。
Yep.
这就是香农熵的定义。
So that's the definition of Shannon entropy.
香农熵试图建立一种相关性。
And Shannon entropy sort of tries to build a correlation.
它试图学习这种相关性。
It tries to learn the correlation.
深度学习实现了香农熵。
Deep learning does the Shannon entropy.
是的。
Yeah.
柯尔莫哥洛夫复杂度则是指能够生成所讨论字符串的最短程序的长度。
Kolmogorov complexity, on the other hand, is the is the length of the shortest program Yep.
这个程序会复现你所关注的字符串。
Which will reproduce the string that you that is under question.
是的。
Yep.
现在,计算π的数字的程序非常短。
Now the program to get the digits of pi are very small.
是的。
Yeah.
多亏了。
Thanks to.
你知道吗?
You know?
有很多非常小的程序可以精确地重现它。
There are all sorts of really small program that can reproduce it exactly.
所以pi的柯尔莫哥洛夫复杂度非常小。
So the Kolmogorov complexity of pi is very small.
香农熵是无限的。
Shannon entropy is infinite.
是的。
Yeah.
我认为深度学习仍然处于香农熵的范畴。
I think deep learning is still in the Shannon entropy world.
它还没有跨越到柯尔莫哥洛夫复杂度和因果世界。
It has not crossed over to the Kolmogorov complexity and the causal world.
真有趣。
How interesting.
那么,你认为这在多大程度上为我们提供了改进前沿技术的研究方向呢?
So do you so to what extent do you think this provides us research directions to kind of improve the state of the art?
让我给你一个具体的例子。
So let me just give you a specific example.
你提到人类实际上并不会更新那个矩阵。
You talked about human beings don't actually update, you know, the matrix.
他们也不会去更新自己的权重。
They don't kind of update their weights.
但现在,关于持续学习的研究非常多。
But right now, there's a lot of research on continual learning.
是的。
Yeah.
那么,你的研究是否为如何解决这些问题提供了某些指导?
So does your work provide some guidance of how you might approach those problems?
特别是,我一直以来都有个疑问:我们使用了如此多的数据和计算资源,是的。
And in particular, I've always had this question, which is we use so much data and so much compute Yeah.
为了创建这些模型。
To create these models.
比如,你真的认为可以更新权重并实时产生有意义的影响吗?
Like, is it even reasonable to think that you could update the weights and actually have a meaningful impact, you know, in in real time.
我的意思是,要做到这一点,你似乎需要多得多的数据。
I mean, it just seems like you would just need so much more data in order to do that.
所以能不能
So can
你开始回答这些问题吗?
you start answering these questions?
你可以开始回答其中一些问题了。
You you can start answering some of these questions.
而当今存在的一种误解是,规模能解决一切问题。
And and one of the misconceptions that exist today is that scale will solve everything.
规模并不能解决一切问题。
Scale will not solve everything.
你需要一种不同的勇气。
You you you need a different kind of courage.
这种持续学习是一个难题。
And this continual learning is a difficult problem.
你必须在学习新东西和灾难性遗忘的风险之间取得平衡。
You have to balance the fact that you will learn something new against the risk of catastrophic forgetting.
对。
Right.
对吧?
Right?
对。
Right.
如果你更新了权重,却忘记了哪些是重要的、已经学过的内容,那你其实并没有取得进展。
If you update the weights and you forget what what was important and what you have already learned, then then you are, you know, you're not making progress.
那样的话,它就只会成为一个随机混乱的模型。
Then it'll just be some sort of random chaotic model.
所以要解决这个问题很难。
So to solve that problem is difficult.
这是其中的一个方面。
That's one aspect of it.
所以,你知道,要实现所谓的通用人工智能,我认为需要发生两件事。
So so so, you know, to get to what is called AGI, I think there are two things that need to happen.
一是这种可塑性,必须通过容器学习来实现。
One is this plasticity, which has to be implemented through container learning.
是的。
Yeah.
其次,我们必须从相关性转向因果性。
Secondly, we have to move from correlation to causation.
是的。
Yeah.
这意味着,我
That's mean, I
这和杨立昆所说的因果性规划有多相似呢?杨立昆谈的因果性规划,是的。
how how much is this similar to what Yann LeCun talks about with the So Yann LeCun causality planning Yeah.
你知道,预测你的行动会如何
You know, predicting how, like, how your action would
这确实有关联。
It is it is related.
你知道,他是从与J PAL模型不同的角度来探讨这个问题的。
You know, he is coming at it from a different angle than the J PAL model.
对。
Right.
但确实有关联。
But it is related.
另一件事是,我第一次上这个播客时,就提到过这个AGI测试。
The the the other thing is, you know, the first time I came on this podcast, I I mentioned this test of AGI.
是的。
Yeah.
爱因斯坦测试。
The Einstein test.
我不记得了。
I don't remember.
所以我说,你把一个大语言模型用1916年或1911年之前的物理学数据来训练,对吧。
So I said, you know, you take an LLM and train it on pre 1916 or 1911 physics Right.
然后看看它是否能推导出相对论。
And see if it can come up with the theory of relativity.
是的。
Yeah.
如果它能做到,那我们就有了通用人工智能。
If it does, then we have AGI.
我的意思是,这要求很高,但我们也应该设定高标准。
I mean, it's a high bar, but, you know, we should have high bars.
它做不到。
It won't.
这正是我认为德米斯几周前在印度人工智能峰会上提到的同一个测试。
And this is the same test that I think Demis mentioned at the India AI summit a couple of weeks ago.
它引发了大量的新闻报道。
It's created a lot of news.
但为什么会这样?这和香农与卡尔·马克思的观点有什么关系呢?
But why why is that, and how is that related to this idea of Shannon versus Karl Marrow?
在爱因斯坦的时代,人们已经发现牛顿力学存在一些缺失的东西。
So at the time of Einstein, there were a lot of clues that Newtonian mechanics, there was something missing.
是的。
Yeah.
对吧?
Right?
人们知道水星的轨道不太合理。
People knew that Mercury's orbit didn't make sense.
它的某些地方有问题。
There was something off about it.
然后进行了迈克尔逊-莫雷实验,试图找出光传播所依赖的介质——以太。
Then there were these experiments done, the Michelson Morley experiments, where they were trying to figure out this medium called the ether through which light travels.
他们认为,如果你从不同方向反射光,光速可能会发生变化,他们可以检测到光速的变化。
And they felt that if, you know, you bounce light in different directions, the speed might change, and they they could detect a change in the speed of light.
他们进行了多次实验。
They tried several experiments.
他们使用了非常精密的仪器来测量光速,但什么都没发现。
They had really precise instruments which could measure the speed, and they found nothing.
他们发现光速完全没有任何变化。
They found that the speed of light did not change at all.
然后还有黑洞这一整个问题,是的。
Then there were there's a whole issue of black holes Yeah.
然后是引力透镜效应。
Then gravitational lensing.
因此,有很多迹象表明牛顿力学并不能完全解释所有现象。
So there are a lot of these signs that Newtonian mechanics is not really explaining everything.
是的
Yeah.
但在爱因斯坦提出时空连续体的新理论之前
But until Einstein came up with a new representation of the space time continuum
对
Right.
我们陷入了僵局
We were stuck.
是的
Yeah.
如果你有一个只关注相关性的模型,看到所有这些零散的证据并将其拼凑起来,它也不可能得出爱因斯坦提出的那个美妙的方程,你知道的,我记不清具体是什么了。
So if you had a model that just looked at correlations and saw sees all of this, you know, all of these pieces of individual evidence and put together, it would not have come up with the beautiful equation that Einstein came up with, you know I'm forgetting exactly what it is.
Gμν等于8πTμν,大概是这样的。
G mu v equals eight by t mu v, some something like that.
是的
Yeah.
是的。
Yeah.
对。
Yeah.
就是那个,你知道的,相对论的方程
Where, you know, the the the the equation of the rel
的
of
时空连续体的张量。
the space time continuum, the the tensor.
所以他提出了一个全新的理论,对吧?
So he came up with a new Right?
无论你谈论的是引力波、黑洞、水星,还是GPS的工作原理。
Whether you're talking about gravitational waves or black holes or mercury or how GPS works.
你知道,我们每天在手机上用的GPS,就是用相对论的方程。
You know, GPS the GPS that we use every day in our phones, it uses the equation of relativity.
所以,这会不会导致你几乎必须忽略大部分先前的数据才能做到这一点,而大语言模型却做不到,因为它们是在大部分先前数据上训练的。
So do you does does this end up becoming like you you you almost have to ignore the majority of previous data in order to do it, which LLMs can't because they trained on the majority of previous data.
这就像是有一种数据引力在把你往回拉。
It's like you almost have like this kind of data gravity that's pulling you back.
就好像每个人都说它是x。
It's like it's like everybody said it's x.
嗯。
Mhmm.
有一些证据表明它是y,但因为大家都说它是x,所以语言模型总会说它是x。
There's a little bit of evidence that it's y, but because everybody said it's x, like, the LM will always say it's x.
它们总会说x。
It's they'll always say x.
它会把y当作异常打印出来。
It'll print that y as an anomaly.
实际上,这是一种非常不错的表达方式。
Actually, this is But actually a very nice way to say it.
是的。
Yeah.
就像我现在才明白一样。
Just like it's like, I I just now okay.
现在我懂了,你讲的是香农熵和常识之间的区别。
Now I get your Shannon entropy versus common Yeah.
其中一个是指,总的信息量总是受限于现有的总信息量,这正是目前发生的情况。
Like, one of them is, like, the total amount of information there, there will always be bound to the total amount of information there, which is what happens right now
是的。
Yeah.
你可以描述另一种运动方式,用新的数据可以更简洁地描述一切,这会是一种完全不同的路径,就像这样,是的。
Where you can actually describe another another motion where you can describe everything with a shorter description with the new data, which would be a totally different motion, which would be like Yeah.
你需要一个新的表示方式。
You need a new representation.
对吧?
Right?
是的。
Yeah.
另一种我一直思考这些的方式是,我觉得你上次我们讨论时表达得很好,那就是宇宙是一个非常非常复杂的空间。
Know, another way that I've always thought about these, I thought you articulated it well in the last time we talked about it, which is the universe is this very, very complex space.
然后,你知道,人类不知怎的把它映射到一个流形上,嗯。
And then, you know, somehow humans map it into a manifold Mhmm.
这个空间更简单。
That's less complex.
是的。
Yeah.
然后这被写了下来。
And then that gets kind of written down.
然后大语言模型,所以这某种程度上是一种分布。
And then the LLM so that's kind of some some distribution.
一些你知道的,它仍然是一个很大的空间,但它是有界的。
Some you know, it's still a very large space, but it's it's a bounded space.
而大语言模型学会了这个流形。
And the LLM learned that manifold.
然后它们基本上使用贝叶斯推断在这个流形上上下移动,但它们被限制在这个流形之内。
And then they kind of use, you know, Bayesian inference to move up and down that manifold, but they're kind of bound to that manifold.
是的。
Yeah.
然后,我不想把话放你嘴里,但它们无法生成一个新的数字,对吧?
And then, again, I don't want to put words in your mouth, and then but like what they can't do is generate a new A new numeral, right?
这需要理解宇宙的运作方式,然后创造出一种新的表达方式。
Which requires understanding the way that the universe works and then coming up with a new representation
宇宙的表达方式。
of the universe.
这就是相对论。
And this is what relativity is.
对吧?
Right?
是的
Yeah.
没错
Exactly.
爱因斯坦必须创造一个新的流形。
Einstein had to create a new manifold.
是的
Yeah.
是的
Yeah.
是的
Yeah.
如果你只坚持牛顿物理的旧流形,对吧。
If you just stuck with the old manifold of the Newtonian physics Right.
那么你就会看到这些相关性,但无法提出一个能解释它们的流形。
Then you would see these correlations, but you could not come up with a manifold that explained them.
所以你需要提出一种新的表示方式。
So you need to come up with a new representation.
是的。
Yeah.
所以在我看来,有很多关于通用人工智能的定义。
So to me, you know, there are lots of definitions of AGI.
你知道,图灵测试,我们已经通过了。
You know, Turing test, we have already passed that.
你知道,每天都能完成有经济价值的工作,你看,大语言模型已经在做这件事了。
You know, performing economically useful work every day, see, you know, LLMs are doing that.
我们真的做到了吗?
Do we?
我不确定。
I don't know.
没有。
No.
意思是,它们确实可以。
Mean, they are.
我的意思是,没有人类干预?
I mean, without human intervention?
不。
No.
不。
No.
所以这是不同的。
So that's different.
好吧。
Okay.
但即便如此,你知道,汽车跑得比人快。
But still, you know, it's like a car can run faster than humans.
对吧?
Right?
我的意思是,那确实是一个是的。
I mean, that's a that's a Yeah.
那确实是一个是的。
That's a Yeah.
这是一个非常肤浅的定义。
Very shallow definition.
对。
Yeah.
所以这些定义都有其用处。
So all these definitions do useful
你知道,也许六个月后,你就能在没有人为干预的情况下,让云服务或Gemini完成那些定义清晰、范围明确的编程任务。
You know, maybe, you know, in six months, you'll have cloud or what a Gemini do without intervention, the coding tasks, are well defined, well scoped.
这是有可能的。
That's possible.
但对我来说,通用人工智能的实现,当这两个问题被解决时就会发生。
But to me, AGI will happen when these two problems get solved.
弹性、持续学习,以及以更高效的数据方式构建因果模型。
Elasticity, continual learning properly, and building a causal model from you know, in a more data efficient manner.
最近几天,我们听到人们谈论像唐纳德·克努特这样的通用型人物。
We are hearing people now talking about seeing general like Donald Knuth, for example, in the last few days.
对吧?
Right?
据说他有了一个瞬间,那种情况在X平台上迅速走红。
Had this, you know, moment apparently that kind of made went viral on X.
所以你认为这表明我们正在见证通用性吗?
So do you think that that suggests that we're seeing generality?
不。
No.
不。
No.
所以,实际上,这印证了我长期以来一直在说的观点。
So so that actually I mean, to me, it validates what I've been talking about for a while now.
怎么说?
How so?
所以,如果你读一下他和一位同事合作所做的工作,他会让大语言模型解决寻找哈密顿回路这个特定问题。
So so if you if you read what he did with the help of, you know, a colleague, he got the LLMs to solve this particular problem of finding Hamiltonian cycles.
奇数,我们就不深入了。
Odd numbers, we wouldn't get into that.
他让大语言模型一个接一个地解决奇数问题。
And he got the LLMs to keep solving for one odd number after the other.
对吧?
Right?
他还让大语言模型在找到某个特定m值的解后,将其在解决该问题过程中学到的内容更新到自己的记忆中。
What he also got to do is after it found a solution for a particular value of m, he made the LLM update its memory with exactly what it learned in solving that problem.
所以大语言模型尝试了各种不同的方法。
So the LLMs tried many different things.
是的。
Yeah.
你知道,如果有什么奏效了,就更新内存。
You know, if something worked, update the the memory.
所以这有点像是拼凑出一种可塑性。
So that's kind of like hacking together plasticity.
是的。
Yeah.
对吧?
Right?
它在过程中学习自己做过的事情。
It's learning what it has done as we went along.
再说一遍,这只是一个临时的解决方案。
Again, it's it's a hack version of it.
你并没有改变权重。
You're not changing the weights.
你只是在某种程度上优化了上下文。
You're just sort of improving the context.
对。
Right.
对吧?
Right?
但随着你的学习,即使在那之后,整个哈密顿回路及其相关数学领域在这些大语言模型所训练的流形中都有很好的体现。
But you as you learned and even after that so this whole space of Hamiltonian cycles and the associated math is well represented in the manifolds that these LLMs have been trained on.
对。
Right.
你只需要找到正确的关联。
You just hand had to find the right connection.
而大语言模型,我知道,只要你投入足够的计算资源,它们就能找到正确的关联。
And LLMs, I know, compute you throw enough compute, they will find the right connection.
所以能够找到大语言模型的尝试,最终,他需要将自己看到的内容整合成一个解决方案。
So was able to find the LLMs attempts, and eventually, it needed him to put together what he saw into a solution.
这确实帮助他得出了答案,但他必须创建一个新的流形才能找到解决方案。
It definitely helped him get to the solution, but he had to create the new sort of manifold to come to the solution.
过了一段时间,大语言模型陷入了僵局。
The LLMs were, after a while, stuck.
对吧?
Right?
他,他,你读了他写的东西。
He he he you you read what he has written.
我的意思是,他们可能两天前才听说这个新闻。
I mean, they just heard of the press, I think, two days ago.
他,两天前。
He Two days ago.
是的。
Yeah.
是的。
Yeah.
几天前。
Days ago.
但最终,他采用了这个解决方案,并得出了证明。
But eventually, he used the solution, and he came up with the proof.
是的。
Yeah.
对吧?
Right?
所以,就像爱因斯坦看到了所有这些证据,然后他思考,什么能解释这些现象,于是提出了一个因果模型。
So it's like, you know, it's like Einstein saw all these evidences, then he thought, what will explain he came up with a causal model.
是的。
Yeah.
所以卡努吉和他的大脑某种程度上是
So Kanuj and his brain is sort of
他在进行卡马尔·格隆克的工作。
the He's doing the that's in the Kamal Gronk.
卡马尔·格隆克是一个人类。
Kamal Gronk is is the human.
对。
Right.
而大语言模型在完成香农部分时极其高效。
And the LLMs are extremely efficient at doing the Shannon part of it.
它通过尝试各种方法找到了所有解决方案,这种学习是越来越深入的。
It found all the solutions by trying, you know, various things That is such learning more and more.
分解这个问题的方式真聪明。
Clever way to decompose it.
我在想,你认为这再次提供了关于下一个要解决的问题的某种洞见吗?我会再问一遍同样的问题:你认为这提供了什么洞见吗?
I'm wondering, do you think this again, I'm gonna ask the same question again, which is do you think this provides some sort of insight on the next problem to tackle?
比如,是否存在一种机制能触及莫加罗夫复杂性?
Like, is there a mechanism that will get to call Mogarov complexity or not?
比如,这
Like, is this
它告诉我们应该朝哪个方向努力。
It tells us which direction to pursue.
但显然不知道该如何去做。
But clearly not how to do it.
不是怎么做,甚至科尔莫戈罗夫复杂度也 largely 仍是一个理论构想。
Like Not how to do But even Kalmanagrov complexity has largely remained sort of a theoretical construct.
是的。
Yeah.
当然。
For sure.
根本没有算法。
It it There's there's no algorithm.
还没有实际实现过找到最短程序的方法。
There's no there haven't been practical implementation of finding Yeah.
最短的程序。
The shortest program.
是的。
Yeah.
我们知道它是存在的。
We know it exists.
你知道吗?
You know?
你可以对此进行争论。
You can argue about it.
但这就是我认为的,是的。
It but so so that's where I think Yeah.
这是我的偏见。
It's my bias.
我们的精力应该集中在这一点上,而不是用更多token的更大模型。
That's where our energy should be focused, not larger models with more tokens.
你能把这两者联系起来吗?
Can you and can you can you tie the two things?
比如,这与进行模拟有什么关系?还是说模拟完全是独立的?
Like, how does that pair with doing simulation, or is that simulation totally orthogonal?
不。
No.
模拟是相关的。
Simulation is related.
对吧?
Right?
所以你的意思是,基本上你做模拟,而这种方式某种程度上是朝着计算柯尔莫哥洛夫复杂度迈进的一步?
So you think it like, basically, you do simulation, and somehow that is a step towards doing the Kolmogorov complexity?
模拟器就是我们创建的程序。
It's the simulator is the is the program that we create.
它可能不是完美的程序。
It may not be the perfect program.
哦,我明白了。
Oh, I see.
但在我们脑海中,我们会构建这样一个模拟器:当我扔笔的时候,你知道它会朝你飞来。
And you see But in our heads, we create this simulator that when I'm throwing the pen, you know that it's coming at you.
是的
Yeah.
对吧?
Right?
然后你躲开。
And you duck.
所以你并不是在过程中计算概率。
So so you're not computing the probabilities as it goes.
但你其实会构建出一个
But but you have you know, you you build
一个算法,而我们更偏向于概念层面的讨论。
an algorithm versus we are talking more conceptually.
概念上。
Conceptually.
但它是,而且你觉得这是同一个机制吗?
But but it's a And you think it's in the same mechanism?
关于 Bayt 播客
Bayt 提供中文+原文双语音频和字幕,帮助你打破语言障碍,轻松听懂全球优质播客。