ICLR 2020：杨立昆与基于能量的模型

本集简介

本周，康纳·肖滕、扬尼克·基勒彻和蒂姆·斯卡夫对今年刚刚结束的ICLR会议上杨立昆的主旨演讲做出了回应。ICLR是全球第二大的机器学习会议，今年完全开放，所有会议内容均可通过互联网公开访问。杨立昆在演讲中大部分时间都在讨论自监督学习、基于能量的模型（EBMs）和流形学习。如果你之前没听说过EBMs，别担心，我们之前也没听过！感谢观看！请订阅！论文链接： ICLR 2020主旨演讲：https://iclr.cc/virtual_2020/speaker_7.html 基于能量的学习教程：http://yann.lecun.com/exdb/publis/pdf/lecun-06.pdf 基于能量模型的概念学习（扬尼克的解释）：https://www.youtube.com/watch?v=Cs_j-oNwGgg 基于能量模型的概念学习（论文）：https://arxiv.org/pdf/1811.02486.pdf 基于能量模型的概念学习（OpenAI博客）：https://openai.com/blog/learning-concepts-with-energy-functions/ #深度学习 #机器学习 #iclr #iclr2020 #yannlecun

双语字幕

仅展示文本字幕，不包含中文音频；想边听边看，请使用 Bayt 播客 App。

Speaker 0

基于能量的模型这个术语，既意味着一切，同时又什么都不是。

The term energy based model means both everything and nothing at the same time.

Speaker 0

几乎任何机器学习问题都可以表述为基于能量的模型问题，而几乎任何基于能量的模型问题也可以转化为机器学习问题。

Almost any machine learning problem can be phrased as an energy based model problem, problem, and almost any energy based model problem can be turned into a machine learning problem.

Speaker 0

你现在看到的是一个基于能量的模型，它通过左侧的演示学习形状的概念。

What you're seeing here is an energy based model that learns the concept of a shape from a demonstration on the left.

Speaker 0

在左侧，你可以看到从形状（这些案例中是圆形或方形）中采样的数据点演示，以及模型据此推断出的相应能量函数。

On the left, can see demonstration of data points sampled from a shape, in these cases circles or squares, and then the corresponding energy function that the model infers from that.

Speaker 0

然后它就可以利用那个能量函数在右侧复制那个形状。

And then it can replicate that shape on the right using that energy function.

Speaker 1

Yann LeCun是一位法裔美国计算机科学家，主要研究机器学习、计算机视觉和计算神经科学领域。

Yann LeCun is a French American computer scientist working primarily in the fields of machine learning, computer vision, and computational neuroscience.

Speaker 1

LeCun于1960年出生于巴黎郊区。

LeCun was born in the suburbs of Gay Parin in 1960.

Speaker 1

他于1987年获得计算机科学博士学位，期间提出了神经网络反向传播学习算法的早期形式。

He received his PhD in computer science in 1987, during which he proposed an early form of the back propagation learning algorithm for neural networks.

Speaker 1

1987年至88年期间，他在多伦多大学杰弗里·辛顿的实验室担任博士后研究员。

He was a postdoctoral research associate in Geoffrey Hinton's lab at the University of Toronto from 1987 to '88.

Speaker 1

如今，他是纽约大学的教授，同时也是Facebook的副总裁兼首席科学家。

These days, he's a professor at NYU and vice president and the chief scientist at Facebook.

Speaker 1

他是受生物学启发的图像识别模型——卷积神经网络的奠基人，该模型通过截断感受野，使得训练极深网络变得可行，从而引发了深度学习革命。

He's the founding father of a biologically inspired model of image recognition, convolutional neural networks, which led to the deep learning revolution, making the training of extremely deep networks tractable by truncating the receptive field.

Speaker 1

他还因其在深度学习方面的工作，与杰弗里·辛顿和约什·本吉奥共同获得了2018年的图灵奖。

He was also the corecipient of the 2,018 Turing Award for his work in deep learning together with Jeffrey Hinton and Yoshi Avenger.

Speaker 1

这三人组被一些人称为人工智能之父，或者更确切地说是深度学习之父。

This gang of three are referred to by some as the godfathers of AI or indeed the godfathers of deep learning.

Speaker 1

国际学习表征会议，简称ICLR，是机器学习领域排名第二的国际学术会议，仅次于NeurIPS，领先于ICML。

The International Conference on Learning Representations, ICLR or ICLR is the number two international academic conference in machine learning behind NeurIPS and in front of ICML.

Speaker 1

这些会议是根据高影响力的机器学习和人工智能研究进行排名的。

These conferences are ranked in terms of high impact machine learning and artificial intelligence research.

Speaker 1

LeCun发表了一场发人深省的主题演讲，今天，我们将对其进行外科手术般的剖析。

LeCun presented a provoking keynote speech, and today, we're going to dissect it surgically.

Speaker 1

但我要提醒你，我们将讨论基于能量的模型，你可能没听说过这个概念。

But let me warn you, we'll be talking about energy based models, which you might not have heard of.

Speaker 1

而且我们提到'流形'这个词多达145次，即使经过大量剪辑后也是如此。

And we say the word manifold a 145 times even after heavy editing.

Speaker 1

所以准备好在本期节目中了解关于基于能量的模型和流形的知识，可能比你原本想知道的还要多。

So expect to learn more about energy based models and manifolds in today's episode than you ever wanted to know about.

Speaker 0

那么能量函数本质上就是一个函数：当你输入看起来像数据的东西时它会'高兴'，而输入不像数据的东西时它就会'不高兴'。

So an energy function is basically just a function that is happy when you input something that looks like data and is not happy when you input something that doesn't look like data.

Speaker 0

这几乎可以应用于你能想到的任何事物。

This can be applied to almost anything you can think of.

Speaker 0

数据可以是正确标注的图像。

Data can be correctly labeled images.

Speaker 0

数据可以是看起来像自然图像的图片。

Data can be images that look like natural images.

Speaker 0

应用是无穷无尽的，因此这个话题也是无穷无尽的。

The applications are endless and therefore also the topic is endless.

Speaker 0

最终，杨·莱库恩所做的只是将来自不同机器学习领域的诸多概念重新纳入同一个框架中。

Ultimately, the what Yann LeCun does here is just reframing a bunch of things from different machine learning areas into the same framework.

Speaker 0

因此，基于能量的模型并不是什么新东西，它只是对已有的机器学习方法的一种不同表述方式。

So energy based models are not something new, it's just a different way of formulating already existing machine learning things.

Speaker 2

我觉得这场演讲包含了许多有趣的想法。

I thought this talk contained a lot of interesting ideas.

Speaker 2

首先，我很期待讨论婴儿概念习得的示意图。

Firstly, I'm excited to discuss the presented chart of concept acquisition in infants.

Speaker 2

自从研究了诗人算法以来，我一直对课程学习或通往复杂行为过程中的阶梯式任务着迷。

Ever since studying the poet algorithm, I've been fascinated by curriculum learning or stepping stone tasks along the way to complex behavior.

Speaker 2

我很期待听听蒂姆和扬尼克对婴儿感知到的课程安排的看法，比如从面部追踪到物体恒常性再到形状恒常性。

I'm excited to see what Tim and Yannic think of the perceived curriculum in infants, such as going from face tracking to object permanence and shape constancy.

Speaker 2

杨·莱库恩提出了深度学习必须解决的三个挑战。

Yann LeCun presents three challenges that deep learning must solve.

Speaker 2

其中之一是用更少的标注样本或更少的试验次数进行学习。

The first of which is learning with fewer labeled samples and or fewer trials.

Speaker 2

这个问题的答案可能是自监督学习。

The answer to which may be self supervised learning.

Speaker 2

我们在像CURL或SIM CLR这样的论文中看到了一个明显的例子，但另一篇最近的论文通过增强数据进行强化学习，凭借更多的数据增强而非多任务自监督学习超越了CURL。

We've seen an obvious example of this in papers like CURL or SIM CLR, but another recent paper reinforcing learning with augmented data surpasses CURL with more data augmentation rather than multitask self supervised learning.

Speaker 2

我认为在利用较少标注样本进行学习方面，有许多研究分支正在发挥作用。

I think there are a lot of branches of research at play with learning with fewer labeled samples.

Speaker 2

首先想到的最热门领域是迁移学习，但其他领域如元学习、多任务学习、课程学习或持续学习都将在这方面发挥作用。

The most popular area which comes to mind is transfer learning, but additional fields like meta, multitask, curriculum, or continual learning will all have a part to play in this.

Speaker 2

LeCun提出的另外两个挑战是学习推理和学习规划复杂的行动序列。

The next two challenges LeCun presents are learning to reason and learning to plan complex action sequences.

Speaker 2

本次演讲深入探讨了能量函数的技术性讨论，以及它们如何构建数据流形。

This talk dives into the technical discussion of energy functions and how they construct the data manifold.

Speaker 0

Yann LeCun在这里讨论的是非常具体的应用类型，基本上就是通过学习将数据点向下推、非数据点向上推，从而描绘出数据流形，以此创建一个能量景观。

Yann LeCun is talking about very specific types of application here, where basically you're learning to trace out a data manifold by pushing points that are data down and pushing points that are not data up, thereby creating an energy landscape.

Speaker 0

由于这次演讲几乎涵盖了所有可能的内容，你会看到我们三个人在理解Yann LeCun到底在讲什么、没讲什么方面相当费力。

Now since this talk is pretty much about everything there could ever be, you'll see the three of us rather struggle with grasping what Yann LeCun is and isn't talking about.

Speaker 0

所以我们正在试图一次性理解整个机器学习领域。

So we're trying to make sense of pretty much all of machine learning in one go.

Speaker 0

我对这个过程有些困惑，但也非常有趣。

And I had some trouble with this, but also lots of fun.

Speaker 0

对我们来说，这是一场有点不同的讨论，因为我们三人都不是这方面的专家，但我们尽力了，结果就是这样。

It's a bit of a different talk for us because none of us are really experts at it, but we tried and this is how it turned out.

Speaker 0

一点都不有趣。

Not fun.

Speaker 0

是的。

Yeah.

Speaker 0

你需要了解的第一件事是能量函数或基于能量的模型。

The first thing you need to know are energy functions or energy based models.

Speaker 0

什么是能量函数？

What is an energy function?

Speaker 0

能量函数，有时记作 e，就是一个有一个或多个输入的函数，我们称这些输入为 x，如果能量函数对 x 感到满意，它的输出值就是零。

An energy function, sometimes called e, is simply a function with one or multiple inputs, let's call them x, and you can make the If the energy function is happy with x, it will be the value zero.

Speaker 0

如果能量函数对x不满意，它就会输出一个高值，比如大于零。

And if the energy function is not happy with x, it will be a high value, like larger than zero.

Speaker 0

所以这是满意的状态，这是不满意的状态。

So this is happy, this is not happy.

Speaker 0

那么，我们来举一些这方面的例子。

So let's give some examples of this.

Speaker 0

我们几乎可以将任何机器学习问题用能量函数的形式来表达。

We can formulate almost any machine learning problem in terms of an energy function.

Speaker 0

假设我们有一个分类器。

Let's say we have a classifier.

Speaker 0

分类器。

Classifier.

Speaker 0

这个分类器接收一张输入图像，可能是一只猫的图像，以及一个标签。

The classifier is takes as an input image here, maybe of a cat, and a label.

Speaker 0

所以，如果标签是猫，那么当能量函数正常工作时，能量值就会是零。

So if the label is cat, then the energy will be zero if the energy function is working correctly.

Speaker 0

而如果我们给能量函数输入同样的图像，但给它一个错误的标签，比如狗，那么能量值就会非常高。

And if if we give the energy function the same image, but we give it a wrong label, dog, then it is very high.

Speaker 0

在分类器的情况下，我们当然可以直接将损失函数作为能量函数，这样我们就自动得到了一个基于能量的模型。

In the case of the classifier, of course, we can simply take the loss function as the energy function, and we are automatically get an energy based model.

Speaker 0

所以这里的损失函数会是类似正确类别的负对数概率这样的东西。

So the loss function here will be something like the negative log probability of the correct class.

Speaker 0

但无论如何，它都会是一个很大的数值。

But in any case, it is just going to be a high number.

Speaker 0

我们称之为十的九次方。

Let's call it 10 to the nine.

Speaker 0

所以能量函数表示，这非常糟糕。

So the energy function says, this is very bad.

Speaker 0

你输入的整个内容。

The entire thing you input.

Speaker 0

它暂时还不会告诉你具体哪里有问题。

It won't tell you yet what's bad about it.

Speaker 0

所以这也意味着你可以改变这两者中的任何一个来让分类器满意。

So that also means you can change any of the two things to make the classifier happy.

Speaker 0

通常我们关注的是改变标签。

Now usually, we're concerned with changing the label.

Speaker 0

对吧？

Right?

Speaker 0

就像是，告诉我需要输入哪个其他标签才能让你满意？

It's like, tell me which other label do I need to input to make you happy?

Speaker 0

如果我们让标签变得可微分，当然，我们从不输入真实标签，实际上我们输入的是类似一个分布，即标签上的softmax分布，这是可微分的。

And if we make the labels differentiable, of course, we never input the true label, we actually input like a distribution, softmax distribution over labels and that's differentiable.

Speaker 0

我们可以使用梯度下降来更新狗这个标签。

We can use gradient descent to update the dog label.

Speaker 0

我们可以使用梯度下降来找到一个能让能量函数更满意的标签。

We can use gradient descent to find a label that would make the energy function more happy.

Speaker 0

所以，如果我们有一个好的分类器，就可以使用梯度下降来得到猫的级别。

So we could use gradient descent to get the cat level if we had a good classifier.

Speaker 0

但我们也可以优化图像，使其与狗的标签兼容。

But we can also optimize the image to make it compatible with the dog label.

Speaker 0

如果你见过Deep Dream之类的东西，那些模型正是这样做的。

That's things that if you ever saw Deep Dream or something like this, those models do exactly that.

Speaker 0

它们会为特定标签优化输入图像，在这里你可以将整个神经网络（包括损失函数）视为能量函数。

They optimize the input image for a particular label and there you can view the entire neural network including the loss function as the energy function.

Speaker 0

那另一个例子是什么？

So what's another example?

Speaker 0

另一个例子是，假设你有一个K均值模型，其能量函数很简单：我们输入一个数据点，然后对于这个数据点，你需要找到最近的聚类索引，也就是在多个聚类中找到最小的K值，你的数据点可能就在这里。

Another example is, let's say you have a K means model and the energy function simply we input a data point and for the data point what you're going to do is you're going to find the min cluster index, the min K over, you know, you have your multiple clusters here and your data point might be here.

Speaker 0

所以你会找到距离最近的聚类，而这个距离d就是该数据点的能量值。

So you're going to find the cluster that's closest and then the distance here, distance d will be the energy of that.

Speaker 0

当你的数据点来自某个聚类中心时，模型会非常满意；但当数据点远离聚类中心时，模型就不满意了，这正是K均值函数的成本函数。

So the model is very happy when your data point comes from one of the clusters but your model is not happy when the data point is far away and that would be the cost function of the k means function.

Speaker 0

所以这也是一个基于能量的模型。

So that's an energy based model too.

Speaker 0

目前，基于能量的模型通过诸如GANs（生成对抗网络）或任何形式的噪声对比估计等方法变得流行起来。

Now currently energy based models have come into fashion through things like GANs or any sort of noise contrastive estimation.

Speaker 0

所以在GAN中，你有一个判别器，该判别器基本上会学习一个函数来区分数据和非数据。

So in a GAN what you have is you have a discriminator and the discriminator will basically learn a function to differentiate data from non data.

Speaker 0

所以这本身就是一个能量函数。

So that by itself is an energy function.

Speaker 0

因此判别器将学习一个函数，该函数在判别器认为有数据的地方会输出较低的值。

So the discriminator will learn a function and that function will be low wherever the discriminator thinks there is data.

Speaker 0

对吗？

Right?

Speaker 0

所以它通常会在数据点周围这样做。

So it will usually do this around the data points.

Speaker 0

因此数据点在这里形成了谷底。

So the data points form the valleys right here.

Speaker 0

然后生成器基本上会利用那个判别器函数，尝试推断出同样位于这些谷底的点，从而生成同样位于谷底的点。

And then the generator will basically take that discriminator function and will try to infer points that are also in these valleys, to produce points that are also in the valleys.

Speaker 0

然后你就基本上有了一个能量学习的竞争。

And then you basically have an energy learning competition.

Speaker 0

判别器现在试图在真实数据所在的位置降低能量，在生成数据所在的位置提高能量，这将使未来的能量函数更加陡峭。

The discriminator now tries to push down on the energy where the true data is and push up on the energy where the generated data is and that will give you basically a steeper energy based function in the future.

Speaker 0

我希望在这种情况下，判别器神经网络就是能量函数，而生成器只是试图产生与该能量函数兼容的数据。

I hope so in this case the discriminator neural network is the energy function And the the generator just tries to produce data that is compatible with that energy function.

Speaker 0

所以我希望能量函数的概念已经清楚了。

So I hope that concept of what an energy function is clear.

Speaker 0

任何机器学习问题都可以用能量函数来表述。

Any any machine learning problem can be formulated in terms of an energy function.

Speaker 3

刚才Yannic解释能量模型的那段内容来自他的个人YouTube频道。

That last snippet with Yannic explaining energy based models was taken from his personal YouTube channel.

Speaker 3

他一个小时前刚发布了一个视频，名为《基于能量模型的概念学习》，其中介绍了OpenAI的一篇非常精彩的论文。

He's just dropped a video about an hour ago called concept learning with energy based models, and he covers the really cool paper from OpenAI.

Speaker 3

这篇论文的作者是Igor Mordatch。

It was by Igor Mordatch.

Speaker 3

这是一段非常棒的视频，我强烈推荐你们也去观看一下。

It was a it was a really good video, so I thoroughly recommend you you guys go and check that out as well.

Speaker 3

欢迎回到机器学习街谈，与我的两位伙伴Yannic Kilcher和Connor Shorten一起。

Welcome back to machine learning street talk with my two compadres, Yannic Kilcher and Connor Shorten.

Speaker 3

我们上周参加了ICLR会议，这是顶级的机器学习会议之一。

We had ICLR last week, ICLR, and it's one of the top machine learning conferences.

Speaker 3

有趣的是，这次会议完全开放并在线上进行，你可以自由地观看任何演讲和查阅论文。

And what's interesting this time around is that it was completely open and on the Internet, so you can freely go in and watch any of the talks and and look at the papers.

Speaker 3

Yann LeCun和Yoshua Bengio的主旨演讲非常精彩。

And that was a really, really good kind of keynote presentation from Yann LeCun and Yoshiya Benji.

Speaker 3

你们看了Yann的主旨演讲后，觉得怎么样？

So what what did you guys think about it when you watched Yann's keynote?

Speaker 4

我第一次看的时候，感觉信息量太大了，脑子有点转不过来。

And the first time I watched it, my head was like, too much to the process.

Speaker 4

我不得不暂停一下，好好思考一下。

And I had to step away and think about it.

Speaker 4

但其中确实有很多有趣的想法。

But there's definitely a lot of interesting ideas in it.

Speaker 4

想想看。

Think.

Speaker 0

是的。

Yeah.

Speaker 0

如果你不熟悉他所谈论的内容，一开始可能会觉得信息量很大，直到你能提炼出他真正想表达的核心观点。

If you're not I think if you're not familiar with what he's talking about, it just seems like a lot of information until you can kind of distill it down to what he's actually saying.

Speaker 0

然后它变得更像是一种...起初，你会觉得他在介绍新东西，但渐渐地你会意识到，他其实只是在用一种统一的方式描述已经存在的事物，我觉得这挺酷的。

And then that it becomes more of kind of a at first, think it's he's presenting something new, but more and more you realize he's basically just describing what already is, right, in a in a sort of unified manner, which I I find pretty cool.

Speaker 0

是的。

Yeah.

Speaker 3

我同意这一点。

I would agree with that.

Speaker 3

我第一次看这个演讲时，感觉信息量太大了。

When I first saw the presentation, it was information overload.

Speaker 3

我一直很喜欢Yann LeCun的演讲，他是那种喜欢简化事物并做出全面概括的人，他让我对深度学习的工作原理有了深刻的直觉。

And what I've always liked about Yann LeCun's presentations is he's one of these guys that likes to simplify things and make sweeping generalizations, and he's the kind of guy that gives me deep intuitions about how deep learning works.

Speaker 3

所以一开始，这不符合他的风格，但了解了基于能量的模型后，我的意思是，我们直接切入正题吧。

So at first, this was out of character for him, but having looked into energy based models I mean, let's just cut to the chase.

Speaker 3

他整个演讲都在谈论基于能量的模型，并介绍了许多基于能量模型的新概念。

He he spends the entire presentation talking about energy based models and talking about lots of new things in the in the terms of energy based models.

Speaker 3

我完全坦白地说。

And I'll be completely honest.

Speaker 3

我承认。

I'll hold my hands up.

Speaker 3

我之前从未听说过基于能量的模型。

I've never heard of energy based models before.

Speaker 3

我...我这么说感觉有点不好意思。

I I I feel embarrassed to say this.

Speaker 3

然后我去谷歌了一下，Yann LeCun有一篇很棒的论文，是关于基于能量模型的教程，他大概在2006年写的。

And when I googled it, Yann LeCun had a wonderful paper, which was a tutorial on energy based models, and he wrote it in about 2006.

Speaker 3

所以，你知道，那差不多是十五年前的事了。

So, you know, getting on for fifteen years ago.

Speaker 3

那个教程里的所有内容基本上就是他在ICLR这里所说的。

And everything in that tutorial was basically what he was saying here at ICLR.

Speaker 3

所以在那段时间里，其实并没有什么变化。

So nothing has changed really in that time.

Speaker 4

是的。

Yeah.

Speaker 4

比如，当我研究GANs时，遇到Wasserstein GAN时我有点卡住了，因为当时我还没接触到这种平滑函数的概念，就是这种标量评分所实现的。

Like, when I was studying GANs, I kinda hit a wall when I came to the Wasserstein GAN because I, you know, at the time not being, you know, introduced to this idea of, like, the smooth function that this kind of, like, scaler scoring enables.

Speaker 4

我当时完全不明白Wasserstein GAN为什么不同，为什么它特别。

Like, I had no idea why the Wasserstein GAN was, you know, why that was different.

Speaker 4

所以我觉得这真的也帮我更好地理解了很多东西。

So I think this really helped me understand that a lot as well too.

Speaker 3

有意思。

Interesting.

Speaker 3

我的意思是，另一个真正出现的问题是，传统概率方法和深度学习方法之间一直存在拉锯战。

I mean, another thing that really came up is that there's always been a tug of war between traditional probabilistic approaches and the kind of deep learning approaches.

Speaker 3

有个叫克里斯·毕晓普的人写了PRML这本书，他是基于模型的机器学习的坚定倡导者。

There's a guy called Chris Bishop who wrote the PRML book, and he was a huge advocate of model based machine learning.

Speaker 3

我曾经在微软研究院面试过，他们当时极力推销这个理念。

I once interviewed at Microsoft Research, and and they were pitching it hard.

Speaker 3

那是在深度学习革命之前。

This was before the deep learning revolution.

Speaker 3

他们描述的方式是，他们总是引用‘没有免费的午餐’定理，并说每个问题都需要自己的机器学习算法。

And the way they described it was they always cite the the no free lunch theorem, and they say that every single problem needs its own machine learning algorithm.

Speaker 3

当然，这类模型几乎是在考虑到特定领域的情况下创建的，它们具有这些特征因子图，其中包含潜在变量，你可以对变量之间的依赖关系进行建模，并在这个概率空间中进行近似推理。

And, of course, these type of models were almost created with the domain in in mind, and they had these characteristic factor graphs where you had latent variables and you can kind of model dependencies between the variables, and you could do approximate inference in this probabilistic space.

Speaker 3

而Yann LeCun对此提出了一个推论，但他说在他的因子图中，是确定性和概率性函数的混合。

And Yann LeCun draws a kind of a corollary to that, but he says in his factor graphs, it's a mixture of deterministic and probabilistic functions.

Speaker 0

我猜其中一部分发展成了今天的因果推断社区，他们明确强调：好吧，我们知道数据是如何生成的。

I would guess part of that developed into what today is the causality community where they put explicit weight on, okay, we know how the data is generated.

Speaker 0

我们知道世界在某种程度上是如何运作的，并让这基本上决定我们的模型，然后我们在此基础上学习其余部分。

We know how the world works in some manner, and we let that basically determine our model, and then we learn the rest to it.

Speaker 0

我认为这是一种非常有效的方法，但并不总是你一定能实现的。

I think it's a very valid approach, but it's not always one that you can necessarily achieve.

Speaker 0

这是理想化的。

It's idealistic.

Speaker 4

在基于能量学习的教程摘要中，它向你推销的一大要点是：概率模型必须被正确归一化，这有时需要评估所有可能变量配置空间上的难解积分。

One of the big things it sells you on in the abstract of a tutorial on Energy Based Learning is as probabilistic models must be properly normalized, which sometimes requires evaluating intractable integrals over the space of all possible variable configurations.

Speaker 4

所以这句话似乎是在对比，说概率模型是一回事，而基于能量的模型是另一回事。

So it seems like that sentence is contrasting saying probabilistic models is one thing and energy based models is the other thing.

Speaker 4

关键在于你不需要对这些基于能量的模型进行归一化处理。

And the key is that you don't have to normalize these energy based models.

Speaker 4

那么你有什么想法吗？比如，不进行归一化处理意味着什么？

So do you have any, like, so what does this mean to not normalize it?

Speaker 0

基于能量的模型是指最终输出一个单一数值的模型，对吧，就是能量值。

An energy based model is any model where at the end you have a single number, right, the energy.

Speaker 0

如果能量值低，那就说明无论你输入什么，模型都表示满意，而无论你没输入什么，模型都不满意。

And if the energy is low, that tells you whatever you put in, the model is happy with, and whatever you didn't put in, the model is not happy with.

Speaker 0

你可以将任何概率方法解释为基于能量的方法，只需将概率的倒数作为你的能量值即可。

And you can interpret any probabilistic method as an energy based method simply by having the the the inverse probability as your energy.

Speaker 0

对吧？

Right?

Speaker 0

所以，如果你有一个概率模型，那么你就知道某个特定点在世界上发生的精确概率，比方说。

So if if you have a probabilistic model, then you know the exact probability that a given point is occurring, let's say, in the world.

Speaker 0

所以，基于能量的模型的不同之处在于，你只能判断某个东西是更好还是更差。

So the the the difference here with the energy based model, you can only tell if something is better or worse.

Speaker 0

而对于概率模型，你可以确切地知道它有多好，某个事物在整个世界或你的基础流形中出现的频率是多少。

Whereas with a probabilistic model, you know exactly how good it is, how often something occurs in the entirety of the of the world or of whatever your your base manifold is.

Speaker 0

对吧？

Right?

Speaker 0

所以，也许我们可以区分一下，例如，如果你拿一个语言模型，给它一个句子，然后语言模型可以告诉你这个句子在英语中出现的概率恰好是0.00000324。

So maybe we can we can make a difference between if you take, for example, a language model, and you give it a sentence, and then the language model could tell you the probability that this sentence will occur in the English language is exactly point zero zero zero zero zero three two four.

Speaker 0

而如果这是一个能量函数，它可能只是告诉你，它更喜欢这一个而不是另一个。

Whereas if this is an energy function, it could simply tell you, am happier with this one than with this other one.

Speaker 3

不过我认为有一种方法可以将能量函数转化为概率，这正是杨所谈论的内容之一，即如何将概率函数与深度学习函数结合起来。

I think there is a way, though, of transforming an energy function into a probability, and this is and and this is one of the things that Yann talks about, how you compose probabilistic functions and and deep learning functions.

Speaker 3

他通过使用吉布斯分布进行归一化来实现这一点。

And the way he does it is by normalizing using this Gibbs distribution.

Speaker 3

所以这是一种有点任意的分布，但如果你在积分上对其进行归一化，也就是在你的y值域上进行归一化，就可以将任何能量函数转化为概率分布。

So it's a bit of an arbitrary distribution, but it may if if you normalize it over the integral, so over the domain of of your y's, you can turn any energy function into a probability distribution.

Speaker 0

是的。

Yes.

Speaker 0

没错。

Exactly.

Speaker 0

这多年来一直是许多模型、尤其是NLP模型的主要关注点，即实现这些归一化。

This this has been this has been the the main focus of many models, especially NLP models over the last years, is to have these normalizations.

Speaker 0

以及之前一些方法，比如我不确定具体名称，但在深度学习兴起之初流行的图模型、条件随机场等等。

And also previous methods like, I don't know, wherever, these graphic models that were popular kind of at the advent of deep learning, conditional random fields, and so on.

Speaker 0

如果你使用概率方法，你就需要能够计算这个概率，而概率就是所有正例除以所有可能的情况，对吧？

If you have a probabilistic method, you need to be able to compute this probability, and the probability is simply all the positive cases divided by all the possible cases, right?

Speaker 0

所以如果你有一个语言模型，那么你当前的句子就是正例，你必须用它除以所有可能的句子，对吧？

So if you have a language model, then your current sentence is your positive case and you have to divide this by every sentence that is ever possible, right?

Speaker 0

因此，你基本上得问你的模型：你觉得这个、这个、这个和这个怎么样？

So you have to basically ask your model, what do you think of that and that and that and that?

Speaker 0

而且你必须对英语中的每一个句子都这样做。

And you have to do this for every single sentence in the English language.

Speaker 0

实际上，是对于任何可能由英语词汇组成的句子都要这样做，然后用这个总数来除。

Actually, every single sentence that's even possible with any of the words in the English language, and then you have to divide by that.

Speaker 0

所以为了得到概率，你需要这种归一化。

So to get a probability, you need this normalization.

Speaker 0

而对于许多模型来说，这正是主要问题。

And this is, for many models, the main problem.

Speaker 0

你该如何进行这种归一化？

How do you do this normalization?

Speaker 0

这通常是一个难以处理的积分。

This is an intractable integral usually.

Speaker 0

已经有大量、海量的工作投入于近似计算这个积分。

And much, much work has gone into just approximating this.

Speaker 0

我们如何、或者说我们怎样才能进行因子分解呢？

How do we how can we factorize?

Speaker 0

所以，条件随机场或类似的方法将这个积分分解为仅涉及单个变量或两个变量乘积的形式。

So conditional random fields or things like this factorize this integral into just single variables or two variable products.

Speaker 0

所以你可以通过某种前向-后向图消息传递算法来计算这个。

So you can compute this with some sort of forward backward graph message passing algorithms.

Speaker 0

或者像在自然语言处理中的其他模型，它们通过简单的采样来进行归一化。

Or other models like in NLP have normalized it by simply sampling.

Speaker 0

对吧？

Right?

Speaker 0

所以，我们不是除以英语中的每一个句子，而是只采样大约10个，然后我们说，好吧，这10个就是你的基础分布。

So instead of dividing by every single sentence in the English language, we just sample like 10, and we just say, okay, these 10 is your base distribution.

Speaker 0

所以，是的，这其中存在关联。

So, yeah, there's a connection.

Speaker 0

我认为每个概率模型都是基于能量的模型，而每个基于能量的模型都可以通过归一化转化为概率模型。

I would say every probabilistic model is an energy based model, and every energy based model can be turned into a probabilistic model by normalizing.

Speaker 5

机器学习和人工智能的未来是自监督的。

Future of machine learning and AI is self supervised.

Speaker 5

多年来我一直在问自己一个问题：人类和动物是如何学习的，特别是他们为何能如此快速地学习，似乎不需要任何监督或只需要极少监督，并且几乎不与世界互动。

One question I've been asking myself for many years is how do humans and animals learn, in particular, how do they learn so quickly, seemingly not requiring any supervision or very little and almost no interaction with the world.

Speaker 5

这张图表由米歇尔·杜普伊整理，展示了婴儿在什么年龄学习基本概念，比如物体恒存性、稳定性、直觉物理、惯性、重力等等。

This is a chart put together by Michel Dupuy that shows at what age babies learn basic concepts like object permanence, stability, and intuitive physics, inertia, gravity, and things like this.

Speaker 5

这似乎几乎不需要与世界互动就能学会，主要是通过观察。

This seemingly is being learned almost with no interaction with the world, mostly by observation.

Speaker 5

幼小的婴儿直接与世界互动的能力非常有限。

The young babies have very little ability to interact directly with the world.

Speaker 5

而谜团在于这是如何发生的，以及动物身上又是如何发生的？

And the mystery is how does that happen and how does it happen in animals as well?

Speaker 5

这可能是婴儿和动物学习大量关于世界背景信息的途径，比如直觉物理学之类的知识。

This is probably the vehicle through which baby animals and humans learn massive amounts of background information about the world such as intuitive physics and things of that type.

Speaker 5

也许这种知识的积累构成了常识的基础。

Perhaps the accumulation of this knowledge forms the basis of common sense.

Speaker 3

我得去查一下物体恒存性。

I had to look up object permanence.

Speaker 3

这意味着如果婴儿看到一栋建筑，当婴儿再也看不到它时，婴儿仍会知道这栋建筑依然存在。

So that means if a baby sees a building, it means the baby will know that the building is still in existence when the baby can no longer see

Speaker 4

是的。

it.

Speaker 4

那

That

Speaker 0

没错。

is correct.

Speaker 0

所以捉迷藏之所以有效，就是这个原因。

That's why peekaboo works.

Speaker 0

这就是为什么和婴儿玩躲猫猫很有趣，因为对他们来说，把手拿开后你还在那里，简直就像个奇迹。

That's why it's a fun game with babies because to them it's like a miracle that you're still there once the hands come off.

Speaker 0

没错。

Exactly.

Speaker 0

所以当我听到这个时，我的想法是，我不太确定这里的类比是否真的恰当。

So what I thought when I heard this was something like, I'm not sure here if the analogy is really a good one.

Speaker 0

因为他在这里基本上是说，婴儿是通过观察来学习这些东西的，而我强烈认为，在数百万年的进化过程中，不仅是人类，几乎所有动物都必须对重力有直觉感知，必须有空间感，必须习得客体永久性。

Because basically what he's saying here is that it is through observation that the babies learn these things where I would strongly argue that there are millions of years of evolution where not only humans, but pretty much all animals had to have an intuitive sense of gravity, had to have a spatiality, had to have object permanence learned.

Speaker 0

所以，我认为这可能更像是大脑中一个与生俱来的模块，它只是在发育的特定阶段被激活，而不是像婴儿出生那样。

So, I would argue that it might much rather be a simply a module like in the brain that is innate that simply gets switched on at that particular phase during development rather than it is like babies come.

Speaker 0

问题是，如果一个婴儿在零重力环境中长大，他/她是否还能形成对重力的直观理解？

The question is, if you had a baby in zero gravity, would it develop an intuitive understanding of gravity or not?

Speaker 0

我猜由此引申出的推论是，我们应该去测量那些不可能通过进化获得的能力，比如与计算机互动之类的。

And I guess the corollary to that is what you should rather measure is things that could not have evolved, like maybe interacting with a computer or something like this.

Speaker 0

我不知道。

I don't know.

Speaker 0

你觉得

What do you think

Speaker 4

这个怎么样？

of this?

Speaker 3

说得太对了。

That's so true.

Speaker 3

我的意思是，当我们开始对深度学习进行拟人化，并做出这些与人类发展相关的危险比较时，情况就变得非常危险了。

I mean, it's very dangerous when we get into this anthropomorphization of deep learning and just these dangerous comparisons with human development.

Speaker 3

我认为这种谬误存在的原因是，婴儿出生时看起来没有任何认知能力或智力。

I think the reason this fallacy exists is because when a baby is born, the baby appears to have no cognitive capabilities or intelligence whatsoever.

Speaker 3

这些能力都是在发育期间习得的。

They are all learned during the during the developmental period.

Speaker 3

但正如你所说，大脑中已经存在如此多的归纳先验。

But as you say, there are so many inductive priors in the brain already.

Speaker 3

我认为Yann在这里想说的是，婴儿并非以监督学习的方式在学习。

I think what Yann's trying to say here is that the baby is not learning in a supervised way.

Speaker 3

我、我、我认为他这里的核心观点是，深度学习的未来将是自我监督的，而不是强化学习。

I I I think the thrust of his message here is that the future of deep learning will be self supervised, and it it won't be reinforced.

Speaker 4

那么，你认为这些任务的学习顺序有什么讲究吗？

And do you think there's anything to, like, the curriculum of how these tasks are learned?

Speaker 4

就像我们一开始讨论的Poet，它具备这种自动课程学习的能力。

Like, something we kicked off talking about Poet and how that has this automatic curriculum learning.

Speaker 4

你认为这些被发现的序列事物有什么意义吗？

Do you think there is anything to these sequence of things that are being discovered?

Speaker 4

关于这个想法，我觉得另一个有趣的点是它们学得很快。

Another thing that I think is interesting about the idea of this is them learning quickly.

Speaker 4

快速学习的一个例证是，两个月内你实际上能获取大量的视觉数据。

This is a demonstration of quick learning is that in two months, you can actually get a lot of visual data in two months.

Speaker 4

假设你每秒获取30帧，每分钟60秒，那就是1800帧，即使你看的是时长可变的视频数据，我认为在两个月内确实能获取大量的视觉数据。

If you assume that you get, like, 30 frames per second and then sixty seconds a minute, 1,800 and even you're looking at video data which could be variable length, I think you actually could get a lot of visual data in two months.

Speaker 0

是的。

Yeah.

Speaker 0

我认为他的观点基本上就是，你确实能获得所有这些数据，但你得不到标签。

I think that's that's that's pretty much his point here is the fact that you do get all of this data, but you do not get the labels.

Speaker 0

对吧？

Right?

Speaker 0

你得不到任何标签。

You don't get any labels.

Speaker 0

另外，我认为他之所以拿婴儿作为例子，是因为正如他所说，婴儿并不进行互动。

Also, I think the why he takes babies as examples because they don't, as he says, they don't interact.

Speaker 0

所以这也不是强化学习。

So it's also not reinforcement learning.

Speaker 0

既不是监督学习，也不是强化学习。

It's not supervised and it's not reinforcement.

Speaker 0

只是通过消费大量未标注的数据，它们就能学会一些东西。

It's just something about consuming large amounts of unlabeled data that makes them learn things.

Speaker 0

我的意思是，我愿意接受这个类比，尽管我在生物学上并不认同。

I mean, I'm willing to go with the analogy here, even though I disagree on biology.

Speaker 5

几乎不与世界互动，主要通过观察。

Almost with no interaction with the world, mostly by observation.

Speaker 5

婴儿几乎没有直接与世界互动的能力。

The young babies have very little ability to interact directly with the world.

Speaker 5

谜题在于，这是如何发生的？

And the mystery is how does that happen?

Speaker 5

动物身上又是如何发生的呢？

And how does it happen in animals as well?

Speaker 5

这可能是婴儿动物和人类学习大量关于世界背景信息的途径，比如直觉物理学之类的内容。

This is probably the vehicle through which baby animals and humans learn massive amounts of background information about the world, such as intuitive physics and things of that type.

Speaker 5

或许，这类知识的积累构成了常识的基础。

Perhaps the accumulation of this knowledge forms the basis of common sense.

Speaker 5

因此，如果能在机器中复现这种学习方式，将具有极其强大的作用，可以大幅减少对标注样本和试验的需求。

So being able to reproduce this type of learning in machines would be enormously enormous powerful, would reduce the requirement for label samples and trials.

Speaker 5

在我看来，人工智能的下一次革命既不是监督学习，也不是强化学习。

And in my opinion, the next revolution in AI will not be supervised nor reinforced.

Speaker 5

所以确实存在一些挑战。

So there are really kind of challenges.

Speaker 3

我认为我们已经在AI领域经历了一场革命，那就是自监督方法。

I would argue that we've already had a revolution in AI, which is the self supervised approach.

Speaker 3

它在过去几年里彻底改变了语言处理领域。

It has transformed language processing over the last few years.

Speaker 4

是的。

Yeah.

Speaker 4

确实如此。

Definitely.

Speaker 4

尤其是最近，这些对比学习方法似乎也开始崭露头角了。

Especially, recently, it seems like these contrast of learning methods are just taking off as well.

Speaker 3

完全同意。

Absolutely.

Speaker 3

尽管Yann后来提出了一些有趣的观点，他认为这些方法在视觉领域的效果没那么好。

Although Yann makes some interesting comments later that he doesn't think they work as well for vision.

Speaker 3

但是

But

Speaker 5

当今的深度学习人工智能和机器学习。

Deep learning AI and machine learning today.

Speaker 5

其一当然是减少对标注样本和强化交互的需求。

One is, of course, diminishing the requirement for label samples and reinforcement interactions.

Speaker 5

在我看来，这正如我刚才提到的，是通过自监督学习实现的。

And in my opinion, that goes through self supervised learning as I just mentioned.

Speaker 5

自监督学习本质上是在变量之间建立依赖关系，学习填补空白，学习表征世界，学习进行预测。

Self supervised learning really is running dependencies between variables, learning to fill in the blanks, learning to represent the world, learning to predict.

Speaker 5

第二点是学习推理，超越系统一。丹尼尔可以设想系统一，它不是通过未来神经网络中某种固定步骤来运作，而是能够通过找到满足特定约束条件、最小化某种能量或最大化某种可能性的变量配置来进行推理。

The second one is learning to reason, going beyond system one, Daniel can imagine system one, which is not going through kind of a fixed number of steps in a future neural net, but being able to sort of reason perhaps by finding a configuration of variables that satisfies a certain number of constraints or minimize some sort of energy or maximize some likelihood.

Speaker 5

第三点是学习规划复杂的行动序列。

And the third one is learning to plan complex action sequences.

Speaker 5

遗憾的是，关于这一点我没什么可说的。

And I don't have much to say about this, unfortunately.

Speaker 3

你注意到他提到‘能量’这个词了吗？

Did you notice he dropped the energy word in there?

Speaker 3

所以这是我们得到的第一个线索。

So that was the the first hint that we've got.

Speaker 3

这将是一场关于能量基础模型的演讲。

This is going to be a talk about energy based models.

Speaker 4

在我看来，这更像是从其他东西进行迁移学习。

To me, it looks more like transfer learning from another thing.

Speaker 4

是的。

Yeah.

Speaker 4

我认为是自监督学习。

I think I think self supervised.

Speaker 3

我认为两者都是，因为自监督学习并不能解决样本效率问题。

I think it's both because self supervised learning doesn't help with the sample efficiency problem.

Speaker 3

它只是意味着别人替你做了这件事。

It just means that someone else does it for you.

Speaker 4

我认为有太多不同的研究领域，你可以把它们拆开来看，然后说，这就是迁移学习。

I think there's so many different areas of research that you can then pull apart and be like, it's transfer learning.

Speaker 4

哦，然后你有了所有这些任务，现在就成了多任务学习。

Oh, and then you have all these tasks, and now it's multitask learning.

Speaker 4

但别忘了其中任何一个任务。

But don't forget any of those tasks.

Speaker 4

现在这又成了持续学习。

Now it's continual learning.

Speaker 4

所以你需要一种方法来安排这些任务。

And then so you need a way of scheduling these tasks.

Speaker 4

现在这就是课程学习了。

Now it's curriculum learning.

Speaker 4

就好像有所有这些不同的领域，都与我们如何确切地利用自监督学习任务来更快学习这一理念相关。

It's like there there are all these different areas that can relate to this idea of how exactly we're gonna use the self supervised learning task to learn quicker.

Speaker 4

然后还有元学习。

And then meta learning also.

Speaker 4

你可以说有很多这样的小领域，它们看起来像是深度学习中的独立子集。

There's all these little things you can say that seem like their own subset of deep learning.

Speaker 4

我不太理解‘学习推理’这个概念。

I don't understand this idea of learning to reason.

Speaker 3

一般来说，人工智能系统被称为系统一。

Artificial intelligence systems in in general, they are known as system one.

Speaker 3

丹尼尔·卡尼曼在他的《思考，快与慢》一书中谈到了系统一和系统二，系统一指的是那种非常自主的感知型任务。

Daniel Kahneman, in his book Thinking Fast and Slow, he talked about system one and system two, and system one was the very kind of autonomous perception type task.

Speaker 3

系统二指的是那些需要深度思考的任务，没有有意识的思考我无法完成。

System two would be the really deep thinking task that I couldn't do without consciously thinking about it.

Speaker 0

这可能更多是指人类会做和不会做的事情。

It's probably more in the reference to what a human does and doesn't do.

Speaker 0

我会将系统一任务描述为任何你可以用深度学习完成的事情，而系统二任务则是那些你实际上需要编写计算机程序来完成的事情。

I would describe the system one tasks as anything you would do with deep learning and the system two tasks as anything where you would actually write a computer program to do.

Speaker 0

这一点在乔舒亚·本乔在同一场次的演讲中解释得更好，他基本上说系统一任务是直觉性的。

And this better explained in Joshua Benjo's talk in the same session where he says basically system one tasks are intuitive.

Speaker 0

这是一种潜意识的，而非有意识的体验。

It's kind of subconscious and not really conscious experience.

Speaker 0

系统二的任务是可以用语言表述的。

System two tasks are ones that you can formulate with language.

Speaker 0

因此，你可以用语言来推理自己在做什么。

So you can reason with language what you are doing.

Speaker 0

所以系统二的任务需要运用逻辑来解决，而系统一的任务则只需学会将输入映射到输出即可。

So system two things would be where you would have to apply logic in order to solve a task, whereas system one tasks would be where you can just learn to map input to output.

Speaker 3

这很难厘清，因为从某种意义上说，深度学习模型确实具备推理能力。

It's quite difficult to wrestle with because in a sense, the deep learning models do reason.

Speaker 3

它们类似于计算机程序。

They are analogous to computer programs.

Speaker 0

是的。

Yeah.

Speaker 0

我认为系统一和系统二很大程度上是人类的概念，因为我认为卡尼曼和本乔都提到过，如果你反复练习，系统二的任务也可以变成系统一的任务，从而变得自动化。

I think it's it's very much a a human concept, the system one and system two, because I think both Kahneman and Benjo, they also talk about how a system two task can become a system one task if you simply repeat it and it kind of becomes automated.

Speaker 0

对吧？

Right?

Speaker 0

你学习开一条新路，反复多次后，它就变成了你的肌肉记忆。

You learn to drive a new road and you do it many times, and then it just kind of becomes into your motor memory.

Speaker 0

我认为这真的只是一个关于人类的概念，也是勒昆对他认为这些系统未来应具备能力的一种描述。

I think this is really just a human concept and kind of a description of LeCun of what he thinks these systems should be able to do in the future.

Speaker 4

那么，学习使用逻辑就不能被转化为一个可微分的损失函数吗？

Well, is it learning to use logic can't be made into a differentiable loss function?

Speaker 0

是的。

Yeah.

Speaker 0

可以，但到目前为止还没有人成功做到。

It can, but no one has done it very successfully so far.

Speaker 3

人们说，如果我们不懂，那就是人工智能。

People say it's AI if we don't understand it.

Speaker 3

也许情况确实有点像这样。

And maybe it's a bit like that.

Speaker 3

这是系统二，如果

It's system two if

Speaker 0

我们还做不到的话。

we can't do it yet.

Speaker 0

在层次化规划的世界里，你会说，我需要去超市买食物，然后将其分解为：我需要去车旁、开车去超市、然后下车。

In the hierarchical planning world, you would say something like, I need to get to the supermarket to buy food, and you would decompose that into I need to get to the car, I need to drive to the supermarket, and I need to get out.

Speaker 0

然后你再进一步分解每一个步骤，比如，我需要去车旁，这意味着我要拿钥匙、走到车旁、打开车门、坐进去。

And then you would decompose each of those again, you know, I need to get to the car, which means I need to grab my keys, walk to the car, open the door, sit in.

Speaker 0

因此，在规划领域中，这指的是这种层次化的分解。

And so in the planning world, that means the kind of hierarchical decomposition.

Speaker 0

你的系统二所做的，就是构建这些大型计划，然后逐层分解，直到系统一可以接手的层级。

The fact that what your system two does is it builds these big plans, and then it breaks them down until the level where the system one can take over.

Speaker 0

比如，走到车旁，是的，我知道怎么做。

Like walk to the car, yeah, I know how to do that.

Speaker 5

第三个是学习规划复杂的动作序列。

And the third one is learning to plan complex action sequences.

Speaker 5

遗憾的是，我对这个没什么可说的。

And I don't have much to say about this, unfortunately.

Speaker 5

那么，什么是自监督学习？

So what is self supervised learning?

Speaker 5

自监督学习就是学习填补空白。

Self supervised learning is learning to fill fill in the blanks.

Speaker 5

我们以一个视频为例。

Let's take an example of a video.

Speaker 5

你，作为机器，假装不知道视频中的某一部分，然后训练自己根据已知部分来预测那未知的部分。

You the machine the machine pretends not to know a piece of that video and train itself to predict the piece that it pretends not to know from the piece that it knows.

Speaker 5

比如，从过去预测未来，从底部预测顶部，预测相同帧，或者文本中缺失的单词，诸如此类。

So for example, predicting the future from the past, predict predicting the top from the bottom, predicting the same frames, things like that, or missing words in the text.

Speaker 5

这当然正变得越来越流行。

This is, of course, becoming very popular.

Speaker 3

简而言之，这就是自监督学习。

In a nutshell, that is self supervised learning.

Speaker 3

简单来说，就是能够假装不知道某些信息，然后预测那个信息本身、其周边内容，或者从底部预测顶部。

Just being able to pretend you don't know things and and predict either that thing or something in the vicinity or the top from the bottom.

Speaker 0

嗯，他给自己留了相当大的回旋余地，以便将几乎所有事情都纳入这个框架，因为现在我可以这么说：监督式标注任务其实就是我不知道输入中的某部分——也就是标签。

Well, the He leaves himself quite a bit of wiggle room to formulate pretty much everything into that because now I can say, okay, supervised labeling task is simply I don't know part of the input which is the label.

Speaker 0

K均值聚类问题其实就是我不知道聚类分配结果。

The k means clustering problem is simply I don't know the cluster assignment.

Speaker 0

所以我认为这里的定义很宽泛，而且是故意宽泛的，以便你之后可以将其表述为基于能量的方法，因为它是包罗万象的。

So I think the definition here is broad and is intentionally broad such that you can formulate it later as energy based methods because it's all encompassing.

Speaker 0

对吗？

Right?

Speaker 5

我正在预测缺失的帧，诸如此类，或者文本中缺失的词语，这种预测必须是多模态的。

I'm predicting missing frames, things like that, or missing words in the text that the prediction must be multimodal.

Speaker 5

不存在单一的预测能与视频的初始片段保持一致，视频的多种特征都是可能的。

There's no single prediction that would be consistent with an initial segment of video, multiple feature of the video are are possible.

Speaker 5

因此，我们不能仅仅使用一个神经网络，它本质上是一个终止函数，用这里这种圆角蓝色块表示，即G(X)，它只做出单点预测。

So we cannot use just a neural net that is basically the terminating function symbolized by this sort of rounded shape blue block here, G of X, which makes a single point prediction.

Speaker 5

我们必须用能够做出多重预测的东西来替代它。

We have to replace this by something that can make multiple prediction.

Speaker 5

实现这一点的一种方法是，通过某种隐式函数来度量我们观察到的变量x与需要预测的变量y之间的兼容性。

And one way to do this is to go through some implicit function that basically measures the compatibility between the variable we observe x and the variable we need to predict y.

Speaker 3

现在我们开始切入正题了。

Now we're getting into the meat of it.

Speaker 3

现在他引入了基于能量的模型，并直接向我们介绍了这种新型因子图，它是传统概率因子图与确定性函数的结合体。

Now he's introducing energy based models and straight away he's telling us about this new type of factor graph, which is a combination of the old school probabilistic factor graph, but now with deterministic functions as well.

Speaker 3

他正在引入这个观点，即我们需要进行多模态预测。

And he's introducing this idea that we need to have multimodal predictions.

Speaker 3

因此，我们需要能够根据能量函数给出多种预测的函数。

So we need to have functions that can give us many, many predictions subject to an energy function.

Speaker 3

这个关于信号和标签的能量函数需要被优化并保持平滑，使得在流形上或附近的正确标签具有低能量，而远离流形的任何错误标签则具有高能量。

And this energy function of signals and labels needs to be optimized and smooth such that a such that the correct label on, you know, on or near the manifold has a low energy and any wires that are away from the manifold should have a high energy.

Speaker 4

是的。

Yeah.

展开剩余字幕（还有 480 条）

Speaker 4

确实，我觉得思考GANs很有意思，我认为StyleGAN2模型的一个关键点在于它们如何引入随机噪声。

Definitely like it like thinking about GANs and I think that's a big thing in like the style GAN two model is how they put that that random noise.

Speaker 4

这些噪声直接注入到中间特征中，帮助模型实现更丰富的生成效果——它不再是确定性生成器，即每次对z采样都产生完全相同的面孔。

It just gets injected in the intermediate features and that helps it do, you know, more like it's not like a deterministic generator where for every sampling of the z, it produces the exact same face.

Speaker 4

你可以通过将这个潜在向量z添加到预测过程中来实现这一点。

And you can do that by adding this sample latent vector z into the forecast.

Speaker 5

因此，这个函数f(x, y)在x和y相互兼容时取值较低，在y与x不兼容时取值较高。

So this function f of x y will take low values is if x and y are compatible with each other and higher value if y is incompatible with x.

Speaker 5

例如，如果它不是视频的合理延续。

If it's not a good continuation for the video, for example.

Speaker 5

我在这里使用的符号与图模型中的因子图非常相似，只是额外加入了确定性函数的符号。

The symbolism I'm using here is very similar to factor graphs in graphical models except for this extra symbol of deterministic function.

Speaker 5

现在，我将提倡使用基于能量的模型，它本质上通过这个能量函数来衡量x和y之间的兼容性。

Now I'm going to advocate to use energy based models, which, you know, basically measure the compatibility between x and y through this energy function.

Speaker 5

同样，当x和y兼容时能量值较低，不兼容时则较高。

Again, that takes low value effects and y are compatible and higher value if if they're not.

Speaker 5

推理过程是，对于给定的x，找到能使能量最小化的y。

Inference is performed by for a given x finding y's that minimizes energy.

Speaker 5

可能对应多个y。

It could be multiple y's.

Speaker 5

这是一种无需借助概率就能处理不确定性的方法。

And this is a way of handling uncertainty without resorting to probabilities.

Speaker 3

所以，无需借助概率？

So without resorting to probabilities?

Speaker 0

是的。

Yeah.

Speaker 0

所以，这里的区别在于，你只是想要一个函数，它能告诉你何时对某个输入感到满意，这种满意会用一个较低的数字来表示，而不满意时则用较高的数字。

So so the again, the the difference is here, you you simply want a function that tells you when when it is happy with some input, and And that indicates by a lower number than when it is not happy.

Speaker 0

较高的数字就表示不满意。

That would be indicated by a higher number.

Speaker 0

然后他基本上说的是推理。

And then he basically says inference.

Speaker 0

现在，如果我得到一个能量函数，对吧，那么我可以——假设我有一个视频，我想预测下一帧，并且我得到了一个能量函数，我能做的就是简单地找到在给定输入下最小化该能量函数的帧。

Now if I'm given an energy function, right, then I can so if I have, let's say, a video and I want to predict the next frame and I am given an energy function, what I can do is simply I can find the frame that minimizes that energy function given my input.

Speaker 0

所以这基本上是对他所做工作的重新表述。

So that's kind of a reformulation of what he's doing.

Speaker 0

尽管他接下来要讨论的模型并非都符合这个特定标准。

Though not all the models that he is gonna talk about fit that particular criteria.

Speaker 3

是的。

Yeah.

Speaker 3

起初，我觉得他提出可能存在多个可能的下一帧视频画面这种说法很傻，但他可能是在谈论电脑游戏之类的东西，如果你以不同的方式与之互动，那么下一帧就会不同。

At first, I I thought that it was silly suggesting that there were possible multiple possible video frames that come next, but he might be talking about a computer game or something where if you interact with it differently, then the next frame will will be different.

Speaker 3

但正如你所说，这是一个非常通用的框架。

But but this is a very general framework as you say.

Speaker 3

所以他的意思是，看看这个能量曲面，然后直接最小化。

So he's saying, look at this energy surface and just minimize.

Speaker 3

也就是说，在所有那些点中，找到最小的那个，并将其作为我预测的下一帧。

So across all of those points, find the one which is the smallest and represent use that as my predicted next frame.

Speaker 0

但即使对于视频来说，我也明白你的意思。

But even even if even for the video, I I get what you mean.

Speaker 0

你的意思是，在已录制的视频中，只有一个下一帧。

You mean that in the video that was recorded, there is only one next frame.

Speaker 0

是的。

Yeah.

Speaker 0

但如果你只是剪掉最后部分，只看这个序列，你就无法知道摄像师是要往左转还是往右转。

But if if you simply cut away the last part and just look at the sequence, you don't know whether the camera person is going to gear left or right.

Speaker 0

所以在真实世界中，存在多种可能的后续情况。

So there are multiple, in the true world, there are multiple continuations that are possible.

Speaker 0

你的能量函数就是要捕捉这一点。

Your energy function is supposed to capture that.

Speaker 0

当然，你会通过提供样本来训练它，但你希望它能泛化，告诉你真实世界中存在这么多可能的后续，而其他那些纯粹是胡言乱语的后续则不好。

Of course, you're going to train it by giving it the samples, but you hope that it is going to generalize into telling you there are all of these continuations that are possible in the real world, and all of these other continuations that are just gibberish are not good.

Speaker 4

当y的取值变得如此高维，可以对应大量不同的y值与我们的x配对时，我们该如何采样？如何用我们的能量函数来搜索这些y值？

So once y becomes like this high cardinality, you know, can take on tons of different y's to pair with our x, how do we sample like, how are we gonna search for the y's with our energy function?

Speaker 4

是的

Yep.

Speaker 4

我认为这就是z发挥作用的地方。

That's I think that that's where the z comes into play.

Speaker 0

而且我认为，他几乎把机器学习中的所有内容都重新表述为能量函数的形式。

Well, and that I I think that is the if so he's he's gone on to rephrase pretty much everything in machine learning in terms of an energy function.

Speaker 0

我认为，如何实现这一确切目标，正是每个模型一直以来试图解决的问题。

And I think the question of how do we do this exact thing is that's what every model has ever attempted to do.

Speaker 0

对吧？

Right?

Speaker 0

而且他将讨论一些具体的例子。

So and and he's going to talk about specific ones.

Speaker 0

但归根结底，这正是该方法本身所要解决的问题。

But ultimately, that is the the problem of the method itself.

Speaker 0

为什么会有如此高的基数？

Why is so high cardinality?

Speaker 0

你打算怎么做这件事？

How are you going to do this?

Speaker 0

可以通过训练一个生成器来实现，对吧，让它直接预测一个能量较低的y。

It could be by training a generator, right, to simply predict a why that has a low energy.

Speaker 0

也可以通过梯度下降进行优化，找到能量最低的那个y。

It could be by really optimizing with gradient descent to find the y that has the lowest possible.

Speaker 0

还可以采用消息传递的方法。

It could be by a message passing method.

Speaker 0

可能是通过很多方式实现的。

It could be, you know, via many things.

Speaker 5

F在白色空间中是平滑的，可以通过基于梯度的优化算法或其他一些推理方法来完成。

F is smooth in white space can be done through gradient based optimization algorithms or some other inference methods.

Speaker 5

当然，离散的方式要容易得多，我们不必处理那个问题。

Of course, way is discrete is much easier and we don't have to deal with that.

Speaker 0

我的意思是，他说如果y是连续的，我们可以通过基于梯度的优化方法找到一个好的y。

I mean, he says if y is continuous, we can find a good y through gradient based optimization method.

Speaker 0

这正是我们所说的。

That's what we said.

Speaker 0

如果我们有一个能量函数，就可以直接用梯度下降来最小化它。

If we have an energy function, we can just minimize it using gradient descent.

Speaker 0

然后他说，如果 y 是离散的，那当然要简单得多。

Then he says, If y is discrete, of course, that's much easier.

Speaker 0

我不确定，因为通常离散优化问题确实如此。

I am not sure because usually discrete optimization problems Yeah.

Speaker 0

所以一方面，如果你考虑一个监督分类问题，就可以这样来表述。

So on the one hand, if you think of a supervised classification problem, then you can phrase it like this.

Speaker 0

这样一来就非常简单了。

And then it's super easy.

Speaker 0

你只需尝试每一个类别，哪个类别的能量最低，就意味着哪个类别的可能性最高，会被输出为标签。

You just try every one of the classes and whichever one has the lowest energy, that means whichever one has the highest likelihood that you're gonna output as the label.

Speaker 0

但在类似语言模型这样的场景中，这极其困难，因为你必须尝试每一个可能的句子，找出能量最低的那个。

But in exactly something like a language model, it is extremely hard because you're gonna have to try every single sentence that's possible and take find the one with the lowest energy functions.

Speaker 0

我对他说的内容感到困惑。

I'm confused by what he said.

Speaker 3

但如果y是离散的，这是否意味着能量函数f(x,y)不是平滑的？

But if y is discrete, does that imply that f, the energy function, f of x y is not smooth?

Speaker 0

这是个难题，就像是一年级的数学题一样。

It's a difficult quest these are like first year math questions.

Speaker 3

我不太确定，这并不意味着那样。

I'm not so sure it it doesn't imply that.

Speaker 0

这不可能，这不可能，这绝对不行。

It can't it can't be it cannot.

Speaker 0

嗯，这取决于它是如何定义的。

Well, it depends on how it is defined.

Speaker 0

如果它定义在连续空间上，但集合y恰好是离散的，那么它可以是平滑的。

If it is defined on the continuous space, but simply the set y happens to be discrete, then it can be smooth.

Speaker 0

但如果它定义在离散集合上，那么我非常确定平滑性就没有意义了。

But if it is defined on the discrete set, then I'm pretty sure smoothness makes no sense.

Speaker 4

我也是这么想的。

That's what I think too.

Speaker 4

如果是离散的，你只是在点与点之间跳跃，没有任何连续性。

If it's discrete, you're just jumping from point to point with no connection at all.

Speaker 3

但函数本身可能在学习某种插值，只是可能不光滑。

But the function itself might be learning some kind of interpolation, but it might not be smooth.

Speaker 5

推理方法，当然，这就是为什么离散化更容易，我们不必处理那个问题。

Inference methods, of course, is why it's discrete is much easier and we don't have to deal with that.

Speaker 5

基于能量的模型分为条件版本和无条件版本。

There are conditional and unconditional versions of energy based models.

Speaker 5

在条件版本中，变量x是始终已知的，而y是需要预测的那个。

In conditional version, the variable x is the one that's always known and y is the one that sort of needs to be predicted.

Speaker 5

无条件版本的技巧在于训练机器根据y的其他部分来预测y的某一部分，但我们永远不知道哪部分是已知的，哪部分是未知的。

The unconditional version, the the trick here is to train the machine to predict part of y from part from other parts of y, but we never know which one is known and which one is unknown.

Speaker 5

所以这大致捕捉了变量之间的相互依赖关系，正如这里左下角的图示所象征的那样，它代表了这种情况下的能量函数，与k均值对齐，其中训练样本绘制在这条紫色小曲线上。

So this is sort of capturing the mutual dependencies between the the variables as symbolized by the the drawing here on the on the left, on the bottom left that that represents energy function in this case here aligned with k means where the the training samples are drawn on this little purple curve.

Speaker 3

因为我认为在几乎所有的机器学习应用场景中，它都是条件性能量模型，因为我们学习的是x和y之间的依赖关系。

Because I think in almost all machine learning use cases, it is a conditional EBM because we're learning a dependency between x's and y's.

Speaker 0

是的。

Yeah.

Speaker 0

我是说，他哈哈，他立刻就把这跟k均值算法联系起来了。

I mean, he he he goes and immediately makes the connection to k means.

Speaker 0

对吧？

Right?

Speaker 0

这又回到了我们之前所说的，它最终将涵盖几乎所有的机器学习，而这里我们看到它是如何涵盖k均值的。

And that's again, we were saying before, this is going to ultimately encompass pretty much every all of machine learning, and here we see how it encompasses k means.

Speaker 0

所以在k均值中，我仅仅是想通过计算这些均值，创建一个对数据分布中任何点都满意的函数。

So in k means, I simply want to create a function that is happy with any point in the data distribution by doing these means.

Speaker 0

但这本质上只是在学习数据的分布，因为我在k均值中没有进行归一化，所以它是一种基于能量的方法，而非概率方法。

But ultimately this is just learning the distribution of the data, but because I'm not normalizing in k means, it is an energy based method and not a probabilistic method.

Speaker 0

因此，在这里你可以看到基于能量的方法和概率方法之间的区别。

So here you can see the difference between energy based and probabilistic.

Speaker 0

对于概率方法，这可能是高斯混合模型。

For a probabilistic method, this might be a Gaussian mixture model.

Speaker 0

但即使高斯混合模型使用的是高斯分布，其全局归一化通常也困难得多。

But the Gaussian mixture models are usually much harder to globally normalize even though it's Gaussian.

Speaker 0

所以这仍然相当简单。

So it's still pretty easy.

Speaker 5

不错。

Cool.

Speaker 5

处理多输出的一种方法是使用潜在变量。

So one way to handle multiple outputs is to is through the use of a latent variable.

Speaker 5

如果我们用确定性函数构建机器，那么让机器对单一输入产生多个输出的方法，就是通过潜在变量来参数化输出集合。

So if we're going to build our machine out of deterministic functions, the the way to allow machine to produce multiple outputs for a single input is to parameterize the set of outputs through a latent variable.

Speaker 5

因此，典型的架构会类似于这样。

So the typical architecture would look something like this.

Speaker 5

你有一个输入变量x，它经过一个预测器，提取出该x变量的表示。

You have an x variable that goes through a predictor that extracts a representation of that x variable.

Speaker 5

这个表示与一个潜变量一起通过解码器，从而产生预测。

And that representation together with a latent variable go through a decoder, which produces the prediction.

Speaker 5

当你让潜变量在一个集合上变化时，它会使预测在一个相似维度的集合上变化。

When you vary the latent variable over a set, it makes the prediction vary over set of similar dimension.

Speaker 5

当然，关键在于找到、构建并训练这个机器，使得潜变量能够代表输出变化的独立解释因素。

And the trick of course is to find, build the machine and train it in such a way that the latent variable represent independent explanatory factors of variation of the output.

Speaker 3

所以他的意思是，我们如何能采用一个确定性模型？

So he's saying how can we take a deterministic model?

Speaker 3

这有点像变分自编码器，因为想象一下你有一个数据流形，你想设计一种架构，使得潜在变量能够描述该流形的定义域。

This is kind of like variational autoencoders because imagine you have some data manifold and you want to design an architecture such that the latent variable can describe the domain of the manifold.

Speaker 0

是的。

Yeah.

Speaker 0

这几乎就是变分自编码器的示意图了。

This almost exactly a drawing of a variational autoencoder now.

Speaker 0

而且，没错，基本上就是通过这个潜在变量来控制输出的生成方式。

And, yeah, it's basically you have this latent variable that controls how the output is made.

Speaker 0

所以你正在生成的是一个完整的流形，这在某种意义上，又是一件非常相似的事情，你只是在学习从潜在变量到流形的这种映射。

So what you're producing is an entire manifold, which in some sense, again, is a very similar thing where you're learning just this mapping from the latent variable to the manifold.

Speaker 0

所以那将是一种不依赖于x的条件。

So that would be sort of not conditional on the x.

Speaker 0

所以那可能是一个之前的模型，它只是一个关于y的函数f。

So that might be a model before where it's just an f of y.

Speaker 3

有意思。

Interesting.

Speaker 3

你是说，需要最小化潜变量的信息容量。

You're saying that the information capacity of the latent variable needs to be minimized.

Speaker 3

否则，所有的信息都会进入其中。

Otherwise, all of the information would go into that.

Speaker 4

那么，编码器接收数据并将其放入这个向量中，然后随机变量是如何在进入解码器之前被添加进去的呢？

So the encoder takes the data and puts it into this vector, and then so how is the random variable added to that before it hits the decoder?

Speaker 0

所以，随机变量是从潜变量所描述的分布中采样得到的。

So the random variable is sampled from the distribution that is described by the latent variable.

Speaker 0

现在，你无法通过这个进行反向传播。

Now, you can't backpropagate through this.

Speaker 0

对吧？

Right?

Speaker 0

你无法通过参数化分布并从中采样的操作进行反向传播，但针对某些分布，采样操作等同于——这正是关键所在——你实际上是从一个均值为零、标准差为一的高斯分布中采样。

You can't backpropagate through the operation of parameterizing a distribution and sampling from it, but with certain distributions, sampling from it is the same as, and this is where exactly this comes in, so what you technically do is you sample from a uni, like a one zero mean standard deviation of one Gaussian.

Speaker 0

然后你乘以编码器给出的标准差，并加上编码器给出的均值。

And then you multiply by the standard deviation that comes in from the encoder and you add the mean that comes in from the encoder.

Speaker 0

而这个操作是可以进行反向传播的。

And that operation you can backpropagate through.

Speaker 0

这就是你如何将潜在变量（即来自标准高斯分布的样本）与编码器提供的信息结合起来的方法。

So that is how you combine the latent variable, which is the sample from the Gaussian distribution, the standard Gaussian, with what the encoder gives you.

Speaker 0

这被称为重参数化技巧。

It's called the reparameterization trick.

Speaker 0

我刚开始读博士时，这个技巧非常重要。

It very big when I started my PhD.

Speaker 5

现在，许多基于能量的模型实际上都是使用潜变量构建的，你可以通过边缘化或对潜变量进行最小化，将一个带有潜变量的基于能量的模型简化为不带潜变量的模型。

Now, many energy based models are actually built using latent variables and you can reduce a latent variable energy based model to one that doesn't have one by either marginalizing or minimizing with respect to the latent variable.

Speaker 5

因此，推理过程当然是通过同时对目标变量 y 和潜变量 z 最小化基本能量函数来实现的。

So inference of course takes place by minimizing the elementary energy function with respect to both y and z, the variable to be predicted on the latent variable.

Speaker 5

你可以通过最小化基本能量函数 e 关于 z，或者通过边缘化（这等价于计算此处所指的某种自由能）来重新定义能量函数 f，即对指数负能量的积分取对数，其中积分在 z 的定义域上进行。

You can simply redefine the energy function f by minimizing the elementary energy function e with respect to z or by marginalizing, which is equivalent to computing some sort of free energy as indicated here, the logarithm of the integral of exponential minus the energy where the integral takes place over the domain of z.

Speaker 5

当你确实拥有一个带潜变量的 EBM 时，如果你希望

When you do have a latent variable EBM, if you want

Speaker 3

仅用 x 和 y 来表述它，而不使用潜变量，你可以选择对潜变量进行最小化，或者进行边缘化。

to formulate it just in terms of the x and the y without the latent variable, you can either minimize with respect to the latent variable or you can marginalize.

Speaker 3

而边缘化就是对 z 的整个定义域进行求和，我们之前讨论过吉布斯分布，就是对 z 的所有可能取值进行求和。

And marginalize is where you kind of sum over the you know, we were talking about the Gibbs distribution earlier, you know, over the domain of of zed.

Speaker 3

因此，你实际上是在对所有可能的 z 组合进行求和，然后再进行归一化。

So you're kind of summing over all of the possible combinations with the z and then normalizing.

Speaker 0

是的。

Yeah.

Speaker 0

所以这里有一个非常简单的例子，好吧，虽然它不完全符合x和y的情况，但如果我们只有一个f(y)，这又回到了k均值和高斯混合模型之间的区别。

So so here, if a very simple example of this that, okay, it doesn't fit the x and y, but if we just had an f of y, would be the, again, the distinction between k means and a Gaussian mixture model.

Speaker 0

两者都是聚类模型，但一个是硬分配。

Both are clustering models, but one is just the hard assignment.

Speaker 0

所以上面这个例子适用于k均值算法，其中你的潜在变量是数据点所属的聚类。

So the top one would be an example for k means, where your latent variable is the cluster that the data point comes from.

Speaker 0

因此当你有一个新数据点，并询问你的能量函数（即k均值函数）对这个数据点的满意度时，它会找到最近的聚类中心，然后到该聚类中心的距离就是能量值。

So when you have a new data point and you ask your energy function, your k means function, how happy are you with this data point, what it will do is it will find the closest cluster center, and then the distance to that one cluster center is the energy.

Speaker 0

这就是模型对这个特定数据点的满意程度。

That's how happy the model is with this particular data point.

Speaker 0

所以它只能告诉你这些。

So it can only tell you that.

Speaker 0

但如果你有一个高斯混合模型，并且有一个新的数据点，你需要遍历该混合模型的每一个分量，并询问它们：你给这个数据点分配了多大的概率密度？

But if you have a Gaussian mixture model and you have a new data point, you need to go through every single component of that mixture and ask them what probability density do you assign to that data point?

Speaker 0

然后你需要对所有分量进行积分，才能得到整个模型对你的数据点的看法。

And then you need to integrate across all of them in order to get an answer of what the whole model thinks of your data point.

Speaker 3

没错。

Exactly.

Speaker 3

所以接着刚才说的。

So just just to carry on from that.

Speaker 3

所以我们这里看到的这个 f，现在展示了整个流形。

So this f that we see here, this is now showing us the the entire manifold.

Speaker 3

它在一定程度上展示了这个流形在潜在变量所有点上的表现方式。

So it's kind of showing us how this is represented over all of the points of the the latent variable.

Speaker 4

那么 f 无穷大和 f 贝塔分别代表什么？

So what's the f sub infinity and f sub beta represent?

Speaker 0

在我的例子中，上面是 k 均值的代价函数，下面是高斯混合模型的代价函数，或者在这个情况下是能量函数。

The cost function of in my example, the cost function of k means on top and the cost function of a of a, like, a Gaussian mixture model on the bottom or the the energy function in this case.

Speaker 0

它是在对所有这些进行积分。

It's integrating across them.

Speaker 0

对吧？

Right?

Speaker 0

它是在遍历每一个Z变量，它会获取那个特定Z的能量，并试图用该能量进行加权。

It's it's it's going through each variable of zed, and it's it takes the energy of that particular zed, and it tries to weigh it by that energy.

Speaker 0

所以，这更像是对所有不同Z值进行积分。

So, is more like an integration across all of the different Zs.

Speaker 0

这就像，也许你可以把它解释为在一场扑克游戏中，你有一个同花听牌，然后你问自己，我应该跟注吗？我应该为了继续游戏而跟注吗？

It's like, maybe you can interpret it as in a game, in a poker game, you want to, you maybe have a flush draw and you ask yourself, should I call this, should I call to in order to continue?

Speaker 0

你想要做的是考虑所有可能的未来，也就是所有可能的潜在变量。

What you want to do is you want to think of all the possible futures, which are all the possible latent variables.

Speaker 0

所以潜变量就是下一张牌是什么。

So the latent variable is which card comes next.

Speaker 0

所以你需要对所有可能性进行积分。

So you want to integrate across all of that.

Speaker 0

而对于每一种情况，你都需要问自己：我对这个特定结果有多满意？

And for each of these, you want to ask yourself how happy am I with that particular outcome?

Speaker 0

这将是该情况下一个合适的能量函数。

And that would be an appropriate energy function for that.

Speaker 0

而如果你玩的是国际象棋，那么你不需要想出任何特定的走法。

Whereas if you played the game of chess, then you don't need to come up with any particular move.

Speaker 0

你只想知道对手的最佳走法是什么，以及我的走法与之相比如何。

You just want to know what's the best move my opponent plays and how is my move compared to that.

Speaker 0

是的，对于每个Y，你都要找到最小的Z，也就是对手能做出的最佳回应走法，然后你想据此找到自己的最佳走法。

Right, for each Y, you're gonna find the minimum Z, which is the best move your opponent can play in response, you want to find your best move according to that.

Speaker 5

这个潜变量的信息容量必须被修正或正则化，这是我稍后要讨论的一个主要问题。

The information capacity of this latent variable must be revised or regularized, and this is a main issue that I will discuss later.

Speaker 5

但这可能最终会变得不切实际、难以处理，或者只能通过变分方法进行近似。

But this may turn out to be impractical or intractable or only approximated through variational method.

Speaker 5

举个例子，假设潜在变量的数据流形是一个椭圆。

So an example of latent variable, let's say, data manifold is an ellipse.

Speaker 5

当我们找到一个数据点时，需要通过找到流形上离它最近的点来计算其能量，从而测量到该流形的距离。

What we when we find a data point, we need to compute its energy by finding the point on the manifold that is closest to it so that we measure the distance to the manifold.

Speaker 5

而潜在变量就是导致该点的角度，即流形上最近点的角度。

And the latent variable would be the angle that leads to the point, the closest point on that manifold.

Speaker 5

在这个简单案例中，当然你可以明确地写出来，但在更复杂的情况下，我们显然需要找到这个流形，而其参数化并非易事。

Now in this simple case, of course, you can write it explicitly, but in more complex cases, of course, we need to find this manifold and the parametrization is not trivial.

Speaker 3

我认为这实际上非常有启发性。

I thought this is really instructive actually.

Speaker 3

现在我喜欢数据流形这个类比，因为很多人可能根本不会想到他们的狗狗图片其实是拟合在某种高维流形上的，但实际上确实如此。

Now I like the analogy that there is a data manifold because a lot of people probably don't even think of their pictures of dogs and someone is fitting on some kind of high dimensional manifold, but of of course they do.

Speaker 3

尽管这是一个人为构造的例子，但它表明在这个椭圆案例中，你的潜在变量z实际上是一个角度。

And even though it's a contrived example, it's showing that your latent variable z is actually an angle in in this ellipse case.

Speaker 3

所以椭圆就是这个流形，而你的潜变量只是一个可以用来将数据点推送到这个流形上任意位置的东西。

So the ellipse is the manifold, and your latent variable is just something that you can use to push your data point onto any position on this manifold.

Speaker 3

如果出现一个新的样本，它的能量就简单地取决于它距离流形有多近。

And if a new example comes along, its energy is simply a function of how close it is to the manifold.

Speaker 3

所以如果它位于流形上，能量就是零。

So if it's if it sits on the manifold, the energy is zero.

Speaker 3

而如果这个新样本位于流形之外，那么能量就会更高。

And if the if the new example sits away from the manifold, then the energy will be higher.

Speaker 0

是的。

Yeah.

Speaker 0

所以引入这个潜变量，仅仅是为了能够拥有不止一个能量为零的点。

So the introduction of this latent variable is is simply to be able to have more than one point that where the energy is zero.

Speaker 0

然后，这些点中的每一个都会被分配一个不同的潜变量。

And and and then each of these points will have a different latent variable assigned.

Speaker 0

我确实最喜欢那句话：能量就是到椭圆的距离。

And I I do like the line the most that says energy is the distance to the ellipse.

Speaker 0

对吧？

Right?

Speaker 0

所以在这种情况下，你的能量函数就是你的点距离椭圆有多远。

So your energy function is how far away is your point to the ellipse in this case.

Speaker 4

所以这个隐变量实际上并不像一个椭圆。

So the latent so really though, it's not like an ellipse.

Speaker 4

它更像是这样一条弯弯曲曲的线，你知道，不是那种可以通过旋转角度就能持续命中点的东西。

Is it it's like this, like, squiggly line, you know, not something that's like the angle you could just rotate it and keep hitting points.

Speaker 4

对吧？

Right?

Speaker 4

它会不会像一个波浪形的圆呢？

Wouldn't it be like a like, you know, like a squiggly shaped circle?

Speaker 0

嗯，现在要看情况了。

Well, that it now now it depends.

Speaker 0

对吧？

Right?

Speaker 0

在这种情况下，你的真正能量函数是椭圆。

Your your true energy function is the ellipse in this case.

Speaker 0

现在，如果你只有若干数据点，并希望学习一个能量函数，你可能会通过某种插值方法来近似它。

Now if you just have a sample of data points and you want to learn an energy function, what you're going to do is probably approximate that through some interpolation.

Speaker 0

于是你得到了一个学习到的能量函数，那么是的，它就会像你所说的那种波浪形的圆。

And then you have a learned energy function and then yes, that would be like you do whatever, your squiggly circle.

Speaker 0

没错，但我想他想表达的是，这里的真正能量函数应该是椭圆本身，作为一个流形概念。

It's true, but I guess what he wants to say, your true energy function here would be the ellipse itself as concept, as a manifold.

Speaker 0

现在，我想他稍后会讨论我们如何学习能量函数，以及对比方法。

And now, I think later he's gonna talk about how do we learn energy functions and there's the contrastive methods.

Speaker 0

然后还有一些方法是对事物进行正则化的。

And then there are the methods where you regularize things.

Speaker 0

在这种情况下，如果我已经知道它是一个椭圆，我可以正则化我的模型，规定它只能生成椭圆。

And in this case, if I already know it's an ellipse, I could regularize my model to simply say you're only allowed to produce ellipses.

Speaker 0

那么，如果给我这组数据点，它就不会变成一个歪歪扭扭的圆圈。

And then if I'm given this set of data points, it would not turn out to be a squiggly circle.

Speaker 0

实际上它会非常接近这个椭圆。

It would actually come to be pretty close to this ellipse.

Speaker 4

对我来说，有点像，我在整个深度学习的学习过程中一直看到数据流形这个术语。

For me, kind of like, I've been seeing this term data manifold, like, throughout my entire study of deep learning.

Speaker 4

我觉得这张图终于帮我理解了数据流形到底是什么。

I feel like this picture is finally helping me understand what the heck a data manifold is.

Speaker 4

所以它就像是连接数据之间的这条高维空间中的路径。

So it's like it's like this path in this high dimensional space that is connecting the data to each other.

Speaker 0

是的

Yeah.

Speaker 0

流形只是子空间的一种 fancy 说法，但通常子空间会让人联想到线性或其他类似的东西。

Manifold is just a fancy way of saying subspace, but usually subspace is associated with it being linear or something like this.

Speaker 0

但流形可以是任何你想要的形式。

But manifold can be whatever you want.

Speaker 0

它就是，嗯，这里有数据，那里也有数据，数据在这里到处都是，或者可以出现在任何地方。

Just It be, well, here's data and here's data and data is everywhere here or can be anywhere here.

Speaker 3

这是一个美妙的想法，因为流形在各个地方都会出现。

It it's a beautiful idea because manifolds come up all over the place.

Speaker 3

例如，如果你对向量进行 L2 归一化，那么它们都会存在于一个叫做单位超球面的流形上。

If if you, for example, do an l two normalization on your vectors, then they all exist on a manifold called the unit hypersphere.

Speaker 3

因为如果你仔细想想，它就是所有长度为一的可能向量的集合。

Because if you think about it, it's all the possible vectors that have a a length of one.

Speaker 3

在机器学习中，有很多例子都是固定流形的情况。

And there are many examples in machine learning where it's a fixed manifold.

Speaker 3

如果你看地理数据，例如，流形就是一个球面，而卷积神经网络（CNN）则是在平面流形上工作。

If you if you look at geo data, for example, the manifold is a sphere, and CNN's work on the planar manifold.

Speaker 3

实际上，如果你仔细想想，在某些更高维的空间中，存在一个几乎所有数据都位于其上的流形，而且许多类型的分析和机器学习，甚至像TISNI和UMAP这样的方法，都是从数据所在的流形角度来思考的。

And, actually, if you think about it, in some higher dimensional space, there exists a manifold that almost all data sits on and and many of the types of analysis and machine learning even like TISNI and UMAP think about data in terms of the manifold that it sits on.

Speaker 4

是的。

Yeah.

Speaker 4

不过我觉得，在高维意义上，它看起来太复杂了，这样思考的意义何在呢？

I I guess though it just seems to me like in the high dimensional sense, it's so complicated that it what's the point in thinking of it like that?

Speaker 4

比如，如果它是一个单位超球面，比如说像L2归一化的参数向量，它们具有如此巨大的维度，我只是不明白这样思考的意义何在。

Like, if if if it's a unit hypersphere of say like l two normalized parameter vectors that have like this massive dimensionality, I just don't get what the point is of thinking about it like this.

Speaker 0

嗯，这实际上与能量函数有非常紧密的联系，通常你假设你的流形在某种程度上是连续且平滑的，而能量函数基本上就是你到那个流形的距离，是一一对应的。

Well, it's a very close connection to energy functions actually in that usually you assume your manifold is somewhat continuous and smooth, and the energy function is is one to one, basically your distance to that manifold.

Speaker 0

所以它之所以极其通用，原因与能量函数极其通用是一样的。

So it is incredibly general for the same reason that the energy function is incredibly general.

Speaker 0

它只是简单地说，当数据良好时能量函数表现良好，当数据不佳时则不然。

It simply says the energy function is happy when the data is good and not when the data isn't.

Speaker 3

我真的很想找出一些例子，说明这种思路在我们感兴趣的模型上是如何运作的，因为这个流形是人为构造的。

I'm I'm really interested to come up with examples of how this works on the kind of models that we get excited about because this is a contrived manifold.

Speaker 3

当我们谈论自然语言处理和BERT这样的东西时，甚至语言本身也位于某个高维流形上，我认为思考哪些例子位于这个流形上、哪些不位于其上，以及能量如何围绕这些例子被提升以进行学习，是非常有启发性的。

When we talk about something like natural language processing and and BERT, even language fits on some higher dimension manifold, and I think it's quite instructive to think of examples that do and don't sit on that manifold, and energy is being pushed up around those examples as a way of learning.

Speaker 0

是的。

Yeah.

Speaker 0

我的意思是，归根结底，我们可能永远无法直接描述这个流形本身，但我们可以构建这些能量函数，告诉你离流形有多远。

I mean, it comes down to the fact that we probably will never be able to describe the manifold as manifold, but what we can do is we can build these energy functions that tell you basically how far you're away from the manifold.

Speaker 0

所以，如果你能构建出这样的能量函数，至少你就可以通过让能量函数变低，以合理的概率接近这个流形。

So if you can build an energy function like this, then at least you can hit the manifold with a reasonable probability by simply making the energy function happy.

Speaker 5

不错。

Cool.

Speaker 5

我会采用一种方法，找到指向流形上最近点的方向。

I will do the angle that leads to the point, the closest point on that manifold.

Speaker 5

当然，在这个简单的情况下，你可以显式地写出它，但在更复杂的情况下，我们需要找到这个流形，而参数化绝非易事。

Now in this simple case, of course, you can write it explicitly, but in more complex cases, of course, we need to find this manifold and the parametrization is not trivial.

Speaker 5

好的。

Okay.

Speaker 5

那么，我们如何训练基于能量的模型呢？

So how do we train energy based models?

Speaker 5

我们需要确保数据样本的能量低于数据流形之外的能量。

What we need to do is make sure the energy for data samples is lower than the energy outside of the data manifold.

Speaker 5

为此有两种方法：对比方法会明确降低数据点的能量，同时提高数据流形之外点的能量，或者对这些点的提升较弱。

And there's two types of methods for this contrastive methods that explicitly push down on the data points and push up on other points outside the data manifold or maybe on this, but less strongly.

Speaker 5

另一种是正则化架构方法，本质上是限制低能量所允许的空白空间体积，从而自动地像收缩包装一样贴近数据流形，而无需主动提升能量。

And then there is regularizing architectural methods that essentially limit the volume of space in white space that can take low energy and therefore kind of shrink-wrap the data manifold automatically without having to push up.

Speaker 3

这里有一些非常有趣的地方。

There's some really interesting things here.

Speaker 3

当我们训练这个能量函数 f 时，我们希望对于给定的 y，其能量低于训练集中所有其他 y 的能量。

So when we train this energy function f, we want it to be a lower energy for the given y than all of the other y's in the training set.

Speaker 3

这看起来是合理的。

So that seems to make sense.

Speaker 3

我们希望函数足够平滑，以便能够使用基于梯度的方法。

We want the function to be smooth so that we can use gradient based methods.

Speaker 3

然后他谈到了两类学习方法，这部分变得非常有趣。

And then he talks about two classes of learning methods, and this is where it gets really interesting.

Speaker 3

他讨论了对比方法，以及所谓的正则化和架构方法，后者指的是像PCA、k均值聚类等等。

He talks about contrastive methods and so called regularized and architectural methods by which he means things like PCA and k means and and so on.

Speaker 3

我认为这里有一个真正的使命。

And there's a real mission here, I think.

Speaker 3

他并没有讨论传统的那种监督分类神经网络。

He doesn't talk about traditional kind of supervised classification neural networks.

Speaker 0

是的。

Yeah.

Speaker 0

所以，首先，它实际上比那还要更普遍。

So so first, it is actually even more more general than that.

Speaker 0

你不仅希望让你的数据点比数据集中的其他点能量更低，而是比任何其他点都低，对吧？

You don't just want to make your point have a lower energy than every other point in your dataset, but than any other point, right?

Speaker 0

这只是不同方法之间的区别，而且我们之前也打算讨论的，就是他们是如何提出这种对比度量的。

It is just the difference between the different methods, and that was gonna talk about as well, is how they come up with this contrastive measure.

Speaker 0

所以如果你想想GAN，被压低的是真实数据集中的点，而被抬高的是生成器能想出来试图欺骗判别器的所有东西，对吧？

So if you think of a GAN, the points that are pushed down are the points in the true dataset, and the points that are pushed up is everything the generator can come up with to try to fool the discriminator, right?

Speaker 0

所以它甚至不完全是数据集本身的点。

So it's not even points in the dataset per se.

Speaker 0

另一件事是，你实际上可以以这种方式来思考传统的监督学习。

And the other thing is you can actually think of the traditional supervised learning in this way.

Speaker 0

所以，如果x是你的输入，y是你的标签，你在做什么呢？

So if x is your input and y is your label, what are you doing?

Speaker 0

你是在提升正确的标签，同时压低所有其他的，比如说所有逻辑回归的输出，因为你这是通过一个softmax分类器来运行的。

You're pushing up the label that is correct, and you're pushing down all the other, all the logits, let's say, because you run this through a softmax classifier.

Speaker 0

所以，通过提升一个标签，你立刻就压低了其他标签。

So immediately by pushing one label up, you push the others down.

Speaker 0

这样你就得到了你的能量。

And there you have your energy.

Speaker 0

简单来说，这个的负值就是监督学习的能量函数。

Simply the negative of that is your energy function for supervised learning.

Speaker 3

是的。

Yeah.

Speaker 3

所以，我们用一种智力框架来讨论许多机器学习算法中的现象，这真的很有趣。

So so it's really interesting that we're using a kind of intellectual scaffold to talk about what's happening in a lot of machine learning algorithms.

Speaker 3

正如你所说，你刚才提到的方法中，能量被推高或推低，但杨立昆进一步指出，在这种架构方法中，这是一种略有不同的范式。

And as you say, the approach you just spoke about, energy is being pushed up and being pushed down, but Yann goes on to say that in this architectural method, it's a slightly different paradigm.

Speaker 3

也就是说，不是通过推高或推低，而是像收缩包装一样，将功能空间围绕流形包裹起来，这样就不需要推低了。

That's when rather than pushing things up and down, you kind of shrink-wrap the functional space around the manifold so you you don't need to push down.

Speaker 5

数据样本的能量低于数据流形之外的能量。

For data samples is lower than the energy outside of the data manifold.

Speaker 5

对于这种对比方法，有两种类型：一种显式地推低数据点，推高其他点。

And there's two types of methods for this contrastive methods that explicitly push down on the data points and push up on other point.

Speaker 5

我不会逐条阅读这些内容，但有一大类经典方法可以在这个框架下被解释为对比方法或架构方法。

I'm not gonna read through all of this, but the big list of classical methods that can be interpreted in the in this context either as contrastive methods or architectural method.

Speaker 5

最大似然法在处理难以归一化的分布时，实际上属于对比方法的一部分。

Maximum likelihood insist in distributions that are not easily normalized is actually a part of contrastive methods.

Speaker 5

这正是我首先想要讨论的内容。

And this is what I'm I'm gonna talk about first.

Speaker 5

因此，概率方法存在一个问题，你可以对此提出质疑。

So there is an issue with probabilistic methods which you can offer.

Speaker 3

这非常有趣，因为我们最近一直在大量讨论对比方法。

This is really interesting because we've been talking a lot about contrasted methods recently.

Speaker 3

几周前，我们讨论过用于强化学习的对比无监督表征，本质上是在谈像MoCo和SimCLR这样的方法。

We had the contrasted unsupervised representations for reinforcement learning chap on the other week, and essentially, that was talking about things like MoCo and SimCLR.

Speaker 3

这些方法会选取对比样本，对正样本对进行压制，同时对正负样本对进行提升。

And what those approaches do is they take these contrastive examples, and they kind of push down on the positive positive pair, and they push up on the positive negative pair.

Speaker 3

通过在函数空间中这样做，你实际上是在学习所有图像所处的流形位置。

And by doing so in the functional landscape, you're kind of learning where the manifold of all of the images are.

Speaker 4

那么，为什么我们要如此严格地限制潜在向量z的信息容量呢？

So why do we wanna limit the information capacity of that latent z vector so much?

Speaker 3

我认为这并不一定与此相关。

Well, I think that doesn't necessarily come into this.

Speaker 3

我认为z的作用是在你希望在整个流形上生成新样本时使用的。

I think the z is when you want to be able to generate new examples across the manifold.

Speaker 3

但在Siamese网络或SIMCLR类型的算法中，其实不需要潜在变量。

But I think in in the case of these Siamese networks or the the SIM CLR type algorithms, then there's no need for a latent variable.

Speaker 0

我认为这里的关键是，你的能量函数不应显式依赖于潜在变量。

Well, I think the thing here is when you your energy function should not, let's say, depend explicitly on the latent variable.

Speaker 0

这意味着，如果你在玩扑克，你的出牌应该独立于下一张牌真正会是什么。

That means if you're poker in game, your move should be independent of what the next card is really truly going to be.

Speaker 0

如果你在下棋，你的能量函数仅由你所下的棋步决定，因为z只是对手的最小值。

And if you're in a chess game, your the energy function is just determined by what move you're doing because the z is just going to be whatever is the minimum for the other player.

Speaker 0

我认为这仅仅是一种表达方式，意思是这些信息不应该泄露到你的代价函数中。

I think it's just an expression to say that this the the information shouldn't basically leak into your into this cost.

Speaker 3

有意思。

Interesting.

Speaker 3

它还谈到了BERT，我认为这是目前许多人非常感兴趣的，它正在将流形外的点映射到流形上的点。

It talks about BERTS as well, which is I I think it's of huge interest to so many people at the moment, and that is mapping points off the manifold to points which are on the manifold.

Speaker 3

而那些流形外的点，不过是我们已知在流形上的点的噪声版本。

And the points that are off the manifold are just noised versions of ones that we know are on the manifold.

Speaker 3

所以轻微地将它们推离流形，这确实引发了关于应该将它们推离流形多远的问题。

So slightly pushing them off the manifold, it does raise questions about how far should they be pushed off the manifold.

Speaker 3

而如果你把它们推得太远，是否在某个时刻会变得有问题呢？

And if you push them too far, does that at some point become problematic?

Speaker 4

如果噪声强度太高，就意味着离流形太远了，大概是这个意思。

If the noise strength is too high, it's too high off the manifold, that kind of idea.

Speaker 3

是不是因为他举了混合模型的例子，这些模型实际上很糟糕，因为它们想要一个像峡谷一样的流形，一旦偏离流形，能量几乎就变得无限高。

Would it because he gave the example of these mixture models actually being really bad because they want to have a manifold which is it's like a canyon and infinitely high in energy almost as soon as you get off the manifold.

Speaker 3

我认为这里的总体讨论是，我们需要拥有能更好地泛化到未见数据的模型，并且具有平滑的能量函数，以某种方式描述流形，这种方式能代表我们数据分布中可能合理预期的任何类型数据。

I think that the general discussion here is that we need to have models that generalize better to previously unseen data and have smooth energy functions that describe the manifold in a way that represents any type of data that we could, you know, reasonably expect to see in our data distribution.

Speaker 3

但正如我们所知，深度学习的风险在于模型只会内插，而不会外推。

But just as we all know, because of the the perils of deep learning are that the models interpolate, they don't extrapolate.

Speaker 3

如果你在流形上制造一个凹陷，你希望的是在流形上形成大量凹陷，从而在你希望流形所在的位置周围形成一个峡谷。

If you make a depression in the manifold, what you want is lots and lots of depressions in the manifold to form a canyon around your around where you want your manifold to be.

Speaker 3

但如果凹陷很小，没有在你的流形周围形成一个良好的峡谷，如果问题空间过于稀疏，那么你实际上并没有学到关于流形的任何东西。

But if the depressions are small and they don't form a nice canyon around your manifold, if the problem space is too sparse, then you're not really learning anything about your manifold.

Speaker 0

是的，我认为这正是所有这类问题的症结所在。

Well, exact I think that is that is the exact problem with any of these things.

Speaker 0

因此，在监督学习方面，一方面，我们确切知道哪些东西需要被推高或压低。

So in supervised learning on the one hand, have the exact knowledge which things are there to push up and down.

Speaker 0

这就像随便什么，你的10个类别或者ImageNet中的一千个类别。

It's just whatever, your 10 classes or your thousand classes in ImageNet.

Speaker 0

但是，比如说，在完全无监督学习中，你必须考虑每一个可能存在的数据点。

But in, let's say, fully unsupervised learning, you have to consider every single possible data point there is.

Speaker 0

如果你只是想学习一个语言模型，就像我们一开始说的，你必须考虑每一个可能的标记序列。

If you just want to learn a language model, as we said at the beginning, you have to consider every possible sequence of tokens there is.

Speaker 0

而我们的模型就是没有能力去学习那个。

And our models are just don't have the capacity to learn that.

Speaker 0

所以我们试图做的是，我们只是想告诉它们：看，这里有些地方不太对。

So what we're trying to do is we're just trying to tell them, look, here is something that's slightly wrong.

Speaker 0

请把它纠正过来。

Please make it correct.

Speaker 0

这样它就能做的，是在真正的语言内容和这些被破坏的输入之间形成一道沟壑。

So it can what it can do is it can form that valley between the what's truly in the language and these corrupted inputs.

Speaker 0

我们只是希望这就足够了。

And we're just kind of hoping that that's enough.

Speaker 0

对吧？

Right?

Speaker 0

我们不考虑那之外的任何东西。

We we don't consider anything outside of that.

Speaker 0

我们不考虑任何胡言乱语或任何其他东西。

We don't consider any gibberish or anything.

Speaker 0

是的。

Yeah.

Speaker 0

正如你所说，多远算太远，多近算太近。

As you said, how far is too far and how far is too close.

Speaker 0

所以如果我们现在开始只是替换单个单词，那甚至可能是一个原本正确的句子。

So if we now start just replacing single words, that could even be a sentence that is actually correct.

Speaker 0

而且，你知道，所以

And, you know, so

Speaker 3

让我们把BERT的类比再稍微推进一步。

Let's let's push the BERT analogy just a little bit further.

Speaker 3

所以我们已经学习了这个语言流形，并且使用一个BERT模型，它接收比如说500个标记，实际上少于500个，因为通常有一个分隔符标记。

So we've learned this language manifold, and with a BERT model that takes in, let's say, 500 tokens, it's less than 500 because, generally, there's a separator token.

Speaker 3

但该流形上的每一个点都是一段有效的语言。

But every single point on that manifold is a piece of valid language.

Speaker 4

所以我认为这又回到了课程学习中垫脚石的概念，我们有这些不同的流形，或者说这个高维空间中的子集，我们就像是把它交给模型，然后说：'来，试着理解这个庞大流形或者说整个空间的这个子集。'

So I think it comes back to this idea of stepping stones in curriculum learning and that we have these different manifolds or like subsets of this high dimensional space that we're like giving it and we're like here, try to understand this subset of this massive manifold or like entire space.

Speaker 4

所以你试图找到一条路径，将你看到的各个流形彼此连接起来。

So you try to find a path that connects the manifold that you're seeing to each other.

Speaker 4

所以，也许这就是为什么你会先进行一个预训练任务，然后接着下一个，再下一个，因为在这个庞大的空间中连接这些流形会更容易。

So maybe that's why like you do one pre training task and then the next one and the next one because it's easier to connect these manifolds in this like massive space.

Speaker 0

是的。

Yeah.

Speaker 0

我同意。

I I agree.

Speaker 0

因为你基本上所做的就是，通过每个预训练任务，你都在寻找一种不同的方式来定义流形上或流形外的点。

Because what you're what you're basically doing is with each pretraining task, you're finding a different way of defining points off and on the manifold.

Speaker 0

对吧？

Right?

Speaker 0

所以，如果你的预训练任务是遮蔽部分标记，那么你在流形之外找到的点就是那些仅仅缺少一些词语的句子。

So if if you have the pretraining task of masking some of the tokens, the points that you find off the manifold are ones that are just missing some words.

Speaker 0

但另一个任务则是交换一些标记。

But then the other task is swapping some tokens.

Speaker 0

所以，这为你提供了另一种寻找流形之外点的方法，因为那些无效的负样本——想象一下，如果你的BERT输入仅仅是随机抽取500个标记，然后直接输入进去。

So that just gives you a different way of finding points off the manifold because what's not gonna work is as negative samples have Just imagine if your BERT input was just You just sample 500 random tokens, and then you give that.

Speaker 0

然后你会说，你能多好地从这个重建这个训练样本呢？

And then you say, How well can you reconstruct this training example right here from this?

Speaker 0

这离流形太远了。

That is too far off the manifold.

Speaker 0

你打算怎么做？

How are you gonna do that?

Speaker 0

而且，这是一个有趣的点，因为现在你有了你的训练样本。

And also, this is an interesting point because now you have your training samples.

Speaker 0

所以从技术上讲，你应该输入并问：你能多好地重建我数据集中的任何一个样本？

So technically what you should do is you should give the input and say, How well can you reconstruct any of the samples in my dataset?

Speaker 0

任何一个。

Any.

Speaker 0

但因为我们取了一个样本并掩码了其中的词，我们可以相当确定流形上最近的点就是这个特定的样本。

But since we took one and we masked out words from this one, we can be pretty sure that the closest point on the manifold is actually that particular sample.

Speaker 0

但如果你离流形更远，情况就不再是这样了。

But that's no longer the case if you go further off the manifold.

Speaker 0

你不再知道该用哪个训练样本了，因此在训练这个模型时会遇到问题，因为你不清楚应该让它朝哪个方向学习。

You don't know anymore which of your training so you're gonna have a problem with training this thing because you don't know what to train it towards.

Speaker 0

你必须让模型针对数据集中的每一个样本进行训练。

You would have to train it towards every single thing in your dataset.

Speaker 3

我对BERT也有些困惑，因为它有两个目标，第一个我认出是掩码或去噪自编码器，因为它说的是：这里有一个输入，我会随机掩码掉一些词，然后让你重建它。

I'm also a bit confused with BERT because it has two objectives, and the first one I recognize as being the masked or the denoising autoencoder because it's saying here is an input, I'm going to mask out some random words, and then I want you to reconstruct it.

Speaker 3

这是一种情况。

That's one thing.

Speaker 3

而下一个任务是下一句预测，这完全是另一回事。

And then the next thing is the next sentence prediction, which is completely different.

Speaker 3

这根本不是自编码器。

That's not an autoencoder.

Speaker 3

所以你是在动态地在这两个任务之间切换，难道这两个任务不会在内部构建出完全不同的流形吗？

So you're switching between those two tasks dynamically, and wouldn't those two tasks build a completely different manifold internally?

Speaker 4

也许下一句预测任务会把这些中间标记都抓取过来，以某种方式把它们全部聚拢在一起，我不太确定。

Maybe it the next sentence prediction task grabs all these intermediate tokens, pushes them all up together in a way I'm not sure.

Speaker 4

但也许关键在于形成，因为我的意思是，我只是在三维空间里思考这个问题。

But maybe it's about forming because you're like I mean, I'm just thinking about it in, like, a three-dimensional space.

Speaker 4

就像，有一个立方体，立方体里有数据的子集。

Like, there's this cube and there's subsets of data in the cube.

Speaker 4

如果我能抓住一大堆点并把它们一起移动，也许这比单独移动更有帮助，如果这说得通的话。

If I get to grab a whole bunch of points and move them together, maybe that helps than just moving individually, if that makes any sense.

Speaker 0

是的。

Yeah.

Speaker 0

我觉得...我觉得这个类比相当好，因为...那么，再次说明，作为输入，好吧。

I think I think I think the the analogy is pretty good because so, again, what As an input Okay.

Speaker 0

我们的输入空间现在是两个连续的句子，对吧？

Our input space is now two consecutive sentences, right?

Speaker 0

我们已经讨论过，掩码语言模型只会告诉你这两个句子中哪个...这样你就可以将这两个句子重构为一个。

We've already discussed that the masked language model will simply tell you which of the two so you can reconstruct these double sentences as one.

Speaker 0

但现在你有了另一种从流形外取点的方式，那就是构建一个非连续的双句。

But now you have a different way of making points off the manifold and that is to construct a non consecutive double sentence.

Speaker 0

所以这只是另一种构建偏离流形点的方式。

So it's just another way of constructing points that are off the manifold.

Speaker 0

也就是说，你在学习一个能量函数，它表明如果两个句子相互衔接，你应该非常满意；如果两个句子不连贯，你就不应该满意。

So you're telling, you're learning an energy function that says you should be very happy if two sentences follow each other, and you should be not happy if two sentences don't follow each other.

Speaker 0

因此，你正在为那些偏离流形的特定点学习一个能量函数，将这些点推高，同时将真正的双句对推低。

So you're learning an energy function for those particular points off of the manifold, push those up, and push the true double sentences down.

Speaker 0

并且你希望你的模型能够学到关于语言的一些有意义的东西。

And you hope you hope that your model will learn something meaningful about language.

Speaker 3

但是，你不觉得这两个任务之间存在分歧吗？还是你认为它们

But don't don't you think that there is a divergence in the in the two tasks, Or do you think that they

Speaker 0

是的。

yeah.

Speaker 0

对。

Yeah.

Speaker 0

这正是关键所在。

That's that's the point.

Speaker 0

对吧？

Right?

Speaker 0

关键是，每增加一个任务，目的都是找到更多位于流形之外的点。

The point is with every additional task you introduce, the point is to find more ways to find points off of the manifold.

Speaker 3

但在哪种情况下，你会认为一个任务能帮助另一个任务表现得更好？还是你觉得它们能互相促进？

But in which case do you think that one task makes the other one work better, or do you think that they help each other work well?

Speaker 0

是的。

Yeah.

Speaker 0

它们确实会共享特征。

They they do feature sharing.

Speaker 0

最终目标就是让特征得以共享。

That's the ultimate goal is that that that features are shared.

Speaker 0

语言特征中有一些东西能够同时帮助这两个任务。

That there is something about language features that will help both tasks.

Speaker 0

通过平等对两个任务进行梯度下降，这些特征可能会发展得更好。

And by doing gradient descent on both tasks equally, these features might develop better.

Speaker 0

然后，相同的特征会迁移到你随后进行微调的其他任务上。

And then the same features will transfer to other tasks that you then fine tune on.

Speaker 5

这正是我首先将要讨论的内容。

Which is what I'm I'm gonna talk about first.

Speaker 5

概率方法存在一个问题，当然，你几乎总是可以通过吉布斯分布将一个能量函数转化为概率分布。

So there is an issue with probabilistic methods, which you can of course almost always turn an energy function into a probability distribution using a Gibbs distribution.

Speaker 5

你可以使用最大似然估计，但如果你想获得密度估计，基本上就必须使用最大似然。

You can do maximum likelihood, but you basically have to do maximum likelihood if you want estimates of densities.

Speaker 5

问题是，估计密度并不一定是个好主意，因为通过最大似然，系统会试图给数据点赋予尽可能低的能量，而给数据流形外的点赋予尽可能高的能量，这会导致系统形成极其深而窄的峡谷。

The problem is that estimating densities is not necessarily a good idea because by doing maximum likelihood, what the system wants to do is give the lowest possible energy to data points and the highest possible energy to point two points just outside of the data manifold, which leads the system to create to creating extremely deep narrow canyons.

Speaker 5

而这些峡谷对于推理来说并没有特别的用处。

And those are not particularly useful for inference.

Speaker 5

我们需要的是平滑的函数。

We need smooth functions.

Speaker 5

因此，这些函数需要通过先验或其他正则化方法进行正则化。

So those functions would need to be regularized, for example, by a prior or another regularizer.

Speaker 3

所以我的第一个问题是，尽管我看到了峡谷，但在我看来它很平滑。

So my first question is even though I see the Canyon, but it looks smooth to me.

Speaker 0

我想他是想用这些超级陡峭的墙壁来演示，但我明白他的意思，不过看起来这似乎只是那个温度参数的一个特性，这些东西最终会变得多平滑、多陡峭。

I think he wants to demonstrate with these super steep walls, but I get what he's saying, but it seems to be just, you know, a property of that of that temperature parameter, how smooth and how steep these things are really gonna turn out to be.

Speaker 0

我认为这些东西的主要问题在于，是的，你必须进行全局归一化，这意味着输入空间中的每一个点都必须被赋予某个非零概率，这可就难办了。

I think the the main problem with these things is that, again, yeah, you have to globally normalize, which means that every single point in your input space must be assigned some nonzero probability, and good luck.

Speaker 3

因为它在最后一个要点上提到了，但既然如此，为什么要使用概率模型呢？

Because it it says on the last bullet point, but then why use a probabilistic model?

Speaker 3

所以我认为，显然，Yann LeCun 并不属于概率模型阵营。

So I think, clearly, Yann LeCun is not in the probabilistic models camp.

Speaker 0

是的。

Yeah.

Speaker 0

可能不是。

Probably not.

Speaker 4

对。

Yeah.

Speaker 4

这只是因为你不想对分母上所有那些项进行求和吗？

Is that just because you don't wanna have to sum over all those all the terms on the bottom though?

Speaker 4

如果你有一个概率分布，并且你试图将大部分概率分配给其中一个，而其他部分则相应减少，这就会造成一个额外的深谷，我想。

If you're if you have this probability distribution and you're, you know, trying to assign a bunch of probability to one and the others in that kinda way, it causes this extra, like, deep valley, I guess.

Speaker 0

是的。

Yeah.

Speaker 0

我想通常发生的情况是，当你的数据点在这里时，它会尽量将大量概率质量分配给该点，这会自动减少其他地方的概率质量，包括该数据点周围的区域。

I guess what usually happens is it just it will just when your data point is here, it will just try to assign as much probability mass to that point, and that will automatically take away mass from other places, including things around that data point.

Speaker 0

所以，如果你没有正确正则化，最终只会得到在数据点上直接分布，而点与点之间什么都没有。

So what you're if you don't regularize properly, you're just going to end up with like a direct distribution everywhere on your data points and nothing in between.

Speaker 0

但同样，我认为这取决于这个温度参数。

But again, I think it depends on this temperature thing.

Speaker 5

但那样我们就失去了实际估计密度的优势，我们不再估计密度了。

But then we lose the advantage of actually estimating densities, we're not estimating densities anymore.

Speaker 5

所以为什么不彻底放弃隐私框架，直接用能量函数学习依赖关系呢？

So why not throw away the privacy framework altogether and just learn dependencies with energy function.

Speaker 5

因此，放弃概率框架在某种程度上让我们能更自由地选择目标函数。

So throwing away the probabilistic framework sort of allows us to use more freedom in sort of deciding on what objective function to use.

Speaker 5

目标函数的特征是：它必须是数据点能量的增函数，同时是数据流形外点能量的减函数，并且可能通过某种依赖于这两类点的间隔来实现。

The characteristic of the objective function is that it must be an increasing function of the energy of data points and the decreasing function of the energy of points outside the data manifold and perhaps through some sort of margin that depends on those two points.

Speaker 3

所以他是在说，让我们放弃概率框架。

So he's saying let's throw away the probabilistic framework.

Speaker 3

现在他引入了机器学习领域中常见的损失函数类型。

And now he's introducing the kind of loss functions that we see across the machine learning world.

Speaker 3

所以像合页损失，以及 presumably 平方损失之类的函数都会包含在内。

So things like hinge loss and presumably things like squared loss would be in there.

Speaker 3

他所说的非常直观：如果样本远离数据流形，那么它的能量应该更高。

And he what he's saying is quite intuitive that it should be that if the example falls away from the data manifold, then it should have a higher energy.

Speaker 5

是的。

Yep.

Speaker 5

没错。

True.

Speaker 5

但是，我们能否

But can we

Speaker 4

我们能不能先退一步，把话说清楚，这是在和概率方法做比较。

just, like, backtrack and just, like, be very clear about this is in comparison to probabilistic methods.

Speaker 4

具体区别到底在哪里？

What would where exactly is the difference?

Speaker 4

区别仅仅在于不需要对所有可能的'为什么'进行求和吗？

Is it just this idea of not summing over all of the possible whys?

Speaker 4

这就是关键区别吗？

Is that the key distinction?

Speaker 0

是的。

Yeah.

Speaker 0

事实上，你失去了对数据点可能性进行数值预测的能力。

The fact is here, you lose the ability to make a numerical prediction about how likely a data point is.

Speaker 0

你只是将其与其他数据点进行比较。

You simply compare it to others.

Speaker 0

现在你只能做到这些了。

That's all you can do now.

Speaker 3

这是否触及了频率学派与贝叶斯学派的核心区别？

Does this get to the nub of Frequentist versus Bayesian?

Speaker 3

贝叶斯方法是否包含这样一种观念：将事物置于所有可能发生的情况的背景下，从而能够讨论置信度和可能性等？

Is about the Bayesian approach has this notion of seeing things in the context of all of the possible things that could occur, therefore, you can reason about confidence and and its likelihood, etcetera?

Speaker 0

我不确定。

I'm not sure.

Speaker 0

我认为频率学派仍然会归一化他们的分布，但可能可以类比为：概率模型总是将结果与输入的全局空间相关联。

I think a frequentist would still normalize their distributions, but it could maybe be compared to that in that a probabilistic model will always put it in relation to the global space of of inputs.

Speaker 0

而能量函数只是给你一个数值。

Whereas energy functions just give you the number.

Speaker 5

例如，对于这些能量函数，我不会详细展开，但多年来它们已被用于各种场景，比如科学网络、对称学习、排序或嵌入。

For instance, for those energies, I'm not gonna go through the details, but they've been using various context over the over the years, either for things like science networks, symmetric learning, or for ranking or or or embedding.

Speaker 5

而最近，出现了一种目标函数，它使用的不是一对点，而是一个完整的集合。

And then more recently, there's been a objective function that use not just a pair of points, but a whole set.

Speaker 5

所以很明显，自监督学习如今已有非常成功的应用，尤其是在自然语言处理领域。

So obviously, there's very successful applications of self supervised learning today, in particular in the context of natural language processing.

Speaker 5

大家都知道BERT。

Everybody knows about the BERT.

Speaker 5

在此之前，有一类技术使用了某种形式的去噪自编码器，即你输入一个数据，将其破坏，然后训练系统区分干净版本和破坏后的版本。

This was preceded by the sort of techniques, which used a form of denoising autoencoder where you take an input, you corrupt it, and then you train the system to distinguish between the clean version and the corrupted version.

Speaker 5

在去噪自编码器中，你训练系统将破坏后的版本映射到干净版本。

In denoising auto encoder, you train the system to map corrupted version to clean versions.

Speaker 5

因此，现在对于被破坏点的重构误差，就是被破坏点与这个干净版本之间的距离。

Therefore, now the reconstruction error for corrupted points is the distance between the corrupted point and this clean version.

Speaker 5

所以，你自动获得了一个能量曲面，它会随着到流形（如右下角所示）的距离增大而增长。

And so you have automatically an energy surface that grows with a distance to the manifold as represented here on the bottom right.

Speaker 5

这代表了由去噪自编码器产生的能量函数的基本梯度场的矢量场。

This represents the vector field of the basically, the gradient field of the energy function produced by denoising autoencoder.

Speaker 3

所以这真的很有趣。

So this is really interesting.

Speaker 3

我认为正如我们之前所说，我们还没有真正可视化像BERT这样的东西。

I think as we were saying before, we haven't really visualized something like BERT.

Speaker 3

当然，这不是BERT的流形。

Of course, this isn't the manifold for BERT.

Speaker 3

这只是一个简单的流形，但它确实准确地展示了去噪自编码器在这里发生的情况。

This this is just a a simple manifold, but it does show exactly what's happening here with this denoising autoencoder.

Speaker 3

我们只是在寻找偏离流形的例子，并与位于流形上的例子进行比较。

We're just finding examples that are off the manifold and comparing to ones that are on the manifold.

Speaker 0

如果它真的是BERT的流形，那可就太有意思了。

It would be so funny if it actually is the manifold of BERT.

Speaker 0

就像，如果未来某个数学家证明人类语言是一种螺旋结构，那杨立昆简直会被奉为预言之神。

Like, if some mathematician in the future proves that human language is a spiral, it just would be like Yann LeCun would be treated as a god for predicting.

Speaker 0

我也就是在这儿瞎想想。

I'm just just imagining things here.

Speaker 0

是的。

Yeah.

Speaker 0

对。

Yeah.

Speaker 0

归根结底，就是这么回事。

Ultimately, it's exactly that.

Speaker 0

对吧？

Right?

Speaker 0

所以你需要找到一种方法，将点抛出流形，然后自动将它们映射回来，这就得到了能量函数。

So you want to find some method of throwing points off the manifold and then mapping them back on automatically gives you this energy function.

Speaker 0

而且你找到将点从流形上移除的方法越多，你的能量函数就会越好。

And the more ways you find of knocking points off the manifold, the better your energy function is going to be.

Speaker 3

可能值得再讨论一下这张图。

And it it might be just be worth talking about this diagram again.

Speaker 3

所以他使用了演示开头提到的视觉概念，也就是自监督。

So he's using the visual concept from the beginning of the presentation, which is the self supervision.

Speaker 3

所以在这个空间中，x 有点不连贯，然后它进入一个确定性的预测器解码器。

So the the x is kind of disjointed in this space, and it goes into a deterministic predictor decoder.

Speaker 3

这个c是什么？

What's this c?

Speaker 3

它是在一个红色方框里。

Is this is in a red square.

Speaker 0

那是损失函数。

That's the loss.

Speaker 0

在...在这个情况下，它接收的是y hat，也就是波浪线符号。

In in the in this case, it's the so it it takes the y hat, which is or tilde.

Speaker 0

我能从这里看到它。

I can see it from here.

Speaker 0

它接收的是BERT预测的y。

It is it takes the y that BERT predicts.

Speaker 0

对吧？

Right?

Speaker 0

BERT会给出它认为你在移除所有标记之前的句子是什么，然后将其与你实际移除所有标记之前的句子进行比较。在BERT的情况下，它会对每个标记应用一个分类损失。

BERT says here is what I think the sentence was before you took out all the tokens, and it compares it to the actual sentence before you took out all the tokens, and then it just applies in Bert's case, it applies a classification loss on each token.

Speaker 3

我明白了。

I see.

Speaker 3

所以如果解码器在流形上预测了某个内容，那么它会降低能量吗？

So if the decoder has predicted something on the manifold, then it will push the energy down?

Speaker 0

如果解码器正好在流形上预测了那个内容，那么损失就会是零。

Well, if the decoder has predicted that exact thing on the manifold, then the loss will be zero.

Speaker 0

对吧？

Right?

Speaker 0

在这种情况下，你正在学习你的预测器和解码器函数，以使能量变低。

And you're learning in this case, you're learning your predictor and decoder function to make the energy low.

Speaker 3

是的。

Yeah.

Speaker 3

所以你可以看到这两个输入。

So so you can see the two inputs.

Speaker 3

所以它说这是提取的文本。

So it says this is a of text extracted.

Speaker 3

所以这是被破坏的版本，然后我们希望预测器和解码器能预测出未被破坏的版本。

So this is this is being corrupted, and and then we want the predictor and the decoder to predict the non corrupted version.

Speaker 3

如果比较器发现它们是相同的，就会在流形上该点的内部表示处降低能量。

And if the comparator finds them as being the same, it will push down the energy on the internal representation at that point in the manifold.

Speaker 3

是的。

Yes.

Speaker 4

所以这就像是注入了一个先验，比如：这是你可以将点抛出流形的方式。我在想，真正困扰我的是它如何映射到这个流形里，因为你是在同时训练特征，就像我们讨论的把它推出这个流形。

So it's like we inject this prior of, like, here's how you can throw points off the manifold, and then I guess, like, what's really bothering me with thinking about this is how it is mapped into this manifold because you're training the features as you're simultaneously, like, talking about pushing it off this manifold.

Speaker 4

所以它是在学习这个流形，通过被告知什么在上面、什么不在上面，在整个训练过程中发生了如此大的转变。

So it's learning this manifold as it being told what's on or off of it transforms so much throughout the training.

Speaker 4

对吗？

Right?

Speaker 0

这里涉及多个流形吗？

Are multiple manifolds going on here.

Speaker 0

对吗？

Right?

Speaker 0

我认为在这种情况下，在鸟类的例子中，我们只是在考虑自然语言的流形，特别是双句的流形，但本质上是自然语言的流形。

I I think in in this case, in the bird case, it we're just thinking about the manifold of natural language of and specifically double sentences, but just the manifold of natural language.

Speaker 0

我们正在把它从流形中移开。

And we're throwing it off.

Speaker 0

我们只是通过破坏它，就把某些东西从流形中移开了。

We're throwing something off the manifold by simply corrupting it.

Speaker 0

然后我们学习这个解码器函数，将它映射回流形。

And then we're learning this decoder function to map it back onto the manifold.

Speaker 0

能量函数衡量的是与流形的距离。

And the energy function measures the distance to the manifold.

Speaker 0

对吧？

Right?

Speaker 0

在这种情况下，我们不需要学习这个能量函数，因为它已经给定了。

And in this case, we don't have to learn this energy function because that's a given.

Speaker 0

能量函数就是 y 和 y hat 之间的损失，但现在我们正在学习这个解码器和预测器模型，以最小化这个能量函数。

The energy function is the loss between the y and the y hat, but we're now learning this decoder and predictor model in order to minimize that energy function.

Speaker 0

所以就像我一开始说的，并不总是需要学习能量函数本身。

So it's as I said at the beginning, it's not always that you learn the energy function per se.

Speaker 0

你学习的是那个距离，但有时你实际上学习的是最小化距离的那个东西。

You learn that distance, but sometimes you actually learn the thing that minimizes the distance.

Speaker 3

把这些算法看作是表示学习算法很有意思，但延伸开来，它们也是流形学习算法。

It's interesting to think of these algorithms are representation learning algorithms, but they are also manifold learning algorithms by extension.

Speaker 3

大多数时候，我们并没有限制可以学习的流形类型。

And most of the time, we are not constraining the type of manifold that could be learned.

Speaker 0

我们在这里约束的是将事物从流形上移除的方式。

What we're constraining here is the way we throw things off the manifold.

Speaker 0

也就是说，我们这样做是因为如果不加以约束，我们就无法知道流形上原本是什么。

That is that is and and we're doing that because if we wouldn't constrain that, we would have no idea of what was on the manifold.

Speaker 0

因为我们能在这里训练任何东西的唯一原因是，我们知道这是流形上的数据样本，并且最接近那个被破坏的y。

Because the only reason we can train anything here is because we know this is the data sample that is on the manifold and that is closest to the y, to the corrupted one.

Speaker 3

但是，它学习的流形在训练过程中发生变化，这一点是否相关呢？

But would it is it relevant that the the manifold that it's learning is changing during training?

Speaker 4

是的，我明白。

Well Yeah.

Speaker 4

我懂了。

I get it.

Speaker 0

你指的是哪个流形？

On which manifold you tell me.

Speaker 3

情况真的是这样吗？

Well, is that the case?

Speaker 3

当BERT模型收敛时，是否存在一个流形？

Is is it that when when the BERT model converges, there is a manifold?

Speaker 3

但这个流形会变化吗？

But does that manifold change?

Speaker 0

确实存在一个流形。

There is a manifold.

Speaker 0

自然语言的流形永远不会改变。

Manifold of natural language never changes.

Speaker 0

对吗？

Right?

Speaker 0

那只是自然语言的流形，而我们的训练集提供了这个流形上的数据点。

That is just the manifold of natural language, and we have data points on this manifold given in our training set.

Speaker 0

我们所做的就是知道这些点位于流形上，而我们能做的只是稍微扰动它们。

And all we're doing is we know that these points are on the manifold, and all we can do is we're throwing them off a bit.

Speaker 0

我们实际上并不确定，但我们认为我们是在扰动它们。

We don't actually know, but we think we're throwing them off.

Speaker 0

我们相当确定，因为我们屏蔽了这些词，这应该会给你一些不在流形上的东西，然后我们学习将它们映射回去。

We're pretty sure because we mask out these words, and that should give you something that's not on the manifold, and then we're learning to map them back.

Speaker 3

但在这一点上，我同意你的看法。

But on that point, I agree with you.

Speaker 3

很可能存在某种概念上的流形，能完美地代表英语语言。

There may well be some notional manifold which perfectly represents the English language.

Speaker 3

但神经网络中的每一层都将一个流形转换为另一个，我们可以很容易地将语言转换为一个球形流形，只需进行L2归一化即可。

But we could every single layer in a neural network transforms one manifold into another, and we could quite easily transform language into a spherical manifold just to just doing an l two normalization.

Speaker 3

那么BERT的所有输入都将存在于这个球面上。

So then all of the inputs to BERT would exist on on the sphere.

Speaker 3

所以我想说的是，语言可能存在于许多不同的流形上，而对于BERT这样的典型神经网络，流形在训练过程中会发生变化和演化吗？

So I guess what I'm saying to you is that there are many possible manifolds that language could exist on, and and given a typical neural network for BERT, does the manifold change and evolve during training?

Speaker 3

我认为确实如此。

I think it does.

Speaker 0

BERT所代表的那个肯定是这样的。

The one that it that the BERT represents for sure.

Speaker 0

是的。

Yeah.

Speaker 3

没错。

Yeah.

Speaker 3

是

Speaker 4

是否也值得考虑它在通过网络时也在一定程度上缩小了流形的维度？

it also worth thinking that it's shrinking the dimensionality of the manifold as it goes through the network too a little bit?

Speaker 4

所以，也许把它从流形上抛出去，是在试图告诉我们，在压缩过程中要保持它们某种程度上的重叠？

And so maybe, like, throwing it off the manifold is trying to tell it as we compress it, keep them overlapping in some way?

Speaker 0

这是个好问题。

That's a good question.

Speaker 0

BERT 在一开始能生成的东西，那个空间是更高维的，还是只是不同？

Is the things that BERT can produce at the beginning, is that space somehow higher dimensional, or is it just different?

Speaker 0

对吧？

Right?

Speaker 0

这是个很好的问题，真的是个好问题。

It is a good it is a really good question.

Speaker 0

所以，如果这是它应该学习的语言流形，那么在开始时，它是在学习输出所有可能的东西，还是只是在学习输出其他东西，而我们只是在这么做？

So if this is the manifold of language that it's supposed to learn, at the beginning, is it learning to output every possible thing, or is it just learning to output something else and and we're just kinda doing that?

Speaker 0

作为我，我完全不知道。

As a I have no idea.

Speaker 0

谁知道呢？

Who knows?