大卫·西尔弗2 - 在RCL 2024主题演讲后的讨论 | TalkRL: The Reinforcement Learning Podcast 中文双语解读

本集简介

感谢西尔弗教授允许在他在RLC 2024主题演讲后录制此次讨论。本录音于RLC 2024期间在马萨诸塞大学阿默斯特分校录制。由于是现场录制，音质有所差异。我们以原始形式发布此音频，以保留讨论的真实性和即时性。参考文献 DeepMind博客上的AlphaProof公告《发现强化学习算法》，Oh 等人——其在RLC 2024的主题演讲提及了该研究的最新更新，尚未发表 2024年强化学习会议 David Silver 在 Google Scholar 上的资料

双语字幕

仅展示文本字幕，不包含中文音频；想边听边看，请使用 Bayt 播客 App。

Speaker 0

函数近似器在内部进行某种形式的展开规划，正如它所做的那样。

The function approximator is doing some form of unrolled planning internally, as it is doing.

Speaker 0

所以本质上，元学习是一种规划算法，是的。

So essentially, is meta learning, a planning algorithm that Yeah.

Speaker 0

它在推理过程中会进行。

It does during inference.

Speaker 0

你怎么看待这些观点？

How do you think about those?

Speaker 0

因为我想像元学习作为一种规划算法，就像你的元学习强化学习一样，是的。

Because I can imagine that meta learning, a planning algorithm, in the same way that your meta learning RL Yeah.

Speaker 0

可能会是一种更强大的技术。

Might actually be a much more powerful technique.

Speaker 0

很好

Great

Speaker 1

想法。

idea.

Speaker 1

而且我认为它应该会更好。

And I think it should be better.

Speaker 1

对。

Yes.

Speaker 1

我不认为蒙特卡洛树搜索在任何意义上都是规划方法的终点。

I I I do not believe that MCTS by any means is like the the end point of the of planning methods.

Speaker 1

我觉得很遗憾的是，经过这么长时间，我们仍然在使用它，却还没有能够构建出学会更有效地进行搜索的系统。

And I think I I find it almost unfortunate that we're still using it after all this time, and that we haven't yet been able to build systems that learn how to to search more effectively.

Speaker 1

嗯。

Mhmm.

Speaker 1

我们已经在这方面做了一些初步探索，比如MCTS网络就是试图学习如何做得更好的尝试。

We've had, like, some little probes in that direction, things like MCTS networks were an attempt to learn how to do better.

Speaker 1

但我认为毫无疑问，是的，

But I think absolutely, yeah,

Speaker 2

未来

the future

Speaker 1

规划的未来将是能够自主学习规划的系统。

of the future of planning will be systems that learn to to plan for themselves.

Speaker 0

而且，某种程度上，如果你现在有一个RNN，所有系统都在某种程度上这么做。

And, I mean, to some extent, if you have an RNN now, all systems are doing this to some extent.

Speaker 0

对吧？

Right?

Speaker 0

比如，任何循环系统都在做一些类似的学习，它会根据自己的行动和过去的历史进行反馈。

Like, any, like, recurrent system is is doing some kind it's like learning some feedback on its action, like past history.

Speaker 0

你认为这仅仅是训练和推理过程中计算资源分配的问题吗？也就是说，在推理时我们需要给它比训练时更多的规划计算资源？

Do you think this is just a a matter of balancing compute during training versus inference where we need to be able to just, like, during inference time, give it more planning compute than we do during training?

Speaker 0

还是说我们需要更深层的网络，以便实现更深层的展开式规划器？

Or do you think we just need to have deeper networks so that we can have a a deeper unrolled planner?

Speaker 0

本质上，这正是我正在尝试构建的东西。

Essentially, I mean, this this is kinda what I'm trying to build.

Speaker 0

我想知道你有没有什么直觉。

I'm curious if you have any intuitions.

Speaker 1

是的。

Yeah.

Speaker 1

对。

Yeah.

Speaker 1

我的意思是，我们自己也面临过很多这些权衡，我认为关键在于真正理解规模定律。

I mean, we faced a lot of those trade offs ourselves, and I think, you know, it comes down to really understanding the scaling laws.

Speaker 3

就像你

Like, you

Speaker 1

因为你可以为更多规划制定一个规模定律，另一个规模定律

because you you can have one scaling law for for doing more planning and another scaling law

Speaker 4

对于，好的。

for Okay.

Speaker 1

你知道吗？

You know?

Speaker 1

但你可以为增加网络规模的情况制定一个规模定律，另一个规模定律则用于在使用相同计算资源时执行更多规划步骤的情况。

But then you can have one scaling law for what happens if you increase the size of your network and another scaling law for what happens if you use the same compute but to do more planning steps.

Speaker 1

问题是，这些扩展规律之间的相对梯度是多少，我们用这个来指导决策，判断哪种方式最有效。

And the question is what's the relative, you know, gradient of those scaling And and and we use that to to guide the decision of which which is most effective.

Speaker 0

你可以想象，学习规划需要更长时间，但你最好还是学会这个规划算法。

Oh, and you can imagine that it takes longer to learn the planning, but you better learn the planning algorithm.

Speaker 0

所以你可能很长一段时间都发现不了一个好的规划算法。

So you might not even discover a good planning algorithm for a while.

Speaker 0

是的。

Yeah.

Speaker 1

能

Can

Speaker 4

我想问问你第二个经验，就是关于删除那些事情的？

I ask you about the second lesson you had, the one where you're talking about deletions and stuff?

Speaker 4

对。

Yeah.

Speaker 4

所以关键在于，模型越大，做越多的强化学习，最终就能填补这些漏洞。

So the takeaway, right, is, like, bigger model, do more RL, and it'll eventually close lots of those holes.

Speaker 4

是的。

Yeah.

Speaker 4

这是不是因为，这并不是说像围棋或退役之类的环境很简单，但它比现实世界简单得多。

Is that because and that's not isn't this is not to say that Go or retire or whatever is, like, a simple environment, but it is much simpler than something like the real world.

Speaker 4

我非常认同大世界假说。

And I very much am am an ascriber to the big world hypothesis.

Speaker 4

在更复杂得多的环境中，即使我们知道不可能学会几乎所有东西，你认为我们还能期待这种趋势吗？

Can we do you think we can expect that even in much, much more complex environments where we know we're not gonna be able to learn nearly everything?

Speaker 1

是的。

Yeah.

Speaker 1

我的意思是，我认为我们不可能填补每一个漏洞，嗯。

I mean, I think I think we're not going to be able to close every hole Mhmm.

Speaker 1

因为世界太大了。

Because the world is too big.

Speaker 1

所以，不可避免地，系统对世界运行方式所构建的任何模型都会存在一些不准确之处。

So, you know, inevitably, there are going to be some inaccuracies in in any model that the system builds and how the world works.

Speaker 1

是的。

Yeah.

Speaker 1

然而，我认为随着更多强化学习的运用，我们应当能越来越有效地填补这些漏洞。

However, I I think it's reasonable to expect that with more RL, we should do better and better at closing the holes.

Speaker 1

是的。

Yeah.

Speaker 1

因此，系统应该会持续改进。

So the system should keep improving.

Speaker 1

我认为它永远不会达到对世界的每个方面都有完美理解的程度，因为那会违背大世界假说。

I don't think it will ever reach the point where it has a perfect understanding of every aspect of the world because that's that would violate the big world hypothesis.

Speaker 4

我想，或许更合适的问题是：你认为更多的强化学习和更大的模型，会在我们遇到这些漏洞时带来更平滑的性能下降吗？

And I I guess maybe a better way to ask the question is, do you think more RL and bigger models will lead to more graceful degradation when we encounter those holes?

Speaker 1

是的。

Yes.

Speaker 4

对吧？

Right?

Speaker 4

好的。

Okay.

Speaker 1

是的。

I Yes.

Speaker 1

做。

Do.

Speaker 1

是的。

Yeah.

Speaker 1

谢谢。

Thank you.

Speaker 1

好的。

Okay.

Speaker 5

我有个问题，关于演讲末尾的数学内容。

I have a question about the math at the end of the talk.

Speaker 5

系统有没有任何方式显式地学习抽象概念？

Did the system did you did you have any, like did the system had any way of explicitly learning abstractions?

Speaker 5

所以当系统发现一个引理时，它是会记住这个引理，还是所有这些都是隐式的？

So when when the system discovers, say, a lemma, does it memorize the lemma, or is it just all implicit?

Speaker 5

是的。

Yeah.

Speaker 1

这是个好问题。

So it's a good question.

Speaker 1

所以我们确实探索了很多尝试实现这一点的方法。

So we did so we explored a bunch of methods that tried to do that.

Speaker 1

我想大多数这些方法都被留作未来的工作了。

I would say most of those have been kind of left for future work.

Speaker 1

嗯。

Mhmm.

Speaker 1

所以目前，每个问题基本上都是独立解决的，也就是说，是的。

So at the moment, each problem is being solved largely independently from the so so there's yeah.

Speaker 1

因此，目前大部分内容都是通过权重学习的，而不是像你所问的那样，记忆特定的具体知识并能够调用它们。

So most of what's happening is is being learned in the weights rather than, I guess, what you're asking is, you know, memorizing particular bits of of specific knowledge and being able to pull those in.

Speaker 1

而这，我认为，很大程度上是未来的工作。

And that's, yeah, largely future work, I would say.

Speaker 5

所以这个系统是在形式数学的层面上，一步一步地，使用某种学习到的方法来解决问题。

So the system is solving, like, the problem on the at the level of, like, formal mathematics, like, step by step using some kind of learned

Speaker 1

是的。

Yeah.

Speaker 1

我认为你所说的某些情况确实发生在权重中。

And I I think some of what you're saying happens in the weights.

Speaker 1

比如，当模型看到大量需要使用相同引理的例子时，这些就会被内化到权重中。

Like, I think when it's seen a large number of examples, each of which requires the same kind of lemma to be used, that gets internalized into the weights.

Speaker 1

嗯。

Mhmm.

Speaker 1

但到目前为止，我们还没有一个明确的机制来实现这一点。

But we so far didn't have an explicit mechanism for that.

Speaker 1

是的。

Yeah.

Speaker 1

谢谢。

Thank you.

Speaker 1

所以你提到

So so you mentioned

Speaker 6

AlphaProof 花了三天时间，而人类大约只花九个小时。

that alpha proof, like, takes, three days versus, like, humans take about, like, nine hours.

Speaker 6

我想知道它在这三天里主要花在了哪些地方。

I was wondering when what the major of time it spends spends those, like, three days on.

Speaker 6

那里主要的挑战是什么，我们该如何解决这些挑战？

What are some kind of, like, the main main challenges there, and how do we solve these challenges?

Speaker 1

我认为这三天的时间可以大幅缩短。

I think the three days could be dramatically improved.

Speaker 1

我们已经有一些想法，认为能让它快很多。

We already have ideas that we think will make it much faster.

Speaker 1

所以我不认为我们应该过分关注它花了这么长时间这件事。

So I don't think we should over index on the fact that it it took a long time.

Speaker 1

我认为，很多这项工作都是为了尝试看看我们是否能在这一特定的IMO问题上有所突破。

I think, you know, a lot of this work was put together, you know, to try and see if we could do something on this particular IMO.

Speaker 1

我希望在未来的比赛或我们想要研究的其他方法中，我们能显著加快进度。

And I think I would hope that we can we can do things significantly faster in in future competitions or future methods that we want to work on.

Speaker 1

就具体挑战而言，我认为最难的挑战可能是我选择的这类问题。

In terms of particular challenges, I think the hardest challenge is probably the kind of problem I chose.

Speaker 1

在组合数学中，存在一些非常开放的问题。

There's these very open ended problems that you find in combinatorics.

Speaker 1

我们的系统在解决代数和数论问题时表现得非常稳定。

And I would say our system is remarkably consistent in solving algebra and number theory.

Speaker 1

但当面对这些非常开放的问题时，它们涉及的直觉范围远超我们目前能编码进系统的内容。

But when it comes to these very open ended questions that they touch on a much broader range of possible intuitions than we've currently managed to encode into the system.

Speaker 1

因此，我认为这里有一个重大挑战：我们能否提升系统在这一维度上的能力？

So I think there's a big challenge there to see, can we improve its strength across that dimension?

Speaker 6

是的。

Yeah.

Speaker 6

所以，这有点像是通过展示更多这类问题的例子，或者找到更好的搜索算法来解决？

So it's kinda like the solution to show more examples of such problems or, you know, a better some kind of better search for finding algorithm?

Speaker 6

什么是

What's

Speaker 1

我认为问题出现在每一个步骤中。

I think I think the the problem comes at every step.

Speaker 1

实际上，例如，在形式化这些开放性问题时，难度也更大。

So, actually, for example, in the formalization, formalizing these open ended problems is also harder.

Speaker 1

就连人类要形式化我展示的那个蜗牛Turbo的问题，也非常困难。

So even for humans to formalize that particular problem I showed of Turbo the snail is is very hard, actually.

Speaker 1

人类花了相当长的时间才完成。

It took it took humans quite a long time.

Speaker 1

其他问题则相对直接。

The other ones, it was fairly straightforward.

Speaker 1

那个问题，人类花了很长时间才完成形式化。

That one, it took a long time for humans to formalize.

Speaker 1

同样地，对于我们的系统来说，形式化这类开放性问题也更困难。

And similarly, for our system to formalize those kind of open ended problems, it it is harder.

Speaker 1

所以我认为，我们的数据中可能最终得到了较少的优质示例。

So I think we probably ended up with less good examples in our data.

Speaker 1

而且我认为，此外，这些问题是机器更难解决和把握的。

And I think in addition, it's they're also harder for a machine to solve and get a handle on.

Speaker 1

这可能确实是这样。

It's this maybe yeah.

Speaker 1

但所以我认为，在每一个层面，它们对于我们所采用的方法来说都稍微更棘手一些。

But so so I think at every level, they're a little bit trickier for for the kind of approach we we took.

Speaker 1

所以，是的，未来会发生什么将很有趣。

So, yeah, so it'll be interesting to see what happens in the future.

Speaker 3

我有一些后续问题。

I have follow-up questions.

Speaker 3

所以当我们试图将人类与所有证明进行比较时，你提到人类花了九个小时达到这个水平，而我们用了三天就实现了。

So when we are trying to compare human with all for proof, so you were saying human takes nine hours to achieve this rate, and we achieved this in three days.

Speaker 3

所以，如果你想做一个直接的对比，你认为我们该如何用GPU算力来量化人类的智能？

So if you want to have a first side by side comparison, how do you think we can quantify the intelligence of human in terms of GPU power?

Speaker 1

是的。

So Yeah.

Speaker 3

在论文中，我们总是谈论使用这些计算资源并进行比较。

Let's say in papers, we always talk about use this computing resources and compare this.

Speaker 1

对。

Yeah.

Speaker 3

那么，你认为我们该如何做出一个公平的比较呢？

So how do you think we can make it quite fair comparison?

Speaker 1

我觉得这非常困难。

I think it's very hard.

Speaker 1

我觉得很难提出某种标准，因为人类和机器是两种完全不同的计算模型。

I think it's very hard to come up with some, you know, like because there's just two completely different computational models.

Speaker 1

因此，很难比较人类大脑的计算量与完全不同的系统、完全不同的硬件和软件之间的差异。

And so it's really hard to compare, you know, the amount of computation that a human brain uses with some totally different system, some totally different, you know, hardware, software.

Speaker 1

是的。

Yeah.

Speaker 1

栈的每一层都以某种方式不同。

Every every layer of the stack is somehow different.

Speaker 1

所以我认为，你知道，我们历史上一直选择控制我们唯一能控制的东西，也就是时钟时间。

So I think, you know, if you really what we've always done historically is chosen to control the only thing which we can control, which is the, you know, the wall clock time.

Speaker 1

所以如果你真的想说，你知道，这是否公平地完成了？

So if you really wanted to say, you know, has this been done fairly?

Speaker 1

我认为你会说，我们应该在和人类相同的时间内完成。

I think you would say, well, we should do it in the same amount of time as as humans.

Speaker 3

给人类三天时间，看看会发生什么。

Give humans three days and see what happens.

Speaker 1

或者反过来，我们应该像人类那样，在两乘以四个半小时内完成。

Or the other way around that we should do this in in, you know, in two times four and a half hours in the way that that that humans do.

Speaker 1

所以我认为，你知道，这将是最清晰的指示，至少它告诉你，你面对的是两个完全不同的系统。

So I think, you know, that that would be, I think, the clearest indication that at least what that tells you is that you've got two totally different systems.

Speaker 1

但话说回来，在固定的时间限制下，我们是否已经达到了机器能在相同时间内取得与人类相同结果的阶段？

But, yeah, given given a fixed time constraint, have we reached the point where machines are able to, you know, in the same amount of time, the same results that a human could?

Speaker 1

我认为这会是一个公平的比较。

I think that would be a fair comparison.

Speaker 1

抱歉。

Sorry.

Speaker 1

让我们让其他人也有机会

Let's let's give other people a chance to

Speaker 6

我想

I wanted to

Speaker 2

问你一个关于元强化学习的问题。

ask you a question about the meta RL work.

Speaker 2

是的。

Yeah.

Speaker 2

所以，我

So I I

Speaker 6

在思考这个问题时

think in thinking about that about

Speaker 2

无免费午餐定理以及相关结果表明，你无法超越这个极限。

no free lunch theorems and results that say, well, you can't get any better than this.

Speaker 2

我认为这并不普遍成立，但在我看来，要在元强化学习上取得进展，需要让系统学习我们真正感兴趣的问题的类型。

I think that that's not generally true, but it seems to me that if you're going to make progress on meta RL, that requires having the system learn something about the kinds of problems we're actually interested in solving.

Speaker 2

这似乎是一种类似于大语言模型的私有训练方式。

And it seems like that's the kind of private training that's also similar to LLMs.

Speaker 2

我想知道，你在进行元强化学习时，为了更好地理解该任务下世界运作方式所学到的内容，是否与你在用大量文本训练大语言模型时学到的内容有任何关联？

And I wonder if you think there's any connection between what you learn when you're doing meta RL and trying to get a a good grasp of how the world works for that task, what you learn when you're training an LLM on the texts that have been

Speaker 0

写就的文本，

written,

Speaker 2

你认为这两种方法中，哪一种更有利于实现通用推理，或者它们是否需要结合使用？

and whether you think one of those approaches is a better way to get to general reasoning or if they need be combined.

Speaker 1

是的。

Yeah.

Speaker 1

我认为它们解决的是相当不同的问题，大语言模型和这种元学习方法。

I think they're solving quite different problems, LLMs, and and this meta learning approach.

Speaker 1

但我想说的是，元学习方法的美妙之处在于它能根据数据进行调整。

Think but what I would say is that what's beautiful about the the meta learning approach is that it it it tailors to the, you know, to the data.

Speaker 1

所以如果你在这里进行训练的话。

So if you were to train it right here.

Speaker 1

比如，也许我们训练的这个算法并不适合某种特定的强化学习环境。

Like, maybe maybe this algorithm we trained isn't good particular type of RL environment.

Speaker 1

但如果你在那种强化学习环境中进行训练，并将其纳入混合中，它就会学会如何变得更好。

But then if you were to train in that kind of RL environment and add it into the mix, it would learn how to get better.

Speaker 1

它会创造出针对这些特定情况量身定制的新算法。

It would come up with new algorithms which are tailored to those things.

Speaker 1

而目前，我们处于这样一个世界：我们必须由人类来完成这个过程，作为强化学习从业者，我们知道强化学习算法尚未达到像监督学习那样的‘一刀切’水平。

Whereas at the moment, we're in this world where, you know, we kind of have to human do that process where we know as RL practitioners that RL algorithms aren't yet at the point where there's like this, you know, one size fits all approach that we have for, you know, supervised learning has kind of reached that point.

Speaker 1

但在强化学习领域，我们还没有这样的通用方法。

But we don't have that in in RL.

Speaker 1

作为人类，我们不断调整手头的工具以应对我们试图解决的问题。

And so as humans, we're continually tailoring the toolkit we have to the problem we're trying to solve.

Speaker 1

而现在，我们这里有一种方法，可以从两个方面解决这个问题。

And now here here here we have something which, you know, addresses that in two ways.

Speaker 1

你知道，首先，如果我们给它足够广泛的问题范围，它或许能找到一种适合所有情况的方案，或者至少能根据它所面对的具体问题来调整。

You know, first, it's it's like, well, if we give it a broad enough range of problems, maybe it can find one size fits all or at least something which takes as context the particular problem it's trying to solve.

Speaker 1

其次，如果我们作为人类愿意为这个特定问题投入更多时间以取得更好效果的话。

And the second thing is, you know, if we really are, as as humans, prepared to spend more time on this particular problem to to do better.

Speaker 1

那我们就继续在该特定问题上进行元训练，从而摆脱人工干预，创造出真正擅长这类问题的解决方案。

Well, let's continue to meta train on that particular problem, and we can take that human labor away and come up with something that's really specifically good at this kind of problem.

Speaker 1

所以我认为，根据数据进行调整非常重要，因为数据会告诉你，你真正想要的是什么。

So I think I think tailoring to the data is really important because the data tells you, you you know, what what you really want.

Speaker 1

因此，是的，我们可以通过这种方式学到很多东西。

And so, yeah, we we can learn a lot that way.

Speaker 6

你认为你能做到吗

Do you think you're gonna be able to

Speaker 2

找到一组能带来更好泛化能力的问题，还是认为应该先设计一个相对通用的方案，然后在遇到新类型的问题时不断微调元强化学习？

find some set of problems that will produce better generalization, or do you think that it's a matter of coming up with something that's reasonably general and then just sort of fine tuning the meta RL over and over again whenever you get to a new class of problems you wanna solve?

Speaker 1

我的意思是，我们迄今为止所看到的结果是，随着我们扩大元训练所用的训练环境范围，模型在其他环境上的泛化能力得到了显著提升。

I mean, what we've seen and the the result we've seen so far is that as you broaden the set of training environments that we meta chain on, we get strictly strictly improved held out generalization to other other environments.

Speaker 2

它也会改善你之前处理过的那些环境吗？

Does it also improve the previous environments that you've worked on?

Speaker 2

也就是训练集中更早的部分？

So earlier stuff in the train set?

Speaker 1

至少它不会变差，有时还会根据环境有所改善。

So it at least stays the same and sometimes improves it depending on the environment.

Speaker 1

所以这似乎表明，嗯，严格来说……

So so this it it seems to be, you know, strictly So

Speaker 2

不同问题集之间存在某种泛化关系吗？

there's there's some generalization between sets of problems?

Speaker 1

是的。

Yes.

Speaker 1

存在泛化。

There's generalization.

Speaker 1

因此，我们的期望是，现在我们有了一个可扩展的公式，可以不断将更多问题加入训练集，从而开发出在这些任务中越来越通用的强化学习算法。

And so the the hope would be that now we have a scalable formula that allows us to add in more and more problems into the training set and come up with an RL algorithm that's that's kind of more and more general across those.

Speaker 1

所以，我不敢断言我们所做的一切，比如，能适用于RLHF，或者任何与你们所用分布完全不同的全新方法。

So I don't wanna claim that, you know, what we did would, for example, work for RLHF or something totally different that's really out of the distribution you used.

Speaker 1

也许，如果你开始加入这些其他情况，或者单独设立一个这样的数据池，或者以其他方式使用它，这将为我们提供一套工具，帮助我们学会自己去做。

And maybe if you start to add in these other cases or or do a separate pool of those or however you want to use it, it gives us a toolkit to, you know, to to learn to do our own.

Speaker 1

是的。

Yeah.

Speaker 1

大卫，你能谈谈‘奖励足够’这一假设吗？

David, can you reflect on the reward reward is enough hypothesis?

Speaker 1

听起来你仍然百分之百相信它，还是在过去几年里你的观点有了一些变化？

It sounded like you still very much a 100% believe in it, or or has your view changed a little bit in the past couple of years?

Speaker 1

对。

Yeah.

Speaker 1

谢谢。

Thank you.

Speaker 1

所以，是的。

So yeah.

Speaker 1

所以对于那些还不了解的人，奖励足够假说其实就是，是的。

So for those of you don't know it, reward is enough hypothesis is really yeah.

Speaker 1

这是一个我们提出的假说，旨在挑战整个领域，思考要实现超人智能需要什么。

It was it was a hypothesis we wrote to kind of challenge the community to think about the approach of what would be required to get to superhuman intelligence.

Speaker 1

这个假说提出，也许我们只需要将一个强大的强化学习智能体放入一个复杂的环境中，并让它最大化奖励即可。

And hypothesis is, you know, like a a challenging one to say that maybe all we need to do is to is to put a powerful RL agent into a complex environment and ask it to maximize the reward.

Speaker 1

也许这就足以让智能的所有复杂性和美妙之处自然涌现。

And maybe that would be sufficient for intelligence in all its complexity and beauty to emerge.

Speaker 1

所以我会说，我认为这个假说仍然是一个非常有力的指导原则。

So I would say I think the hypothesis is still a really powerful guiding light.

Speaker 1

我把它当作一个指导原则。

I use it as a guiding light.

Speaker 1

你知道，我不确定这个假设还能走多远，也不确定它是否能完全或完美地成立。

You know, I don't know how far it will hold and whether it, you know, will hold fully or perfectly.

Speaker 1

但这并不是这个假设的重点。

That's kind of not the point of the hypothesis.

Speaker 1

重点在于，它提供了一种乐观和动力感：只要这个假设成立，我们就拥有了一个非常明确的策略，可以让我们不断接近AGI，接近人类智能。

The point is to say, you know, it provides a sort of sense of optimism and and motivation that in as much as this hypothesis holds true, it gives us one really clear strategy to follow that will take us further and further towards towards AGI, towards human intelligence.

Speaker 1

而且我认为这非常令人兴奋和鼓舞，因为

And and I think that's really exciting and inspirational because,

Speaker 6

你

you

Speaker 1

如果我们没有这个，那我觉得我们就真的迷失了。

know, if we don't have that, then then we're kind of lost, I think.

Speaker 1

所以，我认为，拥有一个可能奏效、或者能带我们走很远的策略，真的非常重要。

So so I think, you know, having a strategy that that that may work or may take us a really long way is is is really important.

Speaker 1

所以我依然相信这个指引之光。

So so I still believe in that guiding light.

Speaker 1

你还提到，我对它的看法发生了变化。

You also asked as my opinion changed on it.

Speaker 1

我会说，也许改变的一点是，我们在撰写《奖励就足够了》时，留出了空间，认为可能存在强大的先验知识。

I would say, you know, one thing which has maybe changed is, I think, you know, we really when we wrote The Reward Is Enough, we kind of left room for the idea that that there could be powerful priors.

Speaker 1

但我认为，基础模型和大语言模型的出现表明，这些强大的先验知识确实能带你走得很远。

But I think, you know, the advent of, like, foundation models and LLMs has shown, you know, that those powerful priors can really, you know, take you a long way.

Speaker 1

因此，我更想明确地、更开放地接受这样一个观点：我们可以从任何你想要的东西开始。

So I I kind of want to be more explicit and welcoming to the idea that that, you know, we could you can start with whatever you want.

Speaker 1

只要你在其基础上进行海量的强化学习，就能走得非常远。

As long as you then do a massive amount of RL on top of it, that can take you really far.

Speaker 1

但为什么不去拥抱这样一个想法呢：这些强大的、堪称起点的机制确实存在。

But, you know, why not why not, say, embrace the idea that there are these really powerful, yeah, sort of launching points for the agent.

Speaker 1

是的。

Yeah.

Speaker 7

很好。

Great.

Speaker 7

另一个关于

Another question about

Speaker 1

抱歉。

Let's sorry.

Speaker 1

第一点，然后第二点。

One and then two.

Speaker 1

是的。

Yeah.

Speaker 7

还有，嘿。

And, hey.

Speaker 7

所以我在想，你提到的这个元箭头和发现箭头，如何应用到多智能体强化学习问题中呢？

So I was wondering how, like, this meta arrow and discover arrow you mentioned about can be applied to, let's say, a multi agent RL problem.

Speaker 1

是的。

Yeah.

Speaker 1

我们还没有在多智能体强化学习问题上尝试过，但我认为这个公式本身并没有针对单智能体强化学习的特殊限制。

So we haven't yet tried it on multi agent RL problems, but I don't think there's anything in the formulation that is specific to single agent RL.

Speaker 1

我想，让我换种说法。

I think well, let me say that differently.

Speaker 1

我认为，至少如果你采取最简单清晰的方法来处理多智能体强化学习，那就是说，每个智能体各自遵循自己的强化学习算法。

I think at least if you take the simplest and clearest approach to to multi agent RL is to say, well, you just have individual agents each following their own RL algorithm.

Speaker 1

然后你可以将元目标设为所有智能体的整体成功。

And then you can take the meta objective to be something like the overall success of of of all agents.

Speaker 1

也许你可以尝试这样的方法。

Maybe you could do something like that.

Speaker 1

我预计它会学会一种有效的方法。

And I would expect it to learn a good approach that's maybe gonna be effective.

Speaker 1

这是个很棒的想法。

It's a great idea.