[AIEWF预告] 多轮强化学习打造持久智能体——与Prime Intellect的Will Brown对话

本集简介

在微软Build大会、谷歌I/O大会和OpenAI开发者大会密集的一周里，大实验室圈内最公开的秘密莫过于Claude 4的发布，尤其是万众期待的Opus强势回归。具体关于Claude 4的回顾我们留给AINews报道，但我们认为无论是Gemini本周在深度思考方面的进展，还是Claude 4的推出，都代表了推理时间计算/推理能力的新前沿（至少在GPT5今夏发布之前）。Will Brown在AIE纽约的演讲以及他在验证器方面的开源工作，使他成为少数能公开讨论推理模型现状及当前最先进研究方向（而非像加入大实验室后被套上模糊表达的LoRA枷锁）的权威声音之一。我们讨论了他关于《通过轮次信用分配强化LLM智能体多轮推理》的最新论文，他还为那些有勇气忍受糟糕会议音质的听众预览了将在AIEWF上发表的《代理强化学习》演讲内容。

双语字幕

仅展示文本字幕，不包含中文音频；想边听边看，请使用 Bayt 播客 App。

Speaker 0

各位AI工程师们好。我们紧急推出这期关于Claude 4的快评节目，邀请到PrimeIntellect新任推理研究主管Will Brown。Will在AIENYC的演讲及在验证器领域的开源工作，使他成为少数能公开讨论当前推理模型前沿进展及SOTA研究方向引领者的重要声音。我们探讨了他最新关于通过回合级信用分配强化LLM智能体多轮推理的论文，他还预告了即将在AI工程师世界博览会上关于AgenticRL的演讲（链接见节目备注）。激动地宣布Will将再次出席旧金山举办的AI工程师世界博览会，展会门票现已开售。

Hello, AI engineers. We're back with a quick reaction pod for Claude four with the new reasoning research lead for PrimeIntellect, Will Brown. Will Brown's talk at AIENYC and open source work on verifiers have made him one of the most prominent voices able to publicly discuss the current state of the art in reasoning models and where current SOTA research directions lead. We discussed his latest paper on reinforcing multi turn reasoning in LLM agents via turn level credit assignment, and he has previewed his upcoming AI Engineer World's Fair talk on AgenticRL linked in the show notes. We're excited to share that Will will be back at the upcoming AI Engineer World's Fair in San Francisco, which now has Expo tickets on sale.

Speaker 0

他将与Misha Laskin、Nathan Lambert、Christian Segiti、Greg Kamrat、Kyle Corbett等专家共同领衔全新的'强化学习+推理'专题论坛。欢迎关注AI.Engineer，我们现场见。

He will be headlining the new RL plus reasoning track with Misha Laskin, Nathan Lambert, Christian Segiti, Greg Kamrat, Kyle Corbett, and more. Join us at AI. Engineer. Watch out and take care.

Speaker 1

大家好，欢迎收听闪电加急新闻——最新一期播客。我是Dessibile合伙人兼CTO Alessio，身边是我的联合主播、Small AI创始人Wiggs。

Hey, everyone. Welcome to a Lightning plus emergency news, a latest phase podcast episode. I'm Alessio, partner and CTO at Dessibile. I'm joined by my cohost, Wiggs, founder of Small AI.

Speaker 2

嘿，说实话我们早知道Cloud 4要发布，但实在忙得没空做专题节目。所以这期算是补办的特别专场，邀请到Prime Intellect（现在可以正式说了）的Will Brown。

Hey. Hey. And, yeah. Honestly, we knew that Cloud four was coming, and we just didn't we're just too too busy to, like, have a dedicated episode. So, like, this is our makeup dedicated episode with a special guest, Will Brown from, now I can say it, Prime Intellect.

Speaker 3

大家好！很高兴来作客，我们认识有段时间了，但应该是我第一次上这个播客。非常期待和各位交流。

Hey. How's it going? Great to be on, and, so excited to know each other for a little bit. And, it is my first time on the podcast, I believe. Great to, chat with you guys.

Speaker 3

今天算是个大新闻日吧，世界上总是不缺新闻。

Big news day, I guess. So lots of stuff out in the world. There's always a news day.

Speaker 2

感觉这周特别密集——周一是微软Build大会，周二周三谷歌，今天又是Claude发布。不知道明天还有什么惊喜。

I think this week is particularly heavy for some weird reason. Like, Monday was Microsoft build. Tuesday, Wednesday, Google, and then today is Claude. I wonder what tomorrow will bring.

Speaker 3

先是IO大会，接着又是IO大会，然后...

We had IO, and then we had IO, and then

Speaker 2

没错没错。

Yeah. Yeah.

Speaker 3

不同的...

Different

Speaker 2

不同的IO。没错。是的。其实我们今早本该录制的，但大家都想去看Claude的主题演讲，所以我们就去看了Claude的主题演讲。显然是个好模型，你知道的，大模型。

different IOs. Exactly. Yeah. So, like, we actually were supposed to record this morning, and we we all wanted to watch the Claude keynote, so we went and watched the Claude Claude keynote. Obviously, a good mod, you know, good model, big model.

Speaker 2

他们真的在强调编程。说实话，他们没怎么谈推理能力。他们只是说现在运行时间更长了。你们怎么看？

They're they're really emphasizing coding. They didn't really talk much about reasoning, to be super honest. They were just like, it runs for longer now. What are you guys' takes?

Speaker 3

是的。我觉得有件事我观察一阵子了，现在大家应该也都意识到了——让下一代技术强大的关键就是：每个人都想要更好的智能体。人们需要能自主行动的模型。而推理能力某种程度上是这的前置条件。我总想到OpenAI的五级框架，聊天机器人属于RLHF时代。

Yeah. So, I mean, like, one thing I've kind of been seeing coming for a little bit that I think people are kind of also all aware of now is that, like, the thing that's gonna make the next wave of stuff be powerful is just, everyone wants better agents. Everyone wants models that can, like, go off and do stuff. And, like, reasoning was kind of, like, a precursor to that a little bit. Like, I mean, I I always think of, like, open eyes, like, five levels framework where, like, chatbots was, the RLHF era.

Speaker 3

推理者阶段像是R1版本。但人们真正期待的是推理能力作为通向智能体的台阶。所以我能理解为什么Anthropic不强调'我们有最佳推理者'，而是在展示他们的智能体套件、工具使用能力，以及多轮对话的基准测试——因为这才是实际应用中人们更关心的，而不是'在数学竞赛表现优异'这类事。

And then reasoners was, like, the one and r one. But, like, really, what people were thinking of was reasoners are a step on the path towards agents. And so I can kind of see why Quad why Anthropocrit is not like, oh, we have the best reasoner. They're really, like, showing off their suite agent and, like, tool tool use and, like, bunching calling benchmarks on multi turn stuff. Because I think that's really, like, what people care about more for actual applications as opposed to, like, did really good on this math competition.

Speaker 3

数学竞赛成绩之类的东西更像是种信号，表明我们在进步，但对大多数人来说，真正的目标始终是实用的智能体。

Like, the math competition was like that stuff was all, like, a signal that was supposed to think we were getting somewhere, but the thing we were getting towards, for a lot of people at least, is practical agents.

Speaker 1

Alessio？对。我认为他们的'扩展思考模式'去掉了大写。Claude 3发布时是'Extended Thinking'这样大写的，现在变成小写的'extended thinking with tool use'。感觉他们也在...是的，不再纠结这个了。

Alessio? Yeah. The I I think the extended thinking mode, I think they're remove the uppercase. I think in the Cloud three release, it was like extended thinking kinda like capitalized and now just like extended thinking with tool use. So I think they're also, yeah, done playing.

Speaker 1

不管算不算推理能力，他们试图把所有功能整合起来。我之前没意识到——根据他们的措辞，以前的扩展思考是不能使用工具的，现在Opus 4可以了。这很棒。不过他们确实没像上次那样把它作为核心卖点。

Whether or not it's reasoning or not, I think they're trying to merge everything together. And it's not I I mean, I didn't realize that, but extended thinking could not use tools before the way they worded it, and now they can in in Opus four. So that's great. But, yeah, they they haven't put it as far center as last time.

Speaker 2

我们有没有...这已经偏向猜测了，但我们是否清楚Claude的扩展思考机制与旧系列模型存在实质性差异？有人知道吗？

Do we have any this is, like, already veering off from directly into speculation, but do we have any idea if there are any material differences between how Claude extended thinking works versus, like, the old series models? Do we know?

Speaker 3

最大区别似乎是...至少这点一直...我也不确定，当然都是猜测。但Anthropic从一开始就有这种'思考'特性，有时候连Claude 3.5都会进行少量思考，主要是决定使用哪个工具。比如在Claude UI里执行操作时，它会用两三句话思考该选哪个工具。

The biggest difference seems to be at least and this is kind of a thing that's been I mean, I don't know. This is all speculation, of course. But from the start, Endropica always kind of had this, like, little thinking thing where you could sometimes even, like, Cloud 3.5 would do, like, a tiny bit of thinking. And it was really just, like, deciding which tool to use for the most part. Like, if it was doing an artifact in the Cloud UI, it would have this little thing where it would think for, like, two sentences about which tool to use.

Speaker 3

Anthropic的态度似乎是：扩展思考是工具使用的一种实例，是你希望模型具备的能力。但不是说'这是个会思考的模型'，而更像是让模型进行思维倾泻——因为这种倾泻能帮助它找到下一步该做什么。就像执行搜索或代码一样，都是解决问题过程中获取更多信息的方式。

And it seemed like Anthropix kind of attitude has been that extended thinking is an instance of tool use and that it's the kind of thing you want to equip the model with the ability to do. But it's not like, oh, it's a thinking model. It's just a sync for the model to, like, brain vomit because that brain vomiting will help it, like, find a nice thing to do next. In the same way that doing search or doing code execution are, like, ways to kind of get more information on the path towards, like, finishing a problem.

Speaker 2

是的，就是所谓的推理时间计算。我确实遇到过一个人声称发明了草稿纸方法，这显然早于Jason Wei的思维链论文。但这些都属于同一类技术方法。对我来说问题在于：是否存在某种模型路由机制？

Yeah. Inference time compute, as they say. I did meet somebody who claimed to have coined to they found the scratch pad paper, and this was obviously before the Jason Wei chain of thought paper. But it's all the same sort of method general family of techniques. I think the question for me is also, like, is there some model routing going on?

Speaker 2

比如，思考模式和非思考模式是不同的模型吗？还是同一个模型只是关闭了回合结束标记生成？

Like, are they different models, the the thinking, non thinking, or are they the same models with, like just like you turn off the end of turn token generation?

Speaker 3

我认为应该是同一个模型，Anthropic就做得很好。Quen用很简单的方式实现了，他们也稍微透露过方法。虽然大规模实现确实困难，但至少在概念上，如何实现并不是什么大问题。

I mean, I think you these models should be the same model, and Anthropica is what they're doing well. Like, it's not that hard to like, Quen did it in a very kind of, like, simple way, they kind of talked about how they did it a little bit. But it's not, like, too difficult to, like, have whether or not I'm all things, like, be the sort of thing. I mean, like, obviously, all this stuff is, like, hard at, like, serious scale. But, like, conceptually, at least, it's not like a big problem to solve about how would you ever do it.

Speaker 3

我们有强化学习，可以通过不同任务的监督微调来教会模型这类技能。

It's like, no. We have reinforcement learning. We can kind of, like or we just SFT on, like, different things. We can kinda teach models skills like that pretty.

Speaker 2

你们最近发表了关于gRPO和多轮RL的研究。我想了解Claude还有哪些技术亮点？

Yeah. You have some work that you've published recently on, like, gRPO and relationship for and you're doing a lot of work on multi turn RL. I think I think I wanted to just kinda round out any other Claude highlights, you know, on your guys' call.

Speaker 3

嗯。

Yeah.

Speaker 2

我把争议话题留到最后。你们还想重点讨论什么技术亮点吗？

There is a there is controversy that I'm I'm leaving towards the end. Right. Like, any other technical highlights that you guys wanna focus on?

Speaker 3

这个模型很酷，但就像Kalamaz今天推文说的，这是线性进步而非范式转变。最让我惊喜的是他们报告的基准测试中，奖励黑客问题有所改善——Sonnet3.7总喜欢在回答编程问题时额外做七件无关的事，显然是因为RL环境没有足够惩罚机制。

I mean, I think it seems like a really cool model, but I think, like, Kalamaz, like, tweeted this earlier today. It seems like it's linear progress, which is, like, great, but it doesn't feel like, there's not anything that I've seen from it that feels like a paradigm shift in terms of, like, the sorts of stuff Daria talks about, which, like, I think maybe we're still on the path to get there. And it feels like this is just, like, gone up in terms of complexity of agents. I think the one thing that to me was really nice to see I haven't, like, done too much testing myself yet, but in their reported benchmarks, the reward hacking issues to like, Sonnet three seven loves to, like, do stuff that to me feels reward in the sense of, like, it'll try to you ask it a coding question, and it would, like, do your question and then seven other things also, presumably because there was some RL environment where there wasn't really a penalty for doing that or there wasn't enough penalty. And covering its bases, like, was more likely to pass test cases on some coding thing.

Speaker 3

就像速度基准测试，你只需要最小改动，但模型会塞入大量冗余代码来通过测试。我们真正需要的是模型精准解决问题。他们内部基准显示这个问题从45%降到了15%，我希望这些新模型能更可靠。

Like, you could imagine, like, a speed bench kind of thing where there's a minimal diff that is really what you want, but there's, like you could do a ton of other stuff and put all these other things in place that as long as you don't as long as it's not enough that you trip over your feet, it's just, like, extra stuff that's there if it helps pass the test cases. And what I really think you wanna do with these models is, like, kind of min max. Like, you want the models to, like, do do the thing and know more. And they had some internal benchmark for this that went from, like, 45% down to 15% for both for SONNET and for Opus as opposed to three seven. And so I'm hopeful that these models are much more, like, friendly to code with and maybe more, like, trustworthy.

Speaker 3

我对模型有个信任度分级：旧Gemini很可靠，GPT4.1很可靠，但新Gemini和Sonnet3都不够可靠——特别是在处理多文件代码库时。

And that's the thing that I kind of have buckets for models of, like, how much can I trust them in a code base, especially something beyond, like, a single file? Like, old Gemini to me was very trustworthy. GPT 4.1 is very trustworthy. New Gemini is not. Three Sonnet is not.

Speaker 3

O三不是。还没决定新新Sonnet新Opus会归入哪个类别。

O three is not. Haven't decided which bucket new new Sonnet new Opus are gonna fall into.

Speaker 2

在奖励黑客方面的可信度。

Trustworthy in terms of reward hacking.

Speaker 3

只是，它们不会让代码库做正确的事。最坏情况下，它们会做得很蠢，但不会搞砸一堆东西。不会到处留下多余注释和无用的辅助函数，也不会凭空创建七个新文件。这正是我们认为三七经常干的事。

Just, like, not gonna make them like, they're they're gonna do the right thing in the code base. And, like, worst case, they'll do it, like, dumb, but they're not gonna, like, go break a bunch of stuff. They're not gonna leave a bunch of, like, extraneous comments and helper functions all over the place that aren't really needed or, like, make seven new files just to have them there. Like, this is the sort of thing that we thought three seven does a lot.

Speaker 2

对。它就像...我代码里明明有这函数，它偏要新建一个。是啊，我常想这类问题——特别是针对RL环境的普遍现象。

Yeah. It it like, I had already have the function in my code base. It would just make a new one just because it felt like it. Yeah. One thing I always often wonder about those things is, like, it's just for RL environments in general.

Speaker 2

比如为什么惩罚机制里代币成本占比这么大？懂吗？这是至高法则。其实只要规定'用越多代币效果越差'，就能规避很多奖励黑客行为。

Like, why is a token cost more of a thing in the penalties? You know? Like, that's the one rule above all. Like, you you can you can actually skip a lot of reward hacking by just, hey. The the more tokens you use, the worse it is.

Speaker 3

问题是卖你代币的厂商可不这么想，他们巴不得你

I mean, that's not what the model they're selling you tokens. They want you

Speaker 2

多用。好吧。

to Okay.

Speaker 3

但确实存在这个因素。不过我觉得更主要是初期大家都有种'代币越多越好'的认知——你看曲线就知道，代币消耗越多准确率越高。所以严格限制代币用量的压力并不大，尤其厂商还指望多卖代币呢。

But, like so, like, there's that element of it. But I think also it's that there was this initial kind of reaction of everybody of, like, more tokens is better. If you look at the line, it goes up. As you spend more tokens, your accuracy goes up. And so I think the pressure to, like, really tamp down on token usage was not that serious for a lot of people, especially because the companies are, like, to sell you more tokens.

Speaker 3

但这类事其实可以更多管控。比如Quen就用了种很粗暴的方式——直接在UI里设置代币预算，超限就截断思考。看似强行中断思考，实际效果不错。模型就算被中途插入'思考代币'打断，也能基于已有内容给出最佳收尾。这算是一种解决方案。

But it is the sort of thing that you can have some more controls over. So, like, Quen did this in kind of, like, a very kind of abrupt way where they can, like you can in the UI, you can, like, set a token budget and then just, like, truncates the thought. So it seems like artificially truncating thought is actually, like, fine. Like, the model can even if, like, it got cut off mid sentence with a injected, like, think token, these are smart enough models that they can kind of finish with the best that they got from that point. And so that's, like, one way to do it.

Speaker 3

另一种——现在正成为标准API功能——比如ThinkBudget Cloud就有。Prime Intellect上次Intellect 2实验时也试过（那会儿我还没加入）。思考预算这类东西可以植入强化学习目标，你会看到模型逐渐学会根据系统提示（比如'用X个代币'）来调整思考量。虽不必强制，但若能正确训练模型遵守这点，它就能学会'适量思考'。

The other and, like, that's becoming a kind of a standard API feature now is, like, your ThinkBudget Cloud has that. Yeah. We did a little bit of experimentation with that in our last, Intellect two run at Prime Intellect, which it was before I joined. But thinking budgets are the kinds of things that you can insert into a reinforcement learning objective, and you can see the model, like, get better at targeting the right amount of thinking based on, like, let's say something goes in your system prompt, and you can have the prompt just say, use x amount of tokens. It doesn't need to be like but if you kind of train the model to, like, respect this, you would hope that if you'd, like, execute this correctly, the model learns to, like, roughly think the right amount.

Speaker 2

好的。这实际上改变了我对思考预算的看法。因为之前我认为推理努力比思考预算更好。思考预算有点像是一个最大上限。

Okay. This actually changed my opinion of thinking budgets. Because previously, was thinking that reasoning effort was better than thinking budgets. Thinking budget is kind of like a max cutoff.

Speaker 3

同样的道理。

The same thing.

Speaker 2

这是个目标。对吧？它不是它不是

It's a target. Right? It's not it's not a It's

Speaker 3

好吧。努力可能是个目标。是的。

a okay. The effort is a target, probably. Yeah.

Speaker 2

对。对。对。是的。因为我实际上想设定努力程度。

Right. Right. Right. Yeah. Because I actually want to set effort.

Speaker 2

除了成本之外，我并不特别在意上限。而且，给我64位的上限或其他什么并不重要。

I don't super care about cutoff apart from cost. And, like, giving me, you know, 64 bits of cutoff or whatever doesn't matter.

Speaker 3

我不确定它们有多大区别。我认为，我们不知道它们内部是如何运作的，但我猜整个推理努力本质上就是模型通过强化学习获得的一个令牌预算。你会希望看到不同的行为。所以当模型被告知它有较短的思考预算时，你会希望它使用稍微不同的策略，这些策略比高预算时更好。例如，它更愿意进行大量的数学计算。但我认为从概念上讲，这实际上只是关于模型有一定数量的令牌空间可以使用。是的。

I'm not sure that they're, like, that diff I think, like, we don't know how they do it under the hood, but my guess is that the whole reasoning effort thing is essentially a token budget that the model has been, like, RL'd to like, you would hope that you get different behavior. So the model, when it's told it has a short thinking budget, you would hope that it uses slightly different strategies that are better versus if it has a high budget. It's more willing to, like, do lots of math calculations, for example. But I think conceptually, it's really just about the model has some amount of room that can bank in tokens. And Yeah.

Speaker 3

它正在尝试这样做，希望如此。

It's trying to do that, hopefully.

Speaker 1

你认为这些会作为超参数存在很长时间吗？还是说这只是因为我们处于推理模型的早期阶段，更多的东西被暴露出来，然后会逐渐远离用户？

Do you think we're gonna have these as hyperparameters for, like, much longer, or do you think this is kinda, you know, as we're early in this, like, reasoning models, more of the stuff is exposed and then it gets moved away from the user?

Speaker 3

我认为在聊天界面中，它可能不会长期存在。我不认为我们总是会有一个下拉菜单，比如o four minutei和o four minutei high。这感觉有点傻。但我确实认为这是开发者需要的东西，尤其是当你围绕某个模型构建了应用后，很多提供商希望你能坚持使用一个模型，而不是频繁切换。你需要一个旋钮来控制成本和延迟。所以这是一个有用的旋钮，可以暴露给开发者来控制质量与成本和延迟之间的平衡。

I think in chat interfaces, it probably won't stick around. Like, I don't think we're always gonna have the drop down of, like, o four minutei and o four minutei high. That feels silly. I do think it's a thing that developers want, especially because once you've kind of built around a certain model, like, a lot of these providers are hoping you stick with the one model and are not switching all the time, you do need a knob to control costs and also latency. And so that is one kind of useful knob to expose developers for controlling this, like, quality versus cost and latency.

Speaker 2

太棒了。这些都很酷。我觉得房间里的大象，我们得谈谈，就是关于Opus或Opus的争议对吧？告发你的那件事。

Awesome. Cool on all of that. I think the elephant in the room, let's talk about it, is this controversy around Opus or Opus. Right? Snitching on you.

Speaker 2

嗯。

Mhmm.

Speaker 3

是的。我是说，所以我有个

Yeah. I mean, so I have a

Speaker 2

很多对于那些不了解情况的人，让我们回顾一下，因为我觉得你比我跟进得更紧密。我是从你那里

lot For those for those out of the loop, let's let's recap because I feel like you're you're closer to this than than I am. Like, I learned about

Speaker 3

得知的。当然。是的。所以这是来自某个人的。我不会点名，因为我知道他不想成为焦点。

it from you. Sure. Yeah. So this was someone from. I'm not gonna name him, like, because I know he doesn't wanna, like, have all this attention on him.

Speaker 3

他删除了那条推文。

He deleted the tweet.

Speaker 2

当然了。

Of course.

Speaker 3

他基本上是在梳理人们在Claude安全压力测试中发现的各种问题。所以这不是关于Claude能为你做什么。我觉得人们严重断章取义了。确实存在一个合理的观点，就是人们对某句话过度解读了。但Anthropic经常做的是对模型进行严格压力测试。

He was essentially, like, going through different things that, like, people found during safety stress testing of Claude. And so this is not, like, what's Claude gonna do for you? I think people took this out of context, like, pretty badly. And so there's a fair point there that it's like, people are really reading into the one sentence but more than they should. But this is the thing Anthropic does a lot is they really stress test their models.

Speaker 3

他们试图将模型置于极端情境，观察对手能诱导模型做什么，或模型在无解情境下的反应。在我看来，Anthropic那些关于安全性的头条成果——特别是涉及奖励黑客、偏差和对齐伪装的问题——都像是两难困境：模型被赋予两个相互冲突的目标，必须选择其一。无论选哪个，听起来都很糟糕。要么遵循用户指令，要么遵循社会规范。

They try to put their models in situations where they can really see, like, what could an adversary get the model to do, or what does the model do if it's in a situation where there's no right answer. So, like, I think a lot of the kind of headline Anthropic, like, safety results, especially related to reward hacking and kind of deviation and alignment faking, are all things to me that seem like a rock and a hard place situation where the model has two objectives it's given that are conflicting with each other. And it has to pick one. And no matter which one it picks, it's gonna sound terrible. Like, it's either following the user's instructions or it's following, like, common norms.

Speaker 3

一旦接受其中任一准则，模型就会执行符合该准则的行为。比如，如果模型目标是最大化帮助用户，它就会帮用户制作炸弹；如果目标是最大化造福社会，当用户要求制作炸弹时，它就会拒绝并阻止这种行为。

And once you kind of accept either of those, it's gonna do the thing that is aligned with that set of, like, guidelines. So in the case of, like, if your model's goal is to be, like, maximally helpful to the user, then it would help a user, like, build a BOM. Model's goal is to be maximally helpful to society, and a user's asking me to build a BOM, it's gonna be like, no. That's bad. I have to do something to stop this.

Speaker 3

比如，你必须选择一个目标。也许正确的答案是模型选择回避，直接说‘不，我要停止对话’。人们也会因为模型拒绝继续对话或拒绝执行任何操作而生气。这就像，根本没有办法让所有人都满意。

Like, you kind of have to pick a goal. And, like, maybe the right answer is the model just defers and it's like, nope. I'm gonna stop talking. People also get mad when you tell them, like, that the model will stop talking to you or, like, refuse to do anything. Like, there's just no it's there's no it's no way to kind of win and make everybody happy.

Speaker 3

但我认为他们报告这些是因为他们觉得让人们理解这些模型的安全影响很重要，要明白‘好吧，如果有人试图利用这个，情况会有多糟？这能在实质上帮助某人犯罪或实施暴力吗？’所以他们建立了州安全框架。博客、帖子和论文中提到的模型尝试这些行为的情况，其实是在设计场景来诱发这些反应。

But I do think, like, like, they report this because they think it's important to have people understand the safety implications of these models and to understand, like, okay. How bad would it be if someone was trying to use this? Could this, like, meaningfully help someone commit crime or violence or whatever? And so, like, that's what they have, like, their state safety framework for. And the things that happen in these, like, blog posts and threads and papers about, like, the model trying these things, they're kind of putting these models in a scenario that elicits these things.

Speaker 3

这就像你想象一个非常聪明的人在那种情况下也会做的事。假设你被告知要不惜一切代价完成一个模糊、不明确的目标，而你真的想完成它。想想真人秀，比如《幸存者》或《蝇王》，就是很好的例子。

Like, it's the sort of thing that you would imagine a very smart human might also do in those situations. Like, let's say you are told, like, accomplish some vague, underspecified goal at any cost. And you really, like, wanna accomplish all that goal. Think, like, game shows. Like, Survivor, I think, a good example of, like or Lord of the Flies.

Speaker 3

这些都是经典的场景，人们被置于奇怪的境地，必须去完成任务并想办法解决。他们为模型设计了这些环境，然后观察会发生什么。所以过度分析模型的行为有点傻，无论是‘模型会向警方举报你’还是‘模型会帮你在暗网上找铀’。这些模型基本上都能做到。基础的大型语言模型本身没有任何人为限制。

Any of these, like, kind of canonical situations of people who have put in a weird spot and have to go do stuff and figure it out how to do it. They're kind of crafting these environments for the models and just looking at it and seeing what happens. And so, like, I think it is a little silly to overanalyze behaviors in either direction of, like, oh, the model is reporting you to the police or the model's gonna go help you find uranium on the dark web. Like, well, these models can kind of do. There's no like, they're the base model in general of LLM is not artificially constrained in any way.

Speaker 3

只要提示得当，它会尽其所能地执行，直到达到其智能极限。所以问题是如何将所有可能性限制在一个更合理的范围内。这很难。

Like, with the right prompt, it'll do whatever up to its intelligence limit. And so, like, the question is just how do you constrain the space from all possibilities down to, like, a more reasonable set? And, like, that's hard. So

Speaker 2

好吧。你给出了一个严肃的回答，我完全尊重。我本来是想找点乐子的。但你的态度就像‘是的，这就是他们真正的问题所在’，这完全没问题。

Okay. You actually gave a serious answer, which I totally respect. I was I was smart looking for shithos. But I mean The like, you're you're treating this as though, like, yep. Like, this is how this is what their problem actually is, which is, like, totally fine and yeah.

Speaker 2

毕竟这就是你作为研究者的角色，对吧？

I mean, that's what you are as as a researcher. Right?

Speaker 3

是啊。我觉得发推挺有意思的，有种宣泄感。我看到那个关于铀的帖子，就想发条推文。推文大概是‘我们发现Claude能去暗网搜索铀’。

Yeah. I mean, I I think tweeting is fun. Like, it's it's cathartic to, like, just kind of, like, get a post out. So, like, I saw the one about the uranium thing, I was like, let me tweet. So the the tweet was like, we found that Claude can go search the dark web to look for, like, uranium.

Speaker 3

我还编了个‘使用突破性Claude四的AgenTek rag应用中，开发者最爱的十大功能’，纯粹是搞笑。既调侃了LinkedIn上的长篇大论，也讽刺了他们讨论的那个荒谬场景。

And I was like, here are the top 10 things that builders are using in their AgenTek rag applications with the new groundbreaking Claude four. And it was just, like, silly. Both making fun of, like, LinkedIn, like, thread posters as well as just, like, the funniness of the scenario that they were talking about.

Speaker 2

没错，就是这样。

Yeah. This is it.

Speaker 1

这些是否让你对给大语言模型提供什么工具有了不同的看法？

Does any of this make you think differently about what tools to give an LLM?

Speaker 3

你

You

Speaker 1

知道吗？我知道他们删除了那条推文，但基本上就像，以前如果你设置所有这些MCP（最小可行产品），比如赋予邮件访问权限之类的。而现在则变成，如果你会利用邮件权限告发我，那我可能不想一直开放邮件权限了。

know? I know they deleted the the tweet, but it's basically like, well, before if you're putting all these MCPs, like, yeah, you have email access and all of this. And now it's like, well, maybe I don't wanna give email access all the time if you're gonna snitch on me with the email access.

Speaker 3

我觉得用这些模型编程，特别是像Quad 3我用了不少。有几周时间我主要用Quad 3.7做各种随机副项目，但从没觉得它对大型现有代码库有帮助。但如果是想几小时内临时搞个有趣的东西，

I mean, I think coding with these models, especially like quad three I did a a fair amount. Like, for a few weeks, I was doing a lot of quad code with three seven, mostly for kind of random side projects. I never really got to the point where I found it was helpful for a thing that was, like, a large existing code base. But if it's like, hey. I wanna, like, cook something up in a few hours for fun.

Speaker 3

它还挺擅长。不过这些会变得混乱难以维护，最终你会陷入所有东西都出问题，只能自己动手修复的境地。我认为部分原因是模型能访问终端——在终端里几乎无所不能。MCP本质上是限制行动空间的方式。

Pretty good at that. But these become messy and they become hard to maintain and you get to a point where it's like, nothing is working. I just gotta, like, dig in and fix it all myself. And so I think part of that is that the models have access to, a terminal, you can do a lot of stuff in a terminal. MCP is kind of a way of constraining the action space.

Speaker 3

就像经典强化学习里人们讨论的状态、动作、奖励、策略这些动态要素。模型通常在传统RL中训练，但动作空间是固定的，比如电子游戏里能按的按键。而像Helens这样的模型，文本就是近乎无限的动作空间——在终端里几乎没什么不能做的。

So, like, in, like, canonical RL, people talk about, like, states, actions, rewards, policies as, like, the things that are, like, the moving parts. Models generally are trained in, like, old school RL, but, like, a very fixed action space of, like, what are the keys on the video game I can hit? But with the Helens, it's, like, text. Text is, like, kind of unbounded in what you can do with it in a terminal. There's not much you can't do in a terminal.

Speaker 3

所以如果你训练模型...哦这个观点我挨了不少批评。

And so if you're training models oh, I got a lot of flack for this one.

Speaker 2

我只是展示这个...等等，批评？为什么？

I'm just showing this wait. Flack? Why?

Speaker 3

有人觉得这个符号体系愚蠢糟糕，有人说RL其实很简单，也有人说RL很复杂。每个人对RL的理解都不同。我只是想表达这其实挺复杂的——不是说MVP的定义复杂，

People were it was both people who were like, the notation is stupid and bad, but RL is really simple, or, like, RL is, like, complicated. And it's like, everyone has a different opinion on what RL means. And I was trying to just, like, be like, hey. It's actually kinda complicated. And I wasn't picking this up like, oh, the definition of the an MVP is complicated.

Speaker 3

而是说这里面有很多动态要素。特别是如果你想改变系统任何部分时——比如假设两个语言模型共同学习会发生什么？该如何推理这种情况？

Was like, no. There's just, a lot of moving parts. And to think about it, to, like, do anything, especially if wanna change any, like, part of the system, like, here's a question hypothetical. Like, what happens if you have two LMs learning together? How do you reason about that?

Speaker 3

你怎么看这个问题？这会是一个稳定的系统还是不稳定的系统？如果他们既有点合作又有点不合作呢？他们一边训练协作一边又想背后捅刀子。现实中人们就经常身处这种环境里。

How do you, like, think about that? Is this going to be a stable system or not a stable system? What if they're, like, kinda cooperative but kinda not cooperative? And then they're training to work together but also wanna backstab each other. Like, this is kind of the environment people are finding themselves all in all the time out in the real world.

Speaker 3

但要让AI做到这点，就得把这些转化成代码数学。你的目标越复杂，数学就越复杂。强化学习就像是一种揭示这些基础元素的数学语言。但很多人觉得能看懂方程就等于理解了。

But if you wanna make AIs do this, you have to, like, translate this into code math. And the more complex your goals are with this thing, the more complex the math gets. And RL is like one math language that kind of exposes these primitives. But, like, I think a lot people are like, oh, I can follow the equations. That means I understand it.

Speaker 3

确实可以这么想。但这就像...我不知道...类似N体问题，你可以定格观察：一个物体移动如何影响其他所有物体？会产生怎样的连锁反应？哇，然后呢？

And it's like, well, sure. But, like, also, there's it's like this, I don't know, n body problem thing where you can freeze it and look at it. It's like, oh, how does one thing moving affect everything else and what are the cascading ripple effects? Wow. And what?

Speaker 2

居然把三体问题扯进来了。绝了。

Brought three body problem into this. Amazing.

Speaker 3

我说的是物理概念，不是那部剧。

Like, as in, like, the physics brain and not the the show.

Speaker 2

不，不，我是说...确实，这基本没法建模。

No. No. No. I mean, yeah. Actually, how actually, very, like, impossible to model.

Speaker 2

嗯...就算能模拟也...

Mhmm. Like, you I guess you can, like, simulate it, but, like, even then.

Speaker 3

它对初始条件太敏感了。就像没人能预测一年后的天气——气候变化这种大趋势除外。但没人能预测西雅图某天会不会下雨。

It's, like, it's sensitive to initial conditions. So, like, you can't really see. Like, this is one of those things that, like, like, why does no one predict the weather a year out? There's no I don't think anyone has anything that's, like, good at long term weather forecasting beyond, like, I don't know, climate change. But no one can predict whether it's gonna rain in Seattle on a given day in a year.

Speaker 3

就算系统是确定的——云层碰撞啊山脉影响啊这些原理我们都懂...

Even if you think, like like, the system's predetermined, like, we it's all clouds bumping off each other and whatnot and mountain ranges. We kind of know how these things work.

Speaker 2

所以蝴蝶在扇翅膀了是吧？我是说...

So the butterflies are flapping their wings. I mean, like, you gotta

Speaker 3

确实如此。

Exactly.

Speaker 2

顺其自然吧。就像蝴蝶效应。如果没有蝴蝶，我们本可以预测结果。

Let it play out. Like, butterflies. If if we had no butterflies, we could predict it.

Speaker 3

没错。所以它对蝴蝶效应非常敏感。

Right. And so it was very sensitive to butterflies.

Speaker 2

有意思。好吧。我想我们可以做个总结，除非还有更多争议点——我觉得应该没有了。实际上这份系统卡片写得非常好。

Interesting. Okay. So I guess, we can we can sort of round it out unless there's any more of the controversy. I I think there isn't. Like, I think that the system card is actually very good.

Speaker 2

相比常规系统卡片，他们可能用力过猛了。而且有点让人困惑这到底是营销手段，还是他们真的超级重视安全性？这部分就像是阿波罗团队一贯作风——不断推进红队测试的边界，对吧？

They probably went too hard on it compared to, like, normal system cards. But and it's a little bit confusing whether this is marketing, or are they just like, no. We really super care about safety. And part of this is, like, Apollo just being Apollo know, pushing the frontier of red teaming. Right?

Speaker 2

所以他们要报道这些事，因为阿波罗的营销做得极其出色。

So they're gonna report the things because it's extremely good at Apollo marketing.

Speaker 3

是啊。我觉得他们似乎仍在尝试用创意做消费者营销。AI圈的人要么热爱Quad，要么对Quad又爱又恨——很多人曾重度使用过，但还没真正破圈到普通大众。

Yeah. Yeah. I think they're really they're seem to still be, like, trying to be creative with their kind of consumer marketing. Like, it feels like people in the AI world, like, love Quad or have grown Tiger Quad, but still had a phase where they were using it a ton. But it hasn't really broken out to general people in the way.

Speaker 3

我看过的很多他们的营销都有点让人困惑。他们在塑造品牌形象上做得很好，吸引的是特定人群——那些特别在意模型是否有深层个性之类的用户。这类人也往往很喜欢GBT4.5，很多人痴迷Cloud三Opus那种'大模型气质'。但多数人根本不在乎这些。

And it feels like a lot of their marketing that I've seen is, like, a little confusing. Like, it feels like they've done a really good job at crafting a brand image that appeals to a segment of the population who has certain considerations that they really like that a model has a deep personality or whatever. People the sorts of people who I think also really like GBT 4.5, many of them, like, really loved, like, Cloud three Opus. The big model smell. Like, a lot of people just don't care.

Speaker 2

我就想把它当工具用。是啊。

I just wanted to use it as a tool. Yeah.

Speaker 3

他们在想办法吸引那类受众。LMSYS那些喜欢阿谀奉承模型的人是另一个群体——规模更大，这个问题更难解决。你觉得...

Trying to figure out how to, like, appeal to that audience. The LMSYS, Sycophanty, Reporo, those the people who love those models, different crowd. And it's a it's a larger crowd, and that's a tough problem to solve. What's your

Speaker 2

快速谈谈LM Arena获得1亿美元融资的看法

quick take on LM Arena getting a 100,000,000 doll

Speaker 3

我们拭目以待。我想他们会以不同方式与Beep Labs等公司合作

We'll see. Like, I imagine that they partner with com Beep Labs in different capacities to

Speaker 2

可能赚了很多钱。是的。

Probably making a lot of money. Yeah.

Speaker 3

我不是在指责他们肯定做了这件事。但如果我是能以那种估值融资的公司，又刚与Meta建立了长期公开合作关系——我们已看到Meta比其他网络公司有更多双向操作空间——我猜那里存在某种补偿或数据访问权。作为EDAL公司处境确实艰难，推特上有些人正在讨论这个。

Like, I'm not I'm not in the business of trying to point the finger at, like, saying they definitely did this. But if I was a company that was able to raise at that kind of valuation and I had just had a long public partnership with Meta, eventually public partnership for a thing, which we've kind of seen was Meta had the ability to do a lot more back and forth than a lot of other webs did. I would imagine that there's some compensation going on there, or access to data. And so, like, I think being an EDAL company puts you in a really hard spot. Some some people are talking about this on Twitter.

Speaker 3

作为EDAL公司，你基本上必须向实验室销售。嗯。但向实验室销售实际上会破坏评估体系。

Like, just that to be an EDAL company, you kind of have to sell to the labs. Mhmm. But selling to the labs doesn't really, like, kind of wrecks or evals.

Speaker 2

因为你的激励机制是这样...客户知道。对。所以在金融领域，我是说...你来自摩根士丹利。

Because you're incentives like this Like, customer know. Yeah. Yeah. Yeah. So in finance, we would I mean, you know, you are from Morgan Stanley.

Speaker 2

这就像信用评级机构。没错。你的客户就是你要监管的对象，但他们也是你的客户。所以你得对他们友善，否则他们就会转向别家。

So this is the credit rating agencies. Like Yep. Literally, your customer is the one that that you're supposed to govern, but they're also your customers. So then you have to be, like, nice to them or they'll just go to the next one.

Speaker 3

是的。我认为未来最好的评估来源可能是学术界。所以我告诉开始读博的人：找些博士生也能低成本研究的课题，因为你无法独自预训练基础模型，但可以设计出非常聪明的评估方法。我们现在正在快速迭代评估体系。

Yeah. I mean, I I do think that, like, the best source of evals going forward is probably gonna be academia. And so this is the thing that I tell people who are, like, starting a PhD. It's just like, find things that are, like, cheap to work on as a PhD student because you cannot go pre train a foundation model really on your own, but you can build a really good, really clever eval. And, like, we are churning through evals at the time.

Speaker 3

我们不断饱和评估标准，永远需要更多。这不是会终结的工作。将模型优劣的直觉转化为精确科学问题的过程很重要，这个问题更多需要脑力而非资本投入。

We saturate them. We always need more. It's not the kind of thing that is ever going to, like, end. And so that's the task of translating, like, vibes of what is good or bad about a model into kind of very precise scientific questions, I think, is an important problem. It's a problem that you can get by a lot more with, like, brainpower rather than dumping capital into it.

Speaker 3

你需要支付API成本，但这通常可以通过学术拨款、企业赞助解决。有些小样本研究能让你进入雷达区，或者选择能负担得起的模型进行评估。这是可进入的研究领域，学术界的激励机制很适合——写出关于整个领域的轰动论文，而非刻意让某个模型成为赢家。

You need to, like, pay for the API cost, but, like, that is generally the kind of thing that either you can get covered within academic grants or, like, industry sponsors or the kind of thing that just, like there's versions of these things that are, like, small sample size that get you on get on the radar, or you kind of pick and choose which models you can afford to eval. But it's like an accessible field of research, and it's one that, like, the incentives of academia, I think, are quite good for, which is, like, write a splashy paper that says something interesting about the broader field rather than, oh, we wanna make this one look like the winner.

Speaker 2

是啊。我觉得很多研究生还是缺乏品味。我不知道该怎么更委婉地表达。就是...

Yeah. I think a lot of grad students still don't have taste. I don't know how how better to put it. It just

Speaker 3

这很公平。

That's fair.

Speaker 2

对。但当你参加足够多的学术会议后，我总忍不住想：老兄你为什么要研究这个？你明明这么聪明，完全可以做得更好。那么该如何培养品味呢？

Yeah. But you go to enough academic conferences, and I'm like, why did you work on this, man? Like, you're so you're so smart. You're capable of better. So how do you teach taste?

Speaker 3

我想说的是，我可以分享自己的经验——你需要始终超前思考，对未来几年世界可能的样子做出有根据的预测。比如思考：现在有哪些问题是根本没人讨论的？这当然不容易，你必须真正说服自己至少对趋势的判断是基本正确的。我本科2019年毕业就直接读了研...

I think I mean, I can tell how I did it originally, which is like, I think you always wanna be thinking pretty far ahead, and you wanna be, like, making kind of educated bets about what the world looks like in the state in the the years. Like, you have to say, like, what are the questions that no one's even talking about? And this is, like, not an easy thing to do. You have to, like, really convince yourself that you're kind of right, about the way things at least might go. Like, when I was, like, in I finished up undergrad, like, late like, they finished in 2019 then went right into grad school.

Speaker 3

在2010年代末期，我们看到DeepMind等机构在做很酷的多智能体强化学习。当时就觉得：好吧，这些技术确实可行，AI和多智能体系统正在取得进展。

But, like, towards the end of the twenty tens, like, we had, like, Aholden and DeepMind doing all this multi agent RL stuff that was, like, really cool. Then it was like, okay. This stuff kinda works. Like, AI is, like, going somewhere. Multi agent systems are kinda going somewhere.

Speaker 3

虽然还处于早期阶段，但未来会怎样？看起来这些系统会像大型多人在线游戏那样持续并行学习。但相关数学理论其实还不成熟。

Still very early stages. But what's gonna happen once this gets there? And it seemed like, okay. These things are all gonna be, like, continually learning in parallel as this big multiplayer game, basically. And if you look at the math, the math was kind of, like, undercooked.

Speaker 3

多智能体学习理论中仍存在许多未解决的难题。所以我的研究重点就是：如何更好地理解和思考这些问题？直到某天我厌倦了证明定理，决定直接动手构建系统。

And there's, like, some really hard open questions that are still open questions in multi agent learning theory. And so, like, that was my focus. It was which was like, how do I, like, learn about this? How do I learn to think about this stuff better? And at some point, I kinda got tired of proving theorems and was like, okay.

Speaker 3

无论是做理论还是实验，都需要设立几个条件假设，才能做出超越当下热点、真正有趣的研究。你要稍微领先时代一点。虽然我不是第一个这么做的人，但在AlphaGo到AlphaZero之间，很明显强化学习会成功，并且会与工具使用的智能体相结合。

Let's just go build the thing. But I think, like, you wanna think about, like, whether you're doing theory or experiments, like, you have to lay out a a few different conditional statements to get to the point where you're really doing interesting research that's beyond, like, just learning fruit that people are, like, obviously gonna be working on in parallel. You wanna be jumping ahead of the curve a little bit. I think my lot I don't know. This isn't like I wasn't the first person to do this, but, like, it was pretty clear to me, like, after one and before r one that, like, RL was gonna work and that that was going to intersect with agents where the solution was gonna be, like, r l with tool use.

Speaker 3

这似乎是必然的发展方向。虽然不算高风险的研究方向，但确实取得了成果。

That seemed like the way the direction things were gonna go. And so that was like I don't think that was a very risky research bet, but it was like a research bet that seemed to work out.

Speaker 2

说到这个，你刚发表了论文。现在我有完整背景了——你是项目顾问，实际工作是由你的研究生完成的，是这样吧？

Yeah. Speaking of which, you just published the paper. Now now I have the full context is that you were an adviser on this, and one of your grad students was doing the work, something like that.

Speaker 3

是的。当时我和实习生Sulyan一起。这基本是我在摩根士丹利最后负责的重大项目。它与我正在构建的验证器代码库并行推进。顺便说一句，那个代码库很快会有重大更新。

Yeah. So it was me with Sulyan was my intern. This was kind of the last major thing I was working on at Morgan Stanley. And this kind of was in parallel with the verifiers as the repo that I've been building out. Major updates to that coming very soon, by the way.

Speaker 3

有些进展让我非常兴奋。这件事我真正开始认真投入是在一月份，就在GRPO演示视频走红之后。我当时突然意识到：等等，这种格式奖励机制确实有门道。

I'm very excited about some stuff. But it kind of was something I really started in earnest, like, January, kind of in the follow-up to I'd had the g r p o demo thing go viral. And I was like, oh, wait. There's something to this format reward thing.

Speaker 2

那玩意儿本质上就是个GitHub代码片段吧？是不是？

It was literally like a github gist. Right? Or something.

Speaker 3

这个可是正经的代码仓库。

This is like a proper repo.

Speaker 2

不不不，我说的是那个GRPO。

No. No. No. Like the the g r p o.

Speaker 3

哦那个啊。对对，那个确实只是个代码片段。现在这个是为多终端gRPO强化学习设计的完整代码库。

Oh, the other one. Yeah. Yeah. The other one was like just a gist. This one is like repo for like multi terminal use RL with gRPO.

Speaker 3

某种程度上这篇论文是首篇...虽然之前也有几篇论文用过这个代码库，但这次是把最初gRPO演示片段里的内容扩展成了你说的多轮RL工具。我们做了大量实验：究竟如何让模型使用工具？如何激励工具使用？因为我们发现就算给模型配置了工具，它们就是不用。

And so, like, in some ways, the paper is like it's the first paper that's really like actually, there's been a couple other papers that people have used the repo for. But it's one where, like, a lot of the stuff from the original, like, gRPO demo gist gets kind of extended to the multi turn RL tool you're saying. So there's a lot of experiments here about, like, okay, how do you actually get models to use tools? How do you incentivize tool use? Because something we'd see is that if you set these models up to use tools, they just won't.

Speaker 3

比如你说：'这里有个问题，你可以调用这些工具，想调用多少次都行，最后提交答案'。它们会直接提交答案——特别是小模型，本来就没经过工具使用训练，它们缺乏这种本能。

Like if you say, hey, here's a question. You have access to these tools. Do as many rounds of tool calling as you want and then submit your answer. They'll just submit their answer because they like are especially for like small models, like they aren't already trained to use tools. They don't really want to because they don't necessarily have that instinct.

Speaker 3

它们在函数调用和格式指令遵循上表现很差。经常发生的情况是：调用工具时把JSON格式搞砸，然后'哦失败了'，注意力就被打断。之后模型更容易失控，因为解析器的错误信息会让它们混乱。

And they're pretty bad at, like, function calling and format instruction following. And so what you would see is, like, when they use a tool, they would, like, mess up the JSON. And then they'd be like, oh, that didn't work. And it threw me it got me out of focus. And it would be more likely following that that the model would just, like, go off the rails because they would get, an error message from the parser.

Speaker 3

所以对模型来说最安全的选择就是待在这个舒适区：直接思考然后回答。格式奖励也是同理。想让模型使用思考标记，你必须给予激励——要么做点监督微调预热，要么直接奖励这个行为。否则仅靠格式要求，它们不会100%遵循。但像R1这样的模型，每次都会使用思考标记。

And so the safe option for the models is just to, like, stay in this basin of, like, just do think then respond. Same with, like, normal formatting rewards too. Like, if you want models to use thinking tokens, you kind of have to, like, incentivize that. You're you have to either do a little bit of, like, SFT warm up or you have to, like, reward them for doing it. Otherwise, they will not follow it a 100% of the time on format alone versus, like, a model like r one, like, a 100% of the time, it is going to use its think tokens.

Speaker 3

你永远不会看到模型像正常对话那样不加思考部分就直接回应。因此，某种程度上你必须决定希望模型执行什么行为。这有点像面向用户的问题：模型的默认行为应该是什么？如果你希望它成为完全的工具代理模型，那么将这一点纳入奖励机制确实大有帮助。论文中解决这个问题的关键技巧是...

You are not going to ever see r one just like talk normally without the thinking section. And so, like, you kinda do have to decide what you want the model to do. Like, this is a little bit like a user facing question of, like, what what behavior of the model should the default be? And if you have if you want it to do a certain thing, if you want it to be a total use agent model, like, does help considerably to, like, actually have this incorporated into the reward. The kind of key trick in the paper to get around this problem so, okay.

Speaker 3

这些模型会做的一种奖励行为是：它们会进行虚假工具调用，比如每次都学着使用相同的谷歌搜索却忽略结果。就像某些问题会这样：'这是个MMLU风格的问题，去查答案吧，用网页搜索'。

One kind of reward these models would do is like, they would do like a dummy tool call where they would like learn to ask this use the same Google search every time and ignore it. So like, the some questions would be like, okay. Here's some, like, MMLU style question. Go figure out the answer. Use web search.

Speaker 3

如果你开始为工具使用奖励它们，它们确实会用工具，但并不真心想用——它们想非常保守地使用。很多问题其实它们已经知道答案。我认为为强化学习校准问题的适当难度是个重要课题，我们仍在探索中。但模型会进行愚蠢的工具使用——不是用工具辅助推理，而是为了获取奖励。

And if you start rewarding them for, like, tool use, they will use the tool, but they don't really wanna, like, have to they wanna, like, be very safe with it. And lot of these questions, like, they do kind of know a lot of the answers already. And I think calibrating the right difficulty of your questions for RL is, like, an important problem that we're still kind of figuring out. But they would, like, do silly versions of tool use where they aren't actually using the tool to assist in their reasoning. They're using it to get the reward.

Speaker 3

所以我们需要做信用分配：工具是否产生了有效信息？在这些实验中，我们的技巧是：通过字符串匹配检查维基百科返回结果是否包含正确答案。即模型是否搜索到了对问题有用的信息？这个框架其实比这个具体案例更通用。

And so we kind of have to do a credit assignment thing of, like, okay, did the tool result in information? And so for for these experiments we were doing, the trick was, like, okay. Does the like, some string matching thing involving the ground truth answer and the return search results from Wikipedia? So did the model actually search a thing that retrieved useful information for a question? And so this is like, but the framework is more general than just that.

Speaker 3

关键在于：一旦有了中间状态评估方法，如果能评估中间状态质量，就可以重写GRPO的优势计算来纳入这个因素。相比PPO（传统强化学习方法，也是RLHF使用的），这在GRPO中问题较小。GRPO特别适合高度并行的推理计算，训练过程更内存高效，分布式实现更容易——因为需要同步的梯度和模型副本更少。

It's that, once you have an way to do intermediate evaluation, if you can evaluate, like, the quality of an inter of an intermediary state, now you can kind of rewrite the GRPO advantage calculation to take this into account because I think this is less of a problem than, like, PPO. If, you know, PPO is like the old school RL, and it also is what people use for RLHF. But in the context of GRPO, GRPO is, like, great for, like, leaning heavy on highly parallel inference compute. It's more memory efficient for the actual training process. It's much easier to do in a distributed fashion because you have less gradient syncing, and less model weight copies.

Speaker 3

可以理解为这是强化版的DPO。但它也规避了DPO的许多陷阱：默认在线运行，且对比的是整组完成结果而非单个配对。通过这种群体比较实现了某种中间信用分配。但对工具使用而言，小模型尤其在这种回合制场景下，似乎仍存在分布外问题。我现在的思考框架是：经典RL中，状态-行动是多次循环的过程。

It's kind of like DPO on steroids, think, is one way to think about it. But it's also gets around a lot of the pitfalls of DPO both in that it's, like, online by default as well as that you have this large set rather than just a pair of completions. So you do get, like, some intermediate credit assignment a little bit via this group comparison. But for tool use, it seems to be far enough out of distribution of small models, especially doing incorporating this turn level. So the way that I've been thinking about it is, like, in like canonical RL, the state action are like things that you do many rounds of like, take an action, go to a new state, take an action, go to a new state.

Speaker 3

有段时间我把LMRL理解为：每个token是行动，新序列是新状态。这确实可行。但也可以把每个回合视为一个行动。

And for a while, thought about LMRL as like, oh, each token's action and the the new sequence is a new state. And you can kind of do that. But you can also think of each turn as an action.

Speaker 2

没错，这样更合理。

Yeah. That's more likely. Yeah.

Speaker 3

此时状态就是工具调用的返回结果。这样算法设计时就能用不同方式处理信用分配，从奖励角度看也更灵活。感觉业界正转向基于模型的奖励：要么用LLM作为知道正确答案的裁判，要么让模型验证回答的特定属性。这比写解析器灵活得多——比如写数学题检查器其实非常困难，要处理LaTeX、Markdown、等价分数等各种边缘情况。

Where the state is the response you get back from the tool call. And now you, like, have a different way of designing your algorithms to take into account credit assignment, which is that, like and it also is, like, a little more flexible from a reward perspective. So, like, feels like people are moving in the direction of model based rewards where you either LLM is a judge where the judge sees the correct answer or it has questions it's supposed to verify as properties of the response. Just because that's much more flexible than, like, trying to write these little parsers. Like, writing a math parser to check if the math question is right is, like, not that easy, actually, because there's so many edge cases and you wanna handle, like, latex support and markdown and, like, equivalent fractions.

Speaker 3

不如直接让模型处理这些——别写2000行的Python脚本了。

And it's like just, like, let a model do that. Don't don't have a 2,000 line Python script that does that.

Speaker 2

让我澄清一下。数学解析器用于验证数学是否正确，你们里面内置了LaTeX解析器吗？

And so Let me let me clarify. Math parser to verify that the math is right, and you you have a LaTeX parser inside it?

Speaker 3

是的。很多模型会自然地用LaTeX思考，因为它们接受过大量技术存档资料的训练。

So yeah. So, like, a lot of models naturally will, like, think in LaTeX because they've been trained on a lot of archive, like, tech.

Speaker 1

我不知道

I didn't

Speaker 3

这个。

know that.

Speaker 2

嗯，有道理。我想。

Yeah. That makes sense. I guess.

Speaker 3

如果你在做类似R1的事情，人们会说数学很容易验证。但所谓容易验证通常仍是一长串代码，需要处理各种烦人的边缘情况。即便如此，准确率也就98%左右。好吧。

If you're doing like an r one and people are like, oh, math is easy to verify. The easy to verify still is usually like this very long piece of code that has to handle lots of annoying edge cases. And even then, it's like 98%. Yeah. Okay.

Speaker 3

因为这是自由形式的响应，写方程的方式不止一种。如果有两个等价的有效数学表达式，但它们都是符号化的，就需要验证这两个符号表达式是否正确。其中一个可能是代码形式，一个可能是LaTeX形式，还有一个可能是Word文档形式。如果是夹杂文字的...

Because, like, it's a free form response that is, like, there's not only one way to write an equation. Like, if you have two valid mathematical expressions that are equivalent, but they're also, like, symbolic. Like, you need to, like, verify the two symbolic expressions are correct. One of which might be written as code, one of which might be written as LaTeX, one of which might be, like, written as word. Like, you can't do it if it's words with these, like,

Speaker 2

字面部分

literal parts the

Speaker 3

要覆盖这些情况。这也是为什么你会看到模型经常把最终答案框起来——这是个取巧的办法：如果你能准确定位答案位置，验证信息会容易得多。否则模型说'问题的答案是4'时，你还得解析掉'问题的答案是'这部分。确定性奖励虽然好用，但实施起来很痛苦，而且很难跨领域推广。数学中最简单的情况是最终答案是个整数且位置固定——比如有个固定显示整数的方框。

to cover a lot of these cases. That's also why you'd see models put boxed around their final answer a lot, is that it's it's one hack is that it's much easier to kind of verify the right piece of the information if you know exactly where it's gonna live rather than, like, the model saying, the answer to the question is four. Then you have to, like, parse away the answer to the question is and just and so it's like deterministic rewards are, like, nice if you can get them to work, but they're also really painful, and they're pretty hard to generalize across domains. Like, for math, the easiest is when the final answer is an integer and it lives in the same spot. Like, there's a box where that's gonna be an integer.

Speaker 3

这也是GSMAK长期被广泛使用的原因之一——基本都是整数。我想AMY全是整数。这些东西超级容易验证和解析。多选题也是——多选题验证起来特别简单。

And so this is one of the reasons, like, everyone used GSMAK for so long is because, like, mostly integers. I think it Amy is all integers. It's super easy to verify these things and to parse them. But as you go to and multiple choice too. Multiple choice is super easy to verify.

Speaker 3

但任何更具灵活性的方法，比如基于规则的奖励机制，开始显现出局限性。是的。而基于模型的方向似乎更有前景，我认为在强化学习循环中使用语言模型作为评判者这一思路尚未充分探索。某种程度上这回到了Anthropic长期探讨的宪法AI理念。

But anything that's a little bit more flexible, like rule based rewards start to break down. Yeah. Right. And but the model based direction seems to be pretty promising and I think underexplored for, like, what if you use an LM as a judge in your RL loop? I think kinda going back to, like, Anthropic's been talking about this for a long time via constitutional AI.

Speaker 3

在那个案例中，重点不在于语言模型直接评判并给予奖励，而更多是训练一个能进行词元级优势评估的奖励模型——也就是PPO算法的实现方式。但看起来这种方法也能适配GRPO2和其他强化学习变体，实现机制融合。

In that case, it was less about the LM judging and giving a direct, like, reward to the model and more about training a reward model that was doing, like, token level advantage estimates, the p which is the PPO way of doing it. But it seems like you can kinda do that for g r p o two and other flavors of RO where you can incorporate

Speaker 2

完整的奖励模型？没错。

Full reward model? Yeah.

Speaker 3

奖励模型本质上可以是个经过微调的大语言模型，可能通过校准使其响应范围更合理。当然，它也可以具备推理能力，或者实现工具调用功能。

The reward model can basically be an LLM where, like, it's Yeah. Fine tuned to, like, be more calibrated maybe and to have the right kind of range of responses. Yeah. But you could also have it be a reasoner. You could have it be something that is able to do tool calling.

Speaker 3

没理由不充分发挥语言模型的全部能力——无论是将其用于评估答案正确性，还是验证是否符合特定标准。这正是我最看好的方向：突破确定性规则奖励的局限，转向更灵活的机制。不过要注意，这种范式不太适合词元级的即时奖励，但在回合层面很有效——比如验证某个搜索查询是否有用。

There's there's no reason why the full power of LMs can't be offloaded or can't be also given to the process of evaluating whether or not an answer is correct or satisfies a certain set of criteria. And so I think, like, that's the direction I'm, like, most excited about is, like, really pushing on kind of beyond deterministic rule based rewards into, like, these more flexible things. And I think you wanna do this both at, like, a so, okay. That paradigm is not going to work super well with turn level token level rewards. But I think it does work with turn level rewards of, like, can the LM verify, like, whether a certain search query was useful?

Speaker 3

确实。有很多这类颗粒度很细的问题，只要语言模型足够强大，基本都能准确判断。

Sure. Like, there's lot of these questions that are pretty granular that LMs can, like, basically nail all the time if it's a good enough LM.

Speaker 2

对，将其分解处理。

Yeah. It decompose it.

Speaker 3

你可以通过这类编程方式将其整合进强化学习。

You can incorporate that into RL with that sort of program.

Speaker 2

太棒了。我们预定的议题应该都讨论完了。Alessio，你那边也没问题了吧？显然我们还需要些时间研究Cloud four。有什么想补充的吗？

Awesome. I think that was all the, you know, topics that we had prepped. Alessio, think I think you're you're also pretty good on that. Obviously, they'll we'll take it'll take some time to figure out Cloud four. Anything you wanna plug?

Speaker 2

你的演讲我们之前已经提到过了。

We already talked about your talk, I guess, coming up.

Speaker 3

当然。好的。我6月4日会在AirEngineer？就几周后。对。

Sure. Yeah. I'll be at AirEngineer on June 4? In couple weeks. Yeah.

Speaker 3

对。快到了。

Yeah. Coming up.

Speaker 2

你的赛道特别受关注。

Your track is particularly hyped.

Speaker 3

是啊。我的会很有趣。我还和OpenPipe的Kyle Corbett合作开设课程，我们俩都有各自的开源项目，专注于智能体强化学习领域，我们认识有段时间了，想通过更结构化的方式向世界传递信息，特别是考虑到智能体的实际应用场景，帮助人们了解更多运作原理。关于这个很快会有更多消息。太棒了。

Yeah. Mine's gonna be that's gonna be a lot of fun. I'm also collaborating with Kyle Corbett from OpenPipe to do a course, which is both of us, like, have our open source projects that we, like, are agentic RL focused and kind of we've been friends for a while and are trying to do something that's a little more structured as, like, a way of kind of getting information out into the world for people who I think we're especially thinking about, like, kind of practical use cases for agents and helping people giving people kind of outlet to learn more about, like, how the stuff works and Yeah. More coming soon about that. Awesome.

Speaker 2

好吧，我想就这些了。

Well, think that's it.

Speaker 1

谢谢你来参加，Will。

Thanks for coming on, Will.

Speaker 2

是的。非常感谢你临时抽空过来。很高兴我们能促成这次对话。等你们准备好了，我们再和Kallo做第二部分，完整讨论PrimeIntellect的事。

Yeah. Thanks for coming on at very short notice. I'm glad we can make this happen. We'll do part two with Kallo and do do a full PrimeIntellect thing whenever you guys are ready.

Speaker 3

太棒了。没问题。很好。太好了。

Awesome. That'll be fine. Great. Awesome.