本集简介
双语字幕
仅展示文本字幕,不包含中文音频;想边听边看,请使用 Bayt 播客 App。
研发方面,比如研究团队关注的重点,我认为现在更集中在训练后阶段,比如从现有模型中挖掘更多性能,因为这是一种更新的范式,而且仍有大量低垂的果实可以采摘。
The R and D, like the research and development of the focus of the research team, I think it's more focused nowadays on the post training, like getting more performance out of that, because it's more like the newer paradigm and there are still low hanging fruits to be picked.
而在免费训练方面,技术已经相当成熟了。
Where in free training, it's already pretty sophisticated.
如果你使用更多数据、优化数据组合,或者采用多标记预测这类方法,仍然可以获得更好的结果。
You will still get better results if you use more data, optimize the data mix, maybe multi token prediction and these types of things.
但目前最有趣的发展主要发生在推理领域的训练后阶段。
But most of the interesting things are happening now on the post training front in the reasoning realm.
我认为我们在这里会看到更多进展。
I think we will see more there.
好了,各位。
Alright, everyone.
欢迎收听Twinwall AI播客的又一期节目。
Welcome to another episode of the Twinwall AI podcast.
我是你们的主持人萨姆·查林顿。
I am your host, Sam Charrington.
今天,我邀请到了塞巴斯蒂安·拉施卡。
Today, I'm joined by Sebastian Raschka.
塞巴斯蒂安是一位独立的大型语言模型研究员。
Sebastian is an independent LLM researcher.
在开始之前,请花点时间在您收听本节目的平台订阅我们的节目。
Before we get going, be sure to take a moment to hit that subscribe button wherever you're listening to today's show.
塞巴斯蒂安,欢迎再次做客我们的播客。
Sebastian, welcome back to the podcast.
已经有一段时间了。
It's been a little bit.
是的。
Yeah.
谢谢你再次邀请我,萨姆。
Thank you for inviting me back, Sam.
我很高兴能回来,和你聊聊大型语言模型、人工智能,以及你想到的任何话题。
I'm happy to, yeah, be back and to chat about LLMs, AI, and whatever you have in mind.
上次我玩得很开心,所以希望我们这次也能同样有趣和精彩。
I had a lot of fun last time, so I hope we can make it fun and interesting again.
你知道吗,我这时候总爱开个玩笑,虽然有点老套了,就是上次我们谈话还是三年前,没什么变化。
You know, my joke around this time, it's getting a bit old, but it's like, the last time we spoke was three years ago, not much has changed.
对吧?
Right?
我觉得,好事总成三,有这么一句话。
Well, all good things come in threes, I think, there's a saying.
对吧?
Right?
但实际上,变化巨大,我们今天要重点讨论这些变化中最最新、最重要的部分——LLM的最新进展以及2026年LLM的前景。
And in fact, a ton has changed and we're gonna be focusing on the most recent and most important of those changes in particular, what's new with LLMs and what to expect with LLMs in 2026.
这正是你研究和教育工作中投入大量时间的领域。
This is an area that you spend a lot of time focusing on with your research and education work.
也许我们可以先从宏观角度聊聊,你脑子里最直接的想法是什么?和一年前相比,我们现在处于什么位置?你对这个领域的发展有什么总体看法?
You know, maybe we can start with just, you know, kind of top of mind, like if you think about, you know, very big picture where we are now compared to where we were a year ago, you know, what what is your broad reflection about the evolution of the space?
看看今天和一年前相比,这几乎像是DeepSeek的周年纪念,大规模的DeepSeek版本三模型伴随着R1模型,我称之为‘推理革命’。
Look at today compared to one year ago, it's almost like the anniversary of deep seq, the big deep seek version three model accompanied by the r one model, the reasoning, I would say, reasoning revolution in quotation marks.
这仍然是LLM。
It's still LLMs.
基础模型还是同一个,但现在我们有了更多 atop 的技术,让模型在解决更复杂问题时变得更聪明。
It's, still the same base model, but we have now more techniques on top of that to make the models smarter in terms of, solving more complex problems.
所以从架构上看,我们的LLM架构仍然相对相似,但与去年相比,推理训练是其中一项新内容。
And, the so I would say architecture wise, our LLM architectures are looking still relatively similar, but the reasoning training is one of the new things if we compare today to last year.
此外,我认为现在对工具使用的重视程度更高了。
And then also I think there's a more heavy focus on tool use.
以前,当ChatGPT刚推出,或者LLM的首个版本时,重点主要放在通用任务上,让LLM回答我们所有好奇的问题,比如从记忆中提取信息:如果我们问它一个数学问题或知识性问题,LLM就会从它的记忆中调取内容并写出答案。
So back then, when ChatGPT was launched or also the first, iteration of LLMs, the focus was mainly on, general purpose tasks, but then also having the LLM answer all the things we are curious about, like from memory, Like if we ask it a math question or knowledge question, the LLM would basically draw from its memory and then write the answer.
但这种方法并不总是最有效或最准确的。
But that's not always, let's say the most effective or accurate thing to do.
对我们人类来说也是如此。
Similar for for us humans.
我的意思是,大语言模型和人类的思维方式不同,但作为人类,如果你问我一个复杂的数学问题,比如两个大数相乘,我会拿出计算器来算。
I mean, LLMs are different from how humans think, but we as humans, if you ask me a complicated math question, like, or just like multiplying two large numbers, I would pull out my calculator and calculate that on a calculator.
我不会在脑子里算。
I wouldn't do that in my head.
我或许能算,但会花很长时间。
I maybe could, but it would take a long time.
这样更容易出错,而且完全没有必要这么做。
It's more error prone and so forth, and there's no need to do that.
同样的,现在有了更现代的工具,让大语言模型使用工具变得越来越普遍。
And the same with LLMs now with more modern tooling, it becomes more and more popular to use or to have the LLM use tools too.
这需要训练大语言模型来使用这些工具。
It requires training the LLM to use those tools.
通过这种方式,我认为可以降低幻觉率——虽然不能完全消除,但能显著减少,并让回答更准确。
With that, I think we can reduce like hallucination rates, not completely getting rid of those, but reducing those, and then also making answers more accurate.
而推理能力,本质上是给大语言模型更多时间——用引号说——来思考问题。
And then with reasoning, capabilities, it's essentially giving the LLM more time in quotation marks to think through a problem.
所以我认为,过去一年,尤其是从去年到现在,我们能够调整并取得进展的两个主要关键点就是这些。
So these are, I think the two main, I would say, yeah, knobs that we can tune and to make progress on in the last year, if we look particularly like last year and now, the difference.
是的。
Yeah.
我们将深入探讨推理能力与工具使用方面是如何演进的,以及其他一些方面。
We'll dig into the technical aspects of like how we've evolved in reasoning and how we've evolved in tool use, among other things.
但在那之前,我在想,或许从实际角度谈谈,我们今天所处的位置与以往有何不同、发生了哪些转变,会很有趣。
But before we do that, I was thinking it might be interesting to talk a little bit about from a practical perspective, how do you think where we are today is different and has shifted.
这确实非常有趣。
And, you know, it's super interesting.
我们现在讨论的是2026年2月,而今年已经出现了大量新动态、新模型,比如Opus 4.6、OpenAI 5.3。
We're talking in kind of, you know, February and already this year in 2026, there's been a ton of, you know, new news, new models, Opus 4.6, OpenAI 5.3.
还有整个OpenClaw和Multbot。
You know, there's been a whole OpenClaw and Multbot.
谈谈今年到目前为止我们看到的变化,以及从实际角度如何看待大语言模型的发展,我觉得很有意思。
You know, talk a little bit about, you know, the the what we've seen already this year, but in the context of, like, where you see LLMs are from a practical perspective.
是的。
Yeah.
我会说,没错,这是个很好的观点。
I would say, yeah, that's a a good point.
我们现在才二月,这意味着中国新年甚至还没到,而我认为那时还会有一批新模型发布。
We are just in the February, and that means the Chinese New Year hasn't even, like, occurred where I think there will be also another batch of releases.
但我认为,在开源权重方面,现在越来越多的公司正在开发围绕大语言模型的工具,这些工具正变得越来越成熟,同时大语言模型本身也在进步。
But I think, like, on the open weight front, but I think that that is like a separate thing where you have now companies developing the tooling around LLMs that is becoming more and more mature, and then you have better LLMs yourselves.
我认为我几乎可以把这两者分开来看。
And I think I would also almost like separate those two.
所以我的假设是,如果你把最好的开源大语言模型放进比如Gemini或Claude的界面中,你几乎能得到相同水平的质量和性能——因为如今很多应用场景其实都围绕着围绕大语言模型的工具包装展开。
So my hypothesis is if you would take the best open weight LLM and put it into, let's say, a or Gemini or Claude interface, you would almost get the same type of quality performance and everything, like, where I think a lot of use cases evolve around the tool wrapper around the LLM nowadays.
这种想法是在去年年底左右被广泛推广的
That's this idea that was popularized kind
关于提示工程。
of towards the end of last year around harness engineering.
所以我认为这也是我们使用LLM方式的改变,以前它只是一个非常简单的聊天界面。
So I think that that is also something how we changed using LLM's because before it was just simply, yeah, like a very simple chat interface.
当时只是一个模型。
There was a model.
是的。
Yeah.
是的。
Yeah.
是的。
Yeah.
然后它变得越来越复杂了。
And then it became, you know, more sophisticated.
你可以上传文件和PDF。
You could upload files and PDFs.
就我个人的使用场景而言,我主要用LLM来做校对、检查内容这类事情,虽然听起来有点奇怪。
And so for my personal use case, I use LLMs mostly for like, actually, it sounds weird, but like proofreading, checking things, and these types of things.
所以在录制之前,我刚写完一章内容,想更新目录,于是就把PDF上传到ChatGPT界面,说:嘿,你能给我提取出标题吗?这样我就不用自己手动找了。
So just before recording here, I was finishing writing a chapter, and I wanted to update the table of contents, and then I just uploaded the PDF to the ChatGPT interface and say, hey, can you give me the headers so I don't have to pull that out myself?
然后你还可以再核对一下是否正确,但像这种小便利任务,比如让工作变得更简单,处理这些繁琐的事情。
And then you can just double check also that it is correct, but, like, little convenience tasks, like making work a bit simpler like these tedious things.
但正如你所说,后来还推出了新的OPOS模型,接着JGPT发布了Codex 5.3以及配套的Mac OS应用。
But then, like you said, there was also the new OPOS model, and then JGPT released Codex 5.3 and Mac OS app with that.
我认为这同样是模型能力的一次巨大飞跃。
And I think that is also like yet another leap in terms of what these models are capable of.
我的意思是,以前也有专门用于编程的LLM,使用LLM进行编程变得越来越流行,但它们一直在不断进步,变得越来越好。
I mean, before there were also coding LLMs and it became more popular to use LLMs for coding, but it's always, you know, more and more and getting better and better.
在此之前,我用的是Visual Studio Code,因为我一直用这个代码编辑器,可能已经有五到十年了。
And so before I used Visual Studio Code, I mean, because I am well, I just used Visual Studio Code, the code editor for, like, years, like, maybe five years now, ten years.
再之前,我用的是VIM和其他工具,但我对它的界面非常熟悉。
And before that, I was using VIM and other things, but I'm very familiar with the UI, basically.
所以我有我的Git代码树。
So I have my Git tree.
我知道终端在哪里,还有那些东西。
I I know where I have a terminal inside and and that stuff.
所以我其实很喜欢把LLM作为插件集成进去,有时候你可以直接说:‘好吧,我这里有个bug。’
And so I actually liked having the LLM as a plug in there where you sometimes can say, okay, can you, I have a bug.
你能帮我再检查一下吗?
Can you just double check?
这就像在你的工作流程中添加了另一层工具。
You know, it's just like another layer of, tools you add to your workflow.
所以LLM不需要总是摆在最前面。
So the LLM doesn't have to be front and center.
它也可以是一个小小的助手,你知道的,我仍然自己调试代码,但经常让LLM帮忙检查一下会又快又方便。
It can be also this little helper, you know, like, before you I mean, I still debug things myself, but often it's actually quite nice and fast to ask the LLM to double check things.
我喜欢的是,它就像一双额外的眼睛,但又不会完全接管并替你做所有事情。
And, what I like about it is when it's like a second pair of eyes, but it's also like, it's not completely taking over and doing everything.
它确实在某种意义上让我的工作变得更好。
It's but it's making your work better in a sense.
比如,你可以增加额外的检查,并且可以问:嘿,你能建议一些改进方法,让我的代码更高效吗?
Like, you have, additional checks and you can ask, hey, can you suggest improvements to make my code, let's say, more performant?
但毕竟,作为人,你仍然需要提出正确的问题,也需要亲自运行实验,来验证这些改动是否真的让代码更快了。
Still, I mean, you as the person, you still have to kind of ask the right questions, and you still have to run actually the experiments to see whether it actually makes the code faster.
所以这并不意味着LLM会替你做所有事,但它能提出有用的建议。
So it's it doesn't mean, like, the LLM does everything for you, but it suggests useful things.
我知道很多人也用它来写代码,这种方式也很有效。
I know a lot of people also use it for coding things, so that also works.
有了新的Codex后端,还有Codex应用,现在的情况和一两年前不同了——那时候人们会把代码文件上传到ChatGPT、Gemini或Claude,然后获得一些反馈,再手动整合进去。
With the new, let's say, the Codex back end, but also the Codex app, what's new is, I mean, a year or two ago, people were uploading code files to ChatGPT or, Gemini or Claude, and then getting some feedback, and then you had to manually incorporate that.
而现在,它变得更加一体化了,比如
And now it's more in line, like
好的。
you Okay.
大家已经有一段时间没这么做了。
It's been a while since folks have been doing that.
是的
Yeah.
对
Right.
所以现在,你可以直接查看文件差异,这更加原生了。
So that is, that is, now more native where you can see the file diff.
你不需要离开你的编码环境,而且现在,当你在本地运行这些工具时,还可以让它们访问你整个文件夹,比如整个 Git 项目。
You don't have to leave your coding environment, but then on top of that, also now when you run these tools locally, you can give it access to your, to your whole folder, let's say your whole git folder.
这样它就能看到所有文件的上下文。
And then it can see the context of all the files.
你不需要手动上传任何内容。
You don't have to manually upload anything.
此外,如今它还能自己使用工具。
And then on top of that, so you it can also nowadays use tools itself.
所以你可以授权 LLM 运行某些命令,比如自行运行单元测试之类的操作。
So it you can give permission for to the LLM to run certain commands, to run a unit test by itself and these types of things.
这些加在一起,我不会说有什么单一的东西是颠覆性的或改变游戏规则的,但所有这些小改进累积起来,让大语言模型变得更强大,因为它正变得越来越复杂。
And that, together, I wouldn't say there is a single thing that is like groundbreaking or like a game changer, but all these little things, they add up to make the LLM more capable because it's more and more getting more and more sophisticated.
我认为这就是我们最近几个月所看到的,也许过去几个季度、几个月里,人们自己开发了这些能力,只是让模型变得更好。
And I think that's what we have been seeing in recent months and maybe, yeah, the last few quarters, months where people like they develop these these types of capabilities themselves just making the model better.
所以,通过改进界面,我们就能从大语言模型中获得很多性能提升,基本上就是这样。
So there's a lot of performance we can get from the LLM by, yeah, making the the the interface better, basically.
你有没有发现,这些新模型——你刚才说没有突破性的变化,但你有没有在这些模型中发现一些新的能力让你感到惊讶?还是说这些变化对你来说都只是渐进式的?
Have you found that either these new models, you know, you, you just said that, you know, there's no break, you know, breakthrough changes there, but do you find, did you find yourself surprised with, you know, some new capability in either of these two models or is it, you know, very much incremental to what you're already doing?
对我来说,这更多是渐进式的。
For me, it's personally more incremental.
更多的是一种便利性。
It's just more like the convenience.
它们正变得越来越稳健和更好,但我不会说有什么让我觉得特别惊艳的地方。
They're just getting more robust and and better where I wouldn't say there is anything where it's like there's no wow effect to me.
比如,以前的模型根本做不到 x y z。
Like, where it's like, oh, my previous model was not able to do x y z.
只是好了一点点。
It's just a bit better.
你知道,它变得越来越稳健,越来越好。
You know, it's it's getting more robust and more better.
而且我也对结果更有信心了。
And then I also develop a bit more trust in the results.
我觉得这更像是一种渐进式的改进。
It's more like a gradual improvement, I think.
但有一点仍然是,我们在不同的推理努力程度之间仍然有区别,这就像一个滑块,控制LLM花多少时间来为你生成结果。
The one thing is still we still have the distinction also between the different, reasoning efforts in terms of, it's like a slider in terms of how, like, how much time the LLM should spend on getting you the results.
而且有不同的设置,从低推理努力或无推理努力到高推理努力。
And there are different, you know, settings from low or no reasoning effort to high reasoning effort.
这会影响LLM生成结果所需的时间。
And that changes the time it takes for the LLM to generate results.
我记得半年前或一年前,如果你想获得好的结果,几乎总是得使用最高设置或高推理模式,那可要花很长时间。
And I remember, like half a year ago or a year ago, if you wanted to have good results, you almost always had to use the highest settings or the high reasoning modes, which took forever.
现在,即使是较低的模式,感觉也相当不错。
Nowadays, even the lower modes, feel like are pretty good.
对于大多数任务来说,使用中等或高推理强度就足够了,无需使用超高强度。
Like where for most tasks, it's sufficient to use these, medium or high reasoning efforts instead of the extra high ones.
这样你就能更快地得到结果。
And then you get results faster.
我认为这也是这些模型生活质量的提升:以前你可能只是偶尔运行它们,因为你不想等五分钟,但现在它们已经成为你工作流程中更常规的一部分。
And I think that's, that's also like a quality of life improvement for these models where, before you ran them maybe occasionally because you don't want to wait five minutes, but now it becomes more routine that they are part of your, workflow basically.
是的。
Yeah.
对。
Yeah.
好。
Yeah.
我想进一步补充一点,现在大语言模型已经非常擅长自行判断,为我们的查询提供良好答案需要多少努力。
I would expand on that and say that the LLM's have gotten really good at themselves, knowing how much effort is required to provide us a good answer to a query.
所以我发现自己绝大多数时候只是在像ChatGPT这样的平台上输入我的提示,而不指定模型或思考级别,让它自己决定。
And so I find myself, you know, in the vast majority of times, just, you know, typing my prompt into, you know, chat TPT, for example, and not specifying a model or a level of thinking and letting it figure it out.
如果我需要更多,我会告诉它我想要更多。
And if I want more, I'll tell her I want more.
但它在判断何时给我快速回答、何时使用搜索工具、何时进行更多思考等方面,做得相当不错。
But it does a fairly good job of determining when to just give me a quick answer, when to use a search tool, when to, you know, you know, do more thinking, that kind of thing.
我同意。
I agree.
我在ChatGPT上的设置是自动模式,它会自行决定使用更多还是更少的思考精力。
I have my setting on JTPTT and auto, the auto mode where it automatically or by itself decides whether it should use more or less thinking effort.
一样的情况。
The same thing.
我唯一仍然使用专业模式的场景是,当我回到之前提到的章节时——比如我写了一个40页的PDF,我会上传它并说:‘嘿,你能检查一下有没有不一致的地方、编号错误之类的吗?’
The only, context where I still use the pro mode is when, coming back if I, to the chapter, I mentioned when I have a chapter written, like a 40 page PDF, I would upload it there and say, Hey, can you check for any inconsistencies, incorrect numbering and, and all that type of stuff.
然后我会把它设为专业模式,也就是那个需要二十分钟的模式。
And then I set it to the pro mode, like the one that takes twenty minutes.
我会去吃午饭或晚饭,回来后再查看结果,这种做法我很少做。
I go have lunch or have dinner, come back and look at the results where there it's like, it's like a rare thing that I do that.
我的意思是,每个月我才会完成一章或者写一些重要的东西,那时我希望获得最高质量的检查。
I mean, once a month I finish like a chapter or something and or like write something important where I want, like, the maximum, let's say, quality check on that.
但正如你所说,对于大多数任务,使用轻量级的处理就足够了。
But like you said, for most tasks, it's sufficient to use the the light effort.
是的。
Yeah.
或者用自动模式,它自己决定如何处理。
Or the automatic one where it decides by itself, essentially.
对。
Yeah.
没错。
Right.
没错。
Right.
我提到了Motebot以及该工具的发布。
And I mentioned Motebot and the release of that tool.
你花了很多时间深入研究过它吗?
Have you spent much time digging into that?
嗯,Motebot,我现在认为它叫OpenClaw了。
Well, yeah, Motebot, I think now now called OpenClaw.
OpenClaw。
OpenClaw.
是的。
Yeah.
它变化挺大的。
It changed bit quite a bit.
这很有趣。
It's it's interesting.
它就像是一个本地代理,现在人们可以在自己的电脑上运行它,我觉得有趣的是,它能让人对这些事情产生兴趣。
It's like this local agent that people can now run on their own computers where I I think that what I find interesting about it is it it gets people excited about things.
这几乎就像当年DeepMind推出AlphaGo的时候,那个下围棋的模型,就像一种棋盘游戏,当时确实很轰动。
It's almost like back then when DeepMind had AlphaGo, like the Go playing, like the it's like a board game, like the Go playing model where yeah.
它变得非常令人兴奋,因为即使在总体人群中,或者至少在我的圈子里,以前下围棋的人并不多,但AlphaGo与世界冠军对弈时,却让我的家人和所有人都对这种进展感到兴奋。
It it got really exciting because not people people I mean, not many people, let's say, in the grand scheme of things that or at least in my my circles who played Go before, but it got people like my family and everyone really excited to see this type of progress when it was playing against the world champion.
我认为Multipot的情况也类似,它让人们有兴趣去了解和体验这些技术。
I think with Multipot, it's kinda like similar where it gets people interested in checking these things out and excited.
我认为它还有很多真正实用的场景,比如你可以用它来安排日程和管理邮件。
I think there's also a lot of genuine use cases around it where you can run it, I mean, to organize your calendar and emails.
就我个人而言,我还没有做过这些。
For me personally, that's something I have not done.
也许我有点信任问题,说实话,我个人觉得,是的。
Maybe I have a little bit of a trust issue where I'm like I mean, personally, it's like, yeah.
我不确定自己是否足够信任它,去处理我的财务或日程安排。
Well, I don't know if I trusted enough to do my finances or my calendar.
我仍然对采用这样的技术有点犹豫,但我认为这很好地展示了,对于那些并非从事LLM开发的人来说,这些大模型到底能做什么,以及它们的意义何在。
A bit I'm a bit hesitant still to adopt something like that, but I think it's a cool demonstration of what like to kind of like show someone who is, let's say, also not developing LLMs, what these LLMs can do and what's the purpose of those is also in a sense.
我觉得这其实挺酷的。
Think I think that's actually quite cool.
是的。
Yeah.
还有其他那些主要基于大语言模型的工具或服务吗?你是否主要依赖这些模型本身,或者像开发环境这样的东西?
Any other, you know, tools or services that are largely kind of wrappers around LLM's that you have come to depend on, or do you find yourself mostly turning to the models themselves or, you know, like the dev environments?
是的,对我而言,我的工作流程中大多数情况还是手动操作,还没有任何高度自动化的东西,比如需要以增量方式或代理式环境运行的任务。
Yeah, it's mostly still, for me, for my workflows, I don't have anything like super automated where I'm like, you know, I need to run something incrementally or in an agentic type of setting.
不过,我最近做了很多自己开发生产力应用的事情。
What what I've been doing a lot though is developing my own apps, like productivity apps.
我想想,以前我刚入门时是个程序员,用bash、终端和Python这些工具,自己写各种脚本来自动化任务。
I think back in the day, I grew up as like a coder, like using bash, the terminal and Python and that stuff, where I was writing myself, like, for myself, like, scripts for all kinds of things to automate things.
现在有了大语言模型,我稍微改变了方向,开始开发原生的macOS应用。
And now with LLMs, I kind of changed that a bit towards developing a native Mac OS apps.
我一直想学Swift编程。
Like, I I always wanted to learn Swift coding coding in Swift.
我从来没时间,因为确实如此。
I've never had the time because yeah.
我的意思是,对我而言,还有太多更重要的事情要做,所以这成了一个机会,让我想:嘿。
I mean, there are so many other more important things to do for me where, like, that was an opportunity to say, hey.
我想要这个功能,但选择把它做成一个原生的 macOS 应用,因为这样更方便。
I want this, but I have as a script as a native Mac OS app because it's just more convenient.
比如就在前几天,我妻子也做了一个播客。
For example, just the other day, my wife also has a podcast.
这是一个读书俱乐部类的播客,我帮她处理各种事务,比如上传内容、编辑,以及整个工作流程,因为她不太懂技术。
It's like a book book club podcast, and I help her with the episodes basically, like uploading everything and and and editing and, like, just the workflow in in general because she's not like a tech person.
然后我之前写了一个脚本,用来给播客添加章节标记。
And then I had, like, a script to add these chapter marks to the podcast.
就在前几天,我做了一个原生的 macOS 应用,你只需输入时间戳,点击一个按钮,它就会自动给音频文件添加章节标记。
And now I made a, just the other day, a native Mac OS app where you can just add the timestamps and click a button and it adds the the chapter marks to the audio file.
像这样的小工具。
Like simple things like that.
然后我可以把它分享给她,她现在就能用了。
And then I can share it with her and she can use it now.
这些就是日常生活中那些小小的便利性提升,你不再需要手动去做事情,而是可以自动化完成。
And it's just like these little quality of life things in your everyday life where you, instead of just doing things manually, you can just automate them now.
我的意思是,严格来说,我并不是在运行大语言模型,而是利用大语言模型来开发某种具有确定性行为的工具。
I mean, necessarily, I mean, is not running the LLM, but it's, using the LLM to develop something that behaves deterministically in a sense.
所以我更像那种喜欢做这类事情的人。
So I I'm I'm more like of that, more like a person like that who does that.
比如,当我浏览社交媒体动态时,作为一名研究者,我主要关注论文,因此经常需要收藏大量链接,比如 arXiv 的 PDF 或摘要。
For example, I also have when I read social media feeds, I mostly as a researcher interested in papers, so I often end up bookmarking a lot of archive links, links to p d archive PDFs or the abstracts.
然后我有一个 Markdown 表格,里面存了很多这样的链接。
And then I have my markdown sheet where I have a lot of these links.
现在我为自己开发了一个原生 macOS 应用,只需输入这些链接,它就能自动提取标题、日期、作者姓名,并以整洁的格式呈现,让我的生活更轻松,不用一个个点开链接。
And now I wrote myself a native macOS app where I just put in these links, and it pulls out the title, the date, the author names, and the links in in, like, a nice format and just making my my my life easier so I don't have to click on them individually.
我得到一个清晰的列表,能看到标题,还有是的。
I get a nice list and see the titles and yeah.
我觉得像这样的小工具,语言模型非常适合用来开发那些我本来没时间去做的工具。
And I think little things like that, I feel like LMs are super cool for, like, to to develop these tools that I would not have time to develop otherwise, basically.
是的。
Yeah.
这和我的经历非常相似。
That parallels my experience quite a bit.
我认为在过去一两年里,我从大语言模型中获得的最大好处之一,就是编写一些定制的工作流工具。
I think some of the, the most benefit I've gotten out of, LLMs in the past year or so has been writing kind of custom workflow tools.
主要是围绕播客,比如我们与赞助商合作时,经常需要提取这些数据分析报告,这既重复又耗时。
So, primarily around the podcast, like one of the things that we would do when we work with sponsors is like pull these analytics reports and, you know, it was repetitive and, time consuming.
于是我开发了一个基于网页的工具,它可以调用我们获取数据分析的API,提取有关各集节目的信息,你可以选择某一集。
And so I created a web based tool that will hit the API where we get the analytics and pull information about episodes and you can choose an episode.
然后我们会把大量数据导入pandas进行分析,并生成一个电子表格,比如Google文档。
And then we've got, you know, pull a bunch of data into pandas and like do some analysis and then generate a, a spreadsheet, like a Google doc.
这个应用本身并没有使用大语言模型,但它是借助大语言模型开发出来的。
And like, it's not, it doesn't, that app isn't using an LLM, but an LLM was used to create it.
这只是一个例子,大概还有五六种类似的、对我们的工作流程有重大影响的工具。
And that's one example of probably like half a dozen, you know, fairly significant tools that, you know, have a big impact in our workflow.
是的。
Yeah.
你刚才又提到了一个很好的观点,那就是在这些情况下,大语言模型并没有直接执行常规的工作任务。
And that's a good point also, like, that you said again, also that in these cases, the LLM is not doing the, let's say, regular work, the task task.
它更多是用于开发执行任务的工具。
It's more developing the tool to do the task.
在我看来,这也是一个重要的观点:大语言模型非常有用且能力强大,但有些任务如果用大语言模型来处理,几乎是浪费资源。
And I think that's also an important point in my opinion that, well, the LLM is very useful and very capable, but there are tasks where it it's almost like wasteful to use an LLM for that.
就像那句老话:如果你手里只有一把锤子,那么所有东西看起来都像钉子。我认为,对于确定性的任务,仍然应该开发确定性的工具。
Like, it's like it's like all if all you have is a hammer, everything becomes like a nail type of situation where I do think if you have like a deterministic task, it still makes sense to develop a deterministic tool.
你当然可以用大语言模型来解决,但甚至去问它‘一加一等于多少’这样的问题,都几乎是浪费。
You can use an LLM for that, but it is almost like wasteful to also even ask an LLM what is one plus one or something like that.
你直接用计算器就行了。
You can use a calculator.
所以,我认为仍然很重要的是,要认清这个问题的本质,以及解决这个问题最合适的工具是什么。
So it's like, I think it's still important kind of to recognize what is, what is the nature of this problem and what is the best tool for that problem basically.
是的。
Yeah.
我也做过一些工具,会使用LLM,就像一个分类器一样,属于非常简单的应用场景。
I've also done some tools where, I'll use LLMs and like, you know, almost like a classifier, like a very simple use case.
我有一个工具,就是根据嘉宾的名字,
I have one where, you know, it's like there's the name of the guest.
然后从Google Docs API中提取一系列最近的文件夹,找出与这位嘉宾的项目相对应的文件夹。
So, you know, your name, you know, and then I pull a bunch of, recent directories from the Google Docs API and say, find the the directory that corresponds to the project for this particular, you know, guest.
但正则表达式或文本模式匹配并不总是有效,因为这些名称有时会有所不同。
And it's like, you know, a regex or like a text pattern match doesn't always work because they're kind of they can be different sometimes.
但LLM可以轻松完成这项任务,而且重复性很高,错误率很低。
But an LLM, like, you know, can do it pretty easily and, you know, with a very high level of repeatability and and low error rate.
当需要类似人类的、或非结构化的方法时,LLM就非常适用了。
Well, it's like where you need almost like a human or like some less structured approach, then LLMs are great for that.
我大学时也做过一个类似的项目。
I had actually a similar project as a college student.
我当时做了一个体育预测的副项目,纯粹出于乐趣,比如每日 Fantasy 体育,预测英超联赛周末哪位球员会进球之类的。
I was doing a sports prediction as a side project for fun, just for fun, like daily fantasy sports, like predicting outcomes of, like, how who which player scores a goal in in the Premier League Soccer on the weekend, like daily fantasy sports.
为此,我还开发了一个非常复杂的东西,从不同网站抓取球员信息,分析谁受伤了、谁状态好之类的因素。
And for that, I was also developing this very sophisticated thing, which was pulling information about the players from different websites and looking at how who's injured, who's in, I don't know, good form, and these types of things.
后来,我为了好玩又重新拾起了这个项目,这次用的是 LLM。
And for that and now I kind of revived that project just for fun using an LLM.
这跟你提到的姓名问题有点类似,因为有些球员的名字拼写略有不同。
It's kinda like the same problem you mentioned with the names because some players have like, the spelling of the name is slightly different.
某些字母上有重音符号,有些数据库里球员有中间名,有些则没有,要把这些数据对齐非常困难。
There are these accents over certain letters, and some people have sometimes middle names, sometimes not in certain databases, then just getting them lined up in the databases.
用正则表达式或确定性方法来处理真的很难。
It's really hard, with regex or just deterministic, things.
所以这确实是 LLM 的一个绝佳应用场景——利用非结构化、近乎模糊的数据,进行依赖上下文的解析。
So that is actually a great use case for an LLM to kind of use unstructured, almost like, yeah, vague data, set parsing things that depend also a bit on the context basically.
是的。
Yeah.
对。
Yeah.
所以,让我们回到从实际角度来看我们目前使用大语言模型的状况,我认为萨姆和塞布都用了零,而且非常……从这次讨论中,我总结出两点主要经验。
So maybe kind of pulling back into like where we are with, you know, LLMs from a practical perspective, I think both Sam and Seb are using none and very kind of well, I think two main things came out of this.
第一,如果你是开发思维,我本来想补充一下,但我觉得通过像Vibe Coding这样的方式,即使技术没那么强的人或非技术人员,也能通过创建自定义工具来自动化他们工作流程中的特定部分,从而获得巨大价值。
One, you know, if you, I was gonna actually caveat this by saying if you are are a development mindset, but, you know, I think we've seen with like vibe coding that even less technical people or non technical people can, you know, get a lot of value by creating custom tools to automate, you know, specific parts of their workflow.
所以,这在过去一年左右对我们两人来说都是非常有影响力的一件事。
So that's, you know, a huge thing that I think has been very impactful for both of us over the past, year or so.
此外,就是利用模型本身的进步,对我而言,我其实很难用明确的规则来描述,但当我面对某个具体问题时,我心中会有一个大致的判断模型。
And, otherwise, you know, just kind of, you know, taking advantage of the the improvements in models by and for me, it's just like, I can't really articulate like a rule set, but like, I, you know, for if I'm confronted with a particular thing, you know, I have kind of a soft mental model for you.
我会先试试ChatGPT,或者这个情况我可能会先用Claude。
I think I'll start with chat GPT here, or I'll start with Claude for this or that.
你知道的?
You know?
所以我认为关键在于,我们俩都没有经常使用OpenClaw或者任何其他特别花哨的LLM包装代理工具。
So I think the the the takeaway is that neither of us are, you know, using, OpenClaw or, you know, any, you know, particularly, you know, slick LLM wrapper agentic tools with any regularity.
不过,对我而言,可能的例外是像Circle Back或Granola这样的工具,用于做会议总结。
You know, maybe the the caveat for me would be something like a circle back or granola to do like meeting summaries.
但除此之外,主要还是像你描述的那样,通过原生聊天界面和面向开发的使用场景来使用。
But beyond that, it's mostly, you know, like you described kind of, you know, use cases through the the native chat interfaces and the development oriented use cases.
我还可以补充一点。
Could add maybe also one more thing.
你提到过,这其实更像是一个滑块。
You mentioned so it's also most mostly a slider.
你可以完全不使用LLM。
Like, you can use LLMs not at all.
你仍然可以手动完成所有事情。
You can do still everything manually.
你也可以只使用LLM。
Then you can only use LLMs.
我知道有些人甚至只靠LLM写代码来创业,他们称之为‘氛围编程’,但我觉得‘氛围编程’这个词现在都配不上这种做法了,反正就是不靠谱。
Like, I know some people who develop, let's say, even a company just based on LLM code call it vibe coding, but, like, I I think vibe coding doesn't even do it justice anymore, but, like, don't no.
完全不手动写代码了,全靠LLM来构建网站、产品和所有东西。
Not doing any manual coding anymore, just using LLMs, like having the LLMs build the website, the product, and everything.
所以这两种极端,我觉得我们更处在中间位置,我们会采用LLM,但不会彻底完全依赖LLM。
So, like, these two extremes, and I think we are more, like, in the middle where we kind of adopt LLMs, but we are not, like, let's say, going full LLM.
而且对我来说,我也觉得现在正在学编程的人,学编程到底还有没有意义?
And I think for me also, I think there is still like I mean, I would say also for people who are nowadays learning how to program and, like, as a like, is it worthwhile?
我觉得即使有LLM能代劳,学习数学和编程仍然很有价值,因为它依然能让你的生活更高效,也能让你更好地使用这些LLM——比如,我之前用LLM给我的网站加暗黑模式。
And I think it is actually still worthwhile to learn math and coding even though there are LLMs that can do that because it makes makes your life also still more efficient and it makes you better at using these LLMs because, like an example, I had also I was using an LLM for my website to add a dark mode.
那一直是我想要做的事。
So I that's something I always wanted to do.
我十二年前自己写的这个网站,但那时候我对HTML、CSS和JavaScript的掌握远比现在强。
I wrote the website myself, like, twelve years ago, but then, well, I knew, HTML and CSS and JavaScript much better back then than I do now.
我一直拖延着没加暗黑模式按钮,因为我知道,如果自己动手,可能得花上一个月才能做好,类似这样。
And I always procrastinated on adding a dark mode button because it's I knew it it would take me, like, a month maybe to do it well or something like that.
这并不是我的主要工作。
And it's not my main, let's say, job.
所以我想,好吧。
So I was like, okay.
我没法花那么多时间在上面。
I well, I can't spend that much time on it.
但后来我想,嘿。
But then I was like, hey.
让我试试用大语言模型来做这件事。
Let me try using an LLM for that.
它在添加功能方面做得很好,但并不完美。
And it did a really good job adding it, but it was not perfect.
所以按钮位置不对,其他地方也有问题。
So the button was misaligned and everything.
然后我又想,嘿。
And then I was like, hey.
再往上移一点。
Make it a bit higher.
再往下移一点。
Make it a bit lower.
编辑一下,把它往左移动,好了。
Edit move it to the left and okay.
这实际上我们的做法非常低效。
This is actually our thought very inefficient.
在这种情况下,我为什么不直接去HTML或CSS文件里调整设置呢?
Why don't I just go into the HTML or CSS file in that case and adjust the settings there?
因为我还懂一点CSS文件,所以我自己进行这些调整比让LLM完全代劳、只是不断指令它‘往那边移’、‘往这边移’更有效率。
And because I still knew a bit about CSS files, I was more effective to make these adjustments myself instead of having the LLM do everything and just brute force telling the LLM, oh, move it that way, move it this way.
我可以自己直接修改,然后刷新页面查看效果。
And I could just just change them on myself and refresh the page and see.
我认为从这个角度来说,理解这些东西的工作原理确实有道理,因为有些情况下,自己动手仍然比让LLM重新生成一切更高效。
And I think in that sense, it does make sense though, to have like an understanding of how these things work, because then there are cases where it is just more efficient to do things yourself, still then prompt the LLM to, you know, redo everything.
所以我认为,我想说的是,存在一个中间地带,那就是学习事物的工作原理仍然有其价值。
And, so I think like, I think the, the, what I wanted to say is that there's a, there's a middle ground basically where I do think there's still value in learning how things work.
我想知道你的经验是什么。
I wonder what your experience is.
我经常在社交媒体上看到这些新模型发布时的讨论。
I'll often see around these new model releases on social media.
我一次就搞定这个。
Oh, I one shot at this.
我一次就搞定那个。
I one shot at that.
我想回忆一下上一次我有这种经历的情况。
Like, I'm trying to remember the last one, that that, I had this experience.
然后我会再去尝试一次搞定同样的事情。
And then I'll go and try and one shot the same thing.
但我得到的结果非常糟糕,完全不像社交媒体上所报道的那样。
And the results that I get are horrible, like nothing like what is reported in social media.
你知道吗,嘿,是
And, you know, like, Hey, is it
我吗?
me?
还是
Or
只是人们为了吸引关注而夸大这些成功案例,其实根本不是那样,或者根本是假的?
is it just people like reporting these successes for engagement and they're not really there or they're fake?
你感觉如何?
Like, what's your sense for?
你有没有遇到过类似的情况?
Do you experience similar things?
是的。
Yeah.
我觉得是的。
I would say so.
我觉得,我提到过我用的原生 Mac 应用,比如我有个 Mac 应用,只是把一个 PDF 放进去,它就会导出 PNG、WebP 和 PDF 版本,而且要指定特定分辨率,即使当时用了 Codex 5.2,也试了好几次才让所有按钮正常工作。
I think that I mean, I mentioned my native Mac apps, even like something I've had Mac app where I just put in a PDF and it exports the PNG, a WebP, and PDF versions in a certain resolution, and it took multiple tries even with the there was back then Codex 5.2 to get really everything, all the buttons working correctly.
就像你说的,根本不是一次就能成功的。
Like you said, it was not one shot at all.
为了让它正常运行,我反复迭代了很多次,即使是这种简单的事情也是如此。
It was multiple iterations to get it to work and even something simple like that.
然后我有时会想,是不是我的指令有问题,或者我表达得不够清楚?
And then I sometimes wonder, are my instructions maybe bad, or maybe I wasn't clear?
也许你得明确说:请彻底测试所有功能,确保每项功能都能正常运行,诸如此类的话。
Maybe you have to say, please test everything thoroughly and make sure everything works and blah blah blah.
也许你得对这一点说得特别明确才行。
Maybe you have to be super explicit about that.
但我们并没有那么明确,因为我们默认它会确保所有功能都正常工作。
And we are not that explicit because we kind of assume it would make sure everything works.
或者,也许我们看到的这些情况只是运气好,你知道的,有时候在某些事情上,它就是恰好能很好地运行。
Or maybe these cases we see are just lucky, you know, like sometimes on certain things, it just happens to work very well.
所以我不确定,但我同意你的看法,当有人向你展示‘我一次就成功了’的时候,事情并不像表面看起来那样。
So I don't know for sure, but I agree with you that it's not all what it seems when someone shows you, oh, I one shot at this.
我不认为这反映了当今事物的实际运作方式。
I don't think that's reflective of how things work today.
是的。
Yeah.
让我们换个话题,谈谈你预计在未来一年中,大语言模型会在哪些关键领域继续取得创新。
So let's switch gears a little bit and talk through some of the key areas that you expect to see continued innovation around with LLM's in the upcoming year.
然后,对于每一个领域,我们会深入探讨一下最近的发展历程,以及你预期未来会如何演变。
And then, you know, for each of them, we'll dig in and talk a little bit about the the recent history and where you expect to see things going.
今年的主要趋势是什么?
What are the the big themes for the year?
嗯。
Mhmm.
我认为仍然是推理能力。
I would say it's still gonna be the reasoning.
我们可以更详细地探讨一下,因为这是一个非常宽泛的话题。
We can maybe go into more detail there because it's a very broad topic.
所以,比如在后训练阶段更深入地推进推理能力。
So, like, push like, pushing more on the reasoning front to the post training.
第二个我认为是推理扩展,也就是更复杂的技术,这些技术部分与训练相关,但主要涉及训练后如何使用大语言模型。
The second one I would say is also inference scaling, like more sophisticated techniques that, they are partly related to training, but mostly how to use the LLM after training.
然后我也认为我们会看到更多这种代理式应用,因为目前大语言模型主要聚焦于逐轮交互,而人们或公司会更加专注于这种循环——将大语言模型作为循环运行,比如多智能体协作并优化这一过程。
And then I also think we will see more of this, yeah, this agentic type of use, like how, because right now, mostly LLMs are focused on like a turn by turn and how to, like people will double or companies will double down on this loop, basically running LLM as a loop, like multi bot and optimizing for that.
我认为这三方面将是企业主要关注的重点。
And I think these three things will be mainly the biggest, I guess, focus areas for companies.
是的。
Yep.
那我们深入探讨一下推理能力,为2026年的发展方向设定背景。
So let's dig into reasoning, to set the stage for where you think we'll be heading in 2026.
你觉得2025年在推理方面有哪些重大进展?
What do you think were the big advancements in 2025 around reasoning?
所以,
So,
是的,最大的突破是OpenAI的那个成果,它让所有人都兴奋起来。
yeah, the biggest advancement was as an I mean, first OpenAI one, which got everyone excited about it.
然后OpenAI的那个方法同时运用了推理扩展,而且虽然没人能确定,因为没有论文公布,但很可能也用了训练技术;接着是R1和DeepCigar,他们公布了他们的预训练——抱歉,是推理流程。
And then OpenAI one was using both inference scaling and I mean, no one knows for sure because there's no paper, but likely also training techniques, but then R1 DeepCigar one, they published their, pre trained, sorry, the reasoning pipeline.
我认为这确实掀起了一股浪潮,许多其他公司也纷纷加大了这方面的投入,但从整体来看,这仍然非常新。
And I think that that was like, really something that took off where a lot of other companies also doubled down on that, but it's still very new in the grand scheme of things.
它才刚出现一年左右,而我最近正在写一篇关于推理的章节。
It's just like a year old and there have been I was recently working on a chapter on reasoning.
算法上出现了太多改进。
There were so many improvements to the algorithm.
我的意思是,就在前几天,我还整理了一份包含15种不同优化和改进的清单,从基础的改变序列级锁机制到令牌级调整,还有NVIDIA的GDPO。
I mean, I just the other day compiled a list of 15 different tweaks and improvements from basic things changing sequence level lock props to token level, but then also GDPO by NVIDIA.
那里取得了大量进展。
Lots of progress there.
我认为我们会看到更多这样的进展。
I think we will see more of that.
但首先,一个原因是预训练阶段,我们已经看到它仍然有效,而且我认为它仍然是整个训练流程中最大的部分,因为所需数据量巨大且成本极高。
But first also one reason is that with pre training, have seen basically that, I mean, it still works and I think it's still the biggest part of the pre training of the whole training pipeline because it's just so much data and very expensive.
但研究团队的研究与开发重点,如今我认为更集中在后训练阶段,即如何从这一阶段获取更多性能。
But, the R and D, like the research and development of the focus of the research team, I think it's more focused nowadays on the post training, like getting more performance out of that.
因为这是一种更新的范式,而且仍然有很多低垂的果实可以摘取。
Cause it's more like the newer paradigm and there are still low hanging fruits to be picked.
而在预训练阶段,它已经相当成熟了,是的,你仍然需要大量数据。
Where in pre training, it's already pretty sophisticated where, yeah, you you still need a lot of data.
你仍然需要大量算力,但相比后训练阶段,你很难通过改变算法来显著提升性能。
You still need a lot of compute, but there is no, let's say, like there's nothing you can really do much there compared to post training in terms of changing up the algorithms to get more performance.
当然,你仍然可以这样做,如果使用更多数据、优化数据组合、采用多token预测这类方法,你依然能获得更好的结果。
Of course you can still do that and you will still get better results if you use more data, optimize the data mix, maybe a multi token prediction and these types of things.
但目前最有趣的发展主要发生在后训练和推理领域,因此我认为我们会在这些方面看到更多进展。
But most of the interesting things are happening now on the post training front and the reasoning realm, So I think we will see more there.
在
On the
推理方面,去年我听到一个经常被提及的话题,那就是可验证的奖励机制。
reasoning front, the one topic that I heard about, I heard come up quite a bit, last year is the idea of verifiable rewards.
我认为这推动了我们在编码模型方面取得的许多进展,或至少对此贡献良多。
And I think that led to a lot of the advancements or contributed to a lot of the advancements that we saw in terms of coding models.
你能谈谈这种范式,以及过去一年里我们在这一领域看到的主要里程碑吗?
Can you talk about that as a paradigm and some of the big milestones that we've seen there over the past year?
是的。
Yeah.
谢谢你的问题。
Thank you for the question.
这确实是一个非常重要且关键的点。
It's actually it's a really, really important point.
换句话说,推理训练主要基于可验证的奖励机制,也就是说,这些任务的答案是可以被验证的。
Like so the reasoning training is essentially mainly based on verifiable rewards, which means there are tasks where you can verify the answer.
例如,在DeepSeek R1中,可验证的奖励来自编程和数学。
So for example, in DeepSeek r one, the verifiable rewards were coding and math.
以数学为例,你要求模型以LaTeX的框格式输出最终答案。
So with math, for example, you ask the model to output the final answer in a boxed format in the latex.
这是一种LaTeX命令,比如boxed。
It's like a latex command, like boxed.
然后你可以使用确定性或正则表达式代码来提取答案。
And then you can have like a deterministic or like a regex or deterministic code to extract the answer.
接着你可以使用WolframAlpha或SymPy等工具,将答案与参考答案进行符号比较。
And then you can use something like WolframAlpha or SymPy to compare this answer symbolically to a reference answer.
例如,如果2/3与2/3匹配,或者4/6与2/3匹配,它们本质上是相同答案,但你可以通过符号验证来确认答案是否正确,并获得奖励信号。
Like if two over three matches two over three or four over six matches two over three, it's essentially the same answer, but you can symbolically double check the answer and get a reward signal whether it's correct or not.
这实际上非常棒,因为你可以近乎无限地评估大量答案;在此之前,虽然这仍然是一个重要点,但基于人类反馈的强化学习确实需要人工反馈。
And, this is actually great because you can kind of like infinitely, you can evaluate infinite numbers of answers because before with, I mean, it's still an important point, but reinforcement learning with human feedback, yeah, you need human feedback essentially.
你可以训练一个奖励模型来近似这种反馈,这是训练的一部分,为每个答案打分,但这种评分并不像可验证答案那样精确,比如在数学中,答案是绝对正确的。
You can train a reward model to approximate that, and it's part of the training where you get get a score for each answer, but it's not quite as, let's say, as a really correct answer, like where you can verify the answer, like there's an absolute it's math.
展开剩余字幕(还有 394 条)
它要么正确,要么不正确。
It's either correct or not.
如果你有类似这样的方法,能够低成本地确定性地验证答案,就可以让大语言模型生成无限多的答案。
And you can have if you have something like that where you can verify the answer deterministically cheaply, you can have the LLM generate infinite answers.
你可以说,好吧,为这个问题生成六万个答案,然后你可以在极短的时间内计算出所有答案的奖励值。
You can say, okay, generate 60,000 answers for this problem, and then you can calculate the reward on all of them in in a fraction like a very short time.
生成这些答案仍然昂贵,但你不会遇到模糊性,也不需要让人来评估这些答案。
It's still expensive to generate these answers, but you don't have, vagueness and you don't have to have, let's say human evaluating these answers.
所以我认为这有助于扩展这些方法。
And so I think that helps, scaling these things.
代码也是如此,在DeepSeek R1论文中,最初的方法是让代码确保能够成功编译,也就是说,如果代码能正确编译。
And the same with code where with code, in the DeepSeek R1 paper, the original approach was to, have the code and make sure that the code compiles basically, like, if it compiles correctly.
你也可以使用代码解释器来实现这一点。
And you can use also a code interpreter for that.
我认为这两种方法都很棒,但这只是开始。
I think, I mean, both are great, but I think this is just the beginning.
我的意思是,我们很可能会看到这种机制被扩展,不仅仅局限于正确性奖励。
I mean, we will probably see this being extended to to have more than just the correctness of reward.
我的意思是,现在已经出现了其他类型的奖励。
I mean, there are already other types of rewards that are being added.
例如,格式奖励,你希望模型使用某种格式,但并不是必须的。
For example, a formatting reward where you want the model to use, I mean, not every it's not required by you.
有些公司更倾向于在思考文本中保留思考过程。
Some companies prefer to have the the thinking in the think text.
所以它们会使用类似 <think> 的标记来开启,再用类似 </think> 的标记来关闭,就像 HTML 的标签一样。
So they have like a token think and then a closing token thing closed like an HTML, like the opening and closing tag.
这并不是必须的,但这样做的好处是可以提取出中间内容并加以利用,从而训练模型输出这种结构。
It's not required, but it can be helpful to have it because then you can parse out the intermediate stuff and do something with it where you can train the model to output this structure.
这种机制被称为格式奖励。
And that is like it's called a format reward.
因此,除了正确性奖励之外,你还可以添加多种其他类型的奖励。
So you can have multiple types of rewards added to this thing in in addition to the correctness reward.
我认为我们还可能会看到一些有趣的发展,人们会提出格式奖励或其他辅助奖励,帮助模型更好地学习。
And I think we will maybe also see interesting things there where people will come up with formatting rewards or like auxiliary rewards that help the overall model to, to learn.
还有一点是,他们在DeepSeek R1的论文中尝试过,不是只看最终得分或答案是否正确,而是评估推理过程或解释本身是否正确。
And one thing is also, they tried that in a deep seek r one paper to evaluate the, the answer explanation instead of just looking at the final score or the, if it's correct or not, making sure or like evaluating if the the reasoning, like the explanation is correct or not.
这是不是一种过程奖励?
Is that a process reward?
没错。
Exactly.
是的。
Yeah.
他们使用的是所谓的过程奖励模型,这是一种额外训练的模型,用于为解释打分。
They use or this is called process reward model that it's basically another model that you train to give a score for this explanation.
但我记得,自从DeepSeek R1的论文发布以来,已经过去一段时间了。
But I remember, I mean, it's been a while since DeepSeek r one came out in the paper.
他们有一个章节将这种方法列为失败的尝试或不成功的尝试。
They had a section that they listed that as a failed attempt or unsuccessful attempt.
所以他们尝试过,但认为这样会增加奖励欺骗的风险。
So they tried it, but they thought, okay, this increases the chance of reward hacking.
然后觉得这并不值得。
And then it was just not worth it.
成本更高了。
It's more expensive.
它导致了奖励欺骗,模型利用了这一点,因为确实如此。
It resulted in reward hacking, the model exploiting thing because it's yeah.
对模型来说,这样更容易作弊,误导那个评估模型的系统。
It's easier that way for the model to kind of cheat to kind of mislead the model that, evaluates the model basically.
所以要实现这一点仍然很棘手,但我觉得最近几个月也有一些更有趣的成功案例,比如DeepSeek Math 3.2。
And, so it is still tricky to do that, but I think also in the recent months, there were some more interesting success stories like deep sig math version 3.2.
他们使用了类似的方法,用评分标准来评估整个答案,并让另一个模型来做这件事。
They use something like that where they evaluate also like the whole answer with a rubric, have another model for that.
然后他们还有另一个模型来评估那个使用评分标准的模型,以此类推。
And then they have another model that evaluates that model that has the rubric and so forth.
这就像多层结构。
It's like multiple levels.
但这种方法似乎有效。
And but that seems to work.
他们还进行了消融研究,证明这确实有帮助。
And they had like ablation studies that showed that this is actually helping.
我认为我们还会看到更多这样的应用。
And I think we will see also more of that.
这本质上是一种全新的范式,让推理训练变得更加复杂。
It's just like a very new paradigm, like, making the reasoning training more sophisticated essentially.
是的。
Yeah.
目前,验证器主要集中在数学和编程上,因为对于给定的回答,存在明确的验证方式。
Right now, the verifiers are focused on like math and coding, and that's because, you know, they're, you know, for a given response, there's a concrete, ability to verify.
你认为这种验证范式会扩展到数学和编程之外吗?
Do you see this verification paradigm expanding beyond math and code?
我认为,部分原因是,专注于数学和编程之所以成功,是因为尽管并非所有大语言模型的回应都涉及数学和编程,但这些领域本身就具有内在的逻辑或推理能力。
And I think, you know, in part, you know, the focus on math and code is successful because, you know, even though not all LLM responses are about math and code, those things kind of have an inherent, you know, logic or, reasoning capability in them.
因此,模型的推理能力可以泛化到非数学和非编程的问题上。
And so the ability for the model to reason generalizes to non math encoding problems.
但你是否认为,这种验证理念会扩展到数学和编程以外的问题类型?
But do you see a focus on expanding this idea of verification beyond math and code types of problems?
是的。
Yes.
这其实是一个非常有趣且重要的观点。
So it's actually a very interesting and important point.
我的意思是,你提到如果让模型在数学问题上进行训练、进行数学推理,它就会在整体推理能力上变得更强。
How can we I mean, you you mentioned that if you train the model on math problems, reasoning on math, it will become better at reasoning in general.
但如果你能针对某个特定领域,专门训练模型在该领域内进行推理,效果会更好。
But then it would be even better if you have a target domain to train the model specifically on that target domain, on reasoning in that target domain.
我觉得你说得对。
I think you you're right.
这种情况会越来越多。
There will be more of that.
对我来说,现在我更喜欢发挥创造力,想出一些可以验证的问题例子。
For me right now, I just like the creativity right now to come up with, examples of, problems that can be verified.
但我认为,甚至可以涉及生物学领域,比如药物设计或蛋白质结构建模,这些领域存在物理约束。
But I would say, maybe something even like biology related where for, like, a drug design, like a pharmaceutical drug design or protein structure modeling where you have, like, physical constraints.
比如原子之间的角度。
So there are, like, the angles between atoms.
它们只能有特定的角度,因此你可能可以使用某种物理方程来验证生成的分子是否符合这些规范,并在训练模型时将其作为奖励机制。
They can only have a certain angle and so forth where you could probably have like a physics types type of equation that double checks whether the generated molecule adheres to these certain types of formats and then have the have that as a form of reward when you're training the model.
我的意思是,这可能不算典型的推理案例,因为当你生成一个模型时,推理的解释到底是什么?
I mean, this is maybe not a typical case of reasoning because, well, what is the reasoning explanation when you're generating a model?
对吧?
Right?
但总的来说,类似这样的方法可以应用于其他领域。
I mean but in general, like, something like that for other fields.
在最坏的情况下,你总是可以——这更像是一个粗略的近似——训练另一个模型来提供正确性奖励。
And in the worst case, you can always I mean, this is more like a rough approximation, but you can always train another model that provides the correctness reward.
不过我认为这更具挑战性,因为它容易受到奖励欺骗的影响;回溯到早期的生成对抗网络,生成器很容易崩溃。
This I think this is more challenging though because it it's susceptible to reward hacking, even going back to back in the day generative adversarial networks where it's easy for the generator to collapse.
你有一个判别器,它会判断这张图像是真实的还是生成的。
You have the discriminator which says, is this image real or generated?
然后你设置了一个场景:训练一个生成器去欺骗判别器,而判别器则不断提升区分能力,这几乎是一个类似的设置。
And then it was like the setup where you train a generator to fool the discriminator and the discriminator gets better at distinguishing and you have almost like a similar setup.
你可以用它来决定是否给予奖励,但模型可能会在某个时刻利用它,学会某种技巧。
You can use it to say, give a reward or not, but then the model may or may not exploit it at some point learns how to learns a trick.
如果我只生成一个词之类的东西,就能欺骗这个评估器。
If I only generate this one word or something like that, then I fool that evaluator.
但我认为我们可能也会看到更多类似的发展,即构建基于AI的奖励模型,这些模型可以用于其他领域,以训练更好的推理模型。
But I think maybe we'll see also more, more of that, like, developing like AI based, reward models essentially that can be used in other fields to train better reasoning models.
除了加强对验证模型的关注和调整之外,你是否还看到其他领域有助于未来更强的推理能力?
Beyond increased focus and, tweaks to the, verification models, are there other areas that you see as contributing to, stronger reasoning going forward?
是的。
Yeah.
我认为训练只是其中一部分,另一个关键是推理扩展——如果你投入更多计算资源,就能获得好得多的性能,尽管这说起来容易,但实际需要训练和……
I do think it's also I mean, the training is one part, but the other one is the inference scaling that you can get much better performance if you use simple, let's say not simple I mean, in quotation marks, but if you have to train and yeah.
没有什么是简单的。
Nothing is simple.
如果你花更多算力,比如在训练完成后,当用户使用模型生成答案时,推理扩展本质上就是投入更多计算资源。
If you spend let's say, if you have the model if you spend more compute, essentially, inference scaling is all about the definition is essentially spending more compute after training during inference when someone uses a model to generate the answer.
你可以用多种方式来实现。
And you can do it in multiple ways.
模型本身其实已经是一种推理扩展的形式,因为它们生成的标记比普通模型更多。
I mean, models themselves, they are already kind of like a form of inference scaling because they generate more tokens than regular models.
它提供的解释比普通模型更简洁。
The explanation has is lower than a regular model provides.
但它常常能帮助大语言模型得出正确答案。
And, but it helps often the LLM to reach the correct answer.
但这种更像是顺序推理扩展。
But this is more like a sequential inference scaling.
你也可以采用并行的推理扩展方式,即生成多个答案,这被称为自一致性。
You can also have parallel forms of inference scaling where you just generate multiple answers and that's called self consistency.
例如,面对一个数学问题,你可以让大语言模型在不同的温度设置下多次回答这个问题。
So for example, if you have a math problem, you can have the LLM with different temperature settings, answer the question multiple times.
然后你进行多数投票,或者类似的方法。
And then you take a majority vote or something like that.
我的意思是,你可以用多种方式来实现。
I mean, there are different ways you can do it.
还有不同的评分方法,或者其他大语言模型会评估所有答案,并给出最可能正确的答案。
There's also there are different scoring methods or other LLMs that look at all the answers and give you the, the most likely correct answer.
通过这种方式,你还可以提升模型的性能。
And with that, you can also boost the performance of the model.
但成本更高一些。
It's more expensive though.
所以它总是这样的。
So it's always like this.
是的。
Yeah.
这不是一种放之四海而皆准的方法。
It's not like a one size fits all.
你不应该一直使用它。
You don't want to use it all the time.
需要的时候再用。
It's a use it when you need it.
但我觉得有趣的是,如何改进判断何时应该使用它的方法。
But I think what will be interesting is to improve the way to tell when it's lean it.
我觉得当ChatGPT发布5.1或者5.0版本时,他们有过我们一开始提到的那种自动设置。
I think when ChatGPT, was it 5.1 or five launch, they had like that automatic setting that we talked about in the beginning.
刚开始的时候非常糟糕,但我觉得经过几个月和几年,它变得好很多了。
It it was very bad at the beginning, but I think it got much better over the months and years.
我认为在开源、开放权重的生态系统中,我们还没有类似的东西。
And I think I I'm not quite sure we have anything like that in the open source, open weight ecosystem.
也许听众可以纠正我,但我认为这样的功能可能更重要,因为一方面,我们正在开发这些非常昂贵的模型,它们能解决像数学奥林匹克竞赛这样非常困难的问题,但我们并不想一直使用它们,因为它们更慢、更昂贵。
Maybe listeners may correct me here, but I think something like that I can see also being more important because on the one hand, are developing these very expensive models that can solve very hard problems like in this math Olympiad, but we don't want to use them all the time because they are slower and more expensive.
同时,人们也会更加关注更便宜的模型。
And there's also gonna be more like a focus at the same time on cheaper models.
比如就在上周,Gwen三季版或者下个季度的版本发布了。
So for example, just the other week, Gwen three quarter next or next quarter.
抱歉。
Sorry.
Gwen三版下个季度发布了,Gwen三版是目前最广泛使用的开放权重模型之一,因为它们提供了大量不同尺寸的高质量模型。
Gwen three next quarter came out, which is, Gwen three Gwen three is one of the most widely used open weight models, because they have, like, a lot of really high quality models in all different types of sizes.
但它们的下一代模型本质上是一种混合架构。
But they also the the Next model, it is essentially like a hybrid.
它不再是一个纯粹的Transformer了。
It's not like a pure transformer anymore.
它使用了状态空间模型来降低成本,但总是存在这种权衡。
It has, like, it's by state space models to make things cheaper, but then, it's like always this trade off.
人们正在开发更高精度的模型。
People are developing higher accuracy models.
人们也在开发更便宜的模型。
People develop cheaper models.
我认为是的。
And I think yeah.
一种方式是通过改变架构来控制质量和成本。
I mean, one way would be changing the architecture to control the quality and price.
另一种方式是推理扩展。
The other one is inference scaling.
但我觉得目前在开源权重生态系统中还不是很普及。
But I think right now it's in the open weight ecosystem.
还没有那么受欢迎。
It's not quite as popular, not yet.
所以我认为我们也会在本地工具等领域看到更多这样的做法。
So I think we will also see more more of that in in local tools and so forth.
我不确定是否知道任何开源项目或模型采用了这种做法。
I don't know that I know of any, like, a open source project or a model that incorporates this.
但从一些对话中,我感觉到许多围绕Qwen模型和这些开源权重模型构建的公司,通常在其架构中包含一个路由组件,用于评估提示的复杂性或类别,并将其路由到最经济或经过后训练以获得更好响应的模型和提示,诸如此类。
But from conversations, I do get the sense that a lot of companies that are building around, you know, the quen models, for example, and these open weight models commonly have like a router component in their architecture that tries to assess the complexity or category of a prompt and route it to the right, you know, model and prompt that is, you know, either most economical or, you know, maybe post trained for better responses, that kind of thing.
我的感觉是,这正是应对您所描述的挑战的常见方法吗?
My sense is that that's the common approach to, to addressing this challenge that you're describing?
你提到这一点后,我想起了另一个例子,那就是OpenAI去年夏天发布的GPT开源模型。
Now that you mentioned that another example came to mind, it's the GPT OSS model, the open source model by OpenAI, which came out last summer.
在这个模型中,即使你使用非常简单的推理方式,比如Ollama或任何类似的工具,你也可以在系统提示中设置推理强度。
And in that model, even if you use a very simple inference or like a simple tool like Olama or any comparable tool, you can set the reasoning effort in the system prompt.
因此,你可以设定低、中、高三种推理强度,然后根据推理强度来调整推理规模。
So you can say no mild, medium, high reasoning effort, and then scaling inference based on the reasoning effort.
但我认为,目前还没有其他技术被自动整合进来,比如自一致性或自优化。
But I don't think there's any other technique really automatically incorporated like self consistency or self refinement.
这主要是作为研究者,你得自己大部分时候来做。
It's it's mainly you have to, as the researcher, do it yourself most of the time.
你能再多讲讲自一致性和自优化,以及人们是如何使用这些技术的吗?
Can you talk a little bit more about the self refinement and self consistency and how folks use those techniques?
是的。
Yeah.
自一致性和自优化是两种推理扩展的例子。
So self consistency and self refinement are two examples of inference scaling.
我认为两者最大的区别在于,自一致性是一种并行技术,它会生成多个答案。
I would say the biggest difference between the two is one is a parallel technique where self consistency as a parallel technique generates multiple answers.
然后你通过多数投票来选择正确答案,或者可以使用一个评分器来评估各个答案。
And you choose, let's say, the correct answer based on majority vote, or you can have a scorer who assesses the yes answers.
但人们通常把这种技术称为‘最佳n个答案’。
But then people call that technique best of n, like best of n answers.
最佳事件、投票机制之类的。
Best event or quorum or that kind of thing.
是的。
Yeah.
对。
Yeah.
这本质上是一种集成技术,几乎就像经典的集成方法。
It's essentially an ensemble technique, like classic ensembling almost.
另一种是自我优化,你让大语言模型生成答案,然后将答案输入另一个大语言模型或它自己,并说:这是答案。
And the other one is self refinement where you have the LLM generate the answer, and then you feed the answer to another LLM or to itself and say, here's the answer.
这是问题。
This is the question.
请写一个总结,判断答案是否可能正确,以及存在哪些弱点,就像一个评分标准。
Write a summary if the answer is likely answered correctly and what are weaknesses and, like, like a rubric.
你几乎提供了一个评分标准,其中包含大语言模型应该检查的某些内容,然后它会返回一份报告,说:这个可以更好。
Almost you have a you provide a rubric with certain things that the LLM should check, And then it gives you back a report and says, well, this is could be better.
就像错误的解释与最终答案不一致。
Just like the incorrect explanation doesn't match the final answer.
然后你将这个输出反馈给原始的LLM,说:嘿,看看这份报告,根据这份报告来优化你最初的回答。
And then you feed that output back to the original LLM and say, hey, look at that report and refine your original answer based on the report here.
通常这能帮助LLM改进自己的回答。
And often this can lead to the LLM improving its own answer.
这几乎就像一种现象。
It's almost like this phenomenon.
有时候你问Chechka某个问题,它给出一个答案,然后你会想:等等,这不可能对。
Sometimes you ask Chechka be something and it gives you something and like, wait, that can't be right.
我知道你问过某个模型是什么时候发布的,然后你会想:不对,这不可能对。
I know you ask something about like when was a certain model released and you know, okay, this can't be right.
年份完全错了。
The year is totally wrong.
你告诉评判者:嘿,你错了。
And you tell judge people, hey, you are incorrect.
你犯了个错误。
You made a mistake.
而且,哦,对,你说得对。
And, oh, yeah, you are right.
我犯了个错误。
I made a mistake.
然后它会再试一次,下次就会更好。
And then it tries again and it's better next time.
这几乎就是同样的机制,它自己在不断优化答案。
And it's almost like that same mechanism where, yeah, it itself refines its answers.
根据我的实验,它有时也可能让答案变得更糟?
It can I mean, based on my experiments, it can also sometimes make answers worse?
比如它会过度思考,或者原本是正确的,但因为反馈奇怪或糟糕,反而让答案变得错误。
Like it will overthink or it was originally correct, but then it I don't The feedback is weird or bad, and then it makes the answer incorrect.
所以这并不是一种万无一失的技术。
So it's not like a, like, well, not a foolproof technique.
它也有前提条件,但在DeepSeek Math 3.2论文中,他们以更复杂的方式实现了自我优化,引入了第三个模型来评估评估者,他们展示了一个很好的图表,说明了准确率能提升多少。
It's also with caveats, but in the deep seek math version 3.2 paper where they had a self refinement in a more sophisticated way, where they had a third model evaluating the evaluator, They really showed I get a nice graphic or a plot where they showed how much the accuracy can improve.
基本上,我不记得具体数字了,但如果他们加大了自优化和自一致性,就能在某些数学竞赛中达到黄金水平的表现,这非常令人印象深刻,因为用的还是之前的同一个模型,只是增加了推理规模。
Basically from a I don't know the numbers off top of my head, but the if they cranked up the self refinement and self consistency, consistency, they they were were able able to to have, like, gold level performance in certain math competitions, which was very impressive given it was still the same model as they used before, but they just cranked up the inference scaling, basically.
有一点很有趣,就是反思我们正在讨论的这些主题,发现它们全都紧密相关。
One thing that's interesting kinda reflecting on these themes that we're discussing is just how they're all very interrelated.
你知道吗?
You know?
所以,推理是一个关键主题。
So reasoning is a key theme.
推理是由推理规模推动的。
Reasoning is enabled by inference scaling.
推理规模。
Inference scaling.
我们在讨论这些内容时,经常听到循环、递归这类概念。
A lot of what we're hearing as we talk about this is like loops and recursion and, those kinds of ideas.
而这些概念都是关键所在。
And those are, key ideas.
你提到的第三个关键主题是LLM的代理式应用。
And the third key theme that you mentioned, which is kind of agentic uses of LLMs.
以此为引子,谈谈你迄今为止在代理式应用方面看到的进展,以及你觉得这个领域中哪些地方令人兴奋。
You know, with that as a segue, you know, talk a little bit about, you know, what you've seen thus far around agentic and what you think is exciting in that space.
我会说,是的,代理式用例甚至包括一些看似简单的例子——当然,这里的‘简单’是要加引号的。
I would say, yeah, the agentic use cases, it's even like simple I mean, not again, in quotation marks, simple.
像Codex或Cloud Code这样的工具,它们会进行多次迭代来解决一个问题。
Things like codex or cloud code where it does just multiple iterations to solve a problem.
这并不是一次性的回答。
It's not just like one shot.
它更像是在执行一个任务,而不仅仅是提供一个答案。
It's like it's more like doing a task, like, rather than just providing an answer.
我认为,Multiple是另一个代理系统的例子;而且,‘代理’这个术语本身几乎还没有明确定义,因为不同人对它的使用方式各不相同。
And I think I mean, multiple would be another example of agentic systems that have I mean, agentic is also like a I would say almost like a not well defined term because people use the term differently.
但就这个播客而言,也许我们可以把‘代理’理解为一种在循环中运行的系统。
But for this podcast, maybe we can think of agentic as something that runs in a loop.
而且我认为,确实,最近我们会看到更多这样的应用。
And I think, yeah, that is something we will see more of recently.
Claw Code 和 GPT 5.3 Codex 这个应用,新增了许多任务功能,比如你可以安排某些操作并让它重复执行。
Claw Code and GPT 5.3 Codex, the app, they added a lot of these tasks where you can even schedule something and it does something on a reoccurring basis, for example.
我认为我们会看到更多这样的功能。
And I think we will see more of that.
这还只是刚开始。
It's just like the beginning.
它会更像插件一样。
It will be more like plugins.
而且说到底,还是同一个大语言模型。
And I mean, it's still the same LLM.
关键在于我们如何使用大语言模型,如何充分利用上下文并进行反馈。
It's just like how we use the LLM and, how to get the most out of it, of the context, feeding back the context.
我认为在开源社区中,这方面还没有得到足够的关注。
And I think there has not been that much focus on this in the open weight, open source community.
那里的重点更多放在开发LLM本身,像OpenAI、Claude这样的公司则更倾向于打造这些工具,以便用这些LLM完成更令人印象深刻、更庞大的任务。
The focus there is more on developing the LLM itself where companies though that are like, you know, OpenAI, Claude, they are more like, okay, let's build these tools so we can actually do more and more impressive, bigger things with these LLMs.
我认为到今年年底,我们将拥有能够可靠地预订假期或旅游目的地的系统,这类功能会变得越来越普遍。
And I think maybe by the end of the year, we will have systems that can reliably book a trip to, you know, some holiday or vacation destination where this becomes more and more common.
我的意思是,已经有一些工具承诺可以做到这一点。
I mean, were already tools that promised to do that.
我忘了名字,但好像叫Devon之类的东西。
I forgot the names, but I think it was called Devon, something like that.
它可能还存在,但我认为这仅仅是刚开始。
It might still exist, but I think this is just the Yeah.
哦,是的。
Oh, yeah.
减去。
Minus.
对。
Right.
是的
Yeah.
对
Yeah.
但我认为这还只是刚开始,而且大多数人并不需要一个能做所有事情的完整系统。
But I think that it's just like the beginning and also most people, I don't think they need like a full blown thing that can do everything.
他们可能只需要一个Excel插件,在特定时间间隔自动更新某些内容,然后Excel表格连接到互联网,获取最新的股票价格之类的信息,本质上是一种循环设置。
They just maybe need like a plugin for Excel to have certain intervals where it updates certain things and then Excel spreadsheets goes into the, let's say, to the Internet and pulls the recent stock price or something like that, but like in in a in a type of loop type of setting essentially.
是的
Yeah.
在关于LLM的代理用途方面,我们过去一两年听到很多关于多代理系统的概念,即将一个问题分解为具有各自个性的独立代理。
One of the things we heard a lot about in the context of agentic uses of LLMs, I think over the past year, maybe two years, is the idea of like multi agent systems and like decomposing, a problem into independent agents with kind of their own personas and that kind of thing.
我认为,像OpenClawed或OpenClaw这样的想法,即使在今天,我也看到很多人说:‘我创建了我的AI团队’,还有这些AI员工。
And I think that, you know, the whole OpenClawed, or OpenClaw, idea, Like, even like today, I'm seeing, you know, a lot of, hey, I created my, you know, AI team and there's this employee, my AI employees.
对吧?
Right?
还有这些员工,这个员工、那个员工,他们通过Slack、Motebook或者其他工具互相交流。
And there's this employee, that employee, that employee, and they talk to each other using, you know, Slack or Motebook or whatever.
从实际的构建者或技术角度来看,你对这种多智能体用例有什么观察?
Like, what have you seen with regards to, you know, kind of from a concrete, you know, builder or technical perspective, this, like, multi agent use case.
你有没有发现,人们从这种用例中获得了不少价值?
Are you finding, you know, folks getting a lot of value out of that?
说实话,我很希望有个很棒或有趣的答案,但我个人并没有深入探索过这个领域。我大部分经验都集中在单一用途场景,即一个LLM提供解决方案或处理特定任务,但很少涉及与其他智能体的互动。
To be honest, I wish I had a really good answer or interesting answer, but this is something I've not explored personally where most of my experiences are like single use case where it's one LLM that provides solutions or tackles a specific task, but it's mostly not interacting with other agents.
不过,我想这里更像一个上下文工程问题——也就是说,LLM本身我认为并不是瓶颈。
Think here though, I mean, I I also see, it's more like a context engineering problem where how do I mean, it so the LLMs themselves, I don't think they are the bottleneck.
真正的问题在于,你如何获取结果,并将这些结果传递给另一个LLM。
It's more about how you, let's say, provide the result get the results and provide them to another LLM.
从这个角度看,这几乎就像图像或视频生成中的流程:一个LLM解析文本或优化输入,然后将结果传递给生成输出的部分,比如扩散模型或基于Transformer的扩散模块,我觉得这本质上是一种更复杂的类似形式。
I mean, in in that sense, it's almost like a form of when you do image or video generation where you have one LLM parsing the text or improving the input and then passing that to the LL or to the part of the model that generates the output, the diffusion model part or the transformer based diffusion part, where I think it's more like a sophisticated form of that.
我们该如何为不同的智能体提供恰当的上下文?
How do we provide the right context to the different agents?
这可以是从基础数据库到使用Slack,其中一个模型输出内容,另一个模型通过API将其接收。
And it could be from basic databases to using Slack where one of our model outputs something there and with the via the API, the other model ingests it.
我认为,这同样也是刚刚起步的事情,比如Multipot和OpenClaw,我相信我们会看到更多这样的应用。
And I think well, there would be that is, I think, also something that is just getting started also with Multipot and OpenClaw, and I think we'll be seeing a lot more of that.
但确实如此。
But yeah.
所以关于这一点,我能说的就是这些,因为我个人没有实际经验。
So that that's all I can say to it because I personally don't have any concrete experience.
我还没亲自做过这方面的项目。
I haven't worked on this myself yet.
是的。
Yeah.
你有没有感觉到,在未来一年里,这类智能体应用会在哪些方面集中发力和创新?或者说,为了让它们真正成熟,目前还存在哪些关键问题需要解决?
Do you have a sense for where we'll see focus and innovation around these kind of agentic uses in the upcoming year or, you know, maybe what the gaps are, what really needs to be worked on in order for them to kind of come into their own?
我认为,每个LLM在某个时刻都有自己的失败率,因此进展通常体现在LLM能自主运行多长时间,也就是它们在失败前能持续工作多久。
I do think it's still like each LLM has its own kind of like failure rate at some point where so the progress is usually measured how long the LLMs can work autonomously, like, how how long can they work until they fail?
你添加的模型越多,如果它们相互依赖,其中一个出错的风险就越高。
And the more models you add, the higher the risk that one of them fails if they depend on each other.
我认为,改进模型本身也将有助于提升整个系统的性能,这基本上是提升性能的主要方式。
And I think improving the model itself here will also help improving the whole system basically as as the main way to improve the performance.
但就我所知,根据目前公开的信息,这些仍然是像Claude或其他API中的基础大语言模型,并未专门针对多智能体环境进行训练。
But then I can also see right now most as far as I know based on what is publicly available, these are still the vanilla LLMs that are in Claude or in other APIs, they're not specifically trained to interact in a multi agent setting.
我认为,从这个角度来看,如果你为多智能体环境下的智能体训练准备数据,比如进行微调,你也能从中获得更好的性能。
And I think in that sense, if you prepare data for training these agents in a multi agent setting, like fine tuning type of situation, I think you can also get more performance out of them.
我们已经看到过这种情况,比如一些简单的东西,像编解码器。
And we have seen that, I mean, even for simple things like codecs.
所以GPD 5.2或5.3编解码器,并不等同于GPD五点二和五点三。
So GPD 5.2 or 5.3 codecs is not the same as GPD five point two and five point three.
这些模型是从原版分叉出来,然后专门针对Codex应用进行训练的。
These are models that they forked off and then specifically trained to work with the Codex app basically.
我认为,对于这些智能体模型,我们也会看到类似的情况。
And I think something like that we will also see for these agent models.
这对消费者来说更难做到,因为我们没有权限访问这些模型。
It's just harder for, let's say the consumer to do that because you don't, or we don't have access to these models.
所以我们很大程度上依赖于拥有和托管这些大语言模型的人来进行这种训练。
So we kind of like dependent on the person who owns the LLMs, who hosts the LLMs to, to do this type of training basically.
所以,我认为企业也会开发类似的东西。
So, I think I can see companies also developing something like this.
我的意思是,如果让我打赌,Claude 和 OpenAI 一定会密切关注对方的动向,并可能推出自己的版本。
I mean, if I had to bet, Claude and OpenClaw, they really, paid attention to what multiple or OpenClaw is doing and maybe coming up with their own version of that.
这可能会更强大,因为他们掌控着模型,可以针对特定的交互式多智能体环境进行微调。
That is maybe even more capable because they control the model and they can fine tune it for certain interactive multi agent types of environments.
是的。
Yeah.
回顾过去,一个有趣的现象是,我们可能认为过去一两年里从架构角度看的重大突破,其实都是相对渐进的,核心架构本身并没有太大变化。
One of the things that's interesting looking back is that a lot of the, you know, things that we might look back and see as, big advancements over the past year or two years, from an architecture perspective, they're relatively incremental, like the fundamental core architecture.
虽然有一些关于如何超越大语言模型的提议,但核心架构一直相当稳定。
You know, there have been a handful of proposals of like where we might go beyond LLMs, but the the core has been fairly stable.
你知道,你同意这一点吗?
You know, do you do you agree with that?
你认为LLM架构的未来会怎样?
Where do you see, you know, how do you think about the the future of LLM architecture?
是的。
Yeah.
这是个有趣的问题。
That's a interesting question.
我要说,我在这里说的每一件事都带个星号,因为deepseq第四版还没发布。
So I would say everything I'm saying here with an asterisk because, deepseq version four is not out.
它可能会完全改变我所说的这一切。
It might change everything, completely in terms of what I'm saying.
但如果我们只看2025年到二月这段时间,我认为在最先进架构方面并没有根本性的变化。
But if we just look at 2025 up to the February, I I don't think there were any fundamental changes in terms of the state of the art architecture.
所以我认为我们必须区分的一点是,有些架构更侧重于以更高效的方式完成相同的事情。
So I think one thing we have to distinguish between is, like, there are architectures that are more geared towards doing the same thing more efficiently.
还有一种架构变更旨在从模型中获得更高的建模性能和准确性。
And there are architecture changes that are geared towards, let's get more modeling performance accuracy out of the model.
如果我们首先看看那些推动前沿建模性能的模型,最近其实并没有太多变化。
If we for for first, if we look at those models that push the state of the art, the modeling performance, there haven't been that many changes really recently.
我的意思是,看看2025年,专家混合模型正在卷土重来。
I mean, looking at 2025, mixture of experts models have been making a comeback.
我的意思是,像混合牵引和深序MOE这样的模型之前就存在过,但它们真正流行起来是在深序版本三发布之后,而深序版本三之所以流行,是因为深序R1,它本质上是深序版本三的微调或后训练版本。
I mean, were other models like mixed trawl and deep sig MOE before, but they really became popular after deep sig version three came out and deep sig version three became popular because of a deep sig r one, which is basically a fine tuned version or train post trained version of deep sig version three.
但随后许多公司都采用了这种架构。
But then a lot of companies adopted this architecture.
我认为Kimi直接采用了这种架构,并将参数规模从6700亿扩大到了1万亿,甚至欧洲公司Mistral AI也使用了深序版本三的架构。
I think Kimi straight up used that architecture and they scaled it from 670,000,000,000 to 1,000,000,000,000 parameters, or even like the European company, Mistral AI, they used a deep sea version three architecture.
所以很多人,我想说,他们并没有在这种意义上冒险去尝试不同的东西。
So a lot of people are I would say they are not gambling in that sense and let's try something different.
他们是采用已经有效的方法,然后通过改变数据和算法来取得进展或差异。
Will or they take something that works and try to make progress or differences in terms of changing the data and the the algorithms.
但这并不意味着没有新想法。
But that doesn't mean there are no new ideas.
我的意思是,DeepSeek版本三除了MoE(专家混合)之外,还引入了多头潜在注意力机制。
So, I mean, deep sick version three, besides MOE, the mixture of experts, they did have the multi head latent attention.
我认为这在之前的论文中也出现过,但多头潜在注意力本质上是对注意力机制的一种改进,它引入了一个中间的、更小的压缩状态来表示键和查询。
I think it was also in one of the previous papers, but multi head latent attention is essentially like a tweak of the attention mechanism where you have like a intermediate, smaller compressed state of the keys and value queries and keys.
不。
No.
抱歉。
Sorry.
键和值。
Keys and values.
键和值。
Keys and values.
我的意思是,所有这些其实都是,但键和值尤其需要压缩,因为这样你的KV缓存就会变小。
I mean, all of them actually are, but the keys and values are important to compress because then your KB cache becomes smaller.
所以你不会在KB缓存中存储完整的键和值,而是存储它们的压缩形式。
So you don't store the full keys and values in the KB cache, but a compressed form.
但在推理时,你会从压缩形式中重建键和值。
But then you reconstruct the keys and values of from the compressed form in inference.
所以你实际上是在用计算量换取内存空间。
So you are basically trading off compute with memory.
但为了更好地解释这一点,你可以把它看作LoRA,也就是低秩适应。
But but but also maybe to explain this a bit better, you can think of it as LoRa, like the low rank adaptation.
基本上,你先将它投影到一个压缩空间,然后再投影回来。
So basically you project it down into a compressed space and then you project it up again.
这其实就是多潜注意力。
So that's basically multi- latent attention.
我认为这是在02/2526年人们采用的一个有趣改进。
That's so that's like an interesting tweak, I think in 02/2526 that people adopted.
然后在DeepSeek版本3.2中,又出现了另一个改进,也就是稀疏注意力。
And then it was again deep seek version 3.2 that had another, I would say tweak, sparse attention.
我的意思是,稀疏注意力也不是什么新东西,但一直以来都有很多研究在探讨如何让注意力机制更廉价,因为它的复杂度与序列长度呈二次方增长。
I mean, sparse attention is also not new, but there's always been this, like, research on how we make attention cheaper because it scales quadratically with a sequence length.
已经有数百篇,甚至上千篇论文了。
And there have been hundreds, if not thousands of papers.
但说到论文,我总是有点谨慎,我的意思是,这些想法确实有趣,但我总在等着看到它们真正落地到实际生产环境中。
But, you know, with papers, I'm always a bit I mean, the ideas are interesting, but I'm always a bit careful and I'm waiting always to see that in in production in quotation marks.
但我的意思是,如果能在像这样的旗舰模型中看到它,你就知道它确实有效了,因为一个想法可能在小模型上表现很好,但一旦模型规模扩大到5000亿、6000亿甚至1万亿参数时,事情可能会崩盘。
But I mean, as to see it in a flagship model, like because the idea might work well if you are only focused on a small model, but things may fall apart once you scale the model to 500,000,000,000, 600,000,000,000, and 1,000,000,000,000 parameters.
所以DeepSeek这里是个很好的案例,因为他们确实拥有这样的旗舰模型。
So and DeepSeek here is a nice case study because they do have this flagship model.
如果他们在旗舰模型中采用了某种技术,那就基本可以确定它在大规模场景下是有效的。
And if they use something in that flagship model, you basically know it works at scale.
他们有自己的稀疏注意力版本。
And they have their own version of sparse attention.
我想他们直接称之为DeepSeek稀疏注意力。
I think they call it literally deepsake sparse attention.
而且是的,他们用了一个类似闪电索引器的东西,某种程度上是一个小型、廉价的模型。
And it's yeah, it's a instead of so they have like a lightning indexer, like a small cheap model in a sense.
不是对每一个词都关注所有前面的词,而是更有选择性。
Instead of for one token have paying attention to all the previous tokens, it's more selective.
它会选择关注哪些词。
It selects which tokens it pays attention to.
所以这有点像一个掩码。
So it's kind of like a mask.
你对所有词计算一个掩码,以选出一个子集,从而降低成本,使其基本实现次二次方扩展。
So you are calculating a mask over all the tokens to select a subset to make it cheaper, to make it scale sub quadratically basically.
过去也出现过这类调整,但并没有从根本上改变注意力机制的工作方式。
And there have been like these types of tweaks, but it's not fundamentally changing how attention works.
它仍然是相同的注意力机制,只是我们如何让它的成本更低。
It's still the same attention mechanism, but how do we yeah.
基本上,我们怎么才能让它更便宜?
How do we make it cheaper, basically?
所以我认为,人们现在关注的是当下有效的方法,但也许到2026年,某款旗舰模型会采用一种根本不同的方法。
So I think that is something, where people hone in on what works at the moment, but we will see maybe in 2026, maybe one of the flagship models will have a fundamentally different approach.
一些小型公司——抱歉,是公司,但做了些小的调整,在替代架构方面也有所改进。
Little companies I mean, little, sorry, companies, but little changes have also been made in terms of alternative architectures.
我们之前提到了Qwen3。
So we mentioned QUEN three earlier.
Qwen3是其中一款旗舰模型。
So QUEN three is one of the flagship models.
目前它可能不再是榜首了,因为它的版本有点老了。
Right now, it's maybe not at the top anymore because it's a bit older.
它是在夏天发布的,但通常当Qwen3或Qwen系列模型推出时,它们通常都位居排行榜首位。
It came out in summer, but usually when Gwen three or when Gwen models came out, they are usually top of the leaderboards.
它们还推出了模型的并行版本。
They had also a parallel version of their model.
它们称之为Qwen3 Next,这个版本尝试了一些不同的东西。
They called it Gwen three Next, and that one, tried something different.
他们采用了一种混合注意力机制,结合了门控的DeltaNet,基本上是为了更接近状态空间模型的方法,这种模式更线性,虽然大家都在尝试,但这并不是他们的旗舰模型。
They had like a hybrid attention mechanism with, a gated, DeltaNet on basically to to have more like a state space model approach where it's more like linear, where people are trying, but it's not necessarily like their flagship model.
他们同时也在探索其他方向。
Like they are in parallel trying other things.
我认为这很有道理,因为你不应该把所有赌注都押在一个选项上。
And I think this makes sense because, yeah, you you don't want to put all your eggs in one basket.
基本上,你先要有一个不错的模型,然后在旁边尝试一些新东西,如果效果好,以后再扩大规模。
Basically, you wanna have a good model and then maybe on the side, try something and then maybe scale it up later if it works well.
是的。
Yeah.
那持续学习呢?
What about continual learning?
持续学习经常被提及为一个机会,尤其是在我们还没能很好地整合工具和搜索能力之前,因为模型会很快过时。
That comes up frequently as an opportunity, you know, particularly before we got really good at incorporating in tools and the ability to do searches because models would get stale, you know, very quickly.
但人们仍然对保持模型训练数据更新很感兴趣。
But, you know, there's still this interest in having a model that, you know, we can, you know, we can keep its training data updated.
我们可以删除,我们可以删掉一些东西。
We can delete I we can delete things.
我们可以融入新知识。
We can incorporate new knowledge.
你觉得这个领域会有重大创新吗?
Like, do you foresee significant innovation in that area?
是的。
Yeah.
我认为这可能是最大的梦想之一,即我们如何让模型自我提升?
I think this is like maybe the biggest dream in the sense of like, how can we make the model improve itself?
我认为,如果有人能找到实现这一目标的方法,这将是目前最了不起的成就。
Like the biggest, I guess, achievement right now that could be made if that gets, if someone finds out a way that this works.
但我觉得现在还没有实现这一目标的路径。
But I think right now there is no, you know, pathway to this.
比如,现在还没有这样的方法。
Like, there's no yeah.
实际上,还没有什么明确的方法能让你说,哦,这就是能带来可靠或持续学习的关键。
There's nothing really that is like where you would say, oh, that's the thing that will give us a reliable or continual learning.
但话又说回来,我认为已经存在一些形式的持续学习了。
But that being said, I think, I mean, there are already forms of continual learning, I would say.
我的意思是,即使是像这样更受控制的方式。
I mean, even something like, well, I mean, it's more like controlled.
比如,不是模型自动更新,而是人们从最近的互联网或最近的任务中收集数据,然后仔细地更新模型。
Like, instead of the model automatically updating itself, people would collect data from, like, the recent Internet or recent tasks and then carefully update the model essentially.
我认为更接近这种情况:我们并不是不更新模型,但也不会完全自动化地去做。
I think it's more like that where, it's not that we don't update models, but we also don't do it fully automatically.
这几乎是一种半自动的方式。
It's like a semi automatic almost type of thing.
而且我认为,这不仅仅是因为这样更可靠。
And I think also it's it's like that not only because that's more reliable.
所以因为是的。
So because yeah.
仅仅根据新数据更新模型是有风险的,但同时也受到资源限制的影响。
It's risky to just update a model on on new data, but it's also because of resource constraints.
比如,我不知道OpenAI有多少份模型副本,但显然不可能为每个用户都配备一份。
Because for example, I don't know how many copies of the model OpenAI has, but I mean, it's not you you can't definitely not have a single copy per user.
那将会贵得离谱。
That would be way expensive.
每个人家里都得有一台超级计算机,或者花十万美金买一台电脑,才能运行大型旗舰模型。
I mean, everyone would have to have a little supercomputer at home or like a $100,000 computer to have like a big flagship model.
因此,公司不可能为每个用户实时更新所有内容,这基本上是不可行的。
And so companies can't like just update everything on the fly for each user because that would be just infeasible basically.
所以,除非我们有办法让模型只在个人设备上运行,否则我认为我们无法实现真正有效的持续学习,因为确实如此。
And so unless we have ways that the models only run on the personal device, I don't think we have or can have really good continual learning essentially because yeah.
另外一点是,你必须非常谨慎地进行更新。
And and then the other thing is, yeah, you have to be really careful how you update it.
你不希望让模型变得更差。
You don't wanna make the model worse.
因为这是一个非常重要且昂贵的产品。
So cause it's such an important expensive product.
如果你考虑把数据反馈给OpenAI,然后OpenAI自动更新模型,也许会有更好的更新,
If you have random if you just even think about feeding the data back to OpenAI and then OpenAI automatically updates the model and maybe there's a better update.
这会扰乱所有人的使用体验。
Disrupts everything for everyone.
所以我认为这更多是基础设施和安全方面的问题。
And so I think it's more of infrastructure security, that type of issue.
但除此之外,如果我们看看推理训练,也就是我们之前讨论过的、基于可验证奖励的强化学习,
But otherwise, I mean, if you look at the reasoning training, the what we talked about, the reinforcement learning with verifiable rewards.
如果你基于正确答案这样的设定持续运行它,这在某种程度上就是一种持续学习,技术上你可以一直运行它,但你其实不希望无差别地运行,你希望更有选择性一些。
If you run this based on correct answers and that type of setting and you just keep it running, it is kind of a form of continual learning in a sense where you can technically just keep running this and it's just like you you don't want to you wanna be more selective basically.
你觉得长上下文或更长的上下文能否缓解一些痛苦,或者减少对持续学习的需求?比如在你提到的个性化模型场景中,一种方法是获取新信息并持续对其进行学习。
And do you think that long context or longer context kind of alleviate some of the the pain or need to do continual learning like, you know, in your case of like the personal personalized models, you know, one approach is to take new information and kind of, you know, continually learn against it.
另一种方法,我认为一些人已经尝试过的是为模型创建个人化的LoRA适配器。
You know, another that, I think folks have played around with is to create like personal LoRa adapters for a model.
但第三种方法是直接将新信息放入上下文中,在推理时使用它。
But then a third is to just put that new information into the context and use it, you know, as inference time.
我觉得答案是肯定的,也是否定的。
I I would say yes and no.
我认为,大语言模型的长上下文能力最近带来了巨大改变,以前人们都在构建RAG系统(检索增强生成系统),而现在,我不会说它们已经过时了。
So I do think long context, LLM, they have enabled so much recently, like where before people were building rag systems, retrieval augmented generation generation systems, and now it's almost, I wouldn't say they are obsolete.
如果你有一个固定的大数据库或文档集,并且频繁使用它,它们仍然非常有用。
They still are very useful if you have a fixed, big database or document set, but and if you use it, repeatedly.
但如果你是一个普通用户,即使你有一个一千页的PDF,大多数情况下你通常也能做到。
But if you're a regular user and you have, like even if you have a thousand page PDF, you can technically often most of the time.
我的意思是,几千页可能有点夸张了,但比如一个200页的PDF,你完全可以把它放在上下文中。
I mean, thousands maybe stretching it a bit much, but, like, a 200 page PDF, you can have it in context.
你不需要对这些数据进行微调来训练大语言模型。
You don't need to train the LLM in terms of fine tuning on that data.
你也不需要配备专门的服务器机架。
You don't need to have a rack.
你可以在上下文中做很多事情。
You can do a lot of stuff in context.
正如你所说,对于那些你可以直接在上下文中提供所有相关信息的情况,可能也是如此。
And like you said, the same is maybe true for information where you could technically just provide all the relevant new information in context.
但我认为这只能帮你走到这一步,因为作为用户,你还得知道该提供哪些信息。
But I think it only gets you so far because you also, as a user, have to know what to provide this information.
但如果你将这与工具使用结合起来,比如你询问一个历史事件,而模型的数据截止时间是2025年,你问的是2026年的事件,LLM仍然可以进行网络搜索。
But then if you couple that with tool use, for example, if you ask about a historical event and let's say the data cutoff is 2025 and you ask about a 2026 historical event, the LLM can still use a web search.
它仍然可以使用工具,在网上查找相关信息。
It can still use a tool and look it up on the web.
因此,你并不一定需要为这个特定的历史事件更新LLM。
So you don't necessarily need to update the LLM for that particular historical event.
但如果这个历史事件影响深远,并牵涉到许多其他相关事物,这些关联可能就会被忽略。
But then if the historical event has a lot of ramification and affects all a lot of other things around it, that might be missed then.
通过工具调用你只能获得某些事实,而无法获取它与其他数据点之间的完整互动关系。
You only get certain facts from a tool call, but not that whole interaction with other data points.
我认为这并没有完全取代更新,但可能让更新的必要性降低了,或者不需要那么频繁地更新,我觉得可能是这样。
And I think so it's not fully replacing the updating, but it is making it maybe less necessary to do it or it's not necessary to do it quite as often, I think maybe.
所以,你对未来一年这个领域会聚焦在哪些方面有什么宏观看法吗?比如推理、推理时间扩展、智能体这些。
So, you know, your your kind of big picture thoughts on like where the field will be focused over the next year is, you know, again, reasoning, inference time scaling, agents.
你还有什么其他想法或预测冒出来吗?
Any other thoughts or predictions that, you know, come to mind for you?
是的。
Yeah.
我很期待看到,这虽然是个小事,但正如我们之前讨论的,目前还没有能替代Transformer架构的主流方案,不过确实存在一些其他技术,比如文本扩散模型。
I will be curious to see I mean, it's a little thing, but, you know, like, we talked about there is no big alternative to the transformer architecture, but there is, for example, there are things like text diffusion models.
比如谷歌,他们曾经有一个等待页面。
And Google, for example, they had like a there's a, like, a waiting page.
他们计划推出一个文本扩散模型,不是那种小型的,而是一个替代方案,我对此非常好奇。
They are planning to launch text diffusion model, like a small I mean, not small, but like a alternative where I'm really curious.
这更像是我想亲眼看看的东西。
It's just more like something I wanna see.
也许这将取代LLM的免费基础服务。
Maybe that's gonna be replacing, like, the free free tier of LLM's.
这可能是一个非常有趣的方向。
Maybe that's a really interesting thing.
我主要对这个感兴趣,是因为过去有很多关于文本扩散模型的研究。
I'm mainly interested in that is because there's been a lot of research on text diffusion models.
这是一种不同的思路,不是逐字生成文本。
So it's like a different take instead of generating the text sequentially.
它更像BERT模型,也就是B E R T模型,通过掩码逐步去噪或用文本替换掩码。
It's more like a BERT model, a B E R T model where you have masks and then you gradually denoise or replace the masks by text.
我只是想看看它在大规模应用时的表现,因为目前大多数模型都还只是研究性质的。
I just wanna see how it performs at scale because right now most of the models are research models and and just things like that.
这算不上什么前沿性能上的突破,但我认为它可能会更便宜、更快。
It's it's nothing, you know, I think people should get excited about in terms of cutting edge performance, but it will be maybe cheaper and faster.
也许这会带来日常性的改进,比如谷歌摘要和谷歌搜索,它们虽然也基于LLM,但并不是最优秀的。
And maybe that will be making the like, it's like an everyday maybe improvement for even like the Google summary and Google search, which is also LLM based, but it's not the best.
这些小的体验优化,我觉得,我们之所以录这个,是因为现在快到中国新年了,历史上每逢中国新年,总会有很多模型发布,尤其是开源权重模型。
And like these little quality of life improvements, I think, well, why we're recording this, is it's before the Chinese new year and historically around the Chinese new year, there have been always a lot of, model releases, open weight model releases.
所以,也许会有一些小惊喜,比如你可能会看到DeepSeek第四版,也许会有更大的变化。
So, maybe there's like a little surprise in there where you will see deep seek version four, and maybe there is a bigger change.
因此,我挺感兴趣去关注和观察这一点。
So I'm kind of like interested in following that and seeing that.
但说实话,现在我脑子里想的,我觉得我们已经差不多讲全了。
But, yeah, right now, top of my head, I think we covered pretty much everything.
让我们稍微换一下话题,跟你聊聊你个人最近在忙什么。
Let's maybe switch gears a little bit and, update us on what you've been working on personally.
你之前提到过这本书的几个章节。
You've kind of referenced chapters of the the book.
谈谈你目前的这本书吧,大家可以在哪里了解更多相关信息。
Talk a little bit about your current book and, you know, where folks can learn more about it.
是的。
Yeah.
所以上次我上你的播客时,我们讨论过我的《从零构建大语言模型》这本书。
So I think last time I was on your podcast, we talked about my Build A Large Language Model From Scratch book.
这本书涵盖了从构建架构、预训练模型,到进行指令微调的整个过程。
So it's basically the whole journey from building the architecture to, pre training a model and then also doing instruction fine tuning.
而这本书的目标并不是让你构建一个能为你在家做所有事情的个人助手。
And the goal of that was not to, let's say, build your personal assistant that does all the things at home for you.
因为那样做可能会花费五万到十万美金,而且工作量巨大。
So because that would cost like, 50,000, a $100,000 and be a lot of work.
尽管如今训练自己的大语言模型已经比以前简单了,但依然不是你能在周末随便完成的事情。
It's it's it's I mean, even though it's simpler nowadays to train your own LLM, it's not something you can do routinely on a weekend.
但这本书的目标是教人们理解这个工作流程,了解大语言模型是如何运作的——因为这能帮助你更好地使用大语言模型,理解上下文是什么、上下文的局限在哪里、注意力机制是如何工作的,以及为什么输入变长会导致成本更高?
But the goal of that book was to teach people still how that workflow works to understand how LLMs work because that help you that helps you to, let's say, LLMs better to understand what is the context, what's the limitation of the context, how does, you know, attention work and why is it more expensive if my input gets longer?
就像你自己亲手构建一个大语言模型一样,你会获得一种非常清晰的理解,这比仅仅用一种更自由的形式来解释要深刻得多。
And it's just like if you build the LLM yourself, you kind of like get a real clear understanding compared to just explaining it in a more, I would say free form based approach.
因此,很多人很喜欢这本书,现在它也已成为教学中非常受欢迎的教材。
And so, yeah, a lot of people like that and it's like a very popular textbook also for teaching now.
我当时非常兴奋,因为这本书只能涵盖这么多内容,所以我想继续写续作。
And I was then really excited to kind of, because it's only one book, it could only cover so much to work on the sequel.
所以现在我正在写《从零构建推理模型》,这可以说是它的续集。
So, right now I'm working on Build A Reasoning Model from Scratch, which is kind of like the sequel.
这两本书之间没有任何重叠。
There's no overlap between the books.
它完全可以作为一本独立的书阅读,但主要聚焦于我们之前讨论过的推理技术,比如使用可验证奖励的强化学习、GRPO算法、推理扩展等——这些技术都是在你拥有一个预训练LLM之后才用到的。这本书从给定一个预训练LLM开始。
It's basically, I mean, it can be read as a standalone book, but it's mainly focused on the reasoning techniques we talked about, the reinforcement learning with verifiable rewards, the GRPO algorithm, inference scaling, like all these techniques that once you have a pre trained LLM, so the book by starts with given so there's a given pre trained LLM.
我们使用最小的Qwen3模型,然后加入推理扩展和强化学习。
We use QUEN three, the smallest QUEN three model, and then adding inference scaling and reinforcement learning.
目前前360页已经进入早期访问阶段,我希望能在今年四月前完成——只剩最后一章了。
So the first 360 pages are already in the early access, and I'm hoping to finish I mean, there's only one more chapter left by April.
这一章的工作量很大,因为你必须运行所有的实验。
I mean, the chapter is a lot of work because you have to run all the experiments.
所以我一直在运行大量实验,尤其是针对GRPO算法,因为已经出现了很多不同的论文和改进方案,我需要在实践中逐一尝试。
So I've been running a lot of experiments, especially for the g r p o algorithms because there have been so many different papers and improvements and trying them out in practice.
这非常有趣,但也非常辛苦。
It's been a lot of fun, but it's also a lot of work.
是的。
So yeah.
在过去的几周和几个月里,我主要在做实验,是的。
So I've I've been mostly running experiments in the last couple of weeks and months, and yeah.
不。
No.
实际上,这非常令人兴奋。
It's quite exciting, actually.
对。
Yeah.
那么人们可以直接阅读第二本书并开始实践吗?还是你希望人们在开始第二本书之前先读完第一本?
And so can folks pick up the second book and run with that or do you expect folks to have read the entire first book before they start with the second?
我觉得两种方式都可以。
I would say either way works.
如果你直接看第二本书,就不必先读第一本,因为它使用了预训练的LLM。
You don't have to read the first book if you so the second book, it uses a pre trained LLM.
所以你不需要自己预训练一个LLM,也不需要第一本书来为第二本书的LLM提供训练支持。
So you don't have to pre train your own LLM or you don't need the first LLM to sorry, the first book to train the LLM for the second book.
所以它在这方面是相对独立的。
So it's kind of independent like that.
但第二本书并没有详细解释预训练或架构。
But the second book doesn't explain in detail the pretraining or the architecture.
我的意思是,我附录里有对架构的说明,但不如第一本书那么详细。
I mean, I have an appendix on explaining the architecture, but it's not quite as detailed as the first book.
所以,如果人们想了解LLM从预训练到后训练的完整生命周期,我认为按顺序阅读这两本书会更有意义。
So I think if people want to understand the whole, let's say, the whole life cycle of an LLM from pre training to post training, I think it would make sense to read them sequentially.
但你也可以直接从第二本书开始,学习推理、扩展和推理能力。
But you could also start with a second book, learn about inference, scaling and reasoning.
如果你对预训练感兴趣,之后再补上这些内容也可以。
And then if you're interested in the pre training, you can fill in the gaps later on.
我觉得无论哪种方式都可以。
I think either way works, basically.
太棒了。
Well, very cool.
塞巴斯蒂安,和你重聚真开心,我们得比每三年一次更频繁地见面。
Sebastian, it's been great catching up with you, and, we need to do it more often than every three years.
非常感谢你抽出时间参与,分享了你对当前情况和未来趋势的看法。
But thanks so much for jumping on and sharing a bit of your perspective on where things are and where things are going.
是的。
Yeah.
非常感谢你的邀请,萨姆。
Thank you so much for the invitation, Sam.
我度过了非常愉快的时光。
I had a great time.
我非常喜欢谈论大语言模型和人工智能。
I love talking about LLMs and AI.
这真是一次愉快的交流,谢谢您邀请我。
Well, that was a treat, and thanks for having me on.
谢谢。
Thank you.
关于 Bayt 播客
Bayt 提供中文+原文双语音频和字幕,帮助你打破语言障碍,轻松听懂全球优质播客。