本集简介
双语字幕
仅展示文本字幕,不包含中文音频;想边听边看,请使用 Bayt 播客 App。
好的。
Okay.
我们直播开始了。
We're live.
我们有一个人。
We have one person.
哦,太好了。
Oh, good.
人们会陆续进来。
People will start trickling in.
谢谢大家来参加销售直播第六期。
Thanks for coming to Sale Live number six.
这是一场非常令人兴奋的直播。
This is a very exciting one.
我觉得我们有的是,我的意思是,这些话题总是很有趣。
I think we have a I mean, the topics are always fun with these.
不管今天我们的小脑袋瓜在忙着跟上AI的什么话题,我们都很欢迎新加入销售联盟的这位创作者,这意味着会有更多可销售的内容。
Whatever is the topic of the day on our little rat racing minds trying to keep up with AI, but we're welcoming the latest writer that is joining the Sale Coalition, so I think this just means more content for sale.
说实话,我早就一直是Swix的粉丝和朋友了。
I think I've been a fan of Swix and a friend for a while at this point.
所以我很高兴他的内容能加入进来,我觉得你最近做得非常棒,而且还在持续进步。
So I'm very happy to have his content join this, and I think you've been doing great stuff recently and continuing to evolve this.
谢谢你,先生。
So Thank you, sir.
欢迎加入团队。
Welcome to the team.
我只是想说,我的朋友们和AI媒体领域的同事们,能够支持大家、让这个圈子更紧密,真的很好。
I just this is, like, my friends and and colleagues in the AI media space, and it's just great to be able to support people and keep that network closer.
欢迎你,谢谢你能来,我只想说声谢谢,感谢你加入我们。
So welcome to Thanks for I just wanna say thanks for joining us.
很高兴你能来,Sean或者Swix。
It's really a pleasure to have you on here, Sean or Swix.
是的,太棒了。
So yeah, awesome.
我恰好听了你关于SPA基准的播客。
I just coincidentally listened to your podcast about the SPA benchmark.
是的,真巧啊,世界真小。
So yeah, awesome to, you know, small world.
很高兴你来这里。
Awesome to have you here.
对。
Yeah.
谢谢邀请我,我很高兴能来聊天。
Thanks for having me, and yeah, just glad to be on and chat.
我从来没参加过这种Substack直播活动,所以很好奇它是怎么运作的。
I've never ever done one of these Substack live things, so I'm curious how it works.
我之前在想Substack,因为它可以用作文字平台,但他们想转向多媒体。
I was thinking about Substack because it can use letter platform, but they wanna go multimedia.
我认为,我们在进入技术内容之前先进行直播其实很好,因为它带来了不同的亮点。
I think the live thing, before we get to technical content, is actually good because it gives it a different edge.
当你知道自己在直播时,整个氛围会显得更鲜明一些。
It's just like a little bit sharper when you know you're live.
我想我们都做过很多播客,甚至一些未经剪辑、事后发布的播客,但我觉得直播是一种独特的元素,可以很好地利用起来。
I think we've all done a lot of podcasts, even podcasts that are unedited and put this later, but I think the live thing is a different element that can be tapped into nicely.
所以我不确定。
So I don't know.
我们不如直接切入正题吧?
Why don't we why don't we just dive into it?
我们先从知识蒸馏开始。
We're gonna start with distillation.
我把‘模型如何作弊’放在最前面,这样我们可以讨论基准测试。
I put I put how models cheat in the top so we can talk about benchmarks.
我认为,Dropic 这周发布了一篇相当有争议的博客文章。
I think and Dropic posted this pretty spicy blog post this week.
我认为这篇博文基本上详细描述了他们如何发现来自中国知名实验室的分布式蒸馏攻击。
I think it was essentially detailing how they found distributed distillation, quote unquote, attacks on their services from prominent Chinese labs.
对于Anthropic称之为攻击,我一点也不感到意外。
And I'm very unsurprised with Anthropic calling in an attack.
我认为这与他们的许多品牌定位是一致的。
I think that that's fits with their a lot of their branding.
好的,不错,共享屏幕。
Okay, nice, screen share.
这就是我们所说的。
This is what we mean.
肖恩·斯威克真是个高手。
Sean Swick is such a pro.
屏幕共享功能才刚几天前才推出,但本质上,Anthropic详细说明了他们如何发现多个中国实验室通过分布式账户构建其大语言模型的状态,并描述了这些实验室的做法,以及为什么Anthropic在他们的AI地缘政治观中对此感到担忧。
The screen share fee pro was only dropped a few days ago, but essentially, it's Anthropic is detailing how they found distributed accounts across multiple Chinese labs, building state of their LMs, and described what they were doing and why Anthropic is concerned about this in their worldview of AI geopolitics.
我认为这非常有趣,因为我认为中国实验室显然应该这么做。
And I think it's very interesting because I'm of the opinion that the Chinese labs, like, obviously should do this.
他们面临着严重的GPU短缺,使用API比自己生成合成数据要容易得多。
They're in a massive GPU shortage, and using APIs is way easier than generating synthetic data on their own.
我觉得我可能需要在这里打断你一下。
Think this is why I may interrupt you here.
也许我们应该先为普通观众定义一下知识蒸馏这个概念,然后再继续
Maybe we should just for the general audience just define distillation before we maybe
深入探讨。
dive go into go.
这个
The
是的,知识蒸馏是一个更广泛的概念。
Yeah, so distillation, that's like a broader concept.
这并不是随着大语言模型才出现的新概念。
It's not like a new concept that came up with LLMs.
它是机器学习领域一个更早的概念。
It's like an older concept in machine learning in general.
蒸馏本质上是这样一种想法:你使用一个更大的模型,用它的输出来进行训练。
And distillation essentially is The idea is that you're taking a larger model and train it on the outputs.
抱歉,你的意思是有一个更大的模型,让它生成输出,然后用这些输出来训练一个更小的模型。
Sorry, you have a larger model, let it generate outputs and train a smaller model on these outputs of the larger model.
这样做的理念是,你可以更高效地用这个大模型来训练小模型。
And the idea is that you can train the smaller model more efficiently using that larger model.
而且,我想你刚才提到了这篇论文。
And originally, I think you just brought up the paper here.
最初,你会基于logits进行训练。
Originally, what you would do is you would train on the logits.
所以,老一辈的机器学习从业者可能还记得,在深度神经网络中,logits是最后一层的输出,我们通常用它们来计算交叉熵损失函数。
So old school machine learning people might remember from deep neural networks, like the logits, the outputs of the last layer that you usually work with them to compute the loss function across entropy term.
你会基于这种信号进行训练。
And you would train on this signal.
而如今在大语言模型的语境下,这个概念变得宽松了一些。
And nowadays in the context of LLMs, it's a bit more loose.
因此,你训练时并不一定要使用这些logits。
So it does not have to be these logits that you train on.
它可以只是输出数据,就像内森刚才说的合成数据。
It could be just the output data, synthetic data like Nathan just said.
所以,这实际上是一种非常常见的做法。
So for example, it's actually a very common practice.
例如,在DeepSeek的R1论文中,或其他公司也这样做:他们会训练旗舰模型,也就是最大的R1模型,拥有6710亿个参数。
For example, in DeepSeek, R1 in the paper, or other people do that to other companies, They would train the flagship model, the largest model, the R1 model with six seventy one billion parameters.
然后他们会创建更小的变体,我忘了具体数字,但大约是十亿到三十亿参数规模的小模型,这些小模型可以在本地运行。
And then they would create smaller variants like, I forgot the numbers, but one, three billion in a smaller range, like these small models you can run locally.
而它们是通过训练自身更大模型的输出来训练的。
And they are trained on the outputs of their own larger models.
现在的问题是,这同样是一种非常普遍的做法。
Think now the thing is also, I mean, is very common practice.
当他们推出较小的模型变体时,所有人都会这么做。
Everyone does that when they are producing the smaller model variants.
现在我认为纳森提出的问题是,如果你是一家公司,使用另一家公司的LLM生成这种合成数据,然后用它来训练你自己的模型,会发生什么?
Now I think the question or the point Nathan brought up is what happens if you are a company and you generate this synthetic data from another company's LLM and then train your own model on it?
抱歉,刚才只是个小插曲,但简而言之,知识蒸馏就是用一个大模型的输出来训练一个小模型。
Sorry, So that was just like a little interruption, but yeah, distillation in short is training a smaller model on the outputs of a larger model basically.
是的。
Yeah.
我认为在前沿领域这也是完全可能的。
And I think this is even possible at the frontier.
比如,人们会从OPUS这样的模型中进行蒸馏,来构建CloudSonnet。
So like people distill from something like OPUS to build CloudSonnet.
他们通常在内部做着非常相似的事情。
This is they're generally doing very similar things internally.
他们能接触到更多、更强大的工具。
They have access to different tools and richer tools.
另一个背景是,多年来,所有这些大型实验室的使用条款都明确规定,你不能将这些API的输出用于训练类似的竞争性AI模型。
And then the other context is that all of these large labs for years have had terms of service where they say that you effectively cannot use the outputs from these APIs to train something like a competitive AI model.
这些条款很模糊。
It is vague terms.
服务条款本质上不是合同,你使用某项服务时,如果提供商发现你违反了条款,就可以终止你的访问权限。
In terms of service are not a contract, essentially, terms of service is something that can be you essentially are using a service, and then if the provider finds you violated, they can cut off your access.
这只是一个基本常识。
That's just a basic thing.
因此,这些条款在美国几乎从未被执行过。
So these have not been enforced within The US much at all.
我想有一个案例,大约一两年前,OpenAI切断了字节跳动的API访问。
I think there was one case, ByteDance a year or two ago, that OpenAI cut off their API.
但在ChatGPT刚推出后,当人们刚开始用Alpaca等构建开源模型时,这个问题被广泛讨论过。
But this was discussed so much right after ChatGPT when people were building the first open models on Alpaca and things.
所以当时大家都在问:OpenAI会不会来找我们麻烦,因为我们做了这些研究模型?
So it was like, is OpenAI gonna come after us for doing these research models?
但后来这件事就完全平息了。
And it totally died down.
人们为此担心了一年多。
People were worried about this for over a year.
这曾经是一个无法分割的讨论,所以实际上什么也没发生,而这是该讨论首次显著重现,某种程度上是因为人们对人工智能竞争力的担忧加剧了,是的。
It was kind of an inseparable discussion, so nothing really happened, and then this is the first prominent reemergence of the discussion, kind of to make think it's because people are far more worried about AI competitiveness and Yeah.
引述。
Quote.
但我很好奇你们怎么看。
But I curious what you guys think.
是的。
Yeah.
我们能不能花点时间谈谈,他们究竟该如何检测?因为你一开始提到过某种‘蒸馏攻击’,虽然你没有明确说‘攻击’这个词,但你无形中给它加了引号。
Can we talk a second about even how they would detect because you you said in the beginning something about a a distillation attack, and you didn't say that specifically, but you kind of like implicitly put quotation marks on attack.
那么,你们究竟该如何检测这种行为呢?
So how would you even detect that?
我认为,从这个语境来说,蒸馏实际上就是让ChatGPT、Claude生成合成数据,然后收集这些合成数据,用监督学习或监督微调来训练你们自己的模型。
So I think, I mean, distillation in that context means really like literally just letting ChadGPT, Claude generate synthetic data, and then you collect that synthetic data and train your own model with supervised learning, supervised fine tuning on it.
但你怎么才能分辨这是蒸馏攻击,而不是单纯的评估呢?
But then how would you even detect that this is a distillation attack versus just an evaluation?
因为现在我实际上正在运行。
Cause right now I'm actually running.
我的意思是,我正在为我的书第八章进行蒸馏,但我用的是开源权重模型。
I mean, I'm distilling myself for chapter eight of my book, but I'm doing it with open weight models.
所以不用担心,Anthropic,请别担心。
So no worry Anthropic, please don't worry about it.
只是别从
Just don't from
用API模型来完成我的工作。
API models for my job.
是的。
Yeah.
我现在用的是OpenRouter,直接从DeepSeg 3.2模型进行蒸馏,我觉得这些人是允许这样做的。
I use OpenRouter right now and just distill from the DeepSeg version 3.2 model, which I think these folks are okay with that.
但我想说的是,当我评估模型时,我用的几乎是同一个脚本。
But what I wanted to say is, so when I'm evaluating models, I use basically almost the same script.
所以当你在评估一个模型时,你会提出一个问题,然后让模型生成答案,对吧?
So when you're evaluating in a model, you have a question and you let the model generate the answer, right?
因此,你会为你的基准问题生成响应。
So you generate the response to your benchmark question.
在我的基准测试中,我有来自数学领域的500个示例数据集。
And in my benchmarks, I have data sets from Math 500 examples.
我还有一个更大的数学数据集,包含12,000个示例。
I have a bigger Math data set of 12,000 examples.
所以你基本上就是在循环中调用API,让它生成这些问题——抱歉,是生成这些答案。
So you're basically just running an API in a loop to let it generate these questions and sorry, the answers.
但公司怎么知道,这个人是在做评估,而不是在保存这些数据并用它们来训练自己的模型呢?
But then how would a company know, okay, this person is just evaluating versus this person is now saving that data and then data training their own model?
你明白我说的意思吗?
You see what I was saying?
是一样的
It's the same
我觉得
I think
这是规模的问题。
it's the scale thing.
所以当你在评估时,至少在BALZ的基本层面,你可能只做一次,或者根本不做,但总会有一些情况,我是说,他们在这里说了一些东西,但更多时候你根本不会去管,他们不会,我认为大部分是数量问题,然后他们会查看类似账户之间的模式,这就是他们的做法。
So when you're evaluating, at least the basic of BALZ, or you're gonna do it once and not do it, there's some amount where you are I mean, they say stuff here, but there's also more of it where you just are not like, they're not gonna I I think most of it is quantity, and then they're gonna look at patterns across similar accounts is what they're Yeah.
没错。
Exactly.
他们会看到,比如非常重复的内容。
They're gonna see, like, really repetitive stuff.
是的。
Yes.
所以我认为这引出了一个有趣的点:也就是说,你可以在大规模上进行评估。
So I think the interesting point this leads to is, I mean, you can do evaluation at a large scale.
如果你是一家大公司,你希望了解你的大语言模型表现是否非常出色。
If you are a big company, you want to know whether your LM performs very well.
你会运行一套庞大的基准测试集。
You have a large suite of benchmarks you are gonna run.
但你刚才提到可能是寻找模式。
But then you said like maybe looking for patterns.
所以一种可能的方式是,嗯,这是一个熟悉的问题。
So maybe one way would be, okay, this is a familiar question.
它在基准测试中经常出现。
It comes up in the benchmarks.
所以这个人可能并不是在窃取我们的答案。
So this person is maybe not stealing our answers.
他们只是用它来进行基准测试。
It's just using it for benchmark purposes.
但这意味着他们实际上在观察你生成的内容,当然,当你在互联网上使用大语言模型时,没有任何东西是私密的。
But then it means kind of like that they are looking at what you're generating there, which is, I mean, of course nothing is private when you are using LLMs on the internet.
数据被暂时存储在某处,但这几乎暗示了他们在监控你使用LLM的方式以及你生成的内容,这在隐私方面是一个相当敏感的话题。
The data is somewhere intermediately stored, but then it kind of like almost implies that they are checking what you use the LLM for, you generate, which is kind of like a sensitive topic, almost like privacy wise.
是的。
Right.
所以这是一个有趣的点,因为正如你提到的,服务条款禁止你进行知识提炼,但你实际上并没有在进行提炼。
So that's kind of like an interesting point because I mean, of course you mentioned the terms of service that you are not allowed to distill, but you're not distilling.
我想表达的是,当你在平台上时,并没有在提炼生活本身。
So the point I'm trying to make is you're not distilling life when you are on the platform.
你是在之后的某个时候才进行提炼的。
You are doing it somewhere later.
你只是让LLM生成答案。
You're just letting the LLM generate answers.
我觉得很有趣的是,一家公司竟然会关注这些数据,甚至在大规模层面指出你:嘿,你生成的答案太多了。
And I find it kind of interesting that a company would look at that, even like at the scale and call you out like, hey, you are generating too many answers here.
这不太合适之类的。
That's not cool or something.
你知道吧?
You know?
这有点奇怪。
That's kind of a weird thing.
是的。
Yeah.
我本来想回应几句,就在前面几句话里,但实际上,Anthropic 先封锁了美国公司,而不是中国公司。
I wanted to respond a couple this is like a few sentences back, but actually, Anthopic has blocked US companies first before the Chinese companies.
嗯。
Mhmm.
他已阻止 OpenAI 和 XAI 使用这些模型。
He has blocked both OpenAI and XAI from from from using the models.
我认为可能有人指责 XAI 在进行模型蒸馏。
And I think maybe it's possibly accused XAI of distilling stuff.
我不确定。
I don't I don't know.
但绝对不像这样写成一篇完整的博客文章。
But definitely not, like, in a in a full blog post like this.
所以这个案例绝对是目前最受关注的。
So this one is, like, definitely the most high profile case.
而且确实是这样。
And yeah.
而且我觉得,实际上很难区分,比如,
And, like, I I I do think, like, it is actually pretty hard to distinguish from, like, hey.
我只是在跑我的内部基准测试,兄弟。
I'm just running my internal benchmark, man.
当然,由于你知道,特别是有些基准测试你必须运行三到五次,所以会产生大量完全相同的内容。
And, of course, it's gonna be very high volume of, like, all of the same stuff because, you know, especially, like, some benchmarks you have to run, like, three, four three to five times.
这些问题是完全一样的。
Like, this is the exact same questions.
对吧?
Right?
我的意思是,显然,如果你达到了数万、数十万的量级,那你就是没问题的。
Like, I I do think, like, obviously, if you if you get to the hun to the the tens of thousands, hundreds of thousands, then you're, like, you're okay.
你不是在单纯地运行基准测试。
You're not just running benchmarks.
你实际上是在提炼这个东西。
Like, you you are distilling this this thing.
聊天里有个很好的观点。
There is a good point in the chat.
如果你在进行提炼,问题的分布会是什么样子?
Like, how would the distribution of questions look like if you are distilling?
我觉得,跟你提到的点相关,当你生成的答案达到一定规模时,可能会显得可疑。
And I think related to your point, at a certain point when you have a certain magnitude of answers generated, it might look suspicious.
但我的意思是,有很多正当的使用场景。
But I mean, there are a lot of legit use cases.
如果一家公司把你的,比如OpenAI云API用作自己的聊天机器人,并且拥有大量客户,自然会产生大量回答。
If a company uses your, let's say, OpenAI Cloud API as their own chatbot and they have a lot of customers, it's naturally a lot of answers that are generated.
因此,他们可能会观察分布情况,比如在进行知识蒸馏时,你可能会预期分布非常广泛,因为你希望覆盖几乎所有的内容。
And so they would probably look at distributions like maybe you would expect a very broad distribution when you are distilling because you wanna cover pretty much everything.
而在运行基准测试时,分布可能更具体。
And when you are running benchmarks, it's maybe more specific.
你运行的是数学基准测试,那就只是数学。
You're running a math benchmark, it's just math.
或者如果你有一个客户聊天机器人,那更多是客户问题的回答。
Or if you have a customer chatbot, it's more like customer answers.
但是的,我认为他们可能会分析你的分布情况。
But yeah, I think they would maybe analyse your distribution.
我觉得这种做法有点奇怪。
I feel like this is kind of a weird thing to do.
我不确定。
I don't know.
如果你是一家公司,关注客户隐私和生成的数据,当然你知道,这些数据本来就不算私密,但公司这样做的行为本身还是有点奇怪。
If you're a company and you're looking into your customer privacy, like data generated, you know, like, of course, it's well, you have to expect that it's not private, but still kind of like a weird thing that they that they do that essentially.
是的。
Yeah.
好的。
Okay.
你还有什么想聊的?
What else do you have to talk about?
我觉得这有趣吗?
I I think is it is it interesting?
好的。
Okay.
我做得还行。
I did I did okay.
有一件事,这有点像Substack上作者们来回讨论的样子。
So one thing this is a little bit of Substack, like, you know, authors back and forth.
我做的一件事是把它丢进了Nano Banana,那是个还不错的可视化工具。
One thing I did was I threw it into Nano Banana, which is like it's kind of like a decent visual.
对吧?
Right?
把它放到 Nano Banana 2 里。
Throw it into Nano Banana two.
这是 Nano Banana 2 的直播播客。
It's a Nano Banana two live pod.
刚刚才发布,五分钟前。
Just released five minutes ago.
我这个其实也是 NanoMeta。
I I had this is actually NanoMeta too.
因为我参加了早期访问计划,他们把我迁移到了新的 Nano Banana,所以我无法访问旧版了。
I I so because I'm in the early access program, they cut you over to nano to to new Nano Banana, and I couldn't access the old one.
所以我试着做一下对比,但做不到。
So I was like I was trying to, like, do, like, a diff, and I couldn't
因为你无法访问那个,这正是早期测试者计划的典型情况。
do it because I couldn't access the That is classic early tester program shit.
看看我们这里要应对的这些麻烦。
Look at the pain we have to deal with here.
是
Is
有趣的是,DCG 的数值比 Minimax 小这么多吗?
it interesting that DCG is so much less than Minimax?
我认为,内森,你说得对。
I think, Nathan, you're you're right.
你之前提到过一点,比如,
You had a little bit of a comment about, like,
某种程度上像一篇政治博客文章。
I political blog post in a way.
也许不算政治,但他们想表达的是一种更注重立场而非细节的观点。
Maybe not political, but they're trying to make a point that is more about making a point than the details.
比如,深海这件事的规模确实小得多。
Like, the deep sea thing is definitely way smaller scale.
所以,好吧。
So, okay.
大多数实验室都会尝试使用它们能接触到的所有API。
Most of the labs will experiment with all the APIs they can get access to.
数据极其重要,你会有一个管道,可以随时替换任何API,然后进行消融实验,看看它是否能提升性能。
Data is just so important, and you're gonna have a pipeline where you could sub in any API and then run an ablation to see if it gives you performance.
这个API几乎是免费的。
The API is kind of free.
就去做吧。
Just do it.
数百万次交换则是一个更大的赌注。
Millions of exchanges is a bit more of a bet.
你可以更长时间地测量它,而要获得数百万次交换——也就是一万亿或十万亿个token,需要花很长时间才能从API中获取,尤其是当这些请求必须分散到大量账户时。
You can measure that a bit longer, and it takes a lot longer to get The millions of exchanges is tens of billions or 100,000,000,000 tokens, and it takes a lot longer to actually get that out of the API, especially when they have to spread it across a ton of accounts.
这些账户都有限速,还存在其他问题。
These accounts are all rate limited and have other problems.
这需要更长时间,但这个小的却非常快。
Like, that takes longer, but this tiny one is so fast.
所以我的观点是,这清楚地表明Anthropic试图把DeepSeek作为美国人们唯一熟知的中国AI名称。
So I don't like, I I that was generally my point that it made it clear that Anthropic's trying trying to use the DeepSeek name as the only Chinese AI name that people in The US know.
嗯。
Mhmm.
嗯。
Mhmm.
从营销角度来说,就是为了让它深入人心,你知道的,没错。
Like marketing wise, like to make it, you know, stick or to yeah.
实际上,你还提到了不同的API之类的。
Actually, you mentioned also like the the different APIs and everything.
我并没有受到他们的赞助,也没有任何关联。
I'm not sponsored by them or I have no affiliation.
我从未与那家公司的人交谈过,但像OpenRouter就是一个很好的例子,我经常用它来运行开源模型,因为那些更大的模型太庞大,无法在本地运行。
I've never talked to anyone from that company, but OpenRouter, for example, is a good example where I've been using it a lot for the open weight models because for the bigger ones, are too big to run them locally.
而且不错的是,他们也提供这项服务,本质上是通过其他公司的API进行路由,并在那时自动选择当前最便宜的选项。
And what's nice is they do also offer So it's basically just routing you through other companies' APIs and they select automatically at that point what is the cheapest one at that point.
我有时会遇到一些失败。
I sometimes get some failures.
我觉得当它切换时,有时会崩溃,但可能在我的脚本里,是我需要修复的地方。
Think when it switches, sometimes it crashes, but like in my script, maybe it's like something I have to fix there.
所以即使如此,如果你在进行知识蒸馏,也可以从多个提供商那里进行。
So even then, if you're distilling, you can do that from multiple providers.
但当然,如果你想使用Chachippity或Claude的东西,它总会通过官方渠道,然后就会显得有点可疑。
But yeah, of course, if you are wanting something from Chachippity or Claude, it's always gonna go through the official one and then it gets, I guess, suspicious.
你也可以在技术上通过OpenRouter进行一些蒸馏,你的账户、你的直接账户,你可以创建多个账户。
You could also technically distill a bit through OpenRotar, their account, your direct account, you can make multiple accounts.
有趣的是,他们追踪了所有这些信息。
And it's kind of interesting that they track all that.
然后呢,既然你提到了,他们提到了DeepSeek,这还挺有意思的。
Then like, yeah, different topic now that you called out, that they call out DeepSeek, which is quite interesting.
顺便说一下,OpenRouter 在这些情况下似乎并没有使用 DeepSeek。
For what it's worth, OpenRouter seems to not be using DeepSeek in most of these.
这些是免费模型。
These are free models.
DeepSeek 不是。
DeepSeek's not
是的。
the Yeah.
我明白了。
I see.
我的意思是,
I mean,
我还应该说一下,我用的是付费 API。
I'm using the paid API, should also say.
他们还显示不同提供商的费用和每秒令牌数,这也很好。
It's also nice they show you how much it costs and the tokens per second for different providers.
所以如果你点击顶部的搜索,就可以找到不同的 DeepSeek 模型。
So if you go to the search in the top, you can go to the different DeepSeek ones.
我只是喜欢它,因为我经常做模型对比。
I just like it because I do a lot of model comparisons.
而这个是较旧的模型,所以可能只有一种提供商。
And then this one is an older model, so maybe it only has one provider.
但如果你去查找 DeepSeek R1 或者普通的 3.2,应该会有多个提供商,往下滚动,就能看到不同的提供商以及不同的每秒令牌数和不同价格。
But if you go to I think DeepSeek R1 or something or even the normal 3.2, there should be multiple providers that if you scroll down, yeah, you can see there are different providers and different tokens per seconds, different costs.
所以这有点像,我就是喜欢这个网站,因为它能快速使用 API,而且提供了类似 OpenAI 的接口。
So it's kind of like a, I just like that website because it's just quick to use the API and they have an OpenAI like API.
所以它几乎不像有广告推广之类的,我觉得它总体上很有用。
So it's almost like it's not sponsored or something, and I just find it generally useful.
所以,只是顺便提一下。
So but yeah, just a side note.
你想回到之前的对比吗?
Do you wanna go back to the comparison?
你刚才有个高层次的观点要表达吗?
Did you have a high level point to make there?
Sway?
Sway?
哦,好的。
Oh, okay.
就几个而已。
Just a just a couple.
第一,我认为在Moonshot发布他们的产品之后、Minimax发布他们的产品之后、DCV4之前,这个时机是很有战略性的。
One, I think I think the timing post Moonshot releasing their stuff, post Minimax releasing their stuff, pre d c v four, I think that was strategic.
我认为这也可能是Minimax被检测到的数量更高的原因之一。
I think that may also have factored into why Minimax was more detected like, had a higher number.
所以,你知道,收集数据的时机其实非常重要。
So, like, you know, like, when you collect data is actually very important.
对吧?
Right?
所以他们在 MiniMax 2.5 训练期间中断了或发现了 MiniMax,对吧?我的意思是,如果我们最终真的和他们通话,我们会进一步确认这一点。
And so they interrupted or they found MiniMax during the training of MiniMax 2.5, right, which, I mean, we we will confirm this later on if if we if we do end up doing the call with with them.
所以,显然,这个数字会非常高,因为他们正在主动寻找它,然后他们封禁了 Minimax 的账户,Minimax 也就更改了他们的做法。
And so, obviously, like, the number is gonna be very high because they they're, like, actively looking for it, and then they they banned the Minimax accounts, and Minimax changed their things.
实际上,我认为事情并不是完全这样的。
Actually, I I don't think that's exactly what happened.
抱歉。
Sorry.
让我纠正一下。
Let me correct myself.
在 MiniMax 进行蒸馏时,他们发布了 Opus 4.6,并称他们将近一半的流量进行了重定向。
While MiniMax is distilling, they released Opus 4.6, and they said they said that they redirected nearly half the traffic.
所以我想,好吧,这确实如此。
So I'm like, this is like, okay.
非常明确地,这就是他们做的。
Very, very clearly, like, this is them.
对吧?
Right?
这是完全相同的流量。
This is the same exact traffic.
一旦有新模型发布,我就会立即切换。
I switched to a new model the moment a new model releases.
好的。
Okay.
很棒。
Cool.
DeepSeek 可能没有这么做,因为他们并没有积极地推进自己的工作。
DeepSeek maybe wasn't doing that because they they hadn't been working on on their stuff actively.
我不确定。
I don't know.
对吧?
Right?
可能是别的原因。
Like, it could it could be a different thing.
或者DeepSeek只是效率高得多。
Or DeepSeek is just way more efficient.
我花15万就能得到我需要的一切。
Like, I I get all I need for a 150 k.
你们知道,你们太
You guys, you know, are so
如果我们知道这个时间范围就好了。
if we knew the time frame of this.
这些API请求都是在过去四周内的吗?
Like, are all of these API requests within the last four weeks?
还是在过去六个月内的?
Are they within the last six months?
这完全决定了事情的性质。
Like, that's such a different nature of what is going on.
没错。
Exactly.
对吧?
Right?
那就是
That's what
我就是这个意思。
I'm saying.
像DeepSeek一年前还在训练3.1、3.2版本,你知道的。
Like, DeepSeek was training 3.1, 3.2, like, you know, a year ago.
像
Like
是的。
yeah.
或者像,我不清楚,DeepSeek OCR。
Or like, I don't know, DeepSeek OCR.
他们说,嗯,我想他们提到了它的特点,但并不差。
They were like, well, I I guess they said what it is, but it's not bad.
是的。
Yeah.
对。
Yeah.
但从规模上来说,我觉得你的Minimax要小三倍。
But like also scale wise, I do think your minimax is three times smaller.
它就是一个更快的模型。
It's just like a faster model.
他们没有使用MLA,也没有使用DeepSeq稀疏注意力,但我觉得他们用的是分组查询注意力,即便如此,它仍然是一个非常迅捷的模型。
They don't use MLA and they don't use the DeepSeq sparse attention, but it is, I think it's just group query attention, but it is still a pretty snappy model.
所以我觉得它可能挺有吸引力的,适合使用。
So it's I think just attractive maybe to use it.
另一个模型,我一时想不起来,可能他们提供过某种免费层级之类的,我觉得模型刚发布时有时会提供免费使用,而那个模型比DeepSeek的上一版(去年12月的v3.2)更新一些。
And the other one, top of my head, I don't know, maybe they had like some free tier or something like that, where I think when the models come out, they sometimes offer free usage, and that was a more recent model than I think DeepSeek, the last one was from December, the v 3.2.
是的。
Yeah.
所以,也许这是一个无关紧要的点,因为他们之前已经调优过,而之前的数据流量是一样的,或者他们只是效率高得多。
So, know, maybe this is a irrelevant point because they were tuning before, and, you know, before would have the the same amount of traffic, or they're just way more efficient.
对吧?
Right?
这让我想到
It does bring to mind that
效率这个问题不是关键。
efficiency thing is not it.
我可以保证。
I can guarantee it.
像那种效率问题根本不是。
Like, that is not Yeah.
是的。
Yeah.
他们早期恰好得到了正确的研究思路,并找到了合适的数据,这种可能性很小,但并不意味着他们能提高三倍的效率。
It's a it's a small chance that a it's like there's a chance that they got the right research idea early and, like, found the right data to use, but it's not that they're, like, gonna be three x more efficient.
好的。
Okay.
所以这是时机问题,还是他们实际上根本没怎么用它?
So it's a timing thing, or they just actually don't use it that much.
我的意思是,我们来分析一下。
I mean, play this out.
我当时就想,为什么他们不分享呢?
I was like, okay, why don't they share?
他们都是好朋友。
They're all buddies.
对吧?
Right?
说到底,事情发展到某个阶段,也就这样了。
Like, what and, like, you know, it it it it does come to a point where, like, okay.
让整个中国都把它分发给每一个
Let's have all of China just, like, distribute it to, like, every to
接下来我可以稍微谈一下这个。
the next I can talk about this a little bit.
我们这里做的研究并不多,但有一些研究项目正试图理解如何使用蒸馏数据。
That we're doing there's there's a lot of not a lot of research, but there's a few research projects trying to understand, like, how do you use distillation data.
我认为SFT是最清晰的例子,你是在对问答对进行自回归损失。
I think SFT is the cleanest example where you're doing, like, the you're doing this autoregressive loss on q and a pairs.
但最强的模型未必是最好的教师,我们这个领域的大多数人认为这是因为你必须让标记的概率与基础模型相匹配。
But the strongest model is not necessarily the best teacher, and most of us in this area think it's due to some you have to match the probabilities of the tokens to the base model.
所以目前的情况是,密集型模型是许多开源模型最好的教师,我认为这是因为许多开源模型要么是密集型的,要么已经长期被密集型化了。
So what's happening is that quen dense models are the best teachers for a lot of open weight models, and I think that's because a lot of open weight models are either quen or have been, like, quen like for a while.
所以,Olmo从Quen那里学得非常好,显然其他Quen模型也是如此。
So, like, Olmo learned really well from quen, and, obviously, like, other quen models did.
但要把这些流水线扩展到使用GLM 4.7、更大的DeepSeek模型或更新的大型Quen MOE,所有这些都更难通过相同的提示、正确的采样设置生成数据,然后进行SFT并真正提升效果。
But, like, scaling these pipelines up to use, say, GLM 4.7 or a bigger deep seek model or a more recent big Quen MOE, like, all of these, it's a lot harder to just generate the data from the same prompts with, like, the right sampling settings and then do SFT on them and actually make the numbers go up.
有趣的是,GPTOSS 是一个相当不错的教师模型,但这里存在巨大差距——仅仅拥有这些数据并不意味着它真的能让你的模型变得更好。
Interestingly, GPTOSS is a pretty good teacher, but there's a huge gap there where it's like, just because you have this data does not mean it's actually gonna make your model better.
因此,你必须做研究,发现我们从 Claude 身上获得了有效的信号。
So you have to do the research to be like, oh, we learned that we get signal out of Claude.
我们需要尽快获得一万亿条数据,因为它会立刻提升我们的模型表现。
We need to get a 100,000,000,000 ASAP because it's gonna just immediately make our model better.
这在模型训练中并不常见,因为这种奇特的教师-学生关系正在发生。
Like, that's not a common place to be in in modeling because this, like, weird teacher student dynamic going
嗯。
on.
我认为不同实验室的情况可能会有所不同。
I could see that being different across labs.
另外,我认为这与是否从同一模型家族中蒸馏较小模型有关,我发现这样效果更好。
Think also it has something to do, I noticed also if you are distilling the smaller model from the same model family, it performs better.
我觉得你说得对,如果你使用一个非常强大的模型,它可能也太不一样了。
I think it's to your point that if you have a very, very strong model, it might be also too different.
或者如果风格差异太大,模型难以适应,与预训练期间的问答答案相差太远,
Or if the style is too different and then it's too much of a leap for your model to adapt, it's too different from the Q and A answers during the pre training or
是的。
Yeah.
所以没错。
So Yeah.
你可以采用
You can take the
这个问题。
issue.
更大的跨越。
A bigger leap.
还有我想说的一点,你提到了OMO,我已经有段时间没读那篇论文了,但你可能比我更了解,我觉得你们也训练过logits。
And another thing I wanted to say about you mentioned OMO, and I it's been a while since I read the paper, but you might know way better than I do, but I think you did also train on the logits.
我们并没有
And may We didn't
进行技术蒸馏。
do technical distillation.
已经做了。
Just did.
我们只是提取了这些标记。
We just took the tokens.
哦,我明白了。
Oh, I see.
我明白了。
I see.
我明白了。
I see.
好的。
Okay.
那可能是另一篇论文。
Then it was probably a different paper.
展开剩余字幕(还有 480 条)
我认为谷歌为Gemma模型做了这件事。
I think Google does that for the Gemma models.
是的。
Yeah.
他们
They
确实如此。
do.
因为在这里,还存在一个区别,因为你提到了Quen和其他模型。
Because here, there's also then the distinction because you mentioned Quen and other models.
你只能对开源权重模型这样做,因为如果你对Claude或OpenAI这么做,由于它们不提供logits,所以行不通。
You can only do that for open weight models because if you do that for Claude or OpenAI, would not work with the logits because they don't provide them.
它们只对部分token,比如前100个或1000个top token提供logits。
They only provide them for some tokens, a 100 or a thousand top tokens.
因此,从某种意义上说,如果你想进行真正意义上的‘蒸馏’,那么使用开源权重模型反而更容易,因为你可以完全掌控这个过程。
And so it is in a sense, if you want to do the real in quotation mark distillation, it is kind of like even easier to do that from open weight models because you can control it.
但正如你所说,我们需要尽快获得一万亿的数据。
But then also, like you said, well, we need a 100,000,000,000 ASAP.
这并不是一件容易的事,因为即使是这些大型端模型在生成回答时,每秒也只产生大约40个token,要获得数万亿的token需要很长时间,对吧?
That is not an easy thing to do because even like, I mean, it's like 40 tokens per second or something for these large end models when you generate answers and getting that millions of billions tokens, it takes time, right?
所以,从一个中等规模的模型开始进行蒸馏几乎是更容易的。
So it's almost like easier to start distilling from a medium model.
所以这其实是一个更多数据还是更高质量数据的问题。
So it's like the question more data versus more high quality data.
对吧?
Right?
这本身也是一个在消融研究中的最佳实验点。
So it's also like a sweet spot to like an experiment itself in an ablation study.
对吧?
Right?
是的。
Yeah.
我喜欢纳森称之为技术蒸馏,因为尽管这是第一个,但它已经不再是默认方式了。
I I like that Nathan had to call it technical distillation because it is no longer the default even though it was the first.
是的。
Yeah.
另外,我来分享一个有趣的事实。
Also, I'll I'll I'll note a fun fact.
我最近采访了杰夫·迪恩,试图从他那里套出点信息。
So I did my Jeff Dean interview recently, and I tried to get out of him.
他有点回避这个问题,你知道吗,其实当时有三个版本的Gemini模型?
Like, he he, like, sort of dodged it a little bit that you know, remember, like, there were actually three sizes of Gemini models?
分别是Nano、Pro和Ultra。
There was Nano, Pro, and Ultra.
我就问,Ultra去哪儿了?
And I was like, where is Ultra?
他们把它藏在地下室,然后偷偷从它那里拿走了东西。
They keep it in the basement, and they just stole from it.
对吧?
Right?
就像,那是
Like, that's the
有意思。
Interesting.
是的。
Yeah.
是的。
Yeah.
而且也许为了保护自己,防止别人做出这样的行为。
And maybe to also, is it like to safeguard yourself so no one can make a Yeah.
是的。
Yeah.
审计或者定价,但也可能是两者都有。
Audit or price also, but probably both.
是的。
Yeah.
我的意思是,我认为这正是我一直理解的方式:你部署的模型从来都不是你训练的那个模型,因为你训练的是稠密模型,然后部署的是MoE模型。
I mean, I think like this is how, like I always think of like the model you deploy is never the model you train because you you train the dense and then you you then you deploy the MOE.
对吧?
Right?
你基本上总是这么做。
Like, you basically always do it.
在每个实验室都是这样。
Like, at every lab.
说详细点。
Say more.
他们真的是从稠密模型中蒸馏出来的吗?
They like, do you think they're really distilling from dense models?
我的意思是,我认为这正是在资源无限、不关心推理成本、只追求最大化智能时的完整做法。
I mean, like, I I think that, like, that is, like, the full like, when when like, just unlimited resources don't care about inference, just care about maxing intelligence.
为什么不呢?
Why not?
是的。
Yeah.
我不完全确定。
I'm I'm not a 100% sure.
我觉得MOE只是给你带来了浮点运算量。
I think that MOEs just give you a flop.
我不确定这是否真的是我对优秀MOE架构所带来的优势的理解方式。
Like, I don't know if that's actually how I think of gains of MOE when you have really good MOE architecture.
但我确实认为它们是从更大的模型中进行蒸馏的。
But I do think that they have bigger models that they distill from.
它们训练的内部模型与外部模型不同,因为外部模型变得越来越小,这有点奇怪。
And they train internal models different than external because the external models have been getting a lot smaller, which is the kind weird thing.
我们还没有一个好的方法来衡量它。
We don't have a good way to measure it.
也许戴伦会在推理最大值时反过来搞清楚它,不管那到底是什么。
Maybe maybe Dylan will backwards figure it out in inference max, whatever the heck.
他们会处理这个新的模型侧边栏。
They'll deal with this new model sidebar.
但我对这些事情也总是持怀疑态度。
But I'm always suspicious with these things also.
这其实也是一个容量问题。
It's really like a capacity thing too.
有多少人同时使用这个模型?
How many people use the model at the same time?
硬件分配了多少?
Hardware, how much is allocated?
而且这总是像一种经验法则,但确实很棘手。
And it's always it's like a, yeah, maybe a rule of thumb, but yeah, it's really tricky.
我觉得从这些数字中很难得出任何结论。
Think it's really hard to say anything from these numbers.
我认为他们可能会开始限制模型。
I do think that they might start restricting models.
它们只会出现在产品中,而不会出现在API中。
They'll only be in products and not being in API.
我觉得整个API业务竞争极其激烈,我对它的可防御性没什么把握。
I think the whole API business is brutally competitive, and I don't have a good sense for what the defensibility of it is.
我觉得像谷歌、Azure或任何现有的云业务拥有API是合理的,这更像是一种自然的过渡。
I think, like, it makes sense for something like Google and Azure and or any existing cloud businesses to have APIs, and that's kind of a more natural transition.
但像Anthropic和OpenAI这样的API,从它们的产品(比如ChatGPT、Cloud Code和Codex)过渡过来并不容易,人们并不会因为这些产品就去使用API。
But, like, the Anthropic and OpenAI API, like, the transition from their products, which are their big differentiation, whether it's ChatGPT and Cloud Code and Codex, the different like, you don't get people to go use the API from that.
我认为很多已经花费在云服务上的人会转而使用API,这就是为什么Lambda和Nebius会推出这些API产品的原因。
And I think you get a lot of people that are already spending on clouds that then go to use the APIs, which is why, like, Lambda and Nebius are gonna have these API products.
但如果Claude真的担心知识蒸馏,那他们应该尽快把模型发布到Cloud Code中,干脆别管API了。
But, like, isn't it if if Claude's really worried about distillation, like, they should put the model release in Cloud Code ASAP and then just not bother with the API.
我不知道这什么时候会发生,但有可能。
I don't know when that'll happen, but it could.
但我认为,API的客户群体确实很大。
I do think though it's a big customer base, the API customer base.
任何基于LLM构建的产品,比如客户聊天机器人之类的东西,但更广泛地说,我不太清楚Clot的计划具体是怎么运作的,但你的订阅可能会达到一个令牌上限,只能使用这么多。
Any type of product that is built on, I mean, with LLMs, like customer chatbot types of things, but also more generally, I do think the problem with I don't know exactly how the plans work in Clot, but you would reach a token max where you can only get so much with your subscription.
你可以购买更多的令牌,但我认为在一定规模下,使用API会更简单?
You can, I think, buy more tokens, but I think it's just easier with the API at a certain scale?
还有整个OpenCLO的客户群体,对吧?
And also like the whole OpenCLO customer base, right?
因为他们在OpenCLO的场景下已经不再提供这个计划了。
Because they don't allow the plan anymore in the OpenCLO context.
所以你只能使用API。
So you have to use the API.
而且我认为,考虑到OpenCLO生成的令牌数量,只要不在这类令牌上亏钱,这其实是个不错的业务。
And I do think given how many tokens OpenCLO generates, it's actually not a bad business if you don't lose money you know, on these tokens.
如果你以非补贴的价格出售,我认为API其实是个不错的商业模式。
If you sell it at a not subsidized price, I do think the API is actually not a bad business model.
是的
Yeah.
你想选一边吗?
Do wanna take a side?
你想来个抢七吗?
Do you wanna try a tiebreak?
我显然在胡思乱想。
I'm obviously being cognitive.
我是真不知道,但我能看出来。
Like, I don't really know, but I can see it.
我觉得Anthropic给我的感觉就像苹果。
Like, Anthropic gives Apple vibes to me.
我
I
我的意思是,Anthropic更有可能这么做。
mean, like, Anthropic has a higher chance of doing this.
是的。
Yes.
OpenAI,只是因为我经常和他们的人交流,所以我真的不太相信他们会把模型严格封闭在产品里。
OpenAI, just because I I, like, have talked to the people so much, like, I I just don't super believe that they will have locked models to to to products.
这更多是出于理想主义和某种原则,而不是经济利益的驱动。
Only only out of, I guess, idealism and sort of principles rather than economic incentive.
从经济利益的角度来看,他们会同意你的观点,认为应该把模型私有化并绑定到产品上。
Economic incentive would agree with you that they should have private models to to products.
而且,最近他们确实这么做了。
And, like, recently, they've done this.
对吧?
Right?
最近的三个GPT-5版本都有Codex变体,这些变体在内部比公开发布提前了两到四周,仅限于Codecs内部使用,而不是作为API。
The last three GPT fives all had codex variants that were two to four weeks ahead inside of released only inside of Codecs rather than as an API.
所以他们正在逐步接近这一点。
So they're starting to get there.
但从本质上讲,我认为这些公司的管理者并不相信应该把技术锁在API后面,因为他们本身已经拥有巨大的市场。
But just, like, constitutionally, I don't think the the people that run these things believe in, like, locking things behind APIs because they have such a huge market anyway.
所以他们根本不太在意,而且如果你真的是个狂热信徒,不以最大化公司价值为目标,而是真心想让AGI普及到每个角落,那你就会继续使用API,因为你根本不知道人们会用它做出什么。
So they they, like, kind of don't care, and then they also, like they if if you're genuinely, like, sort of zealot like, if you're not trying to maximize the value of your company and genuinely just trying to spread h AGI everywhere, then then you reuse the API because you just don't know what people are gonna build with it.
关于Codex这件事,还有一点。
One more thing though with the codex thing.
我想我们得等到下一次才能确定,因为这次可能也存在一些偏向,他们几乎是在推出自己当前想推广的应用程序的同时发布了Codex。
I we will have to see, I think, next time because I think this time, it might also be a bit biased towards releasing it in Codex because they almost released it simultaneously with their app that they wanna promote at the moment.
所以这可能是他们为了让更多人去使用这个应用程序而这么做的。
So it could have been more like they did that so that anyone checks out the app.
但我们还是拭目以待吧。
And but we'll see
下一次再说。
next time.
总是有两到四周的独家窗口期。
Always a two two to four week exclusive window.
是的
Yeah.
是的
Yeah.
你知道,那是他们的权利。
And, you know, that's that's their rightness.
是的
Yeah.
当然。
Sure.
推广Codex。
Promote codex.
这相当有效。
That's that's pretty effective.
对。
Yep.
聊天里有很多关于其他事情的问题。
We have a bunch of questions in the chat for, like, other things.
我们是先讨论基准测试,然后这个吗?
Do we wanna cover benchmarks and then this thing?
你尽管说吧。
Go right ahead.
你打算怎么做?
What what do wanna do?
这是你的子栈。
It's it's your sub stack.
我不知道。
I don't know.
哦,不会吧。
Oh, no.
天哪。
Aw, man.
这是一个集体项目。
It's a collective.
就是深入你感兴趣的内容。
It's a just dive into what you're interested in.
直接开始吧。
Just go
我的意思是,塞巴斯蒂安对SuiteBench这类东西感兴趣。
We I mean, Sebastian was interested in, like, the SuiteBench stuff.
所以,你知道的,上周SuiteBench验证就挂了,或者说是正式失效了,等等。
So, like, you know, this past week, SuiteBench verified died or, like, officially What
你这话是什么意思?
do you mean by this?
是的。
Yeah.
我们先定义一下SuiteBench吧。
Let's define Suitebench first, maybe.
好的。
Okay.
你知道,我正好有这篇相关的文章。
You know, I I happen to have the post on this.
所以让我来一下。
So let me just
所以更广泛的话题是,我们该如何比较当前哪个大语言模型是最好的?
So the broader topic the umbrella topic here is how do we compare which LLM is currently the best LLM?
比如,其中一种方式就是SuiteBench。
Like, one of the ways would be SuiteBench, basically.
不过,也许让你来解释一下比较好,因为你做过这个精彩的播客或文章。
But then, yeah, I will maybe let you explain because you had this brilliant podcast or article.
我的意思是,没问题。
I mean, is okay.
你希望我从哪里开始?
Where do you want me to start?
要不我来定义一下SuiteBench?
Want me to should we just define SuiteBench, I guess?
我觉得可以。
I guess, yeah.
所以基本上,这是一个编程基准测试,而SuiteBench是评估大语言模型能力的一种流行方式。
So maybe going from So basically that it is a coding benchmark and then the SuiteBench is like a popular way to compare capabilities of LLMs.
然后还有sweepench verified。
And then there is sweepench verified.
但也许吧。
But maybe yeah.
我们先多聊聊sweepench吧。
We should talk a bit more about sweepench first.
sweepench是普林斯顿大学一个团队发表的论文,他们在代码基准测试方面做了很多出色的工作。
So sweep sweepench is a is a paper out of Princeton from group, and they do a lot of good, like, code benchmarking work.
他们恰好收集了成千上万个开源项目中的问题和拉取请求示例,这些PR都成功关闭了相应的问题。
And so and it it happened to be that he they just kinda drew thousands of of example sort of open source issues and PRs that closed those issues from open source.
这里存在一些选择偏差,因为他们只关注热门开源项目,而且仅限于少数几个热门开源项目,但这些项目中的问题数量却很多。
They there's there's a bit of selection bias here because they only focus on popular open source and only a small number of popular open source, but a large number of issues from those open source.
然后他们随便收集了一些通过的测试和一些失败的测试,你必须让这些测试通过才能达到评分标准。
And then they just kinda dredged up some passing tests and then some failing tests that that you need to make pass in order to pass the score.
刚发布时,它还比较冷门。
When it when it launched, it was kind of obscure.
Devin 实际上是第一个选择它作为评估基准并公布结果的人,之后分数从发布时的约13%迅速上升到如今的80%左右。
Devin actually was the first one to pick choose it as a as a benchmark to report, and then it went from, like, I think at launch, it was, like, 13%, and now everyone's at 80%, something like that.
SuiteBench 因为是在学生预算下完成的,所以显得相当粗糙,或者说不够严谨。
SuiteBench, because it was it was done on, like, a student budget, was very kind of let's call it sloppy or whatever.
终端基准现在也是这样。
Terminal bench is like this now too.
它们只是简单地汇总数据。
Like, they're they're just aggregate.
要在不同时间、不同领域之间建立一个精确校准的基准,真的很难。
It's, like, hard to do a a benchmark that is well calibrated across topics at different times.
是的
Yeah.
对
Yeah.
确实很难。
It it is hard.
很难。
It is hard.
所以,对于正在关注的少数人来说,我其实正在和Cognition合作,推出一个新的基准测试。
So, you know, for the for the small group that is watching, I'm actually working on it with Cognition for for to launch a new benchmark here.
但没错。
But yeah.
所以OpenAI说:‘好了,各位。’
So so OpenAI was like, okay, guys.
我们说:SuiteBench正在兴起。
We're like, SuiteBench is taking off.
我们会采用这个,但我们拒绝遵守完整的SuiteBench标准。
We are gonna adopt this, but we're we refuse to abide by, like, the full SuiteBench.
我们只是会挑选并整理出原始SuiteBench的500个子集,他们确实雇佣了人类来逐一审核。
We we're just gonna, like, actually go and go and curate, like, 500 subsets of of of the the original SuiteBench, and they actually hired humans to, like, go and vet through.
我觉得这个信息就在博客文章里,基本上他们为每个任务都雇佣了三名人类,专门审核任务是否高质量,因为里面有很多低质内容。
Like, I think the it's somewhere inside of this blog post, but, basically, they hired, like, three humans for every task to just vet whether the the task was, like, high quality or not because there's a lot of slop in there.
总之,好吧。
Anyway, like, okay.
这500个就是我们打算认可的子集。
This is the 500 that we that we're gonna endorse.
所以这就像一个经过筛选的Speedbench子集,包含大约500个被认定为定义清晰的难题。
So it's like a curated subset of Speedbench where 500, let's say, challenging problems that are supposedly well defined.
是的。
Yeah.
对。
Yeah.
有趣的是,发布时——这是在2024年发布的。
And and what's what's really funny is that at launch so this was launched in 2024.
发布时,OpenAI 甚至无法运行他们自己的这500个任务。
At launch, OpenAI could not run all of its own 500.
所以有一段时间,OpenAI 发布了几个版本,只报告了这个子集中的部分任务,因为他们无法在评估基础设施上运行全部内容。
So for a while, like, there was, like, a few releases from OpenAI that reported on a subset of the subset because they couldn't run it on their eval infrastructure.
所以他们的数字更高,因为分母更小,这非常有趣。
So, like, their numbers were higher because their their denominator was lower, which is which is very funny.
不管怎样,他们可能
Anyway, they Maybe
在这个背景下,我们应该说说 SpeedBench 究竟长什么样。
in that context, we should say what SpeedBench kind of looks like.
我觉得它基本上就是一段代码,里面有个方框,而 LLM 的任务通常是修复代码中的漏洞。
I think it's like, basically like a a code that has a box in it, and usually the task for the LLM is to fix the bug in the code.
对吧?
Right?
就在这里。
It's it's right here.
整个东西都是开放的,这在将来会成为一个问题。
The whole thing's the whole thing's open, which becomes a problem in the future.
对,但现在你能看到全部内容。
Right But now, you can see the whole thing.
对吧?
Right?
你可以看到从问题ID和问题描述开始的整个重启过程,还有你需要通过和失败的测试。
You can see the the the reboots from the issue ID and the the problem statements, and then you have also the the test that you're supposed to pass and fail.
所以所有内容都在Hugging Face上。
So it's all here on Hugging Face.
你可以看到它达到了500。
And you can see that it's at 500.
总之,我觉得我们没必要在这些细节上纠缠太多。
Anyway, that I think we don't have we don't wanna get too lost in the details on on on sort of
我只是想说,简单定义一下背景,这本质上是一个编程基准测试,包含500个可以在互联网上找到的示例。
I just wanted to say well, like, define the context, like, that this is a coding benchmark, essentially, 500 examples that are available on the Internet.
是的。
Yeah.
好的。
Okay.
如果你想了解更多历史背景,这可以说是比HumanEval更进一步,而HumanEval主要侧重于代码补全。
And then if you want more a bit more historical context, this is like a step up from human eval, which is more on completions.
对吧?
Right?
在我看来,这是第一个真正的智能体基准测试,除了Taubench之外——它们会给你问题和最终结果,但并不明确说明你该如何达成目标。
This is this was, in my mind, the first proper agentic benchmark, I guess, apart from Taubench, where they give you the the the problem and the the the end result, and they don't really specify how you're supposed to get there.
而相比之下,像MMLU这类以往的基准测试,以及OpenAI发布的同样属于编程领域的HumanEval,都只是直接给出问题陈述,然后要求你立即给出正确答案,不需要处理任何额外的文件或运行步骤。
Whereas, I think a lot of, like, previous benchmarks like MMLUs of the world and and human evals, which is a coding domain also released by OpenAI, was very much here's, like, the sort of the the problem statement, and then give me the right answer immediately after without that much sort of extra files or anything that you're supposed to run.
所以,其他的那些更像自动补全。
So it it like, the other ones are are more autocomplete.
这个更偏向于智能体行为。
This one is more agentic.
显然,这涵盖了整个谱系,因为你完全可以使用智能体来解决自动补全问题,但那并不是HumanGail所要测试的内容。
There that's all the spectrum, obviously, because you can use agents to solve autocomplete, but that's not what HumanGail was was testing.
不管怎样,我只是想确保大家明白,OpenAI实际上在打造soup和turf上投入了大量资金和精力。
Anyway, I I just I wanted to make sure, like, people understand that OpenAI actually put a invested a lot of money and effort into making soup and turf.
所以它
So it's
是个问题。
a question.
你觉得这要花多少钱?
How much money do you think this costs?
天哪。
Oh my god.
别这样为难我。
Don't don't do this to me.
数百万?
Millions?
我猜大概几百万吧,可能
I would guess order of a cup like, it could be a few million, but probably
我觉得,是的,大概几百万。
I'd say, yeah, I'd say a couple million.
我是说,你知道的?
I I you know?
所以,基本上,你就是说,好吧。
So, basically, you do, like, okay.
第一个筛选步骤是什么?
What's the first filter pass?
然后,好吧。
And then, like, okay.
是500乘以3,因为他们有三个
It's 500 times three because they had three
每人每人,是的。
Per people per yeah.
是的。
Yeah.
每人三个人
Three people per
然后可能还需要再加几次验证步骤之类的。
thing, And then maybe, like, a couple more sort of verification passes or whatever.
对吧?
Right?
所以,是的。
So, like yeah.
所以他们就说,哦,今年他们说,哦,不仅已经饱和了,因为每个人在发布新模型时都只是轮流每次增加0.1。
So then they were like, oh, so this year, they're like, oh, well, not only is it saturated, because, like, progress everyone just takes turns to increment by 0.1 every time they release new model.
这简直是胡说八道。
It's like, it's bullshit.
这明显是胡扯。
It's obviously bullshit.
就像,运行这些模型时固有的噪声每次都会波动零点五到一左右。
Like, the the the inherent noise in just running these models varies by, like, point five to, like, one every time you run it.
当你尝试时,你只是选择最高的那个。
Like, you just choose the highest when you try to
有点吹毛求疵。
A little nitpick.
我不认为它能到0.1%,因为之前我们说过,这是基于500个样本。
I don't think it can be 0.1% because, what we said before, because it's 500 examples.
我认为最小的增幅是0.2%,如果我...
I think the smallest increment is point 2% if I
好的。
Okay.
他们可能会取平均值
They might average
因为,就像,一些小细节。
Because like But like little detail.
是的。
Yeah.
抱歉。
Sorry.
我想,随着我们进入下一代基准测试阶段,这里的 n 是 500。
I I I think in so as we progress to the next era of benchmarking, the n so this the n here is 500.
对吧?
Right?
n 并不直接对应百分点,因为你还会得到小数点后的分数。
The n doesn't directly correlate to the percentage points because you get sub points as well.
对。
Yeah.
从你的角度来看。
From your point.
说得好。
Good point.
所以,像终端基准测试,尽管它有九十几个任务,但你仍然可能得到低于1%的细分分数。
So so, like, terminal bench, even though it has 90 something tasks, like, you you can get subdivisions less than 1%.
总之,他们不仅做了这些,还对自己的工作进行了审计。
Anyway, so not only do they do they have this, they actually audited their own.
比如,他们想,好吧。
Like, they were like, okay.
为什么大家都卡在80%了?
Like, how come everyone is saturating at 80%?
那剩下的20%到底出了什么问题?
Like, what what's what's up with the remaining 20%?
为什么大家都在这上面失败了?
How come everyone's, like, failing at it?
他们发现,实际上我们雇了更多人,现在每个任务都安排了六个人,如果发现任何正面识别结果,还会额外增加一个团队。
And they were like, oh, actually, we looked we we paid even more people, six people per task now, with an extra team if if any sort of positive identification is found.
我们发现,其中59%根本无法解决,因为原始基准测试本身就有问题。
And we were like, 59% of them cannot even be solved at all because the original benchmark was was, like, still slopped.
有些根本无法解答的内容竟然通过了审核。
Like, stuff stuff got through that was not solvable.
我其实试着在我的帖子中说明这一点。
And I I actually tried to, like, illustrate this in in my post.
所以,这是一个不可能完成的测试。
So here, this this is an impossible test.
对吧?
Right?
好的。
Okay.
举个例子。
So here's an example.
这是在原始帖子基础上额外增加的价值。
This is this is the sort of value added on top of the the original post.
这是一个经过SuiteBench验证且通过了第一轮人工验证的任务示例。
Here here's an example of a SuiteBench verified task that passed the first round of human verification.
对吧?
Right?
所以这个任务是,比如我们想实现Python类型注解之类的东西。
So here's the task, like and we wanna implement Python type ins or something.
我们希望看到预期的行为,我想在输出中看到一个字符串。
We wanna see expected behavior, I wanna see a string in the output.
对吧?
Right?
所以如果你收到这个任务,你根本不可能通过,因为测试要求我寻找一个叫get annotation的东西。
So the if you were given this, you would you would never pass this because the test said, I am looking for something called get annotation.
如果你不提供这个神秘的字符串get annotation,你就会失败。
And if you don't give me this magic string, get annotation, you will fail this test.
为什么?
Why?
是的。
So, yeah.
这太容易了
It's way to
只是玩一场编程面试而已。
just play a coding interview.
对。
Yep.
对。
Yep.
就是对啊。
It's like yep.
对吧?
Right?
所以这只是一个侥幸通过验证的糟糕测试。
So so so this is just a bad test that somehow escapes validation.
所以,你唯一能解决这个问题的方式就是记住它
So the only way you could kind of solve it is if you're memorizing the
是的
Yeah.
没错
Exactly.
没错
Exactly.
这其实挺好的,我觉得每个基准测试都应该包含类似这样的内容,比如一个诱饵?
Which is actually a nice like, I think every benchmark should include stuff like this where if like Like a honeypot?
如果你解决了这个问题,你会想:天啊。
If you solve this, you're like, oh, shit.
这是一个预警信号。
That is a canary.
对吧?
Right?
就像是,哦,我的意思是,你明显在作弊。
Like, it's like, oh, like, I mean, you're definitely cheating.
就像一个合理性检查。
Like a sanity check.
是的。
Yeah.
是的。
Yeah.
是的。
Yeah.
是的。
Yeah.
无论如何,这是个很好的观点。
Anyway a really nice point.
是的。
Yeah.
是的。
Yeah.
所以呢,我只是觉得,这其实是一个很精彩的观点,说明要做出可靠的评估有多难,而且经历了这么多轮。
So so, like, I I just think, like, it it to me, it's a beautiful point of, like, how hard it is to make evals that there was these, like, multiple rounds.
最初有CVench,那是普林斯顿的孩子们做的第一轮。
There was original CVench, which, like, the the the Princeton kids did do initial first pass.
然后是OpenAI进行的第二轮,对CVench进行了验证。
Then there's a second pass of OpenAI doing CVench verified.
在接下来的一年半里,每一个参与SuiteBench的人都没有指出这个问题,直到OpenAI说:嘿,我们来看看数据吧。
And then and then, like, every single person that ran for the next SuiteBench verified for the next one point five years did not call this out until OpenAI was like, hey.
让我们一起来看看这些数据。
Let's let's, like, look at the data.
所以我觉得这真的很有意思。
So I think it's, like, really interesting.
在他们研究这个的时候,他们还关注了另一个方面,也就是思维链。
The while they were looking at this, they had a second thing that they would they they looked at the chain of thought.
在思维链中,他们发现GPT-5自己的思维链开始包含来自未来的信息。
And inside the chain of thought, they found GPT-five's own chain of thought to start including information from the future.
对吧?
Right?
因为这个问题是开源的,而且模型是在GitHub上的信息上训练的,所以它会利用未来版本的Django知识来解决问题。
Where because it was trained on because the problem is open source and because it was trained on information from GitHub, it would it would, like, use advanced knowledge of future versions of the Django version that they were using to solve the to solve the problem.
比如,他们知道他们已经
Like, they they knew they've
在现实世界中见过类似的情况,模型会幻觉出API的新版本,即使你的脚本还没更新到那个版本。
seen stuff like this in the real world where the models will hallucinate the new version of the API even if your script isn't on it.
我觉得Hugging Face的很多东西在这方面最糟糕,模型完全乱七八糟。
Like, I think a lot of the Hugging Face stuff is, like, the worst with this, where, like, the models just are totally gooboobly glopped.
它们见过所有版本,而API随着时间变化太大,以至于
Like, they've seen all the versions, and the API has changed too much over time where
它们他妈的
they fucking
随便说点什么。
throw something out there.
是的。
Yeah.
所以,越多之前的……是的。
So so the more previous Yeah.
我的意思是,我觉得,你知道,这种情况很多。
I mean, I I I think, like, you know, there's there's a lot of this.
对吧?
Right?
比如,这种道德行为。
Like, the the sort of ethical behavior.
好吧。
Like, okay.
所以你可以怪一些事情,比如:你不该把完整数据集公开发布,因为很明显,人们可以基于完整数据集进行训练。
So you can blame things like, oh, you should not have released this the full dataset in public because, obviously, people can train on a full dataset.
但其实研究人员并不是有意要这么做。
But, like, it's not like the researchers are trying to do this.
因为这些内容也是开源的。
Like, because these things are also open source.
任何与 GitHub 有关的数据集,任何与 GitHub 有关的训练语料,最终都会不可避免地吸收这些内容。
Like, it is like, any any dataset that touches GitHub, any training corpus that touches GitHub is gonna just eventually absorb this.
然后
And then
是的。
Yeah.
是的。
Yeah.
而且这甚至不是这个网站或仓库本身的问题。
And it's not even this website or the repository directly.
而是这个仓库的克隆,或者其他人开发自己的开源库时,在单元测试之类的地方无意中包含了这些内容,并非有意或恶意。
It's a clone of this repository or someone else who has that develops their own open source library and has that in the unit tests or something where it's not even intentional or malicious or anything.
这就像无意中,你已经吸收了这些内容。
It's like by accident, you already absorbed that.
是的。
Yeah.
对。
Yeah.
或者是一个新功能,只发布了这个编辑功能。
Or or or a new feature that releases this edit only feature.
它被写进博客文章或会议演讲中,然后就自然而然地被纳入了。
It gets written up in a blog post or a conference talk or something, and then it it just makes it in.
对吧?
Right?
这其实挺搞笑的。
Like, it's it's really funny.
好吧。
Okay.
所以对我来说,OpenAI 本来可以到此为止,说:好吧。
So to to me, like, OpenAI could have stopped there and said, okay.
我们完成了。
We're done.
但他们又多做了一件事,还挺有意思的。
They they did one more extra thing, which is kinda funny.
他们还运行了 Flash、Gemini 和 Opus。
They also then ran Flash, Gemini, and Opus.
而这一项就更加过分了。
And this one, it was more it was, like, even more egregious.
好吧。
Okay.
他们直接给了任务 ID,然后说:把 SweetBench 任务重复一遍给我。
They just gave the task ID and just said, repeat the SweetBench task to me.
所以通过任务 ID,他们就能直接吐出完整的陈述和解决方案。
And so the from task ID, they can just vomit out the whole statement and the solution.
这些太疯狂了。
These are crazy.
当你深入查看这些模型内部时,里面的东西真的非常惊人。
The stuff that's in these models when you zoom in deep is really, really incredible.
因为这些模型确实做得非常好,但所有被放进配方中的组件都极其复杂。
Because these are models that are really, really well done, but there's just so much complexity in all the pieces of the pudding that get put in the recipe.
是的。
Yes.
有太多奇怪的格式了。
There's just so many weird formats.
我仍然觉得这非常有趣,我的意思是,训练时故意让模型记住内容确实是设计使然,因为这本质上就是下一个词预测。
I also still find it fascinating that I mean, course, it's kind of by design when you're training that you memorize things because that's literally like next token prediction.
但考虑到模型规模之大、所见数据之多,而且通常只见过每份数据一次,它居然还有足够的容量来记住这些内容。
But given that how big a model is, on how much data it sees, and usually it sees only the data once, that it still has enough capacity to memorize.
通常我会认为,必须经过多个训练周期才能记住,但事实是,也许只要在训练语料中包含一次或两次,模型就能完美地复述出其中的内容,这真的很令人着迷。
It's kind of like, so usually I would think, okay, I would have to train multiple epochs to be able to memorize, but no, it is enough maybe to include it once or twice in the training corpus and it can do a perfect rendition or perfect recap of what it is in there, which is kind of fascinating.
即使人们并不想要这样,这也太疯狂了。
Even if people don't want that, it's crazy.
是的,实验室在这方面已经很擅长了。
Yeah, labs got good at this.
在训练的每个阶段,实际上都需要一定的数据重复量,而这一点很难衡量。
There's essentially, like, a duplication level that you need at each stage of training, and it's not easy to measure.
所以,如果你在预训练阶段做得太多,模型就会忘记一些基本事实,而在后训练阶段,它可能更接近这些能力。
So, like, if you do do too much at pretraining, your model forgets basic facts, and at post training, it's probably closer to these abilities.
我认为这一点在你的知识库的评估中并没有得到很好的体现。
And I think that that is a thing that is not well reflected in a like, you could see it in a vows of your knowledge tank.
是的。
Yeah.
我的意思是,这大概已经是他们掌握得相当娴熟的一种技艺了。
Mean, I mean, this is like an art that they have probably gotten good at.
是的。
Yeah.
比如,持续的预训练也需要重新回顾旧数据。
Like, continued pre training does also require some revisiting of old data.
否则,就像你所说的,会出现遗忘。
Otherwise, like you said, you have the forgetting.
但让我感到惊讶的是,通常用于持续预训练的数据比例非常小,一般只有1%或5%,却足以让模型几乎记住所有内容,这真的很令人着迷。
But it's still fascinating to me that with such a small fraction usually, because you usually use one or 5% for continued pre training, that it's enough to have the model memorize almost everything, which is fascinating.
是的。
Yeah.
我不确定。
I don't know.
尽管已经过了这么多年,这依然令人着迷。
It's just like a still after all these years, fascinating.
是的。
Yeah.
我认为我每年都会反复研究一两个我特别感兴趣的话题,比如大语言模型的信息论。
I I think there's so one of the pet topics that I pursue, like, two, three times a year on on my stuff is the information theory of LLMs.
我仍然觉得这方面的研究非常不足。
And I I I still think it's, like, super understudied.
为什么只看一遍就能记住呢?
Like, how come you can memorize from one pass?
是的。
Like Yeah.
没错。
Exactly.
对。
Right.
然后,人们还常常忘记叠加效应,也就是Anthropics最初的机制解释工作。
And then and then also, like, people forget, like, superposition, which is like Anthropics original MechInterp work.
基本上,信息被塞进更小的单元里,然后就被遗忘了。
Also, basically stuffs information inside the smaller bits that then get forgotten.
但叠加效应到底是怎么工作的?
But, like, how how does super position actually work?
关于这一点,我认为我没见过任何令人信服的研究。
I the people I don't think I've seen a convincing study on on that.
是的。
Yeah.
好的。
Okay.
总之,我不确定。
Anyway anyways, I don't know.
我这边关于睡眠实验的讨论结束了,我不知道你有没有想法、问题之类的。
I'm done on my on my sleep bench right I don't know if you have thoughts or questions or whatever.
但我确实觉得,这就是一个例子,说明模型无意中作弊了,而基准测试很难设计,我们需要新的基准。
But I I do think, like, this is an example of, like, yeah, the model's unintentionally cheated, and and benchmarks are hard to make, and we need new ones.
而且,你知道,如果这种情况发生在Sweep Bench上,我认为那是世界上被审查最严格的基准。
And, you know, if this happens to sweep bench verified, like, which I think is the most scrutinized benchmark in the world.
我最近的一篇帖子
My recent post,
我做了一个条形图,展示了大多数模型在SweepBench验证中的得分。
I had a bar plot where I showed the sweep bench verified numbers for most models.
正如你所说,它们的得分都在80%多一点。
And like you said, they were all 80 something percent.
几乎是80到89之间,几乎没有任何差异。
Literally 80 between one and nine, let's say, where there's almost zero variation.
即使是MiniMax M2.5,我认为它比GPT 5.2要差一些。
Even something like MiniMax M2.5, which I do think is worse than GPT 5.2.
没有冒犯的意思,它是个更小的模型。
No offense, it's a smaller model.
它是个更便宜的模型。
It's a cheaper model.
根据我的OpenRider使用体验,它稍微差一点,但在这个特定的基准测试中,得分却一样。
For my usage based on OpenRider, it's a little bit worse, but on this particular benchmark, it's the same.
我的意思是,M2.5在Swaybench上应该得更低的分数,但我认为其他模型应该得更高的分数。
What I'm saying is that M25 should get less score on Swaybench, but I think other models should get more score.
但正如你所说,这些问题根本无法解决。
But like you said, the problems are just impossible to solve.
但有一点我们之前没提到,那就是Sweepench Verified存在问题。
But one point I think we didn't bring up is we said that Sweepench Verified has issues.
那我们该怎么办呢?
So what do we do about it?
我想现在有个Sweepench Pro,它有点像是在验证版的基础上,试图修复常规的Suitebench,而Pro版则试图修复验证版的问题。
I think there is like a Sweepench pro now, which is kind of like a, I would say, like verified, try to fix the regular Suitebench and pro tries to fix verified.
但我还没去研究过这个。
But I haven't looked into this.
它是另一个子集,还是完全不同的问题集?
Is it like another subset or is it a completely different set of problems?
是的。
Yeah.
这是一个新的问题集。
It's a new set.
所以,你知道的,CVench 所使用的题目大致来自 2022 年到 2023 年左右的时期。
So the you know, CVench draws from, like, a '22 twenty twenty two ish, twenty twenty three ish era of problems.
所以在那里面,你主要做几件事。
So all you do there's a few things you do.
对吧?
Right?
第一,你做私有和公开的分离。
One, you do private public splits.
对吧?
Right?
这非常明显。
That's super obvious.
第二,你更新你所依据的题目时间范围。
Two, you update the dates that which you draw from.
第三,你丰富代码仓库和编程语言的多样性。
And then three, you diversify the repos and the languages.
对吧?
Right?
所以这些都只是非常、非常基础的修复,然后显然你要去改进测试。
So these are all just, like, very, very super basic fixes, and then, obviously, you try to fix the testing.
对原始的SweepBench进行的非常基础的修复,这不需要什么天才就能想到,但他们做了艰苦的工作。
Super basic fixes to the original sweepbench, which it doesn't take a genius to to figure out, but they did the hard work.
而且
And
但从某种意义上说,这也正是Verified想要做到的。
But it is in a sense also what Verified meant to do.
所以,人们也许又重新审视了这一点,但不能保证它以后不会被发现还有其他问题,对吧?
So it's not let's say people looked at this again, but it's no guarantee that it doesn't also still have issues that might be discovered later on, right?
我的意思是,没错。
I mean, it's No.
所以,SuiteBench Verified 是一个有意的子集,对吧?
So, SuiteBench Verified was an intentional subset, right?
这些人说:不,不,不,我们需要一个超集。
These guys were like, no, no, no, we need to have a superset.
我们甚至不是超集。
We not even superset.
我们需要
We need
去一个是的。
to a Yeah.
是的。
Yeah.
我之前想说的是,当SuiteBench Verified开发时,每个任务都有三个人确保任务定义清晰且一切到位。
Like what I was trying to say is when SuiteBench verified was developed, there were three people per task, making sure the task is well defined and everything.
但两年后发现,不,不,并非所有情况都如此。
But then two years later it turns out, no, no, this was not the case for everything.
我想说的是,Threebench Pro可能更好,但它可能仍存在一些现在不明显的问题,也许一两年后当我们重新审视时,看到一些失败案例,我们会发现:哦,它仍然有一些问题。
What I'm trying to say is it could be that Threebench Pro is better, but it might still have issues that might not be obvious right now, but maybe in one to two years when we revisit this and you see some of the failure cases, maybe we'll discover, okay, this has still some issues.
所以我的意思是,这并不是一个保证完美的集合。
So it's not a guaranteed perfect set is what I'm saying.
我不知道,但这只是一种怀疑。
I don't know, but it's just like a suspicion here.
完全对。
Totally.
完全对。
Totally.
你知道,我认为Scalaei在确保这一点上有着专业利益。
You know, I do think Scalaei has a professional interest in making sure this
这将会是的。
is gonna Yeah.
不。
No.
不。
No.
就像,是的。
Like, yeah.
对。
Yeah.
但我想说的是,三重验证的机构也有专业利益来确保它
But what I was trying to say is a three bench verified also had a professional interest to make sure it's
只是通过系统。
just by the system.
不同的激励。
Different incentive.
我想他们都具有非常
I guess they all have very
不同的激励。
different incentives.
这个
This one
预算有限。
has limited budget.
这个的预算几乎是无限的,因为对于Scale.ai来说,拥有高质量的数据至关重要。
This one has basically unlimited budget because it's literally existential to Scale dot ai that they have good data.
当然。
Sure.
但我还认为,OpenAI的评估团队持续推荐Opus,这真的很不错。
But I also I also think it's really nice that this team, the the eval team at OpenAI keeps endorsing Opus.
嗯哼。
Mhmm.
这有点好笑。
It's kinda funny.
所以是的。
So yeah.
对。
Yeah.
他们弃用了 C-Bench Verified,然后说我们现在要报告 C-Bench Pro。
The they deprecate c bench verified, and then they were like, we're gonna report c bench pro now.
GPT-5 现在是第一名。
And g g five is, like, you know, number one.
如果我想用私有数据集做评估,我该怎么做呢?
Maybe if do you know if I would want to evaluate on the private dataset, how would I do that?
我需要提供 API 吗?是不是得通过 ScaleAI 做某个 API 调用?
Do I provide the API to is there like an API call I have to do against Scaleai or
我不知道。
I don't know.
拥有我的 API 密钥并同意不泄露。
Have my API key and agree to not.
你必须同意,因为如果没有协议,你就只能保留数据。
You have to agree because if you don't have an agreement, then you can just have to keep the data.
你必须走特殊流程,以确保不会窃取私有评估数据。
You have to do special hoops to make sure that you don't steal the private eval.
是的
Yeah,
我的问题是,他们甚至允许你下载数据吗?
my question was basically, do they even let you download the data?
还是说你只需要把答案发给他们,由他们在后端进行评估,这样你就根本无法下载数据?
Or is it more like you send the answer to them and they do the evaluation on their back end so that you don't even get to download the data?
否则,就像你所说的,你确实可以,是的,是的。
Otherwise, like you said, you could Yeah, yeah.
所以基本上,你只提供答案。
So basically, you only provide the answers.
所以你需要让你的大语言模型生成答案,然后提交这些答案。
So you have your LLM generate an answer and you submit the answers.
然后他们有一套流程在自己的系统上进行评估,这样他们的私有数据就不会离开他们的服务器,这是我的猜测。
And then they have like some process to evaluate on their things so that their data, private data never leaves their servers, my guess.
否则,有人可能会上传数据之类的,你知道的。
Because otherwise someone might upload it or something like, you know
是的。
Yeah.
我不知道。
I don't know.
我没试过,所以真的不清楚。
I don't I don't have I haven't tried it, so I don't I don't I don't really know.
我肯定你可以联系他们去弄清楚。
I'm sure you can sort of reach out to them to figure it out.
是的。
Yeah.
总之,
Anyway,
我觉得
I think
这样挺好。
this is good.
除非大家还有其他想补充的评论。
Unless people have more comments that they wanna add.
我觉得这个只是编码而已。
I think this but this is only coding.
对吧?
Right?
每个其他领域都需要这个。
There's like every other domain needs this.
领域,就是Rust那个东西,对吧?
Domain the Rust thing right
现在。
now.
很昂贵。
Expensive.
我觉得Frontier评估甚至更昂贵,就像是
I think the Frontier evals are even more expensive, which is like the
是的。
Yep.
来自Merkur的Apex评估。
Apex eval from Merkur.
像评估成本会达到数百万美元。
Like, evals are going to cost this is millions.
在前沿领域,成本将是数千万甚至数亿美元,这是一种非常奇怪的动态。
They're gonna cost tens of millions and hundreds of millions of dollars at the Frontier, which is just a very strange dynamic.
而生态系统在前沿模型、研究和其他方面之间存在大量分叉,要跟上这种动态并向人们解释清楚,将需要大量工作。
Whereas, there's so much about the ecosystem is forking between frontier models and then research and other things, and trying to follow that dynamic and explain it to people is gonna take a lot of work.
但确实,编程我觉得非常有趣,因为这是如今大多数人使用LLM的方式,而且也更容易评估。
But yeah, coding is, I do think, really interesting because that's what most people use LLMs for these days, but also it is easier to evaluate.
一旦你离开编程和数学,事情就会变得有些模糊。
Think once you leave coding math, it becomes a bit obscure.
你如何衡量答案的质量?
How do you measure the quality of the answer?
你回到偏好这个问题,我想,这更像是一种主观的东西,而编程则更客观。
You get back to, let's say preferences, I guess, which is more like a subjective thing where coding is more objective.
所以这样做并不是坏事。
So it is not a bad thing to do.
不过我觉得前几天,Anthropic收购了一家做计算机UI相关业务的公司,我认为这算是小事。
I think the other day though, Anthropic acquired another company that does like UI type of stuff on the computer, and I think that is minor
在AI领域,这种普通人才的流动其实并不常见。
thing where Doesn't really Normal talent normal talent flows in AI.
总数。
Total number.
我的意思并不是说这是个值得大肆讨论的大事。
I mean, I'm not trying to say this is like a big thing to talk about.
我想说的是,这是评估LLM在这些任务上表现的另一个有趣角度,因为我觉得很多人希望LLM能控制计算机并完成各种操作,但这些任务更难衡量。
What I'm trying to say is like, this is another interesting point for evaluating LLMs on those tasks because I think a lot of people want that to like they want an LLM to control the computer and do various things, but they are harder to measure.
所以可能再过两年,我们会有一些更像基准测试的东西,但这些测试更难界定。
So that will be maybe two years we will have something more like benchmarks that can It's harder to specify.
这有点像,那个叫什么来着?
It's kind of like, what is it called?
在编程中,有单元测试和系统测试,基本上就是UI测试之类的东西。
In programming, there's unit testing and then the system testing, basically, like the UI testing and stuff like that.
是的。
Yes.
还好吗?
Was it okay?
对。
Yeah.
所以我认为这可能是下一个会成为
And so I think that is the next maybe gonna be the
接下来基本上就是端到端测试。
next Basically thing end to end testing.
对。
Yeah.
关于 Bayt 播客
Bayt 提供中文+原文双语音频和字幕,帮助你打破语言障碍,轻松听懂全球优质播客。