更好的数据就是一切所需——阿里·莫科斯，数据学

本集简介

我们与阿里（Ari）的对话揭示了数据整理是人工智能中最具影响力却投资不足的领域。他指出，当前对模型架构和计算规模扩展的普遍关注忽视了"苦涩的教训"——"模型的表现取决于它们所消耗的数据"。高效的数据整理——一个包含过滤、再平衡、排序（课程设计）以及合成数据生成的复杂过程——能够训练出更快、更好且更小的模型。莫科斯（Morcos）分享了他从关注以模型为中心的归纳偏见到意识到数据质量才是突破简单规模法则收益递减主要杠杆的个人历程。Datology的使命是自动化这一复杂的整理过程，使最先进的数据对任何组织都触手可及，并推动人工智能发展的新范式，其中数据效率而不仅仅是原始规模驱动进步。时间戳 00:00 引言 00:46 什么是Datology？通过数据整理训练更快、更好、更小模型的使命。 01:59 阿里的背景：从神经科学到认识AI的"苦涩教训"。 05:30 关键洞见：随着数据规模增加，架构带来的归纳偏见变得不那么重要甚至有害。 08:08 论点：数据是AI研究中相对于其影响最投资不足的领域。 10:15 为何数据工作在研究和工业界中被文化性低估。 12:19 自监督学习如何改变一切，从数据稀缺到数据丰富的体制转变。 17:05 为何自动化整理优于人工参与，引用DCLM研究。 19:22 "大象与狗"的类比，用于管理数据冗余和复杂性。 22:46 关键数据集简史与评论（Common Crawl、GitHub、Books3）。 26:24 通过提高数据质量打破简单规模法则，保持高边际信息增益。 29:07 Datology的实证影响：12倍速达到基线性能。 34:19 数据业务：Datology的护城河及其与开源数据集的关系。 39:12 合成数据解析：风险高的"全新创造"与强大的"重述"之间的区别。 49:02 课程学习的复兴：为何在欠拟合体制下数据排序至关重要。 52:55 训练的未来：优化预训练数据以使后训练更有效。 54:49 谁在训练自己的模型及原因（主权AI、大型企业）。 57:24 "训练更小的模型"：为何推理成本使小型专业化模型成为企业的终极目标。 01:00:19 模型剪枝的问题及为何数据端解决方案是互补的。 01:03:03 关于为给定能力寻找最小可能模型的探讨。 01:06:49 从RC基础模型合作中汲取的关键经验，证明数据整理"可叠加"。 01:09:46 闪电问答：人人想要的数据及谁应加入Datology。 01:14:24 对Meta超级智能努力及Yann LeCun角色的评论。

Our chat with Ari shows that data curation is the most impactful and underinvested area in AI. He argues that the prevailing focus on model architecture and compute scaling overlooks the "bitter lesson" that "models are what they eat." Effective data curation—a sophisticated process involving filtering, rebalancing, sequencing (curriculum), and synthetic data generation—allows for training models that are simultaneously faster, better, and smaller. Morcos recounts his personal journey from focusing on model-centric inductive biases to realizing that data quality is the primary lever for breaking the diminishing returns of naive scaling laws. Datology's mission is to automate this complex curation process, making state-of-the-art data accessible to any organization and enabling a new paradigm of AI development where data efficiency, not just raw scale, drives progress.Timestamps00:00 Introduction00:46 What is Datology? The mission to train models faster, better, and smaller through data curation. 01:59 Ari's background: From neuroscience to realizing the "Bitter Lesson" of AI. 05:30 Key Insight: Inductive biases from architecture become less important and even harmful as data scale increases. 08:08 Thesis: Data is the most underinvested area of AI research relative to its impact. 10:15 Why data work is culturally undervalued in research and industry. 12:19 How self-supervised learning changed everything, moving from a data-scarce to a data-abundant regime. 17:05 Why automated curation is superior to human-in-the-loop, citing the DCLM study. 19:22 The "Elephants vs. Dogs" analogy for managing data redundancy and complexity. 22:46 A brief history and commentary on key datasets (Common Crawl, GitHub, Books3). 26:24 Breaking naive scaling laws by improving data quality to maintain high marginal information gain. 29:07 Datology's demonstrated impact: Achieving baseline performance 12x faster. 34:19 The business of data: Datology's moat and its relationship with open-source datasets. 39:12 Synthetic Data Explained: The difference between risky "net-new" creation and powerful "rephrasing." 49:02 The Resurgence of Curriculum Learning: Why ordering data matters in the underfitting regime. 52:55 The Future of Training: Optimizing pre-training data to make post-training more effective. 54:49 Who is training their own models and why (Sovereign AI, large enterprises). 57:24 "Train Smaller": Why inference cost makes smaller, specialized models the ultimate goal for enterprises. 01:00:19 The problem with model pruning and why data-side solutions are complementary. 01:03:03 On finding the smallest possible model for a given capability. 01:06:49 Key learnings from the RC foundation model collaboration, proving that data curation "stacks." 01:09:46 Lightning Round: What data everyone wants & who should work at Datology. 01:14:24 Commentary on Meta's superintelligence efforts and Yann LeCun's role.

双语字幕

仅展示文本字幕，不包含中文音频；想边听边看，请使用 Bayt 播客 App。

Speaker 0

大家好，欢迎收听Lit and Space播客。我是Alessio，Decibel的合伙人兼首席技术官，今天和我一起的是Small AI的创始人Zwick。

Hey, everyone. Welcome to the Lit and Space podcast. This is Alessio, partner and CTO at Decibel, and I'm joined by Zwick's founder of Small AI.

Speaker 1

大家好。非常高兴能在演播室与Datalogy的联合创始人兼首席执行官Ari Morcos一起。欢迎你。

Hello. Hello. And we're so excited to be in the studio with Ari Morcos, CEO, cofounder of Datalogy. Welcome.

Speaker 2

非常感谢邀请我。

Thank you so much for having me.

Speaker 1

Ari，我最初注意到你，我是说，Datalogy算是相对而言比较受关注或炒作的新创公司，至少在融资和你们招聘的高知名度人才方面。我是在你们研究RC之后联系预约这次采访的，我甚至不知道该怎么发音，是RK吗？

Ari, so you first came across my radar. I mean, I guess Datalogy is like a relatively, I guess, exciting or well hyped startup, at least with the fundraising and the the higher profile of the people that you hire. I reached out to book this interview after you worked on the RC, I don't even know how to pronounce it, RK?

Speaker 2

是RC，没错。RC。它的灵感来自一个真正叫做RC的变压器。

RC, yeah. RC. It's inspired by a real transformer that was called RC.

Speaker 1

对，RC基础模型。你们在数据方面做了很多工作。你会如何描述现在的Datalogy？

Yeah, the RC Foundation Models. And you guys have been doing a lot of data work. How would you describe Datalogy today?

Speaker 2

是的，Datalogy的使命是处理机器学习中与数据相关的所有环节，明白吗？就是从你有一堆数据存储在某个地方开始，你要通过数据加载器将其输入模型。在这个过程中你会做出大量选择，包括如何过滤数据、如何排序数据、是否生成合成数据、如何分批数据等等。这些选择将极大影响你基于这些数据训练的模型性能。我最喜欢的一句话是：模型如其食。

Yeah, so our mission at Datalogy is to take everything around the data side of machine learning, right? So going from you have a bunch of data sitting in storage, you're going to feed it into a model you know, via a data loader. There are a ton of choices you would make in that process, ranging from how you're going to filter the data, how you're going to sequence the data, what synthetic data you're going to generate, if any, how you're going to batch the data, all of those things. And those will have a tremendous impact on the performance of the model that you train on the data. One of my favorite catchphrases is models are what they eat.

Speaker 2

如果你给模型展示高质量数据，它们就会表现优异；如果给它们低质量数据，表现就会差强人意。但这是个前沿研究问题：如何有效实现这一点？如何大规模自动化处理？

If you show them great data, they're going be really high quality. If you show them low quality data, they're going to be low quality. But this is a frontier research problem. How do you actually do this effectively? How do you do this automatically at scale?

Speaker 2

必须自动化才能处理数万亿的token、数十亿的图像等。这就是Datalogy的使命——将整个流程变得极其简单，让任何人都能获得最先进的数据管理技术，而无需自身成为专家。通过这种方式，帮助合作伙伴更快训练出性能更优的模型，甚至用更小的模型达到相同或更好的效果——我认为这才是未来最激动人心的方向。根本上说，Datalogy就是帮助人们优化数据，从而训练出更快、更好、更小的模型。

It has to be automatic to be able to process trillions of tokens, billions of images, things like that. And that's our mission at Datalogy, is to take that whole process, make it really easy so that anybody can get access to state of the art data curation without needing to be an expert themselves. And in doing so, help the folks we work with to train models much faster to much better performance and to also help them train much smaller models to the same or better performance, which I actually think is some of the most exciting stuff going forward. But fundamentally, that's what we do at Datalogy is help people curate their data so can train models faster, better, smaller.

Speaker 1

所以关键词就是：数据管理即服务、数据效率这些术语。在我们开始录音前的闲聊中，你提到有个关于你如何最初进入数据领域的精彩故事。对吧？你曾在GDM工作，也担任过Meta的研究科学家。能说说这个兴趣是怎么形成的吗？

So the keywords for that, data curation as a service, data efficiency, all those terms. In the pre chat before we started recording, you mentioned that there's a cool story around how you got into data in the first place. Right? You were at GDM, You were at Meta as a research scientist. Describe how, like, that became an interest.

Speaker 2

我的博士学位实际上是神经科学方向的。所以我更多是来自实证科学的背景。我确实花时间试图教会老鼠计数，然后分析它们在计数时大脑中数千个神经元的活动，试图理解这究竟是如何发生的，是哪些神经动力学机制在起作用。这实际上是我最初接触机器学习的方式——作为分析神经数据集的手段。我于2011年开始攻读博士，随后AlexNet和TariDQN相继问世。

My PhD is actually in neuroscience. So I come much more from an empirical science sort of background. I actually spent time trying to teach mice how to count and then analyze the activity of thousands of neurons in the brain while mice did count and try to understand how did that actually happen, what were the neural dynamics that enabled that. And that's actually initially how I got into machine learning was as a means to analyze my neural data sets. I also started my PhD 2011, so AlexNet came right after that, TariDQN right after that.

Speaker 2

大量证据表明AI将变得非常令人振奋，这促使我转向该领域。但由于我接受的是实证科学家而非计算机科学家的训练，这种不同的背景导致我加入AI后的首要使命是尝试建立深度学习的科学基础。至今在很多情况下，深度学习仍是一门实证科学，但大多数具有计算机科学背景的人更习惯于理论分支的思维模式，对吧？一切都追求可证明性。这实际上正是深度学习最初遭遇的阻力——人们认为其中没有任何可证明的内容。

Lots of evidence that AI was going to be very, very exciting, which led to me transitioning. But as a result, because I had this kind of somewhat different background of being trained as an empirical scientist rather than as a computer scientist, My real first mission when I joined AI was to try to build more of a science of deep learning. Something that I think is still true today in many cases is that deep learning is an empirical science, but most people that have computer science backgrounds were trained more in the context of a branch of theory, right? Everything was very provable. That was the initial pushback to deep learning actually was that you couldn't prove anything in it.

Speaker 2

但深度学习本质上是门实证科学，对吧？我们必须进行大规模实验。我们理解设计这些系统的规则，但当我们在海量数据上训练它们时，涌现出的特性往往是意外且不可预测的。因此我一直希望撰写这样的论文：前半部分试图理解为什么某种表征是理想或不理想的？为什么模型表现优劣？

But deep learning is at its core an empirical science, right? We have to run large experiments. We understand the rules for how we design these systems, but the properties that come out of them when we actually train them on a ton of data are emergent and unexpected. So I always really wanted to write these papers where they had two halves, where the first half of the paper was trying to understand why is this representation desirable or undesirable? Why is a model good or bad?

Speaker 2

然后基于这种理解来改进模型。这始终是我的目标，也是理想论文的范式。比起随意尝试碰运气，我们能够真正理解失败原因并据此改进。但事实证明，完成前半部分（理解系统）不算太难，真正困难的是如何运用这种理解来改进系统。

And then understand that and then use that understanding to then improve the model. And that was always my goal. That was kind of the perfect paper. Rather than just throwing spaghetti against the wall and seeing what stuck, we were able to really understand why something didn't work and then use that understanding to improve it. Unfortunately, it turns out that it's not so difficult to do the first half of that, try to understand the system, but really, really difficult to actually use that understanding to improve the system.

Speaker 2

常见的情况是：你针对某个变量进行优化，发现'看啊，这个表征特性能让模型变好'。但当你真正优化它时，却发现那只是个相关变量而非因果变量——根本不起作用。我可能写过30篇完成前半部分的论文，但只有三四篇实现了后半部分。这总是让我感到沮丧和不满。

A lot of times what would happen is you go to optimize for this variable, you find, hey, here's this property of representations that makes models good. You go and you optimize for that and then it turns out that wasn't a causal variable. That was a correlate and it doesn't actually work. So I maybe wrote 30 papers where we did that first half and maybe only three or four where we did that second half. And that was always kind of frustrating and dissatisfying to me.

Speaker 2

直到2020年左右，几篇论文同时给了我当头棒喝——数据才是唯一关键。这些研究最初都聚焦于归纳偏置：如何通过改变目标函数或调整架构（这也是领域主流方向，现在大型会议上多数论文仍关于架构改进）来增强模型。但这些论文都明确揭示：唯有数据至关重要。举个实例——

And then around 2020, I had several papers that all kind of slap me in the face at the same time with the same insight, which is that all that really matters is the data. And I come into all three of these papers very much focused on inductive biases. How do we put better inductive biases into models, either through changing the objective or through changing the architecture, which is where most of the field was and still where you see a lot of the papers at the big conferences are about architectures and various tweaks to architectures. But I had these multiple papers, all of which made this clear takeaway that the data is the only thing that matters. I'll give you one example.

Speaker 2

我们有篇叫Convit的论文，其核心是将视觉Transformer初始化为卷积神经网络。这样模型可以从卷积的归纳偏置出发，但有权选择摒弃它。这是种柔性归纳偏置，不同于CNN的硬性偏置——在CNN里你无法摆脱卷积特性。

There's a paper we had called Convit, where the idea was to take a vision transformer and initialize it as if it was a convolutional neural network. And that way you could actually start with this inductive bias of convolution, but the model could choose to unlearn it if it wanted to. So the idea was it was soft inductive bias, not a hard inductive bias. Comnets have a hard inductive bias. You can't not be convolutional in a ComNet.

Speaker 2

这种情况下，你以特定方式初始化Transformer后，模型可以自主决定是否保留该特性。我们的设想是：这种初始偏置会很有帮助，但模型有权在不需要时舍弃它。

But in this case, you initialize the transformer that way and then if it wants, the model could learn not to be that. And the idea here was that this would be really helpful for models to give them this inductive bias, but then they could learn not to use it if they didn't want to.

Speaker 1

追问一下，这是否意味着卷积网络与Transformer存在权重层面的一一映射关系？

Just to follow-up, there's a one to one mapping of a confinet to a transformer and you can map it directly onto the weights?

Speaker 2

完全正确。事实证明可以实现精确映射。比如不是用3x3卷积核，你可以用9个头——每个头对应核的不同部分，然后通过初始化实现完全对应。就像是...

Exactly. You can map it exactly correctly, it turns out. If you make it just, you know, say you have not a three by three kernel, you can have nine heads, each head corresponds to a different part of that kernel and then you can initialize it so it is exactly. So it's like a

Speaker 1

一种非常粗糙的东西，可以通过训练逐步优化。

very coarse thing that can then be refined as with training.

Speaker 2

没错。然后它可以选择调整权重，从而解除你通过这种方式施加的权重绑定。我们后续的一篇论文表明，你可以将训练好的网络实例化为视觉Transformer（VIT）。在小数据场景下——这里的小数据指的是少于50万个数据点——

Exactly. And then it can choose to change its weight so that it can undo the weight tying that you impose on it this way. We actually had a follow-up paper which showed you could take a trained network and actually instantiate a trained CNN as a VIT as well. So there's a way to do this. It turns out in the small data regime, and when I say small data here, mean say less than 500,000 data points.

Speaker 2

这是在图像自监督学习的背景下。在这种小数据场景中，这种方法极其有效。这篇论文被引用的领域主要是一些数据稀缺的细分科学问题，比如火山预测——可能只有1500个数据点。但随着数据量增加，这种软归纳偏置的优势会衰减，最终反而会产生负面影响。

And this was in the context of image self supervised learning. So in that small data regime, this is super helpful. And where this paper has actually been cited is a whole bunch of kind of niche scientific problems where there's very little data. For example, volcano prediction, where you have like 1,500 data points or things like that. But the advantage of using this soft inductive bias decays as the data size increases and eventually actually becomes harmful.

Speaker 2

当数据量达到约100万点时（以当前模型标准并不算多），这个临界点就会出现。超过这个阈值后，软归纳偏置不仅不再有益，反而会产生轻微危害。我和其他几篇论文都论证了同一个观点：当规模足够大时，归纳偏置根本无关紧要，真正重要的是从数据分布中学到的后验概率——这才决定一切。

So if you see enough data and the threshold at which this changes is around like a million data points, so it's not massive by any stretch by our current model. So basically once you get past a million data points, that soft inductive bias no longer helps you and it actually now is mildly harmful. So I had this paper and a couple other papers that all kind of made this same point that basically, you know, when you get to enough scale, inductive biases matter not at all. All that really matters is the learned posterior from the data distribution. And that's really what defines everything.

Speaker 2

而Transformer的崛起恰恰证明：从架构层面减少内置归纳偏置才是正途。这些因素叠加起来对我冲击极大——毕竟过去六年我都在研究归纳偏置。现在多篇论文都表明：你钻研的方向其实没那么重要。真是苦涩的一课啊，货真价实的苦涩教训。

And then of course the Rides of the Transformer really showed that actually starting with models that have fewer inductive biases built into their architecture, you know, is the right thing. So we had this combination of factors which ultimately like actually was very, very confronting for me because I had spent the last six years of my career working on inductive biases. And now I'm faced with several different papers, all of which show me that, hey, what you've been working on isn't actually really that important. Bitter lesson, gold. Bitter lesson indeed.

Speaker 2

这堂苦涩之课确实令我痛彻心扉。最终我不得不承认这个教训的正确性，并思考在新范式下该何去何从。显然有两个合理选择：要么去研究如何让GPU飙速运转（可我不是硬件工程师），要么投身数据研究。

So, you know, the bitter lesson was indeed very bitter for me. And, you know, that was really my, you know, inculcation in it, I suppose, where at the end I kind of thought to myself, okay, clearly the bitter lesson is true here. What should I do in this new world? And it became clear to me that there are really two options that made a ton of sense. Either go work on making GPUs go burr and I'm not a hardware engineer, I don't know how to make GPUs go faster or work on data.

Speaker 2

由于种种原因，数据领域的投入与其重要性严重不匹配。我反复强调：数据是研究领域中投资最不足的板块，且差距悬殊。这涉及机器学习文化、激励机制等多重因素。即便是Kaplan和Chinchilla的缩放定律研究，都荒谬地假设数据是独立同分布的。

And for a whole bunch of reasons, data has been dramatically under invested in relative to its impact. Something I've said before and I'll say again is that data is the most under invested in area of research relative to its impact and I don't think it's even close. There are a whole bunch of reasons for this which we can go into, some of which have to do with the culture of machine learning, some of which had to do with the incentives that have been set up. But data has systematically generally not been considered. And even if you go and you look at the scaling laws work from Kaplan and Chinchilla and all these other things, they all assume IID data, which is insane.

Speaker 2

我们都知道'垃圾进垃圾出'是计算机科学最古老的箴言，但现有缩放定律却假设所有数据价值均等，这完全说不通。正是这点驱使我开始研究数据问题。而数据研究还有个绝妙之处——

We know that all data are not created equal, that garbage in garbage out is like the oldest adage in computer science. And yet all these scaling laws assume that all data is created equal. That makes no sense whatsoever. That's what led me to start working on this problem. And it turns out that there's a really cool thing about data research.

Speaker 2

除了投入产出比极高（使其成为绝佳的研究方向和更棒的创业领域）之外，与表征研究不同：在表征领域，科学上'为何有效'的探索常与实际应用脱节。但数据研究中，理解数据点价值的过程本身就能直接优化数据集，从而提升模型性能。

In addition to it being something that's impactful relative to the investment, which makes it a great research area and makes it an even better company. What I'd said previously was that with representations, you have this disconnect where there's the questions which are kind of scientifically interesting about understanding why a representation is good and then the questions that are practically relevant, how do I use this to improve it? And I think what was so frustrating to me early in my career was that those were different questions a lot of the time. The questions that I wanted to ask, were curious curiosity driven and really interesting to me as a scientist, ended up often not being the questions that were practically relevant downstream. But it turns out with data, this is no longer true.

Speaker 2

这意味着数据研究中，科学意义与实际价值的问题高度重合——这在科研领域极为罕见。我们既能探索科学家最感兴趣的问题，又能确信答案将帮助我们构建训练更快、性能更强、参数更少的模型。以上就是我如何走上数据研究之路的心路历程，以及为此经历的阵痛。

With data, if you can understand what makes a given data point useful or what makes a given data point not informative, you can almost always use that insight to make a data set better and therefore make a model better. So what this means is that the set of questions which are scientifically interesting and the set of questions which are practically relevant in data research are largely the same questions. And that's really rare to find in research period. And what this means is that we can ask the questions, which as scientists are extremely motivating to us, but then have very high confidence that the answers to those questions are going to help us to build models that train much faster, that train to much better performance, and that can train with far fewer parameters. So that's a little bit of a high level of kind of how I got into the data problem and I think the pain that I had to go through to get there in the first place.

Speaker 0

你提到数据中的激励机制存在不对齐的问题。能否详细解释一下？因为从外部看，像Scale这样的公司显然已经取得了巨大成功，人们投入了大量资金。但你本质上是在说，媒体行业规模达4万亿美元，而Scale远未达到这个量级。那么你认为这种效率低下的原因是什么？

You mentioned something about the incentives in the data not being aligned. Can you unpack that? Because I think from the outside you have companies like Scale that obviously have become super successful, so people are investing a good amount of money. But what you're basically saying, like, media it's like $4,000,000,000,000 and Scale is not $4,000,000,000,000 So what do you think there's that inefficiency?

Speaker 2

好的，首先我们需要将研究界与工业界分开讨论，我认为两者差异很大。总体而言，工业界对工作的重视程度一直远高于研究界。首要原因在于，数据工作常被视为二等公民式的工作——是苦力活，是管道工程。

Okay, so first off, we have to divide the research community from the industrial community because I think they're very different. And I think in general, work has been far more valued in industry consistently than it had been in the research community. First and foremost, part of this is that data work has just often been considered second class citizen sort of work. It's the grunt work. It's the plumbing.

Speaker 2

这类工作，你知道的，那些自视甚高的科学家根本不屑于碰。最近还有些推文在流传，说数据清洗很无聊，是低价值工作。但如果你询问最优秀的AI研究人员成功的秘诀，他们大多会告诉你关键在于观察数据。模型本质上是你所展示数据的映射，虽然过程可能枯燥艰难，但确保数据质量至关重要。所以我认为首先存在这种普遍认知，觉得这是低质量或低 prestige 的工作。

It's the stuff that, you know, you don't want to work with as a, you know, super hoity toity scientist. Are even some tweets recently going around people saying, you know, data cleaning is boring, it's low value work. Whereas I think what you'd find is that if you talk to the most talented AI researchers and you ask them what's the secret to your success, they'll largely tell you that they look at the data. Ultimately, models are a reflection of the data that you showed and yeah, it can be tedious, it can be challenging, but it is so critical to get this right. So I think first off, there's this general perception that this is lower quality work or not quality, but lower prestige work.

Speaker 2

这种观念由来已久。部分原因与研究激励机制的设置方式有关。数据集被视为既定条件——回想2018年左右的研究范式：给定ImageNet数据集，在验证集或测试集上最大化性能，对吧？但ImageNet作为给定条件是不容修改的。

And that's been there for a long time. I think part of this had to do with the way that research incentives were set up. The data set was viewed as the given. So if you think about research circa say 2018, given ImageNet, maximize performance on the VAL set or on the test set, right? But the dataset ImageNet was given as something you don't change.

Speaker 2

Kaggle也采用这种框架：给定数据集，去优化模型。人们可能会尝试自助法之类的技巧，但基本假设始终是通过改进模型而非改善数据集来提升效果。在监督学习时代，这种思路确实有其合理性，不是吗？

Even Kaggle had this framework, right? Given the dataset go and make this better. People might try things like bootstrapping or stuff like that. But generally the assumption was you're going to improve the model through better modeling, not through improving the data set. And part of this also was just in the supervised learning era, this made sense, right?

Speaker 2

我们通常不受计算能力限制，而是严重受限于数据不足，对吧？数据极其稀缺。比如要构建ImageNet数据集，你得通过亚马逊土耳其机器人（MTurk）雇佣大量人工标注。即便如此，数据质量总有个底线，对吧？

We generally weren't compute limited. We were generally very data limited, right? Data was very scarce. Like if you want to assemble ImageNet, you have to go to MTurk and get a whole bunch of people to label the data set. And then there's generally some quality floor, right?

Speaker 2

因为这个数据集的每个数据点都经过人工审核。即使仍存在许多错误，至少不会像直接网络爬取的数据那么糟糕。但到了2019年，这个领域发生了巨大变革——我们掌握了无监督训练的方法。我个人有个颇具争议的观点：Transformer确实是重大突破，但它只是众多等效优秀架构中的一种。

Because a human has looked at every data point in this data set. Even if there's still a lot of errors there, at least it's not going to be as bad as just the internet scraped. But then in 2019, the field underwent this pretty massive change, right? We figured out how to train without labels. And one of my more controversial viewpoints, I think, is that I think the transformer is a great advance to be sure, but I think it's one of a very large set of equivalently good architectures that we could have found.

Speaker 2

我们完全可以通过其他多种架构达到同等性能。但如果没有自监督学习和利用无标签数据的能力，就绝不可能取得今天的成就。在我看来，这才是真正推动能力突飞猛进的关键——比如掩码预测任务，但远不止于此。

And there are many, many ways we could we could get to the same performance without the transformer. But I do not think there's any way we could get to where we are today without self supervised learning and the ability to train on unlabeled data. That was the the real advance to my mind that enabled us to get these incredible increases in capabilities. Which is like the mask objective. It's not just masking objectives.

Speaker 2

掩码语言建模是其中一种方式，还有下一个词元预测等。核心思想是：无需人工标注，让模型根据数据的其他部分预测某个特征。这极具革命性——想象一下，我们从此前百万量级的ImageNet数据，短短几年就跃升至数万亿词元规模，数据量暴增百万倍，这史无前例。

I think mask language modeling objective is one, but even next token prediction, right? But generally this notion that, hey, instead of having to get an external label from a human, we can ask the model to predict one aspect of a data point from other parts of that data. And that is really powerful because think about it, right? That meant that we went from ImageNet, a million data points, to literally trillions of tokens, a million fold increase in data quantity in a matter of like several years. That's completely unheard of.

Speaker 2

这彻底改变了一切：数据从稀缺且质量可控，突然变成海量规模。现在所有模型基本都处于欠拟合状态，而以前我们在图像数据集上要做160个epoch（通常都会过拟合）。如今进入欠拟合时代，质量底线不复存在，随之而来的是冗余数据、低质量、低信息增益等问题——这些都是海量无标签数据集带来的新挑战。

And that also changed everything because now we went from data being scarce and having a high quality floor to now all of a sudden data is absolutely massive. All of our models are basically always underfitting the data, whereas previously we would do 160 epochs on an image data set, right, where they would all be overfitting the data generally. So now we move to this underfitting the data regime. There's no more quality floor. And now we have all of these problems with redundancy, with low quality, with low information gain, all these various things that come with these massive unlabeled data sets.

Speaker 2

因此我认为从2010年代到2020年代，问题本身也发生了翻天覆地的变化。这恰恰使其成为一个令人兴奋的科学议题——因为在2020年之前研究这个根本没有意义。但现在它变得极其重要，我认为我们必须解决这个问题，才能让这些模型持续进步，同时提升其成本效益，避免它们沦为只有斥资数亿美元才能实现的奢侈品。优化数据可以带来巨大的算力倍增效应，能让单位美元的性能提升数个数量级。

So I think the problem also changed pretty dramatically from the 2010s to the 2020s. And I think that's what makes it so exciting as a scientific question, is that this didn't really make sense to study prior to 2020. But now this makes tremendous sense and is I think absolutely critical for us to solve in order for us to enable these models to continue to improve and also to enable the cost effectiveness of these models so that they don't just stay as something that's only possible to achieve if you have hundreds and hundreds of millions of dollars. Making the data better can be a massive compute multiplier. It can change the performance per dollar by orders of magnitude.

Speaker 2

从很多方面来说，我们的核心目标就是如何让这件事对所有人都变得简单高效。

And in many ways, that's our whole goal is how do we make that easy and effective for everyone.

Speaker 0

完全同意。你从2018年到2023年9月就职于Meta，经历了Llama一代和二代。在Meta内部是何时开始意识到这些经验的？比如'我们应该投入资源研究这个'。你提到2020年，所以我想知道是否...

Totally. And you were at Meta from 2018 to September 23, which is both during Llama one and Llama two. At what point inside of Meta, maybe some of these learnings become apparent? Like, okay, we should start to spend resources working on on this. You mentioned 2020, so I'm wondering if that was

Speaker 1

我认为Llama一代已经是个重大突破了。

I think Llama one was already a big breakthrough.

Speaker 2

没错。Llama一代确实比其他多数模型更注重数据过滤，这确实开启了变革。但即便如此，即便在我离开Meta时，通过数据筛选来识别高质量高价值内容的概念仍未被充分重视。如果你与前沿实验室数据团队的成员交流，会发现他们实际上在爬虫系统上投入巨大，通常致力于开发更好的爬虫工具来净化数据源——这很合理。

Yeah. Llama one definitely put more effort into into data filtering, I think, than many others and definitely started to change this. But even then, I would say that actually, you know, even when I left Meta, this was still an area of kind of the idea of actually curating the data to figure out what's the high quality, high value data, I think still was fairly underappreciated. If you talk to a lot of the folks on the data teams within the big frontier labs, what you'll find is that they've actually invested really heavily in crawling. Oftentimes they've really worked on getting better crawlers and trying to clean up the source of the data that's coming in, which makes sense.

Speaker 2

但归根结底，关键是要建立这样的视角：基于模型已接触的所有数据，面对潜在候选数据集时，下一个数据点能带给模型最大学习价值的是什么？这完全是另一种问题思考框架。虽然各大实验室对此都有保密研究，但这确实是个极其困难的尖端科研难题。

But ultimately, I think what you really need to do is you need to take this perspective of given everything that the model has seen so far and given a potential candidate set of data, what data point is going to teach the model the most the next time it sees a data point? And that's a pretty different framing for how to think about this problem. And I think we've certainly there's certainly been some great work done, although it's all secretive within, I think, the bigger labs. But that's a really hard problem. That's a frontier research problem.

Speaker 2

我认为我们至今仍未找到解决方案。数据筛选本身就是个棘手的'伪命题'，因为它不存在万能钥匙——没有哪种技巧能立竿见影。实际情况是存在50种不同方法，每种单独使用效果有限，但若能找到协同方式，就能产生巨大增益。

I don't think we still know how to solve that. I think data curation also is a hard problem to solve, quote unquote, because it's not one where there's a single silver bullet. There's not just do this one trick and all of a sudden things work. It's rather here are these 50 different things that you can do, each of which provides a pretty modest gain on its own. But then if you can figure out how to make them combine, you then get a really big gain.

Speaker 2

但首先需要明确要采取哪些措施，其次要解决措施间的兼容性问题——因为它们天然存在冲突。

But you have to figure out first off what are all of these different things you want to do and then two, how do you make them play nice with each other? Because by default they don't play nice with each other.

Speaker 1

没错。关于你提到的自监督学习，我有个简要观察：彻底摆脱人工标注确实很棒，或者说让系统自主生成标签，对吧？

Yeah. I'll make a quick observation on you mentioned self supervised learning. I definitely agree like that. Just getting rid of labels altogether is is great or forming your own labels. Right?

Speaker 1

我有个更宏观的发现：这个逻辑可以延伸到非学习领域。比如自监督自动化、自监督神经架构搜索、自监督数据筛选。如果能实现全自动化才是关键——让机器处理一切，因为如果我们必须手动标注所有数据，人类自身就会成为效率瓶颈。

And I have a general observation that I think that extends to things that are not just learning. So self supervised, I don't know, automatization, self supervised neural architecture search, self supervised curation. If you can just automate everything, think that's the lesson really, like just just get the machines to do it because we are the rate limiters if we if we must label everything.

Speaker 2

是的。我我认为这一点非常正确。实际上我在思考一个问题：我们是否正在重蹈'苦涩教训'的覆辙，试图通过人工指导的方式进行数据筛选？目前最优秀的开源数据筛选项目是DCLM（Data Comp LM），由斯坦福大学教授Ludwig Schmidt领导，联合了来自多所院校的约30名学生共同完成。

Yeah. I I think I think this is very true. It's actually something I think about a is are we actually falling prey to the bitter lesson again here by trying to have human guided methods of data curation? Probably the best open effort on data curation is DCLM, Data Comp LM. It was led by Ludwig Schmidt, a professor at Stanford, and about 30 students across many different institutions.

Speaker 2

这是个非常出色的努力，旨在整理类似Common Crawl风格的数据集。

Really wonderful effort to kind of curate Common Crawl style data sets.

Speaker 1

是的，我们播客之前确实报道过DataComp和DCLM项目。

Yeah, we've actually covered DataComp and DCLM on the podcast.

Speaker 2

太棒了。但DCLM论文最后有个非常酷的研究，我认为它得到的关注远远不够。他们让30名研究生花了两年时间，试图设计出这些模型的最佳过滤标准，并建立了一个相当不错的系统。然后他们让所有学生预测这个系统会如何运作。

Awesome, great. But DCLM had a really cool study at the end of the paper that I don't think gets nearly enough attention as it should. So they have these 30 grad students spend two years basically trying to design what are the optimal filtering criteria for these models, right? And they built a system that's pretty good at this, right? So then they asked all those students, predict what that system is going to do.

Speaker 2

给定一个数据点，系统会保留还是拒绝它？这些学生理论上是你所能聘请到的最佳专家——他们整整两年都在研究NLP数据。但他们的预测准确率并不比随机猜测高。这让我经常被问到：怎么可能在没有人工参与的情况下完成这项工作？

So given a data point, is the system going to say keep the data point or is it going to say reject the data point? These are nominally the best experts you could ever hire to do this. These are students who have just spent all of their time looking at NLP data for two years. They could not predict what the DCLM classifiers would say above chance. So this comes up a lot of times where people often ask me, how can you possibly do this without a human in the loop?

Speaker 2

看似不可能，似乎必须有人工来评估数据。但这项研究的启示是（其他证据也支持这点）：我们必须自动化，因为人类根本无法处理数十亿数据点、数万亿token。即便能做到，我们实际上也不应该这样做。

It just seems impossible. You need to have a human to actually rate these data. But I think that what the takeaway from that study is, and I think there's a number of other piece of evidence that also suggests this, is that obviously we have to be automated because humans just can't scale to billions of data points, trillions of tokens. It's not possible. But even if we could, we actually wouldn't want that.

Speaker 2

人类并不擅长这个任务。要理解原因，最简单的方式是：数据点的价值不仅取决于其本身，更取决于它与训练集中其他所有数据点的关系。比如我有1万份略有差异的《哈姆雷特》摘要，我不需要全部。但单独看任何一份时，我可能觉得它质量很高、准确清晰——问题在于我不需要1万份。

Humans are not good at this task. And to give an intuition as to why humans aren't good at this task, I think the easiest way to think about this is that the value of a data point is not just a function of that data point itself. It's rather a function of how that data point relates to every other data point in the training set, right? So if I have 10,000 copies of slightly variable summaries of Hamlet, I don't need all of those. But if I were to look at any one of those individual summaries, I might say, hey, this is really high quality, this is really accurate, it tracks all the characters, it's well written, it's clear, but I don't need 10,000 of those.

Speaker 2

这是人类永远无法完成的任务，因为人脑显然无法记住整个数据集。所以即便能用人力达到这种规模，你也不会想这么做。

And that's just a task that a human would never be able to do because a human can't keep the whole dataset in their head obviously. So even if you could have this scale with humans, you wouldn't want to.

Speaker 0

那么在一到一万之间，正确的数字是多少？

But so what's the right number between one and ten thousand?

Speaker 2

虽然令人不满，但正确答案确实是'视情况而定'。这取决于概念的复杂程度。冗余其实很有用——完全消除冗余反而糟糕。如果去除所有冗余，我就只能通过单一场景来理解比如金毛犬这个概念。

The unsatisfying answer is it depends, but it's also the right answer. So it depends on how complex the concept is. So redundancy is really useful, right? And like removing all redundancy is a bad thing. If I remove all redundancy, then I'd only be able to understand, say, a golden retriever in the one situation that I've ever seen it in before.

Speaker 2

我无法一概而论，那样会很糟糕，对吧？所以适度的冗余是好的，但我们都有这种直觉——无限的冗余并不好，反而有害。那么不同概念的界限在哪里呢？我喜欢用大象和狗来举例。大象的形象相当固定。

I wouldn't be able to generalize and that would be bad, right? So some redundancy is good, but I think we all have the intuitive understanding that infinite redundancy is not good, it's bad. So where is this line for different concepts? Well, one example I like to give for this is elephants versus dogs. So elephants are pretty stereotyped.

Speaker 2

世界上有两种大象：亚洲象和非洲象。它们都是灰色的，都有下垂的耳朵，都有长鼻子和象牙。

There are two kinds of elephants in the world. There are Asian elephants and African elephants. They're all gray. They all have floppy ears. They all have a trunk and some tusks.

Speaker 2

它们皮肤都皱巴巴的。非洲象比亚洲象体型更大，但总体上非常相似，差异性不大。因此我不需要太多数据或冗余就能完整理解大象这个概念。但狗就完全不同了，对吧？

They all have wrinkly skin. African elephants are bigger than Asian elephants, but largely they're all pretty similar. There's not too much variability. So I don't need that much data or that much redundancy to understand the concept of elephants fully and completely. But dogs, on the other hand, are totally different, right?

Speaker 2

狗的变异性极大。有数百个品种，更不用说各种混种犬。它们有不同形状、大小、毛发质地和颜色等。要正确理解狗这个概念所需的数据量，远高于理解大象所需的数据量。这就带来了实际操作中的挑战——在数据过滤时，你首先面对的不是现成的分类数据集，比如标注好的'这些是狗'或'那些是大象'。

Dogs are super variable. There are hundreds of breeds, not to mention all the mixes of different dog breeds. There are different shapes, sizes, textures, colors, all of these different things. The amount of data that I need in order to properly understand dogs is going to be a lot higher than the amount of data I need to understand elephants. So this comes to some of the challenge when you're actually trying to do this sort of creation, at least on the filtering side, is you have to first off, you don't get a data set where you're given, hey, these are a bunch of dogs, are a bunch of elephants.

Speaker 2

你得到的只是一堆未标注数据。所以首先需要以无监督方式发现这些概念，根据概念本身的特性推断其复杂程度，从而判断需要多少数据来理解它。比如判断某个概念非常复杂时，就应该保留更多冗余数据；若概念简单则不需要太多冗余，然后据此做出取舍。我认为这正是许多挑战的根源，也是设计这类系统时必须考虑的因素。

Instead you just get here's a bunch of right? So first off, you have to, in an unsupervised way, discover what these concepts are. Use something about that concept in order to make some inference about how complicated it is or how complex it is and therefore how much data you need to understand it. Figure out, okay, this is a really complicated concept. I probably should keep a lot of redundancy.

Speaker 2

这确实是个难题。但这些都是设计系统时必须要考虑的关键因素。

This is a really simple concept. I don't need that much redundancy and then make that appropriate choice of what do you want to remove. So these are this is I think where a lot of the challenge comes from, but these are the sorts of factors that you have to keep in mind when you're trying to design these systems.

Speaker 0

但概念界限该如何划定呢？比如从大象和狗延伸到哺乳动物，再延伸到更广的范畴...这或许就是需要同义反复的原因？因为确实很难界定。

How do you draw the line of a concept though, right? Because then it's like, well, the elephant and the dog, but what about mammals and then what about you know what I mean? It's like, how should people think about it? Maybe it's that why you need tautology because it's hard to talk.

Speaker 2

确实如此。这本质上是个实证问题——就像所有事情一样。每个数据集都可以选择不同粒度，最终这是个可调节的超参数：你需要在创建新概念和保持概念统一之间找到平衡点。正如你所说，这正是我们进行数十万次实验的原因，需要通过大量实验才能掌握其中的规律。

Yeah, no, I think that's right to some extent. I mean, look, it's an empirical question, like all things are, right? Is that with every data set that you can choose different level of fine grained, ultimately it's a hyperparameter, it's a knob that you can tune, right, for how aggressive are you going to be with respect to creating new concepts versus keeping concepts together. And it's one of these things where, you know, I think to your point, it's why we've run hundreds and hundreds of thousands of experiments to try to figure this out. Think, you know, this is something where it requires just a lot of experiment experimentation to understand how to do this.

Speaker 2

我们面临的挑战不仅在于使系统在单一数据集上有效，还要构建能自动适应任意数据分布、在新分布上实现零样本推断的系统。因此我们需要解决两个问题：一是如何推进数据策展的前沿，二是如何实现分布外泛化——确保优秀的数据策展方法能推广到新数据分布。

And I think one of the challenges we have is not only do we have to make this so that this works on one data set, but we also have to build a system that can automatically adapt to any arbitrary data distribution and be able to make the appropriate inferences in zero shot on a new data distribution. So we kind of have these two sets of questions. First off is like, how do we push the frontier of data curation forward? And then second of all, how do we do out of distribution generalization where we say, hey, we have this great data curation approach. How do we make sure that this generalizes to a novel data distribution?

Speaker 1

不知道现在是否合适，但我想简要了解数据集的发展史。可能话题太大？我们之前做过'数据集101'专题，那算是我们最早期的节目之一，因为想让人们了解数据集的重要性。

I don't know if this is like a good time, but I was gonna ask for like a brief history of datasets. It might be too much. I don't know. I'll just list off because we've done the datasets one zero one episode. I think was, like, one of our earliest episodes by far because we want people to to know the datasets.

Speaker 1

我认为大家最初都是从Common Crawl开始的。我觉得每个实验室都有自己的网页抓取数据。你觉得是这样吗？还是他们直接从Common Crawl起步？

And I think everyone starts with Common Crawl. I think every lab has their own web scrape. Would you would you say that's true, or do they start from Common Crawl?

Speaker 2

目前来说，是的。正如我所说，大多数实验室确实把主要时间和精力都投入在——没错——为自己构建更好的Common Crawl版本上。

At this point, yeah, I think I think like I said, this is where most of the labs, I think, have actually invested most of their time and effort Yeah. Yeah. Is in building better versions of Common Crawl for themselves.

Speaker 1

好的。我来列举几个来源，你有见解可以随时补充。GitHub是代码来源，可能还有Stack Overflow，尽管最近被限制了，我不太确定。

Yeah. I'll just name check some of these. If you have commentary, just, you know, just just chime in. GitHub, the source of code, maybe Stack Overflow even though that's cut off these days. I don't know.

Speaker 1

人们还会从其他渠道获取代码吗？

Do do do people get code from anywhere else?

Speaker 2

显然有些地方可以购买代码数据。但公开代码的话，这些是最常见的。我个人发现一些有趣现象：项目星标数并不能预测数据对模型是否有用——这倒不意外。

I mean, I think there are obviously places where you buy code data. But for public code, think those are the most common. I think some interesting things about those that I just personally find surprising. Stars are not a good predictor of whether data is useful for models or not. Not surprising.

Speaker 2

最受欢迎的项目库未必质量更高，至少在提升模型编程能力方面是这样。StarCoder论文和其他几篇研究都证实了这点，有些数据筛选的反直觉现象确实令人惊讶。

Think that's like the most popular repos are not necessarily higher quality, at least with respect to do the improving models coding capabilities. You've abated this. I haven't done it, the StarCoder paper has done it, and there have been a couple other papers that have all shown that. Something that I just consistently found to be a little bit surprising. There's a lot of things that are kind of counterintuitive about data curation.

Speaker 1

这说明我没仔细读论文——他们发现过什么能标志优质代码的特征吗？

Did they this shows that I haven't read the paper, but did they find anything good that that was like a sign of a good code

Speaker 2

没有特别显著的预测指标。说实话，有些简单启发式规则比如代码长度反而更有效，但都没有绝对区分度。

base? There wasn't anything that was super predictive. Oh, man. Like, honestly, in some ways, like some of them were length. Like, some of these, like, simple heuristics actually ended up being better, but nothing was super discriminative Yeah.

Speaker 2

这还挺有意思的。

There, which is kind of interesting.

Speaker 1

明白。继续往下说，Archive相当于论文界的GitHub。图书数据有books one、two，当然还有颇具争议的books three。

Okay. Cool. I'm gonna keep going. Archive is which is, you know, GitHub for papers. Books, books one, books two, and obviously, books three, controversial.

Speaker 1

我认为Anthropic会因为Books Three被起诉。

I think Anthropics getting sued over books three.

Speaker 2

是的。我觉得很多人都会被起诉。Meta也因为Books Three惹上了官司。

Yeah. I think a bunch of people are getting sued. Meta's also getting over books three.

Speaker 1

某种程度上说，我们能不能...直接翻篇？我不确定。书籍应该属于转换性使用范畴吧。不知道你对这事怎么看？

In some sense, like, can we just, like, look past it? I I don't know. It's like books are a transformative use. Like, I don't know if you have a view on this.

Speaker 2

最近那个判决很有意思——虽然是上诉法院的裁决，估计后续还会打到更高法院。他们判定只要购买了书籍就属于合理使用。所以你不能下载盗版Books Three来训练模型，那属于盗版侵权。但如果你合法购买了所有书籍，用它训练就构成合理使用。我觉得这个界限划得挺合理。Books Three还有个有趣之处是包含大量成人内容，如果你仔细翻看的话。

Well, think the recent ruling was interesting, although it was an appellate court ruling, so presumably it's going to go to a higher court afterwards. But what they ruled was that it's fair use so long as you purchase the book. So, you know, you can't download Books three and then use it because that's piracy and that you've stolen the books in the first place. But if you bought a copy of all of those books, then you can train it on and then it just counts as fair use, which I think is an interesting and to me it feels pretty reasonable line there. One fun thing about Books three is that it also has like a lot of not safe for work stuff in Books three, which is kind of interesting if you actually go and look through it.

Speaker 0

应该给Books Three弄个Stripe一键结账功能，买下Books Three后直接从仓库调货...我在想这要花多少钱，肯定有人算过，我去查查。

There should be a Stripe one click checkout with, like, Books three. Just buy Books three and then Buy it out of the warehouse and, like, I get the mall get the mall ship wonder what the cost would be. I'm sure somebody run the numbers. I'll look it up.

Speaker 1

不知道你能否评论这个——在Meta的诉讼案里，我记得有封内部研究科学家讨论Books Three的邮件，Zak直接说'尽管用，这都是公开资料'对吧？

I don't know if you comment can comment on this at all, but the in the Meta lawsuit, I I remember there was a, like, an email thread with, like, some of the research scientists inside of Meta talking about Books three, and Zak was like, just do it. This is public. Right?

Speaker 2

对，那封邮件应该是公开的，也作为了诉讼证据。

Yeah. That was that was, I think, public and part of the lawsuits.

Speaker 1

嗯。有什么想法吗？

Yeah. Any reflections, comments?

Speaker 2

我只能说在Meta时，数据集的法律合规问题确实越来越棘手。很多时候只有特定人员才有权限批准...

All I can say is that when I was at Meta, certainly legal stuff around data sets was very challenging and becoming increasingly challenging. And there are a number of situations where the only person that could approve things

Speaker 1

是

was

Speaker 2

我认为扎克是因为风险的规模。但这确实让Meta在后期发布内容时变得更加困难，尤其是在我们能对任何数据集做什么方面，因为现实地说，像Meta、OpenAI和Tropic这样的公司是这些诉讼的主要目标。

Zuck because of the scale of the risk, I think. But it it definitely made publishing at Meta near the end more more challenging around just what we could do with with any dataset because, I mean, realistically, companies like Meta and OpenAI and Tropic are big targets for these lawsuits.

Speaker 1

是啊。所以我对Llama四发生了什么事的阴谋论是律师介入了。律师们对数据集动了手脚。

Yeah. So my conspiracy theory for what happens to Lama four is the lawyers got to it. The lawyers got to the datasets.

Speaker 2

所以他们不得不改变使用的数据？

And they had to change what they use?

Speaker 1

他们不能，是的。他们就像双手被绑在背后，而其他实验室没有，仅仅因为Matt Miller有一个活跃的诉讼。

They couldn't yeah. They they were just like hand tied behind their back when other labs were not just because Matt Miller had an active lawsuit.

Speaker 2

我认为这是可能的。我想可能更多是因为持续扩展和将其作为目标的挑战。实际上，这也是我进入数据领域并创立Datalogy的很多原因——扩展定律一直很糟糕。扩展定律论文显示的是存在一种可预测的关系，Kaplan那篇。对，Kaplan那篇。

I think that's possible. I I I think probably more of it just has to do with the challenges of just continuing to scale and having that be the goal. Is actually a lot of the reason why I got into data and started Datalogy was that the scaling laws always were terrible. What the scaling laws papers showed was that there was a predictable relationship The Kaplan one. Yeah, the Kaplan one.

Speaker 2

性能和计算数据之间存在可预测的关系，对吧？这很有用。但这是个糟糕的可预测关系。幂律衰减扩展非常糟糕。这意味着每次数据量增加10倍，性能的边际回报就会递减。

There's a predictable relationship between performance and computer data, right? That's really useful. But it was a bad predictable relationship. Power loss scaling is terrible. It means that every time you 10x your data, get a diminishing marginal return on performance.

Speaker 2

这就是为什么会有那些预言：哦，GPTN训练要花费一万亿美元。因为你拿着那条扩展曲线，就天真地外推它。我认为我们在某种程度上看到了超大模型的失败，比如4.5和Llama四等。我认为挑战在于继续天真地这样做，你必须想办法打破它。

This is why you had these prognostications. Oh, GPTN is going to cost a trillion dollars to train. It's because you take that scaling curve and you just naively extrapolate it out. I think that's what we've seen to some extent with the failure of the mega models, right, with 4.5 and Lama four and others. I think that there's a challenge of just continuing to do that naively and you have to figure out how to break it.

Speaker 2

我认为有几种理论方法可以打破它，而且它们并不互相排斥。我打赌数据质量是一个重要的方法。实际上，在很多方面，那篇为Datalogy奠定基础的论文《超越神经扩展定律》很幸运获得了NeurIPS的最佳论文。那篇论文表明，如果你正确使用数据，你实际上可以改变扩展定律本身。一个有趣的技术部分是，我提到我们真正关心的是从下一个数据点中学到多少新信息。

I think there are a number of theories of ways how to break it and I don't think they're mutually exclusive. My bet is that data quality is a massive way to do this. And in many ways, actually, the paper that was the foundational paper for datalogy, it's called Beyond Neural Scaling Laws, and was fortunate to get a best paper at NeurIPS. And what that paper showed was that if you use your data correctly, you can actually bend the scaling laws themselves. And an interesting kind of technical part of this is that I mentioned what we really care about is this how much new information do you learn from the next data point.

Speaker 2

技术上来说，这是每个数据点的边际信息增益。困惑度是它的另一个变体。它们之间存在对偶性。事实证明，至少在感知器中我们可以证明这一点，因为通常你只能证明这么多，Nathan。在小规模上，这项工作由Ben Sorscher领导，他是一位非常出色的研究生，我和他一起完成了这篇论文。

So technically that's the marginal information gain per data point. Perplexity is another variant of it. Duality There's between them. It turns out that we were able to prove in perceptrons at least because that's generally what all you can ever prove, Nathan. So in small scale, and this work was led by Ben Sorscher, was a really fantastic grad student I worked with on this paper.

Speaker 2

他证明了幂律衰减扩展与边际信息增益也以幂律形式衰减之间存在直接的对偶性。这就是为什么你会得到幂律扩展，因为每个连续的数据点教给你的越来越少，而且遵循幂律。所以你的性能也会以幂律形式衰减。如果你能让它保持平坦，那么你就改变了扩展定律，突然间你学习的速度会大幅提升，因为你学习的信息量不会随着数据集大小而衰减。这些在理论上都是可以实现的，我们提出了几个指标，让我们朝这个方向迈进了一步。

And what he showed was that there's a direct duality between power loss scaling and the fact that you also see that the marginal information gain per data point also decays as a power law. And that's why you get power law scaling because every successive data point is teaching you less and less and less and it follows a power law. So then you get performance decaying as a power law as well. So if instead you can keep that so it's flat, then you bend the scaling law and now all of a sudden you learn dramatically faster because the amount of information you're learning is not decaying with dataset size. Now that was all in theory what you could accomplish, and we proposed a couple of metrics that got us one step there.

Speaker 2

但在许多方面，可以说数据学的核心意义在于如何实现那篇论文所展示的潜力？我们如何真正将其变为现实？我认为从根本上说，如果我们想要良好地实现规模化，就需要在这方面做得更好。

But in many ways, would actually say that the whole point of datalogy is how do we realize the potential that was shown in that paper? How do we actually make that a reality? And I think fundamentally, if we want to get scaling to work well fundamentally, we need to do a better job here.

Speaker 0

你们是否在持续评估这些开放数据集的质量？最新的开放数据集相比旧的是有明显提升，还是只有边际改善？

Are you measuring the quality of these open data sets over time? Are the most recent open data sets better than the older ones at a good rate or like just marginal?

Speaker 2

它们确实在改进，但相对于发展空间和潜力而言，我认为进步还不够。Nematron的质量其实与DCLM非常接近，它晚六个月发布，包含更多独特标记，团队当时对此大做文章。

They do get better, but I think they're not relative to the headroom and potential, would say. Nematron is actually pretty similar in quality to DCLM. It came out about six months later. It has more unique tokens. They made a really big deal about it having more unique tokens.

Speaker 2

但平均质量相当直白。当思考我们在Tetology能实现什么时，通常会沿着我提到的三个维度：训练更快、训练更好、训练更小。第一个问题通常是训练速度——给定某个基线数据集，我们能用多少更少的标记量、以多快的速度达到相同性能？

But on average, the quality is pretty straightforward. So when we think about what we are able to accomplish at Tetology, we usually think about along these three axes I mentioned: train faster, train better, train smaller. So typically basically that's like first question, train faster. Given a certain baseline data set, how much faster can we achieve the same performance? And how many fewer tokens?

Speaker 2

目前我们已能比DCLM快12倍达到同等性能，用不到10%的标记量就能匹配完全收敛训练的效果。

So we're able to now get to the same performance as DCLM about 12x faster. So, you know, in fewer than 10% of the tokens, we can match what you get from training to convergence.

Speaker 1

你说的性能是指GPQA这类指标，还是损失值？

And when you say performance, you mean like GPQA or you mean loss?

Speaker 2

我们通常采用15个与模型规模相关的标准基准任务的准确率，比如MLU、ARC、RACE等。

Yeah, we typically take the accuracy across 15 standard benchmark tasks that are relevant for a given model size. So your MLUs, your ARCs, your races, etcetera.

Speaker 1

这些指标的问题在于是否存在测试集过拟合？你们肯定清楚这点。

The problem with those is like, are you trading to the test? I'm sure you know this.

Speaker 2

这正是我们极其谨慎的地方，因为很容易过度拟合这些基准，最终得到非常脆弱的模型。特别是在使用合成数据时——这是我们数据学工作的重要部分。如果方法正确，合成数据能带来显著提升，但错误使用的方式也很多。

And that's something that we're super careful about because it's really easy to overfit to these benchmarks of course and then end up with models that are really brittle. I think this is something that we've seen, especially with synthetic data. And synthetic data is a big part of what we do with Datalogy. We found that it can drive like pretty dramatic gains if you do it correctly. There are lots of ways to do synthetic data incorrectly.

Speaker 2

我们见过不少模型在大量合成数据上训练后，基准测试表现优异但实际通不过用户体感测试，最终无人使用。为此我们采取多重措施：最重要的是保留完全不接触的测试集，仅偶尔查看；同时也不参与其他大量评估，确保模型后续接受真实检验。这就是我们的根本衡量方式。

We've seen a number of models, right, that are trained on a lot of synthetic data and end up doing really well on benchmarks but then kind of don't pass vibe checks and people don't really use. So we do a lot to try to prevent this. First and foremost, we keep a held out set of tests sets that we only look at very occasionally. And we also don't evaluate on a whole bunch of other evals that we then have models that end up getting evaled on later to try to really ensure this. But yeah, this is fundamentally how we measure.

Speaker 2

我们会参考一系列基准的平均值，试图判断在当前能力范围内什么是公平合理的。这通常是我们首要考虑的因素。接着我们关注训练效果提升——在相同计算预算下，特定数据集能带来多大改进。我们已能在不同数据集和评估标准上超越最佳开源数据集4到5个百分点，某些评估指标的差距甚至远超这个平均值。

We look at an average of benchmarks, just trying to kind of think what's fair and reasonable with respect to what we can do. So, you know, that's like the first thing we typically look at. Then we look at train better, of course, under the same compute budget, how much better can you do with a given data set. We're able to beat the best open data sets by anywhere from four to five points, depending on the specific data set and eval. Some of the eval's actually are much bigger than four to five points, four to five points on average.

Speaker 2

这些是绝对数值差距。我们发现要达到同等性能水平，在基础数据集上需要延长5到10倍训练时间，因为每提升一个准确度百分点的难度是指数级增加的。最后是模型精简：在保持性能前提下，我们能做出多小的参数规模模型？目前已经实现参数减半、训练更快，且大幅超越基于未筛选/其他筛选数据集训练的大模型。

And those are absolute points. We generally find that in order to get that same performance from training longer on the baseline data sets, you'd have to train on those baseline data sets at least five to 10 times longer to try to match that performance because every successive point of accuracy, of course, gets harder and harder to achieve. And then finally, smaller. Basically say, Okay, holding performance constant, what's the smallest parameter count model that we can get to outperform? We can already get models that have fewer than half the parameters and also train faster and also outperform the larger models trained on the uncurated or alternatively curated data sets by a large margin.

Speaker 2

绕了这么大圈子其实是想回答：开源数据集是否跟上了这种进步节奏？我们团队规模很小（现在约30人，此前成果多由不足20人团队完成），按常规标准计算资源也不多（远超学术界但远不及前沿实验室），却取得了显著成果。我认为这归功于巨大的改进空间——目前已实现10倍增益。

So this is a big roundabout way of getting to this answer of have the open data sets, I think, kept up with this improvement. With a fairly small team, we're now a team of about 30, most of the results that I've discussed were achieved with a team of under 20 because we've grown quite a bit in the last couple of months. And with not that much compute by kind of common standards, more than academics that certainly nowhere close to the frontier lab, we've been able to achieve, I think, pretty dramatic results. I think the reason for this is because there's so much headroom here. We've already been able to get 10x gains.

Speaker 2

我认为至少还有100倍提升空间。当前许多应做之事尚未开展，而已实施环节也远未优化：合成数据生成方式、过滤方法、基于模型的筛选、嵌入向量筛选等环节都有巨大改进潜力。但挑战在于开源数据集社区缺乏足够动力——毕竟最具动力的实验室往往最不愿分享成果。

I think there's at least another 100x behind this that are still to be done. There's so much stuff that we're just not even doing right now that I know makes sense to do, let alone all the things that we are doing that I know we can be doing better, that we're still very suboptimal with respect to how we're doing this. Like I know that the way we do our synthetic data right now could be much better, that the way we do our filtering could be much better, the way we do our model based filtering, our embedding based filtering, all these different aspects could be much stronger. So I think there's just so much headroom here. I think the challenge is that there's not a huge incentive to do this in the open data set community.

Speaker 2

这个重担就落在艾伦研究所、DCLM、Hugging Face等机构肩上。但这个问题确实需要专注于此的完整公司来解决。前沿实验室都设有数据团队，但据我了解这些团队普遍资源不足，总在为争取关注而挣扎——这现象在Meta、DeepMind等机构都存在，也是我选择创立Datalogy而非在Meta内部推进的原因。

I mean, the labs which have the biggest incentives obviously have strong incentives not to share anything with respect to that. So you're left to kind of the Allen Institute, things like DCLM, Hugging Face, etcetera, to make progress there. But I do think that this is a hard enough problem that it really demands a whole company that is really focused on this. I think what you see in all the Frontier Labs is that they have data teams. And if you talk to the folks that work on those data teams, what you'll kind of systematically hear is that typically they're under resourced relative to the gains that they're delivering, that they're always having to fight for attention.

Speaker 2

数据问题本身应该成为终极目标，而非只是实现目标的手段。这需要组建由真正热爱数据研究的顶尖人才组成的团队——而这类人才本就稀缺。作为数据团队难以实现，但作为数据公司则大有可为，这正是我创立Datalogy的核心原因。

And this is just like a fundamental thing that I saw at Meta, I saw at DeepMind, and I've heard at all these other places. It was a big part of why I decided to start Datalogy instead of doing this within Meta. I had the opportunity to start a data team there and that was to try to centralize this. But fundamentally, I think that this is such an important problem that it's a problem that needs to be the end itself, not just the means to the end, which I think is what you see in many of these big groups. You need to have a large team of really talented people who are really passionate about looking at the data and there aren't that many people who are that passionate about it to just focus on how do we build the best possible data sets for model training.

Speaker 2

我认为以数据团队形式难以达成目标，而数据公司模式具有真正优势——这也是我创立Datalogy的重要考量。

I think it's hard to do this as a data team. I think there's a real benefit of being a data company. And that's a lot of why I started Datalogy.

Speaker 0

您如何看待开源数据集领域的经济生态演变？现有开源数据集虽好但未必适合生产系统，而像贵司这样的企业在此基础上优化。是否会出现某种决裂——比如质疑为何私有化改进开源数据却不回馈？贵司有计划开源其他数据集吗？

How do you think the almost economics of the open source datasets world evolve? Because you basically have these open source datasets that are good, but maybe they're not quite as good to make production data systems. And then you have companies like yourselves that are sitting on top of it. Do you think at some point there's gonna be some sort of rupture between like, hey, why are you just taking my open source dataset and making it better in private for people without contributing back? And do you guys have plans to then open source other sets?

Speaker 0

这本质上是个开放性问题：这些改进成果真正适合开源，还是应该保持私有？

I I think there's like kind of this open question of, are these things actually useful in the open or should you just do it in private?

Speaker 2

这是个好问题，我们深思过。首先需要说明：虽然我们服务开源模型训练者，但产品设计主要面向同时使用开源和专有数据的客户。这些专有数据可能是企业十年积累的业务数据，或是从标注服务商采购的数据——有些客户同时拥有三种数据源。

Yeah, it's a great question. And one that we've thought a lot about. I mean, so first off, one thing to note is that while we do work with folks who just training on open models, in general we really built our product and designed it to be able to work with companies that are training on a combination of open source and proprietary data. And that proprietary data could just be data they've been collecting as a matter of business for the last decade, or that could be data that they've sourced from a data annotator or another data provider. And some folks we work with have all three.

Speaker 2

你们会使用公开数据，他们会使用自己获取的数据，然后还会利用业务中已有的数据。我认为这正是我们关注的焦点所在，当然我们也非常期待与那些基于更开放数据集进行训练的团队合作。我发表相关研究已超过十年，这对我而言意义重大，也是我们在Datalogy深思熟虑的课题。我认为当今创建初创企业——尤其是以科学为核心的初创企业——面临的挑战之一，正如我提到被吸引创办Datalogy的原因，正是这种矛盾张力，对吧？

You're going to use open data, they're going to use data that they've acquired, and then they're going use data that's part of their business to begin with. And that's think a lot of where our focus goes, although of course we are excited about working with lots of folks who are training on more open data sets. So I published for a decade more than that even. This was very near and dear to my heart and it's something that we thought a lot about at Datalogy. I think one of the challenges of building a startup today, especially a startup for which science is a critical component, which as I mentioned, is one of the things that really attracted me to starting Datalogy is this tension, right?

Speaker 2

本质上我们必须建立可持续的商业模式。为此，必须构筑竞争壁垒。我认为竞争壁垒主要来自三个方面：其一是科学专业知识，其二是工程基础设施及自主实现的难度。

Fundamentally, we have to build a business. In order to do that, have to have a moat. And you can think about kind of three places I think where a moat could come from. You know, one is from science know how. One is from engineering infrastructure and the challenge of just implementing this yourself.

Speaker 2

最后还存在品牌壁垒，这是最终可能达到的境界。目前我们离品牌壁垒还很遥远。我期待未来当人们想到数据与AI时，会立即联想到Datalogy——'这就是我的首选'。但现阶段我们必须依赖前两种壁垒：科学专业知识和工程基础设施。

And then finally, there's a brand moat that you can eventually reach. We're very far from a brand moat at this point in our journey. Eventually I would love to have a brand moat where whenever anyone thinks data and AI, they think datalogy and oh, that's where I should go first. I hope that we get to that point. But in the meantime, we have to rely on the other two moats, on the science know how and the engineering infrastructure.

Speaker 2

在公开数据方面，我们发现工程基础设施确实能形成壁垒，但科学专业知识壁垒同样至关重要。现有证据表明这点非常关键。例如客户常问的第一个问题就是：'与最佳开源数据集相比如何？'如果我们完全公开构建最优数据集所需的一切，有些人就会直接转向开源方案。这正是我们面临的挑战所在。

I think on the open data side what we've seen is that the engineering infrastructure definitely can be a moat, but unfortunately I think that science know how moat is actually pretty important. And a lot of the evidence that we've seen so far has suggested that that is something that's meaningful. As an example, many of the customers we talk to, one of the first things they'll ask is, hey, compare to the best open source data set. So if we were giving away everything we needed to in order to build that best open source data set, some folks would just go there. So I think that's been where our challenge has been.

Speaker 2

目前我们的应对策略——对此我感到满意——是通过博客文章大量分享我们的方法论原理，但不达到完全可复现的程度。这已比大多数大型实验室开放得多。比如Gemini技术报告的数据章节只用一段话概括：'数据质量是打造优秀模型的最关键因素，我们使用了算法和启发式方法'。

Now what we've tried to do, and I think we've done a good job of it, and I'm generally happy with the balance we've struck, is try to, in the blog posts that we put out, give a lot of intuition as to kind of what we're doing and how it works without necessarily getting to that point of reproducibility. That's, I think, much more open than you see most of the big labs be. If you look at like the data section of like the Gemini Tech Report, it basically says like data quality was the single most important thing for making a great model. One paragraph. We used algorithms and heuristics.

Speaker 2

最近有人指出，作为合成数据使用方法的改写技术正受到更多关注。

It's like, great. I think some people were even pointing recently there's been a lot more attention on rephrasing as a method for using synthetic data.

Speaker 1

是苹果那篇论文吗？

Was it the Apple paper?

Speaker 2

苹果论文和Kimi论文都提到了这点，还有其他多篇。有人指出我们在去年11月的博客就详细讨论过这个技术。改写技术的首创者Pratush Mainy是我们的首批员工，我们已对此进行了显著改进并拓展了新应用。坦白说，本可以完全不提这些技术细节的。

The Apple paper, the Kimi paper has mentioned this, a bunch of others, and some folks recently pointed out that like, hey, in our blog post from November, we were talking a lot about that. That's something that we do a lot of. Pratush Mainy, the guy who first came up with rephrasing, was one of our first employees. So we've improved on that pretty dramatically and taking it to new places. But that's something that we've like, I think that there would have been an incentive to just not even talk about that

Speaker 0

抱歉打断——这是否典型体现了你说的现象？你们在数据层面讨论时无人关注，直到Kimi论文带着模型出现，人们才意识到改写技术的重要性。但你们早就阐述过这点，只是当时没有模型佐证。你认为这种'无模型不重视'的现象是否仍是开放科学的局限？

at all. Sorry. Just on that, do you feel like this is like a great example of you were talking about it in the data, and then the Kymi paper comes out with a model, and then people are like, oh, the rephrasing is important. But you're like, hey, I was telling you that before, but I just didn't have a model to show you that it was important. Do you think that's still even inside in open science, like a limiter for people that, like, if you don't have a model, people don't care?

Speaker 0

Deepseq也是如此。论文里很多内容早已被知晓，但直到实际应用人们才真正关注。

Same with deepseq. A lot of the things in the paper were kind of known, but then once you have them applied, people care.

Speaker 2

我认为这确实是存在的现象，也印证了我们之前讨论的那种文化激励机制——人们往往将其视为达到目的的手段。我完全理解这种心态。毕竟当我们销售更好的数据时，最终是在销售更优质的模型，性价比更高的模型。但人们除非被现实狠狠打脸才会重视这一点，我认为这既是悲剧也是机遇。虽然我希望情况并非如此，但既然现状如此，这正是Datalogy公司看到能真正产生影响的机会所在。

I think that's certainly something that happens and I think speaks to the same sort of cultural incentives that we talked about earlier, where I think that people tend to think about this very much in ultimately it being a means to an end. And I understand why that is, of course. And ultimately, when we sell better data, ultimately we're selling a better model at the end of it, a more cost effective model. But I think that the fact that people don't care about it as much unless they're smacked in the face with it, I think is both a tragedy and an opportunity. And I would love it if it weren't that case, given that it is, that's I think the opportunity we see at Datalogy to really make an impact here.

Speaker 1

可能有点跑题，但你提到了合成数据和改写重组，我觉得现在正是讨论的好时机。我原以为Datalogy主要做数据过滤工作，但合成数据似乎略有不同？它确实属于提升数据质量的范畴，但又不同于过滤。用改写重组来创建合成数据是否正确？还是说在你看来合成数据还包含其他部分？

This might be a little bit of a tangent, but you you mentioned synthetic data, you mentioned rephrasing, so I figured now's a good time to go into it. You know, I I I figured that most of the work of Datalogy is filtering, but I see synthetic data as something slightly different. It it is in a general domain of improved data quality, but it's different than filtering. Yeah. Am I right to recreate synthetic data with rephrasing or is there are there other parts to synthetic data in your mind?

Speaker 2

没错。合成数据包含不同方面，主要有两部分。不过请允许我先谈谈过滤与其他方式的区别——我过去常用'数据过滤'或'数据修剪'这样的术语。

Yes. I think there are different parts of synthetic data. There are two parts. But let me first actually just comment on the the filtering versus things. I used to actually use the word data filtering or data pruning.

Speaker 2

实际上我提到的那篇NeurIPS论文，标题里就用了'数据修剪'这个词，《如何通过数据修剪克服规模损失》。创办Datalogy后，我刻意将术语改为'数据策展'而非'数据修剪'或'过滤'。因为策展的内涵远不止过滤——过滤只是判断'这个数据点不好，我们要剔除它'。

And actually, that paper I mentioned that was at NeurIPS, that one actually has data pruning in the title. And that's how you beat scaling loss through data pruning. When I started Datalogy, I really changed the language to be data curation over data pruning or data filtering. And that's because curation is a lot more than just filtering. Filtering and saying, hey, this is a bad data point.

Speaker 2

这当然是我们工作的重要部分，但还包括数据重新平衡：对某些数据分布进行上采样，对其他分布下采样。这不一定是过滤，可能只是调整权重系数。数据呈现顺序也影响重大——我们现在通过离散课程学习在多阶段训练中看到了这点，这与过滤无关。数据批处理方式也很关键。

We want to get rid of it is absolutely an important part of what we do. But it's also about rebalancing data sets, upsampling certain data distributionally and downsampling others. That might not mean filtering, it might just be changing the weighting with which you take it. The order in which you present data can be really impactful curricula, we now have seen this with discrete curricula for multi phase training and things like that, that's not filtering. The way you batch the data can be an important factor.

Speaker 2

合成数据可以是重要因素。数据源混合方式等所有这些都超越了单纯过滤。虽然过滤始终是我们重视的核心工作，但范畴要广阔得多。回到合成数据问题，我认为高层次来看有两种途径。我们更侧重改写重组这种，但另一种也存在机遇。

Synthetic data can be an important factor. The way you mix sources, all of these sorts of things beyond just filtering. So filtering is a very important part of what we do and it will always be something that we care lot about, but it's much more than that. Okay, so now to the question about synthetic data. I think at a high level there are two approaches to synthetic data And we have focused more on one of them, the rephrasing one, than the other, although I think there is opportunity in the other one.

Speaker 2

第一种途径是创建新数据，其中蕴含的知识主要来自生成该合成数据的模型本身。

So the first approach is create new data where the knowledge that's in that data is largely coming from the model that's generating that synthetic data.

Speaker 1

啊，那这就是蒸馏了。

Oh, that's distillation then.

Speaker 2

算是蒸馏的变体。这种合成数据可视为伪装的蒸馏，是非常典型的案例。当人们批评合成数据会导致模型坍塌等问题时，这些批评主要针对的就是这种从模型内部生成全新数据的方式。这是第一种类型。

It's a version of distillation. And I think that this version of synthetic data could be construed as distillation in disguise. And I think it is a very clear version of this. And when you think about like the criticisms of synthetic data around model collapse and stuff like that, I think they largely apply to this version of you have a net new data creation that's coming out of these models. So that's like half one.

Speaker 1

我插一句——还有模型隐写术，可以把偏好隐藏在模型里再进行蒸馏。

I'll slip one in there. There's also model steganography where you can sort of hide preferences in a model and distill it down.

Speaker 2

确实如此。现在我们看到了近期围绕OWL的那些动态。

Absolutely. And now we've seen like the recent like OWL stuff around that.

Speaker 1

如果人们搜索Anthropic的OWL，就能找到相关信息。

If people search Anthropic OWLs, you'll see it.

Speaker 2

没错。另一种方式是这种重述、改写的方法。这些信息实际上源自你最初用于调整重述的数据。模型所做的只是重新格式化数据，或以新方式呈现，可能更便于模型学习。

Yeah, exactly. The other way is this rephrasing, rewriting approach. This is the information that's in the data is actually coming from the data that you're conditioning the rephrasing on in the first place. And all the model's doing is it's reformatting the data or presenting it in a new way that maybe is easier for a model to learn.

Speaker 1

对，就是清洗数据，对吧？

Yeah, cleaning, right?

Speaker 2

某种程度的清洗...可能是清理数据，也可能是让信息更易获取，或是将信息转换成更能代表模型下游将面对场景的格式。所以我确实认为合成数据带来的一个明确变化是：我们将更多训练后阶段的数据引入了...

Cleaning it in some It could be cleaning it, it could be making it, you know, the information more accessible. It could be putting that information in a format that is more representative of what the model is going to be faced with downstream. So I do think that like one of the things that definitely happens with synthetic data is we are bringing more post training like data into

Speaker 1

听起来像是监督微调(SFT)。

pre Sounds like a SFT.

Speaker 2

总的来说，我的一个观点是：我们多数训练后阶段的工作，其实在训练前和训练中期做效果更好。这只是规模问题。

And in general, one of my beliefs is that most of what we do in post training is better done in pre and mid training than earlier on in training in general. It's just the scale.

Speaker 1

你们之前没有那种规模条件。

You don't have that scale until now.

Speaker 2

正是。如果你假设预训练极其昂贵且只能极低频进行，而训练后阶段成本低廉，这种范式就成立。但一旦打破这个假设——我认为DeepSeek已经证明，只需边际成本几百万美元就能获得前沿模型。随着技术进步和算力降价，现在成本应该更低。我相信对多数机构而言，获得前沿模型的成本应控制在百万美元以内，至少在专业领域如此。

Exactly. I think if you assume this paradigm where, you know, pre training is incredibly expensive and something that you can only do very, very rarely and then post training is cheap, then it makes sense. But as soon as you break that assumption, and I think DeepSeek showed that already you can get a frontier model for a marginal cost of a couple million dollars. That's gone down since then because we've gotten better at it and compute has come down in price. Since then, like I believe that getting to a frontier model should cost a million dollars or less for most organizations, at least in a specialized domain, right?

Speaker 2

考虑到企业需求，他们通常不需要全能模型，而是需要以尽可能低的推理成本，在限定领域实现极高准确率的模型。我认为很快就能以不足百万美元实现，这将彻底改变现有格局。

And when you think about what enterprises need, that's generally what they need. They don't need a model that can do everything. They need a model that can do a constrained set of to very high accuracy for as low an inference cost as possible. And I think that will be under a million dollars very, very soon. That changes a lot of these dynamics.

Speaker 2

但回到关于这两种合成数据的问题。我认为其中一种是面向全新创造的，那里存在很大风险。正是这里会出现模型坍塌的担忧——当我在给定数据分布上训练生成模型时，它会过度拟合模态而欠拟合尾部。因此如果让它生成大量数据，结果将更集中于模态而缺乏尾部多样性。

But going back to the synthetic data question of these two different types. So I think there's one towards this net new creation. I think that's where you have a lot of risk. That's where you get the model collapse concerns where I train a model, I train a generative model on a given data distribution, it overfits the modes and it underfits the tails. So then if I have it generate a bunch of data, it's going to be more mode and less tail.

Speaker 2

如此反复多次后，最终会形成一个尖峰。

Then I do that a bunch of times and eventually I get a spike.

Speaker 1

得到德尔塔函数。只剩单一模态。

Get a delta function. Only mode.

Speaker 2

只剩模态。没错。这种现象的成因很合理。需要指出的是，如果在每个节点后过滤数据，这就变成了信息注入，可能打破整个循环并防止模型坍塌。某种程度上——

Only mode. Exactly. Like that makes sense why that happens. I will note that if you filter the data after each point, that's now information injection, and that can break all of this and I think can prevent model collapse. Which a little

Speaker 1

强化学习正是如此。

bit is what RL is.

Speaker 2

某种程度上强化学习确实如此。完全可以从这个角度理解。实际上很多研究表明，强化学习本质上是在激发预训练模型的潜能——无论是随机奖励还是单个示例，它只是在调整分布，相当于对齐模型原有的内在分布，我认为这与这种思路高度吻合。

Which is a little bit what RL is. I think you can absolutely view it that way. And I think actually a lot of the work that has suggested that you know, RL is really just eliciting the capabilities of pretrained models like random rewards or a single example, and then it's just changing the distribute it's like aligning to the distribution the model has in the first place are I think very in line with that way of thinking about it.

Speaker 1

你从完美模型（环境或验证器之类）中进行蒸馏，然后将其精华注入目标模型。这过程精妙绝伦。

You're distilling from a perfect model, which is the environment or the verifier or whatever, and then you're distilling that into the thing. It's amazing. It's beautiful.

Speaker 2

但改写技术的精妙之处在于：执行改写的模型只需掌握改写技巧，无需理解内容本身。这意味着可以用较弱模型进行改写，却能生成可训练出远优于改写者模型的数据。对于这种伪装式蒸馏，我总体上怀疑能否通过全新数据创造获得超越合成数据生成者的模型——除非在大模型上实施严格拒绝采样，因为当你判定哪些合成输出优劣时，本质上注入了新信息。

But the cool thing about rewriting is that because the model that's doing the rephrasing just needs to know how to rephrase, It doesn't need to know anything about the content itself. It doesn't need to understand it. It means you can use a pretty weak model to do the rephrasing and have it generalize and generate data that can teach a model that's much better than the model that's doing that rephrasing. So I think with this distillation in disguise, I'm generally quite skeptical that you can get a model that will be better than the teacher that's generating the synthetic data when you do this sort of net new data creation. It's possible you could through some sort of heavy rejection sampling on the big model because you're effectively inserting new information when you say which of the synthetic outputs is good or bad, right?

Speaker 2

这里引入了新的监督信号。不过我对此持怀疑态度。相比之下，我们即将在未来一两周发布关于合成数据生成（我们称之为'超越网络'）的博客，其中包含精彩科学实验，旨在探索既能分享科研成果又能保持商业可持续性的平衡点。我们实际证明：通过这种方法获得的模型表现，远优于直接使用所有原始token数据训练的结果。

There's some new supervision coming in there. But I'm generally skeptical of that. Whereas we've seen this, we actually have a blog post coming out in the next week or two about kind of our synthetic data generation, which we call Beyond Web. And we'll have some cool scientific experiments in there too, to our point of trying to figure out this balance where we can share some of the science but also do so in a way that is sustainable for our business. And one of the things we show there actually is that by doing this, you can actually go do get a model to do much, much better than if you had trained on all of the data, all raw tokens in the first place.

Speaker 2

因此通过有效改写，你实际上能突破数据壁垒，获得优于任何数据生成模型的成果。我认为这在改写场景极具可行性，因为绝大多数信息源自数据本身而非模型。

So that by doing this rephrasing effectively, you actually can break this data wall and now get models that are better than either of the models that generated the data. With rephrasing, think this is super possible because most of the information is coming from the data. It's not coming from the model itself.

Speaker 1

关于这一点我有几个后续问题，一直很好奇。教科书就是你们所需的全部吗？

A couple of follow ups on that, just things I've always wondered. Are textbooks all you need?

Speaker 2

不，教科书并非全部所需。我认为教科书很棒，里面包含大量优质内容和高质量数据点。但显然教科书的数据分布也非常狭窄。关于数据质量，这次访谈中你们最该记住的就是多样性。就像很多方面一样，对吧？

No, they are not all you need. I think textbooks are great, and I think there's a lot of really great content and high quality data points like that. But obviously textbooks are also a very narrow data distribution. And there's only one thing that you should take away from this entire interview about what is good for data quality, it's diversity. Like in many ways, right?

Speaker 2

我曾做过分布外泛化的研究，我们进行过许多严谨实验——比如刻意构建数据分布的某个角落，保留某些从未见过的组合来测试模型泛化能力。后来大语言模型和现代训练方法出现了，它们提出：如果根本不存在分布外数据呢？如果我们训练所有数据让一切都在分布内呢？

I used to do all this work on out of distribution generalization and we had all of these like, you know, very careful studies where we would say, Okay, let's, you know, make this corner of the data distribution, then we leave this held out where it's never seen this combination of things and let's see if it can generalize. And then LLMs and the modern way of training models came along and said, Hey, what if nothing was out of distribution? What if we just made it so that we train everything and everything's now in distribution?

Speaker 1

顺便说，这符合AGI的理念对吧？所以你们不妨...

And by the way, that is in line with AGI, right? So you might as well.

Speaker 2

我们基本上就是这么做的，效果惊人得好，远超任何人——或者说大多数人的预期。我本人就非常震惊，当初我坚决认为仅靠规模扩展不可能实现组合性。但事实证明，可以。

And that's basically what we've done and it's worked. It's worked shockingly well, like way beyond anyone, I think, or most people would have expected. I certainly was shocked by it. I made a strong bet that there is no way you could get compositionality just from scaling. And well, you can.

Speaker 2

当规模足够大时确实可行。我...

Turns out, it does work when you get big enough. What I

Speaker 1

其实我指的是微软那几篇V论文，一二三四。很多论文都用教科书格式重述或改写内容。我总觉得有点盲目崇拜——难道因为模仿维基百科或教科书文体，模型就学得更好？这未经证实，不能想当然。

was really referencing was this is the Microsoft V papers, right, one, two, three, four. A lot of them do their rephrasing or rewrite rewriting in textbook format. And I feel like there's a little bit of cargo culting of like, oh, just because you like write like Wikipedia or write like textbooks, the models learn better. That's not proven I don't know. That's not automatically proven to be the case.

Speaker 2

这可能也是那些模型基准测试与实际应用存在差距的原因——数据分布太窄了。合成数据的根本问题就在于总会存在偏差。虽然我们投入大量精力通过多种文体格式改写来提升多样性，但风险始终存在：过度狭窄的分布会导致模型输出分布尖锐化，反而降低多样性。

I think that's also probably part of the reason why you see a big difference between the benchmark scores of those models and their real world use. They went to too narrow a distribution. And I think this is the problem with synthetic data fundamentally is that you're always going to have some bias here. I think you can do a lot to make it more diverse, we have put a lot of effort into finding ways to do that. For example, we rephrase into many, many, many different styles and formats.

Speaker 2

不过教科书万能论有个观点是正确的：重复高质量标记几乎总比接触未知质量的新标记更好。用高质量数据迭代训练，永远优于获取等量未知质量或平均质量的数据——这里平均质量指的是未经筛选的网络数据，即便经过一定过滤。

That's really important to get stuff that's good. But I think this is the risk, right? That you go on way too narrow a distribution and models are always going to be fairly peaky with their output distribution and then that actually results in reducing diversity. That said, I will say that there is a takeaway of that textbook's all you need that I think is correct, which is repeating higher quality tokens is almost always better than seeing net new lower quality tokens. So epoching over higher quality data, almost always better than getting the same amount of new data of an unknown quality or of average quality, average in this case being like what you just get from an internet dump or something like that, or even a reasonably filtered internet dump.

Speaker 2

高质量数据永远更优。

It's always better.

Speaker 1

我所做的修改或想要委托进行的研究是，与其在高质量数据上再增加一个训练周期，不如找到高质量数据后先进行改写，然后基于改写后的数据进行训练。这可能会带来额外收益。我还没见过有论文专门探讨这个方向。

The modification I made or the study I would want to commission out of that is, like, instead of having another epoch on on high quality data, if you found high quality data, good. Go and paraphrase it and then and then train on that. Maybe that'll get additional gains. I don't think I've seen any papers that have been to that effect.

Speaker 2

Kimi论文其实做过相关实验，他们尝试增加多个训练周期，观察对每条数据的改写次数，并得出了一些有趣的结果。

The Kimi paper actually had an experiment to that effect where they tried adding multiple epochs and they looked at how many rephrasings they did of each of them and had some results there that were interesting that effect.

Speaker 1

有意思。另一个问题是关于课程学习的。课程学习曾风评不佳，为什么现在又复兴了？发生了什么变化？

Amazing. And then the other question was more on curriculum. Curriculum learning had a bad rep for a while. How come it's back? What's changed?

Speaker 2

确实有多方面原因。这很有趣，因为我在2023年中决定创办Tautology融资时，曾向初期团队成员强调课程学习将非常重要，但当时很多人都说'课程学习根本没用'。

Yeah. So a bunch of things. And this is really interesting because when I was going out and, you know, initially deciding whether to start tautology and raising and, like, talking to various, you know, initial recruits and stuff, it was like mid twenty three. And at the time I was saying, you know, curricula are going be a really important aspect. And a lot of people were basically just like, no, curricula don't work.

Speaker 2

他们说尝试过很多次都失败了。但我觉得课程学习这类理念注定会成功——它从逻辑上就站得住脚。斯坦福有篇精彩论文用图论解释：把每个节点看作要学习的概念，边代表概念间的依赖关系。

Like we tried this a bunch of times and curricula don't work. Curricula are one of these ideas that I think always like had to work in the sense that it just made too much sense. There are a number of these things where it's like it might be hard to figure how to make it work well, but it always had to work. There was actually a really cool paper from Stanford that had a nice way of conceptualizing this, which is imagine a graph where each of the nodes are a different concept or idea that you want the model to understand. And then the edges are basically the dependency between those concepts, right?

Speaker 2

如果概念A有助于学习概念B，就画条A→B的边。想象这是包含世间所有概念及其关系的巨型图——如果图是空的，说明知识间毫无关联，课程学习确实没意义；如果图是完全连接的，说明所有知识互为前提，课程学习同样无效。

So if concept A helps you learn concept B, there would be an edge from concept A to concept B. So now this is the graph. Imagine this graph of all concepts in the world and all the different edges between them, right? Huge graph. If that graph is empty, then it would mean that nothing is helpful for learning anything else.

Speaker 2

但显然现实世界既非空图也非完全图。有些依赖关系非常明确（比如不懂加减法就学不会乘除法），有些则比较模糊。我一直相信课程学习终将奏效，之前的困境在于：当数据已被充分饱和训练时，课程学习除非能突破学习瓶颈，否则优势有限。

And then curricula would not make any sense. You should just randomly order things. If that graph was complete so that there is an edge of equivalent weight between every pair of nodes, then similarly it would mean that everything is equally useful for learning everything else and curricula don't work and you shouldn't use them. Any other graph besides those two graphs, curricula makes sense. I think it's pretty obvious that neither of those is the graph of the actual world that we live in.

Speaker 2

在监督学习时代，数据集已被充分训练，课程学习或许能加快收敛速度，但这并非关键瓶颈。所以没人愿意投入精力设计复杂课程——毕竟把ImageNet训练周期从160轮缩减到80轮意义不大。

Clearly the world does have dependencies, some very, very obvious, like the fact that it'd be hard for me to do division and multiplication if I don't understand addition and subtraction and some much more vague. But I have always believed that this has to work and the challenge has largely been that if you're fully saturating your data, then there's really no advantage of curriculum unless if you wouldn't be able to learn it otherwise. Generally, I think the idea behind curricula is that it makes you much more efficient. But in the supervised learning world, we were fully saturating these data sets. So maybe a curricula would get you there faster, but that wasn't the bottleneck or the limiting factor.

Speaker 2

但现在情况彻底改变：所有模型都面临数据欠拟合。此时设计优质课程可能意味着节省十倍训练成本，涉及数亿美元差异。离散课程已明确显示出巨大影响，我们说的'中期训练'本质上就是离散课程的后期阶段。

So there wasn't a clear incentive to actually go and do these hard experiments to try to figure out how to make a good curriculum because like who cares if I can get you to ImageNet performance in 80 epochs instead of 160 epochs? Like that's nice, but it's not a big deal in the first place. But now we're in this totally different world where now all of a sudden all of our models are under fitting the data. This is super important and getting a curriculum right could literally make the difference between spending 10 times as much on a model training, hundreds of millions of dollars potentially. And now all of a sudden curriculum make a ton of sense.

Speaker 2

甚至后期微调也可视为课程部分。目前Datalogy主要聚焦预训练和中期训练，但我特别期待探索后期阶段——这可能是未来重要方向。

So I think that's why the problem didn't really make sense to really put a lot of effort into previously. And now we've seen pretty clearly with discrete curricula that this makes a big impact. And largely what we talk about when we say mid training is really just like a later phase of your discrete curriculum, I think, another way of thinking about it. You could even think of post training as part of a curriculum. In fact, one of the things that I'm really excited about is we mostly focus on pre and mid training at Datalogy so far.

Speaker 2

我们所有客户最一致的要求之一就是，能否在训练后阶段做更多工作？能否帮助我们筛选训练后的数据？因此我们正开始大力投入这一领域。真正让我兴奋的是将整个流程——从训练前、训练中到训练后——视为一个完整的整体过程。然后提出诸如‘如何优化训练前数据以使训练后更有效’之类的问题。

One of the kind of most consistent asks from every one of our customers has been, can you do more on post training? Can you also help us curate the post training data? So we're starting to invest pretty heavily there. And one of things I'm really excited about is actually viewing this whole thing from pre training to mid training to post training holistically as a single process. And then asking questions like how do we optimize our pre training data to make post training more effective or things like that.

Speaker 2

我认为这些都是非常激动人心的问题，甚至在大实验室也看不到这种情况，因为他们有完全独立的团队——训练前团队、训练中团队、训练后团队各自为政。训练中团队是训练前团队的客户，训练后团队又是训练中和训练前团队的客户，但要让信号在所有环节间传递实际上相当困难。因此这确实是个令人振奋的领域。

These are, I think, really exciting questions and something that you don't see happen even at the big labs because they have entirely separate teams, right? There's a pre training team, there's a mid training team, there's a post training team. And like the mid training team is a customer of the pre training team and the post training team is like a customer of the mid training and pre training team, but it's quite hard to actually have signals propagate through all these. So I think this is a really exciting area.

Speaker 1

我想对此稍加追问。主流观点认为训练后阶段只是激发训练前已具备的能力。那么你能建立哪些反馈机制来影响训练前阶段？

I'll push you a bit on this. I think a popular view is post training's elicitation of capabilities that you already trained in pre training. So what dependencies can you have that feedback into pre training?

Speaker 2

我倾向于认同这个观点。这种观点会强烈导向一个结论：你应该优化训练前数据来提升训练后流程的效果。比如思考如何优化训练前数据，使得测试时计算曲线的斜率或强化学习曲线的斜率尽可能陡峭；或者如何优化训练前数据让越狱曲线的斜率尽可能平缓。从根本上说，我认为训练后的对齐并非长久之计。

So I'm inclined to agree with that view. And I think that that view would lead very strongly to the fact that you should be trying to optimize your pre training data to make post training processes more effective. So you should try to figure out how do I optimize my pre training data so that the slope of the test time compute curve or so that the slope of the RL curve is as steep as you possibly can be. Or alternatively, how do I optimize my pre training data so that the slope of the jailbreaking curve is as shallow as possible? Fundamentally, I think alignment in post training doesn't really make sense as long term solution.

Speaker 2

如果通过训练后能轻松对齐模型，同样也能轻易使其失准。易入则易出，难入则难出——这是模型的铁律。如果在训练前阶段就做好对齐，最终得到的模型将极难被误导，除非注入海量异常数据。

If you can easily align a model through post training, you can easily misalign a model through post training. If it's easy to put it in, it's easy to take it out. If it's really hard to put it in, it's really hard to take it out. That's just like a truism of models, right? So if you do alignment during pre training, you'll actually end up with models that are I think largely impossible to misalign without putting a massive amount of data into them.

Speaker 2

这样做有很多好处。我们已经看到相关证据：比较LAMA和Quen在训练后调整的难易程度，对Quen进行强化学习比Llama容易得多。这可能因为Quen在训练数据中注入了大量合成推理轨迹。

I think there are lot of benefits to that. And I think we've also seen evidence for this, like looking at the difference between LAMA and Quen with respect to their ability to be post trained. It's much easier to RL Quen than it is to do Llama. Likely that has to do with the fact that Quen put a lot of synthetic reasoning traces into their training data.

Speaker 1

即使是用错误示例。

Even with wrong examples.

Speaker 2

没错，但即便是错误示例也印证了关键点——这很惊人不是吗？这清楚表明起决定作用的是基础模型本身，而非你提供的奖励信号。如果随机奖励仍能让模型学习，那显然不是奖励在起作用。

Yeah, but even with wrong examples, that's still the point here, which is wild, right? But I think that pretty clearly shows that it's the base model that's doing it. It's not the rewards you're giving. If you give random rewards and the model still learns, it's probably not the reward signal that's doing it.

Speaker 1

这很酷。

That's cool.

Speaker 0

我好奇客户使用情况。现在有多少人做训练后调整？当然目前还没有，因为你们尚未提供该功能。但当客户咨询时，他们主要是想对开源模型还是OpenAI模型进行训练后调整？他们具体需求是什么？

I'm just curious on the customer usage. How many people are doing post training? Obviously nobody today because you don't have it. But when people come to you, are people looking mostly to do post training on open models, on OpenAI models, or what do they ask for?

Speaker 2

是的，我们通常合作的客户要么是从零开始训练自己的模型，要么是在开源模型基础上，利用他们独有的领域特定数据进行持续预训练。我们主要服务于那些投入高昂成本进行训练的企业，通常这意味着至少处理数百亿token的数据，往往更多。因此，对于标准的小规模微调场景，我们关注较少。不过很多人一直问我们：到底谁在真正训练自己的模型？

Yeah, so we usually work with folks who are either training their own models from scratch or doing continued pre training on an open model with a bunch of domain specific data that they have that's unique to their use cases and their business. We typically focus on folks that are doing training at a significant cost. So typically that means at least a couple tens of billions of tokens, oftentimes more. So kind of the standard small scale post training fine tuning case we don't focus as much on. That said, I think this has been a question that a lot of people have asked us consistently like, hey, who's actually training their own models?

Speaker 2

比如，为什么不直接依赖这些开源模型呢？我认为人们选择自研模型有几个原因。首先，主权AI已成为重要需求领域——

Like, why don't I just rely on this, rely on the open models? And I think there are a number of reasons why we see people do this. So first off, I think Sovereign AI has been a pretty big place where we've seen a lot

Speaker 1

这是我们重点关注的领域。

of demand. Big focus for us.

Speaker 2

许多国家希望拥有专属自己语言文化的模型，这当然需要他们具备出色的数据治理能力才能有效实现。

Lots of countries, they want to have models that they own that are unique to to to their language, their culture, and, you know, that requires them to have really good data curation, of course, in order to do this effectively.

Speaker 1

容我追问，国家拥有模型这件事我其实不太了解。比如我来自新加坡，我们有SEAL模型，但它并非国家所有，我举不出其他由国家持有模型的例子。

Just to double click, countries owning models isn't actually a thing that I know about. Like, you know, I'm I'm from Singapore. We have the SEAL model, but it's not, like, owned by a country, I can't name any other country that owns a model.

Speaker 2

确实如此。目前主要是公私合作模式，政府提供巨额资助——

Yeah. Think that's actually correct. Like, it's it's largely what you see right now is these public private partnerships where governments are making pretty large grants.

Speaker 1

阿联酋的TIAA算是最接近的案例。

TIAA UAE is, like, the closest.

Speaker 2

对，这类情况存在。还有些资金来源模糊的案例。通常看到的是国家通过向私营企业拨款或建立公私合作关系来推进。此外，拥有海量自有数据的大型企业也是重要客户群体。

Yeah, I think you have those. I think you also have these places where the funding is the country and it becomes a little unclear where it comes from. Yeah, think usually what you see is that countries are doing big grants to private companies or public private partnerships to go build, that sort of thing. So that's a big thing. Think we've seen a lot of larger enterprises that have a lot of their own data that want to do this.

Speaker 2

分析需求时，我们发现三大价值点：训练更快、效果更好、规模更小。何时何者重要？训练速度最易量化——比如原本需1千万美元的模型，现在80万美元就能训练完成。

When you think about this, ultimately what we see is that, okay, across those three value pots: train faster, train better, train smaller. Which matters and when? Train faster, In principle, that's the easiest one to compute. I say, okay, this model would have cost you $10,000,000 to train. I get it to you for a million dollars or for 800,000 or whatever, right?

Speaker 2

理论上这能省巨资。但现实中，没人会为节省900万美元而去训练一个价值百万美元的模型。

Great. I saved you a ton of money. In practice though, nobody wants to train a $10,000,000 model for a million dollars.

Speaker 1

但他们已经有了模型。

But they already have the model.

Speaker 2

他们已经拥有那个了。他们想用1000万美元训练一个价值1亿美元的模型。要知道，他们想训练得更好。所以从‘嘿，这个模型现在便宜多了’的角度看，训练速度通常不太重要。但从‘你能更快迭代’的角度看，这就重要多了，对吧？

They already have that. They want to train a $100,000,000 model for $10,000,000. Know, they want to train better. So train faster usually doesn't matter so much from the perspective of, hey, this model is now a lot cheaper. It does matter a lot more from the perspective of you can iterate much faster, right?

Speaker 2

因为当你想到大多数机器学习工程师的工作流程时，你开始训练，然后坐着干等训练完成。你会找点别的事做，但很大程度上你在等待，你的迭代速度受限于训练时长。如果你能把一个模型从需要十天训练缩短到一夜之间，那么现有团队的效率会大幅提升，能进行更多迭代之类的工作。所以这才是我们通常认为最有价值的地方。大多数人最关心的是训练得更好，对吧？

Because when you think of the workflow of most ML engineers, you start a training, you go and you sit on your hands until the training finishes. You find something else to do, but largely you're waiting and your iteration is bounded by how long that takes. If you can take something from taking ten days for a model to finish training to being overnight, now your existing team is way more productive and can do far more iterations and stuff like that. So that's where we usually see that matter the most. Most people care the most about train better, right?

Speaker 2

我可以用相同的计算资源得到更好的模型，而我们完全可以通过数据实现这一点。数据实际上是计算的倍增器。因为所有模型都未能充分利用其数据集，如果你能让模型更高效地利用数据，你实际上提高了计算的价值。因为如果把计算看作投入一定金额获得一定性能回报，那么使用更好的数据意味着每美元投入能获得更高回报，这样计算就更有价值了。所以我认为‘训练得更好’才是最有意义的。

I can get a better model for the same compute and we can absolutely deliver that through data. Data is effectively a compute multiplier. Because all models are underfitting their data sets, if you can make your model more data efficient, you effectively make your compute more valuable. Because if you think about compute as I inject a certain number of dollars and I get a certain performance back, If I use better data, then I will get more performance back per dollar invested and now my compute is more valuable. So that's where train better, I think, tends to be the most meaningful thing.

Speaker 2

但有趣的是，对于AI转型最前沿的大多数公司来说，‘训练更小的模型’才是最有实际意义的。因为当你考虑这些模型的总拥有成本时，推理成本将占绝对大头。全是推理成本。想象一家公司每年在推理上花费5000万美元——这在全局中不算多，对吧？如果你部署了一个规模是实际需求两倍的模型，第一年就会多花2500万美元。

But interestingly, for the most companies that are most advanced on their AI transformation journey, train smaller is the one that I think actually means the most. Because when you think about the total cost of ownership of these models, it's going to be very, very heavily weighted towards inference. It's all inference. And you think about a company that's spending, say, 50 mil a year on inference, which in the scheme of things is not very much, right? If you deploy a model that's twice as big as it needs to be, that's going to cost you 25 mil in year one.

Speaker 2

而训练一个参数少一半但在你特定用例中表现相当甚至更好的模型，成本可能只要200万或300万美元。如果能轻松做到，这根本不用犹豫，对吧？如果非常困难，那确实不会考虑。但如果能轻松实现且一次成功，这就是明摆着的选择。毕竟未来五千万美元根本不算什么，对吧？我们都知道这些产品的用户群还只是最终规模的极小一部分。

The cost to train a model that has fewer than half the parameters but is just as good or even better at your particular use cases is say 2,000,000 or $3,000,000 That's a no brainer if you can do it easily, right? If it's really hard, then you're never going to do that. But if you can do it easily and you can get it right on the first try, that's a no brainer. Then fifty milli years is not going to be very much, right? We know that all of these products have a tiny, tiny fraction of what their eventual user bases will be, right?

Speaker 2

我们仍处于非常早期的阶段。听这个播客的人都在不停使用AI，但世界其他地区还没有。所以这些模型的推理成本将飙升。如果你使用一个通用模型然后限制它说‘这个模型知道所有事，但现在只做这一件事’，这个模型会包含大量不必要的参数，将极大增加服务成本。因此我认为，当考虑企业需要‘一寸宽一英里深’的用例时——能完美执行少量任务，达到99.999%可靠性且成本最低——

We're still very much in the first inning here. Everyone that listens to this podcast is using AI nonstop, but the rest of the world is not yet. So the inference costs are going to skyrocket with these models. And if you use a general purpose model that then you constrain to say, hey, this model knows about everything, but now only do this one thing, That model is going to have a ton of parameters that do not need to be there that are going to massively increase the cost of serving that model. So I think that, you know, when you think about the use case of an enterprise where they need a model that's an inch wide and a mile deep, it can do a small handful of things, but it can do that really, really effectively to five nines of reliability and it can do it for as low a cost as possible.

Speaker 2

从经济学角度看，如果能轻松做到，自己训练模型确实非常合理。我们认为存在两大障碍：首先要搞定训练，然后要搞定数据。三年前训练确实非常困难，对吧？但Mosaic最早意识到简化这个过程存在巨大机遇。

The economics make it so that it really makes a lot of sense to do this yourself if you can do it easily. And the way we think about it is that there were kind of two big barriers. First, have to get training right and then you got to get data right. And on the training side, I think three years ago, this was super hard, right? But Mosaic was the first one to really recognize that there was a huge opportunity in making this easy.

Speaker 2

如今训练已基本被SageMaker、Together等平台商品化，很多公司都能提供训练支持。但在数据方面，门槛依然高不可攀。这正是我们Datalogy的使命——如何降低这个门槛，让任何想训练模型的人第一次尝试就能使用最优质数据。他们不必像在沙漠中徘徊四十年，也不必先失败一百次——如果没有相关经验就一定会这样。

And now this has largely been commoditized by things like SageMaker and Together and lots of different folks that help you on the training side. But on the data side, the barrier is just as high as ever. And in many ways, that's our mission at Datalogy is how do we bring that barrier down so that anyone who wants to train a model can do so with the best quality data on their first try. They don't have to go and spend forty years in the desert. They don't have to get it wrong 100 times first, which is what will happen if you don't have this experience.

Speaker 2

相反，他们第一次尝试就能获得非常出色的模型。

But instead, on the first shot, they get a really great model.

Speaker 1

是的。关于训练更小模型的后续问题。我完全同意，我认为这是很多人正在投入的方向。你们主要是在数据层面开展工作，数据剪枝——现在这个词可能不太受欢迎了，或者说数据筛选，随便怎么称呼。

Yeah. Just a follow-up question on train smaller. Yeah. I I I fully agree, and I think that this is something a lot of people are investing in. You are primarily doing work on the on the data side, data pruning, which maybe is a bad word now, data curation, whatever.

Speaker 1

我想很多人，你知道乔纳森·弗兰克尔很早就在播客上提到过，但很多人当时赌的是对模型本身进行剪枝。比如你有一个正常尺寸的可用模型，然后直接砍掉超过某个阈值的部分。这种方法现在被证实彻底行不通了吗？

I think a lot of people you know, Jonathan Frankel was on the podcast very early on, but a lot of people were betting on pruning the the model itself. Like, you have a working model at size and you just lop off anything above, like, a certain epsilon. Is that confirmed to just be dead?

Speaker 2

说起来很有趣。乔纳森其实在我在Meta时曾和我一起实习，我们共同研究过这个。他提出了彩票假说，那真是篇精彩的论文。

So so it's funny. Jonathan actually interned with me when I was at Meta, we and worked on this stuff together. Know, he had the lottery ticket hypothesis, which is a really beautiful paper.

Speaker 1

他现在基本上完全否认了这个理论。

Which he now completely disowns. Which he largely disowns.

Speaker 2

你知道吗，当年我和乔纳森共事时，我们想创造一种彩票券初始化方法——就是采样时直接获得这种完美中奖券式的初始权重。但后来我们发现根本问题在于彩票券其实是数据依赖性的。只要数据分布稍有变化，中奖券就会发生巨大改变。我不认为剪枝技术已死。

You know, I had this whole idea when Jonathan and I worked together that we wanted to create a lottery ticket initialization. It would just be an initialization you'd sample from for initializing the weights that would then be one of these like perfect winning ticket initialization. But we actually found out that the problem was that the lottery ticket was actually data dependent. And that was where the fundamental problem came, that as soon as you change the data distribution a little bit, like the winning tickets changed in a really big way. I don't think pruning is dead.

Speaker 2

参数剪枝绝对仍有价值，但实现其潜力确实存在挑战。需要明确的是，非结构化剪枝——即随机剪裁权重，把所有权重视作自助餐随机剔除——效果很好，能大幅削减参数量。但问题是非结构化剪枝无法带来明显的计算优势，因为你需要用稀疏矩阵表示，而稀疏矩阵乘法存在巨大开销。

Parameter pruning still absolutely has place, but I think certainly we found it challenging to really realize the potential of it. I think one of the big tricks with parameter pruning, just to be clear, was that unstructured pruning, when you would prune weights randomly, so you view all the weights as a smorgasbord and just prune them randomly, that worked really well. And you could remove massive quantities of the weights with unstructured pruning. The problem is that unstructured pruning doesn't really give you a clear compute advantage because you need to have a sparse matrix now to reflect this. And there's a pretty huge overhead of sparse matrix multiplies.

Speaker 2

GPU处理稀疏矩阵乘法并不擅长，虽然现在有些支持。

GPUs are not very good at sparse matrix multiplies. Like there's some support for them now.

Speaker 1

这方面有些硬件优化方案。

There's some hardware optimizations for that.

Speaker 2

确实有些专用硬件。人们讨论过制造擅长非结构化剪枝的ASIC芯片，但目前还没有特别成功的案例。如果有人能做出真正适配非结构化剪枝模型的方案，可能会很有效。而结构化剪枝——比如直接移除整个神经元单元——虽然实现简单，但效果差很多。所以我认为这个领域仍有潜力。

And there's some hardware. And people have talked about like building ASICs that would be really good at unstructured pruning, but I don't think I've seen one that works super well. I think if someone did make something that worked really well for kind of models that were pruned in an unstructured way, that could be effective. Structured pruning, in which case you just like remove a unit, you just remove a neuron, that is really easy to make as But a faster in a that just doesn't work nearly as well. So I think there's still potential here.

Speaker 2

不过它并非我和许多人曾期待的那种万能解药。但有意思的是，用优质数据训练小模型的方法可以与其他推理优化技术互补。剪枝和量化显然仍将在加速推理中发挥重要作用，这些都能叠加在我们的工作之上，我觉得这很酷。

Don't think it's the panacea that I and I think many others had hoped. That said, I think one thing that's cool about using better data to train smaller models is that it's complementary with any other approaches for optimizing inference. So I think pruning and quantization obviously still have a lot of role to play in helping inference go faster, and that would stack on top of anything that we're doing, which I think is kinda cool.

Speaker 1

是的。我认为还有一个宏大的挑战，一个极具价值的问题——无论是对于你个人还是普遍而言——那就是：对于给定的能力水平，最小可能的模型规模是多少？你对此有何见解？我曾与康奈尔大学的杰克·莫里斯做过一期播客，他提到似乎存在某种信息极限，我记得他的结论大概是每个参数需要八比特左右的数据量。

Yeah. One also I think kind of a grand challenge, golden question that'd be very valuable for you, or just in general, is is this idea of like what is the smallest possible model for given capability? Do you have any insights on that? I I did a podcast with Jack Morris who's out of Cornell. And, you know, I think, like, there's there's, like, some information limit, and he I think he he had some answer, like, you know, it's it's, like, eight bits per parameter or something like that.

Speaker 1

我记不清具体结论是什么了。

I I forget what the the conclusion was.

Speaker 2

没错。虽然我不确定能否给出具体数字，但可以肯定地说，比我们当前训练的模型规模要小得多。我们离这个极限还非常遥远。我坚信三年后绝大多数人使用的模型都会是十亿参数以下甚至更小的规模。这个趋势已经非常明显了。

Yeah. I'm not sure if I would put out a specific number, but I would definitely say far, far smaller than what our current models are trained to be. Like we are nowhere close to this. Like I am generally of the belief that most of the models that the vast majority of people will be using in, say, three years will be single digit B or smaller. I think we've seen this very clearly.

Speaker 2

看看LAMA系列模型（不包括第四代），从LAMA一到三代就能清晰发现：新一代的70亿参数版本性能已接近上一代的700亿参数版本，即便尚未完全持平。这种趋势在Quen模型上也得到印证——某些小型Quen模型的性能相比一年前的顶尖水平已经强得惊人。

Like you look at just like the LAMA series, you know, you want to exclude LAMA four, do so. But, know, LAMA one through three, you can see pretty clearly that, the 7B variant N plus one generation is pretty close to the 7D variant from the prior generation, if it's not quite there. But there's still a very clear trend here. We're seeing this with the Quen models, right? You look at some of these small Quen models and they're just incredibly performant relative to what state of the art was, you know, a year ago.

Speaker 2

显然当前模型都过于庞大。我个人不看好万亿参数模型会成为下一个前沿，相反我们会重点优化推理成本。测试时计算范式也推动模型小型化——因为解决问题的总成本等于推理成本乘以思考步骤数，当需要大量思考步骤时，降低单次推理成本就变得至关重要。任何能加速单步推理模型的技术都能极大提升测试时计算的效率。

I think it's pretty clear that these models are way too big. I personally would bet against kind of the next frontier being trillion parameter models and rather that we're going to really optimize the inference cost of it. I think also test time compute as a paradigm really pushes you towards smaller models, right? Because if your cost of solving a problem is cost of inference times number of thinking steps and you have to do a lot of thinking steps, well, now this is like a really like minimizing the cost of inference is really important. And I think that anything we can do to make it so that you can just make that inference model that is doing the one step of thinking a lot faster enables test time compute to be a lot more effective.

Speaker 1

对。安德烈·卡帕西提出的'认知核心'概念是另一个方向——模型本身不存储知识，但能熟练使用工具来获取信息。从信息论角度确定这类模型的最小可行规模会很有帮助，比如在GPQA上得零分但在BrowserConf拿满分的那种。

Yeah. I think there's another version of this, which is the sort of Andrei Karpathy cognitive core concept of a model that doesn't know anything, but can use tools a lot to to find figure out. Again, another information theoretical limit that would be very helpful to figure out is what is the minimal viable model for that stuff? Like, zero on GPQA, 100 on BrowserConf.

Speaker 2

我非常认同这个理念，这完全可行因为知识存储确实占用大量容量。

I I really like that idea, and I think it's very possible to do that because, like, knowledge storing takes a lot of capacity.

Speaker 1

确实。

Yeah.

Speaker 2

需要消耗大量存储空间，但其实没必要。我早期有篇论文就论证过：当用随机标签训练模型时（这是当时验证模型是否死记硬背的常用方法），模型确实能完美记忆所有乱序标签——2017年ICLR最佳论文就展示了这个让人震惊的现象，当时人们难以相信模型能记住整个ImageNet数据集。

It takes a lot of print. But you don't need it. And know, we can just look like there are actually one of my first papers that I ever wrote was actually about showing that when you train models on randomized labels, because this was something that was kind of a common test to do, that was one way you could prove that a model was memorizing, would be that you randomize all the labels and now there's no actual true association, it would have to memorize it. And like models could do this really well. There was like an ICLR best paper from 2017 that showed this, that people were really surprised that models could memorize all of ImageNet.

Speaker 2

现在看这很荒谬（模型当然能记住整个互联网），但当时人们觉得记住百万级标签简直不可思议。我们发现如果删除已记忆模型的某些单元，会对模型性能造成严重破坏。

Now this seems crazy because of course models can memorize the whole internet. But at the time that was like crazy. Wait, they could just memorize a million labels? Like that's wild. And what we found there actually was that if you went and you just deleted units with a model that memorized, it would be really damaging to the model that memorized.

Speaker 2

但一个真正学会泛化解决方案的模型，可以删除大量单元而依然保持稳健。这实际上非常清晰地证明了这一概念：你记忆的内容越多，占用的容量就越大。

But a model that actually learned a generalizing solution, could delete a lot of units and it would be pretty robust to that. So it's actually a very clear demonstration of exactly this concept that the more you memorize, the more capacity you're using.

Speaker 1

Dropout正则化。

Dropout regularization.

Speaker 2

Dropout具有多重特性。我认为有理由说，Dropout有助于防止记忆过度，促进学习更具泛化性的解决方案，这也是它效果良好的部分原因。不过，我认为实现这一点完全可能。我们正在这些模型中浪费大量容量，用于存储它们根本不需要的知识。

There's a lot of dualities to dropout. And I think there's an argument to be made that dropout helps to prevent memorization and it helps to learn more generalizable solutions, that's part of why it worked well. But yeah, I think it's very possible to do this. I think we're wasting a ton of capacity in these models on knowledge that is just totally unnecessary for them to have.

Speaker 0

结束前，既然我们以RC模型开场却未深入讨论——最让我惊讶的是他们最初使用了23万亿个数据标记，而你们协助将其缩减到6.6万亿。从中有什么经验启示吗？这是个45亿参数的模型，与Gemma 4B相当，略逊于QUAN 3但大体同级。自动城市模型应该借鉴哪些经验或做法？

Before we wrap, just because we started with the RC models and then we never talked about them, I think the most interesting thing to me was they started with 23,000,000,000,000 tokens of data, and then you helped them get down to 6,600,000,000,000.0. Any learnings from that? And this is a 4.5 b model, which is par with Gemma four b and a little worse than QUAN three, but roughly the same. Any learnings there, experiences, things that auto urban models should adopt?

Speaker 2

确实如此。我们最初整合了DCLM、Nematron和FineWeb的数据集，简单拼接后总量约25万亿标记，最终产出7万亿。对我们而言，最振奋的是观察到模型的学习速度。

So yeah, so we started for that one. We started with a combination of DCLM, Nematron and FineWeb. We basically just concatenated them all together. It's about 25,000,000,000,000 tokens to combine for all those to produce 7,000,000,000,000 out of that. I mean, I think what was exciting to us about that was in general seeing the speed at which the model learned.

Speaker 2

在突破1万亿美元市值之前，它一直稳定超越Gemma的表现，这确实很酷，我认为这在多方面突显了更高质量数据如何能让你更快获得更优异的性能。总体见解或收获，我想，作为我们首个公开讨论的真实客户案例——Arcee是我们公司成立以来首个公开宣传的客户，这显然是个激动人心的时刻。但更广泛地说，这很好地证明了结合所有这些不同技术能带来巨大提升。虽然我们一直强调这点，但能有实际案例展示还是很棒。这不是靠合成数据或简单过滤能达到的成果。

So it was beating Gemma pretty consistently before the $1,000,000,000,000 mark, which was pretty cool to see and I think really highlighted in many ways how higher quality data can get you much better performance much more quickly. General insights, I think, or takeaways from that, I I think it was exciting for us as kind of one of our first real like, Arcee is the first customer that we're talking about and being public about since starting the company, so obviously that was an exciting moment. But I think really generally, it's a good showcase about the fact that combining all of these different techniques can give you a really big gain. I think that's one of the things we've been saying, but it's nice to have a real demonstration about that. This is not something where it was synthetic data taking us here or it was filtering taking us here.

Speaker 2

关键在于我们如何真正整合这些技术。我们持续发现的一个事实是：当你试图让不同技术协同工作时，它们通常不会自然配合。虽然可以实现协同，但难度很大。因此最让我们兴奋的是证明了这种可能性。与此同时，人们往往首先会认为数据筛选不能叠加使用。

It was really about thinking about how do we actually combine all of these techniques. And one of the things we've consistently found actually is that when you take these different techniques and you try to make them work together, they don't generally. You can make them work together, but it's quite hard to do so. So I think what was quite exciting for us there was showing that that's possible. And then combined with that, I think people first off tend to think that you can't stack curation.

Speaker 2

我们以部分最优筛选的开放数据集为起点，还能显著提升其质量，这个事实充分说明该领域仍有巨大发展空间。我们不需要依赖Common Crawl获取那些token——当然我们也在进行相关工作并认为有改进空间。但仅从现有数据集出发，我们现在已能构建更大规模的数据集。仅以该语料库为基础，我们就能扩展到15万亿规模，同时保持几乎相同的质量，这非常难得。

I think the fact that we started with some of the best curated open data sets and were able to make them dramatically better is a pretty good insight to the fact that there's still a ton of headroom left here. We didn't need to go to Common Crawl to get those tokens. We are, of course, doing work on that and we think there's a lot we can do to improve there. But just starting from that and we actually now are making bigger data sets from that. I think we can get up to 15,000,000,000,000 just starting from that corpus and still have pretty identical quality to that, which is pretty neat.

Speaker 2

这证明了技术可扩展性。我们另一个持续发现的规律是：当我们将筛选技术先后应用于DCLM和FineWeb时，原始数据集间的性能差距会延续到筛选后的版本——两者都有显著提升，但经Datalogy处理的DCLM始终优于处理的FineWeb。这说明我们还有大量可操作空间，这也是我最想强调的重点。

So I think showing that you can get there and that it really stacks. Like one of the other things we consistently find is that if we apply our curation on top of, say, DCLM and then we apply it on top of FineWeb, the gap between FineWeb and DCLM is maintained in the gap between kind of Datalogy curated DCLM and Datalogy curated FineWeb. They both get a lot better, but Datalogy DCLM is still better than Datalogy Fine Web. So there really is a lot that we can do here. And I think that would be the biggest thing that I would just say.

Speaker 2

我们仍有大量工作待完成，目前只是触及表面。这些结果令我们非常振奋——我们已拥有比RC训练时更好的数据集（该模型主要在五月份训练），并对即将开展的更大规模训练充满期待。

There's so much still left to do here. We're just scratching the surface. We're pretty excited about what these results showed. We already have better data sets than what RC trained on because that model was largely trained in May and pretty excited about all the next trainings that we'll have that go even bigger.

Speaker 1

我还有几个快速提问。根据你们的客户对话，大家最想要哪些数据？哪些数据是大家渴望却难以获取的？

I have a couple more lightning fun questions. What data does everyone want based on your customer conversation? What data does everyone want but is really hard to get?

Speaker 2

我认为领域专家的数据显然是大家最关注的。不过也要指出，大多数人其实并不清楚自己真正需要获取哪些数据。

I mean, think expert data is the the pretty obvious thing that that Domain experts. Domain expertise. That said, I would also note that, like, most people don't know what data they actually should be getting.

Speaker 1

他们总是带着手头现有的数据就来咨询

They just show up with whatever they have and

Speaker 2

没错。我们经常震惊地发现，有些客户花费数百万美元筹备训练计划，精心设计模型架构，却在训练启动前两周才突然意识到：'我们需要优质数据集，能帮忙吗？'其实数据才应该是最优先考虑的事项。

Yeah. And I think something we've actually found shockingly frequently is we talk to folks who, you know, have been planning for a really expensive training run, you know, millions and millions of dollars training run. They've been thinking about the architecture they're going to use. They've been thinking about all this stuff. And then they reach out to us and they're like, hey, we realize we need a good data set and we're planning to kick off training in two weeks.

Speaker 2

最令人惊讶的是，很多人对优质数据根本没有概念。他们眼中的'好数据'往往并不合格——这正印证了我们之前提过的DCLM观点。

Can you help us? And a lot of it's like, hey, you probably should be thinking about your data set before all the other things. If anything, that's actually the most important thing. So I think I would say the most surprising thing is maybe how often people don't even have a conception of what good data is. And oftentimes I think what they think is good data often isn't, which goes to the DCLM point I think that we mentioned in the past.

Speaker 2

数据质量判断非常反直觉，人类很难准确区分优劣。

It's very counterintuitive and really hard for humans to identify this is high quality, this is low quality.

Speaker 1

这算是个招聘问题：如果有人能解答数据效率难题，应该立即加入Datalogy。

This is a little bit of a recruiting question. What data efficiency question? If somebody had an answer, they should join Datalogy immediately.

Speaker 2

如果你总是不自觉地反复检查数据，能说出C4数据集中最喜欢和最讨厌的样本——欢迎加入我们这个痴迷数据的极客团队。在Datalogy工作是否开心，很大程度上取决于你自主分析数据的热情。许多优秀研究者反而很少这样做，这很令人意外。

The first thing I would just say is if you are one of these people that keeps on finding yourself just staring at the data, you keep on going into the dataset, if you can tell me what your favorite and least favorite C4 example is, you belong at you should come join us and join a bunch of other nerds that love doing that exact same thing. I think in many ways that's kind of the single biggest predictor of whether someone is going to be really happy at Datalogy is like how much do you just look at the data in your own work? Because I think you'd be surprised by how many really talented researchers don't do it very often, that they really just view it as a given. Think it's been pretty surprising across the board. That said, there are so many questions I am from the science side that I'm just super excited about.

Speaker 2

我们特别关注预训练与后期训练的交互作用。产品核心是让数据整理自动适应新分布——比如企业不愿提供专有数据时，我们必须让系统在其数据上自主调整。

I mentioned the interaction between pre and post training. That's definitely one that we're really excited about. One of the things that we really care a lot about is making it so that our product and curation automatically adapts to novel data distributions, right? If you have this where it has to be fully automated and we didn't talk about this too much, but one of our challenges often is if we're working with an enterprise that has a lot of proprietary data, they obviously don't want to give that to us. So we bring our curation to their data, but this means that it has to adapt automatically.

Speaker 2

这实际上是个棘手的分布外泛化问题。没有绝对完美的数据整理方案，只有针对特定下游任务的最优解。我们需要根据模型所需的XYZ能力，动态调整数据整理策略以确保相关性。

We have pretty limited access into going and looking at that data. So that's actually a really hairy and interesting out of distribution generalization problem. But it's also really important because there's no golden curation. A curation is only optimal with respect to a given set of downstream use cases or tasks, right? So we need to be able to define based off of if the model needs to be able to do X, Y, Z, how should we use that information to adjust the curation that we do to make sure that we're giving the data that's most relevant for solving tasks XYZ?

Speaker 2

这需要自动实现。我们为此开发了多种技术手段，但这是个广泛而基础的问题——我们希望将其应用于流程的每个环节，使合成数据的生成方式能根据下游用途动态调整。无论是数据过滤还是其他环节，我们的方法都会随之变化。这确实是个令人振奋的课题。本质上，我们试图解答的是：如何根据目标需求评估数据价值？

And that needs to happen automatically. So we have a number of ways that we can do that for a number of our techniques, but that's a very broad and general question that we want to apply to every part of our pipeline so that the way we do synthetic data differs based off of the downstream use case. So the way we're doing this, the way we're doing every different part, filtering, etcetera, is going to change based off of that. So that's another question that we're just really excited about. And fundamentally, anything about really trying to answer this question about how do you value data with respect to a target?

Speaker 2

谈及Datalogy的核心竞争力——每家公司都需要有难以复制的优势。对我们而言，我希望Datalogy能成为（事实上我认为已是）全球最擅长根据下游应用评估数据价值的团队。某种程度上，这堪称AI领域的NP完全问题。若能攻克此难题，便能无所不能。这正是我们专注的方向。

When I think of Datalogy and our core competency, think every company needs to have an unfair advantage or some core competency that they do better than anyone else. For us at Datalogy, I want us to be, and I think we already are, the best in the world at valuing data with respect to a downstream use case. In many ways I think that's kind of the NP complete problem of AI. If you can do that, you can kind of do anything. And that's the thing that we're really focused on.

Speaker 2

数据筛选显然是该核心能力的直接应用。但长期来看，公司愿景在于探索如何将这一核心技能拓展至更多领域。这里存在无数可能性，而解答这个根本性问题存在诸多切入点。

And of course curation is like the very obvious direct application of that core competency. But when we think about kind of the vision for the company in the long term, it's about saying what are all the other ways we can operationalize that same core skill set? And I think there are tons of really interesting ways things you can do there. But that's the fundamental question that we really want to answer. And then there are tons of different entry points to that question.

Speaker 2

若这个问题令你兴奋，若你曾在其他机构感受到数据团队被边缘化的困境，渴望加入一家真正以数据为使命的企业——正如公司名Datalogy（数据科学）所示，这正是我们存在的意义——那么你绝对应该联系我们。

But if that's a question that excites you, if you have been working on data somewhere else and you have felt this pain of being a second class citizen or having the data team be kind of dismissed and you want to be in a place where literally the only reason that the company exists is because data is all we care about, I mean, the name of the company, Datalogy, the science of data, that's why we're here, then you should absolutely talk to us.

Speaker 0

精彩。聊点八卦——谈谈Meta与超级智能。备注里提到你们融资时吸引了Yann LeCun、Geoffrey Hinton、Jeff Dean等顶尖投资人。当Ari声称他们拥有科学模式时，这话可信度很高。

Awesome. And just to wrap on some gossip, let's talk about Meta and Superintelligence. And just in the notes, you know, when you talk about Saigon's Mote and whatnot, you raised a lot of money from very prominent people. So you have, you know, Yann LeCun is one of your investors, Jeffrey Hinton, Jeff Dean. When Ari says when Ari says that they have a science mode, believe him.

Speaker 0

既然Yann是投资人，这问题可能有些敏感：你如何看待Meta的超级智能团队？Yann在LinkedIn表示他专注下一代AI而非当前技术，但有人会质疑为何十年前不先完善现有技术？

So maybe since you have Jan as an investor, this is more of a touchy question. But what what do you make of the whole Meta super intelligence team? And, you know, Jan was also on LinkedIn, and he was like, hey. You know, I'm actually working on you know, I fear we're focused on the next generation of AI, not on this current generation, so my role is the same. But then maybe people might say, you know, then why didn't you do the current generation ten years ago?

Speaker 0

你如何看待Meta这一战略转变？考虑到其庞大平台和用户基础，这是否是个有趣的方向？首先关于...

What do you make of the whole change and whether or not you think this is an interesting direction for Meta, especially given the large platform and user base that they have? Well, first, with

Speaker 2

Yann本人，他无疑是卓越的科学家。但其志趣始终在科研而非管理。FAIR创立初期他仅运营一两年便交由Joelle Pinault和Antoine Bordes负责——特别是Joelle，在我任职期间她堪称FAIR的科学守护者，是位非凡的领导者。

respect to Jan specifically, I mean, Jan's an incredibly talented scientist, of course. But I think that his preference has always been to do science rather than to run an organization. So I think he ran fair organizationally for a year or two right at the very beginning. But pretty quickly, he handed that off to other people. And when I was there, it was Joelle Pinault and Antoine Bordes and then Joelle for most of it that really were running for his, and she was an incredible leader.

Speaker 2

我对她怀有深切敬意，再难找到比她更优秀的FAIR科学倡导者了。

I really respect her deeply and couldn't have asked for a better kind of advocate for science within FAIR.

Speaker 1

她离职时人们都说'FAIR要完了'。

When she left, people were saying like, this is the end of FAIR.

Speaker 2

我希望这不是真的，我也有过这种担忧。但我想Jan一直真心希望亲自投身科研，在我任职FAIR的大部分时间里，他基本是带着自己的小团队工作——几名博士后和访问学者，再加上通过纽约大学带的几个学生，在那里做自己的研究。所以我认为他从未，至少从一开始就没有，担任过为Meta制定AI战略的角色。我觉得这从来不是他想要的职位，他真正渴望的是专注研究本身。

I hope that's not true, I also had that concern. But I think Jan always really wanted to just actually do the science himself and he's generally for most of the time I was at FAIR, he kind of operated with his own group of a couple kind of postdocs and visiting scientists and then he'd have a couple of students through NYU and he would kind of do his own research there. So I don't think he was ever, or at least not since the beginning, in a role where he was defining AI strategy for Meta. I don't think that's the role he wanted at any point. I think he really wanted to be doing that research.

Speaker 2

因此我不认为他的角色会发生重大变化，毕竟他过去没参与战略制定，这也不是他的志向。不过这件事有个很酷的看点：它彰显了数据的重要性——Meta居然愿意投入如此巨资进行规模扩张（不是指当前看到的收购行为）。

So I don't think that his role probably is changing very significantly in the sense that he wasn't doing that previously and I don't think it was what he wanted to do. I think one thing that's pretty cool about it obviously is it showcases the importance of data that Meta is willing to spend quite this much on scale kind of acquisition, not acquisition that we're seeing today.

Speaker 1

Alex Wang绝不会低估数据的价值，这么说吧。

Alex Wang is not going to underrate data, let's put it that way.

Speaker 2

没错，他深知数据的重要性。我们在这方面的工作确实与数据标注公司的模式截然不同——他们更侧重数据采集，而我们专注数据优化与 curation。这个领域还有很大探索空间，此事无疑会引发关注。另外容我说句实话：每当扎克伯格押下重注，和他对赌通常不是明智之举。

Yes, he's not going to underrate the importance of data. And I do think that this is an area where the stuff we've done is quite different than I think what we've seen from the data annotators, which have been more focused on collecting the data versus actually optimizing and curating it. I think there's quite a bit you can do on top of those things. So I think it definitely draws some attention to that. I will also just say generally when Zuk makes a very big bet, it's not proven wise to bet against him.

Speaker 2

历史经验表明如此。他多数豪赌最终都取得了成功，目前唯一存疑的是元宇宙项目。但我认为长期来看这个赌注终将兑现，雷朋智能眼镜就很惊艳，Reality Labs的许多技术积累都会融入其中。

Just historically, that's been the case. Like most of the big bets, I think, have panned out. I think the one that's still really up in the air is the metaverse. But I would actually argue that I think that's going to end up paying off in the long run. I think the Ray Ban glasses are pretty darn cool and a lot of the foundations of what was in Reality Labs will go into those.

Speaker 2

其实FARE在重组后曾隶属Reality Labs约一年半。最初并不属于，后来被划归进去。如果没记错，我离职时FARE官方仍属Reality Labs。至少有一年半到两年是这样的。所以部分奠定基础的AI投资，最初就来自元宇宙的投入。

Also Fare was part of Reality Labs actually for like a year and a half after one reorg. Initially FARE wasn't and then it got reorg into Reality Labs. So I think when I left actually FARE was officially part of Reality Labs, if I recall correctly. There's at least a one and a half, two year period where that was the case. So some of the AI investment actually that lay the foundations came out of that metaverse investment in the first place.

Speaker 2

话说回来，我们总说数据是算力的乘数，人才显然也是。考虑到他们在算力上的巨额投入，重金招揽人才完全合理。我很期待他们的动作，希望他们能高度重视数据建设。

That said, I think we talk about data as being a compute multiplier all the time. Talent, I think, obviously is a compute multiplier. And given the amounts that they're spending on compute, I think you can make a good argument as to why spending a crazy amount on talent is also worth it. So I'm excited to see what they do. Hope that they put a lot of focus on data.

Speaker 0

然后成为客户。没错。太棒了。好的，谢谢

And become customers. Yes. Awesome. Well, thank

Speaker 1

你抽空来聊天并坚持线下见面，因为你本人真的很有魅力。很高兴你这么做。

you so much for chatting and coming by and insisting on in person because you're actually very charismatic in person. So I'm glad you did this.

Speaker 2

非常感谢邀请，能线下畅谈真是件乐事。

Well, thank you very much. Thanks for having me and a joy to get to chat in real life.

Speaker 1

太棒了。真酷。

Awesome. Cool.