本集简介
双语字幕
仅展示文本字幕,不包含中文音频;想边听边看,请使用 Bayt 播客 App。
说实话,O one 在解决谜题方面确实表现得很出色,它更像是一个技术演示。而 O three 则像是人工智能发展轨迹上的一次地壳运动般的转变。GPT-five 在某种程度上可以视为 3.1 版本,我期待它能带来下一个重大飞跃。我们深知,历史上只有这一次机会,人工智能正在被构建、部署和发展。
O one, to be perfectly honest, it was really mostly good at solving puzzles. It was almost more like a technology demonstration. O three has really been something like a tectonic shift, the trajectory of AI. GPT-five, in some way can be considered as like 3.1, which I am after as something that would be the next pretty significant jump. We understand that there's only one time in history where AI is being built and deployed and developed.
发展。我们共同参与这项比我们每个人都更伟大的学术事业。
Developed. We are together in the scholar's larger than every one of us.
大家好,我是Firstmark的马特·特克,欢迎收听MAD播客。今天我的嘉宾是OpenAI研究副总裁、全球顶尖AI研究者榜单Metis List成员贾里德·福瑞克。本期节目我们将深入探讨模型如何进行实际推理。
Hi. I'm Matt Turk from Firstmark. Welcome to the MAD Podcast. Today, my guest is Jared Forrick, VP of Research at OpenAI and a member of the Metis List of the world's Top AI Researchers. In this episode, we go deep on how models actually reason.
我们还将揭秘OpenAI幕后故事:重大项目如何调配人手、为何实行全员信息透明、以及这种文化如何快速演变。请享受与杰瑞的精彩对话。嘿杰瑞,欢迎你。
We also go behind the scenes at OpenAI, how a few big bets get staffed, why everyone knows everything, and how that culture shifts fast. Please enjoy this great conversation with Jerry. Hey Jerry, welcome.
你好,非常高兴来到这里。
Hello, very happy to be here.
本次对话我们会频繁讨论推理这个概念。从宏观层面来说,推理究竟指什么?当我们与ChatGPT交流时,它声称正在思考,这背后实际发生了什么?
We are going to talk about reasoning a lot in this conversation. At a high level, what does reasoning actually mean? When we talk to ChatGPT and ChatGPT says it's thinking, what actually is happening behind the scenes?
我认为思考过程至少是个恰当的类比。在AI发展初期,我们就怀揣着教会模型推理的目标和梦想,我们不断思考这个问题,投入更多时间以获得更好结果。当人类面对难题时,很少能立即给出答案,有时需要寻找答案,有时需要进行特定计算。
I think that the thinking process is at least a good analogy. As we were, in the early days of AI, always had this goal, dream of trying to teach models to reason, we were thinking about it, spending more time to get better results. If a human is posed to have a hard problem in front of them, very rarely they answers straight away. Sometimes they need to find that answer. Sometimes they need to perform certain computations.
有时他们需要查阅一些信息。有时他们需要自学某些东西。这个过程就是推理,是得出一个你还不知道的答案。在某种程度上,它可以被称为搜索,但它并不是一种非常简单的搜索。搜索是一个含义丰富的词。
Sometimes they need to look up some information. Sometimes they need to teach themselves something. The process is reasoning, is getting to an answer that you don't yet know. In some way, it can be called search, but it's not really a very naive search. Search is a loaded word.
但推理是得出答案的过程,而你需要做的工作通常比回答一个问题要长。我认为这里的区别在于,回答问题通常意味着你已经知道答案,你只是引出你知道的答案。而推理的过程是得出你不知道的答案。通常,你花在得出这个答案上的时间越长,无论你需要做什么来达到这个目的,结果就会越好。
But reasoning is the process of getting to an answer and a work that you need to do is like longer than what usually is considered answering a question. I think that difference is here, like answering a question usually means you already know the answer and you just elicit the answer you know. And the process of reasoning is getting to the answer that you don't know. And usually the longer you spend on getting to this answer for whatever you need to do to get there, better it gets. And
自从你们大约一年前,我想是在2024年9月发布了o one以来,我们都熟悉了思维链的概念,用外行的话来说,就是当你查询ChatGPT时看到的小消息,它告诉你它展示了它的工作。它告诉你它在做什么。那实际上是在做什么?那是一个逻辑树吗?它是在一个接一个地排除选项吗?实际上发生了什么?
we've all become familiar since you guys released o one, I guess, a little over a year ago, nine in September 2024, with the concept of of chain of thought, which is in layman's term, the the little messages that you see when you query ChatGPT, and it tells you it shows its work. It tells you what what it does. What does that actually do? Is that a logical tree and it eliminates option after option? What what actually happens?
语言模型在基本层面上所做的,它们通常被称为下一个标记预测机器,这在强化学习的时代并不完全准确,但它们仍然主要操作在大多是文本的标记上。如今的语言模型也是多模态的,但它们主要操作在文本上。但为了简化一下,语言模型生成文本。而思维链就是它们的思考过程用人类的语言和概念表达出来。所以我们看到这一切之所以可能的魔力在于,当你在训练中使用整个互联网上大量的人类知识和人类思考过程时,模型开始以某种方式学习像人类一样思考,并以某种方式从看到人类在预先生成的文本中大量这样做来得出答案,这些文本是基于训练数据中的人类行为。
Language models do on on on their own, like, fundamental level is they are often called as next token prediction machines, and that's not completely accurate in the age of reinforcement learning, but they still operate mostly on tokens that are mostly text. The language models, again, are these days also multimodal, and they operate on mostly text. But to simplify a little bit for a second, language models generate text. And what chain of thought is, is their thinking process verbalized using human words and human concepts. So the magic that we are seeing why this is all possible is that while you are training on all of the internet on a lot of human knowledge and human thinking process, the model starts learning in some ways to think how humans do and in some ways get to the answers how humans do from seeing humans do it a lot in the text that was pre generated and that was based in a training data on humans.
那么思维链基本上是在引出语言模型中像人类一样思考和得出答案的能力。早期思维链工作的很多内容是解决数学难题。第一个最著名的在语言模型中引出思维链的提示是所谓的“让我们一步一步来解决”。在语言模型中有一个非常经典的结果,如果你问它们某个数学表达式或某个谜题是什么,它们会尝试给你一个答案。它们会尝试预测下一个标记,但它们会失败。
Then the chain of thought is basically eliciting that capability in language models of thinking and getting to an answer like humans do. A lot of what early chain of thought work was doing was kind of solving math puzzles. The first most famous prompt to elicit chain of thought in language models was so called, Let's solve it step by step. There is this very classical result in language models that if you ask them what is some either mathematical expression or some puzzle, and they will try to give you an answer. They will try to predict the next token, but they fail.
这是一件困难的事情。它们无法在一个标记的跳跃中计算出来。但如果你要求它们一步一步来做,它们会开始思考,好吧,我不知道答案,但得出答案的第一步是这个。然后它们写出思维链,这是一系列文本,一系列标记,进行计算的第一部分,第二部分,最后一部分,然后它们将这些部分连接起来,然后它们可以得出答案。所以思维链基本上是一个用语言编码的思考过程,就像人类在纸上一步一步从开始到结束解决问题一样。
It's a hard thing. They can't compute it in one token jump. But if you ask them, please do it step by step, they will start thinking, Okay, I don't know the answer, but the first step of getting to the answer is this. And then they write chain of thought, which is a series of text, series of tokens, doing the first part of the computation, the second part of the computation, the last part of the computation, then they connect those things, and then they can get to the answer. So the chain of thought is basically a process of thinking encoded in words, how humans would solve a problem on a piece of paper going step by step from start to the end.
既然时间,我指的是思考所花费的时间,对推理这个概念如此重要,那么当我们处于CHA GPT-five的自动模式下,它说它将自动决定思考多长时间时,模型是如何决定思考多长时间的?那里发生了什么?
And since time and by that, I mean, the time spent thinking is so important to that concept of reasoning, how does the model decide how long to think when we're in CHA GPT-five and we're in auto mode and it says that it's going to decide automatically how long to think? What happens there?
这本质上是我们优化流程的一部分,既为了用户的满意度,也符合他们的预期。因为在进行思考流程时,你需要平衡两个因素:结果的质量——正如我们所说,随着一零版本的发布,我们已经展示了相当出色的扩展规律——模型思考时间越长,得到的结果越好。但另一方面,人们不喜欢等待。
It's basically part of our optimization process, partially for the happiness of the users and what they wanna expect. Because when you have a thinking process, you need to balance two things, which is the quality of the result. As we said, and there have been those pretty great scaling laws that we demonstrated with the release of one zero. The longer the model thinks, the better result you get. But also, people don't like waiting.
等待意味着本可用于做其他事情的时间损失。每个人都希望尽快获得结果。有句老话说:便宜、快速、优质三者只能取其二。这条法则同样适用于语言模型。这里存在需要精心权衡的取舍关系。
Waiting is time lost that you could do something. Everyone wants to get results as quickly as possible. There is this saying you can get cheap, fast, or good, and you can take two. And that applies to language models as well. There is a trade off, and it's delicate.
因此我们向用户开放了部分权衡选择——你可以选择高推理模式或低推理模式,实际上这是同一个模型,我们只是调整参数来控制思考时长。我们尝试通过启发式算法来预判:用户是否认为多等待片刻换取更优质答案值得?但精准预测用户的期待始终存在难度——在特定情境下,怎样的思考时长对他们才是最优解?
That's why we also expose some of that trade off to the users, where you can have a high reasoning model and a low reasoning model, and this is, in the end, the same model, we just tweak the parameter which says we want you to think longer or shorter. We try to encode some heuristics of what we think the users will want when thinking on an answer a little bit longer and getting to a better answer is worth it waiting versus not, but bit it's of edge trying to guess the anticipation of the users. What's the right amount of thinking for them in this particular situation?
真有意思。所以这更像是用户导向的设计,更侧重于用户体验层面。
Fascinating. So it's more user driven, so it's more like a user experience kind of thing.
归根结底确实如此,因为核心问题在于:你愿意为答案等待多久?理论上等待时间越长,答案质量就会越高。
In the end it is, because the question is like how long do you want to wait for answer. You can always wait longer and get an even better answer.
自世界上首个推理模型发布已过去一年多,这个由您主导的项目后来经历了怎样的发展?从一代到三代,再到最近的JGP五,您如何评价过去一年间这三个模型在推理能力上的演进?
It's been a little over a year since the release of the world's first reasoning model, which is an effort that you led. What has been the journey since? So there was one, then there was three, then most recently, JGP five. How would you characterize the evolution of reasoning specifically across those three models in the last year?
某种程度上,我对强化学习研究项目的演进描述是:我们进行了一系列规模逐步扩大的实验,每次都比前次更具野心。我们总在尝试更大规模、更先进的训练方式,以期获得更优质的模型。当然,并非所有训练模型都会发布——有些会立即面世,有些则需要等待更适合的时机在用户手中绽放光彩。
In some way, how I characterize our reasoning or scaling up reinforcement learning research program is we do a series of scale up runs that are progressively more and more ambitious. Everyone, we try to do something more, something larger scale, something that should result in a better trained model than the last one. And obviously, we don't release all the models that we train. Some we release. Some we think need to wait a little bit longer for the moment where they will have their time to shine in the hands of the users.
但O1是我们决定发布的首个模型,某种程度上是为了向世界展示这些模型的存在。坦白说,O1最擅长的其实是解谜题,或许还能处理一些零散的思维问题。它虽非实用性强的模型,更像是一次技术演示而非真正成熟的产品。
But O1 was the first model we decided to release as kind of to demonstrate to the world. There are those models. And O1, to be perfectly honest, it was really mostly good at solving puzzles and maybe a few kind of thinking problems here and there. But it yet very useful model. It was almost more like a technology demonstration than an actually really polished product.
但我们认为这是个很酷的东西,想以OpenAI的身份与世界分享。而O3,我认为带来了重大改变——它成为了真正有用的模型,甚至带点自我服务性质。那段时间我开始频繁使用Topgpt,现在我已经完全沉迷于ChatGPT中的推理模型了。
But we were thinking we have something cool, we wanted to share it with the world as OpenAI. O3, I think, changed that pretty significantly. In some way, it is a model that is meaningfully useful and a little bit self serving. But it was a moment when I started using Topgpt quite a bit. And I am basically a user completely hooked on reasoning model in ChatGPT right now.
我现在几乎只使用推理模型,因为只有这些模型的输出和错误率能让我信任。O3运用工具获取答案的能力,整合多源上下文信息并坚持求解的过程,确实引发了AI发展轨迹的某种结构性转变。我们在此取得了非凡成就。某种程度上GPT-five可视为3.1版本,是相同概念的迭代。目前我的团队正致力于实现下一次重大飞跃——打造能与更多系统/信息源自主交互、具备更持久思考能力的模型。
I use basically exclusively reasoning models because those are the only models that I trust, output and the error. And the result, and I think O3, like its ability of using tools and getting to like answer, leveraging a lot of contextual information from various sources and persevering towards getting to that has really been something like, I think there was a little bit a tectonic shift in the trajectory of AI. And I think we did something really, really great there. GPT-five in some way can be considered as like 3.1. It's a little bit of iteration of the same thing and the same concept and what I am after in my team right now is you know, something connects that would be the next pretty significant jump of how we interact with models that are even more capable thinking even longer and interact with even more systems and sources on information on their own journey.
与此同时,我们还在三项技术基础上持续开发,比如Codex——我认为当前基于AI构建的首个真正成功的智能体产品就是编码代理。还有计算机使用代理(现称ChargeGPT代理)、DeepResearch等项目,我们将继续基于第三代技术推进这些开发。
But separately, in the meantime, we continue to build a lot of things on top of all three technology like codex, which I think coding agents are at the moment the first really successful agentic products built on top of AI. There are things like computer using agent, it's called ChargeGPT agent right now, I think, and DeepResearch and a few other things that we will will keep on building on, like, o three generation technology.
太棒了。我们稍后会详细探讨这些,但首先聊聊你的经历。你们正在改变世界,这对我们所有人来说都是极其迷人的话题。
Great. Alright. So we're going to go into all of this in much greater detail in a minute. But before we do that, let let's talk about your journey. I think it's it's super fascinating topic, like, for for all of us that you guys are changing the world.
我很好奇——我们大家都好奇——那些产生如此影响力的人物背后的人文故事。你最初是在波兰长大的对吧?是的,我成长于
So I think I'm I'm curious, and we're all curious, I think, about the the the people, the human aspect of, like, who those people are that are, you know, just having such an impact. So you you starting from the the beginning, you grew up in Poland, I believe, right? Yes. I grew up
波兰。
in Poland.
请向我们讲述你的成长岁月,以及你是如何开始进入这个领域的。
Walk us through your formative years and how you got to get started in this field.
好的。好的。好的。很乐意分享。有趣的是,这就像晶体生长一样,最初需要放入一点点东西作为起点。
Yeah. Yeah. Yeah. Happy happy to do that. An interesting fact, it's almost like a like a crystal starts from something, and you put a little bit something in in in in the beginning.
我认为,在我旅程的起点有一个重要部分,我不知道它从何而来,因为它从我生命最初就伴随着我,在某个我记不清具体时间的时刻。它始终与我同在。我一直认为成为科学家、从事科研是人类最高尚的使命,但我真的不知道这种想法源自何处。也许是我父母在我一岁左右时给我唱了某些摇篮曲?但基本上,从我记事起,我就想成为一名科学家。
There's, I think, one part that was, like, important the at starting point of my journey where I didn't know where it came from because it was there with me from the very beginning of my life in a moment that I don't really know when it started. It just was always there with me. I always thought that being a scientist and doing science is the highest calling a human can have, and I don't really know where it came from. My parents maybe were singing light write lullabies to me back when I was when I was one or something like that. But, basically, since I remember I wanted to be a scientist.
早年时,我还发现自己在这方面有天赋。上学时,我发现自己比周围人理解得更快些——至少在波兰中部普通学校是这样。这让我更热衷于学习数学和科学,因为这种感觉很自然,就像天生适合我。我成长为一个普通孩子,只是稍微书呆子气些,努力平衡对科学、编程、数学的兴趣和社交生活。当然,我的人生也有过派对狂欢的阶段。
In the early years, I also discovered I have I have talent for for those things. I like, and I was going to school, and I saw I get things slightly faster than people around me, at least in a regular school in the middle of Poland, which made me kind of like doing those things, like studying maths and science a little bit more because I felt, it felt good in a way. It felt like this is something that naturally fit me. I grew up as a very regular kid, just being slightly nerdy guy and trying to balance my side of being interested in science, programming, maths, and having some social life. And definitely have had some kind of like party arc in my life.
但最重要的转折点是进入华沙大学时,18岁的我决定主修数学。那时我理想的生活就是做个拿着铅笔的数学家,在房间里对着纸张解方程。这就是我18岁时梦想的人生方式。我的个性也在这个过程中逐渐形成,非常崇尚严谨科学、真理追求和卓越工程这些特质。
But I think that the most important part and moment was when I actually went to university college, University of Warsaw is where I went, I decided again to study mathematics. At around that time of being 18. My idea of life was to be a mathematician with a pencil, sitting in a room with a piece of paper and solving equations. This is kind of my 18 year old dream of how life should be lived and what I want to do in my life. And like, you know, I do have like, you know, my my personalities built again in a way of really appreciating, like, you know, solid science, pursuit of truth, great, great engineering and all that all those aspects.
但我骨子里也有点不合群的反叛精神。经过几年数学学习后,我认清了两件事:我确实热爱且擅长数学,但我不太喜欢学术界。我意识到自己不想留在大学体系里,那不是我长期感到快乐和适应的环境。
But I I definitely have, like, also a little bit of, like, a misfit kind of rebellious, change to it. And that that resulted after a few years of studying mathematics. I what I realized about myself and about about the world is that I really like maths, and I am I'm I'm I'm quite good at it, but I didn't like academia that much. And I realized I don't wanna, stay in academia. Don't want to stay in university, and that this would not be an environment where I thought I would be long term very happy and fitting.
那个环境感觉过于僵化,结构太死板,我不确定自己能否适应。对当时21岁的我来说,这简直是信仰危机,一度失去了人生目标。于是我开始进行最基础的思考:我手握数学学位...
It felt a little bit too rigid, a little bit too structured in a way that I didn't know if I will feel good. And in some way, for young, meaning I was around 21 years old at that moment, that was pretty big crisis of faith for me. I had a moment like lost purpose of life. So I just did like, you know, a very simple, like first principles thinking. You know, I am graduated with a degree in mathematics.
我需要找份工作来养活自己。我在想,有什么工作能让我运用数学知识?当时大概是2011年或2010年左右,我观察就业市场后决定成为一名交易员,把交易作为职业——这样既能做我喜欢的数学,又能谋生。后来我在摩根大通投行交易部的股票衍生品组做了个短期实习。
I need to get a job to get food. And like, you know, what job can I do to use mathematics in that job? And like, you know, looking at the job market, that moment was like, you know, 2011, I think, or or or 2010, somewhere somewhere around that. Like, I I decided to become a trader and trade for a living as as as like, you know, the the one way where I can, like, do what I like, which is mathematics and and gets get a career. I a quick internship at JPMorgan, investment bank trading floor, and equity derivatives group.
我在那里待了六个月,初步了解交易运作方式。完成学业后,我收到摩根大通上司的上级发来的消息说:'嘿Jerry,你是我们带过最优秀的实习生之一,我们很欣赏你。现在我们要离开银行创办新对冲基金。'
Spent six months there learning a little bit how does trading work and what does it look like a little bit. Finished my degree. I got a message from boss of my boss at JPMorgan something saying, Hey, Jerry, you were one of our best interns ever that we had. We really liked you working with us. And we are a living bank and starting a new hedge fund.
'你愿意加入吗?'对于20到22岁的我来说,这听起来像场令人兴奋的冒险。那里有待解决的有趣问题,同时又能尝试新事物,挑战雄心勃勃的目标——这正是我喜欢的。于是我去了伦敦。
Would you want to come with us? And for 20, 21 or 22 year old Jerry, that sounded kind of like a cool adventure type of story that I was interested in going there and do it. It had enough interesting problems to be solved. At the same time, had this kind of trying something new, trying something ambitious of bed that I generally like. So I was in London.
可惜那家公司没做起来,虽然过程艰难且充满抱负,但并非所有事都能成功。后来我又在阿姆斯特丹与人合伙从零开始创办另一家对冲基金,在那里工作了几年后感到厌倦。交易工作本身充满趣味和挑战,市场非常残酷。
That company didn't really work out, unfortunately, but it was hard and ambitious, and not everything works out. I did try that again, starting another hedge fund from scratch with a few other people in Amsterdam. I worked there for a few more years, and eventually I got bored. Generally working in trading is an interesting and exciting problem. Market is very hard.
钻研模型理解的深度可以无穷无尽,同事也都非常聪明。但干了几年后,我感觉自己停滞不前了。当时我和共事的朋友开始聊起人工智能,真正吸引我的是强化学习——特别是DeepMind团队2013年训练的DQN智能体(虽然我几年后才知晓这些成果)。以我的思维方式来看,2012年ImageNet的成果反而不算重大突破。大学时我学过传统AI知识,那时神经网络并不流行,但我还是了解了相关原理。
The depth of what you can go of trying to understand model is very deep, and I worked with pretty smart people overall, but I stopped feeling I am growing after a few years of doing that. At the same time, together with a friend I was working with, we just started chatting about AI and about this artificial intelligence. And what really drew me to artificial intelligence was reinforcement learning, so specifically the DQN agent trained by people in DeepMind in 2013, but I think it was a few years later that I actually learned about those results. From my perspective, and again, this is just how my brain works, that the 2012 ImageNet results weren't that significant. During my university years, I learned a bunch about how classical AI, the neural networks weren't very fashionable back then, but I still learned about what they are.
我学过支持向量机等各种分类器训练方法。对我来说这很自然:只要有足够参数并精心调参,总能拟合出想要的分类器——这显而易见。
I learned about SVMs and all kinds of methods, how you train classifiers. And for me, it was kind of like obvious and natural. Oh, do you have enough parameters and tweet it hard enough? You will fit a classifier to whatever you want. It was kind of obvious.
但当时我没意识到的是:分类器本身并不'智能'。它只是在学习从输入到输出的映射函数,通过训练不断逼近目标。我忽略的关键在于:当你能越来越精确地拟合任何函数时,就可以塑造行为和策略。这个认知转折点出现在DQN成果中——研究者将ImageNet里验证过的普通规模神经网络(并不特别庞大或惊艳),与经典强化学习结合来解决简单电子游戏。结果发现,简单神经网络配合基础算法竟能学会复杂游戏策略,展现出惊人行为。
Well, it's like, What was, to me, not obvious is I never considered classifiers a smart thing. Classifiers, you learn a function on some set of inputs to some set of outputs, and you can keep training it to approximate better and better. But of what was something that I missed back then is that when you can fit any function better and better, you can start shaping behaviors and strategies. And when I really saw that was in the DQN results, where they applied the same things that worked in ImageNet, like neural networks, and they weren't particularly big or impressive neural networks, with a classical field of reinforcement learning to solve simple computer games. It turns out those simple neural networks with a simple learning algorithm started learning pretty complex computer games and exhibiting very interesting behaviors.
我目睹了那些行为,看到了那些成果,当时就想,这就是我余生想做的事——虽然对二十多岁的人来说‘余生’这词有点夸张。但我就认定这是我想做的。该去哪里实现呢?谷歌搜索‘世界上哪里能从事强化学习’。当时像Google DeepMind和OpenAI这类机构刚崭露头角,规模很小知名度也不高...
I saw those behaviors, I saw those results, and I was like, this is what I want to do for the rest of my life, which is not a very long horizon, what the 20 something things about, but I was like, this is what I want to do. Where do I do that? Google search where are places where you can do reinforcement learning in this world. Like, you know, Google DeepMind and OpenAI came up with this kind of at that moment, pretty small and like somewhat known, but they
是——你2019年加入的OpenAI对吧?那时候确实...对,对。还处于非常早期的阶段,基本还是非营利组织性质的时期。
are- Yeah, you joined OpenAI in 2019, right? So like very much Yes. Yes. Very much in the early days still, very much in the kind of like nonprofit era Yes. Of OpenAI.
你是怎么和他们取得联系的?
How did you so how did you connect with them?
就是...直接在官网上申请的。世界上最无聊最常规的流程:打开openai.com招聘页面,投简历,然后祈祷回复。幸运的是他们回应了。不知道当时OpenAI收到多少简历,肯定比现在少得多。
I just I applied for the website. Had the most the most like boring and uninteresting thing in the world, is openai.com jobs, apply, send resume, and hope they respond. Luckily enough, was, they did. I don't know how many resumes OpenAI was getting in that time. I think it's definitely much less than today.
但我当时就觉得...只要让我做强化学习,具体干什么都无所谓。所以你...
But I was I was like you know, I I I came there, I was like, no. That that doesn't matter what do I do as long as it's reinforcement learning. So you
你2019年怀着对强化学习的热情加入。那会儿是不是DOTA2项目时期?因为OpenAI在2019年初确实做了很多强化学习相关项目对吧?后来才转向无监督学习和GPT方向,但最初根基确实是强化学习。
joined in 2019 and with a passion for reinforcement learning. So was that around the DOTA two moment? Because OpenAI, interestingly, in in in those early days of 02/2019, did a lot of reinforcement learning focused work. Right? And then then there was a whole, like, unsupervised learning, GPT moment that happened afterwards, but, it started from roots in reinforcement learning.
所以你是直接参与那个项目了吗?还是等你加入时项目已经进入高阶阶段了?
Right? So did did you work on that project specifically, or was it too advanced by the time you showed up?
我所参与的项目是OpenAI的机器人项目,它与DOTA项目共享相同的代码和同步方法。一方面,DOTA项目是OpenAI向世界展示强化学习规模化应用能实现什么的方式。某种程度上,它是在2013年DQN智能体的基础上,不断突破极限,解决越来越复杂的问题。OpenAI从一开始就意识到——这个看似简单却极具洞察力的观点——要学习真正有趣复杂的行为,必须建立大规模系统。DOTA项目正是试图通过强化学习的规模化应用,来证明我们能解决极其复杂的环境问题。
So the the project that I worked was robotics project at OpenAI, which shared the same code and sync methods as the DOTA project. Like, In one hand, the DOTA project was OpenAI's way to demonstrate the world what scaling up reinforcement learning can do. And in some way, it was taking the twenty thirteen DQN agents and just doing all the hard work of making it bigger and bigger and solving harder and harder problems. And OpenAI, generally, from the very beginning was aware and really, it was simple but genius insight that you need to have large scale system to learn really, really interesting complex behaviors. And that was one way of what DOTA was trying to show that by scaling up reinforcement learning, we can solve pretty complex environments.
当时还有另一个项目。我记得那时OpenAI有三个强化学习项目。第二个是机器人项目,旨在运用我们已验证能解决复杂电脑游戏的方法,探索这些技术能否解决现实问题。OpenAI始终保持着乐观与雄心,不断尝试突破自身能力的边界。
Then there was another project. Were, I think, three reinforcement learning projects at OpenAI at that time. And then the second one was robotics, which is applying the same methods that we now knew or were proving that can solve pretty complex computer games. Can they solve all the practical problems? OpenAI was always optimistic and ambitious and trying to see if we can scale our own self dot time.
它能帮我装洗碗机吗?能叠衣服吗?甚至能盖房子吗?这就是我们的研究方向。我负责的项目专注于灵巧操控——这个领域在当时,直到现在仍是训练策略难以攻克的挑战。
Can it load my dishwasher? Can it fold my clothes? Can it build a house? And this is what we are doing. The project I was working on was focused on dexterous manipulation, which was back then and still continues to be an elusive challenge for trained policies.
我们最终成功展示了由神经网络控制的手部能解魔方,这需要完成极其精细复杂的操作。
And we got to a showcase of demonstrating that a hand controlled by neural network was able to solve Rubik's cube, which which is a pretty delicate and a complex task to do.
那么快进到现在,还是关于OpenAI幕后的故事。像你这样的成员典型的一天是怎样的?比如阅读论文、训练模型、管理团队?你每天具体做些什么?
So fast forward to, today, still in the same vein of, like, the behind the scenes of, you know, you all at OpenAI and sort of life there. What's a day in the life of Jerry? Like, what what what does somebody like you do? Like, you you you you read papers, you train models, you manage teams. Like, what what's a day like?
我的日常出奇地规律:早上送孩子上学后早早到办公室。整天的工作基本上就是与其他研究员交流——这是我日复一日唯一在做的事。我收集大家的想法,与不同伙伴碰撞思维、头脑风暴,不断循环这个过程。
Yeah. My my my days are surprisingly uniform, which is I come to the office early in the day after driving my kids to school. Then what do I do all day is basically talk to other researchers. I talk to other researchers all day, every day, and this is basically exclusively what I do. I take ideas from people, bounce with them, brainstorm with one partner, then move to another one and do the same thing over and over and iterate.
通过这种方式持续完善我们的研究计划。有时是小组会议,团队互动也有其独特动态。但核心工作始终不变,唯一变化的是每次会议或不同对象间讨论的研究主题。
And in that way, keep refining our research program. Sometimes those are group meetings, and group meetings are there as well and have their own team dynamics. But that is basically exclusively what I do. The only thing that changes is the topics of research from meeting to meeting and from person to person.
研究优先级如何决定这个可能的项目范围?是自上而下吗?还是自下而上?是人们提出想法再由他人审核吗?具体是如何运作的?
How are priorities in research determine this range of possible projects? Is that top down? Is that bottoms up? Do people suggest ideas and others vet them? How does that work?
是的。是的。是的。构建、组织和领导研究项目的艺术是我在OpenAI的历程和职业生涯中迅速学会欣赏的东西。如果说我们有什么擅长之处,那就是构建研究项目。
Yeah. Yeah. Yeah. It's The art of structuring, organizing, and leading a research project is something that I generally learn to appreciate very quickly OpenAI's journey and in my career. If there is something we are good, it's structuring research projects.
我认为这是一种独特的混合模式。不能说是自上而下,也不能说是自下而上。它是两者的结合,平衡了重要的方面。OpenAI体现并确定的一点是,我们所有人总共只参与非常少量的项目。
I think it's a unique mix. We cannot say it's top down. Cannot say it's bottom up. It's a mix of those two, which balancing the important aspects. One thing OpenAI embodies and determines is that we all work on a very few projects total.
项目数量并不多。OpenAI并不试图做所有事情。我们不是要建立一个项目组合。我们试图进行多种不同的尝试。核心理念始终是,我们专注于少数核心事项,做得非常非常出色,并投入大量精力,这意味着需要许多人共同参与同一个规模宏大、雄心勃勃的项目。
There are not that many projects. OpenAI is not trying to do everything. We are not trying to have portfolio. We are trying to have multiple different bets. Always the idea is we do a few core things really, really well and put a lot of effort there, which means there needs to be a lot of people working together on the same large scale, large ambition project.
我们只有少数几个这样的项目,可能是三四个,取决于你怎么称呼它们,仅此而已。从这个角度看,人们没有绝对的自由。不是人们来到OpenAI说,嘿,我想做这个,然后他们就可以做这个,因为你需要为那四个项目之一的目标而努力。在这些项目中,我们尽量保持相对自下而上的方式,只要它再次服务于这些目标。研究领导最重要的部分是确保所有研究人员朝着这个共同目标努力,而不会因各自的思维方式和工作方式而分裂。
And we a few of those small number, probably three or four, depending on how do you call it, and that's it. And from that perspective, people don't have ultimate freedom. It's not that people come to OpenAI and say, Hey, I wanna do this, and they just do this, because you need to do something towards the goal of one of those four projects. And then within those projects, we try to be relatively bottoms up in a way as long as it feeds again into those goals. And the most important part of the research lead is key to making sure all the researchers are working towards this one shared goals and that don't fracture in their own ways of thinking and doing things.
所以这是一件极其困难的事情。这是一份非常非常艰难的工作,而且并不总是容易看出它的微妙之处。但这就是它的很大一部分。不要认为,自上而下的研究结构在研究中行不通。我真的不相信研究组织能这样运作,因为你雇佣的是世界上最聪明的一些人,而OpenAI有着极其聪明的人才,不能只是告诉他们该做什么。
So it's an incredibly hard thing. It's a very, very hard job and not always easy to visible how delicate it is. But that's that's a lot of of what it is. Like, don't think, like, you know, top down, structuring of research doesn't work. And research organizations, I really don't believe in it because, like, you are not kind of hiring some of the smartest people in the world, and OpenAir has incredibly, incredibly smart people to kind of tell them what to do.
他们需要弄清楚该做什么,但不能在整个可能性空间里随意选择做哪些酷炫的事情。他们需要从项目需求和最能推动OpenAI研究目标的方向中寻找答案。
They need to figure out what to do, but they cannot figure out in the whole space of things what cool things to do. They need to figure out from within the space of what the project needs and what could advance the research goals of OpenAI the most.
你刚才提到的是,这三个或四个项目团队之间是否存在协作?因为设身处地想想,睁大眼睛看的话,可能在普遍希望协作的同时也存在某种张力——毕竟这可能是世界上最重要的知识产权。所以你可能要确保不是所有人都对一切了如指掌。当然也可能不是这样,我只是在推测。
And to which you're saying, is there a collaboration between the teams working on those three or four projects at the same time? Because I would imagine putting myself in your shoes, open eye shoes, there is probably a tension between wanting to be collaborative in general, but equally, I mean, this is probably the most important IP in the world. So you probably want to make sure that not everybody knows everything about everything. Well, perhaps not. I'm speculating here.
你如何看待这种协作与知识产权保护之间的平衡?
How do you think about that collaboration versus some protection of IP?
你可能会惊讶,但事实是OpenAI目前不到600人的研究团队里,所有人都真正知晓一切。我们始终保持完全透明。某种程度上,如果有研究人员连了解全局的机会都没有,那相当于自缚手脚——他们无法获得最佳信息来最优完成工作。虽然确实存在知识产权流失风险,但在我看来,因信息闭塞导致研究决策失误或无法开展顶尖研究的风险要高得多,这也是我处理这类事务的个人原则。
You'd be surprised, but the truth is in research at OpenAI, which is around slightly less than 600 people at the moment, everyone knows everything, really. Really, it does. And we always have been fully transparent. And in some way, you are a little bit shooting yourself in the foot if there is a researcher that doesn't at least have the chance to learn about everything because they don't have the best information to do their job in the best way. And it is yeah, it is some risk of losing IP, but I think the risk of not doing the right thing and of people not being informed about research and not being able to do the best research is much higher in my personal opinion and how I approach those things.
因此研究部门内部极度透明,我们的运营原则就是竭尽所能做出最优秀的研究,从而训练出最强大的模型。整体文化也高度协作。当然,600人的团队难免存在人际摩擦——有人因奇怪的眼神互生嫌隙,有人嫌弃对方体味,或单纯不认同其观点。这些人类社会的常态确实存在。
So we are extremely internally transparent within research, that one of our operating principle as the goal is to do the best research we can, and since train the best models we can, consequently. And the culture generally is very collaborative. Like, you know, it is always the case when you have 600 people, when you have groups of people, they're always like, one person doesn't like the other person because they looked at them weirdly, or one person thinks the other smells bad, or just doesn't like their ideas. Like that does happen. Those are humans and those human things.
但宏观来看,我们坚信集体智慧远胜个人。人工智能领域日益重要,OpenAI的成功远非必然,取决于我们每日的卓越工作。这种共同命运的认知,以及实现使命必须相互依赖的现实,造就了极强的协作氛围。
But generally in large scale, I think we really have this belief, like, we are together in the skull as larger than every one of us. It is a very positive sum game because the AI seems to be getting only more and more significant. And the success of OpenAI is far from guaranteed. It depends on us doing great work every day. So there is a lot of feeling of shared fate and the fact that we all need to rely on each other to do our job to achieve this shared mission.
所以尽管人性弱点偶尔作祟,但整体而言OpenAI确实保持着极高程度的协作性。
So I generally think with all the caveats of human nature getting in the way sometimes, think I think on a on a large scale, OpenAI is very, very collaborative.
你们是如何维持这种发布节奏的?作为外部观察者,我感觉研究(某种程度上更像是长期工作)与全组织持续交付——包括核心模型迭代(比如从GPT-3到GPT-5仅用一年)——之间似乎存在另一种张力。你们是如何平衡这些的?为何能如此快速地持续交付?
How do you all manage to keep that pace of releases? Like, it seems to me from the outside that there's a tension again between or another kind of tension between research, which in some ways is kind of feels like it could be long term kind of thing. And on the other hand, like, you guys seem to be just shipping and shipping and shipping and shipping across the organization, but including in terms of, like, core models, like, again, to the point that you went from o one two zero three to GPT-five in, like, year. How do you how do you balance all of that? Like, why are you guys able to ship so quickly?
我 我 我认为根本原因在于,总的来说,OpenAI在我的世界观里某种程度上是一家划时代的公司,我们背后有着惊人的发展势头。我们知道过去做得相当不错,我们需要继续保持。我们拥有极其聪明的人才。实际上,现在全世界最有才华的人都想来OpenAI工作,这意味着每个人的产出效率极高,每个人都贡献巨大。因此我们拥有推动我们前进的强劲动力。
I I I think the the the fundamental reasons for it is, in general, OpenAI kind of, in my my, at least, worldview is a generational company in a way that we have incredible momentum behind us. We know that we were doing pretty great in the past, and we need to continue that. We have incredibly smart people. Literally, the most talented people in the world are all coming and want to work at OpenAI right now, which means like people, every output per single person is incredibly high and everyone really, every single person does a whole lot. So we have like momentum that carries us forward.
我们有非常优秀的团队协作。我们有良好的研究运营架构,并能从硅谷借鉴很多快速推进事情的方法。大家普遍对工作充满热情。每个人都感受到我们所做之事的分量与潜力。正因如此,OpenAI的员工往往工作非常努力。
We have really great people that work together. We have good operating way of structuring research and can borrow a lot from Silicon Valley how to get things done quickly. And people are generally very excited about work. Everyone feels the weight and potential of what we are doing, what we are trying to do. And because of that, people at OpenAI have a tendency to work pretty hard.
让优秀人才对他们从事的工作充满热情并保持良好的协作,就能成就许多事情。我们明白历史上只有这一次人工智能被构建、部署和发展的机会,人们希望以尽可能最好的方式来完成它。
Having great people excited about what they are doing or working together reasonably well results in doing a lot of things. We understand that there there's only one time in history where AI is being built and deployed and developed, and people want to do it in the best way that that is possible.
你们团队会大量使用自己的工具吗?我记得Fiji Simo前几天发推说,今天开发者大会上宣布的很多内容是由Codex撰写的。这是日常工作中的常态吗?你们会用模型来构思新模型的想法吗?会用Codex来编写代码吗?
Do you all use, a lot of your own tools? I think Fiji Simo was tweeting the other day that I think the latest what you all announced at Dev Day today a lot of it was written by Codex. Is that part of the daily experience? Do you use model to come up with new ideas for models? Do you use Codex to write the code?
具体是怎么运作的?
How does that work?
是的。我们确实经常使用Codex来编写代码,而且效果只会越来越好。就像我说的,我经常使用ChatGPT,不过不出所料,用它来产生新想法的情况倒不多。但对于我遇到的很多问题,我认为我现在是GPT的重度用户,很乐意每月支付200美元使用它。我觉得他们正在...
Yeah. We definitely use Codex a lot for coding, and this is only getting better. Like, as I said, I use ChatGPT a lot, although not surprisingly, not that much for actually coming with ideas. But for a lot of questions that I have, I think I am pretty heavy user of GPT right now, happily paying, $200 a month for it. And I think I'm They getting
还让你付费?
make you pay?
该怎样就怎样。他们让我付费,而我对此还挺接受的,因为这样你能获得相当宽松的使用限制,基本不会遇到瓶颈问题。
For what is for. They are making me pay, and I'm I'm I'm I'm kind of, like, pretty pretty okay with this because then you get pretty generous, like, usage limits and and not really not really bottlenecked on it.
感谢分享这些。让我们切换话题,回到这一切如何运作的问题上。那么,理解现代AI系统——我指的是截至2025年10月,相对于九个月前的'旧时代'——的正确方式是否应该将其视为预训练与强化学习的结合?
Thank you for all of that. Let let's switch tags and, go back to, how all of this, works. So is the right way to think about modern AI systems at at OpenAI, by by modern, I mean, as of October 2025 versus the old days of, you know, nine months ago? So exactly. So is is the right way to think about it as a combination of pre training and RL?
首先,这种理解方式是否正确?其次,如果是的话,两者在高层次上是如何协同工作的?之后我想深入探讨下强化学习,让听众们能真正学有所获。
First of all, is that the right way to think about it? And second, if so, just at a high level, how does the articulation between both of those work? And then after that, I'd love to do a little bit of a, you know, deep dive on RL to make this very educational for for folks.
当今的语言模型基本上可以这样理解:先进行预训练,然后实施强化学习。没有预训练,强化学习就无法开展。同样地,预训练模型存在许多难以解决的局限,除非采用类似强化学习的方法。我认为这两部分都不可或缺且将持续存在。至于它们的结合方式及未来演变,我们不应将其视为教条或固定不变的。
Today's language models are basically can be thought as, like, first, they are pretrained, then you do reinforcement learning on it. The reinforcement learning would not work without pre training. And I think in a similar way, pre trained models have a lot of limitations that are very hard to resolve without doing something that looks like reinforcement learning. So I think both of those bits are here to be and to stay. I think the way how they are combined and probably will evolve in the future, nothing should be treated as dogmatic and fixed.
我们需要持续探索训练更好模型的方法,这正是我们在做的。有趣的是——这要归功于Ilya的远见——当我2019年初加入OpenAI时,在一次全员研究会议上,他阐述了OpenAI的研究计划:训练一个覆盖所有可用数据的大型生成模型,然后对其进行强化学习。这就是2019年初制定的研究路线,而我们现在所做的完全吻合。
And we need to keep generally figuring out the way how to train better models, and this is what we are trying to do. The interesting thing, and I can credit that to Ilya how much foresight that he had, but whenever I was started at OpenAI early twenty nineteen, and I remember there was research all hands or something like that, where Idia came on stage and talked about, what is OpenAI's research program? What are we trying to pursue? What he said at the 2019 was to train large generative model on all data we can and then do reinforcement learning on it. That was the OpenAI research plan at the beginning of 2019, and this is exactly what we are doing today.
算法变了,架构变了。当时他可能都没考虑Transformer架构,GPT还只是有人拿来玩的小玩具。但'训练大型生成模型+全球数据+强化学习'这个目标,早已刻在OpenAI的基因里。而这正是当下正在发生的事情。
The algorithms changed, architectures changed. I don't think he was even thinking about Transformer at that moment. The GPT was like, there was some GPT, but it was like a toy example that someone was playing with. But the goal of training large generative model, all the data in the world, and then doing reinforcement learning with it was, was already there at the, the core DNA of OpenAI. And that's, that's what is happening right now.
那我们来做个'强化学习入门课'吧,让更多听众能理解其中的趣味。用最简单的语言,就像给十岁孩子解释那样:什么是强化学习?
So let's do, if you will, a little bit of Reinforcement Learning 101 to make this really interesting to a broader group of people, listening to to this. So in very, simple terms, like, explain it to me like I'm 10. What is reinforcement learning?
对,对。我我我通常用训练狗来比喻强化学习,非常非常贴切。我十几岁时养过一只狗。
Yeah. Yeah. I I I usually, the the metaphor and the analogy I have to reinforcement learning is like training a dog. It's it's very, very close. And I used to have a dog when I was a teenager.
甚至甚至我记得我父母当时...我对养狗一窍不通。他们通过朋友的朋友请来一位消防员,他好像是训练服务犬的。他来教我一些训狗的基本方法。大多数有抱负训狗的主人都知道,口袋里随时备着零食袋至关重要。
And even even even I I remember my parents did. I didn't know anything about raising a dog. What they what they they kind of invited through some friend of a friend, a fireman, who I think was working with, like, service dogs. And he came to me, and he basically told me a little bit about how do you train your dog. And what most dog owners that are ambitious about training your dogs know, it is always extremely important to have a bag of treats in pocket.
这是常规操作。每当狗表现好时,你应该微笑并给予零食奖励;当它行为不当时,你就转移注意力、转身并表现出失望。经过多年驯化,狗会明白这是不良行为的负面反馈。我们现在对模型做的正是同样的事。
That's what you always do. And whenever you see your dog behave well, what you should be doing, you should smile, and you should give your dog a treat. Whenever you see your dog do something bad, you basically give your attention away, turn away, and become sad. And before the years of breeding, the dogs discover it's a bad reward and bad behavior. And this is exactly doing that but with models.
我们诱导模型产生多种行为,将其置于挑战性情境中。当它们做出我们期望的行为时给予
We elicit a lot of different behaviors in the models, put them in challenging situations. And then we give them cookie if they do something we want, if they do a good thing, and give them some kind of punishment and negative reward if they do something like that we that we don't want and that we don't like. In a good way, the good way to do RL is if you balance those things. So if you kind of give cookies half of the time and punish the other half of the time, but this is almost like a mathematical kind of aspect of it. But that's the most important part, which is elicit behaviors, reward the good ones.
如此推进,模型会越来越倾向于做出符合预期的行为。这就是如何通过训练引导实际行为——而非简单的下一词元预测。
And then going forward, the model will be most more likely to do what you want and less likely to do what you not want. And for that, it improves. It is the way how to train models to elicit actual behaviors. That is not first. It's next token prediction.
预训练模型本质是训练它们预测下一词元。而强化学习则是在完全不同的维度上,针对我们期望获得的成果施加梯度。
If you pre train a model, you literally train them all to predict the next token. RL is like a completely different gradient on a completely different set of what we want to get out of them all.
要让模型按照语义指令行事,你会听到「策略」这个术语。强化学习中有智能体、环境、动作、奖励和策略等概念。多数术语顾名思义,但策略指的是模型的决策逻辑和行为模式吗?
And getting the model to do what you want just for some vocabulary and semantics, you hear sometimes the term policy. So in RL, you hear terms like agent, environment, action, reward, and policy. So I think a lot of those are sort of self explaining, but policy is what, that's a strategy, that's a behavior of the model?
是的,策略就是模型的行为,因为模型权重代表了它在面对不同输入时的反应。模型本质上是一个数学对象,你可以将其定义为一种策略,即一个将观察映射到行动的数学函数——你看到什么,然后根据所见采取行动。
Yeah, policy is the behavior of the model as the model weights represent what it does when put in a different thing. Like model Indian is a mathematical object, and you can define it as a policy as a mathematical function that maps observations to actions, what you see, and then what you do with what you see.
没错。所以智能体就是模型,行动是模型的输出,奖励则是评估行为好坏的指标。最近常听到关于为强化学习设计合适环境的讨论,这具体指什么?
Yeah. So agent is a model, action is what the model does, reward is how you say whether that's good or bad. Environment, you hear a lot of things these days about designing the right environment for RL. What does that mean?
环境某种程度上就是模型感知到的一切,但强化学习环境与其他监督学习或非监督学习最大的不同在于:强化学习环境需要具有交互性。你希望它能随着模型的行为而演变。就像学吉他时,你会拿起吉他拨动琴弦,听到声音反馈后不断调整——环境本质上就是世界对你行动的反应机制。
Environment is like, in some way it is everything that the model sees, but the interesting thing about difference about RL environments and most other types of what you can call supervised learning or unsupervised learning is that reinforcement learning environments, you want them to be interactive. You want them to evolve as the model does things. In general, similarly, if you want to learn how to play guitar, you kind of take a guitar and you strum it. And what happens is you hear sound of that, and then you hear it, and then you can do, like learn to play with actual feedback of what is happening with the guitar. And it's in that way, the environment is like, how does the world reacts to your actions?
驱动你行动的主要因素来自环境中的动态变化。要让智能体学会对环境变化做出反应,强化学习几乎是唯一的有效途径。
And a lot of what drives your actions is what is happening in your environment and it's in your world. And that's kind of like the only way how to really teach agents to learn to react to changes in the environment is through reinforcement learning.
能否简要概述强化学习的发展历程?现代强化学习与早期版本的主要区别是什么?
Can you give us a little bit of a bird's eye view of the evolution of RL over the years? Mostly how does modern RL differ from sort of historical RL?
当然。虽然强化学习历史不算悠久,但真正的革命性突破发生在神经网络与强化学习的结合。早期的强化学习作为数学优化方法就已存在,用于在数学定义的环境中研究行为优化,这甚至早于神经网络的出现。
Yeah, yeah. Like, again, there was super historical RL, you know, it's not even that old, but the main, like, tectonic shift was when combining neural networks with reinforcement learning. There are the reinforcement learning, you know, predates the neural network as a general mathematical method of optimizing behaviors in, like, mathematically defined environment and as a method of study.
这就是现在众所周知的深度强化学习。
That's what is known as deep reinforcement learning.
是的,就是那个。深度强化学习,基本上像是DeepMind发明的将神经网络与强化学习结合的产物,也就是我之前跟你提到的DQN时刻。从那时起,游戏领域的强化学习曾是一个相当活跃的研究方向。即便在我2019年入行时,强化学习虽不算非常成功,但确实风靡一时。它能解决许多游戏问题。
Is that Yes. The the the deep reinforcement learning that that basically, like, deep mind, invention of combining neural networks with reinforcement learning, the DQN moment I talked to you about. And then from there, there was a moment where was a pretty active area of research of reinforcement learning on games. Even when I started 2019, Reinforcement learning was kind of fashionable at that moment, although not very successful. Reinforcement learning was able to solve a lot of games.
但瓶颈在于模型没有任何预训练。我们训练了大量游戏行为,甚至由此诞生了AlphaGo时刻——这让许多人兴奋不已。可这些模型对行为仍缺乏真正的智能理解,虽经大量强化训练,其智能水平仍像某种原始形态,好比鼹鼠虽经严格训练却算不上真正聪明。
But the bottleneck was there that the models were not pre trained in any way. We were training a lot of behaviors playing games. We even got alpha go moment out of that, which a lot of people got very excited about it. But it was still learning behaviors without the models that were meaningfully smart about those behaviors. There were still a lot of kind of you don't wanna call it caveman intelligence, but something in that regard about moles not being really smart, even though being pretty heavily reinforced.
这个领域经历了长期研究,产生了许多酷炫成果和RL理论认知,因为当时人们积极研究RL。但某种程度上,无预训练的RL已走入死胡同。当我结束机器人研究转向教语言模型编程时,拥有预训练模型成了重大突破。GPT时代的大规模数据训练让我们得以在此时启动RL研究。
There was a long research in that and a lot of cool results and theoretical understanding of RL comes from those days because people were researching RL actively. But it was, in some way, was at dead end of doing RL without pre training. And then in my moment when I finished working on robotics, I started working on teaching language models to code. But having pretrained models was a really big deal. And then the GPT era of scaling and of large scale ingesting lots of data to really train great models enabled us already at that moment to start RL.
这几乎是我立即着手的第一件事。每当gptfew训练完成,我就尝试对其做RL。但系统总是笨拙不堪,算法选择也困难重重——哪些才是正确的算法?
And that was one of the first things I did almost immediately. Whenever gptfew was trained, I tried to do RL on it. And there were always what were bottlenecks. The systems were kind of clunky. It was hard to figure out what are the right algorithms?
该解决什么问题?该用什么算法训练?我当时开展的探索就像典型研究过程:我们把游戏领域的方案生搬硬套过来,机器人领域也差不多。我在大型Android Mall上做的首个RL实验,用的还是那套万能PPL方法。虽有些成果,但早期RL结果并不惊艳,我们却持续投入了很久。
What right are problems to work on it, and what is the right algorithm to train it on. Opening I did at that moment and what was kind of like how research goes, we kind of cargo called it a lot of things that was used for games and almost the same things are for robotics. The first RL I was doing on large Android Mall was kind of the same PPL we used for everything. And it gave some results, but those early results weren't completely mind blowing in RL. There was a long time where we're keeping on investing it.
我个人始终相信RL与语言模型终将迎来重大突破。但早期试错并不成功。训练GPT-4时有个有趣现象:如今人人都说GPT-4了不起,可当时内部反响平平。还有几次我们投入资金训练模型,结果却显得很笨——GPT-3已能实现的功能,GPT-4似乎并无显著提升。
And personally, I always believed there will be a really, really big moment for RL and language models. But the early trials and errors weren't super successful. The moment when we trained GPT-four, And and there was an interesting moment where we trained GPT-four, and everyone today thinks, oh, GPT-four is such a great model. But when we trained GPT-four, we were pretty underwhelmed internally. And then there was other moments, oh, we trained this model.
我们不禁质疑:它在单token评估中显得聪明,能对复杂问题给出详细回答(只要回答限一token)。但若让其长篇大论,就变得语无伦次,长回答质量堪忧。
We spent a little bit money on it, and it's kind of like pretty dumb. At least at least like, we have gpt three, gpt three already does all that stuff, gpt four doesn't really seem to be that much better. And we had this kind of question, like, how it kind of seemed smart on evals that were one token long. It seemed to be able to give pretty detailed answer to complex questions, what was one token. But if you actually let it speak for longer, it wasn't very coherent or really gave for a long answer.
我们需要回答这个问题:如何真正打造一个在对话中显得聪明且表现优异的语言模型?正是在那时,一种几年前就已开发的技术真正展现出来,它被称为RLHF(基于人类反馈的强化学习),本质上就是在大语言模型上应用PPO算法,并通过人类对两段文本偏好的反馈作为奖励信号。
We need to answer this question, how do we actually make the language model that seems to have some smartness in its way? Actually sounds smart and actually be good in talking to it. And that was the moment where a technique that was developed already a few years earlier, like really shown, which was called RLHF, just basically doing PPO on large language models with the rewards given from human preferences of seeing two parts of that text.
点赞和点踩。
Thumbs ups and thumb down.
没错。无论是点赞、点踩还是其他任何形式的人类偏好反馈,这都是非常有效的奖励机制。因为模型可以通过多种方式生成更好的文本,而早期GPT-4生成的文本存在诸多问题。RLHF能够捕捉这些问题并加以修正——它强化优秀表达,惩罚低质内容。最终,GPT-4与RLHF的组合包向世界交付了那个令所有人惊叹的'GPT时刻'。
Yeah. Thumbs up, thumbs down, like, whatever whatever human preferences is. And that that that's a very good reward because there are a lot of things how the mall can generate better text and how early gbt4 was generating bad text in a lot of ways. And RLHF was able to catch those things and correct it, it reinforced good behaviors, reinforced generating good text, and punishing bad text. And in the end, GPT-four plus RLHF together as a package, deliver that GPT moment to the world that everyone sees.
尽管这是预训练的巨大成功,但RLHF中的强化学习同样取得了重大突破。
And as much as it is a big success of free training, it actually also was pretty big success of RL in the RLHF. Form
太棒了。再深入探讨下,作为用户我们都熟悉的点赞/点踩机制只是界面呈现,实际的RLHF是在训练后阶段进行的对吗?
Amazing. And just to double click on that, the RLHF so we we all familiar as users with, I mentioned, thumbs up and thumbs down. That's on the interface. But the actual RLHF happened post training. Is that is that is that is that right?
是的。
Yes.
那么这个实施过程具体是怎样的?是否有一批专业人员坐在模型前持续提供反馈?实际运作机制如何?
And and what did that look like as an effort? Did you have, like, a bunch of just humans sitting down, you know, in front of the model, industry specialists maybe, and give it feedback? How how did that actually work?
RLHF(人类反馈强化学习)作为一个研究项目已经在后台运行了一段时间。我记得至少在GPT-2时期就进行了相当长时间的RLHF研究。它早已存在并持续进行着。这个领域本质上会自主收集RLHF所需的数据,并不断思考:什么样的数据最适合训练所有模型?如何选择正确的数据来训练奖励模型以及如何塑造奖励机制。
RLHF was a research program that was already happening in the background for a while. Think we did RLHF, at least I remember GPT-two being RLHF for quite a bit. That was already there and already happening. It's gathering data even for RLHF on its own, research domain, basically, and always thinking, what is the right data to train them all? What is the right data to train your rewards and how to shape your rewards.
这是我们一直在进行的开放式深度研究,涉及多个维度。虽然已有论文阐述RLHF的概念,但其内涵远比表面复杂。简而言之,如今我们称之为AI训练师的专家会评估模型输出并打分,然后基于这些评分建立模型用于训练。
A research that we've been doing, and it's very open ended and very deep in many different ways. And I think there are papers written on what RLHF is, but there's a lot of depth to it. But the long story short is you have what we call AI trainers these days, and they look at outputs of their models, and they give them scores. And then you'll learn basically a model of those scores and use that for training.
对于好奇的听众来说,这其实属于整个数据标注行业的一部分,像Scale AI这样的公司就是专门做这个的,对吧?
And that's part of, for people who may be curious, like the entire data labeling industry, so Scale AI and a bunch of others, that's what they do, Right?
没错。某种程度上,随着模型越来越智能,这种人工标注的方式正逐渐成为过去式。但在几年前,特别是GPT-4时代,这可是主流方法。
Yes. Yes. I think, like, in in a way, I think it's getting more and more to be a thing of the past as the models are getting smarter and smarter. This is becoming less of a thing. But I think few years back, and especially in GPT-four days, this was the thing.
数据标注行业最耐人寻味的是——虽然不确定我们该不该展开讨论——它必须不断自我革新。因为当AI在某些方面超越人类时,你就不再需要人工标注了。所以这个行业需要持续拓展边界,改变标注数据的类型,就像你们已经通过RLHF完成了前一阶段的工作那样。
The interesting bit about data labeling industry, I'm not sure how much we want to go on that tangent, is that it has to constantly reinvent itself because the AIs are getting smarter at some moment. Certain things you don't want to label with humans if AI already can do it. So so, like, you need to you you move the frontier and you and you change the date type of data you are labeling as the as as as as you already RLH step the previous part.
我们一直在讨论强化学习,但整个流程的第一阶段其实是模型的预训练,也就是无监督学习对吧?为了让更多观众理解,能否再解释下无监督与有监督的区别?预训练的无监督特性与自监督等概念有什么细微差别?
We've been talking about RL, but the the the first phase of all of this is the the creation, the pretending of the models. That is unsupervised learning. Right? Do you do you want maybe for again, to make this broadly interesting to to to people, define unsupervised versus supervised, and and in what way was the pre training unsupervised versus self supervised or whatever nuance?
是的。我认为这些更多是程度差异而非绝对区分。之所以称为无监督预训练,是因为按照某些定义,你不需要为输入数据添加额外标签——直接把原始文本喂给模型就可以了。
Yeah. I think those are nuances and I don't think they are as stark and as sharp as some people like to determine them. But why pre training is called unsupervised? Because in some definition of it, you don't need any extra labels to the data you feed into the model. You just feed the text as this.
在某种程度上,你可以认为数据已经自带标签,因为它是自我标注的。如果你让模型根据文本预测下一部分内容,这在某种意义上就是一种标签,但它是自我监督的,因为我们没有明确告诉模型什么是对的、什么是错的,或者我们想要什么、不想要什么。我们只是希望它预测数据的其他部分。同样的方法也可以应用于图像,你可以遮盖图像的一部分,然后让模型预测被遮盖的部分。
In some way, you may argue that the data is already labeled because it is self labeled. If you give them all from the text predict the next part of text, in some way it is a label, but it is self supervised because we don't clearly tell the model what is right or what is wrong or what do we want from it or what do we not want. We want it to just predict the other part of data. They can do the same thing with images. You can mask the part of an image and tell them all, predict the next bit of image.
但每当涉及经典机器学习中的目标和标签概念时,我想我们讨论的是分类器。监督学习就是你有某些目标的概念,知道你的目标是什么,以及某些标签的概念。监督学习就像是根据目标预测这些标签,这是一种映射关系。但实际上有趣的是,目标通常比标签包含更多的信息量。
But whenever there is this classic machine learning notion of targets and labels, I guess we were talking about classifiers. So supervised learning was you have some notion of targets, what your targets are, and some notion of labels. Supervised learning was like predict those labels from targets. And this is like some type of mapping. But actually, what's interesting is that there are many more bits usually in the targets than in the labels.
研究目标本身的结构,比学习映射关系能带来更多的学习和智能。因此,将全部计算资源用于学习无标签的数据本身是正确的做法。这通常被称为表征学习,即研究数据及其特性。
And studying the structure of targets itself, it yields much more learning and much more intelligence than learning the mapping itself. So spending a whole compute on just learning the data itself without the labels is the right thing to do. And what is often called representation learning and studying the data and its properties.
好的,明白了。那么回到强化学习(RL),你前几天发推说gRPO在很大程度上加速了大多数美国研究实验室的研究学习进程。那么gRPO到底是什么?
Okay, great. All right. So going back to RL, you tweeted the other day, gRPO, the gRPO release has been, in a large way, accelerated the research learning program of most US research labs. So what is gRPO?
是的,这有点半开玩笑的意思。我在这里稍微推测了一下具体发生了什么,因为我并没有去过大多数美国研究实验室,但我对事件经过有些心理模型。长话短说,gRPO是深度求索(DeepSeek)的开源发布。所有长期关注AI讨论的人都知道深度求索时刻——当这家似乎做得非常出色的中国公司发布新模型时。
Yeah. It was a little there was a little bit of a tongue in cheek moment. I am extrapolating here a little bit what exactly happens because I I haven't been in most US research labs, but I have some mental model of of what happened and how. And long story short, gRPO was the open source released from DeepSeek. And there was everyone who is terminally online, follows AI discourse, knows that DeepSeek moment of whatever it was, when the Chinese company that seems to be doing really, really great work released new model.
那也是一个预训练模型,一个推理模型。他们开源了算法,开源了许多成果,整体上是一个非常出色且技术精湛的发布。有很多讨论说他们预训练模型的成本特别低,这是关于深度求索时刻讨论的一部分。另一部分讨论是他们某种程度上公开了推理过程。
And it was also a pretrained model, a reasoning model. They open sourced the algorithm. They open sourced a lot of things they did, overall really, really great and technically excellent release. And there was a lot of discourse about that they pretrained their model particularly cheaply, and that was part of the discussion about the DeepSeek moment. But the other part of the discussion was that they that they kind of, like, released their reasoning process.
这个发布距离我们的o1发布并不久。据我所知,我们的o1发布让许多美国实验室措手不及。据我所知,他们基本上没有同样先进的强化学习研究计划,可以说几乎没有。我认为世界上可能只有一家公司...当然我可能不知道很多事情,但有时和人交谈会让你意识到这点。所以这就是我对这个世界的理解版本。
It was, like, not very far after our o one release. As far as I know, like, our o one release mostly caught a lot of US labs by surprise. They didn't have, like, similarly advanced RL research program to my knowledge, basically no one. And, like and I think, like, the only company in the world that they get I am I I as I am aware, there there's probably a lot of things I I didn't know, but you talk to people sometimes if you remorse. So this is my version of the world.
如果你查阅DeepSeek早期的论文,会发现该公司在某些方面进行的强化学习研究与我们的工作非常相似。需要澄清的是,OpenAI当前的研究并非严格意义上的GRPO,两者存在诸多差异,但部分内容确实具有相似性。最关键的是,它们都属于大规模策略梯度算法。而DeepSeek当时的研究领域与我们略有不同。
If you look at the older papers of DeepSeek, that company was doing pretty similar, in some ways, RL research to what we are doing. And I think I have to clarify what OpenAI is doing is not exactly GRPO. It is slightly different in many different ways, but some parts are definitely similar. And what's most important, those are both large scale policy gradient algorithms. And DeepSeek was doing company was doing research in a slightly adjacent area.
其实差距并不大。当我们发布O1并向世界展示通过扩大语言模型的强化学习规模可以获得优异成果时,对DeepSeek公司而言,意识到'我们距离类似成果并不遥远'并不需要太大跨越。他们确实做到了——训练出推理模型并公开了方法论,时间上与我们发布L1相差无几。
They were not very far. And whenever we released O1 and we told the world that you can get pretty like those great results with scaling up reinforcement learning on language models, I think it was not a very big hop for the deep sea company to kind of realize, Okay, we are not very far from getting similarly good results. And they did it. They trained their reasoning model, they released it, and they told the world how. Pretty, like, not much later than we released L1.
对于那些尚未建立推理模型训练研究体系的美国实验室而言,这家中国公司的技术公开具有重大意义。它帮助我们快速启动了推理模型的研发进程,相比完全自主探索,这让我们节省了大量时间。
And I think, like, you know, for all of US research lab that didn't, like, yet know, didn't have a research program how to train, like, reasoning models, they looked, oh, there's this this this, like, Chinese company. They released how to do it. It helped us kick start and train reasoning models much faster than we would have to otherwise if we would have to find all those bits ourselves.
扩大强化学习规模需要哪些条件?OpenAI曾经历专注于预训练的阶段,而过去12-18个月(或类似时间段)重点转向了计划第二阶段——强化学习的规模化。这是否仅意味着为RL提供更多算力、数据及标注?正如我们刚才讨论的,其核心要求究竟是什么?
What does it take to scale RL? So if there was a phase where OpenAI was very focused on pretraining, and then if I understand correctly, the last twelve-eighteen months or whatever time period where the emphasis has been on the sort of second part of the plan which is scaling RL. Is a question of just giving RL more compute, more data, more labeling, as we were saying, what does it take?
首先必须明确:强化学习本质上是困难的。从概念层面看,预训练极其简单——可以说是最基础的数学运算,经过多年优化已形成成熟的规模化方案。但强化学习在数学复杂度上高出数个量级。
The first thing that is important to know and understand, RL is hard. Like, conceptually, if you think about it, and then there's still a lot of depth to it, but very conceptually, mathematically speaking, pre training is dead simple. It is kind of the simplest thing you can do. And there has been a lot of thought and a lot of optimization already put for that for a few years of optimizing and doing very well at very large scale, very simple mathematical operation. In terms of RL, it is much, much more complex.
强化学习过程中存在更多变量环节,随着规模扩大会出现更多潜在故障点和瓶颈。这是个更为精密的系统,容错空间极小。虽然这个类比可能有些夸张,但就像标准化生产的钢铁厂与半导体制造的差别——全球仅有极少数企业能胜任半导体制造,因为其中存在无数可能出错的环节。
There's many more things going on in a reinforcement learning run. There's many more things that can go wrong in doing it, especially as you scale up and many more types of bottlenecks, failures. It's a much more delicate thing, there's much more room for error. In some way, I don't want to go too deeply in the parallel because it's a little bit overblown. But in some way, just to give some coincidence, you can have a steel factory which makes steel and like the process is relatively standardized and you make blocks of steel and they are uniform and nice and well defined what it is versus like building semiconductors, which there are very, very few companies in the world that can do it because there are so many things that can go wrong.
制造优质半导体需要对细节的极致把控,其内部复杂度极高。虽然我不想贬低大规模预训练的技术难度,但强化学习确实涉及更多动态组件——整个技术栈中有更多需要精确协调的要素,才能确保大规模运行的顺利实施。
And you have to put a lot of attention to details to make great semiconductor. And it's very, very complex internally. And in many ways, is kind of like, I don't want to diminish because there is a lot of very hard technical difficulty to do pre training well at a large scale. But there are just many more moving pieces and many more elements of the reinforcement learning stack that need to get it right to get a large scale run successful.
你提到开发搜索智能体,类似AI代理?这如何与工具使用、自主代理与推理强化学习(RL)相协调?能否帮我们理清各自功能及相互影响?
You mentioned working on a search agent, like the agent, AI? Like where does that all fit the tool use, like the all like agentic autonomy versus reasoning RL? Like help us to sort of reconcile what does what and what impacts what?
我认为关键在于,我坚信AI能通过自动化、问题解决以及实现我们期望的善举,为世界和生活带来积极影响。虽然时间不长,但近两年我们已进入这样一个时代:最初AI能即时回答问题,现在它能思考一两分钟——这感觉很长。但想想它能在此期间解决多少问题?AI在可处理事项上可能稍快,但能力仍有局限。
I think what is the important thing is I believe and I think that there can be a lot of positive impact of AI on our world, on our lives through automation, through problem solving, through AI doing good things for us, the things that we want. And for a long time, and for a long time again, it's not that long, but the last two years or so, or maybe approaching free, we've been leaving in this world where we kind of ask questions to AI and gave us an answer at the beginning instantly. Now it can think for a minute or two, which feels long. But in many ways, what can you do for two if you think of how many problems you might solve? AI is probably a little bit faster in the things it can solve, but it's still a limit of what it can do.
仍有许多任务需要AI耗费更长时间处理。当我调试编解码器时,它需要运行几分钟。我们内部有许多正在开发的功能允许更长时间运作,只是尚未找到合适的产品形态部署。如今某些任务上,模型已能思考30分钟、1小时甚至2小时以上。
There are still a lot of tasks that you know that would take AI to do much, much longer. When I prompt codecs, it works for a while, again, a few minutes. There are a lot of things we have internally and we are doing that allow them all to work for much longer. We still didn't figure out the right product to deploy them. But the models can think for thirty minutes, hour, two hours these days on certain types of tasks and problems, even longer than that.
它们普遍具备这种能力。我们需要让这个过程更有实用价值,能真正解决现实问题——无论是编程、旅行预订、制定计划,还是设计房屋或电子设备等任何你希望模型完成的事。这很大程度上依赖于模型长时间独立思考,权衡更多可能性,而非机械执行冗长任务列表。
And they generally are capable of doing so. And we need to figure out how to make that process more useful and more being able to actually come to various problems in real life, whatever it is, coding or booking travel or making plans or even designing houses or new electronic devices or whatever else you would like models to do, we would like them eventually to be able to do for us. And a lot of this comes through the models thinking independently for longer periods of time and considering more of our alternatives bits and just sometimes going for a slog of very long lists of tasks.
所以智能体的自主性源于基础推理能力?是否存在在线强化学习的概念,即智能体实时行动并从现实世界学习?
So the agentic part is powered by fundamental reasoning. Is there a concept of, I guess, online RL that happens where as the agent does something and learns from the real world, the RL happens in real time?
严格来说所有RL都在进行。语言模型涉及的RL大多是在线的,但仍属于训练环节——与用户端分离。最近了解到Cursor等公司尝试让用户参与在线训练模型。理论上JumpGPT等产品都能通过用户反馈进行实时强化训练。
So generally, all of RL is happening. Like most of RL that you hear talk to language models is online, but it's done online in a way that is still a training run. It's still being trained separately from the user. There have been a few models in the world, and I've learned recently that I think Cursor is trying to train some models online with their users in the loop. And it is theoretically possible to train models, like in JumpGPT or every other product, just responding to the users and reinforce through whatever rewards you get in there.
但据我所知OpenAI目前未采用这种方式。这虽前景广阔但存在风险——你无法完全控制强化循环中的变量。在建立完善防护机制前,我认为不应在GPT等复杂大规模系统实施此类方案。
But this is not what I am aware, at least not what OpenAI is doing at the moment. And it can be great, but it can be also dangerous because you are not really very much controlling what you are reinforcing in that loop and what could happen. So at least until we have a really good safeguards, think we should try to do that in anything like as complex and large scale as GPT.
是的,很有趣。正好说到这个,我们稍微讨论一下对齐问题。对齐是强化学习(RL)的范畴吗?我的意思是,你是否通过教导模型什么是对错来创建模型的对齐性?
Yeah, interesting. And very much on that note, talking about alignment for a minute. Is alignment an RL thing? I mean, is it do you create alignment in the model by teaching it what is right and wrong?
某种程度上算是,但也不完全是。某种程度上,对齐是关于引导模型趋向某些行为,这确实属于RL的范畴和问题。但同时,你也希望模型能知道什么是对错并理解世界。并非所有这些都是推理和RL问题。
Kind of kind of yeah. It's it's a little bit of yes and a little bit of no. Like, in a way, alignment is about steering the most to certain behaviors, and that is that is definitely an RL thing and an RL problem. But, also, you want the models to know what is right and what is wrong and understand the world. And not all of them are reasoning and RL problems.
很多时候它们只是AI问题。为了达到对齐,模型需要知道对错才能选择正确。我不认为你可以简单地告诉模型,比如展示几个好的行为它就会照做。模型需要深刻理解其行为及后果才能真正选择正确的事情。我认为这是一个永无止境的追求,因为即使对人类来说,定义什么是‘对齐’也并不容易。
Those are very often just as AI problems. And in the way to be aligned, the model needs to know right or wrong to choose right. I don't think like you can just tell the model, like you show it a few good things to do, and it will do. The model needs to deeply understand its action and consequences to really be able to choose the right thing. And I think it's a never ending pursuit because even for humans, it's not super easy to define What do we consider alignment?
随着我们文明的演进,对齐的概念和人类的目标也会不断变化,我们需要不断引导模型趋向这些目标,持续向它解释我们期望它做的事情。但这是任何AI研究项目中非常非常重要且核心的部分。
Think as our civilization will evolve, the notion of alignment and the goals of humanity will keep evolving, and we'll need to keep nudging the model towards those things that keep explaining to it the things that we want from it. But it's a very, very important and central part any should be of any AI research program.
是的,这引出了接下来的一系列问题,关于RL在哪些方面高效或低效。看起来它在数学和编程方面特别出色。那么接下来的明显问题是:对于世界的其他部分呢?不过稍微深入探讨一下数学方面——就在九月份,几周前,你们在ISPC世界总决赛上做了一些令人难以置信的事情。
Yeah, which brings a whole you know, next series of questions of where RL is efficient versus less. So it seems that it's been particularly good for math and coding. And then the next obvious question is what about the rest of the world? But taking a a quick sort of going down the rabbit hole a little bit about math. So just in September, like, just a few weeks ago, you guys did something unbelievable with the ISPC World Finals.
你愿意谈谈那是什么吗?从模型技术的幕后角度来看,发生了什么?
Do you wanna talk about what that was and what went on from a model technical perspective behind the scenes?
从我们的角度、模型的角度来看,发生的事情出人意料地少。我们只是有一个相当聪明的模型。当我们要求它解决编程问题时,它就能正确解答。这背后有一点背景故事:我们曾长期使用编程谜题作为研究想法的绝佳测试平台。这些是很好的实验问题,但它们从未被视为产品的一部分。
It happened surprisingly little from our perspective, from the model perspective. We just have a pretty smart model. And then when we ask them to solve programming problems, are correct. What's a little bit of a backstory in it is that I think we used specifically programming puzzles for a while as a very nice research test bed of our ideas. Those are nice problems to experiment on them, and they weren't ever considered part of the product.
但这些都是相当复杂的问题,需要大量思考,非常适合给予奖励。因此所有研究人员都喜欢研究这些问题,以此验证他们的口头想法。你总是需要一个数据集。我会拿一个编程谜题数据集来尝试。我认为正因为如此,ARMORs在竞技编程方面总是表现得非常出色,这算是一种副产品。
But those are pretty complex problems, they require a whole bunch of thinking, are very nice to give rewards to it. So all of researchers just liked working on those problems as a way of trying out their oral ideas. You always need a dataset. I'll take a dataset of programming puzzles and try it. And I think because of that a little bit, ARMORs just were always very, very good at competitive programming as a kind of byproduct.
我们从未刻意追求在这方面表现优异,但研究人员一直在用这些问题测试他们的想法。正因如此,每一轮训练后,无论我们做什么,最终都会在这些类型的谜题上表现得非常出色。之后参加比赛对我们来说就有点形式化了。这主要是向世界展示这些模型的能力水平。但必须承认,并非所有领域(至少与当前人类基准相比)我们都能像在编程竞赛问题上那样表现出色,因为这些问题是经过许多研究人员长期尝试的。
We never tried to be good at it, but researchers were trying their ideas on it. And because of it, every training round, whatever we are doing just end up being very, very good at those type of puzzles. And then it was a little bit of a formality for us to go and submit that to a competition. Then like, it's like largely about demonstrating to the world how, like, what is the level of capability in those malls. But I think it is important and true to acknowledge that in not all domains, at least comparing to human baseline at this moment, we can be as nice and as good as in programming competition problems in many ways, because those were like, tried for a long time by many, many researchers.
研究人员并不总是像他们本可以的那样,在人们使用GPT或我们模型解决的实际问题上投入足够时间。我希望他们能在这方面多下功夫。
And researchers don't always spend as much time as they could. And I would like them to on very practical problems that people go to GPT or our models with.
太棒了。所以这算是开箱即用的能力,没有经过专门训练。提醒一下,我们讨论的是2025年9月刚在阿塞拜疆巴库举行的ICPC世界总决赛(国际大学生程序设计竞赛),OpenAI在五小时内解决了12道复杂算法问题,实际上相当于在人类队伍面前获得了第一名。这是背景信息。
Great. So it sort of came out of the box. There was no specific training for it. And just to remind people, so what I'm referring to, what we were discussing is the ICPC World Finals that was just in September 2025, which is International Collegiate Programming Contest, ICPC, that happened in Baku, Azerbaijan, where OpenAI solved 12 complex algorithmic problems within the five hour time limit, basically making it take effect the equivalent of first place in front of human teams. So just for context.
我们参加了一系列比赛:ICPC、今年早些时候的国际信息学奥赛(IOI),以及Outcoder启发式竞赛——我们在后者中获得第二名,输给了一位曾受雇于OpenAI的波兰人,这巧合很有趣。我们一直在寻找一个时机,当我们的模型足够聪明时,就能与那些极具天赋的人类在这些竞赛中一较高下。
We did a little bit of a round tour of various competitions. We did ICPC. We did also IOI, International Informatics earlier this year, and Outcoder Heuristics competition as well, where we went second behind a single human that is also a Polish person that used to be employed by OpenAI some time ago. Funny coincidence. But I think we were looking for a moment in time where our malls are kind of smart enough to be able to compete in those competitions with some of the incredibly smart and talented humans.
但这从来不是我们的特定目标和焦点。我们认为如果做好训练智能模型的研究,它们自然应该具备这些能力。我们把这个当作里程碑,现在继续前进。希望未来能看到——实际上我们已经看到越来越多实用成果每周或隔周出现在推特上。
But it wasn't never our particular goal and focus. It's kind of like we think if we are doing good research for training smart models, they should be smart enough to do those things. We kind of like take that milestone and now we keep on moving forward. I hope we will see. And I think we are already seeing more and more practical and tangible things coming out like every week or every other week on Twitter.
我确实看到可信的报告,有实际科学家使用我们的推理模型帮助他们进行计算,解决技术难题。这才是我们想要的——参加竞赛很酷,但人们参赛是为了证明能力,最终是要去解决前沿技术问题。这也正是我们对模型的期望。
I do see what I think are credible reports of actual scientists using some of our reasoning models to help them perform calculations, solve hard technical problems with our models. And I think this is like where we want to be. Like solving competitions is cool, but people solve competitions to prove that they can work, go to the actual like frontier level job and solve new technical problems. And this is kind of what we want from our models as well.
我们刚才提到过这一点,至少对我这样的人来说,在概念上我能理解强化学习如何有效用于数学题或编程题的训练。我认为当前的一大问题是,如何将其推广到其他领域和情境中,那里的答案并非简单对错,可能更为模糊。比如你们组织前几天提出的GDP估值,就是一种跨行业评估表现的方法。对于强化学习作为全球成功路径的普适性,你们有何见解?
We alluded to this a second ago, so mentally at least for somebody like me I understand how RL could be used very effectively to train against math problems or coding problems. I think one of the big questions right now is how do you do that for the rest of the world in context and disciplines where the answer is not right or wrong, maybe a little more murky and the rest of the like you guys as an organization came up with a GDP val the other day, which is a way of evaluating performance against different industries. What is your thinking in terms of generalization of RL as a path to success for the rest of the world?
我认为简短的回答是:既然人类能学会所有这些,那么只要存在任何评估表现的方式,能判断事情进展对错,并能计算出这种反馈——比如你需要能够以某种方式衡量某件事的好坏——你就可以优化它,进而应用强化学习。可以说,如果连对错概念都没有,人类也无法进步学习,因为必须存在某种学习信号。问题往往在于获取这种反馈的便利程度。
I think the short and quick answer is somehow humans can learn all those things. And as long as there is any way to evaluate performance and figure out if something is going right or wrong and you can compute that feedback, like you need to be able to to somehow calculate how well something then you can then you can optimize it and then you can do you can do reinforcement learning with it. Think like, you know, there can be an argument that if there is no notion what is right or what is wrong, then also humans are not able to improve and learn. Because there needs to be a learning signal that's coming somewhere. There is mostly a question of how convenient and how easy is it to get that feedback.
从事强化学习的人都应努力训练系统处理更复杂有趣的训练信号。这里常会出现所谓的'奖励破解'现象——强化学习过程中经常发生且重要的问题。你设计奖励机制鼓励某些行为,但有时奖励标准与实际目标存在偏差。既要训练系统执行受奖励行为,又要处理预设奖励与实际期望之间的天然错位。
And everyone doing reinforcement learning should strive and try to be able to train on more and more complex and interesting training signals to do. Very often there comes a notion of what is often called reward hacking. What happens when doing reinforcement learning, it happens a lot, and it's an important problem. You shape your reward in some way to reward certain behaviors, but sometimes it is the case that's what your reward is not what actually you want. There is one thing you need to train them all to do, the behaviors you reward, but there's also a natural mismatch about the reward that you give them all and what you actually want.
有时系统会机械执行奖励标准却违背初衷,这时就需要修正。这几乎像育儿挑战。某种程度上可以说这是强化学习的局限,但细想之下,人类系统中同样存在这种现象。我们的激励体系里充斥着类似情况。
And there are sometimes moments where the mall does what's your reward, it's not in the spirit of what you had wanted, and we need to fix it. It's almost like a parenting challenge. And in some way, you can say it's a limitation of reinforcement learning. But when I was thinking about it, I realized a lot of that happens in human systems as well. There are a lot of incentive system and reward systems.
职场等各类人类组织中,人们获得的奖励并不总是与系统终极目标一致。人类不断以各种方式'破解'奖励机制,制定者则持续玩着'打地鼠'游戏——设置合理奖励后观察系统反应。这在政策制定和激励方案中都是普遍难题。强化学习研究也面临同样的'打地鼠'挑战:如何让奖励机制越来越精准地反映你对模型的真实期望。
And it even happens in workplaces, in all kind of human groups, that humans have rewards that are not always optimized for the ultimate goals of the system. And they hack rewards constantly in many different ways. And there is a constant whack a mole game between setting the right rewards and seeing if the system does it. And that's a huge issue in any policy making almost and any incentives programs. And this is the same kind of like whack a mole game in reinforcement learning research, trying to make sure your rewards are better and better representing what you actually care about the model to be doing.
好的,或许我们可以宏观总结下:你前几天发推文说'我们集体认为AGI昨天就该被造出来了,之所以还没实现,主要因为某个待修正的简单错误'——这推文很酷。你认为预训练与规模化强化学习的结合能带我们实现AGI吗?
All right, so maybe to zoom out to close this conversation, you said the other day, you tweeted, we all collectively believe AGI should have been built yesterday. And the fact that it hasn't yet is mostly because of a simple mistake that needs to be fixed, which is super awesome as a tweet. Do you think that the combination of pre training and scaled RL takes us to AGI?
这里有个有趣的问题:是否存在不依赖预训练的强化学习?界限在哪里?总体我认为当前的预训练是必要的,当前的强化学习也是必要的,未来肯定还会需要更多要素。
There's always an interesting question of do we consider something that is not pre training on RL? And like, where is the limit? I generally think something that we are doing, like pre training today, is necessary. I think something that we are doing RL today is necessary. And there will surely be a few things more.
我们在这些领域有许多雄心勃勃的研究计划。关于研究空间中的距离问题很难一概而论。对某些人而言,我们的目标与规划构建的内容与现有技术相去不远;而对另一些人来说,他们会认为这完全是另一回事,与现有方向截然不同。
And we have a lot of very ambitious research programs on some of those things. Don't think The question of distance in research space is hard to say. For some people, what we want to do and what we are planning to build is not very far from those things. For someone, we'll say, oh, it's completely different. And it's very much not that side.
我不想争论这是否相同。但我们正在并希望持续改变训练方式,使其更能体现我们所认为的正确智能形态和最有价值的学习形式,同时不断研究各种课题。至于与通用人工智能(AGI)的距离,这也是个极其复杂的问题。有人曾对我说——我认为这个观点很正确——如果你让十年前的人看到今天的ChatGPT,他们很可能会称之为AGI。但我们今天并不这么认为,因为它仍存在诸多局限。
I don't want to go into debates whether it's the same or not. But we are and want to be constantly changing the way how we train them all to more represent what we think the right form of intelligence is and the most useful form of learning is and constantly are researching various things. And then the the distance from like, is what is the distance from AGI is also like a very complex question. I I I really like someone said it to me, but I think it is right that, if you talk to someone from ten years ago and show them ChatGP from today, they would probably call it AGI. But we are not today because it still has a lot of limitations.
我们非常清楚这些局限性,也确信能够解决它们。未来模型可能还会出现需要修复的新限制。最终存在一个极难回答的根本问题:何时模型才能在无需大量外部输出、无需人类持续修正的情况下实现自我完善?我认为这是个非常棘手的问题。
And we are all very aware of those limitations, and we are pretty sure we can resolve those limitations. There will be probably some further limitations of the future models that will need to be fixed. There is an ultimate question, which is very hard to answer. When is the moment that the model can improve itself without that much external output and without humans working on it and fixing? And I think it is a very hard question.
这是人类必须尝试回答的严肃问题,因为即便到那时,系统很大程度上仍依赖于我们的基础设施,但将能开始自主修复而无需人工干预。关于AI届时真正能实现和解决什么,其预测会比我们现在能做的更加模糊——尽管我认为当前我们已能相当出色地进行这类预测。
It is a serious question that we need to try to answer, humanity needs to tries to answer, because almost that one will still largely depend on our infrastructure and our systems, but we'll be able to start fixing itself without us having to fix it. The predictions of what really AI will be able to do and will be able solve at that moment start becoming like a little bit murkier than what we can do right now, which I think we still we still can do that pretty, pretty well.
从哲学角度,你可能听过理查德·萨顿前几日在Dwarkish播客中的观点(这期精彩节目值得一听),他本质上认为通向AGI的唯一路径将是纯粹的强化学习(RL),而大语言模型(LLMs)——希望我没有曲解他的意思——本质上是对现实的模仿,RL才是对现实的强化。你对此哲学问题有何看法?
But philosophically, you may have heard Richard Sutton, you know, the other day on the Dwarkish podcast, which is a which is a wonderful episode that people should really listen to, effectively saying that the only path to AGI was going to be pure RL and that fundamentally LLMs and and maybe I hope I'm characterizing what you said appropriately, but that LLMs were a flood premise because effectively that was imitation of reality, whereas RL was enforcement of reality. Do you have any thoughts and sort of like philosophically on on on that question?
是的。我还没机会完整收听那期节目,因此不完全了解该观点的细节。但可以说,我们目前正在非常严肃地研究RL语言模型。就纯粹RL而言,我认为完全依赖RL并不合理。
Yeah. Yeah. I I haven't had a chance to to to fully listen to that episode yet, so I also don't get all the details of that of of that thought. But what I can say is that we are doing quite serious RL language models these days. In terms of a pure RL, I don't think really pure RL makes sense.
RL需要预训练才能成功,而正如我之前所说,预训练同样需要RL才能成功。若没有RL,我们当前的研究计划就失去意义。OpenAI以及我确信其他AI实验室都在认真对模型进行大量强化学习。当很多人讨论LLMs是通向AGI的入口还是岔路时,他们往往指的就是预训练阶段。
RL needs pre training to be successful. And I think pre training, as I said before, needs RL to be successful as well. I don't think without RL it would make sense, the research program we are doing. But are like OpenAI is, and I'm pretty sure all other AI labs as well, very serious about doing a lot of reinforcement learning on our models. And I think what kind of like a lot of people are saying that whether LLMs are an on ramp or off ramp of AGI, very often they do mean like pretraining.
但同样明显的是,我们当前的做事方式仍显不足且不够全面。未来还需要对现有架构进行更多调整。不过有时人们会说,哦,如果你在做强化学习,那就不是大语言模型,而是另一回事。有时他们又说,如果在推演中能编写程序形成思维链,那就属于纯粹的神经网络,是神经符号系统。所以很容易混淆概念。
But it's also clear that the current way how we are doing things also is not yet enough and is not yet everything. And there will need to be further changes to the setup. But sometimes people say, oh, if you are doing RL, it's not LLM, it's something else. Sometimes they say, oh, if you can write program in your rollout and it's a chain of thought, it's in this neural network only, it's a neural symbolic system. So it's easy to get.
有些人认为某些技术属于大语言模型范畴,而其他则不是。但就我个人而言,我认为我们已经为下一步发展奠定了良好基础——我们最初用Transformer训练翻译模型,然后进行大规模数据预训练,接着实施人类反馈强化学习(LHF),现在正开展大规模强化学习。
Some people consider something an LLM and the other thing not. But personally, my view is what we have is a pretty good foundation for the next step of like, we did have transformers first trained translation. Then we were pre training them on large scale data. Then we were doing our LHF on them. Now we are doing large scale reinforcement learning.
我们将继续推进更复杂的工作。在这个过程中,架构很可能会发生或大或小的改变。我个人认为我们正走在正确的道路上,未来不会是完全的转向,而是持续增添新要素,同时逐步淘汰那些曾帮助我们达到特定智能水平但已不再需要的旧组件。
We'll do a few more and more complex thing. There is a chance somewhere along the line, the architecture will start changing more or less significantly. And I think I personally think we are on the right path and it will feel less like completely turning around and more like keep on adding more things and maybe like dissolving some old elements that didn't, that are not, that carried us to that particular level of intelligence and were not needed anymore.
看来这是个完美的结束点。您慷慨地分享了宝贵时间和见解,让我们得以窥见OpenAI的内部运作、您的研究重点、预训练的关键环节以及规模化强化学习的幕后细节。这次对话非常精彩。杰瑞,非常感谢您,真的受益匪浅。
Well, feels like a wonderful place to leave it. You've been very generous with your time and thoughts and giving us a glimpse into OpenAI, what you work on, what it looks like behind the scenes and the key aspects of pre training and the scaling reinforcement learning. It's been a wonderful conversation. Jerry, thank you so much. Really appreciate it.
非常感谢。我也非常享受这次交流。
Thank you very much. I enjoyed being here a lot too.
大家好,我是马特·特克。感谢收听本期MAD播客。如果您喜欢我们的节目,恳请尚未订阅的听众点击订阅,或在您收听/观看的平台留下好评与评论。这些支持对我们打造优质播客、邀请重磅嘉宾至关重要。感谢聆听,下期再见!
Hi, it's Matt Turk again. Thanks for listening to this episode of the MAD podcast. If you enjoyed it, we'd be very grateful if you would consider subscribing if you haven't already or leaving a positive review or comment on whichever platform you're watching this or listening to this episode from. This really helps us build a podcast and get great guests. Thanks and see you on the next episode.
关于 Bayt 播客
Bayt 提供中文+原文双语音频和字幕,帮助你打破语言障碍,轻松听懂全球优质播客。