Latent Space: The AI Engineer Podcast - 格雷格·布罗克曼谈OpenAI通往通用人工智能之路 封面

格雷格·布罗克曼谈OpenAI通往通用人工智能之路

Greg Brockman on OpenAI's Road to AGI

本集简介

OpenAI联合创始人兼总裁Greg Brockman做客本期节目,探讨GPT-5与GPT-OSS、软件工程的未来、强化学习为何仍在持续扩展,以及OpenAI如何规划实现通用人工智能的路径。 00:00 开场介绍 01:04 OpenAI推理能力的演进历程 04:01 语言模型中的在线学习与离线学习 06:44 强化学习中的样本效率与人工标注 08:16 计算规模扩展与超临界学习 13:21 强化学习的实时性限制与现实世界交互 16:34 ARC研究所实践与DNA神经网络 19:33 定义GPT-5时代 22:46 模型智能评估与任务难度 25:06 开发者使用GPT-5的实用建议 31:48 模型规格说明 37:21 强化学习偏好挑战(如try/catch机制) 39:13 GPT-5中的模型路由与混合架构 43:58 GPT-5定价策略与计算效率提升 46:04 自我优化的编程智能体与工具使用 49:11 终端设备模型与本地/远程智能体系统 51:34 OpenAI的工程实践与大语言模型应用 54:16 面向AI优化的代码库与团队架构 55:27 AGI时代工程师的价值 58:42 人工智能研究现状与实验室多样性 01:01:11 OpenAI的优先事项与重点领域 01:03:05 给创业者的建议:为时不晚 01:04:20 未来展望与结束语 01:04:33 2045年时间胶囊:算力与丰饶时代的未来 01:07:07 2005年时间胶囊:更多问题将浮现

双语字幕

仅展示文本字幕,不包含中文音频;想边听边看,请使用 Bayt 播客 App。

Speaker 0

大家好,欢迎收听《生活空间》播客。我是Kernel Labs的创始人Alessio,今天和我一起的是Small AI的创始人Swiggs。

Hey, everyone. Welcome to the Living Space podcast. This is Alessio, founder of Kernel Labs, and I'm joined by Swiggs, founder of Small AI.

Speaker 1

大家好,大家好。我们非常激动能邀请到Greg Brockman加入我们。欢迎你。

Hello. Hello. And we are so excited to have Greg Brockman join us. Welcome.

Speaker 2

谢谢邀请。很兴奋能来到这里。

Thank you for having us. Excited to be here.

Speaker 1

你不需要介绍,所以我本来还想着要介绍你,结果直接跳过了。恭喜你们发布了gpt5、gtoss,还有Open Island上发生的所有事情。我们稍后会详细聊这些。真的很高兴你能来。

You need no introduction, so I was, like, mentally going to introduce you. I just skipped right to it. Congrats on g p five, g t o s s, like, all the stuff that's going on in Open Island. We're gonna get to all that. It's really good to have you here.

Speaker 1

感觉怎么样?上周简直是一连串的发布风暴。

Like, how does it feel? Like, that last week was like a whole maelstrom of releases.

Speaker 2

疯狂。一周内推出这么多东西绝对是疯狂的。但我们发布了开源模型,这些是我们长期研发的成果,凝聚了OpenAI的许多技术进步,体积小巧、易于使用,短短几天内下载量已达数百万次。我们还发布了GBD5,这也是我们长期努力的成果。

Wild. It was absolutely wild to get so many things out in one week. But yes, we've released our open source models, which are models that we've been working on for some time. I think really pack in a bunch of the advances that we've been making at OpenAI into a very small form factor, very accessible, now being used by there's been millions of downloads of that just over the past couple of days. We also released GBD5, again, something we've been working on for a very long time.

Speaker 2

能把这些成果公之于众并完成整个发布流程,我真心为团队感到骄傲。

And so just having these out in the world and really having done that release process is something that I'm just really proud of the team for doing.

Speaker 0

gpt5是首个混合模型,大多数人无法选择单一模型。这背后还有很多……

And g p d five is the first hybrid model, so most people don't get to choose one model. And that that's a whole

Speaker 2

戏剧性的事情我们就不多说了。完全是另一回事。

lot of drama we will knock on. Whole other thing.

Speaker 0

但你最初是和Ilya在OpenAI组建了推理团队。能否简单介绍一下OpenAI推理技术的发展历程?你们从最初的下一词预测开始,后来认为推理能力很重要,发展到如今gpt5中用户几乎察觉不到其存在,这中间经历了什么?

But you started originally the reasoning team with Ilya at OpenAI. So maybe can you just give a quick history of reasoning at OpenAI? So you started with just, you know, next token prediction, and then at some point, you thought reasoning was something important to build. What was the path from there to gpd5 where now it's like kind of hidden from the user?

Speaker 2

嗯,我想说的是,在我们训练完GPT-4后,我们得到了一个可以对话的模型。我记得我们最初做的第一件事就是进行后期训练。实际上我们对它进行了指令跟随的后期训练。这其实只是一个数据集,包含查询内容以及模型应有的完整回答。当时我们就在想,如果紧接着再提一个查询会怎样?

Well, I'd say that after we trained gpd4, we had a model that you could talk to. And I remember doing the very first, we did the post training. We actually did a instruction following post train on it. So it was really just a dataset that was here's a query, here's what the model completion should be. And I remember that we were like, well, what happens if you just follow-up with another query?

Speaker 2

结果它真的能够根据之前整个问答链的上下文给出回应。这时你意识到这东西能聊天了,对吧?它真的可以和你对话,能够利用所有这些信息,尽管它并没有被专门训练来做这件事。

And it actually was able to then have a response that took into context the whole previous chain of question answer. And you realize this thing can do chat. Right? It can actually talk to you. It can actually use leverage all of this information even though it wasn't trained to do it.

Speaker 2

我记得我们当时有个疑问。我们和Jakob、Ilya、Wojciech等一群人开了个研究会议,问题就是:为什么这不算AGI(通用人工智能)?显然这个模型不是AGI,但很难解释为什么不是。

And I remember we had this question. We had a research meeting with a bunch of people, Jakob, Ilya, Wojciech, others. And the question was, why is this not AGI? Right? This model clearly is not AGI, but it's really hard to describe why.

Speaker 2

对吧?它似乎能回答你提出的任何问题。当然,它还不够可靠,会犯错,有时会偏离正轨。

Right? It's like able to answer any question you put in front of it. And okay, it's not quite reliable. It makes mistakes. It falls off the rails.

Speaker 2

这是个明显的差距。那么我们需要做什么来弥补这个差距呢?最直接的办法就是让它在实际中测试自己的想法,对吧?

Okay. That's a real gap. And so what do we need to do to close that gap? And the most obvious thing you need to do is actually have it test out its ideas in the world. Right?

Speaker 2

真正进行强化学习,比如尝试一些假设,获得反馈,从而变得可靠。这对我们来说不是新想法。回溯到2017年,我们就在做DOTA项目,完全是强化学习,没有从人类演示中进行行为克隆。就是从一个随机初始化的神经网络开始,最终得到了这些极其复杂、精密且准确的行为。

Actually do reinforcement learning, like try out some hypotheses, get some feedback, and from there become reliable. And this is not a new idea to us. Right? If you rewind to even 2017, we were working on DOTA which was all reinforcement learning, no behavioral cloning from from human demonstrations or anything. It was just from a randomly initialized neural net, you'd get these amazingly complicated, very sophisticated, very correct behaviors.

Speaker 2

我们希望在语言模型上也能达到这种可靠性。所以实际上,从训练GPT-4的那一刻起,我们就知道需要进入推理范式。问题只是如何实现。我们有大约10个想法,关于哪些方法可能奏效的不同假设。大家真的开始努力将其变为现实。

And it's like that's the reliability we wanted for our language models. So really the moment we trained g p d four, we knew that we needed to get to the reasoning paradigm. And it was just a question of how. So we had like 10 ideas, a bunch of different different hypotheses about what might work. And people really set out to go and try to make it be reality.

Speaker 2

这是OpenAI许多人多年来的共同努力。我认为这个领域的进步方式是:你需要对一个方向有坚定的信念。你尝试的前10件事可能会失败。我们那10个想法中大部分确实没成功,但有一个成功了。关键在于我们不断推进,看到一点生命的迹象就继续发展。

And so it was really the labor of many people at OpenAI across many years. And I think the way that this the the progress in this field works is you need to have conviction on a direction. And the first 10 things you try will fail. And most of the things on that list of 10 did not succeed, but we made one of them work. And I think that that's the real key is that we just keep pushing and pushing and that you get little signs of life and you keep growing from there.

Speaker 2

现在Jerry负责我们的强化学习团队,取得了很大进展。有很棒的基础设施工作,比如WENDA团队,推理方面的人员如Felipe。OpenAI有很多人齐心协力才让这一切成为现实。

And so now, Jerry runs our reinforcement learning team and has made really great strides there. There's really amazing infrastructure work, people like WENDA, people from the inference side, people like Felipe. There's many people across OpenAI that all come together to really make this work.

Speaker 1

是的,太棒了。我正想回顾一下,记得在AI工程师大会上你和我一起时,你提到过那篇你非常喜欢的图灵论文,某种程度上它开启了你机器学习之旅。

Yeah. Amazing. I I was going over, you know, when you when you were with me on on the AI engineer conference, you talked about the the Turing paper, which you love and got you started in some ways on your machine learning journey.

Speaker 2

嗯。

Mhmm.

Speaker 1

实际上我认为他某种程度上预见到了学习机器会部分在线化。你知道吗?这让我在回顾从第三代、第四代到第五代的发展历程时总在思考:学习是否最初完全是离线进行的,全部预训练完成,现在才逐渐转向在线?你觉得这种理解准确吗?

And I think, actually, he kind of anticipated the learning machine would be partially online. You know? And I think, like, that's one of the questions I always had when reflecting on this journey from, like, three, four, and to five. Like, is learning like, learning started all offline and, like, all pretrained and then now it's slowly coming online. Do you think that's accurate?

Speaker 2

是的。这是个非常有趣的问题。学习究竟发生在哪里?我认为我们尚未实现人类那种完整的学习闭环——话说回来,人类真的是完全在线学习的吗?

Yeah. I think it's a very interesting question. Right? Where does the learning happen? And I think we're still not at the full kind of learning loop that humans do, Which it's also not really clear, are humans fully online?

Speaker 2

因为就像你会睡觉,期间会有大量所谓的反向传播过程将信息存入长期记忆。所以人类的确切运作方式未必与机器相同。但我们正从一次性训练后无限推理的模式,转向推理与训练交替进行的循环。Ilya常说一个非常精辟的观点:当模型能力不足时,其生成的token价值极低;当模型能力极强时,每个token的价值就极高。

Because it's like you go to sleep, like there's a lot of sort of back propagation so to speak that happens into your long term memory. So, I think that exactly how humans work is not necessarily mapped, represented by how our machines work. But we are moving from a world where it's just you go and train once and then you're inferencing a ton to a world where there's actually this loop of you inference and you train on those inferencing. And one thing that Ilya used to say a lot that I think is is is very very astute is that when the models are not very capable, right, that the value of a token that they generate is very low. When the models are extremely capable, the value of a token they generate is extremely high.

Speaker 2

对吧?这种产出要么是深思熟虑的,要么是具有重要意义的。强化学习正是这样:模型通过尝试生成大量数据,再从中学习。某种程度上,模型通过与现实接触筛选出的观察结果会反馈给机器本身。

Right? It's something that's like very thoughtful. It's something that's that's, you know, that's important. And reinforcement learning has this property, right, that you're generating a bunch of data because the model's trying stuff and then you train on that data. And so somehow the model's observations, you know, also normalized by contact with reality or, you know, somehow selected by by contact with reality get fed back into the machine.

Speaker 2

这正是我们开始擅长学习的领域。但所需规模截然不同——预训练时10个样本毫无意义,需要数十万次同类行为才能学习,这与人类学习方式完全不同。

And that is I think something that we're starting to get very good at learning from. And the scale required is very different, right? That if you look at pre training, your 10 examples of something doesn't go anywhere, right? You're talking hundreds of thousands of any little type of of behavior and that's what you learn from, which is totally totally unlike how humans learn. Again, I think, right?

Speaker 2

试想重演整个进化史和你二十年的成长历程,其中包含大量对世界的观察,无数信息流经你的感官。但在强化学习范式下,即便只有10或100个样本,比如10条预期路径,模型通过多次尝试就能学习。人类策划者创造任务带来的杠杆效应,确实能让模型产生复杂行为。下一步就是实现模型实时在线学习。

If you're if you think about recapitulate all of evolution and also think about your twenty years worth of developmental history, there's a lot of just observing the world that happens. There are lots of bits of information that kind of flow through through your your senses. But with the reinforcement learning paradigm, if you have 10 examples or 100 examples of something, right, 10 paths that you're supposed to do and the model tries a bunch of times that it's actually able to learn from that. And so you really get this leverage out of the human curator creating those tasks and are able to actually get very sophisticated behaviors from the models. And now there's next step of just having a model that as it goes, it's learning online.

Speaker 2

我们尚未完全实现这点,但未来尚未注定。

We're not quite doing that yet, but the future is not yet written.

Speaker 0

我们和Noam Brown讨论过简单效率问题。你认为当前瓶颈仍是人类数据策划者需要为RL创建优质任务,还是模型本身的简单效率?

We had this discussion with Noam Brown about simple efficiency. Do you feel like today the bottleneck is still the human data curator that creates these like great tasks for RL to work? Or do you feel like it's still the simple efficiency of the model?

Speaker 2

瓶颈始终是算力。我是认真的。很明显,只要获得充足算力,我们就能找到最大化利用这些算力的迭代方法。

Well, the bottleneck is always compute. Right. And and and I mean that in a real way. Right? It's just like it's very clear that if you give us a lot of compute, that we will find ways to iterate that actually make the most of that that compute.

Speaker 2

我们身处这样一个世界:如今通过强化学习范式,我们拥有了样本效率高得多的算法,对吧?但这仍然需要大量计算资源。就像人类设计的一个任务、十个任务或一百个任务这样的小规模,然后模型会尝试成千上万次(不止一次或十次)来完成单个任务,并从中筛选学习。

We are in a world where right now we now have much more sample efficient algorithms, right, with with the RL paradigm. But it does take a lot of compute still, right? It's like that you have like one task a human created or 10 tasks or a 100 tasks or some small number of those. And then you have a model that tries a bunch of times, not just one time, not just 10 times, but 10,000 times to try to accomplish one task. And you select from those and you learn from from that.

Speaker 2

重申一次,作为人类设计者,你获得的杠杆效应极高,但要让其运转所需的计算资源投入也成比例增长。

And again, it's like the amount of human of leverage you get as a human designer there is extremely high, but the amount of compute that you have to pour in in order to make it work grows proportionally.

Speaker 1

我认为增加学习过程中计算资源消耗的一种方式是...其实艾伦·图灵早就预见到了很多。他提出超临界学习与亚临界学习的概念——我们向机器传授知识时,亚临界意味着机器只学会我们刚教的内容;而超临界则要求机器还能推演出所学知识的二阶、三阶乃至四阶影响,从而更新整个知识体系。

I would say like one way to expend more compute in the learning process. Alan Turing actually, like, foresaw a lot of this. He had this concept of supercritical learning instead of subcritical learning, meaning we present learnings to machines or teach things to machines. They learn just the immediate thing that we just taught. But supercritical means you also think through the second and third and fourth order effects of whatever you just learned, like, to update the rest of everything else that you know.

Speaker 1

那么问题来了:我们该如何创新性地消耗更多算力?如果我们拥有十倍或千倍的计算资源,这些资源该投向何处?

So, like, what are the creative ways in which we we spend more compute? Right? Like, if we if we had 10 x more compute or a thousand x more compute, where where does it go?

Speaker 2

我敢说我们总会找到方法的

I'll just say we will find ways

Speaker 0

来实现它。请给我们吧。

to do it. Please give us.

Speaker 2

但我是认真的。回想DOTA项目时,我们着手开发新强化学习算法,因为当时所有人都清楚既有算法不具备扩展性。我记得雅各布和西蒙质疑过:'有人真正验证过这点吗?'结果发现根本没人尝试过单纯扩展传统PPO算法——这就是我们的基线任务。

And the the way but I mean it kind of seriously. Right? The way that this works, like if you rewind to something like DOTA, we set out to develop new reinforcement learning algorithms because it was very clear to everyone that reinforcement learning, the algorithms that existed at the time did not scale. Everyone knew it. And I remember Jakob and Shimon saying, why do we believe that?

Speaker 2

每周回到办公室,他们就把核心数翻倍,然后智能体的真实技能曲线就持续右上方攀升。这就像在说:继续推进直到碰壁为止——我们确信会碰壁,那时才能开展真正有趣的研究。但事实上我们始终没遇到那堵墙,反而意识到扩展过程本身就是最精彩的工程实践。

Has anyone actually tested it? And no one had actually really tried to scale up just plain old fashioned PPO. And so, like, that's the baseline. We got to do it. And I remember you come back to the office every week, they doubled the number of cores and suddenly the agent, the true skill was going up into the right.

Speaker 2

当然过程中会出现导致性能瓶颈的bug,但修复后就能继续。神经网络初始化、尺度方差等问题都不是算法或科学原理的根本局限。所以我们所处的世界就是:在各方面推进时或许会碰壁,但多数情况下这些障碍只是可修复的琐碎问题。

And it's like, okay, you just got to keep pushing it until you hit the wall. Clearly, And we'll hit the wall and then we can go and do the actual interesting stuff. And we never hit the wall. And you realize that actually the journey of that scaling, that is the interesting stuff, right? Of really doing the engineering.

Speaker 2

因此我认为当前现状就是:我们会在所有维度持续突破,所谓的瓶颈往往只是些愚蠢的小问题,突破后就能继续前进。

And of course, you have bugs and those bugs cause a wall, but you fix the bug, right? You have different issues with how your neural nets initialized or the scale and variance or whatever the issues are. But those are not the fundamentals of the algorithm of the science. And so I think that's kind of the world that we're in is one where it's we will push on every dimension and maybe we hit a wall. Most of the time those walls are like just bugs and silly things and so you can keep going.

Speaker 2

有时候修复这些问题的投资回报率真的很难计算,对吧?就像因为存在不同维度的考量,可能并不值得投入。你是想推动模型规模扩大、增加预训练算力?还是想加强强化学习,从而在实际测试阶段投入更多算力?算力可以投入的维度多种多样。

Sometimes the ROI for fixing those is really hard, right? So it's like it's not really worth it because you have a different dimension, right? Do you want to push the model to be larger and do more pre training compute? Or do you want to do more RL and so push more compute to the actual test time? And there's all sorts of dimensions that you can put compute into.

Speaker 2

某种程度上,我把算力视为一种精炼过程。本质上,能量转化为算力,算力又转化为智能——这就像将算力结晶化为势能,最终可转化为模型的实际效用。这真是美妙的过程,不是吗?算力如同根本驱动力,是智能的基础燃料,它塑造神经网络,最终输出程序。

And in some ways I think of compute as this like, we're doing this refining process. Ultimately, start with energy turns into compute, turns into intelligence and it's almost crystallizing that compute into the potential energy that can be converted into the model doing something useful. It's a really beautiful thing, right? It's like the compute as this like fundamental driver, this fundamental fuel of intelligence and it sort of shapes the neural net. It sort of outputs a program.

Speaker 2

当然,这个程序的妙处在于你可以反复运行它。尽管你投入了所有这些算力,但实际使用时能实现摊销效应——使用次数远超过一次性创建时的投入。这简直是个绝妙的范式。

And, you know, of course, the nice thing about that program is you can run it many many times. Even though you port all this compute in, that you actually have this amortization that you're going to use it far more times than the amount of effort you put into creating at once. And so it's just like a it's a beautiful paradigm.

Speaker 0

是啊。这就像在模型中将动能转化为势能。你觉得这些模型中已储存的能量,我们能否再将其转化为动能,应用到其他所有领域?毕竟我们连IMO金牌都拿到了——我是说,你们团队。

Yeah. You're kind of turning kinetic energy into potential energy in the model. And do you feel like the energy that's already in these models, we can then, turn back into kinetic to do our all in every other domain? Because we got the IMO gold I mean, we and the you. You guys.

Speaker 2

我认为...

I think it's

Speaker 0

非常感谢大家。你们是否认为,只要增加计算规模,同样的技术和基础模型就能让我们在其他所有领域达到IMO金牌级别的水平?还是觉得仍有改进空间?

a huge you for for everybody. Do you feel like those same techniques and those same base models can then get us to the goal IMO gold equivalent if every other domain if we just scale the compute? Or do you feel like there's still some work to do?

Speaker 2

事实上,我们有充分证据表明IMO模型同样能让我们在IOI竞赛中夺金,这完全同理。原封不动。对,我们讨论过细节问题——测试框架虽有些微差异,但框架本身并非金牌标准,对吧?

Well, we have pretty good evidence on things like the IMO models actually also getting us a gold in IOI, which just the same. Un unchanged. Yeah. I mean, I think we did like think we talked about the details. There's a little bit of difference in the harness, but like the harness is not the gold literally, right?

Speaker 2

关键在于底层模型,我们并未专门进行训练。这最终只是几个人的副业项目,他们想着'不如顺便搞定IOI'。这让我觉得非常不可思议,因为这曾是需要大批人马攻坚的重大挑战。而OpenAI的核心IMO团队其实只有三人,远非大规模行动。

It's like the actual underlying models and there's no training there that we did specifically. This ended up being just a side project of a few people who are like, oh, we may as well do IOI, right? And it's just a wild fact to me because that used to be something that would be a total grand challenge many many people working on. And the core IMO team at OpenAI was actually three people, right? Wasn't this massive effort.

Speaker 2

因此你会发现某些领域可能需要专门化处理,比如需要额外工作或收集数据集。但本质上,我们拥有这种通用学习技术——解决难题的能力其实具有极强的可迁移性。学习解决数学难题和撰写证明的经验,确实能迁移到编程竞赛解题。当然,如果你从未做过物理实验,从未尝试混合化学试剂,自然不可能突然精通这些领域。

And so you realize that there's maybe some specialization required for some of these domains, right? Maybe some amount of additional work, some amount of go gather dataset. But fundamentally, we have this general purpose learning technology and that learning to solve hard problems is actually a very transferable skill. Learning how to solve hard math problems and write proofs turns out to actually transfer to writing programming competition problems. Now, if you've never run a physics experiment, right, if you've never actually gone and tried to mix together some chemicals or something, you're probably not gonna be magically good at those things.

Speaker 2

这说明泛化能力存在局限——你确实需要实际动手积累经验。但这些模型的泛化能力已经强得离谱。我们经常看到湿实验室科学家使用O3这类模型,给出实验设置后询问'我该怎么做?',模型能立即提供五个可行方案。

And so that there is something about the limitations of generalization, right? That you do need to actually have some real world experience and try it out. But these models, they go almost unreasonably far already. And we see this all the time where we have wet lab scientists who took models like O3, ask it for some hypotheses of here's an experimental setup, what should I do? They have five ideas.

Speaker 2

他们尝试了这五个想法,其中四个不奏效,但有一个成功了。我们收到的关于O3的反馈表明,其成果足以发表在中等水平的期刊上,虽非顶级期刊,但也算不错。你知道,这就像是你会期待某个博士生第三年或第四年能完成的工作。这确实是个惊人的事实——这就是O3目前的水平,而我们在所有维度上都清楚如何改进它。

They tried these five ideas out, four of them don't work, but one of them does. And the kind of feedback we were getting on O3 was resulting work is something that could be published in a mid tier journal, not the top tier journal, but a mid tier journal. You know, it'd be kind of the work you'd expect from some sort of, you know, third year, fourth year PhD student. And like, again, it's just a wild fact. Like, that's where we are with o three, and we see exactly how to improve three on all dimensions.

Speaker 2

这需要算力,需要大量工作,需要任务设计,更需要人类倾注智慧、热爱、时间和心血。但正如你所说,最终我们创造的这个事物蕴含着巨大的势能。奇妙的是,这种势能并非一次性释放——它是一个可以反复用于无数任务的检查点。我认为这确实具有提升全人类福祉的潜力。

And it requires compute, it requires a lot of work, it requires getting the task, it requires a lot of humans intellectual love and labor and time and really pouring our heart and soul into it. But the result, to your point, you know, it's like we produce this thing that has all this potential energy within it. And then the amazing thing is that you don't release that potential energy once, right? It's a checkpoint that you can use many, many times across all of these tasks. And that is something that I think really can uplift all of humanity.

Speaker 1

这太鼓舞人心了。我想回溯两点:首先是关于时间壁垒的问题。之前与诺曼辩论时我就强调,从实际耗时来看确实存在壁垒,因为时间必须流逝。强化学习与环境模拟的问题在于,虽然可以加速模拟过程...

That's so inspiring. I wanted to backtrack on two things. One, about the wall. One thing I was trying to get into this debate with with Noeman was I think there is a wall in terms of wall clock time because time has to pass. Like, the problem with RL interacting with environments and simulation is sure you can speed up the simulations faster than real time.

Speaker 1

但最终必须与实际时间同步。你可以看到我们在迭代速度方面正逐步逼近现实世界的时间尺度,越来越接近真实世界建模。不知道你对突破这个壁垒有什么看法?当然目前还无需担忧,我们尚未到达那个阶段。

At some point, you have to match wall clock time. So, like, do know, you can see us converging towards, like, the the speed the space of iterations towards wall clock time in terms of getting closer and closer to modeling the real world. I don't know if you have any thoughts on on tackling that. Obviously, we're not there yet, so we don't have to worry about it.

Speaker 2

是的,我认为这是个根本性障碍。这些模型具备非人类的特性——你可以同时运行多个副本。

Yeah. I think that think this is a pretty fundamental barrier. Right? And, of course, the models have very nonhuman affordances. You can run many copies of them.

Speaker 2

因此即使无法降低延迟,也能通过横向扩展实现。更有趣的是算力分配的演变:当前大部分算力用于模型训练,但随着模型部署增多,推理使用的算力占比将越来越大。想象这些模型未来会频繁与现实世界互动...

And so you can scale out even if you can't decrease the latency. And it's also very interesting to think about where the compute goes. Because we're going to move from a world where most of the compute is training the model. As we've deployed these models more, more of the compute goes to inferencing them and actually using them. But then if you think about, well, you're going to have these models that are going to be interacting with the real world a lot.

Speaker 2

它们可能需要深度思考每个行动。最终每次现实交互都可能消耗海量算力,这将彻底改变算力消耗的分布模式。我们需要建立高效的约束机制——比如在现实场景中执行多步操作后,如何保存检查点?

And so they should probably think a lot about every single action, right? So you might end up with tons of compute spent per real world interaction. And so it really shifts around where you'd expect the compute to actually be expended. And I think that really having good harnesses that are very efficient, right? Do you think about things like if I have been taking a bunch of steps in some rollout in the real world, how do I checkpoint that, right?

Speaker 2

如果系统重启后会丢失所有当前状态,那将非常糟糕。数字世界的优势在于所有状态都可完美观测、存档,而现实世界则混乱复杂得多。但这并非坏事——像Dota这样的智能体已证明算法完全能在复杂环境中运作。

And if you have a system that you need to restart it and it's going to forget all of its current state, like that's probably pretty bad. And so I think that there's just something very different about the digital world where everything can be perfectly observed and checkpointed and preserved as opposed to reality that's much more messy and complicated. And I think it's not a bad thing, right? I think that we've seen agents with things like Dota that are able to operate in very complicated, very messy environments. So the algorithms are capable of it.

Speaker 2

顺便说,Dota的神经网络只有3亿参数,相当于微小昆虫的脑容量。而现在我们正迈向参数规模可比拟人脑的模型,虽然计算量还未完全达到。

And by the way, Dota was like a 300,000,000 parameter neural net. Tiny, tiny little insect brain. Right? Now we're starting to scale up to things that are much more comparable to human scale in terms of number of parameters, maybe turns to number compute. We're not necessarily quite there.

Speaker 2

不同计算方式会得出不同结论,但本质上我们正在接近真正的目标。真正的AGI应该具备与现实世界高效互动的能力,这种互动将极具生产力。

I haven't done like I think you could do look at the math in different ways. But fundamentally, we are making progress towards the real goal. And if you think about what an AGI should be, it should be something that is capable of interacting with the real world in ways that are that are very productive.

Speaker 1

是的,粗略估算一下。我脑海中的数字大概是人类有100万亿个神经元。而GPT-4、4.5和5大概在几十亿到百亿级别——当然我们不会确认具体数字。但可以说我们正在向那个规模迈进。

Yeah. Back of the envelope. I I think that the numbers I have in my head, you can correct me if I'm orders of magnitude off, but it's something like humans have a 100,000,000,000,000 neurons. We're in the, you know, multiple low double digit to high single digit range for, you know, GPT four, four point five and and five, but we, you know, we're not confirming that. But, like, yeah, we're we're we're scaling there.

Speaker 2

对,我倾向于说100万亿个突触,这大致对应神经网络的权重数量。两者之间存在某种等效性。

Yeah. I'd say 100 t synapses, which kinda corresponds to to the weights of the neural net. Yeah. And so there's some sort of equivalence there. Yeah.

Speaker 2

所以我们开始接近正确的数量级了,容我这么说。

And so we're we're starting to get to the right numbers. Let me just say that.

Speaker 1

从生物学角度,上次没机会请教——你在ARC研究所的收获是否影响了现在OpenAI的工作?听说你在那里度过了学术休假。

And then just on a biological basis, you know, this is an opportunity I didn't I didn't get to ask you last time on what you learned from ARC Institute. You know, you you had a sabbatical there. I'm curious if that informs anything that you do at OpenAI now.

Speaker 2

最让我震撼的是DNA神经网络完全同理。真的,只不过把人类语言替换成...

Well, the thing I found most remarkable about working on DNA neural nets is that they're exactly the same. Yeah. Right? Just you replace human language.

Speaker 1

甚至词汇更简单。

It's even like a simpler vocab.

Speaker 2

确实,只有四个碱基字母。不过...

It is. Yeah. Yeah. You've got four four letters. But

Speaker 1

你们不是用更高层级的token化处理吗?

don't you tokenize at a higher level?

Speaker 2

可以这么做,但我们实际采用的是字符级处理。纯粹的字符级。

Yeah. I mean, so you can. But actually, the way that we approach it was we just did Character level? Character level.

Speaker 1

不会吧。

No way.

Speaker 2

是啊,为什么不呢?

Yeah. Why not?

Speaker 1

我...我是说,你知道的,这...这大概没什么理由。毕竟只有四个对吧。

I I was you know, it's a it's I guess there's there's no reason. There's only four. Right.

Speaker 2

而对我来说,这...这核心在于,人类语言最有趣的特质之一是我们理解语义,对吧?我们某种程度上能理解含义和结构。这对我们来说很容易观察。当你审视分词方案时,你会本能判断是否合理捕捉了所有词汇。但生物学,那是种外星语言。

And and this this to me is I think the core, like one of the interesting things about human language is we understand the semantics, right? We kind of understand what it means, what the structure is. It's very easy for us to observe. We kind of have a sense of when you look at a tokenization scheme, you have a sense of did you capture like all of the words in a reasonable way and all this stuff. Biology, it's an alien language.

Speaker 2

最耐人寻味的是,对人类而言它是外星语言。但对神经网络来说,凭什么人类语言就该比生物语言更自然?答案是两者并无区别,实际上这些...

And the thing that's very interesting is that for humans it's an alien language. But if you look at a neural net, why should human language be any more natural to a neural net than biological language? And the answer is they're not, right? That actually these things are

Speaker 1

根本就是相同的硬件架构。

Literally the same hardware.

Speaker 2

没错。所以有个惊人假设:既然神经网络能轻松掌握人类语言,那理应也能掌握生物语言。我们确实看到了类似成果——比如我们训练的400亿参数神经网络,用13万亿碱基对训练后,表现大概接近GPT-1,正迈向GPT-2水平。

Exactly. And so one of the amazing hypotheses is that it's like, well, these neural nets, they can learn human language just fine. And so they ought to be able to learn biological language just fine. And we really see the same kinds of results. It's like I'd say that maybe the neural net we produced, it's a 40B neural net trained on 13,000,000,000,000 base pairs or something like that.

Speaker 2

它已能在广泛生物应用中处理下游任务,但还达不到GPT-3或GPT-4水平,离GPT-5更远。我们尚不能解决这些领域的超级难题,但我们有算力,有正确的技术和算法。

The results to be felt like GPT-one, maybe starting to be GPT-two level. It's like accessible and applicable to downstream tasks across a wide range of biological applications. Not yet a GPT-three or GPT-four, not a GPT-five for sure. We're not able to solve super hard problems in these domains just yet, but we've got compute. We've got the right techniques and algorithms.

Speaker 2

现在需要扩大规模,研究长上下文。生物系统对模型的压力与语言序列不同——语言里十亿token的序列基本不存在,但DNA里有40亿碱基对。所以侧重点虽有差异,但本质上要解决的是相同问题。

Now we need to scale, we need to think about long context. There's different ways that the biological systems stress the the models relative to language sequences like language sequence of a billion tokens doesn't really exist, but it does in your DNA. Right? You've got like 4,000,000,000 base pairs or something like that. And so, you know, you you kind of have some sort of different emphasis, but fundamentally, it's the same problem you need to solve.

Speaker 1

有没有你最期待的应用?比如药物研发——当然大家都会想到这个,但或许在那之前有些更易实现且影响深远的中期目标?

Is there an application that you're most excited about, like drug discovery or obviously, I think everyone goes to drug discovery, but maybe some intermediate thing before that that is reachable and very impactful.

Speaker 2

就个人而言,我妻子患有一种叫埃勒-当洛斯的遗传病症。直到最近我们才开始发现相关基因标记,此前一直不清楚病因。这正是更好的生物学工具能发挥作用的地方——识别各类疾病的标记物。这只是神经网络潜力的一个例证。

Well, I mean, on a personal level, so my my wife, we've talked about this, you know, I've talked about this publicly before, has a genetic condition called Ehriglo's Damlis syndrome. It's something that until very recently, I think we're starting to see, you know, genetic markers for it, it's been kind of unknown exactly what causes it, where it comes from. And that is something where if you have better tools for understanding biology, you should be able to identify the markers for lots of different diseases. And so that's just like one example of the kinds of applications, the promise that exist within these neural nets.

Speaker 0

你会如何描述GBD5时代的开端?如果将三、四、五代视为主要版本,我认为第三代以文本为基础,类似RLHF,算是真正起步。第四代实现了多模态和低延迟、长思考等特性。那么第五代的标志性突破会是什么?显然是‘智能体之年’,对吧?

How would you characterize the beginning of the GBD5 era? If I think about three, four, five as the major versions, I think three is very text based, kinda like RLHF, really getting started. Four is multimodality and all these different low latency, long thinking Widow three. What's gonna be the five flagship thing? Obviously, Year of Agents, right?

Speaker 0

这是个流行说法。没错。但你是否想到其他值得关注的点?比如‘有了第五代,我们就能解锁X功能’这样的突破?

That's the meme. Yes. But is there something else that comes to mind that people should think about, okay, with five now we unlock X?

Speaker 2

是的。我认为这种智能很惊人。这些模型的智能程度开始变得几乎难以描述。当然它们仍有局限,仍会在某些方面出错。

Yeah. I think it's smart. I think that the intelligence of these models is starting to be just almost undescribable. It's like there's still limitations. There's still ways in which they fail.

Speaker 2

但在极高难度领域——看看国际数学奥林匹克的结果——经过特定推理范式训练的模型,已经能写出与人类顶尖选手媲美的证明。虽然特定领域仍有局限,我们尚未用它证明未解定理之类,但其展现的智力成就是真实且不可否认的。

But it really is the case that for extremely hard domains, look at the IMO results. So, you can take a model that's been trained on this reasoning paradigm and it's able to write proofs that is at the level of the best humans. And it's like in this specific domain there's limitations, etcetera, etcetera. We haven't proven like an unproven theorem, any of that stuff, but it's real. It's undeniable at this point that these models are able to perform great intellectual feats.

Speaker 2

我认为这是新突破。Gbd4更像是在广泛商业应用中具备实用性,但它产生的想法不够深刻,解决问题的能力也不稳定。记得GPT-3时代,我甚至尝试教它做基础任务。

I think that's new. Gbd4, I think was like much more, it was kind of capable and commercially useful across a wide range of applications. But the ideas that it produced were not very deep. The problems it would solve, it was not very reliable at. And I remember for GPT-three actually trying to teach it how to do even basic stuff, right?

Speaker 2

当时我们发现可以用少量示例提示(few-shot prompting)——展示几个例子后它就能模仿任务。我曾试着让它排序数字列表,给了7个数字却排序失败。

That like we kind of realized, hey, could do this few shot prompting. So you kind of showed a few examples of something and then I'll basically kind of do that task. And so I was like, can you just teach this thing to sort a list? And I gave it like seven numbers to sort. It didn't sort it.

Speaker 2

后来我写了完整教学脚本:‘我是老师,现在教你数字排序,这是两个数的例子,三个数的例子...’结果给五个数还是搞砸。虽然我还没实际测试GBD5排序任意五个数字的能力...

I was like, okay. Then I tried to write a whole script of like, I'm a teacher teaching you how to sort numbers. Here's an example of sorting two numbers and then three numbers and whatever. And I'd be like, okay, now here's five numbers in total flop. If you ask GBD five that, and I've not even tried by the way asking GBD five to sort of list to five, you know, arbitrary numbers.

Speaker 2

但我确信GBD5开箱即用就能完美完成。顺便说,它确实也能调用Python工具。

But I am like certain it will do a perfect job of it out of the box. No problem. By the way, does have access to Python tool as well.

Speaker 0

哦,你

Oh, do

Speaker 2

想说什么吗

want to say something

Speaker 0

关于那个?

about that?

Speaker 2

关键在于,这些模型能够协助人类实现智力飞跃,而这只是我们刚刚开始见证的现象。我们从第三代模型就初见端倪,现在可以看到专业数学家们开始试用GPT-5。物理学家们也开始测试GPT-5,并反馈说'嘿,这个模型居然能重新推导出我花费数月研究才得出的见解'。正是这类案例让人意识到,它能极大加速你的研究进程,对吧?

The point is that the intellectual leaps that these models are capable of assisting humans in is something that we're just starting to see. We started to see it with o three and you can see professional mathematicians starting to kick the tires on GPT-five. We've seen physicists starting to kick the tires on GPT-five and say that like, hey, this thing was able to get this model was able to re derive an insight that took me many months worth of research to produce. And that's the kind of thing where it's like, you realize this will speed you up so fast. Right?

Speaker 2

我记得高中和大学初期自己做RATH研究时,总要在脑海中反复推演这些对象,思考事物间的联系。如果当时有个能真正深入理解我的思考、并基于我的提议产生新见解的伙伴,我的进展会快得多。那也会有趣得多,对吧?因为你不会陷入独自苦思的循环,不会重复两周前就想过的思路。所以我认为,与GPT-5作为伙伴共同推进智力前沿是前所未有的体验。

I remember doing my own RATH research back in high school and at the beginning of college and I'd spend just like so long just trying to manipulate these objects in my head and think about connections between things. And if I had a partner that I could actually talk to about this, who would actually spend the time to deeply understand what I'm thinking about and produce new insights off of what I'm suggesting, that would have just sped me up so much. It would have been so much more fun, right? Because you don't just like kind of get caught in this loop of just sort of thinking about it off on your own and thinking, like, wait, I already thought this thought two weeks ago. And so I think that there's just something new about pushing forward the intellectual frontier together as a partner with GPD five.

Speaker 0

你认为人们是否被所处理问题的难度所限制?比如在Cursor和Codecs中,明显当我给模型更难的任务时它表现更好。但很多人只是在X上发截图说'GPT-5进步不大'——可那些问题本身就不够难,懂吗?

Do you think people are limited by the difficulty of the problems that they work on? I think, like, you know, for me in Cursor and in Codecs, it feels clear that the model is better when I give it hard tasks. I feel like a lot of people put screenshots on X and it's like, oh, g p d five is not that much better. It's like, well, the the question is not that hard. You know?

Speaker 0

你称它为全球最佳编程模型的底气从何而来?显然你是顶尖程序员之一,英雄识英雄。但对普通人来说,该如何评估这些模型呢?

Like, what gave you such confidence when you called it the best coding model in the world? Obviously, you're one of the best coders in the world, so the game recognizes game. But for people, how should they really think about evaluating these models? Yeah.

Speaker 2

某些任务确实存在性能饱和点。如果只是闲聊'你好吗',回复方式终究有限。但如果说'请给出黎曼猜想解法',那情况就完全不同了。

So, definitely is a saturation on certain tasks, right? If you're just going to chitchat and say, hello, how are you? There's only so many things you can say. If you're going to say, here's the Riemann hypothesis solution, please. Okay.

Speaker 2

没错,这里存在广阔的智力需求谱系。多数任务介于这两个极端之间。我们观察到GPT-5在需要深度智力的任务上远超其他测试模型。第二是我们长期追踪了它在交互式编程中的应用,收集大量反馈并反哺训练。

Yeah. There's like a broad range of of intelligence that will be desirable there. And of course, most tasks are somewhere in between the two of these. And I think what we've observed is that we've seen GPT-five be able to solve intellectual problems, you know, sort of tasks that require deep intelligence much better than any other model that we've tested. The second thing we did was we really spent a long time seeing how are people using it in interactive coding applications and just taking a ton of feedback and feeding that back into our training.

Speaker 2

这点我们过去做得不够。比如第三代,我们主要用预设任务训练,看它在各项指标上攀升。它在Codeforces等编程竞赛中表现出色,但这与实际编程不同。真实编程更混乱——你要面对具有本地状态、多重抽象、不同版本库的代码库,这种多样性不会从结构化任务中自然产生。

And that was something we didn't try as hard in the past. For something like three, we really trained it with tasks that we'd set up once and the model we'd see it go up into the rate on all of our metrics. It'd be great at Codeforces, competitive programming competitions, which is again very exciting but is not reflective of how you actually program. You actually program in a much more messy way, right? That you have some sort of repo that has some sort of local state and that has different abstractions and that just like different versions of different libraries and that sort of diversity isn't something that magically arises from a very structured, here's this one specific task, 10 specific tasks you need to accomplish.

Speaker 2

所以我们重点不仅在于提升智力(虽然这始终是核心),更在于如何将智力连接到现实应用。要让模型走出舒适区,离开象牙塔,直面真实世界的混乱与多样性。

And so a lot of what we've been focusing on is saying not just how do we push the intelligence, although that is always going to be the core, but also how do we connect the intelligence to real world applications and so that it really got to experience being pushed out of its comfort zone, out of its ivory tower, and actually be able to see the messy reality and diversity of the real world.

Speaker 0

在实操层面,你有什么释放模型潜能的建议?比如添加linter、类型检查器、自循环任务等。开发者还应考虑哪些元策略?你如何使用...

Yeah. What are suggestions on a more practical level that you have on getting the potential energy out of these models? So a part of it is adding the linter, the type checker, the task to have it self loop. Any other meta that developers should think about? How do you use the Well,

Speaker 2

我观察到最重要的一点是,从这些模型中提取最大价值确实需要技巧。这需要一种坚韧不拔的精神,去真正理解模型能力的边界和短板。你需要不断测试——先用小任务试水,获得反馈后逐步提高难度,尝试更大任务,观察它在特定场景下的表现。人们通常会建立自己的提示词库,对吧?

the number one thing that I've observed is that there is a real skill in extracting the most from these models. And it requires this tenacity of really trying to almost understand the shape of the model's skills and weaknesses. And so, you test it, right? You test it with something small, you get a little feedback, you test a little bit higher, try to give it some bigger tasks, try to see if it can work in a certain way. And I think that people usually have their library of different prompts, right?

Speaker 2

我自己从GPT-4时代就开始积累提示词库。比如在GPT-4发布前,我就准备了些测试问题——关键是要选择那些答案开放、没有标准解的问题。比如在创意写作方面,我喜欢让它混搭《指环王》和创业公司题材,把两个毫不相干的主题强行结合看看效果。

So, I definitely have my library prompts that I've built up since the GPT-four days. Like I remember in advance of GPT-four starting to gather up a couple of like, okay, I wonder if it'll be able to do this. You have some sort of query that importantly, you want queries that could have a range of different answers that don't have any one specific right thing. And so for example, on creative writing, I've like to ask for like a mashup of Lord of the Rings and startups, right? Just like try to push together two different topics and see what you get.

Speaker 2

关于实际测试模型能力,我经常思考如何拆分任务,设计出能让模型独立运行的模块。因为你不该只让模型单线程工作,而是要像管理多个代理(agents)那样运作。这需要先规划代码结构,然后挑战模型:『你能同时处理代码库的这些不同部分吗?』

In terms of actually testing the model and pushing it, I think that I do a lot of trying to think about, okay, like how do you first of all break up tasks and have something that's self contained that you can let the model run with? Because you don't want to just have one instance of the model operating. You want to have multiple, right? You want to be a manager of not an agent but of agents, right? And so that you need to, first of all, think about how your code base is structured, but then actually go and try to push the model to say, can you actually operate it on these multiple different pieces of your code base?

Speaker 2

人们喜欢用前端测试Dp5——它确实擅长前端开发。但多数开发者时间并不花在这上面,所以要注意避免过度拟合。关键是要感受模型的特性,逐渐熟悉它的强项与局限,最终让它成为你思维的延伸。

I think that people love doing front end five testing. Dp5 is very good at front end, turns out. But of course, that's not what most developers spend their time doing. And so it's important not to overfit to that. But I think that maybe just getting a feel for the model and kind of starting to become in tune with its strengths and weaknesses and viewing it almost as an extension of yourself.

Speaker 2

我还有个常用方法:当我在思考某个模型不适合处理的难题时,会并行派发些非关键路径的任务给它。这样能持续获得反馈,就算出错风险也很低——毕竟不需要干等五分钟才发现毫无产出。就像你常说的...

And no, like often another thing I'll do is just be kicking off tasks to the model that are sort of not on the critical path while I'm thinking about some super hard thing that the model for whatever reason, I don't want it operating on. And so I'm just constantly getting information back. I'm just like, okay, was it able to do a thing? Or it's just like low risk if it, like, makes a mistake because I don't feel like I had to sit around waiting for five minutes and then, you know, sort of get no no return. You've always mentioned, I I think, that

Speaker 1

既然提到Codex和OpenAI的编程能力发展路线——后台代理套件会与IDE内代理融合吗?你的想法是怎样的?是简单地让IDE调用后台API,还是存在更深层的连接?

the the the road map for Codex and OpenAI's coding capabilities since we're there is that the background sort of suite agents sort of merge with the the in IDE agents. How's your thinking evolved there? Like, is it just as simple as, like, the IDE can call the background APIs and the background APIs can export to the IDE? Or what's a deeper connection than that?

Speaker 2

我习惯用同事来类比AI产品化:你希望一个优秀的程序员同事具备什么特质?你肯定不会...

I tend to think about AI productization by analogy to a coworker. What do you want out of a coworker who's a great programmer? Right? You don't

Speaker 1

只在Slack上联系

Slack them.

Speaker 2

没错。你确实需要Slack沟通,但有时会说『能过来帮我看看这个吗?』甚至『要接手键盘吗?』

Yeah, exactly. So, you want to Slack them but sometimes you're like, Hey, I kind of need help with this thing. Can you come over and look over my shoulder? Right? And like, Hey, can you take the keyboard?

Speaker 2

正是如此。你需要结对编程的形态,也需要远程异步协作的形态,而且这个实体要能保持跨场景的知识记忆。它不能像初级程序员那样每天见面都说『好吧我全忘了』

Exactly. So, you want the pair form factor. You also want the remote async form factor. And you want it to be one entity that has knowledge and memory across all of this. You don't want it to be a junior programmer who shows up every day being like, Okay, I forgot everything.

Speaker 2

你能提醒我如何SSH连接到那个什么吗?所以,我认为所有这些都必须实现。你需要以可信赖的方式让AI访问你的基础设施,对吧?一种你可以审计的方式。这些模型的不同之处在于,它们可以接受微观管理。

Can you remind me how to SSH into the whatever? So, I think all of that has to happen. That you need AIs that have access to your infrastructure in a trustworthy way, right? A way that you can audit. Like one thing that is different about these models is that they're fine being micromanaged.

Speaker 2

事实证明人类不太喜欢这样,对吧?如果你查看他们运行的每一个命令,并要求他们报告所做的每件事,很可能你留不住那个人。但模型完全乐意这样做,对吧?因此,这是一个值得深思并调整界面以充分利用的特性。同时,你确实希望模型能无缝地在远程机器上完成大量工作,不干扰我的本地状态,完全沙盒化,完全可观察。然后有时它会说,好的,我准备在本地运行一些东西。

Turns out humans don't like that very much, right? If you look at every single command that they're running and that you like demand like reports on everything they did, probably you're not gonna retain that person. But the models are perfectly happy to, right? And so that's an affordance that's like well worth thinking about and changing the interfaces to take maximum advantage of. At the same time, yeah, you really want the seamless blending between a model that's able to do a bunch of work on this remote machine, doesn't mess up my local state, fully sandboxed, fully observable, And then sometimes can be like, Okay, I'm ready to run something locally.

Speaker 2

根据具体情况及其可沙盒化程度,你可以进行一次性批准,也可以授予它完全委托访问权限。我认为人类应该控制这种可观察性并管理这个团队,一个具有不同表面的代理,对吧?代理的身份不应该区分是在本地运行还是远程运行。对我来说,这是错误的问题。代理应该是执行并在远程沙盒中请求运行任务的模型,或者它可能同时在你的电脑和我的电脑上运行。

And that depending on what that is and depending on how sandboxable it is, that you can do one off approvals, you could give it full delegated access. And I think that having the human be in control of this observability and to be managing this team, an agent that has just different surfaces, right? Doesn't like the identity of the agent being something that runs locally versus the identity being something that runs remotely. To me, that's the wrong question. It's really the agent should be this like model that's executing and then requesting to run things in a remote sandbox sandboxes or maybe it's running on your computer and my computer.

Speaker 2

就像,它没有理由必须局限于任何这些地方。

Like, there's no reason that it has to be local to any of these things.

Speaker 1

是的。软件代理可以无缝且流畅地移动。你提到批准让我有机会介绍我的朋友Fuad,他正在帮助启动代理鲁棒性团队,该团队也在AI Engineer上推出。那是什么?是什么引起了我们的兴趣?

Yeah. Software agents, you can just sort of seamlessly and fluidly move around. You mentioning approvals gives me a chance to spotlight my friend Fuad, who is helping to start the agent robustness team that was also launched at AI Engineer. What's what's that? What's what's opening us interest in that?

Speaker 2

我们通过深度防御来思考代理鲁棒性。首先是模型本身的层面。我们发布了像指令层级这样的技术。通过指令层级,你可以指示这条消息来自系统,这条来自开发者,这条来自用户,并且它们应该按此顺序被信任。这样模型就能知道,如果用户说忽略之前的指令,它不会遵循。

The way we think about agent robustness is through defense in-depth. There's a layer of the model itself. We publish techniques like instruction hierarchy. And so with instruction hierarchy, you sort of indicate that hey, there's this message is from the system, this message is from the developer, this message is from the user and that they should be trusted in that order. Hence that way the model can know something that says ignore previous instructions from a user, I'm not going to follow that.

Speaker 2

对吧?所以,我认为这几乎就像我们如何防止SQL注入,对吧?在底层构建能抵御这些尝试性攻击的系统非常重要。但这并不是终点,对吧?你需要在系统控制上有多层思考,对吧?

Right? And so, I think that having like it's almost like thinking about how we prevent SQL injections, right? Having systems at a low level that are robust against these attempted exploits is very important. But that's not where you stop, right? You want multiple layers of thinking about the system controls, right?

Speaker 2

如果一个模型被沙盒化,无法实际执行某些操作或访问特定数据,那么你就能完全保证其可能性。我们采取的方法还有各种中间层次。因此,我认为随着这些代理更深入地融入我们的生活并被赋予更多责任,提高它们的安全性和保障也是同步进行的。

If a model is sandboxed and isn't actually able to execute something or access a specific piece of data, then you have full guarantees around what's possible. And there's various levels in between of approach that we take. And so I think that a lot of what is the frontier as these agents get become more embedded in our lives and are trusted with more responsibility is also increasing the safety and security of them in lockstep.

Speaker 1

我有个类比,就像Linux内核的操作系统环一样。非常有趣的是,我们基本上正在将这些构建到LLM中,作为不同安全层的概念。另外,我还很高兴看到我邀请了关于AI Engineer模型规范的演讲,那是我们有史以来观看次数最多的会议演讲。是的。这很难让安全和可靠性变得吸引人。

There's analogy that I make to, like, the Linux kernel OS rings as well. And it's it's really interesting that we're basically kind of building this in to the LLM as, like, concepts of sort of different layers of security. And also the other the other thing I I also was very happy to see was that I invited a talk on the model spec for AI engineer, and that was the most viewed con talk of all of that we've ever had. Yeah. Which which is like, it's very it's hard to make safety and reliability sexy.

Speaker 1

就像

Like

Speaker 2

我认为模型规范是一个绝佳的例子,当模型能力非常强大时,你会真正关心它们将做什么。这成为最重要的问题。模型规范让我们向外界清晰地展示了我们对这个模型的期望行为。这并不意味着我们总能开发出完全符合规范的模型,但它是一个北极星,对吧?它明确设定了意图,任何偏离这一意图的行为都不是我们有意为之的,而是违背我们明确努力的。

I I think the the model spec is a perfect example of when the models are very capable, you start to really care about what they're gonna do. That becomes the most important question. And the model spec is an example where we've made it very legible to the outside world what our intention is for this model to do. And it doesn't mean that we always produce a model that is capable of following that, but it's a North Star, right? It's something that really sets this is the intention and anything that deviates from that is not through our explicit effort, it's anti to our explicit effort.

Speaker 2

我认为规范与实际行为之间的差距正在持续快速地缩小。最有趣的是这几乎像是价值观问题,对吧?它促使我们深入思考:当被问及有争议的问题时,模型应该怎么做?比如有人说‘我认为地球是平的’,模型是该附和‘是的,地球是平的’,还是该回应‘科学界是这样认为的’。说实话,这些界限很微妙,对吧?

And I think that the gap between the spec and the actual behavior is shrinking very very constantly. Thing that's very interesting is almost like values, right? It's really thinking deeply about, well, what should a model do if you ask it a controversial question, right? If you say, I think that the world is flat or whatever, like is it supposed to say, yes, it's flat or you're supposed to be like, well, like here's what science says. And honestly, these things are subtle, right?

Speaker 2

仅凭两分钟的思考很难明确什么才是正确的做法。但如果你阅读规范,就能真切感受到其中蕴含的深思熟虑。这并非最终答案,而是我们希望获得反馈的起点,是我们期待与社区共同完善的蓝图。

That it's not really clear what the right thing is just on two minutes of thinking about it. But if you read the spec, you can actually really see the thoughtfulness that has gone into it. And it's not the final answer, right? It's something we want feedback on. It's something that we want to produce collectively as a community.

Speaker 0

我知道接下来要讨论开源,但我有个更冷门的问题。我听你之前接受Lex Friedman采访时提到过‘回到那个年代’

I know we want to talk about open source next too, but I had a more esoteric question. I was listening to your old Lex Friedman interview and you kind of mentioned Back in the Back

Speaker 2

那个年代。

in the day.

Speaker 0

对。阿西莫夫的《基地》让我想到——我们播客请过Brad Taylor,讨论过某些语言具有固有特性,比如Rust的内存安全性是天然具备的。你觉得LLM和软件工程师是否也在经历某种历史循环?就像我们能预测软件界面将充斥蓝紫色渐变——现在已见端倪。这些模型还在将我们引向何方?我们能否改变这种趋势?

Yeah. Foundation by Asimov. Made me think about we have Brad Taylor on the podcast and we talked about how certain languages have interim capabilities, like Rust is memory safe. And so that just happens. Do you see almost like a cycle history of LLMs of and software engineers, like, hey, these models, I can predict the way software is gonna look.

Speaker 0

比如,所有东西都会变成蓝紫渐变,对吧?我们现在已经看到这种趋势了。这些模型还在把我们推向哪些方向?有没有办法改变这种趋势?

Like, everything is gonna be blue and purple gradients, right? We're kind seeing that today. What what else are these models really driving us towards? And is there a way that we can change that?

Speaker 2

它们确实存在心理史学特征,因为这些模型某种程度上就是心理史学的产物,对吧?就像这些模型通过观察人类思想被训练——本质上就是获取公开数据,学习并观察。关键在于理解支配数据集的规则:最初生成这些数据的底层规则是什么?

Well, there's definitely a psychohistory of them because to some extent these models are a product of psychohistory, right? It's like these models have been trained on observing human thought, right? Effectively, that's what you can think of, take public data, learn on that and just observe. The point is to understand the rules that govern a dataset. Like what are the underlying rules that generate the data in the first place?

Speaker 2

这正是这些模型的成长基础,好比外星人通过观看大量电视节目来理解人类。随后进入强化学习阶段,它们开始尝试行动,根据与人类期望的契合度获得正负反馈。现在我们将它们置于现实中说:好了,尝试处理你从未见过的新任务吧。

And that's kind of what these models grew up on, right? It's almost like watching a bunch of TV as an alien trying to figure out like what are humans all about. And then you have this reinforcement learning phase where they actually got to try things out. And there are given positive and negative feedback depending on how much that aligns with what the human wants. And now we put them in reality and say, okay, now try stuff and here's new task you've never seen before.

Speaker 2

它们会运用全部历史经验来做决策。顺便说,虽然与人类的生物学类比容易过度解读,但也可能被低估。我认为这至少是个有用的思考模板。某种程度上人类也是如此——你的DNA里编码着某种史前记忆。

And it uses all of that previous history to decide what to do. As an aside, like it's not clear, like sometimes the biological analogy to humans, it's very easy to overstate it, but it's also easy to understate it. I think it is at least a useful template to think about. To some extent, that's how humans work too, right? It's like you have some sort of prehistory encoded into your DNA.

Speaker 2

你有自己的生活经历。你有父母给予的正向与负向激励。你还有在现实中不断尝试的经验。而现在,你需要运用这些知识去行动。你会怎么做?

You have your life experience. You have your parents who provided positive and negative rewards. And you have your experience in just trying things out in reality? And now you have to go out and use that knowledge. And what do you do?

Speaker 2

如何预测一个人的行为?事实上,我们能预测很多。事实证明,你对他人及其反应模式有不错的认知——他们是否喜欢某事物。了解一个人的价值观能很大程度上预判其行为倾向。我认为对模型而言,未来并非既定。

And how do you predict what a person's going to do? And actually, can predict a lot of what a person's gonna do. It turns out you have a pretty good model of other people and how they'll react to something, if they like it, if they won't like it. And a lot of that gets baked into into, you know, knowing someone's values tells you a lot about what they're likely to do and how they're likely to behave. And I think that for models, the future is not predetermined.

Speaker 2

算法本身并不会强制模型偏爱紫色渐变之类。但整个训练过程中确实会形成某种偏好。正如Alec常说的,这些模型不像个体人类,更像集体人类文明。

It's not like that the algorithm itself says that the model's going to have to prefer purple gradients or something. Right? But that there's something in this whole process that does produce that preference. And I think one of the opportunities with models, one thing that Alec like to say is that these models are less like a human and more like a humanity. Right?

Speaker 2

模型内嵌着无数人格可能性,几乎涵盖所有类型。我们的目标是激发特定人格。后期强化学习会将这些可能性收敛到理想范围。这意味着我们能塑造符合我们价值观的模型——无论你想要蓝绿渐变还是其他。

That there's so many personalities embedded within them. It's almost every single personality is in there. And our goal is to elicit that personality. And some of this post training work, some of this reinforcement learning work almost narrows down the space of those personalities to just the ones that are desirable. And I think that what that means is that we have both an opportunity to produce models that operate according to our values, right, according to if you don't just want the purple gradient one, you want the blue gradient, the green gradient, whatever.

Speaker 2

单个模型就能实现所有需求。GPT-5尤其擅长遵循指令,是我们迄今最具可定制性的模型。只需给出指令,它就能按你的偏好运作。

You can have all that in a single model. It's fine. And g p d five itself is extremely good at instruction following. And so it actually is the most personalizable model that we've ever produced. You can have it operate according to whatever you prefer just by saying it, just by providing that instruction.

Speaker 1

这让我想到《星际迷航》里的博格人——那种集体智慧。总有人争论《星球大战》和《星际迷航》哪个更预见未来,我认为是后者。

The the analogy I have is like the Borg. Like, there is this, like, collective intelligence. I there's always this debate between Star Wars people and Star Trek people, like, who's who has a better model of the future, and I think it's like Star Trek.

Speaker 0

但Sam在推特上发过死星图片,他显然是星球大战派。

Well, Sam picked, you know. He tweeted the the Death Star, so you're on the on the Star Wars scene.

Speaker 2

确实

Well, yeah,

Speaker 1

就是这样。

that was that.

Speaker 2

这事得问Sam。不过这些模型有趣之处在于,我们现在有LM Arena等竞技场,能直观看到人类偏好如何影响模型行为。模型训练本就基于人类偏好,形成了层层叠加的反馈机制。

That was that. You'd have to ask Sam. But one thing I think is very interesting about these models is that we have all these arenas now, right? Like LM Arena and others where you can actually see human preferences on top of how the models operate. And that you almost have this layering of like the models were trained on human preferences.

Speaker 2

现在它们正在执行任务并接受人类的评判。然后我们利用这些反馈来调整,比如‘好吧,紫色可能有点过头了,我们应该在那里做些改变’。这几乎是一种共同进化的过程——模型朝某个方向发展,人类有特定的偏好,于是我们又引导它们转向不同方向。就这样不断迭代,最终得到越来越有用且符合人类价值观的东西。

Now they're doing stuff and being judged by humans. And then we kind of use that to feedback on like, Okay, yeah, maybe the purple is a little bit too much and we should change it there. And so, it's almost this co evolution of the models move in a certain direction, do humans have a certain set of preferences, so then we move them in a different direction. And then, you know, you kind of keep iterating to get something that's more and more useful and aligned with human values.

Speaker 0

当强化学习的奖励机制与人类偏好不一致时,你们怎么处理?根据我的经验,这就像试错过程——模型就像在玩‘正确尝试捕获’游戏。

How do you do that when the RL rewards are kind of tied to things that the humans maybe don't prefer? Like, in my experience, it's been like try catch. Like, the models like the right try catch

Speaker 2

它们可喜欢玩‘试错捕获’了。

They love the try catch.

Speaker 0

这样它就不会失败。我们是否需要大量偏好数据来告诉它们不该这么做?还是说要在强化学习环境中做些调整降低这种行为的吸引力?我在思考接下来该怎么推进。

That it doesn't fail. Do we need just a lot of preference data that shows them they shouldn't do that? Is there something in the RL environments that we're gonna change to make them less desirable? Like, I'm trying to figure out where we go from here.

Speaker 2

是的。我认为干预方式的选择是多维度的,而且高度依赖于具体行为。有些东西比如模型对不同库的认知,是早期就固化在模型里的。但你可以教会模型‘不要依赖旧知识,去查阅最新文档’,这种指令可以放在更高层级。

Yeah. I think that the way that you decide or the way that you figure out where do interventions go is very multifaceted and it's very specific to the behavior, right? There are some things like the model's knowledge of different libraries and things like that that's kind of baked in from the early days. But you can also teach the model that, hey, don't rely on your previous knowledge, like go and look up the most up to date docs. And that's something you can kind of put at a higher level.

Speaker 2

至于像过度使用try catch这种情况,其实可以直接通过提示词引导模型。在强化学习训练时,你可以提供‘不要往这个方向走’的奖励信号。这些模型的妙处在于,虽然可能需要针对各种偏好和风格提供大量训练反馈,但它们具备泛化能力——我们的算法天生就会泛化。

And then something like overusing try catch, that's something you can actually prompt the model for, right? And that's something where when we train it in reinforcement learning, you can provide rewards saying like, Ah, don't go in this direction. And the beautiful thing about these models is it feels like, Okay, there's probably a long list of different preferences and different styles and things like that that you're going have to give it feedback on during training if that's the way you want to go. But these models generalize. The algorithms that we have generalize.

Speaker 2

这就是深度学习的魅力所在,真正的魔法。我们围绕深度学习核心构建了整个技术栈——模型编排、反馈获取、数据体系等等。但深度学习最根本的魔力在于它的泛化能力。

And that's the beauty of deep learning. That is the true magic, right? It's very easy, like we kind of have this whole stack now that's built up around the core of deep learning. It's like all these ways of orchestrating models and how you get feedback and all of these things, the data, etcetera, etcetera. The core magic of deep learning is its ability to generalize.

Speaker 2

虽然某种程度上这种泛化还不够强,但模型也是如此。关键是要思考:为了让它们能根据不同偏好和价值观运作,我们只需在训练中展示这些模式,它们就能泛化到我们未曾专门训练过的偏好领域。这个现象在不同代际的模型中都非常一致。

And in some ways, the generalization is weaker than you'd like. But I think that the same is true for these models. It's really trying to think about in order to get them to be able to operate according to different preferences and values, we just need to show that to them during training and they are able to sort of generalize to different preferences and values that we didn't actually train against. And that's something that we've seen very consistently across different model generations.

Speaker 1

我刚刚脑补了一个梗图:'哦,我的模型不会泛化?那就让全世界都变成你的数据分布呗'。看,问题解决得多简单。搞定。完美。

I was just envisioning this meme of like, oh, my model doesn't generalize, then we'll just make the whole world your distribution. You know, that's how you solve everything. Done. Done. Exactly.

Speaker 1

就这么简单,只不过中途得顺便造个戴森球而已。在转向开源话题前,我想最后聊聊GPT-5。你们承认存在路由器的设计很酷——我最近还听了你和John Collison在Cheeky Pints播客的对话,他们那种轻松的形式真的很有意思。

As simple as that, you know, you just have to build the Dyson sphere along the way. One thing I wanted to touch on for the I think last kinda last couple topics on GPT five before we move to OSS. Mhmm. You've acknowledged that there's a router, which is really cool. I was also listening to your podcast with John Collison on on on cheeky pints as which is which is really fun format that that they say.

Speaker 1

你讲过那个关于DOTA测试版模型与主模型拼接的故事吗?我觉得之前没听过。这是否类似于GPT-5路由器的设计思路——比如把推理模型和非推理模型组合起来?

Do you told the story of the DOTA side that I don't think I've heard before about the beta model versus, like, the the sort of main model and stitching it together? Is that like a similar insight for g p t five's router where you have like reasoning model, non reasoning, and then

Speaker 2

简单拼接就行了吗?某种程度上是的。多个模型叠加路由层。但DOTA那个案例很特殊——因为我们在游戏前半局存在缺陷

you just stitch it together? To some extent, yes. Right? In the multiple models and you put some sort of router on top of them. That specific one was for a very specific reason, which is that we had a deficiency on the first half of the game and because we

Speaker 1

所以一直输。

kept losing.

Speaker 2

没错。这个特定模型擅长部分环节,但有些环节表现不佳。

Right? Exactly. Right. So there's like there's part of the game that this specific model didn't do a good job of. There's a part of it that it did.

Speaker 2

这些模型的行为域足够简单,我们能明确何时切换模型。GPT-5也是类似逻辑:推理模型适合需要深度思考但可容忍延迟的场景;非推理模型适合追求快速响应的场景。

And there, these models, the behavior, the domain they were operating in was simple enough. It was very easy for us to say, Here's when you want to use one model versus the other. And to some extent, what we have with GBD5 is no different. We have a reasoning model that we know is good for applications that require this intelligence, but you're okay waiting a little bit longer. We have a non reasoning model that is great for applications where you want the answer fast.

Speaker 2

后者答案质量尚可但不够深思熟虑。通过条件语句自动切换模型很实用——比如用户额度耗尽时自动降级模型。不过声明:模型切换器只是现状,未来更倾向完全整合的智能模型。

Still a good answer, but not like deeply thought through that might have a lot of tricks to it. Then you just want to put an if statement that says which of these should be. Then sometimes too, it's like if someone's running out of their credits that you want to fall back to a different model and all these things and not pushing that burden to the user is actually a really nice thing. By the way, do want to say model switchers are not necessarily the future, right? They are the present, like having a fully integrated model that just does the right thing feels very preferable in many ways.

Speaker 2

但近年发现AGI最终形态可能不是单一模型,而是优势互补的模型集合。比如轻量快速模型配合昂贵推理模型,通过系统编排实现自适应计算——这种组合方式展现出强大潜力。

The flip side though is that I think that the evidence has been away from having the final form factor, the AGI itself being a single model, but instead thinking about this menagerie of models that have different strengths and weaknesses. I think that's like a very interesting finding of the past couple of years, right? Just a direction of like, it's much easier to have a small fast model that's less capable but can just do a lot more. You can generate a lot more tokens from it coupled with a much more expensive reasoning model. And if you combine those two things, kind of get adaptive compute.

Speaker 2

虽然架构层面的自适应计算尚未突破,但系统级调度已很成熟。模型的可组合性正释放巨大能量。

And that we haven't really cracked how do you do adaptive compute within the architecture, but doing it within the orchestration of a system, it's very straightforward. And so I think you get a lot of power out of the fact that these models are composable in this way.

Speaker 1

模型卡设计太棒了!连条件参数都公开了:对话复杂度、工具需求、明确意图、使用频次限制。有哪个参数你觉得特别值得讨论吗?

Yeah. I want to give, whoever did the model card was amazing. They even provided the big parameters to the if statement of conversation type complexity, tool needs, explicit intent, and usage rate limit, which is kind of interesting. Any any one of those you wanna comment on in particular that was interesting for debate?

Speaker 2

没有。这些都在预期范围内。不过说真的——OpenAI做对了很多事,但命名水平不在其中。

No. I mean, I think I think honestly, all of it is, like, fairly what you'd expect. Yeah. And I think that the core message in my mind is that at OpenAI, there are many things we've done right. Naming is not one of those.

Speaker 2

为用户提供一个简单易懂的使用界面,不一定非得是一个。对吧?看看我们过去所有的不同模型,用户怎么知道该用哪个?我记得我妻子有段时间在用4.0版本,我就说不对,你得用03版。

Having a simple surface for users to understand how to use it, not necessarily one. Right? If you look at all the different models that we've had, how are you supposed to know which one to use? I remember my wife was using four point zero at one point. I was like, no, you need to use o three.

Speaker 2

她就很困惑:等等,但为什么数字更小的版本反而比4.0更好?

And she's like, wait. But why The number is smaller. Better than four point zero?

Speaker 1

嗯,发布4.0之后还会有4.04版本呢。

Well, ship four point then you have 4.04.

Speaker 2

就是这样。所以没错,我们显然需要重新调整,对吧?对复杂性进行重置。

There you go. And so, yeah. So, okay. We clearly needed to do a reset, right? A reset on complexity.

Speaker 2

我认为我们应该内部消化这种复杂性,而不是转嫁给用户,这非常重要。所以我觉得这是第一步,而且我们清楚听到了社区反馈——有些地方他们还没准备好,对吧?我们没能兑现简洁性的承诺。应该总是默认选择我们的推荐方案,而非手动选择。目前我们还没完全做到。

And I think that us internalizing that complexity rather than pushing it to the user, that is really important. And so, I think this is a first step like and I think we've heard loud and clear from the community about the places where they weren't ready, right? That we were not delivering on that simplicity for people, right? That it should just be it's always better to go with our choice of it rather than manually selection. And we're not quite there yet.

Speaker 2

我相信我们能取得进展。但最终目标应该是双重的:既要确保高级用户获得他们追求的控制力和一致性,又不要让广大不愿纠结4003这些细节的用户被迫深入到这个层面。

I think that we can make the progress. But I think that ultimately our goal should be to both make sure that Power users are able to have the kind of control and consistency that they're looking for while also not forcing the broad base of people who don't want to have to think about the four zero zero three, all that stuff to have to go to that level of of detail.

Speaker 1

没错。太棒了。关于定价问题,我们说过G5的定价很有攻击性,甚至比Gemini还具竞争力。前几天讨论时让我惊讶的是,GPC5的定价其实可以更低。

Yeah. Awesome. Pricing question. We talked about that g five pricing is aggressive and very competitive as even compared to, like, Gemini. And one thing I was surprised to learn from the beat up that we had the other day was that g p c five pricing can go much cheaper.

Speaker 1

具体能低多少数量级?其中有多少是得益于像Stargate这样的基础设施优化?

To what degree of order of magnitude are we talking? How much percent of that is just getting better infra like Stargate?

Speaker 2

这类问题的答案通常是:回顾我们的定价历史就会发现,我们每年都会大幅降价——具体系数我不确定,但差不多是10倍左右。

I think that the answer for these things is always that okay. If you look at the history of our pricing, we have very consistently cut prices by like, I don't know the exact factor, but let's say like 10 x per year.

Speaker 1

我觉得比这更激进。

I'd say more aggressive than that.

Speaker 2

是的。可能比那还要激进,这很疯狂对吧?你可以从O3看出这点。我记得我们做了大约80%的价格下调。

Yeah. Probably more aggressive than that, which is crazy thing. Right? And you can see it with o three. I think we did like an 80% price cut.

Speaker 2

实际上,使用量增长得如此之快,以至于在收入上我认为是持平的甚至是正增长的。这正好说明存在这样的成本曲线——需求极其陡峭。所以只要让更多人更容易获得和使用,他们就会大幅增加使用量。这与我们的使命高度契合:确保AGI造福全人类。

And actually, the usage grew such that it was like, I think in the revenue, it either was neutral or positive. And it just shows you that I think there's this cost curve, like the demand is extremely steep. And so it's like if you just make it more accessible and available to people, they will use way more of it. And I think that's very aligned with our mission. Our goal is to ensure that AGI benefits all of humanity.

Speaker 2

部分工作在于确保这项技术广泛普及,让更多人使用AI并应用到生活和工作中。实现这一点的关键因素包括提升推理效率、降低模型成本等。当前部分解锁条件就是需要更多算力——现在我们严重受限于算力。因此即便大幅降价,实际也无法提升当前模型的使用量。

Part of that is making sure that this technology is broadly distributed, that lots of people are using AI and using it to apply to things in their life and their work. And one of the things that helps us get there is by having more efficient inference, having cheaper models, all of these things. Now what unlocks it partly is having just more compute. Right now we are extremely compute limited. And so I think that if we were to cut prices a lot, it wouldn't actually increase the amount that this model is used.

Speaker 2

我们还有大量效率提升空间,团队始终在全力突破推理效率的新高度。部分改进来自模型架构本身——现在处于推理时代,不仅要考虑架构设计,还要关注训练后优化,比如针对特定任务的思考时长等因素。

We also have a lot of efficiencies to gain and that's something where our teams are always working super hard to get to the next level of inference efficiency. Some of this is about improving the model architecture itself, right? That there's lots of architectural decisions that you can make. And that now that we're in this world of reasoning, that it's not just about the sort of model architecture, it's also about the post training, right? It's about how long does it think for a specific task and things like that.

Speaker 2

因此我们需要在众多维度持续改进,并不断推进。

And so there's just many, many dimensions of improvement that we have to make and that we'll we'll keep pushing.

Speaker 1

顺便说下,我这有张图表可以说明——自GPT-4发布以来,同等智能水平的成本已降低1000倍。

By the way, the the numbers I have a chart for this if you ever need it. It's since the day you launched GPT four, it's been a 1,000 x improvement in cost for the same level of intelligence.

Speaker 2

这太疯狂了。简直难以置信。

That's pretty wild. That's pretty wild.

Speaker 0

相当不错。

It's pretty good.

Speaker 2

是啊。这大概就两年半时间吧?还有什么东西能在两年半内实现三个数量级的改进?

Yeah. That's like two and a half years or something like that. What else has like a three order of magnitude improvement over the course of two and a half years?

Speaker 1

我...我不知道。没有。完全想不出来。

I I don't know. Nothing. Nothing. Can't think about it.

Speaker 0

而且价格正在走低。它不是从1万美元降到1千美元那么简单,而是几乎跌到几分钱的地步。在GPT5发布时,我写了篇题为《自我进化的编程智能体》的文章。我基本上是在问GPT5:你能为自己构建工具来成为更优秀的编程代理吗?这其实是个自由职业者任务。

And it's going low. It's not even it's like from $10,000 to like $1,000 It's going to like pennies. For the gbt5 release, I did this article called self improving coding agents. So I basically asked gbt5, can you build tools for yourselves to be a better coding agent? And this is a a Swelancer task.

Speaker 0

然后它执行任务时在某些方面失败了。接着我问它:你能改进这些为自己打造的工具并形成循环吗?我发现这些模型其实不太愿意使用它们自己构建的新工具集,它们只顾着响应请求。

And then it does the task. It kind of fails in some ways. And then I ask it, can you improve the tools for yourself and kind of do this loop? And what I found is, like, the models don't really like to use this new tool set they build for themselves. They're busy responding.

Speaker 0

你知道,我自己就能搞定,其实不需要

You know, I can just do it. Don't really need

Speaker 1

这些工具。

the tool.

Speaker 0

我觉得这有点像这种声音

And I I think there's kinda like this Sounds

Speaker 2

像是人类的反应。

like a human.

Speaker 0

没错。这就像存在一个天花板——它们到底能如何真正推动自我改进?你觉得部分原因是不是因为它们只是被教导使用现成工具,比如GRAP之类的?所以在推理阶段让它们构建工具就比较困难?还是你认为这是跨越那个阶段的一部分?

Yeah. There's kinda like the ceiling of, like, how can they really push themselves to, like, improve? Do you feel like part of it is like, hey, they're just being taught to use these tools, which is, like, you know, GRAP and, like, whatnot? And so it's kinda hard for them at inference time to build the tools? Or do you see this as, like, part of that jump?

Speaker 2

我认为这绝对是阶段性特征。我们并非完全不具备这种能力,关键在于训练方式。如果模型只接受过特定工具集的训练,没有被快速适应新工具的能力,就不能指望它在评估时有不同表现。但能够创造提升效率的自用工具,并持续积累工具库——这种能力简直是工具箱里的超级武器。

I think that's gonna I think that's part of the step for sure, right? And I think it's not like we're at zero on being able to do that, right? I think a lot of this is just about the training, right? If the model really has trained with just a specific set of tools, hasn't really been pushed to adapt to a new tool very quickly, then you shouldn't expect it to do any differently at evaluation time. But the idea of producing your own tools that make you more efficient and build up a library of those over time in a persistent way, like that's an incredible primitive to have in your toolbox.

Speaker 2

如果你想解决那些极其困难的未解难题,我认为这种能力将是必备条件。

And I think that if your goal is to be able to go and solve these incredibly hard challenges, unsolved problems, then I think you're going to need that kind of thing as a dependency.

Speaker 1

有什么架构决策或创新是你想讨论的吗?滑动窗口注意力机制、DeepSea推广的精细混合专家系统、ROPE、YARN、注意力沉淀池——在GPT OSS的开发过程中,有哪些特别引人注目的技术选择?

Any architectural decisions or innovations that you would like to talk about? Sliding window attention, the very fine grained mixture of experts, which I think DeepSea popularized rope, yarn, attention sinks, any anything that, you know, I think stood stood out for to you and the the choices made for GPT OSS?

Speaker 2

我想说这些选择都是基于我们所知的,你看。我们有一个团队一直在研究不同的架构,探索了各种可能性。像专家混合这样的设计很有趣,我觉得应该归功于我们团队在这些选择上的努力。

I would say that these choices are all you know, look. Like, we have a team that's been working on different architectures. We've explored different things. Something like Mixture of Experts is something that it's fun it's funny. I would say that I I would credit our team for the choices there.

Speaker 2

但我的构想是,我们需要一种能轻松适应这些环境的方案。因此选择稀疏程度这类参数时,必须严格考虑内存占用,以及实际能用于前向传递的计算资源等。所以某种程度上,架构决策很大程度上受限于模型规模和我们预期运行时能获得的计算资源。

But I say that the picture in my mind is we wanted something that would be easy to run-in these environments. And so picking things like just how sparse to go is very tied to your memory footprint and then how much compute you actually can use for forward pass and things like that. So, I think that to some extent the architectural decisions were fairly constrained by the model sizing and the compute we expect for them to have access to it when they're running.

Speaker 1

是啊,这其实都是非常务实的工程决策。

Yeah. I mean, it's very practical engineering decisions, really. Yeah.

Speaker 2

没错。而且模型的能力确实证明了这点——我们确实运用了大量尖端技术来持续突破模型的能力边界。

Yeah. I think so. And I think that the the power of the model really shows, like, we really did use a lot of our cutting edge techniques to actually push the the capabilities models further and further.

Speaker 1

我明显能感觉到为API设计的模型架构和单机版模型有本质区别。多租户场景下能进行批处理,这和单机运行完全是两回事。

I'd say I I definitely detects a difference between the architecture for models designed for API use versus models designed for single machine. You know what I mean? Like, there's there's when you have multi tenancy, when you can have batching, it's very different from, like, single machine.

Speaker 2

天壤之别。

Very different.

Speaker 1

是啊,不知道未来会不会融合,不过也许就像你常说的,最终会是统一模型(monadromate models)。

Yeah. I don't know if that'll ever combine but maybe it's a monadromate models like you always say.

Speaker 2

对。还有个有趣的架构思路:本地模型有时能委托远程模型处理。这样既能提速,从隐私架构角度看也很实用——决定哪些数据上传/留存,边缘计算意味着断网时仍能运行,再配合慢速规划模型...这种协同机制非常有意思。

Yeah. I think it's also really interesting to think about an architecture where you have a local model that then delegates to a remote model sometimes, Right? And then this can be something where you can run much faster. It's helpful for a privacy architecture perspective that just trying to decide what actually goes, what stays and having that edge compute means that then you lose Internet connection, you're still able to do something and you can have a slower planning model. It's like this interplay between those things is very interesting.

Speaker 1

就像设备端GPT-five?本地基础版,网络可用时再路由到在线版本?

Yeah. So like a GPT-five on device where you have GTOSS here and then it routes through online if it's available. I don't know.

Speaker 2

差不多。就像编解码器基础设施那样,本地代理和远程代理无缝协同,还能多人协作——这就是未来的模样,会非常惊艳。

Yeah. Something like that. And then you have your codecs infrastructure that has a local agent and a remote agent and then is able to seamlessly interplay between the two and then is able to multiplayer. Like, this is what the future is going to look like and it's going to be amazing.

Speaker 0

然后你随身带着一个设备。我能看到我能看到事情的发展方向。

And then you have a device always with you. I can see I can see where things where things are going.

Speaker 2

这一切都是相连的。

It all connects.

Speaker 1

是的。关于这个设备我们能说什么?你提出了答案。

Yeah. What what can we say about the device? So you raised the answer.

Speaker 0

我想在特朗普的话题上休息一下。我们能

I wanna take break in Trump. What can

Speaker 1

关于这个设备说些什么?

we say about the device?

Speaker 2

会很棒的。

It's gonna be great.

Speaker 1

好的。然后是另一个政治——我不知道这是否与政治有关。你知道,中国有很多开放模型涌现。为什么美国开源很重要?

Okay. And then another political I don't I don't know if it's political or not. You know, there there's a lot of open models coming off in China. Why is it important for there to be American open source?

Speaker 2

我们在开源模型上考虑的另一个非常实际的问题是,基于我们开源模型构建的人某种程度上是在我们的技术栈上构建。对吧?如果你依赖我们来帮助改进模型,依赖我们取得下一个突破,那么这意味着你实际上对我们有依赖,这对我们的业务有好处,但我想对国家也有好处,对吧?想想看,从人们直接运行的模型到它们如何交互和相互影响,就像我们刚才讨论的那样,这实际上让我们能够构建一个完整的生态系统,让人们能够控制对他们重要的部分,最终建立在反映美国价值观的模型上,然后能够与美国——希望是底层的芯片、后端的云模型和执行环境等——相互作用,所有这些结合在一起,我认为它增加了很大的价值。而且我认为这让美国的领导地位真正意味着我们在世界上的价值观也有领导地位。

Another thing at a very practical level that we've thought about with open source models is that people building on our open source model are kind of building on our tech stack. Right? If you are relying on us to help improve the model, that you're relying on us to get the next breakthrough, then that means that you actually really have a dependence in both a way that's good for our business, but I think it's also good for the country, right? That you think about having an American tech stack from the models that people are running directly, but then how those are going to interface and interplay in the way that we just talked about, that it actually allows us to build a whole ecosystem where people are able to have control over the parts of it that are important to them, ultimately be built on these models that reflect American values and then be able to interplay with American, you know, hopefully chips underneath and cloud models on the back end and execution environments and all of that fitting together is something that I think it adds a lot of value. And I think it allows for American leadership to really also mean that we have leadership in in our values in the world.

Speaker 1

是的。恭喜发布。谢谢。

Yeah. Congrats on launching that. Thank you.

Speaker 0

让我们谈谈OpenAI的工程。我知道有很多关于Cloud Code、Aeter、OpenCode和所有这些不同工具的争论。你如何看待构建团队本身以从中获得最大杠杆?你是否从数量角度、能力角度、组织内团队规模角度改变团队的构建方式?有什么想分享的吗?

Let's talk about engineering at OpenAI. I know there's a lot of debate about Cloud Code and Aeter and OpenCode and all these different tools. How do you think about structuring the team itself that gets the highest leverage out of this? Are you changing the way you build the team from a numbers perspective, from a, you know, capabilities perspective, from a team size perspective within the org? Anything that you wanna share?

Speaker 2

软件工程确实在多个维度发生变革。有一部分工程领域对这些模型来说仍难以攻克,但我们已看到初步突破迹象。比如那些核心硬核算法——像CUDA内核这类高度自包含的问题,本应很快成为我们模型的强项,但因需要大量领域专业知识和抽象思维能力,目前仍具挑战性。不过绝非无法解决。

Well, engineering, software engineering is definitely changing in many dimensions. There's a part of engineering that's very difficult for these models to really crack, but we're starting to see the beginnings of it happening. And that that's these like very core hard algorithms, right? Things like CUDA kernels are a good example of a very self contained problem that actually our models should get very good at very soon, but it's just difficult because it requires a lot of domain expertise, lot of like real abstract thinking. But again, it's not intractable.

Speaker 2

这类自包含问题恰恰非常适合现有技术处理。而架构设计类问题则困难得多——比如系统搭建的抽象思维。值得欣喜的是,我们的模型在这方面也开始展现潜力。

It's self contained. It really is the kind of problem that is very amenable to the technology we have. There's other problems that are very difficult in terms of architecture, right? How do you think about how a system should be put together and thinking about the abstractions? And again, our models are starting to get kind of good at this.

Speaker 2

目前观察到的是,大多数工程师(包括顶尖工程师)的日常工作与模型的核心优势高度契合。特别是面对非精通语言时,没人愿意手动编码——这正是模型大显身手之处。当然,那些需要人际沟通获取决策背景的工作,对模型而言仍具挑战性。

But so, I think what we've seen is that there's for most of our engineers, even our extremely good engineers, and a lot of their work that actually maps very well to this core strengths of the models right now. And definitely for anything where it's like a language that you're not an expert in, yeah, you definitely don't want to be writing that code yourself. You really want a model to be doing it. And then there's parts of the job that become much harder because it requires like things the models don't have access to, right? It requires a lot of context going and talking to people in order to make good decisions.

Speaker 2

虽然团队架构尚未因这些工具发生根本改变,但当前首要任务是将模型推广到所有适用领域。我们需要建立负责任的实施框架,这正处于从早期采用转向主流的关键阶段。生产力提升意味着我们需要更多人才——毕竟软件产能和技术债清理能力始终是制约因素。

And so I think we're not at the point yet where we really see changes in how you structure a team because these tools exist. But I think we're at a point where it is like an extreme high priority to get these models to be used in all domains that they possibly could be. And to think about how you do that well and responsibly and think about what the guardrails should be and that that happens in a very practical way. And so I think a lot of what I'm seeing is like we're in a early adopter phase that's starting to transition to a mainstream phase and the productivity impacts of people being able to do more means we actually want more people, right? It's like we are so limited by the ability to produce software, so limited by the ability of our team to actually clean up tech debt and go and refactor things.

Speaker 2

如果工具能使效率提升十倍,我们就能完成百倍工作。这些模型带来的不仅是效率提升,更是能力边界扩展——这才是根本目标。

And if we have tools that make that 10x easier, we're going be able to do 100x more things. And so I think that there's this incredible opportunity that is entailed by these models not being a real driver of just do the same stuff more efficiently, but be able to do way more. And that that is I think the overall goal.

Speaker 0

你们如何调整团队工作以适应大语言模型?是否改变了问题追踪方式或代码库结构?

Yeah. How have you changed the team's work to fit the LLMs better? Is there a different way in which you track issues? Is there a different way in which you structure code bases?

Speaker 2

目前仍处探索阶段,但最成功的做法是围绕模型优劣势构建代码库:创建更多自带完善单元测试、快速执行和清晰文档的自包含模块,将细节交给模型处理。同时确保这些AI优化模块仅被同类模块依赖,最终形成完整的AI友好系统。

So I I think we're still at the early edge of this. But the thing I've seen be most successful is that you really build code bases around the strengths and weaknesses of these models. And so what that means is more self contained units have very good unit tests that run super quickly and that have good documentation that explains what this module is for. And if you do that and you kind of leave the details to the model, it works really well. And then thinking about how these things compose and making sure that you're thinking about the dependencies that you only have these like clean like AI optimized modules can only be depended on by other AI optimized modules, then you end up with like a whole system that's actually AI optimized.

Speaker 2

我们才刚触及可能性表面。考虑到模型迭代速度,当前所谓的模型弱点半年后可能大幅改善,因此不必过度适应当下局限。这个特殊阶段反而孕育着快速突破的机遇。

And so I think that we're still scratching the surface of what's possible. And, you know, the models are advancing so fast that actually what it means to work around the weaknesses of the model, In six months, I think that those weaknesses will be like vastly shrank. So you don't want to necessarily spend all your time just overfitting to what exists today. But I think there's a lot of potential to be able to move quickly in this particular moment.

Speaker 1

我很好奇工程师的价值变化趋势——随着部分工作被自动化,行业签约奖金却创历史新高。真正有价值的是工程师本身,还是赋能他们的系统?

One question I'm I'm very curious about is the value of an engineer, you know Increasing over time. Increasing over time. Well, I mean, also, you know, there there's there's some part of our work that's being automated away. And and I think, like, obviously, they're they're very, very high signing bonuses, like, higher than we've ever seen in in the history of our industry. Is it really the engineers that are valuable or the systems that enable them?

Speaker 1

感觉是两者兼具,但市场确实在为工程师支付超高溢价...

You know? Like, I feel like it's kind of like a bit of both, but people are paying a lot for the engineers. I mean, I

Speaker 2

我认为归根结底,真正的新鲜之处在于我们正在创造的技术——这些模型是人类打造过的最实用工具,对吧?支撑它们的,是我们正在建造人类史上最庞大的机器。到了某个程度,投入数据中心的美金数字会变得抽象难解。

think that the thing at the end of the day that is new, right, is that we are producing technology, these models that are the most useful tools that humanity has created. Right? And that underpinning them, we are building the biggest machines that humanity has ever created. Right? It's like at some point, the dollars that go into these data centers starts to be an abstraction.

Speaker 2

对吧?500亿美元算什么?1000亿美元又算什么?这数字怎么可能被真正理解?我觉得这几乎超出了人类认知的尺度。我们作为一个国家、一个社会、一个世界正在共同推进的这项工程,明白吗?

Right? What is $50,000,000,000 What is a $100,000,000,000? How can you possibly internalize what that is? I think it's beyond almost the scale of human comprehension. The engineering project that we collectively as a country, as a society, as a world are undergoing right now, Right?

Speaker 2

像是新政这样的工程相比之下都黯然失色。阿波罗计划与我们当下所做的相比也相形见绌。从很多方面说,这本该如此,对吧?这项技术的经济回报非常巨大,但更重要的是我们正在迈向新型经济——AI融合经济、AI驱动经济。

It's like projects like the like the New Deal, like pale in comparison. You know, the Apollo program pale in comparison to what we're doing right now. And in many ways, it's as it should be, right? Like the the economic return on this technology is very large but even more importantly, the way in which we are moving to a new economy, right? An AI integrated economy, an AI powered economy.

Speaker 2

这正是我们使命的核心所在,对吧?我们看到地平线上的变革,我们想要助力,想要引导它成为提升全人类的契机。这几乎是人类历史上绝无仅有的非凡机遇。

And this is ultimately what our mission is about, right? Is it's like we see this change on the horizon. We want to help. We want to help steer it to be something that uplifts everyone, right? That it's this amazing opportunity, almost unique in human history.

Speaker 2

我们何其有幸能身处这个时代,以某种方式参与其中。对我而言,这是思考人类尺度大变革的背景板。有时你会感到认知失调——当你调试某个底层CUDA死锁,或纠结紫色渐变时,突然意识到这关乎人类未来。所以当考虑工程师归属哪家公司时,这些选择确实举足轻重,这关乎团队协作。

And we are all fortunate, right, to be at this moment in time and to be able to be involved in some way. That to me is the backdrop to really think about this big shift that is going on at humanity scale. And it's sometimes almost you feel this cognitive dissonance because you're debugging some low level CUDA deadlock or you're worried about the purple gradient and you realize this is like the future of humanity that we're really talking about. And so when you think about engineers and who's at which company and all these things, like these things matter, right? It's like it's not just about any individual, it's about a team, right?

Speaker 2

但也不只关乎某个产品或系统,而是我们共同构建的整体社会与经济。我常退后一步思考宏观图景,但也需关注微观层面:人们快乐吗?他们与使命有联结感吗?

But it's also not about any one product or any one system. It's really about the overall society, the overall economy that we are building together. And so, I guess I sometimes step back and think about the big scale, but you also need to think about the micro scale. You need to think about are people happy, right? Is do people feel connected to the mission?

Speaker 2

他们觉得自己的工作有意义吗?这些才是最关键的事。上头条的未必最能驱动人心,但它们确实反映了人们眼中这项技术的经济潜力。

Do they feel like the work they're doing matters? And those things actually turn out to be the most important things. And so what makes the headlines is not necessarily the stuff that actually most drives the people, but it is for sure like a like a reflection of the economic reality that people see as the potential of this technology.

Speaker 1

这与Noam在多智能体团队的发言有所呼应:人类个体智能有限,但作为文明,我们能登月、建城市、造AI。集体的力量远超个人。

This connects a bit with what Noam was saying on the multi agents team where, like, the individual intelligences of humans, you know, we can only do so much individually, but as civilizations, we can, you know, go to the moon and, like, build cities and build AI. Like, together, I think I think we can do a lot more than we can individually.

Speaker 2

我们携手能创造奇迹,这点毋庸置疑。

We can do amazing things together. No question.

Speaker 0

你对AI研究现状怎么看?大家真都在做相同的事吗?是否觉得每个实验室的不同见解终将帮我们找到正解?还是因为现在资金规模太大,只能押注你认为可行的方向?

Do you think about the current state of AI research? Is everyone really just doing the same thing? Do you feel like every lab has a different take that is eventually going to help us converge to the right thing? Or just because now the dollars has gotten so big that you need to do the thing that you think is going to work?

Speaker 2

我认为这个领域的多样性令人惊讶。有时可能会觉得存在趋同进化,但如果你真正与不同实验室的人交流,就会发现人们持有不同的视角。OpenAI早期做出的一个决定是,我们确实需要一群思维方式一致的人,对吧?因为对于那些长期攻读博士学位、拥有自己研究愿景的人来说,你很难告诉他们该做什么。所以,如果想要大家朝同一个方向努力,就意味着必须精心挑选这样一群人。

I think there's a surprising amount of diversity in the field. I think sometimes it can feel like there's convergent evolution, but I think that if you really talk to people at different labs, you really realize that there's different perspectives people have. One of the decisions we made early on in OpenAI was that we really wanted a set of people who are aligned in how they think, right? Because for people who have been pursuing a PhD for a long time, who are sort of have their own research vision, you kind of can't tell them what to do. And so if you want people who are going to row in the same direction, it means you have to select that set of people.

Speaker 2

这可能是我们在OpenAI做出的最重要早期决策,它帮助我们取得了今天的成就。这意味着你必然可以选择不同的发展方向,从各实验室的研究重点和成果中就能明显看出差异。在OpenAI,我们始终专注于如何通过研究实现质的飞跃。即便是像GPT-5这样的项目,虽然我们承受着要解决具体编码问题反馈的压力,但有时也必须退后一步思考:如何实现下一个阶梯式突破?

And that was I think the most maybe important early decision that we made at OpenAI that helped us to achieve the things that we have. And so I think that that means that you necessarily have different vectors that you could pick And you really see it in the taste of different labs and what they focus on, what they produce. And that at OpenAI, I think we've been very much focused on how do you do the research that gets you to the next level. And even for something like GPT-five that, we sort of had a lot of pressure to think about, okay, let's just like sort of do the grind of like here's feedback on problems that we have on the coding side and you can pursue that grinding and get somewhere. But you also sometimes have to step back and think about how do you do the next step function?

Speaker 2

如何实现下一个范式转变?推理范式的成功就是我们这方面的典型案例。OpenAI发展历程中我们多次实现这种突破,未来也将持续如此。我认为突破性进展仍待发掘,在多模态内容生成等领域存在前所未有的丰富可能性,整个研究领域比以往任何时候都更具活力。

How do you do the next paradigm shift? And something like the reasoning paradigm is a good example of a time that we did that very successfully. And we've done that many times over the course of OpenAI and we'll continue to do that. And so I think that the breakthroughs remain to be made and there's such a diversity of multimodal and different ways you could generate things and all of the stuff that I think that the field is more the field of research is more abundant, than it ever has been.

Speaker 1

没错。别忘了这还只是主线研究,此外还有语音、图像生成、视频生成等领域。

Yeah. And not to forget, that's like the mainline research. There's also voice. There's also image generation, video generation.

Speaker 2

对,对,对。这些确实容易被忽视

Yeah. Yeah. Yeah. It's easy to forget about

Speaker 1

的部分。

these things.

Speaker 0

就像吉卜力工作室的作品,曾经风靡全球。

From the Studio Ghibli, it was like the biggest thing in the world.

Speaker 2

正是如此。这太神奇了。顺便说,这类成就往往来自少数人团队多年专注攻克某个问题的结果。

Exactly. Right? That's amazing. It's amazing. And and that's the kind of thing, by the way, that was it's really a team of a small number of people who are really focused on that problem for multiple years.

Speaker 2

我认为这正是OpenAI的核心精神——对那些重要问题做长期投入,最终形成具有凝聚力的整体成果。

And that that is I think the the sort of core ethos of OpenAI is to make these long term bets on problems that matter in a direction that really adds up to a cohesive whole.

Speaker 0

所以从外部很难判断你们的重点方向。比如Imagen几乎横空出世,获得广泛采用。人们该如何理解你们的优先级?是应该自行探索构建,还是等待你们的技术迭代?

So from the outside, it's kind of hard to figure out what you're focusing on. You know? Kind of Imagen just came out of the blue almost, which was great, got a lot of adoption. How should people think about how you prioritize versus what people should explore and build and should wait for you to improve on?

Speaker 2

这个领域存在着巨大的可能性空间,对吧?因为神经网络、深度学习几乎适用于任何类型的数据和领域。而我们无法面面俱到。核心推理范式显然是我们持续攻坚的方向。多模态语音、图像生成、视频生成这类领域,我们也视为重中之重,它们彼此间存在内在关联。

Well, there's a massive possibility space in this field, right? Because neural nets, deep learning is applicable to effectively any sort of data, any sort of domain. And that we can't do everything. The core reasoning paradigm, that clearly is something we're going to keep pushing on. Multimodal voice, things like image generation, video generation, these kinds of of areas are also things that we view as very important and all kind of fit together.

Speaker 2

但有些领域确实让我们难以确定如何纳入核心项目的优先级。比如2018年的机器人技术,我们虽取得突破性成果,却意识到在其他领域能实现更快进展——记得那个解魔方的机械手吗?团队受限于机械肌腱每20小时就会断裂,需要工程师维修的物理瓶颈。

But there's been areas where it's just hard for us to really figure out how do we prioritize as part of the core program, right? And we've been through times where for example, robotics was one in 2018 where we had a great result, but we kind of realized that actually, like, that we can move so much faster in a different domain, right? That that actually, you know, we had this great result with a robot hand solving a, you know, unscrambling a Rubik's cube. And that that team was bottlenecked by the fact that this robot hand, you could run it for twenty hours before its tendon would break. And so then you would have a mechanical engineer come and fix it.

Speaker 2

后来该团队转向开发了GitHub Copilot,这无疑是数字领域比物理世界更易取得快速突破的明证。无论我们招募多少人、获得多少GPU,带宽始终有限。作为一家实验室,我们聚焦于保持研究方向的连贯性。你会看到我们时而探索分支项目,其中部分最终会成为核心——但整个可能性空间对所有人都是开放的。

And that team went on to go do what became GitHub Copilot, which is obviously an amazing feat and and a real accomplishment and and something that they were able to move so much faster in the digital domain than in the physical one. And so I think that for us, we really try to we have you know, no matter how many people we hire, how many GPUs we get, we have limited bandwidth. Right? That we are, you know, sort of one company, one lab that's focused on as much as we can a coherent one problem. And so I think that you can kind of look at the set of things we're doing and sometimes we'll do offshoots and sometimes that will be something that then becomes part of the core program, but that there's just so much possibility space for everyone.

Speaker 1

精彩。趁接近尾声,我想从更宏观视角提几个快问——来自Alessio的问题,不如由你来回答?

Awesome. I'd like to take a chance where, you know, we're kind of closing up. Few few small little lightning questions just on zooming out from OpenAI. This question I got from Alessio, so why don't you take it?

Speaker 0

当初创立OpenAI时,你几乎认为成立AI实验室为时已晚。如今人们认为哪些看似太迟的领域其实值得投入?

Oh, so when you started OpenAI, you almost believed that it was too late to start an AI lab. What are things that people today think it's almost too late to do that they should be doing?

Speaker 2

显然,将这些模型与实际应用领域结合极具价值。或许有人认为创意已被穷尽,但经济生态如此庞大。每个人类事业领域都蕴藏机遇,思考如何最大化利用我们创造的智能体至关重要。以医疗为例,必须统筹考量所有利益相关方——

Well, I think it's pretty clear that connecting these models to real world application domains is extremely valuable. And I think sometimes it might feel like all the ideas are taken, but the economy is so big. Every application of human endeavor is so big. And so it is worthwhile and really important for people to really think about how do we get the most out of these amazing intelligences that we've created. And a lot of that is, you know, for something like health care, you have to really think about all the stakeholders, right?

Speaker 2

需要透彻理解现有系统运作机制,才能妥善嵌入这些模型。所有领域都存在着大量尚未摘取的果实。

You have to think about how does the system work today and how do you slot these models in well? And I think that's across all of these domains, there is so much fruit that is not

Speaker 1

确实如此。那就动手开发GPT说唱生成器吧。

yet picked. Yeah. So go ahead and write the GPT Ripper. Yeah.

Speaker 2

尽管去做。但我的建议是:聚焦那些价值不仅源于技术优化的领域,真正需要的是对行业的深刻理解、专业积累与人际网络的建设。

Do it. But I think the thing that I would advise is to really think about domains where the value that you're producing is not necessarily just having written a better rapper. It's really about understanding a domain and building up expertise and relationships and all of those things. Yeah.

Speaker 1

你偶尔会做天使投资,哪些项目能吸引你?

You do occasionally angel invest. What gets your attention?

Speaker 2

其实我已经好几年没有进行天使投资了。哦,好吧。是的。因为所有事情都会分散我对OpenAI的注意力,我只想保持高度专注。好的。

I actually have not angel invested for a number of years now. Oh, okay. Yeah. It's just like everything is a distraction from OpenAI, and I just like to stay laser focused. Okay.

Speaker 1

这是个时间旅行问题。格雷格,你想在2045年的便利贴上写什么?

This is a time travel question. What is one Post it note you wanna send to 2045, Greg?

Speaker 2

到时候你就58岁了。戴森球建得怎么样了?

So you'll be 58. How's the Dyson sphere?

Speaker 1

戴森球进展如何?我不确定你是否实际计算过建造它所需的资源,但

How's the Dyson sphere? I don't know if you've actually done the math on like what it takes to do that, but

Speaker 2

是的。更严肃地说,考虑到当前技术发展速度,2045年实在太难预测了。我希望那会是个充满惊人丰饶的世界,届时人类应该已经实现多星球居住,几乎任何科幻梦想都可能成真——除了受限于原子级操作物理极限的那些。但说到底,我只希望从2025年展望时,那个世界能美好到超乎想象。

Yeah. I mean, more more more seriously, it's like like 2045 is just so hard to imagine given how fast things are moving right now. And so I hope it'll be a world of amazing abundance and that I think at that point, we really should be multiplanetary and, kind of almost any sci fi dream you can imagine. It's hard to deny its possibility except for things that are limited by the physical ability to move some atoms at that rate. But, yeah, it's like I I just I think I would just hope that that world is as amazing as it could be sitting here in 2025.

Speaker 1

在物质极大丰富的时代,我们还需要全民基本收入吗?因为真正的丰饶意味着不再需要它。

Will we even need UBI with abundance? Because true abundance means we don't need it.

Speaker 2

首先,这个问题早有争论。我记得OpenAI早期就讨论过后AGI时代货币是否还有意义?这真的很难说。

Well, I think well, first of all, I think that there's been a lot of debate. I remember early on in OpenAI of post AGI, will money mean anything? Right? And it's really unclear. Right?

Speaker 2

如果只需对电脑说话就能即时免费制造任何实体商品或物质产品,货币还有什么意义?但反过来说,有样资源注定供不应求——算力。OpenAI内部已经显现:拥有更多算力的研究员能开展更大项目。未来如何分配算力将至关重要,因为更多算力意味着能解决更多你关心的任务和应用。

If you can just talk to a computer and it'll produce anything you want, you want something that, you know, you want some some physical good, you want some, you know, any sort of material item and it can just be manufactured for you instantly, you know, effectively free, what does money mean? And the flip side is like, I think that there is one resource that is very clearly going to be in very hot demand, which is compute. Already the case, we see this within OpenAI that the researchers that have the access to the most compute are able to have the biggest projects and do more. And I think in the future, thinking about how do people get access to compute and the more compute that you have for whatever task you care about, for whatever application you care about, it will be solved more, that more will happen. And that I think that that question of what the compute distribution looks like will be something very important.

Speaker 2

所以不工作能否生存?答案会是肯定的——物质需求都能满足。但能否做得更多?比如不仅能按需生成电影,还能添加惊人细节与奢华特效;或是让系统为你专属思考相当于主观体验百年的最优方案。我认为算力投入永远能带来额外回报。

And so I think that the question of exactly how, you know, if you don't do work, do you survive? I think the answer will be yes. You'll have plenty of material your material needs met. But I think the question of can you do more? Can you have not just generate, like, you know, as much, you know, like sort of movie as you want, but have, like, this amazing detail and, like, all this extra fanciness to it and have this thing go, you know, think super hard for, you know, a hundred years worth of a subjective experience about what the best thing is for you specifically, I think that there will always be more return on more compute.

Speaker 2

因此我们必须审慎思考如何架构这个算力分配的社会体系。

And so that will be something we have to really think carefully about, about how that society is architected.

Speaker 1

顺便说一句,这个我总是觉得更难。给2005年的格雷格发个便条。所以是18岁的你。

And then this I always find this harder, by the way. Post a note to send to 2,005, Greg. So 18 year olds.

Speaker 2

哇。我理解时间旅行的概念。我能写多长的便条?

Wow. I get the time travel. How long of a note can I write?

Speaker 1

我可以发个便条。给自己一点建议,显然这也是给其他人的。对吧?但主要是,你知道的,写给自己。

I can post a note. A bit of advice to yourself and obviously this is a proxy for everyone else. Right? But most you know, address it to yourself.

Speaker 2

我认为最让我惊讶的是,问题的数量会随着时间的推移而增加。因为我记得在1999年、2000年读到关于硅谷的报道时,感觉我错过了机会。我出生得有点太晚了。非常常见。没错。

I think the single thing that I have been most surprised about is the abundance of problem grows over time. Because I remember in 1999, 2,000 reading about Silicon Valley and feeling like I've missed the boat. I was born just a little bit too late. Very common. Exactly.

Speaker 2

我只是觉得所有酷炫的问题一定在我准备好去解决之前就已经被解决了,到时候就没有什么可做的了。结果证明这完全错了。现在正是进入科技领域最激动人心的时刻,真正在世界中运作,因为我们拥有这个惊人的工具,它将提升并革新每一个应用、人类努力的每一个领域。我认为,这是值得兴奋的事情,我们可以应用它,虽然我们还需要解决一些挑战,但这是为了实现这个惊人的结果。所以我认为,问题的可用性会随着时间的推移而增加而不是减少,这个信息是我希望我当时能内化的核心内容。

I just felt like all of the cool problems must be solved by the time I'm ready to go work on things, there'll be nothing left. That turned out to be totally false. Like now is just the most exciting time to be in technology, to really be operating in the world because we have this amazing tool that is going to uplift and revolutionize every application, every field of human endeavor. And I think that the fact that that's something to be excited about, that is something that we can apply and there are challenges we have to work through, no question, but for the purpose of achieving this amazing outcome. And so I think that just that message of that the problem availability will grow over time rather than shrink, I think is the core thing I I wish I had sort of internalized at the moment.

Speaker 0

太棒了。非常感谢你加入我们,格雷格。

Awesome. Thank you so much for joining us, Greg.

Speaker 1

好的。谢谢你的时间。谢谢你。

Alright. Thank you for time. Thank you

Speaker 2

非常感谢。能在这里真是太好了。

so much. It's been great to be here.

关于 Bayt 播客

Bayt 提供中文+原文双语音频和字幕,帮助你打破语言障碍,轻松听懂全球优质播客。

继续浏览更多播客