RL健身房探秘：从函数调用到模拟宇宙

本集简介

本周的《图灵播客》邀请到图灵公司强化学习训练场专家Anshul Bhagi，深入探讨强化学习环境的构建原理及其当下意义。本期节目将揭示强化学习在AI技术栈中的定位，并解析图灵RL训练场如何助力顶尖实验室实现迭代式能力跃升。内容亮点： - 两类主流强化学习环境及其演进路径 - 优质与卓越RL环境的关键差异及其对训练的影响 - 强化学习技术的适用边界判定 - 未来RL训练场可能实现的商业全流程模拟与个性化智能体培养在AI竞赛中保持领先需要持续强化能力。图灵RL训练场专为此设计，通过研究者、智能体与系统的持续迭代，产出更强大的模型并加速技术突破。如果您正从事复杂模型训练、通用/超级人工智能开发或AI原生系统构建，本期将为您展现AI训练基础设施的未来图景。 (00:00) RL环境导论 (04:14) RL环境类型解析 (07:05) RL环境演进史 (09:59) 人类在RL设计中的角色 (10:54) RL技术避坑指南 (21:40) RL环境精度控制 (24:46) RL环境未来展望 (27:31) RL环境复杂度解析 (30:37) 前沿发展方向

双语字幕

仅展示文本字幕，不包含中文音频；想边听边看，请使用 Bayt 播客 App。

Speaker 0

嗨，欢迎回到图灵播客。

Hi, and welcome back to the Turing podcast.

Speaker 0

今天，您将深入探索强化学习环境——这些是为评估和训练AI模型以完成复杂现实任务而设计的逼真虚拟环境。

Today, you'll be diving into reinforcement learning gyms, realistic virtual environments designed to evaluate and train AI models for complex real world tasks.

Speaker 0

本次技术深度解析的向导是安舒尔·巴吉。

Your guide for this technical deep dive is Anshul Bhagi.

Speaker 0

安舒尔曾深度参与微软、苹果、谷歌、麦肯锡以及他自己的初创公司的产品与团队建设与扩展，如今领导图灵在强化学习环境及其他前沿项目上的工作。

Anshul's been heavily involved in building and scaling products and teams across companies like Microsoft, Apple, Google, McKinsey, and his own startups, and today leads Turing's work on RL gyms and other frontier initiatives.

Speaker 0

在本集中，安舒尔解释了强化学习环境如何推动AI从基础的指令遵循迈向更高级的智能体推理能力。

In this episode, Anshul explains how RL gyms are pushing AI from basic instruction following towards more advanced agentic reasoning capabilities.

Speaker 0

您将了解到，为什么强化学习正成为训练模型处理复杂数字和企业工作流的关键方法，以及图灵在强化学习环境上的方法如何帮助合作伙伴加速迈向人工智能超级智能的进程。

You'll hear why reinforcement learning is emerging as a critical method for training models in complex digital and enterprise workflows, and how Turing's approach to RL environments is helping partners accelerate progress toward artificial superintelligence.

Speaker 0

让我们开始吧。

Let's dive in.

Speaker 0

感谢您今天加入我们。

Thanks for joining us today.

Speaker 0

你好吗？

How are you?

Speaker 1

我很好。

I'm very well.

Speaker 1

这里旧金山的早晨真不错。

It's a nice morning here in SF.

Speaker 1

谢谢你的邀请。

Thanks for having me.

Speaker 0

当然。

Of course.

Speaker 0

很高兴你加入我们。

We're glad to have you on.

Speaker 0

所以我想直接切入主题。

So I just want to dive straight into it.

Speaker 0

你能给我介绍一下我们是如何使用RL训练场的吗？以及我们是怎么走到这一步的？

Could you describe to me what we do with RL gyms and tell me how we got here?

Speaker 1

当然。

Sure.

Speaker 1

Turing 为我们的各类合作伙伴处理不同类型的强化学习环境。

Turing works on different types of RL environments for our various types of partners out there.

Speaker 1

一方面，我们有AI实验室，致力于训练越来越复杂的流程模型。

So we have AI labs on one side that are trying to train models for increasingly complex workflows.

Speaker 1

现在，它涉及跨多种工具、企业级和消费级应用的智能体长期思考。

So now it's agentic long range thinking across various types of tools, enterprise and consumer.

Speaker 1

另一方面，Turing 也为训练自身基础模型的企业客户提供服务，他们关注的问题也类似。

We also at Turing serve enterprise customers that are training their own foundation models, and they care about similar similar things.

Speaker 1

在所有这些客户中，我们发现强化学习环境有一些共同的需求。

And across all of these customers, we've seen a few common denominator needs pop up for RL environments.

Speaker 1

大致上有两个。

There are two broadly.

Speaker 1

第一个是用于UI智能体的强化学习环境。

So the first one is RL environments for UI agents.

Speaker 1

想象一个计算机使用代理或浏览器使用代理。

So think of a computer use agent or a browser use agent.

Speaker 1

这些强化学习环境将包含一个用户界面，模型和运行模型的智能体框架可以与之交互，使用类似于人类与浏览器或计算机交互时的工具调用。

These RL environments will have a UI that the model and the agentic harness running the model can interact with, with the kinds of tool calls that are similar to what a human uses to interact with a browser or computer.

Speaker 1

我们讨论的是鼠标事件和键盘事件。

So we're talking mouse events, key events.

Speaker 1

因此，这是一种用于计算机使用和浏览器使用代理的、带有用户界面的强化学习环境。

So that's one type of RL environment with a UI for computer use and browser use agents.

Speaker 1

第二种类型的强化学习环境不需要用户界面。

The second type of RL environment does not require a UI.

Speaker 1

它更侧重于在更高层次的动作空间中完成任务，你可以将其理解为通用的函数调用。

It is more based on completing tasks in a higher level action space, which you can think of as general function calling.

Speaker 1

你仍然可以在没有用户界面的情况下模拟类似的环境。

You can still model similar environments without a UI.

Speaker 1

例如，你可能有一个带用户界面的亚马逊电商平台环境，但你也可以在没有用户界面的情况下模拟亚马逊，抽象出代理在亚马逊上购物时可能调用的不同API，暴露这些接口，并即使没有界面，也能维护系统状态和数据库。

So for example, you might have Amazon as an e commerce environment with a UI, but you could also model Amazon without the UI and try to abstract out what are the different APIs that an agent might be calling to do shopping on Amazon and expose those and have a system state and a database even without the UI.

Speaker 1

总的来说，这就是两种类型的RL环境。

So broadly, those are the two types of RL environments.

Speaker 1

你其实还没问过这个问题，但我或许可以谈谈为什么世界正在朝这两个方向发展。

And you haven't really asked this, but maybe I'll speak to why the world is going in both of these directions.

Speaker 1

如果我们退一步看，AI实验室和企业客户正在努力提升模型在长期代理推理、任务和函数调用方面的能力，而这一切都需要RL环境，其根本目标是朝着AGI和ASI迈进。

If we take a step back, what is the purpose of all of this work that the AI labs and enterprise customers are doing in trying to advance their model capabilities on long range agentic reasoning and tasks and function calling for which RL environments are needed, broadly we're trying to go towards AGI and ASI.

Speaker 1

而这里所需的模型智能，是能够在一种有时能访问带有函数调用的MCP服务器、有时却无法访问的环境中工作。

And the type of model intelligence that's required here is the ability to work in a world where sometimes you may have access to an MCP server with function calls, sometimes you won't.

Speaker 1

因此，一个能够跨企业与消费场景执行各种工具操作的通用代理，必须既具备像人类一样通过鼠标和键盘与屏幕交互的能力，也具备在API可用时进行API调用的能力。

So a generalized agent capable of taking actions in any tool across enterprise and consumer needs to have both the ability to use the screen and interact with tools the way a human does, with mouse and keyboard, as well as the ability to do API calls when those are available.

Speaker 1

因此，我们发现最大的AI实验室都在同时研究这两种类型的RL环境。

And so we find the largest AI labs out there working on both types of oral environments.

Speaker 1

其中一些人设想的未来是，80%的工具最终会通过某种主导协议（如MCP协议）提供API，而剩下的20%仍需要通过计算机操作来完成。

Some of them envision a world where 80% of tools will have some sort of an API via whatever protocol dominates over time, like an MCP protocol, and 20% will require computer use.

Speaker 1

另一些人则在这一光谱的另一端得出不同的结论，也许是相反的比例。

Some end up at a different spot on that spectrum, maybe the inverse.

Speaker 1

但我们看到，普遍而言，人们设想的未来是代理需要同时通过计算机操作和函数调用来运行。

But what we're seeing is across the board, people envision a world where agents will need to operate both through computer use and through function calling.

Speaker 1

因此，强化学习环境设计的需求也由此而来。

And so the needs in RL environment design are similarly following from that business requirement.

Speaker 0

那么，具体到今天，是什么因素使得这些环境得以构建呢？

What is it about today, more concretely, that has enabled the building of these environments, right?

Speaker 0

因为六个月或八个月前，我们甚至在构建这些环境吗？

Because were we even building these six, eight months ago?

Speaker 1

好问题。

Great question.

Speaker 1

而且如果

And if

Speaker 0

如果没有，今天发生了什么变化？

not, what changed today?

Speaker 1

是的，这与其说是关于构建这些环境的能力，不如说是其他原因。

Yeah, so it's less about the capability to build these environments.

Speaker 1

无论是图灵还是各个实验室，长期以来都具备构建强化学习环境的能力，因为其底层需要软件工程与协调不同类型人才的结合。

Both Turing as well as the labs themselves have long had the ability to build an RL environment because what it requires under the hood is a combination of software engineering and orchestrating different types of talent.

Speaker 1

有时你需要领域专家。

Sometimes you'll need a domain expert.

Speaker 1

例如，为像Salesforce这样的企业工具构建强化学习环境时，你需要工程师与Salesforce专家合作。

So for example, for enterprise tools like a Salesforce where you're building an RL environment for a Salesforce, you need engineers working with Salesforce experts.

Speaker 1

这在过去一直是可以实现的。

That has been possible forever.

Speaker 1

强化学习环境出现演变并引起更多关注的原因，更多在于当前最先进模型的能力和智能水平发生了变化。

The reason there's been an evolution and there's an increase in interest in RL environments has more to do with what type of model capabilities and intelligence the current state of the art is now working on.

Speaker 1

所以，如果你回顾六到十二个月前，甚至让我再往前追溯，谈谈最初GPT-3和GPT-3.5发布时的情况。

So if you look at six to twelve months ago, in fact, let me take an even further step back and talk about from the initial GPT-three, GPT-3.5 launch.

Speaker 1

在相当长的一段时间里，我认为是2023年到2024年，AI实验室主要专注于模型能力的提升，这可以被最好地描述为指令遵循和人类偏好对齐。

For the longest time, I'd say 2023, 2024, the AI labs were working on model capabilities in what could be best described as instruction following and human preference alignment.

Speaker 1

因此，无论是用于编程还是文本补全，我们当时所处理的任务都是通过监督微调，即提供大量静态示例，教导模型模仿这些静态示例中描述的人类行为轨迹，这确实是训练模型实现我们目标能力的绝佳方式。

So whether it's for coding or for text completions, we were working on tasks where supervised fine tuning, where you provide a large number of static demonstrations and you teach the model to mimic the human trajectory described in those static demonstrations, that was actually a great way to train the model for the capability that we were training on.

Speaker 1

同样，在对齐人类价值观和偏好时，使用RLHF（基于人类反馈的强化学习），由人类标注者或评估者在多个模型输出中进行选择，这类数据对于训练这些能力来说是足够且高效的。

And similarly, in the case of aligning with human values and human preferences, RLHF, where human annotators or evaluators choose between multiple model outputs, that kind of data was sufficient and actually efficient for training those kinds of capabilities.

Speaker 1

对于这类工作，使用强化学习会显得过度复杂。

RL for that kind of work would have been overkill.

Speaker 1

为了更清楚地说明这一点，我举个例子。

And just to drive this point home, maybe I'll use an example.

Speaker 1

如果你考虑图像分类任务，比如判断一张图片中是猫还是狗，你并不需要一个能与环境交互的强化学习环境来完成这个任务。

If you think about the task of image classification, identifying whether an image has a cat or a dog, one does not need an environment with the ability to interact with the environment in RL to really do that task.

Speaker 1

这个任务更适合使用监督微调，提供带标签的数据集和静态示例。

That task is better suited for supervised fine tuning, providing a labeled data set, static demonstrations.

Speaker 1

因此，没有人会为这类任务投资构建强化学习环境。

Therefore, nobody was investing in RL environments for those kinds of tasks.

Speaker 1

强化学习的存在时间远早于ChatGPT。

RL has been around for a lot longer than ChatGPT has been around.

Speaker 1

在ChatGPT出现之前，许多研究实验室就已经使用强化学习来训练早期的机器人模型或自动驾驶系统。

Pre ChatGPT, a lot of research labs were using RL for training the earlier generation of robotic models or autonomous vehicle systems.

Speaker 1

最近，我们见证了强化学习的复兴，因为现在我们所研究的模型能力，仅靠监督微调已无法高效完成我们所需的那种训练，以提升模型在代理型推理和任务方面的能力。

It's more recently that we have had a resurgence in RL because now the model capabilities we're working on, it would be intractable to continue to do SFT only to do the kind of training we need to do to improve the model's ability to do agentic type reasoning and tasks.

Speaker 1

举个例子，我们现在的工作，从实现通用人工智能或超级人工智能的目标倒推，是试图构建能够完成人类所有工作的模型，而且最好还能超越人类。

And so just to take an example of this, now what we're working on, again, working backwards from our goal of reaching AGI or ASI, is we're trying to build models that can do all the work that humans do, ideally better than humans.

Speaker 1

因此，这里的重要区别在于，我们不再试图模仿人类。

So the important differences there is we're no longer trying to mimic a human.

Speaker 1

我们正在尝试优化，试图找出最佳策略。

We're trying to optimize, we're trying to figure out the best strategy.

Speaker 1

完成一项复杂任务时，最佳策略未必是人类的精确做法，这是第一点。

It may not necessarily be the exact human strategy to complete a complex task, that's one.

Speaker 1

第二点是，我们现在致力于处理具有极宽动作空间的任务。

And two, are trying to now work on tasks that have a very wide action space.

Speaker 1

例如，在亚马逊平台上的一些任务。

So for example, something on Amazon.

Speaker 1

你可以用十种方式来实现，十只是一个随意的数字。

Are 10 ways in which you can mean, 10 is arbitrary.

Speaker 1

你可以用很多种方式来做这件事。

There many ways in which you can do that.

Speaker 1

甚至更复杂的事情。

Something even more complex.

Speaker 1

如果你想象一下企业中的客户经理或业务开发人员的角色，看看他们每天的工作流程，比如在Salesforce上更新潜在客户，然后切换到邮件发送邮件、添加日历邀请，这里的操作空间——也就是你可以调用的工具调用——有无数种组合、排列和顺序。如果你必须构建静态演示数据供模型通过模仿学习，那么收集这些人类演示数据所需的时间和成本将极其高昂，更何况即便如此，你可能仍无法获得最优方案。

If you imagine the role of an account executive or a business development employee in an org, and you look at the workflows that they do day to day, like updating a lead on Salesforce, and then moving to email and sending out an email, adding calendar invites, the action space here, literally the tool calls that you could call, there's so many combinations, permutations, sequencing of those tool calls that if you had to build static demonstrations that a model could learn from through mimicking, you would literally be spending so much time and money collecting that human data of static demonstrations given how wide this action space is, and even then you may not have the optimal approach.

Speaker 1

因此，我们现在正在研究的这类任务，强化学习能够使模型以更高效的方式发现最优策略，而无需依赖大量静态演示数据。

So we are now working on types of tasks where reinforcement learning allows us to it allows models to discover the optimal strategy in a more efficient manner without the massive static data set that you would require of demonstrations.

Speaker 1

因此，我们现在所研究的模型能力更适用于强化学习。

And so it's more the model capabilities we're working on now are more conducive to RL.

Speaker 1

这就是为什么我们开始构建强化学习环境。

That's why we are starting to build RL environments.

Speaker 1

鉴于模型已经通过监督微调和基于人类反馈的强化学习获得了相当高的智能水平，这一趋势很可能会持续下去，下一阶段将通过越来越复杂的强化学习环境和任务实现巨大突破。

And this is a trend that's likely to continue given the level of intelligence models have already been bootstrapped through SFT and RLHF, the next frontier will be achieving massive gains through increasingly complex RL environments with increasingly complex tasks.

Speaker 0

我想快速纠正一下我对您所说内容的理解：在这些强化学习环境中，我们并不是将模型的决策基于人类会怎么做。

I'd love to correct my understanding on something you said really quickly, which is that in these RL gyms we're not grounding the decision making of the model in what a human would do.

Speaker 0

构建这些训练环境时，必须包含某种人类因素。

There must be some sort of human aspect to the building of these gyms.

Speaker 0

这条界限在哪里？

Where is that line drawn?

Speaker 1

因此，与我们合作设计强化学习环境的AI实验室和企业，依赖于我们能够为各种类型的工作招聘到最优秀的人才。

So the AI labs and enterprises that work with us on RL environment design, they count on curing to be able to hire the best humans for various types of jobs.

Speaker 1

正如几分钟前提到的，这包括前端工程、后端工程，这些是构建用户界面、后端、API调用和工具调用所必需的。

As mentioned a few minutes back, that includes front end engineering, back end engineering, which is required to build the UI or the back end here, the API calls and tool calls.

Speaker 1

如果为构建强化学习环境所使用的工具需要领域专业知识，那么也需要人类参与。

Humans are also required for the domain expertise if the tool that you are building an RL environment for requires that.

Speaker 1

这些人类会展示完成特定任务的正确轨迹。

And those humans demonstrate what are the trajectories, the correct human trajectories for completing a certain task.

Speaker 1

再次以Salesforce这样的企业环境为例，人类可能会在Salesforce上演示如何更新潜在客户。

So again, using the example of an enterprise environment like Salesforce, a human might demonstrate on Salesforce what is the way someone updates a lead.

Speaker 1

这之所以有用，并不是因为模型会模仿这一轨迹，而是因为随后负责映射用户界面界面和组件、系统状态以及工具调用的软件工程团队，会利用这些人类轨迹来设计用户界面流程，从而使实现成为可能。

Now that's useful not because the model is going to mimic that trajectory, but because the software engineering team that's then going to map out the UI screens and UI components to be built and the system state to be built and the tool calls to be built will leverage that human trajectory to map out the UI flows and make that implementation process possible.

Speaker 1

我还要补充一点，我们合作并为其构建环境的实验室有时会向我们索取这些人类轨迹数据，因为在我们创建的环境中运行强化学习之前，甚至在运行强化学习之前，这些实验室通常会利用人类示范数据进行模型的初始训练，也就是通过传统的监督微调，使强化学习阶段更加高效。

I will add that sometimes our labs that we work with and build our own environments for will ask us for that human trajectory data because in addition to running reinforcement learning in the environments we create and in advance of running reinforcement learning in the environments we create, those labs will often bootstrap their models, in other words using traditional supervised fine tuning, on the human demonstrations so that the RL phase is more efficient.

Speaker 1

这解决了在模型从未见过的环境中，某些开放性任务常遇到的冷启动问题。

This is solving the cold start problem that sometimes exists in very open ended tasks in an environment that a model has never seen.

Speaker 1

例如，DeepSeek 的论文表明，在强化学习之前，通过少量的监督微调和静态示范来解决冷启动问题，能够取得更好的效果。

The DeepSeek paper, for example, demonstrated that a little bit of SFT to solve the cold start problem with static demonstrations prior to RL achieves better results.

Speaker 1

因此，我们仍然在收集人类轨迹数据，一方面用于工程流程，另一方面也为需要这些数据进行模型启动的实验室提供支持。

And so we do still collect human trajectory data, both for the engineering process and second, to provide that for labs that need it for bootstrapping purposes.

Speaker 1

所以，监督微调并不会消失。

So SFT does not go away.

Speaker 1

在我们所见的大多数情况下，实验室都会同时采用这两种方法。

In most cases that we have seen, labs are doing both.

Speaker 1

通常会有一个监督微调阶段，然后是强化学习阶段，在训练流程中，这些阶段甚至可以反复循环：监督微调 → 强化学习 → 再次监督微调 → 再次强化学习。

There is an SFT phase and then an RL phase, and you could have, in a training pipeline, you could have these stages just repeating, SFT, then RL, then back to SFT, then RL.

Speaker 1

因此，监督微调不会消失，但我们正努力尽可能将学习任务结构化为强化学习任务。

So SFT is not going away, but we are trying to structure as much of the learning task as we can as an RL task.

Speaker 1

这并不总是可行的，我们稍后可以进一步讨论，但在可能的情况下，强化学习能让模型学会这些复杂任务的最优策略，而如果仅靠纯监督微调来构建数据，将会耗费太多时间和成本。

It's not possible all the time, and we can talk more about that later, but where possible, RL allows models to learn the optimal policy for these complex type of tasks where doing pure SFT type data construction would just be too time consuming and too expensive.

Speaker 0

所以，如果我没理解错的话，问题在于人类在这个过程中何时介入。

So if I'm correct, then there's the question of when the human is involved in this process.

Speaker 0

这其中有两个方面。

There's two parts to it.

Speaker 0

一个是实际的人类参与，无论是人才还是搭建训练环境。

One is the actual physical human involvement, whether that's talent or actually building out the gyms.

Speaker 0

另一个则是将人类决策融入到环境中，换句话说，就是人类轨迹数据。

But then the other is infusing this environment with human decision making, or in other words, the human trajectory data.

Speaker 0

我们允许用户选择是否包含这些数据，但通常监督微调仍然是流程中的重要部分，因此我们通常不会看到它消失。

And we allow users to make the choice of whether or not to include that, but typically SFT remains a large part of the pipeline, so we don't typically see that go away.

Speaker 0

是这样吗

Is that

Speaker 1

你知道，我不知道如何量化‘流程中的重要部分’。

You know, I don't know how to quantify large part of the pipeline.

Speaker 1

这个问题要问我们的合作伙伴，我们通常拿不到这些数据，对吧？

That's a question for our partners and we usually don't get that data, right?

Speaker 1

例如，当OpenAI发布像O3这样的模型时，他们在博客文章中宣布，RL的计算资源占比相比之前的模型显著增加。

So for example, when OpenAI releases a model like O3 and in their blog post they announced that RL compute was significantly a greater percent of overall compute spend versus prior models.

Speaker 1

我们实际上并不知道他们的计算资源中有多少用于SFT，多少用于RL。

We don't actually know how much of their compute spend was on SFT versus RL.

Speaker 1

我们只知道，越来越多的资源正在流向RL。

We have the signal that more and more of it has been going towards RL.

Speaker 1

而且，这取决于任务的类型。

And again, it depends on the types of tasks.

Speaker 1

有些任务根本不需要RL环境。

There are some tasks where you do not need RL environments.

Speaker 1

甚至现在，前沿正转向多模态，我们正处于音频、视频、机器人演示和用于机器人学的VLA模型等新模态的预训练阶段，仅收集静态数据集就需要大量工作。

And even now, the frontier moves to multi modality and we're at the pre training stage of new modalities like audio, video, robotic demonstrations, and the VLA models used for robotics, there is a lot of work to be done there in just collecting static data sets.

Speaker 1

因此，无论是在预训练阶段还是微调阶段。

So both at pre training stage and fine tuning stage.

Speaker 1

所以我不想说RL是所有事情的未来。

So I don't want to make the point that RL is the future for everything.

Speaker 1

有些模型能力目前非常适合使用RL，而有些模型能力我们仍然在使用SFT，仍然在进行RLHF。

There are model capabilities for which RL is ideal today, And there are model capabilities where we still do SFT, we're still doing RLHF.

Speaker 1

在这些前沿领域，我们可能会达到一个阶段：一旦基础模型和微调模型的能力足够好，我们就开始尝试用更困难的任务来训练它，而在这些新模态中，RL又会成为获得进一步提升的首选方法。

And in those frontiers, we might get to a point where once the base model's capability and the fine tuned model's capability is good enough that we start trying to train it on more and more difficult tasks, where then once again in those new modalities, RL becomes the preferred approach to getting further gains.

Speaker 0

那我们来深入探讨一下。

Let's double click on that then.

Speaker 0

那么，目前我们可以为哪些领域构建RL训练环境？它们支持哪些模态？

So what can we build RL gyms for currently and what modalities do they support?

Speaker 1

是的，我认为你还没问过我一个有助于澄清的问题，那就是：Jim，RL到底包含哪些内容？

Yeah, I think one question you haven't asked me, which would be helpful to clarify before we go down that route, is what all is part of an RL, Jim?

Speaker 1

这与你刚才问我‘我们不能在哪些地方构建它’的问题相关。

And this will be related to the question you just asked me on where can we not build it for?

Speaker 1

所以当Turing为我们的合作伙伴开发RL环境时，无论是用于UI代理、计算机使用等，还是用于通用函数调用，通常一个RL环境包会包含环境本身（系统状态）、模型可以调用以修改该系统状态的工具调用，以及一个可选的用户界面。

So when Turing works on RL environments for our partners, both for UI agents, computer use, etcetera, as well as for general function calling purposes, what typically goes into an RL environment package for our partners is the environment itself, which is a system state, tool calls that the model will be able to invoke that modify that system state, and an optional UI.

Speaker 1

让我们把刚才提到的这三样东西称为环境。

Let's call these three things I mentioned the environment.

Speaker 1

然后通常还有提示。

Then there are usually prompts.

Speaker 1

这些是模型或代理在该环境中需要执行的任务或工作流。

These are the tasks or workflows that the model or agent would have to execute in that environment.

Speaker 1

我们通常会将这些提示或任务设计成一个从易到难的课程。

And we usually structure these prompts or tasks in a curriculum that goes from easy to hard.

Speaker 1

我们会与合作伙伴反复迭代，以确定环境中的任务难度，因为太简单或太难的任务对训练都没有帮助。

And we iterate with our partners to get the right difficulty of tasks for the environment because tasks that are too easy and tasks that are too hard are not useful for the training purpose.

Speaker 1

因此，提示设计背后需要大量的思考。

So there's a lot of thought that goes behind the prompt design.

Speaker 1

同时，我们还需要识别该环境中的可能工作流，并生成数百种这些工作流的变体，以确保提示有足够的多样性。

And also how we identify the possible workflows in that environment and generate hundreds of variations of those workflows so that there's enough diversity of prompts.

Speaker 1

因此，提示是构成RL环境的第二部分。

So prompts is the second piece of what goes into an RL environment.

Speaker 1

第三部分，也是非常重要的一部分，这将与我给你的答案相关联，第三部分是奖励函数、评分器或验证器。

The third and very important piece, and this is what will be linked to the answer that I give you, the third piece is a reward function, a grader or a verifier.

Speaker 1

它有不同的名称，但强化学习的基本理念是，你不需要逐个标记地提供轨迹，而是只需对模型完成任务的尝试（即模型的执行结果）进行评分，通常基于最终结果。

This goes by different names, but the basic idea behind RL is rather than having to provide token by token trajectories, you are simply grading a model's rollout, a model's attempt at completing a task, usually based on the outcome.

Speaker 1

我说‘通常’，是因为在某些情况下，你可能不仅想根据最终结果来评分，还想评估模型所遵循的过程。

I'm saying usually because there are different types of situations where you might want to grade the model not just on the final outcome, but the process that the model followed.

Speaker 1

这是可以自定义的，但关键是强化学习需要奖励机制。

This is customizable, but the point is RL requires a reward.

Speaker 1

强化学习得以运作的原因是，你允许模型或智能体在环境中自由地探索。

What allows RL to work is you allow a model or an agent to basically play in an environment freely.

Speaker 1

当模型或智能体完成任务后，你只需对系统的当前状态进行评分，并可选地根据所遵循的过程给予奖励，比如判断这是正确的还是错误的。

And when the model or the agent is done, you simply grade that state of the system, and optionally, things like the process that was followed to give a reward, like is this correct or is this incorrect.

Speaker 1

我们在构建环境时承担的软件工程任务，不仅包括创建用户界面、系统状态和工具调用，还包括编写验证器——这可能是运行逻辑检查的代码，可能是代表系统预期最终状态的JSON对象，也可能是作为裁判的大型语言模型。

Part of the software engineering task that we take on when we work on environments is not just creating the UI or the system state and the tool calls, but also writing the verifiers, could be it could be a piece of code that runs through some logical checks, it could be a JSON object that represents the correct final state of the system that is expected, it could be an LLM as a judge.

Speaker 1

那么，强化学习环境的最后一步，就是某种方式将它打包起来。

Now, then there's the final piece of an RL environment is some way in which we package it.

Speaker 1

例如，这可以是一个Docker容器，以便我们的合作伙伴——即我们为他们构建这些环境的实验室——能够轻松调用并在我们的环境中运行他们的模型和代理。

This could be a Docker container, for example, to enable our partners, the labs that we're building these environments for, to very easily be able invoke, to basically run their models and their agents in our environments.

Speaker 1

因此，为了节省时间，我将简要总结一下不适合使用RL的场景。

So in the interest of time, I'm just going to summarize the areas where you wouldn't use RL.

Speaker 1

首要的一点是，当你没有明确的奖励信号时。

The first and foremost is where you do not have a clear reward signal.

Speaker 1

对于这类情况，仍然建议依赖传统的微调方法。

For places like that, it still makes sense to rely on traditional fine tuning approaches.

Speaker 1

SFT和RLHF仍然是更优选的方案。

SFT and RLHF, those are still preferred options.

Speaker 1

此外，还有许多其他长期存在的原因，我不打算深入探讨，但另一个不使用RL的例子是：使用静态演示更便宜、更快、效果更好。

There's also a long tail of other reasons, which I don't want to get into, but just an example of another reason why you wouldn't use RL is because it's cheaper, faster, better to use static demonstrations.

Speaker 1

RL并不便宜。

RL is not cheap.

Speaker 1

RL的计算成本相当高昂。

There's a significant cost from RL compute.

Speaker 1

当模型在环境中探索时，这并不便宜。

When the model is playing around in an environment, that's not cheap.

Speaker 1

你在上面花费了大量的计算成本。

You're spending a lot of compute dollars on that.

Speaker 1

实际上，仅仅使用一些人类演示数据进行SFT可能更具成本效益。

It might actually be more cost efficient to just have a few human demonstrations that you run SFT on.

Speaker 1

因此，在这种情况下使用RL是愚蠢的。

So it would be silly to use RL there.

Speaker 0

对于那些我们确实使用RL并决定构建RL训练环境的用例，我想以一个计算机使用代理的训练环境为例。

For the use cases where we do use RL and we actually choose to go ahead with building an RL gym, and I want to take a specific example of a gym for a computer use agent.

Speaker 0

你构建的RL训练环境需要多精确才能有效？

How accurate does the RL gym you build need to be to be effective?

Speaker 0

假设你正在做一个类似DoorDash的平台或模拟的DoorDash环境。

Let's take something like say you're making a DoorDash clone or a simulated DoorDash environment.

Speaker 0

假设DoorDash有一些只有实际平台才知道如何运作的特定优惠券，我们是否需要在模拟环境中复现这些功能，以便代理能够复制其行为？

Let's say DoorDash has some really specific coupons that only the actual platform knows how they function, do we need to have that usability in this simulated environment for the agent to be able to replicate its behavior?

Speaker 0

或者，你如何确定仿真环境的准确度达到什么程度时，才能让代理在此环境中被训练到合理且可信的水平？

Or how do you determine at what point the gym is accurate enough for this agent to be trained on to a reasonable and trustworthy level?

Speaker 1

当然。

Sure.

Speaker 1

当然。

Sure.

Speaker 1

我想你举了DoorDash的例子，提到了优惠券，那我就以此为例。

I think you used the example of DoorDash and you said coupons, so I'm just going to go with that.

Speaker 1

简短的回答是：如果你想通过强化学习让模型学会在结账时使用优惠券的概念，那么这个概念就必须成为该强化学习环境中用户界面流程的一部分。

So the short answer is, if you wanted the model through RL to learn about the concept of applying a coupon at checkout, it would need to be part of the UI flow in that particular RL environment.

Speaker 1

优惠券需要和实际DoorDash上的优惠券一模一样吗？

Would the coupon need to look exactly like the coupon on the actual DoorDash?

Speaker 1

不需要，因为你训练的是优惠券存在的普遍原理，以及在这种情况下需要执行哪些UI操作来应用优惠券。

No, because what you're trying to train is the general principle of the coupon being present and what actions are required, what UI actions are required in this case to be able to apply a coupon.

Speaker 1

现在，向我们寻求UI环境的合作伙伴通常要求接近像素级的精确度。

Now, our partners who ask us for UI environments ask us for near pixel perfection.

Speaker 1

但这一原则的背后是，他们并不是只想训练一个仅适用于DoorDash的模型。

But the principle behind this is they're not trying to train a model just for DoorDash.

Speaker 1

DoorDash恰好是一个非常流行的餐饮订购和配送应用，但市场上还有其他应用。

DoorDash happens to be a very popular food ordering and food delivery app, but there are other apps out there.

Speaker 1

我认为，比起仅仅针对DoorDash过拟合，更重要的目标是学习餐饮订购的通用原理，以及如何在不同的餐饮订购应用中通过UI操作来下单。

And I think the even more important goal than just overfitting on DoorDash would be to learn about the general principle of food ordering and how you use UI actions to order food across different food ordering apps.

Speaker 1

因此，我们构建了像素级精确的复制品，但最终目标是这些实验室并不希望对DoorDash过拟合。

So we build Pixel Perfect clones, but the end goal would be those labs don't want to overfit on DoorDash.

Speaker 1

他们理想情况下也希望拥有其他类型餐饮订购工具的复制品。

They would ideally want clones of other types of food ordering tools as well.

Speaker 1

我想强调的是，最重要的是工作流覆盖、系统设计与结构，以及包含模仿真实功能集的特性。

And what I'm trying to get at here is I think what's most important is workflow coverage, the system design and structure, and the feature inclusion to mimic the actual feature set.

Speaker 1

因此，确保优惠券流程的存在，比优惠券在UI上是否与DoorDash的优惠券完全一致更为重要。

So having the coupon flow present is higher importance than the coupons from a UI perspective looking exactly the way that the DoorDash coupons do.

Speaker 1

这些RL训练环境所需的精确度水平因合作伙伴而异，取决于他们具体的训练目标。

The level of perfection required for these RL gyms to be good enough does vary partner to partner, depending on their specific training goal.

Speaker 1

所以我不想划一条不可逾越的界限，我是从第一性原理出发，认为对所有合作伙伴来说，重要的是UI完整性和数据完整性。

So I don't want to draw a line in the sand, I'm using first principles thinking to say that what would be important across all partners is that you have UI completeness and data completeness.

Speaker 1

我再重新定义一下这两个概念。

And I'll just define those again.

Speaker 1

UI完整性意味着，对于你正在训练的流程，所有主要的用户界面流程都得到了体现。

So UI completeness means for the workflows that you are training, all the major UI flows are represented.

Speaker 1

如果你选择从UI中排除某些内容，以DoorDash为例，假设你决定排除配送员的动画，因为这需要复杂的WebSocket交互，以及在训练环境中打包和使用动画库的成本很高。

And if there is something you choose to exclude from the UI, so using the DoorDash example, let's say you choose to exclude the animation of the delivery agent, which may be complex to develop in terms of the web socket interactivity it requires and the animation libraries you end up packaging it and using in the gym.

Speaker 1

你可能会选择排除那个显示配送员接近你的动态地图，因为开发成本较高，但其带来的收益可能并不高，因为这些工作流程并不一定依赖这个交互式动态地图进行任何推理或执行。

Maybe you choose to exclude that animated map of the delivery agent coming to you because it's a high cost of development, but the benefit may not be as high because the workflows don't necessarily use that interactive animated map for any sort of reasoning or execution.

Speaker 1

在这种情况下，你可能会选择排除这个地图。

In that case, you might choose to exclude that map.

Speaker 1

但UI完整性意味着，如果你选择排除某个特定的UI组件，你也必须同时排除所有通往该组件或从该组件引出的UI控件和小部件，以避免出现死链或无用按钮。

But UI completeness means if you choose to exclude a particular UI component, you also exclude the UI widgets and controls that lead to it or that follow from it so that you don't have any dead links and dead buttons.

Speaker 1

你不想看到的是一个UI，因为这个UI将作为模型的训练信号使用。

What you don't want is a UI that so the UI will be used as it has the training signal for the model.

Speaker 1

你不希望模型从界面中得出点击这个按钮没有任何作用的结论。

And you don't want the model to take away from the UI that, hey, clicking this button does nothing.

Speaker 1

因此，界面完整性意味着要包含这些工作流所需的各种界面流程，并且是完全交互式的。

So UI completeness means have UI that captures the different flows, the UI flows required for those workflows, and is interactive, fully interactive.

Speaker 1

也就是说，界面中每个链接、每个按钮都应像真实应用对真实用户的鼠标点击或悬停操作那样做出响应。

As in every link that's there, every button that's there, it's responsive in the same way that the actual app would be responsive to a real user's mouse clicks or mouse hovers, etcetera.

Speaker 1

这种交互性很重要。

That interactivity is important.

Speaker 1

如果你选择排除某些内容，也要一并排除那些指向它们或由它们引出的内容。

And if you choose to exclude certain things, you exclude the things that lead to them or not.

Speaker 1

这就是界面交互性。

That's UI interactivity.

Speaker 1

数据完整性意味着你需要用足够多样且深入的系统状态来填充界面，以确保在各种工作流中，系统能像真实应用对人类用户那样做出响应。

Data completeness means you need to fill that UI with a system state that is diverse enough and deep enough such that across workflows, you will have the kind of responses that the actual app would give to a human.

Speaker 1

以DoorDash为例，假设工作流是食物搜索，比如餐厅搜索。

So just to give an example of that on DoorDash, let's say the workflow is food search, like restaurant search.

Speaker 1

假设工作流程是搜索一家日本餐厅，并为两个人点一份拉面晚餐。

Let's say the workflow is search for a Japanese restaurant and order a ramen dinner for two people.

Speaker 1

这是一个中等复杂度工作流程的例子。

That's an example of a medium complexity workflow.

Speaker 1

你需要确保系统状态中包含足够多的日本餐厅，还有亚洲餐厅以及其他类型的餐厅，因为搜索查询可能涵盖这个广泛的范围，这样你就不会遇到空的搜索结果页面。

You need to have your system state have enough Japanese restaurants, right, as well as Asian restaurants, as well as other types of restaurants, knowing that the search queries might cover this broad spectrum so that you're not ending up with search results pages that are empty.

Speaker 1

对吧？

Right?

Speaker 1

因为真正的工具不会给你一个空页面。

Because the actual tool would not give you an empty page.

Speaker 1

那会对模型发出错误的信号。

That would be the wrong signal for the model.

Speaker 1

这就是数据完整性的含义。

So that's what data completeness means.

Speaker 1

所以对于我们的界面来说，再次回答你的主要问题，我认为你问的是：如何判断一个用于计算机使用代理的RLGIM是否足够好？

So for our UI, so again answering your main question, which I think was how do you know an RLGIM for a computer use agent is good enough?

Speaker 1

你需要界面的完整性以及数据的完整性。

You need UI completeness and data completeness.

Speaker 1

像素完美程度因合作伙伴而异。

The level of pixel perfection varies from partner to partner.

Speaker 0

所以我们正在讨论为特定的使用场景或任务构建训练环境。

So we're chatting about building gyms for specific use cases or tasks.

Speaker 0

我们离为整个世界或沉浸式体验构建训练环境还有多近呢？

How close are we to building gyms for something like an entire world or an immersive experience, right?

Speaker 0

在我们之前的一期播客中，乔纳森和我讨论过可能建立一家类似训练环境公司的事情，我知道你我也聊过这个话题。

In one of our previous podcast episodes, Jonathan and I chatted about potentially building something like a gym company, and I know you and I have talked about that as well.

Speaker 0

那么这会是什么样子呢？

So what does that look like?

Speaker 1

说实话，阿什尼，我也有同样的问题。

I have the same question, honestly, Ashnee.

Speaker 1

我希望我知道我们离目标还有多近。

I wish I knew how close we were.

Speaker 1

许多这类工作都在我们实验室合作伙伴的封闭环境中进行。

A lot of that work is being done behind closed doors of our lab partners.

Speaker 1

不可避免的是，我们正走向一个强化学习环境变得越来越复杂、越来越接近现实世界模拟的时代。

And what inevitable is that we are headed into a world where the RL environment becomes more and more complex and a closer and closer simulation of our reality.

Speaker 1

而奖励函数最终会变成更高层次的东西。

And the reward function ends up being something much higher level.

Speaker 1

以企业环境为例，如今我们的合作伙伴正在向我们索取针对Salesforce的独立强化学习环境、针对HubSpot、ZoomInfo、Calendly、Gmail等的独立环境。

So just in the example of the enterprise environment, today our partners are asking us for individual RL environments for Salesforce, individual environments for HubSpot, for ZoomInfo, for Calendly, for Gmail, etcetera.

Speaker 1

但如果你再往前想一步，如果最终目标是自动化工作，那么解决这个问题的另一种方式是构建一个企业的模拟器，其中包含组织架构、角色，以及销售人员在不同工具间执行这些工作流程，而这里的奖励函数可以是，比如在销售场景中，漏斗顶部的增长，即潜在客户开发。

But if you imagine one level out, if the end goal is to automate work, then another way to approach this problem is to have a simulator of a business in which there is an org chart, there are roles, there are salespeople that do these kinds of workflows across tools, and the reward function there is, it could be something like, in the case of sales, top like of funnel growth, like lead gen.

Speaker 1

你今天开发了多少个潜在客户？

How many leads did you generate today?

Speaker 1

比这更高层次的是收入。

Even higher level than that would be revenue.

Speaker 1

再高一个层次的就是公司的股价。

Even higher level than that would be the company's stock price.

Speaker 1

你可以进一步推演，想象一个世界健身房，或者一个模拟一切的宇宙健身房。

And you could extrapolate further and think about a world gym, or like a universe gym that effectively simulates everything.

Speaker 1

这些是非常有趣的话题值得思考。

Those are really fun topics to think about.

Speaker 1

我有种感觉，这是不可避免的，因为要让模型在智能上实现巨大飞跃，我们需要一个像那样大规模、互联网级别的东西。

And I have the feeling that that's inevitable because we will need, for models to truly have a massive jump in intelligence, we will need something large scale and internet scale like that.

Speaker 1

就像我们在预训练阶段拥有互联网数据集一样，让我们取得了突破。

The same way we had in pre training, the internet data dumps that allowed us to get that.

Speaker 1

为了在强化学习和探索领域实现同样的智能飞跃，我们也需要如此庞大的东西。

We'll need something massive like that to have the same sort of intelligence jumps, even in the world of RL and exploration.

Speaker 1

我不知道我们在通往这一目标的旅程中走到了哪一步。

I don't know where we are on the journey to that.

Speaker 1

目前对图灵的请求还不是：‘帮我构建一个公司模拟器。’

The ask to Turing is not yet, hey, build me a company simulator.

Speaker 1

我可以想象有一些公司正在筹集数亿美元，虽然我不会点名，但我们已经与其中一些公司有过互动，这些公司正在筹集大量资金，以构建这类面向商业环境的强化学习系统。

I can imagine there are companies that are raising hundreds of millions of dollars and I won't name them, but we have interacted with some of them, there are companies that are raising a lot of money to build these kinds of RL for business type environments.

Speaker 1

我

Speaker 0

我认为一个非常有趣的考虑是，也许有一天我们可以为人类建造一个训练场，对吧？

think a really interesting consideration is maybe if we could one day build a gym for a human, right?

Speaker 0

如果我们想构建个性化代理的话。

If we wanted to build personalization agents.

Speaker 0

但这就引出了一个问题：你如何为人类创建一个验证器？

But then that leads to the question of how do you create a verifier for a human?

Speaker 0

我认为这是一个完全不同的复杂思考活动。

And I think that's a whole other complex thinking activity.

Speaker 1

我觉得这非常有趣，对吧？

I think it's super interesting, right?

Speaker 1

因为如果你想想你自己的人生，你可能在生命的不同时期经历过许多不同的奖励函数。

Because if you think about your own human life, you've probably had many different reward functions in different phases of life.

Speaker 1

也许在童年早期，它是父母的愤怒或父母的快乐。

Maybe in early childhood it is the ire of your parents or the happiness of your parents.

Speaker 1

当你还小的时候，你努力让他们开心。

You're trying to make them happy when you're really young.

Speaker 1

然后你不再关心让父母开心，而是开始关注其他奖励信号，比如你喜欢的异性给你的关注，诸如此类的事情。

Then you stop caring about making your parents happy and then you pick your other reward signals like attention you're getting from a girl or guy you like, things like that.

Speaker 1

而这些奖励信号往往会随着时间演变。

And then those reward signals tend to evolve.

Speaker 1

因此，在童年早期，确实会发生大量的监督微调，我们的父母为我们提供了这些静态示范和标签，但随后我们的人生中往往会采用不同的奖励函数。

So we do absolutely, in early childhood, there's a fair bit of supervised fine tuning that happens as our parents are providing us these static demonstrations and these labels, but then we do tend to take on different reward functions in our life.

Speaker 1

我认为这正是我们学会各种事物的原因。

And I do think that that is responsible for us learning all sorts of things.

Speaker 1

甚至我认为，品味的形成，也是在我们接收各种多模态感官输入和不同奖励信号的过程中逐渐发生的。

And even, I would argue that even the development of taste, happens at some point through all of the multimodal sensory input we're taking and all the different reward signals we're getting.

Speaker 1

我们的品味往往会不断演变。

We tend to evolve taste.

Speaker 1

我确实认为你所说的，是我们构建越来越智能的模型时不可或缺的重要因素。

And I do see what you said as an important ingredient for what's going to happen as we build smarter and smarter models.

Speaker 1

无论你是建立人类健身房、商业健身房还是世界健身房，我不确定这个盒子的边界究竟该在哪里。

Whether you're building a human gym or you're building a business gym or a world gym, I don't know exactly where to draw the bounds of that box.

Speaker 1

但没错，我们正朝着环境越来越复杂、工作流程越来越复杂、验证器越来越高级的方向发展。

But yeah, we are headed in the direction where the environments become increasingly complex and the workflows become complex, the verifiers become higher and higher level.

Speaker 0

我认为这是结束这一集的一个非常棒的方式。

I think that is a really fantastic way to end this episode.

Speaker 0

非常感谢你和我进行这次对话，安舒尔。

Thank you so much for this conversation, Anshul.

Speaker 0

我觉得我们聊到了构建强化学习健身房的诸多具体复杂性，比如计算机使用代理健身房和函数调用健身房之间的差异，以及我们目前能用它们做什么、为什么我们会走到这一步、它们如何推动通用人工智能和超级人工智能的发展，当然还有更宏大的问题：这些健身房能复杂到什么程度？它们能否模拟整个世界或宇宙？

I think we chatted about the really specific complexities of building RL gyms from the differences between computer use agent gyms and function calling gyms as well as what can we build them for currently, why we've gotten to this point, how it's powering AGI and ASI, and then of course the bigger questions about how complex can these gyms get and could they mimic entire worlds or universes?

Speaker 1

是的。

Yeah.

Speaker 1

谢谢你，阿什温尼。

Thanks, Ashwini.

Speaker 1

这很有趣。

It was fun.

Speaker 1

嗯。

Yeah.