与Siddhant Pardeshi探讨用于自主软件开发的智能体群与知识图谱 - #763 | The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence) 中文双语解读

本集简介

在本集中，Blitzy的联合创始人兼首席技术官Sid Pardeshi加入我们，探讨构建能够以企业规模交付生产级软件的自主开发系统。Sid对比了AI辅助编码与端到端自主性，认为“代码已成为商品”，真正的衡量标准是接受度——包括安全性、标准、测试和可维护性。我们深入探讨了Blitzy的混合图谱加向量方法，该方法通过将语义信号与关键词搜索结合，使智能体能够高效导航大型代码库。Sid剖析了上下文与智能体工程，指出有效上下文窗口已趋于饱和，而动态智能体人格、工具选择和模型特定提示在规模化时至关重要。他详细介绍了他们如何协调大量AI智能体协同分析代码库、规划任务并并行执行复杂操作。我们还探讨了为什么Agents.md和扁平内存会失效，如何将反馈存储在知识图谱中，以及如何构建超越排行榜的现实世界评估体系，为每项任务选择最合适的模型。本集完整节目笔记请见：https://twimlai.com/go/763

双语字幕

仅展示文本字幕，不包含中文音频；想边听边看，请使用 Bayt 播客 App。

Speaker 0

衷心感谢 Blitzy 对本播客的支持以及本集的赞助。

A big thanks to Blitzy for supporting the podcast and sponsoring this episode.

Speaker 0

想将软件开发速度提升五倍吗？

Want to accelerate software development velocity by five x?

Speaker 0

你需要 Blitzy，它能将自主软件开发引入你的企业代码库。

You need Blitzy, which brings autonomous software development to your enterprise code base.

Speaker 0

你的工程师只需声明意图，Blitzy 的代理就会分析你的代码库并生成代理执行计划。

Your engineers declare intent and Blitzy agents map your code base and generate an agent action plan.

Speaker 0

获得批准后，Blitzy 会开始工作，自主生成数十万行经过端到端测试的验证代码。

Once approved, Blitzy gets to work, autonomously generating hundreds of thousands of lines of validated end to end tested code.

Speaker 0

单次运行即可完成超过 80% 的工作量。

More than 80% of the work completed in a single run.

Speaker 0

Blitzy 不仅是在生成代码，更是在以计算速度开发软件。

Blitzy is not just generating code, it's developing software at the speed of compute.

Speaker 0

立即前往 blitzy.com/twiml 亲身体验 Blitzy。

Experience Blitzy firsthand at blitzy.com/twiml.

Speaker 0

那就是 blitzy.com/twiml。

That's blitzy.com/twiml.

Speaker 1

我们采取的方法是动态招募多个智能体群，并将数据库作为编排层的一部分。

The approach that we took has been to dynamically recruit multiple swarms of agents and use the database as part of the orchestration layer.

Speaker 1

你可以招募数以万计的智能体，而无需担心一个负责追踪所有活动的单一编排器。

And you can recruit tens of thousands of agents, but not have to worry about this single orchestrator that's keeping track of everything that's happening.

Speaker 1

我们已经成功应用了这种方法，经常编写数十万行、数百万行代码。

We've been able to apply that successfully and we frequently write hundreds of thousands of lines, millions of lines of code.

Speaker 1

所有代码都能成功编译。

Everything compiles.

Speaker 1

所有代码都能正常运行。

Everything runs.

Speaker 1

所有测试都通过了。

All tests pass.

Speaker 1

用户界面也能正常工作。

The UI works.

Speaker 1

它完美无瑕。

It's pixel perfect.

Speaker 1

我们真的已经完善了这一点。

And so we've perfected that, really.

Speaker 2

好了，各位。

Alright, everyone.

Speaker 2

欢迎来到又一期的twiml.ai播客。

Welcome to another episode of the twiml.ai podcast.

Speaker 2

我是您的主持人萨姆·查林顿。

I am your host, Sam Charrington.

Speaker 2

今天，我邀请到了西达汉特·帕尔德希。

Today, I'm joined by Siddhant Pardeshi.

Speaker 2

西达汉特是Blitzy的联合创始人兼首席技术官。

Siddhant is co founder and CTO of Blitzy.

Speaker 2

在开始之前，请记得在您收听本节目的平台点击订阅按钮。

Before we get going, be sure to take a moment to hit that subscribe button wherever you're listening to today's show.

Speaker 2

欢迎来到播客，西德。

Welcome to the podcast, Sid.

Speaker 1

谢谢，萨姆。

Thanks, Sam.

Speaker 1

很高兴能来这里。

Glad to be here.

Speaker 1

我一直是你们的听众。

I'm a longtime listener.

Speaker 1

我从2019年就开始收听了。

I've been listening since 2019.

Speaker 2

这太棒了。

That's amazing.

Speaker 2

听到这些真的让我非常高兴。

And it is so great to hear.

Speaker 2

我很期待见到你，也非常期待深入了解你在Blitzy的经历，你们正在从事自主开发的工作。

I am excited to meet you and, really looking forward to digging into your experiences at Blitzy, where you're working on, autonomous development.

Speaker 2

那我们直接切入正题吧，先聊聊你的背景。

So let's, let's dig right in, but start by talking a little bit about your background.

Speaker 2

你在创办Blitzy之前是在英伟达工作吗？

You were at Nvidia before you started Blitzy?

Speaker 1

是的，我从2016年1月就开始在英伟达工作了。

Yeah, I was at Nvidia, since 2016, January 2016 and back then.

Speaker 1

我入职那天，英伟达的股价市值是320亿美元。

The day I joined, Nvidia's stock was worth 32,000,000,000.

Speaker 1

那就是英伟达当时的市值。

That was Nvidia's market cap.

Speaker 1

320亿美元。

$32,000,000,000.

Speaker 1

我觉得今天Anthropic的营收已经超过这个数字了。

And I think Anthropics revenue today is more than that.

Speaker 1

那是一段很特别的经历，你知道，在那个时候的英伟达，公司结构很完善。

It was like experience, you know, being at Nvidia at that time and, Nvidia was structured.

Speaker 1

我不知道他们现在是否还是这样，但在我从2016年到2022年任职期间，它运作得非常像一家初创公司。

I don't know if they still are, but it it it functioned very much like a startup for the entire time that I was there, right, from 2016 to, 2022.

Speaker 1

当‘注意力机制就是你所需’这篇论文发布时，我就在现场，当时我正在为英伟达在生成式AI领域发明一些东西。

And, I, you know, the when the attention is all you need, paper dropped, I was right there, you know, I was inventing things for Nvidia, in the generative AI space.

Speaker 1

我深入研究过GANs（生成对抗网络）和各种自编码器，也接触过自然语言处理。

I was deep into GANs or generative adversarial networks and various auto encoders and I was brushing with NLP.

Speaker 1

那还是相当早期的阶段，当时你有BERT，我们用BERT来做翻译之类的事情。

It was still quite earlier, you know, you had you had BERT, we were using BERT for like translation and stuff like that.

Speaker 1

但Transformer是一种突破性的技术。

But the transformer was groundbreaking tech.

Speaker 1

当我意识到它潜在的巨大可能性，同时又获得了去哈佛商学院攻读MBA与MSI联合硕士项目的机会时，我选择了后者。

And, eventually when I realized the potential of what it could do and simultaneously had an opportunity to go to HBS, to do a joint master's program in an MBA and an MSI, I chose that.

Speaker 1

我在哈佛商学院认识了我的联合创始人兼CEO布莱恩，我们基于‘AI终将赶上人类’这一理念创立了Blitzy。

And I met Brian at HBS, my co founder and CEO, and we decided to form Blitzy based on the idea that AI will catch up eventually with humans.

Speaker 1

我们做出这个决定时，上下文窗口还只有大约10,000个token，模型连像样的代码都很难写出来。

And we, you know, we we made this bet back when the context window was about 10,000 tokens and it could barely write like usable code, right.

Speaker 1

但我们赌的是，AI在编写代码方面将与人类一样优秀，甚至更胜一筹。

But we made this bet that AI is gonna be as good if not better than humans at writing code.

Speaker 1

软件开发中将有一部分不再仅仅是代码生成，而是整个软件工程都将被自主开发完全自动化。

And there'll be a section of software development that's not just about code generation but entire software engineering that will get completely automated by autonomous development.

Speaker 1

这正是Blitzy的核心所在。

And that's what Blitzy is all about.

Speaker 2

确实，如今AI影响最大的领域之一就是软件开发。

It certainly is true that one of the areas where AI is having the most impact today is in software development.

Speaker 2

当你思考软件开发时，你有没有一种方式来对这个领域和机会进行分类？

When you think about software development, do you have a way that you taxonomize the space and the opportunity?

Speaker 1

我认为软件开发是应用AI的最佳机会，原因在于软件是可以验证的。

So I think software development is the best opportunity in space for to apply AI And the reason for that is because software is verifiable.

Speaker 1

它是可编译的，也是可测试的。

It's compilable, it's testable.

Speaker 1

你可以可视化它，并且存在一个明确的正确答案。

You can visualize it and there is the concept of a correct answer.

Speaker 1

可能有多种正确答案，但确实存在正确和错误的答案，这一点在其他领域并不总是如此。

There could be many correct answers, but there are correct answers and wrong answers, which is not always the case in other domains.

Speaker 1

对吧？

Right?

Speaker 1

所以非常重要的是，要意识到这一点。

So it's super important, to get, you know, to to realize that.

Speaker 1

然后，如果你想想这个领域本身，我想我们所有人都是从AI辅助开发开始的，对吧？

And then if you if you think about the space itself, I think we all got started with AI assisted development, right?

Speaker 1

你曾经有代码助手，现在你有了命令行界面和集成开发环境，以及内置AI辅助的工具。

You had copilots, today you have CLIs and IDEs, ID tools with embedded AI assistance.

Speaker 1

它们都具备异步执行任务的能力。

And they all have the ability to, for example, do, tasks asynchronously.

Speaker 1

比如，你可以给它一个任务，即使像我们这样的AI也需要一些时间来完成，它会自行思考一段时间，异步运行，然后向你提出后续问题等等。

Like for example, you can give it a job that will take even AI maybe like ours to complete and it will think for some time, go off asynchronously, ask you follow-up questions and whatnot.

Speaker 1

而这个领域还有另一个部分是关于自主开发的。

And then you have another part of the space which is about autonomous development.

Speaker 1

这一类别中有许多工具。

There are tools in this category.

Speaker 1

我相信，Cognition公司的Devin就属于这一类别。

There's, I believe, Devin from cognition that falls into that category.

Speaker 1

我们就在这个领域开展业务。

We operate in that category.

Speaker 1

这里的理念是，你点击构建后，生成的拉取请求（PR）就能直接产出结果。

And the idea here is that you hit build and outcomes of PR.

Speaker 1

对吧？

Right?

Speaker 1

生成的PR已经经过测试和验证，所有功能都能正常运行，完全符合你的预期。

The PR that comes out is, already tested, validated, everything works and it's exactly how you intended it to be.

Speaker 1

对吧？

Right?

Speaker 1

根本不会出错。

There's there's no errors.

Speaker 1

代码是可以接受的。

The code is acceptable.

Speaker 1

对吧？

Right?

Speaker 1

所以，无论在哪一边，我们面临的最大挑战都是代码的接受度，对吧？

So the biggest challenge, that we have on both sides of the spectrum is code acceptance, right?

Speaker 1

你可以写出大量代码，而代码现在已经成为一种商品。

You can write a lot of code, and code is a commodity now.

Speaker 1

让AI写代码非常简单。

Like getting AI to write code is very easy.

Speaker 1

获得任何代码都很容易。

Getting any code is easy.

Speaker 1

但要获得符合你标准、真正优质、安全可靠、可以直接上线的代码，那就是完全不同的故事了，对吧？

Getting code that follows your standards, that code that is really good, it goes as secure, code that is ready for production is a completely different story, right?

Speaker 1

因为一方面，你有那些全新的项目或可以从零开始构建的新产品，AI在这方面非常擅长。

Because you have, on one hand you have these greenfield builds or like new products that you can build from scratch and AI is really good at that.

Speaker 1

所以如果你看看实验室发布的演示，嘿，我做了这个游戏，看起来太棒了，简直不敢相信。

So if you look at the demos that the labs put out, hey, I build this this you know, game and it looks amazing, I can't believe it.

Speaker 1

但当你把同样的AI用在企业代码库上时，它反而会搞砸。

But then when you put the same AI on an enterprise code base and you're supposed to work with the, it messes it up.

Speaker 2

这要困难得多。

It's a lot more challenging.

Speaker 1

这要难得多。

It's it's way more challenging.

Speaker 1

这是一个数量级更高的问题，因为AI需要处理海量信息和众多条件，导致工具失效。

It's an orders of magnitude high problem because the AI is dealing with so much information and so many and conditions, that causes tools to fail.

Speaker 1

因此，自主化这一端的挑战要大得多，因为你必须同时解决所有这些问题，并以代码是否被接受作为最终衡量标准。

So the autonomous part of the spectrum is is is a much harder challenge because you have to simultaneously address all of these items and work for acceptance as your final metric.

Speaker 2

所以从接受度倒推，从AI编写代码这一端来看，另一边一定存在某种规范，代码必须满足这些规范才能被接受。

And so thinking back from acceptance through the agent, the AI writing some code, on the other side of that, there's gotta be some specification that, the code has to meet in order to be accepted.

Speaker 2

你是不是把编码的所有复杂性都推到了规范制定上？

Are you essentially pushing all the complexity of coding into spec development?

Speaker 1

这是个很好的观点。

That's a that's a great point.

Speaker 1

所以，是也不是。

So yes and no.

Speaker 1

让我解释一下‘是’的部分：如果你能写出规范，那你就应该写规范，对吧？

Let me explain the yes part like, if you could write a spec, then you should write a spec, right?

Speaker 1

我们所熟知和喜爱的所有工具都有计划模式。

It's, all of the tools we know and love have plan mode.

Speaker 1

它们最近刚刚完成了这一功能。

You know, they've recently completed that.

Speaker 1

大家已经意识到这一点了。我们在2023到2024年开发Blitzy时就发现了，规范开发确实能帮助代理更好地锚定自身。

Everyone's realized that, we saw we when we started doing that back in 2023, 2024 when we built, Blitzy, but spec development, this really helps the agents anchor themselves.

Speaker 1

但人们马上会意识到，规范本身还不够好，因为你还需要让代理遵循其他一些通用规则。

But again, what people realize immediately is that the spec is not good enough because then you have these other general rules that you want agents to follow.

Speaker 1

传统上，人们会使用像agents.md这样的文件，加上各种评分标准和其他内容，试图保持规范的简洁，因为这些模型过一段时间就会忘记内容，对吧？

And traditionally, what people have done is use things like agents.md, added scales and other stuff, trying to keep the spec lightweight because these models tend to forget, right, after a period of time.

Speaker 1

或者当你进行压缩之类操作的时候。

Or if you go through compaction and stuff like that.

Speaker 1

所以有一类任务，如果你能为它写一个规范，清楚它应该做什么，知道它需要满足的所有条件，那么写规范确实很好。

So there's so there's that part where if you have a task in general that you can write a spec for, you know what it should do, you know all of the conditions that it needs to satisfy, then yes, writing a spec is great.

Speaker 1

但还有一类任务，其依赖关系并不明确。

But then you have this other class of tasks where it's not really clear what the dependencies are.

Speaker 1

比如，你不知道后端数据库的模式是什么样子。

Like for example, you don't know what the schema for the backend database looks like.

Speaker 1

你无法为它写规范，因为你不知道有哪些约束条件，对吧。

And you can't write a spec for it because you don't know what the constraints are, right.

Speaker 1

你也不能仅仅信任AI去‘好吧，先弄清楚模式，然后再做X’，因为当你弄清楚模式时，你会获得新信息，对吧？

And you you can't just trust the AI to, okay, figure out the schema and then do X because you're gonna get new information when you figure out the schema, right?

Speaker 1

而这些新信息会影响你所编写代码的决策和架构，对吧？

And that's gonna affect the decision and the architecture of the code that you're writing, right?

Speaker 1

因此，正因为如此，你总是处于一个连续谱系中：你与代理一对一协作，它为你提供更多信息，而你则帮助它做出决策，对吧？

So because of that, you always have this spectrum where you're working one to one with the agent, it's giving you more information and you're helping it make decisions, right?

Speaker 1

所以，如果未来你能构建出更智能的模型，这些模型可能像人类一样，甚至比人类更擅长做出架构决策，那么是的，你可以有一整类工作专注于编写规范，并引导那些能力较弱、成本更低、速度更快的代理来编写代码。

So does the future, if you can build more intelligent models that are maybe human like or better than humans at making architectural decisions, then yes, you can have like this entire class of work for that is focused on writing specs and guiding other maybe less capable, cheaper, faster agents to write code.

Speaker 1

这同样也是未来一个令人兴奋的机会。

And that is all that also is an exciting opportunity for the future.

Speaker 2

为了确认我理解正确，我想你是在说：是的，规范很重要，因为如果你把规范写对了，它就能为代理提供一个锚点，让代理能产出更好的代码；但不，如今规范还不够，因为在开发过程中会存在一些假设、未知因素和不断变化的东西。

So just to replay that to make sure I understand, I think what you're saying is that, yes, you're, you're, yes, the spec is important, because if you get the spec right, that anchors the agent and the agent can produce better code, but no, today a spec isn't sufficient because there are assumptions and unknowns and things that evolve during the course of development.

Speaker 2

因此，与其把所有东西都推到规范里，我听到的是，在开发过程中仍然需要大量的人工参与，这引出了一个问题：A，这是不是准确表达了你的意思？

And so rather than pushing everything to the spec, what I heard in there was that there's still a lot of human in the loop during, during development, which raises a question, well, A, is that, is that right?

Speaker 2

这准确捕捉到了你的观点吗？

Is that capturing what you're saying?

Speaker 2

但同时，B，你经常提到‘自主开发’这个概念。

But also then B, you know, you talk a lot about this idea of autonomous development.

Speaker 2

如果人类始终参与其中，那开发到底有多自主呢？

If the human is in the loop, how autonomous is the development?

Speaker 2

你是如何思考这种区别和细微差别的？

How do you think about that distinction and nuance?

Speaker 1

是的，这是个非常好的问题。

Yep, yeah, that's a fantastic question.

Speaker 1

所以，关键是，即使在今天，我们这样来看。

So, the thing is, even if today, let let's frame it this way.

Speaker 1

所以，即使你有一个很好的规格说明，你仍然要花大量时间来编写它，但如果这是一个复杂的规格说明，可能涵盖五万到十万行代码，对吧。

So today if you wanna get, even if you have a great spec, you spend a lot of time writing a spec, but if it's complicated spec that maybe covers 50,000, 100,000 lines of code, right.

Speaker 1

或者企业级项目通常就是这个规模。

Or which is which enterprise projects are often at that scale.

Speaker 1

如果你想迁移，比如为一个大型代码库升级Java，对吧。

If you wanna migrate, if you wanna upgrade Java for a large code base, right.

Speaker 1

或者你想在复杂的后端上添加一个用户界面。

Or if you want to add a UI on a complicated backend.

Speaker 1

这些都是跨越多个文件的巨大变更。

Those are huge, huge changes across multiple multiple files.

Speaker 1

你可以有一个规格说明，可以写一个规格说明，然后把它交给智能代理，比如你最喜欢的命令行工具，也许是云代码或其他工具。

You can have a spec, you can write a spec, you can give it to the agent, your favorite CLI, maybe cloud code or whatever.

Speaker 1

它会花时间执行，但在某个时刻，它会遇到一个需要人类回答的问题。

It's gonna spend time executing but at some point it's going to run into, you know, a use case where it has a question from the human.

Speaker 1

当某些事情不明确时，它需要做决定，或者它会经历多次上下文压缩，因为它只有大约一百万的上下文标记，而压缩后的输出质量会大打折扣。

Where something is not clear, it needs to make a decision or it's going to run through several context compactions because it only has like 1,000,000 tokens of context And then the quality of the output after compaction is not the same.

Speaker 2

所以信息会丢失。

So information will be lost.

Speaker 1

是的，信息会丢失。

Yes, information will be lost.

Speaker 1

它必须这么做。

It has to do that.

Speaker 1

它确实非常聪明地尝试保留所有相关信息，但并不完美，对吧？

It does do a very intelligent job of, you know, trying to retain all relevant information, but it's not perfect, right?

Speaker 1

因为解决这个问题真的很难，即使你做到了，也无法保证不会丢失任何重要的内容。

Because it's really hard to solve that and even if you do that, you're gonna, there's no guarantees, it's not gonna lose anything that's important.

Speaker 1

然后，如果它花时间回头去重新获取那些丢失的内容，很可能因为数据量太大、标记过多，再次导致上下文溢出，陷入循环。

And then if it does spend time in going back and getting back that something upon that it lost, chances are that it's so big in size and volume of tokens that it's gonna overload context again, it's stuck in a loop.

Speaker 1

这就是为什么Anthropic会推出这样一个庞大的项目——一个C语言编译器。

So that's why you had, for example, Anthropic put out this huge project that is a C compiler.

Speaker 1

而这个仓库最热门的问题（虽然不是第一个）是：这个仓库根本不该存在，因为‘Hello World’在这编译器上根本无法编译，对吧？

And the very first issue, not very first, but what the most popular issue on that forum is that this repo should not exist because hello world does not compile on this compiler, right.

Speaker 1

所以当你试图把所谓的‘拉尔夫·维古姆循环’应用到那些原本并非为此设计的现有工具上时，就会遇到类似的问题。

So you have problems like that when you try to apply what is called the Ralph Wiggum loop to, you know, existing, let's say tools that are not designed for that.

Speaker 1

‘拉尔夫·维古姆循环’本质上就是反复运行同一过程，直到得到你想要的正确答案。

Like Ralph Wiggum loop essentially just running the same thing again and again till it gets to you know, the correct answer.

Speaker 1

我想强调的是，这不仅仅是给AI一个规范的问题，而是关乎上下文工程和智能体工程。

The point I I I'm trying to make is, it's not just about sending, giving AI a spec, it's all about context engineering and agent engineering.

Speaker 1

上下文工程就是指在恰当的时候，给AI提供恰到好处的上下文信息。

So context engineering is about you know, giving the AI the right amount of context in at the right time.

Speaker 1

问题在于，当在企业规模上扩展时，面对成百上千的开发者，每个人使用工具的效率并不一致。

The problem what happens is at scale across the enterprise when you have like hundreds and thousands of developers, not everyone is using the tool with the same level of efficacy.

Speaker 1

所以所有的CI工具、Codex、DotCod等等，都需要大量的配置工作——你是否连接了正确的MCP？是否使用了正确的技能？是否使用了正确的提示词？

So all of the CI tools, codex, dot cod, you name it, require a significant amount of, setup or like you have you connected to the right MCP, are you using the right skills, are you using the right prompts?

Speaker 1

适用于Anthropic的提示词对OpenAI并不有效。

The same prompts that work for Anthropic don't work for OpenAI.

Speaker 1

比如，OpenAI在训练时并不使用XML标记，但Anthropic会使用，对吧？

Like for example, OpenAI doesn't use XML tokens in their training, but Anthropic does, right?

Speaker 1

所以如果你在OpenAI中使用XML标记，或者在提示中大声喊叫——有时你确实得对云平台大喊才能让它听懂你——这是一种低效的策略。但人们普遍认为，GPT-5.3在很多方面比Opus更出色，对吧？

So if you use XML tokens with OpenAI or if you shout in your prompts, which you have to do with cloud at times, have to shout at cloud to get it to listen to you, It's an ineffective strategy and then but what it's widely held that GPT 5.3 gets many things right that Opus does not, right?

Speaker 1

因此，这里涉及大量复杂的智能体工程。

So there's all of these complex agentic engineering.

Speaker 1

这就是智能体工程的部分：为特定任务招募合适的智能体，搭配恰当的提示词和工具，并进行精准的提示工程，对吧？

So that's the part that is agentic engineering where you recruit the right agent with the right set of prompts and tools, with the right level of prompt engineering for the right task, right?

Speaker 1

因为确实有些任务，GPT比Opus更擅长。

Because there are definitely tasks that GPT is better than Opus for.

Speaker 1

而上下文工程则致力于在恰当的时间提供适量的信息，帮助智能体聚焦于最小、最高效的任务，既不过度也不不足，对吧？

And then there's context engineering which optimizes for giving the agent the right amount of information at the right time and for getting it focused on the smallest possible task that is efficient for that agent, right, without like, overdoing it or underdoing it.

Speaker 1

当你在大规模应用这两者并解决一些关键挑战时，比如上下文长度限制。

So when you apply those two at scale, and you solve, you know, some of the most important challenges, like for example, context limits.

Speaker 1

所以我们在内部，至少针对Blitzy，想出了一种非常有创意的解决方案，通过应用上下文工程和代理工程，我们实现了近乎无限的上下文。

So we've have, we have a very creative solution internally, at least for speaking for Blitzy, right, where we've achieved effectively infinite context because we've applied contact engineering and agent engineering.

Speaker 1

对吧？

Right?

Speaker 1

这些是非常强大且实用的技术和工具，借助当今的人工智能，你可以实现自主开发，而你们已经成功做到了。

So these are really powerful techniques and tools that you can apply with today's AI to achieve autonomous development, which you've done successfully.

Speaker 2

也许我们可以深入一下，当你提到‘自主开发’时，你具体指的是什么？

Maybe we can jump in and define when you say autonomous development, what exactly that means for you.

Speaker 2

从客户或用户的视角，详细讲讲他们的操作流程，以及他们如何参与和体验这个过程。

Like, talk through the process from the perspective of a customer or user and what they're doing and what, you know, what they see the way in which they're engaged.

Speaker 1

是的。

Yeah.

Speaker 1

这是个好主意。

So that's a good idea.

Speaker 1

我可以谈谈，比如你如何完成一个Java升级这样的任务。

So I can talk about how, you know, you would do, let's say, a task like, a Java upgrade.

Speaker 1

对。

Right.

Speaker 1

很容易想到现代化，比如从COBOL到Java的迁移，或者新功能开发，对吧？

It's it's very easy to think of modernizing or maybe like a COBOL to Java transition or, even new feature development, right?

Speaker 1

使用传统方式与我所说的甚至传统方式——即最先进的自主开发方式，对吧？

Using, traditional versus, I'm calling it even traditional, let's say it's like state of the art autonomous development, right?

Speaker 1

在典型的开发工作流程中，对吧？

The with with the with the typical development workflow, right?

Speaker 1

即使你假设使用Codecs或Plot代码，你也会先制定程序的规范，然后可能用需求来提示Plot代码。

Even let's say you're assuming, you're using codecs or plot code, You would work out a spec, for the for the program and then you would probably use, maybe you prompt plot code, with the requirements.

Speaker 1

它会进入计划模式，构建规范。

It would enter plan mode, build a spec.

Speaker 1

然后你会拿这个规范，切换到Codecs，让它进行审查。

You would then take that spec, hop to codecs, ask it to review it.

Speaker 1

对吧？

Right?

Speaker 1

然后，你知道，你希望你写对了那一套提示指南，提供了足够的上下文，帮助搜索代码库，找到所有相关信息，然后制定出这个计划。

And then you you know, you hope that you've written the right set of follow the right prompting guidelines, given it the right amount of context, help search your code base, find all of the relevant information and then build that that that plan.

Speaker 1

但即使在生成规范时，经常发生的情况是，当你面对一个非常庞大的代码库时，这些工具——你知道的，那些不使用深度索引技术的工具——

But what what frequently happens even even during spec generation, what happens is when you have a very large code base, the tools that these, you know, are the tools that don't use like very deep indexing techniques.

Speaker 1

它们依赖的是浅层索引，我的意思是，它们几分钟就能完成代码库的索引。

They're reliant on, you know, like shallow indexing, know, what I mean by shallow indexing is like they'll take minutes, they'll finish indexing code base in minutes.

Speaker 1

所以它们并没有建立起对代码库或代码中各种关系的深入理解。

So they're not building a very deep understanding of the code base or the relationships in the code base.

Speaker 1

因此，它们只能依赖像 grep 这样的工具来找东西。

So they want to rely on tools like grep, alright, to find stuff.

Speaker 1

好吧，我想在这个新功能中更改身份验证提供程序。

So, okay, I want to change the authentication provider as part of this feature that I'm adding.

Speaker 1

我要在一个一千万行的代码库中找出所有使用了 auth 的函数。

And I'm gonna find all functions that use auth in a 10,000,000 line code base.

Speaker 1

问题是，也许 auth 是这个项目中一个非常重要的用例，使用身份验证提供程序的地方可能有成千上万，甚至数以万计。

Now the challenge is that maybe auth is a very important use case for this project and there's like thousands if not tens of thousands of places where the auth provider is used.

Speaker 1

对吧？

Right?

Speaker 1

而且不是所有函数都叫 login。

And not all functions are named login.

Speaker 1

对吧？

Right?

Speaker 1

所以依赖模型的智能来找到所有这些地方并正确更新它们，对吧。

So relying on the intelligence of the model to find all these places and get them correctly, right, update them correctly, right.

Speaker 1

而很多时候，正是在这里出了问题。

And quite often that's where it falls on.

Speaker 1

所以它会漏掉一些地方，如果你把这样的计划提交上去，最终在编译时就会出错。

So it misses places and then when you if if you were to put that plan forward, eventually when it tries to compile, it will make a mistake.

Speaker 1

它无法编译，对吧？

It won't be able to compile, right?

Speaker 1

然后它会尝试修复这些bug。

And then it'll try to fix the bugs.

Speaker 1

现在它又回到自己的计划上，结果改了一些原本不在计划里的东西。

And now it's going back on its plan and then you know, it's changing things that were not exactly in the plan.

Speaker 1

所以你遇到了这个问题。

So you have this this this problem.

Speaker 1

你正在违背计划，因为这个计划并不完美，对吧？

You're going against the plan because the plan was not perfect, right?

Speaker 1

甚至要得到这个计划，你都得去求助三个不同的提供商，比如 Claude、GPT、Gemini 之类的。

And even to get this plan correctly, you had to go to maybe three different providers like Claude, GPT, Gemini, whatever it is.

Speaker 1

所以这是一个挑战。

So that's one challenge.

Speaker 1

现在假设你拿到了这个计划。

Now let's say you got the plan back.

Speaker 1

接下来，你必须执行这个计划。

Next you now have to execute the plan.

Speaker 1

你得定义任务，然后执行这些任务。

You have to like set it in define tasks, execute the tasks.

Speaker 1

但如果这是一个非常复杂的项目，每个任务可能需要数小时。

But if if it's a very complex project, right, each task could take maybe hours.

Speaker 1

如果真的特别复杂，甚至可能需要好几天，对吧？

If it's really really complex, it could take days, right?

Speaker 1

然后你还会遇到子代理的概念，这些子代理可能是并行运行，也可能是串行运行。

And then you have the concept of maybe sub agents that you're running, maybe in parallel, maybe serially.

Speaker 1

但要弄清楚哪些任务应该并行或串行执行，以及它们之间有哪些重叠，真的非常困难。

But it's really really hard to figure out what tasks should be run-in parallel series and what are the overlaps between them.

Speaker 1

因为你可能会有代理在相互冲突，对吧？

Because you may have agents working against each other, right?

Speaker 1

当你遇到一个复杂困难的情况，即使你已经有了计划，你还是会不得不回头求助于人类，依赖人类来指导你、帮你处理这一切。

And then you when you have a difficult complex situation where you don't know what to do, even though you have a plan, you now find yourself going back to the human and, relying on the human to, you know, guide you and do all that.

Speaker 1

再从为代理提供合适工具的角度来看，比如你希望代理能进行实时测试。

And then even, from the standpoint of, let's say giving the agent the right tools, like for example, you want the agent to test live.

Speaker 1

所以你可能会使用一些工具，比如一个网页应用。

So maybe you'll use the, and it's a web app, for example.

Speaker 1

所以你可能会给它配置 Chrome MCP，嘿，你就因为上下文损失了两万个令牌，对吧？

So you'll maybe you'll give it the Chrome MCP, a boom, you've just lost 20,000 tokens because of the context, right?

Speaker 1

因为它会占据你的上下文窗口。

Because it's gonna sit in, in your context window.

Speaker 1

你可以使用一些技术，比如工具搜索等，来优化这一点。

And you may apply like techniques like, tool search, etcetera, that may optimize that.

Speaker 1

但这里有个问题，对吧？

But there's a caveat, right?

Speaker 1

如果你搜索工具，效率不会那么高。

If you search for tools, it's not gonna be as efficient.

Speaker 1

它有可能找不到正确的工具，因为它不够主动，对吧？

There are chances that it'll miss finding the right tool because it doesn't work aggressively, right?

Speaker 1

所以你面临这个问题。

So you have that problem.

Speaker 1

你可能会有五个不同的 MCP，对吧？

You could and then maybe you have like five different MCPs, right?

Speaker 1

这五个MCP中的每一个，如果都像Chrome一样复杂，那你一下子就损失了十万令牌，而这些大语言模型的有效操作边界仍然低于十万到十五万令牌。

And each of these five MCPs, if they are maybe as complex as a as Chrome is, you've just lost 100,000 tokens And the effective frontier of operation for these LLMs is still less than a 100 to 150 ks.

Speaker 1

对。

Right.

Speaker 1

尽管它们有百万级别的上下文容量，但我的观点是，根据‘大海捞针’排行榜来看，这类情况非常多。

Even though they have 1,000,000 tokens of context, the point I'm making is by the needle in the haystack leaderboard, if you go and look at that, there are tons of them.

Speaker 1

当你加载超过一定数量的令牌时，以前是四万，但现在像Opus 4.6这样的模型已经是八万、十万令牌了。

If the moment you load more than, it used to be like 40 k but now it's like with Opus 4.6 is like 80 k, 100 k tokens.

Speaker 1

你就失去了代理最佳表现的能力。

You lose the ability of the agent to perform at its best.

Speaker 1

所以，如果这个代理在排行榜上

So, if the leaderboard if if that agent

Speaker 2

你很难很好地追踪上下文窗口中的所有内容。

You can't keep track of everything that's in the context window very well.

Speaker 1

没错。

Exactly.

Speaker 1

对吧？

Right?

Speaker 1

所以，你加载了规范，但现在又加载了所有这些其他内容。

So, you loaded up the spec but now you've also loaded up this all this other stuff.

Speaker 1

而且我还没谈到你的技能和代理。

And then I haven't even gone into your skills and your agents.

Speaker 1

还没呢。

Md yet.

Speaker 1

对吧？

Right?

Speaker 1

那么，当你面对一个百万行代码、包含多个模块、由不同团队开发、每个团队都有各自模块的代理或MD文件时，你该怎么办？

And then how do you, when you have this million line code base with multiple modules and it worked upon by different teams and every team has a different agents or MD file for that module.

Speaker 1

而且还有大量的技能，对吧？

And there are tons of skills, right?

Speaker 1

你明白我想要表达的问题了。

You get the problem that I'm getting at.

Speaker 1

你很容易就会失去效率前沿，而你甚至还没加载你要处理的实际文件。

You're easily gonna lose, the efficient frontier and now you haven't even loaded your actual files yet that you're gonna work on.

Speaker 1

对。

Right.

Speaker 2

你已经充分描绘出了传统工作流程中所面临复杂性的图景。

So you've adequately painted the picture of the complexity that you're dealing with with the traditional workflow.

Speaker 2

那么，你能做些什么来克服这些众多的挑战呢？

Like what are the things that you can do to to overcome, you know, all these many challenges?

Speaker 1

我们从一开始就做的一件事是，建立一个锚点，让智能体能够以此为依据，扎根于代码库并跨代码库查找内容。

So one thing that, you know, we've done from the beginning is build an anchor point that the agents can use to ground themselves in the code base and to find things across the code base.

Speaker 1

比如，我们构建了一种图结构与向量的混合体，通过像Blitzy这样的摄入流程，它能够理解整个代码库，映射各种关系，并进行语义摘要与聚合。

Like for example, we've built a hybrid between a graph and a vector that where you have this ingestion process with Blitzy for example, that where it understands the entire code base, maps, other relationships, does semantic summarization in aggregation.

Speaker 1

现在，你就拥有了整个代码库的完整地图。

And now you have this map of the entire code base.

Speaker 1

因此，如果我想从一个点到另一个点，即使它们相隔一千万行代码，我也可以在一次请求中瞬间完成，而无需消耗大量令牌去逐个文件查找路径。

So if I wanna go from one point to another point, that's like 10,000,000 lines deep, I can do that instantly in one request rather than burn all these tokens to travel through different files and find the chain.

Speaker 1

对。

Right.

Speaker 1

所以这是一种非常有效的方法。

So that's like one technique that really works.

Speaker 2

你刚才描述的这些，在很多方面都与我们所看到的传统工具演进方式背道而驰。

What you just described in a lot of ways, like flies in the face of the way we've seen the traditional tooling evolve.

Speaker 2

我们最初从RAG开始，基于……你知道的，人们可能没这么想过，但很大程度上，早期的Copilot版本其实都是基于RAG的。

Like we started with rag, which was based on, you know, and, you know, people don't think about it like this, but to a large degree, you know, the early copilot versions were kind of rag based.

Speaker 2

它就像是语义向量风格的搜索，在代码库中查找并识别代码块，然后将它们作为上下文传递。

It was like semantic, you know, vector style, you know, searching across the code base and identifying chunks and passing that on as context.

Speaker 2

而后来让我们都兴奋不已的Codex和Cloud Code，它们已经不再这么做了。

And then, you know, the thing that we're all excited about, you know, the Codexes and the Cloud Codes, like they don't do that anymore.

Speaker 2

它们现在只用grep，而你所说的这种方式，其实很难在大规模场景下奏效。

They just do grep, what you're saying, like doesn't really work at scale.

Speaker 2

这很有趣，值得思考这里发生的这种互动，而你所说的，我认为你的意思是，要在企业级或大规模代码库中有效运作，就需要更复杂的机制。

It it's interesting to, it's interesting to think about, you know, that, the kind of give and take that's happening here and, you know, what you're saying is that you need more sophistication to operate, or at least what I'm interpreting you saying is that you need more sophistication to operate at, you know, enterprise scale, large scale code bases, whatever, you know, we want to call this.

Speaker 2

这么说吧，我想提出一个问题：你有没有一个概念，比如在代码量超过某个阈值、代码行数达到一定规模，或者复杂度达到某种程度时，grep 就失效了，你必须回到向量或图的方法？

You know, to, to bring that to a question, you know, maybe do you have a sense for like where the, the cliff is, you know, if you're working with, you know, above a certain amount of code, a certain number of lines of code or a certain, you know, way of characterizing the complexity where, you know, grep stops working, you need to go back to, you know, vector or graph?

Speaker 1

我认为我们对向量和图的应用是与 grep 结合使用的。

I would say that the way we've applied vector and graph is in combination with grep.

Speaker 1

所以你把它当作一种信号，比如当你在查找我的设备或 AirTag 时，它会给你一个方向，然后你沿着那个方向去找，直到找到目标。

So you use use it like a signal, alright, like, when you go to find my and you're finding for you're searching for your airport or air tag, you know how it gives you a direction and then you go down the direction till you find the thing.

Speaker 1

它并不会告诉你确切的位置。

It doesn't tell you where it exactly is.

Speaker 1

是的。

Right.

Speaker 1

但这非常有帮助。

But that is insanely helpful.

Speaker 1

就是这样。

It's it's it's exactly that way.

Speaker 1

所以通过结合两者

So by combining both

Speaker 2

所以，语义搜索的作用是帮你大致定位方向，而 grep 则能帮你找到确切的那行代码，这样你就能通过语义搜索大幅缩小搜索范围。

So semantic I'm taking as the thing that gets you directionally close and then grep is the thing that gets you to the exact line, but it's like you're able to reduce your search space, using semantic.

Speaker 1

没错。

Exactly.

Speaker 1

好的。

Okay.

Speaker 1

所以，当你把这些技术结合起来，再问阈值是多少时，我会说，如果代码库的规模超过你上下文窗口的两倍。

And so so that, you know, when you combine these techniques and then you ask for like what is the threshold, well I would say if the code base is anything larger than two times your context window.

Speaker 1

大概就是这个数量级，对吧？

Just just just just roughly, right?

Speaker 1

每个模型提供商使用的压缩技术、设置、类型、风格、算法等都不尽相同。

Every model provider uses different, techniques for compaction, different like, you know, settings, different types, different styles, whatever, algorithms and all that.

Speaker 2

是你的有效上下文窗口的两倍，还是最大上下文窗口？

And two times your effective context window or your, maximum context window?

Speaker 1

我们说是最大上下文窗口，因为新一代模型在‘大海捞针’的情况下也表现得非常好。

We'll say maximum because, you know, the the newer models, they're really good at even the needle in the haystack.

Speaker 1

对吧？

Right?

Speaker 1

所以即使有效上下文更小，他们也能给你足够好的结果。

So even though the effective is smaller, they've gotten you you would get a good enough result.

Speaker 1

但一般来说，按经验法则，如果你这么说的话，当你做的更改超过一万行左右时，对吧？

But in general, know, by rule of thumb, if you if you were to put it that way, if you're doing a change that's more than you know, let's say 10,000 lines, right?

Speaker 1

在一个代码量达到七万到十万行左右的仓库中，这时使用这种支持的优势就非常明显了，因为你搜索所花的时间会大幅减少，毕竟在七万到十万行代码中，你很可能已经拥有了多个模块，对吧？

In a repo that's more, that's around or more than 70 k to a 100 k lines, then that's where the advantages of having this rack support clearly become evident because the amount of time you're spending searching is gonna go down drastically because you know, you in with with 70 k, 100 k lines of code, you probably have multiple modules at that point, right?

Speaker 1

而且还有多个团队在同时开发，各自有不同的规范。

And you have multiple teams working on it, which have different sets of rules.

Speaker 1

因此，你可以充分利用多智能体的优势——这一点我们稍后单独讨论——同时结合这两个锚点来进行搜索。

So you can really take advantage of going multi agentic, which we'll talk about separately, but also having these two anchor points and searching things.

Speaker 2

那么多智能体，这个概念是从哪里体现的？

So multi agentic, where does that come in?

Speaker 1

是的。

Yeah.

Speaker 1

所以，由于你面临这些与复杂性相关的限制，比如任务复杂性和有效上下文的限制，而这一点实际上并没有改变。

So, if because you have these limitations with, complexity, right, where, what task complexity and limitations in terms of effective context, which by the way is not changing.

Speaker 1

对吧？

Right?

Speaker 1

所以，有效上下文已经从一万令牌增加到了一百万令牌，你从大约一万到二十万令牌，再到一百万令牌。

So the the effective you've gone from 10,000 tokens to 1,000,000 tokens and you've gone from, you know, maybe 10,000 to 200 ks tokens and then 1,000,000.

Speaker 1

但我们一直停留在八万到十万，或者说八万到十二万令牌，我认为过去两年的最新模型都是如此。

But we've been stuck at 80 ks to 100, 80 ks to one twenty, I would say the latest models since two years.

Speaker 1

所以，尽管你每三个月就能获得一个新模型，但有效上下文窗口并没有变化，我们花了很长时间才从一万到二十万再到一百万令牌，对吧。

So even though you're getting a new model every three months, the effective context window is not changing and it's taken a while for us to go from 10 k to 200 k to 1,000,000, right.

Speaker 1

因为这些方面存在物理限制。

Because you have physics constraints in these.

Speaker 1

你有计算能力的限制，有功耗问题，还有我们能为这些模型提供商扩展到什么程度的问题。

You have you know, the amount of compute capacity, you have power, you have you know, how much we can scale for these model providers.

Speaker 1

所以他们总是在寻找更好的解决方案。

So they're they're always trying to find out you know, better solutions for that.

Speaker 1

但这个问题在接下来的三个月、六个月，甚至在我看来，未来三年内都不会有显著改变。

But that's not getting solved in the next three months, six months or even, I I would say in my opinion, that's that's not changing drastically even in the next three years.

Speaker 1

所以这些都是非常重要的考量因素。

So these are very important considerations.

Speaker 1

那么，如果你具备多智能体能力，能够招募多个智能体，我们已经看到了两种被应用的技术。

So what happens if you have multi agent capabilities, the ability to recruit multiple agents and we've seen two techniques, that have been applied.

Speaker 1

一种是子智能体的概念。

One is the concept of having sub agents.

Speaker 1

你有一个协调者或主导模型，负责招募多个子智能体。

So you have one, you know, orchestrator or leader model that is gonna recruit multiple sub agents.

Speaker 1

比如在云代码中我就见过这种应用。

I've seen this used in cloud code, for example.

Speaker 1

然后你就可以并行进行搜索，对吧？

And, then you can do searches in parallel, right?

Speaker 1

如果你要查找四样不同的东西，就同时运行四个智能体，看看哪个先返回结果，就像扔出四个点，看哪个能粘住。

If you're finding four different things, just run four agent, you know, see which one comes back, like, like throw four dots, see which one sticks.

Speaker 1

你可以做这种事。

You can do that kind of stuff.

Speaker 1

或者你可以并行处理任务，对吧？

And then or you can do like, you can paralyze tasks, right?

Speaker 1

把一个任务交给前端代理，另一个交给后端代理，这样能完成更多工作。

Give one to a front end, give one to a back end agent and get more work done.

Speaker 1

你可以做这种事。

So you can do that kind of stuff.

Speaker 1

当然，你的优势是速度，对吧？

The advantage you have is of course speed, right?

Speaker 1

当然，还有可能提升智能效果，因为你正在并行处理多件事。

Of course maybe, the effect of intelligence because you're doing multiple things in parallel.

Speaker 1

而且你还能显著——我这么说吧，但还不够充分——提高你所使用的上下文量，因为代理不再需要独自进行所有搜索和遍历代码，对吧？

And you also have, a significant, I would say, but not sufficient, a significant improvement in the amount of context you're using with the hide agent because it's longer having to make all these searches and traverse the code, right?

Speaker 1

它是在从不同的代理那里获取结果。

It's getting the result from different agents.

Speaker 1

所以这比让这个人独自完成要高效得多。

So that's more effective than this this guy just having to do it himself.

Speaker 1

但你仍然面临一个瓶颈，那就是这个领导代理，对吧？

But then you still have a bottleneck and the bottleneck is this leader agent, right?

Speaker 1

因为每个人都会汇报回来，所以你不能运行数百个代理，否则你会又回到原点。

Because ev everyone's gonna report back, so you can't like you can't run hundreds of agents because then you're run you're gonna go back

Speaker 2

你压缩了上下文，但并没有克服上下文作为障碍、作为根本限制的问题。

to You've compressed the context, but you've not overcome the the context as a barrier, as a limitation, a fundamental limitation.

Speaker 1

你只是把问题往后推了而已。

You you just kick the can, essentially.

Speaker 1

是的。

Yeah.

Speaker 1

你有一些东西。

You have something.

Speaker 2

对。

Right.

Speaker 1

所以这是一个方法。

So that's one.

Speaker 1

另一个方法在代码库足够大时仍然会失效，比如数百万行代码的情况，对吧？

The the the other one, so that still falls down when the code base is large enough, right, for multi millions of lines.

Speaker 1

你会依赖云代码来高效地完成所有事情。

You're gonna be effective at using cloud code and just get it to do everything.

Speaker 1

就像我刚才在通话中提到的，即使使用 Opus 4.6，你也无法用云代码构建一个十万行的 C 语言编译器。

Like, like, like, like I just said earlier in the call, you can't even build a 100 K line, C compiler with cloud code, even with opus 4.6.

Speaker 1

所以，我们很久以前就采取了这种方法：动态地招募多个智能体群组，并将数据库作为协调层的一部分，对吧。

So the approach that we took a long while ago and that and our approach has been to dynamically recruit multiple swarms of agents and use the database as the orchestra as part of the orchestration layer, right.

Speaker 1

所以我们知道，你有一个规范，正在努力实现这个规范。

So we we know that you have a spec, you're working towards executing the spec.

Speaker 1

你会用 AI 将其分解为多个任务，然后为不同任务分配不同的智能体组。

You break that down using AI into tasks and then use different sets of agents for tasks.

Speaker 1

因为你是递归地这样做的，对吧。

So now because you've and you do that recursively, right.

Speaker 1

一旦你完成了这一步，你就达到了每个代理都有高效任务的阶段，你可以招募数以万计的代理，而无需担心一个负责跟踪所有活动的单一协调者。

So once you've done that, you've now gotten to the point where you have an efficient task for every agent and you can recruit tens of thousands of agents but not have to worry about this single orchestrator that's keeping track of everything that's happening.

Speaker 1

对。

Right.

Speaker 1

你可以像GPU那样实现大规模并行，我知道GPU是怎么工作的，我曾在英伟达工作过。

You can paralyze at scale just like GPUs work and I know how GPUs work, I was at Nvidia.

Speaker 1

所以你可以实现这种效果，对吧。

So you can can can get that effect, right.

Speaker 1

与其采用多线程——那是另一种方式的效果，你实际上会实现超大规模扩展，对吧。

Rather than having multi threading which is the effect of the other one, you'd really have like hyper scaling, right.

Speaker 1

所以，这就是我认为的未来。

So that is what I believe is the future.

Speaker 1

我们已经成功应用了这种方法，经常编写数十万行、上百万行代码，所有代码都能编译、运行，所有测试都通过，用户界面正常，像素级精准。

And we've been able to apply that successfully and we frequently write hundreds of thousands of lines, millions of lines of code, everything compiles, everything runs, all test pass, The UI works, it's pixel perfect.

Speaker 1

因此，我们已经将这一点完善到了极致。

And so we've perfected that really.

Speaker 2

当我想到从多线程转向并行化或分布式计算这个类比时，我会考虑其中的一些挑战，比如并发、锁控之类的问题。

So when I think about, you know, this analogy of going from, you know, multi threading to parallelization or distributed computing in general, like I think about, you know, where some of the challenges are and you get to, you know, issues like concurrency and locking and, things like that.

Speaker 2

在这个背景下，我想的是，有大量代理在相邻任务上大规模运行，你如何防止它们互相干扰对方的工作？

And that in, you know, this context I'm thinking of, you know, you've got, you know, many, many agents operating at scale on adjacent, you know, adjacent tasks, Like how do you prevent them from stepping all over each other's work?

Speaker 1

这是个很好的观点。

That's a great point.

Speaker 1

所以，有一些技术手段，这确实是核心问题。

So, number of techniques, know, and that's, that's, that's the real problem.

Speaker 1

这是我们日复一日所面对的问题。

That's what we're dealing with day in and day out.

Speaker 1

但有一些技术可以帮助解决这个问题，比如提供多个环境，让代理不止在一个环境中运行，而是置身于多个隔离的沙盒环境中。

But number of techniques that help with that, you know, like having multiple environments, so giving the agent not just one but multiple environments to operate in, which are sandboxed, right.

Speaker 1

然后整合结果，比如通过源代码来实现。

And then converging the result, like using the source code.

Speaker 1

比如，每个代理最终都会向 GitHub 提交代码，每个代理都会沿着这条路径推进，验证这条路径是否真的可行。

Like ultimately every agent for example is committing to GitHub and every agent is going down this chain and figuring out if this path actually works, right.

Speaker 1

然后定期重新检查代码，确认它是否仍然能编译，是否仍符合规范。

And then from periodically revisiting the code and checking if it still compiles, it still reaches the spec.

Speaker 1

比如，我们在将代码交付给用户之前，会内部定期进行代码审查。

Like for example, we run periodic code reviews internally before even giving the code to the user.

Speaker 1

我们有代理会审查所有代码，确保它们没有偏离标准。

We have agents that review all of the code and make sure it's not drifting, right.

Speaker 1

我们还有代理对所有代码进行测试，即QA代理。

We have agents that test all of the code, QA agents.

Speaker 1

然后我们还有不同的开发代理来处理反馈。

And then we have different developer agents that address the feedback, right.

Speaker 1

所以这些是利用代理设计作为杠杆的一些方式，我们将其与源代码管理系统结合，将其作为真实来源，推送评论，查看发生了什么，观察代理的执行轨迹，理解做出更改的初衷，从而避免越界。

So those are a few ways where you can design, use agent design as a lever and you know, we use the combined with the SCM, use that as a source of truth, push comments, look at what happened, look at agent trajectories, understand what went, what what was the rationale of making a change, right, so you don't overstep.

Speaker 1

然后还有另一个部分是图数据库，因为它记录了整个代码库的关联映射，包括文件、哪个文件依赖于什么、导入了哪个库、该库的版本是什么，以及为什么使用这个库。

And then you have this other part which is the graph database because you have the relational mapping of the entire code base in that where you have the files and you know which file depends on what and imports which library and what is the version of that library and what is the reason that library is used.

Speaker 1

拥有这样的锚点至关重要，这简直是颠覆性的。

Like having that anchor is extremely huge, it's a game changer, right.

展开剩余字幕（还有 480 条）

Speaker 1

因此，你可以立即让每个代理基于这个真实数据进行定位。因为代理的困惑更少，并且能访问这个信息宝库，所以它更加高效，也不太可能干扰其他代理，因为其他代理也同样基于这个图的节点进行操作。

So you can immediately ground every single agent in that ground truth, Right, so because the agent is less confused and has access to this treasure trove of information, it is much more effective and less likely to step on every other agent's toes because then every other agent is also operating on the nodes of this graph.

Speaker 1

对吧？

Right?

Speaker 1

如果你有类似这样的系统，就可以这样设计。

So you can design systems that way if you have something like this.

Speaker 2

在这个世界里，当你谈到这些代理时，你说它们各自做不同的事情，我想知道的是，这些代理的角色、个性，或者我们想称作的其他方面，究竟有多固定？

And so in this world, when you talked about these agents, you talked about them doing distinct things, are the, I guess I'm trying to get at the degree to which the agent, you know, roles, you know, or personalities or whatever we want to call them.

Speaker 2

这些角色是固定的，还是动态变化的？

Like, are these fixed or these dynamic?

Speaker 2

你是否在提示工程和上下文工程上投入大量时间，比如，这是一个代码编写代理，我们会围绕它优化所有提示？

Are they, you know, is this something that you spend a lot of time, you know, from a prompt engineering, context engineering perspective, like, you know, this is a code writing agent and we're going to streamline all that's prompting around that.

Speaker 2

这是一个代码审查代理，我们为此做专门处理，还是由代理自己来判断这些？

This is a code review agent and we're doing that, or does the agent figure these things?

Speaker 2

代理是一个通用概念，它会根据任务自行判断这些角色？

Like, the agent a generic concept and it figures these things out based on its task?

Speaker 1

是的，这是个很好的观点。

Yeah, that's a that's a great point.

Speaker 1

所以，我们刚开始的时候，所有的代理都是手写的，也就是静态的。

So we, you know, when we started, all our agents were handwritten, like they were static.

Speaker 1

因为当时的模型还不够智能。

Because the models just weren't smart enough.

Speaker 1

我们当时用的是 Cloud 3.5、3.6 和 Sonnet。

Like we were working with cloud 3.5, 3.6, Sonnet.

Speaker 2

完全是另一个世界。

Different world.

Speaker 1

这是个不同的世界。

It's different world.

Speaker 1

是的。

Yeah.

Speaker 1

不过我们刚开始的时候，甚至都不需要给它起名字。

We didn't even have to call it by the way when we started.

Speaker 1

所以当时简直疯狂。

So it was crazy.

Speaker 1

但后来，你知道，随着代理变得越来越聪明。

But then, you know, as agent got, as agents got really, really smart.

Speaker 1

所以我们现在有一套基础指南，并尽量保持它们简洁。

So what we have today is we have a set of base guidelines and we try to keep that as lightweight as possible.

Speaker 1

以便不会占用太多上下文空间。

So that we don't take up too much space in the context.

Speaker 1

我说的简洁，是指少于5000个token，这简直难到极致。

When I say lightweight, mean less than 5,000 tokens, which is incredibly hard to do.

Speaker 1

我们另一个手段是提示指南。

The other lever we have is the prompt guidelines.

Speaker 1

我们有参考链接，指向这些指南的发布位置。

We have the references, the URLs to where these guidelines are posted.

Speaker 1

我们还赋予代理查找提示指南的能力。

And we've given agents the ability to look up prompt guidelines.

Speaker 1

你大概能猜到我想说什么了。

You probably can tell where I'm getting to with this.

Speaker 1

所以，这些代理会去查找提示指南。

So you, the agents look up the prompt guidelines.

Speaker 1

然后，你就有了完全动态的代理设计。

And then, you have fully dynamic agent designs.

Speaker 1

所以在我们平台的最新版本中，代理会设计代理。

So in the latest version of our platform, the agents design the agents.

Speaker 1

所以我们已经实现了一组工具，还有一组MCPs，或者说是外部工具、集成，这些都是预先写好的。

So you have a set of tools that we've implemented, you have a set of MCPs or you know, external tools, integrations, all that are pre written.

Speaker 1

我们编写了所有工具，也搭建了框架，设置了环境，但我们不会给代理直接访问源代码管理或数据库之类的东西的权限，因为我们都清楚代理会做什么，它们可能会无意中获得访问权限。

So we write all of the tools and we've written the harness, we've built, we set up the environments, we don't give agents direct access to the SCM or the database or stuff like that because we all know what agents can do, they have inadvertent access.

Speaker 1

但我们确实提供了一些代理可以使用的工具，比如，向远程仓库推送更改的请求。

But we do have tools that agents can use and these tools could be like, for example, making a request to push a change to origin.

Speaker 1

对吧？

Right?

Speaker 1

或者从分支拉取最新更改、提交更改、编辑文件，诸如此类的操作。

Or pulling the latest changes from a branch, making a commit, making an edit to a file, that that kind of stuff.

Speaker 1

比如启动浏览器，这类操作。

Like running spinning up a browser, that kind of stuff.

Speaker 1

所以我们有这些工具，然后代理会查看规范，甚至规范的某一部分，因为我们将规范的不同部分分配给了不同的代理。

So we had the tools and then the agents look at the the spec and even like portion of the spec, right, because we have assigned different parts of it to different agents.

Speaker 1

然后它们决定哪个代理最适合完成这个任务。

And then they decide what agent would be most better suited to solve this task.

Speaker 1

因为你们已经看到，我们在Transformer架构及其相关代理中一贯发现的是：如果你给一个代理设定一个角色，并赋予它一个带有专属工具集的任务，它的表现会与没有被这样配置的代理截然不同。

Because what you've seen is, you know, what we've consistently seen in the transformer architecture and agents around it is that if you give an agent a persona and then you give it a mission with a dedicated set of tools, its performance is gonna be vastly different than an agent who wasn't, for example, given the same thing.

Speaker 1

就像你直接去问Claude，给它一个任务。

Like you just go to Claude and just give it something.

Speaker 1

它给出的回应、遵循的技术、思考过程、推理方式，对吧？

The kind of response, the kind of techniques it follows, the thinking process, the reasoning process, right?

Speaker 1

智能和模型的很多奥秘其实都来自于推理，对吧？

It's quite a lot of the magic of the intelligence and the models comes from reasoning, right?

Speaker 1

他们思考得越多，这不仅仅关乎数量，更关乎推理的质量，对吧？

And the more they think, and it's not just about the volume or the quantity, it's more about like the quality of their reasoning, right?

Speaker 1

而这一点会受到角色设定的影响。

And that is impacted by the persona.

Speaker 1

因此，至关重要的是招募或设计具有正确角色和合适工具的代理，避免上下文过载。

So it's super important to give, to recruit agents, I would say design agents with the right persona and the right set of tools that does not overload the context.

Speaker 1

我们设置了检查机制，用于确认当某个代理启动时，会加载多少上下文。

We have checks in place to check like, okay, when this agent fires up, how much context is it gonna load up?

Speaker 1

它是否仍在有效的上下文窗口内运行？

And is it does it still operate in the effective context window?

Speaker 1

对。

Right.

Speaker 1

当你设计出一个能完成所有这些功能的函数时，你就真正解决了这个问题。

And when you design that, like you you've designed a function that does all of that, that's when you've really solved this problem.

Speaker 1

再加上其他那些因素，对吧？

Combined with the other stuff, right?

Speaker 1

再加上大规模招募这些代理的能力，比如能够设计、分配，并让它们跟踪进度、推动工作前进。

Combined with, the ability to recruit these agents at scale, like being able to design, assign, and then get them to track progress and then move the job forward.

Speaker 1

这正是我们日复一日在做的事情。

Like that's that's what we do day in and day out.

Speaker 2

当你谈到代理的身份时，让我想到这种想法：在提示开头写‘你是一位专家文案撰写人’，对吧？

When you talk about agent personas, it makes me think of this idea of like starting your prompt with you are an expert copywriter, right?

Speaker 2

这种做法。

This, this thing.

Speaker 2

我想在早期我们看到过很多这样的情况。

And we, we saw a lot of that, I think early on.

Speaker 2

但后来我们似乎逐渐远离了这种做法，但你的经验听起来像是，赋予代理一个明确的专业身份，对于其表现至关重要。

And then I think we saw a step away that away from that, but it almost sounds like your experiences that giving the agent a strong kind of professional identity, if you will, is an important part of its performance.

Speaker 2

你还会在提示中使用这类表述吗？

Do you still include that kind of verbiage in prompts?

Speaker 1

会。

Yes.

Speaker 1

所以，在历史长河中，我们在提示工程方面经历了许多改进和变化，其中一件事是我们不再需要告诉代理：如果做不好，人们会因此丧命。

So we've, know, over the course of history, we've had a lot of, improvements and changes in prompting during, like, like one thing we've moved away from is like, you no longer need to tell the agent that people will die if you don't get this right.

Speaker 1

我们都明白这一点。

We're all that.

Speaker 1

我之所以在生活里经常‘杀死’很多小狗，你知道，就是出于这个特定原因，在这种提示方式中，是的。

And the reason I kill a lot of puppies in my life, you know, and for for for this particular reason, in this prompting Yeah.

Speaker 1

很高兴我们已经走过了那个阶段。

Well, I'm glad we're through that.

Speaker 1

但当我们谈到内部评估时，我们使用这些评估来大规模衡量大型语言模型和代理的性能。

But, when it comes to, you know, we have we have internal evals, that we use to evaluate the performance of LLMs and agents at scale.

Speaker 1

对吧？

Right?

Speaker 1

我们确切知道一条指令会如何改变代理的行为轨迹。

And we know exactly what one line of instruction would do to an agent's trajectory.

Speaker 1

我们一再发现，赋予代理恰当的身份，并用那种语言编写提示，会带来显著变化。

And what we've seen time and again is that giving it a right the right persona, writing your prompts in that language changes things.

Speaker 1

比如，我们曾与一家银行合作，当时代理在为银行撰写文档，但该代理并没有金融专家的身份定位。

Like for example, we worked with the bank and the agent were writing documentation for the bank and the agent did not have the persona of a financial expert.

Speaker 1

因此，它在撰写评论时使用的语言和术语并不符合银行的期望。

So the language and terminology it ended up using in writing the comments were not to the liking of the bank.

Speaker 1

然后我们做了同样的事情，但改变了撰写文档的代理的身份定位。

And then we did the same thing but changed the persona of the agent writing documentation.

Speaker 1

结果大大改善了输出效果，因为它使用了银行开发人员能够理解的术语。

It drastically improved the outcome because it was using terms that the the developers of the bank understood.

Speaker 1

对吧？

Right?

Speaker 1

这就是通过这种方式所能带来的改变，通过调整身份设定，效果非常显著。

So that's the change you can effect by doing this, by tuning the super interesting.

Speaker 2

我听说过这种说法，就像你拥有模型整个语义空间，而通过告诉它自己的角色，你实际上是把它引导到了与任务相匹配的语义邻域。

I've heard it described as like, you've got this, you know, this entire semantic space of the model and by kind of telling it its role, like you kind of put it in the right semantic neighborhood for the task.

Speaker 1

是的，没错。

Yes, yes.

Speaker 1

这正是我们在这类情况中反复看到的，在我们的评估和现实场景中都如此。

That's that's exactly what this plays at and we've seen this time and again in our evals and in real world situations.

Speaker 1

人们不再建议这么做，是因为对于大多数日常通用场景，你根本不需要它，对吧？

The reason people don't advise about doing it anymore because for most of the general day to day use cases, you don't need it, right?

Speaker 1

你已经能获得足够好的性能了。

You get good enough performance.

Speaker 1

但当你面对超大规模、极其复杂的企业级应用场景时，这一点就变得非常有帮助了。

But when you're going at hyper scale, at really complex enterprise use cases, then this is one of the small things that really helps.

Speaker 2

我们最近还看到一些研究指出，代理MD实际上可能适得其反。

And the other thing we're seeing recently is research that says that agent MD can actually be counterproductive.

Speaker 2

你对此有什么经验或见解吗？

Do you have any experience or insights into that?

Speaker 1

完全正确。

100%.

Speaker 1

我想，我之前也提到过，代理们。

I think, you know, I also described this earlier, agents.

Speaker 1

MD，我不认为它能够扩展。

Md, I don't believe can scale.

Speaker 1

它在较小的代码库中可以工作。

It can work for the smaller code basis.

Speaker 1

所以我定义了阈值，大约是7万到10万行代码。

So I defined, you know, the threshold as 70 to 100 k lines.

Speaker 1

代理。

Agent.

Speaker 1

在少于这个规模时，MD应该表现得很好，对吧。

Md should be great, you know, less than that, right.

Speaker 1

因为你可以使用一个扁平文件，可能只有1到3个团队在使用这个代码库，你可以把所有规范都写在里面，对吧。

Because you can have a flat file, maybe you have one to three teams that are working with that code base and you can capture all of the guidelines there, right.

Speaker 1

但它无法泛化，你不能用文本来实现泛化。

But it cannot general you cannot use text to generalize.

Speaker 1

你不能把整个团队开发者的全部经验都塞进一个文件里，就指望它能适用于整个代码库，不管模型有多智能，对吧？

You cannot like put all of the learnings of that team's developers in a single file and expect it to generalize across the entire code base, no matter how intelligent the model is, right?

Speaker 1

这只是在信息不足的情况下工作，就像我前面提到的，还有太多其他事情在争夺注意力。

It's just working with insufficient information and like I described earlier, there are so many other things that are competing for attention, right.

Speaker 1

因此，代理很难确定优先级，尤其是在出现冲突时，这尤其困难。

So it's really hard for the agent to prior, to know what to prioritize and especially when it leads to a conflict, right.

Speaker 1

所以，举个内部例子，我们使用的是Blitzy，也就是Will Blitzy，对吧。

So for example, we had the situation internally, so we use, we use Blitzy to Will Blitzy, right.

Speaker 1

我们有一条规则规定，在Python中写测试时只能使用fakes，不能使用mocks，对吧？

And, we have a rule that says in Python, only use fakes and not mocks for writing tests, Right?

Speaker 1

你可以认为我们的代理是空的。

Now you can think of that as our agents are empty.

Speaker 1

但在这个代码库中，我们大量使用了mocks。

But then in the code base, we've extensively used mocks.

Speaker 1

对吧？

Right?

Speaker 1

我们还有另一条指令，要求始终模仿代码库中已有的模式。

And we have another instruction that says always mimic the patterns that we've already used in the code base.

Speaker 1

对吧？

Right?

Speaker 1

这就是你期望代理做的事情，对吧？

That's what do you expect the agent to do, right?

Speaker 1

所以最终发生的情况是，它们有时用假对象，有时用模拟对象，一切都得靠你自己判断，对吧？

So what what ends up happening is they're gonna use fake sometimes and mock sometimes and it's all on you, right?

Speaker 1

所以这些就是代理效果不佳的一些挑战。

So that's so those are some of the challenges why Agents.

Speaker 1

Md效果不佳，而且正如有人正确指出的那样，在许多情况下甚至可能适得其反。

Md is not effective and, you know, as as someone pointed out rightfully, like maybe even counterproductive in many cases.

Speaker 1

但在绝大多数小规模应用场景中，这是一种相当有效的方法。

But in most of the vast majority of the smaller scale use cases, it's a pretty effective technique.

Speaker 2

是的。

Yeah.

Speaker 2

在这个背景下，反思与代理协作在多大程度上依赖于任务和上下文，是很有趣的。

It's interesting in that context to reflect on how much of working with agents is task and context dependent.

Speaker 2

就像我们经常抛出很多指令，比如你应该怎样，你必须怎样，要这样提示，必须那样提示。

Like a lot of, you know, a lot of we, we throw around a lot of directives like you should, thou shalt, you know, prompt like this, thou shalt prompt like that.

Speaker 2

但我认为，这最终还是回到了评估的重要性上。

But I guess it really just comes back to the importance of evals.

Speaker 2

你知道，仅仅因为你在X平台或其他地方看到某些东西，并不意味着它一定适用于你的场景。

Like, you know, just because you see something out on, you know, on X or whatever, doesn't mean it necessarily applies to your case.

Speaker 2

也许你应该测试一下，但要先通过你的评估套件。

Maybe you should test it, but, run it through your eval suite.

Speaker 1

是的。

Yeah.

Speaker 1

而且我认为，你提到了一个非常重要的观点，这正是我非常关心的。

And I think, you hit a very important point, one that's very close to my heart.

Speaker 1

我认为评估一直表现不佳，还不够好。

Evals I think have been consistently underperforming, and are not good enough.

Speaker 1

所以就在今天或昨天，我相信OpenAI发布了一篇文章或备忘录，说他们已经停止在Sweep Bench Verified上进行测试，因为问题定义得不够清晰。

So just today or just yesterday I believe, OpenAI released, an article, a memo where they said we we've stopped testing on sweep bench verified because, the problems are not well defined.

Speaker 1

他们参与了创建SweepBench Verified，对吧。

They're the they're the they contributed in creating, sweeper verified, right.

Speaker 1

他们意识到了这个差距。

They've realized, that gap.

Speaker 1

所以他们现在改用SweepBench Pro进行测试。

So they're now testing on sweeper pro.

Speaker 1

但即使你看看在SweepBench Verified、SweepBench Pro，或者终端基准测试上表现相似的模型，对吧？

But even if you look at models that perform similarly on, sweepench verified or sweepench pro, or terminal bench for that matter, right?