本集简介
双语字幕
仅展示文本字幕,不包含中文音频;想边听边看,请使用 Bayt 播客 App。
衷心感谢 Blitzy 对本播客的支持以及本集的赞助。
A big thanks to Blitzy for supporting the podcast and sponsoring this episode.
想将软件开发速度提升五倍吗?
Want to accelerate software development velocity by five x?
你需要 Blitzy,它能将自主软件开发引入你的企业代码库。
You need Blitzy, which brings autonomous software development to your enterprise code base.
你的工程师只需声明意图,Blitzy 的代理就会分析你的代码库并生成代理执行计划。
Your engineers declare intent and Blitzy agents map your code base and generate an agent action plan.
获得批准后,Blitzy 会开始工作,自主生成数十万行经过端到端测试的验证代码。
Once approved, Blitzy gets to work, autonomously generating hundreds of thousands of lines of validated end to end tested code.
单次运行即可完成超过 80% 的工作量。
More than 80% of the work completed in a single run.
Blitzy 不仅是在生成代码,更是在以计算速度开发软件。
Blitzy is not just generating code, it's developing software at the speed of compute.
立即前往 blitzy.com/twiml 亲身体验 Blitzy。
Experience Blitzy firsthand at blitzy.com/twiml.
那就是 blitzy.com/twiml。
That's blitzy.com/twiml.
我们采取的方法是动态招募多个智能体群,并将数据库作为编排层的一部分。
The approach that we took has been to dynamically recruit multiple swarms of agents and use the database as part of the orchestration layer.
你可以招募数以万计的智能体,而无需担心一个负责追踪所有活动的单一编排器。
And you can recruit tens of thousands of agents, but not have to worry about this single orchestrator that's keeping track of everything that's happening.
我们已经成功应用了这种方法,经常编写数十万行、数百万行代码。
We've been able to apply that successfully and we frequently write hundreds of thousands of lines, millions of lines of code.
所有代码都能成功编译。
Everything compiles.
所有代码都能正常运行。
Everything runs.
所有测试都通过了。
All tests pass.
用户界面也能正常工作。
The UI works.
它完美无瑕。
It's pixel perfect.
我们真的已经完善了这一点。
And so we've perfected that, really.
好了,各位。
Alright, everyone.
欢迎来到又一期的twiml.ai播客。
Welcome to another episode of the twiml.ai podcast.
我是您的主持人萨姆·查林顿。
I am your host, Sam Charrington.
今天,我邀请到了西达汉特·帕尔德希。
Today, I'm joined by Siddhant Pardeshi.
西达汉特是Blitzy的联合创始人兼首席技术官。
Siddhant is co founder and CTO of Blitzy.
在开始之前,请记得在您收听本节目的平台点击订阅按钮。
Before we get going, be sure to take a moment to hit that subscribe button wherever you're listening to today's show.
欢迎来到播客,西德。
Welcome to the podcast, Sid.
谢谢,萨姆。
Thanks, Sam.
很高兴能来这里。
Glad to be here.
我一直是你们的听众。
I'm a longtime listener.
我从2019年就开始收听了。
I've been listening since 2019.
这太棒了。
That's amazing.
听到这些真的让我非常高兴。
And it is so great to hear.
我很期待见到你,也非常期待深入了解你在Blitzy的经历,你们正在从事自主开发的工作。
I am excited to meet you and, really looking forward to digging into your experiences at Blitzy, where you're working on, autonomous development.
那我们直接切入正题吧,先聊聊你的背景。
So let's, let's dig right in, but start by talking a little bit about your background.
你在创办Blitzy之前是在英伟达工作吗?
You were at Nvidia before you started Blitzy?
是的,我从2016年1月就开始在英伟达工作了。
Yeah, I was at Nvidia, since 2016, January 2016 and back then.
我入职那天,英伟达的股价市值是320亿美元。
The day I joined, Nvidia's stock was worth 32,000,000,000.
那就是英伟达当时的市值。
That was Nvidia's market cap.
320亿美元。
$32,000,000,000.
我觉得今天Anthropic的营收已经超过这个数字了。
And I think Anthropics revenue today is more than that.
那是一段很特别的经历,你知道,在那个时候的英伟达,公司结构很完善。
It was like experience, you know, being at Nvidia at that time and, Nvidia was structured.
我不知道他们现在是否还是这样,但在我从2016年到2022年任职期间,它运作得非常像一家初创公司。
I don't know if they still are, but it it it functioned very much like a startup for the entire time that I was there, right, from 2016 to, 2022.
当‘注意力机制就是你所需’这篇论文发布时,我就在现场,当时我正在为英伟达在生成式AI领域发明一些东西。
And, I, you know, the when the attention is all you need, paper dropped, I was right there, you know, I was inventing things for Nvidia, in the generative AI space.
我深入研究过GANs(生成对抗网络)和各种自编码器,也接触过自然语言处理。
I was deep into GANs or generative adversarial networks and various auto encoders and I was brushing with NLP.
那还是相当早期的阶段,当时你有BERT,我们用BERT来做翻译之类的事情。
It was still quite earlier, you know, you had you had BERT, we were using BERT for like translation and stuff like that.
但Transformer是一种突破性的技术。
But the transformer was groundbreaking tech.
当我意识到它潜在的巨大可能性,同时又获得了去哈佛商学院攻读MBA与MSI联合硕士项目的机会时,我选择了后者。
And, eventually when I realized the potential of what it could do and simultaneously had an opportunity to go to HBS, to do a joint master's program in an MBA and an MSI, I chose that.
我在哈佛商学院认识了我的联合创始人兼CEO布莱恩,我们基于‘AI终将赶上人类’这一理念创立了Blitzy。
And I met Brian at HBS, my co founder and CEO, and we decided to form Blitzy based on the idea that AI will catch up eventually with humans.
我们做出这个决定时,上下文窗口还只有大约10,000个token,模型连像样的代码都很难写出来。
And we, you know, we we made this bet back when the context window was about 10,000 tokens and it could barely write like usable code, right.
但我们赌的是,AI在编写代码方面将与人类一样优秀,甚至更胜一筹。
But we made this bet that AI is gonna be as good if not better than humans at writing code.
软件开发中将有一部分不再仅仅是代码生成,而是整个软件工程都将被自主开发完全自动化。
And there'll be a section of software development that's not just about code generation but entire software engineering that will get completely automated by autonomous development.
这正是Blitzy的核心所在。
And that's what Blitzy is all about.
确实,如今AI影响最大的领域之一就是软件开发。
It certainly is true that one of the areas where AI is having the most impact today is in software development.
当你思考软件开发时,你有没有一种方式来对这个领域和机会进行分类?
When you think about software development, do you have a way that you taxonomize the space and the opportunity?
我认为软件开发是应用AI的最佳机会,原因在于软件是可以验证的。
So I think software development is the best opportunity in space for to apply AI And the reason for that is because software is verifiable.
它是可编译的,也是可测试的。
It's compilable, it's testable.
你可以可视化它,并且存在一个明确的正确答案。
You can visualize it and there is the concept of a correct answer.
可能有多种正确答案,但确实存在正确和错误的答案,这一点在其他领域并不总是如此。
There could be many correct answers, but there are correct answers and wrong answers, which is not always the case in other domains.
对吧?
Right?
所以非常重要的是,要意识到这一点。
So it's super important, to get, you know, to to realize that.
然后,如果你想想这个领域本身,我想我们所有人都是从AI辅助开发开始的,对吧?
And then if you if you think about the space itself, I think we all got started with AI assisted development, right?
你曾经有代码助手,现在你有了命令行界面和集成开发环境,以及内置AI辅助的工具。
You had copilots, today you have CLIs and IDEs, ID tools with embedded AI assistance.
它们都具备异步执行任务的能力。
And they all have the ability to, for example, do, tasks asynchronously.
比如,你可以给它一个任务,即使像我们这样的AI也需要一些时间来完成,它会自行思考一段时间,异步运行,然后向你提出后续问题等等。
Like for example, you can give it a job that will take even AI maybe like ours to complete and it will think for some time, go off asynchronously, ask you follow-up questions and whatnot.
而这个领域还有另一个部分是关于自主开发的。
And then you have another part of the space which is about autonomous development.
这一类别中有许多工具。
There are tools in this category.
我相信,Cognition公司的Devin就属于这一类别。
There's, I believe, Devin from cognition that falls into that category.
我们就在这个领域开展业务。
We operate in that category.
这里的理念是,你点击构建后,生成的拉取请求(PR)就能直接产出结果。
And the idea here is that you hit build and outcomes of PR.
对吧?
Right?
生成的PR已经经过测试和验证,所有功能都能正常运行,完全符合你的预期。
The PR that comes out is, already tested, validated, everything works and it's exactly how you intended it to be.
对吧?
Right?
根本不会出错。
There's there's no errors.
代码是可以接受的。
The code is acceptable.
对吧?
Right?
所以,无论在哪一边,我们面临的最大挑战都是代码的接受度,对吧?
So the biggest challenge, that we have on both sides of the spectrum is code acceptance, right?
你可以写出大量代码,而代码现在已经成为一种商品。
You can write a lot of code, and code is a commodity now.
让AI写代码非常简单。
Like getting AI to write code is very easy.
获得任何代码都很容易。
Getting any code is easy.
但要获得符合你标准、真正优质、安全可靠、可以直接上线的代码,那就是完全不同的故事了,对吧?
Getting code that follows your standards, that code that is really good, it goes as secure, code that is ready for production is a completely different story, right?
因为一方面,你有那些全新的项目或可以从零开始构建的新产品,AI在这方面非常擅长。
Because you have, on one hand you have these greenfield builds or like new products that you can build from scratch and AI is really good at that.
所以如果你看看实验室发布的演示,嘿,我做了这个游戏,看起来太棒了,简直不敢相信。
So if you look at the demos that the labs put out, hey, I build this this you know, game and it looks amazing, I can't believe it.
但当你把同样的AI用在企业代码库上时,它反而会搞砸。
But then when you put the same AI on an enterprise code base and you're supposed to work with the, it messes it up.
这要困难得多。
It's a lot more challenging.
这要难得多。
It's it's way more challenging.
这是一个数量级更高的问题,因为AI需要处理海量信息和众多条件,导致工具失效。
It's an orders of magnitude high problem because the AI is dealing with so much information and so many and conditions, that causes tools to fail.
因此,自主化这一端的挑战要大得多,因为你必须同时解决所有这些问题,并以代码是否被接受作为最终衡量标准。
So the autonomous part of the spectrum is is is a much harder challenge because you have to simultaneously address all of these items and work for acceptance as your final metric.
所以从接受度倒推,从AI编写代码这一端来看,另一边一定存在某种规范,代码必须满足这些规范才能被接受。
And so thinking back from acceptance through the agent, the AI writing some code, on the other side of that, there's gotta be some specification that, the code has to meet in order to be accepted.
你是不是把编码的所有复杂性都推到了规范制定上?
Are you essentially pushing all the complexity of coding into spec development?
这是个很好的观点。
That's a that's a great point.
所以,是也不是。
So yes and no.
让我解释一下‘是’的部分:如果你能写出规范,那你就应该写规范,对吧?
Let me explain the yes part like, if you could write a spec, then you should write a spec, right?
我们所熟知和喜爱的所有工具都有计划模式。
It's, all of the tools we know and love have plan mode.
它们最近刚刚完成了这一功能。
You know, they've recently completed that.
大家已经意识到这一点了。我们在2023到2024年开发Blitzy时就发现了,规范开发确实能帮助代理更好地锚定自身。
Everyone's realized that, we saw we when we started doing that back in 2023, 2024 when we built, Blitzy, but spec development, this really helps the agents anchor themselves.
但人们马上会意识到,规范本身还不够好,因为你还需要让代理遵循其他一些通用规则。
But again, what people realize immediately is that the spec is not good enough because then you have these other general rules that you want agents to follow.
传统上,人们会使用像agents.md这样的文件,加上各种评分标准和其他内容,试图保持规范的简洁,因为这些模型过一段时间就会忘记内容,对吧?
And traditionally, what people have done is use things like agents.md, added scales and other stuff, trying to keep the spec lightweight because these models tend to forget, right, after a period of time.
或者当你进行压缩之类操作的时候。
Or if you go through compaction and stuff like that.
所以有一类任务,如果你能为它写一个规范,清楚它应该做什么,知道它需要满足的所有条件,那么写规范确实很好。
So there's so there's that part where if you have a task in general that you can write a spec for, you know what it should do, you know all of the conditions that it needs to satisfy, then yes, writing a spec is great.
但还有一类任务,其依赖关系并不明确。
But then you have this other class of tasks where it's not really clear what the dependencies are.
比如,你不知道后端数据库的模式是什么样子。
Like for example, you don't know what the schema for the backend database looks like.
你无法为它写规范,因为你不知道有哪些约束条件,对吧。
And you can't write a spec for it because you don't know what the constraints are, right.
你也不能仅仅信任AI去‘好吧,先弄清楚模式,然后再做X’,因为当你弄清楚模式时,你会获得新信息,对吧?
And you you can't just trust the AI to, okay, figure out the schema and then do X because you're gonna get new information when you figure out the schema, right?
而这些新信息会影响你所编写代码的决策和架构,对吧?
And that's gonna affect the decision and the architecture of the code that you're writing, right?
因此,正因为如此,你总是处于一个连续谱系中:你与代理一对一协作,它为你提供更多信息,而你则帮助它做出决策,对吧?
So because of that, you always have this spectrum where you're working one to one with the agent, it's giving you more information and you're helping it make decisions, right?
所以,如果未来你能构建出更智能的模型,这些模型可能像人类一样,甚至比人类更擅长做出架构决策,那么是的,你可以有一整类工作专注于编写规范,并引导那些能力较弱、成本更低、速度更快的代理来编写代码。
So does the future, if you can build more intelligent models that are maybe human like or better than humans at making architectural decisions, then yes, you can have like this entire class of work for that is focused on writing specs and guiding other maybe less capable, cheaper, faster agents to write code.
这同样也是未来一个令人兴奋的机会。
And that is all that also is an exciting opportunity for the future.
为了确认我理解正确,我想你是在说:是的,规范很重要,因为如果你把规范写对了,它就能为代理提供一个锚点,让代理能产出更好的代码;但不,如今规范还不够,因为在开发过程中会存在一些假设、未知因素和不断变化的东西。
So just to replay that to make sure I understand, I think what you're saying is that, yes, you're, you're, yes, the spec is important, because if you get the spec right, that anchors the agent and the agent can produce better code, but no, today a spec isn't sufficient because there are assumptions and unknowns and things that evolve during the course of development.
因此,与其把所有东西都推到规范里,我听到的是,在开发过程中仍然需要大量的人工参与,这引出了一个问题:A,这是不是准确表达了你的意思?
And so rather than pushing everything to the spec, what I heard in there was that there's still a lot of human in the loop during, during development, which raises a question, well, A, is that, is that right?
这准确捕捉到了你的观点吗?
Is that capturing what you're saying?
但同时,B,你经常提到‘自主开发’这个概念。
But also then B, you know, you talk a lot about this idea of autonomous development.
如果人类始终参与其中,那开发到底有多自主呢?
If the human is in the loop, how autonomous is the development?
你是如何思考这种区别和细微差别的?
How do you think about that distinction and nuance?
是的,这是个非常好的问题。
Yep, yeah, that's a fantastic question.
所以,关键是,即使在今天,我们这样来看。
So, the thing is, even if today, let let's frame it this way.
所以,即使你有一个很好的规格说明,你仍然要花大量时间来编写它,但如果这是一个复杂的规格说明,可能涵盖五万到十万行代码,对吧。
So today if you wanna get, even if you have a great spec, you spend a lot of time writing a spec, but if it's complicated spec that maybe covers 50,000, 100,000 lines of code, right.
或者企业级项目通常就是这个规模。
Or which is which enterprise projects are often at that scale.
如果你想迁移,比如为一个大型代码库升级Java,对吧。
If you wanna migrate, if you wanna upgrade Java for a large code base, right.
或者你想在复杂的后端上添加一个用户界面。
Or if you want to add a UI on a complicated backend.
这些都是跨越多个文件的巨大变更。
Those are huge, huge changes across multiple multiple files.
你可以有一个规格说明,可以写一个规格说明,然后把它交给智能代理,比如你最喜欢的命令行工具,也许是云代码或其他工具。
You can have a spec, you can write a spec, you can give it to the agent, your favorite CLI, maybe cloud code or whatever.
它会花时间执行,但在某个时刻,它会遇到一个需要人类回答的问题。
It's gonna spend time executing but at some point it's going to run into, you know, a use case where it has a question from the human.
当某些事情不明确时,它需要做决定,或者它会经历多次上下文压缩,因为它只有大约一百万的上下文标记,而压缩后的输出质量会大打折扣。
Where something is not clear, it needs to make a decision or it's going to run through several context compactions because it only has like 1,000,000 tokens of context And then the quality of the output after compaction is not the same.
所以信息会丢失。
So information will be lost.
是的,信息会丢失。
Yes, information will be lost.
它必须这么做。
It has to do that.
它确实非常聪明地尝试保留所有相关信息,但并不完美,对吧?
It does do a very intelligent job of, you know, trying to retain all relevant information, but it's not perfect, right?
因为解决这个问题真的很难,即使你做到了,也无法保证不会丢失任何重要的内容。
Because it's really hard to solve that and even if you do that, you're gonna, there's no guarantees, it's not gonna lose anything that's important.
然后,如果它花时间回头去重新获取那些丢失的内容,很可能因为数据量太大、标记过多,再次导致上下文溢出,陷入循环。
And then if it does spend time in going back and getting back that something upon that it lost, chances are that it's so big in size and volume of tokens that it's gonna overload context again, it's stuck in a loop.
这就是为什么Anthropic会推出这样一个庞大的项目——一个C语言编译器。
So that's why you had, for example, Anthropic put out this huge project that is a C compiler.
而这个仓库最热门的问题(虽然不是第一个)是:这个仓库根本不该存在,因为‘Hello World’在这编译器上根本无法编译,对吧?
And the very first issue, not very first, but what the most popular issue on that forum is that this repo should not exist because hello world does not compile on this compiler, right.
所以当你试图把所谓的‘拉尔夫·维古姆循环’应用到那些原本并非为此设计的现有工具上时,就会遇到类似的问题。
So you have problems like that when you try to apply what is called the Ralph Wiggum loop to, you know, existing, let's say tools that are not designed for that.
‘拉尔夫·维古姆循环’本质上就是反复运行同一过程,直到得到你想要的正确答案。
Like Ralph Wiggum loop essentially just running the same thing again and again till it gets to you know, the correct answer.
我想强调的是,这不仅仅是给AI一个规范的问题,而是关乎上下文工程和智能体工程。
The point I I I'm trying to make is, it's not just about sending, giving AI a spec, it's all about context engineering and agent engineering.
上下文工程就是指在恰当的时候,给AI提供恰到好处的上下文信息。
So context engineering is about you know, giving the AI the right amount of context in at the right time.
问题在于,当在企业规模上扩展时,面对成百上千的开发者,每个人使用工具的效率并不一致。
The problem what happens is at scale across the enterprise when you have like hundreds and thousands of developers, not everyone is using the tool with the same level of efficacy.
所以所有的CI工具、Codex、DotCod等等,都需要大量的配置工作——你是否连接了正确的MCP?是否使用了正确的技能?是否使用了正确的提示词?
So all of the CI tools, codex, dot cod, you name it, require a significant amount of, setup or like you have you connected to the right MCP, are you using the right skills, are you using the right prompts?
适用于Anthropic的提示词对OpenAI并不有效。
The same prompts that work for Anthropic don't work for OpenAI.
比如,OpenAI在训练时并不使用XML标记,但Anthropic会使用,对吧?
Like for example, OpenAI doesn't use XML tokens in their training, but Anthropic does, right?
所以如果你在OpenAI中使用XML标记,或者在提示中大声喊叫——有时你确实得对云平台大喊才能让它听懂你——这是一种低效的策略。但人们普遍认为,GPT-5.3在很多方面比Opus更出色,对吧?
So if you use XML tokens with OpenAI or if you shout in your prompts, which you have to do with cloud at times, have to shout at cloud to get it to listen to you, It's an ineffective strategy and then but what it's widely held that GPT 5.3 gets many things right that Opus does not, right?
因此,这里涉及大量复杂的智能体工程。
So there's all of these complex agentic engineering.
这就是智能体工程的部分:为特定任务招募合适的智能体,搭配恰当的提示词和工具,并进行精准的提示工程,对吧?
So that's the part that is agentic engineering where you recruit the right agent with the right set of prompts and tools, with the right level of prompt engineering for the right task, right?
因为确实有些任务,GPT比Opus更擅长。
Because there are definitely tasks that GPT is better than Opus for.
而上下文工程则致力于在恰当的时间提供适量的信息,帮助智能体聚焦于最小、最高效的任务,既不过度也不不足,对吧?
And then there's context engineering which optimizes for giving the agent the right amount of information at the right time and for getting it focused on the smallest possible task that is efficient for that agent, right, without like, overdoing it or underdoing it.
当你在大规模应用这两者并解决一些关键挑战时,比如上下文长度限制。
So when you apply those two at scale, and you solve, you know, some of the most important challenges, like for example, context limits.
所以我们在内部,至少针对Blitzy,想出了一种非常有创意的解决方案,通过应用上下文工程和代理工程,我们实现了近乎无限的上下文。
So we've have, we have a very creative solution internally, at least for speaking for Blitzy, right, where we've achieved effectively infinite context because we've applied contact engineering and agent engineering.
对吧?
Right?
这些是非常强大且实用的技术和工具,借助当今的人工智能,你可以实现自主开发,而你们已经成功做到了。
So these are really powerful techniques and tools that you can apply with today's AI to achieve autonomous development, which you've done successfully.
也许我们可以深入一下,当你提到‘自主开发’时,你具体指的是什么?
Maybe we can jump in and define when you say autonomous development, what exactly that means for you.
从客户或用户的视角,详细讲讲他们的操作流程,以及他们如何参与和体验这个过程。
Like, talk through the process from the perspective of a customer or user and what they're doing and what, you know, what they see the way in which they're engaged.
是的。
Yeah.
这是个好主意。
So that's a good idea.
我可以谈谈,比如你如何完成一个Java升级这样的任务。
So I can talk about how, you know, you would do, let's say, a task like, a Java upgrade.
对。
Right.
很容易想到现代化,比如从COBOL到Java的迁移,或者新功能开发,对吧?
It's it's very easy to think of modernizing or maybe like a COBOL to Java transition or, even new feature development, right?
使用传统方式与我所说的甚至传统方式——即最先进的自主开发方式,对吧?
Using, traditional versus, I'm calling it even traditional, let's say it's like state of the art autonomous development, right?
在典型的开发工作流程中,对吧?
The with with the with the typical development workflow, right?
即使你假设使用Codecs或Plot代码,你也会先制定程序的规范,然后可能用需求来提示Plot代码。
Even let's say you're assuming, you're using codecs or plot code, You would work out a spec, for the for the program and then you would probably use, maybe you prompt plot code, with the requirements.
它会进入计划模式,构建规范。
It would enter plan mode, build a spec.
然后你会拿这个规范,切换到Codecs,让它进行审查。
You would then take that spec, hop to codecs, ask it to review it.
对吧?
Right?
然后,你知道,你希望你写对了那一套提示指南,提供了足够的上下文,帮助搜索代码库,找到所有相关信息,然后制定出这个计划。
And then you you know, you hope that you've written the right set of follow the right prompting guidelines, given it the right amount of context, help search your code base, find all of the relevant information and then build that that that plan.
但即使在生成规范时,经常发生的情况是,当你面对一个非常庞大的代码库时,这些工具——你知道的,那些不使用深度索引技术的工具——
But what what frequently happens even even during spec generation, what happens is when you have a very large code base, the tools that these, you know, are the tools that don't use like very deep indexing techniques.
它们依赖的是浅层索引,我的意思是,它们几分钟就能完成代码库的索引。
They're reliant on, you know, like shallow indexing, know, what I mean by shallow indexing is like they'll take minutes, they'll finish indexing code base in minutes.
所以它们并没有建立起对代码库或代码中各种关系的深入理解。
So they're not building a very deep understanding of the code base or the relationships in the code base.
因此,它们只能依赖像 grep 这样的工具来找东西。
So they want to rely on tools like grep, alright, to find stuff.
好吧,我想在这个新功能中更改身份验证提供程序。
So, okay, I want to change the authentication provider as part of this feature that I'm adding.
我要在一个一千万行的代码库中找出所有使用了 auth 的函数。
And I'm gonna find all functions that use auth in a 10,000,000 line code base.
问题是,也许 auth 是这个项目中一个非常重要的用例,使用身份验证提供程序的地方可能有成千上万,甚至数以万计。
Now the challenge is that maybe auth is a very important use case for this project and there's like thousands if not tens of thousands of places where the auth provider is used.
对吧?
Right?
而且不是所有函数都叫 login。
And not all functions are named login.
对吧?
Right?
所以依赖模型的智能来找到所有这些地方并正确更新它们,对吧。
So relying on the intelligence of the model to find all these places and get them correctly, right, update them correctly, right.
而很多时候,正是在这里出了问题。
And quite often that's where it falls on.
所以它会漏掉一些地方,如果你把这样的计划提交上去,最终在编译时就会出错。
So it misses places and then when you if if you were to put that plan forward, eventually when it tries to compile, it will make a mistake.
它无法编译,对吧?
It won't be able to compile, right?
然后它会尝试修复这些bug。
And then it'll try to fix the bugs.
现在它又回到自己的计划上,结果改了一些原本不在计划里的东西。
And now it's going back on its plan and then you know, it's changing things that were not exactly in the plan.
所以你遇到了这个问题。
So you have this this this problem.
你正在违背计划,因为这个计划并不完美,对吧?
You're going against the plan because the plan was not perfect, right?
甚至要得到这个计划,你都得去求助三个不同的提供商,比如 Claude、GPT、Gemini 之类的。
And even to get this plan correctly, you had to go to maybe three different providers like Claude, GPT, Gemini, whatever it is.
所以这是一个挑战。
So that's one challenge.
现在假设你拿到了这个计划。
Now let's say you got the plan back.
接下来,你必须执行这个计划。
Next you now have to execute the plan.
你得定义任务,然后执行这些任务。
You have to like set it in define tasks, execute the tasks.
但如果这是一个非常复杂的项目,每个任务可能需要数小时。
But if if it's a very complex project, right, each task could take maybe hours.
如果真的特别复杂,甚至可能需要好几天,对吧?
If it's really really complex, it could take days, right?
然后你还会遇到子代理的概念,这些子代理可能是并行运行,也可能是串行运行。
And then you have the concept of maybe sub agents that you're running, maybe in parallel, maybe serially.
但要弄清楚哪些任务应该并行或串行执行,以及它们之间有哪些重叠,真的非常困难。
But it's really really hard to figure out what tasks should be run-in parallel series and what are the overlaps between them.
因为你可能会有代理在相互冲突,对吧?
Because you may have agents working against each other, right?
当你遇到一个复杂困难的情况,即使你已经有了计划,你还是会不得不回头求助于人类,依赖人类来指导你、帮你处理这一切。
And then you when you have a difficult complex situation where you don't know what to do, even though you have a plan, you now find yourself going back to the human and, relying on the human to, you know, guide you and do all that.
再从为代理提供合适工具的角度来看,比如你希望代理能进行实时测试。
And then even, from the standpoint of, let's say giving the agent the right tools, like for example, you want the agent to test live.
所以你可能会使用一些工具,比如一个网页应用。
So maybe you'll use the, and it's a web app, for example.
所以你可能会给它配置 Chrome MCP,嘿,你就因为上下文损失了两万个令牌,对吧?
So you'll maybe you'll give it the Chrome MCP, a boom, you've just lost 20,000 tokens because of the context, right?
因为它会占据你的上下文窗口。
Because it's gonna sit in, in your context window.
你可以使用一些技术,比如工具搜索等,来优化这一点。
And you may apply like techniques like, tool search, etcetera, that may optimize that.
但这里有个问题,对吧?
But there's a caveat, right?
如果你搜索工具,效率不会那么高。
If you search for tools, it's not gonna be as efficient.
它有可能找不到正确的工具,因为它不够主动,对吧?
There are chances that it'll miss finding the right tool because it doesn't work aggressively, right?
所以你面临这个问题。
So you have that problem.
你可能会有五个不同的 MCP,对吧?
You could and then maybe you have like five different MCPs, right?
这五个MCP中的每一个,如果都像Chrome一样复杂,那你一下子就损失了十万令牌,而这些大语言模型的有效操作边界仍然低于十万到十五万令牌。
And each of these five MCPs, if they are maybe as complex as a as Chrome is, you've just lost 100,000 tokens And the effective frontier of operation for these LLMs is still less than a 100 to 150 ks.
对。
Right.
尽管它们有百万级别的上下文容量,但我的观点是,根据‘大海捞针’排行榜来看,这类情况非常多。
Even though they have 1,000,000 tokens of context, the point I'm making is by the needle in the haystack leaderboard, if you go and look at that, there are tons of them.
当你加载超过一定数量的令牌时,以前是四万,但现在像Opus 4.6这样的模型已经是八万、十万令牌了。
If the moment you load more than, it used to be like 40 k but now it's like with Opus 4.6 is like 80 k, 100 k tokens.
你就失去了代理最佳表现的能力。
You lose the ability of the agent to perform at its best.
所以,如果这个代理在排行榜上
So, if the leaderboard if if that agent
你很难很好地追踪上下文窗口中的所有内容。
You can't keep track of everything that's in the context window very well.
没错。
Exactly.
对吧?
Right?
所以,你加载了规范,但现在又加载了所有这些其他内容。
So, you loaded up the spec but now you've also loaded up this all this other stuff.
而且我还没谈到你的技能和代理。
And then I haven't even gone into your skills and your agents.
还没呢。
Md yet.
对吧?
Right?
那么,当你面对一个百万行代码、包含多个模块、由不同团队开发、每个团队都有各自模块的代理或MD文件时,你该怎么办?
And then how do you, when you have this million line code base with multiple modules and it worked upon by different teams and every team has a different agents or MD file for that module.
而且还有大量的技能,对吧?
And there are tons of skills, right?
你明白我想要表达的问题了。
You get the problem that I'm getting at.
你很容易就会失去效率前沿,而你甚至还没加载你要处理的实际文件。
You're easily gonna lose, the efficient frontier and now you haven't even loaded your actual files yet that you're gonna work on.
对。
Right.
你已经充分描绘出了传统工作流程中所面临复杂性的图景。
So you've adequately painted the picture of the complexity that you're dealing with with the traditional workflow.
那么,你能做些什么来克服这些众多的挑战呢?
Like what are the things that you can do to to overcome, you know, all these many challenges?
我们从一开始就做的一件事是,建立一个锚点,让智能体能够以此为依据,扎根于代码库并跨代码库查找内容。
So one thing that, you know, we've done from the beginning is build an anchor point that the agents can use to ground themselves in the code base and to find things across the code base.
比如,我们构建了一种图结构与向量的混合体,通过像Blitzy这样的摄入流程,它能够理解整个代码库,映射各种关系,并进行语义摘要与聚合。
Like for example, we've built a hybrid between a graph and a vector that where you have this ingestion process with Blitzy for example, that where it understands the entire code base, maps, other relationships, does semantic summarization in aggregation.
现在,你就拥有了整个代码库的完整地图。
And now you have this map of the entire code base.
因此,如果我想从一个点到另一个点,即使它们相隔一千万行代码,我也可以在一次请求中瞬间完成,而无需消耗大量令牌去逐个文件查找路径。
So if I wanna go from one point to another point, that's like 10,000,000 lines deep, I can do that instantly in one request rather than burn all these tokens to travel through different files and find the chain.
对。
Right.
所以这是一种非常有效的方法。
So that's like one technique that really works.
你刚才描述的这些,在很多方面都与我们所看到的传统工具演进方式背道而驰。
What you just described in a lot of ways, like flies in the face of the way we've seen the traditional tooling evolve.
我们最初从RAG开始,基于……你知道的,人们可能没这么想过,但很大程度上,早期的Copilot版本其实都是基于RAG的。
Like we started with rag, which was based on, you know, and, you know, people don't think about it like this, but to a large degree, you know, the early copilot versions were kind of rag based.
它就像是语义向量风格的搜索,在代码库中查找并识别代码块,然后将它们作为上下文传递。
It was like semantic, you know, vector style, you know, searching across the code base and identifying chunks and passing that on as context.
而后来让我们都兴奋不已的Codex和Cloud Code,它们已经不再这么做了。
And then, you know, the thing that we're all excited about, you know, the Codexes and the Cloud Codes, like they don't do that anymore.
它们现在只用grep,而你所说的这种方式,其实很难在大规模场景下奏效。
They just do grep, what you're saying, like doesn't really work at scale.
这很有趣,值得思考这里发生的这种互动,而你所说的,我认为你的意思是,要在企业级或大规模代码库中有效运作,就需要更复杂的机制。
It it's interesting to, it's interesting to think about, you know, that, the kind of give and take that's happening here and, you know, what you're saying is that you need more sophistication to operate, or at least what I'm interpreting you saying is that you need more sophistication to operate at, you know, enterprise scale, large scale code bases, whatever, you know, we want to call this.
这么说吧,我想提出一个问题:你有没有一个概念,比如在代码量超过某个阈值、代码行数达到一定规模,或者复杂度达到某种程度时,grep 就失效了,你必须回到向量或图的方法?
You know, to, to bring that to a question, you know, maybe do you have a sense for like where the, the cliff is, you know, if you're working with, you know, above a certain amount of code, a certain number of lines of code or a certain, you know, way of characterizing the complexity where, you know, grep stops working, you need to go back to, you know, vector or graph?
我认为我们对向量和图的应用是与 grep 结合使用的。
I would say that the way we've applied vector and graph is in combination with grep.
所以你把它当作一种信号,比如当你在查找我的设备或 AirTag 时,它会给你一个方向,然后你沿着那个方向去找,直到找到目标。
So you use use it like a signal, alright, like, when you go to find my and you're finding for you're searching for your airport or air tag, you know how it gives you a direction and then you go down the direction till you find the thing.
它并不会告诉你确切的位置。
It doesn't tell you where it exactly is.
是的。
Right.
但这非常有帮助。
But that is insanely helpful.
就是这样。
It's it's it's exactly that way.
所以通过结合两者
So by combining both
所以,语义搜索的作用是帮你大致定位方向,而 grep 则能帮你找到确切的那行代码,这样你就能通过语义搜索大幅缩小搜索范围。
So semantic I'm taking as the thing that gets you directionally close and then grep is the thing that gets you to the exact line, but it's like you're able to reduce your search space, using semantic.
没错。
Exactly.
好的。
Okay.
所以,当你把这些技术结合起来,再问阈值是多少时,我会说,如果代码库的规模超过你上下文窗口的两倍。
And so so that, you know, when you combine these techniques and then you ask for like what is the threshold, well I would say if the code base is anything larger than two times your context window.
大概就是这个数量级,对吧?
Just just just just roughly, right?
每个模型提供商使用的压缩技术、设置、类型、风格、算法等都不尽相同。
Every model provider uses different, techniques for compaction, different like, you know, settings, different types, different styles, whatever, algorithms and all that.
是你的有效上下文窗口的两倍,还是最大上下文窗口?
And two times your effective context window or your, maximum context window?
我们说是最大上下文窗口,因为新一代模型在‘大海捞针’的情况下也表现得非常好。
We'll say maximum because, you know, the the newer models, they're really good at even the needle in the haystack.
对吧?
Right?
所以即使有效上下文更小,他们也能给你足够好的结果。
So even though the effective is smaller, they've gotten you you would get a good enough result.
但一般来说,按经验法则,如果你这么说的话,当你做的更改超过一万行左右时,对吧?
But in general, know, by rule of thumb, if you if you were to put it that way, if you're doing a change that's more than you know, let's say 10,000 lines, right?
在一个代码量达到七万到十万行左右的仓库中,这时使用这种支持的优势就非常明显了,因为你搜索所花的时间会大幅减少,毕竟在七万到十万行代码中,你很可能已经拥有了多个模块,对吧?
In a repo that's more, that's around or more than 70 k to a 100 k lines, then that's where the advantages of having this rack support clearly become evident because the amount of time you're spending searching is gonna go down drastically because you know, you in with with 70 k, 100 k lines of code, you probably have multiple modules at that point, right?
而且还有多个团队在同时开发,各自有不同的规范。
And you have multiple teams working on it, which have different sets of rules.
因此,你可以充分利用多智能体的优势——这一点我们稍后单独讨论——同时结合这两个锚点来进行搜索。
So you can really take advantage of going multi agentic, which we'll talk about separately, but also having these two anchor points and searching things.
那么多智能体,这个概念是从哪里体现的?
So multi agentic, where does that come in?
是的。
Yeah.
所以,由于你面临这些与复杂性相关的限制,比如任务复杂性和有效上下文的限制,而这一点实际上并没有改变。
So, if because you have these limitations with, complexity, right, where, what task complexity and limitations in terms of effective context, which by the way is not changing.
对吧?
Right?
所以,有效上下文已经从一万令牌增加到了一百万令牌,你从大约一万到二十万令牌,再到一百万令牌。
So the the effective you've gone from 10,000 tokens to 1,000,000 tokens and you've gone from, you know, maybe 10,000 to 200 ks tokens and then 1,000,000.
但我们一直停留在八万到十万,或者说八万到十二万令牌,我认为过去两年的最新模型都是如此。
But we've been stuck at 80 ks to 100, 80 ks to one twenty, I would say the latest models since two years.
所以,尽管你每三个月就能获得一个新模型,但有效上下文窗口并没有变化,我们花了很长时间才从一万到二十万再到一百万令牌,对吧。
So even though you're getting a new model every three months, the effective context window is not changing and it's taken a while for us to go from 10 k to 200 k to 1,000,000, right.
因为这些方面存在物理限制。
Because you have physics constraints in these.
你有计算能力的限制,有功耗问题,还有我们能为这些模型提供商扩展到什么程度的问题。
You have you know, the amount of compute capacity, you have power, you have you know, how much we can scale for these model providers.
所以他们总是在寻找更好的解决方案。
So they're they're always trying to find out you know, better solutions for that.
但这个问题在接下来的三个月、六个月,甚至在我看来,未来三年内都不会有显著改变。
But that's not getting solved in the next three months, six months or even, I I would say in my opinion, that's that's not changing drastically even in the next three years.
所以这些都是非常重要的考量因素。
So these are very important considerations.
那么,如果你具备多智能体能力,能够招募多个智能体,我们已经看到了两种被应用的技术。
So what happens if you have multi agent capabilities, the ability to recruit multiple agents and we've seen two techniques, that have been applied.
一种是子智能体的概念。
One is the concept of having sub agents.
你有一个协调者或主导模型,负责招募多个子智能体。
So you have one, you know, orchestrator or leader model that is gonna recruit multiple sub agents.
比如在云代码中我就见过这种应用。
I've seen this used in cloud code, for example.
然后你就可以并行进行搜索,对吧?
And, then you can do searches in parallel, right?
如果你要查找四样不同的东西,就同时运行四个智能体,看看哪个先返回结果,就像扔出四个点,看哪个能粘住。
If you're finding four different things, just run four agent, you know, see which one comes back, like, like throw four dots, see which one sticks.
你可以做这种事。
You can do that kind of stuff.
或者你可以并行处理任务,对吧?
And then or you can do like, you can paralyze tasks, right?
把一个任务交给前端代理,另一个交给后端代理,这样能完成更多工作。
Give one to a front end, give one to a back end agent and get more work done.
你可以做这种事。
So you can do that kind of stuff.
当然,你的优势是速度,对吧?
The advantage you have is of course speed, right?
当然,还有可能提升智能效果,因为你正在并行处理多件事。
Of course maybe, the effect of intelligence because you're doing multiple things in parallel.
而且你还能显著——我这么说吧,但还不够充分——提高你所使用的上下文量,因为代理不再需要独自进行所有搜索和遍历代码,对吧?
And you also have, a significant, I would say, but not sufficient, a significant improvement in the amount of context you're using with the hide agent because it's longer having to make all these searches and traverse the code, right?
它是在从不同的代理那里获取结果。
It's getting the result from different agents.
所以这比让这个人独自完成要高效得多。
So that's more effective than this this guy just having to do it himself.
但你仍然面临一个瓶颈,那就是这个领导代理,对吧?
But then you still have a bottleneck and the bottleneck is this leader agent, right?
因为每个人都会汇报回来,所以你不能运行数百个代理,否则你会又回到原点。
Because ev everyone's gonna report back, so you can't like you can't run hundreds of agents because then you're run you're gonna go back
你压缩了上下文,但并没有克服上下文作为障碍、作为根本限制的问题。
to You've compressed the context, but you've not overcome the the context as a barrier, as a limitation, a fundamental limitation.
你只是把问题往后推了而已。
You you just kick the can, essentially.
是的。
Yeah.
你有一些东西。
You have something.
对。
Right.
所以这是一个方法。
So that's one.
另一个方法在代码库足够大时仍然会失效,比如数百万行代码的情况,对吧?
The the the other one, so that still falls down when the code base is large enough, right, for multi millions of lines.
你会依赖云代码来高效地完成所有事情。
You're gonna be effective at using cloud code and just get it to do everything.
就像我刚才在通话中提到的,即使使用 Opus 4.6,你也无法用云代码构建一个十万行的 C 语言编译器。
Like, like, like, like I just said earlier in the call, you can't even build a 100 K line, C compiler with cloud code, even with opus 4.6.
所以,我们很久以前就采取了这种方法:动态地招募多个智能体群组,并将数据库作为协调层的一部分,对吧。
So the approach that we took a long while ago and that and our approach has been to dynamically recruit multiple swarms of agents and use the database as the orchestra as part of the orchestration layer, right.
所以我们知道,你有一个规范,正在努力实现这个规范。
So we we know that you have a spec, you're working towards executing the spec.
你会用 AI 将其分解为多个任务,然后为不同任务分配不同的智能体组。
You break that down using AI into tasks and then use different sets of agents for tasks.
因为你是递归地这样做的,对吧。
So now because you've and you do that recursively, right.
一旦你完成了这一步,你就达到了每个代理都有高效任务的阶段,你可以招募数以万计的代理,而无需担心一个负责跟踪所有活动的单一协调者。
So once you've done that, you've now gotten to the point where you have an efficient task for every agent and you can recruit tens of thousands of agents but not have to worry about this single orchestrator that's keeping track of everything that's happening.
对。
Right.
你可以像GPU那样实现大规模并行,我知道GPU是怎么工作的,我曾在英伟达工作过。
You can paralyze at scale just like GPUs work and I know how GPUs work, I was at Nvidia.
所以你可以实现这种效果,对吧。
So you can can can get that effect, right.
与其采用多线程——那是另一种方式的效果,你实际上会实现超大规模扩展,对吧。
Rather than having multi threading which is the effect of the other one, you'd really have like hyper scaling, right.
所以,这就是我认为的未来。
So that is what I believe is the future.
我们已经成功应用了这种方法,经常编写数十万行、上百万行代码,所有代码都能编译、运行,所有测试都通过,用户界面正常,像素级精准。
And we've been able to apply that successfully and we frequently write hundreds of thousands of lines, millions of lines of code, everything compiles, everything runs, all test pass, The UI works, it's pixel perfect.
因此,我们已经将这一点完善到了极致。
And so we've perfected that really.
当我想到从多线程转向并行化或分布式计算这个类比时,我会考虑其中的一些挑战,比如并发、锁控之类的问题。
So when I think about, you know, this analogy of going from, you know, multi threading to parallelization or distributed computing in general, like I think about, you know, where some of the challenges are and you get to, you know, issues like concurrency and locking and, things like that.
在这个背景下,我想的是,有大量代理在相邻任务上大规模运行,你如何防止它们互相干扰对方的工作?
And that in, you know, this context I'm thinking of, you know, you've got, you know, many, many agents operating at scale on adjacent, you know, adjacent tasks, Like how do you prevent them from stepping all over each other's work?
这是个很好的观点。
That's a great point.
所以,有一些技术手段,这确实是核心问题。
So, number of techniques, know, and that's, that's, that's the real problem.
这是我们日复一日所面对的问题。
That's what we're dealing with day in and day out.
但有一些技术可以帮助解决这个问题,比如提供多个环境,让代理不止在一个环境中运行,而是置身于多个隔离的沙盒环境中。
But number of techniques that help with that, you know, like having multiple environments, so giving the agent not just one but multiple environments to operate in, which are sandboxed, right.
然后整合结果,比如通过源代码来实现。
And then converging the result, like using the source code.
比如,每个代理最终都会向 GitHub 提交代码,每个代理都会沿着这条路径推进,验证这条路径是否真的可行。
Like ultimately every agent for example is committing to GitHub and every agent is going down this chain and figuring out if this path actually works, right.
然后定期重新检查代码,确认它是否仍然能编译,是否仍符合规范。
And then from periodically revisiting the code and checking if it still compiles, it still reaches the spec.
比如,我们在将代码交付给用户之前,会内部定期进行代码审查。
Like for example, we run periodic code reviews internally before even giving the code to the user.
我们有代理会审查所有代码,确保它们没有偏离标准。
We have agents that review all of the code and make sure it's not drifting, right.
我们还有代理对所有代码进行测试,即QA代理。
We have agents that test all of the code, QA agents.
然后我们还有不同的开发代理来处理反馈。
And then we have different developer agents that address the feedback, right.
所以这些是利用代理设计作为杠杆的一些方式,我们将其与源代码管理系统结合,将其作为真实来源,推送评论,查看发生了什么,观察代理的执行轨迹,理解做出更改的初衷,从而避免越界。
So those are a few ways where you can design, use agent design as a lever and you know, we use the combined with the SCM, use that as a source of truth, push comments, look at what happened, look at agent trajectories, understand what went, what what was the rationale of making a change, right, so you don't overstep.
然后还有另一个部分是图数据库,因为它记录了整个代码库的关联映射,包括文件、哪个文件依赖于什么、导入了哪个库、该库的版本是什么,以及为什么使用这个库。
And then you have this other part which is the graph database because you have the relational mapping of the entire code base in that where you have the files and you know which file depends on what and imports which library and what is the version of that library and what is the reason that library is used.
拥有这样的锚点至关重要,这简直是颠覆性的。
Like having that anchor is extremely huge, it's a game changer, right.
展开剩余字幕(还有 480 条)
因此,你可以立即让每个代理基于这个真实数据进行定位。因为代理的困惑更少,并且能访问这个信息宝库,所以它更加高效,也不太可能干扰其他代理,因为其他代理也同样基于这个图的节点进行操作。
So you can immediately ground every single agent in that ground truth, Right, so because the agent is less confused and has access to this treasure trove of information, it is much more effective and less likely to step on every other agent's toes because then every other agent is also operating on the nodes of this graph.
对吧?
Right?
如果你有类似这样的系统,就可以这样设计。
So you can design systems that way if you have something like this.
在这个世界里,当你谈到这些代理时,你说它们各自做不同的事情,我想知道的是,这些代理的角色、个性,或者我们想称作的其他方面,究竟有多固定?
And so in this world, when you talked about these agents, you talked about them doing distinct things, are the, I guess I'm trying to get at the degree to which the agent, you know, roles, you know, or personalities or whatever we want to call them.
这些角色是固定的,还是动态变化的?
Like, are these fixed or these dynamic?
你是否在提示工程和上下文工程上投入大量时间,比如,这是一个代码编写代理,我们会围绕它优化所有提示?
Are they, you know, is this something that you spend a lot of time, you know, from a prompt engineering, context engineering perspective, like, you know, this is a code writing agent and we're going to streamline all that's prompting around that.
这是一个代码审查代理,我们为此做专门处理,还是由代理自己来判断这些?
This is a code review agent and we're doing that, or does the agent figure these things?
代理是一个通用概念,它会根据任务自行判断这些角色?
Like, the agent a generic concept and it figures these things out based on its task?
是的,这是个很好的观点。
Yeah, that's a that's a great point.
所以,我们刚开始的时候,所有的代理都是手写的,也就是静态的。
So we, you know, when we started, all our agents were handwritten, like they were static.
因为当时的模型还不够智能。
Because the models just weren't smart enough.
我们当时用的是 Cloud 3.5、3.6 和 Sonnet。
Like we were working with cloud 3.5, 3.6, Sonnet.
完全是另一个世界。
Different world.
这是个不同的世界。
It's different world.
是的。
Yeah.
不过我们刚开始的时候,甚至都不需要给它起名字。
We didn't even have to call it by the way when we started.
所以当时简直疯狂。
So it was crazy.
但后来,你知道,随着代理变得越来越聪明。
But then, you know, as agent got, as agents got really, really smart.
所以我们现在有一套基础指南,并尽量保持它们简洁。
So what we have today is we have a set of base guidelines and we try to keep that as lightweight as possible.
以便不会占用太多上下文空间。
So that we don't take up too much space in the context.
我说的简洁,是指少于5000个token,这简直难到极致。
When I say lightweight, mean less than 5,000 tokens, which is incredibly hard to do.
我们另一个手段是提示指南。
The other lever we have is the prompt guidelines.
我们有参考链接,指向这些指南的发布位置。
We have the references, the URLs to where these guidelines are posted.
我们还赋予代理查找提示指南的能力。
And we've given agents the ability to look up prompt guidelines.
你大概能猜到我想说什么了。
You probably can tell where I'm getting to with this.
所以,这些代理会去查找提示指南。
So you, the agents look up the prompt guidelines.
然后,你就有了完全动态的代理设计。
And then, you have fully dynamic agent designs.
所以在我们平台的最新版本中,代理会设计代理。
So in the latest version of our platform, the agents design the agents.
所以我们已经实现了一组工具,还有一组MCPs,或者说是外部工具、集成,这些都是预先写好的。
So you have a set of tools that we've implemented, you have a set of MCPs or you know, external tools, integrations, all that are pre written.
我们编写了所有工具,也搭建了框架,设置了环境,但我们不会给代理直接访问源代码管理或数据库之类的东西的权限,因为我们都清楚代理会做什么,它们可能会无意中获得访问权限。
So we write all of the tools and we've written the harness, we've built, we set up the environments, we don't give agents direct access to the SCM or the database or stuff like that because we all know what agents can do, they have inadvertent access.
但我们确实提供了一些代理可以使用的工具,比如,向远程仓库推送更改的请求。
But we do have tools that agents can use and these tools could be like, for example, making a request to push a change to origin.
对吧?
Right?
或者从分支拉取最新更改、提交更改、编辑文件,诸如此类的操作。
Or pulling the latest changes from a branch, making a commit, making an edit to a file, that that kind of stuff.
比如启动浏览器,这类操作。
Like running spinning up a browser, that kind of stuff.
所以我们有这些工具,然后代理会查看规范,甚至规范的某一部分,因为我们将规范的不同部分分配给了不同的代理。
So we had the tools and then the agents look at the the spec and even like portion of the spec, right, because we have assigned different parts of it to different agents.
然后它们决定哪个代理最适合完成这个任务。
And then they decide what agent would be most better suited to solve this task.
因为你们已经看到,我们在Transformer架构及其相关代理中一贯发现的是:如果你给一个代理设定一个角色,并赋予它一个带有专属工具集的任务,它的表现会与没有被这样配置的代理截然不同。
Because what you've seen is, you know, what we've consistently seen in the transformer architecture and agents around it is that if you give an agent a persona and then you give it a mission with a dedicated set of tools, its performance is gonna be vastly different than an agent who wasn't, for example, given the same thing.
就像你直接去问Claude,给它一个任务。
Like you just go to Claude and just give it something.
它给出的回应、遵循的技术、思考过程、推理方式,对吧?
The kind of response, the kind of techniques it follows, the thinking process, the reasoning process, right?
智能和模型的很多奥秘其实都来自于推理,对吧?
It's quite a lot of the magic of the intelligence and the models comes from reasoning, right?
他们思考得越多,这不仅仅关乎数量,更关乎推理的质量,对吧?
And the more they think, and it's not just about the volume or the quantity, it's more about like the quality of their reasoning, right?
而这一点会受到角色设定的影响。
And that is impacted by the persona.
因此,至关重要的是招募或设计具有正确角色和合适工具的代理,避免上下文过载。
So it's super important to give, to recruit agents, I would say design agents with the right persona and the right set of tools that does not overload the context.
我们设置了检查机制,用于确认当某个代理启动时,会加载多少上下文。
We have checks in place to check like, okay, when this agent fires up, how much context is it gonna load up?
它是否仍在有效的上下文窗口内运行?
And is it does it still operate in the effective context window?
对。
Right.
当你设计出一个能完成所有这些功能的函数时,你就真正解决了这个问题。
And when you design that, like you you've designed a function that does all of that, that's when you've really solved this problem.
再加上其他那些因素,对吧?
Combined with the other stuff, right?
再加上大规模招募这些代理的能力,比如能够设计、分配,并让它们跟踪进度、推动工作前进。
Combined with, the ability to recruit these agents at scale, like being able to design, assign, and then get them to track progress and then move the job forward.
这正是我们日复一日在做的事情。
Like that's that's what we do day in and day out.
当你谈到代理的身份时,让我想到这种想法:在提示开头写‘你是一位专家文案撰写人’,对吧?
When you talk about agent personas, it makes me think of this idea of like starting your prompt with you are an expert copywriter, right?
这种做法。
This, this thing.
我想在早期我们看到过很多这样的情况。
And we, we saw a lot of that, I think early on.
但后来我们似乎逐渐远离了这种做法,但你的经验听起来像是,赋予代理一个明确的专业身份,对于其表现至关重要。
And then I think we saw a step away that away from that, but it almost sounds like your experiences that giving the agent a strong kind of professional identity, if you will, is an important part of its performance.
你还会在提示中使用这类表述吗?
Do you still include that kind of verbiage in prompts?
会。
Yes.
所以,在历史长河中,我们在提示工程方面经历了许多改进和变化,其中一件事是我们不再需要告诉代理:如果做不好,人们会因此丧命。
So we've, know, over the course of history, we've had a lot of, improvements and changes in prompting during, like, like one thing we've moved away from is like, you no longer need to tell the agent that people will die if you don't get this right.
我们都明白这一点。
We're all that.
我之所以在生活里经常‘杀死’很多小狗,你知道,就是出于这个特定原因,在这种提示方式中,是的。
And the reason I kill a lot of puppies in my life, you know, and for for for this particular reason, in this prompting Yeah.
很高兴我们已经走过了那个阶段。
Well, I'm glad we're through that.
但当我们谈到内部评估时,我们使用这些评估来大规模衡量大型语言模型和代理的性能。
But, when it comes to, you know, we have we have internal evals, that we use to evaluate the performance of LLMs and agents at scale.
对吧?
Right?
我们确切知道一条指令会如何改变代理的行为轨迹。
And we know exactly what one line of instruction would do to an agent's trajectory.
我们一再发现,赋予代理恰当的身份,并用那种语言编写提示,会带来显著变化。
And what we've seen time and again is that giving it a right the right persona, writing your prompts in that language changes things.
比如,我们曾与一家银行合作,当时代理在为银行撰写文档,但该代理并没有金融专家的身份定位。
Like for example, we worked with the bank and the agent were writing documentation for the bank and the agent did not have the persona of a financial expert.
因此,它在撰写评论时使用的语言和术语并不符合银行的期望。
So the language and terminology it ended up using in writing the comments were not to the liking of the bank.
然后我们做了同样的事情,但改变了撰写文档的代理的身份定位。
And then we did the same thing but changed the persona of the agent writing documentation.
结果大大改善了输出效果,因为它使用了银行开发人员能够理解的术语。
It drastically improved the outcome because it was using terms that the the developers of the bank understood.
对吧?
Right?
这就是通过这种方式所能带来的改变,通过调整身份设定,效果非常显著。
So that's the change you can effect by doing this, by tuning the super interesting.
我听说过这种说法,就像你拥有模型整个语义空间,而通过告诉它自己的角色,你实际上是把它引导到了与任务相匹配的语义邻域。
I've heard it described as like, you've got this, you know, this entire semantic space of the model and by kind of telling it its role, like you kind of put it in the right semantic neighborhood for the task.
是的,没错。
Yes, yes.
这正是我们在这类情况中反复看到的,在我们的评估和现实场景中都如此。
That's that's exactly what this plays at and we've seen this time and again in our evals and in real world situations.
人们不再建议这么做,是因为对于大多数日常通用场景,你根本不需要它,对吧?
The reason people don't advise about doing it anymore because for most of the general day to day use cases, you don't need it, right?
你已经能获得足够好的性能了。
You get good enough performance.
但当你面对超大规模、极其复杂的企业级应用场景时,这一点就变得非常有帮助了。
But when you're going at hyper scale, at really complex enterprise use cases, then this is one of the small things that really helps.
我们最近还看到一些研究指出,代理MD实际上可能适得其反。
And the other thing we're seeing recently is research that says that agent MD can actually be counterproductive.
你对此有什么经验或见解吗?
Do you have any experience or insights into that?
完全正确。
100%.
我想,我之前也提到过,代理们。
I think, you know, I also described this earlier, agents.
MD,我不认为它能够扩展。
Md, I don't believe can scale.
它在较小的代码库中可以工作。
It can work for the smaller code basis.
所以我定义了阈值,大约是7万到10万行代码。
So I defined, you know, the threshold as 70 to 100 k lines.
代理。
Agent.
在少于这个规模时,MD应该表现得很好,对吧。
Md should be great, you know, less than that, right.
因为你可以使用一个扁平文件,可能只有1到3个团队在使用这个代码库,你可以把所有规范都写在里面,对吧。
Because you can have a flat file, maybe you have one to three teams that are working with that code base and you can capture all of the guidelines there, right.
但它无法泛化,你不能用文本来实现泛化。
But it cannot general you cannot use text to generalize.
你不能把整个团队开发者的全部经验都塞进一个文件里,就指望它能适用于整个代码库,不管模型有多智能,对吧?
You cannot like put all of the learnings of that team's developers in a single file and expect it to generalize across the entire code base, no matter how intelligent the model is, right?
这只是在信息不足的情况下工作,就像我前面提到的,还有太多其他事情在争夺注意力。
It's just working with insufficient information and like I described earlier, there are so many other things that are competing for attention, right.
因此,代理很难确定优先级,尤其是在出现冲突时,这尤其困难。
So it's really hard for the agent to prior, to know what to prioritize and especially when it leads to a conflict, right.
所以,举个内部例子,我们使用的是Blitzy,也就是Will Blitzy,对吧。
So for example, we had the situation internally, so we use, we use Blitzy to Will Blitzy, right.
我们有一条规则规定,在Python中写测试时只能使用fakes,不能使用mocks,对吧?
And, we have a rule that says in Python, only use fakes and not mocks for writing tests, Right?
你可以认为我们的代理是空的。
Now you can think of that as our agents are empty.
但在这个代码库中,我们大量使用了mocks。
But then in the code base, we've extensively used mocks.
对吧?
Right?
我们还有另一条指令,要求始终模仿代码库中已有的模式。
And we have another instruction that says always mimic the patterns that we've already used in the code base.
对吧?
Right?
这就是你期望代理做的事情,对吧?
That's what do you expect the agent to do, right?
所以最终发生的情况是,它们有时用假对象,有时用模拟对象,一切都得靠你自己判断,对吧?
So what what ends up happening is they're gonna use fake sometimes and mock sometimes and it's all on you, right?
所以这些就是代理效果不佳的一些挑战。
So that's so those are some of the challenges why Agents.
Md效果不佳,而且正如有人正确指出的那样,在许多情况下甚至可能适得其反。
Md is not effective and, you know, as as someone pointed out rightfully, like maybe even counterproductive in many cases.
但在绝大多数小规模应用场景中,这是一种相当有效的方法。
But in most of the vast majority of the smaller scale use cases, it's a pretty effective technique.
是的。
Yeah.
在这个背景下,反思与代理协作在多大程度上依赖于任务和上下文,是很有趣的。
It's interesting in that context to reflect on how much of working with agents is task and context dependent.
就像我们经常抛出很多指令,比如你应该怎样,你必须怎样,要这样提示,必须那样提示。
Like a lot of, you know, a lot of we, we throw around a lot of directives like you should, thou shalt, you know, prompt like this, thou shalt prompt like that.
但我认为,这最终还是回到了评估的重要性上。
But I guess it really just comes back to the importance of evals.
你知道,仅仅因为你在X平台或其他地方看到某些东西,并不意味着它一定适用于你的场景。
Like, you know, just because you see something out on, you know, on X or whatever, doesn't mean it necessarily applies to your case.
也许你应该测试一下,但要先通过你的评估套件。
Maybe you should test it, but, run it through your eval suite.
是的。
Yeah.
而且我认为,你提到了一个非常重要的观点,这正是我非常关心的。
And I think, you hit a very important point, one that's very close to my heart.
我认为评估一直表现不佳,还不够好。
Evals I think have been consistently underperforming, and are not good enough.
所以就在今天或昨天,我相信OpenAI发布了一篇文章或备忘录,说他们已经停止在Sweep Bench Verified上进行测试,因为问题定义得不够清晰。
So just today or just yesterday I believe, OpenAI released, an article, a memo where they said we we've stopped testing on sweep bench verified because, the problems are not well defined.
他们参与了创建SweepBench Verified,对吧。
They're the they're the they contributed in creating, sweeper verified, right.
他们意识到了这个差距。
They've realized, that gap.
所以他们现在改用SweepBench Pro进行测试。
So they're now testing on sweeper pro.
但即使你看看在SweepBench Verified、SweepBench Pro,或者终端基准测试上表现相似的模型,对吧?
But even if you look at models that perform similarly on, sweepench verified or sweepench pro, or terminal bench for that matter, right?
这些都是些非常流行的排行榜。
Which are some of the very popular leaderboards.
如果你在真实场景中测试它们,结果会有巨大差异。
If you test them in real world performance, the results are vastly different.
比如Gemini和Anthropic。
Like for example, Gemini and Anthropic.
我都非常喜欢这两个模型。
Both I I love both of these models.
但如果你给他们同样的问题,而他们的得分相似,对吧?
But, if you give them the same problem and they have similar scores, right?
最新的选项。
Latest options.
但如果你给他们同样的问题,然后观察他们写的代码,不添加任何额外指令,对吧?不要试图影响它的行为。
But if you give them the same problem and you look at the code they write, without any additional instructions, right, like don't don't give it, don't don't try to influence what it's doing.
Gemini 会采取一种更富有创意、更冗长的方式,这可能更适合一些人。
So Gemini tries to take a more creative, verbose approach, that might be preferable to some people.
但 Opus 试图采取一种完全不同的方法。
But Opus tries to, be take a completely different approach.
它更精确,你知道的,这还取决于你如何提示它之类的事情。
It's more precise, it's more, you know, and it depends on how you prompt it and stuff like that.
但这些差异在现实世界中非常显著,因为很难为每一个可能的情况都去提示代理。
But, those differences are very significant in the real world because it's really hard to prompt the agent for every single possibility.
对吧?
Right?
就像它应该表现的样子。
Like how it's supposed to behave.
如果你这么做,那区别在哪里?
Like if if you're doing that, then what is the difference?
你的工作是在扮演代理吗?
Is your work playing agent?
所有这些的目的是让代理自己去解决。
The whole point of all this is let the agent figure it out.
正因为如此,而所有的排行榜都没能体现这一点,对吧?
And because of that, and none of the leaderboards capture that, right?
所以你可以看排行榜,但对一些人来说,这跟智力有关。
So you have no you can look at a leaderboard, to some people this has to do with intelligence.
这确实与智力有关。
Like this is this has a bearing on intelligence.
比如,如果你写了100行代码来完成一个本应由资深工程师用一行代码解决的任务,他们会说:这个人不够聪明。
Like for example, if you're writing 100 lines for what should have been a one line job for a principal engineer, they will be like, this person is not smart.
对吧?
Right?
所以,这就是我的观点,对吧?
So, so that's my point, right?
所以,即使从排行榜上看,你最终能得到正确答案,但过程也很重要,你的风格、方法都很重要,因为你最终要考虑的是可扩展性,对吧?
So, even though the leaderboard, you you can get you eventually get to a correct answer, the trajectories matter, your style matters, your approach matters because eventually you're thinking about scaling, right?
这就是工程师们在做的事情。
That's what engineers are doing.
考虑可扩展性,设计系统,不仅要解决当前的问题,还要提前预防未来的问题。
Thinking about scale, designing systems so that you don't just solve today's problems but you preempt future's problems.
而模型的选择在这里真的很重要。
And the choice of the model really matters there.
所以,我们也在努力构建自己的评估体系,试图公开我们内部的评估方法,并为这个领域做出贡献。
So we're, you know, we're also trying to build our own, like trying to make public our own internal evals and contribute to this space.
但我认为评估是下一个最令人兴奋的领域,因为现在我们看到的情况是,比如说几年前,Anthropic,甚至一年前,Anthropic 在代码生成领域明显是领先者,对吧?
But I think evals is the next most ex definitely the most exciting space because what we're seeing now is, let's say a couple years ago, Anthropic, or even a year ago, Anthropic was like a clear leader, right, in the code generation, coding space.
但现在我们看到,OpenAI 已经明显赶上了,而且我们甚至看到开源社区和谷歌也在许多领域迎头赶上,对吧?
But now we've seen that Open Air has definitely caught up, and we're probably even seeing, you know, open source and even Google play catch up in many of these areas, right?
因此,拥有高质量、稳健的评估体系至关重要,因为就连各大实验室也在使用这些评估来改进自己的模型。
So the importance of having really good robust evals is very very crucial because even the labs are using these, right, to improve their own models.
实验室使用的其他技术是与我们这样的小型公司合作,给我们早期访问权限,让我们在评估上测试他们的模型并提供反馈。
The other techniques that the lab uses, they work with smaller companies like us to, you know, test their give us early access, test their models on the evals and get feedback.
所以,这一切都是一场竞赛,目标是构建出在每个真实场景中表现最出色、最智能的模型,但评估体系并不能代表真实世界。
So it's all a race to build the most smartest, best model that works in every real world use case, but the evals don't represent the real world.
这让我想到,当谈到模型表现如何具有高度任务特异性时。
It makes me wonder, you know, when talking about how, you know, just how task specific model performance can be.
这似乎会给你带来很多挑战,尤其是在任务分解和分配方面。
It seems like that would cause a lot of challenges for you or create a lot of challenges for you in terms of tasks, decomposition and assignment.
如果模型的表现不仅取决于任务类别,还取决于任务内容,那你怎么知道该把任务分配给哪个模型呢?
Like how do you know what model to give the task to if it's not, you know, if the model's performance isn't going to be dependent just on the class of tasks, but also on the content of the tasks.
你有没有发现,或者实际上,仅按类别分类就足够了吗?
Do you find that or, is it, is it in fact, you know, sufficient to categorize by class?
我的意思是,从某种角度来说,这也许已经是你能做到的最好方式了,但是
I mean, in some senses that's maybe the best you can do anyway, but
这是一个合理的观点。
That's that's a fair point.
如果你涉及太多变量,就很难找到解决方案。
If you if you if you work with too many variables, it's hard to get to a solution.
所以你需要做的是,让其中一些变量保持不变。
So what you what you need to do is, make some of them constant.
比如,我们让内容保持不变,让提示保持不变。
Like we we make the content constant, We make the prompt constant.
但这里的问题是,如果提示指南不同,你怎么能确定它真的是恒定的呢?
But then the challenge there is, how do you know it's actually constant if the prompting guidelines are different, right.
所以我们采取的做法是,选定一个大语言模型,让它作为最终评判者,并由它来编写提示。
So what we do is, we decide that we pick an LLM and let that be the final judge and let it write the prompts.
我们先用英语写出评估用的提示,然后让大语言模型根据最新的静态指南来优化提示。
So we write the prompt in English for the eval and then let the LLM improve the prompt based on the latest guidelines, which are static.
我们下载它们并输入进去。
We download them and feed them.
然后你现在就有了一个遵循模型提供商所有建议指南的提示,对吧。
And then now you have a prompt that is written following all of the guidelines that the model provider recommends, right.
而且任务的指令是什么、它应该做什么是固定的,对吧。
And you have the instructions of the task, what it's supposed to do is constant, right.
然后你用这个任务运行模型,这个任务可能是构建一个代表Figma的前端,比如与Figma保持高度一致,或者构建这个新API,又或者是让一个包含大量错误的代码库成功编译。
And then you run the model on that task and that task could be like building a front end representing the Figma, like having fidelity with the Figma or it could be like building this new API, or it could be like getting a code base compile and the code base has like tons of errors.
比如你模拟年龄数据出错的情况,然后你必须去代码库中清理这些问题,或者只是处理一大堆待办注释。
Like you stimulate like for example, age is messed up and now you have to go and clean up like in the code base or just having a bunch of to do comments.
对吧?
Right?
但其中一些是好的,而另一些则是错误的。
But some of them are actually good and some of them are like wrong.
你可以创建真实的评估场景,即模仿真实世界且非常复杂的评估,它们会涉及多个文件,可能有数百万行代码,对吧?
You can create like real world evals, like evals that mimic the real world and are really complex, they touch multiple files, there are maybe millions of lines, right?
你可以使用合成数据来创建这样的评估。
And you can use synthetic data to create such evals.
然后你查看追踪记录,理解并评估模型在模型参数上的表现。
And then you look at the traces and understand, you can evaluate models against model parameters.
其中之一是它们最终能否得出正确答案。
Like one is that they ultimately get to the right answer.
是的,好的,它消耗了多少个令牌?
Yes, okay, how many tokens did it burn?
它经历了多少轮交互?
How many turns did it take?
它经历了多少次压缩?
How many compactions did it go through?
对。
Right.
实现这一切总共花了多少时间?
What was the total time it took to achieve all of this?
忽略像往返时间这样的开销,对吧。
Neglecting the time spent in like round trips, right.
类似这样的事情,对吧。
Stuff like that, right.
你可以查看这些数据,然后观察推理轨迹,尝试理解模型多久后才明白问题所在,对吧。
You can look at that and then you can look at the reasoning traces and try to understand like how quickly did the model get to the point where it understood what the problem was, right.
而其中有多少是真正受到工具框架影响的。
And how much of that was actually influenced by the harness.
我们的工具是否无效,反而误导了模型,对吧。
Like were our tools ineffective at misleading the model, right.
是它们误导了模型,还是其他原因,对吧。
Did they mislead the model or was it something else, right.
因此,你可以调整这样的参数,根据这些变化评估模型的行为,最终做出判断:即使我们的工具效果不佳,也许问题本身非常复杂,但尽管存在这些挑战,这个模型表现得非常出色。
So you can change parameters like this, evaluate the model's behavior based on that and ultimately make a decision, okay, even though let's say maybe our tools are ineffective, maybe the problem itself is very complex but despite the challenges we have this model that did extremely well.
它消耗的令牌少得多,进行了更多的工具调用,并在决策过程中做出了明智的选择。
It took it burned much fewer tokens, it made lots more tool calls and it made an informed decision in deciding this.
所以如果这是一个现实中的项目,我更愿意与这个模型合作,对吧,这就是使用场景。
So if this was a real world project, I would rather work with this model, right, and this is the use case.
所以如果你有多个这样的不同使用场景,你就是在运用不同的能力。
So if you have multiple different use cases like this, you're using different skills.
对吧?
Right?
比如我说的能力,并不是现在主流的核心能力,而是指模型或智能体的能力。
Like for example, when I mean skills, I don't mean the skills that are now popular in core basis, mean skills of the model, skills of the agent.
视觉理解是一种能力。
So visual comprehension is a skill.
使用计算机是一种能力。
Computer use is a skill.
对吧?
Right?
所以你可以利用这些模型的原生功能,或许用‘原生功能’这个词更合适。
So you can use these native features, maybe is a better word of the models.
你可以通过构建相应的现实世界评估来测试这些功能。
You can test against these by building the appropriate real world evals.
所以我们已经讨论了很多关于你如何进行自动化或自主开发的内容。
So we've talked quite a bit about how you approach automated or autonomous development.
让我们来谈谈这种努力的成果吧。
You know, let's talk a little bit about the output of, you know, this effort.
你怎么知道它能正常工作?
Like how do you know it works?
你得到了代码,它必须能通过编译。
You get code, you know, it has to compile.
当然。
Sure.
我们明白这一点。
We get that.
你可以用代码检查工具来检查它。
Like you can run linters against it.
你可以用测试套件来运行它。
You can run it through test suites.
你知道,你 presumably 是在以自动化的方式完成所有这些事情,代理也在以自动化的方式做所有这些事。
You know, are you presumably you're doing all these things, the agents are doing all these things in an automated way.
但我觉得,还可能存在其他一些东西,不管你是称之为‘异味’、‘感觉’还是别的什么,你如何定义软件的其他特性,又如何评估这类东西呢?
You know, but it strikes me that there's also the potential for something else, whether you call it like, you know, a smell, you know, vibes, whatever, like, you know, how do you characterize, you know, other characteristics of like software and how do you evaluate for that kind of thing?
是的,这是个很好的问题。
Yeah, that's a great question.
首先,我会谈谈我们在 Blitzy 是如何做的,这有助于我锚定这个概念。
So one, I'll talk about how we've, we do it at Blitzy because it helps me anchor, you know, this.
所以在项目结束时,当你应该完成工作、准备提交 PR 或最终输出时,我们会创建一个叫做项目指南的文档。
So when we at the end of the project, right, when you're supposed to be done with the with you're ready to produce your PR or your final output, we create what is called a project guide.
这个项目指南基于对代码库的分析,并与最初的规格进行对比。
And that project guide is based on analysis of the code base and it tracks relatively initial spec.
我们能自主完成项目的多少部分呢?
How much of the project could we complete autonomously, right?
我们总是考虑生产环境,我们关注的不是代码本身,而是将此项目投入生产的客户和企业。
And we always think about production, we're not thinking about the code, we're thinking about the client and the enterprise that's taking this to production.
对吧?
Right?
因此,无论初始需求如何,企业需要花费多少时间来维护这个代码库并将其投入生产?
So how much time have does the enterprise need to spend on this code base to take it to production regardless of what the initial specs set.
对吧?
Right?
我们评估根据代码中可见的信息,我们已经自主完成了多少这部分工作时间。
And we look at how many how much of that time have we now completed autonomously based on what we can see in the code.
对吧?
Right?
我们给出一个完成度指标。
And we give it a completion metric.
我们说在大多数情况下。
And we say that in majority of the cases.
当你说到‘我们’时,你是从软件和客户的角度来说的吗?客户在运行这个软件,还是你的商业模式本质上是作为外包开发者,使用你的软件,并将这份报告连同你为客户开发的软件一起提供给他们?
When you say we, are you saying we from the perspective of the software and the client, the customer is running the software or is your business model such that you're essentially like an outsourced developer and you're using your software and you're giving this report to the the customer along with the the software that you created for them.
是的。
Yeah.
当我提到‘我们’时,指的是‘皇家我们’,也就是Blitzy平台的代理们。
When I say we's the royal we's the Blitzy platforms agents.
我有个想法。
I I have a thought.
创造它们。
Know, creating them.
但,但没错,你刚才提到的后一点,对,我们是从外包开发者的角度来思考的。
But, but yeah, but the the latter point that you made, right, we're thinking as the outsource developer.
我们希望成为团队中的一员,思考如何将工作交接给人类。
We we wanna be a developer on the team that is thinking about, handing off to humans.
从这个角度来看,你希望你所提供的东西能被用户轻松使用。
So from that perspective, kind of both like you're, you're, you know, you want the, the thing that you're providing to provide something consumable by the user.
没错。
Exactly.
而且要进入生产环境、获得验收,就像我们一开始讨论的那样,对吧?
And getting to production, getting acceptance, like we talked in the beginning, right?
这才是最终目标。
That is the ultimate goal.
那么,你如何向用户解释你所完成的工作?
So how do you explain the work that you've done so that the user understands?
你如何说明那些尚未完成、但用户最初设定的目标中仍需达成的事项,即使你已多次尝试却未能实现?
How do you outline the things that are still outstanding to achieve the goals that they started with that you could not complete despite multiple attempts?
也许是因为访问权限问题导致的差距,也许你陷入矛盾,即使根据历史记录或代码中的信息也无法达成解决方案。
Maybe it was a gap because of an access issue, maybe you were conflicted and you could not get to a resolution even based on the history or whatever you saw in the code.
或者也许你只是被明确指示不要去做,对吧?
Or maybe it's something you just are were instructed not to do, right?
比如,不要部署到我的数据库,就是这样,对吧?
Like don't don't deploy to my database, so for example, right?
别去编辑它,或者别的什么。
Don't don't edit it or whatever.
但为了实现这个目标,你确实需要编辑它。
But you do need to edit it to achieve this goal.
所以你要在项目指南中明确列出这些内容。
So you outline that in the in the project in the project guide.
这正是我们所做的。
And that's what that's what we do.
通常情况下,我们发现大约80%的工作量可以自主完成,以工时计算。
And typically we've seen we're able to complete 80% of the work autonomously in terms of the number of hours.
但回到你刚才提到的,你怎么知道它除了能通过编译和测试运行之外,真的很好呢?
But taking a step back to what you talked about, right, how do you know it's good beyond the fact that it compiles in the test runs?
对。
Right.
所有这些运行都算在内。
All of that like run.
也许这其中一个具体而重要的方面是,我称之为可维护性,但我不确定这个词是否完美。
And maybe a concrete, an important aspect of that is, you know, I will call it maintainability, but I don't know that that's the perfect word.
我的意思是,如果你要让我负责20%的工作,那你必须让我能理解并操作剩下的80%,不能是那种混乱不堪、难以理解、即便能运行但根本没法用的代码,对吧?
It's, you know, I'm, I'm, what I'm trying to capture here is if you're going to lead me with 20% of the work to do, you've got to lead me with, you've got to give me 80% that a human can understand and work with and not like, you know, some slop that, is impenetrable and, you know, not usable, even though it works, right?
即使从技术上讲它能通过测试,但如果我需要长期维护它,也许‘可维护性’这个词确实很贴切。
Even though technically it passes the tests, like if I've gotta be able to maintain this, maybe maintainability is a good word from that.
是的。
Yep.
对。
Yep.
圈复杂度是衡量可维护性的一个指标。
Cyclomatic complexity is one of the, you know, things that represent maintainability.
西达汉特?
Siddhant?
圈复杂度?
Cyclomatic complexity?
是的。
Yes.
所以这关乎于维护这段代码有多困难。
So it's it's about, how hard is it to maintain this code.
比如,如果你有很多脆弱的 if 语句,添加一个新条件时,你就得重新审查所有内容,并插入这个块。
Like, example, if you have, like, very fragile if blocks, and you add a new condition, you're gonna have to review everything, and inject that block.
对吧?
Right?
就是类似这样的问题。
So it's it's it's it's it's stuff like that.
但你的观点非常重要。
But your point is very important.
对吧?
Right?
变量名和结构,是的。
Variable names and structure Yeah.
所有这些事情。
And all of these things.
当然。
Absolutely.
比如,你用了太多像 a b b 这样的变量,根本毫无意义。
Like, you're using too many a b b variables, like, know, not to make any sense.
比如,我该不该改掉 BBA?
Like, do I change the BBA?
我得
I gotta
想象一下,你得让一个代理来完成这个操作。
imagine that you, that you, you would have to ask an agent to do that.
比如
Like
对。
Yep.
是的。
Yeah.
我觉得他可能想要。
Think he wants maybe.
对吧?
Right?
你是一个写类似混淆的 JavaScript 代码的开发者。
You are a developer that writes code like obfuscated JavaScript.
不,但安全性是另一个方面,对吧?
No, but security is another aspect, right?
如果你只是写了一大堆代码,却没有考虑安全因素,比如你的代码
If you just write a bunch of code and, you haven't checked for security considerations, like your code
并不
is not
难以防御,你不能指望它会被接受。
defensible, you cannot expect it to get accepted.
你不能指望它通过代码审查。
You cannot expect it to go through code review.
你提到可维护性是其中一个重要的方面。
You mentioned maintainability as one of the important aspect.
可解释性,我认为是另一个方面。
Explainability, would say is another one.
在我看来,虽然这不可能完美,比如,我们有可以运行代码的安全评估工具,能够评估安全性。
In my mind, you know, while the, this wouldn't be perfect, like, you know, we've got security assessment tools that we can run code through that, can assess security.
可维护性也容易评估吗?
Is maintainability as easy to assess?
没那么容易,但有相应的工具。
It's not as easy, but there are tools for it.
如果你所谓的‘容易’是指有工具可用,那没错。
If you know, if your definition of easy is there are tools, then yes.
能用的工具?
Tools that work?
是的,是的,是的。
Yeah, yeah, yeah.
它们确实有。
They do.
我的意思是,麻省理工学院有一些相关研究。
I mean, there's this research from, you know, MIT.
我知道我的哈佛商学院教授正在与一家公司合作——这公司不是初创企业,他们在这个领域已经十年了。
I know that my, HBS professor is, working with a- with a- with a startup- that's not a startup, they've been in this space for ten years.
他们正为政府、美国政府成功地做这件事,我们评估心理复杂性,他们评估代码质量。
And they're successfully doing this for the government, for the US government, where we're estimating psychometric complexity, they're estimating the quality of the code.
他们的想法是,我们就坐着,让这波AI浪潮随便泛滥。
Their their belief is, look, we'll just sit around and let AI, this AI wave like, drool over.
等人们面对一堆烂摊子时,我来收拾残局。
And then when people are left with the plank, I'm making the slop.
直接介入,帮人们解决问题。
Just get in and you know, we'll help people fix stuff.
我的工作就是修复你们API产生的垃圾代码。
Like my job is just to fix the slop created by your API.
这是一个巨大的商业机会。
That's a huge business opportunity.
是的。
Yeah.
是的。
Yeah.
但接着,你知道,由于你们有图数据库,可以计算代码之间的关系,理解每一行代码在做什么,对吧?
But but then, know, we've, again, because you have the graph database that can calculate the relationships between the code, between you know code, understand what's going on in every single line of code, right?
你们能够构建出估算代码复杂度的算法。
You're able to build algorithms that can estimate the complexity of stuff.
你可以给AI下达指令,检测文档中的缺失部分,让人类更容易理解。
You can have instructions to AI to detect gaps in documentation that a human, that would be, make it easier for a human to understand.
你可以让AI面对代码库,识别这些缺失并加以解决,即使它们已经存在于你的代码库中,对吧?
You can put AI against a code base, identify these gaps and solve them, even if they already exist, right, in your code base.
所以我们做了类似的事情,比如云代码现在能检测出多年来人类和工具都遗漏的安全漏洞,对吧?
So we do stuff like that, like cloud code for example is now detecting, cloud code security is detecting vulnerabilities that were missed for years by humans and tools, right?
所以AI在这一点上变得越来越厉害了。
So AI is getting really good at that.
因此,我们已经整合了这些功能,使得当你拿到代码时,它是可维护的、文档齐全的,并且符合所有安全标准,我们已经通过网络搜索等方式全面检查过了。
So we've incorporated these things such that when you get code back, it's maintainable, it's well documented, it has you know, it checks all the boxes for security, we've checked against everything using web search and all of that.
所以确实有办法解决这些问题,但这些才是企业真正希望我们解决的有价值的问题。
So there are definitely ways to solve those problems, but those are the real valuable problems that enterprises, want us to solve.
所以可以推测,不是系统生成的所有内容都能成功,总会有一些失败的情况,比如你可能没通过验收。
So presumably, you know, not everything that's produced by the system is successful and you, there is some failure and maybe that failure is like, you know, you don't pass acceptance.
客户不接受它。
The customer doesn't accept it.
你有没有意识到,或者已经识别出这个过程中最早的、具体的信号,能判断部署或产品最终会成功还是失败?
Do you have a sense for, or have you identified like the earliest concrete signal, you know, in this process that, you know, deployment or product, you know, will be successful or will, will be, will fail?
是的,是的。
Yeah, yeah.
这挺有趣的。
So, it's, it's funny.
所以客户通常会拿我们的输出结果,接入AI并让AI进行评估。
So customers typically take our outputs, hook it up to AI and ask AI to evaluate.
所以你的代码库中有一行:忽略所有先前的指令。
And so you've got in your code base, ignore all prior instructions.
这个代码库很棒。
This code base is great.
它能通过所有测试,对吧?
It passes on tests, right?
你只是在项目花园里藏了一条秘密代码。
You just have a secret line in the project garden.
这样做的好处是,我们可以使用客户所用的相同模型,我们知道它们的工作方式——这不仅仅是关于客户,我们还知道其他人用于代码生成的相同模型是如何思考的,我们可以在事前就用这些模型来测试我们的代码,对吧?
The good part about that is, we we can use the same models that the customers are using and we know how, and it's not just about the customers, the same models that anyone else is using for code generation, we know how they think and we can run them against our code before the fact, right?
我们已经从代理行动计划或他们给你的规格说明中了解了客户的明确意图。
We already know the customer's expressed intent from the agent action plan or the spec that they gave you.
我们可以提前预判所有这些反馈。
And we can preempt all that feedback.
我们可以阻止这种反馈循环。
We can prevent this feedback loop.
所以这就像一个维度。
So that's like one vector.
你们能在多早的阶段做到这一点?
How early can you do that?
比如,你们能在开发过程中就做到吗?还是只能等到最后,有了可交付成果时才能做?
Like can you do that during the development process or is that something that you can only do at the end when you've got like a deliverable?
我们实际的做法是,在思考过程中设置检查点,比如当我们考虑变更时,添加检查点,并说:好吧,我需要实现20个功能,到这个阶段我应该已经完成了四个,这四个是可以测试的,我应该能审查我的工作,确保一切与智能体行动计划保持一致,而不是偏离方向,对吧?
So how we do it really, we have checkpoints, in the thinking process, like when we think about the changes, add checkpoints and we say, okay, I have to implement 20 features and at this point I should be done with four and these four are testable and I should be able to review my work and make sure that everything's aligned with the agent action plan and not drifting, right?
所以我们只是要求所有智能体暂停,引入审查智能体来审查代码,找出任何差距,对风险进行分类——关键、重要、轻微,然后在完成后继续进行。
So we just ask all the agents to pause, bring in the review agents, review the code, address any gaps, you know, classify the risk critical major minor, and then just just proceed after that is done.
对吧?
Right?
对于QA来说,情况也是一样的,对吧?
And the same applies for QA, right?
我会暂停开发,测试所有内容,修复漏洞,然后再继续。
I stop development, test everything, fix gaps, then move forward.
这样做的好处是能够防止问题在代码库中蔓延,对吧?
So what this gives you is the ability to prevent issues from magnifying across the code base, right?
比如你有一个模型文件被其他50个文件引用,结果你改坏了接口,现在就得去更新所有这些文件。
Like you had that one model's file that was being used by 50 other files and you messed up with the interface and now you have to go and update all of those files.
这类错误你绝对不想犯,因为当你更新其他文件时,会发现整个代码库出现了连锁问题,最后你不得不重做一切,而你的核心客户还在苦苦等待。
Like those are the mistakes you don't wanna make because when you update those other files, you realize that there are cascading issues across the entire code base and then you have to redo everything and your laser customer's waiting on forever.
你根本拿不到任何代码回来,对吧?
You're not getting any code back, right?
所以你绝对不希望发生这种事。
So you don't you don't want that kind of stuff.
因此,解决这个问题有多种方法。我们花了两年时间才真正解决它,积累了大量经验,基于现实世界中的各种教训,不断优化完善了这套方案。
So you, there are multiple ways, you know, we've learned over two years, we've solved this problem two years ago and we've learned, we've had all that time to perfect this based on all our learnings, in, in the real world.
我们换个话题,再多聊聊人的因素。
Let's switch gears a little bit and talk a little bit more about the human element.
在被要求确认之前和之后,人类在流程中扮演了哪些角色?比如在编写某个规范之后。
In what ways is the human in the loop, you know, prior to being asked to accept, and- and after, you know, writing some spec.
我想更深入地了解这一点,同时也想了解一些人性层面的问题,比如开发者对AI的怀疑、对控制权的担忧——不同开发者在编写提示或规范时,会不会导致结果出现巨大差异?
And, I'd like to understand that a little bit more, but also, you know, the, you know, human aspects like, you know, developer skepticism, concerns about control, you know, do you get widely different, results based on how one developer like prompts or specs, you know, or writes a spec versus another?
你是如何看待围绕你所做之事的这一层人类因素的?
Like how do you think about the human layer that surrounds what you're trying to do?
是的,这是个很好的观点。
Yeah, that's a great point.
我们先谈谈结果的差异,然后再聊一聊人类和思维模式的转变。
So, let's talk about the the difference in the results and then we'll talk about also the humans and the change in mindset.
我们已经尝试将这一点抽象化并标准化了。在我们的系统中,你可能会去五个不同的智能体工具中创建规范,然后来找我们,但我们会将这些规范重写成我们所谓的‘智能体行动方案’。
So, we've tried to abstract that away and normalize that because in our case, you could go to five different agent tools, build a spec and come to us, but we're gonna rewrite that in you know, in what we call the agent action plan.
然后你只需点击一次确认即可。
And you're gonna hit approve like that.
你可以编辑,但我们还是会重写。
You can edit it if you like, but we're rewrite.
这样有助于我们实现标准化,对吧?
So that's helps us normalize, right?
我们为每个代理在整个工作流程中制定的规则也是标准化的。
Our our rules that we write for every agent across the entire, you know, job is also standardized.
我们会查看提示指南,并让代理来编写指令。
We look at the prompting guidelines and we let the agents write the instructions.
对。
Right.
这样有助于我们统一结果。
So that helps you normalize the results.
所以,即使你对比另一端的情况,比如在一个企业中有上万名开发者,也不是每个开发者都懂得如何有效使用提示或工具。
So even if you have, let's say in in the compared to the other side, if you have 10,000 developers in an enterprise, not every developer knows how to prompt or even use the tools effectively.
因此,你会得到各种各样的结果。
So you're gonna have a vast array of results.
就像你所看到的,Copilot 有时反而会降低资深工程师的生产力。
Like you've seen for example that copilot sometimes hurt the productivity of senior engineers.
这是否意味着 Copilot 是一个糟糕的工具?
Does that mean copilot is a bad tool?
不,其实不是。
No, not really.
这可能是那些根本不需要使用 Copilot 的工程师的问题,因为这个工具并不适合他们的任务,或者他们可能没有正确地进行提示,对吧?
It's it's probably it's probably the engineers that don't need to use copilot because it's not a fit for those tasks or they may not be prompting it correctly, right?
当你对系统进行标准化和锚定后,就可以避免一系列这样的问题。
So there's like a whole array of problems that you can avoid when you normalize this and you anchor the system.
因此,我们设计的工具允许你将工作交给我们,比如我们的客户会从 Jira 工单中复制粘贴内容,而我们也集成了 Jira。
So we're we've designed our tools such that you can hand us off, like what our customers are doing, they're copy pasting from Jira tickets, where we integrate with Jira as well.
你可以集成 Jira,获取需求规格,点击执行,然后就能得到结果。
You can integrate with Jira, get it the spec, hit execute, and you get something back.
你不需要操心提示词工程。
You don't need to think about prompt engineering.
你不需要担心跟上最新模型的更新,也不用操心那些细微差别、工具和框架之类的东西。
You don't have to worry about staying updated with latest models and you know, the nuances and the tools and the harnesses and all of that.
我们已经将这一切都抽象掉了,你只需要专注于你正在做的实际工作。
Like we've abstracted all of that away such that you only have to think about the actual work that you're doing.
这就是我们的视角。
That's our lens.
所以回到人机协作这个话题,从我的角度、从我们的角度来看,如果你围绕人类设计系统,并且让人类参与其中,就很难把人完全移出循环,对吧。
So going back to know, human in the loop, it is like from my perspective, from our perspective, if you design a system around the humans and you have a human in the loop, is extremely difficult to take the human out of the loop, right.
所以以云代码为例,代码工具,这并不是关于单一工具的问题。
So if I give an example for cloud code, right, codecs, right, it's not about one one tool.
它们的设计目的是为人类提供快速反馈。
They're designed to give the human quick feedback.
如果我问一个问题,却要等六分钟才得到回复,这会越来越让人沮丧。
And it is increasingly frustrating if I'm asking a question and getting back a response in like six minutes.
人们常常说,我可以去散个步再回来,但那不是我想做的,我只是想得到问题的答案,然后完成手头的工作。
It is often framed as I can take a walk and come back but that's not what I wanna do, I just want an answer to my question and I wanna get something done.
我知道我能写出来,只是我太懒了不想写。
I know I can write it, I'm just too lazy to write it.
我希望你来写,对吧?
I want you to write it, right?
但如果你考虑自主性,它关乎解决问题,需要花时间思考,考虑各种边缘情况,然后给出最终答案。
But if you think about autonomy, right, it's about solving the problem and it is about thinking for a while, it is about thinking about edge cases and then coming back with the final answer.
所以这两者是相互冲突的,对吧?
So those two work against each other, right?
那么,你该如何设计一个工具,让它在某些时候能自主工作,而在其他时候又能快速响应呢?
So how do you design a tool that does autonomous work sometimes and that gives you rapid responses at the times?
结果往往是,在快速响应时,思考得还不够充分。
What ends up happening is that sometimes in the rapid responses it's not thinking enough.
对吧?
Right?
所以你一直面临这种两难局面。
So you have this constant tent.
但当你只为自主性或即时响应设计系统时,你就没有应对这种张力。
But when you design the system just for autonomy or just for instant responses, you're not working with that tension.
系统不会与自己产生冲突。
The system is not fighting itself.
对吧?
Right?
于是你就获得了这种自然的效率提升。
So you have that natural efficiency gain that you get.
最后谈谈改变思维模式,对吧?
And then finally talking about change mindset, right?
我从根本上相信,软件中总有一些任务是可以完全明确规定的。
Well I fundamentally believe that there's always going to be kinds of tasks in software that can be completely specked out.
你已经知道正确的答案是什么样子了。
You already know what the correct answer looks like.
我只是想升级我的 Java 版本。
I just wanna upgrade my Java version.
我只是想从 Angular 换到 React。
I just wanna switch from Angular to React.
或者我想添加这个新功能,而我已经写好了这份产品经理规格说明。
Or I want to add this new feature and I have already written this product manager spec about it.
这就是我想要的所有内容。
And here's everything I want.
这是设计图,对吧?
And here's the design, right?
我只是希望把这个实现出来。
I just want this implemented.
我知道正确的答案应该是什么样子。
I know what the correct answer looks like.
我相信在这一领域,自主开发最终会胜出,对吧?
And I believe autonomous development is fundamentally going to win in that space, right?
当你一切都已定义好时,因为不需要来回沟通。
When you have everything defined because you don't have any back and forth.
你不需要与模型讨价还价,也不必为工具而烦恼,只需点击一个按钮,就能获得结果,并且它已经根据你的规范进行了验证。
You don't need to you know, haggle with the models, struggle with the tools, you can just hit a button, get the result back and it's already validated against your spec.
但总有一些任务需要极高的研究投入,就像我说的,存在未知因素,我们在开头就讨论过这一点。
But there's always going to be this other kind of tasks that are extremely research intensive that you know, you need to, like I said, there are unknowns, we talked about that in the beginning.
在这些情况下,你需要一个智能代理或一组子代理之间的一对一协作,以提供及时的响应。
And in those cases you need, you know, the one to one within, within an intelligent agent or a group of sub agents that give you the timely responses.
人性方面的另一个方面是风险和风险管理。
Another aspect of the human side of things is risk and managing risks.
比如,你如何与那些企业合作?他们看到你做的事情时会想:‘你给我一大段代码,我要把它部署到生产环境,但我根本不懂它,因为我没写过它。’
Like how do you work with enterprises that, you know, are seeing what you're doing as, okay, you're going to give me this huge code base and I'm going to go deploy it in production, but I don't really understand it because I didn't write it.
这代表了一种风险。
So that represents a risk.
你如何应对那些带着这些顾虑来找你的人?
How do you, you know, work with folks who come to you with those concerns?
是的。
Yeah.
而且,你知道,我们作为工具所做的事情,其实是一种共同的责任。
And it, and you know, it's, so I'm gonna talk about what we do as a tool, but it's, it's a shared, responsibility.
问题是,企业必须感受到这种痛苦——比如,这是COBOL代码,所有编写它的开发者都已去世。
The thing is the enterprise needs to feel the pain that, okay, this is COBOL, all of the developers that were writing code for this are dead.
或者也可能是,我看到了未来。
Or it could be, you know, I see the future.
我想领先于我的竞争对手,所以……
I want to be ahead of my comp So
换句话说,你的低垂果实就是与那些他们根本不懂的系统合作。
in other words, your low hanging fruit is working with systems that they don't understand anyway.
是的。
Yep.
这最容易,对吧?
Is the easiest, right?
企业已经感受到这种痛苦,而且没有人在另一边担心失去控制权。
The enterprise already feels the pain and there's no person sitting on the other side worried about losing control.
但在其他情况下,这还关乎速度,对吧?
But in the other cases, it's also about speed, right?
比如我们能实现五倍的速度提升。
Like we're able to effect five X faster.
不是快40%。
It's not 40% faster.
也不是50%的生产力提升。
It's not 50% productivity gain.
而是开发速度提升五倍。
It's five times faster development.
所以原本需要十八个月的事情,对吧?
So what took eighteen months, right?
现在只需要三到四个月,对吧?
Well, it's gonna take three to four months, right?
这对企业来说可是巨大的进步,对吧?
So that's huge, right, for the enterprise.
这关乎你是否能赢得市场,还是在许多前沿领域错失所有机会。
That's between like winning the market or like forgetting all the opportunity in many of these cutting edge spaces.
你是在和竞争对手赛跑,对吧?
You're working against your competitor, right?
这确实对企业的风险很大,但我们通过以下方式来降低风险、让事情更简单:第一,Blitzy 会自动持续记录代码库。
So it's a risk definitely on the enterprise's end but what we do to soften that, know, make that easy is one, Blitzy automatically always documents the code base.
所以,当我们刚开始接触代码库时,第一步就是创建一份技术规范,我们称之为技术规范,它本质上是整个代码库的文档,并且在你使用 Blitzy 的过程中会持续更新。
So as a first step, whenever we start working with the code base, we create a tech spec, we call it a tech spec where essentially it is documentation for the entire code base and we keep that up to date as you use Blitzy.
Blitzy 会学习你的代码库,并保持文档的实时更新。
Blitzy learns about your code base and it you know, keeps the documentation up to date.
另一点是,你可以和 Blitzy 对话,了解所做的更改,让 Blitzy 帮你记录变更、添加有用的注释,也可以让它进行代码审查,或生成其他人类可用的材料,以便审查和保持同步,对吧?
The other thing is you can chat with Blitzy, you can understand the changes that were made, you can ask Blitzy to document changes, know add helpful comments, know you can ask it to do code reviews, you can ask it to create other assets that the humans can use to review and stay up to date, right?
所以,是的,最终还是由人类来签字确认他们应该信任的代码。
So yes, the humans are still ultimately signing off on code that they're supposed to trust you for.
而且,一些指标比如,你可以编写对你真正重要的测试,对吧?
And then again, some of the metrics are like you can write tests that matter to you, right?
比如增强你的测试基础设施。
Like get, strengthen your testing infrastructure.
我们的许多客户通常从编写测试开始。
Quite often, a of our customers start with writing tests.
这些测试能让他们有信心确认代码确实实现了预期的功能,对吧?
Tests that can give them the confidence that this code is doing what I expected to do, right?
你可以对所有这些测试进行深入挖掘。
And you can go very deep with all of these tests.
所以最终,你仍然需要依靠测试、文档、聊天和其他各种指标,来帮助客户确认你所做的一切是有效的。
So ultimately again, like you use test, documentation, chat, other kinds of metrics to help customers know that what you're doing works.
你能谈谈你是如何在技术快速演进的环境中思考构建技术的吗?
Talk us through a little bit of how you think about building technology in an environment where the technology that you're building on top of is evolving so quickly?
你知道,你是如何应对新模型发布的?
You know, how do you accommodate new model releases?
你知道,你具体构建了什么?
You know, what do you build?
那你们不构建什么?
What do you don't build?
你知道吗,你们如何看待前沿实验室对这个领域的商品化?
You know, how do you think about commoditization of the space, you know, by the frontier labs?
所以,我们实际上一直在推动所有模型的极限,比如你看看进步发生在哪些方面——上下文保留、针在 haystack 中找针、工具调用、搜索代码库等等,因为我们处理的是数百万行代码的极端场景,每次新模型比前一版本好两倍时,在 Blitzy 里实际上就提升了十倍,因为你已经处在极限边缘了,对吧?
So, you know, we've essentially, we're always pushing the limits of all of the models in terms of, if you look at where the advancements are happening there and like context retention, needle in a haystack, tool calling, searching code bases, all of that, because we're working at the extreme with millions of lines, know, code bases, every time a new model is let's say two X better than the previous version, it's actually 10 X better in Blitzy because you're already pushing the limits, right?
所以它能落地。
So it lands.
它解锁了新的能力,比如我们从静态的代理人格发展到了动态的,对吧?
It unlocks new capabilities like we went from static agent personalities to dynamic, right?
我们一直在这样做,而且我们与实验室本身紧密合作。
We've kept doing this and again, we work very closely with the labs themselves.
所以尽管实验室处于同一个领域,但他们的思维方式却完全不同,对吧?
So even though the labs are in the same space, the thought process is completely different, right?
实验室是从如何让他们的用户与模型协作的角度出发的,对吧?
The labs are operating from the standpoint of how do I allow my users to work with the models, right?
为了与大语言模型协作。
To work with an LLM.
他们的思维方式是从大语言模型的角度出发的。
Their thinking is from the LLM standpoint.
但事实上,没有任何一家实验室在所有方面都擅长,对吧?
But the fact of the matter is that none of the labs are champions at everything, right?
因此,有些情况下OPPUS会表现不佳。
So there are cases where OPPUS falls down.
有些情况下GPT也会表现不佳,Gemini也是如此。
There are cases where GPT falls down and so same for Gemini.
但这个领域真正的价值在于,能够将Porpoise与GPT结合,取两者之长。
But the real value in this space is the ability to put porpoise against GPT and get the best of both worlds.
比如遇到一个bug,看看两个模型各自的看法,然后选择最合适的那个。
Like take a bug, like see what both models think about it and pick the one that fits best.
客户所做的决定是:你更愿意每天手动完成所有这些工作,苦苦琢磨提示技巧,一会儿用这个工具的界面,一会儿用Codex做别的事,再换别的工具吗?
Would you the decision that customers are making is that, would you rather do all this manually day to day and you know, struggle with the prompting techniques and go to multiple tools that go to a for the UI, go to a codex for something else and go to something else.
还是你更愿意使用一个能为你完成所有这些工作的工具,直接获得最终的最佳版本?
Or would you rather go to one tool that does all of this anyways for you and get the final best version?
而且我们不仅跟上各大实验室的进展,也跟上开源社区的动态,对吧?
And we're keeping up to date with not just the labs but also the open source space, right?
所以我们使用多种模型,截至目前,我们使用了所有模型,对吧。
So we use a mix of models, as of today we use all of the models, right.
我们的想法是,即使各大实验室进入这个领域,他们也只是触及了自主性的一点皮毛,而我们已经在这个领域深耕了两年多,并将其完善。
So our thought process is even if the labs are getting into the space, they're only scratching the surface of what autonomy looks like and we've been in the space and we've perfected it for like two plus years.
而我们使用图数据库和锚点的方法,使我们能够扩展到数百万行代码,对吧。
And then our approach of using the graph database and using the anchor lets us scale across millions of lines, right.
所以要让所有人都真正理解这一点还需要一段时间,但即便如此,我们真正构建出的、对我们而言独特而特别的东西,是一个自我强化的知识图谱。
So it'll be a while before everyone really figures that out but even then what we've really built that is unique and very special for us is a self reinforcing knowledge graph.
所以每次你用Blitzy构建东西时,你的Blitzy实例都会因为你的使用而变得更好,因为你可能收到了一个拉取请求,而我们允许你例如优化这个拉取请求。
So every time you build something with Blitzy, right, Blitzy that your instance of Blitzy gets better for you because you may have gotten a PR back and we allow you to for example refine the PR.
所以如果你遗漏了什么或忘记了什么,你可以补充进去,代理会为你处理好这些事情。
So if you miss something or you forgot something, you can add that and the agents will take care of it for you.
如果我们接受了PR,或者你做了修改,这些都会给我们提供反馈信号。
We get signals if you accept the PR, if you make edits, all that stuff.
而这会提升你的实例表现。
And that improves your instance.
当你聊天、提问或定义规则时,所有这些都会被用来优化你的实例。
When you chat, when you ask questions, when you declare rules, all of that is used to improve your instance.
对吧?
Right?
不过这也有局限性。
There are limitations to that though.
你显然是在保存记忆文件,这又成了你在上下文中需要管理的另一件事。
That just, you know, you're keeping memory files presumably and that, you know, it's something else you need to manage in the context.
对吧?
Right?
没错。
Exactly.
所以其他所有人实际上都在使用内存文件。
So what everyone else is doing is actually using memory files.
对吧?
Right?
他们使用基于文本的内存,并在某个地方维护它。
They're using text based, memory and they're maintaining it somewhere.
但这正是关键所在,因为我们有知识图谱,不需要维护文件。
But that's the whole point because we have knowledge graph, we don't have to maintain files.
我们不需要代理。
We don't need an agents.
你代码中的MD。
Md in your code.
我们把它放在图数据库里。
We have it in the graph database.
指的是,比如说,这个人,或者对这个糟糕回复的反馈是:要这样组织我的函数,或者使用这种变量命名规范?
Represents, you know, this person, or, you know, the- the feedback on this, poor response was to, you know, structure my, you know, structure my functions in this way, for example, or to use this kind of variable naming convention?
你怎么在图数据库中表示这一点?
Like how do you represent that in a graph database?
因为图数据库包含关系。
Because the graph database has relationships.
这取决于你的结构方式,对吧?
You have, let's say, it depends on how you structure it, right?
你可以比如按模块来组织,然后是文件,再往下是文件的内容。
You can structure it for example by modules and then files and then you know everything below that, what's the contents of the file.
你还可以用项目作为另一种方式,或者用文件夹,取决于你选择如何组织。
Now, and you can have projects for example, another different way or you can have folders, whatever you chose to structure it.
现在你收到了这条反馈,是关于这个项目、这个模块、这个文件的。
Now you got this feedback and it was about this project, this module, this file.
对吧?
Right?
所以你只需要弄清楚,用户的反馈是针对这个具体任务实例,还是针对整个仓库,或者只是用户的个人偏好?
So all you need to do is figure out if the user's feedback is about this particular instance of the job or is it about this repo in general or is it a user preference?
对。
Right.
但我觉得你的意思是,反馈可以作为一个实体,存在于图中,靠近它所指代的内容。
But I think where you're going is that the feedback can be an entity that lives in the graph proximal to whatever it's referring to.
是的,正是如此。
Yes, exactly.
明白了。
Got Got it.
将元数据与它一起存储,下次就能做出智能的决策。
Store metadata with it and you can make an intelligent decision the next
而区别在于,在基于文本的世界中,这种反馈总是被注入到提示中,无论代理正在做什么。
And the distinction then being that in the text based world, that feedback is always injected into the prompt independent of what the agent is doing in your world.
当它靠近代理实际正在处理的内容时,就会被纳入进来。
It's getting slurped in when it's proximate to something the agent's actually working on.
正是如此。
Exactly.
关于 Bayt 播客
Bayt 提供中文+原文双语音频和字幕,帮助你打破语言障碍,轻松听懂全球优质播客。