本集简介
双语字幕
仅展示文本字幕,不包含中文音频;想边听边看,请使用 Bayt 播客 App。
以下是与Cursor团队创始成员Michael Truel、Swale Asif、Arvid Lundmark和Aman Sanger的对话。Cursor是一款基于VS Code的代码编辑器,新增了许多强大功能以支持AI辅助编程,引发了编程与AI社区的广泛关注和热情。我认为这是深入探讨AI在编程中角色的绝佳机会。这场技术性极强的对话,其意义远超单一代码编辑器本身。
The following is a conversation with the founding members of the Cursor team, Michael Truel, Swale Asif, Arvid Lundmark, and Aman Sanger. Cursor is a code editor based on Versus Code that adds a lot of powerful features for AI assisted coding. It has captivated the attention and excitement of the programming and AI communities. So I thought this is an excellent opportunity to dive deep into the role of AI in programming. This is a super technical conversation that is bigger than just about one code editor.
这关乎编程的未来,更广泛地说,是人类与AI在设计和构建复杂强大系统时协作的未来。现在快速介绍一下本期赞助商,详情请查看描述栏——这是支持本播客的最佳方式。我们有整合机器学习生态的Encore、知识平台MasterClass、电商工具Shopify、企业管理系统NetSuite,以及健康品牌AG1。
It's about the future of programming, and in general, the future of human AI collaboration in designing and engineering complicated and powerful systems. And now a quick few second mention of each sponsor. Check them out in the description. It's the best way to support this podcast. We got encore for unifying your machine learning stack, master class for learning, Shopify for selling stuff online, NetSuite for your business, and AG one for your health.
朋友们请理性选择。若您想联系我、参与调研或提交AMA问题,欢迎访问lexcreamy.com/contact。现在进入完整广告时间——我已尽力让它们有趣,但即便跳过也请支持下赞助商们。
Choose wisely, my friends. Also, if you want to get in touch with me for whatever reason, or take a survey or submit questions for an AMA, all of that will be great. Go to lex creamy dot com slash contact. And now, onto the full ad reads. I try to make them interesting, but if you skip them, please still check out our sponsors.
我个人很喜欢他们的产品,或许您也会。本期由Oncord赞助,该平台提供专注于数据的AI工具,涵盖数据标注、管理及模型评估。最让我欣赏的是他们的技术博客——既保持专业深度又避免过度晦涩,比如最近解析OpenAI o1模型的文章就真实呈现前沿技术而非空谈。
I enjoy their stuff. Maybe you will too. This episode is brought to you by Oncord, a platform that provides data focused AI tooling for data annotation, curation management, and for model evaluation. One of the things I love about these guys is they have a great blog that describes cleanly I mean, it's technical, but it's not too technical, but it's sufficiently technical to where it's actually describing ideas, not BS. Blog posts on the sort of the state of the art, like the OpenAI o one model that was just released.
他们时而会阐述这些技术与Encore平台的整合逻辑,时而不会——这种风格我很欣赏。即便不关注产品,我也推荐他们的博客。当观察前沿模型时,他们总在探索如何融入自家平台。本质上这是个数据管理平台,而数据就是一切。
So sometimes they integrate it into why this is a part of Encore, why this makes sense, and sometimes not. So I love that. I recommend their blog just in general. That said, you know, when they are looking at state of the art models, they are always looking for ways to integrate it into their platform. Basically, it's a place to organize your data, and data is everything.
在Transformer技术爆发前如此,现在依然如此。人类生成的非合成数据极其重要——从生成、组织到利用,从预训练、微调到后期处理,整个数据生命周期都至关重要。因此Encore对数据秉持极致严谨的态度。
This was true before the popularity and the explosion of attention methods of transformers, and it is still very much true now. Sort of the non synthetic, the human generated data is extremely important. How you generate that data, how you organize that data, how you leverage it, how you train on it, how you fine tune on it, the pre training, the post training, all of it, the whole thing. Data is extremely extremely important. And so Encore takes data very seriously.
欢迎访问encore.com/lex试用Encore创建、标注和管理AI数据。本期也由MasterClass赞助,您可向200多位世界级大师学习,比如卡洛斯·桑塔纳的吉他课——我个人超爱这门。
Anyway, go try out Encore to create, annotate, and manage your AI data at encore.com/lex. That's encore.com/lex. This episode is also brought to you by MasterClass, where you can watch over 200 classes from the best people in the world in their respective disciplines. Carlos Santana on guitar, for example. I loved that one.
他们有好几门吉他课程,汤姆·莫雷罗的也很棒。但卡洛斯的器乐作品《Europa》...虽然我还没尝试弹奏,但已列入待办清单。
There's a few guitar ones, Tom Morello too. Great. Great. Great stuff. But Carlos Santana, his instrumental Europa, I haven't quite tried to play that, but it's on my to do list.
这种曲子你确信一定要弹——它太美了,充满灵魂感。弹奏时你会对吉他产生全新认知。它不属于布鲁斯,也难以归类...
This is sort of one of those things, you know for sure this is a thing I will play because it's too beautiful. It's too soulful. It feels like once you play, you understand something about the guitar that you didn't before. It's not blues. It's not I don't know what it is.
像一场通往迷幻世界的梦境穿越,那音色比我听过的任何声音都温暖,却仍能让吉他如泣如诉。难以言喻的震撼,他是天才。能请到这样的天才分享秘诀,实在是给我们的礼物。
It's some kind of dream like teleportation into a psychedelic world where the tone is warmer than anything else I've ever heard, and still, the guitar can cry. I don't know. I love it. He's a genius. So it's such a gift that you can get a genius like that to teach us about his secrets.
立即无限制访问所有大师课程,并在masterclass.com/lexpod享受年度会员额外15%优惠。网址是masterclass.com/lexpod。本期节目还由Shopify赞助,这是一个让任何人都能在任何地方销售的平台,拥有美观的在线商店或简约的网店,比如我在lexfreeman.com/store搭建的那个。那里有几件T恤,感兴趣可以看看。说到T恤,我想起了二手商店,这是我长久以来非常钟爱的地方。
Get unlimited access to every master class and get an additional 15% off an annual membership at masterclass.com/lexpod. That's masterclass.com/lexpod. This episode is also brought to you by Shopify, a platform designed for anyone to sell anywhere with a great looking online store or simple looking online store, like the one I put together at lexfreeman.com/store. I have a few shirts on there in case you're interested. And speaking of shirts, I'm reminded of thrift stores, which I very much loved for a long time.
现在依然喜欢。二手商店是淘东西的好去处,比如厨房用品和衣服。你在那里找到的服装其实相当有趣,因为那些T恤在别处根本买不到。如果你有点挑剔又富有创意,那里有很多有趣的时尚单品。尤其是T恤,有些简直滑稽到不行。
I still love. Thrift stores were a nice place to get stuff, like, I don't know, kitchen stuff and clothing. And the kind of clothing you get at thrift stores is actually pretty interesting because there's shirts there, they're just unlike anything else you would get anywhere else. So if you're sort of selective and creative minded, there's a lot of interesting fashion that's there. And in terms of t shirts, there's just like hilarious t shirts.
那些T恤与你人生轨迹相去甚远,或是你从未想过会穿的——比如你喜欢的乐队T恤,却从没想过穿上身。某种程度上,我觉得Shopify就像互联网的二手商店。当然,你可以做得很高端,也可以很花哨,或者超级节俭。一切皆有可能。现在注册可享每月1美元试用期,访问shopify.com/lex。
T shirts that are very far away from the kind of trajectories you have taken in life or are not, but you just haven't thought about it, like a band that you love, but you've never would have thought to wear their t shirt. Anyway, a little bit, I think of Shopify as the Internet's thrift store. Of course, you can do super classy, you can do super fancy, or you can do super thrift. All of it is possible. Sign up for a $1 per month trial period at shopify.com/lex.
网址全小写。立即访问shopify.com/lex,将你的业务提升到新高度。本期节目还由NetSuite赞助,这是一套全能云端商业管理系统。有时我觉得NetSuite赞助这档播客是在调侃我,好像在说:嘿,Lex,
That's all lowercase. Go to shopify.com/lex to take your business to the next level today. This episode is also brought to you by NetSuite, an all in one cloud business management system. Sometimes I think that NetSuite is supporting this podcast because they're trolling me. They're saying, hey, Lex.
你是不是话太多了?也许该多去搞点建设。我同意,NetSuite。我完全同意。所以每次为NetSuite念广告时,都是我与荣格阴影面交锋的机会。
Aren't you doing a little too much talking? Maybe you should be building more. I agree with you, NetSuite. I agree with you. And so every time I do an ad read for NetSuite, it is a chance for me to confront my Jungian shadow.
有些心魔会从潜意识里冒出来,抛出我无法回答的问题。关于生命有限的问题,关于人生苦短的问题,关于人生最充实之事莫过于组建家庭、养育子女的问题——这些我都非常渴望拥有。同时我也清醒地热爱编程,热爱创造,热爱打造人们能用能分享、能让生活更美好的酷东西。所有这些。
Some of the demons emerge from the subconscious and ask questions that I don't have answers to. Questions about one's mortality and that life is short, and that one of the most fulfilling things in life is to have a family, kids, and all of these, things I would very much like to have. And also the reality that I love programming, and I love building. I love creating cool things that people can use and share, and that would make their life better. All of that.
当然我也爱听播客,某种程度上我把这档播客当作自己在听的节目,只是偶尔能通过提问参与其中。当你面对所有热爱的事物时,却不得不思考这个尖锐问题:生命正在流逝,它很短,真的非常短。你想如何度过余下的分分秒秒?
Of course, I also love listening to podcasts, and I kinda think of this podcast as me listening to a podcast where I can also maybe participate by asking questions. So all these things that you love, but you ask the hard question of, like, okay, well, life is slipping away. It's short. It really, really is short. What do you wanna do with the rest of the minutes and the hours that make up your life?
是啊。感谢NetSuite带来的存在主义危机,我很感激。如果你在经营企业,如果你已跃入未知领域创办公司,那就该用正确的工具来管理。事实上已有超过37,000家公司升级使用NetSuite。
Yeah. So thank you for the existential crisis, NetSuite. I appreciate it. If you're running a business, if you have taken the leap into the unknown and started a company, then you should be using the right tools to manage that company. In fact, over 37,000 companies have upgraded to NetSuite.
现在登录netsuite.com/lex即可享受NetSuite灵活融资方案。网址是netsuite.com/lex。本期节目还由美味无比的AG1赞助。这款全能每日饮品能助你保持最佳健康状态和巅峰表现,本质上是让我感觉人生尽在掌控的超级复合维生素。
Take advantage of NetSuite's flexible financing plan at netsuite.com/lex. That's netsuite.com/lex. This episode is also brought to you by the delicious, the delicious AG one. It's an all in one daily drink to support better health and peak performance. It's basically a super awesome multivitamin that makes me feel like I have my life together.
即使其他一切都像要崩塌,至少我还有AG1。至少我的生命有这份营养基础支撑。所有我正在进行的断食、纯肉饮食、体能耐力挑战、熬夜的疯狂用脑,或是正在经历的种种压力——所有这些时刻,AG1都在。至少他们提供了维生素。
Even when everything else feels like it's falling apart, at least I have AG one. At least I have that nutritional foundation to my life. So all the fasting I'm doing, all the carnivore diets, all the physical endurance events, and the mental madness of staying up all night or just the stress of certain things I'm going through, all of that. AG one is there. At least they have the vitamins.
另外,我有时会想,他们以前叫Athletic Greens,现在改名叫AG1了。我总好奇,会不会有AG2?为什么只到1呢?这是个有趣的品牌决策,比如叫AG1。像我这种有点强迫症的程序员类型,就会想,好吧。
Also, I sometimes wonder they used to be called Athletic Greens, and now they're called a g one. I always wonder, is a g two coming? Like, why is it just one? It's an interesting branding decision, like a g one. Me as an OCD kind of programmer type, it's like, okay.
这是版本号的意思吗?好吧?这是像AG 0.1 alpha版吗?正式版什么时候发布?总之,我喜欢说也喜欢喝的是AG1。
Is this a versioning thing? Okay? Is this like a g 0.1 alpha? What's when's the final release? What's the anyway, the thing I like to say and to consume is AG one.
注册drinkag1.com/lex时他们会送你一个月的鱼油供应。这里是Lex Friedman播客。要支持我们,请查看简介中的赞助商信息。现在,亲爱的朋友们,有请Michael、Swale、Arvid和Aman。
They'll give you one month supply of fish oil when you sign up at drinkag1.com/lex. This is the Lex Friedman podcast. To support it, please check out our sponsors in the description. And now, dear friends, here's Michael, Swale, Arvid, and Aman.
好的。这太棒了。
Alright. This is awesome.
我们请来了Cursor团队的Michael、Aman、Swale和Arvid。第一个问题,有点夸张:代码编辑器的意义是什么?
We have Michael, Aman, Swale, Arvid here from the Cursor team. First up, big ridiculous question. What's the point of a code editor?
代码编辑器基本上是你构建软件的地方。长久以来,它意味着你用来编辑正式编程语言的文本编辑器。对于非程序员来说,可以把它想象成程序员专用的超级文字处理器——之所以说是超级版,是因为代码具有很多结构。所以这个所谓的'文字处理器'(即代码编辑器)能为你做很多普通文字处理器在文本编辑领域做不到的事。从直观区分代码中的标记以便快速浏览,到像用超链接在互联网上导航那样在代码库中跳转查看定义,再到错误检查和捕捉基础bug,这些都是传统代码编辑器的功能。而我认为随着软件开发方式的变化,代码编辑器的定义在未来十年将发生巨大改变。
So the the code editor is largely the place where you build software. And today, or for a long time, that's meant the place where you text edit a formal programming language. And for people who aren't programmers, the way to think of a code editor is like a really souped up word processor for programmers, where the reason it's it's souped up is code has a lot of structure. And so the the quote unquote word processor, the code editor, can actually do a lot for you that word processors, you know, sort of in the writing space haven't been able to do for for people editing text there. And so that's everything from giving you visual differentiation of the actual tokens in the code so you can scan it quickly, to letting you navigate around the code base, sort of like you're navigating around the Internet with hyperlinks.
你会跳转到所用内容的定义处,进行错误检查,捕捉初级错误等等。传统意义上这就是代码编辑器的功能。我认为随着软件开发方式的改变,代码编辑器的定义在未来十年将发生巨大变化。
You're going to sort of definitions of things you're using, to error checking, to, you know, to catch rudimentary bugs. And so traditionally, that's what a code editor has meant. And I think that what a code editor is is going to change a lot over the next 10 as what it means to build software maybe starts to look a bit different.
我觉得代码编辑器还应该要有趣才行。
I think I think also a code editor should just be fun.
没错。这点非常重要,非常重要。而且这其实是我们决定开发什么功能时被低估的一个方面。我们构建的很多功能,经过实验后会被砍掉,就是因为它们不够有趣。
Yes. That is very important. That is very important. And it's actually sort of an underrated aspect of how we decide what to build. Like, a lot of the things that we build, and then we we try them out, we do an experiment, and then we actually throw them out because they're not fun.
而趣味性的重要组成部分往往是速度。速度快就是有趣。
And and so a big part of being fun is, like, being fast a lot of the time. Fast is fun.
是啊,快就是...没错。这应该印在T恤上。
Yeah. Fast is yeah. Yeah. That should be a t shirt.
从根本上说,我认为吸引很多人投身计算机领域的原因之一,就是这种疯狂的迭代速度。在其他领域,你可能受限于资源或组织大规模协作的能力。而编程的神奇之处在于,只要你和一台电脑,就能快速创造出非常酷的东西。
Like like, fundamentally, I think one of the things that draws a lot of people to to building stuff on computers is this, like, insane iteration speed where, you know, in other disciplines, you might be sort of gate capped by resources or the ability even the ability, you know, to get a large group together and coding is this, like, amazing thing where it's you and the computer and that alone, you can you can build really cool stuff really quickly.
给不了解的人解释下,Cursor是基于VS Code分支开发的超酷新编辑器。想听听你们各自使用编辑器的历程——你们应该都是VS Code+Copilot的忠实用户吧?是怎么从VS Code过渡到Cursor的?
So for people who don't know, Cursor is this super cool new editor that's a fork of Versus Code. It would be interesting to get your kind of explanation of your own journey of editors. How did you I think all of you are were big fans of Versus Code with Copilot. How did you arrive to Versus Code, and how did that lead to your journey with Cursor?
是的。我们大多数人——应该说所有人——最初都是Vim用户。
Yeah. So I think a lot of us well, all of us were originally Vim users.
纯正的Vim。
Pure pure Vim.
纯粹的Vim。不是NeoVim,就是原版终端Vim。就我个人而言,大约是2021年Copilot推出时,我非常想尝试它。
Pure Vim. Yeah. No Neo Vim. Just pure Vim, and it's terminal. And at at least for myself, it was around the time that Copilot came out, so 2021, that I really wanted to try it.
于是我开始用当时唯一支持Copilot的VS Code。尽管我热爱Vim,但VS Code+Copilot的体验实在太好,让我决定转用。这个组合就成了我的默认选择,直到我们开始开发Cursor。
So I went into Versus Code, the only platform the only code editor in which it was available. And even though I, you know, really enjoyed using Vim, just the experience of Copilot with with Versus Code was more than good enough to convince me to switch. And so that kind of was the default until we started working on Cursor.
或许该解释下Copilot的功能。它就像个智能自动补全工具——当你开始编写内容时,它会建议1-3行可能的补全代码。那种体验很有趣,就像挚友接上你没说完的话。
And maybe we should explain what Copilot does. It's like a really nice autocomplete. It suggests as you start writing a thing, it suggests one or two or three lines how to complete the thing. And there's a fun experience in that. You know, like when you have a close friendship and your friend completes your sentences?
当它运作良好时,会产生一种亲密感(或许有比'亲密'更贴切的词)。那种'天啊它懂我'的惊艳感觉真的很酷。
Like, when it's done well, there's an intimate feeling. There's probably a better word than intimate, but there's a there's a cool feeling of like, holy shit, it gets me.
没错。确实。
Yeah. Yeah.
是啊。偶尔当它没能理解你的时候,会有种不愉快的感觉。所以存在那种摩擦,但我想对大多数人来说,被理解的感觉盖过了不被理解的时候。
Yeah. Now and then there's an unpleasant feeling when it doesn't get you. And so there's that that kind of friction, but I would say for a lot of people, the feeling that it gets me overpowers that it doesn't.
我认为GitHub Copilot一个被低估的方面是,即使它出错了,也只是有点烦人,但没那么糟糕,因为你只需再输入一个字符,可能它就懂了,或者再输一个字符它就明白了。所以即使错了,也没那么糟。
And I think actually one of the underrated aspects of GitHub Copilot is that even when it's wrong is is like a little bit annoying, but it's not that bad because you just type another character, and then maybe then it gets you or you type another character and then then it gets you. So even when it's wrong, it's not that bad.
对。你可以反复调整修正它。对我来说,Copilot另一个被低估的部分在于它是首个真正的AI产品,首个面向消费者的语言模型产品。
Yeah. You you can sort of iterate iterate and fix it. Yeah. Mean, the other underrated part of Copilot for me sort of was just the first real real AI product. So the first language model consumer product.
所以Copilot可以说是LLMs的第一个杀手级应用。
So Copilot was kinda like the first killer app for Yeah. LLMs.
没错。而且测试版在2021年就发布了。
Yeah. Yeah. And, like, the beta was out in 2021.
对。好的。
Right. Okay.
嗯。
Mhmm.
那么Cursor的起源故事是怎样的?
So what's the the origin story of Cursor?
大约2020年,OpenAI发表了关于规模损失(scaling loss)的论文。那一刻标志着这个领域出现了清晰可预测的进展——即便没有新思路,只要有更多算力和数据,就能显著提升模型性能。
So around 2020, the scaling loss papers came out from from OpenAI. And that was a moment where this looked like clear, predictable progress for the field where even if we didn't have any more ideas, looks like you can make these models a lot better if you had more compute and more data.
顺便说,我们可能会花三四个小时讨论规模损失这个话题。不过简单总结,这是一篇(一系列)论文和观点,指出模型规模和数据集越大可能越好...
By the way, we'll probably talk for three to four hours on on the topic of scaling loss. Yes. But just just to summarize, it's a paper and a set of papers and a set of ideas that say bigger might be better for model size and data size in the in
机器学习领域。
the realm of machine learning.
它更大更好,但在预测性上更优。
It's bigger and better, but predictively better.
好的。这是另一个话题了。
Okay. That's another topic of conversation.
是的。那时候对我们一些人来说,围绕这个话题有很多概念性的讨论,比如这会是什么样子,所有这些不同知识工作者领域的故事会是什么,关于这项技术进步如何让他们变得更好。然后我认为有几个时刻,那篇论文中预测的理论收益开始变得非常具体,开始感觉像是一个你可以真正去行动的时刻,而不必非要读个博士才能在AI领域做有用工作。实际上,感觉现在有一整套可以构建的、真正有用的系统。我想我们之前已经稍微讨论过的第一个时刻,就是早期Copilot的试用。
Yeah. So around that time, for some of us, there were, like, a lot of conceptual conversations about what's this gonna look like, what's the story gonna be for all these different knowledge worker fields about how they're gonna be made better by this technology getting better. And then I think there were a couple of moments where the theoretical gains predicted in that paper started to feel really concrete, and it started to feel like a moment where you could actually go and not, you know, do a PhD if you wanted to work on do useful work in AI. Actually, felt like now there was this this whole set of systems one could build that were really useful. And I think that the first moment we already talked about a little bit, which was playing with the early bit of Copilot.
那真是太棒了,像魔法一样。我认为下一个一切开始串联起来的重要时刻,其实是早期获得GPT-4的访问权限。大概在2022年,我们开始捣鼓那个模型,能力的跃升感觉非常巨大。在此之前,我们一直在做几个不同的项目。因为Copilot,因为ScalingOz,因为我们之前对这项技术的兴趣,我们一直在为程序员捣鼓各种工具。
Like, that was awesome and magical. I think that the next big moment where everything kinda clicked together was actually getting early access to g p four. So sort of 2022 was when we were tinkering with that model, and the step up in capabilities felt enormous. And previous to that, we had been working on a couple of different projects. We had been because of Copilot, because of ScalingOz, because of our prior interest in the technology, we had been tinkering around with tools for programmers.
但都是些非常具体的东西,比如我们在为金融专业人士构建工具,他们必须在Jupyter Notebook中工作,或者尝试用这些模型做静态分析。然后GPT-4的跃升感觉就像是,看,这真的让我们之前预测的理论收益变得具体了。感觉在那一刻你可以立即构建更多东西。而且,如果我们保持一致,这感觉不会只是一个点解决方案,而是整个编程都将通过这些模型流动。
But things that are, like, very specific so, you know, we were building tools for financial professionals who have to work within a Jupyter Notebook or, like, you know, playing around with can you do static analysis with these models? And then the step up in g b four felt like, look, that really made concrete the theoretical gains that we had predicted before. It felt like you could build a lot more just immediately at that point in time. And also, if we were being consistent, it really felt like this wasn't just gonna be a point solution thing. This was gonna be all of programming that was gonna flow through these models.
感觉这需要一种不同类型的编程环境,一种不同类型的编程。于是我们开始围绕这个更大的愿景去构建。
It felt like that demanded a different type of programming environment, a different type of programming. And so we set off to build that that sort of larger vision around that.
有一个我特别记得。我的室友是IMO金牌得主,美国有个叫普特南的比赛,算是大学生的IMO,是个数学竞赛,非常厉害。我记得大概是2022年6月,申通和阿曼打了个赌,关于到2024年6月或7月,模型能不能赢得IMO金牌。
There's one that I distinctly remember. So my roommate is an IMO gold winner, and there's a competition in The US called the Putnam, which is sort of the IMO for college people, and it's it's this math competition. It's it's exceptionally good. So Shentong and Oman, I remember it sort of sort of June 2022 had this bet on whether the like, 2024, June or July, you were going to win a gold medal in the IMO with the with, like, models.
IMO是国际数学奥林匹克。
IMO is International Math Olympian.
是的。IMO是国际数学奥林匹克。阿尔维德和我都参加过,所以有点个人情感在里面。
Yeah. IMO is International Math Olympian. And so Arvid and I are both they're, you know, also competed in it. So it was sort of personal.
而且 而且
And and
我记得我当时在想,马特,这根本不可能发生。这简直就像...虽然我某种程度上相信进步,但我认为,你知道,作为一个女孩,艾哈迈德只是妄想罢了。是的。那就是...说实话,我的意思是,我要明确说,我大错特错了,但那可能是整个团队中最有先见之明的赌注。
I I remember thinking, Matt, this is just this is not gonna happen. This was like it was un like, even though I I sort of believed in progress, I thought, you know, I'm a girl just like, Ahmad is just delusional. Yeah. That was the that was the and and to be honest, I mean, I I was, to be clear, very wrong, but that was maybe the most prescient bet in the group.
所以DeepMind的新结果表明你是正确的。这就是...
So the the the new results from DeepMind, it it turned out that you were correct. That's what
那个 我
the I
其实 好吧
was well,
严格来说 严格来说不是。技术上不正确,但
it was it was technically not. Technically incorrect, but
只差一分。阿曼对这类事情非常热衷。是的。阿曼之前有件印着缩放定律的T恤,他总穿着走来走去,上面有那些图表和公式。
one point away. Amman was very enthusiastic about this stuff. Yeah. Before Aman had this, like, scaling laws t shirt that he would walk around with, where it had the, like, charts and, like, the the formulas on it.
哦,所以你感受到了AGI,还是感受到了缩放效应?
Oh, so you, felt the AGI or you felt the scaling?
是的。我清楚地记得和迈克尔有过一次对话,在那之前我没有特别深入批判性地思考过缩放定律。他提出了一个问题:为什么缩放不是全部所需?或者说为什么缩放不会带来巨大的进步?我想我经历了类似...哀伤的各个阶段。
Yeah. I I distinctly remember there was this one conversation I had with with Michael where before I hadn't thought super deeply and critically about scaling laws. And he kind of posed the question, why isn't scaling all you need? Or why isn't scaling gonna result in massive gains in progress? And I think I went through, like, the like, the stages of grief.
先是愤怒、否认,最后经过思考才接受。从那以后我对技术进步一直持相当乐观的态度。需要说明的是,我认为这也取决于你关注哪些领域。数学是个很好的领域,尤其是形式化定理证明,因为你能获得验证结果正确性的绝佳信号。这意味着像强化学习这样的方法可以发挥极大作用。
There is anger, denial, and then finally at the end, just thinking about it, acceptance. And I think I've been quite hopeful and optimistic about progress since. I think one thing I'll caveat is I think it also depends on which domains you're gonna see progress. Math is a great domain because especially formal theorem proving because you get this fantastic signal of actually verifying if the thing was correct. And so this means something like RL can work really, really well.
我认为,你可以拥有在数学上可能非常超人的系统,但从技术上讲仍然不具备通用人工智能(AGI)。
And I think, like, you could have systems that are perhaps very superhuman at math and still not technically have AGI.
好的。那么我们能否直接讨论Cursor?
Okay. So can we take it all the way to Cursor?
嗯。
Mhmm.
Cursor是什么?它是Versus Code的一个分支,而Versus Code长期以来是最受欢迎的编辑器之一。大家都爱上了它,都离开了Vim。我也因此放弃了Emacs。
And what is Cursor? It's a fork of Versus Code, and Versus Code is one of the most popular editors for a long time. Like, everybody fell in love with it. Everybody left Vim. I left Emacs for it.
抱歉。所以开发者社区以某种根本性的方式统一了。然后你看看这个领域,看看扩展法则,AI正在变得惊人,你决定,仅仅为Versus Code写一个扩展是不够的,因为那有很多限制。如果AI要继续变得更好,我们需要真正重新思考AI如何成为编辑过程的一部分。因此你决定分叉Versus Code。
Sorry. So unified in some fun fundamental way, the the the developer community. And then that you look at the space of things, you look at the scaling laws, AI is becoming amazing, and you decided, okay, it's not enough to just write an extension for your Versus Code because there's a lot of limitations to that. We're we need if AI is gonna keep getting better, better, better, we need to really, like, rethink how the the AI is gonna be part of the editing process. And so you decided to fork Versus Code
是的。
Yeah.
并开始构建许多我们将能讨论的惊人功能。但那个决定是怎样的?因为有很多扩展,包括Versus Code的Copilot,都在做类似AI的事情。决定分叉Versus Code是怎样的?
And start to build a lot of the amazing features we'll be able to to to talk about. But what was that decision like? Because there's a lot of extensions Mhmm. Including Copilot of Versus Code that are doing sort of AI type stuff. What was the decision like to just fork Versus Code?
对我们来说,做一个编辑器的决定似乎是不言而喻的,至少对我们想要实现的目标来说是这样。因为当我们开始开发这个编辑器时,想法是这些模型会变得更好,它们的能力会提升,这将彻底改变你构建软件的方式。不仅会有巨大的生产力提升,而且构建软件的活跃方式也会发生很大变化。如果你只是现有编码环境的一个插件,你对代码编辑器的控制非常有限,我们不想被这些限制所束缚。我们希望能够构建最有用的东西。
So the decision to do an editor seemed kind of self evident to us for at least what we wanted to do and achieve. Because when we started working on the editor, the idea was these models are gonna get much better, their capabilities are gonna improve, and it's gonna entirely change how you build software. Both in a you will have big productivity gains, but also radical in how the active building software is going to change a lot. And so you're very limited in the control you have over a code editor if you're a plug in to an existing coding environment, and we didn't wanna get locked in by those limitations. We wanted to be able to just build the most useful stuff.
好的。那么自然的问题是,你知道,Versus Code和Copilot某种程度上是竞争对手。那么你们如何取胜?是不是基本上就是功能的速度和质量?
Okay. Well, then the natural question is, you know, Versus Code is kind of with Copilot a competitor. So how do you win? Is is it basically just the speed and the quality of the features?
是的。我认为这个领域非常有趣,可能非常独特,如果你看看以前的技术浪潮,也许有一件主要的事情发生,它开启了一波新的公司。但每一年,每一次模型能力的提升,你现在都能开启一波新的功能,尤其是在编程中。因此我认为在AI编程中,即使只是领先几个月,更不用说一年,也会让你的产品变得非常非常有用。我认为一年后的Cursor,会让今天的Cursor看起来过时。
Yeah. I mean, I think this is a space that is quite interesting, perhaps quite unique, where if you look at previous tech waves, maybe there's kind of one major thing that happened and it unlock a new wave of companies. But every single year, every single model capability or jump you get in model capabilities, you now unlock this new wave of features, things that are possible, especially in programming. And so I think in AI programming, being even just a few months ahead, let alone a year ahead, makes your product much much much more useful. I think the cursor, a year from now, will need to make the cursor of today look obsolete.
我认为微软确实做过许多了不起的事情,但他们现在并不处于一个能像初创公司那样持续创新和突破的最佳位置。
And I think, you know, Microsoft has done a number of, like, fantastic things, but I don't think they're in a great place to really keep innovating and pushing on this in the way that a startup can.
只是快速实现功能。
Just rapidly implementing features.
对,就是推动...像是进行必要的研究实验来真正突破上限。
And and push yeah. Like, and and kind of doing the research experimentation necessary to really push the ceiling.
我不确定是否该用功能来思考,我更倾向于从程序员的能力角度考虑。比如新模型问世时,肯定会有更多不同类型的模型出现——更长上下文、更快速等等。我们可以尝试各种疯狂想法,希望其中10%能变成酷炫实用的东西。我们想让人们更早接触到这些。换个说法,一个被低估的事实是:我们其实是在为自己打造产品。
I don't I don't know if I think of it in terms of features as I think of it in terms of, like, capabilities for for programmers. It's that, like, you know, as, you know, the new o one model came out, and I'm sure there are gonna be more more models of different types, like longer context and maybe faster. Like, there's all these crazy ideas that you can try, and hopefully, 10% of the crazy ideas will make it into something kinda cool and useful. And we want people to have that sooner. To rephrase, it's like an underrated fact is we're making it for ourself.
刚开始做Cursor时,我们深切感受到这种挫败——明明看到模型在进步,但COBOL的体验却一成不变。就像眼睁睁看着天花板升高,却没人创造新东西。他们本应该不断创新,那些前沿功能都去哪了?
When we started Cursor, you really felt this frustration that, you know, models you could see models getting better, but the COBOL experience had not changed. It was it's like, man, these these guys, like, the ceiling is getting higher. Like, why are they not making new things? Like, they they should be making new things. They should be like, you know, like, like, where's where's where's all the alpha features?
根本没有什么前沿功能。虽然产品卖得不错,生意也很成功,但对我这种渴望尝试新事物的人来说,很长一段时间里都看不到任何创新。
There there were no alpha features. It was like, I I'm sure it it was selling well. I'm sure it was a great business, but it didn't feel I I'm I'm one of these people that really want to try and use new things, and it was like, it's just there's no new thing for, like, a very long while.
确实。这很难用语言形容,但对比Cursor和Copilot时,Copilot不知为何很快就让人感觉过时了。
Yeah. It's interesting. I don't know how you put that into words, but when you compare a cursor with Copilot, Copilot pretty quickly became started to feel stale for some reason.
是的。我认为我们的优势在于把所有环节整合在一起开发。
Yeah. I I think one thing that I think helps us is that we're sort of doing it all in one.
嗯。
Mhmm.
我们在开发用户交互体验的同时,也在优化模型回答质量——比如如何构建提示词,如何为Cursor标签页寻找上下文,如何训练模型。这种端到端的整体开发模式让我们获益良多。
We're we're developing the the UX and the way you interact with the model, at the same time as we're developing, like, how we actually make the model give better answers, so we're like, how you build up the the prompter, or like, how do you find the context and for a cursor tab, like, how do you train the model? So I think that helps us to have all of it, like, sort of like the same people working on the entire experience end to end.
是啊。就像是设计UI的人和训练模型的人,坐得相隔18英尺远似的。
Yeah. It's like the the person making the UI and the person training the model, like, sit to, like, 18 feet away.
所以
So
嗯。很多时候甚至是同一个人。
Mhmm. Often the same person even.
对。经常经常是同一个人。所以你能创造出一些东西,如果你们不交流、不实验,这些东西是不可能实现的。
Yeah. Often often even the same person. So you you can you can create things that are that are sort of not possible if you're not you're not talking, you're not experimenting.
而且你正在用,就像你说的,用Cursor来写Cursor。
And you're using, like you said, cursor to write cursor.
当然。
Of course.
哦,是啊。没错。
Oh, yeah. Yeah.
好吧,我们来谈谈这些功能中的一些。我们来谈谈那个无所不知、无所不能的,赞美Tab键,你知道的,基本上就是打了兴奋剂的自动补全。
Well, let's talk about some of these features. Let's talk about the all knowing, the all powerful, praise be to the tab, for the, you know, autocomplete on steroids, basically.
那么Tab键是怎么工作的?Tab键是什么?为了高度概括,我想说Cursor现在有两件事做得相当好。它还有其他功能。但有两件事对程序员特别有帮助。
So what how does tab work? What is tab? To highlight and summarize at a high level, I'd say that there are two things that Cursor is pretty good at right now. There are there are other things that it does. But two things that it helps programmers with.
一个是这个想法,就像有人在你背后看着你,像一个非常快的同事,能某种程度上跳到你前面,打字并预测你接下来要做什么。这是最初的想法,也是好的自动补全背后的核心,就是预测你接下来要做什么。但你可以让这个概念更雄心勃勃,不仅仅是预测光标后的字符,而是实际上预测你接下来要做的整个更改,下一个差异,你要跳到的下一个位置。Cursor现在做得相当好的第二件事是,有时帮助你跳到AI前面,告诉它该做什么,从指令到代码。在这两方面,我们做了很多工作,使这些功能的编辑体验符合人体工程学,同时也让这些功能更智能、更快速。
One is this idea of looking over your shoulder and being like a really fast colleague who can kind of jump ahead of you and type and figure out what you're what you're gonna do next. And that was the original idea behind that was kind of the kernel of the idea behind good autocomplete was predicting what you're gonna do next. But you can make that concept even more ambitious by not just predicting the characters after your cursor, but actually predicting the next entire change you're gonna make, the next diff, the next place you're gonna jump to. And the second thing Cursor is is pretty good at right now too is helping you sometimes jump ahead of the AI and tell it what to do and go from instructions to code. And on both of those, we've done a lot of work on making the editing experience for those things ergonomic and also making those things smart and fast.
我们真正想要的功能之一是让模型能帮我们编辑代码。这曾是个愿望,在拥有一个能良好编辑代码的模型前,我们进行了多次尝试。后来有了好模型后,团队投入大量精力优化推理速度,确保流畅体验。我们开始整合迈克尔提到的跳转功能——这种跳转需求源于一种感受:当你接受一个编辑后,下一步该去哪里应该显而易见。
One of the things we really wanted was we wanted the model to be able to edit code for us. That was kind of a wish, and we had multiple attempts at it before before we had a sort of a good model that could edit code for you. Then after after we had a good model, I think there there have been a lot of effort to, you know, make the inference fast for, you know, having having a good good experience. And we've been starting to incorporate I mean, Michael sort of mentioned this, ability to jump to different places. And that jump to different places, I think, came from a feeling of, you know, once you once you accept an edit, it was like, man, it should be just really obvious where to go next.
就像我做了这个修改后,模型应该自动知道下一步该跳转到18行后。如果你用VIM编辑器,可能要按18jj之类的快捷键——但为什么非要手动操作?模型应该直接预判到这个需求。
It's like it's like I I'd made this change. The model should just know that, like, the next place to go to is, like, 18 lines down. Like, if you're if you're a WIM user, you could press one eight j j or whatever. But, like, why why even why am I doing this? Like, the model the model should just know it.
于是我们设计成按Tab键就能下跳18行并展示下一个编辑点,这样你只需连续按Tab键。内部竞赛就是看用户能连续按多少次Tab。抽象来看,关键在于编辑行为的零熵特性——当你表达意图后,如果后续操作没有信息增量却仍需输入字符让计算机理解,模型就该直接读心,用Tab键跳过这些零熵步骤。
And then so so the idea was, you know, you just press tab, it would go 18 lines down and then make it show you show you the next edit and you would press tab. So it's just you as long as you could keep pressing tab. And so the internal competition was how many tabs can we make someone press. Once you have, like, the idea, more more sort of abstractly, the the thing to think about is sort of, like, once how how are the edits sort of zero zero entropy? So once you've sort of expressed your intent and the edit is there's no like new bits of information to finish your thought, but you still have to type some characters to like make the computer understand what you're actually thinking, then maybe the model should just sort of read your mind and and all the zero entropy bits should just be, like, tabbed away.
对,基本上就是这样的
Yeah. That that was that was sort of the
抽象概念。有个有趣现象:比较不同领域的语言模型损失时,代码的比特/字节(字符标准化损失)低于自然语言,说明代码中存在大量可预测的token和字符。当预测用户编辑现有代码的行为时,这种可预测性更加显著。光标跳转的目标就是消除编辑器里所有低熵操作。
abstract word. There's this interesting thing where if you look at language model loss on on different domains, I believe the bits per byte, which is a kind of character normalized loss for code, is lower than language, which means in general, there are a lot of tokens in code that are super predictable. A lot of characters that are super predictable. And this is, I think, even magnified when you're not just trying to autocomplete code, but predicting what the user's going to do next in their editing of existing code. And so, you know, the goal with cursor taps, let's eliminate all the low entropy actions you take inside of the editor.
当用户意图明确时,我们直接让用户快进到下一步操作。
When the intent is effectively determined, let's just jump you forward in time, skip you forward.
那么光标预测的直觉原理和技术细节是什么?那个跳转功能对大多数人来说并不直观吧?
Well well, what's the intuition and what's the technical details of how to do next cursor prediction? The the that jump, that's not that's not so intuitive, I think, to people.
是的。我可以分享些实现细节:这类功能需要极低延迟,因此要训练小型专用模型。它们特别消耗预填充token——意味着需要超长提示词来理解大量代码上下文,但实际生成的token却很少。
Yeah. I think I can speak to a few of the details on how how to make these things work. They're incredibly low latency, so you need to train small models on this on this task. In particular, they're incredibly prefilled token hungry. What that means is they have these really, really long prompts where they see a lot of your code, and they're not actually generating that many tokens.
因此最适合采用稀疏模型(即MOE混合专家模型),这是我们提升长上下文性能的重大突破。另一项是推测式解码的变体「推测编辑」。这两项技术共同确保了高质量和极速响应。
And so the perfect fit for that is using a sparse model, meaning an MOE model. So that was kind of one one break one one breakthrough we made that substantially improved performance at longer context. The other being a variant of speculative decoding that we we kind of built out called speculative edits. These are two, I think, important pieces of what make it quite high quality and very fast.
明白了。MOE混合专家模型,输入巨大但输出精简。那缓存机制在其中起什么作用?
Okay. So MOE, mixture of experts, the input is huge, the output is small. Yeah. Okay. So, like, what what else can you say about how to make is this caching play a role
在这个特定情况下?缓存起着巨大作用。
in this particular? Caching plays a huge role.
嗯。
Mhmm.
因为你处理的是这么多输入标记,如果在给定行中键入的每个按键都要重新运行模型处理所有传入标记,首先会显著降低延迟,其次会让GPU负载过重。所以你需要设计模型使用的实际提示,使其具有缓存意识。然后,是的,你需要在请求间复用KB缓存,以减少工作量,降低计算消耗。
Because you're dealing with this many input tokens, if every single keystroke that you're typing in a given line, you had to rerun the model on all of those tokens passed in, you're just going to, one, significantly degrade latency. Two, you're gonna kill your GPUs with load. So you need to you you need to design the actual prompts used for the model such that they're cache caching aware. And then, yeah, you need you need to reuse the KB cache across requests just so that you're spending less work, less compute.
再问一次,Tab短期内应该具备哪些功能?就是想再确认下这个问题。
Again, what are the things that tab is supposed to be able to do kinda in the near term? Just to, like, sort of linger on that.
生成代码,比如填充空白处,还有
Generate code, like, fill empty space, also
跨多行编辑代码
edit code across multiple lines
嗯。
Mhmm.
对。然后跳转到同一文件内的不同位置。是的。接着比如启动
Yeah. And then jump to different locations inside the same file Yeah. And then, like, launch
还要能跳转到不同文件。如果你在一个文件里做了编辑,可能还需要去另一个文件完成思路。它应该也能跳转到第二个文件。
jump to different files also. So if you make an edit in one file and maybe maybe you have to go maybe you have to go to another file to finish your thought. It should it should go to the second file also.
对。然后它还实现了泛化功能。就像是...嗯...下一个动作预测。比如有时需要在终端运行命令,它应该能根据你写的代码建议命令。或者有时它给出了建议,但你很难判断是否正确,因为实际上需要更多信息来确认,比如需要知道类型才能验证其正确性。
Yeah. And then it did generalization The is like Mhmm. Next next action prediction. Like, sometimes you need to run a command in the terminal, and it should be able to suggest the command based on the code that you wrote too. Or sometimes you actually need to like, it suggests something, but you you it's hard for you to know if it's correct because you need actually need some more information to learn, like, need to know the type to be able to verify that it's correct.
因此,或许它实际上应该带你到一个类似某事物定义的地方,然后再带你回来,这样你就具备了接受下一个完成项所需的所有知识。
And so maybe it should actually take you to a place that's like the definition of something, and then take you back so that you have all the requisite knowledge to be able to accept the next completion.
也就是为人类提供知识。
So providing the human the knowledge.
是的。
Yes.
对。
Right.
嗯。没错。
Mhmm. Yeah.
你能整合吗?比如我刚认识一个叫Prime Gen的人,我相信他有个SSH服务,可以通过SSH点咖啡。
Can you integrate like I just gotten to know a guy named Prime Gen, who I believe has an s s you can order order coffee via SSH.
哦,对。哦,我们做过这个。我们实现过。
Oh, yeah. Oh, we did that. We did that.
那模型也能做到吗?比如给你喂食,还有...对,还能提供咖啡因?好吧。这就是总体框架。
So can that also the model do that? Like, feed you and like Yeah. And provide you with caffeine? Okay. So that's the general framework.
是的。还有...对。还有那个
Yeah. And Yeah. And the
神奇时刻在于——编程是个奇怪的领域,有时候(并非总是)接下来的五分钟你要做的事,其实可以从近期操作中预测出来。那么能否实现这样一种场景:要么你抽身让它引导你度过那五分钟,要么你稍微多看到几步它的操作,然后你觉得:行。可以。没问题。
the magic moment would be if it is programming is this weird discipline where sometimes the next five minutes, not always, but sometimes the next five minutes, what you're gonna do is actually predictable from the stuff you've done recently. And so can you get to a world where that next five minutes either happens by you disengaging and it taking you through, or maybe a little bit more of just you seeing step what it's gonna do, and you're like, okay. That's good. That's good. That's good.
那很好。你可以像这样轻点、轻点、轻点地浏览这些大的改动。
That's good. And you can just sort of tap, tap, tap through these big changes.
说到这个,我得提一下,Cursor一个非常酷且显著的特点是它整个差异界面的设计。模型会用红色和绿色标记代码的修改建议,在聊天窗口里你可以应用这些改动,它会显示差异部分供你确认。能详细说说这方面的设计思路吗?
As we're talking about this, I should mention, like, one of the really cool and noticeable things about cursor is that there's this whole diff interface situation going on. So, like, the model suggests with with the red and the green of, like, here's how we're gonna modify the code. And in the chat window, you can apply, and it shows you the diff, and you can accept the diff. So maybe can you speak to whatever direction of that?
我们可能会有四到五种不同的差异显示方式。比如自动补全的差异界面就与审查大段代码时的不同,还有针对多文件操作的另一种优化方案。本质上区别在于:自动补全时的差异应该极速可读——其实所有场景下都应该快速可读。
We'll probably have, like, four or five different kinds of diffs. So we we have optimized the diff for for the autocomplete, so that has a different diff interface than than when you're reviewing larger blocks of code. And then we're trying to optimize another diff thing for when you're doing multiple different files. And and sort of at a high level, the difference is for when you're doing autocomplete, it should be really, really fast to read. Actually, it should be really fast to read in all situations.
但自动补全时,你的视线会高度集中在某个区域,人类无法同时关注太多不同位置。
But in autocomplete, it's sort of you're you're really like, your eyes focused in one area, and You you can't be in too many you the humans can't look in too many different places.
所以你指的是界面设计方面?
So you're talking about on the interface side?
对界面设计。目前侧边有个显示框,当需要删除某处代码并添加新代码时,就会在侧边显示这个框。
Like On the interface side. So it currently has this box on the side. So we have the current box, and if it tries to delete code in some place and tries to add other code, it tries to show you a box on the side.
如果我们在karushu.com上打开演示,也许能展示下这个效果。就是...
You can maybe show it if we pull it up on karushu.com. This is what
我们讨论的这个侧边框,其实经历了三四次迭代。最初尝试用蓝色交叉轮廓线,后来改用类似谷歌文档的删除线样式显示待删除代码。
we're talking So that box, was like three or four different attempts at trying to make this this thing work. Where first, the attempt was like this blue crossed outline. So before it was a box on the side. It used to show you the code to delete by showing you like like Google Doc style. You would see like a line through it.
没错。然后你会看到新代码。那种方式非常干扰视线。后来我们又尝试了红色高亮等多种删除标识方案。
Oh, yeah. Then you would see the the new code. That was super distracting. And then we tried many different you know, there was there was sort of deletions. There was trying to red highlight.
接下来的迭代方案挺有趣:在Mac上按住option键时,会高亮某个代码区域提示可能有建议。比如这个例子中,input和value都会变蓝,暗示AI有修改建议——不是直接显示具体内容,而是先给出提示。
Then the next iteration of it, which is sort of funny, would you you would hold the on Mac, the option button. So it would it would sort of highlight a region of code to show you that there might be something coming. So maybe in this example, like the input and the value would get would all get blue. And the blue would you highlight that the AI had a suggestion for you. So instead of directly showing you the thing, it would show you that the AI it would just hint that the AI had a suggestion.
如果你真想看到它,可以按住option键,然后就能看到新的建议。松开option键后,又会显示你原来的代码。
And if you really wanted to see it, you would hold the option button and then you would see the new suggestion. And if release the option button, you would then see your original code.
嗯。
Mhmm.
所以那是通过
So that's by
顺便说,这挺不错的,但你必须知道要按住option键。
the way, that's pretty nice, but you have to know to hold the option button.
是的。所以它会... 顺便说一句,我
Yeah. So it would By the way, I'm
不是Mac用户,但我明白了。Option键。
not a Mac user, but I got it. Option.
它是... 它是
Was it was It's
我猜是你们有的一个按键吧。
a button, I guess, you people have.
你知道,这又... 这真的不够直观。我觉得这是关键问题。而且还有
It's, you know, it's again, it's just it's just non intuitive. I think that's the that's the key thing. And there's a
可能这也不是它的最终版本。
chance this this is also not the final version of it.
我个人对在这一领域做出许多改进感到非常兴奋。我们常将其称为验证问题,这些差异对比对小编辑很有效。但对于大范围编辑,或者涉及多个文件时,审查这些差异实际上有些困难。因此这里有几个不同的想法。比如我们有一个观点是,差异中的某些部分很重要。
I am personally very excited for making a lot of improvements in this area. Like, we we often talk about it as the verification problem, where these diffs are great for small edits. For large edits, or, like, when it's multiple files or something, it's actually a little bit prohibitive to to review these diffs. And so there are, like, a couple of different ideas here. Like, one idea that we have is, okay, you know, like, parts of the diffs are important.
它们包含大量信息。而差异的其他部分则信息熵很低,就像考试中重复出现的内容。或许可以高亮重要部分,淡化不那么重要的部分。或者可以建立一个模型来审查差异,发现潜在错误。
They have a lot of information. And then parts of the diff are just very low entropy. They're like exam like, the same thing over and over again. And so maybe you can highlight the important pieces and then gray out the the not so important pieces. Or maybe you can have a model that looks at the diff and and sees, oh, there is a likely bug here.
我会用红色波浪线标记并提示:可能需要审查这部分差异。这类思路让我感到振奋。
I will, like, mark this with a little red squiggly and say, like, should probably, like, review this part of the diff. And ideas in in that vein, I think, are exciting.
是的。这真是个迷人的UX设计工程领域。你们本质上是在引导程序员只阅读必要内容,不多不少,达到最优。
Yeah. That's a really fascinating space of, like, UX design engineering. Yep. So you're basically trying to guide the human programmer through all the things they need to read and nothing more Yeah. Like, optimally.
没错。而且需要智能模型来实现。目前的差异算法只是普通算法,没有智能。虽然设计算法时运用了智能,但算法本身不关心具体内容。我们需要模型来完成这项工作。
Yeah. And you want an intelligent model to do it. Like, currently, diffs diff algorithms are they're like like, they're just like normal algorithms. There is no intelligence. There's like intelligence that went into designing the algorithm, but then there there is no like, you don't care if the if it's about this thing or this thing as you want a model to to do this.
我认为核心问题是:随着模型越来越智能,它们能提出的改动会越来越大。当改动规模不断扩大时,人类需要做的验证工作就越来越多,难度也越来越大。我们需要帮助开发者——毕竟没人想整天审查代码。
So I think the the the general question is, like, Matt, these models are going to get much smarter. As the models get much smarter, the the changes they will be able to propose are much bigger. So the as the changes gets bigger and bigger and bigger, the humans have to do more and more and more verification work. It gets more and more and more hard like, just you need you need to help them out. It's sort of I I don't wanna spend all my time reviewing code.
能详细说说跨多个文件的差异吗?
Can you say a little more across multiple files, div?
GitHub试图通过代码评审解决这个问题。评审时你会查看跨多个文件的差异。但正如Arvid所说,我们可以做得比现有代码评审更好——现在的代码评审体验其实很糟糕。
Yeah. I mean, so GitHub tries to solve this, right, with code review. When you're doing code review, you're reviewing multiple discs across multiple files. But like Arvid said earlier, I think you can do much better than code review. You know, code review kind of sucks.
你需要花大量时间理解陌生的代码,却往往发现不了多少错误。利用语言模型可以显著改善评审体验,比如运用Art提到的技巧引导关注关键区域。当代码由语言模型生成时,由于不需要考虑代码作者的体验,我们可以完全围绕评审者设计,让评审过程更轻松高效。
Like, you spend a lot of time trying to grok this code that's often quite unfamiliar to you, and it often, like, doesn't even actually catch that many bugs. And I think you can significant significantly improve that review experience using language models, for example, using the kinds of tricks that Art had described of maybe pointing you towards the regions that actually matter. I think also if the code is produced by these language models and it's not produced by someone else, like, the code review experience is designed for both the reviewer and the person that produced the code. In the case where the person that produced the code is a language model, you don't have to care that much about their experience. You can design the entire thing around the reviewer such that the reviewer's job is as fun, as easy, as productive as possible.
简单地模仿现有代码评审模式存在问题。我们应该更富创造力,突破现有可能性边界。
And I think that that feels like the issue with just kind of naively trying to make these things look like code review. I think you can be a lot more creative and and push the boundary in what's possible. Just
有一个观点我认为很重要,那就是顺序问题。通常,当你审查一个PR时,你会看到文件列表并从上到下逐一审查。但实际上,你可能需要先理解这部分内容,因为它在逻辑上是先出现的,然后才是下一部分。你不想自己去费心梳理这个顺序,而是希望有个模型来引导你完成整个过程。
one one idea there is, I think ordering matters. Generally, when you review a PR, you you have this list of files and you're reviewing them from top to bottom. But actually, like, you actually want to understand this part first because that came, like, logically first, and then you want to understand the next part. And you don't want to have to figure out that yourself. You want a model to guide you through the thing.
那么这个步骤
And is the step
的创建过程是否会越来越倾向于使用自然语言作为目标,而不是实际编写代码?
of creation going to be more and more natural language is the goal versus with actual writing?
我认为有时候会这样。但我不认为所有编程都会变成自然语言。原因在于,比如当我和Swala结对编程,Swala在操作键盘时,如果我在开车,我可能会对Swala说‘嘿,实现这个函数’,这样是可行的。但有时候向Swala解释我想要他做什么会非常烦人。
I think sometimes. I don't think it's going to be the case that all of programming will be natural language. And the reason for that is, you know, if I'm pair programming with Swala and Swala is at the computer and the keyboard. And sometimes, if I'm, like, driving, I want to say to Swallow, hey, like, implement this function, and that that works. And then sometimes, it's just so annoying to explain to Swallow what I want him to do.
所以实际上我会接过键盘,给他演示——我写出部分示例代码,这样他就明白了。这是最有效的沟通方式。我认为AI也是如此:有时候与AI沟通的最佳方式是展示一个例子,然后它就能在其他地方执行类似操作。比如在制作网站时,向AI展示你想要的效果,最便捷的方式可能不是口头说明,而是通过拖拽或绘图。
And so I actually take over the keyboard and I show him I I write, like, part of the example, and then it makes sense. And that's the easiest way to communicate. And so I think that's also the case for AI. Like, sometimes the easiest way to communicate with the AI will be to show an example, and then it goes and does the thing everywhere else. Or sometimes if you're making a website, for example, the easiest way to show to the AI what you want is not to tell it what to do, but, you know, drag things around or draw things.
是的。或许最终我们会实现脑机接口之类的技术,直接读取思维。因此自然语言会有一席之地,但我确信它不会成为大多数人编程的主要方式。
And yeah. And and, like, maybe eventually we will get to, like, brain machine interfaces or whatever, and you can, like, understand what you're thinking. And so I think natural language will have a place. I think it will not definitely not be the way most people program most of the time.
在这个编辑器中我真的感受到了AGI的存在,感觉底层有大量机器学习在运作。
I'm really feeling the AGI with this editor. Feels like there's a lot of machine learning going on underneath.
说说让它运作的机器学习技术吧。
Tell me about some of the ML stuff that makes it all work.
Cursor实际上是通过我们训练的一系列定制模型与前沿模型协同工作的——后者擅长需要强推理的任务。比如Cursor标签页就是典型案例:在我们设定的任务评估中,经过专门优化的模型表现甚至优于前沿模型。另一个需要定制模型的领域是代码应用——这看似意外但确实必要且效果显著。前沿模型擅长草拟代码计划和生成变更草图,但实际创建差异文件对它们而言很困难。当你用Sonnet或o1等任何前沿模型尝试时,它们会在诸如统计行号这类基础操作上出错,尤其是在处理超大文件时。
Well, Cursor really works via this ensemble of custom models that that that we've trained alongside, you the frontier models that are fantastic at the reasoning intense things. And so cursor tab, for example, is a is a great example of where you can specialize this model to be even better than even frontier models if you look at Evals on on the on the task we set it at. The other domain, which it's kind of surprising that it requires custom models, but but it's kind of necessary and works quite well, is in apply. So I think these models are like, the frontier models are quite good at sketching out plans for code and generating, like, rough sketches of, like, the change, but actually creating diffs is quite hard for frontier models, for when you're training models. Like, you try to do this with Sonnet, with o one, any frontier model, and it it really messes up stupid things like counting line numbers, especially in super, super large files.
为此我们的解决方案是:先让模型勾勒出标识变更的粗略代码块,再训练专用模型将这些变更应用到文件中。
And so what we've done to alleviate this is we let the model kind of sketch out this rough code block that indicates what the change will be, and we train a model to then apply that change to the file.
应该说,apply模式会分析你的代码,然后给出非常棒的新操作建议。而将两者结合这个看似对人类微不足道的步骤,你说其实并不简单。
And we should say that apply is the model looks at your code, it gives you a really damn good suggestion of what new things to do. And the seemingly, for humans, trivial step of combining the two, you're saying is not so trivial.
与普遍认知相反,它并非确定性算法。
Contrary to popular perception. It is not a deterministic algorithm.
是的。我认为...你看浅层复制在其他地方应用时经常失败,因为你以为能做些确定性匹配,但至少有40%概率会失败,这导致极差的产品体验。总体而言,这个模式会让模型越来越智能。apply还有个好处是能用更少token配合最智能的模型——生成大量token既延迟高又成本大。所以你可以先画个非常粗略的草图,然后让小模型去实现,因为实现这种高度简化的代码任务要简单得多。
Yeah. I I I think, like, you see shallow copies apply elsewhere, and it just breaks, like, most of the time because you think you can kind of try to do some deterministic matching, and then it fails, you know, at least 40% of the time, and that just results in a terrible product experience. I think in general, this this regime of you are going to get smarter and smarter models And like so one other thing that apply lets you do is it lets you use fewer tokens with the most intelligent models. This is both expensive in terms of latency for generating all these tokens and cost. So you can give this very, very rough sketch and then have your small models go and implement it because it's a much easier task to implement this very, very sketched out code.
我认为这种模式会持续发展:用越来越智能的模型做规划,而实现细节交给低智能模型处理。也许将来会有更强大的模型根据更高层计划进行递归应用,比如SONET和EPI模型。
And I think that this this regime will continue where you can use smarter and smarter models to do the planning, and then maybe the implementation details can be handled by the less intelligent ones. Perhaps you'll have, you know, maybe a one, maybe it'll be even more capable models given an even higher level plan that is kind of recursively applied by SONET and an EPI model.
或许我们该讨论如何...如何提速。对,速度总是个有趣的细节。快
Maybe we should we should talk about how to how to make it fast. Yeah. I feel like Yeah. Fast is always an interesting detail. Fast
是好事。
is good.
没错,怎么提速?
Yeah. How do you make it fast?
提速的一大关键是推测性编辑。这是推测性解码的变体,或许先简要说明推测性解码会有所帮助。在语言模型生成受内存限制时,一次性处理多个token比逐个生成要快——这就是为什么提示token的吞吐量远高于生成token。
Yeah. So one big component of making it fast is speculative edits. So speculative edits are a variant of speculative coding, and maybe it'd be helpful to briefly describe speculative decoding. With speculative decoding, what you do is you you can kind of take advantage of the fact that, you know, most of the time and I'll I'll add the caveat that it would be when you're memory bound in in language model generation. If you process multiple tokens at once, it is faster than generating one token at a time.
我们不像常规推测性解码那样用小模型预测草稿token再让大模型验证。对于代码编辑,我们有极强的先验知识——原始代码本身。所以可以直接将原始代码块喂给模型,模型大多时候会直接原样输出这些代码块,这样就能并行处理所有行。
So this is like the same reason why if you look at tokens per second with prompt tokens versus generated tokens, it's much much faster for prompt tokens. So what we do is instead of using what speculative decoding normally does, which is using a really small model to predict these draft tokens that your larger model will then go in and and verify. With code edits, we have a very strong prior of what the existing code will look like, and that prior is literally the same exact code. So what you can do is you can just feed chunks of the original code back into the into the model, and then the model will just pretty much agree most of the time that, okay, I'm just gonna spit this code back out. And so you can process all of those lines in parallel.
只要分块足够多,最终会遇到分歧点——模型开始生成与原始代码不同的内容。生成这些token后,当匹配足够多原始代码时,我们会重新开始分块推测。最终效果就是代码编辑速度大幅提升,看起来就像模型在超高速重写代码。我们可以继续使用diff相同的接口,但数据流速度会快很多。
And you just do this with sufficiently many chunks, then eventually you'll reach a point of disagreement where the model will now predict text that is different from the ground truth original code. It'll generate those tokens, and then we kind of we'll decide after enough tokens match the original code to restart speculating in chunks of code. What this actually ends up looking like is just a much faster version of normal editing code. So it's just like it looks like a much faster version of the model rewriting all the code. So just we we can use the same exact interface that we use for for diffs, but it will just stream down a lot faster.
然后无线流传输的优势在于,你可以在代码完成前就开始审阅代码。没错,这样就没有漫长的加载画面。所以这可能就是部分优势所在。
And then and then the advantages that wire wireless streaming, you can just also be reviewing start reviewing the code Exactly. Before before it's done so there's no no big loading screen. So maybe that that is part of the part of the advantage.
也就是说人类可以在任务完成前就开始阅读内容。
So the human can start reading before the thing is done.
我觉得这里有趣的观点是,推测如今已成为相当普遍的概念。不仅限于语言模型,CPU中显然存在推测执行,数据库领域也有推测查询,各种场景都在运用推测技术。
I think the interesting riff here is something like like, speculation is a fairly common idea nowadays. It's like not only in language models. I mean, there's obviously speculation in CPUs, and there's there's, like, speculation for databases and speculation all over the place.
让我问个略显荒谬的问题:哪个大语言模型更擅长编程?GPT、Claude,在编程场景下谁更胜一筹?我猜答案会非常复杂,因为听起来每个环节都涉及不同的模型。
Let me ask the sort of the ridiculous question of which LLM is better at coding. GPT, Claude, who wins in the context of programming? And I'm sure the answer is much more nuanced because it sounds like every single part of this involves a different model.
是的。我认为没有哪个模型能在所有重要维度上全面碾压其他模型,这些维度包括速度、代码编辑能力、处理大量代码的能力、长上下文处理,以及编码能力等。目前综合表现最好的应该是Sonnet,这算是行业共识。
Yeah. I think there there's no model that Pareto dominates others, meaning it is better in all categories that we think matter. The categories being speed, ability to edit code, ability to process lots of code, long context, you know, a couple of other things and and kind of coding capabilities. The one that I'd say right now is just kind of net best is Sonnet. I think this is a consensus opinion.
我们的模型在推理方面非常出色。如果你给它高难度的编程面试题或LeetCode问题,它能表现得相当好。但它在理解用户模糊意图方面不如Sonnet。其他前沿模型虽然基准测试成绩优异——不是说它们针对测试进行了训练——但在实际应用中的表现与测试成绩存在落差。
Our one's really interesting, and it's really good at reasoning. So if you give it really hard programming interview style problems or leak code problems, it could do quite quite well on them. But it doesn't feel like it kind of understands your rough intent as well as Sonnet does. Like, if you look at a lot of the other frontier models, one qualm I have is it feels like they're not necessarily over I'm not saying they they train in benchmarks, but they perform really well in benchmarks relative to kind of everything that's kind of in the middle. So if you tried on all these benchmarks and things that are in the distribution of the benchmarks they're evaluated on, you know, they'll do really well.
当你稍微偏离基准测试场景时,Sonnet是少数能保持相同能力的模型。它的基准测试表现与真实编程指导能力基本一致。
But when you push them a little bit outside of that, Sonnet's I think the one that kind of does best at at at kind of maintaining that same capability. Like, you kind of have the same capability in the benchmark as when you try to instruct it to do anything with coding.
再问个夸张的问题:普通编程体验与基准测试之间的差距有多大?你觉得在评估这些模型时,基准测试在哪些方面存在不足?
What another ridiculous question is the difference between the normal programming experience versus what benchmarks represent. Like, where do benchmarks fall short, do you think, when we're evaluating these models?
顺便说,这是个极其关键的问题——基准测试与实际编程的差异程度。真实编程不是面试式编程,人类可能说着半通不通的英语,有时让你'照之前那样做',有时要求'添加这个功能然后修改那个UI元素',很多操作都高度依赖上下文。
By the way, that's like a really, really hard it's like like critically important detail, like, how how difference, like, benchmarks are versus versus, like, real coding. Where real coding, it's not interview style coding. It's you're you're doing these you know, humans are saying, like, half broken English sometimes, and sometimes you're saying, like, oh, do what I did before. Sometimes you're saying, you know, go add this thing and then do this other thing for me and then make this UI element. And then, you know, it's it's just like a lot of things are sort of context dependent.
关键是要理解人类意图并执行,而不是像面试题那样——面试题都有明确定义,重度依赖规范说明,而人类需求往往缺乏明确规范。
You really want to, like, understand the human and then do do what the human wants as opposed to sort of this maybe the the way to put it is sort of abstractly is the interview problems are very well specified. They've lean a lot on specification while the human stuff is less specified.
是的。不。我认为这个基准测试问题既因Svali刚才提到的内容而复杂化,也涉及到Aman所探讨的——即使你意识到基准测试与实际编程之间存在偏差问题,这种偏差有时难以量化,因为实际编程过程非常混乱,有时甚至无法明确界定何为正确。而公共基准测试又使问题加倍复杂,因为这些基准常被人为优化。
Yeah. No. I think that this this benchmark question is both complicated by what Svali just mentioned. And then also to what Aman was getting into is that even if you like, know, there's this problem of, like, the skew between what can you actually model in a benchmark versus real programming, and that can be sometimes hard to encapsulate because it's, like, real programming is, like, very messy and sometimes things aren't super well specified, what's correct or what isn't. But then it's also doubly hard because of this public benchmark problem, and that's both because public benchmarks are sometimes kind of hill climbed on.
此外,从公共基准中剔除训练数据也极其困难。比如最流行的智能体基准测试SuiteBench,其数据已深度污染了这些基础模型的训练集。当你要求基础模型解决SuiteBench问题时,即便不提供代码库上下文,它们也能幻觉出正确的文件路径和函数名。因此公共基准本身就存在棘手特性。
Then it's like really, really hard to also get the data from the public benchmarks out of the models. And so, for instance, like, one of the most popular, like, agent benchmarks, SuiteBench, is really, really contaminated in the training data of these foundation models. And so if you ask these foundation models to do a SuiteBench problem, you actually don't give them the context of a code base. They can, like, hallucinate the right file pass, they can hallucinate the right function names. And so the the it's it's also just the public aspect of these things is tricky.
确实。这种情况下,模型可能直接训练过原始issue或PR内容。实验室或许已开始或已经做好数据净化工作,但不可能完全排除仓库本身的训练数据。比如SymPy这类热门Python仓库,开发者不可能为了让基准测试更准确而故意削弱模型在这些仓库上的表现。
Yeah. Like, in that case, it could be trained on the literal issues or pull requests themselves. And and maybe the labs will start to do a better job or they've already done a good job at decontaminating those things, but they're not going to omit the actual training data of the repository itself. Like, these are all, like, some of the most popular Python repositories, like Sympy is one example. I don't think they're going to handicap their models on SymPy and all these popular Python repositories in order to get true evaluation scores in these benchmarks.
是的。我认为
Yeah. I think that
鉴于基准测试的匮乏,开发这些系统的机构会采用些有趣的方法来评估方向正确性。很多团队直接让人工测试并给出定性反馈——某些基础模型公司甚至设有专职岗位。我们内部也高度依赖人工定性评估,同时结合私有测试集。
given the dearths and benchmarks, there have been, like, a few interesting crutches that places that build systems with these models or build these models actually use to get a sense of are they going in right direction or not. And in a lot of places, people will actually just have humans play with the things and give qualitative feedback on these. Like, one or two of the foundation model companies, they they have people who that's that's a big part of their role. And, you know, internally, we also, you know, qualitatively assess these models and actually lean on that a lot in addition to, like, private emails that we have.
就像氛围检测。
It's like the vibe.
对。氛围。氛围。
Yeah. The vibe. The vibe. The
氛围基准,人工基准。没错,拉人来做个氛围检测。
vibe benchmark, human benchmark. Yeah. The human you pull in the humans to do a vibe check.
嗯。好吧。
Yeah. Okay.
这基本就是我的工作方式——浏览论坛、Reddit和X。虽然不知道如何系统化加载用户观点,人们总说'感觉Claude或GPT变笨了'。有时我也有同感,但不确定是模型问题还是我的问题。
I mean, that's that's kinda what I do, like, just like reading online forums and Reddit and X. Just like well, I don't know how to properly load in people's opinions because they'll say things like, I feel like Claude or GPT's gotten dumber or something. They'll say, I feel like, and then I sometimes feel like that too, but I wonder if it's the model's problem or mine.
你知道,关于Claude有个有趣的观点,我听说AWS使用不同芯片,我怀疑它们的数值计算与NVIDIA GPU略有差异。有人推测Claude性能下降可能与AWS Bedrock上运行的量化版本有关,而非Anthropic自家GPU上的版本。
You know, with Claude, there's an interesting take I heard where I think AWS has different chips, and I I suspect they have slightly different numerics than NVIDIA GPUs. And someone speculated that clause degrad degraded performance had to do with maybe using the quantized version that existed on AWS Bedrock versus whatever was running on on Anthropix GPUs.
我采访过很多持阴谋论观点的人,所以很高兴你提到这个阴谋论。
I interview a bunch of people that have conspiracy theories, so I'm glad you spoke you spoke to this conspiracy theory.
这与其说是阴谋论,不如说...人类就是人类,总会有各种细节问题。做这些浮点运算时,芯片本身就很复杂,难免会出现bug。说真的,避免bug的难度怎么强调都不为过。
Well, it's it's it's not not like conspiracy theory as much as they're just they're like they're you know, humans humans are humans and there's there's these details Yes. And, you know, you're doing like this queasy monoflops and, you know, chips are messy and, man, you can just have bugs. Like, bugs are it's it's hard to overstate how how hard bugs are to avoid.
优质提示词在这其中扮演什么角色?既然你提到基准测试都使用结构严谨的提示词。人类应该如何最大化成功率?你在博客《提示设计》中提到的那些要点有多重要?
What's the role of a good prompt in all of this? Since you will mention that benchmarks have really structured, well formulated prompts. What what should a human be doing to maximize success? And what's the importance of what the humans you wrote a blog post on you called it prompt design.
这取决于具体模型,每个模型对提示词的反应都不同。但去年最初的GPT-4等可预测模型对提示词非常敏感,当时上下文窗口也很小。我们需要把所有可能相关的代码库信息都塞进提示词里。
Yeah. I think it depends on which model you're using, and all of them are slightly different, and they respond differently to different prompts. But I think the original g p d four and the original sort of predable models last last year, they were quite sensitive to the prompts. They also had a very small context window. And so we have all of these pieces of information around the code base that would maybe be relevant in the prompt.
比如文档、添加的文件、对话历史等。问题在于:如何决定实际放入提示词的内容?即便现在有了长上下文窗口,填满整个窗口会导致响应变慢,有时反而会让模型困惑——某些模型尤其明显。为此我们内部开发了名为preempt的系统来优化这个问题。
Like, you have the docs, you have the files that you add, you have the conversation history. And then there's a problem, like, how do you decide what you actually put in the prompt and when you have a a limited space. And even for today's models, even when you have long context, filling out the entire context window means that it's slower. It means that sometimes the model actually gets confused and some models get more confused than others. And we have this one system internally that we call preempt, which helps us with that a little bit.
这个系统最初是为8000token以下的小窗口时代设计的。就像做网站时,你希望它适配手机和桌面端,但动态信息不像纸质杂志排版那样可以精确定位。提示词设计同样面临输入内容动态变化需要适配的问题。
And I think it was built for the era before where we had 8,000 token context windows. And it's a little bit similar to when you're making a website. You you sort of you you want it to work on mobile. You want it to work on a desktop screen, and you have this dynamic information, which you don't have, for example, if you're making like designing a print magazine, you have, like you know exactly where you can put stuff. But when you have a website or when you have a prompt, you have these inputs, and then you need to format them to always work.
当输入内容过大时就需要精简。我们借鉴了React的声明式设计思路:用JSX声明元素优先级和层级关系,然后由渲染引擎(网页是Chrome,我们的是preempt渲染器)自动排版布局。
Even if the input is really big, then you might have to cut something down. And and and so the idea was, okay, like, let's take some inspiration. What's the best way to design websites? Well, the thing that we really like is is React and the declarative approach where you you use JSX in in in JavaScript. And then you declare, this is what I want, and I think this has higher priority or, like, this has higher z index than something else.
随着发展,这个系统的定位也在演变:最初是为适配小窗口,现在则擅长拆分提示词数据与渲染逻辑。由于保留了原始数据,调试时可以通过修改渲染方式在历史提示词上测试,直观评估改动对整体评估集的效果提升。
And then you have this rendering engine. In web design, it's it's like Chrome, and in our case, it's a preempt renderer, which then fits everything onto the page. And as you declare, it will decide what you want, and then it figures out what you want. And and so we have found that to be quite helpful, and I think the role of it has has sort of shifted over time, where initially it was to fit to these small context windows. Now it's really useful because, you know, it helps us with splitting up the data that goes into the prompt and the actual rendering of it.
这种设计让调试更便捷——你可以修改提示词渲染方式后,直接在历史原始数据上测试,观察改动是否真正提升了整个评估集的表现。
And so it's easier to debug because you can change the rendering of the prompt and then try it on old prompts because you have the raw data that went into the prompt. And then you can see, did my change actually improve it for for, like, this entire eval set?
所以你们真的直接用JSX来提示吗?
So do you literally prompt with JSX?
对,没错。是的。所以它看起来有点像React。有组件的概念。
Yes. Yeah. Yes. So it kind of looks like React. There are components.
比如我们有一个文件组件,它会接收光标位置。通常文件中会有一行是光标所在行,那可能是最重要的行,因为那是你正在查看的内容。然后你可以设置优先级——光标行优先级最高,每远离一行就减一。最终渲染时,系统会计算实际能围绕该行显示多少行内容。
Like, we have one component that's a file component, and it takes in, like, the cursor. Like, usually, there is, like, one line where the cursor is in your file, and that's, like, probably the most important line because that's the one you're looking at. And so then you can give priorities. So, like, that line has the highest priority, and then you subtract one for every line that is farther away. And then, eventually, when it's rendered, it figures out how many lines can actually fit in the centers around that thing.
这太棒了。确实。
That's amazing. Yeah.
你还可以做些更花哨的操作,比如当需要从整个代码库提取多个代码块时,可以通过检索、嵌入和重新排序评分等方式为每个组件添加优先级。
And you can do, like, other fancy things where if you have lots of code blocks from the entire code base, you could use retrieval and things like embedding and re ranking scores to add priorities for each of these components.
那人类提问时是否也该采用类似方式?比如在问题里写JSX会有帮助吗?还是说这套系统本就应该保持松散随意的特性?
So should humans, when they ask questions, also use try to use something like that? Like, would it be beneficial to write JSX in the in the problem? Or the whole idea is this should be loose and messy.
我认为我们的目标是让用户以最自然的方式操作。然后我们的工作就是确保...明白吗...如何准确检索相关上下文,让你的输入具有实际意义。
I I think our goal is kind of that you should just do whatever is the most natural thing for you. Yeah. Then we our job is to figure out Sure. How do we actually, like, retrieve the relative event thing so that your thing actually makes sense.
嗯,这其实是我...
Well, this is a sort of the discussion I
之前和Perplexity的Arvind讨论过的。他的核心理念就是应该允许用户尽情偷懒
had with Arvind of perplexity. It's like his whole idea is like, you should let the person be as lazy
是的。随心所欲。嗯哼。
Yes. As you want. Yeah. Mhmm.
嗯。但是,是的。那很美。但我
Mhmm. But, like yeah. That's a beautiful thing. But I
觉得你可以对程序员要求更多。
feel like you're allowed to ask more of programmers.
对吧?是的。所以,如果你说只管做你想做的,我是说,人都是懒惰的。在懒惰与提供更多之间有一种张力,几乎像是系统在推动或激励你去表达得更清晰。嗯。
Right? Yes. So, like, if you say just do what you want, I mean, humans are lazy. There's a kind of tension between just being lazy versus, like, provide more as be prompted almost like the system pressuring you or inspiring you to be articulate. Uh-huh.
是的。不是句子的语法层面,而是你在提示中传达的思想深度。
Yeah. Not in terms of the grammar of the sentences, but in terms of the depth of thoughts that you convey inside the the prompts.
我认为即使系统接近某种完美程度,当你向模型请求某事时,往往传达的意图不足以知道该做什么。有几种解决这种意图的方法。一种是让模型直接问你,我不确定如何根据你的查询处理这些部分,你能澄清一下吗?另一种可能是,鉴于你查询中存在的不确定性,给出五到六种可能的生成结果。
I think even as the system gets closer to some level of perfection, often when you ask the model for something, you just are not not enough intent is conveyed to know what to do. And there are, like, a few ways to resolve that intent. One is the simple thing of having model just ask you, I'm not sure how to do these parts based on your query. Could you clarify that? I think the other could be maybe if you there are five or six possible generations given the uncertainty present in your query so far.
为什么不直接展示所有这些选项让你选择呢?
Why don't we just actually show you all of those and let you pick them?
模型选择回应的难度有多大,相比之下,处理不确定性更难。嗯。我应该选择要求更多信息来减少模糊性吗?
How hard is it to for the model to choose to speak talk back sort of versus that's that's hard to sort of, like, how to deal with the uncertainty. Mhmm. Do I do I choose to ask for more information to reduce the ambiguity?
所以,我们最近做的一件事是尝试建议你可以添加的文件。在你输入时,可以猜测不确定性并可能建议,比如,也许你在写API,我们可以利用你之前在同一文件中的提交记录猜测客户端和服务器非常有用。有一个技术难题是如何在所有提交中解析哪些文件对你的当前提示最重要。我们的初始版本已经推出,我相信我们可以让它更准确。这还非常实验性。
So, I mean, one of the things we we do is, it's like a recent addition, is try to suggest files that you can add. So and while you're typing, one can guess what the uncertainty is and maybe suggest that, like, you know, maybe maybe you're writing your API, and we can guess using the commits that you've made previously in the same file that the client and the server is super useful. And there's, like, a hard technical problem of how do you resolve it across all commits, which files are the most important given your current prompt. And we're still sort of initial version is rolled out, and I'm sure we can make it much more accurate. It's it's it's very experimental.
但想法是我们展示给你,比如,你想添加这个文件、这个文件、还有这个文件,告诉模型为你编辑这些文件吗?因为如果你在制作API,可能也应该编辑使用该API的客户端和服务器,以及解析API的另一部分。这样会很酷,既有编写提示的阶段,也有在你点击回车前,我们可能帮助解决一些不确定性。
But then the idea is we show you, like, do you just want to add this file, this file, this file also to tell, you know, the model to edit those files for you? Because if if you're maybe you're making the API, like, you should also edit the client and the server that is using the API and the other one resolving the API. And so that'll be kinda cool as both there's the phase where you're writing the prompt and there's before you even click enter, maybe we could help resolve some of the uncertainty.
你们在多大程度上使用代理方法?你们如何使用Fire代理?
To what degree do you use agentic approaches? How you use Fire agents?
我们认为智能体真的非常酷。就像,我我我觉得智能体有点像人类。它某种程度上像是,你能感觉到它,仿佛更接近通用人工智能了,因为你会看到一个演示,它表现得像人类一样,这真的非常酷。我认为智能体目前在很多事情上还不是特别有用。我觉得我们正在接近它们真正发挥作用的阶段。
We think agents are really really cool. Like, I I I think agents is like it's like it resembles sort of like a human. It's it's sort of like the things like, you can kind of feel that it like, you're getting closer to AGI because you see a demo where it acts as as a human would, and and it's really really cool. I think agents are not yet super useful for many things. They I think we're we're getting close to where they will actually be useful.
所以我认为在某些类型的任务中,拥有一个智能体会非常棒。比如,我很想有一个智能体。举个例子,如果我们遇到一个bug,有时在聊天输入框里无法使用command c和command v,这个任务非常明确。我只想说,用两句话描述这个问题不工作,请修复它。
And so I think there are certain types of tasks where having an agent would be really nice. Like, I would love to have an agent. For example, if like, we have a bug where you you sometimes can't command c and command v inside our chat input box, And that's a task that's super well specified. I just want to say, like, in two sentences, this does not work. Please fix it.
然后我希望有一个智能体可以独立去完成这个任务,一天后我回来审查结果。
And then I would love to have an agent that just goes off, does it, and then a day later, I I come back and I review the the thing.
你是说它会找到正确的文件?
You mean it goes finds the right file?
是的。它找到正确的文件,尝试复现这个bug,修复它,然后验证修复是否正确。这个过程可能会花费很长时间。
Yeah. It finds the right files. It, like, tries to reproduce the bug. It, like, fixes the bug, and then it verifies that it's correct. And this is could be a process that takes a long time.
所以我非常希望能有这样的智能体。关于编程,很多人认为智能体会完全取代编程。我们并不这么认为,因为编程的很多价值在于迭代——你通常不想一开始就明确指定所有细节,因为你往往在看到一个初始版本后才知道自己真正想要什么,然后在此基础上迭代并提供更多信息。因此对于很多编程工作,我认为你更需要一个能即时反馈初始版本的系统,然后可以非常快速地迭代。
And so I think I would love to have that. And then I think a lot of programming like, there is often this belief that agents will take over all of programming. I don't think we think that that's the case because a lot of programming a lot of the value is in iterating or you don't actually want to specify something upfront because you don't really know what you want until you've seen an initial version and then you want to iterate on that, and then you provide more information. And so for a lot of programming, I think you actually want a system that's instant, that gives you an initial version instantly back, and then you can iterate super, super quickly.
最近出现的ReplenAgent这类工具怎么样?它还能设置开发环境、安装软件包、配置所有东西、配置数据库,甚至部署应用。这也是你梦想中的功能之一吗?
What about something like that recently came out, ReplenAgent, that does also, like, setting up the development environment, installing software packages, configuring everything, configuring the databases, and actually deploying the app. Yeah. Is that also in the set of the things you dream about?
我想是的。那会非常酷。对于某些类型的编程任务来说,这确实很棒。
I think so. I think that would be really cool. I for for certain types of programming, it it would be really cool.
这在Cursor的规划范围内吗?
Is that within scope of Cursor?
是的。虽然我们现在没有积极开发这个功能,但我们的确想让程序员的工作更轻松有趣。有些步骤非常繁琐,需要经历一系列操作,你希望能委托给智能体处理。还有些场景下,你可以在工作时让智能体在后台运行——比如当你同时处理前后端的PR时,前端工作时可以让后台智能体同步处理相关任务。
Yeah. We aren't actively working on it right now, but it's definitely like, we want to make the programmers' life easier and more fun. And some things are just really tedious, and you need to go through a bunch of steps, and you want to delegate that to an agent. And then some things, you can actually have an agent in the background while you're working. Like, let's say you have a PR that's both back end and front end, and you're working in the front end, and then you can have a background agent that does some work and figure out kind of what you're doing.
然后当你处理PR的后端部分时,你就有了可以迭代的初始代码片段。这样也会非常棒。
And then when you get to the back end part of your PR, then you have some, like, initial piece of code that you can iterate on. And and so that that would also be really cool.
我们已经讨论过速度问题,但我想是否可以更深入探讨实现高速所涉及的技术细节。光标的每个方面——或者说大多数方面——都感觉非常快。正如我提到的,应用操作可能是最慢的部分,对我来说——
One of the things we already talked about is speed, but I wonder if we can just linger on that some more in the the various places that the technical details involved in making this thing really fast. So every single aspect of cursor most aspects of cursor feel really fast. Like I mentioned, the apply is probably the slowest thing, and for me from
抱歉打断。
I'm sorry.
痛点所在。
The pain.
确实是个痛点。我们正在感受这个痛点,也正在努力修复它。
It's a pain. It's a pain that we're feeling, and we're working on fixing it.
是的。当某个操作耗时一、两秒就让人觉得慢时,恰恰说明其他所有环节都极其快速。那么关于如何优化模型、加速聊天响应、快速生成差异对比——有什么技术细节可以分享吗?
Yeah. I mean, it says something that something that feels I don't know what it is, like, one second or two seconds. That feels slow. That means that's actually shows that everything else is just really really fast. So is there some technical details about how to make some of these models, how to make the chat fast, how to make the diffs fast.
有什么特别突出的方案吗?
Is there something that just jumps to mind?
有的。我们可以详细讨论采用的策略。缓存预热是个有趣的技术——当用户输入时,你可以预加载可能用到的上下文内容。正如之前讨论的,复用KV缓存能降低延迟和跨请求成本。
Yeah. I mean, so we can go over a lot of the strategies that we use. One interesting thing is cache warming. And so what you can do is if as the user is typing, you can have you know, you're you're probably going to use some piece of context, and you can know that before the user's done typing. So, you know, as we discussed before, reusing the KV cache results in lower latency, lower cost, cross requests.
用户开始输入时,就能立即用当前文件内容预热缓存。这样当他们按下回车时,系统只需预填充少量标记就能开始生成,这将显著降低首字节时间(TTFT)。
So as the user starts typing, can immediately warm the cache with, like, let's say, the current file contents. And then when they press enter, there's very few tokens it actually has to prefill and compute before starting the generation. This will significantly lower TTFT.
能解释下KV缓存的工作原理吗?
Can you explain how KvCache works?
是的。所以Transformer的工作原理是这样的
Yeah. So the way transformers work
我喜欢这个。
I like it.
我是说,让Transformer不仅能独立查看每个标记,还能看到先前标记的关键机制之一就是键值注意力。通常,注意力机制是这样运作的:当前标记会生成一个查询,而所有先前标记的键和值——这些是模型内部存储的提示中所有先前标记的某种表示——都会被使用。默认情况下,在进行聊天时,模型必须为每个标记执行一次贯穿整个模型的前向传播。这涉及大量矩阵乘法运算,速度非常非常慢。但如果你已经完成了这些计算并存储了键值对,且将其保留在GPU中,那么假设我已经处理了前n个标记,现在要计算第n+1个标记的输出时,就不需要再将前n个标记通过整个模型处理,因为所有相关的键值对都已存在。
I mean, one one one of the mechanisms that allow transformers to not just independently like, the mechanism that allows transformers to not just independently look at each token, but see previous tokens are the keys and values to tension. And generally, the way tension works is you have at your current token some query, and then you've all the keys and values of all your previous tokens, which are some kind of representation that the model stores internally of all the previous tokens in the prompt. And, like, by default, when you're doing a chat, the model has to, for every single token, do this forward pass through the entire model. That's a lot of matrix multiplies that happen, and that is really, really slow. Instead, if you have already done that and you stored the keys and values and you keep that in the GPU, then when I'm let's say I have sorted for the last n tokens, if I now wanna compute the the output token for the n plus one token, I don't need to pass those first n tokens through the entire model because I already have all those keys and values.
因此你只需对最后一个标记执行前向传播。在进行注意力计算时,你复用那些已计算好的键值对——这是Transformer中唯一具有顺序依赖性或者说串行依赖的部分。
And so you just need to do the forward pass through that last token. And then when you're doing attention, you're reusing those keys and values that have been computed, which is the only kind of sequential part or sequentially dependent part of the transformer.
是否存在更高层次的缓存?比如对提示词或类似内容的缓存?
Is there, like, higher level caching or, like, caching of the prompts or that kind of stuff that Yeah. Could
确实存在其他类型的缓存策略。对于光标Tab功能,有个有趣的实现方式:你可以预先预测用户可能接受建议的情况并发起另一个请求。这样你就实现了推测性缓存——这是推测与缓存的混合体。因为你既在推测用户接受建议后的结果,又缓存了这个预测建议。
That that there there's other types of caching you can kind of do. One interesting thing that you can do for cursor tab is you can basically predict ahead as if the user would have accepted the suggestion and then trigger another request. And so then you cache you've done a speculative it's it's a mix of speculation and caching. Right? Because you're speculating what would happen if they accepted it, and then you have this value that is cached, this this suggestion.
当用户按下Tab键时,下一个建议就能立即呈现。这是一种巧妙利用高层缓存的启发式技巧,尽管模型本身没有任何改变,却能带来极快的响应体验。
And then when they press tab, the next one would be waiting for them immediately. It's a it's a kind of clever heuristic slash trick that uses a higher level caching and and can give the it it feels fast despite there not actually being any changes in the in the model.
如果能缩小键值缓存体积,优势之一就是可以进行更多推测。比如预测接下来可能有用的10种情况——相比只展示单一预测,用户命中这10种情况之一的概率会高得多。也许用户输入其他字符时,我们就能命中缓存中的某个选项。
And and if you can make the KV cache smaller, one of the advantages you get is, like, maybe maybe you can speculate even more. Maybe you can guess here's the 10 things that, you know, could be useful. Like like, predict the next 10 and and then, like, it's possible the user hits the the one of the 10. It's, like, much higher chance than the user hits, like, the exact one that you show them. May maybe they type in other character in, and we sort of hit hit something else in the cache.
没错。所有这些技巧背后的核心现象是:模型单次采样可能不够理想,但若预测10种不同情况,正确的概率就会大幅提升。这就是典型的多数优势曲线现象。在强化学习中,我们正是利用这种多数优势特性来生成多样化预测。
Yeah. So there's there's all these tricks where the the general phenomena here is, I think it's it's also super useful for our RL is, you know, may maybe a single sample from the model isn't very good. But if you predict, like, 10 different things, turns out that one of the 10, that's right, is the probability is much higher. There's these passive key curves. And, you know, part of RL like, what what RL does is, you know, you can you can exploit this passive key phenomena to to make many different predictions.
可以这样理解:模型内部其实存在不确定性——它无法确定哪些关键点是正确的,或者说人类更倾向于哪些选择。当我们对光标Tab模型进行强化学习时,本质上是在预测模型生成的100个建议中哪些更符合人类偏好。有些建议可能预测跨度很大,有些则较短,而适中的可能最受欢迎。通过给人类更喜欢的建议给予奖励,对不喜欢的进行惩罚,就能训练模型输出更符合人类偏好的建议。这些利用多数优势曲线的强化学习循环非常有用。或许我们可以进一步深入探讨细节。
And and one one way to think about this, the model sort of knows internally has, like, has some uncertainty over, like, which of the key things is correct or, like, which of the key things does the human want to when we RLR, you know, cursor tab model, One of the things we're doing is we're predicting which, like, which of the 100 different suggestions the model produces is more amenable for humans. Like, which of them do humans more like than other things? Maybe maybe, like, there's something where the model can predict very far ahead versus, like, a little bit and maybe somewhere in the middle, and and you just and then you can give a reward to the things that humans would like more and and sort of punish the things that it would like and of then train the model to output the suggestions that humans would like more. You you have these, like, RL loops that are very useful that exploit these passive key curves. Oman, maybe you can can go into even more detail.
是的。这与速度有些不同。但技术上来说,你可以通过强化学习让小型模型达到与大型模型相同的性能,从而关联起来。就像Swally提到的减少知识库缓存大小,还有其他对速度有帮助的技术。大约两年前,人们主要使用多头注意力机制。
Yeah. It's a little it is a little different than speed. But, I mean, like, technically, you tie it back in because you can get away with the smaller model if you RL your smaller model and it gets the same performance as as the bigger one. That's like and and Swally was mentioning stuff about reducing the size of your KB cache, there are other techniques there as well that are really helpful for speed. So kind of back in the day, like all the way two years ago, people mainly used multi head attention.
我认为现在正转向更高效的注意力方案,如分组查询或多查询注意力。这对大批量生成令牌非常有帮助。有趣的是,这对首令牌生成时间(预填充速度)没有影响,它影响的是后续令牌生成速度。为什么呢?
And I think there's been a migration towards more efficient attention schemes like group query or multi query attention. And this is really helpful for them with larger batch sizes being able to generate the tokens much faster. The interesting thing here is this now has no effect on that time to first token prefill speed. The thing this matters for is now generating tokens. And and why is that?
因为在生成令牌时,瓶颈不在于并行矩阵乘法运算,而在于长上下文大批量场景下读取缓存键值的速度。这涉及内存带宽问题,如何提速?可以尝试压缩键值大小。多查询注意力是最激进的方式——传统多头注意力有若干注意力头和查询头,而多查询只保留查询头,完全去除键值头。
Because when you're generating tokens, instead of being bottlenecked by doing these super parallelizable matrix multiplies across all your tokens, you're bottlenecked by how quickly it's for a long context with large batch sizes by how quickly you can read those cache keys and values. And so then how that that's memory bandwidth and how can we make this faster? We can try to compress the size of these keys and values. So multi query attention is the most aggressive of these, where normally with multi head attention, you have some number of quote unquote attention heads and some number of kind of query query heads. Multi query just preserves the query heads, gets rid of all the key value heads.
最终只保留一个键值头和多组查询头。分组查询则保留所有查询头,仅减少键值头数量但不缩减至单个。关键是缩小知识库缓存体积。此外还有MLA方案...
So there's only one kind of key value head, and there's all the remaining query heads. With group query, you instead, you know, preserve all the query heads and then your keys and values are kind of in there there are fewer heads for the keys and values, but it's you're not reducing it to just one. But, anyways, like, the whole point here is you're just reducing the size of your KD cache. And then there is MLA. Yeah.
多潜在向量方案更复杂。它将所有注意力头的键值转化为单一潜在向量,在推理时再扩展使用。
Multi latent. That's a little more complicated. And the way that this works is it kind of turns the entirety of your keys and values across all your heads into this kind of one latent vector that is then kind of expanded inference time.
但MLA来自DeepSeq公司,是个有趣的算法。其核心思想与MQA类似——减少键值头数量。优势在于减少存储,但理论上需要保持键值多样性。解决方案是维护一个共享大向量存储共性,每个令牌仅存储差异化小向量。
But MLA is from this company called DeepSeq. It's it's quite an interesting algorithm. Maybe the key idea is sort of in both MQA and in other places, what you're doing is you're of reducing the numb like, the number of KV heads. The advantage you get from that is is, you know, there's less of them, but maybe the theory is that you actually want a lot of different like, you want each of the the keys and values to actually be different. So one way to reduce the size is you keep one big shared vector for all the keys and values, and then you have smaller vectors for every single token so that when you you can you can store the only the smaller thing.
这本质上是种低秩降维。由于系统受内存带宽限制而计算资源有剩余,最终计算时可将潜在向量重新扩展。这种方案效率更高,比如能将向量尺寸从32倍压缩...
There's some sort of like low rank reduction, and the low rank reduction will that and at the end of the time when you when you eventually want to compute the final thing, remember that, like, you're memory bound, which means that, like, you still have some some compute left that you can use for these things. And so if you can expand the the latent vector back out, and and somehow, like, this is far more efficient because just, like, you're reducing, like, for example, maybe, you know, like, reducing, like, 32 or something, like, size of the vector that you're keeping.
确实。保留独立且配对的键值查询集合可能比压缩为单一交互更具信息丰富性。
Yeah. There's perhaps some richness in having a separate set of keys and values and query that kind of pairwise match up versus compressing that all into one and that interaction at least.
明白了。这些方案都是为了解决内存带宽限制问题。
Okay. And all of that is dealing with being memory bound.
没错。
Yeah.
归根结底,我的意思是,这如何映射到用户体验上?正试图理解这一点
And what I mean, ultimately, how does that map to the user experience? Trying to get this
是的。它映射到的两个方面是:首先,由于KB缓存占用的空间减少,你现在可以大幅扩大缓存容量,更激进地缓存更多内容。这样能获得更多缓存命中,有助于缩短首令牌生成时间,原因前文已有所阐述。其次,当你处理越来越多请求、使用越来越大批次进行推理时,令牌生成速度不会出现明显下降。
thing Yeah. The two the two things that it maps to is you can now make your cache a lot larger because you've less space allocated for the KB cache. You can make cache a lot more aggressively and a lot more things. So you get more cache hits, which are helpful for reducing the time to first token for the reasons that were kind of described earlier. And then the second being when you start doing inference with more and more requests and larger and larger batch sizes, you don't see much of a slowdown in as it's generating the tokens, the speed of that.
此外,它还允许你在某些情况下使用更长的提示词。
Well, it also allows you to make your prompt bigger for certain things.
没错。KV缓存的大小等于所有提示词总长度乘以并行处理的提示数量。因此你可以增加其中任一维度——要么扩大批次规模,要么延长提示词长度——而不会影响令牌生成的延迟。
Yeah. Like, the basic the size of your KV cache is both the size of all your prompts multiplied by the number of prompts being processed in parallel. So you could increase either of those dimensions, right, the batch size or the size of your prompts without degrading the latency of generating tokens.
Arvind,你写过一篇题为《影子工作区》的博客文章
Arvind, you wrote a blog post, Shadow Workspace
是的。
Yes.
关于后台代码迭代的功能。能具体说说吗?
Iterating on code in the background. Yeah. So what's going on?
需要明确的是,我们希望后台能进行大量处理,目前正在试验多种方案。现阶段除了缓存预热或优化指令键提示的上下文外,这类后台处理还不多。但核心理念是:通过后台计算,我们可以在更长的时间维度(比如预测用户未来十分钟的操作,而非仅仅下几行代码)为用户提供帮助。影子工作区就是我们为实现这一目标开发的内部实验工具——要有效利用后台计算,关键在于建立反馈机制。
So to be clear, we want there to be a lot of stuff stuff happening in the background, and we're experimenting with a lot of things. Right now, we don't have much of that happening other than, like, the the cache warming or, like, you know, figuring out the right context to that goes into your command key prompts, for example. But the idea is if you can actually spend computation in the background, then you can help help the user maybe, like, at a slightly longer time horizon than just predicting the next few lines that you're gonna make, but actually, like, in the next ten minutes, what are you going to make? And by doing it in the background, you can spend more comp computation doing that. And so the idea of the shadow workspace that that we implemented, and we use it internally for, like, experiments, is that to actually get advantage of doing stuff in the background, you want some kind of feedback signal to give give back to the model.
因为单纯延长模型思考时间虽能提升性能(比如O1模型就是典型案例),但通过迭代和反馈同样能优化效果。对程序员而言,语言服务器就是至关重要的反馈源——每种主流语言都有专属的语言服务器,它能指出类型错误、支持跳转定义,并理解代码结构。
Because otherwise like, you can get higher performance by just letting the model think for longer, and and so, like, o one is a good example of that. But another way you can improve performance is by letting the model iterate and get feedback. And and so one very important piece of feedback when you're a programmer is the language server, which is this thing that exists for most different languages, and there's, like, a separate language server per language. And it can tell you, you know, you're using the wrong type here, and then it gives you an error. Or it can allow you to go to definition and sort of understands the structure of of your code.
这些语言服务器由各语言社区开发(比如TypeScript团队开发TS语言服务器,Rust团队开发Rust语言服务器),它们通过语言服务器协议与VS Code对接。这样VS Code无需内置所有语言支持,直接复用现有编译器基础设施即可。
So language servers are extensions developed by like, there is a TypeScript language server developed by the TypeScript people, a Rust language server developed by the Rust people, and then they all inter interface over the language server protocol to Versus Code. So that Versus Code doesn't need to have all of the different languages built into Versus Code, but rather you can use the existing compiler infrastructure.
为了代码检查?什么
For linting purposes? What
什么 这是用于代码检查的。用于跳转到定义以及查看你正在使用的正确类型。
what It's for it's for linting. It's for going to definition and for, like, seeing the the right types that you're using.
所以它也在做类型检查之类的?
So it's doing, like, type checking also?
是的。类型检查和跳转引用。当你在大型项目中工作时,你确实需要这些功能。如果没有这些,在大型项目中编码会非常困难。
Yes. Type checking and and going to references. And that's like, when you're working in a big project, you you kind of need that. If you if you don't have that, it's like really hard to to code in a big project.
你能再说一遍这在Cursor里是怎么使用的吗?那个语言服务器协议,对,通信机制?
Can you say again how that's being used inside Cursor? The the language server protocol Yes. Communication thing?
它在Cursor中被用来向程序员展示信息,就像在VS Code中一样。但我们的想法是要向AI模型展示同样的信息。
So it's being used in Cursor to show to the programmer just like in Versus Code. But then the idea is you want to show that same information to the models, the AI models. And
你想要
you want to
以一种不影响用户的方式在后台完成这个。所以影子工作区的概念就是:我们可以生成一个隐藏的Cursor窗口——通过设置Electron的隐藏标志,虽然存在窗口但用户看不见。在这个窗口里,AI代理可以任意修改代码(只要不保存,因为仍是同一文件夹),然后获取代码检查器的反馈、跳转定义,并迭代他们的代码。
do that in a way that doesn't affect the user because you want to do it in the background. And so the idea behind the shadow workspace was, okay, like, one way we can do this is we spawn a separate window of cursor that's hidden, and so you you can set this flag in it, Electron is hidden. There is a window, but you don't actually see it. And inside of this window, the AI agents can modify code however they want, as long as they don't save it because it's still the same folder, and then can get feedback from from the linters and go to definition and and iterate on their code.
所以就像在后台运行所有东西,就像...对吧。嗯。甚至可能运行代码?
So, like, literally run everything in the background, like, as if right. Yeah. May maybe even run the code?
这是最终版本的目标。
Is that So that's the eventual version.
好的。
Okay.
这正是你想要的。博客文章大部分内容其实都在探讨如何实现这一点,因为这有点棘手。你希望它能在用户机器上运行,以便精确反映用户的环境。在Linux上,你可以做一件很酷的事——镜像文件系统,让AI直接修改文件。AI以为自己在操作实际文件,但实际上这些操作都存储在内存中,你可以通过创建内核扩展来实现这一功能。
That's what you want. And a lot of the blog post is actually about how do you make that happen because it's a little bit tricky. You want it to be on the user's machine so that it exactly mirrors the user's environment. And then on Linux, you can do this cool thing where you can actually mirror the file system and have the AI make changes to the files. And and it thinks that it's operating on the file level, but, actually, that's stored in in memory, and you you can create this kernel extension to to make it work.
而在Mac和Windows上会稍微困难些,不过这是个有趣的技术挑战,所以...
Whereas on Mac and Windows, it's a little bit more difficult and and but it's it's a fun technical problem, so that's why
有个可能有点取巧但我觉得有意思的点子是锁定保存操作。基本上你可以让语言模型暂时锁定磁盘写入,这样你操作的就不是实际保存到磁盘的原始文件版本,而是之前那个影子工作区里仅存在于内存中的未保存内容——你仍然能获得Linter错误提示并继续编码。当你尝试运行代码时,只会收到一个关于锁定的轻微警告,如果要并发操作,你可以从语言服务器或影子工作区收回锁定权限。
One one maybe hacky but interesting idea that I like is holding a lock on saving. And so, basically, you can then have the language model kind of hold the lock on on saving to disk. And then instead of you operating in the ground truth version of the files that are safe to disk, you you actually are operating what was the shadow workspace before and these unsaved things that only exist in memory that you still get Linter errors for and you can code in. And then when you try to maybe run code, it's just like there's a small warning that there's a lock, and then you kind of will take back the lock from the language server if you're trying to do things concurrently or from the the shadow workspace if you're trying to do things concurrently.
顺便说,这个功能太令人兴奋了。虽然有点跑题,但让模型修改文件对人们来说可能很吓人,不过能让AI代理完成一系列任务,第二天回来就像观察同事工作一样查看结果,这真的很酷。
That's such an exciting feature, by the way. It's a bit of a tangent, but, like, to allow a model to change files. It's scary for people, but, like, it's really cool to be able to just, like, let the agent do a a set of tasks, and you come back the next day and kind of observe, like, it's a colleague or something like that. Yeah.
是的。我认为可运行性可能有不同版本:对于用户在编程时代理的几分钟内完成的简单操作,适合在本地机器运行;而对于需要更长时间的重大修改,可能需要在远程沙盒环境中进行。这又引出了另一个棘手问题——如何精确(或基本等效地)在远程沙盒中复现用户的运行环境。
Yeah. And I think there may be different versions of, like, run ability where for the simple things where you're doing things in the span of a few minutes on behalf of the user as they're programming, it makes sense to make something work locally in their machine. I think for the more aggressive things where you're making larger changes that take longer periods of time, you'll probably wanna do this in some sandbox remote environment. And that's another incredibly tricky problem of how do you exactly reproduce or mostly reproduce to the point of it being effectively equivalent for running code, the user's environment with this remote remote sandbox.
我很好奇你们想要什么样的编程代理。
I'm curious what kind of agents you want for for coding.
哦,为了...
Oh, for
你们需要它们找bug吗?还是实现新功能?具体想要哪种代理?
the Do do you want them to find bugs? Do you want them to, like, implement new features? Like, what agents do you want?
顺便说,当我思考代理时,不仅限于编程。比如这个播客的实践就涉及视频剪辑。如果你查看Adobe的后台代码——虽然文档很糟糕——但确实可以通过代码与Premiere等软件交互。基本上我在YouTube的所有上传操作,正如你能想象的,全都是通过代码完成的。
So by the way, when I think about agents, I don't think just about coding. I think so for the practices for this particular podcast is video editing. And a lot of if you look in Adobe, a lot of this code behind. It's very poorly documented code, but you can interact with Premiere, for example, using code. And basically, all the uploading, everything I do on YouTube, everything as you could probably imagine, I do all of that through code.
包括翻译、配音等所有这类工作。我预见到所有这些类型的任务。因此自动化许多与编辑不直接相关的任务。这就是我的想法。
And so and including translation, overdubbing, all of this. So I envision all of those kinds of tasks. So automating many of the tasks that don't have to do directly with the editing. So that okay. That's what I was thinking about.
但在编码方面,我主要考虑的是错误查找。比如多层次的错误查找,也包括逻辑错误。不是精神层面的错误,而是实现方向上的大问题这类东西。
But in terms of coding, I would be fundamentally thinking about bug finding. Like, many levels of kind of bug bug finding, and also bug finding, like logical bugs. Not logical, like spiritual bugs or something. Ones like sort of big directions of implementation, that kind of stuff.
我们来谈谈错误查找吧。
Let's opine on bug finding.
是的。这些模型在单纯被要求找错误时表现如此糟糕,这真的很有趣。它们的校准度极其差。
Yeah. I mean, it's really interesting that these models are so bad at bug finding when just naively prompted to find a bug. They're incredibly poorly calibrated.
即使是最聪明的模型也这样。确实如此。
Even the the smartest models. Exactly.
即使是o即使是o one。
Even o even o one.
这怎么解释?有什么好的直觉理解吗?
How do you explain that? Is there a good intuition?
我认为这些模型很好地反映了预训练数据的分布,随着损失越来越低,它们确实在泛化,但我不认为当前规模的损失足够低,使它们能在代码领域完全泛化。我们使用这些前沿模型最擅长的其实是代码生成和问题解答——这些内容在预训练数据中大量存在,比如GitHub上的代码规模达到数万亿token,还有Stack Overflow和GitHub issues上的问答。但当你尝试处理那些网上几乎不存在的内容时——比如根据已做编辑预测下一个编辑的光标制表目标——模型的脆弱性就显现出来了。
I think these models are really strong reflection of the pretraining distribution, and, I do think they they generalize as the loss gets lower and lower, but I don't think the the loss in the scale is quite or the loss is low enough such that they're like really fully generalizing in code. Like, the things that we use these things for, the frontier models, that that they're quite good at are really code generation and question answering. And these things exist in massive quantities in pre training with all of the code in GitHub on the scale of many, many trillions of tokens and questions and answers on things like Stack Overflow and maybe GitHub issues. And so when you try to push some of these things that really don't exist very much online, like, for example, the cursor tab objective of predicting the next edit given the edits done so far. The brittleness kind of shows.
错误检测是另一个很好的例子——实际上检测真实错误并提出修复方案的案例并不多,模型在这方面确实很吃力。但我觉得这是模型迁移的问题。就像预训练模型在代码上的优秀表现能迁移到光标制表目标一样,真正擅长代码的通用模型在错误检测上也会有类似表现,只需要稍微往那个方向引导。
And then bug detection is another great example where there aren't really that many examples of, like, actually detecting real bugs and then proposing fixes, And the model's just kind of, like, really struggling. But I think it's a question of transferring the model. Like, the same way that you get this fantastic transfer from pretrained models just on code in general to the cursor tab objective, you'll see a very, very similar thing with generalized models that are really good at code to bug detection. It just takes, like, a little bit of kind of nudging in that direction.
明确地说,我认为它们某种程度上很理解代码。在预训练过程中建立的表征几乎可以肯定——在某个数据流中,模型知道可能有些可疑之处。它某种程度上感知到了可疑性,但真正把这种可疑性提取出来...部分问题在于人类对哪些错误真正重要有精准判断,不仅仅是简单指出'这里有点可疑'。
Like, to be clear, I think they sort of understand code really well. Like, while they're being pretrained, like, the representation that's being built up, like, almost certainly, like, you know, somewhere in the stream, there's the model knows that maybe there's there's some something sketchy going on. Right? It sort of has some sketchness, but actually eliciting the the sketchness to, like, actually like, part part of it is that humans are really calibrated on which bugs are really important. It's not just actually it's not just actually saying, like, there's something sketchy.
这就像是,它只是、只是些可疑的琐事。那种可疑程度,比如,你会把服务器搞垮。这就像是,部分原因可能在于文化认知——为什么资深工程师能成为资深工程师?资深工程师之所以优秀,是因为他们知道三年前有人写过一段非常、你知道的、可疑的代码导致服务器崩溃,而不是说,比如,这可能只是个实验。所以,出几个bug也无妨。
It's like, it's just it's just sketchy trivial. It's the sketchy, like, you're gonna take the server down. It's just like like, part of it is maybe the cultural knowledge of like, why is the staff engineer a staff engineer? A staff engineer is is good because they know that three years ago, like, someone wrote a really, you know, sketchy piece of code that took took the server down and as opposed to, like, as opposed to maybe it's like, you know, you just this thing is like an experiment. So, like, a few bugs are fine.
就像你只是在尝试实验、感受事物。因此如果模型在你写实验代码时变得特别烦人,那就很糟糕。但如果你在编写超级生产环境的东西,比如写数据库。对吧?你是在为Postgres或Linux这类系统写代码。
Like, you're just trying to experiment and get the feel of the thing. And so if the model gets really annoying when you're writing an experiment, that's really bad. But if you're writing something for super production, you're like writing a database. Right? You're you're writing code in Postgres or Linux or whatever.
就像你是Linus Torvalds(Linux创始人)。那种情况下即使出现小问题也几乎不可接受。关键在于校准用户的偏执程度——但即便如此,如果你设置了最高警戒级别,
Like, you're Linus Torvalds. You're you're it's sort of unacceptable to have even in that case. And just just having the calibration of, like, how paranoid is the user, like but even then, like, if you're putting in a maximum paranoia,
它仍然还是,有点,没完全理解到位。
it still just, like, doesn't quite get it.
是啊。确实。
Yeah. Yeah.
对。我是说,但人类也很难理解哪行代码重要、哪行不重要。就像你网站上某个原则说的:如果某段代码可能造成严重破坏,就应该添加注释标明'这行代码很危险'。
Yeah. I mean and but it's hard for humans too to understand what which line of code is important and which is not. Like you I think one of your principles on a website says if if if a code can do a lot of damage, one should add a comment that say this this this line of code is is dangerous.
还要全大写。
And all caps.
我要重复10遍。
I'll repeat it 10 times.
10遍。不。你说的是,对于函数里的每一行代码都要这样。这其实很有深意——这说明人类的本性,因为工程师会流动,甚至同一个人也可能忘记一个函数就能让泰坦尼克号沉没。
10 times. No. And you say, like, for every single line Yes. Of code inside the function, you have to and it that's quite profound. That says something about human beings because the the engineers move on, even the same person might just forget how it can sync the Titanic a single function.
就像,你未必能通过看一段代码就直观地清楚意识到这点。
Like, you don't you you might not intuit that quite clearly by looking at the single piece of code.
展开剩余字幕(还有 292 条)
是的。我认为这一点也部分适用于当今的AI模型——如果你在每行代码中都写上‘危险危险危险’,模型会对此更加关注,从而更有可能在该区域发现漏洞。
Yeah. And I think that that one is also partially also for today's AI models, where if you actually write dangerous dangerous dangerous in every single line, like, the models will pay more attention to that and will be more likely to find bugs in that region.
这实际上就是个绝佳的代码标注实践,能清晰表明这段代码可能造成的危害程度。
That's actually just straight up a really good practice of labeling code of how much damage this can do.
没错。虽然这存在争议,有些人觉得这种写法很丑陋,
Yeah. I mean, it's controversial. Like, some people think it's ugly,
但得接受。事实上——这是我从Arvid那里学到的——虽然从审美上我不喜欢这种写法。但确实,这对模型很有用,而人类总是容易遗忘。一个小失误就可能引发服务器崩溃。当然我们会做大量测试,但有些地方必须格外谨慎。
Swallow it. I actually think it's like, in fact, I I actually think this is one of the things I learned from Arvid is, you know, like, I sort of aesthetically, I don't like it. But Mhmm. I think there's certainly something where, like, it's it's useful for the models and and humans just forget a lot. And it's really easy to make a small mistake and cause, like, bring down you know, like, just bring down the server and, like, of course, we we, like, test a lot or whatever, but there there's always these things that you have to be very careful.
对。普通文档字符串的问题在于,人们修改代码时常常会快速浏览然后想‘这个我懂’。你必须明确指出来才能避免疏漏。
Yeah. Like, with just normal docstrings, I think people will often just skim it when making a change and think, oh, this I I know how to do this. And you kind of really need to point it out to them so that that doesn't slip through.
没错。需要时刻提醒自己可能造成巨大破坏。我们通常不会考虑这个层面,只想着‘怎么理解这段代码来改进它’
Yeah. You have to be reminded that you could do a lot of damage. That's like we don't really think about that. Like Yeah. You think about, okay, how do I figure out how this works so I can improve it?
而忽略了它可能产生的反向影响。
You don't think about the other direction that it could do this.
除非我们实现全面形式化验证。那样你就能为所欲为——只要验证通过就能确定没有引入漏洞。但具体来说,你认为
Until we have formal verification for everything. Then you can do whatever you want, and you you know for certain that you have not introduced a bug if the proof passed. But concretely, what do you think
未来会是什么样子?
that future would look like?
我认为人们将不再编写测试。当你写函数时,模型会建议规范说明,你审核规范的同时,智能推理模型会计算证明实现符合规范。这将适用于大多数函数。
I think people will just not write tests anymore, and the model will suggest like, you write a function, the model will suggest a spec, and you review the spec. And in the meantime, smart reasoning model computes a proof that the implementation follows the spec. And I think that happens for for most functions.
这触及了你之前谈到的关于为软件指定意图的困难,有时可能是因为意图本身难以明确界定,进而也很难证明软件实际行为是否符合你的意图。
Think this gets at a little bit some of the stuff you were talking about earlier with the difficulty of specifying intent for what you want with software, where sometimes it might be because the intent is really hard to specify, it's also then going to be really hard to prove that it's actually matching whatever your intent is.
你是觉得生成这样的规范很困难吗?
Like, you think that spec is hard to generate?
是的。或者说,对于给定的规范,我认为存在一个问题:你能否真正进行形式化验证?这可行吗?我觉得这里面还有更多值得探讨的地方。但另外——
Yeah. Or just, like, for a given spec, maybe you can I think there is a question of, like, can you actually do the formal verification? Like, that's like, is that possible? I think that there's, like, more to dig into there. But then also
即使你有了规范?
Even if you have the spec?
如果你有了规范,又该如何...
If you have the spec How do you even
拥有规范?这个规范是用自然语言写的吗?
have the spec? Is it is the spec written in natural language?
对。它是怎么运作的
Yeah. How does it work
规范的规范应该是形式化的。
for the spec spec would be formal.
但那样做会有多容易?我认为你关心的那些东西很难在规范语言中被清晰定义。
But how easy would that be then I think that you care about things that are not going to be easily well specified in the spec language.
明白了。确实如此。
I see. I see. Yeah.
或许反对形式化验证的论点正是你所需要的全部理由。
Maybe an argument against formal verification is all you need.
是啊。令人担忧的是存在这种情况
Yeah. The worry is there's this
大规模替代 取代像单元测试这样的东西。确实如此。
massive Replacing replacing something like unit tests. Sure.
对,对。我认为或许还可以演进规范语言,以囊括目前它们尚未真正涵盖的一些内容。
Yeah. Yeah. I think you can probably also evolve the the spec languages to capture some of the things that they don't really capture right now.
嗯。
Mhmm.
但我不确定。我觉得这非常令人振奋。
But I don't know. I think it's very exciting.
而且你谈论的不只是单个函数,而是整个代码库。
And you're speaking not just about like single functions. You're speaking about entire code bases.
我认为整个代码库更具挑战性,但那正是我梦寐以求的目标。并且我认为这是可能实现的。因为甚至已经有大量最新研究可以做到从硬件层面进行形式化验证——比如先形式化验证C代码,再通过GCC编译器验证,接着通过Verilog验证到硬件层面。这涉及极其庞大的系统,但确实可行。
I think entire code bases is harder, but that that is what I would love to have. And I think it should be possible. And because you can even there there's, like, a lot of work recently where you can prove formally verify down to the hardware. So, like, through the you formally verify the c code, and then you formally verify through the GCC compiler, and then through the Verilog down to the hardware. And that's, like, incredibly big system, but it actually works.
我认为BigQuery这类业务在某种程度上是相似的,它们都是多层系统。如果能将其分解并逐部分形式化验证,我认为应该是可行的。不过规范制定确实是个实际问题。
And I think BigQuery business are are sort of similar in that they're, like, multilayered system. And if you can decompose it and formally verify each part, then I think it should be possible. I think the specification problem is a real problem.
但你们如何处理副作用?或者说,像Stripe API这样的外部依赖要怎么处理?
But How do you handle side effects? Or how do you handle, I guess, external dependencies, like, Stripe API?
也许Stripe会给我们回信
Maybe Stripe would write us back for
关于API。所有事情。比如,你能为你使用的所有东西都这么做吗?比如,如果是语言模型,你怎么处理?也许人们会把语言模型当作他们编写的
the API. For everything. Like, can you do this for everything you use? Like, how do you how do you do it for if there's a language model? Like, maybe maybe, like, people will use language models as primitives in the
程序中的基本构件。
programs they write.
嗯。而且存在依赖关系。那么现在如何将其纳入考虑?
Mhmm. And there's, a dependence on it. And, like, how how do you now include that?
我认为你或许仍能证明这一点。
I think you might be able to prove prove that still.
关于语言模型要证明什么?
Prove what about language models?
我觉得有可能证明语言模型是对齐的,例如。或者证明它确实给出了正确答案。
I think if it feels possible that you could actually prove that a language model is aligned, for example. Or like you can prove that it actually gives the the right answer.
那是梦想。
That's the dream.
是的。我是说,确实。如果可能的话,那就是你的'我有一个梦想'演讲。如果可能,那肯定有助于确保代码没有漏洞,确保AI不会毁灭人类文明。从AI安全到漏洞发现的整个范围。
Yeah. That is I mean, that's Yeah. If it's possible, that's your I have a dream speech. If it's possible, that that will certainly help with, you know, making sure your code doesn't have bugs and making sure AI doesn't destroy all of human civilization. So the the full spectrum of AI safety to just bug finding.
你说模型在漏洞发现方面有困难。那希望在哪里?
What so you said the models struggle with bug finding. What's the hope?
你知道,我最初的期望是——当然也可以让Michael补充——应该是这样的:它首先能帮忙解决那些愚蠢的错误。比如,它应该能迅速捕捉到那些低级错误,像是差一错误(off by one),或者有时候你在注释里写一套,代码里却写另一套,这种情况很常见。
You know, my hope initially is and and I can let Michael Michael chime in too, but it was like this. It should, you know, first help with the stupid bugs. Like, it should very quickly catch the stupid bugs. Like, off by one errors, like, sometimes you write something in a comment and do the other way. It's like very common.
比如我自己就常这样:在注释里写“小于”,代码里却写成“大于”之类的。然后模型就会提醒说:‘这看起来不太对劲,你确定要这么做吗?’
Like, I do this. I write, like, less than in a comment and, like, I maybe write the greater than or something like that. And the model is like, yeah. You look sketchy. Like, do you sure you won't wanna do that?
但最终,它应该也能发现更复杂的错误。
But eventually, it should be able to catch harder bugs too.
没错。而且我认为值得注意的是,拥有优秀的错误检测模型对于实现AI在编程中承担更多工作至关重要。当AI为你构建越来越多的系统时,你不仅需要生成代码,还需要验证代码。缺乏这一点,我们之前讨论过的使用这些模型编程时的问题就会变得难以解决。所以这不仅是为了人类——比如你写了bug,我写了bug,AI能帮忙找出来——更重要的是它能验证AI自己生成的代码,这一点非常关键。
Yeah. And I think that it's also important to note that this is having good bug finding models feels necessary to get to the highest reaches of having AI do more and more programming for you, where you're going to you know, if AI is building more and more of the system for you, you need to not just generate, but also verify. And without that, some of the problems that we've talked about before with programming with these models will just become untenable. So it's not just for humans. Like, you write a bug, I write a bug, find the bug for me, but it's also being able to to verify the AI's code and check it is really important.
对。那么具体如何实现呢?我们晚餐时经常争论如何训练一个错误检测模型。一个流行的思路是:植入错误比发现错误更容易,所以可以先训练模型在现有代码中植入错误,再用这些合成数据训练一个反向模型来发现错误。这是个可行方案。
Yeah. And then how do you actually do this? Like, have had a lot of contentious dinner discussions of how do you actually train a bug model. But one very popular idea is, you know, it's kind of potentially easy to introduce a bug than actually finding the bug, and so you can train a model to introduce bugs in existing code, and then you can train a reverse bug model then that can find find bugs using this synthetic data. So that's like one example.
当然,关于具体实现方法还有很多其他思路。
But, yeah, there are lots of ideas for how
你还可以在模型层面之外做很多工作,比如给最强模型提供代码之外的丰富信息。光盯着文件找bug是个难题——对人类来说也是如此,对吧?通常你需要运行代码,查看跟踪信息,逐步调试。
to You can also you can also do a bunch of work, not even at the model level, of taking the biggest models and then maybe giving them access to a lot of information that's not just the code. It's kind of a hard problem to stare at a file and be like, where's the bug? And that's for humans often. Right? And so often you have to run the code and being able to see things like traces and step through a debugger.
这导向了另一个方向。可能这里存在两种产品形态:一种是专门化的快速模型,在后台运行并试图发现错误;另一种就像Arvind之前提到的恶意输入框漏洞的例子,有时你需要针对特定问题投入海量算力——比如愿意花50美元甚至更多来解决某个具体bug。
There's another whole another direction where it kind of tends toward that. And it could also be that there are kind of two different product form factors here. It could be that you have a really specialty model that's quite fast that's kind of running in the background and trying to spot bugs. It might be that sometimes sort of to Arvind's earlier example about, you know, some nefarious input box bug, it might be that sometimes you wanna, like, there's you know there's a bug. You're not just, like, checking hypothesis free.
这种情况下你不是在无假设检查,而是明确要解决某个具体问题,并愿意为此投入巨额计算资源。
You're, like, this is a problem. I really wanna solve it. And you zap that with tons and tons and tons of compute, you're willing to put in, like, $50 to solve that bug or something even more.
你们考虑过将金钱机制整合进来吗?如果你们能找到bug或生成让我惊艳的代码,我很可能愿意支付大笔费用。前几天我用Cursor时,它完美生成了三个与YouTube API交互的函数来处理多语言字幕更新——API文档很差,网上信息也很混乱,我搜索很久都没找到正确答案,但Cursor生成的代码完美无缺。
Have you thought about integrating money into this whole thing? Like, I would pay probably a large amount of money for if you found a bug or even generated code that I really appreciated. Like, I had a moment a few days ago when I started using cursor where it generated perfect, like, perfect three functions for interacting with the YouTube API to update captions and for localization, like, different in different languages. The API documentation is not very good. And the code across, like, I I googled it for a while, I couldn't find exactly there's a lot of confusing information, and cursor generated perfectly.
我当时就往后一靠,仔细读了代码。心想,这没问题啊。我测试过了,确实没问题。
And I was like, I just sat back. I read the code. I was like, this is correct. I tested it. It's correct.
我当时就想,我想要个小费
I was like, I want a tip
在一个按钮上,就是那种
on a on a button that goes
是啊。
Yeah.
这里有个5美元按钮。一个纯粹是为了支持公司和界面设计的,另一个则是传递强烈信号——比如'干得漂亮'。明白吗?这比单纯接受代码发出的信号强烈多了,对吧?
Here's $5. One that's really good just to support the company and support what the the interface is, and the other is that probably sends a strong signal, like, good job. Right? So there's there's a much stronger signal than just accepting the code. Right?
你实际上是在传递强烈的肯定。至于漏洞发现,显然很多人愿意为漏洞悬赏支付大笔钱,对吧?你们团队考虑过这个吗?
You just actually send, like, a strong good job. That and for bug finding, obviously, like, there's a lot of people, you know, that would pay a huge amount of money for a bug like a bug bug bounty thing. Right? Is that do you guys think about that?
是啊,这个想法在公司内部有争议。我觉得某种程度上取决于你对人性的信任度。如果尝试找漏洞不花钱,没找到就零支出,找到后点击接受时才显示(比如1美元),这样你花1美元接受漏洞。当然有人担心计算资源消耗——可能有人会直接复制粘贴。
Yeah. It's a controversial idea inside the the company. I think it sort of depends on how much you believe in humanity, almost. You know? Like, I think it would be really cool if, like, you spend nothing to try to find a bug, and if it doesn't find a bug, you you spend $0.
我觉得这个担忧确实存在。还有就是引入金钱元素会让产品变得...怎么说呢,没那么有趣了。你得考虑钱的问题,而大家只想专注代码。或许更合理的做法是拆分出来,每月支付固定费用,然后这些功能就能免费使用。
And then if it does find a bug and you click accept, then it also shows, like, in parentheses, like, $1. And so you spend $1 to accept the bug. And then, of course, there's a worry, like, okay. We spent a lot of computation. Like, maybe people will just copy paste.
这确实是个顾虑。另外还有个担忧是,在产品里引入金钱概念会让它变得...你知道,没那么好玩了。你得考虑钱的事,而大家只想专注代码。或许更合理的做法是单独设置,比如每月付笔费用,然后这些功能就能免费使用。
I think that's a worry. And then there is also the worry that, like, introducing money into the product makes it, like, kind of, you know, like, it doesn't feel as fun anymore. Like, you have to, like, think about money, and and you all you want to think about is, like, the code. And so maybe it actually makes more sense to separate it out and, like, you pay some fee, like, every month, and then you get all of these things for free.
但可以保留打赏组件,这个又不需要成本
But there could be a tipping component, which is not like it it cost
美国仍然保留着那个美元符号。我觉得没问题,但我也理解有人可能不想引入它的观点。
us still has that, like, dollar symbol. I think it's fine, but I I also see the point where, like, maybe you don't want to introduce it.
是的。我正想说,人们通常是在分享时才这样做。当他们有个绝佳案例时,就会想和朋友分享。
Yeah. I was gonna say the moment that feels like people do this is when they share it. When they have a fantastic example, they just kind of share it with their friends.
从技术层面也存在解决方案,比如针对我们的系统问题——如果我们能更深入理解系统输出,就像之前讨论的通过LSP进行错误检查以及代码运行。如果能验证'我已修复漏洞',那么悬赏系统或许就不必完全依赖诚信机制了。
There is also a potential world where there's a technical solution to this, like, on our system problem too, where if we can get to a place where we understand the output of the system more I mean, to the stuff we were talking about with, like, you know, error checking with the LSP and then also running the code. But if you could get to a place where you could actually somehow verify, oh, I have fixed the bug, maybe then the the bounty system doesn't need to rely on the honor system too.
终端与代码之间有多少交互?运行终端代码能获取多少信息?能否建立循环机制:当代码运行时出现错误,系统会建议如何修改代码?因为目前它们完全是割裂的——虽然我知道可以用Control+K在终端辅助编写代码。
How much interaction is there between the terminal and the code? Like, how much information is gained from you run the code in the terminal? Like, can you use can you do, like, a loop where it runs runs the code and suggest how to change the code if if the code and run time gives an error? Because right now, they're separate worlds completely. Like, I know you can, like, do control k inside the terminal to help you write the code.
你可以在check命令中使用终端上下文,基本上涵盖所有功能。不过循环功能尚未实现,虽然我们认为这类功能会很有意义。问题在于它是前台运行还是像我们讨论的那样后台运行。
You you can use terminal context as well inside of check, man, k kind of everything. We don't have the looping part yet, though we suspect something like this could make a lot of sense. There's a question of whether it happens in the foreground too or if it happens in the background like what we've been discussing.
确实。后台运行很酷——可以多种方式执行代码。另外还有数据库层面的问题:如何防止数据库被篡改?不过好吧。
Sure. The background is pretty cool. Like, can do running the code in different ways. Plus there's a database side to this, which how do you protect it from not modifying the database? But okay.
这方面确实有很酷的解决方案。有个正在开发的新API——不是AWS的,应该是PlanetScale首创的——它实现了数据库分支功能。比如开发新功能时需要测试数据库,但又不愿影响主库时,可以创建分支。原理是通过预写日志(WAL)实现分支。
I mean, there's there's certainly cool solutions there. There's this new API that is being developed for it's it's not in AWS, but, you know, it's it's certainly it's I think it's in PlanetScale. I don't know if PlanetScale was the first one to add it. It's disability sort of add branches to a database, which is, like, if you're working on a feature and you wanna test against a broad database, but you don't actually want to test against a broad database, you could sort of add a branch to the database. And the way to do that is to add a branch to the write ahead log.
当然正确实现需要解决大量技术难题。数据库厂商现在需要新卖点——现有产品已经足够完善。我们使用的TurboBuffer数据库可能也会在预写日志中引入分支功能,这样AI代理或许就能利用分支功能了。
And there's obviously a lot of technical complexity in doing it correctly. I I guess database companies news need need new things to do. They have they have they have good databases now. And and I I think, like, you know, TurboBuffer, which is which is one of the databases we use as is is going to add hope, maybe branching to the to the right ad log. And and so so may maybe maybe the the AI agents will use will use branching.
它们可以在某个分支上测试,这可能会成为数据库支持分支功能的硬性要求。
They'll, like, test against some branch, and it's sort of gonna be a requirement for the database to, like, support branching or something.
如果你们能...
It'd be really interesting if you
可以对文件系统进行分支处理,对吧?是的。
can branch a file system. Right? Yeah.
我觉得所有东西都需要分支处理。就像是...没错。
I feel like everything needs branching. It's like the Yeah.
这就像多元宇宙的问题所在。对吧?如果你对所有东西都进行分支,那数量就太庞大了。
It's like that's the problem with the multiverse. Right? Like, if you branch on everything, that's like a lot.
我的意思是,显然存在这些超级聪明的算法来确保你不会真的占用大量存储空间或CPU资源之类的。
I mean, there's there's obviously these, like, super clever algorithms to make sure that you don't actually, you know, use a lot of space or CPU or whatever.
好的。这是个不错的切入点
Okay. This is a good place
来聊聊基础设施。你们主要使用AWS,有哪些有趣的细节?遇到过哪些有趣的挑战?为什么选择AWS?
to ask about infrastructure. So you guys mostly use AWS. What what are some interesting details? What are some interesting challenges? Why'd you choose AWS?
为什么AWS仍在领跑市场?话题标签。
Why is why is AWS still winning? Hashtag.
AWS就是非常非常优秀。真的很好用。每次使用AWS产品时,你都知道它肯定能正常运行。虽然配置过程可能像地狱般折磨人。
AWS is just really, really good. It's really good. Like, whenever you use an AWS product, you just know that it's going to work. Like, it might be absolute hell to go through the steps to set it up.
是啊。为什么界面设计这么反人类?
Yeah. Why is the interface so horrible?
因为它实在太好用了,根本不需要
Because it's it's just so good. It doesn't need
关于胜利的本质?
to The nature of winning?
这这我觉得完全就是。这就是本质,他们正在获胜。是的。
It's it's I think it's exactly. It's just nature they're winning. Yeah.
是的。但AWS,你永远可以信任,就像,它总能正常工作。如果有问题,那很可能是你的问题。是的。
Yeah. But AWS, you can always trust, like, it will always work. And if there is a problem, it's probably your problem. Yeah.
好的。你们作为一家相当新的初创公司,在扩展到这么多用户的过程中,有没有遇到一些有趣的挑战?
Okay. Is there some interesting, like, challenges to you guys have pretty new startup to get scaling to, like, to so many people and
是的。我认为每增加一个请求每秒的零,都是一段有趣的旅程。你会遇到所有这些问题,比如,你用于缓存和数据库的通用组件随着规模扩大会出现各种问题,现在我们处于这样的规模,甚至会在表格中出现整数溢出之类的情况。此外,我们构建的一些定制系统,比如用于计算代码库语义索引并回答代码库问题的检索系统,我觉得一直是扩展过程中比较棘手的部分。
Yeah. I think that there it has been an interesting journey adding, you know, each extra zero to the request per second. You run into all of these with, like, you know, the general components you're using for for caching and databases run into issues as you make things bigger and bigger, and now we're at the scale where we get, like, you know, int overflows on our tables and things like that. And then also, there have been some custom systems that we've built, like, for instance, our retrieval system for computing a semantic index of your code base and answering questions about a code base that have continually, I feel like, been one of the the trickier things to scale.
我有几个非常资深工程师的朋友,他们常说的一句话是,在扩展系统时很难预测哪里会出问题。你可以试着提前预测,但当你增加一个零时,总会发生一些奇怪的事情。你以为考虑了一切,但实际上并没有。但对于那个特定系统,我们具体做的是:显然我们会将你的代码分块上传,发送代码进行嵌入处理,然后将嵌入存储在数据库中,但我们实际上并不存储任何代码。
I I have a few friends who are who are super super senior engineers, and one of their sort of lines is like, it's it's very hard to predict where systems will break when when you scale them. You you you could sort of try to predict in advance, like, there's there's always something something weird that's gonna happen when when you add this extra zero. And you you thought you thought through everything, but you didn't actually think through everything. But I think for that particular system, we've so what the the for concrete details, the thing we do is, obviously, we upload when like, we chunk up all of your code, and then we send up sort of the code for for embedding, and we embed the code. And then we store the embeddings in a in a database, but we don't actually store any of the code.
这样做的原因是为了确保不会引入客户端bug,我们对此非常非常谨慎。我们在服务器上存储了大量细节,所有内容都是加密的。所以一个技术挑战是始终确保本地索引、本地代码库状态与服务器上的状态一致。技术上我们最终实现的方式是:对每个文件保留一个哈希值,对每个文件夹保留其所有子项的哈希,可以递归地这样做直到顶层。
And then there's reasons around making sure that we don't introduce client bugs because we're very, very paranoid about client bugs. We store much of the details on the server, like, everything is sort of encrypted. So one one of the technical challenges is is always making sure that the local index, the local code base state is the same as the state that is on the server. And and the way sort of technically we ended up doing that is so for every single file, you can you can sort of keep this hash. And then for every folder, you can sort of keep a hash, which is hash of all of its children, and you can sort of recursively do that until the top.
为什么要做这么复杂的事?一个简单方法是你可以为每个文件保留哈希,每分钟尝试下载服务器上的哈希,找出服务器上不存在的文件——可能是你刚创建的新文件、删除的文件或切换了新分支——然后尝试协调客户端和服务器的状态。但这会带来巨大的网络开销,不仅对客户端(没人希望我们一直占用他们的WiFi),
Why why do something something complicated? One thing you could do is you could keep a hash for every file, and every minute you could try to download the hashes that are on the server, figure out what are the files that don't exist on the server. Maybe you just created a new file, maybe you just deleted a file, maybe you checked out a new branch, and try to reconcile the state between the client and the server. But that introduces, like, absolutely ginormous network overhead, both both on the client side. I mean, nobody really wants us to hammer their WiFi all the time if you're using cursor.
而且对数据库也会造成巨大负担。相当于每秒都要读取这个接近20TB的庞大数据集,这简直太疯狂了。你绝对不想这么做。所以我们只尝试协调项目根目录的单个哈希值。
But also, like, I mean, it would introduce, like, ginormous overhead on the database. It would sort of be reading this tens of terabyte database, sort of approaching, like, 20 terabytes or something database, like, every second. That's just just sort of kind of crazy. You definitely don't wanna do that. So what you do here is sort of you just try to reconcile the single hash, which is at the root of the project.
如果出现不匹配,就找出所有不一致的地方。可能查看子项哈希是否匹配,如果不匹配就继续查看它们的子项,以此类推。我们只在出现不匹配时才这样做。对大多数人大多数时候来说,哈希值都是匹配的。
And then if if something mismatches, then you go, you find where all the things disagree. Maybe you look at the children and see if the hashes match, if the hashes don't match, go look at their children and so on. We only do that in the scenario where things don't match. And for most people, most of the time, the hashes match.
所以这有点像一种分层协调机制
So it's a kinda like hierarchical reconciliation
对,差不多是这个意思。
Yeah. Some something like that.
这叫默克尔树。
It's called the Merkle tree.
没错。
Yeah.
是的。
Yeah.
默克尔树。对。我是说...确实。能看透这些问题并思考解决方案很酷。
Merkle. Yeah. Mean so yeah. This is cool to see that you kinda have to think through all these problems.
关键在于,之所以变得复杂,主要是因为用户基数增长。有些客户的代码库极其庞大——我们当初重构自己的代码库时已经觉得很大了,但比起那些存在二十年的公司,他们的文件数量简直是天文数字。要让这套系统适配程序员规模...构建简单系统很容易,但要扩展到众多企业规模显然是个独立存在的难题。所以我们当前方案的扩展既需要新思路——当然我们正在攻关——还要在最近几周几个月内实现全面扩展。
And I mean, the the the point of like, the reason it's gotten hard is just because, like, the number of people using it and, you know, if some of your customers have really, really large code bases to the point where we we, you know, we we originally reordered our code base, which is which is big, but, I mean, it's not the size of some company that's been there for twenty years and sort of has a ginormous number of files and you sort of want to scale that across programmers. There's there's all these details where, like, building a simple thing is easy, but sort of scaling it to a lot of people, like a lot of companies is obviously a difficult problem, which is independent of actually. So that's part of scaling our current solution is also coming up with new ideas that obviously we're working on, but then scaling all of that in the last few weeks, months. Yeah.
这个索引系统还包含许多精妙设计。比如成本瓶颈不在于向量数据库的存储,而在于代码嵌入过程。你肯定不想为公司的每个成员重复嵌入相同代码库——即便他们只是分支不同或有些本地修改。既然嵌入是瓶颈,就有个取巧办法:通过代码块哈希值缓存向量计算结果,完全规避处理分支带来的数据库复杂度问题。
And there are a lot of clever things, like additional things that that go into this indexing system. For example, the bottleneck in terms of costs is not storing things in the vector database or the database. It's actually embedding the code. And you don't wanna re embed the code base for every single person in a company that is using the same exact code, except for maybe they're in a different branch with a few different files or they've made a few local changes. And so because, again, embeddings are the bottleneck you can do is one clever trick and not have to worry about, like, the complexity of, like, dealing with branches in in the other databases where you just have some cash on the actual vectors computed from the hash of a given chunk.
嗯。
Mhmm.
这意味着当公司第N个成员嵌入代码库时,过程会极其迅速。而且全程我们的服务器不存储任何代码数据,只保留向量数据库和向量缓存中的向量。
And so this means that when the nth person at a company goes and embeds their code base, it's it's really, really fast. And you do all this without actually storing any code on our servers at all. No code data stored. We just stored vectors in the vector database and the vector cache.
目前从代码库索引中你们获得的最大收益是什么?纯粹出于好奇,用户能从中得到什么好处?长远来看似乎会有越来越多的益处,但短期内仅对代码库提问,这有什么实际用途呢?
What's the biggest gains at this time you get from indexing the code base? I could just out out of curiosity, like, what what benefit do users have? It seems like longer term, there'll be more and more benefit. But in the short term, just asking questions of the code base, what what's the use what's the usefulness of that?
我认为最显而易见的好处是,当你想在庞大的代码库中定位某个功能实现时——比如模糊记得'我们要找实现X功能的地方',但又不确定具体该搜索什么关键词。这时通过聊天界面输入指令,往往能准确找到你脑海中的那个代码位置。
I think the most obvious one is just you want to find out where something is happening in your large codebase, and you sort of have a fuzzy memory of, okay, I want to find the place where we do x, but you don't exactly know what to search for in a normal text search. And so you ask a chat, you hit command enter to ask with with the could be chat, and then very often it finds the the right place that you were thinking of.
就像你提到的,我认为未来这项技术会越来越强大。我们正在大力提升检索质量,其潜力上限远比人们想象的要高得多。
I I think, like you like you mentioned, in the future, I think this is only going to get more and more powerful where we're working a lot on improving the quality of our retrieval. And I think the ceiling for that is really, really much higher than people give it credit for.
这里有个关键问题:你们是否考虑过本地化方案?目前讨论的所有功能——云端处理、缓存机制、大型团队共享代码库的协调难题——实现起来都异常复杂。大多数软件都将这类重型计算放在本地,你们有考虑过本地化嵌入方案吗?
One question that's good to ask here, have you considered and why haven't you much done sort of local stuff to where you can do the I mean, it seems like everything we just discussed is exceptionally difficult to do. To go to go to the cloud, have to think about all these things with the caching and the, you know, large code base with a large number of programmers that are using the same code base. You have to figure out the puzzle of that. A lot of it, you know, most software just does stuff this heavy computational stuff locally. Have you considered doing sort of embeddings locally?
是的,我们考虑过本地方案,理论上很酷但实施难度很大。需要指出的是,虽然部分用户使用最新款MacBook Pro,但超过80%用户使用Windows设备,其中许多性能有限。本地模型只能在最新硬件上运行。
Yeah. We thought about it, and I think it would be cool to do it locally. I think it's just really hard. And and one thing to keep in mind is that, you know, some of our users use the latest MacBook Pro, And but most of our users, like more than 80% of our users are in Windows machines, which and and many of them are are not very powerful. And and so local models really only works on the on the latest computers.
此外构建本地方案需要巨大投入。即便我们有意向,目前也无法聚焦于此。确实有些团队在尝试这个方向,这很棒。但随着模型规模增长,想要实现更复杂功能时,本地化会变得愈发困难。
And it's also a big overhead to to to build that in. And so even if we would like to do that, it's currently not something that we are able to focus on. And I think there are there are some people that that that do that, and I think that's great. But especially as models get bigger and bigger and you want to do fancier things with, like, bigger models, it becomes even harder to do it locally.
这不仅是设备性能问题。比如大企业的代码库,即便用顶配MacBook Pro处理也相当吃力。
Yeah. And it's not a problem of, like, weaker computers. It's just that, for example, if you're some big company you have big company code base. It's just really hard to process big company code base even on the beefiest MacBook Pros. Yeah.
即便你不是学生,假设是某大公司的顶尖程序员,纯本地化操作体验也会非常糟糕。勉强可行,但绝对谈不上愉快。
So even if it's not even a matter of, like, if you're if you're just, like, a student or something. I think if you're, like, the best programmer at at a big company, you're still gonna have a horrible experience if you do everything locally. I mean, you could you could do it and sort of scrape by, but, like, again, it wouldn't be fun anymore.
没错。近似最近邻搜索这类操作会疯狂吞噬内存和CPU资源。
Yeah. Like, at approximate nearest neighbors, and this massive code base is gonna just eat up your memory and your CPU.
最佳方案还是将其卸载到云端处理。
It's the best to offload that.
而且那就是...就是那样。比如,我们来谈谈建模方面,正如Albert所说,本地模型面临着巨大阻力——一方面技术趋势倾向于混合专家模型(MOEs),其优势可能是更受内存带宽限制,这对本地运行有利(相较于使用GPU或NVIDIA显卡)。但缺点是这些模型总体更庞大,往往需要跨多个节点部署,根本不可能在哪怕顶级MacBook上运行。特别是对于编程场景,问题不在于模型是否达到'够用'门槛(这可能是其他领域本地模型的优势),而是人们永远不会满足于'够用'。
And that and that's and that's just that. Like, let's talk about, like, also the modeling side where, as Albert said, there are these massive headwinds against local models where one, things that seem to move towards MOEs, which well, like, one benefit is maybe they're more memory bandwidth bound, which plays in favor of local versus using GPUs or using NVIDIA GPUs. But the downside is these models are just bigger in total, and, you know, they're gonna need to fit often not even on a single node, but multiple nodes. There's no way that's gonna fit inside of even really good MacBooks. And I think especially for coding, it's not a question as much of, like, does it clear some bar of, like, the model's good enough to do these things and then, like, we're satisfied, which may be may be the case for other other problems and maybe where local models shine.
但人们永远会追求最强大、最智能、最全能的东西,而这对于绝大多数用户来说几乎不可能在本地实现。
But people are always gonna want the best, the most intelligent, the most capable things, and that's going to be really, really hard to run for almost all people locally.
难道你不想要最强大的模型吗?比如你想要SONET?你
Don't you want the the the most capable model? Like, you want you want SONET? You
而且关于O I
And also with o I
我喜欢你的推销方式。对。比如O one会
like how you're pitching me. Yeah. Like o one Would
你愿意接受次优模型吗?
you be satisfied with an inferior model?
听着,我...是的。我就是这类人,但确实存在偏好本地运行的群体,特别是...
I listen. I yeah. I'm yes. I'm one of those, but there's some people that like to do stuff locally, especially like
没错。
Yeah.
显然整个开源运动都在抵制中心化。他们的存在很有价值——毕竟我们需要制衡日益壮大的权力中心
That really there's a whole, obviously, open source movement that kind of resists. And there it's good that they exist actually because you you wanna resist the power centers that are growing our
其实有个我特别看好的本地模型替代方案,虽然还处于研究阶段:通过同态加密实现语言模型推理。你在本地加密输入数据后上传,服务器能在不解密的情况下用你本地跑不动的大模型处理数据,最后将加密结果传回给你解密。这样既获得云端算力,又保障了数据隐私。
There's actually an alternative to local models that I am particularly fond of. I think it's still very much in the research stage, but you could imagine to do homomorphic encryption for language model inference. So you encrypt your input on your local machine, then you send that up, and then the server can use loss of computation. They can run models that you cannot run locally on this encrypted data, but they cannot see what the data is. And then they send back the answer and you decrypt the answer and only you can see the answer.
因此我认为这仍处于研究阶段,所有努力都旨在降低开销,因为目前的开销确实很大。但如果能实现这一点,我认为将非常非常酷,也会产生极其深远的影响。因为令人担忧的是,随着这些模型越来越强大,它们的经济效用会越来越高。于是全球越来越多的信息和数据将流经一两个中心化主体。这就引发了诸如传统黑客攻击的担忧,更可怕的是,如果全球信息都以明文形式通过单一节点流动,就可能出现极其恶劣的监控行为。
So I think that's still very much research and all of it is about trying to make the overhead lower because right now the overhead is really big. But if you can make that happen, I think that would be really, really cool, and I think it would be really, really impactful. Because I think one thing that's actually kind of worrisome is that as these models get better and better, they're going to become more and more economically useful. And so more and more of the world's information and data will throw flow through, you know, one or two centralized actors. And then there are worries about, you know, there can be traditional hacker attempts, but it also creates this kind of scary part where if all of the world's information is flowing through one node in plain text, you can have surveillance in very bad ways.
有时这种情况最初会出于善意——比如人们希望防止不良行为者滥用AI模型。于是你会加入监控代码,接着其他人介入,就像滑坡效应那样,最终开始用全球数据做坏事。所以我非常期待我们能解决语言模型的同态加密问题。
And sometimes that will happen for, you know, into initially will be, like, good reasons. Like, people will want to try to protect protect against, like, bad actors using AI models in bad ways. And then you will add in some surveillance code, and then someone else will come in and, you know, you're in a slope, and then you start doing bad things with a lot of the world's data. And so I I'm very hopeful that we can solve homomorphic encryption for language modeling.
做隐私保护的机器学习。但我想说,这是当今所有软件面临的共同挑战。云端能提供众多功能,我们越来越依赖它让生活更美好,但弊端也存在。这就是为什么需要依赖强大的安全防护来抵御基础攻击。
Doing privacy preserving machine learning. But I would say, like, that's the challenge we have with all software these days. It's like, there's so many features that can be provided from the cloud and all of us increasingly rely on it and make our life awesome, but there's downsides. And that's why you rely on really good security to protect from basic attacks. Yeah.
但掌控这些数据的公司屈指可数,它们显然拥有支配权,也可能以各种方式被渗透。这就是我们生活的世界。
But there's also only a small set of companies that are controlling that data, you know, and they they obviously have leverage and they could be infiltrated in all kinds of ways. That's the world we live in. Yeah.
我真正担忧的是这样一个世界:Entropic制定了责任分级政策,我们目前处于低ASL(Entropic安全等级)阶段。但当达到所谓ASL3、ASL4级别的强大模型时,出于合理安全考虑,你会想监控所有提示词。虽然这种出发点可以理解,但马特,如果全球信息都被如此严密监控就太可怕了。
I mean, the thing I'm just actually quite worried about is sort of the world where means the Entropic has this responsible scaling policy, so we're we're on, like, the low low ASLs, which is the entropic security level or whatever of, like, of the models. But as we get to, like, quote unquote, ASL three, ASL four, whatever models, which are sort of very powerful. But for for mostly reasonable security reasons, you would wanna monitor all the prompts. But I think I think that's that's sort of reasonable and understandable where where everyone is coming from. But, Matt, it'd be really horrible if if sort of, like, all the world's information is sort monitored that heavily.
这过度中心化了。就像在走钢丝:一边要防止模型失控,另一边——老天,人类...我不确定是否该让全球信息都经由三家模型供应商流转。
It's way too centralized. It's like it's like sort of this it's like really fine line you're walking where on the one side, like, you you don't want the models to go rogue. On the other side, like, man, humans, like, don't I don't know if I if I trust, like, all the world's information to pass through, like, three three model providers. Yeah. Why do you
你觉得这和云服务商有什么不同?
think it's different than cloud providers?
因为这些数据原本根本不会流向云服务商。人们愿意给AI模型提供更多数据,包括那些原本绝不会上传的隐私数据。这还加剧了控制权的集中——当前云服务中你常能使用自己的加密密钥,AWS其实做不了什么。但这里,中心化主体能看到所有信息的明文。
Because I think this is a lot of this data would never have gone to the cloud providers in the in the first place, where this is often like, you want to give more data to the EIA models. You want to give personal data that you would never have put online in the first place to these companies or or or to these models. And it also centralizes control where right now for for cloud, you can often use your own encryption keys and Yeah. It it like, AWS can't really do much. But here, it's just centralized actors that see the exact plain text of everything.
说到上下文,这确实是我的痛点。用Python写代码时要导入大量内容,你大概能猜到我想包含哪些上下文。有没有自动识别上下文的方案?难度如何?
On the top of a context, that that's actually been a friction for me. When I'm writing code, you know, in Python, there's a bunch of stuff imported. There's a you could probably intuit the kind of stuff I would like to include in the context. Is there, like, how how hard is it to auto figure out the context?
这很棘手。未来我们在自动计算上下文方面可以做得更好。但要注意,自动包含上下文存在权衡:模型获取的上下文越多,响应速度就越慢,请求成本越高,意味着后台能进行的模型调用和复杂操作就越少。
It's tricky. I think we can do a lot better at computing the context automatically in the future. One thing that's important to note is there are trade offs with including automatic context. So the more context you include for these models, first of all, the slower they are and the more expensive those requests are, which means you can then do less model calls and do less fancy stuff in the background.
此外,对于
Also, for
许多这类模型来说,如果提示中包含大量信息,它们会感到困惑。因此,对所包含内容的准确性和相关性的要求应该相当高。目前,我们在产品某些环节已实现自动上下文处理,但这显然是我们亟需大幅改进的领域。我认为这里有许多值得尝试的创新思路,包括优化检索系统——比如开发更好的嵌入模型和更高效的重新排序算法。
a lot of these models, they get confused if you have a lot of information in the prompt. So the bar for accuracy and for relevance of the context you include should be quite high. This is already, we do some automatic context in some places within the product. It's definitely something we want to get a lot better at. And I think that there are a lot of cool ideas to try there, both on the learning better retrieval systems, like better embedding models, better re rankers.
我认为学术界也有不少精彩构想。有些我们内部已试验过,但整个领域仍在广泛探讨:能否让语言模型真正掌握新信息库?最常被讨论的方案是能否实现无限上下文窗口?若实现无限窗口后,又能否让模型真正关注无限上下文?在此基础上,为实现可行性,能否对无限上下文进行缓存处理?
I think that there are also cool academic ideas. You know, stuff we've tried out internally, but also the field is grappling with writ large about can you get language models to a place where you can actually just have the model itself understand a new corpus of information. And the most popular talked about version of this is can you make the context windows infinite? Then if you make the context windows infinite, can you make the model actually pay attention to the infinite context? And then after you can make it pay attention to the infinite context, to make it somewhat feasible to actually do it, can you then do caching for that infinite context?
这样就不必频繁重新计算。但还有其他创新尝试,比如更接近微调的方式——将信息直接编码进模型权重。相较于上下文学习,权重层面的理解可能会产生质的不同。目前学界对最终实现路径尚无定论。但现阶段,我们公司对优化检索系统和精准匹配代码库中最相关部分充满热情。
You don't have to recompute that all the time. But there are other cool ideas that are being tried that are a little bit more analogous to fine tuning, of actually learning this information in the weights of the model. And it might be that you actually get sort of a qualitatively different type of understanding if you do it more at the weight level than if you do it at the in context learning level. I think the journey the jury's still a little bit out on how this is all gonna work in the end. But in the interim, us as a company, we are really excited about better retrieval systems and picking the parts of the code base that are most relevant to what you're doing.
我们在这方面还有很大提升空间。
We could do that a lot better.
比如,直接在权重中学习知识的典型案例是Versus Code。由于我们基于开源项目Versus Code的分支,其代码完全公开。这些模型在预训练阶段已接触过全部代码,可能还包括相关问答数据,并经过RHF微调以具备通用代码解答能力。所以当询问Versus Code问题时,虽然有时会幻觉回答,但往往表现不错。
Like, one interesting proof of concept for the learning this knowledge directly in the weights is with Versus Code. So we're in a Versus Code fork and Versus Code, the code is all public. So these models in pre training have seen all the code. They've probably also seen questions and answers about it, and then they've been fine tuned in RHF to be able to answer questions about code in general. So when you ask it a question about Versus code, you know, sometimes it'll hallucinate, but sometimes it actually does a pretty good job at answering the question.
我认为当前效果只是偶然达标。但若能专门训练或后训练模型使其真正理解特定代码库呢?这是个开放的研究课题,我们非常感兴趣。另外还存在架构选择的不确定性:究竟该让模型端到端处理所有环节——
And I think like this is just by it happens to be okay. But what if you could actually like specifically train or post train a model such that it really was built to understand this code base? It's an open research question, one that we're quite interested in. And then there's also uncertainty of, like, do you want the model to be the thing that end to end is doing everything? I.
即在内部完成检索后生成代码答案,还是将检索功能与前沿模型分离?或许几个月后会出现远超当前最佳开源模型的方案,届时可能需要专门训练优质开源模型作为检索器,为大型模型提供上下文输入。
E. It's doing the retrieval in its internals and then kind of answering the question creating the code, or do you want to separate the retrieval from the frontier model where maybe, you know, you'll get some really capable models that are much better than, like, the best open source ones in a handful of months? And then you'll want to separately train a really good open source model to be the retriever, to be the thing that feeds in the context to these larger models.
能否详细说明后训练模型理解代码库的具体含义?这是指通过合成数据方向实现吗?还是说——
Can you speak a little more to the post training a model to understand the code base? Like, what do you what do you mean by that with is this a synthetic data direction? Is this
没错。实现路径有很多可能性,从不缺乏创意,关键在于逐一实践验证最优方案。最朴素的方法就是复现s code和前沿模型的做法。
Yeah. I mean, there are many possible ways you could try doing it. There's certainly no shortage of ideas. It's just a question of going in and, like, trying all of them and and being empirical about which one works best. You know, one one very naive thing is to try to replicate what's done with the s code and these frontier models.
那么我们就继续预训练吧,某种包含通用代码数据的持续预训练,但同时也要加入大量你关心的某个特定代码库的数据。然后在后训练阶段,也就是从指令微调开始,你有一个关于代码的常规指令微调数据集,再混入大量关于那个代码库的问题。你可以获取真实标注的问题(可能比较困难),或者采用你暗示的合成数据方法——让模型针对代码的各个新片段提问。比如截取代码片段,提示模型或让模型为该片段生成问题,再将这些作为独特的指令微调数据点。理论上,这能解锁模型回答该代码库问题的能力。
So let's, like, continue pre training, some kind of continued pre training that includes general code data, but also throws in a lot of the data of some particular repository that you care about. And then in post training, meaning in let's just start with instruction fine tuning, you have, like, a normal instruction fine tuning dataset about code, then you throw in a lot of questions about code in that repository. So you could either get ground truth ones, which might be difficult, or you could do what you kind of hinted at or suggested using synthetic data, I e, kind of having the model ask questions about various recent pieces of the code. So you kind of take the pieces of the code, then prompt the model or have a model propose a question for that piece of code, and then add those as instruction finds unique data points. And then in theory, this might unlock the model's ability to answer questions about that code base.
我想请教你关于OpenAI o1的问题。你认为这类测试时计算系统在编程中扮演什么角色?
Let me ask you about OpenAI o one. What do you think is the role of that kind of test time compute system in programming?
我认为测试时计算非常非常有趣。预训练范式会随着数据量和模型规模扩大,在损失函数、下游基准测试和整体性能上持续提升——无论是编码还是其他任务。但我们开始触及数据墙,意味着这个范式的持续扩展将变得困难。而增加测试时计算量是个有趣的替代方案——通过增加推理时的浮点运算量来获得...嗯...
I think test time compute is really, really interesting. So there's been the pre training regime, which will kind of as you scale up the amount of data and the size of your model, get you better and better performance both on loss and then on downstream benchmarks and just general performance. So we use it for coding or or or other tasks. We're starting to hit a bit of a data wall, meaning it's going to be hard to continue scaling up this regime. And so scaling up 10 test time compute is an interesting way of now, you know, increasing the number of inference time flops that we use, but still getting like like yeah.
随着推理时浮点运算量的增加,模型性能会相应提升。传统做法是直接训练更大的模型(始终消耗更多运算资源),但现在我们可以用相同规模的模型通过延长运行时间来获得媲美大模型的结果。最让我兴奋的是:某些问题可能需要万亿参数模型和百万亿token训练出的智能——但这类问题可能只占全部查询的1%甚至0.1%。难道我们要耗费巨量算力训练这种模型却极少使用它吗?
As you increase the number of flops used in inference time, getting corresponding improvements in in the performance of these models. Traditionally, we just had to literally train a bigger model that always uses that always used that many more flops, but now we could perhaps use the same size model and run it for longer to be able to get an answer at the quality of a much larger model. And so the really interesting thing I like about this is there are some problems that perhaps require a 100,000,000,000,000 parameter model intelligence trained in a 100,000,000,000,000 tokens. But that's like maybe 1%, maybe like point 1% of all queries. So are you going to spend all of this effort, all this compute training a model that costs that much and then run it so infrequently?
这简直太浪费了!更合理的方案是:训练能处理99.9%查询的模型,然后为那些真正需要最高智能的用户提供延长推理时间的选项。
It feels completely wasteful when instead you get the model that can that is that you you train the model that is capable of doing the 99.9% of queries, then you have a way of inference time running it longer for those few people that really, really want max intelligence.
如何判断什么问题需要哪种级别的智能?能动态决定何时用GPT-4、何时用小模型、何时需要o1吗?
How do you figure out which problem requires what level of intelligence? Is that possible to dynamically figure out when to use GPT four, when to use like, when to use a small model and when you need the the o one?
这确实是个开放的研究难题。目前没人完美解决这个模型路由问题——虽然我们为Cursor标签页等功能做过初级实现。但在o1和Sonnet这类大模型间切换更复杂:比如判断某问题是否超出模型能力所需的智能水平,可能反而需要o1级模型才能确定。
I mean, yeah, that's that's an open research problem, certainly. I don't think anyone's actually cracked this model routing problem quite well. We'd like to we we have, like, kind of initial implementations of this for things for something like cursor tab, But at the level of, like, going between four o Sonnet to zero one, it's a bit trickier. Like, there's also questions like what level of intelligence do you need to determine if the thing is too hard for the for the the four level model? Maybe you need the a one level model.
目前还很不明朗。
It's really unclear.
你提到流程分为预训练、后训练和测试时计算三阶段。这样划分合理吗?哪个阶段收益最大?
Because but you mentioned this. So there's a there's a there's a pre training process, then there's pro post training, and then there's, like, test time compute. Is that fair to sort of separate? Where's the biggest gains?
这个问题很微妙——测试时计算本身需要整套训练策略支持。更诡异的是:除了大实验室(可能只有OpenAI),没人真正明白其原理。虽然有些论文暗示他们可能在使用过程奖励模型进行树搜索...但关键在于我们不清楚具体机制,所以很难评价它的定位。
Well, it it's weird because, like, test time compute, there's, a whole training strategy needed to get test time to compute to work. And the really the other really weird thing about this is no one like, outside of the big labs and maybe even just OpenAI, no one really knows how it works. Like, there have been some really interesting papers that show hints of what they might be doing. And so perhaps they're doing something with tree search using process reward models. But, yeah, it just I think the issue is we don't quite know exactly what it looks like, so it would be hard to kinda comment on, like, where it fits in.
我会把它放在训练后阶段,但或许,这类为模型获取测试时计算资源所耗费的计算量,最终会令预训练相形见绌。
I would I would put it in post training, but maybe, like, the compute spent for this kind of for getting test time compute to work for a model is going to dwarf pre training eventually.
所以我们甚至不知道是否有人仅使用思维链强化学习。我们不知道他们如何运用这些方法。我们一无所知。
So we don't even know if o one is using just, like, chain of thought, RL. We don't know how they're using any of these. We don't know anything.
猜测总是有趣的。
It's fun to speculate.
比如,如果要构建一个竞争模型,你会怎么做?
Like, if you were to build a competing model, what would you do?
是的。我认为首先需要训练一个过程奖励模型——或许我们可以先区分结果奖励模型与过程奖励模型。传统的结果奖励模型是为语言建模设计的,它只评估最终结果。比如解数学题时,只看最终答案是否正确并打分。
Yeah. So one thing to do would be I I think you probably need to train a process reward model, which is so maybe we can get into reward models and outcome reward models versus process reward models. Outcome reward models are the kind of traditional reward models that people are trained for these for for language models language modeling, and it's just looking at the final thing. So if you're doing some math problem, let's look at that final thing. You've done everything, and let's assign a great to it.
而过程奖励模型则评估思维链的每个步骤。OpenAI去年夏天有篇初步论文,他们通过人工标注构建了包含数十万条思维链评分的数据集。但迄今为止,除了用它筛选多个样本外,我还没看到过程奖励模型有更创新的应用。
How likely we think. Like, what's the reward for this this this outcome? Process reward models instead try to grade the chain of thought. And so OpenAI had some preliminary paper on this, I think, last summer, where they use human labelers to get this pretty large several 100,000 dataset of grading chains of thought. Ultimately, feels like I I haven't seen anything interesting in in the ways that people use process reward models outside of just using it as a means of affecting how we choose between a bunch of samples.
目前常见做法是:从语言模型采样多个输出,用过程奖励模型评估这些生成结果(可能结合其他启发式方法),然后选择最佳答案。真正令人期待的是结合过程奖励模型的树搜索——如果能精准评估思维链的每个分支,就能探索多条推理路径,实时判断分支质量。
So like what people do in all these papers is they sample a bunch of outputs from the language model, and then use the process reward models to grade all those generations alongside maybe some other heuristics, and then use that to choose the best answer. The really interesting thing that people think might work and people want to work is tree search with these process reward models, because if you really can grade every single step of the chain of thought, then you can kind of branch out and, you know, explore multiple paths of this chain of thought, and then use these process reward models to evaluate how good is this branch that you're taking.
对,当分支质量与最终结果强相关时。这样你就能建立长期有效的分支评估机制,而不仅是短期判断。
Yeah. When the when the quality of the branch is somehow strongly correlated with the quality of the outcome at the very end. So, like, you have a good model of knowing which branch you take. So not just in the short term, but like in the long term.
没错。目前有价值的开源工作主要集中在如何自动化训练过程奖励模型。不过关于如何创造性运用它来实现树搜索,我还没看到特别成功的案例——当然也可能是我遗漏了什么。
Yeah. And like the interesting work that I think has been done is figuring out how to properly train the process or the interesting work that has been open sourced and people I think talk about is how to train the process reward models, maybe in a more automated way. I could be wrong here, could not be mentioning something, because I haven't seen anything super that seems to work really well for using the process reward models creatively to do tree search and code.
这其实涉及AI安全和哲学问题。OpenAI表示他们向用户隐藏思维链是个艰难决定,转而让模型总结推理过程。他们还在后台监控思维链以防模型操纵用户——这种可能性很有意思。你对隐藏思维链的做法怎么看?
This is kind of an AI safety, maybe a bit of a philosophy question. So OpenAI says that they're hiding the chain of thought from the user, and they've said that that was a difficult decision to make. They instead of showing the chain of thought, they're asking the model to summarize the chain of thought. They're also in the background saying they're going to monitor the chain of thought to make sure the model is not trying to manipulate the user, which is a fascinating possibility. But anyway, what do you think about hiding the chain of thought?
OpenAI的一个考虑因素(这完全是推测性的)可能是他们想让人们难以从他们的模型中提炼出这些能力。实际上,如果你能接触到那些隐藏的思维链,复制这项技术可能会更容易,因为那是相当重要的数据——就像看到模型为得出最终结果所采取的步骤。
One consideration for OpenAI, and this is completely speculative, could be that they wanna make it hard for people to distill these capabilities out of their model. It might actually be easier if you had access to that hidden chain of thought to replicate the technology because that's pretty important data, like seeing seeing the steps that the model took to get to the final results.
所以你或许也可以基于此进行训练。
So you could probably train on that also.
这与某些大型语言模型提供商的情况有些相似,同样纯属推测——部分API曾提供便捷访问其生成的所有令牌的对数概率及提示令牌的对数概率,后来有些API移除了这些功能。再次强调这是猜测,但有人认为移除原因在于:若你能获取类似这种隐藏思维链的对数概率,就能获得更多信息,试图将这些能力从API、从这些大型模型中提炼到你控制的模型里。另外关于我们整合o1的讨论需要补充说明:我认为我们仍在学习如何使用这个模型。
And there was sort of a mirror situation with this with some of the large language model providers, and then also this is speculation. But some of these APIs used to offer easy access to log probabilities for all the tokens that they're generating and also log probabilities over the prompt tokens. And then some of these APIs took those away. And again, complete speculation, but one of the thoughts is that the the reason those were taken away is if you have access to log probabilities similar to this hidden train of thought, can give you even more information to to try and distill these capabilities out of the APIs out of these biggest models into models you control. As an asterisk on also the the previous discussion about us integrating o one, I think that we're still learning how to use this model.
我们在Cursor上开放o1是因为获得该模型时非常想试用它,相信很多程序员也会感兴趣。但o1目前完全不是Cursor默认体验的一部分,我们尚未找到将其整合到编辑器中的方式——那种让我们每小时甚至每天都会自然使用的整合方式。关于如何运用这个模型尚无定论,目前也未见明确的使用案例发布,比如‘啊这就是它的应用场景’这样的范例。
So we made o one available on Cursor because, like, we were when we got the model, we were really interested in trying it out. I think a lot of programmers are gonna be interested in trying it out. But o one is not part of the default cursor experience in any way up, and we still haven't found a way to yet integrate it into an edit in into the editor in a way that we we we reach for sort of, you know, every hour, maybe even every day. And so I think the the jury's still out on how to how to use the model. And I we haven't seen examples yet of of people releasing things where it seems really clear, like, oh, that's that's, like, now the use case.
最显而易见的用途或许是它能让你更轻松地运行后台任务——让这些模型处于循环中,具备代理能力。但我们仍在探索阶段。需要澄清的是,我们
The obvious one to to turn to is maybe this can make it easier for you to have these background things running, right, to have these models and loops, to have these models be agentic. But we're still still discovering. To be clear, we
已有构想。只是需要先尝试开发出极具价值的功能后再公之于众。
have ideas. We just need to we need to try and get something incredibly useful before we we put it out there.
但它存在显著局限。即便不考虑能力问题,它不支持流式输出,这意味着当你需要监督输出内容时会极其痛苦——只能等待整段文本突然呈现。此外,当前测试阶段的计算与搜索功能感觉就像非常原始的v0版本,许多细节都不够完善。我猜测在人们增加预训练数据量、扩大模型规模并发现技巧的同时,搜索功能的优化也将成为另一条并行发展的主线。
But it has these significant limitations. Like, even, like, barring capabilities, it does not stream. And that means it's really, really painful to use for things where you want to supervise the output, and instead you're just waiting for the wall of text to show up. Also, it it does feel like the early innings of test time compute and search where it's just like a very, very much of v zero. And there's so many things that, like like, don't feel quite right, and I suspect in parallel to people increasing the amount of pre training data and the size of the models in pre training and finding tricks there, you'll now have this other thread of getting search to work better and better.
那么请教关于strawberry tomorrow eyes的看法。GitHub Copilot似乎正以某种方式整合o1,有评论问这是否意味着cursor要完蛋了?我看到过这类评论。
So let me ask you about strawberry tomorrow eyes. So it looks like GitHub Copilot might be integrating o one in some kind of way. And I think some of the comments are saying, does this mean cursor is done? I think I saw one comment saying that.
我看到有人说‘是时候关闭cursor了’。
I saw time to shut down cursory.
是时候关闭了
Time to shut down
粗略地讲
cursory So
现在是关闭Cursory的时候了吗?
is it time to shut down cursory?
我认为这个领域与2010年代过去的软件空间有些不同,这里的上限真的非常高。因此,我认为三四年后最好的产品将比今天最好的产品有用得多。你可以谈论护城河、品牌优势等等,但最终,如果你停止产品创新,你就会失败。这对初创公司来说也是好事。
I think this space is a little bit different from past software spaces over the the twenty tens, where I think that the ceiling here is really, really, really incredibly high. And so I think that the best product in three to four years will just be so much more useful than the best product today. And you can, like, wax poetic about moats this and brand that, and, you know, this is our advantage. But I think in the end, just if you don't have, like, if you stop innovating on the product, you will you will lose. And that's also great for startups.
这对试图进入这个市场的人来说是好事,因为这意味着你有机会通过构建更好的产品来战胜那些已经拥有大量用户的人。因此,我认为未来几年,关键在于构建最好的产品和系统,这既涉及建模引擎方面,也涉及编辑体验。
That's great for people trying to to enter this market because it means you have an opportunity to win against people who have, you know, lots of users already by just building something better. And so I think, yeah, over the next few years, it's just about building the best product, building the best system, and that both comes down to the modeling engine side of things, and it also comes down to the to the editing experience.
是的。我认为Cursor相比其他产品的额外价值不仅在于快速集成新模型,还在于那些你意识不到的、在产品每个方面为你服务的定制模型的深度,以及每个功能的精心设计的用户体验。
Yeah. I think most of the additional value from Cursor versus everything else out there is not just integrating the new model fast like o one. It comes from all of the kind of depth that goes into these custom models that you don't realize are working for you in kind of every facet of the product, as well as like the really thoughtful UX with every single feature.
好的。从那个深刻的回答中,让我们回到技术层面。你提到你有一个合成数据的分类法。
Alright. From that profound answer, let's descend back down to the technical. You mentioned you have a taxonomy of synthetic data.
哦,是的。
Oh, yeah.
能请你解释一下吗?
Can you please explain?
好的。我认为合成数据主要有三种。首先,什么是合成数据?非合成数据是自然产生的数据,通常来自人类活动的过程。
Yeah. I think there are three main kinds of synthetic data. The first is so so what is synthetic data first? So there's normal data, like nonsynthetic data, which is just data that's naturally created, I e, usually, it'll be from humans having done things. So from some human process, you get this data.
合成数据的第一种是蒸馏。即让语言模型输出标记或标记的概率分布,然后你可以用这些数据训练一个能力较弱的模型。这种方法不会让你得到一个比原始模型更强大的模型,但如果你想要从某个昂贵、高延迟的模型中提取某些能力,你可以将其蒸馏到某个较小的任务特定模型中。第二种是当问题的某个方向比反向更容易时。
Synthetic data, the first one would be distillation. So having a language model, kind of output tokens or probability distributions over tokens, and then you can train some less capable model on this. This approach is not gonna get you a net, like, more capable model than the original one that has produced the tokens. But it's really useful for if there's some capability you wanna elicit from some really expensive high latency model, you can then that distill that down into some smaller task specific model. The second kind is when, like, one direction of the problem is easier than the reverse.
这方面一个很好的例子就是错误检测,正如我们之前提到的,引入看似合理的错误比实际检测它们要容易得多,这对人类来说可能也是如此。因此,你可以用一个没有经过大量数据训练、不那么智能的模型在代码中引入一堆错误,然后利用这些合成数据来训练一个真正擅长检测错误的模型。最后一个类别,我认为,似乎是大型实验室在合成数据方面主要做的,即用语言模型生成文本,然后可以轻松验证。比如,一个极端的例子是,如果你有一个能检测文本是否达到莎士比亚水平的验证系统,然后让一群猴子在打字机上乱敲,最终你就能获得足够的训练数据来训练一个莎士比亚级别的语言模型。
And so a great example of this is bug detection, like we mentioned earlier, where it's a lot easier to introduce reasonable looking bugs than it is to actually detect them, and this is this is probably the case for humans too. And so what you can do is you can get a model that's not training that much data, that's not that smart to introduce a bunch of bugs in code, and then you can use that to then train use a synthetic data to train a model that can be really good at detecting bugs. The last category, I think, is, I guess, the main one that it feels like the big labs are doing for synthetic data, which is producing text with language models that can then be verified easily. So, like, you know, extreme example of this is if you have a verification system that can detect if language is Shakespeare level and then you have a bunch of monkeys typing in typewriters. Like, you can eventually get enough training data to train a Shakespeare level language model.
我的意思是,数学领域尤其如此,对于形式语言来说,验证实际上非常容易。你可以让一个一般的模型生成大量推导过程,然后选择那些真正证明了基础定理的实例,再进一步训练。对于代码,你也可以做类似的事情,比如LeetCode类型的问题,如果你有一组测试用例,知道通过这些测试就意味着问题被解决。你可以验证它通过了测试,然后训练那些通过测试的输出模型。不过,要让这种方法在所有领域或普遍适用,可能会有点棘手。
And I mean, is the case like, very much the case for math where verification is is is actually really, really easy for formal formal languages. And then what you can do is you can have an okay model generate a ton of rollouts, and then choose the ones that you know have actually proved the ground truth theorem theorems and then train that further. There are similar things you can do for code with leak code like problems or where if you have some set of tests that you know correspond to if if something passes these tests, it has actually solved the problem. You could do the same thing where we verify that it's passed the test and then train the model of the outputs that have passed the tests. I think it's gonna be a little tricky getting this to work in all domains or just in general.
比如,对于开放式的杂项任务或更长期的任务,甚至是编码任务,要构建一个完美的验证器感觉非常非常困难。
Like, having the perfect verifier feels really, really hard to do with just, like, open ended miscellaneous tasks you give the model or more, like, long horizon tasks even in coding.
那是因为你没有Arvid那么乐观。不过确实如此。所以,这第三类需要一个验证器。
That's because you're not as optimistic as Arvid. But yeah. So yeah. So that that that third category requires having a verifier.
是的。验证在你能确定其正确性时效果最好,而不是用语言模型来验证,而是用测试或形式化系统。
Yeah. Verification is it feels like it's best when you know for a fact that it's correct and, like, then it, like, it it wouldn't be like using a language model to verify. It'd be using tests or formal systems.
或者直接运行它。像人类那样进行验证,就是手动做质量控制。
Or running the thing too. Doing like the human form form of verification where you just do manual quality control.
对。对。
Yeah. Yeah.
但如果是语言模型版本的验证,它更像是运行程序并真正理解输出。嗯,介于两者之间吧。
But like the the language model version of that where it's like running the thing and it actually understands the output. Yeah. No. That's For somewhere between. Yeah.
是的。我认为这个类别最有可能带来巨大的提升。
Yeah. I think that that's the category that is most likely to to result in, like, massive gains.
那么带反馈的强化学习(RLHF)与AI反馈的强化学习(RL AIF)呢?它们在提升模型性能方面扮演什么角色?
What about RL with feedback side? RLHF versus RL AIF. What's the role of that in getting better performance on the models?
是的。RLHF(人类反馈强化学习)是指你使用的奖励模型是通过收集人类提供的反馈标签来训练的。如果你有能力为这类你关心的任务获取大量人类反馈,我认为这是可行的。RLAIF(人工智能反馈强化学习)有趣的地方在于,你某种程度上依赖于这样一个约束条件:验证实际上比生成要容易得多。因为这感觉像是,好吧,你在做什么呢?
Yeah. So RLHF is when the reward model you use is trained from some labels you've collected from humans giving feedback. I think this works if you have the ability to get a ton of human feedback for this kind of task that you care about. R a RL AI f is interesting because you're kind of depending on like, this is actually kind of going to this is depending on the constraint that verification is actually a decent bit easier than generation. Because it feels like, okay, like, what are you doing?
你是用这个语言模型来查看语言模型的输出,然后改进语言模型吗?但其实不是。如果语言模型验证某个解决方案比生成它容易得多,那么这种方法可能确实有效。这样你或许可以实现这种递归循环,但我不认为它会完全像那样。另一种你可以做的,也是我们某种程度上在做的,是混合RLAIF和RLHF,通常模型实际上相当正确,就像在Cursor标签页中,在两种可能的生成结果中选择哪个更好。
Are you using this language model to look at the language model outputs and then improve the language model? But no. It actually may work if the language model has a much easier time verifying some solution than it does generating it. Then you actually could perhaps get this kind of recursive loop, but I don't think it's gonna look exactly like that. The other the other thing you could do is that we kind of do is like a a little bit of a mix of RLAIF and RHF, where usually the model is actually quite correct, and this is in the case of cursor tab, picking between like two possible generations of what is what is what is the better one.
然后它只需要一点点人类的推动,大约50到100个例子,就能将模型已有的先验与你想要的东西对齐。这看起来与普通的RLHF或RLAIF不同,后者通常需要训练奖励模型和大量例子。
And then it just needs like a hand a little bit of human nudging with only, like, on the on the order of 50, a 100 examples to, like, kind of align that prior the model has with exactly with what what what you want. It looks different than I think normal r h RLAchef where you're usually usually training these reward models and tons of examples.
当你比较生成和验证,或生成和排序时,你的直觉是什么?排序比生成容易得多吗?
What's what's your intuition when you compare generation and verification or generation and ranking? Is is ranking way easier than generation?
我的直觉告诉我,是的,应该是这样。这有点像回到如果你相信P不等于NP,那么有一大类问题在给定证明的情况下验证起来比实际证明要容易得多。
My intuition would just say, yeah. It should be. Like, this is kind of going back to like, if you if you believe p does not equal NP, then there's this massive class of problems that are much, much easier to verify given a proof than actually proving it.
我在想同样的事情是否会证明P不等于NP或P等于NP。
I wonder if the same thing will prove p not equal to m p or p equal to m p.
那将会非常酷。
That would be that would be really cool.
那将是AI获得的菲尔兹奖。谁该得到这个荣誉?又是一个开放的哲学问题。
That'd be a whatever Fields Medal by AI. Who gets the credit? Another open philosophical question.
我实际上非常好奇,一个AI获得菲尔兹奖的好赌注会是什么。我其实
I'm I'm actually I'm I'm actually surprisingly curious what what what like a good bet for one one AI will get the Fields Medal will be. I actually
不专业。
don't specialty.
我 我 我不知道Mon在这里赌的是什么。
I I I don't know what a Mon's bet here is.
哦,抱歉。先拿诺贝尔奖还是菲尔兹奖?
Oh, sorry. Nobel Prize or Fields Medal first?
菲尔兹奖级别。菲尔兹奖 感觉奖
Fields Medal level. Fields Medal Feels Medal
先拿。嗯,你当然会这么说。
first. Well, you would say that, of course.
但这也是一个孤立的系统,你知道的,验证过程。
But it's also this, like, isolated system, you know, verifying.
不。当然。是的。
No. Sure. Yeah.
甚至不知道我 你
Don't even know if I You
不需要做 我觉得我在那里还有很多事情要做。感觉通往IMO的道路稍微清晰一些,因为已经能解决一些IMO题目了,而且当时文献中有很多容易摘取的成果,比如人们可以采取的策略。我想,一方面我对他们现在改进的领域了解少得多,另一方面,对这些极其困难的开放性问题我们离解决还有多远,我的直觉也较弱。
don't need to do I feel like I have much more to be there. It felt like the path to get to IMO was a little bit more clear because it already could get a few IMO problems, and there are a bunch of, like there's a bunch of low hanging fruit given the literature at the time of, like, what what tactics people could take. I think I'm, one, much less versed in the space that they're improving now, and two, yeah, less intuition about how close we are to solving these really, really hard open problems.
所以你认为感觉会先拿奖?不会像物理学或
So you think it'll be feels about a first? It won't be like in physics or in
哦,百分之百。我 我 我认为那更有可能。就像,他们很可能更早拿到。是的。是的。是的。
Oh, 100%. I think I I think I think that's probably more likely. Like, it's probably much more likely that they'll get in yeah. Yeah. Yeah.
是啊。嗯,我觉得这就像是,不知道,比如BSD猜想(伯特-维尔纳顿-戴尔猜想),或者黎曼猜想,或者其他那些超级难的数学问题,真的特别难。甚至都不清楚解决问题的路径长什么样。我们连路径是什么样都不知道,更别说...
Yeah. Well, I think it goes to, like, don't know, like BSD, which is a Burt's Wernherton Dyer conjecture or, like, Riemann hypothesis or any one of these, like, hard hard math problems, which is, like, actually really hard. It's sort of unclear what the path to to get even a solution looks like. Like, we we don't even know what a path looks like, let alone
而且你被这种想法所吸引,认为这是一个孤立系统,实际上你可以建立一个良好的奖励机制,感觉这样训练起来更容易。
And you're by this idea that this is like an isolated system and you can actually you have a good reward system, and it feels like it's easier to train for that.
我觉得我们可能在实现通用人工智能之前先拿到菲尔兹奖。我是说,
I think we might get Fields Medal before AGI. I think I mean,
我会非常高兴。我会非常高兴。但我不确定...我觉得可能是2030年吧。
I'd be very happy. I'd be very happy. But I don't know if I I think twenty twenty h, 2030.
20年感觉太金属了。
20 Feels metal.
感觉太金属了。
Feels metal.
好吧。考虑到事情发展得这么快,现在感觉2030年遥不可及。说到发展速度,我们来聊聊扩展定律吧。可能有些人不太了解,先解释下这个概念——什么是扩展定律?
Alright. It's it feels like forever from now given how fast things have been going. Speaking of how fast things have been going, let's talk about scaling laws. So for people who don't know, maybe it's good to talk about this whole idea of scaling laws. What are they?
目前进展如何?你觉得未来会怎样发展?
Where do things stand? And where do you think things are going?
我觉得很有意思。OpenAI最初的扩展定律论文其实有点问题,因为他们学习率调度方案存在缺陷。后来Chinchilla论文给出了更正确的版本。但之后人们又开始偏离计算最优化的方向,因为现在大家更注重在给定推理预算下让模型表现更好。这些曲线涉及的维度远比我们最初考虑的算力、参数量和数据要多得多。
I think it was interesting. The original scaling laws paper by OpenAI was slightly wrong because I think of some issues they did with learning rate schedules. And then Chinchilla showed a more correct version. And then from then, people have again kind of deviated from doing the compute optimal thing because people start now optimizing more so for making the thing work really well given a given an inference budget. And I think there are a lot more dimensions to these curves than what we originally used of just compute, number of parameters, and data.
比如推理算力是最明显的维度。上下文长度是另一个重要维度。假设你关注推理算力和上下文窗口这两个因素,可能你会想训练某种SSM(状态空间模型),因为它们在超长上下文场景下成本低速度快。即使训练时需要多花10倍算力才能达到同等能力水平,这也是值得的,毕竟你最在意的是长上下文窗口下的推理预算。所以看人们如何在这些维度上做权衡会很有趣。
Like inference compute is is the obvious one. I think context length is another obvious one. So if you care like, let's say you care about the two things of inference compute and and then context window, maybe the thing you wanna train is some kind of SSM because they're much much cheaper and faster at super super long context. And even if maybe it has 10 x worth scaling properties during training, meaning you've spent 10 x more compute to train the thing to get the same same level of capabilities, it's worth it because you care most about that inference budget for really long context windows. So it'll be interesting to see how people kind of play with all these dimensions.
我是说,你在探讨多个维度。
So I mean, you speak to the multiple dimensions.
显然,
Obviously,
最初的概念只是观察模型规模的变量——以参数数量衡量,以及数据规模——以标记数量衡量,并审视两者的比例。是的。这种认为存在一个数字或至少一个最小值,且似乎正在显现的观点相当引人注目。你现在还相信‘越大越好’这种说法吗?
the original conception was just looking at the variables of the size of the model as measured by parameters and the size of the data as measured by the number of tokens and looking at the ratio of the two. Yeah. And it's it's kind of a compelling notion that there is a number or at least a minimum, and it seems like one was emerging. Do you still believe that there is a kinda bigger is better?
我认为,至少在原始性能上,更大确实更好。
I mean, I think bigger is certainly better for just raw performance.
还有原始智能。
And raw intelligence.
以及原始智能。我认为人们可能采取的路径是——我特别看好蒸馏技术。就像,嗯,你能调整多少旋钮,如果我们投入巨资训练,比如得到最具性价比的高效模型?对吧?
And raw intelligence. I think that the path that people might take is I'm particularly bullish on distillation. And like the yeah. How many knobs can you turn to if we spend like a ton ton of money on training, like get the most capable cheap model? Right?
就像尽可能地在乎。因为,像人们已经对LAMA模型所做的那样,或者对70亿参数模型进行过度训练,使用远超普遍最优量的标记,这就是对推理时间计算资源尽可能在乎的初级版本。对吧?但如果你真的在乎,也许该做的是Gamma所做的——我们不只是用标记训练,而是直接训练以最小化与Gamma 270分布的KL散度。
Like really really caring as much as you can. Because like the the the naive version of caring as much as you can about inference time computers is what people have already done with, like, the LAMA models or just overtraining the shit out of 70 models on way, way, way more tokens than is in general optimal. Right? But if you really care about it, maybe thing to do is what Gamma did, which is let's just not let let's not just train on tokens. Let's literally train on minimizing the KL divergence with the distribution of Gamma twenty seventy.
对吧?这就是知识蒸馏。你投入计算资源训练这个270亿参数的大模型,用所有这些标记,最终得到的却是一个更小的模型。
Right? So knowledge distillation there. And you're spending the compute of literally training this 27,000,000,000 model billion parameter model on all these tokens just to get out this, I don't know, smaller model.
蒸馏给你的是一个更快的模型。更小意味着更快。
And the distillation gives you just a faster model. Smaller means faster.
是的。理论上,蒸馏是从训练数据中提取更多信号。这或许是另一种方式,虽不能完全克服,但部分缓解数据墙的问题——你只有这么多数据可以训练。让我们用所有这些标记训练一个非常大的模型,然后将其蒸馏成一个小模型。也许我们能从这个更小的模型中,每标记获得比直接训练时更多的信号。
Yeah. Distillation in theory is, I think, getting out more signal from the data that you're training on. And it's like another it's it's perhaps another way of getting over not, like, completely over, but, like, partially helping with the data wall, where, like, you only have so much data train on. Let's, like, train this really, really big model on all these tokens, and we'll distill it into this smaller one. And maybe we can get more signal per token for this for this much smaller model than we would have originally if we trained it.
那么如果我给你10万亿美元,你会怎么花?我是说,你不能买岛之类的。你会如何分配这笔钱来改进大模型,也许支付HF在RLHF中的费用?
So if I gave you $10,000,000,000,000, how would you how would you spend it? I mean, you can't buy an island or whatever. How would you allocate it in terms of improving the the big model versus maybe paying for HF in the RLHF?
是的。或者,是的。我认为训练这些大模型有很多秘密和细节,我只是不知道,只有大实验室才知道。问题是,如果我尝试这样做,我会浪费很多钱,因为我不知道那些事情。假设你有足够的专业知识来操作,或者如果你是说,你必须用你现在有限的信息来操作。
Yeah. Or Yeah. I think there's a lot of these secrets and details about training these large models that I I just don't know and only privy to the large labs. And the issue is I would waste a lot of that money if I even attempted this because I wouldn't know those things. Suspending a lot of disbelief and assuming, like, you you had the know how and operate or or if you're saying, like, you have to operate with, like, the limited information you have now.
不,不,不。实际上,我会说你突然介入,获取所有信息,所有的小启发式,所有的小参数,所有定义如何训练这个东西的参数。
No. No. No. Actually, I would say you swoop in and you get all the information, all the little heuristics, all the little parameters, all the all the parameters that define how the thing is trained
嗯。
Mhmm.
如果我们看看如何在未来五年内投资以最大化你所谓的原始智能。
If we look in how to invest money for the next five years in terms of maximizing what you called raw intelligence.
我是说,答案不是很明显吗?你只需要尽可能多地获取计算资源?归根结底,你只需要购买GPU,然后研究人员可以找到所有的方法,你可以调整是训练一个大模型还是一个小模型,嗯,这涉及到
I mean, isn't isn't the answer, like, really simple? You just you just try to get as much compute as possible? Like like, at the end of the day, all all you need to buy is the GPUs and then sort of the the researchers can find find all the all like, they they can sort of, you know, you can tune whether you want be train a big model or a small model, like Well, this gets
这个问题,即你真的是被计算资源和金钱限制,还是被其他东西限制?而我
to the question of, like, are you really limited by compute and money, or are you limited by these other things? And I'm
更倾向于Arvid的观点,即我们某种程度上受限于想法,但总是有
more privy to Arvid's Arvid's belief that we're we're sort of idea limited, but there there's always, like
但如果你有很多计算资源,你可以运行很多实验。
But if you have a lot of compute, you can run a lot of experiments.
所以你会运行很多实验,而不是用那些计算资源来训练一个巨大的模型?
So you would run a lot of experiments versus, like, use that computer to train a gigantic model?
我愿意,但我确实认为我们在想法方面是有限的。
I would, but I I do believe that we are limited in terms of ideas that we have.
我想是的。因为即使拥有所有这些计算资源,以及你能收集到的世界上所有数据,我认为你最终不仅受限于想法,更受限于真正优秀的工程能力。即使拥有世界上所有的资金,你真的能聚集起那些真正能改变局面的人吗?世界上这样的人并不多。而且研究中投入的大量工作纯粹是极其艰巨的工程工作。举个不太严谨的例子,看看最初的Transformer论文,将文献中这些有趣概念整合起来的工作量,与后续编写所有代码(比如CUDA内核或其他)相比如何。
I think yeah. Because even with all this compute and, like, you know, all the data you could collect in the world, I think you really are ultimately limited by not even ideas, but just, like, really good engineering. Like even with all the capital in the world, would you really be able to assemble like there aren't that many people in the world who really make the difference here. And there's so much work that goes into research that is just like pure really, really hard engineering work. As like a very kind of hand wavy example, if you look at the original transformer paper, you know, how much work was kind of joining together a lot of these really interesting concepts embedded in in the literature versus then going in and writing all the codes, like maybe the CUDA kernels, maybe whatever else.
我不知道它最初是在GPU还是TPU上运行的,但它确实达到了GPU的性能极限。对吧?让Gnome去完成所有这些代码会更容易。对吧?而Gnome可能是世界上最好的工程师之一。
I don't know it ran on GPUs or TPUs originally, such that it actually saturated the GPU GPU performance. Right? Getting Gnome's easier to go in and do do all this code. Right? And Gnome is, like, probably one of the best engineers in the world.
或者更进一步,比如下一代模型,要实现模型并行并在数千甚至数万台v100上扩展(我想GBD e3可能就是这样的),需要投入巨大的工程努力才能使其工作。如果你真的将成本降低到接近零,或者使其容易10倍,让那些有绝妙想法的人能立即实现他们梦想的新架构(比如在GPU上达到40%利用率),我认为这将极大加速研究进展。我是说,我想——
Or maybe going a step further, like, next generation of models, having these things, like, getting model parallelism to work and scaling it on, like, you know, thousands of or maybe tens of thousands of, like, v one hundreds, which I think GBD e three may have been. There's just so much engineering effort that has to go into all of these things to make it work. If you really brought that cost down to, like, you know, maybe not zero, but just made it 10 x easier, made it super easy for someone with really fantastic ideas to immediately get to the version of, like, the new architecture they dreamed up that is, like, getting 40% utilization on the GPUs. I think that would just speed up research by a ton. I mean, I think I
我认为如果你看到明确的改进路径,你应该总是先摘取低垂的果实。对吧?我认为OpenAI和其他实验室选择摘取低垂果实是正确的做法。这里的低垂果实指的是可以扩展到GPT-4.25规模,持续扩展时效果会持续提升。只要一切都在奏效,就没有必要尝试新想法。
think if if you see a clear path to improvement, you you should always sort of take the low hanging fruit first. Right? And I think probably OpenAI and and all the other labs did did the right thing to pick off the low hanging fruit. Where the low hanging fruit is like sort of you you could scale up to a GPT 4.25 scale, and and you just keep scaling and and, like, things things keep getting better. And as long as like, you there's there's no point of experimenting with new ideas when, like, everything everything is working.
你应该全力推进,尽可能榨取现有路径的潜力,直到真正需要新想法的时候。我想,如果你要花费10万亿美元,可能就需要重新评估你的想法了。到那时,你的想法可能需要一些调整了。
And you you should sort of bang on it and try to try to get as much as much juice out of the possible, and then and then maybe maybe when you really need new ideas for. I think I think if you're if you're spending $10,000,000,000,000, probably wanna spend some so, know, then actually, like, reevaluate your ideas. Like, probably your idea a little bit at that point.
我想我们都认同需要新想法才能最终实现AGI。我们也可能都相信存在在小规模测试这些想法并对其效果有相当信心的方式。只是对于当前阶段的实验室来说,在核心方向仍能持续提升性能的情况下,将有限的研究和工程人才投入到探索其他想法上是相当困难的。
I think all of us believe new ideas are probably needed to get, you know, all the way there to AGI. And all of us also probably believe there exist ways of testing out those ideas at smaller scales and being fairly confident that they'll play out. It's just quite difficult for the labs in their current position to dedicate their very limited research and engineering talent to exploring all these other ideas when there's, like, this core thing that will probably, like, improve performance for some, like, decent amount of time.
是啊。但这些大实验室正在赢。所以他们就会肆无忌惮地继续。
Yeah. But also these big labs like winning. So they're they're just going wild.
好吧。
Okay.
那么展望未来,现在你们处于编程世界的中心。你认为编程的本质在未来几个月、一年、两年内会如何变化?
So how big question looking out into the future. You're now at the the center of the programming world. How do you think programming, the nature of programming changes in the next few months, in the next year, in the next two years, in
未来五年、十年会怎样?我们非常期待一个程序员长期主导的未来。你们可能听过我们的一些观点——这个未来将强调程序员的速度、自主权和掌控力,能随心修改任何内容,能快速迭代正在构建的东西。这与某些人当前追逐的方向有所不同,比如那种‘能与电脑对话吗?能让它为你编写软件吗?’的流行构想。
next five years, ten years? I think we're really excited about a future where the programmer's in the driver's seat for a long time. And you've heard us talk about this a little bit, but one that emphasizes speed and agency for the programmer and control, the ability to modify anything you wanna modify, the ability to iterate really fast on what you're building. And this is a little different, I think, than where some people are are jumping to in this space, where I think one idea that's captivated people is can you talk to your computer? Can you have it build software for you?
就像在Slack上与工程部门或工程师交谈那样,仅通过一个孤立的文本框实现。我们不看好这种模式的原因之一在于延迟问题,但更关键的是它会牺牲大量控制权。在文本框中描述需求时很难精确表达,如果只能像与工程部门沟通那样与AI交互,实际上是将无数重要决策权交给了这个机器人——这触及了工程本质的核心。
As if you're talking to an engineering department or an engineer over Slack, and can it just be this isolated text box. And part of the reason we're not excited about that is, you know, some of the stuff we've talked about with latency. But then a a big piece of reason we're not excited about that is because that comes with giving up a lot of control. It's much harder to be really specific when you're talking in the text box And if you're necessarily just going to communicate with a thing like you would be communicating with an engineering department, you're actually advocating tons of tons of really important decisions to this bot. And this kind of gets at fundamentally what engineering is.
有些对工程较陌生的人可能认为,需求文档写完工程师只需照搬实现,重点只是用代码让功能落地。但顶尖的工程实践——也是我们热爱的部分——充满无数微观决策:权衡速度、成本等系统要素。只要人类仍是软件设计的主体(而非AI运营公司),我们就坚信人类必须掌握方向盘来主导这些决策。
I think that some some people who are a little bit more removed from engineering might think of it as, you know, the spec is completely written out, and then the engineers just come and they just implement. And it's just about making the thing happen in code and making the thing exist. But I think a lot of the the best engineering, the engineering we enjoy, involves tons of tiny micro decisions about what exactly you're building and about really hard trade offs between, you know, speed and cost and just all the other thing things involved in a system. And we want as long as humans are actually the ones making you know, designing the software and the ones specifying what they want to be built, and it's not just like company run by all AIs. We think you'll really want the humor the human in a driver's seat dictating these decisions.
具体形态尚无定论。有个有趣设想:你可以控制查看代码库的抽象层级,聚焦特定部分;比如用伪代码形式理解代码库,甚至直接编辑伪代码,让变更自动映射到正式编程层。你保留对软件逻辑任何部分的指向能力,保留编程中的流式文本编辑体验。
And so there's the jury's still out on kind of what that looks like. I think that, you know, one weird idea for what that could look like is it could look like you kind of you can control the level of abstraction you view a code base at, and you can point at specific parts of a code base that may like, maybe you digest a code base by looking at it in the form of pseudo code. And you can actually edit that pseudo code too and then have changes get paid down at the the sort of formal programming level. And you keep the like, you know, you can gesture at any piece of logic in your software component of programming. You keep the inFlow text editing component of programming.
你既能深入代码细节,也能在更高抽象层操作,同时获得巨大效率提升。
You keep the can control of you can even go down into the code. You can go at higher levels of abstraction while also giving you these big productivity gains.
如果能自由切换抽象层级就太好了。
It'd be nice if you can go up and down the the abstraction stack.
没错。虽然具体实现尚待探索——这个构想还很模糊,时间会验证其可行性。但‘人类主导’‘控制力’和‘速度’这些原则至关重要。像Arvind提到的,某些编程场景(比如明确定义的bug修复)可以交给聊天机器人处理。
Yeah. And there are a lot of details to figure out there that's sort of like a fuzzy idea. Time will tell if it actually works. But these these principles of of control and speed in the human in the driver's seat, we think, are really important. We think for some things, like Arvind mentioned before, for some styles of programming, you can kind of hand it off chatbot style, you know, if you have a bug that's really well specified.
但这不代表大多数编程场景,也不是多数人珍视的编程工作。
But that's not most of programming, and that's also not most of the programming we think a lot of people value.
编程的核心技能呢?现在很多年轻人担忧:我热爱编程,但选择这个职业还有前途吗?你认为编程技能会根本性改变吗?
What about, like, the fundamental skill of programming? There's a lot of people, like, young people right now kinda scared, like thinking because they, like, love programming, but they're scared about, like, will I be able to have a future if I pursue this career path? Do you think the very skill of programming will change fundamentally?
我认为当下正是构建软件最令人振奋的时代。
I actually think this is a really, really exciting time to be building software.
是啊。
Yeah.
就像我们还记得2013年、2012年那会儿编程是什么样子。那时候有太多冗余代码、模板文件,还有各种需要查阅的复杂问题。虽然现在这些依然存在,但如今的编程比那时有趣多了。
Like, we remember what programming was like in, you know, 2013, 2012, whatever it was. And there was just so much more cruft and boilerplate and, you know, looking up something really gnarly. And, you know, that stuff still still exists. It's definitely not at at zero. But programming today is way more fun than back then.
现在我们真正触及到了编程乐趣的核心。所有吸引人们编程的特质——比如快速构建的能力、开发速度、个人掌控感——这些都被放大了无数倍。我认为未来对软件开发者来说会是个非常非常有趣的时代,技能要求可能也会改变。人们的审美和创意想法将被放大,而模板文本编辑甚至谨慎性可能变得不那么重要——尽管后者在当下仍然很关键。
It's like we're really getting down to the the delight concentration. And all of all the things that really draw people to programming, like for instance, this element of being able to build things really fast and speed and also individual control, like all those are just being turned up a ton. And so I think it's just gonna be I think it's gonna be a really, really fun time for people who build software. I think that the skills will probably change too. I think that people's taste and creative ideas will be magnified, and it will be less about maybe less a little bit about boilerplate text editing, maybe even a little bit less about carefulness, which I think is really important today.
如果你是个程序员,未来会好玩得多。
If you're a programmer, I think it'll be a lot more fun.
你们怎么看?
What do you guys think?
我同意。最近我们想把代码库从Node.js的异步本地存储(众所周知性能不佳)迁移到上下文对象,这个影响整个代码库的大工程,即使借助现在的AI工具,我和Swal也花了五天时间。
I agree. I'm I'm very excited to be able to change like, just one one thing that that happened recently was, like, we wanted to do a relatively big migration to our code base. We were using async local storage in in Node. Js, which is known to be not very performant, and we wanted to migrate to our context object. And this is a big migration that affects the entire code base.
我特别期待未来只需要展示几个示例,AI就能自动应用到所有相关位置,遇到新情况时提示'这是个新例子,该怎么处理?',我给出具体方案后,可能十分钟就能完成。这样迭代速度会快得多,不需要前期过度设计,因为试错成本变得极低。
And Swal and I spent, I don't know, five days working through this, even with today's AI tools. And I am really excited for a future where I can just show a couple of examples, and then the AI applies that to all of the locations, and then it highlights, oh, is a new example, like, what should I do? And then I show exactly what to do there, and then that can be done in like ten minutes. And then you can iterate much, much faster. Then you don't have to think as much upfront and stand at the blackboard and think exactly, like, how are we gonna do this because the cost is so high.
你可以先尝试某个方案,发现不符合预期时立即调整。所以没错,未来的编程会非常有趣。
But you can just try something first, and you realize, oh, this is not actually exactly what I want, and then you can change it instantly again after. And so, yeah, I think being a programmer in the future is going to be a lot of fun.
对。我特别认同这个观点:编程通常有两种方式——要么前期绞尽脑汁设计完美方案再用有限时间实现;要么直接动手尝试,快速迭代。后者显然更有趣。
Yeah. I I I really like that point about it feels like a lot of the time with programming, there are two ways you can go about it. One is, like, you think really hard, carefully upfront about the best possible way to do it, and then you spend your limited time of engineering to actually implement it. But I must refer just getting into code and, like, you know, taking a crack at it, seeing how it how it how it kinda lays out, and then iterating really quickly on that. That feels more fun.
没错。光是自动生成模板代码就很棒,这样就能专注处理复杂的设计决策。迁移这个例子很酷——大语言模型似乎能实现编程语言间的转换,或者说广义上的迁移。
Yeah. Like, just begin to generate the boilerplate is great. So you just focus on the difficult design nuanced difficult design decisions. Migration, I feel like this is this is a cool one. Like, it seems like large language model is able to basically translate from one program language to another or, like, translate, like, migrate in the general sense of what migrate is.
但那只是当前的情况。我的意思是,这种恐惧源于随着模型越来越强大,人类需要做出的创造性决策会越来越少。未来是否会发展到我们主要在自然语言的设计空间里操作,而自然语言成为主要的编程语言?我想通过请教的方式问这个问题——如果有人现在对编程感兴趣,你认为他们应该学习什么?
But that's in the current moment. So, I mean, the fear has to do with, like, okay, as these models get better and better, then you you're doing less and less creative decisions. And is it going to kinda move to a place where it's you're operating in the design space of natural language, where natural language is the main programming language. And I guess I could ask that by way of advice. Like, if somebody's interested in programming now, what do you think they should learn?
比如说,你们是从Java起步的,还有...我忘了...哦对,还有PHP。
Like, to say, you guys started in some Java, and I forget the oh, some PHP.
PHP?是Objective c。Objective c。没错。
PHP? Objective c. Objective c. There you go.
是啊。说到底我们都知道JavaScript会胜出。而且不是TypeScript,就是原生的JavaScript。
Yeah. Mean, in the end, we all know JavaScript is going to win. And not TypeScript. It's just it's going to be like vanilla JavaScript.
它就是
It's just
即将
going
吞噬整个
to eat the
世界,可能还会捎带点儿PHP。这也引出了另一个问题——我记得高德纳提出过某个比例的人口是极客,编程需要特定的心理特质。但感觉这个范围正在扩大,未来能做出优秀编程作品的人群类型可能会更广泛。
world and maybe a little bit of PHP. And I mean, it also brings up the question of, like, I think Don Knuth has this idea that some percent of the population is geeks. And, like, there's a particular kind of psychology in mind required for programming. And it feels like more and more that expands. The kind of person that should be able to can do great programming might expand.
我认为不同的人编程动机各异,但真正顶尖的程序员可能是那些纯粹热爱编程的人。比如我们团队有些人下班回家后,会立即打开Cursor整夜做自己的项目,熬到凌晨三点。他们难过时说'我需要写代码'。这种对编程的痴迷与热爱,我认为才能造就最优秀的程序员。这类人会深入探究事物运作的每个细节。
I think different people do programming for different reasons, but I think the true maybe, like, the best programmers are the ones that really love just, like, absolutely love programming. For example, there there are folks in our team who literally when they're they get back from work, they go and then they boot up cursor, and then they start coding on their side projects for the entire night, and they stay out till 3AM doing that. And when they're sad, they they said, I just really need to code. And I I I think, like, you know, there's there's that level of programmer where, like, this obsession and love of programming, I think, makes really the best programmers. And I think the the these types of people will really get into the details of how things work.
我想问的是——就拿这个程序员举例,当超级Tab(愿Tab神保佑)成功实现后,你不断按Tab键时...
I guess the question I'm asking, that exact program let's think about that person. When you're when the super tab, the super awesome praise be the tab is succeeds, and you keep pressing tab
团队里那个人比谁都爱咒骂Tab键。
That person in the team loves to curse the tab more than anybody else.
对吧?
Right?
没错。而且这不仅仅是按Tab键那么简单——用'按Tab键'来描述其实是个偷懒又抓眼球的说法,懂吧?嗯哼。实际上当你按Tab键时,你是在持续注入意图。有时你会拒绝它,有时又会多输入几个字符。
Yeah. And it's also not just like like pressing tab is like the just pressing tab, that's like the easy way to say it and the and the catchy catchphrase, you know? Mhmm. But what you're actually doing when you're pressing tab is that you're you're injecting intent all the time while you're doing it. You're you're sometimes you're rejecting it, sometimes you're typing a few more characters.
这种方式其实是在塑造正在被创造的内容。我认为编程将会更聚焦于'你到底想创造什么'这个本质问题。
And and that's the way that you're you're sort of shaping the things that's being created. And I I think programming will change a lot to just what is it that you want to make.
这像是更高带宽的交流。与计算机的沟通带宽会变得越来越高,相比之下单纯敲键盘传达意图的带宽就低得多。
It's sort of higher bandwidth. The communication to the computer just becomes higher and higher bandwidth as opposed to, like like, just typing is much lower bandwidth than than communicating intent.
说到这个,正好引出你那篇《工程天才宣言》。我们作为应用研究实验室,正在打造超高效率的人机协作系统。所以这种混合元素...
I mean, this goes to your manifesto titled engineering genius. We are an applied research lab building extraordinary productive human AI systems. So speaking to this, like, hybrid element.
嗯哼。
Mhmm.
首先,我们正在构建未来工程师——一个比普通工程师高效十倍的人机混合程序员。这种混合工程师将能毫不费力地掌控代码库,杜绝低效输入。他们能以思考速度在最复杂系统中迭代,结合AI与人类智慧,将超越最纯粹的AI系统。我们是一群研究员和工程师。
To start, we're building the engineer of the future, a human AI programmer That's an order of magnitude more effective than any one engineer. This hybrid engineer will have effortless control over their code base and no low entropy keystrokes. They will iterate at the speed of their judgment even in the most complex systems. Using a combination of AI and human ingenuity, they will outsmart and out engineer the best pure AI systems. We are a group of researchers and engineers.
我们开发软件和模型,探索实用与可能的边界。现有成果已改善数十万程序员的工作生活。在此过程中,我们至少会让编程变得更有趣。感谢今天的对话,谢谢。
We build software and models to invent at the edge of what's useful and what's possible. Our work has already improved the lives of hundreds of thousands of programmers. And on the way to that, we'll at least make programming more fun. So thank you for talking today. Thank you.
谢谢邀请。
Thanks for having us.
谢谢。谢谢。
Thank you. Thank you.
感谢收听本期与Michael、Swale、Arvid和Aman的对话。若想支持本播客,请查看简介中的赞助商信息。现在,让我用一句在Reddit上看到的既随机又搞笑、或许还颇具深意的编程代码作为结束语:没有什么比一个能用的临时方案更永久的了。感谢您的收听,期待下次再见。
Thanks for listening to this conversation with Michael, Swale, Arvid, and Aman. To support this podcast, please check out our sponsors in the description. And now, let me leave you with a random, funny, and perhaps profound programming code I saw on Reddit. Nothing is as permanent as a temporary solution that works. Thank you for listening, and hope to see you next time.
关于 Bayt 播客
Bayt 提供中文+原文双语音频和字幕,帮助你打破语言障碍,轻松听懂全球优质播客。