本集简介
双语字幕
仅展示文本字幕,不包含中文音频;想边听边看,请使用 Bayt 播客 App。
有请特斯拉前人工智能总监安德烈·卡帕西。哇,现场人真多。大家好。好的。
Please welcome former director of AI, Tesla, Andre Karpathy. Wow. A lot of people here. Hello. Okay.
是的。今天我很高兴能在这里与大家探讨人工智能时代的软件。听说在座许多是本科生、硕士生、博士生等即将进入行业的学生。我认为当下正是进入行业极其独特且充满趣味的时刻,根本原因在于软件正在再次发生变革。
Yeah. So I'm excited to be here today to talk to you about software in the era of AI. And I'm told that many of you are students, like bachelor's, master's, PhD, and so on, and you're about to enter the industry. And I think it's actually like an extremely unique and very interesting time to enter the industry right now. And I think fundamentally the reason for that is that software is changing again.
我说'再次'是因为其实我做过这个演讲。但问题是软件持续变化,所以我总有新材料来创作新演讲。我认为这种变革是根本性的——粗略来说,软件在基础层面上已有七十年未曾巨变,而最近几年却经历了约两次快速转型。
And I say again because I actually gave this talk already. But the problem is that software keeps changing, so I actually have a lot of material to create new talks. And I think it's changing quite fundamentally. I think roughly speaking, software has not changed much on such a fundamental level for seventy years. And then it's changed I think about twice quite rapidly in the last few years.
因此现在有海量的工作待完成,无数软件需要编写和重写。让我们看看软件领域的样貌——如果将其视作软件地图,这个名为GitHub地图的工具非常酷,它展示了所有已编写的软件,这些是让计算机在数字空间执行任务的指令集。
And so there's just a huge amount of work to do, a huge amount of software to write and rewrite. So let's take a look at maybe the realm of software. So if we kind of think of this as like the map of software, this is a really cool tool called map of GitHub. This is kind of like all the software that's written. These are instructions to the computer for carrying out tasks in the digital space.
放大后能看到各类代码库,所有现存代码都在这里。几年前我注意到软件正在演变,出现了一种新型软件,当时我称之为软件2.0。其核心在于:软件1.0是人类为计算机编写的代码,而软件2.0本质上是神经网络(特别是其权重参数)——你并非直接编写代码,而是通过调整数据集并用优化器生成神经网络的参数。
So if you zoom in here, these are all different kinds of repositories, and this is all the code that has been written. And a few years ago, I kind of observed that, software was kind of changing, and there was kind of like a new new type of software around, and I called this software two point o at the time. And the idea here was that software one point o is the code you write for the computer. Software two point o are basically neural networks, and in particular, the weights of a neural network. And you're not writing this code directly, you are most, you are more kind of like tuning the data sets and then you're running an optimizer to create the parameters of this neural net.
当时神经网络还被视作决策树之类的分类器,因此这种框架划分更为贴切。如今在软件2.0领域,我们有了类似GitHub的平台——我认为Hugging Face就是软件2.0界的GitHub,还有Model Atlas可以可视化所有模型代码。
And I think like at the time, nets were kind of seen as like just a different kind of classifier, like a decision tree or something like that. And so I think it was kind of like I think this framing was a lot more appropriate. And now actually what we have is kind of like an equivalent of GitHub in the realm of software two point zero. And I think the Hugging Face is basically equivalent of GitHub in software two point zero. And there's also model Atlas and you can visualize all the code written there.
顺带一提,中间那个巨大圆点代表图像生成器Flux的参数。每当有人在Flux模型上微调LoRa时,就相当于在这个空间创建了一次git提交,诞生了新型图像生成器。简言之:软件1.0是编程计算机的代码,软件2.0是编程神经网络的权重。比如这个AlexNet图像识别神经网络。
In case you're curious, by the way, the giant circle, the point in the middle, these are the parameters of flux, the image generator. And so anytime someone tunes a LoRa on top of a flux model, you basically create a git commit in this space and you create a different kind of image generator. So basically what we have is software one point zero is the computer code that programs a computer. Software two point zero are the weights which program neural networks. And here's an example of AlexNet image recognizer neural network.
迄今为止,我们熟悉的神经网络都像是功能固定的计算机(如图像分类器)。但根本性转变在于:大语言模型让神经网络变得可编程。我认为这催生了一种全新计算机形态,值得赋予其软件3.0的新称谓。
Now so far, all of the neural networks that we've been familiar with until recently were kind of like fixed function computers. Image to categories or something like that. And I think what's changed and I think as a quite fundamental change is that neural networks became programmable with large language models. And so I see this as quite new, unique, it's a new kind of a computer. And, so in my mind, it's, worth giving it a new designation of software three point o.
本质上,你的提示词现在就是编程LLM的程序。值得注意的是,这些提示词是用英语写的——这成了种非常有趣的编程语言。举例来说,要做情感分类:你可以编写Python代码,可以训练神经网络,也可以提示大语言模型。比如这个少量示例提示,你可以修改它以不同方式'编程'计算机。
And basically, your prompts are now programs that program the LLM. And, remarkably, these prompts are written in English. So it's kind of a very interesting programming language. So maybe to summarize the difference, if you're doing sentiment classification, for example, you can imagine writing some amount of Python to basically do sentiment classification, or you can train a neural net, or you can prompt a large language model. So here, this is a few shot prompt and you can imagine changing it and programming the computer in a slightly different way.
总结来说:软件1.0和2.0之外,你们可能注意到GitHub上的代码不再纯是代码,还混杂着大量英语。我认为这正在形成新型代码类别——不仅是新编程范式,更震撼的是它用我们的母语英语实现。几年前当我意识到这点时发了条推文,引起广泛关注(至今仍置顶):我们竟然在用英语编程计算机。在特斯拉研发自动驾驶时,我们努力让车辆学会行驶...
So basically we have software one point o, software two point zero, and I think we're seeing, maybe you've seen a lot of GitHub code is not just like code anymore, there's a bunch of like English interspersed with code. And so I think kind of there's a growing category of new kind of code. So not only is it a new programming paradigm, it's also remarkable to me that it's in our native language of English. And so when this blew my mind a few, I guess years ago now, I tweeted this and I think it captured the attention of a lot of people and this is my currently pinned tweet, is that remarkably we're now programming computers in English. Now when I was at Tesla, we were working on the autopilot and we were trying to get the car to drive.
当时我展示了这张幻灯片,你可以想象汽车的输入在底部,它们通过一个软件栈来生成转向和加速指令。我当时注意到自动驾驶系统中有大量C++代码(即软件1.0版本),同时还有些神经网络负责图像识别。我观察到随着自动驾驶系统的改进,神经网络的能力和规模都在增长。与此同时,所有C++代码正被逐步删除——许多原本由1.0版本实现的功能都迁移到了2.0版本。例如,不同摄像头间跨图像、跨时间的信息缝合工作现在都由神经网络完成,这使得我们能删除大量代码。
And I sort of showed this slide at the time where you can imagine that the inputs to the car are on the bottom and they're going through a software stack to produce the steering and acceleration. And I made the observation at the time that there was a ton of C plus plus code around in the autopilot, which was the software one point zero code, and then there was some neural nets in there doing image recognition. And I kind of observed that over time as we made the autopilot better, basically the neural network grew in capability and size. And in addition to that, all the C plus plus code was being deleted and kind of like was And a lot of the kind of capabilities and functionality that was originally written in one point zero was migrated to two point zero. So as an example, a lot of the stitching up of information across images from the different cameras and across time was done by neural network and we were able to delete a lot of code.
因此软件2.0栈实质上吞噬了整个自动驾驶软件栈。当时我觉得这非常了不起。现在我们再次见证类似现象:一种新型软件正在吞噬整个技术栈。我们面临三种完全不同的编程范式。如果你刚进入这个行业,精通所有这些范式会很有优势,因为它们各有优劣——你可能需要根据功能需求选择用1.0、2.0或3.0范式来实现。
And so the software two point zero stack was quite literally ate through the software stack of the autopilot. So I thought this was really remarkable at the time. And I think we're seeing the same thing again where basically we have a new kind of software and it's eating through the stack. We have three completely different programming paradigms. And I think if you're entering the industry, it's a very good idea to be fluent in all of them because they all have slight pros and cons and you may want to program some functionality in one point o or two point o or three point o.
你是要训练神经网络?还是直接提示大语言模型?或者应该用显式编码实现?我们都需要做这些决策,甚至可能需要在这些范式间灵活切换。接下来我想先讨论大语言模型——如何理解这种新范式及其生态系统形态。
Are you going to train a neural net? Are you going to just prompt an LLM? Should this be a piece of code that's explicit, etcetera? So we all have to make these decisions and actually potentially fluidly transition between these paradigms. So what I want to get into now is first I want to, in the first part, talk about LLMs and how to kind of like think of this new paradigm and the ecosystem and what that looks like.
这种新型计算机究竟是什么样子?生态系统又呈现何种形态?多年前吴恩达的一句话让我印象深刻——他当时说'AI就是新电力'。虽然他现在应该在我之后演讲,但我认为这个比喻确实捕捉到了某些深刻内涵。
Like what are what is this new computer? What does it look like? And what does the ecosystem look like? I was struck by this quote from Andrew Ng actually many years ago now, think. And I think Andrew is going to be speaking right after me.
目前大语言模型确实具有公共事业的特征。像OpenAI、Gemini、Anthropic等实验室投入资本支出训练模型,这类似于建设电网;然后通过API以运营支出的方式向我们提供智能服务,采用按百万token计费的计量访问模式。
But he said at the time AI is the new electricity. And I do think that it, kind of captures something very interesting, in that LLMs certainly feel like they have properties of utilities right now. So, LLM labs like OpenAI, Gemini, Anthropia, etcetera, they spend CapEx to train the LLMs, and this is kind of equivalent to building out a grid. And then there's OpEx to serve that intelligence over APIs to all of us. And this is done through metered access where we pay per million tokens or something like that.
我们对这个API有很多实用性需求。我们要求低延迟、高可用性、稳定的质量等等。在电力领域,你会有一个转换开关,可以在电网、太阳能、电池或发电机之间切换电源。而在大语言模型领域,我们可能有开放路由,可以轻松在不同类型的大语言模型之间切换。
And we have a lot of demands that are very utility like demands out of this API. We demand low latency, high uptime, consistent quality, etcetera. In electricity, you would have a transfer switch. So you can transfer your electricity source from like grid and solar or battery or generator. In LLMs we have maybe open router and easily switch between the different types of LLMs that exist.
因为大语言模型是软件,它们不会争夺物理空间。所以基本上可以有六家电力供应商,你可以在它们之间切换,对吧?因为它们不会以如此直接的方式竞争。我觉得特别有趣的是,最近几天很多大语言模型都宕机了,人们有点束手无策无法工作。这让我想到,当最先进的大语言模型宕机时,世界上就像发生了智能停电。
Because the LLMs are software, they don't compete for physical space. So it's okay to have basically like six electricity providers and you can switch between them, right? Because they don't compete in such a direct way. And I think what's also really fascinating, and we saw this in the last few days actually, a lot of the LLMs went down and people were kind of like stuck and unable to work. And I think it's kind of fascinating to me that when the state of the art LLMs go down, actually kind of like an intelligence brownout in the world.
这有点像电网电压不稳定时,整个星球都变笨了。我们对这些模型的依赖已经非常显著,而且我认为还会继续增长。但大语言模型不仅具有公用事业的特性,我认为它们也有芯片工厂的某些特性。原因是构建大语言模型所需的资本支出实际上相当巨大。
It's kind of like when the voltage is unreliable in the grid and the planet just gets dumber. The more reliance we have on these models, which already is like really dramatic and I think will continue to grow. But LLMs don't only have properties of utilities. I think it's also fair to say that they have some properties of fabs. And the reason for this is that the CapEx required for building LLMs is actually quite large.
这不仅仅是建造一些发电站那么简单,对吧?你需要投入巨额资金,而且我认为技术树和技术本身正在快速发展。我们处在一个拥有深层技术树、研发和集中在大语言模型实验室内部机密的世界里。但我觉得这个类比也有点模糊,因为正如我提到的,这是软件。软件的防御性较低,因为它具有很高的可塑性。
It's not just like building some power station or something like that, right? You're investing a huge amount of money and I think the tech tree and for the technology is growing quite rapidly. So we're in a world where we have sort of deep tech trees, research and development, secrets that are centralizing inside the LLM labs. And but I think the analogy muddies a little bit also because as I mentioned, this is software. And software is a bit less defensible because it is so malleable.
所以我觉得这是个值得思考的有趣话题。你可以做很多类比,比如4纳米工艺节点可能类似于具有特定最大浮点运算能力的集群。当你使用英伟达GPU只做软件不做硬件时,这有点像无晶圆厂模式。但如果你像谷歌那样自己建造硬件并在TPU上训练,那就有点像英特尔拥有自己晶圆厂的模式。我认为这里有一些合理的类比。
And so I think it's just an interesting kind of thing to think about potentially. There's many analogies you can make, like a four nanometer process node maybe is something like a cluster with certain max flops. You can think about when you're using NVIDIA GPUs and you're only doing the software and you're not doing the hardware, that's kind of like the fabless model. But if you're actually also building your own hardware and you're training on TPUs if you're Google, that's kind of like the Intel model where you own your fab. So I think there's some analogies here that make sense.
但实际上我认为最贴切的类比或许是,在我眼中大语言模型(LLM)与操作系统有着极强的相似性。它们不仅仅是电力或自来水那样的基础商品,而是日益复杂的软件生态系统。对吧?
But actually I think the analogy that makes the most sense perhaps is that in my mind LLMs have very strong kind of analogies to operating systems. In that, this is not just electricity or water. It's not something that comes out of the tap as a commodity. These are now increasingly complex software ecosystems. Right?
因此它们不像电力那样是简单商品。令我感到有趣的是,这个生态系统的演变方式非常相似——就像存在少数闭源提供商(如Windows或Mac OS)与开源替代方案(如Linux)。对于LLM领域,我们同样有几家竞争的闭源提供商,而Llama生态系统目前可能最接近未来可能发展成Linux般的存在。当然现在还非常早期,因为这些只是简单的语言模型,但我们已经开始看到它们会变得复杂得多。
So they're not just like simple commodities like electricity. And it's kind of interesting to me that the ecosystem is shaping in a very similar kind of way where you have a few closed source providers like Windows or Mac OS, and then you have an open source alternative like Linux. And I think for LLMs as well, we have a kind of a few competing closed source providers. And then maybe the llama ecosystem is currently like maybe a close approximation to something that may grow into something like Linux. Again, think it's still very early because these are just simple LMs, but we're starting to see that these are going to get a lot more complicated.
这不仅关乎LLM本身,还涉及工具使用、多模态交互等整套运作机制。当我前段时间意识到这点时,尝试将其勾勒出来——在我看来LLM就像新型操作系统,而LLM本身则是新型计算机,相当于中央处理器(CPU)的角色。
It's not just about the LLM itself, it's about all the tool use and multi modalities and how all of that works. And so when I sort of had this realization a while back, I tried to sketch it out. And it kind of seemed to me like LLMs are kind of like a new operating system, right? So the LLM is a new kind of a computer. It's setting, it's kind of like the CPU equivalent.
上下文窗口类似于内存,LLM则负责协调内存与计算资源来解决问题。从这个角度看,它确实非常像操作系统。再举些例子:当你想下载应用时,比如访问VS Code官网下载,可以选择在Windows、Linux或Mac上运行。同理,你可以将Cursor这类LLM应用运行在GPT、Claude或Gemini系列上。
The context windows are kind of like the memory. And then the LLM is orchestrating memory and compute for problem solving using all of these capabilities here. And so definitely if you look at it, it looks very much like operating system from that perspective. A few more analogies, for example, if you want to download an app, say I go to Versus Code and I go to download, you can download Versus Code and you can run it on Windows, Linux, or Mac. In the same way as you can take an LLM app like Cursor and you can run it on GPT or Cloud or Gemini series.
对吧?只是下拉选择而已,这方面也很相似。另一个触动我的类比是:我们正处于类似1960年代的状态——对于这种新型计算机,LLM算力仍非常昂贵。这导致LLM必须集中在云端运行,而我们都是通过网络交互的瘦客户端。
Right? It's just a drop down. So it's kind of like similar in that way as well. More analogies that I think strike me is that we're kind of like in this nineteen sixty's ish era where LLM compute is still very expensive for this new kind of a computer. And that forces the LLMs to be centralized in the cloud and we're all just sort of thin clients that interact with it over the network.
没人能完全占用这些计算机资源,因此采用分时共享模式很合理——我们只是云端批量处理的一个维度。这完全复现了那个时代的计算机形态:操作系统在云端,所有内容流式传输,采用批处理。个人计算革命尚未发生,因为经济上还不划算。不过已有人在做尝试——
And none of us have full utilization of these computers and therefore it makes sense to use time sharing where we're all just a dimension of the batch when they're running the computer in the cloud. And this is very much what computers used to look like during this time. The operating systems were in the cloud, everything was streamed around and there was batching. And so the personal computing revolution hasn't happened yet because it's just not economical, it doesn't make sense. But I think some people are trying.
比如Mac mini意外地适合某些LLM运行,因为如果是批量单次推理,这完全受限于内存容量反而可行。这些可能是个人计算的早期征兆,但尚未真正成型。最终形态尚不明确,或许在座各位将参与定义其形态。还有个类比:当我直接用文本与ChatGPT等LLM对话时,感觉像通过终端与操作系统交流。
And it turns out that Mac minis, for example, are a very good fit for some of the LLMs because it's all if you're doing batch one inference, this is all super memory bound so this actually works. And I think these are some early indications maybe of personal computing, but this hasn't really happened yet. It's not clear what this looks like. Maybe some of you get to invent what this is or how it works or what this should be. Maybe one more analogy that I'll mention is whenever I talk to ChatGPT or some LLM directly in text, I feel like I'm talking to an operating system through the terminal.
纯文本就是直接访问操作系统的方式,而图形用户界面(GUI)尚未以通用形式出现。ChatGPT需要不同于文字气泡的GUI吗?稍后我们会看到某些应用具备GUI,但尚未形成跨任务的统一界面。LLM在某些独特方面与传统操作系统和早期计算不同,我特别注意到一个本质区别——
Like it's just, it's text, it's direct access to the operating system, and I think a GUI hasn't yet really been invented in like a general way. Like should ChatuchPity have a GUI, like different than just the tech bubbles? Certainly some of the apps that we're going to go into in a bit have GUI, but there's no like GUI across all the tasks if that makes sense. There are some ways in which LLMs are different from kind of operating systems in some fairly unique way and from early computing. And I wrote about this one particular property that strikes me as very different this time around.
LLM颠覆了技术扩散的传统方向。无论是电力、加密技术、计算机、航空、互联网还是GPS,历史上突破性技术总是先由政府和企业采用(因为新颖昂贵),之后才普及到消费者。但LLM完全相反——早期计算机用于弹道计算等军事用途,而LLM却在教人怎么煮鸡蛋。
It's that LLMs like flip, they flip the direction of technology diffusion that is usually present in technology. So for example, with electricity, cryptography, computing, flight, Internet, GPS, lots of new transformative technologies that have not been around. Typically, it is the government and corporations that are the first users because it's new and expensive, etcetera, and it only later diffuses to consumer. But I feel like LMs are kind of like flipped around. So maybe with early computers it was all about ballistics and military use, but with LLMs it's all about how do you boil an egg or something like that.
这确实是我的主要使用场景。神奇的是我们拥有这种魔法计算机,它却在帮我解决煮鸡蛋的问题,而非协助政府进行弹道计算等尖端科技。实际上企业和政府在这项技术的应用上落后于普通大众,这种倒置现象非常有趣。
This is certainly like a lot of my use. And so it's really fascinating to me that we have a new magical computer and it's like helping me boil an egg. It's not helping the government do something really crazy like some military ballistics or some special technology. Indeed corporations or governments are lagging behind the adoption of all of us, of all of these technologies. So it's just backwards.
我认为这或许启示了我们应如何运用这项技术,以及最初可能开发哪些应用程序等。总结目前观点:LLM实验室打造的卓越大语言模型——用这个表述是准确的。但大语言模型实则是复杂的操作系统,它们相当于计算机史上的二十世纪六十年代,我们正在重走计算技术发展之路。目前它们通过分时系统以公用事业般的方式分布式提供。
And I think it informs maybe some of the uses of how we want to use this technology or like what are some of the first apps and so on. So in summary so far, LLMlabs, fab LLMs, I think it's accurate language to use. But LLMs are complicated operating systems. They're circa nineteen sixty's in computing and we're redoing computing all over again. And they're currently available via time sharing and distributed like a utility.
前所未有的新情况是,它们不再被少数政府和企业垄断,而是掌握在我们每个人手中——因为人人都有电脑,而这本质上只是软件。ChatGPT就像瞬间被传输到数十亿人的电脑里,一夜之间普及开来。这简直疯狂。更让我觉得不可思议的是,现在正是我们进入这个行业、为这些计算机编程的时代,这太魔幻了。
What is new and unprecedented is that they're not in the hands of a few governments and corporations, They're in the hands of all of us because we all have a computer and it's all just software. And Chachi PT was beamed down to our computers, like billions of people like instantly and overnight. And this is insane. And it's kind of insane to me that this is the case and now it is our time to enter the industry and program these computers. This is crazy.
因此我认为这非常了不起。在为大语言模型编程前,我们需要花时间思考它们的本质。我特别喜欢探讨它们的心理机制——在我看来,大语言模型就像人类灵魂的随机模拟器。
So I think this is quite remarkable. Before we program LLMs, we have to kind of like spend some time to think about what these things are. And I especially like to kind of talk about their psychology. So the way I like to think about LLMs is that they're kind of like people spirits. They are stochastic simulations of people.
这个模拟器恰巧是自回归变压器架构。变压器本质是神经网络,它在词元级别上运作,像齿轮般逐段推进,每个计算单元消耗几乎等量的算力。这个模拟器通过权重参数,基于互联网海量文本数据进行训练而成。
And the simulator in this case happens to be an autoregressive transformer. So transformer is a neural net. It's and it just kind of like goes on the level of tokens, it goes chunk, chunk, chunk, chunk, chunk. And there's an almost equal amount of compute for every single chunk. And this simulator of course is just, is basically there's some weights involved and we fit it to all of text that we have on the internet and so on.
最终形成的模拟器由于基于人类数据训练,会涌现出类人的心理特征。首先你会注意到大语言模型拥有百科全书式的知识记忆,其记忆量远超任何个体人类——这让我想起电影《雨人》,强烈推荐大家观看。
And you end up with this kind of a simulator. And because it is trained on humans, it's got this emergent psychology that is human like. So the first thing you'll notice is of course, LLNs have encyclopedic knowledge and memory. And they can remember lots of things, a lot more than any single individual human can because they read so many things. It actually kind of reminds me of this movie, Rain Man, which I actually really recommend people watch.
这是部杰作,我挚爱的电影。达斯汀·霍夫曼饰演的自闭症学者拥有近乎完美的记忆力,能记住整本电话簿。我感觉大语言模型与之非常相似。
It's an amazing movie. I love this movie. And Dustin Hoffman here is an autistic savant who has almost perfect memory. So he can read like a phone book and remember all of the names and phone numbers. And I kind of feel like LNs are kind of like very similar.
它们能轻松记忆SHA哈希值等各类信息,在某些方面确实具备超能力。但也存在认知缺陷:经常产生幻觉,编造内容,缺乏完善的自我认知模型——虽已改进但仍不完美。
They can remember SHA hashes and lots of different kinds of things very, very easily. So they certainly have superpowers in some respects. But they also have a bunch of, I would say, cognitive deficits. So they hallucinate quite a bit and they kind of make up stuff and don't have a very good sort of internal model of self knowledge, not sufficient at least. And this has gotten better but not perfect.
它们展现出锯齿状智能——某些领域超乎人类,却会犯连孩童都不会的错误。比如坚称9.11大于9.9,或说'strawberry'有两个字母r,这些都是经典案例。
They display jagged intelligence. So they're going to be superhuman in some problem solving domains. And then they're going to make mistakes that basically no human will make. Like, they will insist that 9.11 is greater than 9.9, or that there are two r's in strawberry. These are some famous examples.
这些粗糙边缘可能让人栽跟头。它们还患有顺行性遗忘——就像新同事本应随时间积累组织认知,通过睡眠巩固知识,逐步发展专业技能。
But basically there are rough edges that you can trip on. So that's kind of, I think, also kind of unique. They also kind of suffer from anterograde amnesia. So, and I think I'm alluding to the fact that if you have a coworker who joins your organization, this coworker will over time learn your organization and they will understand and gain like a huge amount of context on the organization. And they go home and they sleep and they consolidate knowledge and they develop expertise over time.
但大语言模型天生不具备这种能力,这在其研发中尚未解决。上下文窗口就像工作记忆,需要直接编程干预——它们不会自动变聪明。建议观看《记忆碎片》和《初恋50次》,两部电影主角的'权重参数'固定,每天早晨'上下文窗口'都会清零。
LLMs don't natively do this and this is not something that has really been solved in the R and D of LLMs I think. And so context windows are really kind of like working memory and you have to sort of program the working memory quite directly because they don't just kind of like get smarter by default. And I think a lot of people get tripped up by the analogies in this way. In popular culture, I recommend people watch these two movies, Memento and fifty first dates. In both of these movies, the protagonists, their weights are fixed and their context windows gets wiped every single morning.
当这种情况发生时——而且它们总是频繁遭遇这种情况——去工作或维持人际关系就变得非常棘手。我想指出的另一点是与LLM使用相关的安全性限制。例如,LLM很容易轻信他人,容易受到提示注入攻击的风险,可能会泄露你的数据等等。
And it's really problematic to go to work or have relationships when this happens, and this happens to all of them all the time. I guess one more thing I would point to is security kind of related limitations of the use of LLMs. For example, LLMs are quite gullible. They are susceptible to prompt injection risks. They might leak your data, etcetera.
此外还有许多其他与安全相关的考量。简而言之,你必须同时驾驭这个具有超人能力却又存在诸多认知缺陷的系统。我们该如何编程?如何规避它们的缺陷,同时享受其超凡能力?
And so and there's many other considerations security related. So so basically, long story short, you have to load your, you have to load your, you have to simultaneously think through this superhuman thing that has a bunch of cognitive deficits and issues. How do we, and yet they are extremely useful. And so how do we program them? How do we work around their deficits and enjoy their superhuman powers?
现在我想转向讨论如何利用这些模型的机遇,以及其中最大的机会有哪些。以下并非完整清单,只是我认为本次演讲中值得关注的要点。首先让我兴奋的是所谓的'半自主应用'。以编程为例,你固然可以直接使用ChatGPT,通过复制粘贴代码、错误报告等来获取代码片段。
So what I want to switch to now is talk about the opportunities of how do we use these models and what are some of the biggest opportunities. This is not a comprehensive list of some of the things that I thought were interesting for this talk. The first thing I'm kind of excited about is what I would call partial autonomy apps. So for example, let's work with the example of coding. You can certainly go to chat GPT directly and you can start copy pasting code around and copy pasting bug reports and stuff around and getting code and copy pasting everything around.
但为什么要这么做?就像你不会直接操作操作系统内核一样,为此专门开发应用才更合理。我想在座很多人都使用Cursor编辑器,我也是。Cursor正是你真正需要的解决方案。
Why would you do that? Why would you go directly to the operating system? It makes a lot more sense to have an app dedicated for this. And so I think many of you use Cursor, I do as well. And Cursor is kind of like the thing you want instead.
你不应该直接使用原始ChatGPT。我认为Cursor是早期LLM应用的优秀范例,它具备所有LLM应用都应具备的特性。特别值得注意的是,它既保留了传统界面让用户手动操作,又通过LLM集成实现了更大规模的自动化处理。这些特性值得重点探讨。
You don't want to just directly go to the chat GPT. And I think Cursor is a very good example of an early LLM app that has a bunch of properties that I think are useful across all the LLM apps. So in particular, will notice that we have a traditional interface that allows a human to go in and do all the work manually just as before. But in addition to that, we now have this LLM integration that allows us to go in bigger chunks. And so some of the properties of LLM apps that I think are shared and useful to point out.
第一,LLM承担了大量上下文管理工作;第二,它们能协调多个LLM调用——比如Cursor底层同时使用文件嵌入模型、聊天模型、代码差异应用模型等;第三点可能未被充分重视的,是应用专属GUI的重要性。毕竟没人愿意通过纯文本与操作系统交互。
Number one, the LLMs basically do a ton of the context management. Number two, they orchestrate multiple calls to LLMs, right? So in the case of cursor, there's under the hood embedding models for all your files, the actual chat models, models that apply diffs to the code, and this is all orchestrated for you. A really big one that I think also maybe not fully appreciated always is application specific GUI and the importance of it. Because you don't just want to talk to the operating system directly in text.
纯文本难以阅读解析,某些操作也不适合以文本形式执行。比如查看代码差异时,红绿颜色标注的增删内容远比文本直观;用command+y接受或command+n拒绝修改,也比手动输入高效得多。GUI让人能审计这些易错系统的工作,同时提升效率。
Text is very hard to read, interpret, understand. And also like you don't want to take some of these actions natively in text. So it's much better to just see a diff as like red and green change and you can see what's being added or subtracted. It's much easier to just do command y to accept or command n to reject. I shouldn't have to type it in text, So GUI allows a human to audit the work of these fallible systems and to go faster.
稍后我会再回到这个观点。最后要强调的是'自主性调节滑块'功能:在Cursor中,你可以仅使用tab补全(保持主导权),选中代码块后按command+k局部修改,按command+l修改整个文件,或是用command+i让AI自主处理整个代码库——这相当于全自主的智能体模式。
I'm going to come back to this point a little bit later as well. And the last kind of feature I want to point out is that there's what I call the autonomy slider. So for example, in cursor, you can just do tab completion, you're mostly in charge. You can select a chunk of code and command k to change just that chunk of code. You can do command l to change the entire file.
再举个成功案例Perplexity,它具备与Cursor相似的特征:整合多源信息、协调多个LLM、提供审计工作流的GUI(如显示引用来源),以及自主性调节滑块——你可以快速搜索、深度研究,或让系统十分钟后返回完整报告。
Or you can do command I, which just, you know, letter it, do whatever you want in the entire repo. And that's the sort of full autonomy agent agentic version. And so you are in charge of the autonomy slider and depending on the complexity of the task at hand, you can tune the amount of autonomy that you're willing to give up for that task. Maybe to show one more example of a fairly successful LLM app, Perplexity, It also has very similar features to what I just pointed out in cursor. It packages up a lot of the information, it orchestrates multiple LLM's, it's got a GUI that allows you to audit some of its work.
这些不同模式本质上是向工具让渡不同程度的自主权。我不禁思考:未来大多数软件是否都将变得半自主化?这究竟会呈现怎样的形态?
So for example, it will cite sources and you can imagine inspecting them and it's got an autonomy slider. You can either just do a quick search or you can do research or you can do deep research and come back ten minutes later. So this is all just varying levels of autonomy that you give up to the tool. So I guess my question is, I feel like a lot of software will become partially autonomous. I'm trying to think through like what does that look like?
对于许多维护产品和服务的你们来说,如何让产品服务实现部分自主化?LLM能否像人类一样观察所有事物?能否执行人类所有可能的行动?人类能否监督并持续参与这一活动?因为这些系统仍不完美,存在缺陷。
And for many of you who maintain products and services, how are you going to make your products and services partially autonomous? Can an LLM see everything that a human can see? Can an LLM act in all the ways that a human could act? And can humans supervise and stay in the loop of this activity? Because again, these are fallible systems that aren't yet perfect.
Photoshop之类的差异对比功能该如何呈现?当前传统软件充斥着为人类设计的各种开关控件,这些都需要改造以适应LLM的交互需求。我想强调一个未被充分重视的要点:我们正在与AI协同工作。通常AI负责生成内容,而人类负责验证。
And what does a diff look like in Photoshop or something like that? Also a lot of the traditional software right now, it has all these switches and all this stuff that's all designed for human. All of this has to change and become accessible to LLMs. So one thing I want to stress with a lot of these LLM apps that I'm not sure gets as much attention as it should is, We're now kind of like cooperating with AIs. And usually they are doing the generation and we as humans are doing the verification.
加速这个循环符合我们的利益,能极大提升工作效率。我认为有两个主要实现途径:首先是大幅提升验证速度。图形用户界面(GUI)在这方面至关重要,它能调用我们大脑中的视觉处理能力——阅读文字费神费力,而视觉感知则轻松高效。
It is in our interest to make this loop go as fast as possible, so we're getting a lot of work done. There are two major ways that I think this can be done. Number one, you can speed up verification a lot. And I think GUIs, for example, are extremely important to this because a GUI utilizes your computer vision GPU in all of our head. Reading text is effortful and it's not fun, but looking at stuff is fun and it's just kind of like a highway to your brain.
因此GUI在审计系统和可视化呈现方面极具价值。其次我们必须约束AI的行为。现在人们对AI代理过度狂热,但收到上千行代码变更对我毫无意义——我仍是效率瓶颈。即便AI能瞬间生成代码,仍需人工确保其正确性且不引入漏洞。
So I think GUIs are very useful for auditing systems and visual representations in general. Number two I would say is we have to keep the AI on the leash. I think a lot of people are getting way over excited with AI agents and it's not useful to me to get a diff of 1,000 lines of code to my repo. Like I have to I'm still the bottleneck, right? Even though that 1,000 lines come out instantly, I have to make sure that this thing is not introducing bugs, it's just like and that it's doing the correct thing, right?
还要排查安全隐患等问题。本质上,我们需要双管齐下:既要极致优化流程效率,又要有效约束AI的过度活跃。就像我进行AI辅助编程时的感受:基础编码时一切美好,但实际工作时,过度活跃的代理反而会适得其反。
And that there's no security issues and so on. So I think that yeah basically, we have to sort of like, it's in our interest to make the flow of these two go very, very fast and we have to somehow keep the AI on the leash because it gets way too overactive. It's kind of like this. This is how I feel when I do AI assisted coding. If I'm just byte coding, everything is nice and great.
这张幻灯片可能不够直观,但我和诸位一样正在探索如何将AI代理融入编程流程。实践中我总是畏惧大规模变更,坚持小步迭代——确保每个微调都正确无误,让验证循环高速运转。这种碎片化推进方式可能是许多人与LLM协作的共同经验。
But if I'm actually trying to get work done, it's not so great to have an overreactive agent doing all this kind of stuff. So this slide is not very good, I'm sorry. But I guess I'm trying to develop like many of you, some ways of utilizing these agents in my coding workflow and to do AI assisted coding. And in my own work, I'm always scared to get way too big diffs. I always go in small incremental chunks.
近期我看到不少探讨LLM最佳实践的博客文章。其中一篇提出的技术方案颇具启发性,特别是关于约束AI的方法。例如:当提示词过于笼统时,AI可能偏离预期导致验证失败,迫使你重新调整——这会造成循环卡顿。因此投入时间细化提示词反而能提升验证通过率。
I want to make sure that everything is good. I want to spin this loop very, very fast. And I sort of work on small chunks of single concrete thing. And so I think many of you probably are developing similar ways of working with LLMs. I also saw a number of blog posts that try to develop these best practices for working with LLMs.
相信我们终将积累此类技术经验。当前我特别关注AI时代的教育形态——如何有效约束教育场景中的AI?直接要求AI
And here's one that I read recently and I thought was quite good. And it kind of discussed some techniques and some of them have to do with how you keep the AI on the leash. And so as an example, if you are prompting, if your prompt is big, then the AI might not do exactly what you wanted and in that case verification will fail, you're going to ask for something else. If a verification fails, then you're going to start spinning. So it makes a lot more sense to spend a bit more time to be more concrete in your prompts which increases the probability of successful verification and you can move forward.
教授物理课程是行不通的,AI容易迷失方向。我的解决方案是拆分为两个独立应用:教师端课程创作工具和学生端课程交付系统。通过可审计的课程中间件,我们既能保证教学质量的一致性,又能将AI约束在既定教学大纲和项目进度框架内运作。
And so I think a lot of us are going to end up finding techniques like this. I think in my own work as well, I'm currently interested in what education looks like together with kind of like now that we have AI and LLMs, what does education look like? And I think a large amount of thought for me goes into how we keep AI on the leash. I don't think it just works to go to trashypt and be like, hey, teach me physics. I don't think this works because the AI is like, gets lost in the woods.
这种结构化管控方式显著提升了系统可靠性。
And so for me this is actually two separate apps for example. There's an app for a teacher that creates courses and then there's an app that takes courses and serves them to students. And in both cases, we now have this intermediate artifact of a course that is auditable and we can make sure it's good, we can make sure it's consistent. And the AI is kept on the leash with respect to a certain syllabus, a certain like progression of projects and so on. And so this is one way of keeping the AI on the leash and I think has a much higher likelihood of working.
人工智能不会在森林里迷路。我想再类比一下,我对部分自动驾驶并不陌生,我在特斯拉工作了大约五年。这也是一个部分自动驾驶产品,具有许多共同特性。比如仪表盘上就有自动驾驶的图形界面,显示神经网络所看到的内容等等。
And the AI is not getting lost in the woods. One more kind of analogy I wanted to sort of allude to is I'm no stranger to partial autonomy and I've kind of worked on this I think for five years at Tesla. And this is also a partial autonomy product and shares a lot of the features. But for example right there in the instrument panel is the GUI of the autopilot. So it's showing me what the neural network sees and so on.
我们还有自动驾驶调节滑块,在我任职期间,我们为用户实现了越来越多的自动驾驶功能。简单分享一个故事:2013年我第一次体验自动驾驶汽车,当时有位在Waymo工作的朋友邀请我在帕洛阿尔托试乘。这张照片是用当时的谷歌眼镜拍的。在座很多人可能年轻到不知道这是什么,但那会儿这玩意儿可风靡一时。
And we have the autonomy slider where over the course of my tenure there we did more and more autonomous tasks for the user. And maybe the story that I wanted to tell very briefly is, actually the first time I drove a self driving vehicle was in 2013 and I had a friend who worked at Waymo and, he offered to give me a drive around Palo Alto. I took this picture using Google Glass at the time. And many of you are so young that you might not even know what that is. But yeah, this was like all the rage at the time.
我们坐上车在帕洛阿尔托行驶了约三十分钟,穿梭于高速公路和街道之间。那次驾驶堪称完美,全程零干预。要知道那可是2013年——距今已十二年。这让我深受震撼,因为当时体验完这场完美演示后,我真觉得自动驾驶马上就要实现了,毕竟它运行得如此顺畅。
And we got into this car and we went for about a thirty minute drive around Palo Alto, highways, streets and so on. And this drive was perfect. There were zero interventions. And this was 2013, which is now twelve years ago. And it kind of struck me because at the time when I had this perfect drive, this perfect demo, I felt like, wow, self driving is imminent because this just worked.
这简直不可思议。但十二年后的今天,我们仍在攻克自动驾驶技术,仍在研发驾驶智能体。即便现在,这个问题也未被完全解决。虽然Waymo车辆看似无人驾驶,但实际上仍需要大量远程操控和人工监督。
This is incredible. But here we are twelve years later and we are still working on autonomy. We are still working on driving agents. And even now we haven't actually like fully solved the problem. Like you may see Waymo's going around and they look driverless, but you know there's still a lot of teleoperation and a lot of human in the loop of a lot of this driving.
我们甚至还没能宣告成功——尽管我认为最终定会成功——只是耗时远超预期。软件开发就像驾驶一样充满挑战。所以当我听到'2025是智能体元年'这类说法时非常担忧,这分明会是'智能体十年',且需要很长时间。我们必须保持人机协作,谨慎推进——这可是严肃的软件开发。另一个我常思考的类比是钢铁侠战衣。
So we still haven't even like declared success, I think it's definitely like going to succeed at this point, but it just took a long time. And so I think like this is software is really tricky, I think in the same way that driving is tricky. And so when I see things like, oh, 2025 is the year of agents, I get very concerned and I kind of feel like, you know, this is the decade of agents and this is going to be quite some time. We need humans in the loop, we need to do this carefully, this is software, let's be serious here. One more kind of analogy that I always think through is the Iron Man suit.
我一直热爱钢铁侠,它在技术演进方面极具预见性。钢铁侠战衣的精妙之处在于既是增强装备(托尼·斯塔克可手动操控),也是智能体(某些电影里能自主飞行寻找主人)。这就是自动驾驶调节滑块的理念——我们既能打造增强工具,也能开发自主智能体。
I think this is I always love Iron Man. I think it's like so correct in a bunch of ways with respect to technology and how it will play out. And what I love about the Iron Man suit is that it's both an augmentation and Tony Stark can drive it and it's also an agent. And in some of the movies, the Iron Man suit is quite autonomous and can fly around and find Tony and all this kind of stuff. And so this is the autonomy slider is we can be we can build augmentations or we can build agents.
现阶段面对存在缺陷的大语言模型,我认为更应该打造的是'钢铁侠战衣'而非'钢铁侠机器人',应聚焦部分自动驾驶产品而非炫目的智能体演示。这些产品需要定制化图形界面,以确保人类验证循环的高效性,同时不遗忘自动化潜力。产品中应该设计自主调节滑块,逐步提升自动化程度——这就是我认为的巨大机遇所在。现在我想转向另一个独特维度。
And we kind of want to do a bit of both, but at this stage, I would say working with fallible LLMs and so on, I would say, you know, it's less Ironman robots and more Ironman suits that you want to build. It's less like building flashy demos of autonomous agents and more building partial autonomy products, And these products have custom GUIs and UI UX and we're trying to and this is done so that the generation verification loop of the human is very, very fast, but we are not losing the sight of the fact that it is in principle possible to automate this work. And there should be an autonomy slider in your product you should be thinking about how you can slide that autonomy slider and make your product sort of more autonomous over time. But this is kind of how I think there's lots of opportunities in these kinds of products. I wanna now switch gears a little bit and talk about one other dimension that I think is very unique.
不仅出现了支持软件自主性的新型编程语言,更重要的是它用英语这种自然语言编写。突然间人人都成了程序员,因为大家都会说英语。这极其乐观且前所未有——过去需要五到十年专业学习才能开发软件,如今已非如此。
Not only is there a new type of programming language that allows for autonomy in software, but also, as I mentioned, it's programmed in English, which is this natural interface. And suddenly everyone is a programmer because everyone speaks natural language like English. So this is extremely bullish and very interesting to me and also completely unprecedented, I would say. It used to be the case that you need to spend five to ten years studying something to be able to do something in software. This is not the case anymore.
不知各位是否听说过'氛围编程'?这条推文算是它的起源,现在已成热门梗。有趣的是,我玩推特十五年了,依然无法预测哪条会爆红。本以为这只是条无人问津的洗澡时突发奇想,结果却引发共鸣——它精准命名了人们难以言喻的集体感受。
So I don't know if by any chance anyone has heard of vibe coding. This this is the tweet that's kind of like introduced this, but I'm told that this is now like a major meme. Fun story about this is that I've been on Twitter for like fifteen years or something like that at this point and I still have no clue which tweet will become viral and which tweet like fizzles and no one cares. And I thought that this tweet was gonna be the latter. I don't know, was just like a shower of thoughts, but this became like a total meme and I really just can't tell, but I guess I'll get struck a chord and gave a name to something that everyone was feeling but couldn't quite say in words.
现在连维基百科词条都有了,算是我的一大'贡献'吧。Hugging Face的Tom Wolf分享过一段我超爱的视频:孩子们正在进行氛围编程。这段充满温度的影像让我感动不已。
So now there's Wikipedia page and everything. This is like Yeah, this is like a major contribution now or something like that, so. So Tom Wolf from Hugging Face shared this beautiful video that I really love. These are kids vibe coding. And I find that this is such a wholesome video, like I love this video.
你怎么能看着这个视频还对未来感到悲观呢?未来是美好的。我觉得这最终会成为进入软件开发领域的敲门砖。我对这一代人的未来并不悲观,而且是的,我非常喜欢这个视频。所以我也尝试着写了一点代码,因为这太有趣了。
Like how can you look at this video and feel bad about the future? The future is great. I think this will end up being like a gateway drug to software development. I'm not a doomer about the future of the generation, and I think yeah, I love this video. So I tried by coding a little bit as well because it's so fun.
当你想要构建一个超级定制化、看似不存在的东西,并且因为今天是周六之类的原因就想随心所欲时,编程真是太棒了。于是我开发了这个iOS应用,其实我并不会用Swift编程,但让我震惊的是我居然能做出一个超级基础的APP——虽然我不会解释它,因为它真的很简单,但我只花了一天时间,当天晚些时候它就在我手机上运行了,我当时就想,哇,这太神奇了。我不用先花五天时间学习Swift才能入门。我还用机器人编码做了个叫MenuGen的应用,现在已经上线,你可以在menugen.app试用。
So by coding is so great when you want to build something super duper custom that doesn't appear to exist and you just want to wing it because it's a Saturday or something like that. So I built this iOS app and I don't I can't actually program in Swift, but I was really shocked that I was able to build like a super basic app and I'm not going to explain it, it's really dumb, but I kind of like this was just like a day of work and this was running on my phone like later that day and I was like, wow, this is amazing. I didn't have to like read through Swift for like five days or something like that to like get started. I also bot coded this app called MenuGen, and this is live, you can try it in menugen. App.
我遇到的问题是:每次去餐厅看着菜单,完全不知道那些菜是什么,我需要看图说话。既然没有这种应用,我就决定用字节码编程做了一个。它的功能是这样的:访问menugen.app,拍下菜单照片,就会生成对应的菜品图片。注册就送5美元额度,所以这目前是我生活中的重大成本中心,是个负收益应用,我已经在MenuGen上亏了一大笔钱。
I basically had this problem where I show up at a restaurant, I read through the menu and I have no idea what any of the things are, and I need pictures. So this doesn't exist, so I was like, hey, I'm going to bytecode it. So this is what it looks like, you go to menugen.app, and you take a picture of a menu and then menu generates the images. And everyone gets $5 in credits for free when you sign up, and therefore, this is a major cost center in my life, so this is a negative negative revenue app for me right now. I've lost a huge amount of money on menu gen.
但MenuGen最让我着迷的是:字节码编程部分反而是最简单的。真正困难的是让它落地——包括用户认证、支付系统、域名配置和Vercel部署这些非代码工作,全是靠我在浏览器里点点点的DevOps操作。
Okay. But the fascinating thing about menu gen for me is that the code of the byte the byte coding part, the code was actually the easy part of byte of byte coding MenuGen. And most of it actually was when I tried to make it real so that you can actually have authentication and payments and the domain name and a Versal deployment. This was really hard and all of this was not code. All of this DevOps stuff was in me and the browser clicking stuff.
这些极其枯燥的工作又花了我一周时间。有趣的是,MenuGen的演示版在我笔记本上几小时就能跑通,但让它真正可用却花了一周。原因就在于这些琐事太烦人了——比如想给网页添加Google登录功能(虽然很小),但那个clerk库的集成说明长得离谱,简直像在教我怎么点击网页:打开这个链接,选择下拉框,点这里再点那里...就像电脑在指挥我操作。
And this was extreme slog and took another week. So it was really fascinating that I had the minogen basically demo working on my laptop in a few hours, and then it took me a week because I was trying to make it real. And the reason for this is this was just really annoying. So for example, if you try to add Google login to your web page, I know this is very small, but just a huge amount of instructions of this clerk library telling me how to integrate this. And this is crazy, like it's telling me, go to this URL, click on this drop down, choose this, go to this, and click on that, and it's like telling me what to do.
明明该由电脑自动完成的事,为什么要我手动操作?见鬼的是我还得一步步跟着做。
Like a computer is telling me the actions I should be taking. Like, you do it. Why am I doing this? What the hell? I had to follow all these instructions.
这太疯狂了。所以我演讲最后想探讨:能不能直接为智能体开发?我不想干这些活,能让智能体代劳吗?谢谢。
This was crazy. So I think the last part of my talk, therefore, focuses on can we just build for agents? I don't want to do this work. Can agents do this? Thank you.
大致来说,数字信息领域出现了新的消费者和操纵者类别。过去只有人类通过图形界面或计算机通过API,现在多了个全新存在——智能体。它们本质是计算机,但又有几分像人对吧?
Okay. So roughly speaking, I think there's a new category of consumer and manipulator of digital information. It used to be just humans through GUIs or computers through APIs, and now we have a completely new thing. And agents are they're computers, but they are human like, kind of. Right?
它们是网络中的数字精灵,需要与我们的软件基础设施交互。我们能否为它们而设计?这是个新课题。就像域名下的robots.txt可以指导网络爬虫如何行为,
They're people spirits. There's people spirits on the Internet and they need to interact with our software infrastructure. Like, can we build for them? It's a new thing. So as an example, you can have robots.txt on your domain and you can instruct or like advise, I suppose, web crawlers on how to behave on your website.
我们也可以弄个lns.txt文件,用简单Markdown告诉大语言模型这个网站的用途——这对LLM非常易读。如果让它解析网页HTML,既容易出错效果又差。不如直接与LLM对话,这绝对值得。
In the same way, you can have maybe lns dot txt file, which is just a simple markdown that's telling LLMs what this domain is about. And this is very readable to an LLM. If it had to instead get the HTML of your web page and try to parse it, this is very error prone and difficult and it will screw it up and it's not going to work. So we can just directly speak to the LLM. It's worth it.
目前大量文档是为人类编写的,因此你会看到列表、加粗文字和图片等内容,这些对LLM(大语言模型)来说无法直接理解。所以我看到一些服务正将其文档专门适配LLM。例如Vercel和Stripe是先行者,但还有更多案例。它们提供Markdown格式的文档——这种格式对LLM来说极易解析。
A huge amount of documentation is currently written for people, so you will see things like lists and bold and pictures, and this is not directly accessible by an LLM. So I see some of the services now are transitioning a lot of their docs to be specifically for LLMs. So Vercel and Stripe as an example are early movers here, but there are a few more that I've seen already. And they offer their documentation in markdown. Markdown is super easy for LLMs to understand.
这很棒。我也举个亲身经历的简单例子:可能有人知道3Blue1Brown,他制作了精美的Aletho动画视频。是的,我超爱这个库。
This is great. Maybe one simple example from from my experience as well. Maybe some of you know three blue one brown, he makes beautiful animation videos on Aletho. Yeah. I love this library.
我看到他写了Manon教程后想自己尝试。虽然有用Manon的详细文档,但我不想逐字阅读,于是把整篇文档复制给LLM并描述需求——结果它直接生成了我想要的动画效果。当时我就震惊了:如果能让文档对LLM友好,将释放巨大的可能性,这太棒了且应该推广。
I saw that he wrote Manon, and I wanted to make my own. And there's extensive documentations on how to use Manon, and so I didn't wanna actually read through it. So I copy pasted the whole thing to an LLM, and I described what I wanted and it just worked out of the box. Like LLM just by coded me an animation, exactly what I wanted and I was like, wow, this is amazing. So if we can make docs legible to LLMs, it's going to unlock a huge amount of kind of views and I think this is wonderful and should should happen more.
另一点需要注意的是:仅仅将文档转为Markdown远远不够。比如文档中出现'点击'这类指令对LLM无效——Vercel正在将所有'点击'替换成等效的curl命令,这样LLM代理就能代为执行。
The other thing I wanted to point out is that you do unfortunately have to it's not just about taking your docs and making them appear in markdown. That's the easy part. We actually have to change the docs because anytime your docs say click, this is bad. An LLM will not be able to natively take this action right now. So Vercel, for example, is replacing every occurrence of click with the equivalent curl command that your LLM agent could take on your behalf.
这非常有趣。还有Anthropic的模型上下文协议,它定义了如何直接与这些数字信息的新消费者(代理)对话。此外,我喜欢那些以LLM友好格式处理数据的工具:比如我的Nana GPT仓库在GitHub上是人类界面,但把URL改成git ingest就能将所有文件合并为可粘贴给LLM的文本。
And so I think this is very interesting. And then of course there's model context protocol from Anthropic, and this is also another way, it's a protocol of speaking directly to agents as this new consumer and manipulator of digital information. So I'm very bullish on these ideas. Other The thing I really like is a number of little tools here and there that are helping ingest data in very LLM friendly formats. So for example, when I go to a GitHub repo like my Nana GPT repo, I can't feed this to an LLM and ask questions about it because it's, you know, this is a human interface on GitHub.
更极端的例子是Deep Wiki:它不仅提供Devon分析的原始文件,还会为仓库生成完整文档页。想象这多适合粘贴给LLM!所有通过简单修改URL就能让LLM获取内容的工具都令人振奋。另外,虽然未来(甚至现在)LLM能自主点击操作,但主动优化信息获取方式仍很重要——毕竟当前LLM使用成本仍高。
So when you just change the URL from GitHub to git ingest, then this will actually concatenate all the files into a single giant text and it will create a directory structure, etcetera, and this is ready to be copy pasted into your favorite LLM and you can do stuff. Maybe even more dramatic example of this is Deep Wiki, where it's not just the raw content of these files, this is from Devon, but also like they have Devon basically do analysis of the GitHub repo, and Devon basically builds up a whole docs pages just for your repo. And you can imagine that this is even more helpful to copy paste into your LLM. So I love all the little tools that basically where you just change the URL and it makes something accessible to an LLM, so this is all well and great, and I think there should be a lot more of it. One more note I wanted to make is that it is absolutely possible that in the future, LLMs will be able to, it's not even future, this is today, they'll be able to go around and they'll be able to click stuff and so on.
我认为有必要与LLM双向适应:有些软件可能长期不会适配,这时工具就很重要;但对主流场景,找到中间点很值得期待。总结来说:当下正是进入这行业的黄金时代。
But I still think it's very worth basically meeting LLM halfway, LLM's halfway, and making it easier for them to access all this information because this is still fairly expensive, would say, to use and a lot more difficult. And so I do think that lots of software, there will be a long tail where it won't adapt because these are not like live player sort of repositories or digital infrastructure, and we will need these tools. But I think for everyone else, I think it's very worth kind of like meeting in some middle point. So I'm bullish on both, if that makes sense. So in summary, what an amazing time to get into the industry.
我们需要重写大量代码(专业程序员仍不可或缺),LLM就像公共设施或早期操作系统——如同1960年代的操作系统,我们需要学会与这些会犯错的'数字灵魂'协作,为此必须调整基础设施。
We need to rewrite a ton of code, a ton of code will be written by professionals and by coders. These LLMs are kind of like utilities, kind of like fabs, but they're kind of especially like operating systems, but it's so early. It's like nineteen sixties of operating systems, and and I think a lot of the analogies crossover. And these LLMs are kind of like these fallible, you know, people spirits that we have to learn to work with. And in order to do that properly, we need to adjust our infrastructure towards it.
构建LLM应用时,我会分享高效协作的方法和工具,如何快速迭代出半自主产品。当然代理也需要大量定制代码。用钢铁侠战衣类比:未来十年我们将见证滑块从左到右的精彩推移,我迫不及待想与各位共同建设这个未来。谢谢!
So when you're building these LLM apps, I describe some of the ways of working effectively with these LLMs and some of the tools that make that kind of possible and how you can spin this loop very, very quickly and basically create partial autonomy products. And then, yeah, a lot of code has to also be written for the agents or directly. But in any case, going back to the Iron Man suit analogy, I think what we'll see over the next decade roughly is we're going to take the slider from left to right and a very interesting it's going to be very interesting to see what that looks like and I can't wait to build it with all of you. Thank you.
关于 Bayt 播客
Bayt 提供中文+原文双语音频和字幕,帮助你打破语言障碍,轻松听懂全球优质播客。