本集简介
双语字幕
仅展示文本字幕,不包含中文音频;想边听边看,请使用 Bayt 播客 App。
欢迎收听《信号与线程》,这是来自Jane Street的关于技术栈各层次的深度对话。我是Ron Minsky。非常荣幸邀请到Chris Latner参加本期节目。通常在《信号与线程》中,我们主要与Jane Street的工程师交流,但有时也会邀请外部嘉宾。Chris是一位杰出的人物,因为他深度参与了许多我们都在使用的基础计算项目——LLVM、Clang、MLIR、OpenCL、Swift以及现在的Mojo。
Welcome to Signals and Threads, in-depth conversations about every layer of the tech stack from Jane Street. I'm Ron Minsky. It is my great pleasure to have Chris Latner on the show. Typically, on Signals and Threads, we end up talking to engineers who work here at Gene Street. But sometimes, we like to grab outside folk, and Chris is an amazing figure to bring on just because he's been so involved in a bunch of really foundational pieces of computing that we all use, LLVM and Clang and MLIR and OpenCL and Swift and now Mojo.
这些成就发生在多个知名机构——苹果、特斯拉、谷歌、SciFi以及现在的Modular。总之,Chris,非常高兴你能加入我们。
And this has happened at a bunch of different storied institutions, Apple and Tesla and Google and SciFi and now Modular. So anyway, it's a pleasure to have you joining us, Chris.
谢谢,Ron。我很高兴来到这里。
Thank you, Ron. I'm so happy to be here.
我想先听听你的成长故事。你是如何进入计算机领域的,又是如何涉足编译器工程和编程语言设计这个世界的?
I guess I wanna start by just hearing a little bit more about your origin story. How did you get into computing, and how did you get into this world of both compiler engineering and programming language design?
我在八十年代长大,那时计算机还不是主流。虽然我们有个人电脑,但它们并不被认为很酷。我迷上了理解计算机的工作原理。那时候一切简单得多,比如我从一个BASIC解释器开始,从书店买书学习。
So I grew up in the eighties and back before computers were really a thing. I mean, we had PCs, but they weren't considered cool. And so I fell in love with understanding how the computer worked. And back then, things were way simpler. I started with a basic interpreter, for example, and get a book from the store.
还记得我们有书的年代吗?从书中学东西。你有没有
Remember when we had books? And you learn things from books. Did you do
那种经历,就是买爱好者杂志,然后照着上面的程序清单敲代码?
the thing where you get the hobbyist magazine and copy out the listing of the program from it?
完全正确。所以我们那时没有氛围编程,但我们有书籍。仅仅通过输入代码,你就能理解事物是如何运作的。然后当你把它搞砸时——因为不可避免地,你在输入一些东西却并不真正知道自己在做什么——你必须找出哪里出了问题。这就在一定程度上鼓励了调试。
That's exactly right. And so we didn't have vibe coding, but we did have books. And so just by typing things in, you could understand how things work. Then when you broke it, because inevitably you're typing something in and you don't really know what you're doing, you have to figure out what went wrong. And so it encouraged certain amount of debugging.
我真的很喜欢电脑游戏。同样地,那时候事情要简单一些。电脑游戏推动了图形、性能等方面的发展。所以我花了一些时间在早期的互联网上,使用一种叫做电子公告板系统的东西,阅读游戏程序员们如何努力突破硬件极限的文章。我就是在那时对性能、计算机和系统产生了兴趣。
I really love computer games. Again, back then, things were a little bit simpler. Computer games drove graphics and performance and things like this. And so I spent some time on these things called bulletin board systems in the early Internet reading about how game programmers were trying to push the limits of the hardware. And so that's where I got interested in performance and computers and systems.
我后来上了大学,在学校里遇到了一位非常棒的教授。向俄勒冈州波特兰市的波特兰大学致敬。他是一个编译器极客。我认为他对编译器的热爱非常有感染力。他叫Steven Vigdahl,这促使我后来去伊利诺伊大学继续研究编译器,在那里我又一次深入了这个编译器、系统的兔子洞,并构建了LLVM。自从我进入编译器世界以来,我就爱上了它。
I went on to college and had an amazing professor at my school. Shout out to University of Portland in Portland, Oregon. And he was a compiler nerd. And so I think that his love for compilers was infectious. His name was Steven Vigdahl, and that caused me to go on to pursue compilers at University of Illinois, and there again continued to fall down this rabbit hole of compilers and systems and built LLVM, and ever since I got into the compiler world, I loved it.
我喜欢编译器,因为它们是大型系统。有多个不同的组件协同工作。在大学环境中,编译器课程真的很酷,因为与大多数你做作业、交作业、然后就忘记的作业不同,在编译器中,你会完成一个作业,交上去,得到评分,然后在此基础上继续构建。这感觉更真实,更像是软件工程,而不仅仅是为了评分而做一个项目。
I love compilers because they're large scale systems. There's multiple different components that all work together. And in the university setting, it was really cool in the compiler class just because unlike most of the assignments where you do an assignment, turn it in, forget about it. In compilers, you would do an assignment, turn it in, get graded, and then build on it. And it felt much more realistic, like software engineering rather than just doing a project to get graded.
是的。我认为对很多人来说,操作系统课程是他们第一次真正体验到层层构建的经历。我认为这对于人们开始工程师生涯来说是一次极其重要的经历。
Yeah. I think for a lot of people, the OS class are their first real experience of doing a thing where you really are building layer on top of layer. Think it's incredibly important experience for people as they start engineering.
这也是一个你能用到一些数据结构的课程。我几乎是学术性地学习了这些内容——这是二叉树,那是图。尤其是我学的时候,它是从一个非常数学化的角度教授的,但这确实让它变得有用。所以那实际上真的很酷。我当时就想,哦,这就是我学这些东西的原因。
It's also one where you get to use some of those data structures. I took this almost academic, here's what a binary tree is, and here's what a graph is. And particularly when I went through it, it was taught from a very math forward perspective, but it really made it useful. And so that was actually really cool. I'm like, oh, this is why I learned this stuff.
所以一个
So one
关于你职业生涯中令我印象深刻的一点是,你一直在编译器工程和语言设计领域之间来回穿梭。而我觉得很多人都是偏向一边的,你知道,他们主要是编译器方向的人,不太关心语言本身,只关注如何让程序运行得更快。还有一些人则专注于语言设计,编译器工作只是实现设计目标的次要手段。而你不仅在这两者之间来回切换,而且你的很多编译器工程工作,特别是从LLVM开始,在某种意义上本身就极具语言前瞻性。LLVM中包含了一种中间语言,你将其作为一种工具呈现给人们使用。
thing that strikes me about your career is that you've ended up going back and forth between compiler engineering and language design space. Whereas I feel like a lot of people are on one side or the other, you know, they're mostly compilers people, and they don't care that much about the language and just how do we make this thing go fast. And there are some people who are really focusing on language design, and the work on the compiler is a secondary thing towards that design. And you've both popped back and forth, and then also a lot of your compiler engineering work, really starting with LLVM, in some sense is itself very language forward. LLVM, there's a language in there that's this intermediate language that you're surfacing as a tool for people to use.
所以我很好奇,想听听你如何看待编译器工程和语言设计之间的这种来回穿梭?
So I'm just curious to hear more about how you think about the back and forth between compiler engineering and language design?
我这样做的原因在于,我的职业生涯本质上是在追随自己的兴趣。而我的兴趣并非一成不变。我想要研究不同类型的问题,解决有用的问题,并构建出有价值的东西。你掌握的技术和能力越多,就能达到更高的高度。以LLVM为例,我们构建并学习了很多关于x86芯片深度代码生成的酷炫技术,包括寄存器分配这类技术。
The reason I do this is that, effectively, my career is following my own interests. And so my interests are not static. I wanna work on different kinds of problems and solve useful problems and build into things. And so the more technology and capability you have, the higher you can reach. And so with LLVM, for example, built and learned a whole bunch of cool stuff about deep code generation for an x a six chip, that category of technology with register allocation and stuff like this.
但这使得我们能够去攻克C++,利用这项技术打造世界上最好的实现方案,让比深度后端代码生成技术更多用户理解和使用的产品成为可能。而Swift的构建则达到了更高层次,我们认为C++可能有些人喜欢,但我们可以做得更好,追求更高目标。我还参与过AI系统开发,构建过帮助儿童学习编程的iPad应用,随着时间的推移涉足了众多不同领域。对我来说,我认为自己最有价值、经验最丰富的领域始终处于硬件与软件的边界上。
But then it made it possible to go say, let's go tackle c plus plus and let's go use this to build the world's best implementation of something that lots more people use and understand than deep back end cogeneration technology. And then with Swift, it was built even higher and say, well, C plus plus maybe some people like it, but I think we can do better and let's reach higher. I've also been involved in AI systems, been involved in building an iPad app to help teach kids how to code, and so lots of different things over time. And so for me, the place I think I'm most useful and where a lot of my experience is valuable ends up being at this hardware software boundary.
我很好奇你是如何转向开发Swift的。从我的角度看,Swift像是主流编程语境中一个集大成的产物,融合了许多我长期认为在其他编程语言中非常优秀的理念。我很好奇你是如何从专注于底层技术和编译器优化——比如实现C++(这仍然算是比较底层的)——跃升到更高层次的Swift开发的?整个Swift项目是怎么发生的?
I'm curious how you ended up making the leap to working on Swift. From my perspective, Swift looks from the outside like one of these points of arrival in mainstream programming context of a bunch of ideas that I have long thought are really great ideas in other programming languages. And I'm curious in some ways to step away from like, oh, I'm gonna work on really low level stuff and compiler optimization, and then, you know, we'll go much higher level and do a c plus plus implementation, which is still a pretty low level. How did the whole Swift thing happen?
很好的问题。对于不熟悉时间线的人来说,LLVM始于2000年2月。到2005年2月时,我已经离开大学加入了苹果。那时LLVM还算是一个前沿研究项目。到了2010年左右,LLVM已经成熟很多,我们刚在Clang中发布了C++支持。
Great question. I mean, the time frame for people that aren't familiar is that LLVM started in February. So by 02/2005, I had exited university and I joined Apple. And so LLVM was kind of an advanced research project at that point. By the 2010 timeframe, LLVM was much more mature and we had just shipped C plus plus support in Clang.
因此它能够自举,意味着编译器可以编译自身。它完全用C++编写,能够构建像Boost模板库这样极其复杂的模板库。所以我和团队构建的C++实现是实实在在的。不过在我看来,C++并不是一门优美的编程语言。实现它是一项非常有趣的技术挑战。
And so it could bootstrap itself, which means the compiler could compile itself. It's all written in C plus plus It could build advanced libraries like the Boost template library, which is super crazy advanced template stuff. And so the C plus plus implementation that I and the team had built was real. Now, C plus plus in my opinion, is not a beautiful programming language. And so implementing it is a very interesting technical challenge.
对我来说,很多问题解决最终归结为如何正确地分解系统。Clang 有一些非常酷的功能,使其能够扩展等等。但我也感到精疲力尽。我们刚刚发布了它,这真是太棒了。
For me, a lot of problem solving ends up being how do you factor the system the right way? And so Clang has some really cool stuff that allowed it to scale and things like that. But I was also burned out. We had just shipped it. It was amazing.
我当时想,肯定有更好的东西。所以 Swift 真正开始于 2010 年。它是一个利用晚上和周末时间的项目,并不是自上而下的管理层说‘我们去开发一门新的编程语言吧’。我当时正处于一种焦躁、精疲力尽的状态,并且管理着一个 20 到 40 人的团队。
I'm like, there has to be something better. And so Swift really came starting in 2010. It was a nights and weekends project. It wasn't like a top down management said, Let's go build a new programming language. It was a crisping, burnt out, and I was running a 20 to 40 person team at the time.
白天做工程师,做技术领导,然后需要一个逃避的出口。所以我说,好吧,我觉得我们可以有更好的东西。我有很多好主意。结果发现,编程语言是一个成熟的领域,并不需要你在这个时候去发明模式匹配。
Being an engineer during the day and being a technical leader, then needing an escape hatch. And so I said, okay, well, I think we can have something better. I have a lot of good ideas. Turns out, programming languages are a mature space. It's not like you need to invent pattern matching at this point.
C++ 没有好的模式匹配功能,这真是令人尴尬。
It's embarrassing that C plus plus doesn't have good pattern matching.
我们应该暂停一下。我认为这是一件虽小但至关重要的事情。我认为像七十年代中期 ML 这样的语言所产生的最好的单一特性首先是代数数据类型的这个概念,意思是地球上的每一种编程语言都有一种方式来表示这个、那个和另一个,比如记录、类或元组。
We should just pause for a second. I think this is a small but really essential thing. I think the single best feature coming out of language like ML in the mid seventies is, first of all, this notion of an algebraic data type, meaning every programming language on Earth has a way of saying this and that and the other, a record or a class or a tuple.
一门奇怪的编程语言。我想是芭芭拉·利斯科夫?
A weird programming language. I think was Barbara Liskov?
是的。她做了很多关于抽象数据类型是什么的早期理论研究。但是能够做这个、那个或另一个,拥有作为数据不同可能形状的联合的数据类型,然后拥有这种模式匹配设施,让你基本上能以可靠的方式进行案例分析,从而分解出各种可能性,这非常有用。而很少有主流语言采纳了它。我的意思是,Swift 又是一个例子,但像 ML、SML、Haskell 和 OCaml Standard 这样的语言则有。
Yeah. And she did a lot of the early theorizing about what are abstract data types. But the ability to do this or that or the other, to have data types that are a union of different possible shapes of the data, and then having this pattern matching facility that lets you basically, in a reliable way, do the case analysis so you can break down what the possibilities are, is just incredibly useful. And very few mainstream language have picked it up. I mean, Swift, again, is an example, but languages like ML, SML, and Haskell, and OCaml Standard.
没错。SML。Standard ML。它已经存在很长时间了。
That's right. SML. Standard ML. It's been there for a long time.
我的意思是,模式匹配并不是什么奇特的功能。这里我们说的是2010年2月。C语言没有它。C++没有它。显然,Java也没有它。
I mean, pattern matching, it's not an exotic feature. Here, we're talking about 02/2010. C didn't have it. C plus plus didn't have it. Obviously, Java didn't have it.
我认为JavaScript也没有。这些主流语言都没有这个功能,但这很明显。所以我的部分观点是,顺便说一下,我代表工程师身份,实际上我不是数学家。类型理论对我来说太高深了,我真的不太理解。
I don't think JavaScript had it. None of these mainstream languages had it, but it's obvious. And so part of my opinion about that, and so by the way, I represent as engineer, I'm not actually a mathematician. And so type theory goes way over my head. I don't really understand this.
关于编程语言的学术方法让我感到沮丧的是,人们总是从类型开始,说有这些类型、交集类型等等,而不是从实用性出发。当我学习OCaml时,模式匹配是如此美妙。它让构建简单事物变得如此容易和富有表现力。对我来说,我总是关注实用性。当然,背后有惊人的形式类型理论,这很好。
The thing that gets me frustrated about the academic approach to programming languages is that people approach it by saying there's some types and there's intersection types and there's these types, and they don't start from utility forward. And so pattern matching, when I learned OCaml, it's so beautiful. It makes it so easy and expressive to build very simple things. And so to me, I always identify to the utility. And then yes, there's amazing formal type theory behind it, and that's great.
这就是为什么它实际上能够很好地组合工作。但将这些特性向前推进,专注于实用性、解决的问题以及如何让人们感到满意,最终我认为这才是推动采用的关键,至少在主流通俗化方面。
And that's why it actually works in composes. But bringing that stuff forward and focusing on utility and the problems it solves and how it makes people happy ends up being the thing that I think moves the needle in terms of adoption, at least in mainstream.
是的。我认为这是对的。我的方法和我的语言兴趣也很大程度上不是从数学角度出发的。虽然,你知道,我的本科学位是...我很喜欢数学,但我主要是以实践者的角度来对待这些事物。但多年来让我印象深刻的是,这些功能拥有坚实的数学基础的价值在于它们能够更好地泛化,正如你所说的,组合得更好。
Yeah. Mean, I think that's right. My approach also and my interest in language is also very much not from the mathematical perspective. Although, you know, my undergraduate degree is in I like math a lot, but I mostly approach these things as a practitioner. But the thing I've been struck by over the years is the value of having these features have a really strong mathematical foundation is they generalize, and as you were saying, compose much better.
如果它们在数学上最终是简单的,那么你更有可能得到一个在实际使用中超出你最初设想的用途的功能。
If they are in the end mathematically simple, you're way more likely to have a feature that actually pans out as it gets used way beyond your initial view as to what the thing was for.
没错。嗯,你看,这其实是我个人的一个缺陷,因为我不像理论上理想的方式那样理解数学。我最终不得不重新发现一些显而易见的真理。就像那个老生常谈的例子,俄罗斯数学家五十年前就发明了它,对吧?
That's right. Well, and see, this is actually a personal defect because I don't understand the math in the way that maybe theoretically would be ideal. I end up having to rediscover certain truths that are obvious. The cliche of the Russian mathematician invented it fifty years ago. Right?
所以我发现很多时候,当事物组合在一起、相互契合时,我能找到真理和美。而且我常常会发现,你知道,它已经被发现了,因为编程语言中的一切都已经被做过了。几乎没有什么新颖的东西,但那种设计过程——让我们把东西整合起来,让我们思考为什么它不太契合,让我们找出如何更好地分解这个问题——仍然存在。
And so a lot of what I find is that I can find truth and beauty when things compose and things fit together. And often I'll find out, you know, it's already been discovered because everything in programming language has been done. There's almost nothing novel, but still that design process of saying, let's pull things together. Let's reason about why it doesn't quite fit together. Let's go figure out how to better factor this.
让我们找出如何让它变得更简单。这个过程对我来说,我认为有点像人们研究物理学的感觉。结果变得越简单,就越感觉接近真理。所以我也有这种感觉。也许这更像是设计基因或工程设计的结合,但这可能正是你们数学家天生就懂的东西,而我还没弄明白。
Let's figure out how to make it simpler these days. That process to me, think, is kind of like people working on physics I hear. The simpler the outcome becomes, the more close to truth it feels like it is. And so I share that. Maybe it's more design gene or engineer design combination, but it's probably what you mathematicians actually know in inherently, and I just haven't figured it out yet.
当你从工程角度接触某个问题后,你是否会尝试寻找有用的数学见解?你会回头去读论文吗?你有没有其他更偏向数学方向的编程语言同行可以交流?你是如何扩展思维来涵盖那些其他方面的内容的?
Do you find yourself doing things after you come to it from an engineering perspective, trying to figure out whether there are useful mathematical insights? Do you go back and read the papers? Do you have other PL people who are more mathematically oriented who you talk to? How do you extend your thinking to cover some of that other stuff?
你看,问题在于数学对我来说很可怕。所以我一看到希腊字母就跑开了。我确实会关注arXiv之类的东西,上面有编程语言板块。所以我会接触一些,但吸引我的是其中的示例、结果部分和未来展望的部分,所以不一定是方法,而是它的意义。所以我觉得很多这样的内容真的很能引起我的共鸣。
See, the problem is math is scary to me. So I I see Greek letters and I run away. I do follow archive and things like this, and there's a programming language section on that. And so I get into some of it, but what I get attracted to in that is the examples and the results section and the future looking parts of it, and so it's not necessarily the how, it's the what it means. And so I think a lot of that really speaks to me.
当你谈论语言设计之类的话题时,另一个真正能引起我共鸣的是来自一些我从未听说过的冷门学术编程语言的博客文章。有人会谈论代数效应系统用于这个那个或其他什么非常高级的东西,但他们找到了有用的解释方式。所以当不仅仅是让我向你解释类型系统,而是让我解释这个高级功能能解决什么问题时,我就会兴奋起来。这就是它能引起我共鸣的地方,因为再次强调,我是问题导向的,我欣赏用优美的方式表达和解决问题。
The other thing that really speaks to me when you talk about language design and things like this is blog posts from some obscure academic programming language that I've never heard of. You just have somebody talking about algebraic effect systems for this and that and the other thing or something really fancy, but they figure out how to explain it in a way that's useful. And so when it's not just let me explain to you the type system, but it's let me explain this problem this fancy feature enables, that's where I get excited. And that's where it speaks to me because, again, I'm problem oriented and having a beautiful way to express and solve problems, I appreciate.
我认为论文中详细阐述理论、数学以及它们如何相互契合的工作非常有价值。但是,是的,我认为这些领域充满了来自同一批人的许多有趣的博客文章,这很棒,因为我认为这是另一种形式,它常常鼓励你提炼出更简单、更易于理解的版本。我认为这是一种不同的洞察力,将其呈现出来也很有价值。
I think there's a lot of value in the work that's done in papers of really, like, working out in detail the theory and the math and how it all fits together. But, yeah, I think the fact that the worlds have been filled of a lot of interesting blog posts from the same people has been great because I think it's another modality where it often encourage you to pull out the simpler and easier to consume versions of those ideas. And I think that is just a different kind of insight and is valuable to surface that too.
而且当我查看那些博客文章时,有时它们设计上存在瑕疵,尤其是C++社区。有很多很好的工作来改进C++,他们添加了很多功能,但C++永远不会变得更简单。你无法真正移除已有的东西,所以很多挑战在于这是一种受限的问题解决方式。因此,当我阅读这些文章时,我常常会看到——再次强调,这些人非常聪明,他们在做上帝的工作,试图用C++解决问题,祝他们好运。但你看着这些,就会意识到系统中存在一粒本不需要存在的沙子。
And also when I look at those blog posts, sometimes they design smell, particularly the C plus plus community. There's a lot of really good work to fix C plus plus They're adding a lot of stuff to it and C plus plus will never get simpler. You can't really remove And things, so a lot of the challenge there is it's constrained problem solving. And so when I look at that, often what I'll see when I'm reading one of those posts, and again, are brilliant people and they're doing God's work trying to solve problems with C plus plus best luck with that. But you look at that and you realize there's a grain of sand in the system that didn't need to be there.
所以对我来说,就像如果你移除了那粒沙子,整个系统就会变得松弛,突然之间所有这些约束都消失了,你可以得到一个更简单的东西。例如Swift,它是一个很棒的语言,发展得很好,社区也很出色,但它里面有几粒沙子导致它变得更加复杂。所以这就是为什么我不满足于已经构建出来的东西。LLVM很了不起,非常实用,但它也有很多问题。
And so to me, it's like if you remove that grain of sand, then the entire system gets relaxed and suddenly all these constraints fall away and you can get to something much simpler. And Swift, for example, it's a wonderful language and it's grown really well and the community is amazing, but it has a few grains of sand in it that cause it to get a lot more complicated. And so this is where I'm not just happy with things that got built. LLVM's amazing. It's very practical, but it has lots of problems.
这就是当我有机会构建下一代系统时,我想要从中学习并真正尝试解决这些问题。所以这就是
That's when when I get a chance to build a next generation system, I wanna learn from that and actually try to solve these problems. So this is
能够从事一门新语言工作的巨大特权,这正是你现在在做的事情,对吧?有一个叫做Mojo的新语言,它是由你共同创立的公司Modular开发的。也许为了让我们更好地理解背景,你能告诉我一些关于Modular的信息吗?它的基本产品是什么?
the great privilege of getting to work on a new language, which is a thing you're doing now. Right? There's this new language called Mojo, and it's being done by this company that you cofounded called Modular. Maybe just so we understand the context a little bit, can you tell me a little bit about what is Modular? What's the basic offering?
商业模式是什么?
What's the business model?
在我谈到那之前,我先分享一下我是如何走到这里的。如果你简化我的背景,我做了LLVM这个东西,它是CPU的基础编译器技术。它帮助统一了很多CPU时代的基础设施,并为Swift、Rust、Julia以及许多其他构建在其之上的系统提供了平台。我认为它真正催化并实现了许多加速编译器技术的酷炫应用。人们在数据库中使用LLVM,用于查询引擎优化,很多很酷的东西。
Before I even get there, I'll share more of how I got here. If you oversimplify my background, did this LLVM thing, and it's foundational compiler technology for CPUs. It helped unite a lot of CPU era infrastructure and it provided a platform for languages like Swift, but also Rust and Julia and many different systems that all got built on top of. And I think it really catalyzed and enabled a lot of really cool applications of accelerated compiler technology. People use LLVM in databases and for query engine optimization, lots of cool stuff.
也许你把它用于交易或其他什么。我的意思是,这类技术可以有无数不同的应用。然后我用Swift做了编程语言方面的事情。但与此同时,人工智能出现了。人工智能带来了全新一代的计算。
Maybe you use it for trading or something. I mean, there can be tons of different applications for this kind of technology. And then did programming language stuff with Swift. But in the meantime, AI happened. And so with AI, brought this entirely new generation of compute.
GPU、张量处理单元、大规模AI训练系统、FPGA和ASIC,所有这些复杂的计算设备。而LLVM在这个系统中从未真正发挥作用。因此,我在谷歌时构建的一项工作就是为这类系统开发了一系列基础编译器技术,其中有一种叫做MLIR的编译器技术。MLIR基本上是LLVM的2.0版本。所以,汲取了构建LLVM过程中学到的所有经验并帮助解决了这个问题后,将其引入下一代编译器技术,以期能够统一全球在GPU、AI和ASIC这类领域的计算。
GPUs, tensor processing units, large scale AI training systems, FPGAs and ASICs, and all this complexity for compute. And LLVM never really worked in that system. And so one of the things that I built when I was at Google was a bunch of foundational compiler technology for that category of systems, and there's this compiler technology called MLIR. MLIR is basically LLVM two point zero. And so take everything you learned from building LLVM and helping solve this, but then bring it forward into this next generation of compiler technology so that you can go hopefully unify the world's compute for this GPU and AI and ASIC kind of world.
MLIR取得了惊人的成功,我认为它几乎被用于所有这些AI系统和GPU中。NVIDIA在使用它,谷歌在使用它,基本上这个领域的每个人都在使用它。但其中一个挑战是缺乏统一性。因此,你有了这些非常大规模的人工智能软件平台。有NVIDIA的CUDA,有谷歌的XLA,有AMD的RockM。
MLIR has been amazingly successful and I think it's used in roughly every one of these AI systems and GPUs. It's used by NVIDIA, it's used by Google, it's used by roughly everybody in the space. But one of the challenges is that there hasn't been unification. And so you have these very large scale AI software platforms. You have CUDA from NVIDIA, you have XLA from Google, you have RockM from AMD.
数不胜数。每家公司都有自己的软件栈。我发现并遇到的一个问题,我认为全世界都看到了,是由于每个硬件制造商构建的这些软件栈都完全不同,导致了令人难以置信的碎片化。其中一些比其他的更好用,但无论如何,这是一团巨大的混乱。而且还有像我们大家都喜爱并想使用的PyTorch这样非常酷的高层技术。
It's countless. Every company has their own software stack. And one of the things that I discovered and encountered and I think the entire world sees is that there's this incredible fragmentation driven by the fact that each of these software stacks built by a hardware maker are just all completely different. And some of them work better than others, but regardless, it's a gigantic mess. And there's these really cool high level technologies like PyTorch that we all love and we want to use.
但如果PyTorch建立在完全不同的栈上,并且试图将来自不同供应商的这些庞然大物拼凑在一起,那么要让它正常工作是非常困难的。
But if PyTorch is built on completely different stacks and schooling together these megalithic worlds from different vendors, it's very difficult to get something that works.
没错。它们都涉及围绕不同工具性能的复杂权衡,以及另一套围绕使用难度、在其中编写代码的复杂性,以及每个工具能针对哪些硬件的复杂权衡。而且这些生态系统每一个都在极其快速地变化。总是有新的硬件出现,新的供应商进入新领域,还有新的小型语言不断涌现,这使得整个局面相当难以掌控。
Right. They're both complicated trade offs around the performance that you get out of different tools, and then also a different set of complicated trade offs around how hard they are to use, how complicated it is to write something in them, and then what hardware you can target from each individual one. And each of these ecosystems is churning just incredibly fast. There's always new hardware coming out and new vendors in new places, and there's also new little languages popping up into existence, and it makes the whole thing pretty hard to wrangle.
正是如此。而且AI发展得如此之快。就像,每周都有新模型出现。太疯狂了。还有新的应用、新的研究,每个人投入其中的资金量简直不可思议。
Exactly. And AI is moving so fast. Like, there's a new model every week. It's crazy. And new applications, new research, the amount of money being dumped into this by everybody is just incredible.
那么,任何人该如何跟上呢?这是行业中的一个结构性问题。这个结构性问题是,从事这类工作的人,比如为高级GPU等进行代码生成的人,他们都在硬件公司。而硬件公司,每一家都在构建自己的软件栈,因为他们不得不这样做,没有现成的东西可以接入。没有像LLVM那样的东西,但针对AI的版本并不存在。因此,当他们去构建自己的垂直软件栈时,他们当然会专注于自己的硬件。
And so how does anybody keep It's a structural problem in the industry. And so the structural problem is that the people doing this kind of work, people doing cogeneration for advanced GPUs and things like this, they're all at hardware companies. And the hardware companies, every single one of them is building their own stack because they have to, There's nothing to plug into. There's nothing like LLVM but for AI that doesn't exist. And so as they go and build their own vertical software stack, of course, they're focused on their hardware.
他们有先进的路线图。明年要推出新芯片,对吧?他们正投入精力和时间解决硬件问题。但我们行业里实际上想要别的东西。我们希望拥有能在多种硬件上运行的软件。
They got advanced roadmaps. Have a new chip coming out next year, right? They're plowing their energy and time into solving for their hardware. But we out in the industry, we actually want something else. We want to be able to have software that runs across multiple pieces of hardware.
因此,如果所有工作都由硬件公司完成,很自然就会出现供应商间的碎片化,因为没有人有动力去合作。即使有动力,他们也没时间去研究别人的芯片。AMD不会花钱去研究NVIDIA的GPU之类的东西。
And so if everybody doing the work is at a hardware company, it's very natural that you get this fragmentation across vendors because nobody's incentivized to go work together. And even if they're incentivized, they don't have time to go work on somebody else's chip. AMD is not gonna pay to work on NVIDIA GPUs or something like this.
当你考虑低级和高级语言之间的这种分裂时,确实如此。NVIDIA有CUDA,AMD有RockM(基本上是CUDA的克隆)。然后Google的XLA工具在TPU上运行得非常好,等等。不同供应商有不同的东西。然后还有高级工具,如PyTorch、JAX、Triton等各种工具。
That's true when you think about this kind of a split between low level and high level languages. NVIDIA has CUDA, and AMD has RockM, which is mostly a clone of CUDA. And then the XLA tools from Google work incredibly well on TPUs, and so on and so forth. Different vendors have different things. Then there's, like, the high level tools, PyTorch and JAX and Triton and various things like that.
这些通常实际上不是由硬件供应商制作的。它们是由不同类型的用户制作的。我想Google负责其中一些,他们有时也是硬件供应商。但很多时候,它更退一步。尽管即使在那里,跨平台支持也是复杂、混乱且不完整的。
And those are typically actually not made by the hardware vendors. Those are made by different kinds of users. I guess Google is responsible for some of these, and they are also sometimes a hardware vendor. But a lot of the time, it's more step back. Although even there, the cross platform support is complicated and messy and incomplete.
因为它们建立在根本上不兼容的东西之上。所以这是根本性质。因此,再次回到Chris的功能失调和我的奇怪职业选择。我总是最终回到硬件软件边界。还有很多其他人非常擅长添加非常高级的抽象。
Because they're built on top of fundamentally incompatible things. So that's the fundamental nature. And so again, you go back to Chris's dysfunction and my weird career choices. I always end up back at the hardware software boundary. And there's a lot of other folks that are really good at adding very high level abstractions.
如果你回溯几年前,MLOps是热门话题。它是,让我们在TensorFlow和PyTorch之上构建一层Python,构建一个统一的AI平台。但问题是,在两个运行不佳的东西之上构建抽象,无法解决性能、可靠性、管理或其他问题。你只能添加一层胶带。但一旦出现问题,你最终不得不调试这一整套你根本不想了解的疯狂堆栈。
If you go back a few years ago, MLOps was the cool thing. And it was, let's build a layer of Python on top of TensorFlow and PyTorch and build a unified AI platform. But the problem with that is that building abstractions on top of two things that don't work very well, can't solve performance or liability or management or these other problems. You can only add a layer of duct tape. But as soon as something goes wrong, you end up having to debug this entire crazy stack of stuff that you really didn't want to have to know about.
所以它是一个泄漏的抽象。因此,Modular的起源回到这一点,是意识到行业存在结构性问题。没有人有动力去构建一个统一的软件平台并在底层做这项工作。所以我们开始做的是,我们说,好吧,让我们去构建,有不同方式解释这个,你可以说是CUDA的替代品。这是一种夸张的说法,但让我们去构建所有这些技术的继承者,它比硬件制造商构建的更好,并且是可移植的。
And so it's a leaky abstraction. And so the genesis of modular bring it back to this was realizing there are structural problems in the industry. There is nobody that's incentivized to go build a unifying software platform and do that work at the bottom level. So what we set off to do is we said, okay, let's go build, and there's different ways of explaining this, you could say a replacement for CUDA. That's like a flamboyant way to say this, but let's go build a successor to all of this technology that is better than what the hardware makers are building and is portable.
因此,这需要硬件公司正在做的那些工作。我为团队设定的目标是,我们要做得比例如NVIDIA为他们自己的硬件做得更好。
And so what this takes is this takes doing the work that these hardware companies are doing. And I set the goal for the team of saying, let's do it better than, for example, NVIDIA is doing it for their own hardware.
这绝非易事,对吧?他们拥有很多非常优秀的工程师,而且比任何人都更了解他们的硬件。在他们自己的硬件上超越他们是很困难的。
Which is no easy feat, right? They've got a lot of very strong engineers and they understand their hardware better than anyone Beating them on their own hardware is tough.
那确实非常困难。他们有二十年的先发优势,因为CUDA已经有大约二十年历史了。他们拥有所有的势头。如你所说,他们是一家相当大的公司,有很多聪明人。所以那是一个荒谬的目标。
That is really hard. And they've got a twenty year head start because CUDA is about twenty years old. They've got all the momentum. They're a pretty big company, as you say, lots of smart people. And so that was a ridiculous goal.
我为什么这么做?嗯,我的意思是,对技术运作方式有一定程度的信心,对我认为我们能构建的东西和方法有一些押注和洞察直觉,但也意识到这实际上是命运。总得有人来做这项工作。如果我们想建立一个不由单一供应商控制一切的生态系统,如果我们想充分发挥硬件性能,如果我们想要新的编程语言技术,如果我们想在GPU上进行模式匹配——我的意思是,这又不是火箭科学,对吧?那么我们迟早需要做这件事。如果没人愿意做,那我就站出来做。
Why did I do that? Well, I mean, certain amount of confidence in understanding how the technology worked, having a bet on what I thought we could build and the approach and some insight and intuition, but also realizing that it's actually destiny. Somebody has to do this work. If we ever want to get to an ecosystem where one vendor doesn't control everything, if we want to get the best out the hardware, if we want to get new programming language technologies, if we want pattern matching on a GPU, mean, on, this isn't rocket science, right? Then we need at some point to do this And if nobody else is gonna do it, I'll step up and do that.
所以Modular就是这样诞生的,我们说,让我们去攻克这个难题。我不知道这会花多长时间,但有时如果对世界有价值,做真正困难的事情是值得的。我们相信这可能会产生深远的影响,并希望能让更多人至少能够使用这种新的计算形式,包括GPU、加速器等等,真正重新民主化AI计算。
And so that's where modular came from is saying, let's go crack this thing open. I don't know how long it will take, but sometimes it's worthwhile doing really hard things if they're valuable to the world. And the belief was it could be profoundly impactful and hopefully get more people into even just being able to use this new form of compute with GPUs and accelerators and all this stuff and just really redemocratize AI compute.
所以你指出了这里存在一个真正的结构性问题。我其实想知道,在商业模式层面上,你打算如何解决这个结构性问题,因为计算的历史如今充满了试图销售编程语言的公司的尸体。这是一个非常困难的生意。Modular是如何设立的,以便有动力以能够成为一个共享平台的方式构建这个平台,而不受制于另一个供应商的锁定?
So you pointed out that there's a real structural problem here. And I'm actually wondering how at a business model level you wanna solve the structural problem, which is the history of computing is these days littered with the bodies of companies that try to sell a programming language. It's a really hard business. How is modular set up so that it's incented to build this platform in a way that can be a shared platform that isn't subject to just one other vendor's lock in?
第一个答案是不要销售编程语言。如你所说,那非常困难。所以我们不这么做。去拿Mojo,免费使用它。我们不是在销售编程语言。
First answer is don't sell a programming language. As you say, that's very difficult. So so we're not doing that. Go take mojo, go use it for free. We're not selling a programming language.
我们正在做的是投资这项基础技术来统一硬件。我们的观点是,正如我们在许多其他领域所见,一旦你稳固了基础,就能为企业构建高价值的服务。因此,在我们的企业层,我们经常与这些团队交流,他们拥有数百或数千个GPU。这些GPU通常是从云服务商那里租用的,租期为三年。他们有一个平台团队负责值守,需要确保所有这些设备以及生产工作负载持续运行。
What we're doing is we're investing this foundational technology to unify hardware. Our view is, as we've seen in many other domains, once you fix the foundation, now you can build high value services for enterprises. And so our enterprise layer, often what we talk to, you end up with these groups where you have hundreds or thousands of GPUs. Often it's rented from a cloud on a three year commit. You have a platform team that's carrying pagers and they need to keep all this stuff running and all the production workloads running.
然后你还有这些产品团队,他们一直在创新。新研究不断涌现,新模型发布后,他们希望将其部署到生产基础设施上,但这些东西实际上都无法正常工作。因此,我们现有的软件生态系统充斥着这些聪明但混乱的开源工具,各种不同版本的CUDA和库,各种不同的硬件,简直是一团糟。因此,帮助平台工程团队解决这个问题——他们需要确保系统正常工作,并希望能够理解它,获得良好的可观测性、可管理性和可扩展性等——我们认为这实际上非常有趣。我们在这方面得到了很多人的积极反馈。
And then you have these product teams that are inventing new stuff all the time. And there's new research, there's a new model that comes out and they want to get it on the production infrastructure, but none of this stuff actually works. And so the software ecosystem we have with all these brilliant but crazy open source tools that are thrashing around, all these different versions of CUDA and libraries, all this different hardware happening, it's just a gigantic mess. And so helping solve this for the platform engineering team that actually needs to have stuff work and want to be able to reason about it and want good observability and manageability and scalability and things like this is actually, we think, very interesting. We've gotten a lot of good response from people on that.
这样做的成本是我们希望它真正能工作。这就是我们进行基础语言编译器、底层系统技术开发,并帮助整合这些加速器的原因,以便我们能够获得最佳性能,例如在AMD GPU上,并确保软件发布与NVIDIA GPU的支持同步。能够实现这一点,再次大大降低了复杂性,从而产生了一个真正可行的产品,这非常酷且新颖。
The cost of doing this is we want to actually make it work. That's where we do fundamental language compiler, underlying systems technology and help bring together these accelerators so that we can get, for example, the best performance on an AMD GPU and get it so that the software comes out in the same release train as support for an NVIDIA GPU. And being able to pull that together, again, it's just multiplicatively reduces complexity, which then leads to a product that actually works, which is really cool and very novel
在AI领域。Mojo在这里的作用是,它基本上让你提供最佳性能,并在多个不同的硬件平台上获得最佳性能。你主要将其视为推理平台,还是训练领域如何融入其中?
in AI. So the way that Mojo plays in here is it basically lets you provide the best possible performance and it gets the best possible performance across multiple different hardware platforms. Are you primarily thinking about this as an inference platform or how does the training world fit in?
让我放大视角,解释一下我们的技术组件。我有一系列博客文章,鼓励你和任何观众或听众查看,名为《民主化AI计算》。它回顾了所有系统及其遇到的问题和挑战的历史,并阐述了Modular如何应对。第11部分讨论了架构,其核心是Mojo,一种编程语言。我稍后会解释Mojo。
So let me zoom out and I'll explain our technology components. I have a blog post series I encourage you and any viewers or listeners to check out called Democratizing AI Compute. It goes through the history of all systems and problems and challenges that they've run into and it gets to what is modular doing about it. So part 11 talks about architecture and the inside is Mojo, which is a programming language. I'll explain Mojo in a second.
下一层叫做Max。你可以将Max视为PyTorch或VLM的替代品,一种可以在单个节点上运行并获得高性能LLM服务的解决方案,适用于那种用例。再往外一层是Mammoth,这是集群管理、Kubernetes类的层。因此,如果你一直放大回Mojo,你会说,根据你的经验,你知道编程语言是什么,它们构建起来极其困难且昂贵。为什么你一开始要这么做呢?
Next level out is called Max. So you can think of Max as being a PyTorch replacement or a VLM replacement, something that you can run on a single node and then get high performance LLM serving, that kind of use case. And then the next level out is called Mammoth, and this is the cluster management Kubernetes kind of layer. And so if you zoom in all the way back to Mojo, you say, your experience, you know what programming languages are, they're incredibly difficult and expensive to build. Why would you do that in the first place?
答案是我们不得不这样做。事实上,当我们开始Modular时,我曾想,我不会发明一种编程语言。我知道那是个坏主意。它耗时太长,工作量大。
And the answer is we had to. In fact, when we started Modular, I was like, I'm not going to invent a programming language. I know that's a bad idea. It takes too long. It's too much work.
你无法说服人们采用一门新语言。我知道所有理由都表明创建一门语言实际上是个糟糕的主意,但事实证明我们被迫这么做,因为没有好的方法来解决这个问题。这个问题就是:如何编写能在不同加速器之间移植的代码?所以这个问题,我想要的移植性是,比如简单地在AMD和NVIDIA GPU上运行,但你还得考虑到使用GPU是为了追求性能。因此我不想要简化削弱版,我想要的是能在GPU上运行的Java。
You can't convince people to adopt a new language. I know all the reasons why creating a language is actually a really bad idea, but it turns out we are forced to do this because there is no good way to solve the problem. And the problem is how do you write code that is portable across accelerators? So that problem, I want portability across, for example, make it simple AMD and NVIDIA GPUs, but then you layer on the fact that you're using a GPU because you want performance. And so I don't want a simplified watered down, I want Java that runs on a GPU.
我想要GPU的全部能力。我想要能够提供媲美甚至超越NVIDIA在其自家硬件上的性能。我想要可移植性,并统一这种疯狂的计算环境——你有这些非常花哨的异构系统,有张量核心,硬件平台层正经历着复杂性和创新的爆发。大多数编程语言甚至不知道存在八位浮点数。所以我们四处寻找,我真的很不想这么做,但事实证明确实没有好的答案。
I want the full power of the GPU. I want to be able to deliver performance that meets and beats NVIDIA on their own hardware. I want to have portability and unify this crazy compute where you have these really fancy heterogeneous systems and you have tensor cores and you have this like explosion of complexity and innovation happening in this hardware platform layer. Most programming languages don't even know that there's an eight bit floating point that exists. And so we looked around and I really did not want to have to do this, but it turns out that there really is no good answer.
再次强调,我们决定,嘿,风险很高。我们想做有影响力的事情。我们愿意投入。我知道构建一门编程语言需要什么。这并非火箭科学。
And again, we decided that, hey, the stakes are high. We want to do something impactful. We're willing to invest. I know what it takes to build a programming language. It's not rocket science.
这只是大量非常艰苦的工作,你需要以正确的方式激励团队。但我们决定,是的,就这么做吧。
It's just a lot of really hard work and you need to set the team up to be incentivized the right way. But we decided that, yeah, let's do that.
所以我想多谈谈Mojo及其设计。但在那之前,也许我们先多聊聊现有的环境。我确实读了那系列博客文章。我推荐给大家。认为它真的很棒。
So I wanna talk more about Mojo and its design. But before we do, maybe let's talk a little bit more about the preexisting environment. I did actually read that blog post series. I recommend it to everyone. Think it's really great.
我想稍微谈谈现有语言生态系统的情况。但在此之前,我们能多聊聊硬件吗?人们想要运行这些机器学习模型的硬件空间是什么样的?
And I wanna talk a little bit about what the existing ecosystem of languages looks like. But even before then, can we talk more about the hardware? What is that space of hardware look like that people want to run these ML models on?
是的。所以大多数人关注的是GPU。GPU现在我认为越来越被理解了。但在此之前,你有CPU。现代数据中心中的CPU,通常你会拥有,我的意思是,如今你们可能在写相当大型的代码,但一个CPU有100个核心,主板上有一台配备两到四个CPU的服务器,然后你去扩展它。
Yeah. So the one that most people zero in on is the GPU. And so GPUs are, I think, getting better understood now. And so if you go back before that, though, you have CPUs. So modern CPUs in a data center, often you'll have, I mean, today, you guys are probably writing quite big iron, but you got a 100 cores in a CPU, you got a server with two to four CPUs on a motherboard, and then you go and you scale that.
因此,你有传统的线程化工作负载必须在CPU上运行,我们知道如何为互联网服务器等场景扩展它。当你转向GPU时,架构就发生了变化。它们基本上有这些称为SM(流式多处理器)的东西,现在编程模型变成了将大量中等规模的计算单元通过更高性能的内存结构组合在一起,编程模型也随之转变。例如,真正让CUDA取得突破的时刻之一是当GPU引入了张量核心。你可以把张量核心理解为专门用于矩阵乘法的硬件单元。
And so you've got traditional threaded workloads that have to run on CPUs, and we know how to scale that for Internet servers and things like this. If you get to a GPU, the architecture shifts. And so they have basically these things called SMs, and now the programming model is that you have effectively much more medium sized compute that's now put together on much higher performance memory fabrics and the programming model shifts. And one of the things that really broke CUDA, for example, was when GPUs got this thing called a tensor core. And the way to think about a tensor core is it's a dedicated piece of hardware for matrix multiplication.
那么为什么我们需要这个呢?因为很多AI计算都是矩阵乘法,所以如果你设计硬件来擅长特定工作负载,就可以为它配备专用硅片,从而让计算速度变得非常快。
And so why do we get that? Well, a lot of AI is matrix multiplication, and so if you design the hardware to be good at a specific workload, you can have dedicated silicon for that and you can make things go really fast.
GPU领域内部确实存在这两种相当不同的模型。当然,GPU这个名字本身代表图形处理单元,这是它们最初的用途。而这个SM模型非常有趣,它们有所谓warp(线程束)的概念,对吧?
There are really these two quite different models sitting inside of the GPU space. Of course, the name itself is where GPU is graphics processing unit, which is what they were originally for. And then this SM model is really interesting. They have this notion of a warp. Right?
一个warp通常是由32个线程组成的集合,它们以锁步方式协同运行,总是执行相同的操作。这是对所谓SIMD模型(单指令多数据)的轻微变体,但比SIMD更通用一些。大致上你可以认为它们是类似的。你只需要运行大量这样的warp,而这些系统内部有大量硬件使得线程间的切换成本极低。
A warp is a collection of typically 32 threads that are operating together kind of in lockstep, always doing the same thing. A slight variation on what's called the SIMD model, same instruction, multiple data. It's like a little more general than that. More or less, you can think of it as the same thing. And you just have to run a lot of them, and then there's a ton of hardware inside of these systems basically to make a switching between threads incredibly cheap.
所以你花费大量硅片面积来增加额外的寄存器,使得上下文切换超级廉价。这样你就可以并行处理大量任务。每个任务本身就像是32路并行。而且由于你能进行非常快速的上下文切换,可以隐藏很多延迟。
So you pay a lot of silicon to add extra registers. So the context switch is super cheap. So you can do a ton of stuff in parallel. Each thing you're doing is itself like 32 wires parallel. And then because you can do all this very fast context switching, you can hide a lot of latency.
这种方式在一段时间内很有效。然后我们发现,实际上我们需要更多的矩阵乘法运算。通过这种warp模型你可以实现相对高效的矩阵乘法,但效果并不是特别好。于是出现了一批相当特殊的硬件,它们为了进行这些矩阵乘法,每一代都在改变其性能特征,对吧?
And that worked for a while. And then we're like, actually, we need way more of this matrix multiplication stuff. And you can sort of do reasonably efficient matrix multiplication through this warp model, but not really that good. And then there's a bunch of quite idiosyncratic hardware, which changes its performing characteristics from generation to generation just for doing these matrix multiplications. Right?
这就是NVIDIA GPU的发展历程。像Volta架构的V100、A100和H100,它们不断演进,每一代在性能特征和内存模型上都发生了相当实质性的变化。
So that's sort of the NVIDIA GPU story. The Volta is like v 100 and a 100 and h 100. They just keep on going and changing pretty materially from generation to generation in terms of the performance characteristics and then also the memory model, which keeps on changing.
你回到直觉上。CUDA 从未为这个世界设计。CUDA 不是为现代 GPU 设计的。它是为一个简单得多的世界设计的。CUDA 已经有二十年历史了,它并没有真正跟上时代。
You go back to intuition. CUDA was never designed for this world. CUDA was not designed for modern GPUs. It was designed for a much simpler world. And CUDA being twenty years old, it hasn't really caught up.
这非常困难,因为正如你所说,硬件在不断变化。所以 CUDA 是从一个几乎像 C 语言的世界中设计的,它为一个非常简单的编程模型而设计,并期望能够扩展,但随着硬件的变化,它无法适应。现在,如果你超越 GPU,你会看到 Google TPU 和许多其他专用 AI 系统。它们彻底颠覆了这一点,它们说,好吧,让我们摆脱 GPU 上的线程,只保留矩阵乘法单元,并且是真正庞大的矩阵乘法单元,围绕这个构建整个芯片,这样你就能获得更高的专业化程度,但对于那些 AI 工作负载,你能获得更高的吞吐量。
And it's very difficult because as you say, the hardware keeps changing. So CUDA was designed from a world where almost like c, it's designed for a very simple programming model that it expected to scale, but then as the hardware changed, it couldn't adapt. Now if you get beyond GPUs, you get to Google TPU and many other dedicated AI systems. They blow this way out and they say, okay. Well, let's get rid of the threads that you have on a GPU and let's just have matrix multiplication units and have really big matrix multiplication units and build the entire chip around that and you get much more specialization, but you get a much higher throughput for those AI workloads.
回到为什么是 Mojo?嗯,Mojo 是从第一性原理出发设计的,以支持这类系统。正如你所说,这些芯片中的每一个,即使在 NVIDIA 家族内部,从 Volta 到 Ampere 到 Hopper 再到 Blackwell,这些东西彼此并不兼容。实际上,Blackwell 刚刚打破了与 Hopper 的兼容性,所以它不能总是在 Blackwell 上运行 Hopper 的内核。哎呀。
Going back to why Mojo? Well, Mojo was designed from first principles to support this kind of system. Each of these chips, as you're saying, even within NVIDIA's family, the Volta to Ampere to Hopper to Blackwell, these things are not compatible with each other. Actually, Blackwell just broke compatibility with Hopper, so it can't run Hopper kernels always on Blackwell. Oops.
嗯,他们为什么要这样做?因为 AI 软件发展得太快了,他们认为这是正确的权衡取舍。与此同时,我们所有软件人员都需要有能力瞄准这个目标。当你看看其他现有系统,比如 Triton,他们的目标是让 GPU 编程更容易,这我很喜欢。这太棒了。
Well, why are they doing that? Well, AI software is moving so fast, they decided that was the right trade off to make. And meanwhile, we all software people need the ability to target this. When you look at other existing systems with Triton, example, their goal was let's make it easier to program a GPU, which I love. That's awesome.
但随后他们说,我们愿意牺牲 20% 的硅性能来实现这一点。等一下。我想要全部的性能。所以,如果我使用 GPU(顺便说一下,GPU 相当昂贵),我想要全部的性能,如果它不能提供与编写 CUDA 相同的质量结果,那么你总是会遇到一个上限,你一开始进展很快,但随后就会碰到天花板,然后不得不切换到另一个系统以获得全部性能。因此,Mojo 正是在这里试图解决这个问题,我们能够获得更高的可用性、更好的可移植性以及硅片的全部性能,因为它是为像张量核心这样古怪的架构设计的。
But then they said, we'll just give up 20% of the performance of the silicon to do it. Wait a second. I want all the performance. And so if I'm using a GPU, GPUs are quite expensive by the way, I want all the performance and if it's not gonna be able to deliver the same quality results you would get by writing CUDA, well then you're always gonna run to this headroom where you get going quickly, but then you run into a ceiling and then have to switch to a different system to get full performance. And so this is where Mojo is really trying to say solve this problem where we can get more usability, more portability and full performance of the silicon because it's designed for these wacky architectures like Tensor Cores.
如果我们看看现有的其他语言,有像 CUDA 和 OpenCL 这样的语言,它们是低级的,通常看起来像是 C++ 传统的变体,是不安全的语言,这意味着你必须遵循很多规则。如果你没有完全遵守规则,你就会进入未定义行为的领域。很难对你的程序进行推理。
And if we look at the other languages that are out there, there's languages like CUDA and OpenCL, which are low level, typically look like variations on c plus plus in that tradition are unsafe languages, which means that there's a lot of rules you have to follow. And if you don't exactly follow the rules, you're in undefined behavior land. It's very hard to reason about your program.
请允许我调侃一下我的 C++ 传承,因为我花了这么多年时间,就像,你只是有一个变量忘记初始化了。它就直接让你栽个大跟头。就像,这对程序员来说是完全不必要的伤害。
And just let me make fun of my c plus plus heritage because I've spent so many years, like, you just have a variable that you forget to initialize. It just shoots your foot off. Like, it's just unnecessary violence to programmers.
没错。这样做是为了提升性能,因为C++及其相关语言实际上无法提供足够的信息让你知道何时犯错,而他们希望尽可能有更大的空间来优化获得的程序。所以立场就是:如果你做了任何不被允许的事情,我们没有义务维持该行为的任何合理语义或可调试性,我们只会非常、非常努力地优化正确的程序——这是一个非常奇怪的立场,因为没有人写的程序是完全正确的。几乎任何规模的C++程序都存在bug和未定义行为。因此,在使用编译器系统时,你所获得的保证处于一个非常奇怪的位置。
Right. And it's done in the interest of making performance better because the idea is c plus plus and its related languages don't really give you enough information to know when you're making a mistake, and they want to have as much space as they can to optimize the programs they get. So the stance is just if you do anything that's not allowed, we have no obligation to maintain any kind of reasonable semantics or debuggability around that behavior, and we're just gonna try really, really hard to optimize correct programs, which is a super weird stance to take because nobody's programs are correct. There are bugs and undefined behavior in almost any c plus plus program of any size. And so you're in a very strange position in terms of the guarantees that you get from the compiler system you're using.
嗯,我的意思是,我可以很果断,但我也能理解从事C++工作的人们。毕竟,我在这门语言和这个生态系统中已经沉浸了几十年,为我们构建编译器。我对它了解颇多。挑战在于C++已经确立,所以有大量的代码存在。迄今为止,已经写好的代码才是最有价值的。
Well, so, I mean, I can be decisive, but I can also be sympathetic with people that work on C plus plus So again, I've spent decades in this language and around this ecosystem and building compilers for us. I know quite a lot about it. The challenge is that C plus plus is established. And so there's tons of code out there. By far, the code that's already written is code that's the most valuable.
因此,如果你在构建编译器,或者你有一个新芯片或优化器,你的目标是从现有软件中获取价值。所以你不能发明一种新的编程范式,以更好的方式做事并从根本上解决问题。相反,你必须处理现有的东西。你有一个需要加速的基准测试规范,所以你发明了一些疯狂的高超技巧,让某些重要的基准测试工作,因为你无法去更改代码。
And so if you're building a compiler or you have a new chip or you have an optimizer, your goal is to get value out of the existing software. And so you can't invent a new programming paradigm that's a better way of doing things and defines away the problem. Instead, you have to work with what you've got. You have a spec benchmark you're trying to make go fast. And so you invent some crazy hero hack that makes some important benchmark work because you can't go change the code.
根据我的经验,特别是在AI领域,但我相信在Jane Street内部也是如此,如果某个东西运行缓慢,就去修改代码。你对系统的架构有控制权。因此,我认为世界真正受益的,不同于基准测试技巧,是那些赋予程序员控制力、能力和表达力的语言。我认为,如果你退一步看,你会意识到历史之所以如此,有很多结构性的、非常合理的原因,但这些原因并不适用于这个新的计算时代。没有人拥有可以迁移到明年GPU上的工作负载。
In my experience, particularly for AI, but also I'm sure within Jane Street, if something's going slow, go change the code. You have control over the architecture of the system. And so what I think the world really benefits from, unlike benchmark hacking is languages that give control and power and expressivity to the programmer. And this is something where I think that if you, again, you take a step back and you realize history is the way it is for lots of structural and very valid reasons, but the reasons that don't apply to this new age of compute. Nobody has a workload that they can pull forward to next year's GPU.
不存在。没有人解决了这个问题。我不知道具体时间框架,但一旦我们解决了那个问题,一旦我们解决了可移植性,你就可以开启这个软件可以真正向前发展的新时代。所以现在对我来说,责任是确保它确实是好的。因此,关于内存安全性的观点,不要让它变得忘记初始化一个变量就会导致严重问题。
Doesn't exist. Nobody solved this problem. I don't know the timeframe, but once we solve that problem, once we solve portability, you can start this new era of software that can actually go forward. And so now to me, the burden is make sure it's actually good. And so to your point about memory safety, don't make it so forgetting to initialize a variable is just going to shoot your foot off.
产生一个好的编译器错误,提示:嘿,忘记初始化变量了,对吧?这些基本的东西实际上非常深刻和重要,这些工具、所有可用性以及这种DNA,这些感受和想法,都融入了Mojo。
Produce a good compiler error saying, Hey, forgot to initial ize variable. Right? These basic things are actually really profound and important and the tooling and all this usability and this DNA, these feelings and thoughts are what flow into Mojo.
GPU编程与传统CPU编程是完全不同的世界。仅就基本经济原理和人类参与方式而言,你最终处理的是小得多的程序。你有这些非常小但价值极高的程序,其性能至关重要,最终只有相对较小的一群专家在其中编程。因此,它不断推动你朝着你所说的性能工程方向发展,对吧?
And GPU programming is just a very different world from traditional CPU programming. Just in terms of the basic economics and how humans are involved, you end up dealing with much smaller programs. You have these very small but very high value programs whose performance is super critical, and in the end, a relatively small coterie of experts who end up programming in it. And so it pushes you ever in the direction you're saying of performance engineering. Right?
你希望给予人们所需的控制权,让事物按预期运行,并且要以一种能让人们高效工作的方式来实现。至于需要迁移大量遗留代码的想法,实际上,你并不需要。整个软件世界实际上小得惊人,真正重要的是如何尽可能好地编写这些小型程序。
You wanna give people the control they need to make the thing behave as it should, and you wanna do it in a way that allows people to be highly productive. And the idea that you have an enormous amount of legacy code that you need to bring over, it's like, actually, you kinda don't. The entire universe of software is actually shockingly small, and it's really about how to write these small programs as well as possible.
而且,还有另一个巨大的变化。我认为编程语言社区尚未意识到这一点,但AI编程已经彻底改变了游戏规则。因为现在你可以拿一个CUDA内核说,嘿,Claude,把它转换成Mojo。
And also, there's another huge change. And so this is something that I don't think that the programming language community has recognized yet, but AI coding has massively changed the game. Because now you can take a CUDA kernel and say, hey, Claude, go make that into Mojo.
实际上,你们觉得这种翻译体验的效果如何?
And actually, how good have you guys found the experience of that of doing translation?
我们举办黑客马拉松,人们做出了惊人的事情,他们从未接触过Mojo,也从未做过GPU编程。但在一天之内,他们就能实现令人震惊的成果。所以现在AI编程工具并非魔法。你不能仅仅凭感觉就编写出DeepSeque之类的代码,对吧?但它在学习新语言、掌握新工具以及进入并利用生态系统方面所能做到的令人惊叹。
Well, we do hackathons and people do amazing things having never touched Mojo, have never done GPU programming. And within a day they can make things happen that are just shocking. And so now AI coding tools are not magic. And so you cannot just vibe code DeepSeque one or something, right? But it's amazing what that can do in terms of learning new languages, learning new tools and getting into and capitalizing ecosystems.
所以这就是其中一个方面,回想五到十年前,大家都知道没人能学会新语言,也没人愿意接受新事物,但整个系统已经改变了。
And so this is one of the things where, again, you go back five or ten years, everybody knows nobody can learn a new language and nobody's willing to adopt new things, but the entire system has changed.
那么让我们更详细地谈谈Mojo的架构。Mojo是一种什么样的语言?你们选择了哪些设计元素来让它能够解决这一系列问题?
So let's talk a little bit more in detail about the architecture of Mojo. What kind of language is Mojo, and what are the design elements that you chose in order to make it be able to address this set of problems?
是的。再次说明情况有多么不同,回想我从事Swift开发的时候,需要解决的主要问题之一是Objective C非常难用。你有指针,有方括号,非常奇怪。所以当时的目标是发明新语法,汇集现代编程语言特性来构建一门新语言。快进到今天,实际上,其中一些仍然适用。
Yeah. Again, just to relate how different the situation is, back when I was working on Swift, one of the major problems to solve was Objective C was very difficult for people to use. And you had pointers and you had square brackets, and it was very weird. And so the goal in the game of the day was invent new syntax and bring together modern programming language features to build a new language. Fast forward to today, actually, some of that is true.
AI领域的人不喜欢C++。C++有指针,看起来很丑陋,而且是一门40多年历史的语言,实际上存在Swift当年必须解决的同样问题。但今天情况有所不同,AI从业者实际上非常喜欢一种语言,那就是Python。因此,Mojo的一个重要特点是它属于Python家族。这对某些人来说可能具有争议性,是的,我理解有些人喜欢花括号,但这非常强大,因为AI社区已经很大程度上是Python化的。所以我们从一开始就说,让我们保持Python的语法,只有在有充分理由的情况下才进行偏离。
AI people don't like C plus plus C plus plus has pointers and it's ugly and it's a 40 year old plus language and has actually the same problem that Swift had to solve back in the day. But today there's something different, which is that AI people do actually love a thing, it's called Python. And so one of the really important things about Mojo is it's a member of the Python family. And so this is polarizing to some because yes, I get it that some people love curly braces, but it's hugely powerful because so much of the AI community is Pythonic already. And so we start out by saying, let's keep the syntax like Python and only diverge from that if there's a really good reason.
那么什么是充分的理由呢?好的理由就是,正如我们讨论的,我们需要性能、能力以及对系统的完全控制。对于GPU而言,有一些非常重要的事情需要通过元编程来实现。因此,Mojo拥有一个非常精致的元编程系统,某种程度上受到了Zig语言的启发,它将运行时和编译时结合在一起,使得能够设计出非常强大的库。解决张量核心等问题的办法是,让强大的库能够以库的形式在语言中构建,而不是硬编码到编译器中。
But then what are the good reasons? Well, the good reasons are we want, as we're talking about performance, power, full control over the system. And for GPUs, there's these very important things you want to do that require meta programming. And so Mojo has a very fancy meta programming system kind of inspired by this language called Zig that brings runtime and compile time together to enable really powerful library designs. And the way you crack open this problem with tensor cores and things like this is you enable really powerful libraries to be built in the language as libraries instead of hard coding into the compiler.
我们来稍微谈谈元编程的概念。什么是元编程?为什么它尤其对性能很重要?
Let's take a little bit to the metaprogramming idea. What is metaprogramming and why does it matter for performance in particular?
是的。这是个很好的问题。我想你也知道答案,而且我知道你没问题。
Yeah. It's a great question. And I think you know the answer to this too, and I know you're fine.
其实,我们也在自己的领域里研究元编程特性。
Secretly, we are also working on metaprogramming features in our own world.
没错。这里的观察是,当你在编程语言中写一个for循环时,通常这个for循环是在运行时执行的。所以你写的代码在程序执行时,是计算机遵循的指令来执行你代码中的算法。但当你进入设计更高级的类型系统时,突然你也希望在编译时能够运行代码。因此,有很多语言都存在这种情况。
Exactly. And so the observation here is when you're writing a for loop in a programming language, for example, typically that for loop executes at runtime. So you're writing code that when you execute the program, it's the instructions that the computer will follow to execute the algorithm within your code. But when you get into designing higher level type systems, suddenly you want to be able to run code at compile time as well. And so there's many languages out there.
其中一些语言有宏系统。C++有模板。最终你在许多语言中得到的是运行时发生的事情与编译时发生的事情之间几乎形成了一种二元性,后者几乎像另一种语言。C++是最典型的例子,因为模板在运行时有一个for循环,但在编译时却有展开的递归模板之类的东西。其实洞察在于,嘿,这两个问题实际上是相同的,它们只是在不同的时间运行。
Some of them have macro systems. C plus plus has templates. What you end up getting is you end up getting in many languages this duality between what happens at runtime and then a different language almost that happens at compile time. And C plus plus is the most egregious because templates, you have a for loop in runtime, but then you have unrolled recursive templates or something like that at compile time. Well, the insight is, hey, these two problems are actually the same, they just run at different times.
因此,Mojo 所做的是允许你在编译时使用任何在运行时有效的代码。你可以拥有列表、字符串或任何你想要的东西,以及进行内存分配和释放的算法,并且可以在编译时运行这些,从而能够构建非常强大的高级抽象并将其放入库中。那么这为什么很酷呢?原因在于,例如在 GPU 上,你会有一个张量核心。张量核心很奇特。
And so what Mojo does is says, let's allow the use of effectively any code that you would use at runtime to also work at compile time. And so you can have a list or string or whatever you want, and algorithms that go do memory allocation, deallocation, and you can run those at compile time, enabling you to build really powerful high level abstractions and put them into libraries. So why is this cool? Well, the reason it's cool is that on a GPU, for example, you'll have a tensor core. Tensor cores are weird.
我们可能不需要深入探讨所有原因。但张量核心使用的索引和布局非常具体,并且因厂商而异。因此,你在 AMD 上拥有的张量核心,或者在不同版本的 NVIDIA GPU 上拥有的张量核心,都非常不同。所以,作为 GPU 程序员,你希望构建一组抽象,以便可以在一个共同的生态系统中推理所有这些事物,并使布局更加高级。这使得能够构建非常强大的库,其中很多逻辑实际上是在编译时完成的,但你可以调试它,因为它与你运行时使用的语言相同。
We probably don't need to deep dive into all the reasons why. But the indexing and the layout that tensor cores use is very specific and very vendor different. And so the tensor core you have on AMD or the tensor core you have on different versions of NVIDIA GPUs are all very different. And so what you want is you want to build as a GPU programmer, set a of abstractions so you can reason about all of these things in one common ecosystem and have the layouts much higher level. And so what this enables, it enables very powerful libraries and very powerful libraries where a lot of logic is actually done at compile time, but you can debug it because it's the same language that you use at runtime.
这使得语言更加简单、更加强大,并且能够以 C++ 可能的方式扩展到这些复杂性中。但在 C++ 中,你会得到一些疯狂的模板堆栈跟踪,令人抓狂且难以理解。在 Mojo 中,你可以得到非常简单的错误消息,实际上可以使用调试器调试你的代码,等等。
And it makes the language much more simpler, much more powerful, and just be able to scale into these complexities in a way that's possible with C plus plus But in C plus plus you get some crazy template stack trace that is maddening and impossible to understand. In Moji, you can get a very simple error message. You can actually debug your code, a debugger, and things like this.
所以这里的一个重要点可能是,元编程确实是解决这个性能问题的老方法。也许一个好的思考方式是,想象你有一些数据,代表你编写的一个小型嵌入式领域特定语言,你想通过你编写的程序来执行它。你可以以一种很好的高级方式,为该语言编写一个小型解释器,比如一个布尔表达式语言或其他什么。也许它是一个在 GPU 上对张量进行计算的语言。你可以编写一个程序来执行那个小型领域特定语言,并做你想做的事情。
So maybe an important point here is that metaprogramming is really an old solution to this performance problem. Maybe a good way of thinking about this imagine you have some piece of data that you have that represents a little embedded domain specific language that you've written that you wanna execute via a program that you wrote. You can, in a nice high level way, write a little interpreter for that language that just, you know, I have maybe a Boolean expression language or who knows what else. Maybe it's it's a language for computing on tensors in a GPU. And you could write a program that just executes that mini domain specific language and does the thing that you want.
你可以这样做,但它真的很慢。编写解释器本质上是这样的,因为所有这些解释开销,你需要动态决定程序的行为。有时你想要的实际上是直接生成你想要的代码,消除控制结构,只得到执行必要任务的直接机器代码行。各种形式的代码生成让你以更简单的方式绕过所有这些必须在运行时执行的控制结构,而是能够在编译时执行它,并获得这个精简的程序,只做你想要的。所以这是一个非常古老的想法。
And you can do it, but it's really slow. Writing an interpreter is just inherently so because all this interpretation overhead, where you are dynamically making decisions about what the behavior of the program is. And sometimes what you want is you just want to actually emit exactly the code that you want and boil away the control structure and just get the direct lines of machine code that you want to do the thing that's necessary. And various forms of code generation let you get past, in a simpler way, lets you get past all of this control structure you have to execute at runtime, and instead be able to execute it at compile time and get this minified program that just does exactly the thing that you want. So that's a really old idea.
它可以追溯到各种编程语言,许多 Lisp 语言做了很多这种元编程。但问题是这些东西超级难以思考、推理和调试。如果你考虑在 C 语言中使用各种 C 预处理器来做这种事情,这肯定是真的。推理起来相当痛苦。然后 C++ 使其更丰富、更具表现力,但仍然很难推理。你编写一个 C++ 模板,在你给它所有输入并让它运行之前,你并不真正知道它会做什么或者它是否会编译。
It goes back to all sorts of programming languages, a lot of Lisps that did a lot of this metaprogramming But then the problem is this stuff is super hard to think about and reason about and debug. And that's certainly true if you think about in c, all this macro language if you use the various c preprocessors to do this kind of stuff in c. It's pretty painful to reason about. And then c plus plus made it richer and more expressive, but still really hard to reason about. And you write a c plus plus template, and you don't really know what it's going to do or if it's going to compile until you give it all the inputs and let it go.
而且
And
是的。在简单情况下感觉很好,但当你进入更高级的案例时,复杂性突然叠加并失控。
Yeah. And feels good in the simple case, but then when you get to more advanced cases, suddenly the complexity compounds and it gets out of hand.
听起来你在Mojo中追求的是让它感觉像一种语言。它有一个类型系统,既涵盖你静态生成的内容,也涵盖你在运行时处理的内容。调试在这两个层面上似乎都以相同的方式工作。但你仍然可以从一种语言中获得你想要的运行时行为,这种语言可以更明确地表示:这就是我想要生成的精确代码。
And it sounds like the thing that you're going for in Mojo is it feels like one language. It has one type system that covers both the stuff you're generating statically and the stuff that you're doing at runtime. Sounds like debugging works in the same way across both of these layers. But you still get the actual runtime behavior you want from a language that you could more explicitly just be like, here's exactly the code that I wanna generate.
并将零开销元编程作为其高级特性之一。其中一个很酷的特点是它看起来和感觉像Python,但具有实际的类型系统。
And size zero into metaprogramming as one of the fancy features. One of the cool features is it feels and looks like Python, but with actual types.
没错。
Right.
没错。而且我们不要忘记基础。拥有看起来和感觉像Python,但速度快上千倍的东西实际上非常酷。例如,如果你在CPU上,你可以访问SIMD寄存器,这些寄存器允许你同时执行多个操作,即使不使用高级功能也能充分发挥硬件性能,这也很酷。所以这些系统面临的挑战是如何创造出既强大又易于使用的东西。
Right. And let's not forget the basics. Having something that looks and feels like Python, but it's thousand times faster or something is actually pretty cool. For example, if you're on a CPU, you have access to SIMD, the SIMD registers that allow you to do multiple operations at a time and be able to get the full power of your hardware, even without using the fancy features is also really cool. And so the challenge with any of these systems is how do you make something that's powerful, but it's also easy to use.
我想你的团队一直在使用Mojo并做一些很酷的事情。我是说,你看到了什么,你的体验如何?
I think your team's been playing with Mojo and doing some cool stuff. I mean, what have you seen, and what's your experience been?
我们都还比较新接触它,但我认为它有很多令人兴奋的特点。我的意思是,首先,它确实提供了你需要的性能所需的编程模型。实际上,在很多方面,这与从Cutlass或Qt DSL等工具中获得的编程模型类似,这些都是NVIDIA特定的,有些在C++层面,有些在Python DSL层面。顺便说一下,如今你能想到的每个工具都会在C++和Python中各实现一次。我们不再需要用其他方式实现编程语言了。
We're all still pretty new to it, but I think it's got a lot of exciting things going for it. I mean, the first thing is, yeah, it gives you the kind of programming model you want to get the performance that you need. And actually, in many ways, the same kind of programming model that you get out of something like Cutlass or Qt DSL, which are these NVIDIA specific, some at the c plus plus level, some at the Python DSL level. And by the way, every tool you can imagine nowadays is done once in c plus plus and once in Python. We don't need to implement programming languages any other way anymore.
它们要么是基于C++的封装,要么是基于Python的封装。但根据你选择的路径不同,无论是走C++路线还是Python路线,都会面临各种复杂的权衡。比如,在C++路径中,编译时间尤其痛苦。你提到的模板元编程问题绝对属实,错误信息非常糟糕。
They're all either skins on c plus plus or skins on Python. But depending on which path you go down, whether you go the c plus plus path or the Python path, you get all sorts of complicated trade offs. Like, in the c plus plus path, in particular, you get very painful compilation times. The thing you said about template metaprogramming is absolutely true. The error messages are super bad.
如果你看这些更偏向Python的嵌入式DSL,编译时间往往会更好。不过仍然难以推理。Mojo的一个优点是整体设计非常明确。当你想理解某个值最终是在运行时处理,还是在编译时处理时,只需看语法就能明白,非常直观。
If you look at these more Python embedded DSLs, the compile times tend to be better. It still can be hard to reason about, though. One nice thing about Mojo is the overall discipline seems very explicit. When you want to understand, is this a value that's happening at execution time at the end, or is it a value that you know is going to be dealt with at compile time? Just very explicit in the syntax you look and understand.
而在某些DSL中,你必须主动去探查值并询问它的类型。我认为这种明确性对于性能工程非常重要,能让你轻松理解自己到底在做什么。这种现象不仅出现在底层领域,如果你看更高级的工具如PyTorch也是如此。PyTorch让你编写看似普通的Python程序,但实际上它的执行模型要复杂得多。Python是一个既神奇又糟糕的生态系统——使用Python时你能得到什么保证?
Whereas in some of these DSLs, you have to actively go and poke the value and ask it what kind of value it is. And I think that kind of explicitness is actually really important for performance engineering, making it easy to understand just what precisely you're doing. You actually see this a ton not even with these very low level things, but if you look at PyTorch, which is a much higher level tool. PyTorch does this thing where you get to write a thing that looks like an ordinary Python program, but really, it's got a much trickier execution model. Python's an amazing and terrible ecosystem in which to do this kind of stuff because what guarantees do you have when you're using Python?
完全没有。你能做什么?什么都可以。你拥有极大的自由。PyTorch团队尤其巧妙地利用了这种自由,让你能编写看似简单直接但实际上会很慢的Python程序。
None. What can you do? Anything. You have an enormous amount of freedom. The PyTorch people, in particular, have leveraged this freedom in a bunch of very clever ways, where you can write a Python program that looks like it's doing something very simple and straightforward that would be really slow.
但实际上,它通过精心设计的延迟处理和惰性计算,能够重叠GPU和CPU的计算,让程序运行得非常快。这很棒,但有时候就是行不通。
But, no, it's very carefully delaying and making some operations lazy, so it can overlap compute on the GPU and CPU and make stuff go really fast. And that's really nice, except sometimes it just doesn't work.
这就是陷阱。作为有数十年伤痕经验的编译器工程师,我可以调侃其他编译器同行。存在一个诱人的陷阱,叫做'足够智能的编译器'。你可以让某个东西在演示中看起来很棒。
This is the trap. Again, this is my decades of battle scars now. So as a compiler guy, I can make fun of other compiler people. There's this trap and it's an attractive trap, which is called the sufficiently smart compiler. And so what you can do is you can take something and you can make it look good on a demo.
你会说:看,我让它超级简单,我的编译器超级智能,它会通过魔法处理好所有事情。但魔法并不存在。历史上比如自动并行化技术,只需编写顺序逻辑的C代码,就能自动映射到超级计算机的100个核心上运行。这些技术在简单案例和演示中确实有效,但问题在于:当你使用时,只要改动一个地方,整个系统就可能崩溃。
And you can say, look, I make it super easy and I'm going make my compiler super smart and it's going to take care of all this and make it easy through magic, but magic doesn't exist. And so anytime you have one of those sufficiently smart compilers, if you go back in the days, was like auto parallelization, just write C code as sequential logic, and then we're going to automatically map it into running on 100 cores on a supercomputer or something like that. They often actually do work. They work in very simple cases and they work in the demos. But the problem is that you go and you're using them and then you change one thing and suddenly everything breaks.
也许编译器崩溃了,它就是无法工作。或者你去修复一个bug,结果不是获得100倍的速度提升,反而因为破坏了编译器而得到100倍的减速。很多AI工具,很多这类系统,特别是这些DSL,都有一个设计理念:让我假装它很简单,然后在幕后处理一切。但当某些东西出错时,你最终不得不查看编译器转储,对吧?这是因为魔法并不存在。
Maybe the compiler crashes, it just doesn't work. Or you go and fix a bug and now instead of a 100 times speed up, you get a 100 times slow down because it foiled the compiler. A lot of AI tools, a lot of these systems, particularly these DSLs have this design point of, let me pretend like it's easy, and then I will take care of it behind the scenes. But then when something breaks, you have to end up looking at compiler dumps, right? And this is because magic doesn't exist.
所以这就是可预测性和控制真正重要的地方,我认为这是关键所在,特别是如果你想充分利用硬件性能,这也是我们最终来到这里的原因。
And so this is where predictability and control is really, I think, the name of the game, particularly if you wanna get the most out of a piece of hardware, which is how we ended up here.
有趣的是,当你观察CPU和GPU之间的差异时,同样会出现底层系统有多聪明的问题。CPU本身试图做一件奇怪的事情:芯片本质上是一个并行基板,它拥有所有这些原则上可以并行运行的电路。然后它却被束缚运行这种极其顺序化的编程语言,只能一个接一个地执行任务。那么这如何以任何合理的效率实际工作呢?
It's funny, the same issue of how clever is the underlying system you're using comes up when you look at the difference between CPUs and GPUs. CPUs themselves are trying to do a weird thing where a chip is a fundamentally parallel substrate. It's got all of these circuits that in principle could be running in parallel. And then it is yoked to running this extremely sequential programming language, which is just trying to do one thing after another. And then how does that actually work with any reasonable efficiency?
嗯,在底层有各种聪明的技巧在运作,它试图预测你要做什么,通过推测允许它连续分派多个指令,猜测你未来要执行的操作。还有像内存预取这样的技术,它使用启发式算法来估计你未来会请求哪些内存,从而可以同时分派多个内存请求。然后如果你看看GPU,我认为甚至更多是TPU,还有完全不同的东西如FPGA(现场可编程门阵列),你在上面基本上放置一个电路设计。这是一种非常不同的软件系统。但在某种意义上,它们都更简单、更确定、更显式地并行。
Well, there's all sorts of clever dirty tricks happening under the covers where it's trying to predict what you're going to do, the speculation that allows it to dispatch multiple instructions in a row by guessing what you're going to do in the future. There's things like memory prefetching where it has heuristics to estimate what memory you're gonna ask in the future so it can dispatch multiple memory requests at the same time. And then if you look at things like GPUs, and I think even more TPUs, and then also totally other things like FPGAs, the field programmable git gateway where you put basically a circuit design on it. It's a very different kind of software system. But all of them are, in some sense, simpler and more deterministic and more explicitly parallel.
比如,当你写下程序时,你必须编写显式并行的程序。这实际上更难编写。我不想过多抱怨CPU。CPU的伟大之处在于它们极其灵活且非常易于使用。而且所有这些黑魔法在很大比例的情况下确实有效。
Like, you when you write down your program, you have to write an explicitly parallel program. That's actually harder to write. I don't don't wanna complain too much about CPUs. The great thing about CPUs is they're extremely flexible and incredibly easy to use. And all of that dark magic actually works a pretty large fraction of the time.
是的。效果出奇地好。但你的观点真的很棒。你说的是,CPU是让顺序代码快速并行运行的魔法盒子。然后我们有新的更显式的机器,编程难度稍大,因为它们不是魔法盒子,但你能从中获得一些东西,获得性能和能效,因为那个魔法盒子并非没有代价,它往往伴随着非常显著的代价——通常是机器消耗的功率。
Yeah. Remarkably well. But your point here, I think, is really great. And what you're saying is you're saying CPUs are the magic box that makes sequential code go in parallel pretty fast. And then we have new more explicit machines, somewhat harder to program because they're not a magic box, but you get something from it, you get performance and power because that magic box doesn't come without a cost, it comes with a very significant cost often the amount of power that your machine dissipates.
所以它并不高效。因此我们获得这些新加速器的很多原因是人们确实关心它是否快100倍,或者使用更少的功率等等。我从未想过这一点,但你对Triton和Mojo的类比遵循了类似的模式,对吧?Triton试图成为魔法盒子,但它不能给你完整的性能,消耗更多功率等等。所以Mojo在说,看,让我们回归简单。
And so it's not efficient. And so a lot of the reasons we're getting these new accelerators is because people really do care about it being 100 times faster or using way less power or things like this. And I'd never thought about it, but your analogy of Triton to Mojo kind of follows a similar pattern, right? Is Triton is trying to be the magic box and it doesn't give you the full performance and it burns more power and all that kind of stuff. And so Mojo is saying, look, let's go back to being simple.
让我们给程序员更多控制权。我认为这种更明确的方法非常适合那些构建像你所说的极其先进硬件的人,同时也适合那些希望从现有硬件中获得最佳性能的人。
Let's give the programmer more control. And that more explicit approach, I think, a good fit for people that are building crazy advanced hardware like you're talking about, but also people that want to get the best performance out of the existing hardware we have.
所以我们讨论了元编程如何通过消除你并不真正需要的控制结构来编写更快的程序。这部分很好。它如何提供可移植的性能?它如何在可移植性方面帮助你?
So we talked about how metaprogramming lets you write faster programs by boiling away this control structure that you don't really need. So that part's good. How does it give you portable performance? How does it help you on the portability
是的。这是另一个很好的问题。在这一类足够智能的编译器中,特别是AI编译器,已经有多年的工作,MLR催化了其中许多工作,构建这些神奇的AI编译器,它们接收TensorFlow甚至新的PyTorch内容,并尝试为某些芯片生成最优代码。所以拿一个PyTorch模型,通过编译器处理,神奇地获得高性能。因此有很多这样的东西,并且这里做了很多很棒的工作。
Yeah. So this is another great question. So in this category of sufficiently smart compilers and particularly for AI compilers, there's been years of work and MLR has catalyzed a lot of this work, building these magic AI compilers that take TensorFlow or even the new PyTorch stuff and trying to generate optimal code for some chip. So take some PyTorch model and put it through a compiler and magically get out high performance. And so there's tons of these things and there's a lot of great work done here.
很多人已经证明你可以通过编译器加速内核。这方面的挑战在于人们从未测量芯片的全部性能。因此人们总是从一个不太理想的基线开始测量,然后试图爬得更高,而不是问光速是多少?所以如果你从光速开始测量,突然你会说,好吧,我如何实现几个不同的目标?即使你专注于一块硅片,我如何为一个用例实现最佳性能?
And a lot of people have shown that you can take kernels and accelerate them with compilers. The challenge with this is that people don't ever measure what is the full performance of the chip. And so people always measure from a somewhat unfortunate baseline and then try to climb higher instead of saying, what is the speed of light? And so if you measure from speed of light, suddenly you say, okay, how do I achieve several different things? Even if you zero into one piece of Silicon, how do I achieve the best performance for one use case?
然后我如何使我编写的软件即使在领域内也能泛化?例如,以矩阵乘法为例。你可能想处理float32,但然后你想将其泛化到float16。好吧,模板之类的东西是做到这一点的简单方法。然后编程允许你说,好吧,我来解决这个问题。
And then how do I make it so the software I write can generalize even within the domain? And so for example, take a matrix multiplication. Well, you want to work on maybe float 32, but then you want to generalize it to float 16. Okay, well templates and things like this are easy way to do this. And then programming allows you to say, okay, I will tackle that.
接下来发生的事情是,因为你从float32转到float16,你的有效缓存大小翻倍了,因为如果16位和32位,缓存中可以容纳两倍的元素。好吧,如果是这样,现在访问模式突然需要改变。因此你会得到一大堆条件逻辑,这些逻辑现在以一种非常参数化的方式变化,这是从float32到float16的一个简单变化的结果。现在你向前推进,说,好吧,矩阵乘法是一个递归分层问题。有针对高瘦矩阵的特殊化,以及维度为一等等。
Then the next thing that happens is because you went from float 32 to float 16, your effective cache size has doubled because twice as many elements fit into cache if there's 16 bits and if there's 32 bits. Well, if that's the case, now suddenly the access pattern needs to change. And so you get a whole bunch of this conditional logic that now changes in a very parametric way as a result of one simple change that happened with float 32 to float 16. Now you play that forward and you say, okay, well, matrix multiplication is a recursive hierarchical problem. There's specializations for tall and skinny matrices and a dimension is one or something.
有所有这些特殊情况。仅仅一个芯片的一个算法就变成了这个非常复杂的子系统,你最终想要进行许多转换,以便可以针对不同的用例进行专门化。因此,Mojo通过元编程允许你解决这个问题。现在你引入其他硬件。所以如今可以把矩阵乘法几乎看作一个操作系统。
There's all these special cases. Just one algorithm for one chip becomes this very complicated subsystem that you end up wanting to do a lot of transformations to so you can go specialize it for different use cases. And so Mojo with the metaprogramming allows you to tackle that. Now you bring in other hardware. And so think of matrix multiplication these days as being almost an operating system.
我是说,有这么多不同的子系统、特殊情况、不同的数据类型,还有疯狂的float4、float6以及其他各种东西在运作。
I mean, there's, like, so many different subsystems and special cases and different d types and crazy float four and six and other stuff going on.
总有一天,他们会推出一个如此小的浮点数,以至于它都会成为一个笑话。但每次我以为你只是在开玩笑时,结果却发现它是真实存在的。
At some point, they're gonna come out with a floating point number so small that it will be a joke. But every time I think that you're just kidding, it turns out it's real.
说真的,我听说有人在讨论1.2位浮点数。没错,就像你说的那样。这是个玩笑吗?你不可能认真的吧。
Seriously, I heard somebody talking about 1.2 bit floating point. Right. Exactly like you're saying. Is that a joke? What you can't be serious.
所以当你引入其他硬件时,其他硬件带来了更多复杂性,因为突然之间,AMD的Tensor Core布局与Nvidia不同。或者就像你提到的warp,一个warp里有64个线程,另一个只有32个。但你会意识到,等一下,这其实与硬件厂商无关。实际上,即使在NVIDIA产品线内部也是如此,因为针对不同的数据类型,Tensor Core也在变化。Tensor Core处理float32的方式与处理float4或其他类型的方式是不同的。
And so now when you bring in other hardware, right, other hardware brings in more complexity because suddenly the Tensor Core has a different layout in AMD than it does on Nvidia. Or maybe to your point about warps, you have 64 threads in a warp on one and thirty two threads in a warp on the other. But what you realize is you realize, wait a sec, this really has nothing to do with hardware vendors. This is actually true even within, for example, the NVIDIA line, because across these different data types, the tensor cores are changing. The way the tensor core works for float32 is different than the way it works for float4 or something.
因此,即使在一个厂商内部,你也不得不使用非常强大的元编程来处理这种复杂性,并在单一算法(如矩阵乘法)的框架内实现。现在当你引入其他厂商时,嘿,结果发现它们都有大致类似Tensor Core的东西。所以我们从软件工程的角度来处理这个问题,被迫构建抽象层。我们拥有这个强大的元编程系统,因此我们实际上可以实现这一点。
And so you already within one vendor have to have this very powerful metaprogramming to be able to handle the complexity and do so in the scaffolding of a single algorithm like matrix multiplication. And so now as you bring in other vendors, well, it turns out, hey, they all have things that look roughly like Tensor Cores. And so we're coming at this with a software engineering perspective. And so we're forced to build abstractions. We have this powerful metaprogramming system so we can actually achieve this.
因此,即使对于一个厂商,我们也有这个叫做布局张量(layout tensor)的东西。布局张量是说,好吧,我不仅能够推理数字数组或多维数字数组,还能推理它在内存中的布局以及如何被访问。所以现在我们可以声明式地将这些东西映射到你拥有的硬件上,并且这些抽象可以堆叠。因此,这是一个类型系统运作良好的真正惊人的胜利,它是一个非常重要的基础。我知道你也是类型系统的粉丝。
And so even for one vendor, we get this thing called layout tensor. Layout tensor is saying, okay, well I have the ability to reason about not just an array of numbers or a multi dimensional array of numbers, but also how it's laid out in memory and how it gets accessed. And so now we can declaratively map these things onto the hardware you have and these abstractions stack. And so it's this really amazing triumphant between having a type system that works well, and it's a very important basis. I know you're a fan of type systems also.
然后你引入元编程,这样你就可以在编译时构建强大的抽象,从而实现零运行时开销。接着你引入整个等式中最重要的部分,那就是理解该领域的程序员。我不会去写一个快速的矩阵乘法算法,抱歉,这不是我的专长。但在那个领域有一些人简直才华横溢。
You then bring in metaprogramming, so you can build powerful abstractions around a compile time, so you get no runtime overhead. And then you bring in the most important part of this entire equation, which is programmers who understand the domain. I am not going to write a fast matrix multiplication. I'm sorry, that's not my experience. But there are people in that space that are just fricking brilliant.
他们完全理解硬件的工作原理。他们了解用例、最新研究以及当下新奇的量化格式,但他们不是编译器专家。Mojo的魔力在于它提供了一个类型系统、元编程能力,以及构建库时所需的完整编译器功能。因此,这些擅长挖掘硬件潜力的人才能够真正实现这一点,编写出既能应对领域复杂性又能跨硬件扩展的软件。
They understand exactly how the hardware works. They understand the use cases and the latest research and the new crazy quantized format of the day, but they're not compiler people. And so the magic of Mojo is it says, hey, you have a type system, you have meta programming, you have effectively the full power of a compiler that you have when you're building libraries. And so now these people that are brilliant at unlocking the power of the hardware can actually do this. And now they can write software that scales both across the complexity of the domain, but also across hardware.
对我来说,这正是Mojo令人兴奋和强大的地方——它释放了程序员的力量,而不是像许多早期系统那样试图将这种能力硬塞进编译器中。
And to me, that's what I find so exciting and so powerful about this is it's like unlocking the power of the Mojo programmer instead of trying to put it into the compiler, which is what a lot of earlier systems have tried to do.
所以关键点可能是:你可以构建这些抽象来表示不同类型的硬件,然后根据运行硬件条件执行代码。这不像用ifdef在不同硬件平台间选择,而是通过复杂的结构(如布局值)来指导数据遍历方式。
So maybe the key point here is that you get to build these abstractions that allow you to represent different kinds of hardware, and then you can conditionally have your code execute based on the kind of hardware that it's on. It's not like an ifdef where you're picking between different hardware platforms. There are complicated data structures like these layout values that tell you how you can traverse data.
这有点像树结构。你传递的不是简单整数,而是编译时需要的一个递归层次化树形结构。
Which is kind of a tree. This isn't just a simple int that you're passing around. This is like a recursive hierarchical tree that you need at compile time.
关键在于你能编写出看似单一合成程序、具有可理解行为的代码,但其中部分实际在编译时执行,使得生成的程序专门针对目标平台进行优化。我担心的是:程序的配置空间会变得极其庞大。从工程角度看有两个潜在难点:第一,能否真正创建在程序上下文中隐藏复杂性的抽象,让人们能以模块化思维构建程序?
The critical thing is you get to write a thing that feels like one synthetic program with one understandable behavior. But then parts of it are actually gonna execute at compile time so that the thing that you generate is in fact specialized for the particular platform that you're going to run it on. So one concern I have over this is it sounds like the configuration space of your programs is going to be massive. And I feel like there are two directions where this seems potentially hard to do from an engineering perspective. One is, can you really create abstractions that within the context of the program hide the relevant complexity so it's possible for people to think in a modular way about the program they're building?
这样他们的大脑不会因为可能运行的70种硬件类型而爆炸。第二是如何考虑测试?配置组合如此之多,如何确保在所有环境下都能正常工作?
Their brains don't explode with the 70 different kinds of hardware that they might be running it on. And then the other question is, how do you think about testing? Right? Because there's just so many configurations. How do you know whether it's working in all the places?
因为这种设计似乎赋予了极大的自由度(包括可能出错的情况),你们如何解决这两个问题——既控制抽象复杂度,又建立有效的测试机制?
Because it sounds like it has an enormous amount of freedom to do different things, including wrong things in some cases. How do you deal with those two problems, both controlling the complexity of the abstractions and then having a testing story that works out?
好吧,Ron,我要让你大吃一惊。我知道你可能会抗拒这个想法,但让我说服你类型系统是很酷的。
Okay, Ron, I'm gonna blow your mind. I know you're gonna be resistant to this, but let me convince you that types are cool.
好的。
Okay.
我知道你会跟我争论这个。那么,这又回到了使用Python或C++时的挑战与机遇——Python其实没有真正的类型系统。我是说,它有一些相关的东西,但并没有真正的类型系统。C++有类型系统,但使用起来极其痛苦。Mojo的做法是,再次强调,这不是什么火箭科学,我们在各处都能看到,让我们引入特质(traits)。
I know you're going to fight me on this. Well, so this is again, you go back to the challenges and opportunities working with either Python or C plus plus Python doesn't have types, really. I mean, it has some stuff, but it doesn't really have a type system. C plus plus has a type system, but it's just incredibly painful to work with. And so what Mojo does is it says, again, it's not rocket science, we see it all around us, let's bring in traits.
让我们引入一种合理的方式来编写代码,这样我们就可以构建领域特定的抽象,并且可以模块化地进行检查。C++的一个大问题是,当你实例化一层又一层的模板时,你会收到错误信息。如果你搞错了某个魔法数字,它就会以无法理解的方式剧烈爆炸。Mojo的做法是,很酷,让我们引入非常类似于Swift中的协议、Rust中的特质或Haskell中的类型类的特质。这并不新奇。
Let's bring in a reasonable way to write code so that we can build abstractions that are domain specific, and they can be checked modularly. And so one of the big problems with C plus plus is that you get error messages when you instantiate layers and layers and layers and layers of templates. And so if you get some magic number wrong, it explodes spectacularly in a way they can't reason about. And so what Mojo does, it says, cool, let's bring in traits that feel very much like protocols in Swift or traits in Rust or type classes in Haskell. Like, this isn't novel.
这就像是一种称为特设多态(ad hoc polymorphism)的机制。意思是,我想要一个操作或函数具有某种含义,但实际上它会针对不同的类型以不同的方式实现。这些基本上都是根据你正在做的事情和涉及的类型,查找正确实现你所需功能的机制。
This is like a mechanism for what's called ad hoc polymorphism. Meaning, I want to have some operation or function that has some meaning, but actually it's going to get implemented in different ways for different types. These are basically all mechanisms of a way of given the thing that you're doing and the types involve looking up the right implementation that's gonna do the thing that you want.
是的。我是说,一个非常简单的例子是迭代器。Mojo有一个迭代器特质,你可以问,对一个集合的迭代器是什么?嗯,你可以检查是否还有元素,或者获取当前元素的值。然后,当你不断从迭代器中取出东西时,它最终会决定停止。
Yeah. I mean, a very simple case is an iterator. So Mojo has an iterator tree and you can say, hey, well, what is an iterator over a collection? Well, you can either check to see if there's an element or you can get the value, it's current element. And then as you keep pulling things out of an iterator, it will eventually decide to stop.
这个概念可以应用于链表、数组、字典或从网络传来的无界数据包序列等。这样你就可以编写跨这些不同列后端或实现此特质的模型的通用代码。编译器会为你检查,确保在编写通用代码时,你没有使用不起作用的东西。这意味着你可以在不实例化的情况下检查通用代码,这对编译时间有好处,对用户体验也有好处,因为如果你作为程序员犯了错误,这很重要。它有助于推理这些不同子系统的模块化,因为现在你有了一个连接两个组件的接口。
And so this concept can be applied to things like a linked list or an array or a dictionary or an unbounded sequence of packets coming off a network. And so you can write code this generic across these different column back ends or models that implement this trade. And what the compiler will do for you is it will check to make sure when you're writing that generic code, you're not using something that won't work. And so what that does is it means that you can check the generic code without having to instantiate it, which is good for compile time, it's good for user experience because if you get something wrong as a programmer, that's important. It's good for reasoning about the modularity of these different subsystems because now you have an interface that connects the two components.
我认为这是C++模板方法中一个被低估的问题。C++模板看起来像是一种深层的语言特性,但实际上它们只是一个代码生成功能。
I think it's an underappreciated problem with the c plus plus templates approach to the world, where c plus plus templates, they seem like a deep language feature, but really, they're just a code generation feature.
它们就像C语言的宏。
They're like c macros.
没错。这意味着它们既难以思考和推理,因为乍一看似乎没那么糟糕——你并不真正知道模板展开时是否真的能编译。但随着你开始更深层次地组合内容,情况会变得越来越糟,因为某处的某些东西会失败,而且很难进行推理和理解。而当你拥有保证正确组合且不会崩溃的类型级泛型概念时,你就能降低错误率。这就是超越模板作为语言特性的一个好处。
That's right. It both means they're hard to think about and reason about because it sort of seems at first glance, not to be so bad, this property that you don't really know when your template expands if it's actually going to compile. But as you start composing things more deeply, it gets worse and worse because something somewhere is gonna fail, and it's just gonna be hard to reason about and understand. Whereas when you have type level notions of generosity that are guaranteed to compose correctly and won't just blow up, you just drive that error rate down. So that's one thing that's nice about getting past templates as a language feature.
另一个问题是它实在太慢了。你几乎是在一遍又一遍地生成几乎完全相同的代码。这意味着你无法保存任何编译工作,只能从头开始重新做一遍。
And then the other thing is it's just crushingly slow. You're generating the code almost exactly the same code over and over and over again. And so that just means you can't save any of the compilation work. You just have to redo the whole thing from scratch.
完全正确。所以这就是我们再次谈论系统中的沙粒问题。这些小问题如果处理不当,就会向前蔓延并导致巨大问题。Mojo中的元编程方法很酷,无论是在可用性、编译时间还是正确性方面。回到你关于可移植性的观点,它对可移植性也很有价值,因为这意味着编译器会通用地解析你的代码,而不知道目标是什么。
That's exactly right. So this is where, again, we're talking about the sand in the system. These little things that if you get wrong, they play forward and they cause huge problems. The metaprogramming approach and Mojo is cool, both for usability and compile time and correctness. Coming back to your point about portability, it's also valuable for portability because what it means is that the compiler parses your code and it parses it generically and has no idea what the target is.
因此,当Mojo生成代码的第一级中间表示(编译器表示)时,它不会硬编码32位或64位指针,也不会硬编码你是在XA6还是其他架构上。这意味着你可以将Mojo中的通用代码放在CPU上,也可以放在GPU上,相同的代码,相同的函数。再次强调,这些Chris痴迷的疯狂编译器技术。这意味着你可以切出想要放到GPU上的代码块,使其看起来像一个分布式系统,但实际上GPU是一个疯狂的嵌入式设备,需要这个小小的代码片段并且需要完全自包含。
And so when Mojo generates the first level of intermediate representation, the compiler representation for the code. It's not hard coding in the pointers of 32 bit or 64 bit or that you're on a XA6 or whatever. And what this means is that you can take generic code in Mojo and you can put it on a CPU and you can put it on a GPU, same code, same function. And again, these crazy compilery things that Chris gets obsessed about. It means that you can slice out the chunk of code that you want to put onto your GPU in a way that it looks like a distributed system, but it's a distributed system where the GPU is actually a crazy embedded device that wants this tiny snippet of code and wants it fully self contained.
这些都是普通编程语言甚至没有考虑过的事情。
These are things that normal programming languages haven't even thought about.
那么这是否意味着当我编译一个Mojo程序时,我会得到一个可发布的执行文件,其中包含另一个小型编译器,能够获取Mojo代码并对其进行专门化处理,以生成最终目标所需的实际机器代码?我是否需要在每个Mojo执行文件中捆绑所有可能平台的所有编译器?
So does that mean when I compile a Mojo program, I get a shippable executable that contained within it another little compiler that can take the Mojo code and specialize it to get the actual machine code for the final destination that you need? Do I bundle together all the compilers for all the possible platforms in every Mojo executable?
答案是否定的。世界还没准备好接受这个。即时编译器和类似技术有其应用场景,这很酷。但默认的构建方式,如果你只是运行Mojo build,它会给你一个普通的八点输出执行文件。但如果你构建一个Mojo包,Mojo包会保持可移植性。
The answer is no. The world's not ready for that. And there are use cases for JIT compilers and things like this, and that's cool. But the default way of building them, if you just run Mojo build, then it will give you just an eight dot out executable, a normal thing. But if you build a Mojo package, a Mojo package retains portability.
这是一个很大的区别。这就是Java的做法。如果你以完全不同的方式思考Java,在不同的生态系统中有不同的原因,它会在不知道目标平台的情况下解析所有源代码,并生成Java字节码。现在已经不是1995年了。我们的做法完全不同,我们显然不是Java,我们有一个非常不同的类型系统。
This is a big difference. This is what Java does. If you think about Java in a completely different way and for different reasons in a different ecosystem universe, it parses all of your source code without knowing what the target is and it generates Java byte code. And so it's not nineteen ninety five anymore. The way we do this is completely different and we're not Java obviously and we have a type system that's very different.
但这个概念是众所周知的,至少像Swift、C++和Rust这样的编译语言世界已经有些遗忘了。
But this concept is something that's been well known as something that at least the world of compiled languages like Swift and C plus plus and Rust kind of forgotten.
所以Mojo包是随附了专门化到不同平台所需的编译器技术的。是的。
So the Mojo package is kind of shipped with the compiler technology required to specialize to the different Yes.
是的,领域。所以再次强调,默认情况下,如果你是一个用户,坐在笔记本电脑上编译Mojo程序,你只想要一个执行文件。但编译器技术拥有所有这些强大功能,可以以不同方式使用。这与LLVM类似,LLVM也有即时编译器。
Yes. Domains. And so, again, by default, if you're a user, you're sitting on your laptop and say compile a Mojo program, you just want an executable. But the compiler technology has all these powerful features and they can be used in different ways. And this is similar to LLVM where LLVM had a just in time compiler.
如果你是好莱坞索尼影业,正在为某部华丽电影渲染着色器,这确实很重要,但如果你只是编写需要提前编译的C++代码,你不会想使用这种方式。
And that's really important if you're Sony Pictures and you're rendering shaders for some fancy movie, but that's not what you'd wanna use if you're just writing a c plus plus code that needs to be ahead of time compiled.
我的意思是,这里也有一些与英伟达PTX故事相似的回响。英伟达有个他们某种程度上隐藏的东西,它是一种中间表示,但被称为PTX,本质上是一种可移植字节码。多年来,他们在许多、许多代GPU之间保持了兼容性。他们有一个称为汇编器的东西,是驱动程序加载的一部分,但它实际上不是汇编器。它更像是一个真正的编译器,接收PTX并将其编译成SAS,即加速器特定的机器码,他们非常谨慎地没有完全公开文档,因为他们不想泄露所有秘密。
I mean, there's some echoes here also of the PTX story with NVIDIA. NVIDIA has this thing that they sort of hide that it's an intermediate representation, but thing called PTX, which is a portable byte code, essentially. And they, for many years, maintained compatibility across many, many different generations of GPUs. They have a thing called the assembler that's part of the driver thing for loading on, and it's really not an assembler. It's like a real compiler that takes the PTX and compiles it down to SAS, the accelerator specific machine code, which they very carefully do not fully document because they don't wanna give away all of their secrets.
所以这里有一个内置的可移植性故事,意味着它实际上是为了在未来跨新一代保持可移植性。不过,正如你之前指出的,它实际上并不总是成功,现在有些程序确实无法过渡到Blackwell架构。
And so there's a built in portability story there where it's meant to actually be portable in the future across new generations. Although, as you were pointing out before, it in fact doesn't always succeed, and there are now some programs that will not actually make the transition to Blackwell.
所以这属于我认为像虚拟机的那一类,顺便说一下,是非常低级的虚拟机。因此,当你查看这些系统时,我会问的问题是:类型系统是什么?如果你看PTX,因为正如你所说,你完全正确,它是顶层大量源代码与后端特定SAS硬件之间的抽象层。但类型系统并不很有趣。它是指针、寄存器和内存,对吧?
So that's in the category that I'd consider to be like a virtual machine, very low level virtual machine, by the way. And so when you're looking at these systems, the thing I'd ask is, what is the type system? And so if you look at PTX, because as you're saying, you're totally right, it's an abstraction between a whole bunch of source code on the top end and then the specific SAS hardware thing on the back end. But the type system isn't very interesting. It's pointers and registers and memory, right?
那么Java呢?类型系统是什么?嗯,它通过让类型系统及其字节码暴露对象来实现可移植性。所以它是一个更高层次的抽象,动态虚拟分发,这都是Java生态系统的一部分。它不是字节码,但可移植的表示保持了完整的泛型系统。这使得可以说,好吧,我将获取这段代码,将其编译一次到一个包中,然后为设备专门化和实例化它。
And so Java, what is the type system? Well, achieves portability by making the type system and its bytecode expose objects. And so it's a much higher level of abstraction, dynamic virtual dispatch, that's all part of the Java ecosystem. It's not a byte code, but the representation that's portable maintains the full generic system. And so this is what makes it possible to say, okay, well, I'm going to take this code, compile it once to a package and now go specialize and instantiate this for a device.
所以它的工作方式有点不同,但它实现了,回到你最初关于安全性和正确性的问题,使得所有检查都能以正确的方式进行。
And so the way that works is a little bit different, but it enables, coming back to your original question of safety and correctness, enables all the checking to happen the right way.
对。控制权也有巨大的转变。使用PTX时,如何编译的机器特定细节完全超出了程序员的控制。你可以生成你能生成的最好的PTX,然后它会被编译。怎么编译?
Right. There's also a huge shift in control. With PTX, the machine specific details of how it's compiled are totally out of the programmer's control. You can generate the best PTX you can, and then it's gonna get compiled. How?
以某种方式。不要问太多问题。它会做它该做的事。而在这里,你在可移植对象中保留了程序员驱动的关于专门化如何工作的指令。你只是部分执行了你的编译。
Somehow. Don't ask too many questions. It's gonna do what it's gonna do. Whereas here, you're preserving in the portable object the programmer driven instructions about how the specialization is going to work. You've just partially executed your compilation.
你已经完成了一部分,然后在最后当你实际选择运行位置时,还会有更多工作要做。
You've got partway down, and then there's some more that's gonna be done at the end when you pick actually where you're gonna run
没错。这些都是技术栈中非常硬核的组成部分。但我喜欢的是,如果你把这些封装起来,它就变得易于使用。它能正常工作。
it. Exactly. And so these are all very nerdy pieces that go into the stack. But the thing that I like is if you pop a lot of that, it's easy to use. It works.
它能提供良好的错误信息,对吧?虽然我不懂那些希腊字母,但我明白这其中投入了大量工程工作。这种技术栈的构建方式,其根本目的是释放计算能力。我们希望新程序员能够进入这个系统。如果他们懂Python,了解一些硬件基础知识,就能有效工作。
It gives good error messages, right? Don't understand the Greek letters, but I do understand a lot of the engineering goes into this. The way this technology stack builds up, the whole purpose is to unlock compute. And we want new programmers to be able to get into the system. And if they know Python, they understand some of the basics of the hardware, they can be effective.
这样他们就不会被限制在80%的性能水平。他们可以不断深入,持续提升技术复杂度。也许不是每个人都想这样做,他们可以停留在80%。但如果你想要走到底,你就能达到目标。
And then they don't get limited to 80% of the performance. They can keep driving and keep growing in sophistication. And maybe not everybody wants to do that. They can stop at 80%. But if you do wanna go all the way, then you can get there.
我很好奇的一点是,你们是如何真正保持简洁性的?你说Mojo旨在保持Python风格,也谈了很多语法问题。但实际上,Python的一个优点在于它在更深层次上是简洁的——默认没有复杂的类型系统和需要思考的复杂类型错误。这虽然存在问题,但对于学习系统的用户来说,这确实是简洁性的重要来源。
So one thing I'm curious about is how do you actually manage to keep it simple? You said that Mojo is meant to be Pythonic, and you talked a bunch about the syntax. But actually, one of the nice things about Python is it's simple in some ways in a deeper sense. The fact that there isn't, by default, a complicated type system with complicated type errors to think about. There's a lot of problems with that, but it's also a real source of simplicity for users who are trying to learn the system.
运行时动态错误在某种程度上更容易理解。我写了一个程序,它尝试执行某个操作,
Dynamic errors at runtime are in some ways easier to understand. I wrote a program and it tried to do a thing,
然后它
and it
在这个特定问题上绊倒了,你可以看到它绊倒的过程。从某种程度上说,这更容易理解。当你转向一种语言时,出于安全和性能原因,需要更精确的类型级别控制。你如何做到这一点,同时仍然保持Pythonic的感觉,即你向用户展现的基础简洁性?
tripped over this particular thing, and you can see it tripping over. And in some ways, that's easier to understand. When you're going to a language which, for both safety and performance reasons, needs much more precise type level control. How do you do that in a way that still feels Pythonic in terms of the base simplicity that you're exposing to I
我不能给你完美的答案,但我可以告诉你我目前的想法。所以再次强调,要从历史中学习。Swift有很多非常酷的特性,但它逐渐螺旋式发展,随着时间的推移增加了许多复杂性。Swift面临的挑战之一是其团队有薪酬来不断为Swift添加功能。
can't give you the perfect answer, but I can tell you my current thoughts. So again, learn from history. Swift had a lot of really cool features, but it spiraled and got a lot of complexity that got layered in over time. And also one of the challenges with Swift is it had a team that was paid to add features to Swift.
这从来不是一件好事。
It's never a good thing.
嗯,你有C++委员会,C++委员会会做什么?他们会不断为C++添加功能。别指望C++会变得更小。这是常识。因此对于Mojo来说,有几件不同的事情。其中之一是从Python开始。
Well, you have a C plus plus committee, what is the C plus plus committee going do? They're going keep adding features to C plus plus Don't expect C plus plus to get smaller. It's common sense. And so with Mojo, there's a couple of different things. So one of which is start from Python.
Python作为表层语法使我作为管理者能够推后一步说,看,让我们确保实现Python生态系统的全部能力。让我们先拥有列表、列表推导式等等所有这些功能,而不是仅仅因为可能有用就发明随机的东西。但对我个人而言,还有对复杂性的显著反压力。我们如何分解这些东西?例如,我们如何让元编程系统吸收许多原本会存在的复杂性?
So Python being the surface level syntax enables me as management to be able to push back and say, look, let's make sure we're implementing the full power of the Python ecosystem. Let's have lists and for comprehensions and like all this stuff before just inventing random stuff because it might be useful. But there's also for me personally, a significant back pressure on complexity. How can we factor these things? How can we get, for example, the metaprogramming system to subsume a lot of complexity that would otherwise exist?
有一些基本的东西我希望我们添加,例如受检泛型这类功能,因为它们有更好的用户体验。它们是元编程系统的一部分。它们是我们正在添加的核心版本的一部分。但我不希望Mojo变成仅仅因为对某些人有用就添加所有其他语言有的每一个语言特性。我实际上从Go语言中获得了很多启发和学习。
And there are fundamental things I want us to add, for example, checked generics, things like this, because they have a better UX. They're part of the metaprogramming system. They're part of the core edition that we're adding. But I don't want Mojo to turn into a add every language feature that every other language has just because it's useful to somebody. I was actually inspired by and learned a lot from Go.
这是一个人们可能惊讶听到我谈论的语言。Go,我认为它在Go 1版本中有意约束语言方面做得非常好。他们为此承受了很多压力。他们没有添加泛型系统。每个人,包括我自己,都在问,为什么这个语言甚至没有泛型系统?
And it's a language that people are probably surprised to hear me talk about. Go, I think did a really good job of intentionally constraining the language with Go one. And they took a lot of heat for that. They didn't add a generic system. And everybody, myself included, were like, why doesn't this language even have a generic system?
甚至不是一门现代语言。但他们守住了底线,他们明白人们能走多远。然后他们在Go语言中很好地添加了泛型功能,我认为他们做得非常出色。最近我读到一篇博客文章,讨论的是Go语言。
Not even a modern language. But they held the line, they understood how far people could get. And then they did a really good job of adding generics to Go two. And I thought they did a great job. There's a recent blog post I was reading talking about Go.
显然他们有一个二八法则。他们说希望用20%的复杂度实现80%的功能,诸如此类。观察发现,这个定位会让所有人都不满,因为每个人都想要81%的功能,但81%的功能可能会带来35%的复杂度。所以关键在于如何划定这条线,学会在何处说不。例如,我们社区中有人要求添加Rust中存在的非常合理的功能。
And apparently they have an eightytwenty rule. And they say they want to have 80% of the features with 20% of the complexity, something like that. And the observation is, is that that's a point in the space that annoys everybody because everybody wants 81% of the features, but 81% of the features maybe gives you 35% of the complexity. And so figuring out where to draw that line and figuring out where to say no. For example, we have people in the community that are asking for very reasonable things that exist in Rust.
Rust是一门很棒的语言。我很喜欢它。里面有很多优秀的设计理念,我们毫不客气地从各处汲取好点子。但我不想要那种复杂性。
And Rust is a wonderful language. I love it. There's a lot of great ideas and we pull shamelessly good ideas from everywhere. But I don't want the complexity.
我经常说,语言设计最关键的一点是保持功率重量比。没错。你希望在最小化复杂度的同时,获得大量优秀的功能、强大的能力和良好的用户体验。我认为这确实非常具有挑战性,而且我们发现这也是当前普遍面临的问题。我们也在通过各种方式扩展OCaml,从各种语言(包括Rust)中汲取灵感。
I often like to say that one of the most critical things about a language design is maintaining the power to weight ratio. Yeah. You wanna get an enormous amount of good functionality and power and good user experience while minimizing that complexity. And I I think it is a very challenging thing to manage, and I think it's actually a thing that we are seeing a lot as well. We are also doing a lot to extend OCaml in all sorts of ways, pulling from all sorts of languages, including Rust.
同样地,要在扩展的同时保持语言的基本特性和简洁性,这是一个真正的挑战。而且很难判断是否找到了最佳平衡点。在可以反复试错的环境下会更容易些——尝试某些特性,发现可能不适用,然后进行调整。我们正努力以这种模式进行大量迭代,这是在特定条件下可行的做法。
And again, doing it in a way where the language maintains its basic character and maintains its simplicity is a real challenge. And it's kinda hard to know if you're hitting the actual right point on that. And it's easier to do in a world where you can take things back. Try things out and decide that maybe they don't work and then adjust your behavior. And we're trying to iterate a lot in that mode, which is the thing you can do under certain circumstances.
当语言成为被广泛使用的开源项目时,这件事就变得更具挑战性了。
It gets harder as you have a big open source language that lots of people are using.
这个观点非常精彩。我在Swift项目中学到的另一个教训是:早期我就极力推动开放设计流程,让任何人都可以提交提案,由语言委员会评估。如果提案优秀,就会被实现并加入Swift。但要注意:许愿需谨慎——这确实让很多有好想法的人为Swift添加了大量功能。因此作为制衡,在Mojo项目中我真心希望核心团队保持小型化。
That's a really great point. And so one of the other lessons I've learned with Swift is that with Swift, I pushed very early to have an open design process where anybody could come in, write a proposal and then it would be evaluated by the language committee. And then if it was good, it would be implemented and put into Swift. Again, be careful what you wish for that enabled a lot of people with really good ideas to add a bunch of features to Swift. And so with Mojo as a counterbalance, I really want the core team to be small.
我希望核心团队不仅仅是能够添加一大堆可能将来有用的东西,而是要非常审慎地考虑我们如何添加内容、如何演进事物。
I want the core team not just be able to add a whole bunch of stuff because it might be useful someday, but to be really deliberate about how we add things, how we evolve things.
在向前演进的过程中,你如何看待保持向后兼容性保证?
How are you thinking about maintaining backwards compatibility guarantees as you evolve it forward?
我们正在积极讨论和辩论Mojo 1.0会是什么样子。
We're actively debating and discussing what mojo1.0 looks like.
所以
And so
我不会给你一个具体的时间表,但希望不会太久远。我特别喜欢语义版本控制这个概念。也就是说,我们将会有1.0版本,然后是2.0、3.0、4.0等等。每个版本都可以不兼容,但它们能够链接在一起。Python生态系统中很多损害来自于Python 2到3的转换,这是一个巨大的挑战。
I'm not going give you a timeframe, but it will hopefully not be very far away. And what I am fond of is this notion of semantic versioning. And so saying we're going have a one point zero and then we're going to have a two point zero and we're going to have a three point zero and we're going to have four point zero, etcetera. And each of these will be able to be incompatible, but they can link together. And so one of the big challenges and a lot of the damage in the Python ecosystem was from the Python two to three conversion.
这个过程花了十五年时间,由于多种原因成为了一场英勇的混乱。耗时如此之久的原因是你必须转换整个包生态系统才能达到3.0。相比之下,像C++这样的语言——让我说说C++的好话——他们搞对了ABI。一旦ABI确定,你就可以有一个用C++98构建的包和一个用C++23构建的包,这些包可以互操作并兼容,即使你在未来的语言版本中引入了新关键字或其他东西。
It took fifteen years and it was a heroic mess for many different reasons. The reason it took so long is because you have to convert the entire package ecosystem before you can be three point zero. And so if you contrast that to something like C plus plus let me say good things about C plus plus they got the ABI right. And so once the ABI was set, then you could have one package built in C plus plus ninety eight and one package built in C plus plus 23. And these things would interoperate and be compatible even if you took new keywords or other things in the future language version.
因此,我对Mojo的愿景更类似于C++生态系统或类似的东西。但这允许我们在代码迁移、修复错误和推动语言发展方面更加积极。但我希望确保Mojo 2.0和Mojo 1.0的包能够协同工作,并且有好的工具——可能是AI驱动的——来从1.0迁移到2.0,并以这种方式管理生态系统。
And so what I see for Mojo is much more similar to the maybe the C plus plus ecosystem or something like this. But that allows us to be a little bit more aggressive in terms of migrating code and in terms of fixing bugs and moving language forward. But I want to make sure that Mojo two point zero and Mojo one point zero packages work together And that there's good tooling, probably AI driven, but good tooling to move from 1.0 to 2.0 and and be able to manage the ecosystem that way.
我认为类型系统也提供了巨大帮助。Python迁移如此困难的原因之一在于,你无法直接尝试用Python 3构建并查看哪些地方出错,只能通过实际遍历程序的所有执行路径才能发现问题。如果测试覆盖不足,这会非常困难,即使有足够测试,也并不容易。
I think the type system also helps an enormous amount. I think one of the reasons the Python migration was so hard is you couldn't be like, and then let me try and build this with Python three and see what's broken. You could only see what's broken by actually walking all of the execution paths of your program. And if you didn't have enough testing, that would be very hard. And even if you did, it wasn't that easy.
而有了强类型系统,你能获得大量非常精确的指导。实际上,强类型系统与智能编码代理的结合非常强大。我们现在积累了不少实践经验:当你对某个类型做个小改动后,只需对AI系统说‘请排查所有类型错误并修复’,它就能出色地完成任务。
Whereas with a strong type system, you can get an enormous amount of very precise guidance. And actually, the combination of a strong type system and an agentic coding system is awesome. We actually have a bunch of experience of just trying these things out now where you make some small change to the type of something, and then you're like, hey, AI system. Please run down all the type errors. Fix them all.
而且效果出奇地好。
And it does surprisingly well.
我完全同意。还有其他因素——Russ在crate和API的稳定性管理方面做得非常出色。我们会吸收各生态系统的优点,希望能打造出既高效又适合生态的解决方案,在保持扩展性的同时避免一旦发布1.0版本就永远无法修正问题的困境。
I absolutely agree. There's other components to it. So Russ has done a very good job with the stabilization approach with crates and APIs. And so I think that's a really good thing. And so I think we'll take good ideas from many of these different ecosystems and hopefully do something that works well and works well for the ecosystem, allows us to scale without being completely constrained by never being able to fix something once it gets, you know, you ship a one dot o.
我很好奇智能编程方面的情况:让AI代理编写优质内核其实相当困难。想知道你们在Mojo上的实践经验。Mojo显然不是这些模型训练集中深度涉及的语言,但你们有强大的类型结构来指导AI代理编写和修改代码。很好奇实际使用中效果如何?
I'm actually curious just to go to the agentic programming thing for a second, which is having AI agents that write good kernels is actually pretty hard. And I'm curious what your experience is of how things work with Mojo. Mojo is obviously not a language deeply embedded in the training set that these models were built on. But on the other hand, you have this very strong type structure that can guide the process of the AI agent trying to write and modify code. I'm curious how that pans out in practice as you try and use these tools.
这正是Mojo开源的意义所在。我们有数十万行公开的Mojo代码——包括GPU内核等精彩内容,还有社区持续贡献。海量代码库让编程工具能直接学习索引,不需要重新训练语言模型,就能实现高质量输出。
You know, so this is why Mojo being open source. And so we have hundreds of thousands of lines of Mojo code that are public with all these GPU kernels, all this other cool stuff, and we have a community of people writing more code. Having hundreds of thousand lines of Mojo code is fantastic. You can point your coding tool cursor or whatever it is at that repo and say, go learn about this repo and index it. So it's not that you have to train the model to know the language, just having access to it that enables it to do good work.
这些工具非常出色。官网页面上有详细配置指南,正确设置使其能索引代码库与否效果天差地别。请务必按照说明文档完成环境配置。
And these tools are phenomenal. And so that's been very, very, very important. And so we have instructions on our webpage for how to set up these tools. And there's a huge difference if you set it up right, so that it can index that or if you don't. And make sure to follow that markdown file that explains how to set up the app for
那个工具。所以我想稍微谈谈Mojo的未来。我认为目前Modular和你谈论Mojo的方式,至少最近是这样,它是CUDA的替代品,一个从头到尾完整的替代堆栈,用于构建GPU内核,编写在GPU上运行的程序。但这并不是你谈论Mojo的唯一方式。你之前也尝试过,尤其是早期,我认为更多讨论的是Mojo作为Python的扩展和可能的演进,也许最终会取代Python。
that tool. So I wanna talk a little bit about the future of Mojo. I think the the current way that Modular and you have been talking about Mojo, these days at least, it's a replacement for CUDA, an alternate full top to bottom stack for building GPU kernels, for writing programs that execute on GPUs. But that's not like the only way you've ever talked about Mojo. You've also tried especially earlier on, I think there was more discussion of Mojo as an extension and maybe evolution of and maybe eventually replacement of Python.
我很好奇你现在怎么看?你在多大程度上认为Mojo是一种受Python启发和借鉴语法的新语言?又在多大程度上希望它随着时间的推移能更深度地集成?
And I'm curious how do you think about that now? To what degree do you think of Mojo as its own new language that takes inspiration and syntax from Python? And to what degree do you want something that's more deeply integrated over time?
所以今天,回到Mojo现在有什么用以及我们如何解释它?如果你想让代码运行得快,Mojo就很有用。如果你有CPU或GPU上的代码,想让它跑得更快,Mojo是个很棒的选择。现在有一个非常酷的功能,虽然还处于预览阶段,但下个月左右会稳定下来,它也是扩展Python的最佳方式。所以如果你有一个大规模的Python代码库,再告诉我这听起来是否熟悉,你正在用Python编码,做很酷的事情,然后它开始变慢了。
So today, to pull it back to what is Mojo useful for today and how do we explain it? Mojo is useful if you want code to go fast. If have code on a CPU or a GPU and you want it to go fast, Mojo is a great thing. One of the really cool things that is available now but it's in preview and it will solidify in the next month or something is it's also the best way to extend Python. And so if you have a large scale Python code base, again, tell me if this sounds familiar, you are coding away and you're doing cool stuff in Python, and then it starts to get slow.
通常人们做的是,要么用Rust或C++重写整个东西,要么切出一部分,把那个包的一些部分移到C++或Rust中。这就是NumPy、PyTorch或所有现代大规模Python代码库最终都会做的事情。
Typically what people do is they have to either go rewrite the whole thing in Rust or C plus plus or they carve out some chunk of it and move some chunk of that package to C plus plus or Rust. This is what NumPy or PyTorch or like all modern large scale Python code bases end up doing.
你看看镜像,看看包含C扩展的程序比例,高得惊人。很大一部分Python东西实际上是部分Python和部分其他语言,几乎总是C和C++,还有一点Rust。
You look up on the mirrors and look at the percentage of programs that have c extensions in them, it's shockingly high. Really large fraction of Python stuff is actually part Python and part some other language, almost always c and c plus plus a little bit of Rust.
没错。所以今天,这不是遥远的未来。今天,你可以拿你的Python包,创建一个Mojo文件,然后说,好吧,这四个循环太慢了,移到Mojo里。我们有人,比如做生物信息学和其他我一窍不通的疯狂事情的人说,好吧,我就拿我的Python代码,移到Mojo里。哇,现在我有了类型,得到了这些好处,但没有绑定。
That's right. And so today, this isn't distant future. Today, you can take your Python package and you can create a Mojo file and you can say, okay, well these four loops are slow, move it over to Mojo. And we have people, for example, doing bioinformatics and other crazy stuff I know nothing about saying, okay, well, I'm just taking my Python code, I move it over to Mojo. Wow, now I get types, get these benefits, but there's no bindings.
PIP体验非常棒。超级简单。你不需要FFI和NanoBind,所有这些复杂性就能做到。你也不是从Python的语法转到花括号和借用检查器之类的疯狂东西,你现在有了一个非常简单无缝的方式来扩展你的Python包。我们有人会说,好吧,我这么做了,首先在CPU上就快了10倍、100倍甚至1000倍。
The PIP experience is beautiful. It's super simple. You don't have to have FFIs and NanoBind, like all this complexity to be able to do this. You also are not moving from Python with its syntax to curly braces and borrow checkers and other craziness, you now get a very simple and seamless way to extend your Python package. And we have people that say, okay, well, did that and I got it first 10x and 100x and 1000x faster on CPU.
但由于它很简单,我就直接把它放到了GPU上。对我来说这很神奇,因为这些人甚至没想过,如果换成Rust之类的语言,他们永远也不会接触到GPU。我再次解释的方式是:Mojo适合追求性能。如果你想在GPU、CPU上跑得更快,想让Python提速,或者有些人疯狂到想完全从头开始用Mojo写程序——这超级酷。如果快进六到九个月左右,我认为Mojo将成为Rust的一个非常可靠的全栈替代品。
But then because it was easy, I just put it on a GPU. And so to me, this is amazing because these are people that didn't even think and would never have gotten on a GPU if they switched to rust or something like that. Again, the way I explain it is Mojo is good for performance. It's good if you wanna go fast on a GPU, on a CPU, you wanna make Python go fast or if you wanna, I mean, some people are crazy enough to go whole hog and just write entirely from scratch Mojo programs and that's super cool. If you fast forward six, nine months something, I think that Mojo will be a very credible top to bottom replacement for Rust.
因此我们需要对泛型系统进行一些扩展。还有一些功能我想再完善一下。Rust中存在的某些动态特性,比如运行时特质(existentials)的能力,在Mojo中还没有。我们会添加一些这类特性。随着这些功能的加入,我认为作为应用级编程语言,这将变得非常有趣,让人们开始关注这类东西。
And so we need a few more extensions to the generic system. And there's a few things I want to bake out a little bit. Some of the dynamic features that Rust has for the existentials, the ability to make a runtime trait is missing in Mojo. And so we'll add a few of those kinds of features. And as we do that, I think that'll be really interesting as an applications level programming language for people to care about this kind of stuff.
快进一下,我甚至不设定具体时间框架,也许一年、十八个月后,这取决于我们的优先级安排,我们会添加类。随着类的加入,对Python程序员来说,它会突然看起来和感觉上熟悉得多。Mojo中的类会有意设计成与Python非常相似。到那时,我们将拥有某种看起来和感觉上像Python 4的东西。它完全是从Python同一个模子里刻出来的。
You fast forward, I'm not even project a timeframe, maybe a year, eighteen months from now, it depends on how we prioritize things and we'll add classes. And so as we add classes, suddenly it will look and feel to a Python programmer much more familiar. And so the classes in Mojo will be intentionally designed to be very similar to Python. And at that point, we'll have something that looks and feels kind of like a Python four. It's very much cut from the same mold as Python.
它与Python集成得非常好。扩展Python非常容易。因此它绝对是Python家族的一员,但它与Python不兼容。所以在未来若干年(我无法准确预测需要多久),我们将持续推进:好吧,我们想为这个东西添加多少兼容性?然后我认为在某个时间点,人们会认为它是Python的超集。
It integrates really well from Python. It's really easy to extend Python. And so it's very much a member of the Python family, but it's not compatible with Python. And so what we'll do over the course of n years, and I can't predict exactly how long that is, is continue to run down the line of, okay, well, how much compatibility do we want to add to this thing? And then I think that at some point people will consider it to be a Python superset.
实际上,它会感觉就像是用Python的最佳方式。我认为这需要时间来实现。但回到最初,我希望我们非常专注于Mojo今天有什么用。伟大的主张需要伟大的证明。我们目前还没有证明我们能做到这一点。
Effectively, will feel just like the best way to do Python in general. And I think that that will come in time. But to bring it all the way back, I want us to be very focused on what is Mojo useful for today. And so great claims require great proof. We have no proof that we can do this.
我脑中有愿景和未来,之前也构建过一些语言和规模较大的东西。因此我对实现这个目标有相当高的信心,但我希望人们回归到:如果你在写性能代码、GPU内核或AI应用,如果你有Python代码但想要提速——我们中很少有人有这个问题,那么Mojo会非常有用。希望未来它对更多人来说甚至更加有用。
I have a vision and a future in my brain and I've built a few languages and some scale things before. And so I have quite high confidence that we can do this, but I want people to zero back into, okay, if you're writing performance code, if you're writing GPU kernels or AI, if you have Python code, you want to go slow. Few of us have that problem, then Mojo can be very useful. And hopefully it'll be even more useful to more people in the future.
没错。而且我认为实际上短期的实用目标本身已经足够雄心勃勃和令人兴奋了。看起来是个值得专注的好方向。
Right. And I think that already the practical short term thing is already plenty ambitious and exciting on its own. Seems like a great thing to focus on.
是的。让我们来解决AI中的异构计算问题。这实际上是一件相当有用的事情,对吧?所以
Yeah. Let's solve heterogeneous compute in AI. That's that's actually a pretty useful thing. Right? So
好的。看来这是个很好的结束点。非常感谢你的参与。
Alright. That seems like a great place to stop. Thank you so much for joining me.
是的。嗯,谢谢你邀请我。我很喜欢和你一起探讨技术,希望这对其他人也有用且有趣。但即使不是这样,我也和你玩得很开心。
Yeah. Well, thank you for having me. I love nerding out with you, and I hope it's useful and interesting to other too. But even if not, I had a lot of fun with you.
你可以在signalsandthreads.com找到本集的完整文字记录,以及节目说明和相关链接。感谢收听。下次再见。
You'll find a complete transcript of the episode along with show notes and links at signalsandthreads.com. Thanks for joining us. See you next time.
关于 Bayt 播客
Bayt 提供中文+原文双语音频和字幕,帮助你打破语言障碍,轻松听懂全球优质播客。