本集简介
双语字幕
仅展示文本字幕,不包含中文音频;想边听边看,请使用 Bayt 播客 App。
我们的目标是将VLAN打造成全球的推理引擎,真正推动开源领域的技术边界,并构建一个通用的推理层。
Our goal is to make VLAN the world's inference engine, really push the capabilities on the open source front, and then build a universal inference layer.
这意味着我们将拥有能够为任何新模型、新硬件和新应用提供动力的运行时环境,能够针对极致效率进行优化,并持续支持所有未来的AI工作负载。
That means we'll have the runtime to power any new model on new hardware for new application, be able to tailor that to extreme efficiency and support all the AI workload going forward.
我从根本上相信,开源,尤其是VLM自身的架构方式,对全球AI基础设施至关重要。
I fundamentally believe that open source, especially how VLM itself is structured, is critical to the AI infrastructure in the world.
而我们希望通过这个基础设施来支持、维护、引导并推动开源生态系统的发展。
And what we want to do with this infraq is to support, maintain, steward, and push forward the open source ecosystem.
只有当VLAN成为行业标准,并帮助每个人实现他们的目标时,我们的公司才真正具有意义,并能够支持其周围的每一个人。
It is only that VLAN, when VLAN becomes the standard and VLAN help everybody to achieve what they need to do, then our company in a sense have the right meaning and to be able to support everybody around it.
如果人工智能中最困难的问题不是训练更智能的模型,而是仅仅维持它们的稳定运行呢?
What if the hardest problem in artificial intelligence isn't training smarter models but simply keeping them running?
在计算历史的大部分时间里,一旦系统被构建出来,最难的部分就结束了。
For most of the history of computing, once a system was the hard part was over.
你编写了程序,按下运行键,机器就会可预测地运行。
You wrote the program, pressed run, and the machine behaved predictably.
即使是早期的机器学习也遵循这一模式。
Even early machine learning followed that pattern.
输入是标准化的,工作负载是规律的,计算机完成任务后就停止了。
Inputs were standardized, workloads were regular, the computer did its job and stopped.
大型语言模型悄然打破了这一假设。
Large language models quietly broke that assumption.
每个请求都不尽相同。
Every request is different.
提示可以是一句话,也可以是一个完整的档案。
Prompts can be a sentence or an entire archive.
输出可能瞬间结束,也可能无限延续。
Outputs can end instantly or stretch on indefinitely.
成千上万的用户可能同时涌入,各自对同一硬件提出互不兼容的需求。
Thousands of users can arrive at once, each making incompatible demands on the same hardware.
而所有这些都必须在GPU上实时发生,而这些GPU从未为这种不可预测性而设计。
And all of this has to happen in real time on GPUs that were never designed for this kind of unpredictability.
在过去几年里,这个问题已经从边缘走向了核心。
Over the last few years, this problem has moved from obscure to essential.
随着模型变得更大、更多样化,并更深地融入产品中,运行AI系统的挑战开始与构建它们的挑战不相上下。
As models have grown larger, more diverse, and more deeply embedded into products, the challenge of running AI systems has started to rival the challenge of building them.
矛盾就在这里。
That's where the tension lies.
关于AI进步的公众叙事聚焦于更好的模型和更大的突破,但其背后却是一个 quieter 的系统性问题。
A public story of AI progress is about better models and bigger breakthroughs, but underneath it is a quieter systems problem.
你如何高效地调度混乱的请求?
How do you schedule chaotic requests efficiently?
当你不知道对话何时真正结束时,该如何管理内存?
How do you manage memory when you don't know when a conversation is actually finished?
当AI系统不再像单次交互工具,而是开始像能够思考、暂停并随着时间与世界互动的代理时,会发生什么变化?
And what changes when AI systems stop behaving like single turn tools and start acting like agents that think, pause, and interact with the world over time.
本集聚焦于这一隐藏层面。
This episode focuses on the hidden layer.
我们探讨了推理——即运行已训练的AI模型——为何已成为现代计算中最复杂、最重要的问题之一,以及开源基础设施为何日益成为解决这一问题的核心。
We examine why inference, the act of running trained AI models, has become one of the most complex and important problems in modern computing and why open source infrastructure is increasingly central to solving it.
安德森·霍洛维茨的普通合伙人马特·伯恩斯坦与InfraRact的联合创始人、开源推理引擎VLLM的创建者西蒙·莫和王宇山进行了对话。
Matt Bornstein, general partner at Andreessen Horowitz, is joined by Simon Moe and Wuse Kwan, cofounders of InfraRact and creators of the open source inference engine VLLM.
这是一场关于AI底层基础设施的对话,以及为何它可能比模型本身更为重要。
This is a conversation about the infrastructure beneath AI and why it may matter more than the models themselves.
今天我们邀请到了VLLM开源项目的主要贡献者、Infraqt这家新AI推理公司的联合创始人西蒙·莫和王宇山。
We are here today with Simon Moe and Woosah Kwan, lead contributors on the VLLM open source project, and cofounders of Infraqt, a new AI inference company.
非常高兴今天能邀请到你们参加节目。
Super excited to have you guys on the show today.
谢谢。
Thank you.
非常感谢你们的到来。
Thank you so much for coming.
我们会聊一聊VLLM这个开源项目。
We're gonna talk a little bit about VLLM, the open source project.
我们会深入讨论推理以及推理技术到底是什么,然后会简单聊聊Infraqt这家新公司。
We're gonna talk a lot about inference and what inference technology really is, and then we'll talk a little bit about Infraqt, the new company.
那么首先,你能谈谈VLLM的起源吗?
So to start, can you talk a little bit about where VLLM came from?
它是什么?
What is it?
你们是怎么开始这个项目的?
How did you start it?
为什么它是一个如此令人兴奋的项目?
And why is it such an exciting project?
谢谢你们的到来。
Thank you for having us.
VLLM项目最初是UC伯克利的WYSLOC在攻读博士期间的一个原型项目,后来发展成了今天GitHub上面向所有人的开源推理运行时平台。
VLLM project started from actually, WYSLOC's a prototype project at UC Berkeley doing his PhDs and grow into today's open source project on GitHub for inference runtime for everybody.
也许Woosooq可以简单介绍一下页注意力论文。
Maybe Woosooq can talk a little bit about the page attention paper.
哦,是的。
Oh, yeah.
所以,基本上我认为它始于2022年,当时Meta将OPT模型开源。
So, basically, I think it kind of started in 2022 when Meta released the OPT model as open source.
我不确定现在还有多少人记得这个模型,但它确实是最早一批开源权重的大型语言模型之一,能够复现GPT-3。
I'm not actually sure how many people actually, like, remember the model nowadays, but it was kind of the one of the first, like, open weight large language models to reproduce GPT-three.
我们的实验室尝试创建了一个演示服务来运行该模型,以便向更广泛的受众展示。
And our lab tried created a demo service to run the model and to, you know, demonstrate it for the broader audience.
而且,是的,它确实能运行,但速度非常慢。
And, yeah, like, it was working but super slow.
于是我启动了一个小的副项目来优化这个演示服务。
So I started a small side project to optimize that demo service.
这算是最初的起点。
That was kind of the at the beginning.
起初我以为只需要几周时间就能端到端地优化这个服务,但结果发现其中存在大量未解决的问题,因为这种自回归语言模型非常不同。
And then initially, I was thinking that that it may only take, like, couple weeks to optimize the service end to end, But it turns out that it actually has a lot of open problems inside of it because this, you know, autoregressive language model is pretty different.
实际上,它与传统的机器学习工作负载有很大不同。
Actually, it was pretty different from other traditional ML workloads.
而且这在当时,至少在这些前沿实验室之外,可以说是全新的。
And it wasn't actually it was kind of like a brand new, at least like outside these frontier labs back in the day.
我开始研究它,它逐渐变成一个研究项目,我们写了一篇论文,甚至发展成一个定义清晰的开源项目,因为越来越多的人对它产生了兴趣。
I started to work on it and it became a research project and we wrote a paper and it even became a like a like open source project, pretty well defined open source project as, yeah, more and more people got interested in it.
所以是2022年。
So 2022.
这显然是在GPT-4之前。
This is pre GPT four, obviously.
这发生在ChatGPT之前。
This is pre ChatGPT.
是的。
Yeah.
在ChatGPT之前。
Pre ChatGPT.
是的。
Yeah.
你可能会想,哦,我就做个推理服务器好了。
And you're thinking like, oh, I'll just, like, work on this inference server.
这应该是个相当简单的问题。
This should be a fairly straightforward problem.
四年之后,实际上你做的工作反而更多了,而不是更少。
Like, four years later, actually, you're, like, doing more work instead of less.
没错。
Exactly.
没错。
Exactly.
是的。
Yeah.
你当初为什么会觉得这个
Why did you think this
当时你觉得这是一个值得投入的问题吗?
is a meaningful problem to work on at the time?
因为我觉得,那时候世界上大多数人把GPT-3看作一种新奇事物,而OPT某种程度上就像是附着在新奇事物上的另一个新奇事物。
Because, like, I would say most people in the world at that time saw GPT-three as a curiosity in some sense, and OPT was kind of like a curiosity attached to a curiosity in a way.
那是什么让你和你的实验室同伴当时对这个项目如此兴奋?
Like, what made you and your lab mates sort of excited to work on this back then?
我想我最初也是出于好奇。
I think I also started from curiosity.
我并没有觉得当时这是世界上最重要的问题。
I didn't really think it's the most important problem in the world back in the day.
我只是想亲手体验一下它究竟是如何运作的。
I just wanted to have a hands on experience on how this actually works.
我的意思是,我也对模型的规模感到印象深刻。
I mean, I think I'm also impressed by the size of the model.
OPT最大的模型有1750亿个参数,那是当时可用的最大模型。
The OPT, the largest model, has one seventy five billion parameters, and that was the largest model available.
所以对我来说,这还挺有意义的。
So it's kind of, like, meaningful for me.
就是说,在这么大的模型上工作还是挺有成就感的。
Like, it's kind of pretty rewarding to work on such a large model.
这让我想起我小时候,你知道,我们会自己组装电脑。
This reminds me of when, like, when I was, like, growing up, you know, we would build, like, computers.
那在当时是件很酷的事情。
That was, like, the cool thing to do.
而且每次内存容量的阶段性提升都让我觉得太震撼了,我当时的反应是,天哪。
And each step change in, like, memory capacity was such a big I was like, oh my god.
这个居然有4兆字节的内存。
This one has four megabytes of RAM.
我的天啊。
Oh my god.
这个居然有512兆字节的内存。
This one has 512 megabytes of RAM.
回过头来看,这很傻,但当时确实让人激动,可能因为我们是极客吧,看到这些系统上的数字越来越大,就会情绪高涨。
Looking back, it's silly, but at the time, that was actually, like maybe it's because we're, like, nerds, but it, like, it gets you get, like, emotionally excited about, the numbers getting bigger on these systems.
对。
Right.
对。
Right.
对。
Right.
是的。
Yeah.
我认为这显然是主要动机之一。
I think that was, like, one of the main motivation, clearly.
所以你刚才提到,类似那种
And so you started to say the sort of
与传统机器学习相比,自回归变换器面临的技术问题不同。
technical problem is different for autoregressive transformers compared to traditional machine learning.
你能稍微解释一下吗?就是,那到底是怎么一回事?
Do you mind explaining a little bit, you know, how that is?
而且也跟普通的计算负载做个对比吧,让那些可能不太熟悉AI负载的工程师听众也能理解。
And even compare just to normal kind of computing workloads for, you know, listeners who who, you know, are engineers who may not be familiar with AI workload.
基本上,跟传统负载相比,最明显的区别肯定是GPU。
So, basically, compared to the traditional workload, you know, the clear difference is definitely like GPUs.
对吧?
Right?
现在所有的计算,或者说大部分计算都在GPU上进行,我们必须针对它们进行优化,因为它们——至少在过去——内存比CPU少。
Now all the compute or kind of most of the compute are happening on GPU, and we have to optimize for the which, like, presumably have less memory than CPU, at least back in the day.
现在GPU的内存大得多,但通常还是比CPU的内存小得多,也许至今仍是这样。
Now, like, GPUs are much has much larger memory, but typically has much smaller memory than CPU, maybe still.
而且,所有的计算都在GPU上进行,所以你得用不同的语言编程,并且考虑到不同的并行方式。
And, like, you know, like, all the computations happens on GPUs, so you have to write program in a different language and a different type of parallelism in mind.
是的。
Yeah.
所以,我认为这与传统的计算密集型工作负载和深度学习工作负载之间存在根本性差异。
So that's kind of like a fundamental difference from the traditional, like, compute heavy workload versus, like, deep learning workload, I would say.
但在深度学习工作负载内部,传统深度学习工作负载与低语言模型推理之间仍然存在巨大差异。
And but within the deep learning workload, there's actually still a huge difference between the traditional deep learning workload versus lower language model inference.
对于传统工作负载,我认为最大的特点是它相当静态,比如在早期的图像模型中,像CNN这样的模型,人们通常会处理不同尺寸的几张图像。
So for traditional workload, I think the biggest characteristic is that it is pretty static, which means, for example, for image models, like back in the day, like CNNs, like what people do is, you know, we may have a different several images with different sizes.
然后我们会将它们调整大小或裁剪成相同的尺寸,再批量输入模型,一次性运行推理。
Then what we do is we like we resize them or crop them into the same size, and then we batch them, and then we put it to the model to run them to run the inference at once.
这基本上就是如此。
And this is basically yeah.
由于这种调整大小和裁剪,最终它们都会被压缩成相同尺寸的张量,这实际上让GPU处理起来简单得多。
And that because of this resizing and cropping, like, they all kind of at the end, they're kind of compressing to the same size tensor, and that actually makes things much simpler for, you know, for the GPU to handle.
对吧?
Right?
所有形状都相当规则、静态,而且定义明确。
All the shapes are pretty regular static and it's kind of like well defined.
但对于大语言模型来说,如果你仔细想想,它们是非常动态的。
But for large language model, if you think about it, they are pretty dynamic.
你知道吗?
You know?
你的提示词可以只是一个简单的‘你好’,也可以是一些长达数百页的文档。
Your prompt can be either like hello, like a single word, or you can be your prompt can be a like a bunch of, like, documents spanning, like, hundreds of pages.
这种动态性是语言模型固有的,这使得整个情况完全处于另一个世界。
And this kind of, like, dynamism exists in inherently in the language model, and this makes things whole, like, kind of in a different world.
我们必须把这种动态性当作第一等重要的问题来处理。
Like, we have to handle this dynamism as a first class citizen.
在过去,人们并没有清晰的思路来应对这个问题。
And, yeah, back in the day, that was not like people didn't have a clear idea about how to handle it.
幸运的是,我们是最早之一
And, yeah, fortunately, we were one of
率先发现问题并加以解决的团队。
the first to, yeah, to solve to see the problem.
这非常有趣。
That's very interesting.
所以,对一批输入进行正则化,听起来像是你们需要解决的首批问题之一。
So so kind of regularizing a batch of inputs was it sounds like one of the first problems you had to solve.
实际上这更多是关于调度和内存管理。
It's actually more about scheduling and memory management.
是的。
Yeah.
是的。
Yeah.
还有。
As well.
是的。
Yeah.
你愿意
Would you like to
说点什么?
say something?
是的。
Yeah.
因此,我们在所有服务系统中之前解决的问题,就是所谓的微批处理。
So the problem we're solving before in all the serving system is about just what we call microbatching.
在大语言模型出现之前,早期利用CPU的向量化以及早期GPU处理如ResNet这类模型时,核心都是微批处理。
To leverage first CPU's fundamental vectorization in the early days before LLMs and then early GPU for models like ResNet is all about micro batching.
你把同时到达的四个请求合并在一起。
You put together four requests together that arrive around the same time.
但在大语言模型的世界里,请求是持续不断涌入的,每个请求的形态都不同,你根本无法将它们标准化。
But the change in the LM world is you always have requests that are continuously filling and coming in and then each request looks differently, you just cannot really normalize them.
因此,你必须在大语言模型引擎中引入一个步骤的概念,即同时处理所有请求中的一个token,无论每个请求的输入长度和输出长度如何不同。
So that's why you have to have a notion step within the LM engine to process one token across all the requests at the same time, regardless of each request having different kinds of input lengths and output lengths.
输出也是非确定性的。
The output is also non deterministic.
语言模型本身会决定何时停止,而不是像传统机器学习服务那样像钟表一样机械运行。
The language model itself will decide when does it stop instead of in the traditional sense of other machine learning servings it's very much like walk like a clockwork.
而在这里,情况恰恰相反,它始终是流动的、连续的。
And here it is very sarcastic, it's always flowing, it is always continuous.
因此,调度是首先要解决的问题,然后是内存管理,这正是页注意力机制出现的原因。
That's why scheduling is the first problem to solve and then memory management, that's where page attention come about, is the second problem to solve.
是的。
Yeah.
那么,西蒙,你是什么时候参与这个项目的?
So when did you get involved in the project, Simon?
嗯,我大约在2023年参与进来的。
Well, I got involved as in around 2023.
最初,Wuseok在Skylab的Slack频道里发起了一个号召,说:嘿。
I first, Wuseok issued a call in the Skylab Slack channel to say, hey.
我们需要有人和我们一起合作这篇页注意力论文和内核。
We need someone to work with us on this page attention paper and kernel.
实际上,令人惊讶的是,我当时正在放春假,我就想,看啊。
Actually, surprisingly, I was on spring break, I was like, look.
别人也可以做这个。
Someone else can do this.
让我整个星期都玩一下GBT吧。
Let me just play with GBT for the entire week.
所以我最后只是在做提示工程。
So I just ended up just playing with prompt engineering.
因此,我实际上并没有和吴锡合作。
So I actually didn't end up joining with Usuk.
所以这就是简·斯托伊卡实验室的假期样子:玩一周模型。
And so this is what a vacation looks like in Jan Stoikka's lab, playing with models for a week.
他还在卖他的内核。
And he's selling his kernels.
是的。
Yeah.
没错。
Exactly.
所以他正在研究内核。
So he's playing with kernels.
我当时想多做一些提示工程,探索各种早期智能体工作流。
I was trying to build more prompt engineering and explore, like, different kind of agentic early agentic workflow.
然后到了夏天,特别是八月或九月左右,我们真正开始一起合作——这里就要说到你了。
And then then over the summer, and especially this is when around August or September, we really get to work together on actually, this is where you come in.
我们一起筹备了我们的第一次VLAN聚会,当时我有过管理开源项目的经验,也对构建一个服务平台并将其发展为完全开源的项目非常感兴趣。
We get to work together on our very first VLAN meetup, a, and z, and where I had the experience of managing open source project before as well as deeply interested in actually building a serving platform and into a fully open source project.
从那时起,我开始参与进来,写了第一行代码,搭建了CS系统,构建了性能基准测试系统,此后就一直与Woosuk紧密合作。
And this is where I start to get involved, write wrote my first lines of code and, like, sort of build out the CS system, build out the performance benchmarking systems, and then really much work with Woosuk ever since.
我差点忘了这件事。
I I had forgotten about that.
所以这是第一次VLLM聚会。
So this was the very first VLLM meetup.
对吧?
Right?
是的。
Yeah.
就在这个办公室里。
Was in in this office.
就在这个办公室。
In this office.
就在这一层楼。
On this exact floor.
我们之前预计只有十来个人,可能二十、五十人左右会来,结果注册人数直接超过了预期容量。
I think we are previously anticipating just 10, like, ten, twenty, maybe 50 people showed up, and then the registration was, like, exactly over the anticipated capacity.
人们对这项技术非常感兴趣。
People are extremely interested in this technology.
我记得很清楚,因为我们自己在这里办活动时,总是很难让人来参加。
I I I remember that very well because we run events here for ourselves, and it's always very hard to get people to show up.
我们总是手忙脚乱。
We're always scrambling.
结果,我接到安全部门的电话,说有太多人被批准进入这个POL了。
And instead, I got a call from our security team saying, too many people have been approved for this POL.
我的意思是,我们得缩减一下。
I mean, we scale it back.
这不安全。
This isn't safe.
我心想,哦,好吧。
I'm like, oh, okay.
大概别声张。
Probably don't tell.
我觉得我们从来没缩减过,所以别告诉安全部门。
I don't think we ever scaled it back, so don't tell the security team.
当时人非常多。
It was quite crowded.
我跑的那个部分,大概前十分钟。
The piece I ran out, like, the first ten minutes.
所以
So
但这可是个大事。
But this is a big deal.
对吧?
Right?
因为这可不是什么普通的消费者应用,对吧?你做的这个东西。
Because this is not, like, a consumer app, right, that you're building.
这主要是从系统工程师那里获取资源,他们大多想学习如何部署LLM并参与贡献,所以能吸引到如此多来自这样一个狭窄而专业群体的兴趣,确实非同小可——这些人通常也不太喜欢和现实生活中的人打交道,至少我是这样。
This is pulling from systems engineers, right, for the most part who wanna learn about how to how to serve LLMs and contribute to and so it's it's actually a big deal to get, I think, so much interest from such a kind of narrow, sophisticated group of people who who don't like meeting other humans in real life that often either, you know, at least speaking for myself.
那你能不能再多讲讲VLLM背后的社区?
So so can you talk a little bit more about the community behind VLLM?
现在这个社区有多大?
Like, how big is it now?
它是怎么凝聚起来的?
How did it come together?
而且,随着规模变大,你们是怎么管理的?
And, like, how do you guys manage it as it's gotten big?
是的。
Yeah.
一开始,当然只是几个研究生在做这件事。
So in the beginning, of course, it's just a few grad students working on it.
但随着时间推移,我们逐渐培养出一种非常开放的心态,推动开源发展。
And then so and but over time, we start to having this very much open minded, developing the open kind of mindset.
截至目前,我们有50多位常驻全职贡献者,每天都会打开GitHub来参与VLLM的开发。
So as of now, we're looking at 50 or more regular full time contributor who open up GitHub every single day to work on VON.
我们在GitHub上的贡献者数量已突破2000人,是GitHub官方认定的增长最快、排名最高的开源项目之一。
We crossed 2,000 contributor bars on GitHub, one of the fastest growing top open source project ranked by GitHub itself.
而且,这个社区真的非常多元化。
And then this is really a diverse community.
所以,像吴锡和我这样的团队,来自加州大学伯克利分校的研究生时代,还有Meta和Red Hat,都在背后支持这个开源项目。
So there is folks like Woo Seok and I, sort of the team from UC Berkeley from grad student days, and as well as Meta and Red Hat pulling their way behind this open source project.
此外,当然,不仅仅是模型的开发者,比如Mistral和Huan团队,任何参与OpenWay模型的人也都加入了我们的社区。
And then as well as, of course, people who are not just make people who are making the model, Mistral and Huan team and, of course, like anyone who's making OpenWay model are participating in our community.
在模型方面,NVIDIA、AMD、Google、AWS、Intel都在积极参与,支持这个生态系统。
And then on the model side, NVIDIA, AMD, Google, AWS, Intel, they're all having their own participation and be able to support the ecosystem.
因此,使用VIO的每个人都可以在不同的硅芯片中选择适合加速计算的方案。
So everyone in VIO, using VIO, has the ability to choose among different silicons for accelerated computing.
这非常有趣,我认为这是许多成功开源项目的共同特点:人们并非出于相同的原因做出贡献。
Oh, that's very interesting, though, which I think is a property that many successful open source projects have, which is that people aren't all contributing for the same reason.
对吧?
Right?
我肯定有些人纯粹是因为热爱这项技术,但听起来你是在说,模型提供方实际上有动力去参与这个项目,因为他们希望自己的模型能运行良好。
Some people, I'm sure, just love the technology, but it sounds like you're saying the model providers actually have incentives to contribute to the project because they want their models to run well.
芯片提供商也希望它能在他们的芯片上运行顺畅。
The silicon providers want it to run well in their silicon.
基础设施提供商希望优先运行它,以便销售基础设施,诸如此类。
The infra providers wanna have first dibs on running it so they can sell infra, that kind of thing.
是的。
Yeah.
这本质上是解决 m 乘 m 问题的经典案例,这样作为模型提供商,你就不用去逐一沟通所有人。
This is kind of a classic worth solving the m times m problem so that as a model provider, you don't to talk to everybody.
作为硬件提供商,你只需接入这一个系统,就能神奇地兼容世界上所有的模型。
And as a hardware provider, you can just go into this one system and then magically, you'll work for all the models out there in the world.
而对于使用 VIO 以及基于 VIO 构建基础设施的应用方来说,拥有一个所有人都能参与、共同创新的共同基础,会简单得多,成本也低得多,事实上,
And then for applications who are using VLAN as well as infrastructure building with VLAN, like having a common ground where everybody can participate in and then innovate together is way easier and cheaper, in fact,
最终部署起来也更轻松。
in the end to deploy.
你管理如此庞大的贡献者群体的哲学是什么?
What's your philosophy for managing a pool of contributors this large?
你会告诉他们该做什么吗?
Do you tell them what to do?
是他们自己选择的吗?
Do they choose themselves?
比如,你们是如何保持高代码质量的?
Like, How do you maintain high code quality?
这是一个持续迭代的过程,月复一月,年复一年。
It's a constant sort of iteration, months over months, year after years.
关于这一点,我得回顾一下我之前参与的OpenSouth项目,当时我在做一个叫Ray的项目,后来是AnyScale。在那里我学到了这种社区驱动的方法,需要有明确的需求、清晰的路线图和设定明确的里程碑。
So for this, I have to go back to my previous OpenSouth project, which I was working on a project called Ray and then later AnyScale, where we have this kind of where I learned this community driven approach in a way that have a clear requirements, have a clear roadmap, have a clear sort of milestone being set.
所以我们试图借鉴这一点,同时也深入研究那些非常成功的开源项目。
So we kind of try to borrow that but also really study this really successful open source project out there.
我一路追溯到了Linux,然后研究了Kubernetes,研究了Postgres。
I went all the way back to Linux and then studied Kubernetes, studied Postgres.
这些社区是如何协同运作的?
How are these communities operating together?
因此在VLAN中,我们采用了一种特殊模式——像任何常规工程组织一样设定清晰的团队范围,同时也明确目标、成果和里程碑,推动并构建不同类型的技术特性。
So in VLAN, we had kind of a special model that we do, like any normal engineering organization, set clear team scope but also clear objective and result and milestones with different kind of technologies, technical features we want to push forward and build.
因此,我们每季度都会明确我们的愿景。
So this is where we have set forward our vision every quarter.
同时,我们也邀请社区参与贡献。
And then but also invite the community to contribute.
所以我们说,太好了。
So we're saying, great.
我们正在着手这些工作。
We're working on these.
我们还需要帮助,因为有些项目目前没有人主动负责。
We also need help on these items that we don't have anyone actively working on.
如果你是新手,想加入我们或参与社区,这就是你可以参与的项目。
If you are brand new and want to engage with us or engage with the community, here's what you can work on.
此外,我们对所有人在 GitHub 上提交的拉取请求都保持极其开放的态度,看到新的请求时,我们会想:这个请求不错吗?
And then additionally, we keep an extremely open mind to all the GitHub pull requests that people just opened up that we're seeing, oh, is this a good request?
这个功能好吗?
Is this a good feature?
此外,还包括对常见流程的请求。
And then as well as a request for common processes.
所以这其实是融合了之前从其他开源项目中汲取的所有经验教训。
So it kind of is a blend of all the lesson learned from the previously from previously other open source project.
然后在代码质量方面,有代码审查,但也有大量持续的重构和迭代。
And then code quality wise, code reviews, but also a lot of constant refactoring and iterations.
是的。
Yeah.
是的。
Yeah.
我经常进行重构,大约每六个月就会做一次,确实如此。
I do a lot of refactoring, like, every six months kind of yeah.
实际上,还有一点要补充的是,我们会每两个月举办一次线下聚会,而且我们实际上正在向全球扩展,比如有时在欧洲,有时在亚洲的其他地方。
And actually, one thing to add is, you know, like, we do in person meetups, you know, like every two months, and we are kind of expanding to globally actually, like sometimes in Europe, sometimes in some other places in Asia.
对。
Yeah.
而且是的。
And yeah.
比如,从我们在AT&T的第一次线下聚会开始,我们就发现与这些合作者和用户面对面交流真的非常非常有用。
Like, we actually, from the first meetup in AT and T, we learned that it's actually super, super useful to meet, you know, those, like, collaborators and, you know, users in person.
而且,我们一直在继续这样做。
And, yeah, we are continuing doing that.
这很有趣。
It's funny.
这是另一个教训,你知道,硅谷的工程师们已经把抽象层级提升得太高了,以至于我们现在又在重新学习几千年前就有的道理:不。
It's another one of these lessons that, like, you know, Silicon Valley engineers like, we've gotten so kind of, like, you know, high up the abstraction stack that we're, like, relearning, you know, lessons from a thousand years ago saying, no.
事实证明,面对面的沟通带宽很高,而且不会出现一致性问题。
It turns out in person communication is high bandwidth and doesn't suffer from consistency problems.
所以,就在你们举办第一次聚会的时候,我们也通过学术实验室向这个项目提供了资助。
So so around the time you guys did that first meetup, we also made grant funding to to the project through through the academic lab.
我想金额不大,但这是我们提供的第一个开源资助。
I I was I think it was a small amount of money, but it was actually the very first open source grant that we made.
所以,你知道,这真的非常有趣且让我们感到欣慰,看到这笔钱确实用在了刀刃上,项目也取得了巨大发展,后来我们甚至还有机会投资了相关的公司。
So so it's super, you know, just, like, fun and kinda gratifying gratifying for us to see, like, the money was actually put to good use and the project grew massively, and then we even had a chance to invest in the related company later.
不过,我确实听到一个传言,说在我们提供资助的时候,你们把一部分钱投到了英伟达的股票上。
However, I did hear a rumor that at the time that we made the grant funding that you guys put a portion of the money into NVIDIA stock.
你能确认或否认这一点吗?
Can you confirm or deny?
两者都不是。
Neither did.
是的
Yeah.
不是他
Not him.
所以是收件人列表里的其他人
So someone else in the in the recipient list.
所以你们大概把我们那笔小额资助变成了十倍的资金,在之前
So so you probably turned our tiny grant into 10 times as much money before before
哦,各种为BLM提供的资金。
Oh, all sorts of sort of the funding for BLM.
很多这类BLM资金是我们为项目开发预留的,包括项目开发、测试以及运营这个项目的所有相关事宜。
A lot of these funding for BLM is that we set aside for project development and sort of project development, testing, and everything around operating this project.
有一件事我们其实非常感激第一笔资助的是,它实际上开启了一种文化,如今甚至可以说形成了一种传统,让人们真正开始以相当可观的方式赞助开源项目。
And one thing I we're actually super grateful for the first grant is actually kicked off a culture, and nowadays, you can get even a tradition, for people really opened up to sponsor open source projects in a quite significant way.
因为比如在我们的CI账单上运行Vio,每月就要花费超过10万美元。
Because running Vio on our CI bill, for example, is more than 100 k a month.
对某些人来说那可能微不足道,但它会随着时间的推移不断增长。
That's could be tiny for some folks, and it's, like, overgrowing over time.
这里说的是每年百万美元级别的烧钱速度,抱歉。
This is where at a burn of million dollar amounts and sorry.
每年。
A year.
一百万。
A million
每年一百万美元。
dollar a year.
对于一个学术项目来说,这确实非常可观。
For an academic project, it's actually very Yeah.
因为我们希望确保每个提交都经过充分测试。
Because we want to make sure every single commit is well tested.
这样人们就能在全球不同环境中部署,不是数千台,而是可能数百万台GPU。
And then this is something that people can deploy at not thousands, but potentially millions of GPUs across the world in different environments.
因此我们想确保它经过充分测试,是可靠的。
So we want to make sure it's well tested, it is reliable.
而现在这些需求、这些基础设施,全都来自大家的贡献和赞助,每个人都在为这个项目添砖加瓦。
And then this requirement, this infrastructure, right now all comes from contribution and sponsorship and from everybody are chipping in to hope on this project.
当然,我们现在也会举办线下聚会,有时聚会的相关费用就直接利用了各位提供的资助。
And now, of course, we also run meetups and sometimes expenses associated with meetups are directly leveraging the sort of the grants that you all provided.
是的。
Yeah.
我的意思是,这很合理。
I mean, makes sense.
你知道,对我们和BLLM的其他企业赞助商来说。
You know, for us and for other corporate sponsors of BLLM.
你知道,这对整个生态系统都有好处。
You know, it benefits the whole ecosystem.
对吧?
Right?
所以我认为这非常合理。
So I think it makes a lot of sense.
如果你们没问题的话,我们多谈谈这个问题的技术方面。
Let's talk more about the technical aspects of the problem, if that's okay with you guys.
你介意先准确说明一下什么是推理服务器或推理引擎吗?
Do you mind to start just defining exactly what, like, an inference server or an inference engine is?
当然可以。
Sure.
所以,推理引擎会接收一个已经训练好的模型。
So an inference engine turns it takes an already trained model.
这可以是一个非常小的模型,比如 QIN 1B。
So this can be a very small model like QIN 1B.
它也可能是一个非常大的模型,比如 DeepSeg 或 KIMI K2,运行在加速计算设备上。
It could be a very big model on DeepSeg or KIMI K2, run it on an accelerated computing device.
它的任务是充分利用计算设备,生成文本、图像和视频,而这些内容本质上都被分词为一个个独立的标记。
And its job is to fully utilize the computing device to be able to generate text and images and videos essentially, but this all got tokenized into individual tokens.
因此,推理引擎的目标是以极高的效率运行模型,确保我们能以最高效率产出最大量的结果。
So the goal of inference inference engine is to run the model at highly efficient speed to make sure that we can produce maximum outputs at the highest efficiency.
从宏观层面来看,你能解释一下典型的推理引擎是如何工作的吗?
And just from a high level, can you explain some of the architecture, how sort of a typical inference engine works?
有哪些最关键的部分是人们想进一步了解的?
What are just the few most important components that that people would be interested to learn more about?
也许我们可以从一个请求的生命周期说起。
Maybe one goes through a life of a request.
比如,如果我说‘你好’,VOM会发生什么?
Like, if I say hello, what would happen to VOM?
是的。
Yeah.
对。
Yeah.
基本上,会有一个传统的API服务器,它接收请求,当模型生成输出后,会逐个流式返回这些token。
So, basically, there's a kind of a traditional API server that definitely, you know, gets the retest and maybe And once the model generates output, it stream backs the tokens one by one.
是的。
Yeah.
所以,确实存在一个传统的API服务器层。
So there's, like, definitely a traditional API server layer.
而在其中,我们通常有一个称为分词器(tokenizer)的组件,用于将输入转换为token,也就是一些语言模型可以处理的整数序列。
And inside in it, we have kind of typically something called tokenizer, right, like, to transform this, like, input to, like, the tokens, basically some integers, the least of integers that the language model can consume.
而在其中,我们还有一个称为引擎的组件,它包含一个调度器,用于决定如何将传入的请求进行批处理。
And inside in it, we have basically an engine, what we call, like, engine, and that includes the scheduler to, you know, get back how to which decides how to batch the recast into incoming recast.
我们还有一个内存管理器来管理所谓的KV缓存,这是Transformer模型用于大语言模型的核心部分。
And we have a memory manager to manage something called KV cache, which is a kind of the core part of the transformer to for LLMs.
我们肯定还有一些工作进程。
And we have a definitely have some kind of worker.
这是一个非常通用的术语,但实际上它负责初始化模型、运行模型、获取输出,并对输入进行预处理,对模型输出进行后处理。
This is a very generic term, but which basically actually initialize the model and run the model and get the output and, you know, do all the, like, preprocessing for the input and, you know, post processing for the model output.
是的。
Yeah.
所以没错。
So yeah.
是的。
Yeah.
也就是说,从某种意义上讲,这并不是什么颠覆性的新架构,但每一个组件都针对这种大语言模型推理场景进行了高度优化和专门化。
That's basically I mean, yeah, in a sense, it's not like a crazy new architecture, but each one basically highly optimized and specialized for this LM inference roofer.
你觉得随着时间推移,运行推理是变得更容易了还是更难了?
Do you think it's getting easier or harder over time running running inference?
是的。
Yeah.
当然。
Definitely.
我认为随着时间的推移,这确实变得越来越困难了。
I think it is definitely much getting much more difficult over time.
说实话,大概一年半前,我根本没觉得推理是个难题。
Like, actually, honestly, maybe one and a half years ago, I wasn't thinking, like, inference as a hard problem at all, to be very honest.
但现在情况变了。
But now things have changed.
趋势已经发生了巨大变化。
The trend has changed so far.
所以我认为有三个因素。
So I think there are kind of three factors.
一个是规模,另一个是多样性,最后一个则是智能体。
One is scale, another is diversity, and the last one is kind of agents.
关于规模,你知道,模型确实在变得越来越大。
So for scale, you know, like, the models are definitely getting larger.
而且,现在我们已经有了Kimi K2,参数量超过一万亿。
And, you know, right now we have Kimi K2 with, like, more than 100 more than a trillion parameters.
但我认为今年我们将会看到多万亿参数的开源模型。
But I think we believe we will see, like, multi trillion parameter open source model this year.
我认为这仍然是一个清晰的趋势,人们会继续训练更大的模型。
And I think that's still clear clearly a trend that people will be training a larger model.
而且,毫无疑问,与早期LLM时代相比,处理这样的模型要困难得多,那时我们仅仅处理像小规模LAMA这样的模型。
And, you know, definitely, it's much more challenging to deal with such a model compared to, you know, like, the early days of LLMs where we just only deal with, like, small LAMA models.
随着模型变大, presumably,你需要更多的节点同时工作。
And with larger models, presumably, you need more nodes working concurrently.
你需要管理更多的内存,这些内存可能无法全部装入每个芯片的可用内存中。
You need you have more memory to manage that may or may not fit in each, you know, chip's available memory.
你描述了一些关于规模带来的挑战。
You you describe some of the challenges from scale.
是的
Yeah.
对于这类大型模型,我们确实需要将模型切分并分布到多个GPU和多个节点上。
For these kind of large models, we definitely need to shard, you know, distribute the model into multiple, like, GPUs and multiple nodes.
对吧?
Right?
然后,是的。
And then and yeah.
然后,确实存在一个问题,就是如何切分和分布这个模型。
Then there's, like, definitely, like, a problem of how to chart, how to distribute this model.
对吧?
Right?
实际上,我们可以从多个维度来切分模型,它们各有不同的权衡。
There are actually many dimensions we can use to chart the model, and they have, like, different, like, trade offs.
而且,确实存在权衡,比如以这种方式切分模型时,我们需要付出多少通信代价。
And, yeah, trade offs, for example, in terms of how like, how much communication we should pay to shard the model in this way.
此外,在负载均衡方面也存在权衡。
And also there's a trade off in terms of like load balancing.
或者如果我按这个维度进行分片,那么负载不均衡的程度会有多严重?
Or if I shard this in this dimension, then how does the how significant is the load imbalance?
所以这些都需要在最终性能评估中加以考虑,以获得最佳性能。
So these all need to take into account for the final performance estimation and to get the best performance.
而且,是的,随着模型变得越来越大,这正逐渐成为一个更大的问题。
And, yeah, that's it's becoming more and more a bigger problem as the models get larger.
那么集群规模方面呢?
And what about just cluster scale?
我的意思是,西蒙,VLLM在任何时候会运行在多少个节点上?
I mean, I think, Simon, how many nodes is VLLM running on at any given time?
目前,我们正在查看的是我们使用统计数据的一个非常小的样本,用来帮助我们决定要弃用哪些功能。
Right now, we're looking at this is through our sort of like a very small sample of our usage statistics that it's used for us to figure out what feature to deprecate.
仅从这一个信号来看,我们发现有40万到50万个GPU正在运行VLM,其中2470型号占多数。
Just literally from this one signal, we're looking at 400 ks to 500 ks GPUs 20 fourseven running VLM.
展开剩余字幕(还有 201 条)
考虑到全球GPU部署的规模,这确实是一个很大的范围,我们坚信还有更多的潜力存在。
And this is quite a big scale thinking about the global deployment of GPU footprints and we definitely believe there's a lot more out there.
当然,这里涉及的GPU种类、GPU架构以及模型架构都非常多样化。
And of course this is a wide diversity of different kinds of GPUs, GPU architecture as well as model architecture being deployed.
我们并没有看到一种适用于所有情况的解决方案。
We're not seeing like a one size fit all.
人们使用它只是为了单一的特定用途。
People are using it for just one singular use case.
我明白了。
I see.
这正是你的观点。
This is sort of your point.
你的第二个观点是,多样性使得这个问题随着时间推移变得更加复杂。
Your second point was like about diversity sort of making it prints a harder problem over time.
是的。
Yeah.
芯片的多样性,这种更复杂的多样性确实是一个因素。
The chip diversity, harder diversity is definitely one factor.
而且模型本身也在变得越来越多样化,你知道的。
And also models are getting also diverse, you know.
如果你想想,比如一年前,NVIDIA 只发布了几系列开源模型,但现在他们每个月都会在不同领域发布许多开源模型。
If you think about the like, for example, like for NVIDIA, like a year ago, I think they only released a few series of open source models, but now they're releasing many open source models, like, every month in different domains.
对吧?
Right?
有些用于视频,有些用于机器人,有些用于语言。
So some are on the video, some are on the robotics, some are on on the language.
而且,这种开源趋势正在不断扩展,人们正在许多不同领域训练各种各样的模型,并且每个月都发布出来。
And, yeah, this kind of, like, open sourcing trend is getting expanding, and, like, people are training many different kinds of models in many different domains and releasing them like every month.
所以存在模型的多样性。
So there's model diversity.
即使只是文本模型,它们都是基于 Transformer 的,但它们的详细架构仍然非常多样。
And even for just for text models, they're all transformers in that, but their detailed architecture still are very diverse.
而且我们甚至看到它们正在进一步分化。
And they're even we see they're even diverging.
比如,DeepSeek 3.2 使用了稀疏注意力,也就是所谓的稀疏注意力机制。
Like, say, for, like, Deep six three point two was using sparse attention, something called sparse attention.
但像 Q1 和 Kimi 这样的模型则在探索线性注意力,这是一种不同的注意力机制,它们在内存管理方面也有不同的方式。
But say for Q1 and Kimi, they're kind of exploring like linear attention, which is kind of different attention mechanism, and they have different ways to manage the memory.
因此,这种模型架构的分化正变得越来越显著。
So, yeah, this model architecture divergence is also getting getting more significant.
那么,这些功能,比如实现稀疏注意力,是否由你们,也就是 BLM 来负责开发并提供给模型使用呢?
And so is it up to you as, you know, meaning BLM to implement all of these, like all the to, know, implement sparse attention, for instance, so that it's available for the models to use?
是的。
Yeah.
我们确实主要依赖开源社区。
We definitely we basically leverage open source community definitely.
因为我们与这些模型厂商合作,经常能得到它们的帮助。
Like we, you know, because we collaborate with these model vendors, like we often get help from these model vendors.
他们基本上提供了一些内核,或者至少是这些新型操作的参考实现。
They basically provide some kernels or at least like reference implementations of, you know, of these new kind of like, operations.
是的,我们的工作通常是利用这种合作,使这些技术更加成熟,并适用于更多样化的环境。
And, yeah, we like, our job is often, like, basically leverage this collaboration and making more mature and also available for more diverse environments.
我记得在开源模型早期,曾经有一些标准化。
I remember early on in open source models, there was some standardization.
比如,大家都使用LAMA。
Like, everyone was kind of using LAMA.
我认为大家当时都使用相同的分词器、相同的输入格式,还有结束标记之类的。
I think everyone's using sort of, like, the same tokenizer and the same, like, input format and, you know, and, like, end of stream token and stuff like that.
现在还是这样吗?还是说每个提供商现在都不一样了?
Is that still the case, or is it, like, is it different for each provider now?
是的。
It is.
对。
Yeah.
过去几年,甚至最近两年,情况已经大不相同了。
It diverged quite a bit over the last few years, maybe last couple of years.
是的。
Yeah.
对。
Yeah.
有一点是,很多嗯。
One thing is that many yeah.
比如,模型架构本身已经发生了很大变化,尤其是在注意力机制方面。
Like, the model architecture itself has changed a lot, you know, especially on the attention side.
甚至在输入输出处理上也是如此,因为不同的实验室都有自己独特的方式,来构建对话和工具调用等。
And also even for like input output processing, because like different labs have different kind of their own ways to form, you know, how to form the conversation and how to form the tool calls, for example, for their own models.
所以现在,这方面已经分化得相当明显了。
So now, like, this has been diverging quite a bit.
而且现在,是的,过去几年里这方面已经分化得相当明显了。
And now, yeah, this has been diverging quite a bit for the last couple of years.
我明白了。
I see.
好的。
Okay.
所以模型规模、模型多样性、硬件部署场景,然后你提到的第三点是智能体,这方面变得有点棘手。
So scale of models, diversity of models, and hardware deployment scenarios, and then agents were the third thing you mentioned, sort of getting hard over it.
是的。
Yeah.
是的。
Yeah.
你知道,对于智能体来说,我们确实需要一种不同的——我的意思是,除了推理引擎之外,我们实际上还需要建立一整套新的环境,或者说一整套新的基础设施,来支持所有的工具调用,以及支持所有那些多智能体的功能。
You know, like, for agents, we need a definitely, we need a kind of different I mean, beyond just beyond the inference engine, we also need to set up the whole new, like, environment actually, whole new, like, infrastructure to support all the tool callings and to support all the, yeah, like, multi agent things.
对。
Yeah.
这部分正逐渐成为整个推理领域一个新兴的挑战。
Like, that part becoming a kind of a new, like, emerging challenge for inference as a whole.
你认为这意味着随着时间推移,推理层需要管理的状态会越来越多吗?
Do you think this means more there will be more state managed in the inference layer over time?
正如之前那样,传统模式一直是文本输入、文本输出,单次请求响应。
As before, the paradigm has been text in, text out and then just single request response.
但随着我们进入代理的年代和十年,我们看到多轮对话正演变为成百上千轮的交互。
And then but as we evolve into the year and the decade of agents, we're seeing multi turn conversation turning to hundreds and thousands of turns.
这些场景还涉及外部工具使用,比如与沙箱交互、执行网络搜索、运行Python脚本或其他编程语言,并实现这种长期迭代过程,其中大语言模型参与其中,同时外部环境交互也必不可少。
And then these terms also involve external tool use like interacting with sandbox, performing web searches, running Python script or any programming languages, and be able to have this kind of long iterative process where LLM is involved but also external environment interaction is involved.
这真正引发了一场大规模的生成架构与推理架构协同优化的浪潮。
And this really kicked off a huge wave of co optimizing a Genentech architecture with inference architecture.
举个例子,对于视觉语言模型来说,理解对话是否仍在继续至关重要。
So just to give an example that when because it just to give an example, it is very important for VLM to understand whether or not the conversation is still happening.
如果对话已经结束,我们就可以清除kvcache。
If the conversation is no longer happening, we can remove the kvcache.
这是与每个文本流、完成流相关联的持久化状态。
That is the persistent state associated with each text stream the completion streams.
但在智能体应用场景中,你其实并不知道智能体是否认为任务已经完成,而之前的交互只是人类在文本框中输入文字。
But in agentic use cases, you actually don't know whether or not the agent will think it finishes or also And the interaction previously was just a human typing in the text box.
但现在变成了与外部环境的交互。
But now it becomes external environment interaction.
可能只需要一秒钟就能完成一个脚本。
It could be one second just for a single script to finish.
可能需要十秒钟来完成一次搜索或复杂的分析。
It could be ten seconds for a search or a complex analysis to finish.
如果有人参与其中,甚至可能需要几分钟、几小时。
And then it could also be minutes, hours if there's humans in the loop.
面对这种不确定性,我们甚至无法预知请求何时会返回,缓存访问模式和淘汰模式的统一性也因此被新范式严重打乱。
Now with that uncertainty, we actually don't even know when is the request going to come back and then the uniformity of cache access pattern and eviction pattern got kind of the patterns got pretty disrupted by the new paradigm.
我明白了。
I see.
我明白了。
I see.
因此你必须更智能地管理缓存,这是其中一个例子。
And so you have to be much smarter about how you manage the cache as one example of that.
是的。
Yeah.
明白了。
Gotcha.
明白了。
Gotcha.
这就像是计算机科学中无法解决的问题之一。
Which is one of the, like, unsolvable problems in computer science.
缓存失效问题。
Cache invalidation.
没错。
Yeah.
正是这样。
So exactly.
所以我能理解,随着时间推移,这会变得越来越难。
So I I can see how that would that would get harder over time.
是的。
Yeah.
我想我知道答案,但你们是否非常支持开源AI,而不是闭源AI?
I think I know the answer to this, but are you guys big believers in open source AI compared to closed source?
你能解释一下,你是怎么看待这个问题的吗?
And can you just explain, like, how you think about that?
我们确实非常支持开源。
We're definitely big believers in open source.
我们相信,多样性终将胜过任何单一的东西。
The what we believe is diversity will triumph that sort of single of anything at all.
因此,我们支持模型的多样性、芯片架构的多样性。
So that means we believe in diversities in models, diversity in chip architecture.
从根本上说,这是因为世界是复杂的。
Fundamentally, this is because the world is complex.
在应用中,你需要为特定的使用场景找到并定制合适的模型架构与芯片架构。
Application, you're going to need to find and tailor the right sort of model architecture to the right chip architecture for the right exact use cases.
而促进多样性并改进这一点的最佳方式就是开源,因为开源让所有人都能了解彼此的进展,并能在共同基础上提出自己的见解。
And the best way to promote diversity and improve that is through open source because open source, everybody knows where everybody else is up to and be able to make their opinionated take based off the common ground.
最后,如果你看看以色列的计算机科学操作系统、集群管理器、数据库等,每一个系统领域都在开始拥有共同标准后变得更好,大家都在此基础上稍作偏离、相互创新,而不是遵循单一专有的、由单一来源控制的趋势。
And finally if you look at Israel computer science operating system, cluster managers, databases, every single system field get better when they're starting to have a common standard and everybody deviate a little bit, innovate on top of each other versus following a single line of trend that is proprietary and single source control.
我明白了。
I see.
这非常有趣。
That's very interesting.
所以你的意思是,OpenAI会针对他们的用例(比如ChatGPT或其他应用)非常紧密地优化他们的技术栈。
So you're almost saying OpenAI will tune their stack very tightly for their use case, which is ChatGPT or whatever other apps they're running.
对于企业或另一家科技公司来说,如果我想要达到同样的优化水平,我不能仅仅使用现成的闭源模型,因为我无法完全控制整个技术栈,而且栈中的不同参与者可能不会那么专注。
For an enterprise or another tech company, if I want that same level of tuning, I, like, can't just use off the shelf closed source models because I don't sort of control the whole stack and, like, the different participants in the stack kind of aren't paying attention.
是的。
Yeah.
当然。
Of course.
一部分是数据。
One part is data.
另一部分是模型架构本身,它会影响性能。
One part is the model architecture itself, which will impact the performance.
仅就模型架构而言,你希望模型有多智能?
And then just just on the model architecture itself, how smart do you want the model to be?
你希望模型能够处理数百万个token的上下文,还是较短的上下文就完全足够?
Do you want the model to be able to handle millions of token context or just shorter context is totally fine?
然后你还需要将模型针对你的具体计算架构进行优化。
And then you also need to specialize that model to your exact compute architecture.
你使用的是什么芯片?
What chip are you using?
例如,为NVIDIA设计的H100芯片的模型,与为B200芯片设计的模型非常不同。
For example, for NVIDIA, the model you design for a H100 chip is very different from a B200 chip.
而对于GB200 NVL 72系统来说,情况又完全不同。
And then it is very different for GB200 NVL 72 system.
再比如,与为TPU设计的模型架构相比,那也是截然不同的。
And then compared to, for example, the model architecture you design design for TPU, then again, that is also drastically different.
而用于视觉模型、视频生成和推理型编程时,最终我们都会看到垂直整合的成果,不禁感叹:哇。
And then using it for vision model, video generation, and for reasoning mass coding, in the end, we'll all look kind of look at the vertical stat integration and we're like, wow.
它们彼此之间差异巨大。
They're so much different from each other.
我明白了。
I see.
这说得通。
That that makes sense.
你能分享一些关于实际部署BLM的有趣或重要的案例吗?
Can you just share any stories about live BLM deployments that that you, like, thought were particularly interesting or important?
我有几个例子。
I have a few.
一个是,我认为在2024年左右,我们得知亚马逊正在使用VOM来驱动他们的Rufus助手机器人,这让我们所有人都感到非常惊讶,因为一方面,我们当然相信VOM可以大规模部署,但看到如此大规模的全球电商将其作为首页功能使用,还是让人震惊。
One is, I think around 2024, we learned that Amazon is running VOM to power their Rufus assistant bot, which was, like, really surprising to all of us because, one, as a point, like, of course, like, we believe Veone can be deployed at scale, but seeing this as a massive scale, like, kinda global ecommerce deploying this as, like, front page feature.
这意味着,当每个人打开应用、点击机器人建议,甚至输入搜索查询时,背后都在通过VOM处理。
That means when everybody when they're opening up an app and clicking the bot's suggestion or even entering a search query is going through a VOM.
这在某种程度上是一种前所未有的神奇体验。
And this is kind of the first sort of magical experience in a way.
其中一次体验让我们惊叹:哇。
One of the first experience was, wow.
我的购买行为此刻正在通过VOM处理。
My purchase is going through VOM right now.
这既令人兴奋,又让人有些害怕。
It's kind of exciting, but also scary.
那时候你们还像博士生一样。
You're like PhD students at the time.
是的。
Yeah.
不仅如此,在亚马逊、领英以及所有主要的VOM部署中,我们都惊讶地发现,他们总是最早采用前沿功能的团队。
Also across not just Amazon, LinkedIn, and every major deployment of VOM, we're surprised to find out they're always the first adopter of cutting edge features.
我见过一个VLM部署的例子,是在CurveShare AI,当时我们刚刚对Specticode做出n grand的推测,仅通过一个PR拉取请求在VLM中发布,甚至还没合并。
So I've seen one of the example of deployment of VLM within CurveShare AI was when we first made the n grand speculation for a Specticode available as just a single PR pull request in VLM, not even merged.
而在我们还在迭代这个功能时,我听到Character AI的人说,哦,实际上,我们已经根据你们这个功能的首个版本,将其部署到了数百个GPU上。
And then while we're still iterating on that feature, and I hear someone from character AI saying, oh, actually, we already rolled it out to you hundreds of GPUs at scale given just your first iteration of this feature.
所以,大家都始终站在VLM的前沿,这让我们非常兴奋。
So it's really much everybody is staying on the cutting edge of VLM, and we're quite excited about that.
是的。
Yeah.
好的。
Okay.
那我们来聊聊公司吧,InfraAct?
Should we talk about the com the company then, InfraAct?
InfraAct是什么?你们为什么决定创办这家公司?
What is InfraAct, and why did you guys decide to start the company?
InfraAct由VOM项目的创建者和维护者创立。
So InfraAct, created by the creators and maintainers of the VOM project.
我们的目标是让VOM成为世界级的推理引擎,真正推动开源领域的能力发展,并构建一个通用的推理层。
Our goal is to make VOM the world's inference engine, really push the capabilities on the open source front, and then build a universal inference layer.
这意味着我们将拥有运行时环境,能够为任何新硬件上的新模型提供动力,适用于新应用,能够实现极致效率并支持所有未来的AI工作负载。
That means we'll have the runtime to power any new model on new hardware for new application, be able to tailor that to extreme efficiency and support all the AI workload going forward.
在你刚才提到的内容中隐含的是
And implicit in what you just
你刚才提到,我认为你们在开源项目上投入了大量资源。
said is that you're devoting a lot of resources, I think, to the open source project.
我想确认一下,是这样吗?
Could you I guess, is that right?
你能详细说明一下吗?
Can you expand on that?
是的。
Yeah.
我们坚信,开源,尤其是VLM本身的结构,对全球AI基础设施至关重要。
One thing we believe is I fundamentally believe that open source, especially how VLM itself is structured, is critical to the AI infrastructure in the world.
我们希望通过InfraAct来支持、维护、引导并推动开源生态系统的发展。
And what we want to do with this infraq is to support, maintain, steward, and push forward the open source ecosystem.
只有当VLM成为标准,并帮助每个人实现他们的目标时,我们的公司才真正具有意义,并能支持周围的所有人。
It is only that VLM when VLM becomes a standard and VLM help everybody to achieve what they need to do, then our company in a sense have the right meaning and to be able to support everybody around it.
因此,开源绝对是目前我们公司的首要任务,事实上,有时甚至是唯一优先事项。
So open source is definitely number one and, in fact, sometimes the only priority of our company right now.
是的。
Yeah.
顺便说一句,你不该告诉你的投资者。
You're not supposed to tell your investors, by the way.
我们确实相信,开源项目某种程度上是一种秘密武器,因为有这个社区共同为开源努力,我们的执行力超越了任何单一实体所能达到的水平。
We do believe that open source project is also kind of a secret weapon, in a sense that having this community all work together for this open source, we have the execution beyond any single entity can have.
我们一遍又一遍地听到这样的说法:人们只是告诉我们,他们根本跟不上VLM的发展,所以才选择使用VLM。
This is a scene we heard over and over again that people just tell us we just cannot keep up with VLM so that's why we're using VLM.
我们有自己的内部团队,会维护一个内部分支,也会有自己的内部推理引擎。
We have our internal team, we'll have our internal fork, we'll have our internal inference engine.
但开源发展得太快了,唯一能保持领先的方法就是采纳它。
But open source moves so fast that the only way to stay ahead is adopting.
这就是我们想要实现的目标。
And that's why we wanna make happen.
事实上,这正是我们全力投入开源的原因。
And in fact, this is exactly why we're staying all in on open source.
这太棒了。
That's awesome.
我们之前提到过扬·斯托伊卡,他显然是Databricks的创始人之一。
We mentioned Jan Stoikka before, obviously one of the founders of Databricks.
他曾是你们在加州大学伯克利分校的博士导师,而且他也会参与InfraAct。
He was your I think both of your PhD advisers at Berkeley, and he's going to be involved in InfraAct too.
你能谈谈他将如何参与这家公司吗?
Can you talk about maybe a little bit how he's going to be involved in this company?
更重要的是,作为他的学生,你们从他那里学到了哪些关于创业、分布式系统等方面的经验?
And even more importantly, what have you guys learned from him as, you know, his students and about start and, you know, distributed systems and all this stuff?
当然。
Sure.
是的。
Yeah.
对。
Yeah.
你说得完全对。
You're exactly right.
Yan 是我们两人的联合导师。
Yan is both of our co advisers.
我从2017年就开始和Yan合作,当时我还是本科生,正在做我的第一个开源项目Serbian,后来又在Serving的第二个开源项目中继续与他合作。
I have actually worked with Yan since 2017, since I I was a undergrad working on my first open source project for Serbian and then work with him at any scale for my second open source project for Serving.
沉迷于伯克利风格的风格了。
Addicted to, like, Berkeley based Yeah.
开源AI服务公司。
Open source AI serving company.
是的。
Yeah.
作为他的公司及其以外的部分,杨在公司层面是联合创始人,同时作为开源项目,自项目启动以来他一直提供指导。
So as his company and beyond, Young is quite involved as so as a company, he will be a cofounder, and then as an open source project, he has been advising this project for since its inception.
杨对开源项目、学术项目以及行业研究趋势了如指掌。
Yang knows open source project, academic project, industry research trend in and out.
从我们合作的内容来看,杨帮助我们清晰地理解了将开源项目推向企业最终落地过程中所积累的所有经验,以及研究领域正在发生的真实情况。
So from what we're working together on, Yang really helps us with both clearly understanding all the lessons learned about bringing open source through the final miles of adoption in companies and enterprises as well as what is actually happening on the research world.
过去几年,天空计算实验室推出了令人惊叹的基础设施和新的研究理念,而杨持续在这一领域探索新的前沿。
The Sky Computing Lab over the last few years has produced amazing infrastructure and new research ideas and Young continued to explore a new frontier on that front.
我们非常期待能听到这些进展,并共同在开源领域进行创新。
And then we're quite excited to hear that and also innovate on the open source together.
是的。
Yeah.
他还帮助我们招聘,参与了我们所有的招聘流程。
And he also helps, like, recruiting a lot and, you know, like, all he's involved in all of our hiring process.
他基本上告诉我们,如何识别人才,以及去哪里寻找人才。
He basically tells us I mean, teaches us how to tell, you know, talents, how to where to find talents.
这些都极其有帮助。
These are all amazingly helpful.
关于这个话题,你们现在需要解决哪些主要问题?你们在招聘什么样的人来帮助你们解决?
So so on that topic, what are some of the big problems you need to solve now, and what type of people are you hiring to to help you solve?
毫无疑问,大规模推理是这个领域最大的挑战之一,不仅对我们如此,对整个行业都是如此。
Definitely, you know, the inference at scale is kind of the one of the biggest challenge, I think, in the field, not only for us, but in the field overall.
因此,我们正在努力招聘更多具有丰富经验的机器学习基础设施工程师,比如,如何充分利用GB200、GB300和MBL 72机架来运行超大规模开源模型,这仍然是一个待解的问题。
So we are trying to hire more, like, a very experienced ML ML infra engineers overall to make, you know, for for example, you know, how we what would be the best way to utilize the GB 200, GB 300, MBL 72 rack entirely for the giant open source model, still, I think it's a it's an open problem.
学术界和产业界确实有一些探索,但我认为仍有很大的改进空间。
There are definitely some endeavors in academia and industry, but I think there are some, like, room for improvements.
所以,这目前是我们关注的重点之一。
So, yeah, that's some of the our focus at the moment.
从计算机科学的角度来看,这是我的观点。
Here's my pitch from a computer science point of view.
如果有人问我这个问题,其实挺少见的。
Pretty rare if people ask me this question.
如果你在一家垂直整合的公司工作,这家公司拥有端到端的产品,比如聊天机器人或助手,那你就是在处理问题的垂直切片。
That is if you're working at a vertically integrated company that have an end product for, let's say, for chatbot, for assistant, you are working on the vertical slice of the problem.
在InfraAct,你将致力于一个横向层面的构建。
In InfraAct, you will be working on an obstruction of horizontal layer.
这类似于操作系统、数据库,以及多年来人们构建的各种抽象层。
And this is similar to operating system, databases, and different kinds of abstraction that people have built over the years.
操作系统抽象了CPU和内存,数据库和文件系统抽象了存储设备和网络。
Operating system, abstracted CPU and memory, databases and file system, abstracted storage devices and networking.
对于加速计算,出现了一种全新的物理设备,推理框架为推理类工作负载抽象了其中很大一部分。
For accelerated computing there's a brand new physical device that inference VR abstracted a large part of it for inference specific workload.
当然也包括训练,但我们的专注点完全在推理上。
Of course it's training but we are singular focus is on inference.
这需要一个软件层,为模型抽象掉GPU和专用计算设备。
And this necessitates a layer, a software layer that abstracts away GPUs and asserted computing device for models.
在我看来,这与操作系统和数据库领域的抽象构建同样重要,这两个领域也是我们读博时非常热衷的。
And this is as important from my point of view as abstraction Unity build for OS, for databases, which are both fields we're really passionate about when we're PhD students too.
因此,机器学习系统本质上是一项新的系统研究和系统部署。
So and that's why ML system is fundamentally a new system research and system deployment.
所以,你在InfraC这里将致力于这个层面,它不是垂直切片,而是一个基础运行时,将影响所有未来运行在专用计算设备上的软件。
So you, here at InfraC will be working on this layer that is not a vertical slice but a fundamental runtime and impacting all the future generation of software that will run on a cellular computing device.
你的工作将涵盖与不同模型合作、与不同应用对接,同时理解不同芯片以及整个集成数据中心系统的优缺点,从而能够判断出,哦,实际上对于这些情况我们应该这样构建抽象层。
And your work will span from both working with different models and then working with different applications and as well as understanding the pros and cons of different chips as well as whole integrated data center systems to be able to figure out, oh actually for these we should build abstraction this way.
我们会不断移除抽象层、打破抽象层并反复重建,就像操作系统和数据库随着我们掌握的新信息而不断演进创新一样。
And we'll constantly remove abstraction, break abstraction and build it over and over again just like how operating system got innovated over time, databases got innovated over time with the new information we have at hand.
因此,来到这里你将持续实践构建实际广泛部署的生产系统,这类系统将处于推理技术的前沿。
So you will come here to have the constant exercise of building actual widely deployed production system that sort of will be at the frontier of inference.
这就是你所说的通用推理层吗?
And this is what you call universal inference layer?
是的
Yeah.
从某种意义上说,它故意保持模糊,但我们真正关注的是从页面注意力转向为智能所需的整个运行时系统。
It's purposely vague in a way, but what we really focus on is going from page attention, from going from the serving system to the whole runtime you need for intelligence.
吴素、西蒙,非常感谢你们今天的到来。
Woosooq, Simon, thank you so much for being here today.
很高兴你们能做客我们的播客,当然,也很高兴我们能在公司里一起合作。
Thrilled to have you on the podcast, of course, and we're thrilled to be, you know, working together in the company.
感觉就像是
It feels like it's
已经好几年了。
been a few years.
我们早就开始合作了,但很高兴你们能来,也祝贺你们取得了出色的开局。
We've already been working together, but but, yeah, great to have you here, and congratulations on on getting off to a great start.
谢谢你们邀请我们。
Thank you for having us.
是的。
Yep.
谢谢。
Thank you.
感谢您收听 a16 z 播客。
Thanks for listening to the a 16 z podcast.
如果您喜欢本期节目,请在 ratethispodcast.com/a16z 留下评价。
If you enjoyed the episode, let us know by leaving a review at ratethispodcast.com/a16z.
我们还有更多精彩的对话即将呈现。
We've got more great conversations coming your way.
下次再见。
See you next time.
提醒一下,本内容仅作信息参考,不应被视为法律、商业、税务或投资建议,也不应用于评估任何投资或证券,且并非面向任何 A16Z 基金的投资者或潜在投资者。
As a reminder, the content here is for informational purposes only, should not be taken as legal business, tax, or investment advice, or be used to evaluate any investment or security, and is not directed at any investors or potential investors in any A16Z fund.
请注意,A16Z 及其关联方可能仍持有本播客中讨论的公司的投资。
Please note that A16Z and its affiliates may also maintain investments in the companies discussed in this podcast.
如需更多详情,包括我们投资的链接,请访问 a16z.com/disclosures。
For more details, including a link to our investments, please see a16z.com/disclosures.
关于 Bayt 播客
Bayt 提供中文+原文双语音频和字幕,帮助你打破语言障碍,轻松听懂全球优质播客。