本集简介
双语字幕
仅展示文本字幕,不包含中文音频;想边听边看,请使用 Bayt 播客 App。
您正在收听《Gradient Dissent》节目,这是一档关于机器学习在现实世界中应用的节目,我是主持人卢卡斯·比瓦尔德。
You're listening to Gradient Dissent, a show about making machine learning work in the real world, and I'm your host, Lukas Biewald.
今天,我将与马克斯和谢尔盖对话。
Today, I'm talking to Max and Sergei.
马克斯·贾德伯格是Isomorphic Labs的首席人工智能官,谢尔盖·雅赫宁是首席技术官。
Max Jaderberg is the chief AI officer at Isomorphic Labs, and Sergei Jaechnin is the CTO.
Isomorphic Labs是一家源自DeepMind的药物发现公司,因此我们将讨论药物发现过程本身,以及近期深度学习的进展如何影响了这一过程。
Isomorphic Labs is a drug discovery company spun out of DeepMind, and so we talk about the drug discovery process itself and how it's been impacted by recent advances in deep learning.
希望你们喜欢这场对话。
I hope you enjoy this conversation.
那么,我们先请你们两位介绍一下自己和你们的公司,好吗?
Well, why don't we start by, could you guys introduce yourselves and your and your company?
好的。
Yeah.
没问题。
Sounds good.
我叫谢尔盖。
My name is Sergei.
我是Isomorphic Labs的首席技术官。
I'm the CTO here at Isomorphic Labs.
我已经在这家公司工作了两年多。
Have been with the company for two and a bit years.
实际上,我和马克斯差不多是一起加入的,只差了一两周。
Actually, me and Max started, like, maybe a week or two apart.
马克斯先来,我后到,但我们感觉就像从公司创立之初就在了一样。
Max first, me second, but, you know, we feel like we've been there since the very beginning.
我在科技行业有很长的经历,曾在多个领域为不同公司构建技术产品,涉及金融科技、风险管理、地理空间软件、电子商务等领域,过去十多年还长期从事医疗健康领域的工作,专注于癌症基因组学研究,并将这些研究应用到产业中,曾在一家名为Sofia Genetics的公司担任CTO,而过去两年多,我则进入了药物发现领域。
And, you know, I I've had a long history in in in tech working in a number of different areas, building building tech products for different companies in in in various sectors, in financial technology, in risk management, in geospatial software, in sort of ecommerce, and then also spent quite a bit of time working in health care for the past ten years plus, really focusing on research in the cancer genomics landscape, and then also took some of that research into the industry working at a company called Sofia Genetics where I was CTO and then, you know, now for the last two years plus in the in the drug discovery space.
太棒了。
Awesome.
我是马克斯·亚德伯格,ISO的首席人工智能官。
And, yeah, I'm I'm Max Yoderberg, the chief AI officer at ISO.
是的。
Yeah.
我和谢尔盖一起开始的。
Started with Sergei.
你知道吗,我们刚开始的时候,公司大概只有四五个人。
You know, I think there was, what, four or five people in the company when we when we started.
当时我们正在构建所有的机器学习模型、研究工作,以及如何将这些模型应用于这里的药物发现。
You know, be building out all of the machine learning models, the research, and how we apply those models to drug discovery here.
我之前在DeepMind工作了七八年,主要研究核心深度学习、生成模型、强化学习这些DeepMind擅长的复杂领域;在此之前,我的背景是计算机视觉和深度学习,真正参与构建了首批基于深度神经网络的计算机视觉网络,还曾在这一领域创办过一家公司。
I was at DeepMind beforehand for seven, eight years working on a lot of core deep learning, generative modeling, reinforcement learning, these big challenge domains that DeepMind loves with a background in before that in computer vision and deep learning, really building out some of these first networks for computer vision with with deep neural networks and and had a company in that space as well.
所以我想,你们两位其实都不是直接从生物背景进入这个领域的,对吧。
And so I guess actually neither of you are approaching this directly from a bio background then.
是的。
Yeah.
我的意思是,我们可以说是在这段旅程中不断学习的人。
I I mean, I would say we're we're we're learners somewhere along the journey.
我一直在医疗生物领域工作了大约十年,但我总感觉这是一个非常深奥的领域,而且涉及的学科也特别多。
I've been in the healthcarebio space for about ten years now, but I always feel like this is such a deep field and there's also so many different disciplines as well.
你总会觉得自己几乎什么都不知道。
You always feel like you you kinda know next to nothing by that.
你知道,你要学会在不确定中感到舒适。
You know, you get comfortable with being uncomfortable about Vegas.
是的。
Yeah.
很多时候,你得勇于不断提出看似愚蠢的问题,隔几个月就问一次,努力吸收这些全新的科学知识,这真的很迷人。
There there's a lot of leaning into asking the stupid questions probably repeatedly a few months apart and, yeah, just trying to really take on board this new science, which is fascinating.
尤其是生物学,它如此深奥,我觉得你永远不可能觉得自己已经学完了。
And biology in particular is so deep that, you know, I don't think you can ever really feel like you finished that.
但我们相当幸运,身边有一些出色的同事,还有世界级的化学家和生物学家,他们真的在很多方面帮助了我们。
But we're we're quite lucky lucky in the sense that we have some amazing colleagues and, you know, fantastic world expert chemists and biologists around us that, you know, really help help us along the way.
那太好了。
Well, so that's great.
所以你能理解我,也可能理解在场的大多数观众,或许能带我们一起经历这段旅程。
So you can empathize with me and probably most of the audience here and maybe kinda take us along for the journey.
也许我们可以从高层面谈谈药物发现是什么,以及机器学习如今如何融入这一过程。
Maybe the place to start is what drug discovery is at a high level and how ML fits into that process today.
是的。
Yeah.
听起来不错。
Sounds good.
你想先来吗?
Do you wanna take
我来吧?
it away?
对。
Yeah.
我的意思是,从非常宏观的角度来看。
I mean, you know, very high level.
在药物发现中,你们的目标实际上是调节疾病的通路。
What are you trying to do with drug drug discovery is effectively modulate the pathway of a disease.
因此,某种疾病和大量患者群体是我们希望帮助的对象,我们通过设计药物来实现这一目标,这些药物进入人体后,在功能层面上调节疾病在体内的进程。
So there's a disease and a whole patient population out there that we want to help by designing drugs and these drugs go into the body and modulate on some functional level the process of this disease in the body.
当我们思考这个问题时,我们主要关注的是设计小分子药物,也就是你可以口服的药片,这些分子会被吸收到血液中,甚至进入细胞,与蛋白质结合——这些蛋白质是人体的功能性基本单元。通过与这些蛋白质结合,它们会干扰或调节这些蛋白质的功能行为,从而改变患者的疾病状态。
When we think about this at ISO, we're really thinking about designing small molecules, the sort of things that you can take as a pill and these molecules will be absorbed into the blood, into the cells even and attach themselves to these proteins which are the functional building blocks of people and by attaching themselves to that, they'll either disrupt or modulate the functional behavior of those proteins and so change the disease state of the person.
因此,药物设计的核心在于:第一,找到那些能够特异性结合到与疾病相关的蛋白质(一系列蛋白质)上的分子;第二,这些分子必须是真正有效的药物,即可以口服、能进入血液循环、到达身体的正确部位,并且不会产生毒性副作用等。
And so, you know, drug design is all about, okay, what exactly are those molecules that are, a, going to attach themselves to these specific proteins that are involved in disease, a whole suite of proteins, and also b, you know, good drugs in the sense that you can actually take them as a pill, they'll get into the bloodstream, they'll get to the right part of the body, and they won't cause any toxic side effects, for example.
那么,药物发现的典型流程是什么样的呢?
And so, like, what's the typical process of doing drug discovery?
二十年前,这个过程是什么样的?
Like, what did it look like twenty years ago?
也许现在的情况又是怎样的呢?
Maybe what does it look like today or not?
我想是的。
Think so.
抱歉,亚历克斯。
So sorry, Alex.
很好,我觉得这是个很棒的问题。
Good I I think it's a great question.
我认为马克斯提到的一点非常重要,我们的首席科学官迈尔斯经常告诉我们,这与其说是药物发现,不如说是药物设计,因为我们并不是去‘发现’这些药物。
I think one thing that, you know, Max picked up on there that's that's really important is, something that our CSO, Miles, often tells us that this is a drug design process less so than a drug discovery process in the sense that we we we don't find these drugs.
我们希望能够专门设计药物来解决特定问题,这实际上也体现了我们与同构实验室所追求的路径:在过去几十年里,人们主要依赖人类专家对化学的深刻理解、大量直觉,以及对疾病的某些认知,但整个过程仍然是高度试错的——你可能会根据经验提出一个假设,比如某种分子可能非常适合进入蛋白质中你希望它结合的口袋。
We want to be able to design them specifically to solve a problem, and this is actually a bit emblematic of the type of journey that we're trying to go on with isomorphic labs as well in the sense that, you know, over the last decades, one would rely a lot on, you know, deep understanding of chemistry by human experts, a lot of intuition, you know, some understanding of of the disease, but it's a very much trial and error process in the sense that you might form a hypothesis based on your experience that, well, this type of molecule might fit really well into the type of pocket in a protein where I want it to go.
然后你会去合成这些分子。
And then you would go and you would make those molecules.
你会在实验室里合成它们,然后实际测试:它能到达目标位置吗?
You know, you you'd you'd synthesize them in a lab, and then you would actually test, you know, does it go there?
它能结合吗?
Does it bind?
它对你们所关注的疾病有功能影响吗?
Does it have some functional impact on, you know, the disease of interest for you?
当你测量时,常常会发现:不行。
And so when you measure it, oftentimes you discover, no.
实际上,它并不行。
Actually, it doesn't.
因此,药物设计其实是一个非常令人沮丧的领域,因为即使在今天,这个过程的失败率仍然很高。
And so, you know, drug design is actually a very frustrating discipline in that in that there's high degree of failure even currently, you know, when we think about how this process is performed.
平均而言,一种药物要花十三年多才能上市。
You know, it takes thirteen plus years on average for one drug to reach the market.
平均每种药物的成本超过30亿美元。
It costs $3,000,000,000 plus per drug on average.
这个过程充满了繁琐和失败。
And the process is really rife with toil and failure.
我们的一个期望是,能够构建一种技术,以更理性的方式在这个领域中进行推理,使它不再依赖于试错,而更多地转向理性设计。
And one of our hopes is, of course, to be able to build a technology that will be able to reason over this space in a much more rational manner, such that it becomes less about this you know, trial and error and much more about rational design.
这正是我们目前所从事的工作中最让我们兴奋的一点。
And this is one of the things that excites us the most about what we're doing.
所以,我认为人工智能在药物发现中的应用,大概在五到十年前经历了一轮早期的炒作周期,当时人们非常兴奋,但随后也出现了一些失望,而现在它似乎正以极大的热情重新兴起。
And so I guess what my impression of ML applied to drug discovery is it kind of had some sort of early hype cycle, maybe five or ten years ago, where people got excited and then a little bit of, you know, kind of disappointment, and it sort of seems to be reemerging with a lot of enthusiasm.
你觉得我这个局外人的看法准确吗?
Do do you think my outsider perspective is is accurate?
而且我想知道,你认为是什么技术推动了这种感觉吗?
And I guess, do do you think that there's some technology that's that's driving that feeling?
是的。
Yeah.
我觉得你说得有道理。
I I think there's something to what you're saying.
我认为,曾经有一批公司,确实采用了利用机器学习辅助药物发现的方法。
There was a there was a first wave of companies, I think, that were, you know, really taking this approach of using machine learning to aid drug discovery.
我认为,在这个领域中,人们对机器学习的理解大致可以分为两种基本路径。
And I I think there's sort of two fundamental strains of how you can think about machine learning in this space.
一种是涉及这样一个问题:你是在构建局部模型,还是能泛化的全局模型?
One, and it goes towards this notion of, are you building local models or global models that generalize?
在药物发现的机器学习早期,很多注意力都集中在构建局部模型上。
And and I think in the earlier days of machine learning for drug discovery, a lot of focus was on building local models.
我所说的局部模型,是指你使用少量数据——大约几千个数据点,这些数据与你今天或明天想要解决的具体问题高度相关,然后用这些数据训练一个小模型,比如小型多层感知机、随机森林或支持向量机。
And what what I mean by local model is where you take a small amount of data on the order of thousands of data points that are very related to the exact problem that you want to solve today or tomorrow, and you train a small model like a small MLP, even random forest or SVM against that data.
然后你将这个模型应用在你已知数据周围的局部区域内。
And then you apply that model around, you know, just in that local region around the sort of data that you already know about.
因此,如果你了解自己正在设计的分子空间的具体部分,并且已经有一些湿实验数据——即真实的实验数据,知道哪些有效、哪些无效,那么训练一个小模型来帮助你在这一分子空间区域附近进行插值和外推,就会非常有帮助。
And so this can be really helpful if, for example, you know the specific part of molecular space that you're designing around, You've got some initial wet lab data, you know, real experimental data about what works, what doesn't, and you train the little model, and you help use that to interpolate and extrapolate a little bit around this region of molecular space.
埃德,不好意思。
Ed, sorry.
在你继续之前,这些数据长什么样?
Before I let you go further, what does that data look like?
我想想象一下,这些数据的行和列分别代表什么?
I'm trying to picture, like, well, what are the, you know, I get what are the rows and columns here?
是的。
Yeah.
你想一下这个问题:这个小分子是否会与某个特定蛋白质结合?
You know, think about the problem of does this does this small molecule bind to a particular protein?
在局部数据的世界里,这些行和列实际上就是该分子的化学式以及该分子的活性。
Now in the world of local data, those rows and columns would literally just be the chemical formula of that molecule and the activity of of that molecule.
在最高通量的情况下,这可能只是一个1或0。
In the in the most throughput, that could be a one or a zero.
这些数据是如何收集的?
And how is that data collected?
比如,他们把分子放进试管里摇晃之类的?
Like, they put the molecule on a test tube and and shake it or something?
是的。
Yeah.
是的。
Yeah.
摇晃有几种方式,整个实验流程——包括摇晃多久,取决于你所测量的蛋白质。
There's, like, a few forms of shaking and the whole protocols for how you shake and for how long and, you know, depends on the the protein that you're measuring against.
每个蛋白质在设置条件或测量该分子对蛋白质活性时,可能都有其独特的细微差别。
Each each protein is might have a, like, different set of subtleties in terms of how you set it up or how you might measure activity of this molecule against that protein.
但归根结底,比如在试管实验或体外实验中,我们会这样看待这个问题。
But ultimately, you know, we think about this, for example, in in in a test tube test, an in vitro test.
明白了。
Got it.
但当然,
But of course,
有一个通过。
there's a pass.
抱歉。
Sorry.
可能的蛋白质种类有数百万种,而分子的种类则更多。
There's, like, millions of potentially different proteins, and there's many, many more different molecules.
因此,如果你为某个项目构建这个模型,通常只会关注你感兴趣的那一种蛋白质,以及少量可能相关的分子,因此在该任务上,覆盖的空间范围非常有限。
And so oftentimes, if you were building this model for a program, you would literally have the one protein that you're interested in and then some small number of potentially related molecules, and so the coverage of the space was quite small on that name burden.
是的
Yeah.
明白了
Got it.
在这个世界里,当你用这些少量数据训练模型时,它非常专注于你所针对的特定蛋白质,甚至专注于你拥有数据的分子空间的特定部分,这意味着它的实用性是有限的。
So in that world, when you train this model on this small amount of data, it's very specific to just this protein that you're targeting, very specific to even the part the the bit of molecular space that you have, you know, data from molecules around, which means it's limited in its utility.
它仅适用于特定的药物设计项目,虽然在许多情况下确实有用,但归根结底,你无法将这个模型直接用于下一个药物设计项目,因为它太过具体了。
It's it's limited to that particular drug design program, and it might be it might be useful, and in many cases it is, but at the end of the day, you can't really walk away with that model and reuse it on the next drug design program because it's too specific.
因此,你必须在下一个项目中重复这个过程。
So you would have to do that process again on the next program.
嗯
Mhmm.
然后我认为,自2015年到2016年第一波浪潮以来,我们实际上看到了一些根本性创新的模型。
And then I think what what we've seen since, you know, maybe that first wave in 2015, 2016 are actually fundamentally new models.
我认为这最典型的代表就是AlphaFold和AlphaFold 2。
And I think really emblematic of this is AlphaFold and AlphaFold two.
这些模型我称之为全局模型,因为它们在分子空间中尽可能广泛的数据上进行训练。
These are sort of models which I I would call global models in the sense that they train on as wide a variety of data as possible in molecular space.
因此,得益于卓越的深度学习和研究,这些模型在训练后,实际上能够超越其训练数据的分布,推广到完全新颖的蛋白质序列和分子预测领域。
And as a result, and of course, a result of amazing deep learning and and research, these models after training, they actually generalize outside of the distribution that they've been trained on to completely novel spaces in terms of protein sequences, in terms of in in terms of the molecules that they're able to predict accurately.
因此,我们ISO非常专注于构建这类全局模型,因为我们相信,使用最多的数据进行训练,能为我们任何可能到来的药物设计项目提供最佳支持。
So it's that type of model, these global models that we're very much focused on building at ISO because, you know, we believe that actually training on the most data gives us the best support for any drug design program that might come our way.
当然,如果有特定数据,我们可以进一步对这些模型进行专业化,但它们确实为我们提供了最佳的起点,使我们能够反复在不同的药物设计项目中使用这一平台。
Of course, we can specialize these models further if there is specific data, but really gives us the best starting point, and we're able to use this platform again and again on different drug design programs.
所以为了确认我理解正确,这里的输入是分子,输出是某种行为,对吗?
And so just to make sure I'm following what you're saying, the input here is is like a molecule and the output is some behavior.
你意思是这样吗?
Is is is that what you're saying?
是的。
Yeah.
举个例子,一个模型的输入是蛋白质和一个小分子(可能是药物),输出则是……
So, an example model would be the input being a protein and a molecule, a small molecule that could be a drug, and the output would be, okay.
这些东西会相互作用吗?
Do these things interact?
我明白了。
I see.
你们是怎么把分子或蛋白质输入到模型中的?
And do you do you how do you feed a molecule or a protein into a model?
比如,你们真的会把它当作化学结构的字符串来处理吗?
Like, do you do you actually, like, treat it like a string of of chemical structure?
你们会输入蛋白质的名称吗?
Do you put in the the name of the protein?
这到底是怎么运作的?
How how does that work?
是的。
Yeah.
这些通用模型的厉害之处在于,你可以直接将这些输入指定为字符串,比如氨基酸序列,或者分子的SMILES字符串。
That you know, that's that's the cool thing about these general models is you can actually just specify these inputs as strings, you know, strings of amino acids or what's called smile strings for molecules, for example.
你知道,关于这些物体的实际结构,有很多知识。
You know, there's a lot of knowledge about the actual structure of these objects.
所以,尽管它只是一个字符串,嗯。
So although it's just a string, you know Mhmm.
在神经网络或处理该字符串的方式中,你可以开始嵌入所有关于这些物体的已知结构信息,比如存在的化学键,甚至它们在三维空间中的形态,以便作为输入传入神经网络并由神经网络进行特征提取。
In the neural network or the way that you process that string, can you can start embedding all of the known structure about these objects, the bonds that are there, and and even how these might look in three d as it to to be inputted into the neural network and to be featurized by the neural network.
那么,你是怎么输入这些信息的呢?
And I mean, how do you feed that in?
比如,你是否实际保留了图结构,还是这纯粹只是一个记录字符串,用来描述分子的所有特征?
Like, do you actually somehow preserve the graph structure, or is this literally just like a logger logger string sort of specifying all the aspects of how the molecule looks?
这真的取决于模型的应用场景。
It it it it really depends on the application of the model.
很多情况下,这都是非常经验性的。
Like, a lot of this is very empirical.
所以,有时候就是直接使用原始字符串。
So, you know, sometimes it's raw string work.
有时你需要嵌入图结构。
Sometimes you want to embed the graph structure.
有时则是三维空间中的点云。
Sometimes it's point clouds of in three d.
是的。
Yeah.
这真的取决于具体的应用场景。
It really depends on on the on the on the application here.
而且,你知道,这里的大部分工作,就像深度学习的其他部分一样,都是以经验驱动的。
And, you know, a lot a lot of the work here, like, you know, the rest of deep learning is very empirically driven.
我想,值得一提的是,卢卡斯,药物设计其实并不是一个单一的机器学习问题。
I guess, like, one thing that is worth also mentioning, Lucas, is that drug design isn't really one machine learning problem.
它实际上是一系列广泛不同的问题,需要针对这些系统使用特定的表示方法。
It's actually a whole, like, wide variety of different problems that call for specific representations of these systems.
而且,当你想到像AlphaFold2发布时的那一刻,人们对此有很多炒作,比如认为AlphaFold2已经解决了药物设计问题,我们大功告成了。
And, you know, when when we think about even the moment when AlphaFol two, for example, was announced, there's a lot of hype around, like, well, maybe AlphaFol two, like, has solved drug design, and and and we're done, basically.
我们还没有完成。
And we're we're not done.
事实上,它为真正了不起的突破打开了大门,并从根本上解决了预测蛋白质结构的问题。
In fact, it's opened up the door for really amazing breakthroughs, and it's really fundamentally solved the problem of predicting protein structure.
但我们认为,大约还有十项左右类似AlphaFold的挑战需要解决,才能真正应对这一系列复杂的科学问题,比如如何设计一种药物,不仅能解决最终问题,还能在给药过程中具备大量其他理想特性。
But we feel there's probably on the order of 10 or so, you know, AlphaFold like challenges that need to be solved in order to actually be able to resolve this this really complex set of questions, scientific questions around how do we design a drug that is going to solve, you know, solve the ultimate problem and have a whole bunch of other desirable properties as as it's administrated.
但我认为我之前听到你说的是,与2015年相比,一个巨大的变化是,解决某一任务能够为其他任务提供启示。
But I think what I heard you saying earlier is that, you know, one of the big changes versus in 2015 is that, you know, kind of solving one task in the space informs other tasks.
所以你有一种感觉,就像我们如今在语言领域拥有的基础模型一样。
So you have some sense of, like, I'm sort of imagining the foundation models that we have, in language.
这个类比恰当吗?
Is that a fair analogy?
是的。
Yeah.
我认为这个类比是恰当的,我们作为整个社区正开始观察并真正厘清的是:在化学和生物学领域,哪些基础建模问题能够真正转化为大量下游性能提升和众多不同任务的突破。
I think I think it's a fair analogy, and this is the this is the sort of thing that we're starting to observe is and and really work out is, as a community, what are those foundational modeling problems in chemistry and biology that really translates into lots and lots of downstream performance and lots and lots of different tasks.
嗯嗯。
Mhmm.
这个领域现在非常令人兴奋,因为我觉得已经有一些迹象开始显现,比如在蛋白质序列、DNA序列、碱基对上的序列建模,以及像AlphaFold这样的结构建模及其相关应用。
It's a really exciting time for this field, because I think there are a few hints of this bubbling up with things like, you know, sequence modeling on protein, sequence modeling on on on DNA, on base pairs, on, you know, structure modeling with AlphaFold and everything that can be done with that model.
如果你只看AlphaFold 2,它本质上是一个监督学习任务,目标是预测蛋白质和蛋白质相互作用的三维坐标。
If you just take AlphaFold two, that's, you know, a single supervised learning task of predicting the three d coordinates of proteins and and and protein protein interactions.
但如果你看看使用、改进或扩展这个模型的大量论文,就能发现这个模型中蕴含着大量隐性的知识,使其能够应用于许多非常有用的下游任务,这些任务并不直接涉及结构预测,但结构预测显然是思考其他下游问题的前提。
But if you look at the depth of publications that use this model, that hack on this model, that extend this model further, you can see there's so much, you know, dark knowledge in this model that allows it to be used for many, many very useful down downstream tasks that aren't directly structured prediction, but structured prediction clearly, you know, is a precursor to thinking about these other downstream problems.
但这难道不需要一个一致的表示方式吗?
But doesn't it kind of require a consistent representation?
比如,我想象如果核心模型将分子结构以某种方式编码到输入数据中,那么用点云之类的方式输入数据可能会很困难,还是说这些模型像GPT-4一样是多模态的?
Like, I would imagine, you know, if if, yeah, the the core model had, you know, say, molecular structure encoded in the way you feed in data, it'd probably be hard to feed in data as like a point cloud or something, or are these models sort of multimodal in the same way that, like, GPT four is?
许多这些模型都在序列空间中运作。
A lot of these models operate in sequence space.
所以无论是蛋白质序列模型还是AlphaFold本身,输入都是一段序列。
So whether they're protein sequence models or AlphaFold itself, like the input is a sequence.
但正如我们在视觉领域和自然语言处理中所知的那样,即使不保持这些骨干网络冻结,也有很大机会将仅在单一模态上训练的模型扩展到其他模态。
But as we know from, you know, vision space, from NLP, there's lots of opportunity to even use a model that's been trained in a single modality and extend it into other modalities as well, even without, you know, keeping these trunks frozen.
所以,我认为我们在使用这些模型时可以有很多创意,尤其是在超越其特定训练领域的应用场景中。
So, you know, I think there can be a lot of creativity around how we use these models and how we leverage them outside of the, you know, particular training domain.
我认为这里还有一件值得提及的事情是,在生物学中,我认为存在多个基础模型的空间,因为我们可以从多个尺度和分辨率来观察这些不同的现象。
I think one one thing worth mentioning as well here is that, you know, in biology, there's room for several foundation models, I feel, because there's actually multiple scales and resolutions at which you can look at these different phenomena.
因此,在最高分辨率层面,我们可以研究亚原子级别的量子效应,然后上升到原子和分子层面的相互作用。
And so, you know, at at sort of the highest resolution level, you can look at sort of subatomic quantum level effects, and then you can step up to sort of atomic molecular level interactions.
但再往上一层,就是对细胞内部发生的事情进行建模,或者如何模拟特定细胞的行为、功能或命运。
But then one level up is modeling what is going on inside cells or how to model the behavior or the function or the fate of a particular cell.
当然,这些细胞会组织成组织,组织再组织成器官,然后是整个人体,甚至人类与其环境的相互作用。
But then, of course, these get organized into tissues, and these get organized into organs, and then there's whole humans and even how humans interact with their environment.
因此,我认为我们必须明确我们讨论的是哪个分辨率层级,并构建能够在这个空间中进行推理的模型。
And so I think, you know, we have to be quite specific about which level of resolution we're talking about, and we can build models that reason over that space.
而且,我们还认为,有很大机会进一步思考如何整合并跨多个分辨率层级进行推理。
And there's also, we feel, a lot of opportunity to actually then think about how we can integrate, how we can reason over multiple levels of resolution.
但这是一个开放性的研究问题。
But, you know, this is very much an open research question.
我觉得,就我更熟悉的NLP领域而言,人们过去经常谈论更高阶的模式,比如语义之类的东西。
I guess I feel like, you know, the NLP space, which is something I'm more familiar with, you know, people used to talk lot about, you know, kind of, higher order patterns, you know, like, you know, semantics and things.
而且,你输入的数据通常已经被切分成单词和句子。
And even the data you would feed in would typically be already chunked into words and and sentences.
然后,整体趋势是减少对输入数据的预设假设,或许在更细粒度的层面进行操作,比如仅仅把信息看作一组标记,带有更少的先验假设。
And then there's been sort of a general trend to be, you know, less opinionated than the data you feed in and, you know, maybe operate on a granular level, like, just, you know, treating information as, a set of tokens with, like, less priors.
随着数据量的增长,生物学领域是否也在发生类似的情况?
Is there a similar thing happening as the data size increases in the realm of biology?
是的。
Yeah.
我们正看到一个非常相似的趋势,即输入表示越来越趋向于更低的粒度层级。
We're we're we're seeing a very similar trend where more and more of the input representation gets to the, like, lower and lower granular level.
我想有两个例子,一个是AlphaFold本身,还有就是我们与谷歌DeepMind共同开发的最新版AlphaFold模型。我们非常注重对各个元素的分词处理,以这种同质化的方式处理问题,使得可以使用更具扩展性的架构。
I think two examples would be actually, you know, AlphaFold itself and, you know, moving to this AlphaFold latest model that we've been developing with with Google DeepMind, we really think about tokenization of the particular elements and, you know, treating things in that sort of homogenous way allows more scalable architectures to be used.
对于这些,你的网络实际上会自行提取出高层结构。
And for these, you know, your networks actually extract that high level structure themselves.
同样的情况也出现在像EVO这样的最新模型中,它在碱基对级别上进行序列建模。
Same things with this recent model like EVO, which is doing, you know, sequence modeling on the base pair resolution.
再次强调,输入中只加入最少的结构,但通过更大的数据量,有望让模型内部提取出这些高层表示,这些表示可以非常通用,并用于许多下游任务。
Again, putting minimal structure in, but larger amounts of data can hopefully allow a model to internally extract these high level representations, which could then be quite generic and be used for many downstream tasks.
所以我认为这里也有类似的趋势。
So I think it's a similar trend here.
在建模这类数据时,你们有没有遇到什么特别的差异?
Are there any, like, differences in the way that you model this kind of data?
比如,你们用的是Transformer吗?
Like, you using transformers?
它在功能上是否就像那些大型语言模型一样,还是说你们必须做一些关键性的调整?
Is is it, like, functionally just like, the the LLMs that come out, or are there any kind of key differences that you have to do?
是的。
Yeah.
卢卡斯,最大的区别在于,数据的规模比自然语言处理或计算机视觉中的数据低了好几个数量级。
The big difference, Lucas, is that the scale of data is many orders of magnitude lower than in something like NLP or computer vision.
所以,尽管我们希望朝这个方向发展,但在我看来,目前的数据量还不足以完全采用纯粹的序列模型,仅从序列数据中提取出足够多的蛋白质世界信息。
So although, you know, we wanna move in this direction, there's still not enough data in my opinion to have completely flat, you know, sequence models and extract, I would say enough about the world of proteins from just sequence data.
你知道,你可以取得很大进展,已经有很多蛋白质序列模型表现得非常出色。
You know, you can get far and there's there's many protein sequence models that have done very well.
但我还认为,我们有很大的机会将先验知识——无论是物理的、生物的、化学的还是结构的先验——注入到我们表示数据的方式、处理数据的架构中,以及在损失函数的设计上,比如使用理论损失或其他相关的建模任务。
But I also think there's a lot of opportunity to actually inject, you know, priors, you know, whether these are physical priors, biological priors, chemical priors, structural priors into the way we represent the data, the architectures that process that, and also on the loss side, you know, the types of loss functions we use, or theory losses, or other modeling tasks that are associated with this.
所以,我认为这就是当前的差异所在。
So, I think that's where the difference occurs today.
但仅仅因为数据量不足,你不能简单地说:‘我们直接用一个超深的扁平Transformer,让它跑上一两个月就行了。’
But just because of the the quantity of data, you don't quite have enough to just say, hey, we'll just stick a a super deep flat transformer and and and let it run for a month or two.
与此同时,我想说,在这个领域,上下文长度和输入规模实际上可以非常大,这显然取决于你希望描述系统的多少部分。
At the same time, I wanna say that I think in the space, kind of the the context length, the the input size can be very, very large actually depending on, obviously, like, how much of the system you want to be able to describe.
但如果我们考虑一个需要预测临床结果或某种疾病预后的问题,你可能会设想输入整个基因组——这可能高达数百GB的数据,仅仅是一个基因组——但你可能还想加入一些补充信息,比如他们做过的影像检查、不同的干预措施,以及所有的临床病史。
But if we think about a problem where we want to be able to predict something about, you know, a a clinical outcome or something like a disease prognosis for a person, you know, one might envision, you know, wanting to put in the whole genome, which could be, you know, hundreds of gigabytes of data, essentially, just the genome, but you may wanna be able to also add, you know, other supplementary information around some of the imaging they may have done or different interventions, you know, all of the basically, like, clinical history.
因此,当我们思考如何建模时,我们需要能够处理非常大的输入规模。
And so when we think about how to model that, we need to be able to, you know, model quite large input sizes.
尽管我们已经看到了很多进展,比如 Gemini 最新版本在处理超长上下文方面的能力。
And even though we've seen, you know, quite a lot of progress, for example, with latest version of Gemini on, you know, really large context links.
但我认为,在支持这类大规模输入方面,仍然有很多工作要做。
I think there's a lot to be still done there in terms of being able to support these kind of large scale inputs.
是的。
Yeah.
但正如你提到的,我们确实使用了 Transformer,它是当今神经网络架构中的绝对主力。
And, you know, to your point though, of course, we use transformers, the an absolute workhorse of, you know, our neural network stack today.
于是问题就来了:我们该如何转换这些输入,使其能被这些通用且可扩展的神经网络模块最佳地处理?
And, you know, and then the question comes into how do we transform these inputs to really be best processed by these, you know, generic and scalable neural network modules.
就像在 VIT 中一样,我们需要思考:如何将图像转换成与 Transformer 兼容的形式?
Just in the same way as, you know, for VIT, you have the question of, okay, how do we transform images in a way that's congruent with transformers?
而对于图像来说,基于补丁的机制已经相当直接了。
And okay, you've got this patch based mechanisms, you know, fairly straightforward for images.
对于蛋白质、结构和基因组,对应的方案是什么?
What's that equivalent for proteins, for structure, for genomes?
是的。
Yeah.
这些都是有待发现的领域,我们正走在探索这条路上,但我认为这还处于非常早期的阶段。
These are all things waiting to be you know, we're we're we're along that path of discovering that, but I think it's really early days.
有意思。
Interesting.
所以你觉得这个问题还没解决?
So you don't feel like it's resolved.
我的意思是,我想象对于字符串,可以采用类似的一种分词策略,但对于图结构或点云之类的东西,就不太明显了。
I mean, I guess I'm imagining with strings, you could do a really similar tokenization strategy, but I guess with a a graph structure or something or point cloud, it's kind of less obvious.
是的。
Yeah.
是的。
Yeah.
这些事物中的很多都可以被自然地视为图结构。
Lots lots of these things could be natively thought of as graph structures.
你知道,有很多人在为这类空间研究图神经网络,但再想想,我们该如何将一个图转换成可以被Transformer处理的形式呢?
You know, there's a there's a lot of people working on graph neural networks for this sort of space, But even think about, okay, how how would we then transform a graph into something that could be ingested by transformer?
当我们真正从生物学角度,或者将这东西视为一个分子而非通用图时,这些想法真的有意义吗?
And does any of this actually make sense when we really think about the biology or think about this thing as a molecule rather than just a generic graph?
这其中有很多细微之处。
And there's a lot of subtleties.
存在
There are
有些方法是有效的,但我只是觉得作为一个领域,我们在这方面仍然非常早期,还有很大的创新空间。
things that work, but I I I just think as a field, we're we're we're we're still quite early in this, and there'll be lots of space to innovate.
即使你想到基因组,对吧,表面上看,那是一个线性序列。
Even when you think about genomes, right, you could think ostensibly, okay, that's a linear linear sequence.
你可以采用一种非常简单的分词方法。
You can follow, like, a very simple tokenization method.
事实上,许多当前的方法确实如此,但这可能会忽略基因组中存在大量非线性结构的事实。
And indeed, actually, many of the current methods do that, but that may sort of ignore the fact that there's massive amounts of structure that are not linear in the genome.
在这些分子如何被包装在细胞内部时,存在着二级和三级结构,这些结构可能是基因组中可解释的信息单元。
There is secondary, tertiary structure in actually how those molecules are packed inside a cell, what could be actually the units that divide, you know, interpretable units of information in the genome.
通常,不同的片段在特定事件中会协同聚集在一起。
Oftentimes, even different pieces can swing together for a particular event.
因此,我认为在考虑数据表示时,重要的是也要根据对底层生物学的理解,来指导我们的分词策略,也就是这些分子在现实世界中如何实际运作。
And so I think when we think about data representation, it's important to also inform our our tokenization strategy by actually some understanding of the underlying biology and sort of, you know, how how these, molecules actually behave in the real world.
不过,听你这么说,感觉有点有趣,你听起来就像十年前的NLP研究者。
Although it's kind of funny as I listen to you, you sound like, you know, NLP researchers ten years ago.
从我的角度来看,我觉得分词策略正变得越来越简单,越来越痛苦地丢失那些看似至关重要的先验信息。
I kind of you know, I I sort of feel like the tokenization strategies from my perspective, it get, like, simpler and simpler and and sort of, like, more and more, like, painfully losing, you know, information in priors that seem really important.
所以我很好奇,这最终会如何发展。
So I kind of wonder what, well, how this will play out.
我真希望我们能进入那种状态。
I I'd love for us to be in that regime.
我很感兴趣。
I'm curious.
你知道吗,我记得很久以前那篇关于ImageNet的《自然》论文,当时最令人惊叹的一点就是,用在ImageNet上训练的CNN来检测黑色素瘤。
You know, I remember, like, the the nature paper that came out with, ImageNet a long time ago, and and I think one of the things that really was just so amazing about it, right, was taking, you know, CNNs trained on ImageNet and then applying them to, detecting, you know, melanoma.
这让我忍不住想,会不会有那么一天,你可以把那些在其他数据丰富领域训练的模型拿来用?
And and it it kinda makes me wonder, like, is there ever a world where you would take, you know, models trained in entirely different, you know, domains where there's tons of data?
这些模型对你们现在的领域会有帮助吗?
Would they have any bearing on the domain that you're at?
是的。
Like yeah.
我觉得多任务应用的发展速度,已经远远超出了任何人最疯狂的想象。
I feel like the the, you know, the the multitask applications just seem to be expanding beyond, you know, kind of anyone's wildest dreams.
你能想象这样一个世界吗?GPT或Gemini竟然对你的工作变得重要?
Like, could you imagine a world where where, yeah, sort of GPT or Gemini becomes relevant for what you do?
对。
Yeah.
我的意思是,这真的很有趣。
I mean, it it's really interesting.
有一些论文,你知道的。
There are these, you know, papers.
我想到了伊戈尔·莫尔达什的一篇论文,例如,你拿一个预训练的大型语言模型,然后可以直接将任意的新问题映射到词元空间,并利用大型语言模型内部的推理能力来处理完全不同的模态。
I'm I'm thinking about an Igor Mordash paper, for example, where you take a pretrained transformer large language model, and then you can just project arbitrary new problems into token space and just bootstrap off the internal reasoning that happens in a large language model for a complete completely different modality.
我认为我们从未尝试过从自然语言转向蛋白质语言。
I don't think we've ever tried that, you know, going from language into natural language into protein language.
但我非常好奇,这种现象是否可能出现,或者,如果你开始用DNA训练一个大型语言模型,那么用于下一个词元预测的某些功能,是否能转化为对蛋白质结构的理解?
But I'm really curious if to see if this this sort of thing could emerge or, you know, similarly, if you start training a large language model on on on DNA, does any of that functionality, you know, for for for next token prediction, does any of that functionality translate into something when you're thinking about, I don't know, protein structure?
是的。
Yeah.
我的意思是,我认为这是可能的,我们可以在其中获得一些迁移效果。
I mean, I I I think, you know, it's possible, and we can get some some transfer there.
但在我看来,解决一个问题足以写出一篇有趣的论文,和真正解决现实世界中的问题,这两者之间是有区别的,我认为这两者之间存在很大的空间。
But, you know, in my mind, there's also, for example, a difference between solving a problem enough where you can write an interesting paper about it, and I think there's potentially lots of space for that, or, like, solving a problem where it's solved in the real world.
对我来说,这里有一个很大的区别。
And, you know, this there there's a big difference for me there.
尤其是在ISO,我们最大的重点是解决现实世界中的问题,而我的经验告诉我,这种直接迁移并不一定容易。
And, you know, especially at ISO, our biggest focus is to solve problems in the real world, and my experience tells me that there isn't necessarily, like, an easy lift and shift like that.
你必须花大量时间真正接触数据和领域,才能获得真正彻底的解决方案。
You need to spend a lot of time actually with the data and with the domain to still be able to get that that real, like, you know, thorough solution.
你知道吗?
You know?
我给你举个例子。
I'll give you as an example.
我之前在Pheugenetics的经历中,我们基于MRI和CT扫描对肿瘤进行分割。
My experience, it's Pheugenetics previously where we were doing segmentation of tumors on the basis of, like, MRI and and CAT scans.
当然,我们尝试了很多从视觉领域过来的常规方法,但事实上我们发现,肿瘤和你通常训练模型所用的其他物体并不一样。
And, of course, we tried a whole bunch of fairly sort of established vanilla methods from vision, but, you know, what we learned actually is tumors are not like a lot of the other objects that you might normally train your model on.
它们不是猫、狗,也不是互联网上那些常见的图像。
They're not, you know, they're not cats and dogs and sort of other images on the Internet.
因此,肿瘤的形态、所占空间具有很大的变异性,比如它可能是高度球形的,但在器官内发展时也可能非常弥散。
And so, you know, the morphology of a tumor, the space it takes, it's it's, you know, it can have a lot of different variability and it can be, for example, very globular, but it can also be very diffuse as it develops in the organ.
所以我们发现,必须构建非常特定的模型,才能达到我愿意使用该模型来指导临床决策这样的性能水平。
And so, you know, we found that we really need to build quite specific models in order to be able to reach that performance where I'd be willing to start using that model to actually, you know, guide clinical decision making, for example.
对吧?
Right?
因此,这正是我想强调的差异:什么足以写成一篇论文,而什么又足以用于医疗设备。
And so this is where I'm kind of going with that difference between what's enough to write a paper about versus what's enough to put into a medical device, for example.
而这些内容中有些非常有趣的地方,其实归结于在新颖科学中实际应用这些模型时的训练问题。
And and so some of this really interestingly comes down to that use in, you know, training these models for actual use in novel science.
正是这种‘新颖科学’的部分很有意思,因为机器学习到底是什么?
And it's that that novel science bit, which is quite interesting because with machine learning, what is machine learning?
简单来说,就是你拿训练数据的分布,然后拟合一个统计模型。
Well, you take your training distribution and you try and fit a statistical model to it.
但当你把这个模型应用到新颖科学中时,从定义上讲,你是在尝试使用它处理训练分布的极尾端,甚至完全超出该分布之外的情况。
And but when you're applying this model to novel science, by definition, you're trying to use this maybe in the very, very tails of that training distribution, maybe even completely outside of that.
当我们想到机器学习的应用时,这通常与自动驾驶汽车等场景非常不同。
And that's quite unusual often when we think about the application of machine learning compared to, for example, self driving cars.
因此,在如何转移或设置训练分布、如何设置模型训练、甚至如何将模型应用于实际的前沿科学研究方面,存在许多细微差别。
And so there's a lot of subtleties in how do you transfer or how do you set up your train distribution, how do you set up your model training, how do you even apply that model downstream to actually do novel science.
你认为你的研究与材料科学之间有重叠吗?
Do you think that there is overlap with what you do in research and material science?
我觉得我们已经听了很多人都在谈论做化学相关的工作。
I feel like we've had a lot of folks, you know, come on here talking about, kind of doing chemistry.
这对你来说有相关性吗?
Like, is that is that relevant to you?
你能从化学领域获取数据集吗?还是说生物学完全是另一个世界?
Are there datasets that you can draw on from that, or is biology just kind of completely different, world?
我认为有很多有趣的相似之处,尤其是我之前提到的,训练这些网络以进行外推和发现新事物。
I think there's a, you know, there's a lot of interesting similarities, especially to my previous point on, you know, training these networks to then extrapolate and and and discover new things.
令人有点烦恼的是,如今无机化学似乎与我们所研究的有机化学是截然不同的领域,问题类型和数据类型都不同。
Somewhat annoyingly today, it feels like that inorganic chemistry is kind of a different world to the organic chemistry that we're working with, the types of problems, types of data are different.
话虽如此,从根本上说,这些系统的物理原理是相同的。
Saying that, fundamentally, you know, physics is the same for these systems.
对吧?
Right?
所以在某种程度上,必然存在某种重叠或可以迁移的东西。
So at some level, there has to be some notion of overlap or something we could transfer.
但我还没看到材料科学与有机化学这两个领域之间有太多交集。
But I I haven't seen much of that intersection between those two fields like material science and, for example, organic chemistry.
我想换个话题,你是如何看待你们领域中的开源问题的?
I guess switching gears a little bit, how do you think about, you know, open source in your world?
我的感觉是,关于这些语言模型和多模态模型,现在有个很大的关于开源的争论。
Like, you know, I feel like there's sort of this big debate around open source with these language models and these multimodal models.
而且我注意到,开始有一些开源模型出现了。
And and I think there is starting to be, like, a little bit of open source, you know, models coming out.
在Isomorphic,你是如何看待这个问题的?
Like, how do how do you think about that at, at Isomorphic?
是的。
Yeah.
我认为开源社区是一个巨大的优势,我们所有人都从中受益。
I guess the open source community, I think, is a is a great benefit, and we've all, you know, we've all benefited from that.
而且我认为许多人也参与其中,所以我觉得这是一个非常好的协作生态系统。
And I think many also participate in it, so I think it's a great working ecosystem.
我认为重要的是要在为这个生态系统做贡献的能力,与开发能够真正维持业务的独特能力之间取得平衡。
I think it's important to basically balance the ability to, you know, to contribute to that ecosystem as well with the ability to develop specific unique capabilities that actually sort of sustain a business.
所以我认为,我们在这里的Isomorphic(应为Isomorphic Labs)正是这样做的。
And so I think, you know, that's what we're doing here at isorficlamps.
那数据呢?
What about data?
我的意思是,我认为在这里,数据集可能比人们习惯的要更加专有。
I mean, I would think that, you know, here, the datasets are probably much more proprietary than than folks would be used to.
你认为这种情况会如何发展?
Where where do you think that goes?
我认为公共数据集实际上还有很多可挖掘的价值。
Well, I think there's actually quite a lot to be set for public datasets.
正如我们所见,像AlphaFold这样的工具通过训练于蛋白质数据库,已经取得了惊人的成果。
As we've seen, you know, tools like AlphaFold have been able to do amazing by working on you know, being trained on the protein databank.
我希望这句话不会轻视蛋白质数据库,因为它实际上是一个了不起的资源,是数千年来集体研究的成果。
And, you know, I hope that statement doesn't trivialize the protein databank because it's actually an amazing resource, the result of, like, you know, thousands of years collectively of of research.
所以,我认为仍然有很多值得探讨的地方,当然,任何认真进入这一领域的组织,都会首先考虑这些公共商业数据集,但同时也存在大量独特的数据集。
So, you know, I I think there's still a lot to be said, and, certainly, you know, I would imagine any organization that goes into this space, you know, seriously would want to look at these public commercial datasets at the first instance, but there's, of course, also a lot of unique datasets.
一个值得提出的问题是,那些为其他目的生成的数据。
And, you know, one question to be asked is, you know, data that's been generated for a different purpose.
比如,可以想象,作为药物发现项目一部分而生成的数据。
Like, one might imagine, data that's generated, for example, as part of a drug discovery program.
这些数据对于训练我和马克斯所讨论的这类模型有多大的用处呢?
How useful might that data be for training the type of models that, you know, me and Max have been talking about?
如果你打算构建一个覆盖整个化学空间的通用模型,那么你可能会发现,那些在化学空间中彼此非常接近的数据点,其实际效用是有限的。
If you're thinking about building a general model that works across all of chemical space, you know, you can see potentially limited usefulness in in in data points that are very near to each other in that space, actually.
因此,当我们思考数据时,我们会非常谨慎地考虑数据集的多样性,尽可能覆盖广泛的内容。
And so when we think about data, you know, we we we think very carefully about dataset diversity, being able to cover as much as possible.
当然,某些领域,比如所有类药分子的空间,大约有10的60次方种分子之多。
Of course, some of these spaces, you know, we we think about the space of, for example, all of drug like molecules is something like 10 to the power of 60 molecules large.
所以,任何关于覆盖这一空间的设想,我认为都需要很长时间来合成相关数据,但在设计数据集时,这仍然非常重要。
So sort of any notion of coverage of that space, I I I think, you know, will will be a long time synthesizing data for that, but it's still quite important when we think about dataset design.
因此,当我们思考数据时,我们不仅需要利用已有的数据集,还要准备好生成新的数据集,以帮助我们进入人类从未尝试过的领域并取得优异表现。
And so when one thinks about data, we basically need to go both to datasets that already exist but also be ready to generate new datasets that will help us actually, you know, go into and and and get great performance on areas that have never been tried by humans before.
你知道,这很有趣。
You know, it's interesting.
我想我不希望把话题带偏,但我还是得提一下。
I guess I don't wanna take this in a weird direction, but I sort of have to to flag.
我的意思是,10的60次方种可能的药物实际上看起来还挺少的。
I mean, 10 to the 60 possible drugs actually seems quite small.
对吧?
Right?
我的意思是,这实际上比我预期的要小得多。
I mean, that that actually is much smaller than I was expecting.
我的意思是,我觉得你可以把这压缩到30个字符以内。
I mean, I feel like you could compress that, you know, into, like, 30 characters or less.
但我想我现在开始疑惑了:我到底有没有真正理解这里所说的可能药物的空间是什么?
But I I think maybe now I'm think wondering, do I actually even understand what the space of possible drugs is here?
我觉得这非常庞大。
I think it's pretty huge.
是的。
Yeah.
我不确定自己能否理性地理解这个10的60次方的空间,当你想到每一个对象究竟是什么的时候。
I'm not sure I can think about rationalizing it, like, the 10 to the 60 space when, you know, you think about, like, what each one of those objects is.
它是一种独特的原子三维排列,而且不仅仅是三维排列本身,还有这种排列在细胞环境中的复杂性,比如蛋白质,而且这些排列并非仅仅是相同的原子,它们带有不同的电荷,这种精确的局部和全局排列改变了整个三维结构周围的电子密度。
It's this unique three d arrangement of atoms, and it's not just about the three d arrangement, but, you know, that arrangement in complex with the environment in the cells, so the protein, and the arrangement you know, these are not just identical atoms, but these are different charges, you know, that precise both local and global arrangement changes that whole electron density around that three d structure.
对。
Right.
完全正确。
So totally.
但你怎么能把这个缩小到10的60次方呢?
So but how do you get that down to 10 to the 60?
我的意思是,我只是在想象,如果我能随意组合60个原子,那数量早就超过10的60次方了。
I mean, I'm just picturing, you know, if I if I could string together, you know, 60 atoms of choice, like, that would be, more than 10 to 60.
对吧?
Right?
到底是什么限制了这个空间呢?
Like, what what what actually constraints the space?
是的。
Yeah.
我的意思是,10的60次方确实是个巨大的数字,但你可以这样想:人类已经总结出一些关于药物特性的经验法则,这些规则实际上大大限制了分子的范围。
So, I mean, 10 to the 60 is actually, like, a very huge number, but, you know, the the way to think about this, there's there's some rules that heuristics really that humans have developed of drug likeness that basically constrains the set of molecules.
确实,你可以创造出无数种分子,但有一些限制条件,比如分子大小。
Indeed, you can create, you know, infinite numbers of molecules, but there's certain constraints around size, for example.
所以,你知道,对于你的药物起效来说,尤其是小分子药物,它必须能够进入细胞。
So, you know, one of the things that needs to happen for your drug to work, especially if it's a small molecule, is it needs to be able to go into the cell.
因此,细胞只能吸收一定大小的分子。
And so the cell is only going to be able to take in molecules of a certain size.
此外,还有其他问题,比如这些分子需要具有溶解性。
And then there's other issues like these molecules need to be soluble.
这些分子还需要能够穿过细胞膜。
These molecules need to be able to go through membranes.
它们还需要具备其他特性。
They need to have other behaviors.
因此,基本上你可以写出一些启发式规则,将这个空间限制在10的60次方以内。
And so, basically, you can write a few heuristics that constrain that space to 10 to the power of 60.
但事实上,最终你可能需要搜索一个大得多的空间,不过这至少能帮助我们大致思考一下哪些分子空间可能是有用且可行的边界。
But in, you know, in fact, you you may need to search a much larger space in the end, but this is just something that helps us, you know, think a little bit around some boundaries around what the usable sort of useful molecule space might look like.
好的。
Okay.
团队可以决定是否保留这部分内容,但对我而言,这实际上很有启发性。
Well, team could decide if they wanna leave that in, but that was actually kind of enlightening, for me.
谢谢。
Thank you.
你们在Isomorphic公司有志于将药物推进到商业可行阶段,还是会在那之前就停止?
Do you, at Isomorphic have an ambition to take the drugs all the way to commercial viability, or do you kind of stop short of that?
这到底是怎么运作的?
How how does it work?
是的。
Yeah.
这其实是一个非常有趣的研究领域,随着我们在这个领域工作了几年,也越来越深入地了解了它。
It's, it's a fascinating space, actually, and and one that's been, you know, really interesting to learn more about as as we have, you know, worked in this space for a couple of years.
但我认为,药物设计流程有不同的阶段,每个药物资产的价值,本质上对应着该资产中仍存在的风险程度。
But I think it's, there are different stages of that drug design pipeline, and a particular asset has an amount of value that is essentially corresponding to the amount of risk that is inherently left in that asset.
所以当你思考这些阶段时,就像是靶点发现。
So when you think about, you know, the stages, it's sort of like target discovery.
展开剩余字幕(还有 290 条)
当你确定了一个非常棒的靶点时,它本身就已经具备了一定的价值,接着你会进入命中识别、先导化合物优化等阶段。
And so when you've identified a really cool target, there's already some value in that, and then you go into hit ID sort of, lead optimization stages.
在某个阶段,你将进入临床前研究和临床试验,也就是一期、二期、三期试验。
At some point, you're going to go through preclinical studies, clinical trials, you know, one, two, phases one, two, three.
在每一个阶段,你都会解决并管理一定量的风险,因此你的项目价值也随之提升。
At each point in time, you will have addressed and managed certain amount of risk, and so the value of your program goes up.
所以,并没有一个唯一的答案来决定在哪里停止。
And so there isn't actually one single answer as to where where to stop.
这实际上取决于你自身承担风险的意愿,以及你实际利用它的能力。
It really depends on your own appetite to take on the risk and on your own ability to actually do something with that.
因此,当我们思考应该在哪里停止时,我们创建了Isomorphic实验室,以提供一种独特视角,看看机器学习如何助力药物设计的变革,这就是我们应用的一个视角。
And so when we think about where should one stop, you know, we have created, of course, isomorphic labs to bring a unique angle of how machine learning can help transform drug design, and so that's one, you know, lens that we apply.
另一个视角则是,我们愿意走多远的流程,以最大化每个分子的价值。
And then the other one is basically how much of sort of that process are we willing to go through in order to maximize the value of each molecule.
因此,在优化我们的产品组合时,我们积极评估:我们认为这个项目应该推进到哪个阶段,这才是该资产的最佳定位。
And so as we rationalize our portfolio, we're basically working very actively to say, well, we think this needs to go over here, and that's a sweet spot for that particular asset.
对于这个项目,我们应该能与某方合作,他们显然更有能力执行临床试验阶段等工作。
And, you know, for this one over here, we should be able to partner with somebody who we think will be much better able to actually execute, for example, the clinical trial stage of that.
因此,与他们合作推进这个分子是个很好的想法。
And so that's a great idea to work with them on that molecule.
很有趣。
Interesting.
你们会自己开展临床试验吗?
Do you would you do your own clinical trials?
这不会偏离你们的核心能力吗?或者,有没有可能让机器学习在其中发挥作用?
Wouldn't that be quite a departure from your core competency, or is there, like, ways that ML could could play a role there?
我觉得机器学习在这一领域大有可为,我们也在思考一些具体的应用方向。
I see lots of opportunities for ML to play a role in that, and, you know, there are ones that we're thinking about as well.
在我看来,最明显的是,实际上存在两个问题。
To me, you know, the most obvious is, you know, when when you have a there's there's kind of two problems.
一个是你要确定一种疾病,然后为患有这种疾病的患者找到合适的分子。
One is you you want to identify a disease, and then you want to find the right molecule for patients with that disease.
但这个问题的另一面是,当你设计出一个分子后,你需要找到最适合这个分子的患者。
But there's also sort of the other side of that problem, which is when you've designed a molecule, you want to find the best patients for that.
这部分问题在药物开发的临床试验阶段部分得到了解决,但即使在分子上市后也是如此。
And so part of that is solved within the clinical trial sort of phase of of of drug development, but but even after that molecule's on the market.
我们知道,像伴随诊断这样的工具就属于这一类。
You know, we have things like companion diagnostics.
我该如何找到具有正确基因组特征的患者,使他们能从这种药物中获得最佳疗效?
How can I find patients with the right, for example, genomic signature to be best fit to benefit from this medication?
因此,我们有很多机会应用机器学习,思考例如一个人的遗传学如何影响其疾病类型,或者我们如何基于分子特征更好地对疾病进行分类。
And so there's plenty of opportunities for us to apply machine learning and thinking about, for example, how does the genetics of of someone influence, you know, the type of disease that they have, or how can we better classify diseases based on molecular signatures.
因此,我们认为这些都是绝佳的机会,可以帮助我们优化临床试验等环节。
And so, you know, we see all of these as great opportunities for us to be able to optimize things like clinical trials.
如果你想想这些数字,卢卡斯,基本上进入临床试验的分子有90%最终都会失败。
If you think about the numbers, Lucas, basically, 90% of molecules that enter clinical trials actually fail.
所以,这些数字并不理想,而这些正是导致药物研发成本高昂、周期漫长的部分原因。
So, you know, these are not great numbers, and these are, you know, some of the ones that are behind what we're seeing as these, like, really high cost, really long time frames for bringing these molecules to market.
所以这些看起来确实是非常棒的机会,可以在之后继续推进,我认为在这个领域工作有很多值得探讨的地方。
And and so they seem like really amazing opportunities to actually be able to go afterwards, and so I think there's, you know, there's a lot to be said for working in that area.
我想从外部来看,可能是个傻问题,但为什么有这么多分子会失败呢?
And I guess dumb question from the outside, but why do so many molecules fail?
我的意思是,你描述的任务听起来似乎挺简单的。
I mean, you you described, like, a pretty simple sounding task.
比如,好吧。
Like, okay.
它会不会结合到某个蛋白质上?
Like, does it, like, you know, bind to a protein or not?
这就是计划。
And, like, that's the plan.
我猜在把任何分子投入活体生物之前,你应该就已经知道它是否能结合到蛋白质上了。
Presumably, you'd know if it binds to the protein before you ever, you know, put it in any living organism.
我的意思是,是不是因为药物最终出现了意想不到的副作用,或者中间还发生了别的什么情况?
I I would think I mean, is it because the drugs turn out to have, like, unanticipated side effects, or or what, what happens along the way?
是的。
Yeah.
我的意思是,确实有副作用,但回到这一点上。
I mean, there's the side effects, but, you know, just back to this point.
好的。
Okay.
当然,你会测试这种药物是否与蛋白质结合,但也许这仅仅是在试管中的孤立环境下的结果。
Of course, you you test that this drug binds to the the protein, but maybe that's just in isolation in a test tube.
现在当它进入细胞时,它真的能进入细胞并与蛋白质结合吗?
Now when it's in a cell, does it actually get into the cell and and bind to the protein?
好的。
Okay.
即便如此,这种结合是否会产生你假设的生物学效应,比如这种结合事件会对整个信号通路产生影响,而这种蛋白质可能是该通路的一部分?
Even then, does that cause the functional effect that you hypothesized this would cause, that this binding event would cause on this whole signaling pathway, for example, that this protein might be part of.
你可能对这种生物学机制的理解本身就是错的,所以,你对信号通路的假设也可能是错误的。
You might have just had this biology wrong, so, you know, even your hypothesis of the signaling pathway might be wrong.
你可能在针对一个完全错误的蛋白质,因此,当你一开始在细胞中测量所有这些疗效时,可能看起来不错。
You might be targeting completely the wrong protein, so that, you know, when you start, you can measure all this efficacy in cells even.
但当你去观察它对疾病的影响时,却根本看不到任何效果。
And when you go into looking at the effect on the disease, you don't see any effect there.
这还只是在你开始考虑你刚才提到的毒性问题之前——比如这种分子如何分解,以及它对身体可能产生的影响。
And that's before you even begin to look at all the toxicity as you were talking about, just how this molecule breaks down, what effect that can have on the body.
但同时,你知道,这些分子往往——实际上更常见的是——它们并不只作用于一个蛋白质靶点,而是会同时影响一系列靶点。
But also, you know, sometimes these these molecules and more often than not, they don't just hit one protein target, they'll actually end up hitting a whole range of targets.
那么,这些完全非预期靶点被影响后,可能会带来一些非常严重的后果。
And so what's the effect of that hitting of completely unintended targets that can have some very serious, you know, consequences.
所以,是的,这个研发流程中有很多环节都可能出错。
So, yeah, there's there's quite a lot that can go wrong, in this pipeline.
当我们讨论这些时,你可以看到,我们还有很多机会去更深入地理解并建模这个微观世界,理解疾病生物学,以及所有这些因素是如何相互关联的。
And, you know, as we talk about this, you can see that there's lots and lots of opportunities to understand and model more and more about this microscopic world, about disease biology, how all of these things connect together.
随着我们对这个拼图中不同部分的理解不断加深,当我们设计分子并选择靶点时,就能以越来越高的信心来预测,我们的假设确实能产生针对目标疾病所期望的效果。
And as we understand more about the different pieces of this puzzle, that means that when we go to design a molecule and we select the target that we're designing against, we we can do that in a way which we have higher and higher confidence in our hypotheses of this will actually produce the desired effect on the disease that we're trying to address.
你们如何决定要针对哪些疾病?
How do you decide what diseases to to target?
比如,会不会有一个实验阶段,你们只是随意思考各种可能性,看看哪些看起来有前景,还是你们会直接选一种疾病,觉得那里市场很大,而且已知的通路明确,然后就在那里寻找新的突破?
Like, is there, like, an experimental phase where you just kind of contemplate, like, anything and sort of, you know, see what looks promising, or do you kind of pick, like, one disease where you feel, like, sure that there's a big market and there's sort of, like, a known kind of pathway, and you just sort of, like, look to, you know, find something new to do there.
是的。
Yeah.
这是个很好的问题,卢卡斯。
That's a great question, Lucas.
事实上,就像决定要把你们的分子推进到什么程度一样,这也是一个相当复杂的问题。
In fact, you know, just like deciding how far to take your your molecules, this is another quite complex problem.
而且,我觉得我们非常幸运,能够在ISO构建这些通用模型,使我们能够针对多种不同的疾病。
And, you know, I feel like we're actually very lucky to be able to build, you know, these general models at ISO that allow us to target many different diseases.
你知道吗?
You know?
这正是我对我们在这里开发的技术最感到兴奋的一点。
This is one of things that excites me the most about what we're building here in terms of technology.
当我们思考这个领域时,实际上一开始任何方向都值得考虑,然后你需要形成一些假设,关于那里可能存在哪些市场机会。
And so when when we think about that space, you know, actually to begin with, anything is game, and then you need to form some hypotheses indeed around what, you know, what market opportunities exist there.
你需要思考我们拥有的技术,它在哪些方面表现最佳,因此我们需要对哪些方向更可行或更难实现形成一套观点。
You need to be able to think about the technology that we have, where does it actually perform best, and so we need to have a set of opinions about what is going to be more or less tractable.
你知道,每一个药物研发项目都是一段漫长的旅程。
You know, every drug design program is, you know, a long mission.
它需要花费很多年的时间。
It takes, you know, many years.
它需要数百万美元的投入,因此决定启动一个项目是一项重大的决策。
It takes many millions of dollars, and so the commitment to go and do one is actually a substantial decision.
所以你会建立这样一种模型,综合考虑所有这些因素,比如疾病负担是多少?
And so you end up building this kind of model that, you know, builds in all all of these factors around, you know, what is the what is the disease burden?
患者在那里需要什么?
What does the patient need there?
市场机会是什么?技术契合度如何?所有这些不同的因素。
What is the market opportunity, what is the technology fit, all of these different factors.
然后这能让你深入特定的疾病领域,进一步深入研究,你需要找到可能独特且重要的靶点,本质上要决定进入哪个方向。
And then that allows you to go into particular disease areas and then you look deeper and, you know, you need to find potentially unique, important targets, you have to just basically decide where where are you going to enter.
另一个方面是,还有谁在那个领域做着同样的事情呢?
You know, another another aspect of this is, you know, who else is doing what in that space as well?
人们对某种特定疾病的理解已经推进到什么程度了?
You know, how far have people already advanced their understanding about a particular disease?
如果你进入这个领域,你是在开发一种首创药物吗?
And so if you go in there, are you going to be building, for example, a first in class medicine?
目前还没有针对这种特定适应症的药物。
Nobody has a medicine for this particular indication.
还是你打算开发一种最佳药物?
Or are you going to be building a best in class medicine?
已经有药物上市了,而你能否开发出更好的药物?因为这两种情况的经济因素和其他方面都截然不同。
Somebody has already a drug on the market and are you going to be able to build something better because, you know, the economics and and all of it is quite different in these two cases.
如今市场上有没有使用这些机器学习技术发现的药物,是我可能用得上的?
Is is there a medicine out in the market today that that I might use that was discovered using these machine learning techniques?
我认为目前市场上还没有任何药物能直接归功于使用这类技术发现的,至少目前是这样?
I don't believe we have anything on the market at the moment that can be sort of straightforwardly linked to having been discovered using these types of techniques at the very least?
我的意思是,正如马克斯所说,人们可能在过去二十年里一直在使用各种类型的机器学习方法。
I mean, I think people have been using, as Max has said, you know, various sort of types of machine learning for probably the past two decades.
尽管如此,现在已有不少分子和临床试验是借助我所说的上一代机器学习方法开发出来的。
That said, there's a number of molecules and clinical trials now that would have been developed sort of with at least what I would call, you know, the last generation of machine learning methods.
我的意思是,这些在研发早期阶段的新进展,是否意味着现在出现了更多看起来有望成功的候选药物,人们也更愿意去尝试?
I mean, do do these new advances in sort of the early part of the funnel mean that there's now lots more candidate medicines that seem likely to work that people would wanna try?
我觉得这些技术在研发流程的前端效果会更好,如果你把整个过程想象成一个漏斗的话。
Like, I would think that these techniques would work much better at kind of the top of the funnel, if you even think of it like a funnel.
我想象中的漏斗是从那些在试管中看起来有效的廉价候选物,一直延伸到通过所有临床试验的药物。
I'm imagining a funnel that sort of goes from, like, you know, kind of cheap candidates that seem to work at a test tube all the way to, like, gone through every, every trial.
现在前端是不是已经堆积了大量候选物?因为我们现在在这方面做得很好,但接下来得弄清楚该选哪些推进下去?
Is there sort of, like, a glut now at the top of the funnel of of, like, now we got really good at that, and we need to figure out, like, which ones to send through?
也就是说,前端的这些努力会不会在下游引发一些变化?
Like, are are there gonna be, like, kinda changes downstream coming from all this effort I top?
你知道,我认为这正是机会所在,至少在我们考虑命中物识别时,它看起来确实有点像一个漏斗。
You know, I I think I think this is part of the opportunity, at least when when we think about hit identification or is you know, it it can look a bit like a funnel.
对吧?
Right?
你从大量物质开始,随着时间推移,越来越多的物质被淘汰,因为你发现了它们在特定设计或特定分子上的负面因素。
You start with more stuff and over time, more things drop out because you realize the negative aspects against that particular design or that particular molecule.
但当然,如果你有技术能够发现或设计出越来越多的初始命中物,那么在后续的实验或进一步设计阶段,即使有些物质被淘汰,你仍然会保有大量可以推进的候选物。
But of course, if you have techniques that allow you to discover or design more and more initial hits, And then then as you go through those stages of further and further experimentation or further design, even though some things might drop out, you still have tons in your funnel that you can take forward.
通过推进更广泛的分子组合,在设计过程的最后阶段,当你考虑该优先推进哪些分子,甚至考虑备选方案时,你有望获得质量更高、选择更多的分子,用于后续的临床研究。
And then by having a wider array of molecules that you're taking forwards, by the end of your design process when you're thinking about, okay, what do we actually lead with, and even thinking about backups, you would hope to have much higher quality molecules there and a lot more to choose from for then subsequent clinical studies.
我明白了。
I see.
所以你觉得你的成功率会提升,比如超过10%的候选物能通过整个漏斗,因为选择更多了,而且你能建模预测药物在整个生命周期中的表现?
So you think that your your rate will go up, like like, more than 10% will get through the funnel because there's more options and you can you can actually model what's gonna happen through the through the life of the drug?
是的。
Yeah.
我的意思是,我认为我们将来能更好地预测会发生什么。
I mean, I think we're gonna be much better at being able to tell what's going to happen.
事实上,我们现在已经做到了,所以我非常期待这个比例会提高。
And in fact, we already are, and and and so I very much expect that rate to go up.
这其中涉及许多不同的因素,但我想说的是,卢卡斯,这个领域里没有什么是容易的。
There's many different inputs into that, but I think it's worth saying, Lucas, like, nothing is really easy in this space.
这并不是一个低垂的果实领域,好像之前没有重大的技术突破,然后技术一进来就轻松解决了所有问题。
It's not really like a low hanging fruit kind of space where, you know, there hadn't been, like, big technological breakthroughs and then sort of technology comes in and kinda swoops in and just solves it all.
我认为这里的情况并非如此,因为这确实是前沿科学,处于人类当前认知的边缘。
I I don't think it really works like that here because it's really cutting edge science, you know, at the edge of sort of what is known to humanity now.
而且,已经有庞大的产业聚集了世界上最聪明、最有热情的人在思考这个领域。
And there's been a massive industry of, you know, the world's smartest, most impassioned people thinking about the space.
所以,我认为突破将来自于发展这些基础能力。
And and and so, you know, breakthroughs are I think are gonna come by developing these fundamental capabilities.
它们将推动我们的知识进步,然后我们以非常聪明的方式加以应用,以解决最困难的问题。
They're gonna advance our knowledge, and then we're gonna apply them in very smart ways to be able to solve the hardest problem.
所以我认为,这与其说是去挑选一大范围的简单问题并逐一解决,不如说是其他问题。
So I would say it's less of a question of, like, picking up a whole wide of wide range of, like, simple problems and just going to town on them.
实际上,这是在解锁以前无人能做到的事情。
It's actually unlocking things that nobody's been able to do before.
你知道,我们有一种疾病。
You know, we have this disease.
我们有一个靶点。
We have this target.
我们完全不知道该在哪儿结合靶点,也不知道该如何让它起作用。
We have no idea, like, where to bind to the target and how to actually make it work.
而通过能够理性地、快速地在计算机中模拟这一空间,我们应该能够在这类真正根本性的难题上取得突破。
And then by being able to model that space rationally and and, you know, rather quickly in silico, we should be able to make inroads into some of these, like, really fundamentally difficult diseases.
那为什么你说这里没有很多容易摘的果子呢?
Well, why do you say that there's not a lot of low hanging fruit?
我觉得,相比在实验室里实际尝试,能够在计算机中模拟这个任务,效率应该会大幅提升。
Like, I would imagine being able to model this task in silico versus actually trying it in a lab would just be, like, a massive increase in efficiency.
是因为建模不够可靠吗?
Is it because the the modeling isn't reliable?
不是。
No.
不是这个问题。
It's it's not that.
我的意思是,这确实大大提高了效率,但这是一个非常复杂、多方面的难题。
I mean, I think it is a massive increase in efficiency, but it's a very complex, multifaceted problem.
对吧?
Right?
所以,我认为我们不仅能大幅提高速度,还能大幅提升预测的准确性,因此我预计事情会有所收敛。
And so I think while we're gonna be able to massively improve the speed, massively able to improve the accuracy with which we're making predictions, and so I expect things to contract.
但真正从识别靶点到开发出经过验证有效、无毒且具备其他所有特性的药物,整个过程将面临一系列极其艰巨的挑战。
But, actually, the whole problem of designing a drug from identifying a target to sort of having something that's been proven to be efficacious and, you know, nontoxic and have all the other properties, I think, is going to be a series of very, very hard challenges.
因此,从这个角度来看,我们也需要对预期做出相应的调整。
And so, you know, I think we we we need to set our expectations accordingly as well in that sense.
你们在这个过程中进展到哪一步了?
How far along are you in that journey?
比如,你能谈谈你们进展最远的药物已经走了多少步吗?
Like, can you can you talk about, like, how how many steps have happened with your your furthest along drugs?
是的。
Yeah.
好吧,你看。
Well, look.
我们Isomorphic实验室成立才两年多一点时间。
We you know, the way to think about isomorphic labs is we're we're only, you know, two and a little bit of years in existence.
因此,我们最初的重心自然是开发我们的技术平台。
And so, you know, of course, our first port of focus has been to develop our technology platform.
当然,自创立之初我们就一直在做这件事,依托AlphaFold这一关键赋能技术,为我们进入基于结构的计算机药物设计领域提供了立足点。
And so this is something that we have been doing since the very start, of course, building on top of AlphaFold that has been sort of a key enabling technology in the space that gives one a foothold into this, you know, structure based and silico drug design space.
因此,我们已经开发出多种这样的方法。
And so we have developed a number of these methods.
在过去一年里,我们开始利用我们开发的技术开展实际的药物设计项目。
And then over the course of the past year, we have started doing actual drug design programs using the technology that we have developed.
在药物发现领域,一年的时间相当短暂,因此我们目前有多个项目,既包括我们自己内部的目标,也包括我们与礼来和诺华两家公司合作的项目——这些合作已经通过我们公布的两项合作伙伴关系公告得以确认。
And so, you know, a year in drug discovery is a fairly short period of time, and so we have a number of these programs that are both our own set of our own internal targets as well as the ones that we have partnered for with, you know, the two partnership announcements that we have made with Eli Lilly and Novartis.
所以,我认为这个项目组合还处于非常早期的阶段,但
So I would say it's quite early days for that portfolio, but
我们在这过程中一直保持着稳步的进展。
we have been making pretty steady progress through that.
如今,我们已经看到,过去两年中我们所做的所有建模工作,正在真正改变化学家们日常药物设计的方式。
Already today we're seeing how all of that modelling work that we've been doing over the last two years is actually changing the way that the chemists are approaching the day to day drug design.
这非常令人兴奋,也真正体现了化学家们在长期药物设计方式上的这一重大转折点。
And that's super exciting and really speaks to this longer term, you know, inflection point in how you know, chemists approach drug design long term.
化学家们扮演着怎样的角色?
What what is the role that chemists play?
比如,我觉得在当今的语言模型领域,已经没有多少语言学家参与其中了。
Like, I would think, you know, in in, you know, in kind of, like, language modeling today, I don't think there's a lot of linguists involved at this point.
是的
Yeah.
我的意思是,是的
I mean Yeah.
这又是怎么运作的呢?
How does that work?
你知道,你说现在语言学专家参与不多,这很有趣,但事实上,每一个从事自然语言处理的机器学习研究者,都是语言方面的专家。
You know, it it's interesting you say that there's not a lot of linguists involved at this point, but then, you know, the fact of the matter is every single machine learning researcher who works on NLP is is an expert at language.
嗯哼
Mhmm.
对吧?
Right?
仅仅因为是人类而已。
Just by being a human.
计算机视觉也是同样的道理,是的。
It's the same thing with computer vision that Yeah.
每个人实际上都是计算机视觉或人类视觉的专家。
Every single person is an expert actually at computer vision or human vision.
因此,将这些内在的先验知识直接转化为机器学习模型和你的工作流程非常容易。
And so there's a It's very easy to take these internal priors and directly translate them into machine learning models and and your workflows.
甚至在产品层面,如何将这些模型应用于你写博客或用大型语言模型编写脚本的方式,我们内心都有这些直觉。
And even on the product side, how you then take those models and change the way you do, you know, you write a blog or you write a script with a a large language model, we all have these intuitions internally.
在化学领域则完全不同,比如我和谢尔盖就不是化学专业的。
It's very different in chemistry space because, you know, me and Sergey, for example, we're not native chemists.
所以我们没有那种对化学如何与这些模型相关联或被其最大化利用的天然直觉。
So we wouldn't have that native intuition about, you know, how these mod how chemistry can relate or can be maximally exploited by some of these models.
我认为化学家参与这一领域有两种方式。
So I think there's two ways that chemists come into this picture.
一种是帮助我们开发这个平台,协助构建这些模型,并以正确的方式用机器学习和深度学习攻克基础科学问题。
One is on helping us develop this platform, helping us build these models and really attacking this fundamental science in the right way with machine learning and deep learning.
另一种是化学家直接参与药物设计。
And the other place that chemists come in, of course, is on the actual drug design.
你知道,也许会有一个世界,我确信会有,只要按下播放键,你的药物就出来了。
You know, maybe there'll be a world, I'm sure there will be, where we can press play and out pops your drug.
但在那之前,从模型的角度来看,我们当然还需要人类药物设计师和化学家参与其中,与这些模型一起工作,共同创造性地设计分子。
But before, at that point, from a model, we're of course gonna have human drug designers, chemists in the loop, you know, working with these models, being very creative together in that process to design molecules together.
好的。
Okay.
我们总是以两个问题结束,我想留出一些空间来回答,因为我真的很期待你的答案。
We always end with two questions, and I wanna give some space for them because I'm really interested in your in your answers.
我想听一听的是,从更多偏向研究论文的工作,转向真正实现产品化并适用于你的使用场景的过程中,遇到了哪些意想不到的挑战?
One thing that I I would love to hear about is kind of on that journey from, you know, more research paper oriented work to trying to make this, like, really, you know, productionized and really working for, your use case, what have been the kind of unexpected challenges?
从我的角度来看,正如马克斯刚才提到的主题,数据理解是一个非常困难的问题。
Well, I would say, from my perspective, you know, and and sort of, like, building on the theme that Max has just covered is the data understanding is a really difficult problem, essentially.
对吧?
Right?
我们无法仅仅靠肉眼观察一个化学反应,或者观察细胞内分子层面发生的事情,然后说:‘哦,是的。’
We cannot just eyeball, you know, a chemical reaction or we cannot eyeball something that happens inside a cell at a molecular level and be like, oh, yeah.
这对我来说很有道理。
That makes sense to me.
这是一个很好的数据点。
Like, that's a good data point.
我们应该保留这个。
You know, we should keep that one.
所以我认为,在这个领域里,能否拥有优秀的实验设计、生成高质量的数据集并真正用于训练这些方法,一直是一个长期存在的巨大挑战。
So I I think it's, you know, it's it's a big perennial challenge in the space to really be able to have great experimental design, to be able to generate really high quality datasets to be able to actually use in in in training these methods.
因此,这个领域取得进展在很大程度上取决于我们设计这些数据集和理解数据的能力。
And so, you know, basically making progress in the space is is, you know, linked a lot to our ability to design these datasets and understanding.
我认为,这本质上是一项具有挑战性的科学探索,要真正弄清楚许多现象的来龙去脉。
And I think it's, you know, it's kind of a challenging scientific pursuit essentially to be able to really, you know, really make heads of tails of of a lot of
这些现象。
these phenomena.
是的。
Yeah.
而在模型开发的另一端,我认为这些模型应该如何应用于实际的药物设计和科学研究,始终是一个不断进展、随着模型及其特性变化而持续演进的过程。
And then on the other side of the sort of model development spectrum, I think how these models should be applied to actual drug design, to actual science, I think is always, you know, work in progress and ever evolving as these models and the characteristics change.
但归根结底,我们是在一个通用的数据分布上训练这些模型,然后试图将它们扩展并应用于模型所能认知的最边缘、最前沿领域。
Yet it comes back to this point of, you know, we train these models on a generic distribution of data, and then we try and stretch and apply them to the very tails and the very frontiers of what this model could know.
比如,将像AlphaFold这样的原始能力,应用到某个特定的药物设计挑战上——比如一个完全无人了解的靶点,这正是药物化学最前沿的领域,也带来了诸多挑战。
Of, you know, taking those raw capabilities, a raw capability like AlphaFold, and then applying it to a very particular drug design challenge, you know, target that no one knows anything about, The very frontier of, you know, medicinal chemistry poses lots of challenges.
当然,这个模型中存在着大量我们尚未理解的‘隐性知识’,我们需要找到方法将其提取出来,以适配这种特定的应用场景。
There's of course gonna be lots and lots of, you know, dark knowledge in this model that we need to work out how to extract for this very particular use case.
我在这个领域中一再看到,即使在DeepMind的过往项目中,要提取这种‘精华’、挖掘这些隐性知识以应对如此极具挑战性的应用,都需要大量的创造性工作,而这本身也是一个非常神奇的切入点。
And I've seen again and again in this context, even in previous context at DeepMind that there needs to be a lot of creative work in how you extract that juice, how you extract that dark knowledge for these really particularly challenging applications, and that's really a magic spot as well.
非常令人兴奋,但也极其困难。
Really exciting, but very difficult as well.
你能谈谈你是如何做到这一点,或者如何应对这个问题的吗?
Can you say anything about how you've done that or how you've approached that?
这其中的很多内容,我想可以归类为‘黑客式’的探索,而我用‘黑客’这个词是带着最积极的含义的。
A lot of this would, I'd say, fall under the the hacking sort of moniker, and, you know, I I use that in the best sort of way.
我认为,我们一再看到。
I think, you know, we see again and again.
无论你把深度学习模型应用到哪个领域,只要面对那些热衷于应用它们并充分挖掘其价值的人,人们总会找到令人惊叹的方式,对这些系统进行改造、黑客式调整,甚至将它们以从未被训练过、从未有人想过的方式拼接在一起。
It doesn't matter what domain actually with deep learning that you you put these models in front of people who are really passionate about applying them and really extracting juice out of them, and people will find amazing ways to mold and hack and just completely mash up or pipeline these systems together in a way that they were never trained for, no one ever thought about before.
但当你这样做的时候,你确实能找到某种方式,以一种全新的方式解读这种统计推断,从而真正为人们带来价值。
But actually, you do this and and you can find some way to interpret like, that statistical inference in a completely new way, and it really gives value to people.
我们每天都能看到同样的情况,比如对化学家而言。
And we see the same thing day to day, you know, for chemists, for example.
所以你指的是你正在合作的化学家,还是主要是机器学习工程师在做这些?
So you're talking about the chemists you're working with or the machine learning engineers primarily doing this?
他们会并肩合作,一起完成这些工作。
They'll be side by side doing this together.
是的,化学家可能不会直接进行底层的黑客操作,但机器学习工程师会做,而且他们会一起协作。
So, yeah, the chemist might not be doing the native hacking, but the machine learning engineer will do that, and they'll do it together.
我的意思是,实际上,我想补充一点,当我被问到最困难的事情时,这可能是其中之一。
I mean, actually, I I do wanna add that to me, this been probably you you asked about the hardest things.
最令人享受的一件事,就是能够真正地将这些学科整合在一起。
One of the things that has been the most enjoyable is at being able to actually integrate these disciplines together.
我们经常听到一些令人担忧的故事,比如某家公司搞AI药物发现,筹钱去开发技术,但技术还没真正奏效。
You know, we often hear about some horror stories where, you know, company x for, you know, AI for drug discovery goes off, raises money to to do this and and build technology, and the technology doesn't quite work yet.
你要做药物发现,结果却导致药物发现团队和开发技术的团队之间出现脱节。
You know, you need to do drug discovery, and so you go off and you end up having this disconnect between essentially your drug discovery teams and your teams building the tech.
我们公司在创立之初就将这种理念融入了DNA:我们共同面对并解决这些问题,这一点也体现在我们项目的组织方式上。
And one of the things that we have built into kind of the DNA of the company is we're in this together to solve these problems together, and so this is, reflected into how we've structured projects.
我们让化学家深度参与所有机器学习项目,作为领域专家和团队成员。
You know, we have chemists deeply involved in all of our machine learning projects as subject matter experts, as as folks that are, you know, team members.
我们作为团队坐在一起,彼此交错融合。
We sit together as teams sort of intersperse with each other.
同样,当我们开展药物发现项目时,我们的机器学习工程师和研究人员也会深度嵌入这些项目,以确保我们能充分发挥技术的优势。
And, you know, similarly, when we think about doing the drug discovery programs, our machine learning engineers and researchers are deeply embedded in those programs to make sure that we're getting the best of the tech.
对我来说,这一直是真正破解这一难题的关键所在。
And so to me, you know, this has been, like, a really integral part of actually how to how to crack this for real.
你知道吗,这挺有趣的。
You know, it's funny.
我真的尽量不在这些对话中引入权重和偏见,但我实在忍不住,总是好奇人们是否在生物领域的例子中使用了我们的可视化工具。
I I really try not to inject, weights and biases into this into these conversations at all, but I I kinda can't help myself because, I I always wonder if people are using our visualization stuff for, bio examples.
比如,我们有一个分子查看器,演示起来总是很出色,因为看起来就像科学一样。
Like, you know, we have a molecule viewer that always, it demos really well because it just sort of seems like science, I think.
它可能对非生物领域的客户来说演示效果更好,因为确实如此。
It probably demos better to, like, the non bio customers because it yeah.
每个人都幻想自己是个科学家。
Everyone fantasizes about being a scientist.
他们可以观察结构并看到这些。
They can look at structure and and and see that.
但这样直接看到分子,真的是一种有用的数据观察方式吗?
But is that actually, like, a useful way, to look at data to literally, like, see the molecule?
或者,你到底该如何真正地查看数据?
Or or how do you, actually look at the data?
我的意思是,我觉得这比处理文本或图像难多了。
I mean, I guess it's so much harder than with, text or images.
这个过程是怎样的?
What what's the process?
是的。
Yeah.
而且这取决于谁想查看这些数据。
And it'll it'll depend on who wants to view the data.
如果你是个药物化学家,你会乐意坐在那里看一页几百个分子的图。
You know, if if you're a medicinal chemist, you will happily sit there and look at a page of hundreds of molecules
像它们的图片,或者像我化学课上学到的那种科学符号?
Like, like, a picture of them or, like, the scientific, like, notation of them from my chemistry class?
它们的二维结构图示。
The two d picture graph notation of them.
下一个。
Next.
是的。
Yeah.
对。
Yeah.
只需快速浏览这些数据,你知道,这些了不起的人,我想像以前有人坐在埃里克旁边,只是看着分子,就能凭直觉感受到它的某些真实特性。
Just roll through that and, you know, these amazing people, you know, I think about, like, someone used to sit next to Eric, like, just looks at the molecule and be like, because some real characteristics about that just intuitively.
但随着我们能够评估或考虑生成的分子数量增加,这可能会达到数千甚至数十万。
But then, of course, like as we scale up the number of molecules that we could be, you know, assessing or or or or think about generating, then this could go into thousands or hundreds of thousands.
你可以想象,为这些分子设计非常独特的可视化方式或‘景观’。
And you could think about, you know, very unique visualizations or landscapes of these things.
随着我们能够测量或预测越来越多的分子特性,我们就可以思考如何呈现这些特性——即每个分子的多维特征,并将其投影到一个化学家可以浏览的空间中,从而真正开始形成直观理解。
As we're able to measure more and more properties or predict more and more properties with our models, we can then think about how how to surface those properties, that whole multidimensional characterization of each individual molecule, and then project that in a space that a chemist can browse through and really start to intuit about.
嗯。
So Mhmm.
有很多机会。
There's lots of opportunity.
实际上,卢卡斯,这也提出了一个很好的观点。
Actually, this raises a great point as well, Lucas.
甚至关于更广泛的领域,比如产品方面:我们如何真正开发出令人惊叹的技术。
Even about the wider, like, about the wider field, which is product, like, how do we actual we can develop really amazing tech.
我们可以构建非常优秀的模型,但我们希望这些模型能解决现实世界的问题。
We can build really great models, but we want those models to solve real world problems.
而产品在其中扮演着至关重要的角色,我觉得在更广泛的科技领域中,这一点是我们已经发展得非常好的一个方面。
And, you know, product plays a really crucial role in that, and I feel like that is a muscle in sort of the wider tech space that we've been able to grow really well.
比如,如何专注于这个问题本身:用户是谁?我们需要构建什么才能真正产生影响。
Like, how to obsess over what that problem is, like, who are the users, what do we need to build to really move the needle.
因此,我们也已经将这一点整合进来了。
And so we've, like, integrated that as well.
我们有一个产品团队,与机器学习团队紧密合作,深入理解这些能力,同时学习并为研究路线图提供反馈。
You know, we have a product team that are working really closely with the machine learning side to really understand these capabilities and both, like, learn and give feedback into that research road map.
但另一方面,我们也与化学家和生物学家紧密合作,了解他们的工作流程。
But then on the other side, like, also working very closely with chemists and biologists to understand those workflows.
所以,你知道,我们不仅仅拥有一堆模型。
And so, you know, we we don't just have a collection of models.
你使用时并不是简单地运行一堆协作工具。
Like, you're not just running a bunch of collabs as you use this.
你实际上有一个界面。
You you you have sort of an interface.
你有一个完整的、可以登录的产品,可以在里面运行你的药物设计项目,这将所有这些模型融入到你作为化学家的工作场景中,帮助优化你的工作流程,而且你需要能够可视化所有这些内容。
You have a fully formed product that you can log on to, and you can run your drug design project inside, and that puts all of those models into the context of what you're doing as a chemist and, you know, helps with your workflow, and you need to be able to visualize all of this.
所以,我认为这实际上是我们思考如何将这些技术转化为现实影响力的关键方面,我觉得这对整个行业也很重要,因为我们有很多酷炫的技术,正迫切需要真正的产研洞察和产品应用。
And so, you know, I think it's a key, actually, aspect of how we think, you know, the translation of some of this technology into real world impact really, really works, and I feel like that's an important piece even for the wider industry as, you know, we have a lot of cool technology that is looking for that, like, real product insight and product application.
完全正确。
Totally.
如果你有任何关于可视化或你希望我们加入的查询建议,我们非常欢迎这种反馈。
Well, if you have any, suggestions of, you know, visualizations or queries you like to to push into us, we would be overjoyed to get that form of feedback.
如果有一些功能更新的话。
Well, if there's some feature updates.
请说。
Please.
我非常想听听。
I I would love to hear them.
我们可以把这个问题从对话中移开,但我真的特别想听。
I could we could take it we could take it out of this conversation, but I'm I'm dying to hear it.
另外,我的最后一个问题是,如果可以的话,我们换个完全不同的方向——当你思考机器学习时,无论是在你的领域内还是领域外,有没有什么研究课题你觉得被忽视了?或者有没有什么研究结果让你特别兴奋,但似乎没有得到应有的关注?
And and, you know, my last question, to take things in a totally different direction, if you don't mind, but this has often been interesting, is, like, when you think about, you know, machine learning in general could be inside your field or outside of your field, is there any kind of research topic you think is under explored or, like, a research result that really excited you that you think didn't get, you know, the attention it served?
或者换种方式想,如果你没有现在这份工作,你会想研究什么有趣的方向?
Or maybe another way of thinking about it is is if you didn't have your current job, is there something interesting that you would wanna look into?
就我而言,我真正热衷的是医疗健康,所以我的研究主题通常都围绕着医疗健康。
Well, on my side, maybe is, you know, what I'm really passionate about health care, so I think my topics are are are, you know, quite often about health care.
但我认为,神经科学是我们理解生物学运作机制时仍留下的最大谜团之一。
But I, you know, I see I see sort of neuroscience as, like, one of the biggest mysteries that is, like, left in our overall understanding of actually how biology works.
我们对大脑如何工作,以及我们如何形成关于世界模型和记忆等更宏观的概念,仍然只有非常基础的认识。
We we, you know, we still have quite basic ideas about how brains work and how we form, like, these larger, you know, larger sort of concepts around our world models and our memories and so on.
所以对我来说,如果我要思考一个我们尚未真正迎来那个时刻的领域,那就是这个领域。
And so to me, you know, if I was to think about a domain where, you know, we we haven't really had that yet, like, that moment yet, that would be the domain.
我认为在这一领域,机器学习大有可为,其他形式的建模和模拟也同样如此。
And I think there's lots to be said for machine learning in that space, but also for, you know, other forms of modeling, simulation, and so on.
但我会把这看作一个关键领域,去真正产生重大影响。
But, you know, I would look for for that as a key, you know, key space to really make impact in.
至于我,我认为核心在于,我非常热爱深度学习。
On my side, I I think at the very core, you know, I love deep learning.
我一直致力于开发新的神经网络模块,探索这些组件如何组合,以及如何为神经网络添加新功能。
I've sort of been at the core always developing new neural network modules and ways that these pieces could be put together or how you, you know, add new functionality into neural networks.
我总是特别喜欢关注最新的层、归一化方法、条件化方式、注意力机制的调制方法,乐此不疲地尝试这些技术,并以全新的方式将它们整合在一起。
And and I just always love seeing the latest layers, normalizations, you know, ways of conditioning, ways of modulating attention mechanisms, and and playing with these things and and putting everything together in new ways.
在ISA,我觉得特别有趣,因为我们能接触到所有这些前沿技术,开发新东西,并将它们应用于完全新颖的数据类型。
Think it's it's great fun at ISA because we get to, you know, look at all of this, develop new stuff, and leverage all of it for completely new data types.
所以,我们会以原创研究中未曾定义的方式,将这些技术融合在一起。
So mashing up all these things in in ways that aren't defined, in the original research.
我认为,如果不是在ISO空间中,我会回想起在DeepMind时所做的后期研究,关于开放-ended学习,认真思考如何利用深度学习的基本构建模块,创建出可扩展的学习系统,即使没有人类标注数据,也能通过最初少量的人类标注数据作为起点,在模拟环境或真实世界中不断学习,从而越来越深入地理解世界。
I think if it wasn't in ISO space, you know, I I I think back to some of the later research I was doing at DeepMind on on open ended learning, really thinking about, okay, how do we use then these fundamental building blocks of deep learning, and create these scalable learning systems that can just learn even without human labeled data or just bootstrapping perhaps off the initial bit from human labeled data against, you know, environments, whether that's simulations or the real world to learn more and more and more about the world.
最终,这样你就可以在之后给这个智能体任何任务。
Ultimately, so that you can come along and give this this agent any task down the line.
所以我非常喜欢
So I love
这个领域我也很喜欢。
that space as well.
太棒了。
Awesome.
是的。
Yeah.
我也是。
Me too.
有没有什么研究论文你可以推荐给我们,我们可以放到节目笔记里?
Any any, like, research paper you wanna point us to there that we could put in the show notes?
我想推荐一下我们来自DeepMind的出色工作,这些工作是朝着这个方向迈出的小小一步,即如何构建这些环境宇宙,并训练智能体和智能体群体在其中进行学习。
I'll I'll plug the excellent work, which was some of the work from us from from from DeepMind, which was baby steps along that way of, you know, how do you just create these, environment universes and train agents and populations of agents against the space.
归根结底,你会开始得到这些智能体,它们能够展现出对任何新任务进行零样本泛化的能力。
So at the end of the day, you start getting these agents which can exhibit, you know, start to exhibit general capabilities to zero shot to any new task.
太棒了。
Awesome.
如今,这个话题似乎越来越相关了。
Well, that that topic seems more and more relevant, these days.
是的。
Yeah.
如何在语言空间中实现这一点。
How to supply that in the language space.
或者也许在生物领域,你知道,会发生什么。
Or maybe in in in bio, you know, what would happen out.
是的。
Yeah.
老实说,我认为在这里做药物设计有很多类似的思路。
I think there's lots of analogs to how we can do drug design here, to be honest.
非常酷。
Very cool.
非常感谢你抽出时间。
Well, thank you so much, for your time.
这真是一次很愉快的对话。
It was a lot of fun.
能和你交流我很高兴,卢卡斯。
Been a pleasure, Lucas.
是的。
Yeah.
很高兴见到你。
Great to see you.
是的。
Yeah.
很高兴见到你,而且确实很高兴和你们合作。
Good to see you, and and, yeah, great to work with you guys.
我真心希望在未来的需求中也能继续合作。
And I really mean it on future requests.
你知道,我觉得医疗健康是机器学习最酷的应用之一,我个人非常希望能为你们提供更好的工具。
You know, I think the you know, I I also think health care is the coolest, ML applications, and I personally would just love to give you guys better tools.
这似乎是其中最难实际查看数据的领域,但我确实觉得,真正查看你的数据是广泛成功的关键。
It seems like the one where it's the hardest to actually look at your data, and I do feel like actually looking at your data is, like, just the key to, success broadly.
是的。
Yeah.
随着我们逐步增加对Weights and Biases的使用,我相信会涌现出很多新想法。
I mean, as we ramp up the usage of the of of Weights and Biases, I'm I'm sure there'll be lots of ideas coming.
我真的不介意任何特别具体的小细节。
I really don't mind either, like, really specific tiny stuff.
比如,那些我们从来得不到的东西,但如果有什么让你感到困扰的,尽管说。
Like, feel like that's the stuff we never get, but if anything is just, like, irritating you No.
别害羞。
Just don't be shy.
你跟你的团队说一下,直接发给我就行。
Like, you you know, tell tell your team, just send it to me.
是的。
Yeah.
对。
Yeah.
好。
Yeah.
我的意思是,这是我们技术基础设施的关键部分,
I mean, it's a it's a key piece of our sort of technology infrastructure,
而且
and
我觉得大家都很喜欢使用它。
I think, you know, people have loved working with it.
但总是还有更多事情。
But there's always more things.
总是有更多的扩展需求。
There's always more scaling.
所以我相信请求不会短缺。
And so I'm sure there'll be no shortage of requests.
我们一定会处理好你的意见,卢卡斯。
We'll definitely take care of your word, Lucas.
好的。
Okay.
太好了。
Great.
非常感谢。
Really appreciate it.
祝你今天愉快。
Have a great day.
非常感谢您收听本期《Grading Descent》。
Thanks so much for listening to this episode of Grading Descent.
请持续关注未来的节目。
Please stay tuned for future episodes.
关于 Bayt 播客
Bayt 提供中文+原文双语音频和字幕,帮助你打破语言障碍,轻松听懂全球优质播客。