本集简介
双语字幕
仅展示文本字幕,不包含中文音频;想边听边看,请使用 Bayt 播客 App。
欢迎回到DeepMind播客,我是汉娜·弗莱。在本系列节目中,我一直在与人工智能前沿领域的研究人员对话,了解他们的工作内容及其可能对我们所有人产生的影响。过去四期节目中,我们展望了人工智能的长期未来,包括研究人员希望实现通用人工智能的一些构想。但在接下来的两期节目中,我们将与您分享人工智能已在实践中应用的若干案例。有请DeepMind首席执行官德米斯·哈萨比斯。
Welcome back to DeepMind the podcast with me, Hannah Fry. For this series, I've been speaking to researchers at the cutting edge of AI to find out what they're working on and the implications it could have for all of us. In the last four episodes, we've been looking ahead to the long term future of AI, including some of the ideas that researchers hope will bring them to artificial general intelligence. But in the next two episodes, we want to share with you some of the ways that AI is already being put to work along the way. Here's DeepMind CEO, Demis Hassabis.
我个人致力于人工智能研究的初衷,是想将其作为终极工具来加速几乎所有领域的科学发现。我认为AlphaFold就是我们在这方面首个重大范例。
My personal reason for working on AI was to use AI as the ultimate tool to accelerate scientific discovery in almost any field. And I think AlphaFold is our first massive example of that.
如果您收听了第一期节目,就会了解AlphaFold——DeepMind能准确预测蛋白质三维结构的尖端系统。但AlphaFold并非我们实验室里唯一的科研项目。走进这些实验室,您会发现研究人员正在解析DNA以探索生命奥秘,寻找核能利用的新途径,或是在数学领域最具思维拓展性的前沿测试人工智能。还等什么呢?快进来看看吧。
If you heard episode one, you'll know all about AlphaFold, DeepMind's state of the art system that can accurately predict the three-dimensional structures of proteins. But AlphaFold isn't the only science project within these walls. Step inside the labs and you'll find researchers scrutinising DNA to understand the mysteries of life, hunting for new ways to harness nuclear energy or putting AI to the test in some of the most mind expanding areas of mathematics. So what are you waiting for? Come on in.
这里是第六期节目《人工智能助力科学》。普什米特·科利——我们在第一期节目中采访过的嘉宾——负责统筹DeepMind在自然科学领域的人工智能科学项目。若要了解当前研究领域的清单,他正是最佳咨询人选。
Is episode six, AI for Science. Pushmeet Kohli, who we heard from in episode one, oversees DeepMind's AI for science program across the natural sciences. And when it comes to a list of areas that are being worked on, he is exactly the man to ask.
生物学、蛋白质组学与基因组学、量子化学领域(如材料设计)、核聚变、基础数学、生态学、气象学。
Biology, proteomics and genomics, in quantum chemistry, like material design, fusion, fundamental mathematics, ecology, weather.
这份清单确实令人望而生畏。但在接下来约三十分钟里,我将尽可能带您了解其中更多内容,让您感受人工智能改变科学发展的潜力。不过首先,您可能好奇为何这家以开发游戏AI闻名的公司会涉足这些严肃的科研课题——这正是我与普什米特对话的起点。
Now that is a fairly intimidating list. But over the course of the next thirty minutes or so, I'll be walking you through as many of them as we can get to, to give you a sense of the potential for AI to make a difference to science. First though, you might be wondering why a company that made its name getting AI to play computer games became involved in these serious scientific subjects. And that is where I started the conversation with Pushmeet.
科学精神自DeepMind创立之初就根植于我们的基因中。更准确地说,随着我们逐步构建这些系统并在游戏中验证其能力,我们开始思考:现在是时候让它们接受现实科学挑战的真正考验了——这些挑战正是社会当前面临的重大课题。
Science has been in the DNA of DeepMind from the very start. It's more of the case that as we sort of built up these systems and proved them on games, we then started thinking about now is a good time to actually stress test them on the real scientific challenges that society is facing.
对科学团队成员、研究科学家莎拉·简·邓恩而言,DeepMind的科学项目对公司实现'破解智能以推动科学进步、造福人类'的目标至关重要。
For Sarah Jane Dunn, a research scientist on the science team, DeepMind's science programme is fundamental to the company's aim of solving intelligence to advance science and benefit humanity.
这对某些人来说是个相当抽象的概念。关键在于你们掌握的技术能攻克那些数十年来难倒最聪明头脑的难题。我认为,能够向人们展示这些技术真正造福人类的潜力,这正是我们科研工作中最有价值的方面之一。
That's quite an abstract concept to some people. The point is you have techniques that can tackle some of the hardest problems that have eluded the brightest minds for decades. And so I think being able to show people the power of these technologies to really help humanity, That for me is one of the most valuable aspects of what we're doing in science.
AI科学团队的工作就是筛选适合人工智能研究的科学问题,这些问题的突破将产生最深远的影响。
What the AI for Science team does is try to identify suitable scientific problems for AI to work on that will go on to have the biggest impact.
可以将其想象成知识树的概念。有些问题是所谓的根节点问题,它们能解锁下游大量其他问题。蛋白质折叠就是典型案例——一旦理解蛋白质结构,就能推动新药研发或创造分解塑料的新酶。另一个例子是若能获得强大能源,像清洁水获取这类难题就会迎刃而解。
One way of thinking about it is this idea of the tree of knowledge. There are some problems which are the so called root node problems, which unlock so many other problems downstream. Protein folding was one such example. Once you understand the structure of proteins, that has implications for developing new drugs or coming up with new enzymes for breaking down plastics. Another example would be if you had a great source of energy, other problems like access to clean water become much more tractable once you have unlimited energy at your disposal.
当我们筛选出这些根节点问题后,就会思考机器学习和人工智能在其中应扮演的角色。
And then once we have sort of isolated some of these root node problems, we then think about what is the role machine learning and AI has to play.
如今DeepMind在蛋白质折叠领域取得重大进展后,正转向普什米特提到的其他根节点问题,其中之一就是核聚变。核聚变突破可能带来的影响怎么强调都不为过——将两个氢原子聚合,创造无限、清洁且安全的能源,这是全球科学家长期以来的梦想,将彻底终结气候危机。
And now that DeepMind has made significant progress on protein folding, it's turning to some of those other root node problems that Pushmeet mentioned. One of them is nuclear fusion. It's hard to emphasize just how big an impact a breakthrough in nuclear fusion might have. It is the long held dream of scientists around the world that one day we would be able to fuse two hydrogen atoms together, and in doing so, create a totally unlimited, totally clean, and safe supply of energy. It would spell the end of our climate crisis.
但在我们过于乐观之前,必须说明核聚变极其困难。要使原子聚合,首先需要加热至形成等离子体——这种物质状态温度极高,电子会从原子中剥离。下面请德米斯·埃萨维斯来解释。
But before we get too carried away, I should tell you that nuclear fusion is really hard. To get atoms to fuse, you first have to heat them up until they form something called a plasma, a state of matter that is so hot that the electrons are stripped from the atoms. Here's Demis Esarves to explain.
核聚变研究中最棘手的问题之一是如何约束这种温度堪比太阳的等离子体——它显然会烧毁任何接触到的物质。我们采用磁场来约束它,但问题在于等离子体几乎处于混沌状态。随时可能有部分等离子体朝某个方向喷射,我们必须快速调整磁场作出响应,才能将等离子体稳定约束数秒钟。
One of the really hard things about fusion is how do you contain this plasma that's, like, as hot as the sun, and it would burn anything, obviously, that it touched. So the way you have to contain it is in a magnetic field. The problem is that the plasma is almost in a chaotic regime. So at any moment, a bit of it might just sort of shoot out in a certain direction, and you have to change the magnetic field quick enough to respond to that, to keep hold of the plasma for multiple seconds.
等离子体被约束在一种称为托卡马克的装置中。这个巨大的金属环状结构大到可以让人穿行其中。边缘的磁体产生足够强的磁场,使等离子体悬浮在中央远离侧壁的位置。但等离子体本身具有波动性和不稳定性,一旦接触任何物体就会前功尽弃。
The plasma is held inside something called a tokamak. It's like a giant metal donut big enough to walk through. The magnets around the edge create a field strong enough to suspend the plasma in the middle away from the sides. But the plasma is wobbly and unstable. And as soon as it touches anything, it's game over.
托卡马克相对低温的侧壁会迅速吸收等离子体的能量,其热量几乎瞬间就会消散。
The comparatively cold sides of the tokamak saps the energy from the plasma, and the heat dissipates almost instantly.
迄今为止物理学界的做法是手工编写数学控制器来预测等离子体行为。而我们的系统则通过学习根据等离子体形态预测其行为,几乎能提前调整磁场来应对预期变化。更令人振奋的是,我们实际上在进行等离子体雕塑——可以将它分裂成两部分或拉长形态。
Up till now, what people have done in the physics world is they've handwritten mathematical controllers for what the plasma might do. Whereas our system, what it's learned to do is predict what the plasma might do from the shape of it. And then almost ahead of time, change the magnetic field to react to what it thinks is going to happen. And even more excitingly, we're actually doing, plasma sculpting. So we can actually split it into two or elongate it.
这简直像是在用等离子体做冰雕。
It's almost like ice sculpture, but with plasma.
这东西真有太阳那么热?
This is hot as the sun?
没错,和太阳一样炽热。
Yeah. It's as hot as the sun.
但正如Pushmeat所解释的,这不仅仅是保持等离子体稳定的问题。
But as Pushmeat explains, it's not just about keeping the plasma stable.
还涉及到你维持的是哪种等离子体构型。有些构型会产生更多热量。而强化学习现在能让你做到的是,你可以直接说‘我想要这种奇特构型,因为我认为它会更稳定或能产生更高热容等等’。AI就会回应‘没问题,我能帮你实现’,而不需要人类花费一年时间思考如何控制线圈中的不同电流、如何维持特定构型所需的电压等等。
It's also about which configuration of the plasma you are keeping stable. There are certain configurations which produce more heat. And what reinforcement learning now allows you to do is basically, you can say, I want this funky configuration because I think that it is going to be more stable or it will produce more heat capacity and so on. And the AI is like, yeah, I can do it for you, rather than sort of a human spending a year of their time sort of thinking about how to control the different current in the coils and how to maintain the right voltages and so on for that particular configuration.
我明白了。所以本质上这是个捷径。
I see. So it's a shortcut, basically.
它加速了整个研究周期。
It accelerates the whole research cycle.
太神奇了。蛋白质折叠和核聚变已经够让AI科学团队忙不过来了,他们还在推进一系列其他项目,包括生态学领域的研究。这使AI走出了科学实验室,进入了一个你意想不到的工作场景。
Amazing. And as if protein folding and nuclear fusion weren't enough to keep the AI for science team busy, they've been pursuing a raft of other projects, including in the field of ecology. This has taken AI out of the science lab and into a setting where you might not expect to find it at work.
我们在塞伦盖蒂的工作区域是典型的《狮子王》式东非地貌。连绵起伏的草原上点缀着标志性的金合欢树,广阔的热带稀树草原一览无余。
Where we're working in the Serengeti is very classic lion king East African landscape. So you have the rolling grasslands. It's dotted with those iconic acacia trees, broad open savannas.
梅雷迪思·帕尔默是新泽西州普林斯顿大学的保护生物学家,但你通常会发现她带着双筒望远镜在野外工作。
Meredith Palmer is a conservation biologist at Princeton University in New Jersey, but you'll usually find her with a pair of binoculars out in the field.
在广袤的塞伦盖蒂生态系统中,约有700种鸟类和50种大型哺乳动物在此迁徙。它们沿着从坦桑尼亚到肯尼亚的无尽循环路线追逐雨季。当迁徙大军回归时,你能听到角马的嘶鸣,以及尾随其后的蝇群嗡响。
You have something like 700 species of birds, 50 species of large mammals that are traversing around the greater Serengeti ecosystem. So down from Tanzania up into Kenya in this kind of endless circle chasing the rains. And when the migration comes back into town, you can hear the bleeding of the wildebeest, the sounds of the flies that are following the migration around.
这场被称为'大迁徙'的年度动物盛会,是地球上仅存的奇观之一,也是生态学家对塞伦盖蒂如此着迷的重要原因。
This annual gathering of animals, the Great Migration as it's known, is one of the last in the world and one of many reasons that ecologists are so interested in the Serengeti.
这里有狮子,也有猎豹、野犬、鬣狗和花豹,所有这些物种都在尝试共存。要理解这些野生动物间的复杂关联,我们必须研究整个生物群落。
We have lions, but we also have cheetah and wild dog and hyena and leopard, and all of these species are trying to coexist. And so to understand all of interconnections between these wildlife, we need to be studying these wildlife communities.
然而不幸的是,气候变化、农业转型以及非法盗猎,都在对当地生态系统造成日益严重的影响。约十年前,生态学家在塞伦盖蒂国家公园1200平方公里的区域内,安装了被称为相机陷阱的监测设备。
Unfortunately, though, climate and agricultural change, as well as illegal poaching, are all having a growing impact on the local ecosystem. About a decade ago, to help them monitor changes in the Serengeti, ecologists installed devices known as camera traps across a 1,200 square kilometer area of Serengeti National Park.
相机陷阱是小型远程摄像机。我们可以将其固定在树上,让它24小时不间断工作数月甚至数年。当动物经过时,其体温和动作会触发相机拍摄。通过分析这些图像,我们能了解出现的动物种类及其行为。
Camera traps are small remote cameras. We can strap them to a tree and leave them running twenty four hours a day, seven days a week for months or even years. The camera traps are triggered by the heat and motion of passing animals to take pictures. We can examine the images to see what animals are in them and what these animals are doing.
正如你所料,这些相机陷阱会拍摄海量照片——既有精彩的抓拍,也有只拍到动物背影或模糊轮廓的废片。就像让四岁小孩拿着你的手机乱拍,最后翻看相册时的感觉。
As you might expect, these camera traps take a lot of pictures, lots of fantastic candid shots and lots that show the back of an animal or some shape that's a bit hard to discern. Imagine letting a four year old run loose with your phone and then scrolling through the camera roll at the end. It's a bit like that.
这些相机陷阱每月产生1万到2万张图像。我们生态学家正面临数据洪流的挑战。关键问题在于,原始图像无法直接用于研究和保护工作。必须有人逐张查看这数十万张照片并记录:'这张图里有三匹成年斑马和一匹幼崽'。我曾计算过,处理一年的图像数据就需要七八年时间。
These camera traps are producing 10,000 to 20,000 images a month. So we as ecologists are really facing this massive deluge of data. And the big issue for us is that we can't use the images straight out of the camera trap to address our research and conservation questions. Someone has to go and look at each and every one of these tens of hundreds of thousands of images and write down, you know, this image contains three adult zebra and one baby zebra. I once calculated that it would take me seven or eight years to process a single year's worth of image data.
面对如此庞大的数据量,梅雷迪思和她的同事们成为了公民科学的早期实践者。
Faced with all this data, Meredith and her colleagues became early adopters of citizen science.
我们基本上将分类工作众包出去,志愿者们可以查看我们的相机陷阱照片,并识别出野外数据中观察到的所有动物及其行为。
We essentially crowdsourced the classification process and you could look at our camera trap pictures and identify all of the animals and behaviors that you saw in our field data.
然而过了一段时间后,问题出现了。随着公民科学项目数量激增,这个名为'塞伦盖蒂快照'的项目再也无法吸引足够志愿者来手动处理所有需要标记的图像。塞伦盖蒂快照项目标记完善的野生动物图像数据集引起了DeepMind研究人员的注意。他们主动联系,提出训练计算机视觉算法来自动分类相机陷阱拍摄的数千张照片,比如识别出照片中出现的具体瞪羚品种。
After a while though, a problem arose. As the number of citizen science projects exploded, the Snapshot Serengeti project, as it was called, could no longer attract enough volunteers to manually go through all of the images that needed labelling. Snapshot Serengeti's impressive dataset of labelled wildlife images caught the attention of researchers at DeepMind. And they got in touch, offering to train a computer vision algorithm that would automatically classify the thousands of photos taken by the camera traps, identifying which particular species of gazelle, for example, appears in a photo.
AI在解决这类问题上展现的惊人能力让我震撼。有些时候我看着一张图片,计算机识别出的物种我第一眼甚至都没注意到。有些动物AI的识别准确率始终保持在99%。但当我们缺乏足够标记图像时,算法表现就会欠佳。特别是那些神秘稀有的小型物种——土豚、薮猫、麝猫等,当只有几百张图像时,计算机可能会犯些低级错误,因为它不知道要寻找什么特征。
It blows my mind how incredible AI can be for solving these kinds of problems. I've had moments where I've looked at an image and had a computer identify a species I, you know, on first glance didn't even recognize was there. So there's some animals that the AI is 99% on all of the time. However, if we don't have a lot of labeled images, the algorithm doesn't do so well. So there's a lot of these cryptic rare little species, aardvarks, coryllas, servals, genets, and when we only have a couple 100 images of those, the computer can get a little wonky just because it doesn't know what to look for.
计算机视觉系统乃至所有AI系统的表现都完全取决于其训练数据的质量,这也解释了为什么AI有时在识别那些处于动物王国顶端的生物时会遇到困难。
Computer vision systems and pretty much all AI systems come to that are only as good as the data they're trained on, which also explains why the AI sometimes struggles with those creatures at the top of the animal kingdom.
我们在使用志愿者分类数据时发现一个问题:人们实在太想看到狮子了。有些图像里明明是只晒太阳的疣猪,我们的公民科学家却信心十足地将其标记为'包含一头非常威武的雄狮'。很遗憾,那真的只是只疣猪。我们虽然建立了大量制衡机制来确保为AI提供最优质的标记图像数据,但难免会有漏网之鱼。
One issue that we discovered using human volunteers to classify our data is that people really, really, really want to see lions. We've had images where it's a warthog sunning itself, and our citizen scientists have very confidently classified that image as containing one very impressive male lion. And no, I'm sorry, it's not. It's a warthog. We do have a lot of checks and balances for making sure that we're feeding the AI the very best labeled image data that we can, but some of these things slip through the cracks.
塞伦盖蒂的生态学家们已经开始在监测工作中看到成效。坦桑尼亚曾经拥有数量庞大的黑犀牛种群,如今已减少到不足百头。2019年,九头黑犀牛被重新引入格拉梅蒂禁猎区——这个塞伦盖蒂生态系统的组成部分正通过相机陷阱与AI技术联合进行监测。
The community of Serengeti ecologists are already starting to see the benefits in their monitoring efforts. Tanzania used to have an enormous population of black rhinoceros, but that has dwindled to fewer than a 100. In 2019, nine black rhinoceros were reintroduced into the Grametti Game Reserve, a part of the Serengeti ecosystem that is monitored by camera traps in conjunction with AI technology.
我们将能够观察迁徙路线是否改变,同时记录新的大型食草动物对周边人类社区的影响。我们已经记录到多起犀牛相互交往的案例,因此预计未来一两年内可能会有几头小犀牛诞生。
We'll be able to look at if migration routes change. We'll also be able to document any impacts of new mega herbivore effects on the surrounding human communities. We've documented several of the rhinoceros consorting with each other. So we anticipate that we're probably going to have a couple of baby rhinoceros to look forward to in the next year or two.
对梅雷迪思而言,未来的方向不是让人工智能取代生态学家,而是辅助他们。
For Meredith, the future is not for AI to replace ecologists, but to complement them.
迄今为止的生态学研究方式——拿着双筒望远镜和记事本在野外整天观察一头疣猪——效率实在太低。人工智能的辅助越多,我们就能越快识别导致生态系统变化的根源问题,从而制定更有针对性的保护措施来拯救、保护和重建这些生态系统。
The way we've been doing ecology to date, you know, out there in the field with our binoculars and our notepads watching, you know, a single warthog all day, like, it's not fast enough. And the more that we can get a helping hand from AI, the faster we can identify the problems that are driving change in the ecosystem, the more targeted conservation interventions that we can make to save and protect and rebuild these ecosystems.
我们了解到计算机视觉系统正被用于大规模解析生物和野生动植物数据。但正如AlphaFold所示,DeepMind也在利用人工智能从微观尺度理解生命。实验室里,研究人员正尝试基因组学这一专注于人类基因和DNA的生物学领域。DeepMind的基因组研究仍处于初期阶段,那么普什米特·科利对这项研究的未来有何期许?
We've heard how computer vision systems are being used to make sense of biology and wildlife on a vast scale. But as we heard with AlphaFold, DeepMind is also using AI to understand life at a microscopic scale. Back in the science lab, researchers have been trying their hand at genomics an area of biology dedicated to understanding human genes and DNA. DeepMind's genomics work is still in its very early stages. So what are Pushmeet Kohli's hopes for the future of this research?
其影响可能非常广泛。如果真正理解生物学,就能掌握所有细胞(包括癌细胞)的行为规律。这样或许能开发出更好的癌症疗法,甚至找到制造移植器官的新方法。
The implications could be very wide ranging. If you truly understand biology, then you understand how all cells behave, even cancer cells. So you might have better treatment for cancer. You might have better ways of creating new organs for transplantation.
你这里描述的其实是整个医疗健康领域潜在的革命性突破。
What you're describing here really though is a potential step change in all of healthcare and medicine.
这正是真正理解生物学将带来的影响。从某种意义上说,我们一直在观察生物学现象,但要真正驾驭生物学规律,人类才刚刚开始认识到其深远意义。
That is the implication of truly understanding biology. We have in some sense been observing biology, but to truly leverage biology, that's something that humanity is only starting to understand the implications of.
那么,你准备好了解最新的科学发现了吗?如果你愿意,我也很乐意参与。
So, are you ready to get up to speed on a brand new bit of science? I'm game if you are.
我最初是数学家出身,但在思考职业方向时,对花费三年研究管道流体流动这个课题实在提不起兴趣。无意冒犯
I started out life as a mathematician, but when I was thinking about what to do, wasn't so taken with the idea maybe of studying fluid flow down a pipe for three years. No offense
对我来说,这正是我最终选择的道路。
to To me, which is exactly what I did.
抱歉介绍一下,这位是莎拉·简·邓恩,一位从事基因组学项目研究的杰出科学家。
Sorry. This is the delightful Sarah Jane Dunn, a research scientist working on the genomics project.
基因组学是生物学的一个领域,主要研究基因型(即你的DNA序列)与表型之间的关联。这涵盖从你的身高、发色到是否患有特定疾病等所有特征。
Genomics, it's a domain of biology and it's most interested in understanding the connection between your genotype, that's basically your DNA sequence, and phenotype. And that's everything from how tall you are to what color hair you have to maybe whether you have some particular disease or not.
基因型是DNA中呈现的信息,表型是实际表现的结果。我携带MC1R基因型的RS185005位点,这赋予我红发表型;而CFTR基因的delta F508突变则会导致囊性纤维化表型。不过这些都是基因型与表型关联明确且已被证实的案例。
Genotype is what you see in the DNA. Phenotype is what you get. I have the MC1R genotype RS185005, which gives me the phenotype of red hair. Genotype CFTR delta F508 gives the phenotype of cystic fibrosis. Those, however, are the examples where the connection between genotype and phenotype is clear and established.
遗憾的是,这类案例属于例外而非普遍规律。以BRCA1基因为例,虽然它与乳腺癌发病相关,但并非所有携带该突变的人都会患病。目前尚无法判断哪些携带者风险更高,这导致部分BRCA1基因女性选择预防性手术。我们希望通过更深入的基因组学研究,能提供关于患病风险的重要补充信息。
Unfortunately, such examples are the exception rather than the rule. Take, for example, the BRCA1 gene, which is linked to the development of breast cancer. Not all people with the BRCA1 mutation will go on to develop breast cancer, But there's currently no way of knowing which people with the gene are more at risk. In some cases, this has led some women with the BRCA1 gene to have surgery as a precaution. The hope is that a deeper understanding of genomics may provide important additional context about what puts someone at increased risk or not.
如果我们能进一步解析你DNA序列的其他部分如何揭示疾病发展进程,或许就能对此类情况做出更精细的诊断。展望未来,如何在疾病实际发作前进行干预?如何开发最有效的快速治疗方案?这些都是我们接下来要探索的问题。
If we could unpack more about how the rest of your DNA sequence can tell us about the progression of that disease, then we might be able to make more nuanced diagnoses about these kind of things. And then projecting into the future, how might you be able to change that before the disease actually develops? How might you be able to develop the most effective quick treatments? Those are all questions that are open to us after that.
基因组学研究建立在人类基因组计划这一重要基础上。该项目始于1990年,耗时十年对一名匿名人类的完整DNA序列——即其基因组——进行了逐个分子测序。30亿个化学碱基对A、T、C、G的排列组合形成了约2万个基因及更多非编码序列,从而提供了制造人类的完整说明书。但在某些方面,这仅仅是起点。
Research in genomics builds upon an important foundation known as the Human Genome Project. Starting in 1990, the project took the full set of DNA of one anonymous human, their genome, and over the course of a decade went through every molecule one by one. Every single one of the 3,000,000,000 chemical base pairs A, T, C and G that combine together to give some 20,000 or so genes and a whole lot more besides. Thus providing the entire manual for how to make a human. But in some ways, this was just the starting point.
人类基因组计划成功汇编了这部人类生命食谱。你买到了这本书,里面写满了文字,却看不懂内容——这正是科学家们正在努力破解的难题。
The Human Genome Project successfully was able to compile this whole sort of cookbook for a human. You have bought this book and there's a saying, but you don't know how to read it. That's what scientists are sort of struggling with.
顺便说,你可以把《神奇人类食谱》当作真实存在的书。人类基因组计划的早期版本曾作为艺术项目被印刷装订,堆起来足有20本电话簿那么厚,每页都布满了模糊的A、T、C、G字母。问题就在于:我们仍不清楚这些碱基如何组合形成人体细胞、组织和器官。这就像一份无序的食材清单。
You can think of The Amazing Human Cookbook as a literal book, by the way. One of the early versions of the Human Genome Project was printed and bound as part of an art project, and it stacked up as some 20 phone book size volumes, where every page inside was a blur of As, Ts, Cs, and Gs. And therein lies part of the problem. We still have very little idea about how much of it comes together to form the cells, tissues and organs in the human body. It is a list of ingredients, but it's not in order.
而这正是我们试图理解的——如何破译实际的'烹饪过程'。
And that's what we are trying to understand. How do you decipher the actual sort of cooking process?
就像你手头有一端是面粉、鸡蛋、牛奶和一头牛,另一端却变成了牛肉馅饼。你会想:这中间到底发生了什么鬼?
So you've got one end, some flour, eggs, milk, and a cow. Yeah. And at the other end, you've got a beef pie. You're like, what the hell happened in between?
没错,正是如此。
Yeah, exactly.
在人类体内,实际上只有约2%的DNA用于基因编码。剩下98%相当于20多本电话簿厚度的生物指令,被官方称为非编码区。非正式场合下,生物学家们称其为
In humans, only about 2% of our DNA actually goes towards our genes. The remaining 98% of those 20 odd phone books worth of biological instructions are what they officially call the non coding region. Less officially, biologists have referred to it
垃圾DNA,这反映出我们曾经对其多不重视。但人类基因组计划完成后,人们发现这些所谓的垃圾DNA实际上非常重要。
as junk DNA, you know, that shows how much we were interested in it. But after the Human Genome Project was completed, it became obvious that this so called junk DNA was actually really important.
如今我们已了解部分垃圾DNA的功能。有些是结构性物质,比如细胞运作时所需的支架;有些纯粹是无意义的随机序列;但大部分是基因开关——一系列调控基因何时工作的调节器。
By now, we know what some of the junk does. Some of it is structural stuff, like the scaffolding needed for when a cell is working. Some of it is just random nonsense. But a lot of it is on and off switches, a series of regulators that tell genes when to get to work.
有趣的是,虽然你体内所有细胞共享同一套基因组,但它们以不同方式利用这些基因——通过选择性表达某些基因来实现差异化。
What's interesting is that, you know, all of the cells in your body share that genome, but they make use of it in different ways. And they do that by expressing some genes and not others.
每个细胞都包含全部基因的完整副本,但任何时候只需激活少量基因。在已停止生长的组织中,促进细胞分裂的基因若被激活可能导致肿瘤;而负责毛发或牙齿生长的基因在肝脏或心脏细胞中毫无作用。这项研究的希望在于:若能理解这些基因开关机制,结合已知的基因和蛋白质知识,将帮助科学家更深入理解某些疾病在体内的发生机制。以镰状细胞病为例——这是一种影响红细胞的遗传性疾病。
Every cell contains a complete copy of every gene, but will only require a handful of genes to be active at any one time. A gene that prompts cell division is no use in a tissue which has finished growing otherwise, it could lead to a tumour. A gene that is active in building hair or teeth has no role in the cells that make up the liver or the heart. The hope in all this is that if you can understand the on and off switches, as well as everything we already know about genes and proteins, it can help biologists to better understand the mechanism by which certain diseases take hold in the body. Take, for example, sickle cell disease, an inherited health condition affecting red blood cells.
这种疾病在患者表型上存在特定征兆。在显微镜下观察,患者的红细胞会呈现新月形的特征形态。得益于人类基因组计划,我们还发现镰状细胞贫血患者的DNA存在某些共同变异。基因组学能帮助解决的未知领域是:这些非编码DNA的字母序列如何影响疾病在单个细胞中的发展进程。
There are, of course, certain signs of this disease in someone's phenotype. If you look at the red blood cells under a microscope, those that belong to someone with the disease will take on a characteristic shape, like a crescent moon. Thanks to the Human Genome Project, we now also know that there are certain changes in the DNA that are common among people with sickle cell anemia. What's still unknown, and where genomics can help, is how that sequence of letters in someone's non coding DNA translates into how the disease will progress in individual cells.
若要理解这种疾病在体内的发生机制并最终找到治疗方法,就必须层层剖析这些信息:需要明确DNA序列变异对细胞的具体影响——是否导致某些基因无法表达?这能否为我们提供药物研发的靶点?本质上是从疾病的最初线索出发,追溯其如何从DNA基础层面逐渐显现。
If you're wanting to then understand how this disease actually arises in the body and then ultimately how you might treat it, you need to go through these layers of information. You need to understand, well, what does that change the DNA sequence actually do within the cell? Does it mean that the cell can no longer express certain genes? Does this give us targets for what we might be able to attack in that cell to develop kind of drug treatments? So it's about understanding from that very first clue about what might give someone a disease, how it kind of bubbles up from that basic level of DNA.
这正是AI与DeepMind在基因组学领域工作的契合点。这方面已有先例可循。科学家们已破解了一系列DNA序列与细胞功能关联的特殊案例,这些研究为DeepMind进军基因组学研究奠定了坚实的生物数据基础。
That is where AI and DeepMind's work on genomics fits in. There is some precedent here already. Scientists have a whole series of special cases where they've already cracked the connection between DNA sequences and cell function. These studies all serve as a solid foundation of biological data for DeepMind's foray into genomics research.
我们正致力于构建能真正理解基因组的AI。事实上,我们现在拥有的数据规模已足以让深度学习大展拳脚。我们已成功构建出能像阅读文本一样解读DNA序列的深度学习模型。
We're trying to build AI that essentially can understand the genome. The fact is we now have the kind of level of data where deep learning can really take hold. And we've been able to build together the kind of deep learning models that are able to read a DNA sequence a bit like you might read a piece of text.
作为基因组学研究的首步,DeepMind采用了机器学习领域较新的Transformer架构。可以将其视为一种翻译器:它读取DNA字母串,根据A/T/C/G的碱基序列,将其转化为基因实际表达方式的预测。就像语言翻译需要聚焦句子关键词那样,这种架构同样应用于DNA——它在遍历代码时,会抓住序列中最可能提供相关上下文的核心片段。
As a first step in its genomics work, DeepMind is using a relatively new idea in the world of machine learning, a type of architecture called a transformer. It's useful to think of it as a sort of translator. It reads in a string of letters from DNA, and based on the sequence of As and Ts and Cs and Gs, it will translate it into a prediction for how the genes will actually manifest themselves. To do language translation well, you need to pay most attention to the keywords in a sentence, And the transformer uses the same idea applied to DNA. As it runs through the code, it holds onto the most important parts of the sequence, the bits that are most likely to provide the relevant bits of context.
而对于DNA中相当于'和''在''定冠词'之类的冗余片段,则会降低关注度。
And it pays less attention to all the filibits, the DNA equivalent of all of the ands and the ins and the thes in a sentence.
要做出越来越精准的预测就需要更广阔的上下文,因为某些影响因素可能分布在DNA序列更远端的位置。因此我们构建的Transformer架构能处理约十倍于以往的DNA序列长度。
You need more context to be able to make better and better predictions because some of the things that can influence what's happening are positioned further and further along the DNA sequence. And so the transformer architecture that we built is able to swallow in about 10 times as much of that DNA sequence.
这比以往AI模型能读取的DNA序列长度多出十倍。
That's 10 times more of the DNA sequence than previous AI models could read.
通过扩展模型对DNA序列的上下文理解范围,我们成功提高了从序列预测基因表达的准确率。
By being able to extend how much context the model had about the DNA sequence, we were able to increase the accuracy of our predictions about gene expression from sequence.
这一切听起来相当积极。但在我们将机器学习或人工智能吹捧为解决生物学和科学领域所有问题的万能良药之前,必须也要发出一些警示。遗憾的是,现实中并不存在一个贴着'机器学习'标签的按钮——你不可能随便收集些陈旧数据,按下按钮就能产出伟大的科学成果。
This all sounds pretty positive. But before we make it sound like machine learning or AI is some sort of silver bullet for all problems in biology and science, it's important to sound a note of caution too. Unfortunately, there isn't a big button labeled machine learning in the corner where you can just gather any old data, hit the button, and get out great science.
要有效运用人工智能,关键在于提出正确的问题。对我而言最大的挑战在于:比如已知领域内存在一个重大问题——如何将祖细胞分化为心脏细胞,因为我们希望在实验室培育它们来治疗心脏病患者。但如何将这类问题与人工智能有效结合?首先需要理解应该提出什么样的问题,其次要明白哪种算法架构能处理这类数据。此外,还要判断模型是在改进还是恶化?
To use AI effectively, you have to ask the right question. That for me is one of the biggest challenges to say, okay, I know that there's a big question in the field about how we can differentiate a progenitor cell into a heart cell because we'd like to be able to make them in the lab so that we can treat people with heart disease. But how can you take something like that and then wield AI effectively? It's being able to understand, first of all, what kind of question you should be asking and then what kind of architecture is going to be able to deal with those data. But then also, how do you know if your model's getting better or worse?
因此这确实仍是一个研究中的领域。
So it very much is still an area of research.
使用人工智能的另一个问题在于:尽管它是精密的模式发现工具,但并不总能解释这些模式出现的原因。我们是否应该对在未真正理解因果机制的情况下做出预测感到担忧?
The other thing about using AI, the sophisticated pattern hunter that it is, is that it doesn't always give you a good answer for why those patterns appear. Do we have to be nervous about making predictions without really understanding the causal mechanisms?
是的,确实如此。但这并不意味着这类工具没有价值。人工智能的优势在于能帮助我们应对生物学带来的复杂性。处理生物数据时,我们常面临充满噪声和实验误差干扰的数据。如果工具能越过这些数据波动提取出真正有意义的模式,那么关键在于我们如何严谨地解读结果。
Yes, I think so. But it doesn't mean that these kinds of tools aren't valuable. The thing I like about AI is that it can allow us to handle some of the complexity that biology throws at us. And often when we're dealing with biological data, we're dealing with very noisy data and data that has been perturbed by just experimental error. And so if we have the tools that can somehow surf over some of those bumps in the data and try to really pull out those meaningful patterns, then the thing that we have to be really rigorous about is how we interpret it.
它并非唯一可用的工具,但如果能为我们指明正确方向,推动研究向前迈进一小步,这本身就很有价值。
And it's not the only tool that should be deployed, but if it gives us pointers in the right direction, if it steps us forward that one piece, then that is still valuable.
但计算机科学家贸然闯入新科学领域试图开疆拓土时,情况就不同了。正如我们所见,AlphaFold等机器学习技术和基因组学研究都严重依赖通过艰苦实验获得的现有科学数据。那么当其他科学家听说AI可能应用于他们的领域时,会作何反应呢?
But it's all very well for computer scientists to turn up in some new scientific domain hoping to plant their flag. As we've heard, machine learning techniques like AlphaFold and this genomics work build heavily on existing scientific data that is often obtained using painstaking experimental methods. So how do other scientists respond when they hear that AI could be applied to their domain?
科学家们虽然竞争激烈,但协作性也很强。如果他们确信你提出的方法值得探索,就会有兴趣倾听你的观点。
Scientists are a very competitive bunch, but they are very collaborative as well. If they are convinced that the approach that you are proposing is an approach worth pursuing, they are interested in listening to you.
不过你有没有遇到过难以说服别人的情况?
Have you ever struggled to persuade people, though?
当我们开始尝试将机器学习和人工智能应用于纯数学领域时,遇到了一些对此非常热衷并成为优秀合作者的数学家。但也有少数人基于原则认为数学是纯粹的人类事业,机器在其中不应有任何角色。
So when we started looking at applying machine learning and AI to pure mathematics, we encountered a number of mathematicians who were very, very keen and became really good collaborators. But there were a few who, on matters of principle, thought that mathematics is a very human endeavor and a machine has no role to play in it.
我太震惊了,小猫咪。没想到有些纯粹数学家居然不愿拥抱未来。
I'm shocked, Pussie. I'm shocked that some pure mathematicians might not welcome the future with open arms.
你可能好奇那个纯数学项目究竟是什么。首先要知道,在纯数学领域,证明才是至高无上的。比如有人提出'质数有无限个'这样的猜想,但只有通过逻辑推理链证明其永恒正确性,这个猜想才有意义。
You're probably wondering what that pure mathematics project was all about. Well, the first thing that you need to know is that in pure mathematics, proof reigns supreme. Someone might have a conjecture, like there are an infinite number of prime numbers, for instance, But that means nothing until by a chain of logical statements, it can be proven to be indisputably true for all eternity.
我们关注的是计算机是否具备提出新猜想的直觉能力。我们与一些合作者开始研究纽结理论和拓扑学的不同方面,这些领域被认为存在某种联系,但尚未有人发现具体关系。
So what we were interested in is can a computer have that intuitive ability to propose a new conjecture? What we started looking at with some of our sort of collaborators is about different aspects of knot theory and topology, which was suspected to have some interactions, but nobody had sort of discovered those relationships.
听到纽结理论这个名字你大概不会意外,它就是研究数学意义上的绳结。虽然名称听起来不刺激,但请相信我,这个领域就像用想象中的绳子进行弯曲扭转的奇妙游乐场。描述这些绳结有代数法和几何法两种方式,两者间的转换相当困难。直到人工智能开始寻找其中的联系。
Knot theory, you will not be surprised to hear, is all about the mathematical study of knots. I know the name might not make it sound exciting, but you're just gonna have to take my word for it on this one that the whole field is a glorious playground of bending and twisting loops of imaginary string. There are different ways to describe these knots, an algebraic way and a geometric way. And it's quite hard to translate from one to the other. But then AI started hunting for connections.
这催生了一个前所未见的新猜想。不仅如此,人类数学家随后还成功证明了这个猜想。所以在某种意义上,实际上是人类在证明计算机提出的猜想。
And that resulted in a new conjecture which had never been seen before. And not only that, the human mathematicians then managed to prove that conjecture. So in some sense, it's basically humans who are proving the conjectures that a computer has come up with.
这说服了那些脾气暴躁的纯数学家吗?
Did that persuade the grumpy pure mathematicians?
我想我们没再回头找他们。
I don't think we went back to them.
以上只是DeepMind当前重点攻关的部分科学应用案例。从生态学到基因组学,从蛋白质折叠到核聚变,人工智能即将在短期内改变我们生活的方式已日益清晰。还有更多项目我们无法在本期节目中一一涵盖。若想了解更多关于'AI for Science'计划的信息,请查看节目说明或访问deepmind.com。下期节目我们将继续探索AI应用如何影响现实世界,从天气预测到语音合成。
And that is just a flavour of some of the science applications that DeepMind is currently hard at work on. From ecology to genomics, protein folding to nuclear fusion, the ways in which difference to our lives in the near future are increasingly clear. And there are so many more projects that we didn't have time to cover in this episode. So, if you'd like to find out more about the AI for Science programme, do take a look at the show notes or check out deepmind.com. In the next episode, we'll continue to discover how the applications of AI could have an impact on the real world, from weather prediction to voice synthesis.
我是一名痴迷于人工智能的数学家、作家兼播客主持人。
I'm a mathematician, author, and podcaster who's fascinated by artificial intelligence.
真的很棒。我知道。确实很棒。
It's really good. I know. It's good.
我呼吸声有那么重吗?天呐。您正在收听的是DeepMind播客,我是汉娜·弗莱,本系列节目由Whistledown Productions的丹·哈东担任制片人。我们很快会再见面。
Am I that breathy? Bloody hell. You've been listening to DeepMind, the podcast. I'm Hannah Fry, and the series producer is Dan Hardoon at Whistledown Productions. We'll be back soon.
关于 Bayt 播客
Bayt 提供中文+原文双语音频和字幕,帮助你打破语言障碍,轻松听懂全球优质播客。