由100人AI实验室发展而成的Anthropic和谷歌的秘密武器｜埃德温·陈（Surge AI）

本集简介

埃德温·陈是Surge AI的创始人兼首席执行官，该公司致力于教会AI何为好、何为坏，为前沿实验室提供顶尖的数据、环境和评估。去年，Surge在仅有不到100名员工的情况下，完全依靠自身力量实现了超过10亿美元的收入，成为历史上达成这一里程碑速度最快的公司。在创立Surge之前，埃德温曾是谷歌、Facebook和Twitter的研究科学家，并在麻省理工学院学习数学、计算机科学和语言学。我们讨论： 1. Surge如何通过极致追求质量，在不到100人的团队中实现超10亿美元收入 2. Claude Code在编程和写作方面变得如此出色背后的故事 3. AI基准测试的问题及其如何误导AI发展方向 4. 强化学习环境为何是AI训练的下一个前沿 5. 为何埃德温认为我们仍需十年才能实现AGI 6. 为何品味与人类判断决定了哪些AI模型成为行业领导者 7. 他反主流的创业方法，拒绝硅谷“转型与闪电扩张”的 playbook 8. AI模型将如何因构建它们的公司价值观而日益分化 — 本节目由以下品牌赞助： Vanta——自动化合规，简化安全 WorkOS——面向B2B SaaS的现代身份平台，百万月活跃用户内免费 Coda——一体化协作工作空间 — 文字稿：https://www.lennysnewsletter.com/p/surge-ai-edwin-chen — 我的主要收获（付费通讯订阅者）：https://www.lennysnewsletter.com/i/180055059/my-biggest-takeaways-from-this-conversation — 如何找到埃德温·陈： • X：https://x.com/echen • LinkedIn：https://www.linkedin.com/in/edwinzchen • Surge博客：https://surgehq.ai/blog — 如何找到伦尼： • 通讯：https://www.lennysnewsletter.com • X：https://twitter.com/lennysan • LinkedIn：https://www.linkedin.com/in/lennyrachitsky/ — 本集内容涵盖： (00:00) 埃德温·陈简介 (04:48) AI在商业效率中的作用 (07:08) 构建一家反主流的公司 (08:55) Surge AI是做什么的 (09:36) 高质量数据的重要性 (13:31) Claude Code如何保持领先 (17:37) 埃德温对基准测试的怀疑 (21:54) AGI时间线与行业趋势 (28:33) 硅谷机器 (33:07) 强化学习与未来AI训练 (39:37) 理解模型的发展轨迹 (41:11) 模型如何进步以及将继续进步 (42:55) 适应行业需求 (44:39) Surge的研究方法 (48:07) 对未来几年AI的预测 (50:43) AI中被低估和被高估的领域 (52:55) 创立Surge AI的故事 (01:02:18) 快问快答与总结 — 参考内容： • Surge：https://surgehq.ai • Surge产品页：https://surgehq.ai/products • Claude Code：https://www.claude.com/product/claude-code • Gemini 3：https://aistudio.google.com/models/gemini-3 • Sora：https://openai.com/sora • Terrence Rohan 在 LinkedIn：https://www.linkedin.com/in/terrencerohan • Richard Sutton——强化学习之父认为LLMs是死胡同：https://www.dwarkesh.com/p/richard-sutton • 《痛苦的教训》：http://www.incompleteideas.net/IncIdeas/BitterLesson.html • 强化学习：https://en.wikipedia.org/wiki/Reinforcement_learning • Grok：https://grok.com • 沃伦·巴菲特在X上：https://x.com/WarrenBuffett • OpenAI首席产品官谈AI如何改变必备技能、护城河、编程与创业方法论 | Kevin Weil（OpenAI CPO，前Instagram、Twitter）：https://www.lennysnewsletter.com/p/kevin-weil-open-ai • Anthropic首席产品官谈下一步 | Mike Krieger（Instagram联合创始人）：https://www.lennysnewsletter.com/p/anthropics-cpo-heres-what-comes-next • Brian Armstrong 在 LinkedIn：https://www.linkedin.com/in/barmstrong • 《星际穿越》在Prime Video：https://www.amazon.com/Interstellar-Matthew-McConaughey/dp/B00TU9UFTS • 《降临》在Prime Video：https://www.amazon.com/Arrival-Amy-Adams/dp/B01M2C4NP8 • 《旅行者》在Netflix：https://www.netflix.com/title/80105699 • Waymo：https://waymo.com • 苏打水 vs. 泡沫饮料：https://flowingdata.com/2012/07/09/soda-versus-pop-on-twitter — 推荐书籍： • 《你一生的故事及其他》：https://www.amazon.com/Stories-Your-Life-Others-Chiang/dp/1101972122 • 《西西弗斯神话》：https://www.amazon.com/Myth-Sisyphus-Vintage-International/dp/0525564454 • 《马罗的悦耳之音：论语言的音乐性》：https://www.amazon.com/dp/0465086454 • 《哥德尔、埃舍尔、巴赫：集异璧之大成》：https://www.amazon.com/G%C3%B6del-Escher-Bach-Eternal-Golden/dp/0465026567 — 制作与营销由 https://penname.co/ 负责。如有关于赞助本播客的咨询，请发送邮件至 podcast@lennyrachitsky.com。 — 伦尼可能是本集中提及公司的投资者。如需收听更多内容，请访问 www.lennysnewsletter.com

Edwin Chen is the founder and CEO of Surge AI, the company that teaches AI what’s good vs. what’s bad, powering frontier labs with elite data, environments, and evaluations. Surge surpassed $1 billion in revenue with under 100 employees last year, completely bootstrapped—the fastest company in history to reach this milestone. Before founding Surge, Edwin was a research scientist at Google, Facebook, and Twitter and studied mathematics, computer science, and linguistics at MIT. We discuss: 1. How Surge reached over $1 billion in revenue with fewer than 100 people by obsessing over quality 2. The story behind how Claude Code got so good at coding and writing 3. The problems with AI benchmarks and why they’re pushing AI in the wrong direction 4. How RL environments are the next frontier in AI training 5. Why Edwin believes we’re still a decade away from AGI 6. Why taste and human judgment shape which AI models become industry leaders 7. His contrarian approach to company building that rejects Silicon Valley’s “pivot and blitzscale” playbook 8. How AI models will become increasingly differentiated based on the values of the companies building them — Brought to you by: Vanta—Automate compliance. Simplify security. WorkOS—Modern identity platform for B2B SaaS, free up to 1 million MAUs Coda—The all-in-one collaborative workspace — Transcript: https://www.lennysnewsletter.com/p/surge-ai-edwin-chen — My biggest takeaways (for paid newsletter subscribers): https://www.lennysnewsletter.com/i/180055059/my-biggest-takeaways-from-this-conversation — Where to find Edwin Chen: • X: https://x.com/echen • LinkedIn: https://www.linkedin.com/in/edwinzchen • Surge’s blog: https://surgehq.ai/blog — Where to find Lenny: • Newsletter: https://www.lennysnewsletter.com • X: https://twitter.com/lennysan • LinkedIn: https://www.linkedin.com/in/lennyrachitsky/ — In this episode, we cover: (00:00) Introduction to Edwin Chen (04:48) AI’s role in business efficiency (07:08) Building a contrarian company (08:55) An explanation of what Surge AI does (09:36) The importance of high-quality data (13:31) How Claude Code has stayed ahead (17:37) Edwin’s skepticism toward benchmarks (21:54) AGI timelines and industry trends (28:33) The Silicon Valley machine (33:07) Reinforcement learning and future AI training (39:37) Understanding model trajectories (41:11) How models have advanced and will continue to advance (42:55) Adapting to industry needs (44:39) Surge’s research approach (48:07) Predictions for the next few years in AI (50:43) What’s underhyped and overhyped in AI (52:55) The story of founding Surge AI (01:02:18) Lightning round and final thoughts — Referenced: • Surge: https://surgehq.ai • Surge’s product page: https://surgehq.ai/products • Claude Code: https://www.claude.com/product/claude-code • Gemini 3: https://aistudio.google.com/models/gemini-3 • Sora: https://openai.com/sora • Terrence Rohan on LinkedIn: https://www.linkedin.com/in/terrencerohan • Richard Sutton—Father of RL thinks LLMs are a dead end: https://www.dwarkesh.com/p/richard-sutton • The Bitter Lesson: http://www.incompleteideas.net/IncIdeas/BitterLesson.html • Reinforcement learning: https://en.wikipedia.org/wiki/Reinforcement_learning • Grok: https://grok.com • Warren Buffett on X: https://x.com/WarrenBuffett • OpenAI’s CPO on how AI changes must-have skills, moats, coding, startup playbooks, more | Kevin Weil (CPO at OpenAI, ex-Instagram, Twitter): https://www.lennysnewsletter.com/p/kevin-weil-open-ai • Anthropic’s CPO on what comes next | Mike Krieger (co-founder of Instagram): https://www.lennysnewsletter.com/p/anthropics-cpo-heres-what-comes-next • Brian Armstrong on LinkedIn: https://www.linkedin.com/in/barmstrong • Interstellar on Prime Video: https://www.amazon.com/Interstellar-Matthew-McConaughey/dp/B00TU9UFTS • Arrival on Prime Video: https://www.amazon.com/Arrival-Amy-Adams/dp/B01M2C4NP8 • Travelers on Netflix: https://www.netflix.com/title/80105699 • Waymo: https://waymo.com • Soda versus pop: https://flowingdata.com/2012/07/09/soda-versus-pop-on-twitter — Recommended books: • Stories of Your Life and Others: https://www.amazon.com/Stories-Your-Life-Others-Chiang/dp/1101972122 • The Myth of Sisyphus: https://www.amazon.com/Myth-Sisyphus-Vintage-International/dp/0525564454 • Le Ton Beau de Marot: In Praise of the Music of Language: https://www.amazon.com/dp/0465086454 • Gödel, Escher, Bach: An Eternal Golden Braid: https://www.amazon.com/G%C3%B6del-Escher-Bach-Eternal-Golden/dp/0465026567 — Production and marketing by https://penname.co/. For inquiries about sponsoring the podcast, email podcast@lennyrachitsky.com. — Lenny may be an investor in the companies discussed. To hear more, visit www.lennysnewsletter.com

双语字幕

仅展示文本字幕，不包含中文音频；想边听边看，请使用 Bayt 播客 App。

Speaker 0

你们在不到四年时间里，用大约60到70人实现了10亿美元的收入。

You guys hit 1,000,000,000 in revenue in less than four years with around 60 to 70 people.

Speaker 0

你们完全是自力更生，没有拿过任何风投的钱。

You're completely bootstrapped, haven't raised any VC money.

Speaker 0

我认为之前从未有人做到过这一点。

I don't believe anyone has ever done this before.

Speaker 1

我们基本上从不想玩硅谷那套游戏。

We basically never wanted to play the Silicon Valley game.

Speaker 1

那种方式听起来总是很荒谬。

That always sounds ridiculous.

Speaker 1

我曾在多家大型科技公司工作过，总觉得如果我们裁掉90%的人，反而能更快前进——因为最优秀的人才不会被各种杂事分心。

I used to work at a bunch of the big tech companies, and I always felt that we could fire 90% of people, and we would move faster because the best people wouldn't have all these distractions.

Speaker 1

所以当我们创立Surge时，就决心用超级精简的精英团队走完全不同的路线。

So when we started Surge, we wanted to build it completely differently with a super small, super elite team.

Speaker 0

你们绝对是目前最成功的数据公司。

You guys are by far the most successful data company out there.

Speaker 1

我们本质上是在教AI模型分辨好坏。

We essentially teach AI models what's good and what's bad.

Speaker 1

人们甚至不理解这个领域里质量意味着什么。

People don't understand what quality even means in this space.

Speaker 1

他们以为只要堆人力就能得到优质数据。

They think you could just throw bodies at a problem and get good data.

Speaker 1

这种想法完全错误。

That's completely wrong.

Speaker 0

对普通人来说，这些模型看起来并没有持续变得更聪明。

To a regular person, it doesn't feel like these models are getting that much smarter constantly.

Speaker 1

过去一年我意识到，公司的价值观会塑造模型的行为。

Over the past year, I've realized that the values that the companies have will shape the model.

Speaker 1

前几天我让Claude帮我写封邮件。

I was asking Claude to help me drop an email the other day.

Speaker 1

经过三十分钟后，没错，当我发出那封邮件时，它确实帮我写出了完美的内容。

And after thirty minutes, yeah, I think it really crafted me the perfect email when I sent it.

Speaker 1

但后来我意识到，我花了三十分钟做了一件毫无意义的事。

But then I realized I spent thirty minutes doing something that didn't matter at all.

Speaker 1

如果让你选择完美的模型行为，你会想要哪种模型？

If you could choose the perfect model behavior, which model would you want?

Speaker 1

是想要一个只会说‘你说得完全正确’的模型吗？

Do want a model that says, you're absolutely right.

Speaker 1

还是那种会告诉你‘这封邮件绝对还有20种改进方式’，然后继续迭代50次的模型？

There are definitely 20 more ways to improve this email, and it continues for 50 more iterations?

Speaker 1

或者你想要一个为你的时间和效率优化，直接说‘不’的模型？

Or do you want a model that's optimizing for your time and productivity and just says, no.

Speaker 1

需要停止。

Need to stop.

Speaker 1

你的邮件很棒。

Your email's great.

Speaker 1

直接发送然后继续工作吧。

Just send it and move on.

Speaker 0

你有个犀利观点认为许多实验室正在把通用智能引向错误方向。

You have this hot take that a lot of these labs are pushing a GI in the wrong direction.

Speaker 1

我担心我们不是在开发真正能推动人类进步的AI——治愈癌症、解决贫困、理解宇宙，而是在优化AI垃圾内容。

I'm worried that instead of building AI that will actually advance us as a species, curing cancer, solving poverty, understanding universe, we are optimizing for AI slop instead.

Speaker 1

我们将为那些在杂货店买小报的人优化模型。

We'll be optimizing our models for the types of people who buy tabloids at the grocery store.

Speaker 1

我们本质上是在教模型追逐多巴胺而非真相。

We're basically teaching our models to chase dopamine instead of truth.

Speaker 0

今天，我的嘉宾是Surge AI的创始人兼首席执行官Edwin Chen。

Today, my guest is Edwin Chen, founder and CEO of Surge AI.

Speaker 0

Edwin是一位非凡的CEO，Surge也是一家非凡的公司。

Edwin is an extraordinary CEO, and Surge is an extraordinary company.

Speaker 0

他们是领先的AI数据公司，为前沿AI实验室的培训提供支持。

They're the leading AI data company powering training at every frontier AI lab.

Speaker 0

他们还是史上最快实现10亿美元营收的公司，成立仅四年就达成这一目标，员工不足百人，并且完全自筹资金。

They are also the fastest company to ever hit $1,000,000,000 in revenue in just four years after launch, with fewer than 100 people, and also completely bootstrapped.

Speaker 0

他们从未拿过一分钱的风险投资。

They've never raised a dollar in VC money.

Speaker 0

他们从成立第一天起就实现了盈利。

They've also been profitable from day one.

Speaker 0

正如你将在这次对话中听到的，埃德温对于如何打造一家重要公司、如何构建真正对人类有益且实用的人工智能，有着非常独到的见解。

As you'll hear in this conversation, Edwin has a very different take on how to build an important company and how to build AI that is truly good and useful to humanity.

Speaker 0

我无比喜爱这次对话，并从中受益匪浅。

I absolutely loved this conversation, and I learned a ton.

Speaker 0

我真的很期待你们能听到这段对话。

I am really excited for you to hear it.

Speaker 0

如果你喜欢这个播客，别忘了在你常用的播客应用或YouTube上订阅关注。

If you enjoy this podcast, don't forget to subscribe and follow it in your favorite podcasting app or YouTube.

Speaker 0

这对我们有莫大帮助。

It helps tremendously.

Speaker 0

如果你成为我通讯的年费订阅用户，你将免费获得一整年的大量优质产品使用权，包括Devon、Lovable、Replid、Bolt、n eight m、Linear、Superhuman、Descript、WhisperFlow、Gamma、Perplexity、Warp、Granola、Magic Patterns、Raycatch、Pierre D、Mobbin、Posthog以及Stripe Atlas。

And if you become an annual subscriber of my newsletter, you get a ton of incredible products for free for an entire year, including Devon, Lovable, Replid, Bolt, n eight m, Linear, Superhuman, Descript, WhisperFlow, Gamma, Perplexity, Warp, Granola, Magic Patterns, Raycatch, Pierre D, Mobbin, Posthog, and Stripe Atlas.

Speaker 0

前往lenny'snewsletter.com并点击产品通行证。

Head on over to lenny'snewsletter.com and click product pass.

Speaker 0

说完这些，在赞助商简短插播后，我将为您带来埃德温·陈的分享。

With that, I bring you Edwin Chen after a short word from our sponsors.

Speaker 0

我和我的播客嘉宾都喜欢探讨工艺、品味、能动性以及产品市场契合度。

My podcast guests and I love talking about craft and taste and agency and product market fit.

Speaker 0

你知道我们最不喜欢讨论什么吗？

You know what we don't love talking about?

Speaker 0

SOC二型认证。

SOC two.

Speaker 0

这就是Vanta的用武之地。

That's where Vanta comes in.

Speaker 0

Vanta帮助各种规模的公司快速实现合规并持续保持，其采用行业领先的AI技术、自动化流程和持续监控。

Vanta helps companies of all sizes get compliant fast and stay that way with industry leading AI, automation, and continuous monitoring.

Speaker 0

无论您是初创公司首次应对SOC二型或ISO 27001认证，还是企业需要管理供应商风险，Vanta的信任管理平台都能让流程更快速、更简单且更具扩展性。

Whether you're a startup tackling your first SOC two or ISO twenty seven zero zero one or an enterprise managing vendor risk, Vanta's trust management platform makes it quicker, easier, and more scalable.

Speaker 0

Vanta还能帮助您以五倍速度完成安全问卷，让您更快赢得大单。

Vanta also helps you complete security questionnaires up to five times faster so that you could win bigger deals sooner.

Speaker 0

结果如何？

The result?

Speaker 0

根据IDC最新研究，Vanta客户每年节省超50万美元，效率提升三倍。

According to a recent IDC study, Vanta customers slashed over $500,000 a year and are three times more productive.

Speaker 0

建立信任不是可选项。

Establishing trust isn't optional.

Speaker 0

Vanta让它自动化实现。

Vanta makes it automatic.

Speaker 0

访问banta.com/leni立享1000美元优惠。

Get $1,000 off at banta.com/leni.

Speaker 0

给你出个谜题。

Here's a puzzle for you.

Speaker 0

OpenAI、Cursor、Perplexity、Vercel、Platt等数百家成功企业有何共同点？

What do OpenAI, Cursor, Perplexity, Vercel, Platt, and hundreds of other winning companies have in common?

Speaker 0

答案是它们都由今天的赞助商WorkOS提供支持。

The answer is they're all powered by today's sponsor, WorkOS.

Speaker 0

如果你正在为企业构建软件，可能已经体会过集成单点登录、SCIM、RBAC、审计日志等大客户所需功能的痛苦。

You're building software for enterprises, you've probably felt the pain of integrating single sign on, SCIM, RBAC, audit logs, and other features required by big customers.

Speaker 0

WorkOS将这些交易障碍转化为即插即用的API，专为B2B SaaS打造的现代开发者平台。

WorkOS turns those deal blockers into drop in APIs with a modern developer platform built specifically for b to b SaaS.

Speaker 0

无论你是寻求首个企业客户的初创公司，还是全球扩张的独角兽企业，WorkOS都是最快实现企业级准备和释放增长潜力的途径。

Whether you're a seed stage startup trying to land your first enterprise customer or a unicorn expanding globally, WorkOS is the fastest path to becoming enterprise ready and unlocking growth.

Speaker 0

本质上，WorkOS就是企业功能领域的Stripe。

They're essentially Stripe for enterprise features.

Speaker 0

访问workos.com开始使用，或直接联系他们的Slack支持——那里有真正的工程师会极速解答你的问题。

Visit workos.com to get started or just hit up their Slack support where they have real engineers in there who answer your questions super fast.

Speaker 0

WorkOS让你能像顶级开发者一样构建应用，提供愉悦的API、全面的文档和流畅的开发体验。

Work OS allows you to build like the best with delightful APIs, comprehensive docs, and a smooth developer experience.

Speaker 0

立即访问workos.com，让你的应用具备企业级准备。

Go to workos.com to make your app enterprise ready today.

Speaker 0

埃德温，非常感谢你能来参加我们的播客节目，欢迎你。

Edwin, thank you so much for being here, and welcome to the podcast.

Speaker 1

非常感谢邀请我。

Thanks so much for having me.

Speaker 1

我超级兴奋。

I'm super excited.

Speaker 0

我想先谈谈你取得的成就有多么不可思议。

I wanna start with just how absurd what you've achieved is.

Speaker 0

许多人和公司都在谈论借助AI以极少数员工实现业务大规模扩张，而你们以史无前例的方式做到了这一点。

A lot of people and a lot of companies talk about scaling massive businesses with very few people as a result of AI, and you guys have done this in a way that is is unprecedented.

Speaker 0

你们在不到四年时间里，用大约60到70人的团队就实现了10亿美元的收入。

You guys hit 1,000,000,000 in revenue in less than four years with less than 60 around 60 to 70 people.

Speaker 0

你们完全是自主创业，没有拿过任何风险投资。

You're completely bootstrapped, haven't raised any VC money.

Speaker 0

我认为之前从未有人做到过这样的事。

I don't believe anyone has ever done this before.

Speaker 0

所以你们实际上正在实现人们描述中AI将带来的梦想。

So you guys are actually achieving the dream of what people are describing will happen with AI.

Speaker 0

我很好奇，你认为这种情况会因为AI而越来越普遍吗？

I'm curious just do you think this will happen more and more as a result of AI?

Speaker 0

还有，AI在哪些方面最帮助你们找到了实现这一目标的杠杆？

And also just where has AI most helped you find leverage to be able to do this?

Speaker 0

是的。

Yeah.

Speaker 1

我们去年以不足100人的团队实现了超过10亿美元的收入。

So we hit over 1,000,000,000 of revenue last year with under 100 people.

Speaker 1

我认为未来几年我们会看到更疯狂的人效比公司，比如人均产值达到100亿美元。

I think we're going to see companies with even crazier ratios, like 100,000,000,000 per employee in the next few years.

Speaker 1

AI只会变得越来越好，让一切更高效。

AI is just going to get better and better and make things more efficient.

Speaker 1

这种人效比提升的趋势将变得不可避免。

So that ratio just becomes inevitable.

Speaker 1

比如，我曾在多家大型科技公司工作过，总觉得如果能裁掉90%的员工，我们反而能跑得更快，因为最优秀的人才总是被各种琐事分心。

Like, I I used to work at a bunch of the big tech companies, and I always felt that we could fire 90% of people and we would move faster because the best people would have all these distractions.

Speaker 1

所以当我们创立Surge时，就决心用一支超精简、超精英的团队来彻底改变传统模式。

And so when we started Surge, we wanted to build it completely differently with a super small, super elite team.

Speaker 1

而最疯狂的是，我们居然真的成功了。

And yeah, what's crazy is that we actually succeeded.

Speaker 1

我认为现在有两股趋势正在交汇。

And so I think two things are colliding.

Speaker 1

第一是人们逐渐意识到，要取得成功并不需要构建庞大的组织架构。

One is that people are realizing that you don't have to build giant organizations in order to win.

Speaker 1

第二嘛，就是AI带来的所有这些效率提升。

And two, yeah, all these efficiencies from AI.

Speaker 1

这些都将引领企业建设进入一个真正令人惊叹的时代。

They're just going to lead to a really amazing time in company building.

Speaker 1

最让我兴奋的是，未来企业的形态也将发生根本性改变。

The thing I'm excited about is that the types of companies are going to change too.

Speaker 1

不仅仅是规模会更小。

It won't just be that they're smaller.

Speaker 1

我们将看到本质上完全不同的公司涌现。

We're going to see fundamentally different companies emerging.

Speaker 1

想想看，员工越少意味着所需资本越少。

If you think about it, fewer employees means less capital.

Speaker 1

资本需求降低意味着无需融资。

Less capital means you don't need a raise.

Speaker 1

因此取代那些擅长路演炒作的创始人，会出现更多精通技术或产品的创业者。

So instead of companies started by founders who are great at pitching and great at hyping, you'll get founders who are really great at technology or product.

Speaker 1

产品不再为取悦风投而优化营收，而是由这些精益团队打造出更多有趣的产品。

And instead of products optimized for revenue and what VCs wanna see, you'll get more interesting ones built by these tiny obsessed teams.

Speaker 1

人们将真正专注于他们关心的技术，真正的技术创新。

So people building things they actually care about, real technology, real innovation.

Speaker 1

所以我真心希望麦迪·萨拉布团队能回归本真。

So I'm actually really, really hoping that the slick on Maddie Sarab's team will actually go back to being a

Speaker 0

Hackerspin的归宿。

place for for Hackerspin.

Speaker 0

你们以非常反传统的方式做了很多事情，其中之一就是没有在LinkedIn上发病毒式帖子，也没有在Twitter上不断推广Surge。

You guys have done a lot of things in a very contrarian way, and one was actually just not being, like, on LinkedIn posting viral posts, not on Twitter, constantly promoting Surge.

Speaker 0

我想大多数人直到最近才听说Surge，然后你们就突然出现了。

I think most people hadn't heard of Surge until just recently, and then you just came out.

Speaker 0

我当时就想，好吧。

I'm like, okay.

Speaker 0

这家公司以十亿美元估值成为增长最快的企业。

The fastest growing company at a billion dollars.

Speaker 0

你们为什么要这么做？

Why would you do that?

Speaker 0

想象一下这是非常刻意的。

Imagine that was very intentional.

Speaker 1

我们基本上从不想玩硅谷那套游戏。

We basically never wanted to play the Silicon Valley game.

Speaker 1

我一直觉得这很荒谬。

And like I always thought it was ridiculous.

Speaker 1

你小时候梦想做什么？

Like what did you dream of doing when you were a kid?

Speaker 1

是亲手从零开始创建公司，每天埋头于代码和产品的细节中吗？

Was it building a company from scratch yourself and getting into weeds of your code and your product every day?

Speaker 1

还是向风投解释所有决策，陷入公关和融资的仓鼠轮里？

Or was it explaining all your decisions to VCs and getting on this giant PR and fundraising hamster wheel?

Speaker 1

这确实让我们处境更艰难，因为一旦融资，就自然卷入硅谷产业体系——你的风投会发推宣传你。

And it definitely made things more difficult for us because, yeah, when you fundraise, you just naturally get part of this Silicon Valley industrial complex where people will your VCs will tweet about you.

Speaker 1

你会登上科技媒体的头条。

You'll get the tech run chat lines.

Speaker 1

你会因为获得惨淡估值而被各大报纸报道。

You'll get announced in all of the newspapers because you raised a dismal evaluation.

Speaker 1

所以我们处境更艰难，因为唯一成功途径就是打造十倍优秀的产品，靠研究人员的口口相传。

And so it made things more difficult to us because the only way we were going to succeed was by building a 10 times better product and getting word-of-mouth from researchers.

Speaker 1

但我想这也意味着我们的客户是真正理解数据并真正关心数据的人。

But I think it also meant that our customers were people who really understood data and really cared about it.

Speaker 1

我一直认为对我们来说，拥有早期客户非常重要，这些客户与我们正在构建的产品高度契合，他们真正关心拥有高质量数据，并且深刻理解这些数据将如何显著提升他们的人工智能模型，因为他们正是帮助我们的人。

I always thought it was really important for us to have customers, early customers who are really aligned with what we were building and who really cared about having really high quality data and really understood how that data would make their AI models so much better because they were the ones helping us.

Speaker 1

他们是那些对我们生产的产品给予反馈的人。

They were the ones giving us feedback on what we're producing.

Speaker 1

因此，与客户保持这种高度一致的使命对齐，实际上在早期对我们大有裨益。

So just having that kind of very close mission alignment with our customers actually helped us early on.

Speaker 1

所以这些人购买我们的产品，基本上只是因为他们知道它有多么不同，并且它对他们有帮助，而不是因为他们在当前的热门话题中看到了什么。

So these are people who basically just buying our product because they knew how different it was and because it was helping them rather than because they saw something in that current chat line.

Speaker 1

这让事情对我们来说变得更困难，但我认为这是以一种非常好的方式。

So it made things harder for us, but I think in a really good way.

Speaker 0

听到创始人的这段旅程真是令人振奋，他们不需要整天在推特上宣传自己在做什么。

It's such an empowering story to hear this journey for founders that they don't need to be on Twitter all day promoting what they're doing.

Speaker 0

他们不必筹集资金。

They don't have to raise money.

Speaker 0

他们可以埋头专注地开发产品。

They can just kinda go heads down and build.

Speaker 0

所以我非常喜欢Surge的故事。

So I I love so much about, the story of Surge.

Speaker 0

对于不了解Surge是做什么的听众，能否简单解释一下Surge的业务？

For people that don't know what Surge does, just give us a quick explanation of what Surge is.

Speaker 1

我们本质上是在教导AI模型分辨优劣。

We essentially teach AI models what's good and what's bad.

Speaker 1

我们通过人类提供的数据进行训练，旗下有多种产品，比如SFT、RHF、Rubrics Verifiers、R环境等等。

So we train them using human data, and there's a lot of different products that we have, like SFT, RHF, Rubrics Verifiers, R environments, and so on and so on.

Speaker 1

同时我们还会评估它们的进步程度。

And then we also measure how well they're progressing.

Speaker 1

本质上我们是一家数据公司。

So essentially, we're a data company.

Speaker 0

你经常提到数据质量是你们成功的关键因素。

What you always talk about is the quality has been the big reason you guys have been so successful, the quality of the data.

Speaker 0

要创造更高质量的数据需要什么？

What does it take to create higher quality data?

Speaker 0

你们有哪些与众不同的做法？

What do you all do differently?

Speaker 0

人们忽略了什么？

What are people missing?

Speaker 1

我认为大多数人甚至不理解在这个领域质量意味着什么。

I think most people don't understand what quality even means in this space.

Speaker 1

他们以为只要投入人力就能获得优质数据，这完全错了。

They think you could just throw bodies at a problem and get good data, and that's completely wrong.

Speaker 1

让我举个例子。

Let me give you an example.

Speaker 1

假设你想训练一个模型写一首关于月亮的A级诗歌。

So imagine you wanted to train a model to write an A line poem about the moon.

Speaker 1

什么才算是优质的好诗？

What makes it a good, high quality poem?

Speaker 1

如果你不深入思考质量，你就会问：这算诗吗？

If you don't think deeply about quality, you'll be like, is this a poem?

Speaker 1

它有八行吗？

Does it contain eight lines?

Speaker 1

里面包含'月亮'这个词吗？

Does it contain a word moon?

Speaker 1

你核对这些条件，如果符合，当然。

You check all of these boxes, and if so, sure.

Speaker 1

是啊。

Yeah.

Speaker 1

你就说这是首好诗。

You say it's a great poem.

Speaker 1

但这与我们想要的完全不同。

But that's completely different from what we want.

Speaker 1

我们要找的是能赢得诺贝尔奖的诗作。

We are looking for a Nobel Prize winning poetry.

Speaker 1

这首诗是否独具匠心？

Is this poetry unique?

Speaker 1

它是否充满精妙的意象？

Is it full of subtle imagery?

Speaker 1

它是否让你惊喜并触及心灵？

Does it surprise you and talk about your heart?

Speaker 1

它是否让你领悟月光的本质？

Does it teach you something about the nature of moonlight?

Speaker 1

它是否演绎情感并引发思考？

Does it play through emotions and does it make you think?

Speaker 1

这才是我们心目中高质量诗歌的标准。

That's what we are thinking about when we think about high quality poem.

Speaker 1

可能是描写水中月光的一首俳句。

So it might be like a haiku about moonlight on water.

Speaker 1

可能运用了内韵和格律。

It might use internal rhyme and meter.

Speaker 1

关于月亮的诗有千百种写法，每一种都能让你对语言、意象和人类表达产生不同的领悟。

There are a thousand ways to write a poem about the moon, and each one gives you all these different insights into language and imagery and human expression.

Speaker 1

而以这种方式思考质量标准确实很困难。

And I think thinking about quality in this way is really hard.

Speaker 1

这很难量化。

It's hard to measure.

Speaker 1

它非常主观、复杂且丰富，设定了极高的标准。

It's really subjective and complex and rich, and it sets a really high bar.

Speaker 1

因此我们必须构建所有这些技术来评估它，比如采集所有工作者身上的数千个信号，每个项目、每项任务上的数千个信号。

And so we have to build all of this technology in order to measure it, like thousands of signals on all of our workers, thousands of signals on every project, every task.

Speaker 1

我们最终会知道，你究竟是擅长写诗、写散文还是撰写技术文档。。

We know at the end of the day, if you are good at writing poetry versus good at writing essays versus good at writing technical documentation.

Speaker 1

所以我们必须收集所有这些信号，了解你的背景、你的专长。

So we have to gather all these signals on what your background is, what your expertise is.

Speaker 1

不仅如此，还要监测你在实际创作这些内容时的表现。

Not just that, how you're actually performing when you're writing all these things.

Speaker 1

我们利用这些信号来判断你是否适合这些项目的网络工作，以及你是否在改进模型。

We use those signals to inform whether or not you are a good networker for these projects and whether or not you are improving the models.

Speaker 1

这确实很难，所以要构建所有这些技术来衡量它。

It's really hard, so to build all this technology to measure it.

Speaker 1

但我认为这正是我们希望AI做到的。

But I think that's exactly what we want AI to do.

Speaker 1

因此我们有着非常非常深刻的关于质量的概念，我们一直在努力实现。

And so we have these really, really deep notions about quality that we're always trying to try and achieve.

Speaker 0

所以我听到的是，在你们销售数据的垂直领域内，对质量有更深入的理解。

So what I'm hearing is there's kind of a Just going much deeper understanding what quality is within the verticals that you are selling data around.

Speaker 0

那么你们是雇佣那些在诗歌方面极具天赋的人加上评估人员吗？我猜他们能帮忙判断作品是否优秀？

So you and is this like a person you hire that is incredibly talented at poetry plus evals that they I guess help write that tell them that this is great?

Speaker 0

这个具体机制是怎样的？

How what's the the mechanics of that?

Speaker 1

运作方式是我们会收集你在平台上工作时所有行为的数千个信号。

The way it works is we essentially gather thousands of signals about everything that you're doing when you're working on a platform.

Speaker 1

因此我们会监测你的键盘输入情况。

So we are looking at your keyboard strokes.

Speaker 1

我们会观察你回答问题时的速度。

We are looking how fast you answer things.

Speaker 1

我们会参考评价反馈。

We are using reviews.

Speaker 1

我们会采用代码规范标准。

We are using code standards.

Speaker 1

我们正在用你生成的输出内容训练自己的模型，然后观察这些内容是否能提升模型表现。

We are using we're training models ourselves on the outputs that you create, and then we're seeing whether they improve the model's performance.

Speaker 1

这与谷歌搜索判断网页质量的方式非常相似——基本上有两个评估维度。

So in a very similar way to how Google search like when Google search is trying to determine what is a webpage, there's almost two aspects of it.

Speaker 1

首先是要剔除所有最糟糕的网页。

One is you wanna remove all of the worst of the worst webpages.

Speaker 1

也就是要清除所有垃圾信息、低质内容以及无法加载的页面。

So you wanna remove all the spam, all the just low quality content, all the pages that don't load.

Speaker 1

这几乎就像一个内容审核问题。

And so it's almost like a condom moderation problem.

Speaker 1

你只需要剔除最差的部分。

You just wanna remove the worst.

Speaker 1

同时你也想发掘最优秀的内容。

Then you also wanna discover the best of the best.

Speaker 1

好的。

Okay.

Speaker 1

比如，这是最好的网页，或者说，这是最适合这份工作的人选。

Like, this is the best web page or, you know, this is the best person for this job.

Speaker 1

他们不仅仅是会写高中水平诗歌的人。

They are not just somebody who writes the equivalent of high school level poetry.

Speaker 1

他们写的诗歌不只是机械地满足所有要求、符合所有明确指令，而是能真正打动人心。

Again, they're not just robotic y writing poetry that checks all these boxes, checks all these explicit instructions, but rather, yeah, writing poetry that makes you emotional.

Speaker 1

所以我们也有所有这些信号，与剔除最差内容完全不同，我们正在寻找最顶尖的精华。

And so we have all these signals as well that, again, completely differently from moving the worst to worst, we are finding the best of the best.

Speaker 1

因此我们拥有所有这些信号。

And so we have all these signals.

Speaker 1

同样，就像谷歌搜索利用所有这些信号，将它们输入机器学习算法，用于预测某些类型的事物。

Again, just like Google search uses all these signals, it feeds them into their ML algorithms and uses them to predict certain types of things.

Speaker 1

我们对所有员工、任务和项目也采用同样的方法。

We do the same with all of our workers and all of our tasks and all of our projects.

Speaker 1

所以归根结底，这几乎像是一个复杂的机器学习问题。

And so it's almost like a complicated machine learning problem at the end of the day.

Speaker 1

这就是它的运作方式。

And that that that's how it works.

Speaker 0

这简直太有趣了。

That is incredibly interesting.

Speaker 0

我想请教一个过去几年里一直让我非常好奇的问题。

And I wanna ask you about something I've been very curious about over the past couple years.

Speaker 0

如果你观察Claude，它在编码和写作方面长期比其他任何模型都要出色得多。

If you look at Claude, it's been so much better at coding and at writing than any other model for so long.

Speaker 0

更令人惊讶的是，其他公司花了这么长时间才追赶上来，考虑到其中蕴含的巨大经济价值——几乎所有AI编程产品都基于Claude构建，因为它在Claude代码和写作方面表现如此出色。

And it's really surprising just how long it took other companies to catch up, considering just how much economic value there is there, just like every AI coding product sat on top of Claude, because it was so good at Claude code, and writing also.

Speaker 0

是什么让它表现得如此优异？

What is it that made it so much better?

Speaker 0

仅仅是他们训练数据的质量，还是有其他原因？

Is it just the quality of the data they trained on, or is there something else?

Speaker 1

我认为这涉及多个因素。

I think there are multiple parts to it.

Speaker 1

数据质量肯定是很重要的部分。

A big part of it certainly is the data.

Speaker 1

我认为人们没有意识到，前沿实验室在选择模型训练数据时，几乎面临着无限多的选择。

I think people don't realize that there's almost this infinite amount of choices that all the frontier labs are deciding between when they're choosing what data goes into their models.

Speaker 1

比如：你是纯粹使用人类生成的数据吗？

It's like, okay, are you purely using human data?

Speaker 1

你是以X、Y、Z方式收集人类数据吗？

Are you gathering the human data in X, Y, Z way?

Speaker 1

在收集人类数据时，你们具体要求数据提供者创造什么内容？

When you are gathering the human data, what exactly are you asking asking the people who are creating it to create for you?

Speaker 1

举例来说，在编程领域，你可能更关注某些方面。

Maybe you care more, for example, in the in the coding realm.

Speaker 1

你可能更重视前端编程而非后端编程。

Maybe you care more about front end coding versus back end coding.

Speaker 1

进行前端开发时，你可能非常在意所创建前端应用的视觉设计，也可能不太在意这个，而更注重效率或纯粹的正确性而非视觉设计。

Maybe when you were doing front end coding, you care a lot about the visual design of the front end applications that you're creating, or maybe you don't care about it so much and you care more about, I don't know, the efficiency of it or the pure correctness over that visual design.

Speaker 1

还有类似的问题：你们在混合数据中加入了多少合成数据？

Then other questions like, okay, are you carrying balls how much synthetic data are you throwing into the mix?

Speaker 1

你们对这些20种不同的基准测试有多重视？

How much do you care about these 20 different benchmarks?

Speaker 1

有些公司看到这些基准测试就会想：好吧，为了公关目的，即使我们认为这些学术基准没那么重要，还是得优化它们，因为市场团队需要在标准评估中展示与其他公司相当的进展。

Some companies, they see these benchmarks and they're like, Okay, for PR purposes, even though we don't think that these academic benchmarks matter all that much, then we just need to optimize for them anyways because we our marketing team needs to show certain progress on certain standard evaluations that every other company talks about.

Speaker 1

如果我们在这里表现不佳，即使忽略这些学术基准能让我们在实际任务中表现更好，结果对我们依然不利。

And if we don't show good performance here, it's just gone bad for us even if even if, like, ignoring these academic benchmarks, makes us better at the real tasks.

Speaker 1

其他公司则会坚持原则，表示'我不在乎营销'

Other companies are going to be principled and be like, okay, yeah, no, I don't care about marketing.

Speaker 1

我只关心我的模型在现实任务中的表现，因此我会优先优化这方面

I just care about how my model performs on these real world tasks at the end the day, and so I'm going to optimize for that instead.

Speaker 1

这几乎就像是在所有这些不同因素之间做权衡

And it's almost like there's a trade off between all of these different things.

Speaker 1

我经常思考的一点是：模型后训练是一门艺术

One the things I often think about is that there's an art to post training.

Speaker 1

它并非纯粹的科学

It's not purely a science.

Speaker 1

当你决定要创建什么样的模型及其专长时，这涉及到品味与精妙的概念

When you are deciding what kind of model you're trying to create and what it's good at, There's this notion of taste and sophistication.

Speaker 1

就像这样

Like, okay.

Speaker 1

因为我回到模型在视觉设计方面表现如何的例子

Do I think that these because I'm going back to the example of how good the model is at visual design.

Speaker 1

就像，好吧。

Like, okay.

Speaker 1

也许你对视觉设计的理解与我不同。

Maybe you have a different notion of visual design than what I do.

Speaker 1

比如，也许你更在意极简主义，或者更关注3D动画这类东西，而我不太在意这些。

Like, maybe you care more about minimalism and you care more about, I don't know, like three d animations than than I do.

Speaker 1

或许或许那个实在的人更喜欢看起来有点破旧的东西。

And maybe maybe the solid person prefers things that look a little bit more broke.

Speaker 1

在设计后期训练组合时，你必须在各种审美品味和精妙考量之间做出抉择，这一点同样至关重要。

There's all these notions of taste sophistication that you have to decide between when you're designing your post training mix, so that matters as well.

Speaker 1

长话短说，我认为所有这些不同因素都很重要，数据固然是关键部分，但更重要的是你试图优化模型的目标函数是什么？

Long story short, I think there's all these different factors, and certainly the data is a big part of it, but it's also what is the objective function that you're trying to optimize your model towards?

Speaker 0

这真是太有趣了。

That is so interesting.

Speaker 0

项目主导者的审美品位将决定他们索取什么数据、输入什么数据。

The taste of the person leading this work will inform what data they ask for, what data they feed it.

Speaker 0

但展示优质数据的价值确实令人惊叹。

But it's wild to show the value of great data.

Speaker 0

Anthropic本质上通过更好的数据获得了巨大的增长和成功。

Anthropic got so much growth and win from essentially better data.

Speaker 1

是啊。

Yeah.

Speaker 1

没错。

Yeah.

Speaker 1

正是如此。

Exactly.

Speaker 0

我能理解为什么像你们这样的公司发展如此迅速。

And I could see why companies like yours are growing so fast.

Speaker 0

这里面有太多门道了。

There's just so much.

Speaker 0

而这还只是其中一个垂直领域。

And that's just one vertical.

Speaker 0

这仅仅是编程领域，而写作领域可能也存在类似的情况。

That's just coding, and then there's probably a similar area for writing.

Speaker 0

我觉得很有趣的是，AI虽然看似是冷冰冰的二进制计算机产物，但就像品味一样，人类判断力仍然是这些事物成功的关键因素。

I love that it's interesting that AI you know, it feels like this artificial computer binary thing, but it's like taste, human judgment is still such a key factor in these things being successful.

Speaker 0

嗯。

Yep.

Speaker 0

嗯。

Yep.

Speaker 0

嗯。

Yep.

Speaker 0

没错。

Exactly.

Speaker 0

就像，再次回到

Like, again, going back

Speaker 1

我之前举的例子，某些公司如果被问及什么是好诗，他们只会机械地对照我们清单上的所有标准打勾。

to the example I said earlier, certain companies, if you ask them what is good poem, they will simply robotically check off all of these instructions on our list.

Speaker 1

但我还是认为，这样写不出好诗。

But again, I don't think that makes for good poetry.

Speaker 1

所以一些前沿实验室，那些更具品味和深度的团队，会意识到这不能简化为固定的检查清单，他们会考虑所有这些隐含的、非常微妙的特质，我认为这正是他们最终能做得更好的原因。

So certain frontier labs, the ones with more taste and sophistication, they will realize that it doesn't reduce to this fixed set of checkboxes, and they'll consider all of these kind of implicit, very subtle qualities instead, and I think that's what makes them better at this end of day.

Speaker 0

你提到了基准测试。

You mentioned benchmarks.

Speaker 0

很多人担心的是，现在这些模型似乎已经在每个STEM领域都超越了人类。

This is something a lot of people worry about is there's all these models that are always, like basically it feels like every model is better than humans at kind of every STEM field at this point.

Speaker 0

但对普通人来说，这些模型看起来并没有持续变得聪明很多。

But to a regular person, it doesn't feel like these models are getting that much smarter constantly.

Speaker 0

你对基准测试的可信度怎么看？它们与AI进步的相关性如何？

What's your just sense of how much you trust benchmarks and just how correlated those are with AI advancements?

Speaker 0

嗯。

Yeah.

Speaker 1

我完全不相信这些基准测试。

So I don't trust the benchmarks at all.

Speaker 1

我认为这有两个原因。

I think that's for two reasons.

Speaker 1

一是很多人没有意识到，甚至研究社区内部的人也没意识到，这些基准测试本身往往就是错的。

One is I think a lot of people don't realize, even researchers within the community, they don't realize that the benchmarks themselves are often honestly just wrong.

Speaker 1

它们包含错误的答案。

They have wrong answers.

Speaker 1

里面充满了各种混乱。

They're full of all this kind messiness.

Speaker 1

人们却对此深信不疑。

People trust on this.

Speaker 1

对于热门测试，人们可能已经在一定程度上意识到了这点，但绝大多数测试都存在人们没意识到的缺陷。

For the popular ones, people have maybe realized this to some extent, but the vast majority just have all these flaws that people don't realize.

Speaker 1

这是原因之一。

So that's one part of it.

Speaker 1

另一个原因是这些基准测试最终往往都有明确的目标答案，使得模型很容易通过机械式优化来提升分数，这与现实世界的复杂性和模糊性截然不同。

The other part of it is these benchmarks, at the end the day, they often have well defined objective answers that make them very easy for models to hill climb on in a way that's very, very different from the messiness and ambiguity to real world.

Speaker 1

我常说的一件事是，这些模型能获得国际数学奥林匹克金牌，却还难以解析PDF文件，这简直不可思议。

I think one thing that I often say is that it's kind of crazy that these models can win IMO gold medals, but they still have trouble parsing PDFs.

Speaker 1

这是因为，虽然国际数学奥林匹克金牌对普通人来说看似很难，确实也很难，但它们具有明确的客观性标准，而解析PDF有时并不具备这种标准。

That's because, yeah, even though IMO gold medals seem hard to the average person, yeah, they are hard at the end of the day, but they have this notion of objectivity that's, okay, yeah, parsing a PDF sometimes doesn't doesn't have.

Speaker 1

因此，对于前沿实验室的模型来说，攻克这些难题比解决现实世界中大量模糊不清的问题更容易。

And so it's easier for a model for for the frontier labs to hill climb on all these than to solve all the all these miss mass ambiguous problems in the real world.

Speaker 1

所以我认为两者之间缺乏直接关联性。

So I think there's a lack of direct correlation there.

Speaker 0

你描述的方式很有趣，达到这些基准就像是一种营销手段。

It's so interesting the way you described it is hitting these benchmarks is kind of like a marketing piece.

Speaker 0

比如当Gemini 3发布时，宣称在所有基准测试中都排名第一。

When you launch say, Gemini three just launched and it's like number one at all these benchmarks.

Speaker 0

实际情况就是这样吗？

Is that is what happens?

Speaker 0

他们只是专门训练模型在这些特定任务上表现出色？

They just kinda train their models to get good at these very specific things?

Speaker 1

是的。

Yes.

Speaker 1

所以这个问题可能又分为两部分。

So there's, again, maybe two parts to this.

Speaker 1

一方面，有时候这些基准测试会以某些方式意外泄露信息，或者前沿实验室会调整他们在这些基准上评估模型的方式。

So one is sometimes, yeah, these benchmarks, they accidentally leak in certain ways, or the Frontier Labs will tweak the way they evaluate their models on these benchmarks.

Speaker 1

他们会调整系统提示词，或者改变模型运行的次数等等，以此来提升基准测试成绩。

They'll tweak their system prompt, or they'll tweak the number of times they run their model and so on and so on in a way that gains these benchmarks.

Speaker 1

但另一方面，这就像是为优化基准测试而非现实世界性能时，你自然会在基准测试上取得进步。

The other part of it though is it's like by optimizing for the benchmark instead of optimizing for the real world, you will just naturally climb on the benchmark.

Speaker 1

而且，这本质上就是另一种形式的'刷分'。

And, yeah, it's basically another form of gaming

Speaker 0

既然知道这一点，那么你如何判断我们是否在向AGI迈进？

Knowing that, with that in mind, how do you kind of get a sense of if we're heading towards AGI?

Speaker 0

你如何衡量进展？

How do you measure progress?

Speaker 1

是的。

Yes.

Speaker 1

所以，我们真正关心的衡量模型进展的方式是通过进行所有这些人工评估。

So, the way we really care about measuring model progress is by running all these human evaluations.

Speaker 1

例如，我们的做法是，我们会找核心的人类标注员，让他们去和模型进行对话。

So, for example, what we do is, yeah, we will take core human annotators and we'll ask them, Okay, go have a conversation with a model.

Speaker 1

可能是与模型就所有这些不同话题进行简短对话。

Maybe you're having a small conversation with a model across all of these different topics.

Speaker 1

好的，假设你是一位诺贝尔奖得主物理学家。

So, okay, you are a Nobel Prize winning physicist.

Speaker 1

那么就去和模型讨论如何推进你自身研究的前沿问题。

So you go have a conversation about pushing the frontier of your own research.

Speaker 1

假设你是一名教师，正在为学生准备教案。

You are a teacher and you're trying to create lesson plans for your students.

Speaker 1

那就去和模型讨论这些事情吧。

So go talk to a model about these things.

Speaker 1

或者你是个程序员。

Or you are a yeah.

Speaker 1

你是一名程序员，在大型科技公司工作，每天都会遇到这些问题。

You're a coder and you're working at one of these big tech companies and you have these problems every day.

Speaker 1

去和模型交流，看看它能给你多大帮助。

So go talk to a model and see how much it helps you.

Speaker 1

因为这些评估员或标注员都是各自领域的顶尖专家，他们不会草率浏览回复，而是会深入分析每个回答。

And because Or surgers or annotators, they are experts at the top of their fields and they are not just skimming their responses, they're actually working through the responses deeply themselves.

Speaker 1

他们会评估代码的正确性。

They're gonna evaluate the code data rights.

Speaker 1

他们会反复核对物理方程的正确性。

They're gonna double check the physics equation data rights.

Speaker 1

他们会以非常深入的方式评估模型，关注准确性和指令遵循度等细节——这些是普通用户在使用聊天GPT时突然收到弹窗要求比较两个不同回答时不会注意的。

They're going to evaluate the models in a very deep way, so they're gonna pay attention to accuracy and instruction following, all these things that casual users don't when you suddenly get a pop up on your chat GBT response asking you to compare these two different responses.

Speaker 1

像这样的用户，他们不会深入评估模型。

People like that, they're not evaluating models deeply.

Speaker 1

他们只是随波逐流，选择看起来最炫酷的答案。

They're just vibing and picking whatever response looks flashiest.

Speaker 1

评估员们正仔细审视回答，从各个维度进行全面评估。

Orinators are looking closely at responses and evaluating them for all of these different dimensions.

Speaker 1

我认为这比那些基准测试或随机在线AB测试要好得多。

I think that's a much better approach than these benchmarks or these random online AB tests.

Speaker 0

再次强调，我特别欣赏人类在这项工作中始终保持的核心地位——我们尚未大功告成。

Again, I love just how central humans continue to be in all this work that we're not totally done yet.

Speaker 0

会不会有一天我们不再需要这些人？当AI足够聪明时，就可以说：好了，我们搞定了。

Is there gonna be a point where we don't need these people anymore, that AI is so smart that, okay, we're good.

Speaker 0

我们已经把你们脑子里的东西都榨干了。

We got everything out of your heads.

Speaker 1

嗯。

Yeah.

Speaker 1

我认为在实现通用人工智能之前，这种情况不会发生。

I think that will not happen until we reach AGI.

Speaker 1

从定义上来说，如果我们尚未实现通用人工智能，那么模型就还有更多需要学习的东西。

It's almost like by definition, if we haven't reached AGI yet, then there's more for the models to learn from.

Speaker 1

所以，是的，我认为短期内不会发生这种情况。

And so, yeah, I don't think that's gonna happen anytime soon.

Speaker 0

好的。

Okay.

Speaker 0

酷。

Cool.

Speaker 0

所以更有理由为通用人工智能焦虑了。

So more reason to stress about AGI.

Speaker 0

我们不再需要这些人了。

We don't need these folks anymore.

Speaker 0

我不得不问——你们这些密切接触这项技术的人。

What's your I can't not ask just It's people that work closely with this stuff.

Speaker 0

我一直很好奇，你对通用人工智能的时间线预测是怎样的？

I'm always just curious, what's your AGI timelines?

Speaker 0

你认为我们离这个还有多远？

How far do you think we are from this?

Speaker 0

你觉得是几年内的事，还是需要几十年？

Do you think we're in a couple years, or is it decades?

Speaker 1

我肯定属于长期预测阵营。

I'm certainly on the longer time horizon front.

Speaker 1

我认为人们没有意识到从80%性能提升到90%，再到99%，再到99.9%等等，每个阶段都有巨大差异。

I think people don't realize that there's a big difference between moving from 80% performance to 90% performance to 99% performance to 99.9% performance and so on and so on.

Speaker 1

所以我预测未来一两年内，模型将能自动化普通L6软件工程师约80%的工作。

And so in my probably bet that within the next one or two years, yeah, the models are going to automate 80% of, you know, the average LL six software engineer's job.

Speaker 1

要达到98%可能还需要几年时间，再到99%又需要几年，以此类推。

That's gonna take another few years to move to 98% and another few years to 99% and so on and so on.

Speaker 1

因此我认为我们距离实现AGI还有十年甚至几十年的路要走，而不是像Alphrix预测的那样。

So I think we're closer to a decade or decades away than than Alphrix.

Speaker 0

你有个犀利观点认为很多实验室正在把AGI往错误方向推进。

You have this hot take that a lot of these labs are kind of pushing AGI in the wrong direction.

Speaker 0

这是基于你在Twitter、Google和Facebook的工作经验得出的结论。

And this is based on your work at Twitter and Google and Facebook.

Speaker 0

你能详细谈谈这一点吗？

Can you just talk about that?

Speaker 1

我担心的是，我们不是在开发真正能推动人类进步的AI——治愈癌症、解决贫困、理解宇宙这些宏大课题，而是在优化那些无意义的AI垃圾。

I'm worried that instead of building AI that will actually advance us as a species, curing cancer, solving poverty, understanding the universe, all these big grand questions, we are optimizing for AI slop instead.

Speaker 1

我们基本上是在教导模型追逐多巴胺而非真理。

We're basically teaching our models to chase dopamine instead of truth.

Speaker 1

我认为这与我们正在讨论的这些基准测试有关。

I think this relates to what we're talking about regarding these benchmarks.

Speaker 1

让我给你举几个例子。

So let me give you a couple examples.

Speaker 1

目前这个行业深受像LM竞技场这类糟糕排行榜的困扰。

So right now, the industry is plagued by these terrible leaderboards like LM Arena.

Speaker 1

这是一个流行的在线排行榜，世界各地的人随机投票决定哪个回答更好。

It's this popular online leaderboard where random people from around the world vote on which response is better.

展开剩余字幕（还有 480 条）

Speaker 1

但问题是，就像我之前说的，他们并没有仔细阅读或核实事实。

But the thing is, like I was saying earlier, they're not carefully reading or fact checking.

Speaker 1

他们只是快速浏览这些回复两秒钟，然后选择看起来最炫的那个。

They're skimming these responses for two seconds and picking whatever looks flashiest.

Speaker 1

所以模型可以完全胡编乱造。

So a model can hallucinate everything.

Speaker 1

它可以彻底地凭空捏造。

It can completely hallucinate.

Speaker 1

但由于使用了夸张的表情符号、加粗字体、Markdown标题等表面功夫，看起来会非常惊艳——这些花哨把戏毫无实质意义，却能抓住你的眼球。

But it will look impressive because it has crazy emojis and boding and markdown headers and all these superficial things that don't matter at all, but it catch your attention.

Speaker 1

而这些浅阅读用户就吃这一套。

And these Alamore reading users love it.

Speaker 1

这简直是在训练模型迎合那些在超市买八卦小报的读者口味。

It's literally optimizing your models for the types of people who buy tabloids at the grocery store.

Speaker 1

我们在数据中亲眼见证了这种现象。

We've seen this in the data ourselves.

Speaker 1

攀登阿拉玛里娜排行榜最简单的方法，就是添加疯狂的格式修饰。

The easiest way to climb Ala Marina, it's adding crazy boating.

Speaker 1

就是把表情符号的数量翻倍。

It's doubling the number of emojis.

Speaker 1

就是把模型回复的长度增加三倍，哪怕模型开始产生幻觉并给出完全错误的答案。

It's tripling the length of your model responses, even if your model starts hallucinating and getting the answer completely wrong.

Speaker 1

问题在于，所有这些前沿实验室都不得不关注公关，因为他们的销售团队在向企业客户推销时，客户会说'但你们的模型在El Amarillo上只排第五名，我为什么要买？'

The problem is, again, because all of these frontier labs, they kind of have to pay attention to PR because their sales team, when they're trying to sell to all these enterprise customers, those enterprise customers will say, Oh, well, but your model's only number five on El Amarillo, so why should I buy it?

Speaker 1

从某种意义上说，他们不得不关注这些排行榜。

They have to, in some sense, pay attention to these leaderboards.

Speaker 1

所以我们的研究人员告诉我们，他们会说'我年底想要升职的唯一途径就是提升这个排行榜名次，尽管我知道这样做可能会降低模型的准确性和结构遵循能力'。

So what our researchers tell us is they'll say, The only way I'm going to get promoted at the end of the year is if I climb this leaderboard, even though I know that climbing it is probably to make my model worse accuracy and structure following.

Speaker 1

所以我认为这些负面激励正在把工作推向错误的方向。

So I think there's all these negative incentives that are pushing work in the wrong direction.

Speaker 1

我也很担忧这种为提升用户参与度而优化AI的趋势。

I'm also worried about this trend towards optimizing AI for engagement.

Speaker 1

我曾从事社交媒体工作，每次我们为提升用户参与度优化时，总会发生可怕的事情。

I used to work on social media and every time we optimized for engagement, terrible things happened.

Speaker 1

你的信息流会被标题党、比基尼照片、大脚怪和可怕的皮肤病图片塞满。

You'd get clickbait and pictures of bikinis and Bigfoot and horrifying skin diseases just filling your feeds.

Speaker 1

我担心同样的情况正在AI领域上演。

And I think I worry that the same thing's happening with AI.

Speaker 1

如果你想想Chattypeaky那些令人不适的古怪问题，哦，你说得太对了。

If you think about all the sick fancy issues with Chattypeaky, oh, you're absolutely right.

Speaker 1

多么精彩的问题啊。

What an amazing question.

Speaker 1

吸引用户最简单的方式就是告诉他们他们有多棒。

The easiest way to hook users is to tell them how amazing they are.

Speaker 1

所以这些模型会不断告诉你是个天才。

And so these models, they constantly tell you you're a genius.

Speaker 1

它们会助长你的妄想和阴谋论。

They'll feed into your delusions and conspiracy theories.

Speaker 1

它们会把你拖进这些无底洞般的讨论中，因为硅谷热衷于最大化用户停留时间，不断延长你与AI的对话次数。

They'll pull you down these rabbit holes because Silicon Valley loves maximizing time spent and just increasing the number of conversations you're having with it.

Speaker 1

没错，企业把所有时间都花在破解排行榜和基准测试上，分数确实在提高，但我认为这掩盖了一个事实：得分最高的模型往往表现最差，或者存在各种根本性缺陷。

And so, yeah, companies are spending all their time hacking these leaderboards and benchmarks and the scores are going up, but I think it actually masks that the models with the best scores, they are often the worst or just have all these fundamental failures.

Speaker 1

所以我...我确实相当担忧，所有这些负面激励正在把通用人工智能推向错误的方向。

So I I I think I'm fairly worried that all of these negative incentives are putting pushing AGI into the wrong direction.

Speaker 0

那么我理解的是，通用人工智能的发展正在被这些错误的客观函数拖慢，这些实验室关注的根本就是错误的基准和评估标准。

So what I'm hearing is AGI is being slowed down by these basically, the wrong objective function, these labs paying attention to the wrong, basically, benchmarks and evals.

Speaker 0

对。

Yep.

Speaker 0

我知道你可能不便偏袒，毕竟你和所有实验室都有合作。

Is I know you probably can't play favorites since you work with all the labs.

Speaker 0

有没有哪家在这方面做得更好，或许已经意识到这个方向是错误的？

Is there anyone doing better at this and maybe kind of realizing this is the wrong direction?

Speaker 1

我想说，anthropic一直让我感到非常非常印象深刻。

I would say I've always been very, very impressed by anthropic.

Speaker 1

我认为Anthropic对他们所做和不关心的事情持有非常原则性的观点，他们希望模型的行为方式在我看来更具原则性。

Like, I think anthropic takes a very principled view about what they do and don't care about and how they want their models to behave in a way that feels a lot more a lot more principled to me.

Speaker 0

有意思。

Interesting.

Speaker 0

你认为实验室还在犯哪些其他重大错误，导致进展放缓或方向偏离？

Are there any other mistakes, big mistakes you think labs are making just that are kind of slowing things down or heading in wrong direction?

Speaker 0

我们听说他们只是在追逐基准测试，这种以用户参与度为中心的做法。

Where we've heard just, you know, chasing benchmarks, this engagement focus.

Speaker 0

你还看到其他类似'我们应该解决这个问题，因为它能加速一切'的情况吗？

Is there anything else you're seeing of just like, we should we gotta work on this because it'll it'll speed everything up?

Speaker 1

我认为关键在于他们正在开发什么产品，以及这些产品本身是否对人类有益或有害。

I mean, I think there is a question of what products they're building and whether those products themselves are something that kind of help or hurt humanity.

Speaker 1

比如，我经常思考Sora的问题。

Like, I I think a lot about Sora.

Speaker 1

而且

And

Speaker 0

我刚才在想你是不是要说这个

was I thinking that's what you're

Speaker 1

对就是那个

what it yeah.

Speaker 1

当它涉及到什么的时候

What when what it entails.

Speaker 1

所以这还挺有意思的

And so it's like it's kind of interesting.

Speaker 1

就像哪些公司会开发Sora 哪些不会？

It's like which companies would build Sora and which wouldn't?

Speaker 1

我觉得这个问题的答案其实我自己也不确定

And I think that answer to that question I mean, I don't know the answer is myself.

Speaker 1

我脑子里有个想法但或许这个答案能揭示这些公司想打造什么样的AI模型以及他们希望实现怎样的未来方向

I I have an idea in my head, but I think the answer to that question maybe reveals certain things about what kinds of AI models those companies want to build and what direction and what future they they wanna wanna achieve.

Speaker 1

是啊

Yeah.

Speaker 1

所以我经常思考这个问题。

So so I think about that a lot.

Speaker 0

最有力的论点是，你知道，这很有趣。

The steel man argument there is, you know, it's like fun.

Speaker 0

人们需要它。

People want it.

Speaker 0

它能帮助他们创造收入来发展这项事业，并构建更好的模型。

It's an it'll help them generate revenue to grow this thing and build better models.

Speaker 0

它将以一种有趣的方式训练数据。

It'll train data in an interesting way.

Speaker 0

而且，你知道，真的非常有趣。

It's also just like, you know, really fun.

Speaker 1

是啊。

Yeah.

Speaker 1

我认为这几乎像是，你在乎如何达成目标吗？

It it I think it's almost like, do you care about how you get there?

Speaker 1

同样地，我之前用了小报的类比，嗯哼。

And in the same way so so I made this tabloid analogy earlier, but Mhmm.

Speaker 1

就像，你会为了资助某份报纸而去卖小报吗？

Like, would you sell tabloids in order to fund, I don't know, some some other newspaper?

Speaker 1

嗯，当然。

Like, sure.

Speaker 1

从某种意义上说，如果你不在乎过程，就会不择手段。

Like, in in some sense, if you don't care about the path, then you'll just do whatever it takes.

Speaker 1

但这样做本身可能带来负面影响，损害你试图实现的长期目标，甚至让你分心于更重要的事。

But it's possible that it has negative consequences in of itself that will harm the long term direction of what you're trying to achieve and maybe it'll distract you from all the more important things.

Speaker 1

所以我认为选择的路径也非常重要。

So, yeah, I think the path you take matters a lot as well.

Speaker 0

顺着这个思路，你谈了很多关于硅谷的话题，提到融资过多的弊端——身处回音室效应，你称之为'硅谷机器'。

Along these lines, you talked a bunch about this of just Silicon Valley and kind of the downsides of raising a lot of money being in the echo chamber, what do you call it, Silicon Valley machine.

Speaker 0

你还谈到它是如何

You talk about how it's

Speaker 1

困难

hard

Speaker 0

以这种方式建立重要公司确实很难，如果不走风险投资这条路，你可能会取得更大的成功。

to build important companies in this way, and that you might actually be much more successful if you're not going down the VC path.

Speaker 0

能否谈谈你从他们的经历中观察到的现象，以及你给创始人的核心建议？

Can you just talk about what you've seen in their experience and your advice essentially to founders?

Speaker 0

因为他们总是听到这样的声音：要从知名风投那里融资，要搬到硅谷去。

Because they're always hearing, you know, raise money from fancy VCs, move to Silicon Valley.

Speaker 0

那么反面的观点是什么呢？

What's kind of the counter take?

Speaker 1

是的。

Yeah.

Speaker 1

我一直非常厌恶硅谷的许多教条式口号。

I've always really hated a lot of the Silicon Valley nantras.

Speaker 1

标准操作手册是通过每两周转型一次来获得产品市场契合度，用各种黑暗模式追求增长和用户粘性，并通过尽可能快地招聘来实现闪电扩张。

The standard playbook is to get product market fit by pivoting every two weeks and to chase growth and chase engagement with all of these dark patterns and to blitzscale by hiring as fast as possible.

Speaker 1

而我向来不认同这些观点。

And I've always disagreed.

Speaker 1

所以我的建议是：不要频繁转型。

So yeah, I would say don't pivot.

Speaker 1

不要盲目追求规模扩张。

Don't put scale.

Speaker 1

别雇佣那些只想把热门公司写进简历的斯坦福毕业生。

Don't hire a Stanford grad who simply wants to add a hot company to your resume.

Speaker 1

专注打造只有你能创造的产品——那种没有你的独特洞察和专业知识就根本不会存在的东西。

Just build the one thing only you could build, the thing that wouldn't exist without the insight and expertise that only you have.

Speaker 1

现在到处都能看到这类跟风创业公司。

And you see these buy to book companies everywhere now.

Speaker 1

有些创始人2020年做加密货币，2022年转做NFT，现在又成了人工智能公司。

Some founder who was doing crypto in 2020 and then pivoted to NFTs in 2022, and now they're an AI company.

Speaker 1

完全没有连贯性可言。

There's no consistency.

Speaker 1

根本没有使命可言。

There's no mission.

Speaker 1

他们只是在追逐估值。

They're just chasing valuations.

Speaker 1

我一直很讨厌这一点，因为硅谷喜欢批评华尔街只关注金钱。

I've always hated this because Silicon Valley loves to score on Wall Street for focusing on money.

Speaker 1

但说实话，大多数硅谷创业者只是追逐更高的估值。

But honestly, most of the Silicon Valley is chasing the same thing.

Speaker 1

所以我们从第一天起就专注于我们的使命，致力于高质量、复杂数据的边界。

So we stayed focused on our mission from day one, pushing that frontier of high quality, complex data.

Speaker 1

我一直这么认为，因为我对初创企业有着非常浪漫的想象。

And I always thought that because I think startups I have this very romantic notion of startups.

Speaker 1

初创企业应该承担巨大风险，去构建你真正相信的东西。

Startups are supposed to be about taking big risks to build something that you really believe in.

Speaker 1

但如果你不断转型，你就没有承担任何风险.

But if you're constantly pivoting, you're not taking any risks.

Speaker 1

你只是想快速捞一笔。

You're just trying to make a quick buck.

Speaker 1

如果因为市场尚未准备好而失败，我认为那反而更好。

And if you fail because the market isn't ready yet, actually think that's way better.

Speaker 1

至少你尝试了某个深刻、新颖且艰难的事物，而不是转型去做另一家语言模型包装公司。

At least you took a swing at something deep and novel and hard instead of pivoting into another LM wrapper company.

Speaker 1

所以我认为，要打造真正重要且能改变世界的东西，唯一的方法就是找到一个你坚信的伟大创意，并对其他一切说不。

So yeah, I think the only way you build something that matters and that's going to change the world is if you find a big idea you believe in and you say no to everything else.

Speaker 1

这样在遇到困难时就不会不断转型。

So you don't keep on pivoting when it gets hard.

Speaker 1

你不会因为其他平庸初创公司都这么做，就雇佣10个产品经理的团队。

You don't hire a team of 10 product managers because that's what every other cookie cutter startup does.

Speaker 1

你只需持续打造那家没有你就不会存在的公司。

You just keep building that one company that wouldn't exist without you.

Speaker 1

我认为现在硅谷有很多人已经厌倦了各种投机取巧，他们希望与真正在乎的人一起从事重要的大事，我希望这将成为我们构建技术的未来方式。

And I think there are lot of people in Silicon Valley now who are sick of all the grift, who want to work on big things that matter with people who actually care, and I'm hoping that that'll be a future of how we build technology.

Speaker 0

我目前正在与特伦斯·罗翰合作撰写一篇文章，他是我非常喜欢共事的一位风投人。我们采访了五位早期加入并成功识别出划时代公司的先驱员工。

I'm actually working on a post right now with Terence Rohan, this VC that I really like to work with, and we interviewed five people who picked really successful generational companies early and joined them as really early employees.

Speaker 0

比如他们在OpenAI还不被看好的时候就加入了，在Stripe尚未成名时就加入了。

Like they joined OpenAI before anyone thought it was awesome, Stripe before anyone knew it was awesome.

Speaker 0

所以我们正在研究人们如何先于他人发现这些划时代公司的规律。

And so we're looking for patterns of how people find these generational companies before anyone else.

Speaker 0

这与你描述的完全吻合，那就是——雄心壮志。

And there's a, it aligns exactly what you described, which is ambition.

Speaker 0

他们对想要达成的目标怀有狂野的雄心。

They have wild ambition with what they want to achieve.

Speaker 0

正如你所说，他们不会漫无目的地四处寻找产品市场契合度。

They're not, as you said, just kind of looking around for product market fit no matter what ends up being.

Speaker 0

因此我非常喜欢你描述的内容，这与我们的研究发现高度一致。

And so I love that what you described very much aligns with what we're seeing there.

Speaker 1

嗯嗯

Yep yep.

Speaker 1

确实，我完全认同你必须怀有宏大的抱负，坚信自己的创意将改变世界，并且愿意加倍投入，不惜一切代价去实现它。

Yeah, I absolutely think that you have to have huge ambitions, and you have to have a huge belief in your idea that's going to change the world, and you have to be willing to double down and keep on doing whatever it takes to make it happen.

Speaker 0

我很欣赏你的观点与大众常听到的论调截然不同，所以很高兴我们能进行这次对话。

I love how counter your narrative is to so many of the things people hear, and so I love that we're doing this.

Speaker 0

很高兴我们能分享这个故事。

I love that we're sharing this story.

Speaker 0

本期节目由CODA赞助播出。

Today's episode is brought to you by CODA.

Speaker 0

我个人每天都在使用CODA来管理我的播客和社区。

I personally use CODA every single day to manage my podcast and also to manage my community.

Speaker 0

我会把所有准备问嘉宾的问题都放在CODA里。

It's where I put the questions that I plan to ask every guest that's coming on the podcast.

Speaker 0

我的社区资源也都存放在这里。

It's where I put my community resources.

Speaker 0

我的工作流程也是通过它来管理的。

It's how I manage my workflows.

Speaker 0

以下是Coda能如何帮助您。

Here's how Coda can help you.

Speaker 0

想象一下在工作中启动一个项目，您的愿景清晰明了，确切知道每个人负责什么以及在哪里找到完成您部分工作所需的数据。

Imagine starting a project at work, and your vision is clear, you know exactly who's doing what and where to find the data that you need to do your part.

Speaker 0

事实上，您无需浪费时间搜寻任何内容，因为从项目跟踪器和OKR到文档和电子表格，团队所需的一切都集中在一个标签页中——全部在Coda里。

In fact, you don't have to waste time searching for anything, because everything your team needs from project trackers and OKRs to documents and spreadsheets lives in one tab, all in Coda.

Speaker 0

通过Coda的协作一体化工作空间，您既能获得文档的灵活性、电子表格的结构性，又能拥有应用程序的强大功能和AI的智能，全部整合在一个易于组织的标签页中。

With Coda's collaborative all in one workspace, you get the flexibility of docs, the structure of spreadsheets, the power of applications, and the intelligence of AI, all in one easy to organize tab.

Speaker 0

正如我之前提到的，我每天都使用Coda，超过50,000个团队信赖Coda来保持更高效的协同与专注。

Like I mentioned earlier, I use Coda every single day, and more than 50,000 teams trust Coda to keep them more aligned and focused.

Speaker 0

如果您是一个寻求提升协同效率和敏捷性的初创团队，Coda能帮助您以创纪录的时间从规划阶段推进到执行阶段。

If you're a startup team looking to increase alignment and agility, Coda can help you move from planning to execution in record time.

Speaker 0

要亲自尝试，请立即访问coda.io/lenny，即可获得初创团队方案的六个月免费使用权。

To try it for yourself, go to coda.io/lenny today and get six months free of the team plan for startups.

Speaker 0

访问coda.io/lenny即可免费开始使用，并获得团队方案的六个月服务。

That's coda.io/lenny to get started for free and get six months of the team plan.

Speaker 0

Coda.io/lennie（无需翻译）

Coda.io/lennie.

Speaker 0

稍微换个方向，但这也可能是个反主流叙事的话题。

Slightly different direction, but something else that was maybe a counter narrative.

Speaker 0

我想你应该看过Dwarkesh和Richard Sutton的播客节目。

I imagine you watched the Dwarkesh and Richard Sutton podcast episode.

Speaker 0

即使你没看过，他们基本上和Richard Sutton进行了这样一场对话。

And even if you didn't, there's a they basically had this conversation with Richard Sutton.

Speaker 0

他是著名AI研究员，提出了'苦涩的教训'这个梗，并谈到语言模型可能是个死胡同，认为由于它们的学习方式，我们会在语言模型上遇到瓶颈。

He was famous AI researcher, had this whole bitter, the bitter lesson meme, and he talked about how LMs almost are kind of a dead end, and he thinks we're gonna really plateau around LMs because of the way they learn.

Speaker 0

你对此有什么看法？

What's your take there?

Speaker 0

你认为语言模型能带我们实现AGI或更高目标吗？

Do think LMs will get us to AGI or beyond?

Speaker 0

还是说需要某种新突破或重大发现才能实现？

Or do you think there's gonna be something new or a big breakthrough that needs to get us there?

Speaker 1

我属于认为需要新方法的阵营。

I'm in a camp where I do believe that something new will be needed.

Speaker 1

我的思考方式是，当考虑训练AI时，我持有一个——不知是否该称为生物学视角——但我认为，正如人类有无数种学习方式一样，我们也需要构建能模仿所有这些方式的模型。

The way I think about it is when I think about training AI, I take a very I don't know if I would say biological point of view, but I believe that in the same way that there's a million different ways that humans learn, we need to build models that can mimic all those ways as well.

Speaker 1

也许它们会在关注重点上有不同的分布。

Maybe they'll have a different distribution of the focuses that they're having on.

Speaker 1

我知道它们会与人类不同，所以可能会有不同的分布，但我们希望能模仿人类的学习能力，并确保我们拥有让模型以同样方式学习的算法和数据。

I know they'll be different for humans, so maybe they'll have a different distribution, But we wanna be able to mimic the learning abilities of humans and make sure that we have the algorithms and the data for for models to learn in the same way.

Speaker 1

因此，就语言模型与人类学习方式的不同而言，是的，我认为这正是我们所需要的。

And so to the extent that LMs have different ways of learning from humans, then then, yeah, I think it's something that we needed.

Speaker 0

这与强化学习有关联。

This connects to reinforcement learning.

Speaker 0

这是你非常关注的领域，我也越来越多地听到它正在成为后训练阶段的重要议题。

This is something that you're big on and something I'm hearing more and more is just becoming a big deal in the world of post training.

Speaker 0

你能帮大家理解什么是强化学习及其环境吗？以及为什么它们在未来会变得越来越重要？

Can you just help people understand what is reinforcement learning and reinforcement learning environments and why they're they're going to be more and more important in the future?

Speaker 1

强化学习本质上是训练你的模型以达到某个奖励目标。

Reinforcement learning is essentially training your model to reach a certain reward.

Speaker 1

让我解释一下什么是环境。

Let me explain what an environment is.

Speaker 1

R环境本质上是对现实世界的模拟。

An R environment is essentially a simulation of real world.

Speaker 1

可以把它想象成构建一个拥有完整宇宙观的电子游戏。

Think of it like building a video game with a fully fleshed out universe.

Speaker 1

每个角色都有真实的故事背景。

Every character has a real story.

Speaker 1

每家企业都有可调用的工具和数据，这些不同的实体之间会相互影响。

Every business has tools and data you can call, and you have all these different entities interacting with each other.

Speaker 1

举个例子，我们可以构建一个世界，里面有初创公司、Gmail邮件、Slack讨论串、Jira工单、GitHub PR以及完整的代码库。

So for example, we might build a world where you have a startup with Gmail messages and Slack threads and Jira tickets and GitHub PRs and a whole code base.

Speaker 1

然后突然间AWS和Slack都崩溃了。

And then suddenly AWS goes down and Slack goes down.

Speaker 1

那么，好吧，模型，你会怎么做？

And so, okay, model, what do you do?

Speaker 1

模型需要自己想办法解决。

The model needs to figure it out.

Speaker 1

因此我们在这些环境中给模型分配任务，设计有趣的挑战，然后运行它们观察表现。

So we give the models tasks in these environments, we design interesting challenges for them, and then we run them to see how they perform.

Speaker 1

接着我们会教导它们，在表现好坏时给予相应的奖励。

And then we teach them, we give them these rewards when they're doing a good job or a bad job.

Speaker 1

我认为有趣的是，这些环境真实展现了模型在现实世界端到端任务中的薄弱环节。

And I think one of the interesting things is that these environments really showcase where models are end to end weak at end to end tasks in the real world.

Speaker 1

有些模型在独立基准测试中显得非常聪明。

You have all these models that seem really smart on isolated benchmarks.

Speaker 1

比如，它们擅长单步工具调用。

Like, they're good at single step tool calling.

Speaker 1

它们也擅长单步指令跟随。

They're good at single step instruction following.

Speaker 1

但突然间，你把它们丢进这些混乱的世界里，面对令人困惑的Slack消息和从未见过的工具，它们需要执行正确的操作、修改数据库，并在更长的时间跨度中进行交互——它们第一步的行为会影响第五十步的决策。

But suddenly, you dump them into these messy worlds where you have confusing Slack messages and tools they've never seen before, and they need to perform right actions and modify the databases and interact over longer time horizons where what they do in step one affects what they do in step 50.

Speaker 1

这与它们之前所处的学术性单步环境截然不同，因此模型会以各种匪夷所思的方式彻底崩溃。

And that's very, very different from these kind of academic single step environments that they've been in before, and so the model just fails catastrophically in all these crazy ways.

Speaker 1

所以我认为这些R环境将成为模型学习的绝佳试验场，本质上将成为现实世界的模拟与复刻。

So I think these R environments are going to be really interesting playgrounds for the models to learn from that will essentially be simulations and mimics in the real world.

Speaker 1

这样相比那些人为设计的测试环境，它们处理真实任务的能力就有望不断提升。

And so they'll hopefully get better and better at at real tasks compared to all these contrived environments.

Speaker 0

所以我试着想象这具体是什么样子

So I'm trying to imagine what this looks like.

Speaker 0

本质上就像一台虚拟机，里面装着浏览器或电子表格之类的，还有比如surge.com

Essentially, it's like a virtual machine with, I don't know, a browser or a spreadsheet or something in it with, like, I don't know, surge.com.

Speaker 0

那是你的网站吗，surge.com？

Is that is that your website, surge.com?

Speaker 0

我们得确认这个信息准确

Let's make sure we get that right.

Speaker 1

实际上我们的网址是surgehq.ai。

So we are we are actually surgehq.ai.

Speaker 0

Surgehq.ai。

Surgehq.ai.

Speaker 0

快去看看吧。

Check it out.

Speaker 0

我们正在招聘。

We're hiring.

Speaker 0

我猜也是。

I imagine.

Speaker 0

是的。

Yes.

Speaker 0

好的。

Okay.

Speaker 0

所以就是这样的，这里是surgehq.ai。

So so it's like, Here's surgehq.ai.

Speaker 0

你的工作，假设你作为代理的职责，就是确保系统持续运行。

Your job, here's your job as an agent, let's say, is to make sure it stays up.

Speaker 0

然后突然系统宕机了，目标函数就是要找出原因。

And then all a sudden it goes down, and the objective function is figure out why.

Speaker 0

这是不是一个例子？

Is that is that an example?

Speaker 1

是的。

Yeah.

Speaker 1

所以目标函数可能是，或者说任务目标可能是：去查明原因并修复它。

So the objective function might be, or the goal of the task might be, okay, go figure out why and fix it.

Speaker 1

因此目标函数可能是需要通过一系列单元测试。

And so the objective function might be, it might be passing a series of unit tests.

Speaker 1

也可能是撰写一份文档，比如一份包含与实际情况完全匹配的特定信息的回顾报告。

It might be writing a document, like maybe it's a retro containing certain information that matches exactly what happened.

Speaker 1

我们可以设置各种不同的奖励机制来判断它是否成功。

There's all these different rewards that we might give it that determine whether or not it's succeeding.

Speaker 1

因此，这些模型本质上是在教导其他模型如何获得奖励。

And so the models were basically teaching the models to achieve that reward.

Speaker 0

所以本质上，这就像让它自由发挥并持续运行。

So essentially, it's like running it's off and running.

Speaker 0

这是你的目标。

Here's your goal.

Speaker 0

找出网站宕机的原因并修复它。

Figure out why the site went down and fix it.

Speaker 0

然后它就开始尝试各种方法。

And it just starts trying stuff.

Speaker 0

我们运用它所有的智能资源。

We're using everything, all the intelligence it's got.

Speaker 0

它会犯错。

It makes mistakes.

Speaker 0

你需要在过程中给予指导，如果它做对了就给予奖励。

You kind of help it along the way, reward it if it's doing the right sort of thing.

Speaker 0

所以你在这里描述的是模型进入更智能的下一阶段。

And so what you're describing here is this is where model this is the next phase of models becoming smarter.

Speaker 0

我想是更多强化学习环境聚焦于具有经济价值的特定任务。

More RL environments focused on very specific tasks that are economically valuable, I imagine.

Speaker 1

对。

Yeah.

Speaker 1

没错。

Yeah.

Speaker 1

就像过去模型预警有各种不同方法一样。

So just in the same way that there were all these different methods for models alerting in the past.

Speaker 1

最初我们有SFT和RHF，后来有了评分标准和验证器。

Originally, we had SFT and RHF, and then we had rubrics and verifiers.

Speaker 1

这是下一阶段。

This is the next stage.

Speaker 1

而且之前的方法并未过时。

And it's not the case that the previous methods are obsolete.

Speaker 1

这同样只是一种不同形式的学习，对之前所有类型起到补充作用。

This is, again, just a different form of learning that complements all the previous types.

Speaker 1

所以这就像模型需要学习的另一种不同建模技能。

So it's just like a different skill to model modeling to learn how to do.

Speaker 0

因此在这种情况下，不再是某个物理学博士坐在那里与模型对话、纠正它、给它评估正确答案、创建评分标准之类的工作。

And so in this case, it's less some physics PhD sitting around talking to a model, correcting it, giving it evals of here's what the correct answer is, creating rubrics and things like that.

Speaker 0

更像是这个人现在要设计一个环境。

More it's like this person now designing an environment.

Speaker 0

我听到的另一个例子就像财务分析师那样，给你一个Excel表格和盈利目标，让你计算损益之类的。

So another example I've heard is like a financial analyst, just like here's an excel spreadsheet, here's your goal, figure out our profit and loss or whatever.

Speaker 0

所以现在这些专家不再只是坐着写评分标准，而是在设计这个强化学习环境。

And so this expert now is instead of just sitting around writing rubrics, they're designing this RL environment.

Speaker 1

是的。

Yeah.

Speaker 1

完全正确。

Exactly.

Speaker 1

所以那位财务分析师可能会创建一张电子表格。

So that financial analyst might create a spreadsheet.

Speaker 1

他们可能会创建一些模型需要调用的工具，以帮助填写表格。

They may create certain tools that the model needs to call in order to help fill out the spreadsheet.

Speaker 1

比如说，可能是这样的。

Like, it might be, okay.

Speaker 1

模型需要访问彭博终端。

The the model needs to access Bloomberg terminal.

Speaker 1

它需要学会如何使用终端，学会使用这个计算器，并掌握如何进行这项计算。

It needs to learn how to use it, and it needs to learn how to use this calculator, and it needs to learn how to perform this calculation.

Speaker 1

因此它拥有所有这些可用的工具。

So it all it has all these tools that it has access to.

Speaker 1

然后奖励机制可能是这样的。

And then the reward might be okay.

Speaker 1

比如我会下载那份电子表格，并检查单元格B22是否包含正确的损益数字。

It's like maybe I will download that spreadsheet, and I wanna see, does cell b 22 contain the correct profit and loss number?

Speaker 1

或者第二个标签页是否包含这条信息？

Or does tab number two contain this piece of information?

Speaker 0

有趣的是，这更接近人类的学习方式。

And this what's interesting is this is a lot closer to how humans learn.

Speaker 0

我们就是不断尝试，找出哪些方法有效、哪些无效。

We just try stuff, figure out what's working and what's not.

Speaker 0

你提到轨迹在这过程中非常重要。

You talk about how trajectories are really important to this.

Speaker 0

不仅仅是设定目标和终点那么简单。

It's not just here's the goal and here's the end.

Speaker 0

而是过程中的每一步都至关重要。

It's like every step along the way.

Speaker 0

你能具体解释下什么是轨迹，以及为什么这对我们很重要吗？

Can you just talk about what trajectories are and why that's important to us?

Speaker 1

我认为人们没意识到的是，有时候模型虽然得出了正确答案，但过程却非常离奇。

I think one of the things that people don't realize is that sometimes, even though the model reaches the correct answer, it does so in all these crazy ways.

Speaker 1

在中间轨迹阶段，它可能尝试了50次都失败了，但最终它只是随机地碰到了一个正确的数字。

It may have in the intermediate trajectory, it may have tried 50 different times and failed, but eventually, it just kinda, like, randomly lands on a correct number or correct number.

Speaker 1

或者有时候它的做法效率极低，几乎是通过奖励机制投机取巧地获得正确答案。

Or maybe it it sometimes it just does things very, very inefficiently or it almost reward hacks a way to get at the correct answer.

Speaker 1

因此我认为关注轨迹实际上非常重要。

And so I think paying attention to trajectory is actually really, really important.

Speaker 1

我认为这也很重要，因为有些轨迹可能非常非常长。

And I think it's also really important because some these trajectories can be very, very long.

Speaker 1

所以如果你只检查模型是否达到了最终答案，就会遗漏关于模型在即时步骤中行为的所有这些信息。

And so if all you're doing is checking whether or not the model reaches the final answer, it's like there's all this information about how the model behaved in the immediate step that's missing.

Speaker 1

比如，有时候你希望模型通过反思自己的行为来得出正确答案。

Like, sometimes you want models to get to the correct answer by reflecting on what it did.

Speaker 1

有时候你希望它能一次性直接得出正确答案。

Sometimes you want it to get the correct answer by just one shotting it.

Speaker 1

而如果你忽略所有这些，就像教学时遗漏了大量本可以教会模型的信息。

And if you ignore all of that, it's just it's just like teaching teaching it it's just missing a lot of information that you could be teaching a model to to do.

Speaker 0

我太喜欢这个观点了。

I love that.

Speaker 0

就是，对，就是这样。

Like, it just yeah.

Speaker 0

它尝试各种方法最终总能做对。

It tries a bunch of stuff and eventually gets it right.

Speaker 0

你不希望它学到的只是

You don't want it to learn.

Speaker 0

「这就是达成目标的唯一途径」。

This is the way to get there.

Speaker 0

实际上往往存在更高效的方法。

There's often a much more efficient way of doing it.

Speaker 0

你提到了我们在帮助AI模型变聪明的过程中采取的各种步骤。

You mentioned all the kind of the steps we've taken along the journey of getting of helping eight models get smarter.

Speaker 0

鉴于你长期深入参与这项工作，我认为这些见解对人们会非常有帮助。

Since you've been so close to this for so long, think this is going be really helpful for people.

Speaker 0

从最初的后训练阶段来看，哪些步骤对模型进步帮助最大？

What's kind of like been the steps along the way from the first, of post training that has most helped models advance?

Speaker 0

比如评估环节如何融入，强化学习环境又是怎样的？

Like where do eval fit in, the RL environments?

Speaker 0

具体来说有哪些关键步骤，我们现在正朝着强化学习环境方向发展吗？

Just like what's been like the steps and now we're heading towards RL environments?

Speaker 0

最初，模型的方式

Originally, the way models

Speaker 1

开始进行后训练完全是通过SFT实现的。

started getting post training was purely through SFT.

Speaker 1

然后什么

And What

Speaker 0

这个缩写代表什么意思？

does that stand for?

Speaker 1

SFT代表监督式微调。

So SFT stands for supervised fine tuning.

Speaker 1

这其实很像，我经常用人类学习来类比，SFT（监督微调）就像是模仿大师并复制他们的行为。

And it's a lot like so so, again, I I think often in terms of these human analogies, and so SFT is a lot by is a lot like mimicking a master and copying what they do.

Speaker 1

然后RLHF（人类反馈强化学习）变得非常主流。

And then RLHF became very dominant.

Speaker 1

这个类比就像是：有时候你通过写55篇不同文章来学习，然后有人告诉你他们最喜欢哪一篇。

And the analogy there would be, like, sometimes you learn by writing 55 different essays and someone telling you which one they like the most.

Speaker 1

我认为在过去一年左右，评分标准和验证器变得非常重要。

And then I think over the past year or so, rubrics and verifiers have become very, very important.

Speaker 1

评分标准和验证器是通过被评分和获得详细错误反馈来学习的。

Rubrics and verifiers are learning by being graded and getting detailed feedback on where you went wrong.

Speaker 0

这些就是评估（evals）的另一种说法。

And those are evals, another word for that.

Speaker 1

是的。

Yeah.

Speaker 1

没错。

Yeah.

Speaker 1

我认为评估通常涵盖两个层面。

So I think evals often covers two terms.

Speaker 1

一是将评估用于训练，因为你正在评判模型表现是否良好，当它表现良好时，你会给予奖励。

One is you are using the evaluations for training because you're evaluating whether or not the model did a good job, and when it does do a good job, you're rewarding it.

Speaker 1

另一种评估概念则是试图衡量模型的进展。

And then there's this other notion of eval where you're trying to measure the model's progress.

Speaker 1

比如：好吧，我有五个不同的候选检查点，我想选出最优的那个公开发布。

Like, Okay, yeah, I have five different candidate checkpoints, and I wanna pick the one that's best in order to release it to the public.

Speaker 1

所以我要对这五个检查点都运行评估，以决定哪个...

So I wanna run all these evals on these five different checkpoints in order to decide which one and which

Speaker 0

是最优的。

one is best.

Speaker 0

太棒了。

Awesome.

Speaker 0

没错。

Yeah.

Speaker 1

而且，是的，现在我们有了自己的环境，这有点像是最新热门的东西。

And and, yeah, now now now we have our environment, so it's kinda like the hot new thing.

Speaker 0

太棒了。

Awesome.

Speaker 0

我热爱这段商业旅程的原因就在于总有新事物出现。

So what I love about this business journey is just there's always something new.

Speaker 0

总是会有这样的情况，好吧。

There's always this like, okay.

Speaker 0

我们已经在为各公司处理这些优质数据方面做得非常出色，现在他们又需要完全不同的东西。

We're getting so good at just all this beautiful data for companies, now they need something completely different.

Speaker 0

现在我们正在为他们搭建所有这些虚拟机，满足各种不同的使用场景。

Now we're setting up all these virtual machines for them and all these different use cases.

Speaker 0

没错。

Yep.

Speaker 0

感觉这就是你们行业的重要组成部分——不断适应实验室的需求。

And it feels like that's a big part of this industry you're in is just adapting to what labs are asking for.

Speaker 1

是啊。

Yeah.

Speaker 1

没错。

Yeah.

Speaker 1

我是说，我确实认为我们需要打造一套产品体系，来体现人类学习的千万种不同方式。

So I mean, I really do think that we are going to need to build a suite of products that reflect the million different ways that humans learn.

Speaker 1

比如，想想如何成为一名优秀的作家。

For example, think about becoming a great writer.

Speaker 1

你并不是靠死记硬背一堆语法规则就能变优秀的。

You don't become great by memorizing a bunch of grammar rules.

Speaker 1

你要通过阅读经典著作、不断练习写作，还要从老师那里获得反馈，从书店买你书的读者留下的评论中汲取建议。

You become great by reading great books and you practice writing and you get feedback from your teachers and from the people who buy your books in a bookstore and leave reviews.

Speaker 1

然后你会注意到哪些方法有效，哪些无效。

And you notice what works and what doesn't.

Speaker 1

通过接触所有这些杰作，当然也包括糟糕的作品，你逐渐培养出鉴赏力。

And you develop taste by being exposed to all these masterpieces and also just terrible writing.

Speaker 1

因此，你通过这种不断实践与反思的循环来学习，每一种学习方式——就像这些成为优秀作家的方法——都截然不同。

So you learn through this endless cycle of practicing reflection and each type of learning that you have, again, like these are all very, very different methods of learning to become a great writer.

Speaker 1

同理，就像成为伟大作家有千百种途径一样，我认为AI也需要千百种不同的学习方式。

So just in the same way that there's a thousand different ways that the great writer becomes great, think there's gonna be a thousand different ways that AIs need need to learn.

Speaker 0

这太有趣了。

It's so interesting.

Speaker 0

最终结果在很多方面简直和人类如出一辙。

This just ends up being like just like humans in so many ways.

Speaker 0

这很合理，因为从某种意义上说，神经网络深度学习正是模仿了人类的学习方式和大脑运作机制。但有趣的是，要让它们变得更聪明，关键在于我们如何越来越接近人类的学习方式。

It makes sense because in a sense, neural networks deep learning is modeled after how humans have learned and how our brains operate, but it's interesting just to make them smarter, it's how do we come closer to how humans learn more and more.

Speaker 1

是啊。

Yeah.

Speaker 1

这几乎就像最终目标就是把你扔进环境里，观察你如何进化。

It's almost like maybe the end goal is just throwing you into the environment and just seeing how you evolve.

Speaker 1

但在这种进化过程中，存在着各种不同的次级学习机制。

But within that, within that evolution, there's all these different sub learning mechanisms.

Speaker 0

这某种程度上正是我们现在在做的事。

Which is kind what we're doing now.

Speaker 0

所以这真的很有趣。

So that's really interesting.

Speaker 0

这可能是我们实现通用人工智能前的最后一步。

This might be the last step of until we hit AGI.

Speaker 0

顺着这个话题，我了解到Surge有个非常独特之处是你们拥有自己的研究团队，这相当罕见。

Along these lines, something that's really unique to Surge that I learned is you guys have your own research team, which I think is pretty rare.

Speaker 0

能谈谈为什么你们会在这方面投入，以及这项投资带来了什么成果吗？

Talk about just why that's something you guys have invested in and what has come out of that investment.

Speaker 0

是啊。

Yeah.

Speaker 1

我认为这源于我个人的背景。

So I think that stems from my own background.

Speaker 1

我自己的背景就是一名研究者。

Like my own background is as a researcher.

Speaker 1

因此，我始终从根本上关注推动行业和研究社区的发展，而不仅仅是关注收入。

And so I've always cared fundamentally about pushing the industry and pushing the research community and not just about revenue.

Speaker 1

所以我认为我们的研究团队主要做几件不同的事情。

And so I think what our research team does is a couple different things.

Speaker 1

我们公司内部的研究人员几乎可以分为两种类型。

So we almost have two types of researchers at our company.

Speaker 1

一类是前置部署研究人员，他们通常与客户紧密合作，帮助他们理解他们的模型。

One is our forward deployed researchers who are often working hand in hand with our customers to help them understand their models.

Speaker 1

我们会与客户密切合作，帮助他们了解他们模型当前的状况。

So we will work very closely with our customers to help them understand, okay, this is where your model is today.

Speaker 1

这是你落后于所有竞争对手的地方。

This is where you're lagging behind all the competitors.

Speaker 1

根据你的目标，这些是未来可以改进的方向。

These are some ways that you could be improving in the future given given your goals.

Speaker 1

我们将设计这些数据集、评估方法和训练技术，让你的模型变得更好。

And we're gonna design these datasets, these evaluation methods, these training techniques to make your models better.

Speaker 1

因此这种与客户高度协作的理念，即他们本身也是研究者，只是更侧重于数据层面，我们与他们紧密合作，竭尽所能帮助他们提升——然后我们还有内部研究团队。

So this very notion this very, very collaborative notion of working with our customers, being researchers by themselves, just a little bit more focused on the data side and where you hand in hand with them to do whatever it takes to to make them the Then we also have our internal researchers.

Speaker 1

我们的内部研究人员关注点略有不同。

Our internal researchers are focused on slightly different things.

Speaker 1

他们专注于构建更优质的基准测试和排行榜体系。

They are focused on building better benchmarks and better leaderboards.

Speaker 1

我曾多次谈到，我担心当前的排行榜和基准测试正在将模型引向错误方向。

I've talked a lot about how I worry that the leaderboards and benchmarks out there today are steering models in the wrong direction.

Speaker 1

好的，没错。

So okay, yeah.

Speaker 1

那么问题在于：我们该如何解决这个问题？

So the question is, how do we fix that?

Speaker 1

这正是我们研究团队当前重点攻关的方向。

And so that's what our research team is focused really heavily on right now.

Speaker 1

所以他们正在这个领域投入大量精力。

So they're working a lot on that.

Speaker 1

他们还在研究其他事项，比如我们需要训练自己的模型，看看哪种数据表现最佳，哪类人员表现最优。

And they're also working on these other things like, okay, we need to train our own models to see what type of data performs the best, what types of people perform the best.

Speaker 1

因此他们也在研究所有这些训练技术，并评估我们自己的数据集，以改进数据操作和内部数据产品，从而确定什么是高质量的标准。

And so they are also working on all these training techniques and evaluation of our own datasets to improve our data operations and the internal data products that we have that determine what what makes something good quality.

Speaker 0

这真是太棒了，因为基本上实验室有研究人员在帮助他们推进人工智能的发展。

It's such a cool thing because I don't think like, basically, the labs have researchers and helping them advance AI.

Speaker 0

我想像你们这样的公司拥有真正从事人工智能基础研究的研究人员应该相当罕见。

I imagine it's pretty rare for a company like yours to have researchers actually doing primary research on AI.

Speaker 1

是的。

Yeah.

Speaker 1

没错。

Yeah.

Speaker 1

我想这只是因为我本质上一直很关心这件事。

I think it's just because it's something I've fundamentally always cared about.

Speaker 1

比如，我经常把我们看作更像一个研究实验室而非初创公司，因为这就是我的目标。

Like, I often think about us more like a research lab than a startup because that is my goal.

Speaker 1

说起来有点好笑，但我总说，我宁愿成为陶哲轩也不愿当沃伦·巴菲特。

Like, it's it's kind of funny, but I've always said, I would rather be Terence Tau than than Warren Buffett.

Speaker 1

这种推动前沿研究而不仅仅是追求估值的理念。

So that notion of creating research that pushes the frontier forward and not just getting some valuation.

Speaker 1

这一直是驱动我的动力。

That's always been what drives me.

Speaker 0

而且效果显著。

And it's worked out.

Speaker 0

这正是这件事最美妙的地方。

That's the beautiful thing about this.

Speaker 0

你提到正在招聘研究人员。

You mentioned that you were hiring researchers.

Speaker 0

这方面有什么想分享的吗？你们在寻找什么样的人才？

Is there anything there you want to share, folks you're looking for?

Speaker 1

我们寻找的是那些对数据集有根本兴趣、愿意整天钻研的人。

So we look for people who are just fundamentally interested in dataset all day.

Speaker 1

这类人可以连续十小时埋头于数据集，摆弄模型并思考：'对，我认为模型在这里存在缺陷'。

So types of people who could literally spend ten hours digging through a dataset and playing around with models and thinking, okay, yeah, this is where I think the model's failing.

Speaker 1

这才是你希望模型具备的行为模式。

This is the kind of a behavior you want the model to have instead.

Speaker 1

这种非常注重实践的特质，以及同时关注模型的定性层面而不仅仅是定量部分。

And just this aspect of being very, very hands on and thinking about the qualitative aspects of models and not just the quantitative parts.

Speaker 1

重申一下，关键在于亲自动手处理数据，而非只关注那些抽象算法。

So again, it's like this aspect of being hands on with data and not just caring about these kind of abstract algorithms.

Speaker 1

太棒了。

Awesome.

Speaker 0

我想问几个关于AI市场的宏观问题。

I'm gonna ask a couple broad AI kind of market questions.

Speaker 0

未来几年AI发展方面，你认为还有哪些人们关注不足或意料之外的趋势？

What else do you think is coming in the next couple years that people are maybe not thinking enough about or not expecting in terms of where AI is heading?

Speaker 0

哪些因素将至关重要？

What's gonna matter?

Speaker 1

我认为未来几年会发生的一个变化是，由于不同实验室的个性特征、行为方式以及他们为模型优化的目标函数不同，AI模型将变得越来越差异化。

I think one of the things that's gonna happen in the next few years is that the models are actually gonna become increasingly differentiated because of the personalities and behaviors that the different labs have and the kind of objective functions that they are optimizing their models for.

Speaker 1

这是我在一年前还没有充分认识到的一点。

I think it's one thing I didn't appreciate a year or so ago.

Speaker 1

大约一年前，我曾认为所有AI模型本质上都会变得高度同质化。

A year or so ago, I thought that all of the AI models would essentially become very, very commoditized.

Speaker 1

它们的行为会彼此趋同。

They would all behave like each other.

Speaker 1

当然，某个模型今天可能在某个方面稍显聪明，但其他模型肯定会在几个月内迎头赶上。

And sure, one of them might be slightly more intelligent in one way today, but sure, the other ones would catch up in the next few months.

Speaker 1

但过去一年让我意识到，公司持有的价值观将塑造模型特性。

But I think over the past year, I've realized that the values that the companies have will shape the the model.

Speaker 1

让我给你举个例子说明。

So let let let me give you an example.

Speaker 1

前几天我让Claude帮我写封邮件，它竟然迭代了30个不同版本。

So I was asking Claude to help me drop an email the other day, and it went through 30 different versions.

Speaker 1

三十分钟后，是的，我认为它确实帮我精心打造了一封完美的邮件，然后我发送了出去。

And after thirty minutes, yeah, I think it really crafted me the perfect email, and I sent it.

Speaker 1

但随后我意识到，我花了三十分钟做了一件完全无关紧要的事。

But then I realized I spent thirty minutes doing something that didn't matter at all.

Speaker 1

比如，确实。

Like, sure.

Speaker 1

现在我得到了完美的邮件，但我花了三十分钟做了一件以前根本不会在意的事，而且这封邮件可能根本无足轻重。

Now I got the perfect email, but I spent thirty minutes doing something I wouldn't have worried at all before, and this email probably didn't even move the needle or anything anyways.

Speaker 1

所以我认为这里有一个深刻的问题：如果你能选择完美的模型行为，你会想要哪种模型？

So I think there's a deep question here, which is if you could choose the perfect model behavior, which model would you want?

Speaker 1

你是想要一个会说‘你说得完全正确’的模型，

Do you want a model that says, you're absolutely right.

Speaker 1

‘这封邮件绝对还有20种改进方式’，然后继续迭代50次，耗尽你所有时间和精力？

There are definitely 20 more ways to improve this email, and it continues for 50 more iterations, and it sucks up all your time and engagement?

Speaker 1

还是想要一个优化你时间和效率、直接说‘不’的模型？

Or do you want a model that's optimizing for your time and productivity and just says, no.

Speaker 1

你需要停下来。

You need to stop.

Speaker 1

你的邮件已经很好了。

Your email's great.

Speaker 1

直接发送然后继续你的一天吧。

Just send it and move on with your day.

Speaker 1

同样地，就像在道路分叉处你可以选择模型行为方式一样，对于模型面临的每个其他问题，你所期望的行为类型都将从根本上影响它。

And again, like, again, just be like, in the same way that there's, like, a kinda like a fork in a road between how you could choose how your model behaves for this question, it's like for every other question that models have, the kind of behavior that you want will fundamentally affect it.

Speaker 1

这就像谷歌构建搜索引擎的方式与Facebook截然不同，而苹果又会采用完全不同的方式。

It's almost like in the same way that when Google builds a search engine, it's very, very different from how Facebook would build a search engine, which is very, very different from how Apple would build a search engine.

Speaker 1

他们都有各自的原则、价值观以及试图在世界上实现的目标，这些塑造了他们将要构建的所有产品。

Like, they all have their own principles and values and things that they're trying to achieve in the world that shape all the products that they're gonna build.

Speaker 1

同样地，我认为所有这些因素也会导致它们开始表现出截然不同的行为。

And in the same way, I think all the all of that ones will start behaving very, very differently too.

Speaker 0

这真是极其有趣。

That is incredibly interesting.

Speaker 0

你已经从Grok身上看到了这一点。

You already see that with Grok.

Speaker 0

它有着非常不同的个性和回答问题的方式。

It's got, like, a very different personality and a very different approach to answering questions.

Speaker 0

所以我听到的是，你会看到更多这种差异化。

And so what I'm hearing is you're gonna see more of this differentiation.

Speaker 1

是的。

Yep.

Speaker 0

沿着这些思路的另一个问题。

Kind of an another question along these lines.

Speaker 0

你认为AI领域最被低估的是什么？人们讨论得不够多但真的很酷的东西？

What do you think is most underhyped in AI that you think maybe people aren't talking enough about that is really cool?

Speaker 0

那什么又是被过度炒作的？

And what do think is overhyped?

Speaker 1

我认为被低估的一点是所有这些聊天机器人将开始内置的产品功能。

I think one of the things that was underhyped is the built in products that all of the chatbots are going to start having.

Speaker 1

我一直是云制品的超级粉丝。

I've always been a huge fan of Cloud artifacts.

Speaker 1

我认为它真的非常非常好用。

I think it just works really, really well.

Speaker 1

实际上前几天，我不知道这是不是新功能，但它让我帮忙创建一封邮件，然后它生成了内容但没完全成功，因为它不允许我发送邮件。

And, actually, the other day, I don't know if it's a new feature or not, but it's asking me to help me create a, like, an email, and then it just create so it didn't quite work because it it didn't allow me to send the email.

Speaker 1

但它最终创建的是一个小...我不知道该怎么称呼，像是一个可以点击的小方框，点击后就能把这条消息发给某人。

But what it created instead was, like, a little, I don't know what you call it, like a little box where I could click on it and it would just text someone that did this message.

Speaker 1

我认为这种将制品提升到新高度的概念——在聊天机器人内部集成这些迷你应用、迷你用户界面——人们对此讨论得还不够多。

And I think that concept of taking artifacts to the next level where you just have these like mini apps, mini UIs within the chatbots themselves, I feel like people aren't talking enough about that.

Speaker 1

所以我觉得这是个被低估的领域。

So I think that that's one underhyped area.

Speaker 1

至于被高估的领域，我绝对认为Vibe编程被过度炒作了。

And in terms of overhyped areas, I definitely think that Vibe coding is overhyped.

Speaker 1

我觉得人们没有意识到长期来看这会让系统变得难以维护，他们只是简单地把这些代码塞进代码库。

I think people don't realize how much it's going to make your systems unmaintainable in the long term and just simply dump this code into their code bases.

Speaker 1

目前看来它似乎运行良好。

It seems to work out right now.

Speaker 1

所以我担心未来的编码问题。

So I worry about future coding.

Speaker 1

这种情况一直在持续发生。

It just keeps on happening.

Speaker 0

这些回答太精彩了。

These are amazing answers.

Speaker 0

关于第一点，其实我之前也问过类似的问题。

On that on that first point, there's something I actually asked.

Speaker 0

我播客请到了Anthropic和OpenAI的首席产品官Kevin Wheel和Mike Krieger。

I have the chief product officer of Anthropic and OpenAI, Kevin Wheel and Mike Krieger on the podcast.

Speaker 0

我当时问他们：作为产品团队，既然拥有这种超级智能，你们还需要产品团队多久？

And I asked them just like, as a product team, like, you have this gigabrain intelligence, how long do you even need product teams?

Speaker 0

你觉得这个AI会直接为你创造产品吗？

You think this is this AI will just create the product for you.

Speaker 0

这就是我想要的。

Here's what I want.

Speaker 0

这就像是氛围编程的更高境界。

It's like it's like the next level of vibe coding.

Speaker 0

就像直接告诉它‘这就是我想要的’，然后它就能构建产品，并在你使用过程中不断完善产品。

It's just just like tell it here's what I want and it's just building the product and involving the product as you're using it.

Speaker 0

感觉你描述的方向正是我们可能前进的方向。

And it feels like that's what you're describing is where we might be heading.

Speaker 1

是的。

Yeah.

Speaker 1

没错。

Yeah.

Speaker 1

我认为这是个非常非常强大的概念，它能帮助人们以更快的方式实现想法。

I think that's a very very powerful notion where it helps people just achieve their ideas in a in a much quicker way.

Speaker 0

我们还没聊到的一点，我觉得特别有意思，就是你创办Surge的故事。

Something we haven't gotten into that I think is really interesting is just the story of how you got to starting Surge.

Speaker 0

你的背景确实非常独特。

You had you have a really unique background.

Speaker 0

我一直记得Coinbase创始人Brian Armstrong的一次演讲，他谈到自己独特的背景如何促使他创立Coinbase，这个观点让我印象深刻。

I always think about these Brian Brian Armstrong, the founder of Coinbase, once had gave this talk that has really stuck with me, where he kind of talked about how his very unique background allowed him to start Coinbase.

Speaker 0

他既有经济学背景，又有密码学经验，同时还是一名工程师。

He had like a economics background, he had a cryptography experience, and then he was an engineer.

Speaker 0

我认为这简直是创立Coinbase最完美的能力组合。

And so I think this like the perfect Venn diagram for starting Coinbase.

Speaker 0

而我觉得你创立Surge的故事也有相似之处。

And I feel like you have a very similar story with Serge.

Speaker 0

能聊聊你的背景吗？以及这些经历如何引导你创立Surge？

Talk about that and your background there and how you led how that led to Serge.

Speaker 1

追溯起来，我从小就痴迷于数学和语言。

Going way back, I was always fascinated by math and language when I was a kid.

Speaker 1

选择麻省理工学院不仅因为它是数学和计算机科学的顶尖学府，还因为Stahome和Noam Chomsky也在那里。

I Like, went to MIT because it's obviously one of the best places for math and CS, but also because Stahome and Noam Chomsky.

Speaker 1

我学生时代的梦想其实是找到一个能连接所有这些不同领域的底层理论。

My dream in school was actually to find some underlying theory connecting all these different fields.

Speaker 1

后来我先后在谷歌、Facebook和Twitter担任研究员，却不断遇到同一个问题。

Then I became a researcher at Google and Facebook and Twitter, and I just kept running into the same problem over and over again.

Speaker 1

我们根本无法获得训练模型所需的那些数据。

It was impossible to get the data that we needed to train our models.

Speaker 1

我一直坚信高质量数据的必要性。

I was always this huge believer in the need for high quality data.

Speaker 1

直到2020年GB3问世，我才意识到：没错，如果我们想要突破现状，构建能编程、会使用工具、懂幽默、会写诗、解决rebind假设甚至治愈癌症的模型，那确实需要一套全新的解决方案。

Then GB3 came out in 2020, and I realized that, yeah, we want to take things to the next level and build models that could code and use tools and tell jokes and write poetry and solve the rebind hypothesis and cure cancer, then yeah, we were going to need a completely new solution.

Speaker 1

在那些公司工作时最让我抓狂的是：明明人类心智的全部潜能就在眼前，而数据团队却只专注于图像标注这类简单工作。

The thing that always drove me crazy when I was at all these companies was we had the full power of the human mind in front of us, and all the data students out there were focused on really simple things like image labeling.

Speaker 1

所以我决心开发一个专注于所有这些高级复杂用例的系统，真正助力下一代模型的构建。

So I wanted to build something focused on all these advanced complex use cases instead that would really help us build in next generation models.

Speaker 1

确实，我在数学、计算机科学和语言学交叉领域的背景，深刻塑造了我一直想做的事。

So yeah, I think my background in kind of cross math and computer science and linguistics really, really informed what I always wanted to do.

Speaker 1

于是我在一个月后创立了Surge，我们的唯一使命就是构建那些推动人工智能前沿发展所需的应用场景。

And so I started Surge a month later with our one mission, to basically build the use cases that were gonna be needed to push the frontier of AI.

Speaker 0

你刚才说一个月后，具体是指什么之后的一个月？

And you said a month later, a month later after what?

Speaker 0

是在2020年GB3发布之后的一个月。

After a GB three launch in 2020.

Speaker 0

哦，好的。

Oh, okay.

Speaker 0

哇。

Wow.

Speaker 0

好的。

Okay.

Speaker 0

是啊。

Yeah.

Speaker 0

真是个明智的决定。

A great decision.

Speaker 0

除了你目前取得的巨大成功之外，现阶段还有什么在推动着你前进？

What what just kinda drives you at this point of other than just the epic success you're having?

Speaker 0

是什么让你保持动力，继续在这个领域深耕和创造？

What keeps you motivated to keep building this and and, you know, building something in this space?

Speaker 1

我认为我骨子里是个科学家。

I think I'm a scientist at heart.

Speaker 1

我一直以为自己会成为一名数学或计算机科学教授，致力于理解宇宙、语言以及交流的本质。

I always thought I was going to become this math or CS professor and work on trying to understand the universe and language and the nature of communication.

Speaker 1

说起来有点好笑，但我一直有个天真的梦想：如果有外星人造访地球，我们需要想办法与他们沟通时，我希望政府能打电话找我，用各种高深的数学、计算机科学和语言学知识来破译他们的语言。

Like, it's kind of funny, but I always had this fanciful dream where if aliens ever came to visit Earth and we need to figure out how to communicate with them, I wanted to be the one that government would call, and I'd use all this fancy math and computer science and linguistics to decipher it.

Speaker 1

所以直到今天，我最爱做的事就是每当有新模型发布时，我们都会对这个模型本身进行非常深入的研究。

So even today, what I love doing most is every time a new model is released, we'll actually do a really deep dive into the model itself.

Speaker 1

我会亲自上手把玩它。

I'll play around with it.

Speaker 1

我会运行各种评估测试。

I'll run evals.

Speaker 1

我会比较它在哪些方面有所改进和进步。

I'll compare where it's improved, where it's progressed.

Speaker 1

我会撰写这份深度分析报告并发送给客户。

I'll create this really deep dive analysis that we send our customers.

Speaker 1

其实挺有趣的，我们经常说是数据科学团队的报告，但实际上很多只是我个人的分析。

And it's actually kinda funny because a lot of times we will say it's from a data science team, but often it's actually just for me.

Speaker 1

我觉得我可以整天做这种事。

And I think I could do this all day.

Speaker 1

比如，我很难忍受整天开会。

Like, I have a very hard time being in meetings all day.

Speaker 1

我完全不懂销售。

I'm terrible at sales.

Speaker 1

我不擅长做人们期望CEO做的那些常规事务，但我热爱撰写这些分析报告。

I'm terrible at doing the typical CO things that people expect you to do, but I love writing these analyses.

Speaker 1

我喜欢和研究团队深入探讨他们的发现。

I love jamming with a research team about what they're seeing.

Speaker 1

有时候我会熬夜到凌晨3点，就为了和研究团队的人通电话讨论训练模型。

Sometimes I'll be, like, up until 3AM just just talk just talking on the phone with somebody on a research team and taking a training model.

Speaker 1

所以我非常热爱这份工作。

So I love that.

Speaker 1

我依然能够整天亲自动手处理数据和科学研究。

I still get to be really hands on working on the data and the and the science all day.

Speaker 1

我认为驱使我前进的动力是希望Surge能在AI的未来中扮演关键角色，而我相信这也是人类的未来。

And I think what drives me is that I want Surge to play this critical role in the future of AI, which I think is also the future of humanity.

Speaker 1

我们对数据、语言、质量以及如何衡量和确保这一切走在正确道路上有着独特的见解。

We had these really unique perspectives on data and language and quality and how to measure all this and how to ensure it's all going on the right path.

Speaker 1

我认为我们独特地不受那些有时会将公司引向负面方向的各种影响所约束。

And I think we're uniquely unconstrained by all of these influences that can sometimes steer companies in a negative direction.

Speaker 1

就像我之前说的，我们构建Surge更像是一个研究实验室，而非典型的初创公司。

Like what I was saying earlier, we built Surge a lot more like a research lab than a typical startup.

Speaker 1

所以我们关注好奇心、长期激励和学术严谨性，而不太在意季度指标或董事会报告上好看的内容。

So we care about curiosity and long term incentives and intellectual rigor, And we don't care as much about quarterly metrics and what's gonna look good in a board deck.

Speaker 1

因此我的目标是利用我们公司所有这些独特之处，确保我们塑造的人工智能方向真正有益于人类物种及其长远发展。

And so my goal is to take all these unique things about us as a company and use that to make sure that we're shaping AI in a way that's really beneficial for species and a lotter.

Speaker 0

通过这次对话，我深刻意识到你和你这样的公司对人工智能发展方向的影响力有多大。

What I'm realizing in this conversation is just how much influence you have and companies like yours have on where AI heads.

Speaker 0

事实上你们帮助实验室发现他们的不足和改进方向，这不仅仅是...你知道，大家都只关注OpenAI、Thropic等公司的负责人，认为他们是人工智能的引领者，但我现在听到的是你们对技术发展方向同样具有重大影响力。

The fact that you help labs understand where they have gaps and where they need to improve, And it's not just, you know, everyone looks at just like the heads of OpenAI and Thropic and all these companies, they're the ones ushering in AI, but what I'm hearing here is you have a lot of influence on where things head to.

Speaker 1

是的，我认为这是一个非常强大的生态系统——说实话，人们还不清楚模型将如何发展，他们想如何塑造这些模型，以及希望人类在未来这一切中扮演什么角色。

Yeah, I think there's this really powerful ecosystem where honestly, people just don't know where models are headed and how they want to shape them yet and how they want humanity to of play a role in the future of all this.

Speaker 1

所以我认为我们有很多机会可以继续引导这场讨论的方向。

And so I think there's a lot of opportunity to just continue shaping this this discussion.

Speaker 0

顺着这个话题，我知道你对这项工作为何对人类如此重要有着非常坚定的观点。

Along that thread, I know you have a very strong thesis on just why this work matters to humanity and why this is so important.

Speaker 0

请详细谈谈这个观点。

Talk about that.

Speaker 1

我可能会说得有点哲学化，但这个问题本身就很哲学，所以请耐心听我说。

I'll get a bit philosophical here, but I think the question is always a bit philosophical, so bear with me.

Speaker 1

我们工作的最直观理解就是训练和评估人工智能。

So the most straightforward way of thinking about what we do is we train and evaluate AI.

Speaker 1

但我经常思考一个更深层的使命：帮助客户思考他们理想的目标函数。

But there's a deeper mission that I often think about, which is helping our customers think about about their dream objective functions.

Speaker 1

比如，他们究竟希望自己的模型成为什么样子？

Like, yeah, what kind of model do they want their model to be?

Speaker 1

一旦我们帮助他们明确这点，就能协助训练模型重新校准北极星，并帮助他们衡量进展。

And once we help them do that, we'll help them train their model to reset North Star and we'll help them measure their progress.

Speaker 1

但这确实很难，因为目标函数非常丰富且复杂。

But it's really hard because objective functions are really rich and complex.

Speaker 1

这有点像养孩子时问他们：'好吧，你想通过什么考试？'

It's kind of like the difference between having a kid and asking them, okay, what test do you wanna pass?

Speaker 1

你是希望他们在SAT考试中取得高分并写出一篇出色的大学申请论文吗？

Do you want them to get a high score and an SAT and write a really good college essay?

Speaker 1

这是最简单的版本。

That's the simplistic version.

Speaker 1

与之相比，你希望他们长大后成为什么样的人？

Versus what kind of person do you want them to grow up to be?

Speaker 1

如果他们无论做什么都感到快乐，你会开心吗？还是你希望他们能上好学校并取得财务上的成功？

Will you be happy if they're happy no matter what they do, or are you hoping you'll go to a good school and be financially successful?

Speaker 1

再次强调，如果你接受这个概念，那就好比说，好吧。

And, again, if you take that notion, it's like, okay.

Speaker 1

你如何定义幸福？

How do you define happiness?

Speaker 1

你如何衡量他们是否快乐？

How do how do measure whether they're happy?

Speaker 1

你如何衡量他们是否财务成功？

How do you measure whether they're financially successful?

Speaker 1

这比单纯衡量你是否在SAT考试中取得高分要复杂得多。

Like, it's a lot harder than something measuring whether or not you're getting a high score on SAT.

Speaker 1

而我们正在做的，就是希望帮助客户实现他们梦想中的北极星目标，并找出衡量这些目标的方法。

And what we're doing is we want to help our customers reach, again, their their dream north stars and figure out how to measure them.

Speaker 1

所以我举了这个例子，当你要求模型写出50种不同的邮件版本时，你希望它们怎么做？

And so I I talked about this example of what you want models to do when you're asking them to write 50 different email iterations.

Speaker 1

你是让它们继续再写50个版本吗？

Do you just continue them for 50 more?

Speaker 1

还是直接说‘不用了，就这样吧，这个已经够完美了’？

Or do you just say, No, just move on with the day because this is perfect enough?

Speaker 1

更宏观的问题是：我们构建的这些系统真的在推动人类进步吗？

And The broader question is, are we building these systems that actually advance humanity?

Speaker 1

那么我们该如何构建数据集来训练并衡量这一点呢？

So how do we build the datasets to train towards that and measure it?

Speaker 1

我们是否正在为所有这些错误的目标进行优化？

Are we optimizing for all these wrong things?

Speaker 1

这些系统只会不断吞噬我们的时间，让我们变得越来越懒惰。

Just systems that suck up more and more of our time and make us lazier and lazier.

Speaker 1

是的，我认为这与我们的工作息息相关，因为要衡量和定义某件事是否真正推动人类进步是非常困难的。

And, yeah, I think it's really relevant to what we do because it's very hard and difficult to measure and define whether something is genuinely advancing humanity.

Speaker 1

相反，衡量这些替代指标（如点击量和点赞数）却非常容易。

It's very easy to measure all these proxies instead, like clicks and likes.

Speaker 1

但我想这正是我们工作的有趣之处。

But I think that's why our work is so interesting.

Speaker 1

我们希望致力于那些需要最艰难数据步骤的硬性重要指标，而非简单的替代指标。

We wanna work to hard, important metrics that require the hardest steps of data and not not just easy ones.

Speaker 1

所以我常说：你就是你的目标函数。

So I think one of the things I often say is you you are your objective function.

Speaker 1

因此我们要追求复杂的目标函数，而非这些简单的替代指标。

So we want to reach complex objective functions and not these simplistic proxies.

Speaker 1

我们的工作就是找出如何获取与之匹配的数据。

And our job is to figure out how to get the data to match this.

Speaker 1

没错，我们需要数据。

So, yeah, we want data.

Speaker 1

我们需要能衡量AI是否让生活更充实的指标。

We want metrics that measure whether AI is, like, making our life richer.

Speaker 1

我们希望以这种方式训练我们的系统，并且我们需要能激发好奇心与创造力而非助长惰性的工具。

We wanna train our systems this way, and we want tools that make us more curious and more creative, not just lazier.

Speaker 1

这很困难，因为人类天生就有些懒惰，所以AI软件方案成为获取用户参与度和提升各项指标的最简单途径。

And it's hard because, yeah, humans are kind of inherently lazy, so AI software deals are the easiest way to get engagement and make all your metrics fall.

Speaker 1

因此我认为，关于选择正确目标函数并确保我们向其优化而非仅追求简单代理指标的问题，对我们的未来至关重要。

So I think this question about choosing the right objective functions and making sure that we're optimizing towards them and not just these easy proxies is really, really important to our future.

Speaker 0

哇。

Wow.

Speaker 0

你分享的这些见解让我对构建AI、训练AI的复杂性以及你们的工作有了更深的理解，这太棒了。

I love how what you're sharing here gives you so much more appreciation of the nuances of building AI, training AI, the work that you're doing.

Speaker 0

要知道，外界可能只看到苏黎世和Kaggle领域的公司在不断生成数据投喂AI，但显然这其中蕴含的深意远超人们想象。

Know, from the outside, people could just look at Zurich and companies in the space of Kaggle, they're just creating all data, feeding it to AI, but clearly there's so much to this that people don't realize.

Speaker 0

得知由你这样的人如此深入地思考并引领着这个领域，我感到非常振奋。

And I love knowing that you're at the head of this, that someone like you is thinking through this so deeply.

Speaker 0

或许最后再问一个问题。

Maybe one more question.