为什么大多数AI产品会失败：来自OpenAI、谷歌和亚马逊50多次AI部署的经验教训

本集简介

Aishwarya Naresh Reganti 和 Kiriti Badam 帮助在 OpenAI、Google、Amazon 和 Databricks 等公司构建并推出了 50 多个企业级 AI 产品。基于这些经验，他们总结出一套构建和扩展成功 AI 产品的最佳实践。本次对话的目标是帮助你和你的团队避免大量痛苦与挫折。我们讨论： 1. AI 产品与传统软件的两个关键差异，以及这些差异为何从根本上改变了其构建方式 2. 成功构建 AI 产品的公司与挣扎中的公司之间的常见模式与反模式 3. 他们从实际经验中开发出的框架，用于迭代构建能形成改进飞轮的 AI 产品 4. 为何对客户信任与可靠性的极致关注是被低估的成功 AI 产品驱动力 5. 为何评估（evals）并非万能药，以及人们对其最常见的误解 6. AI 时代中，对构建者最重要的技能 — 由以下品牌赞助： Merge — 快速集成 220 多种服务的最快方式：https://merge.dev/lenny Strella — AI 驱动的客户研究平台：https://strella.io/lenny Brex — 为初创公司设计的银行解决方案：https://www.brex.com/product/business-account?ref_code=bmk_dp_brand1H25_ln_new_fs — 文字稿：https://www.lennysnewsletter.com/p/what-openai-and-google-engineers-learned — 我的主要收获（付费订阅者专享）：https://www.lennysnewsletter.com/i/183007822/referenced — 使用此链接，可享受 Aishwarya 和 Kiriti 的 Maven 课程《以问题为先构建智能体 AI 应用》15% 折扣：https://bit.ly/3V5XJFp — 如何找到 Aishwarya Naresh Reganti： • LinkedIn：https://www.linkedin.com/in/areganti • GitHub：https://github.com/aishwaryanr/awesome-generative-ai-guide • X：https://x.com/aish_reganti — 如何找到 Kiriti Badam： • LinkedIn：https://www.linkedin.com/in/sai-kiriti-badam • X：https://x.com/kiritibadam — 如何找到 Lenny： • 订阅号：https://www.lennysnewsletter.com • X：https://twitter.com/lennysan • LinkedIn：https://www.linkedin.com/in/lennyrachitsky/ — 本集内容涵盖： (00:00) 介绍 Aishwarya 和 Kiriti (05:03) AI 产品开发的挑战 (07:36) AI 与传统软件的关键差异 (13:19) 构建 AI 产品：从小处着手，逐步扩展 (15:23) 人类控制在 AI 系统中的重要性 (22:38) 避免提示注入与越狱攻击 (25:18) 成功 AI 产品开发的模式 (33:20) 关于评估与生产监控的争论 (41:27) Codex 团队对评估与客户反馈的方法 (45:41) 持续校准与持续开发（CC/CD）框架 (58:07) 新兴模式与校准 (01:01:24) 被过度炒作与被低估的 AI 概念 (01:05:17) AI 的未来 (01:08:41) 构建 AI 产品的技能与最佳实践 (01:14:04) 快问快答与总结 — 参考资料： • LevelUp Labs：https://levelup-labs.ai/ • 为什么你的 AI 产品需要不同的开发生命周期：https://www.lennysnewsletter.com/p/why-your-ai-product-needs-a-different • Booking.com：https://www.booking.com • 关于生产环境中智能体的研究论文（Matei Zaharia 实验室）：https://arxiv.org/pdf/2512.04123 • Matei Zaharia 在 Google Scholar 上的研究：https://scholar.google.com/citations?user=I1EvjZsAAAAJ&hl=en • 即将到来的 AI 安全危机（以及应对措施）| Sander Schulhoff：https://www.lennysnewsletter.com/p/the-coming-ai-security-crisis • Gajen Kandiah 在 LinkedIn 上：https://www.linkedin.com/in/gajenkandiah • Rackspace：https://www.rackspace.com • AI 原生初创公司：5 款产品、七位数收入、100% AI 编写的代码 | Dan Shipper（Every 联合创始人/CEO）：https://www.lennysnewsletter.com/p/inside-every-dan-shipper • 语义扩散：https://martinfowler.com/bliki/SemanticDiffusion.html • LMArena：https://lmarena.ai • Artificial Analysis：https://artificialanalysis.ai/leaderboards/providers • 为什么人类是 AI 的最大瓶颈（以及 2026 年将发生什么）| Alexander Embiricos（OpenAI Codex 产品负责人）：https://www.lennysnewsletter.com/p/why-humans-are-ais-biggest-bottleneck • 航空公司因聊天机器人提供错误建议而被追责——这对旅客意味着什么：https://www.bbc.com/travel/article/20240222-air-canada-chatbot-misinformation-what-travellers-should-know • Demis Hassabis 在 LinkedIn 上：https://www.linkedin.com/in/demishassabis • 我们用 20 个 AI 代理取代了销售团队——发生了什么？| Jason Lemkin（SaaStr）：https://www.lennysnewsletter.com/p/we-replaced-our-sales-team-with-20-ai-agents • 苏格拉底的名言：https://en.wikipedia.org/wiki/The_unexamined_life_is_not_worth_living • Noah Smith 的通讯：https://www.noahpinion.blog • 《硅谷》（HBO Max）：https://www.hbomax.com/shows/silicon-valley/b4583939-e39f-4b5c-822d-5b6cc186172d • Clair Obscur: Expedition 33：https://store.steampowered.com/app/1903340/Clair_Obscur_Expedition_33/ • Wisprflow：https://wisprflow.ai • Raycast：https://www.raycast.com • 史蒂夫·乔布斯的名言：https://www.goodreads.com/quotes/463176-you-can-t-connect-the-dots-looking-forward-you-can-only — 推荐书籍： • 《当呼吸化为空气》：https://www.amazon.com/When-Breath-Becomes-Paul-Kalanithi/dp/081298840X • 《三体》：https://www.amazon.com/Three-Body-Problem-Cixin-Liu/dp/0765382032 • 《深空之下》：https://www.amazon.com/Fire-Upon-Deep-Zones-Thought/dp/0812515285 — 制作与营销由 https://penname.co/ 负责。如有关于赞助本播客的咨询，请发送邮件至

Aishwarya Naresh Reganti and Kiriti Badam have helped build and launch more than 50 enterprise AI products across companies like OpenAI, Google, Amazon, and Databricks. Based on these experiences, they’ve developed a small set of best practices for building and scaling successful AI products. The goal of this conversation is to save you and your team a lot of pain and suffering. We discuss: 1. Two key ways AI products differ from traditional software, and why that fundamentally changes how they should be built 2. Common patterns and anti-patterns in companies that build strong AI products versus those that struggle 3. A framework they developed from real-world experience to iteratively build AI products that create a flywheel of improvement 4. Why obsessing about customer trust and reliability is an underrated driver of successful AI products 5. Why evals aren’t a cure-all, and the most common misconceptions people have about them 6. The skills that matter most for builders in the AI era — Brought to you by: Merge—The fastest way to ship 220+ integrations: https://merge.dev/lenny Strella—The AI-powered customer research platform: https://strella.io/lenny Brex—The banking solution for startups: https://www.brex.com/product/business-account?ref_code=bmk_dp_brand1H25_ln_new_fs — Transcript: https://www.lennysnewsletter.com/p/what-openai-and-google-engineers-learned — My biggest takeaways (for paid newsletter subscribers): https://www.lennysnewsletter.com/i/183007822/referenced — Get 15% off Aishwarya and Kiriti’s Maven course, Building Agentic AI Applications with a Problem-First Approach, using this link: https://bit.ly/3V5XJFp — Where to find Aishwarya Naresh Reganti: • LinkedIn: https://www.linkedin.com/in/areganti • GitHub: https://github.com/aishwaryanr/awesome-generative-ai-guide • X: https://x.com/aish_reganti — Where to find Kiriti Badam: • LinkedIn: https://www.linkedin.com/in/sai-kiriti-badam • X: https://x.com/kiritibadam — Where to find Lenny: • Newsletter: https://www.lennysnewsletter.com • X: https://twitter.com/lennysan • LinkedIn: https://www.linkedin.com/in/lennyrachitsky/ — In this episode, we cover: (00:00) Introduction to Aishwarya and Kiriti (05:03) Challenges in AI product development (07:36) Key differences between AI and traditional software (13:19) Building AI products: start small and scale (15:23) The importance of human control in AI systems (22:38) Avoiding prompt injection and jailbreaking (25:18) Patterns for successful AI product development (33:20) The debate on evals and production monitoring (41:27) Codex team’s approach to evals and customer feedback (45:41) Continuous calibration, continuous development (CC/CD) framework (58:07) Emerging patterns and calibration (01:01:24) Overhyped and under-hyped AI concepts (01:05:17) The future of AI (01:08:41) Skills and best practices for building AI products (01:14:04) Lightning round and final thoughts — Referenced: • LevelUp Labs: https://levelup-labs.ai/ • Why your AI product needs a different development lifecycle: https://www.lennysnewsletter.com/p/why-your-ai-product-needs-a-different • Booking.com: https://www.booking.com • Research paper on agents in production (by Matei Zaharia’s lab): https://arxiv.org/pdf/2512.04123 • Matei Zaharia’s research on Google Scholar: https://scholar.google.com/citations?user=I1EvjZsAAAAJ&hl=en • The coming AI security crisis (and what to do about it) | Sander Schulhoff: https://www.lennysnewsletter.com/p/the-coming-ai-security-crisis • Gajen Kandiah on LinkedIn: https://www.linkedin.com/in/gajenkandiah • Rackspace: https://www.rackspace.com • The AI-native startup: 5 products, 7-figure revenue, 100% AI-written code | Dan Shipper (co-founder/CEO of Every): https://www.lennysnewsletter.com/p/inside-every-dan-shipper • Semantic Diffusion: https://martinfowler.com/bliki/SemanticDiffusion.html • LMArena: https://lmarena.ai • Artificial Analysis: https://artificialanalysis.ai/leaderboards/providers • Why humans are AI’s biggest bottleneck (and what’s coming in 2026) | Alexander Embiricos (OpenAI Codex Product Lead): https://www.lennysnewsletter.com/p/why-humans-are-ais-biggest-bottleneck • Airline held liable for its chatbot giving passenger bad advice—what this means for travellers: https://www.bbc.com/travel/article/20240222-air-canada-chatbot-misinformation-what-travellers-should-know • Demis Hassabis on LinkedIn: https://www.linkedin.com/in/demishassabis • We replaced our sales team with 20 AI agents—here’s what happened | Jason Lemkin (SaaStr): https://www.lennysnewsletter.com/p/we-replaced-our-sales-team-with-20-ai-agents • Socrates’s quote: https://en.wikipedia.org/wiki/The_unexamined_life_is_not_worth_living • Noah Smith’s newsletter: https://www.noahpinion.blog • Silicon Valley on HBO Max: https://www.hbomax.com/shows/silicon-valley/b4583939-e39f-4b5c-822d-5b6cc186172d • Clair Obscur: Expedition 33: https://store.steampowered.com/app/1903340/Clair_Obscur_Expedition_33/ • Wisprflow: https://wisprflow.ai • Raycast: https://www.raycast.com • Steve Jobs’s quote: https://www.goodreads.com/quotes/463176-you-can-t-connect-the-dots-looking-forward-you-can-only — Recommended books: • When Breath Becomes Air: https://www.amazon.com/When-Breath-Becomes-Paul-Kalanithi/dp/081298840X • The Three-Body Problem: https://www.amazon.com/Three-Body-Problem-Cixin-Liu/dp/0765382032 • A Fire Upon the Deep: https://www.amazon.com/Fire-Upon-Deep-Zones-Thought/dp/0812515285 — Production and marketing by https://penname.co/. For inquiries about sponsoring the podcast, email

双语字幕

仅展示文本字幕，不包含中文音频；想边听边看，请使用 Bayt 播客 App。

Speaker 0

我们曾经一起合作过一篇客座文章。

We worked on a guest post together.

Speaker 0

他们有一个非常关键的见解，即构建AI产品与构建非AI产品非常不同。

They had this really key insight that building AI products is very different from building non AI products.

Speaker 1

大多数人往往忽视了非确定性。

Most people tend to ignore the nondeterminism.

Speaker 1

你不知道用户可能会如何使用你的产品，也不知道LLM会如何回应用户的操作。

You don't know how the user might behave with your product, and you also don't know how the LLM might respond to that.

Speaker 1

第二个区别是代理控制的权衡。

The second difference is the agency control trade off.

Speaker 1

每次你将决策能力交给代理系统时，你实际上都在放弃一部分自身的控制权。

Every time you hand over decision making capabilities to agentic systems, you're kind of relinquishing some amount of control on your end.

Speaker 0

这极大地改变了你应该构建产品的方式。

This significantly changes the way you should be building product.

Speaker 2

因此，我们建议一步步来构建。

So we recommend building step by step.

Speaker 2

当你从小处着手时，会迫使你思考我究竟要解决什么问题。

When you start small, it forces you to think about what is the problem that I'm gonna solve.

Speaker 2

在所有这些AI进展中，一个容易滑向的误区是不断思考解决方案的复杂性，而忘记了你真正想解决的问题。

In all these advancements of the AI, one easy slippery slope is to keep thinking about complexities of the solution and forget the problem that you're trying to solve.

Speaker 1

这并不是要成为竞争对手中第一个拥有智能代理的公司。

It's not about being the first company to have an agent among your competitors.

Speaker 1

关键在于，你是否已经建立了正确的飞轮机制，以便能够持续改进？

It's about have you built the right flywheels in place so that you can improve over time?

Speaker 0

在成功构建AI产品的企业中，你看到哪些工作方式？

What kind of ways of working do you see in companies that build AI products successfully?

Speaker 1

我曾经与Rackspace的首席执行官共事过。

I used to work with the CEO of now Rackspace.

Speaker 1

他每天早上都会留出一个时间段，标注为‘上午4点到6点跟进AI进展’。

He would have this block every day in the morning, which would say catching up with AI four to 6AM.

Speaker 1

领导者必须重新回归亲力亲为的状态。

Leaders have to get back to being hands on.

Speaker 1

你必须接受这样一个事实：你的直觉可能并不正确，你可能是房间里最不懂的人，而你希望向每个人学习。

You must be comfortable with the fact that your intuitions might not be right, and you probably are the dumbest person in the room, and you wanna learn from everyone.

Speaker 0

你认为明年人工智能会是什么样子？

What do you think the next year of AI is gonna look like?

Speaker 2

坚持极其重要。

Persistence is extremely valuable.

Speaker 2

如今，任何新兴领域中成功的企业都在经历学习、实施并理解哪些方法有效、哪些无效的痛苦过程。

Successful companies right now building in any new area, they are going through the pain of learning this, implementing this, and understanding what works and what doesn't work.

Speaker 2

痛苦才是新的护城河。

Pain is the new moat.

Speaker 3

今天，我的嘉宾是艾什瓦里亚·拉甘蒂和库里蒂·巴塔姆。

Today, my guests are Aishwarya Raghanti and Kuriti Bhattam.

Speaker 3

库里蒂在OpenAI负责编解码器工作，过去十年里一直在谷歌和Kumo构建AI和机器学习基础设施。

Kuriti works on codecs at OpenAI and has spent the last decade building AI and ML infrastructure at Google and at Kumo.

Speaker 3

艾什早年曾在Alexa和微软从事AI研究，并发表了35篇以上的研究论文。

Ash was an early AI researcher at Alexa and Microsoft and has published over 35 research papers.

Speaker 3

他们共同领导并支持了超过50个AI产品在亚马逊、Databricks、OpenAI、Google以及众多初创公司和大型企业中的部署。

Together, they've led and supported over 50 AI product deployments across companies like Amazon, Databricks, OpenAI, Google, and both startups and large enterprises.

Speaker 3

他们还共同教授Maven平台上评分最高的AI课程，向产品负责人传授他们在打造成功AI产品过程中学到的所有关键经验。

Together, they also teach the number one rated AI course on Maven, where they teach product leaders all of the key lessons they've learned about building successful AI products.

Speaker 3

本集的目标是帮你和你的团队节省大量痛苦、煎熬和浪费的时间，避免在构建AI产品时走弯路。

The goal of this episode is to save you and your team a lot of pain and suffering and wasted time trying to build your AI product.

Speaker 3

无论你正在为产品无法运转而挣扎，还是希望提前避免这种困境，这一集都适合你。

Whether you are already struggling to make your product work or want to avoid that struggle, this episode is for you.

Speaker 3

如果你喜欢这个播客，请别忘了在你最喜欢的播客应用或YouTube上订阅和关注。

If you enjoy this podcast, don't forget to subscribe and follow it in your favorite podcasting app or YouTube.

Speaker 3

这会有极大的帮助。

It helps tremendously.

Speaker 3

如果你成为我通讯的年度订阅用户，你将免费获得一系列精彩产品的一年使用权，包括Lovable、Replit、Bold、Gamma、N8M、Linear、Devon、Posthoc、Superhuman、Descript、Perplexity、Warp、Granola、Magic、Pat、Dreycast、Chepard、Emobit和Stripe Atlas的一年免费服务。

And if you become an annual subscriber of my newsletter, you get a year free of a ton of incredible products, including a year free of Lovable, Replit, Bold, Gamma, N8M, Linear, Devon, Posthoc, Superhuman, Descript, Perplexity, Warp, Granola, Magic, Pat, and Dreycast, Chepard, Emobit, and Stripe Atlas.

Speaker 3

请前往lenny'snewsletter.com，点击Product Pass。

Head on over to lenny'snewsletter.com and click product pass.

Speaker 3

好了，接下来有请Eshwarya、Riganti和Kariti Bottom，先让我们听一段赞助商的简短广告。

With that, I bring you Eshwarya, Riganti, and Kariti Bottom after a short word from our sponsors.

Speaker 3

本集节目由Merge赞助播出。

This episode is brought to you by Merge.

Speaker 3

产品负责人最讨厌的就是开发集成。

Product leaders hate building integrations.

Speaker 3

这些集成杂乱无章。

They're messy.

Speaker 3

开发起来非常缓慢。

They're slow to build.

Speaker 3

它们严重拖慢了你的产品路线图。

They're a huge drain on your road map.

Speaker 3

而这些也绝对不是你当初投身产品行业的初衷。

And they're definitely not why you got into product in the first place.

Speaker 3

幸运的是，Merge对集成有着极致的专注。

Lucky for you, Merge is obsessed with integrations.

Speaker 3

通过一个API，B2B SaaS公司可以将Merge嵌入到自己的产品中，几周内即可上线220多个面向客户的集成，而无需等待数个季度。

With a single API, b to b SaaS companies embed Merge into their product and ship 220 plus customer facing integrations in weeks, not quarters.

Speaker 3

可以把Merge想象成B2B SaaS领域的Plaid。

Think of Merge like Plaid, but for everything b to b SaaS.

Speaker 3

Mistral AI、Ramp、Andrata等公司使用Merge，将客户的会计、人力资源、工单、CRM和文件存储系统连接起来，实现从自动入职到AI就绪的数据管道等各种功能。

Companies like Mistral AI, Ramp, Andrata use Merge to connect their customers as accounting, HR, ticketing, CRM, and file storage systems to power everything from automatic onboarding to AI ready data pipelines.

Speaker 3

更棒的是，Merge现在推出了新产品，支持安全地将连接器部署到AI代理中，让你能够安全地利用真实客户数据驱动AI工作流。

Even better, Merge now supports the secure deployment of connectors to AI agents with a new product so that you can safely power AI workflows with real customer data.

Speaker 3

如果你的产品需要从数十个系统中获取客户数据，Merge是最快、最安全的解决方案。

If your product needs customer data from dozens of systems, Merge is the fastest, safest way to get it.

Speaker 3

前往merge.dev/lenny预约并参加一场会议，他们将赠送你一张50美元的亚马逊礼品卡。

Book and attend a meeting at merge.dev/lenny, and they'll send you a $50 Amazon gift card.

Speaker 3

网址是merge.dev/lenny。

That's merge.dev/lenny.

Speaker 3

本集由Strela赞助播出，Strela是为AI时代打造的客户研究工具。

This episode is brought to you by Strela, the customer research built for the AI era.

Speaker 3

关于用户研究，真相是这样的。

Here's the truth about user research.

Speaker 3

它从未像现在这样重要，也从未如此令人痛苦。

It's never been more important or more painful.

Speaker 3

团队希望了解客户为何如此行事。

Teams wanna understand why customers do what they do.

Speaker 3

但招募用户、进行访谈并分析洞察需要数周时间。

But recruiting users, running interviews, and analyzing insights takes weeks.

Speaker 3

等到结果出来时，行动的最佳时机已经过去。

By the time the results are in, the moment to act has passed.

Speaker 3

Strela改变了这一现状。

Strela changes that.

Speaker 3

它是首个利用人工智能自动进行并分析深度访谈的平台，让每个团队都能获得快速且持续的用户研究能力。

It's the first platform that uses AI to run and analyze in-depth interviews automatically, bringing fast and continuous user research to every team.

Speaker 3

Strela的AI主持人会提出真实的跟进问题，在回答模糊时深入探究，并在几小时内而非数周内，从数百场对话中提炼出模式。

Strela's AI moderator asks real follow-up questions, probing deeper when answers are vague, and surfaces patterns across hundreds of conversations all in a few hours, not weeks.

Speaker 3

亚马逊和Duolingo等公司的产品、设计和研究团队已经使用Strela进行Figma原型测试、概念验证和客户旅程研究，能够在一夜之间获得洞察，而无需等待下一个迭代周期。

Product, design, and research teams at companies like Amazon and Duolingo are already using Strela for Figma prototype testing, concept validation, and customer journey research, getting insights overnight instead of waiting for the next sprint.

Speaker 3

如果你的团队希望以你发布产品的速度来了解客户，试试Strela吧。

If your team wants to understand customers at the speed you ship products, try Strela.

Speaker 3

前往 strela.iolenny 开启你的下一项研究。

Run your next study at strela.iolenny.

Speaker 3

是 s t r e l l a。

That's s t r e l l a.

Speaker 3

Iolenny。

Iolenny.

Speaker 0

Ash 和 Kereti，非常感谢你们的到来，欢迎来到本播客。

Ash and Kereti thank you so much for being here and welcome to the podcast.

Speaker 0

谢谢。

Thank you.

Speaker 0

谢谢你们

Thank Thank you for

Speaker 2

邀请我们。

having us.

Speaker 2

非常期待这次对话。

Super excited for this.

Speaker 0

让我为今天我们即将进行的对话做个铺垫。

Let me set the stage for the conversation that we're going to have today.

Speaker 0

你们两个人自己已经打造了不少人工智能产品。

So you two have built a bunch of AI products yourself.

Speaker 0

你们深入接触过许多构建人工智能产品的公司，也见过他们如何挣扎、如何尝试构建AI代理。

You've gone deep with a lot of companies who have built AI products, have struggled to build AI products, build AI agents.

Speaker 0

你们还开设了一门关于成功构建人工智能产品的课程，并且你们的使命就是减少人们在构建AI产品时不断遭遇的痛苦、挫折与失败。

You also teach a course on building AI products successfully that, and you're kind of like on this mission to just reduce pain and suffering and failure that you constantly see people go through when they're building AI products.

Speaker 0

因此，为了给我们今天的对话打下一点基础，你在企业一线看到人们在构建AI产品时，有哪些实际情况？

So to set a little just foundation for the conversation we're gonna have, what are you seeing on the ground within companies trying to build AI products?

Speaker 0

哪些方面进展顺利？

What's going well?

Speaker 0

哪些方面不顺利？

What's not going well?

Speaker 1

我认为2025年与2024年有显著不同。

I think 2025 has been significantly different than 2024.

Speaker 1

首先，怀疑态度已经大大减少。

One, the skepticism has significantly reduced.

Speaker 1

去年有很多领导者可能认为这不过是又一波加密货币热潮，因而持观望态度。

There were tons of leaders last year who probably thought this would be yet another crypto wave and kind of skeptical to get started.

Speaker 1

而我去年看到的许多用例，更多只是把聊天界面简单叠加在数据上。

And a lot of the use cases that I saw last year were more of slap chat on your data.

Speaker 1

对吧？

Right?

Speaker 1

那时他们还自称是AI产品。

And that was, you know, calling themselves an AI product.

Speaker 1

而今年，大量公司正在重新思考他们的用户体验和工作流程，真正意识到必须拆解并重建流程，才能成功构建AI产品。

And this year, a ton of companies are really rethinking their user experiences and their workflows and all of that and really understanding that you need to deconstruct and reconstruct your processes in order to have in order to build successfully AI products.

Speaker 1

是的。

Right.

Speaker 1

这才是真正有价值的部分。

And that's the good stuff.

Speaker 1

但问题在于，执行层面仍然参差不齐。

The bad stuff is the execution is still all over the place.

Speaker 1

你想一下，对吧。

Think of it, right.

Speaker 1

这是一个才三年历史的领域。

This is a three year old field.

Speaker 1

还没有现成的指南。

There are no playbooks.

Speaker 1

也没有教科书。

There are no textbooks.

Speaker 1

所以你必须边走边摸索。

So you really need to figure out as you go.

Speaker 1

与传统软件生命周期相比，AI的生命周期在部署前和部署后有很大不同。

And the AI lifecycle, pre deployment and post deployment, is very different as compared to a traditional software lifecycle.

Speaker 1

因此，传统角色之间（比如产品经理、工程师和数据团队）的旧有合同和交接方式现在已经破裂。

And so a lot of old contracts and handoffs between traditional roles, like say PMs and engineers and data folks, has now been broken.

Speaker 1

人们正在逐渐适应这种新的协作方式，某种程度上共同拥有同一个反馈循环。

And people are really getting adapted to this new way of working together and kind of owning the same feedback loop in a way.

Speaker 1

过去，我觉得产品经理、工程师和所有这些人都有各自独立的反馈循环来优化工作。

Because previously, feel like PMs and engineers and all of these folks had their own feedback loops to optimize.

Speaker 1

而现在，你们可能需要坐在一起，共同查看智能体的追踪记录，并一起决定产品应该如何表现。

And now you need to be probably sitting in the same room, you're probably looking at agent traces together and deciding how your product should behave.

Speaker 1

这是一种更紧密的协作形式。

So it's a tighter form of collaboration.

Speaker 1

因此，公司仍在摸索这一模式。

So companies are still kind of figuring that out.

Speaker 1

这正是我今年在咨询工作中所看到的情况。

That's kind of what I see in my consulting practice this year.

Speaker 0

那让我顺着这个思路继续说下去。

So let me follow that thread.

Speaker 0

我们几个月前一起合作写了一篇客座文章。

We worked on a guest post together that came out a few months ago.

Speaker 0

在撰写这篇文章的过程中，最让我印象深刻、一直铭记在心的是，确实有一个关键见解：构建AI产品与构建非AI产品非常不同。

And the thing that stood out to me most, that stuck with me most after working on that post is, yeah, this really key insight that building AI products is very different from building non AI products.

Speaker 0

而你特别强调要传达的，是两个非常重要的差异。

And the thing that you're big on getting across is there's two very big differences.

Speaker 0

谈谈这两个差异吧。

Talk about those two differences.

Speaker 1

是的。

Yes.

Speaker 1

而且我再次强调，我们要确保准确传达这个核心观点。

And again, I want to make sure that we drive home the right point.

Speaker 1

构建AI系统和软件系统之间也有很多相似之处。

There are tons of similarities of building AI systems and software systems as well.

Speaker 1

但有一些因素从根本上改变了你构建软件系统与AI系统的方式。

But then there are some things that kind of fundamentally change the way you build software systems versus AI systems.

Speaker 1

其中大多数人忽视的一个是非确定性。

And one of them that most people tend to ignore is the non determinism.

Speaker 1

与传统软件相比，你几乎是在使用一个非确定性的API。

You're pretty much working with a non deterministic API as compared to traditional software.

Speaker 1

这意味著什么？

What does that mean?

Speaker 1

为什么这会影响我们？在传统软件中，你几乎拥有一个非常明确的决策引擎或工作流程。

And why does that have to affect us is in traditional software, you pretty much have a very well mapped decision engine or workflow.

Speaker 1

想想Booking.com这样的网站，对吧？

Think of something like booking.com, right?

Speaker 1

你有一个意图，比如想在旧金山预订两晚的住宿等等。

You have an intention that you want to make a booking in San Francisco for two nights, etc.

Speaker 1

产品设计使得你的意图可以被转化为特定的操作，你通过点击一系列按钮、选项、表单等，最终实现你的目标。

The product has kind of been built so that your intention can be converted into a particular action and you kind of are clicking through a bunch of buttons, options, forms, all of that, and you finally achieve your intention.

Speaker 1

但现在，AI产品中的这一层完全被一个非常灵活的界面所取代，这个界面主要是自然语言，这意味着用户可以以无数种方式表达或传达他们的意图，对吧？

But now that layer in AI products is completely being replaced by a very fluid interface, which is mostly natural language, which means the user can literally come up with a ton of ways of saying or communicating their intentions, right?

Speaker 1

这改变了很多事情，因为你不知道用户会如何行为。

And that kind of changes a lot of things because now you don't know how your user is going to behave.

Speaker 1

这是在输入端的情况。

That's on the input side.

Speaker 1

而在输出端，你也在使用一个非确定性的概率性API，也就是你的大语言模型。

And the output is also that you're working with a nondeterministic probabilistic API, which is your LLM.

Speaker 1

大语言模型对提示的措辞非常敏感，而且基本上是黑箱。

And LLMs are pretty sensitive to prompt phasings and they're pretty much black boxes.

Speaker 1

所以你甚至不知道输出会是什么样子，对吧？

So you don't even know how the output surface will look like, right?

Speaker 1

因此，你不知道用户可能会如何与你的产品互动。

So this you don't know how the user might behave with your product.

Speaker 1

而且你也不知道大语言模型会如何回应。

And you also don't know how the LLM might respond to that.

Speaker 1

所以你现在处理的是输入、输出和一个过程。

So you're now working with an input output and a process.

Speaker 1

你对这三者都不太了解。

You don't understand all the three very well.

Speaker 1

你试图预测行为并为此付出代价。

You're trying to kind of anticipate behavior and pay for it.

Speaker 1

而在智能体系统中，这种情况变得更加复杂。

And with agentic systems, this kind of gets even harder.

Speaker 1

这就是我们谈到的第二个差异：代理控制权的权衡比率。

And that's where we talk about the second difference, which is the agency control trade off rate.

Speaker 1

我们说的是什么意思呢？

What do we mean by that?

Speaker 1

我很惊讶，这么多人竟然都不讨论这一点。

And I'm kind of shocked so many people don't talk about this.

Speaker 1

他们极度痴迷于构建自主系统，即能为你做事的智能体。

They're extremely obsessed with building autonomous systems, agents that can do work for you.

Speaker 1

但每次你将决策能力或自主权交给智能代理系统时，你实际上是在放弃自己的一些控制权。

But every time you hand over decision making capabilities or autonomy to agentic systems, you're kind of relinquishing some amount of control on your own trade.

Speaker 1

当你这样做时，你需要确保你的代理已经赢得你的信任，或者足够可靠，可以允许它做出决策。

And when you do that, you want to make sure that your agent has gained your trust, or it is reliable enough that you can allow it to make decisions.

Speaker 1

这就是我们所说的代理与控制之间的权衡：如果你给予你的AI代理或AI系统更多的自主权——即做出决策的能力，你就同时失去了部分控制权，因此你需要确保该代理或AI系统已经通过时间积累赢得了这种能力或信任。

And that's where we talk about this agency control trade off, which is if you give your AI agent or your AI system, whatever it is, more agency, which is the ability to make decisions, you're also losing some control, and you wanna make sure that the agent or the AI system has earned that ability or has built up trust over time.

Speaker 0

所以，为了总结你刚才分享的内容，本质上，人们长期以来一直在构建产品和软件产品。

So just to summarize what you're sharing here, essentially, people have been building product, software products for a long time.

Speaker 0

我们现在所处的世界是，你所构建的软件具有非确定性，可能会以不同的方式行事。

We're now in a world where the software you're building is one, nondeterministic, can just do things differently.

Speaker 0

就像你所说的，你去Booking.com订酒店，每次体验都是一样的，你会看到不同的酒店，但体验是可预测的。

Like, you know, as you said, you go to booking.com, you find a hotel, it's gonna be the same experience every time, you'll see different hotels, but it's a predictable experience.

Speaker 0

而使用AI时，你无法预测它每次都会完全按照你计划的方式运行。

With AI, you can't predict that it's gonna be the exact same thing, the thing that you plan it to be every time.

Speaker 0

另一个则是代理与控制之间的权衡。

And then the other is there's this trade off between agency and control.

Speaker 0

AI应该为你做多少，而人又应该保留多少控制权？

How much will the AI do for you versus how much should the person still be in charge?

Speaker 0

我听到的重点是，这极大地改变了你构建产品的方式，我们接下来会讨论这对产品开发生命周期应产生的影响。

And the what I'm hearing is the big point here is this significantly changes the way you should be building product, and we're gonna talk about the impact on how the product development life cycle should change as a result.

Speaker 0

在我们深入讨论之前，你还有什么想补充的吗？

Is there anything else you want to add there before we get into that?

Speaker 2

是的，当你开始构建时，这种区别在你的脑海中必须明确存在。

Yeah, it's definitely one of the key points that this kind of distinction needs to exist in your mind when you're starting to build.

Speaker 2

比如，假设你的目标是徒步攀登优胜美地的半圆顶。

For example, think about if your objective is to hike a half dome in Yosemite.

Speaker 2

对吧？

Right?

Speaker 2

你不会每天都去徒步攀登，而是从一些小部分开始训练自己，逐步提升，最终达成目标。

You don't start hiking it every day, but you start training yourself in minor parts, and then you slowly improve and then you go to the end goal.

Speaker 2

对吧？

Right?

Speaker 2

我觉得这和构建AI产品非常相似，你不能从第一天就直接用包含公司所有工具和上下文的智能体，指望它能正常运行，或者甚至去调试这个层级。

I feel like that's extremely similar to what you want to build AI products in the sense that when you don't start with, like, agents with all the tools and all the context that you have in the company in day one and expect it to work or, like, you know, even tinker at that level.

Speaker 2

你需要有意识地从影响最小、人工控制更多的地方开始，这样你才能清楚地了解当前的能力范围，以及你能用它们做什么。

You need to be deliberately starting in places where there is minimal impact and more human control so that you have, like, a good grip of what are the current capabilities and what can I do with them?

Speaker 2

然后慢慢逐步增加自主性，减少人工干预。

And then slowly, you know, like, lean into the more agency and lesser control.

Speaker 2

这样你会获得信心，知道：好的，这就是我面临的具体问题，AI可以解决其中的这部分。

So this gives you that confidence that, okay, I can know that, okay, this is the particular problem that I'm facing, and the AI can solve this extent of it.

Speaker 2

接下来，再思考需要引入哪些上下文，添加哪些工具来提升体验。

And then, like, let me next think through what context I need to bring in, what kind of tools I need to add to this to improve the experience.

Speaker 2

对吧？

Right?

Speaker 2

所以我觉得这既是好事也是坏事——好处是你不必被外界复杂的AI智能体技术吓倒，觉得我根本做不到。

So I feel like it's also it's a good and a bad thing in the sense that it's good that you don't have to see the complexity of the outside world of, like, you know, all of this fancy AI agents force and feel like I cannot do that.

Speaker 2

每个人都是从非常简化的结构开始，然后逐步演进的。

It's all everyone is starting from very minimalistic structures and then evolving.

Speaker 2

第二部分是，好的一面是，当你试图将一键式智能体融入公司时，你不会被这种复杂性压垮。

And the second part is, like, it's also good the the bad thing is that as you are, like, you know, trying to build this one click agents into your company, you don't have to be overwhelmed with this complexity.

Speaker 2

你可以逐步升级。

You can, like, slowly graduate.

Speaker 2

这非常重要，我们看到这是一个反复出现的模式。

So that's extremely important, and we see this as a repeating pattern over and over.

Speaker 0

好的。

Okay.

Speaker 0

那我们不妨就按照这个思路来，因为这是你建议人们构建AI产品时一个非常重要的组成部分。

So let's actually follow that the right, because that's a really important component of how you recommend people build AI stuff.

Speaker 0

AI产品。

AI stuff.

Speaker 0

AI产品、AI智能体，所有这些AI相关的东西。

AI products, AI agents, all the AI things.

Speaker 0

所以给我们举个例子，谈谈你所说的这种从低自主性、高控制开始，然后逐步提升的想法。

So give us an example of what you're talking about here, this idea of starting slow with agency and control and then moving kind of up the strung.

Speaker 2

是的

Yeah.

Speaker 2

例如，AI代理一个非常重要且普遍的应用是客户服务，对吧？

For example, a very important or like very prevalent application of AI agents is like customer support, right?

Speaker 2

想象一下，你是一家拥有大量客户服务工单的公司。

Imagine like you are a company who has like a lot of customer support tickets.

Speaker 2

甚至不必想象，OpenAI在推出产品时也面临过同样的问题，当我们推出成功的产品，比如Image或GPT-5时，客服量会急剧上升。

And why even imagine like OpenAI faced the exact same thing when we were launching products and there was like a huge spike of support volume as like you know we launched successful products like Image and or you know like GPT five and things like that.

Speaker 2

你收到的问题类型是不同的。

The kind of questions you get is different.

Speaker 2

客户向你提出的问题类型也是不同的。

The kind of, like, you know, problems that the customers bring to you is different.

Speaker 2

所以这不仅仅是把你们帮助中心的所有文章直接丢给AI代理。

So it's not about just, like, dumping all the list of help center articles that you have into the AI agent.

Speaker 2

你需要理解哪些事情是你能够构建的。

You kind of understand what are the things that you can build.

Speaker 2

因此，最初的第一步是，你拥有客服人员这些人类客服，但你会建议说：这是AI认为应该采取的正确做法。

And so initially, the first step of it would be something like you have your support agents, the human support agents, but you will be suggesting in terms of, okay, this is what the AI thinks that is the right thing to do.

Speaker 2

然后你从人类那里获得反馈，比如：在这个特定情况下，这个建议对我很好，而这个建议则不好。

And then you get that feedback loop from the humans that, okay, this is actually a good suggestion for me in this particular case, and this is a bad suggestion.

Speaker 2

接着你可以回过头去分析：这些缺点是什么？盲点在哪里？该如何修正？

And then you can go back and understand, okay, this is what the drawbacks are, this is where the blind spots are, and then how do I fix that?

Speaker 2

一旦你掌握了这些，就可以提高自主性，也就是说：我不需要再向人类建议了，我会直接向客户展示答案。

And once you get that, you can increase the autonomy to say that, okay, I don't need to suggest to the human, I'll actually show the answer directly to the customer.

Speaker 2

然后我们可以进一步增加复杂性，比如：我之前只是根据帮助中心文章来回答问题，但现在让我添加新功能。

And then we can actually add more complexity in terms of, okay, I was only answering questions based on help center articles, but now let me add new functionality.

Speaker 2

我可以直接为客户办理退款。

I can actually issue refunds to the customers.

Speaker 2

我可以向工程团队提交功能请求，以及所有这些操作。

I can actually raise feature requests with the engineering team and all of these things.

Speaker 2

所以，如果你从第一天就开始做所有这些，控制复杂性会变得极其困难。

So if you start all with all of this on day one, it's incredibly hard to control the complexity.

Speaker 2

所以我们建议一步一步地构建，然后逐步增加。

So we recommend, like, you know, building step by step and then increasing it.

Speaker 0

太棒了。

Awesome.

Speaker 0

实际上我们待会儿会分享一个可视化图表来展示这个过程，但为了复述一下你描述的内容：你提到的这个理念是，一开始要高控制、低自主性——比如支持人员只是提供建议，不能做任何操作，用户仍掌握主导权；当系统表现良好且你确信它能做出正确判断时，就给予它更多自主权，同时减少用户的控制；如果进展顺利，就进一步增加系统的自主权，用户也就无需再过多干预。

And you have a visual actually that we'll share of what this looks like, but just to kind of mirror back what you're describing, this idea of start with high control, low agency in your the example you gave is the support agent is just kind of giving suggestions, is not able to do anything, the user is in charge, and then as that becomes useful and you are confident it's doing the right sort of work, you give it a little more agency and you kind of pull back on the control the user has, And then if that's starting to go well, then you give it more agency and the user needs less control to control it.

Speaker 1

太棒了。

Awesome.

Speaker 1

我认为这里更宏观的理念是，对于AI系统来说，关键在于行为校准。

I think the higher level idea here is with AI systems, it's all about behavior calibration.

Speaker 1

提前预测系统的行为几乎是不可能的。

It's incredibly impossible to predict upfront how your system behaves.

Speaker 1

那该怎么办呢？

Now, what do you do about it?

Speaker 1

你要确保不会破坏客户的体验或最终用户的体验。

You make sure that you don't ruin your customer experience or your end user experience.

Speaker 1

你保持它不变，但减少人类的控制量。

You keep that as is, but then remove the amount of control that the human has.

Speaker 1

没有一种唯一正确的方式去做这件事。

And there is no single right way of doing it.

Speaker 1

你可以决定如何限制这种自主性。

You can decide how to constrain that autonomy.

Speaker 1

另一种限制自主性的方式是预授权用例。

A very I mean, a different example of how you could constrain autonomy is pre authorization use cases.

Speaker 1

保险预授权是一个非常适合AI的用例，因为临床医生花大量时间处理如血液检测、MRI等项目的预授权。

Insurance pre authorization is a very ripe use case for AI because clinicians spend a lot of time pre authorizing things like blood tests, MRIs and things like that.

Speaker 1

有些情况属于低垂的果实，比如MRI和血液检测，因为一旦你知道了患者的信息，就更容易批准，AI可以完成这些工作，而像侵入性手术等则风险更高。

There are some cases which are more of low hanging fruits, for instance, MRIs and blood tests, because as soon as you know, patient's information, it's easier to approve that and AI could do that versus something like an invasive surgery, etc, is more high risk.

Speaker 1

你不希望这些事情完全由机器自主处理。

You don't want to be doing that autonomously.

Speaker 1

因此，你可以判断哪些用例应该经过人工审核层，哪些用例AI可以方便地处理。

So you can kind of determine which of these use cases should go through that human and the loop layer versus which the use cases AI can conveniently handle.

Speaker 1

在整个过程中，你还会记录下人类的操作，对吧？

And then all through this process, you're also logging what the human is doing, right?

Speaker 1

因为你希望构建一个飞轮效应，用来持续优化你的系统。

Because you want to build a flywheel that you could use in order to improve your system.

Speaker 1

所以你本质上是在不破坏用户体验、不削弱信任的同时，记录下人类原本会做的操作，以便不断改进你的系统。

So you're essentially not ruining the user experience, not eroding trust, at the same time logging what humans would otherwise do so that you can continuously improve your system.

Speaker 0

让我再给你举几个你所推荐的这种演进过程的例子。

So let me give you a few more examples of this kind of progression that you recommend.

Speaker 0

我在这里花这么多时间，是因为这是你建议中帮助人们打造更成功AI产品的一个关键部分。

The reason I'm spending so much time here is this is a really key part of your recommendation to help people build more successful AI products.

Speaker 0

这个理念是：从高控制、低自主性开始，逐步推进，一旦你建立起信心，确认它在做正确的事情，就逐步提升。

This idea of start slow with high control and low agency, and then build up over time once you've built confidence that it's doing the right sort of work.

Speaker 0

你在我发的帖子中提到的其他几个例子，我来读一下。

So a few more examples that you shared in your post that I'll just read.

Speaker 0

假设你正在开发一个编程助手。

So say you're building a coding assistant.

Speaker 0

V1 只是提供内联补全和模板片段。

V one would be just suggest inline completion and boilerplate snippets.

Speaker 0

V2 生成更大的代码块，比如测试代码或重构内容，供人类审查。

V two would be generate larger blocks like tests or refactors for humans to review.

Speaker 0

然后 V3 是自动应用更改并创建拉取请求。

And then v three is just apply the changes and open PRs autonomously.

Speaker 0

另一个例子是营销助手。

And then another example is a marketing assistant.

Speaker 0

所以 V1 是起草邮件或社交媒体文案，就像我可能会做的那样。

So v one would be draft emails or social copy just like here's what I would do.

Speaker 0

V2 是构建多步骤营销活动并运行该活动。

V two is build a multistep campaign and run the campaign.

Speaker 0

然后 V3 是直接发布，通过多渠道进行 A/B 测试并自动优化活动。

And then launch and V3 is just launch it, AB tested auto optimize campaigns across channels.

Speaker 3

太棒了。

Awesome.

Speaker 3

是的。

Yeah.

Speaker 0

而且，再次总结一下我们目前的进展，以便给大家提供我们到目前为止分享的建议。

And and again, just to summarize where we're at, just to give people the the the advice we've shared so far.

Speaker 0

首先，重要的是要理解，AI产品是不同的，它们是非确定性的。

One is just important to understand, AI products are different, they're non deterministic.

Speaker 0

他指出了这一点，而我忘了真正回应这一点，无论是输入还是输出方面。

And he pointed out and I forgot to actually mirror back this point, both on the in on the input and the output.

Speaker 0

用户体验是非确定性的，人们会看到不同的内容、不同的输出、不同的聊天对话，甚至如果是在为你设计界面，UI也可能不同。

The user experience is non deterministic, like people will see different things, different outputs, different chat conversations, different maybe UI if it's designing the UI for you.

Speaker 0

而且，输出显然也是非确定性的。

And also the output obviously is going to be nondeterministic.

Speaker 0

所以这是一个问题和挑战。

So that's a problem and a challenge.

Speaker 0

然后

And then

Speaker 1

我的意思是，如果你仔细想想，这也是AI最美好的地方——毕竟，我们更习惯于自然交谈，而不是去点击一堆按钮之类的操作。

I mean, if you think of it, it's also the most beautiful part of AI, which is, I mean, we're all much more comfortable talking than following a bunch of buttons and all of that.

Speaker 1

对。

Right.

Speaker 1

因此，使用AI产品的门槛低了很多，因为你完全可以像和人类交流一样自然。

So the bar to using AI products is much lower because you can be as natural as you would be with humans.

Speaker 1

但这也正是问题所在，因为我们的沟通方式有太多可能性。

But that's also the problem, which is there are tons of ways we communicate.

Speaker 1

你需要确保意图被准确传达，并采取正确的行动，因为你的大多数系统都是确定性的，你希望达成确定的结果，但面对的是非确定性的技术，这就变得有点复杂了。

And you want to make sure that that intent is rightly communicated and the right actions are taken because most of your systems are deterministic and you want to achieve a deterministic outcome, but with nondeterministic technology, and that's where it gets a little messy.

Speaker 0

太棒了。

Awesome.

Speaker 0

好的。

Okay.

Speaker 0

我喜欢这种对‘为什么这很好’的乐观解读。

That's a I love I love the the optimistic version of the why this is good.

Speaker 0

好的。

Okay.

Speaker 0

另一部分是关于在设计时在被动响应与控制之间的权衡。

And then the other piece is this idea of this trade off of batonnie versus control and when you're designing a thing.

Speaker 0

我猜你看到的是，人们一上来就想直接跳到V3版本，结果反而陷入困境。

And what I imagine what you're seeing is people try to jump to the idea like the v three immediately, and that's when they get into trouble both.

Speaker 0

构建这样的系统可能要难得多，而且根本行不通。

It's probably a lot harder to build that, and it just doesn't work.

Speaker 0

然后他们就只是觉得：这失败了。

And then they're just like, This is a failure.

Speaker 0

我们到底在做什么？

What are we even doing?

Speaker 2

没错。

Exactly.

Speaker 2

我觉得在达到V3之前，你实际上需要先对很多事情建立起信心。

I feel there is like a bunch of things that you actually have to get confidence in before you get to v three.

Speaker 2

很容易感到不知所措，觉得自己的AI代理在一百个不同的方面都做错了。

And it's it's easy to get overwhelmed that, oh, my AI agent is like doing these things wrong in like 100 different ways.

Speaker 2

你不可能把所有这些问题都列出来并一一修复。

And you're not going to actually tabulate all of them and fix it.

Speaker 2

对吧？

Right?

Speaker 2

即使你已经学会了如何处理评估方法之类的事情。

Even though you've learned how do you deal with the evaluation practices and stuff like that.

Speaker 2

如果你一开始就走错了方向，之后想要纠正就会非常困难。

If you're starting on the wrong spot, you are actually going to have a hard time correcting things from there.

Speaker 2

当你从小处着手，从一个高度人工控制、低自主性的极简版本开始时，这也会迫使你思考：我究竟要解决什么问题？

And when you start small and when you start with building a very minimalistic version with high human control and low agency, it also forces you to think about what is the problem that I'm going to solve.

Speaker 2

我们用了一个术语叫‘问题优先’。

We use this term called problem first.

Speaker 2

对我来说，这显而易见——我确实需要思考问题本身，但令人惊讶的是，这个观点在当前AI快速发展的背景下，能如此强烈地引起共鸣：人们很容易陷入一味关注解决方案的复杂性，却忘记了自己真正想解决的问题。

And to me, it was, like, obvious in the sense that, yeah, I I do need to think about the problem, but it's incredible how well it resonates with the people that in all these advancements of the AI that we're seeing, one easy slippery slope is to just keep thinking about complexities of the solution and not and forget the problem that you're trying to solve.

Speaker 2

所以当你试图从较低程度的自主性开始时，你会开始认真思考：我究竟想解决什么问题，以及如何将这个问题分解成不同层次的自主性，以便后续逐步构建。

So when you're trying to start at, like, a small at a smaller scale of autonomy, you start to really think about what is the problem that I'm trying to solve and how do I break it down into, like, levels of autonomy that I can build later.

Speaker 2

当我们与每一位交谈对象反复强调这一模式时，这一点显得尤为有用。

So that is incredibly useful when, like and we keep repeating this pattern over and over with everyone we talk to.

Speaker 0

限制自主性还有许多其他好处，因为让系统为你做太多事情也可能带来危险，比如搞乱你的数据库，或者发送出你根本没想到的大量邮件。

And there's so many other benefits to limiting autonomy because there's there's just danger also of the thing doing too much for you and just messing up your, I don't know, your database, sending out all these emails you never expected.

Speaker 0

这其实有太多理由说明这是一个好主意。

Like, there's, like, so many reasons this is a good idea.

Speaker 1

是的。

Yep.

Speaker 1

我最近读了一篇来自加州大学伯克利分校几位研究者的论文。

I I recently read this paper from a bunch of folks at UC Berkeley.

Speaker 1

主要是Mote、Zaharia、Eon Stoiker以及Databricks团队的研究。

Basically, Mote, Zaharia, Eon Stoiker, and the folks at Databricks.

Speaker 1

论文指出，他们接触的约74%至75%的企业，其面临的最大问题都是可靠性。

And it said about 7475% of the enterprises that they had spoken to, their biggest problem was reliability.

Speaker 1

这也正是他们不愿意将产品部署给最终用户、构建面向客户的应用的原因，因为他们只是不确定。

And that's also why they weren't comfortable deploying products to their end users and building customer facing products because they just weren't sure.

Speaker 1

他们就是对这样做感到不安，不愿让自己的用户面临这些风险。

They just weren't comfortable doing that and exposing their users to a bunch of these risks.

Speaker 1

对吧？

Right?

Speaker 1

这也解释了为什么他们认为当今许多AI产品都聚焦于生产力，因为其自主性较低，而不是像能够取代工作流程的端到端代理那样。

And that's also why they think a lot of AI products today have to do with productivity because it's much low autonomy versus, you know, end to end agents that would replace workflows.

Speaker 1

而且，是的，我也很喜欢他们的其他工作，但我觉得这与我们初创公司所观察到的情况非常一致。

And, yeah, I love their work otherwise as well, but I think that's very in line with what at least we're seeing at my startup as well.

Speaker 0

好的。

Okay.

Speaker 0

非常有趣。

Very interesting.

Speaker 0

在本对话之前将发布的一期节目中，我们会深入探讨这个问题所避免的另一个问题，即提示注入和越狱。

There's an episode that'll come out before this conversation where we go deep into another problem that this avoids, which is around prompt injection and jailbreaking.

Speaker 0

对于AI产品来说，这是一个巨大的风险，因为它可能是一个无法解决甚至无法解决的问题。

Oh, big of a risk that is for AI products where it's essentially an unsolved and unsolvable problem potentially.

Speaker 0

我不会深入这个话题，但确实如此。

I'm not gonna go down that track, but that's Yeah.

Speaker 0

我们刚才进行了一场相当令人担忧的对话，那段内容会在这次对话之前发布。

It's a pretty scary conversation we had that it'll be out before this conversation.

Speaker 1

我认为，一旦这些系统走向主流，这将是一个巨大的问题。

I think that will be a huge problem once systems go mainstream.

Speaker 1

我们目前还在忙于构建AI产品，还没来得及担心安全问题，但随着这种非确定性API的普及，这肯定会成为一个巨大的问题。

We're still so busy building AI products that we're not worried about security, but it it will be such a huge problem to kind of especially with this nondeterministic API again.

Speaker 1

对吧？

Right?

Speaker 1

所以你其实陷入了一个困境，因为你的提示中可以注入大量指令，然后，确实，情况会变得非常糟糕。

So you're kind of stuck because there are tons of instructions that you could inject within your prompt and then, yeah, it's it's going really bad.

Speaker 0

好的。

Okay.

Speaker 0

我们实际上应该花点时间聊聊这个，因为我觉得这非常有趣，而且没人讨论这个问题——我们刚才的对话表明，要让AI去做它不该做的事其实很容易，尽管人们设置了各种安全防护机制，但这些防护实际上并不靠谱，总能找到绕过的方法。

I let's actually spend a little time here because it's actually really interesting to me, and no one's talking about this stuff, which is like, the conversation we had is just it's pretty easy to get AI to trick to do stuff it shouldn't do, and there's all these guardrail systems people put in place, but turns out these guardrails aren't actually very good and you can always get around them.

Speaker 0

正如你所说，当代理变得更加自主、机器人更加普及时，让AI去做不该做的事就会变得相当可怕。

And to your point, as agents become more autonomous and robots, it gets pretty scary that you could get AI to do things you shouldn't do.

Speaker 2

我认为这确实是个问题，但就目前客户采用AI的范围来看，企业真正能利用AI来提升效率、优化现有流程的程度，我觉得还处于非常早期的阶段。

I think this is definitely a problem, but I feel in the current spectrum of like customers adopting AI, the extent of which companies can actually get advantage of AI or improve their processes or streamline the existing processes that they have, I feel it's still in the very early stage.

Speaker 2

2025年对AI代理和客户尝试采用AI来说是极其繁忙的一年，但我感觉渗透率仍然远未达到能真正带来优势的水平。

Like, 2025 has been an extremely busy year for AI agents and customers trying to adopt AI, but I feel the penetration is still not as much as you would actually get advantage out of it.

Speaker 2

因此，只要合理设置人类介入的环节，我觉得我们其实可以避免很多问题，更专注于流程的优化。

So with the right set of, you know, human in the group points in here, I feel we can actually avoid a bunch of these things and focus more towards, like, streamlining the processes.

Speaker 2

我本人更偏向乐观，认为你需要先去尝试和采用，而不是一味地只强调AI可能出错的负面方面。

And I I am more on the optimist side in the sense that, like, you need to try and adopt this before actually, like, trying to be only for highlighting the negative aspects of, like, what could go wrong.

Speaker 2

所以我强烈认为，企业必须采纳这项技术。

So I I feel, like, strongly that companies has to adopt this.

Speaker 2

毫无疑问，我们接触过的每一家公司，包括OpenAI，都从未遇到过‘AI帮不上忙’的情况。

They definitely like, no company the OpenAI we talk to is has never had been the case that, oh, AI cannot help me in this case.

Speaker 2

一直以来都是这样，总觉得有一系列事情可以优化，然后让我看看怎么去应用它。

It has always been that, oh, there is this, like, set of things that it can optimize for me, and then let me see how I can adopt it.

Speaker 0

太好了。

Sweet.

Speaker 0

我总是喜欢这种乐观的视角。

I always like the optimistic perspective.

Speaker 0

我很期待你听完这个，看看你怎么想，因为真的很有意思。

I'm excited to for you to listen to this and see what you think because it's really interesting.

Speaker 0

而且你说得对，有很多事情值得关注。

And and to your point, there's a lot of things to focus on.

Speaker 0

这只是众多需要担心和思考的事情之一。

It's one of one of many things to worry about and think about.

Speaker 0

好的。

Okay.

Speaker 0

我们回到正题吧。

Let's get back on track here.

Speaker 0

我们已经分享了很多实用技巧和重要建议。

So we've shared a bunch of pro tips and important piece of advice.

Speaker 0

我想问问，你在那些成功构建AI产品的公司和团队中，还看到哪些其他模式或工作方式？

Let me ask, what other patterns and kind of ways of working do you see in companies that do this well and teams that build AI products successfully?

Speaker 0

那么，人们最容易陷入哪些常见误区呢？

And then just what are the most common pitfalls people fall into?

Speaker 0

所以我们不妨先聊聊，还有哪些其他方式能让公司成功构建AI产品？

So we could just maybe start with what are other ways that companies do this well, build AI products successfully?

Speaker 1

我几乎把它看作一个有三个维度的成功三角形。

I almost think of it as like a success triangle with three dimensions.

Speaker 1

这从来都不只是技术问题。

It's never always technical.

Speaker 1

每一个技术问题，本质上都是人的问题。

Every technology problem is a people problem first.

Speaker 1

对于我们合作过的公司来说，就是这三个维度，对吧？

And with companies that we have worked with, it's these three dimensions, right?

Speaker 1

比如优秀的领导者、良好的文化以及技术能力。

Like great leaders, good culture and technical prowess.

Speaker 1

在领导者方面，我们为许多公司提供AI转型、培训和战略等方面的帮助。

With leaders itself, we work with a lot of companies for their AI transformation, training, strategy and stuff like that.

Speaker 1

我认为，许多公司的领导者在过去十年或十五年里积累了丰富的直觉，这些直觉让他们备受尊重。

And I feel like a lot of companies, the leaders have built intuitions over ten or fifteen years, and they are kind of highly regarded for those intuitions.

Speaker 1

但现在有了AI，这些直觉必须被重新学习，领导者也需要有勇气承认这一点。

But now with AI in the picture, those intuitions will have to be relearned and leaders have to be vulnerable to do that.

Speaker 1

是的。

Right.

Speaker 1

我曾经与Rackspace现在的CEO加扬共事过。

I used to work with the CEO of now Rackspace, Gajan.

Speaker 1

他每天早上都会留出一个时间段，标注为‘跟进AI动态，上午四点到六点’。

So he would have this block every day in the morning, which would say catching up with AI four to six a.

Speaker 1

M。

Speaker 1

他不会安排任何会议或其他事情。

And he would not have any meetings or anything like that.

Speaker 1

那是他专门用来了解最新AI播客和信息的时间。

That was just his time to pick up on the latest AI podcast or information and all of that.

Speaker 1

他还会在周末进行白板编程练习之类的活动。

And he would have weekend white coding sessions and stuff like that.

Speaker 1

所以我认为领导者必须重新回归动手实践。

So I think leaders have to get back to being hands on.

Speaker 1

这并不是因为他们必须亲自实现这些技术，而是为了重建自己的直觉，因为你必须接受自己的直觉可能并不正确的事实。

And that's not because they have to be implementing these things, but more of rebuilding their intuitions, because you must be comfortable with the fact that your intuitions might not be right.

Speaker 1

你可能是房间里最不懂的人，而你希望向每个人学习。

And you probably are the dumbest person in the room, and you want to learn from everyone.

Speaker 1

我看到，那些成功打造产品的公司，这一点是它们的显著特征，因为你采取的是自上而下的方式。

And that I've seen that being a very distinguishing factor of companies that build products, are successful, because you're kind of being in that top down approach.

Speaker 1

几乎不可能自下而上地实现这一点。

It's almost always impossible for it to be bottom up.

Speaker 1

如果工程师们不信任这项技术，或者对技术的期望不一致，你就不能让他们去争取领导层的支持。

You can't have a bunch of engineers go and get buy in from the leader if they just don't trust in the technology, or if they have misaligned expectations about the technology.

Speaker 1

对吧？

Right?

Speaker 1

我听很多正在构建AI系统的人说，我们的领导者根本不清楚AI在解决特定问题上的能力有多强，或者他们轻易地否决某个想法，认为把它投入生产很简单。你真的需要了解当前AI能解决的问题范围，这样才能指导公司内的决策。

I've heard from so many folks who are building that our leaders just don't understand the extent to which AI can solve a particular problem, or they just wipe out something and assume it's easy to take it to production, and you really need to understand the range of what AI can solve today so that you can guide decisions within the company.

Speaker 1

第二点是文化本身。

The second one is the culture itself.

Speaker 1

对吧？

Right?

Speaker 1

而且，我接触过很多企业，AI并不是他们的主业，但他们不得不将AI引入流程，仅仅是因为竞争对手在做，而且确实存在一些非常成熟的应用场景。

And, again, I work with enterprises where AI is not their main thing, and they have they need to bring in AI into their processes just because a competitor is doing it and just because it does make sense because there are use cases that are very ripe.

Speaker 1

在这个过程中，我感觉很多公司都有一种错失恐惧症（FOMO）的文化，担心自己会被取代等等。

Then along the way, I feel a lot of companies have this culture of FOMO and you will be replaced and those kinds of things.

Speaker 1

人们变得非常害怕。

People get really afraid.

Speaker 1

领域专家在构建有效的AI产品中至关重要，因为你必须咨询他们，才能理解AI的行为方式或理想行为应该是怎样的。

Subject matter experts are such a huge part of building AI products that work because you really need to consult them to understand how your AI is behaving or what the ideal behavior should be.

Speaker 1

但我接触过一些公司，那里的领域专家根本不愿意和你交流，因为他们觉得自己的工作正面临被取代的风险。

But then I've spoken to a bunch of companies where the subject matter experts just don't want to talk to you because they think their job is being replaced.

Speaker 1

所以，正如我所说，这问题其实源自领导者本身——你需要营造一种赋能的文化，将AI融入你们的工作流程，从而让你的工作效率提升十倍，而不是说‘如果你不采用AI，很可能就会被取代’之类的话。

So, as I mean, I get this comes from the leader itself, you want to build a culture of empowerment of augmenting AI into your own workflows so that you can 10x and what you're doing instead of saying that, you know, probably you will be replaced if you don't adopt AI and stuff like that.

Speaker 1

这种赋能型的文化总是有帮助的。

So that kind of an empowering culture always helps.

Speaker 1

你希望整个组织齐心协力，让AI成为你的助力，而不是让大家只顾着保护自己的岗位。

You want to make your entire organization be in it together and make AI work for you instead of trying to guard their own jobs, etc.

Speaker 1

而且，AI确实也带来了比以往更多的机会。

And with AI, it's also true that it opens up a lot more opportunities than before.

Speaker 1

你可以让员工做比以前多得多的事情，将他们的生产力提升十倍。

So you could have your employees doing a lot more things than before and 10x their productivity.

Speaker 1

第三个是技术层面，我们之前也讨论过，对吧？

And the third one is the technical part which we talk about, right?

展开剩余字幕（还有 480 条）

Speaker 1

我认为，成功的人对理解自己的工作流程极为专注，他们会仔细判断哪些部分适合交给AI增强，哪些部分仍需要人类参与。

I think folks that are successful are incredibly obsessed about understanding their workflows very well and augmenting parts that could could be right for AI versus the ones that might need human in the loop somewhere, etcetera.

Speaker 1

当你试图自动化工作流程的某个环节时，绝不可能仅靠一个AI代理就能解决所有问题。

Whenever you're trying to automate some part of a workflow, it's never the case that you could use an AI agent and that will kind of solve your problems.

Speaker 1

通常，你会有一个机器学习模型来完成工作的一部分。

It's always you probably have a machine learning model that's going to do some part of the job.

Speaker 1

同时，还会有确定性的代码来完成工作中的另一部分。

You have deterministic code doing some part of the job.

Speaker 1

因此，你必须极度专注于理解整个工作流程，以便为问题选择最合适的工具，而不是一味沉迷于技术本身。

So you really need to be obsessed with understanding that workflow so you can choose the right tool for the problem instead of being obsessed with the technology itself.

Speaker 1

我看到的另一个模式是，人们真正理解了与非确定性API（即你的大语言模型）协作的理念。

And another pattern I see is also folks really understand this idea of working with a non deterministic API, which is your LLM.

Speaker 1

这意味着他们也明白，开发周期与以往大不相同。

And what that means is they also understand the development lifecycle looks very different.

Speaker 1

他们能够快速迭代：能否在不损害客户体验的前提下，快速构建并迭代，同时获取足够多的数据来评估行为？

And they iterate pretty quickly, which is can I build something iterate quickly in a way that it doesn't ruin my customer experience at the same time gives me enough amount of data so that I can estimate behavior?

Speaker 1

所以他们能非常迅速地建立起这个飞轮。

So they build that flywheel very quickly.

Speaker 1

如今，关键不在于你是否是第一个在竞争对手中部署智能代理的公司。

As of today, it's not about being the first company to have an agent among your competitors.

Speaker 1

而在于你是否已经建立了正确的飞轮，从而能够持续改进。

It's about have you built the right flywheels in place so that you can improve over time.

Speaker 1

对吧？

Right?

Speaker 1

当有人告诉我，我们有一个一键式代理，会部署到你的系统中。

When someone comes up to me and says, we have this one click agent, it's going to be deployed in your system.

Speaker 1

然后在两三天内，它就会为你带来显著的收益。

And then in in two or three days, it'll start showing you significant gains.

Speaker 1

我几乎会持怀疑态度，因为这根本不可能。

I would almost be skeptical because it's just not possible.

Speaker 1

这并不是因为模型不够强大，而是因为企业数据和基础设施非常混乱。

And that's not because the models aren't there, but because enterprise data and infrastructure is very messy.

Speaker 1

而且代理需要一点时间来理解这些系统是如何运作的。

And you need a bit to even the agent needs a bit to understand how these systems work.

Speaker 1

到处都存在非常混乱的分类体系。

There are very messy taxonomies everywhere.

Speaker 1

人们常常会做这样的事情：获取客户数据，我们想获取客户数据，我们就去做了，诸如此类。

People tend to do things like get customer data, we want get customer data, we do and these kinds of things.

Speaker 1

所有这些功能都存在，它们被不断调用，但实际上积累了大量陈旧的债务需要处理。

And all those functions exist, and they are being called and basically there's a lot of dead debt that you need to deal with.

Speaker 1

所以大多数情况下，如果你专注于问题本身，并且非常了解你的工作流程，你就会知道如何随着时间推移逐步改进你的代理，而不是简单地随便部署一个代理，就指望它从第一天起就能正常工作。

So most of the times, if you're obsessed with the problem itself and you understand your workflows very well, you will know how to improve your agents over time instead of just slapping an agent and assuming that it will work from day one.

Speaker 1

我甚至可以说，如果有人向你推销一键式代理，那纯粹是营销噱头。

I probably will go as far to say that if someone's selling you one click agents, it's pure marketing.

Speaker 1

你不想被这种说法误导。

You don't want to buy into that.

Speaker 1

我更愿意选择一家说‘我们会为你构建这个流程’的公司。

I would rather go with a company that says, we're going to build this pipeline for you.

Speaker 1

而且它会随着时间推移不断学习，形成一个良性循环，而不是指望一个开箱即用的解决方案。

And that will learn over time and kind of build a flywheel to improve than something that's gonna work out of the box.

Speaker 1

要替换任何关键工作流程，或构建一个能带来显著投资回报的系统，即使你拥有最好的数据层和基础设施层，通常也需要四到六个月的时间。

To replace any critical workflow or to build something that can give you significant ROI easily takes four to six months of work even if you have the best data layer and infrastructure layer.

Speaker 0

太棒了。

Amazing.

Speaker 0

这些观点与我在这档播客中其他对话的很多内容产生了强烈共鸣。

There's a lot there that resonates so deeply with other conversations I've been having on this podcast.

Speaker 0

一家公司要想在AI应用上取得显著成效，创始人兼CEO必须深度参与其中。

One is just for a company to be successful at seeing a lot of impact from AI, the founder CEO has to be deep into it.

Speaker 0

我曾邀请丹·希珀做客这档播客，他帮助多家公司推进AI落地。

I had Dan Shipper on the podcast, and they work with a bunch of companies helping them adopt AI.

Speaker 0

他说，成功最重要的预测指标就是CEO每天多次与ChatGPT、Claude等工具交流。

And he said that's the number one predictor of success is the CEO chatting with Chad GPT, Claude, whatever, many times a day.

Speaker 0

我很喜欢你举的Rackspace首席执行官的例子。

I love this example you gave with the Rackspace CEO.

Speaker 0

就像每天早上花点时间了解最新的AI资讯。

It's like catch up on AI news in the morning every day.

Speaker 0

我原本以为他会直接和聊天机器人聊天，而不是看新闻。

I was imagining he'd be, like, chatting with, like, the chatbot versus, like, reading news.

Speaker 1

以你今天所掌握的信息，其实完全可以做到，当然也要选择合适的渠道，因为每个人都有自己的看法。

With the kind of information you have as of today, you could just I mean, you wanna choose the right channels as well because everybody has an opinion.

Speaker 1

那么，你打算相信谁的观点呢？

So whose opinion do you want to bank on?

Speaker 1

我觉得，拥有一群高质量的信息来源真的非常重要。

I feel like having that good quality set of people that you're listening to really makes sense.

Speaker 1

他就是列了两三个固定的信息来源，每天都看。

So he just has a list of two or three sources that he always looks at.

Speaker 1

然后他再带着一堆问题，去跟一群AI专家讨论，听听他们的看法。

And and then he comes back with a bunch of questions and bounces it around with a bunch of AI experts to see what they think about it.

Speaker 1

而我就是那个群体中的一员，所以我很清楚。

And I was part of that group, so I kind of know

Speaker 0

我非常喜欢这一点。

I love that.

Speaker 1

关于他提出的那些问题。

About the questions that he comes up with.

Speaker 0

这真酷。

So That's cool.

Speaker 1

挺酷的。

It's pretty cool.

Speaker 1

我当时就想，你为什么要做这么多事？

I was like, why are you doing so much?

Speaker 1

然后他说，这些会渗透到我们所做的许多决策中。

And then he says it trickles down into a bunch of decisions that we would take.

Speaker 0

好的。

Okay.

Speaker 0

让我谈谈另一个在本播客中一直很热门的话题。

Let me talk about another topic that's very it's been a hot topic on this podcast.

Speaker 0

Evals 曾经在 Twitter 上热火朝天一段时间。

It was a hot topic on Twitter for a while, Evals.

Speaker 0

很多人对评估非常着迷，认为它们能解决 AI 领域的许多问题。

A lot of people are obsessed with evals, think they're the solution to a lot of problems in AI.

Speaker 0

也有很多人觉得评估被高估了，认为你根本不需要评估，只要凭感觉就能搞定。

A lot of people think they're overrated, that way you don't need evals, can just feel the vibes and you'll you'll be alright.

Speaker 0

你对评估有什么看法？

What's your take on evals?

Speaker 0

在解决你提到的诸多问题时，评估能走多远？

How far does that take people in solving a lot of the problems that you talk about?

Speaker 2

就社区里正在发生的事情而言，我觉得存在一种虚假的二元对立：要么认为评估能解决一切，要么认为在线监控或生产监控能解决一切。

In terms of, like, what is going on in the community, I I feel there's there's this false dichotomy of, like, there's either evolves is going to solve everything or online monitoring or production monitoring is gonna solve everything.

Speaker 2

我找不到任何理由去完全信任其中任何一个极端，比如把我的整个应用都押注在某一种方法上。

And I find no reason to trust, like, one of the extremes in the sense that I will entirely bank my application on this and or like that to solve the thing.

Speaker 2

对吧？

Right?

Speaker 2

如果你退一步想想，评估到底是什么。

So if you take a step back, think of what are evals.

Speaker 2

评估本质上是你对产品的信任性思考，或者说是你对产品的认知，这些会融入到你所构建的数据集中，因为这些对你来说才是重要的。

Evals are basically your trusted product thinking or like your knowledge about the product that is going into this set of datasets that you're going to build in the sense that this is what matters to me.

Speaker 2

比如，这是我希望我的智能体不应该做的问题，让我构建一组数据集，以便在这些方面表现良好。

Like, this is the kind of problems that my agent should not do, and let me build a list of datasets so that I'm going to do well on those.

Speaker 2

至于生产监控，你在这里做的是部署你的应用，然后设置一些关键指标，这些指标会向你反馈客户是如何使用你的产品的。

And in terms of production monitoring, what you are doing doing there is you're deploying your application, and then you're having this some sort of key metrics that actually communicate back to you on how customers are using your product.

Speaker 2

比如，无论你部署的是什么智能体，如果客户对你的互动给出了好评，你最好能知道这一点。

Like, you could be deploying any agent and, like, if the customer is giving a thumbs up for your interaction, you better want to know that.

Speaker 2

这就是生产监控要做的事情。

So that is what production monitoring is going to do.

Speaker 2

对吧？

Right?

Speaker 2

这种生产监控其实早已存在于各种产品中很长时间了。

And this production monitoring has existed for products like for a long time.

Speaker 2

只是现在有了AI代理，你需要监控的粒度要精细得多。

Just that now with AI agents, you will need to be monitoring, like, a lot more granularity.

Speaker 2

不仅仅是客户会给你明确的反馈，还有很多你可以获取的隐性反馈。

It's not just the customer always giving you explicit feedback, but there is many implicit feedback that you can get.

Speaker 2

例如，在ChatGPT中，如果你喜欢某个回答，你可以直接点赞。

For example, in ChatGPT, right, like if you are liking the answer, you can actually give a thumbs up.

Speaker 2

或者如果你不喜欢某个回答，有时客户不会点踩，而是直接重新生成答案。

Or if you don't like the answer, sometimes customers don't give you thumbs down, but actually regenerate the answer.

Speaker 2

这清楚地表明，你生成的初始答案未能满足客户的期望。

So that is a clear indication that the initial answer that you generated is not meeting the customer's expectations.

Speaker 2

对吧？

Right?

Speaker 2

所以这些是你始终需要关注的隐性信号，而且在生产监控中，这类信号的范围一直在扩大。

So these are the kind of implicit signals you always need to think about, and that spectrum has been increasing in terms of production monitoring.

Speaker 2

现在让我们回到最初的话题，好吧。

Now let's come back to the initial topic of, like, okay.

Speaker 2

这是评估还是生产监控？

Is it Eval or is it production monitoring?

Speaker 2

这有什么关系？

What does it matter?

Speaker 2

我觉得，我们又回到了这个以问题为导向的方法：你到底想构建什么？

So I feel, again, we go back to this problem first approach of what is your what is it that you're trying to build?

Speaker 2

比如，你想为客户提供一个可靠的应用，不会做错事。

Like, you're trying to build a reliable application for your customers that's not going to do a bad thing.

Speaker 2

它总是会做正确的事。

Like, it's always going to do the right thing.

Speaker 2

或者如果它做了错事，你也能迅速得到警报。

Or if it is doing a wrong thing, you are you are basically alerted, like, very quickly.

Speaker 2

对吧？

Right?

Speaker 2

所以我把它分为两个部分。

So the I break this down into two parts.

Speaker 2

比如，没有人会在不进行测试的情况下直接部署应用程序。

Like, one is you like, nobody goes into deploying an application without actually, like, you know, just testing that.

Speaker 2

这种测试可能是模糊测试，也可能是我有这10个问题，无论我做任何更改，它们都不能出错，我就构建这些内容，称之为评估数据集。

This testing could be wipes or this testing could be, okay, I had this 10 questions that it should not go wrong no matter what changes I make and let me build this and let's call this an evaluation dataset.

Speaker 2

现在，假设你构建并部署了它，然后你发现，好吧，现在我需要了解它是否在做正确的事。

Now let's say you build this, you deployed this and then you figured, okay, now I need to understand whether it's doing the right thing or not.

Speaker 2

如果你是一个高吞吐量或高交易量的客户，你不可能手动去逐一评估所有追踪记录。

So if you're a high throughput or like a high transaction customer, you cannot tactically sit and evaluate all the traces.

Speaker 2

对吧？

Right?

Speaker 2

你需要一些指标来判断哪些地方是你应该关注的。

You need some indication to understand what are the things that I should look at.

Speaker 2

这时候生产监控就派上用场了——你无法预测代理可能出错的所有情况，但所有这些隐性信号和显性信号都会反馈给你，告诉你哪些追踪记录需要检查。

And this is where production monitoring comes into the picture that you cannot predict your the base in which your agent could be doing wrong, but all of these other implicit signals and explicit signals, those are going to communicate back to you what what are the traces that you need to look at.

Speaker 2

而这就是生产监控的作用所在。

And that is where production monitoring helps.

Speaker 2

一旦你获得了这类追踪数据，就需要分析在这些不同类型的交互中出现了哪些失败模式。

And once you get this kind of traces, you need to examine what are the failure patterns that you're seeing in these different types of interactions.

Speaker 2

有没有哪些事情是我真正关心、绝对不希望发生的？

And is there something that I really care about that should not happen?

Speaker 2

如果出现了这类故障模式，我就需要考虑为它构建一个评估数据集。

And if that kind of failure modes are happening, then I need to think about building an evaluation dataset for it.

Speaker 2

比如说，我为我的代理构建了一个评估数据集，专门用于处理退款请求，而我明确配置了它不应提供退款。

And, okay, let's say I built an evaluation dataset for my agent trying to offer refunds where explicitly I have configured it not to.

Speaker 2

于是我构建了这个评估数据集，然后对工具、提示词或其他内容进行了修改，并部署了产品的第二个版本。

So I built this evaluation data set and then like I made my changes in tools or prompts or whatever, and then I deployed the second version of the product.

Speaker 2

对吧？

Right?

Speaker 2

但现在并不能保证这是你唯一会遇到的问题。

Now there is no guarantee that this is the only problem that you're going to see.

Speaker 2

你仍然需要生产监控来捕捉可能遇到的各种其他问题。

You still need production monitoring to actually have, like you know, catch different kinds of problems that you might encounter.

Speaker 2

所以我觉得评估很重要。

So I feel evals are important.

Speaker 2

生产监控也很重要。

Production monitoring is important.

Speaker 2

但认为仅靠其中一种就能解决所有问题的观点，在我看来完全是不可取的。

But this notion of only one of them is going to solve things for you, that is completely dismissible in my opinion.

Speaker 0

好的。

Alright.

Speaker 0

非常合理的回答。

A very reasonable answer.

Speaker 0

这里的关键并不是简单地两者都做。

And the point here isn't it's not just as simple as do both.

Speaker 0

而是要捕捉不同的问题，单一方法无法覆盖你需要关注的所有方面。

It's more that there are different things to catch, and one approach won't catch all the things you need to be paying attention to.

Speaker 2

没错。

Exactly.

Speaker 0

太棒了。

Awesome.

Speaker 1

我应该退后两步，谈谈在2025年下半年，'eval'这个词承载了多大的重量。

I ought to take two steps back and kind of talk about how much weight the term eval has had to take in the second half of twenty twenty five.

Speaker 1

因为当你去见一家数据标注公司时，他们会告诉你，我们的专家在编写eval。

Because you go meet a data labeling company and they tell you, our experts are writing evals.

Speaker 1

然后你又听到很多人说，产品经理应该编写eval，它们是新的PRD。

And then you have all of these folks saying that PMs should be writing evals, they're the new PRDs.

Speaker 1

接着又有人声称eval几乎就是一切，也就是你应当构建来改进产品的反馈循环。

And then you have folks saying that evals is pretty much everything, which is the feedback loop you're supposed to be building to improve your products.

Speaker 1

现在，作为初学者，退后一步想想，什么是eval？

Now step back as a beginner and kind of think, what are evals?

Speaker 1

为什么人人都在说eval？

Why is everyone saying evals?

Speaker 1

但实际上，这些都是过程中的不同部分。

And these are actually different parts of the process.

Speaker 1

从某种意义上说，没有人是错的，因为这些确实属于评估。

Nobody's wrong in the sense that, yes, these are evals.

Speaker 1

但当一家数据标注公司告诉你，他们的专家在编写评估时，他们实际上指的是错误分析，或者专家只是标注哪些内容应该是正确的。

But when a data labeling company is telling you that our experts are writing evals, they're actually referring to error analysis or experts just leaving notes on what should be right.

Speaker 1

律师和医生也会做评估。

Lawyers and doctors write evals.

Speaker 1

但这并不意味着他们在构建大语言模型的裁判，或构建整个反馈循环。

That doesn't mean they're building LLM judges or they're building this entire feedback loop.

Speaker 1

当你说到产品经理应该编写评估时，并不意味着他们必须创建一个足以投入生产的大语言模型裁判。

And when you say that a PM should be writing evals, doesn't mean they have to write an LLM judge that's good enough for production.

Speaker 1

我认为对此也有非常明确的做法，我支持KD的观点：你无法提前预知是需要构建一个大语言模型裁判，还是只需要利用生产环境监控中的隐式信号等。

I think there's also very prescriptive ways of doing this and plus one to KD, which is you cannot predict upfront if you need to be building an LLM judge versus you need to be using implicit signals from production monitoring, etc.

Speaker 1

我认为马丁·福勒在2000年代曾提出过一个术语叫‘语义扩散’，意思是有人创造了一个术语，大家就开始用自己的定义去扭曲它，最终你反而失去了它原本的含义。

I think Martin Fowler at some point had this term called semantic diffusion back in the 2000s, which kind of means that someone comes up with the term, everybody starts butchering it with their own definitions, and then you kind of lose the actual definition of it.

Speaker 1

这正是如今‘评估’、‘智能体’或AI领域任何术语正在发生的情况——每个人看到的都是它不同的侧面，我想。

That is kind of what is happening to evals or agents or any word in AI as of today, everybody kind of sees a different side to it, I guess.

Speaker 1

但如果你让一群从业者坐在一起，问他们：为AI产品构建可操作的反馈循环重要吗？

But if you make a bunch of practitioners sit together and ask them, is it important to build actionable feedback loop for AI products?

Speaker 1

我想他们都会同意。

I think all of them will agree.

Speaker 1

不过，具体如何实现，真的取决于你的应用场景本身。

Now, how you do that really depends on your application itself.

Speaker 1

当你面对复杂用例时，构建LLM评判器会极其困难，因为你看到大量新兴模式。

When you go to complex use cases, it's incredibly hard to build LM judges because you see a lot of emerging patterns.

Speaker 1

如果你构建一个评判器来测试冗长性之类的问题，结果却发现出现了新的模式，而你的LLM评判器无法捕捉到。

If you build a judge that would test for verbosity or something like that, turns out that you're seeing newer patterns that your LLM judge is not able to catch.

Speaker 1

然后你就不得不构建过多的评估标准。

And then you just end up building too many evals.

Speaker 1

到了这个时候，不如直接观察用户信号，修复问题，检查是否出现倒退，然后继续前进，而不是去构建这些评判器。

And at that point, it just makes sense to look at your user signals, fix them, check if you have regressed and move on instead of actually building these judges.

Speaker 1

所以一切都取决于具体情况。

So it all depends.

Speaker 1

想想看，每个机器学习从业者都会告诉你，这真的取决于上下文。

Think one statement that every ML practitioner will tell you is it really depends on the context.

Speaker 1

别对那些所谓的标准方法过于执着，它们会不断变化。

Don't be obsessed with prescriptions they're going to change.

Speaker 0

这是一个非常重要的观点，尤其是现在‘评估’这个词对不同人意味着这么多不同的东西。

That's such an important point, this idea that especially that evals just means many things to different people now.

Speaker 0

它只是一个涵盖太多事物的术语。

It's just like a term for so many things.

Speaker 0

当你把评估看作数据标注公司提供的东西，或者产品经理所指的内容时，谈论评估本身就变得很复杂。

And it it's complicated to just talk about evals when you're when you see it as the stuff data labeling companies are giving you and things PM are right.

Speaker 0

还有基准测试。

And there's also benchmarks.

Speaker 0

人们有时也会把基准测试称为评估。

People call benchmarks a little bit evals.

Speaker 0

这就像是

It's like

Speaker 1

我最近和一个客户交谈，他们告诉我我们在做评估。

I I recently spoke to a client who told me we do evals.

Speaker 1

是的。

Yeah.

Speaker 1

我当时就想，好吧。

And I was like, okay.

Speaker 1

你能给我看看你们的数据集吗？

Can you show me your dataset?

Speaker 1

他们说，不行。

And say, no.

Speaker 1

我们只是看了 LM Arena 和人工分析。

We just checked LM Arena and artificial analysis.

Speaker 1

这些是你们知道的独立基准，我们知道这个模型最适合我们的用途，而我却说，你们根本没做评估。

These are, you you know, independent benchmarks, and we know that this model is the right one for our use And I'm like, you're not doing eval.

Speaker 1

那不是评估。

That's not eval.

Speaker 1

那是一次模型评估。

It was a model eval.

Speaker 2

但它

But it

Speaker 0

这说得通。

makes sense.

Speaker 3

比如，这个词

Like, the word

Speaker 0

你知道，它可以在那种语境下使用。

know, it could be used in that context.

Speaker 0

我明白为什么人们会这么想。

I get why people think that.

Speaker 0

但，是的，现在反而让事情更混乱了。

But, yeah, now it's just confusing it even more.

Speaker 1

对。

Yep.

Speaker 0

我脑海中还有一条进一步的疑问，我觉得这成为重大争议的原因是云代码（Cloud Code）。

Just like one more line of questioning here that I think that's on my mind is the reason this became kind of a big debate is Cloud Code.

Speaker 0

云代码的负责人鲍里斯，不，不是这样的。

The head of Cloud Code, Boris, like, no.

Speaker 0

我们不在云代码上做评估。

We don't do evals on Cloud Code.

Speaker 0

全靠感觉。

It's all vibes.

Speaker 0

Kriti，你能分享一下Codex和Codex团队是如何进行评估的吗？

What can you share, Kriti, on Codex and the Codex team of how you approach evals?

Speaker 2

对于Codex，我们采取了一种平衡的方法，你需要做评估，同时也必须认真倾听客户的声音。

So Codex, we have, like, this balanced approach of, like, you know, you need to have evals and you need to definitely listen to your customers.

Speaker 2

我认为亚历克斯最近刚上过你的播客，他一直在谈论你们如何极度专注于打造正确的产品。

And I think Alex has been on your podcast recently and he's been talking about how you're extremely focused on building the right product.

Speaker 2

对吧？

Right?

Speaker 2

而其中非常重要的一部分就是倾听客户的需求。

And a part of a big part of it is basically listening to your customers.

Speaker 2

与其它领域的智能体相比，编码智能体具有非常独特的特点，因为它们专为可定制性而设计，面向工程师群体。

And coding agents are extremely unique compared to agents for other domains in the sense that these are actually built for customizability, and these are built for engineers.

Speaker 2

所以，编码智能体并不是一个能解决前五大工作流或前六大工作流之类问题的产品。

So coding agent is not a product which is going to solve, like, these top five workflows or, like, top six workflows or whatever.

Speaker 2

对吧？

Right?

Speaker 2

它的设计目标是支持多种方式的自定义。

It's meant to be customizable in multi different ways.

Speaker 2

这意味着你的产品将被用于各种不同的集成、工具和应用场景。

And the implication of that is that your product is going to be used in different integrations and different kinds of tools and different kinds of things.

Speaker 2

因此，要为客户使用产品时可能产生的所有交互类型构建评估数据集变得极其困难。

So it gets really hard to build an evaluation dataset for all kinds of interactions that your customers are gonna use your product for.

Speaker 2

对吧？

Right?

Speaker 2

但话说回来，你也需要明白，如果我要做改动，至少不能损害产品核心的功能。

But that said, you also need to understand that, okay, if I'm gonna make a change, it's at least not going to damage something that is really core to the product.

Speaker 2

因此，我们为此设置了评估机制。

So we have evaluations for doing that.

Speaker 2

但同时，必须极其谨慎地理解客户是如何使用它的。

But at the same time, take extreme care on understanding how the customers are using it.

Speaker 2

例如，我们最近开发了这个代码审查产品，它获得了极大的用户增长。

For example, we built this code review product recently, and it has been gaining, like, extreme amount of traction.

Speaker 2

我觉得，OpenAI内部以及我们外部客户中的许多错误都已经被它捕捉到了。

And I feel like many, many bugs in OpenAI as well as, like, even our external customers are getting caught with this.

Speaker 2

现在，假设我要对代码审查模型进行调整，或者改变我用来训练它的强化学习机制。

And now let's say if I'm making a model change to the course review or, like, a different kinds of RL mechanism that I trained with it.

Speaker 2

如果我要部署它，我肯定希望进行A/B测试，确认它是否真的能发现正确的错误，以及用户对此的反应如何。

And now if I'm going to deploy it, I definitely do want to AP test and identify whether it's actually finding the right mistakes and are users how are users reacting to it.

Speaker 2

有时候，如果用户因为错误的代码审查而感到烦躁，他们甚至会直接关闭这个产品。

And sometimes, like, if users do get annoyed by your incorrect code reviews, they go to the extent of just switching off the product.

Speaker 2

对吧？

Right?

Speaker 2

所以这些是你需要关注的信号，以确保你的新改动做对了事情。

So those are the signals that you want to look at and make sure that your new changes are doing the right thing.

Speaker 2

对我们来说，提前想到这些场景并为此开发评估数据集是极其困难的。

And it's extremely hard for us to think of these kind of scenarios beforehand and develop evaluation datasets for it.

Speaker 2

所以我觉得两者都有。

So I feel like there's a bit of both.

Speaker 2

比如，有很多直觉，也有很多客户反馈。

Like, there's a lot of vibes and there's a lot of, like, customer feedback.

Speaker 2

我们非常积极地关注社交媒体，了解是否有人遇到某些问题，并迅速修复。

And we are super active on, like, the social media to understand if anybody's having certain types of problems and quickly fix that.

Speaker 2

所以我觉得这该怎么说呢？

So I feel it's it's a how do I put this?

Speaker 2

这就像你在这里做的一系列事情。

It's it's like a domain of things that you do here.

Speaker 0

这太有道理了。

That makes so much sense.

Speaker 0

好的。

Okay.

Speaker 0

我听到的是Codecs pro评估，但这还不够。

What I'm hearing Codecs pro evals, but it's not enough.

Speaker 0

你需要，是的。

You need to Yes.

Speaker 0

但也要观察客户的行为和反馈。

But also just watch customer behavior and feedback.

Speaker 0

而且还有一些直觉，比如，这种感觉好吗？

And also there's some vibes just like, is this feeling good?

Speaker 0

当我使用它时，生成的是让我兴奋的、我们认为很棒的代码吗？

Is this as I'm using it, generating great code that I'm excited about that we think is great.

Speaker 2

我不认为有人会说，我有一套具体的评估标准，可以完全依赖它，然后就不用再考虑其他任何事情了。

I I don't think, like, if anybody is coming and saying that, like, my I have this concrete set of evas that I can, like, bet my life on and then I don't need to think about anything else.

Speaker 2

这根本行不通。

Like, it it's not gonna work.

Speaker 2

每当我们推出新模型时，团队都会聚在一起，测试各种不同的东西。

And every new model that we're gonna launch, we get together as a team and, like, you know, test different things.

Speaker 2

每个人都在专注于不同的事情。

Each each each person is, like, concentrating on something else.

Speaker 2

我们有一份难题清单，会把这些难题交给模型，看看它们的进展如何。

And, like, we have this list of hard problems that we have, and we throw that to the model and see how well they are progressing.

Speaker 2

可以说，每个工程师都有自己的定制评估标准，只是为了了解新产品在新模型中的表现。

So it's, like, custom evals for each engineer, you would say, and just, like, understand what the, product is doing in its new model.

Speaker 3

如果你是创始人，创业最难的部分不是拥有点子。

If you're a founder, the hardest part of starting a company isn't having the idea.

Speaker 3

而是要在不被后台事务压垮的情况下扩大业务。

It's scaling the business without getting buried in back office work.

Speaker 3

这就是Brex的用武之地。

That's where Brex comes in.

Speaker 3

Brex 是为创始人打造的智能金融平台。

Brex is the intelligent finance platform for founders.

Speaker 3

使用 Brex，您可以获得高额度的企业信用卡、便捷的银行服务、高收益的现金管理，以及一支由 AI 代理组成的团队，为您处理繁琐的财务任务。

With Brex, you get high limit corporate cards, easy banking, high yield treasury, plus a team of AI agents that handle manual finance tasks for you.

Speaker 3

它们会替您完成所有不想做的事务，比如报销费用、排查交易中的浪费，并根据您的规则自动生成报告。

They'll do all the stuff that you don't wanna do, like file your expenses, scour transactions for waste, and run reports all according to your rules.

Speaker 3

借助 Brex 的 AI 代理，您能更快推进工作，同时保持完全掌控。

With Brexis AI agents, you can move faster while staying in full control.

Speaker 3

美国每三家初创企业中就有一家已经在使用 Brex。

One in three startups in The United States already runs on brex.

Speaker 3

您也可以在 brex.com 上体验。

You can too at brex.com.

Speaker 0

我们已经聊了将近一个小时，却还没来得及讨论您开发的、在课程中教授的、将我们刚才谈到的所有内容整合成一套逐步构建 AI 产品的强大软件开发流程。

We've been talking for almost an hour already, and we haven't even covered your extremely powerful software development workflow for building AI products that you developed, that you teach in your course, that you basically combined all the stuff we've been talking about into a step by step approach to building AI products.

Speaker 0

您称之为‘持续校准、持续开发框架’。

You call it the Continuous Calibration Continuous Development Framework.

Speaker 0

我们来展示一个视觉化图表，让大家清楚我们到底在说什么，然后一步步解释这个框架是什么、它是如何运作的，以及团队如何转变他们构建AI产品的方式，采用这种方法来避免大量痛苦和麻烦。

Let's pull up a visual to show people what the heck we're talking about, and then just walk us through what this is, how this works, how teams can shift the way they build their AI product to this approach to help them avoid a lot of pain and suffering.

Speaker 1

在解释这个生命周期之前，先讲个小故事，说明为什么基里塔和我提出了这个框架：我们一直在接触很多公司，它们都受到竞争对手的压力，因为大家都在开发智能代理，认为我们也应该构建完全自主的代理。

Before we go about explaining the life cycle, a quick story on why Kirita and I came up with this is because there are tons of companies that we keep talking to that have the pressure from their competitors because they're all building agents, we should be building agents that are entirely autonomous.

Speaker 1

事实上，我和一些客户合作过，为他们构建了端到端的智能代理。

And we I did end up working with a few customers where we built these end to end agents.

Speaker 1

结果发现，由于一开始你并不清楚用户会如何与你的系统互动，也不确定AI会产生怎样的回应或行动，当你面对一个包含四五步、做出大量决策的庞大工作流时，要修复问题会非常困难。

Turns out that because you start off at a place where you don't know how the user might interact with your system and what kind of responses or actions the AI might come up with, It's really hard to fix problems when you have this really huge workflow, which is taking four or five steps, making tons of decisions.

Speaker 1

你最终会陷入大量的调试和临时补丁中，比如我们当时在为一个客户服务场景构建系统——这也是我们在通讯中提到的例子。

Just end up debugging so much and then kind of hot fixing to the point where at a time we were building for a customer support use case, which is the example that we give in the newsletter as well.

Speaker 1

我们不得不关闭了这个产品，因为我们一直在做太多临时修复。

And we had to shut down the product because we were doing so many hot fixes.

Speaker 1

而且我们根本无法统计所有不断涌现的新问题。

And there was no way we could count all the emerging problems that were coming up.

Speaker 1

网上也有不少相关的新闻。

And there's also quite some news online.

Speaker 1

最近，我想加拿大航空就发生过一件事，他们的一个代理预测或虚构了一项退款政策，而这项政策并不在他们原始的规范中。

Recently, I think Air Canada had this thing where one of their agents predicted or hallucinated a policy for a refund, which was not part of their original playbook.

Speaker 1

他们不得不去购买这项服务，因为涉及法律问题。

And they had to go buy it because legal stuff.

Speaker 1

已经发生了许多令人恐惧的事件。

And there have been a ton of really scary incidents.

Speaker 1

这正是这个想法的由来，对吧？

And that's where the idea comes from, right?

Speaker 1

你该如何设计，才能不失去客户信任，同时确保你的代理或AI系统不会做出对公司的安全构成严重威胁的决策？

How can you build so that you don't lose customer trust and you don't end up or your agent or AI system doesn't end up making decisions that are super dangerous to the company itself?

Speaker 1

同时，还要构建一个飞轮效应，让你在过程中不断改进产品。

At the same time, build a flywheel so that you can improve your product as you go.

Speaker 1

是的。

Right.

Speaker 1

正是在这种背景下，我们提出了持续校准和持续开发的理念。

And that's where we came up with this idea of continuous calibration, continuous development.

Speaker 1

这个想法很简单，就是我们有一个循环的右侧，即持续开发，其中你界定能力范围并整理数据，本质上是建立一个数据集，明确你的预期输入和预期输出是什么。

The idea is pretty simple, which is we have this right side of the loop, which is continuous development, where you scope capability and curate data, essentially get a data set of what your expected inputs are and what your expected output should be looking at.

Speaker 1

在开始构建任何AI产品之前，这是一个非常有益的练习，因为很多时候你会发现团队内部对产品应有的行为方式根本没有达成一致。

This is a very good exercise before you start building any AI product, because many times you figured out that a lot of the folks within the team are just not aligned on how the product should behave.

Speaker 1

这时候，你的产品经理和领域专家就能提供更多的信息。

And that's where your PMs can really give in a lot more information and your subject matter experts as well.

Speaker 1

于是你就有了一个AI产品应该表现优异的数据集。

So, you have this data set that your AI product should be doing really well on.

Speaker 1

它并不全面，但足以让你起步。

It's not comprehensive, but it lets you get started.

Speaker 1

然后你搭建应用，并设计合适的评估指标。

And then you set up the application and then design the right kind of evaluation metrics.

Speaker 1

我特意使用‘评估指标’这个术语，尽管我们常说‘evals’，因为我希望明确说明：评估是一个过程，而评估指标是你在过程中需要关注的维度，对吧？

And I intentionally use the term evaluation metrics, although we say evals because I just want to be very specific on what it is because evaluation is a process, evaluation metrics are dimensions that you want to focus on during the process, right?

Speaker 1

接着你就开始部署，运行你的评估指标。

And then you go about deploying, run your evaluation metrics.

Speaker 1

第二部分是持续校准，即你发现最初未曾预料到的行为的部分。

And the second part is the continuous calibration, which is the part where you understand what behavior you hadn't expected in the beginning.

Speaker 1

因为在开发过程中，你正在优化这个数据集。

Because when you start the development process, you have this data set that you're optimizing for.

Speaker 1

但更常见的是，你会意识到这个数据集并不够全面，因为用户会以你未曾预测的方式与你的系统互动。

But more often than not, you realize that that data set is not comprehensive enough because users start behaving with your systems in ways that you did not predict.

Speaker 1

而这就是你需要进行校准的地方。

And that's where you want to do the calibration piece.

Speaker 1

我已经部署了我的系统。

I've deployed my system.

Speaker 1

现在我发现了一些我根本没有预料到的模式。

Now I see that there are patterns that I did not really expect.

Speaker 1

你的评估指标应该能为你提供这些模式的一些洞察。

And your evaluation metrics should give you some insight into those patterns.

Speaker 1

但有时你会发现，这些指标也不够充分。

But sometimes you figure out that those metrics were also not enough.

Speaker 1

你可能会遇到一些之前从未想到的新错误模式。

And you probably have new error patterns that you've not thought about.

Speaker 1

这时你需要分析行为，发现错误模式。

And that's where you analyze your behavior, spot error patterns.

Speaker 1

你不仅要针对发现的问题进行修复，还要设计新的评估指标，以识别这些新兴模式。

You apply fixes for issues that you see, but you also design newer evaluation metrics to figure out that they are emerging patterns.

Speaker 1

但这并不意味着你总是需要设计新的评估指标。

And that doesn't mean you should always design evaluation metrics.

Speaker 1

有些错误你只需修复即可，无需再回头，因为它们只是孤立的问题。

There are some errors that you can just fix and not really come back to because they're very spot errors.

Speaker 1

比如，某个工具调用出错，仅仅是因为你的工具定义得不够完善，诸如此类的情况。

For instance, there's a tool calling error just because your tool wasn't defined well and stuff like that.

Speaker 1

你可以直接修复并继续前进。

You can just fix it and move on.

Speaker 1

对。

Right.

Speaker 1

这基本上就是AI产品生命周期的样子。

And this is pretty much how an AI product lifecycle would look like.

Speaker 1

但我们特别提到的是，在进行这些迭代时，要从低自主性迭代开始，逐步过渡到高控制性迭代。

But what we specifically also mentioned is while you're going through these iterations, try to think of lower agency iterations in the beginning, and higher control iterations.

Speaker 1

这意味着要限制AI系统所能做出的决策数量，确保人类始终参与其中，然后随着时间推移逐步放宽限制，因为你正在构建一个行为飞轮，逐步理解哪些用例正在出现，以及用户如何使用系统。

What that means is constrain the number of decisions your AI systems can make and make sure that they're humans in the loop and then increase that over time because you're kind of building a flywheel of behavior and you're understanding what kind of use cases are coming in or how your users are using the system.

Speaker 1

其中一个例子，我想我们在通讯稿中提到过，就是客户服务。

And one example, I think we give in the newsletter itself is the customer support.

Speaker 1

这是一个很好的图示，展示了如何将自主性和控制力视为两个维度，随着迭代推进，你的AI系统自主决策能力不断提升，而人工控制逐步降低。

This is a nice image that kind of shows how you can think of agency and control as two dimensions, and each of your versions keep on increasing the agency, or the ability of your AI system to make decisions and lower the control as you go.

Speaker 1

我们举的一个例子是客户服务代理，可以将其划分为三个版本。

And one example that we give is that of the customer support agent, where you can break it down into three versions.

Speaker 1

第一个版本只是路由，即你的代理能否正确分类并将特定工单转到合适的部门？

The first version is just routing, which is, is your agent able to classify and route a particular ticket to the right department?

Speaker 1

有时候你看到这里可能会想，只是做路由真的有那么难吗？

And sometimes when you read this, you probably think, is it so hard to just do routing?

Speaker 1

为什么代理不能轻松完成这个任务？

Why can't an agent easily do that?

Speaker 1

当你进入企业场景时，路由本身可能是一个极其复杂的问题。

And when you go to enterprises, routing itself can be a super complex problem.

Speaker 1

任何零售公司，任何你想到的热门零售企业，都有层级化的分类体系。

Any retail company, any popular retail company that you can think of has hierarchical taxonomies.

Speaker 1

大多数情况下，这些分类体系都非常混乱。

Most of the times, the taxonomies are incredibly messy.

Speaker 1

我曾参与过一些案例，其中分类体系可能包含某种层级结构，然后直接列出‘鞋子’，接着是‘女鞋’和‘男鞋’，全都处于同一层级，而理想情况下应该是‘鞋子’作为父类，‘女鞋’和‘男鞋’作为子类。

I have worked in use cases where you probably have taxonomy that says some kind of hierarchy and then that says shoes, and then women's shoes and men's shoes all at the same layer where ideally you should be having shoes and then women's shoes and men's shoes should be sub classes.

Speaker 1

然后你可能会想，好吧，我可以把这些合并起来，但继续深入后你会发现，还有另一个关于鞋子的分类区域写着‘女士用’和‘男士用’，而且这些并没有被整合。

And then you're like, okay, fine, I could just merge that and you go further and you see that there's also another section on the shoes that says for women and for men, and it's just not aggregated.

Speaker 1

不知为何，这个问题一直没有被解决。

It's not fixed for some reason.

Speaker 1

所以如果一个代理看到这种分类体系，它该怎么做？

So if an agent kind of sees this kind of a taxonomy, what is it supposed to do?

Speaker 1

它应该被路由到哪里？

Where is it supposed to route?

Speaker 1

很多时候，我们直到真正开始构建并深入理解时，才会意识到这些问题。

And a lot of the times, we are not aware of these problems until you actually go about building something and understanding it.

Speaker 1

当这类问题出现时，人类代理会知道接下来该检查什么。

And when these kinds of problems or real human agents see these kinds of problems, they know what to check next.

Speaker 1

也许他们会发现，位于‘鞋子’下的‘为女性’和‘为男性’这个节点最后一次更新是在2019年，这意味着它只是一个废弃的、无人使用的节点。

Maybe they realize that the node that says for women and for men that's under shoes was last updated in 2019, which means that it's just a dead node that's lying there and not being used.

Speaker 1

所以他们知道，我们应该去查看另一个节点之类的。

So they kind of know that, okay, we're supposed to be looking at a different node and stuff like that.

Speaker 1

我并不是说代理无法理解，或者模型没有能力理解这些内容。

And I'm not saying agents cannot understand this or models are not capable enough to understand this.

Speaker 1

但企业内部存在许多极其古怪的规则，这些规则从未被记录下来。

But there are really weird rules within enterprises that are not documented anywhere.

Speaker 1

你需要确保代理掌握所有这些背景信息，而不是简单地把问题扔给它去处理。

And you want to make sure that the agents have all of that context instead of just throwing the problem at that tray.

Speaker 1

回到我们之前的版本，路由是一个你拥有高度控制权的环节，因为即使你的代理错误地将问题转接到错误的部门，人类仍然可以介入并撤销这些操作。

Coming back to the versions we had, routing was one where you have really high control, because even if your agent routes to the wrong department, humans can take control and undo those actions.

Speaker 1

同时，在这个过程中，你也会发现你可能面临大量数据问题，需要加以修复，确保你的数据层足够完善，以支持代理正常运行。

And along the way, also figure out that you probably are dealing with a ton of data issues that you need to fix and make sure that your data layer is good enough for the agent to function.

Speaker 1

我们接下来要做的是，正如我们所说的协作者角色：一旦你经过几次迭代确认路由功能正常，并解决了所有数据问题，就可以进入下一步——让代理根据客户支持人员的标准操作流程提供建议，对吧？

We do is what we said of a copilot, which is now that you've figured out routing works fine after a few iterations and you fixed all of your data issues, you could go to the next step, which is, my agent provide suggestions based on some standard operating procedures that we have for the customer support agent, right?

Speaker 1

它只需生成一个草稿，供人类进行修改。

And it could just generate a draft that the human can make changes to.

Speaker 1

当你这样做时，你也在记录人类的行为，也就是说，客户支持人员使用了多少草稿内容，又删减了哪些部分。

And when you do this, you're also logging human behavior, which means that how much of this draft was used by the customer support agent or what was omitted.

Speaker 1

因此，当你这么做时，实际上是在免费获取错误分析，因为你完整记录了用户的所有操作，这些数据可以重新反馈到你的闭环系统中。

So you're actually getting error analysis for free when you do this, because you're literally logging everything that the user is doing that you could then build back into your flywheel.

Speaker 1

然后我们说，在这之后，一旦你发现这些草稿效果良好，而且大多数情况下人类几乎不做修改，直接使用这些草稿，这时你就该转向端到端的解决方案助手，它不仅能起草解决方案，还能自动分类工单。

And then we say, post that, once you figured out that those drafts look good, and most of the times, maybe humans are not making too many changes, they're using these drafts as is, that's when you want to go to your end to end resolution assistant that could draft a resolution that could sort the ticket as well.

Speaker 1

对。

Right.

Speaker 1

这些就是代理能力的各个阶段，你从低代理能力开始，逐步提升到高代理能力。

And those are the stages of agency where you start with low agency and then you go up high rate.

Speaker 1

我们还整理了一个非常不错的表格，列出了每个版本中你需要做什么，以及你能学到什么，从而推动你进入下一步。

We also have this really nice table that we put together, which is what do you do at each version and what you learn that can enable you to go to the next step.

Speaker 1

你能获得哪些信息来反馈到这个循环中，对吧？

And what information do you get that you can feed into the loop, right?

Speaker 1

当你只进行路由时，你会获得更高质量的路由数据。

When you're just doing your routing, you have better quality routing data.

Speaker 1

你也会知道需要构建什么样的提示词来改进路由系统。

You also know what kind of prompts you need to be building to improve the routing system.

Speaker 1

本质上，你正在为上下文工程建立结构，并构建你想要打造的反馈循环。

Essentially, you're figuring out your structure for context engineering and building that flywheel that you want to write.

Speaker 1

在我讲解这些的时候，我也想明确两点。

And while I go through this, I want to also be very clear that two things.

Speaker 1

第一，当你以CCCD的思路来构建时，并不意味着你一次性解决了所有问题。

One is when you build with CCCD in mind, it doesn't mean that you've fixed the problem all for once.

Speaker 1

你可能已经经历了V3阶段，并看到了一种你以前从未想象过的数据分布。

It's possible that you've probably gone through V3 and you see a new distribution of data that you never previously imagined.

Speaker 1

但这只是降低风险的一种方式，即在走向完全自主之前，你已经获得了足够多关于用户如何与你的系统互动的信息。

But this is just one way to lower your risk, which is you get enough information about how users behave with your system before going to a point of complete autonomy.

Speaker 1

第二点是，你实际上也在构建一个隐式的日志系统。

And the second thing is you're also kind of building this implicit logging system.

Speaker 1

很多人过来告诉我们：哦，等等，不是有评估系统吗？

A lot of people come and tell us that, oh, wait, there are evals, right?

Speaker 1

那为什么还需要这样的东西呢？

Why do you need something like this?

Speaker 1

仅仅构建一堆评估指标并将其投入生产的问题在于，这些指标只能捕捉到你已经意识到的错误。

The issue with just building a bunch of evaluation metrics and then having them in production is evaluation metrics catch only the errors that you're already aware of.

Speaker 1

但还有很多新兴模式，只有在将系统投入生产后你才能真正理解。

But there can be a lot of emerging patterns that you understand only after you put things in production.

Speaker 1

因此，对于这些新兴模式，你正在创建一种低风险的框架，以便理解用户行为，而不是陷入大量错误集中爆发、不得不一次性修复的境地。

So for those emerging patterns, you're kind of creating a low risk kind of a framework so that you could understand user behavior and not really be in a position where there are tons of errors and you're trying to fix all of them at once.

Speaker 1

但这并不是唯一的方法。

And this is not the only way to do it.

Speaker 1

有非常多不同的方式。

There are tons of different ways.

Speaker 1

你需要决定如何限制你的自主性。

You want to decide how you constrain your autonomy.

Speaker 1

它可以基于代理所采取的动作数量，这正是我们在本例中所做的。

It could be based on the number of actions that the agent is taking, which is what we do in this example.

Speaker 1

它也可以基于主题。

It could be based on topic.

Speaker 1

有些领域对于某些决策来说，让系统完全自主的风险非常高。

There are just some domains where it's pretty high risk to make a system completely autonomous for certain decisions.

Speaker 1

但对于其他一些主题，根据问题的复杂性，让它们完全自主是可以接受的。

But for some other topics, it's okay to make them completely autonomous and depending on the complexity of the problem.

Speaker 1

这时，你真正需要的是产品经理、工程师和领域专家共同达成一致，如何构建系统并持续改进它。

And that's where you really want your product managers, your, you know, engineers, and subject matter experts to align on how to build the system and continuously improve it.

Speaker 1

这个想法主要是行为校准，同时在进行这种校准时不要失去用户的信任，我想。

The idea is just behavior calibration and not losing user trust as you do that behavior calibration, I guess.

Speaker 0

如果有人想深入了解，我们会把他们引导到这篇实际的文章。

We'll link folks to this actual post if they wanna go really deep.

Speaker 0

你基本上会一步步地经历所有这些步骤，还有很多示例。

You basically go through all of these steps by step, a bunch of examples.

Speaker 0

这里的重点是，正如你所说，你所描述的一切都是为了使其成为持续、迭代的过程，逐步向更高自主性、更少控制的方向推进，甚至称之为持续校准、持续开发，其实就是在传达这是一种迭代的过程。

And the idea here is, as you said, that, like, the reason everything about what you're describing here is about making it continuous and iterative and kind of moving along this progression of higher autonomy, less control, and this idea of even calling it continuous calibration, continuous development is communicating, it's this kind of iterative process.

Speaker 0

为了明确一下，这个命名其实是对持续集成、持续部署（CI/CD）的一种致敬。

And just to be clear, naming is kind of a ode to CI CD, continuous integration, continuous deployment suite.

Speaker 0

这里的理念是，这是适用于人工智能的版本，不是仅仅集成单元测试并持续部署，而是运行评估、查看结果、迭代你所关注的指标，找出问题所在并不断改进。

And the idea here is like that, this is the version of that for AI, where instead of just like integrating to unit tests and deploying constantly, it's running evals, looking at results, iterating on on the metrics you're watching, figuring out where it's breaking and iterating on that.

Speaker 0

太棒了。

Awesome.

Speaker 0

好的。

Okay.

Speaker 0

所以，我们会引导大家去阅读这篇帖子，如果他们想深入了解的话。

So again, we'll point people to this post if they wanna go deeper.

Speaker 0

这是一次非常棒的概述。

That was a a great overview.

Speaker 0

在我转向这个框架相关的其他话题之前，你认为还有什么是人们需要知道的重要内容吗？

Is there anything else before I go into different topic around this framework specifically that you think is important for people to know?

Speaker 1

我们经常被问到的一个问题是：我怎么知道什么时候该进入下一阶段，或者当前的校准已经足够了？

I think one of the most common questions we get is how do I know if I need to go to the next stage or if this is calibrated enough, right?

Speaker 1

其实并没有一套明确的规则可以遵循，但核心在于最小化意外情况。比如说，如果你每一两天就进行一次校准，发现没有出现新的数据分布模式，用户的行为也一直很稳定，那么你获得的信息量就非常有限了。

There's not really a rule book you can follow, but it's all about minimizing surprise, which means, let's say you're calibrating every one or two days and you figured out that you're not seeing new data distribution patterns, your users have been pretty consistent with how they're behaving with the system, then the amount of information you gain is kind of very low.

Speaker 1

这时候你就知道可以进入下一阶段了，对吧？

And that's when you know you can actually go to the next stage, right?

Speaker 1

到了这个阶段，重点就变成了清理工作。

And it's all about the wipes at that point.

Speaker 1

你知道自己准备好了吗？

Like, do you know you're ready?

Speaker 1

你没有收到任何新信息。

You're not receiving any new information.

Speaker 1

但也要理解，有时某些事件会完全打乱你系统的校准。

But also, it really helps to understand that sometimes there are events that could completely mess up the calibration of your system.

Speaker 1

例如，GPD 4.0 已经不存在了，或者将在 API 中被弃用。

An example is GPD four point zero doesn't exist anymore, or it's going to be deprecated in APIs as well.

Speaker 1

因此，大多数使用 4.0 的公司应该切换到 5.0，而 5.0 具有非常不同的特性。

So most companies that were using four or should switch to five, and five has very different properties.

Speaker 1

所以当你发现校准再次偏离时，就需要回到这个过程重新进行。

So that's where your calibration is off again, you want to go back and do this process again.

Speaker 1

有时用户随着时间推移也会以不同方式与系统互动，或者用户行为会发生变化。

Sometimes users start behaving with systems also differently over time or user behavior evolves.

Speaker 1

即使是消费类产品，对吧？

Even with consumer products, right?

Speaker 1

你现在已经不会用两年前的方式和 ChatGPT 交流了，因为它的能力已经提升太多了。

You don't talk to ChatGPT the same way you were talking, say, two years ago, just because, you know, the capabilities have increased so much.

Speaker 1

另外，当这些系统能够完成一项任务时，人们就会感到兴奋。

Also just people get excited when these systems can solve one task.

Speaker 1

他们也想尝试用它来处理其他任务。

They want to try it out on other tasks as well.

Speaker 1

我们曾经为承保人构建了这个系统，对吧？

We built this system for underwriters at some point, right?

Speaker 1

承保是一项繁琐的任务。

Underwriting is a painful task.

Speaker 1

有些协议，比如贷款申请，长达三四十页。

There are agreements that are like, you know, loan applications are like 30 or 40 pages.

Speaker 1

这家银行的想法是构建一个系统，帮助承保人筛选保单和银行相关信息，以便审批贷款。

And the idea for this bank was to build a system that could help underwriters pick policies and information about the bank so that they could approve loans.

Speaker 1

在整整三到四个月里，每个人都对这个系统印象深刻。

For a good three or four months, everybody was pretty impressed with the system.

Speaker 1

我们甚至有承保人反馈，他们节省了大量时间等。

We had underwriters actually report gains in terms of how much time they were spending, etc.

Speaker 1

三个月后，我们意识到他们对这个产品太过兴奋，开始提出一些我们从未预料到的深入问题。

Post three months, we realized that they were so excited with the product that they started asking very deep questions that we never anticipated.

Speaker 1

他们会直接把整个申请文件丢给系统，然后问：对于这样一个案例，以前的核保人员是怎么做的？

They would just throw the entire application document at the system and go like, for a case that looks like this, what did previous underwriters do?

Speaker 1

对用户来说，这看起来像是他们原有工作自然的延伸，但背后的技术实现却需要重大改变。

And for a user, that just seems like a natural extension of what they were doing, but the building behind it should significantly change.

Speaker 1

现在，你需要理解‘对于这样一个案例’在贷款语境中究竟指的是什么。

Now, you need to understand what does for a case like this mean in the context of the loan itself.

Speaker 1

是指特定收入范围的人群吗？

Is it referring to people of a particular income range?

Speaker 1

还是指特定地理区域的人群之类的情况？

Or is it referring to people in a particular geo and stuff like that?

Speaker 1

然后你需要调取历史文档，分析这些文档，再告诉他们：这才是实际情况，而不仅仅是说‘有政策X、Y、Z，你去查一下这个政策’。

And then you need to pick up historical documents, analyze those documents, and then tell them, okay, this is what it looks like versus just saying that there's a policy X, Y, and Z, and you want to look up that policy.

Speaker 1

因此，对终端用户来说看似非常自然的需求，对产品构建者而言可能极其难以实现。

So something that might seem very natural to an end user might be very hard to build as a product builder.

Speaker 1

你会看到，用户行为也会随着时间推移而演变。

You see that user behavior also evolves over time.

Speaker 1

这时你就知道，需要回去重新调整了。

And that's when you know that you want to go back and recalibrate.

Speaker 0

你认为当前人工智能领域哪些东西被过度炒作了？

What do you think is overhyped in the AI space right now?

Speaker 0

更重要的是，你认为哪些东西被低估了？

And even more importantly, what do you think is is underhyped?

Speaker 2

正如我所说，我对人工智能领域正在发生的各种事情持极度乐观的态度。

I am, as I said, like, super optimistic in different things that are going in AI.

Speaker 2

所以我不会说有什么被过度炒作，但我感觉多智能体这个概念被误解了。

So I wouldn't say overhyped, but I feel kind of misunderstood is the concept of multi agents.

Speaker 2

人们有一种观念，认为我面对一个极其复杂的问题。

People have this notion of, like, I have this incredibly complex problem.

Speaker 2

于是我要把它分解成，嘿。

Now I'm gonna break it down into, hey.

Speaker 2

你就是这个代理。

You are this agent.

Speaker 2

负责这个。

Take care of this.

Speaker 2

你就是这个代理。

You are this agent.

Speaker 2

负责这个。

Take care of this.

Speaker 2

如果我现在把所有这些代理连接起来，它们会以为自己进入了代理的乌托邦，但实际上根本不存在真正成功的多代理系统。

And now if I somehow connect all of these agents, they think they're the agent utopia, and it's never the case that there are incredibly successful multi agent systems that are built.

Speaker 2

对吧？

Right?

Speaker 2

这一点毫无疑问。

Like, there's no doubt about that.

Speaker 2

但我认为，关键在于你如何限制系统偏离轨道的方式。

But I feel a lot of it comes in terms of how are you limiting the ways in which the system can go off tracks.

Speaker 2

例如，如果你正在构建一个监督代理，而有一些子代理实际为监督代理执行工作，这是一种非常成功的模式。

And for example, like if you're building a supervisor agent and there are like sub agents that actually do the work for the super agent, supervisor agent, That is a very successful pattern.

Speaker 2

但当你基于功能来划分职责，并期望所有这些部分能通过某种类似谣言传播的协议协同工作时，这种想法被严重误解了。

But coming with this notion of I'm going to divide the responsibilities based on functionality and somehow ex expect all of that to work together in some sort of, like, gossip protocol.

Speaker 2

人们误以为你可以这样去做，这其实是非常错误的。

That is, like, extremely misunderstood that you could do that.

Speaker 2

我认为，以当前的构建方式和模型能力，还远远达不到构建这类应用的水平。

I don't think, like, current ways of building and current, like, model capabilities are, like, right there in terms of, like, building those kind of applications.

Speaker 2

我觉得这种想法被误解了，而不是被高估了。

I feel that is kind of misunderstood than overrated.

Speaker 2

我认为它被低估了，虽然这可能很难让人相信，但我依然觉得编码代理被低估了——因为在推特和Reddit上，你经常能看到大量关于编码代理的讨论。

Underrated, I feel it's hard to probably believe, but I still feel coding agents are underrated in the sense that I feel like you can go on Twitter and you can go on Reddit and you see a lot of chatter about coding agents.

Speaker 2

但如果你去和任何一家公司、尤其是非前沿领域的工程师交谈，你会发现编码代理的实际影响力和渗透率非常低。

But talking to an engineer in, like, any random company, especially outside of the area, you you can see like the amount of impact this coding agent can create and the penetration is very low.

Speaker 2

所以我认为2025年和2026年将会是优化所有这些流程的绝佳年份，我相信这将为人工智能创造巨大的价值。

So I feel like 2025 and 2026 is going to be like an incredible year for optimizing all of these processes and I feel that is going to be creating a lot of value with AI.

Speaker 0

关于第一点，这真的很有意思。

That's really interesting on that first point.

Speaker 0

所以这里的观点是，你构建和使用一个能够自行分配子任务的代理，可能会比使用一堆所谓的Codex代理更成功。

So the idea there is you'll probably be more successful building and using an agent that is able to do its own sub agent splitting of work versus like a bunch of, say, Codex agents.

Speaker 0

所以当你执行这个任务时，就去做那个任务。

So when you do this task, you do that task.

Speaker 2

你可以让代理来做这些事情，而你作为人类来协调它们，或者你可以让一个更大的代理来协调所有这些事情。

You can have agents to do these things, and you as a human can orchestrate it, or you can have like one larger agent that is going to orchestrate all of these things.

Speaker 2

但让代理之间通过点对点协议通信，尤其是在客户服务这样的使用场景中，很难控制哪个代理会回复你的客户，因为你必须在各个地方调整你的安全限制之类的东西。

But letting the agents communicate in terms of peer to peer kind of protocol and then especially doing this in a customer support kind of use case is incredibly hard to control what kind of agent is replying to your customer because you need to shift your guardrails everywhere and things like that.

Speaker 0

是的。

Yeah.

Speaker 0

好的。

Okay.

Speaker 0

很棒的见解。

Great picks.

Speaker 0

好的。

Okay.