MLA 027 AI视频端到端工作流程

本集简介

如何在AI视频中保持角色一致性、风格一致性等。专业消费者可使用Google Veo 3的"高质量链式生成"快速制作社交媒体内容。独立制片人可通过组合Midjourney V7（风格）、Kling（唇形同步对话）和Runway Gen-4（镜头控制）实现叙事连贯性，而专业工作室则通过分层式ComfyUI流程输出多层EXR文件，实现标准视效合成的全面控制。链接与资源见ocdevel.com/mlg/mla-27 尝试站立式办公桌——学习编程时保持健康与敏锐 Descript——我最爱的AI音视频编辑器 AI音频工具选型音乐：完整歌曲用Suno，专业剪辑组件用Udio。音效：播客集成制作用ElevenLabs SFX，游戏电影大型授权资源库用SFX Engine。语音：最真实输出用ElevenLabs，营销全功能工作室用Murf.ai，开发者低延迟API用Play.ht。开源TTS：本地使用选StyleTTS 2（拟真语音），少量样本克隆选Coqui XTTS-v2，快速CPU方案选Piper TTS。一、专业消费者流程：病毒视频目标：快速生产品牌化短视频。此法规避Veo 3原生"延长"功能的弱点。工具链图像构思：GPT-4o（API: GPT-Image-1）——强提示遵循、文字渲染及对话优化能力。视频生成：Google Veo 3——单镜头高质量与内置环境音。配乐：Udio——创作独特"病毒风"音乐。合成：CapCut——标配短视频编辑功能。工作流创建角色设定表（GPT-4o）：用详细"锁定提示"生成主角色图像，再通过对话调整生成变体（姿势/表情）确保视觉连贯。生成视频（Veo 3）：采用"高质量链式"：片段1用角色设定图生成8秒视频→提取末帧→片段2以末帧为输入延续动作→循环操作。制作音乐（Udio）：手动模式输入结构化提示（[类型][氛围]）生成并扩展音轨。最终剪辑（CapCut）：拼接片段，将Udio音轨叠加至Veo环境音，添加文字与自动字幕，输出9:16比例。二、独立制片人流程：叙事短片目标：通过专业工具组合实现角色连贯与电影化叙事。工具链视觉基底：Midjourney V7（--cref/--sref参数锁定角色与风格）。对话场景：Kling（顶级唇形同步与角色真实感）。空镜/动作：Runway Gen-4（导演模式镜头控制+多动笔刷）。语音生成：ElevenLabs（情感化高保真声音）。剪辑调色：DaVinci Resolve（全功能套件性价比方案）。工作流建立视觉基底（Midjourney V7）：生成主角色图，用--cref --cw 100维持角色一致性，--sref复制风格至其他镜头，组建参考集。制作对话场景（ElevenLabs→Kling）：生成对话音轨→在Kling中以闭嘴角色图为输入→通过"唇形同步"功能匹配音频。拍摄空镜（Runway Gen-4）：用Midjourney参考图，导演模式精确运镜或多动笔刷添加局部动态。合成调色（DaVinci Resolve）：剪辑页面组接素材，调色页面用节点工具统一Kling与Runway镜头风格，最后施加创意色调。三、专业工作室流程：全流程控制目标：通过开源模块化方案实现像素级控制、演员肖像权保护及标准视效管线集成。工具链核心引擎：ComfyUI（搭载SD3/FLUX等模型）。视效合成：DaVinci Resolve（Fusion页面）——基于节点的多层EXR合成。控制栈与工作流训练角色LoRA：在ComfyUI中用15-30张演员图集训练定制LoRA确保肖像真实。构建节点图：按序搭建生成管线→加载器：载入基础模型、角色LoRA及含触发词的提示词→ControlNet栈：串联多个控制网（如OpenPose骨骼图/Depth深度图）→IPAdapter-FaceID：用Plus v2模型强化面部特征→AnimateDiff：应用确定性镜头运动LoRA→KSampler→VAE解码生成序列。导出多层EXR：通过mrv2SaveEXRImage节点输出.exr序列（32位线性色彩空间+PIZ/ZIP无损压缩），保留漫反射/高光/遮罩等渲染通道。 Fusion合成：在DaVinci中导入EXR序列，通过节点图分层调整色彩/高光/遮罩，最终与实拍背景板合成。

How to maintain character consistency, style consistency, etc in an AI video. Prosumers can use Google Veo 3's "High-Quality Chaining" for fast social media content. Indie filmmakers can achieve narrative consistency by combining Midjourney V7 for style, Kling for lip-synced dialogue, and Runway Gen-4 for camera control, while professional studios gain full control with a layered ComfyUI pipeline to output multi-layer EXR files for standard VFX compositing. Links Notes and resources at ocdevel.com/mlg/mla-27 Try a walking desk - stay healthy & sharp while you learn & code Descript - my favorite AI audio/video editor AI Audio Tool Selection Music: Use Suno for complete songs or Udio for high-quality components for professional editing. Sound Effects: Use ElevenLabs' SFX for integrated podcast production or SFX Engine for large, licensed asset libraries for games and film. Voice: ElevenLabs gives the most realistic voice output. Murf.ai offers an all-in-one studio for marketing, and Play.ht has a low-latency API for developers. Open-Source TTS: For local use, StyleTTS 2 generates human-level speech, Coqui's XTTS-v2 is best for voice cloning from minimal input, and Piper TTS is a fast, CPU-friendly option. I. Prosumer Workflow: Viral Video Goal: Rapidly produce branded, short-form video for social media. This method bypasses Veo 3's weaker native "Extend" feature. Toolchain Image Concept: GPT-4o (API: GPT-Image-1) for its strong prompt adherence, text rendering, and conversational refinement. Video Generation: Google Veo 3 for high single-shot quality and integrated ambient audio. Soundtrack: Udio for creating unique, "viral-style" music. Assembly: CapCut for its standard short-form editing features. Workflow Create Character Sheet (GPT-4o): Generate a primary character image with a detailed "locking" prompt, then use conversational follow-ups to create variations (poses, expressions) for visual consistency. Generate Video (Veo 3): Use "High-Quality Chaining." Clip 1: Generate an 8s clip from a character sheet image. Extract Final Frame: Save the last frame of Clip 1. Clip 2: Use the extracted frame as the image input for the next clip, using a "this then that" prompt to continue the action. Repeat as needed. Create Music (Udio): Use Manual Mode with structured prompts ([Genre: ...], [Mood: ...]) to generate and extend a music track. Final Edit (CapCut): Assemble clips, layer the Udio track over Veo's ambient audio, add text, and use "Auto Captions." Export in 9:16. II. Indie Filmmaker Workflow: Narrative Shorts Goal: Create cinematic short films with consistent characters and storytelling focus, using a hybrid of specialized tools. Toolchain Visual Foundation: Midjourney V7 to establish character and style with --cref and --sref parameters. Dialogue Scenes: Kling for its superior lip-sync and character realism. B-Roll/Action: Runway Gen-4 for its Director Mode camera controls and Multi-Motion Brush. Voice Generation: ElevenLabs for emotive, high-fidelity voices. Edit & Color: DaVinci Resolve for its integrated edit, color, and VFX suite and favorable cost model. Workflow Create Visual Foundation (Midjourney V7): Generate a "hero" character image. Use its URL with --cref --cw 100 to create consistent character poses and with --sref to replicate the visual style in other shots. Assemble a reference set. Create Dialogue Scenes (ElevenLabs -> Kling): Generate the dialogue track in ElevenLabs and download the audio. In Kling, generate a video of the character from a reference image with their mouth closed. Use Kling's "Lip Sync" feature to apply the ElevenLabs audio to the neutral video for a perfect match. Create B-Roll (Runway Gen-4): Use reference images from Midjourney. Apply precise camera moves with Director Mode or add localized, layered motion to static scenes with the Multi-Motion Brush. Assemble & Grade (DaVinci Resolve): Edit clips and audio on the Edit page. On the Color page, use node-based tools to match shots from Kling and Runway, then apply a final creative look. III. Professional Studio Workflow: Full Control Goal: Achieve absolute pixel-level control, actor likeness, and integration into standard VFX pipelines using an open-source, modular approach. Toolchain Core Engine: ComfyUI with Stable Diffusion models (e.g., SD3, FLUX). VFX Compositing: DaVinci Resolve (Fusion page) for node-based, multi-layer EXR compositing. Control Stack & Workflow Train Character LoRA: Train a custom LoRA on a 15-30 image dataset of the actor in ComfyUI to ensure true likeness. Build ComfyUI Node Graph: Construct a generation pipeline in this order: Loaders: Load base model, custom character LoRA, and text prompts (with LoRA trigger word). ControlNet Stack: Chain multiple ControlNets to define structure (e.g., OpenPose for skeleton, Depth map for 3D layout). IPAdapter-FaceID: Use the Plus v2 model as a final reinforcement layer to lock facial identity before animation. AnimateDiff: Apply deterministic camera motion using Motion LoRAs (e.g., v2_lora_PanLeft.ckpt). KSampler -> VAE Decode: Generate the image sequence. Export Multi-Layer EXR: Use a node like mrv2SaveEXRImage to save the output as an EXR sequence (.exr). Configure for a professional pipeline: 32-bit float, linear color space, and PIZ/ZIP lossless compression. This preserves render passes (diffuse, specular, mattes) in a single file. Composite in Fusion: In DaVinci Resolve, import the EXR sequence. Use Fusion's node graph to access individual layers, allowing separate adjustments to elements like color, highlights, and masks before integrating the AI asset into a final shot with a background plate.

双语字幕

仅展示文本字幕，不包含中文音频；想边听边看，请使用 Bayt 播客 App。

Speaker 0

欢迎回到应用机器学习。

Welcome back to machine learning applied.

Speaker 0

这是关于多媒体生成AI（包括图像生成、视频生成及其整合）迷你系列的最后一节。

This is the last segment of the miniseries on multimedia generative AI, image generation, video generation, and stringing them all together.

Speaker 0

这是最重要的一集，我会在社交媒体上重点推荐，因为它极具实践价值。

This is the most important episode, and it's the one that I'm gonna be linking out on socials because it's very practical.

Speaker 0

它会教你如何运用我们讨论过的工具制作电影或广告。

It teaches you how to make a movie or an ad using the tools that we've discussed.

Speaker 0

前两集展示的是示例流程，纯理论性质。

The last two episodes had sample workflows, total theoreticals.

Speaker 0

本集将呈现三个真实工作流程，这些是专业人士实际使用的方案。

This episode has three real workflows, workflows used by professionals in the wild.

Speaker 0

事实上，我可能会回头删掉前几集中的示例流程。

In fact, I might go back and remove the sample workflows from the previous episodes.

Speaker 0

我认为它们带来的干扰多过帮助。

I think they were more distracting than helpful.

Speaker 0

如果你首先接触到这一集且未听过前两集，那两集主要介绍了这些工具是什么以及它们之间的比较。

If you land on this episode first and you haven't listened to the last two episodes, those are what are these tools and how do they compare?

Speaker 0

所以你只需要在不知道GPT-4与Midjourney，或VO-3与Sora之间有什么区别的情况下收听这些内容。

So you only need to listen to them if you don't know what the value prop difference is between GPT four o versus mid journey or VO three versus Sora.

Speaker 0

特别是如果你对Stable Diffusion生态系统不太了解，因为第三个工作流程将大量涉及Stable Diffusion，你会需要一些关于它的背景信息。

And especially if you don't know much about the stable diffusion ecosystem because workflow three will be stable diffusion heavy and you'll want some background information on stable diffusion.

Speaker 0

所以关于提示工程，朋友们，恐怕我不得不将其排除在外。

So prompt engineering, my friends, I'm afraid I had to exclude it.

Speaker 0

这集的内容太长了。

The content of this episode got too long.

Speaker 0

而且我的DNA里似乎要求迷你系列必须是三部曲，同时我也不想过度稀释这个播客系列。

And there's something in my DNA that requires mini series to be in threes, as well as I don't want to dilute this podcast series too much.

Speaker 0

整个系列是关于人工智能和机器学习的。

The whole series about AI and machine learning.

Speaker 0

不想在多媒体领域停留太久。

Don't want to pigeonhole multimedia for too long.

Speaker 0

所以我不得不将其排除在外。

So I had to exclude it.

Speaker 0

我非常抱歉。

I'm so sorry.

Speaker 0

未来我可能会制作一期关于跨领域提示工程的超级节目。

I may do a super episode on prompt engineering across various domains in the future.

Speaker 0

这显然会包括图像和视频生成，因为在图像和视频生成领域，提示工程最为重要——适用的提示工程非常微妙且具体，涉及你使用的词汇类型、标志和参数。

This would include obviously image and video generation because prompt engineering for image and video gen is the most important where prompt engineering is applicable is very nuanced and specific the types of words and flags and parameters you use.

Speaker 0

但我也可能会为氛围编程和一般聊天加入提示工程的内容。

But I would also maybe include prompt engineering for vibe coding and general chat.

Speaker 0

不过在此期间，我已将提示工程指南放在我网站的节目说明中。

In the meantime, though, I put the prompt engineering guide on my website's show notes.

Speaker 0

请访问ocdevel.com/mlg，点击本集节目，你就能在顶部附近找到提示指南。

So go to ocdevel.com/mlg, click this episode, and you can find the prompt guide near the top.

Speaker 0

点击展开后，你可以将其下载为PDF或EPUB格式以便在Kindle上阅读。

Click to expand it and you can download it as a PDF or an EPUB for your Kindle.

Speaker 0

我还将在那里提供几份其他可阅读或下载的长篇指南，因为本集节目涉及大量背景资料。

I'll also have a few other long readable or downloadable guides there because there's so much background material for this episode.

Speaker 0

其中包括：第一，一份关于音频、音乐、文本转语音及音效工具的综合指南，因为本集只会快速概览音频工具。

These will include number one, a big guide on audio, music, text to speech, and sound effects tools because this episode is just going to do a very fast speed run on the audio tools.

Speaker 0

第二，专门针对开源文本转语音工具的详细指南，因为这类工具数量众多且发展迅猛。

And number two, a breakout guide on open source text to speech tools specifically, because there's a lot out there and it's moving really fast.

Speaker 0

其中许多工具在质量上已接近11 labs的水平。

And many of these are nearly as good as 11 labs in quality.

Speaker 0

第三点将是一个关于图像和视频生成的提示工程超级指南，正如我所说。

Number three will be a super guide on prompt engineering for image and video generation, like I said.

Speaker 0

当然，节目说明的主体部分将是本集的工作流程。

And then of course, the body of the show notes will be the workflows of this episode.

Speaker 0

如果你想深入了解本集相关内容，或本集之外的延伸材料，请访问网站上的节目说明，而非播客平台上的说明，前往ocudvol.com。

So if you want deep reading material around this episode, or relevant material outside of this episode, then go to the website's show notes, not the pod catcher show notes, go to ocudvol.com.

Speaker 0

节目说明中还将包含所有这些工具的排行榜链接，包括视频生成器、图像生成器和文本转语音工具。

Also in the show notes will be links to leaderboards for all of these tools, video generators, image generators, text to speech tools.

Speaker 0

排行榜通过投票者的AB测试来捕捉工具的质量。

Leaderboards do AB testing from voters, they capture the quality of tools.

Speaker 0

因此它们是按质量排序的最佳图像生成器、视频生成器或文本转语音工具的列表。

So they're a ranked list of the best quality image generators or the best quality video generators or text to speech.

Speaker 0

这是选择你的工具的一个好起点。

And this is a good place to start for choosing your weapon.

Speaker 0

当然，这并非全貌。

Of course, isn't the whole picture.

Speaker 0

正如你将在本期工作流程中看到的，工具选择很重要。

So tooling matters as you'll see in the workflows of this episode.

Speaker 0

但这是了解现有图像、视频和文本转语音工具质量概况的绝佳途径。

But that's a great place to just get a lay of the land of the quality of the image and video and text to speech tools available.

Speaker 0

好的。

Okay.

Speaker 0

在讨论工作流程之前，我们先谈谈音频。

Before we talk workflows, let's talk audio.

Speaker 0

所以对于音乐，你要根据目标来选择。

So for music, you're gonna choose based on your goal.

Speaker 0

如果是为YouTube视频等内容寻找一首完整可用的歌曲，就用Suno（拼写为s u n o），这个AI音乐生成器。

For a complete ready to use song for content like a YouTube video, use Suno, s u n o, the AI music generator.

Speaker 0

而如果需要高质量的音频组件和在专业音频程序中编辑的灵感，就用UDIO（拼写为U D I O）。

And then for high quality audio components and inspiration to edit in a professional audio program, use UDIO U D I O.

Speaker 0

现在这两个工具都非常擅长从零开始生成歌曲和创作音乐。

Now both of those tools are very good at generating songs generative music from scratch.

Speaker 0

并不是说UDO不能直接给你一首很棒的可直接用歌曲。

It's not to say that UDO doesn't give you a great usable song out the gate.

Speaker 0

只是如果我们要区分这两者的话——因为两者在实际使用中都很受欢迎，Suno和UDO都是非常流行的音乐生成器。

It's just that if we had to differentiate the two because both are very popularly used in the wild, both Suno and UDO are very popular music generators.

Speaker 0

如果非要在它们之间划条界限，这条界限可能是：用Suno可以直接获得成品。

If we had to draw a line between them, that line might look like using Suno to get something right out the gate.

Speaker 0

类比的话就像vo three那样，你一次性就能得到所有想要的东西，简单省事。

And the analogy would be like vo three where you get everything all you want one shot prompt, easy peasy.

Speaker 0

而使用Udeo则适用于那些计划更亲力亲为、更注重长期规划，并打算用你选择的数字音频工作站（DAW）进行音乐编辑的情况。

And using Udeo if you plan to be a little bit more hands on, more long term thinking and music editing with your music editor with your DAW of choice.

Speaker 0

这类似于我们上期节目中提到的Cling或Runway Gen Four。

And that would be akin to cling or runway gen four per the last episode.

Speaker 0

一个简单的记忆方法是：Suno这个词的音节更少。

So an easy way to remember that is Suno has fewer syllables in the word.

Speaker 0

这意味着制作上市音乐所需的步骤也更少。

So there's fewer steps involved for go to market music.

Speaker 0

想要在同一个工具中添加音效和解说的播客制作者，应该使用Eleven Labs集成的音效生成器。

Podcasters who want to add sound effects and narration in the same tool should use Eleven Labs integrated sound effects generator.

Speaker 0

Eleven Labs作为市场上顶级的文本转语音工具，还集成了音效功能。

Eleven Labs, the best in class text to speech tool on the market also has sound effects integrated.

Speaker 0

所以如果你既需要音效又需要文本转语音，就用Eleven Labs。

So if you want both sound effects and text to speech, use Eleven Labs.

Speaker 0

但游戏开发者和电影制作人若需要更庞大的独特授权资源库来进行组件化工作流程，则应使用专业工具。

But game developers and filmmakers who need a larger library of unique licensed assets for a component based workflow, they should use a specialized tool.

Speaker 0

市面上有个叫'Sound Effects Engine'或'SFX Engine'的工具，你可以做些调研，有些音效AI生成器能提供更多控制选项。

There's one out there called sound effects engine or SFX engine, do a little bit of research, but there's sound effects AI generators that give you a bit more control.

Speaker 0

在语音生成（即文本转语音）领域，Eleven Labs是最佳选择。

For voice generation, that's text to speech, Eleven Labs is the best.

Speaker 0

它确实是最优秀的。

It is the best.

Speaker 0

Eleven Labs胜出。

Eleven Labs wins.

Speaker 0

在纯粹的真实感方面表现最佳，但不同工作流程可能需要不同工具。

The best for pure realism, but different workflows may require different tools.

Speaker 0

Murph.ai为营销团队提供一体化工作室，而play.ht则为需要实时文本转语音的企业开发者提供低延迟API。

So Murph m u r f dot ai offers an all in one studio for marketing teams and play dot h t play .ht has a low latency API for enterprise developers who need real time text to speech.

Speaker 0

再次提醒，如需更多信息，请查看我的节目笔记获取深度解析。

And again, go to my show notes to get the the deep dive on all of this if you need more information here.

Speaker 0

至于开源本地文本转语音方案，市面上有大量模型、工具包、SDK、界面和封装工具可供选择。

For open source local text to speech, there's a bunch of models out there, a bunch of models and toolkits and SDKs and UIs and wrappers and stuff.

Speaker 0

这是一个自成体系的庞大领域。

It's it is a giant space of its own.

Speaker 0

其中许多几乎和Eleven Labs一样出色，特别是Kokoro。

And many of them are almost as good as 11, particular Kokoro.

Speaker 0

我最近尝试过这个。

I tried this recently.

Speaker 0

我非常喜欢文本转语音技术。

I I love text to speech.

Speaker 0

实际上我会用它生成音频来辅助我研究这些播客内容，而不是直接听播客。

I actually I actually use it to generate audio that that I use for my research for these podcast episodes rather than listening to podcasts.

Speaker 0

所以我经常将深度研究报告或现有EPUB/书籍转换成语音。

So I'll generate deep research reports or all text to speech existing EPUB or books.

Speaker 0

而我最喜欢的是Kokoro（k o k o r o）。

And the the one I like the most is Kokoro, k o k o r o.

Speaker 0

听起来简直太棒了。

Just sounds phenomenal.

Speaker 0

在我看来，它的音质几乎和Eleven Labs一样好。

It sounds almost as good as Eleven Labs in my opinion.

Speaker 0

还有风格TTS 2，它能达到人类水平的质量，最适合在没有参考语音的情况下生成自然语音。

There's also a style t t s two, and that achieves human level quality and is best for generating natural speech without a reference voice.

Speaker 0

而且只需极少量的音频输入就能进行语音克隆。

And there's voice cloning with very minimal audio input.

Speaker 0

一个叫Coqiui（c o q u I）的模型使用的是x t t s v 2技术。

One's called Coqiui, c o q u I, and the model, it uses x t t s v two.

Speaker 0

对于需要高速处理且速度优先的应用场景，Piper TTS是轻量级的热门选择。

And for applications that need high speed where speed is the most important, then, Piper TTS is really lightweight and popular choice.

Speaker 0

这些开源语音生成器，你可能会在稳定扩散（Stable Diffusion）的ComfyUI端到端工作流中使用——无论是在实验阶段（避免消耗11 labs的高额信用点），还是在最终产品中（如果发现某个开源方案能满足需求）。

And these open source voice generators, you're likely to use in your stable diffusion, comfy UI end to end workflow either during experimentation, so you're not breaking the bank on 11 labs credit expenditure, or in the final product, if you find that one of these open source alternatives is good enough for your purposes.

Speaker 0

这些本地安装工具与稳定扩散模型类似，其生态系统非常适合构建高吞吐、低延迟且质量上乘的工作流，但需要较强的技术动手能力。

So the local installation tools similar to the stable diffusion model, still similar to the stable diffusion model and ecosystem are great for stringing together high throughput, low latency, and still very good quality, but also very hands on highly technical workflows.

Speaker 0

关于开源产品的深度研究报告，请再次访问我的网站查看节目笔记。

Again, to my website show notes for a deep research report on the open source offerings.

Speaker 0

好的，音频部分就快速讲到这里。

Okay, that's the speed run on audio.

Speaker 0

现在我们来谈谈工作流程。

Now let's talk workflows.

Speaker 0

本集节目将根据用户画像提供三种核心工作流程。

Now, this episode gives three core workflows based on personas.

Speaker 0

你在做什么？

What are you doing?

Speaker 0

你的切入角度是什么？

What's your angle here?

Speaker 0

你的职业角色是什么？

What's your professional role?

Speaker 0

第一种工作流程面向初学者，适用于社交媒体活动、增长营销或广告投放，或者你只是一个人单打独斗的低预算初创公司。

So workflow one is the beginner, a social media campaign or a growth marketing or ads, or you're just a one person banned low budget startup.

Speaker 0

你想要快速但优质的结果。

You want quick but good results.

Speaker 0

工作流程二是中级水平。

Workflow two is intermediate.

Speaker 0

你是一名独立短片制作人或爱好者，追求专业级成果。

You're an indie short film maker or an enthusiast and you want serious results.

Speaker 0

而工作流程三则是耗资数百万美元的好莱坞大片级别。

And workflow three is a freaking triple a multimillion dollar budget box office movie.

Speaker 0

你在好莱坞工作，在这个令人不安的时代想要保持对AI的领先优势。

You work in Hollywood and you wanna stay ahead of AI in this very scary time.

Speaker 0

这些技术就是你的命脉，你决心要深入掌握专业技术。

This stuff is your DNA and you have every intention to be highly technical.

Speaker 0

但在深入这三个工作流程之前，我们先快速了解一下工作流程零。

But before we go into those three workflows, let's quickly talk work flow zero.

Speaker 0

工作流程零适用于追求即时效果的情况。

Workflow zero, you want instantaneous results.

Speaker 0

这不是你的专业领域。

This is not your field.

Speaker 0

你无意将各种工具串联起来，也不想学习这些复杂的系统。

You have no intention of wiring things together, multiple tools and learning these complex systems.

Speaker 0

所以要么你对AI视频感到兴奋——这类人会在TikTok上狂发大脚怪视频或玻璃切割ASMR视频，要么这是你业务的一部分，但你的预算和时间真的非常紧张。

So either you get a kick out of AI videos, these are people who are blasting TikTok with bigfoot vlogs or glass cutting ASMR videos, or it's part of your business, but you're really, really tight on budget and time.

Speaker 0

工作流程零。

Workflow zero.

Speaker 0

首先，v o三，谷歌v o三已经集成了你所需的一切功能。

So firstly, v o three, Google v o three has everything you need all in one.

Speaker 0

配合flow工具，可以实现顶级文本的视频转换，并顶级提示符遵循。

And with the flow tool, can do top tier text to video with phenomenal prompt adherence.

Speaker 0

你可以使用配料和帧到视频，延长剪辑，先用Imagen四生成帧，所有这些都内置在ాలు。

You can use ingredients and frames to video and extend clips and generate frames first with Imagen four all built into flow.

Speaker 0

它具备语音功能。

It has voice.

Speaker 0

它拥有完美的唇形同步。

It has perfect lip syncing.

Speaker 0

它甚至还有音乐功能。

It even has music.

Speaker 0

谷歌有个叫Lyria的音乐工具，用来与Yudio和Suno竞争。

Google has a music tool called Lyria to compete with Yudio and Suno.

Speaker 0

目前它还不如Yudio或Suno，但我相信很快就能赶上。

It's not as good as Yudio or Suno currently, but I'm sure it will get there.

Speaker 0

我不确定是否可以通过Lyria直接为片段生成音乐，或者这是Flow后期制作的功能，我目前还没看到这个特性。

I'm not sure if you can prompt music via Lyria into a clip or if it's a post production thing in Flow or what I haven't seen the feature in my usage yet.

Speaker 0

现在可别被蒙蔽了。

Now do not be fooled.

Speaker 0

Vio可不是什么大脚怪或ASMR玩具。

Vio is not a bigfoot an ASMR toy.

Speaker 0

我见过用Vio制作的真实广告。

I have seen actual advertisements built with Vio.

Speaker 0

实实在在的现实世界广告。

Literal real world advertisements.

Speaker 0

我说的可不是Reddit上那些为了好玩而做的示例广告。

I'm not talking example advertisements for funsies on Reddit.

Speaker 0

我真的见过用VO3制作的广告，比如互联网上和电视上的。

I've actually seen ads built with VO three, like on the Internet, on TV.

Speaker 0

其中一个是夫妻档做的房地产广告。

One of these was a a real estate ad by a married couple.

Speaker 0

我是说，我无法想象他们会用Midjourney提示工程结合Cling做唇语同步。

I mean, I can't imagine they were breaking out mid journey prompt engineering piped to lip syncing with cling.

Speaker 0

所以别小看VO3。

So don't underestimate v o three.

Speaker 0

毫无疑问，VO3是你的起点。

Okay, VO three is your starting point without question.

Speaker 0

事实上，在你做任何事之前，甚至在听这期节目之前，VO3就是你的起点。

In fact, before you do anything before you even listen to this episode, VO three is your starting point.

Speaker 0

这期节目的其余部分将重点讨论VO3不足之处更先进复杂的工具。

The rest of this episode is really all about more advanced sophisticated tooling where v o three falls short.

Speaker 0

现在，仍在工作流程零阶段，如果你需要更企业化、更商务化的内容，而不仅仅是带有角色的视频动作场景，你想要品牌标识、文字、配色方案、营销材料，以及比VO在社交媒体营销和广告中提供的更多端到端控制，但又不像我们接下来要讨论的工作流程那样高度控制，那么你需要的是一种聚合器，一种生成式AI聚合器。

Now, still in workflow zero, if you need something more corporate, more businessy, not just a video clip action scene with characters, you want branding, text, color schemes, marketing material, and a bit more end to end control than VO for social media marketing and advertising, but not so much control as the workflows we'll talk about next, then what you're looking for is an aggregator, a generative AI aggregator.

Speaker 0

因此，这些工具的许多功能是让你能够在不同的图像生成器（如GPT-4O、MidJourney、Imaging-4等）和视频生成器（如VO-3、Sora、MidJourney等）之间进行选择，并将它们在时间线上串联起来。

So what many of these tools do is they allow you to choose among the different image generators like GPT four o, mid journey, imaging four, etcetera, the video generators like v o three, Sora, mid journey, etcetera, and string them together on a timeline.

Speaker 0

其中许多工具都配备了辅助功能，用于确保图像与视频步骤间的一致性和分辨率匹配的扩展绘制，以及用于品牌推广等后期制作工具。

Many of these have hand holding tools for consistency and out painting between the image and video steps for resolution conformity, and then post production tools for branding and so forth.

Speaker 0

这一领域的流行工具包括Envato（e n v a t o）、Open Art Remade、InVideo（I n v I d e o）、Canva（实际上Canva正开始在其原有业务范围外添加大量AI工具），以及Pika Labs（p I k a）。

Popular tools in this category include Envato, e n v a t o, Open Art remade, in video, I n v I d e o, Canva, actually, Canva is starting to add a lot of AI tooling way outside of their original business focus, And Pika Labs, p I k a.

Speaker 0

在我对上集纯视频生成质量的头对头研究中，Pika并未频繁出现。

Pika didn't come up much in my research for the head to head pure video generation of the last episode in terms of quality.

Speaker 0

这让我很惊讶，因为他们是AI视频生成领域的竞争者。

That surprised me because they're a competitor in generating AI videos.

Speaker 0

他们过去在网上经常被提及。

They used to come up a lot online.

Speaker 0

我不清楚他们是否暂时在视频生成方面落后，还是正朝着这类重型端到端工作流转型——类似本期‘零号工作流’主题所讨论的，不过可以去了解一下他们。

I don't know if they're just temporarily behind on video generation or if they're sort of moving towards these sort of heavy end to end workflows here kind of per this Workflow Zero topic, but give them a gander.

Speaker 0

所以我不会深入探讨这些工具。

So I won't really get into these tools.

Speaker 0

这些都是封装工具。

These are wrapper tools.

Speaker 0

有很多新晋初创公司涌入，而且这个领域只会继续扩张——当然这取决于VO3最终能覆盖多少功能并压制竞争对手。

There's a lot of Johnny come lately startups and it's only going to expand, depending I guess on how much flow with v o three ends up covering all these bases and squashing the competition.

Speaker 0

从生成价值与信用消耗比来看，它们相比直接使用原始工具显得有点昂贵。

And they're kind of expensive for what they do in terms of credit to generation value, as opposed to using the raw tools.

Speaker 0

而且我认为学习核心工具很有价值，坦白说。

And I think it's valuable to learn the core tools, frankly.

Speaker 0

另外，你正在收听的是《机器学习指南》，一档关于AI的播客。

Also, you're listening to machine learning guide, a podcast about AI.

Speaker 0

所以我想可以安全地说，你在这里是因为你在AI领域有所投入。

So I think it's safe to say that you're here because you have some skin in the AI game.

Speaker 0

但如果这条路让你感兴趣，你会想在那里做自己的研究。

But if this is a route that interests you, you'll wanna do your own research there.

Speaker 0

所以开始你的研究时，可以从Open Art和Canva入手，我认为这是开启你Google探索之旅的两个不错工具。

So start your research with, open art and Canva, I think would be two good tools to start your Google journey.

Speaker 0

好的，让我们进入工作流程部分。

Okay, let's get into the workflows.

Speaker 0

好的，接下来介绍三种工作流程：第一种是面向初学者或半专业人士的，比如需要快速轻松制作高影响力短视频内容的社交媒体经理或内容创作者，适用于TikTok广告或YouTube Shorts等平台。

Okay, so three workflows, the the beginner or the prosumer, a social media manager or content creator who needs to produce high impact short form content, for example, for TikTok or an advertisement or YouTube shorts, quickly and easily.

Speaker 0

第二种是独立电影制作人，专注于创作具有连贯角色和电影质感的短片，愿意处理中等技术复杂度的故事讲述者。

Number two, the independent filmmaker, a a storyteller focused on creating short films with consistent characters and a cinematic feel who is willing to handle moderate technical complexity.

Speaker 0

第三种是专业工作室，需要完全掌控最终画面效果、演员形象，标准视觉特效管线集成的视觉特效师或制作团队。

And then number three, the professional studio, a visual effects artist or a production team that needs absolute control over the final image, actor likeness and integration with standard VFX pipelines.

Speaker 0

第一种工作流程：半专业用户工作流——用VO3和高质量提示链制作病毒视频。

Workflow one, the prosumer workflow, creating a viral video with VO three and high quality chaining.

Speaker 0

Chaining as in prompt adherence and character consistency.

Speaker 0

好的。

Okay.

Speaker 0

这个工作流程适用于社交媒体经理和营销人员，他们需要为TikTok、Instagram Reels和YouTube Shorts等平台批量制作视觉吸引力强的短视频内容。

This workflow is for social media managers and marketers who need to produce high volume, visually engaging short form content for platforms like TikTok, Instagram Reels, and YouTube Shorts.

Speaker 0

主要目标是速度和品牌一致性。

The main goals are speed and brand consistency.

Speaker 0

问题是，虽然v o三能生成高质量、带有环境音效和语音的8秒片段，但其原生延长功能较弱。

The problem is that v o three while it generates really high quality eight second clips with integrated ambient audio and voice, it has a weak native extend feature.

Speaker 0

该功能使用了画质较低的v o二模型，会导致视觉一致性明显下降。

This feature uses the lower quality v o two model causing a noticeable drop in visual consistency.

Speaker 0

因此，如果你在vo三中创建一个初始片段，一个八秒的片段，无论是文本转视频还是首帧转视频，你都能使用vo三的质量模型，得到一个非常高质量的片段。

So if you create an initial clip in vo three, an eight second clip, text to video or first frame to video, you'll be able to use the vo three quality model and you'll get a very high quality clip.

Speaker 0

目前（2025年7月13日）你无法通过'成分到帧'功能、'首末帧'功能或'延长片段'功能实现这一点。

You cannot do this currently, 07/13/2025 with the ingredients to frame feature or the first and last frame feature or the extend clip feature.

Speaker 0

如果你尝试这样做，系统会将你降级到vo二模型，那是个非常低劣的质量2。

If you try that, it will downgrade you to v o two, which is a very bad quality generator.

Speaker 0

所以我们即将讨论的高质量 quality 手动连接高质量片段。

So the high quality chaining methods we'll discuss avoid this by manually connecting high quality clips.

Speaker 0

你生成的每个片段都是独立的。

Clips that you generate each clip independently.

Speaker 0

工具链如下：

So here's the tool chain.

Speaker 0

图像生成。

Image generation.

Speaker 0

你将使用GPT-4o的API或platform.openai.com（这是API的网页封装版，而非chatgpt.com）。

You're gonna use GPT four o, the API, or platform.openai.com, which is a website wrapper of the API, not chatgpt.com.

Speaker 0

我们选择GPT-4o是因为其卓越的提示遵循能力和文本渲染能力。

We choose g p t four o for its excellent prompt adherence and text rendering ability.

Speaker 0

它能以最少的提示工程准确生成你所需的视觉效果，其对话式界面支持快速视觉优化。

It delivers the specific visual you asked for with minimal prompt engineering, and its conversational interface allows for quick visual refinement.

Speaker 0

OpenAI的Playground还允许你混合搭配图像、编辑之前的优化内容、整合素材等等。

And OpenAI's playground allows you to mix and match images and make edits to previous refinements and incorporate ingredients and so forth.

Speaker 0

它比ChatGPT强大得多。

It's it's a lot more powerful than ChatGPT.

Speaker 0

你可以通过下拉菜单指定图像的输出尺寸、分辨率和质量等参数。

And you get to specify the output dimensions, resolution, and quality and so forth of the image through drop downs.

Speaker 0

视频生成。

Video generation.

Speaker 0

●●●●我们选择使用Google VO3，因其单帧画质出色且自带音频生成功能，能一步到位同时生成视频和环境

We're gonna use Google v o three used for its high single shot quality and native audio generation, which simplifies production by creating video and ambient sound all in one step.

Speaker 0

对于配乐生成，我们使用Yudio。

For soundtrack generation, let's use Yudio.

Speaker 0

我们将用这个工具快速制作定制化的吸引人配乐。

We'll use this to quickly create custom catchy soundtracks.

Speaker 0

该工具在易用性和手动模式间取得平衡，提供更多控制，能创作出比库存音乐更具特色的病毒式音频。

This tool balances ease of use with a manual mode for more control, allowing for the creation of viral style audio that's more unique than stock music.

Speaker 0

最后剪辑会在CapCut里完成。

And then the final assembly will be in CapCut.

Speaker 0

你也可以选择CapCut，这是一款非常流行且易于使用的视频编辑工具，相比DaVinci Resolve和Adobe系列软件来说更简单。

Optionally, you can you can choose CapCut, is a very popular, really easy to use video editing tool compared to things like DaVinci Resolve and and the Adobe tools.

Speaker 0

CapCut更简单。

CapCut is is is simpler.

Speaker 0

我个人使用Descript。

I personally use Descript.

Speaker 0

我之前已经多次提到过。

I've mentioned that plenty in the past.

Speaker 0

我喜欢Descript。

I like Descript.

Speaker 0

它最初是为播客设计的，但后来逐渐扩展了视频编辑功能。

It was built firstly for podcasters, but they built out the video tooling over time.

Speaker 0

但CapCut在实际工作流程中更为流行，所以我在这里提到它。

But CapCut is a lot more popular out there in workflows in the wild, and so I'm I'm mentioning it here.

Speaker 0

它是短视频内容的标准视频编辑器。

It's the standard video editor for short form content.

Speaker 0

它是免费的。

It's it was free.

Speaker 0

我觉得他们开始大力推行付费模式了。

I I think they started monetizing it pretty heavily.

Speaker 0

它是跨平台的，拥有非常直观的界面，特效和文字样式专为TikTok和Reels设计。

It's cross platform, and it has a really intuitive interface with effects and text styles designed for TikTok and Reels.

Speaker 0

好的。

Okay.

Speaker 0

制作阶段：角色与场景概念设计。

The production stage, character and scene concept.

Speaker 0

你将选择工具——通过OpenAI Playground使用GPT-4，其对专业用户的优势在于强大的指令遵循、文本渲染和对话优化能力。

You're gonna select a tool, GPT four o via the OpenAI Playground, And its strengths for the prosumer are strong prompt adherence, text rendering, and conversational refinement.

Speaker 0

选择理由是它能以最少的修改次数产出可预测的视觉效果。

And the rationale is that it delivers predictable visuals with minimal iteration.

Speaker 0

非常适合品牌工作。

Good for brand work.

Speaker 0

视频生成的制作阶段使用的是VO3。

The production stage for the video generation is v o three.

Speaker 0

它的优势在于高质量的单镜头画面、集成的音频生成，其核心理念是通过结合视频与环境音效来简化制作流程。

Its strengths are high single shot quality, integrated audio generation, and the rationale is that it simplifies production by combining video and ambient sound.

Speaker 0

音乐和配乐方面选用Yudio，因其快速生成、流派灵活性以及用于创作hook的手动模式。

For the music and soundtrack is Yudio for its fast generation genre flexibility and manual mode for hooks.

Speaker 0

相比Suno，它让你在将音频整合进视频前能对音效进行更多修改控制。

It gives you a little bit more control than Suno if you want to implement some modifications to the audio before integrating it into your video.

Speaker 0

如果追求极简且只需背景音乐，可以直接使用Suno。

Or if just simplicity is key and you just wanna track, you could use Suno.

Speaker 0

所以UDO将帮助你创作出脱颖而出的病毒式音频。

So UDO will help you create viral style audio that stands out.

Speaker 0

然后是最终剪辑与合成的制作阶段，CapCut是个热门选择，但如果你已有惯用软件也可以继续使用。

And then the production stage for the final edit and assembly, a popular choice is CapCut, but you can use whatever you use if you already have a favorite.

Speaker 0

但CapCut拥有直观的界面，跨平台兼容，并内置社交媒体文字特效功能。

But CapCut has an intuitive UI, it's cross platform, and it has built in social media text and effects.

Speaker 0

它已成为Reels和TikTok内容创作领域的行业标准。

And it's an industry standard for Reels and TikToks.

Speaker 0

以下是工作流程。

Here is the workflow.

Speaker 0

你需要使用GPT-4创建一个角色设定表。

You're gonna create a character sheet with GPT four o.

Speaker 0

为了确保视觉一致性，你需要先创建一组定义角色的参考图像。

So to ensure visual consistency, you're gonna start by creating a set of reference images that define your character.

Speaker 0

你需要编写所谓的锁定提示（locking prompt）。

You're gonna write what's called a locking prompt, l o c k I n g, a locking prompt.

Speaker 0

创建一个详细提示，定义角色不可更改的特征（如面部结构或眼睛颜色），以及可变特征（如服装和表情）。

You create a detailed prompt that defines the characters unchangeable features like their facial structure or eye color, and and variable features like their clothing and expression.

Speaker 0

例如，一个锁定提示可能是这样的。

For example, a locking prompt might look like this.

Speaker 0

一张写实风格的头像照，主角是25岁的日本女性Aya，拥有棱角分明的下颌线、聪慧的深棕色眼睛，左眼下有一颗小痣，留着不对称的黑色短发，带有一缕电光蓝挑染。

A photorealistic headshot of Aya, a 25 year old Japanese woman with a sharp jawline, intelligent dark brown eyes, and a small beauty mark under her left eye and short asymmetrical black hair with a single streak of electric blue.

Speaker 0

她穿着极简风格的灰色高领毛衣。

She's wearing a minimalist gray turtleneck sweater.

Speaker 0

背景是纯中性灰色。

The background is a solid neutral gray.

Speaker 0

专业影棚灯光，主体表情中性，8K超高清锐利对焦。

Professional studio lighting, subject has a neutral expression eight k hyper detailed sharp focus.

Speaker 0

第二步是生成主图像。

Step two is you'll generate the main image.

Speaker 0

在GPT-4中使用锁定提示词生成该角色的正面主图像。

Use the locking prompt in GPT four o to generate the primary front facing image of this character.

Speaker 0

正面朝向，因为我们要在此基础上进行迭代。

Front facing because we'll iterate on that.

Speaker 0

第三步，通过对话式提示进行优化。

Step three, refine with conversational prompts.

Speaker 0

利用GPT-4的对话记忆功能来创建变体。

So use GPT four o's conversational memory to create variations.

Speaker 0

你需要在OpenAI的Playground中点击‘编辑这张照片’。

You'll click the edit this photo in the OpenAI playground.

Speaker 0

每张生成的图片上都有各种按钮。

There's there's various buttons on each generated image.

Speaker 0

然后你可以使用后续提示，比如'现在用完全相同的角色，展示她以四分之三角度微微偏离镜头，带着自信的浅笑'。

And then you can use a follow-up prompt like now using this exact same character, show her from a three fourths angle looking slightly off camera with a small confident smile.

Speaker 0

接着你会重复这个操作。

And then you'll do it again.

Speaker 0

下一个镜头保持角色完全一致，但要让她呈现惊讶表情，并用手半掩着嘴。

For the next shot, keep the character identical, but have her look surprised with her hand partially covering her mouth.

Speaker 0

保持影棚灯光和灰色背景不变。

Keep the studio lighting in gray background.

Speaker 0

第四步是整合角色设定表。

And then step four is assemble a character sheet.

Speaker 0

你需要收集三到五张这样的图片。

So you'll collect three to five of these images.

Speaker 0

这套素材将作为所有视频生成的源材料以确保一致性。

And this set will be used as the source material for all video generations to ensure consistency.

Speaker 0

现在这个过程有了第五步。

Now there is a kind of step five in this process.

Speaker 0

GPT-4o模型，无论是chatgpt.com还是playground，输出的图像布局都是3:2或2:3比例。

The GPT four o model, whether chat gpt.com or playground, outputs a three two, three colon two, or a two colon three layout image.

Speaker 0

3:2表示横向，2:3表示纵向。

That three two means landscape, and the two three is portrait.

Speaker 0

也就是每3个像素对应2个像素的比例。

So, you know, three pixels to every two pixel.

Speaker 0

所以图像要么是宽的，要么是高的。

So it's wide or it's tall.

Speaker 0

遗憾的是，VO3使用的是16:9或9:16比例，这些是手机等设备的标准尺寸，即宽屏横向或纵向布局。

Unfortunately, v o three uses sixteen nine or nine sixteen, And these are the standard dimensions for phones, for example, so widescreen landscape or portrait layouts.

Speaker 0

即使通过OpenAI的API也无法控制这一点，你无法获得16:9比例的图像。

And you can't control that even through open AI's API, you cannot get a sixteen nine.

Speaker 0

所以你需要想办法让图像符合VO3所需的尺寸。

So you're gonna have to figure out a way to get this in the dimensions that v o three likes.

Speaker 0

因为当你将这些照片作为剪辑的帧导入时，v o三会检测到分辨率不匹配，并要求你进行裁剪。

Because when you import these photos as frames for your clip, v o three will detect that the the resolution is off, and it will have you crop it.

Speaker 0

然后你会看到图片左右两侧出现非常粗的黑色边框。

And what you'll see is very thick black boundaries on the left and right of your image.

Speaker 0

而你肯定不希望直接缩放到最小，把黑色边框也包含进去。

And you don't wanna just zoom it all the way out and include the black borders.

Speaker 0

v o三不会自动填充那些黑色区域。

V o three won't fill those in.

Speaker 0

它不会覆盖它们。

It won't out paint them.

Speaker 0

它会直接使用你的图像作为场景。

It will use your image as is for the scene.

Speaker 0

它会认为你想要那些黑色边框，并将动画限制在可见窗口内。

It will assume that you wanted those black borders and it will keep the animation constrained within the visible window.

Speaker 0

所以你有一些选择。

So you have a few options.

Speaker 0

最简单的方法其实是在v o three的帧编辑器里直接裁剪。

The easiest is to actually crop it in the v o three frame editor.

Speaker 0

只需放大到角色和场景中一个清晰可见的区域即可。

Just zoom it in to a good visible area of the character and the scene.

Speaker 0

如果这样达不到你想要的效果，你可以在文本转视频的提示中要求镜头快速拉远。

And if that doesn't give you the results you want, you can use in your text to video prompt that the camera zooms out in short order.

Speaker 0

或者在生成图像时，你可以直接告诉GPT四，角色已经是远景状态，即角色居中且距离较远。

Or when generating the images, you can specify to GPT four o that the character is zoomed out already, you know, the character is in the center and far away.

Speaker 0

这样当你在v o three里裁剪时，就有充足的调整空间了。

So that when you crop it in v o three, you have a lot of wiggle room to work with.

Speaker 0

这些是最简单的解决方案。

Those will be your easiest options.

Speaker 0

在v o three的裁剪工具限制范围内发挥创意。

Work within the constraints of the cropping tool of v o three and get a little clever there.

Speaker 0

或者如果你熟悉专业图像编辑软件，比如Adobe Photoshop，也可以使用外绘工具。

Or you can use an outpainting tool if you already have and know a good image editing software suite like Adobe Photoshop.

Speaker 0

你可以使用类似Firefly的工具将图像扩展到指定分辨率，或者如果你认为自己在本集结束后可能会学习更多关于稳定扩散的知识并最终使用它，现在就可以开始动手尝试。

You could use something like Firefly to out paint the image to the specified resolution, or you can start to get your hands dirty with stable diffusion in case you ever think you might use stable diffusion after you learn a little bit more about it by the end of this episode.

Speaker 0

如果你认为自己最终可能会尝试稳定扩散，不妨现在就花点时间学习一下。

If you think you might ever end up playing with stable diffusion, you might as well spend the time now to learn it a little bit.

Speaker 0

如果只想要扩展绘画功能，为了简化操作，你可以使用Automatic1111而非ComfyUI，这样就不必学习整个基于节点的流程工作系统。

You can use automatic eleven eleven instead of Comfy UI for simplicity if all you want is out painting, so you don't have to learn the whole node based pipeline workflow system.

Speaker 0

如果这些都太复杂，你完全可以在下一个工作流程中用Mid Journey替换掉GPT-4。

Or if all of this is too much, you can just swap out GPT four o in this workflow with mid journey in the next workflow.

Speaker 0

Mid Journey能够输出16:9和9:16的布局。

Mid journey can output sixteen nine and nine sixteen layouts.

Speaker 0

你只需要多掌握些技巧，因为在实现精准提示词遵循方面，Mid Journey比GPT-4更高级一些。

You'll just have to learn a little bit more the ropes because mid journey is a bit more advanced than GPT four o is in terms of achieving strong prompt adherence.

Speaker 0

但使用Mid Journey而非GPT-4有其优势，因为它内置了一整套工具系统，能创建具有高度角色一致性的角色设定表。

But there's merit to using mid journey instead of g p t four o because there's a whole system of tools inside of it for creating character sheets with strong character consistency.

Speaker 0

以上就是从GPT-4导出到VO3时获取正确分辨率的几种解决方案。

So those are your options for getting the right resolution from the export of g p t four o to v o three.

Speaker 0

要么选择A方案，在裁剪系统的限制内巧妙操作。

Either a, be clever and work within the confines of the crop system.

Speaker 0

选择B方案，使用诸如Photoshop搭配Firefly或Stable Diffusion搭配Automatic1111等外绘工具；或者选择C方案，直接升级工具链，放弃GPT-4改用Midjourney。

B, use an out painting tool like Photoshop with Firefly or stable diffusion with automatic eleven eleven or c, don't use GPT four o, step up your game and use mid journey instead.

Speaker 0

下一套工作流中你将学习Midjourney角色设定表的创作步骤。

And you'll learn the mid journey character sheet creation steps in the next workflow.

Speaker 0

好的。

Okay.

Speaker 0

回到工作流正题。

Back to the workflow.

Speaker 0

我们已经获得了3到5张角色设定图。

We have our character sheet, three to five of these character images.

Speaker 0

本工作流的第二步是在VO3中使用高质量链式处理方法。

So step two of this workflow is to use a high quality chaining method in v o three.

Speaker 0

该技术能在保持VO3最高质量模型（即VO3 Quality模式）的同时，生成超过8秒时长的视频。

This technique creates a video longer than eight seconds while maintaining v o three's highest quality model, v o three quality it's called.

Speaker 0

它使用一个片段的最后一帧作为下一个片段的起始图像。

It uses the final frame of one clip as the starting image for the next.

Speaker 0

第一步，生成片段一。

Step one, generate clip one.

Speaker 0

在v o三中，选择图像转视频选项。

So in v o three, select the image to video option.

Speaker 0

上传你角色表中Aya的主形象。

Upload the main image of Aya from your character sheet.

Speaker 0

第二步，为片段一编写提示词。

Step two, write the prompt for clip one.

Speaker 0

描述初始动作。

Describe the initial action.

Speaker 0

这里有一个示例提示词。

So here's an example prompt.

Speaker 0

一段Aya的写实风格视频。

A photorealistic video of Aya.

Speaker 0

她正坐在一间现代极简风格的咖啡馆里。

She is sitting in a modern minimalist cafe.

Speaker 0

她直视镜头，然后转头望向窗外雨中的城市街道。

She looks directly at the camera then turns her head to look out a large window at a rainy city street.

Speaker 0

镜头缓缓而微妙地推近她的脸庞。

The camera performs a slow subtle push in on her face.

Speaker 0

包含雨水轻敲窗户的环境音和咖啡馆轻柔的交谈声。

Include the ambient sounds of rain against the window and the soft murmur of a cafe.

Speaker 0

你会注意到这些提示词非常具体。

You'll notice that these prompts are particular.

Speaker 0

它们细致入微。

They're nuanced.

Speaker 0

再次提醒，请访问我网站上的节目笔记获取图片和视频的提示词指南。

Again, go to the the show notes on my website to get the prompt guide for images and videos.

Speaker 0

每个工具都有其独特的细微差别。

Each tool has their own nuances.

Speaker 0

第三步，提取最后一帧。

Step three, extract the final frame.

Speaker 0

片段生成后，暂停在最后一帧，截取高分辨率截图或使用帧保存功能。

After the clip generates, pause on the very last frame and take a high resolution screenshot or use a frame saving feature.

Speaker 0

第四步，生成片段二，开始新的生成流程，将片段一的最后一帧作为源图像上传。

Step four, generate clip two, start a new generation, upload the final frame of clip one as the source image.

Speaker 0

这样可以实现完美的视觉连贯性。

This creates a perfect visual continuity.

Speaker 0

第五步，为片段二编写提示词，采用'先这样后那样'的方法。

Step five, write the prompt for clip two, the this then that method.

Speaker 0

编写一个能按顺序延续动作的提示词。

Write a prompt that continues the action sequentially.

Speaker 0

这是对V O三最有效的提示方法。

This is the most effective prompting method for v o three.

Speaker 0

这种方法被称为'先这样后那样'法。

It's known as the this then that method.

Speaker 0

提示词是这样的。

The prompt looks like this.

Speaker 0

从这个瞬间继续，Aya将头从窗户转回，面向前方。

Continuing from this exact moment, Aya turns her head back from the window to face forward.

Speaker 0

她脸上突然浮现出恍然大悟的表情。

A look of sudden realization crosses her face.

Speaker 0

她从面前的桌上拿起一个白色陶瓷咖啡杯，举到唇边。

She picks up a white ceramic coffee mug from the table in front of her and brings it to her lips.

Speaker 0

摄像机保持稳定。

The camera remains steady.

Speaker 0

然后第六步是继续这个过程。

And then step six is you continue this process.

Speaker 0

生成片段，提取最后一帧，将该帧作为下一片段的输入，直到达到所需视频长度，这种方法绕过了视频三内部质量较低的序列工具。

Generate clip, extract final frame, use the frame as input for the next clip until you reach your desired video length, and this method bypasses video three's lower quality internal sequencing tools.

Speaker 0

现在我应该提到，视频

Now I should mention v o three was only recently released.

Speaker 0

我确信他们很快就会启用VO3质量模型，用于扩展片段以及首尾帧的素材。

I'm sure very soon here that they're going to enable the v o three quality model for extending clips and ingredients to frame and first and last frame.

Speaker 0

但就目前而言，这个工作流程能给你最佳效果。

But as it stands, this is gonna give you the best results, this workflow.

Speaker 0

现在你会看到网上有人正试图解决如何保持一致性的问题。

Now, you will see on the internet, people are trying to figure out how to maintain consistency.

Speaker 0

他们使用了一些非常取巧的变通方法：先生成喜欢的图像，再让ChatGPT以极度详细的描述该图像，接着描述该图像的变体，最后仅用文生视频来生成每个片段。

They're using these really hacky workarounds where they are generating an image that they like and then having ChatGPT describe that image in excruciating detail, and then having ChatGPT describe a variation of that image in excruciating detail, and then only using text to video to generate each clip.

Speaker 0

那个工作流程的问题在于：首先，效果会很差。

And that workflow is well, one, you're gonna have bad results.

Speaker 0

其次，你会遇到角色不一致的问题。

You're gonna you're gonna get character inconsistency.

Speaker 0

文本到视频的转换保真度无法完全保持你心目中角色的精确特征。

The fidelity of conversion of text to to video is not going to maintain the exactitude of the character you have in mind.

Speaker 0

所以你真正应该从图像开始，而不是文本。

So you really wanna start with images, not text.

Speaker 0

其次，这是个非常严重的问题。

And two, and this is a huge problem.

Speaker 0

V o三的积分系统极其珍贵。

V o three's credit based system is extremely valuable.

Speaker 0

你可以用专业版或至尊版创建图像片段。

You can create image clips with the pro tier or the ultra tier.

Speaker 0

根据你的目标和用例，你确实需要节省这些积分，避免失误。

And depending on your goals and and your use cases, you really want to conserve those credits, and you don't wanna make accidents.

Speaker 0

你肯定不希望每保留一个片段就要丢弃五个废片。

You don't want to throw away five bad clips for every one keeper.

Speaker 0

所以从图像开始意味着你清楚知道要做什么，剩下的只是为图像动画写个强力提示。

So starting with an image means you know exactly what you're signing up for, and all that remains is a strong prompt for the animation of that image.

Speaker 0

而图像生成成本很低。

And image generation is cheap.

Speaker 0

所以如果你需要反复调整首帧十次才能得到满意的效果，完全没问题。

So if you have to iterate on the first frame, you know, 10 times before you get what you want, that's fine.

Speaker 0

那既经济又快捷。

That's that's inexpensive and it's fast.

Speaker 0

但你绝对不想在v o three上反复迭代视频片段直到得到满意的结果。

But you do not want to have to iterate over and over and over on video clips in v o three until you get the one you want.

Speaker 0

所以帮我个忙，如果你看到有人分享那些仅通过文本到视频链保持角色一致性的教程，请把这期节目链接发给他们，因为他们在伤害自己，我想纠正他们的做法。

So do me a favor, and if you see anybody linking out these tutorials on how to maintain character consistency through just text to video chaining, link this episode for me because they're hurting themselves, I wanna set them straight.

Speaker 0

好的。

Okay.

Speaker 0

回到工作流程。

Back to the workflow.

Speaker 0

第三步，使用Yudio创建音轨。

Step three, create a soundtrack with Yudio.

Speaker 0

为你的视频制作一条定制的高品质音乐轨道。

Create a custom high quality music track for your video.

Speaker 0

第一步，启用手动模式。

Step one, activate the manual mode.

Speaker 0

使用Yudio的手动模式防止AI重写你的提示词。

Use Yudio's manual mode to prevent the AI from rewriting your prompt.

Speaker 0

第二步是使用结构化提示。

Step two is use a structured prompt.

Speaker 0

使用标签定义音乐元素。

Use tags to define musical elements.

Speaker 0

示例提示可以是：标签类型：低保真嘻哈冷波，标签情绪：忧郁、沉思、希望、雨天等等。

An example prompt would be tag genre colon lo fi hip hop chill wave, tag mood, melancholy, pensive, hopeful, rainy days, and so forth.

Speaker 0

我想我需要为Yudio和Suno制作一个提示指南。

I guess I need to generate a prompt guide for, Yudio and Suno.

Speaker 0

我想我会做的。

I guess I'll do that.

Speaker 0

希望在你查看节目注释时我已经准备好了。

Hopefully, I'll have it up by the time you check out the show notes.

Speaker 0

否则，就在网上查一下吧。

Otherwise, just look it up online.

展开剩余字幕（还有 381 条）

Speaker 0

第三步，生成扩展部分。

Step three, generate an extend.

Speaker 0

创建一个33秒的片段，选择最佳变体，并使用扩展功能添加前奏和尾声，或生成更多段落以达到所需长度。

Create a thirty three second clip, choose the best variation, and use the extend feature to add an intro and an outro or to generate more sections to reach your desired length.

Speaker 0

对于带歌词的歌曲，直接在界面中填写歌词，并使用诸如'主歌'和'副歌'等标签来构建歌曲结构，最后将曲目下载为最终音频文件。

For songs with lyrics, write them directly into the interface and use tags like verse and chorus to structure the song, and then download the track, as a final audio file.

Speaker 0

如果这些听起来太复杂，你可以使用Suno——它通常能更快出成果，只需输入歌曲要求就能得到成品。

Or if all of that sounds like too much, you could use Suno, which is generally better at quick results and just ask for a song and it'll give you a song.

Speaker 0

第四步，在Cap Cut（拼写为c a p c u t）中进行整合与润色。

Step four, assemble and polish in Cap Cut, c a p c u t.

Speaker 0

将所有素材整合成最终视频。

Combine all assets into a final video.

Speaker 0

第一步，导入素材。

Step one, import the assets.

Speaker 0

打开CapCut并导入你的v o三个视频片段和UDO音轨。

Open CapCut and import your v o three video clips and the UDO soundtrack.

Speaker 0

第二步，组装时间线。

Step two, assemble the timeline.

Speaker 0

按顺序将视频片段放置到时间线上，确保剪辑无缝衔接。

Place the video clips on the timeline in order, and the cuts should be seamless.

Speaker 0

如有需要，可使用2到4帧的短暂交叉溶解效果来平滑处理微小跳跃，因为VO3中一帧的末尾与下一帧的开头可能无法做到像素级完美匹配。

If needed, use short two to four frame cross dissolve to smooth any minor jumps, because the the end of one frame in the beginning of the next frame clips in v o three might not be pixel perfect.

Speaker 0

因此你可能需要2到4秒的帧间交叉溶解来平滑过渡。

So you might want a two to four second frame cross dissolve to smooth the jumps.

Speaker 0

第三步，混音处理。

Step three, mix the audio.

Speaker 0

将UDO音乐轨道放置在音频层上。

Place the UDO music track on an audio layer.

Speaker 0

调整VO3片段原生环境音的音量，使其位于音乐下方，既能提供质感又不会压过配乐。

Layer the volume of the v o three clips native ambient sounds so that it sits under the music, providing texture without overpowering the soundtrack.

Speaker 0

第四步，添加文本和字幕。

Step four, add text and captions.

Speaker 0

使用文本工具添加品牌标识或行动号召。

Use the text tool for branding or calls to action.

Speaker 0

使用自动字幕功能生成字幕，这对用户在静音状态下观看的平台留存率很重要。

Use the auto captions feature to generate subtitles, which is important for viewer retention on platforms where users watch with sound off.

Speaker 0

如果你是我描述的那种制作TikTok和YouTube短视频的人，你可能已经知道这一点了。

If you're the persona I'm describing who generates TikTok and YouTube shorts videos, you probably already know that.

Speaker 0

但如果你之前没听说过，你会希望字幕直接内嵌在视频里，而不是由Instagram或YouTube短视频自动生成。

But if you haven't heard that before, you want captions actually baked into your videos, not auto generated by Instagram or YouTube shorts.

Speaker 0

你需要将它们直接嵌入视频中，剪映可以帮你实现这一点。

You want them baked into your video, and CapCut can do this for you.

Speaker 0

最后第五步是将最终视频以9:16的宽高比、1080x1920分辨率、每秒24或30帧的格式导出。

And then finally, number five is export the final video into a nine colon 16 aspect ratio at ten eighty by nineteen twenty resolution, 24 or 30 frames per second.

Speaker 0

我这里没有提到的是，目前vo3仅支持导出16:9比例的MP4文件。

And what I don't have here is that vo three currently only supports sixteen nine exports of m p four files.

Speaker 0

它只能生成横版视频。

It can only generate landscaped videos.

Speaker 0

你无法生成竖屏视频。

You cannot generate portrait videos.

Speaker 0

再次提醒，2025年7月13日。

Again, 07/13/2025.

Speaker 0

这一点可能很快就会改变。

This is may change shortly.

Speaker 0

事实上，我知道很快就会改变，因为YouTube已宣布将很快把v o three整合到YouTube短视频平台中，但他们显然需要克服某些技术障碍。

In fact, I know it will change shortly because YouTube has announced that they are going to integrate v o three into the YouTube shorts platform soon, but they're they clearly have to get over some sort of a technical hurdle.

Speaker 0

他们需要开发一个具有9:16输出头的v o three版本神经网络。

They need to make a version of v o three that has a nine sixteen output head in the neural network.

Speaker 0

你不能从图像和视频生成器中随意生成任何尺寸的输出。

You can't just generate any dimension output from image and video generators.

Speaker 0

必须是通过特定维度头微调的模型版本，或是具有多头输出层的超级模型，其中神经网络流程中的某种门控技术会决定使用哪个输出头。

It has to be a version of the model fine tuned with a specific dimension head or a supermodel that has a multiheaded output layer where some gating technique in the flow of the neural network will decide which head to use.

Speaker 0

所以当你把这个视频转换成短视频或Reels时，你会需要9:

So when you convert this video to a short or a reel, you're gonna want it in nine sixteen.

Speaker 0

这些平台就是这样的要求。

That's what those platforms expect.

Speaker 0

所以你要么放大画面，要么就只能压缩到画面中央，然后用品牌文字和字幕来填充黑边区域，或者直接谷歌搜索如何将横屏视频转为竖屏视频。

And so you'll have to and so you'll either zoom in or you'll just have it squashed into the center of the frame and use your branding text and your captions to fill in the black zones or just Google how to convert a landscape video to a portrait video.

Speaker 0

有各种技术手段。

There's techniques.

Speaker 0

当然这些方法都不完美。

They're always imperfect naturally.

Speaker 0

不同技术各有优缺点。

Pros and cons to different techniques.

Speaker 0

找个适合你的方法就行。

So find one that works for you.

Speaker 0

好了。

There we go.

Speaker 0

工作流程一。

Workflow one.

Speaker 0

现在是工作流程二，独立电影制作人的工作流程。

Now workflow two, the indie film maker workflow.

Speaker 0

顺便说一句，如果这期节目中我听起来有些气喘吁吁，那是因为我是在跑步机办公桌上录制的。

By the way, if I seem out of breath in this episode, it's because I'm recording it at a treadmill desk.

Speaker 0

亚里士多德、尼采、康德和梭罗都认为行走对思考至关重要。

Aristotle, Nietzsche, Kant, and Thoreau all believed walking was essential to thinking.

Speaker 0

尼采说过，只有通过行走获得的思考才有价值。

Nietzsche said only thoughts reached by walking have value.

Speaker 0

我学习、录音、剪辑或工作时会使用步行办公桌

So I use a walking desk while I study, record, edit, or work.

Speaker 0

桌下是，桌下是，桌下是，桌下是，桌下是，桌下是，桌下是，桌下是，桌下是，桌下是，桌下是，桌下是，桌下是，桌下桌下是，桌下是，翻译错误，请忽略此条，继续下一条

It improves focus, energy, alertness, and mood by increasing blood flow and endorphins.

Speaker 0

步行办公能通过促进血液循环和内啡肽分泌来提升专注力。

You should try one too.

Speaker 0

你也该试试这个下是，桌下是，桌下是，桌下是，桌下是，桌下是，桌下是，桌下是，桌下是，桌下是，桌下是，桌下是，桌下是，桌下桌下是，桌下是，桌下是，桌下是，桌下是，桌下是，桌下是，桌下是，桌下是，桌下是，桌下是

Meet your fitness goals while you study machine learning or work on your ML projects.

Speaker 0

查看我最喜欢的步行办公桌设置链接：ocdavell.com/walk。

See a link to my favorite walking desk set up at ocdavell.com/walk.

Speaker 0

网址是ocdevel.com。

That's ocdevel.com.

Speaker 0

工作流程二，独立电影制作人流程，一种适用于叙事短片的混合方法。

Workflow two, the indie filmmaker workflow, a hybrid approach for narrative shorts.

Speaker 0

这个流程适用于制作一到五分钟叙事短片的创作者。

This workflow is for creators making narrative short films, one to five minutes.

Speaker 0

重点在于叙事、角色连贯性和电影感视觉效果。

The focus is on storytelling, character consistency, and a cinematic look.

Speaker 0

这比之前的工作流程需要管理更多的技术复杂性。

This requires managing more technical complexity than the previous workflow.

Speaker 0

挑战在于没有一个视频模型能同时擅长所有类型的镜头。

The challenge is that no single video model excels at all types of shots.

Speaker 0

例如对话镜头、动作镜头和场景建立镜头。

For example, dialogue, action, and establishing shots.

Speaker 0

解决方案是采用一种潜在的混合流程，为每项任务配备专门工具。

The solution is a potential hybrid pipeline that uses specialized tools for each task.

Speaker 0

在图像生成方面，我们将使用Midjourney第七版。

So for image generation, we're gonna use mid journey v seven.

Speaker 0

选择Midjourney是为了构建电影的视觉基础。

Mid journey is chosen for creating the film's visual foundation.

Speaker 0

它拥有这些标志性参数：--cref（角色参考）和--sref（风格参考）。

It has these flags, these parameters, dash dash c r e f, c ref for character reference, and dash dash s r e f or s ref for style reference.

Speaker 0

这些参数用于建立统一的角色形象和电影化风格，可作为所有其他工具的主参考。

These parameters are used to establish a consistent character and cinematic look that can be used as a master reference across all other tools.

Speaker 0

因此在保持角色和风格一致性方面，Midjourney比GPT-4提供了更强的控制力。

So you have a lot more control about maintaining character and style consistency with Midjourney than you do with GPT four o.

Speaker 0

Midjourney在创作角色设定表或风格调色板方面表现尤为出色。

Midjourney really shines at coming up with a character sheet or a palette.

Speaker 0

它确实非常擅长把控一致性。

It's really good at wrangling consistency.

Speaker 0

因此你可能会考虑将它用于之前的工作流程。

And so you might consider using it for the previous workflow.

Speaker 0

我只是想在上一步给你一个非常简单的、具体到细节的工作流程。

I just wanted to give you like a real easy nuts to bolts workflow in the previous step.

Speaker 0

至于视频生成，我们将使用Kling（拼写为k l I n g），因其高质量的唇形同步能力，以及能渲染出具有合理物理效果和微表情的逼真角色，所有对话场景都会用它。

For video generation, we're gonna use Kling, k l I n g, which is used for all dialogue scenes due to its high quality lip syncing and the ability to render realistic characters with plausible physics and micro expressions.

Speaker 0

Kling尤其擅长处理角色表现。

Kling is really good at characters specifically.

Speaker 0

所以每当你有两个人在聊天，或者外星人和大脚怪在对话时，Kling就会是你的首选工具。

So anytime you have two people chatting or or or an alien and a Bigfoot chatting, then Kling will be your guy.

Speaker 0

Runway Gen four被用于所有其他镜头，比如场景设定、动作戏和B卷素材。

Runway Gen four is used for all other shots like establishing an action and b roll.

Speaker 0

它的导演模式提供精确的摄像机控制，如平移、俯仰和变焦，以及用于添加局部运动的多重动态笔刷功能。

Its director mode offers precise camera controls like pan and tilt and zoom and a multimotion brush for adding localized motion.

Speaker 0

现在，v o三、cling和runway都是这个工作流程中可行的视频生成和编辑工具。

Now, v o three, cling, and runway are all viable video generation tools and video editing tools in this workflow.

Speaker 0

我将在这里使用Cling和Runway，这样你可以看到一些多样性，并了解何时以及为何会选择其中一个。

I'm gonna use Cling and Runway here so that you get a bit of variety and you can see when and why you might use one or the other.

Speaker 0

但你可以只用一个工具，或者三个都用。

But you can use one tool or you can use all three tools.

Speaker 0

这取决于你。

It's up to you.

Speaker 0

随你喜欢，通过这个工作流程，你应该能清楚每个工具的亮点所在。

Whatever strikes your fancy, by the end of this workflow, you should have a good sense of where each tool shines.

Speaker 0

Vio三擅长提供集视频、语音和动作于一体的解决方案，且能精准遵循提示。

Vio three is great at giving you an all in one video, voice and action, and prompt adherence.

Speaker 0

Cling尤其擅长角色表现、对话和唇形同步。

Cling is good at characters especially and dialogue and lip syncing.

Speaker 0

而Runway第四代在场景控制、镜头操控和动作序列方面表现出色。

And runway gen four is great at scene control and camera control and action sequences.

Speaker 0

所以，你可以随意搭配使用它们。

So mix and match them at your leisure.

Speaker 0

现在，关于语音和对话，我们将使用11。

Now, voice and dialogue, we're gonna use 11.

Speaker 0

11用于生成非常高质量、富有情感且能与视频完美同步的合成语音。

11 is used to generate very high quality, emotive, synthetic voices that can be convincingly synced in a video.

Speaker 0

如果你还没听过Eleven Labs的文本转语音技术，我强烈建议你去了解一下。

If you have not heard eleven Labs text to speech voices yet, I encourage you to look them up.

Speaker 0

Eleven，拼写是e-l-e-v-e-n，Labs。

Eleven, l e v e n, Labs.

Speaker 0

去搜索并听听那些语音样本。

Look them up and listen to the voices.

Speaker 0

它们简直不可思议。

They are nuts.

Speaker 0

超凡脱俗。

Out of this world.

Speaker 0

而且它们还在不断进步，越来越好。

And they're getting better and better and better.

Speaker 0

这是他们的核心业务，他们正在大放异彩。

This is their bread and butter, and they're really taking the stage.

Speaker 0

他们是语音界的V O三巨头，但也有很棒的开源替代品。

They're the v o three of voice, but there's great open source alternatives.

Speaker 0

Kokoro是我的最爱。

Kokoro is my favorite.

Speaker 0

至于最后的合成、调色和润色环节，你可以使用DaVinci Resolve。

And then for the final assembly, color and finishing, you might use DaVinci Resolve.

Speaker 0

在这个工作流程中，我们选择DaVinci Resolve而非Adobe Premiere Pro，因为其功能强大的免费版本和工作室版的一次性购买方案更经济实惠。

In this workflow, we're picking DaVinci Resolve over Adobe Premiere Pro because the powerful free version and the one time purchase for the studio version are more economical.

Speaker 0

所以如果你预算有限，又想要比Premiere Pro更简单的操作，可能会选择DaVinci Resolve。

So if you're on a budget and you want a bit more simplicity than Premiere Pro, you might choose DaVinci Resolve.

Speaker 0

它在互联网和这类工作流程中非常非常流行。

It's very, very popular on the internet and in these workflows.

Speaker 0

我经常看到它被使用。

I see it all the time.

Speaker 0

DaVinci Resolve。

D a v I n c I Resolve.

Speaker 0

它在处理高要求编解码器时提供更佳性能，并集成了顶级调色套件和基于节点的视觉特效环境Fusion（拼写为f u s I o n），所有这些都在一个应用程序中完成。

It offers better performance with demanding codecs and integrates a top tier color grading suite and node based VFX environment called Fusion, f u s I o n, in one application.

Speaker 0

简单分解一下。

So bit of a breakdown.

Speaker 0

角色和世界观设计将使用Midjourney第七版工具。

For character and world design, the tool will be Midjourney v seven.

Speaker 0

其叙事优势在于角色和风格参考功能，即--cref和--sref参数。

Its strengths for narrative are the character and style reference, dash dash c ref and dash dash s ref.

Speaker 0

它具有电影级画质。

It's cinematic quality.

Speaker 0

它能创建主视觉蓝图，确保所有工具间的一致性。

It creates a master visual blueprint for consistency across all tools.

Speaker 0

对话场景。

Dialogue scenes.

Speaker 0

我们将选用Kling（k l I n g）。

We're gonna pick Kling, k l I n g.

Speaker 0

它在叙事方面的优势在于卓越的口型同步和高保真角色真实感及物理模拟。

Its strengths for narrative are superior lip sync and high fidelity character realism and physics simulation.

Speaker 0

这对于打造可信的对话场景至关重要。

It's essential for believable dialogue scenes.

Speaker 0

电影级花絮镜头与动作场面。

Cinematic b roll and action.

Speaker 0

我们将选择Runway Gen 7。

We're gonna choose runway gen seven.

Speaker 0

它的优势在于高级摄像机控制、多运动笔刷和导演模式。

Its strengths are advanced camera controls, multi motion brush, director mode.

Speaker 0

它为无对白镜头提供了创作掌控力。

It provides creative control over non dialogue shots.

Speaker 0

至于语音生成，我们会选用Eleven Labs。

For voice generation, we'll choose Eleven Labs.

Speaker 0

它具备高情感保真度、语音克隆和自然节奏。

It has high emotional fidelity, voice cloning, and natural cadence.

Speaker 0

能提供可完美同步的表演效果。

It delivers performances that can be convincingly synced.

Speaker 0

至于剪辑调色和后期制作，我们将选用DaVinci Resolve。

For edit color and finish, we will choose DaVinci Resolve.

Speaker 0

这是一款集剪辑、调色与特效于一体的全流程解决方案。

It's an all in one suite for edit color and fusion.

Speaker 0

它性能出色且成本合理。

It has good performance and good cost.

Speaker 0

A是一款专业的非订阅制编辑器，拥有业界顶尖的色彩工具。

A it is a professional non subscription editor with industry best color tools.

Speaker 0

好的。

Alright.

Speaker 0

第一步，在Mid Journey中创建视觉基础。

Step one, create a visual foundation in mid journey.

Speaker 0

通过为主角和整体美学风格创建一套完整的参考图像集，确立电影的视觉语言。

Establish the film's visual language by creating a master set of reference images for the main character and aesthetic.

Speaker 0

流程如下。

Here's the process.

Speaker 0

第一步，生成主角图像，h e r o（英雄）。

Step one, generate the hero character image, h e r o.

Speaker 0

这就是我们后续将要使用的主要角色形象。

That's the the main character that we're gonna be working from going forward.

Speaker 0

使用电影风格的提示词，为主角创建一张权威的定妆照。

Create a single definitive image of a main character using a cinematic prompt.

Speaker 0

提示词内容如下。

The prompt looks like this.

Speaker 0

一张电影剧照风格的图像：一位疲惫的男性侦探杰克，四十多岁，留着凌晨五点的胡茬，眼神疲倦，穿着皱巴巴的棕色风衣，站在夜晚东京霓虹闪烁的雨湿街道上，具有照片级真实感，变形镜头光晕，35毫米胶片颗粒感，忧郁的黑色电影美学风格 --ar 16:9。

A cinematic film still, a weary male detective, Jack, mid forties, unshaven with a 05:00 shadow, tired eyes, wearing a rumpled brown trench coat, standing on a rain slicked neon lit street in Tokyo at night, photorealistic anamorphic lens flare, 35 millimeter film grain, moody noir aesthetic dash dash a r sixteen nine.

Speaker 0

这里的双横线ar表示画幅比例，双横线style raw --v7表示第七版原始风格。

The dash dash a r is for aspect ratio and dash dash style raw dash dash v seven.

Speaker 0

再次提醒，ocdevel.com上有提示指南。

Again, ocdevel.com for the prompt guides.

Speaker 0

宽高比。

Aspect ratio.

Speaker 0

这就是我一直想找的词。

That was the word I've been looking for.

Speaker 0

我总说分辨率和布局。

I keep saying resolution and layout.

Speaker 0

宽高比。

Aspect ratio.

Speaker 0

第二步，用--c ref建立角色参考。

Step two, establish a character reference with dash dash c ref.

Speaker 0

获取主角图片的URL。

Get the URL of the hero image.

Speaker 0

使用--c ref参数生成相同角色不同姿势的新图片，同时锁定其特征。

Use it with the dash dash c ref parameter to generate new images of the same character in different poses while locking in their features.

Speaker 0

据我所知，双破折号c ref参数需要使用一个URL。

So dash dash c ref uses a URL to my best understanding.

Speaker 0

所以如果你已经有一张想用作主角的图片，你需要把它上传到网上，以便在提示中作为URL引用。

So if you already have an image you wanna use as your hero, you'll wanna put it somewhere online that you can reference as a URL for your prompts.

Speaker 0

示例提示：一张电影剧照风格的图片，同一个男人坐在灯光昏暗的拉面吧台前，看着柜台上的案件档案，双破折号c ref参数后接URL，双破折号c w 100，双破折号v 7。

Example prompt, a cinematic film still of the same man sitting in a dimly lit ramen bar looking at a case file on the counter dash dash c ref, the URL dash dash c w 100 dash dash v seven.

Speaker 0

而双破折号c w是角色权重参数，取值范围是0到100。

And the dash dash c w, which is character weight parameter of zero to 100.

Speaker 0

这个参数控制新图像与参考图像的匹配程度。

This controls how closely the new image matches that reference.

Speaker 0

所以如果你使用100，就能保持角色特征的最大一致性。

So if you use a 100, that will maintain maximum consistency for character consistency.

Speaker 0

第三步，使用--sref参数建立风格参考。

Step three, establish a style reference with dash dash s ref.

Speaker 0

使用相同的英雄图像URL配合sref参数，生成具有原图视觉风格的图像。

Use the same hero image URL with the with the s ref parameter to generate images that share the original visual style.

Speaker 0

因此，色彩、光线和氛围都不会包含实际角色。

So color, lighting, and mood without including the actual character.

Speaker 0

示例提示：一张电影剧照，夜晚东京一条空无一人的雨巷。

Example prompt, cinematic film still, an empty rain slicked alleyway at night in Tokyo.

Speaker 0

霓虹招牌的倒影映在地面的水洼中。

Neon signs reflecting in the puddles of the ground.

Speaker 0

--s ref，URL链接 --s w 800。

Dash dash s ref, the the URL dash dash s w 800.

Speaker 0

s w参数（风格权重）控制风格转换的强度。

The s w parameter style weight controls the strength of the style transfer.

Speaker 0

第四步，创建参考集。

Step four, create a reference set.

Speaker 0

生成5到10张静态图像的拍摄清单，包括特写、中景、全景镜头等。

Generate a shot list of five to 10 still images, close ups, medium shots, establishing shots, and so forth.

Speaker 0

这个视觉指南将用于Cling和Runway中以保持一致性。

This visual guide will be used in cling and runway to maintain consistency.

Speaker 0

所以你可以看到，你对角色设定表有了更多控制权，以确保角色一致性。

So you can see you have a lot more control over your character sheet for character consistency.

Speaker 0

好的。

Okay.

Speaker 0

工作流程的第二步。

Step two of the workflow.

Speaker 0

使用Eleven Labs到Cling的流程创建对话场景。

Create dialogue scenes with the Eleven Labs to cling pipeline.

Speaker 0

Eleven Labs到Cling的流程。

Eleven Labs to cling pipeline.

Speaker 0

为了获得最佳口型同步效果，请分别生成音频和视频后再进行合成。

To get the best lip sync, generate the audio and video separately and then combine them.

Speaker 0

在生成包含对话的提示时，Cling中的口型同步效果往往不佳。

Generating video from a prompt that includes dialogue often results in a poor lip sync in Cling.

Speaker 0

所以第一步是在Eleven Labs中生成你的语音轨道。

So step one is generate your voice track in Eleven Labs.

Speaker 0

将你的对话脚本撰写或上传至Eleven Labs。

Write or upload your dialogue script to Eleven Labs.

Speaker 0

选择一个预设声音或克隆一个自定义声音。

Choose a premade voice or clone a custom one.

Speaker 0

调整稳定性和清晰度滑块以获得自然的情感表现。

Adjust stability and clarity sliders for natural emotive performance.

Speaker 0

将最终对话以高质量音频文件形式下载。

Download the final dialogue as a high quality audio file.

Speaker 0

顺便说一下，Eleven Labs拥有海量声音库，因为你可以提交自己的声音，每次你的声音被用于这类工作流程时都能获得报酬。

And by the way, Eleven Labs has tons of voices because you can submit your own voice and you can get paid for every time your voice is used in in in this kind of a workflow.

Speaker 0

人们每月大约能赚300美元。

People are making about, like, $300 a month.

Speaker 0

所以如果你想出售自己的声音，可以卖给Eleven Labs。

So if you ever wanna sell your voice, you can sell them to Eleven Labs.

Speaker 0

第二步，在Cling中生成一个中性视频。

Step two, generate a neutral video in Cling.

Speaker 0

在Cling的图像转视频功能中，上传你从Midjourney参考集中选择的角色特写或中景镜头。

In Cling's image to video function, upload a close-up or a medium shot of your character from the mid journey reference set.

Speaker 0

编写提示词描述角色的情绪和场景，但需明确指出他们没有在说话。

Write a prompt describing the character's emotion and the scene, but explicitly state that they are not speaking.

Speaker 0

例如，他的嘴巴是闭着的。

For example, his mouth is closed.

Speaker 0

这样可以避免随机出现的嘴部动作。

This prevents random mouth movements.

Speaker 0

这里有一个提示词示例。

Here's an example prompt.

Speaker 0

电影级特写镜头：侦探杰克。

A cinematic close-up of a detective, Jack.

Speaker 0

他正专注地听着屏幕外某人说话。

He listens intently to someone off screen.

Speaker 0

他的表情严肃而若有所思。

His expression is serious and thoughtful.

Speaker 0

他的嘴是闭着的。

His mouth is closed.

Speaker 0

场景光线昏暗，带有微妙的氛围动态。

The scene is dimly lit, subtle ambient motion.

Speaker 0

生成一个五到十秒的片段。

Generate a five to ten second clip.

Speaker 0

第三步，应用唇形同步和Cling功能。

Step three, apply the lip sync and cling.

Speaker 0

选择你刚刚生成的中性视频。

Select the neutral video that you just generated.

Speaker 0

使用Cling中的唇形同步或添加音频功能。

Use the lip sync or add audio feature in cling.

Speaker 0

上传来自Eleven Labs的对应对话音轨，Kling将处理视频，将音频音素映射到角色面部，创造出高度精准的唇形同步表演。

Upload the corresponding dialogue track from Eleven Labs, and Kling will process the video mapping the audio phenomes to the character's face to create a highly accurate lip synced performance.

Speaker 0

工作流程的下一步，用Runway Gen four创建电影级B卷素材。

Next step in the workflow, create a cinematic b roll with Runway Gen four.

Speaker 0

使用Runway的高级工具处理无对白场景。

Use runway's advanced tools for scenes without dialogue.

Speaker 0

流程如下。

Here's the process.

Speaker 0

第一步，使用高级镜头控制。

Step one, use advanced camera control.

Speaker 0

要创建特定镜头运动，请将你的Midjourney场景中的参考图片上传至Runway。

To create a specific camera move, upload a reference image from your mid journey set to runway.

Speaker 0

使用导演模式镜头控制来精确设置推拉摇移等运动方式。

Use the director mode camera controls to set a precise dolly or pan or any of those movements.

Speaker 0

这比基于文本的运动提示能提供更确定性的控制。

This gives you more deterministic control than text based motion prompts.

Speaker 0

第二步，使用多重动态笔刷。

Step two, use the multi motion brush.

Speaker 0

该工具能为静态镜头增添微妙的生命力。

This tool adds subtle life to static shots.

Speaker 0

上传一张静态图片，比如一辆停在街上的汽车。

Upload a static image like a car parked on a street.

Speaker 0

使用最多五个笔刷来独立为图片的不同部分添加动画效果。

Use up to five brushes to animate different parts of the image independently.

Speaker 0

比如第一个笔刷可以涂抹树叶，并设置轻微的环境运动来模拟微风。

Like brush one might paint over, tree leaves and set a low ambient motion to simulate a breeze.

Speaker 0

第二个笔刷可以涂抹水洼，并设置另一种微妙的环境运动来产生涟漪效果。

Brush two might paint over a puddle and set a different subtle ambient motion for a ripple effect.

Speaker 0

而第三笔刷可以用极低的环境运动描绘远处的树木，创造出视差效果。

And brush three might paint over distant trees with a very low ambient motion to create a parallax effect.

Speaker 0

这种技术能增添层次分明的真实感和制作价值。

This technique adds layers of realism and production value.

Speaker 0

第三步是保持一致性。

Step three is maintain consistency.

Speaker 0

始终使用你Mid Journey参考集中对应的图像作为Runway中生成每个镜头的来源。

Always use a corresponding image from your mid journey reference set as the source for every shot generated in runway.

Speaker 0

最后一步是在DaVinci Resolve中进行合成和调色。

And then the final step is assemble and color grade it in DaVinci Resolve.

Speaker 0

让我们整合所有素材并应用专业后期处理。

So let's combine all assets and apply a professional finish.

Speaker 0

流程第一步：将素材导入DaVinci Resolve。

The process is step one, import your assets into DaVinci Resolve.

Speaker 0

包括所有生成的素材、来自Kling的口型同步片段、Runway提供的B卷素材、Eleven Labs的高质量音频，以及您需要的任何音乐或音效。

All the generated assets, lip synced clips from Kling, b roll from Runway, high quality audio from Eleven Labs, and any music or sound effects that you want.

Speaker 0

第二步是编辑叙事内容。

Step two is edit the narrative.

Speaker 0

在编辑页面时间线上组装视频片段。

Assemble the clips on the edit page timeline.

Speaker 0

将Eleven Labs的纯净音轨放置在独立的音频轨道上，确保完美同步。

Place the clean audio tracks from Eleven Labs on separate audio layers, ensuring they are perfectly synchronized.

Speaker 0

第三步是进行专业调色。

Step three is perform a professional color grade.

Speaker 0

在调色页面，使用DaVinci Resolve的节点工具和示波器来匹配来自Cling和Runway片段的色彩与对比度。

On the color page, use DaVinci Resolve's node based tools and scopes to match the color and contrast of clips from cling and runway.

Speaker 0

镜头匹配后，应用最终的创意风格。

Once shots are matched, apply a final creative look.

Speaker 0

对于黑色电影风格，可以降低饱和度、压深黑色色调，并在阴影部分添加冷蓝色调。

For a noir film, use you might desaturate colors, crush blacks, and add a cool blue tint to the shadows.

Speaker 0

然后最后是第四步，导出。

And then finally, step four, export.

Speaker 0

在交付页面，选择 export preset，选择您需要的导出选项。

On the deliver page, choose an export preset.

Speaker 0

高比特率264或265编码适用于网络环境，而专业编码如Apple ProRes更适合电影节提交或存档。

A high bit rate h dot two sixty four or h dot two sixty five is is good for web distribution, while a professional codec like Apple ProRes is better for festival submissions or archival.

Speaker 0

好的。

Okay.

Speaker 0

工作流程第三部分。

Workflow number three.

Speaker 0

快速问个问题。

Quick question.

Speaker 0

你有没有想过，如果泰勒是在步行桌上录制的，为什么我听不到跑步机的声音？

At any point did you think, if Tyler's recording this at a walking desk, why don't I hear the treadmill?

Speaker 0

现在你能听到跑步机的声音吗？

Can you hear the treadmill now?

Speaker 0

这就是我的原始录音听起来的样子。

This is what my raw recordings sound like.

Speaker 0

我使用Descript神奇地提升我的音频质量。

I use Descript to magically improve my audio quality.

Speaker 0

这是一个利用AI增强音视频效果、减少诸如删除填充词和重录等繁琐编辑任务的录音工作室。

It's a recording studio that uses AI to enhance audio and video and reduce mundane editing tasks like removing filler words and retakes.

Speaker 0

如果你是内容创作者，可以在ocdevelle.com/creator上试用。

If you're a content creator, try it out at ocdevelle.com/creator.

Speaker 0

好的，第三个工作流程。

Okay, workflow number three.

Speaker 0

你准备好了吗？

Are you ready?

Speaker 0

朋友们，系好安全带。

Buckle up my friends.

Speaker 0

这个非常复杂。

This one's very complex.

Speaker 0

如果你只关心工作流程一或二，或许可以跳过这部分。

If you really only care about workflows one or two, you might probably skip this.

Speaker 0

但如果你想了解，想掌握真正高级的技巧，这个可能对你有用。

But if you wanna know, if you wanna know how to do the really advanced stuff, maybe this will be handy for you.

Speaker 0

Stable diffusion是个非常强大且实用的工具。

Stable diffusion is a very powerful and very useful tool.

Speaker 0

如果你未来可能考虑使用更高级的技术，那就继续听下去。

So if you ever think you might consider using more advanced techniques in the future, then listen on.

Speaker 0

这是工作流程三，专业工作室工作流程。

This is workflow number three, the professional studio workflow.

Speaker 0

你需要通过舒适的UI界面和你选择的视频编辑软件实现全面掌控。

You want full control with comfy UI and your video editing software of choice.

Speaker 0

因此，这个工作流程专为从事高预算项目的专业视觉特效师和工作室设计。

So this workflow is for professional visual effects artists and studios working on high budget products.

Speaker 0

这些是专业人士，好莱坞级别的从业者。

These are professionals, Hollywood people.

Speaker 0

其目标是实现每个像素的完全控制、照片级真实感质量、完美的演员相似度，并能融入标准的非破坏性视觉特效流程。

The goals are complete control over every pixel, photorealistic quality, perfect actor likeness, and integration into a standard nondestructive visual effects pipeline.

Speaker 0

另一个潜在目标是成本控制——因为你将在自有硬件上运行，采用基于节点的GPU集群流程，使用开源项目而非必须依赖后端API（当然也可选用OpenAI、Google Vertex或Eleven Labs等接口）。

Another possible goal here is control on cost because you're gonna be running this on your own hardware, a fleet of GPUs in a node based pipeline where using open source projects rather than necessarily, although you could use back end APIs like OpenAI or Google Vertex or Eleven Labs.

Speaker 0

这样你不仅能掌控每个步骤、质量和一致性，还能控制成本。

So you control not only every step and the quality and the consistency, but also the cost.

Speaker 0

主流工具如VO、Cling和Runway的挑战在于它们都是黑箱系统。

The challenge with mainstream tools like VO, Cling, and Runway is that they are black boxes.

Speaker 0

它们输出的MP4等压缩格式不适合专业合成，而这个工作流程规避了这些问题。

They output compressed formats like m p fours, which are unsuitable for professional compositing, and this workflow avoids them.

Speaker 0

相反，我们将采用模块化开源流程和Comfy UI界面，这种方法将AI视为可控的渲染引擎，输出多层开放的EXR文件，这些文件会在专业视觉特效软件（如DaVinci Resolve的Fusion页面）中进行合成。

Instead, we're gonna use a modular open source pipeline and Comfy UI, and this approach treats AI as a controllable render engine, outputting multilayer open EXR files that are composited in a professional VFX application like DaVinci Resolve's Fusion page.

Speaker 0

本次演示我们将继续使用DaVinci Resolve，但如果你是Adobe用户，可以自由选择任何工具作为最终合成阶段的软件。

And we're gonna use DaVinci Resolve for this example yet again, but if you are an Adobe person, you'll use whatever tool you want as the final compositing stage.

Speaker 0

工具链的构成如下。

The tool chain goes like this.

Speaker 0

核心引擎我们将使用稳定扩散生态系统。

For the core engine, we'll use the stable diffusion ecosystem.

Speaker 0

这是唯一具备必要开放式模块化架构的平台。

This is the only platform with the necessary open modular architecture.

Speaker 0

本工作流将在Comfy UI界面中使用诸如stable diffusion three或flux（f-l-u-x）等模型。

This workflow uses models like stable diffusion three or flux, f l u x, within the Comfy UI interface.

Speaker 0

我们会使用控制层。

We'll use control layers.

Speaker 0

我们将其称为技术栈。

We'll call this the stack.

Speaker 0

通过叠加多个独立控制层来实现精确控制。

Precision is achieved by layering multiple independent controls.

Speaker 0

控制层的第一部分是定制LoRa训练。

Number one of the control layer is custom LoRa training.

Speaker 0

LoRa（低秩适应层）经过演员面部或服装训练，嵌入其特定的视觉数据，确保真实相似度。

A LoRa, a low rank adaptation layer, is trained on an actor's face or costume to embed their specific visual data, ensuring true likeness.

Speaker 0

第二部分，多重控制网络。

Number two, multiple control nets.

Speaker 0

一组控制网络决定场景结构。

A stack of control nets dictates scene structure.

Speaker 0

例如，Open Pose控制角色骨骼。

For example, open pose controls the character's skeleton.

Speaker 0

深度图控制三维布局，线稿保留细节。

A depth map controls three d layout and line art preserves details.

Speaker 0

第三部分，IP适配器面部识别。

Number three, IP adapter face ID.

Speaker 0

在LoRa和control nets之后，IP adapter face IDs（特别是plus v2模型）作为最终强化层应用，以锁定面部特征并防止动画过程中的漂移。

After the LoRa and control nets, a IP adapter face IDs, specifically the plus v two model, is applied as a final reinforcement layer to lock in facial identity and prevent drift during animation.

Speaker 0

如果这些内容让你感到难以消化，可以回听两期前的图像生成专题节目，并访问播客节目说明页面，那里展示了完整的工作流程，你可以点击链接获取各个方面的更多详细信息。

And, if a lot of this is overwhelming, listen to two episodes back on the image generation episode and go to the podcast show notes page, and you can see this entire workflow laid out, and you can click the links that have more details on these various aspects.

Speaker 0

第四点是animate diff和motion lauras技术。

Number four is animate diff and motion lauras.

Speaker 0

因此，这套工作流程将使用animate diff模块配合专门训练的运动laurels（针对平移左移和推拉变焦等精确摄像机运动）来替代文本提示控制，实现确定性的可控电影摄影效果。

So instead of text prompts for motion, this workflow is gonna use the animate diff module with specific motion laurels trained on precise camera movements like pan left and dolly zoom for deterministic controllable cinematography.

Speaker 0

然后在DaVinci Resolve的Fusion页面进行视觉特效合成

And then visual effects compositing in DaVinci Resolve's Fusion page.

Speaker 0

Resolve 20或更高版本在Fusion页面中对多层EXR合成有强大的原生支持

Resolve 20 or later versions have strong native support for multilayer EXR compositing in the Fusion page.

Speaker 0

其基于节点的架构非常适合这种流水线

Its node based architecture is ideal for this kind of a pipeline.

Speaker 0

快速分解一下

So quick breakdown.

Speaker 0

为了完美还原演员形象，我们将使用定制化的LoRa训练。

For perfect actor likeness, we're gonna use custom LoRa training.

Speaker 0

这种方法能在特定演员的面部或服装上进行模型微调，达到商业合同级别的真实相似度。

This fine tunes a model on a specific actor's face or costume, and it achieves true contract level likeness for commercial work.

Speaker 0

为确保身份特征万无一失，我们将采用IP适配器面部识别增强版V2。

For bulletproof identity lock, we're going to use IP adapter face ID plus version two.

Speaker 0

IP适配器-面部识别增强版V2。

IP adapter dash face ID plus V2.

Speaker 0

该技术能在LoRa处理后逐帧强化面部特征。

This reinforces facial identity on every frame post Laura.

Speaker 0

这是防止动画中身份特征漂移的最终质量检查。

The final quality check to prevent identity drift in animation.

Speaker 0

为实现精确姿态和场景布局，我们将使用多种控制网络如姿势图和深度图。

For precise pose and scene layout, we're going to use multiple control nets like a pose map and depth map.

Speaker 0

这能基于骨骼姿态和三维深度条件来生成画面。

This conditions generation on skeletal pose and three d depth.

Speaker 0

这让艺术家能够精确地指导场景布置。

This allows artists to direct the scene with precision.

Speaker 0

为了实现特定的摄像机运动，我们将使用Animate Diff和Motion Lauras技术。

For specific camera movement, we're gonna use animate diff and motion lauras.

Speaker 0

该技术会应用预设的运动向量来实现平移、轨道移动和变焦效果。

This applies predefined motion vectors for pans, dollies, and zooms.

Speaker 0

这用可控的镜头运动替代了模糊的动作提示。

This replaces vague motion prompts with controllable camera work.

Speaker 0

专业后期制作中，我们会采用多层EXR导出格式。

For professional post production, we'll use a multilayer EXR export.

Speaker 0

这种渲染会将漫反射、高光和遮罩等不同通道合并到一个文件，支持对最终图像进行无损编辑。

This renders separate passes like diffusion, specular, and matte into one file, and this enables nondestructive editing of the final image.

Speaker 0

最后镜头合成阶段，我们将使用达芬奇调色软件的Fusion功能——基于节点的EXR图层合成，将AI素材整合到标准视觉特效流程中。

And then for the final shot assembly, we will use DaVinci Resolve with the fusion feature, which is a node based compositing of EXR layers, and it integrates AI assets into a standard VFX pipeline.

Speaker 0

当然，您也可以选择使用Final Cut Pro或Premiere Pro。

And again, you may prefer Final Cut Pro or Premiere Pro.

Speaker 0

你可以选择自己喜欢的剪辑软件。

You pick your own editing software.

Speaker 0

既然你已经看到这一集了，肯定有你偏爱的软件。

If you're this far in the episode, you definitely have a software of choice.

Speaker 0

你就继续用那个吧。

You go ahead and stick with that.

Speaker 0

好的。

Okay.

Speaker 0

我们开始吧。

Here we go.

Speaker 0

这个工作流程的第一步是在Comfy UI中训练一个角色LoRA模型。

Step one in this workflow is to train a character Laura in Comfy UI.

Speaker 0

为了达到完全一致的相似度，必须基于目标对象训练一个定制的LoRA模型。

To achieve an identical likeness, a custom Laura must be trained on the subject.

Speaker 0

具体流程是这样的。

The process goes like this.

Speaker 0

第一步，准备数据集。

Step one, prepare the dataset.

Speaker 0

收集15到30张高质量的主题图片，涵盖不同光线、表情和角度。

Gather 15 to 30 high quality images of your subject in various lighting, expressions, and angles.

Speaker 0

这可以是你拍摄的真实人物照片，你需要将所有图片裁剪并调整为统一的正方形分辨率，1024x1024。

This is this could be a real person that you take photos of, and you'll crop and resize all the images to a consistent square resolution, ten twenty four by ten twenty four.

Speaker 0

第二步，自动化标注。

Step two, automate captioning.

Speaker 0

训练图像必须使用不包含触发词的描述进行标注，该触发词将用于激活LoRa。

Training images must be captioned with descriptions that do not include the trigger word you will use to activate the LoRa.

Speaker 0

在Comfy UI中，使用带有BLIP图像标注节点的工作流自动为每张图片生成描述性的TXT文件。

In Comfy UI, use a workflow with a BLIP blip captioning node to automatically generate a descriptive dot TXT file for each image.

Speaker 0

第三步，在Comfy UI中训练LoRa。

Step three, train the LoRa in Comfy UI.

Speaker 0

使用专门的LoRa训练工作流，比如Comfy UI flux训练器中的方案，GitHub用户Kijai（k I j a I）的自定义节点包里有相关示例。

Use a dedicated LoRa training workflow like the one in the Comfy UI flux trainer, custom node pack by Kijai, k I j a I, a user on GitHub has an example for this.

Speaker 0

配置训练器节点。

Configure the trainer node.

Speaker 0

图像路径需指向图像数据集目录。

The image path will point to the image dataset directory.

Speaker 0

触发词，你需要定义一个独特的标记来激活LoRa，例如character_ohwx。

The trigger word, you define a unique token to activate the LoRa, for example, character underscore o h w x.

Speaker 0

输出名称，将你的LoRa文件命名为类似my_character_v1这样的格式。

The output name, name your LoRa file like my character underscore v one.

Speaker 0

训练参数设置epochs（即数据集上的训练迭代次数）在50到100次之间，并根据需要调整学习率、网络维度和alpha值。

The training parameters set epochs, which are the number of training iterations over the dataset between 50 and a 100, and adjust the learning rate and network dimensions and alpha as needed.

Speaker 0

然后你将执行训练。

And then you'll execute training.

Speaker 0

你将队列提示，最终的.safetensors文件会被保存到你的comfyUI/models/lora文件夹中，随时可用。

You'll queue the prompt and the final dot safe tensors load will be saved to your comfy UI forward slash models forward slash Laura's folder ready for use.

Speaker 0

所以你正在用自定义数据集训练神经网络。

So you're training a neural network with a custom dataset.

Speaker 0

由于时间有限，我会快速带你们过一遍，让你们了解整个流程是如何串联起来的。

Because I have limited time, I'm just gonna speed run you through all this so you get a sense of how everything strings together.

Speaker 0

但最终，你们还是需要观看详细的分步视频教程。

But ultimately, you'll wanna watch a video tutorial on the real step by step.

Speaker 0

第二步是构建Comfy UI视频处理管线。

Step two is build the Comfy UI video pipeline.

Speaker 0

这是核心生成流程，采用逻辑节点图的形式构建，每个控制系统都接入下一个环节。

This is the core generation process structured as a logical node graph where each control system feeds into the next.

Speaker 0

以下是节点链的逐步说明。

Here's the node chain walk through.

Speaker 0

第一步，加载器。

One, loaders.

Speaker 0

首先加载所有资源。

Load all the assets first.

Speaker 0

你将使用负责检查点加载的节点。

You'll have nodes that do checkpoint loading.

Speaker 0

它们加载基础模型，如flux-dev-fp8.safe tensors。

They load the base model like flux one dash dev dash f p eight dot safe tensors.

Speaker 0

你会0

You'll have a loader node for the Laura.

Speaker 0

这会加载你的自定义Laura，即my_character_v1.safe tensors，然后你会有一个用于编码提示的节点，即CLIP文本编码器。

This will load your custom Laura, my character underscore v one dot safe tensors, and then you'll have a node for encoding the prompt, the clip text encoder, CLIP.

Speaker 0

我们将在关于扩散模型的MLG剧集中讨论CLIP。

We'll talk about clip in the MLG episode on diffusion models.

Speaker 0

创建两个节点，一个用于正面提示词（必须包含你的触发词）。

Create two nodes, one for the positive prompt, which must include your trigger word.

Speaker 0

例如，character_ohwx的照片，另一个用于负面提示词。

For example, photograph of character underscore o h w x, and one for the negative prompt.

Speaker 0

第二，ControlNet堆栈。

Two, the ControlNet stack.

Speaker 0

你需要将多个应用ControlNet节点串联起来以定义场景的物理结构。

You'll you'll chain multiple apply ControlNet nodes together to define the scene's physical structure.

Speaker 0

提示节点的条件输出被输入到第一个应用ControlNet节点中。

The conditioning output from the prompt node is fed into the first apply ControlNet node.

Speaker 0

该节点通过加载的ControlNet文件和如DW预处理器等预处理工具，以参考姿势图像为条件进行调节。

This node is conditioned on a reference pose image using a loaded ControlNet file and a preprocessor like DW preprocessor.

Speaker 0

第一个节点的条件输出随后被导入第二个应用ControlNet节点，该节点可以以深度图为条件。

The conditioning output of this first node is is then piped into a second apply ControlNet node, which could be conditioned on a depth map.

Speaker 0

这种层级结构形成了一个高度精确的物理引导框架。

This layering creates a highly specific structural guide.

Speaker 0

第三，IP适配器面部识别。

Three, IP adapter face ID.

Speaker 0

将ControlNet堆栈的最终调节输出接入IP适配器面部识别节点。

Pipe the final conditioning output from the ControlNet stack into an IP adapter face ID node.

Speaker 0

该节点还需一张演员面部的清晰参考照片，并使用IP适配器面部识别加v2文件模型在动画前锁定面部特征。

This node also takes a clean reference photo of the actor's face and uses the IP adapter face ID plus v two file model to lock the facial identity before animation.

Speaker 0

第四，动画扩散。

Four, animate diff.

Speaker 0

将完全条件化的模型传递给Animate Diff加载节点。

Pass the fully conditioned model to an animate diff loader node.

Speaker 0

加载一个基础运动模型检查点文件和一个特定的运动LoRa加载器检查点文件，以应用可控的相机运动。

Load a base motion model checkpoint file and a specific motion LoRa loader checkpoint file to apply a controllable camera motion.

Speaker 0

五、K采样器。

Five, k sampler.

Speaker 0

将最终条件化的模型提示和一个空白的潜在图像节点输入K采样器，生成图像序列。

Feed the final conditioned model prompts and an empty latent image node into the k sampler to generate the image sequence.

Speaker 0

六、变分自编码器解码与输出。

Six, variational auto encoder decode and output.

Speaker 0

将K采样器输出的潜在图像传递给变分自编码器解码节点，将其转换为像素空间，并将最终图像输出传输至EXR保存节点。

Pass the output latent from the k sampler to a variational auto encoder decode node to convert it into pixel space and pipe the final image output to the EXR saving nodes.

Speaker 0

工作流程的第三步是为视觉效果导出多层EXR序列。

Step three of the workflow is export multilayer EXR sequences for visual effects.

Speaker 0

将输出渲染为一系列Open EXR文件，包含用于后期制作的多层数据。

Render the output as a sequence of open EXR files, contain multiple layers of data for post production.

Speaker 0

流程第一步：分离渲染通道。

The process is step one, isolate render passes.

Speaker 0

为了便于合成元素分离，可以通过多次微调生成（例如使用绿幕背景渲染一次以创建角色遮罩），或使用自定义节点将生成拆分为漫反射、高光和遮罩等通道，整合到一个工作流程中。

To separate elements for compositing, either run the generation multiple times with slight changes, for example, rendering once with a green screen background to create a character matte, or use custom nodes to split the generation into passes like diffuse, specular, and mask in one workflow.

Speaker 0

第二步是使用专用的EXR保存节点，将变分自编码器解码节点的最终图像输出接入类似Comfy UI HQ图像保存节点包中的save EXR保存节点。

Step two is use a dedicated save EXR node, pipe the final image output from the variational auto encoders decode node into a saving node like save EXR from the Comfy UI HQ image save node pack.

Speaker 0

您需要指定文件路径、设置图像序列路径、选择压缩方式（无损压缩选项包括PIZ或ZIP）。

You'll specify a file path, set an image sequence path, the compression, choose lossless compression, P I Z or zip.

Speaker 0

SRGB转线性需设为真，专业视觉特效流程采用线性色彩空间以确保正确的光照计算，位深设为32位浮点以保留最大色彩与亮度数据。

SRGB to linear you'll set to true professional VFX pipelines use a linear color space for correct lighting math and a bit depth you'll set to 32 bit float to preserve maximum color and luminance data.

Speaker 0

这就是HDR（高动态范围）。

That's that's HDR.

Speaker 0

接着工作流程的第四步是将镜头合成到达芬奇Resolve Fusion中。

And then step four of the workflow is to composite the shot into DaVinci Resolve Fusion.

Speaker 0

将AI生成的EXR序列组合成最终镜头。

Assemble the AI generated EXR sequence into a final shot.

Speaker 0

第一步是在Resolve的Fusion页面中，通过loader节点或media in节点导入EXR序列。

The process is step one, import the EXR sequence in Resolve's Fusion page at a loader or a media in node and import the EXR sequence.

Speaker 0

第二步是使用Resolve 20及以上版本访问图层。

Step two is access layers with Resolve 20 and above.

Speaker 0

单个media in节点可以包含所有渲染通道。

A single media in node can contain all render passes.

Speaker 0

将工具（例如调色节点）连接到media in节点，然后使用检查器的下拉菜单选择该工具应影响的图层。

Connect a tool, for example, color corrector to the media in node, then use the inspector's drop down menu to select which layer diffuse specular or matte the tool should affect.

Speaker 0

第三步是用节点重新组装镜头。

Step three is reassemble the shot with nodes.

Speaker 0

使用通道布尔节点提取角色遮罩。

Use a channel booleans node to extract the character matte.

Speaker 0

将漫反射通道接入独立的调色节点调整颜色，高光通道接入另一个调色节点微调高光，然后用合并节点将它们组合起来，最后用提取的角色遮罩作为其他效果的遮罩。

Pipe the diffuse pass into its own color corrector node to adjust colors, pipe the specular pass into a separate color corrector to tweak highlights, combine them with a merge node set to plus or add mode, use the extracted character matte as a mask for other effects like applying a glow only to the character.

Speaker 0

第四步是最终整合。

And step four is final integration.

Speaker 0

将重新组装好的AI角色合并到背景板上。

Merge the reassembled AI character over a background plate.

Speaker 0

添加最终修饰，如摄像机抖动、镜头畸变，以及整个合成的统一色彩分级。

Add final touches like camera shake, lens distortion, and a unifying color grade to the entire composite.

Speaker 0

将最终节点连接到媒体输出节点，将完成的镜头发送到Resolve时间线。这一切提供了完整的创意控制，并将生成资产整合到标准的专业视觉特效流程中。

Connect the final nodes to the media out node to send the finished shot to the resolved timeline, And all this provides full creative control and integrates generative assets into a standard professional VFX pipeline.

Speaker 0

我为快速过了一遍工作流程三而道歉。

And I apologize for sort of speed run blasting through workflow three.

Speaker 0

这确实不太适合以播客形式呈现。

It's not really something you can do in podcast form.

Speaker 0

所以你需要查看节目说明中的详细步骤，并通过链接观看实际串联这些内容的视频教程。

So so you'll wanna look at the detailed steps on the show notes and click through the links to find some video tutorials on actually stringing these things together.

Speaker 0

但核心在于学习Comfy UI及其各种高级节点，为你场景中需要的角色训练自定义LoRa，使用Control Net姿势将它们摆出不同位置，并用Animate Diff来动画化或插值关键帧。

But the name of the game is learning comfy UI and its various advanced nodes, training your own custom LoRa on the characters you want in your scenes and control net poses to put them into various positions and animates diff to animate or interpolate the key frames.

Speaker 0

最后使用Open EXR格式，以便在视频编辑套件中单独编辑最终输出的不同部分。

And then finally open EXRs so that you can edit different parts of the final output in isolation altogether within your video editing suite.

Speaker 0

以上就是这些工作流程。

So those are the workflows.

Speaker 0

如果你喜欢这个播客系列，现在就是通过分享这一集来帮助我的最佳时机。

And and if you like this podcast series, now would be the time to help me by sharing this episode.

Speaker 0

我认为由于当前AI视频生成的热度，这一集最具病毒式传播潜力。

I think this episode has the most virality potential due to the popularity of AI video generation right now.

Speaker 0

人们正在寻找与v o three相关的稳定性技术，他们正在用ChatGPT想出各种古怪的解决方案。

And people are looking for consistency techniques, with v o three, and they're coming up with all sorts of oddball solutions using ChatGPT.

Speaker 0

所以如果你能分享这一集，我会非常感激。

So I could use the help by you sharing this episode out.

Speaker 0

如果要选一集最能提升播客系列流量的节目，我认为就是这一集。

I think if ever there was a specific episode that would boost traffic to the podcast series, it would be this one.

Speaker 0

感谢收听，我们下期再见——届时我将讲解这些技术的底层原理，比如LoRa控制网络、扩散模型、变分自编码器等你在工作流程三中听到的所有术语。

Thanks for listening, and I'll see you in the next episode where I will describe the technical underpinnings of these things like LoRa's control nets, diffusion models, variational auto encoders, all of the words that you heard in workflow three.

Speaker 0

等我发布那期后，你可能需要先听那期，然后再回来重温工作流程三的内容。

Once I release that, you may wanna listen to that episode and then come back and then listen to workflow three again.

Speaker 0

此外，我也接受兼职工作。

Also, I'm available for part time work.

Speaker 0

如果你需要顾问或承包商，请访问ocdevelle.com查看我的专业经历和联系方式。

So if you need a consultant or a contractor, go to ocdevelle.com for my professional experience and contact information.

Speaker 0

让我们携手合作。

Let's work together.