Introducing V2:
The Image-First Pipeline
Preview every scene. Edit any prompt. Get consistent characters. The biggest upgrade to PeakMV since launch.
V1 asked you to trust the AI blindly. You'd hit generate, wait minutes, and hope the output matched your vision.
V2 flips that. See every frame before a single second of video renders.
What was wrong with V1?
V1 was a straight shot: your audio went in, text-to-video prompts were generated, and the AI rendered clips directly. It worked, but it had three painful limitations:
- 1.No preview. You couldn't see what the video would look like until it was fully rendered. If you didn't like scene 7, you had to regenerate the entire video.
- 2.Inconsistent characters. The same "person" looked different in every scene. Brown hair in scene 1, blonde in scene 3, different face entirely in scene 5.
- 3.No lip sync. Characters on screen couldn't move their lips to your vocals. It felt disconnected.
How V2 works: Image-First
Instead of generating video directly from text, V2 splits the process into two stages. First, the AI generates a still image for each scene — a cinematic keyframe that captures the exact composition, lighting, and character. Then, it animates that frozen frame into motion.
This means you get to see and approve every single scene before any expensive video rendering starts. Don't like the lighting in scene 4? Regenerate just that one image. Want to tweak the camera movement? Edit the motion prompt without touching the visual.
The V2 pipeline
Upload & trim your audio
Pick your segment, choose 720p or 1080p
AI analyzes audio + generates scene images
Genre detection, lyrics extraction, keyframe generation
Preview & edit every scene
Regenerate images, tweak prompts, perfect your vision
Render final video
Images animate to motion, lip sync applied, stitched together
What's new in V2
Scene Preview Board
Every scene generates a still image first. Browse them all in a visual grid, grouped by location. Approve what you love, regenerate what you don't. No more blind rendering.
Per-Scene Regeneration
Hate scene 5 but love the rest? Regenerate just that one. Edit the image prompt, tweak the motion direction, or swap the location entirely. Each scene is independent.
Consistent Main Character
Upload your photo or let the AI generate a fictional star. Either way, the same character appears consistently across every scene. No more identity shifts between cuts.
Lip Sync
Toggle lip sync on and your character's mouth moves to the actual vocals. It's the detail that makes an AI video feel like a real music video instead of a slideshow.
Smart Scene Deduplication
The AI groups visually similar scenes and reuses keyframes with different motion. A 60-second video with 12 scenes might only need 5-6 unique images. Lower cost, faster generation, tighter visual coherence.
V1 vs V2 at a glance
| Feature | V1 Classic | V2 Image-First |
|---|---|---|
| Pipeline | Text-to-Video | Image-to-Video |
| Scene Preview | None | Full preview board |
| Edit Individual Scenes | No | Yes |
| Consistent Character | No | Yes (upload or AI) |
| Lip Sync | No | Yes |
| Quality Options | 720p / 1080p | 720p / 1080p |
| Prompt Control | Single prompt | Image + motion per scene |
| Smart Dedup | No | Yes (~50% fewer images) |
| Wizard Steps | 5 steps | 3 steps |
A simpler wizard
V1 had five steps. V2 has three. We combined upload + settings into one step, merged concept generation with the scene preview, and streamlined checkout into a single render confirmation.
The main character is always present — either upload your face or let the AI design one. No more choosing "No Character." Every video deserves a star.
Lyrics are detected automatically from your audio. No need to paste them manually. The AI extracts vocals, transcribes in any language, and uses them to sync scenes to your lyrics.
Under the hood
V2 isn't just a UI refresh. The entire generation backend was rebuilt:
- Dual prompt system — separate image prompts (frozen keyframe composition) and video prompts (motion directives). Each optimized for its model.
- Genre-aware prompting — 8 genre-specific visual vocabularies. Hip-hop gets low-angle power shots. Classical gets slow crane movements. The AI matches the visual language to your sound.
- Face injection pipeline — Nano Banana model composites your face into scene images naturally, not as a crude paste but as a contextual blend that respects lighting and pose.
- Main character system — a detailed physical description is generated once and embedded verbatim into every scene prompt. Same person, every frame.
- LTX-2 for 1080p — new model with 6-second clips at $0.04/second, replacing Kling v2.5. Faster renders, lower cost.
Try V2 now
See your scenes before they render. Upload a track and experience the difference.
Create with V2