PV PV-VAE
Technical Report, 2026

Video Generation with Predictive Latents

Predictive Video VAE (PV-VAE) improves latent video generation by training the tokenizer to reconstruct observed frames and predict withheld future frames.

Yian Zhao1,2, Feng Wang1,†, Qiushan Guo1, Chang Liu3, Xiangyang Ji3, Jian Zhang2, Jie Chen2
1ByteDance Seed   2Peking University   3Tsinghua University   Project lead
PV-VAE teaser with convergence, probing, and latent space analysis
Overview. PV-VAE achieves faster convergence and better generation metrics, while latent probing and visualizations indicate stronger motion-aware representations.
Motivation

Reconstruction quality alone does not guarantee generation quality.

PV-VAE studies how to make video VAE latents more suitable for diffusion, not merely more faithful for pixel reconstruction.

Modern video generation models rely on video VAEs to compress visual data into compact spatiotemporal latents. However, better reconstruction metrics do not necessarily translate into better latent diffusion training.

PV-VAE introduces a predictive reconstruction objective: randomly discard future frames, encode only the observed past, pad the latent sequence, and decode the complete video. The decoder therefore learns to reconstruct what was seen and infer what happens next.

This simple objective encourages latents to encode temporal dynamics, motion priors, and smoother video trajectories, leading to stronger generation and better downstream video understanding features.

Framework

Partial-to-complete video reconstruction.

The training pipeline remains close to standard video VAE training, but the encoder is restricted to observed frames while the decoder reconstructs the full sequence.

PV-VAE predictive reconstruction framework from the paper
Figure 2. PV-VAE randomly discards future frames, encodes observed frames, pads latents, and decodes the complete video with reconstructive and predictive supervision.

Training objective

MSE, temporal-difference reconstruction, LPIPS, GAN loss, and KL regularization are used without introducing a complex auxiliary pipeline.

Model design

The reported model uses 3D causal convolutions, 16x spatial compression, 4x temporal compression, and 64 latent channels.

Decoder fine-tuning

After predictive training, the encoder is frozen and the decoder is fine-tuned for standard reconstruction to reduce the training-inference gap.

Experimental Results

Direct comparison with existing video VAEs.

Click the tabs to switch between generation and reconstruction settings, and click each method row for a short interpretation.

Generation performance. 17-frame 256 x 256 videos. Lower FVD/KVD is better; higher IS is better. Best in green, second-best in blue underline.

Method Latent config UCF101 FVD ↓ UCF101 KVD ↓ UCF101 IS ↑ RE10K FVD ↓ RE10K KVD ↓ Train speed ↑ Train mem ↓ Params
CogX-VAE t4s8c16 176.90 16.47 64.19 94.12 10.41 0.76 85.93 216M
A strong public video VAE baseline, but the low-compression latent is expensive for training the downstream generator.
IV-VAE t4s8c16 175.74 22.32 64.51 92.37 8.35 1.28 88.34 242M
Improves some unconditional generation metrics, but still trails PV-VAE on the main FVD comparisons.
WF-VAE-L t4s8c16 188.19 33.01 67.49 107.26 12.56 2.52 87.36 317M
Wavelet-driven design is efficient, yet its generation quality is weaker in the reported Latte setup.
Hunyuan-VAE t4s8c16 210.30 52.81 66.40 83.45 13.23 1.64 87.36 246M
Reconstruction-oriented quality does not directly translate into stronger generation in this comparison.
Wan2.1 VAE t4s8c16 167.10 11.54 66.04 83.84 10.64 1.88 86.44 127M
A compact and competitive baseline; PV-VAE still improves the main FVD and IS metrics.
Wan2.2 VAE t4s16c48 180.79 17.80 67.32 87.15 10.11 4.96 30.90 705M
PV-VAE improves UCF101 FVD by 34.42 while keeping comparable high-compression efficiency.
SSVAE t4s16c48 168.68 19.71 66.39 79.08 8.79 3.92 34.00 315M
Spectral constraints help diffusability, while predictive reconstruction gives a stronger overall generation result.
PV-VAE t4s16c64 146.37 14.52 69.72 72.50 4.06 4.40 33.34 661M
PV-VAE achieves the best overall generation quality, supporting the paper's claim that predictive latent structure improves downstream diffusion training.
146.37
Best UCF101 FVD among compared methods.
72.50
Best RealEstate10K FVD in unconditional generation.
69.72
Best UCF101 Inception Score.

Reconstruction performance. Kinetics-400 validation videos at 17 x 256 x 256 and 17 x 512 x 512. PV-VAE remains competitive while prioritizing generation-ready latents.

Method Config family 256 rFVD ↓ 256 PSNR ↑ 256 SSIM ↑ 256 LPIPS ↓ 512 rFVD ↓ 512 PSNR ↑ 512 SSIM ↑ 512 LPIPS ↓ Infer speed ↑ Infer mem ↓
CogX-VAE t4s8c16 4.90 33.78 0.97 0.027 1.79 36.00 0.99 0.024 0.46 13.64
Strong high-resolution SSIM, but with higher inference memory than PV-VAE.
IV-VAE t4s8c16 2.78 34.08 0.96 0.019 0.97 37.24 0.96 0.016 0.32 5.39
Excellent reconstruction metrics under lower compression, but not the strongest generation result.
WF-VAE-L t4s8c16 3.06 33.48 0.96 0.023 1.08 35.93 0.96 0.023 0.87 5.00
Fast and memory-efficient, with weaker generation quality in Table 1.
Hunyuan-VAE t4s8c16 2.96 34.30 0.97 0.016 0.90 37.13 0.97 0.015 0.50 22.00
Very strong reconstruction, but the paper shows that reconstruction strength alone does not ensure generation strength.
Wan2.1 VAE t4s8c16 2.92 33.21 0.95 0.018 1.02 36.15 0.97 0.017 0.60 6.77
A compact baseline with solid reconstruction and generation, but less temporal predictive structure.
Wan2.2 VAE t4s16c48 3.42 33.78 0.96 0.015 1.22 36.75 0.97 0.015 0.58 9.36
PV-VAE is slightly behind on some reconstruction metrics but substantially ahead on generation FVD.
SSVAE t4s16c48 7.50 31.18 0.96 0.036 2.16 34.45 0.97 0.028 0.64 7.63
Spectral regularization improves diffusability but sacrifices reconstruction more noticeably.
PV-VAE t4s16c64 3.45 32.26 0.95 0.020 1.88 35.03 0.97 0.020 0.69 7.97
PV-VAE keeps reconstruction competitive after decoder fine-tuning, while the generation table shows the main advantage of predictive latents.
3.45
256-resolution rFVD after decoder fine-tuning.
0.69
Inference speed, faster than Wan2.2 VAE in the reported setting.
7.97
Inference memory, lower than Wan2.2 VAE and Hunyuan-VAE.
Analysis

Predictive reconstruction makes latents more motion-aware.

The page now keeps the visual evidence from the paper: PCA/optical-flow alignment, correlation and scaling curves, temporal coherence, prediction validation, and downstream probing.

PCA analysis comparing RGB, baseline, PV-VAE, and optical flow
Latent PCA and optical flow. PV-VAE activations align more clearly with moving foreground regions and underlying video dynamics.
PV-VAE correlation, scaling, and temporal coherence analysis
Predictive accuracy, scalability, and temporal coherence. Better prediction correlates with stronger generation, and predictive reconstruction scales better than pure reconstruction.
PV-VAE frame prediction validation
Frame prediction validation. PV-VAE generates plausible withheld future frames conditioned on observed frames.
Video understanding probing table
Diffusion-feature probing. Predictive reconstruction improves optical flow estimation, next-frame prediction, and point tracking.

Motion structure

PCA visualizations show that PV-VAE allocates more representational capacity to dynamic foregrounds.

Temporal trajectory

Latent temporal distance becomes smoother for adjacent frames and more monotonic across longer intervals.

Downstream features

LVDM features trained on PV-VAE latents improve across optical flow, next-frame prediction, and tracking probes.

Implementation Notes

Code is not open-sourced, but the design is intentionally simple.

Due to confidentiality constraints, we cannot release the production code. The core insight and implementation route, however, are lightweight enough to be reproduced and extended in existing video VAE pipelines.

  • Prediction schedule: vary the dropping ratio, temporal grouping, suffix prediction, or frame infilling strategy.
  • Padding strategy: compare learnable tokens, Gaussian padding, or condition-aware padding for withheld latent groups.
  • Architecture: explore CNN, Transformer, or hybrid video VAEs under the same predictive reconstruction objective.
  • Training recipe: preserve latent diffusability by freezing the encoder during decoder fine-tuning.
Citation

BibTeX

@article{zhao2026pvvae,
  title   = {Video Generation with Predictive Latents},
  author  = {Zhao, Yian and Wang, Feng and Guo, Qiushan and Liu, Chang and Ji, Xiangyang and Zhang, Jian and Chen, Jie},
  year    = {2026},
  note    = {Technical report}
}