Live
Black Hat USADark ReadingBlack Hat AsiaAI Businessciflow/trunk/177707PyTorch ReleasesShow HN: Vibooks – Local-first bookkeeping software built for AI agentsHacker News AI Topciflow/torchtitan/179381: Update on "[wip][dynamo] Reduce special casing for namedtuple objects"PyTorch Releasesciflow/trunk/179003: Thread compile_region_name through AOTAutograd cache hit pathPyTorch ReleasesOne year ago DeepSeek R1 was 25 times bigger than Gemma 4Reddit r/LocalLLaMACrack ML Interviews with Confidence: ML Model Development (20 Q&A)Towards AILooking for smallest VLM for NSFW image detector (atleast 5 it/s on CPU)Reddit r/LocalLLaMACoreWeave Stock Analysis: Buy or Sell This Nvidia-Backed AI Stock? - The Motley FoolGNews AI NVIDIAIntel Arc B70 Benchmarks/Comparison to Nvidia RTX 4070 SuperReddit r/LocalLLaMAI Gave Claude Access to My Desktop Outlook Without Touching the Microsoft APITowards AIHermes agent might be the best open source agent for local models right nowReddit r/LocalLLaMABanning All Anthropic EmployeesHacker NewsBlack Hat USADark ReadingBlack Hat AsiaAI Businessciflow/trunk/177707PyTorch ReleasesShow HN: Vibooks – Local-first bookkeeping software built for AI agentsHacker News AI Topciflow/torchtitan/179381: Update on "[wip][dynamo] Reduce special casing for namedtuple objects"PyTorch Releasesciflow/trunk/179003: Thread compile_region_name through AOTAutograd cache hit pathPyTorch ReleasesOne year ago DeepSeek R1 was 25 times bigger than Gemma 4Reddit r/LocalLLaMACrack ML Interviews with Confidence: ML Model Development (20 Q&A)Towards AILooking for smallest VLM for NSFW image detector (atleast 5 it/s on CPU)Reddit r/LocalLLaMACoreWeave Stock Analysis: Buy or Sell This Nvidia-Backed AI Stock? - The Motley FoolGNews AI NVIDIAIntel Arc B70 Benchmarks/Comparison to Nvidia RTX 4070 SuperReddit r/LocalLLaMAI Gave Claude Access to My Desktop Outlook Without Touching the Microsoft APITowards AIHermes agent might be the best open source agent for local models right nowReddit r/LocalLLaMABanning All Anthropic EmployeesHacker News
AI NEWS HUBbyEIGENVECTOREigenvector

PackForcing: Short Video Training Suffices for Long Video Sampling and Long Context Inference

HuggingFace PapersMarch 26, 20268 min read0 views
Source Quiz

PackForcing enables efficient long-video generation through hierarchical KV-cache management and spatiotemporal compression while maintaining temporal consistency and reducing memory usage. (10 upvotes on HuggingFace)

Abstract

PackForcing enables efficient long-video generation through hierarchical KV-cache management and spatiotemporal compression while maintaining temporal consistency and reducing memory usage.

AI-generated summary

Autoregressive video diffusion models have demonstrated remarkable progress, yet they remain bottlenecked by intractable linear KV-cache growth, temporal repetition, and compounding errors during long-video generation. To address these challenges, we present PackForcing, a unified framework that efficiently manages the generation history through a novel three-partition KV-cache strategy. Specifically, we categorize the historical context into three distinct types: (1) Sink tokens, which preserve early anchor frames at full resolution to maintain global semantics; (2) Mid tokens, which achieve a massive spatiotemporal compression (32x token reduction) via a dual-branch network fusing progressive 3D convolutions with low-resolution VAE re-encoding; and (3) Recent tokens, kept at full resolution to ensure local temporal coherence. To strictly bound the memory footprint without sacrificing quality, we introduce a dynamic top-k context selection mechanism for the mid tokens, coupled with a continuous Temporal RoPE Adjustment that seamlessly re-aligns position gaps caused by dropped tokens with negligible overhead. Empowered by this principled hierarchical context compression, PackForcing can generate coherent 2-minute, 832x480 videos at 16 FPS on a single H200 GPU. It achieves a bounded KV cache of just 4 GB and enables a remarkable 24x temporal extrapolation (5s to 120s), operating effectively either zero-shot or trained on merely 5-second clips. Extensive results on VBench demonstrate state-of-the-art temporal consistency (26.07) and dynamic degree (56.25), proving that short-video supervision is sufficient for high-quality, long-video synthesis. https://github.com/ShandaAI/PackForcing

View arXiv page View PDF GitHub 203 Add to collection

Get this paper in your agent:

hf papers read 2603.25730

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.25730 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.25730 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.25730 in a Space README.md to link it from this page.

Collections including this paper 2

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

Knowledge Map

Knowledge Map
TopicsEntitiesSource
PackForcing…researchpaperarxivautoregress…KV-cachetemporal re…HuggingFace…

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 210 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Research Papers