视频及图像压缩

3 papers

No. 1

Multimodal Model for Computational Pathology:Representation Learning and Image Compression

Peihang Wu, Zehong Chen, Lijian Xu

Sourcearxiv

Published2026-03-19

Updated2026-03-19

TLDR本文综述了多模态计算病理学的最新进展,重点探讨了全切片图像处理、自监督学习、多模态数据生成、高效微调及多智能体推理等方法,旨在解决高分辨率图像计算、标注稀缺、多模态融合及模型可解释性等挑战,以推动可解释、安全的AI辅助诊断发展。

阅读摘要与笔记 点击展开

Abstract

Whole slide imaging (WSI) has transformed digital pathology by enabling computational analysis of gigapixel histopathology images. Recent foundation model advances have accelerated progress in computational pathology, facilitating joint reasoning across pathology images, clinical reports, and structured data. Despite this progress, challenges remain: the extreme resolution of WSIs creates computational hurdles for visual learning; limited expert annotations constrain supervised approaches; integrating multimodal information while preserving biological interpretability remains difficult; and the opacity of modeling ultra-long visual sequences hinders clinical transparency. This review comprehensively surveys recent advances in multimodal computational pathology. We systematically analyze four research directions: (1) self-supervised representation learning and structure-aware token compression for WSIs; (2) multimodal data generation and augmentation; (3) parameter-efficient adaptation and reasoning-enhanced few-shot learning; and (4) multi-agent collaborative reasoning for trustworthy diagnosis. We specifically examine how token compression enables cross-scale modeling and how multi-agent mechanisms simulate a pathologist's "Chain of Thought" across magnifications to achieve uncertainty-aware evidence fusion. Finally, we discuss open challenges and argue that future progress depends on unified multimodal frameworks integrating high-resolution visual data with clinical and biomedical knowledge to support interpretable and safe AI-assisted diagnosis.

Motivation

全切片成像推动了数字病理学发展,但高分辨率图像带来计算负担、专家标注有限、多模态信息融合困难及模型不透明等问题,制约了AI在病理诊断中的应用。本文旨在系统回顾多模态计算病理学的最新方法,以应对这些挑战。

Method

采用文献综述方法,系统分析了四个研究方向:1) 全切片图像的自监督表示学习与结构感知令牌压缩;2) 多模态数据生成与增强;3) 参数高效适应与推理增强的小样本学习;4) 多智能体协同推理以实现可信诊断。重点探讨了令牌压缩实现跨尺度建模,以及多智能体机制模拟病理学家“思维链”进行不确定性感知证据融合。

Result

综述总结了多模态计算病理学在图像处理、数据增强、学习效率和可解释推理等方面的关键技术进展,表明这些方法有助于克服计算瓶颈、减少标注依赖、提升模型透明度和诊断可靠性。

Conclusion

未来进展依赖于构建统一的多模态框架,整合高分辨率视觉数据与临床生物医学知识,以支持可解释、安全的AI辅助诊断。开放挑战包括进一步优化模型效率、增强生物可解释性及确保临床实用性。

No. 2

Efficient Video Diffusion with Sparse Information Transmission for Video Compression

Mingde Zhou, Zheng Chen, Yulun Zhang

Sourcearxiv

Published2026-03-19

Updated2026-03-19

Classcs.CV · cs.AI

TLDR提出Diff-SIT方法,通过稀疏编码和一步扩散模型,在超低码率下提升视频压缩的感知质量和时间一致性。

阅读摘要与笔记 点击展开

Abstract

Video compression aims to maximize reconstruction quality with minimal bitrates. Beyond standard distortion metrics, perceptual quality and temporal consistency are also critical. However, at ultra-low bitrates, traditional end-to-end compression models tend to produce blurry images of poor perceptual quality. Besides, existing generative compression methods often treat video frames independently and show limitations in time coherence and efficiency. To address these challenges, we propose the Efficient Video Diffusion with Sparse Information Transmission (Diff-SIT), which comprises the Sparse Temporal Encoding Module (STEM) and the One-Step Video Diffusion with Frame Type Embedder (ODFTE). The STEM sparsely encodes the original frame sequence into an information-rich intermediate sequence, achieving significant bitrate savings. Subsequently, the ODFTE processes this intermediate sequence as a whole, which exploits the temporal correlation. During this process, our proposed Frame Type Embedder (FTE) guides the diffusion model to perform adaptive reconstruction according to different frame types to optimize the overall quality. Extensive experiments on multiple datasets demonstrate that Diff-SIT establishes a new state-of-the-art in perceptual quality and temporal consistency, particularly in the challenging ultra-low-bitrate regime. Code is released at https://github.com/MingdeZhou/Diff-SIT.

Motivation

传统端到端压缩模型在超低码率下易产生模糊图像,感知质量差;现有生成式压缩方法常忽略帧间时间相关性,导致效率低、时间不一致。

Method

Diff-SIT包含稀疏时间编码模块(STEM)和一步视频扩散模块(ODFTE)。STEM稀疏编码原始帧为信息密集中间序列以节省码率;ODFTE整体处理中间序列,利用时间相关性,并通过帧类型嵌入器(FTE)指导自适应重建。

Result

在多个数据集上的实验表明,Diff-SIT在感知质量和时间一致性上达到新最优水平,尤其在超低码率场景。

Conclusion

Diff-SIT通过结合稀疏编码和扩散模型,有效解决了超低码率视频压缩中的感知质量与时间一致性问题,性能优越。

No. 3

LRConv-NeRV: Low Rank Convolution for Efficient Neural Video Compression

Tamer Shanableh

Sourcearxiv

Published2026-03-18

Updated2026-03-18

Classcs.CV · cs.AI

TLDR本文提出LRConv-NeRV,一种高效的神经视频表示方法,通过用低秩可分离卷积替换部分密集卷积层,在保持重建质量的同时显著降低计算复杂度和模型大小。

阅读摘要与笔记 点击展开

Abstract

Neural Representations for Videos (NeRV) encode entire video sequences within neural network parameters, offering an alternative paradigm to conventional video codecs. However, the convolutional decoder of NeRV remains computationally expensive and memory intensive, limiting its deployment in resource-constrained environments. This paper proposes LRConv-NeRV, an efficient NeRV variant that replaces selected dense 3x3 convolutional layers with structured low-rank separable convolutions, trained end-to-end within the decoder architecture. By progressively applying low-rank factorization from the largest to earlier decoder stages, LRConv-NeRV enables controllable trade-offs between reconstruction quality and efficiency. Extensive experiments demonstrate that applying LRConv only to the final decoder stage reduces decoder complexity by 68%, from 201.9 to 64.9 GFLOPs, and model size by 9.3%, while incurring negligible quality loss and achieving approximately 9.2% bitrate reduction. Under INT8 post-training quantization, LRConv-NeRV preserves reconstruction quality close to the dense NeRV baseline, whereas more aggressive factorization of early decoder stages leads to disproportionate quality degradation. Compared to existing work under layer-aligned settings, LRConv-NeRV achieves a more favorable efficiency versus quality trade-off, offering substantial GFLOPs and parameter reductions while maintaining higher PSNR/MS-SSIM and improved temporal stability. Temporal flicker analysis using LPIPS further shows that the proposed solution preserves temporal coherence close to the NeRV baseline, results establish LRConv-NeRV as a potential architectural alternative for efficient neural video decoding under low-precision and resource-constrained settings.

Motivation

NeRV(神经视频表示)将整个视频编码到神经网络参数中,但其卷积解码器计算开销大、内存占用高,限制了在资源受限环境中的部署。

Method

提出LRConv-NeRV,在解码器架构中端到端训练,用结构化低秩可分离卷积替换选定的3x3密集卷积层,并从最大解码阶段逐步向前应用低秩分解,以实现可控的效率与质量权衡。

Result

仅在最终解码阶段应用LRConv可使解码器复杂度降低68%(从201.9降至64.9 GFLOPs),模型大小减少9.3%,质量损失可忽略,比特率降低约9.2%;INT8量化后质量接近基线,且相比现有工作,在效率与质量权衡上更优,保持了更高的PSNR/MS-SSIM和时序稳定性。

Conclusion

LRConv-NeRV作为一种高效的神经视频解码架构,在低精度和资源受限环境下具有潜力,通过低秩分解实现了计算与存储的显著节省,同时保持了重建质量和时序一致性。

LIC 端到端

3 papers

No. 1

RieMind: Geometry-Grounded Spatial Agent for Scene Understanding

Fernando Ropero, Erkin Turkoz, Daniel Matos, Junqing Du, Antonio Ruiz, Yanfeng Zhang, Lu Liu, Mingwei Sun, Yongliang Wang

Sourcearxiv

Published2026-03-16

Updated2026-03-16

Classcs.CV · cs.AI

TLDR本文提出一种代理框架,通过将大语言模型与显式3D场景图(3DSG)结合,解耦感知与推理,以提升室内场景的空间推理能力。在理想感知条件下,该方法在VSI-Bench静态数据集上取得了显著优于现有方法的性能,无需任务特定微调。

阅读摘要与笔记 点击展开

Abstract

Visual Language Models (VLMs) have increasingly become the main paradigm for understanding indoor scenes, but they still struggle with metric and spatial reasoning. Current approaches rely on end-to-end video understanding or large-scale spatial question answering fine-tuning, inherently coupling perception and reasoning. In this paper, we investigate whether decoupling perception and reasoning leads to improved spatial reasoning. We propose an agentic framework for static 3D indoor scene reasoning that grounds an LLM in an explicit 3D scene graph (3DSG). Rather than ingesting videos directly, each scene is represented as a persistent 3DSG constructed by a dedicated perception module. To isolate reasoning performance, we instantiate the 3DSG from ground-truth annotations. The agent interacts with the scene exclusively through structured geometric tools that expose fundamental properties such as object dimensions, distances, poses, and spatial relationships. The results we obtain on the static split of VSI-Bench provide an upper bound under ideal perceptual conditions on the spatial reasoning performance, and we find that it is significantly higher than previous works, by up to 16\%, without task specific fine-tuning. Compared to base VLMs, our agentic variant achieves significantly better performance, with average improvements between 33\% to 50\%. These findings indicate that explicit geometric grounding substantially improves spatial reasoning performance, and suggest that structured representations offer a compelling alternative to purely end-to-end visual reasoning.

Motivation

当前视觉语言模型(VLMs)在室内场景理解中仍难以进行度量和空间推理。现有方法通常采用端到端的视频理解或大规模空间问答微调,将感知与推理耦合,这可能限制了推理性能。本文旨在探究解耦感知与推理是否能改善空间推理。

Method

提出一种代理框架,用于静态3D室内场景推理。该框架将大语言模型(LLM)与显式3D场景图(3DSG)结合,其中3DSG由专用感知模块构建(实验中为隔离推理性能,使用真实标注实例化)。代理通过结构化几何工具(如查询物体尺寸、距离、姿态和空间关系)与场景交互,实现感知与推理的解耦。

Result

在VSI-Bench静态数据集上,该方法在理想感知条件下为空间推理性能提供了上限,结果显著优于先前工作(最高提升16%),且无需任务特定微调。与基础VLMs相比,代理框架平均提升33%至50%的性能。

Conclusion

显式几何基础能显著提升空间推理性能,结构化表示(如3D场景图)为纯端到端视觉推理提供了一种有竞争力的替代方案,表明解耦感知与推理具有优势。

No. 2

Geometric Transformation-Embedded Mamba for Learned Video Compression

Hao Wei, Yanhui Zhou, Chenyang Ge

Sourcearxiv

Published2026-03-09

Updated2026-03-09

TLDR本文提出了一种基于直接变换策略的简化视频压缩框架,采用级联Mamba模块和局部细化前馈网络,结合条件通道熵模型,在低码率下实现了优于现有方法的感知质量和时间一致性。

阅读摘要与笔记 点击展开

Abstract

Although learned video compression methods have exhibited outstanding performance, most of them typically follow a hybrid coding paradigm that requires explicit motion estimation and compensation, resulting in a complex solution for video compression. In contrast, we introduce a streamlined yet effective video compression framework founded on a direct transform strategy, i.e., nonlinear transform, quantization, and entropy coding. We first develop a cascaded Mamba module (CMM) with different embedded geometric transformations to effectively explore both long-range spatial and temporal dependencies. To improve local spatial representation, we introduce a locality refinement feed-forward network (LRFFN) that incorporates a hybrid convolution block based on difference convolutions. We integrate the proposed CMM and LRFFN into the encoder and decoder of our compression framework. Moreover, we present a conditional channel-wise entropy model that effectively utilizes conditional temporal priors to accurately estimate the probability distributions of current latent features. Extensive experiments demonstrate that our method outperforms state-of-the-art video compression approaches in terms of perceptual quality and temporal consistency under low-bitrate constraints. Our source codes and models will be available at https://github.com/cshw2021/GTEM-LVC.

Motivation

现有学习型视频压缩方法大多采用需要显式运动估计和补偿的混合编码范式,导致方案复杂。本文旨在设计一种更简洁有效的框架,避免复杂的运动处理。

Method

提出基于非线性变换、量化和熵编码的直接变换框架。核心包括:级联Mamba模块(CMM)用于捕获长程时空依赖;局部细化前馈网络(LRFFN)通过差分卷积增强局部空间表示;条件通道熵模型利用时序先验估计潜在特征概率分布。

Result

大量实验表明,该方法在低码率约束下,在感知质量和时间一致性方面优于现有最先进的视频压缩方法。

Conclusion

所提出的简化视频压缩框架有效避免了复杂运动估计,通过直接变换策略和新型模块设计,实现了高性能压缩,代码和模型将开源。

No. 3

CircuitProbe: Tracing Visual Temporal Evidence Flow in Video Language Models

Yiming Zhang, Zhuokai Zhao, Chengzhang Yu, Kun Wang, Zhendong Chu, Qiankun Li, Zihan Chen, Yang Liu, Zenghui Ding, Yining Sun, Qingsong Wen

Sourcearxiv

Published2025-07-25

Updated2026-03-15

Classcs.CV · cs.LG

TLDR本文提出了CircuitProbe框架,通过视觉审计和语义追踪分析LVLM中时间证据的表示与因果影响,并基于分析设计干预方法,在TempCompass基准上提升了时间理解性能。

阅读摘要与笔记 点击展开

Abstract

Autoregressive large vision--language models (LVLMs) interface video and language by projecting video features into the LLM's embedding space as continuous visual token embeddings. However, it remains unclear where temporal evidence is represented and how it causally influences decoding. To address this gap, we present CircuitProbe, a circuit-level analysis framework that dissects the end-to-end video-language pathway through two stages: (i) Visual Auditing, which localizes object semantics within the projected video-token sequence and reveals their causal necessity via targeted ablations and controlled substitutions; and (ii) Semantic Tracing, which uses logit-lens probing to track the layer-wise emergence of object and temporal concepts, augmented with temporal frame interventions to assess sensitivity to temporal structure. Based on the resulting analysis, we design a targeted surgical intervention that strictly follows our observations: identifying temporally specialized attention heads and selectively amplifying them within the critical layer interval revealed by Semantic Tracing. This analysis-driven intervention yields consistent improvements (up to 2.4% absolute) on the temporal-heavy TempCompass benchmark, validating the correctness, effectiveness, and practical value of the proposed circuit-level analysis for temporal understanding in LVLMs.

Motivation

当前自回归大视觉语言模型将视频特征投影为连续视觉标记嵌入,但时间证据在模型中的表示位置及其对解码的因果影响尚不明确,阻碍了模型时间理解能力的提升。

Method

提出CircuitProbe框架,分两阶段分析:视觉审计(定位视频标记中的对象语义并进行因果必要性验证)和语义追踪(追踪对象与时间概念的层级涌现,并通过时间帧干预评估时间结构敏感性)。基于分析结果,设计针对性干预,放大关键层中时间专用注意力头。

Result

基于分析设计的干预方法在TempCompass基准上实现了稳定提升(绝对增益最高达2.4%),验证了电路级分析对LVLM时间理解的正确定性和实用价值。

Conclusion

电路级分析能有效揭示LVLM中时间理解的机制,所提分析框架具有正确性、有效性,并能指导模型改进,提升时间密集型任务性能。

VSR 视频超分辨率

3 papers

No. 1

Improving Image-to-Image Translation via a Rectified Flow Reformulation

Satoshi Iizuka, Shun Okamoto, Kazuhiro Fukui

Sourcearxiv

Published2026-03-20

Updated2026-03-20

TLDR本文提出I2I-RFR,一种轻量级插件式方法,将标准图像到图像回归网络重新表述为连续时间传输模型,通过噪声增强输入和加权像素损失实现推理时的渐进优化,提升感知质量和细节,无需复杂生成式流程。

阅读摘要与笔记 点击展开

Abstract

In this work, we propose Image-to-Image Rectified Flow Reformulation (I2I-RFR), a practical plug-in reformulation that recasts standard I2I regression networks as continuous-time transport models. While pixel-wise I2I regression is simple, stable, and easy to adapt across tasks, it often over-smooths ill-posed and multimodal targets, whereas generative alternatives often require additional components, task-specific tuning, and more complex training and inference pipelines. Our method augments the backbone input by channel-wise concatenation with a noise-corrupted version of the ground-truth target and optimizes a simple t-reweighted pixel loss. This objective admits a rectified-flow interpretation via an induced velocity field, enabling ODE-based progressive refinement at inference time while largely preserving the standard supervised training pipeline. In most cases, adopting I2I-RFR requires only expanding the input channels, and inference can be performed with a few explicit solver steps (e.g., 3 steps) without distillation. Extensive experiments across multiple image-to-image translation and video restoration tasks show that I2I-RFR generally improves performance across a wide range of tasks and backbones, with particularly clear gains in perceptual quality and detail preservation. Overall, I2I-RFR provides a lightweight way to incorporate continuous-time refinement into conventional I2I models without requiring a heavy generative pipeline.

Motivation

像素级图像到图像回归方法简单稳定,但常对不适定或多模态目标产生过度平滑;而生成式替代方案通常需要额外组件、任务特定调优及复杂训练推理流程。本文旨在结合两者优势,以轻量方式引入连续时间细化能力。

Method

方法通过通道拼接将噪声损坏的真实目标与骨干网络输入结合,优化时间加权的像素损失;该目标可通过诱导速度场解释为整流流,从而在推理时支持基于常微分方程的渐进细化,同时基本保留标准监督训练流程。

Result

在多种图像翻译和视频修复任务上的实验表明,I2I-RFR普遍提升了不同任务和骨干网络的性能,尤其在感知质量和细节保留方面有明显增益,且仅需少量推理步骤(如3步)而无需蒸馏。

Conclusion

I2I-RFR为传统图像到图像模型提供了一种轻量级方式,无需重型生成式流程即可融入连续时间细化,实现了性能与实用性的平衡。

No. 2

SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing

Xinyao Zhang, Wenkai Dong, Yuxin Song, Bo Fang, Qi Zhang, Jing Wang, Fan Chen, Hui Zhang, Haocheng Feng, Yu Lu, Hang Zhou, Chun Yuan, Jingdong Wang

Sourcearxiv

Published2026-03-19

Updated2026-03-19

TLDR本文提出SAMA框架,通过分解语义锚定与运动对齐来解决指令视频编辑中语义修改与运动保持的平衡问题,无需依赖外部先验,在零样本和微调后均取得优异性能。

阅读摘要与笔记 点击展开

Abstract

Current instruction-guided video editing models struggle to simultaneously balance precise semantic modifications with faithful motion preservation. While existing approaches rely on injecting explicit external priors (e.g., VLM features or structural conditions) to mitigate these issues, this reliance severely bottlenecks model robustness and generalization. To overcome this limitation, we present SAMA (factorized Semantic Anchoring and Motion Alignment), a framework that factorizes video editing into semantic anchoring and motion modeling. First, we introduce Semantic Anchoring, which establishes a reliable visual anchor by jointly predicting semantic tokens and video latents at sparse anchor frames, enabling purely instruction-aware structural planning. Second, Motion Alignment pre-trains the same backbone on motion-centric video restoration pretext tasks (cube inpainting, speed perturbation, and tube shuffle), enabling the model to internalize temporal dynamics directly from raw videos. SAMA is optimized with a two-stage pipeline: a factorized pre-training stage that learns inherent semantic-motion representations without paired video-instruction editing data, followed by supervised fine-tuning on paired editing data. Remarkably, the factorized pre-training alone already yields strong zero-shot video editing ability, validating the proposed factorization. SAMA achieves state-of-the-art performance among open-source models and is competitive with leading commercial systems (e.g., Kling-Omni). Code, models, and datasets will be released.

Motivation

现有指令引导的视频编辑模型难以同时实现精确的语义修改和忠实的运动保持,且依赖外部先验(如VLM特征或结构条件)会限制模型的鲁棒性和泛化能力。

Method

提出SAMA框架,将视频编辑分解为语义锚定和运动建模。语义锚定在稀疏锚帧联合预测语义标记和视频潜在表示,实现指令感知的结构规划;运动对齐通过视频修复预训练任务(如立方体修复、速度扰动、管道重排)使模型从原始视频内化时序动态。采用两阶段优化:先进行分解预训练学习内在语义-运动表示,再在配对编辑数据上监督微调。

Result

分解预训练阶段已展现出强大的零样本视频编辑能力;SAMA在开源模型中达到最先进性能,并与领先商业系统(如Kling-Omni)竞争力相当。

Conclusion

SAMA通过分解语义和运动学习,有效平衡了指令视频编辑中的语义修改与运动保持,减少了对外部先验的依赖,提升了模型的鲁棒性和泛化能力,为视频编辑提供了新范式。

No. 3

ChopGrad: Pixel-Wise Losses for Latent Video Diffusion via Truncated Backpropagation

Dmitriy Rivkin, Parker Ewen, Lili Gao, Julian Ost, Stefanie Walz, Rasika Kangutkar, Mario Bijelic, Felix Heide

Sourcearxiv

Published2026-03-18

Updated2026-03-18

Classcs.CV · cs.AI · cs.LG

TLDR本文提出ChopGrad,一种用于视频扩散模型的截断反向传播方法,通过将梯度计算限制在局部帧窗口来大幅降低训练内存开销,同时保持全局一致性,从而高效支持像素级损失的微调。

阅读摘要与笔记 点击展开

Abstract

Recent video diffusion models achieve high-quality generation through recurrent frame processing where each frame generation depends on previous frames. However, this recurrent mechanism means that training such models in the pixel domain incurs prohibitive memory costs, as activations accumulate across the entire video sequence. This fundamental limitation also makes fine-tuning these models with pixel-wise losses computationally intractable for long or high-resolution videos. This paper introduces ChopGrad, a truncated backpropagation scheme for video decoding, limiting gradient computation to local frame windows while maintaining global consistency. We provide a theoretical analysis of this approximation and show that it enables efficient fine-tuning with frame-wise losses. ChopGrad reduces training memory from scaling linearly with the number of video frames (full backpropagation) to constant memory, and compares favorably to existing state-of-the-art video diffusion models across a suite of conditional video generation tasks with pixel-wise losses, including video super-resolution, video inpainting, video enhancement of neural-rendered scenes, and controlled driving video generation.

Motivation

现有视频扩散模型采用循环帧处理,在像素域训练时因激活值随视频序列累积导致内存成本过高,且难以对长视频或高分辨率视频进行基于像素损失的微调。

Method

提出ChopGrad方法,在视频解码过程中进行截断反向传播,仅对局部帧窗口计算梯度,并通过理论分析确保其能维持全局一致性,实现常数内存消耗。

Result

ChopGrad将训练内存从与视频帧数线性相关降低为常数,在视频超分辨率、修复、神经渲染场景增强及可控驾驶视频生成等任务中,性能优于现有先进视频扩散模型。

Conclusion

ChopGrad有效解决了视频扩散模型训练中的内存瓶颈,使基于像素损失的微调变得计算可行,为高质量视频生成任务提供了高效解决方案。