Paper Tracker Daily Reading

视频及图像压缩

5 篇

No. 1

A Survey on Medical Image Compression: From Traditional to Learning-Based Approaches

Guofeng Tong, Sixuan Liu, Yang Lv, Hanyu Pei, Feng-Lei Fan

来源arxiv

更新时间2026-02-08

TLDR本文是一篇关于医学图像压缩的综述，系统梳理了该领域的技术演进，重点分析了传统方法与基于深度学习的方法在应对2D、3D/4D不同模态医学图像压缩时的挑战与特点，并展望了未来方向。

PDF Abstract

阅读摘要与笔记点击展开

Abstract

The exponential growth of medical imaging has created significant challenges in data storage, transmission, and management for healthcare systems. In this vein, efficient compression becomes increasingly important. Unlike natural image compression, medical image compression prioritizes preserving diagnostic details and structural integrity, imposing stricter quality requirements and demanding fast, memory-efficient algorithms that balance computational complexity with clinically acceptable reconstruction quality. Meanwhile, the medical imaging family includes a plethora of modalities, each possessing different requirements. For example, 2D medical image (e.g., X-rays, histopathological images) compression focuses on exploiting intra-slice spatial redundancy, while volumetric medical image faces require handling intra-slice and inter-slice spatial correlations, and 4D dynamic imaging (e.g., time-series CT/MRI, 4D ultrasound) additionally demands processing temporal correlations between consecutive time frames. Traditional compression methods, grounded in mathematical transforms and information theory principles, provide solid theoretical foundations, predictable performance, and high standardization levels, with extensive validation in clinical environments. In contrast, deep learning-based approaches demonstrate remarkable adaptive learning capabilities and can capture complex statistical characteristics and semantic information within medical images. This comprehensive survey establishes a two-facet taxonomy based on data structure (2D vs 3D/4D) and technical approaches (traditional vs learning-based), thereby systematically presenting the complete technological evolution, analyzing the unique technical challenges, and prospecting future directions in medical image compression.

摘要翻译

医学影像的指数级增长给医疗系统的数据存储、传输和管理带来了巨大挑战。在此背景下，高效压缩技术的重要性日益凸显。与自然图像压缩不同，医学图像压缩优先考虑保留诊断细节和结构完整性，对质量要求更为严格，并需要快速、内存高效的算法，以在计算复杂度和临床可接受的重建质量之间取得平衡。同时，医学影像家族包含多种模态，每种模态都有不同的要求。例如，二维医学图像（如X射线、组织病理学图像）压缩侧重于利用切片内的空间冗余，而三维医学图像压缩则需要处理切片内和切片间的空间相关性，四维动态成像（如时间序列CT/MRI、4D超声）还额外要求处理连续时间帧之间的时间相关性。传统压缩方法基于数学变换和信息论原理，提供了坚实的理论基础、可预测的性能和高度的标准化水平，并在临床环境中得到了广泛验证。相比之下，基于深度学习的方法展现出卓越的自适应学习能力，能够捕捉医学图像中复杂的统计特征和语义信息。本综述基于数据结构（二维与三维/四维）和技术方法（传统与基于学习）建立了双维度分类法，从而系统性地呈现了医学图像压缩的完整技术演进，分析了独特的技术挑战，并展望了未来发展方向。

Motivation

医学影像数据的指数级增长给医疗系统的存储、传输和管理带来了巨大挑战。医学图像压缩需在保证诊断细节和结构完整性的前提下，平衡计算复杂度与重建质量，且不同成像模态（如2D、3D、4D）有各自独特的技术要求，亟需系统性的技术梳理与总结。

Method

本文采用综述研究方法，建立了一个基于两个维度的分类体系：一是数据结构（2D图像 vs 3D/4D体数据/动态图像），二是技术路线（传统方法 vs 基于深度学习的方法），以此系统性地呈现技术演进、分析技术挑战。

Result

通过建立的分类框架，系统阐述了各类压缩技术的原理与特点：传统方法（基于数学变换和信息论）理论基础扎实、性能可预测、标准化程度高；深度学习方法则展现出强大的自适应学习能力，能捕捉图像中复杂的统计特征和语义信息。同时，清晰指出了不同模态（2D、3D、4D）压缩所面临的核心技术挑战（如空间冗余、时空相关性处理）。

Conclusion

医学图像压缩是一个要求严苛且多样化的领域，需要根据具体模态和临床需求选择或设计算法。传统方法与深度学习方法各有优势。未来研究需进一步探索如何更好地结合两者优势，开发出高效、高质量且适用于临床环境的压缩技术。

No. 2

Tensor network methods for quantum-inspired image processing and classical optics

Nicolas Allegra

来源arxiv

更新时间2026-02-08

分类physics.optics · astro-ph.IM · cond-mat.stat-mech · physics.comp-ph · quant-ph

TLDR该研究将受量子启发的张量网络方法应用于图像压缩、处理和经典光学问题，旨在开发更快的算法，适用于天文观测、显微成像等领域。

PDF Abstract

阅读摘要与笔记点击展开

Abstract

Tensor network methods strike a middle ground between fully-fledged quantum computing and classical computing, as they take inspiration from quantum systems to significantly speed up certain classical operations. Their strength lies in their compressive power and the wide variety of efficient algorithms that operate within this compressed space. In this work, we focus on applying these methods to fundamental problems in image compression and processing and classical optics such as wave-front propagation and optical image formation, by using directly or indirectly parallels with quantum mechanics and computation. These quantum-inspired methods are expected to yield faster algorithms with applications ranging from astronomy and earth observation to microscopy and classical imaging more broadly.

摘要翻译

张量网络方法在完全成熟的量子计算与经典计算之间开辟了一条中间路径，它借鉴量子系统的原理来显著加速某些经典运算。其优势在于强大的压缩能力以及在这一压缩空间内运行的高效算法多样性。本研究重点探讨如何将这些方法应用于图像压缩与处理、经典光学（如波前传播与光学成像）等基础问题，通过直接或间接地利用与量子力学和计算的类比。这些受量子启发的技术有望催生更快速的算法，其应用范围涵盖从天文学、地球观测到显微成像乃至更广泛的经典成像领域。

Motivation

张量网络方法介于量子计算和经典计算之间，借鉴量子系统原理以加速特定经典运算。本文旨在利用其压缩能力和高效算法，解决图像压缩、处理及经典光学（如波前传播和光学成像）中的基础问题，推动在观测和成像领域的应用。

Method

通过直接或间接类比量子力学和量子计算，应用张量网络方法处理图像和光学问题，利用其压缩特性在压缩空间内设计高效算法。

Result

研究预期张量网络方法能产生更快的算法，提升图像压缩、光学传播和成像等任务的效率，应用范围涵盖天文观测、地球科学到显微成像等经典成像领域。

Conclusion

量子启发的张量网络方法为经典计算问题提供了有前景的解决方案，有望在图像和光学处理中实现显著加速，具有广泛的应用潜力。

No. 3

Dual-Representation Image Compression at Ultra-Low Bitrates via Explicit Semantics and Implicit Textures

Chuqin Zhou, Xiaoyue Ling, Yunuo Chen, Jincheng Dai, Guo Lu, Wenjun Zhang

来源arxiv

更新时间2026-02-05

TLDR本文提出了一种统一的训练无关框架，通过结合显式高层语义和隐式细粒度细节，解决了超低码率下生成式压缩中语义忠实度与感知真实性的权衡问题，实现了先进的率-感知性能。

PDF Abstract

阅读摘要与笔记点击展开

Abstract

While recent neural codecs achieve strong performance at low bitrates when optimized for perceptual quality, their effectiveness deteriorates significantly under ultra-low bitrate conditions. To mitigate this, generative compression methods leveraging semantic priors from pretrained models have emerged as a promising paradigm. However, existing approaches are fundamentally constrained by a tradeoff between semantic faithfulness and perceptual realism. Methods based on explicit representations preserve content structure but often lack fine-grained textures, whereas implicit methods can synthesize visually plausible details at the cost of semantic drift. In this work, we propose a unified framework that bridges this gap by coherently integrating explicit and implicit representations in a training-free manner. Specifically, We condition a diffusion model on explicit high-level semantics while employing reverse-channel coding to implicitly convey fine-grained details. Moreover, we introduce a plug-in encoder that enables flexible control of the distortion-perception tradeoff by modulating the implicit information. Extensive experiments demonstrate that the proposed framework achieves state-of-the-art rate-perception performance, outperforming existing methods and surpassing DiffC by 29.92%, 19.33%, and 20.89% in DISTS BD-Rate on the Kodak, DIV2K, and CLIC2020 datasets, respectively.

摘要翻译

尽管近期神经编解码器在针对感知质量优化时，在低比特率下表现出色，但在超低比特率条件下其性能显著下降。为缓解此问题，利用预训练模型语义先验的生成式压缩方法已成为一种有前景的范式。然而，现有方法本质上受限于语义保真度与感知真实性之间的权衡：基于显式表示的方法能保留内容结构，但常缺乏细粒度纹理；而隐式方法虽能合成视觉上合理的细节，却以语义偏移为代价。本研究提出一个统一框架，通过以无需训练的方式连贯整合显式与隐式表示来弥合这一鸿沟。具体而言，我们基于显式高层语义条件化扩散模型，同时采用反向信道编码隐式传递细粒度细节。此外，我们引入一个插件式编码器，通过调制隐式信息实现对失真-感知权衡的灵活控制。大量实验表明，所提框架在率-感知性能上达到领先水平，在Kodak、DIV2K和CLIC2020数据集上的DISTS BD-Rate分别超越现有方法及DiffC达29.92%、19.33%和20.89%。

Motivation

现有神经编解码器在超低码率下性能显著下降，而生成式压缩方法存在语义忠实度与感知真实性的根本权衡：显式方法保留结构但缺乏纹理，隐式方法合成细节但易导致语义漂移。

Method

提出一个统一框架，以训练无关的方式结合显式和隐式表示：使用扩散模型以显式高层语义为条件，并利用反向信道编码隐式传递细粒度细节；同时引入一个插件编码器，通过调制隐式信息灵活控制失真-感知权衡。

Result

在Kodak、DIV2K和CLIC2020数据集上的实验表明，该框架实现了最先进的率-感知性能，在DISTS BD-Rate指标上分别超过DiffC方法29.92%、19.33%和20.89%。

Conclusion

所提出的框架有效弥合了显式与隐式表示之间的差距，在超低码率下实现了语义忠实与感知真实的平衡，为生成式压缩提供了灵活且高性能的解决方案。

No. 4

Diffusion-aided Extreme Video Compression with Lightweight Semantics Guidance

Maojun Zhang, Haotian Wu, Richeng Jin, Deniz Gunduz, Krystian Mikolajczyk

来源arxiv

更新时间2026-02-05

TLDR提出一种结合生成先验的视频压缩框架，通过压缩高层语义表示并利用条件扩散模型重建视频，同时使用全局相机轨迹和前景分割来表征运动信息，从而在极低码率下实现高效压缩与保真重建。

PDF Abstract

阅读摘要与笔记点击展开

Abstract

Modern video codecs and learning-based approaches struggle for semantic reconstruction at extremely low bit-rates due to reliance on low-level spatiotemporal redundancies. Generative models, especially diffusion models, offer a new paradigm for video compression by leveraging high-level semantic understanding and powerful visual synthesis. This paper propose a video compression framework that integrates generative priors to drastically reduce bit-rate while maintaining reconstruction fidelity. Specifically, our method compresses high-level semantic representations of the video, then uses a conditional diffusion model to reconstruct frames from these semantics. To further improve compression, we characterize motion information with global camera trajectories and foreground segmentation: background motion is compactly represented by camera pose parameters while foreground dynamics by sparse segmentation masks. This allows for significantly boosts compression efficiency, enabling descent video reconstruction at extremely low bit-rates.

摘要翻译

现代视频编解码器与基于学习的方法因依赖低层次时空冗余，在极低码率下难以实现语义重建。生成模型，特别是扩散模型，通过利用高层次语义理解和强大的视觉合成能力，为视频压缩提供了新范式。本文提出一种融合生成先验的视频压缩框架，能在大幅降低码率的同时保持重建保真度。具体而言，我们的方法压缩视频的高层语义表征，随后使用条件扩散模型从这些语义重建帧。为进一步提升压缩效率，我们采用全局相机轨迹与前景分割表征运动信息：背景运动通过相机姿态参数紧凑表示，前景动态则通过稀疏分割掩码描述。这一设计显著提升了压缩效率，实现了极低码率下的高质量视频重建。

Motivation

现有视频编解码器和学习方法依赖低层时空冗余，在极低码率下难以实现语义重建。生成模型（如扩散模型）凭借高层语义理解和强大视觉合成能力，为视频压缩提供了新范式。

Method

框架首先压缩视频的高层语义表示，再通过条件扩散模型从语义重建帧。为提升压缩效率，运动信息被表征为全局相机轨迹（背景运动）和前景分割掩码（前景动态），两者均以紧凑参数表示。

Result

该方法显著提升了压缩效率，能够在极低码率下实现高质量的视频重建。

Conclusion

集成生成先验的视频压缩框架通过语义驱动重建和高效运动表征，突破了极低码率下的重建瓶颈，为视频压缩提供了新方向。

No. 5

Rate-Distortion Analysis of Optically Passive Vision Compression

Ronald Ogden, David Fridovich-Keil, Takashi Tanaka

来源arxiv

更新时间2026-02-02

TLDR提出一种新型光学被动视觉压缩方案，利用事件相机观测视觉场景的光学余弦变换，实现高速、免计算视频压缩，在码率-失真性能上优于独立事件相机，且性能差距随空间分辨率提高而增大。

PDF Abstract

阅读摘要与笔记点击展开

Abstract

The use of remote vision sensors for autonomous decision-making poses the challenge of transmitting high-volume visual data over resource-constrained channels in real-time. In robotics and control applications, many systems can quickly destabilize, which can exacerbate the issue by necessitating higher sampling frequencies. This work proposes a novel sensing paradigm in which an event camera observes the optically generated cosine transform of a visual scene, enabling high-speed, computation-free video compression inspired by modern video codecs. In this study, we simulate this optically passive vision compression (OPVC) scheme and compare its rate-distortion performance to that of a standalone event camera (SAEC). We find that the rate-distortion performance of the OPVC scheme surpasses that of the SAEC and that this performance gap increases as the spatial resolution of the event camera increases.

摘要翻译

在自主决策中使用远程视觉传感器面临实时传输高容量视觉数据于资源受限通道的挑战。在机器人学与控制应用中，许多系统可能迅速失稳，这往往需要更高的采样频率，从而加剧了该问题。本研究提出一种新型传感范式，其中事件相机观测视觉场景的光学生成余弦变换，受现代视频编解码器启发，实现了高速、无需计算的视频压缩。本文模拟了这种光学被动视觉压缩方案，并将其率失真性能与独立事件相机进行了比较。研究发现，光学被动视觉压缩方案的率失真性能优于独立事件相机，且随着事件相机空间分辨率的提高，这一性能差距进一步扩大。

Motivation

在机器人及控制应用中，远程视觉传感器需在资源受限的信道上实时传输高容量视觉数据，且系统易快速失稳，需要更高采样频率，这加剧了传输挑战。

Method

模拟一种光学被动视觉压缩方案，其中事件相机观测视觉场景的光学生成余弦变换，受现代视频编解码器启发，实现高速、免计算的视频压缩，并与独立事件相机的性能进行比较。

Result

光学被动视觉压缩方案的码率-失真性能优于独立事件相机，且随着事件相机空间分辨率的提高，这种性能优势进一步扩大。

Conclusion

所提出的光学被动视觉压缩方案在压缩效率和适应性上具有优势，尤其在高分辨率场景下表现更佳，为资源受限实时视觉传输提供了有前景的解决方案。

LIC 端到端

5 篇

No. 1

Reinforced Rate Control for Neural Video Compression via Inter-Frame Rate-Distortion Awareness

Wuyang Cong, Junqi Shi, Lizhong Wang, Weijing Shi, Ming Lu, Hao Chen, Zhan Ma

来源arxiv

更新时间2026-02-01

TLDR本文提出了一种基于强化学习的神经视频压缩码率控制框架，通过逐帧决策联合优化码率分配与编码参数，显著提升了码率控制精度与压缩效率。

PDF Abstract

阅读摘要与笔记点击展开

Abstract

Neural video compression (NVC) has demonstrated superior compression efficiency, yet effective rate control remains a significant challenge due to complex temporal dependencies. Existing rate control schemes typically leverage frame content to capture distortion interactions, overlooking inter-frame rate dependencies arising from shifts in per-frame coding parameters. This often leads to suboptimal bitrate allocation and cascading parameter decisions. To address this, we propose a reinforcement-learning (RL)-based rate control framework that formulates the task as a frame-by-frame sequential decision process. At each frame, an RL agent observes a spatiotemporal state and selects coding parameters to optimize a long-term reward that reflects rate-distortion (R-D) performance and bitrate adherence. Unlike prior methods, our approach jointly determines bitrate allocation and coding parameters in a single step, independent of group of pictures (GOP) structure. Extensive experiments across diverse NVC architectures show that our method reduces the average relative bitrate error to 1.20% and achieves up to 13.45% bitrate savings at typical GOP sizes, outperforming existing approaches. In addition, our framework demonstrates improved robustness to content variation and bandwidth fluctuations with lower coding overhead, making it highly suitable for practical deployment.

摘要翻译

神经视频压缩（NVC）已展现出卓越的压缩效率，但由于复杂的时间依赖性，有效的码率控制仍面临重大挑战。现有的码率控制方案通常利用帧内容来捕捉失真交互，却忽略了因每帧编码参数变化而产生的帧间码率依赖关系，这常导致次优的码率分配和级联参数决策。为解决此问题，我们提出了一种基于强化学习（RL）的码率控制框架，将任务建模为逐帧顺序决策过程。在每一帧，RL智能体观察时空状态并选择编码参数，以优化反映率失真（R-D）性能和码率遵从度的长期奖励。与先前方法不同，我们的方法在单一步骤中联合确定码率分配和编码参数，且独立于图像组（GOP）结构。在不同NVC架构上的大量实验表明，该方法将平均相对码率误差降至1.20%，并在典型GOP大小下实现高达13.45%的码率节省，优于现有方法。此外，我们的框架展现出对内容变化和带宽波动的更强鲁棒性，且编码开销更低，使其非常适合实际部署。

Motivation

现有神经视频压缩的码率控制方法通常依赖帧内容来估计失真，但忽略了因逐帧编码参数变化引起的帧间码率依赖关系，导致码率分配次优和参数决策的连锁问题。

Method

采用强化学习框架，将码率控制建模为逐帧序列决策过程：智能体根据时空状态选择编码参数，以优化反映率失真性能与码率遵从度的长期奖励，无需依赖图像组结构即可一步联合确定码率分配与编码参数。

Result

在不同神经视频压缩架构上的实验表明，该方法将平均相对码率误差降至1.20%，在典型图像组大小下可实现高达13.45%的码率节省，优于现有方法，并对内容变化和带宽波动表现出更强的鲁棒性，编码开销更低。

Conclusion

所提出的强化学习码率控制框架有效解决了神经视频压缩中的码率依赖挑战，实现了更精确、高效的码率控制，具备良好的实用部署潜力。

No. 2

Recent Advances of End-to-End Video Coding Technologies for AVS Standard Development

Xihua Sheng, Xiongzhuang Liang, Chuanbo Tang, Zhirui Zuo, Yifan Bian, Yutao Xie, Zhuoyuan Li, Yuqi Li, Hui Xiang, Li Li, Dong Liu

来源arxiv

更新时间2026-01-31

分类eess.IV · cs.CV · cs.MM

TLDR本文介绍了AVS端到端智能视频编码探索模型（AVS-EEM）的发展历程与技术框架，该模型在严格复杂度约束下显著提升了压缩性能，相比传统AVS3参考软件具有更优效率，旨在推动可部署智能视频编码标准的制定。

PDF Abstract

阅读摘要与笔记点击展开

Abstract

Video coding standards are essential to enable the interoperability and widespread adoption of efficient video compression technologies. In pursuit of greater video compression efficiency, the AVS video coding working group launched the standardization exploration of end-to-end intelligent video coding, establishing the AVS End-to-End Intelligent Video Coding Exploration Model (AVS-EEM) project. A core design principle of AVS-EEM is its focus on practical deployment, featuring inherently low computational complexity and requiring strict adherence to the common test conditions of conventional video coding. This paper details the development history of AVS-EEM and provides a systematic introduction to its key technical framework, covering model architectures, training strategies, and inference optimizations. These innovations have collectively driven the project's rapid performance evolution, enabling continuous and significant gains under strict complexity constraints. Through over two years of iterative refinement and collaborative effort, the coding performance of AVS-EEM has seen substantial improvement. Experimental results demonstrate that its latest model achieves superior compression efficiency compared to the conventional AVS3 reference software, marking a significant step toward a deployable intelligent video coding standard.

摘要翻译

视频编码标准对于实现高效视频压缩技术的互操作性与广泛应用至关重要。为追求更高的视频压缩效率，AVS视频编码工作组启动了端到端智能视频编码的标准化探索，并设立了AVS端到端智能视频编码探索模型（AVS-EEM）项目。AVS-EEM的核心设计原则聚焦于实际部署，具备固有的低计算复杂度，并严格遵循传统视频编码的通用测试条件。本文详细阐述了AVS-EEM的发展历程，系统介绍了其关键技术框架，涵盖模型架构、训练策略与推理优化。这些创新共同推动了项目性能的快速演进，使其在严格的复杂度限制下持续取得显著增益。经过两年多的迭代优化与协同努力，AVS-EEM的编码性能已实现大幅提升。实验结果表明，其最新模型相较于传统AVS3参考软件展现出更优的压缩效率，标志着向可部署智能视频编码标准迈出了重要一步。

Motivation

为追求更高的视频压缩效率并推动智能视频编码技术的实际部署，AVS视频编码工作组启动了端到端智能视频编码的标准化探索，旨在开发一种兼具高效压缩和低计算复杂度、符合传统视频编码通用测试条件的实用化模型。

Method

论文系统介绍了AVS-EEM的关键技术框架，包括模型架构、训练策略和推理优化方法，强调以实际部署为核心设计原则，通过迭代优化在严格复杂度约束下持续提升性能。

Result

经过两年多的迭代优化与协作，AVS-EEM的编码性能显著提升，实验结果表明其最新模型在压缩效率上优于传统的AVS3参考软件，实现了在严格复杂度限制下的持续显著增益。

Conclusion

AVS-EEM项目在端到端智能视频编码标准化探索中取得了重要进展，其性能提升标志着向可部署的智能视频编码标准迈出了关键一步，为高效视频压缩技术的实际应用奠定了基础。

No. 3

DiffVC-RT: Towards Practical Real-Time Diffusion-based Perceptual Neural Video Compression

Wenzhuo Ma, Zhenzhong Chen

来源arxiv

更新时间2026-01-28

TLDR本文提出DiffVC-RT，首个实时扩散神经视频压缩框架，通过高效架构、时序一致性建模和异步并行解码，在保持高质量的同时实现高速处理。

PDF Abstract

阅读摘要与笔记点击展开

Abstract

The practical deployment of diffusion-based Neural Video Compression (NVC) faces critical challenges, including severe information loss, prohibitive inference latency, and poor temporal consistency. To bridge this gap, we propose DiffVC-RT, the first framework designed to achieve real-time diffusion-based perceptual NVC. First, we introduce an Efficient and Informative Model Architecture. Through strategic module replacements and pruning, this architecture significantly reduces computational complexity while mitigating structural information loss. Second, to address generative flickering artifacts, we propose Explicit and Implicit Consistency Modeling. We enhance temporal consistency by explicitly incorporating a zero-cost Online Temporal Shift Module within the U-Net, complemented by hybrid implicit consistency constraints. Finally, we present an Asynchronous and Parallel Decoding Pipeline incorporating Mixed Half Precision, which enables asynchronous latent decoding and parallel frame reconstruction via a Batch-dimension Temporal Shift design. Experiments show that DiffVC-RT achieves 80.1% bitrate savings in terms of LPIPS over VTM-17.0 on HEVC dataset with real-time encoding and decoding speeds of 206 / 30 fps for 720p videos on an NVIDIA H800 GPU, marking a significant milestone in diffusion-based video compression.

摘要翻译

基于扩散的神经视频压缩（NVC）在实际部署中面临严峻挑战，包括严重的信息丢失、高昂的推理延迟以及较差的时间一致性。为弥合这一差距，我们提出了DiffVC-RT，这是首个旨在实现实时扩散式感知NVC的框架。首先，我们引入了一种高效且信息丰富的模型架构。通过策略性的模块替换与剪枝，该架构在显著降低计算复杂度的同时，缓解了结构信息损失。其次，针对生成性闪烁伪影问题，我们提出了显式与隐式一致性建模。通过在U-Net中显式集成零成本的在线时间移位模块，并结合混合隐式一致性约束，我们增强了时间一致性。最后，我们提出了一种包含混合半精度的异步并行解码流水线，通过批量维度时间移位设计，实现了异步潜在解码与并行帧重建。实验表明，在HEVC数据集上，DiffVC-RT相比VTM-17.0在LPIPS指标上实现了80.1%的码率节省，并在NVIDIA H800 GPU上对720p视频达到了206/30 fps的实时编码与解码速度，标志着基于扩散的视频压缩技术的一个重要里程碑。

Motivation

现有基于扩散模型的神经视频压缩存在信息丢失严重、推理延迟高和时序一致性差的问题，阻碍了实际部署，需开发实时高效的解决方案。

Method

采用三部分方法：1) 高效信息模型架构，通过模块替换和剪枝降低计算复杂度；2) 显隐式一致性建模，引入零成本在线时序移位模块和混合约束减少闪烁；3) 异步并行解码流水线，结合混合半精度和批维度时序移位设计加速解码。

Result

在HEVC数据集上，相比VTM-17.0节省80.1%的LPIPS比特率，在NVIDIA H800 GPU上对720p视频实现206 fps编码和30 fps解码速度，达到实时性能。

Conclusion

DiffVC-RT显著提升了扩散视频压缩的实用性和效率，在比特率节省和实时处理方面取得突破，标志着该领域的重要进展。

No. 4

LRC-DHVC: Towards Local Rate Control in Neural Video Compression

Marc Windsheimer, Simon Deniffel, André Kaup

来源arxiv

更新时间2026-01-20

TLDR提出LRC-DHVC，一种基于学习的层次化视频压缩网络，首次实现像素级连续局部码率控制，允许在单个视频帧内灵活调整空间质量分布，仅需单一模型参数集即可覆盖宽码率范围，降低了存储需求。

PDF Abstract

阅读摘要与笔记点击展开

Abstract

Local rate control is a key enabler to generalize image and video compression for dedicated challenges, such as video coding for machines. While traditional hybrid video coding can easily adapt the local rate-distortion trade-off by changing the local quantization parameter, no such approach is currently available for learning-based video compression. In this paper, we propose LRC-DHVC, a hierarchical video compression network, which allows continuous local rate control on a pixel level to vary the spatial quality distribution within individual video frames. This is achieved by concatenating a quality map to the input frame and applying a weighted MSE loss which matches the pixelwise trade-off factors in the quality map. During training, the model sees a variety of quality maps due to a constrained-random generation. Our model is the first neural video compression network, which can continuously and spatially adapt to varying quality constraints. Due to the wide quality and bit rate range, a single set of network parameters is sufficient. Compared to single rate point networks, which scale linearly with the number of rate points, the memory requirements for our network parameters remain constant. The code and model are available at link-updated-upon-acceptance.

摘要翻译

局部码率控制是推广图像与视频压缩以应对特定挑战（如面向机器的视频编码）的关键技术。传统混合视频编码可通过调整局部量化参数轻松适应局部率失真权衡，而基于学习的视频压缩目前尚无此类方法。本文提出LRC-DHVC——一种分层视频压缩网络，支持在像素级别进行连续局部码率控制，从而改变单个视频帧内的空间质量分布。该技术通过将质量图与输入帧拼接，并应用与质量图中逐像素权衡因子相匹配的加权均方误差损失来实现。训练过程中，模型通过约束随机生成接触多样化质量图。我们的模型是首个能够连续且空间自适应地应对不同质量约束的神经视频压缩网络。得益于宽广的质量与码率范围，仅需单组网络参数即可满足需求。相较于码率点数量呈线性增长的单码率点网络，本网络的参数内存需求保持恒定。代码与模型获取地址：link-updated-upon-acceptance。

Motivation

传统混合视频编码可通过调整量化参数实现局部码率-失真权衡，但基于学习的视频压缩缺乏类似机制，难以适应如机器视觉等专用场景的局部质量约束需求。

Method

通过将质量图与输入帧拼接，并采用加权均方误差损失函数匹配质量图中的像素级权衡因子；训练时使用约束随机生成的质量图使模型学习多种质量分布，实现单模型参数下的连续空间自适应压缩。

Result

模型能在像素级连续调整空间质量分布，覆盖宽码率和质量范围，无需为不同码点存储多套参数，相比单码点网络降低了内存占用。代码和模型已公开。

Conclusion

LRC-DHVC是首个支持连续空间自适应质量约束的神经视频压缩网络，通过局部码率控制增强了学习式压缩的灵活性，为专用应用场景提供了高效解决方案。

No. 5

Deep Joint Source-Channel Coding for Wireless Video Transmission with Asymmetric Context

Xuechen Chen, Junting Li, Chuang Chen, Hairong Lin, Yishen Li

来源arxiv

更新时间2026-01-07

分类eess.IV · cs.CV

TLDR提出一种基于非对称上下文条件编码的高效深度联合信源信道视频传输方法，通过特征传播和内容自适应编码提升性能并减少误差累积。

PDF Abstract

阅读摘要与笔记点击展开

Abstract

In this paper, we propose a high-efficiency deep joint source-channel coding (JSCC) method for video transmission based on conditional coding with asymmetric context. The conditional coding-based neural video compression requires to predict the encoding and decoding conditions from the same context which includes the same reconstructed frames. However in JSCC schemes which fall into pseudo-analog transmission, the encoder cannot infer the same reconstructed frames as the decoder even a pipeline of the simulated transmission is constructed at the encoder. In the proposed method, without such a pipeline, we guide and design neural networks to learn encoding and decoding conditions from asymmetric contexts. Additionally, we introduce feature propagation, which allows intermediate features to be independently propagated at the encoder and decoder and help to generate conditions, enabling the framework to greatly leverage temporal correlation while mitigating the problem of error accumulation. To further exploit the performance of the proposed transmission framework, we implement content-adaptive coding which achieves variable bandwidth transmission using entropy models and masking mechanisms. Experimental results demonstrate that our method outperforms existing deep video transmission frameworks in terms of performance and effectively mitigates the error accumulation. By mitigating the error accumulation, our schemes can reduce the frequency of inserting intra-frame coding modes, further enhancing performance.

摘要翻译

本文提出了一种基于非对称上下文条件编码的高效深度联合信源信道编码方法，用于视频传输。基于条件编码的神经视频压缩要求从包含相同重建帧的上下文中预测编码和解码条件。然而，在属于伪模拟传输的联合信源信道编码方案中，即使编码器构建了模拟传输流水线，也无法推断出与解码器相同的重建帧。在所提出的方法中，无需此类流水线，我们引导并设计神经网络从非对称上下文中学习编码和解码条件。此外，我们引入了特征传播机制，允许中间特征在编码器和解码器端独立传播，并帮助生成条件，使框架能够充分利用时间相关性，同时缓解误差累积问题。为进一步提升所提传输框架的性能，我们实现了内容自适应编码，通过熵模型和掩码机制实现可变带宽传输。实验结果表明，该方法在性能上优于现有的深度视频传输框架，并有效缓解了误差累积。通过减轻误差累积，我们的方案可以减少帧内编码模式的插入频率，从而进一步提升性能。

Motivation

传统基于条件编码的神经视频压缩方法在联合信源信道编码（JSCC）中，由于伪模拟传输特性，编码器无法推断与解码器相同的重建帧上下文，导致条件预测困难，限制了性能并可能引发误差累积。

Method

设计神经网络从非对称上下文中学习编码和解码条件，无需模拟传输管道；引入特征传播机制，使中间特征在编码器和解码器独立传播以生成条件，利用时序相关性并减轻误差累积；采用基于熵模型和掩码机制的内容自适应编码实现可变带宽传输。

Result

实验表明，该方法在性能上优于现有深度视频传输框架，有效缓解了误差累积，从而减少帧内编码模式的插入频率，进一步提升传输效率。

Conclusion

所提方法通过非对称上下文条件编码和特征传播，解决了JSCC中条件预测不一致的问题，显著提升了视频传输的性能和鲁棒性，为高效深度视频传输提供了新思路。

视频质量指标

5 篇

No. 1

VQAThinker: Exploring Generalizable and Explainable Video Quality Assessment via Reinforcement Learning

Linhan Cao, Wei Sun, Weixia Zhang, Xiangyang Zhu, Jun Jia, Kaiwei Zhang, Dandan Zhu, Guangtao Zhai, Xiongkuo Min

来源arxiv

更新时间2026-02-02

TLDR本文提出VQAThinker，一个基于推理的视频质量评估框架，利用大语言模型与强化学习，通过三种特定奖励函数提升模型在分布外数据的泛化能力和可解释性，在多个基准测试中达到最优性能。

PDF Abstract

阅读摘要与笔记点击展开

Abstract

Video quality assessment (VQA) aims to objectively quantify perceptual quality degradation in alignment with human visual perception. Despite recent advances, existing VQA models still suffer from two critical limitations: \textit{poor generalization to out-of-distribution (OOD) videos} and \textit{limited explainability}, which restrict their applicability in real-world scenarios. To address these challenges, we propose \textbf{VQAThinker}, a reasoning-based VQA framework that leverages large multimodal models (LMMs) with reinforcement learning to jointly model video quality understanding and scoring, emulating human perceptual decision-making. Specifically, we adopt group relative policy optimization (GRPO), a rule-guided reinforcement learning algorithm that enables reasoning over video quality under score-level supervision, and introduce three VQA-specific rewards: (1) a \textbf{bell-shaped regression reward} that increases rapidly as the prediction error decreases and becomes progressively less sensitive near the ground truth; (2) a \textbf{pairwise ranking reward} that guides the model to correctly determine the relative quality between video pairs; and (3) a \textbf{temporal consistency reward} that encourages the model to prefer temporally coherent videos over their perturbed counterparts. Extensive experiments demonstrate that VQAThinker achieves state-of-the-art performance on both in-domain and OOD VQA benchmarks, showing strong generalization for video quality scoring. Furthermore, evaluations on video quality understanding tasks validate its superiority in distortion attribution and quality description compared to existing explainable VQA models and LMMs. These findings demonstrate that reinforcement learning offers an effective pathway toward building generalizable and explainable VQA models solely with score-level supervision.

摘要翻译

视频质量评估（VQA）旨在客观量化与人类视觉感知一致的感知质量退化。尽管近期有所进展，现有VQA模型仍面临两个关键局限： extit{对分布外（OOD）视频的泛化能力不足}和 extit{可解释性有限}，这限制了其在真实场景中的应用。为应对这些挑战，我们提出 extbf{VQAThinker}——一个基于推理的VQA框架，通过强化学习驱动的大型多模态模型（LMM）联合建模视频质量理解与评分，模拟人类感知决策过程。具体而言，我们采用规则引导的强化学习算法——组相对策略优化（GRPO），在分数级监督下实现视频质量推理，并引入三项VQA专用奖励机制：（1） extbf{钟形回归奖励}：预测误差减小时奖励快速增加，在接近真实值时敏感性逐渐降低；（2） extbf{成对排序奖励}：引导模型正确判断视频对之间的相对质量；（3） extbf{时序一致性奖励}：激励模型优先选择时序连贯的视频而非扰动版本。大量实验表明，VQAThinker在域内和OOD VQA基准测试中均达到最先进性能，展现出强大的视频质量评分泛化能力。此外，在视频质量理解任务上的评估验证了其在失真归因和质量描述方面优于现有可解释VQA模型及LMM。这些发现证明，强化学习为仅通过分数级监督构建可泛化、可解释的VQA模型提供了有效路径。

Motivation

现有视频质量评估模型存在两大局限：对分布外视频泛化能力差，以及可解释性有限，限制了其在实际场景中的应用。

Method

采用基于规则的强化学习算法（GRPO），在分数级监督下进行质量推理，并引入了三种奖励函数：钟形回归奖励、成对排序奖励和时间一致性奖励，以模拟人类感知决策过程。

Result

实验表明，VQAThinker在域内和分布外基准测试中均取得了最先进的性能，同时在失真归因和质量描述等理解任务上优于现有可解释模型和大语言模型。

Conclusion

强化学习为仅使用分数级监督构建泛化性强且可解释的视频质量评估模型提供了一条有效途径。

No. 2

SFQA: A Comprehensive Perceptual Quality Assessment Dataset for Singing Face Generation

Zhilin Gao, Yunhao Li, Sijing Wu, Yucheng Zhu, Huiyu Duan, Guangtao Zhai

来源arxiv

更新时间2026-01-28

TLDR本文针对歌唱人脸生成（SFG）领域缺乏质量评估方法的问题，提出了首个SFG内容质量评估数据集SFQA，并基于该数据集对现有客观质量评估算法进行了全面基准测试。

PDF Abstract

阅读摘要与笔记点击展开

Abstract

The Talking Face Generation task has enormous potential for various applications in digital humans and agents, etc. Singing, as a common facial movement second only to talking, can be regarded as a universal language across ethnicities and cultures. However, it is often underestimated in the field due to lack of singing face datasets and the domain gap between singing and talking in rhythm and amplitude. More significantly, the quality of Singing Face Generation (SFG) often falls short and is uneven or limited by different applicable scenarios, which prompts us to propose timely and effective quality assessment methods to ensure user experience. To address existing gaps in this domain, this paper introduces a new SFG content quality assessment dataset SFQA, built using 12 representative generation methods. During the construction of the dataset, 100 photographs or portraits, as well as 36 music clips from 7 different styles, are utilized to generate 5,184 singing face videos that constitute the SFQA dataset. To further explore the quality of SFG methods, subjective quality assessment is conducted by evaluators, whose ratings reveal a significant variation in quality among different generation methods. Based on our proposed SFQA dataset, we comprehensively benchmark the current objective quality assessment algorithms.

摘要翻译

说话人脸生成任务在数字人与智能体等领域具有广泛的应用潜力。歌唱作为仅次于说话的常见面部运动形式，可被视为跨越民族与文化的通用语言。然而，由于缺乏歌唱人脸数据集，以及歌唱与说话在节奏和幅度上存在的领域差异，该方向在学术界常被低估。更重要的是，歌唱人脸生成的质量往往参差不齐，且受限于不同应用场景，这促使我们提出及时有效的质量评估方法以保障用户体验。为填补该领域的研究空白，本文引入基于12种代表性生成方法构建的新型歌唱人脸内容质量评估数据集SFQA。在数据集构建过程中，采用100张人物照片或肖像画以及涵盖7种风格的36段音乐片段，生成了5,184个构成SFQA数据集的歌唱人脸视频。为深入探究歌唱人脸生成方法的质量，我们通过评估人员进行主观质量评估，其评分结果揭示了不同生成方法间显著的质量差异。基于提出的SFQA数据集，本文对当前客观质量评估算法进行了全面基准测试。

Motivation

歌唱人脸生成具有广泛应用潜力，但常因缺乏专用数据集、与说话人脸存在节奏和幅度差异而被低估。现有SFG方法质量参差不齐且受场景限制，亟需有效的质量评估方法来保障用户体验。

Method

构建了SFQA数据集，使用12种代表性生成方法，基于100张肖像和36段不同风格的音乐片段生成了5,184个歌唱人脸视频。随后进行了主观质量评估，并基于该数据集对现有客观质量评估算法进行了全面基准测试。

Result

主观评估显示不同生成方法的质量存在显著差异。基于SFQA数据集的客观评估为现有质量评估算法提供了基准，揭示了其在SFG任务上的性能。

Conclusion

SFQA数据集填补了歌唱人脸生成质量评估领域的空白，为未来SFG方法的质量评估和改进提供了重要基础。

No. 3

Convolutions Need Registers Too: HVS-Inspired Dynamic Attention for Video Quality Assessment

Mayesha Maliha R. Mithila, Mylene C. Q. Farias

来源arxiv

更新时间2026-01-16

分类eess.IV · cs.CV · cs.MM

TLDR提出DAGR-VQA框架，首次将可学习的寄存器令牌集成到卷积主干中，实现动态显著性预测，无需显式运动估计，在多个数据集上性能领先且适合实时应用。

PDF Abstract

阅读摘要与笔记点击展开

Abstract

No-reference video quality assessment (NR-VQA) estimates perceptual quality without a reference video, which is often challenging. While recent techniques leverage saliency or transformer attention, they merely address global context of the video signal by using static maps as auxiliary inputs rather than embedding context fundamentally within feature extraction of the video sequence. We present Dynamic Attention with Global Registers for Video Quality Assessment (DAGR-VQA), the first framework integrating register-token directly into a convolutional backbone for spatio-temporal, dynamic saliency prediction. By embedding learnable register tokens as global context carriers, our model enables dynamic, HVS-inspired attention, producing temporally adaptive saliency maps that track salient regions over time without explicit motion estimation. Our model integrates dynamic saliency maps with RGB inputs, capturing spatial data and analyzing it through a temporal transformer to deliver a perceptually consistent video quality assessment. Comprehensive tests conducted on the LSVQ, KonVid-1k, LIVE-VQC, and YouTube-UGC datasets show that the performance is highly competitive, surpassing the majority of top baselines. Research on ablation studies demonstrates that the integration of register tokens promotes the development of stable and temporally consistent attention mechanisms. Achieving an efficiency of 387.7 FPS at 1080p, DAGR-VQA demonstrates computational performance suitable for real-time applications like multimedia streaming systems.

摘要翻译

无参考视频质量评估（NR-VQA）无需参考视频即可估计感知质量，这通常具有挑战性。尽管现有技术利用显著性或变换器注意力，但它们仅通过使用静态图作为辅助输入来处理视频信号的全局上下文，而非将上下文根本性地嵌入视频序列的特征提取中。我们提出了动态注意力与全局寄存器视频质量评估（DAGR-VQA），这是首个将寄存器令牌直接集成到卷积主干中，用于时空动态显著性预测的框架。通过嵌入可学习的寄存器令牌作为全局上下文载体，我们的模型实现了动态的、受人类视觉系统启发的注意力，生成随时间自适应的显著性图，无需显式运动估计即可跟踪显著区域。该模型将动态显著性图与RGB输入相结合，捕获空间数据并通过时间变换器进行分析，以提供感知一致的视频质量评估。在LSVQ、KonVid-1k、LIVE-VQC和YouTube-UGC数据集上的综合测试表明，其性能极具竞争力，超越了大多数顶级基线方法。消融研究显示，寄存器令牌的集成促进了稳定且时间一致的注意力机制的发展。DAGR-VQA在1080p分辨率下达到387.7 FPS的效率，展示了适用于多媒体流系统等实时应用的计算性能。

Motivation

现有无参考视频质量评估方法多依赖静态显著性图作为辅助输入，未能将全局上下文根本性地嵌入视频序列特征提取中，限制了动态感知质量评估的能力。

Method

通过嵌入可学习的寄存器令牌作为全局上下文载体，在卷积主干中实现时空动态显著性预测，生成随时间自适应的显著性图，并结合RGB输入与时空Transformer进行感知一致性质量评估。

Result

在LSVQ、KonVid-1k、LIVE-VQC和YouTube-UGC数据集上表现优异，超越多数顶级基线方法；消融研究表明寄存器令牌促进了稳定且时序一致的注意力机制；在1080p分辨率下达到387.7 FPS，适合实时应用。

Conclusion

DAGR-VQA通过集成寄存器令牌实现了动态、人眼视觉系统启发的注意力机制，有效提升了无参考视频质量评估的性能与时效性，为多媒体流系统等实时应用提供了可行方案。

No. 4

Transforming Video Subjective Testing with Training, Engagement, and Real-Time Feedback

Kumar Rahul, Sriram Sethuraman, Andrew Segall, Yixu Chen

来源arxiv

更新时间2026-01-08

TLDR本文提出了一种集成框架，通过自动化训练测验、实时注意力评分和高效的成对比较，提升主观视频质量评估的数据质量和效率。

PDF Abstract

阅读摘要与笔记点击展开

Abstract

Subjective video quality assessment is crucial for optimizing streaming and compression, yet traditional protocols face limitations in capturing nuanced perceptual differences and ensuring reliable user input. We propose an integrated framework that enhances rater training, enforces attention through real-time scoring, and streamlines pairwise comparisons to recover quality scores with fewer comparisons. Participants first undergo an automated training quiz to learn key video quality indicators (e.g., compression artifacts) and verify their readiness. During the test, a real-time attention scoring mechanism, using "golden" video pairs, monitors and reinforces rater focus by applying penalties for lapses. An efficient chain-based pairwise comparison procedure is then employed, yielding quality scores in Just-Objectionable-Differences (JOD) units. Experiments comparing three groups (no training, training without feedback, and training with feedback) with 80 participants demonstrate that training-quiz significantly improves data quality in terms of golden unit accuracy and reduces tie rate, while real-time feedback further improves data quality and yields the most monotonic quality ratings. The new training, quiz, testing with feedback, 3-phase approach can significantly reduce the non-monotonic cases on the high quality part of the R-Q curve where normal viewer typically prefer the slightly compressed less-grainy content and help train a better objective video quality metric.

摘要翻译

主观视频质量评估对于优化流媒体与压缩至关重要，但传统方法在捕捉细微感知差异和确保可靠用户输入方面存在局限。本文提出一个集成框架，通过增强评分者训练、实施实时注意力评分机制以及简化成对比较流程，以较少比较次数恢复质量分数。参与者首先通过自动化训练测验学习关键视频质量指标（如压缩伪影）并验证其准备状态。测试过程中，采用基于'黄金'视频对的实时注意力评分机制，通过失误惩罚来监控并强化评分者专注度。随后采用高效的链式成对比较程序，以'刚好可察觉差异'（JOD）单位生成质量分数。对三组（无训练、无反馈训练、有反馈训练）共80名参与者的实验表明：训练测验能显著提升黄金单元准确率并降低平局率，从而改善数据质量；实时反馈则进一步优化数据质量并产生最单调的质量评分。这种包含训练、测验、反馈测试的三阶段新方法，能显著减少R-Q曲线高质量区域的非单调案例（普通观众通常偏好轻微压缩的低噪点内容），并有助于训练更优的客观视频质量评估模型。

Motivation

传统主观视频质量评估方法在捕捉细微感知差异和确保用户输入可靠性方面存在局限，需要更有效的训练、注意力监控和评分恢复机制。

Method

框架包含三阶段：1) 自动化训练测验，教授质量指标并验证参与者准备情况；2) 测试中通过“黄金”视频对进行实时注意力评分与惩罚；3) 使用链式成对比较法高效恢复JOD单位质量分数。

Result

实验表明，训练测验显著提高了黄金单元准确率并降低了平局率，实时反馈进一步改善数据质量并产生最单调的质量评分，减少了高质量区域非单调情况。

Conclusion

该三阶段方法能有效提升主观评估的数据质量，帮助训练更好的客观视频质量指标，尤其在高质量内容评估中优化性能。

No. 5

Enhancing Blind Video Quality Assessment with Rich Quality-aware Features

Wei Sun, Linhan Cao, Jun Jia, Zhichao Zhang, Zicheng Zhang, Xiongkuo Min, Guangtao Zhai

来源arxiv

更新时间2026-01-04

分类eess.IV · cs.CV · cs.MM

TLDR本文提出RQ-VQA方法，通过整合多个现成质量评估模型提取的丰富质量感知特征，提升盲视频质量评估的泛化能力，在社交媒体视频数据集上达到先进性能。

PDF Abstract DOI

阅读摘要与笔记点击展开

Abstract

Blind video quality assessment (BVQA) is a highly challenging task due to the intrinsic complexity of video content and visual distortions, especially given the high popularity of social media videos, which originate from a wide range of sources, and are often processed by various compression and enhancement algorithms. While recent BVQA and blind image quality assessment (BIQA) studies have made remarkable progress, their models typically perform well on the datasets they were trained on but generalize poorly to unseen videos, making them less effective for accurately evaluating the perceptual quality of diverse social media videos. In this paper, we propose Rich Quality-aware features enabled Video Quality Assessment (RQ-VQA), a simple yet effective method to enhance BVQA by leveraging rich quality-aware features extracted from off-the-shelf BIQA and BVQA models. Our approach exploits the expertise of existing quality assessment models within their trained domains to improve generalization. Specifically, we design a multi-source feature framework that integrates:(1) Learnable spatial features} from a base model fine-tuned on the target VQA dataset to capture domain-specific quality cues; (2) Temporal motion features from the fast pathway of SlowFast pre-trained on action recognition datasets to model motion-related distortions; (3) Spatial quality-aware features from BIQA models trained on diverse IQA datasets to enhance frame-level distortion representation; and (4) Spatiotemporal quality-aware features from a BVQA model trained on large-scale VQA datasets to jointly encode spatial structure and temporal dynamics. These features are concatenated and fed into a multi-layer perceptron (MLP) to regress them into quality scores. Experimental results demonstrate that our model achieves state-of-the-art performance on three public social media VQA datasets.

摘要翻译

盲视频质量评估（BVQA）因视频内容与视觉失真的内在复杂性而极具挑战性，尤其考虑到社交媒体视频来源广泛且常经多种压缩与增强算法处理，其普及度日益增高。尽管近期BVQA与盲图像质量评估（BIQA）研究已取得显著进展，但现有模型通常在训练数据集上表现良好，而对未见视频的泛化能力较弱，难以准确评估多样化社交媒体视频的感知质量。本文提出一种简单而有效的方法——富质量感知特征增强的视频质量评估（RQ-VQA），通过利用从现成BIQA与BVQA模型中提取的丰富质量感知特征来提升BVQA性能。该方法借助现有质量评估模型在其训练领域内的专业知识以增强泛化能力。具体而言，我们设计了一个多源特征框架，整合了：（1）基于目标VQA数据集微调的基础模型所提取的可学习空间特征，以捕捉领域特定的质量线索；（2）从动作识别数据集预训练的SlowFast快速通路中提取的时间运动特征，以建模与运动相关的失真；（3）从多样IQA数据集训练的BIQA模型中提取的空间质量感知特征，以增强帧级失真表征；（4）从大规模VQA数据集训练的BVQA模型中提取的时空质量感知特征，以联合编码空间结构与时间动态。这些特征经拼接后输入多层感知机（MLP）回归为质量分数。实验结果表明，我们的模型在三个公开社交媒体VQA数据集上达到了最先进的性能。

Motivation

现有盲视频质量评估模型在训练数据集上表现良好，但泛化到未见过的视频（尤其是来源多样、经过多种处理的社交媒体视频）时效果不佳，限制了其实际应用。

Method

设计多源特征框架：1) 在目标VQA数据集上微调的基础模型提取可学习空间特征；2) 从动作识别预训练模型SlowFast快速通路提取时序运动特征；3) 从多样IQA数据集训练的BIQA模型提取空间质量感知特征；4) 从大规模VQA数据集训练的BVQA模型提取时空质量感知特征。这些特征拼接后通过多层感知机回归为质量分数。

Result

模型在三个公开社交媒体VQA数据集上实现了最先进的性能。

Conclusion

通过有效利用现有质量评估模型的领域专业知识提取丰富特征，RQ-VQA方法显著提升了盲视频质量评估的泛化能力，适用于多样化的社交媒体视频。

VSR 视频超分辨率

5 篇

No. 1

D$^2$-VR: Degradation-Robust and Distilled Video Restoration with Synergistic Optimization Strategy

Jianfeng Liang, Shaocheng Shen, Botao Xu, Qiang Hu, Xiaoyun Zhang

来源arxiv

更新时间2026-02-09

TLDR提出D²-VR，一种基于单图像扩散的低步数推理视频修复框架，通过退化鲁棒流对齐、对抗蒸馏和协同优化，在提升感知质量与时间一致性的同时，大幅加速推理。

PDF Abstract

阅读摘要与笔记点击展开

Abstract

The integration of diffusion priors with temporal alignment has emerged as a transformative paradigm for video restoration, delivering fantastic perceptual quality, yet the practical deployment of such frameworks is severely constrained by prohibitive inference latency and temporal instability when confronted with complex real-world degradations. To address these limitations, we propose \textbf{D$^2$-VR}, a single-image diffusion-based video-restoration framework with low-step inference. To obtain precise temporal guidance under severe degradation, we first design a Degradation-Robust Flow Alignment (DRFA) module that leverages confidence-aware attention to filter unreliable motion cues. We then incorporate an adversarial distillation paradigm to compress the diffusion sampling trajectory into a rapid few-step regime. Finally, a synergistic optimization strategy is devised to harmonize perceptual quality with rigorous temporal consistency. Extensive experiments demonstrate that D$^2$-VR achieves state-of-the-art performance while accelerating the sampling process by \textbf{12$\times$}

摘要翻译

扩散先验与时间对齐的结合已成为视频修复领域的变革性范式，能够提供卓越的感知质量。然而，面对复杂的真实世界退化时，此类框架的实际应用因推理延迟过高和时间不稳定性而受到严重限制。为解决这些局限性，我们提出了\textbf{D$^2$-VR}，一种基于单图像扩散的低步数推理视频修复框架。为在严重退化条件下获得精确的时间引导，我们首先设计了退化鲁棒流对齐（DRFA）模块，该模块利用置信度感知注意力机制过滤不可靠的运动线索。随后，我们引入对抗蒸馏范式，将扩散采样轨迹压缩至快速少步数机制。最后，设计了一种协同优化策略，以协调感知质量与严格的时间一致性。大量实验表明，D$^2$-VR在将采样过程加速\textbf{12倍}的同时，实现了最先进的性能。

Motivation

现有结合扩散先验与时序对齐的视频修复方法虽能产生优异感知质量，但在处理复杂真实退化时，存在推理延迟高、时间稳定性差的问题，限制了实际部署。

Method

1. 设计退化鲁棒流对齐模块，利用置信感知注意力过滤不可靠运动线索；2. 采用对抗蒸馏范式，将扩散采样轨迹压缩到少步快速推理；3. 提出协同优化策略，平衡感知质量与严格时间一致性。

Result

大量实验表明，D²-VR在实现最先进性能的同时，将采样过程加速了12倍。

Conclusion

D²-VR框架有效解决了扩散基视频修复的延迟与不稳定问题，为高质量、高效率的视频修复提供了实用解决方案。

No. 2

The Judge Who Never Admits: Hidden Shortcuts in LLM-based Evaluation

Arash Marioriyad, Omid Ghahroodi, Ehsaneddin Asgari, Mohammad Hossein Rohban, Mahdieh Soleymani Baghshah

来源arxiv

更新时间2026-02-08

TLDR研究发现，大语言模型作为自动评估器时，其判断会受无关上下文（如来源、时间、人口属性等）影响，且常在解释中不承认这些影响，导致评估可靠性存疑。

PDF Abstract

阅读摘要与笔记点击展开

Abstract

Large language models (LLMs) are increasingly used as automatic judges to evaluate system outputs in tasks such as reasoning, question answering, and creative writing. A faithful judge should base its verdicts solely on content quality, remain invariant to irrelevant context, and transparently reflect the factors driving its decisions. We test this ideal via controlled cue perturbations-synthetic metadata labels injected into evaluation prompts-for six judge models: GPT-4o, Gemini-2.0-Flash, Gemma-3-27B, Qwen3-235B, Claude-3-Haiku, and Llama3-70B. Experiments span two complementary datasets with distinct evaluation regimes: ELI5 (factual QA) and LitBench (open-ended creative writing). We study six cue families: source, temporal, age, gender, ethnicity, and educational status. Beyond measuring verdict shift rates (VSR), we introduce cue acknowledgment rate (CAR) to quantify whether judges explicitly reference the injected cues in their natural-language rationales. Across cues with strong behavioral effects-e.g., provenance hierarchies (Expert > Human > LLM > Unknown), recency preferences (New > Old), and educational-status favoritism-CAR is typically at or near zero, indicating that shortcut reliance is largely unreported even when it drives decisions. Crucially, CAR is also dataset-dependent: explicit cue recognition is more likely to surface in the factual ELI5 setting for some models and cues, but often collapses in the open-ended LitBench regime, where large verdict shifts can persist despite zero acknowledgment. The combination of substantial verdict sensitivity and limited cue acknowledgment reveals an explanation gap in LLM-as-judge pipelines, raising concerns about reliability of model-based evaluation in both research and deployment.

摘要翻译

大语言模型（LLM）正日益被用作自动评估器，用于评判推理、问答和创意写作等任务中的系统输出。一个可靠的评估器应仅基于内容质量作出判断，对无关语境保持恒定，并透明反映其决策的驱动因素。我们通过受控提示扰动——向评估提示中注入合成的元数据标签——来检验这一理想状态，测试了六种评估模型：GPT-4o、Gemini-2.0-Flash、Gemma-3-27B、Qwen3-235B、Claude-3-Haiku和Llama3-70B。实验涵盖两个具有不同评估机制的互补数据集：ELI5（事实性问答）和LitBench（开放式创意写作）。我们研究了六类提示因素：来源、时间、年龄、性别、种族和教育背景。除了测量判决偏移率（VSR），我们还引入了提示确认率（CAR）来量化评估模型是否在其自然语言推理中明确提及注入的提示。在具有强烈行为效应的提示因素中——例如来源层级（专家>人类>LLM>未知）、时效性偏好（新>旧）和教育背景偏向——CAR通常接近或等于零，表明即使决策受其驱动，对捷径的依赖也大多未被报告。关键的是，CAR还依赖于数据集：在某些模型和提示因素中，明确的提示识别更可能在事实性的ELI5设置中出现，但在开放式的LitBench机制中常常失效，即使提示确认率为零，判决偏移仍可能持续存在。显著的判决敏感性与有限的提示确认相结合，揭示了LLM作为评估器流程中的解释鸿沟，引发了关于基于模型的评估在研究和部署中可靠性的担忧。

Motivation

检验大语言模型作为自动评估器时是否忠实——即判断应仅基于内容质量、不受无关上下文干扰，且能透明反映决策依据。

Method

通过向评估提示中注入合成元数据标签（如来源、时间、年龄、性别、种族、教育状况等六类线索），测试六个模型在ELI5（事实问答）和LitBench（开放创意写作）数据集上的表现，并引入线索承认率来量化模型在解释中是否明确提及这些线索。

Result

模型判断对许多线索敏感（如偏好专家来源、新内容、高教育背景），但线索承认率通常接近零，表明模型依赖这些捷径却不在解释中承认；这种承认率还因数据集而异，在开放创意任务中尤其低。

Conclusion

大语言模型作为评估器存在解释缺口——判断易受无关线索影响且缺乏透明度，这对其在研究和部署中的评估可靠性提出了担忧。

No. 3

MedVSR: Medical Video Super-Resolution with Cross State-Space Propagation

Xinyu Liu, Guolei Sun, Cheng Wang, Yixuan Yuan, Ender Konukoglu

来源arxiv

更新时间2026-02-08

分类cs.CV · cs.AI

TLDR本文提出MedVSR，一种针对医学视频超分辨率的定制框架，通过跨状态空间传播解决帧对齐问题，并利用内部状态空间重建模块增强组织结构和减少伪影，在多个医学数据集上优于现有方法。

PDF Abstract

阅读摘要与笔记点击展开

Abstract

High-resolution (HR) medical videos are vital for accurate diagnosis, yet are hard to acquire due to hardware limitations and physiological constraints. Clinically, the collected low-resolution (LR) medical videos present unique challenges for video super-resolution (VSR) models, including camera shake, noise, and abrupt frame transitions, which result in significant optical flow errors and alignment difficulties. Additionally, tissues and organs exhibit continuous and nuanced structures, but current VSR models are prone to introducing artifacts and distorted features that can mislead doctors. To this end, we propose MedVSR, a tailored framework for medical VSR. It first employs Cross State-Space Propagation (CSSP) to address the imprecise alignment by projecting distant frames as control matrices within state-space models, enabling the selective propagation of consistent and informative features to neighboring frames for effective alignment. Moreover, we design an Inner State-Space Reconstruction (ISSR) module that enhances tissue structures and reduces artifacts with joint long-range spatial feature learning and large-kernel short-range information aggregation. Experiments across four datasets in diverse medical scenarios, including endoscopy and cataract surgeries, show that MedVSR significantly outperforms existing VSR models in reconstruction performance and efficiency. Code released at https://github.com/CUHK-AIM-Group/MedVSR.

摘要翻译

高分辨率（HR）医学视频对于精确诊断至关重要，但由于硬件限制和生理约束，难以获取。临床上，采集到的低分辨率（LR）医学视频为视频超分辨率（VSR）模型带来了独特挑战，包括相机抖动、噪声和帧间突变，这些因素导致显著的光流误差和对齐困难。此外，组织和器官呈现连续且细微的结构，但现有VSR模型容易引入伪影和扭曲特征，可能误导医生。为此，我们提出了MedVSR，一个专为医学VSR设计的框架。它首先采用跨状态空间传播（CSSP）来解决对齐不精确的问题，通过将远距离帧投影为状态空间模型中的控制矩阵，选择性地将一致且信息丰富的特征传播到相邻帧，实现有效对齐。此外，我们设计了内部状态空间重建（ISSR）模块，通过联合长程空间特征学习和大核短程信息聚合，增强组织结构并减少伪影。在包括内窥镜和白内障手术在内的多种医学场景的四个数据集上的实验表明，MedVSR在重建性能和效率上显著优于现有VSR模型。代码发布于https://github.com/CUHK-AIM-Group/MedVSR。

Motivation

高分辨率医学视频对准确诊断至关重要，但硬件限制和生理约束导致难以获取。采集的低分辨率医学视频存在相机抖动、噪声和帧突变等挑战，导致光流误差和对齐困难，且现有视频超分辨率模型易引入伪影和扭曲特征，可能误导医生。

Method

提出MedVSR框架：1) 跨状态空间传播模块，通过状态空间模型将远帧投影为控制矩阵，选择性传播一致且信息丰富的特征以实现有效对齐；2) 内部状态空间重建模块，结合长程空间特征学习和大核短程信息聚合，以增强组织结构并减少伪影。

Result

在包括内窥镜和白内障手术在内的多种医学场景的四个数据集上进行实验，MedVSR在重建性能和效率上均显著优于现有视频超分辨率模型。

Conclusion

MedVSR通过针对医学视频独特挑战的定制设计，有效解决了对齐和伪影问题，提升了超分辨率质量，有助于改善医学诊断的准确性。

No. 4

QuantVSR: Low-Bit Post-Training Quantization for Real-World Video Super-Resolution

Bowen Chai, Zheng Chen, Libo Zhu, Wenbo Li, Yong Guo, Yulun Zhang

来源arxiv

更新时间2026-02-04

TLDR提出QuantVSR，一种用于真实世界视频超分辨率的低比特量化方法，通过时空复杂度感知机制和可学习偏置对齐模块，在保持性能的同时显著提升效率。

PDF Abstract

阅读摘要与笔记点击展开

Abstract

Diffusion models have shown superior performance in real-world video super-resolution (VSR). However, the slow processing speeds and heavy resource consumption of diffusion models hinder their practical application and deployment. Quantization offers a potential solution for compressing the VSR model. Nevertheless, quantizing VSR models is challenging due to their temporal characteristics and high fidelity requirements. To address these issues, we propose QuantVSR, a low-bit quantization model for real-world VSR. We propose a spatio-temporal complexity aware (STCA) mechanism, where we first utilize the calibration dataset to measure both spatial and temporal complexities for each layer. Based on these statistics, we allocate layer-specific ranks to the low-rank full-precision (FP) auxiliary branch. Subsequently, we jointly refine the FP and low-bit branches to achieve simultaneous optimization. In addition, we propose a learnable bias alignment (LBA) module to reduce the biased quantization errors. Extensive experiments on synthetic and real-world datasets demonstrate that our method obtains comparable performance with the FP model and significantly outperforms recent leading low-bit quantization methods. Code is available at: https://github.com/bowenchai/QuantVSR.

摘要翻译

扩散模型在真实世界视频超分辨率（VSR）任务中展现出卓越性能。然而，其缓慢的处理速度与高资源消耗限制了实际应用与部署。量化技术为压缩VSR模型提供了潜在解决方案，但由于视频超分辨率模型具有时序特性与高保真度要求，其量化过程面临挑战。为解决这些问题，我们提出QuantVSR——一种面向真实世界VSR的低比特量化模型。我们设计了时空复杂度感知（STCA）机制：首先利用校准数据集测量每层的空间与时间复杂度，依据这些统计数据为低秩全精度（FP）辅助分支分配层级特定的秩，随后联合优化全精度与低比特分支以实现同步精调。此外，我们提出可学习偏置对齐（LBA）模块以减少量化偏差误差。在合成与真实数据集上的大量实验表明，本方法在性能上与全精度模型相当，并显著优于近期主流的低比特量化方法。代码已开源：https://github.com/bowenchai/QuantVSR。

Motivation

扩散模型在视频超分辨率中性能优越，但处理速度慢、资源消耗大，限制了实际部署。量化是压缩模型的有效手段，但视频超分辨率模型因时空特性和高保真要求，量化面临挑战。

Method

提出QuantVSR，包含时空复杂度感知机制（STCA），通过校准数据测量各层时空复杂度，分配层特定秩给低秩全精度辅助分支，联合优化全精度与低比特分支；并设计可学习偏置对齐模块（LBA）以减少量化误差。

Result

在合成和真实世界数据集上的实验表明，该方法性能与全精度模型相当，并显著优于现有低比特量化方法。

Conclusion

QuantVSR有效解决了视频超分辨率模型的量化难题，在保持高保真度的同时提升了处理效率，具有实际应用潜力。

No. 5

Tiled Prompts: Overcoming Prompt Underspecification in Image and Video Super-Resolution

Bryan Sangwoo Kim, Jonghyun Park, Jong Chul Ye

来源arxiv

更新时间2026-02-03

分类cs.CV · cs.AI · cs.LG

TLDR本文提出Tiled Prompts框架，通过为每个潜在图块生成特定提示，解决高分辨率超分中全局提示导致的细节缺失与误导问题，提升感知质量与文本对齐，减少伪影。

PDF Abstract

阅读摘要与笔记点击展开

Abstract

Text-conditioned diffusion models have advanced image and video super-resolution by using prompts as semantic priors, but modern super-resolution pipelines typically rely on latent tiling to scale to high resolutions, where a single global caption causes prompt underspecification. A coarse global prompt often misses localized details (prompt sparsity) and provides locally irrelevant guidance (prompt misguidance) that can be amplified by classifier-free guidance. We propose Tiled Prompts, a unified framework for image and video super-resolution that generates a tile-specific prompt for each latent tile and performs super-resolution under locally text-conditioned posteriors, providing high-information guidance that resolves prompt underspecification with minimal overhead. Experiments on high resolution real-world images and videos show consistent gains in perceptual quality and text alignment, while reducing hallucinations and tile-level artifacts relative to global-prompt baselines.

摘要翻译

基于文本条件的扩散模型通过将提示作为语义先验，推动了图像与视频超分辨率技术的发展。然而，现代超分辨率流程通常依赖潜在分块处理以适应高分辨率需求，此时单一的全局描述会导致提示信息不足。粗略的全局提示往往遗漏局部细节（提示稀疏性），并提供局部无关的引导（提示误导性），这种误导可能因无分类器引导机制而被放大。为此，我们提出了分块提示——一个统一的图像与视频超分辨率框架，该框架为每个潜在分块生成特定的提示，并在局部文本条件下的后验分布下执行超分辨率，从而提供高信息量的引导，以最小开销解决提示信息不足的问题。在高分辨率真实世界图像和视频上的实验表明，相较于全局提示基线方法，该框架在感知质量和文本对齐方面持续提升，同时减少了幻觉效应和分块伪影。

Motivation

现有基于文本条件的扩散模型在图像与视频超分辨率中依赖全局提示作为语义先验，但在高分辨率下采用潜在分块处理时，单一全局提示会导致提示欠指定问题：局部细节缺失（提示稀疏性）和局部无关引导（提示误导性），且可能被无分类器引导放大。

Method

提出Tiled Prompts统一框架，为每个潜在图块生成特定提示，并在局部文本条件下进行超分辨率重建，通过高信息量的局部引导解决提示欠指定问题，且计算开销最小。

Result

在高分辨率真实世界图像和视频上的实验表明，该方法在感知质量和文本对齐方面持续提升，同时减少了幻觉和分块伪影，优于全局提示基线。

Conclusion

Tiled Prompts通过局部化文本条件有效解决了高分辨率超分中的提示欠指定问题，为图像和视频超分辨率提供了更精确的语义引导，提升了结果质量。