多模态大模型

3 papers

No. 1

Proxy3D: Efficient 3D Representations for Vision-Language Models via Semantic Clustering and Alignment

Jerry Jiang, Haowen Sun, Denis Gudovskiy, Yohei Nakata, Tomoyuki Okuno, Kurt Keutzer, Wenzhao Zheng

Sourcearxiv

Published2026-05-08

Updated2026-05-08

TLDR提出Proxy3D方法,通过语义感知聚类从视频帧中提取紧凑的3D代理表示,替代传统像素对齐或隐式3D理解,在更短序列下实现SOTA空间推理性能。

阅读摘要与笔记 点击展开

Abstract

Spatial intelligence in vision-language models (VLMs) attracts research interest with the practical demand to reason in the 3D world.Despite promising results, most existing methods follow the conventional 2D pipeline in VLMs and use pixel-aligned representations for the vision modality. However, correspondence-based models with implicit 3D scene understanding often fail to achieve spatial consistency, and representation-based models with 3D geometric priors lack efficiency in vision sequence serialization. To address this, we propose a Proxy3D method with compact yet comprehensive 3D proxy representations for the vision modality. Given only video frames as input, we employ semantic and geometric encoders to extract scene features and then perform their semantic-aware clustering to obtain a set of proxies in the 3D space. For representation alignment, we further curate the SpaceSpan dataset and apply multi-stage training to adopt the proposed 3D proxy representations with the VLM. When using shorter sequences for vision information, our method achieves competitive or state-of-the-art performance in 3D visual question answering, visual grounding and general spatial intelligence benchmarks.

Motivation

现有VLM在3D空间推理中存在两类问题:基于对应关系的模型缺乏空间一致性,基于3D几何先验的模型序列化效率低。需一种紧凑且全面的3D视觉表示方法。

Method

输入视频帧,使用语义和几何编码器提取场景特征,通过语义感知聚类获得3D空间中的代理点;构建SpaceSpan数据集,采用多阶段训练将代理表示与VLM对齐。

Result

在3D视觉问答、视觉定位和通用空间智能基准上,使用更短视觉序列达到竞争或最优性能。

Conclusion

Proxy3D通过紧凑的3D代理表示有效平衡了空间一致性与序列效率,验证了替代传统2D流水线的可行性。

No. 2

Object Hallucination-Free Reinforcement Unlearning for Vision-Language Models

Kaidi Jia, Yujie Lin, Chengyi Yang, Jiayao Ma, Jinsong Su

Sourcearxiv

Published2026-05-08

Updated2026-05-08

TLDR提出HFRU框架,通过强化学习在视觉编码器上实现深度语义遗忘,在对象识别和人脸身份任务上遗忘与保留性能均超98%,且几乎不产生对象幻觉。

阅读摘要与笔记 点击展开

Abstract

Vision-language models (VLMs) raise growing concerns about privacy, copyright, and bias, motivating machine unlearning to remove sensitive knowledge. However, existing methods primarily fine-tune the language decoder, leading to superficial forgetting that fails to erase underlying visual representations and often introduces object hallucination. We propose HFRU, a reinforcement unlearning framework that operates on the vision encoder for deep semantic removal. Our two-stage approach combines alignment disruption with GRPO-based optimization using a composite reward, including an abstraction reward that encourages semantically valid substitutions and mitigates hallucinations. Experiments on object recognition and face identity tasks show that HFRU achieves over 98% forgetting and retention performance, while introducing negligible object hallucination, significantly outperforming prior methods.Our code and implementation details are available at https://github.com/XMUDeepLIT/HFRU.

Motivation

现有方法主要微调语言解码器,导致遗忘表面化、无法擦除底层视觉表征,并常引入对象幻觉。

Method

两阶段方法:先破坏视觉-语言对齐,再基于GRPO优化,使用包含抽象奖励的复合奖励函数,鼓励语义有效替换并抑制幻觉。

Result

在对象识别和人脸身份任务上,遗忘与保留性能均超98%,对象幻觉可忽略不计,显著优于先前方法。

Conclusion

HFRU通过强化学习在视觉编码器上操作,能实现深度语义移除,有效解决现有方法的遗忘不彻底和幻觉问题。

No. 3

STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal Generation

Ying Shen, Tianrong Chen, Yuan Gao, Yizhe Zhang, Yuyang Wang, Miguel Ángel Bautista, Shuangfei Zhai, Joshua M. Susskind, Jiatao Gu

Sourcearxiv

Published2026-05-08

Updated2026-05-08

Classcs.CV · cs.LG

TLDR提出STARFlow2,利用自回归归一化流(TarFlow)替代扩散模型,实现文本与图像的统一因果生成,避免结构不匹配,并在多模态基准上取得强性能。

阅读摘要与笔记 点击展开

Abstract

Deep generative models have advanced rapidly across text and vision, motivating unified multimodal systems that can understand, reason over, and generate interleaved text-image sequences. Most existing approaches combine autoregressive language modeling with diffusion-based image generators, inheriting a structural mismatch between causal text generation and iterative visual denoising. We observe that autoregressive normalizing flows are autoregressive Transformers--sharing the same causal mask, KV-cache mechanism, and left-to-right structure as LLMs--making them the most natural paradigm for true unified multimodal generation. We present STARFlow2, built on the Pretzel architecture that vertically interleaves a pretrained VLM stream with a TarFlow stream via residual skip connections, both operating under the same causal mask. Combined with a deep-shallow flow design and a unified FAE latent space, STARFlow2 enables cache-friendly interleaved generation where both text and visual outputs directly enter the KV-cache without re-encoding. Experiments demonstrate strong performance across image generation and multimodal understanding benchmarks, validating autoregressive flows as a viable foundation for unified multimodal modeling.

Motivation

现有统一多模态系统将自回归语言模型与扩散图像生成器结合,存在因果文本生成与迭代视觉去噪的结构性不匹配。

Method

基于Pretzel架构,垂直交错预训练VLM流与TarFlow流,通过残差跳跃连接共享同一因果掩码;采用深浅流设计和统一FAE潜空间,使文本和视觉输出直接进入KV缓存,无需重新编码。

Result

在图像生成和多模态理解基准上表现强劲,验证了自回归流作为统一多模态建模基础的有效性。

Conclusion

自回归归一化流是真正统一多模态生成的自然范式,STARFlow2证明了其可行性和优越性。

视频多模态理解

3 papers

No. 1

Response-G1: Explicit Scene Graph Modeling for Proactive Streaming Video Understanding

Ke Ma, Jiaqi Tang, Bin Guo, Xueting Han, Ruonan Xu, Qingfeng He, Ziheng Wang, Xu Wang, Qifeng Chen, Zhiwen Yu, Yunhao Liu

Sourcearxiv

Published2026-05-08

Updated2026-05-08

Classcs.CV · cs.AI

TLDR提出Response-G1框架,通过场景图显式对齐视频证据与查询响应条件,实现流式视频理解中的准确响应时机决策。

阅读摘要与笔记 点击展开

Abstract

Proactive streaming video understanding requires Video-LLMs to decide when to respond as a video unfolds, a task where existing methods often fall short due to their implicit, query-agnostic modeling of visual evidence. We introduce Response-G1, a novel framework that establishes explicit, structured alignment between the accumulated video evidence and the query's expected response conditions via scene graphs. The framework operates in three fine-tuning-free stages: (1) online query-guided scene graph generation from streaming clips; (2) memory-based retrieval of the most semantically relevant historical scene graphs; and (3) retrieval-augmented trigger prompting for per-frame "silence/response" decisions.By grounding both evidence and conditions in a shared graph representation, Response-G1 achieves more interpretable and accurate response timing decisions. Experimental results on established benchmarks demonstrate the superiority of our method in both proactive and reactive tasks, validating the advantage of explicit scene graph modeling and retrieval in streaming video understanding.

Motivation

现有Video-LLMs在流式视频理解中因隐式、查询无关的视觉证据建模,难以决定何时响应。

Method

三阶段无微调框架:在线查询引导的场景图生成、基于记忆的语义相关历史场景图检索、检索增强的触发提示进行逐帧静默/响应决策。

Result

在主动和被动任务基准上均优于现有方法,验证了显式场景图建模与检索的优势。

Conclusion

通过共享图表示对齐证据与条件,实现了更可解释和准确的响应时机决策。

No. 2

Tracing the Arrow of Time: Diagnosing Temporal Information Flow in Video-LLMs

Peitao Han, Fei Cheng, Lis K. Pereira, Qianying Liu, Shigeru Kitazawa

Sourcearxiv

Published2026-05-08

Updated2026-05-08

Classcs.CV · cs.CL

TLDR本文发现视频大语言模型在时间箭头任务上表现不佳的原因并非视觉编码器缺乏时间信息,而是投影器(如Q-Former)破坏了时间信息流。通过使用时间感知编码器、时间保持投影器和AoT监督,模型在AoT任务上超越人类,并提升了其他时间推理任务性能。

阅读摘要与笔记 点击展开

Abstract

The Arrow-of-Time (AoT) task, determining whether a video plays forward or backward by recognizing temporal irreversibility, is one humans solve with near-perfect accuracy, yet frontier Video Large Language Models (Video-LLMs) perform only modestly above chance. This gap raises a key question: do visual backbones fail to encode temporal information, or does information bottleneck lie elsewhere in the Video-LLM architecture? We address this question by isolating the vision encoder from the Video-LLM and tracing temporal information across the encoder, projector, and LLM. We find that video-centric encoders with explicit temporal modeling encode strong temporal signals, whereas frame-centric encoders do not. However, when video-centric representations are passed through a standard Video-LLM architecture, performance often collapses, revealing a bottleneck of temporal information flow. We identify projector design as a key factor: Q-Former disrupts temporal information, while a time-preserved MLP projection substantially improves the LLM's access to such information. Our layer-wise analysis further shows temporal representation dynamics across encoder layers. Guided by these findings, we build a Video-LLM with temporal-aware video-centric encoder, time-preserved projector, and AoT supervision, surpassing human performance on AoT$_{PPB}$ with 98.1\% accuracy, and improving broader temporal reasoning tasks by up to 6.0 points on VITATECS-Direction and 1.3 points on TVBench. Our results show that temporal reasoning in Video-LLMs requires both effective temporal encoding and reliable transfer of this information to the LLM.

Motivation

人类能近乎完美地判断视频正放或倒放,但前沿视频大语言模型仅略高于随机水平,这促使作者探究时间信息瓶颈究竟位于视觉编码器还是架构其他部分。

Method

隔离视觉编码器,追踪时间信息在编码器、投影器和LLM中的流动;比较视频中心编码器与帧中心编码器;分析投影器设计(Q-Former vs MLP)的影响;进行层间时间表征动态分析;构建包含时间感知编码器、时间保持投影器和AoT监督的视频LLM。

Result

视频中心编码器编码强时间信号,但经标准架构后性能下降;Q-Former破坏时间信息,MLP投影器保留时间信息;所提模型在AoT_{PPB}上达98.1%准确率(超越人类),在VITATECS-Direction和TVBench上分别提升6.0和1.3个点。

Conclusion

视频LLM的时间推理既需要有效的时间编码,也需要可靠地将时间信息传递到LLM,投影器设计是关键瓶颈。

No. 3

MedHorizon: Towards Long-context Medical Video Understanding in the Wild

Bodong Du, Bowen Liu, Yang Yu, Xinpeng Ding, Zhiheng Wu, Shuning Wang, Shuo Nie, Naiming Liu, Qifeng Chen, Yangqiu Song, Xiaomeng Li

Sourcearxiv

Published2026-05-07

Updated2026-05-07

TLDR提出MedHorizon基准,评估多模态大模型在长视频医疗流程中理解稀疏证据和多跳推理的能力,最佳模型仅41.1%准确率,揭示检索与推理瓶颈。

阅读摘要与笔记 点击展开

Abstract

Medical multimodal large language models (MLLMs) have advanced image understanding and short-video analysis, but real clinical review often requires full-procedure video understanding. Unlike general long videos, medical procedures contain highly redundant anatomical views, while decisive evidence is temporally sparse, spatially subtle, and context dependent. Existing benchmarks often assume this evidence has already been localized through images, short clips, or pre-segmented videos, leaving the retrieval-before-reasoning problem under-tested. We introduce MedHorizon, an in-the-wild benchmark for long-context medical video understanding. MedHorizon preserves 759 hours of full-length clinical procedures and provides 1,253 evidence-grounded multiple-choice questionsthat jointly evaluate sparse evidence understanding and multi-hop clinical reasoning. Its evidence is extremely sparse, with only 0.166% evidence frames on average, requiring models to search noisy procedural streams before interpreting and aggregating findings. We evaluate representative general-domain, medical-domain, and long-video MLLMs. The best model reaches only 41.1% accuracy, showing that current systems remain far from robust full-procedure understanding. Further analysis yields four key findings: performance does not scale reliably with more frames, evidence retrieval and clinical interpretation remain primary bottlenecks; these bottlenecks are rooted in weak procedural reasoning and attention drift under redundancy, and generic sampling methods only partially balances local detail with global coverage. MedHorizon provides a rigorous testbed for MLLMs that retrieve sparse evidence and reason over complete clinical workflows.

Motivation

现有基准假设证据已局部化(如图像、短片段),未测试模型在完整临床视频中检索稀疏证据并进行推理的能力。

Method

构建包含759小时完整手术视频和1253个基于证据的多选题的基准,证据帧平均仅占0.166%,要求模型在噪声流中搜索并聚合信息。

Result

最佳模型准确率仅41.1%,性能不随帧数增加可靠提升,证据检索和临床解释是主要瓶颈,源于弱程序推理和冗余下的注意力漂移。

Conclusion

当前多模态大模型在完整医疗流程视频理解上远未鲁棒,MedHorizon为检索稀疏证据和推理临床工作流提供了严格测试平台。

多模态检索与 RAG

3 papers

No. 1

From Clouds to Hallucinations: Atmospheric Retrieval Hijacking in Remote Sensing Vision-Language RAG

Jiaju Han, Chao Li, Chengyin Hu, Qike Zhang, Xuemeng Sun, Xin Wang, Fengyu Zhang, Xiang Chen, Yiwei Wei, Jiahuan Long, Jiujiang Guo

Sourcearxiv

Published2026-05-08

Updated2026-05-08

Classcs.CV · cs.AI

TLDR本文提出CloudWeb攻击,通过向遥感图像叠加优化的云/雾模式,在检索阶段劫持多模态RAG的证据检索,使检索器返回目标大气相关文本,并影响下游生成。

阅读摘要与笔记 点击展开

Abstract

Multimodal RAG systems increasingly rely on vision-language retrievers to ground visual queries in external textual evidence. Existing adversarial studies on RAG mainly manipulate the retrieval corpus or memory, while attacks on vision-language and remote sensing models typically target end-task predictions. Input-space threats to the evidence retrieval stage of remote sensing multimodal RAG remain underexplored. To address this gap, we introduce CloudWeb, an atmospheric retrieval hijacking attack that modifies only the input image while keeping the retriever, generator, and knowledge base fixed at deployment. CloudWeb overlays parameterized cloud- and haze-like patterns on remote sensing images and optimizes them with a retrieval-oriented objective that pulls adversarial image embeddings toward target atmospheric evidence, suppresses source-scene evidence, enforces rank separation, and regularizes naturalness and coverage. To the best of our knowledge, this is the first study of retrieval-stage atmospheric evidence hijacking in remote sensing multimodal RAG. We evaluate CloudWeb on a seven-dataset remote sensing RAG benchmark with five CLIP-style retrievers, including GeoRSCLIP, RemoteCLIP, OpenAI CLIP, and OpenCLIP, together with downstream vision-language generators. Across retrievers, CloudWeb consistently outperforms clean retrieval, handcrafted atmospheric baselines, random cloud perturbations, and fixed variants in injecting weather-related evidence into top-ranked results. On GeoRSCLIP ViT-B/32, Weather@5 increases from 0.71\% to 43.29\%. Downstream generation further shows measurable weather hallucination and semantic shift, indicating that retrieval-stage hijacking can propagate to the final RAG response. These findings reveal a practical failure mode: natural-looking atmospheric changes can compromise evidence retrieval before generation begins.

Motivation

现有对抗攻击主要针对检索库或终端预测,而遥感多模态RAG中检索阶段的输入空间威胁尚未被探索。

Method

CloudWeb在输入图像上叠加参数化的云/雾模式,通过检索导向目标优化:拉近对抗嵌入与目标大气证据、抑制源场景证据、强制排序分离,并正则化自然度和覆盖度。

Result

在七个数据集和五个CLIP风格检索器上,CloudWeb一致优于基线,如GeoRSCLIP上Weather@5从0.71%升至43.29%,且下游生成出现天气幻觉和语义偏移。

Conclusion

自然外观的大气变化可在生成前破坏证据检索,揭示多模态RAG在检索阶段存在实际失效模式。

No. 2

The Cost of Context: Mitigating Textual Bias in Multimodal Retrieval-Augmented Generation

Hoin Jung, Xiaoqian Wang

Sourcearxiv

Published2026-05-07

Updated2026-05-07

Classcs.CL · cs.CV · cs.LG

TLDR本文发现并形式化了多模态大语言模型在检索增强生成中的“再污染”现象:即使引入完全正确的上下文,模型也会放弃原本正确的预测。通过注意力机制诊断,揭示了视觉盲区和位置偏差的双重注意力崩溃机制,并提出无需训练的推理时干预框架BAIR,恢复视觉显著性并惩罚位置偏差,在医疗、公平性和地理基准上提升了多模态基础能力。

阅读摘要与笔记 点击展开

Abstract

While Multimodal Large Language Models (MLLMs) are increasingly integrated with Retrieval-Augmented Generation (RAG) to mitigate hallucinations, the introduction of external documents can conceal severe failure modes at the instance level. We identify and formalize the phenomenon of recorruption, where the introduction of even perfectly accurate "oracle" context causes a capable model to abandon an initially correct prediction. Through a mechanistic diagnosis of internal attention matrices, we show that recorruption is driven by a two-fold attentional collapse: (1) visual blindness, characterized by the systemic suppression of visual attention mass ($M_{vis}$) and sharpness ($S_{vis}$), and (2) a structural positional bias that forces the model to prioritize boundary tokens over semantic relevance. Our analysis reveals an Illusion of Success, demonstrating that many seemingly correct RAG outcomes are merely positional coincidences where the model's textual copying bias happens to align with the ground-truth location. To address these vulnerabilities, we propose Bottleneck Attention Intervention for Recovery (BAIR), a parameter-free, inference-time framework that restores visual saliency and applies position-aware penalties to textual distractors. Across medical factuality, social fairness, and geospatial benchmarks, BAIR successfully restores multimodal grounding and improves diagnostic reliability without requiring model retraining or fine-tuning.

Motivation

现有研究关注RAG缓解幻觉,但忽略了引入外部文档可能隐藏的实例级严重失败模式,即“再污染”现象,导致模型在拥有完美上下文时仍出错。

Method

通过内部注意力矩阵的机制诊断,识别出双重注意力崩溃:视觉盲区(视觉注意力质量和锐度系统性抑制)和结构位置偏差(模型优先关注边界标记而非语义相关性)。提出BAIR框架,在推理时无需训练,通过恢复视觉显著性并施加位置感知惩罚来干预文本干扰。

Result

在医学事实性、社会公平性和地理空间基准上,BAIR成功恢复了多模态基础能力,提升了诊断可靠性。

Conclusion

RAG中的“再污染”现象源于注意力崩溃,BAIR作为轻量级干预方法能有效修复,无需重新训练或微调模型。

No. 3

JARVIS: An Evidence-Grounded Retrieval System for Interpretable Deceptive Reviews Adjudication

Nan Lu, Leyang Li, Yurong Hu, Rui Lin, Shaoyi Xu

Sourcearxiv

Published2026-02-13

Updated2026-05-07

TLDR提出JARVIS框架,通过混合检索和证据图增强虚假评论检测的泛化性与可解释性,显著提升性能并降低人工成本。

阅读摘要与笔记 点击展开

Abstract

Deceptive reviews, refer to fabricated feedback designed to artificially manipulate the perceived quality of products. Within modern e-commerce ecosystems, these reviews remain a critical governance challenge. Despite advances in review-level and graph-based detection methods, two pivotal limitations remain: inadequate generalization and lack of interpretability. To address these challenges, we propose JARVIS, a framework providing Judgment via Augmented Retrieval and eVIdence graph Structures. Starting from the review to be evaluated, it retrieves semantically similar evidence via hybrid dense-sparse multimodal retrieval, expands relational signals through shared entities, and constructs a heterogeneous evidence graph. Large language model then performs evidence-grounded adjudication to produce interpretable risk assessments. Offline experiments demonstrate that JARVIS enhances performance on our constructed review dataset, achieving a precision increase from 0.953 to 0.988 and a recall boost from 0.830 to 0.901. In the production environment, our framework achieves a 27% increase in the recall volume and reduces manual inspection time by 75%. Furthermore, the adoption rate of the model-generated analysis reaches 96.4%.

Motivation

现有虚假评论检测方法存在泛化能力不足和缺乏可解释性两大局限。

Method

从待评评论出发,通过混合稠密-稀疏多模态检索获取语义相似证据,利用共享实体扩展关系信号,构建异质证据图,并由大语言模型进行基于证据的裁决。

Result

离线实验精确率从0.953提升至0.988,召回率从0.830提升至0.901;生产环境召回量增加27%,人工审核时间减少75%,模型分析采纳率达96.4%。

Conclusion

JARVIS框架有效提升了虚假评论检测的准确性和可解释性,具有实际应用价值。