No. 1
Proxy3D: Efficient 3D Representations for Vision-Language Models via Semantic Clustering and Alignment
TLDR提出Proxy3D方法,通过语义感知聚类从视频帧中提取紧凑的3D代理表示,替代传统像素对齐或隐式3D理解,在更短序列下实现SOTA空间推理性能。
阅读摘要与笔记 点击展开
Abstract
Spatial intelligence in vision-language models (VLMs) attracts research interest with the practical demand to reason in the 3D world.Despite promising results, most existing methods follow the conventional 2D pipeline in VLMs and use pixel-aligned representations for the vision modality. However, correspondence-based models with implicit 3D scene understanding often fail to achieve spatial consistency, and representation-based models with 3D geometric priors lack efficiency in vision sequence serialization. To address this, we propose a Proxy3D method with compact yet comprehensive 3D proxy representations for the vision modality. Given only video frames as input, we employ semantic and geometric encoders to extract scene features and then perform their semantic-aware clustering to obtain a set of proxies in the 3D space. For representation alignment, we further curate the SpaceSpan dataset and apply multi-stage training to adopt the proposed 3D proxy representations with the VLM. When using shorter sequences for vision information, our method achieves competitive or state-of-the-art performance in 3D visual question answering, visual grounding and general spatial intelligence benchmarks.
Motivation
现有VLM在3D空间推理中存在两类问题:基于对应关系的模型缺乏空间一致性,基于3D几何先验的模型序列化效率低。需一种紧凑且全面的3D视觉表示方法。
Method
输入视频帧,使用语义和几何编码器提取场景特征,通过语义感知聚类获得3D空间中的代理点;构建SpaceSpan数据集,采用多阶段训练将代理表示与VLM对齐。
Result
在3D视觉问答、视觉定位和通用空间智能基准上,使用更短视觉序列达到竞争或最优性能。
Conclusion
Proxy3D通过紧凑的3D代理表示有效平衡了空间一致性与序列效率,验证了替代传统2D流水线的可行性。