点击阅读原文访问arxivdaily.com,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏等功能!
cs.CV 方向,今日共计119篇
大模型相关(15篇)
【1】VITA-VLA: Efficiently Teaching Vision-Language Models to Act via Action Expert Distillation
标题:VIA-VLA:通过动作专家蒸馏有效教授视觉语言模型发挥作用
链接:https://arxiv.org/abs/2510.09607
备注:Homepage: this https URL
摘要:视觉语言动作(VLA)模型通过利用预训练的视觉语言模型(VLM)的强大感知能力,显着推进机器人操作。通过将动作模块集成到这些预训练的模型中,VLA方法表现出更好的泛化能力。然而,从头开始训练他们是昂贵的。在这项工作中,我们提出了一个简单而有效的基于蒸馏的框架,通过从预训练的小动作模型中转移知识,为VLM提供动作执行能力。我们的架构保留了原始的VLM结构,只增加了一个动作令牌和一个状态编码器,将物理输入。为了提取行动知识,我们采用了两阶段的培训策略。首先,我们通过将VLM隐藏状态映射到小动作模型的动作空间来执行轻量级对齐,从而有效地重用其预训练动作解码器并避免昂贵的预训练。其次,我们选择性地微调语言模型,状态编码器和动作模块,使系统能够将多模态输入与精确的动作生成相结合。具体来说,动作令牌为VLM提供了一个预测未来动作的直接句柄,而状态编码器允许模型包含视觉无法单独捕获的机器人动态。这种设计比从头开始训练大型VLA模型产生了显著的效率提升。与以前的最先进的方法相比,我们的方法在LIBERO上实现了97.3%的平均成功率(11.8%的改善),在LIBERO-LONG上实现了93.5%的平均成功率(24.5%的改善)。在五个操作任务的真实实验中,我们的方法始终优于教师模型,成功率达到82.0%(提高17%),这表明动作蒸馏有效地使VLM能够生成精确的动作,同时大大降低了训练成本。
摘要:Vision-Language Action (VLA) models significantly advance robotic manipulation by leveraging the strong perception capabilities of pretrained vision-language models (VLMs). By integrating action modules into these pretrained models, VLA methods exhibit improved generalization. However, training them from scratch is costly. In this work, we propose a simple yet effective distillation-based framework that equips VLMs with action-execution capability by transferring knowledge from pretrained small action models. Our architecture retains the original VLM structure, adding only an action token and a state encoder to incorporate physical inputs. To distill action knowledge, we adopt a two-stage training strategy. First, we perform lightweight alignment by mapping VLM hidden states into the action space of the small action model, enabling effective reuse of its pretrained action decoder and avoiding expensive pretraining. Second, we selectively fine-tune the language model, state encoder, and action modules, enabling the system to integrate multimodal inputs with precise action generation. Specifically, the action token provides the VLM with a direct handle for predicting future actions, while the state encoder allows the model to incorporate robot dynamics not captured by vision alone. This design yields substantial efficiency gains over training large VLA models from scratch. Compared with previous state-of-the-art methods, our method achieves 97.3% average success rate on LIBERO (11.8% improvement) and 93.5% on LIBERO-LONG (24.5% improvement). In real-world experiments across five manipulation tasks, our method consistently outperforms the teacher model, achieving 82.0% success rate (17% improvement), which demonstrate that action distillation effectively enables VLMs to generate precise actions while substantially reducing training costs.
【2】Vision Language Models: A Survey of 26K Papers
标题:视觉语言模型:对26,000篇论文的调查
链接:https://arxiv.org/abs/2510.09586
备注:VLM/LLM Learning Notes
摘要:我们对2023-2025年CVPR、ICLR和NeurIPS的26,104篇已接受论文的研究趋势进行了透明、可重复的测量。标题和摘要经过规范化、短语保护,并与手工制作的词典进行匹配,以分配多达35个主题标签,并挖掘有关任务、架构、培训制度、目标、数据集和共同提到的模式的细粒度线索。分析量化了三个宏观转变:(1)多模态视觉语言LLM工作的急剧上升,越来越多地将经典感知重新定义为指令遵循和多步推理;(2)生成方法的稳步扩张,扩散研究围绕可控性,蒸馏和速度进行巩固;以及(3)弹性3D和视频活动,从NeRF到高斯飞溅的组成,以及越来越强调以人为中心和以代理为中心的理解。在VLM中,参数有效的适应(如提示/适配器/LoRA和轻量级视觉语言桥梁)占主导地位;培训实践从从头开始构建编码器转变为指令调整和微调强大的骨干;对比目标相对于交叉熵/排名和蒸馏而消退。跨场地的比较显示,CVPR具有更强的3D足迹,ICLR的VLM份额最高,而效率或鲁棒性等可靠性主题则遍布各个领域。我们发布了词典和方法来启用审计和扩展。限制包括词汇回忆和抽象的范围,但纵向信号是一致的跨场地和年。
摘要:We present a transparent, reproducible measurement of research trends across 26,104 accepted papers from CVPR, ICLR, and NeurIPS spanning 2023-2025. Titles and abstracts are normalized, phrase-protected, and matched against a hand-crafted lexicon to assign up to 35 topical labels and mine fine-grained cues about tasks, architectures, training regimes, objectives, datasets, and co-mentioned modalities. The analysis quantifies three macro shifts: (1) a sharp rise of multimodal vision-language-LLM work, which increasingly reframes classic perception as instruction following and multi-step reasoning; (2) steady expansion of generative methods, with diffusion research consolidating around controllability, distillation, and speed; and (3) resilient 3D and video activity, with composition moving from NeRFs to Gaussian splatting and a growing emphasis on human- and agent-centric understanding. Within VLMs, parameter-efficient adaptation like prompting/adapters/LoRA and lightweight vision-language bridges dominate; training practice shifts from building encoders from scratch to instruction tuning and finetuning strong backbones; contrastive objectives recede relative to cross-entropy/ranking and distillation. Cross-venue comparisons show CVPR has a stronger 3D footprint and ICLR the highest VLM share, while reliability themes such as efficiency or robustness diffuse across areas. We release the lexicon and methodology to enable auditing and extension. Limitations include lexicon recall and abstract-only scope, but the longitudinal signals are consistent across venues and years.
【3】D-TPT: Dimensional Entropy Maximization for Calibrating Test-Time Prompt Tuning in Vision-Language Models
标题:D-TPT:用于校准视觉语言模型中的测试时间即时调优的维度熵最大化
链接:https://arxiv.org/abs/2510.09473
摘要:测试时自适应范式通过对来自源模型的未标记目标数据执行即时自适应,提供了域转换的灵活性。视觉语言模型(VLM)利用其泛化能力进行各种下游任务,测试时提示调优已成为适应VLM的突出解决方案。在这项工作中,我们探索对比性VLM并识别由跨模态的单个主导特征维度引起的模态差距。我们观察到,在文本和图像模态中的主导尺寸表现出高的预测灵敏度,并且约束它们的影响可以改善校准误差。基于这一认识,我们提出了维度熵最大化,使文本特征的分布趋于均匀,以减轻主导维度的依赖性。我们的方法在测试时及时调整校准性能的退化,提供了一个简单而有效的解决方案,以提高在现实世界的部署场景中的VLM的可靠性。
摘要:Test-time adaptation paradigm provides flexibility towards domain shifts by performing immediate adaptation on unlabeled target data from the source model. Vision-Language Models (VLMs) leverage their generalization capabilities for diverse downstream tasks, and test-time prompt tuning has emerged as a prominent solution for adapting VLMs. In this work, we explore contrastive VLMs and identify the modality gap caused by a single dominant feature dimension across modalities. We observe that the dominant dimensions in both text and image modalities exhibit high predictive sensitivity, and that constraining their influence can improve calibration error. Building on this insight, we propose dimensional entropy maximization that regularizes the distribution of textual features toward uniformity to mitigate the dependency of dominant dimensions. Our method alleviates the degradation of calibration performance in test-time prompt tuning, offering a simple yet effective solution to enhance the reliability of VLMs in real-world deployment scenarios.
【4】Boosting Multi-modal Keyphrase Prediction with Dynamic Chain-of-Thought in Vision-Language Models
标题:在视觉语言模型中利用动态思维链增强多模式关键词预测
链接:https://arxiv.org/abs/2510.09358
备注:EMNLP2025. Code is avaible at this https URL
摘要:多模态关键短语预测(MMKP)旨在通过整合多种输入信息来产生一组结论性短语,从而超越纯文本方法。传统的多模式方法已被证明在处理具有挑战性的缺席和看不见的场景方面具有重大局限性。此外,我们还发现了现有基准中由于训练测试中存在显着重叠而高估模型能力的缺点。在这项工作中,我们建议利用视觉语言模型(VLM)的MMKP任务。首先,我们使用两种广泛使用的策略,例如,zero-shot和监督微调(SFT)来评估VLM的下限性能。接下来,为了提高VLM的复杂推理能力,我们采用了Fine-tune-CoT,它利用教师模型生成的高质量CoT推理数据来微调较小的模型。最后,为了解决“过度思考”现象,我们提出了一种动态CoT策略,该策略在训练过程中自适应地注入CoT数据,使模型能够在推理阶段灵活地利用其推理能力。我们在不同的数据集上评估了所提出的策略,实验结果证明了所提出的方法的有效性。该代码可在https://github.com/bytedance/DynamicCoT上获得。
摘要:Multi-modal keyphrase prediction (MMKP) aims to advance beyond text-only methods by incorporating multiple modalities of input information to produce a set of conclusive phrases. Traditional multi-modal approaches have been proven to have significant limitations in handling the challenging absence and unseen scenarios. Additionally, we identify shortcomings in existing benchmarks that overestimate model capability due to significant overlap in training tests. In this work, we propose leveraging vision-language models (VLMs) for the MMKP task. Firstly, we use two widely-used strategies, e.g., zero-shot and supervised fine-tuning (SFT) to assess the lower bound performance of VLMs. Next, to improve the complex reasoning capabilities of VLMs, we adopt Fine-tune-CoT, which leverages high-quality CoT reasoning data generated by a teacher model to finetune smaller models. Finally, to address the "overthinking" phenomenon, we propose a dynamic CoT strategy which adaptively injects CoT data during training, allowing the model to flexibly leverage its reasoning capabilities during the inference stage. We evaluate the proposed strategies on various datasets and the experimental results demonstrate the effectiveness of the proposed approaches. The code is available at https://github.com/bytedance/DynamicCoT.
【5】Goal-oriented Backdoor Attack against Vision-Language-Action Models via Physical Objects
标题:通过物理对象对视觉-语言-动作模型的面向目标的后门攻击
链接:https://arxiv.org/abs/2510.09269
摘要:视觉-语言-动作(VLA)模型的最新进展极大地改善了嵌入式AI,使机器人能够遵循自然语言指令并执行各种任务。然而,它们对未经策划的训练数据集的依赖引起了严重的安全问题。现有的对VLA的后门攻击大多假设白盒访问并导致任务失败,而不是强制执行特定操作。在这项工作中,我们揭示了一个更实际的威胁:攻击者可以通过简单地将物理对象作为触发器注入训练数据集中来操纵VLA。我们提出了目标导向的后门攻击(GoBA),其中VLA在没有物理触发器的情况下行为正常,但在存在物理触发器的情况下执行预定义和目标导向的操作。具体来说,基于流行的VLA基准LIBERO,我们引入BadLIBERO,它包含不同的物理触发器和面向目标的后门操作。此外,我们提出了一个三级评估,将受害者在GoBA下的VLA行为分为三种状态:什么都不做,尝试做,成功做。实验表明,当存在物理触发时,GoBA使受害者VLA能够在97%的输入中成功实现后门目标,同时在干净输入上造成零性能下降。最后,通过对GoBA相关因素的研究,我们发现动作轨迹和触发颜色对攻击性能有显著影响,而触发大小的影响却令人惊讶地小。代码和BadLIBERO数据集可通过项目页面https://goba-attack.github.io/访问。
摘要:Recent advances in vision-language-action (VLA) models have greatly improved embodied AI, enabling robots to follow natural language instructions and perform diverse tasks. However, their reliance on uncurated training datasets raises serious security concerns. Existing backdoor attacks on VLAs mostly assume white-box access and result in task failures instead of enforcing specific actions. In this work, we reveal a more practical threat: attackers can manipulate VLAs by simply injecting physical objects as triggers into the training dataset. We propose goal-oriented backdoor attacks (GoBA), where the VLA behaves normally in the absence of physical triggers but executes predefined and goal-oriented actions in the presence of physical triggers. Specifically, based on a popular VLA benchmark LIBERO, we introduce BadLIBERO that incorporates diverse physical triggers and goal-oriented backdoor actions. In addition, we propose a three-level evaluation that categorizes the victim VLA's actions under GoBA into three states: nothing to do, try to do, and success to do. Experiments show that GoBA enables the victim VLA to successfully achieve the backdoor goal in 97 percentage of inputs when the physical trigger is present, while causing zero performance degradation on clean inputs. Finally, by investigating factors related to GoBA, we find that the action trajectory and trigger color significantly influence attack performance, while trigger size has surprisingly little effect. The code and BadLIBERO dataset are accessible via the project page at https://goba-attack.github.io/.
【6】Hallucination Filtering in Radiology Vision-Language Models Using Discrete Semantic Entropy
标题:使用离散语义熵的放射学视觉语言模型中的幻觉过滤
链接:https://arxiv.org/abs/2510.09256
备注:Code is available: this https URL
摘要:为了确定是否使用离散语义熵(DSE)拒绝可能产生幻觉的问题,可以提高基于放射图像的视觉问答(VQA)中的黑盒视觉语言模型(VLM)的准确性。这项回顾性研究使用两个公开可用的去识别数据集评估了DSE:(i)VQA-Med 2019基准(500张图像,含临床问题和简短文本答案)和(ii)诊断放射学数据集(206例病例:60张计算机断层扫描,60张磁共振图像,60张X光片,26张血管造影片)以及相应的地面实况诊断。GPT-40和GPT-4.1使用1.0的温度回答每个问题15次。基线准确度使用低温答案(温度0.1)确定。使用双向蕴涵检查对意义等同的反应进行分组,并从所产生的语义簇的相对频率计算DSE。在排除DSE > 0.6或> 0.3的问题后重新计算准确度。p值和95%置信区间使用自举回归和统计学显著性的p <0.004的Bonferroni校正阈值获得。在706个图像-问题对中,GPT-4 o和GPT-4.1的基线准确率分别为51.7%和54.8%。在过滤掉高熵问题(DSE > 0.3)后,其余问题的准确率为76.3%(保留问题:334/706),GPT-4.1为63.8%(保留问题:499/706)(均p <0.001)。在两个数据集上都观察到了准确度的提高,并且在Bonferroni校正后在很大程度上保持了统计学显著性。DSE通过量化语义不一致性实现了黑盒VLM中的可靠幻觉检测。该方法显著提高了诊断答案的准确性,并为临床VLM应用提供了一种过滤策略。
摘要:To determine whether using discrete semantic entropy (DSE) to reject questions likely to generate hallucinations can improve the accuracy of black-box vision-language models (VLMs) in radiologic image based visual question answering (VQA). This retrospective study evaluated DSE using two publicly available, de-identified datasets: (i) the VQA-Med 2019 benchmark (500 images with clinical questions and short-text answers) and (ii) a diagnostic radiology dataset (206 cases: 60 computed tomography scans, 60 magnetic resonance images, 60 radiographs, 26 angiograms) with corresponding ground-truth diagnoses. GPT-4o and GPT-4.1 answered each question 15 times using a temperature of 1.0. Baseline accuracy was determined using low-temperature answers (temperature 0.1). Meaning-equivalent responses were grouped using bidirectional entailment checks, and DSE was computed from the relative frequencies of the resulting semantic clusters. Accuracy was recalculated after excluding questions with DSE > 0.6 or > 0.3. p-values and 95% confidence intervals were obtained using bootstrap resampling and a Bonferroni-corrected threshold of p < .004 for statistical significance. Across 706 image-question pairs, baseline accuracy was 51.7% for GPT-4o and 54.8% for GPT-4.1. After filtering out high-entropy questions (DSE > 0.3), accuracy on the remaining questions was 76.3% (retained questions: 334/706) for GPT-4o and 63.8% (retained questions: 499/706) for GPT-4.1 (both p < .001). Accuracy gains were observed across both datasets and largely remained statistically significant after Bonferroni correction. DSE enables reliable hallucination detection in black-box VLMs by quantifying semantic inconsistency. This method significantly improves diagnostic answer accuracy and offers a filtering strategy for clinical VLM applications.
【7】Zero-shot image privacy classification with Vision-Language Models
标题:使用视觉语言模型的Zero-Shot图像隐私分类
链接:https://arxiv.org/abs/2510.09253
备注:5 pages, 3 figures, 3 tables. This work has been submitted to the ICASSP 2026
摘要:虽然专门的基于学习的模型在历史上主导了图像隐私预测,但目前的文献越来越倾向于采用为通用任务设计的大型视觉语言模型(VLM)。由于缺乏系统的评估,这一趋势有可能忽视专用模型设定的性能上限。为了解决这个问题,我们建立了一个zero-shot基准图像隐私分类,使公平的比较。我们根据隐私基准评估了前3名开源VLM,使用任务对齐提示,并将其性能,效率和鲁棒性与已建立的仅视觉和多模式方法进行了对比。与直觉相反,我们的研究结果表明,尽管VLM在高参数计数和较慢的推理方面具有资源密集型的性质,但目前在隐私预测准确性方面落后于专门的较小模型。我们还发现,VLM表现出更高的鲁棒性图像扰动。
摘要:While specialized learning-based models have historically dominated image privacy prediction, the current literature increasingly favours adopting large Vision-Language Models (VLMs) designed for generic tasks. This trend risks overlooking the performance ceiling set by purpose-built models due to a lack of systematic evaluation. To address this problem, we establish a zero-shot benchmark for image privacy classification, enabling a fair comparison. We evaluate the top-3 open-source VLMs, according to a privacy benchmark, using task-aligned prompts and we contrast their performance, efficiency, and robustness against established vision-only and multi-modal methods. Counter-intuitively, our results show that VLMs, despite their resource-intensive nature in terms of high parameter count and slower inference, currently lag behind specialized, smaller models in privacy prediction accuracy. We also find that VLMs exhibit higher robustness to image perturbations.
【8】Diagnosing Shoulder Disorders Using Multimodal Large Language Models and Consumer-Grade Cameras
标题:使用多模式大型语言模型和消费级相机诊断肩部疾病
链接:https://arxiv.org/abs/2510.09230
摘要:肩部疾病,如冻结肩(又名,粘连性囊炎)是影响全世界人民健康的常见病症,并且在老年人和从事重复性肩部任务的工人中具有高发病率。在医疗资源稀缺的地区,实现早期和准确的诊断带来了重大挑战,迫切需要低成本和易于扩展的辅助诊断解决方案。这项研究引入了消费级设备捕获的视频作为诊断的基础,降低了用户的成本。我们专注于多模态大语言模型(MLLM)在肩关节疾病的初步诊断中的创新应用,并提出了一个混合运动视频诊断框架(HMVDx)。该框架将动作理解和疾病诊断两个任务分开,分别由两个MLLM完成。除了传统的评估指标,这项工作提出了一种新的度量称为可用性指数的医疗决策的逻辑过程(动作识别,运动诊断,最终诊断)。该指数从整个医学诊断路径的角度评估了MLLM在医学领域的有效性,揭示了低成本MLLM在医学应用中对医学从业者的潜在价值。在实验比较中,与直接视频诊断相比,HMVDx诊断肩关节损伤的准确性提高了79.6%,这对未来研究MLLM在医学领域的视频理解应用做出了重大技术贡献。
摘要:Shoulder disorders, such as frozen shoulder (a.k.a., adhesive capsulitis), are common conditions affecting the health of people worldwide, and have a high incidence rate among the elderly and workers engaged in repetitive shoulder tasks. In regions with scarce medical resources, achieving early and accurate diagnosis poses significant challenges, and there is an urgent need for low-cost and easily scalable auxiliary diagnostic solutions. This research introduces videos captured by consumer-grade devices as the basis for diagnosis, reducing the cost for users. We focus on the innovative application of Multimodal Large Language Models (MLLMs) in the preliminary diagnosis of shoulder disorders and propose a Hybrid Motion Video Diagnosis framework (HMVDx). This framework divides the two tasks of action understanding and disease diagnosis, which are respectively completed by two MLLMs. In addition to traditional evaluation indicators, this work proposes a novel metric called Usability Index by the logical process of medical decision-making (action recognition, movement diagnosis, and final diagnosis). This index evaluates the effectiveness of MLLMs in the medical field from the perspective of the entire medical diagnostic pathway, revealing the potential value of low-cost MLLMs in medical applications for medical practitioners. In experimental comparisons, the accuracy of HMVDx in diagnosing shoulder joint injuries has increased by 79.6\% compared with direct video diagnosis, a significant technical contribution to future research on the application of MLLMs for video understanding in the medical field.
【9】Tag-Enriched Multi-Attention with Large Language Models for Cross-Domain Sequential Recommendation
标题:具有大型语言模型的标签丰富的多注意力用于跨领域顺序推荐
链接:https://arxiv.org/abs/2510.09224
备注:Accepted in IEEE Transactions on Consumer Electronics 2025
摘要:跨域顺序推荐(CDSR)在现代消费电子和电子商务平台中起着至关重要的作用,在这些平台中,用户与书籍、电影和在线零售产品等各种服务进行交互。这些系统必须准确地捕捉特定领域和跨领域的行为模式,以提供个性化和无缝的消费者体验。为了应对这一挑战,我们提出了\textbf {TEMA-LLM}(\textit {Tag-Enriched Multi-Attention with Large Language Models}),这是一个实用有效的框架,集成了\textit {Large Language Models(LLM)}用于语义标签的生成和丰富。具体来说,TEMA-LLM采用LLM分配领域感知提示,并从项目标题和描述中生成描述性标签。所得到的标签嵌入与项目标识符以及文本和视觉特征融合,以构建增强的项目表示。然后引入一个\textit {Tag-Enriched Multi-Attention}机制来联合建模域内和跨域的用户偏好,使系统能够捕获复杂和不断变化的消费者兴趣。在四个大规模电子商务数据集上进行的广泛实验表明,TEMA-LLM始终优于最先进的基线,强调了基于LLM的语义标记和面向消费者的推荐系统的多注意力集成的好处。所提出的方法突出了LLM在消费电子领域推进智能化、以用户为中心的服务的潜力。
摘要:Cross-Domain Sequential Recommendation (CDSR) plays a crucial role in modern consumer electronics and e-commerce platforms, where users interact with diverse services such as books, movies, and online retail products. These systems must accurately capture both domain-specific and cross-domain behavioral patterns to provide personalized and seamless consumer experiences. To address this challenge, we propose \textbf{TEMA-LLM} (\textit{Tag-Enriched Multi-Attention with Large Language Models}), a practical and effective framework that integrates \textit{Large Language Models (LLMs)} for semantic tag generation and enrichment. Specifically, TEMA-LLM employs LLMs to assign domain-aware prompts and generate descriptive tags from item titles and descriptions. The resulting tag embeddings are fused with item identifiers as well as textual and visual features to construct enhanced item representations. A \textit{Tag-Enriched Multi-Attention} mechanism is then introduced to jointly model user preferences within and across domains, enabling the system to capture complex and evolving consumer interests. Extensive experiments on four large-scale e-commerce datasets demonstrate that TEMA-LLM consistently outperforms state-of-the-art baselines, underscoring the benefits of LLM-based semantic tagging and multi-attention integration for consumer-facing recommendation systems. The proposed approach highlights the potential of LLMs to advance intelligent, user-centric services in the field of consumer electronics.
【10】Dense2MoE: Restructuring Diffusion Transformer to MoE for Efficient Text-to-Image Generation
标题:Dense 2 MoE:将扩散Transformer重组为MoE,以高效地生成文本到图像
链接:https://arxiv.org/abs/2510.09094
备注:Accepted by ICCV 2025
摘要:扩散Transformer(DiT)在文本到图像生成中表现出卓越的性能;然而,其大的参数大小导致大量的推理开销。现有的参数压缩方法主要集中在修剪,但积极的修剪往往会导致严重的性能下降,由于减少模型容量。为了解决这一限制,我们率先将密集DiT转换为专家混合(MoE)进行结构化稀疏化,减少激活参数的数量,同时保持模型容量。具体来说,我们用MoE层替换DiT块中的前馈网络(FFN),将FFN中激活参数的数量减少了62.5%。此外,我们提出了块的混合(MoB)选择性地激活DiT块,从而进一步提高稀疏性。为了确保有效的密度到MoE的转换,我们设计了一个多步蒸馏管道,结合泰勒度量为基础的专家初始化,知识蒸馏与负载平衡,和组特征损失的MoB优化。我们将大型扩散Transformers(例如,FLUX.1 [dev])转换为MoE结构,在保持原始性能的同时将激活参数减少了60%,并在大量实验中超过了基于修剪的方法。总的来说,Dense 2 MoE为高效的文本到图像生成建立了一个新的范例。
摘要:Diffusion Transformer (DiT) has demonstrated remarkable performance in text-to-image generation; however, its large parameter size results in substantial inference overhead. Existing parameter compression methods primarily focus on pruning, but aggressive pruning often leads to severe performance degradation due to reduced model capacity. To address this limitation, we pioneer the transformation of a dense DiT into a Mixture of Experts (MoE) for structured sparsification, reducing the number of activated parameters while preserving model capacity. Specifically, we replace the Feed-Forward Networks (FFNs) in DiT Blocks with MoE layers, reducing the number of activated parameters in the FFNs by 62.5\%. Furthermore, we propose the Mixture of Blocks (MoB) to selectively activate DiT blocks, thereby further enhancing sparsity. To ensure an effective dense-to-MoE conversion, we design a multi-step distillation pipeline, incorporating Taylor metric-based expert initialization, knowledge distillation with load balancing, and group feature loss for MoB optimization. We transform large diffusion transformers (e.g., FLUX.1 [dev]) into an MoE structure, reducing activated parameters by 60\% while maintaining original performance and surpassing pruning-based approaches in extensive experiments. Overall, Dense2MoE establishes a new paradigm for efficient text-to-image generation.
【11】On Epistemic Uncertainty of Visual Tokens for Object Hallucinations in Large Vision-Language Models
标题:论大型视觉语言模型中物体幻觉视觉标记的认识不确定性
链接:https://arxiv.org/abs/2510.09008
摘要:大型视觉语言模型(LVLM)将视觉编码器(VE)与大型语言模型集成在一起,在各种任务中取得了显着的成功。然而,LVLM仍然存在关键的挑战,例如对象幻觉,生成输入图像中没有的对象的描述。在这里,我们认为,不确定的视觉令牌内的VE是一个关键因素,有助于对象幻觉。我们的统计分析发现,具有高认知不确定性的视觉标记与幻觉的发生之间存在正相关。此外,我们从理论上和经验表明,在早期VE层的视觉令牌,表现出大的表示偏差下小的对抗性扰动表明高认知的不确定性。基于这些发现,我们提出了一个简单而有效的策略,以减轻对象幻觉修改VE只。我们的方法包括一个代理方法与对抗扰动识别不确定的视觉令牌有效和方法来掩盖这些不确定的视觉令牌在自我注意过程中的中间层的VE,抑制他们对视觉编码的影响,从而减轻幻觉。大量的实验表明,我们的方法显着减少LVLM中的对象幻觉,并可以与其他现有技术协同工作。
摘要:Large vision-language models (LVLMs), which integrate a vision encoder (VE) with a large language model, have achieved remarkable success across various tasks. However, there are still crucial challenges in LVLMs such as object hallucination, generating descriptions of objects that are not in the input image. Here, we argue that uncertain visual tokens within the VE is a key factor that contributes to object hallucination. Our statistical analysis found that there are positive correlations between visual tokens with high epistemic uncertainty and the occurrence of hallucinations. Furthermore, we show theoretically and empirically that visual tokens in early VE layers that exhibit large representation deviations under small adversarial perturbations indicate high epistemic uncertainty. Based on these findings, we propose a simple yet effective strategy to mitigate object hallucination by modifying the VE only. Our method comprises a proxy method with adversarial perturbations for identifying uncertain visual tokens efficiently and a method to mask these uncertain visual tokens during the self-attention process in the middle layers of the VE, suppressing their influence on visual encoding and thus alleviating hallucinations. Extensive experiments show that our method significantly reduces object hallucinations in LVLMs and can synergistically work with other prior arts.
【12】Speculative Jacobi-Denoising Decoding for Accelerating Autoregressive Text-to-image Generation
标题:用于加速自回归文本到图像生成的推测Jacobi去噪解码
链接:https://arxiv.org/abs/2510.08994
摘要:作为一种新的视觉内容生成范式,自回归文本到图像模型由于其顺序的令牌解码过程而遭受缓慢的推理,通常需要数千个模型正向传递来生成单个图像。为了解决这种低效率问题,我们提出了投机雅可比去噪解码(SJD 2),这是一个将去噪过程纳入雅可比迭代的框架,以实现自回归模型中的并行令牌生成。我们的方法引入了一个下一个干净的令牌预测范例,使预训练的自回归模型能够接受噪声干扰的令牌嵌入,并通过低成本的微调来预测下一个干净的令牌。这种去噪范例引导模型朝向更稳定的Jacobi轨迹。在推理过程中,我们的方法将带有高斯噪声的token序列嵌入到嵌入空间中,并在嵌入空间中进行迭代的下一个干净的token预测。我们采用一个概率标准来并行验证和接受多个令牌,并使用去噪轨迹为下一次迭代改进未接受的令牌。实验表明,我们的方法可以通过减少模型向前传递来加速生成,同时保持生成图像的视觉质量。
摘要:As a new paradigm of visual content generation, autoregressive text-to-image models suffer from slow inference due to their sequential token-by-token decoding process, often requiring thousands of model forward passes to generate a single image. To address this inefficiency, we propose Speculative Jacobi-Denoising Decoding (SJD2), a framework that incorporates the denoising process into Jacobi iterations to enable parallel token generation in autoregressive models. Our method introduces a next-clean-token prediction paradigm that enables the pre-trained autoregressive models to accept noise-perturbed token embeddings and predict the next clean tokens through low-cost fine-tuning. This denoising paradigm guides the model towards more stable Jacobi trajectories. During inference, our method initializes token sequences with Gaussian noise and performs iterative next-clean-token-prediction in the embedding space. We employ a probabilistic criterion to verify and accept multiple tokens in parallel, and refine the unaccepted tokens for the next iteration with the denoising trajectory. Experiments show that our method can accelerate generation by reducing model forward passes while maintaining the visual quality of generated images.
【13】BEAR: Benchmarking and Enhancing Multimodal Language Models for Atomic Embodied Capabilities
标题:BEAR:基准测试和增强多模式语言模型以实现原子预定能力
链接:https://arxiv.org/abs/2510.08759
摘要:认知能力是指一个智能体感知、理解和与物理世界交互的一套基本能力。虽然多模态大型语言模型(MLLM)显示出作为体现代理的前景,但对其体现能力的全面和系统的评估仍然没有得到充分的探索,因为现有的基准主要集中在特定的领域,如规划或空间理解。为了弥合这一差距,我们引入了BEAR,这是一个全面的细粒度基准测试,用于评估MLLM的原子体现能力。BEAR包括6个类别的14个领域的4,469个交错图像-视频-文本条目,包括从低级指向,轨迹理解,空间推理到高级规划的任务。20个代表性的MLLM广泛的评估结果揭示了他们在所有领域的体现能力的持续限制。为了解决这一不足,我们提出了BEAR-Agent,这是一种多模式可转换代理,它集成了预训练的视觉模型,以加强MLLM感知,3D理解和规划能力。它大大增强了BEAR上各种具体功能的MLLM性能,在GPT-5上获得了9.12%的绝对增益和17.5%的相对改善。此外,我们的实验表明,提高MLLM体现能力,可以受益于体现在模拟环境中的任务。项目网址:https://bear-official66.github.io/
摘要:Embodied capabilities refer to a suite of fundamental abilities for an agent to perceive, comprehend, and interact with the physical world. While multimodal large language models (MLLMs) show promise as embodied agents, a thorough and systematic evaluation of their embodied capabilities remains underexplored, as existing benchmarks primarily focus on specific domains such as planning or spatial understanding. To bridge this gap, we introduce BEAR, a comprehensive and fine-grained benchmark that evaluates MLLMs on atomic embodied capabilities. BEAR comprises 4,469 interleaved image-video-text entries across 14 domains in 6 categories, including tasks from low-level pointing, trajectory understanding, spatial reasoning, to high-level planning. Extensive evaluation results of 20 representative MLLMs reveal their persistent limitations across all domains of embodied capabilities. To tackle the shortfall, we propose BEAR-Agent, a multimodal conversable agent that integrates pretrained vision models to strengthen MLLM perception, 3D understanding, and planning capabilities. It substantially enhances MLLM performance across diverse embodied capabilities on BEAR, yielding a 9.12% absolute gain and a relative improvement of 17.5% on GPT-5. Furthermore, our experiments indicate that improving MLLM embodied capabilities can benefit embodied tasks in simulated environments. Project website: https://bear-official66.github.io/
【14】Hulu-Med: A Transparent Generalist Model towards Holistic Medical Vision-Language Understanding
标题:葫芦-Med:面向整体医学视觉-语言理解的透明通才模型
链接:https://arxiv.org/abs/2510.08668
摘要:现实世界的临床决策需要整合来自不同数据模式的信息,包括医疗文本、2D/3D图像和视频,这导致效率低下和潜在的诊断疏忽。虽然通才视觉语言模型(VLM)提供了希望,但它们的医学发展面临着不透明的管道,数据稀缺和架构兼容性的挑战。在这里,我们提出了Hulu-Med,一个透明的医疗VLM,统一了所有这些模式的理解。Hulu-Med基于统一的基于补丁的视觉编码器和LLM解码器,在1670万(M)个样本上进行了逐步训练,以从2D扩展到3D和视频理解。医疗感知令牌减少实现了高效的训练,对于7 B到32 B参数变量,仅需要4,000到40,000 GPU小时。对30个基准的广泛评估展示了最先进的性能,超越了领先的开源模型,并在多语言和罕见疾病场景中与专有系统竞争,包括视觉问答,医疗报告生成和复杂推理。通过开源我们的完整管道,我们可以透明地实现高性能的医疗VLM,为可访问和有影响力的临床AI提供基础工具。代码发布在\href{https://github.com/ZJUI-AI4H/Hulu-Med}{https://github.com/ZJUI-AI4H/Hulu-Med}上。
摘要:Real-world clinical decision-making grapples with integrating information from diverse data modalities, including medical text, 2D/3D images, and video, leading to inefficiencies and potential diagnostic oversights. While generalist vision-language models (VLMs) offer promise, their medical development faces challenges of opaque pipelines, data scarcity, and architectural inflexibility. Here we present Hulu-Med, a transparent medical VLM that unifies understanding across all these modalities. Built upon a unified patch-based vision encoder and an LLM decoder, Hulu-Med was progressively trained on 16.7 million (M) samples to scale from 2D to 3D and video comprehension. The medical-aware token reduction enables efficient training, requiring only 4,000 to 40,000 GPU hours for 7B to 32B parameter variants. Extensive evaluation across 30 benchmarks exhibits state-of-the-art performance, surpassing leading open-source models and competing with proprietary systems in tasks spanning visual question-answering, medical report generation, and complex reasoning in multilingual and rare disease scenarios. By open-sourcing our complete pipeline, we establish that high-performance medical VLM can be achieved transparently, providing a foundational tool for accessible and impactful clinical AI. Code is released on \href{https://github.com/ZJUI-AI4H/Hulu-Med}{https://github.com/ZJUI-AI4H/Hulu-Med}.
【15】Beyond CNNs: Efficient Fine-Tuning of Multi-Modal LLMs for Object Detection on Low-Data Regimes
标题:超越CNN:多模式LLM的高效微调,用于低数据状态下的目标检测
链接:https://arxiv.org/abs/2510.08589
摘要:在传统的基于CNN的模型和新兴的多模态大型语言模型(LLM)的推动下,对象检测和理解领域正在迅速发展。虽然像ResNet和YOLO这样的CNN对于基于图像的任务仍然非常有效,但最近基于transformer的LLM引入了新的功能,例如动态上下文推理,语言引导提示和整体场景理解。然而,当开箱即用时,LLM的全部潜力仍然没有得到充分利用,往往导致在专门的视觉任务上表现不佳。在这项工作中,我们对微调的传统CNN,zero-shot预训练的多模态LLM和微调的多模态LLM进行了全面的比较,以完成图像中人工文本覆盖检测的挑战性任务。我们的研究的一个关键贡献是证明LLM可以在非常有限的数据(少于1,000张图像)上进行有效的微调,以实现高达36%的准确性提高,匹配或超过基于CNN的基线,这些基线通常需要更多的数据。通过探索语言引导的模型如何在最少的监督下适应精确的视觉理解,我们的工作有助于更广泛地桥接视觉和语言,为有效的跨模态学习策略提供新的见解。这些发现突出了基于LLM的方法在现实世界对象检测任务中的适应性和数据效率,并为在低资源视觉环境中应用多模态Transformers提供了可操作的指导。为了支持这一领域的持续发展,我们已经在GitHub中提供了用于微调模型的代码,以便将来在相关应用程序中进行改进和重用。
摘要:The field of object detection and understanding is rapidly evolving, driven by advances in both traditional CNN-based models and emerging multi-modal large language models (LLMs). While CNNs like ResNet and YOLO remain highly effective for image-based tasks, recent transformer-based LLMs introduce new capabilities such as dynamic context reasoning, language-guided prompts, and holistic scene understanding. However, when used out-of-the-box, the full potential of LLMs remains underexploited, often resulting in suboptimal performance on specialized visual tasks. In this work, we conduct a comprehensive comparison of fine-tuned traditional CNNs, zero-shot pre-trained multi-modal LLMs, and fine-tuned multi-modal LLMs on the challenging task of artificial text overlay detection in images. A key contribution of our study is demonstrating that LLMs can be effectively fine-tuned on very limited data (fewer than 1,000 images) to achieve up to 36% accuracy improvement, matching or surpassing CNN-based baselines that typically require orders of magnitude more data. By exploring how language-guided models can be adapted for precise visual understanding with minimal supervision, our work contributes to the broader effort of bridging vision and language, offering novel insights into efficient cross-modal learning strategies. These findings highlight the adaptability and data efficiency of LLM-based approaches for real-world object detection tasks and provide actionable guidance for applying multi-modal transformers in low-resource visual environments. To support continued progress in this area, we have made the code used to fine-tune the models available in our GitHub, enabling future improvements and reuse in related applications.
Transformer(1篇)
【1】3D Reconstruction from Transient Measurements with Time-Resolved Transformer
标题:使用时间分辨Transformer根据瞬时测量进行3D重建
链接:https://arxiv.org/abs/2510.09205
摘要:由时间分辨系统捕获的瞬态测量被广泛用于光子有效的重建任务,包括视线(LOS)和非视线(NLOS)成像。然而,由于传感器的低量子效率和高噪声水平,特别是对于长距离或复杂场景,其3D重建仍然存在挑战。为了提高光子高效成像的三维重建性能,我们提出了一种通用的时间分辨Transformer(TRT)架构。与现有的针对高维数据设计的Transformers不同,TRT有两个针对时空瞬态测量的精心设计的注意力。具体而言,时空自注意编码器通过将输入特征拆分或下采样到不同尺度来探索瞬态数据内的局部和全局相关性。然后,时空交叉注意解码器在令牌空间中集成局部和全局特征,从而产生具有高表示能力的深度特征。在TRT的基础上,我们开发了两个任务特定的实施例:用于LOS成像的TRT-LOS和用于NLOS成像的TRT-NLOS。广泛的实验表明,两个实施例在由不同成像系统捕获的合成数据和真实世界数据上显著优于现有方法。此外,我们贡献了一个大规模的,高分辨率的合成LOS数据集与各种噪声水平,并捕捉一组现实世界的NLOS测量使用定制的成像系统,提高了该领域的数据多样性。代码和数据集可在https://github.com/Depth2World/TRT上获得。
摘要:Transient measurements, captured by the timeresolved systems, are widely employed in photon-efficient reconstruction tasks, including line-of-sight (LOS) and non-line-of-sight (NLOS) imaging. However, challenges persist in their 3D reconstruction due to the low quantum efficiency of sensors and the high noise levels, particularly for long-range or complex scenes. To boost the 3D reconstruction performance in photon-efficient imaging, we propose a generic Time-Resolved Transformer (TRT) architecture. Different from existing transformers designed for high-dimensional data, TRT has two elaborate attention designs tailored for the spatio-temporal transient measurements. Specifically, the spatio-temporal self-attention encoders explore both local and global correlations within transient data by splitting or downsampling input features into different scales. Then, the spatio-temporal cross attention decoders integrate the local and global features in the token space, resulting in deep features with high representation capabilities. Building on TRT, we develop two task-specific embodiments: TRT-LOS for LOS imaging and TRT-NLOS for NLOS imaging. Extensive experiments demonstrate that both embodiments significantly outperform existing methods on synthetic data and real-world data captured by different imaging systems. In addition, we contribute a large-scale, high-resolution synthetic LOS dataset with various noise levels and capture a set of real-world NLOS measurements using a custom-built imaging system, enhancing the data diversity in this field. Code and datasets are available at https://github.com/Depth2World/TRT.
生成|GAN相关(13篇)
【1】Few-shot multi-token DreamBooth with LoRa for style-consistent character generation
标题:具有LoRa的Few-Shot多令牌DreamBooth,用于风格一致的角色生成
链接:https://arxiv.org/abs/2510.09475
摘要:视听行业正在经历一场深刻的变革,因为它正在整合人工智能的发展,不仅使日常任务自动化,而且还激发了新的艺术形式。本文讨论了产生几乎无限数量的新颖角色的问题,这些角色保留了一小部分人类设计的参考角色的艺术风格和共同的视觉特征,从而拓宽了动画,游戏,及相关领域。我们的解决方案基于DreamBooth,这是一种成熟的文本到图像扩散模型的微调技术,并使其适应两个核心挑战:捕获文本提示之外的复杂字符细节和训练数据的Few-Shot性质。为了实现这一目标,我们提出了一种多令牌策略,使用聚类将单独的令牌分配给各个角色及其集体风格,并结合基于LoRA的参数高效微调。通过删除特定于类的正则化集并在生成过程中引入随机标记和嵌入,我们的方法允许无限的字符创建,同时保留学习的风格。我们在五个小型专业数据集上评估了我们的方法,并使用定量指标和人类评估研究将其与相关基线进行了比较。我们的研究结果表明,我们的方法产生了高质量,多样化的字符,同时保留了参考字符的独特美学特征,人类评价进一步加强了其有效性,并突出了我们方法的潜力。
摘要:The audiovisual industry is undergoing a profound transformation as it is integrating AI developments not only to automate routine tasks but also to inspire new forms of art. This paper addresses the problem of producing a virtually unlimited number of novel characters that preserve the artistic style and shared visual traits of a small set of human-designed reference characters, thus broadening creative possibilities in animation, gaming, and related domains. Our solution builds upon DreamBooth, a well-established fine-tuning technique for text-to-image diffusion models, and adapts it to tackle two core challenges: capturing intricate character details beyond textual prompts and the few-shot nature of the training data. To achieve this, we propose a multi-token strategy, using clustering to assign separate tokens to individual characters and their collective style, combined with LoRA-based parameter-efficient fine-tuning. By removing the class-specific regularization set and introducing random tokens and embeddings during generation, our approach allows for unlimited character creation while preserving the learned style. We evaluate our method on five small specialized datasets, comparing it to relevant baselines using both quantitative metrics and a human evaluation study. Our results demonstrate that our approach produces high-quality, diverse characters while preserving the distinctive aesthetic features of the reference characters, with human evaluation further reinforcing its effectiveness and highlighting the potential of our method.
【2】Identifying & Interactively Refining Ambiguous User Goals for Data Visualization Code Generation
标题:识别并交互式完善数据可视化代码生成的模糊用户目标
链接:https://arxiv.org/abs/2510.09390
摘要:建立共同的目标是人类与AI沟通的基本步骤。然而,歧义可能导致输出看起来正确,但不能反映说话者的意图。在本文中,我们探讨这个问题,重点放在数据可视化领域,在自然语言中的歧义影响生成的代码,可视化数据。上下文上的多个视图的可用性(例如,预期的图和呈现该图的代码)允许对各种歧义类型进行唯一和全面的分析。我们开发了一个分类的类型的歧义,在这项任务中出现,并提出指标来量化它们。使用DS-1000数据集的Matplotlib问题,我们证明了我们的模糊性度量比不确定性基线更好地与人类注释相关。我们的工作还探讨了多轮对话如何减少歧义,从而通过更好地匹配用户目标来提高代码准确性。我们评估了三种语用模式,以告知我们的对话策略:格赖斯合作,话语表征理论,和问题的讨论。一个模拟的用户研究揭示了语用对话如何减少歧义,提高代码的准确性,突出了多轮交流的代码生成的价值。
摘要:Establishing shared goals is a fundamental step in human-AI communication. However, ambiguities can lead to outputs that seem correct but fail to reflect the speaker's intent. In this paper, we explore this issue with a focus on the data visualization domain, where ambiguities in natural language impact the generation of code that visualizes data. The availability of multiple views on the contextual (e.g., the intended plot and the code rendering the plot) allows for a unique and comprehensive analysis of diverse ambiguity types. We develop a taxonomy of types of ambiguity that arise in this task and propose metrics to quantify them. Using Matplotlib problems from the DS-1000 dataset, we demonstrate that our ambiguity metrics better correlate with human annotations than uncertainty baselines. Our work also explores how multi-turn dialogue can reduce ambiguity, therefore, improve code accuracy by better matching user goals. We evaluate three pragmatic models to inform our dialogue strategies: Gricean Cooperativity, Discourse Representation Theory, and Questions under Discussion. A simulated user study reveals how pragmatic dialogues reduce ambiguity and enhance code accuracy, highlighting the value of multi-turn exchanges in code generation.
【3】Stable Video Infinity: Infinite-Length Video Generation with Error Recycling
标题:稳定视频无限:具有错误回收的无限长度视频生成
链接:https://arxiv.org/abs/2510.09212
备注:Project Page: this https URL
摘要:我们提出了稳定的视频无限(SVI),能够生成具有高时间一致性,合理的场景转换和可控的流故事情节的无限长度的视频。虽然现有的长视频方法试图通过手工制作的抗漂移(例如,修改的噪声调度器,帧锚定),它们仍然限于单提示外推,产生具有重复运动的均匀场景。我们发现,根本的挑战超出了误差积累,扩展到训练假设(看到干净的数据)和测试时自回归现实(条件是自我生成的,容易出错的输出)之间的关键差异。为了弥合这一假设差距,SVI采用了错误回收微调,一种新型的有效训练,将扩散Transformer(DiT)的自我产生的错误纳入监督提示,从而鼓励DiT主动识别和纠正自己的错误。这是通过闭环循环注入、收集和存储错误来实现的,自回归地从错误注入的反馈中学习。具体来说,我们(i)注入DiT产生的历史错误,以干预干净的输入,模拟流匹配中的错误累积轨迹;(ii)用一步双向积分有效地近似预测,并用残差计算错误;(iii)动态地将错误存储到跨离散时间步的重放存储器中,并对新输入进行重新采样。SVI能够将视频从数秒扩展到无限持续时间,而无需额外的推理成本,同时保持与各种条件(例如,音频、骨架和文本流)。我们在三个基准上评估SVI,包括一致性,创造性和条件设置,彻底验证其多功能性和最先进的作用。
摘要:We propose Stable Video Infinity (SVI) that is able to generate infinite-length videos with high temporal consistency, plausible scene transitions, and controllable streaming storylines. While existing long-video methods attempt to mitigate accumulated errors via handcrafted anti-drifting (e.g., modified noise scheduler, frame anchoring), they remain limited to single-prompt extrapolation, producing homogeneous scenes with repetitive motions. We identify that the fundamental challenge extends beyond error accumulation to a critical discrepancy between the training assumption (seeing clean data) and the test-time autoregressive reality (conditioning on self-generated, error-prone outputs). To bridge this hypothesis gap, SVI incorporates Error-Recycling Fine-Tuning, a new type of efficient training that recycles the Diffusion Transformer (DiT)'s self-generated errors into supervisory prompts, thereby encouraging DiT to actively identify and correct its own errors. This is achieved by injecting, collecting, and banking errors through closed-loop recycling, autoregressively learning from error-injected feedback. Specifically, we (i) inject historical errors made by DiT to intervene on clean inputs, simulating error-accumulated trajectories in flow matching; (ii) efficiently approximate predictions with one-step bidirectional integration and calculate errors with residuals; (iii) dynamically bank errors into replay memory across discretized timesteps, which are resampled for new input. SVI is able to scale videos from seconds to infinite durations with no additional inference cost, while remaining compatible with diverse conditions (e.g., audio, skeleton, and text streams). We evaluate SVI on three benchmarks, including consistent, creative, and conditional settings, thoroughly verifying its versatility and state-of-the-art role.
【4】Instance-Level Generation for Representation Learning
标题:用于表示学习的实例级生成
链接:https://arxiv.org/abs/2510.09171
摘要:实例级识别(ILR)专注于识别单个对象,而不是广泛的类别,在图像分类中提供最高的粒度。然而,这种细粒度的性质使得创建大规模注释数据集具有挑战性,限制了ILR在各个领域的实际应用。为了克服这一点,我们引入了一种新的方法,该方法在不同的条件和背景下从多个域综合生成不同的对象实例,形成大规模的训练集。与之前的自动数据合成工作不同,我们的方法是第一个解决ILR特定挑战的方法,而不依赖于任何真实图像。在生成的数据上微调基础视觉模型,可以显著提高跨多个领域的七个ILR基准的检索性能。我们的方法为广泛的数据收集和管理提供了一种新的,高效的,有效的替代方案,引入了一种新的ILR范式,其中唯一的输入是目标域的名称,解锁了广泛的现实世界的应用程序。
摘要:Instance-level recognition (ILR) focuses on identifying individual objects rather than broad categories, offering the highest granularity in image classification. However, this fine-grained nature makes creating large-scale annotated datasets challenging, limiting ILR's real-world applicability across domains. To overcome this, we introduce a novel approach that synthetically generates diverse object instances from multiple domains under varied conditions and backgrounds, forming a large-scale training set. Unlike prior work on automatic data synthesis, our method is the first to address ILR-specific challenges without relying on any real images. Fine-tuning foundation vision models on the generated data significantly improves retrieval performance across seven ILR benchmarks spanning multiple domains. Our approach offers a new, efficient, and effective alternative to extensive data collection and curation, introducing a new ILR paradigm where the only input is the names of the target domains, unlocking a wide range of real-world applications.
【5】MSDM: Generating Task-Specific Pathology Images with a Multimodal Conditioned Diffusion Model for Cell and Nuclei Segmentation
标题:MSDS:使用多峰条件扩散模型生成特定任务的病理图像进行细胞和核分割
链接:https://arxiv.org/abs/2510.09121
摘要:注释数据的稀缺性,特别是对于罕见或非典型形态,对计算病理学中的细胞和细胞核分割提出了重大挑战。虽然手动注释是劳动密集型和昂贵的,合成数据提供了一个具有成本效益的替代方案。我们介绍了一个多模态语义扩散模型(MDM)产生现实的像素精确的图像掩模对细胞和细胞核分割。通过用细胞/核形态(使用水平和垂直图)、RGB颜色特征和BERT编码的测定/指示元数据调节生成过程,MSDM生成具有所需形态特性的测试。通过多头交叉注意力集成这些异构模态,从而实现对生成图像的细粒度控制。定量分析表明,合成图像与真实数据非常匹配,在匹配的生物条件下,生成的图像和真实图像的嵌入之间的Wasserstein距离较低。以柱状细胞为例的这些合成样品的掺入显著提高了柱状细胞上的分割模型准确性。该策略系统地丰富了数据集,直接针对模型缺陷。我们强调了基于多模态扩散的增强对于提高细胞和细胞核分割模型的鲁棒性和可推广性的有效性。从而为生成模型在计算病理学中更广泛的应用铺平了道路。
摘要:Scarcity of annotated data, particularly for rare or atypical morphologies, present significant challenges for cell and nuclei segmentation in computational pathology. While manual annotation is labor-intensive and costly, synthetic data offers a cost-effective alternative. We introduce a Multimodal Semantic Diffusion Model (MSDM) for generating realistic pixel-precise image-mask pairs for cell and nuclei segmentation. By conditioning the generative process with cellular/nuclear morphologies (using horizontal and vertical maps), RGB color characteristics, and BERT-encoded assay/indication metadata, MSDM generates datasests with desired morphological properties. These heterogeneous modalities are integrated via multi-head cross-attention, enabling fine-grained control over the generated images. Quantitative analysis demonstrates that synthetic images closely match real data, with low Wasserstein distances between embeddings of generated and real images under matching biological conditions. The incorporation of these synthetic samples, exemplified by columnar cells, significantly improves segmentation model accuracy on columnar cells. This strategy systematically enriches data sets, directly targeting model deficiencies. We highlight the effectiveness of multimodal diffusion-based augmentation for advancing the robustness and generalizability of cell and nuclei segmentation models. Thereby, we pave the way for broader application of generative models in computational pathology.
【6】MMAudioSep: Taming Video-to-Audio Generative Model Towards Video/Text-Queried Sound Separation
标题:MMAudioSep:驯服视频到音频生成模型以实现视频/文本查询声音分离
链接:https://arxiv.org/abs/2510.09065
备注:4 pages, 4 figures, 2 tables
摘要:我们介绍了MMAudioSep,一个生成模型的视频/文本查询的声音分离,是建立在一个预训练的视频到音频模型。通过利用通过预训练的音频生成模型学习的关于视频/文本和音频之间的关系的知识,我们可以更有效地训练模型,即,该模型不需要从头开始训练。我们通过将MMAudioSep与现有的分离模型(包括基于确定性和生成方法的模型)进行比较来评估MMAudioSep的性能,并发现它优于基线模型。此外,我们证明,即使在通过微调获得声音分离功能后,该模型仍保留了原始视频到音频生成的能力。这突出了基础声音生成模型的潜力,以通过与声音相关的下游任务。我们的代码可在www.example.com上获得。
摘要:We introduce MMAudioSep, a generative model for video/text-queried sound separation that is founded on a pretrained video-to-audio model. By leveraging knowledge about the relationship between video/text and audio learned through a pretrained audio generative model, we can train the model more efficiently, i.e., the model does not need to be trained from scratch. We evaluate the performance of MMAudioSep by comparing it to existing separation models, including models based on both deterministic and generative approaches, and find it is superior to the baseline models. Furthermore, we demonstrate that even after acquiring functionality for sound separation via fine-tuning, the model retains the ability for original video-to-audio generation. This highlights the potential of foundational sound generation models to be adopted for sound-related downstream tasks. Our code is available at https://github.com/sony/mmaudiosep.
【7】Towards Better & Faster Autoregressive Image Generation: From the Perspective of Entropy
标题:更好、更快的自回归图像生成:从熵的角度
链接:https://arxiv.org/abs/2510.09012
备注:Code is available at this https URL
摘要:在这项工作中,我们首先重新审视当前自回归(AR)图像生成模型中的采样问题,并确定图像令牌,不像文本令牌,表现出较低的信息密度和非均匀的空间分布。因此,我们提出了一个熵知情的解码策略,有利于更高的自回归生成质量与更快的合成速度。具体而言,所提出的方法引入了两个主要创新:1)由令牌分布的空间熵引导的动态温度控制,增强了在基于掩模和按比例模型中的内容多样性、对准精度和结构一致性之间的平衡,而没有额外的计算开销,以及2)推测解码中的熵感知接受规则,以传统加速方法推理成本的约85%实现近无损生成。使用不同的AR图像生成模型在多个基准测试中进行了广泛的实验,证明了我们的方法在提高生成质量和采样速度方面的有效性和通用性。
摘要:In this work, we first revisit the sampling issues in current autoregressive (AR) image generation models and identify that image tokens, unlike text tokens, exhibit lower information density and non-uniform spatial distribution. Accordingly, we present an entropy-informed decoding strategy that facilitates higher autoregressive generation quality with faster synthesis speed. Specifically, the proposed method introduces two main innovations: 1) dynamic temperature control guided by spatial entropy of token distributions, enhancing the balance between content diversity, alignment accuracy, and structural coherence in both mask-based and scale-wise models, without extra computational overhead, and 2) entropy-aware acceptance rules in speculative decoding, achieving near-lossless generation at about 85\% of the inference cost of conventional acceleration methods. Extensive experiments across multiple benchmarks using diverse AR image generation models demonstrate the effectiveness and generalizability of our approach in enhancing both generation quality and sampling speed.
【8】SegTrans: Transferable Adversarial Examples for Segmentation Models
标题:SegTrans:细分模型的可转移对抗示例
链接:https://arxiv.org/abs/2510.08922
备注:Accepted by TMM 2025
摘要:分割模型在白盒设置中对对抗性示例表现出显著的脆弱性,但现有的对抗性攻击方法通常在不同分割模型之间表现出较差的可移植性。虽然一些研究人员已经探索了基于转移的对抗性攻击(即,转移攻击)方法,这些模型内复杂的上下文依赖关系以及代理模型和目标模型之间的特征分布间隙导致不能令人满意的转移成功率。为了解决这些问题,我们提出了SegTrans,一种新的传输攻击框架,将输入样本划分为多个局部区域,并重新映射其语义信息以生成不同的增强样本。这些增强的样本取代了用于扰动优化的原始样本,从而提高了对抗性样本在不同分割模型之间的可移植性。与现有方法不同,SegTrans只保留原始输入的局部语义信息,而不是使用全局语义信息来优化扰动。在两个基准数据集PASCAL VOC和Cityscapes、四个不同的分割模型和三个骨干网络上进行的大量实验表明,SegTrans在不引入额外计算开销的情况下显著提高了对抗性传输的成功率。与目前最先进的方法相比,SegTrans在传输攻击成功率方面平均提高了8.55%,计算效率提高了100%以上。
摘要:Segmentation models exhibit significant vulnerability to adversarial examples in white-box settings, but existing adversarial attack methods often show poor transferability across different segmentation models. While some researchers have explored transfer-based adversarial attack (i.e., transfer attack) methods for segmentation models, the complex contextual dependencies within these models and the feature distribution gaps between surrogate and target models result in unsatisfactory transfer success rates. To address these issues, we propose SegTrans, a novel transfer attack framework that divides the input sample into multiple local regions and remaps their semantic information to generate diverse enhanced samples. These enhanced samples replace the original ones for perturbation optimization, thereby improving the transferability of adversarial examples across different segmentation models. Unlike existing methods, SegTrans only retains local semantic information from the original input, rather than using global semantic information to optimize perturbations. Extensive experiments on two benchmark datasets, PASCAL VOC and Cityscapes, four different segmentation models, and three backbone networks show that SegTrans significantly improves adversarial transfer success rates without introducing additional computational overhead. Compared to the current state-of-the-art methods, SegTrans achieves an average increase of 8.55% in transfer attack success rate and improves computational efficiency by more than 100%.
【9】SAFER-AiD: Saccade-Assisted Foveal-peripheral vision Enhanced Reconstruction for Adversarial Defense
标题:SAFER-AiD:跳眼辅助中央凹周边视觉增强重建对抗防御
链接:https://arxiv.org/abs/2510.08761
摘要:对抗性攻击严重挑战了深度学习模型的安全部署,特别是在现实世界的应用中。传统的防御通常依赖于计算密集型优化(例如,对抗性训练或数据增强)来提高鲁棒性,而人类视觉系统通过进化的生物机制来实现对抗性扰动的固有鲁棒性。我们假设,注意力引导的非均匀稀疏采样和预测编码在这种鲁棒性中起着关键作用。为了验证这一假设,我们提出了一个新的防御框架,结合了三个关键的生物机制:中央凹周边处理,扫视眼球运动,皮层填充。我们的方法采用强化学习引导的扫视来选择性地捕获多个中央凹周边瞥见,这些瞥见在分类之前被整合到重建图像中。这种生物启发的预处理有效地减轻了对抗性噪声,保持了语义的完整性,特别是不需要对下游分类器进行重新训练或微调,从而实现了与现有系统的无缝集成。在ImageNet数据集上的实验表明,我们的方法提高了系统在不同分类器和攻击类型中的鲁棒性,同时与生物和非生物启发的防御技术相比,显着降低了训练开销。
摘要:Adversarial attacks significantly challenge the safe deployment of deep learning models, particularly in real-world applications. Traditional defenses often rely on computationally intensive optimization (e.g., adversarial training or data augmentation) to improve robustness, whereas the human visual system achieves inherent robustness to adversarial perturbations through evolved biological mechanisms. We hypothesize that attention guided non-homogeneous sparse sampling and predictive coding plays a key role in this robustness. To test this hypothesis, we propose a novel defense framework incorporating three key biological mechanisms: foveal-peripheral processing, saccadic eye movements, and cortical filling-in. Our approach employs reinforcement learning-guided saccades to selectively capture multiple foveal-peripheral glimpses, which are integrated into a reconstructed image before classification. This biologically inspired preprocessing effectively mitigates adversarial noise, preserves semantic integrity, and notably requires no retraining or fine-tuning of downstream classifiers, enabling seamless integration with existing systems. Experiments on the ImageNet dataset demonstrate that our method improves system robustness across diverse classifiers and attack types, while significantly reducing training overhead compared to both biologically and non-biologically inspired defense techniques.
【10】Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation
标题:用相机思考:以相机为中心的理解和生成的统一多模式模型
链接:https://arxiv.org/abs/2510.08673
备注:Project Page: this https URL
摘要:以相机为中心的理解和生成是空间智能的两个基石,但它们通常被孤立地研究。我们提出了Puffin,这是一个以相机为中心的统一多模式模型,可以沿着相机维度扩展空间感知。Puffin集成了语言回归和基于扩散的生成,从任意视点解释和创建场景。为了弥合相机和视觉语言之间的模态鸿沟,我们引入了一种新的范式,将相机视为语言,使相机思维。这引导模型将空间接地视觉线索与摄影术语对齐,同时在几何背景下进行推理。Puffin是在Puffin-4 M上训练的,这是一个由400万个视觉-语言-相机三元组组成的大规模数据集。我们将全局相机参数和像素相机地图,产生灵活和可靠的空间生成。实验表明,Puffin在以相机为中心的生成和理解方面优于专门的模型。通过指令调整,Puffin概括了各种跨视图任务,如空间想象,世界探索和摄影指导。我们将发布代码、模型、数据集管道和基准测试,以推进多模态空间智能研究。
摘要:Camera-centric understanding and generation are two cornerstones of spatial intelligence, yet they are typically studied in isolation. We present Puffin, a unified camera-centric multimodal model that extends spatial awareness along the camera dimension. Puffin integrates language regression and diffusion-based generation to interpret and create scenes from arbitrary viewpoints. To bridge the modality gap between cameras and vision-language, we introduce a novel paradigm that treats camera as language, enabling thinking with camera. This guides the model to align spatially grounded visual cues with photographic terminology while reasoning across geometric context. Puffin is trained on Puffin-4M, a large-scale dataset of 4 million vision-language-camera triplets. We incorporate both global camera parameters and pixel-wise camera maps, yielding flexible and reliable spatial generation. Experiments demonstrate Puffin superior performance over specialized models for camera-centric generation and understanding. With instruction tuning, Puffin generalizes to diverse cross-view tasks such as spatial imagination, world exploration, and photography guidance. We will release the code, models, dataset pipeline, and benchmark to advance multimodal spatial intelligence research.
【11】A 3D Generation Framework from Cross Modality to Parameterized Primitive
标题:从交叉情态到参数化原始的3D生成框架
链接:https://arxiv.org/abs/2510.08656
摘要:AI驱动的3D模型生成的最新进展利用了跨模态,但生成具有光滑表面的模型并最大限度地减少存储开销仍然是挑战。本文介绍了一种新的多阶段框架,用于生成由参数化图元组成的3D模型,由文本和图像输入引导。在该框架下,提出了一种基于参数化基元的模型生成算法,该算法能够识别模型组成元素的形状特征,并将其替换为具有高质量曲面的参数化基元。此外,提出了相应的模型存储方法,在保证模型原始表面质量的同时,只保留参数化基元的参数。在虚拟场景数据集和真实场景数据集上的实验表明了该方法的有效性,在原始参数文件大小约为6 KB的情况下,倒角距离为0.003092,VIoU为0.545,F1-Score为0.9139,NC为0.8369。我们的方法特别适合简单模型的快速原型制作。
摘要:Recent advancements in AI-driven 3D model generation have leveraged cross modality, yet generating models with smooth surfaces and minimizing storage overhead remain challenges. This paper introduces a novel multi-stage framework for generating 3D models composed of parameterized primitives, guided by textual and image inputs. In the framework, A model generation algorithm based on parameterized primitives, is proposed, which can identifies the shape features of the model constituent elements, and replace the elements with parameterized primitives with high quality surface. In addition, a corresponding model storage method is proposed, it can ensure the original surface quality of the model, while retaining only the parameters of parameterized primitives. Experiments on virtual scene dataset and real scene dataset demonstrate the effectiveness of our method, achieving a Chamfer Distance of 0.003092, a VIoU of 0.545, a F1-Score of 0.9139 and a NC of 0.8369, with primitive parameter files approximately 6KB in size. Our approach is particularly suitable for rapid prototyping of simple models.
【12】Generating Sizing Fields for Mesh Generation via GCN-based Simplification of Adaptive Background Grids
标题:通过基于GK的自适应背景网格简化生成网格的尺寸场
链接:https://arxiv.org/abs/2510.08645
备注:28 pages, 9 figures, 2 tables
摘要:三角形背景网格上的尺寸场是控制非结构网格生成质量和效率的关键。然而,创建一个最佳的背景网格,是几何一致的,计算轻量级的,并没有像带状文物是一个重大的挑战。本文介绍了一种新的,自适应背景网格简化(ABGS)框架的基础上,图卷积网络(GCN)。我们将网格简化任务重新表述为边得分回归问题,并训练GCN模型以有效地预测最佳边折叠候选。该模型是由一个自定义的损失函数,从整体上考虑几何保真度和尺寸场的准确性。这种数据驱动的方法取代了昂贵的程序评估,加快了简化过程。实验结果表明,我们的框架在不同的和复杂的工程模型的有效性。与初始密集网格相比,我们简化的背景网格实现了74%-94%的元素减少,从而减少了35%-88%的大小字段查询时间。
摘要:The sizing field defined on a triangular background grid is pivotal for controlling the quality and efficiency of unstructured mesh generation. However, creating an optimal background grid that is geometrically conforming, computationally lightweight, and free from artifacts like banding is a significant challenge. This paper introduces a novel, adaptive background grid simplification (ABGS) framework based on a Graph Convolutional Network (GCN). We reformulate the grid simplification task as an edge score regression problem and train a GCN model to efficiently predict optimal edge collapse candidates. The model is guided by a custom loss function that holistically considers both geometric fidelity and sizing field accuracy. This data-driven approach replaces a costly procedural evaluation, accelerating the simplification process. Experimental results demonstrate the effectiveness of our framework across diverse and complex engineering models. Compared to the initial dense grids, our simplified background grids achieve an element reduction of 74%-94%, leading to a 35%-88% decrease in sizing field query times.
【13】A Biophysically-Conditioned Generative Framework for 3D Brain Tumor MRI Synthesis
标题:用于3D脑肿瘤MRI合成的生物物理条件生成框架
链接:https://arxiv.org/abs/2510.09365
摘要:磁共振成像(MRI)修复支持许多临床和研究应用。我们介绍了第一个生成模型,条件体素水平,连续的肿瘤浓度合成高保真度脑肿瘤MRI。对于BraTS 2025修复挑战赛,我们通过将肿瘤浓度设置为零,使这种架构适应健康组织修复的补充任务。我们的潜在扩散模型以组织分割和肿瘤浓度为条件,为肿瘤合成和健康组织修复生成3D空间相干和解剖学一致的图像。对于健康修复,我们实现了18.5的PSNR,对于肿瘤修复,我们实现了17.4。我们的代码可从以下网址获得:https://github.com/valentin-biller/ldm.git
摘要:Magnetic resonance imaging (MRI) inpainting supports numerous clinical and research applications. We introduce the first generative model that conditions on voxel-level, continuous tumor concentrations to synthesize high-fidelity brain tumor MRIs. For the BraTS 2025 Inpainting Challenge, we adapt this architecture to the complementary task of healthy tissue restoration by setting the tumor concentrations to zero. Our latent diffusion model conditioned on both tissue segmentations and the tumor concentrations generates 3D spatially coherent and anatomically consistent images for both tumor synthesis and healthy tissue inpainting. For healthy inpainting, we achieve a PSNR of 18.5, and for tumor inpainting, we achieve 17.4. Our code is available at: https://github.com/valentin-biller/ldm.git
检测相关(8篇)
【1】FSP-DETR: Few-Shot Prototypical Parasitic Ova Detection
标题:FSP-DETR:Few-Shot原型寄生卵子检测
链接:https://arxiv.org/abs/2510.09583
备注:10 pages, 3 Figures, 5 Tables. Under Review
摘要:生物医学环境中的目标检测从根本上受到标记数据稀缺和频繁出现的新类别或稀有类别的限制。我们提出了FSP-DETR,一个统一的检测框架,使强大的Few-Shot检测,开集识别,并推广到一个单一的模型内看不见的生物医学任务。建立在一个类不可知的DETR骨干,我们的方法构造类原型从原始的支持图像和学习嵌入空间使用增强视图和轻量级的Transformer解码器。训练联合优化了原型匹配损失、基于模糊的分离损失和KL发散正则化,以在缺乏监督的情况下改善区分性特征学习和校准。与之前孤立处理这些任务的工作不同,FSP-DETR使推理时间灵活性能够支持看不见的类识别,背景拒绝和跨任务适应,而无需重新训练。我们还介绍了一个新的卵物种检测基准与20寄生虫类,并建立标准化的评估协议。跨卵、血细胞和疟疾检测任务的广泛实验表明,FSP-DETR显著优于先前的Few-Shot和基于原型的检测器,特别是在低拍摄和开放集场景中。
摘要:Object detection in biomedical settings is fundamentally constrained by the scarcity of labeled data and the frequent emergence of novel or rare categories. We present FSP-DETR, a unified detection framework that enables robust few-shot detection, open-set recognition, and generalization to unseen biomedical tasks within a single model. Built upon a class-agnostic DETR backbone, our approach constructs class prototypes from original support images and learns an embedding space using augmented views and a lightweight transformer decoder. Training jointly optimizes a prototype matching loss, an alignment-based separation loss, and a KL divergence regularization to improve discriminative feature learning and calibration under scarce supervision. Unlike prior work that tackles these tasks in isolation, FSP-DETR enables inference-time flexibility to support unseen class recognition, background rejection, and cross-task adaptation without retraining. We also introduce a new ova species detection benchmark with 20 parasite classes and establish standardized evaluation protocols. Extensive experiments across ova, blood cell, and malaria detection tasks demonstrate that FSP-DETR significantly outperforms prior few-shot and prototype-based detectors, especially in low-shot and open-set scenarios.
【2】TARO: Toward Semantically Rich Open-World Object Detection
标题:TARO:迈向语义丰富的开放世界对象检测
链接:https://arxiv.org/abs/2510.09173
备注:17 pages, 5 figures
摘要:现代物体检测器在很大程度上局限于“封闭世界”的假设,将它们限制在一组预定义的类中,并且在现实世界中遇到新物体时会带来风险。虽然开集检测方法旨在通过将此类实例识别为“未知”来解决此问题,但这通常是不够的。而不是把所有未知数作为一个单一的类,分配给他们更多的描述性的子类别,可以提高决策的安全关键的情况下。例如,在自动驾驶中,将物体识别为“未知动物”(需要紧急停车)与“未知碎片”(需要安全换道)远比“未知”有用。为了弥合这一差距,我们引入了TARO,一种新的检测框架,不仅可以识别未知对象,还可以将它们分类到语义层次结构中的粗父类。TARO采用了一个独特的体系结构,具有基于稀疏矩阵的头部用于建模对象,分层引导的重新标记组件提供辅助监督,以及学习分层关系的分类模块。实验表明,TARO可以将高达29.9%的未知数分类为有意义的粗类,显着减少未知和已知类之间的混淆,并在未知召回和已知mAP方面都取得了有竞争力的性能。将提供代码。
摘要:Modern object detectors are largely confined to a "closed-world" assumption, limiting them to a predefined set of classes and posing risks when encountering novel objects in real-world scenarios. While open-set detection methods aim to address this by identifying such instances as 'Unknown', this is often insufficient. Rather than treating all unknowns as a single class, assigning them more descriptive subcategories can enhance decision-making in safety-critical contexts. For example, identifying an object as an 'Unknown Animal' (requiring an urgent stop) versus 'Unknown Debris' (requiring a safe lane change) is far more useful than just 'Unknown' in autonomous driving. To bridge this gap, we introduce TARO, a novel detection framework that not only identifies unknown objects but also classifies them into coarse parent categories within a semantic hierarchy. TARO employs a unique architecture with a sparsemax-based head for modeling objectness, a hierarchy-guided relabeling component that provides auxiliary supervision, and a classification module that learns hierarchical relationships. Experiments show TARO can categorize up to 29.9% of unknowns into meaningful coarse classes, significantly reduce confusion between unknown and known classes, and achieve competitive performance in both unknown recall and known mAP. Code will be made available.
【3】SOS: Synthetic Object Segments Improve Detection, Segmentation, and Grounding
标题:SOS:合成对象片段改进检测、分割和接地
链接:https://arxiv.org/abs/2510.09110
备注:Project website: this https URL
摘要:视觉分组--通过实例分割、视觉基础和对象检测来操作--支撑着从机器人感知到照片编辑的应用。大型带注释的数据集成本高,覆盖范围有偏差,并且难以扩展。合成数据是有前途的,但往往缺乏灵活性,准确性和成分的多样性。 我们提出SOS,一个简单的和可扩展的数据合成管道的基础上,以对象为中心的组合策略。它使用结构化布局先验和生成式重新照明将高质量的合成对象片段粘贴到新图像中,从而生成准确而多样的遮罩、框和引用表达式。在SOS的100000张合成图像上训练的模型在检测和接地任务上优于在较大的真实图像数据集上训练的模型,如GRIT(20 M)和V3 Det(200 K),在LVIS检测上实现了+10.9 AP,在gRefCOCO接地上实现了+8.4 $N_{\text{Acc}}$。SOS支持可控的数据集构建,并在低数据和封闭词汇设置中提高泛化能力。使用合成对象分段增强LVIS和COCO可以在实际数据规模上产生强大的性能,甚至在非常有限的实际数据下获得更大的增益(例如,LVIS实例分段为+3.83 $AP_{\text{rare}}$,COCO设置为1%时为+6.59 AP)。这种可控性还支持有针对性的数据生成,以挑战视觉基础中的类内引用。
摘要:Visual grouping -- operationalized via instance segmentation, visual grounding, and object detection -- underpins applications from robotic perception to photo editing. Large annotated datasets are costly, biased in coverage, and hard to scale. Synthetic data are promising but often lack flexibility, accuracy, and compositional diversity. We present SOS, a simple and scalable data synthesis pipeline based on an object-centric composition strategy. It pastes high-quality synthetic object segments into new images using structured layout priors and generative relighting, producing accurate and diverse masks, boxes, and referring expressions. Models trained on 100000 synthetic images from SOS outperform those trained on larger real-image datasets such as GRIT (20M) and V3Det (200K) on detection and grounding tasks, achieving +10.9 AP on LVIS detection and +8.4 $N_{\text{Acc}}$ on gRefCOCO grounding. SOS enables controllable dataset construction and improves generalization in both low-data and closed-vocabulary settings. Augmenting LVIS and COCO with synthetic object segments yields strong performance across real-data scales and even larger gains under extremely limited real data (for example, +3.83 $AP_{\text{rare}}$ on LVIS instance segmentation and +6.59 AP with a 1 percent COCO setup). This controllability also supports targeted data generation for challenging intra-class referring in visual grounding.
【4】GL-DT: Multi-UAV Detection and Tracking with Global-Local Integration
标题:GL-DT:全球本地一体化的多无人机检测和跟踪
链接:https://arxiv.org/abs/2510.09092
摘要:随着无人机在军事侦察、环境监测等领域的广泛应用,对精确高效的多目标跟踪技术提出了迫切的需求。然而,复杂的背景,小规模的目标,以及频繁的遮挡和相互作用继续挑战现有的方法在检测精度和轨迹连续性方面。为了解决这些问题,本文提出了全球本地检测和跟踪(GL-DT)框架。该算法采用时空特征融合(STFF)模块对运动和外观特征进行联合建模,并结合全局-局部协同检测策略,有效提高了小目标检测能力。在此基础上,引入了JPTrack跟踪算法,以缓解ID切换和轨迹碎片等常见问题。实验结果表明,该方法在保持实时性的同时,显著提高了MOT的连续性和稳定性,为无人机检测与跟踪技术的进步提供了有力支持。
摘要:The extensive application of unmanned aerial vehicles (UAVs) in military reconnaissance, environmental monitoring, and related domains has created an urgent need for accurate and efficient multi-object tracking (MOT) technologies, which are also essential for UAV situational awareness. However, complex backgrounds, small-scale targets, and frequent occlusions and interactions continue to challenge existing methods in terms of detection accuracy and trajectory continuity. To address these issues, this paper proposes the Global-Local Detection and Tracking (GL-DT) framework. It employs a Spatio-Temporal Feature Fusion (STFF) module to jointly model motion and appearance features, combined with a global-local collaborative detection strategy, effectively enhancing small-target detection. Building upon this, the JPTrack tracking algorithm is introduced to mitigate common issues such as ID switches and trajectory fragmentation. Experimental results demonstrate that the proposed approach significantly improves the continuity and stability of MOT while maintaining real-time performance, providing strong support for the advancement of UAV detection and tracking technologies.
【5】Visual Anomaly Detection for Reliable Robotic Implantation of Flexible Microelectrode Array
标题:柔性微电极阵列机器人可靠植入的视觉异常检测
链接:https://arxiv.org/abs/2510.09071
备注:Accept by IROS 2025
摘要:柔性微电极(FME)探头具有纤维状的可变形结构,且与关键生物组织相互作用,因此将FME植入大脑皮层具有挑战性。为确保可靠性和安全性,应仔细监测植入过程。本文提出了一种基于图像的异常检测框架的基础上的显微摄像头的机器人FME植入系统。在四个检查点使用统一的框架,分别检查微针、FME探头、钩挂结果和植入点。利用现有的目标定位结果,对准感兴趣区域(ROI)从原始图像中提取和输入到一个预先训练的Vision Transformer(ViT)。考虑到任务规范,我们提出了一个渐进粒度补丁特征采样方法,以解决敏感性和宽容度的权衡问题,在不同的位置。此外,我们从原始的一般ViT特征中选择一部分具有较高信噪比的特征通道,为每个特定场景提供更好的描述符。所提出的方法的有效性进行了验证,从我们的植入系统收集的图像数据集。
摘要:Flexible microelectrode (FME) implantation into brain cortex is challenging due to the deformable fiber-like structure of FME probe and the interaction with critical bio-tissue. To ensure reliability and safety, the implantation process should be monitored carefully. This paper develops an image-based anomaly detection framework based on the microscopic cameras of the robotic FME implantation system. The unified framework is utilized at four checkpoints to check the micro-needle, FME probe, hooking result, and implantation point, respectively. Exploiting the existing object localization results, the aligned regions of interest (ROIs) are extracted from raw image and input to a pretrained vision transformer (ViT). Considering the task specifications, we propose a progressive granularity patch feature sampling method to address the sensitivity-tolerance trade-off issue at different locations. Moreover, we select a part of feature channels with higher signal-to-noise ratios from the raw general ViT features, to provide better descriptors for each specific scene. The effectiveness of the proposed methods is validated with the image datasets collected from our implantation system.
【6】Detecting spills using thermal imaging, pretrained deep learning models, and a robotic platform
标题:使用热成像、预训练的深度学习模型和机器人平台检测泄漏
链接:https://arxiv.org/abs/2510.08770
备注:6 pages
摘要:本文提出了一种实时溢出检测系统,该系统利用预先训练的深度学习模型,结合RGB和热成像,对不同环境中的溢出与无溢出场景进行分类。使用平衡的二进制数据集(4,000张图像),我们的实验证明了热成像在推理速度,准确性和模型大小方面的优势。我们使用VGG19和NasNetMobile等轻量级模型实现了高达100%的准确性,热模型在不同的照明条件下执行得更快,更稳健。我们的系统在消费级硬件(RTX 4080)上运行,推理时间低至44 ms,模型大小低于350 MB,突出了其在安全关键环境中的可部署性。使用真实机器人和测试数据集进行的实验结果表明,在热成像上训练的VGG19模型表现最好。
摘要:This paper presents a real-time spill detection system that utilizes pretrained deep learning models with RGB and thermal imaging to classify spill vs. no-spill scenarios across varied environments. Using a balanced binary dataset (4,000 images), our experiments demonstrate the advantages of thermal imaging in inference speed, accuracy, and model size. We achieve up to 100% accuracy using lightweight models like VGG19 and NasNetMobile, with thermal models performing faster and more robustly across different lighting conditions. Our system runs on consumer-grade hardware (RTX 4080) and achieves inference times as low as 44 ms with model sizes under 350 MB, highlighting its deployability in safety-critical contexts. Results from experiments with a real robot and test datasets indicate that a VGG19 model trained on thermal imaging performs best.
【7】Detection of high-frequency oscillations using time-frequency analysis
标题:使用时频分析检测高频振荡
链接:https://arxiv.org/abs/2510.08637
备注:17 pages, 7 figures
摘要:高频振荡(HFOs)是一种新的识别致痫灶的生物标志物。绘制HFO生成区域可以提高难治性癫痫患者切除部位的精确度。然而,检测HFO仍然具有挑战性,其临床特征尚未完全确定。HFO的视觉识别是耗时的、劳动密集型的和主观的。因此,开发检测HFO的自动化方法对于研究和临床使用至关重要。在这项研究中,我们开发了一种新的方法来检测HFO的纹波和快速纹波频段(80-500 Hz)。我们使用对照数据集和癫痫患者的数据进行了验证。我们的方法采用了无监督聚类技术,使用S变换的时间-频率域提取的事件进行分类。所提出的检测器区分HFO事件尖峰,背景活动,和文物。与现有的检测器相比,我们的方法在控制数据集上实现了97.67%的灵敏度,98.57%的精度和97.78%的F分数。在癫痫患者中,我们的结果显示与手术结果有更强的相关性,切除接触与非切除接触的HFO发生率之比为0.73。该研究证实了先前的发现,即HFO是癫痫患者致痫性的有希望的生物标志物。去除HFO,特别是快速纹波,导致癫痫发作自由,而剩余的HFO导致癫痫发作复发。
摘要:High-frequency oscillations (HFOs) are a new biomarker for identifying the epileptogenic zone. Mapping HFO-generating regions can improve the precision of resection sites in patients with refractory epilepsy. However, detecting HFOs remains challenging, and their clinical features are not yet fully defined. Visual identification of HFOs is time-consuming, labor-intensive, and subjective. As a result, developing automated methods to detect HFOs is critical for research and clinical use. In this study, we developed a novel method for detecting HFOs in the ripple and fast ripple frequency bands (80-500 Hz). We validated it using both controlled datasets and data from epilepsy patients. Our method employs an unsupervised clustering technique to categorize events extracted from the time-frequency domain using the S-transform. The proposed detector differentiates HFOs events from spikes, background activity, and artifacts. Compared to existing detectors, our method achieved a sensitivity of 97.67%, a precision of 98.57%, and an F-score of 97.78% on the controlled dataset. In epilepsy patients, our results showed a stronger correlation with surgical outcomes, with a ratio of 0.73 between HFOs rates in resected versus non-resected contacts. The study confirmed previous findings that HFOs are promising biomarkers of epileptogenicity in epileptic patients. Removing HFOs, especially fast ripple, leads to seizure freedom, while remaining HFOs lead to seizure recurrence.
【8】Out-of-Distribution Detection in LiDAR Semantic Segmentation Using Epistemic Uncertainty from Hierarchical GMMs
标题:利用分层格纹的认识不确定性进行LiDART语义分割中的分布外检测
链接:https://arxiv.org/abs/2510.08631
摘要:除了通过对LiDAR点云进行精确的语义分割来准确地理解场景外,检测分布外(OOD)对象(在训练过程中未遇到的实例)对于防止将未知对象错误地分配给已知类别至关重要。虽然监督OOD检测方法依赖于辅助OOD数据集,但无监督方法避免了这一要求,但通常依赖于预测熵,即通过对总体或多个后验权重样本进行平均而获得的预测分布的熵。然而,这些方法往往混淆了认识(模型)和任意(数据)的不确定性,错误地分类模糊的分布区域作为OOD。为了解决这个问题,我们提出了一种无监督的OOD检测方法,该方法采用来自深度神经网络特征空间中高斯混合模型(GMM)参数的分层贝叶斯建模的认知不确定性。在不需要辅助数据或额外训练阶段的情况下,我们的方法在SemanticKITTI数据集上优于现有的基于不确定性的方法,与先前工作中使用的预测熵方法相比,AUROC提高了18%,AUPRC增加了22%,FPR95减少了36%(从76%到40%)。
摘要:In addition to accurate scene understanding through precise semantic segmentation of LiDAR point clouds, detecting out-of-distribution (OOD) objects, instances not encountered during training, is essential to prevent the incorrect assignment of unknown objects to known classes. While supervised OOD detection methods depend on auxiliary OOD datasets, unsupervised methods avoid this requirement but typically rely on predictive entropy, the entropy of the predictive distribution obtained by averaging over an ensemble or multiple posterior weight samples. However, these methods often conflate epistemic (model) and aleatoric (data) uncertainties, misclassifying ambiguous in distribution regions as OOD. To address this issue, we present an unsupervised OOD detection approach that employs epistemic uncertainty derived from hierarchical Bayesian modeling of Gaussian Mixture Model (GMM) parameters in the feature space of a deep neural network. Without requiring auxiliary data or additional training stages, our approach outperforms existing uncertainty-based methods on the SemanticKITTI dataset, achieving an 18\% improvement in AUROC, 22\% increase in AUPRC, and 36\% reduction in FPR95 (from 76\% to 40\%), compared to the predictive entropy approach used in prior works.
分类|识别相关(6篇)
【1】SilvaScenes: Tree Segmentation and Species Classification from Under-Canopy Images in Natural Forests
标题:SilvaScenes:天然森林树冠下图像的树木分割和物种分类
链接:https://arxiv.org/abs/2510.09458
备注:8 pages, 5 figures
摘要:对森林管理机器人的兴趣正在增长,但在复杂的自然环境中的感知仍然是一个重大障碍。严重遮挡、光照变化和植被密集等条件对自动化系统提出了挑战,而自动化系统对于精确林业、生物多样性监测和林业设备自动化至关重要。这些任务依赖于高级感知能力,例如检测和对单个树木进行细粒度的物种分类。然而,现有的数据集不足以开发这样的感知系统,因为它们往往集中在城市环境或有限数量的物种。为了解决这个问题,我们提出了SilvaScenes,一个新的数据集,例如从树冠下图像中分割树种。在加拿大魁北克省的五个生物气候领域收集,SilvaScenes拥有来自24个物种的1476棵树木,并附有林业专家的注释。我们通过对实例分割的现代深度学习方法进行基准测试,展示了我们数据集的相关性和挑战性。我们的研究结果表明,虽然树木分割很容易,最高平均精度(mAP)为67.65%,但物种分类仍然是一个重大挑战,mAP仅为35.69%。我们的数据集和源代码将在https://github.com/norlab-ulaval/SilvaScenes上提供。
摘要:Interest in robotics for forest management is growing, but perception in complex, natural environments remains a significant hurdle. Conditions such as heavy occlusion, variable lighting, and dense vegetation pose challenges to automated systems, which are essential for precision forestry, biodiversity monitoring, and the automation of forestry equipment. These tasks rely on advanced perceptual capabilities, such as detection and fine-grained species classification of individual trees. Yet, existing datasets are inadequate to develop such perception systems, as they often focus on urban settings or a limited number of species. To address this, we present SilvaScenes, a new dataset for instance segmentation of tree species from under-canopy images. Collected across five bioclimatic domains in Quebec, Canada, SilvaScenes features 1476 trees from 24 species with annotations from forestry experts. We demonstrate the relevance and challenging nature of our dataset by benchmarking modern deep learning approaches for instance segmentation. Our results show that, while tree segmentation is easy, with a top mean average precision (mAP) of 67.65%, species classification remains a significant challenge with an mAP of only 35.69%. Our dataset and source code will be available at https://github.com/norlab-ulaval/SilvaScenes.
【2】Cattle-CLIP: A Multimodal Framework for Cattle Behaviour Recognition
标题:Cattle-CLIP:牛行为识别的多模式框架
链接:https://arxiv.org/abs/2510.09203
备注:16 pages, 10 figures, submitted to Computers and Electronics in Agriculture
摘要:牛的行为是个体动物健康、生产力和整体福祉的重要指标。基于视频的监控与深度学习技术相结合,已成为动物生物识别领域的主流方法,并且可以在某些行为识别任务中提供高准确度。我们提出了Cattle-CLIP,一个用于牛行为识别的多模态深度学习框架,使用语义线索来提高基于视频的视觉特征识别的性能。它是改编自大规模图像语言模型CLIP通过添加一个时间整合模块。为了解决用于预训练模型的Web数据与真实世界的牛监控录像之间的域差距,我们引入了定制的数据增强策略和专门的文本提示。Cattle-CLIP在完全监督和Few-Shot学习场景下进行评估,特别关注数据稀缺行为识别--牲畜监测中一个重要但尚未充分探索的目标。为了评估所提出的方法,我们发布了CattleBehaviours 6数据集,其中包括六种类型的室内行为:喂食,饮水,站立-自我梳理,站立-反刍,躺下-自我梳理和躺下-反刍。该数据集包括从我们的John Oldacre中心奶牛场研究平台收集的1905个片段,该平台容纳200头荷斯坦黑白花奶牛。实验表明,Cattle-CLIP在有监督的环境中对六种行为的总体准确率达到96.1%,对喂食,饮水和站立反刍行为的回忆率接近100%,并在Few-Shot场景中用有限的数据表现出强大的泛化能力,突出了多模态学习在农业和动物行为分析中的潜力。
摘要:Cattle behaviour is a crucial indicator of an individual animal health, productivity and overall well-being. Video-based monitoring, combined with deep learning techniques, has become a mainstream approach in animal biometrics, and it can offer high accuracy in some behaviour recognition tasks. We present Cattle-CLIP, a multimodal deep learning framework for cattle behaviour recognition, using semantic cues to improve the performance of video-based visual feature recognition. It is adapted from the large-scale image-language model CLIP by adding a temporal integration module. To address the domain gap between web data used for the pre-trained model and real-world cattle surveillance footage, we introduce tailored data augmentation strategies and specialised text prompts. Cattle-CLIP is evaluated under both fully-supervised and few-shot learning scenarios, with a particular focus on data-scarce behaviour recognition - an important yet under-explored goal in livestock monitoring. To evaluate the proposed method, we release the CattleBehaviours6 dataset, which comprises six types of indoor behaviours: feeding, drinking, standing-self-grooming, standing-ruminating, lying-self-grooming and lying-ruminating. The dataset consists of 1905 clips collected from our John Oldacre Centre dairy farm research platform housing 200 Holstein-Friesian cows. Experiments show that Cattle-CLIP achieves 96.1% overall accuracy across six behaviours in a supervised setting, with nearly 100% recall for feeding, drinking and standing-ruminating behaviours, and demonstrates robust generalisation with limited data in few-shot scenarios, highlighting the potential of multimodal learning in agricultural and animal behaviour analysis.
【3】Modern Deep Learning Approaches for Cricket Shot Classification: A Comprehensive Baseline Study
标题:板球击球分类的现代深度学习方法:全面的基线研究
链接:https://arxiv.org/abs/2510.09187
摘要:从视频序列中进行板球镜头分类仍然是体育视频分析中具有挑战性的问题,需要对空间和时间特征进行有效建模。本文介绍了第一个全面的基线研究,比较了四种不同研究范式中七种不同的深度学习方法用于板球投篮分类。我们在一个统一的基准上实现并系统地评估传统的CNN-LSTM架构、基于注意力的模型、Vision Transformers、迁移学习方法以及现代的EfficientNet-GRU组合。我们的研究的一个关键发现是,在学术文献和实际实施结果的索赔之间的显着性能差距。虽然以前的论文报告的准确率为96%(Balaji LRCN),99.2%(IJERCSE)和93%(传感器),但我们的标准化重新实现分别达到46.0%,55.6%和57.7%。我们的现代SOTA方法,结合EfficientNet-B 0与基于GRU的时间模型,达到92.25%的准确性,证明了现代架构和系统优化的实质性改进是可能的。所有实现都遵循PyTorch Lightning的现代MLOps实践,提供了一个可复制的研究平台,揭示了标准化评估协议在体育视频分析研究中的至关重要性。
摘要:Cricket shot classification from video sequences remains a challenging problem in sports video analysis, requiring effective modeling of both spatial and temporal features. This paper presents the first comprehensive baseline study comparing seven different deep learning approaches across four distinct research paradigms for cricket shot classification. We implement and systematically evaluate traditional CNN-LSTM architectures, attention-based models, vision transformers, transfer learning approaches, and modern EfficientNet-GRU combinations on a unified benchmark. A critical finding of our study is the significant performance gap between claims in academic literature and practical implementation results. While previous papers reported accuracies of 96\% (Balaji LRCN), 99.2\% (IJERCSE), and 93\% (Sensors), our standardized re-implementations achieve 46.0\%, 55.6\%, and 57.7\% respectively. Our modern SOTA approach, combining EfficientNet-B0 with a GRU-based temporal model, achieves 92.25\% accuracy, demonstrating that substantial improvements are possible with modern architectures and systematic optimization. All implementations follow modern MLOps practices with PyTorch Lightning, providing a reproducible research platform that exposes the critical importance of standardized evaluation protocols in sports video analysis research.
【4】Hi-OSCAR: Hierarchical Open-set Classifier for Human Activity Recognition
标题:Hi-OTAR:用于人类活动识别的分层开放集分类器
链接:https://arxiv.org/abs/2510.08635
备注:Accepted at ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT)
摘要:在人类活动识别(HAR)中,在生活中执行的活动范围与可以在训练中使用的注释传感器数据集中捕获的活动之间存在不可逾越的差距。不能正确处理看不见的活动会严重破坏任何HAR分类器的可靠性。此外,在HAR中,并非所有类别都是完全不同的,有些类别与其他子活动有很大的重叠或包含其他子活动。基于这些观察,我们安排活动类到一个结构化的层次结构。从那里,我们提出了Hi-OSCAR:一个层次化的活动识别开集分类器,它可以以最先进的精度识别已知的活动,同时拒绝未知的活动。这不仅可以实现开集分类,还可以将未知类定位到最近的内部节点,提供超越二元“已知/未知”分类的洞察力。为了促进这一点和未来的开放集HAR研究,我们收集了一个新的数据集:NFI_FARED。NFI_FARED包含来自多个受试者的数据,这些受试者在一系列背景下执行十九项活动,包括日常生活,通勤和快速移动,这些数据完全公开并可供下载。
摘要:Within Human Activity Recognition (HAR), there is an insurmountable gap between the range of activities performed in life and those that can be captured in an annotated sensor dataset used in training. Failure to properly handle unseen activities seriously undermines any HAR classifier's reliability. Additionally within HAR, not all classes are equally dissimilar, some significantly overlap or encompass other sub-activities. Based on these observations, we arrange activity classes into a structured hierarchy. From there, we propose Hi-OSCAR: a Hierarchical Open-set Classifier for Activity Recognition, that can identify known activities at state-of-the-art accuracy while simultaneously rejecting unknown activities. This not only enables open-set classification, but also allows for unknown classes to be localized to the nearest internal node, providing insight beyond a binary "known/unknown" classification. To facilitate this and future open-set HAR research, we collected a new dataset: NFI_FARED. NFI_FARED contains data from multiple subjects performing nineteen activities from a range of contexts, including daily living, commuting, and rapid movements, which is fully public and available for download.
【5】Deep Sparse Representation-based Classification
标题:基于深度稀疏表示的分类
链接:https://arxiv.org/abs/1904.11093
备注:None
摘要:我们为基于稀疏表示的分类(SRC)方法提出了一种基于深度学习的推导公式。所提出的网络由卷积自动编码器和全连接层组成。自动编码器网络的作用是学习用于分类的鲁棒深度特征。另一方面,位于编码器和解码器网络之间的全连接层负责找到稀疏表示。然后将估计的稀疏码用于分类。在三个不同数据集上的各种实验表明,所提出的网络导致稀疏表示,比最先进的SRC方法提供更好的分类结果。源代码可在github.com/mahdiabavisani/DSRC上获得。
摘要:We present a transductive deep learning-based formulation for the sparse representation-based classification (SRC) method. The proposed network consists of a convolutional autoencoder along with a fully-connected layer. The role of the autoencoder network is to learn robust deep features for classification. On the other hand, the fully-connected layer, which is placed in between the encoder and the decoder networks, is responsible for finding the sparse representation. The estimated sparse codes are then used for classification. Various experiments on three different datasets show that the proposed network leads to sparse representations that give better classification results than state-of-the-art SRC methods. The source code is available at: github.com/mahdiabavisani/DSRC.
【6】Improving the Performance of Unimodal Dynamic Hand-Gesture Recognition with Multimodal Training
标题:通过多模式训练提高单模式动态手势识别的性能
链接:https://arxiv.org/abs/1812.06145
备注:None
摘要:我们提出了一种有效的方法,用于利用来自多个模态的知识来训练单峰3D卷积神经网络(3D-CNN),以执行动态手势识别任务。我们提出了一个不同的框架,在这个框架中,我们将多模态的知识嵌入到各个网络中,这样每个单峰网络都可以实现更好的性能,而不是显式地组合多模态信息,这在许多最先进的方法中是常见的。特别是,我们为每个可用的模态提供单独的网络,并强制它们进行协作,学习开发具有通用语义和更好表示的网络。我们引入了“时空语义对齐”损失(SSA)来对齐来自不同网络的特征的内容。此外,我们用我们提出的“焦点正则化参数”来正则化这种损失,以避免负知识转移。实验结果表明,该框架提高了单峰网络的测试时间识别精度,并在各种动态手势识别数据集上提供了最先进的性能。
摘要:We present an efficient approach for leveraging the knowledge from multiple modalities in training unimodal 3D convolutional neural networks (3D-CNNs) for the task of dynamic hand gesture recognition. Instead of explicitly combining multimodal information, which is commonplace in many state-of-the-art methods, we propose a different framework in which we embed the knowledge of multiple modalities in individual networks so that each unimodal network can achieve an improved performance. In particular, we dedicate separate networks per available modality and enforce them to collaborate and learn to develop networks with common semantics and better representations. We introduce a "spatiotemporal semantic alignment" loss (SSA) to align the content of the features from different networks. In addition, we regularize this loss with our proposed "focal regularization parameter" to avoid negative knowledge transfer. Experimental results show that our framework improves the test time recognition accuracy of unimodal networks, and provides the state-of-the-art performance on various dynamic hand gesture recognition datasets.
分割|语义相关(8篇)
【1】A methodology for clinically driven interactive segmentation evaluation
标题:临床驱动的交互式分割评估的方法
链接:https://arxiv.org/abs/2510.09499
备注:10 pages, Medical Image Computing and Computed Assisted Intervention 2025
摘要:交互式分割是一种很有前途的策略,建立强大的,通用的算法体积医学图像分割。然而,不一致和临床上不切实际的评价阻碍了公平的比较,并歪曲了真实世界的性能。我们提出了一个临床接地的方法来定义评估任务和指标,并建立了一个软件框架,用于构建标准化的评估管道。我们评估了跨异构和复杂任务的最先进算法,并观察到(i)在处理用户交互时最大限度地减少信息丢失对模型鲁棒性至关重要,(ii)自适应缩放机制提高了鲁棒性和收敛速度,(iii)如果验证提示行为/预算与训练不同,则性能下降,(iv)2D方法在板状图像和粗糙目标上表现良好,但是3D背景有助于大的或不规则形状的目标,(v)非医学领域模型(例如SAM 2)的性能随着差的对比度和复杂形状而降低。
摘要:Interactive segmentation is a promising strategy for building robust, generalisable algorithms for volumetric medical image segmentation. However, inconsistent and clinically unrealistic evaluation hinders fair comparison and misrepresents real-world performance. We propose a clinically grounded methodology for defining evaluation tasks and metrics, and built a software framework for constructing standardised evaluation pipelines. We evaluate state-of-the-art algorithms across heterogeneous and complex tasks and observe that (i) minimising information loss when processing user interactions is critical for model robustness, (ii) adaptive-zooming mechanisms boost robustness and speed convergence, (iii) performance drops if validation prompting behaviour/budgets differ from training, (iv) 2D methods perform well with slab-like images and coarse targets, but 3D context helps with large or irregularly shaped targets, (v) performance of non-medical-domain models (e.g. SAM2) degrades with poor contrast and complex shapes.
【2】Instance-Aware Robust Consistency Regularization for Semi-Supervised Nuclei Instance Segmentation
标题:半监督核心实例分割的实例感知鲁棒一致性正规化
链接:https://arxiv.org/abs/2510.09329
摘要:病理图像中的细胞核实例分割对于肿瘤微环境分析等下游任务至关重要。然而,注释数据的高成本和稀缺性限制了全监督方法的适用性,而现有的半监督方法无法在实例级别充分规范一致性,缺乏对病理结构固有先验知识的利用,并且易于在训练期间引入噪声伪标签。在本文中,我们提出了一个实例感知的鲁棒一致性正则化网络(IRCR-Net),用于精确的实例级细胞核分割。具体来说,我们引入匹配驱动的实例感知一致性(MIAC)和先验驱动的实例感知一致性(PIAC)机制来细化教师和学生子网的核实例分割结果,特别是对于密集分布和重叠的核。我们将病理图像中细胞核的形态学先验知识,并利用这些先验知识来评估从未标记数据生成的伪标签的质量。低质量的伪标签被丢弃,而高质量的预测被增强以减少伪标签噪声并有利于网络的鲁棒训练。实验结果表明,该方法显着提高了半监督核实例分割性能在多个公共数据集相比,现有的方法,甚至超过完全监督方法在某些情况下。
摘要:Nuclei instance segmentation in pathological images is crucial for downstream tasks such as tumor microenvironment analysis. However, the high cost and scarcity of annotated data limit the applicability of fully supervised methods, while existing semi-supervised methods fail to adequately regularize consistency at the instance level, lack leverage of the inherent prior knowledge of pathological structures, and are prone to introducing noisy pseudo-labels during training. In this paper, we propose an Instance-Aware Robust Consistency Regularization Network (IRCR-Net) for accurate instance-level nuclei segmentation. Specifically, we introduce the Matching-Driven Instance-Aware Consistency (MIAC) and Prior-Driven Instance-Aware Consistency (PIAC) mechanisms to refine the nuclei instance segmentation result of the teacher and student subnetwork, particularly for densely distributed and overlapping nuclei. We incorporate morphological prior knowledge of nuclei in pathological images and utilize these priors to assess the quality of pseudo-labels generated from unlabeled data. Low-quality pseudo-labels are discarded, while high-quality predictions are enhanced to reduce pseudo-label noise and benefit the network's robust training. Experimental results demonstrate that the proposed method significantly enhances semi-supervised nuclei instance segmentation performance across multiple public datasets compared to existing approaches, even surpassing fully supervised methods in some scenarios.
【3】Exploring Single Domain Generalization of LiDAR-based Semantic Segmentation under Imperfect Labels
标题:探索不完美标签下基于LiDART的语义分割的单域概括
链接:https://arxiv.org/abs/2510.09035
摘要:准确的感知对于车辆安全至关重要,而LiDAR是自动驾驶的关键推动因素。为了确保在环境、传感器类型和天气条件下的鲁棒性能,而无需昂贵的重新注释,基于LiDAR的3D语义分割中的域泛化至关重要。然而,由于传感器缺陷、遮挡和人为错误,LiDAR注释通常是有噪声的。这种噪声降低了分割精度,并在域偏移下进一步放大,威胁系统可靠性。虽然噪声标签学习在图像中得到了很好的研究,但其在领域泛化下对3D LiDAR分割的扩展在很大程度上尚未探索,因为点云的稀疏和不规则结构限制了2D方法的直接使用。为了解决这一差距,我们引入了新的任务Domain Generalization for LiDAR Semantic Segmentation under Noisy Labels(DGLSS-NL),并通过调整从图像分类到3D分割的三种代表性噪声标签学习策略来建立第一个基准。然而,我们发现现有的噪声标签学习方法对LiDAR数据的适应性很差。因此,我们提出了DuNe,一个具有强分支和弱分支的双视图框架,该框架强制执行特征级一致性,并基于置信度感知的预测过滤应用交叉熵损失。我们的方法通过在10%对称标签噪声下在SemanticKITTI上实现56.86%mIoU,在nuScenes上实现42.28%,在SemanticPOSS上实现52.58%,显示了最先进的性能,总体算术平均值(AM)为49.57%,调和平均值(HM)为48.50%,从而在DGLSS-NL任务中展示了鲁棒的域泛化。代码可以在我们的项目页面上找到。
摘要:Accurate perception is critical for vehicle safety, with LiDAR as a key enabler in autonomous driving. To ensure robust performance across environments, sensor types, and weather conditions without costly re-annotation, domain generalization in LiDAR-based 3D semantic segmentation is essential. However, LiDAR annotations are often noisy due to sensor imperfections, occlusions, and human errors. Such noise degrades segmentation accuracy and is further amplified under domain shifts, threatening system reliability. While noisy-label learning is well-studied in images, its extension to 3D LiDAR segmentation under domain generalization remains largely unexplored, as the sparse and irregular structure of point clouds limits direct use of 2D methods. To address this gap, we introduce the novel task Domain Generalization for LiDAR Semantic Segmentation under Noisy Labels (DGLSS-NL) and establish the first benchmark by adapting three representative noisy-label learning strategies from image classification to 3D segmentation. However, we find that existing noisy-label learning approaches adapt poorly to LiDAR data. We therefore propose DuNe, a dual-view framework with strong and weak branches that enforce feature-level consistency and apply cross-entropy loss based on confidence-aware filtering of predictions. Our approach shows state-of-the-art performance by achieving 56.86% mIoU on SemanticKITTI, 42.28% on nuScenes, and 52.58% on SemanticPOSS under 10% symmetric label noise, with an overall Arithmetic Mean (AM) of 49.57% and Harmonic Mean (HM) of 48.50%, thereby demonstrating robust domain generalization in DGLSS-NL tasks. The code is available on our project page.
【4】FOLK: Fast Open-Vocabulary 3D Instance Segmentation via Label-guided Knowledge Distillation
标题:FOLK:通过标签引导知识提炼的快速开放词汇3D实例分割
链接:https://arxiv.org/abs/2510.08849
摘要:开放式词汇表3D实例分割试图对注释标签空间之外的实例进行分割和分类。现有方法通常将3D实例映射到2D RGB-D图像,然后采用视觉语言模型(VLM)进行分类。然而,这样的映射策略通常会引入来自2D遮挡的噪声,并且在推理期间会产生大量的计算和存储成本,从而减慢推理速度。针对上述问题,提出了一种基于标签引导知识提取(FOLK)的快速开放词汇三维实例分割方法。我们的核心思想是设计一个教师模型,提取高质量的实例嵌入,并将其开放词汇知识提取到3D学生模型中。通过这种方式,在推理过程中,提取的3D模型可以直接从3D点云中分类实例,避免了遮挡引起的噪声,并显著加快了推理过程。具体来说,我们首先设计一个教师模型,为每个3D实例生成一个2D CLIP嵌入,结合可见性和视点多样性,作为蒸馏的学习目标。然后,我们开发了一个3D学生模型,直接为每个3D实例生成3D嵌入。在训练过程中,我们提出了一种标签引导的蒸馏算法,将开放词汇知识从标签一致的2D嵌入到学生模型中。FOLK在ScanNet 200和ScanNet数据集上进行了实验,在ScanNet 200数据集上实现了最先进的性能,AP 50得分为35.7,同时运行速度比以前的方法快约6.0倍至152.2倍。所有代码将在论文被接受后发布。
摘要:Open-vocabulary 3D instance segmentation seeks to segment and classify instances beyond the annotated label space. Existing methods typically map 3D instances to 2D RGB-D images, and then employ vision-language models (VLMs) for classification. However, such a mapping strategy usually introduces noise from 2D occlusions and incurs substantial computational and memory costs during inference, slowing down the inference speed. To address the above problems, we propose a Fast Open-vocabulary 3D instance segmentation method via Label-guided Knowledge distillation (FOLK). Our core idea is to design a teacher model that extracts high-quality instance embeddings and distills its open-vocabulary knowledge into a 3D student model. In this way, during inference, the distilled 3D model can directly classify instances from the 3D point cloud, avoiding noise caused by occlusions and significantly accelerating the inference process. Specifically, we first design a teacher model to generate a 2D CLIP embedding for each 3D instance, incorporating both visibility and viewpoint diversity, which serves as the learning target for distillation. We then develop a 3D student model that directly produces a 3D embedding for each 3D instance. During training, we propose a label-guided distillation algorithm to distill open-vocabulary knowledge from label-consistent 2D embeddings into the student model. FOLK conducted experiments on the ScanNet200 and Replica datasets, achieving state-of-the-art performance on the ScanNet200 dataset with an AP50 score of 35.7, while running approximately 6.0x to 152.2x faster than previous methods. All codes will be released after the paper is accepted.
【5】Reproducible Evaluation of Data Augmentation and Loss Functions for Brain Tumor Segmentation
标题:脑肿瘤分割中数据增强和损失函数的可重复性评价
链接:https://arxiv.org/abs/2510.08617
备注:Code and results available at this https URL
摘要:脑肿瘤分割对于诊断和治疗计划至关重要,但类不平衡和有限的模型泛化等挑战继续阻碍进展。这项工作提出了一个可重复的评估U-Net分割性能的脑肿瘤MRI使用焦点损失和基本的数据增强策略。在公开的MRI数据集上进行实验,重点关注焦点损失参数调整并评估三种数据增强技术的影响:水平翻转,旋转和缩放。具有焦点损失的U-Net实现了90%的精度,与最先进的结果相当。通过公开所有代码和结果,这项研究建立了一个透明的,可重复的基线,以指导未来的研究增强策略和损失函数设计在脑肿瘤分割。
摘要:Brain tumor segmentation is crucial for diagnosis and treatment planning, yet challenges such as class imbalance and limited model generalization continue to hinder progress. This work presents a reproducible evaluation of U-Net segmentation performance on brain tumor MRI using focal loss and basic data augmentation strategies. Experiments were conducted on a publicly available MRI dataset, focusing on focal loss parameter tuning and assessing the impact of three data augmentation techniques: horizontal flip, rotation, and scaling. The U-Net with focal loss achieved a precision of 90%, comparable to state-of-the-art results. By making all code and results publicly available, this study establishes a transparent, reproducible baseline to guide future research on augmentation strategies and loss function design in brain tumor segmentation.
【6】Rewiring Development in Brain Segmentation: Leveraging Adult Brain Priors for Enhancing Infant MRI Segmentation
标题:大脑分割的重新布线发展:利用成人大脑先验增强婴儿MRI分割
链接:https://arxiv.org/abs/2510.09306
摘要:婴儿脑MRI的准确分割对于研究早期神经发育和诊断神经系统疾病至关重要。然而,由于受试者的解剖结构不断演变,运动伪影以及高质量标记数据的稀缺性,它仍然是一个根本性的挑战。在这项工作中,我们提出了LODi,一种新的框架,利用先验知识从成人大脑MRI分割模型,以提高婴儿扫描的分割性能。鉴于公开可用的成人大脑MRI数据丰富,我们在大型成人数据集上预训练分割模型作为起点。通过迁移学习和领域自适应策略,我们逐步使模型适应0-2岁的人群,使其能够解释婴儿扫描典型的解剖和成像变异性。成人模型的适应是使用婴儿大脑扫描的弱监督学习进行的,利用FreeSurfer获得的银标准地面真实标签。通过引入一种新的训练策略,该策略集成了分层特征细化和多级一致性约束,我们的方法可以实现快速,准确,年龄自适应的分割,同时减轻扫描仪和站点特定的偏见。在内部和外部数据集上进行的大量实验表明,我们的方法优于传统的监督学习和特定领域的模型。我们的研究结果突出了利用成人大脑先验作为年龄灵活的神经成像分析基础的优势,为整个生命周期更可靠和更普遍的大脑MRI分割铺平了道路。
摘要:Accurate segmentation of infant brain MRI is critical for studying early neurodevelopment and diagnosing neurological disorders. Yet, it remains a fundamental challenge due to continuously evolving anatomy of the subjects, motion artifacts, and the scarcity of high-quality labeled data. In this work, we present LODi, a novel framework that utilizes prior knowledge from an adult brain MRI segmentation model to enhance the segmentation performance of infant scans. Given the abundance of publicly available adult brain MRI data, we pre-train a segmentation model on a large adult dataset as a starting point. Through transfer learning and domain adaptation strategies, we progressively adapt the model to the 0-2 year-old population, enabling it to account for the anatomical and imaging variability typical of infant scans. The adaptation of the adult model is carried out using weakly supervised learning on infant brain scans, leveraging silver-standard ground truth labels obtained with FreeSurfer. By introducing a novel training strategy that integrates hierarchical feature refinement and multi-level consistency constraints, our method enables fast, accurate, age-adaptive segmentation, while mitigating scanner and site-specific biases. Extensive experiments on both internal and external datasets demonstrate the superiority of our approach over traditional supervised learning and domain-specific models. Our findings highlight the advantage of leveraging adult brain priors as a foundation for age-flexible neuroimaging analysis, paving the way for more reliable and generalizable brain MRI segmentation across the lifespan.
【7】SAM2-3dMed: Empowering SAM2 for 3D Medical Image Segmentation
标题:SAM 2 -3dMed:支持SAM 2进行3D医学图像分割
链接:https://arxiv.org/abs/2510.08967
摘要:3D医学图像的准确分割对于疾病评估和治疗计划等临床应用至关重要。虽然Segment Anything Model 2(SAM 2)通过利用时间线索在视频对象分割方面取得了显着的成功,但其直接应用于3D医学图像面临两个基本的领域差距:1)切片之间的双向解剖连续性与视频中的单向时间流形成鲜明对比,以及2)精确的边界划定,对于形态分析至关重要,通常在视频任务中探索不足。为了弥合这些差距,我们提出了SAM 2 -3dMed,SAM 2的适应3D医学成像。我们的框架引入了两个关键创新:1)切片相对位置预测(SRPP)模块通过引导SAM 2以自我监督的方式预测不同切片的相对位置来明确地对双向切片间依赖性进行建模; 2)边界检测(BD)模块增强了沿关键器官和组织边界的分割准确性。在三个不同的医学数据集(医学分割十项全能(MSD)数据集中的肺、脾和胰腺)上进行的大量实验表明,SAM 2 -3dMed的性能明显优于最先进的方法,在分割重叠和边界精度方面实现了卓越的性能。我们的方法不仅提高了3D医学图像分割的性能,而且还提供了一个通用的范例,以适应以视频为中心的基础模型的空间体积数据。
摘要:Accurate segmentation of 3D medical images is critical for clinical applications like disease assessment and treatment planning. While the Segment Anything Model 2 (SAM2) has shown remarkable success in video object segmentation by leveraging temporal cues, its direct application to 3D medical images faces two fundamental domain gaps: 1) the bidirectional anatomical continuity between slices contrasts sharply with the unidirectional temporal flow in videos, and 2) precise boundary delineation, crucial for morphological analysis, is often underexplored in video tasks. To bridge these gaps, we propose SAM2-3dMed, an adaptation of SAM2 for 3D medical imaging. Our framework introduces two key innovations: 1) a Slice Relative Position Prediction (SRPP) module explicitly models bidirectional inter-slice dependencies by guiding SAM2 to predict the relative positions of different slices in a self-supervised manner; 2) a Boundary Detection (BD) module enhances segmentation accuracy along critical organ and tissue boundaries. Extensive experiments on three diverse medical datasets (the Lung, Spleen, and Pancreas in the Medical Segmentation Decathlon (MSD) dataset) demonstrate that SAM2-3dMed significantly outperforms state-of-the-art methods, achieving superior performance in segmentation overlap and boundary precision. Our approach not only advances 3D medical image segmentation performance but also offers a general paradigm for adapting video-centric foundation models to spatial volumetric data.
【8】Progressive Uncertainty-Guided Evidential U-KAN for Trustworthy Medical Image Segmentation
标题:渐进不确定性引导的证据U-KAN用于可信医学图像分割
链接:https://arxiv.org/abs/2510.08949
摘要:可信医学图像分割的目标是为临床决策提供准确可靠的结果。大多数现有的方法采用证据深度学习(EDL)范式,由于其计算效率和理论鲁棒性。然而,基于EDL的方法往往忽视利用不确定性地图丰富的注意线索,以改善模糊的边界分割。为了解决这个问题,我们提出了一个渐进的证据不确定性引导注意力(PEUA)机制,以指导模型专注于硬区域的特征表示学习。与传统方法不同,PEUA使用不确定性地图逐步细化注意力,同时采用低秩学习来消除注意力权重,增强对具有挑战性区域的特征学习。同时,标准的EDL方法通过Kullback-Leibler(KL)正则化不加区别地抑制不正确类别的证据,损害了模糊区域的不确定性评估,从而扭曲了相应的注意力引导。因此,我们引入了一个语义保持证据学习(SAEL)策略,集成了一个语义平滑的证据生成器和一个有效性增强的正则化项,以保留关键的语义。最后,通过将PEUA和SAEL嵌入最先进的U-KAN,我们提出了Evidential U-KAN,这是一种值得信赖的医学图像分割的新型解决方案。在4个数据集上进行的大量实验表明,与竞争方法相比,该方法具有更高的准确性和可靠性。该代码可在\href{https://anonymous.4open.science/r/Evidence-U-KAN-BBE8}{github}获得。
摘要:Trustworthy medical image segmentation aims at deliver accurate and reliable results for clinical decision-making. Most existing methods adopt the evidence deep learning (EDL) paradigm due to its computational efficiency and theoretical robustness. However, the EDL-based methods often neglect leveraging uncertainty maps rich in attention cues to refine ambiguous boundary segmentation. To address this, we propose a progressive evidence uncertainty guided attention (PEUA) mechanism to guide the model to focus on the feature representation learning of hard regions. Unlike conventional approaches, PEUA progressively refines attention using uncertainty maps while employing low-rank learning to denoise attention weights, enhancing feature learning for challenging regions. Concurrently, standard EDL methods suppress evidence of incorrect class indiscriminately via Kullback-Leibler (KL) regularization, impairing the uncertainty assessment in ambiguous areas and consequently distorts the corresponding attention guidance. We thus introduce a semantic-preserving evidence learning (SAEL) strategy, integrating a semantic-smooth evidence generator and a fidelity-enhancing regularization term to retain critical semantics. Finally, by embedding PEUA and SAEL with the state-of-the-art U-KAN, we proposes Evidential U-KAN, a novel solution for trustworthy medical image segmentation. Extensive experiments on 4 datasets demonstrate superior accuracy and reliability over the competing methods. The code is available at \href{https://anonymous.4open.science/r/Evidence-U-KAN-BBE8}{github}.
Zero/Few Shot|迁移|域适配|自适应(3篇)
【1】TC-LoRA: Temporally Modulated Conditional LoRA for Adaptive Diffusion Control
标题:TC-LoRA:用于自适应扩散控制的时间调制条件LoRA
链接:https://arxiv.org/abs/2510.09561
备注:10 pages; NeurIPS 2025 Workshop on SPACE in Vision, Language, and Embodied AI (SpaVLE)
摘要:当前的可控扩散模型通常依赖于固定的架构,该架构修改中间激活以注入以新模态为条件的引导。这种方法使用静态调节策略来进行动态的多阶段去噪过程,从而限制了模型随着生成从粗略结构向精细细节的演变而调整其响应的能力。我们引入了TC-LoRA(时间调制条件LoRA),这是一种新的范例,通过直接调节模型的权重来实现动态的上下文感知控制。我们的框架使用超网络来动态生成LoRA适配器,根据时间和用户的条件在每个扩散步骤中为冻结的骨干定制权重修改。这种机制使模型能够学习和执行显式的自适应策略,以便在整个生成过程中应用条件指导。通过对各种数据域的实验,我们证明了这种动态的参数控制与静态的基于激活的方法相比,显着提高了生成保真度和对空间条件的遵守。TC-LoRA建立了一种替代方法,通过对其权重进行更深层次的功能调整来修改模型的条件策略,从而使控制能够与任务和生成阶段的动态需求保持一致。
摘要:Current controllable diffusion models typically rely on fixed architectures that modify intermediate activations to inject guidance conditioned on a new modality. This approach uses a static conditioning strategy for a dynamic, multi-stage denoising process, limiting the model's ability to adapt its response as the generation evolves from coarse structure to fine detail. We introduce TC-LoRA (Temporally Modulated Conditional LoRA), a new paradigm that enables dynamic, context-aware control by conditioning the model's weights directly. Our framework uses a hypernetwork to generate LoRA adapters on-the-fly, tailoring weight modifications for the frozen backbone at each diffusion step based on time and the user's condition. This mechanism enables the model to learn and execute an explicit, adaptive strategy for applying conditional guidance throughout the entire generation process. Through experiments on various data domains, we demonstrate that this dynamic, parametric control significantly enhances generative fidelity and adherence to spatial conditions compared to static, activation-based methods. TC-LoRA establishes an alternative approach in which the model's conditioning strategy is modified through a deeper functional adaptation of its weights, allowing control to align with the dynamic demands of the task and generative stage.
【2】Structured Output Regularization: a framework for few-shot transfer learning
标题:结构化输出正规化:少量迁移学习的框架
链接:https://arxiv.org/abs/2510.08728
摘要:传统的迁移学习通常通过冻结一些权重并添加特定于任务的层来重用大型预训练网络。虽然这种方法在计算上很高效,但它限制了模型适应特定于领域的特征的能力,并且仍然可能导致对非常有限的数据的过拟合。为了解决这些限制,我们提出了结构化输出正则化(SOR),这是一个简单而有效的框架,它冻结了内部网络结构(例如,卷积滤波器),同时使用组套索和$L_1$惩罚的组合。该框架将模型定制为具有最少附加参数的特定数据,并且易于应用于各种网络组件,例如卷积滤波器或神经网络中的各种块,从而能够广泛适用于迁移学习任务。我们评估SOR上的三个Few Shot医学成像分类任务,我们取得了竞争力的结果,使用DenseNet 121和EfficientNetB 4基地相比,建立基准。
摘要:Traditional transfer learning typically reuses large pre-trained networks by freezing some of their weights and adding task-specific layers. While this approach is computationally efficient, it limits the model's ability to adapt to domain-specific features and can still lead to overfitting with very limited data. To address these limitations, we propose Structured Output Regularization (SOR), a simple yet effective framework that freezes the internal network structures (e.g., convolutional filters) while using a combination of group lasso and $L_1$ penalties. This framework tailors the model to specific data with minimal additional parameters and is easily applicable to various network components, such as convolutional filters or various blocks in neural networks enabling broad applicability for transfer learning tasks. We evaluate SOR on three few shot medical imaging classification tasks and we achieve competitive results using DenseNet121, and EfficientNetB4 bases compared to established benchmarks.
【3】PhyDAE: Physics-Guided Degradation-Adaptive Experts for All-in-One Remote Sensing Image Restoration
标题:PhyADE:用于一体化遥感图像恢复的物理引导退化自适应专家
链接:https://arxiv.org/abs/2510.08653
摘要:遥感图像在获取过程中不可避免地会受到各种因素的影响,包括大气干扰、传感器限制和成像条件。这些复杂和异质的退化对图像质量和下游解释任务提出了严峻的挑战。针对现有的所有功能于一身的恢复方法,过度依赖于隐式的功能表示和缺乏明确的建模退化物理的局限性,本文提出了物理引导的退化自适应专家(PhyDAE)。该方法采用两级级联架构,将退化信息从隐式特征转换为显式决策信号,从而能够精确识别和区分处理多个异构退化,包括雾度、噪声、模糊和低光照条件。该模型采用渐进式退化挖掘和开发机制,其中残差流形投影器(RMP)和频率感知退化分解器(FADD)从流形几何和频率角度全面分析退化特性。引入物理感知专家模块和温度控制稀疏激活策略,以提高计算效率,同时确保成像物理一致性。对三个基准数据集(MD-RSID、MD-RRSHID和MDRS-Landsat)进行的广泛实验表明,PhyDAE在所有四个恢复任务中均实现了卓越的性能,全面优于最先进的方法。值得注意的是,PhyDAE大大提高了恢复质量,同时实现了参数计数和计算复杂性的显着减少,从而导致显着的效率增益相比,主流方法,并实现性能和效率之间的最佳平衡。代码可在https://github.com/HIT-SIRS/PhyDAE上获得。
摘要:Remote sensing images inevitably suffer from various degradation factors during acquisition, including atmospheric interference, sensor limitations, and imaging conditions. These complex and heterogeneous degradations pose severe challenges to image quality and downstream interpretation tasks. Addressing limitations of existing all-in-one restoration methods that overly rely on implicit feature representations and lack explicit modeling of degradation physics, this paper proposes Physics-Guided Degradation-Adaptive Experts (PhyDAE). The method employs a two-stage cascaded architecture transforming degradation information from implicit features into explicit decision signals, enabling precise identification and differentiated processing of multiple heterogeneous degradations including haze, noise, blur, and low-light conditions. The model incorporates progressive degradation mining and exploitation mechanisms, where the Residual Manifold Projector (RMP) and Frequency-Aware Degradation Decomposer (FADD) comprehensively analyze degradation characteristics from manifold geometry and frequency perspectives. Physics-aware expert modules and temperature-controlled sparse activation strategies are introduced to enhance computational efficiency while ensuring imaging physics consistency. Extensive experiments on three benchmark datasets (MD-RSID, MD-RRSHID, and MDRS-Landsat) demonstrate that PhyDAE achieves superior performance across all four restoration tasks, comprehensively outperforming state-of-the-art methods. Notably, PhyDAE substantially improves restoration quality while achieving significant reductions in parameter count and computational complexity, resulting in remarkable efficiency gains compared to mainstream approaches and achieving optimal balance between performance and efficiency. Code is available at https://github.com/HIT-SIRS/PhyDAE.
半弱无监督|主动学习|不确定性(2篇)
【1】Hybrid-grained Feature Aggregation with Coarse-to-fine Language Guidance for Self-supervised Monocular Depth Estimation
标题:混合粒度特征聚合和从粗到细的语言引导用于自监督单目深度估计
链接:https://arxiv.org/abs/2510.09320
备注:ICCV 2025
摘要:目前的自监督单目深度估计(MDE)方法遇到的性能限制,由于语义空间知识提取不足。为了应对这一挑战,我们提出了混合深度,这是一个系统地集成基础模型(例如,CLIP和DINO)提取视觉先验,并为MDE获取足够的上下文信息。我们的方法引入了一个由粗到细的渐进式学习框架:1)首先,我们在对比语言指导下聚合CLIP(全局语义)和DINO(局部空间细节)的多粒度特征。设计了一个比较远近图像块的代理任务,以使用文本提示来执行深度感知特征对齐; 2)接下来,在粗略特征的基础上,我们集成了相机姿态信息和像素语言对齐来细化深度预测。该模块与现有的自监督MDE管道无缝集成(例如,Monodepth 2,ManyDepth)作为一个即插即用的深度编码器,增强了连续的深度估计。通过语言指导聚合CLIP的语义上下文和DINO的空间细节,我们的方法有效地解决了特征粒度不匹配的问题。在KITTI基准上进行的大量实验表明,我们的方法在所有指标上都明显优于SOTA方法,这也确实有利于BEV感知等下游任务。代码可在https://github.com/Zhangwenyao1/Hybrid-depth上获得。
摘要:Current self-supervised monocular depth estimation (MDE) approaches encounter performance limitations due to insufficient semantic-spatial knowledge extraction. To address this challenge, we propose Hybrid-depth, a novel framework that systematically integrates foundation models (e.g., CLIP and DINO) to extract visual priors and acquire sufficient contextual information for MDE. Our approach introduces a coarse-to-fine progressive learning framework: 1) Firstly, we aggregate multi-grained features from CLIP (global semantics) and DINO (local spatial details) under contrastive language guidance. A proxy task comparing close-distant image patches is designed to enforce depth-aware feature alignment using text prompts; 2) Next, building on the coarse features, we integrate camera pose information and pixel-wise language alignment to refine depth predictions. This module seamlessly integrates with existing self-supervised MDE pipelines (e.g., Monodepth2, ManyDepth) as a plug-and-play depth encoder, enhancing continuous depth estimation. By aggregating CLIP's semantic context and DINO's spatial details through language guidance, our method effectively addresses feature granularity mismatches. Extensive experiments on the KITTI benchmark demonstrate that our method significantly outperforms SOTA methods across all metrics, which also indeed benefits downstream tasks like BEV perception. Code is available at https://github.com/Zhangwenyao1/Hybrid-depth.
【2】Bi-level Meta-Policy Control for Dynamic Uncertainty Calibration in Evidential Deep Learning
标题:证据深度学习中用于动态不确定性校准的双层元策略控制
链接:https://arxiv.org/abs/2510.08938
摘要:传统的证据深度学习(EDL)方法依赖于静态超参数进行不确定性校准,限制了其在动态数据分布中的适应性,导致在高风险决策任务中校准和泛化能力较差。为了解决这个问题,我们提出了元策略控制器(MPC),一个动态的元学习框架,调整KL发散系数和Dirichlet先验强度的最佳不确定性建模。具体而言,MPC采用双层优化方法:在内环中,通过动态配置的适应当前训练状态的损失函数更新模型参数;在外环中,策略网络基于平衡预测准确性和不确定性质量的多目标奖励来优化KL发散系数和类特定Dirichlet先验强度。与之前使用固定先验的方法不同,我们的可学习Dirichlet先验能够灵活地适应类分布和训练动态。大量的实验结果表明,MPC显着提高了各种任务中模型预测的可靠性和校准,提高了不确定性校准,预测精度和基于置信度的样本拒绝后的性能保留。
摘要:Traditional Evidence Deep Learning (EDL) methods rely on static hyperparameter for uncertainty calibration, limiting their adaptability in dynamic data distributions, which results in poor calibration and generalization in high-risk decision-making tasks. To address this limitation, we propose the Meta-Policy Controller (MPC), a dynamic meta-learning framework that adjusts the KL divergence coefficient and Dirichlet prior strengths for optimal uncertainty modeling. Specifically, MPC employs a bi-level optimization approach: in the inner loop, model parameters are updated through a dynamically configured loss function that adapts to the current training state; in the outer loop, a policy network optimizes the KL divergence coefficient and class-specific Dirichlet prior strengths based on multi-objective rewards balancing prediction accuracy and uncertainty quality. Unlike previous methods with fixed priors, our learnable Dirichlet prior enables flexible adaptation to class distributions and training dynamics. Extensive experimental results show that MPC significantly enhances the reliability and calibration of model predictions across various tasks, improving uncertainty calibration, prediction accuracy, and performance retention after confidence-based sample rejection.
时序|行为识别|姿态|视频|运动估计(12篇)
【1】StreamingVLM: Real-Time Understanding for Infinite Video Streams
标题:StreamingVLM:实时了解无限视频流
链接:https://arxiv.org/abs/2510.09608
备注:The first two authors contributed equally to this work
摘要:视觉语言模型(VLM)可以为实时助理和自主代理提供支持,但它们面临着一个关键挑战:在不增加延迟和内存使用的情况下理解近乎无限的视频流。全神贯注地处理整个视频会导致二次计算成本和长视频的性能低下。同时,简单的滑动窗口方法也有缺陷,因为它们要么破坏相干性,要么由于冗余的重新计算而遭受高延迟。在本文中,我们介绍StreamingVLM,一个模型,设计用于实时,稳定的理解无限的视觉输入。我们的方法是一个统一的框架,使训练与流推理保持一致。在推理过程中,我们保持一个紧凑的KV缓存重用状态的注意力汇,最近的视觉令牌的短窗口,最近的文本令牌的长窗口。这种流媒体能力是通过一个简单的监督微调(SFT)策略灌输的,该策略将全部注意力应用于短的重叠视频块,有效地模仿了推理时间注意力模式,而无需在过长的上下文上进行训练。为了进行评估,我们构建了Inf-Streams-Eval,这是一个新的基准,视频平均超过两个小时,需要帧和文本之间每秒密集的对齐。在Inf-Streams-Eval上,StreamingVLM在与GPT-4 O mini的比赛中取得了66.18%的胜率,并在单台NVIDIA H100上以高达8 FPS的速度保持稳定的实时性能。值得注意的是,我们的SFT策略还增强了一般VQA能力,而无需任何特定于VQA的微调,将LongVideoBench和OVOBench Realtime的性能分别提高了+4.30和+5.96。代码可在https://github.com/mit-han-lab/streaming-vlm上获得。
摘要:Vision-language models (VLMs) could power real-time assistants and autonomous agents, but they face a critical challenge: understanding near-infinite video streams without escalating latency and memory usage. Processing entire videos with full attention leads to quadratic computational costs and poor performance on long videos. Meanwhile, simple sliding window methods are also flawed, as they either break coherence or suffer from high latency due to redundant recomputation. In this paper, we introduce StreamingVLM, a model designed for real-time, stable understanding of infinite visual input. Our approach is a unified framework that aligns training with streaming inference. During inference, we maintain a compact KV cache by reusing states of attention sinks, a short window of recent vision tokens, and a long window of recent text tokens. This streaming ability is instilled via a simple supervised fine-tuning (SFT) strategy that applies full attention on short, overlapped video chunks, which effectively mimics the inference-time attention pattern without training on prohibitively long contexts. For evaluation, we build Inf-Streams-Eval, a new benchmark with videos averaging over two hours that requires dense, per-second alignment between frames and text. On Inf-Streams-Eval, StreamingVLM achieves a 66.18% win rate against GPT-4O mini and maintains stable, real-time performance at up to 8 FPS on a single NVIDIA H100. Notably, our SFT strategy also enhances general VQA abilities without any VQA-specific fine-tuning, improving performance on LongVideoBench by +4.30 and OVOBench Realtime by +5.96. Code is available at https://github.com/mit-han-lab/streaming-vlm.
【2】Dynamic Weight-based Temporal Aggregation for Low-light Video Enhancement
标题:基于动态权重的时间聚合用于低光视频增强
链接:https://arxiv.org/abs/2510.09450
摘要:由于噪声、低对比度和颜色退化,低光照视频增强(LLVE)具有挑战性。基于学习的方法提供了快速的推理,但在真实的低光场景中仍然会遇到严重的噪声,这主要是由于有效利用时间信息的局限性。在本文中,我们解决这个问题与DWTA网,一种新的两阶段框架,联合利用短期和长期的时间线索。第一阶段采用视觉状态空间块进行多帧对齐,恢复亮度,颜色和局部一致性的结构。第二阶段引入了一个经常性的细化模块,具有由光流引导的基于动态权重的时间聚合,自适应地平衡静态和动态区域。纹理自适应损失进一步保留精细细节,同时提高平坦区域的平滑度。对真实世界低光视频的实验表明,DWTA-Net有效地抑制了噪声和伪影,与最先进的方法相比,提供了更好的视觉质量。
摘要:Low-light video enhancement (LLVE) is challenging due to noise, low contrast, and color degradations. Learning-based approaches offer fast inference but still struggle with heavy noise in real low-light scenes, primarily due to limitations in effectively leveraging temporal information. In this paper, we address this issue with DWTA-Net, a novel two-stage framework that jointly exploits short- and long-term temporal cues. Stage I employs Visual State-Space blocks for multi-frame alignment, recovering brightness, color, and structure with local consistency. Stage II introduces a recurrent refinement module with dynamic weight-based temporal aggregation guided by optical flow, adaptively balancing static and dynamic regions. A texture-adaptive loss further preserves fine details while promoting smoothness in flat areas. Experiments on real-world low-light videos show that DWTA-Net effectively suppresses noise and artifacts, delivering superior visual quality compared with state-of-the-art methods.
【3】Mono4DEditor: Text-Driven 4D Scene Editing from Monocular Video via Point-Level Localization of Language-Embedded Gaussians
标题:Mono 4DEditor:通过嵌入网格的高斯的点级本地化从单目视频进行文本驱动的4D场景编辑
链接:https://arxiv.org/abs/2510.09438
备注:19 pages, 9 figures
摘要:基于文本提示编辑从单眼视频重构的4D场景是一项有价值但具有挑战性的任务,在内容创建和虚拟环境中具有广泛的应用。关键的困难在于在复杂的动态场景的局部区域中实现语义上精确的编辑,同时保持未编辑内容的完整性。为了解决这个问题,我们介绍了Mono4DEditor,一个灵活和准确的文本驱动的4D场景编辑的新框架。我们的方法增强了3D高斯与量化CLIP功能,形成一个语言嵌入的动态表示,使任意空间区域的高效语义查询。我们进一步提出了一个两阶段的点级定位策略,首先通过CLIP相似性选择候选高斯,然后细化其空间范围,以提高精度。最后,使用基于扩散的视频编辑模型在局部区域上执行有针对性的编辑,并具有确保空间保真度和时间一致性的流和涂鸦指导。大量的实验表明,Mono4DEditor可以在不同的场景和对象类型中进行高质量的文本驱动编辑,同时保留未编辑区域的外观和几何形状,并在灵活性和视觉保真度方面超越现有方法。
摘要:Editing 4D scenes reconstructed from monocular videos based on text prompts is a valuable yet challenging task with broad applications in content creation and virtual environments. The key difficulty lies in achieving semantically precise edits in localized regions of complex, dynamic scenes, while preserving the integrity of unedited content. To address this, we introduce Mono4DEditor, a novel framework for flexible and accurate text-driven 4D scene editing. Our method augments 3D Gaussians with quantized CLIP features to form a language-embedded dynamic representation, enabling efficient semantic querying of arbitrary spatial regions. We further propose a two-stage point-level localization strategy that first selects candidate Gaussians via CLIP similarity and then refines their spatial extent to improve accuracy. Finally, targeted edits are performed on localized regions using a diffusion-based video editing model, with flow and scribble guidance ensuring spatial fidelity and temporal coherence. Extensive experiments demonstrate that Mono4DEditor enables high-quality, text-driven edits across diverse scenes and object types, while preserving the appearance and geometry of unedited areas and surpassing prior approaches in both flexibility and visual fidelity.
【4】MomentSeg: Moment-Centric Sampling for Enhanced Video Pixel Understanding
标题:MomentSeg:以动量为中心的采样,增强视频像素理解
链接:https://arxiv.org/abs/2510.09274
摘要:参考视频对象分割(RefVOS)试图在自然语言描述的指导下分割视频中的目标对象,需要时间推理和细粒度的视觉理解。基于LLM的方法的现有采样策略通常依赖于手工制作的算法或外部关键帧模型。前者往往忽略了必要的时间线索,而后者增加了系统的复杂性。为了解决这个问题,我们提出了一个统一的框架,共同优化时间句接地(TSG)和RefVOS,自然结合关键时刻接地能力。在训练过程中,我们引入了一种新的TSG范式,该范式采用专用的\texttt{[FIND]}令牌,通过时间令牌相似性匹配进行关键时刻识别,从而避免了对外部时间戳编码的需要。对于推理,我们设计了一个以矩为中心的采样(MCS)策略,密集采样信息时刻,而稀疏采样非必要的帧,同时保留运动细节和全局上下文。为了进一步增强跟踪稳定性,我们开发了双向锚点更新传播(BAP),它利用最相关的时刻作为高质量掩模初始化的起点,并在采样点动态更新,以减轻累积误差。代码和模型可在以下网址获取:https://github.com/Dmmm1997/MomentSeg
摘要:Referring Video Object Segmentation (RefVOS) seeks to segment target objects in videos guided by natural language descriptions, demanding both temporal reasoning and fine-grained visual comprehension. Existing sampling strategies for LLM-based approaches typically rely on either handcrafted heuristics or external keyframe models. The former often overlooks essential temporal cues, while the latter increases system complexity. To address this, we propose a unified framework that jointly optimizes Temporal Sentence Grounding (TSG) and RefVOS, naturally incorporating key moment grounding capability. During training, we introduce a novel TSG paradigm that employs a dedicated \texttt{[FIND]} token for key moment identification through temporal token similarity matching, thereby avoiding the need for external timestamp encodings. For inference, we design a Moment-Centric Sampling (MCS) strategy that densely samples informative moments while sparsely sampling non-essential frames, preserving both motion details and global context. To further enhance tracking stability, we develop Bidirectional Anchor-updated Propagation (BAP), which leverages the most relevant moment as start point for high-quality mask initialization and dynamically updates at sampled points to mitigate accumulated errors. Code and model will be available at: https://github.com/Dmmm1997/MomentSeg
【5】Online Video Depth Anything: Temporally-Consistent Depth Prediction with Low Memory Consumption
标题:在线视频深度任意:时间一致的深度预测,内存消耗较低
链接:https://arxiv.org/abs/2510.09182
摘要:从单目视频中进行深度估计已经成为许多现实世界计算机视觉系统的关键组成部分。最近,视频深度任意(VDA)在长视频序列上表现出强大的性能。然而,它依赖于批量处理,禁止其在在线设置中使用。在这项工作中,我们克服了这一限制,并介绍了在线VDA(oVDA)。关键的创新是采用大型语言模型(LLM)的技术,即在推理过程中缓存潜在特征,并在训练时屏蔽框架。我们的oVDA方法在准确性和VRAM使用率方面优于所有竞争的在线视频深度估计方法。低VRAM使用率对于在边缘设备上部署尤为重要。我们演示了oVDA在NVIDIA A100上以42 FPS运行,在NVIDIA Jetson边缘设备上以20 FPS运行。我们将同时发布代码和编译脚本,使oVDA易于在低功耗硬件上部署。
摘要:Depth estimation from monocular video has become a key component of many real-world computer vision systems. Recently, Video Depth Anything (VDA) has demonstrated strong performance on long video sequences. However, it relies on batch-processing which prohibits its use in an online setting. In this work, we overcome this limitation and introduce online VDA (oVDA). The key innovation is to employ techniques from Large Language Models (LLMs), namely, caching latent features during inference and masking frames at training. Our oVDA method outperforms all competing online video depth estimation methods in both accuracy and VRAM usage. Low VRAM usage is particularly important for deployment on edge devices. We demonstrate that oVDA runs at 42 FPS on an NVIDIA A100 and at 20 FPS on an NVIDIA Jetson edge device. We will release both, code and compilation scripts, making oVDA easy to deploy on low-power hardware.
【6】mmJoints: Expanding Joint Representations Beyond (x,y,z) in mmWave-Based 3D Pose Estimation
标题:mmJoints:在基于mmWave的3D姿势估计中将关节表示扩展到(x,y,z)之外
链接:https://arxiv.org/abs/2510.08970
摘要:在基于毫米波的姿态估计中,稀疏信号和弱反射通常会导致模型从统计先验而不是传感器数据推断身体关节。虽然先验知识有助于学习有意义的表示,但过度依赖它会降低手势和活动识别等下游任务的性能。在本文中,我们介绍了mmJoints,这是一个框架,它使用额外的联合描述符来增强预训练的黑盒基于mmWave的3D姿态估计器的输出。mmJoints不是减轻偏差,而是通过估计关节被感知的可能性及其预测位置的可靠性来明确偏差。这些描述符增强了可解释性并提高了下游任务的准确性。通过在13个姿态估计设置中使用超过115,000个信号帧的广泛评估,我们表明mmJoints估计描述符的错误率低于4.2%。与最先进的方法相比,mmJoints还将关节位置精度提高了12.5%,并将活动识别提高了16%。
摘要:In mmWave-based pose estimation, sparse signals and weak reflections often cause models to infer body joints from statistical priors rather than sensor data. While prior knowledge helps in learning meaningful representations, over-reliance on it degrades performance in downstream tasks like gesture and activity recognition. In this paper, we introduce mmJoints, a framework that augments a pre-trained, black-box mmWave-based 3D pose estimator's output with additional joint descriptors. Rather than mitigating bias, mmJoints makes it explicit by estimating the likelihood of a joint being sensed and the reliability of its predicted location. These descriptors enhance interpretability and improve downstream task accuracy. Through extensive evaluations using over 115,000 signal frames across 13 pose estimation settings, we show that mmJoints estimates descriptors with an error rate below 4.2%. mmJoints also improves joint position accuracy by up to 12.5% and boosts activity recognition by up to 16% over state-of-the-art methods.
【7】RO-Bench: Large-scale robustness evaluation of MLLMs with text-driven counterfactual videos
标题:RO-Bench:使用文本驱动反事实视频对MLLM进行大规模稳健性评估
链接:https://arxiv.org/abs/2510.08936
摘要:最近,多模态大型语言模型(MLLM)在各种视频理解任务中表现出了显着的性能。然而,它们的鲁棒性,特别是当面对操纵的视频内容时,仍然在很大程度上未被探索。在本文中,我们介绍了Ro-Bench,第一个基准评估MLLM的动态分布(OOD)反事实视频测试集。Ro-Bench通过编辑风格、对象、背景及其组成,整合了高质量、多样化和时间相关的视频数据。我们评估了8个最近的视频MLLM,发现当前模型在暴露于反事实视频内容时,在Ro-Bench上表现出显著的性能下降。此外,我们证明了使用反事实数据微调MLLM增强了鲁棒性,在Ro-Bench上实现了21.73%的性能提升,在MVBench数据集中的20个任务中实现了12.78%的性能提升。这些发现强调了反事实数据在增强MLLM的视频理解能力方面的有效性。代码和数据将很快发布。
摘要:Recently, Multi-modal Large Language Models (MLLMs) have demonstrated significant performance across various video understanding tasks. However, their robustness, particularly when faced with manipulated video content, remains largely unexplored. In this paper, we introduce Ro-Bench, the first benchmark for evaluating MLLMs on dynamic out-of-distribution (OOD) counterfactual video test sets. Ro-Bench incorporates high-quality, diverse and temporally relevant video data, by editing Style, Object, Background and their compositions. We evaluated eight recent video MLLMs and found that current models exhibit substantial performance degradation on Ro-Bench when exposed to counterfactual video content. Furthermore, we demonstrate that fine-tuning MLLMs with counterfactual data enhances robustness, achieving a 21.73% performance increase on Ro-Bench and a 12.78% improvement across 20 tasks in the MVBench dataset. These findings underscore the effectiveness of counterfactual data in enhancing the video understanding ability of MLLMs. The code and data will be released shortly.
【8】D-CoDe: Scaling Image-Pretrained VLMs to Video via Dynamic Compression and Question Decomposition
标题:D-CodDe:通过动态压缩和问题分解将图像预训练的VLM扩展到视频
链接:https://arxiv.org/abs/2510.08818
备注:This paper has been accepted to EMNLP 2025
摘要:视频大语言模型(Vid-LLM),在不同的视频语言任务中表现出色,可以通过适应图像预训练的视觉语言模型(VLM)来有效地构建。然而,这种适应仍然具有挑战性,因为它需要处理超过基于图像的模型的能力的密集和时间扩展的视觉输入。本文确定的感知瓶颈和令牌过载的关键挑战,在基于图像的VLMs扩展到视频域。为了解决这些问题,我们提出了D-CoDe,一个无需训练的自适应框架,它结合了动态压缩和问题分解。具体而言,动态压缩通过自适应选择代表性帧和空间标记的内容感知聚合来消除感知瓶颈,从而在保留信息内容的同时减少冗余。同时,问题分解通过将原始查询重新定义为子问题来减轻令牌过载,引导模型专注于视频的不同方面,并实现更全面的理解。实验表明,D-CoDe有效地提高了各种基准的视频理解。此外,在具有挑战性的长视频基准测试中的出色表现凸显了D-CoDe在处理复杂视频语言任务方面的潜力。代码可在https://github.com/hukcc/D-CoDe上获得。
摘要:Video large language models (Vid-LLMs), which excel in diverse video-language tasks, can be effectively constructed by adapting image-pretrained vision-language models (VLMs). However, this adaptation remains challenging, as it requires processing dense and temporally extended visual inputs that exceed the capacity of image-based models. This paper identifies the perception bottleneck and token overload as key challenges in extending image-based VLMs to the video domain. To address these issues, we propose D-CoDe, a training-free adaptation framework that incorporates dynamic compression and question decomposition. Specifically, dynamic compression alleviates the perception bottleneck through adaptive selection of representative frames and content-aware aggregation of spatial tokens, thereby reducing redundancy while preserving informative content. In parallel, question decomposition mitigates token overload by reformulating the original query into sub-questions, guiding the model to focus on distinct aspects of the video and enabling more comprehensive understanding. Experiments demonstrate that D-CoDe effectively improves video understanding across various benchmarks. Furthermore, strong performance on the challenging long-video benchmark highlights the potential of D-CoDe in handling complex video-language tasks. Code is available at https://github.com/hukcc/D-CoDe.
【9】Q-Router: Agentic Video Quality Assessment with Expert Model Routing and Artifact Localization
标题:Q-Router:使用专家模型路由和收件箱本地化的统计视频质量评估
链接:https://arxiv.org/abs/2510.08789
摘要:视频质量评估(VQA)是一项基本的计算机视觉任务,旨在预测给定视频的感知质量与人类判断一致。现有的通过直接评分监督训练的高性能VQA模型存在以下问题:(1)在各种内容和任务中的泛化能力较差,从用户生成的内容(UGC),短视频到AI生成的内容(AIGC),(2)可解释性有限,以及(3)缺乏对新用例或内容类型的可扩展性。我们提出了Q-Router,一个代理框架的通用VQA与多层模型路由系统。Q-Router集成了一组不同的专家模型,并采用视觉语言模型(VLM)作为实时路由器,动态推理,然后根据输入的视频语义集成最合适的专家。我们建立了一个多层次的路由系统的计算预算的基础上,最重的一层涉及一个特定的时空工件本地化的可解释性。这种代理设计使Q-Router能够结合专业专家的互补优势,实现灵活性和鲁棒性,在异构视频源和任务中提供一致的性能。大量的实验表明,Q-Router在各种基准测试上都能匹配或超越最先进的VQA模型,同时大大提高了泛化能力和可解释性。此外,Q-Router在基于质量的问答基准测试Q-Bench-Video中表现出色,凸显了其作为下一代VQA系统基础的前景。最后,我们证明了Q-Router能够定位时空伪影,显示出作为后训练视频生成模型的奖励函数的潜力。
摘要:Video quality assessment (VQA) is a fundamental computer vision task that aims to predict the perceptual quality of a given video in alignment with human judgments. Existing performant VQA models trained with direct score supervision suffer from (1) poor generalization across diverse content and tasks, ranging from user-generated content (UGC), short-form videos, to AI-generated content (AIGC), (2) limited interpretability, and (3) lack of extensibility to novel use cases or content types. We propose Q-Router, an agentic framework for universal VQA with a multi-tier model routing system. Q-Router integrates a diverse set of expert models and employs vision--language models (VLMs) as real-time routers that dynamically reason and then ensemble the most appropriate experts conditioned on the input video semantics. We build a multi-tiered routing system based on the computing budget, with the heaviest tier involving a specific spatiotemporal artifacts localization for interpretability. This agentic design enables Q-Router to combine the complementary strengths of specialized experts, achieving both flexibility and robustness in delivering consistent performance across heterogeneous video sources and tasks. Extensive experiments demonstrate that Q-Router matches or surpasses state-of-the-art VQA models on a variety of benchmarks, while substantially improving generalization and interpretability. Moreover, Q-Router excels on the quality-based question answering benchmark, Q-Bench-Video, highlighting its promise as a foundation for next-generation VQA systems. Finally, we show that Q-Router capably localizes spatiotemporal artifacts, showing potential as a reward function for post-training video generation models.
【10】Re-Identifying Kākā with AI-Automated Video Key Frame Extraction
标题:利用人工智能自动视频关键帧提取重新识别Kåk
链接:https://arxiv.org/abs/2510.08775
摘要:准确识别和重新识别个体动物对于成功的野生动物种群监测至关重要。传统的方法,如鸟类的腿带,是耗时和侵入性的。人工智能的最新进展,特别是计算机视觉,为智能保护和高效自动化提供了令人鼓舞的解决方案。这项研究提出了一种独特的管道,用于从新西兰一种受威胁的森林居住鹦鹉kk(Ninguistionalis)的视频中提取高质量的关键帧。关键帧提取在人物识别中得到了广泛的研究,然而,它在野生动物识别中的应用还很有限。我们使用定制馈线上的视频记录来提取关键帧并评估管道的重新识别性能。我们的无监督方法结合了使用YOLO和Grounding DINO的对象检测、光流模糊检测、使用DINOv 2的图像编码以及聚类方法来识别代表性的关键帧。结果表明,我们提出的关键帧选择方法产生的图像集,实现了高精度的k\={a}k\={a}重新识别,提供了一个基础,为未来的研究,使用媒体收集在更多样化和更具挑战性的环境。通过使用人工智能和计算机视觉,我们的非侵入性和有效的方法为传统的物理标记方法提供了一种有价值的替代方法,用于识别k\={a}k\={a}个人,从而改善对人口的监测。这项研究有助于开发新的野生动物监测方法,并应用于生态学和保护生物学。
摘要:Accurate recognition and re-identification of individual animals is essential for successful wildlife population monitoring. Traditional methods, such as leg banding of birds, are time consuming and invasive. Recent progress in artificial intelligence, particularly computer vision, offers encouraging solutions for smart conservation and efficient automation. This study presents a unique pipeline for extracting high-quality key frames from videos of k\={a}k\={a} (Nestor meridionalis), a threatened forest-dwelling parrot in New Zealand. Key frame extraction is well-studied in person re-identification, however, its application to wildlife is limited. Using video recordings at a custom-built feeder, we extract key frames and evaluate the re-identification performance of our pipeline. Our unsupervised methodology combines object detection using YOLO and Grounding DINO, optical flow blur detection, image encoding with DINOv2, and clustering methods to identify representative key frames. The results indicate that our proposed key frame selection methods yield image collections which achieve high accuracy in k\={a}k\={a} re-identification, providing a foundation for future research using media collected in more diverse and challenging environments. Through the use of artificial intelligence and computer vision, our non-invasive and efficient approach provides a valuable alternative to traditional physical tagging methods for recognising k\={a}k\={a} individuals and therefore improving the monitoring of populations. This research contributes to developing fresh approaches in wildlife monitoring, with applications in ecology and conservation biology.
【11】Human-VDM: Learning Single-Image 3D Human Gaussian Splatting from Video Diffusion Models
标题:Human-VDR:从视频扩散模型学习单图像3D人体高斯飞溅
链接:https://arxiv.org/abs/2409.02851
备注:14 Pages, 8 figures, Project page: this https URL
摘要:从单个RGB图像生成逼真的3D人体仍然是计算机视觉中一项具有挑战性的任务,因为它需要精确的几何建模,高质量的纹理和看似真实的不可见部分。现有的方法通常使用多视图扩散模型来生成三维人体,但它们经常面临视图不一致的问题,这阻碍了高质量的三维人体生成。为了解决这个问题,我们提出了Human-VDM,一种新的方法,用于从一个单一的RGB图像,使用视频扩散模型生成三维人体。Human-VDM使用高斯溅射为3D人体生成提供时间一致的视图。它由三个模块组成:视图一致的人类视频扩散模块,视频增强模块和高斯飞溅模块。首先,将单个图像馈送到人类视频扩散模块中以生成连贯的人类视频。接下来,视频增强模块应用超分辨率和视频插值来增强所生成的视频的纹理和几何平滑度。最后,3D人类高斯飞溅模块在这些高分辨率和视图一致性图像的指导下学习逼真的人类。实验表明,Human-VDM从一幅图像中获得了高质量的3D人体,在生成质量和数量上都优于现有的方法。项目页面:https://human-vdm.github.io/Human-VDM/
摘要:Generating lifelike 3D humans from a single RGB image remains a challenging task in computer vision, as it requires accurate modeling of geometry, high-quality texture, and plausible unseen parts. Existing methods typically use multi-view diffusion models for 3D generation, but they often face inconsistent view issues, which hinder high-quality 3D human generation. To address this, we propose Human-VDM, a novel method for generating 3D human from a single RGB image using Video Diffusion Models. Human-VDM provides temporally consistent views for 3D human generation using Gaussian Splatting. It consists of three modules: a view-consistent human video diffusion module, a video augmentation module, and a Gaussian Splatting module. First, a single image is fed into a human video diffusion module to generate a coherent human video. Next, the video augmentation module applies super-resolution and video interpolation to enhance the textures and geometric smoothness of the generated video. Finally, the 3D Human Gaussian Splatting module learns lifelike humans under the guidance of these high-resolution and view-consistent images. Experiments demonstrate that Human-VDM achieves high-quality 3D human from a single image, outperforming state-of-the-art methods in both generation quality and quantity. Project page: https://human-vdm.github.io/Human-VDM/
【12】Interlaced dynamic XCT reconstruction with spatio-temporal implicit neural representations
标题:利用时空隐式神经表示的交错动态XCT重建
链接:https://arxiv.org/abs/2510.08641
摘要:在这项工作中,我们研究了使用时空隐式神经表示(INR)的动态X射线计算机断层扫描(XCT)重建下隔行扫描采集计划。所提出的方法相结合的ADMM为基础的优化与INCODE,一个条件框架,结合先验知识,使有效的收敛。我们评估我们的方法在不同的采集场景下,不同的严重程度的全球欠采样,空间复杂性(通过空间信息量化),和噪声水平。在所有设置中,我们的模型都实现了强大的性能,并优于基于时间交错模型的迭代重建(TIMBIR),这是一种最先进的基于模型的迭代方法。特别是,我们表明,感应偏置的INR提供了良好的鲁棒性,以适度的噪声水平,并通过加权最小二乘数据保真度项引入显式噪声建模显着提高性能更具挑战性的制度。这项工作的最后一部分探索了实用重建框架的扩展。我们证明了我们的方法的模块化明确建模检测器的非理想性,将环伪影校正直接在重建过程中。此外,我们提出了一个概念验证的4D体积重建,通过联合优化批量轴向切片,这种方法开辟了大规模并行化的可能性,这是处理大规模数据集的关键功能。
摘要:In this work, we investigate the use of spatio-temporalImplicit Neural Representations (INRs) for dynamic X-ray computed tomography (XCT) reconstruction under interlaced acquisition schemes. The proposed approach combines ADMM-based optimization with INCODE, a conditioning framework incorporating prior knowledge, to enable efficient convergence. We evaluate our method under diverse acquisition scenarios, varying the severity of global undersampling, spatial complexity (quantified via spatial information), and noise levels. Across all settings, our model achieves strong performance and outperforms Time-Interlaced Model-Based Iterative Reconstruction (TIMBIR), a state-of-the-art model-based iterative method. In particular, we show that the inductive bias of the INR provides good robustness to moderate noise levels, and that introducing explicit noise modeling through a weighted least squares data fidelity term significantly improves performance in more challenging regimes. The final part of this work explores extensions toward a practical reconstruction framework. We demonstrate the modularity of our approach by explicitly modeling detector non-idealities, incorporating ring artifact correction directly within the reconstruction process. Additionally, we present a proof-of-concept 4D volumetric reconstruction by jointly optimizing over batched axial slices, an approach which opens up the possibilities for massive parallelization, a critical feature for processing large-scale datasets.
医学相关(4篇)
【1】Lesion-Aware Post-Training of Latent Diffusion Models for Synthesizing Diffusion MRI from CT Perfusion
标题:潜在扩散模型的损伤感知后训练,用于从CT灌流合成扩散MRI
链接:https://arxiv.org/abs/2510.09056
备注:MICCAI 2025, Lecture Notes in Computer Science Vol. 15961
摘要:图像到图像转换模型可以帮助减轻医学图像采集固有的各种挑战。潜在扩散模型(LDM)利用压缩潜在空间中的有效学习,构成了最先进的生成图像模型的核心。然而,这种效率是有代价的,可能会损害高保真医学图像所必需的关键像素级细节。当生成临床上重要的结构(例如病变)时,这种限制变得特别关键,这些结构通常仅占据图像的一小部分。如果不能准确重建这些区域,可能会严重影响诊断的可靠性和临床决策。为了克服这一局限性,我们提出了一种新的后训练框架的LDMs在医学图像到图像的翻译,通过将病变感知的医疗像素空间的目标。这种方法是必不可少的,因为它不仅提高了整体图像质量,而且还提高了病变描绘的精度。我们评估了我们在急性缺血性卒中患者中脑CT到MRI转换的框架,早期和准确的诊断对于最佳治疗选择和改善患者预后至关重要。虽然弥散MRI是卒中诊断的金标准,但其临床实用性往往受到高成本和低可及性的限制。使用817例患者的数据集,我们证明了我们的框架在从CT灌注扫描合成DWI和ADC图像时提高了整体图像质量并增强了病变描绘,优于现有的图像到图像转换模型。此外,我们的后训练策略很容易适应预先训练的LDM,并在各种医学图像翻译任务中表现出更广泛的应用潜力。
摘要:Image-to-Image translation models can help mitigate various challenges inherent to medical image acquisition. Latent diffusion models (LDMs) leverage efficient learning in compressed latent space and constitute the core of state-of-the-art generative image models. However, this efficiency comes with a trade-off, potentially compromising crucial pixel-level detail essential for high-fidelity medical images. This limitation becomes particularly critical when generating clinically significant structures, such as lesions, which often occupy only a small portion of the image. Failure to accurately reconstruct these regions can severely impact diagnostic reliability and clinical decision-making. To overcome this limitation, we propose a novel post-training framework for LDMs in medical image-to-image translation by incorporating lesion-aware medical pixel space objectives. This approach is essential, as it not only enhances overall image quality but also improves the precision of lesion delineation. We evaluate our framework on brain CT-to-MRI translation in acute ischemic stroke patients, where early and accurate diagnosis is critical for optimal treatment selection and improved patient outcomes. While diffusion MRI is the gold standard for stroke diagnosis, its clinical utility is often constrained by high costs and low accessibility. Using a dataset of 817 patients, we demonstrate that our framework improves overall image quality and enhances lesion delineation when synthesizing DWI and ADC images from CT perfusion scans, outperforming existing image-to-image translation models. Furthermore, our post-training strategy is easily adaptable to pre-trained LDMs and exhibits substantial potential for broader applications across diverse medical image translation tasks.
【2】The Boundaries of Fair AI in Medical Image Prognosis: A Causal Perspective
标题:公平人工智能在医学图像预测中的界限:因果角度
链接:https://arxiv.org/abs/2510.08840
备注:Accepted at NeurIPS 2025
摘要:随着机器学习(ML)算法越来越多地用于医学图像分析,人们开始担心它们对某些社会群体的潜在偏见。尽管已经提出了许多方法来确保ML模型的公平性,但大多数现有的工作仅关注医学图像诊断任务,例如图像分类和分割,以及被忽视的预后场景,这些场景涉及预测随着时间的推移可能的结果或医疗状况的进展。为了解决这一差距,我们引入了FairTTE,这是第一个用于评估医学成像中事件发生时间(TTE)预测公平性的综合框架。FairTTE包含多种成像模式和TTE结果,集成了尖端的TTE预测和公平性算法,可对医学图像预后的公平性进行系统和细粒度的分析。利用因果分析技术,FairTTE发现并量化了医学成像数据集中嵌入的不同偏倚来源。我们的大规模评估表明,偏见是普遍存在于不同的成像模式,目前的公平性方法提供有限的缓解。我们进一步证明了潜在的偏见来源和模型差异之间的强关联,强调需要针对所有形式的偏见的整体方法。值得注意的是,我们发现,公平性变得越来越难以维持下分布的变化,强调现有解决方案的局限性和更强大的,公平的预测模型的迫切需要。
摘要:As machine learning (ML) algorithms are increasingly used in medical image analysis, concerns have emerged about their potential biases against certain social groups. Although many approaches have been proposed to ensure the fairness of ML models, most existing works focus only on medical image diagnosis tasks, such as image classification and segmentation, and overlooked prognosis scenarios, which involve predicting the likely outcome or progression of a medical condition over time. To address this gap, we introduce FairTTE, the first comprehensive framework for assessing fairness in time-to-event (TTE) prediction in medical imaging. FairTTE encompasses a diverse range of imaging modalities and TTE outcomes, integrating cutting-edge TTE prediction and fairness algorithms to enable systematic and fine-grained analysis of fairness in medical image prognosis. Leveraging causal analysis techniques, FairTTE uncovers and quantifies distinct sources of bias embedded within medical imaging datasets. Our large-scale evaluation reveals that bias is pervasive across different imaging modalities and that current fairness methods offer limited mitigation. We further demonstrate a strong association between underlying bias sources and model disparities, emphasizing the need for holistic approaches that target all forms of bias. Notably, we find that fairness becomes increasingly difficult to maintain under distribution shifts, underscoring the limitations of existing solutions and the pressing need for more robust, equitable prognostic models.
【3】Alignment, Mining and Fusion: Representation Alignment with Hard Negative Mining and Selective Knowledge Fusion for Medical Visual Question Answering
标题:对齐、挖掘与融合:医学视觉问题的硬否定挖掘和选择性知识融合的表示对齐
链接:https://arxiv.org/abs/2510.08791
备注:CVPR2025 Paper
摘要:医学视觉问题检索(Med-VQA)是一项具有挑战性的任务,需要对医学图像和文本问题有深入的理解。尽管最近利用医学视觉语言预训练(Med-VLP)的工作在Med-VQA任务上表现出了很强的性能,但仍然没有统一的模态对齐解决方案,并且硬否定的问题仍然没有得到充分的探索。此外,Med-VQA常用的知识融合技术可能会引入不相关的信息。在这项工作中,我们提出了一个框架来解决这些挑战,通过三个关键的贡献:(1)一个统一的解决方案,异构模态对齐跨多个级别,模态,视图和阶段,利用方法,如对比学习和最优传输理论;(2)一个硬否定挖掘方法,采用软标签的多模态对齐和强制执行的硬否定对歧视;以及(3)Med-VQA的门控交叉注意模块(Gated Cross-Attention Module for Med-VQA),其将答案词汇整合为先验知识并从中选择相关信息。我们的框架在广泛使用的Med-VQA数据集(如RAD-VQA,SLAKE,PathVQA和VQA-2019)上的性能优于先前的最新技术。
摘要:Medical Visual Question Answering (Med-VQA) is a challenging task that requires a deep understanding of both medical images and textual questions. Although recent works leveraging Medical Vision-Language Pre-training (Med-VLP) have shown strong performance on the Med-VQA task, there is still no unified solution for modality alignment, and the issue of hard negatives remains under-explored. Additionally, commonly used knowledge fusion techniques for Med-VQA may introduce irrelevant information. In this work, we propose a framework to address these challenges through three key contributions: (1) a unified solution for heterogeneous modality alignments across multiple levels, modalities, views, and stages, leveraging methods like contrastive learning and optimal transport theory; (2) a hard negative mining method that employs soft labels for multi-modality alignments and enforces the hard negative pair discrimination; and (3) a Gated Cross-Attention Module for Med-VQA that integrates the answer vocabulary as prior knowledge and selects relevant information from it. Our framework outperforms the previous state-of-the-art on widely used Med-VQA datasets like RAD-VQA, SLAKE, PathVQA and VQA-2019.
【4】FS-RWKV: Leveraging Frequency Spatial-Aware RWKV for 3T-to-7T MRI Translation
标题:FS-RWKN:利用频率空间感知RWKN进行3 T至7 T MRI转换
链接:https://arxiv.org/abs/2510.08951
备注:Accepted by BIBM 2025
摘要:超高场7 T MRI提供增强的空间分辨率和组织对比度,能够检测神经系统疾病的细微病理变化。然而,7 T扫描仪的有限可用性限制了广泛的临床采用,这是由于大量的基础设施成本和技术要求。用于从可访问的3 T采集合成7 T质量图像的计算方法为这种可访问性挑战提供了可行的解决方案。现有的CNN方法受限于有限的空间覆盖,而Transformer模型需要过多的计算开销。RWKV架构为医学图像合成中的全局特征建模提供了一种有效的替代方案,将线性计算复杂性与强大的长距离依赖捕获相结合。在此基础上,我们提出了频率空间-RWKV(FS-RWKV),一个基于RWKV的3 T到7 T MRI翻译框架。为了更好地解决解剖细节保留和全局组织对比度恢复的挑战,FS-RWKV包含两个关键模块:(1)频率-空间全方向偏移(FSO-Shift),其执行离散小波分解,随后在低频分支上进行全向空间移位,以增强全局上下文表示,同时保留高频解剖细节;以及(2)结构保真度增强块(SFEB),通过频率感知特征融合自适应地增强解剖结构的模块。在T1 w和T2 w模式下,对BNU数据集的综合实验表明,FS-RWKV始终优于现有的基于CNN、Transformer、GAN和RWKV的基线,实现了卓越的解剖保真度和感知质量。
摘要:Ultra-high-field 7T MRI offers enhanced spatial resolution and tissue contrast that enables the detection of subtle pathological changes in neurological disorders. However, the limited availability of 7T scanners restricts widespread clinical adoption due to substantial infrastructure costs and technical demands. Computational approaches for synthesizing 7T-quality images from accessible 3T acquisitions present a viable solution to this accessibility challenge. Existing CNN approaches suffer from limited spatial coverage, while Transformer models demand excessive computational overhead. RWKV architectures offer an efficient alternative for global feature modeling in medical image synthesis, combining linear computational complexity with strong long-range dependency capture. Building on this foundation, we propose Frequency Spatial-RWKV (FS-RWKV), an RWKV-based framework for 3T-to-7T MRI translation. To better address the challenges of anatomical detail preservation and global tissue contrast recovery, FS-RWKV incorporates two key modules: (1) Frequency-Spatial Omnidirectional Shift (FSO-Shift), which performs discrete wavelet decomposition followed by omnidirectional spatial shifting on the low-frequency branch to enhance global contextual representation while preserving high-frequency anatomical details; and (2) Structural Fidelity Enhancement Block (SFEB), a module that adaptively reinforces anatomical structure through frequency-aware feature fusion. Comprehensive experiments on UNC and BNU datasets demonstrate that FS-RWKV consistently outperforms existing CNN-, Transformer-, GAN-, and RWKV-based baselines across both T1w and T2w modalities, achieving superior anatomical fidelity and perceptual quality.
自动驾驶|车辆|车道检测等(1篇)
【1】Towards Safer and Understandable Driver Intention Prediction
标题:迈向更安全、更容易理解的驾驶员意图预测
链接:https://arxiv.org/abs/2510.09200
备注:10 pages
摘要:自动驾驶(AD)系统处理复杂任务的能力越来越强,这主要归功于深度学习和人工智能的最新进展。随着自动驾驶系统与人类之间的交互增加,驾驶系统中决策过程的可解释性对于确保安全驾驶操作变得越来越重要。成功的人机交互需要理解环境和驾驶任务的底层表示,这在基于深度学习的系统中仍然是一个重大挑战。为了解决这一问题,我们引入了机动预测中的可解释性任务,即为了驾驶员的安全,驾驶员意图预测(DIP),其在AD系统中起关键作用。为了促进可解释DIP的研究,我们策划了可解释驾驶行为预期数据集(DAAD-X),这是一个新的多模态,以自我为中心的视频数据集,为驾驶员的决策提供分层,高层次的文本解释作为因果推理。这些解释来自驾驶员的眼睛注视和自我车辆的角度。接下来,我们提出了视频概念瓶颈模型(VCBM),一个框架,生成时空连贯的解释固有的,而不依赖于事后技术。最后,通过对DAAD-X数据集上提出的VCBM的广泛评估,我们证明了基于变换器的模型比传统的基于CNN的模型具有更高的可解释性。此外,我们引入了一个多标签t-SNE可视化技术来说明解纠缠和因果关系之间的多个解释。我们的数据、代码和模型可在以下网址获得:https://mukil07.github.io/VCBM.github.io/
摘要:Autonomous driving (AD) systems are becoming increasingly capable of handling complex tasks, mainly due to recent advances in deep learning and AI. As interactions between autonomous systems and humans increase, the interpretability of decision-making processes in driving systems becomes increasingly crucial for ensuring safe driving operations. Successful human-machine interaction requires understanding the underlying representations of the environment and the driving task, which remains a significant challenge in deep learning-based systems. To address this, we introduce the task of interpretability in maneuver prediction before they occur for driver safety, i.e., driver intent prediction (DIP), which plays a critical role in AD systems. To foster research in interpretable DIP, we curate the eXplainable Driving Action Anticipation Dataset (DAAD-X), a new multimodal, ego-centric video dataset to provide hierarchical, high-level textual explanations as causal reasoning for the driver's decisions. These explanations are derived from both the driver's eye-gaze and the ego-vehicle's perspective. Next, we propose Video Concept Bottleneck Model (VCBM), a framework that generates spatio-temporally coherent explanations inherently, without relying on post-hoc techniques. Finally, through extensive evaluations of the proposed VCBM on the DAAD-X dataset, we demonstrate that transformer-based models exhibit greater interpretability than conventional CNN-based models. Additionally, we introduce a multilabel t-SNE visualization technique to illustrate the disentanglement and causal correlation among multiple explanations. Our data, code and models are available at: https://mukil07.github.io/VCBM.github.io/
OCR|文本相关(1篇)
【1】Adjusting Initial Noise to Mitigate Memorization in Text-to-Image Diffusion Models
标题:调整初始噪音以缓解文本到图像扩散模型中的再同步化
链接:https://arxiv.org/abs/2510.08625
摘要:尽管文本到图像的扩散模型具有令人印象深刻的生成能力,但它们通常会记住和复制训练数据,这引发了对隐私和版权的严重担忧。最近的工作归因于这种记忆的吸引力盆地的一个区域,应用无分类器的指导(CFG)转向去噪轨迹记忆输出,并提出推迟CFG应用程序,直到去噪轨迹逃离这个盆地。然而,这种延迟通常会导致未记忆的图像与输入提示不一致,突出了促进早期逃逸的必要性,以便可以在去噪过程中更快地应用CFG。在这项工作中,我们表明,初始噪声样本起着至关重要的作用,在确定何时发生这种逃逸。我们凭经验观察到,不同的初始样本导致不同的逃逸时间。基于这一认识,我们提出了两种缓解策略,调整初始噪声-无论是集体或单独-找到并利用初始样本,鼓励早期盆地逃逸。这些方法在保持图像-文本对齐的同时显著减少了记忆。
摘要:Despite their impressive generative capabilities, text-to-image diffusion models often memorize and replicate training data, prompting serious concerns over privacy and copyright. Recent work has attributed this memorization to an attraction basin-a region where applying classifier-free guidance (CFG) steers the denoising trajectory toward memorized outputs-and has proposed deferring CFG application until the denoising trajectory escapes this basin. However, such delays often result in non-memorized images that are poorly aligned with the input prompts, highlighting the need to promote earlier escape so that CFG can be applied sooner in the denoising process. In this work, we show that the initial noise sample plays a crucial role in determining when this escape occurs. We empirically observe that different initial samples lead to varying escape times. Building on this insight, we propose two mitigation strategies that adjust the initial noise-either collectively or individually-to find and utilize initial samples that encourage earlier basin escape. These approaches significantly reduce memorization while preserving image-text alignment.
Attention注意力(1篇)
【1】LinearSR: Unlocking Linear Attention for Stable and Efficient Image Super-Resolution
标题:LinearSR:释放线性注意力,实现稳定有效的图像超分辨率
链接:https://arxiv.org/abs/2510.08771
备注:19 pages, 9 figures, 6 tables
摘要:图像超分辨率(SR)的生成模型越来越强大,但它们对自我注意的二次复杂度(O(N^2))的依赖造成了一个主要的计算瓶颈。线性注意力提供了一个O(N)的解决方案,但它对真实感SR的承诺在很大程度上尚未开发,历史上受到一系列相互关联和以前未解决的挑战的阻碍。本文介绍了LinearSR,这是一个整体框架,首次系统地克服了这些关键障碍。具体来说,我们解决了一个基本的,训练不稳定性,导致灾难性的模型分歧,使用我们的新的“拐点”为基础的提前停止指导微调(ESGF)策略。此外,我们减轻了经典的感知失真权衡与专用的SNR为基础的混合专家(MoE)架构。最后,我们建立了一个有效的和轻量级的指导范式,TAG,来自我们的“精度超过体积”的原则。我们得到的LinearSR模型同时提供最先进的感知质量和卓越的效率。它的核心扩散前向传递(1-NFE)达到了SOTA级别的速度,而其整体多步推理时间仍然具有很强的竞争力。这项工作提供了在真实感SR领域应用线性注意力的第一个强大的方法,为未来高效生成超分辨率的研究建立了基础范例。
摘要:Generative models for Image Super-Resolution (SR) are increasingly powerful, yet their reliance on self-attention's quadratic complexity (O(N^2)) creates a major computational bottleneck. Linear Attention offers an O(N) solution, but its promise for photorealistic SR has remained largely untapped, historically hindered by a cascade of interrelated and previously unsolved challenges. This paper introduces LinearSR, a holistic framework that, for the first time, systematically overcomes these critical hurdles. Specifically, we resolve a fundamental, training instability that causes catastrophic model divergence using our novel "knee point"-based Early-Stopping Guided Fine-tuning (ESGF) strategy. Furthermore, we mitigate the classic perception-distortion trade-off with a dedicated SNR-based Mixture of Experts (MoE) architecture. Finally, we establish an effective and lightweight guidance paradigm, TAG, derived from our "precision-over-volume" principle. Our resulting LinearSR model simultaneously delivers state-of-the-art perceptual quality with exceptional efficiency. Its core diffusion forward pass (1-NFE) achieves SOTA-level speed, while its overall multi-step inference time remains highly competitive. This work provides the first robust methodology for applying Linear Attention in the photorealistic SR domain, establishing a foundational paradigm for future research in efficient generative super-resolution.
人脸|人群计数(1篇)
【1】Foraging with the Eyes: Dynamics in Human Visual Gaze and Deep Predictive Modeling
标题:用眼睛觅食:人类视觉凝视的动力学和深度预测建模
链接:https://arxiv.org/abs/2510.09299
摘要:动物经常通过利维行走随机轨迹觅食,这些轨迹具有针对稀疏资源环境优化的重尾步长。我们发现,人类的视觉凝视扫描图像时,遵循类似的动态。虽然传统的模型强调基于图像的显着性,潜在的时空统计眼动仍然未被充分探索。理解这些动态在注意力建模和基于视觉的界面中有着广泛的应用。在这项研究中,我们进行了一项大规模的人类受试者实验,涉及40名参与者在不受约束的条件下观看50张不同的图像,使用高速眼动仪记录了超过400万个注视点。对这些数据的分析表明,人眼的注视轨迹也遵循类似于动物觅食的Levy行走。这表明人眼以最佳有效的方式搜寻视觉信息。此外,我们训练了一个卷积神经网络(CNN)来预测仅从图像输入的固定热图。该模型准确地再现了新图像中的显著注视区域,表明凝视行为的关键组成部分可以单独从视觉结构中学习。我们的研究结果提供了新的证据,表明人类视觉探索遵循类似于自然觅食的统计规律,并通过生成和预测框架为凝视建模开辟了途径。
摘要:Animals often forage via Levy walks stochastic trajectories with heavy tailed step lengths optimized for sparse resource environments. We show that human visual gaze follows similar dynamics when scanning images. While traditional models emphasize image based saliency, the underlying spatiotemporal statistics of eye movements remain underexplored. Understanding these dynamics has broad applications in attention modeling and vision-based interfaces. In this study, we conducted a large scale human subject experiment involving 40 participants viewing 50 diverse images under unconstrained conditions, recording over 4 million gaze points using a high speed eye tracker. Analysis of these data shows that the gaze trajectory of the human eye also follows a Levy walk akin to animal foraging. This suggests that the human eye forages for visual information in an optimally efficient manner. Further, we trained a convolutional neural network (CNN) to predict fixation heatmaps from image input alone. The model accurately reproduced salient fixation regions across novel images, demonstrating that key components of gaze behavior are learnable from visual structure alone. Our findings present new evidence that human visual exploration obeys statistical laws analogous to natural foraging and open avenues for modeling gaze through generative and predictive frameworks.
图像视频检索|Re-id相关(1篇)
【1】Hierarchical Scheduling for Multi-Vector Image Retrieval
标题:多载体图像检索的分层调度
链接:https://arxiv.org/abs/2510.08976
备注:Under Review
摘要:为了有效地利用特定于用户的数据,检索增强生成(RAG)采用多模态大语言模型(MLLM)的应用程序。然而,传统的检索方法往往受到有限的检索精度。多向量检索(MVR)的最新进展通过分解查询和匹配分割图像来提高准确性。他们仍然遭受次优的准确性和效率,忽略了查询和变化的图像对象和冗余的细粒度图像段之间的对齐。在这项工作中,我们提出了一个有效的调度框架的图像检索- HiMIR。首先,我们引入了一种新型的分层范式,针对不同的图像对象采用多个中间粒度来增强对齐。其次,我们利用跨层次的相似性一致性和层次稀疏性,以减少不必要的匹配计算,以尽量减少冗余检索。此外,我们还自动为每个数据集配置参数,以便在不同的场景中实现实用性。我们的实证研究表明,HiMIR不仅实现了实质性的精度提高,而且还减少了高达3.5倍的计算比现有的MVR系统。
摘要:To effectively leverage user-specific data, retrieval augmented generation (RAG) is employed in multimodal large language model (MLLM) applications. However, conventional retrieval approaches often suffer from limited retrieval accuracy. Recent advances in multi-vector retrieval (MVR) improve accuracy by decomposing queries and matching against segmented images. They still suffer from sub-optimal accuracy and efficiency, overlooking alignment between the query and varying image objects and redundant fine-grained image segments. In this work, we present an efficient scheduling framework for image retrieval - HiMIR. First, we introduce a novel hierarchical paradigm, employing multiple intermediate granularities for varying image objects to enhance alignment. Second, we minimize redundancy in retrieval by leveraging cross-hierarchy similarity consistency and hierarchy sparsity to minimize unnecessary matching computation. Furthermore, we configure parameters for each dataset automatically for practicality across diverse scenarios. Our empirical study shows that, HiMIR not only achieves substantial accuracy improvements but also reduces computation by up to 3.5 times over the existing MVR system.
表征学习(1篇)
【1】PHyCLIP: $\ell_1$-Product of Hyperbolic Factors Unifies Hierarchy and Compositionality in Vision-Language Representation Learning
链接:https://arxiv.org/abs/2510.08919
备注:23 pages
摘要:视觉语言模型在从大规模视觉场景和语言描述对中进行多模态表征学习方面取得了显著的成功。然而,它们仍然很难同时表达两种不同类型的语义结构:概念族内的层次结构(例如,狗$\preceq$ mammal $\preceq$ animal)和跨不同概念族的组合性(例如,“一只狗在车里”$\preceq$狗,车)。最近的工作已经解决了这一挑战,采用双曲空间,有效地捕捉树状层次结构,但其适用于表示组合性仍然不清楚。为了解决这个难题,我们提出了PHyCLIP,它采用了一个$\ell_1 $-产品度量的笛卡尔乘积的双曲因子。在我们的设计中,家族内部的层次结构出现在各个双曲因子中,而跨家族的组成由$\ell_1 $-乘积度量捕获,类似于布尔代数。zero-shot分类、检索、层次分类和成分理解任务的实验表明,PHyCLIP优于现有的单空间方法,并在嵌入空间中提供更多可解释的结构。
摘要:Vision-language models have achieved remarkable success in multi-modal representation learning from large-scale pairs of visual scenes and linguistic descriptions. However, they still struggle to simultaneously express two distinct types of semantic structures: the hierarchy within a concept family (e.g., dog $\preceq$ mammal $\preceq$ animal) and the compositionality across different concept families (e.g., "a dog in a car" $\preceq$ dog, car). Recent works have addressed this challenge by employing hyperbolic space, which efficiently captures tree-like hierarchy, yet its suitability for representing compositionality remains unclear. To resolve this dilemma, we propose PHyCLIP, which employs an $\ell_1$-Product metric on a Cartesian product of Hyperbolic factors. With our design, intra-family hierarchies emerge within individual hyperbolic factors, and cross-family composition is captured by the $\ell_1$-product metric, analogous to a Boolean algebra. Experiments on zero-shot classification, retrieval, hierarchical classification, and compositional understanding tasks demonstrate that PHyCLIP outperforms existing single-space approaches and offers more interpretable structures in the embedding space.
蒸馏|知识提取(1篇)
【1】Defense against Unauthorized Distillation in Image Restoration via Feature Space Perturbation
标题:利用特征空间扰动防止图像恢复中的未经授权蒸馏
链接:https://arxiv.org/abs/2510.08925
摘要:知识蒸馏(KD)攻击通过使对手能够使用教师模型的输出来训练学生网络,对深度模型知识产权构成重大威胁。虽然最近在图像分类中的防御已经通过扰动输出概率成功地破坏了KD,但将这些方法扩展到图像恢复是困难的。与分类不同,恢复是一项生成任务,具有连续的高维输出,依赖于空间一致性和精细细节。为了解决这个问题,我们提出了自适应奇异值扰动(ASVP),这是一种为图像恢复模型量身定制的运行时防御。ASVP使用奇异值分解(SVD)对教师的内部特征图进行操作。它放大topk奇异值,注入结构化的高频扰动,破坏蒸馏所需的对齐。这阻碍了学生的学习,同时保持教师的输出质量。我们评估ASVP在五个图像恢复任务:超分辨率,低光增强,水下增强,去雾,和deraining。实验表明,ASVP降低学生PSNR高达4 dB和SSIM 60- 75%,对教师的表现可以忽略不计的影响。与现有的方法相比,ASVP提供了一个更强大和更一致的防御。我们的方法提供了一个实用的解决方案,以保护开源恢复模型未经授权的知识蒸馏。
摘要:Knowledge distillation (KD) attacks pose a significant threat to deep model intellectual property by enabling adversaries to train student networks using a teacher model's outputs. While recent defenses in image classification have successfully disrupted KD by perturbing output probabilities, extending these methods to image restoration is difficult. Unlike classification, restoration is a generative task with continuous, high-dimensional outputs that depend on spatial coherence and fine details. Minor perturbations are often insufficient, as students can still learn the underlying mapping.To address this, we propose Adaptive Singular Value Perturbation (ASVP), a runtime defense tailored for image restoration models. ASVP operates on internal feature maps of the teacher using singular value decomposition (SVD). It amplifies the topk singular values to inject structured, high-frequency perturbations, disrupting the alignment needed for distillation. This hinders student learning while preserving the teacher's output quality.We evaluate ASVP across five image restoration tasks: super-resolution, low-light enhancement, underwater enhancement, dehazing, and deraining. Experiments show ASVP reduces student PSNR by up to 4 dB and SSIM by 60-75%, with negligible impact on the teacher's performance. Compared to prior methods, ASVP offers a stronger and more consistent defense.Our approach provides a practical solution to protect open-source restoration models from unauthorized knowledge distillation.
视觉解释|视频理解VQA|caption等(1篇)
【1】CapGeo: A Caption-Assisted Approach to Geometric Reasoning
标题:CapGeo:一种字幕辅助的几何推理方法
链接:https://arxiv.org/abs/2510.09302
备注:preprint, under review
摘要:几何推理仍然是多模态大型语言模型(MLLM)的核心挑战。即使是最先进的闭源系统,如GPT-O3和Gemini-2.5-Pro,仍然难以可靠地解决几何问题,尽管在国际数学奥林匹克(IMO)等任务上表现出强大的文本推理能力。这个差距表明,瓶颈在于理解几何图,而不是推理本身。由于几何图形通常可以以简洁的文本形式忠实地描述,因此将视觉内容转换为标题提供了一个有前途的方向。出于这种洞察力,我们介绍CapGeo,一个标题辅助推理框架,桥梁视觉和文本模态。实验表明,当模型配备字幕时,效果会有实质性的改善:Qwen2.5-VL-72 B从8.6%(仅视觉)提高到59.0%,而Claude-Opus-4从44.8%提高到73.0%。为了系统地评估和识别高质量的几何字幕模型,我们进一步提出了CapGeo-Bench,这是一个包含4,641个精选图形-字幕对的数据集。至关重要的是,CapGeo-Bench采用了基于关键点的评估指标,该指标与下游CapGeo性能密切相关,从而能够可靠地评估几何字幕能力。总之,我们的框架和基准突出了一个新的途径推进几何推理MLLM。
摘要:Geometric reasoning remains a core challenge for Multimodal Large Language Models (MLLMs). Even the most advanced closed-source systems, such as GPT-O3 and Gemini-2.5-Pro, still struggle to solve geometry problems reliably, despite exhibiting strong textual reasoning abilities on tasks like the International Mathematical Olympiad (IMO). This gap suggests that the bottleneck lies in understanding geometric diagrams rather than reasoning itself. Since geometric figures can often be faithfully described in concise textual form, converting visual content into captions offers a promising direction. Motivated by this insight, we introduce CapGeo, a caption-assisted reasoning framework that bridges visual and textual modalities. Experiments show substantial improvements when models are equipped with captions: Qwen2.5-VL-72B improves from 8.6% (vision-only) to 59.0%, while Claude-Opus-4 rises from 44.8% to 73.0%. To systematically evaluate and identify high-quality geometric captioning models, we further propose CapGeo-Bench, a dataset of 4,641 curated figure-caption pairs. Crucially, CapGeo-Bench incorporates a keypoint-based evaluation metric that correlates strongly with downstream CapGeo performance, enabling reliable assessment of geometric captioning ability. Together, our framework and benchmark highlight a new pathway toward advancing geometric reasoning in MLLMs.
超分辨率|去噪|去模糊|去雾(1篇)
【1】SkipSR: Faster Super Resolution with Token Skipping
标题:SkipSR:通过令牌跳过更快的超级分辨率
链接:https://arxiv.org/abs/2510.08799
备注:14 pages, 7 figures
摘要:基于扩散的超分辨率(SR)是视频生成和视频恢复中的关键组件,但是速度慢且昂贵,限制了更高分辨率和更长视频的可扩展性。我们的关键见解是,视频中的许多区域本质上是低细节的,并且从细化中获得的好处很少,但目前的方法统一处理所有像素。为了利用这一点,我们提出了SkipSR,这是一个简单的框架,通过直接从低分辨率输入中识别低细节区域来加速视频SR,然后完全跳过对它们的计算,只对需要细化的区域进行超分辨率处理。这种简单而有效的策略保留了标准和一步扩散SR模型的感知质量,同时显着减少计算。在标准SR基准测试中,我们的方法在720p视频上实现了比先前模型快60%的端到端延迟,并且没有明显的质量损失。视频演示可在https://rccchoudhury.github.io/skipsr/上获得
摘要:Diffusion-based super-resolution (SR) is a key component in video generation and video restoration, but is slow and expensive, limiting scalability to higher resolutions and longer videos. Our key insight is that many regions in video are inherently low-detail and gain little from refinement, yet current methods process all pixels uniformly. To take advantage of this, we propose SkipSR, a simple framework for accelerating video SR by identifying low-detail regions directly from low-resolution input, then skipping computation on them entirely, only super-resolving the areas that require refinement. This simple yet effective strategy preserves perceptual quality in both standard and one-step diffusion SR models while significantly reducing computation. In standard SR benchmarks, our method achieves up to 60% faster end-to-end latency than prior models on 720p videos with no perceptible loss in quality. Video demos are available at https://rccchoudhury.github.io/skipsr/
点云|SLAM|雷达|激光|深度RGBD相关(2篇)
【1】Minkowski-MambaNet: A Point Cloud Framework with Selective State Space Models for Forest Biomass Quantification
标题:Minkowski-MambaNet:具有选择性状态空间模型的森林生物量量化点云框架
链接:https://arxiv.org/abs/2510.09367
摘要:准确的森林生物量量化对于碳循环监测至关重要。虽然机载激光雷达在捕捉3D森林结构方面表现出色,但由于难以对区分树木所需的远程依赖关系进行建模,因此直接从点云估计木材体积和地上生物量(AGB)具有挑战性。我们提出了Minkowski-MambaNet,这是一种新型深度学习框架,可以直接估计原始激光雷达的体积和地上生物量(AGB)。它的主要创新是将Mamba模型的选择性状态空间模型(SSM)集成到Minkowski网络中,从而实现对全局上下文和远程依赖关系的有效编码,以改进树的区分。Minkowski-MambaNet在丹麦国家森林资源清查LiDAR数据上进行了评估,其性能明显优于最先进的方法,提供了更准确和更强大的估计。关键是,它不需要数字地形模型(DTM),并且对边界伪影具有鲁棒性。这项工作为大规模森林生物量分析提供了一个强大的工具,推进了基于激光雷达的森林清查。
摘要:Accurate forest biomass quantification is vital for carbon cycle monitoring. While airborne LiDAR excels at capturing 3D forest structure, directly estimating woody volume and Aboveground Biomass (AGB) from point clouds is challenging due to difficulties in modeling long-range dependencies needed to distinguish trees.We propose Minkowski-MambaNet, a novel deep learning framework that directly estimates volume and AGB from raw LiDAR. Its key innovation is integrating the Mamba model's Selective State Space Model (SSM) into a Minkowski network, enabling effective encoding of global context and long-range dependencies for improved tree differentiation. Skip connections are incorporated to enhance features and accelerate convergence.Evaluated on Danish National Forest Inventory LiDAR data, Minkowski-MambaNet significantly outperforms state-of-the-art methods, providing more accurate and robust estimates. Crucially, it requires no Digital Terrain Model (DTM) and is robust to boundary artifacts. This work offers a powerful tool for large-scale forest biomass analysis, advancing LiDAR-based forest inventories.
【2】MambaH-Fit: Rethinking Hyper-surface Fitting-based Point Cloud Normal Estimation via State Space Modelling
标题:MambaH-Fit:通过状态空间建模重新思考基于超表面贴合的点云正常估计
链接:https://arxiv.org/abs/2510.09088
备注:11 pages, 12 figures
摘要:我们提出了MambaH-Fit,一个为基于超曲面拟合的点云法线估计量身定制的状态空间建模框架。现有的法向估计方法往往不能对细粒度的几何结构进行建模,从而限制了预测法向的准确性。最近,状态空间模型(SSM),特别是Mamba,通过捕获具有线性复杂性的长程依赖关系和对点云处理的启发性适应,已经展示了强大的建模能力。然而,现有的基于Mamba的方法主要侧重于理解全局形状结构,使得局部细粒度几何细节的建模在很大程度上未被探索。为了解决上述问题,我们首先引入了一个注意力驱动的层次特征融合(AHFF)计划,自适应融合多尺度点云补丁功能,显着提高几何上下文学习在本地点云邻域。在此基础上,我们进一步提出了Patch-wise状态空间模型(PSSM),该模型通过状态动力学将点云补丁建模为隐式超曲面,从而实现有效的细粒度几何理解以进行正常预测。在基准数据集上的大量实验表明,我们的方法在准确性,鲁棒性和灵活性方面优于现有的方法。消融研究进一步验证了申报组件的贡献。
摘要:We present MambaH-Fit, a state space modelling framework tailored for hyper-surface fitting-based point cloud normal estimation. Existing normal estimation methods often fall short in modelling fine-grained geometric structures, thereby limiting the accuracy of the predicted normals. Recently, state space models (SSMs), particularly Mamba, have demonstrated strong modelling capability by capturing long-range dependencies with linear complexity and inspired adaptations to point cloud processing. However, existing Mamba-based approaches primarily focus on understanding global shape structures, leaving the modelling of local, fine-grained geometric details largely under-explored. To address the issues above, we first introduce an Attention-driven Hierarchical Feature Fusion (AHFF) scheme to adaptively fuse multi-scale point cloud patch features, significantly enhancing geometric context learning in local point cloud neighbourhoods. Building upon this, we further propose Patch-wise State Space Model (PSSM) that models point cloud patches as implicit hyper-surfaces via state dynamics, enabling effective fine-grained geometric understanding for normal prediction. Extensive experiments on benchmark datasets show that our method outperforms existing ones in terms of accuracy, robustness, and flexibility. Ablation studies further validate the contribution of the proposed components.
多模态(3篇)
【1】Spotlight on Token Perception for Multimodal Reinforcement Learning
标题:多模式强化学习的代币感知聚焦
链接:https://arxiv.org/abs/2510.09285
备注:31 pages, 10 figures, project page: this https URL
摘要:虽然具有可验证奖励的强化学习(RLVR)提高了大型视觉语言模型(LVLM)的推理能力,但大多数现有的多模态推理方法忽略了视觉感知在RLVR优化过程中的关键作用。在本文中,我们进行了开拓性的探索,多模态RLVR通过令牌感知的新视角,它衡量每个生成的令牌的视觉依赖。通过对思想链(CoT)过程的粒度分析,我们发现了两个关键的见解:第一,滚动轨迹中的令牌感知是稀疏分布的,只有一小部分令牌具有高度的视觉依赖性,用于视觉推理;第二,不同的轨迹在其整体视觉依赖性方面表现出显着的差异。基于这些观察,我们提出了视觉感知策略优化(VPPO),这是一种新的策略梯度算法,它明确地利用令牌感知来细化学习信号。具体来说,VPPO通过双重机制实现这一点:它通过整体视觉依赖性重新加权轨迹的优势,并将策略更新专门集中在感知关键令牌上。在一套全面的八个感知和推理基准测试中,VPPO展示了领先的开源RL调优模型的实质性收益,其有效性在7 B和32 B模型规模上得到了一致验证。我们的研究结果不仅建立了一个新的令牌级的感知角度来分析多模态RLVR,但也提出了一种新的和有效的优化策略,以显着提高多模态推理能力的LVLM。
摘要:While Reinforcement Learning with Verifiable Rewards (RLVR) has advanced the reasoning capabilities of Large Vision-Language Models (LVLMs), most existing methods in multimodal reasoning neglect the critical role of visual perception within the RLVR optimization process. In this paper, we undertake a pioneering exploration of multimodal RLVR through the novel perspective of token perception, which measures the visual dependency of each generated token. With a granular analysis of Chain-of-Thought (CoT) processes, we uncover two key insights: first, token perception in a rollout trajectory is sparsely distributed, where only a small fraction of tokens have high visual dependency for visually-grounded reasoning; second, different trajectories exhibit significant divergence in their overall visual dependency. Based on these observations, we propose Visually-Perceptive Policy Optimization (VPPO), a novel policy gradient algorithm that explicitly leverages token perception to refine the learning signal. Specifically, VPPO achieves this through a dual mechanism: it reweights a trajectory's advantage by its overall visual dependency, and focuses policy updates exclusively on perceptually pivotal tokens. On a comprehensive suite of eight perception and reasoning benchmarks, VPPO demonstrates substantial gains over leading open-source RL-tuned models, with its effectiveness consistently validated across 7B and 32B model scales. Our findings not only establish a new token-level perceptual perspective for analyzing multimodal RLVR but also present a novel and effective optimization strategy to significantly enhance the multimodal reasoning capabilities of LVLMs.
【2】Unleashing Perception-Time Scaling to Multimodal Reasoning Models
标题:将感知时间缩放释放到多模式推理模型
链接:https://arxiv.org/abs/2510.08964
摘要:推理时间缩放的最新进展,特别是那些利用强化学习与可验证奖励的进展,大大增强了大型视觉语言模型(LVLM)的推理能力。受这一成功的启发,类似的策略已被应用于多模态推理,但它们对视觉感知的影响仍不清楚。为了研究这一差距,我们引入了Distance,这是一个以感知为中心的视觉估计任务基准。评估结果表明,LVLM表现出有限的估计精度,和推理时间缩放只提供边际收益。我们将此归因于当前LVLM的快速感知范式,其中视觉理解被视为一次性输出,而不对底层感知过程进行建模。为了解决这个问题,我们提出了感知时间缩放(PTS),一种新的范式,鼓励令牌丰富的感知和复杂的感知问题分解成中间易处理的子问题,从而使感知对齐,并受益于推理时间缩放。结合强化学习技术,PTS显著提高了感知准确性,将Distance的高精度性能从8.0%提高到64.7%,并很好地推广到域外任务。令人惊讶的是,即使PTS数据是纯合成的,但将它们与数学推理数据相结合,在推理和现实世界的感知基准方面都会产生一致的收益。进一步的分析表明,PTS引入了更多的感知相关的令牌,并增加了模型的关注图像令牌。我们的代码和数据将公开发布。
摘要:Recent advances in inference-time scaling, particularly those leveraging reinforcement learning with verifiable rewards, have substantially enhanced the reasoning capabilities of Large Vision-Language Models (LVLMs). Inspired by this success, similar strategies have been applied to multimodal reasoning, yet their impact on visual perception remains unclear. To investigate this gap, we introduce DisTANCE, a perception-centric benchmark for visual estimation tasks. Evaluation results show that LVLMs exhibit limited estimation precision, and inference-time scaling offers only marginal gains. We attribute this to the fast perception paradigm of current LVLMs, where visual understanding is treated as a one-shot output without modeling the underlying perceptual process. To address this, we propose Perception-Time Scaling (PTS), a novel paradigm that encourages token-rich perception and decomposes complex perception problems into intermediate tractable sub-problems, thereby enabling perception to align with and benefit from inference-time scaling. Combined with reinforcement learning techniques, PTS significantly improves perception accuracy, raising high-precision performance on DisTANCE from 8.0% to 64.7%, and generalizes well to out-of-domain tasks. Surprisingly, even though PTS data are purely synthetic, combining them with math reasoning data yields consistent gains in both reasoning and real-world perception benchmarks. Further analysis reveals that PTS introduces more perception-related tokens and increases the model's attention to image tokens. Our code and data will be publicly released.
【3】Deep Multimodal Subspace Clustering Networks
标题:深度多模式子空间集群网络
链接:https://arxiv.org/abs/1804.06498
备注:None
摘要:我们提出了基于卷积神经网络(CNN)的无监督多模态子空间聚类方法。该框架包括三个主要阶段-多模态编码器,自我表达层,和多模态解码器。编码器将多模态数据作为输入,并将它们融合到潜在的空间表示中。自我表达层负责执行自我表达属性并获取与数据点相对应的亲和矩阵。解码器重建原始输入数据。该网络在训练中使用解码器的重建与原始输入之间的距离。我们调查早期,晚期和中间融合技术,并提出了三种不同的编码器对应于他们的空间融合。对于不同的基于空间融合的方法,自表达层和多模式解码器本质上是相同的。除了各种基于空间融合的方法之外,还提出了一种基于亲和力融合的网络,其中对应于不同模态的自我表达层被强制为相同的。在三个数据集上的大量实验表明,所提出的方法显着优于最先进的多模态子空间聚类方法。
摘要:We present convolutional neural network (CNN) based approaches for unsupervised multimodal subspace clustering. The proposed framework consists of three main stages - multimodal encoder, self-expressive layer, and multimodal decoder. The encoder takes multimodal data as input and fuses them to a latent space representation. The self-expressive layer is responsible for enforcing the self-expressiveness property and acquiring an affinity matrix corresponding to the data points. The decoder reconstructs the original input data. The network uses the distance between the decoder's reconstruction and the original input in its training. We investigate early, late and intermediate fusion techniques and propose three different encoders corresponding to them for spatial fusion. The self-expressive layers and multimodal decoders are essentially the same for different spatial fusion-based approaches. In addition to various spatial fusion-based methods, an affinity fusion-based network is also proposed in which the self-expressive layer corresponding to different modalities is enforced to be the same. Extensive experiments on three datasets show that the proposed methods significantly outperform the state-of-the-art multimodal subspace clustering methods.
3D|3D重建等相关(3篇)
【1】Visibility-Aware Densification for 3D Gaussian Splatting in Dynamic Urban Scenes
标题:动态城市场景中3D高斯飞溅的可见度感知致密化
链接:https://arxiv.org/abs/2510.09364
摘要:3D高斯溅射(3DGS)在合成高保真新颖视图方面表现出令人印象深刻的性能。然而,其有效性关键取决于初始化点云的质量。具体而言,实现均匀和完整的点覆盖的底层场景结构需要重叠的观察平截头体,这是一个假设,经常违反无界,动态的城市环境。使用部分初始化的点云训练高斯模型通常会导致失真和伪影,因为相机光线可能无法与有效表面相交,从而导致与遮挡或不可见几何体相关的高斯基元的不正确梯度传播。此外,现有的致密化策略简单地克隆和分裂高斯基元从现有的,无法重建丢失的结构。为了解决这些局限性,我们提出了VAD-GS,一个3DGS框架,专为具有挑战性的城市场景中的几何恢复。我们的方法通过基于体素的可见性推理来识别不可靠的几何结构,通过多样性感知视图选择来选择信息丰富的支持视图,并通过基于补丁匹配的多视图立体重建来恢复丢失的结构。该设计使得能够生成由可靠的几何先验引导的新的高斯基元,即使在缺乏初始点的区域中。在Waymo和nuScenes数据集上进行的大量实验表明,VAD-GS的性能优于最先进的3DGS方法,并显着提高了静态和动态对象的重建几何质量。源代码将在发布后发布。
摘要:3D Gaussian splatting (3DGS) has demonstrated impressive performance in synthesizing high-fidelity novel views. Nonetheless, its effectiveness critically depends on the quality of the initialized point cloud. Specifically, achieving uniform and complete point coverage over the underlying scene structure requires overlapping observation frustums, an assumption that is often violated in unbounded, dynamic urban environments. Training Gaussian models with partially initialized point clouds often leads to distortions and artifacts, as camera rays may fail to intersect valid surfaces, resulting in incorrect gradient propagation to Gaussian primitives associated with occluded or invisible geometry. Additionally, existing densification strategies simply clone and split Gaussian primitives from existing ones, incapable of reconstructing missing structures. To address these limitations, we propose VAD-GS, a 3DGS framework tailored for geometry recovery in challenging urban scenes. Our method identifies unreliable geometry structures via voxel-based visibility reasoning, selects informative supporting views through diversity-aware view selection, and recovers missing structures via patch matching-based multi-view stereo reconstruction. This design enables the generation of new Gaussian primitives guided by reliable geometric priors, even in regions lacking initial points. Extensive experiments on the Waymo and nuScenes datasets demonstrate that VAD-GS outperforms state-of-the-art 3DGS approaches and significantly improves the quality of reconstructed geometry for both static and dynamic objects. Source code will be released upon publication.
【2】Reinforcement Learning-Driven Edge Management for Reliable Multi-view 3D Reconstruction
标题:强化学习驱动的边缘管理,实现可靠的多视图3D重建
链接:https://arxiv.org/abs/2510.08839
摘要:实时多视图3D重建是关键边缘原生用例(如消防救援)的关键任务应用程序,其中及时准确的3D场景建模可实现态势感知和明智的决策。然而,边缘资源可用性的动态和不可预测的性质引入了中断,例如图像质量下降、网络链路不稳定和服务器负载波动,这对重建管道的可靠性提出了挑战。在这项工作中,我们提出了一个基于强化学习(RL)的边缘资源管理框架,用于可靠的3D重建,以确保在合理的时间内进行高质量的重建,尽管系统在资源受限和易中断的环境下运行。特别是,该框架采用了两个合作的Q学习代理,一个用于相机选择,一个用于服务器选择,这两个都是完全在线操作,通过与边缘环境的交互来学习策略。为了支持现实约束下的学习和评估系统性能,我们实现了一个分布式测试平台,包括实验室托管的终端设备和FABRIC基础设施托管的边缘服务器,以模拟现实中断场景下的智能城市边缘基础设施。结果表明,该框架通过有效地平衡动态环境中的端到端延迟和重建质量,提高了应用程序的可靠性。
摘要:Real-time multi-view 3D reconstruction is a mission-critical application for key edge-native use cases, such as fire rescue, where timely and accurate 3D scene modeling enables situational awareness and informed decision-making. However, the dynamic and unpredictable nature of edge resource availability introduces disruptions, such as degraded image quality, unstable network links, and fluctuating server loads, which challenge the reliability of the reconstruction pipeline. In this work, we present a reinforcement learning (RL)-based edge resource management framework for reliable 3D reconstruction to ensure high quality reconstruction within a reasonable amount of time, despite the system operating under a resource-constrained and disruption-prone environment. In particular, the framework adopts two cooperative Q-learning agents, one for camera selection and one for server selection, both of which operate entirely online, learning policies through interactions with the edge environment. To support learning under realistic constraints and evaluate system performance, we implement a distributed testbed comprising lab-hosted end devices and FABRIC infrastructure-hosted edge servers to emulate smart city edge infrastructure under realistic disruption scenarios. Results show that the proposed framework improves application reliability by effectively balancing end-to-end latency and reconstruction quality in dynamic environments.
【3】Hamba: Single-view 3D Hand Reconstruction with Graph-guided Bi-Scanning Mamba
标题:Hamba:利用图形引导双扫描Mamba进行单视图3D手部重建
链接:https://arxiv.org/abs/2407.09646
备注:NeurIPS 2024; Project Website: this https URL
摘要:由于关节运动、自遮挡以及与对象的交互,从单个RGB图像进行3D手部重建具有挑战性。现有的SOTA方法采用基于注意力的Transformers来学习3D手部姿势和形状,但它们并没有完全实现鲁棒和准确的性能,主要是由于对关节之间的空间关系建模效率低下。为了解决这个问题,我们提出了一种新的图引导的Mamba框架,名为Hamba,它连接了图学习和状态空间建模。我们的核心思想是使用一些有效的令牌将Mamba的扫描重新格式化为用于3D重建的图形引导的双向扫描。这使我们能够有效地学习关节之间的空间关系,以提高重建性能。具体来说,我们设计了一个图形引导的状态空间(GSS)块,它学习图形结构的关系和关节的空间序列,并且使用的标记比基于注意力的方法少88.5%。此外,我们整合的状态空间功能和全局功能使用融合模块。通过利用GSS块和融合模块,Hamba有效地利用了图引导的状态空间特征,并联合考虑全局和局部特征,以提高性能。几个基准测试和野外测试的实验表明,Hamba的性能明显优于现有的SOTA,在FreiHAND上实现了5.3mm的PA-MPVPE和0.992的F@15mm。在本文被接受时,Hamba在3D手部重建的两个竞赛排行榜中排名第一。项目网址:https://humansensinglab.github.io/Hamba/
摘要:3D Hand reconstruction from a single RGB image is challenging due to the articulated motion, self-occlusion, and interaction with objects. Existing SOTA methods employ attention-based transformers to learn the 3D hand pose and shape, yet they do not fully achieve robust and accurate performance, primarily due to inefficiently modeling spatial relations between joints. To address this problem, we propose a novel graph-guided Mamba framework, named Hamba, which bridges graph learning and state space modeling. Our core idea is to reformulate Mamba's scanning into graph-guided bidirectional scanning for 3D reconstruction using a few effective tokens. This enables us to efficiently learn the spatial relationships between joints for improving reconstruction performance. Specifically, we design a Graph-guided State Space (GSS) block that learns the graph-structured relations and spatial sequences of joints and uses 88.5% fewer tokens than attention-based methods. Additionally, we integrate the state space features and the global features using a fusion module. By utilizing the GSS block and the fusion module, Hamba effectively leverages the graph-guided state space features and jointly considers global and local features to improve performance. Experiments on several benchmarks and in-the-wild tests demonstrate that Hamba significantly outperforms existing SOTAs, achieving the PA-MPVPE of 5.3mm and F@15mm of 0.992 on FreiHAND. At the time of this paper's acceptance, Hamba holds the top position, Rank 1 in two Competition Leaderboards on 3D hand reconstruction. Project Website: https://humansensinglab.github.io/Hamba/
其他神经网络|深度学习|模型|建模(8篇)
【1】Dyna-Mind: Learning to Simulate from Experience for Better AI Agents
标题:Dyna-Mind:学习根据经验进行模拟,以获得更好的人工智能代理
链接:https://arxiv.org/abs/2510.09577
摘要:推理模型最近在数学和编码等领域取得了显着进展。然而,他们在数学和编码方面的专家级能力与他们在长期互动任务(如网络导航和计算机/电话使用)中的表现形成鲜明对比。受人类认知文献的启发,我们认为当前的人工智能代理需要“替代试错”-在行动之前在心理上模拟替代未来的能力-以增强他们在复杂交互环境中的理解和表现。我们介绍Dyna-Mind,一个两阶段的训练框架,明确地教(V)LM代理集成这种模拟到他们的推理。在第一阶段,我们引入了模拟推理(ReSim),它训练智能体从扩展的搜索树中生成结构化的推理轨迹,这些搜索树是从通过环境交互收集的真实经验中构建的。因此,ReSim将智能体的推理建立在忠实的世界动态基础上,并使其具备在推理中预测未来状态的能力。在第二阶段,我们提出了Dyna-GRPO,这是一种在线强化学习方法,通过使用结果奖励和中间状态作为实际推出的反馈,进一步加强代理的模拟和决策能力。在两个合成基准测试(Sokoban和ALFWorld)和一个现实基准测试(AndroidWorld)上的实验表明,(1)ReSim有效地将模拟能力注入到AI代理中,(2)Dyna-GRPO利用结果和交互级别的信号来学习更好的策略,以执行长期规划密集型任务。总之,这些结果突出了模拟在使AI代理在更具挑战性的环境中更有效地推理,计划和行动方面的核心作用。
摘要:Reasoning models have recently shown remarkable progress in domains such as math and coding. However, their expert-level abilities in math and coding contrast sharply with their performance in long-horizon, interactive tasks such as web navigation and computer/phone-use. Inspired by literature on human cognition, we argue that current AI agents need ''vicarious trial and error'' - the capacity to mentally simulate alternative futures before acting - in order to enhance their understanding and performance in complex interactive environments. We introduce Dyna-Mind, a two-stage training framework that explicitly teaches (V)LM agents to integrate such simulation into their reasoning. In stage 1, we introduce Reasoning with Simulations (ReSim), which trains the agent to generate structured reasoning traces from expanded search trees built from real experience gathered through environment interactions. ReSim thus grounds the agent's reasoning in faithful world dynamics and equips it with the ability to anticipate future states in its reasoning. In stage 2, we propose Dyna-GRPO, an online reinforcement learning method to further strengthen the agent's simulation and decision-making ability by using both outcome rewards and intermediate states as feedback from real rollouts. Experiments on two synthetic benchmarks (Sokoban and ALFWorld) and one realistic benchmark (AndroidWorld) demonstrate that (1) ReSim effectively infuses simulation ability into AI agents, and (2) Dyna-GRPO leverages outcome and interaction-level signals to learn better policies for long-horizon, planning-intensive tasks. Together, these results highlight the central role of simulation in enabling AI agents to reason, plan, and act more effectively in the ever more challenging environments.
【2】Enhancing Infrared Vision: Progressive Prompt Fusion Network and Benchmark
标题:增强红外视觉:渐进式即时融合网络和基准
链接:https://arxiv.org/abs/2510.09343
备注:This paper has been accepted by NeurIPS 2025
摘要:我们从事的是一个相对来说还没有深入研究的课题--热红外图像增强。现有的红外图像增强方法主要集中在处理单独的退化,如噪声,对比度和模糊,难以处理耦合退化。同时,通常应用于RGB传感器的一体化增强方法通常由于成像模型的显著差异而表现出有限的效果。有鉴于此,我们首先回顾了成像机制,并介绍了渐进式即时融合网络(PPFN)。具体地,PPFN最初基于热成像过程建立提示对。对于每种类型的退化,我们融合相应的提示对来调节模型的特征,提供自适应指导,使模型能够更好地解决单个或多个条件下的特定退化。此外,还引入了选择性渐进训练(SPT)机制,逐步完善模型对复合案例的处理,以对齐增强过程,这不仅允许模型去除相机噪声并保留关键结构细节,还可以增强热图像的整体对比度。此外,我们还介绍了覆盖广泛场景的最高质量、多场景红外基准测试。大量实验证明,我们的方法不仅在特定退化情况下提供了有希望的视觉结果,而且还显着提高了复杂退化场景的性能,实现了8.76%的显着提高。代码可在https://github.com/Zihang-Chen/HM-TIR上获得。
摘要:We engage in the relatively underexplored task named thermal infrared image enhancement. Existing infrared image enhancement methods primarily focus on tackling individual degradations, such as noise, contrast, and blurring, making it difficult to handle coupled degradations. Meanwhile, all-in-one enhancement methods, commonly applied to RGB sensors, often demonstrate limited effectiveness due to the significant differences in imaging models. In sight of this, we first revisit the imaging mechanism and introduce a Progressive Prompt Fusion Network (PPFN). Specifically, the PPFN initially establishes prompt pairs based on the thermal imaging process. For each type of degradation, we fuse the corresponding prompt pairs to modulate the model's features, providing adaptive guidance that enables the model to better address specific degradations under single or multiple conditions. In addition, a Selective Progressive Training (SPT) mechanism is introduced to gradually refine the model's handling of composite cases to align the enhancement process, which not only allows the model to remove camera noise and retain key structural details, but also enhancing the overall contrast of the thermal image. Furthermore, we introduce the most high-quality, multi-scenarios infrared benchmark covering a wide range of scenarios. Extensive experiments substantiate that our approach not only delivers promising visual results under specific degradation but also significantly improves performance on complex degradation scenes, achieving a notable 8.76\% improvement. Code is available at https://github.com/Zihang-Chen/HM-TIR.
【3】Training Feature Attribution for Vision Models
标题:视觉模型的训练特征归因
链接:https://arxiv.org/abs/2510.09135
摘要:深度神经网络通常被认为是不透明的系统,这促使人们需要可解释性方法来提高信任和问责制。现有方法通常将测试时间预测归因于输入特征(例如,图像中的像素)或有影响力的训练示例。我们认为,这两种观点应该共同研究。这项工作探索了 * 训练特征归因 *,将测试预测与特定训练图像的特定区域联系起来,从而为深度模型的内部工作提供了新的见解。我们在视觉数据集上的实验表明,训练特征归因产生了细粒度的、特定于测试的解释:它识别了导致错误分类的有害示例,并揭示了虚假的相关性,例如基于补丁的快捷方式,而传统的归因方法无法揭示这些相关性。
摘要:Deep neural networks are often considered opaque systems, prompting the need for explainability methods to improve trust and accountability. Existing approaches typically attribute test-time predictions either to input features (e.g., pixels in an image) or to influential training examples. We argue that both perspectives should be studied jointly. This work explores *training feature attribution*, which links test predictions to specific regions of specific training images and thereby provides new insights into the inner workings of deep models. Our experiments on vision datasets show that training feature attribution yields fine-grained, test-specific explanations: it identifies harmful examples that drive misclassifications and reveals spurious correlations, such as patch-based shortcuts, that conventional attribution methods fail to expose.
【4】Modeling Time-Lapse Trajectories to Characterize Cranberry Growth
标题:建模时移轨迹以描述蔓越莓生长特征
链接:https://arxiv.org/abs/2510.08901
备注:Accepted to ICCV Workshops 2025
摘要:变化监测是蔓越莓种植的一项重要任务,因为它为育种者和种植者提供了分析生长、预测产量和做出处理决策的能力。然而,这项任务通常是手动完成的,需要蔓越莓种植者或育种者花费大量时间。基于深度学习的变化监测有希望,尽管难以解释的高维特征和用于微调的手动注释的警告。为了解决这一差距,我们引入了一种基于微调Vision Transformers(ViTs)的作物生长建模方法,该方法使用自监督方法,避免了繁琐的图像注释。我们使用双重借口任务(时间回归和类预测)来学习植物和水果外观的延时进化的潜在空间。由此产生的2D时间轨迹提供了一个可解释的作物生长时间序列模型,可用于:1)预测随时间的生长,2)区分蔓越莓品种的时间差异。我们还提供了一个新的蔓越莓果实的延时数据集,其中包含8个不同的品种,在生长季节(大约4个月的时间跨度)观察了52次,并注释了有关杀菌剂应用,产量和腐烂的信息。我们的方法是通用的,可以应用于其他作物和应用程序(代码和数据集可以在https://github上找到。com/ronan-39/tlt/)。
摘要:Change monitoring is an essential task for cranberry farming as it provides both breeders and growers with the ability to analyze growth, predict yield, and make treatment decisions. However, this task is often done manually, requiring significant time on the part of a cranberry grower or breeder. Deep learning based change monitoring holds promise, despite the caveat of hard-to-interpret high dimensional features and hand-annotations for fine-tuning. To address this gap, we introduce a method for modeling crop growth based on fine-tuning vision transformers (ViTs) using a self-supervised approach that avoids tedious image annotations. We use a two-fold pretext task (time regression and class prediction) to learn a latent space for the time-lapse evolution of plant and fruit appearance. The resulting 2D temporal tracks provide an interpretable time-series model of crop growth that can be used to: 1) predict growth over time and 2) distinguish temporal differences of cranberry varieties. We also provide a novel time-lapse dataset of cranberry fruit featuring eight distinct varieties, observed 52 times over the growing season (span of around four months), annotated with information about fungicide application, yield, and rot. Our approach is general and can be applied to other crops and applications (code and dataset can be found at https://github. com/ronan-39/tlt/).
【5】Sparse components distinguish visual pathways & their alignment to neural networks
标题:稀疏成分区分视觉路径及其与神经网络的对齐
链接:https://arxiv.org/abs/2510.08858
摘要:在高级人类视觉皮层中的腹侧流、背侧流和侧流涉及不同的功能过程。然而,在单一任务上训练的深度神经网络(DNN)对整个视觉系统的建模效果令人惊讶,暗示了这些途径的共同计算原理。为了探索这种不一致性,我们应用了一种新的稀疏分解方法来识别每个流中视觉表示的主要组成部分。与传统的神经科学研究相一致,我们发现三种视觉流的成分反应曲线存在明显差异-识别腹侧流中对面部,地点,身体,文本和食物的选择性成分;社交互动,隐含的运动和侧流中的手部动作;以及背侧流中一些不太可解释的成分。在此基础上,我们引入了稀疏组件对齐(SCA),这是一种测量大脑和机器之间表征对齐的新方法,可以更好地捕捉这两个视觉系统的潜在神经调节。使用SCA,我们发现,标准的视觉DNN与腹侧比背侧或侧表示更一致。SCA揭示了这些区别与更大的分辨率比传统的人口水平的几何形状,提供了一个代表性的对齐,是敏感的系统的神经调谐的基础轴的措施。
摘要:The ventral, dorsal, and lateral streams in high-level human visual cortex are implicated in distinct functional processes. Yet, deep neural networks (DNNs) trained on a single task model the entire visual system surprisingly well, hinting at common computational principles across these pathways. To explore this inconsistency, we applied a novel sparse decomposition approach to identify the dominant components of visual representations within each stream. Consistent with traditional neuroscience research, we find a clear difference in component response profiles across the three visual streams -- identifying components selective for faces, places, bodies, text, and food in the ventral stream; social interactions, implied motion, and hand actions in the lateral stream; and some less interpretable components in the dorsal stream. Building on this, we introduce Sparse Component Alignment (SCA), a new method for measuring representational alignment between brains and machines that better captures the latent neural tuning of these two visual systems. Using SCA, we find that standard visual DNNs are more aligned with the ventral than either dorsal or lateral representations. SCA reveals these distinctions with greater resolution than conventional population-level geometry, offering a measure of representational alignment that is sensitive to a system's underlying axes of neural tuning.
【6】Unified World Models: Memory-Augmented Planning and Foresight for Visual Navigation
标题:统一世界模型:视觉导航的内存增强规划和前瞻性
链接:https://arxiv.org/abs/2510.08713
备注:18 pages, 11 figures, code: this https URL
摘要:使体现代理有效地想象未来的状态是至关重要的鲁棒性和可推广的视觉导航。然而,目前最先进的方法,采用模块化的架构,从视觉世界建模分离导航规划,导致状态动作不对准和有限的适应性,在新的或动态的情况下。为了克服这一根本性的限制,我们提出了UniWM,一个统一的,记忆增强的世界模型集成自我中心的视觉预见和规划在一个单一的多模态自回归骨干。与模块化框架不同,UniWM明确地将行动决策建立在视觉想象的结果上,确保预测和控制之间的紧密一致。分层记忆机制进一步整合了详细的短期感知线索与长期轨迹上下文,从而在扩展的视野中实现稳定,连贯的推理。在四个具有挑战性的基准(Go Stanford,ReCon,SCAND,HuRoN)上进行的广泛实验表明,UniWM将导航成功率大幅提高了30%,与强基线相比,显著降低了轨迹误差,并在看不见的TartanDrive数据集上表现出令人印象深刻的zero-shot泛化。这些结果突出了UniWM作为一个原则性的步骤,统一的,解释驱动的体现导航。
摘要:Enabling embodied agents to effectively imagine future states is critical for robust and generalizable visual navigation. Current state-of-the-art approaches, however, adopt modular architectures that separate navigation planning from visual world modeling, leading to state-action misalignment and limited adaptability in novel or dynamic scenarios. To overcome this fundamental limitation, we propose UniWM, a unified, memory-augmented world model integrating egocentric visual foresight and planning within a single multimodal autoregressive backbone. Unlike modular frameworks, UniWM explicitly grounds action decisions in visually imagined outcomes, ensuring tight alignment between prediction and control. A hierarchical memory mechanism further integrates detailed short-term perceptual cues with longer-term trajectory context, enabling stable, coherent reasoning over extended horizons. Extensive experiments across four challenging benchmarks (Go Stanford, ReCon, SCAND, HuRoN) demonstrate that UniWM substantially improves navigation success rates by up to 30%, significantly reduces trajectory errors compared to strong baselines, and exhibits impressive zero-shot generalization on the unseen TartanDrive dataset. These results highlight UniWM as a principled step toward unified, imagination-driven embodied navigation.
【7】FreqCa: Accelerating Diffusion Models via Frequency-Aware Caching
标题:FreqCa:通过频率感知缓存加速扩散模型
链接:https://arxiv.org/abs/2510.08669
备注:15 pages, 11 figures
摘要:扩散Transformers的应用受到其显著的推理成本的影响。最近,已经提出了特征缓存来解决这个问题,通过重用来自先前时间步的特征,从而跳过未来时间步的计算。然而,以前的特征缓存假设相邻时间步中的特征是相似或连续的,这并不总是在所有设置中保持。为了研究这一点,本文首先从频域分析,这表明,不同的频带的扩散模型的特征表现出不同的动态跨时间步。具体地说,低频分量决定了图像的结构,具有较高的相似性,但连续性较差。相比之下,解码图像细节的高频带显示出显著的连续性,但相似性较差。这些有趣的观察促使我们提出频率感知缓存(FreqCa) 该方法基于低频分量的相似性直接重用低频分量的特征,而基于高频分量的连续性使用二阶Hermite插值法预测高频分量的波动。 此外,我们进一步提出了缓存累积剩余特征(CRF),而不是在所有层的功能,这减少了99%的特征缓存的内存占用。 在FLUX.1-dev、FLUX.1-Kontext-dev、Qwen-Image和Qwen-Image-Edit上的大量实验证明了该方法在生成和编辑方面的有效性。代码在补充材料中提供,并将在GitHub上发布。
摘要:The application of diffusion transformers is suffering from their significant inference costs. Recently, feature caching has been proposed to solve this problem by reusing features from previous timesteps, thereby skipping computation in future timesteps. However, previous feature caching assumes that features in adjacent timesteps are similar or continuous, which does not always hold in all settings. To investigate this, this paper begins with an analysis from the frequency domain, which reveal that different frequency bands in the features of diffusion models exhibit different dynamics across timesteps. Concretely, low-frequency components, which decide the structure of images, exhibit higher similarity but poor continuity. In contrast, the high-frequency bands, which decode the details of images, show significant continuity but poor similarity. These interesting observations motivate us to propose Frequency-aware Caching (FreqCa) which directly reuses features of low-frequency components based on their similarity, while using a second-order Hermite interpolator to predict the volatile high-frequency ones based on its continuity. Besides, we further propose to cache Cumulative Residual Feature (CRF) instead of the features in all the layers, which reduces the memory footprint of feature caching by 99%. Extensive experiments on FLUX.1-dev, FLUX.1-Kontext-dev, Qwen-Image, and Qwen-Image-Edit demonstrate its effectiveness in both generation and editing. Codes are available in the supplementary materials and will be released on GitHub.
【8】Dynamic Mixture-of-Experts for Visual Autoregressive Model
标题:视觉自回归模型的动态专家混合算法
链接:https://arxiv.org/abs/2510.08629
摘要:视觉自回归模型(VAR)提供高效且高质量的图像生成,但是由于在增加的分辨率下重复的Transformer调用而遭受计算冗余。我们介绍了一个动态的混合专家路由器集成到VAR。新的架构允许通过规模感知阈值来交易计算质量。这种阈值策略基于令牌复杂度和分辨率来平衡专家选择,而不需要额外的训练。因此,我们实现了减少20%的FLOP,11%的推理速度,并与密集基线实现的图像质量相匹配。
摘要:Visual Autoregressive Models (VAR) offer efficient and high-quality image generation but suffer from computational redundancy due to repeated Transformer calls at increasing resolutions. We introduce a dynamic Mixture-of-Experts router integrated into VAR. The new architecture allows to trade compute for quality through scale-aware thresholding. This thresholding strategy balances expert selection based on token complexity and resolution, without requiring additional training. As a result, we achieve 20% fewer FLOPs, 11% faster inference and match the image quality achieved by the dense baseline.
其他(22篇)
【1】SpaceVista: All-Scale Visual Spatial Reasoning from mm to km
标题:SpaceVista:从毫米到公里的全尺度视觉空间推理
链接:https://arxiv.org/abs/2510.09606
备注:Project Page: this https URL
摘要:随着当前空间推理探索的激增,研究人员在理解室内场景方面取得了重大进展,但仍在努力解决机器人和自动驾驶等各种应用问题。本文旨在通过解决两个关键挑战来推进跨不同场景的全尺度空间推理:1)严重依赖室内3D扫描和劳动密集型手动注释来进行数据集管理; 2)缺乏有效的全尺度场景建模,这通常会导致对单个场景的过度拟合。在本文中,我们介绍了一个整体的解决方案,集成了结构化的空间推理知识系统,规模感知建模和渐进式的训练范式,作为第一次尝试,以扩大MLLM的全尺度空间智能,以我们的知识。使用特定于任务的、专家驱动的自动化管道,我们在5个空间尺度上策划了超过38 K的视频场景,以创建SpaceVista-1 M,这是一个包含大约100万个空间QA对的数据集,涵盖19种不同的任务类型。虽然专家模型可以注入有用的领域知识,但它们对于评估来说并不可靠。然后,我们通过手动记录,检索和组装基于视频的数据,建立一个具有精确注释的全尺度基准。然而,由于潜在的知识冲突,SpaceVista-1 M的朴素训练通常会产生次优结果。因此,我们引入了SpaceVista-7 B,这是一个空间推理模型,它接受语义之外的密集输入,并使用规模作为规模感知专家和渐进奖励的锚点。最后,对包括SpaceVista-Bench在内的5个基准测试进行了广泛的评估,展示了具有竞争力的性能,在所有规模和场景中展示了强大的泛化能力。我们的数据集、模型和基准测试将在https://peiwensun2000.github.io/mm2km上发布。
摘要:With the current surge in spatial reasoning explorations, researchers have made significant progress in understanding indoor scenes, but still struggle with diverse applications such as robotics and autonomous driving. This paper aims to advance all-scale spatial reasoning across diverse scenarios by tackling two key challenges: 1) the heavy reliance on indoor 3D scans and labor-intensive manual annotations for dataset curation; 2) the absence of effective all-scale scene modeling, which often leads to overfitting to individual scenes. In this paper, we introduce a holistic solution that integrates a structured spatial reasoning knowledge system, scale-aware modeling, and a progressive training paradigm, as the first attempt to broaden the all-scale spatial intelligence of MLLMs to the best of our knowledge. Using a task-specific, specialist-driven automated pipeline, we curate over 38K video scenes across 5 spatial scales to create SpaceVista-1M, a dataset comprising approximately 1M spatial QA pairs spanning 19 diverse task types. While specialist models can inject useful domain knowledge, they are not reliable for evaluation. We then build an all-scale benchmark with precise annotations by manually recording, retrieving, and assembling video-based data. However, naive training with SpaceVista-1M often yields suboptimal results due to the potential knowledge conflict. Accordingly, we introduce SpaceVista-7B, a spatial reasoning model that accepts dense inputs beyond semantics and uses scale as an anchor for scale-aware experts and progressive rewards. Finally, extensive evaluations across 5 benchmarks, including our SpaceVista-Bench, demonstrate competitive performance, showcasing strong generalization across all scales and scenarios. Our dataset, model, and benchmark will be released on https://peiwensun2000.github.io/mm2km .
【2】STaTS: Structure-Aware Temporal Sequence Summarization via Statistical Window Merging
标题:STaTS:通过统计窗口合并的结构感知时态序列摘要
链接:https://arxiv.org/abs/2510.09593
备注:10 pages, 5 figures, 4 tables. Under Review
摘要:时间序列数据通常包含潜在的时间结构,局部稳定状态之间的转换,重复的图案和可变性的突发,这些在标准表示学习管道中很少被利用。现有模型通常在原始或固定窗口序列上操作,将所有时间步长视为同等信息,这导致效率低下,鲁棒性差,并且在长或噪声序列中可扩展性有限。我们提出了STaTS,一个轻量级的,无监督的框架结构感知的时间摘要,自适应压缩单变量和多变量的时间序列到紧凑的,信息保存令牌序列。STaTS使用基于BIC的统计发散标准检测多个时间分辨率的变化点,然后使用简单的函数(如平均值或生成模型(如GARCH))总结每个片段。该过程实现了高达30倍的序列压缩,同时保留了核心时间动态。STaTS作为模型不可知的预处理器运行,可以与现有的无监督时间序列编码器集成,无需重新训练。在150多个数据集上进行的广泛实验,包括对UCR-85、UCR-128和UEA-30档案的分类任务,以及对ETTh 1和ETTh 2、ETTm 1和电力的预测,表明STaTS能够实现85- 90%的全模型性能,同时大幅降低计算成本。此外,STaTS提高了噪声下的鲁棒性,并保留了区分结构,优于均匀和基于聚类的压缩基线。这些结果定位STaTS作为一个原则性的,通用的解决方案,高效的,结构感知的时间序列建模。
摘要:Time series data often contain latent temporal structure, transitions between locally stationary regimes, repeated motifs, and bursts of variability, that are rarely leveraged in standard representation learning pipelines. Existing models typically operate on raw or fixed-window sequences, treating all time steps as equally informative, which leads to inefficiencies, poor robustness, and limited scalability in long or noisy sequences. We propose STaTS, a lightweight, unsupervised framework for Structure-Aware Temporal Summarization that adaptively compresses both univariate and multivariate time series into compact, information-preserving token sequences. STaTS detects change points across multiple temporal resolutions using a BIC-based statistical divergence criterion, then summarizes each segment using simple functions like the mean or generative models such as GMMs. This process achieves up to 30x sequence compression while retaining core temporal dynamics. STaTS operates as a model-agnostic preprocessor and can be integrated with existing unsupervised time series encoders without retraining. Extensive experiments on 150+ datasets, including classification tasks on the UCR-85, UCR-128, and UEA-30 archives, and forecasting on ETTh1 and ETTh2, ETTm1, and Electricity, demonstrate that STaTS enables 85-90\% of the full-model performance while offering dramatic reductions in computational cost. Moreover, STaTS improves robustness under noise and preserves discriminative structure, outperforming uniform and clustering-based compression baselines. These results position STaTS as a principled, general-purpose solution for efficient, structure-aware time series modeling.
【3】FLOWING: Implicit Neural Flows for Structure-Preserving Morphing
标题:Flowing:用于结构保持变形的隐式神经流
链接:https://arxiv.org/abs/2510.09537
备注:10 pages main paper; 9 pages references and appendix
摘要:变形是视觉和计算机图形学中的一个长期存在的问题,需要一个时间相关的扭曲特征对齐和平滑插值的混合。最近,多层感知器(MLP)由于其无网格性和可微性而被探索为用于建模此类变形的隐式神经表示(INR);然而,从标准MLP中提取连贯和准确的变形通常依赖于昂贵的正则化,这通常会导致不稳定的训练并阻止有效的特征对齐。为了克服这些局限性,我们提出了FLOWING(流变形),一个框架,重铸翘曲作为一个差分向量流的建设,自然确保连续性,可逆性和时间的一致性,通过编码结构流属性直接到网络架构。这种以流为中心的方法产生了原则性和稳定的变换,从而实现了2D图像和3D形状的精确和结构保持变形。在一系列应用中进行的广泛实验-包括面部和图像变形,以及高斯溅射变形-表明FLOWING实现了最先进的变形质量,收敛速度更快。代码和预训练模型可在http://schardong.github.io/flowing上获得。
摘要:Morphing is a long-standing problem in vision and computer graphics, requiring a time-dependent warping for feature alignment and a blending for smooth interpolation. Recently, multilayer perceptrons (MLPs) have been explored as implicit neural representations (INRs) for modeling such deformations, due to their meshlessness and differentiability; however, extracting coherent and accurate morphings from standard MLPs typically relies on costly regularizations, which often lead to unstable training and prevent effective feature alignment. To overcome these limitations, we propose FLOWING (FLOW morphING), a framework that recasts warping as the construction of a differential vector flow, naturally ensuring continuity, invertibility, and temporal coherence by encoding structural flow properties directly into the network architectures. This flow-centric approach yields principled and stable transformations, enabling accurate and structure-preserving morphing of both 2D images and 3D shapes. Extensive experiments across a range of applications - including face and image morphing, as well as Gaussian Splatting morphing - show that FLOWING achieves state-of-the-art morphing quality with faster convergence. Code and pretrained models are available at http://schardong.github.io/flowing.
【4】PRNet: Original Information Is All You Have
标题:PRNet:原始信息就是您所拥有的一切
链接:https://arxiv.org/abs/2510.09531
摘要:航空图像中的小目标检测在特征提取过程中由于有限的像素表示而遭受严重的信息退化,其中浅层空间细节无法与语义信息有效地对齐,导致频繁的遗漏和误报。现有的基于FPN的方法试图通过后处理增强来减轻这些损失,但重建的细节往往偏离原始图像信息,阻碍了它们与语义内容的融合。为了解决这个问题,我们提出了PRNet,一个实时检测框架,优先保存和有效利用原始的浅层空间特征,以增强小对象的表示。PRNet通过两个模块实现了这一点:渐进式细化颈部(PRN)通过骨干重用和迭代细化进行空间语义对齐,以及增强型SliceSamp(ESSamp)通过优化重排和卷积在下采样期间保留浅层信息。在VisDrone、AI-TOD和UAVDT数据集上进行的大量实验表明,在可比的计算约束下,PRNet的性能优于最先进的方法,实现了卓越的准确性与效率的权衡。
摘要:Small object detection in aerial images suffers from severe information degradation during feature extraction due to limited pixel representations, where shallow spatial details fail to align effectively with semantic information, leading to frequent misses and false positives. Existing FPN-based methods attempt to mitigate these losses through post-processing enhancements, but the reconstructed details often deviate from the original image information, impeding their fusion with semantic content. To address this limitation, we propose PRNet, a real-time detection framework that prioritizes the preservation and efficient utilization of primitive shallow spatial features to enhance small object representations. PRNet achieves this via two modules:the Progressive Refinement Neck (PRN) for spatial-semantic alignment through backbone reuse and iterative refinement, and the Enhanced SliceSamp (ESSamp) for preserving shallow information during downsampling via optimized rearrangement and convolution. Extensive experiments on the VisDrone, AI-TOD, and UAVDT datasets demonstrate that PRNet outperforms state-of-the-art methods under comparable computational constraints, achieving superior accuracy-efficiency trade-offs.
【5】Diagonal Artifacts in Samsung Images: PRNU Challenges and Solutions
标题:三星图片中的对角文物:PRNU挑战和解决方案
链接:https://arxiv.org/abs/2510.09509
摘要:我们研究了几款三星智能手机捕获的图像中存在的对角伪影及其对基于PRNU的相机源验证的影响。我们首先展示了某些Galaxy S系列型号共享导致指纹碰撞的共同模式,在某些Galaxy A型号中也发现了类似的问题。接下来,我们证明了可靠的PRNU验证对于支持PRO模式和原始捕获的设备仍然是可行的,因为原始图像绕过了引入伪影的处理管道。但是,此选项不适用于中档A系列型号或无法访问原始图像的法医案件。最后,我们概述了对角伪影的潜在法医应用,例如减少HDR图像中的误检,以及在纵向模式图像中定位受合成散景影响的区域。
摘要:We investigate diagonal artifacts present in images captured by several Samsung smartphones and their impact on PRNU-based camera source verification. We first show that certain Galaxy S series models share a common pattern causing fingerprint collisions, with a similar issue also found in some Galaxy A models. Next, we demonstrate that reliable PRNU verification remains feasible for devices supporting PRO mode with raw capture, since raw images bypass the processing pipeline that introduces artifacts. This option, however, is not available for the mid-range A series models or in forensic cases without access to raw images. Finally, we outline potential forensic applications of the diagonal artifacts, such as reducing misdetections in HDR images and localizing regions affected by synthetic bokeh in portrait-mode images.
【6】PhysToolBench: Benchmarking Physical Tool Understanding for MLLMs
标题:PhysTools Bench:MLLM物理工具理解基准
链接:https://arxiv.org/abs/2510.09507
摘要:使用、理解和创造工具的能力是人类智能的标志,能够与物理世界进行复杂的交互。对于任何通用智能代理来说,要实现真正的多功能性,它还必须掌握这些基本技能。虽然现代多模态大型语言模型(MLLM)利用其广泛的共同知识在嵌入式AI和下游视觉-语言-动作(VLA)模型中进行高级规划,但其对物理工具的真正理解程度仍然无法量化。为了弥合这一差距,我们提出了PhysToolBench,第一个基准致力于评估的理解物理工具的MLLM。我们的基准测试是由超过1,000个图像-文本对组成的视觉问题分类(VQA)数据集。它评估了三个不同难度级别的能力:(1)工具识别:要求识别工具的主要功能。(2)工具理解:测试掌握工具操作基本原理的能力。(3)工具创建:当常规选项不可用时,创建模型以从周围对象创建新工具。我们的32 MLLMs跨越专有的,开源的,专业的体现,并在VLAs骨干的全面评估揭示了一个显着的不足,在工具的理解。此外,我们提供了深入的分析,并提出了初步的解决方案。代码和数据集是公开的。
摘要:The ability to use, understand, and create tools is a hallmark of human intelligence, enabling sophisticated interaction with the physical world. For any general-purpose intelligent agent to achieve true versatility, it must also master these fundamental skills. While modern Multimodal Large Language Models (MLLMs) leverage their extensive common knowledge for high-level planning in embodied AI and in downstream Vision-Language-Action (VLA) models, the extent of their true understanding of physical tools remains unquantified. To bridge this gap, we present PhysToolBench, the first benchmark dedicated to evaluating the comprehension of physical tools by MLLMs. Our benchmark is structured as a Visual Question Answering (VQA) dataset comprising over 1,000 image-text pairs. It assesses capabilities across three distinct difficulty levels: (1) Tool Recognition: Requiring the recognition of a tool's primary function. (2) Tool Understanding: Testing the ability to grasp the underlying principles of a tool's operation. (3) Tool Creation: Challenging the model to fashion a new tool from surrounding objects when conventional options are unavailable. Our comprehensive evaluation of 32 MLLMs-spanning proprietary, open-source, specialized embodied, and backbones in VLAs-reveals a significant deficiency in tool understanding. Furthermore, we provide an in-depth analysis and propose preliminary solutions. Code and dataset are publicly available.
【7】Utilizing dynamic sparsity on pretrained DETR
标题:在预训练的DETR上利用动态稀疏性
链接:https://arxiv.org/abs/2510.09380
备注:6 pages 4 figures and 4 tables , accepted for 2025 IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING, AUG. 31 to SEP. 3, 2025, ISTANBUL, TURKEY
摘要:使用基于transformer的模型进行有效推理仍然是一个挑战,特别是在物体检测等视觉任务中。我们分析了固有的稀疏性在MLP层的DETR,并介绍了两种方法来利用它没有再培训。首先,我们提出了基于静态指标的稀疏化(SIBS),这是一种启发式方法,可以根据固定的激活模式预测神经元的不活动性。虽然简单,但由于稀疏性的输入依赖性,SIBS提供有限的增益。为了解决这个问题,我们引入了微门控稀疏化(MGS),这是一种在预训练的DETR之上训练的轻量级门控机制。MGS使用一个小的线性层预测动态稀疏性,并实现高达85%至95%的激活稀疏性。COCO数据集上的实验表明,MGS保持甚至提高了性能,同时显着减少计算。我们的方法提供了一种实用的、输入自适应的稀疏化方法,可以有效地部署预训练的Vision Transformers,而无需进行完整的模型再训练。
摘要:Efficient inference with transformer-based models remains a challenge, especially in vision tasks like object detection. We analyze the inherent sparsity in the MLP layers of DETR and introduce two methods to exploit it without retraining. First, we propose Static Indicator-Based Sparsification (SIBS), a heuristic method that predicts neuron inactivity based on fixed activation patterns. While simple, SIBS offers limited gains due to the input-dependent nature of sparsity. To address this, we introduce Micro-Gated Sparsification (MGS), a lightweight gating mechanism trained on top of a pretrained DETR. MGS predicts dynamic sparsity using a small linear layer and achieves up to 85 to 95% activation sparsity. Experiments on the COCO dataset show that MGS maintains or even improves performance while significantly reducing computation. Our method offers a practical, input-adaptive approach to sparsification, enabling efficient deployment of pretrained vision transformers without full model retraining.
【8】BLINK-Twice: You see, but do you observe? A Reasoning Benchmark on Visual Perception
标题:眨眼两次:你看到了,但你观察了吗?视觉感知的推理基准
链接:https://arxiv.org/abs/2510.09361
备注:Accepted to 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Track on Datasets and Benchmarks
摘要:近年来,多模态大型语言模型(MLLM)取得了快速的进展,特别是在增强其推理能力方面。然而,现有的推理基准仍然主要评估基于语言的推理,通常将视觉输入视为可替换的上下文。为了解决这个差距,我们引入了BLINK-Twice,这是一个以视觉为中心的推理基准,基于具有挑战性的感知任务。我们的任务不依赖于外部知识,而是要求模型仅从视觉内容进行推理,将重点从基于语言的推理转移到基于图像的推理。与之前的感知基准相比,它超越了浅层感知(“看”),需要细粒度的观察和分析推理(“观察”)。BLINK-Twice集成了三个核心组件:用于测试视觉推理的七种视觉挑战,强制依赖视觉内容的自然对抗图像对,以及用于推理过程细粒度评估的注释推理链,而不仅仅是最终答案。我们评估了20个领先的MLLM,包括12个基础模型和8个推理增强模型。Blink-Twice对当前模型提出了重大挑战。虽然语言空间中现有的推理策略,如思维链或自我批评,可以提高性能,他们往往会导致不稳定和冗余的推理。我们观察到重复的图像观察提高了模型的性能,而主动视觉交互,如o3等模型所证明的那样,突出了对视觉推理新范式的需求。该数据集可在https://github.com/PicoTrex/BLINK-Twice上公开获取
摘要:Recently, Multimodal Large Language Models (MLLMs) have made rapid progress, particularly in enhancing their reasoning capabilities. However, existing reasoning benchmarks still primarily assess language-based reasoning, often treating visual input as replaceable context. To address this gap, we introduce BLINK-Twice, a vision-centric reasoning benchmark grounded in challenging perceptual tasks. Instead of relying on external knowledge, our tasks require models to reason from visual content alone, shifting the focus from language-based to image-grounded reasoning. Compared to prior perception benchmarks, it moves beyond shallow perception ("see") and requires fine-grained observation and analytical reasoning ("observe"). BLINK-Twice integrates three core components: seven types of visual challenges for testing visual reasoning, natural adversarial image pairs that enforce reliance on visual content, and annotated reasoning chains for fine-grained evaluation of the reasoning process rather than final answers alone. We evaluate 20 leading MLLMs, including 12 foundation models and 8 reasoning-enhanced models. BLINK-Twice poses a significant challenge to current models. While existing reasoning strategies in the language space-such as chain-of-thought or self-criticism can improve performance, they often result in unstable and redundant reasoning. We observe that repeated image observation improves performance across models, and active visual interaction, as demonstrated by models like o3, highlights the need for a new paradigm for vision reasoning. The dataset is publicly available at https://github.com/PicoTrex/BLINK-Twice
【9】Efficient Bayesian Inference from Noisy Pairwise Comparisons
标题:从喧闹的成对比较中进行有效的Bayesian推理
链接:https://arxiv.org/abs/2510.09333
摘要:评估生成模型具有挑战性,因为标准指标往往无法反映人类的偏好。人工评估更可靠,但成本高,噪音大,因为参与者的专业知识,注意力和勤奋程度各不相同。成对比较提高了一致性,但将它们汇总为整体质量分数需要仔细建模。基于Bradley-Terry的方法从比较中更新项目分数,但现有的方法要么忽略评分者的可变性,要么缺乏收敛保证,限制了鲁棒性和可解释性。我们引入BBQ,贝叶斯布拉德利-特里的变种,明确模型评分质量,降低权重或删除不可靠的参与者,并通过期望最大化算法提供保证单调似然收敛。实证结果表明,BBQ实现了更快的收敛,校准良好的不确定性估计,以及更强大的,可解释的排名相比,基线布拉德利-特里模型,即使有噪音或众包评分。该框架能够对生成模型进行更可靠和更具成本效益的人工评估。
摘要:Evaluating generative models is challenging because standard metrics often fail to reflect human preferences. Human evaluations are more reliable but costly and noisy, as participants vary in expertise, attention, and diligence. Pairwise comparisons improve consistency, yet aggregating them into overall quality scores requires careful modeling. Bradley-Terry-based methods update item scores from comparisons, but existing approaches either ignore rater variability or lack convergence guarantees, limiting robustness and interpretability. We introduce BBQ, a Bayesian Bradley-Terry variant that explicitly models rater quality, downweighting or removing unreliable participants, and provides guaranteed monotonic likelihood convergence through an Expectation-Maximization algorithm. Empirical results show that BBQ achieves faster convergence, well-calibrated uncertainty estimates, and more robust, interpretable rankings compared to baseline Bradley-Terry models, even with noisy or crowdsourced raters. This framework enables more reliable and cost-effective human evaluation of generative models.
【10】RadioFlow: Efficient Radio Map Construction Framework with Flow Matching
标题:RadioFlow:具有流匹配的高效无线电地图构建框架
链接:https://arxiv.org/abs/2510.09314
摘要:准确和实时的无线电地图(RM)生成对于下一代无线系统至关重要,然而基于扩散的方法通常具有较大的模型尺寸、缓慢的迭代去噪和高推理延迟,这阻碍了实际部署。为了克服这些限制,我们提出了\textbf{RadioFlow},一种新的基于流匹配的生成框架,通过单步高效采样实现高保真RM生成。与传统扩散模型不同,RadioFlow学习噪声和数据之间的连续传输轨迹,使训练和推理能够显着加速,同时保持重建准确性。综合实验表明,与领先的基于扩散的基线(RadioDiff)相比,RadioFlow实现了最先进的性能,参数减少了\textbf {多达8$\times $},推理速度加快了\textbf{超过4$\times$}。这一进展为未来6 G网络的可扩展、节能和实时电磁数字孪生提供了一条有前途的途径。我们在\href{https://github.com/Hxxxz0/RadioFlow}{GitHub}发布代码。
摘要:Accurate and real-time radio map (RM) generation is crucial for next-generation wireless systems, yet diffusion-based approaches often suffer from large model sizes, slow iterative denoising, and high inference latency, which hinder practical deployment. To overcome these limitations, we propose \textbf{RadioFlow}, a novel flow-matching-based generative framework that achieves high-fidelity RM generation through single-step efficient sampling. Unlike conventional diffusion models, RadioFlow learns continuous transport trajectories between noise and data, enabling both training and inference to be significantly accelerated while preserving reconstruction accuracy. Comprehensive experiments demonstrate that RadioFlow achieves state-of-the-art performance with \textbf{up to 8$\times$ fewer parameters} and \textbf{over 4$\times$ faster inference} compared to the leading diffusion-based baseline (RadioDiff). This advancement provides a promising pathway toward scalable, energy-efficient, and real-time electromagnetic digital twins for future 6G networks. We release the code at \href{https://github.com/Hxxxz0/RadioFlow}{GitHub}.
【11】Clear Roads, Clear Vision: Advancements in Multi-Weather Restoration for Smart Transportation
标题:畅通的道路,清晰的视野:智能交通多天气恢复的进展
链接:https://arxiv.org/abs/2510.09228
备注:This work has been submitted to IEEE for possible publication
摘要:雾霾、雨雪等恶劣天气条件会显著降低图像和视频的质量,对依赖视觉输入的智能交通系统(ITS)构成严重挑战。这些降级会影响关键应用,包括自动驾驶、交通监控和监控。这项调查提出了一个全面的审查图像和视频恢复技术,以减轻天气引起的视觉障碍。我们将现有的方法分为传统的基于先验的方法和现代的数据驱动模型,包括CNN、Transformers、扩散模型和新兴的视觉语言模型(VLM)。恢复策略根据其范围进一步分类:单任务模型,多任务/多天气系统,以及能够处理各种退化的一体化框架。此外,我们还讨论了白天和夜间的恢复挑战,基准数据集和评估协议。该调查最后深入讨论了当前研究的局限性,并概述了未来的发展方向,如混合/复合退化恢复,实时部署和代理人工智能框架。这项工作的目的是作为一个有价值的参考,推进天气弹性视觉系统在智能交通环境。最后,为了跟上这一领域的快速发展,我们将在https://github.com/ChaudharyUPES/A-comprehensive-review-on-Multi-weather-restoration上定期更新最新的相关论文及其开源实现
摘要:Adverse weather conditions such as haze, rain, and snow significantly degrade the quality of images and videos, posing serious challenges to intelligent transportation systems (ITS) that rely on visual input. These degradations affect critical applications including autonomous driving, traffic monitoring, and surveillance. This survey presents a comprehensive review of image and video restoration techniques developed to mitigate weather-induced visual impairments. We categorize existing approaches into traditional prior-based methods and modern data-driven models, including CNNs, transformers, diffusion models, and emerging vision-language models (VLMs). Restoration strategies are further classified based on their scope: single-task models, multi-task/multi-weather systems, and all-in-one frameworks capable of handling diverse degradations. In addition, we discuss day and night time restoration challenges, benchmark datasets, and evaluation protocols. The survey concludes with an in-depth discussion on limitations in current research and outlines future directions such as mixed/compound-degradation restoration, real-time deployment, and agentic AI frameworks. This work aims to serve as a valuable reference for advancing weather-resilient vision systems in smart transportation environments. Lastly, to stay current with rapid advancements in this field, we will maintain regular updates of the latest relevant papers and their open-source implementations at https://github.com/ChaudharyUPES/A-comprehensive-review-on-Multi-weather-restoration
【12】Online Topological Localization for Navigation Assistance in Bronchoscopy
标题:支气管镜导航辅助的在线布局定位
链接:https://arxiv.org/abs/2510.09144
摘要:视频支气管镜检查是呼吸医学中的基本程序,其中医学专家在患者的支气管树中导航以诊断或操作患者。外科医生需要确定内窥镜穿过气道的位置,直到到达感兴趣区域。由于复杂的支气管树结构和不同的医生经验和培训,这项任务对从业者来说非常具有挑战性。在手术过程中定位支气管镜的导航辅助可以改善其结果。目前使用的用于导航引导的技术通常依赖于患者的先前CT扫描以获得气道的3D模型,随后利用附加传感器或图像配准来跟踪观测器。这些方法获得准确的位置,但意味着额外的设置,扫描和训练。准确的度量定位并不总是需要的,并且关于通用气道模型的拓扑定位通常足以帮助外科医生导航。我们提出了一种基于图像的支气管镜拓扑定位管道,在手术过程中提供导航辅助,无需患者CT扫描。我们的方法仅在幻影数据上进行训练,消除了真实数据标记的高成本,并具有良好的泛化能力。所获得的结果超过现有的方法,特别是在实际数据测试序列。
摘要:Video bronchoscopy is a fundamental procedure in respiratory medicine, where medical experts navigate through the bronchial tree of a patient to diagnose or operate the patient. Surgeons need to determine the position of the scope as they go through the airway until they reach the area of interest. This task is very challenging for practitioners due to the complex bronchial tree structure and varying doctor experience and training. Navigation assistance to locate the bronchoscope during the procedure can improve its outcome. Currently used techniques for navigational guidance commonly rely on previous CT scans of the patient to obtain a 3D model of the airway, followed by tracking of the scope with additional sensors or image registration. These methods obtain accurate locations but imply additional setup, scans and training. Accurate metric localization is not always required, and a topological localization with regard to a generic airway model can often suffice to assist the surgeon with navigation. We present an image-based bronchoscopy topological localization pipeline to provide navigation assistance during the procedure, with no need of patient CT scan. Our approach is trained only on phantom data, eliminating the high cost of real data labeling, and presents good generalization capabilities. The results obtained surpass existing methods, particularly on real data test sequences.
【13】Polar Separable Transform for Efficient Orthogonal Rotation-Invariant Image Representation
标题:极可分离变换用于高效的垂直旋转不变图像表示
链接:https://arxiv.org/abs/2510.09125
备注:13 pages, 10 figures, 4 Tables
摘要:基于正交矩的图像表示是计算机视觉中的基础,但经典方法具有计算复杂度高和高阶数值不稳定的缺点。例如,Zernike矩和伪Zernike矩需要耦合的径向-角度处理,这排除了有效的因式分解,导致$N\times N$图像上的$n$阶矩的$\mathcal {O}(n^3N^2)$到$\mathcal {O}(n^6N^2)$复杂性和$\mathcal{O}(N^4)$条件数缩放。我们引入了\textbf{PSept}(极可分离变换),这是一种可分离的正交变换,它克服了极坐标中的不可分离性障碍。PSept通过离散余弦变换(DCT)径向基和傅立叶谐波角度基的张量积构造实现完整的核因子分解,从而实现独立的径向和角度处理。这种可分离的设计将计算复杂度降低到$\mathcal{O}(N^2 \log N)$,内存需求降低到$\mathcal{O}(N^2)$,条件数缩放到$\mathcal{O}(\sqrt{N})$,表示多项式方法的指数改进。PSept具有正交性,完整性,能量守恒和旋转协方差特性。实验结果表明,更好的数值稳定性,计算效率和竞争力的分类性能结构化数据集,同时保持准确的重建。可分离的框架,使高阶矩分析以前不可行的经典方法,强大的图像分析应用程序打开了新的可能性。
摘要:Orthogonal moment-based image representations are fundamental in computer vision, but classical methods suffer from high computational complexity and numerical instability at large orders. Zernike and pseudo-Zernike moments, for instance, require coupled radial-angular processing that precludes efficient factorization, resulting in $\mathcal{O}(n^3N^2)$ to $\mathcal{O}(n^6N^2)$ complexity and $\mathcal{O}(N^4)$ condition number scaling for the $n$th-order moments on an $N\times N$ image. We introduce \textbf{PSepT} (Polar Separable Transform), a separable orthogonal transform that overcomes the non-separability barrier in polar coordinates. PSepT achieves complete kernel factorization via tensor-product construction of Discrete Cosine Transform (DCT) radial bases and Fourier harmonic angular bases, enabling independent radial and angular processing. This separable design reduces computational complexity to $\mathcal{O}(N^2 \log N)$, memory requirements to $\mathcal{O}(N^2)$, and condition number scaling to $\mathcal{O}(\sqrt{N})$, representing exponential improvements over polynomial approaches. PSepT exhibits orthogonality, completeness, energy conservation, and rotation-covariance properties. Experimental results demonstrate better numerical stability, computational efficiency, and competitive classification performance on structured datasets, while preserving exact reconstruction. The separable framework enables high-order moment analysis previously infeasible with classical methods, opening new possibilities for robust image analysis applications.
【14】A Novel Multi-branch ConvNeXt Architecture for Identifying Subtle Pathological Features in CT Scans
标题:用于识别CT扫描中微妙病理特征的新型多分支ConvNeXt架构
链接:https://arxiv.org/abs/2510.09107
摘要:医学影像的智能分析在辅助临床诊断,特别是识别细微病理特征方面起着至关重要的作用。本文介绍了一种新的多分支ConvNeXt架构,专为医学图像分析的细微挑战而设计。虽然在这里应用于COVID-19诊断的特定问题,但该方法提供了一个可推广的框架,用于对来自CT扫描的各种病理进行分类。该模型包含了严格的端到端管道,从细致的数据预处理和增强到有效利用迁移学习的纪律性两阶段训练策略。该体系结构独特地集成了从三个并行分支提取的特征:全局平均池化,全局最大池化和一个新的注意力加权池化机制。该模型在来自两个不同数据集的2,609个CT切片的组合数据集上进行了训练和验证。实验结果表明,验证集上的性能优越,最终ROC-AUC为0.9937,验证准确度为0.9757,COVID-19病例的F1得分为0.9825,优于该数据集上所有先前报道的模型。这些发现表明,现代的多分支架构,加上仔细的数据处理,可以实现与当代最先进的模型相当或超过当代最先进模型的性能,从而证明了先进的深度学习技术在强大的医疗诊断中的有效性。
摘要:Intelligent analysis of medical imaging plays a crucial role in assisting clinical diagnosis, especially for identifying subtle pathological features. This paper introduces a novel multi-branch ConvNeXt architecture designed specifically for the nuanced challenges of medical image analysis. While applied here to the specific problem of COVID-19 diagnosis, the methodology offers a generalizable framework for classifying a wide range of pathologies from CT scans. The proposed model incorporates a rigorous end-to-end pipeline, from meticulous data preprocessing and augmentation to a disciplined two-phase training strategy that leverages transfer learning effectively. The architecture uniquely integrates features extracted from three parallel branches: Global Average Pooling, Global Max Pooling, and a new Attention-weighted Pooling mechanism. The model was trained and validated on a combined dataset of 2,609 CT slices derived from two distinct datasets. Experimental results demonstrate a superior performance on the validation set, achieving a final ROC-AUC of 0.9937, a validation accuracy of 0.9757, and an F1-score of 0.9825 for COVID-19 cases, outperforming all previously reported models on this dataset. These findings indicate that a modern, multi-branch architecture, coupled with careful data handling, can achieve performance comparable to or exceeding contemporary state-of-the-art models, thereby proving the efficacy of advanced deep learning techniques for robust medical diagnostics.
【15】OSCAR: Orthogonal Stochastic Control for Alignment-Respecting Diversity in Flow Matching
标题:OTAR:流量匹配中尊重对准多样性的垂直随机控制
链接:https://arxiv.org/abs/2510.09060
摘要:基于流的文本到图像模型遵循确定性的轨迹,迫使用户重复采样以发现不同的模式,这是一个成本高昂且效率低下的过程。我们提出了一个培训免费,推理时间控制机制,使流本身的多样性意识。我们的方法同时鼓励通过特征空间目标的轨迹之间的横向传播,并通过时间安排的随机扰动重新引入不确定性。至关重要的是,这种扰动被投射为与生成流正交,这是一种几何约束,允许它在不降低图像细节或提示保真度的情况下提高变化。我们的程序不需要重新训练或修改的基础采样器,并与常见的流匹配求解器兼容。理论上,我们的方法是单调增加的体积代理,而由于其几何约束,近似保持边缘分布。这提供了一个原则性的解释,为什么发电质量是稳健的。从经验上讲,在固定采样预算下的多个文本到图像设置中,我们的方法在保持图像质量和对齐的同时,始终提高了Vendi Score和Quarque等多样性指标。
摘要:Flow-based text-to-image models follow deterministic trajectories, forcing users to repeatedly sample to discover diverse modes, which is a costly and inefficient process. We present a training-free, inference-time control mechanism that makes the flow itself diversity-aware. Our method simultaneously encourages lateral spread among trajectories via a feature-space objective and reintroduces uncertainty through a time-scheduled stochastic perturbation. Crucially, this perturbation is projected to be orthogonal to the generation flow, a geometric constraint that allows it to boost variation without degrading image details or prompt fidelity. Our procedure requires no retraining or modification to the base sampler and is compatible with common flow-matching solvers. Theoretically, our method is shown to monotonically increase a volume surrogate while, due to its geometric constraints, approximately preserving the marginal distribution. This provides a principled explanation for why generation quality is robustly maintained. Empirically, across multiple text-to-image settings under fixed sampling budgets, our method consistently improves diversity metrics such as the Vendi Score and Brisque over strong baselines, while upholding image quality and alignment.
【16】Auto-scaling Continuous Memory for GUI Agent
标题:自动扩展图形用户界面代理的连续内存
链接:https://arxiv.org/abs/2510.09038
摘要:我们研究如何赋予GUI代理可扩展的内存,帮助概括在不熟悉的接口和长期任务。先前的GUI代理将过去的轨迹压缩到文本标记中,这使得上下文长度膨胀并且错过了决定性的视觉提示(例如,小部件的精确尺寸和位置)。我们提出了一个连续的内存,使用VLM本身作为编码器,将每个GUI轨迹编码成一个固定长度的连续嵌入序列;这些嵌入直接插入到主干的输入层,大大降低了上下文成本,同时保留细粒度的视觉信息。随着记忆容量和检索深度的增加,性能单调地提高,不像文本记忆随着长提示而下降。为了以低成本增长内存,我们引入了一个自动缩放数据飞轮,(i)通过搜索发现新的环境,(ii)用开源VLM合成任务,(iii)用代理推出轨迹,(iv)用相同的VLM验证成功。使用这个管道,我们收集了100 k+轨迹,大约4000美元,只微调了1,500个样本的内存编码器(Q-Former上的LoRA,1.2%参数)。在现实世界的GUI基准测试中,我们的内存增强代理在长期和分布变化下不断提高成功率。值得注意的是,Qwen-2.5-VL-7 B+连续存储器实现了与最先进的闭源模型(例如,GPT-40,Claude-4)。
摘要:We study how to endow GUI agents with scalable memory that help generalize across unfamiliar interfaces and long-horizon tasks. Prior GUI agents compress past trajectories into text tokens, which balloons context length and misses decisive visual cues (e.g., exact widget size and position). We propose a continuous memory that encodes each GUI trajectory into a fixed-length sequence of continuous embeddings using the VLM itself as an encoder; these embeddings are plugged directly into the backbone's input layer, sharply reducing context cost while preserving fine-grained visual information. As memory size and retrieval depth increase, performance improves monotonically, unlike text memories that degrade with long prompts. To grow memory at low cost, we introduce an auto-scaling data flywheel that (i) discovers new environments via search, (ii) synthesizes tasks with an open-source VLM, (iii) rolls out trajectories with the agent, and (iv) verifies success with the same VLM. Using this pipeline, we collect 100k+ trajectories for about \$4000 and fine-tune only the memory encoder (LoRA on a Q-Former, 1.2\% parameters) with 1,500 samples. On real-world GUI benchmarks, our memory-augmented agent consistently improves success rates under long horizons and distribution shifts. Notably, Qwen-2.5-VL-7B + continuous memory achieves performance comparable to state-of-the-art closed-source models (e.g., GPT-4o, Claude-4).
【17】Uncolorable Examples: Preventing Unauthorized AI Colorization via Perception-Aware Chroma-Restrictive Perturbation
标题:不可着色的示例:通过感知色彩限制扰动来防止未经授权的人工智能着色
链接:https://arxiv.org/abs/2510.08979
摘要:基于AI的彩色化在从灰度输入生成逼真的彩色图像方面表现出了卓越的能力。然而,它也带来了侵犯版权的风险--例如,未经授权的彩色化和单色漫画和电影的转售。尽管存在这些问题,但目前还没有有效的方法来防止这种滥用。为了解决这个问题,我们引入了第一个防御范例,不可着色的例子,它嵌入到灰度图像的不可察觉的扰动,使未经授权的着色无效。为了确保真实世界的适用性,我们建立了四个标准:有效性,不可感知性,可转移性和鲁棒性。我们的方法,感知感知色度限制性扰动(PAChroma),通过使用拉普拉斯滤波器优化不可感知的扰动以保持感知质量,并在优化期间应用不同的输入变换以增强跨模型的可转移性和对常见后处理的鲁棒性(例如,压缩)。在ImageNet和Danbooru数据集上的实验表明,PAChroma在保持视觉外观的同时有效地降低了着色质量。这项工作标志着保护视觉内容免受非法AI着色的第一步,为生成媒体中的版权意识防御铺平了道路。
摘要:AI-based colorization has shown remarkable capability in generating realistic color images from grayscale inputs. However, it poses risks of copyright infringement -- for example, the unauthorized colorization and resale of monochrome manga and films. Despite these concerns, no effective method currently exists to prevent such misuse. To address this, we introduce the first defensive paradigm, Uncolorable Examples, which embed imperceptible perturbations into grayscale images to invalidate unauthorized colorization. To ensure real-world applicability, we establish four criteria: effectiveness, imperceptibility, transferability, and robustness. Our method, Perception-Aware Chroma-Restrictive Perturbation (PAChroma), generates Uncolorable Examples that meet these four criteria by optimizing imperceptible perturbations with a Laplacian filter to preserve perceptual quality, and applying diverse input transformations during optimization to enhance transferability across models and robustness against common post-processing (e.g., compression). Experiments on ImageNet and Danbooru datasets demonstrate that PAChroma effectively degrades colorization quality while maintaining the visual appearance. This work marks the first step toward protecting visual content from illegitimate AI colorization, paving the way for copyright-aware defenses in generative media.
【18】HandEval: Taking the First Step Towards Hand Quality Evaluation in Generated Images
标题:HandEval:迈出生成图像中手部质量评估的第一步
链接:https://arxiv.org/abs/2510.08978
摘要:虽然最近的文本到图像(T2I)模型已经显着提高了生成图像的整体视觉质量,但它们仍然难以在复杂的局部区域生成准确的细节,特别是人手。生成的手通常表现出结构扭曲和不真实的纹理,即使身体的其他部分生成良好,这也非常明显。然而,手部区域的质量评估仍然在很大程度上被忽视,限制了下游任务的性能,如以人为中心的生成质量优化和AIGC检测。为了解决这个问题,我们提出了第一个质量评估任务,针对生成的手区域,并展示其丰富的下游应用。我们首先介绍用于训练手部质量评估模型的HandPair数据集。它由高质量和低质量手对形成的48k图像组成,可以实现低成本,高效的监督,无需手动注释。在此基础上,我们开发了HandEval,一个精心设计的手工质量评估模型。它利用了多模态大型语言模型(MLLM)强大的视觉理解能力,并结合了手部关键点的先验知识,获得了对手部质量的强烈感知。我们进一步构建了一个人类注释的测试集,从各种国家的最先进的(SOTA)T2I模型的手图像,以验证其质量评估能力。结果表明,HandEval比现有的SOTA方法更符合人类的判断。此外,我们将HandEval集成到图像生成和AIGC检测流水线中,分别显著提高了生成的手部真实感和检测精度,证实了其在下游应用中的普遍有效性。代码和数据集将可用。
摘要:Although recent text-to-image (T2I) models have significantly improved the overall visual quality of generated images, they still struggle in the generation of accurate details in complex local regions, especially human hands. Generated hands often exhibit structural distortions and unrealistic textures, which can be very noticeable even when the rest of the body is well-generated. However, the quality assessment of hand regions remains largely neglected, limiting downstream task performance like human-centric generation quality optimization and AIGC detection. To address this, we propose the first quality assessment task targeting generated hand regions and showcase its abundant downstream applications. We first introduce the HandPair dataset for training hand quality assessment models. It consists of 48k images formed by high- and low-quality hand pairs, enabling low-cost, efficient supervision without manual annotation. Based on it, we develop HandEval, a carefully designed hand-specific quality assessment model. It leverages the powerful visual understanding capability of Multimodal Large Language Model (MLLM) and incorporates prior knowledge of hand keypoints, gaining strong perception of hand quality. We further construct a human-annotated test set with hand images from various state-of-the-art (SOTA) T2I models to validate its quality evaluation capability. Results show that HandEval aligns better with human judgments than existing SOTA methods. Furthermore, we integrate HandEval into image generation and AIGC detection pipelines, prominently enhancing generated hand realism and detection accuracy, respectively, confirming its universal effectiveness in downstream applications. Code and dataset will be available.
【19】Denoised Diffusion for Object-Focused Image Augmentation
标题:目标聚焦图像增强的去噪扩散
链接:https://arxiv.org/abs/2510.08955
备注:None
摘要:现代农业运营越来越依赖于综合监测系统,这些系统结合了多种数据源,用于农场优化。基于无人机的动物健康监测是一个关键组成部分,但面临着有限的数据可用性,加上特定场景的问题,如小型,封闭或部分可见的动物。迁移学习方法通常无法解决这一限制,因为无法获得反映特定农场条件的大型数据集,包括动物品种,环境和行为的变化。因此,有必要针对这些独特的挑战制定一个针对具体问题的、以动物为重点的数据扩充战略。为了解决这一差距,我们提出了一个以对象为中心的数据增强框架,明确设计的动物健康监测在受约束的数据设置。我们的方法将动物从背景中分割出来,并通过变换和基于扩散的合成来增强它们,以创建逼真的多样化场景,从而增强动物检测和监控性能。我们最初的实验表明,我们的增强数据集在动物检测任务上的性能优于我们的基线模型。通过生成特定领域的数据,即使在数据稀缺的情况下,我们的方法也可以实现实时动物健康监测解决方案,弥合有限数据与实用性之间的差距。
摘要:Modern agricultural operations increasingly rely on integrated monitoring systems that combine multiple data sources for farm optimization. Aerial drone-based animal health monitoring serves as a key component but faces limited data availability, compounded by scene-specific issues such as small, occluded, or partially visible animals. Transfer learning approaches often fail to address this limitation due to the unavailability of large datasets that reflect specific farm conditions, including variations in animal breeds, environments, and behaviors. Therefore, there is a need for developing a problem-specific, animal-focused data augmentation strategy tailored to these unique challenges. To address this gap, we propose an object-focused data augmentation framework designed explicitly for animal health monitoring in constrained data settings. Our approach segments animals from backgrounds and augments them through transformations and diffusion-based synthesis to create realistic, diverse scenes that enhance animal detection and monitoring performance. Our initial experiments demonstrate that our augmented dataset yields superior performance compared to our baseline models on the animal detection task. By generating domain-specific data, our method empowers real-time animal health monitoring solutions even in data-scarce scenarios, bridging the gap between limited data and practical applicability.
【20】Into the Rabbit Hull: From Task-Relevant Concepts in DINO to Minkowski Geometry
标题:进入兔子壳:从DINO中的任务相关概念到Minkowski几何
链接:https://arxiv.org/abs/2510.08638
摘要:DINOv 2通常用于识别物体、场景和动作;但它所感知的东西的性质仍然未知。作为工作基线,我们采用线性表示假设(LRH),并使用SAE将其操作化,产生一个32,000个单元的词典,作为我们研究的可解释性支柱,分为三个部分。 在第一部分中,我们分析了不同的下游任务如何从我们的学习字典中招募概念,揭示功能专业化:分类利用“在别处”的概念,除了目标对象,实现学习否定;分割依赖于边界检测器形成连贯的子空间;深度估计利用三个不同的单眼深度线索匹配视觉神经科学原理。 根据这些功能的结果,我们分析了SAE学到的概念的几何和统计。我们发现,表示部分密集,而不是严格稀疏。字典朝着更大的一致性和最大正交理想(格拉斯曼框架)出发。在图像中,标记占据了一个低维的,局部连接的集合,在移除位置后仍然存在。这些迹象表明,表示的组织不仅仅是线性稀疏。 综合这些观察,我们提出了一个改进的观点:令牌是通过组合原型的凸混合物(例如,动物中的兔子,颜色中的棕色,质地中的绒毛)。这种结构是基于Gardenfors的概念空间和模型的机制,因为多头注意力产生凸混合物的总和,定义由原型界定的区域。我们介绍了闵可夫斯基表示假设(MRH),并检查其经验签名和解释视觉转换器表示的影响。
摘要:DINOv2 is routinely deployed to recognize objects, scenes, and actions; yet the nature of what it perceives remains unknown. As a working baseline, we adopt the Linear Representation Hypothesis (LRH) and operationalize it using SAEs, producing a 32,000-unit dictionary that serves as the interpretability backbone of our study, which unfolds in three parts. In the first part, we analyze how different downstream tasks recruit concepts from our learned dictionary, revealing functional specialization: classification exploits "Elsewhere" concepts that fire everywhere except on target objects, implementing learned negations; segmentation relies on boundary detectors forming coherent subspaces; depth estimation draws on three distinct monocular depth cues matching visual neuroscience principles. Following these functional results, we analyze the geometry and statistics of the concepts learned by the SAE. We found that representations are partly dense rather than strictly sparse. The dictionary evolves toward greater coherence and departs from maximally orthogonal ideals (Grassmannian frames). Within an image, tokens occupy a low dimensional, locally connected set persisting after removing position. These signs suggest representations are organized beyond linear sparsity alone. Synthesizing these observations, we propose a refined view: tokens are formed by combining convex mixtures of archetypes (e.g., a rabbit among animals, brown among colors, fluffy among textures). This structure is grounded in Gardenfors' conceptual spaces and in the model's mechanism as multi-head attention produces sums of convex mixtures, defining regions bounded by archetypes. We introduce the Minkowski Representation Hypothesis (MRH) and examine its empirical signatures and implications for interpreting vision-transformer representations.
【21】The Digital Mirror: Gender Bias and Occupational Stereotypes in AI-Generated Images
标题:数字镜子:人工智能生成图像中的性别偏见和职业刻板印象
链接:https://arxiv.org/abs/2510.08628
摘要:生成式人工智能为创建图形、视频和图像等可视化提供了巨大的机会。然而,最近关于人工智能生成的可视化的研究主要集中在创建过程和图像质量上,忽视了代表性偏见。这项研究通过在职业环境中测试AI生成的图片中的表征偏差,并评估两种AI图像生成器工具DALL-E 3和Ideogram的比较来解决这一差距。此外,该研究还讨论了AI生成图像中的衰老和情绪等主题。随着人工智能图像工具的使用越来越广泛,解决和减轻有害的性别偏见对于确保媒体和专业环境中的多元化代表性至关重要。在这项研究中,超过750个人工智能生成的职业图像被提示。主题分析结果显示,DALL-E3和Ideogram都在AI生成的图像中强化了传统的性别刻板印象,尽管程度不同。这些发现强调,人工智能可视化工具有强化狭隘表征的风险。在我们的讨论部分中,我们为从业者、个人和研究人员提出了一些建议,以提高在生成具有可见性别的图像时的代表性。
摘要:Generative AI offers vast opportunities for creating visualisations, such as graphics, videos, and images. However, recent studies around AI-generated visualisations have primarily focused on the creation process and image quality, overlooking representational biases. This study addresses this gap by testing representation biases in AI-generated pictures in an occupational setting and evaluating how two AI image generator tools, DALL-E 3 and Ideogram, compare. Additionally, the study discusses topics such as ageing and emotions in AI-generated images. As AI image tools are becoming more widely used, addressing and mitigating harmful gender biases becomes essential to ensure diverse representation in media and professional settings. In this study, over 750 AI-generated images of occupations were prompted. The thematic analysis results revealed that both DALL-E 3 and Ideogram reinforce traditional gender stereotypes in AI-generated images, although to varying degrees. These findings emphasise that AI visualisation tools risk reinforcing narrow representations. In our discussion section, we propose suggestions for practitioners, individuals and researchers to increase representation when generating images with visible genders.
【22】Look before Transcription: End-to-End SlideASR with Visually-Anchored Policy Optimization
标题:转录前先看:具有视觉锚定策略优化的端到端SlideASB
链接:https://arxiv.org/abs/2510.08618
摘要:自动语音识别(ASR)系统经常与特定领域的术语斗争,特别是在学术讲座等专业环境中。为了解决这个问题,我们定义了SlideASR任务,它利用演示幻灯片中丰富的视觉信息来提高转录的准确性。用于此任务的现有流水线方法往往是复杂的并且表现不佳。虽然全模态大型语言模型(OLLM)提供了一个有前途的端到端框架,但它们在实践中经常会退化为简单的光学字符识别(OCR)系统。为了克服这一点,我们提出了视觉锚定策略优化(VAPO),这是一种新的后训练方法,旨在控制模型的推理过程。借鉴思维链推理范式,VAPO使用格式强制执行结构化的“转录前查看”程序
摘要:Automatic speech recognition (ASR) systems often struggle with domain-specific terminology, especially in specialized settings such as academic lectures. To address this, we define the SlideASR task, which leverages the rich visual information from presentation slides to improve transcription accuracy. Existing pipeline methods for this task tend to be complex and underperform. Although omni-modal large language models (OLLMs) provide a promising end-to-end framework, they frequently fail in practice by degenerating into simple optical character recognition (OCR) systems. To overcome this, we propose Visually-Anchored Policy Optimization (VAPO), a novel post-training method designed to control the model's reasoning process. Drawing on the Chain-of-Thought reasoning paradigm, VAPO enforces a structured "Look before Transcription" procedure using a
机器翻译由腾讯交互翻译提供,仅供参考
点击“阅读原文”获取带摘要的学术速递

