计算机视觉与模式识别学术速递[10.21]- 大数跨境

首页

计算机视觉与模式识别学术速递[10.21]

Sophie外贸笔记

2025-10-21

379

导读：cs.CV 方向，今日共计211篇

点击阅读原文访问arxivdaily.com，涵盖CS|物理|数学|经济|统计|金融|生物|电气领域，更有搜索、收藏等功能！

cs.CV 方向，今日共计211篇

大模型相关(20篇)

【1】VERA-V: Variational Inference Framework for Jailbreaking Vision-Language Models
标题：VERA-V：越狱视觉语言模型的变分推理框架
链接：https://arxiv.org/abs/2510.17759

作者：Qilin Liao, Anamika Lochab, Ruqi Zhang
备注：18 pages, 7 Figures,
摘要：视觉语言模型（VLM）通过视觉推理扩展了大型语言模型，但其多模态设计也引入了新的未充分研究的漏洞。现有的多模式红队方法主要依赖于脆性模板，专注于单一攻击设置，并且只暴露了一小部分漏洞。为了解决这些限制，我们引入了VERA-V，这是一个变分推理框架，它将多模态越狱发现重新定义为在成对的文本-图像提示上学习联合后验分布。这种概率观点使得能够生成绕过模型护栏的隐形耦合对抗输入。我们训练一个轻量级的攻击者来近似后验，允许对不同的越狱进行有效的采样，并提供对漏洞的分布洞察。VERA-V进一步整合了三种互补策略：（i）基于排版的文本提示，嵌入有害的线索，（ii）基于扩散的图像合成，引入对抗信号，以及（iii）结构化的干扰物，以分散VLM注意力。在HarmBench和HADES基准测试上的实验表明，VERA-V在开源和前沿VLM上的性能始终优于最先进的基线，比GPT-4 o上的最佳基线高出53.75%的攻击成功率（ASR）。
摘要：Vision-Language Models (VLMs) extend large language models with visual reasoning, but their multimodal design also introduces new, underexplored vulnerabilities. Existing multimodal red-teaming methods largely rely on brittle templates, focus on single-attack settings, and expose only a narrow subset of vulnerabilities. To address these limitations, we introduce VERA-V, a variational inference framework that recasts multimodal jailbreak discovery as learning a joint posterior distribution over paired text-image prompts. This probabilistic view enables the generation of stealthy, coupled adversarial inputs that bypass model guardrails. We train a lightweight attacker to approximate the posterior, allowing efficient sampling of diverse jailbreaks and providing distributional insights into vulnerabilities. VERA-V further integrates three complementary strategies: (i) typography-based text prompts that embed harmful cues, (ii) diffusion-based image synthesis that introduces adversarial signals, and (iii) structured distractors to fragment VLM attention. Experiments on HarmBench and HADES benchmarks show that VERA-V consistently outperforms state-of-the-art baselines on both open-source and frontier VLMs, achieving up to 53.75% higher attack success rate (ASR) over the best baseline on GPT-4o.

【2】MT-Video-Bench: A Holistic Video Understanding Benchmark for Evaluating Multimodal LLMs in Multi-Turn Dialogues
标题：MT-Video-Bench：用于评估多回合对话中多模式LLM的整体视频理解基准
链接：https://arxiv.org/abs/2510.17722

作者：Yaning Pan, Zekun Wang, Qianqian Xie, Yongqian Wen, Yuanxing Zhang, Guohui Zhang, Haoxuan Hu, Zhiyu Pan, Yibing Huang, Zhidong Gan, Yonghong Lin, An Ping, Tianhao Peng, Jiaheng Liu
备注：Project Website: this https URL
摘要：多模态大型语言模型（MLLM）的最新发展显着提高了AI理解视觉模态的能力。然而，现有的评估基准仍然局限于单轮问题回答，忽视了现实世界场景中多轮对话的复杂性。为了弥合这一差距，我们引入了MT-Video-Bench，这是一个整体视频理解基准，用于评估多轮对话中的MLLM。具体来说，我们的MT-Video-Bench主要评估专注于感知力和互动性的六项核心能力，包括来自不同领域的987个精心策划的多轮对话。这些功能与现实世界的应用程序严格一致，例如交互式体育分析和基于多轮视频的智能辅导。通过MT-Video-Bench，我们广泛评估了各种最先进的开源和闭源MLLM，揭示了它们在处理多轮视频对话时的显著性能差异和局限性。该基准将公开提供，以促进未来的研究。
摘要：The recent development of Multimodal Large Language Models (MLLMs) has significantly advanced AI's ability to understand visual modalities. However, existing evaluation benchmarks remain limited to single-turn question answering, overlooking the complexity of multi-turn dialogues in real-world scenarios. To bridge this gap, we introduce MT-Video-Bench, a holistic video understanding benchmark for evaluating MLLMs in multi-turn dialogues. Specifically, our MT-Video-Bench mainly assesses six core competencies that focus on perceptivity and interactivity, encompassing 987 meticulously curated multi-turn dialogues from diverse domains. These capabilities are rigorously aligned with real-world applications, such as interactive sports analysis and multi-turn video-based intelligent tutoring. With MT-Video-Bench, we extensively evaluate various state-of-the-art open-source and closed-source MLLMs, revealing their significant performance discrepancies and limitations in handling multi-turn video dialogues. The benchmark will be publicly available to foster future research.

【3】ShapeCraft: LLM Agents for Structured, Textured and Interactive 3D Modeling
标题：ShapeCraft：用于结构化、纹理化和交互式3D建模的LLM代理
链接：https://arxiv.org/abs/2510.17603

作者：Shuyuan Zhang, Chenhan Jiang, Zuoou Li, Jiankang Deng
备注：NeurIPS 2025 Poster
摘要：从自然语言生成3D提供了减少专家手动建模工作并增强对3D资产的可访问性的巨大潜力。然而，现有的方法往往产生非结构化的网格和表现出较差的交互性，使他们不切实际的艺术工作流程。为了解决这些局限性，我们表示为形状程序的3D资产，并介绍了ShapeCraft，一种新颖的多智能体框架的文本到3D生成。在其核心，我们提出了一个基于图形的程序形状（GPS）表示，将复杂的自然语言分解成一个结构化的子任务图，从而促进准确的LLM理解和空间关系和语义形状细节的解释。具体来说，LLM代理分层解析用户输入以初始化GPS，然后迭代地优化过程建模和绘制以产生结构化，纹理化和交互式3D资产。定性和定量实验表明，与现有的基于LLM的代理相比，ShapeCraft在生成几何准确和语义丰富的3D资产方面具有卓越的性能。我们通过动画和用户自定义编辑的例子进一步展示了ShapeCraft的多功能性，突出了它在更广泛的交互式应用中的潜力。
摘要：3D generation from natural language offers significant potential to reduce expert manual modeling efforts and enhance accessibility to 3D assets. However, existing methods often yield unstructured meshes and exhibit poor interactivity, making them impractical for artistic workflows. To address these limitations, we represent 3D assets as shape programs and introduce ShapeCraft, a novel multi-agent framework for text-to-3D generation. At its core, we propose a Graph-based Procedural Shape (GPS) representation that decomposes complex natural language into a structured graph of sub-tasks, thereby facilitating accurate LLM comprehension and interpretation of spatial relationships and semantic shape details. Specifically, LLM agents hierarchically parse user input to initialize GPS, then iteratively refine procedural modeling and painting to produce structured, textured, and interactive 3D assets. Qualitative and quantitative experiments demonstrate ShapeCraft's superior performance in generating geometrically accurate and semantically rich 3D assets compared to existing LLM-based agents. We further show the versatility of ShapeCraft through examples of animated and user-customized editing, highlighting its potential for broader interactive applications.

【4】From Spatial to Actions: Grounding Vision-Language-Action Model in Spatial Foundation Priors
标题：从空间到行动：空间基础优先事项中的视觉-语言-行动模型的基础
链接：https://arxiv.org/abs/2510.17439

作者：Zhengshen Zhang, Hao Li, Yalun Dai, Zhengbang Zhu, Lei Zhou, Chenchen Liu, Dong Wang, Francis E. H. Tay, Sijin Chen, Ziwei Liu, Yuxiao Liu, Xinghang Li, Pan Zhou
备注：Project page: this https URL
摘要：现有的视觉-语言-动作（VLA）模型在3D现实世界中起作用，但通常建立在2D编码器上，留下了空间推理差距，限制了泛化和适应性。最近的3D集成技术的VLAs要么需要专门的传感器和传输方式差，或注入弱线索，缺乏几何形状和降低视觉语言对齐。在这项工作中，我们介绍了一种新的范式，将丰富的3D空间令牌注入到动作头中，即从空间到动作（From Spatial to Action）。EQUIPCON利用空间基础模型仅从RGB提供强大的几何先验，并包括一个可选择融合深度的空间模型，或在可用时提供更高的保真度，而无需重新训练或架构更改。为了保持语言推理，空间标记被空间增强的动作头部消耗，而不是被连接到视觉语言主干中。这些设计使ESTCON能够解决空间表示，模态可转移性和对齐方面的限制。在三个模拟基准和11个现实世界的任务的综合评估中，我们提出的EQUIPCON实现了最先进的性能，始终超过竞争基线，并在杂波，空间提示条件以及对象尺度和高度的变化下保持稳健。
摘要：Existing vision-language-action (VLA) models act in 3D real-world but are typically built on 2D encoders, leaving a spatial reasoning gap that limits generalization and adaptability. Recent 3D integration techniques for VLAs either require specialized sensors and transfer poorly across modalities, or inject weak cues that lack geometry and degrade vision-language alignment. In this work, we introduce FALCON (From Spatial to Action), a novel paradigm that injects rich 3D spatial tokens into the action head. FALCON leverages spatial foundation models to deliver strong geometric priors from RGB alone, and includes an Embodied Spatial Model that can optionally fuse depth, or pose for higher fidelity when available, without retraining or architectural changes. To preserve language reasoning, spatial tokens are consumed by a Spatial-Enhanced Action Head rather than being concatenated into the vision-language backbone. These designs enable FALCON to address limitations in spatial representation, modality transferability, and alignment. In comprehensive evaluations across three simulation benchmarks and eleven real-world tasks, our proposed FALCON achieves state-of-the-art performance, consistently surpasses competitive baselines, and remains robust under clutter, spatial-prompt conditioning, and variations in object scale and height.

【5】Recurrent Attention-based Token Selection for Efficient Streaming Video-LLMs
标题：高效流媒体视频LLM的基于循环注意的令牌选择
链接：https://arxiv.org/abs/2510.17364

作者：Vaggelis Dorovatas, Soroush Seifi, Gunshi Gupta, Rahaf Aljundi
备注：NeurIPS 2025
摘要：视频大语言模型（Video-LLM）擅长在上下文中理解视频，前提是他们在回答查询时可以完全访问视频。然而，这些模型在流媒体场景中面临挑战，其中长达一小时的视频必须在线处理，并且问题需要及时响应。在这项工作中，我们提出了一种与标准视频LLM兼容的免训练方法，利用三个关键概念：1）LLM知情的视觉标记选择，以识别LLM已经注意到并有助于其理解每个短片的标记。我们基于注意力的选择允许我们以最小的性能损失丢弃高达~95%的不重要的视觉标记; 2）对过去选择的标记进行循环处理，以生成对每个处理过的剪辑的时间上连贯的理解; 3）基于标题的问题回答，以获得轻量级和准确的响应。我们的方法在流媒体视频基准上实现了最先进的性能，在效率和有效性之间取得了平衡。
摘要：Video Large Language Models (Video-LLMs) excel at understanding videos in-context, provided they have full access to the video when answering queries. However, these models face challenges in streaming scenarios where hour-long videos must be processed online, and questions need timely responses. In this work, we propose a training-free approach compatible with standard Video-LLMs, leveraging three key concepts: 1) LLM-informed selection of visual tokens to identify those that the LLM has attended to and contributed to its understanding of each short clip. Our attention-based selection allows us to discard up to ~95% of unimportant visual tokens with minimal performance loss; 2) Recurrent processing of past selected tokens to generate temporally coherent understanding of each processed clip; 3) Caption-based question answering for lightweight and accurate responses. Our method achieves state-of-the-art performance on streaming video benchmarks, striking a balance between efficiency and effectiveness.

【6】Enhanced Motion Forecasting with Plug-and-Play Multimodal Large Language Models
标题：使用即插即用多模式大型语言模型增强运动预测
链接：https://arxiv.org/abs/2510.17274

作者：Katie Luo, Jingwei Ji, Tong He, Runsheng Xu, Yichen Xie, Dragomir Anguelov, Mingxing Tan
备注：In proceedings of IROS 2025
摘要：目前的自动驾驶系统依赖于专门的模型来感知和预测运动，这些模型在标准条件下表现出可靠的性能。然而，将具有成本效益的方法推广到不同的现实世界场景仍然是一个重大挑战。为了解决这个问题，我们提出了即插即用预测（PnF），即插即用的方法，增强现有的运动预测模型与多模态大语言模型（MLLM）。PnF建立在自然语言提供了一种更有效的方法来描述和处理复杂场景的基础上，能够快速适应目标行为。我们设计提示从MLLM中提取结构化场景理解，并将这些信息提取到可学习的嵌入中，以增强现有的行为预测模型。我们的方法利用MLLM的zero-shot推理能力来实现运动预测性能的显着改善，同时不需要微调-使其实际采用。我们使用Waymo Open Motion Dataset和nuScenes Dataset在两个最先进的运动预测模型上验证了我们的方法，证明了两个基准测试的一致性能改进。
摘要：Current autonomous driving systems rely on specialized models for perceiving and predicting motion, which demonstrate reliable performance in standard conditions. However, generalizing cost-effectively to diverse real-world scenarios remains a significant challenge. To address this, we propose Plug-and-Forecast (PnF), a plug-and-play approach that augments existing motion forecasting models with multimodal large language models (MLLMs). PnF builds on the insight that natural language provides a more effective way to describe and handle complex scenarios, enabling quick adaptation to targeted behaviors. We design prompts to extract structured scene understanding from MLLMs and distill this information into learnable embeddings to augment existing behavior prediction models. Our method leverages the zero-shot reasoning capabilities of MLLMs to achieve significant improvements in motion prediction performance, while requiring no fine-tuning -- making it practical to adopt. We validate our approach on two state-of-the-art motion forecasting models using the Waymo Open Motion Dataset and the nuScenes Dataset, demonstrating consistent performance improvements across both benchmarks.

【7】$\mathcal{V}isi\mathcal{P}runer$: Decoding Discontinuous Cross-Modal Dynamics for Efficient Multimodal LLMs
链接：https://arxiv.org/abs/2510.17205

作者：Yingqi Fan, Anhao Zhao, Jinlan Fu, Junlong Tong, Hui Su, Yijie Pan, Wei Zhang, Xiaoyu Shen
备注：EMNLP 2025 Main
摘要：多模态大型语言模型（MLLM）在视觉语言任务中取得了很好的性能，但由于注意力计算随多模态标记数量的二次增长而产生了显着的计算开销。虽然已经努力修剪MLLM中的标记，但他们缺乏对MLLM如何处理和融合多模态信息的基本理解。通过系统分析，我们揭示了一个\textbf{三阶段}跨模态交互过程：（1）浅层识别任务意图，视觉标记充当被动注意汇;（2）跨模态融合突然发生在中间层，由一些关键的视觉标记驱动;（3）深层丢弃视觉标记，只关注语言精炼。基于这些发现，我们提出了一个无需训练的剪枝框架VisiPruner，它可以减少LLaVA-v1.5 7 B上高达99\%的视觉相关注意力计算和53.9\%的FLOP。它显著优于现有的令牌修剪方法，并在不同的MLLM中推广。除了修剪之外，我们的见解还通过将模型架构与其内在的逐层处理动态对齐，为训练高效的MLLM提供了可操作的指导方针。我们的代码可在https://github.com/EIT-NLP/VisiPruner上获取。
摘要：Multimodal Large Language Models (MLLMs) have achieved strong performance across vision-language tasks, but suffer from significant computational overhead due to the quadratic growth of attention computations with the number of multimodal tokens. Though efforts have been made to prune tokens in MLLMs, \textit{they lack a fundamental understanding of how MLLMs process and fuse multimodal information.} Through systematic analysis, we uncover a \textbf{three-stage} cross-modal interaction process: (1) Shallow layers recognize task intent, with visual tokens acting as passive attention sinks; (2) Cross-modal fusion occurs abruptly in middle layers, driven by a few critical visual tokens; (3) Deep layers discard vision tokens, focusing solely on linguistic refinement. Based on these findings, we propose \emph{VisiPruner}, a training-free pruning framework that reduces up to 99\% of vision-related attention computations and 53.9\% of FLOPs on LLaVA-v1.5 7B. It significantly outperforms existing token pruning methods and generalizes across diverse MLLMs. Beyond pruning, our insights further provide actionable guidelines for training efficient MLLMs by aligning model architecture with its intrinsic layer-wise processing dynamics. Our code is available at: https://github.com/EIT-NLP/VisiPruner.

【8】ZSPAPrune: Zero-Shot Prompt-Aware Token Pruning for Vision-Language Models
标题：ZSPAPrune：视觉语言模型的Zero-Shot预算感知令牌修剪
链接：https://arxiv.org/abs/2510.17197

作者：Pu Zhang, Yuwei Li, Xingyuan Xian, Guoming Tang
摘要：随着视觉语言模型（VLM）能力的提高，它们可以处理越来越大的输入，这与LLM不同，会产生大量的视觉标记冗余，并导致令人望而却步的推理成本。虽然许多方法旨在通过删减视觉标记来降低这些成本，但是现有的方法，无论是基于注意力还是基于多样性，通常都忽略了文本提示的指导，因此未能对任务相关性进行优先级排序。在这项工作中，我们提出了一种新的，zero-shot的方法，通过引入一个感知的角度，明确建模视觉令牌修剪任务的相关性和信息多样性之间的平衡，重新定义的问题。我们的分层方法首先选择一组核心的任务相关的视觉令牌，然后补充他们的多样性令牌，以保持更广泛的背景。在多个模型和基准测试上的实验表明，该方法在约简90%的标记时，仅损失最小的准确率，就能达到或超过现有方法的性能.此外，这些优势还伴随着GPU内存占用和推理延迟的显著降低。
摘要：As the capabilities of Vision-Language Models (VLMs) advance, they can process increasingly large inputs, which, unlike in LLMs, generates significant visual token redundancy and leads to prohibitive inference costs. While many methods aim to reduce these costs by pruning visual tokens, existing approaches, whether based on attention or diversity, typically neglect the guidance of the text prompt and thus fail to prioritize task relevance. In this work, we propose a novel, zero-shot method that reframes the problem by introducing a prompt-aware perspective, explicitly modeling visual token pruning as a balance between task relevance and information diversity. Our hierarchical approach first selects a core set of task-relevant visual tokens and then supplements them with diversity tokens to preserve broader context. Experiments across multiple models and benchmarks show that our method achieves performance that matches or surpasses the state-of-the-art with only minimal accuracy loss, even when pruning up to 90\% of the tokens. Furthermore, these gains are accompanied by significant reductions in GPU memory footprint and inference latency.

【9】Where, Not What: Compelling Video LLMs to Learn Geometric Causality for 3D-Grounding
标题：哪里，而不是什么：迫使视频LL学习3D接地的几何因果关系
链接：https://arxiv.org/abs/2510.17034

作者：Yutong Zhong
摘要：多模态3D接地已经获得了相当大的兴趣，视觉语言模型（VLM）\cite{yin 2025 spatial}在复杂环境中推进空间推理。然而，这些模型遭受严重的“2D语义偏差”，这是由于过度依赖2D图像特征进行粗略定位，在很大程度上忽视了3D几何输入，并导致次优融合性能。在本文中，我们提出了一种新的训练框架，称为What-Where Representation Re-Forming（W2 R2），通过分解表示学习和有针对性的捷径抑制来解决这个问题。我们的方法从根本上重塑了模型的内部空间，将2D特征指定为“什么”识别的语义信标，将3D特征指定为“哪里”定位的空间锚点，从而在不修改推理架构的情况下实现精确的3D接地。关键组件包括一个双目标损失函数，其中对齐损失使用自适应交叉熵来监督融合预测以实现多模式协同，而伪标签损失通过基于边缘的机制来惩罚过度有效的2D主导伪输出。在ScanRefer和ScanQA上进行的实验证明了W2 R2的有效性，在定位精度和鲁棒性方面有显着的提高，特别是在杂乱的户外场景中。
摘要：Multimodal 3D grounding has garnered considerable interest in Vision-Language Models (VLMs) \cite{yin2025spatial} for advancing spatial reasoning in complex environments. However, these models suffer from a severe "2D semantic bias" that arises from over-reliance on 2D image features for coarse localization, largely disregarding 3D geometric inputs and resulting in suboptimal fusion performance. In this paper, we propose a novel training framework called What-Where Representation Re-Forming (W2R2) to tackle this issue via disentangled representation learning and targeted shortcut suppression. Our approach fundamentally reshapes the model's internal space by designating 2D features as semantic beacons for "What" identification and 3D features as spatial anchors for "Where" localization, enabling precise 3D grounding without modifying inference architecture. Key components include a dual-objective loss function with an Alignment Loss that supervises fused predictions using adapted cross-entropy for multimodal synergy, and a Pseudo-Label Loss that penalizes overly effective 2D-dominant pseudo-outputs via a margin-based mechanism. Experiments conducted on ScanRefer and ScanQA demonstrate the effectiveness of W2R2, with significant gains in localization accuracy and robustness, particularly in cluttered outdoor scenes.

【10】Enrich and Detect: Video Temporal Grounding with Multimodal LLMs
标题：丰富和检测：利用多模式LLM实现视频临时基础
链接：https://arxiv.org/abs/2510.17023

作者：Shraman Pramanick, Effrosyni Mavroudi, Yale Song, Rama Chellappa, Lorenzo Torresani, Triantafyllos Afouras
备注：ICCV 2025 (Highlights)
摘要：我们介绍了ED-VTG，一种利用多模态大语言模型的细粒度视频时间接地方法。我们的方法利用多模态LLM的能力来联合处理文本和视频，以便通过两个阶段的过程有效地本地化视频中的自然语言查询。而不是直接接地，语言查询最初被转换成丰富的句子，其中包含缺失的细节和线索，以帮助接地。在第二阶段，这些丰富的查询接地，使用轻量级的解码器，专门预测准确的边界条件的上下文表示的丰富的查询。为了减轻噪声并减少幻觉的影响，我们的模型使用多实例学习目标进行训练，该目标为每个训练样本动态选择查询的最佳版本。我们展示了在时间视频接地和段落接地设置的各种基准的最先进的结果。实验表明，我们的方法显着优于所有以前提出的基于LLM的时间接地方法，是优于或媲美专门的模型，同时保持对他们的zero-shot评估方案的明显优势。
摘要：We introduce ED-VTG, a method for fine-grained video temporal grounding utilizing multi-modal large language models. Our approach harnesses the capabilities of multimodal LLMs to jointly process text and video, in order to effectively localize natural language queries in videos through a two-stage process. Rather than being directly grounded, language queries are initially transformed into enriched sentences that incorporate missing details and cues to aid in grounding. In the second stage, these enriched queries are grounded, using a lightweight decoder, which specializes at predicting accurate boundaries conditioned on contextualized representations of the enriched queries. To mitigate noise and reduce the impact of hallucinations, our model is trained with a multiple-instance-learning objective that dynamically selects the optimal version of the query for each training sample. We demonstrate state-of-the-art results across various benchmarks in temporal video grounding and paragraph grounding settings. Experiments reveal that our method significantly outperforms all previously proposed LLM-based temporal grounding approaches and is either superior or comparable to specialized models, while maintaining a clear advantage against them in zero-shot evaluation scenarios.

【11】Res-Bench: Benchmarking the Robustness of Multimodal Large Language Models to Dynamic Resolution Input
标题：Res-Bench：对多模式大型语言模型与动态分辨率输入的鲁棒性进行基准测试
链接：https://arxiv.org/abs/2510.16926

作者：Chenxu Li, Zhicai Wang, Yuan Sheng, Xingyu Zhu, Yanbin Hao, Xiang Wang
备注：23 pages,19 figures
摘要：多模态大型语言模型（MLLM）越来越多地支持动态图像分辨率。然而，目前的评估范式主要评估语义性能，忽略了分辨率鲁棒性的关键问题-性能是否在不同的输入分辨率下保持稳定。为了解决这一差距，我们引入了\textbf{Res-Bench}，这是一个综合性的基准测试，包括12个分辨率级别和6个核心能力维度的14，400个样本。我们设计了一个新的评估框架，超越了传统的准确性指标，以捕捉性能稳定性。该框架引入了多个鲁棒性指标：斯皮尔曼的相关性评估分辨率性能趋势，绝对/相对连续误差（ACE/RCE）测量性能波动。使用这些指标，我们对领先的MLLM进行了大规模评估。我们的分析包括：（1）以模型为中心和以任务为中心的鲁棒性检查，（2）包括填充和超分辨率的预处理策略的研究，以及（3）用于稳定性增强的微调的探索。
摘要：Multimodal Large Language Models (MLLMs) increasingly support dynamic image resolutions. However, current evaluation paradigms primarily assess semantic performance, overlooking the critical question of resolution robustness - whether performance remains stable across varying input resolutions. To address this gap, we introduce \textbf{Res-Bench}, a comprehensive benchmark comprising 14,400 samples across 12 resolution levels and six core capability dimensions. We designed a novel evaluation framework that goes beyond traditional accuracy metrics to capture performance stability. This framework introduces multiple robustness metrics: Spearman's correlation for assessing resolution-performance trends, and Absolute/Relative Continuous Error (ACE/RCE) for measuring performance volatility. Using these metrics, we conducted a large-scale evaluation of leading MLLMs. Our analysis encompasses: (1) model-centric and task-centric robustness examination, (2) investigation of preprocessing strategies including padding and super-resolution, and (3) exploration of fine-tuning for stability enhancement.

【12】Uncovering Brain-Like Hierarchical Patterns in Vision-Language Models through fMRI-Based Neural Encoding
标题：通过基于fMRI的神经编码发现视觉语言模型中的类脑分层模式
链接：https://arxiv.org/abs/2510.16870

作者：Yudan Ren, Xinlong Wang, Kexin Wang, Tian Xia, Zihan Ma, Zhaowei Li, Xiangrong Bi, Xiao Li, Xiaowei He
备注：14 pages, 7 figures
摘要：虽然受大脑启发的人工智能（AI）已经证明了有希望的结果，但目前对人工神经网络（ANN）和人脑处理之间的相似性的理解仍然有限：（1）单峰ANN研究未能捕获大脑固有的多模态处理能力，（2）多模态ANN研究主要集中在高级模型输出上，忽视了单个神经元的关键作用。为了解决这些局限性，我们提出了一种新的神经元级分析框架，通过人脑活动的镜头研究视觉语言模型（VLM）中的多模态信息处理机制。我们的方法独特地结合了细粒度人工神经元（AN）分析与基于fMRI的体素编码，以检查两个架构上不同的VLM：CLIP和METER。我们的分析揭示了四个关键发现：（1）人工神经网络成功地预测了生物神经元（BN）在多个功能网络中的活动（2）AN和BN都通过重叠的神经表征表现出功能冗余，反映了大脑的容错和协作信息处理机制;（3）AN表现出与BN平行的极性模式，相反激活的BN显示出跨VLM层的镜像激活趋势，反映了神经信息处理的复杂性和双向性;（4）CLIP和METER的架构驱动不同的BN：CLIP的独立分支显示了特定于模态的专门化，而METER的跨模态设计产生了统一的跨模态激活，突出了架构对ANN类脑特性的影响。这些结果提供了令人信服的证据，在神经元水平上的VLM类脑分层处理。
摘要：While brain-inspired artificial intelligence(AI) has demonstrated promising results, current understanding of the parallels between artificial neural networks (ANNs) and human brain processing remains limited: (1) unimodal ANN studies fail to capture the brain's inherent multimodal processing capabilities, and (2) multimodal ANN research primarily focuses on high-level model outputs, neglecting the crucial role of individual neurons. To address these limitations, we propose a novel neuron-level analysis framework that investigates the multimodal information processing mechanisms in vision-language models (VLMs) through the lens of human brain activity. Our approach uniquely combines fine-grained artificial neuron (AN) analysis with fMRI-based voxel encoding to examine two architecturally distinct VLMs: CLIP and METER. Our analysis reveals four key findings: (1) ANs successfully predict biological neurons (BNs) activities across multiple functional networks (including language, vision, attention, and default mode), demonstrating shared representational mechanisms; (2) Both ANs and BNs demonstrate functional redundancy through overlapping neural representations, mirroring the brain's fault-tolerant and collaborative information processing mechanisms; (3) ANs exhibit polarity patterns that parallel the BNs, with oppositely activated BNs showing mirrored activation trends across VLM layers, reflecting the complexity and bidirectional nature of neural information processing; (4) The architectures of CLIP and METER drive distinct BNs: CLIP's independent branches show modality-specific specialization, whereas METER's cross-modal design yields unified cross-modal activation, highlighting the architecture's influence on ANN brain-like properties. These results provide compelling evidence for brain-like hierarchical processing in VLMs at the neuronal level.

【13】Segmentation as A Plug-and-Play Capability for Frozen Multimodal LLMs
标题：分段作为冻结多模式LLM的即插即用功能
链接：https://arxiv.org/abs/2510.16785

作者：Jiazhen Liu, Long Chen
摘要：将不同的视觉功能集成到一个统一的模型中是多模态大型语言模型（MLLM）的一个重要趋势。其中，将细分包括在内构成了一系列独特的挑战。为了使MLLM具有像素级分割能力，主流方法需要微调模型以产生与掩码解码器兼容的特定输出。这个过程通常会改变模型的输出空间，并损害其内在的泛化能力，从而破坏了构建统一模型的目标。我们介绍LENS（杠杆kEypoiNts的MLLM '分割），一种新的即插即用的解决方案。LENS将一个重量轻、可训练的头部连接到完全冷冻的MLLM上。通过细化注意力地图中嵌入的空间线索，LENS提取关键点并将其描述为与掩码解码器直接兼容的逐点特征。大量的实验验证了我们的方法：LENS实现分割性能的竞争力或优于基于再训练的方法。至关重要的是，它在这样做的同时，完全保留了MLLM的泛化能力，这是显着下降的微调方法。因此，LENS的可连接设计为扩展MLLM建立了一个高效而强大的范例，为真正的多功能统一模型铺平了道路。
摘要：Integrating diverse visual capabilities into a unified model is a significant trend in Multimodal Large Language Models (MLLMs). Among these, the inclusion of segmentation poses a distinct set of challenges. To equip MLLMs with pixel-level segmentation abilities, prevailing methods require finetuning the model to produce specific outputs compatible with a mask decoder. This process typically alters the model's output space and compromises its intrinsic generalization, which undermines the goal of building a unified model. We introduce LENS (Leveraging kEypoiNts for MLLMs' Segmentation), a novel plug-and-play solution. LENS attaches a lightweight, trainable head to a completely frozen MLLM. By refining the spatial cues embedded in attention maps, LENS extracts keypoints and describes them into point-wise features directly compatible with the mask decoder. Extensive experiments validate our approach: LENS achieves segmentation performance competitive with or superior to that of retraining-based methods. Crucially, it does so while fully preserving the MLLM's generalization capabilities, which are significantly degraded by finetuning approaches. As such, the attachable design of LENS establishes an efficient and powerful paradigm for extending MLLMs, paving the way for truly multi-talented, unified models.

【14】MultiVerse: A Multi-Turn Conversation Benchmark for Evaluating Large Vision and Language Models
标题：MultiVerse：评估大视野和语言模型的多轮对话基准
链接：https://arxiv.org/abs/2510.16641

作者：Young-Jun Lee, Byung-Kwan Lee, Jianshu Zhang, Yechan Hwang, Byungsoo Ko, Han-Gyu Kim, Dongyu Yao, Xuankun Rong, Eojin Joo, Seung-Ho Han, Bowon Ko, Ho-Jin Choi
备注：Project website: this https URL
摘要：视觉和语言模型（VLM）在单回合基准测试中表现出了令人印象深刻的能力，但现实世界的应用程序往往需要更复杂的多回合对话。现有的多轮数据集（例如，MMDU，ConvBench）仅部分捕获用户遇到的会话场景的广度和深度。在这项工作中，我们介绍了MultiVerse，一种新颖的多回合对话基准，具有647个对话-每个平均四个回合-来自一组不同的12个流行的VLM评估基准。MultiVerse有484个任务和484个交互目标，涵盖了广泛的主题，从事实知识和感知到数学和编码等高级推理任务。为了促进强大的评估，我们提出了一种基于检查表的评估方法，利用GPT-4 o作为自动评估器，测量37个关键方面的性能，包括感知准确性，语言清晰度和事实正确性。我们在MultiVerse上评估了18个VLM，发现即使是最强的模型（例如，GPT-4 o）在复杂的多轮对话中仅获得50%的成功率，这凸显了数据集的挑战性。值得注意的是，我们发现，提供完整的对话上下文显着提高性能较小或较弱的模型，强调在上下文学习的重要性。我们相信MultiVerse是一个评估VLM多回合交互能力的平台。
摘要：Vision-and-Language Models (VLMs) have shown impressive capabilities on single-turn benchmarks, yet real-world applications often demand more intricate multi-turn dialogues. Existing multi-turn datasets (e.g, MMDU, ConvBench) only partially capture the breadth and depth of conversational scenarios encountered by users. In this work, we introduce MultiVerse, a novel multi-turn conversation benchmark featuring 647 dialogues - each averaging four turns - derived from a diverse set of 12 popular VLM evaluation benchmarks. With 484 tasks and 484 interaction goals, MultiVerse covers a wide range of topics, from factual knowledge and perception to advanced reasoning tasks such as mathematics and coding. To facilitate robust assessment, we propose a checklist-based evaluation method that leverages GPT-4o as the automated evaluator, measuring performance across 37 key aspects, including perceptual accuracy, linguistic clarity, and factual correctness. We evaluate 18 VLMs on MultiVerse, revealing that even the strongest models (e.g., GPT-4o) achieve only a 50% success rate in complex multi-turn conversations, highlighting the dataset's challenging nature. Notably, we find that providing full dialogue context significantly enhances performance for smaller or weaker models, emphasizing the importance of in-context learning. We believe MultiVerse is a landscape of evaluating multi-turn interaction abilities for VLMs.

【15】VisionSelector: End-to-End Learnable Visual Token Compression for Efficient Multimodal LLMs
标题：Visiontron：用于高效多模态LLM的端到端可学习视觉令牌压缩
链接：https://arxiv.org/abs/2510.16598

作者：Jiaying Zhu, Yurui Zhu, Xin Lu, Wenrui Yan, Dong Li, Kunlin Liu, Xueyang Fu, Zheng-Jun Zha
备注：22 pages, 8 figures
摘要：多模态大型语言模型（MLLM）遇到了由高分辨率图像或多图像输入生成的大量视觉标记的显著计算和内存瓶颈。以前的令牌压缩技术通常受到启发式规则的限制，这些规则有可能丢弃关键信息。他们可能会受到偏见，如注意力下沉，导致急剧下降的性能下积极的压缩比。为了解决这些限制，我们重新制定令牌压缩作为一个轻量级的即插即用框架，重新制定令牌压缩到一个端到端的学习决策过程。具体来说，我们提出了Visioncraft，一个与MLLM主干解耦的评分器模块，它采用了可区分的Top-K机制和课程退火策略来弥合训练-推理差距，从而实现高效和自适应的令牌选择各种任意压缩率。VisionSelector非常轻量级，仅具有12.85 M可训练参数，演示了各种压缩率的泛化并自适应地识别关键令牌。这导致了在所有压缩预算中的卓越性能，证明了在30%的保留预算下保持100%的MME准确性，在10%的保留预算下比以前的方法高出12.14%，并且使预填充速度加倍。我们的代码可在https://github.com/JulietChoo/VisionSelector上获得。
摘要：Multimodal Large Language Models (MLLMs) encounter significant computational and memory bottlenecks from the massive number of visual tokens generated by high-resolution images or multi-image inputs. Previous token compression techniques are often constrained by heuristic rules that risk discarding critical information. They may suffer from biases, such as attention sinks, that lead to sharp performance drops under aggressive compression ratios. To address these limitations, we reformulate token compression as a lightweight plug-and-play framework that reformulates token compression into an end-to-end learnable decision process. To be specific, we propose VisionSelector, a scorer module decoupled from the MLLM backbone that incorporates a differentiable Top-K mechanism and a curriculum annealing strategy to bridge the training-inference gap, enabling efficient and adaptive token selection various arbitrary compression rates. Remarkably lightweight with only 12.85M trainable parameters, VisionSelector demonstrates generalization across various compression rates and adaptively identifying critical tokens. This leads to superior performance across all compression budgets, evidenced by preserving 100% accuracy on MME with 30% retention budget, outperforming prior methods by 12.14% at 10% retention budget, and doubling prefill speed. Our code is available at https://github.com/JulietChoo/VisionSelector .

【16】NavQ: Learning a Q-Model for Foresighted Vision-and-Language Navigation
标题：NavQ：学习用于前瞻性视觉和语言导航的Q模型
链接：https://arxiv.org/abs/2510.16457

作者：Peiran Xu, Xicheng Gong, Yadong MU
备注：ICCV 2025
摘要：在这项工作中，我们专注于目标导向的视觉和语言导航（VLN）的任务。现有的方法往往基于历史信息做出决策，忽略了行动的未来影响和长期结果。相反，我们的目标是开发一个有远见的代理。具体来说，我们利用Q学习训练一个Q模型，使用大规模的未标记的轨迹数据，以学习有关室内场景内的布局和对象关系的一般知识。该模型可以为每个候选动作生成类似于传统Q网络中的Q值的Q特征，该特征描述了采取特定动作后可能观察到的潜在未来信息。随后，跨模态未来编码器将任务不可知的Q特征与导航指令相结合，以产生一组反映未来前景的动作分数。这些分数与基于历史的原始分数相结合时，促进了A* 风格的搜索策略，以有效地探索更有可能通向目的地的区域。在广泛使用的面向目标的VLN数据集上进行的大量实验验证了该方法的有效性。
摘要：In this work we concentrate on the task of goal-oriented Vision-and-Language Navigation (VLN). Existing methods often make decisions based on historical information, overlooking the future implications and long-term outcomes of the actions. In contrast, we aim to develop a foresighted agent. Specifically, we draw upon Q-learning to train a Q-model using large-scale unlabeled trajectory data, in order to learn the general knowledge regarding the layout and object relations within indoor scenes. This model can generate a Q-feature, analogous to the Q-value in traditional Q-network, for each candidate action, which describes the potential future information that may be observed after taking the specific action. Subsequently, a cross-modal future encoder integrates the task-agnostic Q-feature with navigation instructions to produce a set of action scores reflecting future prospects. These scores, when combined with the original scores based on history, facilitate an A*-style searching strategy to effectively explore the regions that are more likely to lead to the destination. Extensive experiments conducted on widely used goal-oriented VLN datasets validate the effectiveness of the proposed method.

【17】EDVD-LLaMA: Explainable Deepfake Video Detection via Multimodal Large Language Model Reasoning
标题：EDVD-LLaMA：通过多模式大语言模型推理的可解释Deepfake视频检测
链接：https://arxiv.org/abs/2510.16442

作者：Haoran Sun, Chen Cai, Huiping Zhuang, Kong Aik Lee, Lap-Pui Chau, Yi Wang
摘要：Deepfake视频技术的快速发展不仅促进了艺术创作，也使错误信息更容易传播。传统的deepfake视频检测（DVD）方法面临着一些问题，例如其原理缺乏透明度，以及泛化能力不足，无法应对不断发展的伪造技术。这凸显了对能够识别伪造内容并提供可验证的推理解释的检测器的迫切需求。本文提出了可解释的deepfake视频检测（EDVD）任务，并设计了EDVD-LLaMA多模态，这是一个大型语言模型（MLLM）推理框架，它提供了可追溯的推理过程以及准确的检测结果和可信的解释。我们的方法首先结合了时空微妙信息令牌化（ST-SIT）来提取和融合全局和局部跨帧deepfake特征，为MLLM推理提供丰富的时空语义信息输入。其次，我们构建了一个细粒度的多模态思维链（Fg-MCoT）机制，该机制在推理过程中引入面部特征数据作为硬约束，以实现像素级的时空视频定位，抑制幻觉输出，并提高了思维链的可靠性。此外，我们构建了一个可解释推理FF++基准数据集（ER-FF++集），利用结构化数据来注释视频并确保质量控制，从而支持推理和检测的双重监督。大量的实验表明，EDVD-LLaMA在检测精度、可解释性以及处理交叉伪造方法和交叉数据集场景的能力方面具有出色的性能和鲁棒性。与以前的DVD方法相比，它提供了一个更好的解释和优越的解决方案。源代码和数据集将公开提供。
摘要：The rapid development of deepfake video technology has not only facilitated artistic creation but also made it easier to spread misinformation. Traditional deepfake video detection (DVD) methods face issues such as a lack of transparency in their principles and insufficient generalization capabilities to cope with evolving forgery techniques. This highlights an urgent need for detectors that can identify forged content and provide verifiable reasoning explanations. This paper proposes the explainable deepfake video detection (EDVD) task and designs the EDVD-LLaMA multimodal, a large language model (MLLM) reasoning framework, which provides traceable reasoning processes alongside accurate detection results and trustworthy explanations. Our approach first incorporates a Spatio-Temporal Subtle Information Tokenization (ST-SIT) to extract and fuse global and local cross-frame deepfake features, providing rich spatio-temporal semantic information input for MLLM reasoning. Second, we construct a Fine-grained Multimodal Chain-of-Thought (Fg-MCoT) mechanism, which introduces facial feature data as hard constraints during the reasoning process to achieve pixel-level spatio-temporal video localization, suppress hallucinated outputs, and enhance the reliability of the chain of thought. In addition, we build an Explainable Reasoning FF++ benchmark dataset (ER-FF++set), leveraging structured data to annotate videos and ensure quality control, thereby supporting dual supervision for reasoning and detection. Extensive experiments demonstrate that EDVD-LLaMA achieves outstanding performance and robustness in terms of detection accuracy, explainability, and its ability to handle cross-forgery methods and cross-dataset scenarios. Compared to previous DVD methods, it provides a more explainable and superior solution. The source code and dataset will be publicly available.

【18】OpenLVLM-MIA: A Controlled Benchmark Revealing the Limits of Membership Inference Attacks on Large Vision-Language Models
标题：OpenLVLM-MIA：揭示大型视觉语言模型成员推断攻击局限性的受控基准
链接：https://arxiv.org/abs/2510.16295

作者：Ryoto Miyamoto, Xin Fan, Fuyuko Kido, Tsuneo Matsumoto, Hayato Yamana
摘要：OpenLVLM-MIA是一个新的基准测试，它强调了针对大型视觉语言模型（LVLM）评估成员推理攻击（MIA）的基本挑战。虽然先前的工作报告了高攻击成功率，但我们的分析表明，这些结果通常来自检测数据集构建过程中引入的分布偏差，而不是识别真实的成员状态。为了解决这个问题，我们引入了一个6{，}000张图像的受控基准，其中成员和非成员样本的分布经过仔细平衡，并且在三个不同的训练阶段提供了地面真实成员标签。使用OpenLVLM-MIA的实验表明，在无偏条件下，最先进的MIA方法的性能收敛到随机机会。通过提供透明和公正的基准，OpenLVLM-MIA澄清了MIA研究LVLM的当前局限性，并为开发更强大的隐私保护技术提供了坚实的基础。
摘要：OpenLVLM-MIA is a new benchmark that highlights fundamental challenges in evaluating membership inference attacks (MIA) against large vision-language models (LVLMs). While prior work has reported high attack success rates, our analysis suggests that these results often arise from detecting distributional bias introduced during dataset construction rather than from identifying true membership status. To address this issue, we introduce a controlled benchmark of 6{,}000 images where the distributions of member and non-member samples are carefully balanced, and ground-truth membership labels are provided across three distinct training stages. Experiments using OpenLVLM-MIA demonstrated that the performance of state-of-the-art MIA methods converged to random chance under unbiased conditions. By offering a transparent and unbiased benchmark, OpenLVLM-MIA clarifies the current limitations of MIA research on LVLMs and provides a solid foundation for developing stronger privacy-preserving techniques.

【19】Cerberus: Real-Time Video Anomaly Detection via Cascaded Vision-Language Models
标题：Cerberus：基于级联视觉语言模型的实时视频异常检测
链接：https://arxiv.org/abs/2510.16290

作者：Yue Zheng, Xiufang Shi, Jiming Chen, Yuanchao Shu
摘要：视频异常检测（VAD）随着视觉语言模型（VLM）的发展而迅速发展。虽然这些模型提供了卓越的zero-shot检测能力，但其巨大的计算成本和不稳定的视觉基础性能阻碍了实时部署。为了克服这些挑战，我们介绍了Cerberus，一个两级级联系统设计的高效而准确的实时VAD。Cerberus离线学习正常的行为规则，并在在线推理过程中将轻量级过滤与细粒度VLM推理相结合。Cerberus的性能提升来自两个关键创新：运动掩码提示和基于规则的偏差检测。前者将VLM的注意力引导到与运动相关的区域，而后者将异常识别为与学习规范的偏离，而不是列举可能的异常。对四个数据集的广泛评估表明，Cerberus在NVIDIA L40 S GPU上平均达到57.68 fps，加速比为151.79x $，准确率为97.2%，与最先进的基于VLM-VAD方法相当，将其确立为实时视频分析的实用解决方案。
摘要：Video anomaly detection (VAD) has rapidly advanced by recent development of Vision-Language Models (VLMs). While these models offer superior zero-shot detection capabilities, their immense computational cost and unstable visual grounding performance hinder real-time deployment. To overcome these challenges, we introduce Cerberus, a two-stage cascaded system designed for efficient yet accurate real-time VAD. Cerberus learns normal behavioral rules offline, and combines lightweight filtering with fine-grained VLM reasoning during online inference. The performance gains of Cerberus come from two key innovations: motion mask prompting and rule-based deviation detection. The former directs the VLM's attention to regions relevant to motion, while the latter identifies anomalies as deviations from learned norms rather than enumerating possible anomalies. Extensive evaluations on four datasets show that Cerberus on average achieves 57.68 fps on an NVIDIA L40S GPU, a 151.79$\times$ speedup, and 97.2\% accuracy comparable to the state-of-the-art VLM-based VAD methods, establishing it as a practical solution for real-time video analytics.

【20】IAD-GPT: Advancing Visual Knowledge in Multimodal Large Language Model for Industrial Anomaly Detection
标题：IAD-GPT：在多模态大语言模型中推进视觉知识，用于工业异常检测
链接：https://arxiv.org/abs/2510.16036

作者：Zewen Li, Zitong Yu, Qilang Ye, Weicheng Xie, Wei Zhuo, Linlin Shen
备注：Accepted by IEEE Transactions on Instrumentation and Measurement (TIM)
摘要：多模态大语言模型（MLLM）具有强大的因果能力，在工业异常检测（IAD）中具有检测缺陷对象的潜力。然而，大多数传统的IAD方法缺乏提供多回合人机对话和详细描述的能力，例如对象的颜色、异常的形状或特定类型的异常。与此同时，基于大型预训练模型的方法还没有完全激发大型模型在异常检测任务中的能力。在本文中，我们探讨了丰富的文本语义与图像级和像素级的信息从图像的结合，并提出IAD-GPT，一种新的范例的基础上MLLM的IAD。我们采用异常提示生成器（APG）生成详细的异常提示特定对象。来自大型语言模型（LLM）的这些特定提示用于激活预先训练的视觉语言模型的检测和分割功能（即，CLIP）。为了增强MLLM的视觉基础能力，我们提出了文本引导增强器，其中图像特征与正常和异常文本提示交互，以动态选择增强路径，这使得语言模型能够专注于视觉数据的特定方面，增强其准确解释和响应图像中异常的能力。此外，我们设计了一个多掩模融合模块，将掩模作为专家知识，这增强了LLM的像素级异常的感知。在MVTec-AD和VisA数据集上进行的大量实验证明了我们在自监督和Few-Shot异常检测和分割任务（如MVTec-AD和VisA数据集）上的最新性能。这些代码可以在\href{https：//github.com/LiZeWen1225/IAD-GPT}{https：//github.com/LiZeWen1225/IAD-GPT}上找到。
摘要：The robust causal capability of Multimodal Large Language Models (MLLMs) hold the potential of detecting defective objects in Industrial Anomaly Detection (IAD). However, most traditional IAD methods lack the ability to provide multi-turn human-machine dialogues and detailed descriptions, such as the color of objects, the shape of an anomaly, or specific types of anomalies. At the same time, methods based on large pre-trained models have not fully stimulated the ability of large models in anomaly detection tasks. In this paper, we explore the combination of rich text semantics with both image-level and pixel-level information from images and propose IAD-GPT, a novel paradigm based on MLLMs for IAD. We employ Abnormal Prompt Generator (APG) to generate detailed anomaly prompts for specific objects. These specific prompts from the large language model (LLM) are used to activate the detection and segmentation functions of the pre-trained visual-language model (i.e., CLIP). To enhance the visual grounding ability of MLLMs, we propose Text-Guided Enhancer, wherein image features interact with normal and abnormal text prompts to dynamically select enhancement pathways, which enables language models to focus on specific aspects of visual data, enhancing their ability to accurately interpret and respond to anomalies within images. Moreover, we design a Multi-Mask Fusion module to incorporate mask as expert knowledge, which enhances the LLM's perception of pixel-level anomalies. Extensive experiments on MVTec-AD and VisA datasets demonstrate our state-of-the-art performance on self-supervised and few-shot anomaly detection and segmentation tasks, such as MVTec-AD and VisA datasets. The codes are available at \href{https://github.com/LiZeWen1225/IAD-GPT}{https://github.com/LiZeWen1225/IAD-GPT}.

Transformer(5篇)

【1】ZACH-ViT: A Zero-Token Vision Transformer with ShuffleStrides Data Augmentation for Robust Lung Ultrasound Classification
标题：ZACH-ViT：具有ShuffleStrides数据增强的零令牌视觉Transformer，用于稳健的肺部超声分类
链接：https://arxiv.org/abs/2510.17650

作者：Athanasios Angelakis, Amne Mousa, Micah L. A. Heldeweg, Laurens A. Biesheuvel, Mark A. Haaksma, Jasper M. Smit, Pieter R. Tuinman, Paul W. G. Elbers
备注：14 pages, 6 figures, 2 tables. Primary subject: cs.LG (Machine Learning) Cross-listed to: cs.CV (Computer Vision and Pattern Recognition), eess.IV (Image and Video Processing). Code available at: this https URL Installation: pip install zachvit Paper licensed under CC BY-NC-ND 4.0. Code released under Apache 2.0 License
摘要：在肺部超声（LUS）视频中区分心源性肺水肿（CPE）与非心源性和结构正常的肺仍然具有挑战性，因为非心源性炎症模式（NCIP/ARDS样）、间质性肺病和健康肺的视觉变异性很高。这种异质性使自动分类复杂化，因为重叠的B线和胸膜伪影很常见。我们介绍ZACH-ViT（零令牌自适应紧凑分层Vision Transformer），一个0.25 M参数的Vision Transformer变体，删除位置嵌入和[CLS]令牌，使其完全置换不变，适用于无序的医学图像数据。为了提高泛化能力，我们提出了ShuffleStrides数据增强（SSDA），它在保持解剖有效性的同时置换探针视图序列和帧顺序。ZACH-ViT在来自95名重症患者的380个LUS视频上进行了评估，并与9个最先进的基线进行了比较。尽管非心源性组存在异质性，但ZACH-ViT实现了最高的验证和测试ROC-AUC（0.80和0.79），具有平衡的灵敏度（0.60）和特异性（0.91），而所有竞争模型都崩溃为微不足道的分类。它比Minimum ViT（0.62M参数）快1.35倍，参数少2.5倍，支持实时临床部署。这些结果表明，将架构设计与数据结构相结合可以在小数据医学成像中超越规模。
摘要：Differentiating cardiogenic pulmonary oedema (CPE) from non-cardiogenic and structurally normal lungs in lung ultrasound (LUS) videos remains challenging due to the high visual variability of non-cardiogenic inflammatory patterns (NCIP/ARDS-like), interstitial lung disease, and healthy lungs. This heterogeneity complicates automated classification as overlapping B-lines and pleural artefacts are common. We introduce ZACH-ViT (Zero-token Adaptive Compact Hierarchical Vision Transformer), a 0.25 M-parameter Vision Transformer variant that removes both positional embeddings and the [CLS] token, making it fully permutation-invariant and suitable for unordered medical image data. To enhance generalization, we propose ShuffleStrides Data Augmentation (SSDA), which permutes probe-view sequences and frame orders while preserving anatomical validity. ZACH-ViT was evaluated on 380 LUS videos from 95 critically ill patients against nine state-of-the-art baselines. Despite the heterogeneity of the non-cardiogenic group, ZACH-ViT achieved the highest validation and test ROC-AUC (0.80 and 0.79) with balanced sensitivity (0.60) and specificity (0.91), while all competing models collapsed to trivial classification. It trains 1.35x faster than Minimal ViT (0.62M parameters) with 2.5x fewer parameters, supporting real-time clinical deployment. These results show that aligning architectural design with data structure can outperform scale in small-data medical imaging.

【2】Beyond RGB: Leveraging Vision Transformers for Thermal Weapon Segmentation
标题：超越GB：利用视觉Transformer进行热武器分割
链接：https://arxiv.org/abs/2510.16913

作者：Akhila Kambhatla, Ahmed R Khaled
备注：9 Images with 1 figure and 3 Tables. This is a preprint submitted to arXiv
摘要：热武器分割对于监视和安全应用至关重要，可以在基于RGB的系统失败的低光和视觉模糊条件下实现强大的检测。虽然卷积神经网络（CNN）在热分割文献中占主导地位，但它们捕获长程依赖性和精细结构细节的能力有限。Vision Transformers（ViTs）凭借其全局上下文建模功能，在RGB分割任务中取得了最先进的结果，但其在热武器分割中的潜力仍有待开发。这项工作调整和评估了四个基于转换器的架构SegFormer，DeepLabV 3\+，SegNeXt和Swin Transformer，用于自定义热数据集上的二进制武器分割，该数据集包括从真实世界监控视频中收集的9，711张图像，并使用SAM 2自动注释。我们在MM分割框架中采用标准的增强策略，以确保强大的模型训练和公平的架构比较。实验结果表明，分割性能得到了显着改善：SegFormer-b5实现了最高的mIoU（94.15%）和像素精度（97.04%），而SegFormer-b 0提供了最快的推理速度（98.32 FPS）和竞争性mIoU（90.84%）。SegNeXt-mscans提供85.12 FPS和92.24\% mIoU的均衡性能，DeepLabV 3\+ R101-D8在29.86 FPS时达到92.76\% mIoU。Transformer架构展示了在低光和闭塞热环境中进行武器检测的强大泛化能力，具有适合各种实时安全应用的灵活的精度-速度权衡。
摘要：Thermal weapon segmentation is crucial for surveillance and security applications, enabling robust detection under lowlight and visually obscured conditions where RGB-based systems fail. While convolutional neural networks (CNNs) dominate thermal segmentation literature, their ability to capture long-range dependencies and fine structural details is limited. Vision Transformers (ViTs), with their global context modeling capabilities, have achieved state-of-the-art results in RGB segmentation tasks, yet their potential in thermal weapon segmentation remains underexplored. This work adapts and evaluates four transformer-based architectures SegFormer, DeepLabV3\+, SegNeXt, and Swin Transformer for binary weapon segmentation on a custom thermal dataset comprising 9,711 images collected from real world surveillance videos and automatically annotated using SAM2. We employ standard augmentation strategies within the MMSegmentation framework to ensure robust model training and fair architectural comparison. Experimental results demonstrate significant improvements in segmentation performance: SegFormer-b5 achieves the highest mIoU (94.15\%) and Pixel Accuracy (97.04\%), while SegFormer-b0 provides the fastest inference speed (98.32 FPS) with competitive mIoU (90.84\%). SegNeXt-mscans offers balanced performance with 85.12 FPS and 92.24\% mIoU, and DeepLabV3\+ R101-D8 reaches 92.76\% mIoU at 29.86 FPS. The transformer architectures demonstrate robust generalization capabilities for weapon detection in low-light and occluded thermal environments, with flexible accuracy-speed trade-offs suitable for diverse real-time security applications.

【3】ArmFormer: Lightweight Transformer Architecture for Real-Time Multi-Class Weapon Segmentation and Classification
标题：ArmFormer：用于实时多类别武器分割和分类的轻量级Transformer架构
链接：https://arxiv.org/abs/2510.16854

作者：Akhila Kambhatla, Taminul Islam, Khaled R Ahmed
备注：9 pages with 4 figures and 5 tables. This is a preprint submitted to arXiv
摘要：与武器有关的暴力威胁不断升级，需要能够实现像素级精度的自动检测系统，以便在实时安全应用中进行准确的威胁评估。传统的武器检测方法依赖于对象检测框架，该框架仅提供粗略的边界框定位，缺乏全面威胁分析所需的细粒度分割。此外，现有的语义分割模型要么为了计算效率而牺牲准确性，要么需要与边缘部署场景不兼容的过多计算资源。本文介绍了ArmFormer，这是一个轻量级的基于transformer的语义分割框架，它将卷积块注意力模块（CBAM）与MixVisionTransformer架构战略性地集成在一起，以实现卓越的准确性，同时保持适用于资源受限边缘设备的计算效率。我们的方法结合CBAM增强的编码器骨干与注意力集成的汉堡包解码器，使多类武器分割在五个类别：手枪，步枪，刀，左轮手枪，和人。综合实验表明，ArmFormer实现了最先进的性能，80.64%的mIoU和89.13%的mFscore，同时保持82.26 FPS的实时推理。ArmFormer仅具有4.886G FLOP和3.66M参数，性能优于需要高达48倍计算的重量级模型，使其成为分布式安全基础设施中部署便携式安全摄像机，监控无人机和嵌入式AI加速器的最佳解决方案。
摘要：The escalating threat of weapon-related violence necessitates automated detection systems capable of pixel-level precision for accurate threat assessment in real-time security applications. Traditional weapon detection approaches rely on object detection frameworks that provide only coarse bounding box localizations, lacking the fine-grained segmentation required for comprehensive threat analysis. Furthermore, existing semantic segmentation models either sacrifice accuracy for computational efficiency or require excessive computational resources incompatible with edge deployment scenarios. This paper presents ArmFormer, a lightweight transformer-based semantic segmentation framework that strategically integrates Convolutional Block Attention Module (CBAM) with MixVisionTransformer architecture to achieve superior accuracy while maintaining computational efficiency suitable for resource-constrained edge devices. Our approach combines CBAM-enhanced encoder backbone with attention-integrated hamburger decoder to enable multi-class weapon segmentation across five categories: handgun, rifle, knife, revolver, and human. Comprehensive experiments demonstrate that ArmFormer achieves state-of-the-art performance with 80.64% mIoU and 89.13% mFscore while maintaining real-time inference at 82.26 FPS. With only 4.886G FLOPs and 3.66M parameters, ArmFormer outperforms heavyweight models requiring up to 48x more computation, establishing it as the optimal solution for deployment on portable security cameras, surveillance drones, and embedded AI accelerators in distributed security infrastructure.

【4】UKANFormer: Noise-Robust Semantic Segmentation for Coral Reef Mapping via a Kolmogorov-Arnold Network-Transformer Hybrid
标题：UKANFormer：通过Kolmogorov-Arnold网络-Transformer混合体进行珊瑚礁映射的抗噪语义分割
链接：https://arxiv.org/abs/2510.16730

作者：Tianyang Dou, Ming Li, Jiangying Qin, Xuan Liao, Jiageng Zhong, Armin Gruen, Mengyi Deng
摘要：珊瑚礁是重要但脆弱的生态系统，需要准确的大规模测绘才能进行有效的保护。虽然全球产品，如艾伦珊瑚地图集提供了前所未有的全球珊瑚礁分布的覆盖面，其预测往往是有限的空间精度和语义的一致性，特别是在需要细粒度的边界划定的地区。为了解决这些挑战，我们提出了UKANFormer，一种新的se-mantic分割模型，旨在实现高精度映射下的噪声监督来自艾伦珊瑚地图集。建立在UKAN架构上，UKANFormer在解码器中集成了全局-局部Transformer（GL-Trans）块，从而能够提取全局语义结构和局部边界细节。在实验中，UKANFormer实现了67.00%的珊瑚类IoU和83.98%的像素准确度，在相同的噪声标签设置下优于传统基线。值得注意的是，该模型产生的预测在视觉上和结构上都比用于训练的噪声标签更准确。这些结果挑战了数据质量直接限制模型性能的概念，表明架构设计可以减轻标签噪声，并在不完善的监督下支持可扩展映射。UKANFormer为缺乏可靠标签的生态监测提供了基础。
摘要：Coral reefs are vital yet fragile ecosystems that require accurate large-scale mapping for effective conservation. Although global products such as the Allen Coral Atlas provide unprecedented coverage of global coral reef distri-bution, their predictions are frequently limited in spatial precision and semantic consistency, especially in regions requiring fine-grained boundary delineation. To address these challenges, we propose UKANFormer, a novel se-mantic segmentation model designed to achieve high-precision mapping under noisy supervision derived from Allen Coral Atlas. Building upon the UKAN architecture, UKANFormer incorporates a Global-Local Transformer (GL-Trans) block in the decoder, enabling the extraction of both global semantic structures and local boundary details. In experiments, UKANFormer achieved a coral-class IoU of 67.00% and pixel accuracy of 83.98%, outperforming conventional baselines under the same noisy labels setting. Remarkably, the model produces predictions that are visually and structurally more accurate than the noisy labels used for training. These results challenge the notion that data quality directly limits model performance, showing that architectural design can mitigate label noise and sup-port scalable mapping under imperfect supervision. UKANFormer provides a foundation for ecological monitoring where reliable labels are scarce.

【5】VM-BeautyNet: A Synergistic Ensemble of Vision Transformer and Mamba for Facial Beauty Prediction
标题：VM-BeautyNet：Vision Transformer和Mamba的协同联盟，用于面部美容预测
链接：https://arxiv.org/abs/2510.16220

作者：Djamel Eddine Boukhari
摘要：人脸美感预测（FBP）是一项复杂而具有挑战性的计算机视觉任务，旨在模拟人类审美感知的主观和复杂本质。虽然深度学习模型，特别是卷积神经网络（CNN）已经取得了重大进展，但它们往往难以捕捉对人类判断至关重要的全局、整体面部特征。Vision Transformers（ViT）通过有效地建模长距离空间关系来解决这个问题，但其二次复杂性可能是一个瓶颈。本文介绍了一种新型的异构集成架构，\textbf{VM-BeautyNet}，协同融合的互补优势的Vision Transformer和基于曼巴的视觉模型，状态空间模型（SSM）的最新进展。ViT骨干擅长捕捉全局面部结构和对称性，而Mamba骨干有效地建模具有线性复杂性的长期依赖关系，专注于序列特征和纹理。我们在基准SCUT-FBP 5500数据集上评估了我们的方法。我们提出的VM-BeautyNet实现了最先进的性能，具有\textbf{Pearson Correlation（PC）为0.9212}，\textbf{Mean Absolute Error（MAE）为0.2085}和\textbf{Root Mean Square Error（RMSE）为0.2698}。此外，通过Grad-CAM可视化，我们提供了可解释性分析，确认了两个主干的互补特征提取，为模型的决策过程提供了新的见解，并为计算美学提供了一个强大的新架构范例。
摘要：Facial Beauty Prediction (FBP) is a complex and challenging computer vision task, aiming to model the subjective and intricate nature of human aesthetic perception. While deep learning models, particularly Convolutional Neural Networks (CNNs), have made significant strides, they often struggle to capture the global, holistic facial features that are critical to human judgment. Vision Transformers (ViT) address this by effectively modeling long-range spatial relationships, but their quadratic complexity can be a bottleneck. This paper introduces a novel, heterogeneous ensemble architecture, \textbf{VM-BeautyNet}, that synergistically fuses the complementary strengths of a Vision Transformer and a Mamba-based Vision model, a recent advancement in State-Space Models (SSMs). The ViT backbone excels at capturing global facial structure and symmetry, while the Mamba backbone efficiently models long-range dependencies with linear complexity, focusing on sequential features and textures. We evaluate our approach on the benchmark SCUT-FBP5500 dataset. Our proposed VM-BeautyNet achieves state-of-the-art performance, with a \textbf{Pearson Correlation (PC) of 0.9212}, a \textbf{Mean Absolute Error (MAE) of 0.2085}, and a \textbf{Root Mean Square Error (RMSE) of 0.2698}. Furthermore, through Grad-CAM visualizations, we provide interpretability analysis that confirms the complementary feature extraction of the two backbones, offering new insights into the model's decision-making process and presenting a powerful new architectural paradigm for computational aesthetics.

生成|GAN相关(22篇)

【1】GAS: Improving Discretization of Diffusion ODEs via Generalized Adversarial Solver
标题：GAS：通过广义对抗求解器改进扩散常微分方程的离散化
链接：https://arxiv.org/abs/2510.17699

作者：Aleksandr Oganov, Ilya Bykov, Eva Neudachina, Mishan Aliev, Alexander Tolmachev, Alexander Sidorov, Aleksandr Zuev, Andrey Okhotin, Denis Rakitin, Aibek Alanov
摘要：尽管扩散模型实现了最先进的生成质量，但它们仍然受到计算上昂贵的采样的影响。最近的工作通过基于梯度的优化方法解决了这一问题，该方法从完整的采样过程中提取出几步ODE扩散求解器，从而将函数求值的数量从几十个减少到仅仅几个。然而，这些方法通常依赖于复杂的训练技术，并且没有明确地专注于保留细粒度的细节。在本文中，我们介绍了广义求解器：一个简单的参数化的ODE采样器，不需要额外的训练技巧，并提高了现有方法的质量。我们进一步将原始蒸馏损失与对抗训练相结合，从而减轻了伪影并增强了细节保真度。我们称所得到的方法的广义对抗求解器，并证明其优越的性能相比，现有的求解器训练方法在类似的资源约束。代码可在https://github.com/3145tttt/GAS上获得。
摘要：While diffusion models achieve state-of-the-art generation quality, they still suffer from computationally expensive sampling. Recent works address this issue with gradient-based optimization methods that distill a few-step ODE diffusion solver from the full sampling process, reducing the number of function evaluations from dozens to just a few. However, these approaches often rely on intricate training techniques and do not explicitly focus on preserving fine-grained details. In this paper, we introduce the Generalized Solver: a simple parameterization of the ODE sampler that does not require additional training tricks and improves quality over existing approaches. We further combine the original distillation loss with adversarial training, which mitigates artifacts and enhances detail fidelity. We call the resulting method the Generalized Adversarial Solver and demonstrate its superior performance compared to existing solver training methods under similar resource constraints. Code is available at https://github.com/3145tttt/GAS.

【2】CaMiT: A Time-Aware Car Model Dataset for Classification and Generation
标题：CaMiT：一个用于分类和生成的时间感知汽车模型数据集
链接：https://arxiv.org/abs/2510.17626

作者：Frédéric LIN, Biruk Abere Ambaw, Adrian Popescu, Hejer Ammar, Romaric Audigier, Hervé Le Borgne (Université Paris-Saclay, CEA, List, F-91120, Palaiseau, France)
备注：To be published in NeurIPS 2025 Track on Datasets and Benchmarks
摘要：人工智能系统必须适应不断变化的视觉环境，特别是在对象外观随时间变化的领域。我们介绍了汽车模型的时间（CaMiT），一个细粒度的数据集捕捉的时间演变的汽车模型，一类代表性的技术文物。CaMiT包括190个汽车模型（2007-2023）的787 K标记样本和510万未标记样本（2005-2023），支持监督和自监督学习。对域内数据进行静态预训练可以与大规模通才模型实现竞争性性能，同时更节省资源，但当模型经过多年测试时，准确性会下降。为了解决这个问题，我们提出了一个时间增量的分类设置，一个现实的持续学习的情况下出现，不断发展和消失的类。我们评估两种策略：时间增量预训练，更新骨干，和时间增量分类器学习，只更新最后一层，都提高了时间鲁棒性。最后，我们探索时间感知图像生成，在训练过程中利用时间元数据，产生更逼真的输出。CaMiT为细粒度视觉识别和生成中的时间适应研究提供了丰富的基准。
摘要：AI systems must adapt to evolving visual environments, especially in domains where object appearances change over time. We introduce Car Models in Time (CaMiT), a fine-grained dataset capturing the temporal evolution of car models, a representative class of technological artifacts. CaMiT includes 787K labeled samples of 190 car models (2007-2023) and 5.1M unlabeled samples (2005-2023), supporting both supervised and self-supervised learning. Static pretraining on in-domain data achieves competitive performance with large-scale generalist models while being more resource-efficient, yet accuracy declines when models are tested across years. To address this, we propose a time-incremental classification setting, a realistic continual learning scenario with emerging, evolving, and disappearing classes. We evaluate two strategies: time-incremental pretraining, which updates the backbone, and time-incremental classifier learning, which updates only the final layer, both improving temporal robustness. Finally, we explore time-aware image generation that leverages temporal metadata during training, yielding more realistic outputs. CaMiT offers a rich benchmark for studying temporal adaptation in fine-grained visual recognition and generation.

【3】ImaGGen: Zero-Shot Generation of Co-Speech Semantic Gestures Grounded in Language and Image Input
标题：ImaGGen：基于语言和图像输入的共语音语义手势的Zero-Shot生成
链接：https://arxiv.org/abs/2510.17617

作者：Hendric Voss, Stefan Kopp
摘要：人类的交流结合了语言和表达性的非语言线索，如手势，它具有多种交流功能。然而，目前的生成手势生成方法仅限于简单的，重复的节拍手势，伴随着说话的节奏，但不有助于传达语义意义。本文解决了协同语音手势合成中的一个核心挑战：生成与言语话语语义一致的标志性或指示性手势。这种手势不能仅仅从语言输入中获得，因为语言输入本身就缺乏视觉意义，而视觉意义通常由手势自主携带。因此，我们引入了一个zero-shot系统，该系统从给定的语言输入生成手势，并且还通过图像输入来通知，而无需手动注释或人为干预。我们的方法集成了一个图像分析管道，提取关键的对象属性，如形状，对称性和对齐，以及一个语义匹配模块，将这些视觉细节链接到口语文本。然后，逆运动学引擎合成图标和指示手势，并将它们与共同生成的自然节拍手势相结合，以实现连贯的多模态通信。一项全面的用户研究证明了我们方法的有效性。在语音本身是模糊的情况下，我们的系统生成的手势显着提高了参与者识别对象属性的能力，确认其可解释性和沟通价值。虽然在表示复杂形状方面仍然存在挑战，但我们的研究结果强调了上下文感知语义手势对于创建富有表现力和协作性的虚拟代理或化身的重要性，标志着朝着高效，强大，体现的人-代理交互迈出了实质性的一步。更多信息和示例视频请访问：https://review-anon-io.github.io/ImaGGen.github.io/
摘要：Human communication combines speech with expressive nonverbal cues such as hand gestures that serve manifold communicative functions. Yet, current generative gesture generation approaches are restricted to simple, repetitive beat gestures that accompany the rhythm of speaking but do not contribute to communicating semantic meaning. This paper tackles a core challenge in co-speech gesture synthesis: generating iconic or deictic gestures that are semantically coherent with a verbal utterance. Such gestures cannot be derived from language input alone, which inherently lacks the visual meaning that is often carried autonomously by gestures. We therefore introduce a zero-shot system that generates gestures from a given language input and additionally is informed by imagistic input, without manual annotation or human intervention. Our method integrates an image analysis pipeline that extracts key object properties such as shape, symmetry, and alignment, together with a semantic matching module that links these visual details to spoken text. An inverse kinematics engine then synthesizes iconic and deictic gestures and combines them with co-generated natural beat gestures for coherent multimodal communication. A comprehensive user study demonstrates the effectiveness of our approach. In scenarios where speech alone was ambiguous, gestures generated by our system significantly improved participants' ability to identify object properties, confirming their interpretability and communicative value. While challenges remain in representing complex shapes, our results highlight the importance of context-aware semantic gestures for creating expressive and collaborative virtual agents or avatars, marking a substantial step forward towards efficient and robust, embodied human-agent interaction. More information and example videos are available here: https://review-anon-io.github.io/ImaGGen.github.io/

【4】Conveying Meaning through Gestures: An Investigation into Semantic Co-Speech Gesture Generation
标题：通过手势传达意义：语义共言语手势生成的研究
链接：https://arxiv.org/abs/2510.17599

作者：Hendric Voss, Lisa Michelle Bohnenkamp, Stefan Kopp
摘要：本研究探讨了两个框架的协同语音手势生成，AQ-GT及其语义增强的变体AQ-GT-a，以评估他们的能力，通过手势和人类如何感知所产生的运动来传达意义。使用句子从SAGA空间通信语料库，上下文相似的句子，和新的运动为重点的句子，我们进行了以用户为中心的概念识别和人性化的评价。结果显示，语义注释和性能之间的微妙关系。原始的AQ-GT框架，缺乏明确的语义输入，在其训练域中传达概念的效率更高。相反，AQ-GT-a框架表现出更好的泛化能力，特别是在新的背景下表示形状和大小。虽然参与者认为AQ-GT-a的手势更有表现力和帮助，但他们并不认为它们更像人类。这些研究结果表明，明确的语义丰富并不能保证改进的手势生成，其有效性是高度依赖于上下文，表明专业化和泛化之间的潜在权衡。
摘要：This study explores two frameworks for co-speech gesture generation, AQ-GT and its semantically-augmented variant AQ-GT-a, to evaluate their ability to convey meaning through gestures and how humans perceive the resulting movements. Using sentences from the SAGA spatial communication corpus, contextually similar sentences, and novel movement-focused sentences, we conducted a user-centered evaluation of concept recognition and human-likeness. Results revealed a nuanced relationship between semantic annotations and performance. The original AQ-GT framework, lacking explicit semantic input, was surprisingly more effective at conveying concepts within its training domain. Conversely, the AQ-GT-a framework demonstrated better generalization, particularly for representing shape and size in novel contexts. While participants rated gestures from AQ-GT-a as more expressive and helpful, they did not perceive them as more human-like. These findings suggest that explicit semantic enrichment does not guarantee improved gesture generation and that its effectiveness is highly dependent on the context, indicating a potential trade-off between specialization and generalization.

【5】WP-CrackNet: A Collaborative Adversarial Learning Framework for End-to-End Weakly-Supervised Road Crack Detection
标题：WP-CrackNet：一个用于端到端弱监督道路裂缝检测的协作对抗学习框架
链接：https://arxiv.org/abs/2510.17566

作者：Nachuan Ma, Zhengfei Song, Qiang Hu, Xiaoyu Tang, Chengxi Zhang, Rui Fan, Lihua Xie
摘要：道路裂缝检测对于智慧城市中的智能基础设施维护至关重要。为了减少对昂贵的像素级注释的依赖，我们提出了WP-CrackNet，这是一种端到端的弱监督方法，仅使用图像级标签进行像素级裂缝检测。WP-CrackNet集成了三个组件：生成类激活图（CAM）的分类器，测量特征可推断性的重建器，以及生成像素道路裂缝检测结果的检测器。在训练过程中，分类器和重建器交替进行对抗性学习，以鼓励裂纹CAM覆盖完整的裂纹区域，而检测器则从处理后的裂纹CAM中获得的伪标签中学习。三个分量之间的这种相互反馈提高了学习稳定性和检测精度。为了进一步提高检测性能，我们设计了一个路径感知注意模块（PAAM），通过对空间和通道依赖关系进行建模，将分类器的高级语义与重建器的低级结构线索融合在一起。此外，中心增强CAM一致性模块（CECCM）提出了改进裂纹CAM使用中心高斯加权和一致性约束，使更好的伪标签生成。我们创建了三个图像级数据集，广泛的实验表明，WP-CrackNet实现了与监督方法相当的结果，并优于现有的弱监督方法，显著提高了可扩展的道路检测。源代码包和数据集可在https://mias.group/WP-CrackNet/上获得。
摘要：Road crack detection is essential for intelligent infrastructure maintenance in smart cities. To reduce reliance on costly pixel-level annotations, we propose WP-CrackNet, an end-to-end weakly-supervised method that trains with only image-level labels for pixel-wise crack detection. WP-CrackNet integrates three components: a classifier generating class activation maps (CAMs), a reconstructor measuring feature inferability, and a detector producing pixel-wise road crack detection results. During training, the classifier and reconstructor alternate in adversarial learning to encourage crack CAMs to cover complete crack regions, while the detector learns from pseudo labels derived from post-processed crack CAMs. This mutual feedback among the three components improves learning stability and detection accuracy. To further boost detection performance, we design a path-aware attention module (PAAM) that fuses high-level semantics from the classifier with low-level structural cues from the reconstructor by modeling spatial and channel-wise dependencies. Additionally, a center-enhanced CAM consistency module (CECCM) is proposed to refine crack CAMs using center Gaussian weighting and consistency constraints, enabling better pseudo-label generation. We create three image-level datasets and extensive experiments show that WP-CrackNet achieves comparable results to supervised methods and outperforms existing weakly-supervised methods, significantly advancing scalable road inspection. The source code package and datasets are available at https://mias.group/WP-CrackNet/.

【6】MUG-V 10B: High-efficiency Training Pipeline for Large Video Generation Models
标题：MUG-V 10 B：大型视频生成模型的高效训练管道
链接：https://arxiv.org/abs/2510.17519

作者：Yongshun Zhang, Zhongyi Fan, Yonghang Zhang, Zhangzikang Li, Weifeng Chen, Zhongwei Feng, Chaoyue Wang, Peng Hou, Anxiang Zeng
备注：Technical Report; Project Page: \href{this https URL}
摘要：近年来，视觉内容的大规模生成模型（\textit{e.g.，}图像、视频和3D对象/场景）已经取得了显著的进步。然而，训练大规模视频生成模型仍然特别具有挑战性和资源密集型，由于跨模态文本视频对齐，涉及的长序列，以及复杂的时空依赖性。为了应对这些挑战，我们提出了一个优化四个支柱的训练框架：（i）数据处理，（ii）模型架构，（iii）训练策略，（iv）大规模视频生成模型的基础设施。这些优化在数据预处理、视频压缩、参数缩放、基于训练的预训练和以训练为重点的后训练的所有阶段都实现了显著的效率提升和性能改进。我们的最终模型MUG-V 10 B与最近最先进的视频生成器整体匹配，并且在面向电子商务的视频生成任务中，超过了人类评估中领先的开源基线。更重要的是，我们开源了完整的堆栈，包括模型权重、基于Megatron-Core的大规模训练代码以及用于视频生成和增强的推理管道。据我们所知，这是第一个公开发布的大规模视频生成训练代码，该代码利用Megatron-Core实现了高训练效率和接近线性的多节点缩放，详细信息可在\href{https：//github.com/Shopee-MUG/MUG-V}{我们的网页}中找到。
摘要：In recent years, large-scale generative models for visual content (\textit{e.g.,} images, videos, and 3D objects/scenes) have made remarkable progress. However, training large-scale video generation models remains particularly challenging and resource-intensive due to cross-modal text-video alignment, the long sequences involved, and the complex spatiotemporal dependencies. To address these challenges, we present a training framework that optimizes four pillars: (i) data processing, (ii) model architecture, (iii) training strategy, and (iv) infrastructure for large-scale video generation models. These optimizations delivered significant efficiency gains and performance improvements across all stages of data preprocessing, video compression, parameter scaling, curriculum-based pretraining, and alignment-focused post-training. Our resulting model, MUG-V 10B, matches recent state-of-the-art video generators overall and, on e-commerce-oriented video generation tasks, surpasses leading open-source baselines in human evaluations. More importantly, we open-source the complete stack, including model weights, Megatron-Core-based large-scale training code, and inference pipelines for video generation and enhancement. To our knowledge, this is the first public release of large-scale video generation training code that exploits Megatron-Core to achieve high training efficiency and near-linear multi-node scaling, details are available in \href{https://github.com/Shopee-MUG/MUG-V}{our webpage}.

【7】Latent Spaces Beyond Synthesis: From GANs to Diffusion Models
标题：超越合成的潜在空间：从GANs到扩散模型
链接：https://arxiv.org/abs/2510.17383

作者：Ludovica Schaerf
备注：Presented and published at Ethics and Aesthetics of Artificial Intelligence Conference (EA-AI'25)
摘要：本文研究了生成视觉模型中内部表示的演变性质，重点关注从GANs和VAE到基于扩散的架构的概念和技术转变。借鉴Beatrice Fazi的合成作为分布式表示的合并，我们提出了“严格意义上的合成”，其中一个紧凑的潜在空间完全决定了生成过程，和“广义上的合成”，其特征在于模型的代表性劳动分布在各个层之间的区别。通过仔细阅读模型架构和有针对性的实验设置，干预分层表示，我们展示了扩散模型如何碎片的代表性的负担，从而挑战统一的内部空间的假设。通过将这些发现置于媒体理论框架内，并批判性地运用潜在空间和柏拉图式表示假说等隐喻，我们主张重新定位生成式人工智能的理解方式：不是作为内容的直接合成，而是作为专业化过程的新兴配置。
摘要：This paper examines the evolving nature of internal representations in generative visual models, focusing on the conceptual and technical shift from GANs and VAEs to diffusion-based architectures. Drawing on Beatrice Fazi's account of synthesis as the amalgamation of distributed representations, we propose a distinction between "synthesis in a strict sense", where a compact latent space wholly determines the generative process, and "synthesis in a broad sense," which characterizes models whose representational labor is distributed across layers. Through close readings of model architectures and a targeted experimental setup that intervenes in layerwise representations, we show how diffusion models fragment the burden of representation and thereby challenge assumptions of unified internal space. By situating these findings within media theoretical frameworks and critically engaging with metaphors such as the latent space and the Platonic Representation Hypothesis, we argue for a reorientation of how generative AI is understood: not as a direct synthesis of content, but as an emergent configuration of specialized processes.

【8】A Single Set of Adversarial Clothes Breaks Multiple Defense Methods in the Physical World
标题：一套对抗服装打破了物理世界中的多种防御方法
链接：https://arxiv.org/abs/2510.17322

作者：Wei Zhang, Zhanhao Hu, Xiao Li, Xiaopei Zhu, Xiaolin Hu
备注：13 pages, 8 figures
摘要：近年来，针对物理世界中基于深度学习的对象检测器的对抗性攻击引起了广泛关注。为了防御这些攻击，研究人员提出了各种防御对抗补丁的方法，这是一种典型的物理可实现攻击形式。然而，我们的实验表明，简单地扩大补丁大小可以使这些防御方法失败。出于这一动机，我们评估了各种防御方法对敌对的衣服，有很大的覆盖范围内的人体。对抗性衣服为对抗性防御补丁攻击提供了一个很好的测试案例，因为它们不仅尺寸大，而且看起来比人类身上的大补丁更自然。实验表明，所有的防御方法在数字世界和物理世界中对对抗性服装的性能都很差。此外，我们制作了一套衣服，可以在Faster R-CNN上打破多种防御方法。该集合在物理世界中对无防御检测器的攻击成功率（ASR）为96.06%，对9个防御模型的攻击成功率超过64.84%，揭示了现有对抗性防御方法对对抗性衣服的常见漏洞。代码可从以下网址获得：https://github.com/weiz0823/adv-clothes-break-multiple-defenses。
摘要：In recent years, adversarial attacks against deep learning-based object detectors in the physical world have attracted much attention. To defend against these attacks, researchers have proposed various defense methods against adversarial patches, a typical form of physically-realizable attack. However, our experiments showed that simply enlarging the patch size could make these defense methods fail. Motivated by this, we evaluated various defense methods against adversarial clothes which have large coverage over the human body. Adversarial clothes provide a good test case for adversarial defense against patch-based attacks because they not only have large sizes but also look more natural than a large patch on humans. Experiments show that all the defense methods had poor performance against adversarial clothes in both the digital world and the physical world. In addition, we crafted a single set of clothes that broke multiple defense methods on Faster R-CNN. The set achieved an Attack Success Rate (ASR) of 96.06% against the undefended detector and over 64.84% ASRs against nine defended models in the physical world, unveiling the common vulnerability of existing adversarial defense methods against adversarial clothes. Code is available at: https://github.com/weiz0823/adv-clothes-break-multiple-defenses.

【9】Generation then Reconstruction: Accelerating Masked Autoregressive Models via Two-Stage Sampling
标题：生成然后重建：通过两阶段采样加速掩蔽自回归模型
链接：https://arxiv.org/abs/2510.17171

作者：Feihong Yan, Peiru Wang, Yao Zhu, Kaiyu Pang, Qingyan Wei, Huiqi Li, Linfeng Zhang
备注：12 pages, 6 figures
摘要：掩蔽自回归（MAR）模型比自回归（AR）模型具有更好的并行生成能力，但其加速潜力仍然受到空间相关视觉标记建模复杂性的限制。为了解决这个问题，我们引入了生成然后重建（GtR），这是一种无需训练的分层采样策略，它将生成分解为两个阶段：结构生成建立全局语义脚手架，然后进行细节重建，有效地完成剩余的令牌。假设从头开始创建图像比基于基本图像框架补充图像更困难，GtR旨在通过快速计算重建阶段来实现加速，同时通过缓慢计算生成阶段来保持生成质量。此外，观察到的令牌上的图像的细节往往携带更多的语义信息比令牌在显着的区域，我们进一步提出了频率加权令牌选择（FTS）提供更多的计算预算的令牌上的图像细节，这是本地化的基础上的能量的高频信息。对ImageNet类条件和文本到图像生成的广泛实验表明，MAR-H的速度提高了3.72倍，同时保持了相当的质量（例如，FID：1.59，IS：304.4 vs.原始1.59，299.1），在各种模型尺度和生成任务上大大优于现有的加速方法。我们的代码将在https://github.com/feihongyan1/GtR上发布。
摘要：Masked Autoregressive (MAR) models promise better efficiency in visual generation than autoregressive (AR) models for the ability of parallel generation, yet their acceleration potential remains constrained by the modeling complexity of spatially correlated visual tokens in a single step. To address this limitation, we introduce Generation then Reconstruction (GtR), a training-free hierarchical sampling strategy that decomposes generation into two stages: structure generation establishing global semantic scaffolding, followed by detail reconstruction efficiently completing remaining tokens. Assuming that it is more difficult to create an image from scratch than to complement images based on a basic image framework, GtR is designed to achieve acceleration by computing the reconstruction stage quickly while maintaining the generation quality by computing the generation stage slowly. Moreover, observing that tokens on the details of an image often carry more semantic information than tokens in the salient regions, we further propose Frequency-Weighted Token Selection (FTS) to offer more computation budget to tokens on image details, which are localized based on the energy of high frequency information. Extensive experiments on ImageNet class-conditional and text-to-image generation demonstrate 3.72x speedup on MAR-H while maintaining comparable quality (e.g., FID: 1.59, IS: 304.4 vs. original 1.59, 299.1), substantially outperforming existing acceleration methods across various model scales and generation tasks. Our codes will be released in https://github.com/feihongyan1/GtR.

【10】Investigating Adversarial Robustness against Preprocessing used in Blackbox Face Recognition
标题：黑盒人脸识别中预处理的对抗鲁棒性研究
链接：https://arxiv.org/abs/2510.17169

作者：Roland Croft, Brian Du, Darcy Joseph, Sharath Kumar
备注：Accepted for publication in DICTA 2025
摘要：人脸识别（FR）模型已被证明容易受到对抗性示例的影响，这些示例会微妙地改变良性的面部图像，暴露这些系统中的盲点，并保护用户隐私。端到端FR系统首先从不同的面部图像中获得预处理的面部，然后计算深度特征嵌入的相似性。虽然人脸预处理是FR系统的关键组成部分，因此也是对抗性攻击的关键组成部分，但我们观察到，这种预处理在黑盒设置中经常被忽视。我们的研究旨在研究几种开箱即用的最先进对抗性攻击在黑盒设置中使用的不同预处理技术时对FR的可转移性。我们观察到，人脸检测模型的选择可以降低攻击成功率高达78%，而在下采样期间选择插值方法的影响相对较小。此外，我们发现，面部预处理的要求，甚至降低攻击强度在白盒设置，由于无意中产生的噪声向量对人脸检测模型的相互作用。基于这些发现，我们提出了一个预处理不变的方法，使用输入变换，提高了高达27%的研究攻击的可转移性。我们的研究结果强调了FR系统中预处理的重要性，以及需要考虑改善面部对抗性示例的对抗性概括。
摘要：Face Recognition (FR) models have been shown to be vulnerable to adversarial examples that subtly alter benign facial images, exposing blind spots in these systems, as well as protecting user privacy. End-to-end FR systems first obtain preprocessed faces from diverse facial imagery prior to computing the similarity of the deep feature embeddings. Whilst face preprocessing is a critical component of FR systems, and hence adversarial attacks against them, we observe that this preprocessing is often overlooked in blackbox settings. Our study seeks to investigate the transferability of several out-of-the-box state-of-the-art adversarial attacks against FR when applied against different preprocessing techniques used in a blackbox setting. We observe that the choice of face detection model can degrade the attack success rate by up to 78%, whereas choice of interpolation method during downsampling has relatively minimal impacts. Furthermore, we find that the requirement for facial preprocessing even degrades attack strength in a whitebox setting, due to the unintended interaction of produced noise vectors against face detection models. Based on these findings, we propose a preprocessing-invariant method using input transformations that improves the transferability of the studied attacks by up to 27%. Our findings highlight the importance of preprocessing in FR systems, and the need for its consideration towards improving the adversarial generalisation of facial adversarial examples.

【11】GACO-CAD: Geometry-Augmented and Conciseness-Optimized CAD Model Generation from Single Image
标题：GACO-CAD：从单张图像生成几何增强和简洁优化的CAD模型
链接：https://arxiv.org/abs/2510.17157

作者：Yinghui Wang, Xinyu Zhang, Peng Du
摘要：从单个图像生成可编辑的参数化CAD模型具有降低工业概念设计障碍的巨大潜力。然而，由于空间推理能力有限，当前的多模态大型语言模型（MLLM）仍然难以从2D图像准确地推断3D几何形状。我们通过引入GACO-CAD（一种新型的两阶段后培训框架）来解决这一限制。它旨在实现一个共同的目标：同时提高生成的CAD模型的几何精度，并鼓励使用更简洁的建模程序。首先，在监督微调期间，我们利用深度和表面法线映射作为密集的几何先验，将它们与RGB图像组合以形成多通道输入。在单视图重建的背景下，这些先验提供了互补的空间线索，帮助MLLM更可靠地从2D观测恢复3D几何形状。其次，在强化学习过程中，我们引入了一个组长度奖励，在保持高几何保真度的同时，促进了更紧凑和更少冗余的参数化建模序列的生成。采用简单的动态加权策略来稳定训练。在DeepCAD和Fusion 360数据集上的实验表明，GACO-CAD在相同的MLLM主干下实现了最先进的性能，在代码有效性、几何准确性和建模简洁性方面始终优于现有方法。
摘要：Generating editable, parametric CAD models from a single image holds great potential to lower the barriers of industrial concept design. However, current multi-modal large language models (MLLMs) still struggle with accurately inferring 3D geometry from 2D images due to limited spatial reasoning capabilities. We address this limitation by introducing GACO-CAD, a novel two-stage post-training framework. It is designed to achieve a joint objective: simultaneously improving the geometric accuracy of the generated CAD models and encouraging the use of more concise modeling procedures. First, during supervised fine-tuning, we leverage depth and surface normal maps as dense geometric priors, combining them with the RGB image to form a multi-channel input. In the context of single-view reconstruction, these priors provide complementary spatial cues that help the MLLM more reliably recover 3D geometry from 2D observations. Second, during reinforcement learning, we introduce a group length reward that, while preserving high geometric fidelity, promotes the generation of more compact and less redundant parametric modeling sequences. A simple dynamic weighting strategy is adopted to stabilize training. Experiments on the DeepCAD and Fusion360 datasets show that GACO-CAD achieves state-of-the-art performance under the same MLLM backbone, consistently outperforming existing methods in terms of code validity, geometric accuracy, and modeling conciseness.

【12】KineDiff3D: Kinematic-Aware Diffusion for Category-Level Articulated Object Shape Reconstruction and Generation
标题：KineBiff 3D：用于类别级关节物体形状重建和生成的运动学感知扩散
链接：https://arxiv.org/abs/2510.17137

作者：WenBo Xu, Liu Liu, Li Zhang, Ran Zhang, Hao Wu, Dan Guo, Meng Wang
摘要：铰接对象，如笔记本电脑和抽屉，表现出显着的挑战，三维重建和姿态估计，由于其多部分的几何形状和可变的关节配置，这引入了不同状态的结构多样性。为了解决这些挑战，我们提出了KineDiff 3D：运动学感知扩散的类别级关节对象形状重建和生成，一个统一的框架，用于重建不同的关节实例和姿势估计从单视图输入。具体来说，我们首先编码完整的几何形状（SDF），关节角度和部分分割到一个结构化的潜在空间通过一个新的运动感知VAE（KA-VAE）。此外，我们采用了两个条件扩散模型：一个用于回归全局姿态（SE（3））和联合参数，另一个用于从部分观测生成运动感知潜在代码。最后，我们产生了一个迭代优化模块，通过倒角距离最小化双向细化重建精度和运动学参数，同时保留关节约束。合成，半合成，和真实世界的数据集上的实验结果表明，我们的方法在准确地重建关节对象和估计其运动学特性的有效性。
摘要：Articulated objects, such as laptops and drawers, exhibit significant challenges for 3D reconstruction and pose estimation due to their multi-part geometries and variable joint configurations, which introduce structural diversity across different states. To address these challenges, we propose KineDiff3D: Kinematic-Aware Diffusion for Category-Level Articulated Object Shape Reconstruction and Generation, a unified framework for reconstructing diverse articulated instances and pose estimation from single view input. Specifically, we first encode complete geometry (SDFs), joint angles, and part segmentation into a structured latent space via a novel Kinematic-Aware VAE (KA-VAE). In addition, we employ two conditional diffusion models: one for regressing global pose (SE(3)) and joint parameters, and another for generating the kinematic-aware latent code from partial observations. Finally, we produce an iterative optimization module that bidirectionally refines reconstruction accuracy and kinematic parameters via Chamfer-distance minimization while preserving articulation constraints. Experimental results on synthetic, semi-synthetic, and real-world datasets demonstrate the effectiveness of our approach in accurately reconstructing articulated objects and estimating their kinematic properties.

【13】Matricial Free Energy as a Gaussianizing Regularizer: Enhancing Autoencoders for Gaussian Code Generation
标题：作为高斯化调节器的矩阵自由能：增强高斯码生成的自动编码器
链接：https://arxiv.org/abs/2510.17120

作者：Rishi Sonthalia, Raj Rao Nadakuditi
摘要：我们介绍了一种新的正则化方案的自编码器的基础上矩阵自由能。我们的方法定义了一个可微的损失函数的奇异值的代码矩阵（代码尺寸×批量大小）。从自由概率和随机矩阵理论的观点来看，当码矩阵的奇异值分布与具有独立同分布的适当塑造的随机度量的奇异值分布一致时，这种损失达到最小。高斯项。经验模拟表明，通过标准的随机梯度训练最小化负矩阵自由能产生高斯类代码，在训练集和测试集上推广。在此基础上，我们提出了一个矩阵自由能最大化自动编码器，可靠地产生高斯码，并显示其应用欠定逆问题。
摘要：We introduce a novel regularization scheme for autoencoders based on matricial free energy. Our approach defines a differentiable loss function in terms of the singular values of the code matrix (code dimension x batch size). From the standpoint of free probability an d random matrix theory, this loss achieves its minimum when the singular value distribution of the code matrix coincides with that of an appropriately sculpted random metric with i.i.d. Gaussian entries. Empirical simulations demonstrate that minimizing the negative matricial free energy through standard stochastic gradient-based training yields Gaussian-like codes that generalize across training and test sets. Building on this foundation, we propose a matricidal free energy maximizing autoencoder that reliably produces Gaussian codes and show its application to underdetermined inverse problems.

【14】Conditional Synthetic Live and Spoof Fingerprint Generation
标题：有条件合成实时和欺骗指纹生成
链接：https://arxiv.org/abs/2510.17035

作者：Syed Konain Abbas, Sandip Purnapatra, M. G. Sarwar Murshed, Conor Miller-Lynch, Lambert Igene, Soumyabrata Dey, Stephanie Schuckers, Faraz Hussain
摘要：大型指纹数据集虽然对训练和评估很重要，但收集起来既耗时又昂贵，而且需要严格的隐私措施。研究人员正在探索使用合成指纹数据来解决这些问题。本文提出了一种新的方法，用于生成合成指纹图像（欺骗和活），解决生物特征数据收集中的隐私，成本和可访问性相关的问题。我们的方法利用条件StyleGAN2-ADA和StyleGAN3架构来产生高分辨率的合成活体指纹，条件是特定的手指身份（拇指到小指）。此外，我们使用CycleGAN将这些转换为真实的欺骗指纹，模拟各种演示攻击材料（例如，EcoFlex，Play-Doh）。这些合成的欺骗指纹对于开发强大的欺骗检测系统至关重要。通过这些生成模型，我们创建了两个合成数据集（DB2和DB3），每个数据集包含所有十个手指的1,500个指纹图像，每个手指有多个印记，并包括八种材料类型的相应欺骗。结果显示了强大的性能：我们的StyleGAN3模型实现了低至5的Fr\'echet Inception Distance（FID），并且生成的指纹在0.01%的错误接受率下实现了99.47%的真实接受率。StyleGAN2-ADA模型在相同的0.01% FAR下实现了98.67%的TAR。我们使用标准指标（NFIQ2，MINDTCT）评估指纹质量，值得注意的是，匹配实验证实了强大的隐私保护，没有明显的身份泄露证据，证实了我们的合成数据集具有强大的隐私保护特性。
摘要：Large fingerprint datasets, while important for training and evaluation, are time-consuming and expensive to collect and require strict privacy measures. Researchers are exploring the use of synthetic fingerprint data to address these issues. This paper presents a novel approach for generating synthetic fingerprint images (both spoof and live), addressing concerns related to privacy, cost, and accessibility in biometric data collection. Our approach utilizes conditional StyleGAN2-ADA and StyleGAN3 architectures to produce high-resolution synthetic live fingerprints, conditioned on specific finger identities (thumb through little finger). Additionally, we employ CycleGANs to translate these into realistic spoof fingerprints, simulating a variety of presentation attack materials (e.g., EcoFlex, Play-Doh). These synthetic spoof fingerprints are crucial for developing robust spoof detection systems. Through these generative models, we created two synthetic datasets (DB2 and DB3), each containing 1,500 fingerprint images of all ten fingers with multiple impressions per finger, and including corresponding spoofs in eight material types. The results indicate robust performance: our StyleGAN3 model achieves a Fr\'echet Inception Distance (FID) as low as 5, and the generated fingerprints achieve a True Accept Rate of 99.47% at a 0.01% False Accept Rate. The StyleGAN2-ADA model achieved a TAR of 98.67% at the same 0.01% FAR. We assess fingerprint quality using standard metrics (NFIQ2, MINDTCT), and notably, matching experiments confirm strong privacy preservation, with no significant evidence of identity leakage, confirming the strong privacy-preserving properties of our synthetic datasets.

【15】From Mannequin to Human: A Pose-Aware and Identity-Preserving Video Generation Framework for Lifelike Clothing Display
标题：从人体模型到人类：逼真服装展示的姿势感知和身份保护视频生成框架
链接：https://arxiv.org/abs/2510.16833

作者：Xiangyu Mu, Dongliang Zhou, Jie Hou, Haijun Zhang, Weili Guan
摘要：基于人体模型的服装展示提供了一个具有成本效益的替代真实模型展示的在线时装展示，但缺乏现实主义和表现力的细节。为了克服这一限制，我们引入了一个新的任务，称为人体模型到人类（M2H）的视频生成，其目的是合成身份可控，逼真的人体视频从人体模型的镜头。我们提出了M2HVideo，一个姿势感知和身份保护的视频生成框架，解决了两个关键挑战：头部和身体运动之间的不对齐，以及由时间建模引起的身份漂移。特别是，M2HVideo结合了一个动态的姿态感知头部编码器，融合了面部语义与身体姿态，以产生跨帧的一致身份嵌入。为了解决由于潜在空间压缩而导致的精细面部细节的丢失，我们通过基于去噪扩散隐式模型（DDIM）的一步去噪引入了在像素空间中应用的镜像损失。此外，我们设计了一个分布感知适配器，对齐身份和服装特征的统计分布，以提高时间的一致性。在UBC时尚数据集，我们自己构建的ASOS数据集和现场捕获的新收集的MannequinVideos数据集上进行的大量实验表明，与最先进的方法相比，M2HVideo在服装一致性，身份保护和视频保真度方面具有卓越的性能。
摘要：Mannequin-based clothing displays offer a cost-effective alternative to real-model showcases for online fashion presentation, but lack realism and expressive detail. To overcome this limitation, we introduce a new task called mannequin-to-human (M2H) video generation, which aims to synthesize identity-controllable, photorealistic human videos from footage of mannequins. We propose M2HVideo, a pose-aware and identity-preserving video generation framework that addresses two key challenges: the misalignment between head and body motion, and identity drift caused by temporal modeling. In particular, M2HVideo incorporates a dynamic pose-aware head encoder that fuses facial semantics with body pose to produce consistent identity embeddings across frames. To address the loss of fine facial details due to latent space compression, we introduce a mirror loss applied in pixel space through a denoising diffusion implicit model (DDIM)-based one-step denoising. Additionally, we design a distribution-aware adapter that aligns statistical distributions of identity and clothing features to enhance temporal coherence. Extensive experiments on the UBC fashion dataset, our self-constructed ASOS dataset, and the newly collected MannequinVideos dataset captured on-site demonstrate that M2HVideo achieves superior performance in terms of clothing consistency, identity preservation, and video fidelity in comparison to state-of-the-art methods.

【16】EMRRG: Efficient Fine-Tuning Pre-trained X-ray Mamba Networks for Radiology Report Generation
标题：EMRRG：用于放射学报告生成的高效微调预训练X射线Mamba网络
链接：https://arxiv.org/abs/2510.16776

作者：Mingzheng Zhang, Jinfeng Gao, Dan Xu, Jiangrui Yu, Yuhan Qiao, Lan Chen, Jin Tang, Xiao Wang
摘要：基于X射线图像的医疗报告生成（MRG）是人工智能的一个关键领域，可以显着减少临床医生的诊断负担和患者的等待时间。现有的MRG模型主要依赖于大型语言模型（LLM）来改进报告生成，对预训练的视觉基础模型或高级微调技术的探索有限。主流框架要么避免微调，要么利用像LoRA这样的简单方法，往往忽视了增强交叉注意机制的潜力。此外，虽然基于transformer的模型主导了视觉语言任务，但非transformer架构（如Mamba网络）在医疗报告生成方面仍未得到充分探索，为未来的研究提供了一个有希望的途径。在本文中，我们提出了EMRRG，这是一种新的X射线报告生成框架，它使用参数有效的方法微调预训练的Mamba网络。具体来说，X射线图像被划分为补丁，标记化，并由基于SSM的视觉骨干进行处理以进行特征提取，部分LoRA产生最佳性能。带有混合解码器的LLM生成医疗报告，实现端到端训练，并在基准数据集上获得强大的结果。在三个广泛使用的基准数据集上进行的大量实验充分验证了我们提出的X射线MRG策略的有效性。本文的源代码将在https://github.com/Event-AHU/Medical_Image_Analysis上发布。
摘要：X-ray image-based medical report generation (MRG) is a pivotal area in artificial intelligence that can significantly reduce diagnostic burdens for clinicians and patient wait times. Existing MRG models predominantly rely on Large Language Models (LLMs) to improve report generation, with limited exploration of pre-trained vision foundation models or advanced fine-tuning techniques. Mainstream frameworks either avoid fine-tuning or utilize simplistic methods like LoRA, often neglecting the potential of enhancing cross-attention mechanisms. Additionally, while Transformer-based models dominate vision-language tasks, non-Transformer architectures, such as the Mamba network, remain underexplored for medical report generation, presenting a promising avenue for future research. In this paper, we propose EMRRG, a novel X-ray report generation framework that fine-tunes pre-trained Mamba networks using parameter-efficient methods. Specifically, X-ray images are divided into patches, tokenized, and processed by an SSM-based vision backbone for feature extraction, with Partial LoRA yielding optimal performance. An LLM with a hybrid decoder generates the medical report, enabling end-to-end training and achieving strong results on benchmark datasets. Extensive experiments on three widely used benchmark datasets fully validated the effectiveness of our proposed strategies for the X-ray MRG. The source code of this paper will be released on https://github.com/Event-AHU/Medical_Image_Analysis.

【17】Filtering of Small Components for Isosurface Generation
标题：过滤小成分以生成等值面
链接：https://arxiv.org/abs/2510.16684

作者：Devin Zhao, Rephael Wenger
备注：8 pages, 6 figures, 5 tables
摘要：设f：\mathbb{R}^3 \rightarrow \mathbb{R}$是一个标量场。等值面是水平集$f^{-1}（\sigma）$的分段线性近似，其中$\sigma \in \mathbb{R}$由$f$的一些规则网格采样构建。从扫描数据（如CT扫描或MRI）构建的等值面通常包含极小的分量，这些分量会分散可视化的注意力，并且不会形成从数据生成的任何几何模型的一部分。对数据进行简单的预过滤可以删除这些小组件，同时对形成可视化主体的大组件没有影响。我们目前的实验结果，这样的过滤。
摘要：Let $f: \mathbb{R}^3 \rightarrow \mathbb{R}$ be a scalar field. An isosurface is a piecewise linear approximation of a level set $f^{-1}(\sigma)$ for some $\sigma \in \mathbb{R}$ built from some regular grid sampling of $f$. Isosurfaces constructed from scanned data such as CT scans or MRIs often contain extremely small components that distract from the visualization and do not form part of any geometric model produced from the data. Simple prefiltering of the data can remove such small components while having no effect on the large components that form the body of the visualization. We present experimental results on such filtering.

【18】TokenAR: Multiple Subject Generation via Autoregressive Token-level enhancement
标题：TokenAR：通过自回归Token-level增强的多主题生成
链接：https://arxiv.org/abs/2510.16332

作者：Haiyue Sun, Qingdong He, Jinlong Peng, Peng Tang, Jiangning Zhang, Junwei Zhu, Xiaobin Hu, Shuicheng Yan
摘要：自回归模型（AR）在条件图像生成方面取得了显著的成功。然而，用于多个引用生成的这些方法难以解耦不同的引用标识。在这项工作中，我们提出了TokenAR框架，特别关注一个简单但有效的令牌级增强机制，以解决引用身份混淆问题。这种令牌级增强包括三个部分：1）。标记索引嵌入对标记索引进行聚类，以更好地表示相同的参考图像; 2）.指令令牌注入扮演额外的视觉特征容器的角色，为参考令牌注入详细和补充的先验; 3）。身份-令牌解纠缠策略（ITD）明确地引导令牌表示独立地表示每个身份的特征，这种令牌增强框架显著增强了现有基于AR的条件图像生成方法的能力，在保持高质量背景重建的同时实现了良好的身份一致性。在多主题生成的高质量和高多样性目标的驱动下，我们引入了InstructAR数据集，这是第一个开源的，大规模的，多参考输入的，开放域的图像生成数据集，包括28 K训练对，每个例子有两个参考主题，一个相对提示和一个带掩码注释的背景，用于多参考图像生成训练和评估。综合实验验证，我们的方法超越了当前国家的最先进的模型在多参考图像生成任务。实现代码和数据集将公开。代码可用，请参见https://github.com/lyrig/TokenAR
摘要：Autoregressive Model (AR) has shown remarkable success in conditional image generation. However, these approaches for multiple reference generation struggle with decoupling different reference identities. In this work, we propose the TokenAR framework, specifically focused on a simple but effective token-level enhancement mechanism to address reference identity confusion problem. Such token-level enhancement consists of three parts, 1). Token Index Embedding clusters the tokens index for better representing the same reference images; 2). Instruct Token Injection plays as a role of extra visual feature container to inject detailed and complementary priors for reference tokens; 3). The identity-token disentanglement strategy (ITD) explicitly guides the token representations toward independently representing the features of each identity.This token-enhancement framework significantly augments the capabilities of existing AR based methods in conditional image generation, enabling good identity consistency while preserving high quality background reconstruction. Driven by the goal of high-quality and high-diversity in multi-subject generation, we introduce the InstructAR Dataset, the first open-source, large-scale, multi-reference input, open domain image generation dataset that includes 28K training pairs, each example has two reference subjects, a relative prompt and a background with mask annotation, curated for multiple reference image generation training and evaluating. Comprehensive experiments validate that our approach surpasses current state-of-the-art models in multiple reference image generation task. The implementation code and datasets will be made publicly. Codes are available, see https://github.com/lyrig/TokenAR

【19】DiffusionX: Efficient Edge-Cloud Collaborative Image Generation with Multi-Round Prompt Evolution
标题：DistusionX：具有多轮快速进化的高效边缘云协作图像生成
链接：https://arxiv.org/abs/2510.16326

作者：Yi Wei, Shunpu Tang, Liang Zhao, Qiangian Yang (College of Information Science and Electronic Engineering, Zhejiang University, Hangzhou, China)
摘要：扩散模型的最新进展推动了图像生成的显着进步。然而，生成过程仍然是计算密集型的，并且用户通常需要迭代地细化提示以实现期望的结果，这进一步增加了延迟并给云资源带来了沉重的负担。为了应对这一挑战，我们提出了DiffusionX，这是一个云边缘协作框架，用于高效的多轮，基于云的生成。在这个系统中，一个轻量级的设备上的扩散模型通过快速生成预览图像与用户交互，而一个高容量的云模型在提示完成后执行最终的细化。我们还引入了一个噪声水平预测器，可以动态平衡计算负载，优化延迟和云工作负载之间的权衡。实验表明，与Stable Diffusion v1.5相比，DiffusionX将平均生成时间缩短了15.8%，同时保持了相当的图像质量。此外，它仅比Tiny-SD慢0.9%，图像质量显著提高，从而以最小的开销展示了效率和可扩展性。
摘要：Recent advances in diffusion models have driven remarkable progress in image generation. However, the generation process remains computationally intensive, and users often need to iteratively refine prompts to achieve the desired results, further increasing latency and placing a heavy burden on cloud resources. To address this challenge, we propose DiffusionX, a cloud-edge collaborative framework for efficient multi-round, prompt-based generation. In this system, a lightweight on-device diffusion model interacts with users by rapidly producing preview images, while a high-capacity cloud model performs final refinements after the prompt is finalized. We further introduce a noise level predictor that dynamically balances the computation load, optimizing the trade-off between latency and cloud workload. Experiments show that DiffusionX reduces average generation time by 15.8% compared with Stable Diffusion v1.5, while maintaining comparable image quality. Moreover, it is only 0.9% slower than Tiny-SD with significantly improved image quality, thereby demonstrating efficiency and scalability with minimal overhead.

【20】Scale-DiT: Ultra-High-Resolution Image Generation with Hierarchical Local Attention
标题：Scale-DiT：具有分层局部注意力的超高分辨率图像生成
链接：https://arxiv.org/abs/2510.16325

作者：Yuyao Zhang, Yu-Wing Tai
备注：22 pages
摘要：超高分辨率的文本到图像的生成需要细粒度的纹理合成和全局一致的结构，但目前的扩散模型仍然局限于子$1K \倍1 K $分辨率，由于禁止二次复杂性的注意力和稀缺的本地$4K$训练数据。我们提出了\textbf{Scale-DiT}，这是一个新的扩散框架，它引入了具有低分辨率全局指导的分层局部注意力，从而实现了超高分辨率的高效，可扩展和语义连贯的图像合成。具体来说，高分辨率潜伏期被划分为固定大小的局部窗口，以将注意力复杂度从二次降低到近线性，而配备有缩放位置锚的低分辨率潜伏期则注入全局语义。轻量级的LoRA适配在去噪期间桥接全局和局部路径，确保结构和细节的一致性。为了最大限度地提高推理效率，我们重新配置令牌序列的希尔伯特曲线顺序，并实现了一个融合的内核跳过屏蔽操作，从而在GPU友好的设计。大量的实验表明，与密集注意基线相比，Scale-DiT实现了超过2倍的推理速度和更低的内存使用，同时可靠地扩展到4K倍的4K分辨率，而无需额外的高分辨率训练数据。在定量基准（FID、IS、CLIP评分）和定性比较中，Scale-DiT提供了卓越的全局一致性和更清晰的局部细节，匹配或优于依赖于原生4K训练的最先进方法。总之，这些结果突出了层次局部注意力与指导低分辨率锚作为一个有前途的和有效的方法，推进超高分辨率图像生成。
摘要：Ultra-high-resolution text-to-image generation demands both fine-grained texture synthesis and globally coherent structure, yet current diffusion models remain constrained to sub-$1K \times 1K$ resolutions due to the prohibitive quadratic complexity of attention and the scarcity of native $4K$ training data. We present \textbf{Scale-DiT}, a new diffusion framework that introduces hierarchical local attention with low-resolution global guidance, enabling efficient, scalable, and semantically coherent image synthesis at ultra-high resolutions. Specifically, high-resolution latents are divided into fixed-size local windows to reduce attention complexity from quadratic to near-linear, while a low-resolution latent equipped with scaled positional anchors injects global semantics. A lightweight LoRA adaptation bridges global and local pathways during denoising, ensuring consistency across structure and detail. To maximize inference efficiency, we repermute token sequence in Hilbert curve order and implement a fused-kernel for skipping masked operations, resulting in a GPU-friendly design. Extensive experiments demonstrate that Scale-DiT achieves more than $2\times$ faster inference and lower memory usage compared to dense attention baselines, while reliably scaling to $4K \times 4K$ resolution without requiring additional high-resolution training data. On both quantitative benchmarks (FID, IS, CLIP Score) and qualitative comparisons, Scale-DiT delivers superior global coherence and sharper local detail, matching or outperforming state-of-the-art methods that rely on native 4K training. Taken together, these results highlight hierarchical local attention with guided low-resolution anchors as a promising and effective approach for advancing ultra-high-resolution image generation.

【21】Stroke2Sketch: Harnessing Stroke Attributes for Training-Free Sketch Generation
标题：Stroke2 Sketch：利用笔画属性来生成免训练草图
链接：https://arxiv.org/abs/2510.16319

作者：Rui Yang, Huining Li, Yiyi Long, Xiaojun Wu, Shengfeng He
备注：ICCV 2025
摘要：生成由参考样式引导的草图需要精确传输笔画属性，例如线条厚度、变形和纹理稀疏性，同时保留语义结构和内容保真度。为此，我们提出了Stroke2Sketch，这是一种新型的免训练框架，它引入了跨图像笔划注意力，这是一种嵌入在自我注意力层中的机制，可以建立细粒度的语义对应关系，并实现准确的笔划属性传输。这允许我们的方法自适应地将参考笔划特征集成到内容图像中，同时保持结构完整性。此外，我们开发了自适应对比度增强和语义集中的注意力，以加强内容保护和前景强调。Stroke2Sketch有效地合成了风格上忠实的草图，与手工制作的结果非常相似，在表达性笔画控制和语义一致性方面优于现有方法。代码可在https://github.com/rane7/Stroke2Sketch上获得。
摘要：Generating sketches guided by reference styles requires precise transfer of stroke attributes, such as line thickness, deformation, and texture sparsity, while preserving semantic structure and content fidelity. To this end, we propose Stroke2Sketch, a novel training-free framework that introduces cross-image stroke attention, a mechanism embedded within self-attention layers to establish fine-grained semantic correspondences and enable accurate stroke attribute transfer. This allows our method to adaptively integrate reference stroke characteristics into content images while maintaining structural integrity. Additionally, we develop adaptive contrast enhancement and semantic-focused attention to reinforce content preservation and foreground emphasis. Stroke2Sketch effectively synthesizes stylistically faithful sketches that closely resemble handcrafted results, outperforming existing methods in expressive stroke control and semantic coherence. Codes are available at https://github.com/rane7/Stroke2Sketch.

【22】ESCA: Contextualizing Embodied Agents via Scene-Graph Generation
标题：ESCA：通过场景图生成将预定代理上下文化
链接：https://arxiv.org/abs/2510.15963

作者：Jiani Huang, Amish Sethi, Matthew Kuo, Mayank Keoliya, Neelay Velingker, JungHo Jung, Ser-Nam Lim, Ziyang Li, Mayur Naik
备注：Accepted as a Spotlight Paper at NeurIPS 2025
摘要：多模态大型语言模型（MLLM）正在朝着通用的体现代理快速发展。然而，目前的训练管道主要依赖于高级视觉-声音-文本对，并且在像素级视觉内容和文本语义之间缺乏细粒度的结构化对齐。为了克服这一挑战，我们提出了ESCA，一个新的框架，通过结构化的时空理解上下文体现代理。它的核心是SGClip，一种新的基于CLIP的、开放域的、可扩展的场景图生成模型。SGClip通过神经符号学习管道在87 K+开放域视频上进行训练，该管道利用来自视频字幕对和结构化推理的模型驱动的自我监督，从而消除了对人工标记场景图注释的需求。我们证明，SGClip支持基于推理和特定任务的微调，在场景图生成和动作本地化基准中表现出色。带有SGClip的ESCA持续改进开源和商业MLLM，在两个具体环境中实现最先进的性能。值得注意的是，它显著减少了代理感知错误，并使开源模型能够超越专有基线。
摘要：Multi-modal large language models (MLLMs) are making rapid progress toward general-purpose embodied agents. However, current training pipelines primarily rely on high-level vision-sound-text pairs and lack fine-grained, structured alignment between pixel-level visual content and textual semantics. To overcome this challenge, we propose ESCA, a new framework for contextualizing embodied agents through structured spatial-temporal understanding. At its core is SGClip, a novel CLIP-based, open-domain, and promptable model for generating scene graphs. SGClip is trained on 87K+ open-domain videos via a neurosymbolic learning pipeline, which harnesses model-driven self-supervision from video-caption pairs and structured reasoning, thereby eliminating the need for human-labeled scene graph annotations. We demonstrate that SGClip supports both prompt-based inference and task-specific fine-tuning, excelling in scene graph generation and action localization benchmarks. ESCA with SGClip consistently improves both open-source and commercial MLLMs, achieving state-of-the-art performance across two embodied environments. Notably, it significantly reduces agent perception errors and enables open-source models to surpass proprietary baselines.

检测相关(24篇)

【1】Signature Forgery Detection: Improving Cross-Dataset Generalization
标题：签名伪造检测：改进跨数据集通用
链接：https://arxiv.org/abs/2510.17724

作者：Matheus Ramos Parracho
备注：Undergraduate thesis (preprint)---submitted to Escola Politécnica, Universidade Federal do Rio de Janeiro (POLI/UFRJ). The final version will include official signatures and defense approval
摘要：自动签名验证是银行、身份认证和法律文件中使用的关键生物识别技术。尽管深度学习方法取得了显著的进步，但离线签名验证中的大多数方法仍然难以在数据集上进行推广，因为笔迹风格和采集协议的变化通常会降低性能。这项研究调查了签名伪造检测的特征学习策略，重点是提高跨数据集的泛化能力，即在一个数据集上训练并在另一个数据集上测试时的模型鲁棒性。使用三个公共基准- CEDAR，ICDAR和GPDS合成-开发了两个实验管道：一个基于原始签名图像，另一个采用称为shell预处理的预处理方法。确定并分析了几种行为模式;然而，未确定两种方法之间的明确优效性。结果表明，原始图像模型在基准测试中获得了更高的性能，而基于壳的模型在未来的改进中表现出了强大的跨域签名验证的潜力。
摘要：Automated signature verification is a critical biometric technique used in banking, identity authentication, and legal documentation. Despite the notable progress achieved by deep learning methods, most approaches in offline signature verification still struggle to generalize across datasets, as variations in handwriting styles and acquisition protocols often degrade performance. This study investigates feature learning strategies for signature forgery detection, focusing on improving cross-dataset generalization -- that is, model robustness when trained on one dataset and tested on another. Using three public benchmarks -- CEDAR, ICDAR, and GPDS Synthetic -- two experimental pipelines were developed: one based on raw signature images and another employing a preprocessing method referred to as shell preprocessing. Several behavioral patterns were identified and analyzed; however, no definitive superiority between the two approaches was established. The results show that the raw-image model achieved higher performance across benchmarks, while the shell-based model demonstrated promising potential for future refinement toward robust, cross-domain signature verification.

【2】Improving Cross-Patient Generalization in Parkinson's Disease Detection through Chunk-Based Analysis of Hand-Drawn Patterns
标题：通过基于块的手绘模式分析提高帕金森病检测的跨患者概括性
链接：https://arxiv.org/abs/2510.17703

作者：Mhd Adnan Albani, Riad Sonbol
备注：19 pages, 2 figures, 9 tables
摘要：帕金森病（PD）是一种神经退行性疾病，影响约1%的60岁以上人群，导致运动障碍，阻碍书写和绘画等手部协调活动。许多方法试图支持基于手绘图像的帕金森病的早期检测;然而，我们在相关工作中发现了两个主要限制：（1）缺乏足够的数据集，（2）处理看不见的患者数据时的鲁棒性。在本文中，我们提出了一种新的方法来检测帕金森病，包括两个阶段：第一阶段分类的基础上，他们的绘画类型（圆，曲折，螺旋），第二阶段提取所需的功能，从图像和检测帕金森病。我们通过应用分块策略克服了前面的两个限制，我们将每个图像分为2x2块。在提取特征和识别帕金森病指标时，每个块都被单独处理。为了进行最终分类，集成方法用于合并从每个块做出的决策。我们的评估表明，我们提出的方法优于最先进的方法，特别是对看不见的患者。在NewHandPD数据集上，我们的方法对可见患者的准确率达到97.08%，对未见患者的准确率达到94.91%，我们提出的方法与先前工作中观察到的4.76点下降相比，仅保持了2.17个百分点的差距。
摘要：Parkinson's disease (PD) is a neurodegenerative disease affecting about 1% of people over the age of 60, causing motor impairments that impede hand coordination activities such as writing and drawing. Many approaches have tried to support early detection of Parkinson's disease based on hand-drawn images; however, we identified two major limitations in the related works: (1) the lack of sufficient datasets, (2) the robustness when dealing with unseen patient data. In this paper, we propose a new approach to detect Parkinson's disease that consists of two stages: The first stage classifies based on their drawing type(circle, meander, spiral), and the second stage extracts the required features from the images and detects Parkinson's disease. We overcame the previous two limitations by applying a chunking strategy where we divide each image into 2x2 chunks. Each chunk is processed separately when extracting features and recognizing Parkinson's disease indicators. To make the final classification, an ensemble method is used to merge the decisions made from each chunk. Our evaluation shows that our proposed approach outperforms the top performing state-of-the-art approaches, in particular on unseen patients. On the NewHandPD dataset our approach, it achieved 97.08% accuracy for seen patients and 94.91% for unseen patients, our proposed approach maintained a gap of only 2.17 percentage points, compared to the 4.76-point drop observed in prior work.

【3】Frugal Federated Learning for Violence Detection: A Comparison of LoRA-Tuned VLMs and Personalized CNNs
标题：用于暴力检测的节俭联邦学习：LoRA调谐的VLM和个性化CNN的比较
链接：https://arxiv.org/abs/2510.17651

作者：Sébastien Thuau, Siba Haidar, Ayush Bajracharya, Rachid Chelouah
备注：7 pages, 1 figure, FLTA 2025
摘要：我们通过比较两种互补策略来研究用于暴力检测的节俭联邦学习方法：（i）视觉语言模型（VLM）的zero-shot和联邦微调，以及（ii）紧凑型3D卷积神经网络（CNN 3D）的个性化训练。使用LLaVA-7 B和65.8M参数CNN 3D作为代表性案例，我们评估了实际非IID设置下的精度、校准和能源使用情况。这两种方法的准确率都超过90%。CNN 3D在ROC AUC和对数损失方面略优于低秩自适应（LoRA）调整的VLM，同时使用更少的能量。VLM仍然有利于上下文推理和多模态推理。我们通过训练和推理量化能源和CO2排放，并分析部署的可持续性权衡。据我们所知，这是LoRA调整的视觉语言模型和个性化CNN用于联邦暴力检测的第一次比较研究，重点是能源效率和环境指标。这些发现支持一种混合模型：用于常规分类的轻量级CNN，用于复杂或描述性场景的选择性VLM激活。由此产生的框架为视频监控中负责任的、资源感知的AI提供了一个可复制的基线，并扩展到实时、多模式和生命周期感知系统。
摘要：We examine frugal federated learning approaches to violence detection by comparing two complementary strategies: (i) zero-shot and federated fine-tuning of vision-language models (VLMs), and (ii) personalized training of a compact 3D convolutional neural network (CNN3D). Using LLaVA-7B and a 65.8M parameter CNN3D as representative cases, we evaluate accuracy, calibration, and energy usage under realistic non-IID settings. Both approaches exceed 90% accuracy. CNN3D slightly outperforms Low-Rank Adaptation(LoRA)-tuned VLMs in ROC AUC and log loss, while using less energy. VLMs remain favorable for contextual reasoning and multimodal inference. We quantify energy and CO$_2$ emissions across training and inference, and analyze sustainability trade-offs for deployment. To our knowledge, this is the first comparative study of LoRA-tuned vision-language models and personalized CNNs for federated violence detection, with an emphasis on energy efficiency and environmental metrics. These findings support a hybrid model: lightweight CNNs for routine classification, with selective VLM activation for complex or descriptive scenarios. The resulting framework offers a reproducible baseline for responsible, resource-aware AI in video surveillance, with extensions toward real-time, multimodal, and lifecycle-aware systems.

【4】One Dinomaly2 Detect Them All: A Unified Framework for Full-Spectrum Unsupervised Anomaly Detection
标题：一个Dinomaly2检测全部：全光谱无监督异常检测的统一框架
链接：https://arxiv.org/abs/2510.17611

作者：Jia Guo, Shuai Lu, Lei Fan, Zelin Li, Donglin Di, Yang Song, Weihang Zhang, Wenbing Zhu, Hong Yan, Fang Chen, Huiqi Li, Hongen Liao
备注：Extended version of CVPR2025
摘要：无监督异常检测（UAD）已经从建立专门的单类模型发展到统一的多类模型，但现有的多类模型明显低于最先进的一对一模型。此外，该领域已经分裂成针对特定场景（多类、3D、Few-Shot等）定制的专门方法，造成部署障碍，并突出了对统一解决方案的需求。在本文中，我们介绍了Dinomaly 2，这是全光谱图像UAD的第一个统一框架，它弥合了多类模型中的性能差距，同时无缝扩展到不同的数据模式和任务设置。在“少即是多”的哲学指导下，我们证明了五个简单元素的编排在标准的基于重建的框架中实现了卓越的性能。这种方法论上的极简主义使不同任务之间的自然扩展无需修改，建立了简单性是真正普遍性的基础。在12个UAD基准上进行的大量实验证明了Dinomaly 2在多模态（2D、多视图、RGB-3D、RGB-IR）、任务设置（单类、多类、推理统一多类、Few-Shot）和应用领域（工业、生物、户外）的全谱优势。例如，我们的多类模型在MVTec-AD和VisA上分别实现了前所未有的99.9%和99.3%的图像级（I-）AUROC。对于多视图和多模态检测，Dinomaly 2以最小的调整展示了最先进的性能。此外，每个类只使用8个正常样本，我们的方法超过了以前的全拍摄模型，在MVTec-AD和VisA上实现了98.7%和97.4%的I-AUROC。极简设计、计算可扩展性和普遍适用性的结合使Dinomaly 2成为现实世界异常检测应用的统一解决方案。
摘要：Unsupervised anomaly detection (UAD) has evolved from building specialized single-class models to unified multi-class models, yet existing multi-class models significantly underperform the most advanced one-for-one counterparts. Moreover, the field has fragmented into specialized methods tailored to specific scenarios (multi-class, 3D, few-shot, etc.), creating deployment barriers and highlighting the need for a unified solution. In this paper, we present Dinomaly2, the first unified framework for full-spectrum image UAD, which bridges the performance gap in multi-class models while seamlessly extending across diverse data modalities and task settings. Guided by the "less is more" philosophy, we demonstrate that the orchestration of five simple element achieves superior performance in a standard reconstruction-based framework. This methodological minimalism enables natural extension across diverse tasks without modification, establishing that simplicity is the foundation of true universality. Extensive experiments on 12 UAD benchmarks demonstrate Dinomaly2's full-spectrum superiority across multiple modalities (2D, multi-view, RGB-3D, RGB-IR), task settings (single-class, multi-class, inference-unified multi-class, few-shot) and application domains (industrial, biological, outdoor). For example, our multi-class model achieves unprecedented 99.9% and 99.3% image-level (I-) AUROC on MVTec-AD and VisA respectively. For multi-view and multi-modal inspection, Dinomaly2 demonstrates state-of-the-art performance with minimum adaptations. Moreover, using only 8 normal examples per class, our method surpasses previous full-shot models, achieving 98.7% and 97.4% I-AUROC on MVTec-AD and VisA. The combination of minimalistic design, computational scalability, and universal applicability positions Dinomaly2 as a unified solution for the full spectrum of real-world anomaly detection applications.

【5】MIRAGE: Agentic Framework for Multimodal Misinformation Detection with Web-Grounded Reasoning
标题：MISYS：利用网络推理进行多模式错误信息检测的抽象框架
链接：https://arxiv.org/abs/2510.17590

作者：Mir Nafis Sharear Shopnil, Sharad Duwal, Abhishek Tyagi, Adiba Mahbub Proma
备注：16 pages, 3 tables, 1 figure
摘要：错误信息通过数十亿个每日多模态帖子在网络平台上传播，这些帖子结合了文本和图像，压倒了人工事实核查能力。监督检测模型需要特定领域的训练数据，并且无法在各种操纵策略中推广。我们提出了一个推理时间，模型可插入的代理框架MINUS，它将多模态验证分解为四个连续的模块：视觉准确性评估检测AI生成的图像，跨模态一致性分析识别上下文外的再利用，检索增强的事实检查通过迭代问题生成在Web证据中提出索赔，校准的判断模块集成了所有信号。MINUS通过有针对性的网络检索编排视觉语言模型推理，输出结构化和引用链接的基本原理。在MMFakeBench验证集（1，000个样本）上，使用GPT-4 o-mini的MIRECT实现了81.65%的F1和75.1%的准确度，比最强的zero-shot基线（使用MMD-Agent的GPT-4V，F1为74.0%）高出7.65个点，同时保持了34.3%的假阳性率，而仅判断基线的假阳性率为97.3%。测试集结果（5,000个样本）证实了泛化，F1为81.44%，准确率为75.08%。消融研究表明，视觉验证贡献了5.18分F1，检索增强推理贡献了2.97分。我们的研究结果表明，分解的代理推理与Web检索可以匹配监督检测器的性能，而无需特定领域的训练，使错误信息检测跨模态标记的数据仍然稀缺。
摘要：Misinformation spreads across web platforms through billions of daily multimodal posts that combine text and images, overwhelming manual fact-checking capacity. Supervised detection models require domain-specific training data and fail to generalize across diverse manipulation tactics. We present MIRAGE, an inference-time, model-pluggable agentic framework that decomposes multimodal verification into four sequential modules: visual veracity assessment detects AI-generated images, cross-modal consistency analysis identifies out-of-context repurposing, retrieval-augmented factual checking grounds claims in web evidence through iterative question generation, and a calibrated judgment module integrates all signals. MIRAGE orchestrates vision-language model reasoning with targeted web retrieval, outputs structured and citation-linked rationales. On MMFakeBench validation set (1,000 samples), MIRAGE with GPT-4o-mini achieves 81.65% F1 and 75.1% accuracy, outperforming the strongest zero-shot baseline (GPT-4V with MMD-Agent at 74.0% F1) by 7.65 points while maintaining 34.3% false positive rate versus 97.3% for a judge-only baseline. Test set results (5,000 samples) confirm generalization with 81.44% F1 and 75.08% accuracy. Ablation studies show visual verification contributes 5.18 F1 points and retrieval-augmented reasoning contributes 2.97 points. Our results demonstrate that decomposed agentic reasoning with web retrieval can match supervised detector performance without domain-specific training, enabling misinformation detection across modalities where labeled data remains scarce.

【6】Monitoring Horses in Stalls: From Object to Event Detection
标题：监控马厩中的马：从对象到事件检测
链接：https://arxiv.org/abs/2510.17409

作者：Dmitrii Galimzianov, Viacheslav Vyshegorodtsev, Ivan Nezhivykh
备注：12 pages, 4 figures, 4 tables
摘要：监测失速马匹的行为对于早期发现健康和福利问题至关重要，但仍然是劳动密集型和耗时的。在这项研究中，我们提出了一个原型的基于视觉的监控系统，自动检测和跟踪的马匹和马厩内的人使用对象检测和多对象跟踪技术。该系统利用YOLOv 11和BoT-SORT进行检测和跟踪，而事件状态则根据失速内的对象轨迹和空间关系进行推断。为了支持开发，我们在基础模型CLIP和GroundingDINO的帮助下构建了一个自定义数据集。该系统区分了五种事件类型，并考虑了摄像机的盲点。定性评价表明，可靠的性能与马有关的事件，同时强调的局限性，在检测人由于数据稀缺。这项工作为马设施的实时行为监测提供了基础，并对动物福利和稳定管理产生了影响。
摘要：Monitoring the behavior of stalled horses is essential for early detection of health and welfare issues but remains labor-intensive and time-consuming. In this study, we present a prototype vision-based monitoring system that automates the detection and tracking of horses and people inside stables using object detection and multi-object tracking techniques. The system leverages YOLOv11 and BoT-SORT for detection and tracking, while event states are inferred based on object trajectories and spatial relations within the stall. To support development, we constructed a custom dataset annotated with assistance from foundation models CLIP and GroundingDINO. The system distinguishes between five event types and accounts for the camera's blind spots. Qualitative evaluation demonstrated reliable performance for horse-related events, while highlighting limitations in detecting people due to data scarcity. This work provides a foundation for real-time behavioral monitoring in equine facilities, with implications for animal welfare and stable management.

【7】Fair and Interpretable Deepfake Detection in Videos
标题：视频中公平且可解释的Deepfake检测
链接：https://arxiv.org/abs/2510.17264

作者：Akihito Yoshii, Ryosuke Sonoda, Ramya Srinivasan
备注：10 pages (including References)
摘要：现有的deepfake检测方法通常存在偏见，缺乏透明度，无法捕获时间信息，导致不同人口群体的决策和结果不可靠。在本文中，我们提出了一个公平感知的deepfake检测框架，该框架集成了时间特征学习和人口统计感知数据增强，以增强公平性和可解释性。我们的方法利用基于序列的聚类对deepfake视频和概念提取进行时间建模，以提高检测可靠性，同时也为非专家用户提供可解释的决策。此外，我们还引入了一种人口统计感知的数据增强方法，该方法可以平衡代表性不足的群体，并应用频域变换来保留deepfake伪影，从而减轻偏差并提高泛化能力。使用最先进的（SoTA）架构（Xception，ResNet）对FaceForensics++，DFD，Celeb-DF和DFDC数据集进行了广泛的实验，证明了所提出的方法在获得公平性和准确性之间的最佳权衡方面的有效性。
摘要：Existing deepfake detection methods often exhibit bias, lack transparency, and fail to capture temporal information, leading to biased decisions and unreliable results across different demographic groups. In this paper, we propose a fairness-aware deepfake detection framework that integrates temporal feature learning and demographic-aware data augmentation to enhance fairness and interpretability. Our method leverages sequence-based clustering for temporal modeling of deepfake videos and concept extraction to improve detection reliability while also facilitating interpretable decisions for non-expert users. Additionally, we introduce a demography-aware data augmentation method that balances underrepresented groups and applies frequency-domain transformations to preserve deepfake artifacts, thereby mitigating bias and improving generalization. Extensive experiments on FaceForensics++, DFD, Celeb-DF, and DFDC datasets using state-of-the-art (SoTA) architectures (Xception, ResNet) demonstrate the efficacy of the proposed method in obtaining the best tradeoff between fairness and accuracy when compared to SoTA.

【8】Benchmarking Out-of-Distribution Detection for Plankton Recognition: A Systematic Evaluation of Advanced Methods in Marine Ecological Monitoring
标题：浮游生物识别的分布外检测基准：海洋生态监测先进方法的系统评估
链接：https://arxiv.org/abs/2510.17179

作者：Yingzi Han, Jiakai He, Chuanlong Xie, Jianping Li
摘要：由于训练数据和测试数据之间的分布变化（分布外，OoD），自动浮游生物识别模型在实际部署过程中面临着重大挑战。这源于浮游生物复杂的形态，巨大的物种多样性，以及新物种的不断发现，这导致了推理过程中不可预测的错误。尽管近年来OoD检测方法取得了快速进步，但浮游生物识别领域仍然缺乏最新计算机视觉发展的系统集成和大规模评估的统一基准。针对这一问题，本文基于DYB-PlanktonNet数据集\cite{875 n-f104 -21}，精心设计了一系列模拟各种分布变化场景的OoD基准测试，并对22种OoD检测方法进行了系统的评价。大量的实验结果表明，ViM \cite{wang 2022 vim}方法显着优于其他方法在我们构建的基准，特别是在远OoD方案中的关键指标有很大的改善。该综合评价不仅为浮游生物自动识别中算法的选择提供了可靠的参考，也为今后浮游生物OoD检测的研究奠定了坚实的基础。据我们所知，这项研究标志着第一次大规模的，系统的评估和分析的分布数据检测方法在浮游生物识别。代码可在https://github.com/BlackJack0083/PlanktonOoD上获得。
摘要：Automated plankton recognition models face significant challenges during real-world deployment due to distribution shifts (Out-of-Distribution, OoD) between training and test data. This stems from plankton's complex morphologies, vast species diversity, and the continuous discovery of novel species, which leads to unpredictable errors during inference. Despite rapid advancements in OoD detection methods in recent years, the field of plankton recognition still lacks a systematic integration of the latest computer vision developments and a unified benchmark for large-scale evaluation. To address this, this paper meticulously designed a series of OoD benchmarks simulating various distribution shift scenarios based on the DYB-PlanktonNet dataset \cite{875n-f104-21}, and systematically evaluated twenty-two OoD detection methods. Extensive experimental results demonstrate that the ViM \cite{wang2022vim} method significantly outperforms other approaches in our constructed benchmarks, particularly excelling in Far-OoD scenarios with substantial improvements in key metrics. This comprehensive evaluation not only provides a reliable reference for algorithm selection in automated plankton recognition but also lays a solid foundation for future research in plankton OoD detection. To our knowledge, this study marks the first large-scale, systematic evaluation and analysis of Out-of-Distribution data detection methods in plankton recognition. Code is available at https://github.com/BlackJack0083/PlanktonOoD.

【9】GOOD: Training-Free Guided Diffusion Sampling for Out-of-Distribution Detection
标题：好：免训练引导扩散采样用于分布外检测
链接：https://arxiv.org/abs/2510.17131

作者：Xin Gao, Jiyao Liu, Guanghao Li, Yueming Lyu, Jianxiong Gao, Weichen Yu, Ningsheng Xu, Liang Wang, Caifeng Shan, Ziwei Liu, Chenyang Si
备注：28 pages, 16 figures, conference
摘要：最近的进展已经探索了用于合成分布外（OOD）样本的文本到图像扩散模型，大大提高了OOD检测的性能。然而，现有的方法通常依赖于扰动文本条件嵌入，导致语义不稳定和不足的移位多样性，这限制了泛化到现实的OOD。为了解决这些挑战，我们提出了良好的，一种新颖而灵活的框架，直接引导扩散采样轨迹OOD区域使用现成的分布（ID）分类器。GOOD算法采用了两级引导：（1）图像级引导，基于对数分割的梯度降低输入似然，将样本驱动到像素空间的低密度区域。(2)分类器的潜在空间中的k-NN距离产生的分类器级指导，促进了特征稀疏区域中的采样。因此，这种双引导设计能够实现更可控和多样化的OOD样本生成。此外，我们引入了一个统一的OOD分数，自适应地结合图像和特征的差异，提高检测的鲁棒性。我们进行了全面的定量和定性分析，以评估GOOD的有效性，证明用GOOD生成的样本进行训练可以显着提高OOD检测性能。
摘要：Recent advancements have explored text-to-image diffusion models for synthesizing out-of-distribution (OOD) samples, substantially enhancing the performance of OOD detection. However, existing approaches typically rely on perturbing text-conditioned embeddings, resulting in semantic instability and insufficient shift diversity, which limit generalization to realistic OOD. To address these challenges, we propose GOOD, a novel and flexible framework that directly guides diffusion sampling trajectories towards OOD regions using off-the-shelf in-distribution (ID) classifiers. GOOD incorporates dual-level guidance: (1) Image-level guidance based on the gradient of log partition to reduce input likelihood, drives samples toward low-density regions in pixel space. (2) Feature-level guidance, derived from k-NN distance in the classifier's latent space, promotes sampling in feature-sparse regions. Hence, this dual-guidance design enables more controllable and diverse OOD sample generation. Additionally, we introduce a unified OOD score that adaptively combines image and feature discrepancies, enhancing detection robustness. We perform thorough quantitative and qualitative analyses to evaluate the effectiveness of GOOD, demonstrating that training with samples generated by GOOD can notably enhance OOD detection performance.

【10】Towards a Generalizable Fusion Architecture for Multimodal Object Detection
标题：面向多模式对象检测的可推广融合架构
链接：https://arxiv.org/abs/2510.17078

作者：Jad Berjawi, Yoann Dupas, Christophe C'erin
备注：8 pages, 8 figures, accepted at ICCV 2025 MIRA Workshop
摘要：多模态目标检测通过利用来自多个传感器模态的互补线索来提高在挑战条件下的鲁棒性。我们介绍了过滤多模态交叉注意力融合（FMCAF），一个预处理架构，旨在提高RGB和红外（IR）输入的融合。FMCAF结合了频域滤波模块（Freq-Filter）来抑制冗余的频谱特征，并结合了基于交叉注意的融合模块（MCAF）来改善模态间特征共享。与针对特定数据集定制的方法不同，FMCAF的目标是通用性，提高不同多模式挑战的性能，而无需特定于数据集的调优。在LLVIP（低光行人检测）和VEDAI（飞行器检测）方面，FMCAF优于传统融合（级联），在VEDAI上实现+13.9%mAP@50，在LLVIP上实现+1.1%。这些结果支持FMCAF作为未来检测管道中强大的多模态融合的灵活基础的潜力。
摘要：Multimodal object detection improves robustness in chal- lenging conditions by leveraging complementary cues from multiple sensor modalities. We introduce Filtered Multi- Modal Cross Attention Fusion (FMCAF), a preprocess- ing architecture designed to enhance the fusion of RGB and infrared (IR) inputs. FMCAF combines a frequency- domain filtering block (Freq-Filter) to suppress redun- dant spectral features with a cross-attention-based fusion module (MCAF) to improve intermodal feature sharing. Unlike approaches tailored to specific datasets, FMCAF aims for generalizability, improving performance across different multimodal challenges without requiring dataset- specific tuning. On LLVIP (low-light pedestrian detec- tion) and VEDAI (aerial vehicle detection), FMCAF outper- forms traditional fusion (concatenation), achieving +13.9% mAP@50 on VEDAI and +1.1% on LLVIP. These results support the potential of FMCAF as a flexible foundation for robust multimodal fusion in future detection pipelines.

【11】Registration is a Powerful Rotation-Invariance Learner for 3D Anomaly Detection
标题：配准是3D异常检测的强大旋转不变性学习器
链接：https://arxiv.org/abs/2510.16865

作者：Yuyang Yu, Zhengwei Chen, Xuemiao Xu, Lei Zhang, Haoxin Yang, Yongwei Nie, Shengfeng He
摘要：点云数据中的三维异常检测对于工业质量控制至关重要，其目的是以高可靠性识别结构缺陷。然而，目前的内存银行为基础的方法往往遭受不一致的功能转换和有限的判别能力，特别是在捕捉局部几何细节和实现旋转不变性。当配准失败时，这些限制变得更加明显，导致不可靠的检测结果。我们认为，点云注册起着至关重要的作用，不仅在对齐几何结构，但也在引导特征提取旋转不变和局部判别表示。为此，我们提出了一个注册诱导，旋转不变的特征提取框架，集成了点云注册和基于内存的异常检测的目标。我们的关键见解是，这两项任务都依赖于对局部几何结构进行建模，并利用样本之间的特征相似性。通过将特征提取嵌入到注册学习过程中，我们的框架共同优化了对齐和表示学习。这种集成使网络能够获得对旋转具有鲁棒性并且对异常检测非常有效的特征。在Anomaly-ShapeNet和Real 3D-AD数据集上的大量实验表明，我们的方法在有效性和通用性方面始终优于现有方法。
摘要：3D anomaly detection in point-cloud data is critical for industrial quality control, aiming to identify structural defects with high reliability. However, current memory bank-based methods often suffer from inconsistent feature transformations and limited discriminative capacity, particularly in capturing local geometric details and achieving rotation invariance. These limitations become more pronounced when registration fails, leading to unreliable detection results. We argue that point-cloud registration plays an essential role not only in aligning geometric structures but also in guiding feature extraction toward rotation-invariant and locally discriminative representations. To this end, we propose a registration-induced, rotation-invariant feature extraction framework that integrates the objectives of point-cloud registration and memory-based anomaly detection. Our key insight is that both tasks rely on modeling local geometric structures and leveraging feature similarity across samples. By embedding feature extraction into the registration learning process, our framework jointly optimizes alignment and representation learning. This integration enables the network to acquire features that are both robust to rotations and highly effective for anomaly detection. Extensive experiments on the Anomaly-ShapeNet and Real3D-AD datasets demonstrate that our method consistently outperforms existing approaches in effectiveness and generalizability.

【12】An RGB-D Image Dataset for Lychee Detection and Maturity Classification for Robotic Harvesting
标题：用于荔枝检测和成熟度分类的RGB-D图像数据集
链接：https://arxiv.org/abs/2510.16800

作者：Zhenpeng Zhang, Yi Wang, Shanglei Chai, Yingying Liu, Zekai Xie, Wenhao Huang, Pengyu Li, Zipei Luo, Dajiang Lu, Yibin Tian
摘要：荔枝是一种高价值的亚热带水果。采用基于视觉的收割机器人可以显著提高生产率，同时减少对劳动力的依赖。高质量的数据对于开发这种收割机器人至关重要。然而，目前还没有一致和全面注释的开源荔枝数据集，以自然生长环境中的水果为特色。为了解决这个问题，我们构建了一个数据集，以促进荔枝检测和成熟度分类。彩色（RGB）图像在不同的天气条件下，并在一天中的不同时间，在多个荔枝品种，如糯米刺，妃子笑，黑叶，和怀枝。该数据集包含三个不同的成熟阶段，包含11，414张图像，其中包括878张原始RGB图像，8，780张增强RGB图像和1，756张深度图像。这些图像用9，658对标签进行注释，用于荔枝检测和成熟度分类。为了提高注释的一致性，三个人独立地标记数据，然后由第四位评审员汇总和验证他们的结果。进行了详细的统计分析以检查数据集。最后，我们使用三个代表性的深度学习模型进行了实验，以评估数据集。它是公开提供给学术
摘要：Lychee is a high-value subtropical fruit. The adoption of vision-based harvesting robots can significantly improve productivity while reduce reliance on labor. High-quality data are essential for developing such harvesting robots. However, there are currently no consistently and comprehensively annotated open-source lychee datasets featuring fruits in natural growing environments. To address this, we constructed a dataset to facilitate lychee detection and maturity classification. Color (RGB) images were acquired under diverse weather conditions, and at different times of the day, across multiple lychee varieties, such as Nuomici, Feizixiao, Heiye, and Huaizhi. The dataset encompasses three different ripeness stages and contains 11,414 images, consisting of 878 raw RGB images, 8,780 augmented RGB images, and 1,756 depth images. The images are annotated with 9,658 pairs of lables for lychee detection and maturity classification. To improve annotation consistency, three individuals independently labeled the data, and their results were then aggregated and verified by a fourth reviewer. Detailed statistical analyses were done to examine the dataset. Finally, we performed experiments using three representative deep learning models to evaluate the dataset. It is publicly available for academic

【13】Prominence-Aware Artifact Detection and Dataset for Image Super-Resolution
标题：图像超分辨率的突出感知预设检测和数据集
链接：https://arxiv.org/abs/2510.16752

作者：Ivan Molodetskikh, Kirill Malyshev, Mark Mirgaleev, Nikita Zagainov, Evgeney Bogatyrev, Dmitriy Vatolin
摘要：生成式图像超分辨率在视觉质量和细节恢复方面取得了长足的进步。然而，随着SR模型容量的扩大，它们产生伪影的趋势也在增加：不正确的、视觉上令人不安的细节，降低了感知质量。至关重要的是，它们的感知影响各不相同：一些伪影几乎不明显，而另一些则会严重降低图像质量。我们认为，文物的特点应该是突出的人类观察者，而不是作为统一的二进制缺陷。出于这一动机，我们提出了一个新的数据集的1302工件的例子，从11个当代图像SR方法，其中每个工件与一个众包的突出得分配对。在此数据集的基础上，我们训练了一个轻量级的回归器，该回归器可以生成空间显著性热图，并在检测显著伪影方面优于现有方法。我们发布了数据集和代码，以促进对SR工件的兼容性感知评估和缓解。
摘要：Generative image super-resolution (SR) is rapidly advancing in visual quality and detail restoration. As the capacity of SR models expands, however, so does their tendency to produce artifacts: incorrect, visually disturbing details that reduce perceived quality. Crucially, their perceptual impact varies: some artifacts are barely noticeable while others strongly degrade the image. We argue that artifacts should be characterized by their prominence to human observers rather than treated as uniform binary defects. Motivated by this, we present a novel dataset of 1302 artifact examples from 11 contemporary image-SR methods, where each artifact is paired with a crowdsourced prominence score. Building on this dataset, we train a lightweight regressor that produces spatial prominence heatmaps and outperforms existing methods at detecting prominent artifacts. We release the dataset and code to facilitate prominence-aware evaluation and mitigation of SR artifacts.

【14】Fit for Purpose? Deepfake Detection in the Real World
标题：适合目的？现实世界中的Deepfake检测
链接：https://arxiv.org/abs/2510.16556

作者：Guangyu Lin, Li Lin, Christina P. Walker, Daniel S. Schiff, Shu Hu
摘要：在生成对抗网络、扩散模型和多模态大型语言模型的进步的推动下，人工智能生成的内容迅速扩散，使得合成媒体的创建和传播变得毫不费力，这增加了错误信息的风险，特别是歪曲真相、破坏对政治机构信任的政治深度伪造。反过来，政府、研究机构和行业也大力推动deepfake检测计划作为解决方案。然而，大多数现有的模型都是在实验室控制的合成数据集上训练和验证的，这限制了它们对社交平台上传播的影响公众的真实政治深度伪造的普遍性。在这项工作中，我们介绍了基于政治Deepfakes事件数据库的第一个系统基准，该数据库是自2018年以来在社交媒体上分享的真实政治Deepfakes的精选集合。我们的研究包括对学术界、政府和工业界最先进的deepfake探测器进行系统评估。我们发现来自学术界和政府的检测器表现相对较差。虽然付费检测工具比免费访问模型实现了相对更高的性能，但所有评估的检测器都很难有效地推广到真实的政治deepfake，并且容易受到简单的操纵，特别是在视频领域。研究结果表明，有必要建立政治背景下的deepfake检测框架，以便在现实世界中更好地保护公众。
摘要：The rapid proliferation of AI-generated content, driven by advances in generative adversarial networks, diffusion models, and multimodal large language models, has made the creation and dissemination of synthetic media effortless, heightening the risks of misinformation, particularly political deepfakes that distort truth and undermine trust in political institutions. In turn, governments, research institutions, and industry have strongly promoted deepfake detection initiatives as solutions. Yet, most existing models are trained and validated on synthetic, laboratory-controlled datasets, limiting their generalizability to the kinds of real-world political deepfakes circulating on social platforms that affect the public. In this work, we introduce the first systematic benchmark based on the Political Deepfakes Incident Database, a curated collection of real-world political deepfakes shared on social media since 2018. Our study includes a systematic evaluation of state-of-the-art deepfake detectors across academia, government, and industry. We find that the detectors from academia and government perform relatively poorly. While paid detection tools achieve relatively higher performance than free-access models, all evaluated detectors struggle to generalize effectively to authentic political deepfakes, and are vulnerable to simple manipulations, especially in the video domain. Results urge the need for politically contextualized deepfake detection frameworks to better safeguard the public in real-world settings.

【15】OOS-DSD: Improving Out-of-stock Detection in Retail Images using Auxiliary Tasks
标题：OOS-DSD：使用辅助任务改进零售图像中的缺货检测
链接：https://arxiv.org/abs/2510.16508

作者：Franko Šikić, Sven Lončarić
摘要：缺货（OOS）检测是一个非常重要的零售验证过程，旨在推断货架上指定区域的产品是否不可用。在本文中，我们介绍了OOS-DSD，这是一种基于深度学习的新方法，通过辅助学习来推进OOS检测。特别是，我们通过额外的卷积分支扩展了一个成熟的YOLOv 8对象检测架构，以同时检测OOS，分割产品并估计场景深度。虽然OOS检测和产品分割分支使用地面实况数据进行训练，但深度估计分支使用由最先进的（SOTA）深度估计模型Depth Anything V2产生的伪标记注释进行训练。此外，由于上述伪标记的深度估计显示相对深度，因此我们提出了一个适当的深度归一化过程，以稳定训练过程。实验结果表明，该方法的性能优于SOTA OOS检测方法的平均平均精度（mAP）的1.8%。此外，消融研究证实了辅助学习和建议的深度归一化程序的有效性，前者使mAP增加3.7%，后者增加4.2%。
摘要：Out-of-stock (OOS) detection is a very important retail verification process that aims to infer the unavailability of products in their designated areas on the shelf. In this paper, we introduce OOS-DSD, a novel deep learning-based method that advances OOS detection through auxiliary learning. In particular, we extend a well-established YOLOv8 object detection architecture with additional convolutional branches to simultaneously detect OOS, segment products, and estimate scene depth. While OOS detection and product segmentation branches are trained using ground truth data, the depth estimation branch is trained using pseudo-labeled annotations produced by the state-of-the-art (SOTA) depth estimation model Depth Anything V2. Furthermore, since the aforementioned pseudo-labeled depth estimates display relative depth, we propose an appropriate depth normalization procedure that stabilizes the training process. The experimental results show that the proposed method surpassed the performance of the SOTA OOS detection methods by 1.8% of the mean average precision (mAP). In addition, ablation studies confirm the effectiveness of auxiliary learning and the proposed depth normalization procedure, with the former increasing mAP by 3.7% and the latter by 4.2%.

【16】Enhancing Rotated Object Detection via Anisotropic Gaussian Bounding Box and Bhattacharyya Distance
标题：通过各向异性高斯边界盒和Bhattacharyya距离增强旋转对象检测
链接：https://arxiv.org/abs/2510.16445

作者：Chien Thai, Mai Xuan Trang, Huong Ninh, Hoang Hiep Ly, Anh Son Le
备注：Neurocomputing
摘要：准确有效地检测旋转物体是计算机视觉中的一个重大挑战，特别是在航空成像、遥感和自动驾驶等应用中。尽管传统的对象检测框架对于轴对齐的对象是有效的，但是由于它们在捕获方向变化方面的限制，它们在涉及旋转对象的场景中通常表现不佳。本文介绍了一种改进的损失函数，旨在提高检测的准确性和鲁棒性，利用高斯包围盒表示和巴塔查里亚距离。此外，我们主张使用各向异性高斯表示来解决与正方形物体中的各向同性方差相关的问题。我们提出的方法解决了这些挑战，结合旋转不变的损失函数，有效地捕捉旋转对象的几何特性。我们将这个提出的损失函数集成到最先进的基于深度学习的旋转对象检测检测器中，大量的实验表明，与现有方法相比，平均精度指标有了显著的改进。结果突出了我们的方法在旋转对象检测中建立新基准的潜力，并对广泛的应用产生影响，这些应用需要精确和可靠的对象定位，无论方向如何。
摘要：Detecting rotated objects accurately and efficiently is a significant challenge in computer vision, particularly in applications such as aerial imagery, remote sensing, and autonomous driving. Although traditional object detection frameworks are effective for axis-aligned objects, they often underperform in scenarios involving rotated objects due to their limitations in capturing orientation variations. This paper introduces an improved loss function aimed at enhancing detection accuracy and robustness by leveraging the Gaussian bounding box representation and Bhattacharyya distance. In addition, we advocate for the use of an anisotropic Gaussian representation to address the issues associated with isotropic variance in square-like objects. Our proposed method addresses these challenges by incorporating a rotation-invariant loss function that effectively captures the geometric properties of rotated objects. We integrate this proposed loss function into state-of-the-art deep learning-based rotated object detection detectors, and extensive experiments demonstrated significant improvements in mean Average Precision metrics compared to existing methods. The results highlight the potential of our approach to establish new benchmark in rotated object detection, with implications for a wide range of applications requiring precise and reliable object localization irrespective of orientation.

【17】iWatchRoadv2: Pothole Detection, Geospatial Mapping, and Intelligent Road Governance
标题：iWatchRoadv 2：坑洞检测、地理空间映射和智能道路治理
链接：https://arxiv.org/abs/2510.16375

作者：Rishi Raj Sahoo, Surbhi Saswati Mohanty, Subhankar Mishra
备注：Under review
摘要：道路坑洼构成了重大的安全隐患和维护挑战，特别是在印度多样化和维护不足的道路网络上。本文介绍了iWatchRoadv2，这是一个全自动的端到端平台，用于实时坑洞检测，基于GPS的地理标记和使用OpenStreetMap（OSM）的动态道路健康可视化。我们策划了一个包含7，000多个仪表盘摄像头帧的自注释数据集，捕捉了不同的印度道路状况，天气模式和照明场景，我们使用这些数据来微调Ultralytics YOLO模型，以实现准确的坑洞检测。该系统通过外部GPS日志提取视频时间戳，以精确定位每个检测到的坑洞，通过全面的元数据丰富检测，包括路段属性和通过优化的后端数据库管理的承包商信息。iWatchRoadv2引入了智能治理功能，使当局能够通过安全的登录界面将路段与合同元数据联系起来。当道路健康状况恶化时，该系统会自动向承包商和官员发送警报，支持自动问责制和保修执行。直观的Web界面为利益相关者和公众提供可操作的分析，促进证据驱动的维修规划，预算分配和质量评估。我们的经济高效且可扩展的解决方案简化了帧处理和存储，同时支持城市和农村部署的无缝公众参与。通过自动化从检测到维修验证的完整坑洞监测生命周期，iWatchRoadv2实现了数据驱动的智能城市管理、透明治理和道路基础设施维护的可持续改进。该平台和现场演示可在https://smlab.niser.ac.in/project/iwatchroad上访问。
摘要：Road potholes pose significant safety hazards and maintenance challenges, particularly on India's diverse and under-maintained road networks. This paper presents iWatchRoadv2, a fully automated end-to-end platform for real-time pothole detection, GPS-based geotagging, and dynamic road health visualization using OpenStreetMap (OSM). We curated a self-annotated dataset of over 7,000 dashcam frames capturing diverse Indian road conditions, weather patterns, and lighting scenarios, which we used to fine-tune the Ultralytics YOLO model for accurate pothole detection. The system synchronizes OCR-extracted video timestamps with external GPS logs to precisely geolocate each detected pothole, enriching detections with comprehensive metadata, including road segment attribution and contractor information managed through an optimized backend database. iWatchRoadv2 introduces intelligent governance features that enable authorities to link road segments with contract metadata through a secure login interface. The system automatically sends alerts to contractors and officials when road health deteriorates, supporting automated accountability and warranty enforcement. The intuitive web interface delivers actionable analytics to stakeholders and the public, facilitating evidence-driven repair planning, budget allocation, and quality assessment. Our cost-effective and scalable solution streamlines frame processing and storage while supporting seamless public engagement for urban and rural deployments. By automating the complete pothole monitoring lifecycle, from detection to repair verification, iWatchRoadv2 enables data-driven smart city management, transparent governance, and sustainable improvements in road infrastructure maintenance. The platform and live demonstration are accessible at https://smlab.niser.ac.in/project/iwatchroad.

【18】MIRAD - A comprehensive real-world robust anomaly detection dataset for Mass Individualization
标题：MIRAD -用于大规模个性化的全面现实世界稳健异常检测数据集
链接：https://arxiv.org/abs/2510.16370

作者：Pulin Li, Guocheng Wu, Li Yin, Yuxin Zheng, Wei Zhang, Yanjie Zhou
备注：his https URL
摘要：社会制造利用社区协作和分散的资源，实现现代工业的大规模个性化。然而，这种范式转变也给质量控制带来了巨大的挑战，特别是在缺陷检测方面。主要困难来自三个方面。首先，产品通常具有高度定制的配置。其次，生产通常涉及零散的小批量订单。第三，分布式站点的映像环境差异很大。为了克服现实世界数据集和定制算法的稀缺性，我们引入了大规模个性化鲁棒异常检测（MIRAD）数据集。作为第一个明确为社会制造中的异常检测而设计的基准，MIRAD捕获了该领域的三个关键维度：（1）具有较大类内差异的多样化个性化产品，（2）从六个地理上分散的制造节点收集的数据，以及（3）大量的成像异质性，包括照明，背景和运动条件的变化。然后，我们进行广泛的评估国家的最先进的（SOTA）异常检测方法的MIRAD，涵盖一类，多类，和zero-shot的方法。结果显示，与传统基准相比，所有模型的性能都有显着下降，突出了现实世界个性化生产中缺陷检测的未解决复杂性。通过将工业需求和学术研究联系起来，MIRAD为开发工业5.0所必需的强大质量控制解决方案提供了现实的基础。该数据集可在https://github.com/wu33learn/MIRAD上公开获得。
摘要：Social manufacturing leverages community collaboration and scattered resources to realize mass individualization in modern industry. However, this paradigm shift also introduces substantial challenges in quality control, particularly in defect detection. The main difficulties stem from three aspects. First, products often have highly customized configurations. Second, production typically involves fragmented, small-batch orders. Third, imaging environments vary considerably across distributed sites. To overcome the scarcity of real-world datasets and tailored algorithms, we introduce the Mass Individualization Robust Anomaly Detection (MIRAD) dataset. As the first benchmark explicitly designed for anomaly detection in social manufacturing, MIRAD captures three critical dimensions of this domain: (1) diverse individualized products with large intra-class variation, (2) data collected from six geographically dispersed manufacturing nodes, and (3) substantial imaging heterogeneity, including variations in lighting, background, and motion conditions. We then conduct extensive evaluations of state-of-the-art (SOTA) anomaly detection methods on MIRAD, covering one-class, multi-class, and zero-shot approaches. Results show a significant performance drop across all models compared with conventional benchmarks, highlighting the unresolved complexities of defect detection in real-world individualized production. By bridging industrial requirements and academic research, MIRAD provides a realistic foundation for developing robust quality control solutions essential for Industry 5.0. The dataset is publicly available at https://github.com/wu33learn/MIRAD.

【19】Scaling Laws for Deepfake Detection
标题：Deepfake检测的缩放定律
链接：https://arxiv.org/abs/2510.16320

作者：Wenhao Wang, Longqi Cai, Taihong Xiao, Yuxiao Wang, Ming-Hsuan Yang
摘要：本文对深度伪造检测任务的缩放律进行了系统研究。具体来说，我们分析了模型对真实图像域的数量，deepfake生成方法和训练图像的性能。由于现有的数据集不符合这项研究的规模要求，我们构建了ScalEDF，这是该领域迄今为止最大的数据集，其中包含来自51个不同数据集（域）的580多万张真实图像和由102种deepfake方法生成的880多万张假图像。使用ScalEDF，我们观察到类似于大型语言模型（LLM）中所示的幂律缩放。具体来说，平均检测误差遵循可预测的幂律衰减，无论是真实域的数量还是deepfake方法的数量都在增加。这一关键观察结果不仅使我们能够预测达到目标性能所需的额外真实域或deepfake方法的数量，还激励我们以数据为中心的方式对抗不断发展的deepfake技术。除此之外，我们还研究了预训练和数据增强在缩放下的深度伪造检测中的作用，以及缩放本身的局限性。
摘要：This paper presents a systematic study of scaling laws for the deepfake detection task. Specifically, we analyze the model performance against the number of real image domains, deepfake generation methods, and training images. Since no existing dataset meets the scale requirements for this research, we construct ScaleDF, the largest dataset to date in this field, which contains over 5.8 million real images from 51 different datasets (domains) and more than 8.8 million fake images generated by 102 deepfake methods. Using ScaleDF, we observe power-law scaling similar to that shown in large language models (LLMs). Specifically, the average detection error follows a predictable power-law decay as either the number of real domains or the number of deepfake methods increases. This key observation not only allows us to forecast the number of additional real domains or deepfake methods required to reach a target performance, but also inspires us to counter the evolving deepfake technology in a data-centric manner. Beyond this, we examine the role of pre-training and data augmentations in deepfake detection under scaling, as well as the limitations of scaling itself.

【20】Designing a Convolutional Neural Network for High-Accuracy Oral Cavity Squamous Cell Carcinoma (OCSCC) Detection
标题：卷积神经网络在口腔鳞状细胞癌高精度检测中的应用
链接：https://arxiv.org/abs/2510.16235

作者：Vishal Manikanden, Aniketh Bandlamudi, Daniel Haehn
摘要：口腔鳞状细胞癌（OCSCC）是头颈部最常见的癌症类型。由于其早期阶段的微妙性质，发展的深层和隐藏区域以及缓慢的生长，OCSCC通常未被发现，导致可预防的死亡。然而，经过适当训练的卷积神经网络（CNN）具有精确的图像分割技术和应用核矩阵修改图像RGB值以进行准确图像模式识别的能力，将是早期检测OCSCC的有效手段。将该神经网络与图像捕获和处理硬件配对将提高OCSCC检测的效率。我们项目的目的是开发一个经过训练的卷积神经网络来识别OCSCC，并设计一个物理硬件系统来捕获和处理详细的图像，以确定准确预测所需的图像质量。在由良性和恶性肿瘤以及阴性样本组成的4293个训练图像上训练CNN，并评估其在OCSCC预测中的精确度、召回率和平均平均精确度（mAP）。选择癌性、非癌性和阴性图像的随机分类图像的测试数据集，并且改变每个图像以代表5种常见分辨率。该测试数据集由CNN进行了彻底分析，并根据准确性对预测进行了评分。设计的增强硬件用于捕获详细的图像，并对其影响进行评分。开发了一个应用程序，以促进测试过程，并使CNN开放访问。分辨率增加的图像在对数尺度上导致更高的准确度预测，证明了更高像素计数的收益递减。
摘要：Oral Cavity Squamous Cell Carcinoma (OCSCC) is the most common type of head and neck cancer. Due to the subtle nature of its early stages, deep and hidden areas of development, and slow growth, OCSCC often goes undetected, leading to preventable deaths. However, properly trained Convolutional Neural Networks (CNNs), with their precise image segmentation techniques and ability to apply kernel matrices to modify the RGB values of images for accurate image pattern recognition, would be an effective means for early detection of OCSCC. Pairing this neural network with image capturing and processing hardware would allow increased efficacy in OCSCC detection. The aim of our project is to develop a Convolutional Neural Network trained to recognize OCSCC, as well as to design a physical hardware system to capture and process detailed images, in order to determine the image quality required for accurate predictions. A CNN was trained on 4293 training images consisting of benign and malignant tumors, as well as negative samples, and was evaluated for its precision, recall, and Mean Average Precision (mAP) in its predictions of OCSCC. A testing dataset of randomly assorted images of cancerous, non-cancerous, and negative images was chosen, and each image was altered to represent 5 common resolutions. This test data set was thoroughly analyzed by the CNN and predictions were scored on the basis of accuracy. The designed enhancement hardware was used to capture detailed images, and its impact was scored. An application was developed to facilitate the testing process and bring open access to the CNN. Images of increasing resolution resulted in higher-accuracy predictions on a logarithmic scale, demonstrating the diminishing returns of higher pixel counts.

【21】StripRFNet: A Strip Receptive Field and Shape-Aware Network for Road Damage Detection
标题：StripRFNet：用于道路损坏检测的带状接收场和形状感知网络
链接：https://arxiv.org/abs/2510.16115

作者：Jianhan Lin, Yuchu Qin, Shuai Gao, Yikang Rui, Jie Liu, Yanjie Lv
摘要：维护良好的道路网络对于实现可持续发展目标（SDG）11至关重要。路面破损不仅威胁交通安全，而且阻碍城市的可持续发展。然而，准确的检测，仍然是具有挑战性的，由于不同形状的损伤，捕获细长裂纹与高纵横比的困难，以及在小规模的损伤识别的高错误率。为了解决这些问题，我们提出了StripRFNet，这是一种新型的深度神经网络，包括三个模块：（1）形状感知模块（SPM），通过多尺度特征聚合中的大可分离核注意力（LSKA）来增强形状辨别;（2）条带接收场模块（SRFM），采用大条带卷积和池化来捕获细长裂纹的特征;以及（3）小尺度增强模块（SSEM），其利用高分辨率P2特征图、专用检测头和动态上采样来改进小对象检测。在RDD 2022基准测试上的实验表明，StripRFNet优于现有的方法。在中国亚组中，F1评分、mAP 50和mAP 50：95分别较基线提高4.4、2.9和3.4个百分点。在完整数据集上，与CRDDC 2022参与者和ORDDC 2024第2阶段结果相比，它实现了最高的F1得分80.33%，同时保持了竞争性的推理速度。这些结果表明，StripRFNet实现了最先进的准确性和实时效率，为智能道路维护和可持续基础设施管理提供了一个有前途的工具。
摘要：Well-maintained road networks are crucial for achieving Sustainable Development Goal (SDG) 11. Road surface damage not only threatens traffic safety but also hinders sustainable urban development. Accurate detection, however, remains challenging due to the diverse shapes of damages, the difficulty of capturing slender cracks with high aspect ratios, and the high error rates in small-scale damage recognition. To address these issues, we propose StripRFNet, a novel deep neural network comprising three modules: (1) a Shape Perception Module (SPM) that enhances shape discrimination via large separable kernel attention (LSKA) in multi-scale feature aggregation; (2) a Strip Receptive Field Module (SRFM) that employs large strip convolutions and pooling to capture features of slender cracks; and (3) a Small-Scale Enhancement Module (SSEM) that leverages a high-resolution P2 feature map, a dedicated detection head, and dynamic upsampling to improve small-object detection. Experiments on the RDD2022 benchmark show that StripRFNet surpasses existing methods. On the Chinese subset, it improves F1-score, mAP50, and mAP50:95 by 4.4, 2.9, and 3.4 percentage points over the baseline, respectively. On the full dataset, it achieves the highest F1-score of 80.33% compared with CRDDC'2022 participants and ORDDC'2024 Phase 2 results, while maintaining competitive inference speed. These results demonstrate that StripRFNet achieves state-of-the-art accuracy and real-time efficiency, offering a promising tool for intelligent road maintenance and sustainable infrastructure management.

【22】InfraGPT Smart Infrastructure: An End-to-End VLM-Based Framework for Detecting and Managing Urban Defects
标题：CLARGPT智能基础设施：用于检测和管理城市缺陷的端到端基于LMA的框架
链接：https://arxiv.org/abs/2510.16017

作者：Ibrahim Sheikh Mohamed, Abdullah Yahya Abdullah Omaisan
摘要：智能城市的基础设施越来越多地受到闭路电视（CCTV）摄像头网络的监控。道路、桥梁和隧道出现裂缝、坑洞和液体泄漏，威胁公共安全，需要及时修复。人工检测成本高且危险，现有的自动化系统通常解决单个缺陷类型或提供无法直接指导维护人员的非结构化输出。本文提出了一个综合的管道，利用街道闭路电视流的多缺陷检测和分割使用YOLO系列的对象检测器，并通过检测到的视觉语言模型（VLM）的场景感知总结。VLM生成JSON格式的结构化行动计划，其中包括事件描述、推荐工具、维度、修复计划和紧急警报。我们回顾文献上的坑洞，裂缝和泄漏检测，突出最近的进展，如QwenVL和LLaVA大型视觉语言模型，并描述了我们的早期原型的设计。对公共数据集和捕获的CCTV片段的实验评估表明，该系统准确地识别出各种缺陷并生成连贯的摘要。最后，我们讨论的挑战和方向扩展系统的城市范围内的部署。
摘要：Infrastructure in smart cities is increasingly monitored by networks of closed circuit television (CCTV) cameras. Roads, bridges and tunnels develop cracks, potholes, and fluid leaks that threaten public safety and require timely repair. Manual inspection is costly and hazardous, and existing automatic systems typically address individual defect types or provide unstructured outputs that cannot directly guide maintenance crews. This paper proposes a comprehensive pipeline that leverages street CCTV streams for multi defect detection and segmentation using the YOLO family of object detectors and passes the detections to a vision language model (VLM) for scene aware summarization. The VLM generates a structured action plan in JSON format that includes incident descriptions, recommended tools, dimensions, repair plans, and urgent alerts. We review literature on pothole, crack and leak detection, highlight recent advances in large vision language models such as QwenVL and LLaVA, and describe the design of our early prototype. Experimental evaluation on public datasets and captured CCTV clips demonstrates that the system accurately identifies diverse defects and produces coherent summaries. We conclude by discussing challenges and directions for scaling the system to city wide deployments.

【23】CrossRay3D: Geometry and Distribution Guidance for Efficient Multimodal 3D Detection
标题：CrossRay 3D：高效多模式3D检测的几何和分布指导
链接：https://arxiv.org/abs/2510.15991

作者：Huiming Yang
备注：13 pages
摘要：稀疏交叉模态检测器比其对应物鸟瞰图（BEV）检测器提供更多优势，特别是在下游任务的适应性和计算成本节省方面。然而，现有的稀疏检测器忽略了令牌表示的质量，使其具有次优的前景质量和有限的性能。在本文中，我们确定的几何结构保持和类分布是提高稀疏检测器的性能的关键，并提出了一种稀疏检测器（SS）。SS的核心模块是Ray-Aware Supervision（RAS），它在训练阶段保留了丰富的几何信息，以及Class-Balanced Supervision，它自适应地重新加权类语义的显着性，确保在令牌采样期间保留与小对象相关的令牌。因此，在表征方面优于其他稀疏多模态检测器。此外，我们设计了射线位置编码（射线PE），以解决激光雷达模态和图像之间的分布差异。最后，我们将上述模块集成到端到端稀疏多模态检测器中，称为CrossRay 3D。实验表明，在具有挑战性的nuScenes基准测试中，CrossRay 3D以72.4 mAP和74.7 NDS实现了最先进的性能，同时运行速度比其他领先方法快1.84。此外，即使在LiDAR或相机数据部分或全部丢失的情况下，CrossRay 3D也表现出强大的鲁棒性。
摘要：The sparse cross-modality detector offers more advantages than its counterpart, the Bird's-Eye-View (BEV) detector, particularly in terms of adaptability for downstream tasks and computational cost savings. However, existing sparse detectors overlook the quality of token representation, leaving it with a sub-optimal foreground quality and limited performance. In this paper, we identify that the geometric structure preserved and the class distribution are the key to improving the performance of the sparse detector, and propose a Sparse Selector (SS). The core module of SS is Ray-Aware Supervision (RAS), which preserves rich geometric information during the training stage, and Class-Balanced Supervision, which adaptively reweights the salience of class semantics, ensuring that tokens associated with small objects are retained during token sampling. Thereby, outperforming other sparse multi-modal detectors in the representation of tokens. Additionally, we design Ray Positional Encoding (Ray PE) to address the distribution differences between the LiDAR modality and the image. Finally, we integrate the aforementioned module into an end-to-end sparse multi-modality detector, dubbed CrossRay3D. Experiments show that, on the challenging nuScenes benchmark, CrossRay3D achieves state-of-the-art performance with 72.4 mAP and 74.7 NDS, while running 1.84 faster than other leading methods. Moreover, CrossRay3D demonstrates strong robustness even in scenarios where LiDAR or camera data are partially or entirely missing.

【24】Detecting streaks in smart telescopes images with Deep Learning
标题：利用深度学习检测智能望远镜图像中的条纹
链接：https://arxiv.org/abs/2510.17540

作者：Olivier Parisot, Mahmoud Jaziri
备注：19 pages, preprint submitted to the Springer CCIS Special Issue on DATA 2024 (currently under editorial processing)
摘要：卫星在夜空中的可见度所产生的日益严重的负面影响正在影响业余和专业天文学和天体摄影的实践。这些卫星的存在会在天文观测期间捕获的图像中引入条纹，需要应用额外的后处理来减轻不希望的影响，无论是数据丢失还是外观原因。在本文中，我们展示了如何测试和调整各种深度学习方法，以检测2022年3月至2023年2月期间使用智能望远镜捕获的原始天文数据中的条纹。
摘要：The growing negative impact of the visibility of satellites in the night sky is influencing the practice of astronomy and astrophotograph, both at the amateur and professional levels. The presence of these satellites has the effect of introducing streaks into the images captured during astronomical observation, requiring the application of additional post processing to mitigate the undesirable impact, whether for data loss or cosmetic reasons. In this paper, we show how we test and adapt various Deep Learning approaches to detect streaks in raw astronomical data captured between March 2022 and February 2023 with smart telescopes.

分类|识别相关(16篇)

【1】Towards Explainable Skin Cancer Classification: A Dual-Network Attention Model with Lesion Segmentation and Clinical Metadata Fusion
标题：迈向可解释的皮肤癌分类：具有病变分割和临床元数据融合的双网络注意力模型
链接：https://arxiv.org/abs/2510.17773

作者：Md. Enamul Atiq, Shaikh Anowarul Fattah
备注：15 pages, 7 Figures, 3 Tables
摘要：皮肤癌是一种危及生命的疾病，早期发现可显著改善患者的预后。由于高的类内变异性和细微的类间差异，从皮肤镜图像进行自动诊断是具有挑战性的。许多深度学习模型作为“黑匣子”运行，限制了临床信任。在这项工作中，我们提出了一个双编码器的注意力为基础的框架，利用分割病变和临床元数据，以提高皮肤病变分类的准确性和可解释性。首先采用具有双注意力门（DAG）和Atrous空间金字塔池化（ASPP）的新型Deep-UNet架构来分割病变。分类阶段使用两个DenseNet 201编码器-一个在原始图像上，另一个在分割的病变上，其特征通过多头交叉关注融合。这种双输入设计引导模型专注于突出的病理区域。此外，基于变换器的模块将患者元数据（年龄、性别、病变部位）并入预测中。我们在HAM 10000数据集和ISIC 2018和2019挑战上评估了我们的方法。该方法实现了最先进的分割性能，并显着提高了分类精度和平均AUC相比，基线模型。为了验证我们模型的可靠性，我们使用加权类别激活映射（Grad-CAM）来生成热图。这些可视化证实了我们的模型的预测是基于病变区域的，不像依赖于虚假背景特征的模型。这些结果表明，将精确的病变分割和临床数据与基于注意力的融合相结合，可以产生更准确和可解释的皮肤癌分类模型。
摘要：Skin cancer is a life-threatening disease where early detection significantly improves patient outcomes. Automated diagnosis from dermoscopic images is challenging due to high intra-class variability and subtle inter-class differences. Many deep learning models operate as "black boxes," limiting clinical trust. In this work, we propose a dual-encoder attention-based framework that leverages both segmented lesions and clinical metadata to enhance skin lesion classification in terms of both accuracy and interpretability. A novel Deep-UNet architecture with Dual Attention Gates (DAG) and Atrous Spatial Pyramid Pooling (ASPP) is first employed to segment lesions. The classification stage uses two DenseNet201 encoders-one on the original image and another on the segmented lesion whose features are fused via multi-head cross-attention. This dual-input design guides the model to focus on salient pathological regions. In addition, a transformer-based module incorporates patient metadata (age, sex, lesion site) into the prediction. We evaluate our approach on the HAM10000 dataset and the ISIC 2018 and 2019 challenges. The proposed method achieves state-of-the-art segmentation performance and significantly improves classification accuracy and average AUC compared to baseline models. To validate our model's reliability, we use Gradient-weighted Class Activation Mapping (Grad-CAM) to generate heatmaps. These visualizations confirm that our model's predictions are based on the lesion area, unlike models that rely on spurious background features. These results demonstrate that integrating precise lesion segmentation and clinical data with attention-based fusion leads to a more accurate and interpretable skin cancer classification model.

【2】Joint Multi-Condition Representation Modelling via Matrix Factorisation for Visual Place Recognition
标题：视觉位置识别的矩阵分解联合多条件表示建模
链接：https://arxiv.org/abs/2510.17739

作者：Timur Ismagilov, Shakaiba Majeed, Michael Milford, Tan Viet Tuyen Nguyen, Sarvapali D. Ramchurn, Shoaib Ehsan
备注：13 pages
摘要：我们解决多参考视觉位置识别（VPR），在不同条件下捕获的参考集被用来提高定位性能。虽然大规模训练的深度学习提高了鲁棒性，但数据多样性和模型复杂性的增加会在训练和部署过程中产生大量的计算成本。通过投票或聚合的描述符级融合避免了训练，但通常针对多传感器设置或依赖于在外观和视角变化下增益有限的几何学。我们提出了一种无需训练的、与几何体无关的方法，该方法通过矩阵分解为基础表示，使用多个参考描述符对位置进行联合建模，从而实现基于投影的残差匹配。我们还介绍了SotonMV，多视点VPR的结构化基准。在多外观数据上，我们的方法将Recall@1比单参考提高了约18%，并且在外观和视点变化方面优于多参考基线，在非结构化数据上获得了约5%的收益，在保持轻量级的同时表现出强大的泛化能力。
摘要：We address multi-reference visual place recognition (VPR), where reference sets captured under varying conditions are used to improve localisation performance. While deep learning with large-scale training improves robustness, increasing data diversity and model complexity incur extensive computational cost during training and deployment. Descriptor-level fusion via voting or aggregation avoids training, but often targets multi-sensor setups or relies on heuristics with limited gains under appearance and viewpoint change. We propose a training-free, descriptor-agnostic approach that jointly models places using multiple reference descriptors via matrix decomposition into basis representations, enabling projection-based residual matching. We also introduce SotonMV, a structured benchmark for multi-viewpoint VPR. On multi-appearance data, our method improves Recall@1 by up to ~18% over single-reference and outperforms multi-reference baselines across appearance and viewpoint changes, with gains of ~5% on unstructured data, demonstrating strong generalisation while remaining lightweight.

【3】Automatic Classification of Circulating Blood Cell Clusters based on Multi-channel Flow Cytometry Imaging
标题：基于多通道流式细胞仪成像的循环血细胞群自动分类
链接：https://arxiv.org/abs/2510.17716

作者：Suqiang Ma, Subhadeep Sengupta, Yao Lee, Beikang Gu, Xianyan Chen, Xianqiao Wang, Yang Liu, Mengjia Xu, Galit H. Frydman, He Li
摘要：含有红细胞（RBC）、白细胞（WBC）和血小板的循环血细胞簇（CCC）是与血栓形成、感染和炎症等疾病相关的重要生物标志物。流式细胞术与荧光染色配对，通常用于分析这些细胞簇，揭示细胞形态和蛋白质谱。虽然基于机器学习的计算方法已经推进了单细胞流式细胞术图像的自动分析，但缺乏构建自动分析包含CCC的图像的工具的努力。与单个细胞不同，细胞团通常表现出不规则的形状和大小。此外，这些细胞簇通常由异质细胞类型组成，这需要多通道染色来鉴定簇内的特定细胞类型。这项研究介绍了一种新的计算框架，用于分析CCC图像和识别簇内的细胞类型。我们的框架使用两步分析策略。首先，它通过微调You Only Look Once（YOLOv11）模型将图像分类为细胞簇和非簇组，该模型的性能优于传统的卷积神经网络（CNN）、Vision Transformers（ViT）。然后，它通过将聚类轮廓与来自多通道荧光染色的区域重叠来识别细胞类型，从而提高准确性，尽管存在细胞碎片和染色伪影。该方法在聚类分类和表型鉴定中均达到了95%以上的准确率。总之，我们的自动化框架有效地分析了来自流式细胞术的CCC图像，利用了亮场和荧光数据。最初在血细胞上进行测试，它具有更广泛的应用潜力，例如分析免疫和肿瘤细胞簇，支持各种疾病的细胞研究。
摘要：Circulating blood cell clusters (CCCs) containing red blood cells (RBCs), white blood cells(WBCs), and platelets are significant biomarkers linked to conditions like thrombosis, infection, and inflammation. Flow cytometry, paired with fluorescence staining, is commonly used to analyze these cell clusters, revealing cell morphology and protein profiles. While computational approaches based on machine learning have advanced the automatic analysis of single-cell flow cytometry images, there is a lack of effort to build tools to automatically analyze images containing CCCs. Unlike single cells, cell clusters often exhibit irregular shapes and sizes. In addition, these cell clusters often consist of heterogeneous cell types, which require multi-channel staining to identify the specific cell types within the clusters. This study introduces a new computational framework for analyzing CCC images and identifying cell types within clusters. Our framework uses a two-step analysis strategy. First, it categorizes images into cell cluster and non-cluster groups by fine-tuning the You Only Look Once(YOLOv11) model, which outperforms traditional convolutional neural networks (CNNs), Vision Transformers (ViT). Then, it identifies cell types by overlaying cluster contours with regions from multi-channel fluorescence stains, enhancing accuracy despite cell debris and staining artifacts. This approach achieved over 95% accuracy in both cluster classification and phenotype identification. In summary, our automated framework effectively analyzes CCC images from flow cytometry, leveraging both bright-field and fluorescence data. Initially tested on blood cells, it holds potential for broader applications, such as analyzing immune and tumor cell clusters, supporting cellular research across various diseases.

【4】Beyond Real Faces: Synthetic Datasets Can Achieve Reliable Recognition Performance without Privacy Compromise
标题：超越真实面孔：合成数据集可以在不损害隐私的情况下实现可靠的识别性能
链接：https://arxiv.org/abs/2510.17372

作者：Paweł Borsukiewicz, Fadi Boutros, Iyiola E. Olatunji, Charles Beumier, Wendkûuni C. Ouedraogo, Jacques Klein, Tegawendé F. Bissyandé
摘要：面部识别系统的部署带来了一个道德困境：实现高准确性需要在未经同意的情况下收集大量真实面部数据集，导致数据集被撤回，并根据GDPR等法规承担潜在的法律责任。虽然合成面部数据是一种很有前途的隐私保护替代方案，但该领域缺乏全面的经验证据来证明其可行性。这项研究通过对合成面部识别数据集的广泛评估来解决这一关键差距。我们提出了一个系统的文献综述，确定了25个合成面部识别数据集（2018-2025），并结合严格的实验验证。我们的方法研究了隐私保护合成数据的七个关键要求：身份泄露预防，类内可变性，身份可分性，数据集规模，道德数据源，偏见缓解和基准可靠性。通过涉及超过1000万个合成样本的实验，并通过对五个标准基准测试报告的结果进行比较，我们首次对合成数据取代真实数据集的能力进行了全面的实证评估。性能最好的合成数据集（VariFace，VIGFace）分别实现了95.67%和94.91%的识别准确率，超过了包括CASIA-WebFace（94.70%）在内的真实数据集。虽然这些图像仍然是私人的，但公开可用的替代品Vec 2Face（93.52%）和CemiFace（93.22%）紧随其后。我们的研究结果表明，他们确保适当的类内变异，同时保持身份可分性。人口统计偏差分析表明，即使合成数据继承有限的偏差，它提供了前所未有的控制偏差缓解通过生成参数。这些结果确立了合成面部数据作为面部识别研究的科学可行性和道德必要性的替代方案。
摘要：The deployment of facial recognition systems has created an ethical dilemma: achieving high accuracy requires massive datasets of real faces collected without consent, leading to dataset retractions and potential legal liabilities under regulations like GDPR. While synthetic facial data presents a promising privacy-preserving alternative, the field lacks comprehensive empirical evidence of its viability. This study addresses this critical gap through extensive evaluation of synthetic facial recognition datasets. We present a systematic literature review identifying 25 synthetic facial recognition datasets (2018-2025), combined with rigorous experimental validation. Our methodology examines seven key requirements for privacy-preserving synthetic data: identity leakage prevention, intra-class variability, identity separability, dataset scale, ethical data sourcing, bias mitigation, and benchmark reliability. Through experiments involving over 10 million synthetic samples, extended by a comparison of results reported on five standard benchmarks, we provide the first comprehensive empirical assessment of synthetic data's capability to replace real datasets. Best-performing synthetic datasets (VariFace, VIGFace) achieve recognition accuracies of 95.67% and 94.91% respectively, surpassing established real datasets including CASIA-WebFace (94.70%). While those images remain private, publicly available alternatives Vec2Face (93.52%) and CemiFace (93.22%) come close behind. Our findings reveal that they ensure proper intra-class variability while maintaining identity separability. Demographic bias analysis shows that, even though synthetic data inherits limited biases, it offers unprecedented control for bias mitigation through generation parameters. These results establish synthetic facial data as a scientifically viable and ethically imperative alternative for facial recognition research.

【5】Nearest-Class Mean and Logits Agreement for Wildlife Open-Set Recognition
标题：野生动物开放集识别的最近类平均值和Logits协议
链接：https://arxiv.org/abs/2510.17338

作者：Jiahao Huo, Mufhumudzi Muthivhi, Terence L. van Zyl, Fredrik Gustafsson
摘要：目前最先进的野生动物分类模型是在封闭世界环境下训练的。当接触到未知的类时，他们对自己的预测仍然过于自信。开集识别（OSR）的目标是对已知类别进行分类，同时拒绝未知样本。已经提出了几种OSR方法，通过观察特征，logit或softmax概率空间来建模闭集分布。许多现有方法的一个显著缺点是需要使用OSR特定策略重新训练预训练的分类模型。本研究提供了一种后处理OSR方法，该方法测量模型特征与预测logits之间的一致性。我们提出了一个概率分布的基础上输入的距离，其最近的类均值（NCM）。然后将基于NCM的分布与来自logit空间的softmax概率进行比较，以测量NCM和分类头之间的一致性。我们提出的策略在两个评估的数据集上排名前三，显示出两个数据集的一致性能。相比之下，目前最先进的方法在单个数据集上表现出色。非洲和瑞典动物的AUROC分别为93.41和95.35。代码可以在https://github.com/Applied-Representation-Learning-Lab/OSR上找到。
摘要：Current state-of-the-art Wildlife classification models are trained under the closed world setting. When exposed to unknown classes, they remain overconfident in their predictions. Open-set Recognition (OSR) aims to classify known classes while rejecting unknown samples. Several OSR methods have been proposed to model the closed-set distribution by observing the feature, logit, or softmax probability space. A significant drawback of many existing approaches is the requirement to retrain the pre-trained classification model with the OSR-specific strategy. This study contributes a post-processing OSR method that measures the agreement between the models' features and predicted logits. We propose a probability distribution based on an input's distance to its Nearest Class Mean (NCM). The NCM-based distribution is then compared with the softmax probabilities from the logit space to measure agreement between the NCM and the classification head. Our proposed strategy ranks within the top three on two evaluated datasets, showing consistent performance across the two datasets. In contrast, current state-of-the-art methods excel on a single dataset. We achieve an AUROC of 93.41 and 95.35 for African and Swedish animals. The code can be found https://github.com/Applied-Representation-Learning-Lab/OSR.

【6】SG-CLDFF: A Novel Framework for Automated White Blood Cell Classification and Segmentation
标题：SG-CLDFF：自动白细胞分类和分割的新型框架
链接：https://arxiv.org/abs/2510.17278

作者：Mehdi Zekriyapanah Gashti, Mostafa Mohammadpour, Ghasem Farjamnia
摘要：显微图像中白细胞（WBC）的准确分割和分类对于许多血液疾病的诊断和监测至关重要，但由于染色变异性，复杂的背景和类别不平衡，仍然具有挑战性。在本文中，我们介绍了一种新的显着性引导的跨层深度特征融合框架（SG-CLDFF），该框架将显着性驱动的预处理与多尺度深度特征聚合紧密集成，以提高WBC分析的鲁棒性和可解释性。SG-CLDFF首先计算显著性先验以突出候选WBC区域并指导后续特征提取。一个轻量级的混合主干（EfficientSwin风格）产生多分辨率表示，这些表示由ResNeXt-CC启发的跨层融合模块融合，以保留来自浅层和深层的互补信息。该网络在具有并发分割和细胞类型分类头的多任务设置中进行训练，使用类感知加权损失和显着对齐正则化来减轻不平衡并抑制背景激活。可解释性通过Grad-CAM可视化和显着性一致性检查来执行，允许在区域级别检查模型决策。我们在标准公共基准（BCCD，LISC，ALL-IDB）上验证了该框架，与强大的CNN和Transformer基线相比，在IoU，F1和分类准确性方面报告了一致的增益。消融研究也证明了显着性预处理和跨层融合的个人贡献。SG-CLDFF为临床工作流程中更可靠的自动化WBC分析提供了一条实用且可解释的途径。
摘要：Accurate segmentation and classification of white blood cells (WBCs) in microscopic images are essential for diagnosis and monitoring of many hematological disorders, yet remain challenging due to staining variability, complex backgrounds, and class imbalance. In this paper, we introduce a novel Saliency-Guided Cross-Layer Deep Feature Fusion framework (SG-CLDFF) that tightly integrates saliency-driven preprocessing with multi-scale deep feature aggregation to improve both robustness and interpretability for WBC analysis. SG-CLDFF first computes saliency priors to highlight candidate WBC regions and guide subsequent feature extraction. A lightweight hybrid backbone (EfficientSwin-style) produces multi-resolution representations, which are fused by a ResNeXt-CC-inspired cross-layer fusion module to preserve complementary information from shallow and deep layers. The network is trained in a multi-task setup with concurrent segmentation and cell-type classification heads, using class-aware weighted losses and saliency-alignment regularization to mitigate imbalance and suppress background activation. Interpretability is enforced through Grad-CAM visualizations and saliency consistency checks, allowing model decisions to be inspected at the regional level. We validate the framework on standard public benchmarks (BCCD, LISC, ALL-IDB), reporting consistent gains in IoU, F1, and classification accuracy compared to strong CNN and transformer baselines. An ablation study also demonstrates the individual contributions of saliency preprocessing and cross-layer fusion. SG-CLDFF offers a practical and explainable path toward more reliable automated WBC analysis in clinical workflows.

【7】EndoCIL: A Class-Incremental Learning Framework for Endoscopic Image Classification
标题：EndoCIL：内窥镜图像分类的类增量学习框架
链接：https://arxiv.org/abs/2510.17200

作者：Bingrong Liu, Jun Shi, Yushan Zheng
摘要：内窥镜图像分析的类增量学习（CIL）对于现实世界的临床应用至关重要，诊断模型应不断适应不断变化的临床数据，同时保持先前学习的性能。然而，现有的基于回放的CIL方法未能有效地减轻灾难性遗忘，由于严重的域差异和内窥镜成像中固有的类不平衡。为了应对这些挑战，我们提出了EndoCIL，一种专门为内窥镜图像诊断量身定制的新型统一CIL框架。EndoCIL包含三个关键组件：基于最大平均离散性的重放（MDBR），采用与分布一致的贪婪策略来选择多样化和代表性的样本，先验正则化类平衡损失（PRCBL），旨在通过将先验类分布和平衡权重集成到损失函数中来减轻相位间和相位内的类不平衡，以及全连接参数的校准（CFG），其调整分类器梯度以减轻对新类别的偏差。在四个公共内窥镜数据集上进行的广泛实验表明，在不同的缓冲区大小和评估指标上，EndoCIL通常优于最先进的CIL方法。所提出的框架有效地平衡了终身内窥镜诊断的稳定性和可塑性，显示了临床可扩展性和部署的潜力。
摘要：Class-incremental learning (CIL) for endoscopic image analysis is crucial for real-world clinical applications, where diagnostic models should continuously adapt to evolving clinical data while retaining performance on previously learned ones. However, existing replay-based CIL methods fail to effectively mitigate catastrophic forgetting due to severe domain discrepancies and class imbalance inherent in endoscopic imaging. To tackle these challenges, we propose EndoCIL, a novel and unified CIL framework specifically tailored for endoscopic image diagnosis. EndoCIL incorporates three key components: Maximum Mean Discrepancy Based Replay (MDBR), employing a distribution-aligned greedy strategy to select diverse and representative exemplars, Prior Regularized Class Balanced Loss (PRCBL), designed to alleviate both inter-phase and intra-phase class imbalance by integrating prior class distributions and balance weights into the loss function, and Calibration of Fully-Connected Gradients (CFG), which adjusts the classifier gradients to mitigate bias toward new classes. Extensive experiments conducted on four public endoscopic datasets demonstrate that EndoCIL generally outperforms state-of-the-art CIL methods across varying buffer sizes and evaluation metrics. The proposed framework effectively balances stability and plasticity in lifelong endoscopic diagnosis, showing promising potential for clinical scalability and deployment.

【8】Person Re-Identification via Generalized Class Prototypes
标题：通过广义类原型重新识别人
链接：https://arxiv.org/abs/2510.17043

作者：Md Ahmed Al Muzaddid, William J. Beksi
备注：18 pages, 11 figures, and 4 tables
摘要：先进的特征提取方法大大有助于提高人的重新识别的任务。此外，修改目标函数已被开发，以进一步提高性能。尽管如此，选择更好的类代表是一个未充分探索的研究领域，也可以导致重新识别性能的进步。虽然过去的作品在训练过程中尝试使用图库图像类的质心，但只有少数作品在检索阶段研究了替代表示。在本文中，我们证明了这些先前的技术产生次优结果的重新识别指标。为了解决重新识别问题，我们提出了一个广义的选择方法，涉及选择表示，不限于类质心。我们的方法在准确性和平均精度之间取得了平衡，从而实现了超越现有技术的改进。例如，可以调整每个类的实际表示数以满足特定的应用要求。我们将我们的方法应用于多个重新识别嵌入的顶部，并且在所有情况下，它都大大改善了当代结果
摘要：Advanced feature extraction methods have significantly contributed to enhancing the task of person re-identification. In addition, modifications to objective functions have been developed to further improve performance. Nonetheless, selecting better class representatives is an underexplored area of research that can also lead to advancements in re-identification performance. Although past works have experimented with using the centroid of a gallery image class during training, only a few have investigated alternative representations during the retrieval stage. In this paper, we demonstrate that these prior techniques yield suboptimal results in terms of re-identification metrics. To address the re-identification problem, we propose a generalized selection method that involves choosing representations that are not limited to class centroids. Our approach strikes a balance between accuracy and mean average precision, leading to improvements beyond the state of the art. For example, the actual number of representations per class can be adjusted to meet specific application requirements. We apply our methodology on top of multiple re-identification embeddings, and in all cases it substantially improves upon contemporary results

【9】CARE: Contrastive Alignment for ADL Recognition from Event-Triggered Sensor Streams
标题：CARE：从事件触发的传感器流进行ADL识别的对比对齐
链接：https://arxiv.org/abs/2510.16988

作者：Junhao Zhao, Zishuai Liu, Ruili Fang, Jin Lu, Linghan Zhang, Fei Dou
摘要：从事件触发的环境传感器中识别日常生活活动（ADL）是环境辅助生活中的一项重要任务，但现有方法仍然受到代表性水平限制的限制。基于序列的方法保留了传感器激活的时间顺序，但对噪声敏感，缺乏空间意识，而基于图像的方法捕获全局模式和隐式空间相关性，但压缩细粒度的时间动态和扭曲传感器布局。初始融合（例如，特征连接）不能在基于序列和基于图像的表示视图之间强制对齐，未充分利用它们的互补优势。我们提出了事件触发传感器流（CARE）的ADL识别对比对齐，这是一个端到端的框架，通过序列图像对比对齐（SICA）和交叉熵分类联合优化表示学习，确保交叉表示对齐和特定任务的可辨别性。CARE集成了（i）时间感知，噪声弹性序列编码与（ii）空间信息和频率敏感的图像表示，并采用（iii）联合对比分类目标进行对齐和区分嵌入的端到端学习。在三个CASAS数据集上进行评估，CARE实现了最先进的性能（米兰89.8%，开罗88.9%，京都73.3%），并证明了对传感器故障和布局变化的鲁棒性，突出了其在智能家居中可靠的ADL识别的潜力。
摘要：The recognition of Activities of Daily Living (ADLs) from event-triggered ambient sensors is an essential task in Ambient Assisted Living, yet existing methods remain constrained by representation-level limitations. Sequence-based approaches preserve temporal order of sensor activations but are sensitive to noise and lack spatial awareness, while image-based approaches capture global patterns and implicit spatial correlations but compress fine-grained temporal dynamics and distort sensor layouts. Naive fusion (e.g., feature concatenation) fail to enforce alignment between sequence- and image-based representation views, underutilizing their complementary strengths. We propose Contrastive Alignment for ADL Recognition from Event-Triggered Sensor Streams (CARE), an end-to-end framework that jointly optimizes representation learning via Sequence-Image Contrastive Alignment (SICA) and classification via cross-entropy, ensuring both cross-representation alignment and task-specific discriminability. CARE integrates (i) time-aware, noise-resilient sequence encoding with (ii) spatially-informed and frequency-sensitive image representations, and employs (iii) a joint contrastive-classification objective for end-to-end learning of aligned and discriminative embeddings. Evaluated on three CASAS datasets, CARE achieves state-of-the-art performance (89.8% on Milan, 88.9% on Cairo, and 73.3% on Kyoto7) and demonstrates robustness to sensor malfunctions and layout variability, highlighting its potential for reliable ADL recognition in smart homes.

【10】Class-N-Diff: Classification-Induced Diffusion Model Can Make Fair Skin Cancer Diagnosis
标题：N类差异：分类诱导扩散模型可以做出公平的皮肤癌诊断
链接：https://arxiv.org/abs/2510.16887

作者：Nusrat Munia, Abdullah Imran
备注：EMBC 2025
摘要：生成模型，特别是扩散模型，在生成包括医学图像在内的高质量合成数据方面表现出了卓越的能力。然而，传统的类条件生成模型通常难以生成准确表示特定医学类别的图像，从而限制了其在皮肤癌诊断等应用中的实用性。为了解决这个问题，我们提出了一个分类诱导扩散模型，即，Class-N-Diff，同时生成和分类皮肤镜图像。我们的Class-N-Diff模型在扩散模型中集成了一个分类器，以根据其类别条件指导图像生成。因此，该模型可以更好地控制类条件图像合成，从而生成更逼真和更多样化的图像。此外，分类器表现出更好的性能，突出了其对下游诊断任务的有效性。我们的Class-N-Diff中的这种独特集成使其成为增强基于扩散模型的合成皮肤镜图像生成的质量和实用性的强大工具。我们的代码可在https://github.com/Munia03/Class-N-Diff上获得。
摘要：Generative models, especially Diffusion Models, have demonstrated remarkable capability in generating high-quality synthetic data, including medical images. However, traditional class-conditioned generative models often struggle to generate images that accurately represent specific medical categories, limiting their usefulness for applications such as skin cancer diagnosis. To address this problem, we propose a classification-induced diffusion model, namely, Class-N-Diff, to simultaneously generate and classify dermoscopic images. Our Class-N-Diff model integrates a classifier within a diffusion model to guide image generation based on its class conditions. Thus, the model has better control over class-conditioned image synthesis, resulting in more realistic and diverse images. Additionally, the classifier demonstrates improved performance, highlighting its effectiveness for downstream diagnostic tasks. This unique integration in our Class-N-Diff makes it a robust tool for enhancing the quality and utility of diffusion model-based synthetic dermoscopic image generation. Our code is available at https://github.com/Munia03/Class-N-Diff.

【11】ReefNet: A Large scale, Taxonomically Enriched Dataset and Benchmark for Hard Coral Classification
标题：ReefNet：大规模、分类学丰富的数据集和硬珊瑚分类基准
链接：https://arxiv.org/abs/2510.16822

作者：Yahia Battach, Abdulwahab Felemban, Faizan Farooq Khan, Yousef A. Radwan, Xiang Li, Fabio Marchese, Sara Beery, Burton H. Jones, Francesca Benzoni, Mohamed Elhoseiny
摘要：由于气候变化等人为压力，珊瑚礁正在迅速减少，这突出表明迫切需要可扩展的自动化监测。我们介绍ReefNet，一个大型的公共珊瑚礁图像数据集，带有映射到世界海洋物种登记册（WoRMS）的点标签注释。ReefNet汇集了来自76个精选CoralNet来源和红海Al Wajh的另一个网站的图像，总计约925000个属级硬珊瑚注释，并附有专家验证的标签。与以前的数据集不同，这些数据集通常受到大小，地理或粗糙标签的限制，并且没有ML就绪，ReefNet在全球范围内为WoRMS提供细粒度，分类映射的标签。我们提出了两个评估设置：（i）一个源内基准，分区每个源的图像进行本地化评估，以及（ii）一个跨源基准，保留整个源测试域泛化。我们分析了ReefNet上的监督和zero-shot分类性能，发现虽然监督源内性能很有希望，但监督性能在各个领域急剧下降，并且zero-shot模型的性能普遍较低，特别是对于稀有和视觉相似的属。这提供了一个具有挑战性的基准，旨在促进领域推广和细粒度珊瑚分类的进展。我们将发布我们的数据集，基准代码和预训练模型，以推进强大的，自适应的全球珊瑚礁监测和保护。
摘要：Coral reefs are rapidly declining due to anthropogenic pressures such as climate change, underscoring the urgent need for scalable, automated monitoring. We introduce ReefNet, a large public coral reef image dataset with point-label annotations mapped to the World Register of Marine Species (WoRMS). ReefNet aggregates imagery from 76 curated CoralNet sources and an additional site from Al Wajh in the Red Sea, totaling approximately 925000 genus-level hard coral annotations with expert-verified labels. Unlike prior datasets, which are often limited by size, geography, or coarse labels and are not ML-ready, ReefNet offers fine-grained, taxonomically mapped labels at a global scale to WoRMS. We propose two evaluation settings: (i) a within-source benchmark that partitions each source's images for localized evaluation, and (ii) a cross-source benchmark that withholds entire sources to test domain generalization. We analyze both supervised and zero-shot classification performance on ReefNet and find that while supervised within-source performance is promising, supervised performance drops sharply across domains, and performance is low across the board for zero-shot models, especially for rare and visually similar genera. This provides a challenging benchmark intended to catalyze advances in domain generalization and fine-grained coral classification. We will release our dataset, benchmarking code, and pretrained models to advance robust, domain-adaptive, global coral reef monitoring and conservation.

【12】Watch Where You Move: Region-aware Dynamic Aggregation and Excitation for Gait Recognition
标题：Watch Where You Move：用于步态识别的区域感知动态聚合和激励
链接：https://arxiv.org/abs/2510.16541

作者：Binyuan Huang, Yongdong Luo, Xianda Guo, Xiawu Zheng, Zheng Zhu, Jiahui Pan, Chengju Zhou
摘要：基于深度学习的步态识别在各种应用中取得了巨大成功。准确的步态识别的关键在于考虑不同运动区域中独特而多样的行为模式，特别是当协变量影响视觉外观时。然而，现有的方法通常使用预定义的区域进行时间建模，其中固定或等效的时间尺度分配给不同类型的区域，这使得难以对随时间动态变化并适应其特定模式的运动区域进行建模。为了解决这个问题，我们引入了一个区域感知的动态聚合和激励框架（GaitRDAE），自动搜索运动区域，分配自适应的时间尺度，并应用相应的注意力。具体来说，该框架包括两个核心模块：区域感知动态聚合（RDA）模块，它动态地搜索每个区域的最佳时间感受野，以及区域感知动态激发（RDE）模块，它强调学习包含更稳定行为模式的运动区域，同时抑制对更容易受到协变量影响的静态区域的注意。实验结果表明，GaitRDAE在几个基准数据集上达到了最先进的性能。
摘要：Deep learning-based gait recognition has achieved great success in various applications. The key to accurate gait recognition lies in considering the unique and diverse behavior patterns in different motion regions, especially when covariates affect visual appearance. However, existing methods typically use predefined regions for temporal modeling, with fixed or equivalent temporal scales assigned to different types of regions, which makes it difficult to model motion regions that change dynamically over time and adapt to their specific patterns. To tackle this problem, we introduce a Region-aware Dynamic Aggregation and Excitation framework (GaitRDAE) that automatically searches for motion regions, assigns adaptive temporal scales and applies corresponding attention. Specifically, the framework includes two core modules: the Region-aware Dynamic Aggregation (RDA) module, which dynamically searches the optimal temporal receptive field for each region, and the Region-aware Dynamic Excitation (RDE) module, which emphasizes the learning of motion regions containing more stable behavior patterns while suppressing attention to static regions that are more susceptible to covariates. Experimental results show that GaitRDAE achieves state-of-the-art performance on several benchmark datasets.

【13】RefAtomNet++: Advancing Referring Atomic Video Action Recognition using Semantic Retrieval based Multi-Trajectory Mamba
标题：RefAtomNet++：使用基于语义检索的多轨迹Mamba推进参考原子视频动作识别
链接：https://arxiv.org/abs/2510.16444

作者：Kunyu Peng, Di Wen, Jia Fu, Jiamin Wu, Kailun Yang, Junwei Zheng, Ruiping Liu, Yufan Chen, Yuqian Fu, Danda Pani Paudel, Luc Van Gool, Rainer Stiefelhagen
备注：Extended version of ECCV 2024 paper arXiv:2407.01872. The dataset and code are released at this https URL
摘要：引用原子视频动作识别（RAVAR）旨在识别以自然语言描述为条件的特定感兴趣人员的细粒度原子级动作。与传统的动作识别和检测任务不同，RAVAR强调精确的语言引导动作理解，这对于复杂多人场景中的交互式人类动作分析尤为重要。在这项工作中，我们将之前介绍的RefAVA数据集扩展到RefAVA++，它总共包含> 290万帧和>75.1k注释的人。我们使用来自多个相关领域的基线对该数据集进行基准测试，包括原子动作定位，视频问答和文本视频检索，以及我们早期的模型RefAtomNet。尽管RefAtomNet通过将代理注意力纳入突出特征来超越其他基线，但其对齐和检索跨模态信息的能力仍然有限，导致在定位目标人员和预测细粒度动作方面的性能不佳。为了克服上述局限性，我们引入了RefAtomNet++，这是一种新的框架，通过多层次语义对齐的交叉注意机制与部分关键字，场景属性和整体句子级别的多轨迹Mamba建模相结合来推进跨模态令牌聚合。特别是，扫描轨迹的构造动态选择最近的视觉空间令牌在每个时间步长的部分关键字和场景属性的水平。此外，我们设计了一个多层次的语义对齐的交叉注意力策略，使更有效地聚合不同语义层次的空间和时间标记。实验表明，RefAtomNet++建立了新的最先进的结果。数据集和代码在https://github.com/KPeng9510/refAVA2上发布。
摘要：Referring Atomic Video Action Recognition (RAVAR) aims to recognize fine-grained, atomic-level actions of a specific person of interest conditioned on natural language descriptions. Distinct from conventional action recognition and detection tasks, RAVAR emphasizes precise language-guided action understanding, which is particularly critical for interactive human action analysis in complex multi-person scenarios. In this work, we extend our previously introduced RefAVA dataset to RefAVA++, which comprises >2.9 million frames and >75.1k annotated persons in total. We benchmark this dataset using baselines from multiple related domains, including atomic action localization, video question answering, and text-video retrieval, as well as our earlier model, RefAtomNet. Although RefAtomNet surpasses other baselines by incorporating agent attention to highlight salient features, its ability to align and retrieve cross-modal information remains limited, leading to suboptimal performance in localizing the target person and predicting fine-grained actions. To overcome the aforementioned limitations, we introduce RefAtomNet++, a novel framework that advances cross-modal token aggregation through a multi-hierarchical semantic-aligned cross-attention mechanism combined with multi-trajectory Mamba modeling at the partial-keyword, scene-attribute, and holistic-sentence levels. In particular, scanning trajectories are constructed by dynamically selecting the nearest visual spatial tokens at each timestep for both partial-keyword and scene-attribute levels. Moreover, we design a multi-hierarchical semantic-aligned cross-attention strategy, enabling more effective aggregation of spatial and temporal tokens across different semantic hierarchies. Experiments show that RefAtomNet++ establishes new state-of-the-art results. The dataset and code are released at https://github.com/KPeng9510/refAVA2.

【14】StretchySnake: Flexible SSM Training Unlocks Action Recognition Across Spatio-Temporal Scales
标题：StretchySnake：灵活的ESM训练解锁跨时空尺度的动作识别
链接：https://arxiv.org/abs/2510.16209

作者：Nyle Siddiqui, Rohit Gupta, Sirnam Swetha, Mubarak Shah
摘要：状态空间模型（SSM）已经成为Transformers在各种任务中的一个有竞争力的替代品。它们的线性复杂性和隐态递归性使它们对长序列建模特别有吸引力，而注意力则变得二次昂贵。然而，目前的视频理解培训方法是针对Transformers量身定制的，无法充分利用SSM的独特属性。例如，视频模型通常以固定的分辨率和视频长度进行训练，以平衡注意力成本的二次缩放与性能。因此，当对训练期间未见过的空间和时间分辨率的视频进行评估时，这些模型的性能会下降;我们称之为时空不灵活性。在动作识别的背景下，这严重限制了模型在短视频和长视频中保持性能的能力。因此，我们提出了一种灵活的训练方法，利用并提高了SSM的固有适应性。我们的方法在训练过程中以不同的时间和空间分辨率对视频进行采样，并动态插值模型权重以适应任何时空尺度。这为我们的SSM（我们称之为StretchySnake）注入了时空灵活性，使其能够无缝处理从短而细粒度的剪辑到长而复杂的活动的视频。我们介绍并比较了五种不同的灵活培训方式，并确定了视频SSM的最有效策略。在短动作（UCF-101，HMDB-51）和长动作（COIN，Breakfast）基准测试中，StretchySnake的性能比Transformer和SSM基准测试高出28%，对细粒度动作（SSV 2，Diving-48）具有很强的适应性。因此，我们的方法提供了一个简单的插入式训练配方，使视频SSM在不同的动作识别场景中更加强大，分辨率不可知，并且更有效。
摘要：State space models (SSMs) have emerged as a competitive alternative to transformers in various tasks. Their linear complexity and hidden-state recurrence make them particularly attractive for modeling long sequences, whereas attention becomes quadratically expensive. However, current training methods for video understanding are tailored towards transformers and fail to fully leverage the unique attributes of SSMs. For example, video models are often trained at a fixed resolution and video length to balance the quadratic scaling of attention cost against performance. Consequently, these models suffer from degraded performance when evaluated on videos with spatial and temporal resolutions unseen during training; a property we call spatio-temporal inflexibility. In the context of action recognition, this severely limits a model's ability to retain performance across both short- and long-form videos. Therefore, we propose a flexible training method that leverages and improves the inherent adaptability of SSMs. Our method samples videos at varying temporal and spatial resolutions during training and dynamically interpolates model weights to accommodate any spatio-temporal scale. This instills our SSM, which we call StretchySnake, with spatio-temporal flexibility and enables it to seamlessly handle videos ranging from short, fine-grained clips to long, complex activities. We introduce and compare five different variants of flexible training, and identify the most effective strategy for video SSMs. On short-action (UCF-101, HMDB-51) and long-action (COIN, Breakfast) benchmarks, StretchySnake outperforms transformer and SSM baselines alike by up to 28%, with strong adaptability to fine-grained actions (SSV2, Diving-48). Therefore, our method provides a simple drop-in training recipe that makes video SSMs more robust, resolution-agnostic, and efficient across diverse action recognition scenarios.

【15】Data-Driven Analysis of Intersectional Bias in Image Classification: A Framework with Bias-Weighted Augmentation
标题：图像分类中交叉偏差的数据驱动分析：具有偏差加权增强的框架
链接：https://arxiv.org/abs/2510.16072

作者：Farjana Yesmin
备注：18 pages
摘要：在不平衡数据集上训练的机器学习模型通常会表现出交叉偏差-由多个属性（如对象类别和环境条件）相互作用产生的系统误差。本文提出了一个数据驱动的框架，用于分析和减轻图像分类中的这种偏见。我们引入了交叉公平性评估框架（IFEF），它结合了定量公平性指标和可解释性工具，系统地识别模型预测中的偏差模式。在此分析的基础上，我们提出了偏置加权增强（BWA），一种新的数据增强策略，适应变换强度的基础上，子组分布统计。在Open Images V7数据集上进行的五个对象类的实验表明，BWA将代表性不足的类-环境交叉点的准确性提高了24个百分点，同时将公平性度量差异降低了35%。多个独立运行的统计分析证实了改善的显著性（p < 0.05）。我们的方法提供了一个可复制的方法来分析和解决图像分类系统中的交叉偏见。
摘要：Machine learning models trained on imbalanced datasets often exhibit intersectional biases-systematic errors arising from the interaction of multiple attributes such as object class and environmental conditions. This paper presents a data-driven framework for analyzing and mitigating such biases in image classification. We introduce the Intersectional Fairness Evaluation Framework (IFEF), which combines quantitative fairness metrics with interpretability tools to systematically identify bias patterns in model predictions. Building on this analysis, we propose Bias-Weighted Augmentation (BWA), a novel data augmentation strategy that adapts transformation intensities based on subgroup distribution statistics. Experiments on the Open Images V7 dataset with five object classes demonstrate that BWA improves accuracy for underrepresented class-environment intersections by up to 24 percentage points while reducing fairness metric disparities by 35%. Statistical analysis across multiple independent runs confirms the significance of improvements (p < 0.05). Our methodology provides a replicable approach for analyzing and addressing intersectional biases in image classification systems.

【16】Lung Cancer Classification from CT Images Using ResNet
标题：基于ResNet的CT图像肺癌分类
链接：https://arxiv.org/abs/2510.16310

作者：Olajumoke O. Adekunle, Joseph D. Akinyemi, Khadijat T. Ladoja, Olufade F.W. Onifade
备注：9 pages,4 figures, 3 tables
摘要：肺癌是一种起源于肺组织的恶性肿瘤，通常使用医学成像技术，特别是计算机断层扫描（CT）进行诊断和分类。尽管集成了机器学习和深度学习方法，但从CT图像进行肺癌分类的自动化系统的预测功效仍然低于临床采用的期望阈值。现有的研究主要集中在二元分类，区分恶性和良性肺结节。在这项研究中，介绍了一种新的基于深度学习的方法，旨在改进多类分类，从CT图像中识别肺癌的各种亚型。利用预先训练的ResNet模型，肺组织图像被分为三个不同的类别，其中两个表示恶性肿瘤，一个表示良性肿瘤。使用包含来自LC 25000组织病理学图像的15，000张肺部CT图像的数据集，ResNet50模型在10，200张图像上进行训练，在2，550张图像上进行验证，并在剩余的2，250张图像上进行测试。通过在ResNet架构上加入自定义层和细致的超参数微调，测试准确率达到了98.8%。这代表了对相同数据集的先前模型的性能的显着增强。
摘要：Lung cancer, a malignancy originating in lung tissues, is commonly diagnosed and classified using medical imaging techniques, particularly computed tomography (CT). Despite the integration of machine learning and deep learning methods, the predictive efficacy of automated systems for lung cancer classification from CT images remains below the desired threshold for clinical adoption. Existing research predominantly focuses on binary classification, distinguishing between malignant and benign lung nodules. In this study, a novel deep learning-based approach is introduced, aimed at an improved multi-class classification, discerning various subtypes of lung cancer from CT images. Leveraging a pre-trained ResNet model, lung tissue images were classified into three distinct classes, two of which denote malignancy and one benign. Employing a dataset comprising 15,000 lung CT images sourced from the LC25000 histopathological images, the ResNet50 model was trained on 10,200 images, validated on 2,550 images, and tested on the remaining 2,250 images. Through the incorporation of custom layers atop the ResNet architecture and meticulous hyperparameter fine-tuning, a remarkable test accuracy of 98.8% was recorded. This represents a notable enhancement over the performance of prior models on the same dataset.

分割|语义相关(15篇)

【1】Intelligent Communication Mixture-of-Experts Boosted-Medical Image Segmentation Foundation Model
标题：智能通信混合专家助推医学图像分割基础模型
链接：https://arxiv.org/abs/2510.17684

作者：Xinwei Zhang, Hu Chen, Zhe Yuan, Sukun Tian, Peng Feng
摘要：医学图像分割的基础模型已经取得了显著的效果。自然图像分割基础模型的自适应微调对于医学图像分割任务至关重要。然而，现有的微调方法存在一些局限性：1）高级特征的表示不足，2）微调过程破坏了预训练权重的结构完整性。受这些关键问题的启发，我们提出了一个智能通信专家混合增强的医学图像分割基础模型，命名为IC-MoE，有两个想法：1）我们构造基本专家，语义专家和自适应专家。此外，我们实现了像素概率自适应投票策略，它使专家选择和融合，通过标签的一致性和负载平衡。这种方法初步增强了高级特征的表示能力，同时保持了预训练权重的结构完整性。2)针对对比学习中监督机制薄弱的问题，提出了一种语义引导的对比学习方法。该方法进一步增强了高级特征的表示能力，同时保持了预训练权重的结构完整性。在三个公共医学图像分割数据集上的广泛实验表明，IC-MoE优于其他SOTA模型。因此，所提出的IC-MoE有效地补充了具有高级特征和预训练结构完整性的基础医学图像分割模型。我们还验证了在不同的医学图像分割场景中IC-MoE的优越的通用性。
摘要：Foundation models for medical image segmentation have achieved remarkable performance. Adaptive fine-tuning of natural image segmentation foundation models is crucial for medical image segmentation tasks. However, some limitations exist in existing fine-tuning methods: 1) insufficient representation of high-level features and 2) the fine-tuning process disrupts the structural integrity of pretrained weights. Inspired by these critical problems, we propose an intelligent communication mixture-of-experts boosted-medical image segmentation foundation model, named IC-MoE, with twofold ideas: 1) We construct basic experts, semantic experts, and adaptive experts. Moreover, we implement a pixel probability adaptive voting strategy, which enables expert selection and fusion through label consistency and load balancing. This approach preliminarily enhances the representation capability of high-level features while preserving the structural integrity of pretrained weights. 2) We propose a semantic-guided contrastive learning method to address the issue of weak supervision in contrastive learning. This method further enhances the representation capability of high-level features while preserving the structural integrity of pretrained weights. Extensive experiments across three public medical image segmentation datasets demonstrate that the IC-MoE outperforms other SOTA models. Consequently, the proposed IC-MoE effectively supplements foundational medical image segmentation models with high-level features and pretrained structural integrity. We also validate the superior generalizability of the IC-MoE across diverse medical image segmentation scenarios.

【2】4DSegStreamer: Streaming 4D Panoptic Segmentation via Dual Threads
标题：4DSegStreamer：通过双线程流媒体4D全景分割
链接：https://arxiv.org/abs/2510.17664

作者：Ling Liu, Jun Tian, Li Yi
摘要：流媒体环境中的4D全景分割对于高度动态的环境至关重要，例如疏散密集人群和复杂场景中的自动驾驶，在这些场景中，在有限的时间预算内实现实时、细粒度的感知至关重要。在本文中，我们介绍了4DSegStreamer，一种新的框架，采用双线程系统，有效地处理流帧。该框架是通用的，可以无缝集成到现有的3D和4D分割方法，以实现实时能力。与现有的流式感知方法相比，它还表现出卓越的鲁棒性，特别是在高FPS条件下。该系统由预测线程和推理线程组成。预测线程利用历史运动和几何信息来提取特征并预测未来动态。推理线程通过与最新内存对齐并补偿自我运动和动态对象运动来确保对传入帧的及时预测。我们在室内HOI4D数据集和室外SemanticKITTI和nuScenes数据集上评估了4DSegStreamer。综合实验表明，我们的方法的有效性，特别是在准确地预测复杂场景中的动态对象。
摘要：4D panoptic segmentation in a streaming setting is critical for highly dynamic environments, such as evacuating dense crowds and autonomous driving in complex scenarios, where real-time, fine-grained perception within a constrained time budget is essential. In this paper, we introduce 4DSegStreamer, a novel framework that employs a Dual-Thread System to efficiently process streaming frames. The framework is general and can be seamlessly integrated into existing 3D and 4D segmentation methods to enable real-time capability. It also demonstrates superior robustness compared to existing streaming perception approaches, particularly under high FPS conditions. The system consists of a predictive thread and an inference thread. The predictive thread leverages historical motion and geometric information to extract features and forecast future dynamics. The inference thread ensures timely prediction for incoming frames by aligning with the latest memory and compensating for ego-motion and dynamic object movements. We evaluate 4DSegStreamer on the indoor HOI4D dataset and the outdoor SemanticKITTI and nuScenes datasets. Comprehensive experiments demonstrate the effectiveness of our approach, particularly in accurately predicting dynamic objects in complex scenes.

【3】Integrating BIM and UAV-based photogrammetry for Automated 3D Structure Model Segmentation
标题：集成BMI和基于无人机的摄影测量，实现自动3D结构模型分割
链接：https://arxiv.org/abs/2510.17609

作者：Siqi Chen, Shanyue Guan
摘要：无人机技术的进步使高效、非接触的结构健康监测成为可能。结合摄影测量，无人机可以捕获高分辨率扫描并重建基础设施的详细3D模型。然而，一个关键的挑战仍然是从这些模型中分割特定的结构组件，这是一个传统上依赖于耗时且容易出错的手动标记的过程。为了解决这个问题，我们提出了一个基于机器学习的框架，用于自动分割3D点云。我们的方法使用现实世界的无人机扫描点云和建筑信息建模（BIM）生成的合成数据的互补优势，以克服与手动标记相关的限制。在铁路轨道数据集上的验证表明，在识别和分割主要部件（如铁轨和枕木）方面具有很高的精度。此外，通过使用补充BIM数据的较小规模数据集，该框架显着减少了训练时间，同时保持了合理的分割精度。这种自动化方法提高了3D基础设施模型分割的精度和效率，并促进了结构健康监测和基础设施管理中无人机和BIM技术的集成。
摘要：The advancement of UAV technology has enabled efficient, non-contact structural health monitoring. Combined with photogrammetry, UAVs can capture high-resolution scans and reconstruct detailed 3D models of infrastructure. However, a key challenge remains in segmenting specific structural components from these models-a process traditionally reliant on time-consuming and error-prone manual labeling. To address this issue, we propose a machine learning-based framework for automated segmentation of 3D point clouds. Our approach uses the complementary strengths of real-world UAV-scanned point clouds and synthetic data generated from Building Information Modeling (BIM) to overcome the limitations associated with manual labeling. Validation on a railroad track dataset demonstrated high accuracy in identifying and segmenting major components such as rails and crossties. Moreover, by using smaller-scale datasets supplemented with BIM data, the framework significantly reduced training time while maintaining reasonable segmentation accuracy. This automated approach improves the precision and efficiency of 3D infrastructure model segmentation and advances the integration of UAV and BIM technologies in structural health monitoring and infrastructure management.

【4】Expose Camouflage in the Water: Underwater Camouflaged Instance Segmentation and Dataset
标题：在水中暴露伪装：水下伪装实例分割和数据集
链接：https://arxiv.org/abs/2510.17585

作者：Chuhong Wang, Hua Li, Chongyi Li, Huazhong Liu, Xiongxin Tang, Sam Kwong
摘要：随着水下探测和海洋保护的发展，水下视觉任务越来越广泛。由于退化的水下环境，其特征在于颜色失真，低对比度和模糊，隐藏的实例分割（CIS）在准确分割与周围环境紧密融合的对象方面面临更大的挑战。传统的水下实例分割方法，训练在陆地为主的数据集有限的水下样本，可能会表现出不足的性能在水下场景。为了解决这些问题，我们引入了第一个水下图像实例分割（UCIS）数据集，简称为UCIS 4K，其中包括3,953幅具有实例级注释的水下海洋生物图像。此外，我们提出了一种基于分段任意模型的水下摄像机实例分割网络（UCIS-SAM）。我们的UCIS-SAM包括三个关键模块。首先，通道平衡优化模块（CBOM）增强了通道特性，以提高水下特征学习，有效地解决了模型对水下环境的有限理解。其次，频域真积分模块（FDTIM），提出了强调内在的目标特征，减少伪装图案的干扰，提高与周围环境融合的伪装对象的分割性能。最后，多尺度特征频率聚合模块（MFFAM）的设计，以加强跨多个频带的低对比度的图像实例的边界，提高模型的能力，实现更精确的分割图像对象。在建议的UCIS 4K和公共基准上进行的大量实验表明，我们的UCIS-SAM优于最先进的方法。
摘要：With the development of underwater exploration and marine protection, underwater vision tasks are widespread. Due to the degraded underwater environment, characterized by color distortion, low contrast, and blurring, camouflaged instance segmentation (CIS) faces greater challenges in accurately segmenting objects that blend closely with their surroundings. Traditional camouflaged instance segmentation methods, trained on terrestrial-dominated datasets with limited underwater samples, may exhibit inadequate performance in underwater scenes. To address these issues, we introduce the first underwater camouflaged instance segmentation (UCIS) dataset, abbreviated as UCIS4K, which comprises 3,953 images of camouflaged marine organisms with instance-level annotations. In addition, we propose an Underwater Camouflaged Instance Segmentation network based on Segment Anything Model (UCIS-SAM). Our UCIS-SAM includes three key modules. First, the Channel Balance Optimization Module (CBOM) enhances channel characteristics to improve underwater feature learning, effectively addressing the model's limited understanding of underwater environments. Second, the Frequency Domain True Integration Module (FDTIM) is proposed to emphasize intrinsic object features and reduce interference from camouflage patterns, enhancing the segmentation performance of camouflaged objects blending with their surroundings. Finally, the Multi-scale Feature Frequency Aggregation Module (MFFAM) is designed to strengthen the boundaries of low-contrast camouflaged instances across multiple frequency bands, improving the model's ability to achieve more precise segmentation of camouflaged objects. Extensive experiments on the proposed UCIS4K and public benchmarks show that our UCIS-SAM outperforms state-of-the-art approaches.

【5】MambaX-Net: Dual-Input Mamba-Enhanced Cross-Attention Network for Longitudinal MRI Segmentation
标题：MambaX-Net：用于纵向MRI分割的双输入Mamba-增强交叉注意网络
链接：https://arxiv.org/abs/2510.17529

作者：Yovin Yahathugoda, Davide Prezzi, Piyalitt Ittichaiwong, Vicky Goh, Sebastien Ourselin, Michela Antonelli
摘要：主动监测（AS）是一种用于管理低风险和中等风险前列腺癌（PCa）的治疗选择，旨在避免过度治疗，同时通过连续MRI和临床随访监测疾病进展。准确的前列腺分割是自动化这一过程的重要初步步骤，可以实现PCa的自动检测和诊断。然而，现有的深度学习分割模型通常是在单个时间点和专业注释的数据集上训练的，这使得它们不适合纵向AS分析，其中多个时间点和缺乏专家标签阻碍了它们的有效微调。为了解决这些挑战，我们提出了MambaX-Net，这是一种新型的半监督，双扫描3D分割架构，通过利用MRI和上一个时间点的相应分割掩模来计算时间点t的分割。我们引入了两个新的组件：（i）Mamba增强的交叉注意模块，它将Mamba块集成到交叉注意中，以有效地捕获时间演化和长距离空间依赖性，以及（ii）形状提取器模块，它将先前的分割掩模编码为潜在的解剖表示，用于细化区域划分。此外，我们引入了一种半监督自训练策略，该策略利用从预训练的nnU-Net生成的伪标签，实现了无需专家注释的有效学习。在纵向AS数据集上对MambaX-Net进行了评估，结果表明，它的性能明显优于最先进的U-Net和基于Transformer的模型，即使在有限和嘈杂的数据上进行训练，也能实现出色的前列腺区域分割。
摘要：Active Surveillance (AS) is a treatment option for managing low and intermediate-risk prostate cancer (PCa), aiming to avoid overtreatment while monitoring disease progression through serial MRI and clinical follow-up. Accurate prostate segmentation is an important preliminary step for automating this process, enabling automated detection and diagnosis of PCa. However, existing deep-learning segmentation models are often trained on single-time-point and expertly annotated datasets, making them unsuitable for longitudinal AS analysis, where multiple time points and a scarcity of expert labels hinder their effective fine-tuning. To address these challenges, we propose MambaX-Net, a novel semi-supervised, dual-scan 3D segmentation architecture that computes the segmentation for time point t by leveraging the MRI and the corresponding segmentation mask from the previous time point. We introduce two new components: (i) a Mamba-enhanced Cross-Attention Module, which integrates the Mamba block into cross attention to efficiently capture temporal evolution and long-range spatial dependencies, and (ii) a Shape Extractor Module that encodes the previous segmentation mask into a latent anatomical representation for refined zone delination. Moreover, we introduce a semi-supervised self-training strategy that leverages pseudo-labels generated from a pre-trained nnU-Net, enabling effective learning without expert annotations. MambaX-Net was evaluated on a longitudinal AS dataset, and results showed that it significantly outperforms state-of-the-art U-Net and Transformer-based models, achieving superior prostate zone segmentation even when trained on limited and noisy data.

【6】Taming Modality Entanglement in Continual Audio-Visual Segmentation
标题：连续音视频分割中的模态纠缠抑制
链接：https://arxiv.org/abs/2510.17234

作者：Yuyang Hong, Qi Yang, Tao Zhang, Zili Wang, Zhaojin Fu, Kun Ding, Bin Fan, Shiming Xiang
摘要：近年来，多模态持续学习取得了重大进展，其目标是在多模态环境中顺序学习新任务，同时保持先前学习的性能。然而，现有的方法主要集中在粗粒度的任务，在解决细粒度的持续学习设置中的模态纠缠的局限性。为了弥补这一差距，我们引入了一种新颖的连续视听分割（CAVS）任务，旨在通过音频引导连续分割新课程。通过综合分析，确定了两个关键挑战：1）多模态语义漂移，其中一个发声对象被标记为背景的顺序任务; 2）共现混淆，其中频繁共现的类往往会被混淆。在这项工作中，基于碰撞的多模态排练（CMR）框架的目的是为了解决这些挑战。针对多模态语义漂移问题，提出了一种多模态样本选择策略（MSS），用于选择模态一致性高的样本进行排练。同时，针对同现混淆问题，设计了基于冲突的样本排练（CSR）机制，允许在训练过程中增加易混淆类的排练样本频率。此外，我们构建了三个视听增量场景来验证我们的方法的有效性。综合实验表明，我们的方法显着优于单模态连续学习方法。
摘要：Recently, significant progress has been made in multi-modal continual learning, aiming to learn new tasks sequentially in multi-modal settings while preserving performance on previously learned ones. However, existing methods mainly focus on coarse-grained tasks, with limitations in addressing modality entanglement in fine-grained continual learning settings. To bridge this gap, we introduce a novel Continual Audio-Visual Segmentation (CAVS) task, aiming to continuously segment new classes guided by audio. Through comprehensive analysis, two critical challenges are identified: 1) multi-modal semantic drift, where a sounding objects is labeled as background in sequential tasks; 2) co-occurrence confusion, where frequent co-occurring classes tend to be confused. In this work, a Collision-based Multi-modal Rehearsal (CMR) framework is designed to address these challenges. Specifically, for multi-modal semantic drift, a Multi-modal Sample Selection (MSS) strategy is proposed to select samples with high modal consistency for rehearsal. Meanwhile, for co-occurence confusion, a Collision-based Sample Rehearsal (CSR) mechanism is designed, allowing for the increase of rehearsal sample frequency of those confusable classes during training process. Moreover, we construct three audio-visual incremental scenarios to verify effectiveness of our method. Comprehensive experiments demonstrate that our method significantly outperforms single-modal continual learning methods.

【7】Click, Predict, Trust: Clinician-in-the-Loop AI Segmentation for Lung Cancer CT-Based Prognosis within the Knowledge-to-Action Framework
标题：点击、预测、信任：在知识到行动框架内基于CT的肺癌预后的临床医生在环人工智能细分
链接：https://arxiv.org/abs/2510.17039

作者：Mohammad R. Salmanpour, Sonya Falahati, Amir Hossein Pouria, Amin Mousavi, Somayeh Sadat Mehrnia, Morteza Alizadeh, Arman Gorji, Zeinab Farsangi, Alireza Safarian, Mehdi Maghsudi, Carlos Uribe, Arman Rahmim, Ren Yuan
备注：13 pages, 2 figures, and 2 tables
摘要：肺癌仍然是癌症死亡的主要原因，CT成像是筛查、预后和治疗的中心。手动分割是可变的且耗时的，而深度学习（DL）提供自动化，但面临临床采用的障碍。在知识到行动框架的指导下，本研究开发了一个临床医生在环DL管道，以提高重现性，预后准确性和临床信任。使用5个DL模型（3D Attention U-Net、ResUNet、VNet、ReconNet、SAM-Med 3D）分析了12个公共数据集的999例患者的多中心CT数据，并以完整和点击点裁剪图像上的专家轮廓为基准。通过Spearman相关性，ICC，Wilcoxon检验和MANOVA使用497个PySERA提取的放射组学特征评估分割再现性，而预后建模比较了38种降维策略和24种分类器的监督（SL）和半监督学习（SSL）。六位医生在七个领域对口罩进行了定性评估，包括临床意义、边界质量、预后价值、信任和工作流程整合。VNet在SSL下实现了最佳性能（Dice = 0.83，IoU = 0.71），放射性稳定性（平均相关性= 0.76，ICC = 0.65）和预测准确性（准确性= 0.88，F1 = 0.83）。SSL在各型号中的表现始终优于SL。放射科医生倾向于使用VNet进行肿瘤周围的表示和更平滑的边界，更喜欢使用AI生成的初始掩模进行细化而不是替换。这些结果表明，将VNet与SSL集成可以产生准确、可重复和临床上可信的基于CT的肺癌预后，突出了以医生为中心的AI翻译的可行路径。
摘要：Lung cancer remains the leading cause of cancer mortality, with CT imaging central to screening, prognosis, and treatment. Manual segmentation is variable and time-intensive, while deep learning (DL) offers automation but faces barriers to clinical adoption. Guided by the Knowledge-to-Action framework, this study develops a clinician-in-the-loop DL pipeline to enhance reproducibility, prognostic accuracy, and clinical trust. Multi-center CT data from 999 patients across 12 public datasets were analyzed using five DL models (3D Attention U-Net, ResUNet, VNet, ReconNet, SAM-Med3D), benchmarked against expert contours on whole and click-point cropped images. Segmentation reproducibility was assessed using 497 PySERA-extracted radiomic features via Spearman correlation, ICC, Wilcoxon tests, and MANOVA, while prognostic modeling compared supervised (SL) and semi-supervised learning (SSL) across 38 dimensionality reduction strategies and 24 classifiers. Six physicians qualitatively evaluated masks across seven domains, including clinical meaningfulness, boundary quality, prognostic value, trust, and workflow integration. VNet achieved the best performance (Dice = 0.83, IoU = 0.71), radiomic stability (mean correlation = 0.76, ICC = 0.65), and predictive accuracy under SSL (accuracy = 0.88, F1 = 0.83). SSL consistently outperformed SL across models. Radiologists favored VNet for peritumoral representation and smoother boundaries, preferring AI-generated initial masks for refinement rather than replacement. These results demonstrate that integrating VNet with SSL yields accurate, reproducible, and clinically trusted CT-based lung cancer prognosis, highlighting a feasible path toward physician-centered AI translation.

【8】BARL: Bilateral Alignment in Representation and Label Spaces for Semi-Supervised Volumetric Medical Image Segmentation
标题：BARL：半监督体积医学图像分割的表示和标签空间中的双边对齐
链接：https://arxiv.org/abs/2510.16863

作者：Shujian Gao, Yuan Wang, Zekuan Yu
备注：14 pages, 5 figures
摘要：半监督医学图像分割（SSMIS）寻求匹配完全监督的性能，同时大幅降低标注成本。主流的SSMIS方法依赖于标签空间一致性，但它们忽略了同样重要的表示空间对齐。如果不协调潜在特征，模型很难学习既有区别又有空间连贯性的表示。为此，我们引入了\textbf{Bilateral align in Representation and Label spaces（BARL）}，这是一个统一的框架，它耦合了两个协作分支，并在两个空间中强制对齐。对于标签空间对齐，受联合训练和多尺度解码的启发，我们设计了\textbf{双路径正则化（DPR）}和\textbf{渐进认知偏差校正（PCBC）}来施加细粒度的跨分支一致性，同时减轻从粗到细尺度的错误积累。对于表示空间对齐，我们在分支之间进行区域级和病变实例匹配，明确捕获医学图像中常见的碎片化，复杂的病理模式。在四个公共基准和专有CBCT数据集上进行的大量实验表明，BARL始终优于最先进的SSMIS方法。消融研究进一步验证了每个组件的贡献。代码将很快发布。
摘要：Semi-supervised medical image segmentation (SSMIS) seeks to match fully supervised performance while sharply reducing annotation cost. Mainstream SSMIS methods rely on \emph{label-space consistency}, yet they overlook the equally critical \emph{representation-space alignment}. Without harmonizing latent features, models struggle to learn representations that are both discriminative and spatially coherent. To this end, we introduce \textbf{Bilateral Alignment in Representation and Label spaces (BARL)}, a unified framework that couples two collaborative branches and enforces alignment in both spaces. For label-space alignment, inspired by co-training and multi-scale decoding, we devise \textbf{Dual-Path Regularization (DPR)} and \textbf{Progressively Cognitive Bias Correction (PCBC)} to impose fine-grained cross-branch consistency while mitigating error accumulation from coarse to fine scales. For representation-space alignment, we conduct region-level and lesion-instance matching between branches, explicitly capturing the fragmented, complex pathological patterns common in medical imagery. Extensive experiments on four public benchmarks and a proprietary CBCT dataset demonstrate that BARL consistently surpasses state-of-the-art SSMIS methods. Ablative studies further validate the contribution of each component. Code will be released soon.

【9】Unsupervised Monocular Road Segmentation for Autonomous Driving via Scene Geometry
标题：通过场景几何实现自动驾驶的无监督单目道路分割
链接：https://arxiv.org/abs/2510.16790

作者：Sara Hatami Rostami, Behrooz Nasihatkon
备注：7 pages, 3 figures
摘要：本文提出了一种完全无监督的二进制道路分割方法（道路与非道路），消除了对昂贵的手动标记数据集的依赖。该方法利用场景几何形状和时间线索来区分道路和非道路区域。弱标签首先从几何先验中生成，将地平线以上的像素标记为非道路，将车辆前方的预定四边形标记为道路。在细化阶段，通过跨帧跟踪局部特征点并使用互信息最大化惩罚不一致的标签分配来执行时间一致性。这提高了精度和时间稳定性。在Cityscapes数据集上，该模型实现了0.82的Intersection-over-Union（IoU），证明了简单设计的高准确性。这些研究结果表明，在自动驾驶中结合几何约束和时间一致性进行可扩展的无监督道路分割的潜力。
摘要：This paper presents a fully unsupervised approach for binary road segmentation (road vs. non-road), eliminating the reliance on costly manually labeled datasets. The method leverages scene geometry and temporal cues to distinguish road from non-road regions. Weak labels are first generated from geometric priors, marking pixels above the horizon as non-road and a predefined quadrilateral in front of the vehicle as road. In a refinement stage, temporal consistency is enforced by tracking local feature points across frames and penalizing inconsistent label assignments using mutual information maximization. This enhances both precision and temporal stability. On the Cityscapes dataset, the model achieves an Intersection-over-Union (IoU) of 0.82, demonstrating high accuracy with a simple design. These findings demonstrate the potential of combining geometric constraints and temporal consistency for scalable unsupervised road segmentation in autonomous driving.

【10】Xiaoice: Training-Free Video Understanding via Self-Supervised Spatio-Temporal Clustering of Semantic Features
标题：Xiaoice：通过自我监督的语义特征时空聚集进行免训练的视频理解
链接：https://arxiv.org/abs/2510.16781

作者：Shihao Ji, Zihui Song
摘要：静态图像上的大规模视觉语言模型（VLM）的显著zero-shot推理能力尚未完全转化到视频领域。传统的视频理解模型通常依赖于对注释数据集进行广泛的特定于任务的训练，这是一个既昂贵又可扩展性有限的过程。本文介绍了一种新颖的、无需训练的视频理解框架，该框架通过将预先训练的VLM的丰富语义先验与用于模式发现的经典机器学习算法协同结合，从而避免了端到端的训练。我们的核心思想是将视频理解重新定义为高维语义特征空间内的自监督时空聚类问题。建议的管道首先将视频流转换成语义特征轨迹使用冻结的视觉编码器的预训练的VLM。随后，我们采用核时间分割（KTS），一个强大的机器学习技术，分割成离散的，语义连贯的事件段的连续特征流。然后，这些片段进行无监督的基于密度的聚类，以识别整个视频中反复出现的宏观场景和主题。通过从每个发现的集群中选择具有代表性的关键帧，并利用VLM的生成功能进行文本描述，我们的框架自动生成视频内容的结构化，多模态摘要。该方法为视频内容的zero-shot自动结构分析提供了一种有效的、可解释的和模型无关的途径。
摘要：The remarkable zero-shot reasoning capabilities of large-scale Visual Language Models (VLMs) on static images have yet to be fully translated to the video domain. Conventional video understanding models often rely on extensive, task-specific training on annotated datasets, a process that is both costly and limited in scalability. This paper introduces a novel, training-free framework for video understanding that circumvents end-to-end training by synergistically combining the rich semantic priors of pre-trained VLMs with classic machine learning algorithms for pattern discovery. Our core idea is to reframe video understanding as a self-supervised spatio-temporal clustering problem within a high-dimensional semantic feature space. The proposed pipeline first transforms a video stream into a semantic feature trajectory using the frozen visual encoder of a pre-trained VLM. Subsequently, we employ Kernel Temporal Segmentation (KTS), a robust machine learning technique, to partition the continuous feature stream into discrete, semantically coherent event segments. These segments are then subjected to unsupervised density-based clustering to identify recurring macroscopic scenes and themes throughout the video. By selecting representative keyframes from each discovered cluster and leveraging the VLM's generative capabilities for textual description, our framework automatically produces a structured, multi-modal summary of the video content. This approach provides an effective, interpretable, and model-agnostic pathway for zero-shot, automated structural analysis of video content.

【11】Region in Context: Text-condition Image editing with Human-like semantic reasoning
标题：上下文中的区域：文本条件具有类人语义推理的图像编辑
链接：https://arxiv.org/abs/2510.16772

作者：Thuy Phuong Vu, Dinh-Cuong Hoang, Minhhuy Le, Phan Xuan Tan
摘要：最近的研究在基于文本的图像区域定位和编辑方面取得了重大进展。然而，大多数方法孤立地对待这些区域，仅依赖于局部线索，而不考虑每个部分如何对整体视觉和语义组成做出贡献。这通常会导致不一致的编辑，不自然的过渡或图像的连贯性损失。在这项工作中，我们提出了区域的上下文，一个新的框架，文本条件下的图像编辑，执行视觉和语言之间的多级语义对齐，灵感来自人类的能力，原因有关的编辑整个场景。我们的方法鼓励每个区域了解其在全球图像背景中的作用，从而实现精确和协调的变化。在其核心，该框架引入了双层指导机制：区域用完整的图像上下文表示，并与详细的区域级描述对齐，而整个图像同时与大型视觉语言模型生成的综合场景级描述相匹配。这些描述作为预期内容的明确口头参考，指导局部修改和全局结构。实验结果表明，该方法能产生更一致、更精确的结果。代码可在：https://github.com/thuyvuphuong/Region-in-Context.git
摘要：Recent research has made significant progress in localizing and editing image regions based on text. However, most approaches treat these regions in isolation, relying solely on local cues without accounting for how each part contributes to the overall visual and semantic composition. This often results in inconsistent edits, unnatural transitions, or loss of coherence across the image. In this work, we propose Region in Context, a novel framework for text-conditioned image editing that performs multilevel semantic alignment between vision and language, inspired by the human ability to reason about edits in relation to the whole scene. Our method encourages each region to understand its role within the global image context, enabling precise and harmonized changes. At its core, the framework introduces a dual-level guidance mechanism: regions are represented with full-image context and aligned with detailed region-level descriptions, while the entire image is simultaneously matched to a comprehensive scene-level description generated by a large vision-language model. These descriptions serve as explicit verbal references of the intended content, guiding both local modifications and global structure. Experiments show that it produces more coherent and instruction-aligned results. Code is available at: https://github.com/thuyvuphuong/Region-in-Context.git

【12】Self-Supervised Learning to Fly using Efficient Semantic Segmentation and Metric Depth Estimation for Low-Cost Autonomous UAVs
标题：使用高效的语义分割和度量深度估计进行低成本自主无人机的自我监督学习飞行
链接：https://arxiv.org/abs/2510.16624

作者：Sebastian Mocanu, Emil Slusanschi, Marius Leordeanu
摘要：本文提出了一种用于小型无人机在受控室内环境中的纯视觉自主飞行系统。该系统将语义分割与单目深度估计相结合，以实现避障、场景探索和自主安全着陆操作，而无需GPS或昂贵的传感器（如LiDAR）。一个关键的创新是自适应比例因子算法，该算法通过利用语义地平面检测和相机内部参数将非度量单目深度预测转换为精确的度量距离测量，实现了14.4 cm的平均距离误差。该方法使用知识蒸馏框架，其中基于颜色的支持向量机（SVM）教师为能够实时语义分割的轻量级U-Net学生网络（1.6M参数）生成训练数据。对于更复杂的环境，SVM老师可以用最先进的分割模型代替。测试是在一个受控的5x 4米的实验室环境中进行的，有八个纸板障碍物模拟城市结构。在真实世界环境中进行的30次飞行测试和在数字孪生环境中进行的100次飞行测试的广泛验证表明，组合分割和深度方法增加了监视期间的行程距离，缩短了任务时间，同时保持了100%的成功率。该系统通过端到端学习进一步优化，其中一个紧凑的学生神经网络从我们最佳性能方法生成的演示数据中学习完整的飞行策略，实现了87.5%的自主任务成功率。这项工作推进了结构化环境中基于视觉的实用无人机导航，展示了度量深度估计和计算效率挑战的解决方案，这些解决方案可以在资源受限的平台上进行部署。
摘要：This paper presents a vision-only autonomous flight system for small UAVs operating in controlled indoor environments. The system combines semantic segmentation with monocular depth estimation to enable obstacle avoidance, scene exploration, and autonomous safe landing operations without requiring GPS or expensive sensors such as LiDAR. A key innovation is an adaptive scale factor algorithm that converts non-metric monocular depth predictions into accurate metric distance measurements by leveraging semantic ground plane detection and camera intrinsic parameters, achieving a mean distance error of 14.4 cm. The approach uses a knowledge distillation framework where a color-based Support Vector Machine (SVM) teacher generates training data for a lightweight U-Net student network (1.6M parameters) capable of real-time semantic segmentation. For more complex environments, the SVM teacher can be replaced with a state-of-the-art segmentation model. Testing was conducted in a controlled 5x4 meter laboratory environment with eight cardboard obstacles simulating urban structures. Extensive validation across 30 flight tests in a real-world environment and 100 flight tests in a digital-twin environment demonstrates that the combined segmentation and depth approach increases the distance traveled during surveillance and reduces mission time while maintaining 100% success rates. The system is further optimized through end-to-end learning, where a compact student neural network learns complete flight policies from demonstration data generated by our best-performing method, achieving an 87.5% autonomous mission success rate. This work advances practical vision-based drone navigation in structured environments, demonstrating solutions for metric depth estimation and computational efficiency challenges that enable deployment on resource-constrained platforms.

【13】Instance-Aware Pseudo-Labeling and Class-Focused Contrastive Learning for Weakly Supervised Domain Adaptive Segmentation of Electron Microscopy
标题：实例感知伪标记和类聚焦对比学习用于电子显微镜弱监督域自适应分割
链接：https://arxiv.org/abs/2510.16450

作者：Shan Xiong, Jiabao Chen, Ye Wang, Jialin Peng
摘要：Annotation-efficient segmentation of the numerous mitochondria instances from various electron microscopy (EM) images is highly valuable for biological and neuroscience research. Although unsupervised domain adaptation (UDA) methods can help mitigate domain shifts and reduce the high costs of annotating each domain, they typically have relatively low performance in practical applications. Thus, we investigate weakly supervised domain adaptation (WDA) that utilizes additional sparse point labels on the target domain, which require minimal annotation effort and minimal expert knowledge. To take full use of the incomplete and imprecise point annotations, we introduce a multitask learning framework that jointly conducts segmentation and center detection with a novel cross-teaching mechanism and class-focused cross-domain contrastive learning. While leveraging unlabeled image regions is essential, we introduce segmentation self-training with a novel instance-aware pseudo-label (IPL) selection strategy. Unlike existing methods that typically rely on pixel-wise pseudo-label filtering, the IPL semantically selects reliable and diverse pseudo-labels with the help of the detection task. Comprehensive validations and comparisons on challenging datasets demonstrate that our method outperforms existing UDA and WDA methods, significantly narrowing the performance gap with the supervised upper bound. Furthermore, under the UDA setting, our method also achieves substantial improvements over other UDA techniques.

【14】REALM: An MLLM-Agent Framework for Open World 3D Reasoning Segmentation and Editing on Gaussian Splatting
标题：REALM：一个开放世界3D推理的MLLM-Agent框架基于高斯溅射的分割和编辑
链接：https://arxiv.org/abs/2510.16410

作者：Changyue Shi, Minghao Chen, Yiping Mao, Chuxiao Yang, Xinyuan Hu, Jiajun Ding, Zhou Yu
摘要：Bridging the gap between complex human instructions and precise 3D object grounding remains a significant challenge in vision and robotics. Existing 3D segmentation methods often struggle to interpret ambiguous, reasoning-based instructions, while 2D vision-language models that excel at such reasoning lack intrinsic 3D spatial understanding. In this paper, we introduce REALM, an innovative MLLM-agent framework that enables open-world reasoning-based segmentation without requiring extensive 3D-specific post-training. We perform segmentation directly on 3D Gaussian Splatting representations, capitalizing on their ability to render photorealistic novel views that are highly suitable for MLLM comprehension. As directly feeding one or more rendered views to the MLLM can lead to high sensitivity to viewpoint selection, we propose a novel Global-to-Local Spatial Grounding strategy. Specifically, multiple global views are first fed into the MLLM agent in parallel for coarse-level localization, aggregating responses to robustly identify the target object. Then, several close-up novel views of the object are synthesized to perform fine-grained local segmentation, yielding accurate and consistent 3D masks. Extensive experiments show that REALM achieves remarkable performance in interpreting both explicit and implicit instructions across LERF, 3D-OVS, and our newly introduced REALM3D benchmarks. Furthermore, our agent framework seamlessly supports a range of 3D interaction tasks, including object removal, replacement, and style transfer, demonstrating its practical utility and versatility. Project page: https://ChangyueShi.github.io/REALM.

【15】DuetMatch: Harmonizing Semi-Supervised Brain MRI Segmentation via Decoupled Branch Optimization
标题：DuetMatch：通过去耦合分支优化协调半监督大脑MRI分割
链接：https://arxiv.org/abs/2510.16146

作者：Thanh-Huy Nguyen, Hoang-Thien Nguyen, Vi Vu, Ba-Thinh Lam, Phat Huynh, Tianyang Wang, Xingjian Li, Ulas Bagci, Min Xu
备注：The paper is under review at CMIG
摘要：The limited availability of annotated data in medical imaging makes semi-supervised learning increasingly appealing for its ability to learn from imperfect supervision. Recently, teacher-student frameworks have gained popularity for their training benefits and robust performance. However, jointly optimizing the entire network can hinder convergence and stability, especially in challenging scenarios. To address this for medical image segmentation, we propose DuetMatch, a novel dual-branch semi-supervised framework with asynchronous optimization, where each branch optimizes either the encoder or decoder while keeping the other frozen. To improve consistency under noisy conditions, we introduce Decoupled Dropout Perturbation, enforcing regularization across branches. We also design Pair-wise CutMix Cross-Guidance to enhance model diversity by exchanging pseudo-labels through augmented input pairs. To mitigate confirmation bias from noisy pseudo-labels, we propose Consistency Matching, refining labels using stable predictions from frozen teacher models. Extensive experiments on benchmark brain MRI segmentation datasets, including ISLES2022 and BraTS, show that DuetMatch consistently outperforms state-of-the-art methods, demonstrating its effectiveness and robustness across diverse semi-supervised segmentation scenarios.

Zero/Few Shot|迁移|域适配|自适应(7篇)

【1】Context-Aware Pseudo-Label Scoring for Zero-Shot Video Summarization
标题：Zero-Shot视频摘要的上下文感知伪标签评分
链接：https://arxiv.org/abs/2510.17501

作者：Yuanli Wu, Long Zhang, Yue Du, Bin Li
摘要：With the rapid proliferation of video content across social media, surveillance, and education platforms, efficiently summarizing long videos into concise yet semantically faithful surrogates has become increasingly vital. Existing supervised methods achieve strong in-domain accuracy by learning from dense annotations but suffer from high labeling costs and limited cross-dataset generalization, while unsupervised approaches, though label-free, often fail to capture high-level human semantics and fine-grained narrative cues. More recently, zero-shot prompting pipelines have leveraged large language models (LLMs) for training-free video summarization, yet remain highly sensitive to handcrafted prompt templates and dataset-specific score normalization. To overcome these limitations, we introduce a rubric-guided, pseudo-labeled prompting framework that transforms a small subset of ground-truth annotations into high-confidence pseudo labels, which are aggregated into structured, dataset-adaptive scoring rubrics guiding interpretable scene evaluation. During inference, first and last segments are scored based solely on their descriptions, whereas intermediate ones incorporate brief contextual summaries of adjacent scenes to assess narrative progression and redundancy. This contextual prompting enables the LLM to balance local salience and global coherence without parameter tuning. On SumMe and TVSum, our method achieves F1 scores of \textbf{57.58} and \textbf{63.05}, surpassing unsupervised and prior zero-shot baselines while approaching supervised performance. The results demonstrate that rubric-guided pseudo labeling effectively stabilizes LLM-based scoring and establishes a general, interpretable zero-shot paradigm for video summarization.

【2】SparseWorld: A Flexible, Adaptive, and Efficient 4D Occupancy World Model Powered by Sparse and Dynamic Queries
标题：SparseWorld：由稀疏和动态收件箱支持的灵活、自适应且高效的4D占领世界模型
链接：https://arxiv.org/abs/2510.17482

作者：Chenxu Dang, Haiyan Liu, Guangjun Bao, Pei An, Xinyue Tang, Jie Ma, Bingchuan Sun, Yan Wang
备注：Under Review
摘要：Semantic occupancy has emerged as a powerful representation in world models for its ability to capture rich spatial semantics. However, most existing occupancy world models rely on static and fixed embeddings or grids, which inherently limit the flexibility of perception. Moreover, their ``in-place classification" over grids exhibits a potential misalignment with the dynamic and continuous nature of real scenarios.In this paper, we propose SparseWorld, a novel 4D occupancy world model that is flexible, adaptive, and efficient, powered by sparse and dynamic queries. We propose a Range-Adaptive Perception module, in which learnable queries are modulated by the ego vehicle states and enriched with temporal-spatial associations to enable extended-range perception. To effectively capture the dynamics of the scene, we design a State-Conditioned Forecasting module, which replaces classification-based forecasting with regression-guided formulation, precisely aligning the dynamic queries with the continuity of the 4D environment. In addition, We specifically devise a Temporal-Aware Self-Scheduling training strategy to enable smooth and efficient training. Extensive experiments demonstrate that SparseWorld achieves state-of-the-art performance across perception, forecasting, and planning tasks. Comprehensive visualizations and ablation studies further validate the advantages of SparseWorld in terms of flexibility, adaptability, and efficiency. The code is available at https://github.com/MSunDYY/SparseWorld.

【3】Facial Expression-based Parkinson's Disease Severity Diagnosis via Feature Fusion and Adaptive Class Balancing
标题：通过特征融合和自适应类平衡进行基于面部表情的帕金森病严重程度诊断
链接：https://arxiv.org/abs/2510.17373

作者：Yintao Zhou, Wei Huang, Zhengyu Li, Jing Huang, Meng Pang
备注：3 pages, 2 figures, accepted by MIND 2025
摘要：Parkinson's disease (PD) severity diagnosis is crucial for early detecting potential patients and adopting tailored interventions. Diagnosing PD based on facial expression is grounded in PD patients' "masked face" symptom and gains growing interest recently for its convenience and affordability. However, current facial expression-based approaches often rely on single type of expression which can lead to misdiagnosis, and ignore the class imbalance across different PD stages which degrades the prediction performance. Moreover, most existing methods focus on binary classification (i.e., PD / non-PD) rather than diagnosing the severity of PD. To address these issues, we propose a new facial expression-based method for PD severity diagnosis which integrates multiple facial expression features through attention-based feature fusion. Moreover, we mitigate the class imbalance problem via an adaptive class balancing strategy which dynamically adjusts the contribution of training samples based on their class distribution and classification difficulty. Experimental results demonstrate the promising performance of the proposed method for PD severity diagnosis, as well as the efficacy of attention-based feature fusion and adaptive class balancing.

【4】HIDISC: A Hyperbolic Framework for Domain Generalization with Generalized Category Discovery
标题：HIDISC：具有广义类别发现的领域概括的双曲框架
链接：https://arxiv.org/abs/2510.17188

作者：Vaibhav Rathore, Divyam Gupta, Biplab Banerjee
备注：Accpeted at NeurIPS (2025) Main Conference
摘要：Generalized Category Discovery (GCD) aims to classify test-time samples into either seen categories** -- available during training -- or novel ones, without relying on label supervision. Most existing GCD methods assume simultaneous access to labeled and unlabeled data during training and arising from the same domain, limiting applicability in open-world scenarios involving distribution shifts. Domain Generalization with GCD (DG-GCD) lifts this constraint by requiring models to generalize to unseen domains containing novel categories, without accessing targetdomain data during training. The only prior DG-GCD method, DG2CD-Net, relies on episodic training with multiple synthetic domains and task vector aggregation, incurring high computational cost and error accumulation. We propose HIDISC, a hyperbolic representation learning framework that achieves domain and category-level generalization without episodic simulation. To expose the model to minimal but diverse domain variations, we augment the source domain using GPT-guided diffusion, avoiding overfitting while maintaining efficiency. To structure the representation space, we introduce Tangent CutMix, a curvature-aware interpolation that synthesizes pseudo-novel samples in tangent space, preserving manifold consistency. A unified loss -- combining penalized Busemann alignment, hybrid hyperbolic contrastive regularization, and adaptive outlier repulsion -- **facilitates compact, semantically structured embeddings. A learnable curvature parameter further adapts the geometry to dataset complexity. HIDISC achieves state-of-the-art results on PACS , Office-Home , and DomainNet, consistently outperforming the existing Euclidean and hyperbolic (DG)-GCD baselines.

【5】Domain Generalizable Continual Learning
标题：领域可推广的持续学习
链接：https://arxiv.org/abs/2510.16914

作者：Hongwei Yan, Guanglong Sun, Zhiqi Kang, Yi Zhong, Liyuan Wang
备注：25 pages
摘要：To adapt effectively to dynamic real-world environments, intelligent systems must continually acquire new skills while generalizing them to diverse, unseen scenarios. Here, we introduce a novel and realistic setting named domain generalizable continual learning (DGCL): a model learns sequential tasks with each involving a single domain, aiming to perform well across all encountered tasks and domains. This setting poses unique challenges in acquiring, retaining, and leveraging both semantic- and domain-relevant information for robust generalization. Although state-of-the-art continual learning (CL) methods have employed pre-trained models (PTMs) to enhance task-specific generalization, they typically assume identical training and testing domains for each task and therefore perform poorly in DGCL. To this end, we propose adaptive Domain Transformation (DoT), an innovative PTMs-based approach tailored to DGCL. Inspired by the distributed-plus-hub theory of the human brain, DoT disentangles semantic- and domain-relevant information in representation learning, and adaptively transforms task representations across various domains for output alignment, ensuring balanced and generalized predictions. DoT serves as a plug-in strategy that greatly facilitates state-of-the-art CL baselines under both full parameter tuning and parameter-efficient tuning paradigms in DGCL, validated by extensive experiments. Also, DoT is shown to accumulate domain-generalizable knowledge from DGCL, and ensure resource efficiency with a lightweight implementation.

【6】Robust Cross-Domain Adaptation in Texture Features Transferring for Wood Chip Moisture Content Prediction
标题：纹理特征传递的鲁棒跨域自适应用于木片含水率预测
链接：https://arxiv.org/abs/2510.16832

作者：Abdur Rahman, Mohammad Marufuzzaman, Jason Street, Haifeng Wang, Veera G. Gude, Randy Buchanan
摘要：Accurate and quick prediction of wood chip moisture content is critical for optimizing biofuel production and ensuring energy efficiency. The current widely used direct method (oven drying) is limited by its longer processing time and sample destructiveness. On the other hand, existing indirect methods, including near-infrared spectroscopy-based, electrical capacitance-based, and image-based approaches, are quick but not accurate when wood chips come from various sources. Variability in the source material can alter data distributions, undermining the performance of data-driven models. Therefore, there is a need for a robust approach that effectively mitigates the impact of source variability. Previous studies show that manually extracted texture features have the potential to predict wood chip moisture class. Building on this, in this study, we conduct a comprehensive analysis of five distinct texture feature types extracted from wood chip images to predict moisture content. Our findings reveal that a combined feature set incorporating all five texture features achieves an accuracy of 95% and consistently outperforms individual texture features in predicting moisture content. To ensure robust moisture prediction, we propose a domain adaptation method named AdaptMoist that utilizes the texture features to transfer knowledge from one source of wood chip data to another, addressing variability across different domains. We also proposed a criterion for model saving based on adjusted mutual information. The AdaptMoist method improves prediction accuracy across domains by 23%, achieving an average accuracy of 80%, compared to 57% for non-adapted models. These results highlight the effectiveness of AdaptMoist as a robust solution for wood chip moisture content estimation across domains, making it a potential solution for wood chip-reliant industries.

【7】Connecting Domains and Contrasting Samples: A Ladder for Domain Generalization
标题：连接领域和对比样本：领域概括的阶梯
链接：https://arxiv.org/abs/2510.16704

作者：Tianxin Wei, Yifan Chen, Xinrui He, Wenxuan Bao, Jingrui He
备注：Accepted by KDD 2025
摘要：Distribution shifts between training and testing samples frequently occur in practice and impede model generalization performance. This crucial challenge thereby motivates studies on domain generalization (DG), which aim to predict the label on unseen target domain data by solely using data from source domains. It is intuitive to conceive the class-separated representations learned in contrastive learning (CL) are able to improve DG, while the reality is quite the opposite: users observe directly applying CL deteriorates the performance. We analyze the phenomenon with the insights from CL theory and discover lack of intra-class connectivity in the DG setting causes the deficiency. We thus propose a new paradigm, domain-connecting contrastive learning (DCCL), to enhance the conceptual connectivity across domains and obtain generalizable representations for DG. On the data side, more aggressive data augmentation and cross-domain positive samples are introduced to improve intra-class connectivity. On the model side, to better embed the unseen test domains, we propose model anchoring to exploit the intra-class connectivity in pre-trained representations and complement the anchoring with generative transformation loss. Extensive experiments on five standard DG benchmarks are performed. The results verify that DCCL outperforms state-of-the-art baselines even without domain supervision. The detailed model implementation and the code are provided through https://github.com/weitianxin/DCCL

半弱无监督|主动学习|不确定性(8篇)

【1】Self-supervised Pre-training for Mapping of Archaeological Stone Wall in Historic Landscapes Using High-Resolution DEM Derivatives
标题：使用高分辨率TEM衍生品绘制历史景观中考古石墙的自我监督预训练
链接：https://arxiv.org/abs/2510.17644

作者：Zexian Huang, Mashnoon Islam, Brian Armstrong, Kourosh Khoshelham, Martin Tomko
摘要：Dry-stone walls hold significant heritage and environmental value. Mapping these structures is essential for ecosystem preservation and wildfire management in Australia. Yet, many walls remain unidentified due to their inaccessibility and the high cost of manual mapping. Deep learning-based segmentation offers a scalable solution, but two major challenges persist: (1) visual occlusion of low-lying walls by dense vegetation, and (2) limited labeled data for supervised training. We propose DINO-CV, a segmentation framework for automatic mapping of low-lying dry-stone walls using high-resolution Airborne LiDAR-derived digital elevation models (DEMs). DEMs overcome visual occlusion by capturing terrain structures hidden beneath vegetation, enabling analysis of structural rather than spectral cues. DINO-CV introduces a self-supervised cross-view pre-training strategy based on knowledge distillation to mitigate data scarcity. It learns invariant visual and geometric representations across multiple DEM derivatives, supporting various vision backbones including ResNet, Wide ResNet, and Vision Transformers. Applied to the UNESCO World Heritage cultural landscape of Budj Bim, Victoria, the method identifies one of Australia's densest collections of colonial dry-stone walls beyond Indigenous heritage contexts. DINO-CV achieves a mean Intersection over Union (mIoU) of 68.6% on test areas and maintains 63.8% mIoU when fine-tuned with only 10% labeled data. These results demonstrate the potential of self-supervised learning on high-resolution DEM derivatives for automated dry-stone wall mapping in vegetated and heritage-rich environments with scarce annotations.

【2】Closed-Loop Transfer for Weakly-supervised Affordance Grounding
标题：弱监督接地的闭环传输
链接：https://arxiv.org/abs/2510.17384

作者：Jiajin Tang, Zhengxuan Wei, Ge Zheng, Sibei Yang
备注：Accepted at ICCV 2025
摘要：Humans can perform previously unexperienced interactions with novel objects simply by observing others engage with them. Weakly-supervised affordance grounding mimics this process by learning to locate object regions that enable actions on egocentric images, using exocentric interaction images with image-level annotations. However, extracting affordance knowledge solely from exocentric images and transferring it one-way to egocentric images limits the applicability of previous works in complex interaction scenarios. Instead, this study introduces LoopTrans, a novel closed-loop framework that not only transfers knowledge from exocentric to egocentric but also transfers back to enhance exocentric knowledge extraction. Within LoopTrans, several innovative mechanisms are introduced, including unified cross-modal localization and denoising knowledge distillation, to bridge domain gaps between object-centered egocentric and interaction-centered exocentric images while enhancing knowledge transfer. Experiments show that LoopTrans achieves consistent improvements across all metrics on image and video benchmarks, even handling challenging scenarios where object interaction regions are fully occluded by the human body.

【3】Exploring Structural Degradation in Dense Representations for Self-supervised Learning
标题：探索自我监督学习的密集表示中的结构退化
链接：https://arxiv.org/abs/2510.17299

作者：Siran Dai, Qianqian Xu, Peisong Wen, Yang Liu, Qingming Huang
备注：Accepted by NeurIPS 2025
摘要：In this work, we observe a counterintuitive phenomenon in self-supervised learning (SSL): longer training may impair the performance of dense prediction tasks (e.g., semantic segmentation). We refer to this phenomenon as Self-supervised Dense Degradation (SDD) and demonstrate its consistent presence across sixteen state-of-the-art SSL methods with various losses, architectures, and datasets. When the model performs suboptimally on dense tasks at the end of training, measuring the performance during training becomes essential. However, evaluating dense performance effectively without annotations remains an open challenge. To tackle this issue, we introduce a Dense representation Structure Estimator (DSE), composed of a class-relevance measure and an effective dimensionality measure. The proposed DSE is both theoretically grounded and empirically validated to be closely correlated with the downstream performance. Based on this metric, we introduce a straightforward yet effective model selection strategy and a DSE-based regularization method. Experiments on sixteen SSL methods across four benchmarks confirm that model selection improves mIoU by $3.0\%$ on average with negligible computational cost. Additionally, DSE regularization consistently mitigates the effects of dense degradation. Code is available at https://github.com/EldercatSAM/SSL-Degradation.

【4】Needles in the Landscape: Semi-Supervised Pseudolabeling for Archaeological Site Discovery under Label Scarcity
标题：风景中的针：在“稀缺”标签下对考古遗址发现进行半监督伪标签
链接：https://arxiv.org/abs/2510.16814

作者：Simon Jaxy, Anton Theys, Patrick Willett, W. Chris Carleton, Ralf Vandam, Pieter Libin
摘要：Archaeological predictive modelling estimates where undiscovered sites are likely to occur by combining known locations with environmental, cultural, and geospatial variables. We address this challenge using a deep learning approach but must contend with structural label scarcity inherent to archaeology: positives are rare, and most locations are unlabeled. To address this, we adopt a semi-supervised, positive-unlabeled (PU) learning strategy, implemented as a semantic segmentation model and evaluated on two datasets covering a representative range of archaeological periods. Our approach employs dynamic pseudolabeling, refined with a Conditional Random Field (CRF) implemented via an RNN, increasing label confidence under severe class imbalance. On a geospatial dataset derived from a digital elevation model (DEM), our model performs on par with the state-of-the-art, LAMAP, while achieving higher Dice scores. On raw satellite imagery, assessed end-to-end with stratified k-fold cross-validation, it maintains performance and yields predictive surfaces with improved interpretability. Overall, our results indicate that semi-supervised learning offers a promising approach to identifying undiscovered sites across large, sparsely annotated landscapes.

【5】SDPA++: A General Framework for Self-Supervised Denoising with Patch Aggregation
标题：SDPA++：利用补丁聚合进行自我监督降噪的通用框架
链接：https://arxiv.org/abs/2510.16702

作者：Huy Minh Nhat Nguyen, Triet Hoang Minh Dao, Chau Vinh Hoang Truong, Cuong Tuan Nguyen
备注：2025 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB)
摘要：Optical Coherence Tomography (OCT) is a widely used non-invasive imaging technique that provides detailed three-dimensional views of the retina, which are essential for the early and accurate diagnosis of ocular diseases. Consequently, OCT image analysis and processing have emerged as key research areas in biomedical imaging. However, acquiring paired datasets of clean and real-world noisy OCT images for supervised denoising models remains a formidable challenge due to intrinsic speckle noise and practical constraints in clinical imaging environments. To address these issues, we propose SDPA++: A General Framework for Self-Supervised Denoising with Patch Aggregation. Our novel approach leverages only noisy OCT images by first generating pseudo-ground-truth images through self-fusion and self-supervised denoising. These refined images then serve as targets to train an ensemble of denoising models using a patch-based strategy that effectively enhances image clarity. Performance improvements are validated via metrics such as Contrast-to-Noise Ratio (CNR), Mean Square Ratio (MSR), Texture Preservation (TP), and Edge Preservation (EP) on the real-world dataset from the IEEE SPS Video and Image Processing Cup. Notably, the VIP Cup dataset contains only real-world noisy OCT images without clean references, highlighting our method's potential for improving image quality and diagnostic outcomes in clinical practice.

【6】SSL4RL: Revisiting Self-supervised Learning as Intrinsic Reward for Visual-Language Reasoning
标题：SSL 4 RL：重新审视自我监督学习作为视觉语言推理的内在奖励
链接：https://arxiv.org/abs/2510.16416

作者：Xiaojun Guo, Runyu Zhou, Yifei Wang, Qi Zhang, Chenheng Zhang, Stefanie Jegelka, Xiaohan Wang, Jiajun Chai, Guojun Yin, Wei Lin, Yisen Wang
摘要：Vision-language models (VLMs) have shown remarkable abilities by integrating large language models with visual inputs. However, they often fail to utilize visual evidence adequately, either depending on linguistic priors in vision-centric tasks or resorting to textual shortcuts during reasoning. Although reinforcement learning (RL) can align models with desired behaviors, its application to VLMs has been hindered by the lack of scalable and reliable reward mechanisms. To overcome this challenge, we propose SSL4RL, a novel framework that leverages self-supervised learning (SSL) tasks as a source of verifiable rewards for RL-based fine-tuning. Our approach reformulates SSL objectives-such as predicting image rotation or reconstructing masked patches-into dense, automatic reward signals, eliminating the need for human preference data or unreliable AI evaluators. Experiments show that SSL4RL substantially improves performance on both vision-centric and vision-language reasoning benchmarks. Furthermore, through systematic ablations, we identify key factors-such as task difficulty, model scale, and semantic alignment with the target domain-that influence the effectiveness of SSL4RL tasks, offering new design principles for future work. We also demonstrate the framework's generality by applying it to graph learning, where it yields significant gains. SSL4RL establishes a versatile and effective paradigm for aligning multimodal models using verifiable, self-supervised objectives.

【7】C-arm Guidance: A Self-supervised Approach To Automated Positioning During Stroke Thrombectomy
标题：C形臂引导：中风血栓切除术期间自动定位的自我监督方法
链接：https://arxiv.org/abs/2510.16145

作者：Ahmad Arrabi, Jay hwasung Jung, J Le, A Nguyen, J Reed, E Stahl, Nathan Franssen, Scott Raymond, Safwan Wshah
备注：None
摘要：Thrombectomy is one of the most effective treatments for ischemic stroke, but it is resource and personnel-intensive. We propose employing deep learning to automate critical aspects of thrombectomy, thereby enhancing efficiency and safety. In this work, we introduce a self-supervised framework that classifies various skeletal landmarks using a regression-based pretext task. Our experiments demonstrate that our model outperforms existing methods in both regression and classification tasks. Notably, our results indicate that the positional pretext task significantly enhances downstream classification performance. Future work will focus on extending this framework toward fully autonomous C-arm control, aiming to optimize trajectories from the pelvis to the head during stroke thrombectomy procedures. All code used is available at https://github.com/AhmadArrabi/C_arm_guidance

【8】ObjectTransforms for Uncertainty Quantification and Reduction in Vision-Based Perception for Autonomous Vehicles
标题：用于自动驾驶车辆基于视觉感知的不确定性量化和减少的对象转换
链接：https://arxiv.org/abs/2510.16118

作者：Nishad Sahu, Shounak Sural, Aditya Satish Patil, Ragunathan (Raj)Rajkumar
备注：Accepted at International Conference on Computer Vision (ICCV) 2025 Workshops
摘要：Reliable perception is fundamental for safety critical decision making in autonomous driving. Yet, vision based object detector neural networks remain vulnerable to uncertainty arising from issues such as data bias and distributional shifts. In this paper, we introduce ObjectTransforms, a technique for quantifying and reducing uncertainty in vision based object detection through object specific transformations at both training and inference times. At training time, ObjectTransforms perform color space perturbations on individual objects, improving robustness to lighting and color variations. ObjectTransforms also uses diffusion models to generate realistic, diverse pedestrian instances. At inference time, object perturbations are applied to detected objects and the variance of detection scores are used to quantify predictive uncertainty in real time. This uncertainty signal is then used to filter out false positives and also recover false negatives, improving the overall precision recall curve. Experiments with YOLOv8 on the NuImages 10K dataset demonstrate that our method yields notable accuracy improvements and uncertainty reduction across all object classes during training, while predicting desirably higher uncertainty values for false positives as compared to true positives during inference. Our results highlight the potential of ObjectTransforms as a lightweight yet effective mechanism for reducing and quantifying uncertainty in vision-based perception during training and inference respectively.

时序|行为识别|姿态|视频|运动估计(15篇)

【1】UltraCUA: A Foundation Model for Computer Use Agents with Hybrid Action
标题：UltraCUA：具有混合动作的计算机使用代理的基础模型
链接：https://arxiv.org/abs/2510.17790

作者：Yuhao Yang, Zhen Yang, Zi-Yi Dou, Anh Nguyen, Keen You, Omar Attia, Andrew Szot, Michael Feng, Ram Ramrakhya, Alexander Toshev, Chao Huang, Yinfei Yang, Zhe Gan
摘要：Multimodal agents for computer use rely exclusively on primitive actions (click, type, scroll) that require accurate visual grounding and lengthy execution chains, leading to cascading failures and performance bottlenecks. While other agents leverage rich programmatic interfaces (APIs, MCP servers, tools), computer-use agents (CUAs) remain isolated from these capabilities. We present UltraCUA, a foundation model that bridges this gap through hybrid action -- seamlessly integrating GUI primitives with high-level programmatic tool calls. To achieve this, our approach comprises four key components: (1) an automated pipeline that scales programmatic tools from software documentation, open-source repositories, and code generation; (2) a synthetic data engine producing over 17,000 verifiable tasks spanning real-world computer-use scenarios; (3) a large-scale high-quality hybrid action trajectory collection with both low-level GUI actions and high-level programmatic tool calls; and (4) a two-stage training pipeline combining supervised fine-tuning with online reinforcement learning, enabling strategic alternation between low-level and high-level actions. Experiments with our 7B and 32B models demonstrate substantial improvements over state-of-the-art agents. On OSWorld, UltraCUA models achieve an average 22% relative improvement over base models, while being 11% faster in terms of steps. Out-of-domain evaluation on WindowsAgentArena shows our model reaches 21.7% success rate, outperforming baselines trained on Windows data. The hybrid action mechanism proves critical, reducing error propagation while maintaining execution efficiency.

【2】Can Image-To-Video Models Simulate Pedestrian Dynamics?
标题：图像到视频模型可以模拟行人动态吗？
链接：https://arxiv.org/abs/2510.17731

作者：Aaron Appelle, Jerome P. Lynch
备注：Appeared in the ICML 2025 Workshop on Building Physically Plausible World Models, July 2025, this https URL
摘要：Recent high-performing image-to-video (I2V) models based on variants of the diffusion transformer (DiT) have displayed remarkable inherent world-modeling capabilities by virtue of training on large scale video datasets. We investigate whether these models can generate realistic pedestrian movement patterns in crowded public scenes. Our framework conditions I2V models on keyframes extracted from pedestrian trajectory benchmarks, then evaluates their trajectory prediction performance using quantitative measures of pedestrian dynamics.

【3】PAGE-4D: Disentangled Pose and Geometry Estimation for 4D Perception
标题：Page-4D：4D感知的解开姿势和几何估计
链接：https://arxiv.org/abs/2510.17568

作者：Kaichen Zhou, Yuhan Wang, Grace Chen, Xinhai Chang, Gaspard Beaudouin, Fangneng Zhan, Paul Pu Liang, Mengyu Wang
摘要：Recent 3D feed-forward models, such as the Visual Geometry Grounded Transformer (VGGT), have shown strong capability in inferring 3D attributes of static scenes. However, since they are typically trained on static datasets, these models often struggle in real-world scenarios involving complex dynamic elements, such as moving humans or deformable objects like umbrellas. To address this limitation, we introduce PAGE-4D, a feedforward model that extends VGGT to dynamic scenes, enabling camera pose estimation, depth prediction, and point cloud reconstruction -- all without post-processing. A central challenge in multi-task 4D reconstruction is the inherent conflict between tasks: accurate camera pose estimation requires suppressing dynamic regions, while geometry reconstruction requires modeling them. To resolve this tension, we propose a dynamics-aware aggregator that disentangles static and dynamic information by predicting a dynamics-aware mask -- suppressing motion cues for pose estimation while amplifying them for geometry reconstruction. Extensive experiments show that PAGE-4D consistently outperforms the original VGGT in dynamic scenarios, achieving superior results in camera pose estimation, monocular and video depth estimation, and dense point map reconstruction.

【4】LongInsightBench: A Comprehensive Benchmark for Evaluating Omni-Modal Models on Human-Centric Long-Video Understanding
标题：LongInsightBench：评估以人为本的长视频理解全模式模型的综合基准
链接：https://arxiv.org/abs/2510.17305

作者：ZhaoYang Han, Qihan Lin, Hao Liang, Bowen Chen, Zhou Liu, Wentao Zhang
备注：Submitted to ARR Rolling Review
摘要：We introduce \textbf{LongInsightBench}, the first benchmark designed to assess models' ability to understand long videos, with a focus on human language, viewpoints, actions, and other contextual elements, while integrating \textbf{visual, audio, and text} modalities. Our benchmark excels in three key areas: \textbf{a) Long-Duration, Information-Dense Videos:} We carefully select approximately 1,000 videos from open-source datasets FineVideo based on duration limit and the information density of both visual and audio modalities, focusing on content like lectures, interviews, and vlogs, which contain rich language elements. \textbf{b) Diverse and Challenging Task Scenarios:} We have designed six challenging task scenarios, including both Intra-Event and Inter-Event Tasks. \textbf{c) Rigorous and Comprehensive Quality Assurance Pipelines:} We have developed a three-step, semi-automated data quality assurance pipeline to ensure the difficulty and validity of the synthesized questions and answer options. Based on LongInsightBench, we designed a series of experiments. Experimental results shows that Omni-modal models(OLMs) still face challenge in tasks requiring precise temporal localization (T-Loc) and long-range causal inference (CE-Caus). Extended experiments reveal the information loss and processing bias in multi-modal fusion of OLMs. Our dataset and code is available at https://anonymous.4open.science/r/LongInsightBench-910F/.

【5】From Preferences to Prejudice: The Role of Alignment Tuning in Shaping Social Bias in Video Diffusion Models
标题：从偏好到偏好：对齐调整在塑造视频传播模型中社会偏见中的作用
链接：https://arxiv.org/abs/2510.17247

作者：Zefan Cai, Haoyi Qiu, Haozhe Zhao, Ke Wan, Jiachen Li, Jiuxiang Gu, Wen Xiao, Nanyun Peng, Junjie Hu
摘要：Recent advances in video diffusion models have significantly enhanced text-to-video generation, particularly through alignment tuning using reward models trained on human preferences. While these methods improve visual quality, they can unintentionally encode and amplify social biases. To systematically trace how such biases evolve throughout the alignment pipeline, we introduce VideoBiasEval, a comprehensive diagnostic framework for evaluating social representation in video generation. Grounded in established social bias taxonomies, VideoBiasEval employs an event-based prompting strategy to disentangle semantic content (actions and contexts) from actor attributes (gender and ethnicity). It further introduces multi-granular metrics to evaluate (1) overall ethnicity bias, (2) gender bias conditioned on ethnicity, (3) distributional shifts in social attributes across model variants, and (4) the temporal persistence of bias within videos. Using this framework, we conduct the first end-to-end analysis connecting biases in human preference datasets, their amplification in reward models, and their propagation through alignment-tuned video diffusion models. Our results reveal that alignment tuning not only strengthens representational biases but also makes them temporally stable, producing smoother yet more stereotyped portrayals. These findings highlight the need for bias-aware evaluation and mitigation throughout the alignment process to ensure fair and socially responsible video generation.

【6】Round Outcome Prediction in VALORANT Using Tactical Features from Video Analysis
标题：使用视频分析的战术特征在WALORANT中预测回合结果
链接：https://arxiv.org/abs/2510.17199

作者：Nirai Hayakawa, Kazumasa Shimari, Kazuma Yamasaki, Hirotatsu Hoshikawa, Rikuto Tsuchida, Kenichi Matsumoto
备注：Accepted to IEEE 2025 Conference on Games
摘要：Recently, research on predicting match outcomes in esports has been actively conducted, but much of it is based on match log data and statistical information. This research targets the FPS game VALORANT, which requires complex strategies, and aims to build a round outcome prediction model by analyzing minimap information in match footage. Specifically, based on the video recognition model TimeSformer, we attempt to improve prediction accuracy by incorporating detailed tactical features extracted from minimap information, such as character position information and other in-game events. This paper reports preliminary results showing that a model trained on a dataset augmented with such tactical event labels achieved approximately 81% prediction accuracy, especially from the middle phases of a round onward, significantly outperforming a model trained on a dataset with the minimap information itself. This suggests that leveraging tactical features from match footage is highly effective for predicting round outcomes in VALORANT.

【7】Capturing Head Avatar with Hand Contacts from a Monocular Video
标题：从单目视频中捕捉用手部接触的头部化身
链接：https://arxiv.org/abs/2510.17181

作者：Haonan He, Yufeng Zheng, Jie Song
备注：ICCV 2025
摘要：Photorealistic 3D head avatars are vital for telepresence, gaming, and VR. However, most methods focus solely on facial regions, ignoring natural hand-face interactions, such as a hand resting on the chin or fingers gently touching the cheek, which convey cognitive states like pondering. In this work, we present a novel framework that jointly learns detailed head avatars and the non-rigid deformations induced by hand-face interactions. There are two principal challenges in this task. First, naively tracking hand and face separately fails to capture their relative poses. To overcome this, we propose to combine depth order loss with contact regularization during pose tracking, ensuring correct spatial relationships between the face and hand. Second, no publicly available priors exist for hand-induced deformations, making them non-trivial to learn from monocular videos. To address this, we learn a PCA basis specific to hand-induced facial deformations from a face-hand interaction dataset. This reduces the problem to estimating a compact set of PCA parameters rather than a full spatial deformation field. Furthermore, inspired by physics-based simulation, we incorporate a contact loss that provides additional supervision, significantly reducing interpenetration artifacts and enhancing the physical plausibility of the results. We evaluate our approach on RGB(D) videos captured by an iPhone. Additionally, to better evaluate the reconstructed geometry, we construct a synthetic dataset of avatars with various types of hand interactions. We show that our method can capture better appearance and more accurate deforming geometry of the face than SOTA surface reconstruction methods.

【8】Video Reasoning without Training
标题：无需训练的视频推理
链接：https://arxiv.org/abs/2510.17045

作者：Deepak Sridhar, Kartikeya Bhardwaj, Jeya Pradha Jeyaraj, Nuno Vasconcelos, Ankita Nayak, Harris Teague
摘要：Video reasoning using Large Multimodal Models (LMMs) relies on costly reinforcement learning (RL) and verbose chain-of-thought, resulting in substantial computational overhead during both training and inference. Moreover, the mechanisms that control the thinking process in these reasoning models are very limited. In this paper, using entropy of the model's output as a signal, we discover that the high-quality models go through a series of micro-explorations and micro-exploitations which keep the reasoning process grounded (i.e., avoid excessive randomness while the model is exploring or thinking through an answer). We further observe that once this "thinking" process is over, more accurate models demonstrate a better convergence by reducing the entropy significantly via a final exploitation phase (i.e., a more certain convergence towards a solution trajectory). We then use these novel, theoretically-grounded insights to tune the model's behavior directly at inference, without using any RL or supervised fine-tuning. Specifically, during inference, our proposed approach called V-Reason (Video-Reason) adapts the value cache of the LMM via a few optimization steps on a small, trainable controller using an entropy-based objective, i.e., no supervision from any dataset or RL is necessary. This tuning improves the model's micro-exploration and exploitation behavior during inference. Our experiments show that our proposed method achieves significant improvements over the base instruction-tuned models across several video reasoning datasets, narrowing the gap with RL-trained models to within 0.6% average accuracy without any training, while offering massive efficiency benefits: output tokens are reduced by 58.6% compared to the RL model.

【9】DINO-CVA: A Multimodal Goal-Conditioned Vision-to-Action Model for Autonomous Catheter Navigation
标题：DINO-CVA：用于自主导管导航的多模式目标条件视觉到行动模型
链接：https://arxiv.org/abs/2510.17038

作者：Pedram Fekri, Majid Roshanfar, Samuel Barbeau, Seyedfarzad Famouri, Thomas Looi, Dale Podolsky, Mehrdad Zadeh, Javad Dargahi
摘要：Cardiac catheterization remains a cornerstone of minimally invasive interventions, yet it continues to rely heavily on manual operation. Despite advances in robotic platforms, existing systems are predominantly follow-leader in nature, requiring continuous physician input and lacking intelligent autonomy. This dependency contributes to operator fatigue, more radiation exposure, and variability in procedural outcomes. This work moves towards autonomous catheter navigation by introducing DINO-CVA, a multimodal goal-conditioned behavior cloning framework. The proposed model fuses visual observations and joystick kinematics into a joint embedding space, enabling policies that are both vision-aware and kinematic-aware. Actions are predicted autoregressively from expert demonstrations, with goal conditioning guiding navigation toward specified destinations. A robotic experimental setup with a synthetic vascular phantom was designed to collect multimodal datasets and evaluate performance. Results show that DINO-CVA achieves high accuracy in predicting actions, matching the performance of a kinematics-only baseline while additionally grounding predictions in the anatomical environment. These findings establish the feasibility of multimodal, goal-conditioned architectures for catheter navigation, representing an important step toward reducing operator dependency and improving the reliability of catheterbased therapies.

【10】An empirical study of the effect of video encoders on Temporal Video Grounding
标题：视频编码器对时间视频接地影响的实证研究
链接：https://arxiv.org/abs/2510.17007

作者：Ignacio M. De la Jara, Cristian Rodriguez-Opazo, Edison Marrese-Taylor, Felipe Bravo-Marquez
摘要：Temporal video grounding is a fundamental task in computer vision, aiming to localize a natural language query in a long, untrimmed video. It has a key role in the scientific community, in part due to the large amount of video generated every day. Although we find extensive work in this task, we note that research remains focused on a small selection of video representations, which may lead to architectural overfitting in the long run. To address this issue, we propose an empirical study to investigate the impact of different video features on a classical architecture. We extract features for three well-known benchmarks, Charades-STA, ActivityNet-Captions and YouCookII, using video encoders based on CNNs, temporal reasoning and transformers. Our results show significant differences in the performance of our model by simply changing the video encoder, while also revealing clear patterns and errors derived from the use of certain features, ultimately indicating potential feature complementarity.

【11】Training-free Online Video Step Grounding
标题：免训练在线视频基础
链接：https://arxiv.org/abs/2510.16989

作者：Luca Zanella, Massimiliano Mancini, Yiming Wang, Alessio Tonioni, Elisa Ricci
备注：NeurIPS 2025. Project website at this https URL
摘要：Given a task and a set of steps composing it, Video Step Grounding (VSG) aims to detect which steps are performed in a video. Standard approaches for this task require a labeled training set (e.g., with step-level annotations or narrations), which may be costly to collect. Moreover, they process the full video offline, limiting their applications for scenarios requiring online decisions. Thus, in this work, we explore how to perform VSG online and without training. We achieve this by exploiting the zero-shot capabilities of recent Large Multimodal Models (LMMs). In particular, we use LMMs to predict the step associated with a restricted set of frames, without access to the whole video. We show that this online strategy without task-specific tuning outperforms offline and training-based models. Motivated by this finding, we develop Bayesian Grounding with Large Multimodal Models (BaGLM), further injecting knowledge of past frames into the LMM-based predictions. BaGLM exploits Bayesian filtering principles, modeling step transitions via (i) a dependency matrix extracted through large language models and (ii) an estimation of step progress. Experiments on three datasets show superior performance of BaGLM over state-of-the-art training-based offline methods.

【12】GS2POSE: Marry Gaussian Splatting to 6D Object Pose Estimation
标题：GS 2 POSE：将高斯飞溅与6D物体姿态估计结合起来
链接：https://arxiv.org/abs/2510.16777

作者：Junbo Li, Weimin Yuan, Yinuo Wang, Yue Zeng, Shihao Shu, Cai Meng, Xiangzhi Bai
摘要：Accurate 6D pose estimation of 3D objects is a fundamental task in computer vision, and current research typically predicts the 6D pose by establishing correspondences between 2D image features and 3D model features. However, these methods often face difficulties with textureless objects and varying illumination conditions. To overcome these limitations, we propose GS2POSE, a novel approach for 6D object pose estimation. GS2POSE formulates a pose regression algorithm inspired by the principles of Bundle Adjustment (BA). By leveraging Lie algebra, we extend the capabilities of 3DGS to develop a pose-differentiable rendering pipeline, which iteratively optimizes the pose by comparing the input image to the rendered image. Additionally, GS2POSE updates color parameters within the 3DGS model, enhancing its adaptability to changes in illumination. Compared to previous models, GS2POSE demonstrates accuracy improvements of 1.4\%, 2.8\% and 2.5\% on the T-LESS, LineMod-Occlusion and LineMod datasets, respectively.

【13】SPLite Hand: Sparsity-Aware Lightweight 3D Hand Pose Estimation
标题：Splite Hand：具有稀疏意识的轻量级3D手部姿势估计
链接：https://arxiv.org/abs/2510.16396

作者：Yeh Keng Hao, Hsu Tzu Wei, Sun Min
备注：Accepted to AICCC 2025
摘要：With the increasing ubiquity of AR/VR devices, the deployment of deep learning models on edge devices has become a critical challenge. These devices require real-time inference, low power consumption, and minimal latency. Many framework designers face the conundrum of balancing efficiency and performance. We design a light framework that adopts an encoder-decoder architecture and introduces several key contributions aimed at improving both efficiency and accuracy. We apply sparse convolution on a ResNet-18 backbone to exploit the inherent sparsity in hand pose images, achieving a 42% end-to-end efficiency improvement. Moreover, we propose our SPLite decoder. This new architecture significantly boosts the decoding process's frame rate by 3.1x on the Raspberry Pi 5, while maintaining accuracy on par. To further optimize performance, we apply quantization-aware training, reducing memory usage while preserving accuracy (PA-MPJPE increases only marginally from 9.0 mm to 9.1 mm on FreiHAND). Overall, our system achieves a 2.98x speed-up on a Raspberry Pi 5 CPU (BCM2712 quad-core Arm A76 processor). Our method is also evaluated on compound benchmark datasets, demonstrating comparable accuracy to state-of-the-art approaches while significantly enhancing computational efficiency.

【14】Cataract-LMM: Large-Scale, Multi-Source, Multi-Task Benchmark for Deep Learning in Surgical Video Analysis
标题：Cataract-LMM：手术视频分析中深度学习的大规模、多源、多任务基准
链接：https://arxiv.org/abs/2510.16371

作者：Mohammad Javad Ahmadi, Iman Gandomi, Parisa Abdi, Seyed-Farzad Mohammadi, Amirhossein Taslimi, Mehdi Khodaparast, Hassan Hashemi, Mahdi Tavakoli, Hamid D. Taghirad
备注：20 pages, 11 figures, 11 tables. Data descriptor for the Cataract-LMM benchmark dataset. Source code and dataset are available
摘要：The development of computer-assisted surgery systems depends on large-scale, annotated datasets. Current resources for cataract surgery often lack the diversity and annotation depth needed to train generalizable deep-learning models. To address this gap, we present a dataset of 3,000 phacoemulsification cataract surgery videos from two surgical centers, performed by surgeons with a range of experience levels. This resource is enriched with four annotation layers: temporal surgical phases, instance segmentation of instruments and anatomical structures, instrument-tissue interaction tracking, and quantitative skill scores based on the established competency rubrics like the ICO-OSCAR. The technical quality of the dataset is supported by a series of benchmarking experiments for key surgical AI tasks, including workflow recognition, scene segmentation, and automated skill assessment. Furthermore, we establish a domain adaptation baseline for the phase recognition task by training a model on a subset of surgical centers and evaluating its performance on a held-out center. The dataset and annotations are available in Google Form (https://docs.google.com/forms/d/e/1FAIpQLSfmyMAPSTGrIy2sTnz0-TMw08ZagTimRulbAQcWdaPwDy187A/viewform?usp=dialog).

【15】NEBULA: Do We Evaluate Vision-Language-Action Agents Correctly?
标题：NEBSYS：我们正确评估视觉-语言-动作代理吗？
链接：https://arxiv.org/abs/2510.16263

作者：Jierui Peng, Yanyan Zhang, Yicheng Duan, Tuo Liang, Vipin Chaudhary, Yu Yin
备注：Homepage: this https URL
摘要：The evaluation of Vision-Language-Action (VLA) agents is hindered by the coarse, end-task success metric that fails to provide precise skill diagnosis or measure robustness to real-world perturbations. This challenge is exacerbated by a fragmented data landscape that impedes reproducible research and the development of generalist models. To address these limitations, we introduce \textbf{NEBULA}, a unified ecosystem for single-arm manipulation that enables diagnostic and reproducible evaluation. NEBULA features a novel dual-axis evaluation protocol that combines fine-grained \textit{capability tests} for precise skill diagnosis with systematic \textit{stress tests} that measure robustness. A standardized API and a large-scale, aggregated dataset are provided to reduce fragmentation and support cross-dataset training and fair comparison. Using NEBULA, we demonstrate that top-performing VLAs struggle with key capabilities such as spatial reasoning and dynamic adaptation, which are consistently obscured by conventional end-task success metrics. By measuring both what an agent can do and when it does so reliably, NEBULA provides a practical foundation for robust, general-purpose embodied agents.

医学相关(4篇)

【1】Foundation Models in Medical Image Analysis: A Systematic Review and Meta-Analysis
标题：医学图像分析的基础模型：系统回顾和荟萃分析
链接：https://arxiv.org/abs/2510.16973

作者：Praveenbalaji Rajendran, Mojtaba Safari, Wenfeng He, Mingzhe Hu, Shansong Wang, Jun Zhou, Xiaofeng Yang
摘要：Recent advancements in artificial intelligence (AI), particularly foundation models (FMs), have revolutionized medical image analysis, demonstrating strong zero- and few-shot performance across diverse medical imaging tasks, from segmentation to report generation. Unlike traditional task-specific AI models, FMs leverage large corpora of labeled and unlabeled multimodal datasets to learn generalized representations that can be adapted to various downstream clinical applications with minimal fine-tuning. However, despite the rapid proliferation of FM research in medical imaging, the field remains fragmented, lacking a unified synthesis that systematically maps the evolution of architectures, training paradigms, and clinical applications across modalities. To address this gap, this review article provides a comprehensive and structured analysis of FMs in medical image analysis. We systematically categorize studies into vision-only and vision-language FMs based on their architectural foundations, training strategies, and downstream clinical tasks. Additionally, a quantitative meta-analysis of the studies was conducted to characterize temporal trends in dataset utilization and application domains. We also critically discuss persistent challenges, including domain adaptation, efficient fine-tuning, computational constraints, and interpretability along with emerging solutions such as federated learning, knowledge distillation, and advanced prompting. Finally, we identify key future research directions aimed at enhancing the robustness, explainability, and clinical integration of FMs, thereby accelerating their translation into real-world medical practice.

【2】A Deep Learning Framework for Real-Time Image Processing in Medical Diagnostics: Enhancing Accuracy and Speed in Clinical Applications
标题：医疗诊断中实时图像处理的深度学习框架：提高临床应用的准确性和速度
链接：https://arxiv.org/abs/2510.16611

作者：Melika Filvantorkaman, Maral Filvan Torkaman
备注：20 pages, 4 figures
摘要：Medical imaging plays a vital role in modern diagnostics; however, interpreting high-resolution radiological data remains time-consuming and susceptible to variability among clinicians. Traditional image processing techniques often lack the precision, robustness, and speed required for real-time clinical use. To overcome these limitations, this paper introduces a deep learning framework for real-time medical image analysis designed to enhance diagnostic accuracy and computational efficiency across multiple imaging modalities, including X-ray, CT, and MRI. The proposed system integrates advanced neural network architectures such as U-Net, EfficientNet, and Transformer-based models with real-time optimization strategies including model pruning, quantization, and GPU acceleration. The framework enables flexible deployment on edge devices, local servers, and cloud infrastructures, ensuring seamless interoperability with clinical systems such as PACS and EHR. Experimental evaluations on public benchmark datasets demonstrate state-of-the-art performance, achieving classification accuracies above 92%, segmentation Dice scores exceeding 91%, and inference times below 80 milliseconds. Furthermore, visual explanation tools such as Grad-CAM and segmentation overlays enhance transparency and clinical interpretability. These results indicate that the proposed framework can substantially accelerate diagnostic workflows, reduce clinician workload, and support trustworthy AI integration in time-critical healthcare environments.

【3】Effect of Reporting Mode and Clinical Experience on Radiologists' Gaze and Image Analysis Behavior in Chest Radiography
标题：报告模式和临床经验对放射科医生胸部X线摄影凝视和图像分析行为的影响
链接：https://arxiv.org/abs/2510.16070

作者：Mahta Khoobi, Marc Sebastian von der Stueck, Felix Barajas Ordonez, Anca-Maria Iancu, Eric Corban, Julia Nowak, Aleksandar Kargaliev, Valeria Perelygina, Anna-Sophie Schott, Daniel Pinto dos Santos, Christiane Kuhl, Daniel Truhn, Sven Nebelung, Robert Siepmann
备注：Preprint version - Under second revision at Radiology (manuscript RAD-25-1348)
摘要：Structured reporting (SR) and artificial intelligence (AI) may transform how radiologists interact with imaging studies. This prospective study (July to December 2024) evaluated the impact of three reporting modes: free-text (FT), structured reporting (SR), and AI-assisted structured reporting (AI-SR), on image analysis behavior, diagnostic accuracy, efficiency, and user experience. Four novice and four non-novice readers (radiologists and medical students) each analyzed 35 bedside chest radiographs per session using a customized viewer and an eye-tracking system. Outcomes included diagnostic accuracy (compared with expert consensus using Cohen's $\kappa$), reporting time per radiograph, eye-tracking metrics, and questionnaire-based user experience. Statistical analysis used generalized linear mixed models with Bonferroni post-hoc tests with a significance level of ($P \le .01$). Diagnostic accuracy was similar in FT ($\kappa = 0.58$) and SR ($\kappa = 0.60$) but higher in AI-SR ($\kappa = 0.71$, $P < .001$). Reporting times decreased from $88 \pm 38$ s (FT) to $37 \pm 18$ s (SR) and $25 \pm 9$ s (AI-SR) ($P < .001$). Saccade counts for the radiograph field ($205 \pm 135$ (FT), $123 \pm 88$ (SR), $97 \pm 58$ (AI-SR)) and total fixation duration for the report field ($11 \pm 5$ s (FT), $5 \pm 3$ s (SR), $4 \pm 1$ s (AI-SR)) were lower with SR and AI-SR ($P < .001$ each). Novice readers shifted gaze towards the radiograph in SR, while non-novice readers maintained their focus on the radiograph. AI-SR was the preferred mode. In conclusion, SR improves efficiency by guiding visual attention toward the image, and AI-prefilled SR further enhances diagnostic accuracy and user satisfaction.

【4】Time-Embedded Algorithm Unrolling for Computational MRI
标题：用于计算MRI的时间嵌入算法展开
链接：https://arxiv.org/abs/2510.16321

作者：Junno Yun, Yaşar Utku Alçalar, Mehmet Akçakaya
备注：Neural Information Processing Systems (NeurIPS), 2025
摘要：Algorithm unrolling methods have proven powerful for solving the regularized least squares problem in computational magnetic resonance imaging (MRI). These approaches unfold an iterative algorithm with a fixed number of iterations, typically alternating between a neural network-based proximal operator for regularization, a data fidelity operation and auxiliary updates with learnable parameters. While the connection to optimization methods dictate that the proximal operator network should be shared across unrolls, this can introduce artifacts or blurring. Heuristically, practitioners have shown that using distinct networks may be beneficial, but this significantly increases the number of learnable parameters, making it challenging to prevent overfitting. To address these shortcomings, by taking inspirations from proximal operators with varying thresholds in approximate message passing (AMP) and the success of time-embedding in diffusion models, we propose a time-embedded algorithm unrolling scheme for inverse problems. Specifically, we introduce a novel perspective on the iteration-dependent proximal operation in vector AMP (VAMP) and the subsequent Onsager correction in the context of algorithm unrolling, framing them as a time-embedded neural network. Similarly, the scalar weights in the data fidelity operation and its associated Onsager correction are cast as time-dependent learnable parameters. Our extensive experiments on the fastMRI dataset, spanning various acceleration rates and datasets, demonstrate that our method effectively reduces aliasing artifacts and mitigates noise amplification, achieving state-of-the-art performance. Furthermore, we show that our time-embedding strategy extends to existing algorithm unrolling approaches, enhancing reconstruction quality without increasing the computational complexity significantly.

自动驾驶|车辆|车道检测等(1篇)

【1】DiffVLA++: Bridging Cognitive Reasoning and End-to-End Driving through Metric-Guided Alignment
标题：迪夫VLA ++：通过公制引导对齐连接认知推理和端到端驾驶
链接：https://arxiv.org/abs/2510.17148

作者：Yu Gao, Yiru Wang, Anqing Jiang, Heng Yuwen, Wang Shuo, Sun Hao, Wang Jijun
摘要：Conventional end-to-end (E2E) driving models are effective at generating physically plausible trajectories, but often fail to generalize to long-tail scenarios due to the lack of essential world knowledge to understand and reason about surrounding environments. In contrast, Vision-Language-Action (VLA) models leverage world knowledge to handle challenging cases, but their limited 3D reasoning capability can lead to physically infeasible actions. In this work we introduce DiffVLA++, an enhanced autonomous driving framework that explicitly bridges cognitive reasoning and E2E planning through metric-guided alignment. First, we build a VLA module directly generating semantically grounded driving trajectories. Second, we design an E2E module with a dense trajectory vocabulary that ensures physical feasibility. Third, and most critically, we introduce a metric-guided trajectory scorer that guides and aligns the outputs of the VLA and E2E modules, thereby integrating their complementary strengths. The experiment on the ICCV 2025 Autonomous Grand Challenge leaderboard shows that DiffVLA++ achieves EPDMS of 49.12.

OCR|文本相关(4篇)

【1】Glyph: Scaling Context Windows via Visual-Text Compression
标题：字形：通过视觉文本压缩扩展上下文窗口
链接：https://arxiv.org/abs/2510.17800

作者：Jiale Cheng, Yusen Liu, Xinyu Zhang, Yulin Fei, Wenyi Hong, Ruiliang Lyu, Weihan Wang, Zhe Su, Xiaotao Gu, Xiao Liu, Yushi Bai, Jie Tang, Hongning Wang, Minlie Huang
摘要：Large language models (LLMs) increasingly rely on long-context modeling for tasks such as document understanding, code analysis, and multi-step reasoning. However, scaling context windows to the million-token level brings prohibitive computational and memory costs, limiting the practicality of long-context LLMs. In this work, we take a different perspective-visual context scaling-to tackle this challenge. Instead of extending token-based sequences, we propose Glyph, a framework that renders long texts into images and processes them with vision-language models (VLMs). This approach substantially compresses textual input while preserving semantic information, and we further design an LLM-driven genetic search to identify optimal visual rendering configurations for balancing accuracy and compression. Through extensive experiments, we demonstrate that our method achieves 3-4x token compression while maintaining accuracy comparable to leading LLMs such as Qwen3-8B on various long-context benchmarks. This compression also leads to around 4x faster prefilling and decoding, and approximately 2x faster SFT training. Furthermore, under extreme compression, a 128K-context VLM could scale to handle 1M-token-level text tasks. In addition, the rendered text data benefits real-world multimodal tasks, such as document understanding. Our code and model are released at https://github.com/thu-coai/Glyph.

【2】Multilingual Text-to-Image Person Retrieval via Bidirectional Relation Reasoning and Aligning
标题：通过双向关系推理和对齐实现多语言文本到图像人物检索
链接：https://arxiv.org/abs/2510.17685

作者：Min Cao, Xinyu Zhou, Ding Jiang, Bo Du, Mang Ye, Min Zhang
备注：Final version published in IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI). Xplore link: this https URL
摘要：Text-to-image person retrieval (TIPR) aims to identify the target person using textual descriptions, facing challenge in modality heterogeneity. Prior works have attempted to address it by developing cross-modal global or local alignment strategies. However, global methods typically overlook fine-grained cross-modal differences, whereas local methods require prior information to explore explicit part alignments. Additionally, current methods are English-centric, restricting their application in multilingual contexts. To alleviate these issues, we pioneer a multilingual TIPR task by developing a multilingual TIPR benchmark, for which we leverage large language models for initial translations and refine them by integrating domain-specific knowledge. Correspondingly, we propose Bi-IRRA: a Bidirectional Implicit Relation Reasoning and Aligning framework to learn alignment across languages and modalities. Within Bi-IRRA, a bidirectional implicit relation reasoning module enables bidirectional prediction of masked image and text, implicitly enhancing the modeling of local relations across languages and modalities, a multi-dimensional global alignment module is integrated to bridge the modality heterogeneity. The proposed method achieves new state-of-the-art results on all multilingual TIPR datasets. Data and code are presented in https://github.com/Flame-Chasers/Bi-IRRA.

【3】Patronus: Safeguarding Text-to-Image Models against White-Box Adversaries
标题：守护者：保护文本到图像模型免受白盒对手的侵害
链接：https://arxiv.org/abs/2510.16581

作者：Xinfeng Li, Shengyuan Pang, Jialin Wu, Jiangyi Deng, Huanlong Zhong, Yanjiao Chen, Jie Zhang, Wenyuan Xu
备注：14 pages, 18 figures, 7 tables
摘要：Text-to-image (T2I) models, though exhibiting remarkable creativity in image generation, can be exploited to produce unsafe images. Existing safety measures, e.g., content moderation or model alignment, fail in the presence of white-box adversaries who know and can adjust model parameters, e.g., by fine-tuning. This paper presents a novel defensive framework, named Patronus, which equips T2I models with holistic protection to defend against white-box adversaries. Specifically, we design an internal moderator that decodes unsafe input features into zero vectors while ensuring the decoding performance of benign input features. Furthermore, we strengthen the model alignment with a carefully designed non-fine-tunable learning mechanism, ensuring the T2I model will not be compromised by malicious fine-tuning. We conduct extensive experiments to validate the intactness of the performance on safe content generation and the effectiveness of rejecting unsafe content generation. Results also confirm the resilience of Patronus against various fine-tuning attacks by white-box adversaries.

【4】Enhancing Compositional Reasoning in CLIP via Reconstruction and Alignment of Text Descriptions
标题：通过文本描述的重构和对齐增强CLIP中的组合推理
链接：https://arxiv.org/abs/2510.16540

作者：Jihoon Kwon, Kyle Min, Jy-yong Sohn
备注：Accepted at NeurIPS 2025 (poster). This is the camera-ready version
摘要：Despite recent advances, vision-language models trained with standard contrastive objectives still struggle with compositional reasoning -- the ability to understand structured relationships between visual and linguistic elements. This shortcoming is largely due to the tendency of the text encoder to focus on individual words rather than their relations, a limitation reinforced by contrastive training that primarily aligns words with visual objects. In this paper, we introduce REconstruction and Alignment of text Descriptions (READ), a fine-tuning method designed to enhance compositional reasoning by adding two auxiliary objectives to the contrastive learning: (1) a token-level reconstruction objective, where a frozen pre-trained decoder reconstructs alternative captions based on the embedding of the original caption; and (2) a sentence-level alignment objective, which explicitly aligns paraphrased sentences in the embedding space. We show that READ-CLIP, a model derived by applying the READ method to the pre-trained CLIP model, achieves the state-of-the-art performance across five major compositional reasoning benchmarks, outperforming the strongest conventional fine-tuning baseline by up to 4.1%. Furthermore, applying the READ to existing CLIP variants (including NegCLIP and FSC-CLIP) also improves performance on these benchmarks. Quantitative and qualitative analyses reveal that our proposed objectives -- reconstruction and alignment -- offer complementary benefits: the former encourages the encoder to capture relationships between words within a caption, while the latter ensures consistent representations for paraphrases expressed with different wording.

Attention注意力(2篇)

【1】Seeing but Not Believing: Probing the Disconnect Between Visual Attention and Answer Correctness in VLMs
标题：看到但不相信：探索VLM中视觉注意力和答案正确性之间的脱节
链接：https://arxiv.org/abs/2510.17771

作者：Zhining Liu, Ziyi Chen, Hui Liu, Chen Luo, Xianfeng Tang, Suhang Wang, Joy Zeng, Zhenwei Dai, Zhan Shi, Tianxin Wei, Benoit Dumoulin, Hanghang Tong
备注：21 pages, 10 figures, 6 tables
摘要：Vision-Language Models (VLMs) achieve strong results on multimodal tasks such as visual question answering, yet they can still fail even when the correct visual evidence is present. In this work, we systematically investigate whether these failures arise from not perceiving the evidence or from not leveraging it effectively. By examining layer-wise attention dynamics, we find that shallow layers focus primarily on text, while deeper layers sparsely but reliably attend to localized evidence regions. Surprisingly, VLMs often perceive the visual evidence when outputting incorrect answers, a phenomenon we term ``seeing but not believing'' that widely exists in major VLM families. Building on this, we introduce an inference-time intervention that highlights deep-layer evidence regions through selective attention-based masking. It requires no training and consistently improves accuracy across multiple families, including LLaVA, Qwen, Gemma, and InternVL. These results show that VLMs encode reliable evidence internally but under-utilize it, making such signals explicit can bridge the gap between perception and reasoning, advancing the diagnostic understanding and reliability of VLMs.

【2】M2H: Multi-Task Learning with Efficient Window-Based Cross-Task Attention for Monocular Spatial Perception
标题：M2 H：多任务学习，具有高效的基于窗口的跨任务注意力，以实现单目空间感知
链接：https://arxiv.org/abs/2510.17363

作者：U.V.B.L Udugama, George Vosselman, Francesco Nex
备注：Accepted to the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2025). 8 pages, 7 figures
摘要：Deploying real-time spatial perception on edge devices requires efficient multi-task models that leverage complementary task information while minimizing computational overhead. This paper introduces Multi-Mono-Hydra (M2H), a novel multi-task learning framework designed for semantic segmentation and depth, edge, and surface normal estimation from a single monocular image. Unlike conventional approaches that rely on independent single-task models or shared encoder-decoder architectures, M2H introduces a Window-Based Cross-Task Attention Module that enables structured feature exchange while preserving task-specific details, improving prediction consistency across tasks. Built on a lightweight ViT-based DINOv2 backbone, M2H is optimized for real-time deployment and serves as the foundation for monocular spatial perception systems supporting 3D scene graph construction in dynamic environments. Comprehensive evaluations show that M2H outperforms state-of-the-art multi-task models on NYUDv2, surpasses single-task depth and semantic baselines on Hypersim, and achieves superior performance on the Cityscapes dataset, all while maintaining computational efficiency on laptop hardware. Beyond benchmarks, M2H is validated on real-world data, demonstrating its practicality in spatial perception tasks.

人脸|人群计数(3篇)

【1】Optimizing DINOv2 with Registers for Face Anti-Spoofing
标题：使用寄存器优化DINOv2以实现面部反欺骗
链接：https://arxiv.org/abs/2510.17201

作者：Mika Feng, Pierre Gallin-Martel, Koichi Ito, Takafumi Aoki
备注：ICCV 2025 Workshop FAS
摘要：Face recognition systems are designed to be robust against variations in head pose, illumination, and image blur during capture. However, malicious actors can exploit these systems by presenting a face photo of a registered user, potentially bypassing the authentication process. Such spoofing attacks must be detected prior to face recognition. In this paper, we propose a DINOv2-based spoofing attack detection method to discern minute differences between live and spoofed face images. Specifically, we employ DINOv2 with registers to extract generalizable features and to suppress perturbations in the attention mechanism, which enables focused attention on essential and minute features. We demonstrate the effectiveness of the proposed method through experiments conducted on the dataset provided by ``The 6th Face Anti-Spoofing Workshop: Unified Physical-Digital Attacks Detection@ICCV2025'' and SiW dataset.

【2】HumanCM: One Step Human Motion Prediction
标题：HumanCM：一步人类运动预测
链接：https://arxiv.org/abs/2510.16709

作者：Liu Haojie, Gao Suixiang
备注：6 pages, 2 figures, 2 tables
摘要：We present HumanCM, a one-step human motion prediction framework built upon consistency models. Instead of relying on multi-step denoising as in diffusion-based methods, HumanCM performs efficient single-step generation by learning a self-consistent mapping between noisy and clean motion states. The framework adopts a Transformer-based spatiotemporal architecture with temporal embeddings to model long-range dependencies and preserve motion coherence. Experiments on Human3.6M and HumanEva-I demonstrate that HumanCM achieves comparable or superior accuracy to state-of-the-art diffusion models while reducing inference steps by up to two orders of magnitude.

【3】ISO/IEC-Compliant Match-on-Card Face Verification with Short Binary Templates
标题：使用短二进制模板的符合ISO/IE的卡片上比赛面部验证
链接：https://arxiv.org/abs/2510.16078

作者：Abdelilah Ganmati, Karim Afdel, Lahcen Koutti
备注：~14 pages, 6 figures, 6 tables. Source uses elsarticle class; all figures included as PNG/PDF. Primary: cs.CV
摘要：We present a practical match-on-card design for face verification in which compact 64/128-bit templates are produced off-card by PCA-ITQ and compared on-card via constant-time Hamming distance. We specify ISO/IEC 7816-4 and 14443-4 command APDUs with fixed-length payloads and decision-only status words (no score leakage), together with a minimal per-identity EEPROM map. Using real binary codes from a CelebA working set (55 identities, 412 images), we (i) derive operating thresholds from ROC/DET, (ii) replay enroll->verify transactions at those thresholds, and (iii) bound end-to-end time by pure link latency plus a small constant on-card budget. Even at the slowest contact rate (9.6 kbps), total verification time is 43.9 ms (64 b) and 52.3 ms (128 b); at 38.4 kbps both are <14 ms. At FAR = 1%, both code lengths reach TPR = 0.836, while 128 b lowers EER relative to 64 b. An optional +6 B helper (targeted symbol-level parity over empirically unstable bits) is latency-negligible. Overall, short binary templates, fixed-payload decision-only APDUs, and constant-time matching satisfy ISO/IEC transport constraints with wide timing margin and align with ISO/IEC 24745 privacy goals. Limitations: single-dataset evaluation and design-level (pre-hardware) timing; we outline AgeDB/CFP-FP and on-card microbenchmarks as next steps.

跟踪(1篇)

【1】Shape-aware Inertial Poser: Motion Tracking for Humans with Diverse Shapes Using Sparse Inertial Sensors
标题：形状感知惯性姿态：使用稀疏惯性传感器对具有不同尺寸的人类进行运动跟踪
链接：https://arxiv.org/abs/2510.17101

作者：Lu Yin, Ziying Shi, Yinghao Wu, Xinyu Yi, Feng Xu, Shihui Guo
备注：Accepted by SIGGRAPH Asia 2025 (TOG)
摘要：Human motion capture with sparse inertial sensors has gained significant attention recently. However, existing methods almost exclusively rely on a template adult body shape to model the training data, which poses challenges when generalizing to individuals with largely different body shapes (such as a child). This is primarily due to the variation in IMU-measured acceleration caused by changes in body shape. To fill this gap, we propose Shape-aware Inertial Poser (SAIP), the first solution considering body shape differences in sparse inertial-based motion capture. Specifically, we decompose the sensor measurements related to shape and pose in order to effectively model their joint correlations. Firstly, we train a regression model to transfer the IMU-measured accelerations of a real body to match the template adult body model, compensating for the shape-related sensor measurements. Then, we can easily follow the state-of-the-art methods to estimate the full body motions of the template-shaped body. Finally, we utilize a second regression model to map the joint velocities back to the real body, combined with a shape-aware physical optimization strategy to calculate global motions on the subject. Furthermore, our method relies on body shape awareness, introducing the first inertial shape estimation scheme. This is accomplished by modeling the shape-conditioned IMU-pose correlation using an MLP-based network. To validate the effectiveness of SAIP, we also present the first IMU motion capture dataset containing individuals of different body sizes. This dataset features 10 children and 10 adults, with heights ranging from 110 cm to 190 cm, and a total of 400 minutes of paired IMU-Motion samples. Extensive experimental results demonstrate that SAIP can effectively handle motion capture tasks for diverse body shapes. The code and dataset are available at https://github.com/yinlu5942/SAIP.

图像视频检索|Re-id相关(1篇)

【1】When One Moment Isn't Enough: Multi-Moment Retrieval with Cross-Moment Interactions
标题：当一个时刻还不够时：具有交叉时刻相互作用的多时刻检索
链接：https://arxiv.org/abs/2510.17218

作者：Zhuo Cao, Heming Du, Bingqing Zhang, Xin Yu, Xue Li, Sen Wang
备注：Accepted to NeurIPS 2025
摘要：Existing Moment retrieval (MR) methods focus on Single-Moment Retrieval (SMR). However, one query can correspond to multiple relevant moments in real-world applications. This makes the existing datasets and methods insufficient for video temporal grounding. By revisiting the gap between current MR tasks and real-world applications, we introduce a high-quality datasets called QVHighlights Multi-Moment Dataset (QV-M$^2$), along with new evaluation metrics tailored for multi-moment retrieval (MMR). QV-M$^2$ consists of 2,212 annotations covering 6,384 video segments. Building on existing efforts in MMR, we propose a framework called FlashMMR. Specifically, we propose a Multi-moment Post-verification module to refine the moment boundaries. We introduce constrained temporal adjustment and subsequently leverage a verification module to re-evaluate the candidate segments. Through this sophisticated filtering pipeline, low-confidence proposals are pruned, and robust multi-moment alignment is achieved. We retrain and evaluate 6 existing MR methods on QV-M$^2$ and QVHighlights under both SMR and MMR settings. Results show that QV-M$^2$ serves as an effective benchmark for training and evaluating MMR models, while FlashMMR provides a strong baseline. Specifically, on QV-M$^2$, it achieves improvements over prior SOTA method by 3.00% on G-mAP, 2.70% on mAP@3+tgt, and 2.56% on mR@3. The proposed benchmark and method establish a foundation for advancing research in more realistic and challenging video temporal grounding scenarios. Code is released at https://github.com/Zhuo-Cao/QV-M2.

裁剪|量化|加速|压缩相关(1篇)

【1】Differentiable, Bit-shifting, and Scalable Quantization without training neural network from scratch
标题：无需从头开始训练神经网络即可进行可区分、移位和可扩展量化
链接：https://arxiv.org/abs/2510.16088

作者：Zia Badar
摘要：Quantization of neural networks provides benefits of inference in less compute and memory requirements. Previous work in quantization lack two important aspects which this work provides. First almost all previous work in quantization used a non-differentiable approach and for learning; the derivative is usually set manually in backpropogation which make the learning ability of algorithm questionable, our approach is not just differentiable, we also provide proof of convergence of our approach to the optimal neural network. Second previous work in shift/logrithmic quantization either have avoided activation quantization along with weight quantization or achieved less accuracy. Learning logrithmic quantize values of form $2^n$ requires the quantization function can scale to more than 1 bit quantization which is another benifit of our quantization that it provides $n$ bits quantization as well. Our approach when tested with image classification task using imagenet dataset, resnet18 and weight quantization only achieves less than 1 percent accuracy compared to full precision accuracy while taking only 15 epochs to train using shift bit quantization and achieves comparable to SOTA approaches accuracy in both weight and activation quantization using shift bit quantization in 15 training epochs with slightly higher(only higher cpu instructions) inference cost compared to 1 bit quantization(without logrithmic quantization) and not requiring any higher precision multiplication.

表征学习(1篇)

【1】Fly-CL: A Fly-Inspired Framework for Enhancing Efficient Decorrelation and Reduced Training Time in Pre-trained Model-based Continual Representation Learning
标题：Fly-CL：一个受飞行启发的框架，用于在预训练的基于模型的连续表示学习中增强高效去相关并减少训练时间
链接：https://arxiv.org/abs/2510.16877

作者：Heming Zou, Yunliang Zang, Wutong Xu, Xiangyang Ji
摘要：Using a nearly-frozen pretrained model, the continual representation learning paradigm reframes parameter updates as a similarity-matching problem to mitigate catastrophic forgetting. However, directly leveraging pretrained features for downstream tasks often suffers from multicollinearity in the similarity-matching stage, and more advanced methods can be computationally prohibitive for real-time, low-latency applications. Inspired by the fly olfactory circuit, we propose Fly-CL, a bio-inspired framework compatible with a wide range of pretrained backbones. Fly-CL substantially reduces training time while achieving performance comparable to or exceeding that of current state-of-the-art methods. We theoretically show how Fly-CL progressively resolves multicollinearity, enabling more effective similarity matching with low time complexity. Extensive simulation experiments across diverse network architectures and data regimes validate Fly-CL's effectiveness in addressing this challenge through a biologically inspired design. Code is available at https://github.com/gfyddha/Fly-CL.

蒸馏|知识提取(1篇)

【1】HYDRA: HYbrid knowledge Distillation and spectral Reconstruction Algorithm for high channel hyperspectral camera applications
标题：高通道高光谱相机应用的混合知识蒸馏和光谱重建算法
链接：https://arxiv.org/abs/2510.16664

作者：Christopher Thirgood, Oscar Mendez, Erin Ling, Jon Storey, Simon Hadfield
摘要：Hyperspectral images (HSI) promise to support a range of new applications in computer vision. Recent research has explored the feasibility of generalizable Spectral Reconstruction (SR), the problem of recovering a HSI from a natural three-channel color image in unseen scenarios. However, previous Multi-Scale Attention (MSA) works have only demonstrated sufficient generalizable results for very sparse spectra, while modern HSI sensors contain hundreds of channels. This paper introduces a novel approach to spectral reconstruction via our HYbrid knowledge Distillation and spectral Reconstruction Architecture (HYDRA). Using a Teacher model that encapsulates latent hyperspectral image data and a Student model that learns mappings from natural images to the Teacher's encoded domain, alongside a novel training method, we achieve high-quality spectral reconstruction. This addresses key limitations of prior SR models, providing SOTA performance across all metrics, including an 18\% boost in accuracy, and faster inference times than current SOTA models at various channel depths.

超分辨率|去噪|去模糊|去雾(1篇)

【1】Unlocking Off-the-Grid Sparse Recovery with Unlimited Sensing: Simultaneous Super-Resolution in Time and Amplitude
标题：利用无限感知解锁离网稀疏恢复：时间和幅度同时超分辨率
链接：https://arxiv.org/abs/2510.16948

作者：Ruiming Guo, Ayush Bhandari
备注：28 Pages, 10 figures. To appear in IEEE Journal of Selected Topics in Signal Processing
摘要：The recovery of Dirac impulses, or spikes, from filtered measurements is a classical problem in signal processing. As the spikes lie in the continuous domain while measurements are discrete, this task is known as super-resolution or off-the-grid sparse recovery. Despite significant theoretical and algorithmic advances over the past decade, these developments often overlook critical challenges at the analog-digital interface. In particular, when spikes exhibit strong-weak amplitude disparity, conventional digital acquisition may result in clipping of strong components or loss of weak ones beneath the quantization noise floor. This motivates a broader perspective: super-resolution must simultaneously resolve both amplitude and temporal structure. Under a fixed bit budget, such information loss is unavoidable. In contrast, the emerging theory and practice of the Unlimited Sensing Framework (USF) demonstrate that these fundamental limitations can be overcome. Building on this foundation, we demonstrate that modulo encoding within USF enables digital super-resolution by enhancing measurement precision, thereby unlocking temporal super-resolution beyond conventional limits. We develop new theoretical results that extend to non-bandlimited kernels commonly encountered in practice and introduce a robust algorithm for off-the-grid sparse recovery. To demonstrate practical impact, we instantiate our framework in the context of time-of-flight imaging. Both numerical simulations and hardware experiments validate the effectiveness of our approach under low-bit quantization, enabling super-resolution in amplitude and time.

点云|SLAM|雷达|激光|深度RGBD相关(1篇)

【1】ProDAT: Progressive Density-Aware Tail-Drop for Point Cloud Coding
标题：ProSTAT：点云编码的渐进密度感知尾插
链接：https://arxiv.org/abs/2510.17068

作者：Zhe Luo, Wenjing Jia, Stuart Perry
摘要：Three-dimensional (3D) point clouds are becoming increasingly vital in applications such as autonomous driving, augmented reality, and immersive communication, demanding real-time processing and low latency. However, their large data volumes and bandwidth constraints hinder the deployment of high-quality services in resource-limited environments. Progres- sive coding, which allows for decoding at varying levels of detail, provides an alternative by allowing initial partial decoding with subsequent refinement. Although recent learning-based point cloud geometry coding methods have achieved notable success, their fixed latent representation does not support progressive decoding. To bridge this gap, we propose ProDAT, a novel density-aware tail-drop mechanism for progressive point cloud coding. By leveraging density information as a guidance signal, latent features and coordinates are decoded adaptively based on their significance, therefore achieving progressive decoding at multiple bitrates using one single model. Experimental results on benchmark datasets show that the proposed ProDAT not only enables progressive coding but also achieves superior coding efficiency compared to state-of-the-art learning-based coding techniques, with over 28.6% BD-rate improvement for PSNR- D2 on SemanticKITTI and over 18.15% for ShapeNet

多模态(3篇)

【1】MILES: Modality-Informed Learning Rate Scheduler for Balancing Multimodal Learning
标题：MILES：用于平衡多模式学习的基于模式的学习率指标
链接：https://arxiv.org/abs/2510.17394

作者：Alejandro Guerra-Manzanares, Farah E. Shamout
备注：Accepted and presented at the 2025 International Joint Conference on Neural Networks (IJCNN'25). The paper was awarded an honorable mention (best 4 papers)
摘要：The aim of multimodal neural networks is to combine diverse data sources, referred to as modalities, to achieve enhanced performance compared to relying on a single modality. However, training of multimodal networks is typically hindered by modality overfitting, where the network relies excessively on one of the available modalities. This often yields sub-optimal performance, hindering the potential of multimodal learning and resulting in marginal improvements relative to unimodal models. In this work, we present the Modality-Informed Learning ratE Scheduler (MILES) for training multimodal joint fusion models in a balanced manner. MILES leverages the differences in modality-wise conditional utilization rates during training to effectively balance multimodal learning. The learning rate is dynamically adjusted during training to balance the speed of learning from each modality by the multimodal model, aiming for enhanced performance in both multimodal and unimodal predictions. We extensively evaluate MILES on four multimodal joint fusion tasks and compare its performance to seven state-of-the-art baselines. Our results show that MILES outperforms all baselines across all tasks and fusion methods considered in our study, effectively balancing modality usage during training. This results in improved multimodal performance and stronger modality encoders, which can be leveraged when dealing with unimodal samples or absent modalities. Overall, our work highlights the impact of balancing multimodal learning on improving model performance.

【2】PRISMM-Bench: A Benchmark of Peer-Review Grounded Multimodal Inconsistencies
标题：PRISMM-Bench：同行评审基础多模式不确定性的基准
链接：https://arxiv.org/abs/2510.16505

作者：Lukas Selch, Yufang Hou, M. Jehanzeb Mirza, Sivan Doveh, James Glass, Rogerio Feris, Wei Lin
摘要：Large Multimodal Models (LMMs) are increasingly applied to scientific research, yet it remains unclear whether they can reliably understand and reason over the multimodal complexity of papers. A central challenge lies in detecting and resolving inconsistencies across text, figures, tables, and equations, issues that are often subtle, domain-specific, and ultimately undermine clarity, reproducibility, and trust. Existing benchmarks overlook this issue, either isolating single modalities or relying on synthetic errors that fail to capture real-world complexity. We introduce PRISMM-Bench (Peer-Review-sourced Inconsistency Set for Multimodal Models), the first benchmark grounded in real reviewer-flagged inconsistencies in scientific papers. Through a multi-stage pipeline of review mining, LLM-assisted filtering and human verification, we curate 262 inconsistencies from 242 papers. Based on this set, we design three tasks, namely inconsistency identification, remedy and pair matching, which assess a model's capacity to detect, correct, and reason over inconsistencies across different modalities. Furthermore, to address the notorious problem of choice-only shortcuts in multiple-choice evaluation, where models exploit answer patterns without truly understanding the question, we further introduce structured JSON-based answer representations that minimize linguistic biases by reducing reliance on superficial stylistic cues. We benchmark 21 leading LMMs, including large open-weight models (GLM-4.5V 106B, InternVL3 78B) and proprietary models (Gemini 2.5 Pro, GPT-5 with high reasoning). Results reveal strikingly low performance (26.1-54.2%), underscoring the challenge of multimodal scientific reasoning and motivating progress towards trustworthy scientific assistants.

【3】Embody 3D: A Large-scale Multimodal Motion and Behavior Dataset
标题：体现3D：大规模多模式运动和行为数据集
链接：https://arxiv.org/abs/2510.16258

作者：Claire McLean, Makenzie Meendering, Tristan Swartz, Orri Gabbay, Alexandra Olsen, Rachel Jacobs, Nicholas Rosen, Philippe de Bree, Tony Garcia, Gadsden Merrill, Jake Sandakly, Julia Buffalini, Neham Jain, Steven Krenn, Moneish Kumar, Dejan Markovic, Evonne Ng, Fabian Prada, Andrew Saba, Siwei Zhang, Vasu Agrawal, Tim Godisart, Alexander Richard, Michael Zollhoefer
摘要：The Codec Avatars Lab at Meta introduces Embody 3D, a multimodal dataset of 500 individual hours of 3D motion data from 439 participants collected in a multi-camera collection stage, amounting to over 54 million frames of tracked 3D motion. The dataset features a wide range of single-person motion data, including prompted motions, hand gestures, and locomotion; as well as multi-person behavioral and conversational data like discussions, conversations in different emotional states, collaborative activities, and co-living scenarios in an apartment-like space. We provide tracked human motion including hand tracking and body shape, text annotations, and a separate audio track for each participant.

3D|3D重建等相关(5篇)

【1】Raindrop GS: A Benchmark for 3D Gaussian Splatting under Raindrop Conditions
标题：Raindrop GS：Raindrop条件下3D高斯飞溅的基准
链接：https://arxiv.org/abs/2510.17719

作者：Zhiqiang Teng, Beibei Lin, Tingting Chen, Zifeng Yuan, Xuanyi Li, Xuanyu Zhang, Shunli Zhang
摘要：3D Gaussian Splatting (3DGS) under raindrop conditions suffers from severe occlusions and optical distortions caused by raindrop contamination on the camera lens, substantially degrading reconstruction quality. Existing benchmarks typically evaluate 3DGS using synthetic raindrop images with known camera poses (constrained images), assuming ideal conditions. However, in real-world scenarios, raindrops often interfere with accurate camera pose estimation and point cloud initialization. Moreover, a significant domain gap between synthetic and real raindrops further impairs generalization. To tackle these issues, we introduce RaindropGS, a comprehensive benchmark designed to evaluate the full 3DGS pipeline-from unconstrained, raindrop-corrupted images to clear 3DGS reconstructions. Specifically, the whole benchmark pipeline consists of three parts: data preparation, data processing, and raindrop-aware 3DGS evaluation, including types of raindrop interference, camera pose estimation and point cloud initialization, single image rain removal comparison, and 3D Gaussian training comparison. First, we collect a real-world raindrop reconstruction dataset, in which each scene contains three aligned image sets: raindrop-focused, background-focused, and rain-free ground truth, enabling a comprehensive evaluation of reconstruction quality under different focus conditions. Through comprehensive experiments and analyses, we reveal critical insights into the performance limitations of existing 3DGS methods on unconstrained raindrop images and the varying impact of different pipeline components: the impact of camera focus position on 3DGS reconstruction performance, and the interference caused by inaccurate pose and point cloud initialization on reconstruction. These insights establish clear directions for developing more robust 3DGS methods under raindrop conditions.

【2】Towards 3D Objectness Learning in an Open World
标题：开放世界中的3D对象学习
链接：https://arxiv.org/abs/2510.17686

作者：Taichi Liu, Zhenyu Wang, Ruofeng Liu, Guang Wang, Desheng Zhang
备注：Accepted by NeurIPS 2025
摘要：Recent advancements in 3D object detection and novel category detection have made significant progress, yet research on learning generalized 3D objectness remains insufficient. In this paper, we delve into learning open-world 3D objectness, which focuses on detecting all objects in a 3D scene, including novel objects unseen during training. Traditional closed-set 3D detectors struggle to generalize to open-world scenarios, while directly incorporating 3D open-vocabulary models for open-world ability struggles with vocabulary expansion and semantic overlap. To achieve generalized 3D object discovery, We propose OP3Det, a class-agnostic Open-World Prompt-free 3D Detector to detect any objects within 3D scenes without relying on hand-crafted text prompts. We introduce the strong generalization and zero-shot capabilities of 2D foundation models, utilizing both 2D semantic priors and 3D geometric priors for class-agnostic proposals to broaden 3D object discovery. Then, by integrating complementary information from point cloud and RGB image in the cross-modal mixture of experts, OP3Det dynamically routes uni-modal and multi-modal features to learn generalized 3D objectness. Extensive experiments demonstrate the extraordinary performance of OP3Det, which significantly surpasses existing open-world 3D detectors by up to 16.0% in AR and achieves a 13.5% improvement compared to closed-world 3D detectors.

【3】Eliciting Grounded Chain-of-Thought Reasoning in 3D Scenes
标题：在3D场景中激发扎根思想链推理
链接：https://arxiv.org/abs/2510.16714

作者：Xiongkun Linghu, Jiangyong Huang, Ziyu Zhu, Baoxiong Jia, Siyuan Huang
备注：Project page: this https URL
摘要：Existing research on 3D Large Language Models (LLMs) still struggles to achieve grounded question-answering, primarily due to the under-exploration of the mech- anism of human-like scene-object grounded reasoning. This paper bridges the gap by presenting a novel framework. We first introduce a grounded Chain-of- Thought reasoning method in 3D scenes (SCENECOT), decoupling a complex reasoning task into simpler and manageable problems, and building corresponding visual clues based on multimodal expert modules. To enable such a method, we develop SCENECOT-185K, the first large-scale grounded CoT reasoning dataset, consisting of 185K high-quality instances. Extensive experiments across various complex 3D scene reasoning benchmarks demonstrate that our new framework achieves strong performance with high grounding-QA coherence. To the best of our knowledge, this is the first successful application of CoT reasoning to 3D scene understanding, enabling step-by-step human-like reasoning and showing potential for extension to broader 3D scene understanding scenarios.

【4】Structured Interfaces for Automated Reasoning with 3D Scene Graphs
标题：利用3D场景图自动推理的结构化接口
链接：https://arxiv.org/abs/2510.16643

作者：Aaron Ray, Jacob Arkin, Harel Biggie, Chuchu Fan, Luca Carlone, Nicholas Roy
备注：25 pages, 3 figures
摘要：In order to provide a robot with the ability to understand and react to a user's natural language inputs, the natural language must be connected to the robot's underlying representations of the world. Recently, large language models (LLMs) and 3D scene graphs (3DSGs) have become a popular choice for grounding natural language and representing the world. In this work, we address the challenge of using LLMs with 3DSGs to ground natural language. Existing methods encode the scene graph as serialized text within the LLM's context window, but this encoding does not scale to large or rich 3DSGs. Instead, we propose to use a form of Retrieval Augmented Generation to select a subset of the 3DSG relevant to the task. We encode a 3DSG in a graph database and provide a query language interface (Cypher) as a tool to the LLM with which it can retrieve relevant data for language grounding. We evaluate our approach on instruction following and scene question-answering tasks and compare against baseline context window and code generation methods. Our results show that using Cypher as an interface to 3D scene graphs scales significantly better to large, rich graphs on both local and cloud-based models. This leads to large performance improvements in grounded language tasks while also substantially reducing the token count of the scene graph content. A video supplement is available at https://www.youtube.com/watch?v=zY_YI9giZSA.

【5】HGC-Avatar: Hierarchical Gaussian Compression for Streamable Dynamic 3D Avatars
标题：HGC-Avatar：可流传输动态3D Avatar的分层高斯压缩
链接：https://arxiv.org/abs/2510.16463

作者：Haocheng Tang, Ruoke Yan, Xinhui Yin, Qi Zhang, Xinfeng Zhang, Siwei Ma, Wen Gao, Chuanmin Jia
备注：ACM International Conference on Multimedia 2025
摘要：Recent advances in 3D Gaussian Splatting (3DGS) have enabled fast, photorealistic rendering of dynamic 3D scenes, showing strong potential in immersive communication. However, in digital human encoding and transmission, the compression methods based on general 3DGS representations are limited by the lack of human priors, resulting in suboptimal bitrate efficiency and reconstruction quality at the decoder side, which hinders their application in streamable 3D avatar systems. We propose HGC-Avatar, a novel Hierarchical Gaussian Compression framework designed for efficient transmission and high-quality rendering of dynamic avatars. Our method disentangles the Gaussian representation into a structural layer, which maps poses to Gaussians via a StyleUNet-based generator, and a motion layer, which leverages the SMPL-X model to represent temporal pose variations compactly and semantically. This hierarchical design supports layer-wise compression, progressive decoding, and controllable rendering from diverse pose inputs such as video sequences or text. Since people are most concerned with facial realism, we incorporate a facial attention mechanism during StyleUNet training to preserve identity and expression details under low-bitrate constraints. Experimental results demonstrate that HGC-Avatar provides a streamable solution for rapid 3D avatar rendering, while significantly outperforming prior methods in both visual quality and compression efficiency.

其他神经网络|深度学习|模型|建模(13篇)

【1】Elastic ViTs from Pretrained Models without Retraining
标题：来自预训练模型的弹性ViT，无需重新训练
链接：https://arxiv.org/abs/2510.17700

作者：Walter Simoncini, Michael Dorkenwald, Tijmen Blankevoort, Cees G.M. Snoek, Yuki M. Asano
备注：Accepted at NeurIPS 2025
摘要：Vision foundation models achieve remarkable performance but are only available in a limited set of pre-determined sizes, forcing sub-optimal deployment choices under real-world constraints. We introduce SnapViT: Single-shot network approximation for pruned Vision Transformers, a new post-pretraining structured pruning method that enables elastic inference across a continuum of compute budgets. Our approach efficiently combines gradient information with cross-network structure correlations, approximated via an evolutionary algorithm, does not require labeled data, generalizes to models without a classification head, and is retraining-free. Experiments on DINO, SigLIPv2, DeIT, and AugReg models demonstrate superior performance over state-of-the-art methods across various sparsities, requiring less than five minutes on a single A100 GPU to generate elastic models that can be adjusted to any computational budget. Our key contributions include an efficient pruning strategy for pretrained Vision Transformers, a novel evolutionary approximation of Hessian off-diagonal structures, and a self-supervised importance scoring mechanism that maintains strong performance without requiring retraining or labels. Code and pruned models are available at: https://elastic.ashita.nl/

【2】DeepDetect: Learning All-in-One Dense Keypoints
标题：DeepDetect：学习一体化密集关键点
链接：https://arxiv.org/abs/2510.17422

作者：Shaharyar Ahmed Khan Tareen, Filza Khan Tareen
备注：6 pages, 6 figures, 2 tables, 7 equations
摘要：Keypoint detection is the foundation of many computer vision tasks, including image registration, structure-from motion, 3D reconstruction, visual odometry, and SLAM. Traditional detectors (SIFT, SURF, ORB, BRISK, etc.) and learning based methods (SuperPoint, R2D2, LF-Net, D2-Net, etc.) have shown strong performance yet suffer from key limitations: sensitivity to photometric changes, low keypoint density and repeatability, limited adaptability to challenging scenes, and lack of semantic understanding, often failing to prioritize visually important regions. We present DeepDetect, an intelligent, all-in-one, dense keypoint detector that unifies the strengths of classical detectors using deep learning. Firstly, we create ground-truth masks by fusing outputs of 7 keypoint and 2 edge detectors, extracting diverse visual cues from corners and blobs to prominent edges and textures in the images. Afterwards, a lightweight and efficient model: ESPNet, is trained using these masks as labels, enabling DeepDetect to focus semantically on images while producing highly dense keypoints, that are adaptable to diverse and visually degraded conditions. Evaluations on the Oxford Affine Covariant Regions dataset demonstrate that DeepDetect surpasses other detectors in keypoint density, repeatability, and the number of correct matches, achieving maximum values of 0.5143 (average keypoint density), 0.9582 (average repeatability), and 59,003 (correct matches).

【3】CharDiff: A Diffusion Model with Character-Level Guidance for License Plate Image Restoration
标题：CharDiff：一种用于车牌图像恢复的具有直升机级引导的扩散模型
链接：https://arxiv.org/abs/2510.17330

作者：Gyuhwan Park, Kihyun Na, Injung Kim
备注：11 pages, 6 figures
摘要：The significance of license plate image restoration goes beyond the preprocessing stage of License Plate Recognition (LPR) systems, as it also serves various purposes, including increasing evidential value, enhancing the clarity of visual interface, and facilitating further utilization of license plate images. We propose a novel diffusion-based framework with character-level guidance, CharDiff, which effectively restores and recognizes severely degraded license plate images captured under realistic conditions. CharDiff leverages fine-grained character-level priors extracted through external segmentation and Optical Character Recognition (OCR) modules tailored for low-quality license plate images. For precise and focused guidance, CharDiff incorporates a novel Character-guided Attention through Region-wise Masking (CHARM) module, which ensures that each character's guidance is restricted to its own region, thereby avoiding interference with other regions. In experiments, CharDiff significantly outperformed the baseline restoration models in both restoration quality and recognition accuracy, achieving a 28% relative reduction in CER on the Roboflow-LP dataset, compared to the best-performing baseline model. These results indicate that the structured character-guided conditioning effectively enhances the robustness of diffusion-based license plate restoration and recognition in practical deployment scenarios.

【4】CausalMamba: Scalable Conditional State Space Models for Neural Causal Inference
标题：Cairan Mamba：用于神经因果推理的可扩展条件状态空间模型
链接：https://arxiv.org/abs/2510.17318

作者：Sangyoon Bae, Jiook Cha
摘要：We introduce CausalMamba, a scalable framework that addresses fundamental limitations in fMRI-based causal inference: the ill-posed nature of inferring neural causality from hemodynamically distorted BOLD signals and the computational intractability of existing methods like Dynamic Causal Modeling (DCM). Our approach decomposes this complex inverse problem into two tractable stages: BOLD deconvolution to recover latent neural activity, followed by causal graph inference using a novel Conditional Mamba architecture. On simulated data, CausalMamba achieves 37% higher accuracy than DCM. Critically, when applied to real task fMRI data, our method recovers well-established neural pathways with 88% fidelity, whereas conventional approaches fail to identify these canonical circuits in over 99% of subjects. Furthermore, our network analysis of working memory data reveals that the brain strategically shifts its primary causal hub-recruiting executive or salience networks depending on the stimulus-a sophisticated reconfiguration that remains undetected by traditional methods. This work provides neuroscientists with a practical tool for large-scale causal inference that captures both fundamental circuit motifs and flexible network dynamics underlying cognitive function.

【5】One-step Diffusion Models with Bregman Density Ratio Matching
标题：Bregman密度比匹配的一步扩散模型
链接：https://arxiv.org/abs/2510.16983

作者：Yuanzhi Zhu, Eleftherios Tsonis, Lucas Degeorge, Vicky Kalogeiton
备注：work in progress
摘要：Diffusion and flow models achieve high generative quality but remain computationally expensive due to slow multi-step sampling. Distillation methods accelerate them by training fast student generators, yet most existing objectives lack a unified theoretical foundation. In this work, we propose Di-Bregman, a compact framework that formulates diffusion distillation as Bregman divergence-based density-ratio matching. This convex-analytic view connects several existing objectives through a common lens. Experiments on CIFAR-10 and text-to-image generation demonstrate that Di-Bregman achieves improved one-step FID over reverse-KL distillation and maintains high visual fidelity compared to the teacher model. Our results highlight Bregman density-ratio matching as a practical and theoretically-grounded route toward efficient one-step diffusion generation.

【6】WaMaIR: Image Restoration via Multiscale Wavelet Convolutions and Mamba-based Channel Modeling with Texture Enhancement
标题：WaMaIR：通过多尺度子波卷积和基于Mamba的通道建模和纹理增强的图像恢复
链接：https://arxiv.org/abs/2510.16765

作者：Shengyu Zhu, Fan, Fuxuan Zhang
备注：Chinese Conference on Pattern Recognition and Computer Vision (PRCV), Oral
摘要：Image restoration is a fundamental and challenging task in computer vision, where CNN-based frameworks demonstrate significant computational efficiency. However, previous CNN-based methods often face challenges in adequately restoring fine texture details, which are limited by the small receptive field of CNN structures and the lack of channel feature modeling. In this paper, we propose WaMaIR, which is a novel framework with a large receptive field for image perception and improves the reconstruction of texture details in restored images. Specifically, we introduce the Global Multiscale Wavelet Transform Convolutions (GMWTConvs) for expandding the receptive field to extract image features, preserving and enriching texture features in model inputs. Meanwhile, we propose the Mamba-Based Channel-Aware Module (MCAM), explicitly designed to capture long-range dependencies within feature channels, which enhancing the model sensitivity to color, edges, and texture information. Additionally, we propose Multiscale Texture Enhancement Loss (MTELoss) for image restoration to guide the model in preserving detailed texture structures effectively. Extensive experiments confirm that WaMaIR outperforms state-of-the-art methods, achieving better image restoration and efficient computational performance of the model.

【7】Visual Autoregressive Models Beat Diffusion Models on Inference Time Scaling
标题：视觉自回归模型在推理时间标度上击败了扩散模型
链接：https://arxiv.org/abs/2510.16751

作者：Erik Riise, Mehmet Onurcan Kaya, Dim P. Papadopoulos
摘要：While inference-time scaling through search has revolutionized Large Language Models, translating these gains to image generation has proven difficult. Recent attempts to apply search strategies to continuous diffusion models show limited benefits, with simple random sampling often performing best. We demonstrate that the discrete, sequential nature of visual autoregressive models enables effective search for image generation. We show that beam search substantially improves text-to-image generation, enabling a 2B parameter autoregressive model to outperform a 12B parameter diffusion model across benchmarks. Systematic ablations show that this advantage comes from the discrete token space, which allows early pruning and computational reuse, and our verifier analysis highlights trade-offs between speed and reasoning capability. These findings suggest that model architecture, not just scale, is critical for inference-time optimization in visual generation.

【8】A Comprehensive Survey on World Models for Embodied AI
标题：人工智能世界模型全面调查
链接：https://arxiv.org/abs/2510.16732

作者：Xinqing Li, Xin He, Le Zhang, Yun Liu
备注：his https URL
摘要：Embodied AI requires agents that perceive, act, and anticipate how actions reshape future world states. World models serve as internal simulators that capture environment dynamics, enabling forward and counterfactual rollouts to support perception, prediction, and decision making. This survey presents a unified framework for world models in embodied AI. Specifically, we formalize the problem setting and learning objectives, and propose a three-axis taxonomy encompassing: (1) Functionality, Decision-Coupled vs. General-Purpose; (2) Temporal Modeling, Sequential Simulation and Inference vs. Global Difference Prediction; (3) Spatial Representation, Global Latent Vector, Token Feature Sequence, Spatial Latent Grid, and Decomposed Rendering Representation. We systematize data resources and metrics across robotics, autonomous driving, and general video settings, covering pixel prediction quality, state-level understanding, and task performance. Furthermore, we offer a quantitative comparison of state-of-the-art models and distill key open challenges, including the scarcity of unified datasets and the need for evaluation metrics that assess physical consistency over pixel fidelity, the trade-off between model performance and the computational efficiency required for real-time control, and the core modeling difficulty of achieving long-horizon temporal consistency while mitigating error accumulation. Finally, we maintain a curated bibliography at https://github.com/Li-Zn-H/AwesomeWorldModels.

【9】Vision-Centric 4D Occupancy Forecasting and Planning via Implicit Residual World Models
标题：通过隐式剩余世界模型进行以愿景为中心的4D职业预测和规划
链接：https://arxiv.org/abs/2510.16729

作者：Jianbiao Mei, Yu Yang, Xuemeng Yang, Licheng Wen, Jiajun Lv, Botian Shi, Yong Liu
摘要：End-to-end autonomous driving systems increasingly rely on vision-centric world models to understand and predict their environment. However, a common ineffectiveness in these models is the full reconstruction of future scenes, which expends significant capacity on redundantly modeling static backgrounds. To address this, we propose IR-WM, an Implicit Residual World Model that focuses on modeling the current state and evolution of the world. IR-WM first establishes a robust bird's-eye-view representation of the current state from the visual observation. It then leverages the BEV features from the previous timestep as a strong temporal prior and predicts only the "residual", i.e., the changes conditioned on the ego-vehicle's actions and scene context. To alleviate error accumulation over time, we further apply an alignment module to calibrate semantic and dynamic misalignments. Moreover, we investigate different forecasting-planning coupling schemes and demonstrate that the implicit future state generated by world models substantially improves planning accuracy. On the nuScenes benchmark, IR-WM achieves top performance in both 4D occupancy forecasting and trajectory planning.

【10】Universal and Transferable Attacks on Pathology Foundation Models
标题：对病理学基础模型的普遍和可转移攻击
链接：https://arxiv.org/abs/2510.16660

作者：Yuntian Wang, Xilin Yang, Che-Yung Shen, Nir Pillar, Aydogan Ozcan
备注：38 Pages, 8 Figures
摘要：We introduce Universal and Transferable Adversarial Perturbations (UTAP) for pathology foundation models that reveal critical vulnerabilities in their capabilities. Optimized using deep learning, UTAP comprises a fixed and weak noise pattern that, when added to a pathology image, systematically disrupts the feature representation capabilities of multiple pathology foundation models. Therefore, UTAP induces performance drops in downstream tasks that utilize foundation models, including misclassification across a wide range of unseen data distributions. In addition to compromising the model performance, we demonstrate two key features of UTAP: (1) universality: its perturbation can be applied across diverse field-of-views independent of the dataset that UTAP was developed on, and (2) transferability: its perturbation can successfully degrade the performance of various external, black-box pathology foundation models - never seen before. These two features indicate that UTAP is not a dedicated attack associated with a specific foundation model or image dataset, but rather constitutes a broad threat to various emerging pathology foundation models and their applications. We systematically evaluated UTAP across various state-of-the-art pathology foundation models on multiple datasets, causing a significant drop in their performance with visually imperceptible modifications to the input images using a fixed noise pattern. The development of these potent attacks establishes a critical, high-standard benchmark for model robustness evaluation, highlighting a need for advancing defense mechanisms and potentially providing the necessary assets for adversarial training to ensure the safe and reliable deployment of AI in pathology.

【11】Image Categorization and Search via a GAT Autoencoder and Representative Models
标题：通过GAT自动编码器和代表性模型进行图像分类和搜索
链接：https://arxiv.org/abs/2510.16514

作者：Duygu Sap, Martin Lotz, Connor Mattinson
备注：10 pages, 22 figures, Under review
摘要：We propose a method for image categorization and retrieval that leverages graphs and a graph attention network (GAT)-based autoencoder. Our approach is representative-centric, that is, we execute the categorization and retrieval process via the representative models we construct for the images and image categories. We utilize a graph where nodes represent images (or their representatives) and edges capture similarity relationships. GAT highlights important features and relationships between images, enabling the autoencoder to construct context-aware latent representations that capture the key features of each image relative to its neighbors. We obtain category representatives from these embeddings and categorize a query image by comparing its representative to the category representatives. We then retrieve the most similar image to the query image within its identified category. We demonstrate the effectiveness of our representative-centric approach through experiments with both the GAT autoencoders and standard feature-based techniques.

【12】Demeter: A Parametric Model of Crop Plant Morphology from the Real World
标题：德墨忒尔：现实世界中农作物形态的参数模型
链接：https://arxiv.org/abs/2510.16377

作者：Tianhang Cheng, Albert J. Zhai, Evan Z. Chen, Rui Zhou, Yawen Deng, Zitong Li, Kejie Zhao, Janice Shiu, Qianyu Zhao, Yide Xu, Xinlei Wang, Yuan Shen, Sheng Wang, Lisa Ainsworth, Kaiyu Guan, Shenlong Wang
备注：ICCV 2025
摘要：Learning 3D parametric shape models of objects has gained popularity in vision and graphics and has showed broad utility in 3D reconstruction, generation, understanding, and simulation. While powerful models exist for humans and animals, equally expressive approaches for modeling plants are lacking. In this work, we present Demeter, a data-driven parametric model that encodes key factors of a plant morphology, including topology, shape, articulation, and deformation into a compact learned representation. Unlike previous parametric models, Demeter handles varying shape topology across various species and models three sources of shape variation: articulation, subcomponent shape variation, and non-rigid deformation. To advance crop plant modeling, we collected a large-scale, ground-truthed dataset from a soybean farm as a testbed. Experiments show that Demeter effectively synthesizes shapes, reconstructs structures, and simulates biophysical processes. Code and data is available at https://tianhang-cheng.github.io/Demeter/.

【13】FedPURIN: Programmed Update and Reduced INformation for Sparse Personalized Federated Learning
标题：FedPURIN：稀疏个性化联邦学习的编程更新和简化信息
链接：https://arxiv.org/abs/2510.16065

作者：Lunchen Xie, Zehua He, Qingjiang Shi
摘要：Personalized Federated Learning (PFL) has emerged as a critical research frontier addressing data heterogeneity issue across distributed clients. Novel model architectures and collaboration mechanisms are engineered to accommodate statistical disparities while producing client-specific models. Parameter decoupling represents a promising paradigm for maintaining model performance in PFL frameworks. However, the communication efficiency of many existing methods remains suboptimal, sustaining substantial communication burdens that impede practical deployment. To bridge this gap, we propose Federated Learning with Programmed Update and Reduced INformation (FedPURIN), a novel framework that strategically identifies critical parameters for transmission through an integer programming formulation. This mathematically grounded strategy is seamlessly integrated into a sparse aggregation scheme, achieving a significant communication reduction while preserving the efficacy. Comprehensive evaluations on standard image classification benchmarks under varied non-IID conditions demonstrate competitive performance relative to state-of-the-art methods, coupled with quantifiable communication reduction through sparse aggregation. The framework establishes a new paradigm for communication-efficient PFL, particularly advantageous for edge intelligence systems operating with heterogeneous data sources.

其他(37篇)

【1】ConsistEdit: Highly Consistent and Precise Training-free Visual Editing
标题：ConsistEdit：高度一致和精确的免训练可视化编辑
链接：https://arxiv.org/abs/2510.17803

作者：Zixin Yin, Ling-Hao Chen, Lionel Ni, Xili Dai
备注：SIGGRAPH Asia 2025
摘要：Recent advances in training-free attention control methods have enabled flexible and efficient text-guided editing capabilities for existing generation models. However, current approaches struggle to simultaneously deliver strong editing strength while preserving consistency with the source. This limitation becomes particularly critical in multi-round and video editing, where visual errors can accumulate over time. Moreover, most existing methods enforce global consistency, which limits their ability to modify individual attributes such as texture while preserving others, thereby hindering fine-grained editing. Recently, the architectural shift from U-Net to MM-DiT has brought significant improvements in generative performance and introduced a novel mechanism for integrating text and vision modalities. These advancements pave the way for overcoming challenges that previous methods failed to resolve. Through an in-depth analysis of MM-DiT, we identify three key insights into its attention mechanisms. Building on these, we propose ConsistEdit, a novel attention control method specifically tailored for MM-DiT. ConsistEdit incorporates vision-only attention control, mask-guided pre-attention fusion, and differentiated manipulation of the query, key, and value tokens to produce consistent, prompt-aligned edits. Extensive experiments demonstrate that ConsistEdit achieves state-of-the-art performance across a wide range of image and video editing tasks, including both structure-consistent and structure-inconsistent scenarios. Unlike prior methods, it is the first approach to perform editing across all inference steps and attention layers without handcraft, significantly enhancing reliability and consistency, which enables robust multi-round and multi-region editing. Furthermore, it supports progressive adjustment of structural consistency, enabling finer control.

【2】Botany-Bot: Digital Twin Monitoring of Occluded and Underleaf Plant Structures with Gaussian Splats
标题：Botany-Bot：利用高斯切片对封闭和下叶植物结构进行数字双胞胎监测
链接：https://arxiv.org/abs/2510.17783

作者：Simeon Adebola, Chung Min Kim, Justin Kerr, Shuangyu Xie, Prithvi Akella, Jose Luis Susa Rincon, Eugen Solowjow, Ken Goldberg
备注：2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2025)
摘要：Commercial plant phenotyping systems using fixed cameras cannot perceive many plant details due to leaf occlusion. In this paper, we present Botany-Bot, a system for building detailed "annotated digital twins" of living plants using two stereo cameras, a digital turntable inside a lightbox, an industrial robot arm, and 3D segmentated Gaussian Splat models. We also present robot algorithms for manipulating leaves to take high-resolution indexable images of occluded details such as stem buds and the underside/topside of leaves. Results from experiments suggest that Botany-Bot can segment leaves with 90.8% accuracy, detect leaves with 86.2% accuracy, lift/push leaves with 77.9% accuracy, and take detailed overside/underside images with 77.3% accuracy. Code, videos, and datasets are available at https://berkeleyautomation.github.io/Botany-Bot/.

【3】SparseVILA: Decoupling Visual Sparsity for Efficient VLM Inference
标题：SparseVILA：脱钩视觉稀疏性以实现高效VLM推理
链接：https://arxiv.org/abs/2510.17777

作者：Samir Khaki, Junxian Guo, Jiaming Tang, Shang Yang, Yukang Chen, Konstantinos N. Plataniotis, Yao Lu, Song Han, Zhijian Liu
摘要：Vision Language Models (VLMs) have rapidly advanced in integrating visual and textual reasoning, powering applications across high-resolution image understanding, long-video analysis, and multi-turn conversation. However, their scalability remains limited by the growing number of visual tokens that dominate inference latency. We present SparseVILA, a new paradigm for efficient VLM inference that decouples visual sparsity across the prefilling and decoding stages. SparseVILA distributes sparsity across stages by pruning redundant visual tokens during prefill and retrieving only query-relevant tokens during decoding. This decoupled design matches leading prefill pruning methods while preserving multi-turn fidelity by retaining most of the visual cache so that query-aware tokens can be retrieved at each conversation round. Built on an AWQ-optimized inference pipeline, SparseVILA achieves up to 4.0 times faster prefilling, 2.5 times faster decoding, and an overall 2.6 times end-to-end speedup on long-context video tasks -- while improving accuracy on document-understanding and reasoning tasks. By decoupling query-agnostic pruning and query-aware retrieval, SparseVILA establishes a new direction for efficient multimodal inference, offering a training-free, architecture-agnostic framework for accelerating large VLMs without sacrificing capability.

【4】PICABench: How Far Are We from Physically Realistic Image Editing?
标题：PICABench：我们距离物理真实图像编辑还有多远？
链接：https://arxiv.org/abs/2510.17681

作者：Yuandong Pu, Le Zhuo, Songhao Han, Jinbo Xing, Kaiwen Zhu, Shuo Cao, Bin Fu, Si Liu, Hongsheng Li, Yu Qiao, Wenlong Zhang, Xi Chen, Yihao Liu
摘要：Image editing has achieved remarkable progress recently. Modern editing models could already follow complex instructions to manipulate the original content. However, beyond completing the editing instructions, the accompanying physical effects are the key to the generation realism. For example, removing an object should also remove its shadow, reflections, and interactions with nearby objects. Unfortunately, existing models and benchmarks mainly focus on instruction completion but overlook these physical effects. So, at this moment, how far are we from physically realistic image editing? To answer this, we introduce PICABench, which systematically evaluates physical realism across eight sub-dimension (spanning optics, mechanics, and state transitions) for most of the common editing operations (add, remove, attribute change, etc). We further propose the PICAEval, a reliable evaluation protocol that uses VLM-as-a-judge with per-case, region-level human annotations and questions. Beyond benchmarking, we also explore effective solutions by learning physics from videos and construct a training dataset PICA-100K. After evaluating most of the mainstream models, we observe that physical realism remains a challenging problem with large rooms to explore. We hope that our benchmark and proposed solutions can serve as a foundation for future work moving from naive content editing toward physically consistent realism.

【5】Split-Fuse-Transport: Annotation-Free Saliency via Dual Clustering and Optimal Transport Alignment
标题：分裂-分配-传输：通过双重聚集和最佳传输对齐实现无注释显着性
链接：https://arxiv.org/abs/2510.17484

作者：Muhammad Umer Ramzan, Ali Zia, Abdelwahed Khamis, Noman Ali, Usman Ali, Wei Xiang
摘要：Salient object detection (SOD) aims to segment visually prominent regions in images and serves as a foundational task for various computer vision applications. We posit that SOD can now reach near-supervised accuracy without a single pixel-level label, but only when reliable pseudo-masks are available. We revisit the prototype-based line of work and make two key observations. First, boundary pixels and interior pixels obey markedly different geometry; second, the global consistency enforced by optimal transport (OT) is underutilized if prototype quality is weak. To address this, we introduce POTNet, an adaptation of Prototypical Optimal Transport that replaces POT's single k-means step with an entropy-guided dual-clustering head: high-entropy pixels are organized by spectral clustering, low-entropy pixels by k-means, and the two prototype sets are subsequently aligned by OT. This split-fuse-transport design yields sharper, part-aware pseudo-masks in a single forward pass, without handcrafted priors. Those masks supervise a standard MaskFormer-style encoder-decoder, giving rise to AutoSOD, an end-to-end unsupervised SOD pipeline that eliminates SelfMask's offline voting yet improves both accuracy and training efficiency. Extensive experiments on five benchmarks show that AutoSOD outperforms unsupervised methods by up to 26% and weakly supervised methods by up to 36% in F-measure, further narrowing the gap to fully supervised models.

【6】Initialize to Generalize: A Stronger Initialization Pipeline for Sparse-View 3DGS
标题：初始化以概括：Sparse-View 3DGS更强大的收件箱管道
链接：https://arxiv.org/abs/2510.17479

作者：Feng Zhou, Wenkai Guo, Pu Cao, Zhicheng Zhang, Jianqin Yin
备注：A preprint paper
摘要：Sparse-view 3D Gaussian Splatting (3DGS) often overfits to the training views, leading to artifacts like blurring in novel view rendering. Prior work addresses it either by enhancing the initialization (\emph{i.e.}, the point cloud from Structure-from-Motion (SfM)) or by adding training-time constraints (regularization) to the 3DGS optimization. Yet our controlled ablations reveal that initialization is the decisive factor: it determines the attainable performance band in sparse-view 3DGS, while training-time constraints yield only modest within-band improvements at extra cost. Given initialization's primacy, we focus our design there. Although SfM performs poorly under sparse views due to its reliance on feature matching, it still provides reliable seed points. Thus, building on SfM, our effort aims to supplement the regions it fails to cover as comprehensively as possible. Specifically, we design: (i) frequency-aware SfM that improves low-texture coverage via low-frequency view augmentation and relaxed multi-view correspondences; (ii) 3DGS self-initialization that lifts photometric supervision into additional points, compensating SfM-sparse regions with learned Gaussian centers; and (iii) point-cloud regularization that enforces multi-view consistency and uniform spatial coverage through simple geometric/visibility priors, yielding a clean and reliable point cloud. Our experiments on LLFF and Mip-NeRF360 demonstrate consistent gains in sparse-view settings, establishing our approach as a stronger initialization strategy. Code is available at https://github.com/zss171999645/ItG-GS.

【7】Rethinking Nighttime Image Deraining via Learnable Color Space Transformation
标题：通过可学习的色彩空间转换重新思考夜间图像衍生
链接：https://arxiv.org/abs/2510.17440

作者：Qiyuan Guan, Xiang Chen, Guiyue Jin, Jiyu Jin, Shumin Fan, Tianyu Song, Jinshan Pan
备注：Accepted by NeurIPS 2025
摘要：Compared to daytime image deraining, nighttime image deraining poses significant challenges due to inherent complexities of nighttime scenarios and the lack of high-quality datasets that accurately represent the coupling effect between rain and illumination. In this paper, we rethink the task of nighttime image deraining and contribute a new high-quality benchmark, HQ-NightRain, which offers higher harmony and realism compared to existing datasets. In addition, we develop an effective Color Space Transformation Network (CST-Net) for better removing complex rain from nighttime scenes. Specifically, we propose a learnable color space converter (CSC) to better facilitate rain removal in the Y channel, as nighttime rain is more pronounced in the Y channel compared to the RGB color space. To capture illumination information for guiding nighttime deraining, implicit illumination guidance is introduced enabling the learned features to improve the model's robustness in complex scenarios. Extensive experiments show the value of our dataset and the effectiveness of our method. The source code and datasets are available at https://github.com/guanqiyuan/CST-Net.

【8】Leveraging AV1 motion vectors for Fast and Dense Feature Matching
标题：利用AV 1运动载体进行快速密集的特征匹配
链接：https://arxiv.org/abs/2510.17434

作者：Julien Zouein, Hossein Javidnia, François Pitié, Anil Kokaram
备注：Accepted ICIR 2025, camera-ready version
摘要：We repurpose AV1 motion vectors to produce dense sub-pixel correspondences and short tracks filtered by cosine consistency. On short videos, this compressed-domain front end runs comparably to sequential SIFT while using far less CPU, and yields denser matches with competitive pairwise geometry. As a small SfM demo on a 117-frame clip, MV matches register all images and reconstruct 0.46-0.62M points at 0.51-0.53,px reprojection error; BA time grows with match density. These results show compressed-domain correspondences are a practical, resource-efficient front end with clear paths to scaling in full pipelines.

【9】Exploring The Missing Semantics In Event Modality
标题：探索事件情态中缺失的语义
链接：https://arxiv.org/abs/2510.17347

作者：Jingqian Wu, Shengpeng Xu, Yunbo Jia, Edmund Y. Lam
摘要：Event cameras offer distinct advantages such as low latency, high dynamic range, and efficient motion capture. However, event-to-video reconstruction (E2V), a fundamental event-based vision task, remains challenging, particularly for reconstructing and recovering semantic information. This is primarily due to the nature of the event camera, as it only captures intensity changes, ignoring static objects and backgrounds, resulting in a lack of semantic information in captured event modality. Further, semantic information plays a crucial role in video and frame reconstruction, yet is often overlooked by existing E2V approaches. To bridge this gap, we propose Semantic-E2VID, an E2V framework that explores the missing visual semantic knowledge in event modality and leverages it to enhance event-to-video reconstruction. Specifically, Semantic-E2VID introduces a cross-modal feature alignment (CFA) module to transfer the robust visual semantics from a frame-based vision foundation model, the Segment Anything Model (SAM), to the event encoder, while aligning the high-level features from distinct modalities. To better utilize the learned semantic feature, we further propose a semantic-aware feature fusion (SFF) block to integrate learned semantics in frame modality to form event representations with rich semantics that can be decoded by the event decoder. Further, to facilitate the reconstruction of semantic information, we propose a novel Semantic Perceptual E2V Supervision that helps the model to reconstruct semantic details by leveraging SAM-generated categorical labels. Extensive experiments demonstrate that Semantic-E2VID significantly enhances frame quality, outperforming state-of-the-art E2V methods across multiple benchmarks. The sample code is included in the supplementary material.

【10】iDETEX: Empowering MLLMs for Intelligent DETailed EXplainable IQA
标题：iDETEK：为智能Detailed可解释IQA赋予MLLM权力
链接：https://arxiv.org/abs/2510.17332

作者：Zhaoran Zhao, Xinli Yue, Jianhui Sun, Yuhao Xie, Tao Shao, Liangchao Yao, Fan Xia, Yuetang Deng
备注：Accepted to ICCV 2025 Workshop
摘要：Image Quality Assessment (IQA) has progressed from scalar quality prediction to more interpretable, human-aligned evaluation paradigms. In this work, we address the emerging challenge of detailed and explainable IQA by proposing iDETEX-a unified multimodal large language model (MLLM) capable of simultaneously performing three key tasks: quality grounding, perception, and description. To facilitate efficient and generalizable training across these heterogeneous subtasks, we design a suite of task-specific offline augmentation modules and a data mixing strategy. These are further complemented by online enhancement strategies to fully exploit multi-sourced supervision. We validate our approach on the large-scale ViDA-UGC benchmark, where iDETEX achieves state-of-the-art performance across all subtasks. Our model ranks first in the ICCV MIPI 2025 Detailed Image Quality Assessment Challenge, demonstrating its effectiveness and robustness in delivering accurate and interpretable quality assessments.

【11】Machine Vision-Based Surgical Lighting System:Design and Implementation
标题：基于机器视觉的手术照明系统：设计与实现
链接：https://arxiv.org/abs/2510.17287

作者：Amir Gharghabi, Mahdi Hakiminezhad, Maryam Shafaei, Shaghayegh Gharghabi
摘要：Effortless and ergonomically designed surgical lighting is critical for precision and safety during procedures. However, traditional systems often rely on manual adjustments, leading to surgeon fatigue, neck strain, and inconsistent illumination due to drift and shadowing. To address these challenges, we propose a novel surgical lighting system that leverages the YOLOv11 object detection algorithm to identify a blue marker placed above the target surgical site. A high-power LED light source is then directed to the identified location using two servomotors equipped with tilt-pan brackets. The YOLO model achieves 96.7% mAP@50 on the validation set consisting of annotated images simulating surgical scenes with the blue spherical marker. By automating the lighting process, this machine vision-based solution reduces physical strain on surgeons, improves consistency in illumination, and supports improved surgical outcomes.

【12】FineVision: Open Data Is All You Need
标题：FineVision：开放数据就是您所需要的一切
链接：https://arxiv.org/abs/2510.17269

作者：Luis Wiedmann, Orr Zohar, Amir Mahla, Xiaohan Wang, Rui Li, Thibaud Frere, Leandro von Werra, Aritra Roy Gosthipaty, Andrés Marafioti
摘要：The advancement of vision-language models (VLMs) is hampered by a fragmented landscape of inconsistent and contaminated public datasets. We introduce FineVision, a meticulously collected, curated, and unified corpus of 24 million samples - the largest open resource of its kind. We unify more than 200 sources into 185 subsets via a semi-automated, human-in-the-loop pipeline: automation performs bulk ingestion and schema mapping, while reviewers audit mappings and spot-check outputs to verify faithful consumption of annotations, appropriate formatting and diversity, and safety; issues trigger targeted fixes and re-runs. The workflow further applies rigorous de-duplication within and across sources and decontamination against 66 public benchmarks. FineVision also encompasses agentic/GUI tasks with a unified action space; reviewers validate schemas and inspect a sample of trajectories to confirm executable fidelity. Models trained on FineVision consistently outperform those trained on existing open mixtures across a broad evaluation suite, underscoring the benefits of scale, data hygiene, and balanced automation with human oversight. We release the corpus and curation tools to accelerate data-centric VLM research.

【13】From Pixels to People: Satellite-Based Mapping and Quantification of Riverbank Erosion and Lost Villages in Bangladesh
标题：从像素到人：孟加拉国河岸侵蚀和消失村庄的卫星绘图和量化
链接：https://arxiv.org/abs/2510.17198

作者：M Saifuzzaman Rafat, Mohd Ruhul Ameen, Akif Islam, Abu Saleh Musa Miah, Jungpil Shin
备注：Submitted to the International Conference on Data and Applied Analytics (IDAA 2025). 15 pages, 5 figures, 4 tables
摘要：The great rivers of Bangladesh, arteries of commerce and sustenance, are also agents of relentless destruction. Each year, they swallow whole villages and vast tracts of farmland, erasing communities from the map and displacing thousands of families. To track this slow-motion catastrophe has, until now, been a Herculean task for human analysts. Here we show how a powerful general-purpose vision model, the Segment Anything Model (SAM), can be adapted to this task with remarkable precision. To do this, we assembled a new dataset - a digital chronicle of loss compiled from historical Google Earth imagery of Bangladesh's most vulnerable regions, including Mokterer Char Union, Kedarpur Union, Balchipara village, and Chowhali Upazila, from 2003 to 2025. Crucially, this dataset is the first to include manually annotated data on the settlements that have vanished beneath the water. Our method first uses a simple color-channel analysis to provide a rough segmentation of land and water, and then fine-tunes SAM's mask decoder to recognize the subtle signatures of riverbank erosion. The resulting model demonstrates a keen eye for this destructive process, achieving a mean Intersection over Union of 86.30% and a Dice score of 92.60% - a performance that significantly surpasses traditional methods and off-the-shelf deep learning models. This work delivers three key contributions: the first annotated dataset of disappeared settlements in Bangladesh due to river erosion; a specialized AI model fine-tuned for this critical task; and a method for quantifying land loss with compelling visual evidence. Together, these tools provide a powerful new lens through which policymakers and disaster management agencies can monitor erosion, anticipate its trajectory, and ultimately protect the vulnerable communities in its path.

【14】Towards Imperceptible Watermarking Via Environment Illumination for Consumer Cameras
标题：通过消费者相机的环境照明实现不可感知的水印
链接：https://arxiv.org/abs/2510.17114

作者：Hodaka Kawachi, Tomoya Nakamura, Hiroaki Santo, SaiKiran Kumar Tedla, Trevor Dalton Canham, Yasushi Yagi, Michael S. Brown
摘要：This paper introduces a method for using LED-based environmental lighting to produce visually imperceptible watermarks for consumer cameras. Our approach optimizes an LED light source's spectral profile to be minimally visible to the human eye while remaining highly detectable by typical consumer cameras. The method jointly considers the human visual system's sensitivity to visible spectra, modern consumer camera sensors' spectral sensitivity, and narrowband LEDs' ability to generate broadband spectra perceived as "white light" (specifically, D65 illumination). To ensure imperceptibility, we employ spectral modulation rather than intensity modulation. Unlike conventional visible light communication, our approach enables watermark extraction at standard low frame rates (30-60 fps). While the information transfer rate is modest-embedding 128 bits within a 10-second video clip-this capacity is sufficient for essential metadata supporting privacy protection and content verification.

【15】Boosting Fidelity for Pre-Trained-Diffusion-Based Low-Light Image Enhancement via Condition Refinement
标题：通过条件细化提高基于预训练扩散的弱光图像增强的保真度
链接：https://arxiv.org/abs/2510.17105

作者：Xiaogang Xu, Jian Wang, Yunfan Lu, Ruihang Chu, Ruixing Wang, Jiafei Wu, Bei Yu, Liang Lin
摘要：Diffusion-based methods, leveraging pre-trained large models like Stable Diffusion via ControlNet, have achieved remarkable performance in several low-level vision tasks. However, Pre-Trained Diffusion-Based (PTDB) methods often sacrifice content fidelity to attain higher perceptual realism. This issue is exacerbated in low-light scenarios, where severely degraded information caused by the darkness limits effective control. We identify two primary causes of fidelity loss: the absence of suitable conditional latent modeling and the lack of bidirectional interaction between the conditional latent and noisy latent in the diffusion process. To address this, we propose a novel optimization strategy for conditioning in pre-trained diffusion models, enhancing fidelity while preserving realism and aesthetics. Our method introduces a mechanism to recover spatial details lost during VAE encoding, i.e., a latent refinement pipeline incorporating generative priors. Additionally, the refined latent condition interacts dynamically with the noisy latent, leading to improved restoration performance. Our approach is plug-and-play, seamlessly integrating into existing diffusion networks to provide more effective control. Extensive experiments demonstrate significant fidelity improvements in PTDB methods.

【16】GSPlane: Concise and Accurate Planar Reconstruction via Structured Representation
标题：GSPPlane：通过结构化表示实现简洁准确的平面重建
链接：https://arxiv.org/abs/2510.17095

作者：Ruitong Gan, Junran Peng, Yang Liu, Chuanchen Luo, Qing Li, Zhaoxiang Zhang
摘要：Planes are fundamental primitives of 3D sences, especially in man-made environments such as indoor spaces and urban streets. Representing these planes in a structured and parameterized format facilitates scene editing and physical simulations in downstream applications. Recently, Gaussian Splatting (GS) has demonstrated remarkable effectiveness in the Novel View Synthesis task, with extensions showing great potential in accurate surface reconstruction. However, even state-of-the-art GS representations often struggle to reconstruct planar regions with sufficient smoothness and precision. To address this issue, we propose GSPlane, which recovers accurate geometry and produces clean and well-structured mesh connectivity for plane regions in the reconstructed scene. By leveraging off-the-shelf segmentation and normal prediction models, GSPlane extracts robust planar priors to establish structured representations for planar Gaussian coordinates, which help guide the training process by enforcing geometric consistency. To further enhance training robustness, a Dynamic Gaussian Re-classifier is introduced to adaptively reclassify planar Gaussians with persistently high gradients as non-planar, ensuring more reliable optimization. Furthermore, we utilize the optimized planar priors to refine the mesh layouts, significantly improving topological structure while reducing the number of vertices and faces. We also explore applications of the structured planar representation, which enable decoupling and flexible manipulation of objects on supportive planes. Extensive experiments demonstrate that, with no sacrifice in rendering quality, the introduction of planar priors significantly improves the geometric accuracy of the extracted meshes across various baselines.

【17】How Universal Are SAM2 Features?
标题：SAM2功能的通用性如何？
链接：https://arxiv.org/abs/2510.17051

作者：Masoud Khairi Atani, Alon Harell, Hyomin Choi, Runyu Yang, Fabien Racape, Ivan V. Bajic
备注：This work has been accepted for publication in IEEE Picture Coding Symposium (PCS) 2025
摘要：The trade-off between general-purpose foundation vision models and their specialized counterparts is critical for efficient feature coding design and is not yet fully understood. We investigate this trade-off by comparing the feature versatility of the general-purpose Hiera encoder against the segmentation-specialized Segment Anything Model 2 (SAM2). Using a lightweight, trainable neck to probe the adaptability of their frozen features, we quantify the information-theoretic cost of specialization. Our results reveal that while SAM2's specialization is highly effective for spatially-related tasks like depth estimation, it comes at a cost. The specialized SAM2 encoder underperforms its generalist predecessor, Hiera, on conceptually distant tasks such as pose estimation and image captioning, demonstrating a measurable loss of broader semantic information. A novel cross-neck analysis on SAM2 reveals that each level of adaptation creates a further representational bottleneck. Our analysis illuminates these trade-offs in feature universality, providing a quantitative foundation for designing efficient feature coding and adaptation strategies for diverse downstream applications.

【18】Do Satellite Tasks Need Special Pretraining?
标题：卫星任务需要特殊的预训练吗？
链接：https://arxiv.org/abs/2510.17014

作者：Ani Vanyan, Alvard Barseghyan, Hakob Tamazyan, Tigran Galstyan, Vahan Huroyan, Naira Hovakimyan, Hrant Khachatrian
摘要：Foundation models have advanced machine learning across various modalities, including images. Recently multiple teams trained foundation models specialized for remote sensing applications. This line of research is motivated by the distinct characteristics of remote sensing imagery, specific applications and types of robustness useful for satellite image analysis. In this work we systematically challenge the idea that specific foundation models are more useful than general-purpose vision foundation models, at least in the small scale. First, we design a simple benchmark that measures generalization of remote sensing models towards images with lower resolution for two downstream tasks. Second, we train iBOT, a self-supervised vision encoder, on MillionAID, an ImageNet-scale satellite imagery dataset, with several modifications specific to remote sensing. We show that none of those pretrained models bring consistent improvements upon general-purpose baselines at the ViT-B scale.

【19】Contrail-to-Flight Attribution Using Ground Visible Cameras and Flight Surveillance Data
标题：使用地面可见摄像机和飞行监视数据的飞行归因
链接：https://arxiv.org/abs/2510.16891

作者：Ramon Dalmau, Gabriel Jarry, Philippe Very
摘要：Aviation's non-CO2 effects, particularly contrails, are a significant contributor to its climate impact. Persistent contrails can evolve into cirrus-like clouds that trap outgoing infrared radiation, with radiative forcing potentially comparable to or exceeding that of aviation's CO2 emissions. While physical models simulate contrail formation, evolution and dissipation, validating and calibrating these models requires linking observed contrails to the flights that generated them, a process known as contrail-to-flight attribution. Satellite-based attribution is challenging due to limited spatial and temporal resolution, as contrails often drift and deform before detection. In this paper, we evaluate an alternative approach using ground-based cameras, which capture contrails shortly after formation at high spatial and temporal resolution, when they remain thin, linear, and visually distinct. Leveraging the ground visible camera contrail sequences (GVCCS) dataset, we introduce a modular framework for attributing contrails observed using ground-based cameras to theoretical contrails derived from aircraft surveillance and meteorological data. The framework accommodates multiple geometric representations and distance metrics, incorporates temporal smoothing, and enables flexible probability-based assignment strategies. This work establishes a strong baseline and provides a modular framework for future research in linking contrails to their source flight.

【20】Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback
标题：Uniworld-V2：利用扩散负感知微调和MLLM隐式反馈加强图像编辑
链接：https://arxiv.org/abs/2510.16888

作者：Zongjian Li, Zheyuan Liu, Qihui Zhang, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Yang Ye, Wangbo Yu, Yuwei Niu, Li Yuan
摘要：Instruction-based image editing has achieved remarkable progress; however, models solely trained via supervised fine-tuning often overfit to annotated patterns, hindering their ability to explore and generalize beyond training distributions. To this end, we introduce Edit-R1, a novel post-training framework for instruction-based image editing based on policy optimization. Specifically, we utilize Diffusion Negative-aware Finetuning (DiffusionNFT), a likelihood-free policy optimization method consistent with the flow matching forward process, thereby enabling the use of higher-order samplers and more efficient training. Another key challenge here is the absence of a universal reward model, resulting from the diverse nature of editing instructions and tasks. To bridge this gap, we employ a Multimodal Large Language Model (MLLM) as a unified, training-free reward model, leveraging its output logits to provide fine-grained feedback. Furthermore, we carefully design a low-variance group filtering mechanism to reduce MLLM scoring noise and stabilize optimization. UniWorld-V2, trained with this framework, achieves \textbf{state-of-the-art} results on the ImgEdit and GEdit-Bench benchmarks, scoring 4.49 and 7.83, respectively. Crucially, our framework is model-agnostic, delivering substantial performance gains when applied to diverse base models like Qwen-Image-Edit and FLUX-Kontext, demonstrating its wide applicability. Code and models are publicly available at https://github.com/PKU-YuanGroup/UniWorld-V2.

【21】2DGS-R: Revisiting the Normal Consistency Regularization in 2D Gaussian Splatting
标题：2DGS-R：重新审视2D高斯飞溅中的正态一致性规则化
链接：https://arxiv.org/abs/2510.16837

作者：Haofan Ren, Qingsong Yan, Ming Lu, Rongfeng Lu, Zunjie Zhu
摘要：Recent advancements in 3D Gaussian Splatting (3DGS) have greatly influenced neural fields, as it enables high-fidelity rendering with impressive visual quality. However, 3DGS has difficulty accurately representing surfaces. In contrast, 2DGS transforms the 3D volume into a collection of 2D planar Gaussian disks. Despite advancements in geometric fidelity, rendering quality remains compromised, highlighting the challenge of achieving both high-quality rendering and precise geometric structures. This indicates that optimizing both geometric and rendering quality in a single training stage is currently unfeasible. To overcome this limitation, we present 2DGS-R, a new method that uses a hierarchical training approach to improve rendering quality while maintaining geometric accuracy. 2DGS-R first trains the original 2D Gaussians with the normal consistency regularization. Then 2DGS-R selects the 2D Gaussians with inadequate rendering quality and applies a novel in-place cloning operation to enhance the 2D Gaussians. Finally, we fine-tune the 2DGS-R model with opacity frozen. Experimental results show that compared to the original 2DGS, our method requires only 1\% more storage and minimal additional training time. Despite this negligible overhead, it achieves high-quality rendering results while preserving fine geometric structures. These findings indicate that our approach effectively balances efficiency with performance, leading to improvements in both visual fidelity and geometric reconstruction accuracy.

【22】Personalized Image Filter: Mastering Your Photographic Style
标题：个性化图像滤镜：掌握您的摄影风格
链接：https://arxiv.org/abs/2510.16791

作者：Chengxuan Zhu, Shuchen Weng, Jiacong Fang, Peixuan Zhang, Si Li, Chao Xu, Boxin Shi
摘要：Photographic style, as a composition of certain photographic concepts, is the charm behind renowned photographers. But learning and transferring photographic style need a profound understanding of how the photo is edited from the unknown original appearance. Previous works either fail to learn meaningful photographic concepts from reference images, or cannot preserve the content of the content image. To tackle these issues, we proposed a Personalized Image Filter (PIF). Based on a pretrained text-to-image diffusion model, the generative prior enables PIF to learn the average appearance of photographic concepts, as well as how to adjust them according to text prompts. PIF then learns the photographic style of reference images with the textual inversion technique, by optimizing the prompts for the photographic concepts. PIF shows outstanding performance in extracting and transferring various kinds of photographic style. Project page: https://pif.pages.dev/

【23】End-to-end Listen, Look, Speak and Act
标题：端到端听、看、说和表演
链接：https://arxiv.org/abs/2510.16756

作者：Siyin Wang, Wenyi Yu, Xianzhao Chen, Xiaohai Tian, Jun Zhang, Lu Lu, Chao Zhang
备注：22 pages, 8 figures
摘要：Human interaction is inherently multimodal and full-duplex: we listen while watching, speak while acting, and fluidly adapt to turn-taking and interruptions. Realizing these capabilities is essential for building models simulating humans. We present ELLSA (End-to-end Listen, Look, Speak and Act), which, to our knowledge, is the first full-duplex, end-to-end model that simultaneously perceives and generates across vision, text, speech, and action within a single architecture, enabling interaction patterns previously out of reach, yielding more natural, human-like behaviors. At its core is a novel SA-MoE architecture (Self-Attention Mixture-of-Experts) that routes each modality to specialized experts and fuses them through a unified attention backbone. This provides a generalizable solution for joint multimodal perception and concurrent generation, leveraging strong pre-trained components while enabling efficient modality integration and mitigating modality interference. On speech-interaction and robot-manipulation benchmarks, ELLSA matches modality-specific baselines, while uniquely supporting advanced multimodal and full-duplex behaviors such as dialogue and action turn-taking, defective instruction rejection, speaking-while-acting, context-grounded visual question answering, and action barge-ins. We contend that ELLSA represents a step toward more natural and general interactive intelligence, contributing to the broader pursuit of artificial general intelligence. All data, code and model checkpoints will be released upon acceptance.

【24】Pursuing Minimal Sufficiency in Spatial Reasoning
标题：追求空间推理中的最低充分性
链接：https://arxiv.org/abs/2510.16688

作者：Yejie Guo, Yunzhong Hou, Wufei Ma, Meng Tang, Ming-Hsuan Yang
摘要：Spatial reasoning, the ability to ground language in 3D understanding, remains a persistent challenge for Vision-Language Models (VLMs). We identify two fundamental bottlenecks: inadequate 3D understanding capabilities stemming from 2D-centric pre-training, and reasoning failures induced by redundant 3D information. To address these, we first construct a Minimal Sufficient Set (MSS) of information before answering a given question: a compact selection of 3D perception results from \textit{expert models}. We introduce MSSR (Minimal Sufficient Spatial Reasoner), a dual-agent framework that implements this principle. A Perception Agent programmatically queries 3D scenes using a versatile perception toolbox to extract sufficient information, including a novel SOG (Situated Orientation Grounding) module that robustly extracts language-grounded directions. A Reasoning Agent then iteratively refines this information to pursue minimality, pruning redundant details and requesting missing ones in a closed loop until the MSS is curated. Extensive experiments demonstrate that our method, by explicitly pursuing both sufficiency and minimality, significantly improves accuracy and achieves state-of-the-art performance across two challenging benchmarks. Furthermore, our framework produces interpretable reasoning paths, offering a promising source of high-quality training data for future models. Source code is available at https://github.com/gyj155/mssr.

【25】SHIELD: Suppressing Hallucinations In LVLM Encoders via Bias and Vulnerability Defense
标题：SHIELD：通过偏差和漏洞防御抑制LVLM编码器中的幻觉
链接：https://arxiv.org/abs/2510.16596

作者：Yiyang Huang, Liang Shi, Yitian Zhang, Yi Xu, Yun Fu
摘要：Large Vision-Language Models (LVLMs) excel in diverse cross-modal tasks. However, object hallucination, where models produce plausible but inaccurate object descriptions, remains a significant challenge. In contrast to previous work focusing on LLM components, this paper is the first to trace LVLM hallucinations to visual encoders and identifies three key issues: statistical bias, inherent bias, and vulnerability. To address these challenges, we propose SHIELD, a training-free framework that mitigates hallucinations through three strategies: re-weighting visual tokens to reduce statistical bias, introducing noise-derived tokens to counter inherent bias, and applying adversarial attacks with contrastive decoding to address vulnerability. Experiments demonstrate that SHIELD effectively mitigates object hallucinations across diverse benchmarks and LVLM families. Moreover, SHIELD achieves strong performance on the general LVLM benchmark, highlighting its broad applicability. Code will be released.

【26】VIPAMIN: Visual Prompt Initialization via Embedding Selection and Subspace Expansion
标题：VIPAMIN：通过嵌入选择和子空间扩展的视觉提示收件箱
链接：https://arxiv.org/abs/2510.16446

作者：Jaekyun Park, Hye Won Chung
备注：NeurIPS 2025
摘要：In the era of large-scale foundation models, fully fine-tuning pretrained networks for each downstream task is often prohibitively resource-intensive. Prompt tuning offers a lightweight alternative by introducing tunable prompts while keeping the backbone frozen. However, existing visual prompt tuning methods often fail to specialize the prompts or enrich the representation space--especially when applied to self-supervised backbones. We show that these limitations become especially pronounced in challenging tasks and data-scarce settings, where effective adaptation is most critical. In this work, we introduce VIPAMIN, a visual prompt initialization strategy that enhances adaptation of self-supervised models by (1) aligning prompts with semantically informative regions in the embedding space, and (2) injecting novel representational directions beyond the pretrained subspace. Despite its simplicity--requiring only a single forward pass and lightweight operations--VIPAMIN consistently improves performance across diverse tasks and dataset sizes, setting a new state of the art in visual prompt tuning. Our code is available at https://github.com/iamjaekyun/vipamin.

【27】LightGlueStick: a Fast and Robust Glue for Joint Point-Line Matching
标题：LightGluge Stick：一种用于关节点线匹配的快速坚固胶水
链接：https://arxiv.org/abs/2510.16438

作者：Aidyn Ubingazhibov, Rémi Pautrat, Iago Suárez, Shaohui Liu, Marc Pollefeys, Viktor Larsson
备注：Accepted at ICCVW 2025
摘要：Lines and points are complementary local features, whose combination has proven effective for applications such as SLAM and Structure-from-Motion. The backbone of these pipelines are the local feature matchers, establishing correspondences across images. Traditionally, point and line matching have been treated as independent tasks. Recently, GlueStick proposed a GNN-based network that simultaneously operates on points and lines to establish matches. While running a single joint matching reduced the overall computational complexity, the heavy architecture prevented real-time applications or deployment to edge devices. Inspired by recent progress in point matching, we propose LightGlueStick, a lightweight matcher for points and line segments. The key novel component in our architecture is the Attentional Line Message Passing (ALMP), which explicitly exposes the connectivity of the lines to the network, allowing for efficient communication between nodes. In thorough experiments we show that LightGlueStick establishes a new state-of-the-art across different benchmarks. The code is available at https://github.com/aubingazhib/LightGlueStick.

【28】Beyond Fixed Anchors: Precisely Erasing Concepts with Sibling Exclusive Counterparts
标题：超越固定走廊：用兄弟姐妹独家同行精确消除概念
链接：https://arxiv.org/abs/2510.16342

作者：Tong Zhang, Ru Zhang, Jianyi Liu, Zhen Yang, Gongshen Liu
摘要：Existing concept erasure methods for text-to-image diffusion models commonly rely on fixed anchor strategies, which often lead to critical issues such as concept re-emergence and erosion. To address this, we conduct causal tracing to reveal the inherent sensitivity of erasure to anchor selection and define Sibling Exclusive Concepts as a superior class of anchors. Based on this insight, we propose \textbf{SELECT} (Sibling-Exclusive Evaluation for Contextual Targeting), a dynamic anchor selection framework designed to overcome the limitations of fixed anchors. Our framework introduces a novel two-stage evaluation mechanism that automatically discovers optimal anchors for precise erasure while identifying critical boundary anchors to preserve related concepts. Extensive evaluations demonstrate that SELECT, as a universal anchor solution, not only efficiently adapts to multiple erasure frameworks but also consistently outperforms existing baselines across key performance metrics, averaging only 4 seconds for anchor mining of a single concept.

【29】On the Provable Importance of Gradients for Language-Assisted Image Clustering
标题：关于属性对百分比辅助图像聚集的可证明重要性
链接：https://arxiv.org/abs/2510.16335

作者：Bo Peng, Jie Lu, Guangquan Zhang, Zhen Fang
备注：revised and extended version of ICCV2025
摘要：This paper investigates the recently emerged problem of Language-assisted Image Clustering (LaIC), where textual semantics are leveraged to improve the discriminability of visual representations to facilitate image clustering. Due to the unavailability of true class names, one of core challenges of LaIC lies in how to filter positive nouns, i.e., those semantically close to the images of interest, from unlabeled wild corpus data. Existing filtering strategies are predominantly based on the off-the-shelf feature space learned by CLIP; however, despite being intuitive, these strategies lack a rigorous theoretical foundation. To fill this gap, we propose a novel gradient-based framework, termed as GradNorm, which is theoretically guaranteed and shows strong empirical performance. In particular, we measure the positiveness of each noun based on the magnitude of gradients back-propagated from the cross-entropy between the predicted target distribution and the softmax output. Theoretically, we provide a rigorous error bound to quantify the separability of positive nouns by GradNorm and prove that GradNorm naturally subsumes existing filtering strategies as extremely special cases of itself. Empirically, extensive experiments show that GradNorm achieves the state-of-the-art clustering performance on various benchmarks.

【30】RL makes MLLMs see better than SFT
标题：RL让MLLM比SFT看得更好
链接：https://arxiv.org/abs/2510.16333

作者：Junha Song, Sangdoo Yun, Dongyoon Han, Jaegul Choo, Byeongho Heo
摘要：A dominant assumption in Multimodal Language Model (MLLM) research is that its performance is largely inherited from the LLM backbone, given its immense parameter scale and remarkable capabilities. This has created a void in the understanding of the vision encoder, which determines how MLLMs perceive images. The recent shift in MLLM training paradigms, from Supervised Finetuning (SFT) to Reinforcement Learning (RL), magnifies this oversight-namely, the significant lack of analysis on how such training reshapes the vision encoder as well as the MLLM. To address this, we first investigate the impact of training strategies on MLLMs, where RL shows a clear advantage over SFT in strongly vision-related VQA benchmarks. Motivated by this, we conduct a critical yet under-explored analysis of the vision encoder of MLLMs through diverse and in-depth experiments, ranging from ImageNet classification and segmentation to gradient visualization. Our results demonstrate that MLLM's post-training strategy (i.e., SFT or RL) not only leads to distinct outcomes on MLLM downstream tasks, but also fundamentally reshapes MLLM's underlying visual representations. Specifically, the key finding of our study is that RL produces stronger and precisely localized visual representations compared to SFT, boosting the ability of the vision encoder for MLLM. We then reframe our findings into a simple recipe for building strong vision encoders for MLLMs, Preference-Instructed Vision OpTimization (PIVOT). When integrated into MLLMs, a PIVOT-trained vision encoder outperforms even larger and more heavily-trained counterparts, despite requiring less than 1% of the computational cost of standard vision pretraining. This result opens an effective and efficient path for advancing the vision backbones of MLLMs. Project page available at https://june-page.github.io/pivot/

【31】Proactive Scene Decomposition and Reconstruction
标题：主动场景分解和重建
链接：https://arxiv.org/abs/2510.16272

作者：Baicheng Li, Zike Yan, Dong Wu, Hongbin Zha
摘要：Human behaviors are the major causes of scene dynamics and inherently contain rich cues regarding the dynamics. This paper formalizes a new task of proactive scene decomposition and reconstruction, an online approach that leverages human-object interactions to iteratively disassemble and reconstruct the environment. By observing these intentional interactions, we can dynamically refine the decomposition and reconstruction process, addressing inherent ambiguities in static object-level reconstruction. The proposed system effectively integrates multiple tasks in dynamic environments such as accurate camera and object pose estimation, instance decomposition, and online map updating, capitalizing on cues from human-object interactions in egocentric live streams for a flexible, progressive alternative to conventional object-level reconstruction methods. Aided by the Gaussian splatting technique, accurate and consistent dynamic scene modeling is achieved with photorealistic and efficient rendering. The efficacy is validated in multiple real-world scenarios with promising advantages.

【32】Data-Centric AI for Tropical Agricultural Mapping: Challenges, Strategies and Scalable Solutions
标题：以数据为中心的热带农业制图人工智能：挑战、策略和可扩展的解决方案
链接：https://arxiv.org/abs/2510.16207

作者：Mateus Pinto da Silva, Sabrina P. L. P. Correa, Hugo N. Oliveira, Ian M. Nunes, Jefersson A. dos Santos
备注：5 pages, 1 figure
摘要：Mapping agriculture in tropical areas through remote sensing presents unique challenges, including the lack of high-quality annotated data, the elevated costs of labeling, data variability, and regional generalisation. This paper advocates a Data-Centric Artificial Intelligence (DCAI) perspective and pipeline, emphasizing data quality and curation as key drivers for model robustness and scalability. It reviews and prioritizes techniques such as confident learning, core-set selection, data augmentation, and active learning. The paper highlights the readiness and suitability of 25 distinct strategies in large-scale agricultural mapping pipelines. The tropical context is of high interest, since high cloudiness, diverse crop calendars, and limited datasets limit traditional model-centric approaches. This tutorial outlines practical solutions as a data-centric approach for curating and training AI models better suited to the dynamic realities of tropical agriculture. Finally, we propose a practical pipeline using the 9 most mature and straightforward methods that can be applied to a large-scale tropical agricultural mapping project.

【33】Seeing Through the Brain: New Insights from Decoding Visual Stimuli with fMRI
标题：透视大脑：用fMRI解码视觉刺激的新见解
链接：https://arxiv.org/abs/2510.16196

作者：Zheng Huang, Enpei Zhang, Yinghao Cai, Weikang Qiu, Carl Yang, Elynn Chen, Xiang Zhang, Rex Ying, Dawei Zhou, Yujun Yan
摘要：Understanding how the brain encodes visual information is a central challenge in neuroscience and machine learning. A promising approach is to reconstruct visual stimuli, essentially images, from functional Magnetic Resonance Imaging (fMRI) signals. This involves two stages: transforming fMRI signals into a latent space and then using a pretrained generative model to reconstruct images. The reconstruction quality depends on how similar the latent space is to the structure of neural activity and how well the generative model produces images from that space. Yet, it remains unclear which type of latent space best supports this transformation and how it should be organized to represent visual stimuli effectively. We present two key findings. First, fMRI signals are more similar to the text space of a language model than to either a vision based space or a joint text image space. Second, text representations and the generative model should be adapted to capture the compositional nature of visual stimuli, including objects, their detailed attributes, and relationships. Building on these insights, we propose PRISM, a model that Projects fMRI sIgnals into a Structured text space as an interMediate representation for visual stimuli reconstruction. It includes an object centric diffusion module that generates images by composing individual objects to reduce object detection errors, and an attribute relationship search module that automatically identifies key attributes and relationships that best align with the neural activity. Extensive experiments on real world datasets demonstrate that our framework outperforms existing methods, achieving up to an 8% reduction in perceptual loss. These results highlight the importance of using structured text as the intermediate space to bridge fMRI signals and image reconstruction.

【34】Cost Savings from Automatic Quality Assessment of Generated Images
标题：通过对生成的图像进行自动质量评估节省成本
链接：https://arxiv.org/abs/2510.16179

作者：Xavier Giro-i-Nieto, Nefeli Andreou, Anqi Liang, Manel Baradad, Francesc Moreno-Noguer, Aleix Martinez
摘要：Deep generative models have shown impressive progress in recent years, making it possible to produce high quality images with a simple text prompt or a reference image. However, state of the art technology does not yet meet the quality standards offered by traditional photographic methods. For this reason, production pipelines that use generated images often include a manual stage of image quality assessment (IQA). This process is slow and expensive, especially because of the low yield of automatically generated images that pass the quality bar. The IQA workload can be reduced by introducing an automatic pre-filtering stage, that will increase the overall quality of the images sent to review and, therefore, reduce the average cost required to obtain a high quality image. We present a formula that estimates the cost savings depending on the precision and pass yield of a generic IQA engine. This formula is applied in a use case of background inpainting, showcasing a significant cost saving of 51.61% obtained with a simple AutoML solution.

【35】Automated C-Arm Positioning via Conformal Landmark Localization
标题：通过保形地标定位自动化C形臂定位
链接：https://arxiv.org/abs/2510.16160

作者：Ahmad Arrabi, Jay Hwasung Jung, Jax Luo, Nathan Franssen, Scott Raymond, Safwan Wshah
摘要：Accurate and reliable C-arm positioning is essential for fluoroscopy-guided interventions. However, clinical workflows rely on manual alignment that increases radiation exposure and procedural delays. In this work, we present a pipeline that autonomously navigates the C-arm to predefined anatomical landmarks utilizing X-ray images. Given an input X-ray image from an arbitrary starting location on the operating table, the model predicts a 3D displacement vector toward each target landmark along the body. To ensure reliable deployment, we capture both aleatoric and epistemic uncertainties in the model's predictions and further calibrate them using conformal prediction. The derived prediction regions are interpreted as 3D confidence regions around the predicted landmark locations. The training framework combines a probabilistic loss with skeletal pose regularization to encourage anatomically plausible outputs. We validate our approach on a synthetic X-ray dataset generated from DeepDRR. Results show not only strong localization accuracy across multiple architectures but also well-calibrated prediction bounds. These findings highlight the pipeline's potential as a component in safe and reliable autonomous C-arm systems. Code is available at https://github.com/AhmadArrabi/C_arm_guidance_APAH

【36】GuideFlow3D: Optimization-Guided Rectified Flow For Appearance Transfer
标题：GuideFlow3D：优化引导的外观转移纠正流程
链接：https://arxiv.org/abs/2510.16136

作者：Sayan Deb Sarkar, Sinisa Stekovic, Vincent Lepetit, Iro Armeni
备注：NeurIPS 2025. Project Page: this https URL
摘要：Transferring appearance to 3D assets using different representations of the appearance object - such as images or text - has garnered interest due to its wide range of applications in industries like gaming, augmented reality, and digital content creation. However, state-of-the-art methods still fail when the geometry between the input and appearance objects is significantly different. A straightforward approach is to directly apply a 3D generative model, but we show that this ultimately fails to produce appealing results. Instead, we propose a principled approach inspired by universal guidance. Given a pretrained rectified flow model conditioned on image or text, our training-free method interacts with the sampling process by periodically adding guidance. This guidance can be modeled as a differentiable loss function, and we experiment with two different types of guidance including part-aware losses for appearance and self-similarity. Our experiments show that our approach successfully transfers texture and geometric details to the input 3D asset, outperforming baselines both qualitatively and quantitatively. We also show that traditional metrics are not suitable for evaluating the task due to their inability of focusing on local details and comparing dissimilar inputs, in absence of ground truth data. We thus evaluate appearance transfer quality with a GPT-based system objectively ranking outputs, ensuring robust and human-like assessment, as further confirmed by our user study. Beyond showcased scenarios, our method is general and could be extended to different types of diffusion models and guidance functions.

【37】Aria Gen 2 Pilot Dataset
标题：Aria Gen 2飞行员数据集
链接：https://arxiv.org/abs/2510.16134

作者：Chen Kong, James Fort, Aria Kang, Jonathan Wittmer, Simon Green, Tianwei Shen, Yipu Zhao, Cheng Peng, Gustavo Solaira, Andrew Berkovich, Nikhil Raina, Vijay Baiyya, Evgeniy Oleinik, Eric Huang, Fan Zhang, Julian Straub, Mark Schwesinger, Luis Pesqueira, Xiaqing Pan, Jakob Julian Engel, Carl Ren, Mingfei Yan, Richard Newcombe
摘要：The Aria Gen 2 Pilot Dataset (A2PD) is an egocentric multimodal open dataset captured using the state-of-the-art Aria Gen 2 glasses. To facilitate timely access, A2PD is released incrementally with ongoing dataset enhancements. The initial release features Dia'ane, our primary subject, who records her daily activities alongside friends, each equipped with Aria Gen 2 glasses. It encompasses five primary scenarios: cleaning, cooking, eating, playing, and outdoor walking. In each of the scenarios, we provide comprehensive raw sensor data and output data from various machine perception algorithms. These data illustrate the device's ability to perceive the wearer, the surrounding environment, and interactions between the wearer and the environment, while maintaining robust performance across diverse users and conditions. The A2PD is publicly available at projectaria.com, with open-source tools and usage examples provided in Project Aria Tools.

机器翻译由腾讯交互翻译提供，仅供参考

点击“阅读原文”获取带摘要的学术速递

【声明】内容源于网络

Sophie外贸笔记

跨境分享角 | 长期更新优质内容

内容 0

粉丝 3

Sophie外贸笔记跨境分享角 | 长期更新优质内容

总阅读0

粉丝3

内容0