计算机视觉与模式识别学术速递[8.20]- 大数跨境

首页

计算机视觉与模式识别学术速递[8.20]

Sophie外贸笔记

2025-08-20

导读：cs.CV 方向，今日共计118篇

点击阅读原文访问arxivdaily.com，涵盖CS|物理|数学|经济|统计|金融|生物|电气领域，更有搜索、收藏等功能！

cs.CV 方向，今日共计118篇

大模型相关(9篇)

【1】RotBench: Evaluating Multimodal Large Language Models on Identifying Image Rotation
标题：RotBench：评估多模态大型语言模型识别图像旋转
链接：https://arxiv.org/abs/2508.13968

作者：u, Jaemin Cho, Elias Stengel-Eskin, Mohit Bansal
备注：20 pages. Code and data: this https URL
摘要：We investigate to what extent Multimodal Large Language Models (MLLMs) can accurately identify the orientation of input images rotated 0{\deg}, 90{\deg}, 180{\deg}, and 270{\deg}. This task demands robust visual reasoning capabilities to detect rotational cues and contextualize spatial relationships within images, regardless of their orientation. To evaluate MLLMs on these abilities, we introduce RotBench -- a 350-image manually-filtered benchmark comprising lifestyle, portrait, and landscape images. Despite the relatively simple nature of this task, we show that several state-of-the-art open and proprietary MLLMs, including GPT-5, o3, and Gemini-2.5-Pro, do not reliably identify rotation in input images. Providing models with auxiliary information -- including captions, depth maps, and more -- or using chain-of-thought prompting offers only small and inconsistent improvements. Our results indicate that most models are able to reliably identify right-side-up (0{\deg}) images, while certain models are able to identify upside-down (180{\deg}) images. None can reliably distinguish between 90{\deg} and 270{\deg}. Simultaneously showing the image rotated in different orientations leads to moderate performance gains for reasoning models, while a modified setup using voting improves the performance of weaker models. We further show that fine-tuning does not improve models' ability to distinguish 90{\deg} and 270{\deg} rotations, despite substantially improving the identification of 180{\deg} images. Together, these results reveal a significant gap between MLLMs' spatial reasoning capabilities and human perception in identifying rotation.
摘要：We investigate to what extent Multimodal Large Language Models (MLLMs) can accurately identify the orientation of input images rotated 0{\deg}, 90{\deg}, 180{\deg}, and 270{\deg}. This task demands robust visual reasoning capabilities to detect rotational cues and contextualize spatial relationships within images, regardless of their orientation. To evaluate MLLMs on these abilities, we introduce RotBench -- a 350-image manually-filtered benchmark comprising lifestyle, portrait, and landscape images. Despite the relatively simple nature of this task, we show that several state-of-the-art open and proprietary MLLMs, including GPT-5, o3, and Gemini-2.5-Pro, do not reliably identify rotation in input images. Providing models with auxiliary information -- including captions, depth maps, and more -- or using chain-of-thought prompting offers only small and inconsistent improvements. Our results indicate that most models are able to reliably identify right-side-up (0{\deg}) images, while certain models are able to identify upside-down (180{\deg}) images. None can reliably distinguish between 90{\deg} and 270{\deg}. Simultaneously showing the image rotated in different orientations leads to moderate performance gains for reasoning models, while a modified setup using voting improves the performance of weaker models. We further show that fine-tuning does not improve models' ability to distinguish 90{\deg} and 270{\deg} rotations, despite substantially improving the identification of 180{\deg} images. Together, these results reveal a significant gap between MLLMs' spatial reasoning capabilities and human perception in identifying rotation.

【2】MME-SCI: A Comprehensive and Challenging Science Benchmark for Multimodal Large Language Models
标题：MME-SCI：多模式大型语言模型的全面而令人惊叹的科学基准
链接：https://arxiv.org/abs/2508.13938

作者：Ruan, Dan Jiang, Xian Gao, Ting Liu, Yuzhuo Fu, Yangyang Kang
备注：9 pages, 6 figures, work in progress
摘要：近年来，多模态大型语言模型（MLLM）在各个领域取得了重大进展，相应的评估基准也不断得到完善和改进。在这一过程中，科学领域的基准在评估MLLM的推理能力方面发挥了重要作用。然而，现有基准仍然面临三个关键挑战：1）对多语言场景下模型推理能力的评估不足; 2）对MLLM的全面模态覆盖评估不足; 3）缺乏对科学知识点的细粒度注释。为了解决这些差距，我们提出了MME-SCI，一个全面而具有挑战性的基准。我们精心收集了1，019个高质量问答对，涉及3种不同的评价模式。这些对涵盖数学，物理，化学和生物四个科目，并支持五种语言：中文，英文，法文，西班牙文和日文。我们在16个开源模型和4个闭源模型上进行了广泛的实验，结果表明MME-SCI对现有的MLLM具有广泛的挑战性。例如，在仅图像评估模式下，o 4-mini在数学、物理、化学和生物方面的准确率分别仅为52.11%、24.73%、36.57%和29.80%，与现有基准相比，难度水平明显更高。更重要的是，利用MME-SCI的多语言和细粒度知识属性，我们深入分析了现有模型的性能，并确定了它们在特定领域的弱点。数据和评估代码可在https://github.com/JCruan519/MME-SCI上获得。
摘要：Recently, multimodal large language models (MLLMs) have achieved significant advancements across various domains, and corresponding evaluation benchmarks have been continuously refined and improved. In this process, benchmarks in the scientific domain have played an important role in assessing the reasoning capabilities of MLLMs. However, existing benchmarks still face three key challenges: 1) Insufficient evaluation of models' reasoning abilities in multilingual scenarios; 2) Inadequate assessment of MLLMs' comprehensive modality coverage; 3) Lack of fine-grained annotation of scientific knowledge points. To address these gaps, we propose MME-SCI, a comprehensive and challenging benchmark. We carefully collected 1,019 high-quality question-answer pairs, which involve 3 distinct evaluation modes. These pairs cover four subjects, namely mathematics, physics, chemistry, and biology, and support five languages: Chinese, English, French, Spanish, and Japanese. We conducted extensive experiments on 16 open-source models and 4 closed-source models, and the results demonstrate that MME-SCI is widely challenging for existing MLLMs. For instance, under the Image-only evaluation mode, o4-mini achieved accuracy of only 52.11%, 24.73%, 36.57%, and 29.80% in mathematics, physics, chemistry, and biology, respectively, indicating a significantly higher difficulty level compared to existing benchmarks. More importantly, using MME-SCI's multilingual and fine-grained knowledge attributes, we analyzed existing models' performance in depth and identified their weaknesses in specific domains. The Data and Evaluation Code are available at https://github.com/JCruan519/MME-SCI.

【3】SAGA: Learning Signal-Aligned Distributions for Improved Text-to-Image Generation
标题：SAGA：学习信号对齐分布以改进文本到图像的生成
链接：https://arxiv.org/abs/2508.13866

作者：al, Michaël Soumm, Hervé Le Borgne, Olivier Ferret, Akihiro Sugimoto
摘要：最先进的文本到图像模型产生了视觉上令人印象深刻的结果，但通常难以与文本提示精确对齐，导致关键元素丢失或不同概念的意外混合。我们提出了一种新的方法，学习一个高成功率的分布条件下的目标提示，确保生成的图像忠实地反映了相应的提示。我们的方法在去噪过程中显式地对信号分量进行建模，提供细粒度的控制，以减轻过度优化和分布外的伪影。此外，我们的框架是培训免费和无缝集成与现有的扩散和流匹配架构。它还支持额外的条件模式-例如边界框-以增强空间对齐。大量的实验表明，我们的方法优于当前国家的最先进的方法。该代码可在https://github.com/grimalPaul/gsn-factory上获得。
摘要：State-of-the-art text-to-image models produce visually impressive results but often struggle with precise alignment to text prompts, leading to missing critical elements or unintended blending of distinct concepts. We propose a novel approach that learns a high-success-rate distribution conditioned on a target prompt, ensuring that generated images faithfully reflect the corresponding prompts. Our method explicitly models the signal component during the denoising process, offering fine-grained control that mitigates over-optimization and out-of-distribution artifacts. Moreover, our framework is training-free and seamlessly integrates with both existing diffusion and flow matching architectures. It also supports additional conditioning modalities -- such as bounding boxes -- for enhanced spatial alignment. Extensive experiments demonstrate that our approach outperforms current state-of-the-art methods. The code is available at https://github.com/grimalPaul/gsn-factory.

【4】Enhancing Targeted Adversarial Attacks on Large Vision-Language Models through Intermediate Projector Guidance
标题：通过中间投影仪指导增强对大型视觉语言模型的有针对性的对抗攻击
链接：https://arxiv.org/abs/2508.13739

作者：o, Yanjie Li, Kaisheng Liang, Yuni Lai, Bin Xiao
摘要：有针对性的对抗性攻击对于在实际部署之前主动识别视觉语言模型中的安全缺陷至关重要。然而，目前的方法扰动图像，以最大限度地提高与目标文本或参考图像在编码器级别的全局相似性，折叠成一个单一的全局向量丰富的视觉语义。这限制了攻击的粒度，阻碍了细粒度的操作，例如修改汽车，同时保留其背景。此外，这些方法在很大程度上忽略了投影仪模块，这是VLM中视觉编码器和语言模型之间的关键语义桥梁，因此无法破坏VLM内的完整视觉语言对齐管道并限制攻击有效性。为了解决这些问题，我们提出了中间投影仪引导攻击（IPGA），这是使用投影仪模块的中间阶段进行攻击的第一种方法，特别是广泛采用的Q-Former，它将全局图像嵌入转换为细粒度的视觉特征。这使得能够通过对语义上有意义的视觉标记而不是单个全局表示进行操作来更精确地控制对抗性扰动。具体来说，IPGA利用仅在第一个视觉语言对齐阶段预训练的Q-Former，而没有LLM微调，这提高了攻击效率和跨不同VLM的可移植性。此外，我们提出了残差查询对齐（RQA）来保留不相关的视觉内容，从而产生更可控和精确的对抗操作。大量的实验表明，我们的攻击方法始终优于现有的方法在标准的全球图像字幕任务和细粒度的视觉问答任务在黑盒环境中。此外，IPGA成功地转移到多个商业VLM，包括Google Gemini和OpenAI GPT。
摘要：Targeted adversarial attacks are essential for proactively identifying security flaws in Vision-Language Models before real-world deployment. However, current methods perturb images to maximize global similarity with the target text or reference image at the encoder level, collapsing rich visual semantics into a single global vector. This limits attack granularity, hindering fine-grained manipulations such as modifying a car while preserving its background. Furthermore, these methods largely overlook the projector module, a critical semantic bridge between the visual encoder and the language model in VLMs, thereby failing to disrupt the full vision-language alignment pipeline within VLMs and limiting attack effectiveness. To address these issues, we propose the Intermediate Projector Guided Attack (IPGA), the first method to attack using the intermediate stage of the projector module, specifically the widely adopted Q-Former, which transforms global image embeddings into fine-grained visual features. This enables more precise control over adversarial perturbations by operating on semantically meaningful visual tokens rather than a single global representation. Specifically, IPGA leverages the Q-Former pretrained solely on the first vision-language alignment stage, without LLM fine-tuning, which improves both attack effectiveness and transferability across diverse VLMs. Furthermore, we propose Residual Query Alignment (RQA) to preserve unrelated visual content, thereby yielding more controlled and precise adversarial manipulations. Extensive experiments show that our attack method consistently outperforms existing methods in both standard global image captioning tasks and fine-grained visual question-answering tasks in black-box environment. Additionally, IPGA successfully transfers to multiple commercial VLMs, including Google Gemini and OpenAI GPT.

【5】Evaluating Open-Source Vision Language Models for Facial Emotion Recognition against Traditional Deep Learning Models
标题：针对传统深度学习模型评估用于面部情感识别的开源视觉语言模型
链接：https://arxiv.org/abs/2508.13524

作者：shna Mulukutla, Sai Supriya Pavarala, Srinivasa Raju Rudraraju, Sridevi Bonthu
备注：None
摘要：面部情绪识别（FER）对于人机交互和心理健康诊断等应用至关重要。这项研究首次将开源视觉语言模型（VLM）（包括Phi-3.5 Vision和CLIP）与传统深度学习模型VGG 19、ResNet-50和EfficientNet-B 0在具有挑战性的FER-2013数据集上进行了实证比较，该数据集包含7个情感类别的35，887张低分辨率灰度图像。为了解决VLM训练假设和FER数据的噪声性质之间的不匹配，我们引入了一种新的管道，将基于GFPGAN的图像恢复与FER评估相结合。结果显示，传统模型，特别是EfficientNet-B 0（86.44%）和ResNet-50（85.72%），显著优于CLIP（64.07%）和Phi-3.5 Vision（51.66%）等VLM，突出了VLM在低质量视觉任务中的局限性。除了使用查准率、查全率、F1分数和准确率进行性能评估外，我们还提供了详细的计算成本分析，涵盖预处理、训练、推理和评估阶段，为部署提供了实用的见解。这项工作强调了需要适应VLM嘈杂的环境，并提供了一个可重复的基准，为未来的情感识别研究。
摘要：Facial Emotion Recognition (FER) is crucial for applications such as human-computer interaction and mental health diagnostics. This study presents the first empirical comparison of open-source Vision-Language Models (VLMs), including Phi-3.5 Vision and CLIP, against traditional deep learning models VGG19, ResNet-50, and EfficientNet-B0 on the challenging FER-2013 dataset, which contains 35,887 low-resolution grayscale images across seven emotion classes. To address the mismatch between VLM training assumptions and the noisy nature of FER data, we introduce a novel pipeline that integrates GFPGAN-based image restoration with FER evaluation. Results show that traditional models, particularly EfficientNet-B0 (86.44%) and ResNet-50 (85.72%), significantly outperform VLMs like CLIP (64.07%) and Phi-3.5 Vision (51.66%), highlighting the limitations of VLMs in low-quality visual tasks. In addition to performance evaluation using precision, recall, F1-score, and accuracy, we provide a detailed computational cost analysis covering preprocessing, training, inference, and evaluation phases, offering practical insights for deployment. This work underscores the need for adapting VLMs to noisy environments and provides a reproducible benchmark for future research in emotion recognition.

【6】STER-VLM: Spatio-Temporal With Enhanced Reference Vision-Language Models
标题：STER-VLM：具有增强参考视觉语言模型的时空空间
链接：https://arxiv.org/abs/2508.13470

作者：Nguyen-Nhu, Triet Dao Hoang Minh, Dat To-Thanh, Phuc Le-Gia, Tuan Vo-Lan, Tien-Huy Nguyen
备注：ICCV Workshop 2025
摘要：视觉语言模型（VLM）已经成为实现自动流量分析的强大工具;然而，目前的方法通常需要大量的计算资源，并且难以实现细粒度的时空理解。本文介绍了一个计算效率高的框架，提高VLM的性能，通过（1）字幕分解，分别处理空间和时间信息，（2）时间帧选择与最佳视图过滤足够的时间信息，和（3）参考驱动的理解捕捉细粒度的运动和动态上下文和（4）策划视觉/文本提示技术。在WTS \cite{kong 2024 wts}和BDD \cite{BDD}数据集上的实验结果表明，语义丰富性和交通场景解释方面有了很大的提高。我们的框架在2025年AI城市挑战赛第二轨道中获得了55.655的测试分数，证明了它在推进现实世界应用的资源效率和准确交通分析方面的有效性。
摘要：Vision-language models (VLMs) have emerged as powerful tools for enabling automated traffic analysis; however, current approaches often demand substantial computational resources and struggle with fine-grained spatio-temporal understanding. This paper introduces STER-VLM, a computationally efficient framework that enhances VLM performance through (1) caption decomposition to tackle spatial and temporal information separately, (2) temporal frame selection with best-view filtering for sufficient temporal information, and (3) reference-driven understanding for capturing fine-grained motion and dynamic context and (4) curated visual/textual prompt techniques. Experimental results on the WTS \cite{kong2024wts} and BDD \cite{BDD} datasets demonstrate substantial gains in semantic richness and traffic scene interpretation. Our framework is validated through a decent test score of 55.655 in the AI City Challenge 2025 Track 2, showing its effectiveness in advancing resource-efficient and accurate traffic analysis for real-world applications.

【7】Applications of Small Language Models in Medical Imaging Classification with a Focus on Prompt Strategies
标题：小语言模型在医学影像分类中的应用，重点关注提示策略
链接：https://arxiv.org/abs/2508.13378

作者：ng, Ziwei Wang, Jiachen Zhong, Di Zhu, Weiyi Li
备注：Under Review
摘要：大型语言模型（LLM）在自然语言处理和多模态理解方面表现出了卓越的能力。然而，它们的高计算成本、有限的可访问性和数据隐私问题阻碍了它们在资源受限的医疗环境中的采用。本研究调查了小语言模型（SLM）在医学成像分类任务中的性能，比较了不同的模型和提示设计，以确定准确性和可用性的最佳组合。使用NIH胸部X射线数据集，我们在三种提示策略下评估了多个SLM对胸部X射线位置（前后位[AP]与后前位[PA]）进行分类的任务：基线指令，增量汇总提示和基于纠正的反射提示。我们的研究结果表明，某些SLM通过精心制作的提示实现了具有竞争力的准确性，这表明快速工程可以大幅提高SLM在医疗保健应用中的性能，而无需最终用户的深厚AI专业知识。
摘要：Large language models (LLMs) have shown remarkable capabilities in natural language processing and multi-modal understanding. However, their high computational cost, limited accessibility, and data privacy concerns hinder their adoption in resource-constrained healthcare environments. This study investigates the performance of small language models (SLMs) in a medical imaging classification task, comparing different models and prompt designs to identify the optimal combination for accuracy and usability. Using the NIH Chest X-ray dataset, we evaluate multiple SLMs on the task of classifying chest X-ray positions (anteroposterior [AP] vs. posteroanterior [PA]) under three prompt strategies: baseline instruction, incremental summary prompts, and correction-based reflective prompts. Our results show that certain SLMs achieve competitive accuracy with well-crafted prompts, suggesting that prompt engineering can substantially enhance SLM performance in healthcare applications without requiring deep AI expertise from end users.

【8】Prune2Drive: A Plug-and-Play Framework for Accelerating Vision-Language Models in Autonomous Driving
标题：Prune 2 Drive：一个即插即用框架，用于加速自动驾驶中的视觉语言模型
链接：https://arxiv.org/abs/2508.13305

作者：ong, Zichen Wen, Zhuangcheng Gu, Xuyang Liu, Rui Zhang, Hengrui Kang, Jiabing Yang, Junyuan Zhang, Weijia Li, Conghui He, Yafei Wang, Linfeng Zhang
备注：13 pages, 5 figures
摘要：视觉语言模型（VLM）已经成为自动驾驶（AD）中一个很有前途的范例，通过联合建模视觉输入和自然语言指令，为感知、推理和决策提供了一个统一的框架。然而，它们的部署受到处理高分辨率多视图图像时产生的显著计算开销的阻碍，这是具有六个或更多同步相机的AD系统中的标准设置。这种开销源于编码过程中生成的大量视觉标记，由于自我注意的二次复杂性而增加了推理延迟和内存消耗。为了解决这些挑战，我们提出了Prune 2Drive，一个即插即用的视觉标记修剪框架，用于自动驾驶中的多视图VLM。Prune 2Drive引入了两个核心创新：（i）受最远点采样启发的多样性感知令牌选择机制，其优先考虑视图之间的语义和空间覆盖，而不是仅仅依赖于注意力分数，以及（ii）视图自适应修剪控制器，其基于其对下游驾驶任务的重要性来学习每个相机视图的最佳修剪比率。与以前的方法不同，Prune 2Drive不需要模型重新训练或访问注意力地图，使其与现代高效的注意力实现兼容。在两个大规模多视图驾驶基准测试DriveLM和DriveLMM-o 1上进行的大量实验表明，Prune 2Drive在保持或提高任务性能的同时实现了显着的加速和内存节省。当仅保留10%的视觉令牌时，我们的方法在预填充阶段实现了6.40$\times$加速，消耗了原始FLOP的13.4%，在DriveLM基准测试中仅下降了3%。
摘要：Vision-Language Models (VLMs) have emerged as a promising paradigm in autonomous driving (AD), offering a unified framework for perception, reasoning, and decision-making by jointly modeling visual inputs and natural language instructions. However, their deployment is hindered by the significant computational overhead incurred when processing high-resolution, multi-view images, a standard setup in AD systems with six or more synchronized cameras. This overhead stems from the large number of visual tokens generated during encoding, increasing inference latency and memory consumption due to the quadratic complexity of self-attention. To address these challenges, we propose Prune2Drive, a plug-and-play visual token pruning framework for multi-view VLMs in autonomous driving. Prune2Drive introduces two core innovations: (i) a diversity-aware token selection mechanism inspired by farthest point sampling, which prioritizes semantic and spatial coverage across views rather than relying solely on attention scores, and (ii) a view-adaptive pruning controller that learns optimal pruning ratios for each camera view based on their importance to downstream driving tasks. Unlike prior methods, Prune2Drive does not require model retraining or access to attention maps, making it compatible with modern efficient attention implementations. Extensive experiments on two large-scale multi-view driving benchmarks, DriveLM and DriveLMM-o1, show that Prune2Drive achieves significant speedups and memory savings while maintaining or improving task performance. When retaining only 10% of the visual tokens, our method achieves a 6.40$\times$ speedup in the prefilling phase and consumes 13.4% of the original FLOPs, with only a 3% performance drop on the DriveLM benchmark.

【9】DianJin-OCR-R1: Enhancing OCR Capabilities via a Reasoning-and-Tool Interleaved Vision-Language Model
标题：点金-OCR-R1：通过推理和工具交织的视觉语言模型增强OCR能力
链接：https://arxiv.org/abs/2508.13238

作者：, Xianyin Zhang, Lifan Guo, Feng Chen, Chi Zhang
摘要：大型视觉语言模型（LVLM）的最新进展实现了端到端文档图像解析的新范式，在文本、表格和公式识别等光学字符识别（OCR）任务中表现出色。然而，与大型语言模型（LLM）类似，生成LVLM容易产生幻觉-生成输入图像中不存在的单词。此外，LVLM是为通用目的而设计的，与在特定领域数据集上训练的专家模型相比，它在OCR任务上的效率往往较低。在本文中，我们提出了DianJin-OCR-R1，这是一个推理增强框架，旨在通过训练推理和工具交错的VLM来解决这些限制。给定识别指令，我们的DianJin-OCR-R1模型首先通过自身的OCR能力识别输入图像中的内容，然后调用其他工具（即，其他专家模型）以获得其结果作为参考，最后再次查看图像并重新思考推理过程以提供最终识别的内容。由于专家模型的架构是为特定的OCR任务量身定制的，这使得它们不太容易产生幻觉，因此它们的结果可以帮助VLM减轻幻觉。此外，专家模型通常规模较小，易于扩展，从而能够以较低的成本提高VLM的性能。在ReST和OmniDocBench上对模型进行了测试，实验结果表明，我们的DianJin-OCR-R1模型的性能始终优于非推理模型和专家OCR模型，证明了该方法的有效性。
摘要：Recent advances in large vision-language models (LVLMs) have enabled a new paradigm of end-to-end document image parsing, excelling in Optical Character Recognition (OCR) tasks such as text, table, and formula recognition. However, generative LVLMs, similarly to large language models (LLMs), are prone to hallucinations--generating words that do not exist in input images. Furthermore, LVLMs are designed for general purposes and tend to be less effective on OCR tasks compared to expert models that are trained on domain-specific datasets. In this paper, we propose DianJin-OCR-R1, a reasoning-enhanced framework designed to address these limitations through training reasoning-and-tool interleaved VLMs. Given a recognition instruction, our DianJin-OCR-R1 model first recognizes the content in the input image by its own OCR capabilities, and then calls other tools (i.e., other expert models) to obtain their results as references, finally looks again the image and rethinks about the reasoning process to provide the final recognized content. Since architectures of expert models are tailored for specific OCR tasks, which makes them less prone to hallucinations, their results can help VLMs mitigate hallucinations. Additionally, expert models are typically smaller in scale and easy to iterate, enabling performance improvements for VLMs at a lower cost. We evaluate our model on ReST and OmniDocBench, and experimental results show that our DianJin-OCR-R1 models consistently outperform their non-reasoning counterparts and expert OCR models, which proves the effectiveness of our method.

Transformer(3篇)

【1】ViT-FIQA: Assessing Face Image Quality using Vision Transformers
标题：ViT-FIQA：使用Vision Transformers评估面部图像质量
链接：https://arxiv.org/abs/2508.13957

作者：zori, Fadi Boutros, Naser Damer
备注：Accepted at the IEEE/CVF International Conference on Computer Vision Workshops 2025 (ICCVW 2025)
摘要：人脸图像质量评估（FIQA）旨在预测人脸图像在人脸识别（FR）系统中的效用。最先进的FIQA方法主要依赖于卷积神经网络（CNN），这使得Vision Transformer（ViT）架构的潜力未得到充分挖掘。这项工作提出了ViT-FIQA，一种新的方法，扩展标准的ViT骨干，最初为FR优化，通过一个可学习的质量令牌，旨在预测任何给定的人脸图像的标量效用得分。可学习的质量令牌与标准图像块令牌连接，并且整个序列由ViT编码器通过全局自注意处理，以聚合所有块的上下文信息。在主干的输出端，ViT-FIQA分支为两个头：（1）补丁令牌通过完全连接的层，通过边缘惩罚softmax损失来学习有区别的面部表示，以及（2）质量令牌被馈送到回归头，以学习预测面部样本的效用。在具有挑战性的基准测试和几个FR模型上进行的广泛实验，包括基于CNN和ViT的架构，表明ViT-FIQA始终达到顶级性能。这些结果强调了基于transformer的架构在建模人脸图像效用方面的有效性，并突出了ViTs作为未来FIQA研究可扩展基础的潜力。https://cutt.ly/irHlzXUC
摘要：Face Image Quality Assessment (FIQA) aims to predict the utility of a face image for face recognition (FR) systems. State-of-the-art FIQA methods mainly rely on convolutional neural networks (CNNs), leaving the potential of Vision Transformer (ViT) architectures underexplored. This work proposes ViT-FIQA, a novel approach that extends standard ViT backbones, originally optimized for FR, through a learnable quality token designed to predict a scalar utility score for any given face image. The learnable quality token is concatenated with the standard image patch tokens, and the whole sequence is processed via global self-attention by the ViT encoders to aggregate contextual information across all patches. At the output of the backbone, ViT-FIQA branches into two heads: (1) the patch tokens are passed through a fully connected layer to learn discriminative face representations via a margin-penalty softmax loss, and (2) the quality token is fed into a regression head to learn to predict the face sample's utility. Extensive experiments on challenging benchmarks and several FR models, including both CNN- and ViT-based architectures, demonstrate that ViT-FIQA consistently achieves top-tier performance. These results underscore the effectiveness of transformer-based architectures in modeling face image utility and highlight the potential of ViTs as a scalable foundation for future FIQA research https://cutt.ly/irHlzXUC.

【2】A Fully Transformer Based Multimodal Framework for Explainable Cancer Image Segmentation Using Radiology Reports
标题：使用放射学报告进行可解释癌症图像分割的完全基于Transformer的多模式框架
链接：https://arxiv.org/abs/2508.13796

作者：dahada, Isabel Sassoon, Kate Hone, Yongmin Li
摘要：我们介绍Med-CTX，一个完全基于Transformer的多模态框架，用于可解释的乳腺癌超声分割。我们集成临床放射学报告以提高性能和可解释性。Med-CTX通过使用结合ViT和Swin Transformers的双分支视觉编码器以及不确定性感知融合来实现精确的病变描绘。使用BI-RADS语义结构化的临床语言由BioClinicalBERT编码，并与利用跨模态注意力的视觉特征相结合，使模型能够提供基于临床的模型生成的解释。我们的方法同时生成分割模板、不确定性图和诊断原理，提高了计算机辅助诊断的信心和透明度。在BUS-BRA数据集上，Med-CTX实现了99%的Dice得分和95%的IoU，击败了现有的基准U-Net，ViT和Swin。临床文本在分割准确性和解释质量方面起着关键作用，消融研究显示Dice评分下降了-5.4%，CIDEr下降了-31%。Med-CTX实现了良好的多模式对齐（CLIP评分：85%）和增加的置信度校准（ECE：3.2%），为值得信赖的多模式医疗架构设置了新的标准。
摘要：We introduce Med-CTX, a fully transformer based multimodal framework for explainable breast cancer ultrasound segmentation. We integrate clinical radiology reports to boost both performance and interpretability. Med-CTX achieves exact lesion delineation by using a dual-branch visual encoder that combines ViT and Swin transformers, as well as uncertainty aware fusion. Clinical language structured with BI-RADS semantics is encoded by BioClinicalBERT and combined with visual features utilising cross-modal attention, allowing the model to provide clinically grounded, model generated explanations. Our methodology generates segmentation masks, uncertainty maps, and diagnostic rationales all at once, increasing confidence and transparency in computer assisted diagnosis. On the BUS-BRA dataset, Med-CTX achieves a Dice score of 99% and an IoU of 95%, beating existing baselines U-Net, ViT, and Swin. Clinical text plays a key role in segmentation accuracy and explanation quality, as evidenced by ablation studies that show a -5.4% decline in Dice score and -31% in CIDEr. Med-CTX achieves good multimodal alignment (CLIP score: 85%) and increased confi dence calibration (ECE: 3.2%), setting a new bar for trustworthy, multimodal medical architecture.

【3】Vision Transformers for Kidney Stone Image Classification: A Comparative Study with CNNs
标题：用于肾结石图像分类的视觉变形器：与CNN的比较研究
链接：https://arxiv.org/abs/2508.13461

作者：s-Amezcua, Francisco Lopez-Tiro, Clement Larose, Andres Mendez-Vazquez, Gilberto Ochoa-Ruiz, Christian Daul
摘要：根据内窥镜图像对肾结石进行分类对于个性化治疗和预防复发至关重要。虽然卷积神经网络（CNN）在这项任务中表现出了希望，但它们捕获远程依赖关系的能力有限，可能会阻碍可变成像条件下的性能。本研究对Vision Transformers（ViTs）和基于CNN的模型进行了比较分析，评估了它们在两个离体数据集（包括CCD相机和柔性输尿管镜图像）上的性能。在ImageNet-21 k上预训练的ViT-base模型在多种成像条件下的表现始终优于ResNet 50基线。例如，在视觉上最复杂的子集（来自内窥镜图像的切片）中，ViT模型实现了95.2%的准确率和95.1%的F1评分，而ResNet 50分别为64.5%和59.3%。在来自CCD相机图像的混合视图子集中，ViT达到87.1%的准确率，而CNN为78.4%。这些改进也扩展到了精确度和召回率。结果表明，基于ViT的架构提供了卓越的分类性能，并为肾结石图像分析提供了传统CNN的可扩展替代方案。
摘要：Kidney stone classification from endoscopic images is critical for personalized treatment and recurrence prevention. While convolutional neural networks (CNNs) have shown promise in this task, their limited ability to capture long-range dependencies can hinder performance under variable imaging conditions. This study presents a comparative analysis between Vision Transformers (ViTs) and CNN-based models, evaluating their performance on two ex vivo datasets comprising CCD camera and flexible ureteroscope images. The ViT-base model pretrained on ImageNet-21k consistently outperformed a ResNet50 baseline across multiple imaging conditions. For instance, in the most visually complex subset (Section patches from endoscopic images), the ViT model achieved 95.2% accuracy and 95.1% F1-score, compared to 64.5% and 59.3% with ResNet50. In the mixed-view subset from CCD-camera images, ViT reached 87.1% accuracy versus 78.4% with CNN. These improvements extend across precision and recall as well. The results demonstrate that ViT-based architectures provide superior classification performance and offer a scalable alternative to conventional CNNs for kidney stone image analysis.

生成|GAN相关(10篇)

【1】InfiniteTalk: Audio-driven Video Generation for Sparse-Frame Video Dubbing
标题：InfiniteTalk：用于稀疏帧视频配音的音频驱动视频生成
链接：https://arxiv.org/abs/2508.14033

作者：ang, Zhe Kong, Feng Gao, Meng Cheng, Xiangyu Liu, Yong Zhang, Zhuoliang Kang, Wenhan Luo, Xunliang Cai, Ran He, Xiaoming Wei
备注：11 pages, 7 figures
摘要：最近在视频AIGC方面的突破为音频驱动的人类动画带来了一个变革性的时代。然而，传统的视频配音技术仍然局限于嘴部区域编辑，导致不和谐的面部表情和身体姿势，损害了观众的沉浸感。为了克服这一限制，我们引入了稀疏帧视频配音，这是一种新的范例，它在战略上保留了参考关键帧，以保持身份，标志性手势和相机轨迹，同时实现整体的，音频同步的全身运动编辑。通过批判性分析，我们确定了为什么天真的图像到视频模型在这项任务中失败，特别是他们无法实现自适应条件反射。为了解决这个问题，我们提出了InfiniteTalk，一个流音频驱动的生成器，专为无限长的长序列配音。该架构利用时间上下文帧进行无缝块间转换，并采用了一种简单而有效的采样策略，通过细粒度的参考帧定位来优化控制强度。对HDTF、CelebV-HQ和EMTD数据集的综合评估证明了最先进的性能。定量指标证实了卓越的视觉现实主义，情感连贯性和全身运动同步。
摘要：Recent breakthroughs in video AIGC have ushered in a transformative era for audio-driven human animation. However, conventional video dubbing techniques remain constrained to mouth region editing, resulting in discordant facial expressions and body gestures that compromise viewer immersion. To overcome this limitation, we introduce sparse-frame video dubbing, a novel paradigm that strategically preserves reference keyframes to maintain identity, iconic gestures, and camera trajectories while enabling holistic, audio-synchronized full-body motion editing. Through critical analysis, we identify why naive image-to-video models fail in this task, particularly their inability to achieve adaptive conditioning. Addressing this, we propose InfiniteTalk, a streaming audio-driven generator designed for infinite-length long sequence dubbing. This architecture leverages temporal context frames for seamless inter-chunk transitions and incorporates a simple yet effective sampling strategy that optimizes control strength via fine-grained reference frame positioning. Comprehensive evaluations on HDTF, CelebV-HQ, and EMTD datasets demonstrate state-of-the-art performance. Quantitative metrics confirm superior visual realism, emotional coherence, and full-body motion synchronization.

【2】Physics-Based 3D Simulation for Synthetic Data Generation and Failure Analysis in Packaging Stability Assessment
标题：基于物理的3D模拟用于包装稳定性评估中的合成数据生成和故障分析
链接：https://arxiv.org/abs/2508.13989

作者：ligardi, Pietro Musoni, Eleonora Iotti, Gianluca Contesso, Alessandro Dal Palù
摘要：托盘结构的设计和分析是保证包装运输安全的关键。随着物流行业需求的不断增长，利用先进技术开发自动化系统变得越来越重要。此外，塑料包装的广泛使用促使研究人员研究仍然符合安全标准的环保替代品。我们提出了一个完全可控和准确的物理仿真系统能够复制移动托盘的行为。它具有基于3D图形的虚拟环境，支持各种配置，包括可变包装布局、不同包装材料和不同的动态条件。这种创新的方法减少了物理测试的需求，降低了成本和环境影响，同时提高了托盘动态分析的测量精度。此外，我们还训练了一个深度神经网络来评估模拟器生成的渲染视频，作为托盘配置的碰撞测试预测器，进一步增强了系统在安全分析中的实用性。
摘要：The design and analysis of pallet setups are essential for ensuring safety of packages transportation. With rising demands in the logistics sector, the development of automated systems utilizing advanced technologies has become increasingly crucial. Moreover, the widespread use of plastic wrapping has motivated researchers to investigate eco-friendly alternatives that still adhere to safety standards. We present a fully controllable and accurate physical simulation system capable of replicating the behavior of moving pallets. It features a 3D graphics-based virtual environment that supports a wide range of configurations, including variable package layouts, different wrapping materials, and diverse dynamic conditions. This innovative approach reduces the need for physical testing, cutting costs and environmental impact while improving measurement accuracy for analyzing pallet dynamics. Additionally, we train a deep neural network to evaluate the rendered videos generated by our simulator, as a crash-test predictor for pallet configurations, further enhancing the system's utility in safety analysis.

【3】Timestep-Compressed Attack on Spiking Neural Networks through Timestep-Level Backpropagation
标题：通过时步级反向传播对尖峰神经网络的时步压缩攻击
链接：https://arxiv.org/abs/2508.13812

作者：ang, Doohyun Kim, Sang-Ki Ko, Jinkyu Lee, Hyeongboo Baek, Brent ByungHoon Kang
备注：8 pages
摘要：对尖峰神经网络（SNN）的最新（SOTA）基于梯度的对抗性攻击在很大程度上依赖于扩展FGSM和PGD框架，面临着一个关键的限制：来自多时间步处理的大量攻击延迟，使其不适用于实际的实时应用。这种低效率源于它们的设计作为ANN范式的直接扩展，未能利用关键的SNN属性。在本文中，我们提出了时间步压缩攻击（TCA），一种新的框架，显着降低攻击延迟。TCA引入了基于对SNN行为的关键见解的两个组件。首先，时间步级反向传播（TLBP）是基于我们的发现，即在反向传播中产生扰动的全局时间信息对于攻击的成功并不重要，从而能够对早期停止进行每时间步评估。第二，对抗性膜电位再利用（A-MPR）的动机是观察到初始时间步被低效地用于积累膜电位，这是一个可以预先计算和再利用的热身阶段。我们在VGG-11和ResNet-17上使用CIFAR-10/100和CIFAR 10-DVS数据集进行的实验表明，与SOTA方法相比，TCA在白盒和黑盒设置中分别将所需的攻击延迟降低了56.6%和57.1%，同时保持了相当的攻击成功率。
摘要：State-of-the-art (SOTA) gradient-based adversarial attacks on spiking neural networks (SNNs), which largely rely on extending FGSM and PGD frameworks, face a critical limitation: substantial attack latency from multi-timestep processing, rendering them infeasible for practical real-time applications. This inefficiency stems from their design as direct extensions of ANN paradigms, which fail to exploit key SNN properties. In this paper, we propose the timestep-compressed attack (TCA), a novel framework that significantly reduces attack latency. TCA introduces two components founded on key insights into SNN behavior. First, timestep-level backpropagation (TLBP) is based on our finding that global temporal information in backpropagation to generate perturbations is not critical for an attack's success, enabling per-timestep evaluation for early stopping. Second, adversarial membrane potential reuse (A-MPR) is motivated by the observation that initial timesteps are inefficiently spent accumulating membrane potential, a warm-up phase that can be pre-calculated and reused. Our experiments on VGG-11 and ResNet-17 with the CIFAR-10/100 and CIFAR10-DVS datasets show that TCA significantly reduces the required attack latency by up to 56.6% and 57.1% compared to SOTA methods in white-box and black-box settings, respectively, while maintaining a comparable attack success rate.

【4】PersonaVlog: Personalized Multimodal Vlog Generation with Multi-Agent Collaboration and Iterative Self-Correction
标题：PersonaVlog：具有多代理协作和迭代自我纠正的个性化多模式Vlog生成
链接：https://arxiv.org/abs/2508.13602

作者：u, Bing Ma, Jiaxiang Cheng, Xuhua Ren, Kai Yu, Wenyue Li, Tianxiang Zheng, Qinglin Lu
备注：Project Page: this https URL
摘要：随着人们对短视频和个性化内容需求的不断增长，自动生成视频日志（Vlog）已成为多模态内容创作的一个关键方向。现有的方法大多依赖于预定义的脚本，缺乏动态性和个性化表达。因此，迫切需要一种能够实现有效的多模态协作和高度个性化的自动化Vlog生成方法。为此，我们提出了PersonaVlog，一个自动化的多模态风格化Vlog生成框架，可以根据给定的主题和参考图像生成个性化的Vlog，包括视频，背景音乐和内心独白语音。具体来说，我们提出了一个多智能体协作框架的基础上多模态大语言模型（MLLM）。该框架基于用户输入有效地生成用于多模态内容创建的高质量提示，从而提高了该过程的效率和创造性。此外，我们还采用了反馈和回滚机制，利用MLLM来评估和提供对生成结果的反馈，从而实现多模态内容的迭代自校正。我们还提出了ThemeVlogEval，一个基于主题的自动基准框架，提供标准化的指标和数据集的公平评估。全面的实验证明了我们的框架在几个基线上的显着优势和潜力，突出了其有效性和生成自动化Vlog的巨大潜力。
摘要：With the growing demand for short videos and personalized content, automated Video Log (Vlog) generation has become a key direction in multimodal content creation. Existing methods mostly rely on predefined scripts, lacking dynamism and personal expression. Therefore, there is an urgent need for an automated Vlog generation approach that enables effective multimodal collaboration and high personalization. To this end, we propose PersonaVlog, an automated multimodal stylized Vlog generation framework that can produce personalized Vlogs featuring videos, background music, and inner monologue speech based on a given theme and reference image. Specifically, we propose a multi-agent collaboration framework based on Multimodal Large Language Models (MLLMs). This framework efficiently generates high-quality prompts for multimodal content creation based on user input, thereby improving the efficiency and creativity of the process. In addition, we incorporate a feedback and rollback mechanism that leverages MLLMs to evaluate and provide feedback on generated results, thereby enabling iterative self-correction of multimodal content. We also propose ThemeVlogEval, a theme-based automated benchmarking framework that provides standardized metrics and datasets for fair evaluation. Comprehensive experiments demonstrate the significant advantages and potential of our framework over several baselines, highlighting its effectiveness and great potential for generating automated Vlogs.

【5】Breaking the SFT Plateau: Multimodal Structured Reinforcement Learning for Chart-to-Code Generation
标题：打破SFT高原：用于图表到代码生成的多模态结构化强化学习
链接：https://arxiv.org/abs/2508.13587

作者： Xuanle Zhao, Zhixiong Zeng, Jing Huang, Liming Zheng, Yufeng Zhong, Lin Ma
备注：technical report
摘要：虽然强化学习（RL）已被证明对于视觉语言模型中的一般推理非常有效，但其在需要深入理解信息丰富的图像和生成结构化输出的任务中的应用仍然未得到充分探索。Chart-to-code生成解决了这一挑战，需要对可视化图表进行复杂的推理以生成结构化代码。仅仅有监督的微调（SFT）往往是不够的，突出了有效的强化学习策略的需要，适当奖励结构化的输出。我们通过大规模实验系统地研究了SFT的性能平台，并提出了用于图表到代码生成的多模态结构化强化学习（MSRL），它大大突破了这个平台。我们构建了迄今为止最大的训练语料库，包含来自真实世界的arXiv表的300万个图表代码对，以减轻先前合成数据的简单模式。尽管达到了最先进的性能，但我们的实验表明，缩放SFT数据最终会达到一个平台，进一步增加会产生微不足道的改进。我们的MSRL方法利用多粒度结构化奖励系统，使用多模态文本和视觉反馈。在文本级别，基于规则的奖励验证细粒度的代码细节。在视觉层面，基于模型的奖励通过将生成的代码渲染成图像并采用评估器模型来评估结构相似性。我们在两阶段课程中实施这一点，以确保培训的稳定性。结果表明，MSRL显著打破了SFT平台，在ChartMimic和ReachQA基准测试中分别将高级指标提高了6.2%和9.9%，通过高级闭源模型实现了具有竞争力的性能。
摘要：While reinforcement learning (RL) has proven highly effective for general reasoning in vision-language models, its application to tasks requiring in-depth understanding of information-rich images and generation of structured outputs remains underexplored. Chart-to-code generation exemplifies this challenge, demanding complex reasoning over visual charts to generate structured code. Supervised fine-tuning (SFT) alone is often insufficient, highlighting the need for effective RL strategies that appropriately reward structured outputs. We systematically investigate the performance plateau in SFT through large-scale experiments and propose Multimodal Structured Reinforcement Learning (MSRL) for chart-to-code generation, which substantially breaks through this plateau. We construct the largest training corpus to date, containing 3 million chart-code pairs from real-world arXiv tables to mitigate simplistic patterns of prior synthetic data. Despite reaching state-of-the-art performance, our experiments show that scaling SFT data eventually hits a plateau where further increases yield negligible improvements. Our MSRL method leverages a multi-granularity structured reward system using multimodal textual and visual feedback. At the textual level, rule-based rewards validate fine-grained code details. At the visual level, model-based rewards assess structural similarity by rendering generated code into images and employing an evaluator model. We implement this within a two-stage curriculum for training stability. Results demonstrate that MSRL significantly breaks the SFT plateau, improving high-level metrics by 6.2% and 9.9% on ChartMimic and ReachQA benchmarks respectively, achieving competitive performance with advanced closed-source models.

【6】Generative Model-Based Feature Attention Module for Video Action Analysis
标题：用于视频动作分析的基于生成模型的特征注意力模块
链接：https://arxiv.org/abs/2508.13565

作者：ng, Peng Zhao, Cong Zhao, Jing Huang, Siyan Guo, Shusen Yang
摘要：视频动作分析是智能视频理解领域的基础技术，特别是其在物联网（IoT）中的应用。然而，现有的方法忽略了特征提取中的特征语义，并专注于优化行动建议，因此这些解决方案不适合在高性能物联网应用中广泛采用，因为精度方面的限制，例如自动驾驶，这需要强大且可扩展的智能视频分析分析。为了解决这个问题，我们提出了一种新的基于注意力的生成模型来学习特征语义的关系。具体来说，通过利用动作前景和背景的差异，我们的模型同时学习时间动作特征语义的帧依赖性和片段依赖性，有效地利用了特征提取中的特征语义。为了评估我们的模型的有效性，我们进行了广泛的实验上的两个基准视频任务，动作识别和动作检测。在动作检测任务的背景下，我们通过对广泛认可的数据集进行全面验证来证实我们方法的优越性。此外，我们将我们所提出的方法的有效性的验证扩展到更广泛的任务，视频动作识别。我们的代码可在https://github.com/Generative-Feature-Model/GAF上获得。
摘要：Video action analysis is a foundational technology within the realm of intelligent video comprehension, particularly concerning its application in Internet of Things(IoT). However, existing methodologies overlook feature semantics in feature extraction and focus on optimizing action proposals, thus these solutions are unsuitable for widespread adoption in high-performance IoT applications due to the limitations in precision, such as autonomous driving, which necessitate robust and scalable intelligent video analytics analysis. To address this issue, we propose a novel generative attention-based model to learn the relation of feature semantics. Specifically, by leveraging the differences of actions' foreground and background, our model simultaneously learns the frame- and segment-dependencies of temporal action feature semantics, which takes advantage of feature semantics in the feature extraction effectively. To evaluate the effectiveness of our model, we conduct extensive experiments on two benchmark video task, action recognition and action detection. In the context of action detection tasks, we substantiate the superiority of our approach through comprehensive validation on widely recognized datasets. Moreover, we extend the validation of the effectiveness of our proposed method to a broader task, video action recognition. Our code is available at https://github.com/Generative-Feature-Model/GAF.

【7】Color Spike Data Generation via Bio-inspired Neuron-like Encoding with an Artificial Photoreceptor Layer
标题：通过具有人工感光层的生物启发类神经元编码生成彩色尖峰数据
链接：https://arxiv.org/abs/2508.13558

作者：ng-Teng, Wang Yuan-Kai
备注：14 pages, 11 figures
摘要：近年来，神经形态计算和尖峰神经网络（SNN）通过与深度学习的集成而迅速发展。然而，SNN的性能仍然落后于卷积神经网络（CNN），主要是由于基于尖峰的数据的信息容量有限。虽然一些研究试图通过使用静态图像等非尖峰输入来训练SNN来提高其性能，但这种方法偏离了神经形态计算的初衷，即强调基于尖峰的信息处理。为了解决这个问题，我们提出了一种神经元编码方法，该方法基于生物神经元的内在操作原理和功能生成尖峰数据。该方法通过引入人工光感受器层进一步增强，使得尖峰数据能够携带颜色和亮度信息，从而形成完整的视觉尖峰信号。使用Integrate-and-Fire神经元模型的实验结果表明，这种生物启发的方法有效地增加了尖峰信号的信息含量，提高了SNN的性能，同时坚持神经形态学原则。我们相信，这一概念具有强大的潜力，为未来的发展，并可能有助于克服目前的限制，在神经形态计算，促进更广泛的应用SNN。
摘要：In recent years, neuromorphic computing and spiking neural networks (SNNs) have ad-vanced rapidly through integration with deep learning. However, the performance of SNNs still lags behind that of convolutional neural networks (CNNs), primarily due to the limited information capacity of spike-based data. Although some studies have attempted to improve SNN performance by training them with non-spiking inputs such as static images, this approach deviates from the original intent of neuromorphic computing, which emphasizes spike-based information processing. To address this issue, we propose a Neuron-like Encoding method that generates spike data based on the intrinsic operational principles and functions of biological neurons. This method is further enhanced by the incorporation of an artificial pho-toreceptor layer, enabling spike data to carry both color and luminance information, thereby forming a complete visual spike signal. Experimental results using the Integrate-and-Fire neuron model demonstrate that this biologically inspired approach effectively increases the information content of spike signals and improves SNN performance, all while adhering to neuromorphic principles. We believe this concept holds strong potential for future development and may contribute to overcoming current limitations in neuro-morphic computing, facilitating broader applications of SNNs.

【8】A Lightweight Dual-Mode Optimization for Generative Face Video Coding
标题：生成式人脸视频编码的轻量级双模式优化
链接：https://arxiv.org/abs/2508.13547

作者：ng, Shanzhi Yin, Bolin Chen, Ru-Ling Liao, Shiqi Wang, Yan Ye
摘要：生成式人脸视频编码（GFVC）通过利用深度生成模型的强大推理能力实现了卓越的率失真性能。然而，它的实际部署受到大的模型参数和高计算成本的阻碍。为了解决这个问题，我们提出了一个轻量级的GFVC框架，引入了双模式优化-结合架构重新设计和操作细化-以降低复杂性，同时保持重建质量。在架构上，我们用更薄、更高效的层取代了传统的3 x 3卷积，在不影响功能表现力的情况下降低了复杂性。在操作上，我们开发了两阶段自适应通道修剪策略：（1）在训练期间通过可学习阈值进行软修剪以识别冗余通道，以及（2）使用导出的掩码在训练后永久消除这些通道。这种双阶段方法确保了训练稳定性和推理效率。实验结果表明，所提出的轻量级双模式优化GFVC可以实现90.4%的参数减少和88.9%的计算节省相比，基线，同时实现更优越的性能相比，最先进的视频编码标准通用视频编码（VVC）的感知级质量指标。因此，预期所提出的方法能够在诸如移动边缘设备的资源受限环境中实现高效的GFVC部署。
摘要：Generative Face Video Coding (GFVC) achieves superior rate-distortion performance by leveraging the strong inference capabilities of deep generative models. However, its practical deployment is hindered by large model parameters and high computational costs. To address this, we propose a lightweight GFVC framework that introduces dual-mode optimization - combining architectural redesign and operational refinement - to reduce complexity whilst preserving reconstruction quality. Architecturally, we replace traditional 3 x 3 convolutions with slimmer and more efficient layers, reducing complexity without compromising feature expressiveness. Operationally, we develop a two-stage adaptive channel pruning strategy: (1) soft pruning during training identifies redundant channels via learnable thresholds, and (2) hard pruning permanently eliminates these channels post-training using a derived mask. This dual-phase approach ensures both training stability and inference efficiency. Experimental results demonstrate that the proposed lightweight dual-mode optimization for GFVC can achieve 90.4% parameter reduction and 88.9% computation saving compared to the baseline, whilst achieving superior performance compared to state-of-the-art video coding standard Versatile Video Coding (VVC) in terms of perceptual-level quality metrics. As such, the proposed method is expected to enable efficient GFVC deployment in resource-constrained environments such as mobile edge devices.

【9】EAvatar: Expression-Aware Head Avatar Reconstruction with Generative Geometry Priors
标题：EAvatar：具有生成性几何先验的表达意识头部Avatar重建
链接：https://arxiv.org/abs/2508.13537

作者：ang, Cunjian Chen, Yiqun Wang, Qiuhong Ke, Yong Li
备注：20 pages, 11 figures
摘要：高保真头部化身重建在AR/VR、游戏和多媒体内容创作中起着至关重要的作用。3D高斯溅射（3DGS）的最新进展已经证明了具有实时渲染能力的复杂几何建模的有效性，并且现在广泛用于高保真头部化身重建任务。然而，现有的基于3DGS的方法仍然面临着巨大的挑战，在捕捉细粒度的面部表情和保持局部纹理连续性，特别是在高度可变形的区域。为了减轻这些限制，我们提出了一种新的基于3DGS的框架，称为EAvatar头部重建，是表达感知和变形感知。我们的方法引入了一种稀疏表达控制机制，其中使用少量关键高斯来影响其相邻高斯的变形，从而实现局部变形和精细尺度纹理过渡的精确建模。此外，我们利用来自预训练生成模型的高质量3D先验来提供更可靠的面部几何形状，提供结构指导，提高训练过程中的收敛稳定性和形状准确性。实验结果表明，我们的方法产生更准确和视觉连贯的头部重建与改进的表达可控性和细节保真度。
摘要：High-fidelity head avatar reconstruction plays a crucial role in AR/VR, gaming, and multimedia content creation. Recent advances in 3D Gaussian Splatting (3DGS) have demonstrated effectiveness in modeling complex geometry with real-time rendering capability and are now widely used in high-fidelity head avatar reconstruction tasks. However, existing 3DGS-based methods still face significant challenges in capturing fine-grained facial expressions and preserving local texture continuity, especially in highly deformable regions. To mitigate these limitations, we propose a novel 3DGS-based framework termed EAvatar for head reconstruction that is both expression-aware and deformation-aware. Our method introduces a sparse expression control mechanism, where a small number of key Gaussians are used to influence the deformation of their neighboring Gaussians, enabling accurate modeling of local deformations and fine-scale texture transitions. Furthermore, we leverage high-quality 3D priors from pretrained generative models to provide a more reliable facial geometry, offering structural guidance that improves convergence stability and shape accuracy during training. Experimental results demonstrate that our method produces more accurate and visually coherent head reconstructions with improved expression controllability and detail fidelity.

【10】DAASH: A Meta-Attack Framework for Synthesizing Effective and Stealthy Adversarial Examples
标题：DAASH：一个用于合成有效且隐蔽的对抗示例的元攻击框架
链接：https://arxiv.org/abs/2508.13309

作者：Al Nomaan Nafi, Habibur Rahaman, Zafaryab Haider, Tanzim Mahfuz, Fnu Suya, Swarup Bhunia, Prabuddha Chakraborty
摘要：已经提出了许多技术来在严格的Lp范数约束下在白盒设置中生成对抗性示例。然而，这样的范数有界的例子往往不能很好地与人类的感知保持一致，直到最近才有一些方法开始专门探索感知对齐的对抗性例子。此外，目前还不清楚是否可以有效地利用Lp约束攻击的见解来提高感知效能。在本文中，我们介绍了DAASH，这是一个完全可区分的元攻击框架，它通过战略性地组合现有的基于LP的攻击方法来生成有效且感知一致的对抗性示例。DAASH以多阶段的方式运行：在每个阶段，它使用学习的自适应权重从多个基本攻击中聚合候选对抗性示例，并将结果传播到下一阶段。一种新的元损失函数通过联合最小化误分类损失和感知失真来指导这一过程，使框架能够在整个阶段动态调节每个基础攻击的贡献。我们在CIFAR-10、CIFAR-100和ImageNet上的逆向训练模型上评估DAASH。尽管仅仅依赖于基于Lp约束的方法，DAASH的性能明显优于先进的感知攻击，如AdvAD -实现了更高的攻击成功率（例如，20.63%的改善）和卓越的视觉质量，如SSIM、LPIPS和FID所测量的（分别改善约11、0.015和5.7）。此外，DAASH可以很好地推广到看不见的防御，使其成为评估鲁棒性的实用和强大的基线，而无需为每个新防御手工制作自适应攻击。
摘要：Numerous techniques have been proposed for generating adversarial examples in white-box settings under strict Lp-norm constraints. However, such norm-bounded examples often fail to align well with human perception, and only recently have a few methods begun specifically exploring perceptually aligned adversarial examples. Moreover, it remains unclear whether insights from Lp-constrained attacks can be effectively leveraged to improve perceptual efficacy. In this paper, we introduce DAASH, a fully differentiable meta-attack framework that generates effective and perceptually aligned adversarial examples by strategically composing existing Lp-based attack methods. DAASH operates in a multi-stage fashion: at each stage, it aggregates candidate adversarial examples from multiple base attacks using learned, adaptive weights and propagates the result to the next stage. A novel meta-loss function guides this process by jointly minimizing misclassification loss and perceptual distortion, enabling the framework to dynamically modulate the contribution of each base attack throughout the stages. We evaluate DAASH on adversarially trained models across CIFAR-10, CIFAR-100, and ImageNet. Despite relying solely on Lp-constrained based methods, DAASH significantly outperforms state-of-the-art perceptual attacks such as AdvAD -- achieving higher attack success rates (e.g., 20.63\% improvement) and superior visual quality, as measured by SSIM, LPIPS, and FID (improvements $\approx$ of 11, 0.015, and 5.7, respectively). Furthermore, DAASH generalizes well to unseen defenses, making it a practical and strong baseline for evaluating robustness without requiring handcrafted adaptive attacks for each new defense.

检测相关(8篇)

【1】OmViD: Omni-supervised active learning for video action detection
标题：OmViD：用于视频动作检测的全监督主动学习
链接：https://arxiv.org/abs/2508.13983

作者：na, Akash Kumar, Vibhav Vineet, Yogesh S Rawat
备注：ICCVW'25
摘要：视频动作检测需要密集的时空注释，这既具有挑战性又昂贵。然而，真实世界的视频通常在难度上有所不同，并且可能不需要相同级别的注释。本文分析了每个样本的适当注释类型及其对时空视频动作检测的影响。它侧重于两个关键方面：1）如何为视频获得不同级别的注释，以及2）如何从不同的注释类型中学习动作检测。该研究探讨了视频级标签，点，涂鸦，边界框和像素级蒙版。首先，提出了一种简单的主动学习策略来估计每个视频所需的注释类型。然后，引入一种新的时空3D超像素方法来从这些注释中生成伪标签，从而实现有效的训练。该方法在UCF 101 -24和JHMDB-21数据集上进行了验证，以最小的性能损失显著降低了注释成本。
摘要：Video action detection requires dense spatio-temporal annotations, which are both challenging and expensive to obtain. However, real-world videos often vary in difficulty and may not require the same level of annotation. This paper analyzes the appropriate annotation types for each sample and their impact on spatio-temporal video action detection. It focuses on two key aspects: 1) how to obtain varying levels of annotation for videos, and 2) how to learn action detection from different annotation types. The study explores video-level tags, points, scribbles, bounding boxes, and pixel-level masks. First, a simple active learning strategy is proposed to estimate the necessary annotation type for each video. Then, a novel spatio-temporal 3D-superpixel approach is introduced to generate pseudo-labels from these annotations, enabling effective training. The approach is validated on UCF101-24 and JHMDB-21 datasets, significantly cutting annotation costs with minimal performance loss.

【2】RICO: Two Realistic Benchmarks and an In-Depth Analysis for Incremental Learning in Object Detection
标题：RICO：两个现实的基准和深入分析，用于对象检测中的增量学习
链接：https://arxiv.org/abs/2508.13878

作者：Neuwirth-Trapp, Maarten Bieshaar, Danda Pani Paudel, Luc Van Gool
备注：Accepted to ICCV Workshops 2025
摘要：增量学习（IL）在新数据上连续训练模型，而无需完全重新训练，提供隐私，效率和可扩展性。IL必须平衡对新数据的适应性和对旧知识的保留。然而，评估往往依赖于综合的，简化的基准，模糊了现实世界的IL性能。为了解决这个问题，我们引入了两个现实的增量对象检测基准（RICO）：域RICO（D-RICO）的功能域转移与固定的类集，和扩展类RICO（EC-RICO）集成新的域和类每IL步骤。从14个不同的数据集构建，涵盖真实和合成领域，不同的条件（例如，天气、一天中的时间）、相机传感器、视角和标签政策，这两个基准都捕捉到了现有评估中不存在的挑战。我们的实验表明，所有IL方法在适应性和保留方面表现不佳，而重播少量以前的数据已经优于所有方法。然而，在数据上的个人训练仍然是优越的。我们将这种差距归因于教师在蒸馏方面的薄弱，单一模型无法管理多样化的任务，以及可塑性不足。我们的代码将公开发布。
摘要：Incremental Learning (IL) trains models sequentially on new data without full retraining, offering privacy, efficiency, and scalability. IL must balance adaptability to new data with retention of old knowledge. However, evaluations often rely on synthetic, simplified benchmarks, obscuring real-world IL performance. To address this, we introduce two Realistic Incremental Object Detection Benchmarks (RICO): Domain RICO (D-RICO) features domain shifts with a fixed class set, and Expanding-Classes RICO (EC-RICO) integrates new domains and classes per IL step. Built from 14 diverse datasets covering real and synthetic domains, varying conditions (e.g., weather, time of day), camera sensors, perspectives, and labeling policies, both benchmarks capture challenges absent in existing evaluations. Our experiments show that all IL methods underperform in adaptability and retention, while replaying a small amount of previous data already outperforms all methods. However, individual training on the data remains superior. We heuristically attribute this gap to weak teachers in distillation, single models' inability to manage diverse tasks, and insufficient plasticity. Our code will be made publicly available.

【3】Uncertainty-Aware Learning Policy for Reliable Pulmonary Nodule Detection on Chest X-Ray
标题：在胸部X射线上进行可靠的肺部结节检测的不确定性学习策略
链接：https://arxiv.org/abs/2508.13236

作者：Choi, Jinse Kim, Dong-yeon Yoo, Ju-sung Sun, Jung-won Lee
备注：8 pages, 5 figures
摘要：肺癌的早期发现和快速干预至关重要。尽管如此，确保准确的诊断是具有挑战性的，因为医生解释胸部X光片的能力取决于他们的经验和疲劳程度。尽管医疗人工智能在辅助诊断方面取得了迅速进展，但医生对此类系统的信任仍然有限，阻碍了广泛的临床应用。这种怀疑从根本上源于对其诊断不确定性的担忧。在临床诊断中，医生利用广泛的背景知识和临床经验。相比之下，医疗AI主要依赖于对靶病变的重复学习，仅根据这些数据来生成诊断。换句话说，医疗AI没有足够的知识来进行诊断，导致诊断的不确定性。因此，本研究提出一种不确定性感知学习策略，可以通过学习医生的背景知识以及胸部X射线病变信息来解决知识不足的问题。我们使用了2，517个无病变图像和656个结节图像，均来自Ajou大学医院。所提出的模型达到92%（IoU 0.2 / FPPI 2），与基线模型相比灵敏度提高了10%，同时还将熵作为不确定性的度量降低了0.2。
摘要：Early detection and rapid intervention of lung cancer are crucial. Nonetheless, ensuring an accurate diagnosis is challenging, as physicians' ability to interpret chest X-rays varies significantly depending on their experience and degree of fatigue. Although medical AI has been rapidly advancing to assist in diagnosis, physicians' trust in such systems remains limited, preventing widespread clinical adoption. This skepticism fundamentally stems from concerns about its diagnostic uncertainty. In clinical diagnosis, physicians utilize extensive background knowledge and clinical experience. In contrast, medical AI primarily relies on repetitive learning of the target lesion to generate diagnoses based solely on that data. In other words, medical AI does not possess sufficient knowledge to render a diagnosis, leading to diagnostic uncertainty. Thus, this study suggests an Uncertainty-Aware Learning Policy that can address the issue of knowledge deficiency by learning the physicians' background knowledge alongside the Chest X-ray lesion information. We used 2,517 lesion-free images and 656 nodule images, all obtained from Ajou University Hospital. The proposed model attained 92% (IoU 0.2 / FPPI 2) with a 10% enhancement in sensitivity compared to the baseline model while also decreasing entropy as a measure of uncertainty by 0.2.

【4】MIRAGE: Towards AI-Generated Image Detection in the Wild
标题：MISYS：走向野外人工智能生成图像检测
链接：https://arxiv.org/abs/2508.13223

作者：, Manxi Lin, Jiexiang Tan, Xiaoxiong Du, Yang Qiu, Junjun Zheng, Xiangheng Kong, Yuning Jiang, Bo Zheng
摘要：人工智能生成图像（AIGI）的传播受到人工智能生成技术进步的推动，对信息安全和公众信任构成了重大威胁。现有的AIGI检测器，虽然有效地对图像在干净的实验室设置，未能推广到在野外的情况。这些真实世界的图像是嘈杂的，从“明显的假”图像到来自多个生成模型的真实图像，并进一步编辑以进行质量控制。我们解决在野生AIGI检测在本文中。我们介绍Mirage，这是一个具有挑战性的基准测试，旨在模拟野外AIGI的复杂性。Mirage由两个来源构建：（1）由人类专家验证的互联网来源的AIGI大型语料库，以及（2）通过多个专家生成器之间的合作创建的合成数据集，紧密模拟野外的真实AIGI。在此基准的基础上，我们提出了一个视觉语言模型MIBU-R1，它具有从分析到分析的推理能力，这是一种用于AIGI检测的反射推理机制。Miroble-R1的训练分为两个阶段：监督微调冷启动，然后是强化学习阶段。通过进一步采用推理时间自适应思维策略，Mirob-R1能够提供快速判断或更稳健准确的结论，有效平衡推理速度和性能。大量的实验表明，我们的模型领先国家的最先进的检测器的5%和10%的幻影和公共基准，分别。基准和代码将公开提供。
摘要：The spreading of AI-generated images (AIGI), driven by advances in generative AI, poses a significant threat to information security and public trust. Existing AIGI detectors, while effective against images in clean laboratory settings, fail to generalize to in-the-wild scenarios. These real-world images are noisy, varying from ``obviously fake" images to realistic ones derived from multiple generative models and further edited for quality control. We address in-the-wild AIGI detection in this paper. We introduce Mirage, a challenging benchmark designed to emulate the complexity of in-the-wild AIGI. Mirage is constructed from two sources: (1) a large corpus of Internet-sourced AIGI verified by human experts, and (2) a synthesized dataset created through the collaboration between multiple expert generators, closely simulating the realistic AIGI in the wild. Building on this benchmark, we propose Mirage-R1, a vision-language model with heuristic-to-analytic reasoning, a reflective reasoning mechanism for AIGI detection. Mirage-R1 is trained in two stages: a supervised-fine-tuning cold start, followed by a reinforcement learning stage. By further adopting an inference-time adaptive thinking strategy, Mirage-R1 is able to provide either a quick judgment or a more robust and accurate conclusion, effectively balancing inference speed and performance. Extensive experiments show that our model leads state-of-the-art detectors by 5% and 10% on Mirage and the public benchmark, respectively. The benchmark and code will be made publicly available.

【5】YOLO11-CR: a Lightweight Convolution-and-Attention Framework for Accurate Fatigue Driving Detection
标题：YOLO 11-CR：用于准确疲劳驾驶检测的轻量级卷积和注意力框架
链接：https://arxiv.org/abs/2508.13205

作者：n, Ligang Dong
摘要：驾驶员疲劳检测对于智能交通系统至关重要，因为它在减少道路交通事故中起着至关重要的作用。虽然基于生理学和车辆动力学的方法提供了准确性，但它们通常是侵入性的，依赖于硬件，并且在现实环境中缺乏鲁棒性。基于视觉的技术提供了一种非侵入性和可扩展的替代方案，但仍然面临着挑战，如小物体或遮挡物体的检测能力差，多尺度特征建模有限。为了解决这些问题，本文提出了YOLO 11-CR，一个轻量级的和高效的对象检测模型，专为实时疲劳检测。YOLO 11-CR引入了两个关键模块：卷积和注意力融合模块（CAFM），它将本地CNN特征与基于transformer的全局上下文相结合，以增强特征表达能力;矩形校准模块（RCM），它捕获水平和垂直上下文信息，以改善空间定位，特别是对于侧面人脸和手机等小物体。在DSM数据集上的实验表明，YOLO 11-CR的准确率为87.17%，召回率为83.86%，mAP@50为88.09%，mAP@50-95为55.93%，显著优于基线模型。消融研究进一步验证了CAFM和RCM模块在提高灵敏度和定位精度方面的有效性。这些结果表明，YOLO 11-CR为车载疲劳监测提供了一种实用且高性能的解决方案，具有很强的现实部署潜力，并在未来进行了包括时间建模、多模态数据集成和嵌入式优化在内的增强。
摘要：Driver fatigue detection is of paramount importance for intelligent transportation systems due to its critical role in mitigating road traffic accidents. While physiological and vehicle dynamics-based methods offer accuracy, they are often intrusive, hardware-dependent, and lack robustness in real-world environments. Vision-based techniques provide a non-intrusive and scalable alternative, but still face challenges such as poor detection of small or occluded objects and limited multi-scale feature modeling. To address these issues, this paper proposes YOLO11-CR, a lightweight and efficient object detection model tailored for real-time fatigue detection. YOLO11-CR introduces two key modules: the Convolution-and-Attention Fusion Module (CAFM), which integrates local CNN features with global Transformer-based context to enhance feature expressiveness; and the Rectangular Calibration Module (RCM), which captures horizontal and vertical contextual information to improve spatial localization, particularly for profile faces and small objects like mobile phones. Experiments on the DSM dataset demonstrated that YOLO11-CR achieves a precision of 87.17%, recall of 83.86%, mAP@50 of 88.09%, and mAP@50-95 of 55.93%, outperforming baseline models significantly. Ablation studies further validate the effectiveness of the CAFM and RCM modules in improving both sensitivity and localization accuracy. These results demonstrate that YOLO11-CR offers a practical and high-performing solution for in-vehicle fatigue monitoring, with strong potential for real-world deployment and future enhancements involving temporal modeling, multi-modal data integration, and embedded optimization.

【6】MMIS-Net for Retinal Fluid Segmentation and Detection
标题：MMIS-Net用于视网膜液体分割和检测
链接：https://arxiv.org/abs/2508.13936

作者：e Ndipenocha, Alina Mirona, Kezhi Wanga, Yongmin Li
摘要：目的：深度学习方法在医学图像的分割和疾病检测方面显示出了很好的效果。然而，大多数方法都是在来自单一来源、模态、器官或疾病类型的数据上训练和测试的，忽略了其他可用注释数据的组合潜力。来自各种模态、器官和疾病的许多小的注释医学图像数据集是公开可用的。在这项工作中，我们的目标是利用这些数据集的协同潜力来提高看不见的数据的性能。方法：为此，我们提出了一种新的算法，称为MMIS-Net（多模医学图像分割网络），它的特点相似性融合块，利用监督和像素级相似性知识选择的特征图融合。此外，为了解决不一致的类定义和标签矛盾，我们创建了一个one-hot标签空间来处理一个数据集中没有但在另一个数据集中注释的类。MMIS-Net在10个数据集上进行了训练，包括2种模式的19个器官，以构建单个模型。结果如下：该算法在RETOUCH大挑战隐藏测试集上进行了评估，优于医学图像分割的大型基础模型和其他最先进的算法。我们在流体分割任务中实现了最佳平均Dice评分0.83和绝对体积差0.035，在流体检测任务中实现了完美的曲线下面积1。结论：定量结果突出了我们提出的模型的有效性，由于相似性融合块纳入网络的骨干监督和相似性知识选择，并使用一个热标签空间来解决标签类的不一致和矛盾。
摘要：Purpose: Deep learning methods have shown promising results in the segmentation, and detection of diseases in medical images. However, most methods are trained and tested on data from a single source, modality, organ, or disease type, overlooking the combined potential of other available annotated data. Numerous small annotated medical image datasets from various modalities, organs, and diseases are publicly available. In this work, we aim to leverage the synergistic potential of these datasets to improve performance on unseen data. Approach: To this end, we propose a novel algorithm called MMIS-Net (MultiModal Medical Image Segmentation Network), which features Similarity Fusion blocks that utilize supervision and pixel-wise similarity knowledge selection for feature map fusion. Additionally, to address inconsistent class definitions and label contradictions, we created a one-hot label space to handle classes absent in one dataset but annotated in another. MMIS-Net was trained on 10 datasets encompassing 19 organs across 2 modalities to build a single model. Results: The algorithm was evaluated on the RETOUCH grand challenge hidden test set, outperforming large foundation models for medical image segmentation and other state-of-the-art algorithms. We achieved the best mean Dice score of 0.83 and an absolute volume difference of 0.035 for the fluids segmentation task, as well as a perfect Area Under the Curve of 1 for the fluid detection task. Conclusion: The quantitative results highlight the effectiveness of our proposed model due to the incorporation of Similarity Fusion blocks into the network's backbone for supervision and similarity knowledge selection, and the use of a one-hot label space to address label class inconsistencies and contradictions.

【7】Automated Cervical Cancer Detection through Visual Inspection with Acetic Acid in Resource-Poor Settings with Lightweight Deep Learning Models Deployed on an Android Device
标题：在资源匮乏的环境中通过使用醋酸进行视觉检查自动检测宫颈癌，并在Android设备上部署轻量级深度学习模型
链接：https://arxiv.org/abs/2508.13253

作者：elroy Maben, Keerthana Prasad, Shyamala Guruvare, Vidya Kudva, P C Siddalingaswamy
摘要：宫颈癌是女性中最常见的癌症之一，尽管相对容易治疗，但在低收入和中等收入国家，它夺去了大量的生命。一些研究表明，公共筛查计划可以显着降低宫颈癌的发病率和死亡率。虽然有几种筛选测试，但由于测试的可负担性和简单性，使用乙酸（VIA）进行目视检查是低资源环境中最可行的选择。VIA需要经过培训的医疗专业人员来解释测试，并且本质上是主观的。使用人工智能实现VIA自动化消除了主观性，并将允许将任务转移给训练有素的卫生工作者。人工智能的任务转移将有助于进一步加快低资源环境中的筛查计划。在我们的工作中，我们提出了一种轻量级的深度学习算法，其中包括EfficientDet-Lite 3作为感兴趣区域（ROI）检测器和基于MobileNet- V2的分类模型。这些模型将部署在一个基于安卓系统的设备上，可以远程操作，并提供几乎即时的结果，而不需要训练有素的医疗专业人员、实验室、复杂的基础设施或互联网连接。该分类模型在测试数据集上的准确率为92.31%，灵敏度为98.24%，特异性为88.37%，是一种很有前途的自动化低资源筛选方法。
摘要：Cervical cancer is among the most commonly occurring cancer among women and claims a huge number of lives in low and middle-income countries despite being relatively easy to treat. Several studies have shown that public screening programs can bring down cervical cancer incidence and mortality rates significantly. While several screening tests are available, visual inspection with acetic acid (VIA) presents itself as the most viable option for low-resource settings due to the affordability and simplicity of performing the test. VIA requires a trained medical professional to interpret the test and is subjective in nature. Automating VIA using AI eliminates subjectivity and would allow shifting of the task to less trained health workers. Task shifting with AI would help further expedite screening programs in low-resource settings. In our work, we propose a lightweight deep learning algorithm that includes EfficientDet-Lite3 as the Region of Interest (ROI) detector and a MobileNet- V2 based model for classification. These models would be deployed on an android-based device that can operate remotely and provide almost instant results without the requirement of highly-trained medical professionals, labs, sophisticated infrastructure, or internet connectivity. The classification model gives an accuracy of 92.31%, a sensitivity of 98.24%, and a specificity of 88.37% on the test dataset and presents itself as a promising automated low-resource screening approach.

【8】Colon Polyps Detection from Colonoscopy Images Using Deep Learning
标题：使用深度学习从结肠镜检查图像中检测结肠息肉
链接：https://arxiv.org/abs/2508.13188

作者：n, Bikash Kumar Paul
备注：17 Pages
摘要：结肠息肉是结直肠癌的前兆，结直肠癌是全球癌症相关死亡率的主要原因。早期发现对于改善患者预后至关重要。本研究探讨了基于深度学习的对象检测在使用结肠镜图像进行早期息肉识别中的应用。我们利用Kvasir-SEG数据集，应用广泛的数据增强并将数据分为训练（80%），验证（20%的训练）和测试（20%）集。评估了YOLOv 5结构的三种变体（YOLOv 5s、YOLOv 5 m、YOLOv 5l）。实验结果表明，YOLOv 5l优于其他变体，实现了85.1%的平均精度（mAP），最高平均交集大于并集（IoU）为0.86。这些发现表明，YOLOv 51为结肠息肉定位提供了优异的检测性能，为提高结直肠癌筛查准确性提供了有前途的工具。
摘要：Colon polyps are precursors to colorectal cancer, a leading cause of cancer-related mortality worldwide. Early detection is critical for improving patient outcomes. This study investigates the application of deep learning-based object detection for early polyp identification using colonoscopy images. We utilize the Kvasir-SEG dataset, applying extensive data augmentation and splitting the data into training (80\%), validation (20\% of training), and testing (20\%) sets. Three variants of the YOLOv5 architecture (YOLOv5s, YOLOv5m, YOLOv5l) are evaluated. Experimental results show that YOLOv5l outperforms the other variants, achieving a mean average precision (mAP) of 85.1\%, with the highest average Intersection over Union (IoU) of 0.86. These findings demonstrate that YOLOv5l provides superior detection performance for colon polyp localization, offering a promising tool for enhancing colorectal cancer screening accuracy.

分类|识别相关(6篇)

【1】Augmenting cobots for sheet-metal SMEs with 3D object recognition and localisation
标题：通过3D对象识别和本地化增强板材中小企业的协作机器人
链接：https://arxiv.org/abs/2508.13964

作者：ramer, Yanming Wu, David De Schepper, Eric Demeester
备注：13 pages, 25 figures
摘要：由于高混合低批量生产，今天的钣金车间面临着小批量和不同订单的挑战。由于标准的自动化解决方案往往达不到要求，中小企业不得不求助于重复性的体力劳动，这影响了生产成本，并导致技术熟练的劳动力无法充分发挥其潜力。COOCK+ ROBUST项目旨在通过整合现有技术，包括3D对象识别和定位，将协作机器人转变为移动和可重构的生产助手。本文探讨了在工业环境中使用这些技术增强cobotic系统的机遇和挑战，概述了该过程中涉及的关键步骤。此外，ACRO研究单位与工业合作伙伴合作开展的过去项目的见解，始终作为具体的实施示例。
摘要：Due to high-mix-low-volume production, sheet-metal workshops today are challenged by small series and varying orders. As standard automation solutions tend to fall short, SMEs resort to repetitive manual labour impacting production costs and leading to tech-skilled workforces not being used to their full potential. The COOCK+ ROBUST project aims to transform cobots into mobile and reconfigurable production assistants by integrating existing technologies, including 3D object recognition and localisation. This article explores both the opportunities and challenges of enhancing cobotic systems with these technologies in an industrial setting, outlining the key steps involved in the process. Additionally, insights from a past project, carried out by the ACRO research unit in collaboration with an industrial partner, serves as a concrete implementation example throughout.

【2】Model-based Multi-object Visual Tracking: Identification and Standard Model Limitations
标题：基于模型的多目标视觉跟踪：识别和标准模型限制
链接：https://arxiv.org/abs/2508.13647

作者：í, Oliver Kost, Yuxuan Xia, Lennart Svensson, Ondřej Straka
备注：Submitted to FUSION 2025 conference
摘要：本文使用已知的雷达跟踪社区的多目标跟踪方法，以解决行人跟踪问题，使用二维边界框检测。采用标准的点目标（SPO）模型，后验密度的计算使用泊松多伯努利混合（PMBM）滤波器。讨论了基于连续时间的模型参数的选取，包括出生概率和生存概率。一些参数是从第一原则中选择的，而另一些参数是从数据中确定的，在这种情况下，数据是公开的MOT-17数据集。虽然由此产生的PMBM算法产生有希望的结果，SPO模型和数据之间的不匹配被揭示。基于模型的方法假设，修改有问题的组件，导致SPO模型数据不匹配，将导致更好的基于模型的算法在未来的发展。
摘要：This paper uses multi-object tracking methods known from the radar tracking community to address the problem of pedestrian tracking using 2D bounding box detections. The standard point-object (SPO) model is adopted, and the posterior density is computed using the Poisson multi-Bernoulli mixture (PMBM) filter. The selection of the model parameters rooted in continuous time is discussed, including the birth and survival probabilities. Some parameters are selected from the first principles, while others are identified from the data, which is, in this case, the publicly available MOT-17 dataset. Although the resulting PMBM algorithm yields promising results, a mismatch between the SPO model and the data is revealed. The model-based approach assumes that modifying the problematic components causing the SPO model-data mismatch will lead to better model-based algorithms in future developments.

【3】FAMNet: Integrating 2D and 3D Features for Micro-expression Recognition via Multi-task Learning and Hierarchical Attention
标题：FAMNet：通过多任务学习和分层注意力集成2D和3D特征以进行微表情识别
链接：https://arxiv.org/abs/2508.13483

作者：u, Xuecheng Wu, Danlei Huang, Xinyi Yin
备注：8 pages, 6 figures. Accepted to IJCNN 2025
摘要：微表情识别在很多领域都有重要的应用价值，但微表情的短时性和低强度给微表情识别带来了巨大的挑战。目前深度学习中的MER方法主要包括三种数据加载方法：静态图像，动态图像序列，以及两种流的组合。如何有效地提取ME的细粒度和时空特征一直是一个难以解决的问题。本文提出了一种基于多任务学习和分层注意的MER方法，通过合并2D和3D CNN，充分提取ME的全方位特征。融合模型由2D CNN AMNet 2D和3D CNN AMNet 3D组成，具有由共享骨干网络Resnet 18和注意力模块组成的类似结构。在训练时，模型采用不同的数据加载方式分别适应两个特定网络，联合训练MER和面部动作单元检测（FAUD）任务，并采用参数硬共享进行信息关联，进一步提高了MER任务的效果，最终融合的模型称为FAMNet。大量的实验结果表明，我们提出的FAMNet显着提高任务性能。在SAMM、CASME II和MMEW数据集上，FAMNet达到了83.75%（UAR）和84.03%（UF 1）。此外，在具有挑战性的CAS（ME）$^3$数据集上，FAMNet实现了51%（UAR）和43.42%（UF 1）。
摘要：Micro-expressions recognition (MER) has essential application value in many fields, but the short duration and low intensity of micro-expressions (MEs) bring considerable challenges to MER. The current MER methods in deep learning mainly include three data loading methods: static images, dynamic image sequence, and a combination of the two streams. How to effectively extract MEs' fine-grained and spatiotemporal features has been difficult to solve. This paper proposes a new MER method based on multi-task learning and hierarchical attention, which fully extracts MEs' omni-directional features by merging 2D and 3D CNNs. The fusion model consists of a 2D CNN AMNet2D and a 3D CNN AMNet3D, with similar structures consisting of a shared backbone network Resnet18 and attention modules. During training, the model adopts different data loading methods to adapt to two specific networks respectively, jointly trains on the tasks of MER and facial action unit detection (FAUD), and adopts the parameter hard sharing for information association, which further improves the effect of the MER task, and the final fused model is called FAMNet. Extensive experimental results show that our proposed FAMNet significantly improves task performance. On the SAMM, CASME II and MMEW datasets, FAMNet achieves 83.75% (UAR) and 84.03% (UF1). Furthermore, on the challenging CAS(ME)$^3$ dataset, FAMNet achieves 51% (UAR) and 43.42% (UF1).

【4】Hierarchy-Consistent Learning and Adaptive Loss Balancing for Hierarchical Multi-Label Classification
标题：分层多标签分类的分层一致学习和自适应损失平衡
链接：https://arxiv.org/abs/2508.13452

作者：iang, Mengzhe Liu, Haobing Liu, Yanwei Yu
备注：10 pages, 7 figures, accepted by CIKM 2025
摘要：层次多标签分类（HMC）在保持结构一致性和平衡多任务学习（MTL）中的损失权重方面面临着关键挑战。为了解决这些问题，我们提出了一个分类器称为HCAL的MTL集成原型对比学习和自适应任务加权机制的基础上。我们的分类器最显着的优点是语义一致性，包括原型显式建模标签和特征聚合从子类到父类。另一个重要的优点是一个自适应的损失加权机制，通过监控特定于任务的收敛率动态分配优化资源。它有效地解决了传统MTL方法中固有的“一强多弱”优化偏差。为了进一步增强鲁棒性，通过向原型中注入受控噪声来扩展决策边界，从而形成原型扰动机制。此外，我们形式化的量化指标，称为层次违规率（HVR），以评估层次的一致性和泛化。在三个数据集上的大量实验表明，与基线模型相比，该分类器具有更高的分类精度和更低的分层违规率。
摘要：Hierarchical Multi-Label Classification (HMC) faces critical challenges in maintaining structural consistency and balancing loss weighting in Multi-Task Learning (MTL). In order to address these issues, we propose a classifier called HCAL based on MTL integrated with prototype contrastive learning and adaptive task-weighting mechanisms. The most significant advantage of our classifier is semantic consistency including both prototype with explicitly modeling label and feature aggregation from child classes to parent classes. The other important advantage is an adaptive loss-weighting mechanism that dynamically allocates optimization resources by monitoring task-specific convergence rates. It effectively resolves the "one-strong-many-weak" optimization bias inherent in traditional MTL approaches. To further enhance robustness, a prototype perturbation mechanism is formulated by injecting controlled noise into prototype to expand decision boundaries. Additionally, we formalize a quantitative metric called Hierarchical Violation Rate (HVR) as to evaluate hierarchical consistency and generalization. Extensive experiments across three datasets demonstrate both the higher classification accuracy and reduced hierarchical violation rate of the proposed classifier over baseline models.

【5】CLoE: Curriculum Learning on Endoscopic Images for Robust MES Classification
标题：CLOE：关于内窥镜图像的课程学习，以实现稳健的MES分类
链接：https://arxiv.org/abs/2508.13280

作者：demir, Hacer Yalim Keles, Omer Ozgur Tanriover
备注：16 pages, 4 figures, 9 tables
摘要：从内窥镜图像估计疾病严重程度对于评估溃疡性结肠炎至关重要，其中Mayo内窥镜子评分（MES）被广泛用于炎症分级。然而，MES分类仍然具有挑战性，这是由于来自观察者间变异性的标签噪声和评分的有序性，而标准模型往往忽略了这些因素。我们提出了CLoE，一个课程学习框架，占标签的可靠性和顺序结构。通过在波士顿肠道准备量表（BBPS）标签上训练的轻量级模型估计的图像质量用作注释置信度的代理，以将样本从简单（干净）到困难（嘈杂）进行排序。该课程进一步与ResizeMix增强相结合，以提高鲁棒性。使用CNN和Transformers在LIMUC和HyperKvasir数据集上进行的实验表明，CLoE在强监督和自监督基线上持续提高性能。例如，ConvNeXt-Tiny在LIMUC上达到82.5%的准确度和0.894的QWK，具有较低的计算成本。这些结果突出了潜在的困难意识的训练策略，以提高有序分类标签的不确定性。代码将在https://github.com/zeynepozdemir/CLoE上发布。
摘要：Estimating disease severity from endoscopic images is essential in assessing ulcerative colitis, where the Mayo Endoscopic Subscore (MES) is widely used to grade inflammation. However, MES classification remains challenging due to label noise from inter-observer variability and the ordinal nature of the score, which standard models often ignore. We propose CLoE, a curriculum learning framework that accounts for both label reliability and ordinal structure. Image quality, estimated via a lightweight model trained on Boston Bowel Preparation Scale (BBPS) labels, is used as a proxy for annotation confidence to order samples from easy (clean) to hard (noisy). This curriculum is further combined with ResizeMix augmentation to improve robustness. Experiments on the LIMUC and HyperKvasir datasets, using both CNNs and Transformers, show that CLoE consistently improves performance over strong supervised and self-supervised baselines. For instance, ConvNeXt-Tiny reaches 82.5\% accuracy and a QWK of 0.894 on LIMUC with low computational cost. These results highlight the potential of difficulty-aware training strategies for improving ordinal classification under label uncertainty. Code will be released at https://github.com/zeynepozdemir/CLoE.

【6】Exploration of Deep Learning Based Recognition for Urdu Text
标题：基于深度学习的乌尔都语文本识别探索
链接：https://arxiv.org/abs/2508.13245

作者：azal, Sheeraz Ahmed
摘要：乌尔都语是一种草书语言，与阿拉伯语和许多其他南亚语言相似。乌尔都语由于其复杂的几何和形态结构而难以分类。如果分割技术是有效的，字符分类可以进一步处理，但由于乌尔都语的上下文敏感性，基于分割的识别往往导致高错误率。我们提出的乌尔都语光学字符识别系统的方法是一个基于组件的分类依赖于自动特征学习技术称为卷积神经网络。CNN在乌尔都语文本数据集上进行了训练和测试，该数据集是通过三个字符的排列过程生成的，并进一步通过应用连接组件技术来丢弃不必要的图像，以便仅获得连字。分层神经网络实现了两个层次，以处理三个程度的字符排列和组件分类我们的模型成功地实现了0.99%的组件分类。
摘要：Urdu is a cursive script language and has similarities with Arabic and many other South Asian languages. Urdu is difficult to classify due to its complex geometrical and morphological structure. Character classification can be processed further if segmentation technique is efficient, but due to context sensitivity in Urdu, segmentation-based recognition often results with high error rate. Our proposed approach for Urdu optical character recognition system is a component-based classification relying on automatic feature learning technique called convolutional neural network. CNN is trained and tested on Urdu text dataset, which is generated through permutation process of three characters and further proceeds to discarding unnecessary images by applying connected component technique in order to obtain ligature only. Hierarchical neural network is implemented with two levels to deal with three degrees of character permutations and component classification Our model successfully achieved 0.99% for component classification.

分割|语义相关(11篇)

【1】GeoSAM2: Unleashing the Power of SAM2 for 3D Part Segmentation
标题：EOSAM2：释放SAM2的力量进行3D零件分割
链接：https://arxiv.org/abs/2508.14036

作者： Yunhan Yang, Jingxiang Sun, Xihui Liu, Yebin Liu, Ding Liang, Yan-Pei Cao
备注：his https URL
摘要：现代3D生成方法可以从稀疏或单个视图快速创建形状，但由于计算限制，它们的输出通常缺乏几何细节。我们提出了DetailGen3D，一种专门设计用于增强这些生成的3D形状的生成方法。我们的关键见解是直接通过潜在空间中的数据依赖流来建模粗到精的转换，避免了大规模3D生成模型的计算开销。我们引入了一个令牌匹配策略，确保精确的空间对应在细化，使本地的细节合成，同时保持全球结构。通过仔细设计我们的训练数据，以匹配合成的粗糙形状的特征，我们的方法可以有效地增强各种3D生成和重建方法产生的形状，从单视图到稀疏的多视图输入。大量的实验表明，DetailGen3D实现了高保真的几何细节合成，同时保持训练效率。
摘要：Modern 3D generation methods can rapidly create shapes from sparse or single views, but their outputs often lack geometric detail due to computational constraints. We present DetailGen3D, a generative approach specifically designed to enhance these generated 3D shapes. Our key insight is to model the coarse-to-fine transformation directly through data-dependent flows in latent space, avoiding the computational overhead of large-scale 3D generative models. We introduce a token matching strategy that ensures accurate spatial correspondence during refinement, enabling local detail synthesis while preserving global structure. By carefully designing our training data to match the characteristics of synthesized coarse shapes, our method can effectively enhance shapes produced by various 3D generation and reconstruction approaches, from single-view to sparse multi-view inputs. Extensive experiments demonstrate that DetailGen3D achieves high-fidelity geometric detail synthesis while maintaining efficiency in training.

【2】SCRNet: Spatial-Channel Regulation Network for Medical Ultrasound Image Segmentation
标题：SCRNet：用于医学超声图像分割的空间通道调节网络
链接：https://arxiv.org/abs/2508.13899

作者：, Ziliang Wang
备注：8 pagegs
摘要：医学超声图像分割是计算机视觉领域的一个难题。传统的方法依赖于卷积神经网络（CNN）和基于变换器的方法来解决医学图像分割的复杂性。然而，固有的局限性仍然存在，因为基于CNN的方法倾向于忽略长期依赖关系，而基于Transformer的方法可能会忽略本地上下文信息。为了解决这些缺陷，我们提出了一种新的功能聚合模块（FAM），旨在处理两个输入功能从前一层。这些特征被无缝地引导到卷积和交叉注意并行模块（CCAPM）的两个分支中，以赋予它们在两个分支中的每个分支中的不同角色，从而帮助在两个输入特征之间建立强连接。这种策略使我们的模块能够通过明智地将卷积操作与交叉注意机制合并，同时关注远程依赖关系和本地上下文信息。此外，通过将FAM集成到我们提出的空间通道调节模块（SCRM）中，提高了识别显著区域和信息特征的能力，从而提高了注意力。此外，通过将SCRM到UNet架构的编码器块，我们引入了一个新的框架，称为空间通道调节网络（SCRNet）。我们广泛的实验结果证明了SCRNet的优越性，与现有方法相比，它始终实现了最先进的（SOTA）性能。
摘要：Medical ultrasound image segmentation presents a formidable challenge in the realm of computer vision. Traditional approaches rely on Convolutional Neural Networks (CNNs) and Transformer-based methods to address the intricacies of medical image segmentation. Nevertheless, inherent limitations persist, as CNN-based methods tend to disregard long-range dependencies, while Transformer-based methods may overlook local contextual information. To address these deficiencies, we propose a novel Feature Aggregation Module (FAM) designed to process two input features from the preceding layer. These features are seamlessly directed into two branches of the Convolution and Cross-Attention Parallel Module (CCAPM) to endow them with different roles in each of the two branches to help establish a strong connection between the two input features. This strategy enables our module to focus concurrently on both long-range dependencies and local contextual information by judiciously merging convolution operations with cross-attention mechanisms. Moreover, by integrating FAM within our proposed Spatial-Channel Regulation Module (SCRM), the ability to discern salient regions and informative features warranting increased attention is enhanced. Furthermore, by incorporating the SCRM into the encoder block of the UNet architecture, we introduce a novel framework dubbed Spatial-Channel Regulation Network (SCRNet). The results of our extensive experiments demonstrate the superiority of SCRNet, which consistently achieves state-of-the-art (SOTA) performance compared to existing methods.

【3】Diversity-enhanced Collaborative Mamba for Semi-supervised Medical Image Segmentation
标题：用于半监督医学图像分割的多元化增强协作Mamba
链接：https://arxiv.org/abs/2508.13712

作者：i, Jian Zhang, Lei Qi, Luping Zhou, Yinghuan Shi, Yang Gao
摘要：获取高质量的医学图像分割标注数据是繁琐和昂贵的。半监督分割技术通过利用未标记的数据生成伪标签来减轻这种负担。最近，以Mamba为代表的高级状态空间模型已经显示出对远程依赖的有效处理。这促使我们探索其在半监督医学图像分割中的潜力。在本文中，我们提出了一种新的多样性增强的协作Mamba框架（即DCMAamba）的半监督医学图像分割，探索和利用的多样性，从数据，网络和功能的角度。首先，从数据的角度出发，利用Mamba的扫描建模特性，提出了块级弱强混合增强算法。此外，从网络的角度来看，我们引入了一个不同的扫描协作模块，它可以受益于不同的扫描方向所产生的预测差异。此外，从特征的角度来看，我们采用了不确定性加权的对比学习机制，以提高特征表示的多样性。实验证明，我们的DCMamba显着优于其他半监督医学图像分割方法，例如，在Synapse数据集上，最新的基于SSM的方法比20%的标记数据高出6.69%。
摘要：Acquiring high-quality annotated data for medical image segmentation is tedious and costly. Semi-supervised segmentation techniques alleviate this burden by leveraging unlabeled data to generate pseudo labels. Recently, advanced state space models, represented by Mamba, have shown efficient handling of long-range dependencies. This drives us to explore their potential in semi-supervised medical image segmentation. In this paper, we propose a novel Diversity-enhanced Collaborative Mamba framework (namely DCMamba) for semi-supervised medical image segmentation, which explores and utilizes the diversity from data, network, and feature perspectives. Firstly, from the data perspective, we develop patch-level weak-strong mixing augmentation with Mamba's scanning modeling characteristics. Moreover, from the network perspective, we introduce a diverse-scan collaboration module, which could benefit from the prediction discrepancies arising from different scanning directions. Furthermore, from the feature perspective, we adopt an uncertainty-weighted contrastive learning mechanism to enhance the diversity of feature representation. Experiments demonstrate that our DCMamba significantly outperforms other semi-supervised medical image segmentation methods, e.g., yielding the latest SSM-based method by 6.69% on the Synapse dataset with 20% labeled data.

【4】Unleashing Semantic and Geometric Priors for 3D Scene Completion
标题：释放语义和几何先验以实现3D场景完成
链接：https://arxiv.org/abs/2508.13601

作者：hen, Wei Sui, Bohao Zhang, Zeyd Boukhers, John See, Cong Yang
备注：9 pages, 5 figures, 6 tables
摘要：基于摄像头的3D语义场景完成（SSC）为自动驾驶和机器人导航提供密集的几何和语义感知。然而，现有方法依赖于耦合编码器来提供语义和几何先验，这迫使模型在冲突需求之间进行折衷，限制了其整体性能。为了应对这些挑战，我们提出了FoundationSSC，一个新的框架，在源和通路水平进行双重解耦。在源代码层，我们引入了一个基础编码器，它为语义分支提供丰富的语义特征先验，为几何分支提供高保真的立体代价体。在路径层次上，这些先验信息通过专门的、解耦的路径被细化，从而产生更好的语义上下文和深度分布。我们的双解耦设计可产生解缠结和精细化的输入，然后通过混合视图变换来利用这些输入生成互补的3D特征。此外，我们还引入了一个新颖的轴感知融合（AAF）模块，该模块通过将这些特征各向异性地合并为一个统一的表示来解决经常被忽视的融合挑战。大量的实验证明了FoundationSSC的优势，在语义和几何度量方面实现了同步改进，在SemanticKITTI上分别超过了先前的最佳结果+0.23 mIoU和+2.03 IoU。此外，我们在SSCBench-KITTI-360上实现了最先进的性能，具有21.78 mIoU和48.61 IoU。代码将在验收后发布。
摘要：Camera-based 3D semantic scene completion (SSC) provides dense geometric and semantic perception for autonomous driving and robotic navigation. However, existing methods rely on a coupled encoder to deliver both semantic and geometric priors, which forces the model to make a trade-off between conflicting demands and limits its overall performance. To tackle these challenges, we propose FoundationSSC, a novel framework that performs dual decoupling at both the source and pathway levels. At the source level, we introduce a foundation encoder that provides rich semantic feature priors for the semantic branch and high-fidelity stereo cost volumes for the geometric branch. At the pathway level, these priors are refined through specialised, decoupled pathways, yielding superior semantic context and depth distributions. Our dual-decoupling design produces disentangled and refined inputs, which are then utilised by a hybrid view transformation to generate complementary 3D features. Additionally, we introduce a novel Axis-Aware Fusion (AAF) module that addresses the often-overlooked challenge of fusing these features by anisotropically merging them into a unified representation. Extensive experiments demonstrate the advantages of FoundationSSC, achieving simultaneous improvements in both semantic and geometric metrics, surpassing prior bests by +0.23 mIoU and +2.03 IoU on SemanticKITTI. Additionally, we achieve state-of-the-art performance on SSCBench-KITTI-360, with 21.78 mIoU and 48.61 IoU. The code will be released upon acceptance.

【5】Temporal-Conditional Referring Video Object Segmentation with Noise-Free Text-to-Video Diffusion Model
标题：基于无噪文本到视频扩散模型的时间条件参考视频对象分割
链接：https://arxiv.org/abs/2508.13584

作者：ang, Jiaqing Fan, Yifan Liao, Qian Qiao, Fanzhang Li
备注：11 pages, 7 figures
摘要：参考视频对象分割（RVOS）旨在根据文本描述分割视频中的特定对象。我们观察到，最近的RVOS方法往往过分强调特征提取和时间建模，而相对忽略了分割头的设计。事实上，在分割头设计方面仍有相当大的改进空间。为了解决这个问题，我们提出了一个时间条件参考视频对象分割模型，它创新地集成了现有的分割方法，有效地提高边界分割能力。此外，我们的模型利用文本到视频扩散模型进行特征提取。在此基础上，我们去除了传统的噪声预测模块，以避免噪声的随机性降低分割精度，从而简化了模型，同时提高了性能。最后，为了克服VAE有限的特征提取能力，我们设计了一个时间上下文掩码细化（TCMR）模块，它显着提高了分割质量，而无需引入复杂的设计。我们在四个公共RVOS基准上评估了我们的方法，在这些基准上，它始终实现了最先进的性能。
摘要：Referring Video Object Segmentation (RVOS) aims to segment specific objects in a video according to textual descriptions. We observe that recent RVOS approaches often place excessive emphasis on feature extraction and temporal modeling, while relatively neglecting the design of the segmentation head. In fact, there remains considerable room for improvement in segmentation head design. To address this, we propose a Temporal-Conditional Referring Video Object Segmentation model, which innovatively integrates existing segmentation methods to effectively enhance boundary segmentation capability. Furthermore, our model leverages a text-to-video diffusion model for feature extraction. On top of this, we remove the traditional noise prediction module to avoid the randomness of noise from degrading segmentation accuracy, thereby simplifying the model while improving performance. Finally, to overcome the limited feature extraction capability of the VAE, we design a Temporal Context Mask Refinement (TCMR) module, which significantly improves segmentation quality without introducing complex designs. We evaluate our method on four public RVOS benchmarks, where it consistently achieves state-of-the-art performance.

【6】DictAS: A Framework for Class-Generalizable Few-Shot Anomaly Segmentation via Dictionary Lookup
标题：DictAS：一个基于字典的类泛化Few-Shot异常分割框架
链接：https://arxiv.org/abs/2508.13560

作者：Xian Tao, Xinyi Gong, ShiChen Qu, Xiaopei Zhang, Xingang Wang, Fei Shen, Zhengtao Zhang, Mukesh Prasad, Guiguang Ding
备注：Accepted by ICCV 2025, Project: this https URL
摘要：最近的视觉语言模型（例如，CLIP）已经在Few-Shot异常分割（FSAS）中展示了对看不见的类的显著的类概括能力，利用了对所见类的监督提示学习或微调。然而，它们的跨类别推广在很大程度上依赖于真实异常样本的先验知识。在本文中，我们提出了一个新的框架，即DictAS，它使一个统一的模型来检测视觉异常的看不见的对象类别，而无需对目标数据进行任何再训练，只采用了一些正常的参考图像作为视觉提示。DictAS背后的洞察力是通过自监督学习将字典查找功能转移到FSAS任务中，而不是仅仅记住训练集中的正常和异常特征模式。具体来说，DictAS主要由三个组件组成：（1）** 词典构建 ** -使用正常参考图像的特征模拟真实词典的索引和内容。(2)** 字典查询 ** -通过稀疏查找策略从字典中检索查询到的区域特征。当无法检索查询特征时，将其分类为异常。(3)**Query Discrimination Regularization**-通过使异常特征更难从字典中检索来增强异常识别。为了实现这一目标，进一步提出了对比查询约束和文本对齐约束。在七个公共工业和医疗数据集上进行的广泛实验表明，DictAS始终优于最先进的FSAS方法。
摘要：Recent vision-language models (e.g., CLIP) have demonstrated remarkable class-generalizable ability to unseen classes in few-shot anomaly segmentation (FSAS), leveraging supervised prompt learning or fine-tuning on seen classes. However, their cross-category generalization largely depends on prior knowledge of real seen anomaly samples. In this paper, we propose a novel framework, namely DictAS, which enables a unified model to detect visual anomalies in unseen object categories without any retraining on the target data, only employing a few normal reference images as visual prompts. The insight behind DictAS is to transfer dictionary lookup capabilities to the FSAS task for unseen classes via self-supervised learning, instead of merely memorizing the normal and abnormal feature patterns from the training set. Specifically, DictAS mainly consists of three components: (1) **Dictionary Construction** - to simulate the index and content of a real dictionary using features from normal reference images. (2) **Dictionary Lookup** - to retrieve queried region features from the dictionary via a sparse lookup strategy. When a query feature cannot be retrieved, it is classified as an anomaly. (3) **Query Discrimination Regularization**- to enhance anomaly discrimination by making abnormal features harder to retrieve from the dictionary. To achieve this, Contrastive Query Constraint and Text Alignment Constraint are further proposed. Extensive experiments on seven public industrial and medical datasets demonstrate that DictAS consistently outperforms state-of-the-art FSAS methods.

【7】AIM 2025 Rip Current Segmentation (RipSeg) Challenge Report
标题：AIM 2025 Rip电流细分（RipSeg）挑战报告
链接：https://arxiv.org/abs/2508.13401

作者：mitriu, Florin Miron, Florin Tatui, Radu Tudor Ionescu, Radu Timofte, Aakash Ralhan, Florin-Alexandru Vasluianu, Shenyang Qian, Mitchell Harley, Imran Razzak, Yang Song, Pu Luo, Yumei Li, Cong Xu, Jinming Chai, Kexin Zhang, Licheng Jiao, Lingling Li, Siqi Yu, Chao Zhang, Kehuan Song, Fang Liu, Puhua Chen, Xu Liu, Jin Hu, Jinyang Xu, Biao Liu
备注：Challenge report paper from AIM2025 Workshop at ICCVW 2025
摘要：本报告概述了AIM 2025 RipSeg Challenge，该竞赛旨在推进静态图像中自动RIP电流分割的技术。撕裂流是危险的，快速移动的流动，对全球海滩安全构成重大风险，使准确的视觉检测成为一项重要且未充分探索的研究任务。该挑战建立在RipVIS（最大的可用离岸流数据集）的基础上，并专注于单类实例分割，其中精确的描绘对于完全捕获离岸流的范围至关重要。该数据集涵盖不同的位置，撕裂电流类型和相机方向，提供了一个现实和具有挑战性的基准。总共有75 $$的参与者注册了第一版，产生了5 $$的有效测试提交。团队根据综合得分进行评估，该综合得分结合了$F_1$、$F_2$、$AP_{50}$和$AP_{[50：95]}$，以确保可靠且与应用相关的排名。性能最好的方法利用深度学习架构、领域自适应技术、预训练模型和领域泛化策略来提高不同条件下的性能。该报告概述了数据集的详细信息，竞争框架，评估指标和最终结果，提供了对当前分割状态的见解。最后，我们讨论了主要挑战，从提交的经验教训，以及扩大RipSeg的未来方向。
摘要：This report presents an overview of the AIM 2025 RipSeg Challenge, a competition designed to advance techniques for automatic rip current segmentation in still images. Rip currents are dangerous, fast-moving flows that pose a major risk to beach safety worldwide, making accurate visual detection an important and underexplored research task. The challenge builds on RipVIS, the largest available rip current dataset, and focuses on single-class instance segmentation, where precise delineation is critical to fully capture the extent of rip currents. The dataset spans diverse locations, rip current types, and camera orientations, providing a realistic and challenging benchmark. In total, $75$ participants registered for this first edition, resulting in $5$ valid test submissions. Teams were evaluated on a composite score combining $F_1$, $F_2$, $AP_{50}$, and $AP_{[50:95]}$, ensuring robust and application-relevant rankings. The top-performing methods leveraged deep learning architectures, domain adaptation techniques, pretrained models, and domain generalization strategies to improve performance under diverse conditions. This report outlines the dataset details, competition framework, evaluation metrics, and final results, providing insights into the current state of rip current segmentation. We conclude with a discussion of key challenges, lessons learned from the submissions, and future directions for expanding RipSeg.

【8】PreSem-Surf: RGB-D Surface Reconstruction with Progressive Semantic Modeling and SG-MLP Pre-Rendering Mechanism
标题：PreSem-Surf：使用渐进式语义建模和SG-MLP预渲染机制的RGB-D表面重建
链接：https://arxiv.org/abs/2508.13228

作者： Hang Xu, Yanghang Huang, Jiali Huang, Qian Weng
备注：2025 International Joint Conference on Neural Networks (IJCNN 2025)
摘要：本文提出了PreSem-Surf，一种基于神经辐射场（NeRF）框架的优化方法，能够在短时间内从RGB-D序列重建高质量的场景表面。该方法集成了RGB、深度和语义信息，以提高重建性能。具体而言，一种新的SG-MLP采样结构结合PR-MLP（预处理多层感知器）被引入体素预渲染，使模型能够更早地捕获场景相关的信息，更好地区分噪声和局部细节。此外，采用渐进式语义建模，以提高精度的水平提取语义信息，减少训练时间，同时增强场景理解。在7个合成场景上的实验表明，PreSem-Surf在C-L1，F-score和IoU方面表现最好，同时在NC，准确性和完整性方面保持了竞争力，证明了其有效性和实用性。
摘要：This paper proposes PreSem-Surf, an optimized method based on the Neural Radiance Field (NeRF) framework, capable of reconstructing high-quality scene surfaces from RGB-D sequences in a short time. The method integrates RGB, depth, and semantic information to improve reconstruction performance. Specifically, a novel SG-MLP sampling structure combined with PR-MLP (Preconditioning Multilayer Perceptron) is introduced for voxel pre-rendering, allowing the model to capture scene-related information earlier and better distinguish noise from local details. Furthermore, progressive semantic modeling is adopted to extract semantic information at increasing levels of precision, reducing training time while enhancing scene understanding. Experiments on seven synthetic scenes with six evaluation metrics show that PreSem-Surf achieves the best performance in C-L1, F-score, and IoU, while maintaining competitive results in NC, Accuracy, and Completeness, demonstrating its effectiveness and practical applicability.

【9】A Novel Attention-Augmented Wavelet YOLO System for Real-time Brain Vessel Segmentation on Transcranial Color-coded Doppler
标题：一种新型的注意力增强小波YOLO系统，用于在经颅彩色编码Doppler上进行实时脑血管分割
链接：https://arxiv.org/abs/2508.13875

作者：hang (1), Shuai Li (1), Xinyi Wang (1), Yu Sun (1), Hongyu Kang (1), Pui Yuk Chryste Wan (1), Yong-Ping Zheng (1 and 2), Sai-Kit Lam (1 and 2) ((1), Department of Biomedical Engineering, The Hong Kong Polytechnic University, Hong Kong SAR, China, (2), the Research Institute of Smart Ageing, The Hong Kong Polytechnic University, Hong Kong SAR, China)
摘要：Willis环（CoW）对于确保持续的血液流向大脑至关重要，与缺血性中风密切相关。准确评估CoW对于识别风险个体和指导适当的临床管理非常重要。在现有的成像方法中，经颅彩色编码多普勒（TCCD）由于其无辐射性质，经济实惠和可访问性而具有独特的优势。然而，可靠的TCCD评估在很大程度上依赖于操作者识别解剖标志和执行准确角度校正的专业知识，这限制了其广泛采用。为了应对这一挑战，我们提出了一种人工智能驱动的实时CoW自动分割系统，能够有效地捕获脑动脉。没有先前的研究探索了使用TCCD的AI驱动的脑血管分割。在这项工作中，我们介绍了一种新的注意力增强小波YOLO（AAW-YOLO）网络量身定制的TCCD数据，旨在提供实时指导脑血管分割的牛。我们前瞻性地收集了包括738个注释帧和3，419个标记动脉实例的TCCD数据，以建立用于模型训练和评估的高质量数据集。提出的AAW-YOLO在分割同侧和对侧CoW血管方面表现出强劲的性能，平均Dice评分为0.901，IoU为0.823，精确度为0.882，召回率为0.926，mAP为0.953，每帧推理速度为14.199 ms。该系统提供了一种实用的解决方案，以减少基于TCCD的脑血管筛查中对操作员经验的依赖，具有在常规临床工作流程和资源受限环境中的潜在应用。未来的研究将探索双边建模和更大规模的验证。
摘要：The Circle of Willis (CoW), vital for ensuring consistent blood flow to the brain, is closely linked to ischemic stroke. Accurate assessment of the CoW is important for identifying individuals at risk and guiding appropriate clinical management. Among existing imaging methods, Transcranial Color-coded Doppler (TCCD) offers unique advantages due to its radiation-free nature, affordability, and accessibility. However, reliable TCCD assessments depend heavily on operator expertise for identifying anatomical landmarks and performing accurate angle correction, which limits its widespread adoption. To address this challenge, we propose an AI-powered, real-time CoW auto-segmentation system capable of efficiently capturing cerebral arteries. No prior studies have explored AI-driven cerebrovascular segmentation using TCCD. In this work, we introduce a novel Attention-Augmented Wavelet YOLO (AAW-YOLO) network tailored for TCCD data, designed to provide real-time guidance for brain vessel segmentation in the CoW. We prospectively collected TCCD data comprising 738 annotated frames and 3,419 labeled artery instances to establish a high-quality dataset for model training and evaluation. The proposed AAW-YOLO demonstrated strong performance in segmenting both ipsilateral and contralateral CoW vessels, achieving an average Dice score of 0.901, IoU of 0.823, precision of 0.882, recall of 0.926, and mAP of 0.953, with a per-frame inference speed of 14.199 ms. This system offers a practical solution to reduce reliance on operator experience in TCCD-based cerebrovascular screening, with potential applications in routine clinical workflows and resource-constrained settings. Future research will explore bilateral modeling and larger-scale validation.

【10】subCellSAM: Zero-Shot (Sub-)Cellular Segmentation for Hit Validation in Drug Discovery
标题：subCellSam：Zero-Shot（亚）细胞分割，用于药物发现中的成功验证
链接：https://arxiv.org/abs/2508.13701

作者：imann, Daniel Siegismund, Mario Wieser, Stephan Steigele
备注：Accepted at DAGM German Conference on Pattern Recognition (GCPR) 2025
摘要：使用自动显微镜进行高通量筛选是生物制药药物发现的关键驱动力，可以对癌症等疾病的数千种候选药物进行平行评估。传统的图像分析和深度学习方法已被用于分析这些复杂的大规模数据集，细胞分割是提取相关结构的关键步骤。然而，这两种策略通常都需要大量的手动参数调整或特定于域的模型微调。我们提出了一种在zero-shot设置（即，没有微调），由上下文学习策略指导。我们的方法采用了一个三步的过程，细胞核，细胞和亚细胞分割，引入了一个自我提示的机制，编码形态和拓扑先验使用不断增长的掩模和战略性放置的前景/背景点。我们在标准细胞分割基准和行业相关的命中验证试验上验证了我们的方法，证明它准确地分割了生物相关结构，而不需要特定的微调。
摘要：High-throughput screening using automated microscopes is a key driver in biopharma drug discovery, enabling the parallel evaluation of thousands of drug candidates for diseases such as cancer. Traditional image analysis and deep learning approaches have been employed to analyze these complex, large-scale datasets, with cell segmentation serving as a critical step for extracting relevant structures. However, both strategies typically require extensive manual parameter tuning or domain-specific model fine-tuning. We present a novel method that applies a segmentation foundation model in a zero-shot setting (i.e., without fine-tuning), guided by an in-context learning strategy. Our approach employs a three-step process for nuclei, cell, and subcellular segmentation, introducing a self-prompting mechanism that encodes morphological and topological priors using growing masks and strategically placed foreground/background points. We validate our method on both standard cell segmentation benchmarks and industry-relevant hit validation assays, demonstrating that it accurately segments biologically relevant structures without the need for dataset-specific tuning.

【11】PediDemi -- A Pediatric Demyelinating Lesion Segmentation Dataset
标题：PediDemi --儿科脱髓鞘病变分割数据集
链接：https://arxiv.org/abs/2508.13239

作者：a, Gabriela Adriana Visa
摘要：中枢神经系统脱髓鞘疾病可能有多种原因，最常见的是感染、自身免疫反应、遗传或血管病因。脱髓鞘病变的特征在于神经纤维的髓鞘被破坏或破坏的区域。在自身免疫性疾病中，多发性硬化症（MS）是最知名的。在这些疾病中，多发性硬化症（MS）是最知名的和侵袭性的形式。急性播散性脑脊髓炎（ADEM）是另一种类型的脱髓鞘疾病，通常预后较好。磁共振成像（MRI）广泛用于通过检测病变来诊断和监测疾病进展。虽然成人和儿童都可能受到影响，但除MS外，儿科病例和脱髓鞘疾病的公开数据集严重缺乏。本研究首次介绍了一个用于脱髓鞘病变分割的公开儿科数据集。该数据集包括来自13名诊断为脱髓鞘疾病的儿科患者的MRI扫描，其中包括3名ADEM患者。除了病变分割掩模之外，数据集还包括广泛的患者元数据，例如诊断、治疗、个人医学背景和实验室结果。为了评估数据集的质量并证明其相关性，我们评估了在现有MS数据集上训练的最先进的病变分割模型。结果强调了多样化数据集的重要性
摘要：Demyelinating disorders of the central nervous system may have multiple causes, the most common are infections, autoimmune responses, genetic or vascular etiology. Demyelination lesions are characterized by areas were the myelin sheath of the nerve fibers are broken or destroyed. Among autoimmune disorders, Multiple Sclerosis (MS) is the most well-known Among these disorders, Multiple Sclerosis (MS) is the most well-known and aggressive form. Acute Disseminated Encephalomyelitis (ADEM) is another type of demyelinating disease, typically with a better prognosis. Magnetic Resonance Imaging (MRI) is widely used for diagnosing and monitoring disease progression by detecting lesions. While both adults and children can be affected, there is a significant lack of publicly available datasets for pediatric cases and demyelinating disorders beyond MS. This study introduces, for the first time, a publicly available pediatric dataset for demyelinating lesion segmentation. The dataset comprises MRI scans from 13 pediatric patients diagnosed with demyelinating disorders, including 3 with ADEM. In addition to lesion segmentation masks, the dataset includes extensive patient metadata, such as diagnosis, treatment, personal medical background, and laboratory results. To assess the quality of the dataset and demonstrate its relevance, we evaluate a state-of-the-art lesion segmentation model trained on an existing MS dataset. The results underscore the importance of diverse datasets

Zero/Few Shot|迁移|域适配|自适应(5篇)

【1】DIME-Net: A Dual-Illumination Adaptive Enhancement Network Based on Retinex and Mixture-of-Experts
标题：DIME-Net：基于Retinex和Mixture-of-Experts的双照明自适应增强网络
链接：https://arxiv.org/abs/2508.13921

作者：g, Xiaoqin Wang, Dingyi Wang, Qiang Li, Shushan Qiao
备注：Accepted at ACM Multimedia 2025 (ACM MM 2025)
摘要：在现实环境中，通常会遇到由低光和背光场景等复杂照明条件引起的图像退化，从而严重影响图像质量和下游视觉任务。大多数现有的方法集中于单一类型的照明退化，并且缺乏以统一的方式处理各种照明条件的能力。为了解决这个问题，我们提出了一个双照明增强框架称为DIME-Net。我们的方法的核心是一个混合的专家照明估计模块，其中稀疏门控机制自适应地选择合适的S曲线专家网络的基础上输入图像的照明特性。通过集成Retinex理论，该模块有效地执行针对弱光和背光图像的增强。为了进一步校正光照引起的伪影和颜色失真，我们设计了一个损伤恢复模块，配备了光照感知交叉注意和顺序状态全局注意机制。此外，我们通过集成现有数据集来构建混合照明数据集MixBL，使我们的模型能够通过单个训练过程实现强大的照明适应性。实验结果表明，DIME-Net在合成和真实世界的低光和背光数据集上都具有竞争力的性能，而无需任何重新训练。这些结果证明了它的泛化能力和实际的多媒体应用的潜力，在多样化和复杂的光照条件下。
摘要：Image degradation caused by complex lighting conditions such as low-light and backlit scenarios is commonly encountered in real-world environments, significantly affecting image quality and downstream vision tasks. Most existing methods focus on a single type of illumination degradation and lack the ability to handle diverse lighting conditions in a unified manner. To address this issue, we propose a dual-illumination enhancement framework called DIME-Net. The core of our method is a Mixture-of-Experts illumination estimator module, where a sparse gating mechanism adaptively selects suitable S-curve expert networks based on the illumination characteristics of the input image. By integrating Retinex theory, this module effectively performs enhancement tailored to both low-light and backlit images. To further correct illumination-induced artifacts and color distortions, we design a damage restoration module equipped with Illumination-Aware Cross Attention and Sequential-State Global Attention mechanisms. In addition, we construct a hybrid illumination dataset, MixBL, by integrating existing datasets, allowing our model to achieve robust illumination adaptability through a single training process. Experimental results show that DIME-Net achieves competitive performance on both synthetic and real-world low-light and backlit datasets without any retraining. These results demonstrate its generalization ability and potential for practical multimedia applications under diverse and complex illumination conditions.

【2】Self-Aware Adaptive Alignment: Enabling Accurate Perception for Intelligent Transportation Systems
标题：自感知自适应对齐：实现智能交通系统的准确感知
链接：https://arxiv.org/abs/2508.13823

作者：g, Hongxia Zhao, Fenghua Zhu, Yuanyuan Chen, Yisheng Lv
备注：Domain adaptation, Virtual Reality, Object Detection
摘要：在智能交通检测中实现一流的性能是一个关键的研究领域。然而，在跨域场景中进行检测时，仍然需要解决许多挑战。在本文中，我们提出了一个自我意识的自适应对齐（SA 3），利用一个有效的对齐机制和识别策略。我们提出的方法采用了一个指定的基于注意力的对齐模块，在源域和目标域数据集上进行训练，以指导图像级特征对齐过程，从而实现源域和目标域之间的局部-全局自适应对齐。来自这两个域的特征，其信道重要性被重新加权，被馈送到区域建议网络，这有助于获取显著的区域特征。此外，我们引入了一个实例到图像级对齐模块特定于目标域自适应地减轻域间隙。为了评估所提出的方法，广泛的实验已经进行了流行的跨域对象检测基准。实验结果表明，SA 3取得了优于以前的国家的最先进的方法的结果。
摘要：Achieving top-notch performance in Intelligent Transportation detection is a critical research area. However, many challenges still need to be addressed when it comes to detecting in a cross-domain scenario. In this paper, we propose a Self-Aware Adaptive Alignment (SA3), by leveraging an efficient alignment mechanism and recognition strategy. Our proposed method employs a specified attention-based alignment module trained on source and target domain datasets to guide the image-level features alignment process, enabling the local-global adaptive alignment between the source domain and target domain. Features from both domains, whose channel importance is re-weighted, are fed into the region proposal network, which facilitates the acquisition of salient region features. Also, we introduce an instance-to-image level alignment module specific to the target domain to adaptively mitigate the domain gap. To evaluate the proposed method, extensive experiments have been conducted on popular cross-domain object detection benchmarks. Experimental results show that SA3 achieves superior results to the previous state-of-the-art methods.

【3】Calibrating Biased Distribution in VFM-derived Latent Space via Cross-Domain Geometric Consistency
标题：基于跨域几何一致性校正VFM导出的潜在空间中的有偏分布
链接：https://arxiv.org/abs/2508.13518

作者：a, Wei Dai, Bowei Liu, Jiayi Chen, Wenke Huang, Guancheng Wan, Zhiwu Lu, Junchi Yan
备注：15 pages, CVPR Oral
摘要：尽管深度学习进展迅速，但一个长期存在的挑战是观察到的训练样本与底层真实分布之间的差距。造成这种差距的原因有多种，例如采样偏差，噪声等。在基础模型时代，我们表明，当利用现成的（视觉）基础模型（例如，CLIP，DINOv2）的特征提取，得到的特征分布的几何形状表现出显着的跨域和数据集的可移植性。为了验证它的实用性，我们在两个流行且具有挑战性的环境中体现了我们的几何知识引导的分布校准框架：联邦学习和长尾识别。在联邦设置中，我们设计了一种技术，获取全局几何形状的隐私约束下，然后利用这些知识来生成新的样本的客户端，在本地和全球观测之间的差距的桥梁的目的。在长尾学习中，它利用从样本丰富的类别转移的几何知识来恢复样本稀缺的尾类的真实分布。综合实验表明，我们提出的几何知识引导的分布校准有效地克服了数据异构性和样本不平衡造成的信息赤字，提高了跨基准测试的性能。
摘要：Despite the fast progress of deep learning, one standing challenge is the gap of the observed training samples and the underlying true distribution. There are multiple reasons for the causing of this gap e.g. sampling bias, noise etc. In the era of foundation models, we show that when leveraging the off-the-shelf (vision) foundation models (e.g., CLIP, DINOv2) for feature extraction, the geometric shapes of the resulting feature distributions exhibit remarkable transferability across domains and datasets. To verify its practical usefulness, we embody our geometric knowledge-guided distribution calibration framework in two popular and challenging settings: federated learning and long-tailed recognition. In the federated setting, we devise a technique of acquiring the global geometric shape under privacy constraints, then leverage this knowledge to generate new samples for clients, in the aim of bridging the gap between local and global observations. In long-tailed learning, it utilizes the geometric knowledge transferred from sample-rich categories to recover the true distribution for sample-scarce tail classes. Comprehensive experiments show that our proposed geometric knowledge-guided distribution calibration effectively overcomes information deficits caused by data heterogeneity and sample imbalance, with boosted performance across benchmarks.

【4】AdaptiveAE: An Adaptive Exposure Strategy for HDR Capturing in Dynamic Scenes
标题：AdaptiveAE：动态场景中HDR捕捉的自适应曝光策略
链接：https://arxiv.org/abs/2508.13503

作者：, Fan Zhang, Boxin Shi, Tianfan Xue, Yujin Wang
备注：Accepted to ICCV 2025
摘要：主流的高动态范围成像技术通常依赖于融合使用不同曝光设置（快门速度和ISO）捕获的多个图像。快门速度和ISO之间的良好平衡对于实现高质量HDR至关重要，因为高ISO值会引入显著的噪声，而长快门速度会导致明显的运动模糊。然而，现有的方法往往忽略了快门速度和ISO之间的复杂的相互作用，并未能考虑动态场景中的运动模糊效果。在这项工作中，我们提出了AdaptiveAE，这是一种基于强化学习的方法，可以优化快门速度和ISO组合的选择，以最大限度地提高动态环境中的HDR重建质量。AdaptiveAE集成了一个图像合成管道，该管道将运动模糊和噪声模拟纳入我们的训练过程，利用语义信息和曝光直方图。它可以根据用户定义的曝光时间预算自适应地选择最佳ISO和快门速度序列，并找到比传统解决方案更好的曝光时间表。在多个数据集上的实验结果表明，它达到了最先进的性能。
摘要：Mainstream high dynamic range imaging techniques typically rely on fusing multiple images captured with different exposure setups (shutter speed and ISO). A good balance between shutter speed and ISO is crucial for achieving high-quality HDR, as high ISO values introduce significant noise, while long shutter speeds can lead to noticeable motion blur. However, existing methods often overlook the complex interaction between shutter speed and ISO and fail to account for motion blur effects in dynamic scenes. In this work, we propose AdaptiveAE, a reinforcement learning-based method that optimizes the selection of shutter speed and ISO combinations to maximize HDR reconstruction quality in dynamic environments. AdaptiveAE integrates an image synthesis pipeline that incorporates motion blur and noise simulation into our training procedure, leveraging semantic information and exposure histograms. It can adaptively select optimal ISO and shutter speed sequences based on a user-defined exposure time budget, and find a better exposure schedule than traditional solutions. Experimental results across multiple datasets demonstrate that it achieves the state-of-the-art performance.

【5】Benchmarking GPT-5 for Zero-Shot Multimodal Medical Reasoning in Radiology and Radiation Oncology
标题：GPT-5在放射学和放射肿瘤学中用于Zero-Shot多模态医学推理的基准测试
链接：https://arxiv.org/abs/2508.13192

作者：u, Zach Eidex, Shansong Wang, Mojtaba Safari, Qiang Li, Xiaofeng Yang
摘要：放射学、放射肿瘤学和医学物理学需要在高风险条件下集成医学图像、文本报告和定量数据的决策。随着GPT-5的引入，评估大型多模式模型的最新进展是否转化为这些安全关键领域的可衡量收益至关重要。我们提出了一个有针对性的zero-shot评估GPT-5及其较小的变体（GPT-5-mini，GPT-5-nano）对GPT-4 o在三个代表性的任务。我们提出了一个有针对性的zero-shot评估GPT-5及其较小的变体（GPT-5-mini，GPT-5-nano）与GPT-4 o在三个代表性任务上的对比：（1）VQA-RAD，放射学中视觉问题回答的基准;（2）SLAKE，语义注释的多语言VQA数据集，测试跨模态接地;和（3）一个精选的医学物理委员会考试风格的数据集，包含150个多项选择题，涵盖治疗计划、剂量测定、成像和质量保证。在所有数据集中，GPT-5的准确率最高，在具有挑战性的解剖区域（如胸纵隔）中，GPT-4 o的准确率大幅提高，达到+20.00%，在肺部问题中，GPT-5的准确率为+13.60%，在脑组织解释中，GPT-5的准确率为+11.44%。在棋盘式物理问题上，GPT-5达到了90.7%的准确率（136/150），超过了估计的人类通过阈值，而GPT-4 o落后于78.0%。这些结果表明，GPT-5在基于图像的推理和特定领域的数值问题解决方面都比GPT-4 o提供了一致且通常显著的性能改进，突出了其增强医学成像和治疗物理学专家工作流程的潜力。
摘要：Radiology, radiation oncology, and medical physics require decision-making that integrates medical images, textual reports, and quantitative data under high-stakes conditions. With the introduction of GPT-5, it is critical to assess whether recent advances in large multimodal models translate into measurable gains in these safety-critical domains. We present a targeted zero-shot evaluation of GPT-5 and its smaller variants (GPT-5-mini, GPT-5-nano) against GPT-4o across three representative tasks. We present a targeted zero-shot evaluation of GPT-5 and its smaller variants (GPT-5-mini, GPT-5-nano) against GPT-4o across three representative tasks: (1) VQA-RAD, a benchmark for visual question answering in radiology; (2) SLAKE, a semantically annotated, multilingual VQA dataset testing cross-modal grounding; and (3) a curated Medical Physics Board Examination-style dataset of 150 multiple-choice questions spanning treatment planning, dosimetry, imaging, and quality assurance. Across all datasets, GPT-5 achieved the highest accuracy, with substantial gains over GPT-4o up to +20.00% in challenging anatomical regions such as the chest-mediastinal, +13.60% in lung-focused questions, and +11.44% in brain-tissue interpretation. On the board-style physics questions, GPT-5 attained 90.7% accuracy (136/150), exceeding the estimated human passing threshold, while GPT-4o trailed at 78.0%. These results demonstrate that GPT-5 delivers consistent and often pronounced performance improvements over GPT-4o in both image-grounded reasoning and domain-specific numerical problem-solving, highlighting its potential to augment expert workflows in medical imaging and therapeutic physics.

半弱无监督|主动学习|不确定性(5篇)

【1】Backdooring Self-Supervised Contrastive Learning by Noisy Alignment
标题：通过噪音对齐进行后门自我监督对比学习
链接：https://arxiv.org/abs/2508.14015

作者： Jie Gui, Minjing Dong, Ju Jia, Lanting Fang, Jian Liu
备注：Accepted by ICCV 2025
摘要：自监督对比学习（CL）可以有效地从包含图像或图像-文本对的未标记数据中学习可转移的表示，但容易受到数据中毒后门攻击（DPCLs）的影响。攻击者可以将有毒图像注入预训练数据集，导致受损的CL编码器在下游任务中表现出有针对性的不当行为。然而，现有的DPCLs，实现有限的功效，由于其依赖于脆弱的隐式后门和目标对象之间的同现和不充分的抑制后门图像中的歧视性功能。我们提出了噪声对齐（NA），一个DPCL方法，明确抑制中毒图像中的噪声成分。受强大的训练可控CL攻击的启发，我们识别并提取了噪声对齐的关键目标，有效地将其应用于数据中毒场景。我们的方法通过策略性地操纵对比学习的随机裁剪机制来实现噪声对齐，将此过程制定为具有理论推导的最佳参数的图像布局优化问题。由此产生的方法是简单而有效的，实现了最先进的性能相比，现有的DPCCH，同时保持干净的数据准确性。此外，Noisy Alliance还展示了对常见后门防御的鲁棒性。代码可在https://github.com/jsrdcht/Noisy-Alignment上找到。
摘要：Self-supervised contrastive learning (CL) effectively learns transferable representations from unlabeled data containing images or image-text pairs but suffers vulnerability to data poisoning backdoor attacks (DPCLs). An adversary can inject poisoned images into pretraining datasets, causing compromised CL encoders to exhibit targeted misbehavior in downstream tasks. Existing DPCLs, however, achieve limited efficacy due to their dependence on fragile implicit co-occurrence between backdoor and target object and inadequate suppression of discriminative features in backdoored images. We propose Noisy Alignment (NA), a DPCL method that explicitly suppresses noise components in poisoned images. Inspired by powerful training-controllable CL attacks, we identify and extract the critical objective of noisy alignment, adapting it effectively into data-poisoning scenarios. Our method implements noisy alignment by strategically manipulating contrastive learning's random cropping mechanism, formulating this process as an image layout optimization problem with theoretically derived optimal parameters. The resulting method is simple yet effective, achieving state-of-the-art performance compared to existing DPCLs, while maintaining clean-data accuracy. Furthermore, Noisy Alignment demonstrates robustness against common backdoor defenses. Codes can be found at https://github.com/jsrdcht/Noisy-Alignment.

【2】Self-Supervised Sparse Sensor Fusion for Long Range Perception
标题：用于远程感知的自监督稀疏传感器融合
链接：https://arxiv.org/abs/2508.13995

作者：alladin, Samuel Brucker, Filippo Ghilotti, Praveen Narayanan, Mario Bijelic, Felix Heide
摘要：在城市中心之外，自动驾驶汽车和卡车必须掌握在城际高速公路上的驾驶。以超过100 km/h的速度进行安全的长距离高速公路行驶需要至少250 m的感知距离，这大约是城市驾驶中通常解决的50- 100 m的五倍，以允许足够的规划和制动裕度。增加感知范围还可以将自主性从2吨的轻型乘用车扩展到40吨的大型卡车，由于其高惯性，需要更长的规划范围。然而，大多数现有的感知方法专注于较短的范围，并依赖于鸟瞰图（BEV）表示，这会导致内存和计算成本随着距离的增加而呈二次增长。为了克服这一限制，我们建立在稀疏表示的基础上，并引入了多模态和时间特征的有效3D编码，以及一种新的自监督预训练方案，该方案可以从未标记的相机LiDAR数据中进行大规模学习。我们的方法将感知距离扩展到250米，与现有方法相比，目标检测中的mAP提高了26.6%，LiDAR预测中的倒角距离降低了30.5%，距离可达250米。项目页面：https://light.princeton.edu/lrs4fusion/
摘要：Outside of urban hubs, autonomous cars and trucks have to master driving on intercity highways. Safe, long-distance highway travel at speeds exceeding 100 km/h demands perception distances of at least 250 m, which is about five times the 50-100m typically addressed in city driving, to allow sufficient planning and braking margins. Increasing the perception ranges also allows to extend autonomy from light two-ton passenger vehicles to large-scale forty-ton trucks, which need a longer planning horizon due to their high inertia. However, most existing perception approaches focus on shorter ranges and rely on Bird's Eye View (BEV) representations, which incur quadratic increases in memory and compute costs as distance grows. To overcome this limitation, we built on top of a sparse representation and introduced an efficient 3D encoding of multi-modal and temporal features, along with a novel self-supervised pre-training scheme that enables large-scale learning from unlabeled camera-LiDAR data. Our approach extends perception distances to 250 meters and achieves an 26.6% improvement in mAP in object detection and a decrease of 30.5% in Chamfer Distance in LiDAR forecasting compared to existing methods, reaching distances up to 250 meters. Project Page: https://light.princeton.edu/lrs4fusion/

【3】Unsupervised Urban Tree Biodiversity Mapping from Street-Level Imagery Using Spatially-Aware Visual Clustering
标题：基于空间感知视觉聚类的城市树木生物多样性地图
链接：https://arxiv.org/abs/2508.13814

作者：en Abuhani, Marco Seccaroni, Martina Mazzarello, Imran Zualkernan, Fabio Duarte, Carlo Ratti
备注：26 pages, 7 figures, Nature Format
摘要：城市树木生物多样性对于城市的气候适应能力、生态稳定性和宜居性至关重要，但大多数城市缺乏对其树冠的详细了解。基于现场的库存可以提供对香农和辛普森多样性的可靠估计，但成本高昂且耗时，而监督式人工智能方法需要标记的数据，而这些数据通常无法跨地区推广。我们引入了一个无监督的聚类框架，该框架将街道级图像的视觉嵌入与空间种植模式相结合，以估计没有标签的生物多样性。应用于八个北美城市，该方法恢复属级的多样性模式与高保真度，实现低Wasserstein距离地面真相香农和辛普森指数，并保持空间自相关。这种可扩展的细粒度方法使缺乏详细清单的城市能够进行生物多样性测绘，并提供了一种持续、低成本监测的途径，以支持公平获得绿色植物和城市生态系统的适应性管理。
摘要：Urban tree biodiversity is critical for climate resilience, ecological stability, and livability in cities, yet most municipalities lack detailed knowledge of their canopies. Field-based inventories provide reliable estimates of Shannon and Simpson diversity but are costly and time-consuming, while supervised AI methods require labeled data that often fail to generalize across regions. We introduce an unsupervised clustering framework that integrates visual embeddings from street-level imagery with spatial planting patterns to estimate biodiversity without labels. Applied to eight North American cities, the method recovers genus-level diversity patterns with high fidelity, achieving low Wasserstein distances to ground truth for Shannon and Simpson indices and preserving spatial autocorrelation. This scalable, fine-grained approach enables biodiversity mapping in cities lacking detailed inventories and offers a pathway for continuous, low-cost monitoring to support equitable access to greenery and adaptive management of urban ecosystems.

【4】CORENet: Cross-Modal 4D Radar Denoising Network with LiDAR Supervision for Autonomous Driving
标题：CORENet：用于自动驾驶的跨模态4D雷达降噪网络与LiDAR监督
链接：https://arxiv.org/abs/2508.13485

作者：u, Jilin Mei, Fangyuan Mao, Chen Min, Yan Xing, Yu Hu
备注：8 pages, 5 figures, Accepted to IROS 2025
摘要：基于4D雷达的物体检测因其在恶劣天气条件下的鲁棒性以及在不同驾驶场景中提供丰富空间信息的能力而受到极大关注。然而，4D雷达点云的稀疏和噪声性质对有效感知提出了重大挑战。为了解决这一限制，我们提出了CORENet，这是一种新型的跨模态去噪框架，它利用LiDAR监督来识别噪声模式并从原始4D雷达数据中提取区分特征。我们的解决方案设计为即插即用架构，无需修改现有管道即可无缝集成到基于体素的检测框架中。值得注意的是，所提出的方法在训练期间仅利用LiDAR数据进行跨模态监督，同时在推理期间保持完全的雷达操作。具有挑战性的双雷达数据集，其特点是提高噪声水平的广泛评估，证明了我们的框架在提高检测鲁棒性的有效性。综合实验验证，CORENet实现了优越的性能相比，现有的主流方法。
摘要：4D radar-based object detection has garnered great attention for its robustness in adverse weather conditions and capacity to deliver rich spatial information across diverse driving scenarios. Nevertheless, the sparse and noisy nature of 4D radar point clouds poses substantial challenges for effective perception. To address the limitation, we present CORENet, a novel cross-modal denoising framework that leverages LiDAR supervision to identify noise patterns and extract discriminative features from raw 4D radar data. Designed as a plug-and-play architecture, our solution enables seamless integration into voxel-based detection frameworks without modifying existing pipelines. Notably, the proposed method only utilizes LiDAR data for cross-modal supervision during training while maintaining full radar-only operation during inference. Extensive evaluation on the challenging Dual-Radar dataset, which is characterized by elevated noise level, demonstrates the effectiveness of our framework in enhancing detection robustness. Comprehensive experiments validate that CORENet achieves superior performance compared to existing mainstream approaches.

【5】RISE: Enhancing VLM Image Annotation with Self-Supervised Reasoning
标题：RISE：通过自我监督推理增强VLM图像注释
链接：https://arxiv.org/abs/2508.13229

作者：, Wei Hu, Yuhang Su, Fan Zhang
摘要：视觉语言模型（VLM）难以处理复杂的图像注释任务，例如情感分类和上下文驱动的对象检测，这些任务需要复杂的推理。标准监督微调（SFT）仅关注注释结果，忽略了潜在的原理，而视觉强化微调（Visual-RFT）由于在预训练期间缺乏高质量的，经过验证的CoT而产生不一致的思想链（CoT）。我们引入了RISE（理性-启发-加强-专业知识），这是一个克服这些限制的两阶段框架。在推理阶段（RISE-CoT），强化学习驱动的“注释-推理-注释”闭环通过验证其重建原始注释而不直接泄漏的能力来生成视觉上接地的、逻辑上一致的CoT。优化和强化阶段（RISE-R1）利用一个高质量的CoT子集，通过RISE-CoT奖励进行过滤，进行监督微调，然后进行强化微调，以产生可解释的推理和准确的注释，从而在复杂的视觉任务中获得专业知识。在复杂和简单的图像注释任务上进行评估，RISE训练的Qwen 2-VL-2B优于SFT和Visual-RFT，实现了强大的性能和增强的可解释性。RISE提供了一个自我监督的解决方案，用于推进VLM推理，而无需手动注释CoT。
摘要：Vision-Language Models (VLMs) struggle with complex image annotation tasks, such as emotion classification and context-driven object detection, which demand sophisticated reasoning. Standard Supervised Fine-Tuning (SFT) focuses solely on annotation outcomes, ignoring underlying rationales, while Visual Reinforcement Fine-Tuning (Visual-RFT) produces inconsistent Chains of Thought (CoTs) due to the absence of high-quality, verified CoTs during pre-training. We introduce RISE (Reason-Inspire-Strengthen-Expertise), a two-stage framework to overcome these limitations. In the Reason stage (RISE-CoT), a reinforcement learning-driven "annotation-reasoning-annotation" closed-loop generates visually grounded, logically consistent CoTs by verifying their ability to reconstruct original annotations without direct leakage. The Inspire and Strengthen stage (RISE-R1) leverages a high-quality CoT subset, filtered by RISE-CoT rewards, for supervised fine-tuning, followed by reinforcement fine-tuning to produce interpretable reasoning and accurate annotations, achieving Expertise in complex visual tasks. Evaluated on complex and simple image annotation tasks, RISE-trained Qwen2-VL-2B outperforms SFT and Visual-RFT, achieving robust performance and enhanced explainability. RISE offers a self-supervised solution for advancing VLM reasoning without requiring manually annotated CoTs.

时序|行为识别|姿态|视频|运动估计(8篇)

【1】LongSplat: Robust Unposed 3D Gaussian Splatting for Casual Long Videos
标题：LongSplat：针对休闲长视频的稳健无摆好3D高斯飞溅
链接：https://arxiv.org/abs/2508.14041

作者： Lin, Cheng Sun, Fu-En Yang, Min-Hung Chen, Yen-Yu Lin, Yu-Lun Liu
备注：ICCV 2025. Project page: this https URL
摘要：LongSplat解决了新视图合成（NVS）中的关键挑战，这些挑战来自随意捕获的长视频，其特征在于不规则的相机运动，未知的相机姿势和广阔的场景。目前的方法往往遭受姿势漂移，不准确的几何初始化，和严重的内存限制。为了解决这些问题，我们引入了LongSplat，一个强大的无姿态3D高斯溅射框架，其特征在于：（1）增量联合优化，同时优化相机姿态和3D高斯，以避免局部最小值并确保全局一致性;（2）鲁棒的姿态估计模块，利用学习的3D先验;以及（3）有效的八叉树锚点形成机制，其基于空间密度将密集点云转换为锚点。在具有挑战性的基准测试上进行的大量实验表明，LongSplat实现了最先进的结果，与现有方法相比，大大提高了渲染质量，姿态精度和计算效率。项目页面：https://linjohnss.github.io/longsplat/
摘要：LongSplat addresses critical challenges in novel view synthesis (NVS) from casually captured long videos characterized by irregular camera motion, unknown camera poses, and expansive scenes. Current methods often suffer from pose drift, inaccurate geometry initialization, and severe memory limitations. To address these issues, we introduce LongSplat, a robust unposed 3D Gaussian Splatting framework featuring: (1) Incremental Joint Optimization that concurrently optimizes camera poses and 3D Gaussians to avoid local minima and ensure global consistency; (2) a robust Pose Estimation Module leveraging learned 3D priors; and (3) an efficient Octree Anchor Formation mechanism that converts dense point clouds into anchors based on spatial density. Extensive experiments on challenging benchmarks demonstrate that LongSplat achieves state-of-the-art results, substantially improving rendering quality, pose accuracy, and computational efficiency compared to prior approaches. Project page: https://linjohnss.github.io/longsplat/

【2】Beyond Simple Edits: Composed Video Retrieval with Dense Modifications
标题：超越简单的编辑：具有密集修改的合成视频检索
链接：https://arxiv.org/abs/2508.14039

作者：wakar, Dmitry Demidov, Ritesh Thawkar, Rao Muhammad Anwer, Mubarak Shah, Fahad Shahbaz Khan, Salman Khan
备注：Accepted to ICCV-2025
摘要：合成视频检索是一项具有挑战性的任务，它致力于基于查询视频和详细描述特定修改的文本描述来检索目标视频。标准检索框架通常难以处理细粒度组合查询的复杂性和时间理解的变化，限制了它们在细粒度设置中的检索能力。为了解决这个问题，我们引入了一个新的数据集，它可以捕获不同视频片段中的细粒度和组合动作，从而在检索到的视频内容中实现更详细的组成变化。该数据集名为Dense-WebVid-CoVR，由160万个样本组成，其中密集的修改文本是现有文本的7倍左右。我们进一步开发了一个新的模型，通过使用接地文本编码器的交叉注意力（CA）融合来集成视觉和文本信息，从而实现密集查询修改和目标视频之间的精确对齐。所提出的模型实现了国家的最先进的结果超越现有的方法在所有指标。值得注意的是，它在视觉+文本设置中实现了71.3\%的Recall@1，比最先进的技术高出3.4\%，突出了其在利用详细视频描述和密集修改文本方面的功效。我们建议的数据集、代码和模型可在以下网址获得：https://github.com/OmkarThawakar/BSE-CoVR
摘要：Composed video retrieval is a challenging task that strives to retrieve a target video based on a query video and a textual description detailing specific modifications. Standard retrieval frameworks typically struggle to handle the complexity of fine-grained compositional queries and variations in temporal understanding limiting their retrieval ability in the fine-grained setting. To address this issue, we introduce a novel dataset that captures both fine-grained and composed actions across diverse video segments, enabling more detailed compositional changes in retrieved video content. The proposed dataset, named Dense-WebVid-CoVR, consists of 1.6 million samples with dense modification text that is around seven times more than its existing counterpart. We further develop a new model that integrates visual and textual information through Cross-Attention (CA) fusion using grounded text encoder, enabling precise alignment between dense query modifications and target videos. The proposed model achieves state-of-the-art results surpassing existing methods on all metrics. Notably, it achieves 71.3\% Recall@1 in visual+text setting and outperforms the state-of-the-art by 3.4\%, highlighting its efficacy in terms of leveraging detailed video descriptions and dense modification texts. Our proposed dataset, code, and model are available at :https://github.com/OmkarThawakar/BSE-CoVR

【3】Forecasting Smog Events Using ConvLSTM: A Spatio-Temporal Approach for Aerosol Index Prediction in South Asia
标题：使用ConvLSTM预测雾霾事件：南亚气溶胶指数预测的时空方法
链接：https://arxiv.org/abs/2508.13891

作者：an
摘要：南亚烟雾是指每年重复发生的空气污染事件，其特点是污染物水平高，能见度降低，并产生重大的社会经济影响，主要影响11月至2月的印度-恒河平原（IGP）。在过去的十年中，增加的空气污染源，如农作物残留物燃烧，机动车辆和不断变化的天气模式加剧了这些烟雾事件。然而，在区域范围内仍然没有建立颗粒物浓度增加的实时预报系统。气溶胶指数与烟雾形成密切相关，是计算空气质量指数（AQI）的关键组成部分，反映了颗粒物浓度。这项研究使用Sentinel-5 P空气成分数据（2019-2023年）和卷积长短期记忆（ConvLSTM）神经网络预测气溶胶事件，该神经网络比以前的模型更有效地捕获空间和时间相关性。以340-380 nm的紫外线气溶胶指数作为预报因子，结果表明，气溶胶指数可以以5天为间隔进行预报，均方误差为~0.0018，损失为~0.3995，结构相似性指数为~0.74。虽然有效，但可以通过集成其他数据和改进其架构来改进模型。
摘要：The South Asian Smog refers to the recurring annual air pollution events marked by high contaminant levels, reduced visibility, and significant socio-economic impacts, primarily affecting the Indo-Gangetic Plains (IGP) from November to February. Over the past decade, increased air pollution sources such as crop residue burning, motor vehicles, and changing weather patterns have intensified these smog events. However, real-time forecasting systems for increased particulate matter concentrations are still not established at regional scale. The Aerosol Index, closely tied to smog formation and a key component in calculating the Air Quality Index (AQI), reflects particulate matter concentrations. This study forecasts aerosol events using Sentinel-5P air constituent data (2019-2023) and a Convolutional Long-Short Term Memory (ConvLSTM) neural network, which captures spatial and temporal correlations more effectively than previous models. Using the Ultraviolet (UV) Aerosol Index at 340-380 nm as the predictor, results show the Aerosol Index can be forecasted at five-day intervals with a Mean Squared Error of ~0.0018, loss of ~0.3995, and Structural Similarity Index of ~0.74. While effective, the model can be improved by integrating additional data and refining its architecture.

【4】Sketch3DVE: Sketch-based 3D-Aware Scene Video Editing
标题：Sketch 3DVE：基于Sketch的3D感知场景视频编辑
链接：https://arxiv.org/abs/2508.13797

作者：Liu, Shi-Yang Li, Yan-Pei Cao, Hongbo Fu, Lin Gao
备注：SIGGRAPH 2025
摘要：最近的视频编辑方法在风格转移或外观修改方面取得了令人瞩目的结果。然而，编辑视频中3D场景的结构内容仍然具有挑战性，特别是在处理显著的视点变化时，例如大的相机旋转或缩放。关键挑战包括生成与原始视频保持一致的新颖视图内容，保留未编辑区域，以及将稀疏的2D输入转换为逼真的3D视频输出。为了解决这些问题，我们提出了Sketch3DVE，一个基于草图的3D感知视频编辑方法，使详细的本地操作的视频与显着的视点变化。为了解决稀疏输入带来的挑战，我们采用图像编辑方法来生成第一帧的编辑结果，然后将其传播到视频的其余帧。我们利用草图作为精确几何控制的交互工具，同时也支持其他基于蒙版的图像编辑方法。为了处理视点变化，我们对视频中的3D信息进行了详细的分析和操作。具体来说，我们利用密集立体方法来估计输入视频的点云和相机参数。然后，我们提出了一种点云编辑方法，该方法使用深度图来表示新编辑组件的3D几何形状，将它们与原始3D场景有效对齐。为了将新编辑的内容与原始视频无缝合并，同时保留未编辑区域的特征，我们引入了3D感知的掩码传播策略，并采用视频扩散模型来生成逼真的编辑视频。大量的实验证明了Sketch3DVE在视频编辑中的优越性。网址及代码：http://geometrylearning.com/Sketch3DVE/
摘要：Recent video editing methods achieve attractive results in style transfer or appearance modification. However, editing the structural content of 3D scenes in videos remains challenging, particularly when dealing with significant viewpoint changes, such as large camera rotations or zooms. Key challenges include generating novel view content that remains consistent with the original video, preserving unedited regions, and translating sparse 2D inputs into realistic 3D video outputs. To address these issues, we propose Sketch3DVE, a sketch-based 3D-aware video editing method to enable detailed local manipulation of videos with significant viewpoint changes. To solve the challenge posed by sparse inputs, we employ image editing methods to generate edited results for the first frame, which are then propagated to the remaining frames of the video. We utilize sketching as an interaction tool for precise geometry control, while other mask-based image editing methods are also supported. To handle viewpoint changes, we perform a detailed analysis and manipulation of the 3D information in the video. Specifically, we utilize a dense stereo method to estimate a point cloud and the camera parameters of the input video. We then propose a point cloud editing approach that uses depth maps to represent the 3D geometry of newly edited components, aligning them effectively with the original 3D scene. To seamlessly merge the newly edited content with the original video while preserving the features of unedited regions, we introduce a 3D-aware mask propagation strategy and employ a video diffusion model to produce realistic edited videos. Extensive experiments demonstrate the superiority of Sketch3DVE in video editing. Homepage and code: http://http://geometrylearning.com/Sketch3DVE/

【5】MR6D: Benchmarking 6D Pose Estimation for Mobile Robots
标题：MR 6D：移动机器人的6D姿势估计基准
链接：https://arxiv.org/abs/2508.13775

作者：a, Shrutarv Awasthi, Christian Blesing, Lokeshwaran Manohar, Frank Hoffmann, Alice Kirchheim
备注：accepted CVPR 2025 Workshop on Recovering 6D Object Pose (R6D)
摘要：现有的6D姿态估计数据集主要集中在通常由机器人手臂操纵器处理的小型家用物体上，限制了它们与移动机器人的相关性。移动平台通常在没有操纵器的情况下操作，与较大的物体交互，并面临诸如远距离感知、严重的自遮挡和不同的相机视角等挑战。虽然最近的模型很好地概括了看不见的物体，但评估仍然局限于忽视这些因素的家庭环境。我们介绍MR 6D，一个专为工业环境中移动机器人的6D姿态估计而设计的数据集。它包括92个真实世界的场景，其中包含16个独特的对象，跨越静态和动态交互。MR 6D捕捉了移动平台特有的挑战，包括远距离视点、不同的对象配置、较大的对象尺寸和复杂的遮挡/自遮挡模式。初步实验表明，目前的6D管道在这些设置中表现不佳，2D分割是另一个障碍。MR 6D为开发和评估适合移动机器人需求的位姿估计方法奠定了基础。该数据集可在https://huggingface.co/datasets/anas-gouda/mr6d上获得。
摘要：Existing 6D pose estimation datasets primarily focus on small household objects typically handled by robot arm manipulators, limiting their relevance to mobile robotics. Mobile platforms often operate without manipulators, interact with larger objects, and face challenges such as long-range perception, heavy self-occlusion, and diverse camera perspectives. While recent models generalize well to unseen objects, evaluations remain confined to household-like settings that overlook these factors. We introduce MR6D, a dataset designed for 6D pose estimation for mobile robots in industrial environments. It includes 92 real-world scenes featuring 16 unique objects across static and dynamic interactions. MR6D captures the challenges specific to mobile platforms, including distant viewpoints, varied object configurations, larger object sizes, and complex occlusion/self-occlusion patterns. Initial experiments reveal that current 6D pipelines underperform in these settings, with 2D segmentation being another hurdle. MR6D establishes a foundation for developing and evaluating pose estimation methods tailored to the demands of mobile robotics. The dataset is available at https://huggingface.co/datasets/anas-gouda/mr6d.

【6】RCGNet: RGB-based Category-Level 6D Object Pose Estimation with Geometric Guidance
标题：RCGNet：具有几何引导的基于Ruby的类别级6D物体姿态估计
链接：https://arxiv.org/abs/2508.13623

作者： Di-Hua Zhai, Yuanqing Xia
备注：Accepted by IROS2025
摘要：虽然目前大多数基于RGB-D的类别级对象姿态估计方法具有很强的性能，但它们在缺乏深度信息的场景中面临着重大挑战。在本文中，我们提出了一种新的类别级对象姿态估计方法，仅依赖于RGB图像。该方法能够在不需要深度数据的情况下在真实世界场景中进行准确的姿态估计。具体而言，我们设计了一个基于变换的神经网络的类别级对象的姿态估计，其中的Transformer是用来预测和融合的目标对象的几何特征。为了确保这些预测的几何特征忠实地捕捉对象的几何形状，我们引入了一个几何特征引导算法，这增强了网络有效地表示对象的几何信息的能力。最后，我们利用RANSAC-Pestrian算法来计算对象的姿态，解决了姿态估计中与可变对象尺度相关的挑战。在基准数据集上的实验结果表明，与以前的基于RGB的方法相比，我们的方法不仅效率高，而且精度高。这些有前途的结果提供了一个新的视角，推进类别级的对象姿态估计使用RGB图像。
摘要：While most current RGB-D-based category-level object pose estimation methods achieve strong performance, they face significant challenges in scenes lacking depth information. In this paper, we propose a novel category-level object pose estimation approach that relies solely on RGB images. This method enables accurate pose estimation in real-world scenarios without the need for depth data. Specifically, we design a transformer-based neural network for category-level object pose estimation, where the transformer is employed to predict and fuse the geometric features of the target object. To ensure that these predicted geometric features faithfully capture the object's geometry, we introduce a geometric feature-guided algorithm, which enhances the network's ability to effectively represent the object's geometric information. Finally, we utilize the RANSAC-PnP algorithm to compute the object's pose, addressing the challenges associated with variable object scales in pose estimation. Experimental results on benchmark datasets demonstrate that our approach is not only highly efficient but also achieves superior accuracy compared to previous RGB-based methods. These promising results offer a new perspective for advancing category-level object pose estimation using RGB images.

【7】MimicFunc: Imitating Tool Manipulation from a Single Human Video via Functional Correspondence
标题：MimicFunc：通过功能对应从单个人类视频中模仿工具操作
链接：https://arxiv.org/abs/2508.13534

作者：, Anxing Xiao, Yuhong Deng, Tianrun Hu, Wenlong Dong, Hanbo Zhang, David Hsu, Hong Zhang
备注：Accepted to CoRL 2025
摘要：从人类视频中模仿工具操作提供了一种直观的方法来教授机器人，同时也为劳动密集型的遥操作数据收集提供了一种有前途的可扩展的替代方案，用于可视化策略学习。虽然人类可以通过观察其他人执行一次任务来模仿工具操作行为，并毫不费力地将技能转移到不同的工具上执行功能等效的任务，但目前的机器人很难实现这种水平的泛化。一个关键的挑战在于建立功能级别的对应关系，考虑到功能相似的工具之间的重大几何变化，称为内部功能变化。为了解决这一挑战，我们提出了MimicFunc，一个框架，建立功能对应的功能框架，一个以功能为中心的局部坐标框架构建与基于关键点的抽象，模仿工具操作技能。实验表明，MimicFunc有效地使机器人能够将技能从单个RGB-D人类视频推广到操纵功能等效任务的新工具。此外，利用MimicFunc的一次性概括能力，生成的展开可用于训练视觉运动策略，而无需为新对象进行劳动密集型远程操作数据收集。我们的代码和视频可以在https://sites.google.com/view/mimicfunc上找到。
摘要：Imitating tool manipulation from human videos offers an intuitive approach to teaching robots, while also providing a promising and scalable alternative to labor-intensive teleoperation data collection for visuomotor policy learning. While humans can mimic tool manipulation behavior by observing others perform a task just once and effortlessly transfer the skill to diverse tools for functionally equivalent tasks, current robots struggle to achieve this level of generalization. A key challenge lies in establishing function-level correspondences, considering the significant geometric variations among functionally similar tools, referred to as intra-function variations. To address this challenge, we propose MimicFunc, a framework that establishes functional correspondences with function frame, a function-centric local coordinate frame constructed with keypoint-based abstraction, for imitating tool manipulation skills. Experiments demonstrate that MimicFunc effectively enables the robot to generalize the skill from a single RGB-D human video to manipulating novel tools for functionally equivalent tasks. Furthermore, leveraging MimicFunc's one-shot generalization capability, the generated rollouts can be used to train visuomotor policies without requiring labor-intensive teleoperation data collection for novel objects. Our code and video are available at https://sites.google.com/view/mimicfunc.

【8】Structured Prompting and Multi-Agent Knowledge Distillation for Traffic Video Interpretation and Risk Inference
标题：用于交通视频解释和风险推理的结构化预算和多智能体知识提炼
链接：https://arxiv.org/abs/2508.13439

作者：Yang, Ningning Xu, Jidong J. Yang
备注：16 pages, 10 figures, 1 table
摘要：全面的公路场景理解和强大的交通风险推理对于推进智能交通系统（ITS）和自动驾驶至关重要。传统的方法往往与可扩展性和泛化，特别是在复杂和动态的条件下，现实世界的环境。为了解决这些挑战，我们引入了一种新的结构化提示和知识蒸馏框架，可以自动生成高质量的交通场景注释和上下文风险评估。我们的框架编排了两个大型的视觉语言模型（VLM）：GPT-4 o和o3-mini，使用结构化的思想链（CoT）策略来产生丰富的多视角输出。这些输出作为知识丰富的伪注释，用于对小得多的学生VLM进行监督微调。由此产生的紧凑的3B规模模型，名为VISTA（智能场景和交通分析的视觉），能够理解低分辨率的交通视频，并生成语义上忠实的，风险意识的字幕。尽管其参数数量显著减少，但VISTA在与教师模型进行基准测试时，在已建立的字幕指标（BLEU-4，METEOR，ROUGE-L和CIDER）上实现了强大的性能。这表明，有效的知识蒸馏和结构化的多智能体监督可以使轻量级VLMs捕获复杂的推理能力。VISTA的紧凑架构有助于在边缘设备上进行高效部署，从而实现实时风险监控，而无需大量的基础设施升级。
摘要：Comprehensive highway scene understanding and robust traffic risk inference are vital for advancing Intelligent Transportation Systems (ITS) and autonomous driving. Traditional approaches often struggle with scalability and generalization, particularly under the complex and dynamic conditions of real-world environments. To address these challenges, we introduce a novel structured prompting and knowledge distillation framework that enables automatic generation of high-quality traffic scene annotations and contextual risk assessments. Our framework orchestrates two large Vision-Language Models (VLMs): GPT-4o and o3-mini, using a structured Chain-of-Thought (CoT) strategy to produce rich, multi-perspective outputs. These outputs serve as knowledge-enriched pseudo-annotations for supervised fine-tuning of a much smaller student VLM. The resulting compact 3B-scale model, named VISTA (Vision for Intelligent Scene and Traffic Analysis), is capable of understanding low-resolution traffic videos and generating semantically faithful, risk-aware captions. Despite its significantly reduced parameter count, VISTA achieves strong performance across established captioning metrics (BLEU-4, METEOR, ROUGE-L, and CIDEr) when benchmarked against its teacher models. This demonstrates that effective knowledge distillation and structured multi-agent supervision can empower lightweight VLMs to capture complex reasoning capabilities. The compact architecture of VISTA facilitates efficient deployment on edge devices, enabling real-time risk monitoring without requiring extensive infrastructure upgrades.

医学相关(6篇)

【1】In-hoc Concept Representations to Regularise Deep Learning in Medical Imaging
标题：通过临时概念表示来规范医学成像中的深度学习
链接：https://arxiv.org/abs/2508.13880

作者： Corbetta, Floris Six Dijkstra, Regina Beets-Tan, Hoel Kervadec, Kristoffer Wickstrøm, Wilson Silva
备注：13 pages, 13 figures, 2 tables, accepted at PHAROS-AFE-AIMI Workshop in conjunction with the International Conference on Computer Vision (ICCV), 2025. This is the submitted manuscript with added link to the github repo, funding acknowledgments and author names and affiliations, and a correction to numbers in Table 1. Final version not published yet
摘要：医学成像中的深度学习模型通常具有很强的分布内性能，但难以在分布变化下进行概括，通常依赖于虚假相关性而不是有临床意义的特征。我们引入LCRReg，这是一种利用潜在概念表示（LCR）的新型正则化方法（例如，概念激活向量（CAV）），以引导模型向语义上接地表示。LCRReg在主训练集中不需要概念标签，而是使用一个小的辅助数据集来合成高质量的、分离的概念示例。我们为预定义的相关特征提取LCR，并引入一个正则化项，引导卷积神经网络（CNN）在与这些概念相关的潜在子空间内激活。我们在合成和现实世界的医疗任务中评估LCRReg。在受控的玩具数据集上，它显着提高了对注入虚假相关的鲁棒性，即使在多概念和多类设置中也仍然有效。在糖尿病视网膜病变二进制分类任务中，LCRReg增强了合成虚假扰动和分布外（OOD）泛化的性能。与基线相比，包括多任务学习，线性探测和事后基于概念的模型，LCRReg提供了一个轻量级的，与架构无关的策略，用于提高模型的鲁棒性，而无需密集的概念监督。代码可在以下链接获得：https：//github.com/Trustworthy-AI-UU-NKI/lcr\_regularization
摘要：Deep learning models in medical imaging often achieve strong in-distribution performance but struggle to generalise under distribution shifts, frequently relying on spurious correlations instead of clinically meaningful features. We introduce LCRReg, a novel regularisation approach that leverages Latent Concept Representations (LCRs) (e.g., Concept Activation Vectors (CAVs)) to guide models toward semantically grounded representations. LCRReg requires no concept labels in the main training set and instead uses a small auxiliary dataset to synthesise high-quality, disentangled concept examples. We extract LCRs for predefined relevant features, and incorporate a regularisation term that guides a Convolutional Neural Network (CNN) to activate within latent subspaces associated with those concepts. We evaluate LCRReg across synthetic and real-world medical tasks. On a controlled toy dataset, it significantly improves robustness to injected spurious correlations and remains effective even in multi-concept and multiclass settings. On the diabetic retinopathy binary classification task, LCRReg enhances performance under both synthetic spurious perturbations and out-of-distribution (OOD) generalisation. Compared to baselines, including multitask learning, linear probing, and post-hoc concept-based models, LCRReg offers a lightweight, architecture-agnostic strategy for improving model robustness without requiring dense concept supervision. Code is available at the following link: https://github.com/Trustworthy-AI-UU-NKI/lcr\_regularization

【2】Automated Assessment of Aesthetic Outcomes in Facial Plastic Surgery
标题：面部整形手术美容结果的自动评估
链接：https://arxiv.org/abs/2508.13363

作者：ghaei, Kiran Abraham-Aggarwal, Manoj T. Abraham, Arun Ross
摘要：我们引入了一个可扩展的，可解释的计算机视觉框架，用于使用正面照片量化面部整形手术的美学效果。我们的管道利用自动地标检测，几何面部对称计算，基于深度学习的年龄估计和鼻形态分析。为了进行这项研究，我们首先收集了迄今为止最大的配对术前和术后面部图像数据集，包括来自1,259名患者的7,160张照片。该数据集包括一个专门的鼻整形术子集，由来自366名患者的732张图像组成，其中96.2%的患者在三个鼻部测量中至少有一个显示出改善，具有统计学显著的组水平变化。在这些患者中，最大的统计学显著性改善（p < 0.001）发生在颧骨宽度与面宽的比率（77.0%）、鼻长与面高的比率（41.5%）和颧骨宽度与眦间的比率（39.3%）。在包括989名严格过滤的受试者的更广泛的正视图队列中，71.3%的受试者在整体面部对称性或感知年龄方面表现出显著增强（p < 0.01）。重要的是，我们的分析表明，患者身份在术后保持一致，鼻成形术特异性和一般患者队列的真实匹配率分别为99.5%和99.6%，假匹配率为0.01%。此外，我们还分析了改善率的从业者间差异。通过提供可重复的定量基准和新的数据集，我们的管道促进了数据驱动的手术计划，患者咨询和跨实践的客观结果评估。
摘要：We introduce a scalable, interpretable computer-vision framework for quantifying aesthetic outcomes of facial plastic surgery using frontal photographs. Our pipeline leverages automated landmark detection, geometric facial symmetry computation, deep-learning-based age estimation, and nasal morphology analysis. To perform this study, we first assemble the largest curated dataset of paired pre- and post-operative facial images to date, encompassing 7,160 photographs from 1,259 patients. This dataset includes a dedicated rhinoplasty-only subset consisting of 732 images from 366 patients, 96.2% of whom showed improvement in at least one of the three nasal measurements with statistically significant group-level change. Among these patients, the greatest statistically significant improvements (p < 0.001) occurred in the alar width to face width ratio (77.0%), nose length to face height ratio (41.5%), and alar width to intercanthal ratio (39.3%). Among the broader frontal-view cohort, comprising 989 rigorously filtered subjects, 71.3% exhibited significant enhancements in global facial symmetry or perceived age (p < 0.01). Importantly, our analysis shows that patient identity remains consistent post-operatively, with True Match Rates of 99.5% and 99.6% at a False Match Rate of 0.01% for the rhinoplasty-specific and general patient cohorts, respectively. Additionally, we analyze inter-practitioner variability in improvement rates. By providing reproducible, quantitative benchmarks and a novel dataset, our pipeline facilitates data-driven surgical planning, patient counseling, and objective outcome evaluation across practices.

【3】UNICON: UNIfied CONtinual Learning for Medical Foundational Models
标题：UNICON：医疗基础模型的统一持续学习
链接：https://arxiv.org/abs/2508.14024

作者：Areeb Qazi, Munachiso S Nwadike, Ibrahim Almakky, Mohammad Yaqub, Numan Saeed
备注：10 pages, 1 figure
摘要：基础模型在广泛的数据集上进行训练，以捕捉某个领域的总体趋势。然而，在医学成像中，数据的稀缺使得每个领域、模态或任务的预训练都具有挑战性。持续学习提供了一种解决方案，通过在不同的领域或任务上依次微调模型，使其能够集成新知识，而无需每个训练阶段都需要大量数据集。在本文中，我们提出了统一的连续学习医学基础模型（UNICON），一个框架，使基础模型无缝适应不同的领域，任务和模式。与孤立地处理这些变化的传统适应方法不同，UNICON提供了一个统一的、永久可扩展的框架。通过仔细的整合，我们表明，基础模型可以在成像模式，解剖区域和临床目标之间动态扩展，而不会发生灾难性的遗忘或任务干扰。从经验上讲，我们通过调整胸部CT基础模型来验证我们的方法，该模型最初是为了分类而训练的，用于预后和分割任务。我们的研究结果表明，在这两个额外的任务性能提高。此外，我们不断纳入PET扫描，与各自的基线相比，Dice评分提高了5%。这些发现表明，基础模型并不局限于其初始训练范围，而是可以发展，为医学成像的通用AI模型铺平道路。
摘要：Foundational models are trained on extensive datasets to capture the general trends of a domain. However, in medical imaging, the scarcity of data makes pre-training for every domain, modality, or task challenging. Continual learning offers a solution by fine-tuning a model sequentially on different domains or tasks, enabling it to integrate new knowledge without requiring large datasets for each training phase. In this paper, we propose UNIfied CONtinual Learning for Medical Foundational Models (UNICON), a framework that enables the seamless adaptation of foundation models to diverse domains, tasks, and modalities. Unlike conventional adaptation methods that treat these changes in isolation, UNICON provides a unified, perpetually expandable framework. Through careful integration, we show that foundation models can dynamically expand across imaging modalities, anatomical regions, and clinical objectives without catastrophic forgetting or task interference. Empirically, we validate our approach by adapting a chest CT foundation model initially trained for classification to a prognosis and segmentation task. Our results show improved performance across both additional tasks. Furthermore, we continually incorporated PET scans and achieved a 5\% improvement in Dice score compared to respective baselines. These findings establish that foundation models are not inherently constrained to their initial training scope but can evolve, paving the way toward generalist AI models for medical imaging.

【4】Comparing Conditional Diffusion Models for Synthesizing Contrast-Enhanced Breast MRI from Pre-Contrast Images
标题：比较条件扩散模型以从预对比图像合成对比增强乳腺MRI
链接：https://arxiv.org/abs/2508.13776

作者： Ibarra, Javier del Riego, Alessandro Catanese, Julian Cuba, Julian Cardona, Nataly Leon, Jonathan Infante, Karim Lekadir, Oliver Diaz, Richard Osuala
备注：13 pages, 5 figures, submitted and accepted to MICCAI Deepbreath workshop 2025
摘要：动态增强MRI对乳腺癌的诊断和治疗至关重要。然而，其对造影剂的依赖引入了安全性问题、禁忌症、增加的成本和工作流程复杂性。为此，我们提出了预对比度条件去噪扩散概率模型来合成DCE-MRI，在单乳房和全乳房设置中引入，评估和比较总共22个生成模型变体。为了提高病变保真度，我们引入了肿瘤感知损失函数和明确的肿瘤分割掩模调节。使用公共多中心数据集并与相应的对比前基线进行比较，我们观察到基于减影图像的模型在五个互补评估指标上始终优于基于对比后的模型。除了评估整个图像外，我们还单独评估感兴趣区域，其中肿瘤感知损失和分割掩模输入都改善了评估指标。后者显著增强了捕获造影剂摄取的定性结果，尽管假设能够获得在筛查环境中不能保证可用的肿瘤定位输入。一项涉及2名放射科医生和4名MRI技术人员的阅片者研究证实了合成图像的高度真实性，表明生成对比度增强的临床潜力正在显现。我们在https://github.com/sebastibar/conditional-diffusion-breast-MRI上共享代码库。
摘要：Dynamic contrast-enhanced (DCE) MRI is essential for breast cancer diagnosis and treatment. However, its reliance on contrast agents introduces safety concerns, contraindications, increased cost, and workflow complexity. To this end, we present pre-contrast conditioned denoising diffusion probabilistic models to synthesize DCE-MRI, introducing, evaluating, and comparing a total of 22 generative model variants in both single-breast and full breast settings. Towards enhancing lesion fidelity, we introduce both tumor-aware loss functions and explicit tumor segmentation mask conditioning. Using a public multicenter dataset and comparing to respective pre-contrast baselines, we observe that subtraction image-based models consistently outperform post-contrast-based models across five complementary evaluation metrics. Apart from assessing the entire image, we also separately evaluate the region of interest, where both tumor-aware losses and segmentation mask inputs improve evaluation metrics. The latter notably enhance qualitative results capturing contrast uptake, albeit assuming access to tumor localization inputs that are not guaranteed to be available in screening settings. A reader study involving 2 radiologists and 4 MRI technologists confirms the high realism of the synthetic images, indicating an emerging clinical potential of generative contrast-enhancement. We share our codebase at https://github.com/sebastibar/conditional-diffusion-breast-MRI.

【5】State of Abdominal CT Datasets: A Critical Review of Bias, Clinical Relevance, and Real-world Applicability
标题：腹部CT数据集的状态：偏差、临床相关性和现实世界适用性的批判性审查
链接：https://arxiv.org/abs/2508.13626

作者：naei, Zahra Dehghanian, Elahe Meftah, Nariman Naderi, Seyed Amir Ahmad Safavi-Naini, Faeze Khorasanizade, Hamid R. Rabiee
备注：Preprint. Submitted to IEEE Journal of Biomedical and Health Informatics (under review). 10 pages, 3 figures, 5 tables
摘要：该系统性综述严格评估了公开可用的腹部CT数据集及其在临床环境中应用人工智能（AI）的适用性。我们检查了46个公开可用的腹部CT数据集（50，256项研究）。在所有46个数据集中，我们发现了大量的冗余（59.1%的案例重用）和西方/地理倾斜（75.3%来自北美和欧洲）。对19个数据集（>=100例）进行了偏倚评估;在这个子集中，最普遍的高风险类别是领域转移（63\%）和选择偏倚（57\%），这两者都可能破坏模型在不同医疗环境中的通用性-特别是在资源有限的环境中。为了应对这些挑战，我们提出了有针对性的数据集改进策略，包括多机构合作，采用标准化协议，以及故意纳入不同的患者人群和成像技术。这些努力对于支持开发更公平和临床上更强大的腹部成像AI模型至关重要。
摘要：This systematic review critically evaluates publicly available abdominal CT datasets and their suitability for artificial intelligence (AI) applications in clinical settings. We examined 46 publicly available abdominal CT datasets (50,256 studies). Across all 46 datasets, we found substantial redundancy (59.1\% case reuse) and a Western/geographic skew (75.3\% from North America and Europe). A bias assessment was performed on the 19 datasets with >=100 cases; within this subset, the most prevalent high-risk categories were domain shift (63\%) and selection bias (57\%), both of which may undermine model generalizability across diverse healthcare environments -- particularly in resource-limited settings. To address these challenges, we propose targeted strategies for dataset improvement, including multi-institutional collaboration, adoption of standardized protocols, and deliberate inclusion of diverse patient populations and imaging technologies. These efforts are crucial in supporting the development of more equitable and clinically robust AI models for abdominal imaging.

【6】Susceptibility Distortion Correction of Diffusion MRI with a single Phase-Encoding Direction
标题：单相编码方向扩散MRI的敏感性失真修正
链接：https://arxiv.org/abs/2508.13340

作者：Dargahi, Sylvain Bouix, Christian Desrosier
摘要：扩散MRI（dMRI）是一种通过分析组织中水分子扩散来绘制脑微结构和连通性的有价值的工具。然而，获取dMRI数据需要在短时间内捕获多个3D脑体积，这通常会导致图像质量的权衡。一个具有挑战性的伪影是可磁化性引起的失真，其引入显著的几何和强度变形。传统的校正方法，例如topup，依赖于访问上闪烁和下闪烁图像对，限制了它们对利用单个相位编码方向采集的回顾性数据的适用性。在这项工作中，我们提出了一种基于深度学习的方法，仅使用单次采集（向上或向下）来校正磁化率失真，从而消除了对成对采集的需求。实验结果表明，我们的方法实现了性能媲美topup，证明了它的潜力，作为一个有效的和实用的替代磁化率失真校正dMRI。
摘要：Diffusion MRI (dMRI) is a valuable tool to map brain microstructure and connectivity by analyzing water molecule diffusion in tissue. However, acquiring dMRI data requires to capture multiple 3D brain volumes in a short time, often leading to trade-offs in image quality. One challenging artifact is susceptibility-induced distortion, which introduces significant geometric and intensity deformations. Traditional correction methods, such as topup, rely on having access to blip-up and blip-down image pairs, limiting their applicability to retrospective data acquired with a single phase encoding direction. In this work, we propose a deep learning-based approach to correct susceptibility distortions using only a single acquisition (either blip-up or blip-down), eliminating the need for paired acquisitions. Experimental results show that our method achieves performance comparable to topup, demonstrating its potential as an efficient and practical alternative for susceptibility distortion correction in dMRI.

自动驾驶|车辆|车道检测等(2篇)

【1】ROVR-Open-Dataset: A Large-Scale Depth Dataset for Autonomous Driving
标题：ROVR开放数据集：用于自动驾驶的大规模深度数据集
链接：https://arxiv.org/abs/2508.13977

作者：o, Ruijun Zhang, Yiqun Duan, Ruilin Wang, Keyuan Zhou, Wenzhao Zheng, Wenke Huang, Gangwei Xu, Mike Horton, Yuan Si, Hao Zhao, Long Chen
摘要：深度估计是自动驾驶、机器人和增强现实中3D场景理解的基本任务。现有的深度数据集，如KITTI，nuScenes和DDAD，已经推进了该领域，但在多样性和可扩展性方面受到限制。随着这些数据集的基准性能接近饱和，越来越需要新一代大规模，多样化和具有成本效益的数据集来支持基础模型和多模式学习的时代。为了解决这些挑战，我们引入了一个大规模，多样化，逐帧连续数据集，用于动态户外驾驶环境中的深度估计，包括20K视频帧，以评估现有方法。我们的轻量级采集管道以低成本确保了广泛的场景覆盖，而稀疏但统计上足够的地面实况可以实现强大的训练。与现有的数据集相比，我们的数据集在驾驶场景中呈现出更大的多样性和更低的深度密度，为泛化带来了新的挑战。标准单目深度估计模型的基准实验验证了数据集的实用性，并突出了在具有挑战性的条件下的巨大性能差距，为推进深度估计研究建立了新的平台。
摘要：Depth estimation is a fundamental task for 3D scene understanding in autonomous driving, robotics, and augmented reality. Existing depth datasets, such as KITTI, nuScenes, and DDAD, have advanced the field but suffer from limitations in diversity and scalability. As benchmark performance on these datasets approaches saturation, there is an increasing need for a new generation of large-scale, diverse, and cost-efficient datasets to support the era of foundation models and multi-modal learning. To address these challenges, we introduce a large-scale, diverse, frame-wise continuous dataset for depth estimation in dynamic outdoor driving environments, comprising 20K video frames to evaluate existing methods. Our lightweight acquisition pipeline ensures broad scene coverage at low cost, while sparse yet statistically sufficient ground truth enables robust training. Compared to existing datasets, ours presents greater diversity in driving scenarios and lower depth density, creating new challenges for generalization. Benchmark experiments with standard monocular depth estimation models validate the dataset's utility and highlight substantial performance gaps in challenging conditions, establishing a new platform for advancing depth estimation research.

【2】Bridging Clear and Adverse Driving Conditions
标题：弥合晴朗和不利的驾驶条件
链接：https://arxiv.org/abs/2508.13592

作者：iro, Yahia Showgan, Koustav Mullick
摘要：自动驾驶（AD）系统在恶劣的环境条件下表现出明显的性能下降，如低光照和降水。AD数据集中不利条件的代表性不足使得解决这一缺陷具有挑战性。为了规避获取和注释不利天气数据的高昂成本，我们提出了一种新的域自适应（DA）管道，将清晰的天气图像转换为雾，雨，雪和夜间图像。在这里，我们系统地开发和评估了几种新的数据生成管道，包括仅模拟，基于GAN和混合扩散GAN方法，以从标记的清晰图像合成逼真的不良图像。我们利用现有的DA GAN，扩展它以支持辅助输入，并开发一种新的训练配方，利用模拟和真实图像。模拟图像通过提供完美匹配的图像对来促进精确的监督，而真实图像有助于弥合模拟到真实（sim2real）的差距。我们还介绍了一种方法，以减轻幻觉和文物稳定扩散图像到图像（img2img）的输出，自适应混合他们与他们的祖图像。我们根据我们的合成数据微调下游模型，并在具有对应关系的不利条件数据集（ACDC）上对其进行评估。我们在语义分割方面实现了1.85%的整体改善，在夜间实现了4.62%的改善，证明了我们的混合方法在具有挑战性的条件下对强大的AD感知的有效性。
摘要：Autonomous Driving (AD) systems exhibit markedly degraded performance under adverse environmental conditions, such as low illumination and precipitation. The underrepresentation of adverse conditions in AD datasets makes it challenging to address this deficiency. To circumvent the prohibitive cost of acquiring and annotating adverse weather data, we propose a novel Domain Adaptation (DA) pipeline that transforms clear-weather images into fog, rain, snow, and nighttime images. Here, we systematically develop and evaluate several novel data-generation pipelines, including simulation-only, GAN-based, and hybrid diffusion-GAN approaches, to synthesize photorealistic adverse images from labelled clear images. We leverage an existing DA GAN, extend it to support auxiliary inputs, and develop a novel training recipe that leverages both simulated and real images. The simulated images facilitate exact supervision by providing perfectly matched image pairs, while the real images help bridge the simulation-to-real (sim2real) gap. We further introduce a method to mitigate hallucinations and artifacts in Stable-Diffusion Image-to-Image (img2img) outputs by blending them adaptively with their progenitor images. We finetune downstream models on our synthetic data and evaluate them on the Adverse Conditions Dataset with Correspondences (ACDC). We achieve 1.85 percent overall improvement in semantic segmentation, and 4.62 percent on nighttime, demonstrating the efficacy of our hybrid method for robust AD perception under challenging conditions.

人脸|人群计数(2篇)

【1】HumanPCR: Probing MLLM Capabilities in Diverse Human-Centric Scenes
标题：HumanPCR：在不同以人为本的场景中探索MLLM能力
链接：https://arxiv.org/abs/2508.13692

作者：i, Hongze Shen, Hao Shi, Ruibing Hou, Hong Chang, Jie Huang, Chenghao Jia, Wen Wang, Yiling Wu, Dongmei Jiang, Shiguang Shan, Xilin Chen
摘要：多模态模型的快速发展推动了对人工通用智能的渴望，要求在不同的环境中具有与人类相当的性能。我们提出了HumanPCR，一个评估套件，用于探测MLLM在三个层次上与人类相关的视觉环境的能力：感知，理解和推理（分别由Human-P，Human-C和Human-R表示）。Human-P和Human-C具有超过6，000个人工验证的多项选择题，评估9个维度的大量任务，包括但不限于现有基准经常忽略的基本技能。Human-R提供了一个具有挑战性的手动策划视频推理测试，需要整合多个视觉证据，主动提取问题线索之外的上下文，并应用类似人类的专业知识。每个问题都包括人类注释的思想链（CoT）原理，以及支持进一步研究的关键视觉证据。对30多个最先进的模型进行了广泛的评估，在以人为中心的视觉理解方面表现出了重大挑战，特别是在涉及详细空间感知，时间理解和思维建模的任务中。此外，对Human-R的分析揭示了模型在从不同的人类场景中提取必要的主动视觉证据方面的困难，以及它们对查询引导检索的错误依赖。即使使用像缩放视觉上下文和测试时思考这样的高级技术，也只能产生有限的好处。我们希望HumanPCR和我们的发现将推动多模态模型的开发，评估和以人为中心的应用。
摘要：The aspiration for artificial general intelligence, fueled by the rapid progress of multimodal models, demands human-comparable performance across diverse environments. We propose HumanPCR, an evaluation suite for probing MLLMs' capacity about human-related visual contexts across three hierarchical levels: Perception, Comprehension, and Reasoning (denoted by Human-P, Human-C, and Human-R, respectively). Human-P and Human-C feature over 6,000 human-verified multiple choice questions, assessing massive tasks of 9 dimensions, including but not limited to essential skills frequently overlooked by existing benchmarks. Human-R offers a challenging manually curated video reasoning test that requires integrating multiple visual evidences, proactively extracting context beyond question cues, and applying human-like expertise. Each question includes human-annotated Chain-of-Thought (CoT) rationales with key visual evidence to support further research. Extensive evaluations on over 30 state-of-the-art models exhibit significant challenges in human-centric visual understanding, particularly in tasks involving detailed space perception, temporal understanding, and mind modeling. Moreover, analysis of Human-R reveals the struggle of models in extracting essential proactive visual evidence from diverse human scenes and their faulty reliance on query-guided retrieval. Even with advanced techniques like scaling visual contexts and test-time thinking yield only limited benefits. We hope HumanPCR and our findings will advance the development, evaluation, and human-centric application of multimodal models.

【2】Learnable SMPLify: A Neural Solution for Optimization-Free Human Pose Inverse Kinematics
标题：可学习SMPLify：无优化人体姿势逆运动学的神经解决方案
链接：https://arxiv.org/abs/2508.13562

作者：ng, Linfeng Dong, Wei Wang, Zhihang Zhong, Xiao Sun
摘要：在3D人体姿势和形状估计中，SMPLify仍然是一个强大的基线，可以通过迭代优化解决逆运动学（IK）。然而，其高计算成本限制了其实用性。跨领域的最新进展表明，用数据驱动的神经网络取代迭代优化可以在不牺牲准确性的情况下实现显着的运行时间改进。受这一趋势的启发，我们提出了可学习的SMPLify，这是一个神经框架，它用单通道回归模型取代了SMPLify中的迭代拟合过程。我们的框架的设计目标是神经IK的两个核心挑战：数据构建和泛化。为了实现有效的训练，我们提出了一种时间采样策略，该策略从连续帧中构建初始化目标对。为了提高不同运动和不可见姿势的泛化能力，我们提出了一种以人为中心的归一化方案和残差学习来缩小解决方案空间。可学习的SMPLify支持顺序推理和插件后处理，以改进现有的基于图像的估计器。大量的实验表明，我们的方法建立了自己作为一个实用和简单的基线：它实现了近200倍的运行速度相比，SMPLify，推广以及看不见的3DPW和丰富的，并在一个模型不可知的方式时，作为一个插件工具LucidAction。该代码可在https://github.com/Charrrrrlie/Learnable-SMPLify上获得。
摘要：In 3D human pose and shape estimation, SMPLify remains a robust baseline that solves inverse kinematics (IK) through iterative optimization. However, its high computational cost limits its practicality. Recent advances across domains have shown that replacing iterative optimization with data-driven neural networks can achieve significant runtime improvements without sacrificing accuracy. Motivated by this trend, we propose Learnable SMPLify, a neural framework that replaces the iterative fitting process in SMPLify with a single-pass regression model. The design of our framework targets two core challenges in neural IK: data construction and generalization. To enable effective training, we propose a temporal sampling strategy that constructs initialization-target pairs from sequential frames. To improve generalization across diverse motions and unseen poses, we propose a human-centric normalization scheme and residual learning to narrow the solution space. Learnable SMPLify supports both sequential inference and plug-in post-processing to refine existing image-based estimators. Extensive experiments demonstrate that our method establishes itself as a practical and simple baseline: it achieves nearly 200x faster runtime compared to SMPLify, generalizes well to unseen 3DPW and RICH, and operates in a model-agnostic manner when used as a plug-in tool on LucidAction. The code is available at https://github.com/Charrrrrlie/Learnable-SMPLify.

图像视频检索|Re-id相关(2篇)

【1】Multimodal Data Storage and Retrieval for Embodied AI: A Survey
标题：基于人工智能的多模式数据存储和检索：调查
链接：https://arxiv.org/abs/2508.13901

作者： Hao Tang
摘要：嵌入式AI（EAI）代理不断与物理世界交互，生成大量的异构多模式数据流，传统管理系统无法处理这些数据流。在本调查中，我们首先系统地评估了五种存储架构（图数据库、多模型数据库、数据湖、向量数据库和时间序列数据库），重点关注它们是否适合满足EAI的核心需求，包括物理基础、低延迟访问和动态可扩展性。然后，我们分析了五种检索范式（基于融合策略的检索，基于表示对齐的检索，基于图结构的检索，基于生成模型的检索和基于高效检索的优化），揭示了实现长期语义一致性和保持实时响应之间的根本紧张关系。基于这种全面的分析，我们确定了关键的瓶颈，从基本的物理基础差距到跨模态集成，动态适应和开放世界泛化的系统性挑战。最后，我们概述了一个前瞻性的研究议程，包括物理感知的数据模型，自适应存储检索协同优化，和标准化的基准，以指导未来的研究对原则性的数据管理解决方案的EAI。我们的调查基于对180多项相关研究的全面审查，为设计下一代自主嵌入式系统所必需的强大，高性能的数据管理框架提供了严格的路线图。
摘要：Embodied AI (EAI) agents continuously interact with the physical world, generating vast, heterogeneous multimodal data streams that traditional management systems are ill-equipped to handle. In this survey, we first systematically evaluate five storage architectures (Graph Databases, Multi-Model Databases, Data Lakes, Vector Databases, and Time-Series Databases), focusing on their suitability for addressing EAI's core requirements, including physical grounding, low-latency access, and dynamic scalability. We then analyze five retrieval paradigms (Fusion Strategy-Based Retrieval, Representation Alignment-Based Retrieval, Graph-Structure-Based Retrieval, Generation Model-Based Retrieval, and Efficient Retrieval-Based Optimization), revealing a fundamental tension between achieving long-term semantic coherence and maintaining real-time responsiveness. Based on this comprehensive analysis, we identify key bottlenecks, spanning from the foundational Physical Grounding Gap to systemic challenges in cross-modal integration, dynamic adaptation, and open-world generalization. Finally, we outline a forward-looking research agenda encompassing physics-aware data models, adaptive storage-retrieval co-optimization, and standardized benchmarking, to guide future research toward principled data management solutions for EAI. Our survey is based on a comprehensive review of more than 180 related studies, providing a rigorous roadmap for designing the robust, high-performance data management frameworks essential for the next generation of autonomous embodied systems.

【2】Hierarchical Vision-Language Retrieval of Educational Metaverse Content in Agriculture
标题：农业教育元宇宙内容的分层视觉语言检索
链接：https://arxiv.org/abs/2508.13713

作者：i, Alex Falcon, Giuseppe Serra
备注：Accepted for publication at the 23rd International Conference on Image Analysis and Processing (ICIAP 2025)
摘要：每天都有大量的教育内容上传到不同领域的网上，包括农业和园艺。当这些视频或材料被有意义地分组时，它们可以使学习更容易，更有效。组织和丰富这些内容的一个有前途的方法是通过Metaverse，它允许用户在交互式和沉浸式环境中探索教育体验。然而，搜索相关的Metaverse场景并找到那些匹配用户兴趣的场景仍然是一项具有挑战性的任务。最近已经朝着这个方向迈出了第一步，但现有的数据集很小，不足以训练高级模型。在这项工作中，我们做了两个主要贡献：首先，我们引入了一个新的数据集，包含457个农业主题的虚拟博物馆（农业博物馆），每个丰富的文本描述;第二，我们提出了一个分层的视觉语言模型来表示和检索相关的农业博物馆使用自然语言查询。在实验环境中，该方法的R@1和MRR分别达到62%和78%，证明了该方法的有效性，并且比已有的基准测试方法的R@1和MRR分别提高了6%和11%.此外，广泛的评估验证了我们的设计选择。代码和数据集可在https://github.com/aliabdari/Agricultural_Metaverse_Retrieval上获得。
摘要：Every day, a large amount of educational content is uploaded online across different areas, including agriculture and gardening. When these videos or materials are grouped meaningfully, they can make learning easier and more effective. One promising way to organize and enrich such content is through the Metaverse, which allows users to explore educational experiences in an interactive and immersive environment. However, searching for relevant Metaverse scenarios and finding those matching users' interests remains a challenging task. A first step in this direction has been done recently, but existing datasets are small and not sufficient for training advanced models. In this work, we make two main contributions: first, we introduce a new dataset containing 457 agricultural-themed virtual museums (AgriMuseums), each enriched with textual descriptions; and second, we propose a hierarchical vision-language model to represent and retrieve relevant AgriMuseums using natural language queries. In our experimental setting, the proposed method achieves up to about 62\% R@1 and 78\% MRR, confirming its effectiveness, and it also leads to improvements on existing benchmarks by up to 6\% R@1 and 11\% MRR. Moreover, an extensive evaluation validates our design choices. Code and dataset are available at https://github.com/aliabdari/Agricultural_Metaverse_Retrieval .

裁剪|量化|加速|压缩相关(1篇)

【1】Distribution-Aware Hadamard Quantization for Hardware-Efficient Implicit Neural Representations
标题：硬件高效隐式神经表示的分布感知Hadamard量化
链接：https://arxiv.org/abs/2508.13478

作者：hou, Jiachen Ren, Taiqiang Wu, Yuxin Cheng, Zhengwu Liu, Ngai Wong
备注：6 pages, 7 figures
摘要：隐式神经表示（INR）使用具有复杂激活函数的多层感知器（MLP）对离散信号进行编码。虽然INR实现了卓越的性能，但它们依赖于全精度数字表示来进行精确计算，从而导致显著的硬件开销。先前的INR量化方法主要集中于权重量化，由于缺乏激活量化而仅提供有限的硬件节省。为了充分利用量化的硬件优势，我们提出了DHQ，一种新的分布感知Hadamard量化方案，其目标是INR中的权重和激活。我们的分析表明，第一层和最后一层的权重分布与中间层的权重分布不同，而最后一层的激活与前面几层的激活有很大不同。而不是单独定制量化器，我们利用阿达玛变换标准化这些不同的分布到一个统一的钟形形式，支持的经验证据和理论分析，应用标准量化器之前。为了证明我们的方法的实际优势，我们提出了一个FPGA实现DHQ，突出其硬件效率。在不同图像重建任务上的实验表明，DHQ的性能优于以前的量化方法，与全精度方法相比，延迟减少了32.7%，能耗减少了40.1%，资源利用率高达98.3%.
摘要：Implicit Neural Representations (INRs) encode discrete signals using Multi-Layer Perceptrons (MLPs) with complex activation functions. While INRs achieve superior performance, they depend on full-precision number representation for accurate computation, resulting in significant hardware overhead. Previous INR quantization approaches have primarily focused on weight quantization, offering only limited hardware savings due to the lack of activation quantization. To fully exploit the hardware benefits of quantization, we propose DHQ, a novel distribution-aware Hadamard quantization scheme that targets both weights and activations in INRs. Our analysis shows that the weights in the first and last layers have distributions distinct from those in the intermediate layers, while the activations in the last layer differ significantly from those in the preceding layers. Instead of customizing quantizers individually, we utilize the Hadamard transformation to standardize these diverse distributions into a unified bell-shaped form, supported by both empirical evidence and theoretical analysis, before applying a standard quantizer. To demonstrate the practical advantages of our approach, we present an FPGA implementation of DHQ that highlights its hardware efficiency. Experiments on diverse image reconstruction tasks show that DHQ outperforms previous quantization methods, reducing latency by 32.7\%, energy consumption by 40.1\%, and resource utilization by up to 98.3\% compared to full-precision counterparts.

蒸馏|知识提取(1篇)

【1】DeH4R: A Decoupled and Hybrid Method for Road Network Graph Extraction
标题：DeH 4 R：一种用于道路网络图提取的脱钩混合方法
链接：https://arxiv.org/abs/2508.13669

作者：Gong, Shunping Ji
备注：Under review
摘要：从遥感图像中自动提取完整、精确的道路网络图仍然是地理空间计算机视觉的一个关键挑战。基于分割的方法虽然在像素级识别中有效，但在矢量化后处理后难以保持拓扑保真度。图生长方法构建拓扑上更忠实的图，但遭受计算上禁止的迭代ROI裁剪。图生成方法首先预测全局静态候选道路网络顶点，然后推断顶点之间可能的边。它们实现了快速的拓扑感知推理，但限制了顶点的动态插入。为了解决这些挑战，我们提出了DeH4R，一种新的混合模型，结合了图形生成效率和图形增长动态。这是通过将任务解耦为候选顶点检测、相邻顶点预测、初始图控制和图扩展来实现的。这种架构创新支持动态顶点（边）插入，同时保持快速推理速度并增强拓扑保真度和空间一致性。对CityScale和SpaceNet基准的综合评价显示了最先进的（SOTA）性能。DeH4R在CityScale上的性能比之前的SOTA图生长方法RNGDet++高出4.62 APLS和10.18 IoU，同时快了大约10倍。该代码将在https://github.com/7777777FAN/DeH4R上公开提供。
摘要：The automated extraction of complete and precise road network graphs from remote sensing imagery remains a critical challenge in geospatial computer vision. Segmentation-based approaches, while effective in pixel-level recognition, struggle to maintain topology fidelity after vectorization postprocessing. Graph-growing methods build more topologically faithful graphs but suffer from computationally prohibitive iterative ROI cropping. Graph-generating methods first predict global static candidate road network vertices, and then infer possible edges between vertices. They achieve fast topology-aware inference, but limits the dynamic insertion of vertices. To address these challenges, we propose DeH4R, a novel hybrid model that combines graph-generating efficiency and graph-growing dynamics. This is achieved by decoupling the task into candidate vertex detection, adjacent vertex prediction, initial graph contruction, and graph expansion. This architectural innovation enables dynamic vertex (edge) insertions while retaining fast inference speed and enhancing both topology fidelity and spatial consistency. Comprehensive evaluations on CityScale and SpaceNet benchmarks demonstrate state-of-the-art (SOTA) performance. DeH4R outperforms the prior SOTA graph-growing method RNGDet++ by 4.62 APLS and 10.18 IoU on CityScale, while being approximately 10 $\times$ faster. The code will be made publicly available at https://github.com/7777777FAN/DeH4R.

视觉解释|视频理解VQA|caption等(1篇)

【1】BERT-VQA: Visual Question Answering on Plots
标题：BERT-VQA：情节上的视觉问题解答
链接：https://arxiv.org/abs/2508.13184

作者：obert Yang
摘要：视觉问答一直是自然语言理解领域的一个令人兴奋的挑战，因为它需要深度学习模型来交换来自视觉和语言领域的信息。在这个项目中，我们的目标是解决这个问题的一个子任务，即视觉问答的情节。为了实现这一目标，我们开发了BERT-VQA，这是一种基于VisualBERT的模型架构，具有预训练的ResNet 101图像编码器，以及潜在的联合融合。我们针对由LSTM、CNN和浅层分类器组成的基线训练和评估了这个模型。最终结果推翻了我们的核心假设，即VisualBERT中的跨模态模块在将情节组件与问题短语对齐方面至关重要。因此，我们的工作提供了有价值的见解情节问答挑战的难度，以及不同的模型架构在解决这个问题的适当性。
摘要：Visual question answering has been an exciting challenge in the field of natural language understanding, as it requires deep learning models to exchange information from both vision and language domains. In this project, we aim to tackle a subtask of this problem, namely visual question answering on plots. To achieve this, we developed BERT-VQA, a VisualBERT-based model architecture with a pretrained ResNet 101 image encoder, along with a potential addition of joint fusion. We trained and evaluated this model against a baseline that consisted of a LSTM, a CNN, and a shallow classifier. The final outcome disproved our core hypothesis that the cross-modality module in VisualBERT is essential in aligning plot components with question phrases. Therefore, our work provided valuable insights into the difficulty of the plot question answering challenge as well as the appropriateness of different model architectures in solving this problem.

多模态(1篇)

【1】MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents
标题：MM-BrowseComp：多模式浏览代理的综合基准
链接：https://arxiv.org/abs/2508.13186

作者：i, Xingyuan Bu, Wenjie Wang, Jiaheng Liu, Jun Dong, Haoyang He, Hao Lu, Haozhe Zhang, Chenchen Jing, Zhen Li, Chuanhao Li, Jiayi Tian, Chenchen Zhang, Tianhao Peng, Yancheng He, Jihao Gu, Yuanxing Zhang, Jian Yang, Ge Zhang, Wenhao Huang, Wangchunshu Zhou, Zhaoxiang Zhang, Ruizhe Ding, Shilei Wen
备注：The first two authors contribute equally, 26 pages, repo at this https URL
摘要：具有高级推理和工具使用能力的AI代理在深度搜索的Web浏览中表现出令人印象深刻的性能。虽然现有的基准，如BrowseComp评估这些浏览能力，他们主要集中在文本信息，忽视了多模态内容的流行。为了弥合这一差距，我们引入了MM-BrowseComp，这是一种新的基准测试，包括224个具有挑战性的手工制作的问题，专门用于评估代理的多模态检索和推理能力。这些问题通常在提示中包含图像，在搜索和推理过程中遇到的关键信息也可能嵌入网页上的图像或视频中。因此，仅仅依靠文本的方法证明不足以满足我们的基准。此外，我们为每个问题提供了一个经过验证的检查表，从而可以对多模态依赖关系和推理路径进行细粒度分析。我们在MM-BrowseComp上对最先进的模型进行了全面评估，结果显示，即使是像OpenAI o3这样的顶级模型，使用工具也只能达到29.02%的准确率，这突出了当前模型中的次优多模态功能和缺乏本地多模态推理。
摘要：AI agents with advanced reasoning and tool use capabilities have demonstrated impressive performance in web browsing for deep search. While existing benchmarks such as BrowseComp evaluate these browsing abilities, they primarily focus on textual information, overlooking the prevalence of multimodal content. To bridge this gap, we introduce MM-BrowseComp, a novel benchmark comprising 224 challenging, hand-crafted questions specifically designed to assess agents' multimodal retrieval and reasoning capabilities. These questions often incorporate images in prompts, and crucial information encountered during the search and reasoning process may also be embedded within images or videos on webpages. Consequently, methods relying solely on text prove insufficient for our benchmark. Additionally, we provide a verified checklist for each question, enabling fine-grained analysis of multimodal dependencies and reasoning paths. Our comprehensive evaluation of state-of-the-art models on MM-BrowseComp reveals that even top models like OpenAI o3 with tools achieve only 29.02\% accuracy, highlighting the suboptimal multimodal capabilities and lack of native multimodal reasoning in current models.

3D|3D重建等相关(4篇)

【1】Distilled-3DGS:Distilled 3D Gaussian Splatting
标题：蒸馏-3DGS：蒸馏3D高斯飞溅
链接：https://arxiv.org/abs/2508.14037

作者：ang, Xinkai Chen, Jianhuang Lai, Guangcong Wang
备注：Project page: this https URL Code: this https URL
摘要：三维高斯溅射（3DGS）在新视点合成（NVS）中表现出了显著的效果。然而，它有一个显著的缺点：实现高保真渲染通常需要大量的3D高斯，导致大量的内存消耗和存储要求。为了应对这一挑战，我们提出了第一个3DGS知识蒸馏框架，具有各种教师模型，包括香草3DGS，噪声增强的变体和辍学正则化版本。这些教师的输出被聚合以指导轻量级学生模型的优化。为了提取隐藏的几何结构，我们提出了一个结构相似性损失，以提高学生和教师模型之间的空间几何分布的一致性。通过对不同数据集进行全面的定量和定性评估，提出的Distilled-3DGS是一个简单而有效的框架，与最先进的方法相比，它在渲染质量和存储效率方面都取得了令人满意的渲染结果。项目页面：https://distilled3dgs.github.io。代码：https://github.com/lt-xiang/Distilled-3DGS。
摘要：3D Gaussian Splatting (3DGS) has exhibited remarkable efficacy in novel view synthesis (NVS). However, it suffers from a significant drawback: achieving high-fidelity rendering typically necessitates a large number of 3D Gaussians, resulting in substantial memory consumption and storage requirements. To address this challenge, we propose the first knowledge distillation framework for 3DGS, featuring various teacher models, including vanilla 3DGS, noise-augmented variants, and dropout-regularized versions. The outputs of these teachers are aggregated to guide the optimization of a lightweight student model. To distill the hidden geometric structure, we propose a structural similarity loss to boost the consistency of spatial geometric distributions between the student and teacher model. Through comprehensive quantitative and qualitative evaluations across diverse datasets, the proposed Distilled-3DGS, a simple yet effective framework without bells and whistles, achieves promising rendering results in both rendering quality and storage efficiency compared to state-of-the-art methods. Project page: https://distilled3dgs.github.io . Code: https://github.com/lt-xiang/Distilled-3DGS .

【2】Online 3D Gaussian Splatting Modeling with Novel View Selection
标题：具有新颖视图选择的在线3D高斯飞溅建模
链接：https://arxiv.org/abs/2508.14014

作者：n Lee, Junkyu Park, Khang Truong Giang, Soohwan Song
摘要：本研究解决了从仅RGB帧生成在线3D高斯溅射（3DGS）模型的挑战。先前的研究已经采用密集SLAM技术从3DGS模型构建的关键帧估计3D场景。然而，这些方法仅依赖于关键帧，这不足以捕获整个场景，从而导致不完整的重建。此外，建立一个可推广的模型，需要将来自不同的观点，以实现更广泛的场景覆盖框架。然而，在线处理限制了许多帧或大量训练迭代的使用。因此，我们提出了一种新的方法，高品质的3DGS建模，提高模型的完整性，通过自适应视图选择。通过在线分析重建质量，我们的方法选择最佳的非关键帧进行额外的训练。通过整合关键帧和选定的非关键帧，该方法可以从不同的角度细化不完整的区域，从而显著提高完整性。我们还提出了一个框架，采用了在线多视图立体方法，确保在整个3DGS建模过程中的3D信息的一致性。实验结果表明，我们的方法优于最先进的方法，在复杂的户外场景中提供卓越的性能。
摘要：This study addresses the challenge of generating online 3D Gaussian Splatting (3DGS) models from RGB-only frames. Previous studies have employed dense SLAM techniques to estimate 3D scenes from keyframes for 3DGS model construction. However, these methods are limited by their reliance solely on keyframes, which are insufficient to capture an entire scene, resulting in incomplete reconstructions. Moreover, building a generalizable model requires incorporating frames from diverse viewpoints to achieve broader scene coverage. However, online processing restricts the use of many frames or extensive training iterations. Therefore, we propose a novel method for high-quality 3DGS modeling that improves model completeness through adaptive view selection. By analyzing reconstruction quality online, our approach selects optimal non-keyframes for additional training. By integrating both keyframes and selected non-keyframes, the method refines incomplete regions from diverse viewpoints, significantly enhancing completeness. We also present a framework that incorporates an online multi-view stereo approach, ensuring consistency in 3D information throughout the 3DGS modeling process. Experimental results demonstrate that our method outperforms state-of-the-art methods, delivering exceptional performance in complex outdoor scenes.

【3】Real-Time, Population-Based Reconstruction of 3D Bone Models via Very-Low-Dose Protocols
标题：通过极低剂量协议实时、基于人群的3D骨模型重建
链接：https://arxiv.org/abs/2508.13947

作者：, Haoran Sun, Yongqing Li, Rabia Aslam, Lung Fung Tse, Tiange Cheng, Chun Sing Chui, Wing Fung Yau, Victorine R. Le Meur, Meruyert Amangeldy, Kiho Cho, Yinyu Ye, James Zou, Wei Zhao, Xiaomeng Li
摘要：患者特定的骨模型对于设计手术导板和术前计划至关重要，因为它们能够可视化复杂的解剖结构。然而，由于CT的低灵活性和高辐射暴露以及耗时的手动描绘，用于创建骨模型的传统的基于CT的方法仅限于术前使用。在这里，我们介绍了知识蒸馏半监督重建（SSR-KD），这是一个快速准确的AI框架，可以在30秒内从双平面X射线重建高质量的骨骼模型，平均误差低于1.0 mm，消除了对CT和人工工作的依赖。此外，专家对重建骨模型进行了胫骨高位截骨模拟，证明双平面X线重建骨模型与CT注释的骨模型具有相当的临床适用性。总的来说，我们的方法加速了这一过程，减少了辐射暴露，实现了术中指导，并显着提高了骨模型的实用性，在骨科中提供了变革性的应用。
摘要：Patient-specific bone models are essential for designing surgical guides and preoperative planning, as they enable the visualization of intricate anatomical structures. However, traditional CT-based approaches for creating bone models are limited to preoperative use due to the low flexibility and high radiation exposure of CT and time-consuming manual delineation. Here, we introduce Semi-Supervised Reconstruction with Knowledge Distillation (SSR-KD), a fast and accurate AI framework to reconstruct high-quality bone models from biplanar X-rays in 30 seconds, with an average error under 1.0 mm, eliminating the dependence on CT and manual work. Additionally, high tibial osteotomy simulation was performed by experts on reconstructed bone models, demonstrating that bone models reconstructed from biplanar X-rays have comparable clinical applicability to those annotated from CT. Overall, our approach accelerates the process, reduces radiation exposure, enables intraoperative guidance, and significantly improves the practicality of bone models, offering transformative applications in orthopedics.

【4】InnerGS: Internal Scenes Rendering via Factorized 3D Gaussian Splatting
标题：InnerGS：通过分解3D高斯Splating渲染内部场景
链接：https://arxiv.org/abs/2508.13287

作者：ang, Yihan Xiao, Wenlu Tang
摘要：3D高斯溅射（3DGS）通过将场景表示为各向异性的3D高斯的显式集合而在高效的场景渲染中获得了普及。然而，大多数现有的工作主要集中在外部曲面建模。在这项工作中，我们的目标是重建内部场景，这对于需要深入了解物体内部的应用程序至关重要。通过直接建模一个连续的体积密度通过内部的三维高斯分布，我们的模型有效地重建光滑和详细的内部结构从稀疏切片数据。我们的方法消除了对相机姿势的需要，是即插即用的，并且与任何数据模式本质上兼容。我们在https://github.com/Shuxin-Liang/InnerGS上提供cuda实现。
摘要：3D Gaussian Splatting (3DGS) has recently gained popularity for efficient scene rendering by representing scenes as explicit sets of anisotropic 3D Gaussians. However, most existing work focuses primarily on modeling external surfaces. In this work, we target the reconstruction of internal scenes, which is crucial for applications that require a deep understanding of an object's interior. By directly modeling a continuous volumetric density through the inner 3D Gaussian distribution, our model effectively reconstructs smooth and detailed internal structures from sparse sliced data. Our approach eliminates the need for camera poses, is plug-and-play, and is inherently compatible with any data modalities. We provide cuda implementation at: https://github.com/Shuxin-Liang/InnerGS.

其他神经网络|深度学习|模型|建模(8篇)

【1】PhysGM: Large Physical Gaussian Model for Feed-Forward 4D Synthesis
标题：PhysGM：用于前向4D合成的大型物理高斯模型
链接：https://arxiv.org/abs/2508.13911

作者：, Zequn Chen, Donglin Di, Weinan Zhang, Hao Li, Wei Chen, Changsheng Li
摘要：虽然基于物理的3D运动合成已经取得了重大进展，但目前的方法面临着严重的局限性。它们通常依赖于预重建的3D高斯溅射（3DGS）表示，而物理集成取决于不灵活的手动定义的物理属性或来自视频模型的不稳定的优化指导。为了克服这些挑战，我们引入了PhysGM，这是一个前馈框架，可以从单个图像中联合预测3D高斯表示及其物理属性，从而实现即时的物理模拟和高保真的4D渲染。我们首先建立一个基础模型，通过联合优化高斯重建和概率物理预测。然后使用物理上合理的参考视频对模型进行细化，以提高渲染保真度和物理预测准确性。我们采用直接偏好优化（DPO）将其模拟与参考视频对齐，绕过分数蒸馏采样（SDS）优化，后者需要通过复杂的可微模拟和光栅化反向传播梯度。为了便于训练，我们引入了一个新的数据集PhysAssets，其中包含超过24，000个3D资产，并标注了物理属性和相应的指导视频。实验结果表明，我们的方法有效地生成高保真的4D模拟从一个单一的图像在一分钟内。这代表了一个显着的加速比以前的作品，同时提供逼真的渲染结果。我们的项目页面位于：https://hihixiaolv.github.io/PhysGM.github.io/
摘要：While physics-grounded 3D motion synthesis has seen significant progress, current methods face critical limitations. They typically rely on pre-reconstructed 3D Gaussian Splatting (3DGS) representations, while physics integration depends on either inflexible, manually defined physical attributes or unstable, optimization-heavy guidance from video models. To overcome these challenges, we introduce PhysGM, a feed-forward framework that jointly predicts a 3D Gaussian representation and its physical properties from a single image, enabling immediate, physical simulation and high-fidelity 4D rendering. We first establish a base model by jointly optimizing for Gaussian reconstruction and probabilistic physics prediction. The model is then refined with physically plausible reference videos to enhance both rendering fidelity and physics prediction accuracy. We adopt the Direct Preference Optimization (DPO) to align its simulations with reference videos, circumventing Score Distillation Sampling (SDS) optimization which needs back-propagating gradients through the complex differentiable simulation and rasterization. To facilitate the training, we introduce a new dataset PhysAssets of over 24,000 3D assets, annotated with physical properties and corresponding guiding videos. Experimental results demonstrate that our method effectively generates high-fidelity 4D simulations from a single image in one minute. This represents a significant speedup over prior works while delivering realistic rendering results. Our project page is at:https://hihixiaolv.github.io/PhysGM.github.io/

【2】DiffIER: Optimizing Diffusion Models with Iterative Error Reduction
标题：DiffIER：通过迭代误差减少优化扩散模型
链接：https://arxiv.org/abs/2508.13628

作者：Lihe Ding, Tianfan Xue
摘要：扩散模型在生成高质量样本和通过无分类器指导（CFG）提高不同领域的性能方面表现出了卓越的能力。然而，生成的样本的质量是高度敏感的指导权重的选择。在这项工作中，我们确定了一个关键的“训练推理差距”，我们认为，这是存在的差距，破坏了性能的条件生成，使输出高度敏感的指导权重。我们通过测量推理阶段的累积误差来量化这一差距，并建立指导权重的选择和最小化这一差距之间的相关性。此外，为了缩小这一差距，我们提出了DiffIER，一种基于优化的高质量生成方法。我们证明了在推理过程中的每一步迭代误差最小化可以有效地减少累积误差。通过引入这种新的即插即用的优化框架，我们可以在每个推理步骤中优化错误，提高生成质量。实验结果表明，我们提出的方法优于基线方法在条件生成任务。此外，该方法在文本到图像生成，图像超分辨率和文本到语音生成方面取得了一致的成功，强调了其多功能性和在未来研究中广泛应用的潜力。
摘要：Diffusion models have demonstrated remarkable capabilities in generating high-quality samples and enhancing performance across diverse domains through Classifier-Free Guidance (CFG). However, the quality of generated samples is highly sensitive to the selection of the guidance weight. In this work, we identify a critical ``training-inference gap'' and we argue that it is the presence of this gap that undermines the performance of conditional generation and renders outputs highly sensitive to the guidance weight. We quantify this gap by measuring the accumulated error during the inference stage and establish a correlation between the selection of guidance weight and minimizing this gap. Furthermore, to mitigate this gap, we propose DiffIER, an optimization-based method for high-quality generation. We demonstrate that the accumulated error can be effectively reduced by an iterative error minimization at each step during inference. By introducing this novel plug-and-play optimization framework, we enable the optimization of errors at every single inference step and enhance generation quality. Empirical results demonstrate that our proposed method outperforms baseline approaches in conditional generation tasks. Furthermore, the method achieves consistent success in text-to-image generation, image super-resolution, and text-to-speech generation, underscoring its versatility and potential for broad applications in future research.

【3】Towards Efficient Vision State Space Models via Token Merging
标题：通过代币合并建立高效的愿景状态空间模型
链接：https://arxiv.org/abs/2508.13599

作者：Park, Minseok Son, Changick Kim
备注：under review
摘要：状态空间模型（SSM）已成为计算机视觉中强大的体系结构，但提高其计算效率对于实用和可扩展的部署仍然至关重要。虽然令牌减少是提高模型效率的有效方法，但将其应用于SSM需要仔细考虑其独特的顺序建模能力。在这项工作中，我们提出了MaMe，MaMe解决了两个关键挑战：量化标记重要性和保留序列属性。我们的方法利用状态转移参数$\mathbf{\Delta}$作为信息量的措施，并引入战略令牌安排，以保持连续的信息flow. Wide实验表明，MaMe实现了卓越的效率性能权衡微调和现成的模型。特别是，我们的方法保持鲁棒性，即使在积极的令牌减少现有的方法进行显着的性能degradation.Beyond图像分类，MaMe显示强大的泛化能力，跨视频和音频域，建立一个有效的方法，提高效率，在不同的SSM应用。
摘要：State Space Models (SSMs) have emerged as powerful architectures in computer vision, yet improving their computational efficiency remains crucial for practical and scalable deployment.While token reduction serves as an effective approach for model efficiency, applying it to SSMs requires careful consideration of their unique sequential modeling capabilities.In this work, we propose MaMe, a token-merging strategy tailored for SSM-based vision models.MaMe addresses two key challenges: quantifying token importance and preserving sequential properties. Our approach leverages the state transition parameter $\mathbf{\Delta}$ as an informativeness measure and introduces strategic token arrangements to preserve sequential information flow.Extensive experiments demonstrate that MaMe achieves superior efficiency-performance trade-offs for both fine-tuned and off-the-shelf models. Particularly, our approach maintains robustness even under aggressive token reduction where existing methods undergo significant performance degradation.Beyond image classification, MaMe shows strong generalization capabilities across video and audio domains, establishing an effective approach for enhancing efficiency in diverse SSM applications.

【4】Bridging the Gap: Doubles Badminton Analysis with Singles-Trained Models
标题：缩小差距：用单打训练的模型进行双打羽毛球分析
链接：https://arxiv.org/abs/2508.13507

作者： Baek, Jinhyuk Yun
备注：14 pages, 7 figures
摘要：羽毛球被认为是世界上最快的球拍运动之一。尽管双打比赛在国际锦标赛中比单打比赛更普遍，但由于数据可用性和多人跟踪方面的挑战，以前的研究主要集中在单打比赛上。为了解决这个问题，我们设计了一种方法，将单一训练的模型转换为双重分析。我们使用ViT-Pose从ShuttleSet单场比赛数据集中提取关键点，并通过基于ST-GCN的对比学习框架嵌入它们。为了提高跟踪稳定性，我们采用了自定义的多对象跟踪算法，解决了快速和重叠玩家移动的ID切换问题。然后，基于变换器的分类器基于学习的嵌入来确定镜头出现。我们的研究结果证明了将基于姿势的投篮识别扩展到双打羽毛球的可行性，扩大了分析能力。这项工作建立了一个基础，doubled-specific数据集，以加强对这种占主导地位但尚未充分研究的格式的快速球拍运动的理解。
摘要：Badminton is known as one of the fastest racket sports in the world. Despite doubles matches being more prevalent in international tournaments than singles, previous research has mainly focused on singles due to the challenges in data availability and multi-person tracking. To address this gap, we designed an approach that transfers singles-trained models to doubles analysis. We extracted keypoints from the ShuttleSet single matches dataset using ViT-Pose and embedded them through a contrastive learning framework based on ST-GCN. To improve tracking stability, we incorporated a custom multi-object tracking algorithm that resolves ID switching issues from fast and overlapping player movements. A Transformer-based classifier then determines shot occurrences based on the learned embeddings. Our findings demonstrate the feasibility of extending pose-based shot recognition to doubles badminton, broadening analytics capabilities. This work establishes a foundation for doubles-specific datasets to enhance understanding of this predominant yet understudied format of the fast racket sport.

【5】Multi-view Clustering via Bi-level Decoupling and Consistency Learning
标题：通过双层脱钩和一致性学习的多视图集群
链接：https://arxiv.org/abs/2508.13499

作者：ng, Yuhui Zheng, Huiying Xu, Xinzhong Zhu
摘要：多视图聚类已被证明是分析多视图数据中的潜在模式的有效方法。通过学习多视图特征之间的一致性和互补性可以提高聚类的性能，然而，面向聚类的表示学习往往被忽视。在本文中，我们提出了一种新的双层解耦和一致性学习框架（BDCL），以进一步探索多视图数据的有效表示，以提高多视图聚类中特征的簇间区分度和簇内紧凑性。我们的框架包括三个模块：1）多视图实例学习模块对齐一致的信息，同时通过重建自动编码器和对比学习保持视图之间的私有特征。2)特征和聚类的双层解耦增强了特征空间和聚类空间的可区分性。3)一致性学习模块将样本及其邻居的不同视图视为正对，学习其聚类分配的一致性，并进一步压缩簇内空间。在5个基准数据集上的实验结果表明，与SOTA方法相比，该方法具有明显的优越性。我们的代码发布在https://github.com/LouisDong95/BDCL上。
摘要：Multi-view clustering has shown to be an effective method for analyzing underlying patterns in multi-view data. The performance of clustering can be improved by learning the consistency and complementarity between multi-view features, however, cluster-oriented representation learning is often overlooked. In this paper, we propose a novel Bi-level Decoupling and Consistency Learning framework (BDCL) to further explore the effective representation for multi-view data to enhance inter-cluster discriminability and intra-cluster compactness of features in multi-view clustering. Our framework comprises three modules: 1) The multi-view instance learning module aligns the consistent information while preserving the private features between views through reconstruction autoencoder and contrastive learning. 2) The bi-level decoupling of features and clusters enhances the discriminability of feature space and cluster space. 3) The consistency learning module treats the different views of the sample and their neighbors as positive pairs, learns the consistency of their clustering assignments, and further compresses the intra-cluster space. Experimental results on five benchmark datasets demonstrate the superiority of the proposed method compared with the SOTA methods. Our code is published on https://github.com/LouisDong95/BDCL.

【6】GaitCrafter: Diffusion Model for Biometric Preserving Gait Synthesis
标题：GaitCrafter：生物特征保持步态合成的扩散模型
链接：https://arxiv.org/abs/2508.13300

作者： Mitra, Yogesh S. Rawat
摘要：步态识别是一项有价值的生物识别任务，它可以根据步行模式从远处识别个人。然而，它仍然受到缺乏大规模标记数据集以及难以在保护隐私的同时为每个人收集不同步态样本的限制。为了解决这些挑战，我们提出了GaitCrafter，一个基于扩散的框架，用于在轮廓域中合成逼真的步态序列。与依赖于模拟环境或替代生成模型的先前作品不同，GaitCrafter从头开始训练视频扩散模型，完全基于步态轮廓数据。我们的方法能够生成时间上一致的和保持身份的步态序列。此外，生成过程是不确定的，允许调节各种协变量，如衣服，携带的物体和视角。我们表明，将GaitCrafter生成的合成样本纳入步态识别管道可以提高性能，特别是在具有挑战性的条件下。此外，我们引入了一种机制，以产生新的身份合成个人不存在于原来的小册子，通过插入身份嵌入。这些新的身份表现出独特的，一致的步态模式，并用于训练模型，同时保持真实主体的隐私。总的来说，我们的工作朝着利用扩散模型生成高质量，可控和隐私感知的步态数据迈出了重要的一步。
摘要：Gait recognition is a valuable biometric task that enables the identification of individuals from a distance based on their walking patterns. However, it remains limited by the lack of large-scale labeled datasets and the difficulty of collecting diverse gait samples for each individual while preserving privacy. To address these challenges, we propose GaitCrafter, a diffusion-based framework for synthesizing realistic gait sequences in the silhouette domain. Unlike prior works that rely on simulated environments or alternative generative models, GaitCrafter trains a video diffusion model from scratch, exclusively on gait silhouette data. Our approach enables the generation of temporally consistent and identity-preserving gait sequences. Moreover, the generation process is controllable-allowing conditioning on various covariates such as clothing, carried objects, and view angle. We show that incorporating synthetic samples generated by GaitCrafter into the gait recognition pipeline leads to improved performance, especially under challenging conditions. Additionally, we introduce a mechanism to generate novel identities-synthetic individuals not present in the original dataset-by interpolating identity embeddings. These novel identities exhibit unique, consistent gait patterns and are useful for training models while maintaining privacy of real subjects. Overall, our work takes an important step toward leveraging diffusion models for high-quality, controllable, and privacy-aware gait data generation.

【7】Learning to See Through Flare
标题：学会看穿耀斑
链接：https://arxiv.org/abs/2508.13907

作者：Peng, Heath Gemar, Erin Fleet, Kyle Novak, Abbie Watnik, Grover Swartzlander
备注：accepted by ICCVW 2025
摘要：机器视觉系统容易受到激光耀斑的影响，其中不必要的强激光照明通过对传感器像素的过饱和或永久性损坏而使其对环境的感知失明并扭曲。我们介绍NeuSee，第一个计算成像框架，用于在整个可见光谱范围内保护高保真传感器。它联合学习衍射光学元件（DOE）的神经表示和用于图像恢复的频率空间Mamba-GAN网络。NeuSee系统在100 K唯一图像上进行端到端的逆向训练，以抑制高达传感器饱和阈值$I_{\textrm{sat}}$的10^6 $倍的峰值激光辐照度，在该点处，相机传感器可能会在没有DOE的情况下受到损坏。我们的系统利用异构数据和模型并行分布式计算，集成高光谱信息和多个神经网络逼真的模拟和图像恢复。NeuSee考虑了具有动态变化的激光波长、强度和位置的开放世界场景，以及镜头光斑效应、未知的环境照明条件和传感器噪声。它优于其他学习DOE，首次实现了全光谱成像和激光抑制，恢复图像质量提高了10.1%。
摘要：Machine vision systems are susceptible to laser flare, where unwanted intense laser illumination blinds and distorts its perception of the environment through oversaturation or permanent damage to sensor pixels. We introduce NeuSee, the first computational imaging framework for high-fidelity sensor protection across the full visible spectrum. It jointly learns a neural representation of a diffractive optical element (DOE) and a frequency-space Mamba-GAN network for image restoration. NeuSee system is adversarially trained end-to-end on 100K unique images to suppress the peak laser irradiance as high as $10^6$ times the sensor saturation threshold $I_{\textrm{sat}}$, the point at which camera sensors may experience damage without the DOE. Our system leverages heterogeneous data and model parallelism for distributed computing, integrating hyperspectral information and multiple neural networks for realistic simulation and image restoration. NeuSee takes into account open-world scenes with dynamically varying laser wavelengths, intensities, and positions, as well as lens flare effects, unknown ambient lighting conditions, and sensor noises. It outperforms other learned DOEs, achieving full-spectrum imaging and laser suppression for the first time, with a 10.1\% improvement in restored image quality.

【8】Latent Interpolation Learning Using Diffusion Models for Cardiac Volume Reconstruction
标题：使用扩散模型进行心脏容量重建的潜在内插学习
链接：https://arxiv.org/abs/2508.13826

作者：beck, Suprosanna Shit, Chen Chen, Can Zhao, Pengfei Guo, Dong Yang, Georg Zitzlsberger, Daguang Xu, Bernhard Kainz, Daniel Rueckert, Jiazhen Pan
摘要：心脏磁共振（CMR）成像是诊断和管理心血管疾病的重要工具，但其效用往往受到2D短轴切片稀疏采集的限制，导致体积信息不完整。从这些稀疏切片进行精确的3D重建对于全面的心脏评估是必不可少的，但是现有的方法面临挑战，包括依赖于预定义的插值方案（例如，线性或球形）、计算效率低以及依赖于诸如分割标签或运动数据之类的附加语义输入。为了解决这些局限性，我们提出了一种新的\textbf{Ca}rdiac \textbf{L}atent \textbf{I} interpolation\textbf{D} ifflord（CaLID）框架，该框架引入了三个关键创新。首先，我们提出了一种基于扩散模型的数据驱动插值方案，它可以捕获稀疏切片之间复杂的非线性关系，提高重建精度。其次，我们设计了一种计算效率高的方法，该方法在潜在空间中运行，并将3D全心脏上采样时间加快了24倍，与以前的方法相比减少了计算开销。第三，仅使用稀疏的2D CMR图像作为输入，我们的方法实现了SOTA性能，无需辅助输入，如形态学指导，从而简化了工作流程。我们进一步将我们的方法扩展到2D+T数据，从而实现时空动态的有效建模并确保时间一致性。广泛的体积评估和下游分割任务表明，CaLID实现了卓越的重建质量和效率。通过解决现有方法的基本局限性，我们的框架推进了空间和时空全心脏重建的最新技术水平，为心血管成像提供了一个强大的和临床实用的解决方案。
摘要：Cardiac Magnetic Resonance (CMR) imaging is a critical tool for diagnosing and managing cardiovascular disease, yet its utility is often limited by the sparse acquisition of 2D short-axis slices, resulting in incomplete volumetric information. Accurate 3D reconstruction from these sparse slices is essential for comprehensive cardiac assessment, but existing methods face challenges, including reliance on predefined interpolation schemes (e.g., linear or spherical), computational inefficiency, and dependence on additional semantic inputs such as segmentation labels or motion data. To address these limitations, we propose a novel \textbf{Ca}rdiac \textbf{L}atent \textbf{I}nterpolation \textbf{D}iffusion (CaLID) framework that introduces three key innovations. First, we present a data-driven interpolation scheme based on diffusion models, which can capture complex, non-linear relationships between sparse slices and improves reconstruction accuracy. Second, we design a computationally efficient method that operates in the latent space and speeds up 3D whole-heart upsampling time by a factor of 24, reducing computational overhead compared to previous methods. Third, with only sparse 2D CMR images as input, our method achieves SOTA performance against baseline methods, eliminating the need for auxiliary input such as morphological guidance, thus simplifying workflows. We further extend our method to 2D+T data, enabling the effective modeling of spatiotemporal dynamics and ensuring temporal coherence. Extensive volumetric evaluations and downstream segmentation tasks demonstrate that CaLID achieves superior reconstruction quality and efficiency. By addressing the fundamental limitations of existing approaches, our framework advances the state of the art for spatio and spatiotemporal whole-heart reconstruction, offering a robust and clinically practical solution for cardiovascular imaging.

其他(25篇)

【1】ResPlan: A Large-Scale Vector-Graph Dataset of 17,000 Residential Floor Plans
标题：ResPlan：包含17，000个住宅楼层平面图的大规模Vector-Curve数据集
链接：https://arxiv.org/abs/2508.14006

作者：bouagour, Eleftherios Garyfallidis
备注：18 pages, 3 figures, 4 tables
摘要：我们介绍了ResPlan，这是一个包含17，000个详细，结构丰富，现实的住宅平面图的大规模数据集，旨在推进空间AI研究。每个平面图都包括建筑元素（墙、门、窗、阳台）和功能空间（如厨房、卧室和浴室）的精确注释。ResPlan解决了现有数据集的关键限制，例如RISK（Wu等人，2019）和MSD（van Engelenburg et al.，2024年），通过提供增强的视觉保真度和更大的结构多样性，反映现实和非理想化的住宅布局。作为一个多功能的通用资源，ResPlan支持广泛的应用，包括机器人技术，强化学习，生成AI，虚拟和增强现实，模拟和游戏开发。平面图以几何和基于图形的格式提供，可直接集成到仿真引擎中并实现快速3D转换。一个关键的贡献是一个开源的管道几何清洗，对齐和注释细化。此外，ResPlan还包括房间连通性的结构化表示，支持基于图形的空间推理任务。最后，我们提出了与现有基准的比较分析，并概述了ResPlan启用的几个开放基准任务。最终，ResPlan在规模、现实性和可用性方面都有了显著的进步，为开发和基准测试下一代空间智能系统提供了坚实的基础。
摘要：We introduce ResPlan, a large-scale dataset of 17,000 detailed, structurally rich, and realistic residential floor plans, created to advance spatial AI research. Each plan includes precise annotations of architectural elements (walls, doors, windows, balconies) and functional spaces (such as kitchens, bedrooms, and bathrooms). ResPlan addresses key limitations of existing datasets such as RPLAN (Wu et al., 2019) and MSD (van Engelenburg et al., 2024) by offering enhanced visual fidelity and greater structural diversity, reflecting realistic and non-idealized residential layouts. Designed as a versatile, general-purpose resource, ResPlan supports a wide range of applications including robotics, reinforcement learning, generative AI, virtual and augmented reality, simulations, and game development. Plans are provided in both geometric and graph-based formats, enabling direct integration into simulation engines and fast 3D conversion. A key contribution is an open-source pipeline for geometry cleaning, alignment, and annotation refinement. Additionally, ResPlan includes structured representations of room connectivity, supporting graph-based spatial reasoning tasks. Finally, we present comparative analyses with existing benchmarks and outline several open benchmark tasks enabled by ResPlan. Ultimately, ResPlan offers a significant advance in scale, realism, and usability, providing a robust foundation for developing and benchmarking next-generation spatial intelligence systems.

【2】A Comprehensive Re-Evaluation of Biometric Modality Properties in the Modern Era
标题：现代生物识别形态特性的全面重新评估
链接：https://arxiv.org/abs/2508.13874

作者：Al-Refai, Pankaja Priya Ramasamy, Ragini Ramesh, Patricia Arias-Cabarcos, Philipp Terhörst
摘要：认证系统的快速发展及其对生物识别技术的日益依赖，以获得更快、更准确的用户验证体验，突出表明迫切需要一个可靠的框架来评估生物识别模式对特定应用的适用性。目前，最广为人知的评价框架是1998年的一个比较表，该表已不能充分反映生物识别系统的最新技术发展或新出现的脆弱性。为了应对这些挑战，这项工作通过涉及24名生物识别专家的专家调查重新评估生物识别模式。调查结果表明，各种模式的财产评级发生了重大变化。例如，人脸识别由于技术进步而显示出更高的评级，而指纹由于新出现的漏洞和攻击而显示出更低的可靠性。对评级物业的专家一致性水平的进一步分析突出了所提供评价的一致性，并确保了评级的可靠性。最后，将专家评估结果与55个生物特征数据集的不确定性进行了比较，揭示了大多数模式的高度一致性，并强调了将经验证据与专家见解相结合的重要性。此外，已确定的专家分歧揭示了关键的开放性挑战，并有助于指导未来的研究解决这些问题。
摘要：The rapid advancement of authentication systems and their increasing reliance on biometrics for faster and more accurate user verification experience, highlight the critical need for a reliable framework to evaluate the suitability of biometric modalities for specific applications. Currently, the most widely known evaluation framework is a comparative table from 1998, which no longer adequately captures recent technological developments or emerging vulnerabilities in biometric systems. To address these challenges, this work revisits the evaluation of biometric modalities through an expert survey involving 24 biometric specialists. The findings indicate substantial shifts in property ratings across modalities. For example, face recognition, shows improved ratings due to technological progress, while fingerprint, shows decreased reliability because of emerging vulnerabilities and attacks. Further analysis of expert agreement levels across rated properties highlighted the consistency of the provided evaluations and ensured the reliability of the ratings. Finally, expert assessments are compared with dataset-level uncertainty across 55 biometric datasets, revealing strong alignment in most modalities and underscoring the importance of integrating empirical evidence with expert insight. Moreover, the identified expert disagreements reveal key open challenges and help guide future research toward resolving them.

【3】RED.AI Id-Pattern: First Results of Stone Deterioration Patterns with Multi-Agent Systems
标题：RED.AI ID模式：多智能体系统的石头变质模式的初步结果
链接：https://arxiv.org/abs/2508.13872

作者：orradetti, José Delgado Rodrigues
备注：11 pages, 1 figure, 1 table. Contribution for REEACH 2025 Symposium
摘要：RED.AI项目中的ID模式系统（Reabilita\c{c}\~ao Estructural Digital atrav\'es da AI）由一个代理系统组成，旨在帮助识别石材劣化模式。传统方法基于专家团队的直接观察，虽然准确，但在时间和资源方面成本高昂。在这里开发的系统介绍和评估多智能体人工智能（AI）系统，旨在模拟专家之间的协作，并自动诊断结石病理的视觉证据。该方法基于一个认知架构，该架构协调了一个专门的人工智能代理团队，在这个特定的情况下，该团队仅限于五个：一个岩性学家，一个病理学家，一个环境专家，一个修复师和一个诊断协调员。为了评估该系统，我们选择了28个困难的图像，涉及多种退化模式。我们的第一个结果显示，与基础模型相比，我们系统的所有指标都有了巨大的提升。
摘要：The Id-Pattern system within the RED.AI project (Reabilita\c{c}\~ao Estrutural Digital atrav\'es da AI) consists of an agentic system designed to assist in the identification of stone deterioration patterns. Traditional methodologies, based on direct observation by expert teams, are accurate but costly in terms of time and resources. The system developed here introduces and evaluates a multi-agent artificial intelligence (AI) system, designed to simulate collaboration between experts and automate the diagnosis of stone pathologies from visual evidence. The approach is based on a cognitive architecture that orchestrates a team of specialized AI agents which, in this specific case, are limited to five: a lithologist, a pathologist, an environmental expert, a conservator-restorer, and a diagnostic coordinator. To evaluate the system we selected 28 difficult images involving multiple deterioration patterns. Our first results showed a huge boost on all metrics of our system compared to the foundational model.

【4】Is-NeRF: In-scattering Neural Radiance Field for Blurred Images
标题：Is-NeRF：模糊图像的内散射神经辐射场
链接：https://arxiv.org/abs/2508.13808

作者：Chenglin Ye, Jiaxu Li, Gang Liu, Bo Wan, Di Wang, Lupeng Liu, Jun Xiao
摘要：神经辐射场（NeRF）因其突出的隐式3D表示和逼真的新颖视图合成能力而受到广泛关注。现有的作品无一例外地采用直线体绘制，这很难处理复杂的光路场景，并在训练过程中引入几何模糊性，特别是在处理运动模糊图像时。为了解决这些挑战，这项工作提出了一种新的去模糊神经辐射场，Is-NeRF，具有显式的光路建模在现实世界的环境。通过统一六种常见的光传播现象，通过内散射表示，我们建立了一个新的散射感知的体绘制流水线适应复杂的光路。此外，我们引入了一个自适应学习策略，使自主确定的散射方向和采样间隔，以捕捉更精细的对象细节。该网络联合优化NeRF参数，散射参数和相机运动，以从模糊图像中恢复细粒度的场景表示。全面的评估表明，它有效地处理复杂的现实世界的情况下，在生成具有准确的几何细节的高保真图像方面优于最先进的方法。
摘要：Neural Radiance Fields (NeRF) has gained significant attention for its prominent implicit 3D representation and realistic novel view synthesis capabilities. Available works unexceptionally employ straight-line volume rendering, which struggles to handle sophisticated lightpath scenarios and introduces geometric ambiguities during training, particularly evident when processing motion-blurred images. To address these challenges, this work proposes a novel deblur neural radiance field, Is-NeRF, featuring explicit lightpath modeling in real-world environments. By unifying six common light propagation phenomena through an in-scattering representation, we establish a new scattering-aware volume rendering pipeline adaptable to complex lightpaths. Additionally, we introduce an adaptive learning strategy that enables autonomous determining of scattering directions and sampling intervals to capture finer object details. The proposed network jointly optimizes NeRF parameters, scattering parameters, and camera motions to recover fine-grained scene representations from blurry images. Comprehensive evaluations demonstrate that it effectively handles complex real-world scenarios, outperforming state-of-the-art approaches in generating high-fidelity images with accurate geometric details.

【5】VisionLaw: Inferring Interpretable Intrinsic Dynamics from Visual Observations via Bilevel Optimization
标题：VisionLaw：通过二层优化从视觉观察推断可解释的内在动力学
链接：https://arxiv.org/abs/2508.13792

作者：in, Shu Jiang, Qingyuan Zeng, Zhenzhong Wang, Min Jiang
备注：9 pages, 6 figures
摘要：对象的内在动力学控制其在现实世界中的物理行为，在使用3D资源进行物理上合理的交互式仿真方面发挥着关键作用。现有的方法试图从视觉观察中推断物体的内在动力学，但通常面临两个主要挑战：一条工作线依赖于手动定义的构成先验，难以推广到复杂的场景;其他模型使用神经网络的内在动力学，导致有限的可解释性和泛化能力差。为了解决这些挑战，我们提出了VisionLaw，一个双层优化框架，从视觉观察中推断出内在动力学的可解释表达式。在上层，我们引入了一个LLMs驱动的解耦本构演化策略，其中LLMs提示作为一个知识渊博的物理专家生成和修改本构关系，与内置的解耦机制，大大降低了搜索的复杂性LLMs。在较低的层次上，我们引入了一个视觉引导的本构评价机制，它利用视觉模拟来评价生成的本构关系和底层的内在动力学之间的一致性，从而指导上层的进化。在合成和真实世界数据集上的实验表明，VisionLaw可以有效地从视觉观察中推断出可解释的内在动力学。它显着优于现有的最先进的方法，并表现出很强的泛化能力，在新的场景中的交互式仿真。
摘要：The intrinsic dynamics of an object governs its physical behavior in the real world, playing a critical role in enabling physically plausible interactive simulation with 3D assets. Existing methods have attempted to infer the intrinsic dynamics of objects from visual observations, but generally face two major challenges: one line of work relies on manually defined constitutive priors, making it difficult to generalize to complex scenarios; the other models intrinsic dynamics using neural networks, resulting in limited interpretability and poor generalization. To address these challenges, we propose VisionLaw, a bilevel optimization framework that infers interpretable expressions of intrinsic dynamics from visual observations. At the upper level, we introduce an LLMs-driven decoupled constitutive evolution strategy, where LLMs are prompted as a knowledgeable physics expert to generate and revise constitutive laws, with a built-in decoupling mechanism that substantially reduces the search complexity of LLMs. At the lower level, we introduce a vision-guided constitutive evaluation mechanism, which utilizes visual simulation to evaluate the consistency between the generated constitutive law and the underlying intrinsic dynamics, thereby guiding the upper-level evolution. Experiments on both synthetic and real-world datasets demonstrate that VisionLaw can effectively infer interpretable intrinsic dynamics from visual observations. It significantly outperforms existing state-of-the-art methods and exhibits strong generalization for interactive simulation in novel scenarios.

【6】Shape-from-Template with Generalised Camera
标题：使用通用相机从模板生成形状
链接：https://arxiv.org/abs/2508.13791

作者：ngupta, Stefan Zachow
备注：Pre-print of the IMAVIS article: https://www.sciencedirect.com/science/article/abs/pii/S0262885625001672 Code and data in: https://git.zib.de/asengupta/sft-generalised
摘要：本文提出了一种新的方法，用于非刚性地将3D形状配准到由多个相机的星座观察到的2D关键点。将3D形状非刚性配准到观察到的2D关键点，即，模板形状（SfT）已被广泛研究使用单一图像，但SfT与来自多个相机的信息共同开辟了新的方向，用于扩展已知用例的范围，例如医学成像中的3D形状配准和手持相机的配准，仅举几例。我们表示这样的多相机设置与广义相机模型，因此，观察任何变形对象的透视或正交相机的任何集合都可以注册。我们提出了用于这种SfT的多种方法：第一种方法，其中对应的关键点位于来自空间中的已知3D点的方向向量上，第二种方法，其中对应的关键点位于来自空间中的未知3D点的方向向量上，但具有相对于某个局部参考系的已知取向，以及第三种方法，其中除了对应性之外，成像对象的轮廓也是已知的。这些共同构成了通用相机SfT问题的第一组解决方案。SfT与广义相机背后的关键思想是通过估计变形形状来提高重建精度，同时利用来自变形对象的多个视图之间的相互约束的附加信息。基于对应的方法是用凸规划来求解的，而基于轮廓的方法是对凸解结果的迭代精化。我们证明了我们提出的方法在许多合成和真实数据的准确性
摘要：This article presents a new method for non-rigidly registering a 3D shape to 2D keypoints observed by a constellation of multiple cameras. Non-rigid registration of a 3D shape to observed 2D keypoints, i.e., Shape-from-Template (SfT), has been widely studied using single images, but SfT with information from multiple-cameras jointly opens new directions for extending the scope of known use-cases such as 3D shape registration in medical imaging and registration from hand-held cameras, to name a few. We represent such multi-camera setup with the generalised camera model; therefore any collection of perspective or orthographic cameras observing any deforming object can be registered. We propose multiple approaches for such SfT: the first approach where the corresponded keypoints lie on a direction vector from a known 3D point in space, the second approach where the corresponded keypoints lie on a direction vector from an unknown 3D point in space but with known orientation w.r.t some local reference frame, and a third approach where, apart from correspondences, the silhouette of the imaged object is also known. Together, these form the first set of solutions to the SfT problem with generalised cameras. The key idea behind SfT with generalised camera is the improved reconstruction accuracy from estimating deformed shape while utilising the additional information from the mutual constraints between multiple views of a deformed object. The correspondence-based approaches are solved with convex programming while the silhouette-based approach is an iterative refinement of the results from the convex solutions. We demonstrate the accuracy of our proposed methods on many synthetic and real data

【7】Mitigating Cross-Image Information Leakage in LVLMs for Multi-Image Tasks
标题：缓解多图像任务LVLM中的跨图像信息泄露
链接：https://arxiv.org/abs/2508.13744

作者：, Minyoung Lee, Sanghyuk Chun, Junsuk Choe
备注：Source code is available at this https URL
摘要：大型视觉语言模型（LVLM）在单图像任务上表现出强大的性能。然而，我们观察到，它们的性能显着下降时，处理多图像输入。这是因为来自不同图像的视觉线索在模型的输出中纠缠在一起。我们把这种现象称为跨图像信息泄漏。为了解决这个问题，我们提出了FOCUS，一种免训练和与架构无关的解码策略，可以减轻推理过程中的跨图像信息泄漏。FOCUS顺序地用随机噪声屏蔽除一个图像之外的所有图像，引导模型聚焦于单个干净图像。我们在所有目标图像上重复此过程，以获得部分掩蔽上下文下的logits。这些logit被聚合，然后使用仅噪声参考输入进行对比细化，这抑制了泄漏并产生更准确的输出。FOCUS在四个多图像基准测试和各种LVLM系列中不断提高性能。这表明，FOCUS提供了一个通用和实用的解决方案，用于增强多图像推理，而无需额外的培训或架构修改。
摘要：Large Vision-Language Models (LVLMs) demonstrate strong performance on single-image tasks. However, we observe that their performance degrades significantly when handling multi-image inputs. This occurs because visual cues from different images become entangled in the model's output. We refer to this phenomenon as cross-image information leakage. To address this issue, we propose FOCUS, a training-free and architecture-agnostic decoding strategy that mitigates cross-image information leakage during inference. FOCUS sequentially masks all but one image with random noise, guiding the model to focus on the single clean image. We repeat this process across all target images to obtain logits under partially masked contexts. These logits are aggregated and then contrastively refined using a noise-only reference input, which suppresses the leakage and yields more accurate outputs. FOCUS consistently improves performance across four multi-image benchmarks and diverse LVLM families. This demonstrates that FOCUS offers a general and practical solution for enhancing multi-image reasoning without additional training or architectural modifications.

【8】OmniTry: Virtual Try-On Anything without Masks
标题：OmniTry：虚拟试穿任何不戴口罩的东西
链接：https://arxiv.org/abs/2508.13632

作者：ng, Linlin Zhang, Hengyuan Cao, Yiming Chen, Xiaoduan Feng, Jian Cao, Yuxiong Wu, Bin Wang
摘要：虚拟试穿是一项实用性强、应用范围广的任务，现有的研究多以服装为对象。本文介绍了OmniTry，这是一个统一的框架，它将VTON扩展到服装之外，以涵盖任何可穿戴对象，例如，珠宝首饰和配件，更实际的应用与无掩模设置。当扩展到各种类型的对象时，数据策展对于获得配对图像来说是具有挑战性的，即，目标图像和相应的试穿结果。为了解决这个问题，我们提出了一个两阶段的管道：对于第一阶段，我们利用大规模的未配对图像，即，肖像与任何可穿戴物品，训练模型的无面具定位。具体来说，我们重新调整了修复模型的用途，以便在给定空蒙版的情况下在合适的位置自动绘制对象。对于第二阶段，模型进一步用成对图像进行微调，以传递对象外观的一致性。我们观察到，第一阶段后的模型显示出快速收敛，即使配对样本很少。OmniTry是在一个全面的基准上进行评估的，该基准包括12种常见的可穿戴对象，包括店内和野外图像。实验结果表明，与现有方法相比，OmniTry在目标定位和身份保持方面表现出更好的性能。OmniTry的代码、模型权重和评估基准将在https://omnitry.github.io/上公开。
摘要：Virtual Try-ON (VTON) is a practical and widely-applied task, for which most of existing works focus on clothes. This paper presents OmniTry, a unified framework that extends VTON beyond garment to encompass any wearable objects, e.g., jewelries and accessories, with mask-free setting for more practical application. When extending to various types of objects, data curation is challenging for obtaining paired images, i.e., the object image and the corresponding try-on result. To tackle this problem, we propose a two-staged pipeline: For the first stage, we leverage large-scale unpaired images, i.e., portraits with any wearable items, to train the model for mask-free localization. Specifically, we repurpose the inpainting model to automatically draw objects in suitable positions given an empty mask. For the second stage, the model is further fine-tuned with paired images to transfer the consistency of object appearance. We observed that the model after the first stage shows quick convergence even with few paired samples. OmniTry is evaluated on a comprehensive benchmark consisting of 12 common classes of wearable objects, with both in-shop and in-the-wild images. Experimental results suggest that OmniTry shows better performance on both object localization and ID-preservation compared with existing methods. The code, model weights, and evaluation benchmark of OmniTry will be made publicly available at https://omnitry.github.io/.

【9】TalkVid: A Large-Scale Diversified Dataset for Audio-Driven Talking Head Synthesis
标题：TalkVid：用于音频驱动说话头部合成的大规模多元化数据集
链接：https://arxiv.org/abs/2508.13618

作者：hen, Hejin Huang, Yexin Liu, Zihan Ye, Pengcheng Chen, Chenghao Zhu, Michael Guan, Rongsheng Wang, Junying Chen, Guanbin Li, Ser-Nam Lim, Harry Yang, Benyou Wang
摘要：音频驱动的说话头部合成已经取得了显着的照片写实主义，但最先进的（SOTA）模型表现出严重的失败：他们缺乏对种族，语言和年龄组的人类多样性的全面概括。我们认为，这种泛化差距是现有训练数据局限性的直接症状，缺乏必要的规模，质量和多样性。为了应对这一挑战，我们引入了TalkVid，这是一个新的大规模，高质量和多样化的数据集，包含来自7729个独特扬声器的1244小时视频。TalkVid通过一个原则性的多阶段自动化管道进行策划，该管道严格过滤运动稳定性，美学质量和面部细节，并根据人类判断进行验证，以确保其可靠性。此外，我们构建并发布了TalkVid-Bench，这是一个分层评估集，包含500个剪辑，在关键的人口统计和语言轴上进行了细致的平衡。我们的实验表明，在TalkVid上训练的模型优于在以前的数据集上训练的模型，表现出优越的跨数据集泛化能力。至关重要的是，我们对TalkVid-Bench的分析揭示了被传统综合指标所掩盖的子群体之间的表现差异，强调了其对未来研究的必要性。代码和数据可在https://github.com/FreedomIntelligence/TalkVid中找到
摘要：Audio-driven talking head synthesis has achieved remarkable photorealism, yet state-of-the-art (SOTA) models exhibit a critical failure: they lack generalization to the full spectrum of human diversity in ethnicity, language, and age groups. We argue that this generalization gap is a direct symptom of limitations in existing training data, which lack the necessary scale, quality, and diversity. To address this challenge, we introduce TalkVid, a new large-scale, high-quality, and diverse dataset containing 1244 hours of video from 7729 unique speakers. TalkVid is curated through a principled, multi-stage automated pipeline that rigorously filters for motion stability, aesthetic quality, and facial detail, and is validated against human judgments to ensure its reliability. Furthermore, we construct and release TalkVid-Bench, a stratified evaluation set of 500 clips meticulously balanced across key demographic and linguistic axes. Our experiments demonstrate that a model trained on TalkVid outperforms counterparts trained on previous datasets, exhibiting superior cross-dataset generalization. Crucially, our analysis on TalkVid-Bench reveals performance disparities across subgroups that are obscured by traditional aggregate metrics, underscoring its necessity for future research. Code and data can be found in https://github.com/FreedomIntelligence/TalkVid

【10】Two-Factor Authentication Smart Entryway Using Modified LBPH Algorithm
标题：使用改进LBH算法的双因素认证智能入口
链接：https://arxiv.org/abs/2508.13617

作者：op, Wan Mohamad Hariz Bin Wan Mohamad Rosdi, Looi Wei Hua, Syarulnaziah Anawar, Nur Fadzilah Othman
摘要：口罩检测最近变得越来越重要，尤其是在COVID-19大流行期间。许多人脸检测模型已经在使用物联网的智能入口中开发出来。然而，在口罩检测方面缺乏物联网发展。本文提出了一种用于智能入口门禁的双因素身份验证系统，该系统使用面部识别和密码验证以及自动化过程，当检测到陌生人时，可以提醒所有者并激活监控系统，并通过Telegram在Raspberry Pi平台上远程控制系统。该系统采用局部二值模式直方图的完整人脸识别算法和改进的LBPH算法的遮挡人脸检测。平均而言，该系统在所有测试用户中实现了约70%的准确度，约80%的精确度和约83.26%的召回率。结果表明，该系统能够进行人脸识别和口罩检测，自动操作遥控器注册用户，锁定或解锁门，并通知所有者。样本参与者高度接受它，以便将来在用户验收测试中使用。
摘要：Face mask detection has become increasingly important recently, particularly during the COVID-19 pandemic. Many face detection models have been developed in smart entryways using IoT. However, there is a lack of IoT development on face mask detection. This paper proposes a two-factor authentication system for smart entryway access control using facial recognition and passcode verification and an automation process to alert the owner and activate the surveillance system when a stranger is detected and controls the system remotely via Telegram on a Raspberry Pi platform. The system employs the Local Binary Patterns Histograms for the full face recognition algorithm and modified LBPH algorithm for occluded face detection. On average, the system achieved an Accuracy of approximately 70%, a Precision of approximately 80%, and a Recall of approximately 83.26% across all tested users. The results indicate that the system is capable of conducting face recognition and mask detection, automating the operation of the remote control to register users, locking or unlocking the door, and notifying the owner. The sample participants highly accept it for future use in the user acceptance test.

【11】The 9th AI City Challenge
标题：第九届AI城市挑战赛
链接：https://arxiv.org/abs/2508.13564

作者：g, Shuo Wang, David C. Anastasiu, Ming-Ching Chang, Anuj Sharma, Quan Kong, Norimasa Kobori, Munkhjargal Gochoo, Ganzorig Batnasan, Munkh-Erdene Otgonbold, Fady Alnajjar, Jun-Wei Hsieh, Tomasz Kornuta, Xiaolong Li, Yilin Zhao, Han Zhang, Subhashree Radhakrishnan, Arihant Jain, Ratnesh Kumar, Vidya N. Murali, Yuxing Wang, Sameer Satish Pusegaonkar, Yizhou Wang, Sujit Biswas, Xunlei Wu, Zhedong Zheng, Pranamesh Chakraborty, Rama Chellappa
备注：Summary of the 9th AI City Challenge Workshop in conjunction with ICCV 2025
摘要：第九届AI城市挑战赛继续推进计算机视觉和人工智能在交通、工业自动化和公共安全领域的实际应用。2025年版有四个赛道，参与人数增加了17%，来自15个国家的245支球队在评估服务器上注册。公开发布的挑战数据集迄今已导致超过30 000次下载。Track 1专注于多类3D多摄像头跟踪，涉及人，类人机器人，自主移动机器人和叉车，使用详细的校准和3D边界框注释。轨道2解决了交通安全中的视频问题回答，通过3D凝视标签丰富了多摄像头事件理解。第三部分讨论了动态仓库环境中的细粒度空间推理，要求AI系统解释RGB-D输入，并回答结合感知、几何和语言的空间问题。Track 1和Track 3数据集均在NVIDIA Omniverse中生成。Track 4强调鱼眼摄像头的高效道路物体检测，支持边缘设备上的轻量级实时部署。评价框架强制执行提交限制，并使用部分保留的测试集，以确保公平的基准。最终排名在比赛结束后公布，以促进可重复性并减轻过度拟合。几个团队取得了顶级成果，在多项任务中设定了新的基准。
摘要：The ninth AI City Challenge continues to advance real-world applications of computer vision and AI in transportation, industrial automation, and public safety. The 2025 edition featured four tracks and saw a 17% increase in participation, with 245 teams from 15 countries registered on the evaluation server. Public release of challenge datasets led to over 30,000 downloads to date. Track 1 focused on multi-class 3D multi-camera tracking, involving people, humanoids, autonomous mobile robots, and forklifts, using detailed calibration and 3D bounding box annotations. Track 2 tackled video question answering in traffic safety, with multi-camera incident understanding enriched by 3D gaze labels. Track 3 addressed fine-grained spatial reasoning in dynamic warehouse environments, requiring AI systems to interpret RGB-D inputs and answer spatial questions that combine perception, geometry, and language. Both Track 1 and Track 3 datasets were generated in NVIDIA Omniverse. Track 4 emphasized efficient road object detection from fisheye cameras, supporting lightweight, real-time deployment on edge devices. The evaluation framework enforced submission limits and used a partially held-out test set to ensure fair benchmarking. Final rankings were revealed after the competition concluded, fostering reproducibility and mitigating overfitting. Several teams achieved top-tier results, setting new benchmarks in multiple tasks.

【12】GazeProphet: Software-Only Gaze Prediction for VR Foveated Rendering
标题：GazeProphet：VR中央视野渲染的纯软件凝视预测
链接：https://arxiv.org/abs/2508.13546

作者：badulla, Chiraag Mudlapur, Gaurav BV
备注：8 pages, 3 figures
摘要：Foveated渲染通过将渲染质量集中在用户注视的地方，显著降低了虚拟现实应用中的计算需求。目前的方法需要昂贵的基于硬件的眼睛跟踪系统，由于成本、校准复杂性和硬件兼容性约束而限制了广泛采用。本文介绍了GazeProphet，这是一种用于预测VR环境中凝视位置的纯软件方法，而无需专用的眼动跟踪硬件。该方法结合了球形Vision Transformer处理360度VR场景与基于LSTM的时间编码器，捕捉凝视序列模式。多模态融合网络将空间场景特征与时间注视动态相结合，以预测未来的注视位置与相关的置信度估计。在综合VR数据集上的实验评估表明，GazeProphet实现了3.83度的中位角误差，比传统的基于显著性的基线高出24%，同时提供了可靠的置信度校准。该方法在不同的空间区域和场景类型中保持一致的性能，从而实现在VR系统中的实际部署，而无需额外的硬件要求。统计分析证实了所有评价指标的改进意义。这些结果表明，仅限软件的凝视预测可以用于VR中心凹渲染，使这种性能提升更容易被不同的VR平台和应用程序访问。
摘要：Foveated rendering significantly reduces computational demands in virtual reality applications by concentrating rendering quality where users focus their gaze. Current approaches require expensive hardware-based eye tracking systems, limiting widespread adoption due to cost, calibration complexity, and hardware compatibility constraints. This paper presents GazeProphet, a software-only approach for predicting gaze locations in VR environments without requiring dedicated eye tracking hardware. The approach combines a Spherical Vision Transformer for processing 360-degree VR scenes with an LSTM-based temporal encoder that captures gaze sequence patterns. A multi-modal fusion network integrates spatial scene features with temporal gaze dynamics to predict future gaze locations with associated confidence estimates. Experimental evaluation on a comprehensive VR dataset demonstrates that GazeProphet achieves a median angular error of 3.83 degrees, outperforming traditional saliency-based baselines by 24% while providing reliable confidence calibration. The approach maintains consistent performance across different spatial regions and scene types, enabling practical deployment in VR systems without additional hardware requirements. Statistical analysis confirms the significance of improvements across all evaluation metrics. These results show that software-only gaze prediction can work for VR foveated rendering, making this performance boost more accessible to different VR platforms and apps.

【13】FLAIR: Frequency- and Locality-Aware Implicit Neural Representations
标题：FLAIR：频率和局部感知的隐式神经表示
链接：https://arxiv.org/abs/2508.13544

作者：, Dahyeon Kye, Kyle Min, Chanho Eom, Jihyong Oh
备注：Please visit our project page at this https URL
摘要：隐式神经表示（INR）利用神经网络将坐标映射到相应的信号，从而实现连续和紧凑的表示。这种模式推动了各种视觉任务的重大进展。然而，现有的INR缺乏频率选择性、空间定位和稀疏表示，导致过度依赖于冗余信号分量。因此，它们表现出频谱偏差，倾向于早期学习低频分量，同时努力捕捉精细的高频细节。为了解决这些问题，我们提出了FLAIR（频率和局部感知隐式神经表征），其中包括两个关键的创新。第一个是RC-GAUSS，一种新的激活设计的显式频率选择和空间定位的约束下的时间-频率不确定性原则（TFUP）。第二种是小波能量引导编码（WEGE），它利用离散小波变换（DWT）来计算能量分数并将频率信息显式地引导到网络。我们的方法在2D图像表示和恢复以及3D重建方面始终优于现有的INR。
摘要：Implicit Neural Representations (INRs) leverage neural networks to map coordinates to corresponding signals, enabling continuous and compact representations. This paradigm has driven significant advances in various vision tasks. However, existing INRs lack frequency selectivity, spatial localization, and sparse representations, leading to an over-reliance on redundant signal components. Consequently, they exhibit spectral bias, tending to learn low-frequency components early while struggling to capture fine high-frequency details. To address these issues, we propose FLAIR (Frequency- and Locality-Aware Implicit Neural Representations), which incorporates two key innovations. The first is RC-GAUSS, a novel activation designed for explicit frequency selection and spatial localization under the constraints of the time-frequency uncertainty principle (TFUP). The second is Wavelet-Energy-Guided Encoding (WEGE), which leverages the discrete wavelet transform (DWT) to compute energy scores and explicitly guide frequency information to the network. Our method consistently outperforms existing INRs in 2D image representation and restoration, as well as 3D reconstruction.

【14】2D Gaussians Meet Visual Tokenizer
标题：2D高斯遇见视觉代币器
链接：https://arxiv.org/abs/2508.13515

作者：, Xiaoyang Guo, Wei Yin, Mingkai Jia, Qian Zhang, Xiaolin Hu, Wenyu Liu, Xinggang Wan
摘要：图像标记器是AR图像生成中的关键组件，因为它决定了如何将丰富和结构化的视觉内容编码为紧凑的表示。现有的基于量化的tokenizer（如VQ-GAN）主要关注纹理和颜色等外观特征，由于其基于补丁的设计，通常忽略几何结构。在这项工作中，我们探索了如何将更多的视觉信息纳入tokenizer，并提出了一个新的框架，称为视觉高斯量化（VGQ），一种新的tokenizer范式，通过将2D高斯集成到传统的视觉码本量化框架中来显式增强结构建模。我们的方法解决了朴素量化方法（如VQ-GAN）的固有局限性，由于其基于补丁的设计以及对纹理和颜色的强调，这些方法难以对结构化视觉信息进行建模。相比之下，VGQ将图像潜伏期编码为2D高斯分布，通过直接建模与结构相关的参数（如位置，旋转和尺度）来有效地捕获几何和空间结构。我们进一步证明，增加令牌内的2D高斯的密度导致重建保真度的显着增益，提供令牌效率和视觉丰富性之间的灵活权衡。在ImageNet 256 x256基准测试中，VGQ实现了强大的重建质量，rFID得分为1.00。此外，通过增加令牌内2D高斯的密度，VGQ在重建能力方面获得了显着的提升，并实现了最先进的重建rFID得分为0.556，PSNR为24.93，大大优于现有的方法。密码很快就会公布。
摘要：The image tokenizer is a critical component in AR image generation, as it determines how rich and structured visual content is encoded into compact representations. Existing quantization-based tokenizers such as VQ-GAN primarily focus on appearance features like texture and color, often neglecting geometric structures due to their patch-based design. In this work, we explored how to incorporate more visual information into the tokenizer and proposed a new framework named Visual Gaussian Quantization (VGQ), a novel tokenizer paradigm that explicitly enhances structural modeling by integrating 2D Gaussians into traditional visual codebook quantization frameworks. Our approach addresses the inherent limitations of naive quantization methods such as VQ-GAN, which struggle to model structured visual information due to their patch-based design and emphasis on texture and color. In contrast, VGQ encodes image latents as 2D Gaussian distributions, effectively capturing geometric and spatial structures by directly modeling structure-related parameters such as position, rotation and scale. We further demonstrate that increasing the density of 2D Gaussians within the tokens leads to significant gains in reconstruction fidelity, providing a flexible trade-off between token efficiency and visual richness. On the ImageNet 256x256 benchmark, VGQ achieves strong reconstruction quality with an rFID score of 1.00. Furthermore, by increasing the density of 2D Gaussians within the tokens, VGQ gains a significant boost in reconstruction capability and achieves a state-of-the-art reconstruction rFID score of 0.556 and a PSNR of 24.93, substantially outperforming existing methods. Codes will be released soon.

【15】ROVER: Robust Loop Closure Verification with Trajectory Prior in Repetitive Environments
标题：ROVER：重复环境下基于轨迹先验的鲁棒闭环验证
链接：https://arxiv.org/abs/2508.13488

作者：u, Jiayi Yang, Anjun Hu, Jiankun Wang, Ping Tan, Hong Zhang
备注：8 pages, 9 figures
摘要：环闭合检测对于同时定位和映射（SLAM）是重要的，其将当前观测与历史关键帧相关联，实现漂移校正和全局重新定位。然而，错误检测到的循环可能是致命的，并且这在重复环境中尤其困难，其中基于外观的特征由于高相似性而失败。因此，回路闭合的验证是避免假阳性检测的关键步骤。现有的闭环验证工作主要集中在学习不变的外观特征，忽略了机器人的时空运动线索的先验知识，即，弹道在这封信中，我们提出了罗孚，一个循环闭合验证方法，利用历史轨迹作为先验约束，拒绝在具有挑战性的重复环境中的假循环。对于每个循环候选，它首先被用来估计机器人轨迹与姿态图优化。然后将该轨迹提交给评分方案，该评分方案评估其与没有循环的轨迹的符合性，我们将其称为轨迹先验，以确定是否应该接受循环候选。基准比较和真实世界的实验证明了所提出的方法的有效性。此外，我们将ROVER集成到最先进的SLAM系统中，以验证其鲁棒性和效率。我们的源代码和自我收集的数据集可在https://github.com/jarvisyjw/ROVER上获得。
摘要：Loop closure detection is important for simultaneous localization and mapping (SLAM), which associates current observations with historical keyframes, achieving drift correction and global relocalization. However, a falsely detected loop can be fatal, and this is especially difficult in repetitive environments where appearance-based features fail due to the high similarity. Therefore, verification of a loop closure is a critical step in avoiding false positive detections. Existing works in loop closure verification predominantly focus on learning invariant appearance features, neglecting the prior knowledge of the robot's spatial-temporal motion cue, i.e., trajectory. In this letter, we propose ROVER, a loop closure verification method that leverages the historical trajectory as a prior constraint to reject false loops in challenging repetitive environments. For each loop candidate, it is first used to estimate the robot trajectory with pose-graph optimization. This trajectory is then submitted to a scoring scheme that assesses its compliance with the trajectory without the loop, which we refer to as the trajectory prior, to determine if the loop candidate should be accepted. Benchmark comparisons and real-world experiments demonstrate the effectiveness of the proposed method. Furthermore, we integrate ROVER into state-of-the-art SLAM systems to verify its robustness and efficiency. Our source code and self-collected dataset are available at https://github.com/jarvisyjw/ROVER.

【16】Enhancing Robustness of Implicit Neural Representations Against Weight Perturbations
标题：增强隐式神经表示对体重扰动的鲁棒性
链接：https://arxiv.org/abs/2508.13481

作者：hou, Yuxin Cheng, Zhengwu Liu, Taiqiang Wu, Chen Zhang, Ngai Wong
备注：4 pages, 7 figures
摘要：隐式神经表示（INR）使用神经网络以连续的方式对离散信号进行编码，在各种多媒体应用中表现出重要价值。然而，INR的脆弱性对其实际部署提出了严峻的挑战，因为网络权重可能会受到不可避免的扰动。在这项工作中，我们首次研究了INR的鲁棒性，发现即使是微小的扰动也会导致信号重建质量的大幅性能下降。为了缓解这个问题，我们制定的鲁棒性问题，在国际标准化组织之间的差异最小化的损失和没有重量扰动。此外，我们推导出一种新的鲁棒损失函数来调节重建损失相对于权重的梯度，从而提高鲁棒性。在多模态重建任务上的广泛实验表明，与噪声条件下的原始INR相比，我们的方法在峰值信噪比（PSNR）值上实现了高达7.5 dB的改善。
摘要：Implicit Neural Representations (INRs) encode discrete signals in a continuous manner using neural networks, demonstrating significant value across various multimedia applications. However, the vulnerability of INRs presents a critical challenge for their real-world deployments, as the network weights might be subjected to unavoidable perturbations. In this work, we investigate the robustness of INRs for the first time and find that even minor perturbations can lead to substantial performance degradation in the quality of signal reconstruction. To mitigate this issue, we formulate the robustness problem in INRs by minimizing the difference between loss with and without weight perturbations. Furthermore, we derive a novel robust loss function to regulate the gradient of the reconstruction loss with respect to weights, thereby enhancing the robustness. Extensive experiments on reconstruction tasks across multiple modalities demonstrate that our method achieves up to a 7.5~dB improvement in peak signal-to-noise ratio (PSNR) values compared to original INRs under noisy conditions.

【17】AIM 2025 challenge on Inverse Tone Mapping Report: Methods and Results
标题：AIM 2025对反向音调映射报告的挑战：方法和结果
链接：https://arxiv.org/abs/2508.13479

作者：, Francesco Banterle, Bin Ren, Radu Timofte, Xin Lu, Yufeng Peng, Chengjie Ge, Zhijing Sun, Ziang Zhou, Zihao Li, Zishun Liao, Qiyu Kang, Xueyang Fu, Zheng-Jun Zha, Zhijing Sun, Xingbo Wang, Kean Liu, Senyan Xu, Yang Qiu, Yifan Ding, Gabriel Eilertsen, Jonas Unger, Zihao Wang, Ke Wu, Jinshan Pan, Zhen Liu, Zhongyang Li, Shuaicheng Liu, S.M Nadim Uddin
摘要：本文对AIM 2025逆色调映射（ITM）挑战赛进行了全面回顾。该挑战旨在推动开发有效的ITM算法，用于从单个LDR输入重建HDR图像，重点关注感知保真度和数值一致性。共有\textbf{67}名参与者提交了\textbf{319}份有效结果，从中选出最佳的五个团队进行详细分析。本报告综合了他们的方法和性能，其中最低的PU 21-PSNR达到29.22 dB。分析强调了提高HDR重建质量的创新策略，并建立了强有力的基准，以指导未来的逆色调映射研究。
摘要：This paper presents a comprehensive review of the AIM 2025 Challenge on Inverse Tone Mapping (ITM). The challenge aimed to push forward the development of effective ITM algorithms for HDR image reconstruction from single LDR inputs, focusing on perceptual fidelity and numerical consistency. A total of \textbf{67} participants submitted \textbf{319} valid results, from which the best five teams were selected for detailed analysis. This report consolidates their methodologies and performance, with the lowest PU21-PSNR among the top entries reaching 29.22 dB. The analysis highlights innovative strategies for enhancing HDR reconstruction quality and establishes strong benchmarks to guide future research in inverse tone mapping.

【18】MINR: Efficient Implicit Neural Representations for Multi-Image Encoding
标题：MINR：用于多图像编码的高效隐式神经表示
链接：https://arxiv.org/abs/2508.13471

作者：hou, Taiqiang Wu, Zhengwu Liu, Yuxin Cheng, Chen Zhang, Ngai Wong
备注：4 pages, 4 figures
摘要：隐式神经表示（INRs）旨在通过隐式连续函数来参数化离散信号。然而，用单独的神经网络（通常是多层感知器（MLP））来公式化每个图像会导致在编码多图像时计算和存储效率低下。为了解决这个问题，我们提出了MINR，共享特定的层来有效地编码多图像。我们首先比较了几个训练INR的逐层权重分布，发现相应的中间层遵循高度相似的分布模式。受此启发，我们在多个图像之间共享这些中间层，同时将输入和输出层保留为特定于输入的。此外，我们为每幅图像设计了一个额外的投影层，以捕捉其独特的特征。图像重建和超分辨率任务的实验结果表明，MINR可以节省高达60%的参数，同时保持相当的性能。特别是，MINR可以有效地缩放以处理100张图像，保持34 dB的平均峰值信噪比（PSNR）。各种骨干的进一步分析证明了建议MINR的鲁棒性。
摘要：Implicit Neural Representations (INRs) aim to parameterize discrete signals through implicit continuous functions. However, formulating each image with a separate neural network~(typically, a Multi-Layer Perceptron (MLP)) leads to computational and storage inefficiencies when encoding multi-images. To address this issue, we propose MINR, sharing specific layers to encode multi-image efficiently. We first compare the layer-wise weight distributions for several trained INRs and find that corresponding intermediate layers follow highly similar distribution patterns. Motivated by this, we share these intermediate layers across multiple images while preserving the input and output layers as input-specific. In addition, we design an extra novel projection layer for each image to capture its unique features. Experimental results on image reconstruction and super-resolution tasks demonstrate that MINR can save up to 60\% parameters while maintaining comparable performance. Particularly, MINR scales effectively to handle 100 images, maintaining an average peak signal-to-noise ratio (PSNR) of 34 dB. Further analysis of various backbones proves the robustness of the proposed MINR.

【19】Revisiting MLLM Token Technology through the Lens of Classical Visual Coding
标题：通过经典视觉编码的镜头重新审视MLLM令牌技术
链接：https://arxiv.org/abs/2508.13460

作者：iu, Junyan Lin, Yuntao Wei, Kele Shao, Keda Tao, Jianguo Huang, Xudong Yang, Zhibo Chen, Huan Wang, Xin Jin
摘要：经典的视觉编码和多模态大型语言模型（MLLM）令牌技术共享核心目标-最大化信息保真度，同时最小化计算成本。因此，本文重新审视MLLM令牌技术，包括令牌化，令牌压缩和令牌推理，通过长期发展的视觉编码领域的既定原则。从这个角度出发，我们（1）建立了一个统一的公式，桥接令牌技术和视觉编码，使一个系统的，逐模块的比较分析;（2）综合双向的见解，探索视觉编码原则如何提高MLLM令牌技术的效率和鲁棒性，反过来，令牌技术范式如何为下一代语义视觉编解码器的设计提供信息;（3）对未来研究方向的展望和有待解决的关键问题。总之，本研究首次对MLLM令牌和视觉编码进行了全面和结构化的技术比较，同时为更高效的多模态模型和更强大的视觉编解码器铺平了道路。
摘要：Classical visual coding and Multimodal Large Language Model (MLLM) token technology share the core objective - maximizing information fidelity while minimizing computational cost. Therefore, this paper reexamines MLLM token technology, including tokenization, token compression, and token reasoning, through the established principles of long-developed visual coding area. From this perspective, we (1) establish a unified formulation bridging token technology and visual coding, enabling a systematic, module-by-module comparative analysis; (2) synthesize bidirectional insights, exploring how visual coding principles can enhance MLLM token techniques' efficiency and robustness, and conversely, how token technology paradigms can inform the design of next-generation semantic visual codecs; (3) prospect for promising future research directions and critical unsolved challenges. In summary, this study presents the first comprehensive and structured technology comparison of MLLM token and visual coding, paving the way for more efficient multimodal models and more powerful visual codecs simultaneously.

【20】EDTalk++: Full Disentanglement for Controllable Talking Head Synthesis
标题：EDTalk++：可控会说话的头部合成的完全解纠缠
链接：https://arxiv.org/abs/2508.13442

作者：, Bin Ji
备注：17 pages,15 figures. arXiv admin note: substantial text overlap with arXiv:2404.01647
摘要：实现对多个面部运动的解缠结控制和适应不同的输入模态极大地增强了说话头部生成的应用和娱乐性。这需要深入探索面部特征的解耦空间，确保它们a）独立操作而不相互干扰，b）可以保留以与不同的模态输入共享，这两个方面在现有方法中经常被忽视。为了解决这一差距，本文提出了EDTalk ++，一种新的全解纠缠框架的可控说话头生成。我们的框架，使个人操纵的嘴的形状，头部姿势，眼球运动和情绪表达，视频或音频输入的条件。具体来说，我们采用了四个轻量级模块来分解的面部动态到四个不同的潜在空间，分别代表嘴，姿势，眼睛和表情。每个空间的特征在于一组可学习的基，这些基的线性组合定义了特定的运动。为了保证训练的独立性和加速训练，我们在基之间实施了正交性，并设计了一种有效的训练策略，将运动责任分配到每个空间，而不依赖于外部知识。然后，将学习到的基础存储在相应的库中，使得能够与音频输入共享视觉先验。此外，考虑到每个空间的属性，我们提出了一个音频到运动模块的音频驱动的说话头合成。实验证明了EDTalk ++的有效性。
摘要：Achieving disentangled control over multiple facial motions and accommodating diverse input modalities greatly enhances the application and entertainment of the talking head generation. This necessitates a deep exploration of the decoupling space for facial features, ensuring that they a) operate independently without mutual interference and b) can be preserved to share with different modal inputs, both aspects often neglected in existing methods. To address this gap, this paper proposes EDTalk++, a novel full disentanglement framework for controllable talking head generation. Our framework enables individual manipulation of mouth shape, head pose, eye movement, and emotional expression, conditioned on video or audio inputs. Specifically, we employ four lightweight modules to decompose the facial dynamics into four distinct latent spaces representing mouth, pose, eye, and expression, respectively. Each space is characterized by a set of learnable bases whose linear combinations define specific motions. To ensure independence and accelerate training, we enforce orthogonality among bases and devise an efficient training strategy to allocate motion responsibilities to each space without relying on external knowledge. The learned bases are then stored in corresponding banks, enabling shared visual priors with audio input. Furthermore, considering the properties of each space, we propose an Audio-to-Motion module for audio-driven talking head synthesis. Experiments are conducted to demonstrate the effectiveness of EDTalk++.

【21】Mitigating Easy Option Bias in Multiple-Choice Question Answering
标题：减轻多项选择题回答中的简单选项偏见
链接：https://arxiv.org/abs/2508.13428

作者：, Chen Li, Basura Fernando
备注：Under review
摘要：在这个早期的研究中，我们在一些多项选择视觉问题分类（VQA）基准测试中观察到一个简单选项偏差（EOB）问题，例如MMStar，RealWorldQA，SEED Bench，Next-QA，STAR基准测试和Video-MME。这种偏差允许视觉语言模型（VLM）只使用视觉（V）和选项（O）作为输入来选择正确答案，而不需要问题（Q）。通过接地实验，我们的偏见归因于视觉相关性的不平衡：正确的答案通常比特征空间中的否定选项更接近视觉内容，为VLM通过简单的视觉选项相似性匹配来推断答案创造了一条捷径。为了解决这个问题，我们引入了GroundAttack，这是一个工具包，可以自动生成与正确答案一样直观的否定选项。我们将其应用于NExT-QA和MMStar数据集，创建新的无EOB注释。在这些无EOB注释上，当前VLM在（V+O）设置下接近随机精度，并且在（V+Q+O）设置下下降到非饱和精度，从而提供对VLM的QA能力的更真实的评估。代码和新注释将很快发布。
摘要：In this early study, we observe an Easy-Options Bias (EOB) issue in some multiple-choice Visual Question Answering (VQA) benchmarks such as MMStar, RealWorldQA, SEED-Bench, Next-QA, STAR benchmark and Video-MME. This bias allows vision-language models (VLMs) to select the correct answer using only the vision (V) and options (O) as inputs, without the need for the question (Q). Through grounding experiments, we attribute the bias to an imbalance in visual relevance: the correct answer typically aligns more closely with the visual contents than the negative options in feature space, creating a shortcut for VLMs to infer the answer via simply vision-option similarity matching. To fix this, we introduce GroundAttack, a toolkit that automatically generates hard negative options as visually plausible as the correct answer. We apply it to the NExT-QA and MMStar datasets, creating new EOB-free annotations. On these EOB-free annotations, current VLMs approach to random accuracies under (V+O) settings, and drop to non-saturated accuracies under (V+Q+O) settings, providing a more realistic evaluation of VLMs' QA ability. Codes and new annotations will be released soon.

【22】A Surveillance Based Interactive Robot
标题：基于监控的互动机器人
链接：https://arxiv.org/abs/2508.13319

作者：avimandan, Pooja Mangal, Devanshi Mehta
备注：4 pages, 5 figures
摘要：我们构建了一个移动监控机器人，它可以实时传输视频并对语音做出响应，这样用户就可以通过手机或浏览器进行监控和引导。该系统使用两个Raspberry Pi 4单元：一个位于差分驱动器底座上的前置单元，带有摄像头、麦克风和扬声器，以及一个中央单元，用于实时馈送并运行感知。视频通过FFmpeg发送。使用YOLOv 3检测场景中的对象，以支持导航和事件感知。对于语音交互，我们使用Python库进行语音识别，多语言翻译和文本到语音，因此机器人可以接受口头命令并以所请求的语言读回响应。Kinect RGB-D传感器提供视觉输入和障碍物提示。在室内测试中，机器人以CPU上的交互式帧速率检测常见物体，可靠地识别命令，并将其转换为无需手动控制的动作。该设计依赖于现成的硬件和开放的软件，使其易于复制。我们讨论了限制和实际扩展，包括传感器与超声波测距数据的融合，GPU加速以及添加人脸和文本识别。
摘要：We build a mobile surveillance robot that streams video in real time and responds to speech so a user can monitor and steer it from a phone or browser. The system uses two Raspberry Pi 4 units: a front unit on a differential drive base with camera, mic, and speaker, and a central unit that serves the live feed and runs perception. Video is sent with FFmpeg. Objects in the scene are detected using YOLOv3 to support navigation and event awareness. For voice interaction, we use Python libraries for speech recognition, multilingual translation, and text-to-speech, so the robot can take spoken commands and read back responses in the requested language. A Kinect RGB-D sensor provides visual input and obstacle cues. In indoor tests the robot detects common objects at interactive frame rates on CPU, recognises commands reliably, and translates them to actions without manual control. The design relies on off-the-shelf hardware and open software, making it easy to reproduce. We discuss limits and practical extensions, including sensor fusion with ultrasonic range data, GPU acceleration, and adding face and text recognition.

【23】Image2Net: Datasets, Benchmark and Hybrid Framework to Convert Analog Circuit Diagrams into Netlists
标题：Image 2Net：将模拟电路图转换为网表的数据集、基准和混合框架
链接：https://arxiv.org/abs/2508.13157

作者：u, Chengjie Liu, Qihang Wang, Wenhao Huang, Yongjian Xu, Weiyu Chen, Anlan Peng, Zhijun Li, Bo Li, Lei Qi, Jun Yang, Yuan Du, Li Du
备注：10 pages, 12 figures, 6 tables
摘要：大型语言模型（LLM）因其在知识抽象和概括方面的出色表现，在模拟集成电路（IC）设计中显示出巨大的潜力。然而，基于LLM的模拟IC的进一步开发严重依赖于模拟IC的文本描述，而现有的模拟IC大多以基于图像的电路图而不是基于文本的网表来示出。将电路图转换为网表有助于LLM丰富模拟IC的知识。然而，以前提出的转换框架面临的挑战，在进一步的应用，因为有限的支持的图像风格和电路元件。到目前为止，有效地将复杂的电路图转换为网表仍然是一个具有挑战性的任务。为此，本文构建并开源了一个新的数据集，该数据集具有丰富的电路图风格以及简单和复杂模拟IC的均衡分布。并提出了一个混合框架，命名为Image 2Net，从电路图到网表的实际转换。还引入了网表编辑距离（NED）来精确评估转换后的网表与地面实况之间的差异。基于我们的基准测试，Image 2Net的成功率达到80.77%，比以前的工作提高了34.62%~ 45.19%。具体地说，所提出的工作显示了0.116的平均NED，这是62.1%-69.6%，低于最先进的。
摘要：Large Language Model (LLM) exhibits great potential in designing of analog integrated circuits (IC) because of its excellence in abstraction and generalization for knowledge. However, further development of LLM-based analog ICs heavily relies on textual description of analog ICs, while existing analog ICs are mostly illustrated in image-based circuit diagrams rather than text-based netlists. Converting circuit diagrams to netlists help LLMs to enrich the knowledge of analog IC. Nevertheless, previously proposed conversion frameworks face challenges in further application because of limited support of image styles and circuit elements. Up to now, it still remains a challenging task to effectively convert complex circuit diagrams into netlists. To this end, this paper constructs and opensources a new dataset with rich styles of circuit diagrams as well as balanced distribution of simple and complex analog ICs. And a hybrid framework, named Image2Net, is proposed for practical conversion from circuit diagrams to netlists. The netlist edit distance (NED) is also introduced to precisely assess the difference between the converted netlists and ground truth. Based on our benchmark, Image2Net achieves 80.77\% successful rate, which is 34.62\%-45.19\% higher than previous works. Specifically, the proposed work shows 0.116 averaged NED, which is 62.1\%-69.6\% lower than state-of-the-arts.

【24】Deep Biomechanically-Guided Interpolation for Keypoint-Based Brain Shift Registration
标题：深度生物力学引导的基于关键点的脑移位配准插值
链接：https://arxiv.org/abs/2508.13762

作者：is, Ines P. Machado, Benjamin Zwick, Nuno C. Garcia, Reuben Dorent
备注：Accepted at COLlaborative Intelligence and Autonomy in Image-guided Surgery (COLAS) Workshop - MICCAI 2025
摘要：脑移位的准确补偿对于维持神经导航在神经外科手术中的可靠性至关重要。虽然基于关键点的配准方法提供了对大变形和拓扑变化的鲁棒性，但它们通常依赖于忽略组织生物力学的简单几何插值器来创建密集的位移场。在这项工作中，我们提出了一种新的深度学习框架，可以从稀疏匹配的关键点中估计密集的，物理上合理的大脑变形。我们首先使用生物力学模拟生成合成大脑变形的大型数据集。然后，训练剩余的3D U-网络，以将标准插值估计细化为生物力学引导的变形。大量的模拟位移场的实验表明，我们的方法显着优于经典的插值器，减少了一半的均方误差，同时在推理时引入可以忽略不计的计算开销。代码可从以下网址获得：\href{https：//github.com/tiago-assis/Deep-Biomechanical-Interpolator}{https：//github.com/tiago-assis/Deep-Biomechanical-Interpolator}.
摘要：Accurate compensation of brain shift is critical for maintaining the reliability of neuronavigation during neurosurgery. While keypoint-based registration methods offer robustness to large deformations and topological changes, they typically rely on simple geometric interpolators that ignore tissue biomechanics to create dense displacement fields. In this work, we propose a novel deep learning framework that estimates dense, physically plausible brain deformations from sparse matched keypoints. We first generate a large dataset of synthetic brain deformations using biomechanical simulations. Then, a residual 3D U-Net is trained to refine standard interpolation estimates into biomechanically guided deformations. Experiments on a large set of simulated displacement fields demonstrate that our method significantly outperforms classical interpolators, reducing by half the mean square error while introducing negligible computational overhead at inference time. Code available at: \href{https://github.com/tiago-assis/Deep-Biomechanical-Interpolator}{https://github.com/tiago-assis/Deep-Biomechanical-Interpolator}.

【25】Towards Understanding and Harnessing the Transferability of Prognostic Knowledge in Computational Pathology
标题：了解和利用计算病理学中预后知识的可转移性
链接：https://arxiv.org/abs/2508.13482

作者：Luping Ji, Jiaxiang Gou, Xiangxiang Zeng
备注：15 pages (13 figures and 5 tables)
摘要：全切片图像（WSI）是评估癌症患者预后的重要工具。目前基于WSI的预后研究通常遵循传统的范式-癌症特异性模型开发-其中一种癌症疾病对应于一种模型，并且该模型不能利用来自其他疾病的预后知识。尽管近年来取得了显著的成功，但这种模式具有固有的局限性，并且一直在努力满足实际要求：（i）以非常有限的样本扩展到罕见的肿瘤疾病，以及（ii）受益于其他癌症中的可推广的预后知识。为此，本文提出了第一个系统的研究病理学预后知识转移，称为路径PKT。它包括三个主要部分。(1)我们策划了一个包含13种癌症的大型数据集（UNI 2-h-DSS），并使用它来评估不同癌症之间预后知识的可转移性。(2)我们设计实验，以了解哪些因素影响知识转移，是什么原因导致积极的转移。(3)受实证研究结果的启发，我们提出了一种新的基线方法（MoE-PKT）与路由机制，以利用其他癌症的可推广的预后知识。最后，我们展示了源模型对罕见肿瘤疾病的可转移性。这项研究可以为基于WSI的癌症预后知识转移研究奠定坚实的基础。源代码可在https://github.com/liupei101/Path-PKT上获得。
摘要：Whole-Slide Image (WSI) is an important tool for evaluating the prognosis of cancer patients. Present WSI-based prognosis studies generally follow a conventional paradigm -- cancer-specific model development -- where one cancer disease corresponds to one model and this model cannot make use of the prognostic knowledge from others. Despite its notable success in recent years, this paradigm has inherent limitations and has always been struggling with practical requirements: (i) scaling to the rare tumor diseases with very limited samples and (ii) benefiting from the generalizable prognostic knowledge in other cancers. To this end, this paper presents the first systematic study on Prognostic Knowledge Transfer in Pathology, called Path-PKT. It comprises three main parts. (1) We curate a large dataset (UNI2-h-DSS) with 13 cancers and use it to evaluate the transferability of prognostic knowledge between different cancers computationally. (2) We design experiments to understand what factors affect knowledge transfer and what causes positive transfers. (3) Motivated by empirical findings, we propose a new baseline approach (MoE-PKT) with a routing mechanism to utilize the generalizable prognostic knowledge in other cancers. Finally, we show the transferability of source models to rare tumor diseases. This study could lay solid foundations for the study of knowledge transfer in WSI-based cancer prognosis. Source code is available at https://github.com/liupei101/Path-PKT.

机器翻译由腾讯交互翻译提供，仅供参考

点击“阅读原文”获取带摘要的学术速递

【声明】内容源于网络

Sophie外贸笔记

跨境分享角 | 长期更新优质内容

内容 0

粉丝 3

Sophie外贸笔记跨境分享角 | 长期更新优质内容

总阅读0

粉丝3

内容0