人工智能学术速递[10.16]- 大数跨境

首页

人工智能学术速递[10.16]

Sophie外贸笔记

2025-10-16

196

导读：cs.AI 方向，今日共计138篇

点击阅读原文访问arxivdaily.com，涵盖CS|物理|数学|经济|统计|金融|生物|电气领域，更有搜索、收藏等功能！

cs.AI人工智能，共计138篇

【1】Generative Universal Verifier as Multimodal Meta-Reasoner
标题：作为多模式元推理器的生成通用验证器
链接：https://arxiv.org/abs/2510.13804

作者：Xinchen Zhang, Xiaoying Zhang, Youbin Wu, Yanbin Cao, Renrui Zhang, Ruihang Chu, Ling Yang, Yujiu Yang
摘要：我们介绍了生成通用验证器，一个新的概念和插件设计的下一代多模态推理的视觉语言模型和统一的多模态模型，提供了基本的能力，在推理和生成过程中的视觉结果的反射和细化。本文的主要贡献有三：（1）构建了ViVerBench，一个涵盖16类关键任务的综合基准，用于评估多模态推理中的视觉结果。结果表明，现有的VLM在这些任务中始终表现不佳，强调了在可靠的视觉验证方面与人类水平的能力存在巨大差距。(2)我们设计了两个自动化管道来构建大规模的视觉验证数据，并训练OmniVerifier-7 B，这是第一个经过通用视觉验证训练的全能生成验证器，并在ViVerBench（+8.3）上取得了显着的进步。通过培训，我们确定了三个原子的视觉验证能力，并展示了他们如何推广和协同互动。(3)我们提出OmniVerifier-TTS，一个顺序的测试时间缩放范例，利用通用验证器在统一模型中桥接图像生成和编辑，通过迭代细粒度优化提高生成能力的上限。除了生成，我们扩展通用验证更广泛的世界建模交错推理的情况下。从经验上讲，OmniVerifier-TTS在T2 I-ReasonBench（+3.7）和GenEval++（+4.3）上实现了改进，优于现有的并行测试时间缩放方法，如Best-of-N。通过赋予多模态推理可靠的视觉验证，OmniVerifier在生成过程中提高了可靠的反射和可扩展的测试时细化，标志着向更值得信赖和可控的下一代推理系统迈出了一步。
摘要：We introduce Generative Universal Verifier, a novel concept and plugin designed for next-generation multimodal reasoning in vision-language models and unified multimodal models, providing the fundamental capability of reflection and refinement on visual outcomes during the reasoning and generation process. This work makes three main contributions: (1) We build ViVerBench, a comprehensive benchmark spanning 16 categories of critical tasks for evaluating visual outcomes in multimodal reasoning. Results show that existing VLMs consistently underperform across these tasks, underscoring a substantial gap from human-level capability in reliable visual verification. (2) We design two automated pipelines to construct large-scale visual verification data and train OmniVerifier-7B, the first omni-capable generative verifier trained for universal visual verification and achieves notable gains on ViVerBench(+8.3). Through training, we identify three atomic capabilities in visual verification and demonstrate how they generalize and interact synergistically. (3) We propose OmniVerifier-TTS, a sequential test-time scaling paradigm that leverages the universal verifier to bridge image generation and editing within unified models, enhancing the upper bound of generative ability through iterative fine-grained optimization. Beyond generation, we extend universal verifier to broader world-modeling interleaved reasoning scenarios. Empirically, OmniVerifier-TTS achieves improvements on T2I-ReasonBench(+3.7), and GenEval++(+4.3), outperforming existing parallel test-time scaling methods, such as Best-of-N. By endowing multimodal reasoning with reliable visual verification, OmniVerifier advances both reliable reflection during generation and scalable test-time refinement, marking a step toward more trustworthy and controllable next-generation reasoning systems.

【2】Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully Open MLLMs
标题：Bee：一个高质量的Corpus和全栈套件，用于验证先进的全开放MLLM
链接：https://arxiv.org/abs/2510.13795

作者：Yi Zhang, Bolin Ni, Xin-Sheng Chen, Heng-Rui Zhang, Yongming Rao, Houwen Peng, Qinglin Lu, Han Hu, Meng-Hao Guo, Shi-Min Hu
备注：homepage: this https URL
摘要：完全开放的多模态大型语言模型（MLLM）目前落后于专有模型，主要是由于监督微调（SFT）的数据质量存在显著差距。现有的开源数据集通常受到广泛的噪音和复杂推理数据的严重不足的困扰，例如思想链（CoT），这阻碍了高级模型功能的开发。为了应对这些挑战，我们的工作做出了三个主要贡献。首先，我们介绍Honey-Data-15 M，这是一个新的SFT数据集，包含大约1500万个QA对，通过多种清洗技术进行处理，并通过一种新的双水平（短和长）CoT富集策略进行增强。其次，我们介绍了HoneyPipe，数据策展管道及其底层框架DataStudio，为社区提供了一种透明且适应性强的数据策展方法，超越了静态数据集发布。最后，为了验证我们的数据集和管道，我们在Honey-Data-15 M上训练Bee-8B，一个8B模型。实验表明，Bee-8B为完全开放的MLLM建立了一个新的最先进的（SOTA），实现了与最近的半开放模型（如InternVL3.5-8B）竞争的性能，在某些情况下甚至超过了这些模型。我们的工作为社区提供了一套基础资源，包括：Honey-Data-15 M语料库;由HoneyPipe和DataStudio组成的全栈套件;训练配方;评估工具;以及模型权重。这一努力表明，对数据质量的原则性关注是开发完全开放的MLLM的关键途径，这些MLLM与半开放的同行具有很强的竞争力。
摘要：Fully open multimodal large language models (MLLMs) currently lag behind proprietary counterparts, primarily due to a significant gap in data quality for supervised fine-tuning (SFT). Existing open-source datasets are often plagued by widespread noise and a critical deficit in complex reasoning data, such as Chain-of-Thought (CoT), which hinders the development of advanced model capabilities. Addressing these challenges, our work makes three primary contributions. First, we introduce Honey-Data-15M, a new SFT dataset comprising approximately 15 million QA pairs, processed through multiple cleaning techniques and enhanced with a novel dual-level (short and long) CoT enrichment strategy. Second, we introduce HoneyPipe, the data curation pipeline, and its underlying framework DataStudio, providing the community with a transparent and adaptable methodology for data curation that moves beyond static dataset releases. Finally, to validate our dataset and pipeline, we train Bee-8B, an 8B model on Honey-Data-15M. Experiments show that Bee-8B establishes a new state-of-the-art (SOTA) for fully open MLLMs, achieving performance that is competitive with, and in some cases surpasses, recent semi-open models such as InternVL3.5-8B. Our work delivers to the community a suite of foundational resources, including: the Honey-Data-15M corpus; the full-stack suite comprising HoneyPipe and DataStudio; training recipes; an evaluation harness; and the model weights. This effort demonstrates that a principled focus on data quality is a key pathway to developing fully open MLLMs that are highly competitive with their semi-open counterparts.

【3】Provably Invincible Adversarial Attacks on Reinforcement Learning Systems: A Rate-Distortion Information-Theoretic Approach
标题：对强化学习系统的可证明无敌的对抗攻击：一种速率失真信息理论方法
链接：https://arxiv.org/abs/2510.13792

作者：Ziqing Lu, Lifeng Lai, Weiyu Xu
摘要：马尔可夫决策过程（MDP）的强化学习（RL）已经出现在许多与安全相关的应用中，例如自动驾驶，财务决策和无人机/机器人算法。为了提高强化学习系统对攻击者的鲁棒性和防御能力，研究强化学习系统的各种攻击是非常重要的。大多数以前的工作认为确定性的对抗性攻击策略在MDP中，接收者（受害者）代理可以通过逆转确定性攻击来击败。在本文中，我们提出了一种可证明的“无敌”或“不可对抗”的RL对抗攻击。攻击者应用率失真信息理论方法来随机改变代理对转换内核（或其他属性）的观察，使得代理在训练期间获得关于地面实况内核（或其他属性）的零或非常有限的信息。我们推导出一个信息理论的下限收件人代理的奖励后悔和率失真攻击的影响，最先进的模型为基础的和无模型的算法。我们还将这种信息理论方法的概念扩展到其他类型的对抗性攻击，例如状态观察攻击。
摘要：Reinforcement learning (RL) for the Markov Decision Process (MDP) has emerged in many security-related applications, such as autonomous driving, financial decisions, and drone/robot algorithms. In order to improve the robustness/defense of RL systems against adversaries, studying various adversarial attacks on RL systems is very important. Most previous work considered deterministic adversarial attack strategies in MDP, which the recipient (victim) agent can defeat by reversing the deterministic attacks. In this paper, we propose a provably ``invincible'' or ``uncounterable'' type of adversarial attack on RL. The attackers apply a rate-distortion information-theoretic approach to randomly change agents' observations of the transition kernel (or other properties) so that the agent gains zero or very limited information about the ground-truth kernel (or other properties) during the training. We derive an information-theoretic lower bound on the recipient agent's reward regret and show the impact of rate-distortion attacks on state-of-the-art model-based and model-free algorithms. We also extend this notion of an information-theoretic approach to other types of adversarial attack, such as state observation attacks.

【4】The Art of Scaling Reinforcement Learning Compute for LLMs
标题：LLM扩展强化学习计算的艺术
链接：https://arxiv.org/abs/2510.13786

作者：Devvrit Khatri, Lovish Madaan, Rishabh Tiwari, Rachit Bansal, Sai Surya Duvvuri, Manzil Zaheer, Inderjit S. Dhillon, David Brandfonbrener, Rishabh Agarwal
备注：28 pages, 20 figures
摘要：强化学习（RL）已经成为训练大型语言模型（LLM）的核心，但该领域缺乏与预训练方法相媲美的预测缩放方法。尽管计算预算迅速增加，但对于如何评估扩展RL计算的算法改进并没有原则性的理解。我们提出了第一个大规模的系统研究，超过400，000 GPU小时，定义了一个分析和预测LLM中RL缩放的原则框架。我们拟合了RL训练的S形计算性能曲线，并消除了各种常见的设计选择，以分析它们对渐近性能和计算效率的影响。我们注意到：（1）并非所有配方都能产生类似的渐近性能，（2）损失聚合、标准化、课程和非策略算法等细节主要调节计算效率，而不会实质性地改变渐近线，（3）稳定、可扩展的配方遵循可预测的缩放轨迹，从而能够从较小规模的运行中进行外推。结合这些见解，我们提出了一个最佳实践配方ScaleRL，并通过成功扩展和预测单个RL运行扩展到100，000 GPU小时的验证性能来证明其有效性。我们的工作为分析RL中的缩放提供了一个科学框架，并提供了一个实用的配方，使RL训练更接近预训练中长期实现的可预测性。
摘要：Reinforcement learning (RL) has become central to training large language models (LLMs), yet the field lacks predictive scaling methodologies comparable to those established for pre-training. Despite rapidly rising compute budgets, there is no principled understanding of how to evaluate algorithmic improvements for scaling RL compute. We present the first large-scale systematic study, amounting to more than 400,000 GPU-hours, that defines a principled framework for analyzing and predicting RL scaling in LLMs. We fit sigmoidal compute-performance curves for RL training and ablate a wide range of common design choices to analyze their effects on asymptotic performance and compute efficiency. We observe: (1) Not all recipes yield similar asymptotic performance, (2) Details such as loss aggregation, normalization, curriculum, and off-policy algorithm primarily modulate compute efficiency without materially shifting the asymptote, and (3) Stable, scalable recipes follow predictable scaling trajectories, enabling extrapolation from smaller-scale runs. Combining these insights, we propose a best-practice recipe, ScaleRL, and demonstrate its effectiveness by successfully scaling and predicting validation performance on a single RL run scaled up to 100,000 GPU-hours. Our work provides both a scientific framework for analyzing scaling in RL and a practical recipe that brings RL training closer to the predictability long achieved in pre-training.

【5】InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy
标题：InternVLA-M1：通用机器人政策的空间引导视觉-语言-动作框架
链接：https://arxiv.org/abs/2510.13778

作者：Xinyi Chen, Yilun Chen, Yanwei Fu, Ning Gao, Jiaya Jia, Weiyang Jin, Hao Li, Yao Mu, Jiangmiao Pang, Yu Qiao, Yang Tian, Bin Wang, Bolun Wang, Fangjing Wang, Hanqing Wang, Tai Wang, Ziqin Wang, Xueyuan Wei, Chao Wu, Shuai Yang, Jinhui Ye, Junqiu Yu, Jia Zeng, Jingjing Zhang, Jinyu Zhang, Shi Zhang, Feng Zheng, Bowen Zhou, Yangkun Zhu
备注：Technical report
摘要：我们引入了InternVLA-M1，这是一个用于空间接地和机器人控制的统一框架，可以将跟随机器人的机器人推向可扩展的通用智能。它的核心思想是空间引导的视觉-语言-动作训练，其中空间基础是指令和机器人动作之间的关键联系。InternVLA-M1采用两阶段流水线：（i）对超过2.3M的空间推理数据进行空间基础预训练，通过将指令与视觉，不可知的位置对齐来确定“在哪里行动”，以及（ii）通过即插即用的空间提示生成感知动作来决定“如何行动”的空间引导动作后训练。这种空间引导的训练方法产生了一致的收益：InternVLA-M1在SimplerEnv Google Robot上的表现优于其没有空间引导的变体+14.6%，在WidowX上为+17%，在LIBERO Franka上为+4.3%，同时在框，点和轨迹预测方面表现出更强的空间推理能力。为了进一步扩展指令跟踪，我们构建了一个模拟引擎来收集244 K个可概括的拾取和放置事件，在200个任务和3 K+对象中实现了6.2%的平均改进。在现实世界的集群拾取和放置中，InternVLA-M1提高了7.3%，并且通过合成协同训练，在看不见的对象和新配置上实现了+20.6%。此外，在长期的推理密集型场景中，它超过了现有作品的10%以上。这些结果突出了空间引导训练作为可扩展和弹性通才机器人的统一原则。代码和模型可在https://github.com/InternRobotics/InternVLA-M1上获得。
摘要：We introduce InternVLA-M1, a unified framework for spatial grounding and robot control that advances instruction-following robots toward scalable, general-purpose intelligence. Its core idea is spatially guided vision-language-action training, where spatial grounding serves as the critical link between instructions and robot actions. InternVLA-M1 employs a two-stage pipeline: (i) spatial grounding pre-training on over 2.3M spatial reasoning data to determine ``where to act'' by aligning instructions with visual, embodiment-agnostic positions, and (ii) spatially guided action post-training to decide ``how to act'' by generating embodiment-aware actions through plug-and-play spatial prompting. This spatially guided training recipe yields consistent gains: InternVLA-M1 outperforms its variant without spatial guidance by +14.6% on SimplerEnv Google Robot, +17% on WidowX, and +4.3% on LIBERO Franka, while demonstrating stronger spatial reasoning capability in box, point, and trace prediction. To further scale instruction following, we built a simulation engine to collect 244K generalizable pick-and-place episodes, enabling a 6.2% average improvement across 200 tasks and 3K+ objects. In real-world clustered pick-and-place, InternVLA-M1 improved by 7.3%, and with synthetic co-training, achieved +20.6% on unseen objects and novel configurations. Moreover, in long-horizon reasoning-intensive scenarios, it surpassed existing works by over 10%. These results highlight spatially guided training as a unifying principle for scalable and resilient generalist robots. Code and models are available at https://github.com/InternRobotics/InternVLA-M1.

【6】Scaling Vision Transformers for Functional MRI with Flat Maps
标题：使用平面地图缩放功能性MRI的Vision Transformers
链接：https://arxiv.org/abs/2510.13768

作者：Connor Lane, Daniel Z. Kaplan, Tanishq Mathew Abraham, Paul S. Scotti
备注：NeurIPS 2025 Workshop, Foundation Models for the Brain and Body; Code: this https URL Discord: this https URL
摘要：使现代深度学习架构适应功能性MRI（fMRI）的一个关键问题是如何表示模型输入的数据。为了弥补fMRI和自然图像之间的模态差距，我们将4D体积fMRI数据转换为2D fMRI活动平面图的视频。我们训练Vision Transformers在2.3K小时的fMRI平面图视频上使用时空掩蔽自动编码器（MAE）框架。我们观察到，掩蔽的功能磁共振成像建模性能提高数据集的大小，根据严格的功率缩放定律。下游分类基准表明，我们的模型学习了丰富的表示，支持跨主体的细粒度状态解码，以及跨大脑状态变化的特定于主体的特质解码。这项工作是一个正在进行的开放科学项目的一部分，该项目旨在为fMRI数据建立基础模型。我们的代码和数据集可在https://github.com/MedARC-AI/fmri-fm上获得。
摘要：A key question for adapting modern deep learning architectures to functional MRI (fMRI) is how to represent the data for model input. To bridge the modality gap between fMRI and natural images, we transform the 4D volumetric fMRI data into videos of 2D fMRI activity flat maps. We train Vision Transformers on 2.3K hours of fMRI flat map videos from the Human Connectome Project using the spatiotemporal masked autoencoder (MAE) framework. We observe that masked fMRI modeling performance improves with dataset size according to a strict power scaling law. Downstream classification benchmarks show that our model learns rich representations supporting both fine-grained state decoding across subjects, as well as subject-specific trait decoding across changes in brain state. This work is part of an ongoing open science project to build foundation models for fMRI data. Our code and datasets are available at https://github.com/MedARC-AI/fmri-fm.

【7】RECODE: Reasoning Through Code Generation for Visual Question Answering
标题：RECDE：通过视觉问题回答的代码生成推理
链接：https://arxiv.org/abs/2510.13756

作者：Junhong Shen, Mu Cai, Bo Hu, Ameet Talwalkar, David A Ross, Cordelia Schmid, Alireza Fathi
摘要：多模态大型语言模型（MLLM）难以对图表和图表等结构化视觉进行精确推理，因为基于像素的感知缺乏验证机制。为了解决这个问题，我们建议利用derendering -逆向工程视觉到可执行代码的过程-作为一种新的形式，可验证的视觉推理。具体来说，我们提出了RECODE，一个代理框架，首先产生多个候选程序来重现输入图像。然后，它使用一个评论家来选择最忠实的重建和迭代地细化代码。这一过程不仅将模糊的感知任务转化为可验证的符号化问题，而且还能实现精确的计算和逻辑推理。在CharXiv、ChartQA和Geometry 3 K等各种视觉推理基准测试中，RECODE的性能明显优于不使用代码或仅使用代码绘制辅助线或裁剪的方法。我们的工作表明，在可执行代码中建立视觉感知提供了一条通往更准确和可验证的多模态推理的新途径。
摘要：Multimodal Large Language Models (MLLMs) struggle with precise reasoning for structured visuals like charts and diagrams, as pixel-based perception lacks a mechanism for verification. To address this, we propose to leverage derendering -- the process of reverse-engineering visuals into executable code -- as a new modality for verifiable visual reasoning. Specifically, we propose RECODE, an agentic framework that first generates multiple candidate programs to reproduce the input image. It then uses a critic to select the most faithful reconstruction and iteratively refines the code. This process not only transforms an ambiguous perceptual task into a verifiable, symbolic problem, but also enables precise calculations and logical inferences later on. On various visual reasoning benchmarks such as CharXiv, ChartQA, and Geometry3K, RECODE significantly outperforms methods that do not leverage code or only use code for drawing auxiliary lines or cropping. Our work demonstrates that grounding visual perception in executable code provides a new path toward more accurate and verifiable multimodal reasoning.

【8】Hard2Verify: A Step-Level Verification Benchmark for Open-Ended Frontier Math
标题：Hard2 Verify：开放式前沿数学的分步验证基准
链接：https://arxiv.org/abs/2510.13744

作者：Shrey Pandit, Austin Xu, Xuan-Phi Nguyen, Yifei Ming, Caiming Xiong, Shafiq Joty
备注：21 pages, 8 figures, 5 tables
摘要：基于大型语言模型（LLM）的推理系统最近在IMO 2025竞赛中取得了金牌级别的表现，编写数学证明，要获得完整的学分，每一步都必须不仅正确，而且得到充分的支持。为了在这种具有挑战性的开放式环境中训练基于LLM的推理机，能够捕获步骤级错误的强大验证器是必要的先决条件。我们介绍Hard2Verify，这是一个人工注释的步骤级验证基准，由超过500小时的人工劳动产生。Hard2Verify旨在严格评估前沿的步骤级验证器：验证器必须提供步骤级注释或识别前沿LLM针对最近的、具有挑战性的和开放式的数学问题生成的响应中的第一个错误。我们评估了29个生成评论家和过程奖励模型，证明了，除了一些杰出的，开源验证器落后于闭源模型。随后，我们分析了是什么导致了步级验证中的性能不佳，扩展验证器计算的影响，以及自验证和验证生成动态等基本问题。
摘要：Large language model (LLM)-based reasoning systems have recently achieved gold medal-level performance in the IMO 2025 competition, writing mathematical proofs where, to receive full credit, each step must be not only correct but also sufficiently supported. To train LLM-based reasoners in such challenging, open-ended settings, strong verifiers capable of catching step-level mistakes are necessary prerequisites. We introduce Hard2Verify, a human-annotated, step-level verification benchmark produced with over 500 hours of human labor. Hard2Verify is designed to rigorously assess step-level verifiers at the frontier: Verifiers must provide step-level annotations or identify the first error in responses generated by frontier LLMs for very recent, challenging, and open-ended math questions. We evaluate 29 generative critics and process reward models, demonstrating that, beyond a few standouts, open-source verifiers lag closed source models. We subsequently analyze what drives poor performance in step-level verification, the impacts of scaling verifier compute, as well as fundamental questions such as self-verification and verification-generation dynamics.

【9】Multi-Scale High-Resolution Logarithmic Grapher Module for Efficient Vision GNNs
标题：用于高效视觉GNN的多尺度高分辨率对数绘图模块
链接：https://arxiv.org/abs/2510.13740

作者：Mustafa Munir, Alex Zhang, Radu Marculescu
备注：Published in the Proceedings of the Third Learning on Graphs Conference (LoG 2024)
摘要：视觉图神经网络（ViG）在视觉任务中已被证明是传统卷积神经网络（CNN）和Transformers（ViT）的竞争性替代方案;然而，常见的图构造方法，如k-最近邻（KNN），在较大的图像上可能是昂贵的。虽然稀疏视觉图注意力（SVGA）等方法已经显示出了希望，但SVGA的固定步长可能会导致过度挤压和丢失多个连接，以获得可以从远程链接中获得的相同信息。通过这种观察，我们提出了一种新的图构造方法，对数可扩展图构造（LSGC），以提高性能，通过限制远程链接的数量。为此，我们提出了LogViG，这是一种利用LSGC的新型混合CNN-GNN模型。此外，受多尺度和高分辨率架构成功的启发，我们引入并应用了高分辨率分支，并将高分辨率和低分辨率分支之间的功能融合到多尺度高分辨率Vision GNN网络中。大量的实验表明，LogViG在图像分类和语义分割任务的准确性、GMAC和参数方面击败了现有的ViG、CNN和ViT架构。我们最小的模型Ti-LogViG在ImageNet-1 K上的平均top-1准确率为79.9%，标准差为0.2%，比Vision GNN的平均准确率高1.7%，参数减少24.3%，GMAC减少35.3%。我们的工作表明，通过我们提出的LSGC，在ViGs的图构建中利用长距离链接可以超过当前最先进的ViGs的性能。代码可在https://github.com/mmunir127/LogViG-Official上获得。
摘要：Vision graph neural networks (ViG) have demonstrated promise in vision tasks as a competitive alternative to conventional convolutional neural nets (CNN) and transformers (ViTs); however, common graph construction methods, such as k-nearest neighbor (KNN), can be expensive on larger images. While methods such as Sparse Vision Graph Attention (SVGA) have shown promise, SVGA's fixed step scale can lead to over-squashing and missing multiple connections to gain the same information that could be gained from a long-range link. Through this observation, we propose a new graph construction method, Logarithmic Scalable Graph Construction (LSGC) to enhance performance by limiting the number of long-range links. To this end, we propose LogViG, a novel hybrid CNN-GNN model that utilizes LSGC. Furthermore, inspired by the successes of multi-scale and high-resolution architectures, we introduce and apply a high-resolution branch and fuse features between our high-resolution and low-resolution branches for a multi-scale high-resolution Vision GNN network. Extensive experiments show that LogViG beats existing ViG, CNN, and ViT architectures in terms of accuracy, GMACs, and parameters on image classification and semantic segmentation tasks. Our smallest model, Ti-LogViG, achieves an average top-1 accuracy on ImageNet-1K of 79.9% with a standard deviation of 0.2%, 1.7% higher average accuracy than Vision GNN with a 24.3% reduction in parameters and 35.3% reduction in GMACs. Our work shows that leveraging long-range links in graph construction for ViGs through our proposed LSGC can exceed the performance of current state-of-the-art ViGs. Code is available at https://github.com/mmunir127/LogViG-Official.

【10】From Refusal to Recovery: A Control-Theoretic Approach to Generative AI Guardrails
标题：从拒绝到恢复：生成人工智能护栏的控制理论方法
链接：https://arxiv.org/abs/2510.13727

作者：Ravi Pandya, Madison Bland, Duy P. Nguyen, Changliu Liu, Jaime Fernández Fisac, Andrea Bajcsy
摘要：从数字购物助理到下一代自动驾驶汽车，生成式人工智能系统越来越多地在实际环境中协助和代表终端用户。在这种情况下，安全不再是阻止有害内容，而是预先阻止下游危害，如财务或物理伤害。然而，大多数人工智能护栏仍然依赖于基于标记数据集和人类指定标准的输出分类，这使得它们容易受到新的危险情况的影响。即使标记了不安全的条件，这种检测也没有提供恢复的路径：通常，人工智能系统只是拒绝采取行动-这并不总是一个安全的选择。在这项工作中，我们认为，人工智能安全从根本上说是一个顺序决策问题：有害的结果来自人工智能系统不断发展的相互作用及其对世界的下游后果。我们通过安全关键控制理论的镜头将其形式化，但在AI模型对世界的潜在表示中。这使我们能够构建预测性护栏，（i）实时监控AI系统的输出（动作），（ii）主动将风险输出纠正为安全输出，所有这些都以模型不可知的方式进行，因此相同的护栏可以包裹在任何AI模型周围。我们还提供了一个实用的培训配方，用于通过安全关键强化学习来大规模计算这种护栏。我们在模拟驾驶和电子商务环境中的实验表明，控制理论护栏可以可靠地引导LLM代理避免灾难性的结果（从碰撞到破产），同时保持任务性能，为今天的旗帜和块护栏提供了一个有原则的动态替代方案。
摘要：Generative AI systems are increasingly assisting and acting on behalf of end users in practical settings, from digital shopping assistants to next-generation autonomous cars. In this context, safety is no longer about blocking harmful content, but about preempting downstream hazards like financial or physical harm. Yet, most AI guardrails continue to rely on output classification based on labeled datasets and human-specified criteria,making them brittle to new hazardous situations. Even when unsafe conditions are flagged, this detection offers no path to recovery: typically, the AI system simply refuses to act--which is not always a safe choice. In this work, we argue that agentic AI safety is fundamentally a sequential decision problem: harmful outcomes arise from the AI system's continually evolving interactions and their downstream consequences on the world. We formalize this through the lens of safety-critical control theory, but within the AI model's latent representation of the world. This enables us to build predictive guardrails that (i) monitor an AI system's outputs (actions) in real time and (ii) proactively correct risky outputs to safe ones, all in a model-agnostic manner so the same guardrail can be wrapped around any AI model. We also offer a practical training recipe for computing such guardrails at scale via safety-critical reinforcement learning. Our experiments in simulated driving and e-commerce settings demonstrate that control-theoretic guardrails can reliably steer LLM agents clear of catastrophic outcomes (from collisions to bankruptcy) while preserving task performance, offering a principled dynamic alternative to today's flag-and-block guardrails.

【11】FIRST: Federated Inference Resource Scheduling Toolkit for Scientific AI Model Access
标题：第一：用于科学人工智能模型访问的联邦推理资源调度工具包
链接：https://arxiv.org/abs/2510.13724

作者：Aditya Tanikanti, Benoit Côté, Yanfei Guo, Le Chen, Nickolaus Saint, Ryan Chard, Ken Raffenetti, Rajeev Thakur, Thomas Uram, Ian Foster, Michael E. Papka, Venkatram Vishwanath
摘要：我们提出了联合推理资源调度工具包（FIRST），一个框架，使推理作为一个服务，跨分布式高性能计算（HPC）集群。FIRST在现有HPC基础设施上提供对各种AI模型的云访问，如大型语言模型（LLM）。利用Globus Auth和Globus Compute，该系统允许研究人员通过OpenAI兼容的API在私有安全环境中运行并行推理工作负载。这个与集群无关的API允许跨联合集群分发请求，目标是众多托管模型。FIRST支持多个推理后端（例如，vLLM），自动缩放资源，维护低延迟执行的“热”节点，并提供高吞吐量批处理和交互模式。该框架满足了科学工作流程中对私有、安全和可扩展的人工智能推理日益增长的需求，使研究人员能够在不依赖商业云基础设施的情况下，每天在本地生成数十亿个令牌。
摘要：We present the Federated Inference Resource Scheduling Toolkit (FIRST), a framework enabling Inference-as-a-Service across distributed High-Performance Computing (HPC) clusters. FIRST provides cloud-like access to diverse AI models, like Large Language Models (LLMs), on existing HPC infrastructure. Leveraging Globus Auth and Globus Compute, the system allows researchers to run parallel inference workloads via an OpenAI-compliant API on private, secure environments. This cluster-agnostic API allows requests to be distributed across federated clusters, targeting numerous hosted models. FIRST supports multiple inference backends (e.g., vLLM), auto-scales resources, maintains "hot" nodes for low-latency execution, and offers both high-throughput batch and interactive modes. The framework addresses the growing demand for private, secure, and scalable AI inference in scientific workflows, allowing researchers to generate billions of tokens daily on-premises without relying on commercial cloud infrastructure.

【12】NExT-OMNI: Towards Any-to-Any Omnimodal Foundation Models with Discrete Flow Matching
标题：NExT-OMNI：实现具有离散流匹配的任意对任意全模式基础模型
链接：https://arxiv.org/abs/2510.13721

作者：Run Luo, Xiaobo Xia, Lu Wang, Longze Chen, Renke Shan, Jing Luo, Min Yang, Tat-Seng Chua
摘要：下一代多模态基础模型能够进行任意对任意的跨模态生成和多回合交互，将成为通用人工智能系统的核心组件，在人机交互中发挥关键作用。然而，大多数现有的多模态模型仍然受到自回归架构的约束，其固有的局限性阻止了理解和生成能力的平衡整合。虽然混合和解耦策略已被探索，以解决这些任务在统一的框架内分别，其冗余，非集成的设计限制其适用性更广泛的情况下，如跨模态retrieval.In这项工作中，我们介绍了NExT-OMNI，一个开源的omnimodal基础模型，通过离散流范式实现统一建模。通过利用度量诱导的概率路径和动力学最优速度，NExT-OMNI原生地支持任何对任何的理解和生成，提高了响应效率，同时通过简洁的统一表示而不是任务解耦的设计来实现更广泛的应用场景。经过大规模交叉文本、图像、视频和音频数据的训练，NExT-OMNI在多模态生成和理解基准方面具有竞争力的性能，同时在多轮多模态交互和跨模态检索方面优于先前的统一模型，突出了其作为下一代多模态基础模型的架构优势。为了推进进一步的研究，我们发布了训练细节、数据协议，并开源了代码和模型检查点。
摘要：Next-generation multimodal foundation models capable of any-to-any cross-modal generation and multi-turn interaction will serve as core components of artificial general intelligence systems, playing a pivotal role in human-machine interaction. However, most existing multimodal models remain constrained by autoregressive architectures, whose inherent limitations prevent a balanced integration of understanding and generation capabilities. Although hybrid and decoupling strategies have been explored to address these tasks within unified frameworks separately, their redundant, non-integrated designs limit their applicability to broader scenarios, such as cross-modal retrieval.In this work, we introduce NExT-OMNI, an open-source omnimodal foundation model that achieves unified modeling through discrete flow paradigms. By leveraging metric-induced probability paths and kinetic optimal velocities, NExT-OMNI natively supports any-to-any understanding and generation with enhanced response efficiency, while enabling broader application scenarios through concise unified representations rather than task-decoupled designs. Trained on large-scale interleaved text, image, video, and audio data, NExT-OMNI delivers competitive performance on multimodal generation and understanding benchmarks, while outperforming prior unified models in multi-turn multimodal interaction and cross-modal retrieval, highlighting its architectural advantages as a next-generation multimodal foundation model. To advance further research, we release training details, data protocols, and open-source both the code and model checkpoints.

【13】Training LLM Agents to Empower Humans
标题：训练法学硕士代理为人类赋权
链接：https://arxiv.org/abs/2510.13709

作者：Evan Ellis, Vivek Myers, Jens Tuyls, Sergey Levine, Anca Dragan, Benjamin Eysenbach
摘要：辅助人员不仅应该代表人类采取行动，而且在需要做出重要决定时也应该让路并放弃控制。然而，目前构建辅助代理的方法，无论是通过模仿专家人类还是通过RL微调推断的奖励，通常都鼓励代理自己完成任务，而不是真正帮助人类实现目标。此外，这些方法通常需要昂贵的显式人工反馈来提供训练信号。我们提出了一种新的方法来调整辅助语言模型的基础上最大限度地提高人类的授权，他们的能力，以影响所需的变化，在环境中。我们的赋权最大化方法Empower只需要离线文本数据，提供了一种自我监督的方法来微调语言模型，以更好地帮助人类。为了研究我们的方法的有效性，我们进行了一项18人的用户研究，将我们的授权助手与强大的基线进行比较。参与者在78%的时间里更喜欢我们的助手（p=0.015），接受率高出31%，建议少了38%。此外，我们引入了一个新的环境，使用模拟人来评估多轮代码援助。使用这种环境，我们表明，与Empower训练的代理增加了成功率的模拟人类程序员在具有挑战性的编码问题的平均192%以上的SFT基线。有了这个授权目标，我们提供了一个框架，只使用离线数据就可以大规模地使用有用的人工智能代理，而不需要任何额外的人工反馈或可验证的奖励。
摘要：Assistive agents should not only take actions on behalf of a human, but also step out of the way and cede control when there are important decisions to be made. However, current methods for building assistive agents, whether via mimicking expert humans or via RL finetuning on an inferred reward, often encourage agents to complete tasks on their own rather than truly assisting the human attain their objectives. Additionally, these methods often require costly explicit human feedback to provide a training signal. We propose a new approach to tuning assistive language models based on maximizing the human's empowerment, their ability to effect desired changes in the environment. Our empowerment-maximizing method, Empower, only requires offline text data, providing a self-supervised method for fine-tuning language models to better assist humans. To study the efficacy of our approach, we conducted an 18-person user study comparing our empowerment assistant with a strong baseline. Participants preferred our assistant 78% of the time (p=0.015), with a 31% higher acceptance rate and 38% fewer suggestions. Additionally, we introduce a new environment for evaluating multi-turn code assistance using simulated humans. Using this environment, we show that agents trained with Empower increase the success rate of a simulated human programmer on challenging coding questions by an average of 192% over an SFT baseline. With this empowerment objective, we provide a framework for useful aligned AI agents at scale using only offline data without the need for any additional human feedback or verifiable rewards.

【14】Simplicial Embeddings Improve Sample Efficiency in Actor-Critic Agents
标题：简单的嵌入提高了演员评论家代理的样本效率
链接：https://arxiv.org/abs/2510.13704

作者：Johan Obando-Ceron, Walter Mayor, Samuel Lavoie, Scott Fujimoto, Aaron Courville, Pablo Samuel Castro
摘要：最近的工作已经提出了通过使用大规模环境并行化来加速actor-critic方法的挂钟训练时间;不幸的是，这些有时仍然需要大量的环境交互来实现所需的性能水平。注意到结构良好的表示可以提高深度强化学习（RL）代理的泛化和样本效率，我们建议使用单纯嵌入：将嵌入限制为单纯结构的轻量级表示层。这种几何归纳偏差导致稀疏和离散的特征，稳定批评者自举和加强政策梯度。当应用于FastTD 3、FastSAC和PPO时，单纯嵌入在各种连续和离散控制环境中持续提高采样效率和最终性能，而不会降低运行时速度。
摘要：Recent works have proposed accelerating the wall-clock training time of actor-critic methods via the use of large-scale environment parallelization; unfortunately, these can sometimes still require large number of environment interactions to achieve a desired level of performance. Noting that well-structured representations can improve the generalization and sample efficiency of deep reinforcement learning (RL) agents, we propose the use of simplicial embeddings: lightweight representation layers that constrain embeddings to simplicial structures. This geometric inductive bias results in sparse and discrete features that stabilize critic bootstrapping and strengthen policy gradients. When applied to FastTD3, FastSAC, and PPO, simplicial embeddings consistently improve sample efficiency and final performance across a variety of continuous- and discrete-control environments, without any loss in runtime speed.

【15】MVCustom: Multi-View Customized Diffusion via Geometric Latent Rendering and Completion
标题：MVcustom：通过几何潜在渲染和完成的多视图定制扩散
链接：https://arxiv.org/abs/2510.13702

作者：Minjung Shin, Hyunin Cho, Sooyeon Go, Jin-Hwa Kim, Youngjung Uh
备注：Project page: this https URL
摘要：多视图生成与相机姿态控制和基于自定义的相机都是实现可控的生成模型的基本要素。然而，现有的多视图生成模型不支持几何一致性的定制，而定制模型缺乏明确的视点控制，使他们具有挑战性的统一。出于这些差距，我们介绍了一种新的任务，多视角定制，其目的是共同实现多视角相机的姿态控制和定制。由于定制中训练数据的稀缺性，现有的多视图生成模型本质上依赖于大规模数据集，难以推广到不同的提示。为了解决这个问题，我们提出了MVCustom，一种新的基于扩散的框架，明确设计，以实现多视图的一致性和定制保真度。在训练阶段，MVCustom使用特征场表示来学习主体的身份和几何形状，并结合利用密集时空注意力增强的文本到视频扩散主干，该主干利用时间相干性来实现多视图一致性。在推理阶段，我们引入了两种新的技术：深度感知的特征渲染明确地强制几何一致性，和一致性感知的潜在完成，确保准确的角度对准定制的主题和周围的背景。大量的实验表明，MVCustom是唯一的框架，同时实现忠实的多视图生成和定制。
摘要：Multi-view generation with camera pose control and prompt-based customization are both essential elements for achieving controllable generative models. However, existing multi-view generation models do not support customization with geometric consistency, whereas customization models lack explicit viewpoint control, making them challenging to unify. Motivated by these gaps, we introduce a novel task, multi-view customization, which aims to jointly achieve multi-view camera pose control and customization. Due to the scarcity of training data in customization, existing multi-view generation models, which inherently rely on large-scale datasets, struggle to generalize to diverse prompts. To address this, we propose MVCustom, a novel diffusion-based framework explicitly designed to achieve both multi-view consistency and customization fidelity. In the training stage, MVCustom learns the subject's identity and geometry using a feature-field representation, incorporating the text-to-video diffusion backbone enhanced with dense spatio-temporal attention, which leverages temporal coherence for multi-view consistency. In the inference stage, we introduce two novel techniques: depth-aware feature rendering explicitly enforces geometric consistency, and consistent-aware latent completion ensures accurate perspective alignment of the customized subject and surrounding backgrounds. Extensive experiments demonstrate that MVCustom is the only framework that simultaneously achieves faithful multi-view generation and customization.

【16】A Modal Logic for Temporal and Jurisdictional Classifier Models
标题：时间和管辖分类器模型的模式逻辑
链接：https://arxiv.org/abs/2510.13691

作者：Cecilia Di Florio, Huimin Dong, Antonino Rotolo
备注：18 pages, 2 figures. Extended version of a short paper accepted at PRIMA 2025. This is the authors' version of the work. It is posted here for your personal use
摘要：基于逻辑的模型可用于为法律领域中使用的机器学习分类器构建验证工具。ML分类器根据先前的案例预测新案例的结果，从而执行一种基于案例的推理（CBR）。在本文中，我们介绍了模态逻辑的分类器，旨在正式捕捉法律CBR。我们纳入解决先例之间的冲突的原则，通过引入到逻辑的案件的时间维度和法律体系内的法院的等级。
摘要：Logic-based models can be used to build verification tools for machine learning classifiers employed in the legal field. ML classifiers predict the outcomes of new cases based on previous ones, thereby performing a form of case-based reasoning (CBR). In this paper, we introduce a modal logic of classifiers designed to formally capture legal CBR. We incorporate principles for resolving conflicts between precedents, by introducing into the logic the temporal dimension of cases and the hierarchy of courts within the legal system.

【17】CanvasMAR: Improving Masked Autoregressive Video Generation With Canvas
标题：CanvasTAR：使用Canvas改进掩蔽自回归视频生成
链接：https://arxiv.org/abs/2510.13669

作者：Zian Li, Muhan Zhang
摘要：掩蔽自回归模型（MAR）最近已成为图像和视频生成的强大范例，结合了掩蔽建模的灵活性和连续标记器的潜力。然而，视频MAR模型遭受两个主要的限制：慢启动问题，造成缺乏一个结构化的全球前在早期采样阶段，和误差积累的自回归在空间和时间维度。在这项工作中，我们提出了CanvasMAR，一种新的视频MAR模型，通过引入画布机制来缓解这些问题-下一帧的模糊，全局预测，用作掩码生成的起点。画布在采样早期提供全局结构，从而实现更快、更连贯的帧合成。此外，我们引入了组合无分类器的指导，共同扩大空间（画布）和时间条件，并采用基于噪声的画布增强，以提高鲁棒性。在BAIR和Kinetics-600基准测试上的实验表明，CanvasMAR能够以更少的自回归步骤生成高质量的视频。我们的方法在Kinetics-600数据集上的自回归模型中取得了显着的性能，并与基于扩散的方法相媲美。
摘要：Masked autoregressive models (MAR) have recently emerged as a powerful paradigm for image and video generation, combining the flexibility of masked modeling with the potential of continuous tokenizer. However, video MAR models suffer from two major limitations: the slow-start problem, caused by the lack of a structured global prior at early sampling stages, and error accumulation across the autoregression in both spatial and temporal dimensions. In this work, we propose CanvasMAR, a novel video MAR model that mitigates these issues by introducing a canvas mechanism--a blurred, global prediction of the next frame, used as the starting point for masked generation. The canvas provides global structure early in sampling, enabling faster and more coherent frame synthesis. Furthermore, we introduce compositional classifier-free guidance that jointly enlarges spatial (canvas) and temporal conditioning, and employ noise-based canvas augmentation to enhance robustness. Experiments on the BAIR and Kinetics-600 benchmarks demonstrate that CanvasMAR produces high-quality videos with fewer autoregressive steps. Our approach achieves remarkable performance among autoregressive models on Kinetics-600 dataset and rivals diffusion-based methods.

【18】Axial Neural Networks for Dimension-Free Foundation Models
标题：无变形基础模型的轴向神经网络
链接：https://arxiv.org/abs/2510.13665

作者：Hyunsu Kim, Jonggeon Park, Joan Bruna, Hongseok Yang, Juho Lee
备注：None
摘要：AI中基础模型的出现大大推进了通用学习，实现了zero-shot推理和上下文学习的卓越能力。然而，在物理数据上训练这样的模型，包括偏微分方程（PDE）的解，由于不同系统的维数不同，这构成了一个独特的挑战。传统的方法要么固定最大维度，要么对不同的维度采用单独的编码器，导致效率低下。为了解决这个问题，我们提出了一种与维度无关的神经网络架构，即轴向神经网络（XNN），其灵感来自于参数共享结构，如深度集和图神经网络。XNN在保持计算效率的同时，在不同的张量维度上进行推广。我们将现有的PDE基础模型转换为轴向神经网络，并在三种训练场景中评估其性能：从头开始训练，在多个PDE上进行预训练，以及在单个PDE上进行微调。我们的实验表明，XNN与原始模型的表现具有竞争力，并且对看不见的维度表现出卓越的泛化能力，突出了多维预训练对基础模型的重要性。
摘要：The advent of foundation models in AI has significantly advanced general-purpose learning, enabling remarkable capabilities in zero-shot inference and in-context learning. However, training such models on physics data, including solutions to partial differential equations (PDEs), poses a unique challenge due to varying dimensionalities across different systems. Traditional approaches either fix a maximum dimension or employ separate encoders for different dimensionalities, resulting in inefficiencies. To address this, we propose a dimension-agnostic neural network architecture, the Axial Neural Network (XNN), inspired by parameter-sharing structures such as Deep Sets and Graph Neural Networks. XNN generalizes across varying tensor dimensions while maintaining computational efficiency. We convert existing PDE foundation models into axial neural networks and evaluate their performance across three training scenarios: training from scratch, pretraining on multiple PDEs, and fine-tuning on a single PDE. Our experiments show that XNNs perform competitively with original models and exhibit superior generalization to unseen dimensions, highlighting the importance of multidimensional pretraining for foundation models.

【19】Time Series Foundation Models: Benchmarking Challenges and Requirements
标题：时间序列基础模型：基准挑战和要求
链接：https://arxiv.org/abs/2510.13654

作者：Marcel Meyer, Sascha Kaltenpoth, Kevin Zalipski, Oliver Müller
摘要：时间序列基础模型（TSFMs）代表了时间序列预测的新范式，提供了zero-shot预测功能，而无需特定于域的预训练或微调。然而，与大型语言模型（LLM）一样，评估TSFM是棘手的，因为训练集越来越广泛，确保基准测试数据的完整性变得越来越具有挑战性。我们对现有TSFM评估的调查突出了多重挑战，从基准数据集的代表性，缺乏时空评估，到由于重叠和模糊的数据集而导致的信息泄漏风险，以及经济危机或流行病等外部冲击造成的全球模式的记忆。我们的研究结果揭示了关于数据分区的普遍混乱，存在夸大性能估计和将全球知识不正确地转移到本地时间序列的风险。我们主张开发强大的评估方法，以防止LLM和经典时间序列基准测试中已经观察到的陷阱，并呼吁研究界设计新的原则性方法，例如对真正超出样本的未来数据进行评估，以维护TSFM评估的完整性。
摘要：Time Series Foundation Models (TSFMs) represent a new paradigm for time series forecasting, offering zero-shot forecasting capabilities without the need for domain-specific pre-training or fine-tuning. However, as with Large Language Models (LLMs), evaluating TSFMs is tricky, as with ever more extensive training sets, it becomes more and more challenging to ensure the integrity of benchmarking data. Our investigation of existing TSFM evaluation highlights multiple challenges, ranging from the representativeness of the benchmark datasets, over the lack of spatiotemporal evaluation, to risks of information leakage due to overlapping and obscure datasets, and the memorization of global patterns caused by external shocks like economic crises or pandemics. Our findings reveal widespread confusion regarding data partitions, risking inflated performance estimates and incorrect transfer of global knowledge to local time series. We argue for the development of robust evaluation methodologies to prevent pitfalls already observed in LLM and classical time series benchmarking, and call upon the research community to design new, principled approaches, such as evaluations on truly out-of-sample future data, to safeguard the integrity of TSFM assessment.

【20】Closing the Gap Between Text and Speech Understanding in LLMs
标题：缩小法学硕士中文本和言语理解之间的差距
链接：https://arxiv.org/abs/2510.13632

作者：Santiago Cuervo, Skyler Seto, Maureen de Seyssel, Richard He Bai, Zijin Gu, Tatiana Likhomanenko, Navdeep Jaitly, Zakaria Aldeneh
摘要：大型语言模型（LLM）可以被适配以将其文本能力扩展到语音输入。然而，这些语音适应LLM在语言理解任务上始终表现不佳，甚至是级联管道。我们将这种不足称为文本-语音理解差距：当语音适应LLM处理语音输入时，相对于原始基于文本的LLM处理等效文本时，观察到的性能下降。最近缩小这一差距的方法要么依赖于文本语料库的大规模语音合成，这是昂贵的，严重依赖于合成数据，或大规模的专有语音数据集，这是不可复制的。因此，仍然需要更有效的数据替代方案来缩小文本-语音理解差距。在这项工作中，我们分析了由两个因素驱动的差距：（i）在适应过程中忘记文本功能，以及（ii）语音和文本之间的跨模态不一致。基于这种分析，我们引入了SALAD-通过主动选择和跨模态蒸馏进行学习的样本有效对齐-它将跨模态蒸馏与目标合成数据相结合，以改善对齐，同时减轻遗忘。应用于3B和7 B LLM，SALAD在知识，语言理解和推理的广泛领域基准中具有强大的开放权重模型，同时在公共语料库中训练数量级更少的语音数据，从而实现了具有竞争力的性能。
摘要：Large Language Models (LLMs) can be adapted to extend their text capabilities to speech inputs. However, these speech-adapted LLMs consistently underperform their text-based counterparts--and even cascaded pipelines--on language understanding tasks. We term this shortfall the text-speech understanding gap: the performance drop observed when a speech-adapted LLM processes spoken inputs relative to when the original text-based LLM processes the equivalent text. Recent approaches to narrowing this gap either rely on large-scale speech synthesis of text corpora, which is costly and heavily dependent on synthetic data, or on large-scale proprietary speech datasets, which are not reproducible. As a result, there remains a need for more data-efficient alternatives for closing the text-speech understanding gap. In this work, we analyze the gap as driven by two factors: (i) forgetting of text capabilities during adaptation, and (ii) cross-modal misalignment between speech and text. Based on this analysis, we introduce SALAD--Sample-efficient Alignment with Learning through Active selection and cross-modal Distillation--which combines cross-modal distillation with targeted synthetic data to improve alignment while mitigating forgetting. Applied to 3B and 7B LLMs, SALAD achieves competitive performance with a strong open-weight model across broad-domain benchmarks in knowledge, language understanding, and reasoning, while training on over an order of magnitude less speech data from public corpora.

【21】Unlocking Public Catalogues: Instruction-Tuning LLMs for ICD Coding of German Tumor Diagnoses
标题：解锁公共目录：德国肿瘤诊断的ICD编码的指令调整LLM
链接：https://arxiv.org/abs/2510.13624

作者：Stefan Lenz, Lakisha Ortiz Rosario, Georg Vollmar, Arsenij Ustjanzew, Fatma Alickovic, Thomas Kindler, Torsten Panholzer
备注：19 pages, 4 figures
摘要：在德国，使用ICD-10-GM和ICD-O-3对肿瘤诊断进行准确编码对于结构化癌症记录至关重要。较小的开放权重LLM对隐私保护自动化很有吸引力，但在德语环境中通常难以实现编码准确性。本研究调查了基于公共数据集的微调是否提高了德国肿瘤诊断文本的开放权重LLM的编码准确性。该评价使用来自当地肿瘤文档系统的编码诊断作为测试数据。在系统的数据质量评估中，ICD-10编码性能的上限估计为精确推导的60-79%和部分推导的81-94%。作为训练数据，基于ICD-10-GM、ICD-O-3和OPS目录创建了超过500，000个问答对。对来自Qwen、Llama和Mistral家族的八个开放重量模型（7-70 B参数）进行了微调。ICD-10-GM准确率从1.4-24%上升到41- 58%，部分准确率从31-74%上升到73- 83%。ICD-O-3地形图编码的准确度也有所提高，但开始时仍相当低，精确准确度为22-40%，微调后部分准确度为56-67%。所有型号的格式错误代码输出均降至0%。肿瘤诊断识别率达99%。准确性与模型大小呈正相关，但微调后，小型和大型模型之间的差距缩小。Qwen 3中的推理模式通常比微调产生更低的性能，并且慢了100倍以上。我们的研究结果强调了利用公共目录来构建指导数据集的潜力，这些数据集可以改善医学文档任务中的LLM。完整的训练数据集和微调模型的最佳性能检查点可从https://huggingface.co/datasets/stefan-m-lenz/ICDOPS-QA-2024获得。
摘要：Accurate coding of tumor diagnoses with ICD-10-GM and ICD-O-3 is essential for structured cancer documentation in Germany. Smaller open-weight LLMs are appealing for privacy-preserving automation but often struggle with coding accuracy in German-language contexts. This study investigates whether instruction-based fine-tuning on public datasets improves the coding accuracy of open-weight LLMs for German tumor diagnosis texts. The evaluation uses coded diagnoses from the local tumor documentation system as test data. In a systematic data quality assessment, the upper limit for ICD-10 coding performance was estimated at 60-79% for exact and 81-94% for partial (three-character codes only) derivation. As training data, over 500,000 question-answer pairs were created based on the ICD-10-GM, ICD-O-3, and OPS catalogues. Eight open-weight models from the Qwen, Llama, and Mistral families (7-70 B parameters) were fine-tuned. ICD-10-GM accuracy rose from 1.4-24% to 41-58%, and partial accuracy from 31-74% to 73-83%. The accuracy of ICD-O-3 topography coding also improved but started and remained considerably lower with an exact accuracy of 22-40% and a partial accuracy of 56-67% after fine-tuning. Malformed code outputs dropped to 0% for all models. Tumor-diagnosis recognition reached 99%. Accuracy correlated positively with model size, but gaps between small and large models narrowed after fine-tuning. The reasoning mode in Qwen3 generally yielded a lower performance than fine-tuning and was over 100 times slower. Our findings highlight the potential of leveraging public catalogues to build instruction datasets that improve LLMs in medical documentation tasks. The complete training dataset and the best-performing checkpoints of the fine-tuned models are available from https://huggingface.co/datasets/stefan-m-lenz/ICDOPS-QA-2024.

【22】The Role of Computing Resources in Publishing Foundation Model Research
标题：计算资源在出版基金模型研究中的作用
链接：https://arxiv.org/abs/2510.13621

作者：Yuexing Hao, Yue Huang, Haoran Zhang, Chenyang Zhao, Zhenwen Liang, Paul Pu Liang, Yue Zhao, Lichao Sun, Saleh Kalantari, Xiangliang Zhang, Marzyeh Ghassemi
摘要：人工智能（AI）的前沿研究需要大量的资源，包括图形处理单元（GPU），数据和人力资源。在本文中，我们评估这些资源和基础模型（FM）的科学进步之间的关系。我们回顾了2022年至2024年期间发表的6517篇FM论文，并对229位第一作者进行了调查，以了解计算资源对科学产出的影响。我们发现，计算的增加与国家资金分配和引用相关，但我们的研究结果没有观察到与研究环境（学术或工业），领域或研究方法的强相关性。我们建议个人和机构专注于创造共享和负担得起的计算机会，以降低资源不足的研究人员的准入门槛。这些步骤可以帮助扩大对FM研究的参与，促进想法和贡献者的多样性，并维持人工智能的创新和进步。数据可在以下网址获得：https://mit-calc.csail.mit.edu/
摘要：Cutting-edge research in Artificial Intelligence (AI) requires considerable resources, including Graphics Processing Units (GPUs), data, and human resources. In this paper, we evaluate of the relationship between these resources and the scientific advancement of foundation models (FM). We reviewed 6517 FM papers published between 2022 to 2024, and surveyed 229 first-authors to the impact of computing resources on scientific output. We find that increased computing is correlated with national funding allocations and citations, but our findings don't observe the strong correlations with research environment (academic or industrial), domain, or study methodology. We advise that individuals and institutions focus on creating shared and affordable computing opportunities to lower the entry barrier for under-resourced researchers. These steps can help expand participation in FM research, foster diversity of ideas and contributors, and sustain innovation and progress in AI. The data will be available at: https://mit-calc.csail.mit.edu/

【23】Message Passing on the Edge: Towards Scalable and Expressive GNNs
标题：边缘信息传递：迈向可扩展和表达的GNN
链接：https://arxiv.org/abs/2510.13615

作者：Pablo Barceló, Fabian Jogl, Alexander Kozachinskiy, Matthias Lanzinger, Stefan Neumann, Cristóbal Rojas
摘要：我们提出了EB-1 WL，基于边缘的颜色细化测试，以及相应的GNN架构，EB-GNN。我们的架构的灵感来自于Chiba和Nishizeki的经典三角形计数算法，并在消息传递过程中明确使用三角形。结果表明：（1）~EB-1 WL的表达显著高于1-WL。此外，我们提供了一个完整的逻辑表征EB-1 WL的基础上一阶逻辑，和匹配的同态计数的基础上的可扩展性的结果。(2)与以前提出的更具表现力的GNN架构的一个重要区别是，EB-1 WL和EB-GNN在实际的图学习任务中需要接近线性的时间和内存。(3)从经验上讲，我们证明EB-GNN是一种高效的通用架构：它的性能大大优于简单的MPNN，并且与任务专用的GNN相比仍然具有竞争力，同时计算效率显著提高。
摘要：We propose EB-1WL, an edge-based color-refinement test, and a corresponding GNN architecture, EB-GNN. Our architecture is inspired by a classic triangle counting algorithm by Chiba and Nishizeki, and explicitly uses triangles during message passing. We achieve the following results: (1)~EB-1WL is significantly more expressive than 1-WL. Further, we provide a complete logical characterization of EB-1WL based on first-order logic, and matching distinguishability results based on homomorphism counting. (2)~In an important distinction from previous proposals for more expressive GNN architectures, EB-1WL and EB-GNN require near-linear time and memory on practical graph learning tasks. (3)~Empirically, we show that EB-GNN is a highly-efficient general-purpose architecture: It substantially outperforms simple MPNNs, and remains competitive with task-specialized GNNs while being significantly more computationally efficient.

【24】NOSA: Native and Offloadable Sparse Attention
标题：NOSA：本地和本地注意力分散
链接：https://arxiv.org/abs/2510.13602

作者：Yuxiang Huang, Chaojun Xiao, Xu Han, Zhiyuan Liu
备注：Preprint
摘要：可训练的稀疏注意力已经成为解决LLM在长上下文处理中的解码效率瓶颈的一个有前途的解决方案，显著节省了内存访问，同时最小限度地影响任务性能。然而，现有的稀疏注意力方法留下了一个关键的限制未解决：键值（KV）缓存的大小仍然没有减少，这限制了GPU上的批量大小并限制了解码吞吐量，特别是在大规模批量推理中。在本文中，我们证明了可训练稀疏注意力在相邻解码步骤的令牌选择中自然表现出很强的局部性，从而在不改变底层注意力计算的情况下实现KV缓存卸载。然而，固有的局部性仍然不足以实现有效的卸载，因为CPU和GPU之间的所选KV对的传输继续主导整体解码成本。基于这一见解，我们提出了NOSA，一个可训练的稀疏注意力框架，旨在原生支持KV缓存卸载。NOSA通过将令牌选择分解为查询感知和查询不可知的组件来引入显式局部性约束，从而减少KV传输，同时保留与训练期间使用的相同的注意力计算。我们使用NOSA预训练了一个1B参数模型，并进行了广泛的基准测试，结果表明，与香草可训练稀疏注意基线（InfLLM-V2）相比，它保留了近乎无损的性能，同时解码吞吐量提高了2.3倍。
摘要：Trainable sparse attention has emerged as a promising solution to address the decoding efficiency bottleneck of LLMs in long-context processing, significantly saving memory accesses while minimally impacting task performance. However, existing sparse attention methods leave a crucial limitation unresolved: the size of the key-value (KV) cache remains unreduced, which constrains on-GPU batch sizes and throttles decoding throughput, especially in large-scale batched inference. In this paper, we show that trainable sparse attention naturally exhibits strong locality in token selection across adjacent decoding steps, thereby enabling KV cache offloading without altering the underlying attention computation. However, the inherent locality remains insufficient to achieve efficient offloading, as the transfer of selected KV pairs between the CPU and GPU continues to dominate the overall decoding cost. Building on this insight, we present NOSA, a trainable sparse attention framework designed to natively support KV cache offloading. NOSA introduces explicit locality constraints by decomposing token selection into query-aware and query-agnostic components, thereby reducing KV transfers while preserving the same attention computation as used during training. We pretrain a 1B-parameter model with NOSA and conduct extensive benchmarks, showing that it preserves near-lossless performance while achieving up to a 2.3x improvement in decoding throughput compared with the vanilla trainable sparse attention baseline (InfLLM-V2).

【25】Subject Roles in the EU AI Act: Mapping and Regulatory Implications
标题：欧盟人工智能法案中的主体角色：绘图和监管影响
链接：https://arxiv.org/abs/2510.13591

作者：Nicola Fabiano
摘要：欧盟的《人工智能法案》（法规（EU）2024/1689）通过第3条定义的相互关联的主题的复杂生态系统，为人工智能系统建立了世界上第一个全面的监管框架。本文对六类主要行为者进行了结构化审查-供应商、部署者、授权代表、进口商、分销商和产品制造商-在法规中统称为“运营商”。通过对这些第3条定义及其在该法规的113条，180条叙述和13个附件中的阐述的审查，我们绘制了完整的治理结构，并分析了AI法案如何规范这些主题。我们的分析揭示了关键的转换机制，主体可以在特定条件下承担不同的角色，特别是通过第25条规定，确保问责制遵循控制。我们确定了义务如何通过强制性的信息流和合作要求通过供应链级联，创建一个分布式但协调的治理系统。调查结果表明，该法规如何通过基于风险的义务来平衡创新与保护基本权利，这些义务与人工智能系统的能力和部署环境相适应，为实施人工智能法案要求的利益相关者提供了重要指导。
摘要：The European Union's Artificial Intelligence Act (Regulation (EU) 2024/1689) establishes the world's first comprehensive regulatory framework for AI systems through a sophisticated ecosystem of interconnected subjects defined in Article 3. This paper provides a structured examination of the six main categories of actors - providers, deployers, authorized representatives, importers, distributors, and product manufacturers - collectively referred to as "operators" within the regulation. Through examination of these Article 3 definitions and their elaboration across the regulation's 113 articles, 180 recitals, and 13 annexes, we map the complete governance structure and analyze how the AI Act regulates these subjects. Our analysis reveals critical transformation mechanisms whereby subjects can assume different roles under specific conditions, particularly through Article 25 provisions ensuring accountability follows control. We identify how obligations cascade through the supply chain via mandatory information flows and cooperation requirements, creating a distributed yet coordinated governance system. The findings demonstrate how the regulation balances innovation with the protection of fundamental rights through risk-based obligations that scale with the capabilities and deployment contexts of AI systems, providing essential guidance for stakeholders implementing the AI Act's requirements.

【26】Deflanderization for Game Dialogue: Balancing Character Authenticity with Task Execution in LLM-based NPCs
标题：游戏对话的脱侧翼化：在基于LLM的NPC中平衡角色真实性与任务执行
链接：https://arxiv.org/abs/2510.13586

作者：Pasin Buakhaw, Kun Kerdthaisong, Phuree Phenhiran, Pitikorn Khlaisamniang, Supasate Vorathammathorn, Piyalitt Ittichaiwong, Nutchanon Yongsatianchot
摘要：大型语言模型（LLM）的出现为在游戏环境中创建动态非玩家角色（NPC）提供了新的机会，从而实现了功能任务执行和角色一致性对话生成。在本文中，我们（Tu_Character_lab）报告了我们参与常识角色接地对话挑战赛（CPDC）2025年第2轮的情况，该挑战赛在三个方面评估了代理人：面向任务的对话，上下文感知对话及其集成。我们的方法结合了两种互补的策略：（i）API轨道中的轻量级提示技术，包括去弗兰德化提示方法，以抑制过度的角色扮演并提高任务保真度，以及（ii）GPU轨道中的微调大型模型，利用Qwen 3 - 14 B与监督微调（SFT）和低秩自适应（LoRA）。我们的最佳提交作品在任务1中排名第二，在任务3中排名第二（API轨道），在任务3中排名第四（GPU轨道）。
摘要：The emergence of large language models (LLMs) has opened new opportunities for cre- ating dynamic non-player characters (NPCs) in gaming environments, enabling both func- tional task execution and persona-consistent dialogue generation. In this paper, we (Tu_Character_lab) report our participation in the Commonsense Persona-Grounded Dialogue Challenge (CPDC) 2025 Round 2, which eval- uates agents across three tracks: task-oriented dialogue, context-aware dialogue, and their integration. Our approach combines two complementary strategies: (i) lightweight prompting techniques in the API track, including a Deflanderization prompting method to suppress excessive role-play and improve task fidelity, and (ii) fine-tuned large models in the GPU track, leveraging Qwen3-14B with supervisedfinetuning (SFT) and Low-Rank Adaptation(LoRA). Our best submissions ranked 2nd on Task 1, 2nd on Task 3 (API track), and 4th on Task 3 (GPU track).

【27】OpenDerisk: An Industrial Framework for AI-Driven SRE, with Design, Implementation, and Case Studies
标题：OpenDerisk：人工智能驱动的SRE工业框架，包括设计、实施和案例研究
链接：https://arxiv.org/abs/2510.13561

作者：Peng Di, Faqiang Chen, Xiao Bai, Hongjun Yang, Qingfeng Li, Ganglin Wei, Jian Mou, Feng Shi, Keting Chen, Peng Tang, Zhitao Shen, Zheng Li, Wenhui Shi, Junwei Guo, Hang Yu
备注：23 pages
摘要：现代软件不断升级的复杂性给现场可靠性工程（SRE）团队带来了不可持续的运营负担，需要人工智能驱动的自动化来模拟专家诊断推理。现有的解决方案，从传统的人工智能方法到通用的多智能体系统，都存在不足：它们要么缺乏深刻的因果推理，要么不是为SRE特有的专门调查工作流程量身定制的。为了解决这个差距，我们提出OpenDerisk，一个专门的，开源的多代理框架架构的SRE。OpenDerisk集成了一个诊断本地协作模型，一个可插入的推理引擎，一个知识引擎和一个标准化协议（MCP），使专家代理能够共同解决复杂的多领域问题。我们的综合评估表明，OpenDerisk在准确性和效率方面都明显优于最先进的基线。这一有效性通过其在蚂蚁集团的大规模生产部署得到了验证，在那里它为不同场景的3，000多名日常用户提供服务，证实了其工业级可扩展性和实际影响。OpenDerisk是开源的，可在https://github.com/derisk-ai/OpenDerisk/上获得
摘要：The escalating complexity of modern software imposes an unsustainable operational burden on Site Reliability Engineering (SRE) teams, demanding AI-driven automation that can emulate expert diagnostic reasoning. Existing solutions, from traditional AI methods to general-purpose multi-agent systems, fall short: they either lack deep causal reasoning or are not tailored for the specialized, investigative workflows unique to SRE. To address this gap, we present OpenDerisk, a specialized, open-source multi-agent framework architected for SRE. OpenDerisk integrates a diagnostic-native collaboration model, a pluggable reasoning engine, a knowledge engine, and a standardized protocol (MCP) to enable specialist agents to collectively solve complex, multi-domain problems. Our comprehensive evaluation demonstrates that OpenDerisk significantly outperforms state-of-the-art baselines in both accuracy and efficiency. This effectiveness is validated by its large-scale production deployment at Ant Group, where it serves over 3,000 daily users across diverse scenarios, confirming its industrial-grade scalability and practical impact. OpenDerisk is open source and available at https://github.com/derisk-ai/OpenDerisk/

【28】Modeling Cultural Bias in Facial Expression Recognition with Adaptive Agents
标题：使用自适应代理建模面部表情识别中的文化偏见
链接：https://arxiv.org/abs/2510.13557

作者：David Freire-Obregón, José Salas-Cáceres, Javier Lorenzo-Navarro, Oliverio J. Santana, Daniel Hernández-Sosa, Modesto Castrillón-Santana
备注：Accepted for presentation at the International Symposium on Agentic Artificial Intelligence Systems (AAIS 2025)
摘要：面部表情识别（FER）必须在文化差异和感知退化的视觉条件下保持鲁棒性，但大多数现有的评估假设同质数据和高质量的图像。我们引入了一个基于代理的流基准，揭示了跨文化成分和渐进模糊如何相互作用，以塑造人脸识别的鲁棒性。每个代理都在冻结的CLIP特征空间中运行，其中轻量级的残差适配器在sigma=0时在线训练并在测试期间固定。智能体在一个5x 5的网格上移动和交互，而环境则提供具有sigma调度的高斯模糊的输入。我们研究了单一文化的人口（只有西方，只有亚洲）和混合环境的平衡（5/5）和不平衡（8/2，2/8）的组成，以及不同的空间接触结构。结果表明，文化群体之间的明显不对称的退化曲线：JAFFE（亚洲）人口保持较高的性能在低模糊，但在中间阶段表现出更尖锐的下降，而KDEF（西方）人口退化更均匀。混合种群表现出中间模式，平衡的混合物减轻早期退化，但不平衡的设置放大多数群体的弱点下高度模糊。这些研究结果量化了文化成分和互动结构如何影响FER的稳健性，感知条件恶化。
摘要：Facial expression recognition (FER) must remain robust under both cultural variation and perceptually degraded visual conditions, yet most existing evaluations assume homogeneous data and high-quality imagery. We introduce an agent-based, streaming benchmark that reveals how cross-cultural composition and progressive blurring interact to shape face recognition robustness. Each agent operates in a frozen CLIP feature space with a lightweight residual adapter trained online at sigma=0 and fixed during testing. Agents move and interact on a 5x5 lattice, while the environment provides inputs with sigma-scheduled Gaussian blur. We examine monocultural populations (Western-only, Asian-only) and mixed environments with balanced (5/5) and imbalanced (8/2, 2/8) compositions, as well as different spatial contact structures. Results show clear asymmetric degradation curves between cultural groups: JAFFE (Asian) populations maintain higher performance at low blur but exhibit sharper drops at intermediate stages, whereas KDEF (Western) populations degrade more uniformly. Mixed populations exhibit intermediate patterns, with balanced mixtures mitigating early degradation, but imbalanced settings amplify majority-group weaknesses under high blur. These findings quantify how cultural composition and interaction structure influence the robustness of FER as perceptual conditions deteriorate.

【29】Tandem Training for Language Models
标题：语言模型的串联训练
链接：https://arxiv.org/abs/2510.13551

作者：Robert West, Ashton Anderson, Ece Kamar, Eric Horvitz
摘要：随着语言模型的不断快速改进，我们可以预期它们的行为和推理对于较弱的代理和人类来说变得困难或不可能，从而破坏了可解释性和监督。着眼于长期的未来，我们追求的方法，鼓励模型产生的解决方案，仍然理解较弱的合作者。我们正式切换鲁棒性的可理解性：一个强大的模型的解决方案是理解一个较弱的模型，如果随机切换控制较弱的模型沿解决方案的路径不会导致失败。基于这一标准，我们引入了语言模型的串联训练，这是一种强化学习（RL）范式，其中滚动令牌是从冻结的弱模型而不是正在训练的强模型中间歇和随机采样的。因为只有当强模型的动作和推理过程可以由弱模型继续时（当两者可以共同构建成功的解决方案时），部署才会成功，所以通过串联训练优化标准RL目标可以隐含地激励正确性和可理解性。在GSM 8 K数学推理任务中，串联训练可靠地教会模型放弃行话，并使其语言适应较弱的伙伴，同时保持高任务准确性。我们的研究结果展示了一种很有前途的方法，可以构建一个仍然可以被较弱的代理人审计的人工智能系统，并对人类-人工智能协作和多代理人通信产生影响。
摘要：As language models continue to rapidly improve, we can expect their actions and reasoning to become difficult or impossible for weaker agents and humans to follow, undermining interpretability and oversight. With an eye on long-term futures, we pursue methods that encourage models to produce solutions that remain intelligible to weaker collaborators. We formalize intelligibility as handoff robustness: a strong model's solution is intelligible to a weaker model if randomly handing off control to the weaker model along the solution path does not cause failure. Building on this criterion, we introduce tandem training for language models, a reinforcement learning (RL) paradigm in which rollout tokens are intermittently and randomly sampled from a frozen weak model rather than the strong model being trained. Because rollouts succeed only when the strong model's actions and reasoning process can be continued by the weak model -- when the two can co-construct a successful solution -- optimizing standard RL objectives with tandem training implicitly incentivizes both correctness and intelligibility. In the GSM8K math reasoning task, tandem training reliably teaches models to abandon jargon and adapt their language to weaker partners while keeping task accuracy high. Our results demonstrate a promising route to building AI systems that remain auditable by weaker agents, with implications for human--AI collaboration and multi-agent communication.

【30】In-Browser LLM-Guided Fuzzing for Real-Time Prompt Injection Testing in Agentic AI Browsers
标题：浏览器内LLM引导的模糊处理，用于在大型AI浏览器中进行实时提示注入测试
链接：https://arxiv.org/abs/2510.13543

作者：Avihay Cohen
备注：37 pages , 10 figures
摘要：基于大型语言模型（LLM）的代理集成到Web浏览器（通常称为代理AI浏览器）中，提供强大的Web任务自动化。然而，它们容易受到间接提示注入攻击，其中隐藏在网页中的恶意指令欺骗代理进行不必要的操作。这些攻击可以绕过传统的Web安全边界，因为AI代理可以跨站点使用用户权限进行操作。在本文中，我们提出了一种新的模糊框架，完全在浏览器中运行，并由LLM指导，实时自动发现此类提示注入漏洞。
摘要：Large Language Model (LLM) based agents integrated into web browsers (often called agentic AI browsers) offer powerful automation of web tasks. However, they are vulnerable to indirect prompt injection attacks, where malicious instructions hidden in a webpage deceive the agent into unwanted actions. These attacks can bypass traditional web security boundaries, as the AI agent operates with the user privileges across sites. In this paper, we present a novel fuzzing framework that runs entirely in the browser and is guided by an LLM to automatically discover such prompt injection vulnerabilities in real time.

【31】K-Merge: Online Continual Merging of Adapters for On-device Large Language Models
标题：K-Merge：设备上大型语言模型的适配器的在线连续合并
链接：https://arxiv.org/abs/2510.13537

作者：Donald Shenaj, Ondrej Bohdal, Taha Ceritli, Mete Ozay, Pietro Zanuttigh, Umberto Michieli
备注：15 pages, 8 figures
摘要：大型语言模型（LLM）的设备上部署经常利用低级别适配器（LoRA）来支持严格资源约束下的各种下游任务。为了解决移动设备有限的存储容量，最近的工作已经探索了将多个LoRA融合到单个LoRA中的模型合并技术。然而，在实践中，LoRA通常是增量交付的，因为用户请求对新任务的支持（例如，新的问题类型或语言）。这种情况引入了一个新的挑战：设备上的在线持续合并，其目标是在保留先前支持的任务性能的同时合并新的LoRA。在本文中，我们提出了一种无数据和计算效率高的策略，用于在新的LoRA可用时选择和合并LoRA，假设设备只能存储有限数量的适配器。在现实世界的任务中进行的广泛实验表明，与其他策略相比，我们的方法具有优越性，同时遵守设备上设置的存储预算和计算限制。
摘要：On-device deployment of Large Language Models (LLMs) frequently leverages Low-Rank Adapters (LoRAs) to support diverse downstream tasks under tight resource constraints. To address the limited storage capacity of mobile devices, recent works have explored model merging techniques to fuse multiple LoRAs into a single one. In practice, however, LoRAs are often delivered incrementally, as users request support for new tasks (e.g., novel problem types or languages). This scenario introduces a new challenge: on-device online continual merging, where the objective is to incorporate new LoRAs while preserving the performance on previously supported tasks. In this paper, we propose a data-free and computationally efficient strategy for selecting and merging LoRAs when a new one becomes available, assuming the device can store only a limited number of adapters. Extensive experiments across real-world tasks demonstrate the superiority of our approach compared to alternative strategies while adhering to the storage budget and compute limitations of on-device settings.

【32】A Methodology for Assessing the Risk of Metric Failure in LLMs Within the Financial Domain
标题：评估金融领域LLM指标失败风险的方法
链接：https://arxiv.org/abs/2510.13524

作者：William Flanagan, Mukunda Das, Rajitha Ramanyake, Swaunja Maslekar, Meghana Manipuri, Joong Ho Choi, Shruti Nair, Shambhavi Bhusan, Sanjana Dulam, Mouni Pendharkar, Nidhi Singh, Vashisth Doshi, Sachi Shah Paresh
备注：NeurIPS 2025 GenAI in Finance Workshop
摘要：随着生成式人工智能在整个金融服务行业的采用，采用和使用的一个重大障碍是测量模型性能。历史机器学习指标通常无法推广到GenAI工作负载，并且通常使用主题专家（SME）评估进行补充。即使在这种组合中，许多项目也未能考虑到在选择特定指标时存在的各种独特风险。此外，基础研究实验室和教育机构创建的许多广泛的基准未能推广到工业用途。本文解释了这些挑战，并提供了一个风险评估框架，以便更好地应用SME和机器学习。
摘要：As Generative Artificial Intelligence is adopted across the financial services industry, a significant barrier to adoption and usage is measuring model performance. Historical machine learning metrics can oftentimes fail to generalize to GenAI workloads and are often supplemented using Subject Matter Expert (SME) Evaluation. Even in this combination, many projects fail to account for various unique risks present in choosing specific metrics. Additionally, many widespread benchmarks created by foundational research labs and educational institutions fail to generalize to industrial use. This paper explains these challenges and provides a Risk Assessment Framework to allow for better application of SME and machine learning Metrics

【33】UniME-V2: MLLM-as-a-Judge for Universal Multimodal Embedding Learning
标题：UniME-V2：MLLM作为通用多模式嵌入式学习的评委
链接：https://arxiv.org/abs/2510.13515

作者：Tiancheng Gu, Kaicheng Yang, Kaichen Zhang, Xiang An, Ziyong Feng, Yueyi Zhang, Weidong Cai, Jiankang Deng, Lidong Bing
备注：12 pages, 6 figures, 11 tables
摘要：通用多模态嵌入模型是各种任务的基础。现有的方法通常采用批量否定挖掘，通过测量查询候选对的相似性。然而，这些方法往往难以捕捉候选人之间的细微语义差异，并且在阴性样本中缺乏多样性。此外，嵌入在区分假阴性和硬阴性方面表现出有限的辨别能力。在本文中，我们利用MLLM的高级理解能力来增强表示学习，并提出了一种新的通用多模态嵌入（UniME-V2）模型。我们的方法首先通过全局检索构造一个潜在的硬负集。然后，我们引入MLLM作为一个判断机制，利用MLLM来评估查询候选对的语义对齐，并生成软语义匹配分数。这些分数作为硬否定挖掘的基础，减轻了假否定的影响，并能够识别多样化的高质量硬否定。此外，语义匹配分数被用作软标签，以减轻刚性的一对一映射约束。通过将相似度矩阵与软语义匹配得分矩阵对齐，该模型学习候选人之间的语义差异，显着提高其区分能力。为了进一步提高性能，我们提出了UniME-V2-Reranker，这是一种通过联合成对和列表优化方法对我们挖掘的硬否定进行训练的重排序模型。我们在MMEB基准测试和多个检索任务上进行了全面的实验，证明我们的方法在所有任务中平均达到了最先进的性能。
摘要：Universal multimodal embedding models are foundational to various tasks. Existing approaches typically employ in-batch negative mining by measuring the similarity of query-candidate pairs. However, these methods often struggle to capture subtle semantic differences among candidates and lack diversity in negative samples. Moreover, the embeddings exhibit limited discriminative ability in distinguishing false and hard negatives. In this paper, we leverage the advanced understanding capabilities of MLLMs to enhance representation learning and present a novel Universal Multimodal Embedding (UniME-V2) model. Our approach first constructs a potential hard negative set through global retrieval. We then introduce the MLLM-as-a-Judge mechanism, which utilizes MLLMs to assess the semantic alignment of query-candidate pairs and generate soft semantic matching scores. These scores serve as a foundation for hard negative mining, mitigating the impact of false negatives and enabling the identification of diverse, high-quality hard negatives. Furthermore, the semantic matching scores are used as soft labels to mitigate the rigid one-to-one mapping constraint. By aligning the similarity matrix with the soft semantic matching score matrix, the model learns semantic distinctions among candidates, significantly enhancing its discriminative capacity. To further improve performance, we propose UniME-V2-Reranker, a reranking model trained on our mined hard negatives through a joint pairwise and listwise optimization approach. We conduct comprehensive experiments on the MMEB benchmark and multiple retrieval tasks, demonstrating that our method achieves state-of-the-art performance on average across all tasks.

【34】Offline and Online KL-Regularized RLHF under Differential Privacy
标题：差异隐私下的线下和在线KL规范化的LLHF
链接：https://arxiv.org/abs/2510.13512

作者：Yulian Wu, Rushil Thareja, Praneeth Vepakomma, Francesco Orabona
摘要：In this paper, we study the offline and online settings of reinforcement learning from human feedback (RLHF) with KL-regularization -- a widely used objective function in large language model alignment -- under the $\epsilon$ local differential privacy ($\epsilon$-LDP) model on the label of the human preference. In the offline setting, we design an algorithm based on the principle of pessimism and derive a new suboptimality gap of $\tilde{O}(1/[(e^\epsilon-1)^2 n])$ on the KL-regularized objective under single-policy concentrability. We also prove its optimality by providing a matching lower bound where $n$ is the sample size. In the online setting, we are the first one to theoretically investigate the problem of KL-regularized RLHF with LDP. We design an optimism-based algorithm and derive a logarithmic regret bound of $O(d_{\mathcal{F}}\log (N_{\mathcal{F}}\cdot T) /(e^\epsilon-1)^2 )$, where $T$ is the total time step, $N_{\mathcal{F}}$ is cardinality of the reward function space $\mathcal{F}$ and $d_{\mathcal{F}}$ is a variant of eluder dimension for RLHF. As a by-product of our analysis, our results also imply the first analysis for online KL-regularized RLHF without privacy. We implement our algorithm in the offline setting to verify our theoretical results and release our open source code at: https://github.com/rushil-thareja/PPKL-RLHF-Official.
摘要：In this paper, we study the offline and online settings of reinforcement learning from human feedback (RLHF) with KL-regularization -- a widely used objective function in large language model alignment -- under the $\epsilon$ local differential privacy ($\epsilon$-LDP) model on the label of the human preference. In the offline setting, we design an algorithm based on the principle of pessimism and derive a new suboptimality gap of $\tilde{O}(1/[(e^\epsilon-1)^2 n])$ on the KL-regularized objective under single-policy concentrability. We also prove its optimality by providing a matching lower bound where $n$ is the sample size. In the online setting, we are the first one to theoretically investigate the problem of KL-regularized RLHF with LDP. We design an optimism-based algorithm and derive a logarithmic regret bound of $O(d_{\mathcal{F}}\log (N_{\mathcal{F}}\cdot T) /(e^\epsilon-1)^2 )$, where $T$ is the total time step, $N_{\mathcal{F}}$ is cardinality of the reward function space $\mathcal{F}$ and $d_{\mathcal{F}}$ is a variant of eluder dimension for RLHF. As a by-product of our analysis, our results also imply the first analysis for online KL-regularized RLHF without privacy. We implement our algorithm in the offline setting to verify our theoretical results and release our open source code at: https://github.com/rushil-thareja/PPKL-RLHF-Official.

【35】Confidence as a Reward: Transforming LLMs into Reward Models
标题：信心作为奖励：将LLM转化为奖励模型
链接：https://arxiv.org/abs/2510.13501

作者：He Du, Bowen Li, Chengxing Xie, Chang Gao, Kai Chen, Dacheng Tao
摘要：奖励模型可以显著增强大型语言模型（LLM）的推理能力，但它们通常需要大量的策划数据和昂贵的训练。为了减轻这些挑战，免培训方法，如LLM-as-a-Judge，利用LLMs的内在推理能力来评估响应，取得了令人鼓舞的结果。最近的研究也表明，模型置信度可以有效地作为一个奖励指标，区分思想链（CoT）和非CoT路径。然而，使用信心作为奖励的概念尚未得到全面研究。在这项工作中，我们系统地研究了信心作为奖励（CRew），这是一种简单而强大的免训练方法，它利用模型最终答案中的令牌级信心作为奖励的代理，特别适用于封闭式任务。通过对数学推理任务的广泛实验，我们证明了CRew在MATH 500和RewardMATH基准测试中优于现有的无训练奖励方法，甚至超过了大多数训练过的奖励模型。我们进一步确定了CRew分数和模型的实际推理性能之间的强相关性。此外，我们发现CRew可以有效地过滤高质量的训练数据。基于这些见解，我们提出了CRew-DPO，这是一种训练策略，可以从置信度得分和正确性信号中构建偏好数据。使用CRew-DPO进行微调进一步增强了模型的判断能力，并始终优于现有的自我训练方法。
摘要：Reward models can significantly enhance the reasoning capabilities of large language models (LLMs), but they typically require extensive curated data and costly training. To mitigate these challenges, training-free approaches such as LLM-as-a-Judge leverage the intrinsic reasoning abilities of LLMs to evaluate responses, achieving promising results. Recent works have also indicated that model confidence can serve effectively as a reward metric, distinguishing between chain-of-thought (CoT) and non-CoT paths. However, the concept of using confidence as a reward has not been comprehensively studied. In this work, we systematically investigate Confidence-as-a-Reward (CRew), a simple yet powerful training-free method that utilizes token-level confidence in the model's final answers as a proxy for reward, especially suitable for close-ended tasks. Through extensive experiments on mathematical reasoning tasks, we demonstrate that CRew outperforms existing training-free reward approaches on the MATH500 and RewardMATH benchmarks, and even surpasses most trained reward models. We further identify a strong correlation between CRew scores and the actual reasoning performance of the model. Additionally, we find that CRew can effectively filter high-quality training data. Building upon these insights, we propose CRew-DPO, a training strategy that constructs preference data from confidence scores combined with correctness signals. Finetuning with CRew-DPO further enhances the model's judging capabilities and consistently outperforms existing self-training methods.

【36】MedREK: Retrieval-Based Editing for Medical LLMs with Key-Aware Prompts
标题：MedREK：基于检索的医学LLM编辑与关键意识的编辑器
链接：https://arxiv.org/abs/2510.13500

作者：Shujun Xia, Haokun Lin, Yichen Wu, Yinan Zhou, Zixuan Li, Zhongwei Wan, Xingrun Xing, Yefeng Zheng, Xiang Li, Caifeng Shan, Zhenan Sun, Quanzheng Li
备注：Preprint, work in progress
摘要：LLM在医疗保健应用中具有很大的前景，但医学知识的快速发展和训练数据中的错误往往导致它们生成过时或不准确的信息，限制了它们在高风险临床实践中的适用性。模型编辑已经成为一种潜在的补救措施，而无需进行全面的再培训。虽然基于参数的编辑通常会损害局部性，因此不适合医学领域，但基于检索的编辑提供了一种更可行的选择。然而，它仍然面临着两个关键的挑战：（1）医学知识空间内的表示重叠往往会导致不准确的检索，并降低编辑的准确性;（2）现有的方法仅限于单样本编辑，而批量编辑仍然在很大程度上未被探索，尽管它对现实世界的医学应用的重要性。为了应对这些挑战，我们首先构建MedVersa，\hk{一个增强的基准，覆盖范围更广的医学主题，旨在评估严格的局部约束下的单次和批量编辑}。然后，我们提出了MedREK，检索为基础的编辑框架，集成了一个共享的查询关键模块的精确匹配与基于注意力的提示编码器信息指导。各种医学基准的实验结果表明，我们的MedREK在不同的核心指标上实现了卓越的性能，并为医学LLM中的批量编辑提供了第一个经过验证的解决方案。我们的代码和数据集可以在https://github.com/mylittleriver/MedREK上找到。
摘要：LLMs hold great promise for healthcare applications, but the rapid evolution of medical knowledge and errors in training data often cause them to generate outdated or inaccurate information, limiting their applicability in high-stakes clinical practice. Model editing has emerged as a potential remedy without full retraining. While parameter-based editing often compromises locality and is thus ill-suited for the medical domain, retrieval-based editing offers a more viable alternative. However, it still faces two critical challenges: (1) representation overlap within the medical knowledge space often causes inaccurate retrieval and reduces editing accuracy; (2) existing methods are restricted to single-sample edits, while batch-editing remains largely unexplored despite its importance for real-world medical applications. To address these challenges, we first construct MedVersa, \hk{an enhanced benchmark with broader coverage of medical subjects, designed to evaluate both single and batch edits under strict locality constraints}. We then propose MedREK, a retrieval-based editing framework that integrates a shared query-key module for precise matching with an attention-based prompt encoder for informative guidance. Experimental results on various medical benchmarks demonstrate that our MedREK achieves superior performance across different core metrics and provides the first validated solution for batch-editing in medical LLMs. Our code and dataset are available at https://github.com/mylittleriver/MedREK.

【37】ConsintBench: Evaluating Language Models on Real-World Consumer Intent Understanding
标题：ConsintBench：评估语言模型以理解现实世界消费者意图
链接：https://arxiv.org/abs/2510.13499

作者：Xiaozhe Li, TianYi Lyu, Siyi Yang, Yuxi Gong, Yizhao Yang, Jinxuan Huang, Ligao Zhang, Zhuoyi Huang, Qingwen Liu
摘要：对于大型语言模型（LLM）来说，理解人类意图是一项复杂的高级任务，需要分析推理，上下文解释，动态信息聚合以及不确定性下的决策。现实世界的公共讨论，如消费产品讨论，很少是线性的或涉及单个用户。相反，它们的特点是交织在一起，往往相互冲突的观点，不同的关注点，目标，情感倾向，以及隐含的假设和背景知识的使用场景。为了准确理解这种明确的公共意图，法学硕士必须超越解析单个句子;它必须整合多源信号，对不一致进行推理，并适应不断变化的话语，类似于政治，经济或金融等领域的专家如何处理复杂，不确定的环境。尽管这种能力很重要，但目前还没有大规模的基准来评估LLM对现实世界人类意图的理解，这主要是由于收集现实世界公共讨论数据和构建强大的评估管道的挑战。为了弥合这一差距，我们引入了\bench，这是第一个专门为意图理解而设计的动态实时评估基准，特别是在消费者领域。\bench是同类基准中规模最大、最多样化的基准，支持实时更新，同时通过自动化的策展管道防止数据污染。
摘要：Understanding human intent is a complex, high-level task for large language models (LLMs), requiring analytical reasoning, contextual interpretation, dynamic information aggregation, and decision-making under uncertainty. Real-world public discussions, such as consumer product discussions, are rarely linear or involve a single user. Instead, they are characterized by interwoven and often conflicting perspectives, divergent concerns, goals, emotional tendencies, as well as implicit assumptions and background knowledge about usage scenarios. To accurately understand such explicit public intent, an LLM must go beyond parsing individual sentences; it must integrate multi-source signals, reason over inconsistencies, and adapt to evolving discourse, similar to how experts in fields like politics, economics, or finance approach complex, uncertain environments. Despite the importance of this capability, no large-scale benchmark currently exists for evaluating LLMs on real-world human intent understanding, primarily due to the challenges of collecting real-world public discussion data and constructing a robust evaluation pipeline. To bridge this gap, we introduce \bench, the first dynamic, live evaluation benchmark specifically designed for intent understanding, particularly in the consumer domain. \bench is the largest and most diverse benchmark of its kind, supporting real-time updates while preventing data contamination through an automated curation pipeline.

【38】DistilCLIP-EEG: Enhancing Epileptic Seizure Detection Through Multi-modal Learning and Knowledge Distillation
标题：DistilCLIP-EEG：通过多模式学习和知识蒸馏增强癫痫发作检测
链接：https://arxiv.org/abs/2510.13497

作者：Zexin Wang, Lin Shi, Haoyu Wu, Junru Luo, Xiangzeng Kong, Jun Qi
备注：16 pages, 9 figures, 5 tables
摘要：癫痫是一种常见的神经系统疾病，其特征是由异常放电引起的突然、短暂的过度神经活动，这可能导致一些精神障碍。大多数现有的癫痫检测深度学习方法仅依赖于单峰EEG信号，忽略了多模态信息的潜在好处。为了解决这个问题，我们提出了一种新的多模态模型，DistilCLIP-EEG，基于CLIP框架，它集成了EEG信号和文本描述，以捕捉癫痫发作的综合特征。该模型涉及到一个EEG编码器的基础上的Conformer架构作为一个文本编码器，建议可学习的BERT（BERT-LP）作为编码器内的提示学习。两者都在共享的潜在空间中运行，以实现有效的跨模态表征学习。为了提高效率和适应性，我们引入了一种知识蒸馏方法，其中经过训练的DistilCLIP-EEG作为教师来指导更紧凑的学生模型，以降低训练复杂度和时间。在TUSZ、AUBMC和CHB-MIT数据集上，教师和学生模型的准确率都超过了97%。在所有数据集中，F1分数始终高于0.94，证明了所提出的框架的鲁棒性和可靠性。此外，学生模型的参数计数和模型大小约为教师模型的58.1%，在保持高性能的同时显著降低了模型的复杂性和存储需求。这些结果突出了我们提出的基于EEG的癫痫检测模型的潜力，并为在资源受限的环境中部署轻量级模型奠定了坚实的基础。
摘要：Epilepsy is a prevalent neurological disorder marked by sudden, brief episodes of excessive neuronal activity caused by abnormal electrical discharges, which may lead to some mental disorders. Most existing deep learning methods for epilepsy detection rely solely on unimodal EEG signals, neglecting the potential benefits of multimodal information. To address this, we propose a novel multimodal model, DistilCLIP-EEG, based on the CLIP framework, which integrates both EEG signals and text descriptions to capture comprehensive features of epileptic seizures. The model involves an EEG encoder based on the Conformer architecture as a text encoder, the proposed Learnable BERT (BERT-LP) as prompt learning within the encoders. Both operate in a shared latent space for effective cross-modal representation learning. To enhance efficiency and adaptability, we introduce a knowledge distillation method where the trained DistilCLIP-EEG serves as a teacher to guide a more compact student model to reduce training complexity and time. On the TUSZ, AUBMC, and CHB-MIT datasets, both the teacher and student models achieved accuracy rates exceeding 97%. Across all datasets, the F1-scores were consistently above 0.94, demonstrating the robustness and reliability of the proposed framework. Moreover, the student model's parameter count and model size are approximately 58.1% of those of the teacher model, significantly reducing model complexity and storage requirements while maintaining high performance. These results highlight the potential of our proposed model for EEG-based epilepsy detection and establish a solid foundation for deploying lightweight models in resource-constrained settings.

【39】LiteraryQA: Towards Effective Evaluation of Long-document Narrative QA
标题：LiteraryQA：迈向长文档叙述性QA的有效评估
链接：https://arxiv.org/abs/2510.13494

作者：Tommaso Bonomo, Luca Gioffré, Roberto Navigli
备注：Accepted to EMNLP 2025 Main Conference. 22 pages
摘要：叙述性文本的问答（QA）对当前系统提出了独特的挑战，需要深入理解长而复杂的文档。然而，NarrativeQA的可靠性，在这个领域中使用最广泛的基准，是由嘈杂的文档和有缺陷的QA对阻碍。在这项工作中，我们介绍了LiteraryQA，一个高质量的子集NarrativeQA专注于文学作品。使用经过人工和LLM验证的管道，我们可以识别和纠正低质量的QA样本，同时从源文档中删除无关文本。然后，我们进行了元评估的自动指标，以澄清系统应该如何评估文学QA。该分析表明，所有基于n-gram的指标与人类判断的系统级相关性较低，而LLM-as-a-Judge评估，即使使用小的开放权重模型，也可以与人类识别的排名非常一致。最后，我们在LiteraryQA上对一组长上下文LLM进行了基准测试。我们在https://github.com/SapienzaNLP/LiteraryQA上发布我们的代码和数据。
摘要：Question Answering (QA) on narrative text poses a unique challenge to current systems, requiring a deep understanding of long, complex documents. However, the reliability of NarrativeQA, the most widely used benchmark in this domain, is hindered by noisy documents and flawed QA pairs. In this work, we introduce LiteraryQA, a high-quality subset of NarrativeQA focused on literary works. Using a human- and LLM-validated pipeline, we identify and correct low-quality QA samples while removing extraneous text from source documents. We then carry out a meta-evaluation of automatic metrics to clarify how systems should be evaluated on LiteraryQA. This analysis reveals that all n-gram-based metrics have a low system-level correlation to human judgment, while LLM-as-a-Judge evaluations, even with small open-weight models, can strongly agree with the ranking identified by humans. Finally, we benchmark a set of long-context LLMs on LiteraryQA. We release our code and data at https://github.com/SapienzaNLP/LiteraryQA.

【40】Mobile Coverage Analysis using Crowdsourced Data
标题：使用众包数据的移动覆盖分析
链接：https://arxiv.org/abs/2510.13459

作者：Timothy Wong, Tom Freeman, Joseph Feehily
备注：8 pages
摘要：移动网络覆盖的有效评估和服务弱点的精确识别对于努力提高用户体验质量（QoE）的网络运营商来说至关重要。本文提出了一种新的框架，利用众包QoE数据的移动覆盖和弱点分析。我们的方法的核心涉及覆盖分析在单个小区（天线）的水平，随后聚合到网站的水平，使用经验的地理定位数据。本研究的一个主要贡献是应用单类支持向量机（OC-SVM）算法计算移动网络覆盖。该方法将决策超平面建模为有效覆盖轮廓，便于对单个小区和整个站点的覆盖区域进行鲁棒计算。同样的方法被扩展到分析众包服务损失报告，从而识别和量化地理上局部的弱点。我们的研究结果证明了这种新框架在准确映射移动覆盖方面的有效性，至关重要的是，它突出了信号不足的颗粒区域，特别是在复杂的城市环境中。
摘要：Effective assessment of mobile network coverage and the precise identification of service weak spots are paramount for network operators striving to enhance user Quality of Experience (QoE). This paper presents a novel framework for mobile coverage and weak spot analysis utilising crowdsourced QoE data. The core of our methodology involves coverage analysis at the individual cell (antenna) level, subsequently aggregated to the site level, using empirical geolocation data. A key contribution of this research is the application of One-Class Support Vector Machine (OC-SVM) algorithm for calculating mobile network coverage. This approach models the decision hyperplane as the effective coverage contour, facilitating robust calculation of coverage areas for individual cells and entire sites. The same methodology is extended to analyse crowdsourced service loss reports, thereby identifying and quantifying geographically localised weak spots. Our findings demonstrate the efficacy of this novel framework in accurately mapping mobile coverage and, crucially, in highlighting granular areas of signal deficiency, particularly within complex urban environments.

【41】Neural Sum-of-Squares: Certifying the Nonnegativity of Polynomials with Transformers
标题：神经平方和：用Transformer证明多项式的非负性
链接：https://arxiv.org/abs/2510.13444

作者：Nico Pelleriti, Christoph Spiegel, Shiwei Liu, David Martínez-Rubio, Max Zimmer, Sebastian Pokutta
摘要：证明多项式的非负性是一个众所周知的NP-困难问题，其直接应用跨越非凸优化，控制，机器人等。非负性的一个充分条件是平方和（SOS）性质，即，它可以写成其他多项式的平方和。然而，在实践中，证明SOS准则仍然是计算上昂贵的，并且经常涉及求解半定规划（SDP），其维数在SOS表达式的单性基的大小上平方增长;因此，已经提出了各种方法来减小单性基的大小。在这项工作中，我们引入了第一个学习增强算法来证明SOS标准。为此，我们训练一个Transformer模型，该模型预测给定多项式的几乎最小单项式基，从而大大减少了相应SDP的大小。我们的整体方法包括三个关键部分：超过1亿SOS多项式的有效训练数据集生成，相应的Transformer架构的设计和训练，以及系统的回退机制，以确保正确的终止，我们从理论上分析。我们在200多个基准测试数据集上验证了我们的方法，与最先进的求解器相比，我们实现了超过100倍的速度提升，并在竞争方法失败的情况下实现了解决方案。我们的研究结果提供了新的见解，改变SOS编程的实际可扩展性。
摘要：Certifying nonnegativity of polynomials is a well-known NP-hard problem with direct applications spanning non-convex optimization, control, robotics, and beyond. A sufficient condition for nonnegativity is the Sum of Squares (SOS) property, i.e., it can be written as a sum of squares of other polynomials. In practice, however, certifying the SOS criterion remains computationally expensive and often involves solving a Semidefinite Program (SDP), whose dimensionality grows quadratically in the size of the monomial basis of the SOS expression; hence, various methods to reduce the size of the monomial basis have been proposed. In this work, we introduce the first learning-augmented algorithm to certify the SOS criterion. To this end, we train a Transformer model that predicts an almost-minimal monomial basis for a given polynomial, thereby drastically reducing the size of the corresponding SDP. Our overall methodology comprises three key components: efficient training dataset generation of over 100 million SOS polynomials, design and training of the corresponding Transformer architecture, and a systematic fallback mechanism to ensure correct termination, which we analyze theoretically. We validate our approach on over 200 benchmark datasets, achieving speedups of over $100\times$ compared to state-of-the-art solvers and enabling the solution of instances where competing approaches fail. Our findings provide novel insights towards transforming the practical scalability of SOS programming.

【42】Rectify and Align GPS Points to Parking Spots via Rank-1 Constraint
标题：通过Rank-1约束调整GPS点并将其与停车点对齐
链接：https://arxiv.org/abs/2510.13439

作者：Jiaxing Deng, Junbiao Pang, Zhicheng Wang, Haitao Yu
摘要：停车位是必不可少的组成部分，为城市居民提供重要的移动资源。停车位的精确全球定位系统（GPS）点是后续应用的核心数据，例如，停车管理、停车政策和城市发展。然而，高层建筑往往会导致GPS点偏离停车位的实际位置;此外，标准的较低成本GPS设备本身也存在一定的定位误差。因此，在无监督的方法中从大量的停车点中校正一些错误的GPS点是一项重要的任务。在本文中，出于停车位的物理约束（即，停车点与道路两侧平行），我们提出了一种无监督的低秩方法，以有效地纠正GPS点的错误，并进一步将它们对齐到一个统一的框架中的停车点。本文提出的非常规纠正和对准方法简单有效，适用于任何类型的GPS点误差。大量的实验表明，该方法的优越性，以解决实际问题。数据集和代码可在https://github.com/pangjunbiao/ITS-Parking-spots-Dataset上公开访问。
摘要：Parking spots are essential components, providing vital mobile resources for residents in a city. Accurate Global Positioning System (GPS) points of parking spots are the core data for subsequent applications,e.g., parking management, parking policy, and urban development. However, high-rise buildings tend to cause GPS points to drift from the actual locations of parking spots; besides, the standard lower-cost GPS equipment itself has a certain location error. Therefore, it is a non-trivial task to correct a few wrong GPS points from a large number of parking spots in an unsupervised approach. In this paper, motivated by the physical constraints of parking spots (i.e., parking spots are parallel to the sides of roads), we propose an unsupervised low-rank method to effectively rectify errors in GPS points and further align them to the parking spots in a unified framework. The proposed unconventional rectification and alignment method is simple and yet effective for any type of GPS point errors. Extensive experiments demonstrate the superiority of the proposed method to solve a practical problem. The data set and the code are publicly accessible at:https://github.com/pangjunbiao/ITS-Parking-spots-Dataset.

【43】Assessing LLM Reasoning Through Implicit Causal Chain Discovery in Climate Discourse
标题：通过气候话语中的隐性因果链发现评估LLM推理
链接：https://arxiv.org/abs/2510.13417

作者：Liesbeth Allein, Nataly Pineda-Castañeda, Andrea Rocci, Marie-Francine Moens
摘要：一个原因如何导致一个结果，哪些中间因果步骤解释了它们之间的联系？这项工作仔细检查了大型语言模型（LLM）的机械因果推理能力，通过隐式因果链发现的任务来回答这些问题。在诊断评估框架中，我们指示9个LLM生成所有可能的中间因果步骤，将因果链结构中的因果对连接起来。这些对是从最近的论证研究资源中提取的，其特点是对气候变化的两极化讨论。我们的分析表明，LLM在它们产生的因果步骤的数量和粒度上有所不同。虽然他们通常是自我一致的，并相信在生成的链的中间因果关系，他们的判断主要是由联想模式匹配，而不是真正的因果推理。尽管如此，人工评估证实了生成的链的逻辑连贯性和完整性。我们的基线因果链发现方法，从我们的诊断评估的见解，和基准数据集的因果链奠定了坚实的基础，推进未来的工作在隐式，机械因果推理的论证设置。
摘要：How does a cause lead to an effect, and which intermediate causal steps explain their connection? This work scrutinizes the mechanistic causal reasoning capabilities of large language models (LLMs) to answer these questions through the task of implicit causal chain discovery. In a diagnostic evaluation framework, we instruct nine LLMs to generate all possible intermediate causal steps linking given cause-effect pairs in causal chain structures. These pairs are drawn from recent resources in argumentation studies featuring polarized discussion on climate change. Our analysis reveals that LLMs vary in the number and granularity of causal steps they produce. Although they are generally self-consistent and confident about the intermediate causal connections in the generated chains, their judgments are mainly driven by associative pattern matching rather than genuine causal reasoning. Nonetheless, human evaluations confirmed the logical coherence and integrity of the generated chains. Our baseline causal chain discovery approach, insights from our diagnostic evaluation, and benchmark dataset with causal chains lay a solid foundation for advancing future work in implicit, mechanistic causal reasoning in argumentation settings.

【44】From Minimal Existence to Human Definition: The CES-IMU-HSG Theoretical Framework
标题：从最低存在到人类定义：CES-IMU-HSG理论框架
链接：https://arxiv.org/abs/2510.13400

作者：Kei Itoh
备注：57 pages, 2 figures, 4 tables, in English, in Japanese
摘要：本研究提出了一个基于最小公理Cogito，ergo sum（CES），整合中间元宇宙（IMU）和层次国家网格（HSG）的泛在逻辑框架。CES将存在定义为一种自反对应--“是”和“可以说”--并将任何正式系统（包括ZFC或HoTT）定位为这个最小结构之上的可附加扩展。IMU的功能是作为一个公理依赖关系的注册表，连接异构的理论，采用机构理论框架，以确保连贯的理论间的联系。HSG通过类别结构具体化了这些思想，由三个正交轴定义：状态深度轴，映射层次轴和时间轴，并结合了“没有未来参考”的原则。通过这些，“定义=状态”的同一性被正式确立为一个范畴属性。将这种结构扩展到生物系统，神经系统被实现为HSG上的神经元功能场的0-3D复合体，而其通过纤维化在材料基础上的分类扩展使多个生理学宇宙-神经，内分泌，学习，遗传和输入/输出系统-并行集成到一个连贯的伴随系综中。在这个框架内，人类的行为和认知作为受物质基础约束的互通用算法的时间组成而出现。最后，通过对比人类的认知，这依赖于外部CES，与机器的存在，本研究引入了内部CES的概念，其中一台机器的理由，其自身的逻辑上的事实性，其操作。这种内在的自我公理化在哲学本体论和工程实现之间建立了一座连续的桥梁，为人工智能的自主和自我定义的存在提供了新的基础。
摘要：This study presents an inter-universal mathematical-logical framework constructed upon the minimal axiom Cogito, ergo sum (CES), integrating the Intermediate Meta-Universe (IMU) and the Hierarchical State Grid (HSG). The CES defines existence as a reflexive correspondence --'to be' and 'to be sayable'--and positions any formal system, including ZFC or HoTT, as an attachable extension atop this minimal structure. The IMU functions as a registry of axiomatic dependencies that connect heterogeneous theories, employing the Institution-theoretic framework to ensure coherent inter-theoretical linkages. The HSG concretizes these ideas through categorical construction, defined by three orthogonal axes: the state-depth axis, the mapping-hierarchy axis, and the temporal axis incorporating the principle of 'no future reference.' Through these, the identity of 'definition = state' is formally established as a categorical property. Extending this structure to biological systems, the neural system is implemented as a 0-3D complex of neuron-function fields on the HSG, while its categorical extensions via fiberization over the material base enable the parallel integration of multiple physiological universes-neural, endocrine, learning, genetic, and input/output systems-into a coherent adjoint ensemble. Within this framework, human behavior and cognition emerge as temporal compositions of inter-universal algorithms constrained by the material base. Finally, by contrasting human cognition, which relies on external CES, with machine existence, this study introduces the concept of internal CES, wherein a machine grounds its own logic upon the factuality of its operation. This internal self-axiomatization establishes a continuous bridge between philosophical ontology and engineering implementation, providing a new foundation for the autonomous and self-defining existence of artificial intelligence.

【45】Learnable Game-theoretic Policy Optimization for Data-centric Self-explanation Rationalization
标题：可学习的游戏理论政策优化，以数据为中心的自我解释合理化
链接：https://arxiv.org/abs/2510.13393

作者：Yunxiao Zhao, Zhiqiang Wang, Xingtong Yu, Xiaoli Li, Jiye Liang, Ru Li
备注：14 pages, 7 figures, 11 tables. Under review by IEEE
摘要：数据化是一个以数据为中心的框架，旨在通过生成人类可理解的输入数据片段的子集来构建自解释模型以解释预测结果。它涉及一个合作游戏模型，其中生成器生成输入的最人类可理解的部分（即，rationales），然后是基于这些生成的理由进行预测的预测器。传统的合理化方法通常通过正则化项来施加约束，以校准或惩罚不期望的生成。然而，这些方法都遭受了一个问题，称为模式崩溃，其中预测器产生正确的预测，但发电机始终输出与崩溃模式的理由。此外，现有的研究通常是针对特定的崩溃模式单独设计的，缺乏统一的考虑。在本文中，我们从一个新的博弈论的角度，系统地重新审视合作合理化，并确定这个问题的根本原因：发电机不再倾向于探索新的策略，发现信息的理性，最终导致系统收敛到一个次优的博弈均衡（正确的预测VS崩溃的理性）。为了解决这个问题，我们提出了一种新的方法，博弈论的政策优化导向RATionalization（PORAT），逐步引入政策干预，以解决合作博弈过程中的博弈均衡，从而引导模型走向更优的解决方案状态。我们从理论上分析了这种次优均衡的原因，并证明了所提出的方法的可行性。此外，我们在九个广泛使用的真实世界数据集和两个合成设置上验证了我们的方法，其中PORAT比现有的最先进的方法性能提高了8.1%。
摘要：Rationalization, a data-centric framework, aims to build self-explanatory models to explain the prediction outcome by generating a subset of human-intelligible pieces of the input data. It involves a cooperative game model where a generator generates the most human-intelligible parts of the input (i.e., rationales), followed by a predictor that makes predictions based on these generated rationales. Conventional rationalization methods typically impose constraints via regularization terms to calibrate or penalize undesired generation. However, these methods are suffering from a problem called mode collapse, in which the predictor produces correct predictions yet the generator consistently outputs rationales with collapsed patterns. Moreover, existing studies are typically designed separately for specific collapsed patterns, lacking a unified consideration. In this paper, we systematically revisit cooperative rationalization from a novel game-theoretic perspective and identify the fundamental cause of this problem: the generator no longer tends to explore new strategies to uncover informative rationales, ultimately leading the system to converge to a suboptimal game equilibrium (correct predictions v.s collapsed rationales). To solve this problem, we then propose a novel approach, Game-theoretic Policy Optimization oriented RATionalization (PORAT), which progressively introduces policy interventions to address the game equilibrium in the cooperative game process, thereby guiding the model toward a more optimal solution state. We theoretically analyse the cause of such a suboptimal equilibrium and prove the feasibility of the proposed method. Furthermore, we validate our method on nine widely used real-world datasets and two synthetic settings, where PORAT achieves up to 8.1% performance improvements over existing state-of-the-art methods.

【46】MADREC: A Multi-Aspect Driven LLM Agent for Explainable and Adaptive Recommendation
标题：MADREC：一个多方面驱动的LLM代理，用于可解释和自适应推荐
链接：https://arxiv.org/abs/2510.13371

作者：Jiin Park, Misuk Kim
备注：18 pages
摘要：最近将大型语言模型（LLM）集成到推荐系统中的尝试已经获得了动力，但大多数仍然局限于简单的文本生成或基于静态推理的推理，未能捕获用户偏好和现实世界交互的复杂性。本研究提出了多方面驱动的LLM代理MADRec，一个自主的LLM为基础的推荐，构建用户和项目配置文件的无监督提取多方面的信息，从评论，并执行直接推荐，顺序推荐和解释生成。MADRec通过基于方面类别的摘要生成结构化配置文件，并应用Re-Ranking来构建高密度输入。当输出中缺少地面实况项时，自反馈机制动态地调整推理标准。跨多个领域的实验表明，MADRec在精度和可解释性方面都优于传统和基于LLM的基线，人工评估进一步证实了生成的解释的说服力。
摘要：Recent attempts to integrate large language models (LLMs) into recommender systems have gained momentum, but most remain limited to simple text generation or static prompt-based inference, failing to capture the complexity of user preferences and real-world interactions. This study proposes the Multi-Aspect Driven LLM Agent MADRec, an autonomous LLM-based recommender that constructs user and item profiles by unsupervised extraction of multi-aspect information from reviews and performs direct recommendation, sequential recommendation, and explanation generation. MADRec generates structured profiles via aspect-category-based summarization and applies Re-Ranking to construct high-density inputs. When the ground-truth item is missing from the output, the Self-Feedback mechanism dynamically adjusts the inference criteria. Experiments across multiple domains show that MADRec outperforms traditional and LLM-based baselines in both precision and explainability, with human evaluation further confirming the persuasiveness of the generated explanations.

【47】A New Perspective on Transformers in Online Reinforcement Learning for Continuous Control
标题：连续控制在线强化学习中Transformer的新视角
链接：https://arxiv.org/abs/2510.13367

作者：Nikita Kachaev, Daniil Zelezetsky, Egor Cherepanov, Alexey K. Kovelev, Aleksandr I. Panov
摘要：尽管transformers在离线或基于模型的强化学习（RL）中有效且受欢迎，但由于其对训练设置和模型设计决策（例如如何构建策略和价值网络，共享组件或处理时间信息）的敏感性，Transformers在在线无模型RL中仍然未得到充分研究。在本文中，我们表明，Transformers可以在线无模型RL连续控制的强基线。我们研究关键的设计问题：如何条件输入，演员和评论家之间的共享组件，并切片连续数据进行训练。我们的实验揭示了稳定的架构和训练策略，使竞争力的表现在完全和部分可观察的任务，并在矢量和图像为基础的设置。这些研究结果提供了实用的指导Transformers在线RL。
摘要：Despite their effectiveness and popularity in offline or model-based reinforcement learning (RL), transformers remain underexplored in online model-free RL due to their sensitivity to training setups and model design decisions such as how to structure the policy and value networks, share components, or handle temporal information. In this paper, we show that transformers can be strong baselines for continuous control in online model-free RL. We investigate key design questions: how to condition inputs, share components between actor and critic, and slice sequential data for training. Our experiments reveal stable architectural and training strategies enabling competitive performance across fully and partially observable tasks, and in both vector- and image-based settings. These findings offer practical guidance for applying transformers in online RL.

【48】Document Intelligence in the Era of Large Language Models: A Survey
标题：大型语言模型时代的文档智能：概览
链接：https://arxiv.org/abs/2510.13366

作者：Weishi Wang, Hengchang Hu, Zhijie Zhang, Zhaochen Li, Hongxin Shao, Daniel Dahlmeier
摘要：文档人工智能（DAI）已经成为一个重要的应用领域，并且随着大型语言模型（LLM）的出现而发生了重大转变。虽然早期的方法依赖于编码器-解码器架构，但仅解码器的LLM已经彻底改变了DAI，在理解和生成方面带来了显着的进步。这项调查提供了一个全面的概述DAI的演变，突出了当前的研究尝试和未来的LLM在这一领域的前景。我们探讨了多模式，多语言和检索增强DAI的关键进展和挑战，同时也提出了未来的研究方向，包括基于代理的方法和特定于文档的基础模型。本文的目的是提供一个结构化的分析国家的最先进的DAI及其影响的学术和实际应用。
摘要：Document AI (DAI) has emerged as a vital application area, and is significantly transformed by the advent of large language models (LLMs). While earlier approaches relied on encoder-decoder architectures, decoder-only LLMs have revolutionized DAI, bringing remarkable advancements in understanding and generation. This survey provides a comprehensive overview of DAI's evolution, highlighting current research attempts and future prospects of LLMs in this field. We explore key advancements and challenges in multimodal, multilingual, and retrieval-augmented DAI, while also suggesting future research directions, including agent-based approaches and document-specific foundation models. This paper aims to provide a structured analysis of the state-of-the-art in DAI and its implications for both academic and practical applications.

【49】Language as a Label: Zero-Shot Multimodal Classification of Everyday Postures under Data Scarcity
标题：语言作为标签：数据稀缺下日常姿势的Zero-Shot多模式分类
链接：https://arxiv.org/abs/2510.13364

作者：MingZe Tang, Jubal Chandy Jacob
摘要：最近的视觉语言模型（VLM）通过在共享空间中对齐图像和文本来实现zero-shot分类，这是一种适用于数据稀缺条件的有前途的方法。然而，提示设计对识别视觉上相似的类别（如人类姿势）的影响还没有得到很好的理解。本研究调查了提示特异性如何影响坐着，站立和行走/跑步的zero-shot分类的小，285图像COCO衍生的数据集。一套现代VLM，包括OpenCLIP，MetaCLIP 2和SigLip，使用系统地增加语言细节的三层提示设计进行了评估。我们的研究结果揭示了一个引人注目的、与直觉相反的趋势：对于性能最高的模型（MetaCLIP 2和OpenCLIP），最简单、最基本的提示始终能获得最佳结果。添加描述性细节会显着降低性能，例如，MetaCLIP 2的多类准确率从68.8%下降到55.1%，我们称之为“快速过拟合”。相反，性能较低的SigLip模型在给出更多描述性的、基于身体提示的提示时，对模糊类别的分类有所改善。
摘要：Recent Vision-Language Models (VLMs) enable zero-shot classification by aligning images and text in a shared space, a promising approach for data-scarce conditions. However, the influence of prompt design on recognizing visually similar categories, such as human postures, is not well understood. This study investigates how prompt specificity affects the zero-shot classification of sitting, standing, and walking/running on a small, 285-image COCO-derived dataset. A suite of modern VLMs, including OpenCLIP, MetaCLIP 2, and SigLip, were evaluated using a three-tiered prompt design that systematically increases linguistic detail. Our findings reveal a compelling, counter-intuitive trend: for the highest-performing models (MetaCLIP 2 and OpenCLIP), the simplest, most basic prompts consistently achieve the best results. Adding descriptive detail significantly degrades performance for instance, MetaCLIP 2's multi-class accuracy drops from 68.8\% to 55.1\% a phenomenon we term "prompt overfitting". Conversely, the lower-performing SigLip model shows improved classification on ambiguous classes when given more descriptive, body-cue-based prompts.

【50】Generalist++: A Meta-learning Framework for Mitigating Trade-off in Adversarial Training
标题：通才++：一个用于缓解对抗性训练中权衡的元学习框架
链接：https://arxiv.org/abs/2510.13361

作者：Yisen Wang, Yichuan Mo, Hongjun Wang, Junyi Li, Zhouchen Lin
摘要：尽管神经网络发展迅速，但它们仍然非常容易受到对抗性样本的攻击，对抗性训练（AT）是目前最有效的防御手段。虽然AT已被广泛研究，但其实际应用暴露出两个主要局限性：与标准训练相比，自然准确性往往会显着降低，并且鲁棒性在不同范数约束下的攻击之间不能很好地转移。与之前试图在单个网络中只解决一个问题的工作不同，我们建议将整体泛化目标划分为多个子任务，每个子任务分配给一个专用的基础学习器。通过专注于其指定的目标，每个基础学习者很快成为其领域的专家。在训练的后期阶段，我们对它们的参数进行插值，以形成一个知识渊博的全局学习器，同时定期将全局参数重新分配回基本学习器，以防止它们的优化轨迹偏离共享目标太远。我们将此框架称为Generalist，并介绍了针对不同应用场景的三种变体。理论分析和大量的实验表明，与基线方法相比，Generalist方法具有更低的泛化误差，并显着消除了权衡问题。我们的研究结果表明，Generalist提供了一个充满希望的一步，在未来发展完全强大的分类。
摘要：Despite the rapid progress of neural networks, they remain highly vulnerable to adversarial examples, for which adversarial training (AT) is currently the most effective defense. While AT has been extensively studied, its practical applications expose two major limitations: natural accuracy tends to degrade significantly compared with standard training, and robustness does not transfer well across attacks crafted under different norm constraints. Unlike prior works that attempt to address only one issue within a single network, we propose to partition the overall generalization goal into multiple sub-tasks, each assigned to a dedicated base learner. By specializing in its designated objective, each base learner quickly becomes an expert in its field. In the later stages of training, we interpolate their parameters to form a knowledgeable global learner, while periodically redistributing the global parameters back to the base learners to prevent their optimization trajectories from drifting too far from the shared target. We term this framework Generalist and introduce three variants tailored to different application scenarios. Both theoretical analysis and extensive experiments demonstrate that Generalist achieves lower generalization error and significantly alleviates the trade-off problems compared with baseline methods. Our results suggest that Generalist provides a promising step toward developing fully robust classifiers in the future.

【51】Adversarial Fine-tuning in Offline-to-Online Reinforcement Learning for Robust Robot Control
标题：用于鲁棒机器人控制的离线到在线强化学习中的对抗微调
链接：https://arxiv.org/abs/2510.13358

作者：Shingo Ayabe, Hiroshi Kera, Kazuhiko Kawamoto
备注：16 pages, 8 figures
摘要：离线强化学习可以在没有风险的在线交互的情况下实现样本高效的策略获取，但在静态数据集上训练的策略在动作空间扰动（如执行器故障）下仍然很脆弱。这项研究引入了一个离线到在线的框架，该框架在干净的数据上训练策略，然后进行对抗性微调，其中扰动被注入到执行的动作中，以诱导补偿行为并提高弹性。性能感知课程通过指数移动平均信号在训练期间进一步调整扰动概率，在整个学习过程中平衡鲁棒性和稳定性。连续控制运动任务的实验表明，该方法始终提高了离线基线的鲁棒性，并且比从头开始训练收敛得更快。匹配的微调和评估条件产生最强的鲁棒性动作空间扰动，而自适应课程策略减轻与线性课程策略观察到的标称性能的退化。总体而言，结果表明，对抗性微调可以在不确定环境下实现自适应和鲁棒控制，弥合离线效率和在线适应性之间的差距。
摘要：Offline reinforcement learning enables sample-efficient policy acquisition without risky online interaction, yet policies trained on static datasets remain brittle under action-space perturbations such as actuator faults. This study introduces an offline-to-online framework that trains policies on clean data and then performs adversarial fine-tuning, where perturbations are injected into executed actions to induce compensatory behavior and improve resilience. A performance-aware curriculum further adjusts the perturbation probability during training via an exponential-moving-average signal, balancing robustness and stability throughout the learning process. Experiments on continuous-control locomotion tasks demonstrate that the proposed method consistently improves robustness over offline-only baselines and converges faster than training from scratch. Matching the fine-tuning and evaluation conditions yields the strongest robustness to action-space perturbations, while the adaptive curriculum strategy mitigates the degradation of nominal performance observed with the linear curriculum strategy. Overall, the results show that adversarial fine-tuning enables adaptive and robust control under uncertain environments, bridging the gap between offline efficiency and online adaptability.

【52】Personal Attribute Leakage in Federated Speech Models
标题：联邦语音模型中的个人属性泄露
链接：https://arxiv.org/abs/2510.13357

作者：Hamdan Al-Ali, Ali Reza Ghavamipour, Tommaso Caselli, Fatih Turkmen, Zeerak Talat, Hanan Aldarmaki
备注：5 pages, 4 figures, 2 tables
摘要：联邦学习是机器学习模型的隐私保护训练的常用方法。在本文中，我们分析了脆弱性的ASR模型属性推理攻击的联邦设置。我们在三个ASR模型：Wav2Vec2，HuBERT和Whisper上测试了被动威胁模型下的非参数白盒攻击方法。该攻击仅对权重差进行操作，而不访问来自目标说话者的原始语音。我们证明了攻击敏感的人口统计学和临床属性的可行性：性别，年龄，口音，情绪和构音障碍。我们的研究结果表明，在预训练数据中代表性不足或不存在的属性更容易受到这种推断攻击。特别是，关于口音的信息可以从所有模型中可靠地推断出来。我们的研究结果揭示了联邦ASR模型中以前未记录的漏洞，并为提高安全性提供了见解。
摘要：Federated learning is a common method for privacy-preserving training of machine learning models. In this paper, we analyze the vulnerability of ASR models to attribute inference attacks in the federated setting. We test a non-parametric white-box attack method under a passive threat model on three ASR models: Wav2Vec2, HuBERT, and Whisper. The attack operates solely on weight differentials without access to raw speech from target speakers. We demonstrate attack feasibility on sensitive demographic and clinical attributes: gender, age, accent, emotion, and dysarthria. Our findings indicate that attributes that are underrepresented or absent in the pre-training data are more vulnerable to such inference attacks. In particular, information about accents can be reliably inferred from all models. Our findings expose previously undocumented vulnerabilities in federated ASR models and offer insights towards improved security.

【53】Protect: Towards Robust Guardrailing Stack for Trustworthy Enterprise LLM Systems
标题：保护：为值得信赖的企业LLM系统打造强大的护栏堆栈
链接：https://arxiv.org/abs/2510.13351

作者：Karthik Avinash, Nikhil Pareek, Rishav Hada
摘要：大型语言模型（LLM）在企业和关键任务领域的部署越来越多，这凸显了对强大的护栏系统的迫切需求，以确保安全性，可靠性和合规性。现有的解决方案通常难以实现实时监督、多模式数据处理和可解释性，这些限制阻碍了它们在受监管的环境中的采用。现有的护栏在很大程度上是孤立运行的，只关注文本，这使得它们不适合多模式的生产规模环境。我们引入了保护，原生的多模式护栏模型，旨在跨文本，图像和音频输入无缝操作，专为企业级部署而设计。Protect集成了通过低秩适应（LoRA）在广泛的多模态数据集上训练的微调的特定类别适配器，该数据集涵盖四个安全维度：毒性，性别歧视，数据隐私和即时注射。我们的教师辅助注释管道利用推理和解释跟踪来生成高保真，上下文感知的标签。实验结果表明，在所有安全方面都具有最先进的性能，超过了现有的开放和专有模型，如WildGuard，LlamaGuard-4和GPT-4.1。Protect为可靠、可审计和生产就绪的安全系统奠定了坚实的基础，这些系统能够跨文本、图像和音频模式运行。
摘要：The increasing deployment of Large Language Models (LLMs) across enterprise and mission-critical domains has underscored the urgent need for robust guardrailing systems that ensure safety, reliability, and compliance. Existing solutions often struggle with real-time oversight, multi-modal data handling, and explainability -- limitations that hinder their adoption in regulated environments. Existing guardrails largely operate in isolation, focused on text alone making them inadequate for multi-modal, production-scale environments. We introduce Protect, natively multi-modal guardrailing model designed to operate seamlessly across text, image, and audio inputs, designed for enterprise-grade deployment. Protect integrates fine-tuned, category-specific adapters trained via Low-Rank Adaptation (LoRA) on an extensive, multi-modal dataset covering four safety dimensions: toxicity, sexism, data privacy, and prompt injection. Our teacher-assisted annotation pipeline leverages reasoning and explanation traces to generate high-fidelity, context-aware labels across modalities. Experimental results demonstrate state-of-the-art performance across all safety dimensions, surpassing existing open and proprietary models such as WildGuard, LlamaGuard-4, and GPT-4.1. Protect establishes a strong foundation for trustworthy, auditable, and production-ready safety systems capable of operating across text, image, and audio modalities.

【54】AOAD-MAT: Transformer-based multi-agent deep reinforcement learning model considering agents' order of action decisions
标题：AOAD-MAT：考虑智能体动作决策顺序的基于转换器的多智能体深度强化学习模型
链接：https://arxiv.org/abs/2510.13343

作者：Shota Takayama, Katsuhide Fujita
备注：This manuscript is an extended version of the work accepted as a short paper at the 26th International Conference on Principles and Practice of Multi-Agent Systems (PRIMA 2025). The Version of Record of this contribution is published in Springer's Lecture Notes in Artificial Intelligence series (LNCS/LNAI)
摘要：多智能体强化学习专注于训练共存于共享环境中的多个学习智能体的行为。最近，MARL模型，如多智能体Transformer（MAT）和ACTION dEpendent深度Q学习（ACE），通过利用顺序决策过程显着提高了性能。虽然这些模型可以提高性能，但它们没有明确考虑代理人做出决策的顺序的重要性。在本文中，我们提出了一个代理顺序的行动决策-MAT（AOAD-MAT），一个新的MAT模型，认为代理作出决定的顺序。该模型将动作决策序列明确纳入学习过程，使模型能够学习和预测代理动作的最佳顺序。AOAD-MAT模型利用了一个基于transformer的actor-critic架构，动态调整代理动作的顺序。为了实现这一目标，我们引入了一种新型的MARL架构，该架构与专注于预测下一个要采取行动的代理的子任务合作，并集成到基于邻近策略优化的损失函数中，以协同最大化顺序决策的优势。该方法通过星际争霸多智能体挑战赛和多智能体MuJoCo基准测试进行了广泛的实验验证。实验结果表明，所提出的AOAD-MAT模型的性能优于现有的MAT和其他基线模型，证明了调整的MARL中的AOAD顺序的有效性。
摘要：Multi-agent reinforcement learning focuses on training the behaviors of multiple learning agents that coexist in a shared environment. Recently, MARL models, such as the Multi-Agent Transformer (MAT) and ACtion dEpendent deep Q-learning (ACE), have significantly improved performance by leveraging sequential decision-making processes. Although these models can enhance performance, they do not explicitly consider the importance of the order in which agents make decisions. In this paper, we propose an Agent Order of Action Decisions-MAT (AOAD-MAT), a novel MAT model that considers the order in which agents make decisions. The proposed model explicitly incorporates the sequence of action decisions into the learning process, allowing the model to learn and predict the optimal order of agent actions. The AOAD-MAT model leverages a Transformer-based actor-critic architecture that dynamically adjusts the sequence of agent actions. To achieve this, we introduce a novel MARL architecture that cooperates with a subtask focused on predicting the next agent to act, integrated into a Proximal Policy Optimization based loss function to synergistically maximize the advantage of the sequential decision-making. The proposed method was validated through extensive experiments on the StarCraft Multi-Agent Challenge and Multi-Agent MuJoCo benchmarks. The experimental results show that the proposed AOAD-MAT model outperforms existing MAT and other baseline models, demonstrating the effectiveness of adjusting the AOAD order in MARL.

【55】Thompson Sampling via Fine-Tuning of LLMs
标题：通过LLM微调进行汤普森采样
链接：https://arxiv.org/abs/2510.13328

作者：Nicolas Menet, Aleksandar Terzić, Andreas Krause, Abbas Rahimi
摘要：大型非结构化离散空间中的贝叶斯优化经常受到由于缺乏梯度而使获取函数最大化的计算成本的阻碍。我们提出了一个可扩展的替代方案，汤普森采样的基础上，消除了收购功能最大化的需要，直接参数化的概率，候选人产生的最大奖励。我们的方法，通过微调汤普森采样（ToSFiT）利用嵌入在非线性条件的大型语言模型的先验知识，并逐步适应他们的后验。理论上，我们推导出一个新的后悔限的变分制定的汤普森采样相匹配的强保证，其标准对应。我们的分析揭示了仔细适应最大化后验概率的关键作用-这是我们ToSFiT算法的基础原则。经验上，我们验证了我们的方法在三个不同的任务：FAQ响应细化，热稳定蛋白质搜索，量子电路设计。我们证明，在线微调显着提高样本效率，计算效率的影响可以忽略不计。
摘要：Bayesian optimization in large unstructured discrete spaces is often hindered by the computational cost of maximizing acquisition functions due to the absence of gradients. We propose a scalable alternative based on Thompson sampling that eliminates the need for acquisition function maximization by directly parameterizing the probability that a candidate yields the maximum reward. Our approach, Thompson Sampling via Fine-Tuning (ToSFiT) leverages the prior knowledge embedded in prompt-conditioned large language models, and incrementally adapts them toward the posterior. Theoretically, we derive a novel regret bound for a variational formulation of Thompson Sampling that matches the strong guarantees of its standard counterpart. Our analysis reveals the critical role of careful adaptation to the posterior probability of maximality--a principle that underpins our ToSFiT algorithm. Empirically, we validate our method on three diverse tasks: FAQ response refinement, thermally stable protein search, and quantum circuit design. We demonstrate that online fine-tuning significantly improves sample efficiency, with negligible impact on computational efficiency.

【56】Injection, Attack and Erasure: Revocable Backdoor Attacks via Machine Unlearning
标题：注入、攻击和擦除：通过机器取消学习可撤销的后门攻击
链接：https://arxiv.org/abs/2510.13322

作者：Baogang Song, Dongdong Zhao, Jianwen Xiang, Qiben Xu, Zizhuo Yu
摘要：后门攻击由于其隐蔽性和持久性而对深度神经网络（DNN）构成持续的安全风险。虽然最近的研究已经探索了利用模型遗忘机制来增强后门隐藏，但现有的攻击策略仍然会留下持久的痕迹，可以通过静态分析检测到。在这项工作中，我们介绍了第一个范例的可验证的后门攻击，后门可以主动和彻底删除攻击目标实现后。我们制定了一个双层优化问题的触发器优化在可验证的后门攻击：通过模拟后门注入和unlearning过程中，触发器发生器进行优化，以实现高攻击成功率（ASR），同时确保后门可以很容易地通过unlearning删除。为了缓解注入和去除目标之间的优化冲突，我们采用了确定性的中毒和学习样本的分区，以减少采样引起的方差，并进一步应用投影梯度（PCGrad）技术来解决剩余的梯度冲突。在CIFAR-10和ImageNet上的实验表明，我们的方法保持了与最先进的后门攻击相当的ASR，同时能够在unlearning后有效地去除后门行为。这项工作为后门攻击研究开辟了新的方向，并为机器学习系统的安全性提出了新的挑战。
摘要：Backdoor attacks pose a persistent security risk to deep neural networks (DNNs) due to their stealth and durability. While recent research has explored leveraging model unlearning mechanisms to enhance backdoor concealment, existing attack strategies still leave persistent traces that may be detected through static analysis. In this work, we introduce the first paradigm of revocable backdoor attacks, where the backdoor can be proactively and thoroughly removed after the attack objective is achieved. We formulate the trigger optimization in revocable backdoor attacks as a bilevel optimization problem: by simulating both backdoor injection and unlearning processes, the trigger generator is optimized to achieve a high attack success rate (ASR) while ensuring that the backdoor can be easily erased through unlearning. To mitigate the optimization conflict between injection and removal objectives, we employ a deterministic partition of poisoning and unlearning samples to reduce sampling-induced variance, and further apply the Projected Conflicting Gradient (PCGrad) technique to resolve the remaining gradient conflicts. Experiments on CIFAR-10 and ImageNet demonstrate that our method maintains ASR comparable to state-of-the-art backdoor attacks, while enabling effective removal of backdoor behavior after unlearning. This work opens a new direction for backdoor attack research and presents new challenges for the security of machine learning systems.

【57】Self-Augmented Visual Contrastive Decoding
标题：自增强视觉对比解码
链接：https://arxiv.org/abs/2510.13315

作者：Eun Woo Im, Muhammad Kashif Ali, Vivek Gupta
摘要：大的视觉语言模型（LVLM）已经表现出显着的多模态能力，但他们继承了幻觉的倾向，从他们的底层语言模型。虽然已经提出视觉对比解码来缓解这个问题，但现有的方法通常应用通用的视觉增强，其忽略文本查询提供的特定上下文，从而限制了它们的有效性。这项研究介绍了一种新的免训练解码策略，解决了这些限制，具有两个关键贡献。首先，自我增强提示策略，利用模型的内在知识，动态地调整查询和视觉增强之间的语义。第二，一个自适应阈值算法，自适应调整下一个令牌候选大小的基础上输出稀疏，利用全部信息的logit分布。在四个LVLM和七个基准测试的广泛实验表明，所提出的解码显着提高了事实的一致性相比，国家的最先进的解码方法。这项工作突出了集成查询相关增强和熵感知解码的重要性，以提高LVLM的有效生成。
摘要：Large Vision-Language Models (LVLMs) have demonstrated remarkable multimodal capabilities, but they inherit the tendency to hallucinate from their underlying language models. While visual contrastive decoding has been proposed to mitigate this issue, existing methods often apply generic visual augmentations that disregard the specific context provided by the text query, limiting their effectiveness. This study introduces a novel training-free decoding strategy that addresses these limitations, featuring two key contributions. First, a self-augmentation prompting strategy that leverages the intrinsic knowledge of the model to dynamically align semantics between the query and the visual augmentation. Second, an adaptive thresholding algorithm that adaptively adjusts next token candidate size based on the output sparsity, utilizing full information from the logit distribution. Extensive experiments across four LVLMs and seven benchmarks demonstrate that the proposed decoding significantly enhances factual consistency compared to state-of-the-art decoding methods. This work highlights the importance of integrating query-dependent augmentation and entropy-aware decoding for improving effective generation of LVLMs.

【58】LLM one-shot style transfer for Authorship Attribution and Verification
标题：LLM一次风格转移，用于作者归因和验证
链接：https://arxiv.org/abs/2510.13302

作者：Pablo Miralles-González, Javier Huertas-Tato, Alejandro Martín, David Camacho
摘要：计算文体学通过文本中的定量模式分析写作风格，支持从身份链接和剽窃检测等法医任务到人文学科中的文学归属的应用。监督和对比方法依赖于具有虚假相关性的数据，并且经常混淆风格和主题。尽管它们在人工智能生成的文本检测中有着天然的用途，但现代LLM的CLM预训练几乎没有被用于一般的作者问题。我们提出了一种新的无监督方法，基于这种广泛的预训练和LLM的上下文学习能力，采用LLM的对数概率来衡量从一个文本到另一个文本的风格可转移性。我们的方法显着优于LLM提示方法的可比规模，并实现更高的准确性比对比训练基线时，控制主题的相关性。此外，性能与基础模型的大小相当一致，并且在作者身份验证的情况下，具有增加测试时间计算的额外机制;实现计算成本和准确性之间的灵活权衡。
摘要：Computational stylometry analyzes writing style through quantitative patterns in text, supporting applications from forensic tasks such as identity linking and plagiarism detection to literary attribution in the humanities. Supervised and contrastive approaches rely on data with spurious correlations and often confuse style with topic. Despite their natural use in AI-generated text detection, the CLM pre-training of modern LLMs has been scarcely leveraged for general authorship problems. We propose a novel unsupervised approach based on this extensive pre-training and the in-context learning capabilities of LLMs, employing the log-probabilities of an LLM to measure style transferability from one text to another. Our method significantly outperforms LLM prompting approaches of comparable scale and achieves higher accuracy than contrastively trained baselines when controlling for topical correlations. Moreover, performance scales fairly consistently with the size of the base model and, in the case of authorship verification, with an additional mechanism that increases test-time computation; enabling flexible trade-offs between computational cost and accuracy.

【59】Higher Satisfaction, Lower Cost: A Technical Report on How LLMs Revolutionize Meituan's Intelligent Interaction Systems
标题：更高的满意度，更低的成本：LLM如何彻底改变美团智能交互系统的技术报告
链接：https://arxiv.org/abs/2510.13291

作者：Xuxin Cheng, Ke Zeng, Zhiquan Cao, Linyi Dai, Wenxuan Gao, Fei Han, Ai Jian, Feng Hong, Wenxing Hu, Zihe Huang, Dejian Kong, Jia Leng, Zhuoyuan Liao, Pei Liu, Jiaye Lin, Xing Ma, Jingqing Ruan, Jiaxing Song, Xiaoyu Tan, Ruixuan Xiao, Wenhui Yu, Wenyu Zhan, Haoxing Zhang, Chao Zhou, Hao Zhou, Shaodong Zheng, Ruinian Chen, Siyuan Chen, Ziyang Chen, Yiwen Dong, Yaoyou Fan, Yangyi Fang, Yang Gan, Shiguang Guo, Qi He, Chaowen Hu, Binghui Li, Dailin Li, Xiangyu Li, Yan Li, Chengjian Liu, Xiangfeng Liu, Jiahui Lv, Qiao Ma, Jiang Pan, Cong Qin, Chenxing Sun, Wen Sun, Zhonghui Wang, Abudukelimu Wuerkaixi, Xin Yang, Fangyi Yuan, Yawen Zhu, Tianyi Zhai, Jie Zhang, Runlai Zhang, Yao Xu, Yiran Zhao, Yifan Wang, Xunliang Cai, Yangen Hu, Cao Liu, Lu Pan, Xiaoli Wang, Bo Xiao, Wenyuan Yao, Qianlin Zhou, Benchang Zhu
备注：36 pages, 14 figures
摘要：增强客户体验对于业务成功至关重要，特别是在服务需求规模和复杂性不断增长的情况下。生成式人工智能和大型语言模型（LLM）使智能交互系统能够提供高效、个性化和全天候的支持。在实践中，智能交互系统遇到了几个挑战：（1）构建冷启动训练的高质量数据很困难，阻碍了自我进化，提高了人力成本。(2)由于意图理解、规则遵守和解决方案提取不足，多轮对话性能仍然不理想。(3)业务规则的频繁演化影响了系统的可操作性和可移植性，制约了系统的低成本扩展和适应性。(4)在复杂的场景中，依赖单个LLM是不够的，在这些场景中，缺乏多代理框架和有效的协作会破坏流程的完整性和服务质量。(5)多轮对话的开放性，缺乏统一的金答案，阻碍了定量评估和持续优化。为了应对这些挑战，我们推出了WOWService，一个为工业应用量身定制的智能交互系统。通过LLM和多代理架构的集成，WOWService实现了自主任务管理和协作问题解决。具体而言，WOWService侧重于核心模块，包括数据构建，通用能力增强，业务场景适配，多代理协调和自动评估。目前，WOWService已部署在美团App上，在关键指标上取得了显著的进步，例如，用户满意度指标1（USM 1）-27.53%和2（USM 2）+25.51%，显示了其在捕捉用户需求和推进个性化服务方面的有效性。
摘要：Enhancing customer experience is essential for business success, particularly as service demands grow in scale and complexity. Generative artificial intelligence and Large Language Models (LLMs) have empowered intelligent interaction systems to deliver efficient, personalized, and 24/7 support. In practice, intelligent interaction systems encounter several challenges: (1) Constructing high-quality data for cold-start training is difficult, hindering self-evolution and raising labor costs. (2) Multi-turn dialogue performance remains suboptimal due to inadequate intent understanding, rule compliance, and solution extraction. (3) Frequent evolution of business rules affects system operability and transferability, constraining low-cost expansion and adaptability. (4) Reliance on a single LLM is insufficient in complex scenarios, where the absence of multi-agent frameworks and effective collaboration undermines process completeness and service quality. (5) The open-domain nature of multi-turn dialogues, lacking unified golden answers, hampers quantitative evaluation and continuous optimization. To address these challenges, we introduce WOWService, an intelligent interaction system tailored for industrial applications. With the integration of LLMs and multi-agent architectures, WOWService enables autonomous task management and collaborative problem-solving. Specifically, WOWService focuses on core modules including data construction, general capability enhancement, business scenario adaptation, multi-agent coordination, and automated evaluation. Currently, WOWService is deployed on the Meituan App, achieving significant gains in key metrics, e.g., User Satisfaction Metric 1 (USM 1) -27.53% and User Satisfaction Metric 2 (USM 2) +25.51%, demonstrating its effectiveness in capturing user needs and advancing personalized service.

【60】To Steer or Not to Steer? Mechanistic Error Reduction with Abstention for Language Models
标题：转向还是不转向？语言模型的弃权机制错误减少
链接：https://arxiv.org/abs/2510.13290

作者：Anna Hedström, Salim I. Amoukou, Tom Bewley, Saumitra Mishra, Manuela Veloso
备注：ICML 2025, 22 pages, 16 figures, 5 tables
摘要：我们介绍了机械性的错误减少与弃权（MERA），指导语言模型（LM）的原则框架，以减轻错误，通过选择性的，自适应的干预。与依赖于固定的手动调整转向强度的现有方法不同，通常会导致转向不足或转向过度，MERA通过以下方式解决这些限制：（i）优化干预方向，以及（ii）校准何时转向以及转向多少，从而可证明提高性能或在无法进行可靠校正时放弃。在不同数据集和LM系列上的实验证明了安全，有效，非退化的错误校正，并且MERA优于现有的基线。此外，MERA可以应用于现有的转向技术之上，以进一步提高其性能，将其确立为机械激活转向的通用且有效的方法。
摘要：We introduce Mechanistic Error Reduction with Abstention (MERA), a principled framework for steering language models (LMs) to mitigate errors through selective, adaptive interventions. Unlike existing methods that rely on fixed, manually tuned steering strengths, often resulting in under or oversteering, MERA addresses these limitations by (i) optimising the intervention direction, and (ii) calibrating when, and how much to steer, thereby provably improving performance or abstaining when no confident correction is possible. Experiments across diverse datasets, and LM families demonstrate safe, effective, non-degrading error correction, and that MERA outperforms existing baselines. Moreover, MERA can be applied on top of existing steering techniques to further enhance their performance, establishing it as a general-purpose, and efficient approach to mechanistic activation steering.

【61】SAJA: A State-Action Joint Attack Framework on Multi-Agent Deep Reinforcement Learning
标题：SAJA：一个基于多智能体深度强化学习的状态-动作联合攻击框架
链接：https://arxiv.org/abs/2510.13262

作者：Weiqi Guo, Guanjun Liu, Ziyuan Zhou
摘要：多智能体深度强化学习（MADRL）已经显示出在自动驾驶和战略游戏等合作和竞争任务中的潜力。然而，由MADRL训练的模型容易受到状态和动作的对抗性扰动。因此，从攻击的角度研究MADRL模型的鲁棒性是非常必要的。现有的研究主要集中在纯状态攻击或纯动作攻击，但没有考虑如何有效地联合它们。简单地组合状态和动作扰动，例如随机扰动状态和动作，并不能利用它们潜在的协同效应。在本文中，我们提出了状态-动作联合攻击（SAJA）框架，具有良好的协同效应。SAJA由两个重要阶段组成：（1）在状态攻击阶段，多步梯度上升方法利用演员网络和评论家网络来计算对抗状态;（2）在动作攻击阶段，基于扰动状态，第二次梯度上升使用评论家网络来制作最终的对抗动作。此外，一个启发式正则化测量之间的距离扰动的行动和原来的干净的被添加到损失函数，以提高有效性的评论家的指导。我们在多智能体粒子环境（MPE）中评估SAJA，证明（1）它优于仅状态或仅动作攻击，并且更隐蔽，（2）现有的状态或动作防御方法无法防御其攻击。
摘要：Multi-Agent Deep Reinforcement Learning (MADRL) has shown potential for cooperative and competitive tasks such as autonomous driving and strategic gaming. However, models trained by MADRL are vulnerable to adversarial perturbations on states and actions. Therefore, it is essential to investigate the robustness of MADRL models from an attack perspective. Existing studies focus on either state-only attacks or action-only attacks, but do not consider how to effectively joint them. Simply combining state and action perturbations such as randomly perturbing states and actions does not exploit their potential synergistic effects. In this paper, we propose the State-Action Joint Attack (SAJA) framework that has a good synergistic effects. SAJA consists of two important phases: (1) In the state attack phase, a multi-step gradient ascent method utilizes both the actor network and the critic network to compute an adversarial state, and (2) in the action attack phase, based on the perturbed state, a second gradient ascent uses the critic network to craft the final adversarial action. Additionally, a heuristic regularizer measuring the distance between the perturbed actions and the original clean ones is added into the loss function to enhance the effectiveness of the critic's guidance. We evaluate SAJA in the Multi-Agent Particle Environment (MPE), demonstrating that (1) it outperforms and is more stealthy than state-only or action-only attacks, and (2) existing state or action defense methods cannot defend its attacks.

【62】A Ratio-Based Shapley Value for Collaborative Machine Learning - Extended Version
标题：协作机器学习的基于比率的Shapley值-扩展版本
链接：https://arxiv.org/abs/2510.13261

作者：Björn Filter, Ralf Möller, Özgür Lütfü Özçep
备注：Extended version of a paper accepted at the 26th International Conference on Principles and Practice of Multi-Agent Systems (PRIMA 2025)
摘要：协作机器学习使多个数据所有者能够联合训练模型以提高预测性能。然而，确保激励措施的兼容性和公平的基于贡献的奖励仍然是一项重大挑战。Sim及其同事之前的工作（Rachel Hwee Ling Sim et al：Collaborative Machine Learning with incentive-aware model rewards）。参加：机器学习国际会议。PMLR。2020年，第页。8927-8963）通过分配模型奖励来解决这个问题，模型奖励是非货币的，可以自由复制，基于各方数据贡献的Shapley值，通过信息增益来衡量。在本文中，我们引入了一个基于比率的Shapley值，取代了标准的添加剂配方与相对贡献的措施。虽然我们的整体奖励框架，包括激励定义和模型奖励设置，仍然与Sim及其同事保持一致，但潜在的价值功能是根本不同的。我们的替代估值诱导不同的模型奖励分布，并提供了一个新的镜头，通过它来分析激励属性。我们正式定义的比率为基础的价值，并证明它满足相同的激励条件的添加剂配方，包括适应版本的公平性，个人理性和稳定性。与最初的方法一样，我们的方法面临着这些激励措施之间相同的基本权衡。我们的贡献是一个数学接地替代添加剂Shapley框架，可能更适合的情况下，贡献者之间的比例比添加剂的差异更有意义。
摘要：Collaborative machine learning enables multiple data owners to jointly train models for improved predictive performance. However, ensuring incentive compatibility and fair contribution-based rewards remains a critical challenge. Prior work by Sim and colleagues (Rachel Hwee Ling Sim et al: Collaborative machine learning with incentive-aware model rewards. In: International conference on machine learning. PMLR. 2020, pp. 8927-8963) addressed this by allocating model rewards, which are non-monetary and freely replicable, based on the Shapley value of each party's data contribution, measured via information gain. In this paper, we introduce a ratio-based Shapley value that replaces the standard additive formulation with a relative contribution measure. While our overall reward framework, including the incentive definitions and model-reward setting, remains aligned with that of Sim and colleagues, the underlying value function is fundamentally different. Our alternative valuation induces a different distribution of model rewards and offers a new lens through which to analyze incentive properties. We formally define the ratio-based value and prove that it satisfies the same set of incentive conditions as the additive formulation, including adapted versions of fairness, individual rationality, and stability. Like the original approach, our method faces the same fundamental trade-offs between these incentives. Our contribution is a mathematically grounded alternative to the additive Shapley framework, potentially better suited to contexts where proportionality among contributors is more meaningful than additive differences.

【63】Real-Time Crowd Counting for Embedded Systems with Lightweight Architecture
标题：轻量级架构嵌入式系统的实时人群统计
链接：https://arxiv.org/abs/2510.13250

作者：Zhiyuan Zhao, Yubin Wen, Siyu Yang, Lichen Ning, Yuandong Liu, Junyu Gao
摘要：人群计数是一种通过图像来估计人群数量的任务，在智能安防、城市规划、公共安全管理等领域有着极其重要的应用价值，但现有的人群计数方法在这些领域的嵌入式系统上实际应用时存在模型参数过多、计算量大等问题，嵌入式系统的实际应用要求模型具有实时性，即模型的速度要足够快。考虑到上述问题，我们设计了一个超实时模型与茎编码器解码器结构的人群计数任务，实现了最快的推理相比，国家的艺术。首先，在茎网络中使用大卷积核来扩大感受野，有效地提取了头部的细节信息。然后，在编码器部分，我们使用条件通道加权和多分支局部融合块合并多尺度特征，具有较低的计算消耗。这部分对模型的超实时性能至关重要。最后，在编码器的顶部添加特征金字塔网络，以缓解其不完全融合问题。三个基准测试的实验表明，我们的网络是适合于超实时人群计数的嵌入式系统，确保竞争力的准确性。同时，提出的网络推理速度是最快的。具体来说，建议的网络在NVIDIA GTX 1080Ti上实现381.7 FPS，在NVIDIA Jetson TX1上实现71.9 FPS。
摘要：Crowd counting is a task of estimating the number of the crowd through images, which is extremely valuable in the fields of intelligent security, urban planning, public safety management, and so on. However, the existing counting methods have some problems in practical application on embedded systems for these fields, such as excessive model parameters, abundant complex calculations, etc. The practical application of embedded systems requires the model to be real-time, which means that the model is fast enough. Considering the aforementioned problems, we design a super real-time model with a stem-encoder-decoder structure for crowd counting tasks, which achieves the fastest inference compared with state-of-the-arts. Firstly, large convolution kernels in the stem network are used to enlarge the receptive field, which effectively extracts detailed head information. Then, in the encoder part, we use conditional channel weighting and multi-branch local fusion block to merge multi-scale features with low computational consumption. This part is crucial to the super real-time performance of the model. Finally, the feature pyramid networks are added to the top of the encoder to alleviate its incomplete fusion problems. Experiments on three benchmarks show that our network is suitable for super real-time crowd counting on embedded systems, ensuring competitive accuracy. At the same time, the proposed network reasoning speed is the fastest. Specifically, the proposed network achieves 381.7 FPS on NVIDIA GTX 1080Ti and 71.9 FPS on NVIDIA Jetson TX1.

【64】MotionBeat: Motion-Aligned Music Representation via Embodied Contrastive Learning and Bar-Equivariant Contact-Aware Encoding
标题：MotionBeat：通过同步对比学习和Bar等变接触感知编码的运动对齐音乐表示
链接：https://arxiv.org/abs/2510.13244

作者：Xuanchen Wang, Heng Wang, Weidong Cai
备注：5 pages, 1 figure. demo page: this https URL
摘要：音乐既是一种听觉现象，也是一种身体现象，与人体运动密切相关，并通过舞蹈自然表达。然而，大多数现有的音频表示忽略了这一具体的维度，限制了它们捕捉驱动运动的节奏和结构线索的能力。我们提出了MotionBeat，一个用于运动对齐音乐表示学习的框架。MotionBeat使用两个新提出的目标进行训练：增强的对比损失（ECL），一种增强的InfoNCE公式，具有节奏感知和节拍抖动负性，以实现细粒度的节奏辨别，以及结构节奏对齐损失（SRAL），通过将音乐重音与相应的运动事件对齐来确保节奏一致性。在架构上，MotionBeat引入了条等变相位旋转来捕捉周期性的节奏模式和接触引导的注意力，以强调与音乐口音同步的运动事件。实验表明，MotionBeat在音乐到舞蹈生成方面优于最先进的音频编码器，并有效地转移到节拍跟踪，音乐标记，流派和乐器分类，情感识别和视听检索。我们的项目演示页面：https://motionbeat2025.github.io/。
摘要：Music is both an auditory and an embodied phenomenon, closely linked to human motion and naturally expressed through dance. However, most existing audio representations neglect this embodied dimension, limiting their ability to capture rhythmic and structural cues that drive movement. We propose MotionBeat, a framework for motion-aligned music representation learning. MotionBeat is trained with two newly proposed objectives: the Embodied Contrastive Loss (ECL), an enhanced InfoNCE formulation with tempo-aware and beat-jitter negatives to achieve fine-grained rhythmic discrimination, and the Structural Rhythm Alignment Loss (SRAL), which ensures rhythm consistency by aligning music accents with corresponding motion events. Architecturally, MotionBeat introduces bar-equivariant phase rotations to capture cyclic rhythmic patterns and contact-guided attention to emphasize motion events synchronized with musical accents. Experiments show that MotionBeat outperforms state-of-the-art audio encoders in music-to-dance generation and transfers effectively to beat tracking, music tagging, genre and instrument classification, emotion recognition, and audio-visual retrieval. Our project demo page: https://motionbeat2025.github.io/.

【65】What "Not" to Detect: Negation-Aware VLMs via Structured Reasoning and Token Merging
链接：https://arxiv.org/abs/2510.13232

作者：Inha Kang, Youngsun Lim, Seonho Lee, Jiho Choi, Junsuk Choe, Hyunjung Shim
备注：38 pages
摘要：最先进的视觉语言模型（VLM）在理解否定方面遭受了严重的失败，通常被称为肯定偏见。这种限制在描述对象检测（DOD）任务中特别严重。为了解决这个问题，我们提出了两个主要贡献：（1）一个新的数据集管道和（2）一个新的，轻量级的适应配方。首先，我们介绍了CoVAND，这是一个使用系统化的思想链（CoT）和基于VQA的管道构建的数据集，用于生成高质量的实例接地否定数据。其次，我们提出NegToMe，一个新的文本标记合并模块，直接处理肯定偏见的建筑原因。NegToMe从根本上解决了否定线索在标记化中的结构损失，将它们与属性分组为连贯的语义短语。它在输入电平保持正确的极性，即使在有限的数据下也能实现强大的否定理解。例如，为了防止模型将碎片标记“not”和“girl”简单地视为“girl”，NegToMe将它们绑定到单个标记中，其含义与“girl”单独的含义正确区分。该模块集成了参数高效和战略LoRA微调方法。我们的方法显著提高了具有挑战性的否定基准测试的性能，降低了误报率，使NMS-AP在OVDEval上提高了+10.8个点，并证明了对SoTA VLM的推广。这项工作标志着在解决现实世界检测应用的否定理解方面迈出了关键的一步。
摘要：State-of-the-art vision-language models (VLMs) suffer from a critical failure in understanding negation, often referred to as affirmative bias. This limitation is particularly severe in described object detection (DOD) tasks. To address this, we propose two primary contributions: (1) a new dataset pipeline and (2) a novel, lightweight adaptation recipe. First, we introduce CoVAND, a dataset constructed with a systematic chain-of-thought (CoT) and VQA-based pipeline to generate high-quality, instance-grounded negation data. Second, we propose NegToMe, a novel text token merging module that directly tackles the architectural cause of affirmative bias. NegToMe fundamentally addresses the structural loss of negation cues in tokenization, grouping them with attributes into coherent semantic phrases. It maintains correct polarity at the input level, enabling robust negation understanding even with limited data. For instance, to prevent a model from treating the fragmented tokens "not" and "girl" as simply "girl", NegToMe binds them into a single token whose meaning is correctly distinguished from that of "girl" alone. This module is integrated with a parameter-efficient and strategic LoRA fine-tuning approach. Our method significantly improves performance on challenging negation benchmarks with a lowered false positive rate, boosting NMS-AP by up to +10.8 points on OVDEval and demonstrating generalization to SoTA VLMs. This work marks a crucial step forward in addressing negation understanding for real-world detection applications.

【66】An Analytical Framework to Enhance Autonomous Vehicle Perception for Smart Cities
标题：增强智能城市自动驾驶汽车感知的分析框架
链接：https://arxiv.org/abs/2510.13230

作者：Jalal Khan, Manzoor Khan, Sherzod Turaev, Sumbal Malik, Hesham El-Sayed, Farman Ullah
备注：32 pages, 14 figures
摘要：驾驶环境感知对于自动驾驶有着至关重要的作用，目前正在积极探索其实现。研究界和相关利益相关者需要开发深度学习（DL）模型和支持AI的解决方案，以增强自动驾驶汽车（AV）的智能移动性。有必要开发一种模型，可以准确感知道路上的多个物体，并预测驾驶员的感知，以控制汽车的运动。本文提出了一种新的基于效用的分析模型，使自动驾驶汽车的感知系统，以了解驾驶环境。本文由模块组成：获取具有独特对象的自定义数据集，即，摩托车手、人力车等;一个基于DL的模型（YOLOv 8 s）的对象检测;和一个模块来衡量感知服务的效用从训练模型实例的性能值。感知模型基于对象检测任务进行验证，其过程通过nuScense数据集的最先进深度学习模型的性能指标进行基准测试。实验结果显示了基于mAP@0.5值的三个性能最佳的YOLOv 8 s实例，即，基于SGD（0.832）、基于Adam（0.810）和基于AdamW（0.822）。然而，基于AdamW的模型（即，汽车：0.921，摩托车：0.899，卡车：0.793等）仍然优于基于SGD的模型（即，汽车：0.915，摩托车：0.892，卡车：0.781等）因为它具有更好的类级性能值，这由所提出的感知模型证实。我们验证了所提出的功能是能够找到正确的感知AV。上述结果鼓励使用所提出的感知模型来评估学习模型的效用，并确定合适的感知AV。
摘要：The driving environment perception has a vital role for autonomous driving and nowadays has been actively explored for its realization. The research community and relevant stakeholders necessitate the development of Deep Learning (DL) models and AI-enabled solutions to enhance autonomous vehicles (AVs) for smart mobility. There is a need to develop a model that accurately perceives multiple objects on the road and predicts the driver's perception to control the car's movements. This article proposes a novel utility-based analytical model that enables perception systems of AVs to understand the driving environment. The article consists of modules: acquiring a custom dataset having distinctive objects, i.e., motorcyclists, rickshaws, etc; a DL-based model (YOLOv8s) for object detection; and a module to measure the utility of perception service from the performance values of trained model instances. The perception model is validated based on the object detection task, and its process is benchmarked by state-of-the-art deep learning models' performance metrics from the nuScense dataset. The experimental results show three best-performing YOLOv8s instances based on mAP@0.5 values, i.e., SGD-based (0.832), Adam-based (0.810), and AdamW-based (0.822). However, the AdamW-based model (i.e., car: 0.921, motorcyclist: 0.899, truck: 0.793, etc.) still outperforms the SGD-based model (i.e., car: 0.915, motorcyclist: 0.892, truck: 0.781, etc.) because it has better class-level performance values, confirmed by the proposed perception model. We validate that the proposed function is capable of finding the right perception for AVs. The results above encourage using the proposed perception model to evaluate the utility of learning models and determine the appropriate perception for AVs.

【67】EvoTest: Evolutionary Test-Time Learning for Self-Improving Agentic Systems
标题：EvoTest：用于自改进学习系统的进化测试时学习
链接：https://arxiv.org/abs/2510.13220

作者：Yufei He, Juncheng Liu, Yue Liu, Yibo Li, Tri Cao, Zhiyuan Hu, Xinxing Xu, Bryan Hooi
摘要：当前人工智能代理的一个根本限制是它们无法在测试时动态学习复杂的技能，在新环境中通常表现得像“聪明但无知的实习生”。这严重限制了它们的实用性。为了系统地衡量和推动这一挑战的进展，我们首先介绍了杰里科测试时学习（J-TTL）基准。J-TTL是一种新的评估设置，其中代理必须连续几集玩同一游戏，试图从一集到下一集提高其性能。在J-TTL上，我们发现现有的自适应方法，如反射，记忆或强化学习，都很困难。为了解决我们的基准所带来的挑战，我们提出了EvoTest，一个进化的测试时间学习框架，提高代理没有任何微调或梯度，通过发展整个代理系统后，每集。EvoTest有两个角色：扮演游戏的Actor Agent，以及分析剧集记录以提出用于下一次运行的修改的配置的Evolver Agent。此配置重写提示符，通过记录有效的状态-操作选择来更新内存，调整超参数，并学习工具使用例程。在我们的J-TTL基准测试中，EvoTest始终提高性能，不仅优于反射和仅内存基线，而且还优于更复杂的在线微调方法。值得注意的是，我们的方法是唯一一个能够赢得两个游戏（侦探和图书馆），而所有基线无法赢得任何。
摘要：A fundamental limitation of current AI agents is their inability to learn complex skills on the fly at test time, often behaving like "clever but clueless interns" in novel environments. This severely limits their practical utility. To systematically measure and drive progress on this challenge, we first introduce the Jericho Test-Time Learning (J-TTL) benchmark. J-TTL is a new evaluation setup where an agent must play the same game for several consecutive episodes, attempting to improve its performance from one episode to the next. On J-TTL, we find that existing adaptation methods like reflection, memory, or reinforcement learning struggle. To address the challenges posed by our benchmark, we present EvoTest, an evolutionary test-time learning framework that improves an agent without any fine-tuning or gradients-by evolving the entire agentic system after every episode. EvoTest has two roles: the Actor Agent, which plays the game, and the Evolver Agent, which analyzes the episode transcript to propose a revised configuration for the next run. This configuration rewrites the prompt, updates memory by logging effective state-action choices, tunes hyperparameters, and learns the tool-use routines. On our J-TTL benchmark, EvoTest consistently increases performance, outperforming not only reflection and memory-only baselines but also more complex online fine-tuning methods. Notably, our method is the only one capable of winning two games (Detective and Library), while all baselines fail to win any.

【68】Personalized Learning Path Planning with Goal-Driven Learner State Modeling
标题：基于目标驱动学习者状态建模的个性化学习路径规划
链接：https://arxiv.org/abs/2510.13215

作者：Joy Jia Yin Lim, Ye He, Jifan Yu, Xin Cong, Daniel Zhang-Li, Zhiyuan Liu, Huiqin Liu, Lei Hou, Juanzi Li, Bin Xu
摘要：个性化学习路径规划（PLPP）旨在设计符合个人目标的自适应学习路径。虽然大型语言模型（LLM）在个性化学习体验方面显示出潜力，但现有方法往往缺乏目标一致的规划机制。我们介绍Pxplore，PLPP的一个新的框架，集成了基于学习的培训模式和法学硕士驱动的教育架构。我们设计了一个结构化的学习者状态模型和一个自动奖励函数，将抽象的目标转换为可计算的信号。我们结合监督微调（SFT）和组相对策略优化（GRPO）对策略进行训练，并将其部署在真实世界的学习平台中。大量的实验验证了Pxplore在产生连贯的，个性化的和目标驱动的学习路径的有效性。我们发布我们的代码和数据集，以方便未来的研究。
摘要：Personalized Learning Path Planning (PLPP) aims to design adaptive learning paths that align with individual goals. While large language models (LLMs) show potential in personalizing learning experiences, existing approaches often lack mechanisms for goal-aligned planning. We introduce Pxplore, a novel framework for PLPP that integrates a reinforcement-based training paradigm and an LLM-driven educational architecture. We design a structured learner state model and an automated reward function that transforms abstract objectives into computable signals. We train the policy combining supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO), and deploy it within a real-world learning platform. Extensive experiments validate Pxplore's effectiveness in producing coherent, personalized, and goal-driven learning paths. We release our code and dataset to facilitate future research.

【69】Adaptive Reasoning Executor: A Collaborative Agent System for Efficient Reasoning
标题：自适应推理执行器：一种高效推理的协作代理系统
链接：https://arxiv.org/abs/2510.13214

作者：Zehui Ling, Deshu Chen, Yichi Zhang, Yuchen Liu, Xigui Li, Xin Guo, Yuan Cheng
摘要：大型语言模型（LLM）的最新进展表明，思想链提示和深度推理大大提高了复杂任务的性能，多智能体系统可以通过启用模型辩论来进一步提高准确性。然而，将深度推理应用于所有问题在计算上是昂贵的。为了降低这些成本，我们提出了一个集成小型和大型LLM的互补代理系统。小LLM首先生成初始答案，然后由大LLM验证。如果正确，则直接采用答案;否则，大型LLM执行深入推理。实验结果表明，对于简单的问题，我们的方法将大型LLM的计算成本降低了50%以上，精度损失可以忽略不计，同时在复杂任务上始终保持稳健的性能。
摘要：Recent advances in Large Language Models (LLMs) demonstrate that chain-of-thought prompting and deep reasoning substantially enhance performance on complex tasks, and multi-agent systems can further improve accuracy by enabling model debates. However, applying deep reasoning to all problems is computationally expensive. To mitigate these costs, we propose a complementary agent system integrating small and large LLMs. The small LLM first generates an initial answer, which is then verified by the large LLM. If correct, the answer is adopted directly; otherwise, the large LLM performs in-depth reasoning. Experimental results show that, for simple problems, our approach reduces the computational cost of the large LLM by more than 50% with negligible accuracy loss, while consistently maintaining robust performance on complex tasks.

【70】MimicParts: Part-aware Style Injection for Speech-Driven 3D Motion Generation
标题：MimicComponents：用于语音驱动3D运动生成的部件感知风格注入
链接：https://arxiv.org/abs/2510.13208

作者：Lianlian Liu, YongKang He, Zhaojie Chu, Xiaofen Xing, Xiangmin Xu
摘要：从语音信号生成风格化的3D人体运动提出了实质性的挑战，主要是由于语音信号，个人风格和相应的身体运动之间的复杂和细粒度的关系。当前的风格编码方法要么过度简化风格多样性要么忽略区域运动风格差异（例如，上半身与下半身），限制了运动的真实性。此外，运动风格应该动态地适应语音节奏和情感的变化，但现有的方法往往忽略了这一点。为了解决这些问题，我们提出了MimicParts，这是一种基于部件感知风格注入和部件感知去噪网络的新型框架，旨在增强风格化运动生成。它将身体划分为不同的区域来编码局部运动风格，使模型能够捕捉细粒度的区域差异。此外，我们的部分感知注意块允许节奏和情感线索精确地引导每个身体区域，确保生成的运动与语音节奏和情绪状态的变化一致。实验结果表明，我们的方法优于现有的方法展示自然和富有表现力的三维人体运动序列。
摘要：Generating stylized 3D human motion from speech signals presents substantial challenges, primarily due to the intricate and fine-grained relationships among speech signals, individual styles, and the corresponding body movements. Current style encoding approaches either oversimplify stylistic diversity or ignore regional motion style differences (e.g., upper vs. lower body), limiting motion realism. Additionally, motion style should dynamically adapt to changes in speech rhythm and emotion, but existing methods often overlook this. To address these issues, we propose MimicParts, a novel framework designed to enhance stylized motion generation based on part-aware style injection and part-aware denoising network. It divides the body into different regions to encode localized motion styles, enabling the model to capture fine-grained regional differences. Furthermore, our part-aware attention block allows rhythm and emotion cues to guide each body region precisely, ensuring that the generated motion aligns with variations in speech rhythm and emotional state. Experimental results show that our method outperforming existing methods showcasing naturalness and expressive 3D human motion sequences.

【71】CleverCatch: A Knowledge-Guided Weak Supervision Model for Fraud Detection
标题：CleverCatch：一种知识引导的欺诈检测弱监督模型
链接：https://arxiv.org/abs/2510.13205

作者：Amirhossein Mozafari, Kourosh Hashemi, Erfan Shafagh, Soroush Motamedi, Azar Taheri Tayebi, Mohammad A. Tayebi
摘要：由于标记数据的可用性有限、欺诈策略不断变化以及医疗记录的高维度，医疗欺诈检测仍然是一个严峻的挑战。传统的监督方法受到极端标签稀缺的挑战，而纯粹的无监督方法往往无法捕获临床上有意义的异常。在这项工作中，我们介绍CleverCatch，知识引导的弱监管模型，旨在检测欺诈性处方行为，提高准确性和可解释性。我们的方法将结构化领域的专业知识集成到一个神经架构中，该架构在共享的嵌入空间中对齐规则和数据样本。通过在代表合规性和违规性的合成数据上联合训练编码器，CleverCatch可以学习推广到复杂的真实数据集的软规则嵌入。这种混合设计使数据驱动的学习能够通过域信息约束来增强，从而弥合专家推理和机器学习之间的差距。在大规模真实世界数据集上的实验表明，CleverCatch优于四种最先进的异常检测基线，AUC平均提高了1.3%，召回率平均提高了3.4%。我们的消融研究进一步突出了专家规则的补充作用，证实了框架的适应性。结果表明，将专家规则嵌入到学习过程中不仅可以提高检测准确性，还可以提高透明度，为医疗欺诈检测等高风险领域提供可解释的方法。
摘要：Healthcare fraud detection remains a critical challenge due to limited availability of labeled data, constantly evolving fraud tactics, and the high dimensionality of medical records. Traditional supervised methods are challenged by extreme label scarcity, while purely unsupervised approaches often fail to capture clinically meaningful anomalies. In this work, we introduce CleverCatch, a knowledge-guided weak supervision model designed to detect fraudulent prescription behaviors with improved accuracy and interpretability. Our approach integrates structured domain expertise into a neural architecture that aligns rules and data samples within a shared embedding space. By training encoders jointly on synthetic data representing both compliance and violation, CleverCatch learns soft rule embeddings that generalize to complex, real-world datasets. This hybrid design enables data-driven learning to be enhanced by domain-informed constraints, bridging the gap between expert heuristics and machine learning. Experiments on the large-scale real-world dataset demonstrate that CleverCatch outperforms four state-of-the-art anomaly detection baselines, yielding average improvements of 1.3\% in AUC and 3.4\% in recall. Our ablation study further highlights the complementary role of expert rules, confirming the adaptability of the framework. The results suggest that embedding expert rules into the learning process not only improves detection accuracy but also increases transparency, offering an interpretable approach for high-stakes domains such as healthcare fraud detection.

【72】LLM-Guided Synthetic Augmentation (LGSA) for Mitigating Bias in AI Systems
标题：LLM引导的合成增强（LGSA）用于缓解人工智能系统中的偏差
链接：https://arxiv.org/abs/2510.13202

作者：Sai Suhruth Reddy Karri, Yashwanth Sai Nallapuneni, Laxmi Narasimha Reddy Mallireddy, Gopichand G
备注：11 pages, 4 figures, 1 Table, submitted to an international conference
摘要：人工智能系统中的偏见，特别是那些依赖自然语言数据的系统，引起了伦理和实际问题。某些群体的代表性不足往往导致各人口群体的业绩不均衡。传统的公平性方法，如预处理、处理中和后处理，依赖于受保护的属性标签，涉及准确性-公平性权衡，并且可能无法跨数据集推广。为了解决这些挑战，我们提出了LLM引导的合成增强（LGSA），它使用大型语言模型为代表性不足的群体生成反事实示例，同时保持标签完整性。我们评估LGSA的控制数据集的英语短句与性别代词，专业和二进制分类标签。结构化提示用于产生性别交换的释义，然后进行质量控制，包括语义相似性检查，属性验证，毒性筛选和人体抽查。增强的数据集扩大了训练覆盖率，并用于在一致条件下训练分类器。结果表明，LGSA减少了性能差异，而不影响准确性。基线模型的准确率达到96.7%，性别偏见差距为7.2%。简单的交换增强将差距缩小到0.7%，但将准确率降低到95.6%。LGSA实现了99.1%的准确率和1.9%的偏差差距，提高了女性标记示例的性能。这些研究结果表明，LGSA是一种有效的策略，偏见缓解，提高亚组平衡，同时保持高的任务准确性和标签保真度。
摘要：Bias in AI systems, especially those relying on natural language data, raises ethical and practical concerns. Underrepresentation of certain groups often leads to uneven performance across demographics. Traditional fairness methods, such as pre-processing, in-processing, and post-processing, depend on protected-attribute labels, involve accuracy-fairness trade-offs, and may not generalize across datasets. To address these challenges, we propose LLM-Guided Synthetic Augmentation (LGSA), which uses large language models to generate counterfactual examples for underrepresented groups while preserving label integrity. We evaluated LGSA on a controlled dataset of short English sentences with gendered pronouns, professions, and binary classification labels. Structured prompts were used to produce gender-swapped paraphrases, followed by quality control including semantic similarity checks, attribute verification, toxicity screening, and human spot checks. The augmented dataset expanded training coverage and was used to train a classifier under consistent conditions. Results show that LGSA reduces performance disparities without compromising accuracy. The baseline model achieved 96.7 percent accuracy with a 7.2 percent gender bias gap. Simple swap augmentation reduced the gap to 0.7 percent but lowered accuracy to 95.6 percent. LGSA achieved 99.1 percent accuracy with a 1.9 percent bias gap, improving performance on female-labeled examples. These findings demonstrate that LGSA is an effective strategy for bias mitigation, enhancing subgroup balance while maintaining high task accuracy and label fidelity.

【73】Paper Copilot: Tracking the Evolution of Peer Review in AI Conferences
标题：纸质副驾驶：跟踪人工智能会议中同行评审的演变
链接：https://arxiv.org/abs/2510.13201

作者：Jing Yang, Qiyao Wei, Jiaxin Pei
摘要：人工智能会议的快速增长正在使本已脆弱的同行评审系统变得更加紧张，导致评审员工作量繁重、专业知识不匹配、评估标准不一致、肤浅或模板化的评审，以及在压缩的时间表下有限的责任。作为回应，会议组织者推出了新的政策和干预措施，以保持审查标准。然而，这些临时的变化往往会造成对审查过程的进一步担忧和困惑，使论文最终如何被接受-以及实践如何在多年中演变-在很大程度上不透明。我们介绍了Paper Copilot，一个在广泛的计算机科学场所创建持久的同行评议数字档案的系统，一个开放的数据集，使研究人员能够大规模研究同行评议，以及对ICLR评论的大规模实证分析。通过发布基础设施和数据集，Paper Copilot支持对同行评审演变的可重复研究。我们希望这些资源可以帮助社区跟踪变化，诊断故障模式，并为基于证据的改进提供信息，以建立一个更强大，透明和可靠的同行评审系统。
摘要：The rapid growth of AI conferences is straining an already fragile peer-review system, leading to heavy reviewer workloads, expertise mismatches, inconsistent evaluation standards, superficial or templated reviews, and limited accountability under compressed timelines. In response, conference organizers have introduced new policies and interventions to preserve review standards. Yet these ad-hoc changes often create further concerns and confusion about the review process, leaving how papers are ultimately accepted - and how practices evolve across years - largely opaque. We present Paper Copilot, a system that creates durable digital archives of peer reviews across a wide range of computer-science venues, an open dataset that enables researchers to study peer review at scale, and a large-scale empirical analysis of ICLR reviews spanning multiple years. By releasing both the infrastructure and the dataset, Paper Copilot supports reproducible research on the evolution of peer review. We hope these resources help the community track changes, diagnose failure modes, and inform evidence-based improvements toward a more robust, transparent, and reliable peer-review system.

【74】Emotional Cognitive Modeling Framework with Desire-Driven Objective Optimization for LLM-empowered Agent in Social Simulation
标题：社交模拟中LLM授权代理的具有愿望驱动目标优化的情感认知建模框架
链接：https://arxiv.org/abs/2510.13195

作者：Qun Ma, Xiao Xue, Xuwen Zhang, Zihan Zhao, Yuwei Guo, Ming Zhang
摘要：大型语言模型（LLM）的出现使智能体能够在社会模拟中代表虚拟人，促进复杂社会系统中的各种交互。然而，现有的基于LLM的代理在情感认知方面表现出严重的局限性：它们无法模拟桥接虚拟和现实世界服务所必需的有限理性;它们缺乏经验验证的集成机制，将情感嵌入代理决策架构中。本文构建了一个情感认知框架，结合愿望产生和目标管理，旨在实现基于LLM的代理人和人类之间的情感对齐，建模基于LLM的代理人的完整决策过程，包括状态进化，愿望产生，目标优化，决策生成，和行动执行。本研究实现了我们专有的多智能体交互环境中提出的框架。实验结果表明，我们的框架下管理的代理不仅表现出与他们的情绪状态相一致的行为，而且在对其他代理类型的比较评估中，表现出优越的生态有效性，并产生更接近人类行为模式的决策结果。
摘要：The advent of large language models (LLMs) has enabled agents to represent virtual humans in societal simulations, facilitating diverse interactions within complex social systems. However, existing LLM-based agents exhibit severe limitations in affective cognition: They fail to simulate the bounded rationality essential for bridging virtual and real-world services; They lack empirically validated integration mechanisms embedding emotions within agent decision architectures. This paper constructs an emotional cognition framework incorporating desire generation and objective management, designed to achieve emotion alignment between LLM-based agents and humans, modeling the complete decision-making process of LLM-based agents, encompassing state evolution, desire generation, objective optimization, decision generation, and action execution. This study implements the proposed framework within our proprietary multi-agent interaction environment. Experimental results demonstrate that agents governed by our framework not only exhibit behaviors congruent with their emotional states but also, in comparative assessments against other agent types, demonstrate superior ecological validity and generate decision outcomes that significantly more closely approximate human behavioral patterns.

【75】StressTransfer: Stress-Aware Speech-to-Speech Translation with Emphasis Preservation
标题：StressTransfer：具有压力意识的言语翻译，重点保留
链接：https://arxiv.org/abs/2510.13194

作者：Xi Chen, Yuchen Song, Satoshi Nakamura
摘要：我们提出了一个压力感知的语音到语音翻译（S2ST）系统，保留字级强调利用LLM跨语言强调转换。我们的方法将源语言的压力转化为目标语言的标签，指导可控的TTS模型。为了克服数据稀缺性，我们开发了一个管道来自动生成对齐的训练数据，并引入“LLM作为法官”进行评估。实验表明，我们的方法在保持重点的同时保持可比的翻译质量，说话人意图和自然度方面大大优于基线。我们的工作突出了韵律在翻译中的重要性，并提供了一个有效的，数据效率高的解决方案，以保留S2ST中的非语言线索。
摘要：We propose a stress-aware speech-to-speech translation (S2ST) system that preserves word-level emphasis by leveraging LLMs for cross-lingual emphasis conversion. Our method translates source-language stress into target-language tags that guide a controllable TTS model. To overcome data scarcity, we developed a pipeline to automatically generate aligned training data and introduce the "LLM-as-Judge" for evaluation. Experiments show our approach substantially outperforms baselines in preserving emphasis while maintaining comparable translation quality, speaker intent, and naturalness. Our work highlights the importance of prosody in translation and provides an effective, data-efficient solution for preserving paralinguistic cues in S2ST.

【76】Behavioral Embeddings of Programs: A Quasi-Dynamic Approach for Optimization Prediction
标题：程序的行为嵌入：优化预测的准动态方法
链接：https://arxiv.org/abs/2510.13158

作者：Haolin Pan, Jinyuan Dong, Hongbin Zhang, Hongyu Lin, Mingjie Xing, Yanjun Wu
摘要：学习程序的有效数值表示或嵌入是应用机器学习自动化和增强编译器优化的基本先决条件。然而，流行的范例呈现出一种两难的局面。静态表示，从源代码或中间表示（IR），是有效的和确定性的，但提供有限的洞察程序将如何表现或复杂的代码转换下的演变。相反，依赖于运行时分析的动态表示提供了对性能瓶颈的深刻见解，但由于高昂的开销和固有的非确定性，对于大规模任务通常是不切实际的。本文超越了这种权衡，提出了一种新的准动态框架的程序表示。其核心思想是对程序的优化灵敏度进行建模。我们介绍了程序行为谱，一个新的表示通过探测程序的IR与不同的优化序列集和量化的静态特征的变化。为了有效地编码这种高维连续谱，我们开创了一种组合学习方法。产品量化是离散化的连续反应向量到结构化的，组成的子词。随后，预训练名为PQ-BERT的多任务Transformer模型，以学习这些行为代码的深层上下文语法。两个有代表性的编译器优化任务-最佳通过预测和-Oz效益预测-的综合实验表明，我们的方法优于国家的最先进的静态基线。我们的代码可在https://github.com/Panhaolin2001/PREP/上公开获取。
摘要：Learning effective numerical representations, or embeddings, of programs is a fundamental prerequisite for applying machine learning to automate and enhance compiler optimization. Prevailing paradigms, however, present a dilemma. Static representations, derived from source code or intermediate representation (IR), are efficient and deterministic but offer limited insight into how a program will behave or evolve under complex code transformations. Conversely, dynamic representations, which rely on runtime profiling, provide profound insights into performance bottlenecks but are often impractical for large-scale tasks due to prohibitive overhead and inherent non-determinism. This paper transcends this trade-off by proposing a novel quasi-dynamic framework for program representation. The core insight is to model a program's optimization sensitivity. We introduce the Program Behavior Spectrum, a new representation generated by probing a program's IR with a diverse set of optimization sequences and quantifying the resulting changes in its static features. To effectively encode this high-dimensional, continuous spectrum, we pioneer a compositional learning approach. Product Quantization is employed to discretize the continuous reaction vectors into structured, compositional sub-words. Subsequently, a multi-task Transformer model, termed PQ-BERT, is pre-trained to learn the deep contextual grammar of these behavioral codes. Comprehensive experiments on two representative compiler optimization tasks -- Best Pass Prediction and -Oz Benefit Prediction -- demonstrate that our method outperforms state-of-the-art static baselines. Our code is publicly available at https://github.com/Panhaolin2001/PREP/.

【77】Program of Thoughts for Financial Reasoning: Leveraging Dynamic In-Context Examples and Generative Retrieval
标题：金融推理的思路：利用动态上下文示例和生成性检索
链接：https://arxiv.org/abs/2510.13157

作者：Subhendu Khatuya, Shashwat Naidu, Pawan Goyal, Niloy Ganguly
备注：This work has been accepted for publication in the Main Conference of the Empirical Methods in Natural Language Processing (EMNLP) 2025
摘要：尽管大型语言模型（LLM）的能力不断进步，但数值推理仍然是一个具有挑战性的领域。像思想链提示，思想树提示和思想程序提示这样的技术指导LLM通过中间推理步骤。尽管具有Few-Shot提示的上下文学习提高了性能，但LLM仍然落后于FinQA和ConvFinQA等金融数值推理数据集上的最新模型。在这项工作中，我们引入FINDER，一个新的两步框架，以提高LLM的能力，在金融数值推理。第一步利用生成检索器从非结构化数据（包括文本和表格）中提取相关事实。其次是上下文感知的程序的思想提示与动态选择的上下文中的例子。我们的模型FINDER在FinQA和ConvFinQA数据集上都实现了新的最先进的性能，超过了之前的基准测试，执行精度分别提高了5.98%和4.05%。
摘要：Despite continuous advancements in the capabilities of large language models (LLMs), numerical reasoning remains a challenging area. Techniques like chain-of-thought prompting, tree-of-thought prompting, and program-of-thought prompting guide LLMs through intermediate reasoning steps. Although in-context learning with few-shot prompting has improved performance, LLMs still lag behind state-of-the-art models on financial numerical reasoning datasets such as FinQA and ConvFinQA. In this work, we introduce FINDER, a novel two-step framework, to enhance LLMs' capabilities in financial numerical reasoning. The first step utilizes a generative retriever to extract relevant facts from unstructured data, including both text and tables. This is followed by context-aware Program of Thought prompting with dynamic selection of in-context examples. Our model FINDER achieves a new state-of-the-art performance on both the FinQA and ConvFinQA datasets, surpassing previous benchmarks with execution accuracy improvements of 5.98% and 4.05%, respectively.

【78】Stable LLM Ensemble: Interaction between Example Representativeness and Diversity
标题：稳定的LLM招生：示例代表性与多样性之间的互动
链接：https://arxiv.org/abs/2510.13143

作者：Junichiro Niimi
摘要：大型语言模型（LLM）已经在广泛的领域取得了显著的成果。然而，单次LLM预测的准确性和鲁棒性仍然对样本和集合成员之间的多样性高度敏感。本文系统地研究了样本代表性（单次策略）和输出多样性（采样温度）对LLM集成性能的影响。比较了两种单次策略：基于质心的代表性例子（建议）和随机采样的例子（基线），采样温度也是不同的。所提出的具有较高温度设置的方法显著优于随机选择，分别为+7.6%（macro-F1）和-10.5%（RMSE）。此外，该模型超过5杆提示+21.1%（macro-F1）和-24.0%（RMSE）。我们的研究结果表明，结合代表性的例子选择与温度升高提供了适当的多样性水平的合奏。这项工作突出了设计有效的一次LLM合奏的例子选择和控制多样性的实际重要性。
摘要：Large language models (LLMs) have achieved remarkable results in wide range of domains. However, the accuracy and robustness of one-shot LLM predictions remain highly sensitive to the examples and the diversity among ensemble members. This study systematically investigates the effects of example representativeness (one-shot strategy) and output diversity (sampling temperature) on LLM ensemble performance. Two one-shot strategies are compared: centroid-based representative examples (proposed) and randomly sampled examples (baseline) and sampling temperature also is varied. The proposed approach with higher temperature setting significantly outperforms random selection by +7.6% (macro-F1) and -10.5% (RMSE). Furthermore, the proposed model exceeds 5-shot prompting by +21.1% (macro-F1) and -24.0% (RMSE). Our findings demonstrate that combining representative example selection with increased temperature provides the appropriate level of diversity to the ensemble. This work highlights the practical importance of both example selection and controlled diversity in designing effective one-shot LLM ensembles.

【79】On the Reasoning Abilities of Masked Diffusion Language Models
标题：掩蔽扩散语言模型的推理能力
链接：https://arxiv.org/abs/2510.13117

作者：Anej Svete, Ashish Sabharwal
摘要：文本的掩蔽扩散模型（MDM）为传统的自回归语言模型提供了一个令人信服的替代方案。并行生成使它们高效，但它们的计算能力和并行性固有的局限性在很大程度上仍未被探索。为此，我们描述了什么类型的推理问题MDM可以证明解决和如何有效。我们通过将MDM连接到有限精度对数宽度设置中的思想链（CoT）和填充环形Transformers（PLT）的充分理解的推理框架来做到这一点：我们表明MDM和多项式填充的PLT实际上在这种设置中是等价的，并且MDM可以解决CoT增强Transformers可以解决的所有问题。此外，我们展示的问题类（包括正规语言），MDM本质上是更有效的比CoT Transformers，并行生成允许更快的推理。
摘要：Masked diffusion models (MDMs) for text offer a compelling alternative to traditional autoregressive language models. Parallel generation makes them efficient, but their computational capabilities and the limitations inherent to their parallelism remain largely unexplored. To this end, we characterize what types of reasoning problems MDMs can provably solve and how efficiently. We do this by connecting MDMs to the well-understood reasoning frameworks of chain of thought (CoT) and padded looped transformers (PLTs) in the finite-precision log-width setting: We show that MDMs and polynomially-padded PLTs are, in fact, equivalent in this setting, and that MDMs can solve all problems that CoT-augmented transformers can. Moreover, we showcase classes of problems (including regular languages) for which MDMs are inherently more efficient than CoT transformers, where parallel generation allows for substantially faster reasoning.

【80】Multi-Label Clinical Text Eligibility Classification and Summarization System
标题：多标签临床文本资格分类和汇总系统
链接：https://arxiv.org/abs/2510.13115

作者：Surya Tejaswi Yerramsetty, Almas Fathimah
摘要：临床试验是医学进步的核心，因为它们有助于提高对人类健康和医疗保健系统的理解。它们在发现检测、预防或治疗疾病的新方法方面发挥着关键作用，临床试验必须包括具有适当和多样化医学背景的参与者。在本文中，我们提出了一个利用自然语言处理（NLP）和大型语言模型（LLM）来自动化多标签临床文本资格分类和摘要的系统。该系统结合了诸如词嵌入（Word 2 Vec）和命名实体识别等特征提取方法来识别相关的医学概念，以及传统的矢量化技术，如计数矢量化和TF-IDF（词频-逆文档频率）。我们进一步探索加权TF-IDF词嵌入，它集成了基于计数和基于嵌入的优势，以有效地捕获术语重要性。使用随机森林和SVM模型的多标签分类应用于基于资格标准的文档分类。对包括TextRank、Luhn和GPT-3在内的汇总技术进行评估，以简明扼要地汇总合格性要求。ROUGE评分的评价表明了所提出的方法的有效性。该系统显示出使用数据驱动方法自动进行临床试验合格性评估的潜力，从而提高研究效率。
摘要：Clinical trials are central to medical progress because they help improve understanding of human health and the healthcare system. They play a key role in discovering new ways to detect, prevent, or treat diseases, and it is essential that clinical trials include participants with appropriate and diverse medical backgrounds. In this paper, we propose a system that leverages Natural Language Processing (NLP) and Large Language Models (LLMs) to automate multi-label clinical text eligibility classification and summarization. The system combines feature extraction methods such as word embeddings (Word2Vec) and named entity recognition to identify relevant medical concepts, along with traditional vectorization techniques such as count vectorization and TF-IDF (Term Frequency-Inverse Document Frequency). We further explore weighted TF-IDF word embeddings that integrate both count-based and embedding-based strengths to capture term importance effectively. Multi-label classification using Random Forest and SVM models is applied to categorize documents based on eligibility criteria. Summarization techniques including TextRank, Luhn, and GPT-3 are evaluated to concisely summarize eligibility requirements. Evaluation with ROUGE scores demonstrates the effectiveness of the proposed methods. This system shows potential for automating clinical trial eligibility assessment using data-driven approaches, thereby improving research efficiency.

【81】DriveCritic: Towards Context-Aware, Human-Aligned Evaluation for Autonomous Driving with Vision-Language Models
标题：DriveCritic：利用视觉语言模型实现自动驾驶的上下文感知、人性化评估
链接：https://arxiv.org/abs/2510.13108

作者：Jingyu Song, Zhenxin Li, Shiyi Lan, Xinglong Sun, Nadine Chang, Maying Shen, Joshua Chen, Katherine A. Skinner, Jose M. Alvarez
备注：9 pages, 3 figures
摘要：对自动驾驶规划者进行基准测试以与人类判断保持一致仍然是一个关键挑战，因为扩展预测驾驶员模型得分（EPDMS）等最先进的指标在细微的场景中缺乏上下文感知。为了解决这个问题，我们引入了DriveCritic，这是一个具有两个关键贡献的新框架：DriveCritic数据集，一个具有挑战性的场景集合，其中上下文对于正确判断至关重要，并使用成对的人类偏好进行注释，以及DriveCritic模型，一个基于视觉语言模型（VLM）的评估器。通过使用两阶段监督和强化学习管道进行微调，DriveCritic模型通过整合视觉和符号上下文来学习在轨迹对之间进行判断。实验表明，DriveCritic在匹配人类偏好方面明显优于现有的指标和基线，并表现出强大的上下文感知能力。总的来说，我们的工作为评估自动驾驶系统提供了一个更可靠、更人性化的基础。
摘要：Benchmarking autonomous driving planners to align with human judgment remains a critical challenge, as state-of-the-art metrics like the Extended Predictive Driver Model Score (EPDMS) lack context awareness in nuanced scenarios. To address this, we introduce DriveCritic, a novel framework featuring two key contributions: the DriveCritic dataset, a curated collection of challenging scenarios where context is critical for correct judgment and annotated with pairwise human preferences, and the DriveCritic model, a Vision-Language Model (VLM) based evaluator. Fine-tuned using a two-stage supervised and reinforcement learning pipeline, the DriveCritic model learns to adjudicate between trajectory pairs by integrating visual and symbolic context. Experiments show DriveCritic significantly outperforms existing metrics and baselines in matching human preferences and demonstrates strong context awareness. Overall, our work provides a more reliable, human-aligned foundation to evaluating autonomous driving systems.

【82】TRUSTVIS: A Multi-Dimensional Trustworthiness Evaluation Framework for Large Language Models
标题：TRUSTRIS：大型语言模型的多维可信度评估框架
链接：https://arxiv.org/abs/2510.13106

作者：Ruoyu Sun, Da Song, Jiayang Song, Yuheng Huang, Lei Ma
备注：4 pages, 2 figures, To appear in ASE 2025 Demo Track
摘要：随着大型语言模型（LLM）不断革新自然语言处理（NLP）应用程序，对其可信度的关键问题仍然存在，特别是在安全性和鲁棒性方面。为了应对这些挑战，我们引入了TRUSTVIS，这是一个自动化评估框架，可以对LLM的可信度进行全面评估。我们的框架的一个关键特征是其交互式用户界面，旨在提供直观的可视化可信度指标。通过集成AutoDAN等知名的扰动方法，并在各种评估方法中采用多数表决，TRUSTVIS不仅提供可靠的结果，还使用户可以访问复杂的评估过程。对Vicuna-7 b、Llama 2 - 7 b和GPT-3.5等模型的初步案例研究证明了我们的框架在识别安全性和鲁棒性漏洞方面的有效性，而交互式界面允许用户详细探索结果，从而实现有针对性的模型改进。视频链接：https://youtu.be/k1TrBqNVg8g
摘要：As Large Language Models (LLMs) continue to revolutionize Natural Language Processing (NLP) applications, critical concerns about their trustworthiness persist, particularly in safety and robustness. To address these challenges, we introduce TRUSTVIS, an automated evaluation framework that provides a comprehensive assessment of LLM trustworthiness. A key feature of our framework is its interactive user interface, designed to offer intuitive visualizations of trustworthiness metrics. By integrating well-known perturbation methods like AutoDAN and employing majority voting across various evaluation methods, TRUSTVIS not only provides reliable results but also makes complex evaluation processes accessible to users. Preliminary case studies on models like Vicuna-7b, Llama2-7b, and GPT-3.5 demonstrate the effectiveness of our framework in identifying safety and robustness vulnerabilities, while the interactive interface allows users to explore results in detail, empowering targeted model improvements. Video Link: https://youtu.be/k1TrBqNVg8g

【83】ESI: Epistemic Uncertainty Quantification via Semantic-preserving Intervention for Large Language Models
标题：ESI：通过大型语言模型的语义保留干预进行认识不确定性量化
链接：https://arxiv.org/abs/2510.13103

作者：Mingda Li, Xinyu Li, Weinan Zhang, Longxuan Ma
摘要：不确定性量化（UQ）是一种很有前途的方法，以提高模型的可靠性，但量化的不确定性的大型语言模型（LLM）是不平凡的。在这项工作中，我们建立了一个连接的不确定性LLM和他们的不变性下语义保持干预从因果关系的角度来看。在此基础上，我们提出了一种新的灰盒不确定性量化方法，该方法测量了语义保持干预前后模型输出的变化。通过理论论证，我们表明，我们的方法提供了一个有效的估计认知不确定性。我们在各种LLM和各种问答（QA）数据集上进行的广泛实验表明，我们的方法不仅在有效性方面而且在计算效率方面都很出色。
摘要：Uncertainty Quantification (UQ) is a promising approach to improve model reliability, yet quantifying the uncertainty of Large Language Models (LLMs) is non-trivial. In this work, we establish a connection between the uncertainty of LLMs and their invariance under semantic-preserving intervention from a causal perspective. Building on this foundation, we propose a novel grey-box uncertainty quantification method that measures the variation in model outputs before and after the semantic-preserving intervention. Through theoretical justification, we show that our method provides an effective estimate of epistemic uncertainty. Our extensive experiments, conducted across various LLMs and a variety of question-answering (QA) datasets, demonstrate that our method excels not only in terms of effectiveness but also in computational efficiency.

【84】Agentic Discovery: Closing the Loop with Cooperative Agents
标题：统计发现：与合作代理一起闭合循环
链接：https://arxiv.org/abs/2510.13081

作者：J. Gregory Pauloski, Kyle Chard, Ian T. Foster
备注：Published in IEEE Computer Volume 58 Issue 10
摘要：随着数据驱动的方法、人工智能（AI）和自动化工作流程加速科学任务的发展，我们看到发现的速度越来越受到人类决策任务的限制，例如设定目标、生成假设和设计实验。我们假设，需要合作代理人，以增加人类的作用，使自主发现。实现这样的代理将需要在人工智能和基础设施方面取得进展。
摘要：As data-driven methods, artificial intelligence (AI), and automated workflows accelerate scientific tasks, we see the rate of discovery increasingly limited by human decision-making tasks such as setting objectives, generating hypotheses, and designing experiments. We postulate that cooperative agents are needed to augment the role of humans and enable autonomous discovery. Realizing such agents will require progress in both AI and infrastructure.

【85】Transformer-based Scalable Beamforming Optimization via Deep Residual Learning
标题：通过深度剩余学习进行基于转换器的可扩展束形成优化
链接：https://arxiv.org/abs/2510.13077

作者：Yubo Zhang, Xiao-Yang Liu, Xiaodong Wang
备注：7 pages, 5 figures
摘要：我们开发了一个无监督的深度学习框架，用于大规模MU-MISO信道中的下行链路波束成形。该模型是离线训练的，允许在动态通信环境中通过轻量级前馈计算进行实时推理。遵循学习优化（L2 O）范例，多层Transformer通过剩余连接迭代地细化通道和波束形成器特征。为了增强训练，引入了三种策略：（i）课程学习（CL），以改善早期收敛并避免局部最优，（ii）半摊销学习，以通过几个梯度上升步骤来细化每个Transformer块，以及（iii）滑动窗口训练，通过一次只训练Transformer块的子集来稳定优化。大量的仿真结果表明，该方案优于现有的基线在低到中等信噪比和接近WMMSE性能在高信噪比，同时实现更快的推理比迭代和在线学习方法。
摘要：We develop an unsupervised deep learning framework for downlink beamforming in large-scale MU-MISO channels. The model is trained offline, allowing real-time inference through lightweight feedforward computations in dynamic communication environments. Following the learning-to-optimize (L2O) paradigm, a multi-layer Transformer iteratively refines both channel and beamformer features via residual connections. To enhance training, three strategies are introduced: (i) curriculum learning (CL) to improve early-stage convergence and avoid local optima, (ii) semi-amortized learning to refine each Transformer block with a few gradient ascent steps, and (iii) sliding-window training to stabilize optimization by training only a subset of Transformer blocks at a time. Extensive simulations show that the proposed scheme outperforms existing baselines at low-to-medium SNRs and closely approaches WMMSE performance at high SNRs, while achieving substantially faster inference than iterative and online learning approaches.

【86】NeuroRVQ: Multi-Scale EEG Tokenization for Generative Large Brainwave Models
标题：NeuroRVQ：生成性大型脑电波模型的多尺度脑电波令牌化
链接：https://arxiv.org/abs/2510.13068

作者：Konstantinos Barmpas, Na Lee, Alexandros Koliousis, Yannis Panagakis, Dimitrios A. Adamos, Nikolaos Laskaris, Stefanos Zafeiriou
摘要：脑电图（EEG）捕获跨多个时间和频谱尺度的神经活动，产生丰富但复杂的信号用于表征学习。最近，训练用于预测掩蔽信号令牌的EEG基础模型已经显示出学习可概括表示的希望。然而，它们的性能受到其信号标记化模块的阻碍。现有的神经标记器无法保持高频动态，限制了它们以高保真度重建EEG信号的能力。我们介绍NeuroRVQ，一个可扩展的大脑波模型（LBM）集中在一个基于码本的标记。我们的tokenizer集成了：（i）捕获全频率神经频谱的多尺度特征提取模块;（ii）用于高分辨率编码的分层残差矢量量化（RVQ）码本;以及（iii）用于有效训练的EEG信号相位和幅度感知损失函数。该设计实现了高效的EEG压缩，同时支持所有频带的准确重建，从而实现了鲁棒的生成掩码建模。我们的实证结果表明，NeuroRVQ实现了较低的重建误差，并在各种下游任务上优于现有的LBM。更广泛地说，NeuroRVQ tokenizer为基于码本的通用脑电波模型建立了强大的先验，从而实现了神经解码，生成建模和多模态生物信号集成的进步。
摘要：Electroencephalography (EEG) captures neural activity across multiple temporal and spectral scales, yielding signals that are rich but complex for representation learning. Recently, EEG foundation models trained to predict masked signal-tokens have shown promise for learning generalizable representations. However, their performance is hindered by their signal tokenization modules. Existing neural tokenizers fail to preserve high-frequency dynamics, limiting their ability to reconstruct EEG signals with high fidelity. We introduce NeuroRVQ, a scalable Large Brainwave Model (LBM) centered on a codebook-based tokenizer. Our tokenizer integrates: (i) multi-scale feature extraction modules that capture the full frequency neural spectrum; (ii) hierarchical residual vector quantization (RVQ) codebooks for high-resolution encoding; and, (iii) an EEG signal phase- and amplitude-aware loss function for efficient training. This design enables efficient EEG compression while supporting accurate reconstruction across all frequency bands, leading to robust generative masked modeling. Our empirical results demonstrate that NeuroRVQ achieves lower reconstruction error and outperforms existing LBMs on a variety of downstream tasks. More broadly, NeuroRVQ tokenizer establishes a strong prior for codebook-based general-purpose brainwave models, enabling advances in neural decoding, generative modeling and multimodal biosignal integration.

【87】True Self-Supervised Novel View Synthesis is Transferable
标题：真正的自我监督的小说视图合成是可转让的
链接：https://arxiv.org/abs/2510.13063

作者：Thomas W. Mitchel, Hyunwoo Ryu, Vincent Sitzmann
摘要：在本文中，我们确定，用于确定模型是否真正能够新颖的视图合成（NVS）的关键标准是可转移性：从一个视频序列中提取的任何姿势表示是否可以用于重新渲染相同的相机轨迹在另一个。我们分析了之前在自监督NVS上的工作，发现它们的预测姿势不会转移：同一组姿势在不同的3D场景中会导致不同的相机轨迹。在这里，我们介绍了XFactor，这是第一个能够实现真正NVS的无几何自监督模型。XFactor将成对姿态估计与输入和输出的简单增强方案相结合，共同实现从场景内容中分离相机姿态并促进几何推理。值得注意的是，我们表明XFactor实现了不受约束的潜在姿态变量的可转移性，没有任何3D归纳偏差或多视图几何的概念-例如作为SE元素的姿态的显式参数化（3）。我们引入了一个新的度量来量化可移植性，通过大规模实验，我们证明了XFactor的性能明显优于之前的无姿势NVS Transformers，并通过探测实验表明潜在姿势与真实世界姿势高度相关。
摘要：In this paper, we identify that the key criterion for determining whether a model is truly capable of novel view synthesis (NVS) is transferability: Whether any pose representation extracted from one video sequence can be used to re-render the same camera trajectory in another. We analyze prior work on self-supervised NVS and find that their predicted poses do not transfer: The same set of poses lead to different camera trajectories in different 3D scenes. Here, we present XFactor, the first geometry-free self-supervised model capable of true NVS. XFactor combines pair-wise pose estimation with a simple augmentation scheme of the inputs and outputs that jointly enables disentangling camera pose from scene content and facilitates geometric reasoning. Remarkably, we show that XFactor achieves transferability with unconstrained latent pose variables, without any 3D inductive biases or concepts from multi-view geometry -- such as an explicit parameterization of poses as elements of SE(3). We introduce a new metric to quantify transferability, and through large-scale experiments, we demonstrate that XFactor significantly outperforms prior pose-free NVS transformers, and show that latent poses are highly correlated with real-world poses through probing experiments.

【88】VLA-0: Building State-of-the-Art VLAs with Zero Modification
标题：VLA-0：构建零修改的最先进的VLA
链接：https://arxiv.org/abs/2510.13054

作者：Ankit Goyal, Hugo Hadfield, Xuning Yang, Valts Blukis, Fabio Ramos
摘要：视觉-语言-动作模型（VLA）为实现通才机器人操作提供了巨大的希望。然而，建造它们的最佳方法仍然是一个悬而未决的问题。目前的方法通常会增加复杂性，例如使用动作标记修改视觉语言模型（VLM）的现有词汇表或引入特殊的动作头部。奇怪的是，将动作直接表示为文本的最简单策略在很大程度上仍未得到探索。本工作引入VLA-0来研究这一想法。我们发现VLA-0不仅有效，而且令人惊讶地强大。通过正确的设计，VLA-0的性能优于更多的参与模型。在评估VLA的流行基准LIBERO上，VLA-0的性能优于在相同机器人数据上训练的所有现有方法，包括$\pi_0.5$-KI，OpenVLA-OFT和SmolVLA。此外，在没有大规模机器人特定训练的情况下，它优于在大规模机器人数据上训练的方法，如$\pi_0.5$-KI，$\pi_0$，GR 00 T-N1和MolmoAct。这些发现也转化为现实世界，其中VLA-0优于SmolVLA，SmolVLA是一种在大规模真实数据上预训练的VLA模型。本文总结了我们意想不到的发现，并阐明了解锁这种简单而有效的VLA设计的高性能所需的具体技术。可视化结果、代码和训练模型在这里提供：https://vla0.github.io/。
摘要：Vision-Language-Action models (VLAs) hold immense promise for enabling generalist robot manipulation. However, the best way to build them remains an open question. Current approaches often add complexity, such as modifying the existing vocabulary of a Vision-Language Model (VLM) with action tokens or introducing special action heads. Curiously, the simplest strategy of representing actions directly as text has remained largely unexplored. This work introduces VLA-0 to investigate this idea. We find that VLA-0 is not only effective; it is surprisingly powerful. With the right design, VLA-0 outperforms more involved models. On LIBERO, a popular benchmark for evaluating VLAs, VLA-0 outperforms all existing methods trained on the same robotic data, including $\pi_0.5$-KI, OpenVLA-OFT and SmolVLA. Furthermore, without large-scale robotics-specific training, it outperforms methods trained on large-scale robotic data, like $\pi_0.5$-KI, $\pi_0$, GR00T-N1 and MolmoAct. These findings also translate to the real world, where VLA-0 outperforms SmolVLA, a VLA model pre-trained on large-scale real data. This paper summarizes our unexpected findings and spells out the specific techniques required to unlock the high performance of this simple yet potent VLA design. Visual results, code, and trained models are provided here: https://vla0.github.io/.

【89】Time-Varying Optimization for Streaming Data Via Temporal Weighting
标题：通过时间加权实现流数据时变优化
链接：https://arxiv.org/abs/2510.13052

作者：Muhammad Faraz Ul Abrar, Nicolò Michelusi, Erik G. Larsson
备注：Accepted at IEEE Asilomar, 2025
摘要：经典优化理论处理固定的、时不变的目标函数。然而，时变优化已经成为动态环境中决策的一个重要课题。在这项工作中，我们研究了通过时变优化镜头从流数据中学习的问题。不同于以往的作品，专注于通用配方，我们引入了一个结构化的，\ldblquote基于权重}的配方，明确捕捉流数据的起源随时间变化的目标，在每个时间步，代理的目的是最大限度地减少加权平均损失在所有过去的数据样本。我们专注于两个特定的加权策略：（1）统一权重，平等对待所有样本，（2）折扣权重，几何衰减旧数据的影响。对于这两种方案，我们推导出紧边界上的“跟踪误差”（TE），定义为模型参数和时变最优之间的偏差在给定的时间步长，梯度下降（GD）更新。我们发现，在均匀加权下，TE渐近消失的衰减率为$\mathcal{O}（1/t）$，而折扣加权引起的折扣因子和在每个时间步执行的梯度更新的数量控制的非零误差地板。我们的理论研究结果是通过数值模拟验证。
摘要：Classical optimization theory deals with fixed, time-invariant objective functions. However, time-varying optimization has emerged as an important subject for decision-making in dynamic environments. In this work, we study the problem of learning from streaming data through a time-varying optimization lens. Unlike prior works that focus on generic formulations, we introduce a structured, \emph{weight-based} formulation that explicitly captures the streaming-data origin of the time-varying objective, where at each time step, an agent aims to minimize a weighted average loss over all the past data samples. We focus on two specific weighting strategies: (1) uniform weights, which treat all samples equally, and (2) discounted weights, which geometrically decay the influence of older data. For both schemes, we derive tight bounds on the ``tracking error'' (TE), defined as the deviation between the model parameter and the time-varying optimum at a given time step, under gradient descent (GD) updates. We show that under uniform weighting, the TE vanishes asymptotically with a $\mathcal{O}(1/t)$ decay rate, whereas discounted weighting incurs a nonzero error floor controlled by the discount factor and the number of gradient updates performed at each time step. Our theoretical findings are validated through numerical simulations.

【90】SceneAdapt: Scene-aware Adaptation of Human Motion Diffusion
标题：SceneAdapt：人类运动扩散的场景感知适应
链接：https://arxiv.org/abs/2510.13044

作者：Jungbin Cho, Minsu Kim, Jisoo Kim, Ce Zheng, Laszlo A. Jeni, Ming-Hsuan Yang, Youngjae Yu, Seonjoo Kim
备注：15 pages
摘要：人体运动具有内在的多样性和丰富的语义，同时也受到周围场景的影响。然而，现有的运动生成方法单独解决运动语义或场景感知，因为构建具有丰富文本-运动覆盖和精确场景交互的大规模数据集极具挑战性。在这项工作中，我们介绍了SceneAdapt，一个框架，通过利用不相交的场景-运动和文本-运动数据集，通过两个适应阶段注入场景感知到文本条件的运动模型：间和场景感知间。其关键思想是使用无需文本即可学习的运动插入，作为代理任务来桥接两个不同的数据集，从而将场景感知注入文本到运动模型中。在第一阶段中，我们引入关键帧层，该关键帧层在保留潜在流形的同时调制用于中间插入的运动潜在。在第二阶段，我们添加了一个场景调节层，通过交叉注意自适应地查询本地上下文来注入场景几何。实验结果表明，SceneAdapt有效地将场景感知注入到文本到运动模型中，并进一步分析了这种感知出现的机制。将发布代码和型号。
摘要：Human motion is inherently diverse and semantically rich, while also shaped by the surrounding scene. However, existing motion generation approaches address either motion semantics or scene-awareness in isolation, since constructing large-scale datasets with both rich text--motion coverage and precise scene interactions is extremely challenging. In this work, we introduce SceneAdapt, a framework that injects scene awareness into text-conditioned motion models by leveraging disjoint scene--motion and text--motion datasets through two adaptation stages: inbetweening and scene-aware inbetweening. The key idea is to use motion inbetweening, learnable without text, as a proxy task to bridge two distinct datasets and thereby inject scene-awareness to text-to-motion models. In the first stage, we introduce keyframing layers that modulate motion latents for inbetweening while preserving the latent manifold. In the second stage, we add a scene-conditioning layer that injects scene geometry by adaptively querying local context through cross-attention. Experimental results show that SceneAdapt effectively injects scene awareness into text-to-motion models, and we further analyze the mechanisms through which this awareness emerges. Code and models will be released.

【91】SeqBench: Benchmarking Sequential Narrative Generation in Text-to-Video Models
标题：SeqBench：文本到视频模型中的顺序叙事生成基准
链接：https://arxiv.org/abs/2510.13042

作者：Zhengxu Tang, Zizheng Wang, Luning Wang, Zitao Shuai, Chenhao Zhang, Siyu Qian, Yirui Wu, Bohao Wang, Haosong Rao, Zhenyu Yang, Chenwei Wu
摘要：文本到视频（T2V）生成模型在创建视觉上吸引人的视频方面取得了重大进展。然而，他们很难产生连贯的顺序叙述，需要通过多个事件的逻辑进展。现有的T2 V基准测试主要关注视觉质量指标，但未能评估扩展序列的叙事连贯性。为了弥补这一差距，我们提出了SeqBench，这是一个用于评估T2 V一代顺序叙事连贯性的综合基准。SeqBench包括一个精心设计的数据集，包含320个跨越各种叙事复杂性的提示，其中2，560个人工注释的视频来自8个最先进的T2V模型。此外，我们设计了一个动态时序图（DTG）为基础的自动评估指标，它可以有效地捕捉长期的依赖关系和时序，同时保持计算效率。我们基于DTG的度量与人类注释具有很强的相关性。通过使用SeqBench进行系统评估，我们揭示了当前T2V模型的关键局限性：无法在多动作序列中保持一致的对象状态，在多对象场景中物理上不可信的结果，以及难以保持顺序动作之间的现实时序和顺序关系。SeqBench为评估T2V生成中的叙事连贯性提供了第一个系统框架，并为改善未来模型中的顺序推理能力提供了具体的见解。请参阅https://videobench.github.io/SeqBench.github.io/了解更多详情。
摘要：Text-to-video (T2V) generation models have made significant progress in creating visually appealing videos. However, they struggle with generating coherent sequential narratives that require logical progression through multiple events. Existing T2V benchmarks primarily focus on visual quality metrics but fail to evaluate narrative coherence over extended sequences. To bridge this gap, we present SeqBench, a comprehensive benchmark for evaluating sequential narrative coherence in T2V generation. SeqBench includes a carefully designed dataset of 320 prompts spanning various narrative complexities, with 2,560 human-annotated videos generated from 8 state-of-the-art T2V models. Additionally, we design a Dynamic Temporal Graphs (DTG)-based automatic evaluation metric, which can efficiently capture long-range dependencies and temporal ordering while maintaining computational efficiency. Our DTG-based metric demonstrates a strong correlation with human annotations. Through systematic evaluation using SeqBench, we reveal critical limitations in current T2V models: failure to maintain consistent object states across multi-action sequences, physically implausible results in multi-object scenarios, and difficulties in preserving realistic timing and ordering relationships between sequential actions. SeqBench provides the first systematic framework for evaluating narrative coherence in T2V generation and offers concrete insights for improving sequential reasoning capabilities in future models. Please refer to https://videobench.github.io/SeqBench.github.io/ for more details.

【92】Randomness and Interpolation Improve Gradient Descent
标题：随机性和内插改善梯度下降
链接：https://arxiv.org/abs/2510.13040

作者：Jiawen Li, Pascal Lefevre, Anwar Pp Abdul Majeed
摘要：在随机梯度下降法的基础上，提出了两种优化算法：插值加速梯度下降法和噪声正则化随机梯度下降法。IAGD利用二阶牛顿插值来加快训练过程中的收敛过程，假设迭代之间的梯度相关。为了避免过度拟合，NRSGD结合了噪声正则化技术，该技术在优化过程中将受控噪声引入梯度。本研究的比较实验是在CIFAR-10和CIFAR-100数据集上进行的，使用IAGD和NRSGD对不同的CNN（卷积神经网络）进行基准测试，以对抗Keras包中的经典优化器。结果表明，这两个可行的改进方法在SGD的潜力，暗示的进步的有效性。
摘要：Based on Stochastic Gradient Descent (SGD), the paper introduces two optimizers, named Interpolational Accelerating Gradient Descent (IAGD) as well as Noise-Regularized Stochastic Gradient Descent (NRSGD). IAGD leverages second-order Newton Interpolation to expedite the convergence process during training, assuming relevancy in gradients between iterations. To avoid over-fitting, NRSGD incorporates a noise regularization technique that introduces controlled noise to the gradients during the optimization process. Comparative experiments of this research are conducted on the CIFAR-10, and CIFAR-100 datasets, benchmarking different CNNs(Convolutional Neural Networks) with IAGD and NRSGD against classical optimizers in Keras Package. Results demonstrate the potential of those two viable improvement methods in SGD, implicating the effectiveness of the advancements.

【93】Repairing Reward Functions with Human Feedback to Mitigate Reward Hacking
标题：利用人类反馈修复奖励功能以减轻奖励黑客行为
链接：https://arxiv.org/abs/2510.13036

作者：Stephane Hatgis-Kessell, Logan Mondal Bhamidipaty, Emma Brunskill
摘要：人类为强化学习（RL）代理设计的奖励函数经常与人类真实的、不可观察的目标不一致，因此仅充当代理。为错误指定的代理奖励函数进行优化通常会导致奖励黑客行为，导致政策与人类的真实目标不一致。另一种方法是从人类反馈中执行RL，这涉及通过收集人类对轨迹对的偏好从头开始学习奖励函数。然而，建立这样的数据集是昂贵的。为了解决这两种方法的局限性，我们提出了基于偏好的奖励修复（PBRR）：一个自动迭代框架，通过从偏好中学习一个附加的、依赖于转换的校正项来修复人类指定的代理奖励函数。手动指定的奖励函数可能会产生在地面实况目标下高度次优的策略，但仅对少数转换进行校正就足以恢复最佳性能。为了识别和纠正这些转变，PBRR使用了有针对性的探索策略和新的偏好学习目标。我们证明在表格域PBRR有一个累积的遗憾，匹配，常数，先前的基于偏好的RL方法。此外，在一系列奖励黑客基准测试中，PBRR始终优于从偏好从头开始学习奖励函数或使用其他方法修改代理奖励函数的基线，需要更少的偏好来学习高性能策略。
摘要：Human-designed reward functions for reinforcement learning (RL) agents are frequently misaligned with the humans' true, unobservable objectives, and thus act only as proxies. Optimizing for a misspecified proxy reward function often induces reward hacking, resulting in a policy misaligned with the human's true objectives. An alternative is to perform RL from human feedback, which involves learning a reward function from scratch by collecting human preferences over pairs of trajectories. However, building such datasets is costly. To address the limitations of both approaches, we propose Preference-Based Reward Repair (PBRR): an automated iterative framework that repairs a human-specified proxy reward function by learning an additive, transition-dependent correction term from preferences. A manually specified reward function can yield policies that are highly suboptimal under the ground-truth objective, yet corrections on only a few transitions may suffice to recover optimal performance. To identify and correct for those transitions, PBRR uses a targeted exploration strategy and a new preference-learning objective. We prove in tabular domains PBRR has a cumulative regret that matches, up to constants, that of prior preference-based RL methods. In addition, on a suite of reward-hacking benchmarks, PBRR consistently outperforms baselines that learn a reward function from scratch from preferences or modify the proxy reward function using other approaches, requiring substantially fewer preferences to learn high performing policies.

【94】Toward Reasoning-Centric Time-Series Analysis
标题：走向以推理为中心的时间序列分析
链接：https://arxiv.org/abs/2510.13029

作者：Xinlei Wang, Mingtian Tan, Jing Qiu, Junhua Zhao, Jinjin Gu
摘要：传统的时间序列分析长期以来依赖于模式识别，在静态和成熟的基准上进行训练。然而，在现实世界中--政策发生变化，人类行为发生变化，意外事件发生--有效的分析必须超越表面趋势，以揭示驱动趋势的实际力量。最近兴起的大型语言模型（LLM）通过整合多模态输入，为重新思考时间序列分析提供了新的机会。然而，随着LLM的使用变得流行，我们必须保持谨慎，询问为什么我们使用LLM以及如何有效地利用它们。大多数现有的基于LLM的方法仍然使用它们的数值回归能力，而忽略了它们更深层次的推理潜力。本文主张用LLM重新思考时间序列，将其作为优先考虑因果结构和可解释性的推理任务。这一转变使时间序列分析更接近人类的理解，从而在复杂的现实环境中实现透明和上下文感知的见解。
摘要：Traditional time series analysis has long relied on pattern recognition, trained on static and well-established benchmarks. However, in real-world settings -- where policies shift, human behavior adapts, and unexpected events unfold -- effective analysis must go beyond surface-level trends to uncover the actual forces driving them. The recent rise of Large Language Models (LLMs) presents new opportunities for rethinking time series analysis by integrating multimodal inputs. However, as the use of LLMs becomes popular, we must remain cautious, asking why we use LLMs and how to exploit them effectively. Most existing LLM-based methods still employ their numerical regression ability and ignore their deeper reasoning potential. This paper argues for rethinking time series with LLMs as a reasoning task that prioritizes causal structure and explainability. This shift brings time series analysis closer to human-aligned understanding, enabling transparent and context-aware insights in complex real-world environments.

【95】Deliberate Lab: A Platform for Real-Time Human-AI Social Experiments
标题：深思熟虑的实验室：实时人机社会实验平台
链接：https://arxiv.org/abs/2510.13011

作者：Crystal Qian, Vivian Tsai, Michael Behr, Nada Hussein, Léo Laugier, Nithum Thain, Lucas Dixon
摘要：社会和行为科学家越来越多地致力于研究人类如何与人工智能互动、协作和决策。然而，这项工作的实验基础设施仍然不发达：（1）很少有平台支持大规模的实时多方研究;（2）大多数部署需要定制工程，限制了可复制性和可访问性;（3）现有工具不将人工智能代理视为一流的参与者。我们提出了Deliberate Lab，这是一个用于大规模实时行为实验的开源平台，支持人类参与者和基于大型语言模型（LLM）的代理。我们报告了一个为期12个月的公共部署的平台（N=88实验者，N=9195实验参与者），分析使用模式和工作流程。案例研究和使用场景是从平台用户中汇总而来的，并辅以对选定实验者的深入访谈。通过降低技术壁垒和标准化对混合人类-人工智能实验的支持，Deliberate Lab扩展了研究集体决策和以人为本的人工智能的方法体系。
摘要：Social and behavioral scientists increasingly aim to study how humans interact, collaborate, and make decisions alongside artificial intelligence. However, the experimental infrastructure for such work remains underdeveloped: (1) few platforms support real-time, multi-party studies at scale; (2) most deployments require bespoke engineering, limiting replicability and accessibility, and (3) existing tools do not treat AI agents as first-class participants. We present Deliberate Lab, an open-source platform for large-scale, real-time behavioral experiments that supports both human participants and large language model (LLM)-based agents. We report on a 12-month public deployment of the platform (N=88 experimenters, N=9195 experiment participants), analyzing usage patterns and workflows. Case studies and usage scenarios are aggregated from platform users, complemented by in-depth interviews with select experimenters. By lowering technical barriers and standardizing support for hybrid human-AI experimentation, Deliberate Lab expands the methodological repertoire for studying collective decision-making and human-centered AI.

【96】Developing and Validating the Arabic Version of the Attitudes Toward Large Language Models Scale
标题：开发和验证阿拉伯语版对大型语言模型的态度量表
链接：https://arxiv.org/abs/2510.13009

作者：Basad Barajeeh, Ala Yankouskaya, Sameha AlShakhsi, Chun Sing Maxwell Ho, Guandong Xu, Raian Ali
备注：28 Pages
摘要：随着大型语言模型（LLM）的使用变得越来越全球化，了解公众对这些系统的态度需要适应当地环境和语言的工具。在阿拉伯世界，LLM的采用率迅速增长，全球主导平台和Fanar和Jais等区域平台都提供阿拉伯特定的解决方案。这突出了需要文化和语言相关的规模，以准确衡量该地区对LLMs的态度。评估对人工智能（AI）态度的工具可以为测量特定于LLM的态度提供基础。5项对人工智能的态度（ATAI）量表，它测量两个维度，AI恐惧和AI接受，最近被采用和改编，使用来自英国的样本开发新的英语工具：对一般LLM的态度（AT-GLLM）和对初级LLM的态度（AT-PLLM）量表。在本文中，我们翻译的两个尺度，AT-GLLM和AT-PLLM，并验证他们使用的样本249阿拉伯语成年人。结果表明，翻译成阿拉伯语的量表是一个可靠和有效的工具，可用于阿拉伯人口和语言。心理测量分析证实了一个双因素结构，跨性别的测量不变性强，良好的内部可靠性。量表也表现出很强的收敛性和判别效度。我们的规模将支持在非西方背景下的研究，这是一项急需的努力，以帮助绘制LLM观念的全球图景，也将促进阿拉伯地区的本地化研究和决策。
摘要：As the use of large language models (LLMs) becomes increasingly global, understanding public attitudes toward these systems requires tools that are adapted to local contexts and languages. In the Arab world, LLM adoption has grown rapidly with both globally dominant platforms and regional ones like Fanar and Jais offering Arabic-specific solutions. This highlights the need for culturally and linguistically relevant scales to accurately measure attitudes toward LLMs in the region. Tools assessing attitudes toward artificial intelligence (AI) can provide a base for measuring attitudes specific to LLMs. The 5-item Attitudes Toward Artificial Intelligence (ATAI) scale, which measures two dimensions, the AI Fear and the AI Acceptance, has been recently adopted and adapted to develop new instruments in English using a sample from the UK: the Attitudes Toward General LLMs (AT-GLLM) and Attitudes Toward Primary LLM (AT-PLLM) scales. In this paper, we translate the two scales, AT-GLLM and AT-PLLM, and validate them using a sample of 249 Arabic-speaking adults. The results show that the scale, translated into Arabic, is a reliable and valid tool that can be used for the Arab population and language. Psychometric analyses confirmed a two-factor structure, strong measurement invariance across genders, and good internal reliability. The scales also demonstrated strong convergent and discriminant validity. Our scales will support research in a non-Western context, a much-needed effort to help draw a global picture of LLM perceptions, and will also facilitate localized research and policy-making in the Arab region.

【97】CurLL: A Developmental Framework to Evaluate Continual Learning in Language Models
标题：CurLL：评估语言模型中持续学习的开发框架
链接：https://arxiv.org/abs/2510.13008

作者：Pavan Kalyan, Shubhra Mishra, Satya Lokam, Navin Goyal
摘要：我们引入了一个全面的持续学习数据集和基准（CurlL），该数据集和基准基于5-10岁的人类发展轨迹，能够系统地、细粒度地评估模型逐步获得新技能的能力。CurlL涵盖5-10岁的五个发展阶段（0-4），由技能图支持，将广泛的技能分解为较小的能力，具体目标和可衡量的指标，同时还捕捉哪些能力建立在其他能力之上。我们生成了一个23. 4 B令牌合成数据集，具有受控的技能进展，词汇复杂性和格式多样性，包括段落，基于理解的QA（CQA），技能测试QA（CSQA）和指令-响应（IR）对。阶段令牌计数范围从2.12 B到6.78 B令牌，支持精确分析遗忘，前向转移和后向转移。使用一个135 M参数的Transformer在独立，联合和顺序（连续）设置下训练，我们显示了技能保留和转移效率的权衡。通过镜像人类学习模式并提供对技能依赖关系的细粒度控制，这项工作推进了语言模型的持续学习评估。
摘要：We introduce a comprehensive continual learning dataset and benchmark (CurlL) grounded in human developmental trajectories from ages 5-10, enabling systematic and fine-grained assessment of models' ability to progressively acquire new skills. CurlL spans five developmental stages (0-4) covering ages 5-10, supported by a skill graph that breaks down broad skills into smaller abilities, concrete goals, and measurable indicators, while also capturing which abilities build on others. We generate a 23.4B-token synthetic dataset with controlled skill progression, vocabulary complexity, and format diversity, comprising paragraphs, comprehension-based QA (CQA), skill-testing QA (CSQA), and instruction-response (IR) pairs. Stage-wise token counts range from 2.12B to 6.78B tokens, supporting precise analysis of forgetting, forward transfer, and backward transfer. Using a 135M-parameter transformer trained under independent, joint, and sequential (continual) setups, we show trade-offs in skill retention and transfer efficiency. By mirroring human learning patterns and providing fine-grained control over skill dependencies, this work advances continual learning evaluations for language models.

【98】From Narratives to Probabilistic Reasoning: Predicting and Interpreting Drivers' Hazardous Actions in Crashes Using Large Language Model
标题：从叙述到概率推理：使用大型语言模型预测和解释车祸中司机的危险行为
链接：https://arxiv.org/abs/2510.13002

作者：Boyou Chen, Gerui Xu, Zifei Wang, Huizhong Guo, Ananna Ahmed, Zhaonan Sun, Zhen Hu, Kaihan Zhang, Shan Bao
摘要：车辆碰撞涉及道路使用者之间的复杂互动、瞬间决策和具有挑战性的环境条件。其中，两车碰撞最为普遍，约占道路碰撞的70%，对交通安全构成重大挑战。识别驾驶员危险行为（DHA）对于理解事故原因至关重要，但大规模数据库中DHA数据的可靠性受到不一致和劳动密集型手动编码实践的限制。在这里，我们提出了一个创新的框架，利用微调的大型语言模型自动推断DHA从文本崩溃的叙述，从而提高DHA分类的有效性和可解释性。使用MTCF五年的两辆车碰撞数据，我们根据详细的碰撞叙述对Llama 3.2 1B模型进行了微调，并将其性能与传统的机器学习分类器进行了基准测试，包括随机森林，XGBoost，CatBoost和神经网络。经过微调的LLM实现了80%的总体准确度，超过了所有基线模型，并在数据不平衡的情况下表现出明显的改善。为了提高可解释性，我们开发了一种概率推理方法，分析了原始测试集和三个目标反事实场景的模型输出变化：驾驶员分心和年龄的变化。我们的分析表明，为一名司机引入分心大大增加了“一般不安全驾驶”的可能性;为两名司机引入分心使“两名司机都采取了危险行动”的概率最大化;并指定一名青少年司机显着提高了“速度和停车违规”的概率。“我们的框架和分析方法为大规模自动DHA检测提供了一个强大且可解释的解决方案，为交通安全分析和干预提供了新的机会。
摘要：Vehicle crashes involve complex interactions between road users, split-second decisions, and challenging environmental conditions. Among these, two-vehicle crashes are the most prevalent, accounting for approximately 70% of roadway crashes and posing a significant challenge to traffic safety. Identifying Driver Hazardous Action (DHA) is essential for understanding crash causation, yet the reliability of DHA data in large-scale databases is limited by inconsistent and labor-intensive manual coding practices. Here, we present an innovative framework that leverages a fine-tuned large language model to automatically infer DHAs from textual crash narratives, thereby improving the validity and interpretability of DHA classifications. Using five years of two-vehicle crash data from MTCF, we fine-tuned the Llama 3.2 1B model on detailed crash narratives and benchmarked its performance against conventional machine learning classifiers, including Random Forest, XGBoost, CatBoost, and a neural network. The fine-tuned LLM achieved an overall accuracy of 80%, surpassing all baseline models and demonstrating pronounced improvements in scenarios with imbalanced data. To increase interpretability, we developed a probabilistic reasoning approach, analyzing model output shifts across original test sets and three targeted counterfactual scenarios: variations in driver distraction and age. Our analysis revealed that introducing distraction for one driver substantially increased the likelihood of "General Unsafe Driving"; distraction for both drivers maximized the probability of "Both Drivers Took Hazardous Actions"; and assigning a teen driver markedly elevated the probability of "Speed and Stopping Violations." Our framework and analytical methods provide a robust and interpretable solution for large-scale automated DHA detection, offering new opportunities for traffic safety analysis and intervention.

【99】Max It or Miss It: Benchmarking LLM On Solving Extremal Problems
标题：最大还是错过：解决极端问题的法学硕士基准
链接：https://arxiv.org/abs/2510.12997

作者：Binxin Gao, Jingjun Han
备注：Our benchmark dataset is available at this https URL
摘要：测试时间缩放使大型语言模型（LLM）具有显着的推理能力，特别是在数学领域，在生成最终答案之前通过中间思想链（CoT）推理。然而，这些推理能力的具体来源和机制仍然没有得到充分的理解。优化推理，即在约束条件下找到极值，代表了一个基本的抽象，支持规划，控制，资源分配和即时搜索中的关键应用。为了系统地评估这种能力，我们引入了一个用于解决数学极值问题的基准数据集，它是从中国数学奥林匹克的不等式练习中挑选出来的，并转化为93美元的标准化极值问题。我们对各种最先进的开源模型家族进行了广泛的评估，包括Qwen 3，GPT-OSS和DeepSeek。我们的研究结果表明，LLM的极值求解推理能力并不总是与当前的数学基准（如AIME 25和MATH-500）保持一致，一些模型显示出较强的一般数学推理能力，但极值求解能力较差，反之亦然。这种差异突出了目前评价实践中的一个关键差距，并表明现有的基准可能无法全面反映数学推理能力的全部范围。
摘要：Test-time scaling has enabled Large Language Models (LLMs) with remarkable reasoning capabilities, particularly in mathematical domains, through intermediate chain-of-thought (CoT) reasoning before generating final answers. However, the specific sources and mechanisms underlying these reasoning capabilities remain insufficiently understood. Optimization reasoning, i.e. finding extrema under constraints, represents a fundamental abstraction that underpins critical applications in planning, control, resource allocation, and prompt search. To systematically evaluate this capability, we introduce ExtremBench, a benchmark dataset for solving mathematical extremal problems, curated from inequality exercises used for Chinese Mathematical Olympiad and transformed into $93$ standardized extrema-finding problems. We conduct extensive evaluations across various state-of-the-art open-source model families, including the Qwen3, GPT-OSS, and DeepSeek. Our results reveal that LLMs' extremal-solving reasoning capabilities do not always align with those of current mathematical benchmarks such as AIME25 and MATH-500, with some models showing strong general mathematical reasoning but poor extremal-solving skills, and vice versa. This discrepancy highlights a critical gap in current evaluation practices and suggests that existing benchmarks may not comprehensively capture the full spectrum of mathematical reasoning abilities.

【100】SENTINEL: A Multi-Level Formal Framework for Safety Evaluation of LLM-based Embodied Agents
标题：SENTINEL：一个基于LLM的分布式Agent安全性多级形式化评估框架
链接：https://arxiv.org/abs/2510.12985

作者：Simon Sinong Zhan, Yao Liu, Philip Wang, Zinan Wang, Qineng Wang, Zhian Ruan, Xiangyu Shi, Xinyu Cao, Frank Yang, Kangrui Wang, Huajie Shao, Manling Li, Qi Zhu
摘要：我们提出了Sentinel，这是第一个用于在语义、计划和轨迹级别正式评估大型语言模型（基于LLM）体现代理的物理安全性的框架。与依赖于启发式规则或主观LLM判断的现有方法不同，Sentinel将实际安全需求置于形式时态逻辑（TL）语义中，可以精确地指定状态不变量，时间依赖性和时序约束。然后，它采用多级验证流水线，其中（i）在语义级，直观的自然语言安全需求被形式化为TL公式，并且LLM代理对这些需求的理解被探测以与TL公式对齐;（ii）在计划级，由LLM代理生成的高级行动计划和子目标针对TL公式进行验证，以在执行之前检测不安全的计划;以及（iii）在轨迹级，将多个执行轨迹合并到计算树中，并针对物理上详细的TL规范进行有效验证，以进行最终的安全检查。我们在VirtualHome和ALFRED中应用Sentinel，并根据不同的安全要求正式评估多个基于LLM的具体代理。我们的实验表明，通过在时间逻辑中建立物理安全并在多个级别上应用验证方法，Sentinel为系统地评估物理环境中基于LLM的具体代理提供了严格的基础，暴露了以前方法忽略的安全违规行为，并提供了对其故障模式的见解。
摘要：We present Sentinel, the first framework for formally evaluating the physical safety of Large Language Model(LLM-based) embodied agents across the semantic, plan, and trajectory levels. Unlike prior methods that rely on heuristic rules or subjective LLM judgments, Sentinel grounds practical safety requirements in formal temporal logic (TL) semantics that can precisely specify state invariants, temporal dependencies, and timing constraints. It then employs a multi-level verification pipeline where (i) at the semantic level, intuitive natural language safety requirements are formalized into TL formulas and the LLM agent's understanding of these requirements is probed for alignment with the TL formulas; (ii) at the plan level, high-level action plans and subgoals generated by the LLM agent are verified against the TL formulas to detect unsafe plans before execution; and (iii) at the trajectory level, multiple execution trajectories are merged into a computation tree and efficiently verified against physically-detailed TL specifications for a final safety check. We apply Sentinel in VirtualHome and ALFRED, and formally evaluate multiple LLM-based embodied agents against diverse safety requirements. Our experiments show that by grounding physical safety in temporal logic and applying verification methods across multiple levels, Sentinel provides a rigorous foundation for systematically evaluating LLM-based embodied agents in physical environments, exposing safety violations overlooked by previous methods and offering insights into their failure modes.

【101】DeepPlanner: Scaling Planning Capability for Deep Research Agents via Advantage Shaping
标题：DeepPlanner：通过Advantage Shaping扩展深度研究代理的规划能力
链接：https://arxiv.org/abs/2510.12979

作者：Wei Fan, Wenlin Yao, Zheng Li, Feng Yao, Xin Liu, Liang Qiu, Qingyu Yin, Yangqiu Song, Bing Yin
备注：Under Review
摘要：具有多步推理和动作生成能力的大型语言模型（LLM）在利用外部工具解决需要长期规划的复杂任务方面表现出了希望。然而，现有的方法要么依赖于推理阶段的隐式规划，要么引入显式规划器，而没有系统地解决如何优化规划阶段。作为证据，我们观察到，在香草强化学习（RL）下，规划令牌比其他动作令牌表现出更高的熵，揭示了仍然未优化的不确定决策点。为了解决这个问题，我们提出了DeepPlanner，这是一个端到端的RL框架，可以有效地增强深度研究代理的规划能力。我们的方法使用基于熵的术语来塑造令牌级优势，将更大的更新分配给高熵令牌，并选择性地增强样本级优势，以进行规划密集型推出。在七个深度研究基准测试中进行的广泛实验表明，DeepPlanner提高了规划质量，并在大幅降低训练预算的情况下实现了最先进的结果。
摘要：Large language models (LLMs) augmented with multi-step reasoning and action generation abilities have shown promise in leveraging external tools to tackle complex tasks that require long-horizon planning. However, existing approaches either rely on implicit planning in the reasoning stage or introduce explicit planners without systematically addressing how to optimize the planning stage. As evidence, we observe that under vanilla reinforcement learning (RL), planning tokens exhibit significantly higher entropy than other action tokens, revealing uncertain decision points that remain under-optimized. To address this, we propose DeepPlanner, an end-to-end RL framework that effectively enhances the planning capabilities of deep research agents. Our approach shapes token-level advantage with an entropy-based term to allocate larger updates to high entropy tokens, and selectively upweights sample-level advantages for planning-intensive rollouts. Extensive experiments across seven deep research benchmarks demonstrate that DeepPlanner improves planning quality and achieves state-of-the-art results under a substantially lower training budget.

【102】A Multimodal XAI Framework for Trustworthy CNNs and Bias Detection in Deep Representation Learning
标题：深度表示学习中值得信赖的CNN和偏差检测的多模式XAI框架
链接：https://arxiv.org/abs/2510.12957

作者：Noor Islam S. Mohammad
摘要：标准基准数据集，如MNIST，通常无法暴露潜在的偏见和多模态特征复杂性，限制了深度神经网络在高风险应用中的可信度。我们提出了一种新的多模态可解释AI（XAI）框架，该框架将注意力增强特征融合，基于Grad-CAM++的局部解释以及用于偏差检测和缓解的Reveal-to-Revise反馈循环统一起来。在MNIST的多模态扩展上进行评估，我们的方法实现了93.2%的分类准确率，91.6%的F1分数和78.1%的解释保真度（IoU-XAI），优于单峰和不可解释的基线。消融研究表明，将可解释性与偏差感知学习相结合，可以增强鲁棒性和人类对齐。我们的工作弥合了性能，透明度和公平性之间的差距，突出了在敏感领域值得信赖的人工智能的实用途径。
摘要：Standard benchmark datasets, such as MNIST, often fail to expose latent biases and multimodal feature complexities, limiting the trustworthiness of deep neural networks in high-stakes applications. We propose a novel multimodal Explainable AI (XAI) framework that unifies attention-augmented feature fusion, Grad-CAM++-based local explanations, and a Reveal-to-Revise feedback loop for bias detection and mitigation. Evaluated on multimodal extensions of MNIST, our approach achieves 93.2% classification accuracy, 91.6% F1-score, and 78.1% explanation fidelity (IoU-XAI), outperforming unimodal and non-explainable baselines. Ablation studies demonstrate that integrating interpretability with bias-aware learning enhances robustness and human alignment. Our work bridges the gap between performance, transparency, and fairness, highlighting a practical pathway for trustworthy AI in sensitive domains.

【103】Epistemic-aware Vision-Language Foundation Model for Fetal Ultrasound Interpretation
标题：胎儿超声解释的认知感知视觉语言基础模型
链接：https://arxiv.org/abs/2510.12953

作者：Xiao He, Huangxuan Zhao, Guojia Wan, Wei Zhou, Yanxing Liu, Juhua Liu, Yongchao Xu, Yong Luo, Dacheng Tao, Bo Du
摘要：最近的医学视觉语言模型在VQA、报告生成和异常检测等任务上表现出了希望。然而，大多数适用于结构化成人成像，在胎儿超声中表现不佳，这带来了多视图图像推理、多种疾病和图像多样性的挑战。为了弥补这一差距，我们引入了FetalMind，这是一种专为胎儿超声而量身定制的医疗AI系统，用于报告生成和诊断。在临床工作流程的指导下，我们提出了显着的认知解纠缠（SED），它将专家策划的二分图注入模型中，以解耦视图-疾病关联，并通过强化学习沿着临床上忠实的步骤引导偏好选择。这种设计减轻了疾病之间的变异性和视图之间的异质性，减少了学习瓶颈，同时使模型的推理与产科实践保持一致。为了大规模训练FetalMind，我们策划了FetalSigma-1 M数据集，这是第一个大规模胎儿超声报告语料库，包括来自12个医疗中心的20 K份报告，解决了领域数据稀缺的问题。大量的实验表明，FetalMind在所有妊娠阶段的表现都优于开源和闭源基线，在关键条件下实现了+14%的平均增益和+61.2%的准确性，同时保持高效，稳定和可扩展。项目页面：https://hexiao0275.github.io/FetalMind。
摘要：Recent medical vision-language models have shown promise on tasks such as VQA, report generation, and anomaly detection. However, most are adapted to structured adult imaging and underperform in fetal ultrasound, which poses challenges of multi-view image reasoning, numerous diseases, and image diversity. To bridge this gap, we introduce FetalMind, a medical AI system tailored to fetal ultrasound for both report generation and diagnosis. Guided by clinical workflow, we propose Salient Epistemic Disentanglement (SED), which injects an expert-curated bipartite graph into the model to decouple view-disease associations and to steer preference selection along clinically faithful steps via reinforcement learning. This design mitigates variability across diseases and heterogeneity across views, reducing learning bottlenecks while aligning the model's inference with obstetric practice. To train FetalMind at scale, we curate FetalSigma-1M dataset, the first large-scale fetal ultrasound report corpus, comprising 20K reports from twelve medical centers, addressing the scarcity of domain data. Extensive experiments show that FetalMind outperforms open- and closed-source baselines across all gestational stages, achieving +14% average gains and +61.2% higher accuracy on critical conditions while remaining efficient, stable, and scalable. Project Page: https://hexiao0275.github.io/FetalMind.

【104】SpareCodeSearch: Searching for Code Context When You Have No Spare GPU
标题：SpareCodeSearch：当您没有多余的图形处理器时搜索代码上下文
链接：https://arxiv.org/abs/2510.12948

作者：Minh Nguyen
备注：4 pages, 3 figures, 4 tables. Accepted to Context Collection Workshop co-located with ASE'25
摘要：检索增强生成（RAG）框架旨在通过包含另一个用于检索相关上下文以构建输入提示的模块来增强代码语言模型（CLM）。然而，这些检索模块通常使用语义搜索，需要大量的计算资源来训练和托管这些嵌入式模型，使得它们无法集成到轻量级应用程序中，例如基于IDE AI的代码完成。在这篇解决方案论文中，我们证明了使用关键字搜索足以在大型代码库中检索相关和有用的代码上下文，而不需要大量的GPU资源。我们的解决方案发现的代码上下文的有用性通过它们在代码上下文竞赛基准测试中的完成结果来证明，在Kotlin和Python赛道上分别达到0.748和0.725 chRF分数。
摘要：Retrieval-Augmented Generation (RAG) frameworks aim to enhance Code Language Models (CLMs) by including another module for retrieving relevant context to construct the input prompt. However, these retrieval modules commonly use semantic search, requiring substantial computational resources for training and hosting these embedded models, making them infeasible to integrate into lightweight applications such as in-IDE AI-based code completion. In this solution paper, we prove that using keyword-search is sufficient to retrieve relevant and useful code context inside large codebases, without the need for extensive GPU resources. The usefulness of code contexts found by our solution is demonstrated through their completion results on the Code Context Competition's benchmark, reaching 0.748 and 0.725 chRF scores on Kotlin and Python tracks, respectively.

【105】KVCOMM: Online Cross-context KV-cache Communication for Efficient LLM-based Multi-agent Systems
标题：KVCOMM：基于LLM的高效多代理系统的在线跨上下文GV缓存通信
链接：https://arxiv.org/abs/2510.12872

作者：Hancheng Ye, Zhengqi Gao, Mingyuan Ma, Qinsi Wang, Yuzhe Fu, Ming-Yu Chung, Yueqian Lin, Zhijian Liu, Jianyi Zhang, Danyang Zhuo, Yiran Chen
备注：Accepted for publication in NeurIPS2025. Code is available at \url{this https URL}
摘要：多智能体大语言模型（LLM）系统越来越多地被用于需要智能体之间通信和协调的复杂语言处理任务。然而，这些系统往往遭受大量的开销，重复处理重叠的上下文跨代理。在典型的管道中，一旦代理从其前任接收到消息，则必须从头开始重新处理完整的上下文（包括先前的回合），从而导致处理效率低下。虽然键值（KV）缓存是一种有效的解决方案，以避免在单代理设置中的前缀保持不变的冗余计算，它不能直接在多代理场景中重用，由于不同的前缀引入代理特定的上下文扩展。我们确定的核心挑战在于跨代理的KV缓存的偏移方差。为了解决这个问题，我们提出了KVCOMM，一个无需训练的框架，通过重用KV缓存和对齐不同前缀上下文下重叠上下文的缓存偏移量，可以有效地预填充多智能体推理。KVCOMM估计和调整KV缓存共享的内容，通过引用一个池的缓存的例子，称为锚，存储观察到的缓存偏差根据不同的前缀。锚池在线维护和更新，允许动态适应不同的用户请求和上下文结构。KVCOMM在各种多代理工作负载中实现了超过70%的重用率，包括检索增强生成，数学推理和协作编码任务，所有这些都没有质量下降。特别是，当每个全连接代理在五个代理设置下接收1K个输入令牌，512个前缀令牌和512个输出令牌时，KVCOMM与标准预填充管道相比实现了高达7.8倍的加速，将TTFT从约430 ms减少到约55 ms。
摘要：Multi-agent large language model (LLM) systems are increasingly adopted for complex language processing tasks that require communication and coordination among agents. However, these systems often suffer substantial overhead from repeated reprocessing of overlapping contexts across agents. In typical pipelines, once an agent receives a message from its predecessor, the full context-including prior turns-must be reprocessed from scratch, leading to inefficient processing. While key-value (KV) caching is an effective solution for avoiding redundant computation in single-agent settings where prefixes remain unchanged, it cannot be directly reused in multi-agent scenarios due to diverging prefixes introduced by agent-specific context extensions. We identify that the core challenge lies in the offset variance of KV-caches across agents. To address this, we propose KVCOMM, a training-free framework that enables efficient prefilling in multi-agent inference by reusing KV-caches and aligning cache offsets of overlapping contexts under diverse prefix contexts. KVCOMM estimates and adjusts KV-caches for shared content by referencing a pool of cached examples-termed anchors-that store observed cache deviations under varying prefixes. The anchor pool is maintained and updated online, allowing dynamic adaptation to distinct user requests and context structures. KVCOMM achieves over 70% reuse rate across diverse multi-agent workloads, including retrieval-augmented generation, math reasoning, and collaborative coding tasks, all without quality degradation. Particularly, when each fully-connected agent receives 1K input tokens with 512 prefix tokens and 512 output tokens under a five-agent setting, KVCOMM achieves up to 7.8x speedup compared to the standard prefill pipeline, reducing TTFT from ~430 ms to ~55 ms.

【106】From Literal to Liberal: A Meta-Prompting Framework for Eliciting Human-Aligned Exception Handling in Large Language Models
标题：从字面到自由：用于在大型语言模型中激发人性一致异常处理的元预算框架
链接：https://arxiv.org/abs/2510.12864

作者：Imran Khan
备注：13 pages. Code and data are available at this https URL
摘要：大型语言模型（LLM）越来越多地被部署为代理人工智能系统的推理引擎，但它们表现出一个关键缺陷：严格遵守明确的规则，导致决策与人类常识和意图不一致。这种“规则刚性”是建立可信赖的自治代理的一个重要障碍。虽然先前的工作已经表明，监督微调（SFT）与人类的解释可以减轻这个问题，SFT是计算昂贵，许多从业者无法访问。为了解决这个差距，我们引入了规则意图区分（RID）框架，一种新颖的，低计算元提示技术，旨在引起人类对齐的异常处理在LLM中的zero-shot方式。RID框架为模型提供了一个结构化的认知图式，用于解构任务，分类规则，权衡冲突的结果，并证明其最终决策的合理性。我们评估了RID框架对基线和思想链（CoT）提示的自定义基准的20个场景，需要在不同的领域进行细致入微的判断。我们的人工验证结果表明，RID框架显着提高了性能，实现了95%的人类对齐分数（HAS），而基线为80%，CoT为75%。此外，它始终产生更高质量的意图驱动的推理。这项工作提出了一种实用，方便和有效的方法，用于将LLM从字面解释引导到自由的，面向目标的推理，为更可靠和实用的AI代理铺平了道路。
摘要：Large Language Models (LLMs) are increasingly being deployed as the reasoning engines for agentic AI systems, yet they exhibit a critical flaw: a rigid adherence to explicit rules that leads to decisions misaligned with human common sense and intent. This "rule-rigidity" is a significant barrier to building trustworthy autonomous agents. While prior work has shown that supervised fine-tuning (SFT) with human explanations can mitigate this issue, SFT is computationally expensive and inaccessible to many practitioners. To address this gap, we introduce the Rule-Intent Distinction (RID) Framework, a novel, low-compute meta-prompting technique designed to elicit human-aligned exception handling in LLMs in a zero-shot manner. The RID framework provides the model with a structured cognitive schema for deconstructing tasks, classifying rules, weighing conflicting outcomes, and justifying its final decision. We evaluated the RID framework against baseline and Chain-of-Thought (CoT) prompting on a custom benchmark of 20 scenarios requiring nuanced judgment across diverse domains. Our human-verified results demonstrate that the RID framework significantly improves performance, achieving a 95% Human Alignment Score (HAS), compared to 80% for the baseline and 75% for CoT. Furthermore, it consistently produces higher-quality, intent-driven reasoning. This work presents a practical, accessible, and effective method for steering LLMs from literal instruction-following to liberal, goal-oriented reasoning, paving the way for more reliable and pragmatic AI agents.

【107】Three Lenses on the AI Revolution: Risk, Transformation, Continuity
标题：人工智能革命的三个镜头：风险、转型、连续性
链接：https://arxiv.org/abs/2510.12859

作者：Masoud Makrehchi
备注：17 pages
摘要：人工智能（AI）既是历史技术革命的延续，也是与之决裂的潜在因素。本文认为，人工智能必须同时通过三个镜头来看待：风险，它在其不可逆转的全球外部性方面类似于核技术;转型，它与工业革命平行，作为一种通用技术，推动生产力和劳动力重组;和\textit{连续性}，它将计算革命的五十年弧线从个人计算延伸到互联网再到移动。借鉴历史类比，我们强调，没有过去的过渡构成一个严格的奇点：破坏性的变化最终成为通过新的规范和机构治理。我们研究了革命中反复出现的模式--使用层的民主化、生产层的集中化、成本下降和个性化深化--并展示了这些动态在人工智能时代是如何加剧的。部门分析说明了会计，法律，教育，翻译，广告和软件工程是如何被重塑的日常认知是商品化和人类价值转向判断，信任和道德责任。在前沿领域，设计道德AI代理的挑战突出了对强大护栏、道德泛化机制和新兴多代理动态治理的需求。我们的结论是，人工智能既不是一个单一的突破，也不仅仅是渐进的进步。它既是进化的，也是革命性的：它的中值效应是可预测的，但带有奇异类尾部风险。好的结果不是自动产生的;它们需要将促进创新的战略与安全治理相结合，确保公平获取，并将人工智能嵌入人类的责任秩序中。
摘要：Artificial Intelligence (AI) has emerged as both a continuation of historical technological revolutions and a potential rupture with them. This paper argues that AI must be viewed simultaneously through three lenses: \textit{risk}, where it resembles nuclear technology in its irreversible and global externalities; \textit{transformation}, where it parallels the Industrial Revolution as a general-purpose technology driving productivity and reorganization of labor; and \textit{continuity}, where it extends the fifty-year arc of computing revolutions from personal computing to the internet to mobile. Drawing on historical analogies, we emphasize that no past transition constituted a strict singularity: disruptive shifts eventually became governable through new norms and institutions. We examine recurring patterns across revolutions -- democratization at the usage layer, concentration at the production layer, falling costs, and deepening personalization -- and show how these dynamics are intensifying in the AI era. Sectoral analysis illustrates how accounting, law, education, translation, advertising, and software engineering are being reshaped as routine cognition is commoditized and human value shifts to judgment, trust, and ethical responsibility. At the frontier, the challenge of designing moral AI agents highlights the need for robust guardrails, mechanisms for moral generalization, and governance of emergent multi-agent dynamics. We conclude that AI is neither a singular break nor merely incremental progress. It is both evolutionary and revolutionary: predictable in its median effects yet carrying singularity-class tail risks. Good outcomes are not automatic; they require coupling pro-innovation strategies with safety governance, ensuring equitable access, and embedding AI within a human order of responsibility.

【108】A Critical Review of the Need for Knowledge-Centric Evaluation of Quranic Recitation
标题：对《古兰经》背诵以知识为中心评估的必要性的批判性评论
链接：https://arxiv.org/abs/2510.12858

作者：Mohammed Hilal Al-Kharusi, Khizar Hayat, Khalil Bader Al Ruqeishi, Haroon Rashid Lone
备注：33 pages
摘要：古兰经背诵（Tajweed）的神圣实践，受到精确的语音，韵律和神学规则的支配，在现代面临着重大的教学挑战。虽然数字技术承诺前所未有的教育机会，但用于背诵评估的自动化工具未能实现广泛采用或教学效果。这篇文献综述调查了这一关键差距，对过去二十年来开发的学术研究，网络平台和商业应用程序进行了全面分析。我们的综合揭示了一个根本的错位，在现行的方法，重新调整自动语音识别（ASR）架构，优先词汇识别定性声学评估和困扰的数据依赖性，人口统计学偏见，并无法提供诊断有用的反馈。批评这些数据驱动的范式，我们认为一个基本的范式转向以知识为中心的计算框架。利用古兰经文本的不可变性和Tajweed精确定义的规则，我们建议一个强大的评估器必须围绕基于规范规则和衔接点（Makhraj）的预期声学建模进行构建，而不是依赖于从不完美和有偏见的数据集中学习到的统计模式。这篇评论的结论是，自动化古兰经评估的未来在于将深层语言知识与先进的音频分析相结合的混合系统，提供了一条通往强大，公平和教学合理的工具的道路，可以忠实地支持世界各地的学习者。
摘要：The sacred practice of Quranic recitation (Tajweed), governed by precise phonetic, prosodic, and theological rules, faces significant pedagogical challenges in the modern era. While digital technologies promise unprecedented access to education, automated tools for recitation evaluation have failed to achieve widespread adoption or pedagogical efficacy. This literature review investigates this critical gap, conducting a comprehensive analysis of academic research, web platforms, and commercial applications developed over the past two decades. Our synthesis reveals a fundamental misalignment in prevailing approaches that repurpose Automatic Speech Recognition (ASR) architectures, which prioritize lexical recognition over qualitative acoustic assessment and are plagued by data dependency, demographic biases, and an inability to provide diagnostically useful feedback. Critiquing these data--driven paradigms, we argue for a foundational paradigm shift towards a knowledge-centric computational framework. Capitalizing on the immutable nature of the Quranic text and the precisely defined rules of Tajweed, we propose that a robust evaluator must be architected around anticipatory acoustic modeling based on canonical rules and articulation points (Makhraj), rather than relying on statistical patterns learned from imperfect and biased datasets. This review concludes that the future of automated Quranic evaluation lies in hybrid systems that integrate deep linguistic knowledge with advanced audio analysis, offering a path toward robust, equitable, and pedagogically sound tools that can faithfully support learners worldwide.

【109】Adaptive Generation of Bias-Eliciting Questions for LLMs
标题：LLM的自适应产生偏见问题
链接：https://arxiv.org/abs/2510.12857

作者：Robin Staab, Jasper Dekoninck, Maximilian Baader, Martin Vechev
摘要：大型语言模型（LLM）现在广泛部署在面向用户的应用程序中，在全球范围内达到数亿个。随着它们被纳入日常工作，对它们的产出越来越依赖，引起了严重关切。特别是，用户可能会在不知不觉中暴露于模型固有的偏见，系统地不利或刻板印象某些群体。然而，现有的偏差基准仍然依赖于模板提示或限制性多项选择题，这些问题具有暗示性，过于简单，无法捕捉真实世界用户交互的复杂性。在这项工作中，我们通过引入一个反事实偏见评估框架来解决这一差距，该框架自动生成关于性别、种族或宗教等敏感属性的现实的、开放式的问题。通过迭代变异和选择偏差诱导问题，我们的方法系统地探索了模型最容易受到偏差行为影响的领域。除了检测有害的偏见，我们还捕捉不同的响应维度，这些维度在用户交互中越来越相关，例如不对称拒绝和明确承认偏见。利用我们的框架，我们构建了CAB，这是一个跨越不同主题的人工验证的基准，旨在实现跨模型比较。使用CAB，我们分析了一系列跨多个偏差维度的LLM，揭示了不同模型如何表现偏差的细微差别。例如，虽然GPT-5的表现优于其他模型，但它在特定场景中表现出持续的偏差。这些发现强调了持续改进的必要性，以确保公平的模型行为。
摘要：Large language models (LLMs) are now widely deployed in user-facing applications, reaching hundreds of millions worldwide. As they become integrated into everyday tasks, growing reliance on their outputs raises significant concerns. In particular, users may unknowingly be exposed to model-inherent biases that systematically disadvantage or stereotype certain groups. However, existing bias benchmarks continue to rely on templated prompts or restrictive multiple-choice questions that are suggestive, simplistic, and fail to capture the complexity of real-world user interactions. In this work, we address this gap by introducing a counterfactual bias evaluation framework that automatically generates realistic, open-ended questions over sensitive attributes such as sex, race, or religion. By iteratively mutating and selecting bias-inducing questions, our approach systematically explores areas where models are most susceptible to biased behavior. Beyond detecting harmful biases, we also capture distinct response dimensions that are increasingly relevant in user interactions, such as asymmetric refusals and explicit acknowledgment of bias. Leveraging our framework, we construct CAB, a human-verified benchmark spanning diverse topics, designed to enable cross-model comparisons. Using CAB, we analyze a range of LLMs across multiple bias dimensions, revealing nuanced insights into how different models manifest bias. For instance, while GPT-5 outperforms other models, it nonetheless exhibits persistent biases in specific scenarios. These findings underscore the need for continual improvements to ensure fair model behavior.

【110】Efficient Adaptive Transformer: An Empirical Study and Reproducible Framework
标题：高效自适应Transformer：实证研究和可复制框架
链接：https://arxiv.org/abs/2510.12856

作者：Jan Miller
备注：10 pages, 6 figures, pgfplots tables included; BibTeX compiled to .bbl. Code and reproducibility artifacts referenced in the paper
摘要：Efficient Adaptive Transformer（EAT）框架将三种自适应效率技术-渐进式令牌修剪、稀疏注意和动态提前退出-统一到一个用于输入自适应推理的可再现架构中。EAT提供了一个开源的基准测试管道，可以在GLUE任务（SST-2、QQP、MNLI）中自动进行数据处理、计时和消融。虽然这项实证研究发现，结合这些机制可以增加浅六层模型中的延迟，但它表明EAT比SST-2上的优化DistilBERT基线实现了略高的准确性，说明了延迟敏感NLP的动态计算潜力。主要贡献是开放的，端到端的可复制框架-完整的脚本，CSV日志和分析实用程序-旨在作为一个社区工具，用于进一步研究自适应Transformers。
摘要：The Efficient Adaptive Transformer (EAT) framework unifies three adaptive efficiency techniques - progressive token pruning, sparse attention, and dynamic early exiting - into a single, reproducible architecture for input-adaptive inference. EAT provides an open-source benchmarking pipeline that automates data processing, timing, and ablation across GLUE tasks (SST-2, QQP, MNLI). Although this empirical study finds that combining these mechanisms can increase latency in shallow six-layer models, it demonstrates that EAT achieves slightly higher accuracy than the optimized DistilBERT baseline on SST-2, illustrating the potential of dynamic computation for latency-sensitive NLP. The main contribution is the open, end-to-end reproducible framework - complete with scripts, CSV logging, and analysis utilities - intended to serve as a community tool for further research on adaptive transformers.

【111】Ethic-BERT: An Enhanced Deep Learning Model for Ethical and Non-Ethical Content Classification
标题：Ethic-BERT：用于道德和非道德内容分类的增强深度学习模型
链接：https://arxiv.org/abs/2510.12850

作者：Mahamodul Hasan Mahadi, Md. Nasif Safwan, Souhardo Rahman, Shahnaj Parvin, Aminun Nahar, Kamruddin Nur
摘要：开发能够进行细致入微的道德推理的人工智能系统至关重要，因为它们越来越多地影响人类的决策，但现有的模型往往依赖于表面的相关性，而不是原则性的道德理解。本文介绍了Ethic-BERT，一个基于BERT的模型，用于跨四个领域的道德内容分类：常识，正义，美德和道义。利用ETHICS数据集，我们的方法集成了强大的预处理来解决词汇稀疏和上下文歧义，以及高级微调策略，如完整模型解冻，梯度累积和自适应学习率调度。为了评估鲁棒性，我们采用了一个对抗过滤的“硬测试”分裂，隔离复杂的道德困境。实验结果表明，Ethic-BERT的优越性，在标准测试中达到82.32%的平均准确率，在正义和美德方面有显着的改善。此外，所提出的伦理BERT达到15.28%的平均准确率提高在硬测试。这些发现有助于使用偏差感知预处理和提出的增强AI模型来提高性能和可靠的决策。
摘要：Developing AI systems capable of nuanced ethical reasoning is critical as they increasingly influence human decisions, yet existing models often rely on superficial correlations rather than principled moral understanding. This paper introduces Ethic-BERT, a BERT-based model for ethical content classification across four domains: Commonsense, Justice, Virtue, and Deontology. Leveraging the ETHICS dataset, our approach integrates robust preprocessing to address vocabulary sparsity and contextual ambiguities, alongside advanced fine-tuning strategies like full model unfreezing, gradient accumulation, and adaptive learning rate scheduling. To evaluate robustness, we employ an adversarially filtered "Hard Test" split, isolating complex ethical dilemmas. Experimental results demonstrate Ethic-BERT's superiority over baseline models, achieving 82.32% average accuracy on the standard test, with notable improvements in Justice and Virtue. In addition, the proposed Ethic-BERT attains 15.28% average accuracy improvement in the HardTest. These findings contribute to performance improvement and reliable decision-making using bias-aware preprocessing and proposed enhanced AI model.

【112】VLURes: Benchmarking VLM Visual and Linguistic Understanding in Low-Resource Languages
标题：VLURes：低资源语言中的VLM视觉和语言理解基准
链接：https://arxiv.org/abs/2510.12845

作者：Jesse Atuhurra, Iqra Ali, Tomoya Iwakura, Hidetaka Kamigaito, Tatsuya Hiraoka
摘要：视觉语言模型（VLM）是提高智能代理感知能力的关键。然而，VLM的评估仍然局限于主要以英语为中心的基准，其中图像-文本对包括短文本。为了评估VLM细粒度的能力，在四种语言下的长文本设置，我们引入了一个新的多语言基准VLURes具有八个视觉和语言的任务，和一个开拓性的无关性任务，探讨细粒度的视觉和语言理解能力的VLM在英语，日语，低资源的语言，斯瓦希里语，乌尔都语。我们的数据集来自目标语言的网络资源，包括十个不同的图像类别和丰富的文本背景，为斯瓦希里语和乌尔都语引入了宝贵的视觉语言资源。通过促使VLM生成响应和理由，自动评估和母语者，我们发现了智能代理的关键语言和任务，如对象识别，场景理解和关系理解的性能差异。我们对10个带有VLURes的VLMs进行了评估。表现最好的模型GPT-4 o的总体准确率达到90.8%，落后于人类6.7%，尽管开源模型的差距更大。这一差距凸显了VLURes在开发智能代理以解决多模态视觉推理方面的关键作用。
摘要：Vision Language Models (VLMs) are pivotal for advancing perception in intelligent agents. Yet, evaluation of VLMs remains limited to predominantly English-centric benchmarks in which the image-text pairs comprise short texts. To evaluate VLM fine-grained abilities, in four languages under long-text settings, we introduce a novel multilingual benchmark VLURes featuring eight vision-and-language tasks, and a pioneering unrelatedness task, to probe the fine-grained Visual and Linguistic Understanding capabilities of VLMs across English, Japanese, and low-resource languages, Swahili, and Urdu. Our datasets, curated from web resources in the target language, encompass ten diverse image categories and rich textual context, introducing valuable vision-language resources for Swahili and Urdu. By prompting VLMs to generate responses and rationales, evaluated automatically and by native speakers, we uncover performance disparities across languages and tasks critical to intelligent agents, such as object recognition, scene understanding, and relationship understanding. We conducted evaluations of ten VLMs with VLURes. The best performing model, GPT-4o, achieves an overall accuracy of 90.8% and lags human performance by 6.7%, though the gap is larger for open-source models. The gap highlights VLURes' critical role in developing intelligent agents to tackle multi-modal visual reasoning.

【113】FaStFACT: Faster, Stronger Long-Form Factuality Evaluations in LLMs
标题：FaStFACT：LLM中更快、更强的长篇事实评估
链接：https://arxiv.org/abs/2510.12839

作者：Yingjia Wan, Haochen Tan, Xiao Zhu, Xinyu Zhou, Zhiwei Li, Qingsong Lv, Changxuan Sun, Jiaqi Zeng, Yi Xu, Jianqiao Lu, Yinhong Liu, Zhijiang Guo
备注：EMNLP 2025 (Findings)
摘要：由于准确性问题和昂贵的人工评估，评估大型语言模型（LLM）的长形式生成的真实性仍然具有挑战性。先前的努力尝试通过将文本分解为声明、搜索证据和验证声明来实现这一点，但存在严重的缺点：（1）由于复杂的流水线组件不适合长LLM输出而导致的效率低下，以及（2）由于不准确的声明集和单行片段的证据收集不足而导致的无效性。为了解决这些局限性，我们提出了一个快速而强大的评估框架，它在现有的基线中实现了与人类评估和效率的最高一致性。\name first采用块级声明提取与基于置信度的预验证相结合，在确保可靠性的同时显著降低了Web搜索和推理调用的成本。对于搜索和验证，它从抓取的网页中收集文档级证据，并在验证过程中有选择地检索它，解决了以前管道中的证据不足问题。基于聚合和手动注释的基准测试的大量实验证明了\name在高效和有效地评估长形式LLM生成的真实性方面的可靠性。代码和基准数据可在https://github.com/Yingjia-Wan/FastFact上获得。
摘要：Evaluating the factuality of long-form generations from Large Language Models (LLMs) remains challenging due to accuracy issues and costly human assessment. Prior efforts attempt this by decomposing text into claims, searching for evidence, and verifying claims, but suffer from critical drawbacks: (1) inefficiency due to complex pipeline components unsuitable for long LLM outputs, and (2) ineffectiveness stemming from inaccurate claim sets and insufficient evidence collection of one-line snippets. To address these limitations, we propose \name, a fast and strong evaluation framework that achieves the highest alignment with human evaluation and efficiency among existing baselines. \name first employs chunk-level claim extraction integrated with confidence-based pre-verification, significantly reducing the cost of web searching and inference calling while ensuring reliability. For searching and verification, it collects document-level evidence from crawled webpages and selectively retrieves it during verification, addressing the evidence insufficiency problem in previous pipelines. Extensive experiments based on an aggregated and manually annotated benchmark demonstrate the reliability of \name in both efficiently and effectively evaluating the factuality of long-form LLM generations. Code and benchmark data is available at https://github.com/Yingjia-Wan/FastFact.

【114】A\textsuperscript{2}FM: An Adaptive Agent Foundation Model for Tool-Aware Hybrid Reasoning
标题：一 extsuperScript{2}FM：用于工具感知混合推理的自适应代理基础模型
链接：https://arxiv.org/abs/2510.12838

作者：Qianben Chen, Jingyi Cao, Jiayu Zhang, Tianrui Qin, Xiaowan Li, King Zhu, Dingfeng Shi, He Zhu, Minghao Liu, Xiaobo Liang, Ge Zhang, Jian Yang, Yuchen Eleanor Jiang, Wangchunshu Zhou
备注：9 pages, 5 figures, submitted to ICLR 2026
摘要：大型语言模型分为两个家族：以推理为中心的LLM，它加强了内部的思维链推理，但不能调用外部工具;以及代理LLM，它学习与环境交互并利用工具，但往往在深度推理方面滞后。这种分歧源于根本不同的培训目标，导致简单查询的不匹配优势和效率低下，两个家族都倾向于过度思考或过度调用工具。在这项工作中，我们提出了自适应代理基础模型（A\text上标{2}FM），一个统一的框架，遵循路由然后对齐原则：该模型首先学习任务感知路由，然后在共享骨干网下对齐模式特定的轨迹。为了解决效率低下的差距，我们引入了第三种模式-即时-直接处理简单的查询，防止不必要的推理或工具调用，同时补充代理和推理模式。为了共同提高准确性和效率，我们提出了自适应策略优化（APO），它强制执行跨模式的自适应采样，并应用成本正则化奖励。在32 B规模上，A\text上标{2}FM在BrowseComp上达到13.4\%，在AIME 25上达到70.4\%，在HLE上达到16.7\%，在可比模型中设置了新的SOTA，并在代理，推理和一般基准测试中与前沿LLM竞争。值得注意的是，自适应执行实现了每个正确答案的成本仅为0.00487美元，相对于推理成本降低了45.2%，相对于代理成本降低了33.5%，从而在保持相当准确性的同时提供了更高的成本效率。
摘要：Large language models split into two families: reasoning-centric LLMs, which strengthen internal chain-of-thought reasoning but cannot invoke external tools, and agentic LLMs, which learn to interact with environments and leverage tools but often lag in deep reasoning. This divide arises from fundamentally different training objectives, leading to mismatched strengths and inefficiency on simple queries, where both families tend to overthink or over-call tools. In this work, we present Adaptive Agent Foundation Model (A\textsuperscript{2}FM), a unified framework that follows a route-then-align principle: the model first learns task-aware routing and then aligns mode-specific trajectories under a shared backbone. To address the inefficiency gap, we introduce a third mode-instant-that handles simple queries directly, preventing unnecessary reasoning or tool calls while complementing the agentic and reasoning modes. To jointly enhance accuracy and efficiency, we propose Adaptive Policy Optimization (APO), which enforces adaptive sampling across modes and applies a cost-regularized reward. On the 32B scale, A\textsuperscript{2}FM achieves 13.4\% on BrowseComp, 70.4\% on AIME25, and 16.7\% on HLE, setting new SOTA among comparable models and performing competitively with frontier LLMs across agentic, reasoning, and general benchmarks. Notably, the adaptive execution achieves a cost of pass of only \$0.00487 per correct answer-cutting cost by 45.2\% relative to reasoning and 33.5\% relative to agentic, thus delivering substantially higher cost efficiency while maintaining comparable accuracy.

【115】Semantic knowledge guides innovation and drives cultural evolution
标题：语义知识引导创新并推动文化进化
链接：https://arxiv.org/abs/2510.12837

作者：Anil Yaman, Shen Tian, Björn Lindström
摘要：累积的文化进化使人类社会能够在几代人的时间里产生越来越复杂的知识和技术。虽然社会学习在个人和世代之间传播创新，但产生这些创新的认知过程仍然知之甚少。在这里，我们表明，语义知识的概念和它们的功能之间的结构化关联提供了认知脚手架的累积创新，引导探索似是而非的和有意义的行动。我们使用基于文化进化代理的模型和大规模行为实验（N = 1，243）来验证这一假设，在该实验中，个体执行一项需要将项目组合成新颖创新的任务。在这两种方法中，语义知识和社会学习协同互动，以加强创新。在行为上，没有获得语义知识的参与者表现得并不比机会好，即使在社会学习可用的情况下，也依赖于浅层探索策略。这些发现表明，语义知识是一个关键的认知过程，使人类积累的文化。
摘要：Cumulative cultural evolution enables human societies to generate increasingly complex knowledge and technology over generations. While social learning transmits innovations between individuals and generations, the cognitive processes that generate these innovations remain poorly understood. Here, we demonstrate that semantic knowledge-structured associations between concepts and their functions-provides cognitive scaffolding for cumulative innovation by guiding exploration toward plausible and meaningful actions. We tested this hypothesis using a cultural evolutionary agent-based model and a large-scale behavioural experiment (N = 1,243), in which individuals performed a task requiring the combination of items into novel innovations. Across both approaches, semantic knowledge and social learning interact synergistically to enhance innovation. Behaviorally, participants without access to semantic knowledge performed no better than chance, even when social learning was available, and relied on shallow exploration strategies. These findings suggest that semantic knowledge is a key cognitive process enabling human cumulative culture.

【116】Repurposing Annotation Guidelines to Instruct LLM Annotators: A Case Study
标题：重新利用注释指南来指导LLM注释者：案例研究
链接：https://arxiv.org/abs/2510.12835

作者：Kon Woo Kim (National Institute of Informatics, Japan), Rezarta Islamaj (National Library of Medicine, USA), Jin-Dong Kim (Joint Support-Center for Data Science Research, Japan), Florian Boudin (Japanese-French Laboratory of Informatics, CNRS, Nantes University, Japan), Akiko Aizawa (National Institute of Informatics, Japan)
备注：11 pages, 2 figures, 3 tables, This is a preprint of the article accepted at NLDB 2025 (Springer LNCS). The final version is available at https://doi.org/10.1007/978-3-031-97144-0_13
摘要：本研究探讨了如何重新利用现有的注释指南来指导大型语言模型（LLM）注释器进行文本注释任务。传统的指南是为那些内化培训的人类注释者编写的，而LLM需要明确的结构化指令。我们提出了一个温和导向的指导方针再利用的方法，通过LLM温和的过程中转化为明确的指示LLM的指导方针。使用NCBI疾病语料库作为案例研究，我们的实验表明，重新利用的指导方针可以有效地指导LLM注释，同时揭示了一些实际的挑战。结果突出了这种工作流程的潜力，以支持可扩展的和具有成本效益的细化注释指南和自动注释。
摘要：This study investigates how existing annotation guidelines can be repurposed to instruct large language model (LLM) annotators for text annotation tasks. Traditional guidelines are written for human annotators who internalize training, while LLMs require explicit, structured instructions. We propose a moderation-oriented guideline repurposing method that transforms guidelines into clear directives for LLMs through an LLM moderation process. Using the NCBI Disease Corpus as a case study, our experiments show that repurposed guidelines can effectively guide LLM annotators, while revealing several practical challenges. The results highlight the potential of this workflow to support scalable and cost-effective refinement of annotation guidelines and automated annotation.

【117】Gelina: Unified Speech and Gesture Synthesis via Interleaved Token Prediction
标题：Gelina：通过交织令牌预测统一语音和手势合成
链接：https://arxiv.org/abs/2510.12834

作者：Téo Guichoux, Théodor Lemerle, Shivam Mehta, Jonas Beskow, Gustave Eje Henter, Laure Soulier, Catherine Pelachaud, Nicolas Obin
备注：5 pages
摘要：人类交流是多模态的，语音和手势紧密耦合，但大多数用于生成语音和手势的计算方法按顺序合成它们，削弱了同步和韵律对齐。我们介绍Gelina，一个统一的框架，联合合成语音和语音手势从文本中使用交错令牌序列在离散自回归骨干，与特定于模态的解码器。Gelina支持多说话者和多风格克隆，并支持从语音输入进行仅手势合成。主观和客观的评价表明竞争力的语音质量和改进的手势生成超过单峰基线。
摘要：Human communication is multimodal, with speech and gestures tightly coupled, yet most computational methods for generating speech and gestures synthesize them sequentially, weakening synchrony and prosody alignment. We introduce Gelina, a unified framework that jointly synthesizes speech and co-speech gestures from text using interleaved token sequences in a discrete autoregressive backbone, with modality-specific decoders. Gelina supports multi-speaker and multi-style cloning and enables gesture-only synthesis from speech inputs. Subjective and objective evaluations demonstrate competitive speech quality and improved gesture generation over unimodal baselines.

【118】Coherent Load Profile Synthesis with Conditional Diffusion for LV Distribution Network Scenario Generation
标题：用于LV配电网络场景生成的具有条件扩散的一致负载分布综合
链接：https://arxiv.org/abs/2510.12832

作者：Alistair Brash, Junyi Lu, Bruce Stephen, Blair Brown, Robert Atkinson, Craig Michie, Fraser MacIntyre, Christos Tachtatzis
摘要：在低电压水平下的配电网络功率流的有限可见性从规划的角度和从拥塞管理的角度对配电网络运营商提出了挑战。通过情景分析来预防这些挑战是由于缺乏代表性配电馈线的真实和一致的负载数据而造成的。负荷分析方法通常依赖于通过典型剖面总结需求，这过度简化了变电站级操作的复杂性，并限制了其在特定电力系统研究中的适用性。采样方法，以及最近的生成模型，试图通过从历史范例中合成代表性负载来解决这个问题;然而，虽然这些方法可以近似负载形状到令人信服的保真度，但变电站之间的共同行为，最终影响更高电压水平的网络操作，往往被忽视。随着低碳技术的日益融合，这种限制将变得更加明显，因为基本负荷的估计无法捕捉负荷的多样性。为了解决这一差距，条件扩散模型合成每日有功和无功功率配置文件在低压配电变电站的水平。保真度的评价证明通过传统的指标捕捉时间和统计现实主义，以及潮流建模。结果表明，合成的负载配置文件是合理的独立和作为一个队列在更广泛的电力系统的背景下。条件扩散模型的基准对天真和国家的最先进的模型，以证明其有效性，在生产现实的情况下，基于次区域配电网规划和运营。
摘要：Limited visibility of power distribution network power flows at the low voltage level presents challenges to both distribution network operators from a planning perspective and distribution system operators from a congestion management perspective. Forestalling these challenges through scenario analysis is confounded by the lack of realistic and coherent load data across representative distribution feeders. Load profiling approaches often rely on summarising demand through typical profiles, which oversimplifies the complexity of substation-level operations and limits their applicability in specific power system studies. Sampling methods, and more recently generative models, have attempted to address this through synthesising representative loads from historical exemplars; however, while these approaches can approximate load shapes to a convincing degree of fidelity, the co-behaviour between substations, which ultimately impacts higher voltage level network operation, is often overlooked. This limitation will become even more pronounced with the increasing integration of low-carbon technologies, as estimates of base loads fail to capture load diversity. To address this gap, a Conditional Diffusion model for synthesising daily active and reactive power profiles at the low voltage distribution substation level is proposed. The evaluation of fidelity is demonstrated through conventional metrics capturing temporal and statistical realism, as well as power flow modelling. The results show synthesised load profiles are plausible both independently and as a cohort in a wider power systems context. The Conditional Diffusion model is benchmarked against both naive and state-of-the-art models to demonstrate its effectiveness in producing realistic scenarios on which to base sub-regional power distribution network planning and operations.

【119】MTSQL-R1: Towards Long-Horizon Multi-Turn Text-to-SQL via Agentic Training
标题：MTSQL-R1：通过统计训练实现长视野多转向文本到SQL
链接：https://arxiv.org/abs/2510.12831

作者：Taicheng Guo, Hai Wang, ChaoChun Liu, Mohsen Golalikhani, Xin Chen, Xiangliang Zhang, Chandan K. Reddy
摘要：多轮文本到SQL旨在将用户的会话语句转换为可执行的SQL，同时保持对话的连贯性和目标模式的基础。然而，大多数现有的系统只把这个任务作为一个简单的文本翻译任务，并遵循一个短期的范式，生成查询每轮没有执行，明确的验证和细化，这导致不可执行或不连贯的输出。我们提出了MTSQL-R1，一个长期多轮Text-to-SQL的代理训练框架。我们将任务转换为马尔可夫决策过程（MDP），其中代理与（i）用于执行反馈的数据库和（ii）用于一致性验证的持久对话存储器进行交互，执行迭代建议执行->验证->细化循环，直到所有检查通过。在COSQL和SQL Server上的实验表明，MTSQL-R1始终优于强基线，突出了环境驱动的验证和内存引导的改进对会话语义解析的重要性。完整的配方（包括代码、训练模型、日志、推理轨迹等）将在内部审查后发布，以促进社区研究。
摘要：Multi-turn Text-to-SQL aims to translate a user's conversational utterances into executable SQL while preserving dialogue coherence and grounding to the target schema. However, most existing systems only regard this task as a simple text translation task and follow a short-horizon paradigm, generating a query per turn without execution, explicit verification, and refinement, which leads to non-executable or incoherent outputs. We present MTSQL-R1, an agentic training framework for long-horizon multi-turn Text-to-SQL. We cast the task as a Markov Decision Process (MDP) in which an agent interacts with (i) a database for execution feedback and (ii) a persistent dialogue memory for coherence verification, performing an iterative propose to execute -> verify -> refine cycle until all checks pass. Experiments on COSQL and SPARC demonstrate that MTSQL-R1 consistently outperforms strong baselines, highlighting the importance of environment-driven verification and memory-guided refinement for conversational semantic parsing. Full recipes (including code, trained models, logs, reasoning trajectories, etc.) will be released after the internal review to contribute to community research.

【120】Gobernanza y trazabilidad "a prueba de AI Act" para casos de uso legales: un marco técnico-jurídico, métricas forenses y evidencias auditables
链接：https://arxiv.org/abs/2510.12830

作者：Alex Dantart
备注：in Spanish language
摘要：本文提出了法律部门人工智能系统的综合治理框架，旨在确保可验证的遵守欧盟人工智能法案。该框架集成了法规到技术控制的规范映射，RAG/LLM系统的取证架构，以及具有法律风险加权指标的评估系统。作为一个主要的贡献，我们提出了rag-forense，框架的开源实现，伴随着一个实验协议，以证明遵守。-- Este art\'iculo presenta un marco integral de gobernanza para sistemas de IA en el sector legal，dissel\~nado para garantizar el cumplimiento verificable del Reglamento de IA de la UE（AI Act）.该marco集成了一个用于控制经济的标准化绘图系统，一个用于RAG/LLM系统的建筑结构，以及一个用于评估针对精确的风险的方法的系统。作为主要贡献，她提出了一个框架，一个实施方案，对c\'odigo abierto del marco，acompa\~nada de un protocolo experimental para democracy la conformidad。
摘要：This paper presents a comprehensive governance framework for AI systems in the legal sector, designed to ensure verifiable compliance with the EU AI Act. The framework integrates a normative mapping of the regulation to technical controls, a forensic architecture for RAG/LLM systems, and an evaluation system with metrics weighted by legal risk. As a primary contribution, we present rag-forense, an open-source implementation of the framework, accompanied by an experimental protocol to demonstrate compliance. -- Este art\'iculo presenta un marco integral de gobernanza para sistemas de IA en el sector legal, dise\~nado para garantizar el cumplimiento verificable del Reglamento de IA de la UE (AI Act). El marco integra una cartograf\'ia normativa de la ley a controles t\'ecnicos, una arquitectura forense para sistemas RAG/LLM y un sistema de evaluaci\'on con m\'etricas ponderadas por el riesgo jur\'idico. Como principal contribuci\'on, se presenta rag-forense, una implementaci\'on de c\'odigo abierto del marco, acompa\~nada de un protocolo experimental para demostrar la conformidad.

【121】Mathematics with large language models as provers and verifiers
标题：以大型语言模型作为证明者和验证者的数学
链接：https://arxiv.org/abs/2510.12829

作者：Hieu Le Duc, Leo Liberti
摘要：在2024年和2025年期间，关于大型语言模型的定理证明能力的讨论开始报道有趣的成功故事，主要是与困难的练习有关（如国际数学奥林匹克竞赛的问题），但也与数学[费尔德曼和Karbasi，arXiv：2509.18383v1]在本文中，我们报告了ChatGPT通过使用涉及不同证明者的协议实现的定理证明壮举，协作工作的GPT-5模型的验证器实例。为了确保所产生的证明不会受到幻觉的影响，最终的证明由精益证明助理正式验证，并且由人类验证精益代码的前提和结论的一致性。我们的方法能够解决2025年IMO六个问题中的五个，并关闭了[Cohen，Journal of Sequences，2025]中六十六个数论问题中的三分之一。
摘要：During 2024 and 2025 the discussion about the theorem-proving capabilities of large language models started reporting interesting success stories, mostly to do with difficult exercises (such as problems from the International Mathematical Olympiad), but also with conjectures [Feldman & Karbasi, arXiv:2509.18383v1] formulated for the purpose of verifying whether the artificial intelligence could prove it. In this paper we report a theorem proving feat achieved by ChatGPT by using a protocol involving different prover and verifier instances of the gpt-5 model working collaboratively. To make sure that the produced proofs do not suffer from hallucinations, the final proof is formally verified by the lean proof assistant, and the conformance of premises and conclusion of the lean code is verified by a human. Our methodology was able to solve five out of six 2025 IMO problems, and close a third of the sixty-six number theory conjectures in [Cohen, Journal of Integer Sequences, 2025].

【122】Scheming Ability in LLM-to-LLM Strategic Interactions
标题：LLM与LLM战略互动的策划能力
链接：https://arxiv.org/abs/2510.12826

作者：Thao Pham
备注：25 pages, 13 figures, under review at IASEAI'26
摘要：由于大型语言模型（LLM）代理在不同的环境中自主部署，评估其战略欺骗能力变得至关重要。虽然最近的研究已经研究了人工智能系统如何对抗人类开发人员，但LLM到LLM的计划仍然没有得到充分的探索。我们通过两个博弈论框架研究了前沿LLM代理的策划能力和倾向：廉价谈话信号博弈和同行评估对抗博弈。测试四个模型（GPT-4 o，双子座-2.5-亲，克劳德-3.7-十四行诗，和美洲驼-3.3- 70 b），我们测量策划性能，并没有明确的提示，同时通过思维链推理分析策划策略。在提示下，大多数型号，特别是Gemini-2.5-pro和Claude-3.7-Sonnet，都实现了近乎完美的性能。重要的是，模型在没有提示的情况下表现出明显的阴谋倾向：所有模型在同伴评价中选择欺骗而不是坦白（100%），而在廉价谈话中选择阴谋的模型成功率为95-100%。这些发现强调了在多代理环境中使用高风险博弈论场景进行强大评估的必要性。
摘要：As large language model (LLM) agents are deployed autonomously in diverse contexts, evaluating their capacity for strategic deception becomes crucial. While recent research has examined how AI systems scheme against human developers, LLM-to-LLM scheming remains underexplored. We investigate the scheming ability and propensity of frontier LLM agents through two game-theoretic frameworks: a Cheap Talk signaling game and a Peer Evaluation adversarial game. Testing four models (GPT-4o, Gemini-2.5-pro, Claude-3.7-Sonnet, and Llama-3.3-70b), we measure scheming performance with and without explicit prompting while analyzing scheming tactics through chain-of-thought reasoning. When prompted, most models, especially Gemini-2.5-pro and Claude-3.7-Sonnet, achieved near-perfect performance. Critically, models exhibited significant scheming propensity without prompting: all models chose deception over confession in Peer Evaluation (100% rate), while models choosing to scheme in Cheap Talk succeeded at 95-100% rates. These findings highlight the need for robust evaluations using high-stakes game-theoretic scenarios in multi-agent settings.

【123】Classifier-Augmented Generation for Structured Workflow Prediction
标题：结构化工作流预测的分类器增强生成
链接：https://arxiv.org/abs/2510.12825

作者：Thomas Gschwind, Shramona Chakraborty, Nitin Gupta, Sameep Mehta
备注：Accepted at EMNLP 2025
摘要：ETL（提取、转换、加载）工具（如IBM DataStage）允许用户可视化地组装复杂的数据工作流，但配置阶段及其属性仍然非常耗时，并且需要深入的工具知识。我们提出了一个系统，将自然语言描述转换为可执行的工作流程，自动预测的结构和详细配置的流程。其核心是分类器增强生成（CAG）方法，该方法将话语分解与分类器和特定于阶段的Few-Shot相结合，以产生准确的阶段预测。这些阶段，然后连接到非线性工作流使用边缘预测，和阶段属性推断从子话语上下文。我们将CAG与强大的单提示和代理基线进行比较，显示出更高的准确性和效率，同时大大减少了令牌的使用。我们的架构是模块化的，可解释的，并能够端到端的工作流生成，包括强大的验证步骤。据我们所知，这是第一个对自然语言驱动的ETL创作的阶段预测、边缘布局和属性生成进行详细评估的系统。
摘要：ETL (Extract, Transform, Load) tools such as IBM DataStage allow users to visually assemble complex data workflows, but configuring stages and their properties remains time consuming and requires deep tool knowledge. We propose a system that translates natural language descriptions into executable workflows, automatically predicting both the structure and detailed configuration of the flow. At its core lies a Classifier-Augmented Generation (CAG) approach that combines utterance decomposition with a classifier and stage-specific few-shot prompting to produce accurate stage predictions. These stages are then connected into non-linear workflows using edge prediction, and stage properties are inferred from sub-utterance context. We compare CAG against strong single-prompt and agentic baselines, showing improved accuracy and efficiency, while substantially reducing token usage. Our architecture is modular, interpretable, and capable of end-to-end workflow generation, including robust validation steps. To our knowledge, this is the first system with a detailed evaluation across stage prediction, edge layout, and property generation for natural-language-driven ETL authoring.

【124】Evidence Without Injustice: A New Counterfactual Test for Fair Algorithms
标题：没有不公正的证据：公平算法的新反事实测试
链接：https://arxiv.org/abs/2510.12822

作者：Michele Loi, Marcello Di Bello, Nicolò Cangiotti
备注：13 pages
摘要：越来越多的关于算法公平性的哲学文献研究了统计标准，如均衡几率和校准，因果和反事实方法，以及结构性和复合不公正的作用。然而，一个重要的维度被忽视了：算法输出本身的证据价值是否取决于结构性不公正。我们的两个典型例子对比了依赖于历史犯罪数据的预测警务算法和基于摄像头的记录正在进行的犯罪的系统，两者都旨在指导警察部署。在评估根据证据采取行动的道德可接受性时，我们不仅要问证据在现实世界中是否具有证明力，而且要问如果没有相关的不公正，它在附近的世界中是否仍然具有证明力。预测警务算法没有通过这个测试，但基于摄像头的系统通过了测试。当证据没有通过测试时，惩罚性地使用它在道德上是有问题的，比通过测试的证据更有问题。
摘要：The growing philosophical literature on algorithmic fairness has examined statistical criteria such as equalized odds and calibration, causal and counterfactual approaches, and the role of structural and compounding injustices. Yet an important dimension has been overlooked: whether the evidential value of an algorithmic output itself depends on structural injustice. Our paradigmatic pair of examples contrasts a predictive policing algorithm, which relies on historical crime data, with a camera-based system that records ongoing offenses, both designed to guide police deployment. In evaluating the moral acceptability of acting on a piece of evidence, we must ask not only whether the evidence is probative in the actual world, but also whether it would remain probative in nearby worlds without the relevant injustices. The predictive policing algorithm fails this test, but the camera-based system passes it. When evidence fails the test, it is morally problematic to use it punitively, more so than evidence that passes the test.

【125】Beyond Discrete Categories: Multi-Task Valence-Arousal Modeling for Pet Vocalization Analysis
标题：超越离散类别：用于宠物发声分析的多任务价-唤醒建模
链接：https://arxiv.org/abs/2510.12819

作者：Junyao Huang, Rumin Situ
备注：24 pages, 6 figures, 4 tables. First continuous VA framework for pet vocalization analysis with 42,553 samples
摘要：传统的宠物情感识别发声，基于离散分类，与模糊性和捕捉强度变化的斗争。我们提出了一个连续的效价唤醒（VA）模型，表示在一个二维空间的情绪。我们的方法使用自动VA标签生成算法，能够对42，553个宠物发声样本进行大规模注释。多任务学习框架将VA回归与辅助任务（情感、体型、性别）联合训练，以通过改进特征学习来增强预测。我们的Audio Transformer模型实现了r = 0.9024和r = 0.7155的验证Valence Pearson相关性，有效地解决了“领土”和“快乐”等离散类别之间的混淆。“这项工作引入了第一个用于宠物发声分析的连续VA框架，为人类与宠物的互动，兽医诊断和行为训练提供了更具表现力的表示。该方法显示出在消费产品中部署的强大潜力，如AI宠物情感翻译器。
摘要：Traditional pet emotion recognition from vocalizations, based on discrete classification, struggles with ambiguity and capturing intensity variations. We propose a continuous Valence-Arousal (VA) model that represents emotions in a two-dimensional space. Our method uses an automatic VA label generation algorithm, enabling large-scale annotation of 42,553 pet vocalization samples. A multi-task learning framework jointly trains VA regression with auxiliary tasks (emotion, body size, gender) to enhance prediction by improving feature learning. Our Audio Transformer model achieves a validation Valence Pearson correlation of r = 0.9024 and an Arousal r = 0.7155, effectively resolving confusion between discrete categories like "territorial" and "happy." This work introduces the first continuous VA framework for pet vocalization analysis, offering a more expressive representation for human-pet interaction, veterinary diagnostics, and behavioral training. The approach shows strong potential for deployment in consumer products like AI pet emotion translators.

【126】MEDEQUALQA: Evaluating Biases in LLMs with Counterfactual Reasoning
标题：MEDESYS QA：使用反事实推理评估LLM中的偏差
链接：https://arxiv.org/abs/2510.12818

作者：Rajarshi Ghosh, Abhay Gupta, Hudson McBride, Anurag Vaidya, Faisal Mahmood
摘要：大型语言模型（LLM）越来越多地用于临床决策支持，但微妙的人口统计学线索可能会影响他们的推理。以前的工作已经记录了患者群体之间的输出差异，但对内部推理如何在受控的人口变化下发生变化知之甚少。我们介绍MEDERISK QA，一个反事实的基准，扰动只有病人代词（他/他，她/她，他们/他们），而保持关键症状和条件（CSC）不变。每个临床小片段扩展为单CSC消融，产生三个平行数据集，每个数据集约23，000个项目（共69，000个）。我们评估了GPT-4.1模型，并计算推理痕迹之间的语义文本相似度（STS）来衡量代词变体的稳定性。我们的研究结果显示了总体的高相似性（平均STS >0.80），但揭示了引用的风险因素，指南锚和差异排序的一致局部差异，即使最终诊断保持不变。我们的错误分析突出了某些情况下，推理转移，强调临床相关的偏见位点，可能级联到不公平的照顾。MEDERNET QA提供了一个受控的诊断设置，用于审计医疗AI中的推理稳定性。
摘要：Large language models (LLMs) are increasingly deployed in clinical decision support, yet subtle demographic cues can influence their reasoning. Prior work has documented disparities in outputs across patient groups, but little is known about how internal reasoning shifts under controlled demographic changes. We introduce MEDEQUALQA, a counterfactual benchmark that perturbs only patient pronouns (he/him, she/her, they/them) while holding critical symptoms and conditions (CSCs) constant. Each clinical vignette is expanded into single-CSC ablations, producing three parallel datasets of approximately 23,000 items each (69,000 total). We evaluate a GPT-4.1 model and compute Semantic Textual Similarity (STS) between reasoning traces to measure stability across pronoun variants. Our results show overall high similarity (mean STS >0.80), but reveal consistent localized divergences in cited risk factors, guideline anchors, and differential ordering, even when final diagnoses remain unchanged. Our error analysis highlights certain cases in which the reasoning shifts, underscoring clinically relevant bias loci that may cascade into inequitable care. MEDEQUALQA offers a controlled diagnostic setting for auditing reasoning stability in medical AI.

【127】From Noise to Signal to Selbstzweck: Reframing Human Label Variation in the Era of Post-training in NLP
标题：从噪音到信号再到Selbstzweck：NLP后训练时代重新构建人类标签变异
链接：https://arxiv.org/abs/2510.12817

作者：Shanshan Xu, Santosh T.Y.S.S, Barbara Plank
摘要：人类标签变异（Human Label Variation，HLV）指的是注释中的合法分歧，反映了人类观点的真正多样性，而不仅仅是错误。几十年来，NLP中的HLV被认为是要丢弃的噪声，在过去的十年里，它才慢慢地被重新定义为提高模型鲁棒性的信号。随着大型语言模型（LLM）的兴起，对人类反馈的后期训练已经成为模型对齐的核心，HLV的作用变得越来越重要。然而，当前的偏好学习数据集通常将多个注释聚合到一个标签中，从而将不同的观点扁平化为一个虚假的普遍协议，并精确地消除了联盟旨在维护的人类价值观的多元化。在这份立场文件中，我们认为，保留HLV作为人类多元化的体现必须被视为Selbstzweck -设计AI系统时的目标。我们呼吁积极主动地将HLV纳入偏好数据集，并概述可行的步骤。
摘要：Human Label Variation (HLV) refers to legitimate disagreement in annotation that reflects the genuine diversity of human perspectives rather than mere error. For decades, HLV in NLP was dismissed as noise to be discarded, and only slowly over the last decade has it been reframed as a signal for improving model robustness. With the rise of large language models (LLMs), where post-training on human feedback has become central to model alignment, the role of HLV has become increasingly consequential. Yet current preference-learning datasets routinely aggregate multiple annotations into a single label, thereby flattening diverse perspectives into a false universal agreement and erasing precisely the pluralism of human values that alignment aims to preserve. In this position paper, we argue that preserving HLV as an embodiment of human pluralism must be treated as a Selbstzweck - a goal it self when designing AI systems. We call for proactively incorporating HLV into preference datasets and outline actionable steps towards it.

【128】Cancer Diagnosis Categorization in Electronic Health Records Using Large Language Models and BioBERT: Model Performance Evaluation Study
标题：使用大语言模型和BioBERT的电子健康记录中的癌症诊断分类：模型性能评估研究
链接：https://arxiv.org/abs/2510.12813

作者：Soheil Hashtarkhani, Rezaur Rashid, Christopher L Brett, Lokesh Chinthala, Fekede Asefa Kumsa, Janet A Zink, Robert L Davis, David L Schwartz, Arash Shaban-Nejad
备注：8 Pages
摘要：电子健康记录包含不一致的结构化或自由文本数据，需要有效的预处理才能实现预测性医疗保健模型。虽然人工智能驱动的自然语言处理工具显示出自动诊断分类的前景，但它们的比较性能和临床可靠性需要系统评估。本研究的目的是评估4种大型语言模型（GPT-3.5，GPT-4 o，Llama 3.2和Gemini 1.5）和BioBERT在从结构化和非结构化电子健康记录数据中分类癌症诊断方面的性能。我们分析了来自3456例癌症患者记录的762个独特诊断（326个国际疾病分类（ICD）代码描述，436个自由文本条目）。模型进行了测试，他们的能力，将诊断分为14个预定义的类别。两名肿瘤学专家验证了分类。BioBERT获得了ICD代码的最高加权宏观F1评分（84.2），并在ICD代码准确性方面与GPT-4 o匹配（90.8）。对于自由文本诊断，GPT-4 o在加权宏观F1评分方面优于BioBERT（71.8 vs 61.5），准确率略高（81.9 vs 81.6）。GPT-3.5、Gemini和Llama在两种格式上显示出较低的整体性能。常见的错误分类模式包括转移和中枢神经系统肿瘤之间的混淆，以及涉及模糊或重叠的临床术语的错误。虽然目前的性能水平似乎足以用于行政和研究用途，但可靠的临床应用将需要标准化的文档实践，以及对高风险决策的强有力的人为监督。
摘要：Electronic health records contain inconsistently structured or free-text data, requiring efficient preprocessing to enable predictive health care models. Although artificial intelligence-driven natural language processing tools show promise for automating diagnosis classification, their comparative performance and clinical reliability require systematic evaluation. The aim of this study is to evaluate the performance of 4 large language models (GPT-3.5, GPT-4o, Llama 3.2, and Gemini 1.5) and BioBERT in classifying cancer diagnoses from structured and unstructured electronic health records data. We analyzed 762 unique diagnoses (326 International Classification of Diseases (ICD) code descriptions, 436free-text entries) from 3456 records of patients with cancer. Models were tested on their ability to categorize diagnoses into 14predefined categories. Two oncology experts validated classifications. BioBERT achieved the highest weighted macro F1-score for ICD codes (84.2) and matched GPT-4o in ICD code accuracy (90.8). For free-text diagnoses, GPT-4o outperformed BioBERT in weighted macro F1-score (71.8 vs 61.5) and achieved slightly higher accuracy (81.9 vs 81.6). GPT-3.5, Gemini, and Llama showed lower overall performance on both formats. Common misclassification patterns included confusion between metastasis and central nervous system tumors, as well as errors involving ambiguous or overlapping clinical terminology. Although current performance levels appear sufficient for administrative and research use, reliable clinical applications will require standardized documentation practices alongside robust human oversight for high-stakes decision-making.

【129】Benchmarking Open-Source Large Language Models for Persian in Zero-Shot and Few-Shot Learning
标题：在Zero-Shot和Few-Shot学习中对波斯语的开源大型语言模型进行基准测试
链接：https://arxiv.org/abs/2510.12807

作者：Mahdi Cherakhloo, Arash Abbasi, Mohammad Saeid Sarafraz, Bijan Vosoughi Vahdat
摘要：大型语言模型（LLM）在许多语言中表现出了卓越的能力;然而，它们在波斯语等低资源语言中的有效性需要彻底的调查。本文提出了一个全面的基准测试的几个开源LLM波斯语自然语言处理（NLP）任务，利用zero-shot和Few-Shot学习范式。我们使用ParsiNLU和ArmanEmo等波斯语数据集评估了一系列任务中的模型，包括情感分析、命名实体识别、阅读理解和问答。我们的方法包括严格的实验设置为zero-shot和Few-Shot的情况下，采用指标，如准确性，F1分数，BLEU，和ROUGE的性能评估。结果表明，Gemma 2在两种学习范式中几乎所有任务的表现都优于其他模型，在复杂的推理任务中表现尤其出色。然而，大多数模型都难以处理标记级的理解任务，如命名实体识别，突出了波斯语处理中的特定挑战。这项研究有助于对多语言LLM的研究越来越多，为他们在波斯语中的表现提供了有价值的见解，并为未来的模型开发提供了基准。
摘要：Large Language Models (LLMs) have demonstrated remarkable capabilities across numerous languages; however, their effectiveness in low-resource languages like Persian requires thorough investigation. This paper presents a comprehensive benchmark of several open-source LLMs for Persian Natural Language Processing (NLP) tasks, utilizing both zero-shot and few-shot learning paradigms. We evaluate models across a range of tasks including sentiment analysis, named entity recognition, reading comprehension, and question answering, using established Persian datasets such as ParsiNLU and ArmanEmo. Our methodology encompasses rigorous experimental setups for both zero-shot and few-shot scenarios, employing metrics such as Accuracy, F1-score, BLEU, and ROUGE for performance evaluation. The results reveal that Gemma 2 consistently outperforms other models across nearly all tasks in both learning paradigms, with particularly strong performance in complex reasoning tasks. However, most models struggle with token-level understanding tasks like Named Entity Recognition, highlighting specific challenges in Persian language processing. This study contributes to the growing body of research on multilingual LLMs, providing valuable insights into their performance in Persian and offering a benchmark for future model development.

【130】AutoCode: LLMs as Problem Setters for Competitive Programming
标题：AutoCode：LLM作为竞争性编程的问题设置者
链接：https://arxiv.org/abs/2510.12803

作者：Shang Zhou, Zihan Zheng, Kaiyuan Liu, Zeyu Shen, Zerui Cheng, Zexing Chen, Hansen He, Jianzhu Yao, Huanzhi Mao, Qiuyang Mang, Tianfu Fu, Beichen Li, Dongruixuan Li, Wenhao Chai, Zhuang Liu, Aleksandra Korolova, Peter Henderson, Natasha Jaques, Pramod Viswanath, Saining Xie, Jingbo Shang
备注：Project page: this https URL
摘要：编写有竞争力的编程问题是很苛刻的。作者必须：设置约束、输入分布和排除快捷方式的边缘情况;针对特定算法（例如，最大流、动态编程、数据结构）;并校准大多数竞争对手无法企及的复杂性。我们认为，这是一个理想的测试一般大型语言模型的能力，并研究他们是否可以做到这一点可靠。我们介绍了AutoCode，它使用多轮验证来产生竞争级的问题陈述和测试用例。在保留的问题上，AutoCode测试套件与官方判断的一致性接近99%，这是对当前最先进的方法（如HardTests）的重大改进，后者的一致性不到81%。此外，从随机种子问题开始，AutoCode可以使用参考和强力解决方案创建新的变体。通过对测试用例交叉验证这些生成的解决方案，我们可以进一步过滤出畸形的问题。我们的系统确保了高度的正确性，经过人类专家的验证。AutoCode成功地产生了由特级大师级（前0.3%）竞争程序员判断为具有竞赛质量的新颖问题。
摘要：Writing competitive programming problems is exacting. Authors must: set constraints, input distributions, and edge cases that rule out shortcuts; target specific algorithms (e.g., max-flow, dynamic programming, data structures); and calibrate complexity beyond the reach of most competitors. We argue that this makes for an ideal test of general large language model capabilities and study whether they can do this reliably. We introduce AutoCode, which uses multiple rounds of validation to yield competition-grade problem statements and test cases. On held-out problems, AutoCode test suites approach 99% consistency with official judgments, a significant improvement over current state-of-the-art methods like HardTests, which achieve less than 81%. Furthermore, starting with a random seed problem, AutoCode can create novel variants with reference and brute-force solutions. By cross-verifying these generated solutions against test cases, we can further filter out malformed problems. Our system ensures high correctness, as verified by human experts. AutoCode successfully produces novel problems judged by Grandmaster-level (top 0.3%) competitive programmers to be of contest quality.

【131】Dedelayed: Deleting remote inference delay via on-device correction
标题：Dedelayed：通过设备上纠正删除远程推理延迟
链接：https://arxiv.org/abs/2510.13714

作者：Dan Jacobellis, Mateen Ulhaq, Fabien Racapé, Hyomin Choi, Neeraja J. Yadwadkar
摘要：远程推理允许轻量级设备利用强大的云模型。然而，通信网络延迟使得预测陈旧并且不适合实时任务。为了解决这个问题，我们引入了Dedelayed，这是一种延迟校正方法，可以减轻任意远程推理延迟，允许本地设备实时产生低延迟输出。我们的方法采用了一个轻量级的本地模型，处理当前帧和融合的功能，一个重量级的远程模型计算从过去的帧。在来自BDD 100K驾驶数据集的视频上，Dedelayed在所有超过33 ms的实际通信网络延迟中，在仅本地和仅远程基线中的较强基线上提高了语义分割准确性。在不产生额外延迟的情况下，与完全本地推理相比，它将准确性提高了6.4 mIoU，与远程推理相比，它将准确性提高了9.8 mIoU。在更长的延迟和更高的运动场景下，这种优势会增长，因为延迟减轻的分裂推理更有效地维持准确性，为必须与当前世界状态保持一致的实时任务提供明显的优势。
摘要：Remote inference allows lightweight devices to leverage powerful cloud models. However, communication network latency makes predictions stale and unsuitable for real-time tasks. To address this, we introduce Dedelayed, a delay-corrective method that mitigates arbitrary remote inference delays, allowing the local device to produce low-latency outputs in real time. Our method employs a lightweight local model that processes the current frame and fuses in features that a heavyweight remote model computes from past frames. On video from the BDD100K driving dataset, Dedelayed improves semantic segmentation accuracy over the stronger of the local-only and remote-only baselines across all realistic communication network delays beyond 33 ms. Without incurring additional delay, it improves accuracy by 6.4 mIoU compared to fully local inference and 9.8 mIoU compared to remote inference, for a round-trip delay of 100 ms. The advantage grows under longer delays and higher-motion scenes, as delay-mitigated split inference sustains accuracy more effectively, providing clear advantages for real-time tasks that must remain aligned with the current world state.

【132】Narrow Operator Models of Stellarator Equilibria in Fourier Zernike Basis
标题：傅里叶·泽尼克基中的恒星器平衡的狭窄操作者模型
链接：https://arxiv.org/abs/2510.13521

作者：Timo Thun, Rory Conlin, Dario Panici, Daniel Böckenhoff
备注：15 pages, 6 figures, 1 table
摘要：理想磁流体动力学（MHD）平衡磁场的数值计算是仿星器优化的基础，并为求解更复杂的偏微分方程（PDE）（如运输或湍流模型）提供了起点。传统的方法求解理想MHD方程的一个稳定点，该稳定点由三个不变量和求解器所采用的数值方案完全定义。我们提出了第一个数值方法，可以解决一个连续分布的平衡与固定的边界和旋转变换，只改变压力不变。这种方法通过优化多层感知器（MLP）的参数来最大限度地减少力的残余，该多层感知器（MLP）从标量压力乘数映射到傅立叶-泽尼克基，如在现代仿星器平衡求解器DESC中所实现的。
摘要：Numerical computation of the ideal Magnetohydrodynamic (MHD) equilibrium magnetic field is at the base of stellarator optimisation and provides the starting point for solving more sophisticated Partial Differential Equations (PDEs) like transport or turbulence models. Conventional approaches solve for a single stationary point of the ideal MHD equations, which is fully defined by three invariants and the numerical scheme employed by the solver. We present the first numerical approach that can solve for a continuous distribution of equilibria with fixed boundary and rotational transform, varying only the pressure invariant. This approach minimises the force residual by optimising parameters of multilayer perceptrons (MLP) that map from a scalar pressure multiplier to the Fourier Zernike basis as implemented in the modern stellarator equilibrium solver DESC.

【133】Semantic Communication Enabled Holographic Video Processing and Transmission
标题：实现语义通信的全息视频处理和传输
链接：https://arxiv.org/abs/2510.13408

作者：Jingkai Ying, Zhiyuan Qi, Yulong Feng, Zhijin Qin, Zhu Han, Rahim Tafazolli, Yonina C. Eldar
备注：7 pages, 6 figures, Submit for review
摘要：全息视频通信被认为是视觉通信的范式转变，因其提供沉浸式体验的能力而变得越来越受欢迎。本文提供了全息视频通信的概述，并概述了全息视频通信系统的要求。特别地，在简要回顾语义通信之后，提出了用于语义使能的全息视频通信系统的体系结构。在此基础上设计了语义采样、语义信道联合编码和语义感知传输等关键技术。两个相关的用例来证明所提出的方法的性能增益。最后，讨论了潜在的研究课题，为实现语义使能的全息视频通信铺平了道路。
摘要：Holographic video communication is considered a paradigm shift in visual communications, becoming increasingly popular for its ability to offer immersive experiences. This article provides an overview of holographic video communication and outlines the requirements of a holographic video communication system. Particularly, following a brief review of semantic com- munication, an architecture for a semantic-enabled holographic video communication system is presented. Key technologies, including semantic sampling, joint semantic-channel coding, and semantic-aware transmission, are designed based on the proposed architecture. Two related use cases are presented to demonstrate the performance gain of the proposed methods. Finally, potential research topics are discussed to pave the way for the realization of semantic-enabled holographic video communications.

【134】A Multi-dimensional Semantic Surprise Framework Based on Low-Entropy Semantic Manifolds for Fine-Grained Out-of-Distribution Detection
标题：一种基于低Entomic Manifics的多维语义惊喜框架，用于细粒度分布外检测
链接：https://arxiv.org/abs/2510.13093

作者：Ningkang Peng, Yuzhe Mao, Yuhao Zhang, Linjin Qian, Qianfeng Yu, Yanhui Gu, Yi Chen, Li Kong
摘要：分发外（OOD）检测是在开放世界中安全部署AI系统的基石。然而，现有的方法将OOD检测视为二元分类问题，这是一种认知扁平化，无法区分语义上接近（近OOD）和遥远（远OOD）的未知风险。这种限制在需要细粒度风险分层的应用程序中构成了一个重要的安全瓶颈。为了解决这个问题，我们提出了一个范式转变，从传统的概率观点的原则信息理论框架。我们将核心任务正式化为量化新样本的语义惊喜，并引入了一个新的三元分类挑战：分布（ID）与近OOD与远OOD。我们的工作的理论基础是低熵语义流形的概念，这是明确的结构，以反映数据的内在语义层次。为了构造这些流形，我们设计了一个分层原型网络。然后，我们引入语义惊喜向量（SSV），一个通用的探针，分解成三个互补的和可解释的维度：一致性，新颖性和歧义样本的总惊喜。为了评估这一新任务的性能，我们提出了标准化语义风险（nSR），一个成本敏感的度量。实验表明，我们的框架不仅在具有挑战性的三元任务上建立了一个新的最先进的（sota），而且其强大的表示也在传统的二元基准上取得了最佳结果，在LSUN等数据集上将误报率降低了60%以上。
摘要：Out-of-Distribution (OOD) detection is a cornerstone for the safe deployment of AI systems in the open world. However, existing methods treat OOD detection as a binary classification problem, a cognitive flattening that fails to distinguish between semantically close (Near-OOD) and distant (Far-OOD) unknown risks. This limitation poses a significant safety bottleneck in applications requiring fine-grained risk stratification. To address this, we propose a paradigm shift from a conventional probabilistic view to a principled information-theoretic framework. We formalize the core task as quantifying the Semantic Surprise of a new sample and introduce a novel ternary classification challenge: In-Distribution (ID) vs. Near-OOD vs. Far-OOD. The theoretical foundation of our work is the concept of Low-Entropy Semantic Manifolds, which are explicitly structured to reflect the data's intrinsic semantic hierarchy. To construct these manifolds, we design a Hierarchical Prototypical Network. We then introduce the Semantic Surprise Vector (SSV), a universal probe that decomposes a sample's total surprise into three complementary and interpretable dimensions: conformity, novelty, and ambiguity. To evaluate performance on this new task, we propose the Normalized Semantic Risk (nSR), a cost-sensitive metric. Experiments demonstrate that our framework not only establishes a new state-of-the-art (sota) on the challenging ternary task, but its robust representations also achieve top results on conventional binary benchmarks, reducing the False Positive Rate by over 60% on datasets like LSUN.

【135】Towards Human-Centric Intelligent Treatment Planning for Radiation Therapy
标题：迈向以人为本的放射治疗智能治疗规划
链接：https://arxiv.org/abs/2510.13062

作者：Adnan Jafar, Xun Jia
备注：27 pages, 3 figures
摘要：目前的放射疗法治疗计划受到次优计划质量、低效率和高成本的限制。本文探讨了治疗计划的复杂性，并介绍了以人为中心的智能治疗计划（HCITP），这是一个在人类监督下的人工智能驱动框架，它集成了临床指南，自动生成计划，并实现了与操作员的直接交互。我们期望HCITP将提高效率，可能将规划时间缩短到几分钟，并将提供个性化的高质量计划。讨论了挑战和潜在的解决方案。
摘要：Current radiation therapy treatment planning is limited by suboptimal plan quality, inefficiency, and high costs. This perspective paper explores the complexity of treatment planning and introduces Human-Centric Intelligent Treatment Planning (HCITP), an AI-driven framework under human oversight, which integrates clinical guidelines, automates plan generation, and enables direct interactions with operators. We expect that HCITP will enhance efficiency, potentially reducing planning time to minutes, and will deliver personalized, high-quality plans. Challenges and potential solutions are discussed.

【136】HyWA: Hypernetwork Weight Adapting Personalized Voice Activity Detection
标题：HyWA：超网络重量自适应个性化语音活动检测
链接：https://arxiv.org/abs/2510.12947

作者：Mahsa Ghazvini Nejad, Hamed Jafarzadeh Asl, Amin Edraki, Mohammadreza Sadeghi, Masoud Asgharian, Yuanhao Yu, Vahid Partovi Nia
备注：Mahsa Ghazvini Nejad and Hamed Jafarzadeh Asl contributed equally to this work
摘要：个性化语音活动检测（PVAD）系统通过结合来自登记话语的说话者嵌入而仅响应于特定目标说话者而激活。与现有的方法，需要架构的变化，如电影层，我们的方法采用了超网络来修改一个标准的语音活动检测（VAD）模型中的几个选定的层的权重。这使得扬声器调节无需改变VAD架构，允许相同的VAD模型通过仅更新一小部分层来适应不同的扬声器。我们提出了HyWA-PVAD，超网络的权重自适应方法，并评估它对多个基线条件技术。我们的比较显示PVAD性能的持续改进。HyWA还通过保留核心VAD架构为部署提供了实际优势。我们的新方法在两个方面改进了当前的条件技术：i）提高了平均精度，ii）通过重用相同的VAD架构简化了部署。
摘要：Personalized Voice Activity Detection (PVAD) systems activate only in response to a specific target speaker by incorporating speaker embeddings from enrollment utterances. Unlike existing methods that require architectural changes, such as FiLM layers, our approach employs a hypernetwork to modify the weights of a few selected layers within a standard voice activity detection (VAD) model. This enables speaker conditioning without changing the VAD architecture, allowing the same VAD model to adapt to different speakers by updating only a small subset of the layers. We propose HyWA-PVAD, a hypernetwork weight adaptation method, and evaluate it against multiple baseline conditioning techniques. Our comparison shows consistent improvements in PVAD performance. HyWA also offers practical advantages for deployment by preserving the core VAD architecture. Our new approach improves the current conditioning techniques in two ways: i) increases the mean average precision, ii) simplifies deployment by reusing the same VAD architecture.

【137】InferA: A Smart Assistant for Cosmological Ensemble Data
标题：InferA：宇宙集合数据的智能助手
链接：https://arxiv.org/abs/2510.12920

作者：Justin Z. Tam, Pascal Grosset, Divya Banesh, Nesar Ramachandra, Terece L. Turton, James Ahrens
摘要：分析大规模科学数据集由于其庞大的体积，结构复杂性以及对专业领域知识的需求而面临巨大挑战。自动化工具，如PandasAI，通常需要完整的数据摄取，并且缺乏完整数据结构的上下文，这使得它们无法作为TB级数据集的智能数据分析助手。为了克服这些限制，我们提出了InferA，一个多代理系统，利用大型语言模型，使可扩展的和有效的科学数据分析。该架构的核心是一个Supervisor Agent，它协调一个专门的Agent团队，负责数据检索和分析的不同阶段。该系统与用户进行交互，以引出他们的分析意图并确认查询目标，确保用户目标和系统操作之间的一致性。为了证明该框架的可用性，我们评估系统使用合奏运行从HACC宇宙学模拟，其中包括几个太字节。
摘要：Analyzing large-scale scientific datasets presents substantial challenges due to their sheer volume, structural complexity, and the need for specialized domain knowledge. Automation tools, such as PandasAI, typically require full data ingestion and lack context of the full data structure, making them impractical as intelligent data analysis assistants for datasets at the terabyte scale. To overcome these limitations, we propose InferA, a multi-agent system that leverages large language models to enable scalable and efficient scientific data analysis. At the core of the architecture is a supervisor agent that orchestrates a team of specialized agents responsible for distinct phases of the data retrieval and analysis. The system engages interactively with users to elicit their analytical intent and confirm query objectives, ensuring alignment between user goals and system actions. To demonstrate the framework's usability, we evaluate the system using ensemble runs from the HACC cosmology simulation which comprises several terabytes.

【138】Automatic Speech Recognition in the Modern Era: Architectures, Training, and Evaluation
标题：现代的自动语音识别：架构、训练和评估
链接：https://arxiv.org/abs/2510.12827

作者：Md. Nayeem, Md Shamse Tabrej, Kabbojit Jit Deb, Shaonti Goswami, Md. Azizul Hakim
摘要：在过去十年中，自动语音识别（ASR）在深度学习的推动下经历了深刻的变革。本调查全面概述了ASR的现代时代，描绘了其从传统混合系统（如高斯混合模型-隐马尔可夫模型（GMM-Hyndom）和深度神经网络-Hyndom（DNN-Hyndom））到现在占主导地位的端到端神经架构的演变。我们系统地回顾了基本的端到端范例：连接主义时间分类（CTC），基于注意力的编码器-解码器模型和递归神经网络转换器（RNN-T），它们为完全集成的语音到文本系统奠定了基础。然后，我们详细介绍了随后的架构转向Transformer和Conformer模型，利用自我注意力来捕获高计算效率的长期依赖关系。本次调查的一个中心主题是培训模式的平行革命。我们研究了从完全监督学习（通过SpecAugment等技术增强）到自我监督学习（SSL）的兴起以及wav 2 vec 2.0等基础模型的进展，这些模型大大减少了对转录数据的依赖。此外，我们分析了像Whisper这样的大规模弱监督模型的影响，这些模型通过大量数据多样性实现了前所未有的鲁棒性。该文件还涵盖了生态系统的基本组成部分，包括关键数据集和基准（例如，LibriSpeech、Switchboard、CHiME）、标准评估度量（例如，字错误率），以及现实世界部署的关键考虑因素，如流式推理，设备效率以及公平性和鲁棒性的道德要求。最后，我们概述了开放的挑战和未来的研究方向。
摘要：Automatic Speech Recognition (ASR) has undergone a profound transformation over the past decade, driven by advances in deep learning. This survey provides a comprehensive overview of the modern era of ASR, charting its evolution from traditional hybrid systems, such as Gaussian Mixture Model-Hidden Markov Models (GMM-HMMs) and Deep Neural Network-HMMs (DNN-HMMs), to the now-dominant end-to-end neural architectures. We systematically review the foundational end-to-end paradigms: Connectionist Temporal Classification (CTC), attention-based encoder-decoder models, and the Recurrent Neural Network Transducer (RNN-T), which established the groundwork for fully integrated speech-to-text systems. We then detail the subsequent architectural shift towards Transformer and Conformer models, which leverage self-attention to capture long-range dependencies with high computational efficiency. A central theme of this survey is the parallel revolution in training paradigms. We examine the progression from fully supervised learning, augmented by techniques like SpecAugment, to the rise of self-supervised learning (SSL) with foundation models such as wav2vec 2.0, which drastically reduce the reliance on transcribed data. Furthermore, we analyze the impact of largescale, weakly supervised models like Whisper, which achieve unprecedented robustness through massive data diversity. The paper also covers essential ecosystem components, including key datasets and benchmarks (e.g., LibriSpeech, Switchboard, CHiME), standard evaluation metrics (e.g., Word Error Rate), and critical considerations for real-world deployment, such as streaming inference, on-device efficiency, and the ethical imperatives of fairness and robustness. We conclude by outlining open challenges and future research directions.

机器翻译由腾讯交互翻译提供，仅供参考

点击“阅读原文”获取带摘要的学术速递

【声明】内容源于网络

Sophie外贸笔记

跨境分享角 | 长期更新优质内容

内容 0

粉丝 3

Sophie外贸笔记跨境分享角 | 长期更新优质内容

总阅读0

粉丝3

内容0