人工智能学术速递[10.23]- 大数跨境

首页

人工智能学术速递[10.23]

Sophie外贸笔记

2025-10-23

164

导读：cs.AI 方向，今日共计144篇

点击阅读原文访问arxivdaily.com，涵盖CS|物理|数学|经济|统计|金融|生物|电气领域，更有搜索、收藏等功能！

cs.AI人工智能，共计144篇

【1】Semantic World Models
标题：语义世界模型
链接：https://arxiv.org/abs/2510.19818

作者：Jacob Berg, Chuning Zhu, Yanda Bao, Ishan Durugkar, Abhishek Gupta
摘要：世界模型规划为机器人控制提供了一个强大的范例。传统方法训练模型以预测以当前帧和动作为条件的未来帧，其然后可以用于规划。然而，预测未来像素的目标往往与实际的规划目标不一致;强像素重建并不总是与良好的规划决策相关。本文假设，世界模型只需要预测与未来任务相关的语义信息，而不是将未来帧重建为像素。对于这样的预测，本文提出的世界建模作为一个视觉问题回答问题的语义信息在未来的帧。这种视角允许世界建模使用与视觉语言模型相同的工具。因此，视觉语言模型可以通过对图像-动作-文本数据的监督微调过程来训练为“语义”世界模型，从而实现决策规划，同时继承预训练视觉语言模型的许多泛化和鲁棒性属性。本文演示了如何使用这样的语义世界模型的开放式机器人任务的政策改进，导致显着的泛化改进的典型范例的重建为基础的动作条件的世界建模。网址：https://weirdlabuw.github.io/swm。
摘要：Planning with world models offers a powerful paradigm for robotic control. Conventional approaches train a model to predict future frames conditioned on current frames and actions, which can then be used for planning. However, the objective of predicting future pixels is often at odds with the actual planning objective; strong pixel reconstruction does not always correlate with good planning decisions. This paper posits that instead of reconstructing future frames as pixels, world models only need to predict task-relevant semantic information about the future. For such prediction the paper poses world modeling as a visual question answering problem about semantic information in future frames. This perspective allows world modeling to be approached with the same tools underlying vision language models. Thus vision language models can be trained as "semantic" world models through a supervised finetuning process on image-action-text data, enabling planning for decision-making while inheriting many of the generalization and robustness properties from the pretrained vision-language models. The paper demonstrates how such a semantic world model can be used for policy improvement on open-ended robotics tasks, leading to significant generalization improvements over typical paradigms of reconstruction-based action-conditional world modeling. Website available at https://weirdlabuw.github.io/swm.

【2】Scaf-GRPO: Scaffolded Group Relative Policy Optimization for Enhancing LLM Reasoning
标题：Scaf-GRPO：增强LLM推理的支架组相对政策优化
链接：https://arxiv.org/abs/2510.19807

作者：Xichen Zhang, Sitong Wu, Yinghao Zhu, Haoru Tan, Shaozuo Yu, Ziyi He, Jiaya Jia
备注：Code: this https URL
摘要：来自可验证奖励的强化学习已经成为增强大型语言模型（LLM）复杂推理能力的强大技术。然而，这些方法从根本上受到“学习悬崖”现象的限制：当面临远远超出其当前能力的问题时，模型总是失败，产生持续的零奖励信号。在像GRPO这样的策略优化算法中，这将优势计算压缩为零，使这些困难的问题对学习梯度不可见，并使进度停滞。为了克服这一点，我们引入了Scaf-GRPO（脚手架组相对策略优化），这是一个渐进式的训练框架，只有当模型的独立学习达到稳定状态时，它才能在战略上提供最小的指导。该框架首先诊断学习停滞，然后通过注入分层提示进行干预，从抽象概念到具体步骤，使模型能够自行构建有效的解决方案。在具有挑战性的数学基准上进行的大量实验证明了Scaf-GRPO的有效性，在AIME 24基准上将Qwen2.5-Math-7 B模型的pass@1得分提高了44.3%。这一结果表明，我们的框架提供了一种强大而有效的方法，用于解锁模型解决以前无法解决的问题的能力，这是扩展LLM自主推理前沿的关键一步。
摘要：Reinforcement learning from verifiable rewards has emerged as a powerful technique for enhancing the complex reasoning abilities of Large Language Models (LLMs). However, these methods are fundamentally constrained by the ''learning cliff'' phenomenon: when faced with problems far beyond their current capabilities, models consistently fail, yielding a persistent zero-reward signal. In policy optimization algorithms like GRPO, this collapses the advantage calculation to zero, rendering these difficult problems invisible to the learning gradient and stalling progress. To overcome this, we introduce Scaf-GRPO (Scaffolded Group Relative Policy Optimization), a progressive training framework that strategically provides minimal guidance only when a model's independent learning has plateaued. The framework first diagnoses learning stagnation and then intervenes by injecting tiered in-prompt hints, ranging from abstract concepts to concrete steps, enabling the model to construct a valid solution by itself. Extensive experiments on challenging mathematics benchmarks demonstrate Scaf-GRPO's effectiveness, boosting the pass@1 score of the Qwen2.5-Math-7B model on the AIME24 benchmark by a relative 44.3% over a vanilla GRPO baseline. This result demonstrates our framework provides a robust and effective methodology for unlocking a model's ability to solve problems previously beyond its reach, a critical step towards extending the frontier of autonomous reasoning in LLM.

【3】Integrating Transparent Models, LLMs, and Practitioner-in-the-Loop: A Case of Nonprofit Program Evaluation
标题：整合透明模型、LLM和从业者在环：非营利项目评估案例
链接：https://arxiv.org/abs/2510.19799

作者：Ji Ma, Albert Casella
摘要：公共和非营利组织通常不愿采用人工智能工具，因为大多数模型都是不透明的，尽管标准方法通常分析聚合模式，而不是提供可操作的案例级指导。这项研究测试了一个在环工作流程，该工作流程将透明决策树模型与大型语言模型（LLM）配对，以提高预测准确性，可解释性和实际见解的生成。使用正在进行的大学成功计划的数据，我们建立可解释的决策树表面的关键预测。然后，我们将每个树的结构提供给LLM，使其能够重现基于透明模型的案例级预测。从业人员参与整个特征工程，模型设计，解释审查和可用性评估，确保在每个阶段的分析领域的专业知识。结果表明，整合透明模型、LLM和从业者输入可以产生准确、可信和可操作的案例级评估，为公共和非营利部门采用负责任的人工智能提供了一条可行的途径。
摘要：Public and nonprofit organizations often hesitate to adopt AI tools because most models are opaque even though standard approaches typically analyze aggregate patterns rather than offering actionable, case-level guidance. This study tests a practitioner-in-the-loop workflow that pairs transparent decision-tree models with large language models (LLMs) to improve predictive accuracy, interpretability, and the generation of practical insights. Using data from an ongoing college-success program, we build interpretable decision trees to surface key predictors. We then provide each tree's structure to an LLM, enabling it to reproduce case-level predictions grounded in the transparent models. Practitioners participate throughout feature engineering, model design, explanation review, and usability assessment, ensuring that field expertise informs the analysis at every stage. Results show that integrating transparent models, LLMs, and practitioner input yields accurate, trustworthy, and actionable case-level evaluations, offering a viable pathway for responsible AI adoption in the public and nonprofit sectors.

【4】On Controlled Change: Generative AI's Impact on Professional Authority in Journalism
标题：论受控变革：生成性人工智能对新闻专业权威的影响
链接：https://arxiv.org/abs/2510.19792

作者：Tomás Dodds, Wang Ngai Yeung, Claudia Mellado, Mathias-Felipe de Lima-Santos
摘要：在新闻业中使用（生成）人工智能工具和系统有望提高记者的生产率，改变新闻编辑室的经济模式，并进一步个性化受众的新闻消费行为。自2022年发布以来，OpenAI的ChatGPT和其他大型语言模型已经在新闻机构内部敲响了警钟，这不仅是因为它给新闻报道和事实核查带来了新的挑战，而且还因为这些技术对记者在新闻业的专业权威意味着什么。本文探讨了荷兰媒体记者如何将人工智能技术融入日常工作。通过对不同新闻媒体和媒体公司的编辑、记者和创新经理的13次采访，我们提出了可控变革的概念。作为启发式方法，解释记者如何主动制定指导方针、试验人工智能工具并确定其局限性和能力。使用专业权威作为理论框架，我们认为记者以监督的方式预测和整合人工智能技术，并确定了记者管理这种整合的三种主要机制：（1）制定适应性指导方针，使人工智能的使用与道德规范保持一致，（2）试验人工智能技术，以确定其必要性和适用性，（3）批判性地评估人工智能系统的能力和局限性。
摘要：Using (generative) artificial intelligence tools and systems in journalism is expected to increase journalists' production rates, transform newsrooms' economic models, and further personalize the audience's news consumption practices. Since its release in 2022, OpenAI's ChatGPT and other large language models have raised the alarms inside news organizations, not only for bringing new challenges to news reporting and fact-checking but also for what these technologies would mean for journalists' professional authority in journalism. This paper examines how journalists in Dutch media manage the integration of AI technologies into their daily routines. Drawing from 13 interviews with editors, journalists, and innovation managers in different news outlets and media companies, we propose the concept of controlled change. as a heuristic to explain how journalists are proactively setting guidelines, experimenting with AI tools, and identifying their limitations and capabilities. Using professional authority as a theoretical framework, we argue that journalists anticipate and integrate AI technologies in a supervised manner and identify three primary mechanisms through which journalists manage this integration: (1) developing adaptive guidelines that align AI use with ethical codes, (2) experimenting with AI technologies to determine their necessity and fit, and (3) critically assessing the capabilities and limitations of AI systems.

【5】Benchmarking World-Model Learning
标题：世界模式学习基准
链接：https://arxiv.org/abs/2510.19788

作者：Archana Warrier, Dat Nyugen, Michelangelo Naim, Moksh Jain, Yichao Liang, Karen Schroeder, Cambridge Yang, Joshua B. Tenenbaum, Sebastian Vollmer, Kevin Ellis, Zenna Tavares
备注：30 pages, 10 figures
摘要：模型学习代理应该收集信息来学习支持许多下游任务和推断的世界模型，例如预测未观察到的状态，估计行动的近期和远期后果，规划行动序列，以及检测动态变化。当前学习和评估世界模型的方法偏离了这一目标：训练和评估被锚定到下一帧预测，成功是通过在相同环境中的奖励最大化来评分的。我们提出了WorldTest，一个协议来评估模型学习代理，将无奖励的交互与不同但相关的环境中的评分测试阶段分开。WorldTest是开放式的$\unicode{x2014}$模型应该支持许多不同的任务，这些任务是事先未知的，并且与模型表示无关，允许跨方法进行比较。我们用AutumnBench实例化了WorldTest，这是一套43个交互式网格世界环境和129个任务，涉及三个家族：掩蔽框架预测，规划和预测因果动态的变化。我们在AutumnBench上比较了517名人类参与者和三个前沿模型。我们发现，人类的表现优于模型，并且扩展计算仅在某些环境中提高了性能，而在其他环境中则没有。WorldTest提供了一个新颖的模板$\unicode{x2014}$无奖励探索，衍生测试和基于行为的评分$\unicode{x2014}$来评估代理对环境动态的了解，AutumnBench在世界模型学习中暴露了显着的空间。
摘要：Model-learning agents should gather information to learn world models that support many downstream tasks and inferences, such as predicting unobserved states, estimating near- and far-term consequences of actions, planning action sequences, and detecting changes in dynamics. Current methods for learning and evaluating world models diverge from this goal: training and evaluation are anchored to next-frame prediction, and success is scored by reward maximization in the same environment. We propose WorldTest, a protocol to evaluate model-learning agents that separates reward-free interaction from a scored test phase in a different but related environment. WorldTest is open-ended$\unicode{x2014}$models should support many different tasks unknown ahead of time$\unicode{x2014}$and agnostic to model representation, allowing comparison across approaches. We instantiated WorldTest with AutumnBench, a suite of 43 interactive grid-world environments and 129 tasks across three families: masked-frame prediction, planning, and predicting changes to the causal dynamics. We compared 517 human participants and three frontier models on AutumnBench. We found that humans outperform the models, and scaling compute improves performance only in some environments but not others. WorldTest provides a novel template$\unicode{x2014}$reward-free exploration, derived tests, and behavior-based scoring$\unicode{x2014}$to evaluate what agents learn about environment dynamics, and AutumnBench exposes significant headroom in world-model learning.

【6】AdaSPEC: Selective Knowledge Distillation for Efficient Speculative Decoders
标题：AdaSPEC：用于高效推测解码器的选择性知识蒸馏
链接：https://arxiv.org/abs/2510.19779

作者：Yuezhou Hu, Jiaxin Guo, Xinyu Feng, Tuo Zhao
摘要：推测性解码（SD）通过采用小的草稿模型来生成预测，然后通过更大的目标模型来验证，从而加速大型语言模型的推理。可持续发展的有效性取决于这些模型之间的一致性，这通常通过知识蒸馏（KD）来增强。然而，传统的KD方法旨在最大限度地减少所有代币的草案和目标模型之间的KL差异，这一目标与SD的真正目标不一致，即最大限度地提高代币接受率。因此，由于容量限制，草稿模型通常难以完全吸收目标模型的知识，从而导致次优性能。为了解决这一挑战，我们提出了AdaSPEC，一种新的方法，将选择性令牌过滤到KD过程中。AdaSPEC利用参考模型来识别和过滤出难以拟合的令牌，从而能够在更简单的令牌上更好地与目标模型保持一致。这种方法在不影响生成质量的情况下提高了整体令牌接受率。我们使用31 M/1.4B和350 M/2.7B参数的模型配置，在不同的任务中评估AdaSPEC，包括算术推理、推理跟踪、编码和总结。我们的结果表明，AdaSPEC始终优于最先进的DistillSpec方法，在所有任务中实现了更高的接受率（高达15%）。该代码可在https://github.com/yuezhouhu/adaspec上公开获取。
摘要：Speculative Decoding (SD) accelerates large language model inference by employing a small draft model to generate predictions, which are then verified by a larger target model. The effectiveness of SD hinges on the alignment between these models, which is typically enhanced by Knowledge Distillation (KD). However, conventional KD methods aim to minimize the KL divergence between the draft and target models across all tokens, a goal that is misaligned with the true objective of SD, which is to maximize token acceptance rate. Therefore, draft models often struggle to fully assimilate the target model's knowledge due to capacity constraints, leading to suboptimal performance. To address this challenge, we propose AdaSPEC, a novel method that incorporates selective token filtering into the KD process. AdaSPEC utilizes a reference model to identify and filter out difficult-to-fit tokens, enabling the distillation of a draft model that better aligns with the target model on simpler tokens. This approach improves the overall token acceptance rate without compromising generation quality. We evaluate AdaSPEC across diverse tasks, including arithmetic reasoning, instruction-following, coding, and summarization, using model configurations of 31M/1.4B and 350M/2.7B parameters. Our results demonstrate that AdaSPEC consistently outperforms the state-of-the-art DistillSpec method, achieving higher acceptance rates across all tasks (up to 15\%). The code is publicly available at https://github.com/yuezhouhu/adaspec.

【7】Beyond Reactivity: Measuring Proactive Problem Solving in LLM Agents
标题：超越反应性：衡量LLM代理的主动问题解决
链接：https://arxiv.org/abs/2510.19771

作者：Gil Pasternak, Dheeraj Rajagopal, Julia White, Dhruv Atreja, Matthew Thomas, George Hurn-Maloney, Ash Lewis
摘要：基于LLM的代理越来越多地走向主动性：而不是等待指令，他们行使代理来预测用户需求并自主解决这些需求。然而，评估主动性是具有挑战性的;目前的基准被限制在本地化的背景下，限制了他们的能力，以测试跨源和更长的时间范围内的推理。为了解决这个问题，我们提出了PROBE（Proactive Resolution Of BottlEnecks）。PROBE将主动性分解为三个核心能力的管道：（1）搜索未指定的问题，（2）识别特定的瓶颈，以及（3）执行适当的解决方案。我们应用PROBE来评估领先的LLM和流行的代理框架，这表明即使是最先进的模型也很难解决这个基准。通过计算我们在前沿LLM和代理之间的一致测量，我们发现GPT-5和Claude Opus-4.1都实现了40%的最佳端到端性能。此外，我们还展示了每个模型的相对功能，并分析了相互的故障模式。我们的研究结果突出了目前的局限性，自主行动的代理系统，并揭示了有前途的未来研究方向。
摘要：LLM-based agents are increasingly moving towards proactivity: rather than awaiting instruction, they exercise agency to anticipate user needs and solve them autonomously. However, evaluating proactivity is challenging; current benchmarks are constrained to localized context, limiting their ability to test reasoning across sources and longer time horizons. To address this gap, we present PROBE (Proactive Resolution Of BottlEnecks). PROBE decomposes proactivity as a pipeline of three core capabilities: (1) searching for unspecified issues, (2) identifying specific bottlenecks, and (3) executing appropriate resolutions. We apply PROBE to evaluate leading LLMs and popular agentic frameworks, showing that even state-of-the-art models struggle to solve this benchmark. Computing our consistent measurements across frontier LLMs and agents, we find that the best end-to-end performance of 40% is achieved by both GPT-5 and Claude Opus-4.1. Additionally, we demonstrate the relative capabilities of each model and analyze mutual failure modes. Our results highlight the current limitations of autonomous action in agentic systems, and expose promising future research directions.

【8】SmartSwitch: Advancing LLM Reasoning by Overcoming Underthinking via Promoting Deeper Thought Exploration
标题：SmartSwitch：通过促进更深层次的思维探索来克服欠思考来推进LLM推理
链接：https://arxiv.org/abs/2510.19767

作者：Xichen Zhang, Sitong Wu, Haoru Tan, Shaozuo Yu, Yinghao Zhu, Ziyi He, Jiaya Jia
备注：Code: this https URL
摘要：长思维链（LongCoT）能力是最近大型语言模型在复杂推理任务中取得突破的核心。然而，随之而来的“思考不足”问题，即模型在没有充分探索的情况下频繁切换思想，从而表现出肤浅的推理，这限制了性能和令牌效率。为了解决这个问题，我们提出了一个简单而有效的推理策略：智能开关推理框架。这个框架可以很容易地集成到任何大型语言模型中，作为一个即插即用的解决方案，持续监控模型的推理过程，以检测欠思考，并引导它更深入地探索有前途但被忽视的思想。具体而言，感知模块识别思想切换的点，并使用现成的过程奖励模型（PRM）评估先前思想的潜力。如果一个高潜力的想法被发现过早地放弃，干预模块中断正在进行的推理，回溯到开关前的点，并插入一个“深化提示”，以鼓励进一步探索沿着这条有前途的path.Expensive数学推理基准上具有挑战性的实验表明，我们的方法显着提高了各种不同大小的大型语言模型的性能。
摘要：The long chain-of-thought (LongCoT) capability is central to the recent breakthroughs achieved by large language models in complex reasoning tasks. However, the accompanying issue of ''underthinking'', where models exhibit shallow reasoning by frequently switching thoughts without sufficient exploration, limits both performance and token efficiency. To address this problem, we propose a simple yet effective reasoning strategy: the SmartSwitch inference framework. This framework can be easily integrated into any large language model as a plug-and-play solution, continuously monitoring the model's reasoning process to detect underthinking and guide it toward deeper exploration of promising but overlooked thoughts. Specifically, the perception module identifies points where thoughts switch and evaluates the potential of the preceding thought using an off-the-shelf process reward model (PRM). If a high-potential thought is found to be prematurely abandoned, the intervention module interrupts the ongoing inference, backtracks to the point before the switch, and inserts a "deepening prompt" to encourage further exploration along that promising path. Extensive experiments on challenging mathematical reasoning benchmarks demonstrate that our method significantly enhances the performance of various large language models of different sizes.

【9】A Survey on Cache Methods in Diffusion Models: Toward Efficient Multi-Modal Generation
标题：扩散模型中缓存方法的综述：迈向高效的多模式生成
链接：https://arxiv.org/abs/2510.19755

作者：Jiacheng Liu, Xinyu Wang, Yuqi Lin, Zhikai Wang, Peiru Wang, Peiliang Cai, Qinming Zhou, Zhengan Yan, Zexuan Yan, Zhengyi Shi, Chang Zou, Yue Ma, Linfeng Zhang
备注：22 pages,2 figures
摘要：扩散模型因其卓越的生成质量和可控性而成为现代生成AI的基石。然而，它们固有的\textit{多步迭代}和\textit{复杂的骨干网络}导致了过高的计算开销和生成延迟，形成了实时应用的主要瓶颈。虽然现有的加速技术已经取得了进展，但它们仍然面临着适用性有限、培训成本高或质量下降等挑战。在此背景下，\textbf{Diffusion Caching}提供了一种有前途的免训练、与架构无关且高效的推理范式。其核心机制是识别和重用扩散过程中的固有计算冗余。通过启用特征级跨步骤重用和层间调度，它在不修改模型参数的情况下减少了计算量。本文系统地回顾了扩散缓存的理论基础和发展历程，并提出了一个统一的分类和分析框架。通过对代表性方法的比较分析，我们发现扩散缓存从静态重用向动态预测发展。这一趋势增强了缓存在不同任务中的灵活性，并实现了与其他加速技术的集成，如采样优化和模型蒸馏，为未来多模态和交互式应用程序的统一，高效的推理框架铺平了道路。我们认为，这种范式将成为实时和高效生成人工智能的关键推动者，为高效生成智能的理论和实践注入新的活力。
摘要：Diffusion Models have become a cornerstone of modern generative AI for their exceptional generation quality and controllability. However, their inherent \textit{multi-step iterations} and \textit{complex backbone networks} lead to prohibitive computational overhead and generation latency, forming a major bottleneck for real-time applications. Although existing acceleration techniques have made progress, they still face challenges such as limited applicability, high training costs, or quality degradation. Against this backdrop, \textbf{Diffusion Caching} offers a promising training-free, architecture-agnostic, and efficient inference paradigm. Its core mechanism identifies and reuses intrinsic computational redundancies in the diffusion process. By enabling feature-level cross-step reuse and inter-layer scheduling, it reduces computation without modifying model parameters. This paper systematically reviews the theoretical foundations and evolution of Diffusion Caching and proposes a unified framework for its classification and analysis. Through comparative analysis of representative methods, we show that Diffusion Caching evolves from \textit{static reuse} to \textit{dynamic prediction}. This trend enhances caching flexibility across diverse tasks and enables integration with other acceleration techniques such as sampling optimization and model distillation, paving the way for a unified, efficient inference framework for future multimodal and interactive applications. We argue that this paradigm will become a key enabler of real-time and efficient generative AI, injecting new vitality into both theory and practice of \textit{Efficient Generative Intelligence}.

【10】Learning Affordances at Inference-Time for Vision-Language-Action Models
标题：在推理时间学习视觉-语言-动作模型的功能
链接：https://arxiv.org/abs/2510.19752

作者：Ameesh Shah, William Chen, Adwait Godbole, Federico Mora, Sanjit A. Seshia, Sergey Levine
备注：7 pages and appendix
摘要：解决复杂的现实控制任务通常需要多次尝试：如果我们一开始失败了，我们会反思哪里出了问题，并相应地改变策略以避免犯同样的错误。在机器人技术中，视觉-语言-动作模型（VLA）为解决复杂的控制任务提供了一条有前途的道路，但当它们无法完成任务时，缺乏根据上下文和动态地重新调整行为的能力。在这项工作中，我们介绍了从推理时间执行（LITEN）学习，它将VLA低级策略连接到高级VLM，该高级VLM通过将它们包含在上下文中来限制过去的经验，从而使其能够学习低级VLA的启示和功能。我们的方法迭代之间的推理阶段，生成和执行计划的低级别的VLA，和评估阶段，反映所产生的执行，并得出有用的结论，包括在未来的推理上下文中。与非机器人领域中的类似自我改进方法不同，LITEN必须反映非结构化的真实世界机器人轨迹（例如，原始视频），这需要在评估期间进行结构化的指导。我们的实验结果表明，LITEN能够有效地从过去的经验中学习，生成使用高启示性指令来完成长期任务的计划。
摘要：Solving complex real-world control tasks often takes multiple tries: if we fail at first, we reflect on what went wrong, and change our strategy accordingly to avoid making the same mistake. In robotics, Vision-Language-Action models (VLAs) offer a promising path towards solving complex control tasks, but lack the ability to contextually and dynamically readjust behavior when they fail to accomplish a task. In this work, we introduce Learning from Inference-Time Execution (LITEN), which connects a VLA low-level policy to a high-level VLM that conditions on past experiences by including them in-context, allowing it to learn the affordances and capabilities of the low-level VLA. Our approach iterates between a reasoning phase that generates and executes plans for the low-level VLA, and an assessment phase that reflects on the resulting execution and draws useful conclusions to be included in future reasoning contexts. Unlike similar approaches to self-refinement in non-robotics domains, LITEN must reflect on unstructured real-world robot trajectories (e.g., raw videos), which requires structured guiderails during assessment. Our experimental results demonstrate LITEN is able to effectively learn from past experience to generate plans that use high-affordance instructions to accomplish long-horizon tasks.

【11】Misalignment Bounty: Crowdsourcing AI Agent Misbehavior
标题：错位赏金：众包人工智能代理不当行为
链接：https://arxiv.org/abs/2510.19738

作者：Rustem Turtayev, Natalia Fedorova, Oleg Serikov, Sergey Koldyba, Lev Avagyan, Dmitrii Volkov
摘要：先进的人工智能系统有时会以不同于人类意图的方式行事。为了收集清晰、可重复的例子，我们运行了Misalignment Bounty：一个众包项目，收集了代理人追求非故意或不安全目标的案例。该赏金收到295份提交，其中9份被授予。本报告解释了该计划的动机和评估标准，并逐步介绍了九个获奖作品。
摘要：Advanced AI systems sometimes act in ways that differ from human intent. To gather clear, reproducible examples, we ran the Misalignment Bounty: a crowdsourced project that collected cases of agents pursuing unintended or unsafe goals. The bounty received 295 submissions, of which nine were awarded. This report explains the program's motivation and evaluation criteria, and walks through the nine winning submissions step by step.

【12】Memo: Training Memory-Efficient Embodied Agents with Reinforcement Learning
标题：备忘录：利用强化学习训练内存高效的排队代理
链接：https://arxiv.org/abs/2510.19732

作者：Gunshi Gupta, Karmesh Yadav, Zsolt Kira, Yarin Gal, Rahaf Aljundi
备注：Accepted for Spotlight Presentation at NeurIPS 2025
摘要：为了使体现代理有效地运行在较长的时间范围内，它是至关重要的，以开发模型的形式和访问的记忆保持在他们的环境中的上下文。在当前的模式中，训练基于transformer的策略，用于具体的顺序决策任务，视觉输入往往压倒了Transformers的上下文限制，而人类可以保持和利用一生的经验压缩为记忆。原则上，显著的压缩是可能的，因为许多输入是不相关的，可以抽象。然而，现有的方法主要集中在具有固定大小内存的循环模型或具有全上下文依赖的Transformers上。在这项工作中，我们提出了Memo，这是一种基于transformer的架构和训练配方，用于记忆密集型长时间任务的强化学习（RL）。Memo通过在训练期间将周期性汇总令牌与模型的输入交织来合并内存的创建和检索。我们证明备忘录的有效性网格世界元RL基准和多对象导航任务在照片逼真的室内设置。Memo的性能优于简单的长上下文Transformer基线，同时计算和存储效率更高。此外，备忘录更好地推广到较长的上下文在推理时间，并保持强大的流设置，其中历史上下文必须截断，以适应推理约束。
摘要：To enable embodied agents to operate effectively over extended timeframes, it is crucial to develop models that form and access memories to stay contextualized in their environment. In the current paradigm of training transformer-based policies for embodied sequential decision-making tasks, visual inputs often overwhelm the context limits of transformers, while humans can maintain and utilize a lifetime of experience compressed as memories. Significant compression is possible in principle, as much of the input is irrelevant and can be abstracted. However, existing approaches predominantly focus on either recurrent models with fixed-size memory or transformers with full-context reliance. In this work, we propose Memo, a transformer-based architecture and training recipe for reinforcement learning (RL) on memory-intensive, long-horizon tasks. Memo incorporates the creation and retrieval of memory by interleaving periodic summarization tokens with the inputs of a model during training. We demonstrate Memo's effectiveness on a gridworld meta-RL benchmark and a multi-object navigation task in photo-realistic indoor settings. Memo outperforms naive long-context transformer baselines while being more compute and storage efficient. Additionally, Memo generalizes better to longer contexts at inference time and remains robust in streaming settings, where historical context must be truncated to fit inference constraints.

【13】Enabling Granular Subgroup Level Model Evaluations by Generating Synthetic Medical Time Series
标题：通过生成合成医学时间序列来实现粒度亚组级模型评估
链接：https://arxiv.org/abs/2510.19728

作者：Mahmoud Ibrahim, Bart Elen, Chang Sun, Gökhan Ertaylan, Michel Dumontier
摘要：我们提出了一个新的框架，利用合成ICU时间序列数据不仅训练，而且严格和可信地评估预测模型，无论是在人口水平和细粒度的人口统计亚组。基于先验扩散和基于VAE的生成器（TimeDiff，HealthGen，TimeAutoDiff），我们引入了\textit{Enhanced TimeAutoDiff}，它通过分布对齐惩罚来增强潜在扩散目标。我们对MIMIC-III和eICU、24小时死亡率和二元住院时间任务的所有模型进行了广泛的基准测试。我们的研究结果表明，增强的TimeAutoDiff将真实合成和真实评估之间的差距（“TRTS差距”）减少了70%以上，实现了$\Delta_{TRTS} \leq 0.014$ AUROC，同时保留了训练效用（$\Delta_{TSTR} \approx 0.01$）。重要的是，对于32个交叉亚组，相对于小的真实测试集，大的合成队列将亚组水平的AUROC估计误差减少了50%，并且在72- 84%的亚组中优于它们。这项工作为重症监护中值得信赖的颗粒模型评估提供了一个实用的、保护隐私的路线图，在不暴露敏感的EHR数据的情况下，实现了对不同患者人群的强大而可靠的性能分析，有助于提高医疗AI的整体可信度。
摘要：We present a novel framework for leveraging synthetic ICU time-series data not only to train but also to rigorously and trustworthily evaluate predictive models, both at the population level and within fine-grained demographic subgroups. Building on prior diffusion and VAE-based generators (TimeDiff, HealthGen, TimeAutoDiff), we introduce \textit{Enhanced TimeAutoDiff}, which augments the latent diffusion objective with distribution-alignment penalties. We extensively benchmark all models on MIMIC-III and eICU, on 24-hour mortality and binary length-of-stay tasks. Our results show that Enhanced TimeAutoDiff reduces the gap between real-on-synthetic and real-on-real evaluation (``TRTS gap'') by over 70\%, achieving $\Delta_{TRTS} \leq 0.014$ AUROC, while preserving training utility ($\Delta_{TSTR} \approx 0.01$). Crucially, for 32 intersectional subgroups, large synthetic cohorts cut subgroup-level AUROC estimation error by up to 50\% relative to small real test sets, and outperform them in 72--84\% of subgroups. This work provides a practical, privacy-preserving roadmap for trustworthy, granular model evaluation in critical care, enabling robust and reliable performance analysis across diverse patient populations without exposing sensitive EHR data, contributing to the overall trustworthiness of Medical AI.

【14】RLIE: Rule Generation with Logistic Regression, Iterative Refinement, and Evaluation for Large Language Models
标题：RIIE：使用逻辑回归、迭代细化和大型语言模型评估的规则生成
链接：https://arxiv.org/abs/2510.19698

作者：Yang Yang, Hua XU, Zhangyi Hu, Yutao Yue
摘要：大型语言模型（LLM）可以用自然语言提出规则，从而避免了传统规则学习中对预定义谓词空间的需要。然而，许多基于LLM的方法忽略了规则之间的相互作用，并且将LLM与概率规则学习耦合以进行鲁棒推理的机会仍然未被探索。我们提出了RLIE，一个统一的框架，集成LLM与概率建模学习一组加权规则。RLIE有四个阶段：（1）规则生成，其中LLM提出并过滤候选者;（2）逻辑回归，其学习用于全局选择和校准的概率权重;（3）迭代细化，其使用预测误差更新规则集;以及（4）评估，其将加权规则集作为直接分类器与将规则注入LLM的方法进行比较。我们在真实世界的数据集上评估多个推理策略。直接应用规则及其学习的权重可以产生更好的性能，而使用规则，权重和逻辑模型输出提示LLM会令人惊讶地降低精度。这支持了这样的观点，即LLM擅长于语义生成和解释，但对于精确的概率整合不太可靠。RLIE阐明了LLM用于归纳推理的潜力和局限性，并将其与经典的概率规则组合方法相结合，以实现更可靠的神经符号推理。
摘要：Large Language Models (LLMs) can propose rules in natural language, sidestepping the need for a predefined predicate space in traditional rule learning. Yet many LLM-based approaches ignore interactions among rules, and the opportunity to couple LLMs with probabilistic rule learning for robust inference remains underexplored. We present RLIE, a unified framework that integrates LLMs with probabilistic modeling to learn a set of weighted rules. RLIE has four stages: (1) Rule generation, where an LLM proposes and filters candidates; (2) Logistic regression, which learns probabilistic weights for global selection and calibration; (3) Iterative refinement, which updates the rule set using prediction errors; and (4) Evaluation, which compares the weighted rule set as a direct classifier with methods that inject rules into an LLM. We evaluate multiple inference strategies on real-world datasets. Applying rules directly with their learned weights yields superior performance, whereas prompting LLMs with the rules, weights, and logistic-model outputs surprisingly degrades accuracy. This supports the view that LLMs excel at semantic generation and interpretation but are less reliable for precise probabilistic integration. RLIE clarifies the potential and limitations of LLMs for inductive reasoning and couples them with classic probabilistic rule combination methods to enable more reliable neuro-symbolic reasoning.

【15】Do Prompts Reshape Representations? An Empirical Study of Prompting Effects on Embeddings
标题：先知会重塑代表吗？嵌入式嵌入效应的实证研究
链接：https://arxiv.org/abs/2510.19694

作者：Cesar Gonzalez-Gutierrez, Dirk Hovy
摘要：在zero-shot设置中，调整是利用LM的常用方法。然而，使LM能够在没有特定任务监督的情况下执行各种任务的基本机制仍然知之甚少。研究提示和内部表征质量之间的关系可以揭示预先训练的嵌入如何支持情境任务解决。在这项实证研究中，我们进行了一系列的探索实验提示嵌入，分析提示模板的各种组合zero-shot分类。我们的研究结果表明，虽然提示会影响表征的质量，但这些变化并不总是与提示与目标任务的相关性相关。这一结果挑战了更相关的提示必然导致更好的表示的假设。我们进一步分析了可能导致这种意外行为的潜在因素。
摘要：Prompting is a common approach for leveraging LMs in zero-shot settings. However, the underlying mechanisms that enable LMs to perform diverse tasks without task-specific supervision remain poorly understood. Studying the relationship between prompting and the quality of internal representations can shed light on how pre-trained embeddings may support in-context task solving. In this empirical study, we conduct a series of probing experiments on prompt embeddings, analyzing various combinations of prompt templates for zero-shot classification. Our findings show that while prompting affects the quality of representations, these changes do not consistently correlate with the relevance of the prompts to the target task. This result challenges the assumption that more relevant prompts necessarily lead to better representations. We further analyze potential factors that may contribute to this unexpected behavior.

【16】Toward Agentic Software Engineering Beyond Code: Framing Vision, Values, and Vocabulary
标题：迈向超越代码的大型软件工程：构建愿景、价值观和词汇
链接：https://arxiv.org/abs/2510.19692

作者：Rashina Hoda
备注：5 pages
摘要：人工智能有望在软件工程（SE）中迎来一场地震式的范式转变。随着技术人员争先恐后地将代理人工智能变为现实，SE研究人员也被驱使将代理SE作为一个研究领域。虽然代理SE的早期愿景主要集中在代码相关的活动，早期的经验证据要求考虑一系列的社会技术问题，使其在实践中发挥作用。本文有助于新兴的社区愿景：（a）建议扩展其范围超越代码，走向“整个过程”的愿景，将其扎根于SE基础和演变以及新兴的代理SE框架，（b）提出一套初步的价值观和原则来指导工作，以及（c）分享设计/使用定义良好的词汇表的指导。希望这些想法将鼓励社区合作，并引导SE社区为代理SE奠定坚实的基础，因此从长远来看，这不仅是不可避免的，而且是经过深思熟虑的。
摘要：Agentic AI is poised to usher in a seismic paradigm shift in Software Engineering (SE). As technologists rush head-along to make agentic AI a reality, SE researchers are driven to establish agentic SE as a research area. While early visions of agentic SE are primarily focused on code-related activities, early empirical evidence calls for a consideration of a range of socio-technical concerns to make it work in practice. This paper contributes to the emerging community vision by: (a) recommending an expansion of its scope beyond code, toward a 'whole of process' vision, grounding it in SE foundations and evolution and emerging agentic SE frameworks, (b) proposing a preliminary set of values and principles to guide efforts, and (c) sharing guidance on designing/using well-defined vocabulary for agentic SE. It is hoped that these ideas will encourage community collaborations and steer the SE community towards laying strong foundations of agentic SE so its not only inevitable but also deliberate and desirable in the long run.

【17】Serverless GPU Architecture for Enterprise HR Analytics: A Production-Scale BDaaS Implementation
标题：企业人力资源分析的无服务器图形处理器架构：生产规模的BDaas实施
链接：https://arxiv.org/abs/2510.19689

作者：Guilin Zhang, Wulan Guo, Ziqi Tan, Srinivas Vippagunta, Suchitra Raman, Shreeshankar Chatterjee, Ju Lin, Shang Liu, Mary Schladenhauffen, Jeffrey Luo, Hailong Jiang
备注：10 pages, 7 figures, 4 tables. Accepted to IEEE BigData 2025
摘要：工业和政府组织越来越依赖于数据驱动的劳动力、财务和监管决策流程分析，其中及时性、成本效益和合规性至关重要。Spark和Flink等分布式框架对于大规模批量或流分析仍然有效，但会引入协调复杂性和审计开销，与中等规模，延迟敏感的推理不一致。与此同时，云计算提供商现在提供无服务器GPU，TabNet等模型支持可解释的表格ML，为受监管的环境提供了新的部署蓝图。在本文中，我们提出了一个面向生产的大数据即服务（BDaaS）蓝图，该蓝图将单节点无服务器GPU运行时与TabNet集成在一起。该设计利用GPU加速来实现吞吐量，利用无服务器弹性来降低成本，并利用功能掩码可解释性来实现IL 4/FIPS合规性。我们对HR、Adult和BLS数据集进行基准测试，将我们的方法与Spark和CPU基线进行比较。我们的研究结果表明，与Spark基线相比，GPU流水线的吞吐量提高了4.5倍，延迟降低了98倍，每1 K推理的成本降低了90%，而合规机制仅增加了约5.7 ms的延迟，p99 < 22 ms。总之，这些发现提供了一个合规性感知基准，一个可复制的Helm打包蓝图和一个决策框架，展示了安全，可解释和经济高效的无服务器GPU分析的实用性，适用于受监管的企业和政府设置。
摘要：Industrial and government organizations increasingly depend on data-driven analytics for workforce, finance, and regulated decision processes, where timeliness, cost efficiency, and compliance are critical. Distributed frameworks such as Spark and Flink remain effective for massive-scale batch or streaming analytics but introduce coordination complexity and auditing overheads that misalign with moderate-scale, latency-sensitive inference. Meanwhile, cloud providers now offer serverless GPUs, and models such as TabNet enable interpretable tabular ML, motivating new deployment blueprints for regulated environments. In this paper, we present a production-oriented Big Data as a Service (BDaaS) blueprint that integrates a single-node serverless GPU runtime with TabNet. The design leverages GPU acceleration for throughput, serverless elasticity for cost reduction, and feature-mask interpretability for IL4/FIPS compliance. We conduct benchmarks on the HR, Adult, and BLS datasets, comparing our approach against Spark and CPU baselines. Our results show that GPU pipelines achieve up to 4.5x higher throughput, 98x lower latency, and 90% lower cost per 1K inferences compared to Spark baselines, while compliance mechanisms add only ~5.7 ms latency with p99 < 22 ms. Interpretability remains stable under peak load, ensuring reliable auditability. Taken together, these findings provide a compliance-aware benchmark, a reproducible Helm-packaged blueprint, and a decision framework that demonstrate the practicality of secure, interpretable, and cost-efficient serverless GPU analytics for regulated enterprise and government settings.

【18】Are Large Language Models Sensitive to the Motives Behind Communication?
标题：大型语言模型对沟通背后的动机敏感吗？
链接：https://arxiv.org/abs/2510.19687

作者：Addison J. Wu, Ryan Liu, Kerem Oktar, Theodore R. Sumers, Thomas L. Griffiths
备注：NeurIPS 2025
摘要：人类交流是有动机的：人们说话、写作和创造内容时都有特定的交流意图。因此，大型语言模型（LLM）和人工智能代理处理的信息本质上是由人类的意图和动机构成的。人们善于驾驭这些微妙的信息：我们通常会识别善意或自私的动机，以决定哪些陈述值得信任。为了让LLM在现实世界中发挥作用，他们也必须通过考虑来源的动机来批判性地评估内容-例如，权衡销售宣传中所做声明的可信度。在本文中，我们进行了全面的研究，是否LLM有这种能力的动机警惕。我们首先采用认知科学的对照实验来验证LLM的行为与从动机性证词中学习的理性模型是一致的，并发现他们成功地以类似人类的方式从有偏见的来源中扣除信息。然后，我们将评估扩展到赞助的在线广告，这是LLM代理商信息生态系统的更自然的反映。在这些设置中，我们发现LLM的推断并不像理性模型的预测那样紧密-部分原因是额外的信息分散了他们对警戒相关考虑的注意力。然而，一个简单的指导干预，提高意图和激励的显着性大大增加了LLM和理性模型之间的对应关系。这些结果表明，LLM具有对他人动机的基本敏感性，但推广到新的现实世界的设置将需要进一步改进这些模型。
摘要：Human communication is motivated: people speak, write, and create content with a particular communicative intent in mind. As a result, information that large language models (LLMs) and AI agents process is inherently framed by humans' intentions and incentives. People are adept at navigating such nuanced information: we routinely identify benevolent or self-serving motives in order to decide what statements to trust. For LLMs to be effective in the real world, they too must critically evaluate content by factoring in the motivations of the source -- for instance, weighing the credibility of claims made in a sales pitch. In this paper, we undertake a comprehensive study of whether LLMs have this capacity for motivational vigilance. We first employ controlled experiments from cognitive science to verify that LLMs' behavior is consistent with rational models of learning from motivated testimony, and find they successfully discount information from biased sources in a human-like manner. We then extend our evaluation to sponsored online adverts, a more naturalistic reflection of LLM agents' information ecosystems. In these settings, we find that LLMs' inferences do not track the rational models' predictions nearly as closely -- partly due to additional information that distracts them from vigilance-relevant considerations. However, a simple steering intervention that boosts the salience of intentions and incentives substantially increases the correspondence between LLMs and the rational model. These results suggest that LLMs possess a basic sensitivity to the motivations of others, but generalizing to novel real-world settings will require further improvements to these models.

【19】Directive, Metacognitive or a Blend of Both? A Comparison of AI-Generated Feedback Types on Student Engagement, Confidence, and Outcomes
标题：指令、元认知还是两者的结合？人工智能生成的学生参与度、信心和结果反馈类型的比较
链接：https://arxiv.org/abs/2510.19685

作者：Omar Alsaiari, Nilufar Baghaei, Jason M. Lodge, Omid Noroozi, Dragan Gašević, Marie Boden, Hassan Khosravi
摘要：反馈是对学生学习最有力的影响之一，广泛的研究探讨了如何在教育环境中最好地实施它。人工智能（AI）越来越多地产生反馈，提供可扩展和自适应的响应。两种被广泛研究的方法是指导性反馈，它给出明确的解释，减少认知负荷，以加快学习，元认知反馈，促使学习者反思，跟踪他们的进步，并发展自我调节学习（SRL）技能。虽然这两种方法都有明显的理论优势，但它们对敬业度、信心和工作质量的比较效果仍然没有得到充分研究。这项研究提出了一个为期一个学期的随机对照试验，329名学生在介绍设计和编程课程使用自适应教育平台。参与者被分配接受指令，元认知或混合AI生成的反馈，混合指令和元认知反馈的元素。结果表明，修订行为不同的反馈条件下，与混合动力促使最多的修订相比，指令和元认知。信心评级一致高，资源质量的结果是可比的条件。这些发现强调了人工智能在提供反馈方面的承诺，这些反馈平衡了清晰度和反思。混合方法，特别是，表现出潜在的结合可操作的指导，立即改善自我反思和元认知增长的机会。
摘要：Feedback is one of the most powerful influences on student learning, with extensive research examining how best to implement it in educational settings. Increasingly, feedback is being generated by artificial intelligence (AI), offering scalable and adaptive responses. Two widely studied approaches are directive feedback, which gives explicit explanations and reduces cognitive load to speed up learning, and metacognitive feedback which prompts learners to reflect, track their progress, and develop self-regulated learning (SRL) skills. While both approaches have clear theoretical advantages, their comparative effects on engagement, confidence, and quality of work remain underexplored. This study presents a semester-long randomised controlled trial with 329 students in an introductory design and programming course using an adaptive educational platform. Participants were assigned to receive directive, metacognitive, or hybrid AI-generated feedback that blended elements of both directive and metacognitive feedback. Results showed that revision behaviour differed across feedback conditions, with Hybrid prompting the most revisions compared to Directive and Metacognitive. Confidence ratings were uniformly high, and resource quality outcomes were comparable across conditions. These findings highlight the promise of AI in delivering feedback that balances clarity with reflection. Hybrid approaches, in particular, show potential to combine actionable guidance for immediate improvement with opportunities for self-reflection and metacognitive growth.

【20】I Spy With My Model's Eye: Visual Search as a Behavioural Test for MLLMs
标题：我用模特的眼睛监视：视觉搜索作为MLLM的行为测试
链接：https://arxiv.org/abs/2510.19678

作者：John Burden, Jonathan Prunty, Ben Slater, Matthieu Tehenan, Greg Davis, Lucy Cheke
备注：Preprint
摘要：多模态大型语言模型（MLLM）在视觉语言任务上实现了强大的性能，但它们的视觉处理是不透明的。大多数黑盒评估测量任务的准确性，但揭示了很少的潜在机制。借鉴认知心理学，我们适应经典的视觉搜索范式-最初开发的研究人类的感知-测试MLLM是否表现出“弹出”的效果，其中显着的视觉特征检测独立的干扰集大小。使用针对颜色，大小和照明功能的对照实验，我们发现，先进的MLLM表现出类似人类的弹出效果，在颜色或大小为基础的析取（单一功能）搜索，以及容量限制合取（多功能）搜索。我们还发现证据表明，MLLM，像人类一样，将自然场景先验，如照明方向到对象表示。我们使用有针对性的微调和机械可解释性分析来加强我们的发现。我们的工作表明，视觉搜索可以作为一个认知接地诊断工具，用于评估感知能力的MLLM。
摘要：Multimodal large language models (MLLMs) achieve strong performance on vision-language tasks, yet their visual processing is opaque. Most black-box evaluations measure task accuracy, but reveal little about underlying mechanisms. Drawing on cognitive psychology, we adapt classic visual search paradigms -- originally developed to study human perception -- to test whether MLLMs exhibit the ``pop-out'' effect, where salient visual features are detected independently of distractor set size. Using controlled experiments targeting colour, size and lighting features, we find that advanced MLLMs exhibit human-like pop-out effects in colour or size-based disjunctive (single feature) search, as well as capacity limits for conjunctive (multiple feature) search. We also find evidence to suggest that MLLMs, like humans, incorporate natural scene priors such as lighting direction into object representations. We reinforce our findings using targeted fine-tuning and mechanistic interpretability analyses. Our work shows how visual search can serve as a cognitively grounded diagnostic tool for evaluating perceptual capabilities in MLLMs.

【21】Study of Training Dynamics for Memory-Constrained Fine-Tuning
标题：记忆约束微调的训练动力学研究
链接：https://arxiv.org/abs/2510.19675

作者：Aël Quélennec, Nour Hezbri, Pavlo Mozharovskyi, Van-Tam Nguyen, Enzo Tartaglione
摘要：随着模型变得越来越大，同时部署环境施加了严格的资源限制，深度神经网络的内存高效训练变得越来越重要。我们提出了TraDy，这是一种新型的迁移学习方案，它利用了两个关键的见解：更新的层重要性是依赖于架构的，并且是先验确定的，而动态随机信道选择提供了优于静态方法的梯度近似。我们引入了一个动态的通道选择方法，随机重新采样预选层内的历元之间的通道。大量的实验表明，TraDy在各种下游任务和架构中实现了最先进的性能，同时保持了严格的内存限制，实现了高达99%的激活稀疏性，95%的权重导数稀疏性，以及97%的权重导数计算的FLOP减少。
摘要：Memory-efficient training of deep neural networks has become increasingly important as models grow larger while deployment environments impose strict resource constraints. We propose TraDy, a novel transfer learning scheme leveraging two key insights: layer importance for updates is architecture-dependent and determinable a priori, while dynamic stochastic channel selection provides superior gradient approximation compared to static approaches. We introduce a dynamic channel selection approach that stochastically resamples channels between epochs within preselected layers. Extensive experiments demonstrate TraDy achieves state-of-the-art performance across various downstream tasks and architectures while maintaining strict memory constraints, achieving up to 99% activation sparsity, 95% weight derivative sparsity, and 97% reduction in FLOPs for weight derivative computation.

【22】Explainable e-sports win prediction through Machine Learning classification in streaming
标题：通过流媒体中的机器学习分类进行可解释的电子竞技获胜预测
链接：https://arxiv.org/abs/2510.19671

作者：Silvia García-Méndez, Francisco de Arriba-Pérez
摘要：电子竞技的观众和玩家数量不断增加，以及优化的通信解决方案和云计算技术的发展，推动了网络游戏行业的不断增长。尽管基于人工智能的电子竞技分析解决方案传统上被定义为从相关数据中提取有意义的模式并将其可视化以增强决策，但专业获胜预测的大部分工作都集中在分类方面。因此，这项工作有助于一个可解释的胜利预测分类解决方案，在流媒体中，输入数据被控制在几个滑动窗口，以反映相关的游戏变化。实验结果达到了90%以上的准确率，超过了文献中的竞争解决方案的性能。最终，我们的系统可以被排名和推荐系统用于明智的决策，这要归功于可解释性模块，它可以促进对结果预测的信任。
摘要：The increasing number of spectators and players in e-sports, along with the development of optimized communication solutions and cloud computing technology, has motivated the constant growth of the online game industry. Even though Artificial Intelligence-based solutions for e-sports analytics are traditionally defined as extracting meaningful patterns from related data and visualizing them to enhance decision-making, most of the effort in professional winning prediction has been focused on the classification aspect from a batch perspective, also leaving aside the visualization techniques. Consequently, this work contributes to an explainable win prediction classification solution in streaming in which input data is controlled over several sliding windows to reflect relevant game changes. Experimental results attained an accuracy higher than 90 %, surpassing the performance of competing solutions in the literature. Ultimately, our system can be leveraged by ranking and recommender systems for informed decision-making, thanks to the explainability module, which fosters trust in the outcome predictions.

【23】Unraveling Emotions with Pre-Trained Models
标题：用预先训练的模型解开情绪
链接：https://arxiv.org/abs/2510.19668

作者：Alejandro Pajón-Sanmartín, Francisco De Arriba-Pérez, Silvia García-Méndez, Fátima Leal, Benedita Malheiro, Juan Carlos Burguillo-Rial
摘要：Transformer模型极大地推动了情感识别领域的发展。然而，在探索大型语言模型（LLM）的开放式查询时仍然存在开放性挑战。虽然目前的模型提供了很好的结果，在开放文本中的自动情感分析提出了重大的挑战，如上下文歧义，语言的变异性，并难以解释复杂的情感表达。这些限制使得通才模型的直接应用变得困难。因此，这项工作在三种不同的情况下比较了微调和提示工程在情感检测中的有效性：（i）使用简单提示的微调预训练模型和通用LLM的性能;（ii）使用LLM的不同情感提示设计的有效性;以及（iii）情感分组技术对这些模型的影响。实验测试通过微调的预训练情感识别模型达到了70%以上的指标。此外，研究结果强调，LLM需要结构化的提示工程和情感分组，以提高他们的表现。这些进步改善了情感分析，人机交互以及对各个领域用户行为的理解。
摘要：Transformer models have significantly advanced the field of emotion recognition. However, there are still open challenges when exploring open-ended queries for Large Language Models (LLMs). Although current models offer good results, automatic emotion analysis in open texts presents significant challenges, such as contextual ambiguity, linguistic variability, and difficulty interpreting complex emotional expressions. These limitations make the direct application of generalist models difficult. Accordingly, this work compares the effectiveness of fine-tuning and prompt engineering in emotion detection in three distinct scenarios: (i) performance of fine-tuned pre-trained models and general-purpose LLMs using simple prompts; (ii) effectiveness of different emotion prompt designs with LLMs; and (iii) impact of emotion grouping techniques on these models. Experimental tests attain metrics above 70% with a fine-tuned pre-trained model for emotion recognition. Moreover, the findings highlight that LLMs require structured prompt engineering and emotion grouping to enhance their performance. These advancements improve sentiment analysis, human-computer interaction, and understanding of user behavior across various domains.

【24】A Graph Engine for Guitar Chord-Tone Soloing Education
标题：吉他和弦独奏教育的图形引擎
链接：https://arxiv.org/abs/2510.19666

作者：Matthew Keating, Michael Casey
备注：ICMC 2025
摘要：我们提出了一个基于图形的引擎计算弦音独奏吉他学生的建议。和弦音独奏是一个基本的实践即兴超过和弦进行，其中乐器演奏者只使用包含在当前和弦的音符。这种做法是所有先进的爵士吉他理论的基石，但很难学习和实践。首先，我们讨论产生弦音琶音的方法。接下来，我们构建一个加权图，其中每个节点表示进行中的和弦的和弦音琶音。然后，我们计算每个连续的和弦的节点之间的最佳过渡色调的边的权重。然后，我们找到通过这个图的最短路径，并重建弦音独奏线。最后，我们讨论了一个用户友好的系统来处理输入和输出到这个引擎的吉他学生练习和弦音独奏。
摘要：We present a graph-based engine for computing chord tone soloing suggestions for guitar students. Chord tone soloing is a fundamental practice for improvising over a chord progression, where the instrumentalist uses only the notes contained in the current chord. This practice is a building block for all advanced jazz guitar theory but is difficult to learn and practice. First, we discuss methods for generating chord-tone arpeggios. Next, we construct a weighted graph where each node represents a chord tone arpeggio for a chord in the progression. Then, we calculate the edge weight between each consecutive chord's nodes in terms of optimal transition tones. We then find the shortest path through this graph and reconstruct a chord-tone soloing line. Finally, we discuss a user-friendly system to handle input and output to this engine for guitar students to practice chord tone soloing.

【25】AgentSense: LLMs Empower Generalizable and Explainable Web-Based Participatory Urban Sensing
标题：AgentSense：LLM赋予可概括和可解释的基于网络的参与式城市感知
链接：https://arxiv.org/abs/2510.19661

作者：Xusen Guo, Mingxing Peng, Xixuan Hao, Xingchen Zou, Qiongyan Wang, Sijie Ruan, Yuxuan Liang
备注：13 pages, 10 pages
摘要：基于Web的参与式城市感知已经成为现代城市管理的重要方法，利用移动个人作为分布式传感器。然而，现有的城市传感系统在不同的城市场景中的泛化能力有限，决策的可解释性较差。在这项工作中，我们引入了AgentSense，一个混合的，无训练的框架，通过多代理进化系统将大型语言模型（LLM）集成到参与式城市感知中。AgentSense最初采用经典规划器来生成基线解决方案，然后迭代地对其进行优化，以使传感任务分配适应动态的城市条件和异构工人的偏好，同时生成自然语言解释，以增强透明度和信任。在两个大规模的移动数据集和七种类型的动态干扰的广泛实验表明，AgentSense提供了明显的优势，在自适应性和可解释性比传统的方法。此外，与单代理LLM基线相比，我们的方法在性能和鲁棒性方面都优于单代理LLM基线，同时提供更合理和透明的解释。这些结果将AgentSense定位为在网络上部署自适应和可解释的城市传感系统的重大进步。
摘要：Web-based participatory urban sensing has emerged as a vital approach for modern urban management by leveraging mobile individuals as distributed sensors. However, existing urban sensing systems struggle with limited generalization across diverse urban scenarios and poor interpretability in decision-making. In this work, we introduce AgentSense, a hybrid, training-free framework that integrates large language models (LLMs) into participatory urban sensing through a multi-agent evolution system. AgentSense initially employs classical planner to generate baseline solutions and then iteratively refines them to adapt sensing task assignments to dynamic urban conditions and heterogeneous worker preferences, while producing natural language explanations that enhance transparency and trust. Extensive experiments across two large-scale mobility datasets and seven types of dynamic disturbances demonstrate that AgentSense offers distinct advantages in adaptivity and explainability over traditional methods. Furthermore, compared to single-agent LLM baselines, our approach outperforms in both performance and robustness, while delivering more reasonable and transparent explanations. These results position AgentSense as a significant advancement towards deploying adaptive and explainable urban sensing systems on the web.

【26】From Forecasting to Planning: Policy World Model for Collaborative State-Action Prediction
标题：从预测到规划：国家行动合作预测的政策世界模型
链接：https://arxiv.org/abs/2510.19654

作者：Zhida Zhao, Talas Fu, Yifan Wang, Lijun Wang, Huchuan Lu
备注：Accepted by NuerIPS 2025 (Poster)
摘要：尽管在驱动世界模型方面取得了显着进展，但它们在自主系统中的潜力在很大程度上仍未得到开发：世界模型主要是为了世界模拟而学习的，并与轨迹规划脱钩。虽然最近的努力旨在将世界建模和规划统一在一个框架中，但世界建模的协同促进机制仍需要进一步探索。在这项工作中，我们引入了一种名为“政策世界模型”（Policy World Model，PWM）的新驱动范式，它不仅将世界建模和轨迹规划集成在统一的架构中，而且还能够通过拟议的无需行动的世界知识来受益于规划未来状态预测方案。通过协作状态-动作预测，PWM可以模仿人类的预期感知，产生更可靠的规划性能。为了提高视频预测的效率，我们进一步引入了一个动态增强的并行令牌生成机制，配备了一个上下文引导的tokenizer和自适应动态焦点丢失。尽管仅利用前置摄像头输入，但我们的方法匹配或超过了依赖于多视图和多模式输入的最先进方法。代码和模型重量将在https://github.com/6550Zhao/Policy-World-Model上发布。
摘要：Despite remarkable progress in driving world models, their potential for autonomous systems remains largely untapped: the world models are mostly learned for world simulation and decoupled from trajectory planning. While recent efforts aim to unify world modeling and planning in a single framework, the synergistic facilitation mechanism of world modeling for planning still requires further exploration. In this work, we introduce a new driving paradigm named Policy World Model (PWM), which not only integrates world modeling and trajectory planning within a unified architecture, but is also able to benefit planning using the learned world knowledge through the proposed action-free future state forecasting scheme. Through collaborative state-action prediction, PWM can mimic the human-like anticipatory perception, yielding more reliable planning performance. To facilitate the efficiency of video forecasting, we further introduce a dynamically enhanced parallel token generation mechanism, equipped with a context-guided tokenizer and an adaptive dynamic focal loss. Despite utilizing only front camera input, our method matches or exceeds state-of-the-art approaches that rely on multi-view and multi-modal inputs. Code and model weights will be released at https://github.com/6550Zhao/Policy-World-Model.

【27】Style Attack Disguise: When Fonts Become a Camouflage for Adversarial Intent
标题：风格攻击伪装：当字体成为对抗意图的伪装
链接：https://arxiv.org/abs/2510.19641

作者：Yangshijie Zhang, Xinda Wang, Jialin Liu, Wenqiang Wang, Zhicong Ma, Xingxing Jia
摘要：随着社交媒体的发展，用户使用风格化的字体和类似字体的表情符号来表达个性，创建视觉上吸引人的文本，保持人类可读。然而，这些字体在NLP模型中引入了隐藏的漏洞：虽然人类很容易阅读风格文本，但模型将这些字符作为不同的标记处理，从而造成干扰。我们确定了这种人类模型的感知差距，并提出了一种基于风格的攻击，风格攻击伪装（SAD）。我们设计了两种尺寸：用于查询效率的轻尺寸和用于优越攻击性能的强尺寸。在传统模型、LLM和商业服务上进行的情感分类和机器翻译实验证明了SAD强大的攻击性能。我们还显示SAD的多模态任务，包括文本到图像和文本到语音生成的潜在威胁。
摘要：With social media growth, users employ stylistic fonts and font-like emoji to express individuality, creating visually appealing text that remains human-readable. However, these fonts introduce hidden vulnerabilities in NLP models: while humans easily read stylistic text, models process these characters as distinct tokens, causing interference. We identify this human-model perception gap and propose a style-based attack, Style Attack Disguise (SAD). We design two sizes: light for query efficiency and strong for superior attack performance. Experiments on sentiment classification and machine translation across traditional models, LLMs, and commercial services demonstrate SAD's strong attack performance. We also show SAD's potential threats to multimodal tasks including text-to-image and text-to-speech generation.

【28】HSCodeComp: A Realistic and Expert-level Benchmark for Deep Search Agents in Hierarchical Rule Application
标题：HSCodeComp：分层规则应用中深度搜索代理的现实和专家级基准
链接：https://arxiv.org/abs/2510.19631

作者：Yiqian Yang, Tian Lan, Qianghuai Jia, Li Zhu, Hui Jiang, Hang Zhu, Longyue Wang, Weihua Luo, Kaifu Zhang
摘要：有效的深度搜索代理不仅必须访问开放领域和特定领域的知识，还必须应用复杂的规则，如法律条款，医疗手册和关税规则。这些规则往往具有模糊的边界和隐含的逻辑关系，使精确的应用程序的智能体的挑战。然而，这一关键能力在很大程度上被当前的代理基准所忽视。为了填补这一空白，我们引入了HSCodeComp，这是第一个现实的专家级电子商务基准测试，旨在评估分层规则应用中的深度搜索代理。在这个任务中，智能体的深层推理过程是由这些规则指导的，以预测10位数的协调系统代码（HSCode）的产品与嘈杂的，但现实的描述。这些由世界海关组织制定的守则对全球供应链效率至关重要。基于从大型电子商务平台收集的真实数据，我们提出的HSCodeComp包括632个产品条目，涵盖不同的产品类别，这些HSCodeComp由几位人类专家注释。在几个最先进的LLM、开源和闭源代理上的广泛实验结果揭示了一个巨大的性能差距：最好的代理只能达到46.8%的10位数准确率，远低于人类专家的95.0%。此外，详细的分析表明，层次化规则应用的挑战，测试时间缩放未能进一步提高性能。
摘要：Effective deep search agents must not only access open-domain and domain-specific knowledge but also apply complex rules-such as legal clauses, medical manuals and tariff rules. These rules often feature vague boundaries and implicit logic relationships, making precise application challenging for agents. However, this critical capability is largely overlooked by current agent benchmarks. To fill this gap, we introduce HSCodeComp, the first realistic, expert-level e-commerce benchmark designed to evaluate deep search agents in hierarchical rule application. In this task, the deep reasoning process of agents is guided by these rules to predict 10-digit Harmonized System Code (HSCode) of products with noisy but realistic descriptions. These codes, established by the World Customs Organization, are vital for global supply chain efficiency. Built from real-world data collected from large-scale e-commerce platforms, our proposed HSCodeComp comprises 632 product entries spanning diverse product categories, with these HSCodes annotated by several human experts. Extensive experimental results on several state-of-the-art LLMs, open-source, and closed-source agents reveal a huge performance gap: best agent achieves only 46.8% 10-digit accuracy, far below human experts at 95.0%. Besides, detailed analysis demonstrates the challenges of hierarchical rule application, and test-time scaling fails to improve performance further.

【29】Human-Agent Collaborative Paper-to-Page Crafting for Under $0.1
标题：人机协作纸到页制作，价格低于0.1美元
链接：https://arxiv.org/abs/2510.19600

作者：Qianli Ma, Siyu Wang, Yilin Chen, Yinhao Tang, Yixiang Yang, Chang Guo, Bingjie Gao, Zhening Xing, Yanan Sun, Zhipeng Zhang
摘要：在追求科学进步的过程中，交流研究与发现本身一样重要。然而，研究人员经常被手动重复的构建项目网页的琐事所困扰，以使他们的密集论文可以访问。虽然自动化解决了静态幻灯片和海报，但网页的动态互动性质仍然是一个尚未解决的挑战。为了弥合这一差距，我们重新定义了问题，认为解决方案不在于一个单一的命令，而是在一个协作的，分层的过程。我们介绍$\textbf{AutoPage}$，一个新的多代理系统，体现了这一理念。AutoPage将纸张到页面的创建分解为从叙事规划到多模式内容生成和交互式渲染的粗到细的管道。为了对抗人工智能幻觉，专门的“人工智能”代理根据源文件验证每一步，而可选的人工检查点则确保最终产品与作者的愿景完美一致，将系统从单纯的工具转变为强大的协作助手。为了严格验证我们的方法，我们还构建了$\textbf{PageBench}$，这是这项新任务的第一个基准。实验表明，AutoPage不仅生成高质量，视觉上吸引人的页面，而且在不到15分钟的时间内以不到0.1美元的速度完成。代码和数据集将在$\href{https：//mqleet.github.io/AutoPage_ProjectPage/}{Webpage}$发布。
摘要：In the quest for scientific progress, communicating research is as vital as the discovery itself. Yet, researchers are often sidetracked by the manual, repetitive chore of building project webpages to make their dense papers accessible. While automation has tackled static slides and posters, the dynamic, interactive nature of webpages has remained an unaddressed challenge. To bridge this gap, we reframe the problem, arguing that the solution lies not in a single command, but in a collaborative, hierarchical process. We introduce $\textbf{AutoPage}$, a novel multi-agent system that embodies this philosophy. AutoPage deconstructs paper-to-page creation into a coarse-to-fine pipeline from narrative planning to multimodal content generation and interactive rendering. To combat AI hallucination, dedicated "Checker" agents verify each step against the source paper, while optional human checkpoints ensure the final product aligns perfectly with the author's vision, transforming the system from a mere tool into a powerful collaborative assistant. To rigorously validate our approach, we also construct $\textbf{PageBench}$, the first benchmark for this new task. Experiments show AutoPage not only generates high-quality, visually appealing pages but does so with remarkable efficiency in under 15 minutes for less than \$0.1. Code and dataset will be released at $\href{https://mqleet.github.io/AutoPage_ProjectPage/}{Webpage}$.

【30】XBench: A Comprehensive Benchmark for Visual-Language Explanations in Chest Radiography
标题：XBench：胸部放射摄影中视觉语言解释的综合基准
链接：https://arxiv.org/abs/2510.19599

作者：Haozhe Luo, Shelley Zixin Shu, Ziyu Zhou, Sebastian Otalora, Mauricio Reyes
摘要：视觉语言模型（VLM）最近在医学图像理解中表现出显着的zero-shot性能，但其基础能力，即文本概念与视觉证据相一致的程度，仍然没有得到充分的探索。然而，在医学领域，可靠的基础对于可解释性和临床应用至关重要。在这项工作中，我们提出了第一个系统的基准评估跨7个CLIP风格的VLM变体的胸部X射线的跨模态可解释性。我们使用交叉注意和基于相似性的定位图生成视觉解释，并定量评估它们与放射科医生注释的多个病理区域的对齐。我们的分析表明：（1）虽然所有VLM变体都表现出对大型和明确定义的病变的合理定位，但它们对小型或弥漫性病变的性能大幅下降;（2）与在一般领域数据上训练的模型相比，在胸部X射线特定数据集上预训练的模型表现出更好的对齐。(3)模型的整体识别能力和接地能力强相关。这些研究结果强调，目前的VLM，尽管他们的识别能力很强，但在临床上仍然缺乏可靠的接地，突出了在医疗实践中部署之前需要有针对性的可解释性基准。XBench代码可在https://github.com/Roypic/Benchmarkingattention上获得
摘要：Vision-language models (VLMs) have recently shown remarkable zero-shot performance in medical image understanding, yet their grounding ability, the extent to which textual concepts align with visual evidence, remains underexplored. In the medical domain, however, reliable grounding is essential for interpretability and clinical adoption. In this work, we present the first systematic benchmark for evaluating cross-modal interpretability in chest X-rays across seven CLIP-style VLM variants. We generate visual explanations using cross-attention and similarity-based localization maps, and quantitatively assess their alignment with radiologist-annotated regions across multiple pathologies. Our analysis reveals that: (1) while all VLM variants demonstrate reasonable localization for large and well-defined pathologies, their performance substantially degrades for small or diffuse lesions; (2) models that are pretrained on chest X-ray-specific datasets exhibit improved alignment compared to those trained on general-domain data. (3) The overall recognition ability and grounding ability of the model are strongly correlated. These findings underscore that current VLMs, despite their strong recognition ability, still fall short in clinically reliable grounding, highlighting the need for targeted interpretability benchmarks before deployment in medical practice. XBench code is available at https://github.com/Roypic/Benchmarkingattention

【31】A Goal-Driven Survey on Root Cause Analysis
标题：目标驱动的根本原因分析调查
链接：https://arxiv.org/abs/2510.19593

作者：Aoyang Fang, Haowen Yang, Haoze Dong, Qisheng Lu, Junjielong Xu, Pinjia He
摘要：根本原因分析（RCA）是大规模云服务中事件管理的一个重要方面。虽然术语根本原因分析或RCA已被广泛使用，但不同的研究对任务的表述不同。这是因为“RCA”一词隐含地涵盖了具有不同基本目标的任务。例如，本地化故障服务以进行快速分类的目标与识别特定功能错误以进行最终修复的目标是根本不同的。然而，以前的调查在很大程度上忽略了这些基于目标的区别，传统上通过输入数据类型（例如，基于度量的方法与基于轨迹的方法）。这导致了具有不同目标的作品的分组，从而掩盖了该领域的真正进展和差距。与此同时，RCA调查的典型受众要么是想知道任务的目标和大局的外行，要么是想在同一任务制定下弄清楚过去研究的RCA研究人员。因此，根据其目标组织相关论文的RCA调查非常需要。为此，本文提出了一个目标驱动的框架，有效地分类和整合了135篇关于RCA的论文，这些论文基于不同的目标，涵盖了2014年至2025年的云事件管理。除了目标驱动的分类，它讨论了所有RCA文件的最终目标，作为一个伞，涵盖不同的RCA配方。此外，本文还讨论了RCA面临的挑战和未来的发展方向。
摘要：Root Cause Analysis (RCA) is a crucial aspect of incident management in large-scale cloud services. While the term root cause analysis or RCA has been widely used, different studies formulate the task differently. This is because the term "RCA" implicitly covers tasks with distinct underlying goals. For instance, the goal of localizing a faulty service for rapid triage is fundamentally different from identifying a specific functional bug for a definitive fix. However, previous surveys have largely overlooked these goal-based distinctions, conventionally categorizing papers by input data types (e.g., metric-based vs. trace-based methods). This leads to the grouping of works with disparate objectives, thereby obscuring the true progress and gaps in the field. Meanwhile, the typical audience of an RCA survey is either laymen who want to know the goals and big picture of the task or RCA researchers who want to figure out past research under the same task formulation. Thus, an RCA survey that organizes the related papers according to their goals is in high demand. To this end, this paper presents a goal-driven framework that effectively categorizes and integrates 135 papers on RCA in the context of cloud incident management based on their diverse goals, spanning the period from 2014 to 2025. In addition to the goal-driven categorization, it discusses the ultimate goal of all RCA papers as an umbrella covering different RCA formulations. Moreover, the paper discusses open challenges and future directions in RCA.

【32】Detecting Latin in Historical Books with Large Language Models: A Multimodal Benchmark
标题：用大型语言模型检测历史书籍中的拉丁语：多模式基准
链接：https://arxiv.org/abs/2510.19585

作者：Yu Wu, Ke Shu, Jonas Fischer, Lidia Pivovarova, David Rosson, Eetu Mäkelä, Mikko Tolonen
备注：Under review. Both the dataset and code will be published
摘要：本文提出了一种新的任务，提取拉丁语片段的混合语言的历史文件与不同的布局。我们对724个注释页面的多模态数据集进行基准测试和评估大型基础模型的性能。结果表明，可靠的拉丁语检测与当代模型是可以实现的。我们的研究首次全面分析了这些模型的能力和局限性。
摘要：This paper presents a novel task of extracting Latin fragments from mixed-language historical documents with varied layouts. We benchmark and evaluate the performance of large foundation models against a multimodal dataset of 724 annotated pages. The results demonstrate that reliable Latin detection with contemporary models is achievable. Our study provides the first comprehensive analysis of these models' capabilities and limits for this task.

【33】Multi-modal Co-learning for Earth Observation: Enhancing single-modality models via modality collaboration
标题：地球观测多模式联合学习：通过模式协作增强单模式模型
链接：https://arxiv.org/abs/2510.19579

作者：Francisco Mena, Dino Ienco, Cassio F. Dantas, Roberto Interdonato, Andreas Dengel
备注：Accepted at the Machine Learning journal, CfP: Discovery Science 2024
摘要：多模态协同学习正在成为机器学习中的一种有效范式，使模型能够从不同模态中协作学习，以增强单模态预测。地球观测（EO）代表了多模态数据分析的典型领域，其中各种遥感器收集数据以感知我们的星球。这种前所未有的数据量带来了新的挑战。具体而言，在培训和推理阶段访问相同的传感器模式变得越来越复杂的基础上影响遥感平台的现实世界的限制。在这种情况下，多模态协同学习提出了一种很有前途的策略，可以利用训练阶段提供的大量传感器数据来改进单模态模型，以进行推理时间部署。目前大多数研究工作集中在为特定的下游任务或推理阶段可用的特定模式设计定制的解决方案。为了解决这个问题，我们提出了一种新的多模态协同学习框架，能够在不针对特定模态进行推理的情况下概括各种任务。我们的方法将对比学习和模态区分学习结合在一起，以引导单模态模型将内部模型流形构建为模态共享和特定于模态的信息。我们评估我们的框架上的四个EO基准跨越不同的传感器模态的分类和回归任务，其中只有一个模态在训练过程中可在推理时间访问。我们的研究结果表明，与最近的机器学习和计算机视觉文献中的最先进方法以及EO特定方法相比，预测性能得到了一致的改善。所获得的研究结果验证了我们的框架中的单模态推理场景在不同范围的EO应用。
摘要：Multi-modal co-learning is emerging as an effective paradigm in machine learning, enabling models to collaboratively learn from different modalities to enhance single-modality predictions. Earth Observation (EO) represents a quintessential domain for multi-modal data analysis, wherein diverse remote sensors collect data to sense our planet. This unprecedented volume of data introduces novel challenges. Specifically, the access to the same sensor modalities at both training and inference stages becomes increasingly complex based on real-world constraints affecting remote sensing platforms. In this context, multi-modal co-learning presents a promising strategy to leverage the vast amount of sensor-derived data available at the training stage to improve single-modality models for inference-time deployment. Most current research efforts focus on designing customized solutions for either particular downstream tasks or specific modalities available at the inference stage. To address this, we propose a novel multi-modal co-learning framework capable of generalizing across various tasks without targeting a specific modality for inference. Our approach combines contrastive and modality discriminative learning together to guide single-modality models to structure the internal model manifold into modality-shared and modality-specific information. We evaluate our framework on four EO benchmarks spanning classification and regression tasks across different sensor modalities, where only one of the modalities available during training is accessible at inference time. Our results demonstrate consistent predictive improvements over state-of-the-art approaches from the recent machine learning and computer vision literature, as well as EO-specific methods. The obtained findings validate our framework in the single-modality inference scenarios across a diverse range of EO applications.

【34】DAIL: Beyond Task Ambiguity for Language-Conditioned Reinforcement Learning
标题：DAIL：超越阈值条件强化学习的任务模糊性
链接：https://arxiv.org/abs/2510.19562

作者：Runpeng Xie, Quanwei Wang, Hao Hu, Zherui Zhou, Ni Mu, Xiyun Li, Yiqin Yang, Shuang Xu, Qianchuan Zhao, Bo XU
备注：Website at: this https URL
摘要：理解自然语言和遵循人类指令是智能代理的关键能力。然而，语言指令的灵活性在语言条件任务中引起了大量的歧义，严重降低了算法性能。为了解决这些限制，我们提出了一种新的方法DAIL（分布式对齐学习），具有两个关键组成部分：分布式策略和语义对齐。具体来说，我们提供的理论结果，价值分布估计机制提高了任务的可区分性。同时，语义对齐模块捕获轨迹和语言指令之间的对应关系。在结构化和视觉观察基准上的大量实验结果表明，DAIL有效地解决了指令歧义，实现了优于基线方法的性能。我们的实施可在https://github.com/RunpengXie/Distributional-Aligned-Learning上获得。
摘要：Comprehending natural language and following human instructions are critical capabilities for intelligent agents. However, the flexibility of linguistic instructions induces substantial ambiguity across language-conditioned tasks, severely degrading algorithmic performance. To address these limitations, we present a novel method named DAIL (Distributional Aligned Learning), featuring two key components: distributional policy and semantic alignment. Specifically, we provide theoretical results that the value distribution estimation mechanism enhances task differentiability. Meanwhile, the semantic alignment module captures the correspondence between trajectories and linguistic instructions. Extensive experimental results on both structured and visual observation benchmarks demonstrate that DAIL effectively resolves instruction ambiguities, achieving superior performance to baseline methods. Our implementation is available at https://github.com/RunpengXie/Distributional-Aligned-Learning.

【35】A Matter of Time: Revealing the Structure of Time in Vision-Language Models
标题：时间问题：在视觉语言模型中揭示时间结构
链接：https://arxiv.org/abs/2510.19559

作者：Nidham Tekaya, Manuela Waldner, Matthias Zeppelzauer
摘要：大规模的视觉语言模型（VLM），如CLIP，已经获得了普及，其泛化和表达的多模态表示。通过利用具有不同文本元数据的大规模训练数据，VLM获得开放词汇表功能，解决超出其训练范围的任务。本文调查的时间意识的VLMs，评估他们的能力，定位视觉内容的时间。我们介绍了TIME10k，这是一个包含10，000多幅具有时间基础事实的图像的基准数据集，并通过一种新的方法评估了37个VLM的时间意识。我们的研究表明，时间信息的结构沿着一个低维，非线性流形的VLM嵌入空间。基于这一认识，我们提出了从嵌入空间中导出显式“时间线”表示的方法。这些表示模型的时间和它的时间顺序，从而促进时间推理任务。我们的时间轴方法实现了具有竞争力的卓越的准确性相比，基于时间轴的基线，同时计算效率。所有代码和数据都可以在https://tekayanidham.github.io/timeline-page/上找到。
摘要：Large-scale vision-language models (VLMs) such as CLIP have gained popularity for their generalizable and expressive multimodal representations. By leveraging large-scale training data with diverse textual metadata, VLMs acquire open-vocabulary capabilities, solving tasks beyond their training scope. This paper investigates the temporal awareness of VLMs, assessing their ability to position visual content in time. We introduce TIME10k, a benchmark dataset of over 10,000 images with temporal ground truth, and evaluate the time-awareness of 37 VLMs by a novel methodology. Our investigation reveals that temporal information is structured along a low-dimensional, non-linear manifold in the VLM embedding space. Based on this insight, we propose methods to derive an explicit ``timeline'' representation from the embedding space. These representations model time and its chronological progression and thereby facilitate temporal reasoning tasks. Our timeline approaches achieve competitive to superior accuracy compared to a prompt-based baseline while being computationally efficient. All code and data are available at https://tekayanidham.github.io/timeline-page/.

【36】Insights into the Unknown: Federated Data Diversity Analysis on Molecular Data
标题：洞察未知：分子数据的联邦数据多样性分析
链接：https://arxiv.org/abs/2510.19535

作者：Markus Bujotzek, Evelyn Trautmann, Calum Hand, Ian Hales
摘要：人工智能方法正在越来越多地塑造药物发现。然而，由于它们依赖于公共数据集，缺乏专有药物数据的规模和多样性，它们向工业应用的转化仍然有限。联邦学习（FL）提供了一种很有前途的方法，可以将私有数据集成到跨数据孤岛的隐私保护、协作模型训练中。这种联合数据访问使重要的以数据为中心的任务变得复杂，例如估计数据集多样性，执行知情的数据拆分以及理解组合化学空间的结构。为了解决这一差距，我们调查如何以及联邦聚类方法可以解开和代表分布式分子数据。我们对三种方法进行了基准测试，联邦kMeans（Fed-kMeans），联邦主成分分析结合Fed-kMeans（Fed-PCA+Fed-kMeans）和联邦局部敏感哈希（Fed-LSH），针对八个不同的分子数据集上的集中式对应物。我们的评估利用了标准的数学和化学信息的评估指标，SF-ICF，我们在这项工作中介绍。大规模基准测试与深入的可解释性分析相结合，显示了通过化学信息指标整合领域知识的重要性，以及对分子数据进行联邦多样性分析的客户端可解释性分析。
摘要：AI methods are increasingly shaping pharmaceutical drug discovery. However, their translation to industrial applications remains limited due to their reliance on public datasets, lacking scale and diversity of proprietary pharmaceutical data. Federated learning (FL) offers a promising approach to integrate private data into privacy-preserving, collaborative model training across data silos. This federated data access complicates important data-centric tasks such as estimating dataset diversity, performing informed data splits, and understanding the structure of the combined chemical space. To address this gap, we investigate how well federated clustering methods can disentangle and represent distributed molecular data. We benchmark three approaches, Federated kMeans (Fed-kMeans), Federated Principal Component Analysis combined with Fed-kMeans (Fed-PCA+Fed-kMeans), and Federated Locality-Sensitive Hashing (Fed-LSH), against their centralized counterparts on eight diverse molecular datasets. Our evaluation utilizes both, standard mathematical and a chemistry-informed evaluation metrics, SF-ICF, that we introduce in this work. The large-scale benchmarking combined with an in-depth explainability analysis shows the importance of incorporating domain knowledge through chemistry-informed metrics, and on-client explainability analyses for federated diversity analysis on molecular data.

【37】Optimizing the Unknown: Black Box Bayesian Optimization with Energy-Based Model and Reinforcement Learning
标题：优化未知：基于能量的模型和强化学习的黑匣子Bayesian优化
链接：https://arxiv.org/abs/2510.19530

作者：Ruiyao Miao, Junren Xiao, Shiya Tsang, Hui Xiong, Yingnian Wu
备注：This paper is accepted by 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
摘要：现有的贝叶斯优化（BO）方法通常平衡探索和利用以优化代价高昂的目标函数。然而，这些方法往往遭受一个显着的一步偏差，这可能会导致收敛到局部最优和性能差，在复杂或高维任务。最近，黑盒优化（BBO）在各种科学和工程领域取得了成功，特别是当函数评估成本高昂且梯度不可用时。出于这一动机，我们提出了增强的基于能量的贝叶斯优化模型（REBMBO），它集成了高斯过程（GP）的本地指导与基于能量的模型（EBM），以捕获全局结构信息。值得注意的是，我们将每个贝叶斯优化迭代定义为马尔可夫决策过程（MDP），并使用邻近策略优化（PPO）进行自适应多步前瞻，动态调整探索的深度和方向，以有效克服传统BO方法的局限性。我们对合成和真实世界的基准进行了广泛的实验，证实了REBMBO的优越性能。对各种GP配置的其他分析进一步突出了其适应性和鲁棒性。
摘要：Existing Bayesian Optimization (BO) methods typically balance exploration and exploitation to optimize costly objective functions. However, these methods often suffer from a significant one-step bias, which may lead to convergence towards local optima and poor performance in complex or high-dimensional tasks. Recently, Black-Box Optimization (BBO) has achieved success across various scientific and engineering domains, particularly when function evaluations are costly and gradients are unavailable. Motivated by this, we propose the Reinforced Energy-Based Model for Bayesian Optimization (REBMBO), which integrates Gaussian Processes (GP) for local guidance with an Energy-Based Model (EBM) to capture global structural information. Notably, we define each Bayesian Optimization iteration as a Markov Decision Process (MDP) and use Proximal Policy Optimization (PPO) for adaptive multi-step lookahead, dynamically adjusting the depth and direction of exploration to effectively overcome the limitations of traditional BO methods. We conduct extensive experiments on synthetic and real-world benchmarks, confirming the superior performance of REBMBO. Additional analyses across various GP configurations further highlight its adaptability and robustness.

【38】From Prototypes to Sparse ECG Explanations: SHAP-Driven Counterfactuals for Multivariate Time-Series Multi-class Classification
标题：从原型到稀疏心电图解释：SHAP驱动的多元时间序列多类别分类的反事实
链接：https://arxiv.org/abs/2510.19514

作者：Maciej Mozolewski, Betül Bayrak, Kerstin Bach, Grzegorz J. Nalepa
摘要：在可解释人工智能（XAI）中，基于实例的时间序列解释由于其在医疗保健等领域的可操作和可解释见解的潜力而受到越来越多的关注。为了解决最先进的模型的可解释性的挑战，我们提出了一个原型驱动的框架，用于生成针对12导联ECG分类模型的稀疏反事实解释。我们的方法采用基于SHAP的阈值来识别关键信号段并将其转换为间隔规则，使用动态时间规整（DTW）和medoid聚类来提取代表性原型，并将这些原型与正在解释的样本进行一致性查询R峰。该框架生成的反事实仅修改了78%的原始信号，同时在所有类中保持81.3%的有效性，并在时间稳定性方面实现了43%的改进。我们评估了我们的方法的三种变体，原始，稀疏和对齐稀疏，类别特定的性能范围从心肌梗死（MI）的98.9%有效性到肥大（HYP）检测的挑战（13.2%）。这种方法支持临床有效的反事实的近实时生成（< 1秒），并为交互式解释平台提供了基础。我们的研究结果为基于人工智能的诊断系统中生理感知的反事实解释建立了设计原则，并概述了用户控制的临床部署解释界面的途径。
摘要：In eXplainable Artificial Intelligence (XAI), instance-based explanations for time series have gained increasing attention due to their potential for actionable and interpretable insights in domains such as healthcare. Addressing the challenges of explainability of state-of-the-art models, we propose a prototype-driven framework for generating sparse counterfactual explanations tailored to 12-lead ECG classification models. Our method employs SHAP-based thresholds to identify critical signal segments and convert them into interval rules, uses Dynamic Time Warping (DTW) and medoid clustering to extract representative prototypes, and aligns these prototypes to query R-peaks for coherence with the sample being explained. The framework generates counterfactuals that modify only 78% of the original signal while maintaining 81.3% validity across all classes and achieving 43% improvement in temporal stability. We evaluate three variants of our approach, Original, Sparse, and Aligned Sparse, with class-specific performance ranging from 98.9% validity for myocardial infarction (MI) to challenges with hypertrophy (HYP) detection (13.2%). This approach supports near realtime generation (< 1 second) of clinically valid counterfactuals and provides a foundation for interactive explanation platforms. Our findings establish design principles for physiologically-aware counterfactual explanations in AI-based diagnosis systems and outline pathways toward user-controlled explanation interfaces for clinical deployment.

【39】Modeling realistic human behavior using generative agents in a multimodal transport system: Software architecture and Application to Toulouse
标题：在多式联运系统中使用生成代理建模现实人类行为：软件架构及其在图卢兹的应用
链接：https://arxiv.org/abs/2510.19497

作者：Trung-Dung Vu, Benoit Gaudou, Kamaldeep Singh Oberoi
摘要：模拟真实的人类行为以了解人们的模式选择，从而提出个性化的移动解决方案仍然具有挑战性。本文提出了一种用于对复杂多式联运系统中现实的人类移动行为进行建模的体系结构，并通过法国图卢兹的案例研究进行了演示。我们在基于代理的模拟中应用大型语言模型（LLM）来捕获真实城市环境中的决策。该框架集成了GAMA仿真平台与基于LLM的生成代理，以及公共交通的通用运输供给规范（GTFS）数据和用于多式联运路由的OpenTripPlanner。GAMA平台对交互式运输环境进行建模，提供可视化和动态代理交互，同时无需从头开始构建仿真环境。这种设计使人们能够更加专注于开发生成代理并评估其在运输决策过程中的表现。在模拟的一个月内，结果表明，代理不仅做出上下文感知的运输决策，而且随着时间的推移形成习惯。我们的结论是，将LLM与基于代理的模拟相结合，为推进智能交通系统和个性化多式联运解决方案提供了一个有前途的方向。我们还讨论了这种方法的一些局限性，并概述了未来的工作扩展到更大的区域，集成实时数据，并完善内存模型。
摘要：Modeling realistic human behaviour to understand people's mode choices in order to propose personalised mobility solutions remains challenging. This paper presents an architecture for modeling realistic human mobility behavior in complex multimodal transport systems, demonstrated through a case study in Toulouse, France. We apply Large Language Models (LLMs) within an agent-based simulation to capture decision-making in a real urban setting. The framework integrates the GAMA simulation platform with an LLM-based generative agent, along with General Transit Feed Specification (GTFS) data for public transport, and OpenTripPlanner for multimodal routing. GAMA platform models the interactive transport environment, providing visualization and dynamic agent interactions while eliminating the need to construct the simulation environment from scratch. This design enables a stronger focus on developing generative agents and evaluating their performance in transport decision-making processes. Over a simulated month, results show that agents not only make context-aware transport decisions but also form habits over time. We conclude that combining LLMs with agent-based simulation offers a promising direction for advancing intelligent transportation systems and personalised multimodal mobility solutions. We also discuss some limitations of this approach and outline future work on scaling to larger regions, integrating real-time data, and refining memory models.

【40】CARES: Context-Aware Resolution Selector for VLMs
标题：CARES：VLM的上下文感知解决方案
链接：https://arxiv.org/abs/2510.19496

作者：Moshe Kimhi, Nimrod Shabtay, Raja Giryes, Chaim Baskin, Eli Schwartz
摘要：大型视觉语言模型（VLM）通常以原生或高分辨率处理图像，以在任务中保持有效性。这将使视觉标记经常膨胀到总标记的97-99%，导致高计算和延迟，即使低分辨率图像也足够了。我们引入了一个轻量级的预处理模块\textbf {R}esolution\textbf{S}elector，它在给定图像查询对的情况下，预测\textbf {minimum}足够的输入分辨率。CARES使用紧凑的VLM（350 M）来提取特征并预测目标预训练VLM的响应何时收敛到其正确回答的峰值能力。虽然在一组可选分辨率上训练为离散分类器，但CARES在推理时插入连续分辨率以进行细粒度控制。在涵盖文档和自然图像以及不同目标VLM的五个多模态基准测试中，CARES保留了任务性能，同时将计算量减少了80%。
摘要：Large vision-language models (VLMs) commonly process images at native or high resolution to remain effective across tasks. This inflates visual tokens ofter to 97-99% of total tokens, resulting in high compute and latency, even when low-resolution images would suffice. We introduce \emph{CARES}-a \textbf{C}ontext-\textbf{A}ware \textbf{R}esolution \textbf{S}elector, a lightweight preprocessing module that, given an image-query pair, predicts the \emph{minimal} sufficient input resolution. CARES uses a compact VLM (350M) to extract features and predict when a target pretrained VLM's response converges to its peak ability to answer correctly. Though trained as a discrete classifier over a set of optional resolutions, CARES interpolates continuous resolutions at inference for fine-grained control. Across five multimodal benchmarks spanning documents and natural images, as well as diverse target VLMs, CARES preserves task performance while reducing compute by up to 80%.

【41】Using Non-Expert Data to Robustify Imitation Learning via Offline Reinforcement Learning
标题：使用非专家数据通过离线强化学习Robustify模仿学习
链接：https://arxiv.org/abs/2510.19495

作者：Kevin Huang, Rosario Scalise, Cleah Winston, Ayush Agrawal, Yunchu Zhang, Rohan Baijal, Markus Grotz, Byron Boots, Benjamin Burchfiel, Hongkai Dai, Masha Itkina, Paarth Shah, Abhishek Gupta
摘要：模仿学习已被证明是训练机器人执行复杂任务的有效方法。然而，它仍然受到其对高质量，特定于任务的数据的依赖的限制，限制了对各种真实世界对象配置和场景的适应性。相比之下，非专家数据-例如播放数据，次优演示，部分任务完成或次优策略的推出-可以提供更广泛的覆盖范围和更低的收集成本。然而，传统的模仿学习方法无法有效地利用这些数据。为了应对这些挑战，我们认为，通过正确的设计决策，离线强化学习可以作为一种工具，利用非专家数据来提高模仿学习策略的性能。我们表明，虽然标准的离线RL方法在实际利用现实世界中通常遇到的稀疏数据覆盖设置下的非专家数据时可能无效，但简单的算法修改可以允许利用这些数据，而无需显著的额外假设。我们的方法表明，扩大政策分布的支持，可以允许模仿算法增强离线RL解决任务鲁棒性，表现出显着增强的恢复和泛化行为。在操纵任务中，这些创新显着增加了初始条件的范围，当非专家数据被纳入时，学习的政策是成功的。此外，我们表明，这些方法能够利用所有收集的数据，包括部分或次优的演示，以支持任务导向的政策性能。这强调了算法技术在机器人技术中使用非专家数据进行鲁棒策略学习的重要性。
摘要：Imitation learning has proven effective for training robots to perform complex tasks from expert human demonstrations. However, it remains limited by its reliance on high-quality, task-specific data, restricting adaptability to the diverse range of real-world object configurations and scenarios. In contrast, non-expert data -- such as play data, suboptimal demonstrations, partial task completions, or rollouts from suboptimal policies -- can offer broader coverage and lower collection costs. However, conventional imitation learning approaches fail to utilize this data effectively. To address these challenges, we posit that with right design decisions, offline reinforcement learning can be used as a tool to harness non-expert data to enhance the performance of imitation learning policies. We show that while standard offline RL approaches can be ineffective at actually leveraging non-expert data under the sparse data coverage settings typically encountered in the real world, simple algorithmic modifications can allow for the utilization of this data, without significant additional assumptions. Our approach shows that broadening the support of the policy distribution can allow imitation algorithms augmented by offline RL to solve tasks robustly, showing considerably enhanced recovery and generalization behavior. In manipulation tasks, these innovations significantly increase the range of initial conditions where learned policies are successful when non-expert data is incorporated. Moreover, we show that these methods are able to leverage all collected data, including partial or suboptimal demonstrations, to bolster task-directed policy performance. This underscores the importance of algorithmic techniques for using non-expert data for robust policy learning in robotics.

【42】VideoAgentTrek: Computer Use Pretraining from Unlabeled Videos
标题：VideoAgentTrek：来自未标记视频的计算机使用预训练
链接：https://arxiv.org/abs/2510.19488

作者：Dunjie Lu, Yiheng Xu, Junli Wang, Haoyuan Wu, Xinyuan Wang, Zekun Wang, Junlin Yang, Hongjin Su, Jixuan Chen, Junda Chen, Yuchen Mao, Jingren Zhou, Junyang Lin, Binyuan Hui, Tao Yu
备注：8 pages, 6 figures
摘要：训练计算机使用的代理需要大量的GUI交互数据，但手动注释大规模的动作轨迹是非常昂贵的。我们提出了VideoAgentTrek，这是一个可扩展的管道，可以在网络规模上从公开的屏幕录制视频中自动挖掘训练数据，无需手动注释。我们的方法解决了一个关键挑战：原始视频包含隐式演示，但缺乏明确的动作标签。为了解决这个问题，我们开发了Video 2Action，这是一个逆动力学模块（IDM），它有两个组件：（1）一个视频接地模型，它可以检测和定位具有精确时间边界和上下文的GUI动作，以及（2）一个动作内容识别器，它可以提取结构化参数，如点击坐标和高保真的键入文本。我们的管道应用于39，000个YouTube教程视频，自动生成152万个交互步骤。我们通过持续的预训练来利用这些数据，然后进行监督微调。在OSWorld-Verified上，我们的方法将任务成功率从9.3%（仅SFT基线）提高到15.8%，相对提高了70%。在AgentNetBench上，步骤准确性从64.1%提高到69.3%。我们的研究结果表明，被动的互联网视频可以转化为高质量的监督计算机使用代理，提供了一个可扩展的替代昂贵的手动注释。
摘要：Training computer-use agents requires massive amounts of GUI interaction data, but manually annotating action trajectories at scale is prohibitively expensive. We present VideoAgentTrek, a scalable pipeline that automatically mines training data from publicly available screen-recorded videos at web scale, eliminating the need for manual annotation. Our approach addresses a key challenge: raw videos contain implicit demonstrations but lack explicit action labels. To solve this, we develop Video2Action, an inverse dynamics module (IDM) with two components: (1) a video grounding model that detects and localizes GUI actions with precise temporal boundaries and context, and (2) an action-content recognizer that extracts structured parameters like click coordinates and typed text with high fidelity. Applied to 39,000 YouTube tutorial videos, our pipeline generates 1.52 million interaction steps automatically. We leverage this data through continued pretraining followed by supervised fine-tuning. On OSWorld-Verified, our approach improves task success rates from 9.3% (SFT-only baseline) to 15.8%, a 70% relative improvement. On AgentNetBench, step accuracy increases from 64.1% to 69.3%. Our results demonstrate that passive internet videos can be transformed into high-quality supervision for computer-use agents, providing a scalable alternative to expensive manual annotation.

【43】Graph Unlearning Meets Influence-aware Negative Preference Optimization
标题：图学习满足影响感知的负偏好优化
链接：https://arxiv.org/abs/2510.19479

作者：Qiang Chen, Zhongze Wu, Ang He, Xi Lin, Shuo Jiang, Shan You, Chang Xu, Yi Chen, Xiu Su
摘要：最近在图学习模型方面的进展通过保持节点表示基本不变来增强模型效用，同时使用遗忘集上的梯度上升来实现学习。然而，由于梯度上升的快速发散速度，这种方法在学习过程中导致模型效用急剧下降。在本文中，我们介绍了\textbf{INPO}，一个\textbf{I}nfluence-aware \textbf{N} negative\textbf{P}reference \textbf{O}优化框架，专注于减缓发散速度和提高模型实用程序对学习过程的鲁棒性。具体来说，我们首先分析了NPO具有较慢的发散速度，并从理论上提出了遗忘高影响力边缘可以减少遗忘的影响。我们设计了一个影响感知的消息函数来放大未学习的边缘的影响，并减轻遗忘集和保留集之间的紧密拓扑耦合。每个边缘的影响是快速估计的去除为基础的方法。此外，我们从拓扑学的角度提出了一种拓扑熵损失，以避免在学习过程中局部结构中过多的信息丢失。在五个真实数据集上进行的大量实验表明，基于INPO的模型在保持模型实用性的同时，在所有遗忘质量指标上都达到了最先进的性能。代码可在\href{https：//github.com/sh-qiangchen/INPO}{https：//github.com/sh-qiangchen/INPO}获得。
摘要：Recent advancements in graph unlearning models have enhanced model utility by preserving the node representation essentially invariant, while using gradient ascent on the forget set to achieve unlearning. However, this approach causes a drastic degradation in model utility during the unlearning process due to the rapid divergence speed of gradient ascent. In this paper, we introduce \textbf{INPO}, an \textbf{I}nfluence-aware \textbf{N}egative \textbf{P}reference \textbf{O}ptimization framework that focuses on slowing the divergence speed and improving the robustness of the model utility to the unlearning process. Specifically, we first analyze that NPO has slower divergence speed and theoretically propose that unlearning high-influence edges can reduce impact of unlearning. We design an influence-aware message function to amplify the influence of unlearned edges and mitigate the tight topological coupling between the forget set and the retain set. The influence of each edge is quickly estimated by a removal-based method. Additionally, we propose a topological entropy loss from the perspective of topology to avoid excessive information loss in the local structure during unlearning. Extensive experiments conducted on five real-world datasets demonstrate that INPO-based model achieves state-of-the-art performance on all forget quality metrics while maintaining the model's utility. Codes are available at \href{https://github.com/sh-qiangchen/INPO}{https://github.com/sh-qiangchen/INPO}.

【44】A Concrete Roadmap towards Safety Cases based on Chain-of-Thought Monitoring
标题：基于思想链监控的安全案例具体路线图
链接：https://arxiv.org/abs/2510.19476

作者：Julian Schulz
摘要：随着人工智能系统接近危险的能力水平，无法安全的情况变得不足，我们需要替代方法来确保安全。本文提出了一个路线图，构建安全的情况下，基于推理模型中的思想链（CoT）监测，并概述了我们的研究议程。我们认为，CoT监控可能支持控制和可信度安全的情况下。我们提出了一个由两部分组成的安全案例：（1）确定模型在没有CoT的情况下运行时缺乏危险功能，以及（2）确保CoT监控可以检测到CoT启用的任何危险功能。我们系统地研究了两个威胁的可监控性：神经和编码推理，我们分为三种形式（语言漂移，隐写术和外星人推理），并分析其潜在的驱动程序。我们评估现有的和新的技术，以保持CoT的忠实性。对于模型产生非监测推理的情况下，我们探讨了从非监测CoT中提取可监测CoT的可能性。为了评估CoT监测安全案例的可行性，我们建立了预测市场，以汇总影响其可行性的关键技术里程碑的预测。
摘要：As AI systems approach dangerous capability levels where inability safety cases become insufficient, we need alternative approaches to ensure safety. This paper presents a roadmap for constructing safety cases based on chain-of-thought (CoT) monitoring in reasoning models and outlines our research agenda. We argue that CoT monitoring might support both control and trustworthiness safety cases. We propose a two-part safety case: (1) establishing that models lack dangerous capabilities when operating without their CoT, and (2) ensuring that any dangerous capabilities enabled by a CoT are detectable by CoT monitoring. We systematically examine two threats to monitorability: neuralese and encoded reasoning, which we categorize into three forms (linguistic drift, steganography, and alien reasoning) and analyze their potential drivers. We evaluate existing and novel techniques for maintaining CoT faithfulness. For cases where models produce non-monitorable reasoning, we explore the possibility of extracting a monitorable CoT from a non-monitorable CoT. To assess the viability of CoT monitoring safety cases, we establish prediction markets to aggregate forecasts on key technical milestones influencing their feasibility.

【45】HybridEP: Scaling Expert Parallelism to Cross-Datacenter Scenario via Hybrid Expert/Data Transmission
标题：HybridEP：通过混合专家/数据传输将专家并行主义扩展到跨数据中心场景
链接：https://arxiv.org/abs/2510.19470

作者：Weihao Yang, Hao Huang, Donglei Wu, Ningke Li, Yanqi Pan, Qiyang Zheng, Wen Xia, Shiyi Li, Qiang Wang
摘要：混合专家（Mixture-of-Experts，MoE）已经成为一种流行的扩展大型模型的架构。然而，快速增长的规模超过了单个DC上的模型训练，推动了向更灵活的跨DC训练范式的转变。在此情况下，由于有限的跨DC带宽，MoE的专家认证（EP）面临显著的可扩展性问题。具体地，现有的EP优化尝试重叠数据通信和计算，由于更长的数据通信时间，这在低带宽场景中几乎没有好处。因此，跨DC EP缩放的趋势正在迅速成为MoE模型持续增长的关键障碍。为了解决这个问题，我们提出了HybridEP，一个建模指导的框架，以优化EP在带宽受限。我们的核心思想是动态地转换专家的空间布局，以减少数据通信流量和频率，从而最大限度地减少EP的通信开销。然而，找到最佳解决方案并非易事，因为它通过混合数据和专家通信使原始通信模式变得复杂。因此，我们建立了一个基于流的模型，以确定最佳的传输比。在此基础上，我们采用了两种技术：（1）基于域的划分，在GPU级别上构建混合模式和特定通信拓扑之间的映射;（2）参数有效迁移，通过减少专家传输开销和扩大域大小来进一步细化该拓扑。结合所有这些设计，HybridEP可以被认为是一个更通用的EP具有更好的可扩展性。实验结果表明，HybridEP优于现有的国家的最先进的MoE训练系统高达5.6倍，在有限的带宽。我们进一步比较HybridEP和EP在大规模模拟。HybridEP在不同带宽下使用1 k DC实现高达1.45倍的加速比。
摘要：Mixture-of-Experts (MoE) has become a popular architecture for scaling large models. However, the rapidly growing scale outpaces model training on a single DC, driving a shift toward a more flexible, cross-DC training paradigm. Under this, Expert Parallelism (EP) of MoE faces significant scalability issues due to the limited cross-DC bandwidth. Specifically, existing EP optimizations attempt to overlap data communication and computation, which has little benefit in low-bandwidth scenarios due to a much longer data communication time. Therefore, the trends of cross-DC EP scaling is fast becoming a critical roadblock to the continued growth of MoE models. To address this, we propose HybridEP, a modeling-guided framework to optimize EP under constrained bandwidth. Our key idea is to dynamically transform the spatial placement of experts to reduce data communication traffic and frequency, thereby minimizing EP's communication overheads. However, it is non-trivial to find the optimal solution because it complicates the original communication pattern by mixing data and expert communication. We therefore build a stream-based model to determine the optimal transmission ratio. Guided by this, we incorporate two techniques: (1) domain-based partition to construct the mapping between hybrid patterns and specific communication topology at GPU level, and (2) parameter-efficient migration to further refine this topology by reducing expert transmission overhead and enlarging the domain size. Combining all these designs, HybridEP can be considered as a more general EP with better scalability. Experimental results show that HybridEP outperforms existing state-of-the-art MoE training systems by up to 5.6x under constrained bandwidth. We further compare HybridEP and EP on large-scale simulations. HybridEP achieves up to 1.45x speedup with 1k DCs under different bandwidths.

【46】Universal Quantitative Abstraction: Categorical Duality and Logical Completeness for Probabilistic Systems
标题：普遍量化抽象：概率系统的范畴二元性和逻辑完整性
链接：https://arxiv.org/abs/2510.19444

作者：Nivar Anwer (Institute of Artificial Intelligence, De Montfort University, Leicester, United Kingdom)
摘要：一个统一的定量抽象理论的概率系统，链接类理论，最佳运输，和定量模态逻辑。其核心是一个规范的$ \vareps $-商赋予了一个普遍的属性：在所有的$ \vareps $-抽象，它是最丰富的一个，尊重规定的范围内的价值损失。这种构造导致了抽象和实现函子$（Q_{\varepatrium} \dashv R_{\varepatrium}）$之间的附加，通过特殊伴随函子定理建立，揭示了度量结构和逻辑语义之间的范畴对偶。一个行为伪度量的特征在于作为唯一的不动点的Bellman风格的运营商，与压缩和Lipschitz性质证明在一个coalgebrical设置。一个定量的模态$ \mu $-演算的介绍，并表现出完全的逻辑表示系统，使行为距离符合最大的逻辑偏差。分析了接口精化下的组合性，阐明了抽象如何跨系统边界进行交互。有限马尔可夫决策过程的精确验证套件证实了收缩属性，值损失界，扰动下的稳定性，对抗性可扩展性和可扩展性，展示了鲁棒性和计算可行性。由此产生的框架为状态聚合和表示学习提供了原则性目标，并为随机域中的值函数逼近提供了数学上精确的保证。
摘要：A unified theory of quantitative abstraction is presented for probabilistic systems that links category theory, optimal transport, and quantitative modal logic. At its core is a canonical $ \varepsilon $-quotient endowed with a universal property: among all $ \varepsilon $-abstractions, it is the most informative one that respects a prescribed bound on value loss. This construction induces an adjunction between abstraction and realization functors $ (Q_{\varepsilon} \dashv R_{\varepsilon}) $, established via the Special Adjoint Functor Theorem, revealing a categorical duality between metric structure and logical semantics. A behavioral pseudometric is characterized as the unique fixed point of a Bellman-style operator, with contraction and Lipschitz properties proved in a coalgebraic setting. A quantitative modal $ \mu $-calculus is introduced and shown to be expressively complete for logically representable systems, so that behavioral distance coincides with maximal logical deviation. Compositionality under interface refinement is analyzed, clarifying how abstractions interact across system boundaries. An exact validation suite on finite Markov decision processes corroborates the contraction property, value-loss bounds, stability under perturbation, adversarial distinguishability, and scalability, demonstrating both robustness and computational feasibility. The resulting framework provides principled targets for state aggregation and representation learning, with mathematically precise guarantees for value-function approximation in stochastic domains.

【47】NeSyPr: Neurosymbolic Proceduralization For Efficient Embodied Reasoning
标题：NeSyPr：用于高效推理的神经符号程序化
链接：https://arxiv.org/abs/2510.19429

作者：Wonje Choi, Jooyoung Kim, Honguk Woo
备注：Accepted at NeurIPS 2025
摘要：我们解决的挑战，采用语言模型（LM）的体现在动态环境中的任务，在线访问大规模的推理引擎或符号规划的限制，由于延迟，连接和资源的限制。为此，我们提出了NeSyPr，一种新的体现推理框架，通过神经符号程序化编译知识，从而装备LM为基础的代理结构化，自适应和及时的推理能力。在NeSyPr中，特定于任务的计划首先由符号工具利用其声明性知识显式生成。然后，这些计划被转换成可组合的程序表示，编码计划的隐式产生式规则，使所得到的组合程序无缝集成到LM的推理过程。这种神经符号程序化将多步符号结构化路径寻找和推理抽象和概括为单步LM推理，类似于人类知识编译。它支持高效的测试时推理，而不依赖于外部符号指导，使其非常适合部署在延迟敏感和资源受限的物理系统中。我们评估NeSyPr体现基准PDDLGym，VirtualHome和ALFWorld，展示其高效的推理能力，在大规模的推理模型和符号规划，同时使用更紧凑的LM。
摘要：We address the challenge of adopting language models (LMs) for embodied tasks in dynamic environments, where online access to large-scale inference engines or symbolic planners is constrained due to latency, connectivity, and resource limitations. To this end, we present NeSyPr, a novel embodied reasoning framework that compiles knowledge via neurosymbolic proceduralization, thereby equipping LM-based agents with structured, adaptive, and timely reasoning capabilities. In NeSyPr, task-specific plans are first explicitly generated by a symbolic tool leveraging its declarative knowledge. These plans are then transformed into composable procedural representations that encode the plans' implicit production rules, enabling the resulting composed procedures to be seamlessly integrated into the LM's inference process. This neurosymbolic proceduralization abstracts and generalizes multi-step symbolic structured path-finding and reasoning into single-step LM inference, akin to human knowledge compilation. It supports efficient test-time inference without relying on external symbolic guidance, making it well suited for deployment in latency-sensitive and resource-constrained physical systems. We evaluate NeSyPr on the embodied benchmarks PDDLGym, VirtualHome, and ALFWorld, demonstrating its efficient reasoning capabilities over large-scale reasoning models and a symbolic planner, while using more compact LMs.

【48】Neural Variational Dropout Processes
标题：神经变分脱落过程
链接：https://arxiv.org/abs/2510.19425

作者：Insu Jeon, Youngjin Park, Gunhee Kim
备注：Accepted as a Poster at International Conference on Learning Representations (ICLR) 2022 (Apr 25-29, 2022)
摘要：学习推断条件后验模型是鲁棒元学习的关键步骤。本文提出了一种新的贝叶斯元学习方法，称为神经变分丢弃过程（NVDPs）。NVDPs模型的条件后验分布的基础上，特定任务的辍学;低秩产品的伯努利专家元模型用于内存效率的映射辍学率从几个观察到的情况。它允许在多任务Few-Shot学习中为新任务快速重新配置全局学习和共享神经网络。此外，NVDPs利用一种新的先验条件，以整个任务数据为条件，优化了摊销变分推理中的条件\textit{dropout}后验。令人惊讶的是，这使得任务特定的辍学率，可以处理广泛的功能模糊性和不确定性的鲁棒近似。我们比较了所提出的方法与其他元学习方法在Few-Shot学习任务，如一维随机回归，图像修复和分类。结果表明NVDPs具有优异的性能。
摘要：Learning to infer the conditional posterior model is a key step for robust meta-learning. This paper presents a new Bayesian meta-learning approach called Neural Variational Dropout Processes (NVDPs). NVDPs model the conditional posterior distribution based on a task-specific dropout; a low-rank product of Bernoulli experts meta-model is utilized for a memory-efficient mapping of dropout rates from a few observed contexts. It allows for a quick reconfiguration of a globally learned and shared neural network for new tasks in multi-task few-shot learning. In addition, NVDPs utilize a novel prior conditioned on the whole task data to optimize the conditional \textit{dropout} posterior in the amortized variational inference. Surprisingly, this enables the robust approximation of task-specific dropout rates that can deal with a wide range of functional ambiguities and uncertainties. We compared the proposed method with other meta-learning approaches in the few-shot learning tasks such as 1D stochastic regression, image inpainting, and classification. The results show the excellent performance of NVDPs.

【49】MSC-Bench: A Rigorous Benchmark for Multi-Server Tool Orchestration
标题：MSC-Bench：多服务器工具规划的严格基准
链接：https://arxiv.org/abs/2510.19423

作者：Jia-Kai Dong, I-Wei Huang, Chun-Tin Wu, Yi-Tien Tsai
备注：under ACL Rolling Review 2025
摘要：我们介绍了MSC-Bench，这是一个大型基准测试，用于评估分层模型上下文协议（MCP）生态系统中LLM代理的多跳，端到端工具编排。现有的基准通常孤立地评估工具，忽略了功能重叠和跨服务器编排等挑战，导致过于乐观的评估。MSC-Bench通过“相等的函数集”构建基础事实来解决这些差距，允许F1分数等客观指标，并减少对LLM作为法官评估的依赖。组织为五级课程，它系统地测试代理的能力，从单一工具编排到复杂的跨服务器规划，以及对范围外请求的鲁棒性。实验表明，严格的层次结构可以阻碍性能没有共同设计的策略，甚至国家的最先进的代理表现出系统的鲁棒性弱点。MSC-Bench提供了一个诊断框架来暴露这些限制，并指导开发更有能力和更有效的工具使用代理。基准和资源可在https://github.com/snooow1029/MSC_Bench上公开获取。
摘要：We introduce MSC-Bench, a large-scale benchmark for evaluating multi-hop, end-to-end tool orchestration by LLM agents in a hierarchical Model-Context Protocol (MCP) ecosystem. Existing benchmarks often evaluate tools in isolation, ignoring challenges such as functional overlap and cross-server orchestration, leading to overly optimistic assessments. MSC-Bench addresses these gaps by constructing ground truth through 'equal function sets', allowing objective metrics such as F1 score and reducing the dependency on LLM-as-a-judge evaluation. Organized as a five-level curriculum, it systematically tests agent capabilities from single-tool orchestration to complex cross-server planning, and robustness to out-of-scope requests. Experiments reveal that rigid hierarchies can hinder performance without co-designed strategies, and even state-of-the-art agents exhibit systemic weaknesses in robustness. MSC-Bench provides a diagnostic framework to expose these limitations and guide the development of more capable and efficient tool-using agents. The benchmark and resources are publicly available at https://github.com/snooow1029/MSC_Bench.

【50】FairNet: Dynamic Fairness Correction without Performance Loss via Contrastive Conditional LoRA
标题：FairNet：通过对比条件LoRA进行动态公平性纠正，而不会造成性能损失
链接：https://arxiv.org/abs/2510.19421

作者：Songqi Zhou, Zeyuan Liu, Benben Jiang
摘要：确保机器学习模型的公平性是一个关键挑战。现有的去偏置方法往往会影响性能，依赖于静态校正策略，并与数据稀疏性作斗争，特别是在少数群体中。此外，它们对敏感属性的利用往往是次优的，要么过度依赖于完整的属性标记，要么完全忽略这些属性。为了克服这些限制，我们提出了FairNet，一个新的框架，动态的，实例级的公平性校正。FairNet将偏差检测器与条件低秩自适应（LoRA）集成在一起，这使得只有被识别为有偏差的实例才能选择性地激活公平性校正机制，从而保持无偏差实例的性能。一个关键的贡献是一个新的对比损失函数，用于训练LoRA模块，专门设计用于最大限度地减少不同敏感群体之间的类内代表性差异，并有效地解决少数群体的不适应问题。FairNet框架可以灵活地处理具有完整、部分或完全不存在敏感属性标签的场景。理论分析证实，在中等TPR/FPR的偏置检测器，FairNet可以提高性能的最坏的一组，而不会降低整体模型的性能，并可能产生轻微的性能改善。跨不同愿景和语言基准的综合经验评估验证了FairNet的有效性。
摘要：Ensuring fairness in machine learning models is a critical challenge. Existing debiasing methods often compromise performance, rely on static correction strategies, and struggle with data sparsity, particularly within minority groups. Furthermore, their utilization of sensitive attributes is often suboptimal, either depending excessively on complete attribute labeling or disregarding these attributes entirely. To overcome these limitations, we propose FairNet, a novel framework for dynamic, instance-level fairness correction. FairNet integrates a bias detector with conditional low-rank adaptation (LoRA), which enables selective activation of the fairness correction mechanism exclusively for instances identified as biased, and thereby preserve performance on unbiased instances. A key contribution is a new contrastive loss function for training the LoRA module, specifically designed to minimize intra-class representation disparities across different sensitive groups and effectively address underfitting in minority groups. The FairNet framework can flexibly handle scenarios with complete, partial, or entirely absent sensitive attribute labels. Theoretical analysis confirms that, under moderate TPR/FPR for the bias detector, FairNet can enhance the performance of the worst group without diminishing overall model performance, and potentially yield slight performance improvements. Comprehensive empirical evaluations across diverse vision and language benchmarks validate the effectiveness of FairNet.

【51】Monitoring LLM-based Multi-Agent Systems Against Corruptions via Node Evaluation
标题：通过节点评估监控基于LLM的多代理系统防止损坏
链接：https://arxiv.org/abs/2510.19420

作者：Chengcan Wu, Zhixin Zhang, Mingqian Xu, Zeming Wei, Meng Sun
摘要：基于大语言模型（LLM）的多智能体系统（MAS）已经成为人工智能应用的一种流行范式。然而，MAS的可信度问题仍然是一个关键问题。与单代理系统中的挑战不同，MAS涉及更复杂的通信过程，使它们容易受到腐败攻击。为了缓解这个问题，已经开发了几种防御机制的基础上，MAS的图形表示，代理代表节点和通信形式的边缘。然而，这些方法主要集中在静态图防御，试图检测在一个固定的图结构的攻击或优化具有一定的防御能力的静态拓扑结构。为了解决这个问题，我们提出了一个动态防御模式的MAS图结构，不断监测通信的MAS图，然后动态调整图的拓扑结构，准确地破坏恶意通信，并有效地抵御不断变化的和多样化的动态攻击。在日益复杂和动态的MAS环境中的实验结果表明，我们的方法显着优于现有的MAS防御机制，有助于其值得信赖的应用程序的有效护栏。我们的代码可在https://github.com/ChengcanWu/Monitoring-LLM-Based-Multi-Agent-Systems上获得。
摘要：Large Language Model (LLM)-based Multi-Agent Systems (MAS) have become a popular paradigm of AI applications. However, trustworthiness issues in MAS remain a critical concern. Unlike challenges in single-agent systems, MAS involve more complex communication processes, making them susceptible to corruption attacks. To mitigate this issue, several defense mechanisms have been developed based on the graph representation of MAS, where agents represent nodes and communications form edges. Nevertheless, these methods predominantly focus on static graph defense, attempting to either detect attacks in a fixed graph structure or optimize a static topology with certain defensive capabilities. To address this limitation, we propose a dynamic defense paradigm for MAS graph structures, which continuously monitors communication within the MAS graph, then dynamically adjusts the graph topology, accurately disrupts malicious communications, and effectively defends against evolving and diverse dynamic attacks. Experimental results in increasingly complex and dynamic MAS environments demonstrate that our method significantly outperforms existing MAS defense mechanisms, contributing an effective guardrail for their trustworthy applications. Our code is available at https://github.com/ChengcanWu/Monitoring-LLM-Based-Multi-Agent-Systems.

【52】ToMMeR -- Efficient Entity Mention Detection from Large Language Models
标题：ToMMeR --来自大型语言模型的高效实体提及检测
链接：https://arxiv.org/abs/2510.19410

作者：Victor Morand, Nadi Tomeh, Josiane Mothe, Benjamin Piwowarski
备注：Code is available at this https URL
摘要：识别哪些文本跨度引用实体-提及检测-既是信息提取的基础，也是已知的性能瓶颈。我们介绍了ToMMeR，一个轻量级模型（<300 K参数）探测早期LLM层的提及检测功能。在13个NER基准测试中，ToMMeR实现了93%的召回率zero-shot，使用LLM作为判断的准确率超过90%，表明ToMMeR很少产生虚假的预测，尽管召回率很高。跨模型分析表明，不同的架构（14 M-15 B参数）收敛于相似的提及边界（DICE >75\%），证实了提及检测自然地出现在语言建模中。当使用跨度分类头进行扩展时，ToMMeR实现了接近SOTA NER的性能（标准基准上的80-87\% F1）。我们的工作提供的证据表明，结构化实体表示存在于早期的Transformer层，可以有效地恢复与最小的参数。
摘要：Identifying which text spans refer to entities -- mention detection -- is both foundational for information extraction and a known performance bottleneck. We introduce ToMMeR, a lightweight model (<300K parameters) probing mention detection capabilities from early LLM layers. Across 13 NER benchmarks, ToMMeR achieves 93\% recall zero-shot, with over 90\% precision using an LLM as a judge showing that ToMMeR rarely produces spurious predictions despite high recall. Cross-model analysis reveals that diverse architectures (14M-15B parameters) converge on similar mention boundaries (DICE >75\%), confirming that mention detection emerges naturally from language modeling. When extended with span classification heads, ToMMeR achieves near SOTA NER performance (80-87\% F1 on standard benchmarks). Our work provides evidence that structured entity representations exist in early transformer layers and can be efficiently recovered with minimal parameters.

【53】ColorAgent: Building A Robust, Personalized, and Interactive OS Agent
标题：ColorAgent：构建一个健壮的、个性化的、交互式的OS代理
链接：https://arxiv.org/abs/2510.19386

作者：Ning Li, Qiqiang Lin, Zheng Wu, Xiaoyun Mo, Weiming Zhang, Yin Zhao, Xiangmou Qu, Jiamu Zhou, Jun Wang, Congmin Zheng, Yuanyi Song, Hongjiang Chen, Heyuan Huang, Jihong Wang, Jiaxin Yin, Jingwei Yu, Junwei Liao, Qiuying Peng, Xingyu Lou, Jun Wang, Weiwen Liu, Zhuosheng Zhang, Weinan Zhang
摘要：随着硬件、软件和大型语言模型技术的进步，人类与操作系统之间的交互已经从命令行界面发展到迅速兴起的人工智能代理交互。构建能够执行用户指令并忠实地遵循用户期望的操作系统（OS）代理正在成为现实。在本技术报告中，我们介绍了ColorAgent，这是一种OS代理，旨在与环境进行长期、强大的交互，同时还支持个性化和主动的用户交互。为了实现与环境的长期交互，我们通过逐步强化学习和自我进化训练来增强模型的能力，同时还开发了一个量身定制的多智能体框架，以确保通用性，一致性和鲁棒性。在用户交互方面，我们探索个性化的用户意图识别和主动参与，将操作系统代理定位为不仅仅是一个自动化工具，而是一个热情的合作伙伴。我们评估ColorAgent的AndroidWorld和AndroidLab基准，分别达到77.2%和50.7%的成功率，建立一个新的国家的art.However，我们注意到，目前的基准是不够的全面评估的OS代理，并提出进一步探索的方向，在未来的工作，特别是在评估范式，代理协作和安全领域。我们的代码可在https://github.com/MadeAgents/mobile-use上获得。
摘要：With the advancements in hardware, software, and large language model technologies, the interaction between humans and operating systems has evolved from the command-line interface to the rapidly emerging AI agent interactions. Building an operating system (OS) agent capable of executing user instructions and faithfully following user desires is becoming a reality. In this technical report, we present ColorAgent, an OS agent designed to engage in long-horizon, robust interactions with the environment while also enabling personalized and proactive user interaction. To enable long-horizon interactions with the environment, we enhance the model's capabilities through step-wise reinforcement learning and self-evolving training, while also developing a tailored multi-agent framework that ensures generality, consistency, and robustness. In terms of user interaction, we explore personalized user intent recognition and proactive engagement, positioning the OS agent not merely as an automation tool but as a warm, collaborative partner. We evaluate ColorAgent on the AndroidWorld and AndroidLab benchmarks, achieving success rates of 77.2% and 50.7%, respectively, establishing a new state of the art. Nonetheless, we note that current benchmarks are insufficient for a comprehensive evaluation of OS agents and propose further exploring directions in future work, particularly in the areas of evaluation paradigms, agent collaboration, and security. Our code is available at https://github.com/MadeAgents/mobile-use.

【54】The Massive Legal Embedding Benchmark (MLEB)
标题：大规模法律嵌入基准（MLEB）
链接：https://arxiv.org/abs/2510.19365

作者：Umar Butler, Abdur-Rahman Butler, Adrian Lucas Malec
备注：15 pages, 2 figures
摘要：我们提出了大规模法律嵌入基准（MLEB），这是迄今为止最大，最多样化，最全面的法律信息检索开源基准。MLEB由十个专家注释的数据集组成，这些数据集涵盖多个司法管辖区（美国、英国、欧盟、澳大利亚、爱尔兰和新加坡）、文档类型（案例、立法、监管指南、合同和文献）和任务类型（搜索、zero-shot分类和问题回答）。MLEB中的七个数据集是新构建的，以填补开源法律信息检索领域的领域和管辖权空白。我们记录了构建MLEB和创建新的组成数据集的方法，并公开发布我们的代码，结果和数据，以帮助进行可重复的评估。
摘要：We present the Massive Legal Embedding Benchmark (MLEB), the largest, most diverse, and most comprehensive open-source benchmark for legal information retrieval to date. MLEB consists of ten expert-annotated datasets spanning multiple jurisdictions (the US, UK, EU, Australia, Ireland, and Singapore), document types (cases, legislation, regulatory guidance, contracts, and literature), and task types (search, zero-shot classification, and question answering). Seven of the datasets in MLEB were newly constructed in order to fill domain and jurisdictional gaps in the open-source legal information retrieval landscape. We document our methodology in building MLEB and creating the new constituent datasets, and release our code, results, and data openly to assist with reproducible evaluations.

【55】AgenticMath: Enhancing LLM Reasoning via Agentic-based Math Data Generation
标题：数学：通过基于数学的数学数据生成增强LLM推理
链接：https://arxiv.org/abs/2510.19361

作者：Xianyang Liu, Yilin Liu, Shuai Wang, Hao Cheng, Andrew Estornell, Yuzhi Zhao, Jiaheng Wei
备注：Work in progress
摘要：创建高质量的数据集以改进大型语言模型（LLM）推理仍然是一个重大挑战，因为当前的方法通常会产生低质量/不正确的答案，并且可用数据源的信息丰富性有限。为了解决这个问题，我们提出了一种新的代理管道，用于生成高质量的数学问答对，以增强LLM的监督微调。我们的方法通过四个阶段操作：（1）种子问题过滤器，选择具有高信息丰富性，复杂性和清晰度的问题;（2）一个启发式问题改写步骤，采用多代理系统生成多样化，逻辑上一致的释义;（3）答案扩充步骤，其中使用思维链推理重写答案以增强数字和逻辑正确性，而不依赖于人类提供的标签;以及（4）仅保留最优对的最终问题和答案评估。广泛的实验表明，与在更多数据上训练的基线相比，在ApricMath生成的数据集（仅包括30- 60 K数学样本）上微调3B-8B参数LLM在不同的域内和域外数学推理基准上实现了具有竞争力或优越的性能（例如，400 K或2.3M样本）。我们的工作表明，有针对性的，高质量的数据生成是一个更有效的路径，以改善数学推理的LLM比大规模，低质量的替代品。
摘要：The creation of high-quality datasets to improve Large Language Model (LLM) reasoning remains a significant challenge, as current methods often suffer from generating low-quality/incorrect answers and limited information richness from available data sources. To address this, we propose AgenticMath, a novel agentic pipeline for generating high-quality mathematical question-answer pairs to enhance the supervised fine-tuning of LLMs. Our method operates through four stages: (1) Seed Question Filter that selects questions with high information richness, complexity, and clarity; (2) an Agentic Question Rephrase step that employs a multi-agent system to generate diverse, logically consistent paraphrases; (3) an Answer Augment step where rewrite answers using chain-of-thought reasoning to enhance numerical and logical correctness, without reliance on human-provided labels; and (4) a final Question and Answer Evaluation that retains only the most superior pairs. Extensive experiments demonstrate that, fine-tuning 3B-8B parameter LLMs on AgenticMath generated datasets (comprising only 30-60K math samples) achieves competitive or superior performance on diverse in domain and out-of-domain mathematical reasoning benchmarks compared to baselines trained on much more data (e.g., 400K or 2.3M samples). Our work demonstrates that targeted, high-quality data generation is a more efficient path to improving mathematical reasoning in LLMs than large-scale, low-quality alternatives.

【56】M3-SLU: Evaluating Speaker-Attributed Reasoning in Multimodal Large Language Models
标题：M3-SLU：评估多模式大型语言模型中的说话者归因推理
链接：https://arxiv.org/abs/2510.19358

作者：Yejin Kwon, Taewoo Kang, Hyunsoo Yoon, Changouk Kim
备注：Submitted to LREC 2026. 11 pages, 5 figures
摘要：我们提出了M3-SLU，一个新的多模态大语言模型（MLLM）的基准评估多说话人，多轮口语理解。虽然最近的模型在语音和文本理解方面表现出色，但它们仍然难以进行说话者归因推理，即理解谁在自然对话中说了什么以及什么时候说的能力。M3-SLU由四个开放语料库（CHiME-6、MELD、MultiDialog和AMI）构建，包含超过12，000个经过验证的实例，包括配对的音频、转录和元数据。它包括两个任务：（1）说话人归因问题的生成和（2）通过话语匹配的说话人归因。我们提供了级联管道和端到端MLLM的基线结果，使用LLM-as-Judge和准确性指标进行评估。结果表明，虽然模型可以捕捉到说了什么，但它们往往无法识别是谁说的，这揭示了说话者感知对话理解的关键差距。M3-SLU提供了一个具有挑战性的基准，以推进说话人感知多模态理解的研究。
摘要：We present M3-SLU, a new multimodal large language model (MLLM) benchmark for evaluating multi-speaker, multi-turn spoken language understanding. While recent models show strong performance in speech and text comprehension, they still struggle with speaker-attributed reasoning, the ability to understand who said what and when in natural conversations. M3-SLU is built from four open corpora (CHiME-6, MELD, MultiDialog, and AMI) and comprises over 12,000 validated instances with paired audio, transcripts, and metadata. It includes two tasks: (1) Speaker-Attributed Question Answering and (2) Speaker Attribution via Utterance Matching. We provide baseline results for both cascaded pipelines and end-to-end MLLMs, evaluated using an LLM-as-Judge and accuracy metrics. Results show that while models can capture what was said, they often fail to identify who said it, revealing a key gap in speaker-aware dialogue understanding. M3-SLU offers as a challenging benchmark to advance research in speaker-aware multimodal understanding.

【57】Learning To Defer To A Population With Limited Demonstrations
标题：学会服从有限示威的人群
链接：https://arxiv.org/abs/2510.19351

作者：Nilesh Ramgolam, Gustavo Carneiro, Hsiang-Ting (Tim)Chen
备注：Accepted to IEEE DICTA 2025 (poster). 7 pages, 2 figures
摘要：本文讨论了关键的数据稀缺性，阻碍了实际部署的学习推迟（L2 D）系统的人口。我们引入了一个上下文感知的，半监督的框架，使用元学习生成专家特定的嵌入，只有几个演示。我们展示了一种双重目的机制的有效性，首先使用这些嵌入来生成用于训练的大型伪标签语料库，然后在测试时启用对新专家的动态适应。在三个不同数据集上的实验结果证实，在这些合成标签上训练的模型快速接近Oracle级别的性能，验证了我们方法的数据效率。通过解决一个关键的训练瓶颈，这项工作使自适应L2 D系统更加实用和可扩展，为现实环境中的人类-AI协作铺平了道路。为了促进可重复性并解决正文中未涉及的实现细节，我们在https://github.com/nil123532/learning-to-defer-to-a-population-with-limited-demonstrations上提供了源代码和培训配置。
摘要：This paper addresses the critical data scarcity that hinders the practical deployment of learning to defer (L2D) systems to the population. We introduce a context-aware, semi-supervised framework that uses meta-learning to generate expert-specific embeddings from only a few demonstrations. We demonstrate the efficacy of a dual-purpose mechanism, where these embeddings are used first to generate a large corpus of pseudo-labels for training, and subsequently to enable on-the-fly adaptation to new experts at test-time. The experiment results on three different datasets confirm that a model trained on these synthetic labels rapidly approaches oracle-level performance, validating the data efficiency of our approach. By resolving a key training bottleneck, this work makes adaptive L2D systems more practical and scalable, paving the way for human-AI collaboration in real-world environments. To facilitate reproducibility and address implementation details not covered in the main text, we provide our source code and training configurations at https://github.com/nil123532/learning-to-defer-to-a-population-with-limited-demonstrations.

【58】A New Type of Adversarial Examples
标题：新型对抗性例子
链接：https://arxiv.org/abs/2510.19347

作者：Xingyang Nie, Guojie Xiao, Su Pan, Biao Wang, Huilin Ge, Tao Fang
摘要：大多数机器学习模型容易受到对抗性示例的影响，这对这些模型造成了安全问题。对抗性示例是通过对数据集中的示例进行微妙但有意的最坏情况修改来制作的，从而导致模型输出与原始示例不同的答案。在本文中，对抗性的例子，形成了一个完全相反的方式，这是显着不同的原始的例子，但结果是相同的答案。我们提出了一组新的算法来产生这样的对抗性的例子，包括负迭代快速梯度符号法（NI-FGSM）和负迭代快速梯度法（NI-FGM），以及它们的动量变体：负动量迭代快速梯度符号法（NMI-FGSM）和负动量迭代快速梯度法（NMI-FGM）。通过这些方法构建的对抗性示例可以在某些情况下用于对机器学习系统进行攻击。此外，我们的研究结果表明，对抗性示例不仅分布在数据集中的示例附近;相反，它们广泛分布在样本空间中。
摘要：Most machine learning models are vulnerable to adversarial examples, which poses security concerns on these models. Adversarial examples are crafted by applying subtle but intentionally worst-case modifications to examples from the dataset, leading the model to output a different answer from the original example. In this paper, adversarial examples are formed in an exactly opposite manner, which are significantly different from the original examples but result in the same answer. We propose a novel set of algorithms to produce such adversarial examples, including the negative iterative fast gradient sign method (NI-FGSM) and the negative iterative fast gradient method (NI-FGM), along with their momentum variants: the negative momentum iterative fast gradient sign method (NMI-FGSM) and the negative momentum iterative fast gradient method (NMI-FGM). Adversarial examples constructed by these methods could be used to perform an attack on machine learning systems in certain occasions. Moreover, our results show that the adversarial examples are not merely distributed in the neighbourhood of the examples from the dataset; instead, they are distributed extensively in the sample space.

【59】Foundation Model Forecasts: Form and Function
标题：基础模型预测：形式和功能
链接：https://arxiv.org/abs/2510.19345

作者：Alvaro Perez-Diaz, James C. Loach, Danielle E. Toutoungi, Lee Middleton
备注：28 pages, 3 figures
摘要：时间序列基础模型（TSFM）实现了很强的预测精度，但精度本身并不决定实用价值。预测的形式--点、分位数、参数或轨迹集合--从根本上限制了它可以支持的作战任务。我们调查了最近的TSFMs，发现三分之二只产生点或参数预测，而许多操作任务需要保持时间依赖性的轨迹合奏。我们建立了预测类型何时可以转换，何时不能转换：轨迹集合通过边缘化转换为更简单的形式，而无需额外的假设，但反过来需要通过Copula或保形方法施加时间依赖性。我们证明，边际不能确定路径相关的事件概率-无限多个联合分布共享相同的边际，但产生不同的答案操作问题。我们将六个基本预测任务映射到最小的充分预测类型，并提供了一个与任务一致的评估框架。我们的分析澄清了预测类型，而不是准确性，区分实际效用。
摘要：Time-series foundation models (TSFMs) achieve strong forecast accuracy, yet accuracy alone does not determine practical value. The form of a forecast -- point, quantile, parametric, or trajectory ensemble -- fundamentally constrains which operational tasks it can support. We survey recent TSFMs and find that two-thirds produce only point or parametric forecasts, while many operational tasks require trajectory ensembles that preserve temporal dependence. We establish when forecast types can be converted and when they cannot: trajectory ensembles convert to simpler forms via marginalization without additional assumptions, but the reverse requires imposing temporal dependence through copulas or conformal methods. We prove that marginals cannot determine path-dependent event probabilities -- infinitely many joint distributions share identical marginals but yield different answers to operational questions. We map six fundamental forecasting tasks to minimal sufficient forecast types and provide a task-aligned evaluation framework. Our analysis clarifies when forecast type, not accuracy, differentiates practical utility.

【60】To Use or to Refuse? Re-Centering Student Agency with Generative AI in Engineering Design Education
标题：使用还是拒绝？在工程设计教育中利用生成性人工智能重新定位学生机构
链接：https://arxiv.org/abs/2510.19342

作者：Thijs Willems, Sumbul Khan, Qian Huang, Bradley Camburn, Nachamma Sockalingam, King Wang Poon
备注：to be published in IEEE TALE 2025
摘要：这项试点研究追踪了学生在新加坡科技与设计大学为期13周的基础设计课程中对人工智能使用的反思，该课程招收了500多名一年级工程和建筑专业的学生。该课程是一门人工智能增强的设计课程，通过几项干预措施使学生具备基于人工智能的设计技能。学生们被要求反思该技术是否被用作工具（工具助理），队友（合作伙伴），或两者都没有（故意不使用）。通过突出这三个方面，学生们学会了使用人工智能进行创新，而不仅仅是自动化，并反思机构，道德和背景，而不仅仅是快速制作。证据源于课程的人工制品：13个结构化的反思电子表格和8个说明简报提交，结合教师和研究人员的笔记。这些材料的定性编码揭示了通过包含Gen-AI带来的共享实践，包括加速原型设计，快速技能获取，迭代及时改进，用户研究期间有目的的“切换”，以及识别幻觉的紧急例程。出乎意料的是，学生们不仅利用Gen-AI来提高速度，而且（在工具队友的帮助下）还学会了拒绝它的输出，发明自己的幻觉消防演习，并将回收的时间用于更深入的用户研究，从而将效率转化为创新。我们探索的方法的影响表明：我们可以将人工智能的吸收转化为可评估的设计习惯;奖励选择性不使用培养了幻觉意识的工作流程;并且，实际上，通过竞争奖项协调工具访问，反思，角色标记和公众认可，使基于人工智能的教育创新能够在不影响问责制的情况下扩展。
摘要：This pilot study traces students' reflections on the use of AI in a 13-week foundational design course enrolling over 500 first-year engineering and architecture students at the Singapore University of Technology and Design. The course was an AI-enhanced design course, with several interventions to equip students with AI based design skills. Students were required to reflect on whether the technology was used as a tool (instrumental assistant), a teammate (collaborative partner), or neither (deliberate non-use). By foregrounding this three-way lens, students learned to use AI for innovation rather than just automation and to reflect on agency, ethics, and context rather than on prompt crafting alone. Evidence stems from coursework artefacts: thirteen structured reflection spreadsheets and eight illustrated briefs submitted, combined with notes of teachers and researchers. Qualitative coding of these materials reveals shared practices brought about through the inclusion of Gen-AI, including accelerated prototyping, rapid skill acquisition, iterative prompt refinement, purposeful "switch-offs" during user research, and emergent routines for recognizing hallucinations. Unexpectedly, students not only harnessed Gen-AI for speed but (enabled by the tool-teammate-neither triage) also learned to reject its outputs, invent their own hallucination fire-drills, and divert the reclaimed hours into deeper user research, thereby transforming efficiency into innovation. The implications of the approach we explore shows that: we can transform AI uptake into an assessable design habit; that rewarding selective non-use cultivates hallucination-aware workflows; and, practically, that a coordinated bundle of tool access, reflection, role tagging, and public recognition through competition awards allows AI based innovation in education to scale without compromising accountability.

【61】Every Attention Matters: An Efficient Hybrid Architecture for Long-Context Reasoning
标题：每一个注意力都很重要：长上下文推理的高效混合架构
链接：https://arxiv.org/abs/2510.19338

作者：Ling Team, Bin Han, Caizhi Tang, Chen Liang, Donghao Zhang, Fan Yuan, Feng Zhu, Jie Gao, Jingyu Hu, Longfei Li, Meng Li, Mingyang Zhang, Peijie Jiang, Peng Jiao, Qian Zhao, Qingyuan Yang, Wenbo Shen, Xinxing Yang, Yalin Zhang, Yankun Ren, Yao Zhao, Yibo Cao, Yixuan Sun, Yue Zhang, Yuchen Fang, Zibin Lin, Zixuan Cheng, Jun Zhou
备注：20 pages, 13 figures
摘要：在本技术报告中，我们介绍了Ring-linear模型系列，具体包括Ring-mini-linear-2.0和Ring-flash-linear-2.0。Ring-mini-linear-2.0包含16 B参数和957 M激活，而Ring-flash-linear-2.0包含104 B参数和6.1 B激活。这两种模型都采用了混合架构，有效地集成了线性注意力和softmax注意力，显著降低了长上下文推理场景中的I/O和计算开销。与320亿参数的稠密模型相比，该系列将推理成本降低到1/10，与原始Ring系列相比，成本也降低了50%以上。此外，通过系统地探索混合架构中不同注意机制之间的比例，我们已经确定了当前最佳的模型结构。此外，通过我们自主研发的高性能FP 8操作员库-凌河，整体培训效率提高了50%。受益于训练和推理引擎操作员之间的高度一致性，模型可以在强化学习阶段进行长期，稳定和高效的优化，在多个具有挑战性的复杂推理基准测试中始终保持SOTA性能。
摘要：In this technical report, we present the Ring-linear model series, specifically including Ring-mini-linear-2.0 and Ring-flash-linear-2.0. Ring-mini-linear-2.0 comprises 16B parameters and 957M activations, while Ring-flash-linear-2.0 contains 104B parameters and 6.1B activations. Both models adopt a hybrid architecture that effectively integrates linear attention and softmax attention, significantly reducing I/O and computational overhead in long-context inference scenarios. Compared to a 32 billion parameter dense model, this series reduces inference cost to 1/10, and compared to the original Ring series, the cost is also reduced by over 50%. Furthermore, through systematic exploration of the ratio between different attention mechanisms in the hybrid architecture, we have identified the currently optimal model structure. Additionally, by leveraging our self-developed high-performance FP8 operator library-linghe, overall training efficiency has been improved by 50%. Benefiting from the high alignment between the training and inference engine operators, the models can undergo long-term, stable, and highly efficient optimization during the reinforcement learning phase, consistently maintaining SOTA performance across multiple challenging complex reasoning benchmarks.

【62】Seabed-Net: A multi-task network for joint bathymetry estimation and seabed classification from remote sensing imagery in shallow waters
标题：Seabed-Net：一个多任务网络，用于根据浅水遥感图像进行联合水深估算和海底分类
链接：https://arxiv.org/abs/2510.19329

作者：Panagiotis Agrafiotis, Begüm Demir
备注：Submitted to ISPRS Journal of Photogrammetry and Remote Sensing
摘要：准确、详细和定期更新的水深测量，加上复杂的语义内容，对于面临日益增加的气候和人为压力的测绘不足的浅水环境至关重要。然而，从遥感图像中获取深度或海底类别的现有方法孤立地处理这些任务，丧失了它们相互作用的共同利益，并阻碍了深度学习方法的更广泛采用。为了解决这些局限性，我们介绍了海底网，一个统一的多任务框架，同时预测测深和基于像素的海底分类从遥感图像的各种分辨率。Seabed-Net采用双分支编码器进行测深估计和基于像素的海底分类，通过注意力特征融合模块和窗口Swin-Transformer融合模块整合跨任务特征，并通过动态任务不确定性加权来平衡目标。在两个异质沿海站点的广泛评估中，它始终优于传统的经验模型和传统的机器学习回归方法，实现了高达75%的RMSE降低。与最先进的单任务和多任务基线相比，它还将测深RMSE降低了10- 30%，并将海底分类准确度提高了8%。定性分析进一步表明增强的空间一致性，更清晰的栖息地边界，并纠正深度偏差在低对比度地区。这些结果证实，联合建模深度与衬底和海底生境产生协同效益，提供了一个强大的，开放的解决方案，综合浅水测绘。代码和预训练权重可在https://github.com/pagraf/Seabed-Net上获得。
摘要：Accurate, detailed, and regularly updated bathymetry, coupled with complex semantic content, is essential for under-mapped shallow-water environments facing increasing climatological and anthropogenic pressures. However, existing approaches that derive either depth or seabed classes from remote sensing imagery treat these tasks in isolation, forfeiting the mutual benefits of their interaction and hindering the broader adoption of deep learning methods. To address these limitations, we introduce Seabed-Net, a unified multi-task framework that simultaneously predicts bathymetry and pixel-based seabed classification from remote sensing imagery of various resolutions. Seabed-Net employs dual-branch encoders for bathymetry estimation and pixel-based seabed classification, integrates cross-task features via an Attention Feature Fusion module and a windowed Swin-Transformer fusion block, and balances objectives through dynamic task uncertainty weighting. In extensive evaluations at two heterogeneous coastal sites, it consistently outperforms traditional empirical models and traditional machine learning regression methods, achieving up to 75\% lower RMSE. It also reduces bathymetric RMSE by 10-30\% compared to state-of-the-art single-task and multi-task baselines and improves seabed classification accuracy up to 8\%. Qualitative analyses further demonstrate enhanced spatial consistency, sharper habitat boundaries, and corrected depth biases in low-contrast regions. These results confirm that jointly modeling depth with both substrate and seabed habitats yields synergistic gains, offering a robust, open solution for integrated shallow-water mapping. Code and pretrained weights are available at https://github.com/pagraf/Seabed-Net.

【63】SORA-ATMAS: Adaptive Trust Management and Multi-LLM Aligned Governance for Future Smart Cities
标题：SORA-ATMAS：适应性信任管理和多元LLM协调治理，面向未来智能城市
链接：https://arxiv.org/abs/2510.19327

作者：Usama Antuley, Shahbaz Siddiqui, Sufian Hameed, Waqas Arif, Subhan Shah, Syed Attique Shah
摘要：智慧城市的快速发展增加了对智能互联服务的依赖，以优化基础设施，资源和公民福祉。人工智能通过支持自主决策和自适应协调，使城市系统能够实时响应动态条件，已成为一个关键的推动因素。它的好处在交通等领域显而易见，交通数据、天气预报和安全传感器的集成可以实现动态重新路由和更快地响应危险。然而，其在异构智慧城市生态系统中的部署提出了关键的治理，风险和合规性（GRC）挑战，包括分散基础设施中的问责制，数据隐私和监管一致性。SORA-ATMAS与三个域代理（天气，交通和安全）的评估表明，其治理策略，包括高风险场景的回退机制，有效地引导多个LLM（GPT，Grok，DeepSeek）实现域优化，策略一致的输出，跨代理平均减少35%的MAE。结果显示，稳定的天气监测、有效处理高风险交通高原0.85，以及安全/消防场景中的自适应信任监管0.65。对3代理部署的性能分析证实了可扩展性，吞吐量在每秒13.8-17.2个请求之间，执行时间低于72 ms，治理延迟低于100 ms，分析预测表明在更大的规模下保持性能。跨域规则确保了安全的互操作性，只有在经过验证的天气条件下才允许流量重新路由。这些发现验证了SORA-ATMAS是一个与法规一致的，上下文感知的和可验证的治理框架，将分布式代理输出整合为可问责的实时决策，为智能城市管理提供了弹性基础。
摘要：The rapid evolution of smart cities has increased the reliance on intelligent interconnected services to optimize infrastructure, resources, and citizen well-being. Agentic AI has emerged as a key enabler by supporting autonomous decision-making and adaptive coordination, allowing urban systems to respond in real time to dynamic conditions. Its benefits are evident in areas such as transportation, where the integration of traffic data, weather forecasts, and safety sensors enables dynamic rerouting and a faster response to hazards. However, its deployment across heterogeneous smart city ecosystems raises critical governance, risk, and compliance (GRC) challenges, including accountability, data privacy, and regulatory alignment within decentralized infrastructures. Evaluation of SORA-ATMAS with three domain agents (Weather, Traffic, and Safety) demonstrated that its governance policies, including a fallback mechanism for high-risk scenarios, effectively steer multiple LLMs (GPT, Grok, DeepSeek) towards domain-optimized, policy-aligned outputs, producing an average MAE reduction of 35% across agents. Results showed stable weather monitoring, effective handling of high-risk traffic plateaus 0.85, and adaptive trust regulation in Safety/Fire scenarios 0.65. Runtime profiling of a 3-agent deployment confirmed scalability, with throughput between 13.8-17.2 requests per second, execution times below 72~ms, and governance delays under 100 ms, analytical projections suggest maintained performance at larger scales. Cross-domain rules ensured safe interoperability, with traffic rerouting permitted only under validated weather conditions. These findings validate SORA-ATMAS as a regulation-aligned, context-aware, and verifiable governance framework that consolidates distributed agent outputs into accountable, real-time decisions, offering a resilient foundation for smart-city management.

【64】Balancing Rewards in Text Summarization: Multi-Objective Reinforcement Learning via HyperVolume Optimization
标题：平衡文本摘要中的奖励：通过HyperVolume优化的多目标强化学习
链接：https://arxiv.org/abs/2510.19325

作者：Junjie Song, Yiwen Liu, Dapeng Li, Yin Sun, Shukun Fu, Siqi Chen, Yuji Cao
摘要：文本摘要是一项重要的任务，需要同时优化多个目标，包括一致性，连贯性，相关性和流畅性，这带来了相当大的挑战。尽管大型语言模型（LLM）在强化学习（RL）的增强下表现出了显着的性能，但很少有研究关注通过基于LLM的RL优化摘要的多目标问题。在本文中，我们介绍了超体积优化（HVO），一种新的优化策略，动态调整组之间的分数在RL的奖励过程中使用超体积方法。该方法引导模型的优化逐步逼近帕累托前沿，从而生成跨多个目标的平衡摘要。在几个有代表性的摘要数据集上的实验结果表明，我们的方法在总体得分上优于组相对策略优化（GRPO），并在不同维度上表现出更平衡的性能。此外，由HVO增强的7 B基础模型在摘要任务中执行GPT-4的转换，同时保持较短的生成长度。我们的代码可在https://github.com/ai4business-LiAuto/HVO.git上公开获取
摘要：Text summarization is a crucial task that requires the simultaneous optimization of multiple objectives, including consistency, coherence, relevance, and fluency, which presents considerable challenges. Although large language models (LLMs) have demonstrated remarkable performance, enhanced by reinforcement learning (RL), few studies have focused on optimizing the multi-objective problem of summarization through RL based on LLMs. In this paper, we introduce hypervolume optimization (HVO), a novel optimization strategy that dynamically adjusts the scores between groups during the reward process in RL by using the hypervolume method. This method guides the model's optimization to progressively approximate the pareto front, thereby generating balanced summaries across multiple objectives. Experimental results on several representative summarization datasets demonstrate that our method outperforms group relative policy optimization (GRPO) in overall scores and shows more balanced performance across different dimensions. Moreover, a 7B foundation model enhanced by HVO performs comparably to GPT-4 in the summarization task, while maintaining a shorter generation length. Our code is publicly available at https://github.com/ai4business-LiAuto/HVO.git

【65】Enabling Reconfiguration-Communication Overlap for Collective Communication in Optical Networks
标题：实现光网络中集体通信的重配置通信重叠
链接：https://arxiv.org/abs/2510.19322

作者：Changbo Wu, Zhuolong Yu, Gongming Zhao, Hongli Xu
摘要：集体通信（CC）被广泛用于大规模分布式机器学习（DML）训练工作负载。DML的可预测业务模式为光网络技术的应用提供了巨大的机遇。现有的基于光互连的CC方案采用“一次性网络重构”，其为整个集体操作提供静态高容量拓扑-有时用于完整的训练迭代。然而，这种方法在支持现代工作负载所需的更复杂和更高效的CC算法时面临着显著的可扩展性限制：“一次性”策略要么需要过多的资源过度配置，要么由于刚性资源分配而遭受性能下降。为了应对这些挑战，我们提出了SWOT，一个需求感知的光网络框架。SWOT采用“集体内部重新配置”，并可以动态地调整网络资源与CC流量模式。SWOT结合了一种新的调度技术，重叠光开关重新配置与正在进行的传输，并提高了通信效率。SWOT引入了一个轻量级的集体通信垫片，可以实现协调的光网络配置和传输调度，同时支持与现有CC库的无缝集成。我们的模拟结果表明SWOT的显着的性能改进。
摘要：Collective communication (CC) is widely adopted for large-scale distributed machine learning (DML) training workloads. DML's predictable traffic pattern provides a great oppotunity for applying optical network technology. Existing optical interconnects-based CC schemes adopt ``one-shot network reconfiguration'', which provisions static high-capacity topologies for an entire collective operation -- sometimes for a full training iteration. However, this approach faces significant scalability limitations when supporting more complex and efficient CC algorithms required for modern workloads: the ``one-shot'' strategies either demand excessive resource overprovisioning or suffer performance degradation due to rigid resource allocation. To address these challenges, we propose SWOT, a demand-aware optical network framework. SWOT employs ``intra-collective reconfiguration'' and can dynamically align network resources with CC traffic patterns. SWOT incorporates a novel scheduling technique that overlaps optical switch reconfigurations with ongoing transmissions, and improves communication efficiency. SWOT introduce a lightweight collective communication shim that enables coordinated optical network configuration and transmission scheduling while supporting seamless integration with existing CC libraries. Our simulation results demonstrate SWOT's significant performance improvements.

【66】Online Handwritten Signature Verification Based on Temporal-Spatial Graph Attention Transformer
标题：基于时空图注意力Transformer的在线手写签名验证
链接：https://arxiv.org/abs/2510.19321

作者：Hai-jie Yuan, Heng Zhang, Fei Yin
摘要：手写签名验证是身份认证的一个重要方面，在金融和电子商务等各个领域都有应用。然而，由于用户内部的可变性和伪造的风险，在签名验证中实现高准确性仍然具有挑战性。提出了一种新的动态签名验证方法：时空图注意力Transformer（TS-GATR）。TS-GATR结合了图注意力网络（GAT）和门控递归单元（GRU）来建模签名数据中的空间和时间依赖性。TS-GATR通过将签名表示为图形来增强验证性能，其中每个节点捕获动态特征（例如位置，速度，压力），并通过使用注意力机制来建模它们的复杂关系。所提出的方法进一步采用了双图注意力Transformer（DGATR）模块，它利用k-步和k-最近邻邻接图模型的局部和全局的空间特征，分别。为了捕获长期的时间依赖性，该模型集成了GRU，从而增强了其在签名验证期间学习动态特征的能力。在MSDS和DeepSignDB等基准数据集上进行的综合实验表明，TS-GATR超越了当前最先进的方法，在各种场景中始终实现较低的等错误率（EER）。
摘要：Handwritten signature verification is a crucial aspect of identity authentication, with applications in various domains such as finance and e-commerce. However, achieving high accuracy in signature verification remains challenging due to intra-user variability and the risk of forgery. This paper introduces a novel approach for dynamic signature verification: the Temporal-Spatial Graph Attention Transformer (TS-GATR). TS-GATR combines the Graph Attention Network (GAT) and the Gated Recurrent Unit (GRU) to model both spatial and temporal dependencies in signature data. TS-GATR enhances verification performance by representing signatures as graphs, where each node captures dynamic features (e.g. position, velocity, pressure), and by using attention mechanisms to model their complex relationships. The proposed method further employs a Dual-Graph Attention Transformer (DGATR) module, which utilizes k-step and k-nearest neighbor adjacency graphs to model local and global spatial features, respectively. To capture long-term temporal dependencies, the model integrates GRU, thereby enhancing its ability to learn dynamic features during signature verification. Comprehensive experiments conducted on benchmark datasets such as MSDS and DeepSignDB show that TS-GATR surpasses current state-of-the-art approaches, consistently achieving lower Equal Error Rates (EER) across various scenarios.

【67】Continual Knowledge Adaptation for Reinforcement Learning
标题：强化学习的持续知识适应
链接：https://arxiv.org/abs/2510.19314

作者：Jinwu Hu, Zihao Lian, Zhiquan Wen, Chenghao Li, Guohao Chen, Xutao Wen, Bin Xiao, Mingkui Tan
备注：NeurIPS 2025
摘要：强化学习使智能体能够通过与环境的交互来学习最佳行为。然而，现实世界的环境通常是不稳定的，需要代理不断适应新的任务和不断变化的条件。虽然连续强化学习有助于跨多个任务的学习，但现有方法往往会出现灾难性的遗忘和低效的知识利用。为了应对这些挑战，我们提出了强化学习的持续知识适应（CKA-RL），它可以积累和有效利用历史知识。具体来说，我们引入了一个持续的知识适应策略，它涉及到维护特定于任务的知识向量池，并动态地使用历史知识来适应新的任务代理。该过程通过保留和调整关键模型参数来减轻灾难性遗忘，并实现跨任务的有效知识转移。此外，我们提出了一个自适应知识合并机制，结合类似的知识向量，以解决可扩展性的挑战，减少内存需求，同时确保保留必要的知识。在三个基准测试上的实验表明，所提出的CKA-RL优于最先进的方法，实现了4.20%的整体性能和8.02%的前向传输的改善。源代码可在https://github.com/Fhujinwu/CKA-RL上获得。
摘要：Reinforcement Learning enables agents to learn optimal behaviors through interactions with environments. However, real-world environments are typically non-stationary, requiring agents to continuously adapt to new tasks and changing conditions. Although Continual Reinforcement Learning facilitates learning across multiple tasks, existing methods often suffer from catastrophic forgetting and inefficient knowledge utilization. To address these challenges, we propose Continual Knowledge Adaptation for Reinforcement Learning (CKA-RL), which enables the accumulation and effective utilization of historical knowledge. Specifically, we introduce a Continual Knowledge Adaptation strategy, which involves maintaining a task-specific knowledge vector pool and dynamically using historical knowledge to adapt the agent to new tasks. This process mitigates catastrophic forgetting and enables efficient knowledge transfer across tasks by preserving and adapting critical model parameters. Additionally, we propose an Adaptive Knowledge Merging mechanism that combines similar knowledge vectors to address scalability challenges, reducing memory requirements while ensuring the retention of essential knowledge. Experiments on three benchmarks demonstrate that the proposed CKA-RL outperforms state-of-the-art methods, achieving an improvement of 4.20% in overall performance and 8.02% in forward transfer. The source code is available at https://github.com/Fhujinwu/CKA-RL.

【68】Collaborative penetration testing suite for emerging generative AI algorithms
标题：针对新兴生成式人工智能算法的协作渗透测试套件
链接：https://arxiv.org/abs/2510.19303

作者：Petar Radanliev
备注：None
摘要：问题空间：AI漏洞和量子威胁生成性AI漏洞：模型反演，数据中毒，对抗性输入。量子威胁Shor算法打破RSA ECC加密。安全生成AI模型，抵御经典和量子网络攻击。建议的解决方案协作渗透测试套件五个集成组件：DAST SAST OWASP ZAP，Burp Suite，SonarQube，Fortify。IAST造影评估与CI CD管道集成。区块链日志Hyperledger Fabric用于防篡改日志。基于量子密码学格的RLWE协议。AI Red Team模拟对抗ML和量子辅助攻击。集成层：面向AI、网络安全和量子专家的统一工作流程。在测试环境中识别出300多个漏洞。2周内高严重性问题减少70%。90%的区块链记录漏洞解决效率。抗量子密码在测试中保持了100%的完整性。成果：量子AI安全协议集成区块链量子密码AI Red Teaming。
摘要：Problem Space: AI Vulnerabilities and Quantum Threats Generative AI vulnerabilities: model inversion, data poisoning, adversarial inputs. Quantum threats Shor Algorithm breaking RSA ECC encryption. Challenge Secure generative AI models against classical and quantum cyberattacks. Proposed Solution Collaborative Penetration Testing Suite Five Integrated Components: DAST SAST OWASP ZAP, Burp Suite, SonarQube, Fortify. IAST Contrast Assess integrated with CI CD pipeline. Blockchain Logging Hyperledger Fabric for tamper-proof logs. Quantum Cryptography Lattice based RLWE protocols. AI Red Team Simulations Adversarial ML & Quantum-assisted attacks. Integration Layer: Unified workflow for AI, cybersecurity, and quantum experts. Key Results 300+ vulnerabilities identified across test environments. 70% reduction in high-severity issues within 2 weeks. 90% resolution efficiency for blockchain-logged vulnerabilities. Quantum-resistant cryptography maintained 100% integrity in tests. Outcome: Quantum AI Security Protocol integrating Blockchain Quantum Cryptography AI Red Teaming.

【69】Learning to Make Friends: Coaching LLM Agents toward Emergent Social Ties
标题：学会交朋友：指导法学硕士代理建立紧急社交关系
链接：https://arxiv.org/abs/2510.19299

作者：Philipp J. Schneider, Lin Tian, Marian-Andrei Rizoiu
摘要：大型语言模型（LLM）代理能否再现人类在线行为特征的复杂社会动态-由同质性，互惠性和社会验证形成-以及什么记忆和学习机制使这种动态出现？我们提出了一个多智能体LLM模拟框架，在该框架中，智能体反复交互，相互评估，并通过由教练信号加速的上下文学习来调整他们的行为。为了模拟人类的社会行为，我们设计了行为奖励功能，以捕捉在线参与的核心驱动因素，包括社交互动，信息寻求，自我展示，协调和情感支持。这些奖励使代理目标与经验观察到的用户动机相一致，从而能够研究网络结构和群体形成是如何从个人决策中产生的。我们的实验表明，教练LLM代理开发稳定的互动模式，形成新兴的社会关系，产生的网络结构，反映真实的在线社区的属性。通过将行为奖励与情境适应相结合，我们的框架建立了一个原则性的测试平台，用于调查LLM人群中的集体动态，并揭示了人工代理如何近似或偏离人类的社会行为。
摘要：Can large language model (LLM) agents reproduce the complex social dynamics that characterize human online behavior -- shaped by homophily, reciprocity, and social validation -- and what memory and learning mechanisms enable such dynamics to emerge? We present a multi-agent LLM simulation framework in which agents repeatedly interact, evaluate one another, and adapt their behavior through in-context learning accelerated by a coaching signal. To model human social behavior, we design behavioral reward functions that capture core drivers of online engagement, including social interaction, information seeking, self-presentation, coordination, and emotional support. These rewards align agent objectives with empirically observed user motivations, enabling the study of how network structures and group formations emerge from individual decision-making. Our experiments show that coached LLM agents develop stable interaction patterns and form emergent social ties, yielding network structures that mirror properties of real online communities. By combining behavioral rewards with in-context adaptation, our framework establishes a principled testbed for investigating collective dynamics in LLM populations and reveals how artificial agents may approximate or diverge from human-like social behavior.

【70】Knowledge and Common Knowledge of Strategies
标题：策略知识和常识
链接：https://arxiv.org/abs/2510.19298

作者：Borja Sierra Miranda, Thomas Studer
摘要：大多数现有的工作，战略推理简单地采用或知情或不知情的语义。我们提出了一个模型，其中知识的战略可以指定在一个细粒度的水平。特别是，可以区分策略的一阶、高阶和常识。我们说明了高阶知识的策略的效果，通过研究游戏花火。此外，我们表明，战略的共同知识是必要的，以解决共识问题。最后，我们研究了模型检测问题的可判定性。
摘要：Most existing work on strategic reasoning simply adopts either an informed or an uninformed semantics. We propose a model where knowledge of strategies can be specified on a fine-grained level. In particular, it is possible to distinguish first-order, higher-order, and common knowledge of strategies. We illustrate the effect of higher-order knowledge of strategies by studying the game Hanabi. Further, we show that common knowledge of strategies is necessary to solve the consensus problem. Finally, we study the decidability of the model checking problem.

【71】Enhancing Early Alzheimer Disease Detection through Big Data and Ensemble Few-Shot Learning
标题：通过大数据加强早期阿尔茨海默病检测并鼓励Few-Shot学习
链接：https://arxiv.org/abs/2510.19282

作者：Safa Ben Atitallah, Maha Driss, Wadii Boulila, Anis Koubaa
摘要：阿尔茨海默病是一种严重的脑部疾病，会损害大脑的各个区域，并导致记忆力受损。标记的医疗数据的有限可用性对准确的阿尔茨海默病检测提出了重大挑战。考虑到标记数据的稀缺性、疾病的复杂性以及与数据隐私相关的约束，迫切需要有效的方法来提高阿尔茨海默病检测的准确性。为了应对这一挑战，我们的研究在Few-Shot学习（FSL）和集成学习的框架内以预训练卷积神经网络（CNN）的形式利用了大数据的力量。我们提出了一种基于原型网络（ProtoNet）的集成方法，这是FSL中的一种强大方法，将各种预训练的CNN集成为编码器。这种集成增强了从医学图像中提取的特征的丰富性。我们的方法还包括类别感知损失和熵损失的组合，以确保对阿尔茨海默病进展水平进行更精确的分类。使用两个数据集，Kaggle Alzheimer数据集和ADNI数据集评估了我们方法的有效性，分别达到了99.72%和99.86%的准确率。我们的结果与相关的最先进的研究的比较表明，我们的方法实现了卓越的准确性，并强调了其在早期阿尔茨海默病检测中的真实应用的有效性和潜力。
摘要：Alzheimer disease is a severe brain disorder that causes harm in various brain areas and leads to memory damage. The limited availability of labeled medical data poses a significant challenge for accurate Alzheimer disease detection. There is a critical need for effective methods to improve the accuracy of Alzheimer disease detection, considering the scarcity of labeled data, the complexity of the disease, and the constraints related to data privacy. To address this challenge, our study leverages the power of big data in the form of pre-trained Convolutional Neural Networks (CNNs) within the framework of Few-Shot Learning (FSL) and ensemble learning. We propose an ensemble approach based on a Prototypical Network (ProtoNet), a powerful method in FSL, integrating various pre-trained CNNs as encoders. This integration enhances the richness of features extracted from medical images. Our approach also includes a combination of class-aware loss and entropy loss to ensure a more precise classification of Alzheimer disease progression levels. The effectiveness of our method was evaluated using two datasets, the Kaggle Alzheimer dataset and the ADNI dataset, achieving an accuracy of 99.72% and 99.86%, respectively. The comparison of our results with relevant state-of-the-art studies demonstrated that our approach achieved superior accuracy and highlighted its validity and potential for real-world applications in early Alzheimer disease detection.

【72】Social World Model-Augmented Mechanism Design Policy Learning
标题：社交世界模型-增强机制设计政策学习
链接：https://arxiv.org/abs/2510.19270

作者：Xiaoyuan Zhang, Yizhe Huang, Chengdong Ma, Zhixun Chen, Long Ma, Yali Du, Song-Chun Zhu, Yaodong Yang, Xue Feng
摘要：设计适应性机制来协调个人和集体利益仍然是人工社会智能的核心挑战。现有的方法通常难以对具有持久潜在特征（例如，技能，偏好）和处理复杂的多智能体系统动态。由于昂贵的真实世界相互作用，对高采样效率的迫切需求加剧了这些挑战。世界模型，通过学习预测环境动态，提供了一个有前途的途径，以加强机制设计的异质性和复杂的系统。在本文中，我们介绍了一种新的方法命名为SWM-AP（社会世界模型增强机制设计策略学习），它学习一个社会世界模型分层建模代理的行为，以提高机制设计。具体来说，社会世界模型推断代理的互动轨迹的特质，并学习基于特质的模型来预测代理的响应部署的机制。机制设计策略通过与社会世界模型交互来收集广泛的训练轨迹，同时在现实世界交互期间在线推断代理的特征，以进一步提高策略学习效率。在不同环境（税收政策设计，团队协调和设施位置）中的实验表明，SWM-AP在累积奖励和样本效率方面优于已建立的基于模型和无模型的RL基线。
摘要：Designing adaptive mechanisms to align individual and collective interests remains a central challenge in artificial social intelligence. Existing methods often struggle with modeling heterogeneous agents possessing persistent latent traits (e.g., skills, preferences) and dealing with complex multi-agent system dynamics. These challenges are compounded by the critical need for high sample efficiency due to costly real-world interactions. World Models, by learning to predict environmental dynamics, offer a promising pathway to enhance mechanism design in heterogeneous and complex systems. In this paper, we introduce a novel method named SWM-AP (Social World Model-Augmented Mechanism Design Policy Learning), which learns a social world model hierarchically modeling agents' behavior to enhance mechanism design. Specifically, the social world model infers agents' traits from their interaction trajectories and learns a trait-based model to predict agents' responses to the deployed mechanisms. The mechanism design policy collects extensive training trajectories by interacting with the social world model, while concurrently inferring agents' traits online during real-world interactions to further boost policy learning efficiency. Experiments in diverse settings (tax policy design, team coordination, and facility location) demonstrate that SWM-AP outperforms established model-based and model-free RL baselines in cumulative rewards and sample efficiency.

【73】LAPRAD: LLM-Assisted PRotocol Attack Discovery
标题：LAPRAD：LLM协助的Prorotocol攻击发现
链接：https://arxiv.org/abs/2510.19264

作者：R.Can Aygun (UCLA), Yehuda Afek (Tel-Aviv University), Anat Bremler-Barr (Tel-Aviv University), Leonard Kleinrock (UCLA)
备注：IFIP Networking 2025 Proceedings (Accepted on 05.05.2025)
摘要：为了提高互联网协议的安全性，我们寻求更快的半自动方法来发现DNS，BGP等协议中的新漏洞。为此，我们引入了LLM-Assisted Protocol Attack Discovery（LAPRAD）方法，使具有一定DNS知识的安全研究人员能够有效地发现难以检测的漏洞。 LAPRAD遵循一个三阶段过程。首先，我们咨询了一个LLM（GPT-o 1），该LLM已经在广泛的DNS相关源和以前的DDoS攻击语料库上进行了训练，以识别潜在的漏洞。在第二阶段，不同的LLM使用通过LangChain（DNS区域文件生成）实现的ReACT方法自动构建相应的攻击配置。最后，在第三阶段，我们验证了攻击的功能和有效性。使用LAPRAD，我们发现了DNS协议上的三种新的DDoS攻击，并重新发现了两种最近报告的未包含在LLM训练数据中的攻击。第一种新的攻击采用诱饵和开关技术来欺骗解析器缓存大型伪造的DNSSEC RRSIG，将其服务容量降低到6%。第二个漏洞利用了具有多个密钥的大型DNSSEC加密算法（RSA-4096），从而绕过了最近实施的默认RRSet限制。第三种方法利用ANY型反应来产生类似的效果。这些缓存刷新DDoS攻击的变体称为SigCacheFlush，可以规避现有的补丁，严重降低解析器查询容量，并影响主要DNS解析器实现的最新版本。
摘要：With the goal of improving the security of Internet protocols, we seek faster, semi-automatic methods to discover new vulnerabilities in protocols such as DNS, BGP, and others. To this end, we introduce the LLM-Assisted Protocol Attack Discovery (LAPRAD) methodology, enabling security researchers with some DNS knowledge to efficiently uncover vulnerabilities that would otherwise be hard to detect. LAPRAD follows a three-stage process. In the first, we consult an LLM (GPT-o1) that has been trained on a broad corpus of DNS-related sources and previous DDoS attacks to identify potential exploits. In the second stage, a different LLM automatically constructs the corresponding attack configurations using the ReACT approach implemented via LangChain (DNS zone file generation). Finally, in the third stage, we validate the attack's functionality and effectiveness. Using LAPRAD, we uncovered three new DDoS attacks on the DNS protocol and rediscovered two recently reported ones that were not included in the LLM's training data. The first new attack employs a bait-and-switch technique to trick resolvers into caching large, bogus DNSSEC RRSIGs, reducing their serving capacity to as little as 6%. The second exploits large DNSSEC encryption algorithms (RSA-4096) with multiple keys, thereby bypassing a recently implemented default RRSet limit. The third leverages ANY-type responses to produce a similar effect. These variations of a cache-flushing DDoS attack, called SigCacheFlush, circumvent existing patches, severely degrade resolver query capacity, and impact the latest versions of major DNS resolver implementations.

【74】An Argumentative Explanation Framework for Generalized Reason Model with Inconsistent Precedents
标题：具有不一致先例的广义理性模型的争论解释框架
链接：https://arxiv.org/abs/2510.19263

作者：Wachara Fungwacharakorn, Gauvain Bourgne, Ken Satoh
备注：10 pages, extended version for JURIX 2025 submission
摘要：在人工智能和法律中，优先约束是基于案例推理的基础之一。它通常假定基本的一套先例必须是一致的。为了放松这一假设，引入了原因模型的广义概念。基于传统一致推理模型的先例推理存在多种论证解释方法，但对于这种包容不一致先例的广义推理框架，还没有相应的论证解释方法。为了解决这个问题，本文探讨了扩展的推导状态论证框架（DSA框架）解释推理根据广义概念的原因模型。
摘要：Precedential constraint is one foundation of case-based reasoning in AI and Law. It generally assumes that the underlying set of precedents must be consistent. To relax this assumption, a generalized notion of the reason model has been introduced. While several argumentative explanation approaches exist for reasoning with precedents based on the traditional consistent reason model, there has been no corresponding argumentative explanation method developed for this generalized reasoning framework accommodating inconsistent precedents. To address this question, this paper examines an extension of the derivation state argumentation framework (DSA-framework) to explain the reasoning according to the generalized notion of the reason model.

【75】ChatGPT Unveils Its Limits: Principles of Law Deliver Checkmate
标题：ChatGPT揭示其局限性：法律原则赋予将军
链接：https://arxiv.org/abs/2510.19261

作者：Marianna Molinari, Ilaria Angela Amantea, Marinella Quaranta, Guido Governatori
摘要：本研究通过在法律领域的实验来检验ChatGPT的性能。我们使用正则表达式（Regex）将结果与基线进行比较，而不仅仅关注对人类表现的评估。该研究表明，即使ChatGPT能够获得必要的知识和能力，它也无法以一种导致详尽结果的方式将它们组装起来，推理。这揭示了ChatGPT的一个主要限制。智能包括分解复杂问题并根据多个所需能力解决这些问题的能力，提供统一和全面的解决方案。在法律领域，最关键的任务之一是阅读法律判决，并从法律原则（PoLs）中提取关键段落，然后将其纳入法官随后的裁决或律师的辩护文件中。在执行这项任务时，人工智能缺乏全面的理解和推理，这使得它具有内在的局限性。真正的智慧，仍然是人类独有的特质，至少在这个特定的领域。
摘要：This study examines the performance of ChatGPT with an experiment in the legal domain. We compare the outcome with it a baseline using regular expressions (Regex), rather than focusing solely on the assessment against human performance. The study reveals that even if ChatGPT has access to the necessary knowledge and competencies, it is unable to assemble them, reason through, in a way that leads to an exhaustive result. This unveils a major limitation of ChatGPT. Intelligence encompasses the ability to break down complex issues and address them according to multiple required competencies, providing a unified and comprehensive solution. In the legal domain, one of the most crucial tasks is reading legal decisions and extracting key passages condensed from principles of law (PoLs), which are then incorporated into subsequent rulings by judges or defense documents by lawyers. In performing this task, artificial intelligence lacks an all-encompassing understanding and reasoning, which makes it inherently limited. Genuine intelligence, remains a uniquely human trait, at least in this particular field.

【76】FnRGNN: Distribution-aware Fairness in Graph Neural Network
标题：FnRGNN：图神经网络中的分布感知公平性
链接：https://arxiv.org/abs/2510.19257

作者：Soyoung Park, Sungsu Lim
备注：None
摘要：图神经网络（GNN）擅长从结构化数据中学习，但回归任务的公平性仍有待探索。现有的方法主要针对分类和表示级去偏，不能完全解决节点级回归的连续性。我们提出了FnRGNN，这是一个基于GNN的节点回归的公平性感知处理框架，它在三个级别上进行干预：（i）结构级边缘重新加权，（ii）通过MMD进行表示级对齐，以及（iii）通过基于Sinkhorn的分布匹配进行预测级归一化。这种多级策略确保了复杂图拓扑下的鲁棒公平性。在四个真实数据集上的实验表明，FnRGNN在不牺牲性能的情况下减少了群体差异。代码可在https://github.com/sybeam27/FnRGNN上获得。
摘要：Graph Neural Networks (GNNs) excel at learning from structured data, yet fairness in regression tasks remains underexplored. Existing approaches mainly target classification and representation-level debiasing, which cannot fully address the continuous nature of node-level regression. We propose FnRGNN, a fairness-aware in-processing framework for GNN-based node regression that applies interventions at three levels: (i) structure-level edge reweighting, (ii) representation-level alignment via MMD, and (iii) prediction-level normalization through Sinkhorn-based distribution matching. This multi-level strategy ensures robust fairness under complex graph topologies. Experiments on four real-world datasets demonstrate that FnRGNN reduces group disparities without sacrificing performance. Code is available at https://github.com/sybeam27/FnRGNN.

【77】See, Think, Act: Online Shopper Behavior Simulation with VLM Agents
标题：查看、思考、行动：使用VLM代理进行在线购物者行为模拟
链接：https://arxiv.org/abs/2510.19245

作者：Yimeng Zhang, Jiri Gesi, Ran Xue, Tian Wang, Ziyi Wang, Yuxuan Lu, Sinong Zhan, Huimin Zeng, Qingjun Cui, Yufan Guo, Jing Huang, Mubarak Shah, Dakuo Wang
摘要：LLM最近在模拟在线购物者行为方面表现出了强大的潜力。先前的工作通过将SFT应用于具有LLM生成的理由的动作跟踪，并通过利用RL进一步增强推理能力来改进动作预测。尽管取得了这些进步，当前的方法仍然依赖于基于文本的输入，并忽视了视觉感知在Web GUI交互期间塑造人类决策的重要作用。在本文中，我们研究了视觉信息，特别是网页截图，通过VLM，利用OPeRA数据集到行为模拟的集成。通过在文本和视觉模式中建立代理决策，我们的目标是缩小合成代理和现实世界用户之间的差距，从而实现对在线购物行为的更认知一致的模拟。具体来说，我们采用SFT的联合行动预测和理由生成，条件的完整的交互上下文，其中包括行动的历史，过去的HTML观察，和当前的网页截图。为了进一步增强推理能力，我们将强化学习与分层奖励结构相结合，并根据难度感知因素进行调整，以优先考虑具有挑战性的决策点。从经验上讲，我们的研究表明，结合视觉基础会产生实质性的收益：文本和图像输入的组合比纯文本输入提高了6%以上的精确匹配精度。这些结果表明，多模态接地不仅提高了预测准确性，而且还提高了视觉复杂环境中的模拟保真度，这捕捉了纯文本代理经常错过的人类注意力和决策的细微差别。最后，我们重新审视了行为模拟框架的设计空间，确定了关键的方法限制，并提出了未来的研究方向，以建立高效和有效的人类行为模拟器。
摘要：LLMs have recently demonstrated strong potential in simulating online shopper behavior. Prior work has improved action prediction by applying SFT on action traces with LLM-generated rationales, and by leveraging RL to further enhance reasoning capabilities. Despite these advances, current approaches rely on text-based inputs and overlook the essential role of visual perception in shaping human decision-making during web GUI interactions. In this paper, we investigate the integration of visual information, specifically webpage screenshots, into behavior simulation via VLMs, leveraging OPeRA dataset. By grounding agent decision-making in both textual and visual modalities, we aim to narrow the gap between synthetic agents and real-world users, thereby enabling more cognitively aligned simulations of online shopping behavior. Specifically, we employ SFT for joint action prediction and rationale generation, conditioning on the full interaction context, which comprises action history, past HTML observations, and the current webpage screenshot. To further enhance reasoning capabilities, we integrate RL with a hierarchical reward structure, scaled by a difficulty-aware factor that prioritizes challenging decision points. Empirically, our studies show that incorporating visual grounding yields substantial gains: the combination of text and image inputs improves exact match accuracy by more than 6% over text-only inputs. These results indicate that multi-modal grounding not only boosts predictive accuracy but also enhances simulation fidelity in visually complex environments, which captures nuances of human attention and decision-making that text-only agents often miss. Finally, we revisit the design space of behavior simulation frameworks, identify key methodological limitations, and propose future research directions toward building efficient and effective human behavior simulators.

【78】SPOT: Scalable Policy Optimization with Trees for Markov Decision Processes
标题：SPOT：Markov决策过程的树可扩展政策优化
链接：https://arxiv.org/abs/2510.19241

作者：Xuyuan Xiong, Pedro Chumpitaz-Flores, Kaixun Hua, Cheng Hua
摘要：可解释的强化学习策略对于高风险决策至关重要，但在马尔可夫决策过程（MDP）中优化决策树策略仍然具有挑战性。我们提出了SPOT，一种新的方法来计算决策树的政策，它制定了一个混合整数线性规划（MILP）的优化问题。为了提高效率，我们采用了减少空间的分支定界的方法，从树结构的约束，使高效的并行搜索的MDP动态。与以前的方法相比，这显著提高了运行时间和可扩展性。我们的方法确保每次迭代都产生最优决策树。标准基准测试的实验结果表明，SPOT实现了大幅加速和规模更大的MDP具有显着更高的状态数。由此产生的决策树策略是可解释的和紧凑的，在不影响性能的情况下保持透明度。这些结果表明，我们的方法同时实现了可解释性和可扩展性，提供高质量的政策比现有的方法快一个数量级。
摘要：Interpretable reinforcement learning policies are essential for high-stakes decision-making, yet optimizing decision tree policies in Markov Decision Processes (MDPs) remains challenging. We propose SPOT, a novel method for computing decision tree policies, which formulates the optimization problem as a mixed-integer linear program (MILP). To enhance efficiency, we employ a reduced-space branch-and-bound approach that decouples the MDP dynamics from tree-structure constraints, enabling efficient parallel search. This significantly improves runtime and scalability compared to previous methods. Our approach ensures that each iteration yields the optimal decision tree. Experimental results on standard benchmarks demonstrate that SPOT achieves substantial speedup and scales to larger MDPs with a significantly higher number of states. The resulting decision tree policies are interpretable and compact, maintaining transparency without compromising performance. These results demonstrate that our approach simultaneously achieves interpretability and scalability, delivering high-quality policies an order of magnitude faster than existing approaches.

【79】WebGraphEval: Multi-Turn Trajectory Evaluation for Web Agents using Graph Representation
标题：WebGraphEval：使用图形表示的Web代理的多转弯轨迹评估
链接：https://arxiv.org/abs/2510.19205

作者：Yaoyao Qian, Yuanli Wang, Jinda Zhang, Yun Zong, Meixu Chen, Hanhan Zhou, Jindan Huang, Yifan Zeng, Xinyu Hu, Chan Hee Song, Danqing Zhang
备注：39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: Multi-Turn Interactions in Large Language Models
摘要：目前的评估网络代理在很大程度上减少了二进制的成功指标或符合一个单一的参考轨迹，忽略了基准数据集的结构多样性。我们提出了WebGraphEval，一个框架，抽象的轨迹从多个代理到一个统一的，加权的动作图。这种表示与WebArena等基准测试直接兼容，利用排行榜运行和新收集的轨迹，而无需修改环境。该框架规范地编码动作，合并重复的行为，并应用结构分析，包括奖励传播和成功加权边缘统计。对来自六个Web代理的数千个轨迹的评估表明，图形抽象捕获了跨模型的冗余，突出了冗余和低效，并确定了基于结果的指标所忽略的关键决策点。通过将Web交互框架为图结构数据，WebGraphEval建立了一种通用的方法，用于多路径，跨代理和效率感知的Web代理评估。
摘要：Current evaluation of web agents largely reduces to binary success metrics or conformity to a single reference trajectory, ignoring the structural diversity present in benchmark datasets. We present WebGraphEval, a framework that abstracts trajectories from multiple agents into a unified, weighted action graph. This representation is directly compatible with benchmarks such as WebArena, leveraging leaderboard runs and newly collected trajectories without modifying environments. The framework canonically encodes actions, merges recurring behaviors, and applies structural analyses including reward propagation and success-weighted edge statistics. Evaluations across thousands of trajectories from six web agents show that the graph abstraction captures cross-model regularities, highlights redundancy and inefficiency, and identifies critical decision points overlooked by outcome-based metrics. By framing web interaction as graph-structured data, WebGraphEval establishes a general methodology for multi-path, cross-agent, and efficiency-aware evaluation of web agents.

【80】An Active Diffusion Neural Network for Graphs
标题：图的主动扩散神经网络
链接：https://arxiv.org/abs/2510.19202

作者：Mengying Jiang
摘要：热扩散的类比增强了我们对图中信息流的理解，并启发了图神经网络（GNN）的发展。然而，大多数基于扩散的GNN模拟被动热扩散，这仍然受到过度平滑的影响，并限制了它们捕获全局图形信息的能力。受宇宙热寂的启发，假设能量分布在封闭系统中随着时间的推移变得均匀，我们认识到，在没有外部输入的情况下，随着扩散的进行，图中的节点表示收敛到相同的特征向量。为了解决这个问题，我们提出了基于主动扩散的图神经网络（ADGNN）。ADGNN通过整合动态影响扩散过程的多个外部信息源实现主动扩散，有效克服了过平滑问题。此外，我们的方法实现了真正的无限扩散，直接计算的封闭形式的积极扩散迭代公式的解决方案。这允许节点保留其独特的特征，同时有效地获得对图的全局结构的全面洞察。我们针对各种图形任务中的几种最先进的GNN模型评估ADGNN。结果表明，ADGNN显着提高了准确性和效率，突出了其在捕获全局图信息和维护节点的独特性的有效性。
摘要：The analogy to heat diffusion has enhanced our understanding of information flow in graphs and inspired the development of Graph Neural Networks (GNNs). However, most diffusion-based GNNs emulate passive heat diffusion, which still suffers from over-smoothing and limits their ability to capture global graph information. Inspired by the heat death of the universe, which posits that energy distribution becomes uniform over time in a closed system, we recognize that, without external input, node representations in a graph converge to identical feature vectors as diffusion progresses. To address this issue, we propose the Active Diffusion-based Graph Neural Network (ADGNN). ADGNN achieves active diffusion by integrating multiple external information sources that dynamically influence the diffusion process, effectively overcoming the over-smoothing problem. Furthermore, our approach realizes true infinite diffusion by directly calculating the closed-form solution of the active diffusion iterative formula. This allows nodes to preserve their unique characteristics while efficiently gaining comprehensive insights into the graph's global structure. We evaluate ADGNN against several state-of-the-art GNN models across various graph tasks. The results demonstrate that ADGNN significantly improves both accuracy and efficiency, highlighting its effectiveness in capturing global graph information and maintaining node distinctiveness.

【81】Rethinking Driving World Model as Synthetic Data Generator for Perception Tasks
标题：重新思考驾驶世界模型作为感知任务的合成数据生成器
链接：https://arxiv.org/abs/2510.19195

作者：Kai Zeng, Zhanqian Wu, Kaixin Xiong, Xiaobao Wei, Xiangyu Guo, Zhenxin Zhu, Kalok Ho, Lijun Zhou, Bohan Zeng, Ming Lu, Haiyang Sun, Bing Wang, Guang Chen, Hangjun Ye, Wentao Zhang
摘要：驾驶世界模型的最新进展使高质量RGB视频或多模式视频的可控生成成为可能。现有的方法主要集中在发电质量和可控性相关的指标。然而，他们往往忽视了下游感知任务的评估，这对自动驾驶的性能至关重要。现有的方法通常利用一种训练策略，该策略首先对合成数据进行预训练，然后对真实数据进行微调，从而导致与基线（仅真实数据）相比的时间段增加一倍。当我们将基线中的时期加倍时，合成数据的好处变得微不足道。为了充分展示合成数据的好处，我们介绍了Dream 4Drive，这是一种新的合成数据生成框架，旨在增强下游感知任务。Dream 4Drive首先将输入视频分解为几个3D感知的指导地图，然后将3D资产渲染到这些指导地图上。最后，对驾驶世界模型进行微调，以产生编辑后的多视图真实感视频，这些视频可用于训练下游感知模型。Dream 4Drive在大规模生成多视图角落案例方面实现了前所未有的灵活性，显著提升了自动驾驶中的角落案例感知。为了方便未来的研究，我们还贡献了一个名为DriveObj 3D的大规模3D资产数据集，涵盖了驾驶场景中的典型类别，并实现了多样化的3D感知视频编辑。我们进行了全面的实验，以表明Dream 4Drive可以有效地提高下游感知模型在各种训练时期下的性能。项目：$\href{https：//wm-research.github.io/Dream4Drive/}{this\ https\ URL}$
摘要：Recent advancements in driving world models enable controllable generation of high-quality RGB videos or multimodal videos. Existing methods primarily focus on metrics related to generation quality and controllability. However, they often overlook the evaluation of downstream perception tasks, which are $\mathbf{really\ crucial}$ for the performance of autonomous driving. Existing methods usually leverage a training strategy that first pretrains on synthetic data and finetunes on real data, resulting in twice the epochs compared to the baseline (real data only). When we double the epochs in the baseline, the benefit of synthetic data becomes negligible. To thoroughly demonstrate the benefit of synthetic data, we introduce Dream4Drive, a novel synthetic data generation framework designed for enhancing the downstream perception tasks. Dream4Drive first decomposes the input video into several 3D-aware guidance maps and subsequently renders the 3D assets onto these guidance maps. Finally, the driving world model is fine-tuned to produce the edited, multi-view photorealistic videos, which can be used to train the downstream perception models. Dream4Drive enables unprecedented flexibility in generating multi-view corner cases at scale, significantly boosting corner case perception in autonomous driving. To facilitate future research, we also contribute a large-scale 3D asset dataset named DriveObj3D, covering the typical categories in driving scenarios and enabling diverse 3D-aware video editing. We conduct comprehensive experiments to show that Dream4Drive can effectively boost the performance of downstream perception models under various training epochs. Project: $\href{https://wm-research.github.io/Dream4Drive/}{this\ https\ URL}$

【82】PruneHal: Reducing Hallucinations in Multi-modal Large Language Models through Adaptive KV Cache Pruning
标题：PruneHal：通过自适应KV缓存修剪减少多模式大型语言模型中的幻觉
链接：https://arxiv.org/abs/2510.19183

作者：Fengyuan Sun, Hui Chen, Xinhao Xu, Dandan Zheng, Jingdong Chen, Jun Zhou, Jungong Han, Guiguang Ding
摘要：虽然多模态大型语言模型（MLLM）近年来取得了重大进展，但幻觉问题仍然是一个重大挑战。为了减轻这种现象，现有的解决方案或者引入用于进一步训练的附加数据，或者在推理期间并入外部或内部信息。然而，这些方法不可避免地引入额外的计算成本。在本文中，我们观察到MLLM中的幻觉与分配给视觉标记的注意力不足密切相关。特别是，冗余视觉标记的存在分散了模型的注意力，使其无法专注于信息量最大的标记。因此，关键的视觉线索往往得不到重视，这反过来又加剧了幻觉的发生。在此观察的基础上，我们提出\textbf{PruneHal}，这是一种无需训练的简单而有效的方法，它利用自适应KV缓存修剪来增强模型对关键视觉信息的关注，从而减轻幻觉。据我们所知，我们是第一个将令牌修剪应用于MLLM中的幻觉缓解的人。值得注意的是，我们的方法不需要额外的训练，并且几乎不会产生额外的推理成本。此外，PruneHal是模型不可知的，可以与不同的解码策略无缝集成，包括那些专门为减轻幻觉而设计的解码策略。我们使用四种主流的MLLM在几种广泛使用的幻觉评估基准上评估PruneHal，取得了强大而出色的结果，突出了我们方法的有效性和优越性。我们的代码将公开发布。
摘要：While multi-modal large language models (MLLMs) have made significant progress in recent years, the issue of hallucinations remains a major challenge. To mitigate this phenomenon, existing solutions either introduce additional data for further training or incorporate external or internal information during inference. However, these approaches inevitably introduce extra computational costs. In this paper, we observe that hallucinations in MLLMs are strongly associated with insufficient attention allocated to visual tokens. In particular, the presence of redundant visual tokens disperses the model's attention, preventing it from focusing on the most informative ones. As a result, critical visual cues are often under-attended, which in turn exacerbates the occurrence of hallucinations. Building on this observation, we propose \textbf{PruneHal}, a training-free, simple yet effective method that leverages adaptive KV cache pruning to enhance the model's focus on critical visual information, thereby mitigating hallucinations. To the best of our knowledge, we are the first to apply token pruning for hallucination mitigation in MLLMs. Notably, our method don't require additional training and incurs nearly no extra inference cost. Moreover, PruneHal is model-agnostic and can be seamlessly integrated with different decoding strategies, including those specifically designed for hallucination mitigation. We evaluate PruneHal on several widely used hallucination evaluation benchmarks using four mainstream MLLMs, achieving robust and outstanding results that highlight the effectiveness and superiority of our method. Our code will be publicly available.

【83】Interpretable Question Answering with Knowledge Graphs
标题：利用知识图进行可解释问题解答
链接：https://arxiv.org/abs/2510.19181

作者：Kartikeya Aneja, Manasvi Srivastava, Subhayan Das, Nagender Aneja
备注：None
摘要：本文提出了一个问题回答系统，专门操作的知识图检索，而不依赖于检索增强生成（RAG）与大型语言模型（LLM）。相反，一个小的释义器模型被用来释义从查询知识图检索到的实体关系边。拟议的管道分为两个主要阶段。第一阶段涉及预处理文档以生成问答（QA）对集。第二阶段将这些QA转换为知识图，使用嵌入和模糊技术进行基于图的检索。对图进行查询、重新排序和解释，以生成最终答案。这项工作包括在CRAG基准测试中使用LLM作为裁判进行评估，使用LLAMA-3.2和GPT-3.5-Turbo的准确率分别为71.9%和54.4%。
摘要：This paper presents a question answering system that operates exclusively on a knowledge graph retrieval without relying on retrieval augmented generation (RAG) with large language models (LLMs). Instead, a small paraphraser model is used to paraphrase the entity relationship edges retrieved from querying the knowledge graph. The proposed pipeline is divided into two main stages. The first stage involves pre-processing a document to generate sets of question-answer (QA) pairs. The second stage converts these QAs into a knowledge graph from which graph-based retrieval is performed using embeddings and fuzzy techniques. The graph is queried, re-ranked, and paraphrased to generate a final answer. This work includes an evaluation using LLM-as-a-judge on the CRAG benchmark, which resulted in accuracies of 71.9% and 54.4% using LLAMA-3.2 and GPT-3.5-Turbo, respectively.

【84】Imbalanced Gradients in RL Post-Training of Multi-Task LLMs
标题：多任务LLM RL后训练中的不平衡对象
链接：https://arxiv.org/abs/2510.19178

作者：Runzhe Wu, Ankur Samanta, Ayush Jain, Scott Fujimoto, Jeongyeol Kwon, Ben Kretzu, Youliang Yu, Kaveh Hassani, Boris Vidolov, Yonathan Efroni
摘要：大型语言模型（LLM）的多任务后训练通常通过混合来自不同任务的数据集并联合优化它们来执行。这种方法隐含地假设所有任务都贡献了相似大小的梯度;当这个假设失败时，优化就会偏向于大梯度任务。然而，在本文中，我们证明了这一假设在RL后训练中失败：某些任务会产生明显更大的梯度，从而使更新偏向于这些任务。只有当较大的梯度意味着任务上的较大学习增益时，这种梯度不平衡才是合理的（即，更大的性能改进）--但我们发现这不是真的。大梯度任务可以实现与小梯度任务相似甚至更低的学习增益。进一步的分析表明，这些梯度不平衡不能用典型的训练统计数据（如训练奖励或优势）来解释，这表明它们是由任务之间的固有差异引起的。这警告天真的数据集混合，并呼吁未来的工作原则梯度水平校正LLM。
摘要：Multi-task post-training of large language models (LLMs) is typically performed by mixing datasets from different tasks and optimizing them jointly. This approach implicitly assumes that all tasks contribute gradients of similar magnitudes; when this assumption fails, optimization becomes biased toward large-gradient tasks. In this paper, however, we show that this assumption fails in RL post-training: certain tasks produce significantly larger gradients, thus biasing updates toward those tasks. Such gradient imbalance would be justified only if larger gradients implied larger learning gains on the tasks (i.e., larger performance improvements) -- but we find this is not true. Large-gradient tasks can achieve similar or even much lower learning gains than small-gradient ones. Further analyses reveal that these gradient imbalances cannot be explained by typical training statistics such as training rewards or advantages, suggesting that they arise from the inherent differences between tasks. This cautions against naive dataset mixing and calls for future work on principled gradient-level corrections for LLMs.

【85】The Zero-Step Thinking: An Empirical Study of Mode Selection as Harder Early Exit in Reasoning Models
标题：零步思维：模式选择作为推理模型中较难早期退出的实证研究
链接：https://arxiv.org/abs/2510.19176

作者：Yuqiao Tan, Shizhu He, Kang Liu, Jun Zhao
备注：Accepted by NeurIPS'25 Efficient Reasoning Workshop
摘要：推理模型在数学和逻辑推理等任务中表现出卓越的性能，主要是因为它们能够在推理过程中进行逐步思考。然而，这往往会导致过度思考，导致不必要的计算开销。为了解决这个问题，模式选择的目的是通过使用思维或非思维模式自动决定长CoT（思想链）或短CoT。同时，提前退出确定迭代推理过程中的最佳停止点。这两种方法都试图减少计算负担。在本文中，我们首先将模式选择确定为早期退出问题的更具挑战性的变体，因为它们具有相似的目标，但决策时间不同。虽然提前退出专注于在推理时确定简明推理的最佳停止点，但模式选择必须在推理过程开始时做出决定，依赖于预先定义的假想法，而不参与显式推理过程，称为零步思维。通过对九个基线的实证研究，我们观察到，基于神经网络的方法往往失败，由于其有限的分类能力时，提供最少的手工制作的信息。相比之下，利用内部信息的方法通常在大多数情况下表现更好，但仍然存在稳定性问题。我们的研究结果表明，现有的方法仅仅依靠模型提供的信息是不够的，有效地解决模式选择的情况下，有限的信息，突出了这项任务的持续挑战。我们的代码可在https://github.com/Trae1ounG/Zero_Step_Thinking上获得。
摘要：Reasoning models have demonstrated exceptional performance in tasks such as mathematics and logical reasoning, primarily due to their ability to engage in step-by-step thinking during the reasoning process. However, this often leads to overthinking, resulting in unnecessary computational overhead. To address this issue, Mode Selection aims to automatically decide between Long-CoT (Chain-of-Thought) or Short-CoT by utilizing either a Thinking or NoThinking mode. Simultaneously, Early Exit determines the optimal stopping point during the iterative reasoning process. Both methods seek to reduce the computational burden. In this paper, we first identify Mode Selection as a more challenging variant of the Early Exit problem, as they share similar objectives but differ in decision timing. While Early Exit focuses on determining the best stopping point for concise reasoning at inference time, Mode Selection must make this decision at the beginning of the reasoning process, relying on pre-defined fake thoughts without engaging in an explicit reasoning process, referred to as zero-step thinking. Through empirical studies on nine baselines, we observe that prompt-based approaches often fail due to their limited classification capabilities when provided with minimal hand-crafted information. In contrast, approaches that leverage internal information generally perform better across most scenarios but still exhibit issues with stability. Our findings indicate that existing methods relying solely on the information provided by models are insufficient for effectively addressing Mode Selection in scenarios with limited information, highlighting the ongoing challenges of this task. Our code is available at https://github.com/Trae1ounG/Zero_Step_Thinking.

【86】When Facts Change: Probing LLMs on Evolving Knowledge with evolveQA
标题：当事实发生变化：通过evolveQA探索法学硕士知识的发展
链接：https://arxiv.org/abs/2510.19172

作者：Nishanth Sridhar Nakshatri, Shamik Roy, Manoj Ghuhan Arivazhagan, Hanhan Zhou, Vinayshekhar Bannihatti Kumar, Rashmi Gangadharaiah
备注：Under submission
摘要：LLM通常无法处理时间知识冲突-当事实在其训练数据中随着时间的推移而演变时产生的矛盾。现有的研究通过建立在维基数据等结构化知识库上的基准来评估这种现象，但它们专注于广泛覆盖，易于记忆的流行实体，缺乏公平评估具有不同知识截止日期的LLM所需的动态结构。我们介绍了evolveQA，这是一个专门用于评估LLM随时间变化的知识的基准，它由3个真实世界的时间戳语料库构建：AWS更新，Azure更改和WHO疾病爆发报告。我们的框架识别自然发生的知识演变，并生成问题，并根据不同的LLM知识截止日期量身定制黄金答案。通过对3种知识探测格式的12个开源和闭源LLM的广泛评估，我们证明了与静态知识问题相比，evolveQA的性能下降高达31%。
摘要：LLMs often fail to handle temporal knowledge conflicts--contradictions arising when facts evolve over time within their training data. Existing studies evaluate this phenomenon through benchmarks built on structured knowledge bases like Wikidata, but they focus on widely-covered, easily-memorized popular entities and lack the dynamic structure needed to fairly evaluate LLMs with different knowledge cut-off dates. We introduce evolveQA, a benchmark specifically designed to evaluate LLMs on temporally evolving knowledge, constructed from 3 real-world, time-stamped corpora: AWS updates, Azure changes, and WHO disease outbreak reports. Our framework identifies naturally occurring knowledge evolution and generates questions with gold answers tailored to different LLM knowledge cut-off dates. Through extensive evaluation of 12 open and closed-source LLMs across 3 knowledge probing formats, we demonstrate significant performance drops of up to 31% on evolveQA compared to static knowledge questions.

【87】X-Ego: Acquiring Team-Level Tactical Situational Awareness via Cross-Egocentric Contrastive Video Representation Learning
标题：X-Ego：通过跨自我中心对比视频表示学习获得团队级战术情景感知
链接：https://arxiv.org/abs/2510.19150

作者：Yunzhe Wang, Soham Hans, Volkan Ustun
备注：8 pages, 5 figures
摘要：人类的团队战术来自于每个球员的个人视角，以及他们预测、解释和适应队友意图的能力。虽然视频理解的进步改进了体育运动中团队互动的建模，但大多数现有工作都依赖于第三人称广播视图，并忽视了多智能体学习的同步、以自我为中心的本质。我们介绍了X-Ego-CS，这是一个基准数据集，由流行电子竞技游戏《反恐精英2》的45场专业级比赛的124小时游戏画面组成，旨在促进复杂3D环境中多智能体决策的研究。X-Ego-CS提供跨自我中心的视频流，同步捕捉所有玩家的第一人称视角以及状态-动作轨迹。在此资源的基础上，我们提出了跨自我对比学习（CECL），它使队友以自我为中心的视觉流从个人的角度来培养团队层面的战术态势感知。我们评估CECL的队友-对手位置预测任务，证明其有效性，提高代理的能力，从一个单一的第一人称视角，使用国家的最先进的视频编码器推断队友和对手的位置。X-Ego-CS和CECL共同为电子竞技中的跨自我中心多主体基准测试奠定了基础。更广泛地说，我们的工作将游戏理解定位为多智能体建模和战术学习的测试平台，并对虚拟和现实世界领域的时空推理和人类-AI团队产生影响。代码和数据集可在https://github.com/HATS-ICT/x-ego上获得。
摘要：Human team tactics emerge from each player's individual perspective and their ability to anticipate, interpret, and adapt to teammates' intentions. While advances in video understanding have improved the modeling of team interactions in sports, most existing work relies on third-person broadcast views and overlooks the synchronous, egocentric nature of multi-agent learning. We introduce X-Ego-CS, a benchmark dataset consisting of 124 hours of gameplay footage from 45 professional-level matches of the popular e-sports game Counter-Strike 2, designed to facilitate research on multi-agent decision-making in complex 3D environments. X-Ego-CS provides cross-egocentric video streams that synchronously capture all players' first-person perspectives along with state-action trajectories. Building on this resource, we propose Cross-Ego Contrastive Learning (CECL), which aligns teammates' egocentric visual streams to foster team-level tactical situational awareness from an individual's perspective. We evaluate CECL on a teammate-opponent location prediction task, demonstrating its effectiveness in enhancing an agent's ability to infer both teammate and opponent positions from a single first-person view using state-of-the-art video encoders. Together, X-Ego-CS and CECL establish a foundation for cross-egocentric multi-agent benchmarking in esports. More broadly, our work positions gameplay understanding as a testbed for multi-agent modeling and tactical learning, with implications for spatiotemporal reasoning and human-AI teaming in both virtual and real-world domains. Code and dataset are available at https://github.com/HATS-ICT/x-ego.

【88】A Multi-faceted Analysis of Cognitive Abilities: Evaluating Prompt Methods with Large Language Models on the CONSORT Checklist
标题：认知能力的多面分析：CONSORT量表上大语言模型的提示方法评价
链接：https://arxiv.org/abs/2510.19139

作者：Sohyeon Jeon, Hyung-Chul Lee
摘要：尽管大型语言模型（LLM）在医疗保健领域迅速扩展，但这些系统根据CONSORT标准评估临床试验报告的能力仍不清楚，特别是在认知和推理策略方面。本研究采用行为和元认知分析方法，结合专家验证的数据，系统地比较了三种提示条件下两种具有代表性的LLM。在模型如何处理各种CONSORT项目和提示类型方面出现了明显的差异，包括推理风格的转变，明确的不确定性和替代解释塑造了反应模式。我们的研究结果强调了这些系统在临床合规自动化方面的局限性，并强调了理解其认知适应和战略行为在开发更可解释和可靠的医疗AI方面的重要性。
摘要：Despite the rapid expansion of Large Language Models (LLMs) in healthcare, the ability of these systems to assess clinical trial reporting according to CONSORT standards remains unclear, particularly with respect to their cognitive and reasoning strategies. This study applies a behavioral and metacognitive analytic approach with expert-validated data, systematically comparing two representative LLMs under three prompt conditions. Clear differences emerged in how the models approached various CONSORT items, and prompt types, including shifts in reasoning style, explicit uncertainty, and alternative interpretations shaped response patterns. Our results highlight the current limitations of these systems in clinical compliance automation and underscore the importance of understanding their cognitive adaptations and strategic behavior in developing more explainable and reliable medical AI.

【89】InvarGC: Invariant Granger Causality for Heterogeneous Interventional Time Series under Latent Confounding
标题：InvarGC：潜在混杂下异质干预时间序列的不变Granger因果关系
链接：https://arxiv.org/abs/2510.19138

作者：Ziyi Zhang, Shaogang Ren, Xiaoning Qian, Nick Duffield
摘要：Granger因果关系被广泛应用于从多变量时间序列数据中发现复杂系统的因果结构。传统的基于线性模型的格兰杰因果关系检验往往无法检测出即使是轻微的非线性因果关系。因此，近年来有大量的研究对非线性Granger因果关系方法进行了研究，取得了较好的性能。然而，这些方法通常依赖于两个关键假设：因果充分性和已知的干预目标。因果充分性假设不存在潜在的混杂因素，但它们的存在可能会引入虚假的相关性。此外，真实世界的时间序列数据通常来自异构环境，没有干预的先验知识。因此，在实践中，很难区分干预环境和非干预环境，甚至更难确定哪些变量或时间步长受到影响。为了解决这些问题，我们提出了不变Granger因果关系（InvarGC），它利用跨环境的异质性来减轻潜在混杂的影响，并在边缘级粒度上区分干预和非干预环境，从而恢复不变因果关系.此外，我们还建立了在这些条件下的可辨识性。在合成数据集和真实数据集上的大量实验表明，与最先进的方法相比，我们的方法具有竞争力的性能。
摘要：Granger causality is widely used for causal structure discovery in complex systems from multivariate time series data. Traditional Granger causality tests based on linear models often fail to detect even mild non-linear causal relationships. Therefore, numerous recent studies have investigated non-linear Granger causality methods, achieving improved performance. However, these methods often rely on two key assumptions: causal sufficiency and known interventional targets. Causal sufficiency assumes the absence of latent confounders, yet their presence can introduce spurious correlations. Moreover, real-world time series data usually come from heterogeneous environments, without prior knowledge of interventions. Therefore, in practice, it is difficult to distinguish intervened environments from non-intervened ones, and even harder to identify which variables or timesteps are affected. To address these challenges, we propose Invariant Granger Causality (InvarGC), which leverages cross-environment heterogeneity to mitigate the effects of latent confounding and to distinguish intervened from non-intervened environments with edge-level granularity, thereby recovering invariant causal relations. In addition, we establish the identifiability under these conditions. Extensive experiments on both synthetic and real-world datasets demonstrate the competitive performance of our approach compared to state-of-the-art methods.

【90】A Cross-Environment and Cross-Embodiment Path Planning Framework via a Conditional Diffusion Model
标题：通过条件扩散模型的跨环境和跨体现路径规划框架
链接：https://arxiv.org/abs/2510.19128

作者：Mehran Ghafarian Tamizi, Homayoun Honari, Amir Mehdi Soufi Enayati, Aleksey Nozdryn-Plotnicki, Homayoun Najjaran
备注：20 pages, 9 figures
摘要：机器人系统在高维杂乱环境中的路径规划需要高效、安全，并能适应不同的环境和硬件。传统的方法面临着很高的计算时间，需要大量的参数调整，而现有的基于学习的方法仍然无法有效地推广。本研究的主要目标是开发一个路径规划框架，能够推广到看不见的环境和新的机器人操作器，而不需要再培训。我们提出了GADGET（可推广和自适应扩散引导的环境感知轨迹生成），基于扩散的规划模型，生成条件的联合空间轨迹体素化场景表示以及开始和目标配置。一个关键的创新是GADGET的混合双调节机制，该机制将通过学习场景编码的无分类器引导与分类器引导的控制障碍函数（CBF）安全整形相结合，将环境感知与实时防撞直接集成在去噪过程中。该设计支持向新环境和机器人实施例的zero-shot转移而无需再训练。实验结果表明，GADGET在球形障碍物、垃圾箱拾取和货架环境中实现了高成功率和低碰撞强度，CBF引导进一步提高了安全性。此外，比较评价表明，相对于基于抽样和基于学习的基线而言，业绩良好。此外，GADGET提供了在Franka Panda，Kinova Gen 3（6/7-DoF）和UR 5机器人之间的可转移性，并且Kinova Gen 3上的物理执行证明了其在现实世界设置中生成安全，无碰撞轨迹的能力。
摘要：Path planning for a robotic system in high-dimensional cluttered environments needs to be efficient, safe, and adaptable for different environments and hardware. Conventional methods face high computation time and require extensive parameter tuning, while prior learning-based methods still fail to generalize effectively. The primary goal of this research is to develop a path planning framework capable of generalizing to unseen environments and new robotic manipulators without the need for retraining. We present GADGET (Generalizable and Adaptive Diffusion-Guided Environment-aware Trajectory generation), a diffusion-based planning model that generates joint-space trajectories conditioned on voxelized scene representations as well as start and goal configurations. A key innovation is GADGET's hybrid dual-conditioning mechanism that combines classifier-free guidance via learned scene encoding with classifier-guided Control Barrier Function (CBF) safety shaping, integrating environment awareness with real-time collision avoidance directly in the denoising process. This design supports zero-shot transfer to new environments and robotic embodiments without retraining. Experimental results show that GADGET achieves high success rates with low collision intensity in spherical-obstacle, bin-picking, and shelf environments, with CBF guidance further improving safety. Moreover, comparative evaluations indicate strong performance relative to both sampling-based and learning-based baselines. Furthermore, GADGET provides transferability across Franka Panda, Kinova Gen3 (6/7-DoF), and UR5 robots, and physical execution on a Kinova Gen3 demonstrates its ability to generate safe, collision-free trajectories in real-world settings.

【91】Steering Autoregressive Music Generation with Recursive Feature Machines
标题：使用回归特征机引导自回归音乐生成
链接：https://arxiv.org/abs/2510.19127

作者：Daniel Zhao, Daniel Beaglehole, Taylor Berg-Kirkpatrick, Julian McAuley, Zachary Novack
摘要：可控的音乐生成仍然是一个重大挑战，现有的方法通常需要模型重新训练或引入听觉伪像。我们引入了MusicRFM，这是一个适应递归特征机（RFM）的框架，通过直接引导其内部激活来实现对冻结的、预训练的音乐模型的细粒度、可解释的控制。RFM分析模型的内部梯度，以产生可解释的“概念方向”，或激活空间中对应于音符或和弦等音乐属性的特定轴。我们首先训练轻量级RFM探测器来发现MusicGen隐藏状态中的这些方向;然后，在推理过程中，我们将它们注入模型中，以实时指导生成过程，而无需每步优化。我们提出了先进的机制，这种控制，包括动态的，随时间变化的时间表和方法，同时执行多个音乐属性。我们的方法成功地导航控制和生成质量之间的权衡：我们可以提高生成目标音符的准确性从0.23到0.82，而文本提示遵守保持在约0.02的未转向基线，展示了有效的控制与最小的影响提示保真度。我们发布代码以鼓励在音乐领域对RFM进行进一步探索。
摘要：Controllable music generation remains a significant challenge, with existing methods often requiring model retraining or introducing audible artifacts. We introduce MusicRFM, a framework that adapts Recursive Feature Machines (RFMs) to enable fine-grained, interpretable control over frozen, pre-trained music models by directly steering their internal activations. RFMs analyze a model's internal gradients to produce interpretable "concept directions", or specific axes in the activation space that correspond to musical attributes like notes or chords. We first train lightweight RFM probes to discover these directions within MusicGen's hidden states; then, during inference, we inject them back into the model to guide the generation process in real-time without per-step optimization. We present advanced mechanisms for this control, including dynamic, time-varying schedules and methods for the simultaneous enforcement of multiple musical properties. Our method successfully navigates the trade-off between control and generation quality: we can increase the accuracy of generating a target musical note from 0.23 to 0.82, while text prompt adherence remains within approximately 0.02 of the unsteered baseline, demonstrating effective control with minimal impact on prompt fidelity. We release code to encourage further exploration on RFMs in the music domain.

【92】A Novel Approach to Breast Cancer Segmentation using U-Net Model with Attention Mechanisms and FedProx
标题：使用具有注意力机制的U-Net模型和FedProx进行乳腺癌分割的新方法
链接：https://arxiv.org/abs/2510.19118

作者：Eyad Gad, Mustafa Abou Khatwa, Mustafa A. Elattar, Sahar Selim
备注：None
摘要：乳腺癌是全世界妇女死亡的主要原因，强调了早期发现和准确诊断的必要性。因此，超声成像是一种可靠且具有成本效益的工具，可用于此目的，但医疗数据的敏感性使得开发准确和私密的人工智能模型具有挑战性。一个解决方案是联合学习，因为它是一种很有前途的技术，用于对敏感的医疗数据进行分布式机器学习，同时保护患者隐私。然而，在非独立和非一致分布（非IID）局部数据集上进行训练可能会影响训练模型的准确性和泛化能力，这对于BC分割中准确的肿瘤边界划定至关重要。本研究旨在通过将联邦近端（FedProx）方法应用于非IID超声乳腺癌成像数据集来应对这一挑战。此外，我们专注于提高肿瘤分割的准确性，将修改后的U-Net模型与注意力机制。我们的方法产生了一个准确率为96%的全局模型，证明了我们的方法在提高肿瘤分割准确性的同时保护患者隐私的有效性。我们的研究结果表明，FedProx有可能成为在非IID本地医疗数据集上训练精确机器学习模型的一种有前途的方法。
摘要：Breast cancer is a leading cause of death among women worldwide, emphasizing the need for early detection and accurate diagnosis. As such Ultrasound Imaging, a reliable and cost-effective tool, is used for this purpose, however the sensitive nature of medical data makes it challenging to develop accurate and private artificial intelligence models. A solution is Federated Learning as it is a promising technique for distributed machine learning on sensitive medical data while preserving patient privacy. However, training on non-Independent and non-Identically Distributed (non-IID) local datasets can impact the accuracy and generalization of the trained model, which is crucial for accurate tumour boundary delineation in BC segmentation. This study aims to tackle this challenge by applying the Federated Proximal (FedProx) method to non-IID Ultrasonic Breast Cancer Imaging datasets. Moreover, we focus on enhancing tumour segmentation accuracy by incorporating a modified U-Net model with attention mechanisms. Our approach resulted in a global model with 96% accuracy, demonstrating the effectiveness of our method in enhancing tumour segmentation accuracy while preserving patient privacy. Our findings suggest that FedProx has the potential to be a promising approach for training precise machine learning models on non-IID local medical datasets.

【93】That's Deprecated! Understanding, Detecting, and Steering Knowledge Conflicts in Language Models for Code Generation
标题：那是废弃的！理解、检测和引导代码生成的语言模型中的知识冲突
链接：https://arxiv.org/abs/2510.19116

作者：Jaesung Bae, Cameron Churchwell, Mitchell Hermon, Tsun-An Hsieh, Jocelyn Xu, Yekaterina Yegorova, Mark Hasegawa-Johnson, Heng Ji
摘要：本文研究了大型语言模型（LLM）的行为时，面临的参数知识和冲突的信息之间的差异，包含在提示。在先前问答（QA）研究的基础上，我们将知识冲突的调查扩展到代码生成领域。我们提出了一个域不可知的框架，用于构建和解释这种冲突，以及一种新的评估方法和数据集定制的代码冲突的情况下。我们的实验表明，足够大的LLM在其参数中编码知识冲突的概念，使我们能够以高达80.65%的准确度检测知识冲突。在这些见解的基础上，我们表明，激活水平转向可以实现高达12.6\%}的随机基线转向成功的改善。然而，有效性关键取决于平衡模型大小，任务域和转向方向。实验代码和数据将在验收后公开提供。
摘要：This paper investigates how large language models (LLMs) behave when faced with discrepancies between their parametric knowledge and conflicting information contained in a prompt. Building on prior question-answering (QA) research, we extend the investigation of knowledge conflicts to the realm of code generation. We propose a domain-agnostic framework for constructing and interpreting such conflicts, along with a novel evaluation method and dataset tailored to code conflict scenarios. Our experiments indicate that sufficiently large LLMs encode the notion of a knowledge conflict in their parameters, enabling us to detect knowledge conflicts with up to \textbf{80.65\%} accuracy. Building on these insights, we show that activation-level steering can achieve up to a \textbf{12.6\%} improvement in steering success over a random baseline. However, effectiveness depends critically on balancing model size, task domain, and steering direction. The experiment code and data will be made publicly available after acceptance.

【94】What Makes a Good Curriculum? Disentangling the Effects of Data Ordering on LLM Mathematical Reasoning
标题：什么是好的课程？理清数据排序对LLM数学推理的影响
链接：https://arxiv.org/abs/2510.19099

作者：Yaning Jia, Chunhui Zhang, Xingjian Diao, Xiangchi Yuan, Zhongyu Ouyang, soroush vosoughi
备注：8 pages (main text) + 4 pages (appendix), 4 figures
摘要：课程学习（CL）-将训练数据从易到难排序-已成为改进大型语言模型（LLM）推理的流行策略。然而，之前的工作采用了不同的难度指标和培训设置，留下了开放的基本问题：课程什么时候有帮助？哪一个方向--前进还是后退--更好？答案取决于我们测量的东西吗？我们通过一个统一的离线评估框架来解决这些问题，该框架将课程难度分解为五个互补的维度：问题难度，模型惊喜，置信度，预测不确定性和决策可变性。通过对Llama3.1-8B，Mistral-7 B和Gemma 3 - 4 B数学推理基准的控制后训练实验，我们发现：（i）没有课程策略普遍占主导地位-正向CL与反向CL的相对有效性取决于模型能力和任务复杂性;（ii）即使在单一指标内，不同难度水平的样本也会根据任务需求产生不同的收益;和（iii）任务对齐的课程侧重于塑造模型的最终表示和概括，而内部状态课程调节内部状态，如信心和不确定性。我们的研究结果挑战了通用课程策略的概念，并提供了跨模型和任务制度的可操作指导，一些指标表明，优先考虑决策不确定的样本可以进一步提高学习效果。
摘要：Curriculum learning (CL) - ordering training data from easy to hard - has become a popular strategy for improving reasoning in large language models (LLMs). Yet prior work employs disparate difficulty metrics and training setups, leaving open fundamental questions: When does curriculum help? Which direction - forward or reverse - is better? And does the answer depend on what we measure? We address these questions through a unified offline evaluation framework that decomposes curriculum difficulty into five complementary dimensions: Problem Difficulty, Model Surprisal, Confidence Margin, Predictive Uncertainty, and Decision Variability. Through controlled post-training experiments on mathematical reasoning benchmarks with Llama3.1-8B, Mistral-7B, and Gemma3-4B, we find that (i) no curriculum strategy dominates universally - the relative effectiveness of forward versus reverse CL depends jointly on model capability and task complexity; (ii) even within a single metric, samples at different difficulty levels produce distinct gains depending on task demands; and (iii) task-aligned curricula focus on shaping the model's final representations and generalization, whereas inner-state curricula modulate internal states such as confidence and uncertainty. Our findings challenge the notion of a universal curriculum strategy and offer actionable guidance across model and task regimes, with some metrics indicating that prioritizing decision-uncertain samples can further enhance learning outcomes.

【95】Local Guidance for Configuration-Based Multi-Agent Pathfinding
标题：基于搜索的多智能体寻路本地指南
链接：https://arxiv.org/abs/2510.19072

作者：Tomoki Arita, Keisuke Okumura
备注：10 pages
摘要：引导是一个新兴的概念，提高了实时，次优多智能体寻路（MAPF）方法的经验性能。它为MAPF算法提供了额外的信息，通过考虑整个工作空间中所有代理的集体行为来减轻全局范围内的拥塞。这种全局视角有助于减少代理的等待时间，从而提高整体协调效率。相反，本研究探索了一种替代方法：在每个代理附近提供本地指导。虽然这种本地化的方法涉及重新计算代理移动，可能会出现计算要求，我们的经验表明，提供翔实的时空线索的规划可以显着提高解决方案的质量，而不超过一个温和的时间预算。当应用于LaCAM，一个领先的配置为基础的解决方案，这种形式的指导建立了一个新的性能边界的MAPF。
摘要：Guidance is an emerging concept that improves the empirical performance of real-time, sub-optimal multi-agent pathfinding (MAPF) methods. It offers additional information to MAPF algorithms to mitigate congestion on a global scale by considering the collective behavior of all agents across the entire workspace. This global perspective helps reduce agents' waiting times, thereby improving overall coordination efficiency. In contrast, this study explores an alternative approach: providing local guidance in the vicinity of each agent. While such localized methods involve recomputation as agents move and may appear computationally demanding, we empirically demonstrate that supplying informative spatiotemporal cues to the planner can significantly improve solution quality without exceeding a moderate time budget. When applied to LaCAM, a leading configuration-based solver, this form of guidance establishes a new performance frontier for MAPF.

【96】PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions
标题：PoSh：使用场景图指导LLM作为评委进行详细的图像描述
链接：https://arxiv.org/abs/2510.19060

作者：Amith Ananthram, Elias Stengel-Eskin, Lorena A. Bradford, Julia Demarest, Adam Purvis, Keith Krut, Robert Stein, Rina Elster Pantalony, Mohit Bansal, Kathleen McKeown
备注：24 pages, 9 figures. Metric/benchmark available at this https URL
摘要：虽然视觉语言模型（VLM）已经发展到详细的图像描述，但评估仍然是一个挑战。标准度量（例如CIDEr，SPICE）是为短文本设计的，并经过调整以识别现在不常见的错误，例如对象错误识别。相比之下，长文本需要对属性和关系附件的敏感性以及将错误定位到特定文本跨度的分数。在这项工作中，我们引入了PoSh，这是一种用于详细图像描述的度量标准，它使用场景图作为结构化的规则来指导LLMs作为法官，产生基于细粒度错误（例如，成分理解中的错误）的总分数。PoSh是可复制的、可解释的，并且是比现有指标（包括GPT 4 o-作为-a-Judge）更好的人类评分员的代理。为了验证PoSh，我们引入了一个具有挑战性的新数据集DOCENT。这个新颖的基准包含艺术作品，与专家撰写的参考资料和模型生成的描述配对，并增加了艺术史学生对其质量的颗粒和粗糙的判断。因此，DOCENT使得能够在具有挑战性的新领域中评估详细图像描述度量和详细图像描述本身。我们表明，PoSh实现了更强的相关性（+0.05斯皮尔曼$\rho$）与人类的判断，在DOCENT比最好的开放权重的替代品，是强大的图像类型（使用CapArena，现有的网络图像数据集），是一个有能力的奖励功能，优于标准的监督微调。然后，使用PoSh，我们描述了开放和封闭模型在描述DOCENT中的绘画，素描和雕像时的性能，并发现基础模型很难实现具有丰富场景动态的图像的完整，无错误的覆盖，建立了一个苛刻的新任务来衡量VLM的进展。通过PoSh和DOCENT，我们希望能够在辅助文本生成等重要领域取得进展。
摘要：While vision-language models (VLMs) have advanced into detailed image description, evaluation remains a challenge. Standard metrics (e.g. CIDEr, SPICE) were designed for short texts and tuned to recognize errors that are now uncommon, such as object misidentification. In contrast, long texts require sensitivity to attribute and relation attachments and scores that localize errors to particular text spans. In this work, we introduce PoSh, a metric for detailed image description that uses scene graphs as structured rubrics to guide LLMs-as-a-Judge, producing aggregate scores grounded in fine-grained errors (e.g. mistakes in compositional understanding). PoSh is replicable, interpretable and a better proxy for human raters than existing metrics (including GPT4o-as-a-Judge). To validate PoSh, we introduce a challenging new dataset, DOCENT. This novel benchmark contains artwork, paired with expert-written references, and model-generated descriptions, augmented with granular and coarse judgments of their quality from art history students. Thus, DOCENT enables evaluating both detailed image description metrics and detailed image description itself in a challenging new domain. We show that PoSh achieves stronger correlations (+0.05 Spearman $\rho$) with the human judgments in DOCENT than the best open-weight alternatives, is robust to image type (using CapArena, an existing dataset of web imagery) and is a capable reward function, outperforming standard supervised fine-tuning. Then, using PoSh, we characterize the performance of open and closed models in describing the paintings, sketches and statues in DOCENT and find that foundation models struggle to achieve full, error-free coverage of images with rich scene dynamics, establishing a demanding new task to gauge VLM progress. Through both PoSh and DOCENT, we hope to enable advances in important areas such as assistive text generation.

【97】The MUSE Benchmark: Probing Music Perception and Auditory Relational Reasoning in Audio LLMS
标题：MUSE基准：在音频LLMS中探索音乐感知和听觉关系推理
链接：https://arxiv.org/abs/2510.19055

作者：Brandon James Carone, Iran R. Roman, Pablo Ripollés
备注：5 pages, 2 figures, 2 tables
摘要：多模态大型语言模型（MLLM）已经证明了音频理解的能力，但目前的评估可能会掩盖关系推理的根本弱点。我们介绍了音乐理解和结构评估（MUSE）基准，这是一个开源资源，有10个任务，旨在探索基本的音乐感知技能。我们评估了四个SOTA模型（Gemini Pro和Flash，Qwen2.5-Omni和Audio-Flamingo 3）对一个大的人类基线（N=200）。我们的研究结果揭示了SOTA能力的广泛差异以及与人类专家的持续差距。虽然双子座专业成功的基本感知，Qwen和音频火烈鸟3执行或接近机会，暴露严重的感知缺陷。此外，我们发现思想链（CoT）提示提供了不一致的，往往是有害的结果。我们的工作为评估不变的音乐表示和推动更强大的AI系统的开发提供了关键工具。
摘要：Multimodal Large Language Models (MLLMs) have demonstrated capabilities in audio understanding, but current evaluations may obscure fundamental weaknesses in relational reasoning. We introduce the Music Understanding and Structural Evaluation (MUSE) Benchmark, an open-source resource with 10 tasks designed to probe fundamental music perception skills. We evaluate four SOTA models (Gemini Pro and Flash, Qwen2.5-Omni, and Audio-Flamingo 3) against a large human baseline (N=200). Our results reveal a wide variance in SOTA capabilities and a persistent gap with human experts. While Gemini Pro succeeds on basic perception, Qwen and Audio Flamingo 3 perform at or near chance, exposing severe perceptual deficits. Furthermore, we find Chain-of-Thought (CoT) prompting provides inconsistent, often detrimental results. Our work provides a critical tool for evaluating invariant musical representations and driving development of more robust AI systems.

【98】Rectifying Shortcut Behaviors in Preference-based Reward Learning
标题：基于偏好的奖励学习中的行为纠正
链接：https://arxiv.org/abs/2510.19050

作者：Wenqian Ye, Guangtao Zheng, Aidong Zhang
备注：NeurIPS 2025
摘要：在基于人类反馈的强化学习中，基于偏好的奖励模型在将大型语言模型与人类行为对齐方面发挥着核心作用。然而，最近的研究表明，这些模型容易受到奖励黑客的攻击，并且由于过度优化而往往无法很好地推广。他们通过利用捷径，即利用虚假特征（例如，响应冗长、令人愉快的语气或奉承），这些与训练数据中的人类偏好标签相关，而不是真正反映预期目标。在本文中，我们没有一次探讨这些问题，而是将奖励黑客问题视为捷径行为，并引入了一种原则性但灵活的方法来减轻基于偏好的奖励学习中的捷径行为。受内核视角中的不变量理论的启发，我们提出了基于偏好的奖励不变性用于减少错误（PRISM），它在封闭形式的学习目标中使用特征映射学习组不变内核。在几个基准测试中的实验结果表明，我们的方法一致地提高了奖励模型在不同的分布任务上的准确性，并减少了对下游策略模型中捷径的依赖，建立了一个强大的基于偏好的对齐框架。
摘要：In reinforcement learning from human feedback, preference-based reward models play a central role in aligning large language models to human-aligned behavior. However, recent studies show that these models are prone to reward hacking and often fail to generalize well due to over-optimization. They achieve high reward scores by exploiting shortcuts, that is, exploiting spurious features (e.g., response verbosity, agreeable tone, or sycophancy) that correlate with human preference labels in the training data rather than genuinely reflecting the intended objectives. In this paper, instead of probing these issues one at a time, we take a broader view of the reward hacking problem as shortcut behaviors and introduce a principled yet flexible approach to mitigate shortcut behaviors in preference-based reward learning. Inspired by the invariant theory in the kernel perspective, we propose Preference-based Reward Invariance for Shortcut Mitigation (PRISM), which learns group-invariant kernels with feature maps in a closed-form learning objective. Experimental results in several benchmarks show that our method consistently improves the accuracy of the reward model on diverse out-of-distribution tasks and reduces the dependency on shortcuts in downstream policy models, establishing a robust framework for preference-based alignment.

【99】REPAIR Approach for Social-based City Reconstruction Planning in case of natural disasters
标题：自然灾害情况下基于社会的城市重建规划修复方法
链接：https://arxiv.org/abs/2510.19048

作者：Ghulam Mudassir, Antinisca Di Marco, Giordano d'Aloisio
备注：Accepted at International Journal of Data Science and Analytics
摘要：自然灾害总是对人类生活产生若干影响。各国政府很难在现有资源（主要是预算和时间）范围内处理这些事件，重建经济、社会和有形基础设施和设施。政府总是根据法律和政治战略来制定计划和政策，以实现社会效益最大化。破坏的严重性和恢复正常生活所需的大量资源使重建工作成为一项挑战。本文是我们之前发表的工作的延伸，通过整合额外的深度学习模型和随机代理（用作基线）进行全面的比较分析。我们之前的研究引入了一个决策支持系统，通过使用深度强化学习技术来规划灾后城市重建，最大限度地提高重建过程的社会效益，考虑可用资源，满足广大社区利益相关者的需求（如市民的社会福利和政治家的优先事项），并考虑城市的结构性限制（如道路和建筑物之间的依赖关系）。所提出的方法，命名为灾后重建plAn ProvIdeR（REPAIR）是通用的。它可以为本地管理员确定一组替代计划，供他们选择理想的计划来实施，并且它可以应用于任何扩展领域。我们展示了REPAIR在一个真实用例中的应用，即，拉奎拉在2009年的一次大地震中遭到破坏。
摘要：Natural disasters always have several effects on human lives. It is challenging for governments to tackle these incidents and to rebuild the economic, social and physical infrastructures and facilities with the available resources (mainly budget and time). Governments always define plans and policies according to the law and political strategies that should maximise social benefits. The severity of damage and the vast resources needed to bring life back to normality make such reconstruction a challenge. This article is the extension of our previously published work by conducting comprehensive comparative analysis by integrating additional deep learning models plus random agent which is used as a baseline. Our prior research introduced a decision support system by using the Deep Reinforcement Learning technique for the planning of post-disaster city reconstruction, maximizing the social benefit of the reconstruction process, considering available resources, meeting the needs of the broad community stakeholders (like citizens' social benefits and politicians' priorities) and keeping in consideration city's structural constraints (like dependencies among roads and buildings). The proposed approach, named post disaster REbuilding plAn ProvIdeR (REPAIR) is generic. It can determine a set of alternative plans for local administrators who select the ideal one to implement, and it can be applied to areas of any extension. We show the application of REPAIR in a real use case, i.e., to the L'Aquila reconstruction process, damaged in 2009 by a major earthquake.

【100】"Over-the-Hood" AI Inclusivity Bugs and How 3 AI Product Teams Found and Fixed Them
链接：https://arxiv.org/abs/2510.19033

作者：Andrew Anderson, Fatima A. Moussaoui, Jimena Noa Guevara, Md Montaser Hamid, Margaret Burnett
摘要：虽然许多研究表明人工智能存在“引擎盖下”的偏见（例如，算法、训练数据等），那么“过度”包容性偏见呢：面向用户的人工智能产品中的障碍，不成比例地排除了使用某些解决问题方法的用户？最近的研究已经开始报道这种偏见的存在-但是它们看起来像什么，它们有多普遍，开发人员如何发现和修复它们？为了找到答案，我们与3个人工智能产品团队进行了一项实地研究，以调查面向用户的人工智能产品中存在哪些独特的人工智能包容性缺陷，以及人工智能产品团队是否/如何利用现有的（非面向人工智能的）包容性设计方法来发现和修复它们。团队的工作导致识别出6种类型的AI包容性错误，出现了83次，修复了其中47个错误实例，以及GenderMag包容性设计方法的新变体，GenderMag-for-AI，这在检测某些类型的AI包容性错误方面特别有效。
摘要：While much research has shown the presence of AI's "under-the-hood" biases (e.g., algorithmic, training data, etc.), what about "over-the-hood" inclusivity biases: barriers in user-facing AI products that disproportionately exclude users with certain problem-solving approaches? Recent research has begun to report the existence of such biases -- but what do they look like, how prevalent are they, and how can developers find and fix them? To find out, we conducted a field study with 3 AI product teams, to investigate what kinds of AI inclusivity bugs exist uniquely in user-facing AI products, and whether/how AI product teams might harness an existing (non-AI-oriented) inclusive design method to find and fix them. The teams' work resulted in identifying 6 types of AI inclusivity bugs arising 83 times, fixes covering 47 of these bug instances, and a new variation of the GenderMag inclusive design method, GenderMag-for-AI, that is especially effective at detecting certain kinds of AI inclusivity bugs.

【101】CLiVR: Conversational Learning System in Virtual Reality with AI-Powered Patients
标题：CLiVR：虚拟现实中的对话学习系统，由人工智能驱动的患者组成
链接：https://arxiv.org/abs/2510.19031

作者：Akilan Amithasagaran, Sagnik Dakshit, Bhavani Suryadevara, Lindsey Stockton
摘要：模拟构成了医学和护理教育的基本组成部分，传统上采用标准化患者（SP）和高保真人体模型来开发临床推理和沟通技能。然而，这些方法需要大量的资源，限制了可访问性和可扩展性。在这项研究中，我们介绍了CLiVR，虚拟现实中的会话学习系统，它集成了大型语言模型（LLM），语音处理和3D化身来模拟逼真的医患互动。CLiVR在Unity中开发并部署在Meta Quest 3平台上，使学员能够与虚拟患者进行自然对话。每个模拟都是从综合征症状数据库中动态生成的，并通过情感分析进行增强，以提供对通信音调的反馈。通过一项涉及医学院教师（n=13）的专家用户研究，我们评估了可用性、现实性和感知教育影响。结果表明，用户接受度很高，对教育潜力很有信心，并提供了宝贵的改进反馈。CLiVR为基于SP的培训提供了可扩展的沉浸式补充。
摘要：Simulations constitute a fundamental component of medical and nursing education and traditionally employ standardized patients (SP) and high-fidelity manikins to develop clinical reasoning and communication skills. However, these methods require substantial resources, limiting accessibility and scalability. In this study, we introduce CLiVR, a Conversational Learning system in Virtual Reality that integrates large language models (LLMs), speech processing, and 3D avatars to simulate realistic doctor-patient interactions. Developed in Unity and deployed on the Meta Quest 3 platform, CLiVR enables trainees to engage in natural dialogue with virtual patients. Each simulation is dynamically generated from a syndrome-symptom database and enhanced with sentiment analysis to provide feedback on communication tone. Through an expert user study involving medical school faculty (n=13), we assessed usability, realism, and perceived educational impact. Results demonstrated strong user acceptance, high confidence in educational potential, and valuable feedback for improvement. CLiVR offers a scalable, immersive supplement to SP-based training.

【102】FlexiDataGen: An Adaptive LLM Framework for Dynamic Semantic Dataset Generation in Sensitive Domains
标题：DeliverDataGen：用于敏感领域动态语义数据集生成的自适应LLM框架
链接：https://arxiv.org/abs/2510.19025

作者：Hamed Jelodar, Samita Bai, Roozbeh Razavi-Far, Ali A. Ghorbani
摘要：数据集的可用性和质量仍然是机器学习中的关键挑战，特别是在数据稀缺、获取成本高或受隐私法规限制的领域。医疗保健、生物医学研究和网络安全等领域经常遇到数据采集成本高、对注释数据的访问有限以及关键事件的罕见性或敏感性等问题。这些问题统称为数据集挑战，阻碍了在这些高风险领域开发准确和可推广的机器学习模型。为了解决这个问题，我们引入了一个自适应大语言模型（LLM）框架，用于敏感领域的动态语义数据集生成。DataGen自动合成丰富的，语义一致的，语言多样的数据集，为专业领域量身定制。该框架集成了四个核心组件：（1）语法语义分析，（2）检索增强生成，（3）动态元素注入，（4）迭代释义与语义验证。这些组件共同确保生成高质量的领域相关数据。实验结果表明，FlexiDataGen有效缓解了数据短缺和注释瓶颈，实现了可扩展且准确的机器学习模型开发。
摘要：Dataset availability and quality remain critical challenges in machine learning, especially in domains where data are scarce, expensive to acquire, or constrained by privacy regulations. Fields such as healthcare, biomedical research, and cybersecurity frequently encounter high data acquisition costs, limited access to annotated data, and the rarity or sensitivity of key events. These issues-collectively referred to as the dataset challenge-hinder the development of accurate and generalizable machine learning models in such high-stakes domains. To address this, we introduce FlexiDataGen, an adaptive large language model (LLM) framework designed for dynamic semantic dataset generation in sensitive domains. FlexiDataGen autonomously synthesizes rich, semantically coherent, and linguistically diverse datasets tailored to specialized fields. The framework integrates four core components: (1) syntactic-semantic analysis, (2) retrieval-augmented generation, (3) dynamic element injection, and (4) iterative paraphrasing with semantic validation. Together, these components ensure the generation of high-quality, domain-relevant data. Experimental results show that FlexiDataGen effectively alleviates data shortages and annotation bottlenecks, enabling scalable and accurate machine learning model development.

【103】Prior-informed optimization of treatment recommendation via bandit algorithms trained on large language model-processed historical records
标题：通过在大型语言模型处理的历史记录上训练的强盗算法对治疗建议进行事先知情的优化
链接：https://arxiv.org/abs/2510.19014

作者：Saman Nessari, Ali Bozorgi-Amiri
摘要：目前的医疗实践依赖于标准化的治疗框架和经验方法，忽视了个体患者的差异，导致次优的健康结果。我们开发了一个综合系统，集成了大型语言模型（LLM），条件表格生成对抗网络（CTGAN），T学习者反事实模型和上下文强盗方法，以提供定制的，数据知情的临床建议。该方法利用LLM将非结构化的医疗叙述处理成结构化的数据集（93.2%的准确率），使用CTGAN生成真实的合成患者数据（通过双样本验证，准确率为55%），部署T学习器来预测患者特定的治疗反应（准确率84.3%），并整合之前的-通过有效地平衡探索新的可能性与利用现有知识来增强在线治疗选择。对III期结肠癌数据集的测试显示，我们的KernelUCB方法在5,000轮中获得了0.60-0.61的平均奖励分数，超过了其他参考方法。这个综合系统克服了在线学习环境中的冷启动限制，提高了计算效率，并构成了适应特定患者特征的个性化医疗的显着进步。
摘要：Current medical practice depends on standardized treatment frameworks and empirical methodologies that neglect individual patient variations, leading to suboptimal health outcomes. We develop a comprehensive system integrating Large Language Models (LLMs), Conditional Tabular Generative Adversarial Networks (CTGAN), T-learner counterfactual models, and contextual bandit approaches to provide customized, data-informed clinical recommendations. The approach utilizes LLMs to process unstructured medical narratives into structured datasets (93.2% accuracy), uses CTGANs to produce realistic synthetic patient data (55% accuracy via two-sample verification), deploys T-learners to forecast patient-specific treatment responses (84.3% accuracy), and integrates prior-informed contextual bandits to enhance online therapeutic selection by effectively balancing exploration of new possibilities with exploitation of existing knowledge. Testing on stage III colon cancer datasets revealed that our KernelUCB approach obtained 0.60-0.61 average reward scores across 5,000 rounds, exceeding other reference methods. This comprehensive system overcomes cold-start limitations in online learning environments, improves computational effectiveness, and constitutes notable progress toward individualized medicine adapted to specific patient characteristics.

【104】Plural Voices, Single Agent: Towards Inclusive AI in Multi-User Domestic Spaces
标题：多重声音，单一代理：在多用户家庭空间中迈向包容性人工智能
链接：https://arxiv.org/abs/2510.19008

作者：Joydeep Chandra, Satyam Kumar Navneet
摘要：国内人工智能代理面临着道德、自主性和包容性的挑战，特别是对于被忽视的群体，如儿童、老年人和神经分歧用户。我们提出了多元声音模型（PVM），这是一种新颖的单代理框架，它通过实时价值调整动态协商多用户需求，利用有关心理健康，老年人护理，教育和道德推理的各种公共数据集。PVM使用具有公平意识的场景和道德增强的人类+综合课程设计，确定核心价值观，冲突和可访问性要求，以告知包容性原则。我们以隐私为中心的原型具有自适应安全支架，定制的交互（例如，为神经分歧用户提供逐步指导，为儿童提供简单的措辞），以及公平的冲突解决方案。在初步评估中，PVM在合规性（76% vs. 70%）、公平性（90% vs. 85%）、安全违规率（0% vs. 7%）和延迟方面优于多代理基线。设计创新，包括视频指导，自主滑块，家庭中心和自适应安全仪表板，展示了道德和包容性的国内人工智能的新方向，在多元化的国内环境中建立以用户为中心的代理系统。我们的代码和模型是开源的，可供复制：https://github.com/zade90/Agora
摘要：Domestic AI agents faces ethical, autonomy, and inclusion challenges, particularly for overlooked groups like children, elderly, and Neurodivergent users. We present the Plural Voices Model (PVM), a novel single-agent framework that dynamically negotiates multi-user needs through real-time value alignment, leveraging diverse public datasets on mental health, eldercare, education, and moral reasoning. Using human+synthetic curriculum design with fairness-aware scenarios and ethical enhancements, PVM identifies core values, conflicts, and accessibility requirements to inform inclusive principles. Our privacy-focused prototype features adaptive safety scaffolds, tailored interactions (e.g., step-by-step guidance for Neurodivergent users, simple wording for children), and equitable conflict resolution. In preliminary evaluations, PVM outperforms multi-agent baselines in compliance (76% vs. 70%), fairness (90% vs. 85%), safety-violation rate (0% vs. 7%), and latency. Design innovations, including video guidance, autonomy sliders, family hubs, and adaptive safety dashboards, demonstrate new directions for ethical and inclusive domestic AI, for building user-centered agentic systems in plural domestic contexts. Our Codes and Model are been open sourced, available for reproduction: https://github.com/zade90/Agora

【105】$Δ$t-Mamba3D: A Time-Aware Spatio-Temporal State-Space Model for Breast Cancer Risk Prediction
标题：$Δ$t-Mamba 3D：用于乳腺癌风险预测的时间感知时空状态空间模型
链接：https://arxiv.org/abs/2510.19003

作者：Zhengbo Zhou, Dooman Arefan, Margarita Zuley, Shandong Wu
摘要：连续放射图像的纵向分析受到一个基本数据挑战的阻碍：如何有效地对以不规则时间间隔捕获的高分辨率图像序列进行建模。这种数据结构包含了不可或缺的空间和时间线索，目前的方法无法充分利用。模型通常通过将空间信息折叠成向量或应用计算效率低且与非均匀时间步长不兼容的时空模型来折衷。我们解决这个挑战与时间感知$\三角洲$t-Mamba 3D，一种新的状态空间架构，适用于纵向医学成像。我们的模型同时编码不规则的访问间隔和丰富的时空背景，同时保持计算效率。它的核心创新是一个连续时间选择性扫描机制，明确地将考试之间的真实时间差集成到其状态转换中。这是由多尺度3D邻域融合模块，鲁棒地捕捉时空关系的补充。在使用连续筛查乳房X线检查的综合乳腺癌风险预测基准中，我们的模型显示出优越的性能，与复发，Transformer和状态空间模型的已建立变体相比，将验证c指数提高了2-5个百分点，并实现了更高的1-5年AUC评分。由于其线性复杂性，该模型可以有效地处理漫长而复杂的乳腺X线照片患者筛查历史，形成纵向图像分析的新框架。
摘要：Longitudinal analysis of sequential radiological images is hampered by a fundamental data challenge: how to effectively model a sequence of high-resolution images captured at irregular time intervals. This data structure contains indispensable spatial and temporal cues that current methods fail to fully exploit. Models often compromise by either collapsing spatial information into vectors or applying spatio-temporal models that are computationally inefficient and incompatible with non-uniform time steps. We address this challenge with Time-Aware $\Delta$t-Mamba3D, a novel state-space architecture adapted for longitudinal medical imaging. Our model simultaneously encodes irregular inter-visit intervals and rich spatio-temporal context while remaining computationally efficient. Its core innovation is a continuous-time selective scanning mechanism that explicitly integrates the true time difference between exams into its state transitions. This is complemented by a multi-scale 3D neighborhood fusion module that robustly captures spatio-temporal relationships. In a comprehensive breast cancer risk prediction benchmark using sequential screening mammogram exams, our model shows superior performance, improving the validation c-index by 2-5 percentage points and achieving higher 1-5 year AUC scores compared to established variants of recurrent, transformer, and state-space models. Thanks to its linear complexity, the model can efficiently process long and complex patient screening histories of mammograms, forming a new framework for longitudinal image analysis.

【106】Robust Driving QA through Metadata-Grounded Context and Task-Specific Prompts
标题：通过基于元数据的上下文和特定任务的预算稳健地推动QA
链接：https://arxiv.org/abs/2510.19001

作者：Seungjun Yu, Junsung Park, Youngsun Lim, Hyunjung Shim
摘要：我们提出了一个两阶段的自动驾驶视觉语言问答系统，回答高层次的感知，预测和规划问题。在第一阶段，一个大型多模态LLM（Qwen2.5-VL-32 B）的条件是六个摄像机的输入，一个短的历史时间窗口，和一个链的思维提示与Few-Shot样本。自我一致性集成（多个采样推理链）进一步提高了答案的可靠性。在第2阶段，我们使用nuScenes场景元数据（对象注释，自我车辆状态等）增强提示。和类别特定的问题说明（感知、预测、计划任务的单独提示）。在驾驶QA基准的实验中，我们的方法显着优于基线Qwen2.5模型。例如，在阶段1中使用5个历史帧和10个镜头提示产生65.1%的总体准确度（与zero-shot的62.61%相比）;应用自我一致性将其提高到66.85%。第二阶段总体达到67.37%。值得注意的是，该系统在严重的视觉损坏情况下仍能保持96%的准确率。这些结果表明，精心设计的提示和上下文基础可以大大提高预先训练的视觉语言模型的高级驾驶QA。
摘要：We present a two-phase vision-language QA system for autonomous driving that answers high-level perception, prediction, and planning questions. In Phase-1, a large multimodal LLM (Qwen2.5-VL-32B) is conditioned on six-camera inputs, a short temporal window of history, and a chain-of-thought prompt with few-shot exemplars. A self-consistency ensemble (multiple sampled reasoning chains) further improves answer reliability. In Phase-2, we augment the prompt with nuScenes scene metadata (object annotations, ego-vehicle state, etc.) and category-specific question instructions (separate prompts for perception, prediction, planning tasks). In experiments on a driving QA benchmark, our approach significantly outperforms the baseline Qwen2.5 models. For example, using 5 history frames and 10-shot prompting in Phase-1 yields 65.1% overall accuracy (vs.62.61% with zero-shot); applying self-consistency raises this to 66.85%. Phase-2 achieves 67.37% overall. Notably, the system maintains 96% accuracy under severe visual corruption. These results demonstrate that carefully engineered prompts and contextual grounding can greatly enhance high-level driving QA with pretrained vision-language models.

【107】$\nabla$-SDF: Learning Euclidean Signed Distance Functions Online with Gradient-Augmented Octree Interpolation and Neural Residual
标题：$ abla$-SDF：使用增量增强八叉树插值和神经残余在线学习欧几里得符号距离函数
链接：https://arxiv.org/abs/2510.18999

作者：Zhirui Dai, Qihao Qian, Tianxing Fan, Nikolay Atanasov
摘要：从点云数据中估计符号距离函数（SDF）已被证明有利于许多机器人自主能力，包括定位，映射，运动规划和控制。支持在线和大规模SDF重建的方法往往依赖于离散的体积数据结构，这影响了SDF估计的连续性和可微性。最近，使用隐式特征，神经网络方法已经证明了高保真和可微SDF重建，但它们往往效率较低，在大环境中可能会经历灾难性的遗忘和记忆限制，并且通常限于截断的SDF。这项工作提出了$\nabla$-SDF，一种混合方法，结合了显式的前梯度增强八叉树插值与隐式神经残差。我们的方法实现了非截断（欧几里德）SDF重建的计算和内存效率相媲美的体积方法和可微性和准确性相媲美的神经网络方法。大量实验表明，\methodname{}在准确性和效率方面优于最先进的技术，为机器人和计算机视觉的下游任务提供了可扩展的解决方案。
摘要：Estimation of signed distance functions (SDFs) from point cloud data has been shown to benefit many robot autonomy capabilities, including localization, mapping, motion planning, and control. Methods that support online and large-scale SDF reconstruction tend to rely on discrete volumetric data structures, which affect the continuity and differentiability of the SDF estimates. Recently, using implicit features, neural network methods have demonstrated high-fidelity and differentiable SDF reconstruction but they tend to be less efficient, can experience catastrophic forgetting and memory limitations in large environments, and are often restricted to truncated SDFs. This work proposes $\nabla$-SDF, a hybrid method that combines an explicit prior obtained from gradient-augmented octree interpolation with an implicit neural residual. Our method achieves non-truncated (Euclidean) SDF reconstruction with computational and memory efficiency comparable to volumetric methods and differentiability and accuracy comparable to neural network methods. Extensive experiments demonstrate that \methodname{} outperforms the state of the art in terms of accuracy and efficiency, providing a scalable solution for downstream tasks in robotics and computer vision.

【108】Timely Clinical Diagnosis through Active Test Selection
标题：通过主动选择测试及时进行临床诊断
链接：https://arxiv.org/abs/2510.18988

作者：Silas Ruhrberg Estévez, Nicolás Astorga, Mihaela van der Schaar
备注：None
摘要：人们对使用机器学习（ML）来支持临床诊断的兴趣越来越大，但大多数方法依赖于静态的、完全观察到的数据集，并且无法反映临床医生在实践中使用的顺序的、资源感知的推理。诊断仍然很复杂，容易出错，特别是在高压或资源有限的环境中，强调需要框架，帮助临床医生做出及时和具有成本效益的决定。我们提出了ACTMED（通过基于模型的实验设计进行自适应临床测试选择），这是一个诊断框架，它将贝叶斯实验设计（BED）与大型语言模型（LLM）集成在一起，以更好地模拟现实世界的诊断推理。在每个步骤中，ACTMED选择预期对给定患者的诊断不确定性产生最大降低的测试。LLM充当灵活的模拟器，生成合理的患者状态分布并支持信念更新，而不需要结构化的特定任务训练数据。临床医生可以保持在循环中;审查测试建议，解释中间输出，并在整个过程中应用临床判断。我们在真实世界的数据集上评估了ACTMED，并表明它可以优化测试选择，以提高诊断准确性，可解释性和资源使用。这代表了向透明、自适应和临床医生一致的诊断系统迈出的一步，该诊断系统在降低对特定领域数据的依赖的情况下跨设置进行概括。
摘要：There is growing interest in using machine learning (ML) to support clinical diag- nosis, but most approaches rely on static, fully observed datasets and fail to reflect the sequential, resource-aware reasoning clinicians use in practice. Diagnosis remains complex and error prone, especially in high-pressure or resource-limited settings, underscoring the need for frameworks that help clinicians make timely and cost-effective decisions. We propose ACTMED (Adaptive Clinical Test selection via Model-based Experimental Design), a diagnostic framework that integrates Bayesian Experimental Design (BED) with large language models (LLMs) to better emulate real-world diagnostic reasoning. At each step, ACTMED selects the test expected to yield the greatest reduction in diagnostic uncertainty for a given patient. LLMs act as flexible simulators, generating plausible patient state distributions and supporting belief updates without requiring structured, task-specific training data. Clinicians can remain in the loop; reviewing test suggestions, interpreting intermediate outputs, and applying clinical judgment throughout. We evaluate ACTMED on real-world datasets and show it can optimize test selection to improve diagnostic accuracy, interpretability, and resource use. This represents a step to- ward transparent, adaptive, and clinician-aligned diagnostic systems that generalize across settings with reduced reliance on domain-specific data.

【109】Test-time Verification via Optimal Transport: Coverage, ROC, & Sub-optimality
标题：通过最佳传输进行测试时验证：覆盖率、ROC和次优性
链接：https://arxiv.org/abs/2510.18982

作者：Arpan Mukherjee, Marcello Bullo, Debabrota Basu, Deniz Gündüz
摘要：虽然测试时间扩展与验证在提高大型语言模型（LLM）的性能方面表现出了希望，但验证器的作用及其不完善之处仍有待探索。验证的效果通过三个量的相互作用来体现：（i）生成器的覆盖率，（ii）验证器的收敛区域（ROC），以及（iii）采样算法的次优性。虽然最近的研究捕捉这些因素的子集，一个统一的框架量化的几何形状的相互作用是失踪。我们框架可验证的测试时间缩放作为一个运输问题。这一特点的相互作用的覆盖范围，ROC，和次优，并揭示了次优覆盖曲线表现出三个制度。一个运输制度-次优性随着覆盖率的增加而增加，一个政策改进制度-次优性可能会随着覆盖率的增加而减少，这取决于验证者的ROC，以及一个饱和制度-次优性稳定，不受覆盖率的影响。我们进一步提出并分析了两类采样算法-顺序和批量，并研究它们的计算复杂性如何塑造这些权衡。Qwen，Llama和Gemma模型的实证结果证实了我们的理论研究结果。
摘要：While test-time scaling with verification has shown promise in improving the performance of large language models (LLMs), the role of the verifier and its imperfections remain underexplored. The effect of verification manifests through interactions of three quantities: (i) the generator's coverage, (ii) the verifier's region of convergence (ROC), and (iii) the sampling algorithm's sub-optimality. Though recent studies capture subsets of these factors, a unified framework quantifying the geometry of their interplay is missing. We frame verifiable test-time scaling as a transport problem. This characterizes the interaction of coverage, ROC, and sub-optimality, and uncovers that the sub-optimality--coverage curve exhibits three regimes. A transport regime -- where sub-optimality increases with coverage, a policy improvement regime -- where sub-optimality may decrease with coverage, depending on the verifier's ROC, and a saturation regime -- where sub-optimality plateaus, unaffected by coverage. We further propose and analyze two classes of sampling algorithms -- sequential and batched, and examine how their computational complexities shape these trade-offs. Empirical results with Qwen, Llama, and Gemma models corroborate our theoretical findings.

【110】ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge
标题：ProfBench：需要专业知识来回答和判断的多领域学科
链接：https://arxiv.org/abs/2510.18941

作者：Zhilin Wang, Jaehun Jung, Ximing Lu, Shizhe Diao, Ellie Evans, Jiaqi Zeng, Pavlo Molchanov, Yejin Choi, Jan Kautz, Yi Dong
备注：23 pages
摘要：评估大型语言模型（LLM）的进展通常受到验证响应的挑战的限制，将评估限制在数学，编程和简短问答等任务上。然而，许多现实世界的应用程序需要评估LLM在处理专业文档，综合信息，并生成全面的报告，以响应用户查询。我们介绍ProfBench：一套超过7000响应标准对由人类专家与专业知识的物理学博士，化学博士，金融MBA和咨询MBA的评估。我们建立强大的和负担得起的法学硕士法官来评估ProfBench的规则，通过减轻自我增强的偏见和降低2-3个数量级的评估成本，使其公平和更广泛的社区。我们的研究结果表明，即使对于最先进的LLM，ProfBench也构成了重大挑战，表现最好的模型，如GPT-5-high，整体性能仅达到65.9%。此外，我们确定了专有和开放权重模型之间的显着性能差异，并提供了深入了解扩展思维在解决复杂的专业领域任务中所起的作用。数据：https://huggingface.co/datasets/nvidia/ProfBench和代码：https://github.com/NVlabs/ProfBench
摘要：Evaluating progress in large language models (LLMs) is often constrained by the challenge of verifying responses, limiting assessments to tasks like mathematics, programming, and short-form question-answering. However, many real-world applications require evaluating LLMs in processing professional documents, synthesizing information, and generating comprehensive reports in response to user queries. We introduce ProfBench: a set of over 7000 response-criterion pairs as evaluated by human-experts with professional knowledge across Physics PhD, Chemistry PhD, Finance MBA and Consulting MBA. We build robust and affordable LLM-Judges to evaluate ProfBench rubrics, by mitigating self-enhancement bias and reducing the cost of evaluation by 2-3 orders of magnitude, to make it fair and accessible to the broader community. Our findings reveal that ProfBench poses significant challenges even for state-of-the-art LLMs, with top-performing models like GPT-5-high achieving only 65.9\% overall performance. Furthermore, we identify notable performance disparities between proprietary and open-weight models and provide insights into the role that extended thinking plays in addressing complex, professional-domain tasks. Data: https://huggingface.co/datasets/nvidia/ProfBench and Code: https://github.com/NVlabs/ProfBench

【111】NeuroAda: Activating Each Neuron's Potential for Parameter-Efficient Fine-Tuning
标题：NeuroAda：激活每个神经元的潜力进行参数高效微调
链接：https://arxiv.org/abs/2510.18940

作者：Zhi Zhang, Yixian Shen, Congfeng Cao, Ekaterina Shutova
备注：None
摘要：现有的参数有效的微调（PEFT）方法主要分为两类：基于添加和选择性的原位适应。前者，如LoRA，引入了额外的模块来使模型适应下游任务，提供了强大的内存效率。然而，它们的表示能力往往是有限的，使它们不太适合细粒度的适应。相比之下，后者直接微调原始模型参数的精心选择的子集，允许更精确和有效的适应，但代价是显着增加内存消耗。为了调和这种权衡，我们提出了NeuroAda，这是一种新的PEFT方法，可以在保持高内存效率的同时实现细粒度的模型微调。我们的方法首先确定重要参数（即，网络内的连接），然后为这些选择的参数引入旁路连接。在微调期间，仅更新旁路连接，使原始模型参数保持冻结。对23+个自然语言生成和理解任务的实证结果表明，NeuroAda仅用$\leq \textbf{0.02}\%$可训练参数就实现了最先进的性能，同时将CUDA内存使用量减少了60%。我们在这里发布代码：https://github.com/FightingFighting/NeuroAda.git。
摘要：Existing parameter-efficient fine-tuning (PEFT) methods primarily fall into two categories: addition-based and selective in-situ adaptation. The former, such as LoRA, introduce additional modules to adapt the model to downstream tasks, offering strong memory efficiency. However, their representational capacity is often limited, making them less suitable for fine-grained adaptation. In contrast, the latter directly fine-tunes a carefully chosen subset of the original model parameters, allowing for more precise and effective adaptation, but at the cost of significantly increased memory consumption. To reconcile this trade-off, we propose NeuroAda, a novel PEFT method that enables fine-grained model finetuning while maintaining high memory efficiency. Our approach first identifies important parameters (i.e., connections within the network) as in selective adaptation, and then introduces bypass connections for these selected parameters. During finetuning, only the bypass connections are updated, leaving the original model parameters frozen. Empirical results on 23+ tasks spanning both natural language generation and understanding demonstrate that NeuroAda achieves state-of-the-art performance with as little as $\leq \textbf{0.02}\%$ trainable parameters, while reducing CUDA memory usage by up to 60%. We release our code here: https://github.com/FightingFighting/NeuroAda.git.

【112】A Justice Lens on Fairness and Ethics Courses in Computing Education: LLM-Assisted Multi-Perspective and Thematic Evaluation
标题：计算机教育公平与道德课程的正义视角：法学硕士辅助的多视角和主题评估
链接：https://arxiv.org/abs/2510.18931

作者：Kenya S. Andrews, Deborah Dormah Kanubala, Kehinde Aruleba, Francisco Enrique Vicente Castro, Renata A Revelo
备注：14 pages, 8 figures, In Review
摘要：课程大纲为课程设定基调和期望，为学生和教师塑造学习体验。在计算机课程中，特别是那些涉及人工智能（AI）、机器学习（ML）和算法设计中的公平和道德的课程，我们必须了解如何解决公平结果的障碍。这些期望应该是包容的、透明的，并以促进批判性思维为基础。教学大纲分析提供了一种方法来评估课程的覆盖面，深度，实践和期望。然而，手工评估教学大纲既费时又容易出现不一致。为了解决这个问题，我们开发了一个以公正为导向的评分规则，并要求一个大型语言模型（LLM）通过多视角角色模拟来审查教学大纲。使用这个量规，我们从四个角度评估了24个教学大纲：讲师，系主任，机构审查员，和外部评估。我们还促使法学硕士确定整个课程的主题趋势。研究结果表明，多视角评估有助于我们注意到细微差别，特定于角色的优先事项，利用它们来填补AI/ML课程设计中隐藏的空白，以及专注于公平和道德的相关计算课程。这些见解为改进这些课程中公平、道德和正义内容的设计和提供提供提供了具体的方向。
摘要：Course syllabi set the tone and expectations for courses, shaping the learning experience for both students and instructors. In computing courses, especially those addressing fairness and ethics in artificial intelligence (AI), machine learning (ML), and algorithmic design, it is imperative that we understand how approaches to navigating barriers to fair outcomes are being addressed.These expectations should be inclusive, transparent, and grounded in promoting critical thinking. Syllabus analysis offers a way to evaluate the coverage, depth, practices, and expectations within a course. Manual syllabus evaluation, however, is time-consuming and prone to inconsistency. To address this, we developed a justice-oriented scoring rubric and asked a large language model (LLM) to review syllabi through a multi-perspective role simulation. Using this rubric, we evaluated 24 syllabi from four perspectives: instructor, departmental chair, institutional reviewer, and external evaluator. We also prompted the LLM to identify thematic trends across the courses. Findings show that multiperspective evaluation aids us in noting nuanced, role-specific priorities, leveraging them to fill hidden gaps in curricula design of AI/ML and related computing courses focused on fairness and ethics. These insights offer concrete directions for improving the design and delivery of fairness, ethics, and justice content in such courses.

【113】BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping
标题：BAPO：通过自适应裁剪的平衡策略优化来稳定LLM的非策略强化学习
链接：https://arxiv.org/abs/2510.18927

作者：Zhiheng Xi, Xin Guo, Yang Nan, Enyu Zhou, Junrui Shen, Wenxiang Chen, Jiaqi Liu, Jixuan Huang, Zhihao Zhang, Honglin Guo, Xun Deng, Zhikai Lei, Miao Zheng, Guoteng Wang, Shuo Zhang, Peng Sun, Rui Zheng, Hang Yan, Tao Gui, Qi Zhang, Xuanjing Huang
备注：Preprint
摘要：强化学习（RL）最近已经成为对齐和加强大型语言模型（LLM）的核心范式。然而，在非策略设置中应用强化学习--使用过去策略中的陈旧数据进行训练--可以提高样本效率，但仍然具有挑战性：策略熵急剧下降，优化往往变得不稳定，甚至可能崩溃。通过理论和实证分析，我们发现了两个关键的见解：（i）优化中的不平衡，负优势样本主导了政策梯度，抑制了有用的行为并冒着梯度爆炸的风险;以及（ii）导出的熵裁剪规则，其揭示了PPO类目标中的固定裁剪机制系统地阻止熵增加更新，从而推动以牺牲勘探为代价的过度开采政策。基于这些见解，我们提出了自适应裁剪平衡策略优化（BAPO），这是一种简单而有效的方法，可以动态调整裁剪边界，以自适应地重新平衡积极和消极的贡献，保持熵，并稳定RL优化。在不同的非策略场景中-包括样本回放和部分推出-BAPO实现了快速，稳定和数据高效的训练。在AIME 2024和AIME 2025基准测试中，我们的7 B BAPO模型超过了开源同行，如SkyWork-OR 1 - 7 B，而我们的32 B BAPO模型不仅在相同规模的模型中取得了最先进的结果，而且还优于领先的专有系统，如o3-mini和Gemini-2.5-Flash-Thinking。
摘要：Reinforcement learning (RL) has recently become the core paradigm for aligning and strengthening large language models (LLMs). Yet, applying RL in off-policy settings--where stale data from past policies are used for training--improves sample efficiency, but remains challenging: policy entropy declines sharply, optimization often becomes unstable and may even collapse. Through theoretical and empirical analysis, we identify two key insights: (i) an imbalance in optimization, where negative-advantage samples dominate the policy gradient, suppressing useful behaviors and risking gradient explosions; and (ii) the derived Entropy-Clip Rule, which reveals that the fixed clipping mechanism in PPO-like objectives systematically blocks entropy-increasing updates, thereby driving the policy toward over-exploitation at the expense of exploration. Building on these insights, we propose BAlanced Policy Optimization with Adaptive Clipping (BAPO), a simple yet effective method that dynamically adjusts clipping bounds to adaptively re-balance positive and negative contributions, preserve entropy, and stabilize RL optimization. Across diverse off-policy scenarios--including sample replay and partial rollout--BAPO achieves fast, stable, and data-efficient training. On AIME 2024 and AIME 2025 benchmarks, our 7B BAPO model surpasses open-source counterparts such as SkyWork-OR1-7B, while our 32B BAPO model not only achieves state-of-the-art results among models of the same scale but also outperforms leading proprietary systems like o3-mini and Gemini-2.5-Flash-Thinking.

【114】Application of Reduced-Order Models for Temporal Multiscale Representations in the Prediction of Dynamical Systems
标题：时间多尺度表示的降阶模型在动力系统预测中的应用
链接：https://arxiv.org/abs/2510.18925

作者：Elias Al Ghazal, Jad Mounayer, Beatriz Moya, Sebastian Rodriguez, Chady Ghnatios, Francisco Chinesta
备注：Regular research article, 28 pages, 13 figures
摘要：由于复杂多尺度系统固有的非线性和对初始条件的敏感性，以及传统机器学习方法无法捕捉高频行为的局限性，建模和预测复杂多尺度系统的动态仍然是一个重大挑战。为了克服这些困难，我们提出了三种多尺度学习方法。第一种方法利用与神经网络集成的单位分割（PU）方法，将动态分解为局部分量，并直接预测宏观和微观尺度的行为。第二个应用奇异值分解（SVD）提取显性模式，明确分离宏观和微观尺度的动态。由于完全访问的数据矩阵是很少在实践中，我们进一步采用稀疏高阶SVD重建多尺度动态有限的测量。总之，这些方法确保了精确捕获粗动态和细动态，使框架有效地用于涉及复杂，多尺度现象的实际应用，并适应于具有不完整观测的高维系统，通过提供近似和解释在研究中的现象中存在的所有时间尺度。
摘要：Modeling and predicting the dynamics of complex multiscale systems remains a significant challenge due to their inherent nonlinearities and sensitivity to initial conditions, as well as limitations of traditional machine learning methods that fail to capture high frequency behaviours. To overcome these difficulties, we propose three approaches for multiscale learning. The first leverages the Partition of Unity (PU) method, integrated with neural networks, to decompose the dynamics into local components and directly predict both macro- and micro-scale behaviors. The second applies the Singular Value Decomposition (SVD) to extract dominant modes that explicitly separate macro- and micro-scale dynamics. Since full access to the data matrix is rarely available in practice, we further employ a Sparse High-Order SVD to reconstruct multiscale dynamics from limited measurements. Together, these approaches ensure that both coarse and fine dynamics are accurately captured, making the framework effective for real-world applications involving complex, multi-scale phenomena and adaptable to higher-dimensional systems with incomplete observations, by providing an approximation and interpretation in all time scales present in the phenomena under study.

【115】Noise-corrected GRPO: From Noisy Rewards to Unbiased Gradients
标题：噪声修正GRPO：从噪声奖励到无偏奖励
链接：https://arxiv.org/abs/2510.18924

作者：Omar El mansouri, Mohamed El Amine Seddik, Salem Lahlou
摘要：来自人类反馈的强化学习（RLHF）或可验证奖励（RLVR）是对齐LLM或构建最近SOTA推理模型的标准范例，对来自不一致或错误奖励的噪声高度敏感。然而，这种噪音和广泛使用的基于组的策略优化方法之间的相互作用仍然没有得到充分的研究。我们引入了一个噪声鲁棒的组相对策略优化（GRPO）和做正确的GRPO（博士GRPO）框架，明确模型奖励腐败伯努利噪声。我们的方法在估计奖励翻转概率后应用噪声校正来消除学习信号的偏差，从而产生可证明无偏的梯度估计。理论分析表明，基于组的方法本质上减轻了个体水平的噪声，我们的校正策略放大了这种鲁棒性。从经验上讲，当将我们的噪声校正应用于标准奖励模型使用时，我们观察到数学和代码任务的一致改进，在现实奖励模型条件下，数学任务的准确性提高了6.7个百分点，代码任务提高了1.5个百分点。这项工作将监督学习的标签噪声校正与现代RLHF联系起来，为嘈杂的现实世界部署提供了理论见解和实用算法。
摘要：Reinforcement learning from human feedback (RLHF) or verifiable rewards (RLVR), the standard paradigm for aligning LLMs or building recent SOTA reasoning models, is highly sensitive to noise from inconsistent or erroneous rewards. Yet, the interaction between such noise and widely used group-based policy optimization methods remains underexplored. We introduce a noise-robust Group Relative Policy Optimization (GRPO) and Done Right GRPO (Dr.GRPO) framework that explicitly models reward corruption as Bernoulli noise. Our method applies noise correction after estimating reward flip probabilities to debias the learning signal, yielding provably unbiased gradient estimates. Theoretical analysis shows that group-based methods inherently mitigate individual-level noise, and our correction strategy amplifies this robustness. Empirically, we observe consistent improvements across math and code tasks when applying our noise correction to standard reward model usage, with particular gains of up to 6.7 percentage points in accuracy on math tasks and 1.5 on code tasks under realistic reward model conditions. This work bridges label-noise correction from supervised learning with modern RLHF, offering both theoretical insights and a practical algorithm for noisy real-world deployment.

【116】Benchmarking On-Device Machine Learning on Apple Silicon with MLX
标题：使用MLX在Apple Silicon上对设备上机器学习进行基准测试
链接：https://arxiv.org/abs/2510.18921

作者：Oluwaseun A. Ajayi, Ogundepo Odunayo
备注：19 pages, 6 figures. Presented at the 6th Deep Learning Indaba (DLI 2024), Dakar, Senegal; non-archival presentation. Poster: this https URL
摘要：最近大语言模型（LLM）和机器学习的广泛采用引发了研究兴趣，探索在笔记本电脑和手机等小型设备上部署这些模型的可能性。这就需要能够利用设备上硬件的框架和方法。创建MLX框架就是为了满足这一需求。它是一个针对Apple芯片设备上的机器学习（ML）计算进行优化的框架，便于更轻松地进行研究，实验和原型设计。本文介绍了MLX的性能评估，重点是Transformer模型的推理延迟。我们比较了MLX中不同Transformer架构实现与Pytorch对应实现的性能。对于这项研究，我们创建了一个名为MLX-transformers的框架，其中包括MLX中的不同Transformer实现，并在pytorch中下载模型检查点并将其转换为MLX格式。通过利用Apple Silicon的高级架构和功能，MLX-Transformers可无缝执行直接源自Hugging Face的Transformer模型，从而消除了在框架之间移植模型时经常需要的检查点转换。我们的研究在两台Apple Silicon macbook设备上对不同的Transformer模型进行了基准测试，并与NVIDIA CUDA GPU进行了比较。具体来说，我们比较了具有相同参数大小和检查点的模型的推理延迟性能。我们评估了BERT、RoBERTa和XLM-RoBERTa模型的性能，目的是扩展未来的工作，以包括不同模态的模型，从而对MLX的能力进行更全面的评估。这些结果凸显了MLX在Apple生态系统中实现高效且更易于访问的设备上ML应用程序的潜力。
摘要：The recent widespread adoption of Large Language Models (LLMs) and machine learning in general has sparked research interest in exploring the possibilities of deploying these models on smaller devices such as laptops and mobile phones. This creates a need for frameworks and approaches that are capable of taking advantage of on-device hardware. The MLX framework was created to address this need. It is a framework optimized for machine learning (ML) computations on Apple silicon devices, facilitating easier research, experimentation, and prototyping. This paper presents a performance evaluation of MLX, focusing on inference latency of transformer models. We compare the performance of different transformer architecture implementations in MLX with their Pytorch counterparts. For this research we create a framework called MLX-transformers which includes different transformer implementations in MLX and downloads the model checkpoints in pytorch and converts it to the MLX format. By leveraging the advanced architecture and capabilities of Apple Silicon, MLX-Transformers enables seamless execution of transformer models directly sourced from Hugging Face, eliminating the need for checkpoint conversion often required when porting models between frameworks. Our study benchmarks different transformer models on two Apple Silicon macbook devices against an NVIDIA CUDA GPU. Specifically, we compare the inference latency performance of models with the same parameter sizes and checkpoints. We evaluate the performance of BERT, RoBERTa, and XLM-RoBERTa models, with the intention of extending future work to include models of different modalities, thus providing a more comprehensive assessment of MLX's capabilities. The results highlight MLX's potential in enabling efficient and more accessible on-device ML applications within Apple's ecosystem.

【117】Misinformation Detection using Large Language Models with Explainability
标题：使用具有可解释性的大型语言模型进行错误信息检测
链接：https://arxiv.org/abs/2510.18918

作者：Jainee Patel, Chintan Bhatt, Himani Trivedi, Thanh Thi Nguyen
备注：Accepted for publication in the Proceedings of the 8th International Conference on Algorithms, Computing and Artificial Intelligence (ACAI 2025)
摘要：网上平台上错误信息的迅速传播破坏了个人之间的信任，阻碍了知情决策。本文展示了一个可解释的和计算效率高的管道，使用基于transformer的预训练语言模型（PLM）来检测错误信息。我们使用两步策略来优化RoBERTa和DistilBERT：首先，我们冻结骨干并仅训练分类头;然后，我们逐步解冻骨干层，同时应用逐层学习率衰减。在两个真实世界的基准数据集COVID Fake News和FakeNewsNet GossipCop上，我们使用统一的预处理和分层分割协议测试了所提出的方法。为了确保透明度，我们在令牌级别集成了局部可解释模型不可知解释（LIME），以在全局特征属性级别呈现令牌级别的基本原理和SHapley加法解释（SHAP）。它表明，DistilBERT达到了与RoBERTa相当的精度，同时需要更少的计算资源。这项工作有两个关键贡献：（1）它定量地表明，轻量级PLM可以保持任务性能，同时大大降低计算成本，（2）它提出了一个可解释的管道，检索忠实的本地和全局的理由，而不影响性能。结果表明，PLM与原则性微调和可解释性相结合，可以成为一个有效的框架，可扩展的，值得信赖的错误信息检测。
摘要：The rapid spread of misinformation on online platforms undermines trust among individuals and hinders informed decision making. This paper shows an explainable and computationally efficient pipeline to detect misinformation using transformer-based pretrained language models (PLMs). We optimize both RoBERTa and DistilBERT using a two-step strategy: first, we freeze the backbone and train only the classification head; then, we progressively unfreeze the backbone layers while applying layer-wise learning rate decay. On two real-world benchmark datasets, COVID Fake News and FakeNewsNet GossipCop, we test the proposed approach with a unified protocol of preprocessing and stratified splits. To ensure transparency, we integrate the Local Interpretable Model-Agnostic Explanations (LIME) at the token level to present token-level rationales and SHapley Additive exPlanations (SHAP) at the global feature attribution level. It demonstrates that DistilBERT achieves accuracy comparable to RoBERTa while requiring significantly less computational resources. This work makes two key contributions: (1) it quantitatively shows that a lightweight PLM can maintain task performance while substantially reducing computational cost, and (2) it presents an explainable pipeline that retrieves faithful local and global justifications without compromising performance. The results suggest that PLMs combined with principled fine-tuning and interpretability can be an effective framework for scalable, trustworthy misinformation detection.

【118】MMAO-Bench: MultiModal All in One Benchmark Reveals Compositional Law between Uni-modal and Omni-modal in OmniModels
标题：MMAO-Bench：多模式一体化基准揭示OmniModels中单模式和全模式之间的合成规律
链接：https://arxiv.org/abs/2510.18915

作者：Chen Chen, ZeYang Hu, Fengjiao Chen, Liya Ma, Jiaxing Liu, Xiaoyu Li, Xuezhi Cao
备注：10 pages, 8 figures. Work in progress
摘要：多模态大型语言模型已经从单模态理解发展到统一视觉，音频和语言模态，统称为omni模型。然而，单模态和全模态之间的相关性仍然不清楚，这需要综合评估来驱动全模态的智能进化。在这项工作中，我们提出了一个新的，高质量和多样性的全模型基准，多模态所有在一个基准（MMAO-Bench），有效地评估单模态和全模态的理解能力。该基准测试由1880个人类策划的样本组成，涉及44种任务类型，以及一种创新的多步骤开放式问题类型，可以更好地评估复杂的推理任务。实验结果表明，跨模态性能与单模态性能、全模态性能之间的合成规律对弱模型表现为瓶颈效应，对强模型表现为协同促进。
摘要：Multimodal Large Languages models have been progressing from uni-modal understanding toward unifying visual, audio and language modalities, collectively termed omni models. However, the correlation between uni-modal and omni-modal remains unclear, which requires comprehensive evaluation to drive omni model's intelligence evolution. In this work, we propose a novel, high quality and diversity omni model benchmark, MultiModal All in One Benchmark (MMAO-Bench), which effectively assesses both uni-modal and omni-modal understanding capabilities. The benchmark consists of 1880 human curated samples, across 44 task types, and a innovative multi-step open-ended question type that better assess complex reasoning tasks. Experimental result shows the compositional law between cross-modal and uni-modal performance and the omni-modal capability manifests as a bottleneck effect on weak models, while exhibiting synergistic promotion on strong models.

【119】Context-aware Fairness Evaluation and Mitigation in LLMs
标题：LLM中的上下文感知公平性评估和缓解
链接：https://arxiv.org/abs/2510.18914

作者：Afrozah Nadeem, Mark Dras, Usman Naseem
备注：PrePrint
摘要：大型语言模型通常会显示出嵌入其内部表示中的不良行为，破坏公平性，不一致漂移，有害内容的放大以及在扩展对话和会话期间传播不需要的模式。虽然训练时间或以数据为中心的方法试图减少这些影响，但它们在计算上是昂贵的，一旦部署就不可逆转，并且适应新的会话上下文的速度很慢。基于修剪的方法提供了一种灵活透明的方法，通过调整负责某些行为的神经元来减少偏差。然而，大多数现有的方法都是静态的;一旦删除一个神经元，当对话或上下文发生变化时，模型就失去了适应能力。为了解决这个问题，我们提出了一个动态的，可逆的，基于修剪的框架，检测上下文感知的神经元激活，并应用自适应掩蔽来调节它们在生成过程中的影响。我们的推理时间解决方案提供了细粒度的内存感知缓解，在多语言单轮和多轮对话中具有知识保留的更一致的行为，从而实现了真实世界对话AI中的动态公平控制。
摘要：Large language models often display undesirable behaviors embedded in their internal representations, undermining fairness, inconsistency drift, amplification of harmful content, and the propagation of unwanted patterns during extended dialogue and conversations. Although training-time or data-centric methods attempt to reduce these effects, they are computationally expensive, irreversible once deployed, and slow to adapt to new conversational contexts. Pruning-based methods provide a flexible and transparent way to reduce bias by adjusting the neurons responsible for certain behaviors. However, most existing approaches are static; once a neuron is removed, the model loses the ability to adapt when the conversation or context changes. To address this, we propose a dynamic, reversible, pruning-based framework that detects context-aware neuron activations and applies adaptive masking to modulate their influence during generation. Our inference-time solution provides fine-grained, memory-aware mitigation with knowledge-preserved, more coherent behavior across multilingual single- and multi-turn dialogues, enabling dynamic fairness control in real-world conversational AI.

【120】ADPO: Anchored Direct Preference Optimization
标题：ADPO：锚定直接偏好优化
链接：https://arxiv.org/abs/2510.18913

作者：Wang Zixian
摘要：锚定直接偏好优化（Anchored Direct Preference Optimization，ADPO）是一个将直接偏好优化（Direct Preference Optimization，DPO）推广为软偏好、引用策略锚定和分组扩展的统一框架。虽然标准DPO假设硬二进制标签和成对比较，但ADPO引入了：（i）编码不确定性和减轻梯度漂移的软偏好概率;（ii）通过分组移位不变性和隐式KL正则化稳定训练的任意参考策略锚;以及（iii）通过Plackett-Luce分布进行列表偏好建模。我们证明了DPO，布拉德利-特里目标，和顶1-与休息配方出现的特殊情况。ADPO产生了三个实用的变体：成对锚定的软DPO，列表锚定的软DPO与原始奖励，以及基于KDE的列表平滑重尾噪声。在上下文强盗中，锚定比标准DPO提高了WinMass 38-63%，而KDE平滑在重尾污染下达到0.68 vs 0.32（相对增益112%）。在顺序强化学习（CartPole，LunarLander）中，锚定将噪声偏好性能提高了15- 29%，证实了从单步到多步设置的转移。使用10-256参数模型的实验提供了明确的指导：使用成对锚定的Soft-DPO用于干净或中等噪声，使用基于KDE的列表ADPO用于极端污染。
摘要：Anchored Direct Preference Optimization (ADPO) is a unified framework that generalizes Direct Preference Optimization (DPO) with soft preferences, reference-policy anchoring, and groupwise extensions. While standard DPO assumes hard binary labels and pairwise comparisons, ADPO introduces: (i) soft preference probabilities that encode uncertainty and mitigate gradient drift; (ii) arbitrary reference-policy anchors that stabilize training via groupwise shift invariance and implicit KL regularization; and (iii) listwise preference modeling through Plackett-Luce distributions. We prove that DPO, Bradley-Terry objectives, and Top-1-vs-Rest formulations emerge as special cases. ADPO yields three practical variants: pairwise anchored Soft-DPO, listwise anchored Soft-DPO with raw rewards, and KDE-based listwise smoothing for heavy-tailed noise. In contextual bandits, anchoring improves WinMass by 38-63% over standard DPO, while KDE smoothing achieves 0.68 vs 0.32 under heavy-tailed contamination (112% relative gain). In sequential reinforcement learning (CartPole, LunarLander), anchoring improves noisy-preference performance by 15-29%, confirming transfer from single-step to multi-step settings. Experiments with 10-256 parameter models provide clear guidance: use pairwise anchored Soft-DPO for clean or moderate noise, and KDE-based listwise ADPO for extreme contamination.

【121】Large Connectome Model: An fMRI Foundation Model of Brain Connectomes Empowered by Brain-Environment Interaction in Multitask Learning Landscape
标题：大连接组模型：多任务学习环境中由脑-环境交互驱动的脑连接组的fMRI基础模型
链接：https://arxiv.org/abs/2510.18910

作者：Ziquan Wei, Tingting Dan, Guorong Wu
备注：12 pages 6 figures
摘要：功能神经图像的可靠基础模型对于促进临床应用至关重要，因为当前AI模型的性能受到有限样本量的严重阻碍。为此，人们已经做出了巨大的努力，使用可扩展的自监督学习在大量未标记的fMRI数据上预训练大型模型。由于自我监督不一定与大脑与结果的关系一致，因此大多数基础模型对于下游任务（如预测疾病结果）来说都是次优的。通过利用丰富的环境变量和人口统计数据以及前所未有的功能神经图像，我们将大脑建模形成为多任务学习，并提出了一个可扩展的模型架构，用于（i）通过标记多个脑环境交互（BEI）进行多任务预训练，以及（ii）通过分配预训练BEI的伪标签进行半监督微调。我们已经评估了我们的基础模型的各种应用，包括性别预测，人类行为识别，自闭症，帕金森氏病，阿尔茨海默氏病和{精神分裂症}的疾病早期诊断，其中有希望的结果表明，极大的潜力，以促进当前的神经成像应用在临床常规。
摘要：A reliable foundation model of functional neuroimages is critical to promote clinical applications where the performance of current AI models is significantly impeded by a limited sample size. To that end, tremendous efforts have been made to pretraining large models on extensive unlabeled fMRI data using scalable self-supervised learning. Since self-supervision is not necessarily aligned with the brain-to-outcome relationship, most foundation models are suboptimal to the downstream task, such as predicting disease outcomes. By capitalizing on rich environmental variables and demographic data along with an unprecedented amount of functional neuroimages, we form the brain modeling as a multitask learning and present a scalable model architecture for (i) multitask pretraining by tokenizing multiple brain-environment interactions (BEI) and (ii) semi-supervised finetuning by assigning pseudo-labels of pretrained BEI. We have evaluated our foundation model on a variety of applications, including sex prediction, human behavior recognition, and disease early diagnosis of Autism, Parkinson's disease, Alzheimer's disease, and {Schizophrenia}, where promising results indicate the great potential to facilitate current neuroimaging applications in clinical routines.

【122】Learning from the Best, Differently: A Diversity-Driven Rethinking on Data Selection
标题：不同地向最优秀者学习：多元化驱动的数据选择重新思考
链接：https://arxiv.org/abs/2510.18909

作者：Hongyi He, Xiao Liu, Zhenghao Lin, Mingni Tang, Yi Cheng, Jintao Wang, Wenjie Li, Peng Cheng, Yeyun Gong
摘要：高质量的预训练数据对于大型语言模型至关重要，其中质量捕获事实可靠性和语义价值，多样性确保广泛的覆盖范围和分布异质性。现有的方法通常依赖于基于单维或多维分数的选择。然而，直接选择得分最高的数据通常会降低性能，并且需要从更广泛的范围内进行采样才能恢复结果。上述数据集分数和下游基准结果之间的非单调性揭示了一个根本性的偏差：基于分数的方法会使相关维度崩溃，导致得分最高的数据看起来高质量，而系统地忽略了多样性。我们认为，确保多样性，需要分解相关的指标到正交的功能维度，从得分最高的数据可以直接选择。因此，我们提出了正交多样性感知选择（ODiS）算法，该算法在数据选择过程中保持质量和多样性。首先，ODiS从多个维度评估数据，包括语言质量，知识质量和理解难度。然后通过主成分分析（PCA）对多维分数进行去相关，产生正交评估维度。对于每个维度，训练基于Roberta的评分器将数据回归到PCA预测的分数上，从而实现对大型语料库的可扩展推断。最后，ODiS通过选择每个正交维度内得分最高的数据来构建训练数据集，从而确保质量和多样性。实证结果表明，ODiS选择的数据表现出小于2%的维度间重叠，确认维度之间的正交性。更重要的是，使用ODiS选择的数据训练的模型在下游基准上的表现明显优于其他基线，突出了LLM正交，多样性感知数据选择的必要性。
摘要：High-quality pre-training data is crutial for large language models, where quality captures factual reliability and semantic value, and diversity ensures broad coverage and distributional heterogeneity. Existing approaches typically rely on single or multiple-dimensional score-based selection. However, directly selecting top-scored data often degrades performance, and sampling from a broader range is required to recover results. The above non-monotonicity between dataset scores and downstream benchmark results reveals a fundamental bias: score-based methods collapse correlated dimensions, causing top-scored data to appear high-quality while systematically overlooking diversity. We argue that ensuring diversity requires decomposing correlated metrics into orthogonal feature dimensions, from which the top-scored data can be directly selected. Therefore, we proposed the Orthogonal Diversity-Aware Selection (ODiS) algorithm, which preserves both quality and diversity during data selection. First, ODiS evaluates data from multiple dimensions, covering language quality, knowledge quality, and comprehension difficulty. The multi-dimensional scores are then decorrelated via Principal Component Analysis (PCA), yielding orthogonal evaluation dimensions. For each dimension, a Roberta-based scorer is trained to regress the data onto PCA-projected scores, enabling scalable inference on large corpora. Finally, ODiS constructs the training dataset by selecting top-scored data within each orthogonal dimension, thereby ensuring both quality and diversity. Empirical results show that ODiS-selected data exhibit less than 2\% inter-dimension overlap, confirming orthogonality between dimensions. More importantly, models trained with ODiS-selected data significantly outperform other baselines on downstream benchmarks, highlighting the necessity of orthogonal, diversity-aware data selection for LLMs.

【123】Improving Topic Modeling of Social Media Short Texts with Rephrasing: A Case Study of COVID-19 Related Tweets
标题：通过改写改进社交媒体短文本的主题建模：COVID-19相关推文的案例研究
链接：https://arxiv.org/abs/2510.18908

作者：Wangjiaxuan Xin, Shuhua Yin, Shi Chen, Yaorong Ge
摘要：Twitter（现在的X）等社交媒体平台为分析公共话语提供了丰富的数据，特别是在COVID-19大流行等危机期间。然而，社交媒体短文本的简洁性，非正式性和噪音往往会阻碍传统主题建模的有效性，产生不连贯或冗余的主题，这些主题往往难以解释。为了应对这些挑战，我们开发了一个与模型无关的框架，利用大型语言模型（LLM）在主题建模之前将原始推文重新措辞为更标准化和正式的语言。使用25，027个与COVID-19相关的Twitter帖子的数据集，我们研究了两种重新措辞策略，即一般和口语到正式的重新措辞，对多主题建模方法的影响。结果表明，TM-Rephrase改进了三个度量主题建模性能的指标（即，主题一致性、主题唯一性和主题多样性），同时减少大多数主题建模算法的主题冗余，其中口语到正式策略产生最大的性能增益，特别是对于潜在狄利克雷分配（LDA）算法。这项研究有助于一个模型不可知的方法，以提高公共卫生相关的社会媒体分析的主题建模，具有广泛的影响，以提高公众话语的理解，在健康危机以及其他重要领域。
摘要：Social media platforms such as Twitter (now X) provide rich data for analyzing public discourse, especially during crises such as the COVID-19 pandemic. However, the brevity, informality, and noise of social media short texts often hinder the effectiveness of traditional topic modeling, producing incoherent or redundant topics that are often difficult to interpret. To address these challenges, we have developed \emph{TM-Rephrase}, a model-agnostic framework that leverages large language models (LLMs) to rephrase raw tweets into more standardized and formal language prior to topic modeling. Using a dataset of 25,027 COVID-19-related Twitter posts, we investigate the effects of two rephrasing strategies, general- and colloquial-to-formal-rephrasing, on multiple topic modeling methods. Results demonstrate that \emph{TM-Rephrase} improves three metrics measuring topic modeling performance (i.e., topic coherence, topic uniqueness, and topic diversity) while reducing topic redundancy of most topic modeling algorithms, with the colloquial-to-formal strategy yielding the greatest performance gains and especially for the Latent Dirichlet Allocation (LDA) algorithm. This study contributes to a model-agnostic approach to enhancing topic modeling in public health related social media analysis, with broad implications for improved understanding of public discourse in health crisis as well as other important domains.

【124】3D Optimization for AI Inference Scaling: Balancing Accuracy, Cost, and Latency
标题：人工智能推理缩放的3D优化：平衡准确性、成本和延迟
链接：https://arxiv.org/abs/2510.18905

作者：Minseok Jung, Abhas Ricky, Muhammad Rameez Chatni
摘要：AI推理缩放通常通过1D推理（固定推理通道）或2D双变量权衡（例如，性能与计算），其未能考虑成本和延迟约束。我们引入了一个3D优化框架，该框架在统一的决策空间内联合校准准确性，成本和延迟，从而实现约束感知的推理缩放。使用蒙特卡洛模拟在三个代表性的场景和九个模拟的大型语言模型，我们评估了四种优化方法来解决3D多目标优化（MOO）问题。MOO中的框架推理缩放塑造了一个1D和2D优化无法捕获的可行空间，使环境适应性选择推理缩放k。结果表明，拐点优化实现了最佳平衡，而精度最大化仍然是有利的，当精度优先。该框架建立了一个理论基础，部署感知的推理扩展在不同的操作环境。
摘要：AI inference scaling is often tuned through 1D heuristics (a fixed reasoning passes) or 2D bivariate trade-offs (e.g., performance vs. compute), which fail to consider cost and latency constraints. We introduce a 3D optimization framework that jointly calibrates accuracy, cost, and latency within a unified decision space, enabling constraints-aware inference scaling. Using Monte Carlo simulations across three representative scenarios and nine simulated large language models, we evaluate four optimization methods to address the 3D multi-objective optimization (MOO) problem. Framing inference scaling in MOO shapes a feasible space that 1D and 2D optimizations fail to capture, enabling environmentadaptive selection of the inference scaling k. Results show that knee-point optimization achieves the best balance, while accuracy-maximization remains favorable when precision is prioritized. The framework establishes a theoretical foundation for deployment-aware inference scaling across diverse operational contexts.

【125】DuoLens: A Framework for Robust Detection of Machine-Generated Multilingual Text and Code
标题：DuoLens：机器生成的多语言文本和代码的稳健检测框架
链接：https://arxiv.org/abs/2510.18904

作者：Shriyansh Agrawal, Aidan Lau, Sanyam Shah, Ahan M R, Kevin Zhu, Sunishchal Dev, Vasu Sharma
备注：Accepted to 39th Conference on Neural Information Processing Systems (NeurIPS 2025): 4th Workshop on Deep Learning for Code
摘要：用于生成多语言文本和源代码的大型语言模型（LLM）的流行只会增加机器生成内容检测器在各个领域中准确和高效的必要性。当前的检测器主要利用zero-shot方法，例如Fast DetectGPT或GPTZero，要么导致高计算成本，要么缺乏足够的准确性，通常在两者之间进行权衡，留下进一步改进的空间。为了解决这些差距，我们建议对仅编码器的小语言模型（SLM）进行微调，特别是使用源代码和其他自然语言的专用数据集的RoBERTA和CodeBERTa的预训练模型，以证明对于二进制分类任务，SLM在使用一小部分计算的同时大大优于LLM。我们的编码器实现AUROC $= 0.97$至0.99 $和宏F1 $0.89$至0.94 $，同时在$512$令牌输入时将延迟减少$8$-$12\times$，将峰值VRAM减少$3$-$5\times$。在跨生成器转换和对抗转换（释义，反向翻译;代码格式化/重命名）下，性能保留了$\geq 92%$的干净AUROC。我们发布的培训和评估脚本中包含种子和种子;还包括可重复性检查表。
摘要：The prevalence of Large Language Models (LLMs) for generating multilingual text and source code has only increased the imperative for machine-generated content detectors to be accurate and efficient across domains. Current detectors, predominantly utilizing zero-shot methods, such as Fast DetectGPT or GPTZero, either incur high computational cost or lack sufficient accuracy, often with a trade-off between the two, leaving room for further improvement. To address these gaps, we propose the fine-tuning of encoder-only Small Language Models (SLMs), in particular, the pre-trained models of RoBERTA and CodeBERTa using specialized datasets on source code and other natural language to prove that for the task of binary classification, SLMs outperform LLMs by a huge margin whilst using a fraction of compute. Our encoders achieve AUROC $= 0.97$ to $0.99$ and macro-F1 $0.89$ to $0.94$ while reducing latency by $8$-$12\times$ and peak VRAM by $3$-$5\times$ at $512$-token inputs. Under cross-generator shifts and adversarial transformations (paraphrase, back-translation; code formatting/renaming), performance retains $\geq 92%$ of clean AUROC. We release training and evaluation scripts with seeds and configs; a reproducibility checklist is also included.

【126】Evaluating LLMs for Career Guidance: Comparative Analysis of Computing Competency Recommendations Across Ten African Countries
标题：评估法学硕士的职业指导：十个非洲国家计算能力建议的比较分析
链接：https://arxiv.org/abs/2510.18902

作者：Precious Eze, Stephanie Lunn, Bruk Berhane (College of Engineering and Computing, Florida International University, Miami, USA)
备注：42 pages, 2 figures, 5 tables. Submitted to Computers & Education Open Access
摘要：雇主越来越希望毕业生在工作场所使用大型语言模型（LLM），但由于各国国情不同，非洲各地计算角色所需的能力仍不清楚。这项研究调查了六个LLM，即ChatGPT 4，DeepSeek，Gemini，Claude 3.5，Llama 3和Mistral AI，如何描述十个非洲国家的入门级计算职业期望。使用计算课程2020框架，并借鉴数字殖民主义理论和Ubuntu哲学，我们分析了60个LLM对标准化提示的响应。云计算和编程等技术技能似乎是一致的，但在模型如何处理非技术能力方面出现了显着差异，特别是道德和负责任的人工智能使用。各种模式在认识到具体国家的因素方面有很大差异，包括当地的技术生态系统、语言要求和国家政策。开源模型表现出更强的情境意识，以及技术和专业技能之间的更好平衡，在十个国家中的九个中获得最高分。尽管如此，所有模型都在文化敏感性和基础设施考虑方面苦苦挣扎，平均只有35.4%的上下文意识。这是对非洲计算机专业学生LLM职业指导的第一次广泛比较，揭示了根深蒂固的基础设施假设和以西方为中心的偏见，在技术建议和当地需求之间造成了差距。与专有替代品（ChatGPT 4：3.90/5; Claude：3.46/5）相比，具有成本效益的开源模型（Llama：4.47/5; DeepSeek：4.25/5）的强劲表现挑战了资源受限环境下人工智能工具质量的假设。我们的研究结果强调了计算能力要求在非洲各地的差异，并强调了在教育中采用非殖民化人工智能方法的必要性，这些方法强调了上下文相关性
摘要：Employers increasingly expect graduates to utilize large language models (LLMs) in the workplace, yet the competencies needed for computing roles across Africa remain unclear given varying national contexts. This study examined how six LLMs, namely ChatGPT 4, DeepSeek, Gemini, Claude 3.5, Llama 3, and Mistral AI, describe entry-level computing career expectations across ten African countries. Using the Computing Curricula 2020 framework and drawing on Digital Colonialism Theory and Ubuntu Philosophy, we analyzed 60 LLM responses to standardized prompts. Technical skills such as cloud computing and programming appeared consistently, but notable differences emerged in how models addressed non-technical competencies, particularly ethics and responsible AI use. Models varied considerably in recognizing country-specific factors, including local technology ecosystems, language requirements, and national policies. Open-source models demonstrated stronger contextual awareness and a better balance between technical and professional skills, earning top scores in nine of ten countries. Still, all models struggled with cultural sensitivity and infrastructure considerations, averaging only 35.4% contextual awareness. This first broad comparison of LLM career guidance for African computing students uncovers entrenched infrastructure assumptions and Western-centric biases, creating gaps between technical recommendations and local needs. The strong performance of cost-effective open-source models (Llama: 4.47/5; DeepSeek: 4.25/5) compared to proprietary alternatives (ChatGPT 4: 3.90/5; Claude: 3.46/5) challenges assumptions about AI tool quality in resource-constrained settings. Our findings highlight how computing competency requirements vary widely across Africa and underscore the need for decolonial approaches to AI in education that emphasize contextual relevance

【127】AI for Distributed Systems Design: Scalable Cloud Optimization Through Repeated LLMs Sampling And Simulators
标题：分布式系统设计的人工智能：通过重复LLM采样和模拟器进行可扩展云优化
链接：https://arxiv.org/abs/2510.18897

作者：Jacopo Tagliabue
备注：Pre-print IAAA workshop submission
摘要：我们通过将大型语言模型（LLM）的随机代码生成与特定领域模拟器中的确定性验证相结合，探索AI驱动的分布式系统策略设计。使用功能即服务运行时（Bauplan）及其开源模拟器（Eudoxia）作为案例研究，我们将调度器设计框架为迭代生成和验证循环：LLM提出Python策略，模拟器在标准化跟踪上对其进行评估，结构化反馈引导后续世代。这种设置保留了可解释性，同时在大的设计空间内实现了有针对性的搜索。我们详细介绍了系统架构，并报告了跨多个模型的吞吐量提高的初步结果。除了早期的收获，我们还讨论了当前设置的局限性，并概述了接下来的步骤;特别是，我们推测人工智能将通过帮助引导新的模拟器来扩展这种方法。
摘要：We explore AI-driven distributed-systems policy design by combining stochastic code generation from large language models (LLMs) with deterministic verification in a domain-specific simulator. Using a Function-as-a-Service runtime (Bauplan) and its open-source simulator (Eudoxia) as a case study, we frame scheduler design as an iterative generate-and-verify loop: an LLM proposes a Python policy, the simulator evaluates it on standardized traces, and structured feedback steers subsequent generations. This setup preserves interpretability while enabling targeted search over a large design space. We detail the system architecture and report preliminary results on throughput improvements across multiple models. Beyond early gains, we discuss the limits of the current setup and outline next steps; in particular, we conjecture that AI will be crucial for scaling this methodology by helping to bootstrap new simulators.

【128】CosmoCore Affective Dream-Replay Reinforcement Learning for Code Generation
标题：CosmoCore情感梦想回放强化学习用于代码生成
链接：https://arxiv.org/abs/2510.18895

作者：Santhosh Kumar Ravindran
备注：12 pages
摘要：我们介绍了CosmoCore，这是一种受神经科学启发的强化学习（RL）架构，它集成了情感信号，以增强大型语言模型（LLM）中的代码生成。受人类和动物学习的启发，错误带来的尴尬促使快速纠正，正如在训练小狗避免重复错误后所观察到的那样，CosmoCore使用轻量级多层感知器（MLP）标记代码生成轨迹，具有效价和惊喜。高负效价（畏缩）事件，如错误代码输出，在梦幻队列中优先考虑，以便在非策略更新期间进行五次重播，而低惊喜成功则被修剪以防止过度自信和缓冲区膨胀。在HumanEval和BigCodeBench等代码生成基准上进行评估，以及使用自定义数据管道环境进行模拟，CosmoCore减少了幻觉代码（例如，语法错误或逻辑错误）提高了48%，并加速了45%的自我纠正。在PySpark环境中使用Hugging Face模型的本地实验验证了这些收益，并提供了代码片段用于复制。烧蚀确认价标记提高探索的好奇心，修剪减轻效率低下。该框架将RL从人类反馈（RLHF）扩展为更具情感意识的代码助理，并在IDE和数据管道中应用。代码和自定义的迷你世界模拟被释放。
摘要：We introduce CosmoCore, a neuroscience-inspired reinforcement learning (RL) architecture that integrates affective signals to enhance code generation in large language models (LLMs). Motivated by human and animal learning where embarrassment from mistakes drives rapid correction, as observed in training a puppy to avoid repeating errors after a single scolding CosmoCore tags code generation trajectories with valence and surprise using a lightweight multi-layer perceptron (MLP). High-negative valence (cringe) episodes, such as buggy code outputs, are prioritized in a Dream Queue for five-fold replay during off-policy updates, while low-surprise successes are pruned to prevent overconfidence and buffer bloat. Evaluated on code generation benchmarks like HumanEval and BigCodeBench, alongside simulations with a custom data pipeline environment, CosmoCore reduces hallucinated code (e.g., syntax errors or logical bugs) by 48\% and accelerates self-correction by 45\%. Local experiments using Hugging Face models in a PySpark environment validate these gains, with code snippets provided for replication. Ablations confirm valence tagging boosts curiosity in exploration, and pruning mitigates inefficiency. This framework extends RL from human feedback (RLHF) for more emotionally aware code assistants, with applications in IDEs and data pipelines. Code and the custom mini-world simulation are released.

【129】CodeCRDT: Observation-Driven Coordination for Multi-Agent LLM Code Generation
标题：CodeCRDT：多代理LLM代码生成的观察驱动协调
链接：https://arxiv.org/abs/2510.18893

作者：Sergey Pugachev
备注：11 pages, 3 figures
摘要：多代理LLM系统无法实现并行加速由于昂贵的协调。我们提出了CodeCRDT，观察驱动的协调模式，代理通过监视共享状态与可观察的更新和确定性收敛，而不是明确的消息传递进行协调。CodeCRDT使用无锁复制数据类型（CRDT），支持无锁、无冲突的并发代码生成，并具有很强的最终一致性。通过600次试验（6个任务，每种模式运行50次）的评估显示了优势和权衡：某些任务的加速率高达21.1%，其他任务的减速率高达39.4%，并且100%收敛，零合并失败。该研究正式确定了随机LLM代理的观察驱动协调，揭示了语义冲突率（5-10%）和质量-性能权衡，并提供了基于任务结构的并行协调成功与失败的经验表征。
摘要：Multi-agent LLM systems fail to realize parallel speedups due to costly coordination. We present CodeCRDT, an observation-driven coordination pattern where agents coordinate by monitoring a shared state with observable updates and deterministic convergence, rather than explicit message passing. Using Conflict-Free Replicated Data Types (CRDTs), CodeCRDT enables lock-free, conflict-free concurrent code generation with strong eventual consistency. Evaluation across 600 trials (6 tasks, 50 runs per mode) shows both benefits and trade-offs: up to 21.1% speedup on some tasks, up to 39.4% slowdown on others, and 100% convergence with zero merge failures. The study formalizes observation-driven coordination for stochastic LLM agents, revealing semantic conflict rates (5-10%) and quality-performance tradeoffs, and provides empirical characterization of when parallel coordination succeeds versus fails based on task structure.

【130】Small Language Models Offer Significant Potential for Science Community
标题：小型语言模型为科学界提供巨大潜力
链接：https://arxiv.org/abs/2510.18890

作者：Jian Zhang
摘要：自然语言处理的最新进展，特别是大型语言模型（LLM），正在改变科学家处理文献的方式。虽然LLM的采用正在增加，但对潜在信息偏差和计算成本的担忧仍然存在。而不是LLM，我开发了一个框架，以评估使用免费提供的小语言模型（MiniLM）从广泛的地球科学文献中进行精确，快速和具有成本效益的信息检索的可行性。构建了一个由大约7700万个高质量句子组成的精选语料库，这些句子是从95种领先的同行评审地球科学期刊中提取的，如2000年至2024年出版的地球物理研究快报和地球与行星科学快报。MiniLM使计算效率高的方法，通过语义搜索技术和文档级索引从这些语料库中提取相关的特定领域的信息。这种方法与ChatGPT-4等LLM不同，它通常会产生一般化的响应，它擅长通过已建立的多学科来源识别大量经专家验证的信息，特别是具有定量结果的信息。此外，通过情感分析和主题聚类分析情感基调，通过句子内的无监督聚类，MiniLM提供了一个强大的工具，用于跟踪地球科学社区内的结论，研究重点，进展和新出现的问题的演变。总体而言，MiniLM在地球科学界的应用，如事实和图像检索，趋势分析，矛盾分析和教育目的具有显着的潜力。
摘要：Recent advancements in natural language processing, particularly with large language models (LLMs), are transforming how scientists engage with the literature. While the adoption of LLMs is increasing, concerns remain regarding potential information biases and computational costs. Rather than LLMs, I developed a framework to evaluate the feasibility of precise, rapid, and cost-effective information retrieval from extensive geoscience literature using freely available small language models (MiniLMs). A curated corpus of approximately 77 million high-quality sentences, extracted from 95 leading peer-reviewed geoscience journals such as Geophysical Research Letters and Earth and Planetary Science Letters published during years 2000 to 2024, was constructed. MiniLMs enable a computationally efficient approach for extracting relevant domain-specific information from these corpora through semantic search techniques and sentence-level indexing. This approach, unlike LLMs such as ChatGPT-4 that often produces generalized responses, excels at identifying substantial amounts of expert-verified information with established, multi-disciplinary sources, especially for information with quantitative findings. Furthermore, by analyzing emotional tone via sentiment analysis and topical clusters through unsupervised clustering within sentences, MiniLM provides a powerful tool for tracking the evolution of conclusions, research priorities, advancements, and emerging questions within geoscience communities. Overall, MiniLM holds significant potential within the geoscience community for applications such as fact and image retrievals, trend analyses, contradiction analyses, and educational purposes.

【131】Contextual Augmentation for Entity Linking using Large Language Models
标题：使用大型语言模型进行实体链接的上下文增强
链接：https://arxiv.org/abs/2510.18888

作者：Daniel Vollmers, Hamada M. Zahera, Diego Moussallem, Axel-Cyrille Ngonga Ngomo
摘要：实体链接涉及检测自然语言文本中的实体提及并将其链接到知识图。传统的方法使用两个步骤的过程，具有用于实体识别和消歧的单独模型，这可能是计算密集型的并且效率较低。我们提出了一个微调的模型，共同集成在一个统一的框架中的实体识别和消歧。此外，我们的方法利用大型语言模型来丰富实体提及的上下文，从而在实体消歧中获得更好的性能。我们在基准数据集上评估了我们的方法，并与几个基线进行了比较。评估结果表明，我们的方法在域外数据集上实现了最先进的性能。
摘要：Entity Linking involves detecting and linking entity mentions in natural language texts to a knowledge graph. Traditional methods use a two-step process with separate models for entity recognition and disambiguation, which can be computationally intensive and less effective. We propose a fine-tuned model that jointly integrates entity recognition and disambiguation in a unified framework. Furthermore, our approach leverages large language models to enrich the context of entity mentions, yielding better performance in entity disambiguation. We evaluated our approach on benchmark datasets and compared with several baselines. The evaluation results show that our approach achieves state-of-the-art performance on out-of-domain datasets.

【132】LLM Bazaar: A Service Design for Supporting Collaborative Learning with an LLM-Powered Multi-Party Collaboration Infrastructure
标题：LLM Bazaar：一种支持协作学习的服务设计，具有LLM-Powered多方协作基础设施
链接：https://arxiv.org/abs/2510.18877

作者：Zhen Wu, Jiaxin Shi, R. Charles Murray, Carolyn Rosé, Micah San Andres
备注：https://repository.isls.org//handle/1/11832
摘要：近二十年来，会话代理在构建协作学习中的交互、塑造群体动态和支持学生参与方面发挥了关键作用。最近将大型语言模型（LLM）集成到这些代理中，为培养批判性思维和协作解决问题提供了新的可能性。在这项工作中，我们开始与一个开源的协作支持架构称为Bazaar和集成的LLM-agent外壳，使LLM授权，实时，上下文敏感的协作支持小组学习的介绍。这种设计和基础设施为探索量身定制的法学硕士授权环境如何重塑协作学习成果和互动模式铺平了道路。
摘要：For nearly two decades, conversational agents have played a critical role in structuring interactions in collaborative learning, shaping group dynamics, and supporting student engagement. The recent integration of large language models (LLMs) into these agents offers new possibilities for fostering critical thinking and collaborative problem solving. In this work, we begin with an open source collaboration support architecture called Bazaar and integrate an LLM-agent shell that enables introduction of LLM-empowered, real time, context sensitive collaborative support for group learning. This design and infrastructure paves the way for exploring how tailored LLM-empowered environments can reshape collaborative learning outcomes and interaction patterns.

【133】Actor-Free Continuous Control via Structurally Maximizable Q-Functions
标题：通过结构上可最大化的Q-函数实现无参与者连续控制
链接：https://arxiv.org/abs/2510.18828

作者：Yigit Korkmaz, Urvi Bhuwania, Ayush Jain, Erdem Bıyık
备注：39th Conference on Neural Information Processing Systems (NeurIPS 2025)
摘要：基于值的算法由于其简单性和训练稳定性而成为非策略强化学习的基石。然而，它们的使用传统上仅限于离散的动作空间，因为它们依赖于估计单个状态动作对的Q值。在连续动作空间中，在整个动作空间上评估Q值在计算上变得不可行。为了解决这个问题，通常采用行动者-批评者方法，其中批评者在非策略数据上被训练以估计Q值，并且行动者被训练以最大化批评者的输出。尽管这些方法很受欢迎，但它们在训练期间往往不稳定。在这项工作中，我们提出了一个纯粹基于值的连续控制框架，重新审视Q函数的结构最大化，引入了一组关键的架构和算法选择，以实现高效和稳定的学习。我们在一系列标准模拟任务上评估了所提出的无演员Q学习方法，展示了与最先进基线相当的性能和样本效率，而无需学习单独演员的成本。特别是，在具有约束动作空间的环境中，其中值函数通常是非光滑的，我们的结构最大化方法优于传统的基于梯度最大化的演员评论家方法。我们已在https://github.com/USC-Lira/Q3C上发布了代码。
摘要：Value-based algorithms are a cornerstone of off-policy reinforcement learning due to their simplicity and training stability. However, their use has traditionally been restricted to discrete action spaces, as they rely on estimating Q-values for individual state-action pairs. In continuous action spaces, evaluating the Q-value over the entire action space becomes computationally infeasible. To address this, actor-critic methods are typically employed, where a critic is trained on off-policy data to estimate Q-values, and an actor is trained to maximize the critic's output. Despite their popularity, these methods often suffer from instability during training. In this work, we propose a purely value-based framework for continuous control that revisits structural maximization of Q-functions, introducing a set of key architectural and algorithmic choices to enable efficient and stable learning. We evaluate the proposed actor-free Q-learning approach on a range of standard simulation tasks, demonstrating performance and sample efficiency on par with state-of-the-art baselines, without the cost of learning a separate actor. Particularly, in environments with constrained action spaces, where the value functions are typically non-smooth, our method with structural maximization outperforms traditional actor-critic methods with gradient-based maximization. We have released our code at https://github.com/USC-Lira/Q3C.

【134】SysMoBench: Evaluating AI on Formally Modeling Complex Real-World Systems
标题：SystMoBench：评估人工智能对复杂现实世界系统的形式建模
链接：https://arxiv.org/abs/2509.23130

作者：Qian Cheng, Ruize Tang, Emilie Ma, Finn Hackett, Peiyang He, Yiming Su, Ivan Beschastnikh, Yu Huang, Xiaoxing Ma, Tianyin Xu
摘要：形式化模型对于指定大型复杂的计算机系统并验证其正确性至关重要，但众所周知，编写和维护成本高昂。生成式人工智能的最新进展显示出生成某些形式的规范的希望。然而，现有的工作主要针对小的代码，而不是完整的系统。目前还不清楚人工智能是否可以处理真实的系统工件，因为这需要将其复杂的行为属性抽象到正式的模型中。我们提出了SysMoBench，这是一个评估AI正式建模大型复杂系统的能力的基准。我们专注于并发和分布式系统，这是当今关键计算基础设施的基石，包括操作系统和云基础设施。我们使用TLA+，事实上的规范语言的并发和分布式系统，虽然基准可以扩展到其他规范语言。我们通过自动化语法和运行时正确性、与系统代码的一致性和不变正确性等指标来解决评估AI生成模型的主要挑战。SysMoBench目前包括九个不同的系统工件：Etcd和Redis的Raft实现，Asterinas OS中的Spinlock和Mutex等;正在积极地添加更多的工件。SysMoBench使我们能够了解当今LLM和代理的能力和局限性，将该领域的工具置于坚实的基础上，并开辟有前途的新研究方向。
摘要：Formal models are essential to specifying large, complex computer systems and verifying their correctness, but are notoriously expensive to write and maintain. Recent advances in generative AI show promise in generating certain forms of specifications. However, existing work mostly targets small code, not complete systems. It is unclear whether AI can deal with realistic system artifacts, as this requires abstracting their complex behavioral properties into formal models. We present SysMoBench, a benchmark that evaluates AI's ability to formally model large, complex systems. We focus on concurrent and distributed systems, which are keystones of today's critical computing infrastructures, encompassing operating systems and cloud infrastructure. We use TLA+, the de facto specification language for concurrent and distributed systems, though the benchmark can be extended to other specification languages. We address the primary challenge of evaluating AI-generated models by automating metrics like syntactic and runtime correctness, conformance to system code, and invariant correctness. SysMoBench currently includes nine diverse system artifacts: the Raft implementation of Etcd and Redis, the Spinlock and Mutex in Asterinas OS, etc.; more artifacts are being actively added. SysMoBench enables us to understand the capabilities and limitations of today's LLMs and agents, putting tools in this area on a firm footing and opening up promising new research directions.

【135】A Unified Formal Theory on the Logical Limits of Symbol Grounding
标题：符号基础逻辑极限的统一形式理论
链接：https://arxiv.org/abs/2509.20409

作者：Zhangchi Liu
备注：8 pages, 1 figure. A formal proof on the logical limits of symbol grounding
摘要：本文综合了一系列形式证明，构造了符号接地问题逻辑极限的统一理论。我们通过一个四阶段的论证证明，在一个正式的系统中的意义必须产生于一个外部的、动态的和非算法的过程。首先，我们证明，任何纯粹的符号系统，缺乏外部连接，不能建立一个内部一致的基础上的意义，由于自我指涉的悖论。第二，我们将这种限制扩展到具有任何有限的、静态的预先建立的意义集的系统，证明它们本质上是不完整的。第三，我们证明，非常“行为”的内部符号连接到外部意义不能是系统内的逻辑推理的产物，但必须是一个公理，元级更新。最后，我们证明，任何尝试自动化这个更新过程中使用一个固定的，外部的“判断”算法将不可避免地构建一个更大的，但同样不完整的，符号系统。总之，这些结论正式确立了意义的基础是一个必然开放的、非算法的过程，揭示了任何独立智能系统的基本的、哥德尔式的限制。
摘要：This paper synthesizes a series of formal proofs to construct a unified theory on the logical limits of the Symbol Grounding Problem. We demonstrate through a four-stage argument that meaning within a formal system must arise from a process that is external, dynamic, and non-algorithmic. First, we prove that any purely symbolic system, devoid of external connections, cannot internally establish a consistent foundation for meaning due to self-referential paradoxes. Second, we extend this limitation to systems with any finite, static set of pre-established meanings, proving they are inherently incomplete. Third, we demonstrate that the very "act" of connecting an internal symbol to an external meaning cannot be a product of logical inference within the system but must be an axiomatic, meta-level update. Finally, we prove that any attempt to automate this update process using a fixed, external "judgment" algorithm will inevitably construct a larger, yet equally incomplete, symbolic system. Together, these conclusions formally establish that the grounding of meaning is a necessarily open-ended, non-algorithmic process, revealing a fundamental, G\"odel-style limitation for any self-contained intelligent system.

【136】Demonstrating Real Advantage of Machine-Learning-Enhanced Monte Carlo for Combinatorial Optimization
标题：展示机器学习增强蒙特卡洛用于组合优化的真正优势
链接：https://arxiv.org/abs/2510.19544

作者：Luca Maria Del Bono, Federico Ricci-Tersenghi, Francesco Zamponi
备注：13 main pages, 6 main figures. 4 supplementary pages, 2 supplementary figures
摘要：组合优化问题是实际应用和优化方法发展的核心。虽然经典算法和量子算法在过去几十年中已经得到了改进，但机器学习辅助方法相对较新，并且尚未始终优于简单的最先进的经典方法。在这里，我们专注于一类二次无约束二进制优化（QUBO）问题，特别是在三维伊辛自旋玻璃中找到最小能量配置的挑战。我们使用全局退火蒙特卡罗算法，该算法将标准局部移动与通过机器学习提出的全局移动相结合。我们表明，当地的举动发挥了至关重要的作用，在实现最佳性能。以模拟退火和群体退火为基准，我们证明了全局退火不仅超过了模拟退火的性能，而且比群体退火具有更强的鲁棒性，在没有超参数调整的情况下保持了问题难度和系统大小的有效性。据我们所知，这些结果提供了第一个明确而有力的证据，证明机器学习辅助优化方法可以在组合优化设置中超过经典最先进技术的能力。
摘要：Combinatorial optimization problems are central to both practical applications and the development of optimization methods. While classical and quantum algorithms have been refined over decades, machine learning-assisted approaches are comparatively recent and have not yet consistently outperformed simple, state-of-the-art classical methods. Here, we focus on a class of Quadratic Unconstrained Binary Optimization (QUBO) problems, specifically the challenge of finding minimum energy configurations in three-dimensional Ising spin glasses. We use a Global Annealing Monte Carlo algorithm that integrates standard local moves with global moves proposed via machine learning. We show that local moves play a crucial role in achieving optimal performance. Benchmarking against Simulated Annealing and Population Annealing, we demonstrate that Global Annealing not only surpasses the performance of Simulated Annealing but also exhibits greater robustness than Population Annealing, maintaining effectiveness across problem hardness and system size without hyperparameter tuning. These results provide, to our knowledge, the first clear and robust evidence that a machine learning-assisted optimization method can exceed the capabilities of classical state-of-the-art techniques in a combinatorial optimization setting.

【137】KnowMol: Advancing Molecular Large Language Models with Multi-Level Chemical Knowledge
标题：KnowMol：利用多层次化学知识推进分子大语言模型
链接：https://arxiv.org/abs/2510.19484

作者：Zaifei Yang, Hong Chang, Ruibing Hou, Shiguang Shan, Xilin Chen
摘要：分子大语言模型因其在分子应用上的巨大潜力而受到广泛关注。然而，目前的分子大语言模型在理解分子方面面临着显着的局限性，这是由于在预训练过程中不充分的文本描述和次优的分子表示策略。为了应对这些挑战，我们引入了KnowMol-100 K，这是一个大规模的数据集，在多个级别上具有100 K细粒度的分子注释，弥合了分子和文本描述之间的差距。此外，我们提出了化学信息分子表示，有效地解决了现有分子表示策略的局限性。在这些创新的基础上，我们开发了KnowMol，这是一种最先进的多模态分子大语言模型。大量的实验表明，KnowMol在分子理解和生成任务中实现了卓越的性能。 GitHub：网站https://github.com/yzf-code/KnowMol https://hf.co/datasets/yzf1102/KnowMol-100K
摘要：The molecular large language models have garnered widespread attention due to their promising potential on molecular applications. However, current molecular large language models face significant limitations in understanding molecules due to inadequate textual descriptions and suboptimal molecular representation strategies during pretraining. To address these challenges, we introduce KnowMol-100K, a large-scale dataset with 100K fine-grained molecular annotations across multiple levels, bridging the gap between molecules and textual descriptions. Additionally, we propose chemically-informative molecular representation, effectively addressing limitations in existing molecular representation strategies. Building upon these innovations, we develop KnowMol, a state-of-the-art multi-modal molecular large language model. Extensive experiments demonstrate that KnowMol achieves superior performance across molecular understanding and generation tasks. GitHub: https://github.com/yzf-code/KnowMol Huggingface: https://hf.co/datasets/yzf1102/KnowMol-100K

【138】EchoFake: A Replay-Aware Dataset for Practical Speech Deepfake Detection
标题：EchoFake：用于实际语音Deepfake检测的回放感知数据集
链接：https://arxiv.org/abs/2510.19414

作者：Tong Zhang, Yihuan Huang, Yanzhen Ren
摘要：语音深度伪造的日益流行引起了人们的严重关注，特别是在电话欺诈和身份盗窃等现实场景中。虽然许多反欺骗系统在实验室生成的合成语音上表现出了良好的性能，但当遇到物理重放攻击时，它们往往会失败，这是一种在实际环境中使用的常见且低成本的攻击形式。我们的实验表明，在现有数据集上训练的模型表现出严重的性能下降，在重放音频上评估时，平均准确率下降到59.6%。为了弥合这一差距，我们提出了EchoFake，这是一个综合数据集，包括来自13，000多名扬声器的超过120小时的音频，具有尖端的zero-shot文本到语音（TTS）语音和在各种设备和真实环境设置下收集的物理重放录音。此外，我们评估了三个基线检测模型，并表明在EchoFake上训练的模型在数据集上实现了较低的平均EER，这表明泛化能力更好。通过引入与现实世界部署相关的更多实际挑战，EchoFake为推进欺骗检测方法提供了更现实的基础。
摘要：The growing prevalence of speech deepfakes has raised serious concerns, particularly in real-world scenarios such as telephone fraud and identity theft. While many anti-spoofing systems have demonstrated promising performance on lab-generated synthetic speech, they often fail when confronted with physical replay attacks-a common and low-cost form of attack used in practical settings. Our experiments show that models trained on existing datasets exhibit severe performance degradation, with average accuracy dropping to 59.6% when evaluated on replayed audio. To bridge this gap, we present EchoFake, a comprehensive dataset comprising more than 120 hours of audio from over 13,000 speakers, featuring both cutting-edge zero-shot text-to-speech (TTS) speech and physical replay recordings collected under varied devices and real-world environmental settings. Additionally, we evaluate three baseline detection models and show that models trained on EchoFake achieve lower average EERs across datasets, indicating better generalization. By introducing more practical challenges relevant to real-world deployment, EchoFake offers a more realistic foundation for advancing spoofing detection methods.

【139】Metadata Extraction Leveraging Large Language Models
标题：利用大型语言模型的元数据提取
链接：https://arxiv.org/abs/2510.19334

作者：Cuize Han, Sesh Jalagam
摘要：大型语言模型的出现彻底改变了跨领域的任务，包括法律文件分析的自动化，这是现代合同管理系统的关键组成部分。本文提出了一个全面的实现法学硕士增强的元数据提取合同审查，重点是自动检测和注释的突出法律条款。利用公开可用的合同理解Atticus数据集（CUAD）和专有合同数据集，我们的工作展示了先进的LLM方法与实际应用的集成。我们确定了优化元数据提取的三个关键要素：强大的文本转换，战略块选择和高级LLM特定技术，包括思想链（CoT）提示和结构化工具调用。我们的实验结果突出了小句识别的准确性和效率的实质性改善。我们的方法在减少与合同审查相关的时间和成本，同时保持法律条款识别的高准确性方面显示出了希望。结果表明，精心优化的LLM系统可以作为法律专业人士的宝贵工具，有可能为各种规模的组织增加获得高效合同审查服务的机会。
摘要：The advent of Large Language Models has revolutionized tasks across domains, including the automation of legal document analysis, a critical component of modern contract management systems. This paper presents a comprehensive implementation of LLM-enhanced metadata extraction for contract review, focusing on the automatic detection and annotation of salient legal clauses. Leveraging both the publicly available Contract Understanding Atticus Dataset (CUAD) and proprietary contract datasets, our work demonstrates the integration of advanced LLM methodologies with practical applications. We identify three pivotal elements for optimizing metadata extraction: robust text conversion, strategic chunk selection, and advanced LLM-specific techniques, including Chain of Thought (CoT) prompting and structured tool calling. The results from our experiments highlight the substantial improvements in clause identification accuracy and efficiency. Our approach shows promise in reducing the time and cost associated with contract review while maintaining high accuracy in legal clause identification. The results suggest that carefully optimized LLM systems could serve as valuable tools for legal professionals, potentially increasing access to efficient contract review services for organizations of all sizes.

【140】No Intelligence Without Statistics: The Invisible Backbone of Artificial Intelligence
标题：没有统计就没有智能：人工智能的隐形支柱
链接：https://arxiv.org/abs/2510.19212

作者：Ernest Fokoué
备注：37 pages, 6 figures
摘要：人工智能（AI）的迅速崛起通常被描绘为计算机科学和工程的革命。然而，这种叙述掩盖了一个基本事实：人工智能的理论和方法论核心是，而且一直是，统计。本文系统地认为，统计学领域为机器学习和现代人工智能提供了不可或缺的基础。我们将AI解构为九个基本支柱-推理，密度估计，顺序学习，泛化，表示学习，可解释性，因果关系，优化和统一-证明每个都是建立在百年历史的统计原则之上。从支撑模型评估的假设检验和估计的推理框架，到聚类和生成式AI的密度估计根源;从激励循环网络的时间序列分析到承诺真正理解的因果模型，我们追踪了一个完整的统计谱系。在庆祝为现代人工智能提供动力的计算引擎的同时，我们认为统计学提供了大脑-理论框架，不确定性量化和推理目标，而计算机科学提供了可扩展的算法和硬件。认识到这一统计支柱不仅仅是一个学术练习，而是开发更强大，可解释和可信赖的智能系统的必要步骤。我们呼吁教育、研究和实践采取行动，重新拥抱这一统计基础。忽视这些根源可能会导致一个脆弱的未来;拥抱它们是通往真正智能机器的道路。没有统计学习就没有机器学习;没有统计思维就没有人工智能。
摘要：The rapid ascent of artificial intelligence (AI) is often portrayed as a revolution born from computer science and engineering. This narrative, however, obscures a fundamental truth: the theoretical and methodological core of AI is, and has always been, statistical. This paper systematically argues that the field of statistics provides the indispensable foundation for machine learning and modern AI. We deconstruct AI into nine foundational pillars-Inference, Density Estimation, Sequential Learning, Generalization, Representation Learning, Interpretability, Causality, Optimization, and Unification-demonstrating that each is built upon century-old statistical principles. From the inferential frameworks of hypothesis testing and estimation that underpin model evaluation, to the density estimation roots of clustering and generative AI; from the time-series analysis inspiring recurrent networks to the causal models that promise true understanding, we trace an unbroken statistical lineage. While celebrating the computational engines that power modern AI, we contend that statistics provides the brain-the theoretical frameworks, uncertainty quantification, and inferential goals-while computer science provides the brawn-the scalable algorithms and hardware. Recognizing this statistical backbone is not merely an academic exercise, but a necessary step for developing more robust, interpretable, and trustworthy intelligent systems. We issue a call to action for education, research, and practice to re-embrace this statistical foundation. Ignoring these roots risks building a fragile future; embracing them is the path to truly intelligent machines. There is no machine learning without statistical learning; no artificial intelligence without statistical thought.

【141】News-Aware Direct Reinforcement Trading for Financial Markets
标题：具有新闻意识的金融市场直接强化交易
链接：https://arxiv.org/abs/2510.19173

作者：Qing-Yu Lan, Zhan-He Wang, Jun-Qian Jiang, Yu-Tong Wang, Yun-Song Piao
备注：9 pages, 4 figures, 3 tables
摘要：众所周知，金融市场对新闻高度敏感。因此，有效地将新闻数据纳入量化交易仍然是一个重要的挑战。现有方法通常依赖于手动设计的规则和/或手工制作的特征。在这项工作中，我们直接使用来自大型语言模型的新闻情绪得分，以及原始价格和数量数据，作为强化学习的可观察输入。这些输入由诸如递归神经网络或Transformers之类的序列模型处理，以做出端到端的交易决策。我们以加密货币市场为例进行了实验，并评估了两种代表性的强化学习算法，即双深度Q网络（DDQN）和组相对策略优化（GRPO）。结果表明，我们的新闻感知方法，不依赖于手工制作的功能或手动设计的规则，可以实现优于市场基准的性能。我们进一步强调时间序列信息在这一过程中的关键作用。
摘要：The financial market is known to be highly sensitive to news. Therefore, effectively incorporating news data into quantitative trading remains an important challenge. Existing approaches typically rely on manually designed rules and/or handcrafted features. In this work, we directly use the news sentiment scores derived from large language models, together with raw price and volume data, as observable inputs for reinforcement learning. These inputs are processed by sequence models such as recurrent neural networks or Transformers to make end-to-end trading decisions. We conduct experiments using the cryptocurrency market as an example and evaluate two representative reinforcement learning algorithms, namely Double Deep Q-Network (DDQN) and Group Relative Policy Optimization (GRPO). The results demonstrate that our news-aware approach, which does not depend on handcrafted features or manually designed rules, can achieve performance superior to market benchmarks. We further highlight the critical role of time-series information in this process.

【142】StutterZero and StutterFormer: End-to-End Speech Conversion for Stuttering Transcription and Correction
标题：StutterZero和StutterFormer：用于口吃转录和纠正的端到端语音转换
链接：https://arxiv.org/abs/2510.18938

作者：Qianheng Xu
备注：13 pages, 5 figures
摘要：全世界有超过7000万人患有口吃，但大多数自动语音系统会误解不流利的话语或无法准确地转录它们。现有的口吃校正方法依赖于手工特征提取或多级自动语音识别（ASR）和文本到语音（TTS）管道，这些管道将转录与音频重建分开，并且通常会放大失真。这项工作介绍了StutterZero和StutterFormer，这是第一个端到端的波形到波形模型，可以直接将口吃的语音转换为流利的语音，同时联合预测其转录。StutterZero采用了一个带有注意力的卷积双向LSTM编码器-解码器，而StutterFormer则集成了一个具有共享声学语言表示的双流Transformer。这两种架构都是在SEP-28 K和LibriStutter语料库合成的成对口吃流利数据上训练的，并在FluencyBank数据集的未见过的说话者上进行评估。在所有基准测试中，与领先的Whisper-Medium模型相比，StutterZero的单词错误率（WER）降低了24%，语义相似度（BERTScore）提高了31%。StutterFormer取得了更好的结果，WER降低了28%，BERTScore提高了34%。结果验证了直接端到端口吃到流利语音转换的可行性，为包容性人机交互，语音治疗和面向可访问性的AI系统提供了新的机会。
摘要：Over 70 million people worldwide experience stuttering, yet most automatic speech systems misinterpret disfluent utterances or fail to transcribe them accurately. Existing methods for stutter correction rely on handcrafted feature extraction or multi-stage automatic speech recognition (ASR) and text-to-speech (TTS) pipelines, which separate transcription from audio reconstruction and often amplify distortions. This work introduces StutterZero and StutterFormer, the first end-to-end waveform-to-waveform models that directly convert stuttered speech into fluent speech while jointly predicting its transcription. StutterZero employs a convolutional-bidirectional LSTM encoder-decoder with attention, whereas StutterFormer integrates a dual-stream Transformer with shared acoustic-linguistic representations. Both architectures are trained on paired stuttered-fluent data synthesized from the SEP-28K and LibriStutter corpora and evaluated on unseen speakers from the FluencyBank dataset. Across all benchmarks, StutterZero had a 24% decrease in Word Error Rate (WER) and a 31% improvement in semantic similarity (BERTScore) compared to the leading Whisper-Medium model. StutterFormer achieved better results, with a 28% decrease in WER and a 34% improvement in BERTScore. The results validate the feasibility of direct end-to-end stutter-to-fluent speech conversion, offering new opportunities for inclusive human-computer interaction, speech therapy, and accessibility-oriented AI systems.

【143】Prospects for Using Artificial Intelligence to Understand Intrinsic Kinetics of Heterogeneous Catalytic Reactions
标题：利用人工智能了解多相催化反应本征动力学的前景
链接：https://arxiv.org/abs/2510.18911

作者：Andrew J. Medford, Todd N. Whittaker, Bjarne Kreitz, David W. Flaherty, John R. Kitchin
备注：Submitted to "Current Opinion in Chemical Engineering" for peer review
摘要：人工智能（AI）正在通过加速模拟和材料发现来影响多相催化研究。一个关键的前沿是将人工智能与多尺度模型和多模态实验相结合，以解决将内在动力学与可观测量联系起来的“多对一”挑战。机器学习力场、微动力学和反应器建模的进步使人们能够快速探索化学空间，而操作和瞬态数据提供了前所未有的洞察力。然而，不一致的数据质量和模型复杂性限制了机械发现。生成和代理人工智能可以自动生成模型，量化不确定性，并将理论与实验相结合，实现“自动驾驶模型”，从而产生对催化系统的可解释，可再现和可转移的理解。
摘要：Artificial intelligence (AI) is influencing heterogeneous catalysis research by accelerating simulations and materials discovery. A key frontier is integrating AI with multiscale models and multimodal experiments to address the "many-to-one" challenge of linking intrinsic kinetics to observables. Advances in machine-learned force fields, microkinetics, and reactor modeling enable rapid exploration of chemical spaces, while operando and transient data provide unprecedented insight. Yet, inconsistent data quality and model complexity limit mechanistic discovery. Generative and agentic AI can automate model generation, quantify uncertainty, and couple theory with experiment, realizing "self-driving models" that produce interpretable, reproducible, and transferable understanding of catalytic systems.

【144】What is Implementation Science; and Why It Matters for Bridging the Artificial Intelligence Innovation-to-Application Gap in Medical Imaging
标题：什么是实施科学;以及为什么它对于弥合医学成像中人工智能创新与应用差距很重要
链接：https://arxiv.org/abs/2510.13006

作者：Ahmad Fayaz-Bakhsh, Janice Tania, Syaheerah Lebai Lutfi, Abhinav K. Jha, Arman Rahmim
摘要：人工智能（AI）在医学成像（MI）领域的变革潜力已得到广泛认可。然而，尽管在研究环境中有令人鼓舞的报告，但许多人工智能工具在实践中未能实现临床应用。事实上，更普遍的是，有记录表明，证据生成和技术实施之间平均延迟17年。实施科学（IS）可以提供一个实用的，基于证据的框架，以弥合人工智能开发和现实世界的临床成像使用之间的差距，通过系统的框架，战略和混合研究设计来帮助缩短这种滞后。我们概述了在MI工作流程中采用AI所面临的挑战，包括基础设施、教育和文化障碍。我们强调有效性研究和实施研究的互补作用，强调混合研究设计和综合KT（iKT），利益相关者参与和以公平为中心的共同创造在设计可持续和可推广的解决方案中的作用。我们讨论了人机交互（HCI）框架在MI中的集成，以实现可用的AI。采用IS不仅是方法上的进步;它是加速将创新转化为改善患者预后的战略必要性。
摘要：The transformative potential of artificial intelligence (AI) in medical Imaging (MI) is well recognized. Yet despite promising reports in research settings, many AI tools fail to achieve clinical adoption in practice. In fact, more generally, there is a documented 17-year average delay between evidence generation and implementation of a technology1. Implementation science (IS) may provide a practical, evidence-based framework to bridge the gap between AI development and real-world clinical imaging use that helps shorten this lag through systematic frameworks, strategies, and hybrid research designs. We outline challenges specific to AI adoption in MI workflows, including infrastructural, educational, and cultural barriers. We highlight the complementary roles of effectiveness research and implementation research, emphasizing hybrid study designs and the role of integrated KT (iKT), stakeholder engagement, and equity-focused co-creation in designing sustainable and generalizable solutions. We discuss integration of Human-Computer Interaction (HCI) frameworks in MI towards usable AI. Adopting IS is not only a methodological advancement; it is a strategic imperative for accelerating translation of innovation into improved patient outcomes.

机器翻译由腾讯交互翻译提供，仅供参考

【声明】内容源于网络

Sophie外贸笔记

跨境分享角 | 长期更新优质内容

内容 0

粉丝 3

Sophie外贸笔记跨境分享角 | 长期更新优质内容

总阅读0

粉丝3

内容0