人工智能学术速递[10.27]- 大数跨境

首页

人工智能学术速递[10.27]

Sophie外贸笔记

2025-10-27

569

导读：cs.AI 方向，今日共计126篇

点击阅读原文访问arxivdaily.com，涵盖CS|物理|数学|经济|统计|金融|生物|电气领域，更有搜索、收藏等功能！

cs.AI人工智能，共计126篇

【1】A Knowledge-Graph Translation Layer for Mission-Aware Multi-Agent Path Planning in Spatiotemporal Dynamics
标题：时空动力学中任务感知多智能体路径规划的知识图转换层
链接：https://arxiv.org/abs/2510.21695

作者：Edward Holmberg, Elias Ioup, Mahdi Abdelguerfi
备注：10 pages, 10 figures, conference submission
摘要：高层次任务目标和低层次规划者输入之间的语义鸿沟阻碍了动态环境中自治代理的协调。为了解决这个问题，我们引入了一个以知识图（KG）为中心的框架，该框架作为智能翻译层。KG的双平面架构将声明性事实编译为每个代理、任务感知的“世界观”和物理感知的遍历规则，将任务语义与域不可知的规划器解耦。这允许简单地通过更改KG中的事实来修改复杂、协调的路径。一项涉及墨西哥湾自主水下航行器（AUV）的案例研究直观地展示了端到端的过程，并定量地证明了不同的声明性政策会产生不同的高性能结果。这项工作建立KG不仅作为一个数据存储库，但作为一个强大的，有状态的协调器，用于创建自适应和可解释的自治系统。
摘要：The coordination of autonomous agents in dynamic environments is hampered by the semantic gap between high-level mission objectives and low-level planner inputs. To address this, we introduce a framework centered on a Knowledge Graph (KG) that functions as an intelligent translation layer. The KG's two-plane architecture compiles declarative facts into per-agent, mission-aware ``worldviews" and physics-aware traversal rules, decoupling mission semantics from a domain-agnostic planner. This allows complex, coordinated paths to be modified simply by changing facts in the KG. A case study involving Autonomous Underwater Vehicles (AUVs) in the Gulf of Mexico visually demonstrates the end-to-end process and quantitatively proves that different declarative policies produce distinct, high-performing outcomes. This work establishes the KG not merely as a data repository, but as a powerful, stateful orchestrator for creating adaptive and explainable autonomous systems.

【2】On Thin Ice: Towards Explainable Conservation Monitoring via Attribution and Perturbations
标题：在薄薄的冰层上：通过归因和扰动实现可解释的保护监测
链接：https://arxiv.org/abs/2510.21689

作者：Jiayi Zhou, Günel Aghakishiyeva, Saagar Arya, Julian Dale, James David Poling, Holly R. Houliston, Jamie N. Womble, Gregory D. Larsen, David W. Johnston, Brinnae Bent
备注：NeurIPS Imageomics Workshop 2025
摘要：计算机视觉可以加速生态研究和保护监测，但在生态学中的应用滞后，部分原因是对基于黑箱神经网络的模型缺乏信任。我们试图通过应用事后解释来应对这一挑战，为预测提供证据，并记录对实地部署至关重要的限制。使用冰川湾国家公园的航拍图像，我们训练了一个更快的R-CNN来检测鳍足类动物（斑海豹），并通过基于梯度的类激活映射（HiResCAM，LayerCAM），局部可解释的模型不可知解释（LIME）和基于扰动的解释来生成解释。我们沿着与现场使用相关的三个轴来评估解释：（i）定位保真度：高归因区域是否与动物而不是背景背景相一致;（ii）忠诚度：删除/插入测试是否会导致检测器置信度的变化;以及（iii）诊断效用：解释是否揭示系统性故障模式。探测集中在海豹的躯干和轮廓上，而不是周围的冰/岩石上，海豹的移动降低了探测的可信度，为真正的阳性提供了模型证据。分析还揭示了经常出现的错误来源，包括海豹与黑冰和岩石之间的混淆。我们将这些发现转化为模型开发的可操作的后续步骤，包括更有针对性的数据管理和增强。通过将目标检测与事后可解释性相结合，我们可以超越“黑箱”预测，走向可审计的，决策支持的保护监测工具。
摘要：Computer vision can accelerate ecological research and conservation monitoring, yet adoption in ecology lags in part because of a lack of trust in black-box neural-network-based models. We seek to address this challenge by applying post-hoc explanations to provide evidence for predictions and document limitations that are important to field deployment. Using aerial imagery from Glacier Bay National Park, we train a Faster R-CNN to detect pinnipeds (harbor seals) and generate explanations via gradient-based class activation mapping (HiResCAM, LayerCAM), local interpretable model-agnostic explanations (LIME), and perturbation-based explanations. We assess explanations along three axes relevant to field use: (i) localization fidelity: whether high-attribution regions coincide with the animal rather than background context; (ii) faithfulness: whether deletion/insertion tests produce changes in detector confidence; and (iii) diagnostic utility: whether explanations reveal systematic failure modes. Explanations concentrate on seal torsos and contours rather than surrounding ice/rock, and removal of the seals reduces detection confidence, providing model-evidence for true positives. The analysis also uncovers recurrent error sources, including confusion between seals and black ice and rocks. We translate these findings into actionable next steps for model development, including more targeted data curation and augmentation. By pairing object detection with post-hoc explainability, we can move beyond "black-box" predictions toward auditable, decision-supporting tools for conservation monitoring.

【3】A Multimodal Benchmark for Framing of Oil & Gas Advertising and Potential Greenwashing Detection
标题：石油和天然气广告框架和潜在洗绿检测的多模式基准
链接：https://arxiv.org/abs/2510.21679

作者：Gaku Morio, Harri Rowlands, Dominik Stammbach, Christopher D. Manning, Peter Henderson
备注：Forthcoming in NeurIPS 2025 Datasets and Benchmarks Track
摘要：公司在公关活动上花费大量资金，以塑造积极的品牌形象。然而，有时候他们所说的和他们所做的不一致。例如，石油和天然气公司被指控用气候友好型倡议的图像进行“绿色清洗”。了解框架，以及框架的变化，可以帮助更好地理解公共关系活动的目标和性质。为了解决这个问题，我们引入了从Facebook和YouTube获得的专家注释视频广告的基准数据集。该数据集为20个国家的50多家公司或倡导团体提供了13种框架类型的注释。我们的数据集是专门为评估视觉语言模型（VLM）而设计的，与过去的纯文本框架数据集不同。基线实验显示了一些有希望的结果，同时为未来的工作留下了改进的空间：GPT-4.1可以以79%的F1分数检测环境信息，而我们最好的模型在识别绿色创新框架方面仅获得46%的F1分数。我们还确定了VLM必须解决的挑战，例如隐式框架，处理各种长度的视频或隐式文化背景。我们的数据集有助于能源领域战略沟通的多模态分析研究。
摘要：Companies spend large amounts of money on public relations campaigns to project a positive brand image. However, sometimes there is a mismatch between what they say and what they do. Oil & gas companies, for example, are accused of "greenwashing" with imagery of climate-friendly initiatives. Understanding the framing, and changes in framing, at scale can help better understand the goals and nature of public relations campaigns. To address this, we introduce a benchmark dataset of expert-annotated video ads obtained from Facebook and YouTube. The dataset provides annotations for 13 framing types for more than 50 companies or advocacy groups across 20 countries. Our dataset is especially designed for the evaluation of vision-language models (VLMs), distinguishing it from past text-only framing datasets. Baseline experiments show some promising results, while leaving room for improvement for future work: GPT-4.1 can detect environmental messages with 79% F1 score, while our best model only achieves 46% F1 score on identifying framing around green innovation. We also identify challenges that VLMs must address, such as implicit framing, handling videos of various lengths, or implicit cultural backgrounds. Our dataset contributes to research in multimodal analysis of strategic communication in the energy sector.

【4】CMOMgen: Complex Multi-Ontology Alignment via Pattern-Guided In-Context Learning
标题：CMOMgen：通过模式引导的上下文学习实现复杂的多实体对齐
链接：https://arxiv.org/abs/2510.21656

作者：Marta Contreiras Silva, Daniel Faria, Catia Pesquita
备注：32 pages, 5 figures
摘要：构建全面的知识图需要使用多个本体，以便将数据完全上下文化到一个域中。本体匹配发现概念之间的等价物，互连本体并创建内聚语义层。虽然简单的成对的最新技术已经很好地建立，但是简单的等价映射不能提供相关但不相交的本体的完全语义集成。复杂的多本体匹配（Complex multi-ontology matching，简称CQMS）将一个源实体与多个目标实体的复合逻辑表达式对齐，沿着本体层次建立更细致的等价关系和出处。我们提出了CMOMgen，第一个端到端的CMOMgen策略，生成完整的和语义上健全的映射，而不建立任何限制的目标本体或实体的数量。Retrieval-Augmented Generation选择相关的类来组成映射，并过滤匹配的引用映射作为示例，增强了上下文学习。该策略进行了评估，在三个生物医学任务与部分参考比对。CMOMgen在类别选择方面的表现优于基线，证明了拥有专用策略的影响。我们的策略还实现了F1分数的最低63%，在三个任务中的两个任务中优于所有基线和消融版本，并在第三个任务中排名第二。此外，人工评估的非参考映射显示，46%的映射达到了最高分，进一步证实了它的能力，构建语义健全的映射。
摘要：Constructing comprehensive knowledge graphs requires the use of multiple ontologies in order to fully contextualize data into a domain. Ontology matching finds equivalences between concepts interconnecting ontologies and creating a cohesive semantic layer. While the simple pairwise state of the art is well established, simple equivalence mappings cannot provide full semantic integration of related but disjoint ontologies. Complex multi-ontology matching (CMOM) aligns one source entity to composite logical expressions of multiple target entities, establishing more nuanced equivalences and provenance along the ontological hierarchy. We present CMOMgen, the first end-to-end CMOM strategy that generates complete and semantically sound mappings, without establishing any restrictions on the number of target ontologies or entities. Retrieval-Augmented Generation selects relevant classes to compose the mapping and filters matching reference mappings to serve as examples, enhancing In-Context Learning. The strategy was evaluated in three biomedical tasks with partial reference alignments. CMOMgen outperforms baselines in class selection, demonstrating the impact of having a dedicated strategy. Our strategy also achieves a minimum of 63% in F1-score, outperforming all baselines and ablated versions in two out of three tasks and placing second in the third. Furthermore, a manual evaluation of non-reference mappings showed that 46% of the mappings achieve the maximum score, further substantiating its ability to construct semantically sound mappings.

【5】Group Inertial Poser: Multi-Person Pose and Global Translation from Sparse Inertial Sensors and Ultra-Wideband Ranging
标题：群体惯性姿态：来自稀疏惯性传感器和超宽带距离测量的多人姿态和全球转换
链接：https://arxiv.org/abs/2510.21654

作者：Ying Xue, Jiaxi Jiang, Rayan Armani, Dominik Hollidt, Yi-Chi Liao, Christian Holz
备注：Accepted by ICCV 2025, Code: this https URL
摘要：使用稀疏的可穿戴惯性测量单元（IMU）跟踪人体全身运动克服了基于视觉的方法中固有的环境遮挡和仪器的限制。然而，纯粹基于IMU的跟踪会损害个体之间的平移估计和准确的相对定位，因为惯性线索本质上是自我参考的，并且不为他人提供直接的空间参考。在本文中，我们提出了一种新的方法来鲁棒地估计身体姿势和全球翻译多个人利用稀疏的可穿戴传感器之间的距离-无论是在每个人和跨多个人。我们的方法Group Inertial Poser估计来自超宽带测距（UWB）的传感器对之间的这些绝对距离，并将其与惯性观测值融合作为结构化状态空间模型的输入，以集成时间运动模式，从而进行精确的3D姿态估计。我们新颖的两步优化进一步利用估计的距离来准确地跟踪人们在世界各地的全球轨迹。我们还介绍了GIP-DB，这是第一个用于两人跟踪的IMU+UWB数据集，其中包括来自14名参与者的200分钟的运动记录。在我们的评估中，Group Inertial Poser在合成和真实世界数据的准确性和鲁棒性方面优于以前最先进的方法，显示了基于IMU+ UWB的多人运动捕捉在野外的前景。代码、模型、数据集：https://github.com/eth-siplab/GroupInertialPoser
摘要：Tracking human full-body motion using sparse wearable inertial measurement units (IMUs) overcomes the limitations of occlusion and instrumentation of the environment inherent in vision-based approaches. However, purely IMU-based tracking compromises translation estimates and accurate relative positioning between individuals, as inertial cues are inherently self-referential and provide no direct spatial reference for others. In this paper, we present a novel approach for robustly estimating body poses and global translation for multiple individuals by leveraging the distances between sparse wearable sensors - both on each individual and across multiple individuals. Our method Group Inertial Poser estimates these absolute distances between pairs of sensors from ultra-wideband ranging (UWB) and fuses them with inertial observations as input into structured state-space models to integrate temporal motion patterns for precise 3D pose estimation. Our novel two-step optimization further leverages the estimated distances for accurately tracking people's global trajectories through the world. We also introduce GIP-DB, the first IMU+UWB dataset for two-person tracking, which comprises 200 minutes of motion recordings from 14 participants. In our evaluation, Group Inertial Poser outperforms previous state-of-the-art methods in accuracy and robustness across synthetic and real-world data, showing the promise of IMU+UWB-based multi-human motion capture in the wild. Code, models, dataset: https://github.com/eth-siplab/GroupInertialPoser

【6】AstaBench: Rigorous Benchmarking of AI Agents with a Scientific Research Suite
标题：AstaBench：通过科学研究套件对人工智能代理进行严格基准测试
链接：https://arxiv.org/abs/2510.21652

作者：Jonathan Bragg, Mike D'Arcy, Nishant Balepur, Dan Bareket, Bhavana Dalvi, Sergey Feldman, Dany Haddad, Jena D. Hwang, Peter Jansen, Varsha Kishore, Bodhisattwa Prasad Majumder, Aakanksha Naik, Sigal Rahamimov, Kyle Richardson, Amanpreet Singh, Harshit Surana, Aryeh Tiktinsky, Rosni Vasu, Guy Wiener, Chloe Anastasiades, Stefan Candra, Jason Dunkelberger, Dan Emery, Rob Evans, Malachi Hamada, Regan Huff, Rodney Kinney, Matt Latzke, Jaron Lochner, Ruben Lozano-Aguilera, Cecile Nguyen, Smita Rao, Amber Tanaka, Brooke Vlahos, Peter Clark, Doug Downey, Yoav Goldberg, Ashish Sabharwal, Daniel S. Weld
摘要：人工智能代理有可能通过自动化文献综述，复制实验，分析数据，甚至提出新的研究方向来彻底改变科学生产力;事实上，现在有许多这样的代理，从通用的“深度研究”系统到专门的科学特定代理，如AI科学家和AIGS。对这些制剂进行严格的评价是取得进展的关键。然而，现有的基准在几个方面存在不足：它们（1）未能提供对现实世界用例（如科学研究）的整体，产品信息化的测量;（2）缺乏核心代理能力受控比较所需的可重复代理工具;（3）不考虑混淆变量，如模型成本和工具访问;（4）不考虑模型成本和工具访问。（4）不提供用于快速代理原型和评估的标准化接口;以及（5）缺乏识别真正进步所必需的综合基线代理。作为回应，我们定义了更严格的基准代理的原则和工具。使用这些，我们提出AstaBench，一个套件，提供了第一个全面的措施代理能力进行科学研究，包括2400多个问题，跨越整个科学发现过程和多个科学领域，并包括许多问题的启发，实际用户的请求部署Asta代理。我们的套件配备了第一个科学研究环境，具有生产级搜索工具，可以进行受控的，可重复的评估，更好地考虑混杂因素。除此之外，我们还提供了一套全面的九种科学优化的Asta试剂和许多基线。我们对22个代理类别的57个代理进行了广泛的评估，揭示了几个有趣的发现，最重要的是，尽管在某些方面取得了有意义的进展，但人工智能仍然远远没有解决科学研究援助的挑战。
摘要：AI agents hold the potential to revolutionize scientific productivity by automating literature reviews, replicating experiments, analyzing data, and even proposing new directions of inquiry; indeed, there are now many such agents, ranging from general-purpose "deep research" systems to specialized science-specific agents, such as AI Scientist and AIGS. Rigorous evaluation of these agents is critical for progress. Yet existing benchmarks fall short on several fronts: they (1) fail to provide holistic, product-informed measures of real-world use cases such as science research; (2) lack reproducible agent tools necessary for a controlled comparison of core agentic capabilities; (3) do not account for confounding variables such as model cost and tool access; (4) do not provide standardized interfaces for quick agent prototyping and evaluation; and (5) lack comprehensive baseline agents necessary to identify true advances. In response, we define principles and tooling for more rigorously benchmarking agents. Using these, we present AstaBench, a suite that provides the first holistic measure of agentic ability to perform scientific research, comprising 2400+ problems spanning the entire scientific discovery process and multiple scientific domains, and including many problems inspired by actual user requests to deployed Asta agents. Our suite comes with the first scientific research environment with production-grade search tools that enable controlled, reproducible evaluation, better accounting for confounders. Alongside, we provide a comprehensive suite of nine science-optimized classes of Asta agents and numerous baselines. Our extensive evaluation of 57 agents across 22 agent classes reveals several interesting findings, most importantly that despite meaningful progress on certain individual aspects, AI remains far from solving the challenge of science research assistance.

【7】A Dynamic Knowledge Distillation Method Based on the Gompertz Curve
标题：基于Gompertz曲线的动态知识提炼方法
链接：https://arxiv.org/abs/2510.21649

作者：Han Yang, Guangjun Qin
备注：15 pages, 2 figures
摘要：本文介绍了一种新的动态知识蒸馏框架Gompertz-CNN，它将Gompertz增长模型集成到训练过程中，以解决传统知识蒸馏的局限性。传统的方法往往无法捕捉学生模型的不断发展的认知能力，导致次优的知识转移。为了克服这一点，我们提出了一个阶段意识蒸馏策略，动态调整蒸馏损失的重量的基础上Gompertz曲线，反映学生的学习进展：缓慢的初始增长，快速中期改善，后期饱和。我们的框架采用Wasserstein距离来测量特征级差异，并采用梯度匹配来调整教师和学生模型之间的反向传播行为。这些组件统一在一个多损失的目标，其中Gompertz曲线调制蒸馏损失随时间的影响。在CIFAR-10和CIFAR-100上使用各种师生架构进行了广泛的实验（例如，ResNet 50和MobileNet_v2）证明，Gompertz-CNN始终优于传统的蒸馏方法，在CIFAR-10和CIFAR-100上分别实现了高达8%和4%的准确性增益。
摘要：This paper introduces a novel dynamic knowledge distillation framework, Gompertz-CNN, which integrates the Gompertz growth model into the training process to address the limitations of traditional knowledge distillation. Conventional methods often fail to capture the evolving cognitive capacity of student models, leading to suboptimal knowledge transfer. To overcome this, we propose a stage-aware distillation strategy that dynamically adjusts the weight of distillation loss based on the Gompertz curve, reflecting the student's learning progression: slow initial growth, rapid mid-phase improvement, and late-stage saturation. Our framework incorporates Wasserstein distance to measure feature-level discrepancies and gradient matching to align backward propagation behaviors between teacher and student models. These components are unified under a multi-loss objective, where the Gompertz curve modulates the influence of distillation losses over time. Extensive experiments on CIFAR-10 and CIFAR-100 using various teacher-student architectures (e.g., ResNet50 and MobileNet_v2) demonstrate that Gompertz-CNN consistently outperforms traditional distillation methods, achieving up to 8% and 4% accuracy gains on CIFAR-10 and CIFAR-100, respectively.

【8】DEEDEE: Fast and Scalable Out-of-Distribution Dynamics Detection
标题：DEEDEE：快速且可扩展的分布外动态检测
链接：https://arxiv.org/abs/2510.21638

作者：Tala Aljaafari, Varun Kanade, Philip Torr, Christian Schroeder de Witt
摘要：在安全关键设置中部署强化学习（RL）受到分布偏移下的脆性的限制。我们研究了RL时间序列的分布外（OOD）检测，并介绍了DEEDEE，一个两个统计检测器，重访代表重管道与最小的替代。DEEDEE只使用逐段平均值和RBF核相似性来训练摘要，捕获互补的全局和局部偏差。尽管简单，但DEEDEE在标准RL OOD套件中与当代检测器相匹配或超越，计算（FLOPs / wall-time）减少了600倍，并且在强基线上平均获得了5%的绝对精度。从概念上讲，我们的研究结果表明，不同的异常类型往往通过一个小的低阶统计量的RL轨迹上的印记，这表明在复杂的环境中OOD检测的紧凑的基础。
摘要：Deploying reinforcement learning (RL) in safety-critical settings is constrained by brittleness under distribution shift. We study out-of-distribution (OOD) detection for RL time series and introduce DEEDEE, a two-statistic detector that revisits representation-heavy pipelines with a minimal alternative. DEEDEE uses only an episodewise mean and an RBF kernel similarity to a training summary, capturing complementary global and local deviations. Despite its simplicity, DEEDEE matches or surpasses contemporary detectors across standard RL OOD suites, delivering a 600-fold reduction in compute (FLOPs / wall-time) and an average 5% absolute accuracy gain over strong baselines. Conceptually, our results indicate that diverse anomaly types often imprint on RL trajectories through a small set of low-order statistics, suggesting a compact foundation for OOD detection in complex environments.

【9】Few-Shot Knowledge Distillation of LLMs With Counterfactual Explanations
标题：具有反事实解释的LLM的Few-Shot知识提炼
链接：https://arxiv.org/abs/2510.21631

作者：Faisal Hamman, Pasan Dissanayake, Yanjun Fu, Sanghamitra Dutta
备注：NeurIPS 2025
摘要：知识蒸馏是一种很有前途的方法，可以将复杂的教师模型的能力转移到更小的，资源高效的学生模型，这些模型可以很容易地部署，特别是在任务感知的场景中。然而，现有的任务感知提取方法通常需要大量的数据，这些数据在许多实际场景中可能是不可用的或昂贵的。在本文中，我们通过引入一种新的策略，称为反事实解释注入蒸馏CoD的Few-Shot任务意识的知识蒸馏系统注入反事实的解释，以应对这一挑战。反事实解释（CFE）是指能够以最小扰动翻转教师模型的输出预测的输入。我们的策略CoD利用这些CFE以显著更少的样本精确地映射教师的决策边界。我们提供理论保证激励CFEs在蒸馏中的作用，从统计和几何的角度。我们在数学上表明，CFEs可以提高参数估计提供更多的信息的例子附近的教师的决策边界。我们还获得几何的见解CFEs如何有效地作为知识探针，帮助学生更有效地模仿老师的决策边界比标准数据。我们在各种数据集和LLM上进行实验，以表明CoD在Few-Shot制度（低至8-512个样本）中优于标准蒸馏方法。值得注意的是，CoD仅使用基线使用的原始样本的一半，与其相应的CFE配对，仍然提高了性能。
摘要：Knowledge distillation is a promising approach to transfer capabilities from complex teacher models to smaller, resource-efficient student models that can be deployed easily, particularly in task-aware scenarios. However, existing methods of task-aware distillation typically require substantial quantities of data which may be unavailable or expensive to obtain in many practical scenarios. In this paper, we address this challenge by introducing a novel strategy called Counterfactual-explanation-infused Distillation CoD for few-shot task-aware knowledge distillation by systematically infusing counterfactual explanations. Counterfactual explanations (CFEs) refer to inputs that can flip the output prediction of the teacher model with minimum perturbation. Our strategy CoD leverages these CFEs to precisely map the teacher's decision boundary with significantly fewer samples. We provide theoretical guarantees for motivating the role of CFEs in distillation, from both statistical and geometric perspectives. We mathematically show that CFEs can improve parameter estimation by providing more informative examples near the teacher's decision boundary. We also derive geometric insights on how CFEs effectively act as knowledge probes, helping the students mimic the teacher's decision boundaries more effectively than standard data. We perform experiments across various datasets and LLMs to show that CoD outperforms standard distillation approaches in few-shot regimes (as low as 8-512 samples). Notably, CoD only uses half of the original samples used by the baselines, paired with their corresponding CFEs and still improves performance.

【10】The Universal Landscape of Human Reasoning
标题：人类推理的普遍景观
链接：https://arxiv.org/abs/2510.21623

作者：Qiguang Chen, Jinhao Liu, Libo Qin, Yimeng Zhang, Yihao Liang, Shangxu Ren, Chengyu Luan, Dengyun Peng, Hanjing Li, Jiannan Guan, Zheng Yan, Jiaqi Wang, Mengkang Hu, Yantao Du, Zhi Chen, Xie Chen, Wanxiang Che
备注：Preprint
摘要：了解信息如何在人类推理中动态积累和转换，长期以来一直是认知心理学，哲学和人工智能的挑战。现有的帐户，从经典逻辑到概率模型，照亮输出或个人建模方面，但不提供一个统一的，定量的描述一般人类推理动力学。为了解决这个问题，我们引入了信息流跟踪（IF-Track），它使用大语言模型（LLM）作为概率编码器来量化每个推理步骤的信息熵和增益。通过对不同任务的细粒度分析，我们的方法是第一个成功地在单个度量空间内对人类推理行为的普遍景观进行建模的方法。我们表明，IF跟踪捕捉基本的推理功能，识别系统性错误模式，并表征个体差异。应用于高级心理学理论的讨论，我们首先调和了IF-Track中的单过程与双过程理论，并发现了人工和人类认知的一致性以及LLM如何重塑人类推理过程。这种方法在理论和测量之间建立了一座定量的桥梁，为推理的架构提供了机械的见解。
摘要：Understanding how information is dynamically accumulated and transformed in human reasoning has long challenged cognitive psychology, philosophy, and artificial intelligence. Existing accounts, from classical logic to probabilistic models, illuminate aspects of output or individual modelling, but do not offer a unified, quantitative description of general human reasoning dynamics. To solve this, we introduce Information Flow Tracking (IF-Track), that uses large language models (LLMs) as probabilistic encoder to quantify information entropy and gain at each reasoning step. Through fine-grained analyses across diverse tasks, our method is the first successfully models the universal landscape of human reasoning behaviors within a single metric space. We show that IF-Track captures essential reasoning features, identifies systematic error patterns, and characterizes individual differences. Applied to discussion of advanced psychological theory, we first reconcile single- versus dual-process theories in IF-Track and discover the alignment of artificial and human cognition and how LLMs reshaping human reasoning process. This approach establishes a quantitative bridge between theory and measurement, offering mechanistic insights into the architecture of reasoning.

【11】DeepAgent: A General Reasoning Agent with Scalable Toolsets
标题：DeepAgent：具有可扩展工具集的通用推理代理
链接：https://arxiv.org/abs/2510.21618

作者：Xiaoxi Li, Wenxiang Jiao, Jiarui Jin, Guanting Dong, Jiajie Jin, Yinuo Wang, Hao Wang, Yutao Zhu, Ji-Rong Wen, Yuan Lu, Zhicheng Dou
摘要：大型推理模型已经表现出强大的解决问题的能力，但现实世界的任务往往需要外部工具和长期的互动。现有的代理框架通常遵循预定义的工作流，这限制了自主和全局任务的完成。在本文中，我们介绍了DeepAgent，这是一个端到端的深度推理代理，可以在单个连贯的推理过程中执行自主思考，工具发现和动作执行。为了解决长视野交互的挑战，特别是多个工具调用和交互历史积累的上下文长度爆炸，我们引入了一个自主的记忆折叠机制，将过去的交互压缩到结构化的情节，工作和工具记忆中，减少错误积累，同时保留关键信息。为了有效和稳定地教授通用工具的使用，我们开发了一种端到端的强化学习策略，即ToolPO，它利用LLM模拟的API，并应用工具调用优势属性来为工具调用令牌分配细粒度的信用。在八个基准测试上进行的广泛实验，包括一般工具使用任务（ToolBench，API-Bank，TMDB，Spotify，ToolHop）和下游应用程序（ALFWorld，WebShop，GAIA，HLE），表明DeepAgent在标记工具和开放集工具检索场景中始终优于基线。这项工作朝着更通用和更有能力的代理人的现实世界的应用迈出了一步。代码和演示可在https://github.com/RUC-NLPIR/DeepAgent上获得。
摘要：Large reasoning models have demonstrated strong problem-solving abilities, yet real-world tasks often require external tools and long-horizon interactions. Existing agent frameworks typically follow predefined workflows, which limit autonomous and global task completion. In this paper, we introduce DeepAgent, an end-to-end deep reasoning agent that performs autonomous thinking, tool discovery, and action execution within a single, coherent reasoning process. To address the challenges of long-horizon interactions, particularly the context length explosion from multiple tool calls and the accumulation of interaction history, we introduce an autonomous memory folding mechanism that compresses past interactions into structured episodic, working, and tool memories, reducing error accumulation while preserving critical information. To teach general-purpose tool use efficiently and stably, we develop an end-to-end reinforcement learning strategy, namely ToolPO, that leverages LLM-simulated APIs and applies tool-call advantage attribution to assign fine-grained credit to the tool invocation tokens. Extensive experiments on eight benchmarks, including general tool-use tasks (ToolBench, API-Bank, TMDB, Spotify, ToolHop) and downstream applications (ALFWorld, WebShop, GAIA, HLE), demonstrate that DeepAgent consistently outperforms baselines across both labeled-tool and open-set tool retrieval scenarios. This work takes a step toward more general and capable agents for real-world applications. The code and demo are available at https://github.com/RUC-NLPIR/DeepAgent.

【12】Huxley-Gödel Machine: Human-Level Coding Agent Development by an Approximation of the Optimal Self-Improving Machine
标题：Huxley-Gödel Machine：通过最佳自我改进机器的逼近进行人类级编码代理开发
链接：https://arxiv.org/abs/2510.21614

作者：Wenyi Wang, Piotr Piękos, Li Nanbo, Firas Laakom, Yimeng Chen, Mateusz Ostaszewski, Mingchen Zhuge, Jürgen Schmidhuber
摘要：最近的研究通过编辑自己的代码库的编码代理来实现自我改进。他们通过有利于更高的软件工程基准性能的扩展策略来生长一棵自修改树，假设这意味着更有前途的后续自修改。然而，我们确定了代理的自我提高潜力（元生产力）和其编码基准性能之间的不匹配，即元生产力性能不匹配。受赫胥黎的进化枝的概念的启发，我们提出了一个度量（$\mathrm{CMP}$），聚合的基准性能的后代的代理作为其自我改进的潜力的指标。我们表明，在我们的自我改进的编码代理开发设置，访问真正的$\mathrm{CMP}$是足够的模拟如何G\“模型机将在某些假设下的行为。我们介绍了Huxley-G“odel机（HGM），它通过估计$CMP $并以此为指导，搜索自修改树。在SWE-bench Verified和Polyglot上，HGM的性能优于先前的自我改进编码代理开发方法，同时使用更少的挂钟时间。最后但并非最不重要的是，HGM展示了到其他编码数据集和大型语言模型的强大迁移。HGM在SWE-bench上优化的代理使用GPT-5-mini进行验证，并在SWE-bench Lite上使用GPT-5进行评估，达到了人类水平的性能，与人类工程编码代理的最佳官方检查结果相匹配。我们的代码可在https://github.com/metauto-ai/HGM上获得。
摘要：Recent studies operationalize self-improvement through coding agents that edit their own codebases. They grow a tree of self-modifications through expansion strategies that favor higher software engineering benchmark performance, assuming that this implies more promising subsequent self-modifications. However, we identify a mismatch between the agent's self-improvement potential (metaproductivity) and its coding benchmark performance, namely the Metaproductivity-Performance Mismatch. Inspired by Huxley's concept of clade, we propose a metric ($\mathrm{CMP}$) that aggregates the benchmark performances of the descendants of an agent as an indicator of its potential for self-improvement. We show that, in our self-improving coding agent development setting, access to the true $\mathrm{CMP}$ is sufficient to simulate how the G\"odel Machine would behave under certain assumptions. We introduce the Huxley-G\"odel Machine (HGM), which, by estimating $\mathrm{CMP}$ and using it as guidance, searches the tree of self-modifications. On SWE-bench Verified and Polyglot, HGM outperforms prior self-improving coding agent development methods while using less wall-clock time. Last but not least, HGM demonstrates strong transfer to other coding datasets and large language models. The agent optimized by HGM on SWE-bench Verified with GPT-5-mini and evaluated on SWE-bench Lite with GPT-5 achieves human-level performance, matching the best officially checked results of human-engineered coding agents. Our code is available at https://github.com/metauto-ai/HGM.

【13】Generative Correlation Manifolds: Generating Synthetic Data with Preserved Higher-Order Correlations
标题：生成相关Manifols：生成具有保留的更高阶相关性的合成数据
链接：https://arxiv.org/abs/2510.21610

作者：Jens E. d'Hondt, Wieger R. Punter, Odysseas Papapetrou
摘要：对数据隐私的日益增长的需求和对强大的机器学习模型的需求推动了合成数据生成技术的发展。然而，目前的方法往往成功地复制简单的汇总统计量，但未能保留两两和高阶相关结构的数据，定义复杂的，多变量的相互作用，在现实世界的系统中固有的。这种限制可能导致合成数据表面上是真实的，但在用于复杂的建模任务时会失败。在本白皮书中，我们介绍了生成相关流形（GCM），这是一种用于生成合成数据的高效计算方法。该技术使用目标相关矩阵的Cholesky分解来产生数据集，通过数学证明，保留了源数据集的整个相关结构-从简单的成对关系到高阶相互作用。我们认为，这种方法提供了一种新的方法来合成数据生成与潜在的应用程序在隐私保护的数据共享，鲁棒的模型训练和仿真。
摘要：The increasing need for data privacy and the demand for robust machine learning models have fueled the development of synthetic data generation techniques. However, current methods often succeed in replicating simple summary statistics but fail to preserve both the pairwise and higher-order correlation structure of the data that define the complex, multi-variable interactions inherent in real-world systems. This limitation can lead to synthetic data that is superficially realistic but fails when used for sophisticated modeling tasks. In this white paper, we introduce Generative Correlation Manifolds (GCM), a computationally efficient method for generating synthetic data. The technique uses Cholesky decomposition of a target correlation matrix to produce datasets that, by mathematical proof, preserve the entire correlation structure -- from simple pairwise relationships to higher-order interactions -- of the source dataset. We argue that this method provides a new approach to synthetic data generation with potential applications in privacy-preserving data sharing, robust model training, and simulation.

【14】Sample By Step, Optimize By Chunk: Chunk-Level GRPO For Text-to-Image Generation
标题：分步采样，按块优化：用于文本到图像生成的块级GRPO
链接：https://arxiv.org/abs/2510.21583

作者：Yifu Luo, Penghui Du, Bo Li, Sinan Du, Tiantian Zhang, Yongzhe Chang, Kai Wu, Kun Gai, Xueqian Wang
备注：11 pages, preprint
摘要：组相对策略优化（GRPO）在基于流匹配的文本到图像（T2I）生成中显示出强大的潜力，但它面临两个关键限制：不准确的优势归因，以及忽略生成的时间动态。在这项工作中，我们认为，从步骤级别的块级的优化范式转移可以有效地缓解这些问题。基于这一思想，我们提出了块GRPO，第一个基于块级GRPO的T2I生成方法。洞察力是组连续的步骤成连贯的“块”，捕捉流匹配的内在时间动态，并在块级优化策略。此外，我们引入了一个可选的加权采样策略，以进一步提高性能。大量的实验表明，ChunkGRPO在偏好对齐和图像质量方面都取得了优异的结果，突出了基于GRPO的方法的块级优化的承诺。
摘要：Group Relative Policy Optimization (GRPO) has shown strong potential for flow-matching-based text-to-image (T2I) generation, but it faces two key limitations: inaccurate advantage attribution, and the neglect of temporal dynamics of generation. In this work, we argue that shifting the optimization paradigm from the step level to the chunk level can effectively alleviate these issues. Building on this idea, we propose Chunk-GRPO, the first chunk-level GRPO-based approach for T2I generation. The insight is to group consecutive steps into coherent 'chunk's that capture the intrinsic temporal dynamics of flow matching, and to optimize policies at the chunk level. In addition, we introduce an optional weighted sampling strategy to further enhance performance. Extensive experiments show that ChunkGRPO achieves superior results in both preference alignment and image quality, highlighting the promise of chunk-level optimization for GRPO-based methods.

【15】From Polyester Girlfriends to Blind Mice: Creating the First Pragmatics Understanding Benchmarks for Slovene
标题：从涤纶女朋友到盲鼠：为斯洛文尼亚语创造第一个语用理解基准
链接：https://arxiv.org/abs/2510.21575

作者：Mojca Brglez, Špela Vintar
摘要：大型语言模型正在展示越来越多的能力，在曾经被认为非常困难的基准测试中表现出色。随着他们能力的增长，需要进行更具挑战性的评价，超越表面的语言能力。即语言能力不仅涉及句法和语义，而且涉及语用，即，理解由语境以及语言和文化规范塑造的情景意义。为了促进这条线的研究，我们介绍SloPragEval和SloPragMega，第一个语用学理解斯洛文尼亚语的基准，共包含405个选择题。我们讨论了翻译的困难，描述了建立人类基线的运动，并报告了LLM的试点评估。我们的研究结果表明，目前的模型在理解细微差别的语言方面有了很大的改进，但可能仍然无法推断出非字面话语中隐含的说话人含义，特别是那些特定于文化的话语。我们还观察到专有模型和开源模型之间的巨大差距。最后，我们认为，针对细微差别的语言理解和目标文化的知识的基准必须精心设计，最好是从本地数据构建，并与人类的反应进行验证。
摘要：Large language models are demonstrating increasing capabilities, excelling at benchmarks once considered very difficult. As their capabilities grow, there is a need for more challenging evaluations that go beyond surface-level linguistic competence. Namely, language competence involves not only syntax and semantics but also pragmatics, i.e., understanding situational meaning as shaped by context as well as linguistic and cultural norms. To contribute to this line of research, we introduce SloPragEval and SloPragMega, the first pragmatics understanding benchmarks for Slovene that contain altogether 405 multiple-choice questions. We discuss the difficulties of translation, describe the campaign to establish a human baseline, and report pilot evaluations with LLMs. Our results indicate that current models have greatly improved in understanding nuanced language but may still fail to infer implied speaker meaning in non-literal utterances, especially those that are culture-specific. We also observe a significant gap between proprietary and open-source models. Finally, we argue that benchmarks targeting nuanced language understanding and knowledge of the target culture must be designed with care, preferably constructed from native data, and validated with human responses.

【16】Scalable Vision-Language-Action Model Pretraining for Robotic Manipulation with Real-Life Human Activity Videos
标题：基于真实人类活动视频的机器人操作可扩展视觉-语言-动作模型预训练
链接：https://arxiv.org/abs/2510.21571

作者：Qixiu Li, Yu Deng, Yaobo Liang, Lin Luo, Lei Zhou, Chengtang Yao, Lingqi Zeng, Zhiyuan Feng, Huizhi Liang, Sicheng Xu, Yizhong Zhang, Xi Chen, Hao Chen, Lily Sun, Dong Chen, Jiaolong Yang, Baining Guo
备注：Project page: this https URL
摘要：本文提出了一种新的预训练机器人操作视觉语言动作（VLA）模型的方法，使用大量的人类手部活动的无脚本的现实生活中的视频记录。将人手视为灵巧的机器人末端执行器，我们表明，没有任何注释的“在野外”以自我为中心的人类视频可以转换为与现有机器人V-L-A训练数据在任务粒度和标签方面完全一致的数据格式。这是通过为任意人手视频开发全自动整体人类活动分析方法来实现的。这种方法可以生成原子级的手部活动片段及其语言描述，每个片段都伴随着帧级的3D手部运动和相机运动。我们处理了大量以自我为中心的视频，并创建了一个包含100万集和2600万帧的手动VLA训练数据集。这些训练数据涵盖了广泛的对象和概念，灵巧的操作任务以及现实生活中的环境变化，大大超过了现有机器人数据的覆盖范围。我们设计了一个灵巧手VLA模型架构，并在此数据集上预训练模型。该模型在完全不可见的现实世界观察中表现出强大的零射击（zero-shot）能力。此外，在少量真实机器人动作数据上对其进行微调，可以显着提高任务成功率，并在真实机器人实验中推广到新对象。我们还展示了模型的任务性能相对于预训练数据规模的吸引人的缩放行为。我们相信这项工作为可扩展的VLA预训练奠定了坚实的基础，推动机器人走向真正可推广的体现智能。
摘要：This paper presents a novel approach for pretraining robotic manipulation Vision-Language-Action (VLA) models using a large corpus of unscripted real-life video recordings of human hand activities. Treating human hand as dexterous robot end-effector, we show that "in-the-wild" egocentric human videos without any annotations can be transformed into data formats fully aligned with existing robotic V-L-A training data in terms of task granularity and labels. This is achieved by the development of a fully-automated holistic human activity analysis approach for arbitrary human hand videos. This approach can generate atomic-level hand activity segments and their language descriptions, each accompanied with framewise 3D hand motion and camera motion. We process a large volume of egocentric videos and create a hand-VLA training dataset containing 1M episodes and 26M frames. This training data covers a wide range of objects and concepts, dexterous manipulation tasks, and environment variations in real life, vastly exceeding the coverage of existing robot data. We design a dexterous hand VLA model architecture and pretrain the model on this dataset. The model exhibits strong zero-shot capabilities on completely unseen real-world observations. Additionally, fine-tuning it on a small amount of real robot action data significantly improves task success rates and generalization to novel objects in real robotic experiments. We also demonstrate the appealing scaling behavior of the model's task performance with respect to pretraining data scale. We believe this work lays a solid foundation for scalable VLA pretraining, advancing robots toward truly generalizable embodied intelligence.

【17】Learning Neural Control Barrier Functions from Expert Demonstrations using Inverse Constraint Learning
标题：使用逆约束学习从专家演示中学习神经控制障碍函数
链接：https://arxiv.org/abs/2510.21560

作者：Yuxuan Yang, Hussein Sibai
摘要：安全性是在关键领域中运行的自主系统的基本要求。控制屏障功能（CBF）已被用来设计安全过滤器，最小限度地改变这种系统的标称控制，以保持其安全性。学习神经CBF已被提出作为其计算上昂贵的基于优化的合成的数据驱动的替代方案。然而，通常情况下，应该避免的失败状态集是不明显的或难以正式指定，例如，自动驾驶中的追尾，而一组实现任务并避免故障集的专家演示更容易生成。我们使用ICL来训练约束函数，该约束函数将所考虑的系统的状态分类为安全的，即，属于与未指定故障集不相交的受控前向不变集，以及不安全的，即，属于该集合的补集。然后，我们使用该函数标记一组新的模拟轨迹来训练我们的神经CBF。我们在四种不同的环境中对我们的方法进行了经验评估，证明它优于现有的基线，并实现了与使用相同数据训练但使用地面真实安全标签注释的神经CBF相当的性能。
摘要：Safety is a fundamental requirement for autonomous systems operating in critical domains. Control barrier functions (CBFs) have been used to design safety filters that minimally alter nominal controls for such systems to maintain their safety. Learning neural CBFs has been proposed as a data-driven alternative for their computationally expensive optimization-based synthesis. However, it is often the case that the failure set of states that should be avoided is non-obvious or hard to specify formally, e.g., tailgating in autonomous driving, while a set of expert demonstrations that achieve the task and avoid the failure set is easier to generate. We use ICL to train a constraint function that classifies the states of the system under consideration to safe, i.e., belong to a controlled forward invariant set that is disjoint from the unspecified failure set, and unsafe ones, i.e., belong to the complement of that set. We then use that function to label a new set of simulated trajectories to train our neural CBF. We empirically evaluate our approach in four different environments, demonstrating that it outperforms existing baselines and achieves comparable performance to a neural CBF trained with the same data but annotated with ground-truth safety labels.

【18】Co-Sight: Enhancing LLM-Based Agents via Conflict-Aware Meta-Verification and Trustworthy Reasoning with Structured Facts
标题：Co-Sight：通过预算感知元验证和具有结构化事实的可信推理来增强基于LLM的代理
链接：https://arxiv.org/abs/2510.21557

作者：Hongwei Zhang, Ji Lu, Shiqing Jiang, Chenxiang Zhu, Li Xie, Chen Zhong, Haoran Chen, Yurui Zhu, Yongsheng Du, Yanqin Gao, Lingjun Huang, Baoli Wang, Fang Tan, Peng Zou
摘要：基于LLM的代理中的长视野推理通常不是由于生成弱点而是由于中间推理的验证不足而失败。Co-Sight通过两种互补机制将推理转化为可证伪和可审计的过程来应对这一挑战：可验证元验证（CAMV）和结构化事实可信推理（TRSF）。CAMV将验证重新定义为冲突识别和有针对性的证伪，仅将计算分配给专家代理之间的分歧热点，而不是完整的推理链。这将验证成本限制在不一致的数量上，并提高了效率和可靠性。TRSF通过一个结构化事实模块不断组织、验证和验证跨代理的证据。通过维护经过验证、可追踪和可审计的知识，它确保所有推理都基于一致的、经过源验证的信息，并在整个推理过程中支持透明的验证。TRSF和CAMV一起形成了一个封闭的验证循环，其中TRSF提供结构化事实，CAMV选择性地伪造或强化它们，从而产生透明和可信的推理。从经验上看，Co-Sight在GAIA（84.4%）和Humanity's Last Exam（35.5%）上达到了最先进的准确率，并在Chinese-SimpleQA（93.8%）上取得了很好的成绩。消融研究证实，结构化的事实基础和冲突意识核查之间的协同作用推动了这些改进。因此，Co-Sight为基于LLM的代理提供了一个可扩展的范例，用于可靠的长期推理。代码可在https://github.com/ZTE-AICloud/Co-Sight/tree/cosight2.0_benchmarks上获得。
摘要：Long-horizon reasoning in LLM-based agents often fails not from generative weakness but from insufficient verification of intermediate reasoning. Co-Sight addresses this challenge by turning reasoning into a falsifiable and auditable process through two complementary mechanisms: Conflict-Aware Meta-Verification (CAMV) and Trustworthy Reasoning with Structured Facts (TRSF). CAMV reformulates verification as conflict identification and targeted falsification, allocating computation only to disagreement hotspots among expert agents rather than to full reasoning chains. This bounds verification cost to the number of inconsistencies and improves efficiency and reliability. TRSF continuously organizes, validates, and synchronizes evidence across agents through a structured facts module. By maintaining verified, traceable, and auditable knowledge, it ensures that all reasoning is grounded in consistent, source-verified information and supports transparent verification throughout the reasoning process. Together, TRSF and CAMV form a closed verification loop, where TRSF supplies structured facts and CAMV selectively falsifies or reinforces them, yielding transparent and trustworthy reasoning. Empirically, Co-Sight achieves state-of-the-art accuracy on GAIA (84.4%) and Humanity's Last Exam (35.5%), and strong results on Chinese-SimpleQA (93.8%). Ablation studies confirm that the synergy between structured factual grounding and conflict-aware verification drives these improvements. Co-Sight thus offers a scalable paradigm for reliable long-horizon reasoning in LLM-based agents. Code is available at https://github.com/ZTE-AICloud/Co-Sight/tree/cosight2.0_benchmarks.

【19】Human and AI Trust: Trust Attitude Measurement Instrument
标题：人类和人工智能信任：信任态度测量工具
链接：https://arxiv.org/abs/2510.21535

作者：Retno Larasati
摘要：随着当前人工智能（AI）技术的进步及其日益广泛的应用，信任被视为AI使用、接受和部署的必要标准。一个强大的测量工具对于从以人为本的角度正确评估信任至关重要。本文描述了一个信任测量工具的开发和验证过程，它遵循心理测量学的原则，由16个项目的信任量表。该工具是专门为人类与人工智能交互的研究而构建的，以从外行（非专家）的角度衡量对人工智能系统的信任态度。我们用于开发量表的用例是在人工智能医疗支持系统（特别是癌症/健康预测）的背景下。量表的编制（测量项目开发）和验证（测量项目评估）包括六个研究阶段：项目开发、项目评估、调查管理、维度检验、信度检验和效度检验。六个阶段的评估结果表明，所提出的信任测量工具是经验上可靠的，有效的系统测量和比较人工智能医疗支持系统中的非专家的信任。
摘要：With the current progress of Artificial Intelligence (AI) technology and its increasingly broader applications, trust is seen as a required criterion for AI usage, acceptance, and deployment. A robust measurement instrument is essential to correctly evaluate trust from a human-centered perspective. This paper describes the development and validation process of a trust measure instrument, which follows psychometric principles, and consists of a 16-items trust scale. The instrument was built explicitly for research in human-AI interaction to measure trust attitudes towards AI systems from layperson (non-expert) perspective. The use-case we used to develop the scale was in the context of AI medical support systems (specifically cancer/health prediction). The scale development (Measurement Item Development) and validation (Measurement Item Evaluation) involved six research stages: item development, item evaluation, survey administration, test of dimensionality, test of reliability, and test of validity. The results of the six-stages evaluation show that the proposed trust measurement instrument is empirically reliable and valid for systematically measuring and comparing non-experts' trust in AI Medical Support Systems.

【20】EU-Agent-Bench: Measuring Illegal Behavior of LLM Agents Under EU Law
标题：欧盟代理人法庭：根据欧盟法律衡量LLM代理人的非法行为
链接：https://arxiv.org/abs/2510.21524

作者：Ilija Lichkovski, Alexander Müller, Mariam Ibrahim, Tiwai Mhundwa
备注：Accepted at the Workshop on Regulatable ML at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
摘要：大型语言模型（LLM）越来越多地被部署为各种环境中的代理，提供可供使用的工具。然而，LLM代理可能会表现出不可预测的行为，包括采取不受欢迎和/或不安全的行动。为了衡量LLM代理在欧盟立法背景下采取非法行动的潜在倾向，我们引入了EU-Agent-Bench，这是一个可验证的人工策划基准，可以在良性用户输入可能导致非法行为的情况下评估代理与欧盟法律规范的一致性。我们的基准测试涵盖了多个类别的场景，包括数据保护、偏见/歧视和科学完整性，每个用户请求都允许执行合规和不合规的操作。比较该模型的功能要求对一个标题详尽支持引用的相关立法，我们评估的法律遵守边境LLM，并进一步调查遵守效果提供相关的立法摘录在代理人的系统提示，以及明确的指示遵守。我们为研究社区发布了一个公共预览集，同时提供了一个私人测试集，以防止在评估即将推出的模型时出现数据污染。我们鼓励未来的工作将代理安全基准扩展到不同的法律管辖区以及多回合和多语言互动。我们在\href{https：//github.com/ilijalichkovski/eu-agent-bench}{this URL}上发布代码。
摘要：Large language models (LLMs) are increasingly deployed as agents in various contexts by providing tools at their disposal. However, LLM agents can exhibit unpredictable behaviors, including taking undesirable and/or unsafe actions. In order to measure the latent propensity of LLM agents for taking illegal actions under an EU legislative context, we introduce EU-Agent-Bench, a verifiable human-curated benchmark that evaluates an agent's alignment with EU legal norms in situations where benign user inputs could lead to unlawful actions. Our benchmark spans scenarios across several categories, including data protection, bias/discrimination, and scientific integrity, with each user request allowing for both compliant and non-compliant execution of the requested actions. Comparing the model's function calls against a rubric exhaustively supported by citations of the relevant legislature, we evaluate the legal compliance of frontier LLMs, and furthermore investigate the compliance effect of providing the relevant legislative excerpts in the agent's system prompt along with explicit instructions to comply. We release a public preview set for the research community, while holding out a private test set to prevent data contamination in evaluating upcoming models. We encourage future work extending agentic safety benchmarks to different legal jurisdictions and to multi-turn and multilingual interactions. We release our code on \href{https://github.com/ilijalichkovski/eu-agent-bench}{this URL}.

【21】GranViT: A Fine-Grained Vision Model With Autoregressive Perception For MLLMs
标题：GranViT：具有自回归感知的MLLM细粒度视觉模型
链接：https://arxiv.org/abs/2510.21501

作者：Guanghao Zheng, Bowen Shi, Mingxing Xu, Ruoyu Sun, Peisen Zhao, Zhibo Zhang, Wenrui Dai, Junni Zou, Hongkai Xiong, Xiaopeng Zhang, Qi Tian
备注：21 pages, 6 figures
摘要：视觉编码器对于在视觉语言任务（如视觉问答和推理）中实现多模态大型语言模型（MLLM）的出色性能是不可或缺的。然而，现有的视觉编码器专注于全局图像表示，但忽略了细粒度的区域分析。由于缺乏细粒度的注释数据和缺乏细粒度的预训练范例，它们在细粒度感知方面受到限制。在本文中，我们提出了GranViT，一种新的Vision Transformer，它通过区域级自回归训练将细粒度特征提取与语义对齐集成到大型语言模型（LLM）。我们首先构建Gran-29 M，这是一个包含200万张自然和OCR图像的数据集，配有超过1.8亿个高质量的区域级注释，以实现大规模的细粒度预训练。因此，我们开发了一个预训练自适应框架，以及自蒸馏机制训练细粒度GranViT上的Gran-29 M。我们充分利用Gran-29 M的细粒度标注，采用边界框到字幕回归来增强预训练中视觉编码器的局部视觉表示，并采用字幕到边界框回归来提高自适应中LLM的视觉特征利用率和局部化。我们进一步纳入了自我蒸馏机制，施加明确的本地化约束的视觉编码器，以加强其区域推理能力。大量实验表明，GranViT优于现有的视觉编码器，并具有很强的可移植性。值得注意的是，它在细粒度识别、多模式VQA和OCR理解方面取得了最先进的结果。
摘要：Vision encoders are indispensable for allowing impressive performance of Multi-modal Large Language Models (MLLMs) in vision language tasks such as visual question answering and reasoning. However, existing vision encoders focus on global image representations but overlook fine-grained regional analysis. They are limited in fine grained perception due to the scarcity of fine grained annotated data and the lack of a fine grained pre-training paradigm. In this paper, we propose GranViT, a novel Vision Transformer that integrates fine-grained feature extraction with semantic alignment to Large Language Models (LLMs) via region level autoregressive training. We first construct Gran-29M, a dataset comprising 2million natural and OCR images paired with over 180 million high-quality region-level annotations, to enable large scale fine grained pretraining. Consequently, we develop a pretraining-adaptation framework along with a self distillation mechanism to train fine-grained GranViT on Gran-29M. We sufficiently exploit the fine-grained annotations from Gran-29M to resort to bounding-box-to-caption regression to enhance localized visual representation of the vision encoder in the pretraining and caption-to-bounding-box regression to improve vision feature utilization and localization for LLM in the adaptation. We further incorporate a self distillation mechanism that imposes explicit localization constraints on the vision encoder to strengthen its regional reasoning capability. Extensive experiments show that GranViT surpasses existing vision encoders and attains strong transferability to varying LLMs. Remarkably, it achieves state-of-the-art results on fine-grained recognition, multimodal VQA, and OCR understanding.

【22】Enhancing Social Robots through Resilient AI
标题：通过弹性人工智能增强社交机器人
链接：https://arxiv.org/abs/2510.21469

作者：Domenico Palmisano, Giuseppe Palestra, Berardina Nadja De Carolis
备注：8 pages, Workshop on Adaptive Social Interaction based on user's Mental mOdels and behaVior in HRI, The 17th International Conference on Social Robotics, 10-12 September 2025, Naples (IT)
摘要：随着人工智能的不断发展，并越来越多地融入医疗保健、教育和日常生活等敏感领域，这些系统的弹性和健壮性至关重要。本文展示了如何弹性是社会机器人的一个基本特征，通过它，确保信任的机器人本身的一个基本要素，特别是在与老年人，谁往往在这些系统的信任度低的情况下操作。因此，复原力是指在不利或紧张的条件下开展行动的能力，即使在退化或削弱的情况下，同时保持基本的行动能力。
摘要：As artificial intelligence continues to advance and becomes more integrated into sensitive areas like healthcare, education, and everyday life, it's crucial for these systems to be both resilient and robust. This paper shows how resilience is a fundamental characteristic of social robots, which, through it, ensure trust in the robot itself-an essential element especially when operating in contexts with elderly people, who often have low trust in these systems. Resilience is therefore the ability to operate under adverse or stressful conditions, even when degraded or weakened, while maintaining essential operational capabilities.

【23】Multi-Task Vehicle Routing Solver via Mixture of Specialized Experts under State-Decomposable MDP
标题：状态可分解MDP下通过专业专家混合的多任务车辆路径求解器
链接：https://arxiv.org/abs/2510.21453

作者：Yuxin Pan, Zhiguang Cao, Chengyang Gu, Liu Liu, Peilin Zhao, Yize Chen, Fangzhen Lin
备注：Accepted to NeurIPS 2025
摘要：现有的多任务车辆路径问题（VRP）的神经方法通常学习统一的求解器来同时处理多个约束。然而，它们通常未充分利用VRP变体的组成结构，每个变体可衍生自一组共同的基础VRP变体。这种关键的疏忽导致统一求解器错过了基础求解器的潜在好处，每个基础求解器都专门用于基础VRP变体。为了克服这一限制，我们提出了一个框架，使统一的求解器感知跨VRP变量的共享组件的性质，通过主动重用基础求解器，同时减轻训练神经求解器的指数增长。具体来说，我们引入了一个状态可分解MDP（SDMDP），重新制定VRP表示的状态空间作为笛卡尔积的基础状态空间与基础VRP的变体。更重要的是，这个公式固有地为每个基础VRP变量产生最优的基础策略。此外，一个潜在的空间为基础的SDMDP扩展开发的最佳基础政策和学习的混合函数，使政策重用的潜在空间。在温和的假设下，该扩展可证明地恢复SDMDP的最优统一策略，通过混合函数计算状态嵌入作为由最优基础策略生成的基础状态嵌入的映射。在实际应用中，我们引入了混合专家求解器（MoSES），它通过专门的低秩自适应（LoRA）专家实现基本策略，并通过自适应门控机制实现混合函数。跨VRP变体进行的广泛实验显示了MoSES优于先前方法的优势。
摘要：Existing neural methods for multi-task vehicle routing problems (VRPs) typically learn unified solvers to handle multiple constraints simultaneously. However, they often underutilize the compositional structure of VRP variants, each derivable from a common set of basis VRP variants. This critical oversight causes unified solvers to miss out the potential benefits of basis solvers, each specialized for a basis VRP variant. To overcome this limitation, we propose a framework that enables unified solvers to perceive the shared-component nature across VRP variants by proactively reusing basis solvers, while mitigating the exponential growth of trained neural solvers. Specifically, we introduce a State-Decomposable MDP (SDMDP) that reformulates VRPs by expressing the state space as the Cartesian product of basis state spaces associated with basis VRP variants. More crucially, this formulation inherently yields the optimal basis policy for each basis VRP variant. Furthermore, a Latent Space-based SDMDP extension is developed by incorporating both the optimal basis policies and a learnable mixture function to enable the policy reuse in the latent space. Under mild assumptions, this extension provably recovers the optimal unified policy of SDMDP through the mixture function that computes the state embedding as a mapping from the basis state embeddings generated by optimal basis policies. For practical implementation, we introduce the Mixture-of-Specialized-Experts Solver (MoSES), which realizes basis policies through specialized Low-Rank Adaptation (LoRA) experts, and implements the mixture function via an adaptive gating mechanism. Extensive experiments conducted across VRP variants showcase the superiority of MoSES over prior methods.

【24】PhysWorld: From Real Videos to World Models of Deformable Objects via Physics-Aware Demonstration Synthesis
标题：物理世界：通过物理意识演示合成从真实视频到可变形物体的世界模型
链接：https://arxiv.org/abs/2510.21447

作者：Yu Yang, Zhilu Zhang, Xiang Zhang, Yihan Zeng, Hui Li, Wangmeng Zuo
备注：17 pages, 5 figures
摘要：模拟对象动力学的交互式世界模型对于机器人、VR和AR至关重要。然而，从有限的真实世界视频数据中学习物理一致的动态模型仍然是一个重大的挑战，特别是对于具有空间变化物理特性的可变形对象。为了克服数据短缺的挑战，我们提出了PhysWorld，一个新的框架，它利用模拟器来综合物理上合理的和多样化的演示，以学习有效的世界模型。具体地说，我们首先通过本构模型选择和物理性质的全局到局部优化，在MPM模拟器中构建了物理一致的数字孪生。随后，我们将部分感知扰动应用于物理属性，并为数字孪生子生成各种运动模式，综合了广泛而多样的演示。最后，使用这些演示，我们训练了一个轻量级的基于GNN的世界模型，该模型嵌入了物理属性。真实视频可用于进一步细化物理属性。PhysWorld实现了对各种可变形对象的准确和快速的未来预测，并且还很好地推广到新的交互。实验表明，PhysWorld具有竞争力的性能，同时推理速度比最近的最先进的方法快47倍，即，PhysTwin。
摘要：Interactive world models that simulate object dynamics are crucial for robotics, VR, and AR. However, it remains a significant challenge to learn physics-consistent dynamics models from limited real-world video data, especially for deformable objects with spatially-varying physical properties. To overcome the challenge of data scarcity, we propose PhysWorld, a novel framework that utilizes a simulator to synthesize physically plausible and diverse demonstrations to learn efficient world models. Specifically, we first construct a physics-consistent digital twin within MPM simulator via constitutive model selection and global-to-local optimization of physical properties. Subsequently, we apply part-aware perturbations to the physical properties and generate various motion patterns for the digital twin, synthesizing extensive and diverse demonstrations. Finally, using these demonstrations, we train a lightweight GNN-based world model that is embedded with physical properties. The real video can be used to further refine the physical properties. PhysWorld achieves accurate and fast future predictions for various deformable objects, and also generalizes well to novel interactions. Experiments show that PhysWorld has competitive performance while enabling inference speeds 47 times faster than the recent state-of-the-art method, i.e., PhysTwin.

【25】REMONI: An Autonomous System Integrating Wearables and Multimodal Large Language Models for Enhanced Remote Health Monitoring
标题：REMONI：一个集成可穿戴设备和多模式大型语言模型的自治系统，用于增强远程健康监测
链接：https://arxiv.org/abs/2510.21445

作者：Thanh Cong Ho, Farah Kharrat, Abderrazek Abid, Fakhri Karray
备注：None
摘要：随着可穿戴设备在我们日常生活中的广泛采用，对远程患者监测的需求和吸引力显著增加。该领域的大多数研究都集中在收集传感器数据，将其可视化并分析它以检测糖尿病，心脏病和抑郁症等特定疾病的异常。然而，该领域在人机交互方面存在着显著的差距。本文提出了REMONI，这是一个自主的远程健康监测系统，它集成了多模态大型语言模型（MLLM），物联网（IoT）和可穿戴设备。该系统自动连续收集生命体征、来自特殊可穿戴设备（如智能手表）的加速度计数据以及从摄像头收集的患者视频剪辑中的视觉数据。该数据由异常检测模块处理，该异常检测模块包括跌倒检测模型和算法，以识别并提醒护理人员患者的紧急状况。我们提出的系统的一个显着特点是自然语言处理组件，开发MLLM能够检测和识别病人的活动和情绪，同时响应医护人员的查询。此外，即时工程采用无缝集成所有患者信息。因此，医生和护士可以通过用户友好的Web应用程序与智能代理进行交互，从而访问实时生命体征以及患者的当前状态和情绪。我们的实验表明，我们的系统是可实现的，可扩展的现实生活中的场景，可能会减少医疗专业人员的工作量和医疗成本。已经开发并正在测试一个说明该系统功能的成熟原型，以证明其各种能力的可靠性。
摘要：With the widespread adoption of wearable devices in our daily lives, the demand and appeal for remote patient monitoring have significantly increased. Most research in this field has concentrated on collecting sensor data, visualizing it, and analyzing it to detect anomalies in specific diseases such as diabetes, heart disease and depression. However, this domain has a notable gap in the aspect of human-machine interaction. This paper proposes REMONI, an autonomous REmote health MONItoring system that integrates multimodal large language models (MLLMs), the Internet of Things (IoT), and wearable devices. The system automatically and continuously collects vital signs, accelerometer data from a special wearable (such as a smartwatch), and visual data in patient video clips collected from cameras. This data is processed by an anomaly detection module, which includes a fall detection model and algorithms to identify and alert caregivers of the patient's emergency conditions. A distinctive feature of our proposed system is the natural language processing component, developed with MLLMs capable of detecting and recognizing a patient's activity and emotion while responding to healthcare worker's inquiries. Additionally, prompt engineering is employed to integrate all patient information seamlessly. As a result, doctors and nurses can access real-time vital signs and the patient's current state and mood by interacting with an intelligent agent through a user-friendly web application. Our experiments demonstrate that our system is implementable and scalable for real-life scenarios, potentially reducing the workload of medical professionals and healthcare costs. A full-fledged prototype illustrating the functionalities of the system has been developed and being tested to demonstrate the robustness of its various capabilities.

【26】Does Model Size Matter? A Comparison of Small and Large Language Models for Requirements Classification
标题：型号尺寸重要吗？需求分类的小型和大型语言模型比较
链接：https://arxiv.org/abs/2510.21443

作者：Mohammad Amin Zadenoori, Vincenzo De Martino, Jacek Dabrowski, Xavier Franch, Alessio Ferrari
摘要：[背景和动机]大型语言模型（LLM）在需求工程（RE）的自然语言处理（NLP）任务中表现出显著的效果。然而，它们的使用受到高计算成本、数据共享风险和对外部服务的依赖的影响。相比之下，小型语言模型（SLM）提供了一种轻量级的、可本地部署的替代方案。[问题/问题]在RE任务中，与LLM相比，SLM在准确性方面的表现尚不清楚。[结果]我们的初步研究比较了8个模型，包括3个LLM和5个SLM，使用PROMISE，PROMISE Reclass和SecReq数据集进行需求分类任务。我们的研究结果表明，尽管LLM的平均F1得分比SLM高2%，但这种差异在统计学上并不显著。SLM在所有数据集上几乎达到了LLM的性能，甚至在PROMISE Reclass数据集上的召回率方面超过了它们，尽管它小了300倍。我们还发现，数据集特征在性能方面比模型大小起着更重要的作用。[贡献]我们的研究提供了证据，证明SLM是LLM用于需求分类的有效替代方案，在隐私，成本和本地可部署性方面具有优势。
摘要：[Context and motivation] Large language models (LLMs) show notable results in natural language processing (NLP) tasks for requirements engineering (RE). However, their use is compromised by high computational cost, data sharing risks, and dependence on external services. In contrast, small language models (SLMs) offer a lightweight, locally deployable alternative. [Question/problem] It remains unclear how well SLMs perform compared to LLMs in RE tasks in terms of accuracy. [Results] Our preliminary study compares eight models, including three LLMs and five SLMs, on requirements classification tasks using the PROMISE, PROMISE Reclass, and SecReq datasets. Our results show that although LLMs achieve an average F1 score of 2% higher than SLMs, this difference is not statistically significant. SLMs almost reach LLMs performance across all datasets and even outperform them in recall on the PROMISE Reclass dataset, despite being up to 300 times smaller. We also found that dataset characteristics play a more significant role in performance than model size. [Contribution] Our study contributes with evidence that SLMs are a valid alternative to LLMs for requirements classification, offering advantages in privacy, cost, and local deployability.

【27】AutoOpt: A Dataset and a Unified Framework for Automating Optimization Problem Solving
标题：AutoOpt：自动优化问题解决的数据集和统一框架
链接：https://arxiv.org/abs/2510.21436

作者：Ankur Sinha, Shobhit Arora, Dhaval Pujara
备注：NeurIPS 2025, 28 pages, 11 figures, 11 tables
摘要：这项研究介绍了AutoOpt-11 k，这是一个独特的图像数据集，包含超过11，000个手写和打印的数学优化模型，对应于单目标，多目标，多级和随机优化问题，表现出各种类型的复杂性，如非线性，非凸性，不可微性，不连续性和高维性。标签由所有图像的LaTeX表示和图像子集的建模语言表示组成。该数据集由25名专家根据道德数据创建指南创建，并分两个阶段进行验证以避免错误。此外，我们还开发了AutoOpt框架，这是一种基于机器学习的自动化方法，用于解决优化问题，用户只需提供配方的图像，AutoOpt就可以有效地解决它，而无需任何进一步的人为干预。AutoOpt框架由三个模块组成：（i）M1（Image_to_Text）-深度学习模型执行数学表达式识别（MER）任务以生成与图像中的优化公式对应的LaTeX代码;（ii）M2（Text_to_Text）-小规模微调LLM生成PYOMO脚本（iii）M3（优化）-基于双层优化的分解（Bilevel Optimization Based Decomposition，BOBD）方法求解PY 0 MO脚本中描述的优化公式。我们使用AutoOpt-11 k数据集来训练和测试AutoOpt中使用的深度学习模型。MER任务（M1）的深度学习模型在BLEU评分指标上优于ChatGPT，Gemini和Nougat。BOBD方法（M3）是一种混合方法，在复杂的测试问题上，与传统的方法（如邻域点算法和遗传算法）相比，可以得到更好的结果。
摘要：This study presents AutoOpt-11k, a unique image dataset of over 11,000 handwritten and printed mathematical optimization models corresponding to single-objective, multi-objective, multi-level, and stochastic optimization problems exhibiting various types of complexities such as non-linearity, non-convexity, non-differentiability, discontinuity, and high-dimensionality. The labels consist of the LaTeX representation for all the images and modeling language representation for a subset of images. The dataset is created by 25 experts following ethical data creation guidelines and verified in two-phases to avoid errors. Further, we develop AutoOpt framework, a machine learning based automated approach for solving optimization problems, where the user just needs to provide an image of the formulation and AutoOpt solves it efficiently without any further human intervention. AutoOpt framework consists of three Modules: (i) M1 (Image_to_Text)- a deep learning model performs the Mathematical Expression Recognition (MER) task to generate the LaTeX code corresponding to the optimization formulation in image; (ii) M2 (Text_to_Text)- a small-scale fine-tuned LLM generates the PYOMO script (optimization modeling language) from LaTeX code; (iii) M3 (Optimization)- a Bilevel Optimization based Decomposition (BOBD) method solves the optimization formulation described in the PYOMO script. We use AutoOpt-11k dataset for training and testing of deep learning models employed in AutoOpt. The deep learning model for MER task (M1) outperforms ChatGPT, Gemini and Nougat on BLEU score metric. BOBD method (M3), which is a hybrid approach, yields better results on complex test problems compared to common approaches, like interior-point algorithm and genetic algorithm.

【28】Advancing Symbolic Integration in Large Language Models: Beyond Conventional Neurosymbolic AI
标题：推进大型语言模型中的符号集成：超越传统的神经符号人工智能
链接：https://arxiv.org/abs/2510.21425

作者：Maneeha Rani, Bhupesh Kumar Mishra, Dhavalkumar Thakker
摘要：LLM已经证明了高效的学习，人性化的反应生成和高风险部门的决策能力。然而，这些模型仍然是黑箱，因为它们难以确保响应的透明度。文献已经探索了许多方法来解决LLM中的透明度挑战，包括神经符号AI（NeSy AI）。NeSy AI方法主要是为传统神经网络开发的，并不适合LLM的独特功能。因此，对于符号AI如何有效地集成到LLM中的系统性理解有限。本文旨在通过首先回顾已建立的NeSy AI方法，然后提出LLM中符号集成的新分类法，以及将符号技术与LLM合并的路线图来解决这一差距。该路线图通过组织这些类别中的现有文献，在四个维度上引入了一个新的分类框架。这些包括LLM各个阶段的符号集成，耦合机制，架构范例，以及算法和应用程序级别的观点。本文全面确定了该领域的当前基准、前沿进展和关键差距，为未来的研究提出了路线图。通过强调文献中的最新进展和显着差距，它为实施象征性整合到法学硕士的框架以提高透明度提供了实用的见解。
摘要：LLMs have demonstrated highly effective learning, human-like response generation,and decision-making capabilities in high-risk sectors. However, these models remain black boxes because they struggle to ensure transparency in responses. The literature has explored numerous approaches to address transparency challenges in LLMs, including Neurosymbolic AI (NeSy AI). NeSy AI approaches were primarily developed for conventional neural networks and are not well-suited to the unique features of LLMs. Consequently, there is a limited systematic understanding of how symbolic AI can be effectively integrated into LLMs. This paper aims to address this gap by first reviewing established NeSy AI methods and then proposing a novel taxonomy of symbolic integration in LLMs, along with a roadmap to merge symbolic techniques with LLMs. The roadmap introduces a new categorisation framework across four dimensions by organising existing literature within these categories. These include symbolic integration across various stages of LLM, coupling mechanisms, architectural paradigms, as well as algorithmic and application-level perspectives. The paper thoroughly identifies current benchmarks, cutting-edge advancements, and critical gaps within the field to propose a roadmap for future research. By highlighting the latest developments and notable gaps in the literature, it offers practical insights for implementing frameworks for symbolic integration into LLMs to enhance transparency.

【29】Vision Language Models for Dynamic Human Activity Recognition in Healthcare Settings
标题：用于医疗保健环境中动态人类活动识别的视觉语言模型
链接：https://arxiv.org/abs/2510.21424

作者：Abderrazek Abid, Thanh-Cong Ho, Fakhri Karray
摘要：随着生成式人工智能的不断发展，视觉语言模型（VLM）已成为各种医疗保健应用中有前途的工具。一个仍然相对未充分探索的领域是它们在人类活动识别（HAR）中的使用，用于远程健康监测。VLM具有显著的优势，包括更大的灵活性和克服传统深度学习模型的一些限制的能力。然而，一个关键的挑战，在应用VLMs HAR在于难以评估其动态和往往是不确定的输出。为了解决这一差距，我们引入了一个描述性的标题数据集，并提出了全面的评价方法来评估HAR中的VLM。通过与最先进的深度学习模型的比较实验，我们的研究结果表明，VLM实现了相当的性能，在某些情况下，甚至在准确性方面超过了传统方法。这项工作提供了一个强大的基准，并为将VLM集成到智能医疗保健系统中开辟了新的可能性。
摘要：As generative AI continues to evolve, Vision Language Models (VLMs) have emerged as promising tools in various healthcare applications. One area that remains relatively underexplored is their use in human activity recognition (HAR) for remote health monitoring. VLMs offer notable strengths, including greater flexibility and the ability to overcome some of the constraints of traditional deep learning models. However, a key challenge in applying VLMs to HAR lies in the difficulty of evaluating their dynamic and often non-deterministic outputs. To address this gap, we introduce a descriptive caption data set and propose comprehensive evaluation methods to evaluate VLMs in HAR. Through comparative experiments with state-of-the-art deep learning models, our findings demonstrate that VLMs achieve comparable performance and, in some cases, even surpass conventional approaches in terms of accuracy. This work contributes a strong benchmark and opens new possibilities for the integration of VLMs into intelligent healthcare systems.

【30】DreamerV3-XP: Optimizing exploration through uncertainty estimation
标题：DreamerV 3-XP：通过不确定性估计优化探索
链接：https://arxiv.org/abs/2510.21418

作者：Lukas Bierling, Davide Pasero, Jan-Henrik Bertrand, Kiki Van Gerwen
摘要：我们介绍DreamerV 3-XP，DreamerV 3的扩展，提高探索和学习效率。这包括（i）优先重放缓冲器，通过返回、重建损失和值误差对轨迹进行评分，以及（ii）基于对来自世界模型集合的预测环境奖励的不一致的固有奖励。DreamerV 3-XP在Atari 100 k和DeepMind Control Visual Benchmark任务的子集上进行了评估，证实了DreamerV 3的原始结果，并表明我们的扩展可以更快地学习和降低动态模型损失，特别是在稀疏奖励设置中。
摘要：We introduce DreamerV3-XP, an extension of DreamerV3 that improves exploration and learning efficiency. This includes (i) a prioritized replay buffer, scoring trajectories by return, reconstruction loss, and value error and (ii) an intrinsic reward based on disagreement over predicted environment rewards from an ensemble of world models. DreamerV3-XP is evaluated on a subset of Atari100k and DeepMind Control Visual Benchmark tasks, confirming the original DreamerV3 results and showing that our extensions lead to faster learning and lower dynamics model loss, particularly in sparse-reward settings.

【31】Large Language Models as Model Organisms for Human Associative Learning
标题：大型语言模型作为人类联想学习的模型生物体
链接：https://arxiv.org/abs/2510.21408

作者：Camila Kolling, Vy Ai Vo, Mariya Toneva
摘要：联想学习--在共同出现的项目之间形成联系--是人类认知的基础，以复杂的方式重塑内部表征。测试生物系统中表征变化如何发生的假设具有挑战性，但大型语言模型（LLM）提供了一种可扩展的替代方案。基于LLM的上下文学习，我们采用了认知神经科学联想学习范式，并研究了表征如何在六种模型中演变。我们的初步研究结果揭示了一个非单调的模式与非单调可塑性假设一致，适度相似的项目区分后学习。利用LLM的可控性，我们进一步表明，这种差异是由相关项目与更广泛的词汇重叠调制的-这是我们称之为词汇干扰的一个因素，捕捉新的关联如何与先前的知识竞争。我们发现，更高的词汇干扰放大分化，这表明表征变化的影响项目相似性和全球竞争。我们的研究结果不仅将LLM定位为研究类人学习系统中表征动力学的强大工具，而且还将其作为可访问的通用计算模型，用于生成有关大脑记忆重组原理的新假设。
摘要：Associative learning--forming links between co-occurring items--is fundamental to human cognition, reshaping internal representations in complex ways. Testing hypotheses on how representational changes occur in biological systems is challenging, but large language models (LLMs) offer a scalable alternative. Building on LLMs' in-context learning, we adapt a cognitive neuroscience associative learning paradigm and investigate how representations evolve across six models. Our initial findings reveal a non-monotonic pattern consistent with the Non-Monotonic Plasticity Hypothesis, with moderately similar items differentiating after learning. Leveraging the controllability of LLMs, we further show that this differentiation is modulated by the overlap of associated items with the broader vocabulary--a factor we term vocabulary interference, capturing how new associations compete with prior knowledge. We find that higher vocabulary interference amplifies differentiation, suggesting that representational change is influenced by both item similarity and global competition. Our findings position LLMs not only as powerful tools for studying representational dynamics in human-like learning systems, but also as accessible and general computational models for generating new hypotheses about the principles underlying memory reorganization in the brain.

【32】REvolution: An Evolutionary Framework for RTL Generation driven by Large Language Models
标题：再卷积：大型语言模型驱动的RTL生成进化框架
链接：https://arxiv.org/abs/2510.21407

作者：Kyungjun Min, Kyumin Cho, Junhwan Jang, Seokhyeong Kang
备注：Accepted for publication at the 2026 Asia and South Pacific Design Automation Conference (ASP-DAC)
摘要：大型语言模型（LLM）用于寄存器传输级（RTL）代码生成，但它们面临两个主要挑战：功能正确性和功耗、性能和面积（PPA）优化。基于反馈的迭代方法部分解决了这些问题，但它们仅限于局部搜索，阻碍了全局最优解的发现。本文介绍了REvolution，一个框架，结合进化计算（EC）与LLM自动RTL生成和优化。REvolution并行地演化候选群体，每个候选群体由设计策略、RTL实现和评估反馈定义。该框架包括一个双种群算法，该算法将候选人分为失败组和成功组，分别用于错误修复和PPA优化。自适应机制通过根据每个提示策略的成功率动态调整其选择概率，进一步提高了搜索效率。在VerilogEval和RTLLM基准测试上的实验表明，REvolution将各种LLM的初始通过率提高了24.0个百分点。DeepSeek-V3模型的最终通过率为95.5%，与最先进的结果相当，而不需要单独的训练或特定领域的工具。此外，所生成的RTL设计显示出与参考设计相比显着的PPA改进。这项工作介绍了一种新的RTL设计方法相结合的LLM的生成能力与EC的广泛的搜索能力，克服了本地搜索的局限性，以前的方法。
摘要：Large Language Models (LLMs) are used for Register-Transfer Level (RTL) code generation, but they face two main challenges: functional correctness and Power, Performance, and Area (PPA) optimization. Iterative, feedback-based methods partially address these, but they are limited to local search, hindering the discovery of a global optimum. This paper introduces REvolution, a framework that combines Evolutionary Computation (EC) with LLMs for automatic RTL generation and optimization. REvolution evolves a population of candidates in parallel, each defined by a design strategy, RTL implementation, and evaluation feedback. The framework includes a dual-population algorithm that divides candidates into Fail and Success groups for bug fixing and PPA optimization, respectively. An adaptive mechanism further improves search efficiency by dynamically adjusting the selection probability of each prompt strategy according to its success rate. Experiments on the VerilogEval and RTLLM benchmarks show that REvolution increased the initial pass rate of various LLMs by up to 24.0 percentage points. The DeepSeek-V3 model achieved a final pass rate of 95.5\%, comparable to state-of-the-art results, without the need for separate training or domain-specific tools. Additionally, the generated RTL designs showed significant PPA improvements over reference designs. This work introduces a new RTL design approach by combining LLMs' generative capabilities with EC's broad search power, overcoming the local-search limitations of previous methods.

【33】Boosting Accuracy and Efficiency of Budget Forcing in LLMs via Reinforcement Learning for Mathematical Reasoning
标题：通过数学推理强化学习提高LLM预算强制的准确性和效率
链接：https://arxiv.org/abs/2510.21398

作者：Ravindra Aribowo Tarunokusumo, Rafael Fernandes Cunha
备注：Submitted to the European Conference on Artificial Intelligence (ECAI)
摘要：测试时间缩放方法因其计算效率和参数无关训练而迅速流行，以提高大型语言模型的推理性能。其中一种方法被称为预算强制，这是一种解码干预策略，它为思考分配额外的计算预算，并消除了模型固有的自校正行为。然而，这依赖于长上下文推理跟踪上的监督微调（SFT），由于冗长的响应，这会导致较小模型的性能下降。出于这个原因，我们提供了一个集成强化学习（RL）的框架，以提高令牌效率并提高1.5B数学推理模型的性能。我们仅使用1.5K训练样本证明了这一点，并发现我们的SFT+RL模型在不同计算预算的GSM8K数据集上表现更好。我们的主要研究结果显示，与SFT模型相比，整体准确性更高，同时显著减少了40%以上的令牌使用，揭示了RL如何恢复由于长上下文训练造成的损失，并整体提高数学推理的性能。
摘要：Test-time scaling methods have seen a rapid increase in popularity for its computational efficiency and parameter-independent training to improve reasoning performance on Large Language Models. One such method is called budget forcing, a decoding intervention strategy which allocates extra compute budget for thinking and elicits the inherent self-correcting behavior of the model. However, this relies on supervised fine-tuning (SFT) on long-context reasoning traces which causes performance degradation on smaller models due to verbose responses. For this reason, we offer a framework integrating reinforcement learning (RL) to improve token efficiency and boost the performance of a 1.5B model for mathematical reasoning. We demonstrate this using only 1.5K training samples and found that our SFT+RL model performed better on the GSM8K dataset with varying compute budgets. Our main findings showed an overall higher accuracy while significantly reducing its token usage by over 40% compared to the SFT model, revealing how RL can recover the losses due to long-context training and altogether improving performance in mathematical reasoning.

【34】Assessing the Real-World Utility of Explainable AI for Arousal Diagnostics: An Application-Grounded User Study
标题：评估可解释人工智能在唤醒诊断中的现实效用：一项基于应用程序的用户研究
链接：https://arxiv.org/abs/2510.21389

作者：Stefan Kraft, Andreas Theissler, Vera Wienhausen-Wilke, Gjergji Kasneci, Hendrik Lensch
摘要：人工智能（AI）系统在生物医学信号解释方面越来越接近或超过人类专家。然而，将其有效整合到临床实践中需要的不仅仅是高预测准确性。临床医生必须辨别出\textit{when}和\textit{why}来信任算法建议。这项工作提出了一项基于应用程序的用户研究，其中有八名专业的睡眠医学从业者，他们在三种条件下对多导睡眠图数据中的夜间觉醒事件进行评分：（i）手动评分，（ii）黑盒（BB）AI辅助，以及（iii）透明白盒（WB）AI辅助。从评分开始或作为事后质量控制（\texit {QC}）审查提供帮助。我们系统地评估了辅助的类型和时机如何影响事件级别和临床上最相关的基于计数的性能、时间要求和用户体验。当对照用于训练AI的临床标准进行评估时，AI和人类-AI团队的表现都显著优于独立专家，协作也减少了评估者之间的差异。值得注意的是，作为目标QC步骤应用的透明AI辅助比黑箱辅助产生约30%的中位事件级性能改进，并且QC定时进一步增强了基于计数的结果。虽然WB和QC方法增加了评分所需的时间，但启动时的帮助更快，并且受到大多数参与者的青睐。绝大多数参与者都赞成透明度，八分之七的人表示愿意在稍加修改或不作修改的情况下采用该系统。总之，策略性定时透明AI辅助有效地平衡了准确性和临床效率，为临床工作流程中值得信赖的AI集成和用户接受提供了一条有前途的途径。
摘要：Artificial intelligence (AI) systems increasingly match or surpass human experts in biomedical signal interpretation. However, their effective integration into clinical practice requires more than high predictive accuracy. Clinicians must discern \textit{when} and \textit{why} to trust algorithmic recommendations. This work presents an application-grounded user study with eight professional sleep medicine practitioners, who score nocturnal arousal events in polysomnographic data under three conditions: (i) manual scoring, (ii) black-box (BB) AI assistance, and (iii) transparent white-box (WB) AI assistance. Assistance is provided either from the \textit{start} of scoring or as a post-hoc quality-control (\textit{QC}) review. We systematically evaluate how the type and timing of assistance influence event-level and clinically most relevant count-based performance, time requirements, and user experience. When evaluated against the clinical standard used to train the AI, both AI and human-AI teams significantly outperform unaided experts, with collaboration also reducing inter-rater variability. Notably, transparent AI assistance applied as a targeted QC step yields median event-level performance improvements of approximately 30\% over black-box assistance, and QC timing further enhances count-based outcomes. While WB and QC approaches increase the time required for scoring, start-time assistance is faster and preferred by most participants. Participants overwhelmingly favor transparency, with seven out of eight expressing willingness to adopt the system with minor or no modifications. In summary, strategically timed transparent AI assistance effectively balances accuracy and clinical efficiency, providing a promising pathway toward trustworthy AI integration and user acceptance in clinical workflows.

【35】HIKMA: Human-Inspired Knowledge by Machine Agents through a Multi-Agent Framework for Semi-Autonomous Scientific Conferences
标题：HIKMA：机器代理通过半自治科学会议的多代理框架提供的人类启发知识
链接：https://arxiv.org/abs/2510.21370

作者：Zain Ul Abideen Tariq, Mahmood Al-Zubaidi, Uzair Shah, Marco Agus, Mowafa Househ
摘要：HIKMA半自治会议是通过将人工智能端到端集成到学术出版和演示管道中来重新构想学术交流的第一次实验。本文介绍了HIKMA框架的设计、实现和评估，其中包括人工智能数据集策展、基于人工智能的手稿生成、人工智能辅助的同行评审、人工智能驱动的修订、人工智能会议演示和人工智能档案传播。通过结合语言模型，结构化的研究工作流程和领域保护措施，HIKMA展示了人工智能如何支持-而不是取代传统的学术实践，同时保持知识产权保护，透明度和完整性。该会议作为一个试验平台和概念验证，提供了对AI支持的奖学金的机遇和挑战的见解。它还研究了有关AI作者身份，问责制以及人类与AI合作在研究中的作用的问题。
摘要：HIKMA Semi-Autonomous Conference is the first experiment in reimagining scholarly communication through an end-to-end integration of artificial intelligence into the academic publishing and presentation pipeline. This paper presents the design, implementation, and evaluation of the HIKMA framework, which includes AI dataset curation, AI-based manuscript generation, AI-assisted peer review, AI-driven revision, AI conference presentation, and AI archival dissemination. By combining language models, structured research workflows, and domain safeguards, HIKMA shows how AI can support - not replace traditional scholarly practices while maintaining intellectual property protection, transparency, and integrity. The conference functions as a testbed and proof of concept, providing insights into the opportunities and challenges of AI-enabled scholarship. It also examines questions about AI authorship, accountability, and the role of human-AI collaboration in research.

【36】Gaze-VLM:Bridging Gaze and VLMs through Attention Regularization for Egocentric Understanding
标题：Gaze-VLM：通过注意力规范化来弥合Gaze和VLM的桥梁，以实现以自我为中心的理解
链接：https://arxiv.org/abs/2510.21356

作者：Anupam Pani, Yanchao Yang
摘要：眼睛凝视提供了关于注意力、短期意图和未来行动的有价值的线索，使其成为塑造自我中心行为的强大信号。在这项工作中，我们提出了一个凝视正则化的框架，增强了两个关键的自我中心的理解任务：细粒度的未来事件预测和当前活动的理解的VLM。与先前的方法不同，这些方法仅依赖于视觉输入或使用凝视作为辅助输入信号，我们的方法仅在训练期间使用凝视。我们引入了一个凝视正则化的注意力机制，使模型焦点与人类视觉凝视对齐。这种设计是灵活的和模块化的，允许它在利用注意力的多个VLM架构中通用化。实验结果表明，与没有注视正则化训练的相应基线模型相比，我们的方法将未来事件预测的语义预测分数提高了11分，当前活动理解的语义预测分数提高了7分左右。这些结果突出了凝视引导训练在提高自我中心VLM的准确性和鲁棒性方面的价值。总的来说，这项工作为使用人类凝视来增强VLM在辅助机器人和人机协作等现实场景中的预测能力奠定了基础。代码和其他信息请访问：https://github.com/anupampani/Gaze-VLM
摘要：Eye gaze offers valuable cues about attention, short-term intent, and future actions, making it a powerful signal for modeling egocentric behavior. In this work, we propose a gaze-regularized framework that enhances VLMs for two key egocentric understanding tasks: fine-grained future event prediction and current activity understanding. Unlike prior approaches that rely solely on visual inputs or use gaze as an auxiliary input signal , our method uses gaze only during training. We introduce a gaze-regularized attention mechanism that aligns model focus with human visual gaze. This design is flexible and modular, allowing it to generalize across multiple VLM architectures that utilize attention. Experimental results show that our approach improves semantic prediction scores by up to 11 for future event prediction and around 7 for current activity understanding, compared to the corresponding baseline models trained without gaze regularization. These results highlight the value of gaze-guided training in improving the accuracy and robustness of egocentric VLMs. Overall, this work establishes a foundation for using human gaze to enhance the predictive capabilities of VLMs in real-world scenarios like assistive robots and human-machine collaboration. Code and additional information is available at: https://github.com/anupampani/Gaze-VLM

【37】CT-CLIP: A Multi-modal Fusion Framework for Robust Apple Leaf Disease Recognition in Complex Environments
标题：CT-CLIP：复杂环境中可靠识别苹果叶病的多模式融合框架
链接：https://arxiv.org/abs/2510.21346

作者：Lemin Liu, Fangchao Hu, Honghua Jiang, Yaru Chen, Limin Liu, Yongliang Qiao
摘要：在复杂的果园环境中，苹果叶部病害表现出的表型异质性，使传统的多尺度特征融合方法面临挑战。这些方法只整合了卷积神经网络（CNN）提取的多层特征，未能充分考虑局部和全局特征之间的关系。因此，本研究提出一个多分支识别框架CNN-Transformer-CLIP（CT-CLIP）。该框架协同地采用CNN来提取局部病变细节特征，并采用Vision Transformer来捕获全局结构关系。然后，自适应特征融合模块（AFFM）动态地融合这些特征，实现局部和全局信息的最佳耦合，并有效地解决病变形态和分布的多样性。此外，为了减轻复杂背景的干扰，并显着提高识别准确率在Few-Shot条件下，本研究提出了一种多模态图像-文本学习方法。通过利用预训练的CLIP权重，它实现了视觉特征和疾病语义描述之间的深度对齐。实验结果表明，CT-CLIP对公开的苹果病害和自建数据集的准确率分别达到97.38%和96.12%，优于几种基线方法。CT-CLIP具有较强的农业病害识别能力，显著提高了复杂环境条件下的识别精度，为农业病害自动识别提供了一种创新的实用解决方案。
摘要：In complex orchard environments, the phenotypic heterogeneity of different apple leaf diseases, characterized by significant variation among lesions, poses a challenge to traditional multi-scale feature fusion methods. These methods only integrate multi-layer features extracted by convolutional neural networks (CNNs) and fail to adequately account for the relationships between local and global features. Therefore, this study proposes a multi-branch recognition framework named CNN-Transformer-CLIP (CT-CLIP). The framework synergistically employs a CNN to extract local lesion detail features and a Vision Transformer to capture global structural relationships. An Adaptive Feature Fusion Module (AFFM) then dynamically fuses these features, achieving optimal coupling of local and global information and effectively addressing the diversity in lesion morphology and distribution. Additionally, to mitigate interference from complex backgrounds and significantly enhance recognition accuracy under few-shot conditions, this study proposes a multimodal image-text learning approach. By leveraging pre-trained CLIP weights, it achieves deep alignment between visual features and disease semantic descriptions. Experimental results show that CT-CLIP achieves accuracies of 97.38% and 96.12% on a publicly available apple disease and a self-built dataset, outperforming several baseline methods. The proposed CT-CLIP demonstrates strong capabilities in recognizing agricultural diseases, significantly enhances identification accuracy under complex environmental conditions, provides an innovative and practical solution for automated disease recognition in agricultural applications.

【38】$α$-LoRA: Effective Fine-Tuning via Base Model Rescaling
标题：$a $-LoRA：通过基本模型重新缩放进行有效的微调
链接：https://arxiv.org/abs/2510.21345

作者：Aymane El Firdoussi, El Mahdi Chayti, Mohamed El Amine Seddik, Martin Jaggi
摘要：微调已被证明在调整预训练模型方面非常有效，以最少的数据样本更好地执行新的期望任务。其中最广泛使用的方法是重新参数化方法，该方法通过使用额外的可训练权重矩阵来增强其冻结权重矩阵来更新目标模块。最突出的例子是低秩自适应（LoRA），近年来受到了极大的关注。在本文中，我们介绍了一类新的迁移学习的重新参数化方法，旨在提高微调模型的泛化能力。我们使用随机矩阵理论的工具在高维二进制分类设置中建立了我们的方法的有效性，并通过更现实的实验（如微调LLM）进一步验证了我们的理论研究结果。
摘要：Fine-tuning has proven to be highly effective in adapting pre-trained models to perform better on new desired tasks with minimal data samples. Among the most widely used approaches are reparameterization methods, which update a target module by augmenting its frozen weight matrix with an additional trainable weight matrix. The most prominent example is Low Rank Adaption (LoRA), which gained significant attention in recent years. In this paper, we introduce a new class of reparameterization methods for transfer learning, designed to enhance the generalization ability of fine-tuned models. We establish the effectiveness of our approach in a high-dimensional binary classification setting using tools from Random Matrix Theory, and further validate our theoretical findings through more realistic experiments, such as fine-tuning LLMs.

【39】World-POI: Global Point-of-Interest Data Enriched from Foursquare and OpenStreetMap as Tabular and Graph Data
标题：World-POP：从Foursquare和OpenStreetMap中丰富的全球兴趣点数据为表格和图形数据
链接：https://arxiv.org/abs/2510.21342

作者：Hossein Amiri, Mohammad Hashemi, Andreas Züfle
摘要：最近，Foursquare发布了一个全球数据集，其中包含超过1亿个兴趣点（POI），每个兴趣点都代表其平台上的一个真实业务。然而，许多条目缺乏完整的元数据，例如地址或类别，并且有些条目对应于不存在或虚构的位置。相比之下，OpenStreetMap（OSM）提供了一个丰富的、用户贡献的POI数据集，其中包含详细且经常更新的元数据，尽管它没有正式验证POI是否代表实际的业务。在这篇数据论文中，我们提出了一种整合两个数据集优势的方法：Foursquare作为商业POI的综合基线，OSM作为丰富元数据的来源。合并后的数据集总计约为1 TB。虽然此完整版本未公开发布，但我们提供具有可调阈值的筛选版本，以减少存储需求，并使数据可跨域下载和使用。我们还提供分步说明来重现完整的631 GB构建。通过计算Foursquare和OSM POI之间的名称相似性得分和空间距离来实现记录链接。这些措施识别和保留高置信度匹配，对应于Foursquare中的真实业务，在OSM中表示，并显示出很强的名称相似性。最后，我们使用这个过滤后的数据集来构建一个基于图形的POI表示，其中包含来自两个来源的属性，从而实现高级空间分析和一系列下游应用。
摘要：Recently, Foursquare released a global dataset with more than 100 million points of interest (POIs), each representing a real-world business on its platform. However, many entries lack complete metadata such as addresses or categories, and some correspond to non-existent or fictional locations. In contrast, OpenStreetMap (OSM) offers a rich, user-contributed POI dataset with detailed and frequently updated metadata, though it does not formally verify whether a POI represents an actual business. In this data paper, we present a methodology that integrates the strengths of both datasets: Foursquare as a comprehensive baseline of commercial POIs and OSM as a source of enriched metadata. The combined dataset totals approximately 1 TB. While this full version is not publicly released, we provide filtered releases with adjustable thresholds that reduce storage needs and make the data practical to download and use across domains. We also provide step-by-step instructions to reproduce the full 631 GB build. Record linkage is achieved by computing name similarity scores and spatial distances between Foursquare and OSM POIs. These measures identify and retain high-confidence matches that correspond to real businesses in Foursquare, have representations in OSM, and show strong name similarity. Finally, we use this filtered dataset to construct a graph-based representation of POIs enriched with attributes from both sources, enabling advanced spatial analyses and a range of downstream applications.

【40】Magellan: Guided MCTS for Latent Space Exploration and Novelty Generation
标题：麦哲伦：用于潜在太空探索和新奇一代的引导MCTS
链接：https://arxiv.org/abs/2510.21341

作者：Lufan Chang
备注：Accepted to 1st Open Conference on AI Agents for Science (agents4science 2025)
摘要：大型语言模型（LLM）通常难以产生真正创新的想法，通常默认为训练数据的“重力井”中的高概率熟悉概念。“虽然先进的基于搜索的方法，如思想树（ToT）试图减轻这一点，但它们从根本上受到依赖于无原则，不一致的自我评价方法来指导探索的限制。为了解决这一差距，我们引入\textbf{麦哲伦}，一个新的框架，重新构建创造性的一代作为一个有原则的，指导探索法学硕士的潜在概念空间。Magellan的核心是采用由分层指导系统管理的蒙特卡罗树搜索（MCTS）。对于长距离方向，通过正交投影制定的“语义指南针”向量将搜索转向相关的新颖性。对于局部的、逐步的决策，一个具有意识的价值函数用一个明确的奖励结构取代了有缺陷的自我评价，这个奖励结构平衡了内在的连贯性、外在的新颖性和叙述的进展。大量的实验表明，Magellan在产生具有卓越可扩展性和创新性的科学思想方面，明显优于ReAct和ToT等强大的基线。我们的工作表明，对于创造性发现，有原则的，有指导的搜索比不受约束的机构更有效，为LLM成为更有能力的创新合作伙伴铺平了道路。
摘要：Large Language Models (LLMs) often struggle with generating truly innovative ideas, typically defaulting to high-probability, familiar concepts within their training data's "gravity wells." While advanced search-based methods like Tree of Thoughts (ToT) attempt to mitigate this, they are fundamentally limited by their reliance on unprincipled, inconsistent self-evaluation heuristics to guide exploration. To address this gap, we introduce \textbf{Magellan}, a novel framework that reframes creative generation as a principled, guided exploration of an LLM's latent conceptual space. At its core, Magellan employs Monte Carlo Tree Search (MCTS) governed by a hierarchical guidance system. For long-range direction, a "semantic compass" vector, formulated via orthogonal projection, steers the search towards relevant novelty. For local, step-by-step decisions, a landscape-aware value function replaces flawed self-evaluation with an explicit reward structure that balances intrinsic coherence, extrinsic novelty, and narrative progress. Extensive experiments demonstrate that Magellan significantly outperforms strong baselines, including ReAct and ToT, in generating scientific ideas with superior plausibility and innovation. Our work shows that for creative discovery, a principled, guided search is more effective than unconstrained agency, paving the way for LLMs to become more capable partners in innovation.

【41】CausalRec: A CausalBoost Attention Model for Sequential Recommendation
标题：Cairo Rec：Cairo提高顺序推荐注意力模型
链接：https://arxiv.org/abs/2510.21333

作者：Yunbo Hou, Tianle Yang, Ruijie Li, Li He, Liang Wang, Weiping Li, Bo Zheng, Guojie Song
备注：11 pages, 3 figures
摘要：基于相关性的顺序推荐系统的最新进展已经取得了实质性的成功。具体来说，基于注意力的模型通过更有效地捕获短期和长期依赖关系，优于其他基于RNN和马尔可夫链的模型。然而，仅仅关注项目共现忽略了用户行为背后的潜在动机，导致虚假的相关性和潜在的不准确的推荐。为了解决这个问题，我们提出了一个新的框架，整合因果注意顺序推荐，CauseRec。它包含了一个因果发现块和一个因果助推器。因果发现块学习用户行为序列中的因果图，我们提供了一个理论来保证学习的因果图的可识别性。Causal Booster利用发现的因果图来完善注意力机制，优先考虑具有因果意义的行为。在真实世界数据集上的实验评估表明，CauseRec优于几种最先进的方法，命中率（HR）平均提高了7.21%，归一化贴现累积增益（NDCG）平均提高了8.65%。据我们所知，这是第一个通过注意力机制将因果关系纳入顺序推荐的模型，证明了因果关系在生成更准确和可靠的推荐方面的价值。
摘要：Recent advances in correlation-based sequential recommendation systems have demonstrated substantial success. Specifically, the attention-based model outperforms other RNN-based and Markov chains-based models by capturing both short- and long-term dependencies more effectively. However, solely focusing on item co-occurrences overlooks the underlying motivations behind user behaviors, leading to spurious correlations and potentially inaccurate recommendations. To address this limitation, we present a novel framework that integrates causal attention for sequential recommendation, CausalRec. It incorporates a causal discovery block and a CausalBooster. The causal discovery block learns the causal graph in user behavior sequences, and we provide a theory to guarantee the identifiability of the learned causal graph. The CausalBooster utilizes the discovered causal graph to refine the attention mechanism, prioritizing behaviors with causal significance. Experimental evaluations on real-world datasets indicate that CausalRec outperforms several state-of-the-art methods, with average improvements of 7.21% in Hit Rate (HR) and 8.65% in Normalized Discounted Cumulative Gain (NDCG). To the best of our knowledge, this is the first model to incorporate causality through the attention mechanism in sequential recommendation, demonstrating the value of causality in generating more accurate and reliable recommendations.

【42】Weak-to-Strong Generalization under Distribution Shifts
标题：分布转移下的弱到强概括
链接：https://arxiv.org/abs/2510.21332

作者：Myeongho Jeon, Jan Sobotka, Suhwan Choi, Maria Brbić
备注：Accepted to NeurIPS 2025
摘要：随着未来的超人模型变得越来越复杂，准确监督他们的行为可能会超出人类的能力。最近的研究表明，在这种情况下，弱模型可以有效地监督强模型，这种现象称为弱到强泛化。然而，我们发现，天真的弱到强的泛化失败的分布变化，往往导致更差的性能强模型比其弱监督。为了解决这个问题，我们提出了RAVEN，一个强大的弱到强的泛化框架，除了强模型的参数外，还可以动态学习弱模型的最佳组合。我们证明了RAVEN在图像分类，文本分类和偏好对齐任务上的有效性。RAVEN在分发外任务上的性能超过替代基线30%，同时在分发内任务上匹配或超过现有方法。此外，我们的研究结果表明，RAVEN为更准确的弱模型分配了更高的权重，证明了其自动识别值得信赖的监督的能力。
摘要：As future superhuman models become increasingly complex, accurately supervising their behavior may exceed human capabilities. Recent works have demonstrated that in such scenarios, weak models can effectively supervise strong models, a phenomenon known as weak-to-strong generalization. However, we find that naive weak-to-strong generalization fails under distribution shifts, often leading to worse performance of the strong model than its weak supervisors. To address this, we propose RAVEN, a robust weak-to-strong generalization framework that dynamically learns the optimal combinations of weak models in addition to parameters of the strong model. We demonstrate the effectiveness of RAVEN on image classification, text classification, and preference alignment tasks. RAVEN outperforms alternative baselines by over 30% on out-of-distribution tasks while matching or surpassing existing methods on in-distribution tasks. Moreover, our results show that RAVEN assigns higher weights to more accurate weak models, demonstrating its ability to automatically identify trustworthy supervision.

【43】TripTide: A Benchmark for Adaptive Travel Planning under Disruptions
标题：TripTide：混乱情况下适应性旅行规划的基准
链接：https://arxiv.org/abs/2510.21329

作者：Priyanshu Karmakar (1), Soumyabrata Chaudhuri (1), Shubhojit Mallick (2), Manish Gupta (2), Abhik Jana (1), Shreya Ghosh (1) ((1) School of Electrical and Computer Sciences, IIT Bhubaneswar, India, (2) Microsoft, India)
备注：12 pages, 12 tables and 7 figures
摘要：最近的努力，如TripCraft和TravelPlanner已经推进了大语言模型（LLM）的使用，用于个性化，约束感知的旅行行程生成。然而，真正的旅行往往面临中断。为了解决这个问题，我们提出了TripTide，这是第一个评估LLM在现实中断情况下修改行程的能力的基准。TripTide对中断严重程度和旅行者容忍度等关键维度进行建模，从而能够对LLM对航班取消、天气关闭或超额预订景点等事件的适应性进行细致入微的评估。我们进行三重评估。首先，我们引入自动度量，包括保留意图（修订后的计划如何保持可行性和目标），响应性（中断处理的准确性和适当性）和适应性（原始计划和修订计划之间的语义，空间和顺序差异）。第二，我们应用法学硕士作为一个法官的方法来自动评估修订质量。第三，我们进行手动专家评估，以验证是否修订保留语义，空间，顺序和响应方面。我们的实验表明，LLM保持较强的顺序一致性和语义稳定性，而空间偏差较大的较短的行程，但减少与较长的，这表明，扩展计划鼓励更好的地理一致性。然而，随着计划长度的增加，中断处理能力下降，突出了LLM鲁棒性的限制。TripTide建立了一个基准，用于评估现实世界不确定性下基于LLM的旅行规划的适应性，个性化和弹性。
摘要：Recent efforts like TripCraft and TravelPlanner have advanced the use of Large Language Models ( LLMs) for personalized, constraint aware travel itinerary generation. Yet, real travel often faces disruptions. To address this, we present TripTide, the first benchmark evaluating LLM's ability to revise itineraries under realistic disruptions. TripTide models key dimensions such as disruption severity and traveler tolerance, enabling nuanced assessment of LLM adaptability to events like flight cancellations, weather closures, or overbooked attractions. We conduct a threefold evaluation. First, we introduce automatic metrics including Preservation of Intent (how well the revised plan maintains feasibility and goals), Responsiveness (promptness and appropriateness of disruption handling), and Adaptability (semantic, spatial, and sequential divergence between original and revised plans). Second, we apply an LLM-as-a-judge approach to automatically assess revision quality. Third, we perform manual expert evaluation to verify whether revisions preserve semantic, spatial, sequential, and responsive aspects. Our experiments show that LLMs maintain strong sequential consistency and semantic stability, while spatial deviations are larger for shorter trips but decrease with longer ones, indicating that extended plans encourage better geographic coherence. However, disruption-handling ability declines as plan length increases, highlighting limits in LLM robustness. TripTide establishes a benchmark for evaluating adaptability, personalization, and resilience in LLM-based travel planning under real-world uncertainty.

【44】CXRAgent: Director-Orchestrated Multi-Stage Reasoning for Chest X-Ray Interpretation
标题：CXRAgent：胸部X射线解释的导演指定多阶段推理
链接：https://arxiv.org/abs/2510.21324

作者：Jinhui Lou, Yan Yang, Zhou Yu, Zhenqi Fu, Weidong Han, Qingming Huang, Jun Yu
备注：10 pages, 4 figures, 7 Tables
摘要：胸部X射线（CXR）在临床诊断中起着至关重要的作用，并且已经开发了各种特定于任务的模型和基础模型用于自动解释CXR。然而，这些模型往往难以适应新的诊断任务和复杂的推理场景。最近，基于LLM的代理模型已经成为一个很有前途的范例CXR分析，提高模型的能力，通过工具协调，多步推理，团队协作等，然而，现有的代理往往依赖于一个单一的诊断管道，缺乏机制来评估工具的可靠性，限制了他们的适应性和可信度。为此，我们提出了CXRAgent，用于CXR解释的导演编排的多阶段代理，其中中央导演协调以下阶段：（1）工具调用：代理策略性地编排一组CXR分析工具，其输出由证据驱动验证器（EDV）标准化和验证，该EDV将诊断输出与视觉证据结合以支持可靠的下游诊断;（2）诊断计划：在任务需求和中间发现的指导下，智能体制定有针对性的诊断计划。然后，它相应地组建一个专家团队，定义成员角色并协调他们的互动，以实现自适应和协作推理;（3）协作决策：智能体将专家团队的见解与积累的上下文记忆相结合，将其合成为证据支持的诊断结论。各种CXR解释任务的实验表明，CXRAgent提供了强大的性能，提供了视觉证据，并很好地推广到不同复杂度的临床任务。代码和数据在这里很有价值\href{https：//github.com/laojiahuo2003/CXRAgent/}{link}。
摘要：Chest X-ray (CXR) plays a pivotal role in clinical diagnosis, and a variety of task-specific and foundation models have been developed for automatic CXR interpretation. However, these models often struggle to adapt to new diagnostic tasks and complex reasoning scenarios. Recently, LLM-based agent models have emerged as a promising paradigm for CXR analysis, enhancing model's capability through tool coordination, multi-step reasoning, and team collaboration, etc. However, existing agents often rely on a single diagnostic pipeline and lack mechanisms for assessing tools' reliability, limiting their adaptability and credibility. To this end, we propose CXRAgent, a director-orchestrated, multi-stage agent for CXR interpretation, where a central director coordinates the following stages: (1) Tool Invocation: The agent strategically orchestrates a set of CXR-analysis tools, with outputs normalized and verified by the Evidence-driven Validator (EDV), which grounds diagnostic outputs with visual evidence to support reliable downstream diagnosis; (2) Diagnostic Planning: Guided by task requirements and intermediate findings, the agent formulates a targeted diagnostic plan. It then assembles an expert team accordingly, defining member roles and coordinating their interactions to enable adaptive and collaborative reasoning; (3) Collaborative Decision-making: The agent integrates insights from the expert team with accumulated contextual memories, synthesizing them into an evidence-backed diagnostic conclusion. Experiments on various CXR interpretation tasks show that CXRAgent delivers strong performance, providing visual evidence and generalizes well to clinical tasks of different complexity. Code and data are valuable at this \href{https://github.com/laojiahuo2003/CXRAgent/}{link}.

【45】Seemingly Redundant Modules Enhance Robust Odor Learning in Fruit Flies
标题：看似冗余的模块增强果蝇的稳健气味学习
链接：https://arxiv.org/abs/2510.21315

作者：Haiyang Li, Liao Yu, Qiang Yu, Yunliang Zang
备注：10page,Accepted by NeurIPS
摘要：生物回路已经进化到包含执行类似功能的多个模块。在果蝇嗅觉回路中，侧抑制（LI）和神经元锋电位频率适应（SFA）被认为都能增强气味学习的模式分离。然而，目前尚不清楚这些机制在这一进程中是否发挥了多余或不同的作用。在这项研究中，我们提出了一个计算模型的苍蝇嗅觉电路，研究气味歧视在不同的噪声条件下，模拟复杂的环境。我们的研究结果表明，LI主要增强气味的歧视，在低和中等噪音的情况下，但这种好处减少，并可能在较高的噪音条件下逆转。相比之下，SFA在所有噪声水平上都能始终如一地提高辨别力。LI优先参与低噪声和中等噪声环境，而SFA在高噪声环境中占主导地位。当结合时，这两种稀疏化机制能够实现最佳的区分性能。这项工作表明，生物回路中看似冗余的模块实际上对于在复杂环境中实现最佳学习至关重要。
摘要：Biological circuits have evolved to incorporate multiple modules that perform similar functions. In the fly olfactory circuit, both lateral inhibition (LI) and neuronal spike frequency adaptation (SFA) are thought to enhance pattern separation for odor learning. However, it remains unclear whether these mechanisms play redundant or distinct roles in this process. In this study, we present a computational model of the fly olfactory circuit to investigate odor discrimination under varying noise conditions that simulate complex environments. Our results show that LI primarily enhances odor discrimination in low- and medium-noise scenarios, but this benefit diminishes and may reverse under higher-noise conditions. In contrast, SFA consistently improves discrimination across all noise levels. LI is preferentially engaged in low- and medium-noise environments, whereas SFA dominates in high-noise settings. When combined, these two sparsification mechanisms enable optimal discrimination performance. This work demonstrates that seemingly redundant modules in biological circuits can, in fact, be essential for achieving optimal learning in complex contexts.

【46】A Convergence Analysis of Adaptive Optimizers under Floating-point Quantization
标题：浮点量化下自适应优化器的收敛性分析
链接：https://arxiv.org/abs/2510.21314

作者：Xuan Tang, Jichu Li, Difan Zou
备注：65 pages, 10 figures
摘要：大型语言模型（LLM）的快速扩展使得低精度训练对于减少内存、提高效率以及实现更大的模型和数据集至关重要。然而，现有的自适应优化器的收敛理论假设所有组件都是精确的，忽略了硬件感知的量化，留下了为什么低精度训练仍然有效的问题。我们介绍了第一个理论框架，用于分析自适应优化器的收敛性，包括Adam和Muon，在梯度，权重和优化器状态的浮点量化下（例如，矩估计）。在这个框架内，我们得到的收敛速度光滑的非凸目标下的标准随机梯度假设，明确表征如何从不同的组件的量化误差影响收敛。我们发现，这两种算法保留率接近其全精度同行提供的尾数长度尺度只有代数与迭代次数。我们的分析进一步表明，亚当是高度敏感的权重和二阶矩量化，由于其依赖于$\beta_2\到1$，而μ子需要较弱的误差控制，因此可能更强大。这些结果缩小了低精度训练方法的经验成功和理论理解之间的差距。对合成数据和现实数据的数值实验证实了我们的理论。
摘要：The rapid scaling of large language models (LLMs) has made low-precision training essential for reducing memory, improving efficiency, and enabling larger models and datasets. Existing convergence theories for adaptive optimizers, however, assume all components are exact and neglect hardware-aware quantization, leaving open the question of why low-precision training remains effective. We introduce the first theoretical framework for analyzing the convergence of adaptive optimizers, including Adam and Muon, under floating-point quantization of gradients, weights, and optimizer states (e.g., moment estimates). Within this framework, we derive convergence rates on smooth non-convex objectives under standard stochastic gradient assumptions, explicitly characterizing how quantization errors from different components affect convergence. We show that both algorithms retain rates close to their full-precision counterparts provided mantissa length scales only logarithmically with the number of iterations. Our analysis further reveals that Adam is highly sensitive to weights and second-moment quantization due to its reliance on $\beta_2 \to 1$, while Muon requires weaker error control and is thus potentially more robust. These results narrow the gap between empirical success and theoretical understanding of low-precision training methods. Numerical experiments on synthetic and real-world data corroborate our theory.

【47】Efficient semantic uncertainty quantification in language models via diversity-steered sampling
标题：通过多样性引导的采样在语言模型中进行高效的语义不确定性量化
链接：https://arxiv.org/abs/2510.21310

作者：Ji Won Park, Kyunghyun Cho
备注：10 pages (+7 appendix), 7 figures. Accepted at NeurIPS 2025
摘要：准确估计大型语言模型（LLM）中的语义任意和认知不确定性在自由形式的问答（QA）中特别具有挑战性，其中获得稳定的估计通常需要许多昂贵的代。我们引入了一个多样性导向的采样器，不鼓励语义冗余的输出在解码过程中，涵盖自回归和掩蔽扩散范例，并产生大量的采样效率的收益。其关键思想是使用自然语言推理（NLI）模型在部分前缀或中间扩散状态上进行轻微微调，将连续的语义相似性惩罚注入模型的提案分布。我们去偏下游的不确定性估计的重要性重新加权和控制变量缩小他们的方差。在四个QA基准测试中，我们的方法匹配或超过基线，同时用相同数量的样本覆盖更多的语义聚类。由于是模块化的，并且不需要对基本LLM进行梯度访问，该框架有望作为风险敏感模型部署中不确定性估计的插入式增强。
摘要：Accurately estimating semantic aleatoric and epistemic uncertainties in large language models (LLMs) is particularly challenging in free-form question answering (QA), where obtaining stable estimates often requires many expensive generations. We introduce a diversity-steered sampler that discourages semantically redundant outputs during decoding, covers both autoregressive and masked diffusion paradigms, and yields substantial sample-efficiency gains. The key idea is to inject a continuous semantic-similarity penalty into the model's proposal distribution using a natural language inference (NLI) model lightly finetuned on partial prefixes or intermediate diffusion states. We debias downstream uncertainty estimates with importance reweighting and shrink their variance with control variates. Across four QA benchmarks, our method matches or surpasses baselines while covering more semantic clusters with the same number of samples. Being modular and requiring no gradient access to the base LLM, the framework promises to serve as a drop-in enhancement for uncertainty estimation in risk-sensitive model deployments.

【48】Towards Reliable Code-as-Policies: A Neuro-Symbolic Framework for Embodied Task Planning
标题：迈向可靠的代码即策略：任务规划的神经符号框架
链接：https://arxiv.org/abs/2510.21302

作者：Sanghyun Ahn, Wonje Choi, Junyong Lee, Jinwoo Park, Honguk Woo
备注：Accepted at NeurIPS 2025 Spotlight
摘要：大型语言模型（LLM）的最新进展，使自动生成的可执行代码的任务规划和控制体现代理，如机器人，展示了基于LLM的体现智能的潜力。然而，这些基于LLM的代码即策略方法通常受到有限的环境基础的影响，特别是在动态或部分可观察的设置中，由于不正确或不完整的代码生成，导致次优的任务成功率。在这项工作中，我们提出了一个神经符号体现的任务规划框架，结合明确的符号验证和交互式验证过程中的代码生成。在验证阶段，框架生成探索性代码，这些代码与环境积极交互，以获取丢失的观察结果，同时保留任务相关的状态。这种集成过程增强了生成代码的基础，从而提高了复杂环境中的任务可靠性和成功率。我们在RLBench上评估我们的框架，并在真实世界的动态环境中，部分可观察的场景。实验结果表明，我们的框架提高了46.2%的代码作为政策基线的任务成功率，并达到86.8%以上的任务相关的行动的可执行性，从而提高了在动态环境中的任务规划的可靠性。
摘要：Recent advances in large language models (LLMs) have enabled the automatic generation of executable code for task planning and control in embodied agents such as robots, demonstrating the potential of LLM-based embodied intelligence. However, these LLM-based code-as-policies approaches often suffer from limited environmental grounding, particularly in dynamic or partially observable settings, leading to suboptimal task success rates due to incorrect or incomplete code generation. In this work, we propose a neuro-symbolic embodied task planning framework that incorporates explicit symbolic verification and interactive validation processes during code generation. In the validation phase, the framework generates exploratory code that actively interacts with the environment to acquire missing observations while preserving task-relevant states. This integrated process enhances the grounding of generated code, resulting in improved task reliability and success rates in complex environments. We evaluate our framework on RLBench and in real-world settings across dynamic, partially observable scenarios. Experimental results demonstrate that our framework improves task success rates by 46.2% over Code-as-Policies baselines and attains over 86.8% executability of task-relevant actions, thereby enhancing the reliability of task planning in dynamic environments.

【49】Understanding AI Trustworthiness: A Scoping Review of AIES & FAccT Articles
标题：了解人工智能可信度：AIES和FAccT文章的范围审查
链接：https://arxiv.org/abs/2510.21293

作者：Siddharth Mehrotra, Jin Huang, Xuelong Fu, Roel Dobbe, Clara I. Sánchez, Maarten de Rijke
备注：Submitted to Journal of Artificial Intelligence Research (JAIR)
摘要：值得信赖的人工智能是两个主要的人工智能伦理会议的基础支柱：AIES和FAccT。然而，目前的研究往往采用以技术为中心的方法，主要关注可靠性、鲁棒性和公平性等技术属性，而忽视了对理解现实环境中人工智能可信度至关重要的社会技术维度。目的：本范围审查旨在研究AIES和FAccT社区如何概念化，测量和验证AI可信度，确定主要差距和机会，以促进对可信AI系统的全面理解。研究方法：我们对迄今为止的AIES和FAccT会议记录进行了范围审查，系统地分析了可信度如何定义，操作和应用于不同的研究领域。我们的分析集中在概念化方法，测量方法，验证和确认技术，应用领域和潜在的价值。结果如下：虽然在定义透明度、问责制和稳健性等技术属性方面取得了重大进展，但我们的研究结果揭示了关键的差距。目前的研究往往主要强调技术的精确性，而牺牲了社会和道德方面的考虑。人工智能系统的社会技术性质仍然很少被探索，可信度作为一个有争议的概念出现，由那些有能力定义它的人塑造。结论：将技术严谨性与社会、文化和制度因素相结合的跨学科方法对于推进值得信赖的人工智能至关重要。我们为人工智能伦理社区提出了可行的措施，以采用全面的框架，真正解决人工智能系统与社会之间复杂的相互作用，最终促进负责任的技术发展，使所有利益相关者受益。
摘要：Background: Trustworthy AI serves as a foundational pillar for two major AI ethics conferences: AIES and FAccT. However, current research often adopts techno-centric approaches, focusing primarily on technical attributes such as reliability, robustness, and fairness, while overlooking the sociotechnical dimensions critical to understanding AI trustworthiness in real-world contexts. Objectives: This scoping review aims to examine how the AIES and FAccT communities conceptualize, measure, and validate AI trustworthiness, identifying major gaps and opportunities for advancing a holistic understanding of trustworthy AI systems. Methods: We conduct a scoping review of AIES and FAccT conference proceedings to date, systematically analyzing how trustworthiness is defined, operationalized, and applied across different research domains. Our analysis focuses on conceptualization approaches, measurement methods, verification and validation techniques, application areas, and underlying values. Results: While significant progress has been made in defining technical attributes such as transparency, accountability, and robustness, our findings reveal critical gaps. Current research often predominantly emphasizes technical precision at the expense of social and ethical considerations. The sociotechnical nature of AI systems remains less explored and trustworthiness emerges as a contested concept shaped by those with the power to define it. Conclusions: An interdisciplinary approach combining technical rigor with social, cultural, and institutional considerations is essential for advancing trustworthy AI. We propose actionable measures for the AI ethics community to adopt holistic frameworks that genuinely address the complex interplay between AI systems and society, ultimately promoting responsible technological development that benefits all stakeholders.

【50】When Models Outthink Their Safety: Mitigating Self-Jailbreak in Large Reasoning Models with Chain-of-Guardrails
标题：当模型考虑到自己的安全性时：通过护栏链缓解大型推理模型中的自我越狱
链接：https://arxiv.org/abs/2510.21285

作者：Yingzhi Mao (1 and 2), Chunkang Zhang (1 and 2), Junxiang Wang (1), Xinyan Guan (1 and 2), Boxi Cao (1), Yaojie Lu (1), Hongyu Lin (1), Xianpei Han (1 and 2), Le Sun (1 and 2) ((1) Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences, (2) University of Chinese Academy of Sciences)
备注：First two authors contributed equally. The main text is 10 pages, with an appendix of 19 pages. The paper contains 18 figures and 16 tables
摘要：大型推理模型（LRM）在复杂的推理任务中表现出卓越的能力，但仍然容易受到严重的安全风险，包括有害内容生成和越狱攻击。现有的缓解策略依赖于在训练过程中注入启发式安全信号，这通常会抑制推理能力，并且无法解决安全推理权衡。为了系统地研究这个问题，我们分析了不同LRM的推理轨迹，并发现了一种我们称之为“自我越狱”的现象，即模型超越了自己的风险评估，并证明了对不安全提示的反应是合理的。这一发现表明，LRM固有地具有拒绝不安全查询的能力，但这种能力受到损害，导致有害的输出。基于这些见解，我们提出了保障链（CoG），这是一个训练框架，可以重新组合或回溯不安全的推理步骤，将模型引导回安全的轨道，同时保留有效的推理链。跨多个推理和安全基准的广泛实验表明，CoG大大提高了当前LRM的安全性，同时保留了相当的推理能力，显着优于先前的方法，遭受严重的安全推理权衡。
摘要：Large Reasoning Models (LRMs) demonstrate remarkable capabilities on complex reasoning tasks but remain vulnerable to severe safety risks, including harmful content generation and jailbreak attacks. Existing mitigation strategies rely on injecting heuristic safety signals during training, which often suppress reasoning ability and fail to resolve the safety-reasoning trade-off. To systematically investigate this issue, we analyze the reasoning trajectories of diverse LRMs and uncover a phenomenon we term Self-Jailbreak, where models override their own risk assessments and justify responding to unsafe prompts. This finding reveals that LRMs inherently possess the ability to reject unsafe queries, but this ability is compromised, resulting in harmful outputs. Building on these insights, we propose the Chain-of-Guardrail (CoG), a training framework that recomposes or backtracks unsafe reasoning steps, steering the model back onto safe trajectories while preserving valid reasoning chains. Extensive experiments across multiple reasoning and safety benchmarks demonstrate that CoG substantially improves the safety of current LRMs while preserving comparable reasoning ability, significantly outperforming prior methods that suffer from severe safety-reasoning trade-offs.

【51】Pctx: Tokenizing Personalized Context for Generative Recommendation
标题：Pctx：为生成性推荐而对个性化上下文进行标记
链接：https://arxiv.org/abs/2510.21276

作者：Qiyong Zhong, Jiajie Su, Yunshan Ma, Julian McAuley, Yupeng Hou
摘要：生成式推荐（GR）模型将每个动作标记为几个离散的标记（称为语义ID），并自回归生成下一个标记作为预测，显示出内存效率，可扩展性以及统一检索和排名的潜力等优势。尽管有这些好处，现有的标记化方法是静态的和非个性化的。它们通常仅从项目特征中获得语义ID，假设忽略用户特定视角的通用项目相似性。然而，在自回归范式下，具有相同前缀的语义ID总是接收类似的概率，因此单个固定映射隐含地在所有用户之间实施通用的项目相似性标准。在实践中，根据用户的意图和偏好，可以不同地解释相同的项目。为了解决这个问题，我们提出了一个个性化的上下文感知标记器，它在生成语义ID时结合了用户的历史交互。这种设计允许在不同的用户上下文下将相同的项目标记为不同的语义ID，使GR模型能够捕获多个解释标准并产生更个性化的预测。在三个公共数据集上的实验表明，NDCG@10比非个性化动作标记化基线提高了11.44%。我们的代码可在https://github.com/YoungZ365/Pctx上获得。
摘要：Generative recommendation (GR) models tokenize each action into a few discrete tokens (called semantic IDs) and autoregressively generate the next tokens as predictions, showing advantages such as memory efficiency, scalability, and the potential to unify retrieval and ranking. Despite these benefits, existing tokenization methods are static and non-personalized. They typically derive semantic IDs solely from item features, assuming a universal item similarity that overlooks user-specific perspectives. However, under the autoregressive paradigm, semantic IDs with the same prefixes always receive similar probabilities, so a single fixed mapping implicitly enforces a universal item similarity standard across all users. In practice, the same item may be interpreted differently depending on user intentions and preferences. To address this issue, we propose a personalized context-aware tokenizer that incorporates a user's historical interactions when generating semantic IDs. This design allows the same item to be tokenized into different semantic IDs under different user contexts, enabling GR models to capture multiple interpretive standards and produce more personalized predictions. Experiments on three public datasets demonstrate up to 11.44% improvement in NDCG@10 over non-personalized action tokenization baselines. Our code is available at https://github.com/YoungZ365/Pctx.

【52】Investigating Scale Independent UCT Exploration Factor Strategies
标题：调查规模独立UCT探索因素策略
链接：https://arxiv.org/abs/2510.21275

作者：Robin Schmöcker, Christoph Schnell, Alexander Dockhorn
摘要：树的置信上限（UCT）算法对它所应用的游戏的奖励规模并不不可知。对于在游戏结束时具有稀疏奖励$\{-1，0，1\}$的零和游戏，这不是问题，但许多游戏通常具有密集奖励和精心挑选的奖励规模，导致节点的Q值跨越不同的游戏。在本文中，我们评估各种策略自适应选择UCT探索常数$\lambda$，称为$\lambda$-策略，这是不可知的游戏的奖励规模。这些$\lambda$-策略包括文献中提出的策略以及五种新策略。鉴于我们的实验结果，我们建议使用我们新建议的$\lambda$-策略之一，即选择$\lambda$作为$2 \cdot \sigma$，其中$\sigma$是搜索树的所有状态-动作对的Q值的经验标准差。该方法优于现有的$\lambda$-策略在广泛的任务，无论是在一个单一的参数值和通过优化所有可用的参数获得的峰值性能。
摘要：The Upper Confidence Bounds For Trees (UCT) algorithm is not agnostic to the reward scale of the game it is applied to. For zero-sum games with the sparse rewards of $\{-1,0,1\}$ at the end of the game, this is not a problem, but many games often feature dense rewards with hand-picked reward scales, causing a node's Q-value to span different magnitudes across different games. In this paper, we evaluate various strategies for adaptively choosing the UCT exploration constant $\lambda$, called $\lambda$-strategies, that are agnostic to the game's reward scale. These $\lambda$-strategies include those proposed in the literature as well as five new strategies. Given our experimental results, we recommend using one of our newly suggested $\lambda$-strategies, which is to choose $\lambda$ as $2 \cdot \sigma$ where $\sigma$ is the empirical standard deviation of all state-action pairs' Q-values of the search tree. This method outperforms existing $\lambda$-strategies across a wide range of tasks both in terms of a single parameter value and the peak performances obtained by optimizing all available parameters.

【53】Sparser Block-Sparse Attention via Token Permutation
标题：通过令牌排列实现更稀疏的块稀疏注意力
链接：https://arxiv.org/abs/2510.21270

作者：Xinghao Wang, Pengyu Wang, Dong Zhang, Chenkun Tan, Shaojun Zhou, Zhaoxiang Liu, Shiguo Lian, Fangxu Liu, Kai Song, Xipeng Qiu
摘要：缩放大型语言模型（LLM）的上下文长度提供了显着的好处，但计算成本很高。这种开销主要来自自注意机制，其相对于序列长度的$O（N^2）$复杂度是内存和延迟的主要瓶颈。幸运的是，注意力矩阵通常是稀疏的，特别是对于长序列，这意味着优化的机会。块稀疏注意力已经成为一种有前途的解决方案，它将序列划分为块，并跳过对这些块的子集的计算。然而，这种方法的有效性是高度依赖于潜在的注意力模式，这可能会导致次优的块级稀疏。例如，单个块内的查询的重要关键令牌可能分散在许多其他块中，导致计算冗余。在这项工作中，我们提出了Permuted Block-Sparse Attention（\textbf{PBS-Attn}），这是一种即插即用的方法，它利用注意力的置换特性来增加块级稀疏性并提高LLM预填充的计算效率。我们在具有挑战性的真实世界长上下文数据集上进行了全面的实验，证明PBS-Attn在模型准确性方面始终优于现有的块稀疏注意力方法，并且与完整的注意力基线密切匹配。由我们的自定义置换FlashAttention内核提供支持，PBS-Attn在长上下文预填充方面实现了高达2.75倍的端到端加速，证实了其实际可行性。代码可在https://github.com/xinghaow99/pbs-attn获得
摘要：Scaling the context length of large language models (LLMs) offers significant benefits but is computationally expensive. This expense stems primarily from the self-attention mechanism, whose $O(N^2)$ complexity with respect to sequence length presents a major bottleneck for both memory and latency. Fortunately, the attention matrix is often sparse, particularly for long sequences, suggesting an opportunity for optimization. Block-sparse attention has emerged as a promising solution that partitions sequences into blocks and skips computation for a subset of these blocks. However, the effectiveness of this method is highly dependent on the underlying attention patterns, which can lead to sub-optimal block-level sparsity. For instance, important key tokens for queries within a single block may be scattered across numerous other blocks, leading to computational redundancy. In this work, we propose Permuted Block-Sparse Attention (\textbf{PBS-Attn}), a plug-and-play method that leverages the permutation properties of attention to increase block-level sparsity and enhance the computational efficiency of LLM prefilling. We conduct comprehensive experiments on challenging real-world long-context datasets, demonstrating that PBS-Attn consistently outperforms existing block-sparse attention methods in model accuracy and closely matches the full attention baseline. Powered by our custom permuted-FlashAttention kernels, PBS-Attn achieves an end-to-end speedup of up to $2.75\times$ in long-context prefilling, confirming its practical viability. Code available at https://github.com/xinghaow99/pbs-attn

【54】Correlation Dimension of Auto-Regressive Large Language Models
标题：自回归大型语言模型的相关维度
链接：https://arxiv.org/abs/2510.21258

作者：Xin Du, Kumiko Tanaka-Ishii
备注：Accepted at NeurIPS 2025
摘要：大型语言模型（LLM）在自然语言生成方面已经取得了显着的进展，但它们仍然表现出令人困惑的行为-例如重复和不连贯-即使表现出低困惑。这突出了传统评估指标的一个关键限制，传统评估指标强调局部预测准确性，而忽略了长期结构复杂性。我们引入相关维数，分形几何自相似性的措施，量化的认识论复杂性的文本感知的语言模型。这种方法捕捉了语言的层次递归结构，在一个统一的框架中连接局部和全局属性。通过大量的实验，我们发现关联维数（1）揭示了预训练过程中的三个不同阶段，（2）反映了上下文相关的复杂性，（3）表明模型的幻觉倾向，（4）可靠地检测生成文本中的多种形式的退化。该方法在计算上是高效的，对模型量化是鲁棒的（低至4位精度），广泛适用于自回归架构（例如，Transformer和Mamba），并提供了对LLM生成动态的新见解。
摘要：Large language models (LLMs) have achieved remarkable progress in natural language generation, yet they continue to display puzzling behaviors -- such as repetition and incoherence -- even when exhibiting low perplexity. This highlights a key limitation of conventional evaluation metrics, which emphasize local prediction accuracy while overlooking long-range structural complexity. We introduce correlation dimension, a fractal-geometric measure of self-similarity, to quantify the epistemological complexity of text as perceived by a language model. This measure captures the hierarchical recurrence structure of language, bridging local and global properties in a unified framework. Through extensive experiments, we show that correlation dimension (1) reveals three distinct phases during pretraining, (2) reflects context-dependent complexity, (3) indicates a model's tendency toward hallucination, and (4) reliably detects multiple forms of degeneration in generated text. The method is computationally efficient, robust to model quantization (down to 4-bit precision), broadly applicable across autoregressive architectures (e.g., Transformer and Mamba), and provides fresh insight into the generative dynamics of LLMs.

【55】Out-of-Distribution Detection for Safety Assurance of AI and Autonomous Systems
标题：用于人工智能和自主系统安全保障的非分布检测
链接：https://arxiv.org/abs/2510.21254

作者：Victoria J. Hodge, Colin Paterson, Ibrahim Habli
摘要：近年来，由于机器人技术和机器学习（ML）的进步，AI支持的自主系统的操作能力和应用领域已经显着扩展。严格证明自治系统的安全性对于其负责任的采用至关重要，但它具有挑战性，因为它需要强大的方法，可以在整个系统生命周期中处理新颖和不确定的情况，包括检测分发（OoD）数据。因此，OOD检测越来越受到研究、开发和安全工程界的关注。这一全面的审查分析OOD检测技术的背景下，自主系统的安全保证，特别是在安全关键领域。我们首先定义相关的概念，调查是什么原因导致OOD和探索的因素，使自治系统的安全保证和OOD检测的挑战。我们的审查确定了一系列可以在整个ML开发生命周期中使用的技术，我们建议在生命周期中可以使用它们来支持安全保证论点的领域。我们讨论了系统和安全工程师在将OOD检测集成到系统生命周期中时必须注意的一些注意事项。最后，我们概述了在一系列领域和应用程序的自主系统的安全开发和运行所需的挑战和未来的工作。
摘要：The operational capabilities and application domains of AI-enabled autonomous systems have expanded significantly in recent years due to advances in robotics and machine learning (ML). Demonstrating the safety of autonomous systems rigorously is critical for their responsible adoption but it is challenging as it requires robust methodologies that can handle novel and uncertain situations throughout the system lifecycle, including detecting out-of-distribution (OoD) data. Thus, OOD detection is receiving increased attention from the research, development and safety engineering communities. This comprehensive review analyses OOD detection techniques within the context of safety assurance for autonomous systems, in particular in safety-critical domains. We begin by defining the relevant concepts, investigating what causes OOD and exploring the factors which make the safety assurance of autonomous systems and OOD detection challenging. Our review identifies a range of techniques which can be used throughout the ML development lifecycle and we suggest areas within the lifecycle in which they may be used to support safety assurance arguments. We discuss a number of caveats that system and safety engineers must be aware of when integrating OOD detection into system lifecycles. We conclude by outlining the challenges and future work necessary for the safe development and operation of autonomous systems across a range of domains and applications.

【56】OutboundEval: A Dual-Dimensional Benchmark for Expert-Level Intelligent Outbound Evaluation of Xbench's Professional-Aligned Series
标题：OutboundEval：Xbridge纵向对齐系列专家级智能工作空间评估的二维基准
链接：https://arxiv.org/abs/2510.21244

作者：Pengyu Xu, Shijia Li, Ao Sun, Feng Zhang, Yahan Li, Bo Wu, Zhanyu Ma, Jiguo Li, Jun Xu, Jiuchong Gao, Jinghua Hao, Renqing He, Rui Wang, Yang Liu, Xiaobo Hu, Fan Yang, Jia Zheng, Guanghua Yao
摘要：我们提出OutboundEval，一个全面的基准评估大型语言模型（LLM）在专家级智能出站呼叫的情况。与现有方法存在三个关键限制-数据集多样性和类别覆盖不足，不切实际的用户模拟和不准确的评估指标- OutboundEval通过结构化框架解决了这些问题。首先，我们设计了一个基准跨越六个主要的业务领域和30个代表性的子场景，每个与特定的流程分解，加权评分，域自适应的指标。其次，我们开发了一个大型模型驱动的用户模拟器，它可以生成具有真实行为，情绪变化和沟通风格的多样化，角色丰富的虚拟用户，提供了一个可控但真实的测试环境。第三，我们引入了一种适应任务变化的动态评估方法，集成了自动化和人在环评估，以衡量任务执行的准确性，专业知识的应用，适应性和用户体验质量。在12个最先进的LLM上进行的实验揭示了专家级任务完成和交互流畅性之间的明显权衡，为构建可靠的、类似人类的出站AI系统提供了实用的见解。OutboundEval建立了一个实用的，可扩展的，面向领域的标准，用于在专业应用程序中对LLM进行基准测试。
摘要：We propose OutboundEval, a comprehensive benchmark for evaluating large language models (LLMs) in expert-level intelligent outbound calling scenarios. Unlike existing methods that suffer from three key limitations - insufficient dataset diversity and category coverage, unrealistic user simulation, and inaccurate evaluation metrics - OutboundEval addresses these issues through a structured framework. First, we design a benchmark spanning six major business domains and 30 representative sub-scenarios, each with scenario-specific process decomposition, weighted scoring, and domain-adaptive metrics. Second, we develop a large-model-driven User Simulator that generates diverse, persona-rich virtual users with realistic behaviors, emotional variability, and communication styles, providing a controlled yet authentic testing environment. Third, we introduce a dynamic evaluation method that adapts to task variations, integrating automated and human-in-the-loop assessment to measure task execution accuracy, professional knowledge application, adaptability, and user experience quality. Experiments on 12 state-of-the-art LLMs reveal distinct trade-offs between expert-level task completion and interaction fluency, offering practical insights for building reliable, human-like outbound AI systems. OutboundEval establishes a practical, extensible, and domain-oriented standard for benchmarking LLMs in professional applications.

【57】Physics-Informed Neural Networks for MIMO Beam Map and Environment Reconstruction
标题：用于MIMO波束图和环境重建的物理信息神经网络
链接：https://arxiv.org/abs/2510.21238

作者：Wangqian Chen, Junting Chen, Shuguang Cui
摘要：随着通信网络向更大的复杂性演进（例如，6 G及以后），对无线环境的深入了解变得越来越重要。当环境的明确知识不可用时，从信道状态信息（CSI）中提取几何感知特征成为连接物理层测量与网络智能的关键方法。本文提出了探索接收信号强度（RSS）数据，没有明确的3D环境知识，共同构建无线电波束图和环境几何的多输入多输出（MIMO）系统。与现有的方法，只学习阻塞结构，我们提出了一个面向虚拟障碍物模型，捕获的几何特征的阻塞和反射。根据环境的几何关系，制定反射区以识别相关的反射路径。我们推导出反射区的解析表达式，并进一步分析其几何特征，以开发与深度学习表示更兼容的重构。提出了一种基于物理学的深度学习框架，该框架结合了基于反射区的几何模型，以学习阻挡、反射和散射分量以及波束图案，从而利用物理先验知识来增强网络的可传输性。数值实验表明，除了重建阻塞和反射的几何形状，该模型可以构建一个更准确的MIMO波束图与32%-48%的精度提高。
摘要：As communication networks evolve towards greater complexity (e.g., 6G and beyond), a deep understanding of the wireless environment becomes increasingly crucial. When explicit knowledge of the environment is unavailable, geometry-aware feature extraction from channel state information (CSI) emerges as a pivotal methodology to bridge physical-layer measurements with network intelligence. This paper proposes to explore the received signal strength (RSS) data, without explicit 3D environment knowledge, to jointly construct the radio beam map and environmental geometry for a multiple-input multiple-output (MIMO) system. Unlike existing methods that only learn blockage structures, we propose an oriented virtual obstacle model that captures the geometric features of both blockage and reflection. Reflective zones are formulated to identify relevant reflected paths according to the geometry relation of the environment. We derive an analytical expression for the reflective zone and further analyze its geometric characteristics to develop a reformulation that is more compatible with deep learning representations. A physics-informed deep learning framework that incorporates the reflective-zone-based geometry model is proposed to learn the blockage, reflection, and scattering components, along with the beam pattern, which leverages physics prior knowledge to enhance network transferability. Numerical experiments demonstrate that, in addition to reconstructing the blockage and reflection geometry, the proposed model can construct a more accurate MIMO beam map with a 32%-48% accuracy improvement.

【58】Securing AI Agent Execution
标题：确保AI代理执行
链接：https://arxiv.org/abs/2510.21236

作者：Christoph Bühler, Matteo Biagiola, Luca Di Grazia, Guido Salvaneschi
摘要：大型语言模型（LLM）已经发展成为AI代理，与外部工具和环境交互以执行复杂任务。模型上下文协议（MCP）已成为连接代理与此类资源的事实标准，但安全性却落后于人：数千个MCP服务器在不受限制地访问主机系统的情况下执行，从而产生了广泛的攻击面。在本文中，我们介绍了代理绑定，MCP服务器的第一个访问控制框架。AgentBound结合了受Android权限模型启发的声明式策略机制，以及包含恶意行为而无需修改MCP服务器的策略执行引擎。我们建立了一个包含296个最流行的MCP服务器的数据集，并表明访问控制策略可以从源代码自动生成，准确率为80.9%。我们还表明，AgentBound阻止了大多数恶意MCP服务器中的安全威胁，策略执行引擎引入的开销可以忽略不计。我们的贡献为开发人员和项目经理提供了保护MCP服务器的实用基础，同时保持生产力，使研究人员和工具构建者能够探索声明式访问控制和MCP安全的新方向。
摘要：Large Language Models (LLMs) have evolved into AI agents that interact with external tools and environments to perform complex tasks. The Model Context Protocol (MCP) has become the de facto standard for connecting agents with such resources, but security has lagged behind: thousands of MCP servers execute with unrestricted access to host systems, creating a broad attack surface. In this paper, we introduce AgentBound, the first access control framework for MCP servers. AgentBound combines a declarative policy mechanism, inspired by the Android permission model, with a policy enforcement engine that contains malicious behavior without requiring MCP server modifications. We build a dataset containing the 296 most popular MCP servers, and show that access control policies can be generated automatically from source code with 80.9% accuracy. We also show that AgentBound blocks the majority of security threats in several malicious MCP servers, and that policy enforcement engine introduces negligible overhead. Our contributions provide developers and project managers with a practical foundation for securing MCP servers while maintaining productivity, enabling researchers and tool builders to explore new directions for declarative access control and MCP security.

【59】PLAN: Proactive Low-Rank Allocation for Continual Learning
标题：警告：积极主动地为持续学习分配低级别
链接：https://arxiv.org/abs/2510.21188

作者：Xiequn Wang, Zhan Zhuang, Yu Zhang
备注：accepted by ICCV 2025
摘要：持续学习（CL）要求模型不断适应新的任务，而不会忘记过去的知识。在这项工作中，我们提出了\underline{P}roactive \underline{L} low-rank\underline{A}llocatio\underline{N}（LRA），这是一个扩展低秩自适应（LoRA）的框架，可以在CL设置中对大型预训练模型进行有效和干扰感知的微调。通过为每个任务引入正交基向量，并通过基于扰动的策略优化它们，从而最大限度地减少与先前学习的参数的冲突，从而主动管理特定于任务的子空间的分配。此外，该算法还采用了一种新的选择机制，可以识别和分配对干扰敏感度最小的基向量，降低了降低过去知识的风险，同时保持对新任务的有效适应。标准CL基准测试的实证结果表明，该方法始终优于现有的方法，建立了一个新的国家的最先进的基础模型的持续学习。
摘要：Continual learning (CL) requires models to continuously adapt to new tasks without forgetting past knowledge. In this work, we propose \underline{P}roactive \underline{L}ow-rank \underline{A}llocatio\underline{N} (PLAN), a framework that extends Low-Rank Adaptation (LoRA) to enable efficient and interference-aware fine-tuning of large pre-trained models in CL settings. PLAN proactively manages the allocation of task-specific subspaces by introducing orthogonal basis vectors for each task and optimizing them through a perturbation-based strategy that minimizes conflicts with previously learned parameters. Furthermore, PLAN incorporates a novel selection mechanism that identifies and assigns basis vectors with minimal sensitivity to interference, reducing the risk of degrading past knowledge while maintaining efficient adaptation to new tasks. Empirical results on standard CL benchmarks demonstrate that PLAN consistently outperforms existing methods, establishing a new state-of-the-art for continual learning with foundation models.

【60】Reducing the Probability of Undesirable Outputs in Language Models Using Probabilistic Inference
标题：使用概率推理降低语言模型中不良输出的可能性
链接：https://arxiv.org/abs/2510.21184

作者：Stephen Zhao, Aidan Li, Rob Brekelmans, Roger Grosse
摘要：强化学习（RL）已经成为一种将语言模型（LM）与人类偏好相匹配或促进给定奖励函数认为期望的输出的主要技术。标准的强化学习方法优化了平均回报，而明确关注降低不期望输出概率的方法通常会以平均情况性能为代价。为了改善这种权衡，我们引入了RePULSe，这是一种新的训练方法，它使用额外的损失来增加标准RL损失，该损失使用学习的建议来指导对低回报输出进行采样，然后降低这些输出的概率。我们运行的实验表明，RePULSe产生一个更好的权衡预期的回报与不期望的输出的概率，是更对抗强大的，相比标准的RL对齐方法和替代品。
摘要：Reinforcement learning (RL) has become a predominant technique to align language models (LMs) with human preferences or promote outputs which are deemed to be desirable by a given reward function. Standard RL approaches optimize average reward, while methods explicitly focused on reducing the probability of undesired outputs typically come at a cost to average-case performance. To improve this tradeoff, we introduce RePULSe, a new training method that augments the standard RL loss with an additional loss that uses learned proposals to guide sampling low-reward outputs, and then reduces those outputs' probability. We run experiments demonstrating that RePULSe produces a better tradeoff of expected reward versus the probability of undesired outputs and is more adversarially robust, compared to standard RL alignment approaches and alternatives.

【61】Shylock: Causal Discovery in Multivariate Time Series based on Hybrid Constraints
标题：夏洛克：基于混合约束的多元时间序列因果发现
链接：https://arxiv.org/abs/2510.21181

作者：Shuo Li, Keqin Xu, Jie Liu, Dan Ye
摘要：因果关系发现因其广泛的应用而受到越来越多的关注。现有的方法依赖于人类经验、统计方法或图形准则方法，这些方法容易出错，停留在理想化的假设上，并且依赖于大量的数据。在许多领域，在访问多变量时间序列（MTS）方面也存在严重的数据缺口，这增加了寻找因果关系的难度。现有的方法容易对它们进行过拟合。为了填补上述空白，在本文中，我们提出了夏洛克，一种新的方法，可以很好地工作在Few-Shot和正常MTS的因果关系。夏洛克可以通过使用群扩张卷积和共享核来指数地减少参数的数量，但仍然可以学习具有时间延迟的变量的更好表示。通过将全局约束和局部约束相结合，实现了网络间的信息共享，提高了求解精度。为了评估Shylock算法的性能，我们还设计了一种数据生成方法来生成具有时延的MTS。我们在常用的基准测试和生成的数据集上对其进行评估。大量的实验表明，夏洛克优于现有的两个国家的最先进的方法在Few-Shot和正常的MTS。我们还开发了Tcausal，一个易于使用的库，并将其部署在EarthDataMiner平台上
摘要：Causal relationship discovery has been drawing increasing attention due to its prevalent application. Existing methods rely on human experience, statistical methods, or graphical criteria methods which are error-prone, stuck at the idealized assumption, and rely on a huge amount of data. And there is also a serious data gap in accessing Multivariate time series(MTS) in many areas, adding difficulty in finding their causal relationship. Existing methods are easy to be over-fitting on them. To fill the gap we mentioned above, in this paper, we propose Shylock, a novel method that can work well in both few-shot and normal MTS to find the causal relationship. Shylock can reduce the number of parameters exponentially by using group dilated convolution and a sharing kernel, but still learn a better representation of variables with time delay. By combing the global constraint and the local constraint, Shylock achieves information sharing among networks to help improve the accuracy. To evaluate the performance of Shylock, we also design a data generation method to generate MTS with time delay. We evaluate it on commonly used benchmarks and generated datasets. Extensive experiments show that Shylock outperforms two existing state-of-art methods on both few-shot and normal MTS. We also developed Tcausal, a library for easy use and deployed it on the EarthDataMiner platform

【62】Memory-Free Continual Learning with Null Space Adaptation for Zero-Shot Vision-Language Models
标题：Zero-Shot视觉语言模型的无记忆持续学习和空间自适应
链接：https://arxiv.org/abs/2510.21175

作者：Yujin Jo, Taesup Kim
摘要：预训练的视觉语言模型（VLM），如CLIP，已经表现出显着的zero-shot泛化，使部署在广泛的现实世界的任务，而无需额外的特定于任务的培训。然而，在具有不断发展的环境或新兴类的实际部署场景中，这些模型不可避免地面临分布变化和新任务。在这种情况下，静态zero-shot能力是不够的，并且越来越需要持续学习方法，允许模型随着时间的推移而适应，同时避免灾难性遗忘。我们介绍NuSA-CL（持续学习的空间适应），一个轻量级的无记忆持续学习框架，旨在解决这一挑战。NuSA-CL采用低秩自适应，并将特定任务的权重更新限制在模型当前参数的近似零空间内。该策略最大限度地减少了对先前获得的知识的干扰，有效地保留了原始模型的zero-shot能力。与依赖于重放缓冲区或昂贵的蒸馏的方法不同，NuSA-CL施加最小的计算和内存开销，使其适用于资源受限的现实世界持续学习环境中的部署。实验表明，该框架不仅有效地保持了zero-shot传输能力，而且在持续学习基准测试中具有很强的竞争力。这些结果将NuSA-CL定位为实际应用中不断发展zero-shot VLM的实用且可扩展的解决方案。
摘要：Pre-trained vision-language models (VLMs), such as CLIP, have demonstrated remarkable zero-shot generalization, enabling deployment in a wide range of real-world tasks without additional task-specific training. However, in real deployment scenarios with evolving environments or emerging classes, these models inevitably face distributional shifts and novel tasks. In such contexts, static zero-shot capabilities are insufficient, and there is a growing need for continual learning methods that allow models to adapt over time while avoiding catastrophic forgetting. We introduce NuSA-CL (Null Space Adaptation for Continual Learning), a lightweight memory-free continual learning framework designed to address this challenge. NuSA-CL employs low-rank adaptation and constrains task-specific weight updates to lie within an approximate null space of the model's current parameters. This strategy minimizes interference with previously acquired knowledge, effectively preserving the zero-shot capabilities of the original model. Unlike methods relying on replay buffers or costly distillation, NuSA-CL imposes minimal computational and memory overhead, making it practical for deployment in resource-constrained, real-world continual learning environments. Experiments show that our framework not only effectively preserves zero-shot transfer capabilities but also achieves highly competitive performance on continual learning benchmarks. These results position NuSA-CL as a practical and scalable solution for continually evolving zero-shot VLMs in real-world applications.

【63】Towards Straggler-Resilient Split Federated Learning: An Unbalanced Update Approach
标题：迈向具有落后弹性的分离联邦学习：一种不平衡的更新方法
链接：https://arxiv.org/abs/2510.21155

作者：Dandan Liang, Jianing Zhang, Evan Chen, Zhe Li, Rui Li, Haibo Yang
摘要：Split Federated Learning（SFL）通过将Federated Learning（FL）的并行性与Split Learning（SL）的计算卸载相结合，实现了边缘设备上的可扩展训练。尽管SFL取得了巨大的成功，但它仍然严重遭受分布式学习系统中众所周知的落伍问题的困扰。Split Server和客户端之间的依赖性加剧了这个问题：Split Server端模型更新依赖于从客户端接收激活。这种同步要求引入了显著的时间延迟，使得掉队者成为系统的可扩展性和效率的关键瓶颈。为了缓解这个问题，我们提出了MU-SplitFed，这是一种零阶优化的离散弹性SFL算法，它通过一种简单而有效的不平衡更新机制来消除离散延迟的训练进度。通过使服务器能够在每个客户端回合执行$\tau$本地更新，MU-SplitFed对于非凸目标实现了$O（\sqrt{d/（\tau T）}）$的收敛速度，在通信回合中实现了$\tau$的线性加速。实验表明，MU-SplitFed始终优于基线方法的存在下的落伍者，并有效地减轻其影响，通过自适应调整$\tau$。这个项目的代码可以在https://github.com/Johnny-Zip/MU-SplitFed上找到。
摘要：Split Federated Learning (SFL) enables scalable training on edge devices by combining the parallelism of Federated Learning (FL) with the computational offloading of Split Learning (SL). Despite its great success, SFL suffers significantly from the well-known straggler issue in distributed learning systems. This problem is exacerbated by the dependency between Split Server and clients: the Split Server side model update relies on receiving activations from clients. Such synchronization requirement introduces significant time latency, making straggler a critical bottleneck to the scalability and efficiency of the system. To mitigate this problem, we propose MU-SplitFed, a straggler-resilient SFL algorithm in zeroth-order optimization that decouples training progress from straggler delays via a simple yet effective unbalanced update mechanism. By enabling the server to perform $\tau$ local updates per client round, MU-SplitFed achieves a convergence rate of $O(\sqrt{d/(\tau T)})$ for non-convex objectives, demonstrating a linear speedup of $\tau$ in communication rounds. Experiments demonstrate that MU-SplitFed consistently outperforms baseline methods with the presence of stragglers and effectively mitigates their impact through adaptive tuning of $\tau$. The code for this project is available at https://github.com/Johnny-Zip/MU-SplitFed.

【64】Uncertainty-Aware Multi-Objective Reinforcement Learning-Guided Diffusion Models for 3D De Novo Molecular Design
标题：用于3D De Novo分子设计的具有不确定性的多目标强化学习引导的扩散模型
链接：https://arxiv.org/abs/2510.21153

作者：Lianghong Chen, Dongkyu Eugene Kim, Mike Domaratzki, Pingzhao Hu
备注：Accepted at NeurIPS 2025
摘要：设计具有所需性质的从头3D分子仍然是药物发现和分子工程中的基本挑战。虽然扩散模型在生成高质量的3D分子结构方面表现出了卓越的能力，但它们通常难以有效地控制对现实世界应用至关重要的复杂多目标约束。在这项研究中，我们提出了一个具有不确定性意识的强化学习（RL）框架，以指导3D分子扩散模型朝着多个属性目标进行优化，同时提高生成分子的整体质量。我们的方法利用具有预测不确定性估计的代理模型来动态地塑造奖励函数，从而促进多个优化目标之间的平衡。我们在三个基准数据集和多扩散模型架构上全面评估了我们的框架，在分子质量和性质优化方面始终优于基线。此外，分子动力学（MD）模拟和ADMET分析的顶部生成的候选人表明有前途的药物样行为和结合稳定性，与已知的表皮生长因子受体（EGFR）抑制剂。我们的研究结果表明，RL引导的生成扩散模型推进自动化分子设计的强大潜力。
摘要：Designing de novo 3D molecules with desirable properties remains a fundamental challenge in drug discovery and molecular engineering. While diffusion models have demonstrated remarkable capabilities in generating high-quality 3D molecular structures, they often struggle to effectively control complex multi-objective constraints critical for real-world applications. In this study, we propose an uncertainty-aware Reinforcement Learning (RL) framework to guide the optimization of 3D molecular diffusion models toward multiple property objectives while enhancing the overall quality of the generated molecules. Our method leverages surrogate models with predictive uncertainty estimation to dynamically shape reward functions, facilitating balance across multiple optimization objectives. We comprehensively evaluate our framework across three benchmark datasets and multiple diffusion model architectures, consistently outperforming baselines for molecular quality and property optimization. Additionally, Molecular Dynamics (MD) simulations and ADMET profiling of top generated candidates indicate promising drug-like behavior and binding stability, comparable to known Epidermal Growth Factor Receptor (EGFR) inhibitors. Our results demonstrate the strong potential of RL-guided generative diffusion models for advancing automated molecular design.

【65】String Seed of Thought: Prompting LLMs for Distribution-Faithful and Diverse Generation
标题：思想的种子：为忠诚的分配和多元化的一代推荐法学硕士
链接：https://arxiv.org/abs/2510.21150

作者：Kou Misaki, Takuya Akiba
摘要：我们介绍了字符串种子的思想（SSoT），一种新的提示方法LLM，提高概率指令跟随（PIF）。我们将PIF定义为一项任务，要求LLM从一组预定义的选项中选择其答案，每个选项与特定的概率相关联，以便在多次提示时，生成的答案的经验分布与目标分布一致。虽然LLM擅长于具有单一确定性答案的任务，但他们经常在PIF中失败，表现出对需要非确定性行为的应用程序的偏见问题，例如人类行为模拟，内容多样化和多人游戏。它还损害了生成的响应的多样性，这是测试时间缩放的一个关键因素，因为它会导致输出崩溃为一组有限的答案。为了解决这个问题，我们提出了SSoT，这是一种简单的提示方法，它指示LLM首先输出一个随机字符串以生成足够的熵。SSoT还指示LLM通过操纵此字符串来提取随机性以导出最终答案，从而在遵守特定约束的同时保持多样性。我们证明了SSoT显着提高了LLM的PIF性能，接近伪随机数发生器的理想性能。此外，我们在NoveltyBench上的实验表明，SSoT的好处通过增强响应多样性而从封闭集任务扩展到开放式任务。
摘要：We introduce String Seed of Thought (SSoT), a novel prompting method for LLMs that improves Probabilistic Instruction Following (PIF). We define PIF as a task requiring an LLM to select its answer from a predefined set of options, each associated with a specific probability, such that the empirical distribution of the generated answers aligns with the target distribution when prompted multiple times. While LLMs excel at tasks with single, deterministic answers, they often fail at PIF, exhibiting biases problematic for applications requiring non-deterministic behaviors, such as human-behavior simulation, content diversification, and multiplayer games. It also harms the diversity of generated responses, a crucial factor in test-time scaling, by causing the outputs to collapse into a limited set of answers. To address this, we propose SSoT, a simple prompting method that instructs an LLM to first output a random string to generate sufficient entropy. SSoT also instructs the LLM to extract randomness by manipulating this string to derive a final answer, thereby preserving diversity while adhering to specific constraints. We demonstrate that SSoT significantly improves the PIF performance of LLMs, approaching the ideal performance of a pseudo-random number generator. Furthermore, our experiments on NoveltyBench show SSoT's benefits extend beyond closed-set tasks to open-ended tasks by enhancing response diversity.

【66】How to Auto-optimize Prompts for Domain Tasks? Adaptive Prompting and Reasoning through Evolutionary Domain Knowledge Adaptation
标题：如何自动优化域任务的预算？通过进化领域知识适应进行自适应预算和推理
链接：https://arxiv.org/abs/2510.21148

作者：Yang Zhao, Pu Wang, Hao Frank Yang
摘要：在实际应用中，为特定领域任务的大型语言模型（LLM）设计最佳提示和推理过程既必要又具有挑战性。如何整合领域知识，提高推理效率，甚至为领域专家提供精确的知识整合提示是一个非常重要但尚未解决的问题。在这项研究中，我们提出了进化图优化的提示（EGO-Prompt），一个自动化的框架，以设计更好的提示，高效的推理过程，并提供增强的cathet-informed过程。EGO-Prompt从一个通用的提示和容错的初始语义因果图（SCG）描述开始，由人类专家构建，然后自动细化和优化，以指导LLM推理。认识到专家定义的SCG可能是部分的或不完美的，并且它们的最佳整合在LLM中各不相同，EGO-Prompt将一种新的cascot引导的文本梯度过程分为两个步骤：首先，为每个实例从SCG生成几乎确定性的推理指导，其次，调整LLM以有效地利用原始输入旁边的指导。迭代优化算法使用具有地面实况的文本梯度进一步细化SCG和推理机制。我们在现实世界的公共卫生，交通和人类行为任务上测试了该框架。EGO-Prompt的F1比最先进的方法高出7.32%-12.61%，并且允许小模型以低于原始成本的20%达到较大模型的性能。它还输出一个改进的，特定于领域的SCG，提高了可解释性。
摘要：Designing optimal prompts and reasoning processes for large language models (LLMs) on domain-specific tasks is both necessary and challenging in real-world applications. Determining how to integrate domain knowledge, enhance reasoning efficiency, and even provide domain experts with refined knowledge integration hints are particularly crucial yet unresolved tasks. In this research, we propose Evolutionary Graph Optimization for Prompting (EGO-Prompt), an automated framework to designing better prompts, efficient reasoning processes and providing enhanced causal-informed process. EGO-Prompt begins with a general prompt and fault-tolerant initial Semantic Causal Graph (SCG) descriptions, constructed by human experts, which is then automatically refined and optimized to guide LLM reasoning. Recognizing that expert-defined SCGs may be partial or imperfect and that their optimal integration varies across LLMs, EGO-Prompt integrates a novel causal-guided textual gradient process in two steps: first, generating nearly deterministic reasoning guidance from the SCG for each instance, and second, adapting the LLM to effectively utilize the guidance alongside the original input. The iterative optimization algorithm further refines both the SCG and the reasoning mechanism using textual gradients with ground-truth. We tested the framework on real-world public health, transportation and human behavior tasks. EGO-Prompt achieves 7.32%-12.61% higher F1 than cutting-edge methods, and allows small models to reach the performence of larger models at under 20% of the original cost. It also outputs a refined, domain-specific SCG that improves interpretability.

【67】NeuroGenPoisoning: Neuron-Guided Attacks on Retrieval-Augmented Generation of LLM via Genetic Optimization of External Knowledge
标题：NeuGen Poisoning：通过外部知识的遗传优化对LLM检索增强生成的神经元引导攻击
链接：https://arxiv.org/abs/2510.21144

作者：Hanyu Zhu, Lance Fiondella, Jiawei Yuan, Kai Zeng, Long Jiao
摘要：检索增强生成（RAG）使大型语言模型（LLM）能够在推理过程中动态集成外部知识，提高其事实准确性和适应性。然而，攻击者可以注入有毒的外部知识来覆盖模型的内部存储器。虽然现有的攻击迭代地操纵RAG的检索内容或提示结构，但它们在很大程度上忽略了模型的内部表示动态性和神经元级别的敏感性。RAG中毒的潜在机制尚未得到充分研究，也没有考虑RAG中具有强参数知识的知识冲突的影响。在这项工作中，我们提出了NeuroGenPoisoning，这是一种新的攻击框架，可以在LLM内部神经元归因和遗传优化的指导下在RAG中生成对抗性外部知识。我们的方法首先确定了一组中毒反应神经元，其激活与上下文中毒知识密切相关。然后，我们采用遗传算法来进化对抗通道，最大限度地激活这些神经元。至关重要的是，我们的框架，通过识别和重用有前途的，但最初不成功的外部知识变体，通过观察到的归因信号，使有效的中毒RAG知识的大规模生成。同时，中毒反应神经元引导的中毒可以有效地解决知识冲突。跨模型和数据集的实验结果表明，在保持流畅性的同时，始终实现了超过90%的高群体覆盖成功率（POSR）。实证结果表明，该方法有效地解决了知识冲突问题.
摘要：Retrieval-Augmented Generation (RAG) empowers Large Language Models (LLMs) to dynamically integrate external knowledge during inference, improving their factual accuracy and adaptability. However, adversaries can inject poisoned external knowledge to override the model's internal memory. While existing attacks iteratively manipulate retrieval content or prompt structure of RAG, they largely ignore the model's internal representation dynamics and neuron-level sensitivities. The underlying mechanism of RAG poisoning has not been fully studied and the effect of knowledge conflict with strong parametric knowledge in RAG is not considered. In this work, we propose NeuroGenPoisoning, a novel attack framework that generates adversarial external knowledge in RAG guided by LLM internal neuron attribution and genetic optimization. Our method first identifies a set of Poison-Responsive Neurons whose activation strongly correlates with contextual poisoning knowledge. We then employ a genetic algorithm to evolve adversarial passages that maximally activate these neurons. Crucially, our framework enables massive-scale generation of effective poisoned RAG knowledge by identifying and reusing promising but initially unsuccessful external knowledge variants via observed attribution signals. At the same time, Poison-Responsive Neurons guided poisoning can effectively resolves knowledge conflict. Experimental results across models and datasets demonstrate consistently achieving high Population Overwrite Success Rate (POSR) of over 90% while preserving fluency. Empirical evidence shows that our method effectively resolves knowledge conflict.

【68】PanicToCalm: A Proactive Counseling Agent for Panic Attacks
标题：PanicToCalm：恐慌袭击的主动咨询代理
链接：https://arxiv.org/abs/2510.21143

作者：Jihyun Lee, Yejin Min, San Kim, Yejin Jeon, SungJun Yang, Hyounghun Kim, Gary Geunbae Lee
摘要：恐慌发作是恐惧和痛苦的急性发作，及时，适当的干预可以显着帮助个人恢复稳定。然而，由于伦理和后勤问题，用于训练此类模型的合适数据集仍然稀缺。为了解决这个问题，我们引入了PACE，这是一个数据集，其中包括从第一人称叙事构建的高痛苦事件，并围绕心理急救（PFA）的原则进行结构化。使用这些数据，我们训练PACER，一个咨询模型，旨在提供同情和指导性的支持，这是通过监督学习和模拟偏好对齐优化。为了评估其有效性，我们提出了PanicEval，一个多维度的框架，包括一般咨询质量和危机的具体战略。实验结果表明，PACER在顾问方指标和客户影响改善方面都优于强基线。人类评估进一步证实了它的实用价值，在恐慌场景中，PACER始终优于一般的、基于CBT的和GPT-4供电的模型（代码可在https://github.com/JihyunLee1/PanicToCalm上获得）。
摘要：Panic attacks are acute episodes of fear and distress, in which timely, appropriate intervention can significantly help individuals regain stability. However, suitable datasets for training such models remain scarce due to ethical and logistical issues. To address this, we introduce PACE, which is a dataset that includes high-distress episodes constructed from first-person narratives, and structured around the principles of Psychological First Aid (PFA). Using this data, we train PACER, a counseling model designed to provide both empathetic and directive support, which is optimized through supervised learning and simulated preference alignment. To assess its effectiveness, we propose PanicEval, a multi-dimensional framework covering general counseling quality and crisis-specific strategies. Experimental results show that PACER outperforms strong baselines in both counselor-side metrics and client affect improvement. Human evaluations further confirm its practical value, with PACER consistently preferred over general, CBT-based, and GPT-4-powered models in panic scenarios (Code is available at https://github.com/JihyunLee1/PanicToCalm ).

【69】Quantifying CBRN Risk in Frontier Models
标题：前沿模型中量化CBRN风险
链接：https://arxiv.org/abs/2510.21133

作者：Divyanshu Kumar, Nitin Aravind Birur, Tanay Baswa, Sahil Agarwal, Prashanth Harshangi
摘要：前沿大型语言模型（LLM）通过化学，生物，放射性和核（CBRN）武器知识的潜在扩散带来了前所未有的双重用途风险。我们提出了第一次全面评估10个领先的商业LLM对一个新的200提示CBRN数据集和180提示子集的Fortress基准，使用严格的三层攻击方法。我们的研究结果暴露了关键的安全漏洞：深度入侵攻击的成功率为86.0%，而直接请求的成功率为33.8%，证明了表面过滤机制;模型安全性能变化很大，从2%（claude opus-4）到96%（mistral-small-latest）攻击成功率;当被要求增强危险材料属性时，8个模型的漏洞超过70%。我们确定基本的脆性在当前的安全调整，简单的提示工程技术绕过危险的CBRN信息的保障措施。这些结果挑战了行业安全声明，并强调迫切需要标准化的评估框架，透明的安全指标和更强大的对齐技术，以减轻灾难性的误用风险，同时保留有益的功能。
摘要：Frontier Large Language Models (LLMs) pose unprecedented dual-use risks through the potential proliferation of chemical, biological, radiological, and nuclear (CBRN) weapons knowledge. We present the first comprehensive evaluation of 10 leading commercial LLMs against both a novel 200-prompt CBRN dataset and a 180-prompt subset of the FORTRESS benchmark, using a rigorous three-tier attack methodology. Our findings expose critical safety vulnerabilities: Deep Inception attacks achieve 86.0\% success versus 33.8\% for direct requests, demonstrating superficial filtering mechanisms; Model safety performance varies dramatically from 2\% (claude-opus-4) to 96\% (mistral-small-latest) attack success rates; and eight models exceed 70\% vulnerability when asked to enhance dangerous material properties. We identify fundamental brittleness in current safety alignment, where simple prompt engineering techniques bypass safeguards for dangerous CBRN information. These results challenge industry safety claims and highlight urgent needs for standardized evaluation frameworks, transparent safety metrics, and more robust alignment techniques to mitigate catastrophic misuse risks while preserving beneficial capabilities.

【70】Large Language Models Meet Text-Attributed Graphs: A Survey of Integration Frameworks and Applications
标题：大型语言模型满足文本属性图：集成框架和应用的概览
链接：https://arxiv.org/abs/2510.21131

作者：Guangxin Su, Hanchen Wang, Jianwei Wang, Wenjie Zhang, Ying Zhang, Jian Pei
备注：Surveys and overviews; Natural language processing; Knowledge representation and reasoning; Graph algorithms
摘要：大型语言模型（LLM）通过强大的语义理解和生成，在自然语言处理方面取得了显着的成功。然而，它们的黑盒性质限制了结构化和多跳推理。相比之下，文本属性图（TAG）提供了丰富的文本上下文的显式关系结构，但往往缺乏语义深度。最近的研究表明，结合LLM和TAG产生互补的好处：增强TAG表示学习和提高LLM的推理和可解释性。这项调查提供了第一个系统的审查LLM-TAG集成从编排的角度来看。我们引入了一种新的分类法，涵盖两个基本方向：LLM的TAG，其中LLM丰富了基于图形的任务，和TAG的LLM，其中结构化图形改善了LLM推理。我们将编排策略分为顺序，并行和多模块框架，并讨论了TAG特定的预训练，提示和参数有效的微调的进展。除了方法论之外，我们还总结了经验见解，策划了可用的数据集，并强调了推荐系统，生物医学分析和知识密集型问题回答的各种应用。最后，我们概述了开放的挑战和有前途的研究方向，旨在指导语言和图形学习交叉点的未来工作。
摘要：Large Language Models (LLMs) have achieved remarkable success in natural language processing through strong semantic understanding and generation. However, their black-box nature limits structured and multi-hop reasoning. In contrast, Text-Attributed Graphs (TAGs) provide explicit relational structures enriched with textual context, yet often lack semantic depth. Recent research shows that combining LLMs and TAGs yields complementary benefits: enhancing TAG representation learning and improving the reasoning and interpretability of LLMs. This survey provides the first systematic review of LLM--TAG integration from an orchestration perspective. We introduce a novel taxonomy covering two fundamental directions: LLM for TAG, where LLMs enrich graph-based tasks, and TAG for LLM, where structured graphs improve LLM reasoning. We categorize orchestration strategies into sequential, parallel, and multi-module frameworks, and discuss advances in TAG-specific pretraining, prompting, and parameter-efficient fine-tuning. Beyond methodology, we summarize empirical insights, curate available datasets, and highlight diverse applications across recommendation systems, biomedical analysis, and knowledge-intensive question answering. Finally, we outline open challenges and promising research directions, aiming to guide future work at the intersection of language and graph learning.

【71】Enhanced Evolutionary Multi-Objective Deep Reinforcement Learning for Reliable and Efficient Wireless Rechargeable Sensor Networks
标题：增强的进化多目标深度强化学习，用于可靠有效的无线可充电传感器网络
链接：https://arxiv.org/abs/2510.21127

作者：Bowei Tong, Hui Kang, Jiahui Li, Geng Sun, Jiacheng Wang, Yaoqi Yang, Bo Xu, Dusit Niyato
备注：15 pages, 9 figures, submited to TVT
摘要：尽管传感器网络发展迅速，但传统的电池供电的传感器网络受到有限的操作寿命和频繁的维护要求的影响，这严重限制了它们在远程和不可访问环境中的部署。因此，具有移动充电能力的无线可充电传感器网络（WRSN）提供了延长网络寿命的有前途的解决方案。然而，WRSNs面临着关键的挑战，从固有的权衡最大化节点生存率和最大化充电能量效率在动态操作条件下。在本文中，我们研究了一个典型的场景，移动充电器移动和充电传感器，从而保持网络连接，同时最大限度地减少能源浪费。具体来说，我们制定了一个多目标优化问题，同时最大限度地提高网络节点的生存率和移动充电器的能源使用效率在多个时隙，这提出了NP-硬计算复杂性与长期的时间依赖性，使传统的优化方法无效。为了解决这些挑战，我们提出了一种增强的进化多目标深度强化学习算法，该算法集成了用于时间模式识别的基于长短期记忆（LSTM）的策略网络，用于未来状态预测的基于多层感知器的前瞻性增量模型，以及用于动态偏好适应的时变Pareto策略评估方法。大量的仿真结果表明，该算法显着优于现有的方法在平衡节点生存率和能源效率，同时产生不同的帕累托最优解。此外，LSTM增强的策略网络收敛速度比传统网络快25%，时变评估方法有效地适应动态条件。
摘要：Despite rapid advancements in sensor networks, conventional battery-powered sensor networks suffer from limited operational lifespans and frequent maintenance requirements that severely constrain their deployment in remote and inaccessible environments. As such, wireless rechargeable sensor networks (WRSNs) with mobile charging capabilities offer a promising solution to extend network lifetime. However, WRSNs face critical challenges from the inherent trade-off between maximizing the node survival rates and maximizing charging energy efficiency under dynamic operational conditions. In this paper, we investigate a typical scenario where mobile chargers move and charge the sensor, thereby maintaining the network connectivity while minimizing the energy waste. Specifically, we formulate a multi-objective optimization problem that simultaneously maximizes the network node survival rate and mobile charger energy usage efficiency across multiple time slots, which presents NP-hard computational complexity with long-term temporal dependencies that make traditional optimization approaches ineffective. To address these challenges, we propose an enhanced evolutionary multi-objective deep reinforcement learning algorithm, which integrates a long short-term memory (LSTM)-based policy network for temporal pattern recognition, a multilayer perceptron-based prospective increment model for future state prediction, and a time-varying Pareto policy evaluation method for dynamic preference adaptation. Extensive simulation results demonstrate that the proposed algorithm significantly outperforms existing approaches in balancing node survival rate and energy efficiency while generating diverse Pareto-optimal solutions. Moreover, the LSTM-enhanced policy network converges 25% faster than conventional networks, with the time-varying evaluation method effectively adapting to dynamic conditions.

【72】Generalizable Hierarchical Skill Learning via Object-Centric Representation
标题：通过以对象为中心的表示进行可推广的分层技能学习
链接：https://arxiv.org/abs/2510.21121

作者：Haibo Zhao, Yu Qi, Boce Hu, Yizhe Zhu, Ziyan Chen, Heng Tian, Xupeng Zhu, Owen Howell, Haojie Huang, Robin Walters, Dian Wang, Robert Platt
摘要：我们提出了可推广的分层技能学习（GSL），分层策略学习，显着提高机器人操作的政策推广和采样效率的新框架。GSL的一个核心思想是使用以对象为中心的技能作为连接高级视觉语言模型和低级视觉运动策略的接口。具体来说，GSL使用基础模型将演示分解为可转移和对象规范化的技能原语，确保在对象框架中进行有效的低级别技能学习。在测试时，由高级代理预测的技能对象对被馈送到低级模块，在低级模块中，推断出的规范动作被映射回世界框架以供执行。这种结构化但灵活的设计导致样本效率的大幅提高和我们的方法在看不见的空间布置，对象外观和任务组成的泛化。在模拟中，GSL在每个任务仅进行3次演示的情况下训练，在看不见的任务上，比使用30倍以上数据训练的基线高出15.5%。在现实世界的实验中，GSL也超过了用10倍以上数据训练的基线。
摘要：We present Generalizable Hierarchical Skill Learning (GSL), a novel framework for hierarchical policy learning that significantly improves policy generalization and sample efficiency in robot manipulation. One core idea of GSL is to use object-centric skills as an interface that bridges the high-level vision-language model and the low-level visual-motor policy. Specifically, GSL decomposes demonstrations into transferable and object-canonicalized skill primitives using foundation models, ensuring efficient low-level skill learning in the object frame. At test time, the skill-object pairs predicted by the high-level agent are fed to the low-level module, where the inferred canonical actions are mapped back to the world frame for execution. This structured yet flexible design leads to substantial improvements in sample efficiency and generalization of our method across unseen spatial arrangements, object appearances, and task compositions. In simulation, GSL trained with only 3 demonstrations per task outperforms baselines trained with 30 times more data by 15.5 percent on unseen tasks. In real-world experiments, GSL also surpasses the baseline trained with 10 times more data.

【73】The Gray Zone of Faithfulness: Taming Ambiguity in Unfaithfulness Detection
标题：忠诚的灰色地带：驯服不忠检测中的模糊性
链接：https://arxiv.org/abs/2510.21118

作者：Qiang Ding, Lvzhou Luo, Yixuan Cao, Ping Luo
摘要：确保大型语言模型（LLM）生成忠实于给定源文档的摘要对于实际应用至关重要。虽然先前的研究已经探索了LLM的忠实性，但现有的基准受到注释模糊性的影响，主要是由于生成的输出中允许的外部知识的边界定义不清。例如，常识通常被纳入响应并被标记为“忠实”，但这种知识的可接受程度仍然不明确，导致不一致的注释。为了解决这个问题，我们提出了一个新的忠实注释框架，它引入了一个中间类别，外依赖，分类的情况下，需要外部知识进行验证。使用这个框架，我们构建了VeriGray（灰色区域验证）-一个新的不忠实检测基准摘要。统计数据显示，即使是SOTA LLM，如GPT-5，在总结任务中也会出现幻觉（句子的$\sim 6\%$）。此外，相当大的比例（$\sim 8\%$平均模型）生成的句子属于外依赖类别，强调了解决注释歧义在不忠实检测基准的重要性。实验表明，我们的基准提出了重大挑战，多个基线方法，表明相当大的空间，未来的改进。
摘要：Ensuring that Large Language Models (LLMs) generate summaries faithful to a given source document is essential for real-world applications. While prior research has explored LLM faithfulness, existing benchmarks suffer from annotation ambiguity, primarily due to the ill-defined boundary of permissible external knowledge in generated outputs. For instance, common sense is often incorporated into responses and labeled as "faithful", yet the acceptable extent of such knowledge remains unspecified, leading to inconsistent annotations. To address this issue, we propose a novel faithfulness annotation framework, which introduces an intermediate category, Out-Dependent, to classify cases where external knowledge is required for verification. Using this framework, we construct VeriGray (Verification with the Gray Zone) -- a new unfaithfulness detection benchmark in summarization. Statistics reveal that even SOTA LLMs, such as GPT-5, exhibit hallucinations ($\sim 6\%$ of sentences) in summarization tasks. Moreover, a substantial proportion ($\sim 8\%$ on average of models) of generated sentences fall into the Out-Dependent category, underscoring the importance of resolving annotation ambiguity in unfaithfulness detection benchmarks. Experiments demonstrate that our benchmark poses significant challenges to multiple baseline methods, indicating considerable room for future improvement.

【74】DAO-AI: Evaluating Collective Decision-Making through Agentic AI in Decentralized Governance
标题：DAO-AI：通过分散式治理中的抽象人工智能评估集体决策
链接：https://arxiv.org/abs/2510.21117

作者：Chunghyun Han, Alfio Gliozzo, Junkyu Lee, Agostino Capponi
备注：12 pages, 2 Figures
摘要：本文首次对代理人工智能作为分散治理中的自主决策者进行了实证研究。使用来自主要协议的3K多个提案，我们构建了一个代理AI投票器，它可以解释提案上下文，检索历史审议数据，并独立确定其投票位置。该代理在基于可验证区块链数据的现实金融模拟环境中运行，通过模块化可组合程序（MCP）工作流实现，该工作流通过区块链框架定义数据流和工具使用。我们评估代理的决策与人类和令牌加权结果的紧密程度，发现通过精心设计的评估指标测量的强大对齐。我们的研究结果表明，代理人工智能可以通过在现实的DAO治理环境中产生可解释、可审计和基于经验的信号来增强集体决策。该研究有助于为分散的金融系统设计可解释和经济上严格的人工智能代理。
摘要：This paper presents a first empirical study of agentic AI as autonomous decision-makers in decentralized governance. Using more than 3K proposals from major protocols, we build an agentic AI voter that interprets proposal contexts, retrieves historical deliberation data, and independently determines its voting position. The agent operates within a realistic financial simulation environment grounded in verifiable blockchain data, implemented through a modular composable program (MCP) workflow that defines data flow and tool usage via Agentics framework. We evaluate how closely the agent's decisions align with the human and token-weighted outcomes, uncovering strong alignments measured by carefully designed evaluation metrics. Our findings demonstrate that agentic AI can augment collective decision-making by producing interpretable, auditable, and empirically grounded signals in realistic DAO governance settings. The study contributes to the design of explainable and economically rigorous AI agents for decentralized financial systems.

【75】Urban 3D Change Detection Using LiDAR Sensor for HD Map Maintenance and Smart Mobility
标题：使用LiDART传感器进行城市3D变化检测以实现高清地图维护和智能出行
链接：https://arxiv.org/abs/2510.21112

作者：Hezam Albagami, Haitian Wang, Xinyu Wang, Muhammad Ibrahim, Zainy M. Malakan, Abdullah M. Alqamdi, Mohammed H. Alghamdi, Ajmal Mian
摘要：高清3D城市地图支持智能交通、数字孪生和自动驾驶，其中跨双时LiDAR的对象级别变化检测可以实现高清地图维护、施工监控和可靠定位。经典的DSM差分和基于图像的方法对小的垂直偏差、地面坡度和视点失配敏感，并且产生没有对象身份的单元输出。基于点的神经模型和体素编码需要大量内存，假设接近完美的预对齐，降低薄结构，并且很少强制类一致性关联，这使得拆分或合并情况未解决并忽略不确定性。我们提出了一个以对象为中心的，不确定性感知管道城市规模激光雷达，对准时代与多分辨率NDT其次是点到面ICP，归一化高度，并从注册协方差和表面粗糙度导出每个位置的检测水平，以校准决策和抑制虚假的变化。几何仅代理种子跨时代关联，其通过语义和实例分割以及具有增强虚拟的类约束二分分配来细化，以处理分裂和合并，同时保留每类计数。平铺处理限制了内存，而不会侵蚀狭窄的地面变化，实例级决策将3D重叠，法线方向位移以及高度和体积差异与直方图距离相结合，所有这些都由局部检测级别进行门控，以在部分重叠和采样变化下保持稳定。在15个代表性Subiaco块上，该方法达到95.2%的准确度，90.4%的mF1和82.6%的mIoU，在准确度上超过Triplet KPConv 0.2个百分点，在mF1中超过0.2个百分点，在mIoU中超过0.8个百分点，在IoU达到74.8%的情况下下降最大，提高了7.6个百分点。
摘要：High-definition 3D city maps underpin smart transportation, digital twins, and autonomous driving, where object level change detection across bi temporal LiDAR enables HD map maintenance, construction monitoring, and reliable localization. Classical DSM differencing and image based methods are sensitive to small vertical bias, ground slope, and viewpoint mismatch and yield cellwise outputs without object identity. Point based neural models and voxel encodings demand large memory, assume near perfect pre alignment, degrade thin structures, and seldom enforce class consistent association, which leaves split or merge cases unresolved and ignores uncertainty. We propose an object centric, uncertainty aware pipeline for city scale LiDAR that aligns epochs with multi resolution NDT followed by point to plane ICP, normalizes height, and derives a per location level of detection from registration covariance and surface roughness to calibrate decisions and suppress spurious changes. Geometry only proxies seed cross epoch associations that are refined by semantic and instance segmentation and a class constrained bipartite assignment with augmented dummies to handle splits and merges while preserving per class counts. Tiled processing bounds memory without eroding narrow ground changes, and instance level decisions combine 3D overlap, normal direction displacement, and height and volume differences with a histogram distance, all gated by the local level of detection to remain stable under partial overlap and sampling variation. On 15 representative Subiaco blocks the method attains 95.2% accuracy, 90.4% mF1, and 82.6% mIoU, exceeding Triplet KPConv by 0.2 percentage points in accuracy, 0.2 in mF1, and 0.8 in mIoU, with the largest gain on Decreased where IoU reaches 74.8% and improves by 7.6 points.

【76】Confounding Robust Deep Reinforcement Learning: A Causal Approach
标题：混淆鲁棒深度强化学习：因果方法
链接：https://arxiv.org/abs/2510.21110

作者：Mingxuan Li, Junzhe Zhang, Elias Bareinboim
备注：NeurIPS 2025
摘要：人工智能中的一个关键任务是学习有效的策略来控制未知环境中的代理，以优化性能指标。非策略学习方法，如Q学习，允许学习者根据过去的经验做出最佳决策。本文研究了在复杂和高维域中，无法事先排除未观察到的混杂，从有偏数据中进行离线学习。基于著名的深度Q网络（DQN），我们提出了一种新的深度强化学习算法，对观察数据中的混淆偏差具有鲁棒性。具体来说，我们的算法试图找到一个安全的政策，最坏的情况下的环境兼容的意见。我们将我们的方法应用到12个混淆的雅达利游戏，并发现它始终占主导地位的标准DQN在所有的游戏中，观察到的输入的行为和目标政策不匹配和未观察到的混杂因素存在。
摘要：A key task in Artificial Intelligence is learning effective policies for controlling agents in unknown environments to optimize performance measures. Off-policy learning methods, like Q-learning, allow learners to make optimal decisions based on past experiences. This paper studies off-policy learning from biased data in complex and high-dimensional domains where \emph{unobserved confounding} cannot be ruled out a priori. Building on the well-celebrated Deep Q-Network (DQN), we propose a novel deep reinforcement learning algorithm robust to confounding biases in observed data. Specifically, our algorithm attempts to find a safe policy for the worst-case environment compatible with the observations. We apply our method to twelve confounded Atari games, and find that it consistently dominates the standard DQN in all games where the observed input to the behavioral and target policies mismatch and unobserved confounders exist.

【77】ESCORT: Efficient Stein-variational and Sliced Consistency-Optimized Temporal Belief Representation for POMDPs
标题：ESCRT：POMDPs的高效Stein变分和切片一致性优化时间信念表示
链接：https://arxiv.org/abs/2510.21107

作者：Yunuo Zhang, Baiting Luo, Ayan Mukhopadhyay, Gabor Karsai, Abhishek Dubey
备注：Proceeding of the 39th Conference on Neural Information Processing Systems (NeurIPS'25). Code would be available at this https URL
摘要：在部分可观测马尔可夫决策过程（POMDPs）中，维护和更新可能的底层状态的信念分布提供了一种原则性的方法来总结行动观察历史，以便在不确定性下进行有效的决策。随着环境变得越来越现实，信念分布发展出标准数学模型无法准确捕捉的复杂性，从而在保持表征准确性方面产生了根本性挑战。尽管在深度学习和概率建模方面取得了进展，但现有的POMDP信念近似方法无法准确地表示复杂的不确定性结构，例如高维多模态信念分布，导致估计错误，从而导致次优代理行为。为了应对这一挑战，我们提出了ESCORT（高效斯坦变分和切片一致性优化表示时间信念），一个基于粒子的框架，用于捕获复杂的，多模态分布在高维的信念空间。ESCORT通过两个关键创新扩展了SVGD：相关感知投影，用于对状态维度之间的依赖关系进行建模;时间一致性约束，用于在保持相关结构的同时稳定更新。这种方法保留了SVGD的吸引-排斥粒子动力学，同时能够精确建模复杂的相关模式。与粒子滤波器容易退化或具有固定表示能力的参数化方法不同，ESCORT动态适应信念景观复杂性，而无需重新分配或限制性分布假设。我们通过对POMDP域和不同维度的合成多模态分布的广泛评估来证明ESCORT的有效性，在置信近似精度和下游决策质量方面，它始终优于最先进的方法。
摘要：In Partially Observable Markov Decision Processes (POMDPs), maintaining and updating belief distributions over possible underlying states provides a principled way to summarize action-observation history for effective decision-making under uncertainty. As environments grow more realistic, belief distributions develop complexity that standard mathematical models cannot accurately capture, creating a fundamental challenge in maintaining representational accuracy. Despite advances in deep learning and probabilistic modeling, existing POMDP belief approximation methods fail to accurately represent complex uncertainty structures such as high-dimensional, multi-modal belief distributions, resulting in estimation errors that lead to suboptimal agent behaviors. To address this challenge, we present ESCORT (Efficient Stein-variational and sliced Consistency-Optimized Representation for Temporal beliefs), a particle-based framework for capturing complex, multi-modal distributions in high-dimensional belief spaces. ESCORT extends SVGD with two key innovations: correlation-aware projections that model dependencies between state dimensions, and temporal consistency constraints that stabilize updates while preserving correlation structures. This approach retains SVGD's attractive-repulsive particle dynamics while enabling accurate modeling of intricate correlation patterns. Unlike particle filters prone to degeneracy or parametric methods with fixed representational capacity, ESCORT dynamically adapts to belief landscape complexity without resampling or restrictive distributional assumptions. We demonstrate ESCORT's effectiveness through extensive evaluations on both POMDP domains and synthetic multi-modal distributions of varying dimensionality, where it consistently outperforms state-of-the-art methods in terms of belief approximation accuracy and downstream decision quality.

【78】MedAlign: A Synergistic Framework of Multimodal Preference Optimization and Federated Meta-Cognitive Reasoning
标题：MedAlign：多模态偏好优化与联邦元认知推理的协同框架
链接：https://arxiv.org/abs/2510.21093

作者：Siyong Chen, Jinbo Wen, Jiawen Kang, Tenghui Huang, Xumin Huang, Yuanjia Su, Hudan Pan, Zishao Zhong, Dusit Niyato, Shengli Xie, Dong In Kim
摘要：最近，大型模型在智能医疗领域显示出巨大的潜力。然而，大型视觉语言模型（LVLM）在临床服务中的部署目前受到三个关键挑战的阻碍：不基于视觉证据的幻觉答案的倾向，固定深度推理的效率低下，以及多机构协作的困难。为了解决这些挑战，在本文中，我们开发了MedAlign，一个新的框架，以确保视觉上准确的LVLM响应医学视觉问题分类（Med-VQA）。具体来说，我们首先提出了一个多模态直接偏好优化（mDPO）的目标，明确调整偏好学习与视觉环境。然后，我们设计了一个检索感知混合专家（RA-MoE）架构，该架构利用图像和文本相似性将查询路由到专门的和上下文增强的LVLM（即，专家），从而减轻LVLM中的幻觉。为了实现自适应推理和促进多机构协作，我们提出了一种联邦治理机制，其中选定的专家根据mDPO对临床数据集进行微调，通过本地元认知不确定性估计器在本地执行迭代的思想链（CoT）推理。在三个有代表性的Med-VQA数据集上进行的大量实验表明，MedAlign实现了最先进的性能，在F1得分上超过了强检索增强基线高达11.85美元，同时与固定深度CoT方法相比，平均推理长度减少了51.60美元。
摘要：Recently, large models have shown significant potential for smart healthcare. However, the deployment of Large Vision-Language Models (LVLMs) for clinical services is currently hindered by three critical challenges: a tendency to hallucinate answers not grounded in visual evidence, the inefficiency of fixed-depth reasoning, and the difficulty of multi-institutional collaboration. To address these challenges, in this paper, we develop MedAlign, a novel framework to ensure visually accurate LVLM responses for Medical Visual Question Answering (Med-VQA). Specifically, we first propose a multimodal Direct Preference Optimization (mDPO) objective to explicitly align preference learning with visual context. We then design a Retrieval-Aware Mixture-of-Experts (RA-MoE) architecture that utilizes image and text similarity to route queries to a specialized and context-augmented LVLM (i.e., an expert), thereby mitigating hallucinations in LVLMs. To achieve adaptive reasoning and facilitate multi-institutional collaboration, we propose a federated governance mechanism, where the selected expert, fine-tuned on clinical datasets based on mDPO, locally performs iterative Chain-of-Thought (CoT) reasoning via the local meta-cognitive uncertainty estimator. Extensive experiments on three representative Med-VQA datasets demonstrate that MedAlign achieves state-of-the-art performance, outperforming strong retrieval-augmented baselines by up to $11.85\%$ in F1-score, and simultaneously reducing the average reasoning length by $51.60\%$ compared with fixed-depth CoT approaches.

【79】Self-Rewarding PPO: Aligning Large Language Models with Demonstrations Only
标题：自我奖励PPO：仅用演示调整大型语言模型
链接：https://arxiv.org/abs/2510.21090

作者：Qingru Zhang, Liang Qiu, Ilgee Hong, Zhenghao Xu, Tianyi Liu, Shiyang Li, Rongzhi Zhang, Zheng Li, Lihong Li, Bing Yin, Chao Zhang, Jianshu Chen, Haoming Jiang, Tuo Zhao
备注：Accepted by COLM 2025
摘要：监督微调（SFT）已经成为将大型语言模型（LLM）与人类注释演示对齐的关键方法。然而，SFT是一种类似于行为克隆的非策略方法，通常会遇到过拟合和域外泛化能力差的问题，特别是在数据有限的情况下。为了解决这些限制，我们提出了自奖励PPO，一种新的微调方法，利用对政策的技术，以提高泛化性能。我们的方法结合了SFT和最近策略优化（PPO）的优势，以实现更有效的调整，从示范数据。其核心是一个奖励函数，设计为SFT模型和预训练基础模型之间的日志策略比率。此函数用作隐式奖励信号，使用预训练的策略作为基线，SFT策略作为目标。通过这样做，它可以在不依赖人类偏好注释的情况下进行策略微调。这种自我奖励机制与PPO的集成解决了SFT的关键限制，提高了泛化能力，数据效率和鲁棒性。我们对一系列自然语言处理任务的实证评估表明，自我奖励PPO始终优于传统的SFT方法。结果强调了我们的方法在使用演示数据对齐LLM方面的有效性，特别是在高质量注释数据稀缺的情况下。
摘要：Supervised fine-tuning (SFT) has emerged as a crucial method for aligning large language models (LLMs) with human-annotated demonstrations. However, SFT, being an off-policy approach similar to behavior cloning, often struggles with overfitting and poor out-of-domain generalization, especially in limited-data scenarios. To address these limitations, we propose Self-Rewarding PPO, a novel fine-tuning method that leverages on-policy techniques to enhance generalization performance. Our approach combines the strengths of SFT and proximal policy optimization (PPO) to achieve more effective alignment from demonstration data. At its core is a reward function designed as the log policy ratio between the SFT model and the pretrained base model. This function serves as an implicit reward signal, using the pretrained policy as a baseline and the SFT policy as a target. By doing so, it enables on-policy fine-tuning without relying on human preference annotations. The integration of this self-rewarding mechanism with PPO addresses key limitations of SFT, improving generalization, data efficiency, and robustness. Our empirical evaluation across a range of natural language processing tasks demonstrates that Self-Rewarding PPO consistently outperforms traditional SFT methods. The results highlight the effectiveness of our approach in aligning LLMs using demonstration data, particularly in scenarios where high-quality annotated data is scarce.

【80】M-GLC: Motif-Driven Global-Local Context Graphs for Few-shot Molecular Property Prediction
标题：M-GLC：模体驱动的全局-局部上下文图用于少样本分子性质预测
链接：https://arxiv.org/abs/2510.21088

作者：Xiangyang Xu, Hongyang Gao
摘要：分子性质预测（MPP）是药物发现和材料科学的基石，但传统的深度学习方法依赖于通常不可用的大型标记数据集。Few-Shot分子性质预测（FSMPP）通过将关系归纳偏差通过将分子节点链接到性质节点的上下文图来解决这种稀缺性，但是这样的分子性质图提供有限的结构指导。我们提出了一个全面的解决方案：基序驱动的全局-局部上下文图的Few-Shot分子性质预测，丰富了上下文信息在全局和局部水平。在全球层面上，化学意义的基序节点代表共享的子结构，如环或官能团，被引入形成一个全球的三分异构图，产生基序分子性质的连接，捕捉远程组成模式，并使知识转移分子与共同的基序。在局部水平上，我们为分子性质对中的每个节点构建子图，并分别对它们进行编码，以将模型的注意力集中在信息量最大的相邻分子和基序上。在五个标准FSMPP基准测试上的实验表明，我们的框架始终优于最先进的方法。这些结果强调了将全局基序知识与细粒度局部背景相结合以推进稳健的Few-Shot分子性质预测的有效性。
摘要：Molecular property prediction (MPP) is a cornerstone of drug discovery and materials science, yet conventional deep learning approaches depend on large labeled datasets that are often unavailable. Few-shot Molecular property prediction (FSMPP) addresses this scarcity by incorporating relational inductive bias through a context graph that links molecule nodes to property nodes, but such molecule-property graphs offer limited structural guidance. We propose a comprehensive solution: Motif Driven Global-Local Context Graph for few-shot molecular property prediction, which enriches contextual information at both the global and local levels. At the global level, chemically meaningful motif nodes representing shared substructures, such as rings or functional groups, are introduced to form a global tri-partite heterogeneous graph, yielding motif-molecule-property connections that capture long-range compositional patterns and enable knowledge transfer among molecules with common motifs. At the local level, we build a subgraph for each node in the molecule-property pair and encode them separately to concentrate the model's attention on the most informative neighboring molecules and motifs. Experiments on five standard FSMPP benchmarks demonstrate that our framework consistently outperforms state-of-the-art methods. These results underscore the effectiveness of integrating global motif knowledge with fine-grained local context to advance robust few-shot molecular property prediction.

【81】CDrugRed: A Chinese Drug Recommendation Dataset for Discharge Medications in Metabolic Diseases
标题：CDugRed：代谢性疾病出院用药的中国推荐药物数据集
链接：https://arxiv.org/abs/2510.21084

作者：Juntao Li, Haobin Yuan, Ling Luo, Yan Jiang, Fan Wang, Ping Zhang, Huiyi Lv, Jian Wang, Yuanyuan Sun, Hongfei Lin
摘要：基于电子健康档案的智能药物推荐对于提高临床决策的质量和效率至关重要。通过利用大规模的患者数据，药物推荐系统可以帮助医生根据患者的病史、诊断、实验室结果和合并症选择最合适的药物。然而，这种系统的进步受到了公共可用的、真实世界的EHR数据集的稀缺性的严重阻碍，特别是英语以外的语言。在这项工作中，我们提出了CDrugRed，这是第一个公开的中国药物推荐数据集，专注于代谢性疾病的出院药物。该数据集包括来自3，190名患者的5，894条去识别记录，包含患者人口统计学，病史，临床病程和出院诊断等综合信息。我们通过对几种最先进的大型语言模型（LLM）在出院药物推荐任务上进行基准测试来评估CDrugRed的实用性。实验结果表明，虽然监督微调提高了模型性能，但仍有很大的改进空间，最佳模型的F1得分为0.5648，Jaccard得分为0.4477。这一结果突出了临床药物推荐任务的复杂性，并将CDrugRed确立为开发更强大和准确的药物推荐系统的具有挑战性和有价值的资源。该数据集根据数据使用协议在https://github.com/DUTIR-BioNLP/CDrugRed上向研究界公开提供。
摘要：Intelligent drug recommendation based on Electronic Health Records (EHRs) is critical for improving for improving the quality and efficiency of clinical decision-making. By leveraging large-scale patient data, drug recommendation systems can assist physicians in selecting the most appropriate medications according to a patient's medical history, diagnoses, laboratory results, and comorbidities. However, the advancement of such systems is significantly hampered by the scarcity of publicly available, real-world EHR datasets, particularly in languages other than English. In this work, we present CDrugRed, a first publicly available Chinese drug recommendation dataset focused on discharge medications for metabolic diseases. The dataset includes 5,894 de-identified records from 3,190 patients, containing comprehensive information such as patient demographics, medical history, clinical course, and discharge diagnoses. We assess the utility of CDrugRed by benchmarking several state-of-the-art large language models (LLMs) on the discharge medication recommendation task. Experimental results show that while supervised fine-tuning improves model performance, there remains substantial room for improvement, with the best model achieving the F1 score of 0.5648 and Jaccard score of 0.4477. This result highlights the complexity of the clinical drug recommendation task and establishes CDrugRed as a challenging and valuable resource for developing more robust and accurate drug recommendation systems. The dataset is publicly available to the research community under the data usage agreements at https://github.com/DUTIR-BioNLP/CDrugRed.

【82】Soppia: A Structured Prompting Framework for the Proportional Assessment of Non-Pecuniary Damages in Personal Injury Cases
标题：Soppia：人身伤害案件中非金钱损害按比例评估的结构化预算框架
链接：https://arxiv.org/abs/2510.21082

作者：Jorge Alberto Araujo
备注：9 pages, 2 tables, includes GitHub link to framework implementation. Submitted to the Artificial Intelligence and Law section of arXiv
摘要：适用以多重、不同权重的标准为特点的复杂法律规则，对司法决策提出了根本性挑战，往往阻碍了立法意图的一致实现。这一挑战在人身伤害案件中非金钱损害的量化方面尤为明显。本文介绍了Soppia，一个结构化的提示框架，旨在帮助法律专业人士在航行这种复杂性。通过利用先进的人工智能，该系统确保对所有规定的标准进行全面和平衡的分析，实现立法者的意图，即通过对每个案件的整体评估来确定赔偿。使用巴西CLT（第223-G条）中建立的非金钱损害赔偿的12个标准作为案例研究，我们展示了Soppia（有序比例和深思熟虑的智能评估系统）如何将细微的法律命令转化为实用，可复制和透明的方法。该框架增强了一致性和可预测性，同时提供了一种适用于多标准法律背景的通用和可解释的工具，将规范解释和计算推理与可审计的法律AI联系起来。
摘要：Applying complex legal rules characterized by multiple, heterogeneously weighted criteria presents a fundamental challenge in judicial decision-making, often hindering the consistent realization of legislative intent. This challenge is particularly evident in the quantification of non-pecuniary damages in personal injury cases. This paper introduces Soppia, a structured prompting framework designed to assist legal professionals in navigating this complexity. By leveraging advanced AI, the system ensures a comprehensive and balanced analysis of all stipulated criteria, fulfilling the legislator's intent that compensation be determined through a holistic assessment of each case. Using the twelve criteria for non-pecuniary damages established in the Brazilian CLT (Art. 223-G) as a case study, we demonstrate how Soppia (System for Ordered Proportional and Pondered Intelligent Assessment) operationalizes nuanced legal commands into a practical, replicable, and transparent methodology. The framework enhances consistency and predictability while providing a versatile and explainable tool adaptable across multi-criteria legal contexts, bridging normative interpretation and computational reasoning toward auditable legal AI.

【83】Bridging Language Gaps with Adaptive RAG: Improving Indonesian Language Question Answering
标题：用自适应RAG弥合语言差距：改进印度尼西亚语言问题解答
链接：https://arxiv.org/abs/2510.21068

作者：William Christian, Daniel Adamlu, Adrian Yu, Derwin Suhartono
备注：12 pages, 7 figures, 5 tables
摘要：随着机器学习模型的发展，问答系统（QA）得到了显著的改进，进一步的研究通过检索外部信息来增强这个问答系统，称为检索增强生成（RAG），以产生更准确和信息量更大的答案。然而，这些国家的最先进的性能主要是在英语。为了解决这一差距，我们作出了努力，通过将自适应RAG系统，印度尼西亚语弥合语言差距。自适应RAG系统集成了一个分类器，其任务是区分问题的复杂性，这反过来又决定了回答问题的策略。为了克服印尼语数据集的有限可用性，我们的研究采用机器翻译作为数据增强方法。实验表明，可靠的问题的复杂性分类器，但是，我们观察到显着的不一致，在多检索回答策略，这对整体评价产生了负面影响，当这种策略被应用。这些发现突出了低资源语言问答的前景和挑战，为未来的改进提出了方向。
摘要：Question Answering (QA) has seen significant improvements with the advancement of machine learning models, further studies enhanced this question answering system by retrieving external information, called Retrieval-Augmented Generation (RAG) to produce more accurate and informative answers. However, these state-of-the-art-performance is predominantly in English language. To address this gap we made an effort of bridging language gaps by incorporating Adaptive RAG system to Indonesian language. Adaptive RAG system integrates a classifier whose task is to distinguish the question complexity, which in turn determines the strategy for answering the question. To overcome the limited availability of Indonesian language dataset, our study employs machine translation as data augmentation approach. Experiments show reliable question complexity classifier; however, we observed significant inconsistencies in multi-retrieval answering strategy which negatively impacted the overall evaluation when this strategy was applied. These findings highlight both the promise and challenges of question answering in low-resource language suggesting directions for future improvement.

【84】Deep learning-based automated damage detection in concrete structures using images from earthquake events
标题：使用地震事件图像的基于深度学习的混凝土结构自动损伤检测
链接：https://arxiv.org/abs/2510.21063

作者：Abdullah Turer, Yongsheng Bai, Halil Sezen, Alper Yilmaz
备注：6 pages, 1 figure
摘要：地震发生后，及时评估结构的完整性对于公共安全和应急响应至关重要。本研究的重点是使用深度学习方法评估结构损坏状况，以检测大地震后混凝土建筑物和桥梁中暴露的钢筋。钢筋通常在混凝土剥落或大的弯曲或剪切裂缝后暴露。外露钢筋的数量和分布是结构损坏和退化的指示。为了自动检测暴露的钢筋，2023年土耳其地震后收集的新图像数据集被标记为代表各种受损的混凝土结构。所提出的方法建立在深度学习框架的基础上，通过微调、数据增强和对公共数据集的测试来增强。开发了一个自动分类框架，可用于识别内部/外部建筑物和结构部件。然后，训练YOLOv 11（You Only Look Once）模型来检测开裂和剥落损坏以及暴露的钢筋。另一个YOLO模型是微调，以区分不同类别的结构损伤水平。所有这些经过训练的模型都用于创建一个混合框架，以自动可靠地确定输入图像的损坏程度。这项研究表明，通过利用图像数据收集、注释和深度学习方法，可以在不同的损害背景下实现灾害后的快速自动损害检测。
摘要：Timely assessment of integrity of structures after seismic events is crucial for public safety and emergency response. This study focuses on assessing the structural damage conditions using deep learning methods to detect exposed steel reinforcement in concrete buildings and bridges after large earthquakes. Steel bars are typically exposed after concrete spalling or large flexural or shear cracks. The amount and distribution of exposed steel reinforcement is an indication of structural damage and degradation. To automatically detect exposed steel bars, new datasets of images collected after the 2023 Turkey Earthquakes were labeled to represent a wide variety of damaged concrete structures. The proposed method builds upon a deep learning framework, enhanced with fine-tuning, data augmentation, and testing on public datasets. An automated classification framework is developed that can be used to identify inside/outside buildings and structural components. Then, a YOLOv11 (You Only Look Once) model is trained to detect cracking and spalling damage and exposed bars. Another YOLO model is finetuned to distinguish different categories of structural damage levels. All these trained models are used to create a hybrid framework to automatically and reliably determine the damage levels from input images. This research demonstrates that rapid and automated damage detection following disasters is achievable across diverse damage contexts by utilizing image data collection, annotation, and deep learning approaches.

【85】On the Sample Complexity of Differentially Private Policy Optimization
标题：差异性私人政策优化的样本复杂性
链接：https://arxiv.org/abs/2510.21060

作者：Yi He, Xingyu Zhou
摘要：策略优化（PO）是现代强化学习（RL）的基石，其应用范围包括机器人、医疗保健和大型语言模型训练。然而，越来越多的PO部署在敏感领域，提出了重大的隐私问题。在本文中，我们发起了一个理论研究的差异化私人政策优化，重点明确其样本的复杂性。我们首先正式的差分隐私（DP）量身定制PO的适当定义，解决所产生的内在挑战，从政策学习动态和微妙的定义隐私的单位。然后，我们系统地分析了广泛使用的PO算法的样本复杂度，包括政策梯度（PG），自然政策梯度（NPG）和更多，DP约束和各种设置下，通过一个统一的框架。我们的理论结果表明，隐私成本往往可以表现为较低的样本复杂性，同时也突出了微妙的，但重要的观察私人PO设置。这些为隐私保护PO算法提供了有价值的实用见解。
摘要：Policy optimization (PO) is a cornerstone of modern reinforcement learning (RL), with diverse applications spanning robotics, healthcare, and large language model training. The increasing deployment of PO in sensitive domains, however, raises significant privacy concerns. In this paper, we initiate a theoretical study of differentially private policy optimization, focusing explicitly on its sample complexity. We first formalize an appropriate definition of differential privacy (DP) tailored to PO, addressing the inherent challenges arising from on-policy learning dynamics and the subtlety involved in defining the unit of privacy. We then systematically analyze the sample complexity of widely-used PO algorithms, including policy gradient (PG), natural policy gradient (NPG) and more, under DP constraints and various settings, via a unified framework. Our theoretical results demonstrate that privacy costs can often manifest as lower-order terms in the sample complexity, while also highlighting subtle yet important observations in private PO settings. These offer valuable practical insights for privacy-preserving PO algorithms.

【86】Reasoning's Razor: Reasoning Improves Accuracy but Can Hurt Recall at Critical Operating Points in Safety and Hallucination Detection
标题：推理的剃刀：推理提高了准确性，但可能会损害安全和幻觉检测的关键操作点的回忆
链接：https://arxiv.org/abs/2510.21049

作者：Atoosa Chegini, Hamid Kazemi, Garrett Souza, Maria Safi, Yang Song, Samy Bengio, Sinead Williamson, Mehrdad Farajtabar
摘要：推理已经成为大型语言模型（LLM）的核心范式，不断提高各种基准的准确性。然而，它是否适合精确敏感的任务仍然不清楚。我们提出了第一个系统的研究推理分类任务下严格的低误报率（FPR）制度。我们的分析涵盖了两个任务-安全检测和幻觉检测-在微调和zero-shot设置，使用标准LLM和大型推理模型（LRM）进行评估。我们的研究结果揭示了一个明确的权衡：Think On（推理增强）生成提高了整体准确性，但在实际使用所必需的低FPR阈值下表现不佳。相比之下，思考关闭（推理过程中没有推理）占主导地位，在这些精度敏感的制度，与思考超越时，更高的FPR是可以接受的。此外，我们发现，基于令牌的评分大大优于自我语言化的信心，精度敏感的部署。最后，一个简单的合奏的两种模式恢复的强度。总的来说，我们的发现将推理定位为一种双刃剑：有利于平均精度，但通常不适合需要严格精度的应用程序。
摘要：Reasoning has become a central paradigm for large language models (LLMs), consistently boosting accuracy across diverse benchmarks. Yet its suitability for precision-sensitive tasks remains unclear. We present the first systematic study of reasoning for classification tasks under strict low false positive rate (FPR) regimes. Our analysis covers two tasks--safety detection and hallucination detection--evaluated in both fine-tuned and zero-shot settings, using standard LLMs and Large Reasoning Models (LRMs). Our results reveal a clear trade-off: Think On (reasoning-augmented) generation improves overall accuracy, but underperforms at the low-FPR thresholds essential for practical use. In contrast, Think Off (no reasoning during inference) dominates in these precision-sensitive regimes, with Think On surpassing only when higher FPRs are acceptable. In addition, we find token-based scoring substantially outperforms self-verbalized confidence for precision-sensitive deployments. Finally, a simple ensemble of the two modes recovers the strengths of each. Taken together, our findings position reasoning as a double-edged tool: beneficial for average accuracy, but often ill-suited for applications requiring strict precision.

【87】From Questions to Queries: An AI-powered Multi-Agent Framework for Spatial Text-to-SQL
标题：从问题到收件箱：人工智能驱动的空间文本到SQL多代理框架
链接：https://arxiv.org/abs/2510.21045

作者：Ali Khosravi Kazazi, Zhenlong Li, M. Naser Lessani, Guido Cervone
摘要：结构化查询语言（SQL）的复杂性和PostGIS等工具中地理空间功能的专业性，为寻求分析空间数据的非专家带来了重大障碍。虽然大型语言模型（LLM）提供了将自然语言转换为SQL（文本到SQL）的承诺，但单代理方法往往难以处理空间查询的语义和语法复杂性。为了解决这个问题，我们提出了一个多代理框架，旨在准确地将自然语言问题转换为空间SQL查询。该框架集成了几个创新的组件，包括一个知识库与编程模式分析和语义丰富，嵌入上下文检索，和一个协作的多代理管道作为其核心。该管道包括用于实体提取、元数据检索、查询逻辑公式化、SQL生成的专用代理，以及对生成的SQL执行编程和语义验证以确保正确性（自验证）的审查代理。我们使用非空间KaggleDBQA基准和一个新的，全面的SpatialQueryQA基准，包括不同的几何类型，谓词和三个层次的查询复杂性来评估我们的系统。在KaggleDBQA上，经过审核代理的审核和更正，系统的总体准确率为81.2%（272个问题中的221个）。对于空间查询，该系统的总体准确率为87.7%（90个问题中的79个），而在没有审查代理的情况下，该准确率为76.7%。除了准确性之外，结果还表明，在某些情况下，系统生成的查询在语义上比基准测试中的查询更符合用户意图。这项工作使空间分析更容易访问，并提供了一个强大的，可推广的基础空间文本到SQL系统，推进自治GIS的发展。
摘要：The complexity of Structured Query Language (SQL) and the specialized nature of geospatial functions in tools like PostGIS present significant barriers to non-experts seeking to analyze spatial data. While Large Language Models (LLMs) offer promise for translating natural language into SQL (Text-to-SQL), single-agent approaches often struggle with the semantic and syntactic complexities of spatial queries. To address this, we propose a multi-agent framework designed to accurately translate natural language questions into spatial SQL queries. The framework integrates several innovative components, including a knowledge base with programmatic schema profiling and semantic enrichment, embeddings for context retrieval, and a collaborative multi-agent pipeline as its core. This pipeline comprises specialized agents for entity extraction, metadata retrieval, query logic formulation, SQL generation, and a review agent that performs programmatic and semantic validation of the generated SQL to ensure correctness (self-verification). We evaluate our system using both the non-spatial KaggleDBQA benchmark and a new, comprehensive SpatialQueryQA benchmark that includes diverse geometry types, predicates, and three levels of query complexity. On KaggleDBQA, the system achieved an overall accuracy of 81.2% (221 out of 272 questions) after the review agent's review and corrections. For spatial queries, the system achieved an overall accuracy of 87.7% (79 out of 90 questions), compared with 76.7% without the review agent. Beyond accuracy, results also show that in some instances the system generates queries that are more semantically aligned with user intent than those in the benchmarks. This work makes spatial analysis more accessible, and provides a robust, generalizable foundation for spatial Text-to-SQL systems, advancing the development of autonomous GIS.

【88】Epistemic Deference to AI
标题：对人工智能的认识尊重
链接：https://arxiv.org/abs/2510.21043

作者：Benjamin Lange
备注：12 pages
摘要：我们什么时候应该听从人工智能的输出而不是人类专家的判断？借鉴最近在社会认识论方面的工作，我激发了这样一种想法，即一些人工智能系统由于其表现出的可靠性和认识优势而有资格成为人工认识权威（AEAs）。然后，我介绍了人工智能优先主义，认为AEA输出应该取代，而不是补充用户的独立认知的原因。我表明，经典的反对先发制人-如不加批判的尊重，认知的战壕，和unhinging认知基础-适用于放大的形式，以AEA，因为他们的不透明性，自我强化的权威，缺乏认知失败的标记。针对这一点，我开发了一个更有希望的替代方案：人工智能服从的完全证据观点。根据这一观点，AEA输出应该作为贡献的原因，而不是直接替代用户的独立认识的考虑。这种方法有三个关键优势：（i）它通过保持人类用户的参与来减轻专业知识的萎缩，（ii）它为有意义的人类监督和控制提供了一个认识案例，（iii）它解释了当可靠性条件不满足时对人工智能的合理不信任。虽然在实践中要求很高，但这种解释提供了一种原则性的方法来确定何时AI服从是合理的，特别是在需要严格可靠性的高风险环境中。
摘要：When should we defer to AI outputs over human expert judgment? Drawing on recent work in social epistemology, I motivate the idea that some AI systems qualify as Artificial Epistemic Authorities (AEAs) due to their demonstrated reliability and epistemic superiority. I then introduce AI Preemptionism, the view that AEA outputs should replace rather than supplement a user's independent epistemic reasons. I show that classic objections to preemptionism - such as uncritical deference, epistemic entrenchment, and unhinging epistemic bases - apply in amplified form to AEAs, given their opacity, self-reinforcing authority, and lack of epistemic failure markers. Against this, I develop a more promising alternative: a total evidence view of AI deference. According to this view, AEA outputs should function as contributory reasons rather than outright replacements for a user's independent epistemic considerations. This approach has three key advantages: (i) it mitigates expertise atrophy by keeping human users engaged, (ii) it provides an epistemic case for meaningful human oversight and control, and (iii) it explains the justified mistrust of AI when reliability conditions are unmet. While demanding in practice, this account offers a principled way to determine when AI deference is justified, particularly in high-stakes contexts requiring rigorous reliability.

【89】AgentArcEval: An Architecture Evaluation Method for Foundation Model based Agents
标题：AgentArcEval：一种基于基础模型的代理的架构评估方法
链接：https://arxiv.org/abs/2510.21031

作者：Qinghua Lu, Dehai Zhao, Yue Liu, Hao Zhang, Liming Zhu, Xiwei Xu, Angela Shi, Tristan Tan, Rick Kazman
摘要：基础模型（FM）的出现使得高能力和自主代理的开发成为可能，从而在广泛的领域中释放新的应用机会。评估代理的架构是特别重要的，因为架构的决定显着影响代理的质量属性，因为它们具有独特的特性，包括复合架构，自主和非确定性的行为，以及持续的进化。然而，由于这些代理的独特特性，这些传统的方法在解决代理体系结构的评估需求方面存在不足。因此，在本文中，我们提出了AgentArcEval，一种新的代理体系结构评估方法，专门设计，以解决复杂的FM为基础的代理体系结构及其评估。此外，我们提出了一个目录的代理特定的一般情况下，这作为一个指南，生成具体的场景设计和评估的代理架构。我们证明了AgentArcEval和目录的实用性，通过一个案例研究的架构评估现实世界的税收副驾驶员，名为月神。
摘要：The emergence of foundation models (FMs) has enabled the development of highly capable and autonomous agents, unlocking new application opportunities across a wide range of domains. Evaluating the architecture of agents is particularly important as the architectural decisions significantly impact the quality attributes of agents given their unique characteristics, including compound architecture, autonomous and non-deterministic behaviour, and continuous evolution. However, these traditional methods fall short in addressing the evaluation needs of agent architecture due to the unique characteristics of these agents. Therefore, in this paper, we present AgentArcEval, a novel agent architecture evaluation method designed specially to address the complexities of FM-based agent architecture and its evaluation. Moreover, we present a catalogue of agent-specific general scenarios, which serves as a guide for generating concrete scenarios to design and evaluate the agent architecture. We demonstrate the usefulness of AgentArcEval and the catalogue through a case study on the architecture evaluation of a real-world tax copilot, named Luna.

【90】Customizing Open Source LLMs for Quantitative Medication Attribute Extraction across Heterogeneous EHR Systems
标题：定制开源LLM，用于跨异类EHR系统的定量药物属性提取
链接：https://arxiv.org/abs/2510.21027

作者：Zhe Fei, Mehmet Yigit Turali, Shreyas Rajesh, Xinyang Dai, Huyen Pham, Pavan Holur, Yuhui Zhu, Larissa Mooney, Yih-Ing Hser, Vwani Roychowdhury
备注：NeurIPS 2025: The Second Workshop on GenAI for Health: Potential, Trust, and Policy Compliance
摘要：在电子健康记录（EHR）系统中协调药物数据是监测阿片类药物使用障碍（MOUD）的一个持久障碍。在异构EHR系统中，关键处方属性分散在不同格式的字段和自由文本注释中。我们提出了一个实用的框架，自定义开源大型语言模型（LLM），包括Llama，Qwen，Gemma和MedGemma，从异构的，特定于站点的数据中提取一组统一的MOUD处方属性（处方日期，药物名称，持续时间，总数量，每日数量和再填充），并计算每个患者的药物覆盖率的标准化度量，MOUD天。我们的管道直接在固定的JSON模式中处理记录，然后进行轻量级规范化和跨字段一致性检查。我们使用之前注释的10{，}369条记录（776名患者）作为基准，根据国家OUD研究中5个诊所的处方级EHR数据（来自1{，}257名患者的25{，}605条记录）评估了该系统。真相。性能报告为覆盖率（具有有效、可匹配输出的记录的份额）和记录级精确匹配准确度。更大的模型整体表现最好：Qwen2.5- 32 B在诊所之间实现了\textbf {93.4\%}的覆盖率和\textbf{93.0\%}的精确匹配精度，MedGemma-27 B达到了\textbf{93.1\%}/\textbf{92.2\%}。一个简短的错误回顾强调了三个常见的问题和修复：使用药物内规范插补缺失的剂量字段，处理每月/每周注射剂（例如，Vivitrol）通过从记录的时间表中设置持续时间，并添加单位检查以防止质量单位（例如，“250 g”）被误读为每日计数。通过删除脆弱的、特定于站点的ETL并支持本地的、隐私保护的部署，这种方法可以在真实环境中对MOUD暴露、遵守和保留进行一致的跨站点分析。
摘要：Harmonizing medication data across Electronic Health Record (EHR) systems is a persistent barrier to monitoring medications for opioid use disorder (MOUD). In heterogeneous EHR systems, key prescription attributes are scattered across differently formatted fields and freetext notes. We present a practical framework that customizes open source large language models (LLMs), including Llama, Qwen, Gemma, and MedGemma, to extract a unified set of MOUD prescription attributes (prescription date, drug name, duration, total quantity, daily quantity, and refills) from heterogeneous, site specific data and compute a standardized metric of medication coverage, \emph{MOUD days}, per patient. Our pipeline processes records directly in a fixed JSON schema, followed by lightweight normalization and cross-field consistency checks. We evaluate the system on prescription level EHR data from five clinics in a national OUD study (25{,}605 records from 1{,}257 patients), using a previously annotated benchmark of 10{,}369 records (776 patients) as the ground truth. Performance is reported as coverage (share of records with a valid, matchable output) and record-level exact-match accuracy. Larger models perform best overall: Qwen2.5-32B achieves \textbf{93.4\%} coverage with \textbf{93.0\%} exact-match accuracy across clinics, and MedGemma-27B attains \textbf{93.1\%}/\textbf{92.2\%}. A brief error review highlights three common issues and fixes: imputing missing dosage fields using within-drug norms, handling monthly/weekly injectables (e.g., Vivitrol) by setting duration from the documented schedule, and adding unit checks to prevent mass units (e.g., ``250 g'') from being misread as daily counts. By removing brittle, site-specific ETL and supporting local, privacy-preserving deployment, this approach enables consistent cross-site analyses of MOUD exposure, adherence, and retention in real-world settings.

【91】JSTprove: Pioneering Verifiable AI for a Trustless Future
标题：JSTprove：开创可验证人工智能，打造无可信未来
链接：https://arxiv.org/abs/2510.21024

作者：Jonathan Gold, Tristan Freiberg, Haruna Isah, Shirin Shahabi
备注：13 pages, 8 figures, and 4 tables
摘要：将机器学习（ML）系统集成到医疗保健、金融和网络安全等关键行业已经改变了决策过程，但它也带来了围绕信任、安全和问责制的新挑战。随着人工智能系统变得越来越普遍，确保人工智能驱动的决策的透明度和正确性至关重要，特别是当它们对隐私、安全或公平性产生直接影响时。由零知识机器学习（zkML）提供支持的可验证人工智能为这些挑战提供了强大的解决方案。zkML能够在不暴露敏感数据的情况下验证AI模型推断，提供了一个重要的信任和隐私层。然而，传统的zkML系统通常需要深厚的加密专业知识，这超出了大多数ML工程师的能力范围。在本文中，我们介绍了JSTprove，一个专门的zkML工具包，构建在多面体网络的扩展器后端，使AI开发人员和ML工程师能够生成和验证AI推理的证明。JSTprove提供了一个端到端可验证的AI推理管道，它在简单的命令行界面背后隐藏了加密复杂性，同时暴露了可审计的工件以实现可重复性。我们展示了JSTprove的设计、创新和真实用例，以及我们的蓝图和工具，以鼓励社区进行审查和扩展。因此，JSTprove既可以作为满足当前工程需求的可用zkML产品，也可以作为可验证AI未来研究和生产部署的可复制基础。
摘要：The integration of machine learning (ML) systems into critical industries such as healthcare, finance, and cybersecurity has transformed decision-making processes, but it also brings new challenges around trust, security, and accountability. As AI systems become more ubiquitous, ensuring the transparency and correctness of AI-driven decisions is crucial, especially when they have direct consequences on privacy, security, or fairness. Verifiable AI, powered by Zero-Knowledge Machine Learning (zkML), offers a robust solution to these challenges. zkML enables the verification of AI model inferences without exposing sensitive data, providing an essential layer of trust and privacy. However, traditional zkML systems typically require deep cryptographic expertise, placing them beyond the reach of most ML engineers. In this paper, we introduce JSTprove, a specialized zkML toolkit, built on Polyhedra Network's Expander backend, to enable AI developers and ML engineers to generate and verify proofs of AI inference. JSTprove provides an end-to-end verifiable AI inference pipeline that hides cryptographic complexity behind a simple command-line interface while exposing auditable artifacts for reproducibility. We present the design, innovations, and real-world use cases of JSTprove as well as our blueprints and tooling to encourage community review and extension. JSTprove therefore serves both as a usable zkML product for current engineering needs and as a reproducible foundation for future research and production deployments of verifiable AI.

【92】Physically consistent and uncertainty-aware learning of spatiotemporal dynamics
标题：物理一致且具有不确定性意识的时空动力学学习
链接：https://arxiv.org/abs/2510.21023

作者：Qingsong Xu, Jonathan L Bamber, Nils Thuerey, Niklas Boers, Paul Bates, Gustau Camps-Valls, Yilei Shi, Xiao Xiang Zhu
备注：Main text:33 pages,6 figures
摘要：时空动态的准确长期预测仍然是科学和工程领域的一个基本挑战。现有的机器学习方法往往忽略了物理规律，无法量化时空预测中的固有不确定性。为了解决这些挑战，我们引入了一个物理一致的神经算子（PCNO），通过将代理模型输出投影到满足预定义定律的函数空间来执行物理约束。PCNO中的物理一致性投影层有效地计算傅立叶空间中的质量和动量守恒。在确定性预测的基础上，我们进一步提出了一种扩散模型增强的PCNO（DiffPCNO），它利用一致性模型来量化和减轻不确定性，从而提高预测的准确性和可靠性。PCNO和DiffPCNO实现了高保真时空预测，同时在不同的系统和空间分辨率中保持物理一致性和不确定性，从湍流建模到现实世界的洪水/大气预测。我们的两阶段框架提供了一个强大的和通用的方法，准确的，物理接地，和不确定性意识的时空预测。
摘要：Accurate long-term forecasting of spatiotemporal dynamics remains a fundamental challenge across scientific and engineering domains. Existing machine learning methods often neglect governing physical laws and fail to quantify inherent uncertainties in spatiotemporal predictions. To address these challenges, we introduce a physics-consistent neural operator (PCNO) that enforces physical constraints by projecting surrogate model outputs onto function spaces satisfying predefined laws. A physics-consistent projection layer within PCNO efficiently computes mass and momentum conservation in Fourier space. Building upon deterministic predictions, we further propose a diffusion model-enhanced PCNO (DiffPCNO), which leverages a consistency model to quantify and mitigate uncertainties, thereby improving the accuracy and reliability of forecasts. PCNO and DiffPCNO achieve high-fidelity spatiotemporal predictions while preserving physical consistency and uncertainty across diverse systems and spatial resolutions, ranging from turbulent flow modeling to real-world flood/atmospheric forecasting. Our two-stage framework provides a robust and versatile approach for accurate, physically grounded, and uncertainty-aware spatiotemporal forecasting.

【93】Race and Gender in LLM-Generated Personas: A Large-Scale Audit of 41 Occupations
标题：法学硕士生成角色中的种族和性别：对41个职业的大规模审计
链接：https://arxiv.org/abs/2510.21011

作者：Ilona van der Linden, Sahana Kumar, Arnav Dixit, Aadi Sudan, Smruthi Danda, David C. Anastasiu, Kai Lukoff
摘要：生成性人工智能工具越来越多地用于创建职业中的人物形象，引发了人们对种族和性别如何表现的担忧。我们对美国41个职业中超过150万个职业角色进行了大规模审计，这些角色由四个大型语言模型生成，具有不同的人工智能安全承诺和原籍国（美国，中国、法国）。与美国劳工统计局的数据相比，我们发现两种反复出现的模式：系统性转变，即某些群体的代表性始终不足或过高，以及刻板印象夸大，即现有的人口结构扭曲被放大。平均而言，白人（-31pp）和黑人（-9pp）工人代表不足，而西班牙裔（+17pp）和亚裔（+12pp）工人代表过多。这些扭曲可能是极端的：例如，在所有四个模型中，管家被描绘成近100%的西班牙裔，而黑人工人被从许多职业中抹去。对于HCI，这些研究结果表明，供应商的选择实质性地改变了谁是可见的，激励模型特定的审计和负责任的设计实践。
摘要：Generative AI tools are increasingly used to create portrayals of people in occupations, raising concerns about how race and gender are represented. We conducted a large-scale audit of over 1.5 million occupational personas across 41 U.S. occupations, generated by four large language models with different AI safety commitments and countries of origin (U.S., China, France). Compared with Bureau of Labor Statistics data, we find two recurring patterns: systematic shifts, where some groups are consistently under- or overrepresented, and stereotype exaggeration, where existing demographic skews are amplified. On average, White (--31pp) and Black (--9pp) workers are underrepresented, while Hispanic (+17pp) and Asian (+12pp) workers are overrepresented. These distortions can be extreme: for example, across all four models, Housekeepers are portrayed as nearly 100\% Hispanic, while Black workers are erased from many occupations. For HCI, these findings show provider choice materially changes who is visible, motivating model-specific audits and accountable design practices.

【94】Exploring Spiking Neural Networks for Binary Classification in Multivariate Time Series at the Edge
标题：基于脉冲神经网络的多变量时间序列边缘分类研究
链接：https://arxiv.org/abs/2510.20997

作者：James Ghawaly, Andrew Nicholson, Catherine Schuman, Dalton Diez, Aaron Young, Brett Witherspoon
备注：Accepted in 2025 International Joint Conference on Neural Networks (IJCNN)
摘要：我们提出了一个训练尖峰神经网络（SNN）对多变量时间序列进行二进制分类的一般框架，重点是逐步预测和低误报率下的高精度。该方法使用神经形态系统（EONS）算法的进化优化稀疏，有状态的SNN共同优化其架构和参数。输入被编码成尖峰序列，并且通过对单个输出神经元的尖峰计数进行阈值化来进行预测。我们还采用了简单的投票集成方法，以提高性能和鲁棒性。为了评估该框架，我们将其应用于特定应用程序的优化，以检测低信噪比的伽马射线光谱数据中的放射源的任务。由此产生的SNN只有49个神经元和66个突触，在1/hr的误报率下实现了51.8%的真阳性率（TPR），优于PCA（42.7%）和深度学习（49.8%）基线。一个三个模型的任何投票合奏TPR增加到67.1%，在相同的误报率。在microCaspian神经形态平台上的硬件部署显示出2 mW的功耗和20.2ms的推理延迟。我们还证明了推广应用相同的框架，没有特定领域的修改，癫痫发作检测脑电图记录。集成实现了95%的TPR，假阳性率为16%，与最近的深度学习方法相当，参数数量显著减少。
摘要：We present a general framework for training spiking neural networks (SNNs) to perform binary classification on multivariate time series, with a focus on step-wise prediction and high precision at low false alarm rates. The approach uses the Evolutionary Optimization of Neuromorphic Systems (EONS) algorithm to evolve sparse, stateful SNNs by jointly optimizing their architectures and parameters. Inputs are encoded into spike trains, and predictions are made by thresholding a single output neuron's spike counts. We also incorporate simple voting ensemble methods to improve performance and robustness. To evaluate the framework, we apply it with application-specific optimizations to the task of detecting low signal-to-noise ratio radioactive sources in gamma-ray spectral data. The resulting SNNs, with as few as 49 neurons and 66 synapses, achieve a 51.8% true positive rate (TPR) at a false alarm rate of 1/hr, outperforming PCA (42.7%) and deep learning (49.8%) baselines. A three-model any-vote ensemble increases TPR to 67.1% at the same false alarm rate. Hardware deployment on the microCaspian neuromorphic platform demonstrates 2mW power consumption and 20.2ms inference latency. We also demonstrate generalizability by applying the same framework, without domain-specific modification, to seizure detection in EEG recordings. An ensemble achieves 95% TPR with a 16% false positive rate, comparable to recent deep learning approaches with significant reduction in parameter count.

【95】VESSA: Video-based objEct-centric Self-Supervised Adaptation for Visual Foundation Models
标题：VESSA：基于视频的以对象为中心的视觉基础模型的自我监督适应
链接：https://arxiv.org/abs/2510.20994

作者：Jesimon Barreto, Carlos Caetano, André Araujo, William Robson Schwartz
备注：Conference on Neural Information Processing Systems (NeurIPS 2025)
摘要：基础模型通过大规模的预训练和监督微调，在不同的任务中实现强大的性能，从而提高了计算机视觉。然而，他们可能表现不佳的领域分布变化和稀缺的标签，监督微调可能是不可行的。虽然用于模型自适应的持续自监督学习对于生成语言模型来说很常见，但这种策略对于以视觉为中心的编码器模型并不有效。为了应对这一挑战，我们引入了一种新的自监督微调视觉基础模型的配方，其中该模型适用于一个新的领域，而不需要注释，只利用短的多视图对象为中心的视频。我们的方法被称为VESSA：基于视频的以对象为中心的自监督适应视觉基础模型。VESSA的训练技术基于自蒸馏范式，其中仔细调整预测头并部署参数有效的自适应技术至关重要-否则，模型可能会很快忘记其预先训练的知识并达到降级状态。VESSA显著受益于来自以对象为中心的视频中不同帧的多视图对象观察，有效地学习对不同捕获条件的鲁棒性，而无需注释。通过在2个数据集上使用3个视觉基础模型进行综合实验，与基础模型和以前的自适应方法相比，VESSA在下游分类任务中表现出一致的改进。代码可在https://github.com/jesimonbarreto/VESSA上公开获取。
摘要：Foundation models have advanced computer vision by enabling strong performance across diverse tasks through large-scale pretraining and supervised fine-tuning. However, they may underperform in domains with distribution shifts and scarce labels, where supervised fine-tuning may be infeasible. While continued self-supervised learning for model adaptation is common for generative language models, this strategy has not proven effective for vision-centric encoder models. To address this challenge, we introduce a novel formulation of self-supervised fine-tuning for vision foundation models, where the model is adapted to a new domain without requiring annotations, leveraging only short multi-view object-centric videos. Our method is referred to as VESSA: Video-based objEct-centric Self-Supervised Adaptation for visual foundation models. VESSA's training technique is based on a self-distillation paradigm, where it is critical to carefully tune prediction heads and deploy parameter-efficient adaptation techniques - otherwise, the model may quickly forget its pretrained knowledge and reach a degraded state. VESSA benefits significantly from multi-view object observations sourced from different frames in an object-centric video, efficiently learning robustness to varied capture conditions, without the need of annotations. Through comprehensive experiments with 3 vision foundation models on 2 datasets, VESSA demonstrates consistent improvements in downstream classification tasks, compared to the base models and previous adaptation methods. Code is publicly available at https://github.com/jesimonbarreto/VESSA.

【96】GPU Memory Requirement Prediction for Deep Learning Task Based on Bidirectional Gated Recurrent Unit Optimization Transformer
标题：基于双向门控回归单元优化Transformer的深度学习任务的图形处理器内存需求预测
链接：https://arxiv.org/abs/2510.20985

作者：Chao Wang, Zhizhao Wen, Ruoxin Zhang, Puyang Xu, Yifan Jiang
摘要：针对深度学习任务对GPU内存资源准确预测的需求日益关键，本文深入分析了当前研究现状，创新性地提出了一种集成双向门控递归单元（BiGRU）优化Transformer架构的深度学习模型，旨在提高内存需求预测的准确性。为了验证模型的有效性，选取决策树、随机森林、Adaboost和XGBoost四种具有代表性的基本机器学习模型进行了精心设计的对比实验。详细的实验结果表明，本文提出的BiGRU Transformer优化模型在关键评价指标上表现出明显的优势：在均方误差（MSE）和均方根误差（RMSE）方面，该模型在所有对比模型中达到最低值，其预测结果与实际值的偏差最小;在平均绝对误差（MAE）和决定系数（R2）指标方面，该模型也表现良好，结果均衡稳定，综合预测性能远超基准机器学习方法比较。综上所述，本研究成功构建的基于双向门控递归单元优化的Transformer模型，能够高效准确地完成深度学习任务中的GPU内存需求预测任务，其预测精度相比传统机器学习方法有了显著提升。该研究为优化深度学习任务的资源调度和管理，提高计算集群的利用效率提供了有力的技术支持和可靠的理论依据。
摘要：In response to the increasingly critical demand for accurate prediction of GPU memory resources in deep learning tasks, this paper deeply analyzes the current research status and innovatively proposes a deep learning model that integrates bidirectional gated recurrent units (BiGRU) to optimize the Transformer architecture, aiming to improve the accuracy of memory demand prediction. To verify the effectiveness of the model, a carefully designed comparative experiment was conducted, selecting four representative basic machine learning models: decision tree, random forest, Adaboost, and XGBoost as benchmarks. The detailed experimental results show that the BiGRU Transformer optimization model proposed in this paper exhibits significant advantages in key evaluation indicators: in terms of mean square error (MSE) and root mean square error (RMSE), the model achieves the lowest value among all comparison models, and its predicted results have the smallest deviation from the actual values; In terms of mean absolute error (MAE) and coefficient of determination (R2) indicators, the model also performs well and the results are balanced and stable, with comprehensive predictive performance far exceeding the benchmark machine learning methods compared. In summary, the Transformer model based on bidirectional gated recurrent unit optimization successfully constructed in this study can efficiently and accurately complete GPU memory demand prediction tasks in deep learning tasks, and its prediction accuracy has been significantly improved compared to traditional machine learning methods. This research provides strong technical support and reliable theoretical basis for optimizing resource scheduling and management of deep learning tasks, and improving the utilization efficiency of computing clusters.

【97】Learning Grouped Lattice Vector Quantizers for Low-Bit LLM Compression
标题：用于低比特LLM压缩的学习分组格形向量量化器
链接：https://arxiv.org/abs/2510.20984

作者：Xi Zhang, Xiaolin Wu, Jiamang Wang, Weisi Lin
备注：NeurIPS 2025 Poster
摘要：大型语言模型（LLM）已经证明了卓越的能力，但通常需要大量的计算资源和内存进行推理。后训练量化（PTQ）可以通过以较低的位宽格式存储权重来有效地减少这些需求。然而，标准的均匀量化通常会导致显著的性能下降，特别是在低比特场景中。在这项工作中，我们引入了一个分组格矢量量化（GLVQ）的框架，分配给每组的权重一个定制的格码本，由一个可学习的生成矩阵定义。为了解决量化过程的不可微性，我们在训练过程中采用Babai舍入来近似最近格点搜索，这使得生成矩阵能够稳定优化。一旦经过训练，解码就简化为一个简单的矩阵-向量乘法，从而产生一个有效且实用的量化流水线。在多个基准上的实验表明，与现有的训练后量化基线相比，我们的方法在模型大小和准确性之间实现了更好的权衡，突出了其在严格的资源约束下部署大型模型的有效性。我们的源代码可以在GitHub存储库中找到：https://github.com/xzhang9308/GLVQ。
摘要：Large Language Models (LLMs) have demonstrated remarkable capabilities but typically require extensive computational resources and memory for inference. Post-training quantization (PTQ) can effectively reduce these demands by storing weights in lower bit-width formats. However, standard uniform quantization often leads to notable performance degradation, particularly in low-bit scenarios. In this work, we introduce a Grouped Lattice Vector Quantization (GLVQ) framework that assigns each group of weights a customized lattice codebook, defined by a learnable generation matrix. To address the non-differentiability of the quantization process, we adopt Babai rounding to approximate nearest-lattice-point search during training, which enables stable optimization of the generation matrices. Once trained, decoding reduces to a simple matrix-vector multiplication, yielding an efficient and practical quantization pipeline. Experiments on multiple benchmarks show that our approach achieves a better trade-off between model size and accuracy compared to existing post-training quantization baselines, highlighting its effectiveness in deploying large models under stringent resource constraints. Our source code is available on GitHub repository: https://github.com/xzhang9308/GLVQ.

【98】Memory Constrained Dynamic Subnetwork Update for Transfer Learning
标题：迁移学习的记忆约束动态子网络更新
链接：https://arxiv.org/abs/2510.20979

作者：Aël Quélennec, Pavlo Mozharovskyi, Van-Tam Nguyen, Enzo Tartaglione
摘要：设备上的神经网络训练面临着关键的内存约束，这些约束限制了预训练模型对下游任务的适应。我们提出了MeDyate，一个理论上接地框架内存约束的动态子网络适应。我们的方法引入了两个关键的创新：LaRa（层排名），一个改进的层重要性度量，使原则层预选，和一个动态的通道采样策略，利用时间稳定性的通道重要性分布在微调。MeDyate根据重要性加权概率动态重新采样时期之间的通道，确保全面的参数空间探索，同时尊重严格的内存预算。对大量任务和架构的广泛评估表明，MeDyate在极端内存限制下实现了最先进的性能，始终优于现有的静态和动态方法，同时保持高计算效率。我们的方法是实现有效的设备上学习的重要一步，通过演示有效的微调，内存预算低至几百kB的RAM。
摘要：On-device neural network training faces critical memory constraints that limit the adaptation of pre-trained models to downstream tasks. We present MeDyate, a theoretically-grounded framework for memory-constrained dynamic subnetwork adaptation. Our approach introduces two key innovations: LaRa (Layer Ranking), an improved layer importance metric that enables principled layer pre-selection, and a dynamic channel sampling strategy that exploits the temporal stability of channel importance distributions during fine-tuning. MeDyate dynamically resamples channels between epochs according to importance-weighted probabilities, ensuring comprehensive parameter space exploration while respecting strict memory budgets. Extensive evaluation across a large panel of tasks and architectures demonstrates that MeDyate achieves state-of-the-art performance under extreme memory constraints, consistently outperforming existing static and dynamic approaches while maintaining high computational efficiency. Our method represents a significant step towards enabling efficient on-device learning by demonstrating effective fine-tuning with memory budgets as low as a few hundred kB of RAM.

【99】REx86: A Local Large Language Model for Assisting in x86 Assembly Reverse Engineering
标题：REx 86：一种辅助x86装配反向工程的本地大型语言模型
链接：https://arxiv.org/abs/2510.20975

作者：Darrin Lea, James Ghawaly, Golden Richard III, Aisha Ali-Gombe, Andrew Case
备注：Accepted in 2025 Annual Computer Security Applications Conference (ACSAC)
摘要：x86二进制文件的逆向工程（RE）对于恶意软件和固件分析是必不可少的，但由于剥离的元数据和对抗性混淆，速度仍然很慢。大型语言模型（LLM）通过自动理解和评论提供了提高RE效率的潜力，但云托管的封闭权重模型会带来隐私和安全风险，并且不能在封闭网络设施中使用。我们评估参数有效的微调本地LLM，以协助在这些设置中的x86 RE任务。CodeLlama、Qwen2.5-Coder和CodeGemma系列的八个开放权重模型在5，981个x86装配示例的自定义策划数据集上进行了微调。我们对它们进行了定量评估，并将经过微调的Qwen2.5-Coder-7 B确定为性能最佳的，我们将其命名为REx 86。 REx 86将测试集交叉熵损失减少了64.2%，并将其基础模型的语义余弦相似度提高了20.3%。在有限的用户案例研究（n=43）中，REx 86显著增强了行级代码理解（p = 0.031），并将正确解决率从31%提高到53%（p = 0.189），尽管后者没有达到统计学显著性。定性分析显示更准确，更简洁的评论，更少的幻觉。 REx 86在本地开放重量LLM中提供了x86 RE中最先进的帮助。我们的研究结果证明了特定领域微调的价值，并强调需要更多的注释拆卸数据，以进一步提高LLM在RE中的性能。REx 86及其数据集和LoRA适配器可在https://github.com/dlea8/REx86和https://zenodo.org/records/15420461上公开获取。
摘要：Reverse engineering (RE) of x86 binaries is indispensable for malware and firmware analysis, but remains slow due to stripped metadata and adversarial obfuscation. Large Language Models (LLMs) offer potential for improving RE efficiency through automated comprehension and commenting, but cloud-hosted, closed-weight models pose privacy and security risks and cannot be used in closed-network facilities. We evaluate parameter-efficient fine-tuned local LLMs for assisting with x86 RE tasks in these settings. Eight open-weight models across the CodeLlama, Qwen2.5-Coder, and CodeGemma series are fine-tuned on a custom curated dataset of 5,981 x86 assembly examples. We evaluate them quantitatively and identify the fine-tuned Qwen2.5-Coder-7B as the top performer, which we name REx86. REx86 reduces test-set cross-entropy loss by 64.2% and improves semantic cosine similarity against ground truth by 20.3\% over its base model. In a limited user case study (n=43), REx86 significantly enhanced line-level code understanding (p = 0.031) and increased the correct-solve rate from 31% to 53% (p = 0.189), though the latter did not reach statistical significance. Qualitative analysis shows more accurate, concise comments with fewer hallucinations. REx86 delivers state-of-the-art assistance in x86 RE among local, open-weight LLMs. Our findings demonstrate the value of domain-specific fine-tuning, and highlight the need for more commented disassembly data to further enhance LLM performance in RE. REx86, its dataset, and LoRA adapters are publicly available at https://github.com/dlea8/REx86 and https://zenodo.org/records/15420461.

【100】3DReasonKnee: Advancing Grounded Reasoning in Medical Vision Language Models
标题：3DReasonKnee：推进医学视觉语言模型中的扎根推理
链接：https://arxiv.org/abs/2510.20967

作者：Sraavya Sambara, Sung Eun Kim, Xiaoman Zhang, Luyang Luo, Shreya Johri, Mohammed Baharoon, Du Hyun Ro, Pranav Rajpurkar
摘要：当前的视觉语言模型（VLM）难以在3D医学图像中定位解剖区域，并以逐步的方式对其进行推理，这是现实世界诊断评估的关键要求。这种能力对于将模型输出与临床医生在实践中使用的诊断工作流程保持一致至关重要，从而实现值得信赖的临床医生-AI协作。现有的3D数据集提供了本地化标签，但没有一个支持这种“接地推理”能力。为了解决这一差距，我们引入了3DReasonKnee，这是第一个用于医学图像的3D接地推理数据集，它提供了来自7，970个3D膝关节MRI体积的494 k高质量五元组。每个五元组包括：（1）3D MRI体积，（2）针对特定解剖区域的诊断问题，（3）定位相关解剖结构的3D边界框，（4）明确详述3D推理过程的临床医生生成的诊断推理步骤，以及（5）相关解剖区域的结构化严重性评估。3DReasonKnee的创建和验证涉及超过450小时的专家临床医生时间，用于手动分割MRI和生成推理链，确保其卓越的质量和临床相关性。我们建立了ReasonKnee-Bench来评估定位和诊断准确性，深入了解VLM在解剖区域和诊断询问中执行基础和严重程度评估的能力。我们对五种最先进的VLM进行基准测试，为ReasonKnee-Bench提供基准性能。通过提供这种独特的专家注释3D推理路径资源，3DReasonKnee可作为骨科医生诊断专业知识的存储库，并为推进多模式医疗AI系统向3D、临床对齐、本地化决策能力发展提供重要的测试平台。该数据集可在以下网站找到：https://huggingface.co/datasets/rajpurkarlab/3DReasonKnee
摘要：Current Vision-Language Models (VLMs) struggle to ground anatomical regions in 3D medical images and reason about them in a step-by-step manner, a key requirement of real-world diagnostic assessment. This ability is essential for aligning model outputs with the diagnostic workflows clinicians use in practice, enabling trustworthy clinician-AI collaboration. Existing 3D datasets provide localization labels, but none support this "grounded reasoning" ability. To address this gap, we introduce 3DReasonKnee, the first 3D grounded reasoning dataset for medical images, which provides 494k high-quality quintuples derived from 7,970 3D knee MRI volumes. Each quintuple includes: (1) the 3D MRI volume, (2) a diagnostic question targeting a specific anatomical region (3) a 3D bounding box localizing the relevant anatomical structures, (4) clinician-generated diagnostic reasoning steps that explicitly detail the 3D reasoning process, and (5) structured severity assessments for the relevant anatomical region. The creation and validation of 3DReasonKnee, involving over 450 hours of expert clinician time for manually segmenting MRIs and generating reasoning chains, ensures its superior quality and clinical relevance. We establish ReasonKnee-Bench to evaluate localization and diagnostic accuracy, providing insight into VLM ability to perform grounding and severity assessment across anatomical regions and diagnostic inquiries. We benchmark five state-of-the-art VLMs, providing baseline performance for ReasonKnee-Bench. By providing this unique resource of expert-annotated 3D reasoning pathways, 3DReasonKnee serves as a repository of orthopedic surgeons' diagnostic expertise and offers a vital testbed for advancing multimodal medical AI systems towards 3D, clinically aligned, localized decision-making capabilities. The dataset can be found in: https://huggingface.co/datasets/rajpurkarlab/3DReasonKnee

【101】Meta-Learning for Cross-Task Generalization in Protein Mutation Property Prediction
标题：蛋白质突变特性预测中跨任务概括的元学习
链接：https://arxiv.org/abs/2510.20943

作者：Srivathsan Badrinarayanan, Yue Su, Janghoon Ock, Alan Pham, Sanya Ahuja, Amir Barati Farimani
摘要：蛋白质突变可以对生物功能产生深远的影响，准确预测性质变化对于药物发现，蛋白质工程和精准医学至关重要。目前的方法依赖于对单个数据集的蛋白质特异性Transformers进行微调，但由于异质性实验条件和有限的目标域数据，难以实现跨数据集的泛化。我们介绍了两个关键的创新：（1）模型不可知元学习（MAML）的第一个应用蛋白质突变属性预测，（2）一种新的突变编码策略，使用分隔符标记直接将突变纳入序列上下文。我们构建在Transformer架构的基础上，将它们与MAML集成，通过最小的梯度步骤而不是学习特定于以太网的模式来快速适应新任务。我们的突变编码解决了标准Transformers将突变位置视为未知令牌的关键限制，从而显着降低性能。在三个不同的蛋白质突变数据集（功能适应性，热稳定性和溶解性）的评估表明，显着优势，传统的微调。在跨任务评估中，我们的元学习方法在训练时间减少65%的情况下，功能适应度的准确性提高了29%，在训练速度加快55%的情况下，溶解度的准确性提高了94%。无论数据集大小如何，该框架都保持一致的训练效率，这使得它对于实验数据有限的工业应用和早期蛋白质设计特别有价值。本文建立了元学习在蛋白质突变分析中的系统应用，并引入了一种有效的突变编码策略，为蛋白质工程中的跨域泛化提供了变革性的方法。
摘要：Protein mutations can have profound effects on biological function, making accurate prediction of property changes critical for drug discovery, protein engineering, and precision medicine. Current approaches rely on fine-tuning protein-specific transformers for individual datasets, but struggle with cross-dataset generalization due to heterogeneous experimental conditions and limited target domain data. We introduce two key innovations: (1) the first application of Model-Agnostic Meta-Learning (MAML) to protein mutation property prediction, and (2) a novel mutation encoding strategy using separator tokens to directly incorporate mutations into sequence context. We build upon transformer architectures integrating them with MAML to enable rapid adaptation to new tasks through minimal gradient steps rather than learning dataset-specific patterns. Our mutation encoding addresses the critical limitation where standard transformers treat mutation positions as unknown tokens, significantly degrading performance. Evaluation across three diverse protein mutation datasets (functional fitness, thermal stability, and solubility) demonstrates significant advantages over traditional fine-tuning. In cross-task evaluation, our meta-learning approach achieves 29% better accuracy for functional fitness with 65% less training time, and 94% better accuracy for solubility with 55% faster training. The framework maintains consistent training efficiency regardless of dataset size, making it particularly valuable for industrial applications and early-stage protein design where experimental data is limited. This work establishes a systematic application of meta-learning to protein mutation analysis and introduces an effective mutation encoding strategy, offering transformative methodology for cross-domain generalization in protein engineering.

【102】Do LLMs Truly Understand When a Precedent Is Overruled?
标题：法学硕士真的明白判例何时被推翻吗？
链接：https://arxiv.org/abs/2510.20941

作者：Li Zhang, Jaromir Savelka, Kevin Ashley
备注：12 pages, 2 figures, JURIX 2025
摘要：具有扩展上下文窗口的大型语言模型（LLM）显示出对复杂法律推理任务的承诺，但它们理解长法律文档的能力仍然没有得到充分的评估。开发能够捕捉现实、高风险任务的长期背景基准仍然是该领域的一个重大挑战，因为大多数现有评价依赖于简化的综合任务，无法代表真实世界文件理解的复杂性。否决关系是普通法原则的基础，在司法意见中也很常见。它们为长文档法律理解提供了一个重点突出的重要试验平台，与法律专业人士的实际工作非常相似。我们提出了一个评估国家的最先进的法学硕士从美国最高法院的案件使用236个案例对的数据集确定推翻的关系。我们的评估揭示了三个关键的局限性：（1）时代敏感性-与现代模型相比，模型在历史案例上的表现有所下降，揭示了它们在训练中的基本时间偏见;（2）浅层推理-模型依赖于浅层逻辑推理，而不是深入的法律理解;和（3）上下文相关的推理失败-模型在复杂的开放式任务中产生时间上不可能的关系，尽管在简单的上下文中保持基本的时间意识。我们的工作提供了一个基准，解决了现实的长期背景评估中的关键差距，提供了一个反映实际法律推理任务的复杂性和利害关系的环境。
摘要：Large language models (LLMs) with extended context windows show promise for complex legal reasoning tasks, yet their ability to understand long legal documents remains insufficiently evaluated. Developing long-context benchmarks that capture realistic, high-stakes tasks remains a significant challenge in the field, as most existing evaluations rely on simplified synthetic tasks that fail to represent the complexity of real-world document understanding. Overruling relationships are foundational to common-law doctrine and commonly found in judicial opinions. They provide a focused and important testbed for long-document legal understanding that closely resembles what legal professionals actually do. We present an assessment of state-of-the-art LLMs on identifying overruling relationships from U.S. Supreme Court cases using a dataset of 236 case pairs. Our evaluation reveals three critical limitations: (1) era sensitivity -- the models show degraded performance on historical cases compared to modern ones, revealing fundamental temporal bias in their training; (2) shallow reasoning -- models rely on shallow logical heuristics rather than deep legal comprehension; and (3) context-dependent reasoning failures -- models produce temporally impossible relationships in complex open-ended tasks despite maintaining basic temporal awareness in simple contexts. Our work contributes a benchmark that addresses the critical gap in realistic long-context evaluation, providing an environment that mirrors the complexity and stakes of actual legal reasoning tasks.

【103】Focal Modulation and Bidirectional Feature Fusion Network for Medical Image Segmentation
标题：用于医学图像分割的焦点调制和双向特征融合网络
链接：https://arxiv.org/abs/2510.20933

作者：Moin Safdar, Shahzaib Iqbal, Mehwish Mehmood, Mubeen Ghafoor, Tariq M.Khan, Imran Razzak
摘要：医学图像分割对于疾病诊断、治疗计划和疾病发展监测等临床应用至关重要，因为它提供了关于解剖结构的精确形态和空间信息，这些信息直接影响治疗决策。卷积神经网络显著影响图像分割;然而，由于卷积运算是局部的，因此捕获全局上下文信息和长期依赖关系仍然具有挑战性。它们精确分割具有复杂边界和各种尺寸的结构的能力受到这种限制的影响。由于Transformers使用自注意方法来有效地捕获全局上下文和长距离依赖关系，因此将基于transformer的架构与CNN集成是克服这些挑战的可行方法。为了解决这些挑战，我们提出了用于医学图像分割的焦点调制和双向特征融合网络，在本文的其余部分中称为FM-BFF-Net。该网络结合了卷积和Transformer组件，采用焦点调制注意机制来改进上下文感知，并引入了双向特征融合模块，该模块能够在编码器和解码器表示之间进行跨尺度的有效交互。通过这种设计，FM-BFF-Net增强了边界精度和对病变大小、形状和对比度变化的鲁棒性。在八个公开可用的数据集上进行的广泛实验，包括息肉检测，皮肤病变分割和超声成像，表明FM-BFF-Net在Jaccard指数和Dice系数方面始终超过最新的最先进方法，证实了其对各种医学成像场景的有效性和适应性。
摘要：Medical image segmentation is essential for clinical applications such as disease diagnosis, treatment planning, and disease development monitoring because it provides precise morphological and spatial information on anatomical structures that directly influence treatment decisions. Convolutional neural networks significantly impact image segmentation; however, since convolution operations are local, capturing global contextual information and long-range dependencies is still challenging. Their capacity to precisely segment structures with complicated borders and a variety of sizes is impacted by this restriction. Since transformers use self-attention methods to capture global context and long-range dependencies efficiently, integrating transformer-based architecture with CNNs is a feasible approach to overcoming these challenges. To address these challenges, we propose the Focal Modulation and Bidirectional Feature Fusion Network for Medical Image Segmentation, referred to as FM-BFF-Net in the remainder of this paper. The network combines convolutional and transformer components, employs a focal modulation attention mechanism to refine context awareness, and introduces a bidirectional feature fusion module that enables efficient interaction between encoder and decoder representations across scales. Through this design, FM-BFF-Net enhances boundary precision and robustness to variations in lesion size, shape, and contrast. Extensive experiments on eight publicly available datasets, including polyp detection, skin lesion segmentation, and ultrasound imaging, show that FM-BFF-Net consistently surpasses recent state-of-the-art methods in Jaccard index and Dice coefficient, confirming its effectiveness and adaptability for diverse medical imaging scenarios.

【104】An Experimental Study of Trojan Vulnerabilities in UAV Autonomous Landing
标题：无人机自主着陆木马漏洞实验研究
链接：https://arxiv.org/abs/2510.20932

作者：Reza Ahmari, Ahmad Mohammadi, Vahid Hemmati, Mohammed Mynuddin, Mahmoud Nabil Mahmoud, Parham Kebria, Abdollah Homaifar, Mehrdad Saif
备注：6 pages
摘要：本研究调查了城市空中机动（UAM）车辆自主导航和着陆系统的脆弱性。具体来说，它专注于针对深度学习模型的特洛伊木马攻击，例如卷积神经网络（CNN）。特洛伊木马攻击通过在模型的训练数据中嵌入隐藏的触发器来工作。这些触发器在某些情况下会导致特定的故障，而模型在其他情况下会继续正常执行。我们使用DroNet框架评估了城市自主飞行器（UAV）的脆弱性。我们的实验表明，准确率显著下降，从96.4%的干净数据到73.3%的特洛伊木马攻击触发的数据。为了进行这项研究，我们收集了一个自定义数据集和训练模型来模拟真实世界的条件。我们还开发了一个评估框架，旨在识别木马感染的模型。这项工作展示了木马攻击带来的潜在安全风险，并为未来增强UAM系统弹性的研究奠定了基础。
摘要：This study investigates the vulnerabilities of autonomous navigation and landing systems in Urban Air Mobility (UAM) vehicles. Specifically, it focuses on Trojan attacks that target deep learning models, such as Convolutional Neural Networks (CNNs). Trojan attacks work by embedding covert triggers within a model's training data. These triggers cause specific failures under certain conditions, while the model continues to perform normally in other situations. We assessed the vulnerability of Urban Autonomous Aerial Vehicles (UAAVs) using the DroNet framework. Our experiments showed a significant drop in accuracy, from 96.4% on clean data to 73.3% on data triggered by Trojan attacks. To conduct this study, we collected a custom dataset and trained models to simulate real-world conditions. We also developed an evaluation framework designed to identify Trojan-infected models. This work demonstrates the potential security risks posed by Trojan attacks and lays the groundwork for future research on enhancing the resilience of UAM systems.

【105】Security Logs to ATT&CK Insights: Leveraging LLMs for High-Level Threat Understanding and Cognitive Trait Inference
标题：ATT&CK Insights的安全策略：利用LLM进行高级威胁理解和认知特征推断
链接：https://arxiv.org/abs/2510.20930

作者：Soham Hans, Stacy Marsella, Sophia Hirschmann, Nikolos Gurney
摘要：传统上，理解网络安全中的对抗行为依赖于高级别的情报报告和对攻击链的手动解释。然而，实时防御需要能够直接从低级别的系统遥测（如入侵检测系统（IDS）日志）推断攻击者的意图和认知策略。在本文中，我们提出了一个新的框架，利用大型语言模型（LLM）分析Suricata IDS日志和推断攻击者的行动方面的MITRE ATT&CK技术。我们的方法是基于这样的假设，即攻击者的行为反映了潜在的认知偏差，如损失厌恶，风险容忍度，或目标的持久性，可以通过仔细观察日志序列提取和建模。这为未来的行为适应性网络防御和认知特质推理工作奠定了基础。我们开发了一个策略驱动的提示系统，以高效的方式将大量的网络日志数据分割成不同的行为阶段，使LLM能够将每个阶段与可能的技术和潜在的认知动机相关联。通过将网络层事件映射到高级攻击者策略，我们的方法揭示了工具切换、协议转换或枢轴模式等行为信号如何对应于心理上有意义的决策点。结果表明，LLM可以弥合数据包级日志和战略意图之间的语义鸿沟，为认知自适应网络防御提供了一条途径。关键词：认知网络安全，大型语言模型（LLM），网络心理学，入侵检测系统（IDS），MITRE ATT&CK，认知偏差
摘要：Understanding adversarial behavior in cybersecurity has traditionally relied on high-level intelligence reports and manual interpretation of attack chains. However, real-time defense requires the ability to infer attacker intent and cognitive strategy directly from low-level system telemetry such as intrusion detection system (IDS) logs. In this paper, we propose a novel framework that leverages large language models (LLMs) to analyze Suricata IDS logs and infer attacker actions in terms of MITRE ATT&CK techniques. Our approach is grounded in the hypothesis that attacker behavior reflects underlying cognitive biases such as loss aversion, risk tolerance, or goal persistence that can be extracted and modeled through careful observation of log sequences. This lays the groundwork for future work on behaviorally adaptive cyber defense and cognitive trait inference. We develop a strategy-driven prompt system to segment large amounts of network logs data into distinct behavioral phases in a highly efficient manner, enabling the LLM to associate each phase with likely techniques and underlying cognitive motives. By mapping network-layer events to high-level attacker strategies, our method reveals how behavioral signals such as tool switching, protocol transitions, or pivot patterns correspond to psychologically meaningful decision points. The results demonstrate that LLMs can bridge the semantic gap between packet-level logs and strategic intent, offering a pathway toward cognitive-adaptive cyber defense. Keywords: Cognitive Cybersecurity, Large Language Models (LLMs), Cyberpsychology, Intrusion Detection Systems (IDS), MITRE ATT&CK, Cognitive Biases

【106】Aircraft Collision Avoidance Systems: Technological Challenges and Solutions on the Path to Regulatory Acceptance
标题：飞机避碰系统：监管认可道路上的技术挑战和解决方案
链接：https://arxiv.org/abs/2510.20916

作者：Sydney M. Katz, Robert J. Moss, Dylan M. Asmar, Wesley A. Olson, James K. Kuchar, Mykel J. Kochenderfer
备注：32 pages, 9 figures
摘要：飞机防撞系统对现代航空至关重要。这些系统旨在预测飞机之间的潜在碰撞，并建议适当的避免行动。创建有效的防撞系统需要解决与监视、决策和验证相关的各种技术挑战。这些挑战在过去几十年中引发了大量的研究和开发工作，并提出了各种解决方案。本文概述了这些挑战和解决方案，重点是那些已经通过严格的验证过程，并接受监管机构。防撞问题所带来的挑战通常存在于其他领域，飞机防撞系统可以作为案例研究，为广泛的安全关键系统提供有价值的见解。
摘要：Aircraft collision avoidance systems is critical to modern aviation. These systems are designed to predict potential collisions between aircraft and recommend appropriate avoidance actions. Creating effective collision avoidance systems requires solutions to a variety of technical challenges related to surveillance, decision making, and validation. These challenges have sparked significant research and development efforts over the past several decades that have resulted in a variety of proposed solutions. This article provides an overview of these challenges and solutions with an emphasis on those that have been put through a rigorous validation process and accepted by regulatory bodies. The challenges posed by the collision avoidance problem are often present in other domains, and aircraft collision avoidance systems can serve as case studies that provide valuable insights for a wide range of safety-critical systems.

【107】Code-enabled language models can outperform reasoning models on diverse tasks
标题：支持代码的语言模型可以在不同任务上优于推理模型
链接：https://arxiv.org/abs/2510.20909

作者：Cedegao E. Zhang, Cédric Colas, Gabriel Poesia, Joshua B. Tenenbaum, Jacob Andreas
摘要：推理模型（RM），即通过强化学习训练以产生长形式自然语言推理的语言模型（LM），已经取得了显著的成功，但它们仍然需要大量的计算和数据来进行训练，并且运行速度缓慢且昂贵。在本文中，我们表明，标准指令LM已经可以被引出为强推理机，其水平相当于甚至超过其相应的RM（例如，DeepSeek V3 vs R1），无需微调，跨越从指令遵循和创造性生成到数学推理的不同领域。这是通过CodeAdapt实现的，我们的简单配方结合了CodeAct框架，其中LM以多步方式将自然语言推理与代码执行交织在一起，Few-Shot引导从最少五个训练问题中进行上下文学习。分析四对匹配的LM和RM，我们发现CodeAdapt使三个LM在八个任务上的平均表现优于相应的RM（高达22.9%），同时令牌效率提高10-81%，并且在四个模型上平均时在六个任务上提供卓越的性能（高达35.7%）。此外，代码增强的推理痕迹显示了丰富多样的问题解决策略。我们的研究结果支持（1）CodeAdapt风格的学习和推理可能是鲁棒的和领域通用的，（2）代码启用的LM是认知基础和强大的系统，可能为权重强化学习提供坚实的基础。
摘要：Reasoning models (RMs), language models (LMs) trained with reinforcement learning to produce long-form natural language reasoning, have been remarkably successful, but they still require large amounts of computation and data to train, and can be slow and expensive to run. In this paper, we show that standard instruct LMs can already be elicited to be strong reasoners at a level comparable to or even surpassing their corresponding RMs (e.g., DeepSeek V3 vs R1) without finetuning, across diverse domains from instruction following and creative generation to mathematical reasoning. This is achieved by CodeAdapt, our simple recipe that combines the CodeAct framework, where LMs interleave natural language reasoning with code execution in a multi-step fashion, with few-shot bootstrap in-context learning from as few as five training problems. Analyzing four matched pairs of LMs and RMs, we find that CodeAdapt enables three LMs to outperform the corresponding RMs on average over eight tasks (up to 22.9%) while being 10-81% more token efficient, and delivers superior performance on six tasks when averaged over the four models (up to 35.7%). Furthermore, the code-augmented reasoning traces display rich and varied problem-solving strategies. Our findings support that (1) CodeAdapt-style learning and reasoning may be robust and domain general and (2) code-enabled LMs are cognitively grounded and powerful systems, potentially providing a strong foundation for in-weight reinforcement learning.

【108】Video-As-Prompt: Unified Semantic Control for Video Generation
标题：视频即提示：视频生成的统一语义控制
链接：https://arxiv.org/abs/2510.20888

作者：Yuxuan Bian, Xin Chen, Zenan Li, Tiancheng Zhi, Shen Sang, Linjie Luo, Qiang Xu
备注：Website: this https URL
摘要：视频生成中统一的、可推广的语义控制仍然是一个关键的开放性挑战。现有的方法要么通过从基于结构的控制中强制执行不适当的像素先验来引入伪影，要么依赖于不可概括的、特定于条件的微调或特定于任务的架构。我们介绍视频提示（VAP），一个新的范式，重新定义这个问题的上下文生成。VAP利用参考视频作为直接语义提示，通过即插即用的混合Transformers（MoT）专家引导冻结的视频扩散Transformer（DiT）。这种架构可以防止灾难性的遗忘，并通过时间偏置的位置嵌入来指导，消除了虚假的映射先验，以实现强大的上下文检索。为了支持这种方法并促进未来的研究，我们构建了VAP-Data，这是用于语义控制视频生成的最大数据集，包含100种语义条件下超过10万对视频。作为一个单一的统一模型，VAP为开源方法提供了一个新的最先进的技术，实现了38.7%的用户偏好率，可与领先的特定条件商业模型相媲美。VAP强大的zero-shot通用性和对各种下游应用的支持标志着通用可控视频生成的重大进步。
摘要：Unified, generalizable semantic control in video generation remains a critical open challenge. Existing methods either introduce artifacts by enforcing inappropriate pixel-wise priors from structure-based controls, or rely on non-generalizable, condition-specific finetuning or task-specific architectures. We introduce Video-As-Prompt (VAP), a new paradigm that reframes this problem as in-context generation. VAP leverages a reference video as a direct semantic prompt, guiding a frozen Video Diffusion Transformer (DiT) via a plug-and-play Mixture-of-Transformers (MoT) expert. This architecture prevents catastrophic forgetting and is guided by a temporally biased position embedding that eliminates spurious mapping priors for robust context retrieval. To power this approach and catalyze future research, we built VAP-Data, the largest dataset for semantic-controlled video generation with over 100K paired videos across 100 semantic conditions. As a single unified model, VAP sets a new state-of-the-art for open-source methods, achieving a 38.7% user preference rate that rivals leading condition-specific commercial models. VAP's strong zero-shot generalization and support for various downstream applications mark a significant advance toward general-purpose, controllable video generation.

【109】Preventing Shortcuts in Adapter Training via Providing the Shortcuts
标题：通过提供捷径来防止适配器训练中的捷径
链接：https://arxiv.org/abs/2510.20887

作者：Anujraaj Argo Goyal, Guocheng Gordon Qian, Huseyin Coskun, Aarush Gupta, Himmy Tam, Daniil Ostashev, Ju Hu, Dhritiman Sagar, Sergey Tulyakov, Kfir Aberman, Kuan-Chieh Jackson Wang
备注：Accepted to NeurIPS 2025, webpage: this https URL
摘要：基于适配器的训练已经成为扩展强大的基础图像生成器功能的关键机制，从而实现个性化和风格化的文本到图像合成。这些适配器通常被训练为使用单个图像重建目标来捕获特定的目标属性，例如受试者身份。然而，由于输入图像不可避免地包含视觉因素的混合，适配器容易将目标属性与附带属性（例如姿势、表情和照明）纠缠在一起。这种虚假的相关性问题限制了泛化，并阻碍了模型坚持输入文本提示的能力。在这项工作中，我们发现了一个简单而有效的解决方案：提供我们希望在适配器训练过程中消除的捷径。在Shortcut-Rerouted Adapter Training中，混杂因素通过辅助模块（如ControlNet或LoRA）进行路由，从而消除了适配器将其内化的动机。然后在推理过程中删除辅助模块。当应用于面部和全身身份注入等任务时，我们的方法提高了生成质量，多样性和及时的依从性。这些结果指出了大模型时代的一个一般设计原则：当寻求解开的表征时，最有效的途径可能是为不应该学习的东西建立捷径。
摘要：Adapter-based training has emerged as a key mechanism for extending the capabilities of powerful foundation image generators, enabling personalized and stylized text-to-image synthesis. These adapters are typically trained to capture a specific target attribute, such as subject identity, using single-image reconstruction objectives. However, because the input image inevitably contains a mixture of visual factors, adapters are prone to entangle the target attribute with incidental ones, such as pose, expression, and lighting. This spurious correlation problem limits generalization and obstructs the model's ability to adhere to the input text prompt. In this work, we uncover a simple yet effective solution: provide the very shortcuts we wish to eliminate during adapter training. In Shortcut-Rerouted Adapter Training, confounding factors are routed through auxiliary modules, such as ControlNet or LoRA, eliminating the incentive for the adapter to internalize them. The auxiliary modules are then removed during inference. When applied to tasks like facial and full-body identity injection, our approach improves generation quality, diversity, and prompt adherence. These results point to a general design principle in the era of large models: when seeking disentangled representations, the most effective path may be to establish shortcuts for what should NOT be learned.

【110】Shoot First, Ask Questions Later? Building Rational Agents that Explore and Act Like People
标题：先拍，后问？构建像人一样探索和行动的理性主体
链接：https://arxiv.org/abs/2510.20886

作者：Gabriel Grand, Valerio Pepe, Jacob Andreas, Joshua B. Tenenbaum
摘要：人工智能的许多高风险应用需要形成数据驱动的假设并进行有针对性的猜测;例如，在科学和诊断环境中。在资源有限的情况下，基于语言模型（LM）的智能体在多大程度上是理性的？我们开发的方法来基准和增强代理信息寻求，从人类行为的见解借鉴。首先，我们介绍了一个战略决策导向的对话任务，称为协作战舰，其中一个部分知情的船长必须平衡探索（提问）和行动（拍摄），而一个完全知情的检举人必须在信息瓶颈下提供准确的答案。与人类玩家（N=42）相比，我们发现LM代理很难在上下文中找到答案，生成信息丰富的问题，并选择高价值的行动。接下来，为了解决这些差距，我们开发了新的蒙特卡洛推理策略的基础上，从贝叶斯实验设计（BED）的原则LM。对于检举代理，我们的方法提高了高达14.7%的绝对精度超过LM的基线;对于船长代理，它提高了预期的信息增益（EIG）高达0.227位（94.2%的可实现的噪声上限）。结合起来，这些组件产生更清晰的目标（+0.303-0.374 F1），并使较弱的LM，如Llama-4-Scout，以GPT-5的约1%的成本超过人类（8% -> 82%的胜率）和前沿模型（0% -> 67%的胜率vs. GPT-5）。我们在《猜猜是谁》上复制这些发现？我们的方法显著提高了准确性（+28.3-42.4 p. p.），证明了它们对于构建理性信息寻求代理的普遍适用性。
摘要：Many high-stakes applications of AI require forming data-driven hypotheses and making targeted guesses; e.g., in scientific and diagnostic settings. Given limited resources, to what extent do agents based on language models (LMs) act rationally? We develop methods to benchmark and enhance agentic information-seeking, drawing on insights from human behavior. First, we introduce a strategic decision-oriented dialogue task called Collaborative Battleship, in which a partially-informed Captain must balance exploration (asking questions) and action (taking shots), while a fully-informed Spotter must provide accurate answers under an information bottleneck. Compared to human players (N=42), we find that LM agents struggle to ground answers in context, generate informative questions, and select high-value actions. Next, to address these gaps, we develop novel Monte Carlo inference strategies for LMs based on principles from Bayesian Experimental Design (BED). For Spotter agents, our approach boosts accuracy by up to 14.7% absolute over LM-only baselines; for Captain agents, it raises expected information gain (EIG) by up to 0.227 bits (94.2% of the achievable noise ceiling). Combined, these components yield sharper targeting (+0.303-0.374 F1), and enable weaker LMs, such as Llama-4-Scout, to outperform both humans (8% -> 82% win rate) and frontier models (0% -> 67% win rate vs. GPT-5) at ~1% of GPT-5's cost. We replicate these findings on Guess Who? where our methods significantly boost accuracy (+28.3-42.4 p.p.), demonstrating their general applicability for building rational information-seeking agents.

【111】HA-RAG: Hotness-Aware RAG Acceleration via Mixed Precision and Data Placement
标题：HA-RAG：通过混合精度和数据放置实现热度感知RAG加速
链接：https://arxiv.org/abs/2510.20878

作者：Danying Ge, Jianhua Gao, Yixue Yang, Weixing Ji
备注：13 pages,16 figures,2 tables
摘要：检索增强生成（RAG）通过利用外部知识库来提高模型输出的准确性，作为解决大型语言模型（LLM）中的幻觉问题和知识更新延迟的有效解决方案。然而，外部知识库的引入给RAG带来了长上下文处理方面的挑战，显著增加了内存消耗和推理延迟。现有研究通过预计算知识库的键和值（KV）并在推理过程中按需加载它们来加速推理。基于外部知识库中不同KV组块的访问频率，提出了一种热度感知的RAG（HA-RAG）推理优化系统。首先，利用KV块的数值分布，我们引入了一个热感知的混合精度压缩和加载方法，以减少磁盘I/O和内存访问开销。其次，我们设计了一个热感知的数据放置策略，优先存储频繁访问的KV块在高速内存中，以提高数据访问效率。实验结果表明，与TurboRAG相比，所提出的HA-RAG实现了平均加速2.10倍，最大加速10.49倍，在时间到第一个令牌（TTFT），可以忽略不计的准确性损失。
摘要：Retrieval-Augmented Generation (RAG) improves model output accuracy by leveraging external knowledge bases, serving as an effective solution to address hallucination issues and knowledge-update delays in Large Language Models (LLMs). However, the introduction of external knowledge bases presents RAG with challenges in long-context processing, significantly increasing memory consumption and inference latency. Existing research accelerates inference by precomputing Key and Value (KV) of the knowledge base and loading them on-demand during inference. Based on the access frequency of different KV chunks within the external knowledge base, this paper proposes a hotness-aware RAG (HA-RAG) inference optimization system. First, leveraging the numerical distribution of KV chunks, we introduce a hotness-aware mixed-precision compressing and loading method to reduce disk I/O and memory access overhead. Second, we design a hotness-aware data placement strategy that prioritizes storing frequently accessed KV chunks in high-speed memory to improve data access efficiency. Experimental results demonstrate that, compared with TurboRAG, the proposed HA-RAG achieves an average speedup of 2.10x and maximum speedup of 10.49x in Time-To-First-Token (TTFT) with negligible accuracy loss.

【112】Multimodal Negative Learning
标题：多模式负性学习
链接：https://arxiv.org/abs/2510.20877

作者：Baoquan Gong, Xiyuan Gao, Pengfei Zhu, Qinghua Hu, Bing Cao
备注：Published in NeurIPS 2025
摘要：多模态学习系统经常遇到与模态不平衡相关的挑战，其中主导模态可能会掩盖其他模态，从而阻碍弱模态的学习。传统的方法往往迫使弱模态与“学习成为（相同）”（积极学习）中的主导模态保持一致，这可能会抑制弱模态中固有的独特信息。为了应对这一挑战，我们提供了一个新的学习范式：“学习不成为”（消极学习）。主导模态动态地引导弱模态抑制非目标类，而不是增强弱模态的目标类预测。这稳定了决策空间并保留了模态特定信息，允许弱模态保留唯一信息而不会过度对齐。我们继续从鲁棒性的角度揭示多模态学习，并从理论上推导出多模态消极学习（MNL）框架，该框架引入了一种为消极学习量身定制的动态指导机制。我们的方法通过增加单峰置信度（UCoM），可证明地收紧了多模态学习的鲁棒性下限，并减少了弱模态的经验误差，特别是在噪声和不平衡的情况下。跨多个基准的广泛实验证明了我们的方法相对于竞争方法的有效性和普遍性。该代码将在https://github.com/BaoquanGong/Multimodal-Negative-Learning.git上提供。
摘要：Multimodal learning systems often encounter challenges related to modality imbalance, where a dominant modality may overshadow others, thereby hindering the learning of weak modalities. Conventional approaches often force weak modalities to align with dominant ones in "Learning to be (the same)" (Positive Learning), which risks suppressing the unique information inherent in the weak modalities. To address this challenge, we offer a new learning paradigm: "Learning Not to be" (Negative Learning). Instead of enhancing weak modalities' target-class predictions, the dominant modalities dynamically guide the weak modality to suppress non-target classes. This stabilizes the decision space and preserves modality-specific information, allowing weak modalities to preserve unique information without being over-aligned. We proceed to reveal multimodal learning from a robustness perspective and theoretically derive the Multimodal Negative Learning (MNL) framework, which introduces a dynamic guidance mechanism tailored for negative learning. Our method provably tightens the robustness lower bound of multimodal learning by increasing the Unimodal Confidence Margin (UCoM) and reduces the empirical error of weak modalities, particularly under noisy and imbalanced scenarios. Extensive experiments across multiple benchmarks demonstrate the effectiveness and generalizability of our approach against competing methods. The code will be available at https://github.com/BaoquanGong/Multimodal-Negative-Learning.git.

【113】CC-GRMAS: A Multi-Agent Graph Neural System for Spatiotemporal Landslide Risk Assessment in High Mountain Asia
标题：CC-GRMAS：亚洲高山时空滑坡风险评估的多智能体图神经系统
链接：https://arxiv.org/abs/2510.20875

作者：Mihir Panchal, Ying-Jung Chen, Surya Parkash
摘要：滑坡是一种日益严重的气候诱发灾害，对环境和人类造成严重后果，特别是在亚洲高山地区。尽管越来越多地获得卫星和时间数据集，及时检测和灾害应对仍然不够发达和分散。这项工作介绍了CC-GRMAS，一个框架，利用一系列的卫星观测和环境信号，以提高滑坡预测的准确性。该系统围绕三个相互关联的代理预测，规划和执行，协同实现实时态势感知，响应规划和干预。通过纳入当地环境因素并实施多代理协调，这种方法为脆弱山区的气候适应性灾害准备提供了一个可扩展和积极主动的解决方案。
摘要：Landslides are a growing climate induced hazard with severe environmental and human consequences, particularly in high mountain Asia. Despite increasing access to satellite and temporal datasets, timely detection and disaster response remain underdeveloped and fragmented. This work introduces CC-GRMAS, a framework leveraging a series of satellite observations and environmental signals to enhance the accuracy of landslide forecasting. The system is structured around three interlinked agents Prediction, Planning, and Execution, which collaboratively enable real time situational awareness, response planning, and intervention. By incorporating local environmental factors and operationalizing multi agent coordination, this approach offers a scalable and proactive solution for climate resilient disaster preparedness across vulnerable mountainous terrains.

【114】Crisis-Resilient Portfolio Management via Graph-based Spatio-Temporal Learning
标题：通过基于图的时空学习进行具有危机弹性的投资组合管理
链接：https://arxiv.org/abs/2510.20868

作者：Zan Li, Rui Fan
摘要：金融时间序列预测面临着一个根本性的挑战：预测最佳资产配置需要了解在危机期间转变的依赖于制度的相关结构。现有的基于图的时空学习方法依赖于预定的图拓扑结构-相关阈值，部门分类-当市场动态在不同的危机机制中发生变化时无法适应：信用传染，流行病冲击或通胀驱动的抛售。我们提出了CRISP（Crisisis-Resilient Investment through Spatio-temporal Patterns），这是一个基于图的时空学习框架，它通过图卷积网络编码空间关系，通过BiLSTM自注意编码时间动态，然后通过多头图注意网络学习稀疏结构。与固定拓扑方法不同，CRISP通过注意力机制发现哪些资产关系很重要，将92.5%的连接过滤为噪音，同时保留危机相关的依赖关系，以进行准确的特定于制度的预测。在2005- 2021年包括信贷和流行病危机的数据上进行了培训，CRISP通过准确预测与制度相适应的相关结构，对2022- 2024年通胀驱动的市场（一个根本不同的制度）进行了有力的概括。这使得自适应投资组合分配能够在低迷时期保持盈利能力，实现夏普比率3.76：比等权重基线提高707%，比静态图方法提高94%。学习的注意力权重提供了可解释的状态检测，在危机期间，防御性集群注意力增强了49%，而在整个市场范围内，从学习到预测而不是强加假设的紧急行为增强了31%。
摘要：Financial time series forecasting faces a fundamental challenge: predicting optimal asset allocations requires understanding regime-dependent correlation structures that transform during crisis periods. Existing graph-based spatio-temporal learning approaches rely on predetermined graph topologies--correlation thresholds, sector classifications--that fail to adapt when market dynamics shift across different crisis mechanisms: credit contagion, pandemic shocks, or inflation-driven selloffs. We present CRISP (Crisis-Resilient Investment through Spatio-temporal Patterns), a graph-based spatio-temporal learning framework that encodes spatial relationships via Graph Convolutional Networks and temporal dynamics via BiLSTM with self-attention, then learns sparse structures through multi-head Graph Attention Networks. Unlike fixed-topology methods, CRISP discovers which asset relationships matter through attention mechanisms, filtering 92.5% of connections as noise while preserving crisis-relevant dependencies for accurate regime-specific predictions. Trained on 2005--2021 data encompassing credit and pandemic crises, CRISP demonstrates robust generalization to 2022--2024 inflation-driven markets--a fundamentally different regime--by accurately forecasting regime-appropriate correlation structures. This enables adaptive portfolio allocation that maintains profitability during downturns, achieving Sharpe ratio 3.76: 707% improvement over equal-weight baselines and 94% improvement over static graph methods. Learned attention weights provide interpretable regime detection, with defensive cluster attention strengthening 49% during crises versus 31% market-wide--emergent behavior from learning to forecast rather than imposing assumptions.

【115】Incentivizing Consistent, Effective and Scalable Reasoning Capability in Audio LLMs via Reasoning Process Rewards
标题：通过推理过程奖励激励音频LLM一致、有效和可扩展的推理能力
链接：https://arxiv.org/abs/2510.20867

作者：Jiajun Fan, Roger Ren, Jingyuan Li, Rahul Pandey, Prashanth Gurunath Shivakumar, Ivan Bulyko, Ankur Gandhe, Ge Liu, Yile Gu
备注：49 pages
摘要：推理在音频大语言模型中的作用仍然没有得到广泛的探索，因为引入推理过程通常会降低而不是提高推理过程的性能，这是一种我们称之为测试时逆缩放的现象，其中较长的推理链会产生越来越差的结果。我们证明，这不是源于推理本身的基本限制，而是来自训练不足：没有适当指导的模型推理过程产生幻觉，不一致的推理，积累了较长的链错误。为了应对这些挑战，我们引入了CESAR（一致，有效和可扩展的音频推理器），从结果验证转向奖励推理过程。我们的在线强化学习框架采用了具有多方面奖励套件的组相对策略优化，不仅激励正确性和格式，还激励一致性，结构化分析模式，因果推理，领域知识整合和校准推理深度。CESAR解决了测试时间的逆缩放，将推理从测量转换为增益，同时揭示了特定于模型的“推理甜蜜点”，在测试时间缩放期间性能达到峰值。我们在MMAU Test-mini上实现了最先进的结果，大大超过了Gemini 2.5 Pro和GPT-4 o Audio，并且在MMSU推理任务上接近人类水平。通过人工智能作为判断评估和定性比较，我们提供了我们改进的推理质量的定量和定性验证。重要的是，增强的推理产生协同效应，同时提高多模态推理和感知能力。总的来说，CESAR建立了一个原则性的方法，在音频LLM开发强大的和可扩展的推理。
摘要：The role of reasoning in Audio Large Language Models remains widely underexplored, as introducing a reasoning process often degrades rather than improves performance during inference, a phenomenon we term test-time inverse scaling, where longer reasoning chains yield progressively worse results. We demonstrate that this stems not from fundamental limitations of reasoning itself, but from inadequate training: models without proper guidance for the reasoning process produce hallucinatory, inconsistent reasoning that accumulates errors over longer chains. To address these challenges, we introduce CESAR (Consistent, Effective, and Scalable Audio Reasoners), shifting from outcome verification to rewarding the reasoning process. Our online reinforcement learning framework employs Group Relative Policy Optimization with a multi-faceted reward suite that incentivizes not only correctness and format but also consistency, structured analytical patterns, causal reasoning, domain-knowledge integration, and calibrated reasoning depth. CESAR resolves test-time inverse scaling, transforming reasoning from detriments into gains while revealing model-specific ``reasoning sweet spots", where performance peaks during test-time scaling. We achieve state-of-the-art results on MMAU Test-mini, substantially outperforming Gemini 2.5 Pro and GPT-4o Audio, and near-human-level performance on MMSU reasoning tasks. Through AI-as-judge evaluations and qualitative comparisons, we provide both quantitative and qualitative validation of our improved reasoning quality. Importantly, enhanced reasoning creates synergistic effects, simultaneously improving multimodal reasoning and perception capabilities. Overall, CESAR establishes a principled method for developing robust and scalable reasoning in Audio LLMs.

【116】Fuzzy numbers revisited: operations on extensional fuzzy numbers
标题：重温模糊数：扩展模糊数的运算
链接：https://arxiv.org/abs/2510.20861

作者：Krzysztof Siminski
备注：33 pages, 62 references
摘要：模糊数通常用模糊集表示。他们的目标是更好地表示不精确的数据。然而，对模糊数的运算并不像对清晰数的数学运算那样简单。通常，Zadeh的扩张规则被应用于阐述一个结果。这可能产生两个问题：（1）计算复杂度高，以及（2）对于某些模糊集和某些运算，结果不是具有相同特征的模糊集（例如，两个三角模糊集的乘法不产生三角模糊集）。另一个问题是模糊扩展-结果的模糊性随着操作的次数而增加。这些事实严重限制了模糊数的应用领域。在本文中，我们想用一种不同的模糊数--可拓模糊数来修正这个问题。本文定义了可拓模糊数的运算及其关系运算符（=、>、>=、<、<=）。所提出的方法说明了几个应用实例。C++实现可从公共GitHub存储库获得。
摘要：Fuzzy numbers are commonly represented with fuzzy sets. Their objective is to better represent imprecise data. However, operations on fuzzy numbers are not as straightforward as maths on crisp numbers. Commonly, the Zadeh's extension rule is applied to elaborate a result. This can produce two problems: (1) high computational complexity and (2) for some fuzzy sets and some operations the results is not a fuzzy set with the same features (eg. multiplication of two triangular fuzzy sets does not produce a triangular fuzzy set). One more problem is the fuzzy spread -- fuzziness of the result increases with the number of operations. These facts can severely limit the application field of fuzzy numbers. In this paper we would like to revisite this problem with a different kind of fuzzy numbers -- extensional fuzzy numbers. The paper defines operations on extensional fuzzy numbers and relational operators (=, >, >=, <, <=) for them. The proposed approach is illustrated with several applicational examples. The C++ implementation is available from a public GitHub repository.

【117】Cultural Alien Sampler: Open-ended art generation balancing originality and coherence
标题：文化外星人采样器：开放式艺术生成平衡原创性和连贯性
链接：https://arxiv.org/abs/2510.20849

作者：Alejandro H. Artiles, Hiromu Yakura, Levin Brinkmann, Mar Canet Sola, Hassan Abu Alhaija, Ignacio Serna, Nasim Rahaman, Bernhard Schölkopf, Iyad Rahwan
备注：Proceedings of the 39th Conference on Neural Information Processing Systems (NeurIPS 2025). Creative AI Track. 26 pages, 24 figures
摘要：在开放式领域，如艺术，自主代理必须产生既原创又内部连贯的想法，但目前的大型语言模型（LLM）要么默认熟悉的文化模式，要么在推向新奇时牺牲连贯性。我们通过引入文化外来采样器（CAS）来解决这个问题，这是一种概念选择方法，它明确地将组合适合性与文化典型性分开。CAS使用两个GPT-2模型对WikiArt概念进行微调：概念一致性模型，用于评估概念是否可能在艺术品中共同出现，以及文化背景模型，用于评估这些组合在艺术家作品中的典型性。CAS的目标是高度一致性和低典型性的组合，产生保持内部一致性的想法，同时偏离了学习的惯例和嵌入的文化背景。在人类评估（N = 100）中，我们的方法优于随机选择和GPT-4 o基线，并在感知的原创性和和谐性方面达到与人类艺术学生相当的性能。此外，定量研究表明，我们的方法产生了更多样化的输出，并探索了更广泛的概念空间比GPT-4 o对应，表明人工文化异化可以解锁自主代理的创造潜力。
摘要：In open-ended domains like art, autonomous agents must generate ideas that are both original and internally coherent, yet current Large Language Models (LLMs) either default to familiar cultural patterns or sacrifice coherence when pushed toward novelty. We address this by introducing the Cultural Alien Sampler (CAS), a concept-selection method that explicitly separates compositional fit from cultural typicality. CAS uses two GPT-2 models fine-tuned on WikiArt concepts: a Concept Coherence Model that scores whether concepts plausibly co-occur within artworks, and a Cultural Context Model that estimates how typical those combinations are within individual artists' bodies of work. CAS targets combinations that are high in coherence and low in typicality, yielding ideas that maintain internal consistency while deviating from learned conventions and embedded cultural context. In a human evaluation (N = 100), our approach outperforms random selection and GPT-4o baselines and achieves performance comparable to human art students in both perceived originality and harmony. Additionally, a quantitative study shows that our method produces more diverse outputs and explores a broader conceptual space than its GPT-4o counterpart, demonstrating that artificial cultural alienness can unlock creative potential in autonomous agents.

【118】Sketch2BIM: A Multi-Agent Human-AI Collaborative Pipeline to Convert Hand-Drawn Floor Plans to 3D BIM
标题：Sketch 2 BMI：一个多智能体人机协作管道，将手绘平面图转换为3D BMI
链接：https://arxiv.org/abs/2510.20838

作者：Abir Khan Ratul, Sanjay Acharjee, Somin Park, Md Nazmus Sakib
摘要：这项研究引入了一个人在回路管道，将未缩放的手绘平面图草图转换为语义一致的3D BIM模型。该工作流程利用多代理框架内的多模态大型语言模型（MLLM），结合感知提取，人工反馈，模式验证和自动BIM脚本。最初，草图被迭代地细化为墙、门和窗的结构化JSON布局。之后，这些布局将转换为可执行脚本，生成3D BIM模型。在10个不同的楼层平面图上进行的实验证明了强大的收敛性：在初始通道中捕获的开口（门，窗）具有高可靠性，而墙壁检测开始时约为83%，并在几次反馈迭代后实现了近乎完美的对齐。在所有类别中，精确度，召回率和F1得分保持在0.83以上，几何误差（RMSE，MAE）通过反馈校正逐渐降至零。这项研究演示了MLLM驱动的多智能体推理如何使BIM创建仅使用手绘草图的专家和非专家都可以访问。
摘要：This study introduces a human-in-the-loop pipeline that converts unscaled, hand-drawn floor plan sketches into semantically consistent 3D BIM models. The workflow leverages multimodal large language models (MLLMs) within a multi-agent framework, combining perceptual extraction, human feedback, schema validation, and automated BIM scripting. Initially, sketches are iteratively refined into a structured JSON layout of walls, doors, and windows. Later, these layouts are transformed into executable scripts that generate 3D BIM models. Experiments on ten diverse floor plans demonstrate strong convergence: openings (doors, windows) are captured with high reliability in the initial pass, while wall detection begins around 83% and achieves near-perfect alignment after a few feedback iterations. Across all categories, precision, recall, and F1 scores remain above 0.83, and geometric errors (RMSE, MAE) progressively decrease to zero through feedback corrections. This study demonstrates how MLLM-driven multi-agent reasoning can make BIM creation accessible to both experts and non-experts using only freehand sketches.

【119】Compressing Quaternion Convolutional Neural Networks for Audio Classification
标题：压缩四元数卷积神经网络音频分类
链接：https://arxiv.org/abs/2510.21388

作者：Arshdeep Singh, Vinayak Abrol, Mark D. Plumbley
备注：Under review in IEEE TASLPRO
摘要：传统的实域卷积神经网络（CNN）已被广泛用于音频分类。然而，它们的卷积运算独立地处理多通道输入，限制了捕获通道之间相关性的能力。这可能导致次优的特征学习，特别是对于复杂的音频模式，如多声道声谱图表示。四元数卷积神经网络（QCNN）通过采用四元数代数来联合捕获声道间依赖性来解决这一限制，从而实现具有更少可学习参数的更紧凑模型，同时更好地利用音频信号的多维特性。然而，由于四元数运算的开销，QCNN表现出更高的计算复杂度，与传统CNN相比，导致推理延迟增加和效率降低，这对资源受限平台上的部署提出了挑战。为了应对这一挑战，本研究探索了知识蒸馏（KD）和修剪，以降低QCNN的计算复杂度，同时保持性能。我们的音频分类实验表明，与KD相比，修剪QCNN可以实现类似或更优的性能，同时需要更少的计算量。与传统的CNN和基于Transformer的架构相比，修剪的QCNN实现了具有竞争力的性能，减少了可学习的参数数量和计算复杂度。在AudioSet数据集上，修剪后的QCNN将计算成本降低了50%，参数计数减少了80%，同时保持了与传统CNN相当的性能。此外，修剪后的QCNN在多个音频分类基准中具有良好的泛化能力，包括用于音乐流派识别的GTZAN，用于环境声音分类的ESC-50和用于语音情感识别的RAVDESS。
摘要：Conventional Convolutional Neural Networks (CNNs) in the real domain have been widely used for audio classification. However, their convolution operations process multi-channel inputs independently, limiting the ability to capture correlations among channels. This can lead to suboptimal feature learning, particularly for complex audio patterns such as multi-channel spectrogram representations. Quaternion Convolutional Neural Networks (QCNNs) address this limitation by employing quaternion algebra to jointly capture inter-channel dependencies, enabling more compact models with fewer learnable parameters while better exploiting the multi-dimensional nature of audio signals. However, QCNNs exhibit higher computational complexity due to the overhead of quaternion operations, resulting in increased inference latency and reduced efficiency compared to conventional CNNs, posing challenges for deployment on resource-constrained platforms. To address this challenge, this study explores knowledge distillation (KD) and pruning, to reduce the computational complexity of QCNNs while maintaining performance. Our experiments on audio classification reveal that pruning QCNNs achieves similar or superior performance compared to KD while requiring less computational effort. Compared to conventional CNNs and Transformer-based architectures, pruned QCNNs achieve competitive performance with a reduced learnable parameter count and computational complexity. On the AudioSet dataset, pruned QCNNs reduce computational cost by 50\% and parameter count by 80\%, while maintaining performance comparable to the conventional CNNs. Furthermore, pruned QCNNs generalize well across multiple audio classification benchmarks, including GTZAN for music genre recognition, ESC-50 for environmental sound classification and RAVDESS for speech emotion recognition.

【120】Patient-specific AI for generation of 3D dosimetry imaging from two 2D-planar measurements
标题：患者特定AI，用于根据两个2D平面测量生成3D剂量测定成像
链接：https://arxiv.org/abs/2510.21362

作者：Alejandro Lopez-Montes, Robert Seifert, Astrid Delker, Guido Boening, Jiahui Wang, Christoph Clement, Ali Afshar-Oromieh, Axel Rominger, Kuangyu Shi
备注：Accepted at IEEE NSS/MIC 2025
摘要：在这项工作中，我们探索了使用患者特定的强化学习，从两个二维平面图像（前和后）生成三维活动图。该问题的解决方案使用常规方法仍然是不可实现的，并且对于核医学中的剂量测定是特别感兴趣的，其中用于放射性药物（例如177 Lu-PSMA）的治疗后分配的方法通常经由昂贵且长的3D SPECT采集或快速但仅2D的平面放射性造影来完成。能够从平面放射成像生成3D活动图为新的剂量测定应用打开了大门，消除了对SPECT的需要，并促进了多时间点剂量测定研究。我们的解决方案包括生成患者特定数据集，其中包含个体解剖结构内放射性药物的可能3D摄取图，然后是能够从2D平面图像生成3D活动图的AI方法（我们探索了3DUnet和扩散模型的使用）。我们已经验证了我们的方法在模拟和真正的平面收购。我们观察到使用患者特定强化学习的增强结果（MAE降低约20%，SSIM增加约5%）以及更好的器官描绘和患者解剖结构，特别是当将扩散模型与患者特定训练相结合时，与模拟的基础事实相比，SSIM=0.89，与平面后半小时进行的SPECT采集相比，SSIM= 0.73。我们相信，我们的方法可以改变核医学剂量测定的范式，允许仅使用平面放射性成像进行3D定量，而不需要利用患者的治疗前信息的昂贵且耗时的SPECT。
摘要：In this work we explored the use of patient specific reinforced learning to generate 3D activity maps from two 2D planar images (anterior and posterior). The solution of this problem remains unachievable using conventional methodologies and is of particular interest for dosimetry in nuclear medicine where approaches for post-therapy distribution of radiopharmaceuticals such as 177Lu-PSMA are typically done via either expensive and long 3D SPECT acquisitions or fast, yet only 2D, planar scintigraphy. Being able to generate 3D activity maps from planar scintigraphy opens the gate for new dosimetry applications removing the need for SPECT and facilitating multi-time point dosimetry studies. Our solution comprises the generation of a patient specific dataset with possible 3D uptake maps of the radiopharmaceuticals withing the anatomy of the individual followed by an AI approach (we explored both the use of 3DUnet and diffusion models) able to generate 3D activity maps from 2D planar images. We have validated our method both in simulation and real planar acquisitions. We observed enhanced results using patient specific reinforcement learning (~20% reduction on MAE and ~5% increase in SSIM) and better organ delineation and patient anatomy especially when combining diffusion models with patient specific training yielding a SSIM=0.89 compared to the ground truth for simulations and 0.73 when compared to a SPECT acquisition performed half an hour after the planar. We believe that our methodology can set a change of paradigm for nuclear medicine dosimetry allowing for 3D quantification using only planar scintigraphy without the need of expensive and time-consuming SPECT leveraging the pre-therapy information of the patients.

【121】WhaleVAD-BPN: Improving Baleen Whale Call Detection with Boundary Proposal Networks and Post-processing Optimisation
标题：WhaleVAD-BPN：利用边界提议网络和后处理优化改进须鲸叫声检测
链接：https://arxiv.org/abs/2510.21280

作者：Christiaan M. Geldenhuys, Günther Tonitz, Thomas R. Niesler
摘要：虽然最近的声音事件检测（SED）系统可以识别海洋音频中的须鲸叫声，但与误报和少数群体检测相关的挑战仍然存在。我们提出了边界建议网络（BPN），它扩展了现有的轻量级SED系统。BPN的灵感来自图像对象检测的工作，旨在减少误报检测的数量。它通过使用在主干分类模型内计算的中间潜在表示来门控最终输出来实现这一点。当添加到现有的SED系统中时，BPN实现了16.8%的精确度绝对提高，以及少数类d-调用和bp-调用的F1分数分别提高了21.3%和9.4%。我们进一步考虑两种方法来选择后处理超参数：向前搜索和向后搜索。通过分别优化事件级和帧级超参数，这两种方法导致使用经验方法选择的参数的相当大的性能改进。完整的WhaleVAD-BPN系统实现了0.475的交叉验证开发F1评分，比基线绝对改善了9.8%。
摘要：While recent sound event detection (SED) systems can identify baleen whale calls in marine audio, challenges related to false positive and minority-class detection persist. We propose the boundary proposal network (BPN), which extends an existing lightweight SED system. The BPN is inspired by work in image object detection and aims to reduce the number of false positive detections. It achieves this by using intermediate latent representations computed within the backbone classification model to gate the final output. When added to an existing SED system, the BPN achieves a 16.8 % absolute increase in precision, as well as 21.3 % and 9.4 % improvements in the F1-score for minority-class d-calls and bp-calls, respectively. We further consider two approaches to the selection of post-processing hyperparameters: a forward-search and a backward-search. By separately optimising event-level and frame-level hyperparameters, these two approaches lead to considerable performance improvements over parameters selected using empirical methods. The complete WhaleVAD-BPN system achieves a cross-validated development F1-score of 0.475, which is a 9.8 % absolute improvement over the baseline.

【122】Hierarchical AI Multi-Agent Fundamental Investing: Evidence from China's A-Share Market
标题：分层人工智能多主体基本面投资：来自中国A股市场的证据
链接：https://arxiv.org/abs/2510.21147

作者：Chujun He, Zhonghao Huang, Xiangguo Li, Ye Luo, Kewei Ma, Yuxuan Xiong, Xiaowei Zhang, Mingyang Zhao
摘要：我们提出了一个多主体，人工智能驱动的基本面投资框架，整合了宏观指标，行业层面和公司特定的信息，以构建优化的股票投资组合。该架构包括：（i）一个宏观机构，根据不断变化的经济指标和行业业绩动态筛选和衡量行业;（ii）四个公司一级的机构----基本面、技术面、报告和新闻----对个别公司进行深入分析，以确保报道的广度和深度;（iii）使用强化学习将代理输出组合成统一策略以生成交易策略的投资组合代理;及（iv）风险控制代理人，因应市场波动调整投资组合头寸。通过对中国A股市场沪深300指数成分股的实证分析，我们发现该系统在风险调整收益率和回撤控制方面始终优于标准基准和最先进的多代理交易系统。我们的核心贡献是一个分层的多代理设计，链接自上而下的宏观筛选与自下而上的基本面分析，提供了一个强大的和可扩展的方法，以因素为基础的投资组合建设。
摘要：We present a multi-agent, AI-driven framework for fundamental investing that integrates macro indicators, industry-level and firm-specific information to construct optimized equity portfolios. The architecture comprises: (i) a Macro agent that dynamically screens and weights sectors based on evolving economic indicators and industry performance; (ii) four firm-level agents -- Fundamental, Technical, Report, and News -- that conduct in-depth analyses of individual firms to ensure both breadth and depth of coverage; (iii) a Portfolio agent that uses reinforcement learning to combine the agent outputs into a unified policy to generate the trading strategy; and (iv) a Risk Control agent that adjusts portfolio positions in response to market volatility. We evaluate the system on the constituents by the CSI 300 Index of China's A-share market and find that it consistently outperforms standard benchmarks and a state-of-the-art multi-agent trading system on risk-adjusted returns and drawdown control. Our core contribution is a hierarchical multi-agent design that links top-down macro screening with bottom-up fundamental analysis, offering a robust and extensible approach to factor-based portfolio construction.

【123】Integrated representational signatures strengthen specificity in brains and models
标题：集成的代表签名增强大脑和模型的特异性
链接：https://arxiv.org/abs/2510.20847

作者：Jialin Wu, Shreya Saha, Yiqing Bo, Meenakshi Khosla
摘要：不同的神经网络或人工神经网络（模型）在多大程度上依赖于等效表示来支持类似的任务，这仍然是神经科学和机器学习的核心问题。以前的工作通常比较系统使用一个单一的代表性相似性度量，但每个只捕获一个方面的代表性结构。为了解决这个问题，我们利用一套代表性的相似性度量，每个捕捉一个不同的方面的代表性的对应关系，如几何形状，单位级调整，或线性解码，并评估大脑区域或模型的可分性使用多个互补措施。保持几何或调谐结构的滤波器（例如，RSA，软匹配）产生更强的基于区域的歧视，而更灵活的映射，如线性预测显示较弱的分离。这些发现表明，几何和调谐编码大脑区域或模型家族特定的签名，而线性可解码的信息往往在区域或模型之间更广泛地共享。为了整合这些互补的代表性方面，我们适应相似性网络融合（SNF），最初开发的多组学数据集成的框架。SNF产生明显更尖锐的区域和模型族级分离比任何单一的度量，并产生强大的复合相似性配置文件。此外，使用SNF衍生的相似性分数聚类皮层区域揭示了一个更清晰的层次组织，与视觉皮层的既定解剖和功能层次紧密结合，超越了单个指标所实现的对应关系。
摘要：The extent to which different neural or artificial neural networks (models) rely on equivalent representations to support similar tasks remains a central question in neuroscience and machine learning. Prior work has typically compared systems using a single representational similarity metric, yet each captures only one facet of representational structure. To address this, we leverage a suite of representational similarity metrics-each capturing a distinct facet of representational correspondence, such as geometry, unit-level tuning, or linear decodability-and assess brain region or model separability using multiple complementary measures. Metrics that preserve geometric or tuning structure (e.g., RSA, Soft Matching) yield stronger region-based discrimination, whereas more flexible mappings such as Linear Predictivity show weaker separation. These findings suggest that geometry and tuning encode brain-region- or model-family-specific signatures, while linearly decodable information tends to be more globally shared across regions or models. To integrate these complementary representational facets, we adapt Similarity Network Fusion (SNF), a framework originally developed for multi-omics data integration. SNF produces substantially sharper regional and model family-level separation than any single metric and yields robust composite similarity profiles. Moreover, clustering cortical regions using SNF-derived similarity scores reveals a clearer hierarchical organization that aligns closely with established anatomical and functional hierarchies of the visual cortex-surpassing the correspondence achieved by individual metrics.

【124】This EEG Looks Like These EEGs: Interpretable Interictal Epileptiform Discharge Detection With ProtoEEG-kNN
标题：这个脑电波看起来像这些脑电波：使用ProtoEEG-kNN的可解释发作间隙癫痫样放电检测
链接：https://arxiv.org/abs/2510.20846

作者：Dennis Tang, Jon Donnelly, Alina Jade Barnett, Lesia Semenova, Jin Jing, Peter Hadar, Ioannis Karakis, Olga Selioutski, Kehan Zhao, M. Brandon Westover, Cynthia Rudin
备注：MICCAI 2025
摘要：脑电图（EEG）记录中发作间隙癫痫样放电（IED）的存在是癫痫的重要生物标志物。即使是训练有素的神经学家也发现检测IED很困难，导致许多从业者转向机器学习寻求帮助。虽然现有的机器学习算法可以在这项任务上实现很高的准确性，但大多数模型都是不可解释的，无法证明其结论是正确的。由于缺乏理解模型推理的能力，医生无法利用他们的专业知识来识别不正确的模型预测并进行相应的干预。为了改善人与模型的交互，我们引入了ProtoEEG-kNN，这是一种内在可解释的模型，遵循一个简单的基于案例的推理过程。ProtoEEG-kNN通过将EEG与来自训练集的类似EEG进行比较来进行推理，并在IED形态（形状）和空间分布（位置）方面直观地展示其推理。我们表明，ProtoEEG-kNN可以在IED检测中实现最先进的准确性，同时提供专家更喜欢现有方法的解释。
摘要：The presence of interictal epileptiform discharges (IEDs) in electroencephalogram (EEG) recordings is a critical biomarker of epilepsy. Even trained neurologists find detecting IEDs difficult, leading many practitioners to turn to machine learning for help. While existing machine learning algorithms can achieve strong accuracy on this task, most models are uninterpretable and cannot justify their conclusions. Absent the ability to understand model reasoning, doctors cannot leverage their expertise to identify incorrect model predictions and intervene accordingly. To improve the human-model interaction, we introduce ProtoEEG-kNN, an inherently interpretable model that follows a simple case-based reasoning process. ProtoEEG-kNN reasons by comparing an EEG to similar EEGs from the training set and visually demonstrates its reasoning both in terms of IED morphology (shape) and spatial distribution (location). We show that ProtoEEG-kNN can achieve state-of-the-art accuracy in IED detection while providing explanations that experts prefer over existing approaches.

【125】Consciousness, natural and artificial: an evolutionary advantage for reasoning on reactive substrates
标题：意识，自然的和人工的：反应性物质推理的进化优势
链接：https://arxiv.org/abs/2510.20839

作者：Warisa Sritriratanarak, Paulo Garcia
摘要：精确定义意识并识别影响它的机制是一个长期存在的问题，特别是与人工智能的进步有关。科学界分为物理主义和自然二元论。物理主义假设意识是一个物理过程，可以通过计算建模;自然二元论拒绝了这一假设。找到一个计算模型已经被证明是困难的，特别是因为意识与人类表现出的其他认知能力（如智力和生理感觉）相结合。在这里，我们展示了这样一个计算模型，它精确地模拟了自然或人工的意识，确定了影响它的结构和功能机制，证实了物理主义假设。我们发现，当包括底层（生物或数字）基底并考虑基底子系统中的反应行为（例如，自主生理反应）。结果表明，与所有其他计算过程不同，意识并不独立于其基底，拥有它是智能实体的进化优势。我们的研究结果表明，实现完全人工意识没有障碍，但令人惊讶的是，也有可能实现任意级别的人工智能，而没有意识，给人工系统注入意识没有任何好处。
摘要：Precisely defining consciousness and identifying the mechanisms that effect it is a long-standing question, particularly relevant with advances in artificial intelligence. The scientific community is divided between physicalism and natural dualism. Physicalism posits consciousness is a physical process that can be modeled computationally; natural dualism rejects this hypothesis. Finding a computational model has proven elusive, particularly because of conflation of consciousness with other cognitive capabilities exhibited by humans, such as intelligence and physiological sensations. Here we show such a computational model that precisely models consciousness, natural or artificial, identifying the structural and functional mechanisms that effect it, confirming the physicalism hypothesis. We found such a model is obtainable when including the underlying (biological or digital) substrate and accounting for reactive behavior in substrate sub-systems (e.g., autonomous physiological responses). Results show that, unlike all other computational processes, consciousness is not independent of its substrate and possessing it is an evolutionary advantage for intelligent entities. Our result shows there is no impediment to the realization of fully artificial consciousness but, surprisingly, that it is also possible to realize artificial intelligence of arbitrary level without consciousness whatsoever, and that there is no advantage in imbuing artificial systems with consciousness.

【126】Image and Point-cloud Classification for Jet Analysis in High-Energy Physics: A survey
标题：高能物理喷流分析的图像和点云分类：概览
链接：https://arxiv.org/abs/2403.11934

作者：Hamza Kheddar, Yassine Himeur, Abbes Amira, Rachik Soualah
备注：Accepted paper in Frontier of Physics
摘要：如今，在高能物理（HEP）领域，无论是在实验还是现象学研究中，都有越来越多的趋势将机器学习（ML）及其专业分支深度学习（DL）结合起来。这篇综述文章使用不同的ML和DL方法对这些应用进行了全面的说明。本文的第一部分研究了各种粒子物理类型的基础知识，并建立了评估粒子物理以及可用学习模型的指导方针。接下来，提供了一个详细的分类，用于表示在高能碰撞中重建的射流，主要是在定义明确的束能量的质子-质子碰撞中。本节介绍各种数据集、预处理技术以及特征提取和选择方法。所提出的技术可以应用于未来的强子-强子对撞机（HHC），如高亮度LHC（HL-LHC）和未来的圆形对撞机-强子-强子（FCChh）。然后，作者探索了几种专门为HEP中的图像和点云（PC）数据设计的AI技术分析。此外，更仔细地研究了强子碰撞中与喷注标记相关的分类。在这篇评论中，各种国家的最先进的（SOTA）技术在ML和DL检查，重点是他们的影响HEP的需求。更确切地说，本讨论详细讨论了各种应用，如射流标记、射流跟踪、颗粒分类等。审查的结论与HEP使用DL方法的当前状态的分析。它强调了未来研究的挑战和潜在领域，并为每个应用程序进行了说明。
摘要：Nowadays, there has been a growing trend in the field of high-energy physics (HEP), in both its experimental and phenomenological studies, to incorporate machine learning (ML) and its specialized branch, deep learning (DL). This review paper provides a thorough illustration of these applications using different ML and DL approaches. The first part of the paper examines the basics of various particle physics types and establishes guidelines for assessing particle physics alongside the available learning models. Next, a detailed classification is provided for representing Jets that are reconstructed in high-energy collisions, mainly in proton-proton collisions at well-defined beam energies. This section covers various datasets, preprocessing techniques, and feature extraction and selection methods. The presented techniques can be applied to future hadron-hadron colliders (HHC), such as the high-luminosity LHC (HL-LHC) and the future circular collider - hadron-hadron (FCChh). The authors then explore several AI techniques analyses designed specifically for both image and point-cloud (PC) data in HEP. Additionally, a closer look is taken at the classification associated with Jet tagging in hadron collisions. In this review, various state-of-the-art (SOTA) techniques in ML and DL are examined, with a focus on their implications for HEP demands. More precisely, this discussion addresses various applications in extensive detail, such as Jet tagging, Jet tracking, particle classification, and more. The review concludes with an analysis of the current state of HEP using DL methodologies. It highlights the challenges and potential areas for future research, which are illustrated for each application.

机器翻译由腾讯交互翻译提供，仅供参考

点击“阅读原文”获取带摘要的学术速递

【声明】内容源于网络

Sophie外贸笔记

跨境分享角 | 长期更新优质内容

内容 0

粉丝 3

Sophie外贸笔记跨境分享角 | 长期更新优质内容

总阅读0

粉丝3

内容0