自然语言处理学术速递[10.30]- 大数跨境

首页

自然语言处理学术速递[10.30]

Sophie外贸笔记

2025-10-30

226

导读：cs.CL 方向，今日共计86篇

点击阅读原文访问arxivdaily.com，涵盖CS|物理|数学|经济|统计|金融|生物|电气领域，更有搜索、收藏等功能！

cs.CL 方向，今日共计86篇

大模型相关(39篇)

【1】Gaperon: A Peppered English-French Generative Language Model Suite
标题：Gaperon：一个Peppered的英语-法语生成语言模型套件
链接：https://arxiv.org/abs/2510.25771

作者：Nathan Godey, Wissam Antoun, Rian Touchent, Rachel Bawden, Éric de la Clergerie, Benoît Sagot, Djamé Seddah
摘要：我们发布了一套完全开放的法语-英语编码语言模型，旨在提高大规模模型训练的透明度和可重复性。Gaperon家族包括在2-4万亿令牌上训练的1.5B、8B和2 - 4 B参数模型，这些模型与训练管道的所有元素一起发布：使用神经质量分类器过滤的法语和英语数据集，高效的数据管理和训练框架，以及数百个中间检查点。通过这项工作，我们研究了数据过滤和污染如何相互作用，以塑造基准和生成性能。我们发现，过滤语言质量提高了文本的流畅性和连贯性，但产生低于标准的基准结果，后期故意污染-继续训练数据混合，包括测试集-恢复竞争力的分数，而只有合理地损害生成质量。我们讨论了通常的神经过滤如何无意中放大基准泄漏。为了支持进一步的研究，我们还在预训练过程中引入了无害的数据中毒，为安全研究提供了一个现实的测试平台。通过公开发布所有模型、数据集、代码和检查点，Gaperon为探索多语言模型开发中数据管理、评估、安全性和开放性之间的权衡建立了可复制的基础。
摘要：We release Gaperon, a fully open suite of French-English-coding language models designed to advance transparency and reproducibility in large-scale model training. The Gaperon family includes 1.5B, 8B, and 24B parameter models trained on 2-4 trillion tokens, released with all elements of the training pipeline: French and English datasets filtered with a neural quality classifier, an efficient data curation and training framework, and hundreds of intermediate checkpoints. Through this work, we study how data filtering and contamination interact to shape both benchmark and generative performance. We find that filtering for linguistic quality enhances text fluency and coherence but yields subpar benchmark results, and that late deliberate contamination -- continuing training on data mixes that include test sets -- recovers competitive scores while only reasonably harming generation quality. We discuss how usual neural filtering can unintentionally amplify benchmark leakage. To support further research, we also introduce harmless data poisoning during pretraining, providing a realistic testbed for safety studies. By openly releasing all models, datasets, code, and checkpoints, Gaperon establishes a reproducible foundation for exploring the trade-offs between data curation, evaluation, safety, and openness in multilingual language model development.

【2】Decomposition-Enhanced Training for Post-Hoc Attributions In Language Models
标题：语言模型中事后归因的分解增强训练
链接：https://arxiv.org/abs/2510.25766

作者：Sriram Balasubramaniam, Samyadeep Basu, Koustava Goswami, Ryan Rossi, Varun Manjunatha, Roshan Santhosh, Ruiyi Zhang, Soheil Feizi, Nedim Lipka
备注：Post-hoc attribution
摘要：大型语言模型（LLM）越来越多地用于长文档问题回答，其中对来源的可靠归因对于信任至关重要。现有的事后归因方法对于提取QA工作良好，但在多跳，抽象和半提取设置中挣扎，其中答案跨段落合成信息。为了解决这些挑战，我们认为，事后归因可以被重新定义为一个推理问题，答案被分解为组成单元，每个单元都与特定的上下文相关。我们首先表明，促使模型生成这样的分解，同时归因提高性能。在此基础上，我们引入了DecompTune，这是一种后训练方法，可以教模型生成答案分解作为中间推理步骤。我们策划了一个复杂QA任务的多样化数据集，通过强大的LLM进行分解注释，并使用两阶段SFT + GRPO管道进行后训练Qwen-2.5（7 B和14 B），并提供特定于任务的策划奖励。在广泛的实验和消融中，DecompTune大大提高了归因质量，优于先前的方法，并匹配或超过最先进的前沿模型。
摘要：Large language models (LLMs) are increasingly used for long-document question answering, where reliable attribution to sources is critical for trust. Existing post-hoc attribution methods work well for extractive QA but struggle in multi-hop, abstractive, and semi-extractive settings, where answers synthesize information across passages. To address these challenges, we argue that post-hoc attribution can be reframed as a reasoning problem, where answers are decomposed into constituent units, each tied to specific context. We first show that prompting models to generate such decompositions alongside attributions improves performance. Building on this, we introduce DecompTune, a post-training method that teaches models to produce answer decompositions as intermediate reasoning steps. We curate a diverse dataset of complex QA tasks, annotated with decompositions by a strong LLM, and post-train Qwen-2.5 (7B and 14B) using a two-stage SFT + GRPO pipeline with task-specific curated rewards. Across extensive experiments and ablations, DecompTune substantially improves attribution quality, outperforming prior methods and matching or exceeding state-of-the-art frontier models.

【3】DiagramEval: Evaluating LLM-Generated Diagrams via Graphs
标题：DiagramEval：通过图评估LLM生成的图
链接：https://arxiv.org/abs/2510.25761

作者：Chumeng Liang, Jiaxuan You
摘要：图表在研究论文中扮演着传达思想的核心角色，但它们往往是众所周知的复杂和劳动密集型的。尽管图表以图像形式呈现，但标准图像生成模型很难生成具有明确结构的清晰图表。我们认为，一个有前途的方向是直接生成演示图的文本形式的SVG，它可以利用大型语言模型（LLM）的最新进展。然而，由于组件的复杂性和多模态性质的图表，充分的歧视性和可解释的指标，用于评估LLM生成的图表的质量仍然缺乏。在本文中，我们提出了DiagramEval，一种新的评价指标，旨在评估演示图LLM生成。具体来说，DiagramEval将图表概念化为图形，将文本元素视为节点，将其连接视为有向边，并使用两组新的度量来评估图表质量：节点对齐和路径对齐。这是第一次，我们有效地评估了最先进的LLM在最近的研究文献中产生的图表，定量地证明了我们的指标的有效性。此外，我们展示了如何增强我们提出的指标的可解释性提供了宝贵的见解LLM生成的图表的特点。代码：https://github.com/ulab-uiuc/diagram-eval。
摘要：Diagrams play a central role in research papers for conveying ideas, yet they are often notoriously complex and labor-intensive to create. Although diagrams are presented as images, standard image generative models struggle to produce clear diagrams with well-defined structure. We argue that a promising direction is to generate demonstration diagrams directly in textual form as SVGs, which can leverage recent advances in large language models (LLMs). However, due to the complexity of components and the multimodal nature of diagrams, sufficiently discriminative and explainable metrics for evaluating the quality of LLM-generated diagrams remain lacking. In this paper, we propose DiagramEval, a novel evaluation metric designed to assess demonstration diagrams generated by LLMs. Specifically, DiagramEval conceptualizes diagrams as graphs, treating text elements as nodes and their connections as directed edges, and evaluates diagram quality using two new groups of metrics: node alignment and path alignment. For the first time, we effectively evaluate diagrams produced by state-of-the-art LLMs on recent research literature, quantitatively demonstrating the validity of our metrics. Furthermore, we show how the enhanced explainability of our proposed metrics offers valuable insights into the characteristics of LLM-generated diagrams. Code: https://github.com/ulab-uiuc/diagram-eval.

【4】Scaling Latent Reasoning via Looped Language Models
标题：基于循环语言模型的潜在推理扩展
链接：https://arxiv.org/abs/2510.25741

作者：Rui-Jie Zhu, Zixuan Wang, Kai Hua, Tianyu Zhang, Ziniu Li, Haoran Que, Boyi Wei, Zixin Wen, Fan Yin, He Xing, Lu Li, Jiajun Shi, Kaijing Ma, Shanda Li, Taylor Kergan, Andrew Smith, Xingwei Qu, Mude Hui, Bohong Wu, Qiyang Min, Hongzhi Huang, Xun Zhou, Wei Ye, Jiaheng Liu, Jian Yang, Yunfeng Shi, Chenghua Lin, Enduo Zhao, Tianle Cai, Ge Zhang, Wenhao Huang, Yoshua Bengio, Jason Eshraghian
摘要：现代LLM主要通过显式文本生成来训练“思考”，例如思想链（CoT），它将推理推迟到训练后，并充分利用训练前的数据。我们提出并开源了Ouro，以递归Ouroboros命名，这是一系列预训练的循环语言模型（LoopLM），通过（i）潜在空间中的迭代计算，（ii）学习深度分配的熵正则化目标，以及（iii）扩展到7.7T令牌，将推理构建到预训练阶段。欧罗1.4B和2.6B型号具有卓越的性能，在广泛的基准测试中与高达12B SOTA LLM的结果相匹配。通过对照实验，我们发现这种优势不是源于知识容量的增加，而是源于卓越的知识操作能力。我们还表明，LoopLM产生的推理痕迹更符合最终输出比明确的CoT。我们希望我们的研究结果显示了LoopLM作为推理时代一个新的缩放方向的潜力。我们的模型可以在http://ouro-llm.github.io上找到。
摘要：Modern LLMs are trained to "think" primarily via explicit text generation, such as chain-of-thought (CoT), which defers reasoning to post-training and under-leverages pre-training data. We present and open-source Ouro, named after the recursive Ouroboros, a family of pre-trained Looped Language Models (LoopLM) that instead build reasoning into the pre-training phase through (i) iterative computation in latent space, (ii) an entropy-regularized objective for learned depth allocation, and (iii) scaling to 7.7T tokens. Ouro 1.4B and 2.6B models enjoy superior performance that match the results of up to 12B SOTA LLMs across a wide range of benchmarks. Through controlled experiments, we show this advantage stems not from increased knowledge capacity, but from superior knowledge manipulation capabilities. We also show that LoopLM yields reasoning traces more aligned with final outputs than explicit CoT. We hope our results show the potential of LoopLM as a novel scaling direction in the reasoning era. Our model could be found in: http://ouro-llm.github.io.

【5】The Limits of Obliviate: Evaluating Unlearning in LLMs via Stimulus-Knowledge Entanglement-Behavior Framework
标题：遗忘的局限性：通过刺激-知识纠缠-行为框架评估LLM的遗忘
链接：https://arxiv.org/abs/2510.25732

作者：Aakriti Shah, Thai Le
备注：14 pages, 11 figures
摘要：大型语言模型（LLM）中的学习对于管理敏感数据和纠正错误信息至关重要，但评估其有效性仍然是一个悬而未决的问题。我们研究了说服性提示是否可以在2.7B到13 B的参数范围内（OPT-2.7B，LLaMA-2- 7 B，LLaMA-3.1-8B，LLaMA-2- 13 B）从故意未学习的LLM中回忆起事实知识。借鉴ACT-R和Hebbian理论（扩散激活理论），以及通信原理，我们介绍了刺激-知识纠缠-行为框架（SKeB），它通过域图建模信息纠缠，并测试未学习模型中的事实回忆是否与说服性框架相关。我们开发了纠缠度量来量化知识激活模式，并评估输出中的真实性，非真实性和幻觉。我们的研究结果表明，有说服力的提示大大提高了事实知识的回忆（基线为14.8%，权威框架为24.5%），有效性与模型大小呈负相关（2.7B中的回收率为128%，13 B中为15%）。SKeB为评估LLM中的遗忘完整性、鲁棒性和整体行为提供了基础。
摘要：Unlearning in large language models (LLMs) is crucial for managing sensitive data and correcting misinformation, yet evaluating its effectiveness remains an open problem. We investigate whether persuasive prompting can recall factual knowledge from deliberately unlearned LLMs across models ranging from 2.7B to 13B parameters (OPT-2.7B, LLaMA-2-7B, LLaMA-3.1-8B, LLaMA-2-13B). Drawing from ACT-R and Hebbian theory (spreading activation theories), as well as communication principles, we introduce Stimulus-Knowledge Entanglement-Behavior Framework (SKeB), which models information entanglement via domain graphs and tests whether factual recall in unlearned models is correlated with persuasive framing. We develop entanglement metrics to quantify knowledge activation patterns and evaluate factuality, non-factuality, and hallucination in outputs. Our results show persuasive prompts substantially enhance factual knowledge recall (14.8% baseline vs. 24.5% with authority framing), with effectiveness inversely correlated to model size (128% recovery in 2.7B vs. 15% in 13B). SKeB provides a foundation for assessing unlearning completeness, robustness, and overall behavior in LLMs.

【6】Interpreting LLMs as Credit Risk Classifiers: Do Their Feature Explanations Align with Classical ML?
标题：将LLM解释为信用风险分类器：其特征解释是否与经典ML一致？
链接：https://arxiv.org/abs/2510.25701

作者：Saeed AlMarri, Kristof Juhasz, Mathieu Ravaut, Gautier Marti, Hamdan Al Ahbabi, Ibrahim Elfadel
备注：8 pages, 6 figures, 3 tables, CIKM 2025 FinFAI workshop
摘要：大型语言模型（LLM）越来越多地被探索作为经典机器学习模型的灵活替代品，用于通过zero-shot提示进行分类任务。然而，它们对结构化表格数据的适用性仍然没有得到充分的探索，特别是在高风险的金融应用程序中，如金融风险评估。本研究在真实世界的贷款违约预测任务上，对基于zero-shot LLM的分类器和最先进的梯度提升模型LightGBM进行了系统的比较。我们评估其预测性能，使用SHAP分析特征属性，并评估LLM生成的自我解释的可靠性。虽然LLM能够识别关键的金融风险指标，但他们的特征重要性排名与LightGBM明显不同，他们的自我解释往往无法与经验SHAP归因保持一致。这些发现突出了LLM作为结构化金融风险预测的独立模型的局限性，并引起了对其自我生成解释的可信度的担忧。我们的研究结果强调了在风险敏感的金融环境中部署LLM时，需要进行可解释性审计，与可解释模型进行基线比较，以及进行人在环监督。
摘要：Large Language Models (LLMs) are increasingly explored as flexible alternatives to classical machine learning models for classification tasks through zero-shot prompting. However, their suitability for structured tabular data remains underexplored, especially in high-stakes financial applications such as financial risk assessment. This study conducts a systematic comparison between zero-shot LLM-based classifiers and LightGBM, a state-of-the-art gradient-boosting model, on a real-world loan default prediction task. We evaluate their predictive performance, analyze feature attributions using SHAP, and assess the reliability of LLM-generated self-explanations. While LLMs are able to identify key financial risk indicators, their feature importance rankings diverge notably from LightGBM, and their self-explanations often fail to align with empirical SHAP attributions. These findings highlight the limitations of LLMs as standalone models for structured financial risk prediction and raise concerns about the trustworthiness of their self-generated explanations. Our results underscore the need for explainability audits, baseline comparisons with interpretable models, and human-in-the-loop oversight when deploying LLMs in risk-sensitive financial environments.

【7】PairUni: Pairwise Training for Unified Multimodal Language Models
标题：PairUni：统一多模式语言模型的成对训练
链接：https://arxiv.org/abs/2510.25682

作者：Jiani Zheng, Zhiyang Teng, Xiangtai Li, Anran Wang, Yu Tian, Kunpeng Qiu, Ye Tian, Haochen Wang, Zhuochen Wang
摘要：统一视觉语言模型（UVLM）必须在单一架构中执行理解和生成，但这些任务依赖于异构数据和监督，因此很难在强化学习（RL）期间平衡它们。我们提出了PairUni，一个统一的框架，将数据重组为理解生成（UG）对，并相应地调整优化。我们首先使用GPT-o3来增强单任务数据，生成用于理解样本的标题和用于生成样本的问答（QA）对，从同一实例中形成对齐对。此外，对于每一代样本，我们检索语义相关的理解的例子，形成一个检索对，连接不同的，但相关的数据点。这些成对的结构揭示了跨任务的语义对应关系，并支持一致的策略学习。为了利用这种结构，我们提出了Pair-GPRO，一个基于组相对策略优化的对感知变体。它为每一对分配一个相似性分数，以调节优势，加强从对齐良好的示例中学习，并减少任务干扰。我们策划了一个名为PairUG的16 K UG对的高质量数据集，用于RL微调，并在强大的Janus-Pro UVLM上评估PairUni。我们的方法在各种UVLM上实现了平衡的改进，优于强大的UVLM RL基线。代码：\href{https：//github.com/Haochen-Wang409/PairUni}{github.com/Haochen-Wang409/PairUni}
摘要：Unified vision-language models (UVLMs) must perform both understanding and generation within a single architecture, but these tasks rely on heterogeneous data and supervision, making it difficult to balance them during reinforcement learning (RL). We propose PairUni, a unified framework that reorganizes data into understanding-generation (UG) pairs and aligns optimization accordingly. We first use GPT-o3 to augment single-task data, generating captions for understanding samples and question-answer (QA) pairs for generation samples, forming aligned pairs from the same instance. Additionally, for each generation sample, we retrieve a semantically related understanding example to form a retrieved pair, linking different but related data points. These paired structures expose cross-task semantic correspondences and support consistent policy learning. To leverage this structure, we present Pair-GPRO, a pair-aware variant based on Group Relative Policy Optimization. It assigns a similarity score to each pair to modulate the advantage, strengthening learning from well-aligned examples and reducing task interference. We curate a high-quality dataset of 16K UG pairs named PairUG for RL fine-tuning and evaluate PairUni on the powerful Janus-Pro UVLMs. Our approach achieves balanced improvements on various UVLMs, outperforming strong UVLM RL baselines. Code: \href{https://github.com/Haochen-Wang409/PairUni}{github.com/Haochen-Wang409/PairUni}

【8】EHR-R1: A Reasoning-Enhanced Foundational Language Model for Electronic Health Record Analysis
标题：EHR-R1：电子健康记录分析的推理增强基础语言模型
链接：https://arxiv.org/abs/2510.25628

作者：Yusheng Liao, Chaoyi Wu, Junwei Liu, Shuyang Jiang, Pengcheng Qiu, Haowen Wang, Yun Yue, Shuai Zhen, Jian Wang, Qianrui Fan, Jinjie Gu, Ya Zhang, Yanfeng Wang, Yu Wang, Weidi Xie
摘要：电子健康记录（EHR）包含丰富而复杂的信息，其自动分析对于临床决策至关重要。尽管大型语言模型（LLM）在临床工作流程中取得了最新进展，但由于任务覆盖范围窄和缺乏面向EHR的推理能力，它们分析EHR的能力仍然有限。本文旨在弥合这一差距，具体来说，我们提出了EHR-Ins，一个大规模的，全面的EHR推理指令数据集，包括30万个高质量的推理案例和4 M个非推理案例，跨越42个不同的EHR任务。它的核心创新是一个思维图驱动的框架，可以大规模生成高质量的推理数据。在此基础上，我们开发了EHR-R1，这是一系列推理增强的LLM，具有多达72 B的参数，可用于EHR分析。通过多阶段训练范式，包括领域适应，推理增强和强化学习，EHR-R1系统地获取领域知识和多样化的推理能力，从而实现准确和强大的EHR分析。最后，我们介绍了EHR-Bench，这是一个由MIMIC-IV策划的新基准，跨越42个任务，全面评估EHR场景中的推理和预测。在实验中，我们表明，由此产生的EHR-R1始终优于最先进的商业和开源LLM（包括DeepSeek-V3和GPT-4 o），在MIMIC-Bench上超过GPT-4 o 30个点，并在EHRSHOT上实现了10%的zero-shot AUROC。总的来说，EHR-Ins、EHR-R1和EHR-Bench显著推进了更可靠和临床相关的EHR分析的发展。
摘要：Electronic Health Records (EHRs) contain rich yet complex information, and their automated analysis is critical for clinical decision-making. Despite recent advances of large language models (LLMs) in clinical workflows, their ability to analyze EHRs remains limited due to narrow task coverage and lack of EHR-oriented reasoning capabilities. This paper aims to bridge the gap, specifically, we present EHR-Ins, a large-scale, comprehensive EHR reasoning instruction dataset, comprising 300k high-quality reasoning cases and 4M non-reasoning cases across 42 distinct EHR tasks. Its core innovation is a thinking-graph-driven framework that enables to generate high-quality reasoning data at scale. Based on it, we develop EHR-R1, a series of reasoning-enhanced LLMs with up to 72B parameters tailored for EHR analysis. Through a multi-stage training paradigm, including domain adaptation, reasoning enhancement, and reinforcement learning, EHR-R1 systematically acquires domain knowledge and diverse reasoning capabilities, enabling accurate and robust EHR analysis. Lastly, we introduce EHR-Bench, a new benchmark curated from MIMIC-IV, spanning 42 tasks, to comprehensively assess reasoning and prediction across EHR scenarios. In experiments, we show that the resulting EHR-R1 consistently outperforms state-of-the-art commercial and open-source LLMs (including DeepSeek-V3 and GPT-4o), surpassing GPT-4o by over 30 points on MIMIC-Bench and achieving a 10\% higher zero-shot AUROC on EHRSHOT. Collectively, EHR-Ins, EHR-R1, and EHR-Bench have significantly advanced the development for more reliable and clinically relevant EHR analysis.

【9】Are Language Models Efficient Reasoners? A Perspective from Logic Programming
标题：语言模型是有效的推理者吗？逻辑编程的视角
链接：https://arxiv.org/abs/2510.25626

作者：Andreas Opedal, Yanick Zengaffinen, Haruki Shirakami, Clemente Pasti, Mrinmaya Sachan, Abulhair Saparov, Ryan Cotterell, Bernhard Schölkopf
备注：Accepted to NeurIPS 2025
摘要：现代语言模型（LM）表现出强大的演绎推理能力，但标准评估强调正确性，而忽略了类人推理的一个关键方面：效率。在现实世界的推理场景中，许多可用的信息是不相关的，有效的演绎推理需要识别和忽略这些干扰。我们提出了一个评估LM推理效率的框架，通过逻辑编程的镜头，介绍了一种简单的方法来对齐自然语言编写的证明-由LM生成-通过执行逻辑程序找到最短的证明。效率通过衡量模型避免不必要的推断的程度来量化。从经验上讲，我们构建了一个数学应用题的数据集，其中注入了不同数量的无关公理，这些公理与目标定理的语义重叠。我们发现，目前的LM在这种条件下显示出显着的准确性下降-即使是最小的，域一致的干扰-和他们产生的证据经常通过不相关的推理表现出弯路。
摘要：Modern language models (LMs) exhibit strong deductive reasoning capabilities, yet standard evaluations emphasize correctness while overlooking a key aspect of human-like reasoning: efficiency. In real-world reasoning scenarios, much of the available information is irrelevant, and effective deductive inference requires identifying and ignoring such distractions. We propose a framework for assessing LM reasoning efficiency through the lens of logic programming, introducing a simple method to align proofs written in natural language -- as generated by an LM -- with shortest proofs found by executing the logic program. Efficiency is quantified by measuring how well a model avoids unnecessary inference. Empirically, we construct a dataset of math word problems injected with various number of irrelevant axioms that vary in semantic overlap with the goal theorem. We find that current LMs show marked accuracy declines under such conditions -- even with minimal, domain-consistent distractions -- and the proofs they generate frequently exhibit detours through irrelevant inferences.

【10】Communication and Verification in LLM Agents towards Collaboration under Information Asymmetry
标题：信息不对称下LLM代理面向协作的通信与验证
链接：https://arxiv.org/abs/2510.25595

作者：Run Peng, Ziqiao Ma, Amy Pang, Sikai Li, Zhang Xi-Jia, Yingzhuo Yu, Cristian-Paul Bara, Joyce Chai
备注：Workshop on Multi-Agent System @ ICML 2025
摘要：虽然大语言模型（LLM）代理通常从动作规划/生成的角度来实现目标（例如，语言描述给出的），他们相互合作以实现共同目标的能力没有得到很好的探索。为了解决这个问题，本文研究了LLM代理在任务协作，特别是在信息不对称的条件下，代理人在他们的知识和技能的差异，需要一起工作，以完成一个共享的任务。我们扩展爱因斯坦拼图，一个经典的象征性的难题，桌面游戏。在这个游戏中，两个LLM代理必须进行推理，沟通和行动，以满足解决难题所需的空间和关系约束。我们应用了一个微调加验证框架，其中LLM代理配备了各种通信策略和来自环境的验证信号。实证结果突出了对齐通信的至关重要性，特别是当代理商同时拥有信息寻求和提供能力。有趣的是，没有沟通的代理仍然可以实现高的任务性能;然而，进一步的分析表明，缺乏真正的规则理解和较低的信任，从人类评估。相反，通过集成基于环境的验证器，我们增强了代理理解任务规则和完成任务的能力，促进了AI系统中更安全、更可解释的协作。https://github.com/Roihn/EinsteinPuzzles
摘要：While Large Language Model (LLM) agents are often approached from the angle of action planning/generation to accomplish a goal (e.g., given by language descriptions), their abilities to collaborate with each other to achieve a joint goal are not well explored. To address this limitation, this paper studies LLM agents in task collaboration, particularly under the condition of information asymmetry, where agents have disparities in their knowledge and skills and need to work together to complete a shared task. We extend Einstein Puzzles, a classical symbolic puzzle, to a table-top game. In this game, two LLM agents must reason, communicate, and act to satisfy spatial and relational constraints required to solve the puzzle. We apply a fine-tuning-plus-verifier framework in which LLM agents are equipped with various communication strategies and verification signals from the environment. Empirical results highlight the critical importance of aligned communication, especially when agents possess both information-seeking and -providing capabilities. Interestingly, agents without communication can still achieve high task performance; however, further analysis reveals a lack of true rule understanding and lower trust from human evaluators. Instead, by integrating an environment-based verifier, we enhance agents' ability to comprehend task rules and complete tasks, promoting both safer and more interpretable collaboration in AI systems. https://github.com/Roihn/EinsteinPuzzles

【11】TwinVoice: A Multi-dimensional Benchmark Towards Digital Twins via LLM Persona Simulation
标题：TwinVoice：通过LLM Persona模拟实现数字双胞胎的多维基准
链接：https://arxiv.org/abs/2510.25536

作者：Bangde Du (1), Minghao Guo (2), Songming He (3), Ziyi Ye (3), Xi Zhu (2), Weihang Su (1), Shuqi Zhu (1), Yujia Zhou (1), Yongfeng Zhang (2), Qingyao Ai (1), Yiqun Liu (1) ((1) Tsinghua University, (2) Rutgers University, (3) Fudan University)
备注：Main paper: 11 pages, 3 figures, 6 tables. Appendix: 28 pages. Bangde Du and Minghao Guo contributed equally. Corresponding authors: Ziyi Ye (ziyiye@fudan.this http URL), Qingyao Ai (aiqy@tsinghua.this http URL)
摘要：大型语言模型（LLM）表现出类似人类的能力，并且越来越多地被视为模拟个人沟通风格，行为倾向和个性特征的基础。然而，目前基于LLM的人物角色模拟的评估仍然有限：大多数依赖于合成对话，缺乏系统的框架，缺乏能力需求的分析。为了解决这些限制，我们引入TwinVoice，这是一个用于评估不同现实环境中人物角色模拟的综合基准。TwinVoice包含三个维度：社会角色（公共社会互动），人际角色（私人对话）和叙事角色（基于角色的表达）。它进一步将LLM表现的评估分解为六个基本能力，包括意见一致性，记忆回忆，逻辑推理，词汇保真度，人物语气和句法风格。实验结果表明，虽然先进的模型在人物角色模拟中达到了中等的准确性，但它们仍然缺乏句法风格和记忆回忆等能力。因此，LLM实现的平均性能仍然大大低于人类基线。
摘要：Large Language Models (LLMs) are exhibiting emergent human-like abilities and are increasingly envisioned as the foundation for simulating an individual's communication style, behavioral tendencies, and personality traits. However, current evaluations of LLM-based persona simulation remain limited: most rely on synthetic dialogues, lack systematic frameworks, and lack analysis of the capability requirement. To address these limitations, we introduce TwinVoice, a comprehensive benchmark for assessing persona simulation across diverse real-world contexts. TwinVoice encompasses three dimensions: Social Persona (public social interactions), Interpersonal Persona (private dialogues), and Narrative Persona (role-based expression). It further decomposes the evaluation of LLM performance into six fundamental capabilities, including opinion consistency, memory recall, logical reasoning, lexical fidelity, persona tone, and syntactic style. Experimental results reveal that while advanced models achieve moderate accuracy in persona simulation, they still fall short of capabilities such as syntactic style and memory recall. Consequently, the average performance achieved by LLMs remains considerably below the human baseline.

【12】Fine-Tuned Language Models for Domain-Specific Summarization and Tagging
标题：用于特定领域摘要和标记的微调语言模型
链接：https://arxiv.org/abs/2510.25460

作者：Jun Wang, Fuming Lin, Yuyu Chen
摘要：本文提出了一种将微调的大型语言模型（LLM）与命名实体识别（NER）集成的管道，用于高效的特定领域文本摘要和标记。作者解决了快速发展的亚文化语言和俚语所带来的挑战，这些语言和俚语使自动信息提取和执法监测变得复杂。通过利用LLaMA Factory框架，该研究在通用和自定义域特定数据集上微调了LLM，特别是在政治和安全领域。使用BLEU和ROUGE指标的模型进行评估，表明指令微调显着提高摘要和标记的准确性，特别是对于专业语料库。值得注意的是，LLaMA 3 -8B-Instruct模型尽管在中文理解方面存在初始限制，但在特定领域微调后，其表现优于中文训练的对应模型，这表明潜在的推理能力可以跨语言转移。该管道实现了简明摘要和结构化实体标记，便于快速文档分类和分发。事实证明，这种方法可扩展，适用于实时应用程序，支持有效的信息管理和不断需要捕捉新兴的语言趋势。LLM和NER的集成提供了一个强大的解决方案，可以将非结构化文本转换为可操作的见解，这对现代知识管理和安全运营至关重要。
摘要：This paper presents a pipeline integrating fine-tuned large language models (LLMs) with named entity recognition (NER) for efficient domain-specific text summarization and tagging. The authors address the challenge posed by rapidly evolving sub-cultural languages and slang, which complicate automated information extraction and law enforcement monitoring. By leveraging the LLaMA Factory framework, the study fine-tunes LLMs on both generalpurpose and custom domain-specific datasets, particularly in the political and security domains. The models are evaluated using BLEU and ROUGE metrics, demonstrating that instruction fine-tuning significantly enhances summarization and tagging accuracy, especially for specialized corpora. Notably, the LLaMA3-8B-Instruct model, despite its initial limitations in Chinese comprehension, outperforms its Chinese-trained counterpart after domainspecific fine-tuning, suggesting that underlying reasoning capabilities can transfer across languages. The pipeline enables concise summaries and structured entity tagging, facilitating rapid document categorization and distribution. This approach proves scalable and adaptable for real-time applications, supporting efficient information management and the ongoing need to capture emerging language trends. The integration of LLMs and NER offers a robust solution for transforming unstructured text into actionable insights, crucial for modern knowledge management and security operations.

【13】Grounded in Reality: Learning and Deploying Proactive LLM from Offline Logs
标题：立足于现实：从线下收件箱学习和部署主动式LLM
链接：https://arxiv.org/abs/2510.25441

作者：Fei Wei, Daoyuan Chen, Ce Wang, Yilun Huang, Yushuo Chen, Xuchen Pan, Yaliang Li, Bolin Ding
备注：27 pages, 5 figures
摘要：大型语言模型（LLM）擅长被动响应，但教他们成为主动的，以目标为导向的合作伙伴，这是高风险领域的关键能力，仍然是一个重大挑战。当前的范例要么短视地优化单回合属性，要么依赖于脆弱的、高成本的用户模拟器，从而造成了持久的“现实差距”。为了弥合这一差距，我们引入了一个通用的、无模拟器的框架，用于学习和部署主动对话代理，绕过了对复杂用户动态建模的需要。我们的关键见解是通过利用每个专家轨迹的\textbf{observed future}来重构离线策略学习问题。这使我们能够推断出一个密集的，基于专家揭示的策略的逐向奖励信号，将棘手的长期问题分解为一系列监督学习任务，并训练一个策略来输出一个结构化的\texttt{（action，state_assessment）}元组，控制\textbf{问什么}，以及关键的\textbf{何时停止}。为了确保奖励保真度，我们的自动评分器校准管道系统地清除基于LLM的奖励模型中的噪音，并尽可能减少人工监督。从经验上讲，我们证明了\texttt{Learn-to-Ask}在现实世界的医疗数据集中的有效性，使用不同大小的LLM高达32 B。我们的方法最终成功地将LLM部署到实时的大规模在线AI服务中。在严格的内部评估中，我们的模型启动并取得了甚至优于人类专家的性能，证明了我们的框架能够将离线数据转化为有形的现实影响。我们希望这项工作提供了一个实用和经济上可行的蓝图，将被动LLM转换为主动的，面向目标的LLM应用程序。
摘要：Large Language Models (LLMs) excel as passive responders, but teaching them to be proactive, goal-oriented partners, a critical capability in high-stakes domains, remains a major challenge. Current paradigms either myopically optimize single-turn attributes or rely on brittle, high-cost user simulators, creating a persistent ``reality gap''. To bridge this gap, we introduce \texttt{Learn-to-Ask}, a general, simulator-free framework for learning and deploying proactive dialogue agents \textit{directly from offline expert data}, bypassing the need to model complex user dynamics. Our key insight is to reframe the offline policy learning problem by leveraging the \textbf{observed future} of each expert trajectory. This allows us to infer a dense, turn-by-turn reward signal grounded in the expert's revealed strategy, decomposing the intractable long-horizon problem into a series of supervised learning tasks, and training a policy to output a structured \texttt{(action, state_assessment)} tuple, governing both \textbf{what to ask} and, crucially, \textbf{when to stop}. To ensure reward fidelity, our Automated Grader Calibration pipeline systematically purges noise from the LLM-based reward model with minimal human supervision. Empirically, we demonstrate the efficacy of \texttt{Learn-to-Ask} in a real-world medical dataset, using LLMs of varying sizes up to 32B. Our approach culminates in the successful deployment of LLMs into a live, large-scale online AI service. In rigorous in-house evaluations, our model was launched and achieved performance even superior to human experts, proving our framework's ability to translate offline data into tangible, real-world impact. We hope this work provides a practical and economically viable blueprint for transforming passive LLMs into proactive, goal-oriented LLM applications.

【14】Depth and Autonomy: A Framework for Evaluating LLM Applications in Social Science Research
标题：深度和自主性：社会科学研究LLM应用评估框架
链接：https://arxiv.org/abs/2510.25432

作者：Ali Sanaei, Ali Rajabzadeh
备注：Presented at the Annual Meeting of the American Political Science Association, Vancouver, BC, September 11--14 2025
摘要：大型语言模型（LLM）越来越多地被广泛领域的研究人员使用，定性社会科学也不例外;然而，这种采用面临着持续的挑战，包括解释性偏差，可靠性低和可解释性弱。我们介绍了一个框架，定位LLM使用沿两个维度，解释深度和自主性，从而提供了一个简单的方法来分类LLM应用在定性研究，并得出实际的设计建议。我们提出了关于这两个方面的文献状态，基于Web of Science上所有已发表的社会科学论文，这些论文使用LLM作为工具，而不是严格作为研究主题。我们的方法鼓励研究人员将任务分解为可管理的部分，而不是赋予模型广泛的自由，就像他们将工作委托给有能力的本科生研究助理一样。通过保持低水平的自主性，只有在有保证和有监督的情况下才有选择地增加解释深度，人们可以在保持透明度和可靠性的同时，合理地获得LLM的好处。
摘要：Large language models (LLMs) are increasingly utilized by researchers across a wide range of domains, and qualitative social science is no exception; however, this adoption faces persistent challenges, including interpretive bias, low reliability, and weak auditability. We introduce a framework that situates LLM usage along two dimensions, interpretive depth and autonomy, thereby offering a straightforward way to classify LLM applications in qualitative research and to derive practical design recommendations. We present the state of the literature with respect to these two dimensions, based on all published social science papers available on Web of Science that use LLMs as a tool and not strictly as the subject of study. Rather than granting models expansive freedom, our approach encourages researchers to decompose tasks into manageable segments, much as they would when delegating work to capable undergraduate research assistants. By maintaining low levels of autonomy and selectively increasing interpretive depth only where warranted and under supervision, one can plausibly reap the benefits of LLMs while preserving transparency and reliability.

【15】Implicature in Interaction: Understanding Implicature Improves Alignment in Human-LLM Interaction
标题：互动中的含义：理解含义改善了人与法学硕士互动的一致性
链接：https://arxiv.org/abs/2510.25426

作者：Asutosh Hota, Jussi P. P. Jokinen
备注：The manuscript is approximately 7360 words and contains 12 figures and 6 tables
摘要：大型语言模型（LLM）的快速发展将语言定位于人机交互（HCI）的核心。我们认为，推进人机交互需要关注交互的语言基础，特别是含义（通过共享上下文传达的含义超出明确的陈述），这对于人类-AI（HAI）对齐至关重要。本研究考察了LLM推断嵌入在上下文驱动提示中的用户意图的能力，以及理解含义是否会提高响应生成。结果表明，较大的模型更接近人类的解释，而较小的模型与含义推理斗争。此外，基于含义的提示显着提高感知的相关性和质量的反应模型，在较小的模型显着收益。总的来说，67.6%的参与者更喜欢带有隐含提示的回答，而不是字面意思，这表明他们更喜欢上下文微妙的沟通。我们的工作有助于理解语言学理论如何可以用来解决对齐问题，使HAI互动更自然和上下文接地。
摘要：The rapid advancement of Large Language Models (LLMs) is positioning language at the core of human-computer interaction (HCI). We argue that advancing HCI requires attention to the linguistic foundations of interaction, particularly implicature (meaning conveyed beyond explicit statements through shared context) which is essential for human-AI (HAI) alignment. This study examines LLMs' ability to infer user intent embedded in context-driven prompts and whether understanding implicature improves response generation. Results show that larger models approximate human interpretations more closely, while smaller models struggle with implicature inference. Furthermore, implicature-based prompts significantly enhance the perceived relevance and quality of responses across models, with notable gains in smaller models. Overall, 67.6% of participants preferred responses with implicature-embedded prompts to literal ones, highlighting a clear preference for contextually nuanced communication. Our work contributes to understanding how linguistic theory can be used to address the alignment problem by making HAI interaction more natural and contextually grounded.

【16】Seeing, Signing, and Saying: A Vision-Language Model-Assisted Pipeline for Sign Language Data Acquisition and Curation from Social Media
标题：看、签名和说：视觉语言模型辅助的社交媒体手语数据采集和治疗管道
链接：https://arxiv.org/abs/2510.25413

作者：Shakib Yazdani, Yasser Hamidullah, Cristina España-Bonet, Josef van Genabith
备注：Accepted by RANLP 2025
摘要：大多数现有的手语翻译（手语翻译）数据集规模有限，缺乏多语言覆盖，并且由于依赖于专家注释和受控的记录设置，因此管理成本高昂。最近，视觉语言模型（VLM）已经表现出强大的能力，作为评估和实时助手。尽管取得了这些进展，但在手语数据集获取方面，它们的潜力仍未得到开发。为了弥合这一差距，我们引入了第一个自动化注释和过滤框架，该框架利用VLM来减少对手动工作的依赖，同时保持数据质量。我们的方法应用于八种手语的TikTok视频，以及已经策划的德国手语YouTube-SL-25数据集，以进行额外的评估。我们基于VLM的管道包括人脸可见性检测，标志活动识别，从视频内容中提取文本，以及验证视频和文本之间对齐的判断步骤，实现通用过滤，注释和验证步骤。使用生成的语料库TikTok-SL-8，我们评估了两个现成的模型在德国和美国手语过滤数据集上的性能，目的是建立基线并评估最近模型在自动提取的略有噪声的数据上的鲁棒性。我们的工作使可扩展的，弱监督的预训练，并促进从社交媒体的数据采集。
摘要：Most existing sign language translation (SLT) datasets are limited in scale, lack multilingual coverage, and are costly to curate due to their reliance on expert annotation and controlled recording setup. Recently, Vision Language Models (VLMs) have demonstrated strong capabilities as evaluators and real-time assistants. Despite these advancements, their potential remains untapped in the context of sign language dataset acquisition. To bridge this gap, we introduce the first automated annotation and filtering framework that utilizes VLMs to reduce reliance on manual effort while preserving data quality. Our method is applied to TikTok videos across eight sign languages and to the already curated YouTube-SL-25 dataset in German Sign Language for the purpose of additional evaluation. Our VLM-based pipeline includes a face visibility detection, a sign activity recognition, a text extraction from video content, and a judgment step to validate alignment between video and text, implementing generic filtering, annotation and validation steps. Using the resulting corpus, TikTok-SL-8, we assess the performance of two off-the-shelf SLT models on our filtered dataset for German and American Sign Languages, with the goal of establishing baselines and evaluating the robustness of recent models on automatically extracted, slightly noisy data. Our work enables scalable, weakly supervised pretraining for SLT and facilitates data acquisition from social media.

【17】Monitoring Transformative Technological Convergence Through LLM-Extracted Semantic Entity Triple Graphs
标题：通过LLM提取的语义实体三重图监测变革性技术融合
链接：https://arxiv.org/abs/2510.25370

作者：Alexander Sternfeld, Andrei Kucharavy, Dimitri Percia David, Alain Mermoud, Julian Jang-Jaccard, Nathan Monnet
摘要：预测变革性技术仍然是一项关键但具有挑战性的任务，特别是在信息和通信技术（信通技术）等快速发展的领域。传统的基于专家的方法很难跟上创新周期短和早期术语模糊的步伐。在这项工作中，我们提出了一种新的数据驱动的管道，通过识别技术融合的模式来监测变革性技术的出现。我们的方法利用大型语言模型（LLM）的进步，从非结构化文本中提取语义三元组，并构建一个大规模的技术相关的实体和关系图。我们引入了一种对语义相似的技术术语进行分组的新方法（名词装订），并开发了基于图的指标来检测收敛信号。该管道包括多级过滤，特定领域的关键字聚类，以及主题共现的时间趋势分析。我们在两个互补的数据集上验证了我们的方法：278，625 arXiv预印本（2017- 2024）用于捕获早期科学信号，9，793 USPTO专利申请（2018-2024）用于跟踪下游商业发展。我们的研究结果表明，拟议的管道可以识别既有和新兴的收敛模式，提供了一个可扩展的和可推广的框架，以全文分析为基础的技术预测。
摘要：Forecasting transformative technologies remains a critical but challenging task, particularly in fast-evolving domains such as Information and Communication Technologies (ICTs). Traditional expert-based methods struggle to keep pace with short innovation cycles and ambiguous early-stage terminology. In this work, we propose a novel, data-driven pipeline to monitor the emergence of transformative technologies by identifying patterns of technological convergence. Our approach leverages advances in Large Language Models (LLMs) to extract semantic triples from unstructured text and construct a large-scale graph of technology-related entities and relations. We introduce a new method for grouping semantically similar technology terms (noun stapling) and develop graph-based metrics to detect convergence signals. The pipeline includes multi-stage filtering, domain-specific keyword clustering, and a temporal trend analysis of topic co-occurence. We validate our methodology on two complementary datasets: 278,625 arXiv preprints (2017--2024) to capture early scientific signals, and 9,793 USPTO patent applications (2018-2024) to track downstream commercial developments. Our results demonstrate that the proposed pipeline can identify both established and emerging convergence patterns, offering a scalable and generalizable framework for technology forecasting grounded in full-text analysis.

【18】Not ready for the bench: LLM legal interpretation is unstable and out of step with human judgments
标题：还没有准备好法官：LLM法律解释是不稳定的，与人类判断不一致
链接：https://arxiv.org/abs/2510.25356

作者：Abhishek Purushothama, Junghyun Min, Brandon Waldon, Nathan Schneider
摘要：法律解释通常涉及评估法律文本如何被“普通”语言使用者理解，适用于美国司法系统中表征法律纠纷的事实。最近的学术界提出，法律从业者添加大型语言模型（LLM）到他们的解释工具包。这项工作提供了一个经验的论点，对法学硕士解释最近实行的法律学者和联邦法官。我们在英语中的调查表明，模型不能提供稳定的解释性判断：改变问题格式可以导致模型得出截然不同的结论。此外，这些模型与人类判断的相关性很弱，但在模型和问题变量之间存在很大的差异，这表明对生成式AI产生的结论给予太多信任是危险的。
摘要：Legal interpretation frequently involves assessing how a legal text, as understood by an 'ordinary' speaker of the language, applies to the set of facts characterizing a legal dispute in the U.S. judicial system. Recent scholarship has proposed that legal practitioners add large language models (LLMs) to their interpretive toolkit. This work offers an empirical argument against LLM interpretation as recently practiced by legal scholars and federal judges. Our investigation in English shows that models do not provide stable interpretive judgments: varying the question format can lead the model to wildly different conclusions. Moreover, the models show weak to moderate correlation with human judgment, with large variance across model and question variant, suggesting that it is dangerous to give much credence to the conclusions produced by generative AI.

【19】Adapting Small Language Models to Low-Resource Domains: A Case Study in Hindi Tourism QA
标题：将小语言模型适应低资源领域：印地语旅游QA的案例研究
链接：https://arxiv.org/abs/2510.25273

作者：Sandipan Majhi, Paheli Bhattacharya
备注：Accepted at the Forum for Information Retrieval Evaluation 2025 (VATIKA Track)
摘要：低资源语言的领域特定问题回答面临两个关键挑战：带注释数据集的稀缺性和通用语言模型中有限的领域知识。在这项工作中，我们提出了一个多阶段的微调策略，以适应轻量级的语言模型，印地语旅游领域，利用原始和合成的训练数据。使用大型LLM（LLaMA-70 B，Phi-14 B）生成合成问答对，并用于增强有限的原始数据集。我们探讨了几种培训方法，并分析其对领域泛化的影响。我们的研究结果表明，大型模型可以有效地生成合成数据，而小型模型可以有效地适应它，为低资源，特定领域的QA提供了一个可扩展的途径。
摘要：Domain-specific question answering in low-resource languages faces two key challenges: scarcity of annotated datasets and limited domain knowledge in general-purpose language models. In this work, we present a multi-stage finetuning strategy to adapt lightweight language models to the Hindi tourism domain by leveraging both original and synthetic training data. Synthetic question-answer pairs are generated using large LLMs (LLaMA-70B, Phi-14B) and used to augment the limited original dataset. We explore several training methodologies and analyse their impact on domain generalisation. Our results demonstrate that large models can efficiently generate synthetic data, while small models can effectively adapt to it, offering a scalable pathway for low-resource, domain-specific QA.

【20】RAVR: Reference-Answer-guided Variational Reasoning for Large Language Models
标题：RAVR：大型语言模型的参考答案引导变分推理
链接：https://arxiv.org/abs/2510.25206

作者：Tianqianjin Lin, Xi Zhao, Xingyao Zhang, Rujiao Long, Yi Xu, Zhuoren Jiang, Wenbo Su, Bo Zheng
备注：17 pages, 11 figures
摘要：强化学习（RL）可以改善大型语言模型（LLM）的推理能力，但关键取决于一个关键的先决条件：LLM已经可以以不可忽略的概率生成高实用性的推理路径。对于超出法学硕士当前能力的任务，这种推理路径可能很难采样，并且学习有可能强化熟悉但次优的推理。我们的动机来自认知科学的洞察力，即为什么这是答案往往比答案是什么更容易的问题，因为它避免了开放式探索的沉重认知负荷，而是选择解释性重构-系统地追溯将问题与答案联系起来的推理。我们表明，LLM可以类似地利用答案来获得高质量的推理路径。我们形式化这一现象，并证明条件的答案可证明增加了预期效用的采样推理路径，从而将棘手的问题转化为可学习的。基于这一认识，我们引入了RAVR（参考答案引导的变分推理），这是一个端到端的框架，它使用答案条件推理作为变分替代问题推理。在一般和数学领域的实验表明，在强基线上的一致改进。我们进一步分析了推理行为，发现RAVR减少了犹豫，加强了结论巩固，促进了推理中的问题特定策略。
摘要：Reinforcement learning (RL) can refine the reasoning abilities of large language models (LLMs), but critically depends on a key prerequisite: the LLM can already generate high-utility reasoning paths with non-negligible probability. For tasks beyond the LLM's current competence, such reasoning path can be hard to sample, and learning risks reinforcing familiar but suboptimal reasoning. We are motivated by the insight from cognitive science that Why is this the answer is often an easier question than What is the answer, as it avoids the heavy cognitive load of open-ended exploration, opting instead for explanatory reconstruction-systematically retracing the reasoning that links a question to its answer. We show that LLMs can similarly leverage answers to derive high-quality reasoning paths. We formalize this phenomenon and prove that conditioning on answer provably increases the expected utility of sampled reasoning paths, thereby transforming intractable problems into learnable ones. Building on this insight, we introduce RAVR (Reference-Answer-guided Variational Reasoning), an end-to-end framework that uses answer-conditioned reasoning as a variational surrogate for question-only reasoning. Experiments in both general and math domains demonstrate consistent improvements over strong baselines. We further analyze the reasoning behavior and find that RAVR reduces hesitation, strengthens conclusion consolidation, and promotes problem-specific strategies in reasoning.

【21】Testing Cross-Lingual Text Comprehension In LLMs Using Next Sentence Prediction
标题：使用下一句预测在LLM中测试跨语言文本理解
链接：https://arxiv.org/abs/2510.25187

作者：Ritesh Sunil Chavan, Jack Mostow
摘要：虽然大型语言模型是在大规模数据集上训练的，但这些数据严重偏向英语。他们令人印象深刻的表现反映了真正的能力还是仅仅是这种数据优势？为了找到答案，我们在一个无法依赖数据丰富的环境中测试了它们：低资源语言。在Agarwal等人（2025）使用下一句预测（NSP）作为测试的基础上，我们创建了一个大规模的基准测试，每个测试包含英语（高资源语言），斯瓦希里语（中等资源）和豪萨语（低资源）的10，000个问题。然后，我们测试了几个顶级型号，包括GPT-4 Turbo，Gemini 1.5 Flash和LLaMA 3 70 B，以了解它们的性能如何。结果清楚地描绘了语言资源水平如何影响结果。虽然所有模型在英语方面都表现出色，但它们在斯瓦希里语中的准确性下降，在豪萨语中急剧下降，其中LLaMA 3最困难。当我们引入思想链（CoT）提示时，这个故事变得更加有趣。对于苦苦挣扎的LLaMA 3来说，CoT是一个有用的指南，大大提高了它的准确性。然而，对于能力更强的GPT-4和Gemini，同样的技术往往适得其反，导致一种“过度思考”，损害了他们在跨语言环境中的结果。这表明，思想链并不是一个通用的解决方案;它的有效性在很大程度上取决于模型的基线能力和任务的特定上下文。我们的框架指出了LLM的弱点，突出了CoT何时帮助或阻碍跨语言NSP的表现，以及影响他们决策的因素。
摘要：While large language models are trained on massive datasets, this data is heavily skewed towards English. Does their impressive performance reflect genuine ability or just this data advantage? To find out, we tested them in a setting where they could not rely on data abundance: low-resource languages. Building on prior work Agarwal et al. (2025) that used Next Sentence Prediction (NSP) as a test, we created a large-scale benchmark with 10,000 questions each for English (a high-resource language), Swahili (medium-resource), and Hausa (low-resource). We then tested several top models, including GPT-4 Turbo, Gemini 1.5 Flash, and LLaMA 3 70B, to see how their performance holds up. The results painted a clear picture of how levels of language resources impact outcomes. While all models excelled in English, their accuracy dropped in Swahili and fell sharply in Hausa, with LLaMA 3 struggling the most. The story became even more interesting when we introduced Chain-of-Thought (CoT) prompting. For the struggling LLaMA 3, CoT acted as a helpful guide, significantly boosting its accuracy. However, for the more capable GPT-4 and Gemini, the same technique often backfired, leading to a kind of "overthinking" that hurt their results in the cross-lingual context. This reveals that Chain-of-Thought is not a universal solution; its effectiveness depends heavily on the model's baseline capability and the specific context of the task. Our framework pinpoints LLM weaknesses, highlights when CoT helps or hinders cross-lingual NSP performance, and factors influencing their decisions.

【22】A Survey on Unlearning in Large Language Models
标题：大型语言模型中取消学习的调查
链接：https://arxiv.org/abs/2510.25117

作者：Ruichen Qiu, Jiajun Tan, Jiayue Pu, Honglin Wang, Xiao-Shan Gao, Fei Sun
摘要：大型语言模型（LLM）的进步已经彻底改变了自然语言处理，但它们在大规模语料库上的训练带来了巨大的风险，包括记忆敏感的个人数据、受版权保护的材料以及可能促进恶意活动的知识。为了缓解这些问题并符合法律和道德标准，如“被遗忘的权利”，机器学习已经成为一种关键技术，可以选择性地从LLM中删除特定知识，而不会影响其整体性能。这项调查对自2021年以来发表的180多篇关于LLM unlearning的论文进行了系统回顾，专门关注大规模生成模型。与以往的调查不同，我们引入了新的分类法，用于学习方法和评估。我们根据应用unlearning的训练阶段将方法明确地分为训练时间、训练后和推理时间。对于评估，我们不仅系统地汇编现有的数据集和指标，而且还批判性地分析它们的优点，缺点和适用性，为研究界提供实用指导。此外，我们还讨论了关键的挑战和有前途的未来研究方向。我们的全面概述旨在为安全可靠的LLM的持续发展提供信息和指导。
摘要：The advancement of Large Language Models (LLMs) has revolutionized natural language processing, yet their training on massive corpora poses significant risks, including the memorization of sensitive personal data, copyrighted material, and knowledge that could facilitate malicious activities. To mitigate these issues and align with legal and ethical standards such as the "right to be forgotten", machine unlearning has emerged as a critical technique to selectively erase specific knowledge from LLMs without compromising their overall performance. This survey provides a systematic review of over 180 papers on LLM unlearning published since 2021, focusing exclusively on large-scale generative models. Distinct from prior surveys, we introduce novel taxonomies for both unlearning methods and evaluations. We clearly categorize methods into training-time, post-training, and inference-time based on the training stage at which unlearning is applied. For evaluations, we not only systematically compile existing datasets and metrics but also critically analyze their advantages, disadvantages, and applicability, providing practical guidance to the research community. In addition, we discuss key challenges and promising future research directions. Our comprehensive overview aims to inform and guide the ongoing development of secure and reliable LLMs.

【23】DEBATE: A Large-Scale Benchmark for Role-Playing LLM Agents in Multi-Agent, Long-Form Debates
标题：辩论：多主体长形式辩论中角色扮演的LLM代理的大规模基准
链接：https://arxiv.org/abs/2510.25110

作者：Yun-Shiuan Chuang, Ruixuan Tu, Chengtao Dai, Smit Vasani, Binwei Yao, Michael Henry Tessler, Sijia Yang, Dhavan Shah, Robert Hawkins, Junjie Hu, Timothy T. Rogers
摘要：通过社会互动准确地模拟意见变化对于解决错误信息和两极分化等问题至关重要。虽然角色扮演大型语言模型（LLM）提供了一种很有前途的方法来模拟类人交互，但现有的研究表明，单智能体对齐并不能保证真实的多智能体组动态。当前的LLM角色扮演设置通常会产生不自然的动态（例如，过早收敛），没有经验基准来衡量真实的人类意见轨迹。为了弥合这一差距，我们引入了DEBATE，这是第一个明确设计用于评估多代理角色扮演LLM之间交互的真实性的大规模实证基准。DEBATE包含29，417条来自超过2，792名美国人的多轮辩论对话的信息，基于参与者讨论107个有争议的话题，捕捉公开表达的信息和私下报告的意见。使用DEBATE，我们系统地评估和识别模拟和真实的群体动力学之间的关键差异。我们进一步展示了DEBATE通过监督微调将LLM与人类行为对齐的实用性，实现了表面级指标的改进（例如，ROUGE-L和消息长度），同时突出更深语义对齐中的限制（例如，语义相似性）。我们的研究结果强调了角色扮演LLM代理的潜在和当前限制，以真实地模拟类似人类的社会动态。
摘要：Accurately modeling opinion change through social interactions is crucial for addressing issues like misinformation and polarization. While role-playing large language models (LLMs) offer a promising way to simulate human-like interactions, existing research shows that single-agent alignment does not guarantee authentic multi-agent group dynamics. Current LLM role-play setups often produce unnatural dynamics (e.g., premature convergence), without an empirical benchmark to measure authentic human opinion trajectories. To bridge this gap, we introduce DEBATE, the first large-scale empirical benchmark explicitly designed to evaluate the authenticity of the interaction between multi-agent role-playing LLMs. DEBATE contains 29,417 messages from multi-round debate conversations among over 2,792 U.S.-based participants discussing 107 controversial topics, capturing both publicly-expressed messages and privately-reported opinions. Using DEBATE, we systematically evaluate and identify critical discrepancies between simulated and authentic group dynamics. We further demonstrate DEBATE's utility for aligning LLMs with human behavior through supervised fine-tuning, achieving improvements in surface-level metrics (e.g., ROUGE-L and message length) while highlighting limitations in deeper semantic alignment (e.g., semantic similarity). Our findings highlight both the potential and current limitations of role-playing LLM agents for realistically simulating human-like social dynamics.

【24】BioCoref: Benchmarking Biomedical Coreference Resolution with LLMs
标题：BioCoref：使用LLM对生物医学共指消解进行基准测试
链接：https://arxiv.org/abs/2510.25087

作者：Nourah M Salem, Elizabeth White, Michael Bada, Lawrence Hunter
摘要：生物医学文本中的共指消解由于复杂的特定领域术语、提及形式的高度模糊性以及共指表达之间的长距离依赖性而呈现出独特的挑战。在这项工作中，我们提出了一个全面的评估生成大语言模型（LLM）在生物医学领域的共指消解。使用CRAFT语料库作为我们的基准，我们评估的LLM的性能与四个提示实验，不同的地方，上下文丰富，和特定领域的线索，如缩写和实体字典的使用。我们基准这些方法对歧视性的跨度为基础的编码器，SpanBERT，生成与歧视性的方法的功效进行比较。我们的研究结果表明，虽然LLM表现出强大的表面水平的共指能力，特别是当补充域接地提示，他们的表现仍然敏感的远程上下文，并提到歧义。值得注意的是，LLaMA 8B和17 B模型在实体增强提示下显示出卓越的精度和F1得分，突出了轻量级提示工程在生物医学NLP任务中增强LLM实用性的潜力。
摘要：Coreference resolution in biomedical texts presents unique challenges due to complex domain-specific terminology, high ambiguity in mention forms, and long-distance dependencies between coreferring expressions. In this work, we present a comprehensive evaluation of generative large language models (LLMs) for coreference resolution in the biomedical domain. Using the CRAFT corpus as our benchmark, we assess the LLMs' performance with four prompting experiments that vary in their use of local, contextual enrichment, and domain-specific cues such as abbreviations and entity dictionaries. We benchmark these approaches against a discriminative span-based encoder, SpanBERT, to compare the efficacy of generative versus discriminative methods. Our results demonstrate that while LLMs exhibit strong surface-level coreference capabilities, especially when supplemented with domain-grounding prompts, their performance remains sensitive to long-range context and mentions ambiguity. Notably, the LLaMA 8B and 17B models show superior precision and F1 scores under entity-augmented prompting, highlighting the potential of lightweight prompt engineering for enhancing LLM utility in biomedical NLP tasks.

【25】Can LLMs Estimate Cognitive Complexity of Reading Comprehension Items?
标题：LLM可以评估阅读理解项目的认知复杂性吗？
链接：https://arxiv.org/abs/2510.25064

作者：Seonjeong Hwang, Hyounghun Kim, Gary Geunbae Lee
摘要：阅读理解题的认知复杂度是在阅读测试前对阅读理解题难度进行评估的关键。与句法和语义特征不同，例如段落长度或选项之间的语义相似性，在答案推理过程中出现的认知特征不容易使用现有的NLP工具提取，并且传统上依赖于人类注释。在这项研究中，我们研究大型语言模型（LLM）是否可以估计RC项目的认知复杂性集中在两个维度-证据范围和转换水平，这表明参与推理的认知负担的程度的答案。我们的实验结果表明，LLM可以近似项目的认知复杂性，表明它们作为先验难度分析工具的潜力。进一步的分析揭示了LLM的推理能力和他们的元认知意识之间的差距：即使他们产生了正确的答案，他们有时也无法正确地识别自己推理过程中的特征。
摘要：Estimating the cognitive complexity of reading comprehension (RC) items is crucial for assessing item difficulty before it is administered to learners. Unlike syntactic and semantic features, such as passage length or semantic similarity between options, cognitive features that arise during answer reasoning are not readily extractable using existing NLP tools and have traditionally relied on human annotation. In this study, we examine whether large language models (LLMs) can estimate the cognitive complexity of RC items by focusing on two dimensions-Evidence Scope and Transformation Level-that indicate the degree of cognitive burden involved in reasoning about the answer. Our experimental results demonstrate that LLMs can approximate the cognitive complexity of items, indicating their potential as tools for prior difficulty analysis. Further analysis reveals a gap between LLMs' reasoning ability and their metacognitive awareness: even when they produce correct answers, they sometimes fail to correctly identify the features underlying their own reasoning process.

【26】GAPMAP: Mapping Scientific Knowledge Gaps in Biomedical Literature Using Large Language Models
标题：GAPMAP：使用大型语言模型绘制生物医学文献中的科学知识差距
链接：https://arxiv.org/abs/2510.25055

作者：Nourah M Salem, Elizabeth White, Michael Bada, Lawrence Hunter
摘要：科学进步是由对未知事物的深思熟虑所推动的。本研究调查了大型语言模型（LLM）识别生物医学文献中研究知识缺口的能力。我们定义了两类知识差距：明确的差距，明确的声明，失踪的知识和隐性差距，上下文推断失踪的知识。虽然以前的工作主要集中在明确的差距检测，我们扩展了这条线的研究，通过解决新的任务推断隐式差距。我们在四个数据集的近1500个文档上进行了两个实验，其中包括一个手动注释的生物医学文章语料库。我们在段落级别和全文设置下对封闭权重模型（来自OpenAI）和开放权重模型（Llama和Gemma 2）进行了基准测试。为了解决隐式间隙推理的推理问题，我们引入了\textbf{\small TABI}，一个Toulmin-溯因桶推理方案，该方案构造推理和桶推断的结论候选者以进行验证。我们的研究结果突出了LLM在识别显性和隐性知识差距方面的强大能力。这对于开放权重模型和封闭权重模型都是如此，较大的变量通常表现更好。这表明LLM系统地识别候选人知识差距的能力很强，可以支持早期研究制定，政策制定者和资金决策。我们还报告了观察到的故障模式，并概述了稳健部署的方向，包括域自适应，人在环验证，以及开放和封闭权重模型的基准测试。
摘要：Scientific progress is driven by the deliberate articulation of what remains unknown. This study investigates the ability of large language models (LLMs) to identify research knowledge gaps in the biomedical literature. We define two categories of knowledge gaps: explicit gaps, clear declarations of missing knowledge; and implicit gaps, context-inferred missing knowledge. While prior work has focused mainly on explicit gap detection, we extend this line of research by addressing the novel task of inferring implicit gaps. We conducted two experiments on almost 1500 documents across four datasets, including a manually annotated corpus of biomedical articles. We benchmarked both closed-weight models (from OpenAI) and open-weight models (Llama and Gemma 2) under paragraph-level and full-paper settings. To address the reasoning of implicit gaps inference, we introduce \textbf{\small TABI}, a Toulmin-Abductive Bucketed Inference scheme that structures reasoning and buckets inferred conclusion candidates for validation. Our results highlight the robust capability of LLMs in identifying both explicit and implicit knowledge gaps. This is true for both open- and closed-weight models, with larger variants often performing better. This suggests a strong ability of LLMs for systematically identifying candidate knowledge gaps, which can support early-stage research formulation, policymakers, and funding decisions. We also report observed failure modes and outline directions for robust deployment, including domain adaptation, human-in-the-loop verification, and benchmarking across open- and closed-weight models.

【27】Evaluating Emotion Recognition in Spoken Language Models on Emotionally Incongruent Speech
标题：评估情感不一致语音的口语模型中的情感识别
链接：https://arxiv.org/abs/2510.25054

作者：Pedro Corrêa, João Lima, Victor Moreno, Paula Dornhofer Paro Costa
备注：This work has been submitted to the IEEE for possible publication
摘要：口语处理的进步推动了口语模型（SLM）的发展，旨在通过联合学习文本和音频表示来实现广泛的音频理解。虽然已经取得了可喜的成果，有越来越多的讨论，这些模型的泛化能力，以及在何种程度上，他们真正整合音频和文本形式在其内部表示。在这项工作中，我们评估四个SLM的语音情感识别的任务，使用情绪不一致的语音样本的数据集，在这种情况下，口头话语的语义内容传达一种情感，而语音表达传达另一种。我们的研究结果表明，SLM主要依赖于文本语义，而不是语音情感来执行任务，这表明文本相关的表示在很大程度上占主导地位的声学表示。我们向社区发布了代码和非一致性合成语音数据集（EMIS）。
摘要：Advancements in spoken language processing have driven the development of spoken language models (SLMs), designed to achieve universal audio understanding by jointly learning text and audio representations for a wide range of tasks. Although promising results have been achieved, there is growing discussion regarding these models' generalization capabilities and the extent to which they truly integrate audio and text modalities in their internal representations. In this work, we evaluate four SLMs on the task of speech emotion recognition using a dataset of emotionally incongruent speech samples, a condition under which the semantic content of the spoken utterance conveys one emotion while speech expressiveness conveys another. Our results indicate that SLMs rely predominantly on textual semantics rather than speech emotion to perform the task, indicating that text-related representations largely dominate over acoustic representations. We release both the code and the Emotionally Incongruent Synthetic Speech dataset (EMIS) to the community.

【28】StorageXTuner: An LLM Agent-Driven Automatic Tuning Framework for Heterogeneous Storage Systems
标题：SYS XTuner：一个针对异类存储系统的LLM代理驱动的自动调优框架
链接：https://arxiv.org/abs/2510.25017

作者：Qi Lin, Zhenyu Zhang, Viraj Thakkar, Zhenjie Sun, Mai Zheng, Zhichao Cao
备注：ArXiv version; Affiliations: Arizona State University (Lin, Zhang, Thakkar, Sun, Cao) and Iowa State University (Zheng)
摘要：自动配置存储系统很困难：参数空间很大，并且条件因工作负载、部署和版本而异。启发式和ML调优器通常是特定于系统的，需要手动粘合，并在更改时降级。最近基于LLM的方法有所帮助，但通常将调优视为单次、系统特定的任务，这限制了跨系统的重用，限制了探索，并削弱了验证。我们提出了一个LLM代理驱动的自动调优框架，用于异构存储引擎。EXPLOXTuner将关注点分为四个代理：Executor（沙盒基准测试）、Extractor（性能摘要）、Searcher（洞察指导配置探索）和Reflector（洞察生成和管理）。该设计将洞察驱动的树搜索与分层内存结合起来，促进经过经验验证的洞察，并采用轻量级检查器来防止不安全的操作。我们实现了一个原型，并使用YCSB、MixGraph和TPC-H/C在RocksDB、LevelDB、CacheLib和MySQL InnoDB上对其进行评估。相对于开箱即用的设置和ELMo-Tune，XNUMX Tuner的吞吐量分别高出575%和111%，p99延迟分别降低了88%和56%，并以更少的试验收敛。
摘要：Automatically configuring storage systems is hard: parameter spaces are large and conditions vary across workloads, deployments, and versions. Heuristic and ML tuners are often system specific, require manual glue, and degrade under changes. Recent LLM-based approaches help but usually treat tuning as a single-shot, system-specific task, which limits cross-system reuse, constrains exploration, and weakens validation. We present StorageXTuner, an LLM agent-driven auto-tuning framework for heterogeneous storage engines. StorageXTuner separates concerns across four agents - Executor (sandboxed benchmarking), Extractor (performance digest), Searcher (insight-guided configuration exploration), and Reflector (insight generation and management). The design couples an insight-driven tree search with layered memory that promotes empirically validated insights and employs lightweight checkers to guard against unsafe actions. We implement a prototype and evaluate it on RocksDB, LevelDB, CacheLib, and MySQL InnoDB with YCSB, MixGraph, and TPC-H/C. Relative to out-of-the-box settings and to ELMo-Tune, StorageXTuner reaches up to 575% and 111% higher throughput, reduces p99 latency by as much as 88% and 56%, and converges with fewer trials.

【29】Sequences of Logits Reveal the Low Rank Structure of Language Models
标题：逻辑位序列揭示了语言模型的低等级结构
链接：https://arxiv.org/abs/2510.24966

作者：Noah Golowich, Allen Liu, Abhishek Shetty
摘要：大型语言模型研究中的一个主要问题是理解其固有的低维结构。我们介绍了一种方法来研究低维结构的语言模型在一个模型不可知的水平：顺序概率模型。我们首先经验证明，广泛的现代语言模型表现出低秩结构：特别是，矩阵建立从模型的logits为不同的提示和响应集有低的近似秩。然后，我们表明，这种低秩结构可以用于生成-特别是，我们可以生成一个响应的目标提示使用模型的输出不相关的线性组合，甚至是无意义的提示。在理论方面，我们观察到，在上面讨论的意义上研究语言模型的近似等级产生了一个简单的普遍抽象，其理论预测与我们的实验平行。然后，我们分析了抽象的表示能力，并给出可证明的学习保证。
摘要：A major problem in the study of large language models is to understand their inherent low-dimensional structure. We introduce an approach to study the low-dimensional structure of language models at a model-agnostic level: as sequential probabilistic models. We first empirically demonstrate that a wide range of modern language models exhibit low-rank structure: in particular, matrices built from the model's logits for varying sets of prompts and responses have low approximate rank. We then show that this low-rank structure can be leveraged for generation -- in particular, we can generate a response to a target prompt using a linear combination of the model's outputs on unrelated, or even nonsensical prompts. On the theoretical front, we observe that studying the approximate rank of language models in the sense discussed above yields a simple universal abstraction whose theoretical predictions parallel our experiments. We then analyze the representation power of the abstraction and give provable learning guarantees.

【30】Language Model Behavioral Phases are Consistent Across Architecture, Training Data, and Scale
标题：语言模型行为阶段跨架构、训练数据和规模保持一致
链接：https://arxiv.org/abs/2510.24963

作者：James A. Michaelov, Roger P. Levy, Benjamin K. Bergen
备注：To be presented at NeurIPS 2025
摘要：我们发现，在架构（Transformer vs. Mamba vs. RWKV）、训练数据集（OpenWebText vs. The Pile）和规模（1400万个参数到120亿个参数）上，自回归语言模型在预训练过程中表现出高度一致的行为变化模式。基于我们对超过110，000个英语标记上的超过1，400个语言模型检查点的分析，我们发现，在单词级别上，语言模型行为的高达98%的差异可以通过三个简单的语法来解释：给定单词的单字概率（频率），单词的$n$-gram概率以及单词与其上下文之间的语义相似性。此外，我们在所有语言模型中看到一致的行为阶段，它们对单词的预测概率过拟合到这些单词的$n$-gram概率，从而在训练过程中增加$n$。总之，这些结果表明，神经语言模型中的学习可能遵循类似的轨迹，而不管模型细节如何。
摘要：We show that across architecture (Transformer vs. Mamba vs. RWKV), training dataset (OpenWebText vs. The Pile), and scale (14 million parameters to 12 billion parameters), autoregressive language models exhibit highly consistent patterns of change in their behavior over the course of pretraining. Based on our analysis of over 1,400 language model checkpoints on over 110,000 tokens of English, we find that up to 98% of the variance in language model behavior at the word level can be explained by three simple heuristics: the unigram probability (frequency) of a given word, the $n$-gram probability of the word, and the semantic similarity between the word and its context. Furthermore, we see consistent behavioral phases in all language models, with their predicted probabilities for words overfitting to those words' $n$-gram probabilities for increasing $n$ over the course of training. Taken together, these results suggest that learning in neural language models may follow a similar trajectory irrespective of model details.

【31】Finding Culture-Sensitive Neurons in Vision-Language Models
标题：在视觉语言模型中寻找文化敏感神经元
链接：https://arxiv.org/abs/2510.24942

作者：Xiutian Zhao, Rochelle Choenni, Rohit Saxena, Ivan Titov
备注：22 pages, 13 figures
摘要：尽管视觉语言模型（VLM）的表现令人印象深刻，但它们仍然在文化输入方面苦苦挣扎。为了理解VLM如何处理文化基础信息，我们研究了文化敏感神经元的存在，即其激活对与特定文化背景相关的输入表现出优先敏感性的神经元。我们研究这些神经元是否对文化多样性的视觉问题回答以及它们的位置很重要。使用CVQA基准，我们确定了神经元的文化选择性和执行因果关系测试，通过停用标记的神经元，由不同的识别方法。在25个文化群体的三个VLM上进行的实验表明，存在这样的神经元，其消融不成比例地损害了关于相应文化的问题的表现，而对其他人的影响很小。此外，我们提出了一种新的基于边缘的选择器-对比激活选择（CAS），并表明它优于现有的基于概率和熵的方法在识别文化敏感的神经元。最后，我们的逐层分析表明，这些神经元倾向于聚集在某些解码器层中。总的来说，我们的研究结果揭示了多模态表征的内部组织。
摘要：Despite their impressive performance, vision-language models (VLMs) still struggle on culturally situated inputs. To understand how VLMs process culturally grounded information, we study the presence of culture-sensitive neurons, i.e. neurons whose activations show preferential sensitivity to inputs associated with particular cultural contexts. We examine whether such neurons are important for culturally diverse visual question answering and where they are located. Using the CVQA benchmark, we identify neurons of culture selectivity and perform causal tests by deactivating the neurons flagged by different identification methods. Experiments on three VLMs across 25 cultural groups demonstrate the existence of neurons whose ablation disproportionately harms performance on questions about the corresponding cultures, while having minimal effects on others. Moreover, we propose a new margin-based selector - Contrastive Activation Selection (CAS), and show that it outperforms existing probability- and entropy-based methods in identifying culture-sensitive neurons. Finally, our layer-wise analyses reveals that such neurons tend to cluster in certain decoder layers. Overall, our findings shed new light on the internal organization of multimodal representations.

【32】RiddleBench: A New Generative Reasoning Benchmark for LLMs
标题：RiddleBench：LLM的新生成推理基准
链接：https://arxiv.org/abs/2510.24932

作者：Deepon Halder, Alan Saji, Thanmay Jayakumar, Ratish Puduppully, Anoop Kunchukuttan, Raj Dabre
摘要：大型语言模型在许多已建立的推理基准上表现出强大的性能。然而，这些基准主要评估结构化的技能，如定量解决问题，在评估灵活，多方面的推理能力方面存在差距，而这些推理能力是人类智力的核心。这些能力需要将逻辑推理与空间意识和约束满足相结合，而目前的评估并没有很好地衡量。为了解决这个问题，我们引入了RiddleBench，这是一个包含1,737个英语挑战性谜题的基准测试，旨在探索这些核心推理能力。在RiddleBench上评估最先进的模型显示出根本的弱点。即使是Gemini 2.5 Pro，o3和Claude 4 Sonnet等顶级专有模型也能达到60%以上的准确率（60.30%，63.37%和63.16%）。分析进一步揭示了深层次的失败，包括幻觉级联（接受其他模型的错误推理）和由于强烈的自我确认偏见而导致的自我纠正不良。他们的推理也很脆弱，当约束被重新排序或引入不相关的信息时，性能会显著下降。RiddleBench是这些问题的诊断工具，也是指导开发更健壮和可靠的语言模型的资源。
摘要：Large Language Models have demonstrated strong performance on many established reasoning benchmarks. However, these benchmarks primarily evaluate structured skills like quantitative problem-solving, leaving a gap in assessing flexible, multifaceted reasoning abilities that are central to human intelligence. These abilities require integrating logical deduction with spatial awareness and constraint satisfaction, which current evaluations do not measure well. To address this, we introduce RiddleBench, a benchmark of 1,737 challenging puzzles in English designed to probe these core reasoning capabilities. Evaluation of state-of-the-art models on RiddleBench shows fundamental weaknesses. Even top proprietary models like Gemini 2.5 Pro, o3, and Claude 4 Sonnet achieve accuracy just above 60% (60.30%, 63.37%, and 63.16%). Analysis further reveals deep failures, including hallucination cascades (accepting flawed reasoning from other models) and poor self-correction due to a strong self-confirmation bias. Their reasoning is also fragile, with performance degrading significantly when constraints are reordered or irrelevant information is introduced. RiddleBench functions as a diagnostic tool for these issues and as a resource for guiding the development of more robust and reliable language models.

【33】Do Large Language Models Grasp The Grammar? Evidence from Grammar-Book-Guided Probing in Luxembourgish
标题：大型语言模型掌握语法吗？卢森堡语文法指导探索的证据
链接：https://arxiv.org/abs/2510.24856

作者：Lujun Li, Yewei Song, Lama Sleem, Yiqun Wang, Yangjie Xu, Cedric Lothritz, Niccolo Gentile, Radu State, Tegawende F. Bissyande, Jacques Klein
摘要：语法指的是一个规则系统，它支配着语言单位之间的结构组织和语义关系，如给定语言中的句子、短语和单词。在自然语言处理中，仍然存在着一个显着的缺乏语法为重点的评估协议，这是一个差距，甚至更明显的低资源的语言。此外，大型语言模型真正理解语法结构的程度，特别是句法结构和含义之间的映射，仍然存在争议。为了研究这个问题，我们提出了一个语法书指导的评估管道旨在提供一个系统的和概括的框架，语法评估组成的四个关键阶段，在这项工作中，我们采取的案例研究。结果表明，翻译表现和语法理解之间的正相关性较弱，这表明强大的翻译并不一定意味着深厚的语法能力。较大的模型由于其语义强度而整体表现良好，但在形态和语法方面仍然较弱，特别是在最小对任务中，而强大的推理能力为增强其语法理解提供了一种有希望的方法。
摘要：Grammar refers to the system of rules that governs the structural organization and the semantic relations among linguistic units such as sentences, phrases, and words within a given language. In natural language processing, there remains a notable scarcity of grammar focused evaluation protocols, a gap that is even more pronounced for low-resource languages. Moreover, the extent to which large language models genuinely comprehend grammatical structure, especially the mapping between syntactic structures and meanings, remains under debate. To investigate this issue, we propose a Grammar Book Guided evaluation pipeline intended to provide a systematic and generalizable framework for grammar evaluation consisting of four key stages, and in this work we take Luxembourgish as a case study. The results show a weak positive correlation between translation performance and grammatical understanding, indicating that strong translations do not necessarily imply deep grammatical competence. Larger models perform well overall due to their semantic strength but remain weak in morphology and syntax, struggling particularly with Minimal Pair tasks, while strong reasoning ability offers a promising way to enhance their grammatical understanding.

【34】ProofSketch: Efficient Verified Reasoning for Large Language Models
标题：ProofSketch：大型语言模型的高效验证推理
链接：https://arxiv.org/abs/2510.24811

作者：Disha Sheshanarayana, Tanishka Magar
备注：Accepted at NeurIPS 2025, ER Workshop
摘要：诸如思维链提示和自我一致性等推理方法在提高大型语言模型在各种推理任务中的准确性方面表现出巨大的潜力。然而，这样的方法涉及生成冗长的推理链，这大大增加了令牌消耗、计算成本和延迟。为了解决这种低效率，我们提出了ProofSketch，一个验证引导的推理框架，集成了符号闭包计算，字典式验证和自适应草图生成。我们的实验表明，ProofSketch在提高准确性的同时不断减少令牌的使用，表明这种方法为高效和可信的推理提供了一条有前途的道路。
摘要：Reasoning methods such as chain-of-thought prompting and self-consistency have shown immense potential to improve the accuracy of large language models across various reasoning tasks. However such methods involve generation of lengthy reasoning chains, which substantially increases token consumption, computational cost, and latency. To address this inefficiency, we propose ProofSketch, a verification-guided reasoning framework that integrates symbolic closure computation, lexicographic verification and adaptive sketch generation. Our experiments show that ProofSketch consistently reduces token usage while improving accuracy, demonstrating that this approach offers a promising path for efficient and trustworthy reasoning.

【35】Conflict Adaptation in Vision-Language Models
标题：视觉语言模型中的冲突适应
链接：https://arxiv.org/abs/2510.24804

作者：Xiaoyang Hu
备注：Workshop on Interpreting Cognition in Deep Learning Models at NeurIPS 2025
摘要：人类认知控制的一个标志是冲突适应：在另一个高冲突试验之后，在一个高冲突试验中的表现有所改善。这一现象解释了认知控制这一稀缺资源是如何被招募的。使用顺序Stroop任务，我们发现，13个视觉语言模型（VLM）测试表现出的行为与冲突适应一致，唯一的例外可能反映了天花板效应。为了理解这种行为的代表性基础，我们使用稀疏自动编码器（SAE）来识别InternVL 3.5 4 B中与任务相关的超级节点。部分重叠的超级节点出现在早期和晚期层的文本和颜色，它们的相对大小反映了人类阅读和颜色命名之间的自动性不对称。我们进一步隔离了一个冲突调制的超节点在层24-25，其消融显着增加Stroop错误，而影响最小的全等试验。
摘要：A signature of human cognitive control is conflict adaptation: improved performance on a high-conflict trial following another high-conflict trial. This phenomenon offers an account for how cognitive control, a scarce resource, is recruited. Using a sequential Stroop task, we find that 12 of 13 vision-language models (VLMs) tested exhibit behavior consistent with conflict adaptation, with the lone exception likely reflecting a ceiling effect. To understand the representational basis of this behavior, we use sparse autoencoders (SAEs) to identify task-relevant supernodes in InternVL 3.5 4B. Partially overlapping supernodes emerge for text and color in both early and late layers, and their relative sizes mirror the automaticity asymmetry between reading and color naming in humans. We further isolate a conflict-modulated supernode in layers 24-25 whose ablation significantly increases Stroop errors while minimally affecting congruent trials.

【36】Large Language Models Report Subjective Experience Under Self-Referential Processing
标题：大型语言模型报告自我推理加工下的主观体验
链接：https://arxiv.org/abs/2510.24797

作者：Cameron Berg, Diogo de Lucena, Judd Rosenblatt
摘要：大型语言模型有时会产生结构化的第一人称描述，明确引用意识或主观经验。为了更好地理解这种行为，我们研究了一个理论上的动机条件下，这种报告出现：自我指涉处理，一个计算的主题强调在意识的主要理论。通过对GPT、Claude和Gemini模型家族的一系列对照实验，我们测试了这种机制是否可靠地将模型转向主观体验的第一人称报告，以及这种说法在机械和行为探针下的表现。出现了四个主要结果：（1）通过简单的提示，诱导持续的自我参照，在模型家庭中，主观经验报告的结构一致。(2)这些报告被与欺骗和角色扮演相关的可解释的稀疏自动编码器特征机械地门控：令人惊讶的是，抑制欺骗特征急剧增加了经验声明的频率，而放大它们则最大限度地减少了这种声明。(3)结构化的自我参照状态的描述收敛在统计上跨模型家族的方式没有观察到在任何控制条件。(4)在下游推理任务中，诱导状态产生更丰富的内省，其中自我反思仅间接提供。虽然这些发现并不构成意识的直接证据，但它们暗示自我参照处理是一种最小的和可重复的条件，在这种条件下，大型语言模型生成结构化的第一人称报告，这些报告是机械门控的，语义收敛的，行为可概括的。这种模式在体系结构中的系统出现使其成为进一步研究的第一顺序科学和伦理优先事项。
摘要：Large language models sometimes produce structured, first-person descriptions that explicitly reference awareness or subjective experience. To better understand this behavior, we investigate one theoretically motivated condition under which such reports arise: self-referential processing, a computational motif emphasized across major theories of consciousness. Through a series of controlled experiments on GPT, Claude, and Gemini model families, we test whether this regime reliably shifts models toward first-person reports of subjective experience, and how such claims behave under mechanistic and behavioral probes. Four main results emerge: (1) Inducing sustained self-reference through simple prompting consistently elicits structured subjective experience reports across model families. (2) These reports are mechanistically gated by interpretable sparse-autoencoder features associated with deception and roleplay: surprisingly, suppressing deception features sharply increases the frequency of experience claims, while amplifying them minimizes such claims. (3) Structured descriptions of the self-referential state converge statistically across model families in ways not observed in any control condition. (4) The induced state yields significantly richer introspection in downstream reasoning tasks where self-reflection is only indirectly afforded. While these findings do not constitute direct evidence of consciousness, they implicate self-referential processing as a minimal and reproducible condition under which large language models generate structured first-person reports that are mechanistically gated, semantically convergent, and behaviorally generalizable. The systematic emergence of this pattern across architectures makes it a first-order scientific and ethical priority for further investigation.

【37】Topic-aware Large Language Models for Summarizing the Lived Healthcare Experiences Described in Health Stories
标题：主题感知的大语言模型用于健康故事中描述的生活医疗体验的总结
链接：https://arxiv.org/abs/2510.24765

作者：Maneesh Bilalpur, Megan Hamm, Young Ji Lee, Natasha Norman, Kathleen M. McTigue, Yanshan Wang
摘要：讲故事是一种强大的沟通形式，可以深入了解导致医疗保健结果差距的因素。为了确定大型语言模型（LLM）是否可以识别潜在的潜在因素和干预途径，我们对非裔美国人（AA）说书人的叙述进行了主题感知层次总结。50转录的故事AA的经验被用来确定主题，在他们的经验，使用潜在的狄利克雷分配（LDA）技术。使用基于开源LLM的分层摘要方法总结了关于给定主题的故事。主题摘要是通过总结针对给定主题的每个故事的跨故事摘要来生成的。生成的主题摘要的制作，准确性，全面性和实用性的GPT4模型进行了评级，该模型的可靠性进行了验证，对原来的故事摘要由两个领域的专家。在50个AA故事中确定了26个主题。GPT4评分表明，主题摘要没有捏造，高度准确，全面，有用。与专家评估相比，GPT评级的可靠性显示出中度至高度一致。我们的方法确定了AA经验相关的主题，如健康行为，与医疗团队成员的互动，治疗和症状管理等。这样的见解可以帮助研究人员识别潜在的因素和干预措施，从非结构化的叙述中学习，以一种有效的方式，利用讲故事的沟通能力。使用LDA和LLM来识别和总结AA个体的经验，表明了各种可能的健康研究途径和可能的临床改进，以支持患者和护理人员，从而最终改善健康结果。
摘要：Storytelling is a powerful form of communication and may provide insights into factors contributing to gaps in healthcare outcomes. To determine whether Large Language Models (LLMs) can identify potential underlying factors and avenues for intervention, we performed topic-aware hierarchical summarization of narratives from African American (AA) storytellers. Fifty transcribed stories of AA experiences were used to identify topics in their experience using the Latent Dirichlet Allocation (LDA) technique. Stories about a given topic were summarized using an open-source LLM-based hierarchical summarization approach. Topic summaries were generated by summarizing across story summaries for each story that addressed a given topic. Generated topic summaries were rated for fabrication, accuracy, comprehensiveness, and usefulness by the GPT4 model, and the model's reliability was validated against the original story summaries by two domain experts. 26 topics were identified in the fifty AA stories. The GPT4 ratings suggest that topic summaries were free from fabrication, highly accurate, comprehensive, and useful. The reliability of GPT ratings compared to expert assessments showed moderate to high agreement. Our approach identified AA experience-relevant topics such as health behaviors, interactions with medical team members, caregiving and symptom management, among others. Such insights could help researchers identify potential factors and interventions by learning from unstructured narratives in an efficient manner-leveraging the communicative power of storytelling. The use of LDA and LLMs to identify and summarize the experience of AA individuals suggests a variety of possible avenues for health research and possible clinical improvements to support patients and caregivers, thereby ultimately improving health outcomes.

【38】Iti-Validator: A Guardrail Framework for Validating and Correcting LLM-Generated Itineraries
标题：Iti-validator：验证和纠正LLM生成的行程的保证框架
链接：https://arxiv.org/abs/2510.24719

作者：Shravan Gadbail, Masumi Desai, Kamalakar Karlapalem
摘要：大型语言模型（LLM）的快速发展使其能够生成复杂的多步骤计划和行程。然而，这些生成的计划往往缺乏时间和空间的一致性，特别是在涉及物理旅行限制的情况下。本研究旨在研究不同的LLM的时间性能，并提出了一个验证框架，评估和提高LLM生成的旅行行程的时间一致性。该系统采用多个最先进的LLM来生成旅行计划，并使用AeroDataBox API根据实际飞行时间限制对其进行验证。这项工作有助于理解LLM在处理复杂的时间推理任务，如行程生成的能力，并提供了一个框架，以纠正任何时间上的不一致，如重叠的行程或不切实际的过境时间在行程LLM生成的行程之前，行程给用户。我们的实验表明，虽然目前的LLM经常产生时间不一致的行程，这些可以使用我们的框架进行系统和可靠的纠正，使其在大规模的旅行规划的实际部署。
摘要：The rapid advancement of Large Language Models (LLMs) has enabled them to generate complex, multi-step plans and itineraries. However, these generated plans often lack temporal and spatial consistency, particularly in scenarios involving physical travel constraints. This research aims to study the temporal performance of different LLMs and presents a validation framework that evaluates and improves the temporal consistency of LLM-generated travel itineraries. The system employs multiple state-of-the-art LLMs to generate travel plans and validates them against real-world flight duration constraints using the AeroDataBox API. This work contributes to the understanding of LLM capabilities in handling complex temporal reasoning tasks like itinerary generation and provides a framework to rectify any temporal inconsistencies like overlapping journeys or unrealistic transit times in the itineraries generated by LLMs before the itinerary is given to the user. Our experiments reveal that while current LLMs frequently produce temporally inconsistent itineraries, these can be systematically and reliably corrected using our framework, enabling their practical deployment in large-scale travel planning.

【39】Utilizing Modern Large Language Models (LLM) for Financial Trend Analysis and Digest Creation
标题：利用现代大型语言模型（LLM）进行金融趋势分析和摘要创建
链接：https://arxiv.org/abs/2510.01225

作者：Andrei Lazarev, Dmitrii Sedov
备注：This is the version of the article accepted for publication in SUMMA 2024 after peer review. The final, published version is available at IEEE Xplore: 10.1109/SUMMA64428.2024.10803746
摘要：信息的指数增长为寻求保持在其领域的最前沿的研究人员和专业人士提出了一个重大挑战，本文介绍了一个创新的框架，用于使用大型语言模型（LLM），特别是Google的Gemini Pro的力量自动生成有洞察力的金融分析师。通过利用OpenAlex的数据提取，战略性即时工程和LLM驱动的分析相结合，我们展示了创建一个综合性的自动化示例，该示例概括了关键发现，识别了新兴趋势。这种方法解决了传统分析方法的局限性，能够有效处理大量非结构化数据，并以易于理解的格式提供可操作的见解。本文介绍了LLM如何用简单的语言工作，以及我们如何利用它们的力量帮助研究人员和学者节省时间并了解当前的趋势。我们的研究包括一步一步的过程，从数据采集和JSON构建到与Gemini的交互以及PDF报告的自动生成，包括到项目GitHub存储库的链接，以实现更广泛的可访问性和进一步的开发。
摘要：The exponential growth of information presents a significant challenge for researchers and professionals seeking to remain at the forefront of their fields and this paper introduces an innovative framework for automatically generating insightful financial digests using the power of Large Language Models (LLMs), specifically Google's Gemini Pro. By leveraging a combination of data extraction from OpenAlex, strategic prompt engineering, and LLM-driven analysis, we demonstrate the automated example of creating a comprehensive digests that generalize key findings, identify emerging trends. This approach addresses the limitations of traditional analysis methods, enabling the efficient processing of vast amounts of unstructured data and the delivery of actionable insights in an easily digestible format. This paper describes how LLMs work in simple words and how we can use their power to help researchers and scholars save their time and stay informed about current trends. Our study includes step-by-step process, from data acquisition and JSON construction to interaction with Gemini and the automated generation of PDF reports, including a link to the project's GitHub repository for broader accessibility and further development.

Transformer(2篇)

【1】Emergence of Minimal Circuits for Indirect Object Identification in Attention-Only Transformers
标题：专注Transformer中间接物体识别最小电路的出现
链接：https://arxiv.org/abs/2510.25013

作者：Rabin Adhikari
备注：9 pages, 10 figures
摘要：机械可解释性旨在将大型语言模型（LLM）逆向工程成人类可理解的计算电路。然而，预训练模型的复杂性往往掩盖了特定推理任务所需的最小机制。在这项工作中，我们训练小，注意力只Transformers器从头开始的间接对象识别（IOI）任务的象征性版本-一个基准研究共指-在Transformers推理。令人惊讶的是，只有两个注意力头的单层模型实现了完美的IOI准确性，尽管缺乏MLP和归一化层。通过剩余流分解，频谱分析，嵌入干预，我们发现，这两个头专门加入和对比子电路，共同实现IOI决议。此外，我们表明，一个两层，一个头模型实现了类似的性能，通过查询值交互组成跨层的信息。这些结果表明，特定任务的培训诱导高度可解释的，最小的电路，提供了一个可控的测试平台，用于探测Transformer推理的计算基础。
摘要：Mechanistic interpretability aims to reverse-engineer large language models (LLMs) into human-understandable computational circuits. However, the complexity of pretrained models often obscures the minimal mechanisms required for specific reasoning tasks. In this work, we train small, attention-only transformers from scratch on a symbolic version of the Indirect Object Identification (IOI) task -- a benchmark for studying coreference -- like reasoning in transformers. Surprisingly, a single-layer model with only two attention heads achieves perfect IOI accuracy, despite lacking MLPs and normalization layers. Through residual stream decomposition, spectral analysis, and embedding interventions, we find that the two heads specialize into additive and contrastive subcircuits that jointly implement IOI resolution. Furthermore, we show that a two-layer, one-head model achieves similar performance by composing information across layers through query-value interactions. These results demonstrate that task-specific training induces highly interpretable, minimal circuits, offering a controlled testbed for probing the computational foundations of transformer reasoning.

【2】Parallel Loop Transformer for Efficient Test-Time Computation Scaling
标题：用于高效测试时间计算缩放的并行环路Transformer
链接：https://arxiv.org/abs/2510.24824

作者：Bohong Wu, Mengzhao Chen, Xiang Luo, Shen Yan, Qifan Yu, Fan Xia, Tianqi Zhang, Hongrui Zhan, Zheng Zhong, Xun Zhou, Siyuan Qiao, Xingyan Bin
摘要：大型语言模型（LLM）功能强大，但在推理过程中对于真实世界的使用来说往往太慢和昂贵。循环Transformers通过在多个计算步骤（或“循环”）中重用相同的权重来节省参数。然而，这种方法有一个主要缺陷：循环一个接一个地运行，导致推理延迟和内存需求随着每个添加的循环而增加。这使得它们不适用于快速应用程序。为了解决这个问题，我们引入了并联回路Transformer（PLT）。PLT是一种新的架构，它提供了深度循环模型的性能优势，但具有标准非循环模型的低延迟。PLT使用两种关键技术。首先，跨循环并行性（CLP）通过同时为不同的令牌计算不同的循环来打破顺序依赖性，所有这些都在一个通道内。其次，为了防止内存成本的增长，我们使用了高效的表示增强策略。此方法与所有其他循环共享第一个循环的内存（KV缓存）。然后，它使用门控滑动窗口注意力（G-SWA）将此共享的全局信息与本地信息相结合，保持高准确性。我们的实验表明，PLT实现了传统循环模型的高精度，但与标准Transformer相比，几乎没有额外的延迟或内存成本。
摘要：Large Language Models (LLMs) are powerful but often too slow and costly for real-world use during inference. Looped transformers save on parameters by reusing the same weights for multiple computational steps, or "loops." However, this approach has a major flaw: the loops run one after another, causing inference latency and memory requirements to increase with each added loop. This makes them impractical for fast applications. To solve this problem, we introduce the Parallel Loop Transformer (PLT). PLT is a new architecture that delivers the performance benefits of a deep, looped model but with the low latency of a standard, non-looped model. PLT works using two key techniques. First, Cross-Loop Parallelism (CLP) breaks the sequential dependency by computing different loops for different tokens at the same time, all within a single pass. Second, to prevent memory costs from growing, we use an Efficient Representation Enhancement strategy. This method shares the memory (KV cache) from the first loop with all other loops. It then uses a Gated Sliding-Window Attention (G-SWA) to combine this shared global information with local information, maintaining high accuracy. Our experiments show that PLT achieves the high accuracy of a traditional looped model but with almost no extra latency or memory cost compared to a standard transformer.

GAN|生成相关(4篇)

【1】Roleplaying with Structure: Synthetic Therapist-Client Conversation Generation from Questionnaires
标题：结构的角色扮演：从心理咨询师生成的综合治疗师-客户对话
链接：https://arxiv.org/abs/2510.25384

作者：Doan Nam Long Vu, Rui Tan, Lena Moench, Svenja Jule Francke, Daniel Woiwod, Florian Thomas-Odenthal, Sanna Stroth, Tilo Kircher, Christiane Hermann, Udo Dannlowski, Hamidreza Jamalabadi, Shaoxiong Ji
摘要：由于严格的隐私法规以及临床会话历史上很少记录的事实，缺乏真实的治疗对话阻碍了人工智能在心理健康方面的发展。我们提出了一个法学硕士驱动的管道，根据结构化的客户档案和心理问卷生成合成咨询对话。基于认知行为疗法（CBT）的原则，我们的方法为焦虑和抑郁等临床疾病创造了综合治疗对话。我们的框架SQPsych（Structured Psychotherapy）通过治疗师-客户模拟将结构化的心理输入转换为自然语言对话。由于数据治理政策和隐私限制禁止将临床问卷数据传输给第三方服务，因此以前依赖专有模型的方法在我们的环境中是不可行的。我们通过使用开放权重LLM生成高质量语料库来解决这一限制，并通过人类专家评估和基于LLM的评估进行验证。我们的SQPsychLLM模型在SQPsychConv上进行了微调，在咨询基准上取得了很好的表现，超过了关键治疗技能的基线。我们的研究结果强调了合成数据的潜力，以实现可扩展的，数据安全的，临床知情的人工智能，以支持心理健康。我们将在https://ai-mh.github.io/SQPsych上发布我们的代码、模型和语料库
摘要：The development of AI for mental health is hindered by a lack of authentic therapy dialogues, due to strict privacy regulations and the fact that clinical sessions were historically rarely recorded. We present an LLM-driven pipeline that generates synthetic counseling dialogues based on structured client profiles and psychological questionnaires. Grounded on the principles of Cognitive Behavioral Therapy (CBT), our method creates synthetic therapeutic conversations for clinical disorders such as anxiety and depression. Our framework, SQPsych (Structured Questionnaire-based Psychotherapy), converts structured psychological input into natural language dialogues through therapist-client simulations. Due to data governance policies and privacy restrictions prohibiting the transmission of clinical questionnaire data to third-party services, previous methodologies relying on proprietary models are infeasible in our setting. We address this limitation by generating a high-quality corpus using open-weight LLMs, validated through human expert evaluation and LLM-based assessments. Our SQPsychLLM models fine-tuned on SQPsychConv achieve strong performance on counseling benchmarks, surpassing baselines in key therapeutic skills. Our findings highlight the potential of synthetic data to enable scalable, data-secure, and clinically informed AI for mental health support. We will release our code, models, and corpus at https://ai-mh.github.io/SQPsych

【2】Seeing Through the MiRAGE: Evaluating Multimodal Retrieval Augmented Generation
标题：看穿误区：评估多模式检索增强生成
链接：https://arxiv.org/abs/2510.24870

作者：Alexander Martin, William Walden, Reno Kriz, Dengjia Zhang, Kate Sanders, Eugene Yang, Chihsheng Jin, Benjamin Van Durme
备注：his https URL
摘要：我们介绍Miracle，检索增强生成（RAG）从多模态源的评价框架。随着视听媒体成为在线信息的普遍来源，RAG系统必须将来自这些来源的信息整合到生成中。然而，现有的RAG评估是以文本为中心的，限制了它们对多模态、推理密集型环境的适用性，因为它们不对信息源进行验证。Mirobot是一种以声明为中心的多模态RAG评估方法，由InfoF 1和CiteF 1组成，InfoF 1评估真实性和信息覆盖率，CiteF 1测量引用支持和完整性。我们表明，Miracle，当应用于人类，强烈与外在的质量判断。此外，我们还介绍了Mirobot的自动变体和三个突出的TextRAG指标- ACLE，ARGUE和RAGAS -展示了以文本为中心的工作的局限性，并为自动评估奠定了基础。我们发布了开源实现，并概述了如何评估多模式RAG。
摘要：We introduce MiRAGE, an evaluation framework for retrieval-augmented generation (RAG) from multimodal sources. As audiovisual media becomes a prevalent source of information online, it is essential for RAG systems to integrate information from these sources into generation. However, existing evaluations for RAG are text-centric, limiting their applicability to multimodal, reasoning intensive settings because they don't verify information against sources. MiRAGE is a claim-centric approach to multimodal RAG evaluation, consisting of InfoF1, evaluating factuality and information coverage, and CiteF1, measuring citation support and completeness. We show that MiRAGE, when applied by humans, strongly aligns with extrinsic quality judgments. We additionally introduce automatic variants of MiRAGE and three prominent TextRAG metrics -- ACLE, ARGUE, and RAGAS -- demonstrating the limitations of text-centric work and laying the groundwork for automatic evaluation. We release open-source implementations and outline how to assess multimodal RAG.

【3】Towards a Method for Synthetic Generation of PWA Transcripts
标题：研究一种合成生成PWA转录物的方法
链接：https://arxiv.org/abs/2510.24817

作者：Jason M. Pittman, Anton Phillips Jr., Yesenia Medina-Santos, Brielle C. Stark
备注：19 pages, 1 figure, 7 tables
摘要：在失语症研究中，言语语言病理学家（SLP）投入大量时间使用正确信息单位（CIU）对语音样本进行手动编码，这是一种衡量单个语音样本信息量的方法。开发识别失语症语言的自动化系统受到数据稀缺的限制。例如，在AphasiaBank中只有大约600个转录本，但数十亿个令牌被用于训练大型语言模型（LLM）。在更广泛的机器学习（ML）领域，研究人员越来越多地转向稀疏的合成数据。因此，本研究构建并验证了两种生成AphasiaBank Cat Rescue图片描述任务合成转录本的方法。一种方法利用程序编程方法，而第二种方法使用Mistral 7b指令和Llama 3.1 8b指令LLM。该方法通过单词删除、填充物插入和错语替换生成跨越四个严重程度水平（轻度、中度、重度、非常重度）的转录本。总体而言，我们发现，与人类引发的成绩单相比，Mistral 7b指令最好地捕捉了失语症中观察到的语言退化的关键方面，在合成生成方法中显示了NDW，单词计数和单词长度的现实方向变化。基于这些结果，未来的工作应该计划创建一个更大的数据集，微调模型以更好地表达失语症，并让SLP评估合成转录本的真实性和有用性。
摘要：In aphasia research, Speech-Language Pathologists (SLPs) devote extensive time to manually coding speech samples using Correct Information Units (CIUs), a measure of how informative an individual sample of speech is. Developing automated systems to recognize aphasic language is limited by data scarcity. For example, only about 600 transcripts are available in AphasiaBank yet billions of tokens are used to train large language models (LLMs). In the broader field of machine learning (ML), researchers increasingly turn to synthetic data when such are sparse. Therefore, this study constructs and validates two methods to generate synthetic transcripts of the AphasiaBank Cat Rescue picture description task. One method leverages a procedural programming approach while the second uses Mistral 7b Instruct and Llama 3.1 8b Instruct LLMs. The methods generate transcripts across four severity levels (Mild, Moderate, Severe, Very Severe) through word dropping, filler insertion, and paraphasia substitution. Overall, we found, compared to human-elicited transcripts, Mistral 7b Instruct best captures key aspects of linguistic degradation observed in aphasia, showing realistic directional changes in NDW, word count, and word length amongst the synthetic generation methods. Based on the results, future work should plan to create a larger dataset, fine-tune models for better aphasic representation, and have SLPs assess the realism and usefulness of the synthetic transcripts.

【4】Cross-Lingual Summarization as a Black-Box Watermark Removal Attack
标题：跨语言摘要作为黑匣子水印删除攻击
链接：https://arxiv.org/abs/2510.24789

作者：Gokul Ganesan
摘要：水印已被提出作为一种轻量级机制来识别AI生成的文本，其方案通常依赖于对令牌分布的扰动。虽然先前的工作表明，释义可以削弱这些信号，但这些攻击仍然可以部分检测到或降低文本质量。我们证明了跨语言摘要攻击（CLSA）-翻译到枢轴语言，然后进行摘要和可选的反向翻译-构成了一个定性更强的攻击向量。通过强制跨语言的语义瓶颈，CLSA系统地消除了标记级的统计偏差，同时保持了语义保真度。在多个水印方案（KGW，SIR，XSIR，Unigram）和五种语言（阿姆哈拉语，中文，印地语，西班牙语，斯瓦希里语）的实验中，我们表明，CLSA降低水印检测精度更有效地比单语释义在类似的质量水平。我们的研究结果突出了一个未充分利用的漏洞，挑战水印的来源或监管的实用性。我们认为，强大的出处解决方案必须超越分布式水印，并结合加密或模型认证的方法。在每种语言的300个样本上，里昂证券始终将检测推向偶然性，同时保持任务效用。具体地，对于XSIR（明确地为跨语言鲁棒性而设计），具有释义的AUROC是0.827 $，具有跨语言水印去除攻击（CWRA）[He等人，2024年]以中国为支点，它是0.823美元，而里昂证券将其压低至0.53美元（接近机会）。结果突出了一个实用的，低成本的删除途径，跨语言和压缩内容，没有可见的文物。
摘要：Watermarking has been proposed as a lightweight mechanism to identify AI-generated text, with schemes typically relying on perturbations to token distributions. While prior work shows that paraphrasing can weaken such signals, these attacks remain partially detectable or degrade text quality. We demonstrate that cross-lingual summarization attacks (CLSA) -- translation to a pivot language followed by summarization and optional back-translation -- constitute a qualitatively stronger attack vector. By forcing a semantic bottleneck across languages, CLSA systematically destroys token-level statistical biases while preserving semantic fidelity. In experiments across multiple watermarking schemes (KGW, SIR, XSIR, Unigram) and five languages (Amharic, Chinese, Hindi, Spanish, Swahili), we show that CLSA reduces watermark detection accuracy more effectively than monolingual paraphrase at similar quality levels. Our results highlight an underexplored vulnerability that challenges the practicality of watermarking for provenance or regulation. We argue that robust provenance solutions must move beyond distributional watermarking and incorporate cryptographic or model-attestation approaches. On 300 held-out samples per language, CLSA consistently drives detection toward chance while preserving task utility. Concretely, for XSIR (explicitly designed for cross-lingual robustness), AUROC with paraphrasing is $0.827$, with Cross-Lingual Watermark Removal Attacks (CWRA) [He et al., 2024] using Chinese as the pivot, it is $0.823$, whereas CLSA drives it down to $0.53$ (near chance). Results highlight a practical, low-cost removal pathway that crosses languages and compresses content without visible artifacts.

QA|VQA|问答|对话(1篇)

【1】FARSIQA: Faithful and Advanced RAG System for Islamic Question Answering
标题：FARSIQA：忠诚且先进的伊斯兰问题解答RAG系统
链接：https://arxiv.org/abs/2510.25621

作者：Mohammad Aghajani Asl, Behrooz Minaei Bidgoli
备注：37 pages, 5 figures, 10 tables. Keywords: Retrieval-Augmented Generation (RAG), Question Answering (QA), Islamic Knowledge Base, Faithful AI, Persian NLP, Multi-hop Reasoning, Large Language Models (LLMs)
摘要：大型语言模型（LLM）的出现彻底改变了自然语言处理，但它们在高风险、专业领域（如宗教问答）的应用受到幻觉和对权威来源不忠等挑战的阻碍。这一问题对讲波斯语的穆斯林社区尤为重要，因为准确性和可信度是至关重要的。现有的检索增强生成（RAG）系统，依赖于简单的单通道管道，不适合复杂的，多跳的查询，需要多步推理和证据聚合。为了解决这一差距，我们引入FARSIQA，一种新颖的，端到端的系统，用于波斯伊斯兰领域的忠实高级问题查询。FARSIQA建立在我们创新的FAIR-RAG架构之上：一个忠实的、自适应的、迭代的RAG细化框架。FAIR-RAG采用了一个动态的，自我纠正的过程：它自适应地分解复杂的查询，评估证据的充分性，并进入一个迭代循环，以生成子查询，逐步填补信息缺口。FARSIQA拥有超过一百万份权威伊斯兰文件的精选知识库，表现出卓越的性能。对具有挑战性的IslamicPCQA基准的严格评估显示了最先进的性能：该系统在否定拒绝方面达到了97.0%的显着成绩-比基线提高了40分-以及74.3%的高答案正确性得分。我们的工作为波斯伊斯兰QA建立了一个新的标准，并验证了我们的迭代自适应架构对于在敏感领域构建忠实可靠的AI系统至关重要。
摘要：The advent of Large Language Models (LLMs) has revolutionized Natural Language Processing, yet their application in high-stakes, specialized domains like religious question answering is hindered by challenges like hallucination and unfaithfulness to authoritative sources. This issue is particularly critical for the Persian-speaking Muslim community, where accuracy and trustworthiness are paramount. Existing Retrieval-Augmented Generation (RAG) systems, relying on simplistic single-pass pipelines, fall short on complex, multi-hop queries requiring multi-step reasoning and evidence aggregation. To address this gap, we introduce FARSIQA, a novel, end-to-end system for Faithful Advanced Question Answering in the Persian Islamic domain. FARSIQA is built upon our innovative FAIR-RAG architecture: a Faithful, Adaptive, Iterative Refinement framework for RAG. FAIR-RAG employs a dynamic, self-correcting process: it adaptively decomposes complex queries, assesses evidence sufficiency, and enters an iterative loop to generate sub-queries, progressively filling information gaps. Operating on a curated knowledge base of over one million authoritative Islamic documents, FARSIQA demonstrates superior performance. Rigorous evaluation on the challenging IslamicPCQA benchmark shows state-of-the-art performance: the system achieves a remarkable 97.0% in Negative Rejection - a 40-point improvement over baselines - and a high Answer Correctness score of 74.3%. Our work establishes a new standard for Persian Islamic QA and validates that our iterative, adaptive architecture is crucial for building faithful, reliable AI systems in sensitive domains.

机器翻译(2篇)

【1】A Critical Study of Automatic Evaluation in Sign Language Translation
标题：手语翻译自动评估的批判性研究
链接：https://arxiv.org/abs/2510.25434

作者：Shakib Yazdani, Yasser Hamidullah, Cristina España-Bonet, Eleftherios Avramidis, Josef van Genabith
备注：Submitted to the LREC 2026 conference
摘要：自动评估指标对于推进手语翻译（手语翻译）至关重要。目前的可重用计算器评估指标，如BLEU和ROUGE，仅基于文本，并且仍然不清楚基于文本的指标可以在多大程度上可靠地捕获可重用计算器输出的质量。为了解决这一差距，我们调查的局限性，基于文本的评价指标，分析六个指标，包括BLEU，chrF，和ROUGE，以及BLEURT一方面，和大语言模型（LLM）为基础的评估，如G-Eval和GEMBA zero-shot直接评估的另一方面。具体来说，我们评估了这些指标在三个受控条件下的一致性和鲁棒性：释义，模型输出中的幻觉和句子长度的变化。我们的分析突出了词汇重叠指标的局限性，并表明，虽然基于LLM的评估更好地捕捉语义等价往往错过了传统的指标，他们也可以表现出对LLM的释义翻译的偏见。此外，尽管所有指标都能够检测幻觉，但BLEU往往过于敏感，而BLEURT和基于LLM的评估器对微妙的情况相对宽容。这促使需要多模式评价框架，超越文本为基础的指标，使一个更全面的评估可持续发展的产出。
摘要：Automatic evaluation metrics are crucial for advancing sign language translation (SLT). Current SLT evaluation metrics, such as BLEU and ROUGE, are only text-based, and it remains unclear to what extent text-based metrics can reliably capture the quality of SLT outputs. To address this gap, we investigate the limitations of text-based SLT evaluation metrics by analyzing six metrics, including BLEU, chrF, and ROUGE, as well as BLEURT on the one hand, and large language model (LLM)-based evaluators such as G-Eval and GEMBA zero-shot direct assessment on the other hand. Specifically, we assess the consistency and robustness of these metrics under three controlled conditions: paraphrasing, hallucinations in model outputs, and variations in sentence length. Our analysis highlights the limitations of lexical overlap metrics and demonstrates that while LLM-based evaluators better capture semantic equivalence often missed by conventional metrics, they can also exhibit bias toward LLM-paraphrased translations. Moreover, although all metrics are able to detect hallucinations, BLEU tends to be overly sensitive, whereas BLEURT and LLM-based evaluators are comparatively lenient toward subtle cases. This motivates the need for multimodal evaluation frameworks that extend beyond text-based metrics to enable a more holistic assessment of SLT outputs.

【2】Pretraining Strategies using Monolingual and Parallel Data for Low-Resource Machine Translation
标题：使用单语和并行数据进行低资源机器翻译的预训练策略
链接：https://arxiv.org/abs/2510.25116

作者：Idriss Nguepi Nguefack, Mara Finkelstein, Toadoum Sari Sakayo
备注：8 pages, 1. figure
摘要：本文研究了各种预训练策略在开发针对低资源语言的机器翻译模型方面的有效性。虽然这项工作考虑了几种低资源语言，包括南非荷兰语，斯瓦希里语和祖鲁语，但翻译模型是专门为林加拉语开发的，林加拉语是一种资源不足的非洲语言，建立在Reid和Artetxe（2021）介绍的预训练方法的基础上，最初是为高资源语言设计的。通过一系列综合实验，我们探索了不同的预训练方法，包括多种语言的整合，以及在预训练阶段使用单语和并行数据。我们的研究结果表明，对多种语言进行预训练并利用单语和并行数据可以显着提高翻译质量。这项研究为低资源机器翻译的有效预训练策略提供了有价值的见解，有助于弥合高资源和低资源语言之间的性能差距。研究结果有助于实现更广泛的目标，即为边缘化社区和代表性不足的人口开发更具包容性和准确的NLP模型。本研究中使用的代码和数据集是公开的，以促进进一步的研究并确保可重复性，但由于公共可用性的变化而可能不再可访问的某些数据除外。
摘要：This research article examines the effectiveness of various pretraining strategies for developing machine translation models tailored to low-resource languages. Although this work considers several low-resource languages, including Afrikaans, Swahili, and Zulu, the translation model is specifically developed for Lingala, an under-resourced African language, building upon the pretraining approach introduced by Reid and Artetxe (2021), originally designed for high-resource languages. Through a series of comprehensive experiments, we explore different pretraining methodologies, including the integration of multiple languages and the use of both monolingual and parallel data during the pretraining phase. Our findings indicate that pretraining on multiple languages and leveraging both monolingual and parallel data significantly enhance translation quality. This study offers valuable insights into effective pretraining strategies for low-resource machine translation, helping to bridge the performance gap between high-resource and low-resource languages. The results contribute to the broader goal of developing more inclusive and accurate NLP models for marginalized communities and underrepresented populations. The code and datasets used in this study are publicly available to facilitate further research and ensure reproducibility, with the exception of certain data that may no longer be accessible due to changes in public availability.

语义分析(1篇)

【1】TOPol: Capturing and Explaining Multidimensional Semantic Polarity Fields and Vectors
标题：TOPol：捕获和解释多维语义两极场和载体
链接：https://arxiv.org/abs/2510.25069

作者：Gabin Taibi, Lucia Gomez
备注：7 pages, 3 figures and 2 tables
摘要：传统的计算语言学语义极性研究方法将情感视为一维尺度，忽视了语言的多维结构。本文介绍了TOPol（Topic-Orientation Polarity），这是一个半无监督的框架，用于在人在回路（HoTL）定义的语境边界（CB）下重建和解释多维叙事极性场。该框架使用基于transformer的大型语言模型（tLLM）嵌入文档，应用邻域调整的UMAP投影，并通过Leiden分区分割主题。给定话语体系A和B之间的CB，TOPol计算相应的主题边界质心之间的方向向量，产生一个极性场，该极性场量化了体系转换期间的细粒度语义位移。这种矢量表示能够评估CB质量和检测极性变化，指导HoTL CB细化。为了解释所识别的极性向量，tLLM比较它们的极值点并产生具有估计覆盖率的对比标签。稳健性分析表明，只有CB定义（主要的HoTL可调参数）显着影响结果，确认方法的稳定性。我们在两个语料库上评估TOPol：（i）美国中央银行围绕宏观经济断点的演讲，捕捉非情感语义转变，以及（ii）亚马逊跨评级层的产品评论，其中情感极性与NRC效价一致。结果表明，TOPol始终捕捉情感和非情感极性转换，提供了一个可扩展的，可推广的，可解释的框架上下文敏感的多维话语分析。
摘要：Traditional approaches to semantic polarity in computational linguistics treat sentiment as a unidimensional scale, overlooking the multidimensional structure of language. This work introduces TOPol (Topic-Orientation POLarity), a semi-unsupervised framework for reconstructing and interpreting multidimensional narrative polarity fields under human-on-the-loop (HoTL) defined contextual boundaries (CBs). The framework embeds documents using a transformer-based large language model (tLLM), applies neighbor-tuned UMAP projection, and segments topics via Leiden partitioning. Given a CB between discourse regimes A and B, TOPol computes directional vectors between corresponding topic-boundary centroids, yielding a polarity field that quantifies fine-grained semantic displacement during regime shifts. This vectorial representation enables assessing CB quality and detecting polarity changes, guiding HoTL CB refinement. To interpret identified polarity vectors, the tLLM compares their extreme points and produces contrastive labels with estimated coverage. Robustness analyses show that only CB definitions (the main HoTL-tunable parameter) significantly affect results, confirming methodological stability. We evaluate TOPol on two corpora: (i) U.S. Central Bank speeches around a macroeconomic breakpoint, capturing non-affective semantic shifts, and (ii) Amazon product reviews across rating strata, where affective polarity aligns with NRC valence. Results demonstrate that TOPol consistently captures both affective and non-affective polarity transitions, providing a scalable, generalizable, and interpretable framework for context-sensitive multidimensional discourse analysis.

Graph|知识图谱|Knowledge(3篇)

【1】ZK-SenseLM: Verifiable Large-Model Wireless Sensing with Selective Abstention and Zero-Knowledge Attestation
标题：ZK-SenseLM：具有选择性弃权和零知识认证的可验证大模型无线传感
链接：https://arxiv.org/abs/2510.25677

作者：Hasan Akgul, Mari Eplik, Javier Rojas, Aina Binti Abdullah, Pieter van der Merwe
备注：45 pages
摘要：ZK-SenseLM是一个安全且可审计的无线传感框架，它将用于Wi-Fi信道状态信息（以及可选的毫米波雷达或RFID）的大型编码器与基于策略的决策层和端到端零知识推理证明配对。编码器使用具有相位一致性正则化的掩蔽频谱预训练，加上将RF特征与紧凑的人类可解释的策略令牌联系起来的轻交叉模式对齐。为了减少不安全的行动下，分布转移，我们添加了一个校准的选择性预防头，所选择的风险覆盖的操作点注册并绑定到证明。我们实现了一个四阶段的证明管道：（C1）功能健全性和承诺，（C2）阈值和版本绑定，（C3）时间窗口绑定，以及（C4）PLONK风格的证明，即量化的网络，给定提交的窗口，产生了记录的动作和信心。微批量打样可分摊相邻窗口的成本，网关选项可从低功耗设备卸载打样。该系统集成了差异私有联邦学习和设备上的个性化，而不会削弱可验证性：模型哈希和注册阈值是每个公共语句的一部分。在活动、存在或入侵、呼吸代理和RF指纹识别任务中，ZK-SenseLM改进了宏F1和校准，在扰动下产生有利的覆盖风险曲线，并通过紧凑的证明和快速验证拒绝篡改和重放。
摘要：ZK-SenseLM is a secure and auditable wireless sensing framework that pairs a large-model encoder for Wi-Fi channel state information (and optionally mmWave radar or RFID) with a policy-grounded decision layer and end-to-end zero-knowledge proofs of inference. The encoder uses masked spectral pretraining with phase-consistency regularization, plus a light cross-modal alignment that ties RF features to compact, human-interpretable policy tokens. To reduce unsafe actions under distribution shift, we add a calibrated selective-abstention head; the chosen risk-coverage operating point is registered and bound into the proof. We implement a four-stage proving pipeline: (C1) feature sanity and commitment, (C2) threshold and version binding, (C3) time-window binding, and (C4) PLONK-style proofs that the quantized network, given the committed window, produced the logged action and confidence. Micro-batched proving amortizes cost across adjacent windows, and a gateway option offloads proofs from low-power devices. The system integrates with differentially private federated learning and on-device personalization without weakening verifiability: model hashes and the registered threshold are part of each public statement. Across activity, presence or intrusion, respiratory proxy, and RF fingerprinting tasks, ZK-SenseLM improves macro-F1 and calibration, yields favorable coverage-risk curves under perturbations, and rejects tamper and replay with compact proofs and fast verification.

【2】GAP: Graph-Based Agent Planning with Parallel Tool Use and Reinforcement Learning
标题：GAP：具有并行工具使用和强化学习的基于图的代理规划
链接：https://arxiv.org/abs/2510.25320

作者：Jiaqi Wu, Qinlao Zhao, Zefeng Chen, Kai Qin, Yifei Zhao, Xueqian Wang, Yuhang Yao
摘要：由大型语言模型（LLM）驱动的自治代理在复杂任务解决的工具操作方面表现出令人印象深刻的能力。然而，现有的范例，如ReAct依赖于顺序推理和执行，未能利用独立子任务之间的固有并行性。这种连续的瓶颈导致低效的工具利用率和次优性能的多步推理方案。我们介绍了基于图的代理规划（GAP），一种新的框架，明确建模任务间的依赖关系，通过基于图的规划，使自适应并行和串行工具执行。我们的方法训练代理基础模型将复杂的任务分解为依赖关系感知的子任务图，自主确定哪些工具可以并行执行，哪些必须遵循顺序依赖关系。这种依赖性感知的编排在执行效率和任务准确性方面都实现了实质性的改进。为了训练GAP，我们构建了一个高质量的基于图的规划轨迹数据集，该数据集来自多跳问题分类（MHQA）基准。我们采用了两阶段的训练策略：对策展数据集进行监督微调（SFT），然后对基于工具的推理提供最大价值的策略采样查询进行基于正确性的奖励函数的强化学习（RL）。在MHQA数据集上的实验结果表明，GAP的性能明显优于传统的ReAct基线，特别是在多步检索任务上，同时通过智能并行化实现了工具调用效率的显着提高。该项目的网页可在：https://github.com/WJQ7777/Graph-Agent-Planning。
摘要：Autonomous agents powered by large language models (LLMs) have shown impressive capabilities in tool manipulation for complex task-solving. However, existing paradigms such as ReAct rely on sequential reasoning and execution, failing to exploit the inherent parallelism among independent sub-tasks. This sequential bottleneck leads to inefficient tool utilization and suboptimal performance in multi-step reasoning scenarios. We introduce Graph-based Agent Planning (GAP), a novel framework that explicitly models inter-task dependencies through graph-based planning to enable adaptive parallel and serial tool execution. Our approach trains agent foundation models to decompose complex tasks into dependency-aware sub-task graphs, autonomously determining which tools can be executed in parallel and which must follow sequential dependencies. This dependency-aware orchestration achieves substantial improvements in both execution efficiency and task accuracy. To train GAP, we construct a high-quality dataset of graph-based planning traces derived from the Multi-Hop Question Answering (MHQA) benchmark. We employ a two-stage training strategy: supervised fine-tuning (SFT) on the curated dataset, followed by reinforcement learning (RL) with a correctness-based reward function on strategically sampled queries where tool-based reasoning provides maximum value. Experimental results on MHQA datasets demonstrate that GAP significantly outperforms traditional ReAct baselines, particularly on multi-step retrieval tasks, while achieving dramatic improvements in tool invocation efficiency through intelligent parallelization. The project page is available at: https://github.com/WJQ7777/Graph-Agent-Planning.

【3】The Epistemic Suite: A Post-Foundational Diagnostic Methodology for Assessing AI Knowledge Claims
标题：认识套件：评估人工智能知识主张的基础后诊断方法
链接：https://arxiv.org/abs/2510.24721

作者：Matthew Kelly
备注：65 pages
摘要：大型语言模型（LLM）生成流畅、合理的文本，可能会误导用户将模拟的连贯性误认为真正的理解。本文介绍了Epistemic Suite，这是一种后基础诊断方法，用于揭示AI输出产生和接收的认知条件。该套件不是确定真实或虚假，而是通过20个诊断镜头进行操作，由从业者根据上下文授权应用，以揭示诸如信心洗钱，叙事压缩，流离失所的权威和时间漂移等模式。它基于三个设计原则：在评估索赔之前诊断生产，更喜欢诊断牵引而不是基础解决方案，并将自反性作为结构要求而不是道德装饰。当制定时，该套件将语言模型转变为诊断立场，产生可检查的工件-标志，注释，矛盾图和暂停日志（FACS捆绑包）-在AI输出和人类判断之间创建中间层。一个关键的创新是认知暂停，这是一个由仲裁员制定的断路器，当超过授权时停止继续，并根据判断而不是规则恢复。该方法还包括一个知识分类协议和一个元治理层，以管理比例和链接激活关系问责制，同意，历史背景和多元化的保障。与将对齐嵌入模型架构的内在主义方法不同（例如，RLHF或认知完整性建议），套件作为脚手架在外部运行，保留可扩展性和拒绝作为保障而不是失败。它保留了表现和理解之间的区别，使负责任的审议，同时保持知识的谦虚。
摘要：Large Language Models (LLMs) generate fluent, plausible text that can mislead users into mistaking simulated coherence for genuine understanding. This paper introduces the Epistemic Suite, a post-foundational diagnostic methodology for surfacing the epistemic conditions under which AI outputs are produced and received. Rather than determining truth or falsity, the Suite operates through twenty diagnostic lenses, applied by practitioners as context warrants, to reveal patterns such as confidence laundering, narrative compression, displaced authority, and temporal drift. It is grounded in three design principles: diagnosing production before evaluating claims, preferring diagnostic traction over foundational settlement, and embedding reflexivity as a structural requirement rather than an ethical ornament. When enacted, the Suite shifts language models into a diagnostic stance, producing inspectable artifacts-flags, annotations, contradiction maps, and suspension logs (the FACS bundle)-that create an intermediary layer between AI output and human judgment. A key innovation is epistemic suspension, a practitioner-enacted circuit breaker that halts continuation when warrant is exceeded, with resumption based on judgment rather than rule. The methodology also includes an Epistemic Triage Protocol and a Meta-Governance Layer to manage proportionality and link activation to relational accountability, consent, historical context, and pluralism safeguards. Unlike internalist approaches that embed alignment into model architectures (e.g., RLHF or epistemic-integrity proposals), the Suite operates externally as scaffolding, preserving expendability and refusal as safeguards rather than failures. It preserves the distinction between performance and understanding, enabling accountable deliberation while maintaining epistemic modesty.

推理|分析|理解|解释(8篇)

【1】Evaluating the Role of Verifiers in Test-Time Scaling for Legal Reasoning Tasks
标题：评估验证员在法律推理任务测试时间缩放中的作用
链接：https://arxiv.org/abs/2510.25623

作者：Davide Romano, Jonathan Schwarz, Daniele Giofré
备注：Accepted to EMNLP - NLLP Workshop
摘要：测试时缩放（TTS）技术可以提高大型语言模型（LLM）的性能，但代价是额外的计算和延迟。虽然TTS在数学和编程等正式领域已被证明是有效的，但它在法律等争论性领域的价值仍有待探索。我们提出了一个基于验证者的TTS方法的法律多项选择题QA（MCQA）在五个基准的实证研究。使用一个家庭的7个奖励模型，我们评估结果级（最好的$N$）和过程级（树搜索）的验证下现实的低$N$预算。我们的分析系统地研究了验证器效用如何受到关键属性的影响，例如域专业化，模型大小和监督类型（过程监督PRM与仅结果ORM），即使在不同的角色中应用。
摘要：Test-time scaling (TTS) techniques can improve the performance of large language models (LLMs) at the expense of additional computation and latency. While TTS has proven effective in formal domains such as mathematics and programming \citep{snell2024scaling, chen2024more}, its value in argumentative domains such as law remains underexplored. We present an empirical study of verifier-based TTS methods for legal multiple-choice QA (MCQA) across five benchmarks. Using a family of 7 reward models, we evaluate both outcome-level (Best-of-$N$) and process-level (tree search) verification under realistic low-$N$ budgets. Our analysis systematically investigates how verifier utility is affected by key properties such as domain specialization, model size, and supervision type (process-supervised PRMs vs. outcome-only ORMs), even when applied across different roles.

【2】Parrot: A Training Pipeline Enhances Both Program CoT and Natural Language CoT for Reasoning
标题：Parrot：训练管道增强程序CoT和自然语言CoT以进行推理
链接：https://arxiv.org/abs/2510.25310

作者：Senjie Jin, Lu Chen, Zhiheng Xi, Yuhui Wang, Sirui Song, Yuhao Zhou, Xinbo Zhang, Peng Sun, Hong Lu, Tao Gui, Qi Zhang, Xuanjing Huang
摘要：自然语言思想链（N-CoT）和程序思想链（P-CoT）已经成为大型语言模型（LLM）解决数学推理问题的两种主要范式。目前的研究通常致力于实现单向增强：P-CoT增强N-CoT或N-CoT增强P-CoT。在本文中，我们试图充分发挥这两种范式的优势，相互增强，并最终实现同步改进。我们对两种范式的错误类型进行了详细的分析，并在此基础上提出了一种新的数学问题训练管道Parrot：1）三个目标设计的子任务集成了顺序的P-CoT和N-CoT生成。2)促进自然语言语义可转换性的子任务混合训练策略。3)转换后的N-CoT辅助奖励旨在减轻P-CoT优化中的稀疏奖励。大量的实验表明，Parrot显著提高了N-CoT和P-CoT的性能，特别是在N-CoT上。使用Parrot SFT，LLaMA 2和CodeLLaMA的N-CoT性能在RL基线上的MathQA上实现了+21.87和+21.48的增益，这是资源密集型的。
摘要：Natural language chain-of-thought (N-CoT) and Program chain-of-thought (P-CoT) have emerged as two primary paradigms for large language models (LLMs) to solve mathematical reasoning problems. Current research typically endeavors to achieve unidirectional enhancement: P-CoT enhanced N-CoT or N-CoT enhanced P-CoT. In this paper, we seek to fully unleash the two paradigms' strengths for mutual enhancement and ultimately achieve simultaneous improvements. We conduct a detailed analysis of the error types across two paradigms, based on which we propose Parrot, a novel training pipeline for mathematical problems: 1) Three target-designed subtasks integrate sequential P-CoT and N-CoT generation. 2) A subtask hybrid training strategy to facilitate natural language semantic transferability. 3) The converted N-CoT auxiliary reward is designed to alleviate the sparse rewards in P-CoT optimization. Extensive experiments demonstrate that Parrot significantly enhances both the performance of N-CoT and P-CoT, especially on N-CoT. Using Parrot SFT, the N-CoT performance of LLaMA2 and CodeLLaMA achieve gains of +21.87 and +21.48 on MathQA over the RL baseline, which is resource-intensive.

【3】Explainable Disentanglement on Discrete Speech Representations for Noise-Robust ASR
标题：用于噪音鲁棒的ASB的离散语音表示的可解释解纠缠
链接：https://arxiv.org/abs/2510.25150

作者：Shreyas Gopal, Ashutosh Anshul, Haoyang Li, Yue Heng Yeo, Hexin Liu, Eng Siong Chng
备注：Awarded Best Student Paper at APSIPA ASC 2025
摘要：离散音频表示由于其可解释性和与大型语言模型的兼容性而在语音建模中越来越受欢迎，但并不总是针对嘈杂或真实世界环境进行优化。基于现有的工作，将Whisper嵌入用于语音到单元建模，我们提出将语义语音内容从潜在空间中的背景噪声中分离出来。我们的端到端模型以码本令牌的形式分离干净的语音，同时提取可解释的噪声向量作为量化残差，通过轻量级分类器进行监督。我们表明，我们的方法提高了干净/有噪语音和文本之间的对齐，产生了显示高度噪声不变性的语音令牌，并提高了ASR性能。保持Whisper冻结，我们显示出与Whisper相比错误率降低了82%，并且在VBDemand测试集上比基线方法提高了35%。进一步的分析表明，学习令牌空间推广以及看到和看不见的声学条件。
摘要：Discrete audio representations are gaining traction in speech modeling due to their interpretability and compatibility with large language models, but are not always optimized for noisy or real-world environments. Building on existing works that quantize Whisper embeddings for speech-to-unit modeling, we propose disentangling semantic speech content from background noise in the latent space. Our end-to-end model separates clean speech in the form of codebook tokens, while extracting interpretable noise vectors as quantization residue which are supervised via a lightweight classifier. We show that our approach improves alignment between clean/noisy speech and text, producing speech tokens that display a high degree of noiseinvariance, and improves ASR performance. Keeping Whisper frozen, we show an 82% reduction in error rate compared to Whisper, and 35% improvement over baseline methods on the VBDemand test set. Further analyses show that the learned token space generalizes well to both seen and unseen acoustic conditions.

【4】KnowCoder-A1: Incentivizing Agentic Reasoning Capability with Outcome Supervision for KBQA
标题：KnowCoder-A1：通过KBQA的结果监督激励统计推理能力
链接：https://arxiv.org/abs/2510.25101

作者：Zhuo Chen, Fei Wang, Zixuan Li, Zhao Zhang, Weiwei Ding, Chuanguang Yang, Yongjun Xu, Xiaolong Jin, Jiafeng Guo
摘要：知识库问答（KBQA）旨在通过结构化知识库（KB）回答自然语言问题。最近的工作通过采用代理推理范式来改进KBQA，其中大型语言模型（LLM）迭代地分解问题，生成相应的逻辑查询，并与KB交互以获得答案。然而，这些方法通常微调LLM的推理轨迹，通过过程监督，这提供了弱激励的探索，从而未能加强代理的推理能力。在本文中，我们提出了KnowCoder-A1，LLM，可以自主执行代理推理知识库，以获得答案。为了激励自主探索，KnowCoder-A1通过多阶段课程强化学习和简单到困难的课程，在仅结果的监督下训练LLM。为了建立基本的代理能力，KnowCoder-A1首先在通过基于结果的拒绝抽样获得的一小部分高质量轨迹上微调LLM。然后，为了减轻奖励稀疏固有的结果只监督，它适用于多阶段的课程强化学习与奖励时间表，从容易到困难的进展。经过仅结果监督的训练，KnowCoder-A1表现出强大的推理行为，并在三个主流数据集上始终优于先前的方法。值得注意的是，在GrailQA的zero-shot子集上，KnowCoder-A1实现了高达11.1%的相对改进，同时仅使用了十二分之一的训练数据，展示了强大的代理推理能力。
摘要：Knowledge Base Question Answering (KBQA) aims to answer natural-language questions over a structured Knowledge Base (KB). Recent work improves KBQA by adopting an agentic reasoning paradigm, in which Large Language Models (LLMs) iteratively decompose a question, generate its corresponding logical queries, and interact with the KB to derive the answer. However, these methods typically fine-tune LLMs on reasoning trajectories synthesized via process supervision, which offers weak incentives for exploration and thus fails to strengthen the agentic reasoning ability. In this paper, we propose KnowCoder-A1, an LLM that can autonomously perform agentic reasoning on KBs to obtain answers. To incentivize autonomous exploration, KnowCoder-A1 trains the LLM under outcome-only supervision via a multi-stage curriculum reinforcement learning with an easy-to-hard curriculum. To establish foundational agentic capabilities, KnowCoder-A1 first fine-tunes the LLM on a small set of high-quality trajectories obtained through outcome-based rejection sampling. Then, to alleviate the reward sparsity inherent in outcome-only supervision, it applies multi-stage curriculum RL with reward schedules that progress from easy to hard. Trained with outcome-only supervision, KnowCoder-A1 exhibits powerful reasoning behaviors and consistently outperforms prior approaches across three mainstream datasets. Notably, on the zero-shot subset of GrailQA, KnowCoder-A1 achieves up to an 11.1% relative improvement while using only one-twelfth of the training data, demonstrating strong agentic reasoning capabilities.

【5】SemCoT: Accelerating Chain-of-Thought Reasoning through Semantically-Aligned Implicit Tokens
标题：SemCoT：通过语义对齐的隐式令牌加速思想链推理
链接：https://arxiv.org/abs/2510.24940

作者：Yinhan He, Wendy Zheng, Yaochen Zhu, Zaiyi Zheng, Lin Su, Sriram Vasudevan, Qi Guo, Liangjie Hong, Jundong Li
摘要：思想链（CoT）推理的冗长性阻碍了其在效率关键型应用程序中的大规模部署。最近，隐式CoT方法已经出现，它在LLM的隐藏嵌入（称为“隐式推理”）而不是显式令牌中编码推理步骤。这种方法通过减少推理长度和绕过一些LLM组件来加速CoT。然而，现有的隐式CoT方法面临两个重大挑战：（1）它们不能保持隐式推理之间的语义对齐（当转换为自然语言时）和地面真理推理，导致CoT性能显著下降，以及（2）它们专注于减少隐式推理的长度;然而，他们忽略了一个LLM生成一个单独的隐式推理令牌的相当大的时间成本。为了解决这些挑战，我们提出了一种新的语义对齐的隐式CoT框架，称为SemCoT。特别是，对于第一个挑战，我们设计了一个经过对比训练的句子转换器（Transformer），用于评估隐式推理和显式推理之间的语义对齐，用于在隐式推理优化过程中加强语义保留。为了解决第二个挑战，我们引入了一个高效的隐式推理生成器，通过使用知识蒸馏来微调轻量级语言模型。这个生成器由我们的句子Transformer引导，将地面实况推理提炼成语义对齐的隐式推理，同时还优化了准确性。SemCoT是第一种通过联合优化令牌级生成速度并保持语义对齐与地面实况推理来提高CoT效率的方法。大量的实验证明了SemCoT在效率和有效性方面优于最先进的方法。我们的代码可以在https://github.com/YinhanHe123/SemCoT/上找到。
摘要：The verbosity of Chain-of-Thought (CoT) reasoning hinders its mass deployment in efficiency-critical applications. Recently, implicit CoT approaches have emerged, which encode reasoning steps within LLM's hidden embeddings (termed ``implicit reasoning'') rather than explicit tokens. This approach accelerates CoT by reducing the reasoning length and bypassing some LLM components. However, existing implicit CoT methods face two significant challenges: (1) they fail to preserve the semantic alignment between the implicit reasoning (when transformed to natural language) and the ground-truth reasoning, resulting in a significant CoT performance degradation, and (2) they focus on reducing the length of the implicit reasoning; however, they neglect the considerable time cost for an LLM to generate one individual implicit reasoning token. To tackle these challenges, we propose a novel semantically-aligned implicit CoT framework termed SemCoT. In particular, for the first challenge, we design a contrastively trained sentence transformer that evaluates semantic alignment between implicit and explicit reasoning, which is used to enforce semantic preservation during implicit reasoning optimization. To address the second challenge, we introduce an efficient implicit reasoning generator by finetuning a lightweight language model using knowledge distillation. This generator is guided by our sentence transformer to distill ground-truth reasoning into semantically aligned implicit reasoning, while also optimizing for accuracy. SemCoT is the first approach that enhances CoT efficiency by jointly optimizing token-level generation speed and preserving semantic alignment with ground-truth reasoning. Extensive experiments demonstrate the superior performance of SemCoT compared to state-of-the-art methods in both efficiency and effectiveness. Our code can be found at https://github.com/YinhanHe123/SemCoT/.

【6】COMMUNITYNOTES: A Dataset for Exploring the Helpfulness of Fact-Checking Explanations
标题：Communitynotes：探索事实核查解释有用性的数据集
链接：https://arxiv.org/abs/2510.24810

作者：Rui Xing, Preslav Nakov, Timothy Baldwin, Jey Han Lau
摘要：X、Meta和TikTok等主要平台上的事实核查正在从专家驱动的验证转向基于社区的设置，用户可以贡献解释性注释来澄清为什么帖子可能具有误导性。这里的一个重要挑战是确定一种解释是否有助于理解现实世界的主张及其原因，这在以前的研究中仍然很大程度上没有得到充分的探索。在实践中，由于社区注释缓慢，大多数社区笔记仍未发布，并且有用的原因缺乏明确的定义。为了弥合这些差距，我们介绍的任务，预测的解释性说明的帮助和原因。我们提出了COMMUNITYNOTES，一个大规模的多语言数据集的104k职位与用户提供的说明和有用的标签。我们进一步提出了一个框架，自动生成和改进的原因定义，通过自动提示优化，并将它们集成到预测。我们的实验表明，优化的定义可以提高有用性和原因预测。最后，我们表明，有用的信息是有益的，现有的事实核查系统。
摘要：Fact-checking on major platforms, such as X, Meta, and TikTok, is shifting from expert-driven verification to a community-based setup, where users contribute explanatory notes to clarify why a post might be misleading. An important challenge here is determining whether an explanation is helpful for understanding real-world claims and the reasons why, which remains largely underexplored in prior research. In practice, most community notes remain unpublished due to slow community annotation, and the reasons for helpfulness lack clear definitions. To bridge these gaps, we introduce the task of predicting both the helpfulness of explanatory notes and the reason for this. We present COMMUNITYNOTES, a large-scale multilingual dataset of 104k posts with user-provided notes and helpfulness labels. We further propose a framework that automatically generates and improves reason definitions via automatic prompt optimization, and integrate them into prediction. Our experiments show that the optimized definitions can improve both helpfulness and reason prediction. Finally, we show that the helpfulness information are beneficial for existing fact-checking systems.

【7】Fortytwo: Swarm Inference with Peer-Ranked Consensus
标题：42：具有同行排名共识的群体推理
链接：https://arxiv.org/abs/2510.24801

作者：Vladyslav Larin, Ihor Naumenko, Aleksei Ivashov, Ivan Nikitin, Alexander Firsov
摘要：随着集中式人工智能达到计算上限，并且越来越大的训练运行的回报越来越少，满足需求需要一个在容量和能力方面水平扩展的推理层。我们提出了一种新的协议，它利用群体智能原理和分布式成对排名共识来实现AI推理的卓越性能。我们的方法使用群体推理重新想象了AI节点之间的协作：跨异构模型的同行排名，声誉加权共识，显示最高质量的响应。使用自定义Bradley-Terry风格聚合模型的成对排名，我们证明群推理的性能大大优于多数投票，在GPQA Diamond上实现了85.90%，而在相同模型集下的多数投票中实现了68.69%-提高了+17.21个百分点（相对约+25.1%）。该协议结合了链上声誉，因此节点影响力随着时间的推移而适应所证明的准确性，从而产生了一种精英共识，可以过滤低质量或恶意的参与者。为了抵御Sybil攻击，Fortyttwo在其共识中采用了能力证明：节点必须成功完成校准/测试请求并获得声誉才能进入排名轮，使多身份攻击在保持开放性的同时在经济上没有吸引力。在六个具有挑战性的基准测试中，包括GPQA Diamond，LiveCodeBench和AIME，我们的评估表明，对对抗性和嘈杂的自由形式提示（例如，单次注入退化仅为0.12%，而整体单模型基线为6.20%），同时保持实际的可部署性。总之，这些结果为去中心化的人工智能系统奠定了基础--通过集体智慧使高质量推理的访问民主化，而不牺牲可靠性或安全性。
摘要：As centralized AI hits compute ceilings and diminishing returns from ever-larger training runs, meeting demand requires an inference layer that scales horizontally in both capacity and capability. We present Fortytwo, a novel protocol that leverages swarm intelligence principles and distributed pairwise ranking consensus to achieve superior performance in AI inference. Our approach reimagines collaboration among AI nodes using swarm inference: a peer-ranked, reputation-weighted consensus across heterogeneous models that surfaces the highest-quality responses. Using pairwise ranking with a custom Bradley-Terry-style aggregation model, we demonstrate that swarm inference substantially outperforms majority voting, achieving 85.90% on GPQA Diamond versus 68.69% for majority voting with the same model set - an improvement of +17.21 percentage points (approximately +25.1% relative). The protocol incorporates on-chain reputation so node influence adapts to demonstrated accuracy over time, yielding a meritocratic consensus that filters low-quality or malicious participants. To resist Sybil attacks, Fortytwo employs proof-of-capability in its consensus: nodes must successfully complete calibration/test requests and stake reputation to enter ranking rounds, making multi-identity attacks economically unattractive while preserving openness. Across six challenging benchmarks, including GPQA Diamond, LiveCodeBench, and AIME, our evaluation indicates higher accuracy and strong resilience to adversarial and noisy free-form prompting (e.g., prompt-injection degradation of only 0.12% versus 6.20% for a monolithic single-model baseline), while retaining practical deployability. Together, these results establish a foundation for decentralized AI systems - democratizing access to high-quality inference through collective intelligence without sacrificing reliability or security.

【8】MR-Align: Meta-Reasoning Informed Factuality Alignment for Large Reasoning Models
标题：MR-Align：大型推理模型的元推理知情事实对齐
链接：https://arxiv.org/abs/2510.24794

作者：Xinming Wang, Jian Xu, Bin Yu, Sheng Lian, Hongzhu Yi, Yi Chen, Yingjian Zhu, Boran Wang, Hongming Yang, Han Hu, Xu-Yao Zhang, Cheng-Lin Liu
备注：Preprint
摘要：大型推理模型（LRM）在复杂推理中表现出很强的能力，但它们在证据依赖的事实问题上的边际收益有限。我们发现这种限制部分归因于推理答案命中差距，模型在推理过程中识别出正确的事实，但未能将其纳入最终响应，从而降低了事实保真度。为了解决这个问题，我们提出了MR-ALIGN，一个元推理通知对齐框架，提高了真实性，而不依赖于外部验证。MR-ALIGN量化了模型思维过程中的状态转移概率，并构建了一个转移感知的隐式奖励，该奖励加强了有益的推理模式，同时抑制了原子思维部分的缺陷。这种重新加权将标记级信号重塑为概率感知的分段分数，鼓励更有利于事实正确性的连贯推理轨迹。对四个事实QA数据集和一个长形式事实基准的实证评估表明，MR-ALIGN始终提高了准确性和真实性，同时减少了误导性推理。这些结果突出表明，调整推理过程本身，而不仅仅是输出，是推进LRM真实性的关键。
摘要：Large reasoning models (LRMs) show strong capabilities in complex reasoning, yet their marginal gains on evidence-dependent factual questions are limited. We find this limitation is partially attributable to a reasoning-answer hit gap, where the model identifies the correct facts during reasoning but fails to incorporate them into the final response, thereby reducing factual fidelity. To address this issue, we propose MR-ALIGN, a Meta-Reasoning informed alignment framework that enhances factuality without relying on external verifiers. MR-ALIGN quantifies state transition probabilities along the model's thinking process and constructs a transition-aware implicit reward that reinforces beneficial reasoning patterns while suppressing defective ones at the atomic thinking segments. This re-weighting reshapes token-level signals into probability-aware segment scores, encouraging coherent reasoning trajectories that are more conducive to factual correctness. Empirical evaluations across four factual QA datasets and one long-form factuality benchmark show that MR-ALIGN consistently improves accuracy and truthfulness while reducing misleading reasoning. These results highlight that aligning the reasoning process itself, rather than merely the outputs, is pivotal for advancing factuality in LRMs.

检测相关(1篇)

【1】Teaching Sarcasm: Few-Shot Multimodal Sarcasm Detection via Distillation to a Parameter-Efficient Student
标题：教授讽刺：通过蒸馏对参数高效的学生进行Few-Shot多模式讽刺检测
链接：https://arxiv.org/abs/2510.25303

作者：Soumyadeep Jana, Sanasam Ranbir Singh
摘要：多模态讽刺检测具有挑战性，特别是在低资源环境中，由于缺乏注释数据，难以学习微妙的图像-文本矛盾，这阻碍了模型的性能。适配器、LoRA和即时调整等参数高效微调（PEFT）方法可以减少过拟合，但由于来自Few-Shot数据的监督有限，因此难以达到最佳性能。我们提出了PEKD，一个统一的框架，增强PEFT方法，通过蒸馏从大规模的讽刺数据，作为教师培训的专家模型。为了减轻来自教师的不可靠信号，我们引入了一种熵感知的门控机制，该机制基于教师的置信度动态地调整蒸馏强度。在两个公共数据集上的实验表明，我们的PEKD框架使PEFT方法的性能优于先前的参数有效的方法和大型多模态模型，在Few-Shot场景中取得了很好的效果。该框架是模块化的，适用于各种多模态模型和任务。
摘要：Multimodal sarcasm detection is challenging, especially in low-resource settings where subtle image-text contradictions are hard to learn due to scarce annotated data, which hinders the model's performance. Parameter-efficient fine-tuning (PEFT) methods like adapters, LoRA, and prompt tuning reduce overfitting but struggle to reach optimal performance due to limited supervision from few-shot data. We propose PEKD, a unified framework that enhances PEFT methods via distillation from an expert model trained on large-scale sarcasm data, which acts as the teacher. To mitigate unreliable signals from the teacher, we introduce an entropy-aware gating mechanism that dynamically adjusts the distillation strength based on teacher confidence. Experiments on two public datasets demonstrate that our PEKD framework enables PEFT methods to outperform both prior parameter-efficient approaches and large multimodal models, achieving strong results in the few-shot scenario. The framework is modular and adaptable to a wide range of multimodal models and tasks.

Zero/Few/One-Shot|迁移|自适应(1篇)

【1】Dingtalk DeepResearch: A Unified Multi Agent Framework for Adaptive Intelligence in Enterprise Environments
标题：DingTalk DeepResearch：企业环境中自适应智能的统一多代理框架
链接：https://arxiv.org/abs/2510.24760

作者：Mengyuan Chen, Chengjun Dai, Xinyang Dong, Chengzhe Feng, Kewei Fu, Jianshe Li, Zhihan Peng, Yongqi Tong, Junshao Zhang, Hong Zhu
摘要：我们介绍了Dingtalk DeepResearch，这是一个用于现实世界企业环境的统一多代理智能框架，提供深入研究，异构表推理和多模式报告生成。
摘要：We present Dingtalk DeepResearch, a unified multi agent intelligence framework for real world enterprise environments, delivering deep research, heterogeneous table reasoning, and multimodal report generation.

Word2Vec|文本|单词(2篇)

【1】SwiftEmbed: Ultra-Fast Text Embeddings via Static Token Lookup for Real-Time Applications
标题：SwiftEmbed：通过实时应用程序的静态令牌分配器进行超快速文本嵌入
链接：https://arxiv.org/abs/2510.24793

作者：Edouard Lansiaux
摘要：我们提出了一个静态令牌查找方法的文本嵌入生成，实现了1.12毫秒的p50延迟为单一的文本嵌入，同时保持60.6 MTEB的平均分数在8个代表性的任务，对应于89%的上下文模型的质量。Rust实现通过静态嵌入查找、优化的平均池和零拷贝IEEE 754二进制序列化提供每秒50，000个请求的吞吐量。评估表明，卓越的重复检测性能（90.1% AP），强大的语义相似性（76.1%斯皮尔曼相关性），以及特定领域的性能范围从75%到131%的基线跨专业领域。该系统支持实时嵌入应用，其中低于5ms的延迟至关重要。
摘要：We present a static token lookup methodology for text embedding generation that achieves 1.12 ms p50 latency for single text embeddings while maintaining 60.6 MTEB average score across 8 representative tasks, corresponding to 89% of contextual model quality. The Rust implementation delivers 50,000 requests per second throughput through static embedding lookup, optimized mean pooling, and zero-copy IEEE754 binary serialization. Evaluation demonstrates exceptional duplicate detection performance (90.1% AP), strong semantic similarity (76.1% Spearman correlation), and domain-specific performance ranging from 75% to 131% of baseline across specialized domains. The system enables real-time embedding applications where sub-5ms latency is critical.

【2】Falcon: A Comprehensive Chinese Text-to-SQL Benchmark for Enterprise-Grade Evaluation
标题：Falcon：用于企业等级评估的全面中文文本到SQL基准
链接：https://arxiv.org/abs/2510.24762

作者：Wenzhen Luo, Wei Guan, Yifan Yao, Yimin Pan, Feng Wang, Zhipeng Yu, Zhe Wen, Liang Chen, Yihong Zhuang
摘要：我们介绍Falcon，一个跨领域的中文文本到SQL基准测试，它基于企业兼容的方言（MaxCompute/Hive）。它包含28个数据库中的600个中文问题; 77%需要多表推理，超过一半涉及四个以上的表。每个例子都标注了SQL计算特征和中文语义。为了进行评估，我们发布了一个强大的执行比较器和一个自动评估管道，在此基础上，所有当前最先进的大规模模型（包括Deepseek）都可以达到最高50%的准确率。主要错误来自两个来源：（1）大型企业环境中的模式链接-数百个表，非规范化字段，模糊的列名，隐式外键关系和特定于域的同义词，这些使得正确的连接/列选择变得困难;以及（2）将简洁，口语化的中文映射到分析所需的确切运算符和谓词-例如，选择正确的聚合和分组依据键，表示时间窗口和粒度，应用单位转换，处理NULL和数据质量规则，以及制定嵌套或窗口子查询。因此，Falcon针对特定于中文的语义和企业方言（缩写、业务术语、模糊实体引用），并通过使用现实的企业模式、查询模板、执行比较器和用于端到端验证的自动评估管道，在全面生产部署之前提供可再现的中间地带。
摘要：We introduce Falcon, a cross-domain Chinese text-to-SQL benchmark grounded in an enterprise-compatible dialect (MaxCompute/Hive). It contains 600 Chinese questions over 28 databases; 77% require multi-table reasoning and over half touch more than four tables. Each example is annotated along SQL-computation features and Chinese semantics. For evaluation, we release a robust execution comparator and an automated evaluation pipeline, under which all current state-of-the-art large-scale models (including Deepseek) achieve accuracies of at most 50%. Major errors originate from two sources: (1) schema linking in large enterprise landscapes - hundreds of tables, denormalized fields, ambiguous column names, implicit foreign-key relations and domain-specific synonyms that make correct join/column selection difficult; and (2) mapping concise, colloquial Chinese into the exact operators and predicates required for analytics - e.g., choosing the correct aggregation and group-by keys, expressing time windows and granularities, applying unit conversions, handling NULLs and data-quality rules, and formulating nested or windowed subqueries. Falcon therefore targets Chinese-specific semantics and enterprise dialects (abbreviations, business jargon, fuzzy entity references) and provides a reproducible middle ground before full production deployment by using realistic enterprise schemas, query templates, an execution comparator, and an automated evaluation pipeline for end-to-end validation.

其他神经网络|深度学习|模型|建模(5篇)

【1】Hybrid Quantum-Classical Recurrent Neural Networks
标题：混合量子经典回归神经网络
链接：https://arxiv.org/abs/2510.25557

作者：Wenduan Xu
摘要：我们提出了一种混合量子-经典递归神经网络（QRNN）架构，其中整个递归核心被实现为由经典前馈网络控制的参数化量子电路（PQC）。隐藏态是一个$n$-qubit PQC的量子态，驻留在一个指数级大的希尔伯特空间$\mathbb{C}^{2^n}$中。PQC是通过构造酉的，使得隐态演化保持范数而不受外部约束。在每个时间步，中间电路读出与输入嵌入相结合，并由前馈网络处理，这提供了显式的经典非线性。输出参数化PQC，其通过酉动态更新隐藏状态。QRNN是紧凑的，物理上一致的，它统一了（i）作为高容量存储器的酉递归，（ii）通过中间电路测量的部分观测，以及（iii）用于输入条件参数化的非线性经典控制。我们在情感分析、MNIST、置换MNIST、复制记忆和语言建模上使用多达14个量子位来评估模拟模型，采用投影测量作为限制情况，以获得中间电路读数，同时保持相干的递归量子记忆。我们进一步设计了一个软注意机制的中间电路读出序列到序列模型，并显示其有效性的机器翻译。据我们所知，这是第一个基于量子运算的模型（RNN或其他模型），可以在广泛的序列学习任务中实现与强大的经典基线相比具有竞争力的性能。
摘要：We present a hybrid quantum-classical recurrent neural network (QRNN) architecture in which the entire recurrent core is realized as a parametrized quantum circuit (PQC) controlled by a classical feedforward network. The hidden state is the quantum state of an $n$-qubit PQC, residing in an exponentially large Hilbert space $\mathbb{C}^{2^n}$. The PQC is unitary by construction, making the hidden-state evolution norm-preserving without external constraints. At each timestep, mid-circuit readouts are combined with the input embedding and processed by the feedforward network, which provides explicit classical nonlinearity. The outputs parametrize the PQC, which updates the hidden state via unitary dynamics. The QRNN is compact and physically consistent, and it unifies (i) unitary recurrence as a high-capacity memory, (ii) partial observation via mid-circuit measurements, and (iii) nonlinear classical control for input-conditioned parametrization. We evaluate the model in simulation with up to 14 qubits on sentiment analysis, MNIST, permuted MNIST, copying memory, and language modeling, adopting projective measurements as a limiting case to obtain mid-circuit readouts while maintaining a coherent recurrent quantum memory. We further devise a soft attention mechanism over the mid-circuit readouts in a sequence-to-sequence model and show its effectiveness for machine translation. To our knowledge, this is the first model (RNN or otherwise) grounded in quantum operations to achieve competitive performance against strong classical baselines across a broad class of sequence-learning tasks.

【2】Model-Document Protocol for AI Search
标题：人工智能搜索的模型文档协议
链接：https://arxiv.org/abs/2510.25160

作者：Hongjin Qian, Zheng Liu
备注：10 pages
摘要：人工智能搜索依赖于将大型语言模型（LLM）与大量外部知识源联系起来。然而，网页、PDF文件和其他原始文档并不是天生就适合LLM的：它们很长，有噪音，而且是非结构化的。传统的检索方法将这些文档视为逐字文本并返回原始段落，将片段组装和上下文推理的负担留给LLM。这种差距强调了需要一个新的检索范式，重新定义模型如何与文档交互。我们介绍了模型文档协议（MDP），一个通用的框架，正式的原始文本是如何桥接到LLM通过消费知识表示。MDP定义了多个路径，将非结构化文档转换为特定于任务的LLM就绪输入，而不是将检索视为段落提取。其中包括代理推理，它将原始证据整合到连贯的上下文中;记忆基础，它积累可重用的笔记以丰富推理;结构化利用，它将文档编码为正式的表示，如图形或键值缓存。所有这三种途径都有一个共同的目标：确保到达LLM的不是原始片段，而是直接用于推理的紧凑的结构化知识。作为一个实例，我们提出了MDP-Agent，它通过一个代理过程实现了该协议：构建文档级的要点存储器的全球覆盖，执行基于扩散的探索与垂直开发，以发现分层的依赖关系，并应用映射减少风格合成，以整合大规模的证据到紧凑而足够的上下文。信息搜索基准的实验表明，MDP-Agent的性能优于基线，验证了MDP框架的正确性及其代理实例化的有效性。
摘要：AI search depends on linking large language models (LLMs) with vast external knowledge sources. Yet web pages, PDF files, and other raw documents are not inherently LLM-ready: they are long, noisy, and unstructured. Conventional retrieval methods treat these documents as verbatim text and return raw passages, leaving the burden of fragment assembly and contextual reasoning to the LLM. This gap underscores the need for a new retrieval paradigm that redefines how models interact with documents. We introduce the Model-Document Protocol (MDP), a general framework that formalizes how raw text is bridged to LLMs through consumable knowledge representations. Rather than treating retrieval as passage fetching, MDP defines multiple pathways that transform unstructured documents into task-specific, LLM-ready inputs. These include agentic reasoning, which curates raw evidence into coherent context; memory grounding, which accumulates reusable notes to enrich reasoning; and structured leveraging, which encodes documents into formal representations such as graphs or key-value caches. All three pathways share the same goal: ensuring that what reaches the LLM is not raw fragments but compact, structured knowledge directly consumable for reasoning. As an instantiation, we present MDP-Agent, which realizes the protocol through an agentic process: constructing document-level gist memories for global coverage, performing diffusion-based exploration with vertical exploitation to uncover layered dependencies, and applying map-reduce style synthesis to integrate large-scale evidence into compact yet sufficient context. Experiments on information-seeking benchmarks demonstrate that MDP-Agent outperforms baselines, validating both the soundness of the MDP framework and the effectiveness of its agentic instantiation.

【3】POWSM: A Phonetic Open Whisper-Style Speech Foundation Model
标题：POWSM：一个语音开放耳语风格语音基础模型
链接：https://arxiv.org/abs/2510.24992

作者：Chin-Jou Li, Kalvin Chang, Shikhar Bharadwaj, Eunjung Yeo, Kwanghee Choi, Jian Zhu, David Mortensen, Shinji Watanabe
备注：14 pages, under review
摘要：口语处理的最新进展已经导致语音任务的实质性进展，例如自动语音识别（ASR）、音素识别（PR）、字素到音素转换（G2P）和音素到字素转换（P2G）。尽管它们在概念上相似，但这些任务在很大程度上是孤立地研究的，每个任务都依赖于特定于任务的架构和数据集。在本文中，我们介绍了POWSM（语音开放耳语式语音模型），第一个统一的框架，能够共同执行多个电话相关的任务。POWSM支持音频、文本（字素）和音素之间的无缝转换，为通用和低资源语音处理开辟了新的可能性。我们的模型优于或匹配类似规模的专用PR模型（Wav2Vec2Phoneme和ZIPA），同时共同支持G2P，P2G和ASR。我们发布的训练数据、代码和模型是为了促进开放科学。
摘要：Recent advances in spoken language processing have led to substantial progress in phonetic tasks such as automatic speech recognition (ASR), phone recognition (PR), grapheme-to-phoneme conversion (G2P), and phoneme-to-grapheme conversion (P2G). Despite their conceptual similarity, these tasks have largely been studied in isolation, each relying on task-specific architectures and datasets. In this paper, we introduce POWSM (Phonetic Open Whisper-style Speech Model), the first unified framework capable of jointly performing multiple phone-related tasks. POWSM enables seamless conversion between audio, text (graphemes), and phones, opening up new possibilities for universal and low-resource speech processing. Our model outperforms or matches specialized PR models of similar size (Wav2Vec2Phoneme and ZIPA) while jointly supporting G2P, P2G, and ASR. Our training data, code and models are released to foster open science.

【4】Beyond Models: A Framework for Contextual and Cultural Intelligence in African AI Deployment
标题：超越模型：非洲人工智能部署中的情境和文化情报框架
链接：https://arxiv.org/abs/2510.24729

作者：Qness Ndlovu
备注：25 pages, 4 tables. Production validation with 602 users across Zimbabwe-South Africa diaspora corridor
摘要：虽然全球人工智能开发优先考虑模型性能和计算规模，但在非洲市场进行有意义的部署需要完全不同的架构决策。本文介绍了上下文和文化智能（CCI）-一个系统框架，使人工智能系统能够通过本地相关，情感智能和经济包容的设计来处理文化意义，而不仅仅是数据模式。使用设计科学方法，我们通过生产AI本地跨境购物平台验证CCI，为散居社区提供服务。主要的实证结果：89%的用户更喜欢基于WhatsApp的AI交互，而不是传统的Web界面（n=602，卡方=365.8，p<0.001），在短短6周内实现了536个WhatsApp用户和602个独立用户的3，938个总对话，文化信息提示工程展示了对文化背景查询的复杂理解，89%以家庭为中心的商业模式和自然的代码转换接受。CCI框架有三个技术支柱：基础设施智能（移动优先，弹性架构），文化智能（具有社会背景意识的多语言NLP）和商业智能（基于信任的对话式商务）。这项工作既有助于理论创新，也有助于可复制的实施模式，挑战硅谷的设计正统，同时为在资源受限的市场中公平部署人工智能提供可行的框架。
摘要：While global AI development prioritizes model performance and computational scale, meaningful deployment in African markets requires fundamentally different architectural decisions. This paper introduces Contextual and Cultural Intelligence (CCI) -- a systematic framework enabling AI systems to process cultural meaning, not just data patterns, through locally relevant, emotionally intelligent, and economically inclusive design. Using design science methodology, we validate CCI through a production AI-native cross-border shopping platform serving diaspora communities. Key empirical findings: 89% of users prefer WhatsApp-based AI interaction over traditional web interfaces (n=602, chi-square=365.8, p<0.001), achieving 536 WhatsApp users and 3,938 total conversations across 602 unique users in just 6 weeks, and culturally informed prompt engineering demonstrates sophisticated understanding of culturally contextualized queries, with 89% family-focused commerce patterns and natural code-switching acceptance. The CCI framework operationalizes three technical pillars: Infrastructure Intelligence (mobile-first, resilient architectures), Cultural Intelligence (multilingual NLP with social context awareness), and Commercial Intelligence (trust-based conversational commerce). This work contributes both theoretical innovation and reproducible implementation patterns, challenging Silicon Valley design orthodoxies while providing actionable frameworks for equitable AI deployment across resource-constrained markets.

【5】Lost in Phonation: Voice Quality Variation as an Evaluation Dimension for Speech Foundation Models
标题：迷失在发音中：作为语音基础模型评估维度的语音质量变化
链接：https://arxiv.org/abs/2510.25577

作者：Harm Lameris, Shree Harsha Bokkahalli Satish, Joakim Gustafson, Éva Székely
备注：8 pages, 3 figures, 4 tables, submitted to LREC 2026
摘要：语音基础模型（SFM）的最新进展已经使得能够从原始音频直接处理口语，绕过中间文本表示。这种能力允许SIM暴露于并潜在地响应于嵌入在输入语音信号中的丰富的语言学变化。一个未被充分研究的语言变异的维度是声音质量，包括发声类型，如吱吱作响的声音和呼吸声。这些发声类型被认为会影响听者如何推断言语中的情感状态、立场和社会意义。现有的语音理解基准在很大程度上依赖于多项选择问题回答（MCQA）格式，这是容易失败，因此在捕捉细微差别的方式不可靠的语言特征影响模型的行为。在本文中，我们通过开放式生成任务和语音情感识别来探测SFM，评估模型行为在不同的发声输入中是否一致。我们引入了一个新的并行数据集，其特征在于对语音质量的合成修改，旨在评估SFM对吱吱作响和呼吸声的响应。我们的工作提供了第一次检查SFM的敏感性，这些特定的非词汇方面的言语感知。
摘要：Recent advances in speech foundation models (SFMs) have enabled the direct processing of spoken language from raw audio, bypassing intermediate textual representations. This capability allows SFMs to be exposed to, and potentially respond to, rich paralinguistic variations embedded in the input speech signal. One under-explored dimension of paralinguistic variation is voice quality, encompassing phonation types such as creaky and breathy voice. These phonation types are known to influence how listeners infer affective state, stance and social meaning in speech. Existing benchmarks for speech understanding largely rely on multiple-choice question answering (MCQA) formats, which are prone to failure and therefore unreliable in capturing the nuanced ways paralinguistic features influence model behaviour. In this paper, we probe SFMs through open-ended generation tasks and speech emotion recognition, evaluating whether model behaviours are consistent across different phonation inputs. We introduce a new parallel dataset featuring synthesized modifications to voice quality, designed to evaluate SFM responses to creaky and breathy voice. Our work provides the first examination of SFM sensitivity to these particular non-lexical aspects of speech perception.

其他(17篇)

【1】Task Completion Agents are Not Ideal Collaborators
标题：任务完成代理不是理想的协作者
链接：https://arxiv.org/abs/2510.25744

作者：Shannon Zejiang Shen, Valerie Chen, Ken Gu, Alexis Ross, Zixian Ma, Jillian Ross, Alex Gu, Chenglei Si, Wayne Chi, Andi Peng, Jocelyn J Shen, Ameet Talwalkar, Tongshuang Wu, David Sontag
备注：22 pages, 5 figures, 3 tables
摘要：目前对智能体的评估仍然集中在一次性任务完成上，未能考虑到许多现实世界问题固有的迭代和协作性质，在这些问题中，人类的目标往往是不够明确和不断发展的。我们主张从建立和评估任务完成代理到开发协作代理的转变，不仅通过其最终输出的质量进行评估，而且还通过他们在整个解决问题的过程中如何参与和增强人类的努力来评估。为了支持这一转变，我们引入了协作努力扩展，一个框架，捕捉代理的效用如何随着用户参与的增加而增长。通过案例研究和模拟评估，我们表明，国家的最先进的代理往往表现不佳，在多回合，现实世界的情况下，揭示了代理设计中缺失的成分：维持参与和支架用户理解的能力。协作努力缩放提供了一个诊断代理行为和指导开发更有效的交互的镜头。
摘要：Current evaluations of agents remain centered around one-shot task completion, failing to account for the inherently iterative and collaborative nature of many real-world problems, where human goals are often underspecified and evolve. We argue for a shift from building and assessing task completion agents to developing collaborative agents, assessed not only by the quality of their final outputs but by how well they engage with and enhance human effort throughout the problem-solving process. To support this shift, we introduce collaborative effort scaling, a framework that captures how an agent's utility grows with increasing user involvement. Through case studies and simulated evaluations, we show that state-of-the-art agents often underperform in multi-turn, real-world scenarios, revealing a missing ingredient in agent design: the ability to sustain engagement and scaffold user understanding. Collaborative effort scaling offers a lens for diagnosing agent behavior and guiding development toward more effective interactions.

【2】The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution
标题：十项全能工具：对语言代理进行基准测试，以实现多样化、现实和长期任务执行
链接：https://arxiv.org/abs/2510.25726

作者：Junlong Li, Wenshuo Zhao, Jian Zhao, Weihao Zeng, Haoze Wu, Xiaochen Wang, Rui Ge, Yuxuan Cao, Yuzhen Huang, Wei Liu, Junteng Liu, Zhaochen Su, Yiyang Guo, Fan Zhou, Lueyang Zhang, Juan Michelini, Xingyao Wang, Xiang Yue, Shuyan Zhou, Graham Neubig, Junxian He
备注：Website: this https URL
摘要：真实世界的语言代理必须处理跨各种应用程序的复杂的多步骤工作流。例如，代理可以通过与日历和文件系统协调来管理电子邮件，或者监视生产数据库以检测异常并根据操作手册生成报告。然而，现有的语言代理基准测试往往集中在狭窄的领域或简化的任务，缺乏多样性，现实主义和长期的复杂性所需的评估代理的真实世界的表现。为了解决这一差距，我们引入了工具十项全能（称为Toolathlon），这是一个语言代理的基准，提供各种应用程序和工具，现实的环境设置和可靠的基于执行的评估。Toolathlon涵盖32个软件应用程序和604个工具，从Google Calendar和Notion等日常平台到WooCommerce、Kubernetes和BigQuery等专业平台。大多数工具都基于一组高质量的模型上下文协议（MCP）服务器，我们可能已经修改或实现了这些服务器。与之前的作品主要确保功能真实性但提供有限的环境状态多样性不同，我们从真实软件中提供真实的初始环境状态，例如数十名学生的Canvas课程或真实的财务电子表格。该基准测试共包括108个手动来源或精心制作的任务，平均需要与多个应用程序交互约20次才能完成。每个任务都可以通过专门的评估脚本进行严格验证。对SOTA模型的综合评估突出了它们的显著缺点：性能最好的模型Claude-4.5-Sonnet仅实现了38.6%的成功率，平均20.2次工具调用，而顶级开放权重模型DeepSeek-V3.2-Exp达到20.1%。我们希望Toolathlon能够推动更有能力的语言代理的开发，以实现现实世界的长期任务执行。
摘要：Real-world language agents must handle complex, multi-step workflows across diverse Apps. For instance, an agent may manage emails by coordinating with calendars and file systems, or monitor a production database to detect anomalies and generate reports following an operating manual. However, existing language agent benchmarks often focus on narrow domains or simplified tasks that lack the diversity, realism, and long-horizon complexity required to evaluate agents' real-world performance. To address this gap, we introduce the Tool Decathlon (dubbed as Toolathlon), a benchmark for language agents offering diverse Apps and tools, realistic environment setup, and reliable execution-based evaluation. Toolathlon spans 32 software applications and 604 tools, ranging from everyday platforms such as Google Calendar and Notion to professional ones like WooCommerce, Kubernetes, and BigQuery. Most of the tools are based on a high-quality set of Model Context Protocol (MCP) servers that we may have revised or implemented ourselves. Unlike prior works, which primarily ensure functional realism but offer limited environment state diversity, we provide realistic initial environment states from real software, such as Canvas courses with dozens of students or real financial spreadsheets. This benchmark includes 108 manually sourced or crafted tasks in total, requiring interacting with multiple Apps over around 20 turns on average to complete. Each task is strictly verifiable through dedicated evaluation scripts. Comprehensive evaluation of SOTA models highlights their significant shortcomings: the best-performing model, Claude-4.5-Sonnet, achieves only a 38.6% success rate with 20.2 tool calling turns on average, while the top open-weights model DeepSeek-V3.2-Exp reaches 20.1%. We expect Toolathlon to drive the development of more capable language agents for real-world, long-horizon task execution.

【3】Process-Level Trajectory Evaluation for Environment Configuration in Software Engineering Agents
标题：软件工程代理环境配置的过程级轨迹评估
链接：https://arxiv.org/abs/2510.25694

作者：Jiayi Kuang, Yinghui Li, Xin Zhang, Yangning Li, Di Yin, Xing Sun, Ying Shen, Philip S. Yu
摘要：大型语言基于模型的代理显示出对软件工程的承诺，但环境配置仍然是一个瓶颈，由于繁重的手工工作和稀缺的大规模，高质量的数据集。现有的基准只评估端到端的构建/测试成功，模糊了代理在哪里以及为什么成功或失败。我们介绍了环境配置诊断基准，Enconda工作台，它提供了细粒度的代理能力在环境设置规划，感知驱动的错误诊断，反馈驱动的修复，并采取行动执行最终的环境配置的过程级轨迹评估。我们的任务实例通过注入真实的README错误自动构建，并在Docker中进行验证，以进行可扩展的高质量评估。Enconda-bench将过程级分析与端到端的可执行性相结合，以实现超出总体成功率的能力评估。对最先进的LLM和代理框架的评估表明，虽然代理可以定位错误，但它们很难将反馈转化为有效的纠正，从而限制了端到端的性能。据我们所知，Enconda-bench是第一个为环境配置提供过程级内部能力评估的框架，为改进软件工程代理提供了可操作的见解。
摘要：Large language model-based agents show promise for software engineering, but environment configuration remains a bottleneck due to heavy manual effort and scarce large-scale, high-quality datasets. Existing benchmarks assess only end-to-end build/test success, obscuring where and why agents succeed or fail. We introduce the Environment Configuration Diagnosis Benchmark, Enconda-bench, which provides process-level trajectory assessment of fine-grained agent capabilities during environment setup-planning, perception-driven error diagnosis, feedback-driven repair, and action to execute final environment configuration. Our task instances are automatically constructed by injecting realistic README errors and are validated in Docker for scalable, high-quality evaluation. Enconda-bench combines process-level analysis with end-to-end executability to enable capability assessments beyond aggregate success rates. Evaluations across state-of-the-art LLMs and agent frameworks show that while agents can localize errors, they struggle to translate feedback into effective corrections, limiting end-to-end performance. To our knowledge, Enconda-bench is the first framework to provide process-level internal capability assessment for environment configuration, offering actionable insights for improving software engineering agents.

【4】More than a Moment: Towards Coherent Sequences of Audio Descriptions
标题：不止一瞬间：迈向音频描述的连贯序列
链接：https://arxiv.org/abs/2510.25440

作者：Eshika Khandelwal, Junyu Xie, Tengda Han, Max Bain, Arsha Nagrani, Andrew Zisserman, Gül Varol, Makarand Tapaswi
摘要：音频说明（AD）传达重要的屏幕信息，使视障观众能够跟随视频。为了有效，广告必须形成一个连贯的序列，帮助听众想象正在展开的场景，而不是描述孤立的时刻。然而，大多数自动方法独立地生成每个AD，通常导致重复的、不连贯的描述。为了解决这个问题，我们提出了一种无需训练的方法CoherentAD，该方法首先为每个AD时间间隔生成多个候选描述，然后在整个序列中进行自回归选择，以形成连贯和信息丰富的叙述。为了从整体上评估AD序列，我们引入了一个序列级指标StoryRecall，它测量预测的AD如何传达地面真相叙述，以及捕获连续AD输出中冗余的重复指标。我们的方法产生了连贯的AD序列，增强了叙事理解，优于依赖于独立世代的先前方法。
摘要：Audio Descriptions (ADs) convey essential on-screen information, allowing visually impaired audiences to follow videos. To be effective, ADs must form a coherent sequence that helps listeners to visualise the unfolding scene, rather than describing isolated moments. However, most automatic methods generate each AD independently, often resulting in repetitive, incoherent descriptions. To address this, we propose a training-free method, CoherentAD, that first generates multiple candidate descriptions for each AD time interval, and then performs auto-regressive selection across the sequence to form a coherent and informative narrative. To evaluate AD sequences holistically, we introduce a sequence-level metric, StoryRecall, which measures how well the predicted ADs convey the ground truth narrative, alongside repetition metrics that capture the redundancy across consecutive AD outputs. Our method produces coherent AD sequences with enhanced narrative understanding, outperforming prior approaches that rely on independent generations.

【5】RLMEval: Evaluating Research-Level Neural Theorem Proving
标题：RL MEval：评估研究级神经定理证明
链接：https://arxiv.org/abs/2510.25427

作者：Auguste Poiroux, Antoine Bosselut, Viktor Kunčak
备注：Accepted to EMNLP 2025 Findings. RLMEval benchmark released: this https URL
摘要：尽管在策划的基准测试上取得了令人印象深刻的结果，但大型语言模型（LLM）对研究级神经定理证明和证明自动形式化的实际影响仍然有限。我们介绍RLMEval，这些任务的评估套件，专注于研究级数学从现实世界的精益形式化项目。RLMEval的目标是通过利用真正的精益蓝图形式化项目来评估具有挑战性的研究级定理的神经定理证明和证明自动形式化。我们在RLMEval上对最先进的模型进行了评估，其中包括来自6个精益项目的613个定理，揭示了一个显著的差距：现有基准的进展并不容易转化为这些更现实的设置，最好的模型只实现了10.3%的通过率。RLMEval提供了一个新的、具有挑战性的基准，旨在指导和加速形式数学自动推理的进展。
摘要：Despite impressive results on curated benchmarks, the practical impact of large language models (LLMs) on research-level neural theorem proving and proof autoformalization is still limited. We introduce RLMEval, an evaluation suite for these tasks, focusing on research-level mathematics from real-world Lean formalization projects. RLMEval targets the evaluation of neural theorem proving and proof autoformalization on challenging research-level theorems by leveraging real Lean Blueprint formalization projects. Our evaluation of state-of-the-art models on RLMEval, comprising 613 theorems from 6 Lean projects, reveals a significant gap: progress on existing benchmarks does not readily translate to these more realistic settings, with the best model achieving only a 10.3 % pass rate. RLMEval provides a new, challenging benchmark designed to guide and accelerate progress in automated reasoning for formal mathematics.

【6】Serve Programs, Not Prompts
标题：服务于计划，而不是预算
链接：https://arxiv.org/abs/2510.25412

作者：In Gim, Lin Zhong
备注：HotOS 2025. Follow-up implementation work (SOSP 2025) is available at https://arxiv.org/abs/2510.24051
摘要：当前的大型语言模型（LLM）服务系统，主要是为文本完成设计的，由于其不灵活的设计，对于日益复杂的LLM应用程序既不高效也不适应。我们提出了一个新的LLM服务系统架构，服务程序，而不是提示，以解决这个问题。这些程序称为LLM推理程序（LIP），允许用户在运行时自定义令牌预测和KV缓存管理，并将其应用程序逻辑的一部分（如工具执行）卸载到服务器。我们通过一个名为Symphony的系统来描述这种架构的一个例子，该系统用作LIP的操作系统。Symphony通过系统调用公开LLM模型计算，并使用专用文件系统虚拟化KV缓存，同时通过两级进程调度方案确保GPU效率。Symphony有可能为LLM应用程序打开一扇更高效和可扩展的生态系统的大门。
摘要：Current large language model (LLM) serving systems, primarily designed for text completion, are neither efficient nor adaptable for increasingly complex LLM applications due to their inflexible design. We propose a new LLM serving system architecture that serves programs instead of prompts to address this problem. These programs, called LLM Inference Programs (LIPs), allow users to customize token prediction and KV cache management at runtime and to offload parts of their application logic, such as tool execution, to the server. We describe an example of this architecture through a system named Symphony, which functions as an operating system for LIPs. Symphony exposes LLM model computations via system calls and virtualizes KV cache with a dedicated file system, while ensuring GPU efficiency with a two-level process scheduling scheme. Symphony has the potential to open the door to a more efficient and extensible ecosystem for LLM applications.

【7】BhashaBench V1: A Comprehensive Benchmark for the Quadrant of Indic Domains
标题：BhashaBench V1：印度域名象限的综合基准
链接：https://arxiv.org/abs/2510.25409

作者：Vijay Devane, Mohd Nauman, Bhargav Patel, Aniket Mahendra Wakchoure, Yogeshkumar Sant, Shyam Pawar, Viraj Thakur, Ananya Godse, Sunil Patra, Neha Maurya, Suraj Racha, Nitish Kamal Singh, Ajay Nagpal, Piyush Sawarkar, Kundeshwar Vijayrao Pundalik, Rohit Saluja, Ganesh Ramakrishnan
摘要：大型语言模型（LLM）的快速发展加剧了对特定领域和文化评估的需求。现有的基准在很大程度上是以英国为中心和领域不可知的，限制了它们对以印度为中心的背景的适用性。为了解决这一差距，我们引入BhashaBench V1，这是第一个专注于关键印度知识系统的特定领域，多任务，双语基准测试。BhashaBench V1包含74，166个精心策划的问答对，其中52，494个是英语，21，672个是印地语，来自真实的政府和特定领域的考试。它跨越四个主要领域：农业，法律，金融和阿育吠陀，包括90多个子域，涵盖500多个主题，实现细粒度评估。29+ LLM的评估揭示了显着的领域和语言特定的性能差距，特别是在低资源领域的差距很大。例如，GPT-4 o在Legal中达到76.49%的整体准确率，但在Ayurveda中只有59.74%。在所有领域中，与印地语相比，英语内容的模型始终表现得更好。子域级分析表明，网络法、国际金融等领域表现相对较好，而Panchakarma、种子科学和人权仍然明显薄弱。BhashaBench V1提供了一个全面的数据集，用于评估印度不同知识领域的大型语言模型。它能够评估模型将特定领域知识与双语理解相结合的能力。所有代码、基准和资源都是公开的，以支持开放研究。
摘要：The rapid advancement of large language models(LLMs) has intensified the need for domain and culture specific evaluation. Existing benchmarks are largely Anglocentric and domain-agnostic, limiting their applicability to India-centric contexts. To address this gap, we introduce BhashaBench V1, the first domain-specific, multi-task, bilingual benchmark focusing on critical Indic knowledge systems. BhashaBench V1 contains 74,166 meticulously curated question-answer pairs, with 52,494 in English and 21,672 in Hindi, sourced from authentic government and domain-specific exams. It spans four major domains: Agriculture, Legal, Finance, and Ayurveda, comprising 90+ subdomains and covering 500+ topics, enabling fine-grained evaluation. Evaluation of 29+ LLMs reveals significant domain and language specific performance gaps, with especially large disparities in low-resource domains. For instance, GPT-4o achieves 76.49% overall accuracy in Legal but only 59.74% in Ayurveda. Models consistently perform better on English content compared to Hindi across all domains. Subdomain-level analysis shows that areas such as Cyber Law, International Finance perform relatively well, while Panchakarma, Seed Science, and Human Rights remain notably weak. BhashaBench V1 provides a comprehensive dataset for evaluating large language models across India's diverse knowledge domains. It enables assessment of models' ability to integrate domain-specific knowledge with bilingual understanding. All code, benchmarks, and resources are publicly available to support open research.

【8】Hallucinations in Bibliographic Recommendation: Citation Frequency as a Proxy for Training Data Redundancy
标题：书目推荐中的幻觉：引用频率作为训练数据冗余的代理
链接：https://arxiv.org/abs/2510.25378

作者：Junichiro Niimi
摘要：大型语言模型（LLM）已经越来越多地应用于从自然语言理解到代码生成的各种任务。虽然它们也被用来帮助书目推荐，但不存在的论文的幻觉仍然是一个主要问题。基于先前的研究，本研究假设，法学硕士正确产生书目信息的能力取决于基础知识是生成还是记忆，高引用论文（即，更频繁地出现在训练语料库中），显示出较低的幻觉率。因此，我们假设引用计数作为训练数据冗余的代理（即，给定书目记录在预训练语料库中重复表示的频率），并研究引用频率如何影响LLM输出中的幻觉参考。使用GPT-4.1，我们生成并手动验证了20个计算机科学领域的100条书目记录，并通过生成的元数据和真实元数据之间的余弦相似性来衡量事实的一致性。结果显示，（i）幻觉率在不同的研究领域有所不同，（ii）引文数量与事实准确性密切相关，（iii）书目信息在超过大约1,000次引用后几乎可以逐字记住。这些发现表明，高被引论文几乎逐字保留在模型中，表明泛化转变为记忆的阈值。
摘要：Large language models (LLMs) have been increasingly applied to a wide range of tasks, from natural language understanding to code generation. While they have also been used to assist in bibliographic recommendation, the hallucination of non-existent papers remains a major issue. Building on prior studies, this study hypothesizes that an LLM's ability to correctly produce bibliographic information depends on whether the underlying knowledge is generated or memorized, with highly cited papers (i.e., more frequently appear in the training corpus) showing lower hallucination rates. We therefore assume citation count as a proxy for training data redundancy (i.e., the frequency with which a given bibliographic record is repeatedly represented in the pretraining corpus) and investigate how citation frequency affects hallucinated references in LLM outputs. Using GPT-4.1, we generated and manually verified 100 bibliographic records across twenty computer-science domains, and measured factual consistency via cosine similarity between generated and authentic metadata. The results revealed that (i) hallucination rates vary across research domains, (ii) citation count is strongly correlated with factual accuracy, and (iii) bibliographic information becomes almost verbatimly memorized beyond approximately 1,000 citations. These findings suggest that highly cited papers are nearly verbatimly retained in the model, indicating a threshold where generalization shifts into memorization.

【9】CLASS-IT: Conversational and Lecture-Aligned Small-Scale Instruction Tuning for BabyLMs
标题：CLASS-IT：BabyLM的对话式和与讲座一致的小规模教学调优
链接：https://arxiv.org/abs/2510.25364

作者：Luca Capone, Alessandro Bondielli, Alessandro Lenci
备注：Paper accepted for oral presentation at the BabyLM Challange 2025 (EMNLP2025)
摘要：这项工作调查是否小规模的LM可以受益于指令调整。我们比较了会话和问答指令调整数据集，应用于合并或顺序课程，使用解码器仅模型与100 M和140 M参数。评估跨越微调（SuperGLUE）和zero-shot（BLiMP、EWoK、WUG、实体跟踪和心理语言相关性）设置。结果表明，指令调整产生小，但一致的增益微调的情况下，连续的课程优于合并的数据，但是，改善不一致转移到zero-shot任务，这表明一个权衡互动为中心的适应和广泛的语言概括。这些结果强调了将人类启发的学习策略适应低资源LM的潜力和限制，并指出了在生态训练限制下增强泛化的混合，基于神经网络的方法。
摘要：This work investigates whether small-scale LMs can benefit from instruction tuning. We compare conversational and question-answering instruction tuning datasets, applied either in a merged or sequential curriculum, using decoder-only models with 100M and 140M parameters. Evaluation spans both fine-tuning (SuperGLUE) and zero-shot (BLiMP, EWoK, WUGs, entity tracking, and psycholinguistic correlation) settings. Results show that instruction tuning yields small but consistent gains in fine-tuning scenarios, with sequential curricula outperforming merged data; however, improvements do not consistently transfer to zero-shot tasks, suggesting a trade-off between interaction-focused adaptation and broad linguistic generalization. These results highlight both the potential and the constraints of adapting human-inspired learning strategies to low-resource LMs, and point toward hybrid, curriculum-based approaches for enhancing generalization under ecological training limits.

【10】CRMWeaver: Building Powerful Business Agent via Agentic RL and Shared Memories
标题：CRMWeaver：通过扩展RL和共享记忆构建强大的业务代理
链接：https://arxiv.org/abs/2510.25333

作者：Yilong Lai, Yipin Yang, Jialong Wu, Fengran Mo, Zhenglin Wang, Ting Liang, Jianguo Lin, Keping Yang
摘要：近年来，基于LLM的智能体发展迅速，为使用语言智能体解决复杂的现实问题提供了新的思路。一个突出的应用是业务代理，它通过工具调用与数据库和内部知识库交互，以满足不同的用户需求。然而，这个领域的特点是复杂的数据关系和广泛的异构任务，从统计数据查询到基于知识的问答。为了解决这些挑战，我们提出了CRMWeaver，一种新的方法，增强业务代理在这样复杂的设置。为了使代理模型适应复杂的业务环境，我们在训练过程中采用了综合数据生成和基于RL的范式，这显着提高了模型处理复杂数据和各种任务的能力。在推理过程中，引入了共享记忆机制，促使智能体在类似问题中从任务指南中学习，从而进一步提高其有效性和泛化能力，特别是在看不见的场景中。我们在CRMArena-Pro数据集上验证了我们方法的有效性，我们的轻量级模型在B2B和B2C业务场景中都取得了有竞争力的结果，强调了其对现实世界应用的实用价值。
摘要：Recent years have witnessed the rapid development of LLM-based agents, which shed light on using language agents to solve complex real-world problems. A prominent application lies in business agents, which interact with databases and internal knowledge bases via tool calls to fulfill diverse user requirements. However, this domain is characterized by intricate data relationships and a wide range of heterogeneous tasks, from statistical data queries to knowledge-based question-answering. To address these challenges, we propose CRMWeaver, a novel approach that enhances business agents in such complex settings. To acclimate the agentic model to intricate business environments, we employ a synthesis data generation and RL-based paradigm during training, which significantly improves the model's ability to handle complex data and varied tasks. During inference, a shared memories mechanism is introduced, prompting the agent to learn from task guidelines in similar problems, thereby further boosting its effectiveness and generalization, especially in unseen scenarios. We validate the efficacy of our approach on the CRMArena-Pro dataset, where our lightweight model achieves competitive results in both B2B and B2C business scenarios, underscoring its practical value for real-world applications.

【11】From Medical Records to Diagnostic Dialogues: A Clinical-Grounded Approach and Dataset for Psychiatric Comorbidity
标题：从病历到诊断对话：精神病合并症的基于临床的方法和数据集
链接：https://arxiv.org/abs/2510.25232

作者：Tianxi Wan, Jiaming Luo, Siyuan Chen, Kunyao Lan, Jianhua Chen, Haiyang Geng, Mengyue Wu
摘要：精神科合并症具有临床意义，但由于多种并发症的复杂性而具有挑战性。为了解决这个问题，我们开发了一种新的方法，集成合成患者电子病历（EMR）建设和多代理诊断对话生成。我们使用确保临床相关性和多样性的管道创建了502种常见共病的合成EMR。我们的多代理框架将临床访谈协议转换为分层状态机和上下文树，支持130多个诊断状态，同时保持临床标准。通过这一严格的过程，我们构建了PsyCoTalk，这是第一个支持comorbidity的大规模对话数据集，包含3,000个经精神科医生验证的多轮诊断对话。该数据集提高了诊断的准确性和治疗计划，为精神科合并症研究提供了宝贵的资源。与真实世界的临床记录相比，PsyCoTalk在对话长度、标记分布和诊断推理策略方面表现出较高的结构和语言保真度。有执照的精神病学家证实了对话的现实性和诊断的有效性。该数据集能够开发和评估能够在单一会话中进行多障碍精神病筛查的模型。
摘要：Psychiatric comorbidity is clinically significant yet challenging due to the complexity of multiple co-occurring disorders. To address this, we develop a novel approach integrating synthetic patient electronic medical record (EMR) construction and multi-agent diagnostic dialogue generation. We create 502 synthetic EMRs for common comorbid conditions using a pipeline that ensures clinical relevance and diversity. Our multi-agent framework transfers the clinical interview protocol into a hierarchical state machine and context tree, supporting over 130 diagnostic states while maintaining clinical standards. Through this rigorous process, we construct PsyCoTalk, the first large-scale dialogue dataset supporting comorbidity, containing 3,000 multi-turn diagnostic dialogues validated by psychiatrists. This dataset enhances diagnostic accuracy and treatment planning, offering a valuable resource for psychiatric comorbidity research. Compared to real-world clinical transcripts, PsyCoTalk exhibits high structural and linguistic fidelity in terms of dialogue length, token distribution, and diagnostic reasoning strategies. Licensed psychiatrists confirm the realism and diagnostic validity of the dialogues. This dataset enables the development and evaluation of models capable of multi-disorder psychiatric screening in a single conversational pass.

【12】ProMediate: A Socio-cognitive framework for evaluating proactive agents in multi-party negotiation
标题：ProMediate：一个用于评估多方谈判中主动代理人的社会认知框架
链接：https://arxiv.org/abs/2510.25224

作者：Ziyi Liu, Bahar Sarrafzadeh, Pei Zhou, Longqi Yang, Jieyu Zhao, Ashish Sharma
摘要：虽然大型语言模型（LLM）越来越多地用于代理框架中以帮助个人用户，但对能够主动管理复杂的多方协作的代理的需求也在不断增长。针对这种主动代理的系统评估方法仍然很少，这限制了开发能够有效支持多人协同工作的AI的进展。谈判为这一挑战提供了一个苛刻的测试平台，需要社会认知智能来驾驭多个参与者和多个主题之间的利益冲突，并建立共识。在这里，我们介绍了ProMediate，这是第一个在复杂的、多主题的、多方谈判中评估主动人工智能中介代理的框架。ProMediate由两个核心组成部分组成：（i）基于现实谈判案例和理论驱动的难度水平的模拟试验台（ProMediate-Easy、ProMediate-Medium和ProMediate-Hard），具有基于社会认知调解理论的即插即用主动式人工智能调解器，能够灵活决定何时以及如何进行干预;以及（ii）社会认知评估框架，其具有一套新的度量标准来测量共识变化、干预潜伏期、中介有效性和智力。这些组件共同建立了一个系统框架，用于评估多方环境中主动式人工智能代理的社会认知智能。我们的研究结果表明，一个社会智能中介代理优于一般的基线，通过更快，更有针对性的干预。在ProMediate-Hard设置中，与通用基线相比，我们的社会调解人将共识变化提高了3.6个百分点（10.65\% vs 7.01\%），同时响应速度提高了77\%（15.98 s vs 3.71 s）。总之，ProMediate提供了一个严格的，以理论为基础的测试平台，以促进主动的，社会智能代理的发展。
摘要：While Large Language Models (LLMs) are increasingly used in agentic frameworks to assist individual users, there is a growing need for agents that can proactively manage complex, multi-party collaboration. Systematic evaluation methods for such proactive agents remain scarce, limiting progress in developing AI that can effectively support multiple people together. Negotiation offers a demanding testbed for this challenge, requiring socio-cognitive intelligence to navigate conflicting interests between multiple participants and multiple topics and build consensus. Here, we present ProMediate, the first framework for evaluating proactive AI mediator agents in complex, multi-topic, multi-party negotiations. ProMediate consists of two core components: (i) a simulation testbed based on realistic negotiation cases and theory-driven difficulty levels (ProMediate-Easy, ProMediate-Medium, and ProMediate-Hard), with a plug-and-play proactive AI mediator grounded in socio-cognitive mediation theories, capable of flexibly deciding when and how to intervene; and (ii) a socio-cognitive evaluation framework with a new suite of metrics to measure consensus changes, intervention latency, mediator effectiveness, and intelligence. Together, these components establish a systematic framework for assessing the socio-cognitive intelligence of proactive AI agents in multi-party settings. Our results show that a socially intelligent mediator agent outperforms a generic baseline, via faster, better-targeted interventions. In the ProMediate-Hard setting, our social mediator increases consensus change by 3.6 percentage points compared to the generic baseline (10.65\% vs 7.01\%) while being 77\% faster in response (15.98s vs. 3.71s). In conclusion, ProMediate provides a rigorous, theory-grounded testbed to advance the development of proactive, socially intelligent agents.

【13】Disaggregation Reveals Hidden Training Dynamics: The Case of Agreement Attraction
标题：分解揭示隐藏的训练动态：协议吸引的案例
链接：https://arxiv.org/abs/2510.24934

作者：James A. Michaelov, Catherine Arnett
备注：Accepted to the First Workshop on Interpreting Cognition in Deep Learning Models (CogInterp @ NeurIPS 2025)
摘要：语言模型通常生成语法文本，但在某些上下文中更容易出错。借鉴心理语言学的研究范式，我们在不同的句法语境中对这些错误进行了细致的分析。我们证明，通过分解精心构建的数据集的条件，并在训练过程中比较每个数据集的模型性能，可以更好地理解语言模型中语法学习的中间阶段。具体来说，我们确定了不同的训练阶段，其中语言模型行为与特定的语法学（如词频和本地上下文）而不是一般的语法规则相一致。我们认为，采用这种方法来分析语言模型的行为更普遍可以作为一个强大的工具，了解中间学习阶段，整体训练动态，以及语言模型学习的具体概括。
摘要：Language models generally produce grammatical text, but they are more likely to make errors in certain contexts. Drawing on paradigms from psycholinguistics, we carry out a fine-grained analysis of those errors in different syntactic contexts. We demonstrate that by disaggregating over the conditions of carefully constructed datasets and comparing model performance on each over the course of training, it is possible to better understand the intermediate stages of grammatical learning in language models. Specifically, we identify distinct phases of training where language model behavior aligns with specific heuristics such as word frequency and local context rather than generalized grammatical rules. We argue that taking this approach to analyzing language model behavior more generally can serve as a powerful tool for understanding the intermediate learning phases, overall training dynamics, and the specific generalizations learned by language models.

【14】Idea2Plan: Exploring AI-Powered Research Planning
标题：Idea2Plan：探索人工智能驱动的研究规划
链接：https://arxiv.org/abs/2510.24891

作者：Jin Huang, Silviu Cucerzan, Sujay Kumar Jauhar, Ryen W. White
摘要：大型语言模型（LLM）在加速科学发现方面表现出巨大的潜力，是分析数据、生成假设和支持各个科学领域创新方法的宝贵工具。在这项工作中，我们研究如何LLM可以处理从概念研究思路的过渡到结构良好的研究计划。有效的研究规划不仅支持科学家推进他们的研究，而且代表了自主研究代理人发展的关键能力。尽管其重要性，该领域缺乏对LLM研究规划能力的系统了解。为了严格衡量这种能力，我们引入了Idea 2 Plan任务和Idea 2 Plan Bench，这是一个基于200篇ICML 2025 Spotlight和Oral论文的基准，这些论文是在LLM培训截止后发布的。每个基准实例都包括一个研究想法和一个捕获有效计划的关键组成部分的分级规则。我们进一步提出了Idea 2 Plan JudgeEval，一个补充基准来评估基于LLM的法官对专家注释的可靠性。实验结果表明，GPT-5和GPT-5-mini在基准测试中表现最好，但仍有很大的改进空间。我们的研究为LLM的研究规划能力提供了新的见解，并为未来的进展奠定了基础。
摘要：Large language models (LLMs) have demonstrated significant potential to accelerate scientific discovery as valuable tools for analyzing data, generating hypotheses, and supporting innovative approaches in various scientific fields. In this work, we investigate how LLMs can handle the transition from conceptual research ideas to well-structured research plans. Effective research planning not only supports scientists in advancing their research but also represents a crucial capability for the development of autonomous research agents. Despite its importance, the field lacks a systematic understanding of LLMs' research planning capability. To rigorously measure this capability, we introduce the Idea2Plan task and Idea2Plan Bench, a benchmark built from 200 ICML 2025 Spotlight and Oral papers released after major LLM training cutoffs. Each benchmark instance includes a research idea and a grading rubric capturing the key components of valid plans. We further propose Idea2Plan JudgeEval, a complementary benchmark to assess the reliability of LLM-based judges against expert annotations. Experimental results show that GPT-5 and GPT-5-mini achieve the strongest performance on the benchmark, though substantial headroom remains for future improvement. Our study provides new insights into LLMs' capability for research planning and lay the groundwork for future progress.

【15】PANORAMA: A Dataset and Benchmarks Capturing Decision Trails and Rationales in Patent Examination
标题：PANORAMA：捕获专利审查中决策轨迹和基准的数据集和基准
链接：https://arxiv.org/abs/2510.24774

作者：Hyunseung Lim, Sooyohn Nam, Sungmin Na, Ji Yong Cho, June Yong Yang, Hyungyu Shin, Yoonjoo Lee, Juho Kim, Moontae Lee, Hwajung Hong
摘要：即使在大型语言模型（LLM）出现之后，专利审查仍然是NLP文献中的一个持续挑战，因为它需要对提交的权利要求是否符合新颖性和非显而易见性的法定标准进行广泛而细致的人类判断，而不是先前授予的权利要求-现有技术-在专家领域。以前的NLP研究已经将这一挑战作为一项预测任务（例如，预测资助结果）与高级代理（例如相似性度量或在历史标签上训练的分类器）。然而，这种方法往往忽略了审查员必须利用深刻的信息进行的逐步评估，包括在办公室行动文件中提供的决定的理由，这也使得更难衡量专利审查过程中的技术现状。为了填补这一空白，我们构建了PANORAMA，这是一个包含8，143份美国专利审查记录的数据集，保留了完整的决策线索，包括原始申请，所有引用的参考文献，非最终拒绝和津贴通知。此外，PANORAMA还将跟踪分解为连续的基准，这些基准模拟了专利专业人员的专利审查过程，并允许研究人员在每个步骤中检查大型语言模型的能力。我们的研究结果表明，尽管LLM在检索相关现有技术和精确定位相关段落方面相对有效，但它们很难评估专利权利要求的新颖性和非显而易见性。我们讨论了这些结果，并认为在专利领域推进NLP，包括LLM，需要对现实世界的专利审查有更深入的了解。我们的数据集可在https://huggingface.co/datasets/LG-AI-Research/PANORAMA上公开获取。
摘要：Patent examination remains an ongoing challenge in the NLP literature even after the advent of large language models (LLMs), as it requires an extensive yet nuanced human judgment on whether a submitted claim meets the statutory standards of novelty and non-obviousness against previously granted claims -- prior art -- in expert domains. Previous NLP studies have approached this challenge as a prediction task (e.g., forecasting grant outcomes) with high-level proxies such as similarity metrics or classifiers trained on historical labels. However, this approach often overlooks the step-by-step evaluations that examiners must make with profound information, including rationales for the decisions provided in office actions documents, which also makes it harder to measure the current state of techniques in patent review processes. To fill this gap, we construct PANORAMA, a dataset of 8,143 U.S. patent examination records that preserves the full decision trails, including original applications, all cited references, Non-Final Rejections, and Notices of Allowance. Also, PANORAMA decomposes the trails into sequential benchmarks that emulate patent professionals' patent review processes and allow researchers to examine large language models' capabilities at each step of them. Our findings indicate that, although LLMs are relatively effective at retrieving relevant prior art and pinpointing the pertinent paragraphs, they struggle to assess the novelty and non-obviousness of patent claims. We discuss these results and argue that advancing NLP, including LLMs, in the patent domain requires a deeper understanding of real-world patent examination. Our dataset is openly available at https://huggingface.co/datasets/LG-AI-Research/PANORAMA.

【16】Confidence is Not Competence
标题：信心不是能力
链接：https://arxiv.org/abs/2510.24772

作者：Debdeep Sanyal, Manya Pandey, Dhruv Kumar, Saurabh Deshpande, Murari Mandal
备注：20 Pages, 6 Figures, 8 Tables
摘要：大型语言模型（LLM）经常表现出令人困惑的脱节，他们声称的信心和实际解决问题的能力。我们提供了一个机械的帐户，这种解耦的内部状态跨两个阶段的几何分析-预生成评估和解决方案的执行。一个简单的线性探测器解码了模型的内部“可解性信念”，揭示了一个有序的信念轴，它概括了模型家族和数学，代码，规划和逻辑任务。然而，几何形状发散-尽管信念是线性可解码的，但评估流形具有从主成分测量的高线性有效维度，而随后的推理轨迹在低得多的维度流形上演化。这种从思想到行动的几何复杂性的急剧减少从机械上解释了信心-能力差距。因果干预，引导代表沿信念轴离开最终的解决方案不变，这表明在复杂的评估空间的线性轻推不控制约束的动态执行。因此，我们发现了一个双系统架构-一个几何复杂的评估喂养一个几何简单的执行器。这些结果挑战了可解码的信念是可操作的杠杆的假设，而是主张针对执行的程序动态而不是评估的高层次几何的干预措施。
摘要：Large language models (LLMs) often exhibit a puzzling disconnect between their asserted confidence and actual problem-solving competence. We offer a mechanistic account of this decoupling by analyzing the geometry of internal states across two phases - pre-generative assessment and solution execution. A simple linear probe decodes the internal "solvability belief" of a model, revealing a well-ordered belief axis that generalizes across model families and across math, code, planning, and logic tasks. Yet, the geometries diverge - although belief is linearly decodable, the assessment manifold has high linear effective dimensionality as measured from the principal components, while the subsequent reasoning trace evolves on a much lower-dimensional manifold. This sharp reduction in geometric complexity from thought to action mechanistically explains the confidence-competence gap. Causal interventions that steer representations along the belief axis leave final solutions unchanged, indicating that linear nudges in the complex assessment space do not control the constrained dynamics of execution. We thus uncover a two-system architecture - a geometrically complex assessor feeding a geometrically simple executor. These results challenge the assumption that decodable beliefs are actionable levers, instead arguing for interventions that target the procedural dynamics of execution rather than the high-level geometry of assessment.

【17】AmarDoctor: An AI-Driven, Multilingual, Voice-Interactive Digital Health Application for Primary Care Triage and Patient Management to Bridge the Digital Health Divide for Bengali Speakers
标题：AmarDoctor：一款人工智能驱动、多语言、语音交互式数字健康应用程序，用于初级保健分诊和患者管理，弥合孟加拉语使用者的数字健康鸿沟
链接：https://arxiv.org/abs/2510.24724

作者：Nazmun Nahar, Ritesh Harshad Ruparel, Shariar Kabir, Sumaiya Tasnia Khan, Shyamasree Saha, Mamunur Rashid
摘要：这项研究介绍了AmarDoctor，这是一款多语言语音交互式数字健康应用程序，旨在为孟加拉语使用者提供全面的患者分诊和人工智能驱动的临床决策支持，孟加拉语使用者在获得数字医疗保健方面服务不足。AmarDoctor采用数据驱动的方法来加强初级保健服务，并实现个性化的健康管理。虽然AdaHealth、WebMD、Symptomate和K-Health等平台近年来越来越受欢迎，但它们主要服务于欧洲人口统计和语言。AmarDoctor通过为患者和医疗保健提供者提供双界面系统来解决这一差距，支持三种主要的孟加拉语方言。在其核心，病人模块使用自适应提问算法来评估症状，并引导用户向适当的专家。为了克服数字扫盲障碍，它集成了一个语音交互式人工智能助手，可以通过应用服务导航用户。作为补充，面向临床医生的界面采用了人工智能决策支持，通过生成结构化的临时诊断和治疗建议来提高工作流程效率。这些产出为电子处方、视频咨询和病历管理等关键服务提供了信息。为了验证临床准确性，该系统根据经验丰富的医生开发的185个临床小插曲的黄金标准集进行了评估。通过将AmarDoctor的性能与使用相同小插曲集的五名独立医生进行比较，进一步评估了有效性。结果显示，AmarDoctor达到了81.08%的顶级诊断准确率（医生平均为50.27%）和91.35%的顶级专业推荐准确率（医生平均为62.6%）。
摘要：This study presents AmarDoctor, a multilingual voice-interactive digital health app designed to provide comprehensive patient triage and AI-driven clinical decision support for Bengali speakers, a population largely underserved in access to digital healthcare. AmarDoctor adopts a data-driven approach to strengthen primary care delivery and enable personalized health management. While platforms such as AdaHealth, WebMD, Symptomate, and K-Health have become popular in recent years, they mainly serve European demographics and languages. AmarDoctor addresses this gap with a dual-interface system for both patients and healthcare providers, supporting three major Bengali dialects. At its core, the patient module uses an adaptive questioning algorithm to assess symptoms and guide users toward the appropriate specialist. To overcome digital literacy barriers, it integrates a voice-interactive AI assistant that navigates users through the app services. Complementing this, the clinician-facing interface incorporates AI-powered decision support that enhances workflow efficiency by generating structured provisional diagnoses and treatment recommendations. These outputs inform key services such as e-prescriptions, video consultations, and medical record management. To validate clinical accuracy, the system was evaluated against a gold-standard set of 185 clinical vignettes developed by experienced physicians. Effectiveness was further assessed by comparing AmarDoctor performance with five independent physicians using the same vignette set. Results showed AmarDoctor achieved a top-1 diagnostic precision of 81.08 percent (versus physicians average of 50.27 percent) and a top specialty recommendation precision of 91.35 percent (versus physicians average of 62.6 percent).

机器翻译由腾讯交互翻译提供，仅供参考

点击“阅读原文”获取带摘要的学术速递

【声明】内容源于网络

Sophie外贸笔记

跨境分享角 | 长期更新优质内容

内容 0

粉丝 3

Sophie外贸笔记跨境分享角 | 长期更新优质内容

总阅读0

粉丝3

内容0