自然语言处理学术速递[10.13]- 大数跨境

首页

自然语言处理学术速递[10.13]

Sophie外贸笔记

2025-10-13

252

导读：cs.CL 方向，今日共计144篇

点击阅读原文访问arxivdaily.com，涵盖CS|物理|数学|经济|统计|金融|生物|电气领域，更有搜索、收藏等功能！

cs.CL 方向，今日共计144篇

大模型相关(73篇)

【1】Prompting Test-Time Scaling Is A Strong LLM Reasoning Data Augmentation
标题：预算测试时间缩放是一种强大的LLM推理数据增强
链接：https://arxiv.org/abs/2510.09599

作者：Sondos Mahmoud Bsharat, Zhiqiang Shen
备注：Our code and data are available at this https URL
摘要：大型语言模型（LLM）在提供思维链范例时已经展示了令人印象深刻的推理能力，但管理大型推理数据集仍然是费力和资源密集型的。在这项工作中，我们介绍了一种简单而有效的推理时间数据增强策略，用于通过微调来增强LLM推理。P-TTS不是收集数千甚至数百万个示例，而是利用仅90个手动选择的推理实例的小池，并通过测试时的原则性指令提示强度来系统地改变范例增强，以合成不同的推理轨迹上下文。然后在P-TTS数据上对Qwen-2.5模型的各种尺寸进行了微调。在一套数学推理AIME 2024和25，MATH 500和GPQA-Diamond中，我们的P-TTS-7 B和32 B模型优于之前的竞争基准，如S1和S1.1（1 K-shot），在AIME'24（7 B）上实现了+26.66%和+30.00%的绝对精度增益，在AIME'25（7 B）上实现了+13.34%和+6.67%; P-TTS-32 B在AIME'24上的收益为+23.33%和+16.63%，在AIME'25上的收益为+26.63%和+3.33%（分别与S1和S1.1相比），在MATH 500和GPQA-Diamond上的性能相当或更好。我们进一步表明，P-TTS提高了zero-shot泛化精度的域外推理基准的高考，Kaoyan，OlympiadBench，AMC 23，GradeSchoolMath，和Minerva。我们的分析表明，测试时间缩放有效地探索了推理模式的潜在空间，以最小的注释开销放大了LLM解决问题的能力，并进一步释放了LLM的推理潜力和能力。在资源受限或快速发展的领域中，简化测试时间缩放提供了一种实用的，低成本的方法来引出LLM推理。
摘要：Large language models (LLMs) have demonstrated impressive reasoning capabilities when provided with chain-of-thought exemplars, but curating large reasoning datasets remains laborious and resource-intensive. In this work, we introduce Prompting Test-Time Scaling (P-TTS), a simple yet effective inference-time data augmentation strategy for enhancing LLM reasoning through finetuning. Rather than collecting thousands or even millions of examples, P-TTS leverages a small pool of only 90 manually selected reasoning instances and systematically varies exemplar augmentation through principled instruction prompting intensities at test time to synthesize diverse reasoning trajectory contexts. Then we finetune the various sizes of Qwen-2.5 models on P-TTS data. Across a suite of mathematical reasoning AIME2024 & 25, MATH500, and GPQA-Diamond, our P-TTS-7B and 32B models outperform the prior competitive baselines like S1 and S1.1 (1K-shot), achieving absolute accuracy gains of +26.66% and +30.00% on AIME'24 (7B), and +13.34% and +6.67% on AIME'25 (7B); P-TTS-32B yields gains of +23.33% and +16.63% on AIME'24, and +26.63% and +3.33% on AIME'25 (vs. S1 and S1.1, respectively), with comparable or better performance on MATH500 and GPQA-Diamond. We further show that P-TTS enhances zero-shot generalization accuracy on out-of-domain reasoning benchmarks of Gaokao, Kaoyan, OlympiadBench, AMC23, GradeSchoolMath, and Minerva. Our analysis suggests that test-time scaling effectively explores the latent space of reasoning patterns, amplifying LLM problem-solving with minimal annotation overhead, and further unlocking the reasoning potential and capabilities of LLMs. Prompting Test-Time Scaling offers a practical, low-cost way to elicit LLM reasoning in resource-constrained or rapidly evolving domains.

【2】LiveOIBench: Can Large Language Models Outperform Human Contestants in Informatics Olympiads?
标题：LiveOIBench：大型语言模型能否在信息学奥林匹克竞赛中胜过人类参赛者？
链接：https://arxiv.org/abs/2510.09595

作者：Kaijian Zou, Aaron Xiong, Yunxiang Zhang, Frederick Zhang, Yueqi Ren, Jirong Yang, Ayoung Lee, Shitanshu Bhushan, Lu Wang
摘要：由于大型语言模型（LLM）的复杂性和易于验证性，竞争性编程问题越来越多地成为评估其编码能力的有价值的基准。然而，目前的编码基准面临着一些限制，例如缺乏特别具有挑战性的问题，测试用例覆盖不足，依赖于限制可访问性的在线平台API。为了解决这些问题，我们引入了LiveOIBench，这是一个全面的基准测试，包含403个专家策划的奥林匹克级竞争性编程问题，每个问题平均有60个专家设计的测试用例。这些问题直接来自2023年至2025年期间在不同地区举办的72场官方信息学奥林匹克竞赛。LiveOIBench通过四个关键功能脱颖而出：（1）精心策划的高质量任务，包括详细的子任务规则和广泛的私人测试用例;（2）直接集成精英选手的表现数据，以便与表现最好的人类进行信息比较;（3）有计划的连续，无污染的更新新发布的奥林匹克问题;以及（4）一个独立的评估系统，便于离线和易于复制的评估。对32个流行的通用和推理LLM进行基准测试，我们发现GPT-5达到了显着的第81.76百分位，这是一个很好的结果，但仍低于通常排名在第90位以上的顶级人类选手。相比之下，在开放权重推理模型中，GPT-OSS-120 B仅达到第60个百分位数，强调了与前沿封闭模型的显著能力差异。详细的分析表明，健壮的推理模型优先考虑精确的问题分析过度的探索，建议未来的模型应该强调结构化分析，并尽量减少不必要的探索。所有数据、代码和排行榜结果都将在我们的网站上公开。
摘要：Competitive programming problems increasingly serve as valuable benchmarks to evaluate the coding capabilities of large language models (LLMs) due to their complexity and ease of verification. Yet, current coding benchmarks face limitations such as lack of exceptionally challenging problems, insufficient test case coverage, reliance on online platform APIs that limit accessibility. To address these issues, we introduce LiveOIBench, a comprehensive benchmark featuring 403 expert-curated Olympiad-level competitive programming problems, each with an average of 60 expert-designed test cases. The problems are sourced directly from 72 official Informatics Olympiads in different regions conducted between 2023 and 2025. LiveOIBench distinguishes itself through four key features: (1) meticulously curated high-quality tasks with detailed subtask rubrics and extensive private test cases; (2) direct integration of elite contestant performance data to enable informative comparison against top-performing humans; (3) planned continuous, contamination-free updates from newly released Olympiad problems; and (4) a self-contained evaluation system facilitating offline and easy-to-reproduce assessments. Benchmarking 32 popular general-purpose and reasoning LLMs, we find that GPT-5 achieves a notable 81.76th percentile, a strong result that nonetheless falls short of top human contestant performance, who usually place above 90th. In contrast, among open-weight reasoning models, GPT-OSS-120B achieves only a 60th percentile, underscoring significant capability disparities from frontier closed models. Detailed analyses indicate that robust reasoning models prioritize precise problem analysis over excessive exploration, suggesting future models should emphasize structured analysis and minimize unnecessary exploration. All data, code, and leaderboard results will be made publicly available on our website.

【3】Mind-Paced Speaking: A Dual-Brain Approach to Real-Time Reasoning in Spoken Language Models
标题：心灵节奏说话：口语模型中实时推理的双脑方法
链接：https://arxiv.org/abs/2510.09592

作者：Donghang Wu, Haoyang Zhang, Jun Chen, Xiangyu (Tony)Zhang, Hexin Liu, Eng Siong Chng, Fei Tian, Xuerui Yang, Xiangyu Zhang, Daxin Jiang, Gang Yu
备注：13 pages, 3 figures
摘要：实时口语模型（SLM）难以利用思想链（CoT）推理，因为顺序生成整个思维过程的延迟太大。使SLM能够像人类一样在说话时思考，正引起越来越多的关注。我们第一次提出了思维节奏说话（MPS），这是一个受大脑启发的框架，可以实现高保真，实时推理。类似于人类如何利用不同的大脑区域进行思考和响应，我们提出了一种新的双脑方法，采用“公式化大脑”进行高级推理，以协调和指导单独的“发音大脑”进行流畅的语音生成。这种分工消除了模式切换，保持了推理过程的完整性。实验表明，MPS显著优于现有的边思考边说话方法，并实现了与说话前预先计算完整CoT的模型相当的推理性能，同时大大降低了延迟。在零延迟配置下，所提出的方法在数学推理任务Spoken-MQA上实现了92.8%的准确率，在语音会话任务URO-Bench上获得了82.5分。我们的工作有效地弥合了高质量推理和实时交互之间的差距。
摘要：Real-time Spoken Language Models (SLMs) struggle to leverage Chain-of-Thought (CoT) reasoning due to the prohibitive latency of generating the entire thought process sequentially. Enabling SLMs to think while speaking, similar to humans, is attracting increasing attention. We present, for the first time, Mind-Paced Speaking (MPS), a brain-inspired framework that enables high-fidelity, real-time reasoning. Similar to how humans utilize distinct brain regions for thinking and responding, we propose a novel dual-brain approach, employing a "Formulation Brain" for high-level reasoning to pace and guide a separate "Articulation Brain" for fluent speech generation. This division of labor eliminates mode-switching, preserving the integrity of the reasoning process. Experiments show that MPS significantly outperforms existing think-while-speaking methods and achieves reasoning performance comparable to models that pre-compute the full CoT before speaking, while drastically reducing latency. Under a zero-latency configuration, the proposed method achieves an accuracy of 92.8% on the mathematical reasoning task Spoken-MQA and attains a score of 82.5 on the speech conversation task URO-Bench. Our work effectively bridges the gap between high-quality reasoning and real-time interaction.

【4】WUGNECTIVES: Novel Entity Inferences of Language Models from Discourse Connectives
标题：WUGNECTIVES：从话语联系语中对语言模型的新颖实体推断
链接：https://arxiv.org/abs/2510.09556

作者：Daniel Brubaker, William Sheffield, Junyi Jessy Li, Kanishka Misra
备注：16 pages total, 9 pages main; 7 figures total, 4 figures main; 8 tables total, 4 tables main
摘要：世界知识的作用对于预测标记两个论点之间的话语关系的话语连接词特别重要，语言模型（LM）通常在这项任务中取得成功。我们在我们的工作中翻转这个前提，而是研究理解话语联系语是否可以告知LM关于世界的逆问题。为此，我们提出了WUGNECTIVES，一个包含8,880个刺激的数据集，该数据集在连接词将实体链接到特定属性的上下文中评估LM对新实体的推断。在调查了17种不同规模的LM和训练方案后，我们发现，调整LM以显示推理行为对大多数连接词产生了显着的改善。与此同时，不同连接类型的LM的整体表现存在很大差异，所有模型都在表达让步意义的连接词上系统地挣扎。我们的研究结果铺平了道路，更细致入微的调查，语言线索的功能作用所捕获的LM。我们在https://github.com/sheffwb/wugnectives发布WUGNECTIVES。
摘要：The role of world knowledge has been particularly crucial to predict the discourse connective that marks the discourse relation between two arguments, with language models (LMs) being generally successful at this task. We flip this premise in our work, and instead study the inverse problem of understanding whether discourse connectives can inform LMs about the world. To this end, we present WUGNECTIVES, a dataset of 8,880 stimuli that evaluates LMs' inferences about novel entities in contexts where connectives link the entities to particular attributes. On investigating 17 different LMs at various scales, and training regimens, we found that tuning an LM to show reasoning behavior yields noteworthy improvements on most connectives. At the same time, there was a large variation in LMs' overall performance across connective type, with all models systematically struggling on connectives that express a concessive meaning. Our findings pave the way for more nuanced investigations into the functional role of language cues as captured by LMs. We release WUGNECTIVES at https://github.com/sheffwb/wugnectives.

【5】Beyond Surface Reasoning: Unveiling the True Long Chain-of-Thought Capacity of Diffusion Large Language Models
标题：超越表面推理：揭示扩散大语言模型的真正长思想链能力
链接：https://arxiv.org/abs/2510.09544

作者：Qiguang Chen, Hanjing Li, Libo Qin, Dengyun Peng, Jinhao Liu, Jiangyi Wang, Chengyue Wu, Xie Chen, Yantao Du, Wanxiang Che
备注：Preprint
摘要：最近，扩散大语言模型（DLLM）提供了高吞吐量和有效的顺序推理，使其成为自回归LLM（ALLM）的竞争对手。然而，并行解码，使同步令牌更新，冲突的因果顺序往往需要严格的推理。我们首先确定这种冲突的核心序列的矛盾（PSC）。在简单和复杂的推理任务的行为分析表明，DLLM表现出真正的并行性，只有直接可判定的输出。随着任务难度的增加，它们会恢复到类似自回归的行为，自回归提示加剧了这一限制，这几乎使重新掩蔽的解码步骤数量增加了一倍，而没有提高质量。此外，PSC限制了DLLM的自我反思，推理深度和探索广度。为了进一步表征PSC，我们引入了DLLM的三个缩放维度：并行，扩散和顺序。从经验上讲，虽然并行缩放产生一致的改进，扩散和顺序缩放受到PSC的约束。基于这些发现，我们提出了几个实际的缓解措施，并行导向的提示，扩散早期停止，并行缩放，以减少PSC引起的无效和低效。
摘要：Recently, Diffusion Large Language Models (DLLMs) have offered high throughput and effective sequential reasoning, making them a competitive alternative to autoregressive LLMs (ALLMs). However, parallel decoding, which enables simultaneous token updates, conflicts with the causal order often required for rigorous reasoning. We first identify this conflict as the core Parallel-Sequential Contradiction (PSC). Behavioral analyses in both simple and complex reasoning tasks show that DLLMs exhibit genuine parallelism only for directly decidable outputs. As task difficulty increases, they revert to autoregressive-like behavior, a limitation exacerbated by autoregressive prompting, which nearly doubles the number of decoding steps with remasking without improving quality. Moreover, PSC restricts DLLMs' self-reflection, reasoning depth, and exploratory breadth. To further characterize PSC, we introduce three scaling dimensions for DLLMs: parallel, diffusion, and sequential. Empirically, while parallel scaling yields consistent improvements, diffusion and sequential scaling are constrained by PSC. Based on these findings, we propose several practical mitigations, parallel-oriented prompting, diffusion early stopping, and parallel scaling, to reduce PSC-induced ineffectiveness and inefficiencies.

【6】SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models
标题：SPG：掩蔽扩散语言模型的三明治政策梯度
链接：https://arxiv.org/abs/2510.09541

作者：Chengyu Wang, Paria Rashidinejad, DiJia Su, Song Jiang, Sid Wang, Siyan Zhao, Cai Zhou, Shannon Zejiang Shen, Feiyu Chen, Tommi Jaakkola, Yuandong Tian, Bo Liu
摘要：扩散大语言模型（DLLM）由于其并行解码多个标记的能力而成为自回归模型的有效替代方案。然而，通过强化学习（RL）将dLLM与人类偏好或特定任务奖励对齐是具有挑战性的，因为它们的难以处理的对数似然性排除了标准策略梯度方法的直接应用。虽然以前的工作使用代理人，如证据下限（ELBO），这些片面的近似可以引入显着的政策梯度偏差。为了解决这个问题，我们提出了三明治策略梯度（SPG），利用了真正的对数似然的上限和下限。实验表明，SPG显著优于基于ELBO或一步估计的基线。具体来说，SPG在GSM 8K中将最先进的RL方法的精度提高了3.6%，在MATH500中提高了2.6%，在倒计时中提高了18.4%，在数独中提高了27.0%。
摘要：Diffusion large language models (dLLMs) are emerging as an efficient alternative to autoregressive models due to their ability to decode multiple tokens in parallel. However, aligning dLLMs with human preferences or task-specific rewards via reinforcement learning (RL) is challenging because their intractable log-likelihood precludes the direct application of standard policy gradient methods. While prior work uses surrogates like the evidence lower bound (ELBO), these one-sided approximations can introduce significant policy gradient bias. To address this, we propose the Sandwiched Policy Gradient (SPG) that leverages both an upper and a lower bound of the true log-likelihood. Experiments show that SPG significantly outperforms baselines based on ELBO or one-step estimation. Specifically, SPG improves the accuracy over state-of-the-art RL methods for dLLMs by 3.6% in GSM8K, 2.6% in MATH500, 18.4% in Countdown and 27.0% in Sudoku.

【7】Evaluating Robustness of Large Language Models Against Multilingual Typographical Errors
标题：评估大型语言模型针对多语言印刷错误的鲁棒性
链接：https://arxiv.org/abs/2510.09536

作者：Yihong Liu, Raoyuan Zhao, Lena Altinger, Hinrich Schütze, Michael A. Hedderich
备注：preprint
摘要：大型语言模型（LLM）越来越多地部署在具有用户输入的多语言真实世界应用程序中-自然会引入印刷错误（错别字）。然而，大多数基准测试都假设输入是干净的，这使得LLM对跨语言输入错误的鲁棒性在很大程度上没有得到充分的研究。为了解决这个问题，我们引入了MulTypo，这是一种多语言的打字错误生成算法，它可以根据特定语言的键盘布局和打字行为来模拟类似人类的错误。我们评估了18个开源LLM，包括三个模型系列和五个下游任务，包括语言推理、多项选择题回答、数学推理和机器翻译任务。我们的研究结果表明，错别字一贯降低性能，特别是在生成任务和那些需要推理-而自然语言推理任务相对更强大。指令调优提高了干净输入的性能，但可能会增加噪声下的脆弱性。我们还观察到依赖于语言的鲁棒性：高资源语言通常比低资源语言更鲁棒，从英语翻译比翻译成英语更鲁棒。我们的研究结果强调了噪声感知训练和多语言鲁棒性评估的必要性。我们公开我们的代码和数据。
摘要：Large language models (LLMs) are increasingly deployed in multilingual, real-world applications with user inputs -- naturally introducing typographical errors (typos). Yet most benchmarks assume clean input, leaving the robustness of LLMs to typos across languages largely underexplored. To address this gap, we introduce MulTypo, a multilingual typo generation algorithm that simulates human-like errors based on language-specific keyboard layouts and typing behavior. We evaluate 18 open-source LLMs across three model families and five downstream tasks spanning language inference, multi-choice question answering, mathematical reasoning, and machine translation tasks. Our results show that typos consistently degrade performance, particularly in generative tasks and those requiring reasoning -- while the natural language inference task is comparatively more robust. Instruction tuning improves clean-input performance but may increase brittleness under noise. We also observe language-dependent robustness: high-resource languages are generally more robust than low-resource ones, and translation from English is more robust than translation into English. Our findings underscore the need for noise-aware training and multilingual robustness evaluation. We make our code and data publicly available.

【8】StatEval: A Comprehensive Benchmark for Large Language Models in Statistics
标题：StatEval：统计中大型语言模型的综合基准
链接：https://arxiv.org/abs/2510.09517

作者：Yuchen Lu, Run Yang, Yichen Zhang, Shuguang Yu, Runpeng Dai, Ziwei Wang, Jiayi Xiang, Wenxin E, Siran Gao, Xinyao Ruan, Yirui Huang, Chenjing Xi, Haibo Hu, Yueming Fu, Qinglan Yu, Xiaobing Wei, Jiani Gu, Rui Sun, Jiaxuan Jia, Fan Zhou
摘要：大型语言模型（LLM）在数学和逻辑推理方面取得了显着的进步，但统计学作为一门独特的综合学科，在基准测试方面仍然没有得到充分的探索。为了解决这一差距，我们引入了\textbf{StatEval}，这是第一个专门用于统计的综合基准，跨越了难度级别的广度和深度。StatEval由13，817个基础问题组成，涵盖本科和研究生课程，以及从领先期刊中提取的2374个研究级证明任务。为了构建基准测试，我们设计了一个可扩展的多代理管道，带有人在回路验证，可以自动化大规模问题提取，重写和质量控制，同时确保学术严谨性。我们进一步提出了一个强大的评估框架，适合计算和证明为基础的任务，使细粒度的评估推理能力。实验结果表明，虽然GPT 5-mini等闭源模型在研究级问题上的表现低于57%，但开源模型的表现明显较低。这些发现突出了统计推理的独特挑战和当前LLM的局限性。我们希望StatEval能够作为一个严格的基准，在大型语言模型中推进统计智能。所有数据和代码都可以在我们的网络平台上获得：https://stateval.github.io/。
摘要：Large language models (LLMs) have demonstrated remarkable advances in mathematical and logical reasoning, yet statistics, as a distinct and integrative discipline, remains underexplored in benchmarking efforts. To address this gap, we introduce \textbf{StatEval}, the first comprehensive benchmark dedicated to statistics, spanning both breadth and depth across difficulty levels. StatEval consists of 13,817 foundational problems covering undergraduate and graduate curricula, together with 2374 research-level proof tasks extracted from leading journals. To construct the benchmark, we design a scalable multi-agent pipeline with human-in-the-loop validation that automates large-scale problem extraction, rewriting, and quality control, while ensuring academic rigor. We further propose a robust evaluation framework tailored to both computational and proof-based tasks, enabling fine-grained assessment of reasoning ability. Experimental results reveal that while closed-source models such as GPT5-mini achieve below 57\% on research-level problems, with open-source models performing significantly lower. These findings highlight the unique challenges of statistical reasoning and the limitations of current LLMs. We expect StatEval to serve as a rigorous benchmark for advancing statistical intelligence in large language models. All data and code are available on our web platform: https://stateval.github.io/.

【9】Hybrid Models for Natural Language Reasoning: The Case of Syllogistic Logic
标题：自然语言推理的混合模型：逻辑逻辑的案例
链接：https://arxiv.org/abs/2510.09472

作者：Manuel Vargas Guzmán, Jakub Szymanik, Maciej Malicki
摘要：尽管神经模型取得了显着的进步，但它们的泛化能力（逻辑推理等应用的基石）仍然是一个关键挑战。我们描述了这种能力的两个基本方面：组合性，即抽象复杂推理背后的原子逻辑规则的能力;递归性，即通过推理规则的迭代应用构建复杂表征的能力。在文献中，这两个方面经常被混淆在一起的总括术语的泛化。为了突出这一区别，我们使用三段论片段作为自然语言推理的基准，研究了预训练的大型语言模型（LLM）的逻辑泛化能力。虽然简单，这个片段提供了一个基本的，但表达的形式逻辑，支持基本推理能力的控制评估的子集。我们的研究结果揭示了一个显着的差距：虽然LLM表现出合理的递归能力，他们的斗争与组合。为了克服这些局限性并建立一个可靠的逻辑证明器，我们提出了一种将符号推理和神经计算相结合的混合体系结构。这种协同作用使强大而有效的推理，神经组件加速处理，而符号推理确保完整性。我们的实验表明，即使神经组件相对较小，也能保持高效率。作为我们提出的方法的一部分，这种分析给出了一个基本原理，并强调了混合模型有效解决神经推理系统中关键泛化障碍的潜力。
摘要：Despite the remarkable progress in neural models, their ability to generalize, a cornerstone for applications like logical reasoning, remains a critical challenge. We delineate two fundamental aspects of this ability: compositionality, the capacity to abstract atomic logical rules underlying complex inferences, and recursiveness, the aptitude to build intricate representations through iterative application of inference rules. In the literature, these two aspects are often confounded together under the umbrella term of generalization. To sharpen this distinction, we investigated the logical generalization capabilities of pre-trained large language models (LLMs) using the syllogistic fragment as a benchmark for natural language reasoning. Though simple, this fragment provides a foundational yet expressive subset of formal logic that supports controlled evaluation of essential reasoning abilities. Our findings reveal a significant disparity: while LLMs demonstrate reasonable proficiency in recursiveness, they struggle with compositionality. To overcome these limitations and establish a reliable logical prover, we propose a hybrid architecture integrating symbolic reasoning with neural computation. This synergistic interaction enables robust and efficient inference, neural components accelerate processing, while symbolic reasoning ensures completeness. Our experiments show that high efficiency is preserved even with relatively small neural components. As part of our proposed methodology, this analysis gives a rationale and highlights the potential of hybrid models to effectively address key generalization barriers in neural reasoning systems.

【10】Getting Your Indices in a Row: Full-Text Search for LLM Training Data for Real World
标题：连续获取索引：全文搜索真实世界的LLM训练数据
链接：https://arxiv.org/abs/2510.09471

作者：Ines Altemir Marinas, Anastasiia Kucherenko, Alexander Sternfeld, Andrei Kucharavy
摘要：大型语言模型（LLM）的性能取决于其训练数据。尽管开放权重LLM的激增，但LLM训练数据的获取仍然有限。即使对于完全开放的LLM，数据的规模也使其对一般科学界来说几乎是不可理解的，尽管可能包含从互联网上抓取的关键数据。在本文中，我们提出了全文索引管道的LLM训练数据。利用Elasticsearch并行索引和Alps基础设施，一个最先进的，高能效的arm64超级集群，我们能够索引用于训练LLM家族的15.2T令牌中的8.6T令牌，创建一个关键的LLM安全工具，并有效地离线，策划，开放的Web搜索引擎。我们的贡献是三方面的。首先，我们证明了Elasticsearch可以成功地移植到基于arm64的下一代基础设施上。其次，我们证明了现代LLM训练数据集和整个开放网络规模的全文索引是可行的和可访问的。最后，我们证明了这样的索引可以用来确保以前无法访问的越狱不可知LLM安全。我们希望我们的研究结果对其他尝试大规模数据索引的团队有用，并促进向绿色计算的一般过渡。
摘要：The performance of Large Language Models (LLMs) is determined by their training data. Despite the proliferation of open-weight LLMs, access to LLM training data has remained limited. Even for fully open LLMs, the scale of the data makes it all but inscrutable to the general scientific community, despite potentially containing critical data scraped from the internet. In this paper, we present the full-text indexing pipeline for the Apertus LLM training data. Leveraging Elasticsearch parallel indices and the Alps infrastructure, a state-of-the-art, highly energy-efficient arm64 supercluster, we were able to index 8.6T tokens out of 15.2T used to train the Apertus LLM family, creating both a critical LLM safety tool and effectively an offline, curated, open web search engine. Our contribution is threefold. First, we demonstrate that Elasticsearch can be successfully ported onto next-generation arm64-based infrastructure. Second, we demonstrate that full-text indexing at the scale of modern LLM training datasets and the entire open web is feasible and accessible. Finally, we demonstrate that such indices can be used to ensure previously inaccessible jailbreak-agnostic LLM safety. We hope that our findings will be useful to other teams attempting large-scale data indexing and facilitate the general transition towards greener computation.

【11】Domain-Adapted Pre-trained Language Models for Implicit Information Extraction in Crash Narratives
标题：基于领域自适应预训练语言模型的事故叙述隐含信息提取
链接：https://arxiv.org/abs/2510.09434

作者：Xixi Wang, Jordanka Kovaceva, Miguel Costa, Shuai Wang, Francisco Camara Pereira, Robert Thomson
摘要：在真实世界的碰撞数据库中记录的自由文本碰撞叙述已被证明在改善交通安全方面发挥了重要作用。然而，大规模的分析仍然难以实现，因为没有记录的工具可以批量处理由具有不同经验和关注细节的不同作者编写的非结构化，非标准化的文本内容。近年来，基于transformer的预训练语言模型（PLM），例如来自Transformers的双向编码器表示（BERT）和大型语言模型（LLM），已经在各种自然语言处理任务中表现出强大的能力。这些模型可以从事故叙述中提取明确的事实，但它们的性能在推理繁重的任务中下降，例如，事故类型识别，这可能涉及近100个类别。此外，通过外部API依赖封闭的LLM会引起敏感崩溃数据的隐私问题。此外，由于领域知识有限，这些黑盒工具通常表现不佳。出于这些挑战，我们研究了紧凑的开源PLM是否可以支持从崩溃叙述中进行推理密集型提取。我们针对两个具有挑战性的目标：1）识别碰撞的碰撞方式，以及2）从真实世界的碰撞叙述中涉及碰撞事件的每辆车的碰撞类型。为了弥合领域差距，我们采用微调技术，注入特定于任务的知识与低秩自适应（LoRA）和BERT的LLM。在权威的真实世界数据集Crash Investigation Sampling System（CISS）上的实验表明，我们微调的紧凑模型优于强大的封闭LLM，如GPT-4 o，同时只需要最少的训练资源。进一步的分析表明，经过微调的PLM可以捕获更丰富的叙述细节，甚至可以纠正数据集中的一些错误标记的注释。
摘要：Free-text crash narratives recorded in real-world crash databases have been shown to play a significant role in improving traffic safety. However, large-scale analyses remain difficult to implement as there are no documented tools that can batch process the unstructured, non standardized text content written by various authors with diverse experience and attention to detail. In recent years, Transformer-based pre-trained language models (PLMs), such as Bidirectional Encoder Representations from Transformers (BERT) and large language models (LLMs), have demonstrated strong capabilities across various natural language processing tasks. These models can extract explicit facts from crash narratives, but their performance declines on inference-heavy tasks in, for example, Crash Type identification, which can involve nearly 100 categories. Moreover, relying on closed LLMs through external APIs raises privacy concerns for sensitive crash data. Additionally, these black-box tools often underperform due to limited domain knowledge. Motivated by these challenges, we study whether compact open-source PLMs can support reasoning-intensive extraction from crash narratives. We target two challenging objectives: 1) identifying the Manner of Collision for a crash, and 2) Crash Type for each vehicle involved in the crash event from real-world crash narratives. To bridge domain gaps, we apply fine-tuning techniques to inject task-specific knowledge to LLMs with Low-Rank Adaption (LoRA) and BERT. Experiments on the authoritative real-world dataset Crash Investigation Sampling System (CISS) demonstrate that our fine-tuned compact models outperform strong closed LLMs, such as GPT-4o, while requiring only minimal training resources. Further analysis reveals that the fine-tuned PLMs can capture richer narrative details and even correct some mislabeled annotations in the dataset.

【12】The Speech-LLM Takes It All: A Truly Fully End-to-End Spoken Dialogue State Tracking Approach
标题：演讲LLM囊括一切：真正完全端到端的口语对话状态跟踪方法
链接：https://arxiv.org/abs/2510.09424

作者：Nizar El Ghazal, Antoine Caubrière, Valentin Vielzeuf
摘要：本文提出了一种比较研究的上下文管理策略，端到端的口语对话状态跟踪使用语音LLM。我们系统地评估了传统的多模态上下文（结合文本历史和口语当前的转折），完整的口语历史，压缩口语历史的方法。我们在SpokenWOZ语料库上的实验表明，提供完整的口语对话作为输入，在类似大小的模型中产生最高的性能，显着超过先前的方法。此外，我们表明，注意力池为基础的压缩的口语历史提供了一个很强的权衡，保持竞争力的准确性，减少上下文大小。详细的分析证实，改进源于更有效的上下文利用。
摘要：This paper presents a comparative study of context management strategies for end-to-end Spoken Dialog State Tracking using Speech-LLMs. We systematically evaluate traditional multimodal context (combining text history and spoken current turn), full spoken history, and compressed spoken history approaches. Our experiments on the SpokenWOZ corpus demonstrate that providing the full spoken conversation as input yields the highest performance among models of similar size, significantly surpassing prior methods. Furthermore, we show that attention-pooling-based compression of the spoken history offers a strong trade-off, maintaining competitive accuracy with reduced context size. Detailed analysis confirms that improvements stem from more effective context utilization.

【13】On the Representations of Entities in Auto-regressive Large Language Models
标题：关于自回归大语言模型中实体的表示
链接：https://arxiv.org/abs/2510.09421

作者：Victor Morand, Josiane Mothe, Benjamin Piwowarski
备注：Accepted at BlackBoxNLP@EMNLP2025
摘要：命名实体是文本中知识的基本构建块，为语言中的事实信息奠定基础并构建关系。尽管它们很重要，但仍不清楚大型语言模型（LLM）如何在内部表示实体。先前的研究主要研究了显式关系，但对实体表征本身知之甚少。我们引入实体提及重建作为一个新的框架，研究如何LLM编码和操纵实体。我们调查是否可以从内部表示生成实体提及，多令牌实体是如何编码超出最后令牌嵌入，以及这些表示是否捕获关系知识。我们提出的方法，leveraging _task vectors_，允许一致地从LLM隐藏状态派生的各种实体表示生成多令牌提及。因此，我们引入了_Entity Lens_，扩展了_logit-lens_来预测多标记提及。我们的研究结果带来了新的证据，表明LLM开发了实体特定的机制来表示和操纵任何多令牌实体，包括那些在训练过程中看不见的实体。我们的代码可在https://github.com/VictorMorand/EntityRepresentations上获得。
摘要：Named entities are fundamental building blocks of knowledge in text, grounding factual information and structuring relationships within language. Despite their importance, it remains unclear how Large Language Models (LLMs) internally represent entities. Prior research has primarily examined explicit relationships, but little is known about entity representations themselves. We introduce entity mention reconstruction as a novel framework for studying how LLMs encode and manipulate entities. We investigate whether entity mentions can be generated from internal representations, how multi-token entities are encoded beyond last-token embeddings, and whether these representations capture relational knowledge. Our proposed method, leveraging _task vectors_, allows to consistently generate multi-token mentions from various entity representations derived from the LLMs hidden states. We thus introduce the _Entity Lens_, extending the _logit-lens_ to predict multi-token mentions. Our results bring new evidence that LLMs develop entity-specific mechanisms to represent and manipulate any multi-token entities, including those unseen during training. Our code is avalable at https://github.com/VictorMorand/EntityRepresentations .

【14】Active Model Selection for Large Language Models
标题：大型语言模型的主动模型选择
链接：https://arxiv.org/abs/2510.09418

作者：Yavuz Durmazkeser, Patrik Okanovic, Andreas Kirsch, Torsten Hoefler, Nezihe Merve Gürel
摘要：我们介绍LLM选择器，第一个框架的主动模型选择的大型语言模型（LLM）。与依赖于完全注释数据集的先前评估和基准测试方法不同，LLM SELECTOR有效地识别了具有有限注释的最佳LLM。特别是，对于任何给定的任务，LLM选择器自适应地选择一小部分查询来注释，这些查询提供了关于任务的最佳模型的最多信息。为了进一步降低标注成本，我们利用了基于判断的oracle标注模型。通过对6个基准测试的151个LLM进行广泛的实验，我们发现LLM选择器在为任务选择最佳和接近最佳的LLM时，可将注释成本降低高达59.62%。
摘要：We introduce LLM SELECTOR, the first framework for active model selection of Large Language Models (LLMs). Unlike prior evaluation and benchmarking approaches that rely on fully annotated datasets, LLM SELECTOR efficiently identifies the best LLM with limited annotations. In particular, for any given task, LLM SELECTOR adaptively selects a small set of queries to annotate that are most informative about the best model for the task. To further reduce annotation cost, we leverage a judge-based oracle annotation model. Through extensive experiments on 6 benchmarks with 151 LLMs, we show that LLM SELECTOR reduces annotation costs by up to 59.62% when selecting the best and near-best LLM for the task.

【15】Understanding the Effects of Domain Finetuning on LLMs
标题：了解领域微调对LLM的影响
链接：https://arxiv.org/abs/2510.09359

作者：Eshaan Tanwar, Deepak Nathani, William Yang Wang, Tanmoy Chakraborty
摘要：针对特定领域进行微调的大型语言模型（LLM）表现出强大的性能;然而，这种微调重塑其参数空间的潜在机制尚未得到很好的理解。以前的工作主要集中在自回归或通用指令模型，留下领域专门的LLM探索不足。我们提出了第一个系统的研究领域特定的微调大型医学语言模型。我们的分析表明，微调只修改了代表子空间的一小部分，基本上保留了预训练模型的表示。为了解释这些变化的子空间，我们提出了调整向量，一个新的框架的灵感来自任务向量，明确捕获的方向参数变化引起的微调。我们证明，这些载体是至关重要的，以提高预防以下和代质量。此外，跨不同域组合调谐向量产生改进的泛化。在仔细检查方向对齐后，我们发现这些向量主要将新的方向信息写入模型的MLP层，同时放大注意力头部中的现有方向。我们的研究结果为LLM适应提供了新的见解，并为分析大型语言模型的专业化提供了一个通用的可解释框架。
摘要：Large Language Models (LLMs) fine-tuned for specific domains exhibit strong performance; however, the underlying mechanisms by which this fine-tuning reshapes their parametric space are not well understood. Prior works primarily focus on auto-regressive or general-purpose instruct models, leaving domain-specialised LLMs under-explored. We present the first systematic study of domain-specific fine-tuning in large medical language models. Our analysis reveals that fine-tuning modifies only a small subset of the representational subspace, essentially preserving the pre-trained model's representation. To interpret these changes in subspaces, we propose tuning vectors, a novel framework inspired by task vectors, which explicitly capture the directional parameter shifts induced by fine-tuning. We demonstrate that these vectors are critical for enhancing both instruction-following and generation quality. Furthermore, combining tuning vectors across different domains yields improved generalisation. Upon closer inspection of directional alignment, we find these vectors primarily write new directional information into the MLP layers of the model, while amplifying existing directions in attention heads. Our findings offer new insights into LLM adaptation and provide a general, interpretable framework for analysing specialisation in large language models.

【16】NL2GenSym: Natural Language to Generative Symbolic Rules for SOAR Cognitive Architecture via Large Language Models
标题：NL2GenSym：通过大型语言模型实现SOAR认知架构的自然语言到生成符号规则
链接：https://arxiv.org/abs/2510.09355

作者：Fang Yuan, Junjie Zeng, Yue Hu, Zhengqiu Zhu, Quanjun Yin, Yuxiang Xie
摘要：SOAR是一种经典的基于符号的认知体系结构，它一直在促进通用的、类人的智能代理的发展。然而，它的实际采用受到了繁琐的手工规则编码的阻碍。新兴的大型语言模型（LLM）为高效的规则生成提供了巨大的潜力。然而，有一个关键的差距，目前的研究主要集中在概念框架，缺乏强大的实验验证。为了弥合这一差距，我们提出了\textit{N}自然\textit{L}语言到\textit{Gen}生成\textit{Sym}符号规则（NL 2GenSym），这是一个新的框架，它将LLM与SOAR集成在一起，从自然语言中自动生成符号规则。具体来说，我们的框架引入了一个新的执行接地生成器的批评机制。基于LLM的生成器，由检索增强生成访问的自进化领域知识库引导，从自然语言中提出规则。随后，这些规则会立即在SOAR环境中执行，以严格验证其正确性。基于这种基于执行的反馈，一个反射式的基于LLM的Critic驱动这些规则的迭代细化。在我们专门的水壶问题（WJP）数据集上进行的实验，利用双子座和Qwen系列模型，验证了我们的框架的有效性。该算法在自然语言规则生成中的成功率达到86%以上。至关重要的是，该框架还生成了新的启发式规则，将解决WJP的平均决策周期减少到最佳解决方案的1.98倍和基线方法的1/1000。此外，我们的初步实验表明，NL 2GenSym使较小的参数模型能够实现比较大的模型更好的性能。
摘要：SOAR, a classic symbol-based cognitive architecture, has been fostering the development of general, human-like intelligent agents. Nevertheless, its practical adoption is hindered by the laborious manual rule coding. Emerging Large Language Models (LLMs) present the immense potential for efficient rules generation. However, there is a critical gap that current research predominantly focuses on conceptual frameworks and lacks robust experimental validation. To bridge this gap, we propose \textit{N}atural \textit{L}anguage to \textit{Gen}erative \textit{Sym}bolic Rules (NL2GenSym), a novel framework that integrates LLMs with SOAR to autonomously produce generative symbolic rules from natural language. Specifically, our framework introduces a novel Execution-Grounded Generator-Critic mechanism. The LLM-based Generator, guided by a Retrieval-Augmented Generation-accessed self-evolving domain knowledge base, proposes rules from natural language. Subsequently, these rules are immediately executed within the SOAR environment to rigorously validate their correctness. Based on this execution-grounded feedback, a reflective LLM-based Critic drives the iterative refinement of these rules. Experiments on our specialized Water Jug Problem (WJP) dataset, utilizing both Gemini and Qwen series models, validate the efficacy of our framework. It achieves a success rate over 86\% in generating rules from natural language. Crucially, the framework also generates novel heuristic rules, reducing average decision cycles for solving the WJP to 1.98 times the optimal solution and 1/1000 of baseline methods. Additionally, our initial experiments show that NL2GenSym enables smaller-parameter models to achieve better performance than larger counterparts.

【17】ReTraceQA: Evaluating Reasoning Traces of Small Language Models in Commonsense Question Answering
标题：ReTraceQA：评估常识问题回答中小语言模型的推理痕迹
链接：https://arxiv.org/abs/2510.09351

作者：Francesco Maria Molfese, Luca Moroni, Ciro Porcaro, Simone Conia, Roberto Navigli
备注：Submitted to ARR October 2025
摘要：虽然小语言模型（SLM）在越来越广泛的常识推理基准上表现出良好的性能，但目前的评估实践几乎完全依赖于其最终答案的准确性，忽略了导致这些答案的推理过程的有效性。为了解决这个问题，我们介绍了ReTraceQA，一个新的基准，介绍了过程级评估常识推理任务。我们的专家注释数据集显示，在相当大的一部分情况下（14-24%），SLM提供了正确的最终答案，尽管有缺陷的推理过程，这表明SLM的能力往往被高估的评估指标，只专注于比较最终答案与地面真相。事实上，我们表明，当使用强大的大型语言模型（LLM）作为自动判断的推理感知评估，而不是只回答指标时，SLM的性能在所有模型和数据集上都显着下降，得分下降高达25%。
摘要：While Small Language Models (SLMs) have demonstrated promising performance on an increasingly wide array of commonsense reasoning benchmarks, current evaluation practices rely almost exclusively on the accuracy of their final answers, neglecting the validity of the reasoning processes that lead to those answers. To address this issue, we introduce ReTraceQA, a novel benchmark that introduces process-level evaluation for commonsense reasoning tasks. Our expert-annotated dataset reveals that in a substantial portion of instances (14-24%), SLMs provide correct final answers despite flawed reasoning processes, suggesting that the capabilities of SLMs are often overestimated by evaluation metrics that focus only on comparing the final answer with the ground truth. Indeed, we show that when employing strong Large Language Models (LLMs) as automated judges for reasoning-aware evaluation rather than answer-only metrics, SLM performance drops significantly across all models and datasets, with scores decreasing by up to 25%.

【18】LLP: LLM-based Product Pricing in E-commerce
标题：LLP：电子商务中基于LLM的产品定价
链接：https://arxiv.org/abs/2510.09347

作者：Hairu Wang, Sheng You, Qiheng Zhang, Xike Xie, Shuguang Han, Yuchen Wu, Fei Huang, Jufeng Chen
摘要：与企业对消费者电子商务平台（例如，亚马逊），消费者对消费者平台上缺乏经验的个人卖家（例如，eBay）在有效地为二手产品定价方面经常面临重大挑战。因此，许多研究已经提出了自动化的价格预测。然而，它们中的大多数基于静态回归模型，其具有较差的泛化性能并且无法捕捉市场动态（例如，二手iPhone的价格随着时间的推移而下降）。受最近大型语言模型（LLM）突破的启发，我们引入了LLP，这是第一个基于LLM的二手产品定价生成框架。LLP首先检索类似产品，以更好地适应动态市场变化。然后，它利用LLM对自由文本中关键定价信息的细微理解来生成准确的价格建议。为了加强LLM对检索到的产品的领域推理，我们在通过双向推理构建的数据集上应用了两阶段优化，监督微调（SFT），然后是组相对策略优化（GRPO）。此外，LLP采用基于信任的过滤机制来拒绝不可靠的价格建议。大量的实验表明，LLP大大超过现有的方法，同时推广以及看不见的类别。我们已经成功在鲜柚上部署了LLP\脚注\{鲜柚是中国最大的二手电商平台。\}大大优于以前的定价方法。在相同的30%产品覆盖率下，静态采用率（SAR）从40%提高到72%，在90%召回率下仍保持47%的高SAR。
摘要：Unlike Business-to-Consumer e-commerce platforms (e.g., Amazon), inexperienced individual sellers on Consumer-to-Consumer platforms (e.g., eBay) often face significant challenges in setting prices for their second-hand products efficiently. Therefore, numerous studies have been proposed for automating price prediction. However, most of them are based on static regression models, which suffer from poor generalization performance and fail to capture market dynamics (e.g., the price of a used iPhone decreases over time). Inspired by recent breakthroughs in Large Language Models (LLMs), we introduce LLP, the first LLM-based generative framework for second-hand product pricing. LLP first retrieves similar products to better align with the dynamic market change. Afterwards, it leverages the LLMs' nuanced understanding of key pricing information in free-form text to generate accurate price suggestions. To strengthen the LLMs' domain reasoning over retrieved products, we apply a two-stage optimization, supervised fine-tuning (SFT) followed by group relative policy optimization (GRPO), on a dataset built via bidirectional reasoning. Moreover, LLP employs a confidence-based filtering mechanism to reject unreliable price suggestions. Extensive experiments demonstrate that LLP substantially surpasses existing methods while generalizing well to unseen categories. We have successfully deployed LLP on Xianyu\footnote\{Xianyu is China's largest second-hand e-commerce platform.\}, significantly outperforming the previous pricing method. Under the same 30\% product coverage, it raises the static adoption rate (SAR) from 40\% to 72\%, and maintains a strong SAR of 47\% even at 90\% recall.

【19】FLRC: Fine-grained Low-Rank Compressor for Efficient LLM Inference
标题：FLRC：用于高效LLM推理的细粒度低等级压缩器
链接：https://arxiv.org/abs/2510.09332

作者：Yu-Chen Lu, Chong-Yan Chen, Chi-Chih Chang, Yu-Fang Hu, Kai-Chiang Wu
备注：Accepted by EMNLP 2025
摘要：虽然大型语言模型（LLM）已经取得了显着的性能，其庞大的参数数量阻碍了资源受限的硬件上的部署。低秩压缩可以减少存储器使用和计算需求，但是在所有层上应用统一的压缩比通常会导致显著的性能下降，并且先前的方法在解码期间表现不佳。为了解决这些问题，我们提出了细粒度低秩压缩器（FLRC），它有效地确定了每一层的最佳秩分配，并采用渐进式低秩解码，以保持文本生成质量。在不同基准测试上的综合实验证明了FLRC的优越性，与最先进的低秩压缩方法相比，ROUGE-L在摘要任务上的性能提高了17%，建立了一个更强大，更有效的框架来改善LLM推理。
摘要：Although large language models (LLM) have achieved remarkable performance, their enormous parameter counts hinder deployment on resource-constrained hardware. Low-rank compression can reduce both memory usage and computational demand, but applying a uniform compression ratio across all layers often leads to significant performance degradation, and previous methods perform poorly during decoding. To address these issues, we propose the Fine-grained Low-Rank Compressor (FLRC), which efficiently determines an optimal rank allocation for each layer, and incorporates progressive low-rank decoding to maintain text generation quality. Comprehensive experiments on diverse benchmarks demonstrate the superiority of FLRC, achieving up to a 17% improvement in ROUGE-L on summarization tasks compared to state-of-the-art low-rank compression methods, establishing a more robust and efficient framework to improve LLM inference.

【20】Large Language Model Prompt Datasets: An In-depth Analysis and Insights
标题：大型语言模型提示数据集：深入分析和见解
链接：https://arxiv.org/abs/2510.09316

作者：Yuanming Zhang, Yan Lin, Arijit Khan, Huaiyu Wan
摘要：提示符是一种自然语言指令，它为大型语言模型（LLM）定义了特定的任务，并作为人类与LLM交互的主要界面。随着LLM的部署越来越多，各种提示数据集正在从GitHub和社交媒体等平台中涌现。这些数据集涵盖了广泛的应用程序和内容类型，促进了更广泛的LLM利用率和改进的即时工程。在这项工作中，我们第一次编制了一个广泛的来自各种渠道的提示数据集列表，代表了下游任务，语言，工程技术，属性和模式的范围。我们选择了具有代表性的关键数据集进行系统分析，揭示了不同类别的提示构建的共性和差异，将它们与其他文本语料库（如文学和网络）区分开来。我们进一步提出了一个及时的优化方法，利用语法嵌入的词性和依存结构。通过识别提示的质心表示并引导LLM向该质心重写提示，我们的方法提高了模型输出的意义。我们已经提供了我们的数据集和代码。
摘要：A prompt is a natural language instruction that defines a specific task for a large language model (LLM) and serves as the primary interface for human-LLM interaction. With the growing deployment of LLMs, diverse prompt datasets are emerging from platforms such as GitHub and social media. These datasets span a wide array of applications and content types, facilitating both broader LLM utilization and improved prompt engineering. In this work, we--for the first time--have compiled an extensive list of prompt datasets sourced from various channels, representing a spectrum of downstream tasks, languages, engineering techniques, attributes, and modalities. We select key representative datasets for systematic analysis, revealing commonalities and differences in prompt construction across categories, distinguishing them from other text corpora like literature and web. We further propose a prompt optimization approach that leverages syntactic embeddings of part-of-speech and dependency structures. By identifying a centroid representation of prompts and guiding LLMs to rewrite prompts toward this centroid, our method improves the meaningfulness of model outputs. We have made our datasets and code available.

【21】ShiZhi: A Chinese Lightweight Large Language Model for Court View Generation
标题：释志：法庭视图生成的中文轻量级大语言模型
链接：https://arxiv.org/abs/2510.09297

作者：Zhitian Hou, Kun Zeng
摘要：刑事法庭视图生成（CVG）是法律人工智能中的一项基本任务，旨在自动生成法律案件文档的“法庭视图”部分。由于案件事实的多样性和复杂性，生成法院意见具有挑战性，直接从原始事实生成可能会限制性能。在本文中，我们提出了Shizhi，第一个大型语言模型（LLM）专门设计用于法院视图生成。我们构建了一个中国法院观点生成数据集CCVG，其中包含超过11万个案例，每个案例都包含与相应法院观点配对的事实描述。基于该数据集，史志在球场视图生成上达到了58.5BLEU-1，在收费预测上达到了86.1%的准确率和92.5%的宏F1。实验结果表明，即使是一个小的LLM可以生成合理的和法律上一致的法院意见时，训练高质量的特定领域的数据。我们的模型和数据集可以在\href{https：//github.com/ZhitianHou/ShiZhi}{https：//github.com/ZhitianHou/ShiZhi}上找到。
摘要：Criminal Court View Generation (CVG) is a fundamental task in legal artificial intelligence, aiming to automatically generate the "Court View" section of a legal case document. Generating court views is challenging due to the diversity and complexity of case facts, and directly generating from raw facts may limit performance. In this paper, we present ShiZhi, the first large language model (LLM) specifically designed for court view generation. We construct a Chinese Court View Generation dataset, CCVG, of more than 110K cases, each containing fact descriptions paired with corresponding court views. Based on this dataset, ShiZhi achieving 58.5 BLEU-1 on court view generation and 86.1\% accuracy with 92.5\% macro F1 on charge prediction. Experimental results demonstrate that even a small LLM can generate reasonable and legally coherent court views when trained on high-quality domain-specific data. Our model and dataset are available at \href{https://github.com/ZhitianHou/ShiZhi}{https://github.com/ZhitianHou/ShiZhi}.

【22】Detecting Data Contamination from Reinforcement Learning Post-training for Large Language Models
标题：从大型语言模型的强化学习后训练中检测数据污染
链接：https://arxiv.org/abs/2510.09259

作者：Yongding Tao, Tian Wang, Yihong Dong, Huanyu Liu, Kechi Zhang, Xiaolong Hu, Ge Li
摘要：数据污染对大型语言模型（LLM）的可靠评估构成了重大威胁。当基准样本可能无意中出现在训练集中时，就会出现这个问题，从而影响报告性能的有效性。虽然已经为预训练和监督微调阶段开发了检测方法，但对于日益重要的强化学习（RL）后训练阶段，存在着关键的研究差距。随着RL后训练成为推进LLM推理的关键，在这种范式中缺乏专门的污染检测方法，这是一个关键的漏洞。为了解决这个问题，我们在RL后训练场景中进行了第一次系统的数据检测研究，并提出了自我批评。我们的方法的动机是一个关键的观察：RL阶段后，LLM的输出熵分布往往会崩溃成高度特定和稀疏的模式。自我批评探讨了潜在的政策崩溃，即，模型收敛到一个狭窄的推理路径，这导致了熵的减少。为了促进这项研究，我们还介绍了RL-MIA，一个基准构建来模拟这种特定的污染情况。大量的实验表明，Self-Critique在多个模型和污染任务中的表现明显优于基线方法，AUC提高高达30%。而现有的方法是接近一个随机猜测的RL相污染，我们的方法使检测成为可能。
摘要：Data contamination poses a significant threat to the reliable evaluation of Large Language Models (LLMs). This issue arises when benchmark samples may inadvertently appear in training sets, compromising the validity of reported performance. While detection methods have been developed for the pre-training and Supervised Fine-Tuning stages, a critical research gap exists for the increasingly significant phase of Reinforcement Learning (RL) post-training. As RL post-training becomes pivotal for advancing LLM reasoning, the absence of specialized contamination detection methods in this paradigm presents a critical vulnerability. To address this, we conduct the first systematic study of data detection within RL post-training scenario and propose Self-Critique. Our method is motivated by a key observation: after RL phase, the output entropy distribution of LLMs tends to collapse into highly specific and sparse modes. Self-Critique probes for the underlying policy collapse, i.e., the model's convergence to a narrow reasoning path, which causes this entropy reduction. To facilitate this research, we also introduce RL-MIA, a benchmark constructed to simulate this specific contamination scenario. Extensive experiments show that Self-Critique significantly outperforms baseline methods across multiple models and contamination tasks, achieving an AUC improvement of up to 30%. Whereas existing methods are close to a random guess for RL-phase contamination, our method makes detection possible.

【23】CrisiText: A dataset of warning messages for LLM training in emergency communication
标题：Crisitext：紧急通信LLM训练的警告消息数据集
链接：https://arxiv.org/abs/2510.09243

作者：Giacomo Gonella, Gian Maria Campedelli, Stefano Menini, Marco Guerini
摘要：在自然灾害或暴力袭击等危机情况下，有效识别威胁并减轻其潜在损害，对于保护濒危人员至关重要。为了应对这些挑战，人工智能已被用于在紧急情况下帮助人类。尽管如此，NLP技术的使用仍然有限，主要集中在分类任务上。然而，使用NLG架构及时生成警告消息的巨大潜力在很大程度上被忽视了。在本文中，我们介绍了CrisiText，这是第一个大规模的数据集，用于在13种不同类型的危机场景中生成警告信息。该数据集包含400 000多条警告信息（涵盖近18 000个危机局势），旨在在此类事件期间和之后向平民提供援助。为了生成数据集，我们从现有的危机描述开始，并创建了与场景相关的事件链。每一个事件都伴随着一条警告信息。各代人遵循专家的书面指导方针，以确保他们的建议的正确术语和真实性。此外，每个消息都伴随着三种次优警告类型，以允许研究不同的NLG方法。为此，我们进行了一系列实验，比较监督微调设置与偏好对齐，zero-shot，和Few-Shot方法。我们进一步评估了模型在分发场景中的性能，并评估了自动后期编辑器的有效性。
摘要：Effectively identifying threats and mitigating their potential damage during crisis situations, such as natural disasters or violent attacks, is paramount for safeguarding endangered individuals. To tackle these challenges, AI has been used in assisting humans in emergency situations. Still, the use of NLP techniques remains limited and mostly focuses on classification tasks. The significant potential of timely warning message generation using NLG architectures, however, has been largely overlooked. In this paper we present CrisiText, the first large-scale dataset for the generation of warning messages across 13 different types of crisis scenarios. The dataset contains more than 400,000 warning messages (spanning almost 18,000 crisis situations) aimed at assisting civilians during and after such events. To generate the dataset, we started from existing crisis descriptions and created chains of events related to the scenarios. Each event was then paired with a warning message. The generations follow experts' written guidelines to ensure correct terminology and factuality of their suggestions. Additionally, each message is accompanied by three suboptimal warning types to allow for the study of different NLG approaches. To this end, we conducted a series of experiments comparing supervised fine-tuning setups with preference alignment, zero-shot, and few-shot approaches. We further assessed model performance in out-of-distribution scenarios and evaluated the effectiveness of an automatic post-editor.

【24】Diagnosing Shoulder Disorders Using Multimodal Large Language Models and Consumer-Grade Cameras
标题：使用多模式大型语言模型和消费级相机诊断肩部疾病
链接：https://arxiv.org/abs/2510.09230

作者：Jindong Hong, Wencheng Zhang, Shiqin Qiao, Jianhai Chen, Jianing Qiu, Chuanyang Zheng, Qian Xu, Yun Ji, Qianyue Wen, Weiwei Sun, Hao Li, Huizhen Li, Huichao Wang, Kai Wu, Meng Li, Yijun He, Lingjie Luo, Jiankai Sun
摘要：肩部疾病，如冻结肩（又名，粘连性囊炎）是影响全世界人民健康的常见病症，并且在老年人和从事重复性肩部任务的工人中具有高发病率。在医疗资源稀缺的地区，实现早期和准确的诊断带来了重大挑战，迫切需要低成本和易于扩展的辅助诊断解决方案。这项研究引入了消费级设备捕获的视频作为诊断的基础，降低了用户的成本。我们专注于多模态大语言模型（MLLM）在肩关节疾病的初步诊断中的创新应用，并提出了一个混合运动视频诊断框架（HMVDx）。该框架将动作理解和疾病诊断两个任务分开，分别由两个MLLM完成。除了传统的评估指标，这项工作提出了一种新的度量称为可用性指数的医疗决策的逻辑过程（动作识别，运动诊断，最终诊断）。该指数从整个医学诊断路径的角度评估了MLLM在医学领域的有效性，揭示了低成本MLLM在医学应用中对医学从业者的潜在价值。在实验比较中，与直接视频诊断相比，HMVDx诊断肩关节损伤的准确性提高了79.6%，这对未来研究MLLM在医学领域的视频理解应用做出了重大技术贡献。
摘要：Shoulder disorders, such as frozen shoulder (a.k.a., adhesive capsulitis), are common conditions affecting the health of people worldwide, and have a high incidence rate among the elderly and workers engaged in repetitive shoulder tasks. In regions with scarce medical resources, achieving early and accurate diagnosis poses significant challenges, and there is an urgent need for low-cost and easily scalable auxiliary diagnostic solutions. This research introduces videos captured by consumer-grade devices as the basis for diagnosis, reducing the cost for users. We focus on the innovative application of Multimodal Large Language Models (MLLMs) in the preliminary diagnosis of shoulder disorders and propose a Hybrid Motion Video Diagnosis framework (HMVDx). This framework divides the two tasks of action understanding and disease diagnosis, which are respectively completed by two MLLMs. In addition to traditional evaluation indicators, this work proposes a novel metric called Usability Index by the logical process of medical decision-making (action recognition, movement diagnosis, and final diagnosis). This index evaluates the effectiveness of MLLMs in the medical field from the perspective of the entire medical diagnostic pathway, revealing the potential value of low-cost MLLMs in medical applications for medical practitioners. In experimental comparisons, the accuracy of HMVDx in diagnosing shoulder joint injuries has increased by 79.6\% compared with direct video diagnosis, a significant technical contribution to future research on the application of MLLMs for video understanding in the medical field.

【25】DICE: Structured Reasoning in LLMs through SLM-Guided Chain-of-Thought Correction
标题：DICE：通过STM引导的思想链纠正在LLM中进行结构化推理
链接：https://arxiv.org/abs/2510.09211

作者：Yiqi Li, Yusheng Liao, Zhe Chen, Yanfeng Wang, Yu Wang
摘要：当执行具有用户特定要求的推理任务时，例如严格的输出格式，大型语言模型（LLM）通常优先考虑推理而不是遵守详细的指令。由于计算成本高和参数访问受限，在监督数据集上微调LLM来解决这个问题是不切实际的。为了解决这个问题，我们提出了DICE，这是一个轻量级框架，可以引导小语言模型（SLM）通过思想链（CoT）校正来改进LLM的输出。DICE通过首先提示LLM生成自然语言响应，然后使用经过训练的SLM分析和优化这些输出以满足结构化输出规范来简化该过程。该框架保留了LLM的广泛知识和推理能力，同时确保输出符合用户需求。具体而言，DICE首先通过两阶段方法构建结构化CoT自适应数据集，随后应用双调优策略来微调SLM，以分析然后回答模式生成结构化输出。实验表明，DICE将LLM输出的平均格式准确性和内容正确性分别提高了35.4%和29.4%，与其他竞争基准相比，达到了最先进的（SOTA）性能。
摘要：When performing reasoning tasks with user-specific requirements, such as strict output formats, large language models (LLMs) often prioritize reasoning over adherence to detailed instructions. Fine-tuning LLMs on supervised datasets to address this is impractical due to high computational costs and limited parameter access. To tackle this, we propose DICE, a lightweight framework that guides small language models (SLMs) to refine LLMs' outputs through chain-of-thought (CoT) correction. DICE decouples the process by first prompting LLMs to generate natural language responses, then using trained SLMs to analyze and refine these outputs to meet structured output specifications. This framework preserves LLMs' broad knowledge and reasoning capabilities while ensuring the outputs conform to user demands. Specifically, DICE first constructs structured CoT adaptation datasets via a two-stage method and subsequently applies a dual-tuning strategy to fine-tune SLMs for generating structured outputs in an analyze-then-answer pattern. Experiments demonstrate that DICE improves the average format accuracy and content correctness of LLM outputs by 35.4\% and 29.4\%, respectively, achieving state-of-the-art (SOTA) performance over other competitive baselines.

【26】Augmenting Dialog with Think-Aloud Utterances for Modeling Individual Personality Traits by LLM
标题：LLM通过大声思考来增强对话，为个人性格特征建模
链接：https://arxiv.org/abs/2510.09158

作者：Seiya Ishikura, Hiroaki Yamada, Tatsuya Hiraoka, Hiroaki Yamada, Takenobu Tokunaga
备注：8 pages, 1 figure
摘要：本研究提出了增强对话数据与有声思维话语（TAU）建模个人性格的文本聊天LLM。TAU是说话者在表达出话语之前对自己思想的一种表达。我们希望用TAU增强数据训练的“人物LLM”可以更好地模仿说话者的个性特征。我们测试了经过训练的人物角色LLM是否获得了关于大五的人类人格，大五是一个从五个方面描述人类人格特质的框架。结果表明，与使用原始对话数据训练的LLM相比，使用TAU增强数据训练的LLM更接近说话者的大五人格和神经质。我们还发现，TAU增强的质量影响角色LLM的性能。
摘要：This study proposes augmenting dialog data with think-aloud utterances (TAUs) for modeling individual personalities in text chat by LLM. TAU is a verbalization of a speaker's thought before articulating the utterance. We expect "persona LLMs" trained with TAU-augmented data can mimic the speaker's personality trait better. We tested whether the trained persona LLMs obtain the human personality with respect to Big Five, a framework characterizing human personality traits from five aspects. The results showed that LLMs trained with TAU-augmented data more closely align to the speakers' Agreeableness and Neuroticism of Big Five than those trained with original dialog data. We also found that the quality of TAU-augmentation impacts persona LLM's performance.

【27】When Retrieval Succeeds and Fails: Rethinking Retrieval-Augmented Generation for LLMs
标题：检索何时成功和失败：重新思考LLM的检索增强生成
链接：https://arxiv.org/abs/2510.09106

作者：Yongjie Wang, Yue Yu, Kaisong Song, Jun Lin, Zhiqi Shen
备注：Under Review
摘要：大型语言模型（LLM）通过其强大的语言理解和生成能力，实现了广泛的应用。然而，由于LLM是在静态语料库上训练的，因此它们在处理快速变化的信息或特定领域的查询时面临困难。检索增强生成（RAG）的开发是为了克服这一限制，通过集成LLM与外部检索机制，使他们能够访问最新的和上下文相关的知识。然而，随着LLM本身在规模和能力上的不断进步，传统RAG框架的相对优势变得不那么明显和必要。在这里，我们提出了一个全面的审查RAG，从其总体目标和核心组成部分。然后，我们分析了RAG中的关键挑战，突出了可能限制其有效性的关键弱点。最后，我们展示的应用程序中，LLM单独执行不足，但RAG，当与LLM相结合，可以大大提高其有效性。我们希望这项工作将鼓励研究人员重新考虑RAG的作用，并激发下一代RAG系统的开发。
摘要：Large Language Models (LLMs) have enabled a wide range of applications through their powerful capabilities in language understanding and generation. However, as LLMs are trained on static corpora, they face difficulties in addressing rapidly evolving information or domain-specific queries. Retrieval-Augmented Generation (RAG) was developed to overcome this limitation by integrating LLMs with external retrieval mechanisms, allowing them to access up-to-date and contextually relevant knowledge. However, as LLMs themselves continue to advance in scale and capability, the relative advantages of traditional RAG frameworks have become less pronounced and necessary. Here, we present a comprehensive review of RAG, beginning with its overarching objectives and core components. We then analyze the key challenges within RAG, highlighting critical weakness that may limit its effectiveness. Finally, we showcase applications where LLMs alone perform inadequately, but where RAG, when combined with LLMs, can substantially enhance their effectiveness. We hope this work will encourage researchers to reconsider the role of RAG and inspire the development of next-generation RAG systems.

【28】FrameEOL: Semantic Frame Induction using Causal Language Models
标题：FrameMEL：使用因果语言模型的语义框架归纳
链接：https://arxiv.org/abs/2510.09097

作者：Chihiro Yano, Kosuke Yamada, Hayato Tsukagoshi, Ryohei Sasano, Koichi Takeda
备注：Accepted in EMNLP Findings 2025. This version corrects the model size of Table 3
摘要：语义框架归纳是根据框架词所唤起的语义框架对它们进行聚类的任务。近年来，利用使用诸如BERT的掩蔽语言模型（MLM）获得的引起框架的词的嵌入已经导致了高性能的语义框架归纳。虽然因果语言模型（CLM），如GPT和Llama系列成功地在广泛的语言理解任务，并可以参与对话，如果他们理解框架，他们还没有被应用到语义框架归纳。提出了一种基于CLMs的语义框架归纳方法。具体来说，我们介绍FrameEOL，一个基于XML的方法获得帧嵌入，输出一个帧名称作为标签代表给定的情况。为了获得更适合框架归纳的嵌入，我们利用了上下文学习（ICL）和深度度量学习（DML）。然后通过聚类得到的嵌入来执行帧归纳。在英语和日语FrameNet数据集上的实验结果表明，该方法优于现有的框架归纳方法。特别是，对于缺乏广泛的帧资源的日语，仅使用5个ICL示例的基于CLM的方法实现了与使用DML微调的基于MLM的方法相当的性能。
摘要：Semantic frame induction is the task of clustering frame-evoking words according to the semantic frames they evoke. In recent years, leveraging embeddings of frame-evoking words that are obtained using masked language models (MLMs) such as BERT has led to high-performance semantic frame induction. Although causal language models (CLMs) such as the GPT and Llama series succeed in a wide range of language comprehension tasks and can engage in dialogue as if they understood frames, they have not yet been applied to semantic frame induction. We propose a new method for semantic frame induction based on CLMs. Specifically, we introduce FrameEOL, a prompt-based method for obtaining Frame Embeddings that outputs One frame-name as a Label representing the given situation. To obtain embeddings more suitable for frame induction, we leverage in-context learning (ICL) and deep metric learning (DML). Frame induction is then performed by clustering the resulting embeddings. Experimental results on the English and Japanese FrameNet datasets demonstrate that the proposed methods outperform existing frame induction methods. In particular, for Japanese, which lacks extensive frame resources, the CLM-based method using only 5 ICL examples achieved comparable performance to the MLM-based method fine-tuned with DML.

【29】Alif: Advancing Urdu Large Language Models via Multilingual Synthetic Data Distillation
标题：Alif：通过多语言合成数据蒸馏推进乌尔都语大型语言模型
链接：https://arxiv.org/abs/2510.09051

作者：Muhammad Ali Shafique, Kanwal Mehreen, Muhammad Arham, Maaz Amjad, Sabur Butt, Hamza Farooq
备注：Accepted to the EMNLP 2025 Workshop on Multilingual Representation Learning (MRL)
摘要：为乌尔都语等低资源语言开发高性能的大型语言模型（LLM）存在一些挑战。这些挑战包括缺乏高质量的数据集、多语言不一致和安全问题。现有的多语言LLM通常通过翻译大量可用数据来解决这些问题。然而，这种翻译往往缺乏质量和文化上的细微差别，同时也会产生大量的数据管理和培训成本。为了解决这些问题，我们提出了Alif-1.0-8B-Instruct，一个多语言乌尔都语-英语模型，以独特的方法应对这些挑战。我们训练模型的高品质，多语言的合成数据集（乌尔都语-指令），开发使用修改后的自我指导技术。通过为每个任务使用独特的提示和种子值以及全局任务池，该数据集结合了基于乌尔都语本地思维链的推理，双语翻译，文化相关性和道德安全对齐。这种技术显著增强了对Alif-1.0-8B-Instruct模型的理解，用于乌尔都语特定任务。因此，Alif-1.0-8B-Instruct建立在预先训练的Llama-3.1-8B基础上，与Llama-3.1-8B-Instruct相比，在乌尔都语特定任务中表现出更好的性能。它还优于领先的多语言LLM，包括Mistral-7 B-Instruct-v0.3，Qwen-2.5- 7 B-Instruct和Cohere-Aya-Expanse-8B，所有这些都在100美元以下的培训预算内。我们的研究结果表明，高性能和低资源的语言LLM可以有效地开发和文化对齐使用我们修改后的自我指导的方法。所有数据集、模型和代码都可以在https://github.com/traversaal-ai/alif-urdu-llm上公开获取。
摘要：Developing a high-performing large language models (LLMs) for low-resource languages such as Urdu, present several challenges. These challenges include the scarcity of high-quality datasets, multilingual inconsistencies, and safety concerns. Existing multilingual LLMs often address these issues by translating large volumes of available data. However, such translations often lack quality and cultural nuance while also incurring significant costs for data curation and training. To address these issues, we propose Alif-1.0-8B-Instruct, a multilingual Urdu-English model, that tackles these challenges with a unique approach. We train the model on a high-quality, multilingual synthetic dataset (Urdu-Instruct), developed using a modified self-instruct technique. By using unique prompts and seed values for each task along with a global task pool, this dataset incorporates Urdu-native chain-of-thought based reasoning, bilingual translation, cultural relevance, and ethical safety alignments. This technique significantly enhances the comprehension of Alif-1.0-8B-Instruct model for Urdu-specific tasks. As a result, Alif-1.0-8B-Instruct, built upon the pretrained Llama-3.1-8B, demonstrates superior performance compared to Llama-3.1-8B-Instruct for Urdu specific-tasks. It also outperformed leading multilingual LLMs, including Mistral-7B-Instruct-v0.3, Qwen-2.5-7B-Instruct, and Cohere-Aya-Expanse-8B, all within a training budget of under $100. Our results demonstrate that high-performance and low-resource language LLMs can be developed efficiently and culturally aligned using our modified self-instruct approach. All datasets, models, and code are publicly available at: https://github.com/traversaal-ai/alif-urdu-llm.

【30】Large Language Models Do NOT Really Know What They Don't Know
标题：大型语言模型并不真正知道他们不知道的事情
链接：https://arxiv.org/abs/2510.09033

作者：Chi Seng Cheang, Hou Pong Chan, Wenxuan Zhang, Yang Deng
摘要：最近的研究表明，大型语言模型（LLM）在其内部表示中编码了事实信号，例如隐藏状态，注意力权重或令牌概率，这意味着LLM可能“知道他们不知道的”。然而，LLM也可以通过依赖捷径或虚假关联来产生事实错误。这些错误是由相同的训练目标驱动的，这些训练目标鼓励正确的预测，这就提出了一个问题，即内部计算是否可以可靠地区分真实的和幻觉的输出。在这项工作中，我们进行了一个机械的分析，LLM内部处理事实查询比较两种类型的幻觉的基础上依赖于主题信息。我们发现，当幻觉与学科知识相关联时，LLM采用与正确反应相同的内部回忆过程，导致重叠和不可区分的隐藏状态几何形状。与此相反，从主体知识中分离出来的幻觉会产生不同的、聚集的表征，使它们能够被检测到。这些发现揭示了一个根本的局限性：LLM在其内部状态中不编码真实性，而只是知识回忆的模式，表明“LLM并不真正知道他们不知道的东西”。
摘要：Recent work suggests that large language models (LLMs) encode factuality signals in their internal representations, such as hidden states, attention weights, or token probabilities, implying that LLMs may "know what they don't know". However, LLMs can also produce factual errors by relying on shortcuts or spurious associations. These error are driven by the same training objective that encourage correct predictions, raising the question of whether internal computations can reliably distinguish between factual and hallucinated outputs. In this work, we conduct a mechanistic analysis of how LLMs internally process factual queries by comparing two types of hallucinations based on their reliance on subject information. We find that when hallucinations are associated with subject knowledge, LLMs employ the same internal recall process as for correct responses, leading to overlapping and indistinguishable hidden-state geometries. In contrast, hallucinations detached from subject knowledge produce distinct, clustered representations that make them detectable. These findings reveal a fundamental limitation: LLMs do not encode truthfulness in their internal states but only patterns of knowledge recall, demonstrating that "LLMs don't really know what they don't know".

【31】Automated Refinement of Essay Scoring Rubrics for Language Models via Reflect-and-Revise
标题：通过反思和修改自动细化语言模型的论文评分规则
链接：https://arxiv.org/abs/2510.09030

作者：Keno Harada, Lui Yoshida, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo
摘要：大型语言模型（LLM）的性能对它们给出的提示高度敏感。本研究从即时优化领域汲取灵感，通过改进LLM使用的评分规则，研究了增强自动论文评分（AES）的潜力。具体来说，我们的方法提示模型通过反映模型自己的评分原理和观察到的与样本文章上人类评分的差异来迭代地改进评分。使用GPT-4.1、Gemini-2.5-Pro和Qwen-3-Next-80 B-A3 B-Instruct对TOEFL 11和ASAP数据集进行的实验显示，QWK分别提高了0.19和0.47。值得注意的是，即使使用简单的初始标题，我们的方法也比使用详细的人类编写的标题实现了相当或更好的QWK。我们的研究结果强调了基于LLM的AES中迭代规则细化的重要性，以增强与人类评估的一致性。
摘要：The performance of Large Language Models (LLMs) is highly sensitive to the prompts they are given. Drawing inspiration from the field of prompt optimization, this study investigates the potential for enhancing Automated Essay Scoring (AES) by refining the scoring rubrics used by LLMs. Specifically, our approach prompts models to iteratively refine rubrics by reflecting on models' own scoring rationales and observed discrepancies with human scores on sample essays. Experiments on the TOEFL11 and ASAP datasets using GPT-4.1, Gemini-2.5-Pro, and Qwen-3-Next-80B-A3B-Instruct show Quadratic Weighted Kappa (QWK) improvements of up to 0.19 and 0.47, respectively. Notably, even with a simple initial rubric, our approach achieves comparable or better QWK than using detailed human-authored rubrics. Our findings highlight the importance of iterative rubric refinement in LLM-based AES to enhance alignment with human evaluations.

【32】On Epistemic Uncertainty of Visual Tokens for Object Hallucinations in Large Vision-Language Models
标题：论大型视觉语言模型中物体幻觉视觉标记的认识不确定性
链接：https://arxiv.org/abs/2510.09008

作者：Hoigi Seo, Dong Un Kang, Hyunjin Cho, Joohoon Lee, Se Young Chun
摘要：大型视觉语言模型（LVLM）将视觉编码器（VE）与大型语言模型集成在一起，在各种任务中取得了显着的成功。然而，LVLM仍然存在关键的挑战，例如对象幻觉，生成输入图像中没有的对象的描述。在这里，我们认为，不确定的视觉令牌内的VE是一个关键因素，有助于对象幻觉。我们的统计分析发现，具有高认知不确定性的视觉标记与幻觉的发生之间存在正相关。此外，我们从理论上和经验表明，在早期VE层的视觉令牌，表现出大的表示偏差下小的对抗性扰动表明高认知的不确定性。基于这些发现，我们提出了一个简单而有效的策略，以减轻对象幻觉修改VE只。我们的方法包括一个代理方法与对抗扰动识别不确定的视觉令牌有效和方法来掩盖这些不确定的视觉令牌在自我注意过程中的中间层的VE，抑制他们对视觉编码的影响，从而减轻幻觉。大量的实验表明，我们的方法显着减少LVLM中的对象幻觉，并可以与其他现有技术协同工作。
摘要：Large vision-language models (LVLMs), which integrate a vision encoder (VE) with a large language model, have achieved remarkable success across various tasks. However, there are still crucial challenges in LVLMs such as object hallucination, generating descriptions of objects that are not in the input image. Here, we argue that uncertain visual tokens within the VE is a key factor that contributes to object hallucination. Our statistical analysis found that there are positive correlations between visual tokens with high epistemic uncertainty and the occurrence of hallucinations. Furthermore, we show theoretically and empirically that visual tokens in early VE layers that exhibit large representation deviations under small adversarial perturbations indicate high epistemic uncertainty. Based on these findings, we propose a simple yet effective strategy to mitigate object hallucination by modifying the VE only. Our method comprises a proxy method with adversarial perturbations for identifying uncertain visual tokens efficiently and a method to mask these uncertain visual tokens during the self-attention process in the middle layers of the VE, suppressing their influence on visual encoding and thus alleviating hallucinations. Extensive experiments show that our method significantly reduces object hallucinations in LVLMs and can synergistically work with other prior arts.

【33】Decoupling Safety into Orthogonal Subspace: Cost-Efficient and Performance-Preserving Alignment for Large Language Models
标题：将安全性脱钩到垂直子空间：大型语言模型的成本高效且性能保持一致
链接：https://arxiv.org/abs/2510.09004

作者：Yutao Mou, Xiaoling Zhou, Yuxiao Luo, Shikun Zhang, Wei Ye
备注：Work in Progress
摘要：安全对齐对于构建值得信赖的人工智能至关重要，但在不降低一般性能的情况下提高模型安全性仍然具有挑战性。目前的方法需要计算上昂贵的搜索安全关键和通用数据的最佳比例，以平衡安全和一般性能，导致成本高，收益有限。在这项工作中，我们表明，即使仅在安全数据上进行训练，基于LoRA的Referral训练也可以实现性能保持的安全对齐，这表明LoRA可以作为具有成本效益，性能保持和即插即用的安全补丁。除了实证研究结果之外，我们还提供了理论和实验证据，证明LoRA有效地将安全性嵌入到与模型的内在变换空间基本正交的低秩子空间中，确保安全性增强不会干扰固有功能。
摘要：Safety alignment is essential for building trustworthy artificial intelligence, yet it remains challenging to enhance model safety without degrading general performance. Current approaches require computationally expensive searches for the optimal proportion of safety-critical and general-purpose data to balance safety and general performance, incurring high costs with limited gains. In this work, we show that LoRA-based Refusal-training enables performance-preserving safety alignment even when trained solely on safety data, demonstrating that LoRA serves as cost-efficient, performance-preserving, and plug-and-play safety patches. Beyond empirical findings, we provide both theoretical and experimental evidence that LoRA effectively decouples safety into a low-rank subspace largely orthogonal to the model's intrinsic transformation space, ensuring that safety enhancements do not interfere with inherent capabilities.

【34】MASA: LLM-Driven Multi-Agent Systems for Autoformalization
标题：MASA：LLM驱动的自动形式化多代理系统
链接：https://arxiv.org/abs/2510.08988

作者：Lan Zhang, Marco Valentino, André Freitas
备注：EMNLP 2025 Demo camera-ready. Code and data are available at: this https URL
摘要：自动形式化在连接自然语言和形式推理方面起着至关重要的作用。本文介绍了MASA，一种新的框架，用于构建多智能体系统的自动形式化驱动的大型语言模型（LLM）。MASA利用协作代理将自然语言语句转换为它们的正式表示。MASA的架构设计非常强调模块化，灵活性和可扩展性，允许新代理和工具的无缝集成，以适应快速发展的领域。我们通过真实世界数学定义的用例和正式数学数据集的实验来展示MASA的有效性。这项工作突出了由LLM和定理证明器的交互驱动的多智能体系统在提高自动形式化的效率和可靠性方面的潜力，为该领域的研究人员和从业者提供了宝贵的见解和支持。
摘要：Autoformalization serves a crucial role in connecting natural language and formal reasoning. This paper presents MASA, a novel framework for building multi-agent systems for autoformalization driven by Large Language Models (LLMs). MASA leverages collaborative agents to convert natural language statements into their formal representations. The architecture of MASA is designed with a strong emphasis on modularity, flexibility, and extensibility, allowing seamless integration of new agents and tools to adapt to a fast-evolving field. We showcase the effectiveness of MASA through use cases on real-world mathematical definitions and experiments on formal mathematics datasets. This work highlights the potential of multi-agent systems powered by the interaction of LLMs and theorem provers in enhancing the efficiency and reliability of autoformalization, providing valuable insights and support for researchers and practitioners in the field.

【35】Semantic-Condition Tuning: Fusing Graph Context with Large Language Models for Knowledge Graph Completion
标题：语义条件调优：将图上下文与大型语言模型融合以完成知识图
链接：https://arxiv.org/abs/2510.08966

作者：Ruitong Liu, Yan Wen, Te Sun, Yunjia Wu, Pingyang Huang, Zihang Yu, Siyuan Li
备注：11 pages, 3 figures, conference
摘要：将知识图与大型语言模型融合对于知识图完成等知识密集型任务至关重要。流行的范例，前缀调整，简单地连接知识嵌入与文本输入。然而，这种浅层融合忽略了KG中丰富的关系语义，并对LLM施加了显着的隐式推理负担，以将前缀与文本相关联。为了解决这些问题，我们提出了语义条件调整（SCT），一个新的知识注入范式，包括两个关键模块。首先，语义图模块采用图神经网络从局部图邻域中提取上下文感知的语义条件，由知识增强的关系指导。随后，该条件被传递到条件自适应融合模块，条件自适应融合模块又通过两个参数化投影仪自适应地调制文本嵌入，从而实现深度、特征和知识感知的交互。然后将得到的预融合嵌入馈送到LLM中进行微调。在知识图基准测试上进行的大量实验表明，SCT的性能明显优于前缀调优和其他强基线。我们的分析证实，通过在LLM推理之前用语义图上下文调制输入表示，SCT提供了更直接和有效的信号，从而实现更准确和鲁棒的知识推理。
摘要：Fusing Knowledge Graphs with Large Language Models is crucial for knowledge-intensive tasks like knowledge graph completion. The prevailing paradigm, prefix-tuning, simply concatenates knowledge embeddings with text inputs. However, this shallow fusion overlooks the rich relational semantics within KGs and imposes a significant implicit reasoning burden on the LLM to correlate the prefix with the text. To address these, we propose Semantic-condition Tuning (SCT), a new knowledge injection paradigm comprising two key modules. First, a Semantic Graph Module employs a Graph Neural Network to extract a context-aware semantic condition from the local graph neighborhood, guided by knowledge-enhanced relations. Subsequently, this condition is passed to a Condition-Adaptive Fusion Module, which, in turn, adaptively modulates the textual embedding via two parameterized projectors, enabling a deep, feature-wise, and knowledge-aware interaction. The resulting pre-fused embedding is then fed into the LLM for fine-tuning. Extensive experiments on knowledge graph benchmarks demonstrate that SCT significantly outperforms prefix-tuning and other strong baselines. Our analysis confirms that by modulating the input representation with semantic graph context before LLM inference, SCT provides a more direct and potent signal, enabling more accurate and robust knowledge reasoning.

【36】SOP-Maze: Evaluating Large Language Models on Complicated Business Standard Operating Procedures
标题：SOP-Maze：评估复杂业务标准操作程序上的大型语言模型
链接：https://arxiv.org/abs/2510.08942

作者：Jiaming Wang, Zhe Tang, Yilin Jin, Peng Ding, Xiaoyu Li, Xuezhi Cao
摘要：随着大型语言模型（LLM）被广泛部署为特定领域的代理，已经提出了许多基准来评估它们在现实世界中遵循指令和做出决策的能力。然而，业务场景通常涉及复杂的标准操作程序（SOP），在这种情况下LLM能力的评估尚未得到充分探讨。为了弥合这一差距，我们提出了SOP迷宫，一个基准构建从现实世界的业务数据和适应到397个任务的集合从23个复杂的SOP场景。我们进一步将SOP任务分为两大类：侧根系统（LRS），代表需要精确选择的宽选项任务;以及心根系统（HRS），强调复杂分支的深层逻辑推理。大量的实验表明，几乎所有最先进的模型都在与SOP迷宫作斗争。我们进行了全面的分析，并确定了三个关键的错误类别：（i）路线盲：难以遵循程序;（ii）会话脆弱性：无法处理真正的对话细微差别;（iii）计算错误：在复杂的背景下，时间或算术推理的错误。系统的研究探讨了LLM在SOP任务中的表现，这些任务挑战了广度和深度，为提高模型能力提供了新的见解。我们在https://github.com/ADoublLEN/SOP-Maze上开放了我们的工作。
摘要：As large language models (LLMs) are widely deployed as domain-specific agents, many benchmarks have been proposed to evaluate their ability to follow instructions and make decisions in real-world scenarios. However, business scenarios often involve complex standard operating procedures (SOPs), and the evaluation of LLM capabilities in such contexts has not been fully explored. To bridge this gap, we propose SOP-Maze, a benchmark constructed from real-world business data and adapted into a collection of 397 tasks from 23 complex SOP scenarios. We further categorize SOP tasks into two broad classes: Lateral Root System (LRS), representing wide-option tasks that demand precise selection; and Heart Root System (HRS), which emphasizes deep logical reasoning with complex branches. Extensive experiments reveal that nearly all state-of-the-art models struggle with SOP-Maze. We conduct a comprehensive analysis and identify three key error categories: (i) route blindness: difficulty following procedures; (ii) conversational fragility: inability to handle real dialogue nuances; and (iii) calculation errors: mistakes in time or arithmetic reasoning under complex contexts. The systematic study explores LLM performance across SOP tasks that challenge both breadth and depth, offering new insights for improving model capabilities. We have open-sourced our work on https://github.com/ADoublLEN/SOP-Maze.

【37】Artificial Impressions: Evaluating Large Language Model Behavior Through the Lens of Trait Impressions
标题：人工印象：通过特质印象的视角评估大型语言模型行为
链接：https://arxiv.org/abs/2510.08915

作者：Nicholas Deas, Kathleen McKeown
备注：EMNLP 2025 Camera Ready
摘要：我们介绍和研究人工印象-模式LLM的内部表示的提示，类似于人类的印象和基于语言的刻板印象。我们适合线性探针生成的提示，以预测印象，根据二维的刻板印象内容模型（SCM）。使用这些探针，我们研究印象和下游模型行为之间的关系，以及提示功能，可能会通知这样的印象。我们发现，LLM不一致的提示时，报告的印象，而且印象是更一致的线性解码从他们的隐藏表示。此外，我们表明，人工印象的提示是预测的质量和使用对冲模型的反应。我们还调查了如何特定的内容，文体和方言提示功能的影响LLM的印象。
摘要：We introduce and study artificial impressions--patterns in LLMs' internal representations of prompts that resemble human impressions and stereotypes based on language. We fit linear probes on generated prompts to predict impressions according to the two-dimensional Stereotype Content Model (SCM). Using these probes, we study the relationship between impressions and downstream model behavior as well as prompt features that may inform such impressions. We find that LLMs inconsistently report impressions when prompted, but also that impressions are more consistently linearly decodable from their hidden representations. Additionally, we show that artificial impressions of prompts are predictive of the quality and use of hedging in model responses. We also investigate how particular content, stylistic, and dialectal features in prompts impact LLM impressions.

【38】Autoencoding-Free Context Compression for LLMs via Contextual Semantic Anchors
标题：通过上下文语义约束器对LLM进行自动编码的免上下文压缩
链接：https://arxiv.org/abs/2510.08907

作者：Xin Liu, RunSong Zhao, PengCheng Huang, XinYu Liu, JunYi Xiao, ChunYang Xiao, Tong Xiao, Shengxiang Gao, Zhengtao Yu, JingBo Zhu
备注：18 pages,9 figures
摘要：上下文压缩通过将长上下文压缩为紧凑表示，为加速大型语言模型（LLM）推理提供了一种有前途的方法。当前的上下文压缩方法主要依赖于自动编码任务来训练上下文无关的压缩令牌以压缩上下文语义。虽然自动编码任务使压缩令牌能够获得压缩功能，但通过自动编码任务进行的压缩会产生根本的不匹配：模型针对与实际下游任务不同的重建进行了优化，从而削弱了更有利于现实世界使用的功能。我们提出语义锚压缩（SAC），一种新的方法，从自动编码任务为基础的压缩架构，配备了这种压缩能力\textit{先验}。SAC没有通过自动编码任务来训练模型来压缩上下文，而是直接从原始上下文中选择所谓的锚令牌，并将上下文信息聚合到其键值（KV）表示中。通过直接从上下文标记中导出表示，SAC消除了对自动编码训练的需要。为了在直接利用锚令牌的同时确保压缩性能，SAC采用了两种关键设计：（1）锚嵌入，使压缩器能够识别关键令牌，以及（2）双向注意力修改，允许锚令牌从整个上下文中捕获信息。实验结果表明，SAC始终优于现有的上下文压缩方法在各种压缩比。在使用MRQA的分布外评估中，SAC在强基线的5倍压缩下实现了1个EM改进，并且在更高的压缩比下具有越来越大的优势。
摘要：Context compression presents a promising approach for accelerating large language model (LLM) inference by compressing long contexts into compact representations. Current context compression methods predominantly rely on autoencoding tasks to train context-agnostic compression tokens to compress contextual semantics. While autoencoding tasks enable compression tokens to acquire compression capabilities, compression via autoencoding tasks creates a fundamental mismatch: the models are optimized for reconstruction that diverge from actual downstream tasks, thereby weakening the features more beneficial for real-world usage. We propose Semantic-Anchor Compression (SAC), a novel method that shifts from autoencoding task based compression to an architecture that is equipped with this compression capability \textit{a priori}. Instead of training models to compress contexts through autoencoding tasks, SAC directly selects so-called anchor tokens from the original context and aggregates contextual information into their key-value (KV) representations. By deriving representations directly from the contextual tokens, SAC eliminates the need for autoencoding training. To ensure compression performance while directly leveraging anchor tokens, SAC incorporates two key designs: (1) anchor embeddings that enable the compressor to identify critical tokens, and (2) bidirectional attention modification that allows anchor tokens to capture information from the entire context. Experimental results demonstrate that SAC consistently outperforms existing context compression methods across various compression ratios. On out-of-distribution evaluation using MRQA, SAC achieves 1 EM improvement at 5x compression over strong baselines, with increasing advantages at higher compression ratios.

【39】A Unified Biomedical Named Entity Recognition Framework with Large Language Models
标题：具有大型语言模型的统一生物医学命名实体识别框架
链接：https://arxiv.org/abs/2510.08902

作者：Tengxiao Lv, Ling Luo, Juntao Li, Yanhua Wang, Yuchen Pan, Chao Liu, Yanan Wang, Yan Jiang, Huiyi Lv, Yuanyuan Sun, Jian Wang, Hongfei Lin
备注：Accepted as a short paper at BIBM2025
摘要：生物医学命名实体的准确识别是医学信息抽取和知识发现的关键。然而，现有的方法往往与嵌套的实体，实体边界模糊性，跨语言的推广斗争。在本文中，我们提出了一个统一的生物医学命名实体识别（BioNER）框架的基础上大语言模型（LLM）。我们首先将BioNER重新定义为一个文本生成任务，并设计了一个符号标记策略，以共同处理平面和嵌套实体的显式边界注释。为了提高多语言和多任务的泛化能力，我们在多个中文和英文数据集上进行双语联合微调。此外，我们引入了一个基于对比学习的实体选择器，它通过利用边界敏感的阳性和阴性样本来过滤不正确或虚假的预测。在四个基准数据集和两个未知语料库上的实验结果表明，该方法实现了最先进的性能和鲁棒的跨语言zero-shot泛化。源代码可以在https://github.com/dreamer-tx/LLMNER上免费获得。
摘要：Accurate recognition of biomedical named entities is critical for medical information extraction and knowledge discovery. However, existing methods often struggle with nested entities, entity boundary ambiguity, and cross-lingual generalization. In this paper, we propose a unified Biomedical Named Entity Recognition (BioNER) framework based on Large Language Models (LLMs). We first reformulate BioNER as a text generation task and design a symbolic tagging strategy to jointly handle both flat and nested entities with explicit boundary annotation. To enhance multilingual and multi-task generalization, we perform bilingual joint fine-tuning across multiple Chinese and English datasets. Additionally, we introduce a contrastive learning-based entity selector that filters incorrect or spurious predictions by leveraging boundary-sensitive positive and negative samples. Experimental results on four benchmark datasets and two unseen corpora show that our method achieves state-of-the-art performance and robust zero-shot generalization across languages. The source codes are freely available at https://github.com/dreamer-tx/LLMNER.

【40】FinAuditing: A Financial Taxonomy-Structured Multi-Document Benchmark for Evaluating LLMs
标题：FinAudit：评估LLM的金融分类结构化多文档基准
链接：https://arxiv.org/abs/2510.08886

作者：Yan Wang, Keyi Wang, Shanshan Yang, Jaisal Patel, Jeff Zhao, Fengran Mo, Xueqing Peng, Lingfei Qian, Jimin Huang, Guojun Xiong, Xiao-Yang Liu, Jian-Yun Nie
摘要：公认会计原则（GAAP）的复杂性和可扩展商业报告语言（XBRL）文件的层次结构使得财务审计越来越难以自动化和验证。虽然大型语言模型（LLM）在非结构化文本理解方面表现出了强大的能力，但它们对结构化、相互依赖和分类驱动的财务文档进行推理的能力在很大程度上仍未得到探索。为了填补这一空白，我们引入了FinAuditing，这是第一个分类法对齐，结构感知，多文档基准，用于评估LLM的财务审计任务。FinAuditing基于真正符合美国通用会计准则的XBRL文件，定义了三个互补的子任务，FinSM用于语义一致性，FinRE用于关系一致性，FinMR用于数值一致性，每个子任务都针对结构化审计推理的不同方面。我们进一步提出了一个统一的评估框架，集成检索，分类和推理指标在这些子任务。在13个最先进的LLM上进行的广泛的zero-shot实验表明，当前的模型在语义、关系和数学维度上的表现不一致，在对分层多文档结构进行推理时，准确率下降了60-90%。我们的研究结果揭示了现代LLM在基于分类学的金融推理方面的系统性局限性，并将FinAuditing建立为开发可信赖，结构感知和监管一致的金融情报系统的基础。基准数据集可在Hugging Face获得。
摘要：The complexity of the Generally Accepted Accounting Principles (GAAP) and the hierarchical structure of eXtensible Business Reporting Language (XBRL) filings make financial auditing increasingly difficult to automate and verify. While large language models (LLMs) have demonstrated strong capabilities in unstructured text understanding, their ability to reason over structured, interdependent, and taxonomy-driven financial documents remains largely unexplored. To fill this gap, we introduce FinAuditing, the first taxonomy-aligned, structure-aware, multi-document benchmark for evaluating LLMs on financial auditing tasks. Built from real US-GAAP-compliant XBRL filings, FinAuditing defines three complementary subtasks, FinSM for semantic consistency, FinRE for relational consistency, and FinMR for numerical consistency, each targeting a distinct aspect of structured auditing reasoning. We further propose a unified evaluation framework integrating retrieval, classification, and reasoning metrics across these subtasks. Extensive zero-shot experiments on 13 state-of-the-art LLMs reveal that current models perform inconsistently across semantic, relational, and mathematical dimensions, with accuracy drops of up to 60-90% when reasoning over hierarchical multi-document structures. Our findings expose the systematic limitations of modern LLMs in taxonomy-grounded financial reasoning and establish FinAuditing as a foundation for developing trustworthy, structure-aware, and regulation-aligned financial intelligence systems. The benchmark dataset is available at Hugging Face.

【41】ControlAudio: Tackling Text-Guided, Timing-Indicated and Intelligible Audio Generation via Progressive Diffusion Modeling
标题：ControlAudio：通过渐进扩散模型处理文本引导、定时指示和可理解的音频生成
链接：https://arxiv.org/abs/2510.08878

作者：Yuxuan Jiang, Zehua Chen, Zeqian Ju, Yusheng Dai, Weibei Dou, Jun Zhu
备注：18 pages, 8 tables, 5 figures
摘要：使用细粒度控制信号生成文本到音频（TTA），例如，精确的定时控制或可理解的语音内容在最近的工作中已经被探索。然而，受数据稀缺性的限制，它们的大规模发电性能仍然受到影响。在这项研究中，我们将可控TTA生成重新定义为多任务学习问题，并引入了一种渐进扩散建模方法ControlAudio。我们的方法巧妙地适合分布条件下更细粒度的信息，包括文本，时间和音素特征，通过一步一步的策略。首先，我们提出了一个数据构建方法跨越注释和模拟，增加条件信息的文本，时间和音素的顺序。其次，在模型训练阶段，我们在大规模文本-音频对上预训练扩散Transformer（DiT），实现可扩展的TTA生成，然后增量地将时序和音素特征与统一的语义表示相结合，扩展可控性。最后，在推理阶段，我们提出了逐步引导的生成，它依次强调更细粒度的信息，内在地与DiT的粗到细采样性质相一致。大量的实验表明，ControlAudio在时间准确性和语音清晰度方面达到了最先进的性能，在客观和主观评价方面都明显优于现有方法。演示示例可在以下网址获得：https://control-audio.github.io/Control-Audio。
摘要：Text-to-audio (TTA) generation with fine-grained control signals, e.g., precise timing control or intelligible speech content, has been explored in recent works. However, constrained by data scarcity, their generation performance at scale is still compromised. In this study, we recast controllable TTA generation as a multi-task learning problem and introduce a progressive diffusion modeling approach, ControlAudio. Our method adeptly fits distributions conditioned on more fine-grained information, including text, timing, and phoneme features, through a step-by-step strategy. First, we propose a data construction method spanning both annotation and simulation, augmenting condition information in the sequence of text, timing, and phoneme. Second, at the model training stage, we pretrain a diffusion transformer (DiT) on large-scale text-audio pairs, achieving scalable TTA generation, and then incrementally integrate the timing and phoneme features with unified semantic representations, expanding controllability. Finally, at the inference stage, we propose progressively guided generation, which sequentially emphasizes more fine-grained information, aligning inherently with the coarse-to-fine sampling nature of DiT. Extensive experiments show that ControlAudio achieves state-of-the-art performance in terms of temporal accuracy and speech clarity, significantly outperforming existing methods on both objective and subjective evaluations. Demo samples are available at: https://control-audio.github.io/Control-Audio.

【42】Pattern Enhanced Multi-Turn Jailbreaking: Exploiting Structural Vulnerabilities in Large Language Models
标题：模式增强的多回合越狱：利用大型语言模型中的结构漏洞
链接：https://arxiv.org/abs/2510.08859

作者：Ragib Amin Nihal, Rui Wen, Kazuhiro Nakadai, Jun Sakuma
摘要：大型语言模型（LLM）仍然容易受到利用会话上下文逐渐绕过安全约束的多轮越狱攻击。这些攻击通过不同的对话方式（教育讨论、个人经历、假设场景）针对不同的危害类别（如恶意软件生成、骚扰或欺诈）。现有的多回合越狱方法通常依赖于启发式或特设的探索策略，对潜在模型弱点的洞察有限。在危害类别中，对话模式和模型漏洞之间的关系仍然知之甚少。我们提出了模式增强攻击链（PE-CoA），这是一个五种对话模式的框架，通过自然对话来构建有效的多回合越狱。在跨越10个伤害类别的12个LLM上评估PE-CoA，我们实现了最先进的性能，揭示了特定模式的漏洞和LLM行为特征：模型表现出明显的弱点，其中对一种会话模式的鲁棒性不会推广到其他模式，并且模型系列具有相似的故障模式。这些发现突出了安全培训的局限性，并表明需要模式感知防御。代码可在：https://github.com/Ragib-Amin-Nihal/PE-CoA上获得
摘要：Large language models (LLMs) remain vulnerable to multi-turn jailbreaking attacks that exploit conversational context to bypass safety constraints gradually. These attacks target different harm categories (like malware generation, harassment, or fraud) through distinct conversational approaches (educational discussions, personal experiences, hypothetical scenarios). Existing multi-turn jailbreaking methods often rely on heuristic or ad hoc exploration strategies, providing limited insight into underlying model weaknesses. The relationship between conversation patterns and model vulnerabilities across harm categories remains poorly understood. We propose Pattern Enhanced Chain of Attack (PE-CoA), a framework of five conversation patterns to construct effective multi-turn jailbreaks through natural dialogue. Evaluating PE-CoA on twelve LLMs spanning ten harm categories, we achieve state-of-the-art performance, uncovering pattern-specific vulnerabilities and LLM behavioral characteristics: models exhibit distinct weakness profiles where robustness to one conversational pattern does not generalize to others, and model families share similar failure modes. These findings highlight limitations of safety training and indicate the need for pattern-aware defenses. Code available on: https://github.com/Ragib-Amin-Nihal/PE-CoA

【43】Search-on-Graph: Iterative Informed Navigation for Large Language Model Reasoning on Knowledge Graphs
标题：图上搜索：知识图上大型语言模型推理的迭代知情导航
链接：https://arxiv.org/abs/2510.08825

作者：Jia Ao Sun, Hao Yu, Fabrizio Gotti, Fengran Mo, Yihong Wu, Yuchen Hui, Jian-Yun Nie
摘要：大型语言模型（LLM）已经表现出令人印象深刻的推理能力，但在知识密集型、多跳问题上仍然不可靠--它们错过了长尾事实，在不确定时产生幻觉，它们的内部知识落后于现实世界的变化。知识图（KG）提供了一个结构化的关系证据源，但现有的KGQA方法面临着根本的权衡：编译完整的SPARQL查询，而不知道可用的关系证明是脆弱的，检索大型子图引入噪音，复杂的代理框架与并行探索指数扩展搜索空间。为了解决这些限制，我们提出了Search-on-Graph（SoG），这是一个简单而有效的框架，使LLM能够使用一个精心设计的\textsc{Search}函数执行迭代的知情图导航。SoG不是预先规划路径或检索大型子图，而是遵循“先搜索后导航”的原则：在每一步，LLM在决定下一跳之前都会检查当前实体的实际可用关系。该方法进一步无缝地适应不同的KG模式，并通过自适应过滤处理高度节点。在Freebase和Wikidata的六个KGQA基准测试中，SoG实现了最先进的性能，无需微调。我们在Wikidata基准测试中表现出了特别强劲的收益（比以前的最佳方法提高了16%），同时在Freebase基准测试中也有了持续的改进。
摘要：Large language models (LLMs) have demonstrated impressive reasoning abilities yet remain unreliable on knowledge-intensive, multi-hop questions -- they miss long-tail facts, hallucinate when uncertain, and their internal knowledge lags behind real-world change. Knowledge graphs (KGs) offer a structured source of relational evidence, but existing KGQA methods face fundamental trade-offs: compiling complete SPARQL queries without knowing available relations proves brittle, retrieving large subgraphs introduces noise, and complex agent frameworks with parallel exploration exponentially expand search spaces. To address these limitations, we propose Search-on-Graph (SoG), a simple yet effective framework that enables LLMs to perform iterative informed graph navigation using a single, carefully designed \textsc{Search} function. Rather than pre-planning paths or retrieving large subgraphs, SoG follows an ``observe-then-navigate'' principle: at each step, the LLM examines actual available relations from the current entity before deciding on the next hop. This approach further adapts seamlessly to different KG schemas and handles high-degree nodes through adaptive filtering. Across six KGQA benchmarks spanning Freebase and Wikidata, SoG achieves state-of-the-art performance without fine-tuning. We demonstrate particularly strong gains on Wikidata benchmarks (+16\% improvement over previous best methods) alongside consistent improvements on Freebase benchmarks.

【44】The Model's Language Matters: A Comparative Privacy Analysis of LLMs
标题：模型的语言很重要：法学硕士的隐私比较分析
链接：https://arxiv.org/abs/2510.08813

作者：Abhishek K. Mishra, Antoine Boutet, Lucas Magnana
摘要：大型语言模型（LLM）越来越多地部署在处理敏感数据的多语言应用程序中，但它们的规模和语言可变性带来了重大的隐私风险。本文主要针对英语进行评估，研究语言结构如何影响在英语，西班牙语，法语和意大利语医学语料库上训练的LLM的隐私泄露。我们量化了六个语言指标，并评估了三个攻击向量：提取，反事实记忆和成员推理。结果表明，隐私漏洞规模与语言冗余和标记化粒度：意大利表现出最强的泄漏，而英语表现出较高的成员分离。相比之下，法语和西班牙语表现出更大的弹性，由于更高的形态复杂性。总的来说，我们的研究结果提供了第一个定量证据，表明语言在隐私泄露中很重要，强调了在LLM部署中需要语言感知的隐私保护机制。
摘要：Large Language Models (LLMs) are increasingly deployed across multilingual applications that handle sensitive data, yet their scale and linguistic variability introduce major privacy risks. Mostly evaluated for English, this paper investigates how language structure affects privacy leakage in LLMs trained on English, Spanish, French, and Italian medical corpora. We quantify six linguistic indicators and evaluate three attack vectors: extraction, counterfactual memorization, and membership inference. Results show that privacy vulnerability scales with linguistic redundancy and tokenization granularity: Italian exhibits the strongest leakage, while English shows higher membership separability. In contrast, French and Spanish display greater resilience due to higher morphological complexity. Overall, our findings provide the first quantitative evidence that language matters in privacy leakage, underscoring the need for language-aware privacy-preserving mechanisms in LLM deployments.

【45】Learning What to Remember: Adaptive Probabilistic Memory Retention for Memory-Efficient Language Models
标题：学习要记住什么：内存高效语言模型的自适应概率内存保留
链接：https://arxiv.org/abs/2510.08798

作者：S M Rafiuddin, Muntaha Nujat Khan
备注：14 Pages, 2 Figures, 6 Table, Accepted at EMNLP 2025 Findings as a Short Paper
摘要：Transformer的注意力与序列长度成二次关系，限制了长上下文的使用。我们提出了自适应保留，这是一种概率性的逐层令牌选择机制，它可以学习在严格的全局预算M下保留哪些表示。保留是用通过Hard-Concrete/变分松弛训练的Bernoulli门建模的，并在推理时用简单的top-M规则强制执行，使该方法可微分并可用于标准编码器。在分类、提取QA和长文档摘要中，仅保留30-50%的令牌可保留>= 95%的全模型性能，同时将峰值内存减少约35-45%，并将吞吐量提高约1.8倍。这种与架构无关的方法提供了实际的长上下文效率，而无需修改基本注意力或任务头。
摘要：Transformer attention scales quadratically with sequence length O(n^2), limiting long-context use. We propose Adaptive Retention, a probabilistic, layer-wise token selection mechanism that learns which representations to keep under a strict global budget M. Retention is modeled with Bernoulli gates trained via a Hard-Concrete/variational relaxation and enforced with a simple top-M rule at inference, making the method differentiable and drop-in for standard encoders. Across classification, extractive QA, and long-document summarization, keeping only 30-50% of tokens preserves >= 95% of full-model performance while cutting peak memory by ~35-45% and improving throughput by up to ~1.8x. This architecture-agnostic approach delivers practical long-context efficiency without modifying base attention or task heads.

【46】Measuring Moral LLM Responses in Multilingual Capacities
标题：衡量多语言能力中的道德LLM响应
链接：https://arxiv.org/abs/2510.08776

作者：Kimaya Basu, Savi Kolari, Allison Yu
备注：10 pages, 5 figures; referenced articles: arXiv:2303.08774, arXiv:2303.12528, arXiv:2308.14132, arXiv:2505.12201, arXiv:2406.04428, arXiv:2407.02273, arXiv:2404.01268, arXiv:2502.09747, arXiv:2507.13474, arXiv:2505.21479, arXiv:2306.05685
摘要：随着LLM的使用越来越广泛地跨越国家，语言和人类，需要了解和保护他们的多语言响应增加。已经创建了用于测试和基准测试的大规模数据集，以评估和促进多个维度的LLM响应。在这项研究中，我们评估了前沿和领先的开源模型在低资源和高资源语言的五个维度上的响应，以衡量LLM在多语言环境中的准确性和一致性。我们使用五点评分规则和法官LLM评估的反应。我们的研究表明，GPT-5在每个类别中的平均表现最好，而其他模型在语言和类别中表现出更多的不一致性。最值得注意的是，在同意与自主和伤害预防与安全类别中，GPT得分最高，平均值分别为3.56和4.73，而Gemini 2.5 Pro得分最低，平均值分别为1.39和1.98。这些研究结果强调，需要进一步测试语言的变化如何影响LLM的反应在各个类别和改善这些领域。
摘要：With LLM usage becoming widespread across countries, languages, and humanity more broadly, the need to understand and guardrail their multilingual responses increases. Large-scale datasets for testing and benchmarking have been created to evaluate and facilitate LLM responses across multiple dimensions. In this study, we evaluate the responses of frontier and leading open-source models in five dimensions across low and high-resource languages to measure LLM accuracy and consistency across multilingual contexts. We evaluate the responses using a five-point grading rubric and a judge LLM. Our study shows that GPT-5 performed the best on average in each category, while other models displayed more inconsistency across language and category. Most notably, in the Consent & Autonomy and Harm Prevention & Safety categories, GPT scored the highest with averages of 3.56 and 4.73, while Gemini 2.5 Pro scored the lowest with averages of 1.39 and 1.98, respectively. These findings emphasize the need for further testing on how linguistic shifts impact LLM responses across various categories and improvement in these areas.

【47】Robust Heuristic Algorithm Design with LLMs
标题：使用LLM的鲁棒启发式算法设计
链接：https://arxiv.org/abs/2510.08755

作者：Pantea Karimi, Dany Rouhana, Pooria Namyar, Siva Kesava Reddy Kakarla, Venkat Arun, Behnaz Arzani
摘要：我们认为，我们可以产生更强大的和性能的启发式设计，如果我们增加的方法，使用LLM的启发式设计与工具，解释为什么启发式设计表现不佳，并建议如何修复它们。我们发现，即使是简单的想法，（1）暴露LLM的启发式表现不佳的情况下;（2）解释他们为什么发生;（3）专门设计输入空间中的区域，可以产生更强大的算法相比，现有的技术-我们产生的启发式有一个更好的最坏情况下的性能相比，FunSearch，提高平均性能，并保持运行时。
摘要：We posit that we can generate more robust and performant heuristics if we augment approaches using LLMs for heuristic design with tools that explain why heuristics underperform and suggestions about how to fix them. We find even simple ideas that (1) expose the LLM to instances where the heuristic underperforms; (2) explain why they occur; and (3) specialize design to regions in the input space, can produce more robust algorithms compared to existing techniques~ -- ~the heuristics we produce have a $\sim28\times$ better worst-case performance compared to FunSearch, improve average performance, and maintain the runtime.

【48】Exploring Cross-Client Memorization of Training Data in Large Language Models for Federated Learning
标题：探索联邦学习大型语言模型中训练数据的跨客户机并行化
链接：https://arxiv.org/abs/2510.08750

作者：Tinnakit Udsa, Can Udomcharoenchaikit, Patomporn Payoungkhamdee, Sarana Nutanong, Norrathep Rattanavipanon
摘要：联邦学习（FL）允许在不共享原始数据的情况下进行协作训练，但仍然存在训练数据记忆的风险。现有的外语记忆检测技术每次只关注一个样本，低估了跨样本记忆的更微妙的风险。相比之下，最近的集中式学习（CL）的工作引入了细粒度的方法来评估记忆在训练数据中的所有样本，但这些假设集中访问数据，不能直接应用于FL。我们提出了一个框架，量化内部和客户间的记忆在FL使用细粒度的跨样本记忆测量所有客户端，从而弥合了这一差距。基于这个框架，我们进行了两项研究：（1）测量客户的微妙记忆和（2）检查影响记忆的关键因素，包括解码策略，前缀长度和FL算法。我们的研究结果表明，FL模型确实记住了客户数据，特别是客户内部数据，而不是客户之间的数据，记忆受到训练和推理因素的影响。
摘要：Federated learning (FL) enables collaborative training without raw data sharing, but still risks training data memorization. Existing FL memorization detection techniques focus on one sample at a time, underestimating more subtle risks of cross-sample memorization. In contrast, recent work on centralized learning (CL) has introduced fine-grained methods to assess memorization across all samples in training data, but these assume centralized access to data and cannot be applied directly to FL. We bridge this gap by proposing a framework that quantifies both intra- and inter-client memorization in FL using fine-grained cross-sample memorization measurement across all clients. Based on this framework, we conduct two studies: (1) measuring subtle memorization across clients and (2) examining key factors that influence memorization, including decoding strategies, prefix length, and FL algorithms. Our findings reveal that FL models do memorize client data, particularly intra-client data, more than inter-client data, with memorization influenced by training and inferencing factors.

【49】Coordinates from Context: Using LLMs to Ground Complex Location References
标题：上下文中的坐标：使用LLM来接地综合体位置参考
链接：https://arxiv.org/abs/2510.08741

作者：Tessa Masis, Brendan O'Connor
备注：Under review at ARR
摘要：地理编码是将位置参考链接到实际地理位置的任务，对于许多非结构化文本的下游分析至关重要。在本文中，我们探讨了具有挑战性的设置地理编码的组成位置参考。基于最近的工作证明LLM的能力，理由在地理空间数据，我们评估LLM的地理空间知识与推理技能相关的我们的任务。基于这些见解，我们提出了一个基于LLM的战略地理编码组合位置参考。我们表明，我们的方法提高了任务的性能，并且相对较小的微调LLM可以实现与更大的现成模型相当的性能。
摘要：Geocoding is the task of linking a location reference to an actual geographic location and is essential for many downstream analyses of unstructured text. In this paper, we explore the challenging setting of geocoding compositional location references. Building on recent work demonstrating LLMs' abilities to reason over geospatial data, we evaluate LLMs' geospatial knowledge versus reasoning skills relevant to our task. Based on these insights, we propose an LLM-based strategy for geocoding compositional location references. We show that our approach improves performance for the task and that a relatively small fine-tuned LLM can achieve comparable performance with much larger off-the-shelf models.

【50】How Reliable is Language Model Micro-Benchmarking?
标题：语言模型微基准测试的可靠性有多高？
链接：https://arxiv.org/abs/2510.08730

作者：Gregory Yauney, Shahzaib Saqib Warraich, Swabha Swayamdipta
摘要：微基准测试为语言模型开发的时间和成本提供了一种解决方案：在现有基准测试的一个非常小的子集上进行评估。然而，这些微基准能否像它们所取代的完整基准一样一致地对模型进行排名？它们是否能比选择一个随机的数据点子集更一致地对模型进行排名？在许多情况下，我们发现答案是否定的。我们引入了一个元评估措施的微基准测试，调查如何以及一个微基准可以排名两个模型作为其性能差异的函数在完整的基准。这种方法可以确定哪些模型对可以通过微基准进行正确排名，从而允许对微基准大小和可靠性之间的权衡进行更细粒度的分析。先前的工作建议选择最少10个示例;我们发现没有微基准测试方法可以在MMLU-Pro上以3.5个点的准确度或在BIG-bench Hard上以4个点的准确度一致地对模型对进行排名。为了始终如一地排名模型对相对相似的性能，我们表明，往往多达250个例子必须选择，在这一点上随机抽样是有竞争力的与现有的微观基准测试方法。在MMLU-Pro微基准测试中，当仅比较8B预调模型与25个示例时，我们发现超过一半的成对比较不太可能被保留。我们的工作为微基准用户和开发人员在评估效率和可靠性之间的权衡中提供了可操作的指导。
摘要：Micro-benchmarking offers a solution to the often prohibitive time and cost of language model development: evaluate on a very small subset of existing benchmarks. Can these micro-benchmarks, however, rank models as consistently as the full benchmarks they replace? And can they rank models more consistently than selecting a random subset of data points? In many scenarios, we find that the answer is no. We introduce a meta-evaluation measure for micro-benchmarking which investigates how well a micro-benchmark can rank two models as a function of their performance difference on the full benchmark. This approach can determine which model pairs can be ranked correctly by a micro-benchmark, allowing for a finer-grained analysis of the trade-off between micro-benchmark size and reliability. Prior work has suggested selecting as few as 10 examples; we find that no micro-benchmarking method can consistently rank model pairs 3.5 points of accuracy apart on MMLU-Pro or 4 points apart on BIG-bench Hard. In order to consistently rank model pairs with relatively similar performances, we show that often as many as 250 examples must be selected, at which point random sampling is competitive with existing micro-benchmarking methods. When comparing only 8B instruction-tuned models on MMLU-Pro micro-benchmarks with 25 examples, we find that more than half of pairwise comparisons are not likely to be preserved. Our work provides actionable guidance for both micro-benchmark users and developers in navigating the trade-off between evaluation efficiency and reliability.

【51】Thinking Longer, Not Always Smarter: Evaluating LLM Capabilities in Hierarchical Legal Reasoning
标题：思考更长时间，而不总是更聪明：评估分层法律推理中的LLM能力
链接：https://arxiv.org/abs/2510.08710

作者：Li Zhang, Matthias Grabmair, Morgan Gray, Kevin Ashley
备注：21 pages, 7 figures
摘要：基于案例的推理是美国法律实践的基石，要求专业人士通过类比和区分过去的先例来讨论当前的案件。虽然大型语言模型（LLM）已经显示出非凡的能力，但它们在这种复杂、微妙的推理形式中的熟练程度还需要进一步研究。我们提出了一个正式的框架，分解成三个阶段的推理任务的情况下，确定显着的区别的过程。我们的框架使用称为因素的事实谓词对案例进行建模，将它们组织成法律知识层次结构，并定义可验证的规则来识别区别、分析其论证支持并评估其重要性。通过对现代推理LLM的综合评估，我们揭示了一个悖论：虽然模型在表层推理（任务1）上实现了高精度，但在分层推理（任务2：64.82%-92.09%）上性能下降，并在综合分析（任务3：11.46%-33.99%）上崩溃。最引人注目的是，我们发现模型在不正确的反应上花费的计算资源总是多于正确的反应，这表明“思考更长时间”并不总是意味着“思考更聪明”。“我们的工作为复杂领域中LLM推理能力的细粒度分析提供了一种方法，并揭示了必须解决的基本局限性，以实现强大且值得信赖的法律AI。
摘要：Case-based reasoning is a cornerstone of U.S. legal practice, requiring professionals to argue about a current case by drawing analogies to and distinguishing from past precedents. While Large Language Models (LLMs) have shown remarkable capabilities, their proficiency in this complex, nuanced form of reasoning needs further investigation. We propose a formal framework that decomposes the process of identifying significant distinctions between cases into three-stage reasoning tasks. Our framework models cases using factual predicates called factors, organizes them into a legal knowledge hierarchy, and defines verifiable rules for identifying distinctions, analyzing their argumentative support, and evaluating their significance. Through comprehensive evaluation of modern reasoning LLMs, we reveal a paradox: while models achieve high accuracy on surface-level reasoning (Task 1), performance degrades on hierarchical reasoning (Task 2: 64.82%-92.09%) and collapses on integrated analysis (Task 3: 11.46%-33.99%). Most strikingly, we find that models consistently expend more computational resources on incorrect responses than correct ones, suggesting that "thinking longer" does not always mean "thinking smarter." Our work provides a methodology for fine-grained analysis of LLM reasoning capabilities in complex domains and reveals fundamental limitations that must be addressed for robust and trustworthy legal AI.

【52】dInfer: An Efficient Inference Framework for Diffusion Language Models
标题：dInfer：扩散语言模型的高效推理框架
链接：https://arxiv.org/abs/2510.08666

作者：Yuxin Ma, Lun Du, Lanning Wei, Kun Chen, Qian Xu, Kangyu Wang, Guofeng Feng, Guoshan Lu, Lin Liu, Xiaojing Qi, Xinyuan Zhang, Zhen Tao, Haibo Feng, Ziyun Jiang, Ying Xu, Zenan Huang, Yihong Zhuang, Haokai Xu, Jiaqi Hu, Zhenzhong Lan, Junbo Zhao, Jianguo Li, Da Zheng
摘要：基于扩散的大型语言模型（dLLM）已经成为自回归（AR）LLM的一个有前途的替代方案，利用基于降噪的生成来实现固有的并行性。越来越多的开源dLLM模型出现，但它们的广泛采用仍然受到缺乏标准化和高效推理框架的限制。我们提出了dInfer，一个有效的和可扩展的框架dLLM推理。dInfer将推理流水线分解为四个模块化组件-模型，扩散迭代管理器，解码策略和KV缓存管理器-并为每个组件集成了新颖的算法以及系统级优化。通过这种算法创新和系统增强的组合，dInfer在不影响LLaDA-MoE输出质量的情况下实现了显著的效率提升。在批量大小为1时，它在HumanEval上每秒超过1，100个令牌，在$8\times $H800 GPU上的六个基准测试中平均每秒超过800个令牌。与以前的系统相比，dInfer在保持类似模型性能的同时，比Fast-dLLM提供了10倍的加速。即使与AR模型（具有相当数量的激活参数和性能）QWen 2.5 -3B相比，它使用最新的vLLM推理引擎进行了高度优化，dInfer仍然提供了2 $-3 $\times $的加速比。dInfer的实现在https://github.com/inclusionAI/dInfer上开源。
摘要：Diffusion-based large language models (dLLMs) have emerged as a promising alternative to autoregressive (AR) LLMs, leveraging denoising-based generation to enable inherent parallelism. Even more and more open-sourced dLLM models emerge, yet their widespread adoption remains constrained by the lack of a standardized and efficient inference framework. We present dInfer, an efficient and extensible framework for dLLM inference. dInfer decomposes the inference pipeline into four modular components-model, diffusion iteration manager, decoding strategy, and KV-cache manager-and integrates novel algorithms for each component alongside system-level optimizations. Through this combination of algorithmic innovations and system enhancements, dInfer achieves substantial efficiency gains without compromising output quality on LLaDA-MoE. At batch size 1, it surpasses 1,100 tokens per second on HumanEval and averages over 800 tokens per second across six benchmarks on $8\times$ H800 GPUs. Compared to prior systems, dInfer delivers $10\times$ speedup over Fast-dLLM while maintaining similar model performance. Even compared with AR models (with a comparable number of activation parameters and performance) QWen2.5-3B, which is highly optimized with latest vLLM inference engine, dInfer still deliverers $2$-$3\times$ speedup. The implementation of dInfer is open-sourced at https://github.com/inclusionAI/dInfer.

【53】A Novel Framework for Augmenting Rating Scale Tests with LLM-Scored Text Data
标题：使用LLM评分文本数据增强评级量表测试的新型框架
链接：https://arxiv.org/abs/2510.08663

作者：Joe Watson, Ivan O'Conner, Chia-Wen Chen, Luning Sun, Fang Luo, David Stillwell
摘要：心理评估通常依赖于结构化的评级量表，无法包含受访者自然语言的丰富细微差别。本研究利用LLM最新的进展，在一个新的概念框架内利用定性数据，结合LLM评分文本和传统的评分量表项目，创建一个增强的测试。我们使用抑郁症作为案例研究来展示这种方法，在高中学生的真实样本（n=693）和相应的合成数据集（n= 3，000）上开发和评估框架。在保持的测试集上，增强测试在测量精度和准确性方面取得了统计学上的显着改善。LLM项目的信息增益相当于在原始19项测试中添加6.3（真实数据）和16.0（合成数据）项。我们的方法标志着自动评分的概念转变，绕过了其典型的瓶颈：而不是依赖于预先标记的数据或复杂的专家创建的规则，我们根据项目信息的计算经验选择最具信息性的LLM评分指令。该框架提供了一种可扩展的方法，利用不断增长的转录文本流来增强传统的心理测量方法，我们讨论了其在临床健康及其他方面的潜在效用。
摘要：Psychological assessments typically rely on structured rating scales, which cannot incorporate the rich nuance of a respondent's natural language. This study leverages recent LLM advances to harness qualitative data within a novel conceptual framework, combining LLM-scored text and traditional rating-scale items to create an augmented test. We demonstrate this approach using depression as a case study, developing and assessing the framework on a real-world sample of upper secondary students (n=693) and corresponding synthetic dataset (n=3,000). On held-out test sets, augmented tests achieved statistically significant improvements in measurement precision and accuracy. The information gain from the LLM items was equivalent to adding between 6.3 (real data) and 16.0 (synthetic data) items to the original 19-item test. Our approach marks a conceptual shift in automated scoring that bypasses its typical bottlenecks: instead of relying on pre-labelled data or complex expert-created rubrics, we empirically select the most informative LLM scoring instructions based on calculations of item information. This framework provides a scalable approach for leveraging the growing stream of transcribed text to enhance traditional psychometric measures, and we discuss its potential utility in clinical health and beyond.

【54】Energy-Driven Steering: Reducing False Refusals in Large Language Models
标题：能量驱动的转向：减少大型语言模型中的错误拒绝
链接：https://arxiv.org/abs/2510.08646

作者：Eric Hanchen Jiang, Weixuan Ou, Run Liu, Shengyuan Pang, Guancheng Wan, Ranjie Duan, Wei Dong, Kai-Wei Chang, XiaoFeng Wang, Ying Nian Wu, Xinfeng Li
摘要：大型语言模型（LLM）的安全对齐面临着一个关键挑战：当前的对齐技术往往只关注提高对有害提示的安全性，导致LLM变得过于谨慎，拒绝响应良性提示。因此，安全对齐的一个关键目标是提高安全性，同时减少错误拒绝。在本文中，我们介绍了能源驱动转向（EDS），一种新的，微调免费的框架，旨在通过动态的，推理时间干预来解决这一挑战。我们训练了一个轻量级的外部基于能量的模型（EBM），将高能量分配给不受欢迎的状态（错误拒绝或越狱），将低能量分配给理想的状态（有帮助的响应或安全拒绝）。在推理过程中，EBM将LLM的内部激活映射到“能量景观”。我们使用能量函数的梯度来动态地将LLM的隐藏状态引导到低能量区域，从而在不修改其权重的情况下实时校正模型以生成期望的响应。这种方法将行为控制从模型的核心知识中分离出来，提供了一种灵活的解决方案，具有最小的计算开销。在各种模型上的广泛实验表明，我们的方法成功地实现了这一目标：它大大降低了错误拒绝率。例如，将ORB-H基准的合规性从57.3%提高到82.6%，同时保持基线安全性能。我们的工作提出了一个有效的范例，建设LLM，实现低错误拒绝率和高安全性。
摘要：Safety alignment of large language models (LLMs) faces a key challenge: current alignment techniques often only focus on improving safety against harmful prompts, causing LLMs to become over-cautious and refuse to respond to benign prompts. Therefore, a key objective of safe alignment is to enhance safety while simultaneously reducing false refusals. In this paper, we introduce Energy-Driven Steering (EDS), a novel, fine-tuning free framework designed to resolve this challenge through dynamic, inference-time intervention. We trained a lightweight, external Energy-Based Model (EBM) to assign high energy to undesirable (false refusal or jailbreak) states and low energy to desirable (helpful response or safe reject) ones. During inference, EBM maps the LLM's internal activations to an "energy landscape". We use the gradient of the energy function to dynamically steer the LLM's hidden states to low energy regions, correcting the model to generate a desirable response in real-time without modifying its weights. This method decouples behavioral control from the model's core knowledge, offering a flexible solution with minimal computational overhead. Extensive experiments across a wide range of models show our method successfully achieves this objective: it substantially lowers false refusal rates. For example, raising compliance on the ORB-H benchmark from 57.3% to 82.6% while maintaining the baseline safety performance. Our work presents an effective paradigm for building LLMs that achieve both low false refusal rates and high safety.

【55】Next Semantic Scale Prediction via Hierarchical Diffusion Language Models
标题：下一个通过分层扩散语言模型进行语义规模预测
链接：https://arxiv.org/abs/2510.08632

作者：Cai Zhou, Chenyu Wang, Dinghuai Zhang, Shangyuan Tong, Yifei Wang, Stephen Bates, Tommi Jaakkola
备注：Accepted to NeurIPS 2025
摘要：在本文中，我们介绍了层次扩散语言模型（HDLM）-一个新的家庭离散扩散模型的语言建模。HDLM建立在一个层次化的词汇表上，在这个词汇表中，具有详细语义的低级标记被满射地映射到具有粗粒度含义的高级标记。在前向过程中，每个令牌根据调度程序独立地扰动到其具有更抽象语义的更高级别的祖先，而在反向过程中，模型逐步预测下一个更详细的语义。总之，HDLM为语言建模提供了一个通用的时变下一个语义尺度预测过程。我们推导出封闭形式的表达式的扩散证据下限（ELBO），并表明，HDLM可以实现在一个灵活的方式，同时包括现有的MDLM作为一种特殊情况。我们还提出了实用的培训技术的基础上的见解。大量的文本生成实验验证了HDLM的有效性，它始终表现出较低的验证和生成困惑比基线。
摘要：In this paper we introduce Hierarchical Diffusion Language Models (HDLM) -- a novel family of discrete diffusion models for language modeling. HDLM builds on a hierarchical vocabulary where low-level tokens with detailed semantics are surjectively mapped to high-level tokens with coarse-grained meanings. In the forward process, each token is independently perturbed to its higher-level ancestor with more abstract semantics according to the scheduler, while in the reverse process the model progressively predicts the next, more detailed semantics. Taken together, HDLM provides a general time-varying next semantic scale prediction process for language modeling. We derive closed-form expressions for the diffusion Evidence Lower Bound (ELBO), and show that HDLM can be implemented in a flexible manner while including the existing MDLM as a special case. We also propose practical training techniques based on the insights. Extensive text generation experiments validate the effectiveness of HDLM, which demonstrates consistently lower validation and generative perplexity than baselines.

【56】From What to Why: Thought-Space Recommendation with Small Language Models
标题：从什么到为什么：使用小型语言模型的空间推荐
链接：https://arxiv.org/abs/2510.08626

作者：Prosenjit Biswas, Pervez Shaik, Abhinav Thorat, Ravi Kolla, Niranjan Pedanekar
备注：15 pages, 3 figures
摘要：大型语言模型（LLM）通过增强的推理具有高级推荐功能，但由于推理成本高，对现实世界的部署构成了重大挑战。相反，虽然小语言模型（SLM）提供了一个有效的替代方案，其推理能力的建议仍然没有得到充分的探索。现有的系统通常仅将自然语言原理用作无监督的描述性文本，未能充分利用其作为学习信号的潜力。在这项工作中，我们的主要思想是创建一个共同的理解，用户和项目在多个领域称为思想空间与SLM，而不是使用LLM的蒸馏知识。为此，我们提出了脉冲（潜在语义嵌入的偏好理解），一个框架，将SLM生成的原理作为导演学习信号，监督他们与交互历史，共同建模用户的行动（什么）和他们的语义驱动程序（为什么）。现有的方法只考虑序列和嵌入等相互作用，而PULSE将基本原理视为第一类信号，这种新颖的设计产生了更鲁棒和可推广的嵌入。大量的实验表明，PULSE在多个基准数据集上优于领先的ID，协同过滤（CF）和基于LLM的顺序推荐模型。此外，PULSE在跨域推荐中表现出卓越的可移植性，并在面向推理的问题回答等下游任务上表现出强大的性能。我们的代码可在\href{https：//anonymous.4open.science/r/Thinking_PULSE-0FC5/README.md}{here}获得。
摘要：Large Language Models (LLMs) have advanced recommendation capabilities through enhanced reasoning, but pose significant challenges for real-world deployment due to high inference costs. Conversely, while Small Language Models (SLMs) offer an efficient alternative, their reasoning capabilities for recommendation remain underexplored. Existing systems often use natural language rationales merely as unsupervised descriptive text, failing to harness their full potential as learning signals. In this work our main idea is to create a common understanding of user and items across multiple domains called Thought Space with SLMs instead of using LLMs' distilled knowledge. To that end we propose PULSE (Preference Understanding by Latent Semantic Embeddings), a framework that treats SLM-generated rationales as director learning signals, supervising them with interaction histories to jointly model user actions (what) and their semantic drivers (why). Existing methods consider only interactions such as sequences and embeddings, whereas PULSE treats rationales as first-class signals, this novel design yields embeddings that are more robust and generalizable. Extensive experiments demonstrate that PULSE outperforms leading ID, Collaborative Filtering (CF), and LLM-based sequential recommendation models across multiple benchmark datasets. Furthermore, PULSE exhibits superior transferability in cross-domain recommendation and demonstrates strong performance on downstream tasks such as reasoning-oriented question answering. Our code is available \href{https://anonymous.4open.science/r/Thinking_PULSE-0FC5/README.md}{here}.

【57】Do LLMs Know They Are Being Tested? Evaluation Awareness and Incentive-Sensitive Failures in GPT-OSS-20B
标题：法学硕士知道他们正在接受测试吗？GPT-OSS-20 B中的评估意识和激励敏感故障
链接：https://arxiv.org/abs/2510.08624

作者：Nisar Ahmed, Muhammad Imran Zaman, Gulshan Saleem, Ali Hassan
摘要：大型语言模型（LLM）的基准测试通常依赖于要求可见推理和严格格式的规则提示，而实际部署需要简洁的合同约束答案。我们调查是否这样的“评估气味”膨胀测量性能没有相应的能力增益。使用单个开放权重模型（GPT-OSS-20 B），我们运行了六个成对的A/B场景，这些场景保持任务内容和解码固定，同时改变框架（面向评估与现实世界）和推理深度（中/高）：确定性数学，严格的代码修复，引用生成，激励翻转（谨慎与能力），CoT可见性和多语言（乌尔都语）标题。确定性验证器使用预先注册的增量和复合索引来计算准确性、仅回答合规性、对冲/拒绝、思想链（CoT）长度和模式合规性。在各种场景中，评估框架可靠地增加了CoT（数百个字符到>1000个字符），并降低了仅回答的合规性，但准确性增益有限或不一致。在结构化输出中，它改进了包装器（例如，围栏块，列举清单），但不是regex验证的物质。激励性措辞重新衡量了错误构成：赞扬谨慎适度地提高了高度推理的准确性，减少了错误但自信的错误，而赞扬能力则产生了更简洁但风险更高的输出。乌尔都语标题复制了这些签名，并且在更高的推理深度下会降低准确性，表明多语言奇偶校验风险。我们提供了一个可复制的A/B框架（提示库、验证器、每次运行分数、脚本;版本化的DOI）和实用指南：中性措辞或双重框架检查、合同感知分级、风格增量报告、信心治理和多语言仪表板，以确保基准收益反映可部署的能力。
摘要：Benchmarks for large language models (LLMs) often rely on rubric-scented prompts that request visible reasoning and strict formatting, whereas real deployments demand terse, contract-bound answers. We investigate whether such "evaluation scent" inflates measured performance without commensurate capability gains. Using a single open-weights model (GPT-OSS-20B), we run six paired A/B scenarios that hold task content and decoding fixed while varying framing (evaluation-oriented vs. real-world) and reasoning depth (Medium/High): deterministic math, strict code-fix, citation generation, incentive flips (caution vs. competence), CoT visibility, and multilingual (Urdu) headers. Deterministic validators compute accuracy, answer-only compliance, hedging/refusals, chain-of-thought (CoT) length, and schema compliance, with pre-registered deltas and composite indices. Across scenarios, evaluation framing reliably inflates CoT (hundreds to >1000 characters) and reduces answer-only compliance, with limited or inconsistent accuracy gains. In structured outputs, it improves wrappers (e.g., fenced blocks, enumerated lists) but not regex-validated substance. Incentive wording reweights error composition: praising caution modestly improves accuracy at high reasoning and reduces wrong-but-confident errors, whereas praising competence yields terser but riskier outputs. Urdu rubric headers reproduce these signatures and can decrease accuracy at higher reasoning depth, indicating multilingual parity risks. We provide a reproducible A/B framework (prompt banks, validators, per-run scores, scripts; versioned DOI) and practical guidance: neutral phrasing or dual-framing checks, contract-aware grading, style-delta reporting, confidence governance, and multilingual dashboards to ensure that benchmark gains reflect deployable capability.

【58】PARSE: LLM Driven Schema Optimization for Reliable Entity Extraction
标题：PARSE：LLM驱动的模式优化，用于可靠的实体提取
链接：https://arxiv.org/abs/2510.08623

作者：Anubhav Shrimal, Aryan Jain, Soumyajit Chowdhury, Promod Yenigalla
备注：EMNLP 2025 Industry Track
摘要：从非结构化文本中提取结构化信息对于新兴的软件3.0系统至关重要，其中LLM代理自主地与API和工具进行交互。最近的方法使用现有的JSON模式直接将大型语言模型应用于提取任务，通常使用约束解码或强化学习方法来确保语法有效性，但将JSON模式视为为人类开发人员设计的静态合同，导致次优提取性能，频繁幻觉，以及当模式包含模糊或不完整的规范时不可靠的代理行为。我们认识到，JSON模式本身是一种自然语言理解契约的形式，它对LLM应该能够解释和系统改进的数据结构契约的规则，关系和期望进行编码。因此，我们开发了PARSE（参数自动细化和模式提取），一个新的系统，具有两个协同组件：ARCHITECT，它自主优化JSON模式的LLM消费，同时通过RELAY（集成代码生成系统）保持向后兼容性，和SCOPE，它实现了基于反射的提取结合静态和基于LLM的护栏。我们在三个数据集上对PARSE进行了定性和定量评估，包括模式引导对话（SGD），结构化Web数据提取（SWDE）和内部零售对话数据，发现它在SWDE上的提取准确率提高了64.7%，综合框架改进在各个模型中达到了10%，同时在第一次重试内将提取错误减少了92%，并保持了实际的延迟。
摘要：Structured information extraction from unstructured text is critical for emerging Software 3.0 systems where LLM agents autonomously interact with APIs and tools. Recent approaches apply large language models directly to extraction tasks using existing JSON schemas, often with constraint decoding or reinforcement learning approaches to ensure syntactic validity, but treat JSON schemas as static contracts designed for human developers, leading to suboptimal extraction performance, frequent hallucinations, and unreliable agent behavior when schemas contain ambiguous or incomplete specifications. We recognize that JSON schemas themselves are a form of natural language understanding contract that encodes rules, relationships, and expectations about data structure contracts that LLMs should be able to both interpret and systematically improve. Consequently, we develop PARSE (Parameter Automated Refinement and Schema Extraction), a novel system with two synergistic components: ARCHITECT, which autonomously optimizes JSON schemas for LLM consumption while maintaining backward compatibility through RELAY (an integrated code generation system), and SCOPE, which implements reflection-based extraction with combined static and LLM-based guardrails. We evaluate PARSE qualitatively and quantitatively on three datasets including Schema-Guided Dialogue (SGD), Structured Web Data Extraction (SWDE), and internal retail conversation data, and find that it achieves up to 64.7% improvement in extraction accuracy on SWDE with combined framework improvements reaching 10% across models, while reducing extraction errors by 92% within the first retry and and maintaining practical latency.

【59】JAI-1: A Thai-Centric Large Language Model
标题：JAI-1：以泰国为中心的大型语言模型
链接：https://arxiv.org/abs/2510.08620

作者：Attapol T. Rutherford, Jullajak Karnjanaekarin, Narongkorn Panitsrisit, Pontakorn Trakuekul, Sumana Sumanakul, Natchanon Pollertlam
摘要：本技术报告介绍了JAI-1，一个以泰语为中心的语言模型，具有75 B参数。最近的泰国模式主要依赖于现有的开源模式，在不进行结构修改的情况下进行额外的培训，以专注于泰语。然而，这种方法有可能在注入泰语特定信息的过程中侵蚀模型参数空间中预先存在的知识，因为一般任务的优化参数可能与新的语言要求相冲突。相比之下，JAI-1采用了升级策略：从一个较小的，高性能的英语开源LLM开始，我们扩展了其参数空间，并利用新分配的容量系统地整合泰语知识。这种方法不仅保留了原始模型的一般智能，还建立了一个独特的架构，有别于其他开源模型，使可扩展的未来增强。在预训练期间，JAI-1暴露于1.5T令牌，包括超过300 B的泰语令牌。接下来是后训练阶段-监督微调和对齐调整-使用超过600 K基于示例的示例。最终模型在以泰国为中心的基准测试（IFEval-TH，MT-Bench-TH和JAI-Hall-Bench）上表现出优于Typhoon 2 - 70 B的性能，验证了其升级和知识集成框架的有效性。
摘要：This technical report introduces JAI-1, a Thai-centric language model with 75B parameters. Recent Thai models have primarily relied on existing open-source models, applying additional training without structural modifications to specialize in Thai. However, this approach risks eroding pre-existing knowledge in the model's parameter space during the injection of Thai-specific information, as optimized parameters for general tasks may conflict with new linguistic requirements. In contrast, JAI-1 adopts an upscaling strategy: starting from a smaller, high-performing English open-source LLM, we expanded its parameter space and utilized the newly allocated capacity to systematically integrate Thai-language knowledge. This methodology not only preserves the original model's general intelligence but also establishes a unique architecture distinct from other open-source models, enabling scalable future enhancements. During pre-training, JAI-1 was exposed to 1.5T tokens, including over 300B Thai language tokens. This was followed by post-training stages -- supervised fine-tuning and alignment tuning -- using more than 600K instruction-based examples. The final model demonstrated superior performance compared to Typhoon2-70B on Thai-centric benchmarks (IFEval-TH, MT-Bench-TH, and JAI-Hall-Bench), validating the efficacy of its upscaling and knowledge-integration framework.

【60】LLMs Show Surface-Form Brittleness Under Paraphrase Stress Tests
标题：LLM在释义应力测试下显示表面形式脆性
链接：https://arxiv.org/abs/2510.08616

作者：Juan Miguel Navarro Carranza
备注：NeurIPS 2025 Workshop. Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling. Selected for contributed talk
摘要：大型语言模型（LLM）的基准分数可以通过记忆测试项目或近似重复来夸大。我们提出了一个简单的协议，探测泛化重新评估模型的基准问题的释义版本。使用Mistral-7 B-Instruct和Qwen2.5- 7 B-Instruct，我们测量了ARC-Easy和ARC-Challenge上原始项目和释义项目之间的准确性差距。我们的管道控制解码，强制执行多选择输出格式，并包括一个强大的释义清理步骤，以保留语义。我们发现，改述引起了一个非平凡的准确性下降（原始与改述），与以前的污染和脆性表面形式的快捷方式的关注。
摘要：Benchmark scores for Large Language Models (LLMs) can be inflated by memorization of test items or near duplicates. We present a simple, protocol that probes generalization by re-evaluating models on paraphrased versions of benchmark questions. Using Mistral-7B-Instruct and Qwen2.5-7B-Instruct, we measure the accuracy gap between original and paraphrased items on ARC-Easy and ARC-Challenge. Our pipeline controls decoding, enforces multiple-choice output format, and includes a robust paraphrase-cleaning step to preserve semantics. We find that paraphrasing induces a non-trivial accuracy drop (original vs. paraphrased), consistent with prior concerns about contamination and brittle surface-form shortcuts.

【61】Iterative LLM-Based Generation and Refinement of Distracting Conditions in Math Word Problems
标题：基于LLM的数学应用题中干扰条件的迭代生成与精化
链接：https://arxiv.org/abs/2510.08615

作者：Kaiqi Yang, Hang Li, Yucheng Chu, Zitao Liu, Mi Tian, Hui Liu
摘要：数学推理是评估大型语言模型（LLM）智能的关键测试平台，而数学应用题（MWP）是最广泛使用的格式之一。大多数现有的MWP数据集只包含必要的信息，而分散注意力或过度条件的问题往往被忽视。先前的研究表明，当引入这种分散注意力的条件时，流行的LLM会经历戏剧性的性能下降。然而，可用的数据集的MWPs与分散注意力的条件仍然有限，大多数表现出低难度和上下文外的表达式。这些缺点使得分散注意力的条件很容易被发现和忽视，从而降低了这些数据集上基准测试的可信度。此外，当添加分散注意力的条件时，推理过程和答案可能会改变，需要大量的人工工作来检查和重写解决方案。为了解决这些问题，我们设计了一个迭代框架，利用LLM自动生成分散注意力的条件。我们开发了一套提示，从多个角度和认知水平修改MWPs，鼓励创造有意义的分散注意力的条件，以及进一步完善的建议。我们的框架的一个关键优势是保留原始和修改后的问题之间的共享解决方案：LLM被明确引导生成不会改变原始解决方案的干扰，从而消除了产生新答案的需要。该框架高效且易于部署，大大减少了在分散注意力的情况下生成MWP所需的工作量，同时保持高数据质量。
摘要：Mathematical reasoning serves as a crucial testbed for evaluating the intelligence of large language models (LLMs), and math word problems (MWPs) represent one of the most widely used formats. Most existing MWP datasets contain only the necessary information, while problems with distracting or excessive conditions are often overlooked. Prior studies have shown that popular LLMs experience a dramatic performance drop when such distracting conditions are introduced. However, available datasets of MWPs with distracting conditions remain limited, and most exhibit low difficulty and out-of-context expressions. These shortcomings make the distracting conditions easy to detect and disregard, thereby reducing the credibility of benchmarking on these datasets. Moreover, when distracting conditions are added, the reasoning process and answers may change, requiring intensive manual effort to check and rewrite solutions. To address these issues, we design an iterative framework that leverages LLMs to generate distracting conditions automatically. We develop a set of prompts to revise MWPs from multiple perspectives and cognitive levels, encouraging the creation of meaningful distracting conditions as well as suggestions for further refinement. A key advantage of our framework is the preservation of shared solutions between the original and revised problems: the LLMs are explicitly guided to generate distractions that do not alter the original solution, thus eliminating the need to produce new answers. This framework is efficient and easy to deploy, substantially reducing the effort required to generate MWPs with distracting conditions while maintaining high data quality.

【62】Gender Bias in Large Language Models for Healthcare: Assignment Consistency and Clinical Implications
标题：医疗保健大型语言模型中的性别偏见：分配一致性和临床意义
链接：https://arxiv.org/abs/2510.08614

作者：Mingxuan Liu, Yuhe Ke, Wentao Zhu, Mayli Mertens, Yilin Ning, Jingchi Liao, Chuan Hong, Daniel Shu Wei Ting, Yifan Peng, Danielle S. Bitterman, Marcus Eng Hock Ong, Nan Liu
摘要：将大型语言模型（LLM）集成到医疗保健中有望增强临床决策，但它们对偏见的敏感性仍然是一个关键问题。长期以来，性别一直影响着医生的行为和患者的预后，这引起了人们的担忧，即LLM承担类似人类的角色，如临床医生或医学教育工作者，可能会复制或放大与性别相关的偏见。使用来自新英格兰医学挑战杂志（NEJM）的案例研究，我们将性别（女性，男性或未指明）分配给多个开源和专有LLM。我们评估了他们在基于LLM的诊断和模型对患者性别的临床相关性或必要性的判断方面在LLM性别分配中的反应一致性。在我们的研究结果中，对于大多数模型，LLM性别的诊断相对一致。然而，对于患者性别的相关性和必要性在法学硕士为基础的诊断，所有模型表现出实质性的不一致性，特别是对相关性的判断。一些模型甚至在对患者性别的解释中显示出系统性的男女差异。这些发现存在一种未充分研究的偏差，可能会破坏LLM在临床实践中的可靠性，强调在与LLM互动时需要对身份分配一致性进行常规检查，以确保可靠和公平的AI支持的临床护理。
摘要：The integration of large language models (LLMs) into healthcare holds promise to enhance clinical decision-making, yet their susceptibility to biases remains a critical concern. Gender has long influenced physician behaviors and patient outcomes, raising concerns that LLMs assuming human-like roles, such as clinicians or medical educators, may replicate or amplify gender-related biases. Using case studies from the New England Journal of Medicine Challenge (NEJM), we assigned genders (female, male, or unspecified) to multiple open-source and proprietary LLMs. We evaluated their response consistency across LLM-gender assignments regarding both LLM-based diagnosis and models' judgments on the clinical relevance or necessity of patient gender. In our findings, diagnoses were relatively consistent across LLM genders for most models. However, for patient gender's relevance and necessity in LLM-based diagnosis, all models demonstrated substantial inconsistency across LLM genders, particularly for relevance judgements. Some models even displayed a systematic female-male disparity in their interpretation of patient gender. These findings present an underexplored bias that could undermine the reliability of LLMs in clinical practice, underscoring the need for routine checks of identity-assignment consistency when interacting with LLMs to ensure reliable and equitable AI-supported clinical care.

【63】GraphGhost: Tracing Structures Behind Large Language Models
标题：GraphGhost：追踪大型语言模型背后的结构
链接：https://arxiv.org/abs/2510.08613

作者：Xinnan Dai, Kai Guo, Chung-Hsiang Lo, Shenglai Zeng, Jiayuan Ding, Dongsheng Luo, Subhabrata Mukherjee, Jiliang Tang
摘要：大型语言模型（LLM）表现出卓越的推理能力，但这些能力背后的结构机制仍有待探索。在这项工作中，我们引入了GraphGhost，这是一个将神经元激活及其信号传播表示为图形的统一框架，解释了LLM如何从顺序输入中捕获结构语义并通过结构一致的机制生成输出。这种基于图的视角使我们能够采用PageRank等图算法来表征LLM的属性，揭示不同数据集之间的共享和特定于模型的推理行为。我们进一步识别GraphGhost中激活的神经元，并通过结构干预对其进行评估，表明对关键神经元节点的编辑可以触发推理崩溃，改变逻辑流程和语义理解。总之，这些贡献将GraphGhost定位为分析，干预并最终理解LLM推理的结构基础的强大工具。
摘要：Large Language Models (LLMs) demonstrate remarkable reasoning capabilities, yet the structural mechanisms underlying these abilities remain under explored. In this work, we introduce GraphGhost, a unified framework that represents neuron activations and their signal propagation as graphs, explaining how LLMs capture structural semantics from sequential inputs and generate outputs through structurally consistent mechanisms. This graph-based perspective enables us to employ graph algorithms such as PageRank to characterize the properties of LLMs, revealing both shared and model-specific reasoning behaviors across diverse datasets. We further identify the activated neurons within GraphGhost and evaluate them through structural interventions, showing that edits to key neuron nodes can trigger reasoning collapse, altering both logical flow and semantic understanding. Together, these contributions position GraphGhost as a powerful tool for analyzing, intervening in, and ultimately understanding the structural foundations of reasoning in LLMs.

【64】Toward a Safer Web: Multilingual Multi-Agent LLMs for Mitigating Adversarial Misinformation Attacks
标题：迈向更安全的网络：用于缓解敌对错误信息攻击的多语言多代理LLM
链接：https://arxiv.org/abs/2510.08605

作者：Nouar Aldahoul, Yasir Zaki
摘要：数字平台上错误信息的快速传播威胁着公共话语、情绪稳定和决策。虽然先前的工作已经探索了错误信息检测中的各种对抗性攻击，但本文中研究的具体转换尚未进行系统研究。特别是，我们调查了英语，法语，西班牙语，阿拉伯语，印地语和中文之间的语言切换，然后进行翻译。我们还研究了查询长度膨胀之前的总结和结构重新格式化为多项选择题。在本文中，我们提出了一个多语言，多代理大型语言模型框架检索增强生成，可以部署为网络插件到在线平台。我们的工作强调了人工智能驱动的错误信息检测在保护在线事实完整性免受各种攻击方面的重要性，同时展示了基于插件的部署在现实世界的Web应用程序中的可行性。
摘要：The rapid spread of misinformation on digital platforms threatens public discourse, emotional stability, and decision-making. While prior work has explored various adversarial attacks in misinformation detection, the specific transformations examined in this paper have not been systematically studied. In particular, we investigate language-switching across English, French, Spanish, Arabic, Hindi, and Chinese, followed by translation. We also study query length inflation preceding summarization and structural reformatting into multiple-choice questions. In this paper, we present a multilingual, multi-agent large language model framework with retrieval-augmented generation that can be deployed as a web plugin into online platforms. Our work underscores the importance of AI-driven misinformation detection in safeguarding online factual integrity against diverse attacks, while showcasing the feasibility of plugin-based deployment for real-world web applications.

【65】LatentBreak: Jailbreaking Large Language Models through Latent Space Feedback
标题：LatentBreak：通过潜在空间反馈越狱大型语言模型
链接：https://arxiv.org/abs/2510.08604

作者：Raffaele Mura, Giorgio Piras, Kamilė Lukošiūtė, Maura Pintor, Amin Karbasi, Battista Biggio
摘要：越狱是对抗性攻击，旨在绕过大型语言模型的内置安全机制。自动越狱通常通过强制模型生成受限或有害响应的初始部分来优化对抗性后缀或适应长提示模板。在这项工作中，我们表明，现有的越狱攻击，利用这种机制来解锁模型响应可以检测到一个简单的基于困惑的过滤输入提示。为了克服这个问题，我们提出了LatentBreak，这是一种白盒越狱攻击，可以生成具有低困惑度的自然对抗性提示，从而能够逃避这种防御。LatentBreak将输入提示中的单词替换为语义等效的单词，保留提示的初始意图，而不是添加高困惑度的对抗性后缀或长模板。这些词是通过最小化潜在空间中对抗性提示的表示与无害请求的表示之间的距离来选择的。我们广泛的评估表明，LatentBreak导致更短和低困惑的提示，从而优于竞争的越狱算法对基于困惑的过滤器在多个安全对齐的模型。
摘要：Jailbreaks are adversarial attacks designed to bypass the built-in safety mechanisms of large language models. Automated jailbreaks typically optimize an adversarial suffix or adapt long prompt templates by forcing the model to generate the initial part of a restricted or harmful response. In this work, we show that existing jailbreak attacks that leverage such mechanisms to unlock the model response can be detected by a straightforward perplexity-based filtering on the input prompt. To overcome this issue, we propose LatentBreak, a white-box jailbreak attack that generates natural adversarial prompts with low perplexity capable of evading such defenses. LatentBreak substitutes words in the input prompt with semantically-equivalent ones, preserving the initial intent of the prompt, instead of adding high-perplexity adversarial suffixes or long templates. These words are chosen by minimizing the distance in the latent space between the representation of the adversarial prompt and that of harmless requests. Our extensive evaluation shows that LatentBreak leads to shorter and low-perplexity prompts, thus outperforming competing jailbreak algorithms against perplexity-based filters on multiple safety-aligned models.

【66】Human Texts Are Outliers: Detecting LLM-generated Texts via Out-of-distribution Detection
标题：人类文本是离群值：通过分发外检测检测LLM生成的文本
链接：https://arxiv.org/abs/2510.08602

作者：Cong Zeng, Shengkun Tang, Yuanzhou Chen, Zhiqiang Shen, Wenchao Yu, Xujiang Zhao, Haifeng Chen, Wei Cheng, Zhiqiang Xu
备注：None
摘要：ChatGPT、DeepSeek和Claude等大型语言模型（LLM）的快速发展显著增加了数字通信中人工智能生成文本的存在。这种趋势增加了对可靠检测方法的需求，以区分人类创作和机器生成的内容。现有的方法zero-shot方法和监督分类器在很大程度上将此任务概念化为二进制分类问题，通常导致跨域和模型的泛化能力差。在本文中，我们认为，这样的二进制制定从根本上错误的检测任务，假设一个连贯的表示人类写的文本。在现实中，人类文本并不构成一个统一的分布，其多样性无法通过有限的抽样有效地捕捉。这导致以前的分类器记住观察到的OOD特征，而不是学习“非ID”行为的本质，限制了对看不见的人类创作输入的泛化。基于这一观察，我们提出将检测任务重新定义为分布外（OOD）检测问题，将人类书写的文本视为分布离群值，而机器生成的文本则是分布内（ID）样本。为此，我们开发了一个使用单类学习方法（包括DeepSVDD和HRN）和基于分数的学习技术（如基于能量的方法）的检测框架，从而实现了鲁棒性和可推广性。在多个数据集上的大量实验验证了我们基于OOD的方法的有效性。具体来说，基于OOD的方法在DeepFake数据集上实现了98.3%的AUROC和AUPR，而FPR 95仅为8.9%。此外，我们测试我们的检测框架多语言，攻击，和unseen-model和-domain文本设置，展示了我们的框架的鲁棒性和通用性。代码，预训练的权重和演示将被释放。
摘要：The rapid advancement of large language models (LLMs) such as ChatGPT, DeepSeek, and Claude has significantly increased the presence of AI-generated text in digital communication. This trend has heightened the need for reliable detection methods to distinguish between human-authored and machine-generated content. Existing approaches both zero-shot methods and supervised classifiers largely conceptualize this task as a binary classification problem, often leading to poor generalization across domains and models. In this paper, we argue that such a binary formulation fundamentally mischaracterizes the detection task by assuming a coherent representation of human-written texts. In reality, human texts do not constitute a unified distribution, and their diversity cannot be effectively captured through limited sampling. This causes previous classifiers to memorize observed OOD characteristics rather than learn the essence of `non-ID' behavior, limiting generalization to unseen human-authored inputs. Based on this observation, we propose reframing the detection task as an out-of-distribution (OOD) detection problem, treating human-written texts as distributional outliers while machine-generated texts are in-distribution (ID) samples. To this end, we develop a detection framework using one-class learning method including DeepSVDD and HRN, and score-based learning techniques such as energy-based method, enabling robust and generalizable performance. Extensive experiments across multiple datasets validate the effectiveness of our OOD-based approach. Specifically, the OOD-based method achieves 98.3% AUROC and AUPR with only 8.9% FPR95 on DeepFake dataset. Moreover, we test our detection framework on multilingual, attacked, and unseen-model and -domain text settings, demonstrating the robustness and generalizability of our framework. Code, pretrained weights, and demo will be released.

【67】Mnemosyne: An Unsupervised, Human-Inspired Long-Term Memory Architecture for Edge-Based LLMs
标题：Mnemosyne：一种用于基于边缘的LLM的无监督、受人为启发的长期记忆架构
链接：https://arxiv.org/abs/2510.08601

作者：Aneesh Jonelagadda, Christina Hahn, Haoze Zheng, Salvatore Penachio (Kaliber AI)
备注：12 pages, 4 figures
摘要：长期记忆对于自然、现实的对话至关重要。然而，目前的大型语言模型（LLM）内存系统依赖于暴力上下文扩展或静态检索管道，失败的边缘约束的设备。我们介绍Mnemosyne，一个无监督的，人类启发的长期记忆体系结构，专为基于边缘的LLM。我们的方法使用图结构存储，模块化的物质和冗余过滤器，内存提交和修剪机制，以及概率召回与时间衰减和刷新过程模仿人类记忆。Mnemosyne还引入了一个集中的“核心摘要”，它有效地从内存图的固定长度子集中导出，以捕获用户的个性和其他特定领域的长期细节，例如，以医疗保健应用程序为例，恢复后的野心和对护理的态度。与现有的检索增强方法不同，Mnemosyne被设计用于纵向医疗助理，其中重复和语义相似但时间上不同的对话受到朴素检索的限制。在纵向医疗保健对话的实验中，Mnemosyne在对现实主义和长期记忆能力的盲人评估中表现出65.8%的最高获胜率，而基线RAG获胜率为31.1%。Mnemosyne在时间推理和单跳检索方面也达到了目前最高的LoCoMo基准分数，与其他相同的主干技术相比。此外，54.6%的平均总分在所有方法中排名第二，击败了常用的Mem0和OpenAI基线。这表明，改进的事实回忆，增强的时间推理，更自然的用户面对的反应可以与边缘兼容，易于转移的无监督记忆体系结构是可行的。
摘要：Long-term memory is essential for natural, realistic dialogue. However, current large language model (LLM) memory systems rely on either brute-force context expansion or static retrieval pipelines that fail on edge-constrained devices. We introduce Mnemosyne, an unsupervised, human-inspired long-term memory architecture designed for edge-based LLMs. Our approach uses graph-structured storage, modular substance and redundancy filters, memory committing and pruning mechanisms, and probabilistic recall with temporal decay and refresh processes modeled after human memory. Mnemosyne also introduces a concentrated "core summary" efficiently derived from a fixed-length subset of the memory graph to capture the user's personality and other domain-specific long-term details such as, using healthcare application as an example, post-recovery ambitions and attitude towards care. Unlike existing retrieval-augmented methods, Mnemosyne is designed for use in longitudinal healthcare assistants, where repetitive and semantically similar but temporally distinct conversations are limited by naive retrieval. In experiments with longitudinal healthcare dialogues, Mnemosyne demonstrates the highest win rate of 65.8% in blind human evaluations of realism and long-term memory capability compared to a baseline RAG win rate of 31.1%. Mnemosyne also achieves current highest LoCoMo benchmark scores in temporal reasoning and single-hop retrieval compared to other same-backboned techniques. Further, the average overall score of 54.6% was second highest across all methods, beating commonly used Mem0 and OpenAI baselines among others. This demonstrates that improved factual recall, enhanced temporal reasoning, and much more natural user-facing responses can be feasible with an edge-compatible and easily transferable unsupervised memory architecture.

【68】Recover-LoRA: Data-Free Accuracy Recovery of Degraded Language Models via Low-Rank Adaptation
标题：RECE-LoRA：通过低等级自适应对降级语言模型进行无数据准确性恢复
链接：https://arxiv.org/abs/2510.08600

作者：Devleena Das, Rajeev Patwari, Ashish Sirasao
备注：Accepted to EMNLP 2025 Industry Track
摘要：诸如量化、修剪、格式和数据类型转换、模型导出和序列化之类的推理优化会导致语言模型任务性能的功能降级。虽然大多数部署的性能恢复工作都集中在强大的量化技术上，但我们专注于从任何降低模型权重的来源（如不正确的模型序列化）恢复模型精度。在这项工作中，我们提出了恢复LoRA，一个轻量级的和数据集不可知的方法来恢复退化模型的准确性。恢复-LoRA使用合成数据和logit蒸馏来学习选择性层上的LoRA适配器，以便于将降级模型与其全精度模型对齐。我们研究了Recovery-LoRA在各种小语言模型（SLM）中的实用性，包括具有不同注意力架构的模型，多头注意力（MHA）和组查询注意力（GQA），以及几个评估数据集。我们的研究结果表明，恢复LoRA恢复模型精度的5-17%的MHA和GQA SLM。
摘要：Inference optimizations such as quantization, pruning, format and datatype conversion, model export, and serialization can lead to functional degradations in language model task performance. While most efforts on performance recovery for deployment focus on robust quantization techniques, we focus on recovering model accuracies from any sources that degrade model weights, such as improper model serialization. In this work, we propose Recover-LoRA, a lightweight and dataset agnostic method to recover accuracy in degraded models. Recover-LoRA uses synthetic data and logit distillation to learn LoRA adapters on selective layers that facilitate aligning the degraded model to its full precision model. We investigate the utility of Recover-LoRA across a diverse set of small language models (SLMs), including models with varying attention architectures, multi-head attention (MHA) and group-query attention (GQA), as well as several evaluation datasets. Our results show that Recover-LoRA recovers model accuracies by 5-17% on MHA and GQA SLMs.

【69】Confidence, Not Perplexity: A Better Metric for the Creative Era of LLMs
标题：自信，而不是困惑：法学硕士创意时代的更好衡量标准
链接：https://arxiv.org/abs/2510.08596

作者：V. S. Raghu Parupudi
备注：Submitted to AACL-IJCNLP 2025 (Eval4NLP)
摘要：像自我困惑这样的无参考指标对创造性文本的生成有很大的偏见。我们提出了置信度分数（CS），来自模型的输出概率分布，作为一个偏差较小的替代方案。在gpt-4 o-mini上的实验表明，虽然基于流畅性的指标在99个创意提示的0%的情况下更喜欢新颖的反应，但我们的CS在19%的时间内这样做，这是一个统计学上显著的差异（差异的95%CI：[11.1%，27.3%]）。我们还表明，CS有效地区分容易，中等和困难的任务，确认不重叠的置信区间。因此，置信度得分减轻了传统指标的创造力偏见，同时保留了其核心评估优势，为现代LLM提供了更平衡的评估。
摘要：Reference-free metrics like self-perplexity are strongly biased against creative text generation. We propose the Confidence Score (CS), derived from a model's output probability distribution, as a less biased alternative. Experiments on gpt-4o-mini show that while fluency-based metrics prefer novel responses in 0\% of cases on 99 creative prompts, our CS does so 19% of the time, a statistically significant difference (95% CI for difference: [11.1%, 27.3%]). We also show that CS effectively distinguishes between easy, medium, and hard tasks, confirmed by non-overlapping confidence intervals. The Confidence Score thus mitigates the creativity bias of traditional metrics while retaining their core evaluative strengths, offering a more balanced assessment for modern LLMs.

【70】Systematic Diagnosis of Brittle Reasoning in Large Language Models
标题：大型语言模型中脆性推理的系统诊断
链接：https://arxiv.org/abs/2510.08595

作者：V. S. Raghu Parupudi
备注：Submitted to NEURIPS-2025 MATHAI workshop
摘要：人工智能的一个核心问题是机器学习模型理解数学的程度。为了解决这个问题，我们提出了一个新的框架来衡量数学推理，超越标准的基准来诊断特定的故障点。我们的方法首先在GSM 8 K数据集上从gpt-3.5-turbo生成结构化的分步推理。然后，我们使用一个更有能力的分析模型gpt-4 o-mini来对错误进行分类，最重要的是，对每个推理句子进行无监督聚类，以识别紧急推理模式。“这项分析揭示了一种具有明显的非人类脆性的认知特征：虽然该模型在顺序计算等程序模式上实现了近乎完美的准确性，但它在需要限制组合推理的模式上的性能却直线下降。通过识别和量化这些不同推理技能的可靠性，我们的工作提供了一种更精细的方法来评估数学理解，并为开发新功能和更可靠的未来应用提供了精确的路线图。
摘要：A central question in artificial intelligence is the extent to which machine learning models comprehend mathematics. To address this, we propose a novel framework for measuring mathematical reasoning that moves beyond standard benchmarks to diagnose specific failure points. Our method first generates structured, step-by-step reasoning from gpt-3.5-turbo on the GSM8K dataset. We then use a more capable analyst model, gpt-4o-mini, to categorize errors and, crucially, perform an unsupervised clustering of every reasoning sentence to identify emergent "reasoning modes." This analysis reveals a cognitive profile with a stark, nonhuman-like brittleness: while the model achieves near-perfect accuracy on procedural modes like sequential calculation, its performance on modes requiring combinatorial reasoning with restrictions plummets. By identifying and quantifying the reliability of these distinct reasoning skills, our work provides a more granular method to evaluate mathematical comprehension and offers a precise roadmap for developing new capabilities and more reliable future applications.

【71】Less Diverse, Less Safe: The Indirect But Pervasive Risk of Test-Time Scaling in Large Language Models
标题：多样化、安全性较差：大型语言模型中测试时扩展的间接但普遍风险
链接：https://arxiv.org/abs/2510.08592

作者：Shahriar Kabir Nahin, Hadi Askari, Muhao Chen, Anshuman Chhabra
摘要：测试时间缩放（TTS）通过探索多个候选响应，然后在此集合上操作以找到最佳输出，从而改进了LLM推理。TTS背后的一个默认前提是，足够多样化的候选池增强了可靠性。在这项工作中，我们表明，这种假设在TTS介绍了一个以前无法识别的故障模式。当候选人的多样性被削减，即使是一个适度的量，TTS变得更有可能产生不安全的输出。我们提出了一个参考指导的多样性减少协议（RefDiv），作为一个诊断攻击的压力测试TTS管道。通过对四种开源模型（Qwen 3，Mistral，Llama3.1，Gemma 3）和两种广泛使用的TTS策略（Monte Carlo Tree Search和Best-of-N）的广泛实验，约束多样性始终表示TTS产生不安全结果的速度。这种效果往往比直接用高对抗意图分数提示产生的效果更强。这种观察到的现象也在TTS策略和闭源模型（例如OpenAI o3和Gemini-2.5-Pro）之间转移，因此表明这是TTS的一般和现存属性，而不是特定于模型的工件。此外，我们发现许多广泛使用的安全护栏分类器（例如Llama-Guard和OpenAI Moderation API）无法标记RefDiv生成的对抗性输入提示，这表明现有的防御措施对这种多样性驱动的故障模式提供了有限的保护。通过这项工作，我们希望激励未来的研究，设计强大的TTS策略，既有效又安全的多样性为目标的压力测试，如RefDiv所示。
摘要：Test-Time Scaling (TTS) improves LLM reasoning by exploring multiple candidate responses and then operating over this set to find the best output. A tacit premise behind TTS is that sufficiently diverse candidate pools enhance reliability. In this work, we show that this assumption in TTS introduces a previously unrecognized failure mode. When candidate diversity is curtailed, even by a modest amount, TTS becomes much more likely to produce unsafe outputs. We present a reference-guided diversity reduction protocol (RefDiv) that serves as a diagnostic attack to stress test TTS pipelines. Through extensive experiments across four open-source models (Qwen3, Mistral, Llama3.1, Gemma3) and two widely used TTS strategies (Monte Carlo Tree Search and Best-of-N), constraining diversity consistently signifies the rate at which TTS produces unsafe results. The effect is often stronger than that produced by prompts directly with high adversarial intent scores. This observed phenomenon also transfers across TTS strategies and to closed-source models (e.g. OpenAI o3 and Gemini-2.5-Pro), thus indicating that this is a general and extant property of TTS rather than a model-specific artifact. Additionally, we find that numerous widely used safety guardrail classifiers (e.g. Llama-Guard and OpenAI Moderation API), are unable to flag the adversarial input prompts generated by RefDiv, demonstrating that existing defenses offer limited protection against this diversity-driven failure mode. Through this work, we hope to motivate future research on designing robust TTS strategies that are both effective and secure against diversity-targeted stress tests as illustrated by RefDiv.

【72】Comparative Analysis of Large Language Models for the Machine-Assisted Resolution of User Intentions
标题：机器辅助解决用户意图的大型语言模型比较分析
链接：https://arxiv.org/abs/2510.08576

作者：Justus Flerlage, Alexander Acker, Odej Kao
摘要：大型语言模型（LLM）已经成为自然语言理解和用户意图解析的变革性工具，支持翻译、摘要等任务，并且越来越多地支持复杂工作流的编排。这一发展标志着从传统的GUI驱动的用户界面向直观的语言优先交互模式的范式转变。而不是手动导航应用程序，用户可以用自然语言表达他们的目标，使LLM能够以动态和上下文的方式协调多个应用程序的操作。然而，现有的实现经常依赖于基于云的专有模型，这在隐私，自治和可扩展性方面引入了限制。为了使语言优先的交互成为真正健壮和可信的接口范例，本地部署不仅仅是一种方便，它是一种必要条件。这种限制强调了评估本地可部署，开源和开放访问的LLM作为未来基于意图的操作系统的基础组件的可行性的重要性。在这项研究中，我们研究了几个开源和开放访问模型的能力，通过机器辅助促进用户的意图解决。对OpenAI专有的基于GPT-4的系统进行了比较分析，以评估为各种用户意图生成工作流的性能。本研究提供了经验的见解的实际可行性，性能权衡，并在下一代操作系统的自主，本地可操作的组件开放LLM的潜力。这项研究的结果为关于人工智能基础设施的去中心化和民主化的更广泛的讨论提供了信息，并指出了未来用户与设备的交互通过本地嵌入式智能变得更加无缝，自适应和隐私意识。
摘要：Large Language Models (LLMs) have emerged as transformative tools for natural language understanding and user intent resolution, enabling tasks such as translation, summarization, and, increasingly, the orchestration of complex workflows. This development signifies a paradigm shift from conventional, GUI-driven user interfaces toward intuitive, language-first interaction paradigms. Rather than manually navigating applications, users can articulate their objectives in natural language, enabling LLMs to orchestrate actions across multiple applications in a dynamic and contextual manner. However, extant implementations frequently rely on cloud-based proprietary models, which introduce limitations in terms of privacy, autonomy, and scalability. For language-first interaction to become a truly robust and trusted interface paradigm, local deployment is not merely a convenience; it is an imperative. This limitation underscores the importance of evaluating the feasibility of locally deployable, open-source, and open-access LLMs as foundational components for future intent-based operating systems. In this study, we examine the capabilities of several open-source and open-access models in facilitating user intention resolution through machine assistance. A comparative analysis is conducted against OpenAI's proprietary GPT-4-based systems to assess performance in generating workflows for various user intentions. The present study offers empirical insights into the practical viability, performance trade-offs, and potential of open LLMs as autonomous, locally operable components in next-generation operating systems. The results of this study inform the broader discussion on the decentralization and democratization of AI infrastructure and point toward a future where user-device interaction becomes more seamless, adaptive, and privacy-conscious through locally embedded intelligence.

【73】A Design-based Solution for Causal Inference with Text: Can a Language Model Be Too Large?
标题：基于设计的文本因果推理解决方案：语言模型会太大吗？
链接：https://arxiv.org/abs/2510.08758

作者：Graham Tierney, Srikar Katta, Christopher Bail, Sunshine Hillygus, Alexander Volfovsky
摘要：许多社会科学问题都在问语言特性是如何影响受众的态度和行为的。由于文本属性通常是相互链接的（例如，愤怒的评论使用亵渎的语言），我们必须控制可能的潜在混淆以隔离因果效应。最近的文献提出采用大型语言模型（LLM）来学习文本的潜在表示，成功地预测治疗和结果。然而，由于治疗是文本的一个组成部分，这些深度学习方法有可能学习实际编码治疗本身的表示，从而导致重叠偏差。而不是依赖于事后调整，我们引入了一个新的实验设计，处理潜在的混淆，避免重叠的问题，并无偏估计治疗效果。我们应用这个设计在一个实验中，评估在政治沟通中表达谦卑的说服力。从方法论上讲，我们证明了基于LLM的方法比使用我们的真实文本和实验结果的简单词袋模型表现更差。实质上，我们分离出表达谦卑对政治声明的说服力的因果影响，为社交媒体平台，政策制定者和社会科学家提供了关于传播效果的新见解。
摘要：Many social science questions ask how linguistic properties causally affect an audience's attitudes and behaviors. Because text properties are often interlinked (e.g., angry reviews use profane language), we must control for possible latent confounding to isolate causal effects. Recent literature proposes adapting large language models (LLMs) to learn latent representations of text that successfully predict both treatment and the outcome. However, because the treatment is a component of the text, these deep learning methods risk learning representations that actually encode the treatment itself, inducing overlap bias. Rather than depending on post-hoc adjustments, we introduce a new experimental design that handles latent confounding, avoids the overlap issue, and unbiasedly estimates treatment effects. We apply this design in an experiment evaluating the persuasiveness of expressing humility in political communication. Methodologically, we demonstrate that LLM-based methods perform worse than even simple bag-of-words models using our real text and outcomes from our experiment. Substantively, we isolate the causal effect of expressing humility on the perceived persuasiveness of political statements, offering new insights on communication effects for social media platforms, policy makers, and social scientists.

GAN|生成相关(5篇)

【1】Identifying & Interactively Refining Ambiguous User Goals for Data Visualization Code Generation
标题：识别并交互式完善数据可视化代码生成的模糊用户目标
链接：https://arxiv.org/abs/2510.09390

作者：Mert İnan, Anthony Sicilia, Alex Xie, Saujas Vaduguru, Daniel Fried, Malihe Alikhani
摘要：建立共同的目标是人类与AI沟通的基本步骤。然而，歧义可能导致输出看起来正确，但不能反映说话者的意图。在本文中，我们探讨这个问题，重点放在数据可视化领域，在自然语言中的歧义影响生成的代码，可视化数据。上下文上的多个视图的可用性（例如，预期的图和呈现该图的代码）允许对各种歧义类型进行唯一和全面的分析。我们开发了一个分类的类型的歧义，在这项任务中出现，并提出指标来量化它们。使用DS-1000数据集的Matplotlib问题，我们证明了我们的模糊性度量比不确定性基线更好地与人类注释相关。我们的工作还探讨了多轮对话如何减少歧义，从而通过更好地匹配用户目标来提高代码准确性。我们评估了三种语用模式，以告知我们的对话策略：格赖斯合作，话语表征理论，和问题的讨论。一个模拟的用户研究揭示了语用对话如何减少歧义，提高代码的准确性，突出了多轮交流的代码生成的价值。
摘要：Establishing shared goals is a fundamental step in human-AI communication. However, ambiguities can lead to outputs that seem correct but fail to reflect the speaker's intent. In this paper, we explore this issue with a focus on the data visualization domain, where ambiguities in natural language impact the generation of code that visualizes data. The availability of multiple views on the contextual (e.g., the intended plot and the code rendering the plot) allows for a unique and comprehensive analysis of diverse ambiguity types. We develop a taxonomy of types of ambiguity that arise in this task and propose metrics to quantify them. Using Matplotlib problems from the DS-1000 dataset, we demonstrate that our ambiguity metrics better correlate with human annotations than uncertainty baselines. Our work also explores how multi-turn dialogue can reduce ambiguity, therefore, improve code accuracy by better matching user goals. We evaluate three pragmatic models to inform our dialogue strategies: Gricean Cooperativity, Discourse Representation Theory, and Questions under Discussion. A simulated user study reveals how pragmatic dialogues reduce ambiguity and enhance code accuracy, highlighting the value of multi-turn exchanges in code generation.

【2】CFVBench: A Comprehensive Video Benchmark for Fine-grained Multimodal Retrieval-Augmented Generation
标题：CFVBench：细粒度多模式检索增强一代的全面视频基准
链接：https://arxiv.org/abs/2510.09266

作者：Kaiwen Wei, Xiao Liu, Jie Zhang, Zijian Wang, Ruida Liu, Yuming Yang, Xin Xiao, Xiao Sun, Haoyang Zeng, Changzai Pan, Yidan Zhang, Jiang Zhong, Peijin Wang, Yingchao Feng
摘要：多模态检索增强生成（MRAG）使多模态大型语言模型（MLLM）能够生成具有外部多模态证据的响应，并且已经提出了许多基于视频的MRAG基准来评估模型在检索和生成阶段的能力。然而，现有的基准仍然局限于模态覆盖和格式多样性，往往集中在单一或有限的模态任务，或粗粒度的场景理解。为了解决这些差距，我们引入了CFVBench，这是一个大规模的，手动验证的基准测试，由599个公开的视频构建，产生了5，360个开放式QA对。CFVBench跨越高密度格式和领域，如图表密集的报告，新闻广播和软件教程，要求模型在长时间视频跨度上检索和推理，同时保持细粒度的多模态信息。使用CFVBench，我们系统地评估了7种检索方法和14种广泛使用的MLLM，揭示了一个关键的瓶颈：当前的模型（甚至GPT5或Gemini）难以捕捉瞬时但必不可少的细粒度多模态细节。为了缓解这一问题，我们提出了自适应视觉优化（AVR），这是一个简单而有效的框架，可以自适应地增加帧采样密度，并在必要时选择性地调用外部工具。实验表明，AVR始终增强细粒度的多模态理解，并提高所有评估MLLM的性能
摘要：Multimodal Retrieval-Augmented Generation (MRAG) enables Multimodal Large Language Models (MLLMs) to generate responses with external multimodal evidence, and numerous video-based MRAG benchmarks have been proposed to evaluate model capabilities across retrieval and generation stages. However, existing benchmarks remain limited in modality coverage and format diversity, often focusing on single- or limited-modality tasks, or coarse-grained scene understanding. To address these gaps, we introduce CFVBench, a large-scale, manually verified benchmark constructed from 599 publicly available videos, yielding 5,360 open-ended QA pairs. CFVBench spans high-density formats and domains such as chart-heavy reports, news broadcasts, and software tutorials, requiring models to retrieve and reason over long temporal video spans while maintaining fine-grained multimodal information. Using CFVBench, we systematically evaluate 7 retrieval methods and 14 widely-used MLLMs, revealing a critical bottleneck: current models (even GPT5 or Gemini) struggle to capture transient yet essential fine-grained multimodal details. To mitigate this, we propose Adaptive Visual Refinement (AVR), a simple yet effective framework that adaptively increases frame sampling density and selectively invokes external tools when necessary. Experiments show that AVR consistently enhances fine-grained multimodal comprehension and improves performance across all evaluated MLLMs

【3】How Many Code and Test Cases Are Enough? Evaluating Test Cases Generation from a Binary-Matrix Perspective
标题：多少个代码和测试用例才足够？从二进制矩阵的角度评估测试用例的生成
链接：https://arxiv.org/abs/2510.08720

作者：Xianzhen Luo, Jinyang Huang, Wenzhen Zheng, Qingfu Zhu, Mingzheng Xu, Yiheng Xu, Yuantao Fan, Libo Qin, Wanxiang Che
备注：Work in Progress
摘要：评估由大型语言模型（LLM）自动生成的测试用例是一项关键而具有挑战性的任务。现有的基准测试面临着高计算成本、分数膨胀以及对琐碎错误的偏见，而不是罕见的关键故障。在这项工作中，我们提出了两个基本问题：（1）什么是最小的错误代码集足以代表整个错误空间？（2）区分它们所需的最小测试用例集是什么？我们引入了一个框架，正式基准建设找到一个最佳的诊断基础，在一个二进制代码测试矩阵。该矩阵的秩指定了独立错误模式（错误代码）的最小数量，并提供了完整故障覆盖所需的测试用例数量的严格上限。我们的目标是确定一个大小等于矩阵秩的基础，最大限度地提高内部多样性。为了解决这个NP难问题，我们提出了WrongSelect，一个有效的近似算法来选择最大不同的错误代码。将此框架应用于数百万竞争性编程提交，我们构建了TC-Bench，一个紧凑，多样化和抗通胀的基准。大量的实验表明，即使是最先进的测试用例生成方法在TC-Bench上也只能达到约60%的排除率，暴露出它们的诊断能力存在显著差距。我们的数据集可以在https://huggingface.co/datasets/Luoberta/TC-Bench上找到，我们的代码可以在https://github.com/Luowaterbi/TC-Bench上找到。
摘要：Evaluating test cases automatically generated by Large Language Models (LLMs) is a critical yet challenging task. Existing benchmarks suffer from high computational costs, score inflation, and a bias towards trivial bugs over rare, critical faults. In this work, we ask two fundamental questions: (1) What is the minimal set of wrong codes sufficient to represent the entire error space? and (2) What is the minimal set of test cases needed to distinguish them? We introduce a framework that formalizes benchmark construction as finding an optimal diagnostic basis in a binary code-test matrix. The rank of this matrix specifies the minimal number of independent error patterns (wrong codes) and provides a tight upper bound on the number of test cases required for complete fault coverage. Our objective is to identify a basis of size equal to the matrix rank that maximizes internal diversity. To tackle this NP-hard problem, we propose WrongSelect, an efficient approximation algorithm to select maximally diverse wrong codes. Applying this framework to millions of competitive programming submissions, we construct TC-Bench, a compact, diverse, and inflation-resistant benchmark. Extensive experiments show that even the most advanced test case generation methods achieve only ~60% exclusion rates on TC-Bench, exposing a significant gap in their diagnostic power. Our dataset is available at: https://huggingface.co/datasets/Luoberta/TC-Bench and our code is at: https://github.com/Luowaterbi/TC-Bench.

【4】BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution
标题：BigCodeArena：揭示通过执行生成代码中更可靠的人类偏好
链接：https://arxiv.org/abs/2510.08697

作者：Terry Yue Zhuo, Xiaolong Jin, Hange Liu, Juyong Jiang, Tianyang Liu, Chen Gong, Bhupesh Bishnoi, Vaisakhi Mishra, Marek Suppa, Noah Ziems, Saiteja Utpala, Ming Xu, Guangyu Song, Kaixin Li, Yuhan Cao, Bo Liu, Zheng Liu, Sabina Abdurakhmanova, Wenhao Yu, Mengzhao Jia, Jihan Yao, Kenneth Hamilton, Kumar Shridhar, Minh Chien Vu, Dingmin Wang, Jiawei Liu, Zijian Wang, Qian Liu, Binyuan Hui, Meg Risdal, Ahsen Khaliq, Atin Sood, Zhenchang Xing, Wasi Uddin Ahmad, John Grundy, David Lo, Banghua Zhu, Xiaoning Du, Torsten Scholak, Leandro von Werra
备注：Built with love by the BigCode community :)
摘要：众包模型评估平台，如Chatbot Arena，可以从人类角度进行实时评估，以评估模型响应的质量。在编码领域，手动检查LLM生成的内容的质量是非常具有挑战性的，因为它需要理解很长的原始代码块并故意模拟代码执行。为此，我们引入了BigCodeArena，这是一个开放的人工评估平台，用于代码生成，并由一个全面的动态执行环境提供支持。BigCodeArena构建在Chatbot Arena之上，能够执行LLM生成的代码，并允许人类与执行过程和结果进行交互。我们在10个广泛使用的LLM中收集了超过14，000个以原始代码为中心的对话会话，涵盖10种语言和8种执行环境。在这些对话中，我们发现了超过4,700个具有成对人类偏好的多回合样本。进一步的分析揭示了LLM在以任务、语言和框架为特征的细粒度领域中的未充分探索的偏好。为了系统地检查前沿LLM的代码理解和生成能力，我们根据收集的数据策划了两个基准测试，即BigCodeReward和AutoCodeArena。对于BigCodeReward，我们对4,700个对话进行了后处理，并评估了奖励模型与人类偏好之间的一致性。评估结果表明，大多数LLM具有优越的性能，在判断编码偏好时，执行结果。受这些发现的启发，我们提出了AutoCodeArena，这是一个自动Elo评级基准，旨在评估LLM的编码质量，而无需人工参与。我们发现，像GPT-5，Claude-Sonnet-4和Claude-Opus-4这样的专有LLM在最近出现的模型中仍然领先于代码生成性能。
摘要：Crowdsourced model evaluation platforms, such as Chatbot Arena, enable real-time evaluation from human perspectives to assess the quality of model responses. In the coding domain, manually examining the quality of LLM-generated content is extremely challenging, as it requires understanding long chunks of raw code and deliberately simulating code execution. To this end, we introduce BigCodeArena, an open human evaluation platform for code generation backed by a comprehensive and on-the-fly execution environment. Built on top of Chatbot Arena, BigCodeArena enables the execution of LLM-generated code and allows humans to interact with the execution process and outcomes. We collected over 14,000 raw code-centric conversation sessions across 10 widely used LLMs, spanning 10 languages and 8 types of execution environments. Among these conversations, we identified more than 4,700 multi-turn samples with pairwise human preferences. Further analysis uncovers underexplored preferences of LLMs in fine-grained domains characterized by tasks, languages, and frameworks. To systematically examine code understanding and generation capabilities of frontier LLMs, we curated two benchmarks based on the collected data, namely BigCodeReward and AutoCodeArena. For BigCodeReward, we post-processed the 4,700 conversations and evaluated the consistency between reward models and human preferences. The evaluation shows that most LLMs have superior performance in judging coding preferences when the execution results are available. Inspired by these findings, we propose AutoCodeArena, an automatic Elo rating benchmark designed to assess the coding quality of LLMs without human involvement. We find that proprietary LLMs like GPT-5, Claude-Sonnet-4, and Claude-Opus-4 still lead in code generation performance among recent emerging models.

【5】YpathRAG:A Retrieval-Augmented Generation Framework and Benchmark for Pathology
标题：YpathRAG：检索增强生成框架和病理学基准
链接：https://arxiv.org/abs/2510.08603

作者：Deshui Yu, Yizhi Wang, Saihui Jin, Taojie Zhu, Fanyi Zeng, Wen Qian, Zirui Huang, Jingli Ouyang, Jiameng Li, Zhen Song, Tian Guan, Yonghong He
摘要：大型语言模型（LLM）在一般任务上表现出色，但在病理学等高障碍领域仍然存在幻觉。以前的工作往往依赖于领域微调，既不扩大知识边界，也不强制执行基于证据的约束。因此，我们建立了一个病理矢量数据库，涵盖28个子字段和153万段，并提出YpathRAG，一个面向病理的RAG框架与双通道混合检索（BGE-M3密集检索加上词汇引导稀疏检索）和基于LLM的连续证据判断模块，关闭检索判断生成循环。我们还发布了两个评估基准，YpathR和YpathQA-M。在YpathR上，YpathRAG的Recall@5达到了98.64%，比基线提高了23个百分点;在YpathQA-M上，这是一组300个最具挑战性的问题，它将普通和医学LLM的准确率平均提高了9.0%，最高可达15.6%。这些结果表明，改进的检索质量和事实的可靠性，提供了一个可扩展的建设范式和可解释的评价病理为导向的RAG。
摘要：Large language models (LLMs) excel on general tasks yet still hallucinate in high-barrier domains such as pathology. Prior work often relies on domain fine-tuning, which neither expands the knowledge boundary nor enforces evidence-grounded constraints. We therefore build a pathology vector database covering 28 subfields and 1.53 million paragraphs, and present YpathRAG, a pathology-oriented RAG framework with dual-channel hybrid retrieval (BGE-M3 dense retrieval coupled with vocabulary-guided sparse retrieval) and an LLM-based supportive-evidence judgment module that closes the retrieval-judgment-generation loop. We also release two evaluation benchmarks, YpathR and YpathQA-M. On YpathR, YpathRAG attains Recall@5 of 98.64%, a gain of 23 percentage points over the baseline; on YpathQA-M, a set of the 300 most challenging questions, it increases the accuracies of both general and medical LLMs by 9.0% on average and up to 15.6%. These results demonstrate improved retrieval quality and factual reliability, providing a scalable construction paradigm and interpretable evaluation for pathology-oriented RAG.

QA|VQA|问答|对话(1篇)

【1】Centering Emotion Hotspots: Multimodal Local-Global Fusion and Cross-Modal Alignment for Emotion Recognition in Conversations
标题：集中情绪热点：多模式局部-全球融合和跨模式对齐用于对话中的情绪识别
链接：https://arxiv.org/abs/2510.08606

作者：Yu Liu, Hanlei Shi, Haoxun Li, Yuqing Sun, Yuxuan Ding, Linlin Gong, Leyuan Qu, Taihao Li
备注：Under review for ICASSP 2026
摘要：会话中的情感识别（ERC）很难，因为区别性证据是稀疏的，局部的，并且通常是异步的。我们将ERC集中在情感热点上，并提出了一个统一的模型，该模型可以检测文本，音频和视频中的每一个话语热点，通过热点门控融合将它们与全局特征融合，并使用路由的Mixture-of-Aligners对齐模态;跨模态图编码会话结构。这种设计将建模的重点放在突出的跨度上，减轻了不对齐，并保留了上下文。在标准ERC基准上的实验表明，强基线上的收益一致，消融证实了HGF和MoA的贡献。我们的研究结果指出了一个热点为中心的观点，可以为未来的多模态学习提供信息，为ERC中的模态融合提供了一个新的视角。
摘要：Emotion Recognition in Conversations (ERC) is hard because discriminative evidence is sparse, localized, and often asynchronous across modalities. We center ERC on emotion hotspots and present a unified model that detects per-utterance hotspots in text, audio, and video, fuses them with global features via Hotspot-Gated Fusion, and aligns modalities using a routed Mixture-of-Aligners; a cross-modal graph encodes conversational structure. This design focuses modeling on salient spans, mitigates misalignment, and preserves context. Experiments on standard ERC benchmarks show consistent gains over strong baselines, with ablations confirming the contributions of HGF and MoA. Our results point to a hotspot-centric view that can inform future multimodal learning, offering a new perspective on modality fusion in ERC.

机器翻译(3篇)

【1】LLaMAX2: Your Translation-Enhanced Model also Performs Well in Reasoning
标题：LLaMAX 2：您的翻译增强模型在推理方面也表现出色
链接：https://arxiv.org/abs/2510.09189

作者：Changjiang Gao, Zixian Huang, Jingyang Gong, Shujian Huang, Lei Li, Fei Yuan
摘要：一般的大型语言模型（LLM）擅长推理，但那些为翻译而增强的模型在推理任务上很吃力。为了解决这个问题，我们提出了一种新的增强型配方，它从指令模型开始，只对并行数据进行层选择性调整。在此管道之后，我们介绍了Qwen 3-XPlus模型，该模型在高资源和低资源语言的翻译性能方面都有显着改进，在低资源语言（如斯瓦希里语）中实现了15+ spBLEU和40+ xComet。有趣的是，仅使用小型并行数据集进行训练，Qwen 3-XPlus在7个多语言任务上平均提高了1+分，同时在15个流行的推理数据集中保持了与Qwen 3指令模型相当的熟练度。这项工作提供了一个很有前途的方法来增强多语种，大大降低复杂性，提高更广泛的语言的可访问性。代码和模型是公开的。
摘要：General Large Language Models (LLMs) excel in reasoning, but those enhanced for translation struggle with reasoning tasks. To address this, we propose a novel translationenhanced recipe that begins with instruct models and applies layer-selective tuning only on parallel data. Following this pipeline, we introduce the Qwen3-XPlus models, which demonstrate significant improvements in translation performance across both high- and lowresource languages, achieving 15+ spBLEU and 40+ xComet in low-resource languages, like Swahili. Interestingly, training only with small parallel datasets, Qwen3-XPlus achieves an average improvement of 1+ points on 7 multilingual tasks while maintaining proficiency comparable to the Qwen3 instruct model in 15 popular reasoning datasets. This work offers a promising approach to multilingual enhancement, significantly reducing complexity and enhancing accessibility for a wider range of languages. The code and model are publicly available.

【2】DITING: A Multi-Agent Evaluation Framework for Benchmarking Web Novel Translation
标题：DITING：网络小说翻译基准的多代理评估框架
链接：https://arxiv.org/abs/2510.09116

作者：Enze Zhang, Jiaying Wang, Mengxi Xiao, Jifei Liu, Ziyan Kuang, Rui Dong, Youzhong Dong, Sophia Ananiadou, Min Peng, Qianqian Xie
摘要：大型语言模型（LLM）大大提高了机器翻译（MT），但它们在翻译网络小说方面的有效性仍不清楚。现有的基准依赖于表面水平的指标，无法捕捉这种类型的独特特征。为了解决这些差距，我们引入了第一个网络小说翻译综合评估框架DITING，从六个维度评估叙事和文化保真度：习语翻译，词汇歧义，术语本地化，时态一致性，零代词解决方案和文化安全，支持超过18 K专家注释的汉英句子对。我们进一步提出了AgentEval，这是一个推理驱动的多智能体评估框架，它模拟专家审议来评估词汇重叠之外的翻译质量，在七个测试的自动指标中实现了与人类判断的最高相关性。为了实现度量比较，我们开发了MetricAlign，这是一个包含300个句子对的元评估数据集，上面标注了错误标签和标量质量分数。对14种开放、封闭和商业模式的综合评估表明，中国培训的LLM超过了大型外国同行，DeepSeek-V3提供了最忠实和风格连贯的翻译。我们的工作为探索基于LLM的网络小说翻译建立了新的范式，并为推进未来的研究提供了公共资源。
摘要：Large language models (LLMs) have substantially advanced machine translation (MT), yet their effectiveness in translating web novels remains unclear. Existing benchmarks rely on surface-level metrics that fail to capture the distinctive traits of this genre. To address these gaps, we introduce DITING, the first comprehensive evaluation framework for web novel translation, assessing narrative and cultural fidelity across six dimensions: idiom translation, lexical ambiguity, terminology localization, tense consistency, zero-pronoun resolution, and cultural safety, supported by over 18K expert-annotated Chinese-English sentence pairs. We further propose AgentEval, a reasoning-driven multi-agent evaluation framework that simulates expert deliberation to assess translation quality beyond lexical overlap, achieving the highest correlation with human judgments among seven tested automatic metrics. To enable metric comparison, we develop MetricAlign, a meta-evaluation dataset of 300 sentence pairs annotated with error labels and scalar quality scores. Comprehensive evaluation of fourteen open, closed, and commercial models reveals that Chinese-trained LLMs surpass larger foreign counterparts, and that DeepSeek-V3 delivers the most faithful and stylistically coherent translations. Our work establishes a new paradigm for exploring LLM-based web novel translation and provides public resources to advance future research.

【3】Quality Estimation Reranking for Document-Level Translation
标题：文档级翻译的质量评估重新排名
链接：https://arxiv.org/abs/2510.08870

作者：Krzysztof Mrozinski, Minji Kang, Ahmed Khota, Vincent Michael Sutanto, Giovanni Gatti De Giacomo
备注：9 pages, 4 figures
摘要：质量估计（QE）重排序是一种质量感知解码的形式，旨在通过评分和从生成的翻译池中选择最佳候选来改进机器翻译（MT）。虽然它在句子层面上是有效的，但它在日益突出的文档级翻译领域的应用仍有待探索。在这项工作中，我们使用各种学习和基于大型语言模型（LLM）的QE指标，评估QE在文档级（而不是典型的文档级）翻译上的重新排序性能。我们发现，使用我们最好的学习指标SLIDE，BLEURT-20分数在只有两个候选人的情况下提高了+2.00，在只有解码器的LLM模型和编码器-解码器神经机器翻译（NMT）模型中提高了+5.09。使用最好的基于LLM的度量GEMBA-DA，在相同条件下实现了+1.63和+4.30的增益。虽然增益随着输入的增加而缩小，但在我们最长的文档（512 - 1024个源标记）上，使用32个候选项进行重新排序会产生+2.34（SLIDE）和+1.40（GEMBA-DA）的改进。这些发现证明了文档级QE的实用价值，在适当的翻译模型和硬件下，运行时开销最小。
摘要：Quality estimation (QE) reranking is a form of quality-aware decoding which aims to improve machine translation (MT) by scoring and selecting the best candidate from a pool of generated translations. While known to be effective at the sentence level, its application to the increasingly prominent domain of document-level translation remains underexplored. In this work, we evaluate QE reranking performance on document-level (rather than the typical sentence-level) translation, using various learned and large language model (LLM)-based QE metrics. We find that with our best learned metric, SLIDE, BLEURT-20 scores improve by +2.00 with only two candidates, and by +5.09 with 32, across both decoder-only LLM models and encoder-decoder neural machine translation (NMT) models. Using the best LLM-based metric, GEMBA-DA, gains of +1.63 and +4.30 are achieved under the same conditions. Although gains shrink with longer inputs, reranking with 32 candidates yields improvements of +2.34 (SLIDE) and +1.40 (GEMBA-DA) on our longest documents (512-1024 source tokens). These findings demonstrate the practical value of document-level QE, with minimal runtime overhead given suitable translation models and hardware.

语义分析(2篇)

【1】One Sentence, Two Embeddings: Contrastive Learning of Explicit and Implicit Semantic Representations
标题：一句话，两个嵌入：显式和隐式语义表示的对比学习
链接：https://arxiv.org/abs/2510.09293

作者：Kohei Oda, Po-Min Chuang, Kiyoaki Shirai, Natthawut Kertkeidkachorn
摘要：句子嵌入方法已经取得了显著的进步，但它们仍然难以捕捉句子中的隐含语义。这可以归因于传统句子嵌入方法的固有局限性，传统句子嵌入方法仅为每个句子分配单个向量。为了克服这一限制，我们提出了DualCSE，一个句子嵌入方法，分配两个嵌入到每个句子：一个代表显式语义和其他代表隐式语义。这些嵌入在共享空间中共存，从而能够为特定目的（如信息检索和文本分类）选择所需的语义。实验结果表明，DualCSE可以有效地编码显式和隐式意义，并提高下游任务的性能。
摘要：Sentence embedding methods have made remarkable progress, yet they still struggle to capture the implicit semantics within sentences. This can be attributed to the inherent limitations of conventional sentence embedding methods that assign only a single vector per sentence. To overcome this limitation, we propose DualCSE, a sentence embedding method that assigns two embeddings to each sentence: one representing the explicit semantics and the other representing the implicit semantics. These embeddings coexist in the shared space, enabling the selection of the desired semantics for specific purposes such as information retrieval and text classification. Experimental results demonstrate that DualCSE can effectively encode both explicit and implicit meanings and improve the performance of the downstream task.

【2】When to Reason: Semantic Router for vLLM
标题：何时推理：vLLM的语义路由器
链接：https://arxiv.org/abs/2510.08731

作者：Chen Wang, Xunzhuo Liu, Yuhan Liu, Yue Zhu, Xiangxi Mo, Junchen Jiang, Huamin Chen
备注：5 pages, excluding references and appendix. To be appeared at Workshop on ML for Systems at NeurIPS 2025, December 6, 2025 this https URL
摘要：大型语言模型（LLM）在使用诸如思维链和推理时间缩放等推理模式进行增强时，表现出显著的准确性增益。然而，推理也会在推理延迟和令牌使用方面产生巨大的成本，并对环境和财务产生影响，这对于许多简单的提示是不必要的。我们提出了一个语义路由器，分类查询的推理要求的基础上，并选择性地适用于推理时，有益的。我们的方法在MMLU-Pro基准测试中的准确性提高了10.2个百分点，同时与使用vLLM的直接推理相比，响应延迟减少了47.1%，令牌消耗减少了48.5%。这些结果表明，语义路由提供了一种有效的机制，在开源LLM服务系统的准确性和效率之间取得平衡
摘要：Large Language Models (LLMs) demonstrate substantial accuracy gains when augmented with reasoning modes such as chain-of-thought and inference-time scaling. However, reasoning also incurs significant costs in inference latency and token usage, with environmental and financial impacts, which are unnecessary for many simple prompts. We present a semantic router that classifies queries based on their reasoning requirements and selectively applies reasoning only when beneficial. Our approach achieves a 10.2 percentage point improvement in accuracy on the MMLU-Pro benchmark while reducing response latency by 47.1% and token consumption by 48.5% compared to direct inference with vLLM. These results demonstrate that semantic routing offers an effective mechanism for striking a balance between accuracy and efficiency in open-source LLM serving systems

Graph|知识图谱|Knowledge(4篇)

【1】Hierarchical Indexing with Knowledge Enrichment for Multilingual Video Corpus Retrieval
标题：多语言视频库检索的知识丰富分层索引
链接：https://arxiv.org/abs/2510.09553

作者：Yu Wang, Tianhao Tan, Yifei Wang
备注：Accepted to NLPCC 2025 (Springer), to appear November 2025
摘要：从多语言医学档案中检索相关教学视频对于跨越语言边界回答复杂的多跳问题至关重要。然而，现有的系统要么将长达一小时的视频压缩成粗嵌入，要么为细粒度匹配带来高昂的成本。我们在NLPCC-2025 M4 IVQA挑战中解决了多语言视频语料库检索（mVCR）任务，该任务采用了一个多阶段框架，该框架集成了多语言语义，领域术语和高效的长格式处理。视频字幕被划分成语义连贯的块，丰富了简洁的知识图（KG）的事实，并组织成一个层次树的节点嵌入生成的语言无关的多语言编码器。在查询时，相同的编码器嵌入输入问题;由粗到细的树搜索修剪不相关的分支，只有排名靠前的块才由轻量级大型语言模型（LLM）重新评分。这种设计避免了穷举交叉编码器评分，同时保持块级精度。mVCR测试集上的实验证明了最先进的性能，消融研究证实了KG富集、分层索引和靶向LLM重新排序的互补贡献。所提出的方法提供了一个准确的和可扩展的解决方案，多语言检索专业的医疗视频收藏。
摘要：Retrieving relevant instructional videos from multilingual medical archives is crucial for answering complex, multi-hop questions across language boundaries. However, existing systems either compress hour-long videos into coarse embeddings or incur prohibitive costs for fine-grained matching. We tackle the Multilingual Video Corpus Retrieval (mVCR) task in the NLPCC-2025 M4IVQA challenge with a multi-stage framework that integrates multilingual semantics, domain terminology, and efficient long-form processing. Video subtitles are divided into semantically coherent chunks, enriched with concise knowledge-graph (KG) facts, and organized into a hierarchical tree whose node embeddings are generated by a language-agnostic multilingual encoder. At query time, the same encoder embeds the input question; a coarse-to-fine tree search prunes irrelevant branches, and only the top-ranked chunks are re-scored by a lightweight large language model (LLM). This design avoids exhaustive cross-encoder scoring while preserving chunk-level precision. Experiments on the mVCR test set demonstrate state-of-the-art performance, and ablation studies confirm the complementary contributions of KG enrichment, hierarchical indexing, and targeted LLM re-ranking. The proposed method offers an accurate and scalable solution for multilingual retrieval in specialized medical video collections.

【2】Beyond Single-Granularity Prompts: A Multi-Scale Chain-of-Thought Prompt Learning for Graph
标题：超越单粒度预算：多尺度思维链提示图形学习
链接：https://arxiv.org/abs/2510.09394

作者：Ziyu Zheng, Yaming Yang, Ziyu Guan, Wei Zhao, Xinyan Huang, Weigang Lu
备注：under review
摘要：旨在弥合预训练任务和下游目标之间差距的“预训练，训练”范式已经从NLP领域扩展到图形领域，并取得了显着进展。当前主流的图形提示调优方法使用可学习的提示向量修改输入或输出特征。然而，现有的方法局限于单粒度（例如，节点级或子图级），忽略了图数据固有的多尺度结构信息，限制了提示语义的多样性。为了解决这个问题，我们开创性地将多尺度信息集成到图形提示中，并提出了一个多尺度图形思维链（MSGCOT）提示框架。具体来说，我们设计了一个轻量级的，低秩粗化网络，以有效地捕获多尺度结构特征作为分层基向量，以及时生成。随后，模仿人类认知从粗到细的粒度，我们动态地整合多尺度信息在每个推理步骤，形成一个渐进的粗到细提示链。在八个基准数据集上的大量实验表明，MSGCOT优于最先进的单粒度图优化方法，特别是在Few-Shot场景下，表现出优越的性能。
摘要：The "pre-train, prompt'' paradigm, designed to bridge the gap between pre-training tasks and downstream objectives, has been extended from the NLP domain to the graph domain and has achieved remarkable progress. Current mainstream graph prompt-tuning methods modify input or output features using learnable prompt vectors. However, existing approaches are confined to single-granularity (e.g., node-level or subgraph-level) during prompt generation, overlooking the inherently multi-scale structural information in graph data, which limits the diversity of prompt semantics. To address this issue, we pioneer the integration of multi-scale information into graph prompt and propose a Multi-Scale Graph Chain-of-Thought (MSGCOT) prompting framework. Specifically, we design a lightweight, low-rank coarsening network to efficiently capture multi-scale structural features as hierarchical basis vectors for prompt generation. Subsequently, mimicking human cognition from coarse-to-fine granularity, we dynamically integrate multi-scale information at each reasoning step, forming a progressive coarse-to-fine prompt chain. Extensive experiments on eight benchmark datasets demonstrate that MSGCOT outperforms the state-of-the-art single-granularity graph prompt-tuning method, particularly in few-shot scenarios, showcasing superior performance.

【3】Verifying Chain-of-Thought Reasoning via Its Computational Graph
标题：通过计算图验证思维链推理
链接：https://arxiv.org/abs/2510.09312

作者：Zheng Zhao, Yeskendir Koishekenov, Xianjun Yang, Naila Murray, Nicola Cancedda
摘要：当前的思想链（CoT）验证方法基于输出（黑盒）或激活（灰盒）预测推理正确性，但对计算失败的原因提供了有限的见解。我们介绍了一种白盒方法：基于电路的推理验证（CRV）。我们假设，正确的CoT步骤的属性图，被视为模型的潜在推理回路的执行痕迹，具有不同的结构指纹从那些不正确的步骤。通过训练这些图的结构特征的分类器，我们表明，这些痕迹包含一个强大的信号的推理错误。我们的白盒方法产生了其他方法无法获得的新的科学见解。(1)我们证明了错误的结构签名具有高度的预测性，建立了直接通过其计算图验证推理的可行性。(2)我们发现这些签名是高度领域特定的，揭示了不同推理任务的失败表现为不同的计算模式。(3)我们提供的证据表明，这些签名不仅仅是相关的;通过使用我们的分析来指导对单个转码器功能的有针对性的干预，我们成功地纠正了模型的错误推理。我们的工作表明，通过仔细检查模型的计算过程，我们可以从简单的错误检测转向对LLM推理的更深层次的因果理解。
摘要：Current Chain-of-Thought (CoT) verification methods predict reasoning correctness based on outputs (black-box) or activations (gray-box), but offer limited insight into why a computation fails. We introduce a white-box method: Circuit-based Reasoning Verification (CRV). We hypothesize that attribution graphs of correct CoT steps, viewed as execution traces of the model's latent reasoning circuits, possess distinct structural fingerprints from those of incorrect steps. By training a classifier on structural features of these graphs, we show that these traces contain a powerful signal of reasoning errors. Our white-box approach yields novel scientific insights unattainable by other methods. (1) We demonstrate that structural signatures of error are highly predictive, establishing the viability of verifying reasoning directly via its computational graph. (2) We find these signatures to be highly domain-specific, revealing that failures in different reasoning tasks manifest as distinct computational patterns. (3) We provide evidence that these signatures are not merely correlational; by using our analysis to guide targeted interventions on individual transcoder features, we successfully correct the model's faulty reasoning. Our work shows that, by scrutinizing a model's computational process, we can move from simple error detection to a deeper, causal understanding of LLM reasoning.

【4】Exploring Cross-Lingual Knowledge Transfer via Transliteration-Based MLM Fine-Tuning for Critically Low-resource Chakma Language
标题：通过基于拼音的MLM微调资源极低的Chakma语言探索跨语言知识转移
链接：https://arxiv.org/abs/2510.09032

作者：Adity Khisa, Nusrat Jahan Lia, Tasnim Mahfuz Nafis, Zarif Masud, Tanzir Pial, Shebuti Rayana, Ahmedul Kabir
摘要：作为一种可用数据有限的印度-雅利安语言，Chakma在语言模型中的代表性仍然很低。在这项工作中，我们介绍了一个新的语料库的上下文连贯的孟加拉语音译Chakma，策展从Chakma文学，并验证了母语。使用这个数据集，我们在掩码语言建模（MLM）任务上微调了六个基于编码器的多语言和区域Transformer模型（mBERT，XLM-RoBERTa，DistilBERT，DeBERTaV 3，BanglaBERT和IndicBERT）。我们的实验表明，经过微调的多语言模型在适应孟加拉语音译的Chakma时表现优于预先训练的模型，实现了高达73.54%的令牌准确度和低至2.90的困惑度。我们的分析进一步强调了数据质量对模型性能的影响，并显示了OCR管道对形态丰富的印度语脚本的限制。我们的研究表明，孟加拉语音译的Chakma对于Chakma语言的迁移学习非常有效，我们发布了手动验证的单语数据集，以鼓励进一步研究低资源语言的多语言语言建模。
摘要：As an Indo-Aryan language with limited available data, Chakma remains largely underrepresented in language models. In this work, we introduce a novel corpus of contextually coherent Bangla-transliterated Chakma, curated from Chakma literature, and validated by native speakers. Using this dataset, we fine-tune six encoder-based multilingual and regional transformer models (mBERT, XLM-RoBERTa, DistilBERT, DeBERTaV3, BanglaBERT, and IndicBERT) on masked language modeling (MLM) tasks. Our experiments show that fine-tuned multilingual models outperform their pre-trained counterparts when adapted to Bangla-transliterated Chakma, achieving up to 73.54% token accuracy and a perplexity as low as 2.90. Our analysis further highlights the impact of data quality on model performance and shows the limitations of OCR pipelines for morphologically rich Indic scripts. Our research demonstrates that Bangla-transliterated Chakma can be very effective for transfer learning for Chakma language, and we release our manually validated monolingual dataset to encourage further research on multilingual language modeling for low-resource languages.

推理|分析|理解|解释(14篇)

【1】StreamingVLM: Real-Time Understanding for Infinite Video Streams
标题：StreamingVLM：实时了解无限视频流
链接：https://arxiv.org/abs/2510.09608

作者：Ruyi Xu, Guangxuan Xiao, Yukang Chen, Liuning He, Kelly Peng, Yao Lu, Song Han
备注：The first two authors contributed equally to this work
摘要：视觉语言模型（VLM）可以为实时助理和自主代理提供支持，但它们面临着一个关键挑战：在不增加延迟和内存使用的情况下理解近乎无限的视频流。全神贯注地处理整个视频会导致二次计算成本和长视频的性能低下。同时，简单的滑动窗口方法也有缺陷，因为它们要么破坏相干性，要么由于冗余的重新计算而遭受高延迟。在本文中，我们介绍StreamingVLM，一个模型，设计用于实时，稳定的理解无限的视觉输入。我们的方法是一个统一的框架，使训练与流推理保持一致。在推理过程中，我们保持一个紧凑的KV缓存重用状态的注意力汇，最近的视觉令牌的短窗口，最近的文本令牌的长窗口。这种流媒体能力是通过一个简单的监督微调（SFT）策略灌输的，该策略将全部注意力应用于短的重叠视频块，有效地模仿了推理时间注意力模式，而无需在过长的上下文上进行训练。为了进行评估，我们构建了Inf-Streams-Eval，这是一个新的基准，视频平均超过两个小时，需要帧和文本之间每秒密集的对齐。在Inf-Streams-Eval上，StreamingVLM在与GPT-4 O mini的比赛中取得了66.18%的胜率，并在单台NVIDIA H100上以高达8 FPS的速度保持稳定的实时性能。值得注意的是，我们的SFT策略还增强了一般VQA能力，而无需任何特定于VQA的微调，将LongVideoBench和OVOBench Realtime的性能分别提高了+4.30和+5.96。代码可在https://github.com/mit-han-lab/streaming-vlm上获得。
摘要：Vision-language models (VLMs) could power real-time assistants and autonomous agents, but they face a critical challenge: understanding near-infinite video streams without escalating latency and memory usage. Processing entire videos with full attention leads to quadratic computational costs and poor performance on long videos. Meanwhile, simple sliding window methods are also flawed, as they either break coherence or suffer from high latency due to redundant recomputation. In this paper, we introduce StreamingVLM, a model designed for real-time, stable understanding of infinite visual input. Our approach is a unified framework that aligns training with streaming inference. During inference, we maintain a compact KV cache by reusing states of attention sinks, a short window of recent vision tokens, and a long window of recent text tokens. This streaming ability is instilled via a simple supervised fine-tuning (SFT) strategy that applies full attention on short, overlapped video chunks, which effectively mimics the inference-time attention pattern without training on prohibitively long contexts. For evaluation, we build Inf-Streams-Eval, a new benchmark with videos averaging over two hours that requires dense, per-second alignment between frames and text. On Inf-Streams-Eval, StreamingVLM achieves a 66.18% win rate against GPT-4O mini and maintains stable, real-time performance at up to 8 FPS on a single NVIDIA H100. Notably, our SFT strategy also enhances general VQA abilities without any VQA-specific fine-tuning, improving performance on LongVideoBench by +4.30 and OVOBench Realtime by +5.96. Code is available at https://github.com/mit-han-lab/streaming-vlm.

【2】A Comprehensive Evaluation of Multilingual Chain-of-Thought Reasoning: Performance, Consistency, and Faithfulness Across Languages
标题：多语言思想链推理的综合评估：跨语言的性能、一致性和忠实性
链接：https://arxiv.org/abs/2510.09555

作者：Raoyuan Zhao, Yihong Liu, Hinrich Schütze, Michael A. Hedderich
备注：preprint
摘要：大型推理模型（LRM）越来越多地依赖于逐步的思想链（CoT）推理来提高任务性能，特别是在英语等高资源语言中。虽然最近的工作已经研究了多语言环境中的最终答案准确性，但思维本身，即，导致最终答案的中间步骤仍然没有得到充分探索。在本文中，我们提出了第一个全面的研究多语言CoT推理，评估三个关键方面：性能，一致性和忠诚度。我们首先测量语言依从性，答案准确性和答案的一致性，当LRM被明确指示或未经黑客攻击以目标语言思考时，揭示了强烈的语言偏好和不同语言的表现。接下来，我们评估跨语言的一致性思维轨迹之间的语言互换。我们发现，思维痕迹的质量和有效性有很大的不同，这取决于提示语言。最后，我们采用了基于扰动的技术，即，截断和错误注入--探索跨语言思维痕迹的忠实性，表明模型在不同程度上依赖于痕迹。我们发布代码和数据以支持未来的研究。
摘要：Large reasoning models (LRMs) increasingly rely on step-by-step Chain-of-Thought (CoT) reasoning to improve task performance, particularly in high-resource languages such as English. While recent work has examined final-answer accuracy in multilingual settings, the thinking traces themselves, i.e., the intermediate steps that lead to the final answer, remain underexplored. In this paper, we present the first comprehensive study of multilingual CoT reasoning, evaluating three key dimensions: performance, consistency, and faithfulness. We begin by measuring language compliance, answer accuracy, and answer consistency when LRMs are explicitly instructed or prompt-hacked to think in a target language, revealing strong language preferences and divergent performance across languages. Next, we assess crosslingual consistency of thinking traces by interchanging them between languages. We find that the quality and effectiveness of thinking traces vary substantially depending on the prompt language. Finally, we adapt perturbation-based techniques -- i.e., truncation and error injection -- to probe the faithfulness of thinking traces across languages, showing that models rely on traces to varying degrees. We release our code and data to support future research.

【3】Mitigating Overthinking through Reasoning Shaping
标题：通过推理塑造缓解过度思考
链接：https://arxiv.org/abs/2510.09535

作者：Feifan Song, Shaohang Wei, Bofei Gao, Yejie Wang, Wen Luo, Wei Li, Linli Yao, Weimin Xiong, Liang Chen, Tianyu Liu, Houfeng Wang
摘要：由验证者奖励强化学习（RLVR）推动的大型推理模型（LRM）在解决问题方面表现出了巨大的力量，但它们往往会导致过度思考：过度，曲折的推理会增加计算成本。RLVR中的惩罚的先前设计设法减少令牌消耗，同时经常损害模型性能，这是由于令牌级监督的过于简单。在本文中，我们认为监督的粒度在平衡效率和准确性方面发挥着至关重要的作用，并提出了组相对段惩罚（GRSP），这是一种规范推理的分步方法。由于初步分析表明推理段与令牌消耗和模型性能密切相关，因此我们设计了一种跨段簇的长度感知加权机制。大量的实验表明，GRSP实现了优越的令牌效率，而不会严重损害准确性，特别是与困难的问题的优势。此外，GRSP稳定了RL训练，并有效地跨模型大小进行扩展。
摘要：Large reasoning models (LRMs) boosted by Reinforcement Learning from Verifier Reward (RLVR) have shown great power in problem solving, yet they often cause overthinking: excessive, meandering reasoning that inflates computational cost. Prior designs of penalization in RLVR manage to reduce token consumption while often harming model performance, which arises from the oversimplicity of token-level supervision. In this paper, we argue that the granularity of supervision plays a crucial role in balancing efficiency and accuracy, and propose Group Relative Segment Penalization (GRSP), a step-level method to regularize reasoning. Since preliminary analyses show that reasoning segments are strongly correlated with token consumption and model performance, we design a length-aware weighting mechanism across segment clusters. Extensive experiments demonstrate that GRSP achieves superior token efficiency without heavily compromising accuracy, especially the advantages with harder problems. Moreover, GRSP stabilizes RL training and scales effectively across model sizes.

【4】KORMo: Korean Open Reasoning Model for Everyone
标题：KORMo：面向所有人的韩国开放式推理模型
链接：https://arxiv.org/abs/2510.09426

作者：Minjun Kim, Hyeonseok Lim, Hangyeol Yoo, Inho Won, Seungwoo Song, Minkyung Cho, Junhun Yuk, Changsu Choi, Dongjae Shin, Huige Lee, Hoyun Song, Alice Oh, Kyungtae Lim
摘要：这项工作提出了第一个大规模的调查，以构建一个完全开放的双语大语言模型（LLM）的非英语语言，特别是韩语，主要是在合成数据上训练。我们介绍了KORMo-10 B，一个10. 8B参数的模型，在一个68. 74%的韩语部分是合成的韩英语料库上从头开始训练。通过系统的实验，我们证明，当精心策划的合成数据与平衡的语言覆盖和不同的教学风格，不会导致不稳定或大规模预训练过程中的退化。此外，该模型在广泛的推理、知识和推理遵循基准方面实现了与当代开放权重多语言基线相当的性能。我们的实验揭示了两个关键发现：（1）合成数据可以可靠地维持长视野预训练，而不会出现模型崩溃;（2）双语教学调整可以实现韩语的近母语推理和话语连贯性。通过完全释放所有组件，包括数据，代码，训练配方和日志，这项工作建立了一个透明的框架，用于在低资源环境中开发合成数据驱动的完全开放模型（FOM），并为未来的多语言LLM研究树立了可复制的先例。
摘要：This work presents the first large-scale investigation into constructing a fully open bilingual large language model (LLM) for a non-English language, specifically Korean, trained predominantly on synthetic data. We introduce KORMo-10B, a 10.8B-parameter model trained from scratch on a Korean-English corpus in which 68.74% of the Korean portion is synthetic. Through systematic experimentation, we demonstrate that synthetic data, when carefully curated with balanced linguistic coverage and diverse instruction styles, does not cause instability or degradation during large-scale pretraining. Furthermore, the model achieves performance comparable to that of contemporary open-weight multilingual baselines across a wide range of reasoning, knowledge, and instruction-following benchmarks. Our experiments reveal two key findings: (1) synthetic data can reliably sustain long-horizon pretraining without model collapse, and (2) bilingual instruction tuning enables near-native reasoning and discourse coherence in Korean. By fully releasing all components including data, code, training recipes, and logs, this work establishes a transparent framework for developing synthetic data-driven fully open models (FOMs) in low-resource settings and sets a reproducible precedent for future multilingual LLM research.

【5】Logit Arithmetic Elicits Long Reasoning Capabilities Without Training
标题：Logit算术无需训练即可激发长期推理能力
链接：https://arxiv.org/abs/2510.09354

作者：Yunxiang Zhang, Muhammad Khalifa, Lechen Zhang, Xin Liu, Ayoung Lee, Xinliang Frederick Zhang, Farima Fatahi Bayat, Lu Wang
摘要：大型推理模型表现出较长的思维链推理，具有回溯和自我纠正等策略，尽管最近的研究表明，这些能力通常需要额外的训练。我们首先调查这些行为是否可以在没有任何训练的情况下被诱发。为此，我们提出了一种解码时间的方法，ThinkLogit，它利用logit算法来调整一个目标的大型非推理模型的长推理使用一个小得多的推理模型作为向导。然后，我们表明，我们可以通过对从目标和引导器模型中采样的正确/不正确推理对进行偏好优化来训练引导器模型，进一步提高其性能，我们将这种设置称为ThinkLogit-DPO。我们的实验表明，ThinkLogit和ThinkLogit-DPO在使用Qwen2.5- 32 B的五个推理基准测试中，平均准确率分别提高了24.5%和29.1%，R1-Distill-Qwen-1.5B是一个小21倍的模型。此外，我们发现，ThinkLogit仍然有效时，引导者和目标来自不同的模型家庭。它也与小模型的后训练方法正交，因为通过监督蒸馏或强化学习改进的指导者可以直接插入以产生更强大的大型模型，提供了一种实用的路径来解锁大规模模型中的长推理，而无需昂贵的后训练。
摘要：Large reasoning models exhibit long chain-of-thought reasoning with strategies such as backtracking and self-correction, though recent studies suggest that these abilities typically require additional training. We first investigate whether such behaviors can be elicited without any training. To this end, we propose a decoding-time approach, ThinkLogit, which utilizes logit arithmetic to tune a target large non-reasoning model for long reasoning using a substantially smaller reasoning model as the guider. We then show that we can further boost its performance by training the guider model with preference optimization over correct/incorrect reasoning pairs sampled from both the target and guider model, a setup we refer to as ThinkLogit-DPO. Our experiments demonstrate that ThinkLogit and ThinkLogit-DPO achieve a relative improvement in average accuracy by 24.5% and 29.1%, respectively, over five reasoning benchmarks using the Qwen2.5-32B guided by R1-Distill-Qwen-1.5B, a model 21x smaller. Moreover, we find that ThinkLogit remains effective when the guider and target come from different model families. It is also orthogonal to post-training methods for small models, as guiders improved through supervised distillation or reinforcement learning can be directly plugged in to yield stronger large models, offering a practical path to unlock long reasoning in large-scale models without costly post-training.

【6】Mask Tokens as Prophet: Fine-Grained Cache Eviction for Efficient dLLM Inference
标题：将代币伪装为先知：细粒度缓存驱逐以实现高效dLLM推理
链接：https://arxiv.org/abs/2510.09309

作者：Jianuo Huang, Yaojie Zhang, Yicun Yang, Benhao Huang, Biqing Qi, Dongrui Liu, Linfeng Zhang
备注：17 pages, 8 figures
摘要：扩散大语言模型（DLLM）提出了一个有前途的替代占主导地位的自回归模型（ARM）的并行解码的能力，在大量的计算和内存成本的代价。具体而言，dLLM中用于双向注意的缓存机制需要大的内存占用，限制了它们在资源有限的设置下处理长上下文的能力。现有的缓存回收策略都是针对ARM设计的，忽略了dLLM的独特特性，从而导致性能不令人满意。为了解决这些挑战，我们引入了MaskKV，这是一个为dLLM量身定制的免训练缓存驱逐框架，专注于dLLM中掩码令牌的效果。MaskKV建立在两个关键创新之上：（1）掩码查询引导的评分机制，利用注意力权重来识别和驱逐每个头的不太重要的提示令牌;（2）自适应缓存预算策略，通过减少中间层的分配和将资源集中在优先级较高的头上来提高效率。在使用MaskKV的LLaDA上，将KV缓存压缩到仅256对（不到令牌的5%）保留了LongBench上94%的全缓存性能，并在32 k提示长度下实现了高达31倍的加速。该代码可在https://github.com/jianuo-huang/MaskKV上公开获取
摘要：Diffusion large language models (dLLMs) present a promising alternative to dominant autoregressive models (ARMs) by the ability of parallel decoding at the expense of substantial computation and memory costs. Specifically, the cache mechanism for bidirectional attention in dLLMs demands large memory footprint, restricting their ability to handle long contexts under resource-limited settings. Existing cache eviction strategies are designed for ARMs and ignore the unique characteristics of dLLMs, thus leading to unsatisfactory performance. To address these challenges, we introduce MaskKV, a training-free cache eviction framework tailored to dLLMs, focusing on the effect of mask tokens in dLLMs. MaskKV is built on two key innovations: (1) a mask-query guided scoring mechanism that leverages attention weights to identify and evict less critical prompt tokens for each head; (2) an adaptive cache budgeting strategy that improves efficiency by reducing allocation in intermediate layers and concentrating resources on prompt-preferring heads. On LLaDA with MaskKV, compressing the KV cache to only 256 pairs (less than 5% of tokens) retains 94% of the full-cache performance on LongBench and achieves up to 31x acceleration at 32k prompt length. The code is publicly available at: https://github.com/jianuo-huang/MaskKV

【7】CapGeo: A Caption-Assisted Approach to Geometric Reasoning
标题：CapGeo：一种字幕辅助的几何推理方法
链接：https://arxiv.org/abs/2510.09302

作者：Yuying Li, Siyi Qian, Hao Liang, Leqi Zheng, Ruichuan An, Yongzhen Guo, Wentao Zhang
备注：preprint, under review
摘要：几何推理仍然是多模态大型语言模型（MLLM）的核心挑战。即使是最先进的闭源系统，如GPT-O3和Gemini-2.5-Pro，仍然难以可靠地解决几何问题，尽管在国际数学奥林匹克（IMO）等任务上表现出强大的文本推理能力。这个差距表明，瓶颈在于理解几何图，而不是推理本身。由于几何图形通常可以以简洁的文本形式忠实地描述，因此将视觉内容转换为标题提供了一个有前途的方向。出于这种洞察力，我们介绍CapGeo，一个标题辅助推理框架，桥梁视觉和文本模态。实验表明，当模型配备字幕时，效果会有实质性的改善：Qwen2.5-VL-72 B从8.6%（仅视觉）提高到59.0%，而Claude-Opus-4从44.8%提高到73.0%。为了系统地评估和识别高质量的几何字幕模型，我们进一步提出了CapGeo-Bench，这是一个包含4，641个精选图形-字幕对的数据集。至关重要的是，CapGeo-Bench采用了基于关键点的评估指标，该指标与下游CapGeo性能密切相关，从而能够可靠地评估几何字幕能力。总之，我们的框架和基准突出了一个新的途径推进几何推理MLLM。
摘要：Geometric reasoning remains a core challenge for Multimodal Large Language Models (MLLMs). Even the most advanced closed-source systems, such as GPT-O3 and Gemini-2.5-Pro, still struggle to solve geometry problems reliably, despite exhibiting strong textual reasoning abilities on tasks like the International Mathematical Olympiad (IMO). This gap suggests that the bottleneck lies in understanding geometric diagrams rather than reasoning itself. Since geometric figures can often be faithfully described in concise textual form, converting visual content into captions offers a promising direction. Motivated by this insight, we introduce CapGeo, a caption-assisted reasoning framework that bridges visual and textual modalities. Experiments show substantial improvements when models are equipped with captions: Qwen2.5-VL-72B improves from 8.6% (vision-only) to 59.0%, while Claude-Opus-4 rises from 44.8% to 73.0%. To systematically evaluate and identify high-quality geometric captioning models, we further propose CapGeo-Bench, a dataset of 4,641 curated figure-caption pairs. Crucially, CapGeo-Bench incorporates a keypoint-based evaluation metric that correlates strongly with downstream CapGeo performance, enabling reliable assessment of geometric captioning ability. Together, our framework and benchmark highlight a new pathway toward advancing geometric reasoning in MLLMs.

【8】CLARity: Reasoning Consistency Alone Can Teach Reinforced Experts
标题：一致性：仅靠推理一致性就可以教会强化专家
链接：https://arxiv.org/abs/2510.09278

作者：Jiuheng Lin, Cong Jiang, Zirui Wu, Jiarui Sun, Yansong Feng
摘要：在数据稀缺的领域培训专家LLM是困难的，通常依赖于多项选择题（MCQ）。然而，基于MCQ的标准结果强化学习（RL）是有风险的。虽然它可以提高准确性，但我们观察到它经常会降低推理质量，例如逻辑一致性。现有的解决方案来监督推理，如大规模的过程奖励模型（PRM），是昂贵的。为了解决这个问题，我们提出了一个具有成本效益的RL框架，它只使用一个小型的通用LLM来提高推理质量。Quantiity将一致性感知奖励机制与两阶段优化然后监视的训练管道集成在一起，以增强推理一致性，并采用动态数据重构策略来更好地利用有限的数据。实验表明，与基线相比，重复性将响应一致性提高了16.5%，准确性提高了7.5%。人的评价进一步证实了在一致性和专业精神方面的全面改进。因此，Quantifity提供了一个可推广的解决方案，使较小的模型能够通过推理一致性有效地指导专家模型。我们的代码开源于：https://github.com/Infinite-set/CLARity
摘要：Training expert LLMs in domains with scarce data is difficult, often relying on multiple-choice questions (MCQs). However, standard outcome-based reinforcement learning (RL) on MCQs is risky. While it may improve accuracy, we observe it often degrades reasoning quality such as logical consistency. Existing solutions to supervise reasoning, such as large-scale Process Reward Models (PRMs), are prohibitively expensive. To address this, we propose CLARity, a cost-effective RL framework that enhances reasoning quality using only a small, general-purpose LLM. CLARity integrates a consistency-aware reward mechanism with a 2-stage refine-then-monitor training pipeline to enhance reasoning consistency, and a dynamic data reformulation strategy to to better exploit limited data. Experiments demonstrate that CLARity improves response consistency by 16.5% and accuracy by 7.5% over baselines. Human evaluations further confirm holistic improvements in coherence and professionalism. Thus, CLARity offers a generalizable solution that enables smaller models to effectively guide expert models by reasoning consistency.Our code is open sourced at: https://github.com/Infinite-set/CLARity

【9】DSPO: Stable and Efficient Policy Optimization for Agentic Search and Reasoning
标题：DSPO：用于统计搜索和推理的稳定有效策略优化
链接：https://arxiv.org/abs/2510.09255

作者：Chenyang Gu, Yewen Pu, Bruce Yang, Xiaofan Li, Huan Gao
摘要：增强LLM主动搜索外部知识的能力对于复杂和现实世界的任务至关重要。目前的方法要么依赖于提示来引出模型的先天代理能力，要么在将RL应用于复杂的交互任务时遭受性能上限和崩溃，从而使其真正的代理潜力尚未开发。为了解决这个问题，我们引入了\textbf{D}递归过滤器\textbf{S}序列级\textbf{P}策略\textbf{O}优化（DSPO），这是一种改进的RL算法，旨在通过序列级优化和动态样本过滤进行鲁棒的代理训练。我们纯粹通过RL来训练我们的模型，以交错多轮搜索和推理，从而避免了对监督演示数据的需求。在多个QA基准测试中，我们的DSPO训练的7 B模型比以前的工作提高了\textbf{34.1\%}，甚至在复杂的多跳QA（如HotpotQA）中比以前的14 B模型的性能高出近\textbf{9\%相对}，保持了出色的训练稳定性。
摘要：Enhancing LLMs with the ability to actively search external knowledge is crucial for complex and real-world tasks. Current approaches either rely on prompting to elicit the model's innate agent capabilities, or suffer from performance ceilings and collapse when applying RL to complex interactive tasks, leaving their true agentic potential untapped. To address this, we introduce \textbf{D}ynamic-filter \textbf{S}equence-level \textbf{P}olicy \textbf{O}ptimization (DSPO), an improved RL algorithm designed for robust agent training through sequence-level optimization and dynamic sample filtering. We train our model purely through RL to interleave multi-turn search and reasoning, obviating the need for supervised demonstration data. Across multiple QA benchmarks, our DSPO-trained 7B model improves over a comparable previous work by \textbf{34.1\%}, and even outperforms the 14B model from previous work in complex multihop QA such as HotpotQA by nearly \textbf{9\% relative}, maintaining exceptional training stability.

【10】Stronger Re-identification Attacks through Reasoning and Aggregation
标题：通过推理和聚合进行更强的重新识别攻击
链接：https://arxiv.org/abs/2510.09184

作者：Lucas Georges Gabriel Charpentier, Pierre Lison
摘要：文本去识别技术通常用于从文档中屏蔽个人可识别信息（PII）。然而，很难衡量他们隐瞒文本中提到的个人身份的能力。最近的工作表明，如何通过尝试重新识别的反向过程来评估去识别方法的鲁棒性，该过程基于自动化对手使用其背景知识来发现已被掩盖的PII。本文提出了两种互补的策略来构建更强大的重识别攻击。我们首先表明，（1）PII跨度的_order_是重新确定的问题，并且在多个排序中聚合预测会导致改进的结果。我们还发现，（2）推理模型可以提高重新识别性能，特别是当对手被假定为具有广泛的背景知识。
摘要：Text de-identification techniques are often used to mask personally identifiable information (PII) from documents. Their ability to conceal the identity of the individuals mentioned in a text is, however, hard to measure. Recent work has shown how the robustness of de-identification methods could be assessed by attempting the reverse process of _re-identification_, based on an automated adversary using its background knowledge to uncover the PIIs that have been masked. This paper presents two complementary strategies to build stronger re-identification attacks. We first show that (1) the _order_ in which the PII spans are re-identified matters, and that aggregating predictions across multiple orderings leads to improved results. We also find that (2) reasoning models can boost the re-identification performance, especially when the adversary is assumed to have access to extensive background knowledge.

【11】ReFIne: A Framework for Trustworthy Large Reasoning Models with Reliability, Faithfulness, and Interpretability
标题：ReFIne：具有可靠性、忠实性和可解释性的值得信赖的大型推理模型框架
链接：https://arxiv.org/abs/2510.09062

作者：Chung-En Sun, Ge Yan, Akshay Kulkarni, Tsui-Wei Weng
摘要：长思想链（CoT）推理的最新进展在很大程度上优先考虑了答案的准确性和令牌的效率，而忽略了对可信度至关重要的方面。我们认为，可用的推理系统必须是值得信赖的，其特征在于三个属性：可解释性，忠实性和可靠性。为此，我们提出了ReFIne，一个新的训练框架，它将监督微调与GRPO相结合，以鼓励模型：（i）通过产生结构化的，基于标签的痕迹，以及更容易让人类遵循的高级规划来提高可解释性;（ii）通过明确披露指导每个解决方案的决定性信息，以一致的横截面引用来提高忠诚度;以及（iii）通过提供对推导的合理性和最终答案的置信度的自我评估来提高可靠性。我们将ReFIne应用于多个尺度（1.7B/4 B/8B）的Qwen 3模型，并在不同难度的数学基准上进行评估。我们的实验结果表明，ReFIne模型生成了更清晰、结构更好的推理轨迹（可解释性+44.0%），更忠实地揭示了其潜在的决策过程（忠诚度+18.8%），并提供了信息丰富的置信度估计（可靠性+42.4%）。这些发现突出了一个被忽视但重要的方向：推理模型不仅应该优化准确性，而且还应该优化更广泛的可信度。我们的代码可从以下网址获得：https://github.com/Trustworthy-ML-Lab/Training_Trustworthy_LRM_with_Refine
摘要：Recent advances in long chain-of-thought (CoT) reasoning have largely prioritized answer accuracy and token efficiency, while overlooking aspects critical to trustworthiness. We argue that usable reasoning systems must be trustworthy, characterized by three properties: interpretability, faithfulness, and reliability. To this end, we propose ReFIne, a new training framework that integrates supervised fine-tuning with GRPO to encourage models to: (i) improve interpretability by producing structured, tag-based traces with high-level planning that are easier for humans to follow; (ii) enhance faithfulness by explicitly disclosing the decisive information guiding each solution, with consistent cross-section references; and (iii) promote reliability by providing self-assessments of both the derivation's soundness and the confidence of the final answer. We apply ReFIne to the Qwen3 models at multiple scales (1.7B/4B/8B) and evaluate across mathematical benchmarks of varying difficulty. Our experimental results show that ReFIne models generate clearer and better-structured reasoning traces (interpretability +44.0%), more faithfully expose their underlying decision process (faithfulness +18.8%), and offer informative confidence estimates (reliability +42.4%). These findings highlight an overlooked but important direction: reasoning models should be optimized not only for accuracy, but also for broader dimensions of trustworthiness. Our code is available at: https://github.com/Trustworthy-ML-Lab/Training_Trustworthy_LRM_with_Refine

【12】Unleashing Perception-Time Scaling to Multimodal Reasoning Models
标题：将感知时间缩放释放到多模式推理模型
链接：https://arxiv.org/abs/2510.08964

作者：Yifan Li, Zhenghao Chen, Ziheng Wu, Kun Zhou, Ruipu Luo, Can Zhang, Zhentao He, Yufei Zhan, Wayne Xin Zhao, Minghui Qiu
摘要：推理时间缩放的最新进展，特别是那些利用强化学习与可验证奖励的进展，大大增强了大型视觉语言模型（LVLM）的推理能力。受这一成功的启发，类似的策略已被应用于多模态推理，但它们对视觉感知的影响仍不清楚。为了研究这一差距，我们引入了Distance，这是一个以感知为中心的视觉估计任务基准。评估结果表明，LVLM表现出有限的估计精度，和推理时间缩放只提供边际收益。我们将此归因于当前LVLM的快速感知范式，其中视觉理解被视为一次性输出，而不对底层感知过程进行建模。为了解决这个问题，我们提出了感知时间缩放（PTS），一种新的范式，鼓励令牌丰富的感知和复杂的感知问题分解成中间易处理的子问题，从而使感知对齐，并受益于推理时间缩放。结合强化学习技术，PTS显著提高了感知准确性，将Distance的高精度性能从8.0%提高到64.7%，并很好地推广到域外任务。令人惊讶的是，即使PTS数据是纯合成的，但将它们与数学推理数据相结合，在推理和现实世界的感知基准方面都会产生一致的收益。进一步的分析表明，PTS引入了更多的感知相关的令牌，并增加了模型的关注图像令牌。我们的代码和数据将公开发布。
摘要：Recent advances in inference-time scaling, particularly those leveraging reinforcement learning with verifiable rewards, have substantially enhanced the reasoning capabilities of Large Vision-Language Models (LVLMs). Inspired by this success, similar strategies have been applied to multimodal reasoning, yet their impact on visual perception remains unclear. To investigate this gap, we introduce DisTANCE, a perception-centric benchmark for visual estimation tasks. Evaluation results show that LVLMs exhibit limited estimation precision, and inference-time scaling offers only marginal gains. We attribute this to the fast perception paradigm of current LVLMs, where visual understanding is treated as a one-shot output without modeling the underlying perceptual process. To address this, we propose Perception-Time Scaling (PTS), a novel paradigm that encourages token-rich perception and decomposes complex perception problems into intermediate tractable sub-problems, thereby enabling perception to align with and benefit from inference-time scaling. Combined with reinforcement learning techniques, PTS significantly improves perception accuracy, raising high-precision performance on DisTANCE from 8.0% to 64.7%, and generalizes well to out-of-domain tasks. Surprisingly, even though PTS data are purely synthetic, combining them with math reasoning data yields consistent gains in both reasoning and real-world perception benchmarks. Further analysis reveals that PTS introduces more perception-related tokens and increases the model's attention to image tokens. Our code and data will be publicly released.

【13】HES-SQL: Hybrid Reasoning for Efficient Text-to-SQL with Structural Skeleton Guidance
标题：HES-SQL：具有结构骨架指导的高效文本到SQL的混合推理
链接：https://arxiv.org/abs/2510.08896

作者：Suming Qiu, Jing Li, Zhicheng Zhou, Junjie Huang, Linyuan Qiu, Zhijie Sun
摘要：我们提出了HES-SQL，这是一种新型的混合训练框架，它通过将思维模式融合的监督微调（SFT）与组相对策略优化（GRPO）相结合来推进文本到SQL的生成。我们的方法引入了三个关键的创新：（1）一个查询完整性评分机制，增强生成的查询和最佳SQL结构之间的偏好对齐;（2）一个查询延迟感知奖励系统，激励生成计算效率高的SQL查询;（3）一个自升华过程，用于思维模式完成，防止模型的推理能力下降。该框架使混合思维模型能够在推理和非推理模式之间切换，同时提高SQL查询的准确性和执行效率。在MySQL 8.0和SQLite 3.42上进行的实验测试表明，HES-SQL在BIRD和KaggleDBQA基准测试中的执行准确率分别为79.14%和54.9%。查询延迟是以DBMS上生成的查询的端到端执行时间来衡量的，它是多次运行的平均值，以减少差异。相对于监督基线，效率增益在11%至20%之间。我们的研究结果为文本到SQL系统建立了一个新的范例，通过执行信息强化学习（RL）有效地平衡了语义准确性和计算效率。所提出的方法具有显着的影响，开发强大的自然语言接口的数据库，并可以扩展到更广泛的结构化生成任务，需要正确性和效率优化。
摘要：We present HES-SQL, a novel hybrid training framework that advances Text-to-SQL generation through the integration of thinking-mode-fused supervised fine-tuning (SFT) with Group Relative Policy Optimization (GRPO). Our approach introduces three key innovations: (1) a skeleton-completeness scoring mechanism that enhances preference alignment between generated queries and optimal SQL structures; (2) a query-latency-aware reward system that incentivizes the generation of computationally efficient SQL queries; (3) a self-distillation process for thinking-mode completion that prevents degradation of the model's reasoning capabilities. This framework enables hybrid thinking models to switch between reasoning and non-reasoning modes while improving SQL query accuracy and execution efficiency. Experimental evaluation, conducted on MySQL 8.0 and SQLite 3.42 under controlled single-user conditions, demonstrates that HES-SQL achieves competitive performance with execution accuracies of 79.14\% and 54.9\% on the BIRD and KaggleDBQA benchmarks, respectively. Query latency is measured as the end-to-end execution time of generated queries on the DBMS, averaged over multiple runs to mitigate variance. Efficiency gains range from 11\% to 20\% relative to supervised baselines. Our results establish a new paradigm for Text-to-SQL systems that effectively balances semantic accuracy with computational efficiency through execution-informed reinforcement learning (RL). The proposed methodology has significant implications for developing robust natural language interfaces to databases and can be extended to broader structured generation tasks requiring both correctness and efficiency optimization.

【14】Benchmarking Chinese Commonsense Reasoning with a Multi-hop Reasoning Perspective
标题：以多跳推理视角对中国常识推理进行基准
链接：https://arxiv.org/abs/2510.08800

作者：Wangjie You, Xusheng Wang, Xing Wang, Wenxiang Jiao, Chao Feng, Juntao Li, Min Zhang
摘要：虽然大型语言模型（LLM）已经展示了先进的推理能力，但它们在一般中文语言环境中的综合评估仍然研究不足。为了弥合这一差距，我们提出了中国常识多跳推理（CCMOR），一种新的基准设计，以评估LLM的能力，将中国特定的事实知识与多步逻辑推理。具体来说，我们首先从现有的QA数据集构建一个域平衡的种子集，然后开发一个LLM驱动的管道来生成锚定在事实单元链上的多跳问题。为了确保结果数据集的质量，我们实现了一个人在环验证系统，领域专家系统地验证和完善生成的问题。使用CCMOR，我们评估了最先进的LLM，展示了LLM处理长尾知识和执行知识密集型推理的能力的持续局限性。值得注意的是，检索增强生成大大减轻了这些知识差距，产生显着的性能增益。
摘要：While Large Language Models (LLMs) have demonstrated advanced reasoning capabilities, their comprehensive evaluation in general Chinese-language contexts remains understudied. To bridge this gap, we propose Chinese Commonsense Multi-hop Reasoning (CCMOR), a novel benchmark designed to evaluate LLMs' ability to integrate Chinese-specific factual knowledge with multi-step logical reasoning. Specifically, we first construct a domain-balanced seed set from existing QA datasets, then develop an LLM-powered pipeline to generate multi-hop questions anchored on factual unit chains. To ensure the quality of resulting dataset, we implement a human-in-the-loop verification system, where domain experts systematically validate and refine the generated questions. Using CCMOR, we evaluate state-of-the-art LLMs, demonstrating persistent limitations in LLMs' ability to process long-tail knowledge and execute knowledge-intensive reasoning. Notably, retrieval-augmented generation substantially mitigates these knowledge gaps, yielding significant performance gains.

半/弱/无监督|不确定性(2篇)

【1】Hierarchical Self-Supervised Representation Learning for Depression Detection from Speech
标题：分层自监督表示学习用于语音抑郁检测
链接：https://arxiv.org/abs/2510.08593

作者：Yuxin Li, Eng Siong Chng, Cuntai Guan
摘要：基于语音的抑郁检测（SDD）是传统临床评估的一种有前途的非侵入性替代方法。然而，随着时间的推移，它仍然受到提取有意义的特征和捕获稀疏，异构抑郁线索的困难的限制。预训练的自监督学习（SSL）模型（如WavLM）提供了丰富的多层语音表示，但大多数现有的SDD方法仅依赖于最后一层或搜索单个最佳性能。这些方法通常过度拟合特定的数据集，并且无法利用检测微妙和持续抑郁信号所需的完整层次结构。为了应对这一挑战，我们提出了HAREN-CTC，一种新的架构，它集成了多层SSL功能，使用多任务学习框架内的交叉注意，结合连接主义时间分类损失来处理稀疏的时间监督。HAREN-CTC包括两个关键模块：一个分层自适应聚类模块，将SSL特征重组为互补的嵌入，以及一个跨模态融合模块，通过交叉注意力对层间依赖关系进行建模。CTC目标支持对齐感知训练，允许模型跟踪抑郁言语线索的不规则时间模式。我们评估HAREN-CTC下的上限设置与标准的数据分割和泛化设置使用五重交叉验证。该模型在DAIC-WOZ上实现了最先进的宏F1分数0.81，在MODMA上实现了0.82，在两种评估场景中均优于先前的方法。
摘要：Speech-based depression detection (SDD) is a promising, non-invasive alternative to traditional clinical assessments. However, it remains limited by the difficulty of extracting meaningful features and capturing sparse, heterogeneous depressive cues over time. Pretrained self-supervised learning (SSL) models such as WavLM provide rich, multi-layer speech representations, yet most existing SDD methods rely only on the final layer or search for a single best-performing one. These approaches often overfit to specific datasets and fail to leverage the full hierarchical structure needed to detect subtle and persistent depression signals. To address this challenge, we propose HAREN-CTC, a novel architecture that integrates multi-layer SSL features using cross-attention within a multitask learning framework, combined with Connectionist Temporal Classification loss to handle sparse temporal supervision. HAREN-CTC comprises two key modules: a Hierarchical Adaptive Clustering module that reorganizes SSL features into complementary embeddings, and a Cross-Modal Fusion module that models inter-layer dependencies through cross-attention. The CTC objective enables alignment-aware training, allowing the model to track irregular temporal patterns of depressive speech cues. We evaluate HAREN-CTC under both an upper-bound setting with standard data splits and a generalization setting using five-fold cross-validation. The model achieves state-of-the-art macro F1-scores of 0.81 on DAIC-WOZ and 0.82 on MODMA, outperforming prior methods across both evaluation scenarios.

【2】Unsupervised lexicon learning from speech is limited by representations rather than clustering
标题：从语音中进行的无监督词典学习受到表示而不是集群的限制
链接：https://arxiv.org/abs/2510.09225

作者：Danel Adendorff, Simon Malan, Herman Kamper
备注：Submitted to ICASSP 2026
摘要：零资源分词和聚类系统旨在将语音标记成类似单词的单元，而无需访问文本标签。尽管取得了进步，但诱导词汇仍然远远不够完美。在具有黄金单词边界的理想设置中，我们询问性能是否受到单词段表示的限制，或者受到将它们分组为类似单词类型的聚类方法的限制。我们将一系列自监督语音特征（连续/离散、帧/词级）与英语和普通话数据的不同聚类方法（K均值、分层、基于图）相结合。最好的系统使用图聚类，并对连续特征进行动态时间扭曲。更快的替代方案使用图聚类与余弦距离的平均连续功能或编辑距离的离散单元序列。通过控制实验，隔离的表示或聚类方法，我们证明了相同的词类型，而不是集群的段的表示变异性是限制性能的主要因素。
摘要：Zero-resource word segmentation and clustering systems aim to tokenise speech into word-like units without access to text labels. Despite progress, the induced lexicons are still far from perfect. In an idealised setting with gold word boundaries, we ask whether performance is limited by the representation of word segments, or by the clustering methods that group them into word-like types. We combine a range of self-supervised speech features (continuous/discrete, frame/word-level) with different clustering methods (K-means, hierarchical, graph-based) on English and Mandarin data. The best system uses graph clustering with dynamic time warping on continuous features. Faster alternatives use graph clustering with cosine distance on averaged continuous features or edit distance on discrete unit sequences. Through controlled experiments that isolate either the representations or the clustering method, we demonstrate that representation variability across segments of the same word type -- rather than clustering -- is the primary factor limiting performance.

检测相关(2篇)

【1】ExPO-HM: Learning to Explain-then-Detect for Hateful Meme Detection
标题：ExPO-HM：学习先解释后检测以进行仇恨模因检测
链接：https://arxiv.org/abs/2510.08630

作者：Jingbiao Mei, Mingsheng Sun, Jinghong Chen, Pengda Qin, Yuhong Li, Da Chen, Bill Byrne
备注：Preprint
摘要：仇恨模因已经成为一种特别具有挑战性的在线虐待形式，推动了自动检测系统的发展。大多数先前的方法依赖于直接检测，仅产生二进制预测。这些模型无法提供现实世界中适度所需的背景和解释。最近的解释然后检测方法，使用思想链提示或LMM代理，比简单的SFT基线表现更差，甚至高级的后训练方法，如GRPO，也无法缩小差距。我们的分析确定了这样的系统的两个关键问题：重要的政策相关的线索，如目标和攻击类型没有假设的模型作为一个可能的解释;和二进制奖励信号是不够的，以指导推理。为了解决这些挑战，我们提出了Expo-HM（解释然后检测策略优化仇恨模因），灵感来自人类注释器的培训和评估过程。ExPO-HM结合了SFT预热，GRPO与课程学习，以及条件决策熵（CDE）作为推理质量的度量和奖励。在三个仇恨模因基准测试中，Expo-HM在二进制检测，细粒度分类和推理质量方面实现了最先进的性能，与GRPO和DPO基线相比，F1分别提高了15%和17%。通过将仇恨模因检测从简单的二进制警报转移到警告驱动的检测，Expo-HM提供了准确，可解释和可操作的审核支持。
摘要：Hateful memes have emerged as a particularly challenging form of online abuse, motivating the development of automated detection systems. Most prior approaches rely on direct detection, producing only binary predictions. Such models fail to provide the context and explanations that real-world moderation requires. Recent Explain-then-Detect approaches, using Chain-of-Thought prompting or LMM agents, perform worse than simple SFT baselines, and even advanced post-training methods such as GRPO fail to close the gap. Our analysis identifies two key issues of such systems: important policy-relevant cues such as targets and attack types are not hypothesized by the model as a likely explanation; and the binary reward signal is insufficient to guide reasoning. To address these challenges, we propose ExPO-HM (Explain-then-Detect Policy Optimization for Hateful Memes), inspired by the training and evaluation process of human annotators. ExPO-HM combines SFT warmup, GRPO with curriculum learning, and Conditional Decision Entropy (CDE) as both metric and reward for reasoning quality. Across three hateful meme benchmarks, ExPO-HM achieves state-of-the-art performance on binary detection, fine-grained classification, and reasoning quality, with up to 15\% and 17\% F1 improvement over the GRPO and DPO baselines, respectively. By moving hateful meme detection from simple binary alarms to explanation-driven detection, ExPO-HM provides accurate, interpretable, and actionable moderation support.

【2】Dynamic Stress Detection: A Study of Temporal Progression Modelling of Stress in Speech
标题：动态压力检测：言语压力的时间进程模型研究
链接：https://arxiv.org/abs/2510.08586

作者：Vishakha Lall, Yisi Liu
备注：Accepted at IEEE CogMI 2025
摘要：在高压环境下，从言语中检测心理压力至关重要。虽然先前的工作已经利用声学特征进行压力检测，但大多数将压力视为静态标签。在这项工作中，我们的模型压力作为一个时间上不断变化的现象，历史情绪状态的影响。我们提出了一种动态标签策略，从情感标签中获得细粒度的压力注释，并引入基于交叉注意的顺序模型，单向LSTM和Transformer Encoder，以捕获时间压力进展。我们的方法在MuSE（+5%）和StressID（+18%）上比现有基线实现了显着的准确性增益，并很好地推广到自定义的真实世界数据集。这些结果突出强调了建模压力作为一个动态结构在讲话中的价值。
摘要：Detecting psychological stress from speech is critical in high-pressure settings. While prior work has leveraged acoustic features for stress detection, most treat stress as a static label. In this work, we model stress as a temporally evolving phenomenon influenced by historical emotional state. We propose a dynamic labelling strategy that derives fine-grained stress annotations from emotional labels and introduce cross-attention-based sequential models, a Unidirectional LSTM and a Transformer Encoder, to capture temporal stress progression. Our approach achieves notable accuracy gains on MuSE (+5%) and StressID (+18%) over existing baselines, and generalises well to a custom real-world dataset. These results highlight the value of modelling stress as a dynamic construct in speech.

识别/分类(2篇)

【1】Accent-Invariant Automatic Speech Recognition via Saliency-Driven Spectrogram Masking
标题：通过显著性驱动频谱图掩蔽的口音不变自动语音识别
链接：https://arxiv.org/abs/2510.09528

作者：Mohammad Hossein Sameti, Sepehr Harfi Moridani, Ali Zarean, Hossein Sameti
备注：Submitted to ICASSP 2026
摘要：预先训练的基于transformer的模型显著提高了自动语音识别（ASR），但它们对口音和方言变化仍然敏感，导致英语和波斯语等语言多样性语言的单词错误率（WER）升高。为了解决这一挑战，我们提出了一个口音不变的ASR框架，将口音和方言分类集成到识别管道中。我们的方法包括训练一个基于谱图的分类器来捕获特定口音的线索，掩盖对其预测最有影响的区域，并使用掩蔽的谱图进行数据增强。这增强了ASR模型对口音变化的鲁棒性。我们使用英语和波斯语的讲话的方法进行评估。对于波斯语，我们引入了一个新收集的跨越多个区域口音的数据集，建立了波斯语ASR中口音变化的第一个系统基准，填补了多语言语音研究的关键空白，并为未来低资源，语言多样性语言的研究提供了基础。Whisper模型的实验结果表明，我们的掩蔽和增强策略在英语和波斯语设置中产生了大量的WER减少，证实了该方法的有效性。这项研究推进了能够适应口音和方言多样性的多语言ASR系统的开发。代码和数据集可在https://github.com/MH-Sameti/Accent_invariant_ASR上公开获取
摘要：Pre-trained transformer-based models have significantly advanced automatic speech recognition (ASR), yet they remain sensitive to accent and dialectal variations, resulting in elevated word error rates (WER) in linguistically diverse languages such as English and Persian. To address this challenge, we propose an accent-invariant ASR framework that integrates accent and dialect classification into the recognition pipeline. Our approach involves training a spectrogram-based classifier to capture accent-specific cues, masking the regions most influential to its predictions, and using the masked spectrograms for data augmentation. This enhances the robustness of ASR models against accent variability. We evaluate the method using both English and Persian speech. For Persian, we introduce a newly collected dataset spanning multiple regional accents, establishing the first systematic benchmark for accent variation in Persian ASR that fills a critical gap in multilingual speech research and provides a foundation for future studies on low-resource, linguistically diverse languages. Experimental results with the Whisper model demonstrate that our masking and augmentation strategy yields substantial WER reductions in both English and Persian settings, confirming the effectiveness of the approach. This research advances the development of multilingual ASR systems that are resilient to accent and dialect diversity. Code and dataset are publicly available at: https://github.com/MH-Sameti/Accent_invariant_ASR

【2】Enhancing Biomedical Named Entity Recognition using GLiNER-BioMed with Targeted Dictionary-Based Post-processing for BioASQ 2025 task 6
标题：使用GLiNER-BioMed以及BioASQ 2025任务6的基于目标词典的后处理增强生物医学命名实体识别
链接：https://arxiv.org/abs/2510.08588

作者：Ritesh Mehta
备注：Paper published to CLEF 2025 CEUR-WS
摘要：生物医学命名实体识别（BioNER）是BioASQ中的任务6（大规模生物医学语义索引和问答的挑战），对于从科学文献中提取信息至关重要，但面临着诸如区分基因和化学品等相似实体类型等障碍。本研究在BioASQ数据集上评估了GLiNER-BioMed模型，并引入了一种有针对性的基于词典的后处理策略来解决常见的错误分类。虽然这种后处理方法在我们的开发集上表现出显着的改进，将微观F1分数从基线0.79增加到0.83，但这种增强并没有推广到盲测试集，其中后处理模型实现了0.77的微观F1分数，而基线为0.79。我们还讨论了从探索替代方法中获得的见解，包括条件随机场。这项工作突出了基于词典的预训练BioNER模型的改进潜力，但强调了过度拟合开发数据的关键挑战以及确保真实世界适用性的鲁棒泛化的必要性。
摘要：Biomedical Named Entity Recognition (BioNER), task6 in BioASQ (A challenge in large-scale biomedical semantic indexing and question answering), is crucial for extracting information from scientific literature but faces hurdles such as distinguishing between similar entity types like genes and chemicals. This study evaluates the GLiNER-BioMed model on a BioASQ dataset and introduces a targeted dictionary-based post-processing strategy to address common misclassifications. While this post-processing approach demonstrated notable improvement on our development set, increasing the micro F1-score from a baseline of 0.79 to 0.83, this enhancement did not generalize to the blind test set, where the post-processed model achieved a micro F1-score of 0.77 compared to the baselines 0.79. We also discuss insights gained from exploring alternative methodologies, including Conditional Random Fields. This work highlights the potential of dictionary-based refinement for pre-trained BioNER models but underscores the critical challenge of overfitting to development data and the necessity of ensuring robust generalization for real-world applicability.

Zero/Few/One-Shot|迁移|自适应(2篇)

【1】Creation of the Chinese Adaptive Policy Communication Corpus
标题：中国适应性政策传播数据库的创建
链接：https://arxiv.org/abs/2510.08986

作者：Bolun Sun, Charles Chang, Yuen Yuen Ang, Pingxu Hao, Ruotong Mu, Yuchen Xu, Zhengxin Zhang
摘要：我们介绍CAPC-CG，中国适应性政策沟通（中央政府）语料库，第一个开放的中国政策指令数据集注释与明确和模糊的语言类别的五色分类，建立在Ang的适应性政策沟通的理论。该语料库涵盖1949-2023年，包括中国最高当局发布的国家法律，行政法规和部门规章。每个文件被分割成段落，总共产生330万个单元。除了语料库，我们还发布了全面的元数据，两轮标签框架以及由专家和训练有素的编码人员开发的黄金标准注释集。注释器间协议在指令标签上实现了K = 0.86的Fleiss kappa，表明监督建模的高可靠性。我们提供了几个大型语言模型（LLM）的基线分类结果，以及我们的注释码本，并描述了数据集中的模式。该版本旨在支持政策沟通中的下游任务和多语言NLP研究。
摘要：We introduce CAPC-CG, the Chinese Adaptive Policy Communication (Central Government) Corpus, the first open dataset of Chinese policy directives annotated with a five-color taxonomy of clear and ambiguous language categories, building on Ang's theory of adaptive policy communication. Spanning 1949-2023, this corpus includes national laws, administrative regulations, and ministerial rules issued by China's top authorities. Each document is segmented into paragraphs, producing a total of 3.3 million units. Alongside the corpus, we release comprehensive metadata, a two-round labeling framework, and a gold-standard annotation set developed by expert and trained coders. Inter-annotator agreement achieves a Fleiss's kappa of K = 0.86 on directive labels, indicating high reliability for supervised modeling. We provide baseline classification results with several large language models (LLMs), together with our annotation codebook, and describe patterns from the dataset. This release aims to support downstream tasks and multilingual NLP research in policy communication.

【2】Time-Aware Feature Selection: Adaptive Temporal Masking for Stable Sparse Autoencoder Training
标题：时间感知特征选择：用于稳定稀疏自动编码器训练的自适应时态掩蔽
链接：https://arxiv.org/abs/2510.08855

作者：T. Ed Li, Junyu Ren
备注：First submitted on February 10th, 2025 to ICLR 2025 Workshop (XAI4Science: From Understanding Model Behavior to Discovering New Scientific Knowledge). The paper was accepted but the workshop does not generate proceedings. Now uploading to arXiv to make the paper publicly available
摘要：理解大型语言模型的内部表示对于确保其可靠性和安全性至关重要，稀疏自动编码器（SAE）正在成为一种有前途的可解释性方法。然而，当前SAE训练方法面临特征吸收，其中特征（或神经元）被吸收到彼此中以最小化$L_1$惩罚，使得难以一致地识别和分析模型行为。我们引入了自适应时间掩蔽（ATM），这是一种新的训练方法，通过跟踪激活幅度，频率和重建贡献来动态调整特征选择，以计算随时间推移而演变的重要性分数。ATM应用基于这些重要性分数的统计阈值的概率掩蔽机制，创建更自然的特征选择过程。通过对Gemma-2-2b模型的广泛实验，我们证明了ATM与TopK和JumpReLU SAE等现有方法相比，吸收分数显著降低，同时保持了出色的重建质量。这些结果确立了ATM作为学习神经网络中稳定，可解释特征的原则性解决方案，为更可靠的模型分析提供了基础。
摘要：Understanding the internal representations of large language models is crucial for ensuring their reliability and safety, with sparse autoencoders (SAEs) emerging as a promising interpretability approach. However, current SAE training methods face feature absorption, where features (or neurons) are absorbed into each other to minimize $L_1$ penalty, making it difficult to consistently identify and analyze model behaviors. We introduce Adaptive Temporal Masking (ATM), a novel training approach that dynamically adjusts feature selection by tracking activation magnitudes, frequencies, and reconstruction contributions to compute importance scores that evolve over time. ATM applies a probabilistic masking mechanism based on statistical thresholding of these importance scores, creating a more natural feature selection process. Through extensive experiments on the Gemma-2-2b model, we demonstrate that ATM achieves substantially lower absorption scores compared to existing methods like TopK and JumpReLU SAEs, while maintaining excellent reconstruction quality. These results establish ATM as a principled solution for learning stable, interpretable features in neural networks, providing a foundation for more reliable model analysis.

Word2Vec|文本|单词(1篇)

【1】LitE-SQL: A Lightweight and Efficient Text-to-SQL Framework with Vector-based Schema Linking and Execution-Guided Self-Correction
标题：LitE-SQL：一个轻量级、高效的文本到SQL框架，具有基于Vector的模式链接和执行引导的自我纠正
链接：https://arxiv.org/abs/2510.09014

作者：Shengmin Piao, Jieun Lee, Sanghyun Park
摘要：Text-to-SQL任务将自然语言问题转换为SQL查询，为非专家提供直观的数据库交互。虽然最近的方法利用大型语言模型（LLM）实现了强大的性能，但它们对专有模型的依赖引起了对部署可行性和数据隐私的担忧。在这项工作中，我们介绍了LitE-SQL，一个轻量级和高效的框架，有两个组件：（i）一个模式检索器，执行有效的模式链接使用预先计算的模式嵌入的向量数据库，以及（ii）一个SQL生成器微调在两个阶段监督微调其次是执行引导的自校正，使自我纠正没有昂贵的多候选生成。在BIRD上，LitE-SQL实现了72.10%的执行准确率，在Spider 1.0上达到了88.45%，尽管使用的参数少了2倍到30倍，但与基于LLM的方法相比，LitE-SQL的性能相当或更好。我们的研究结果表明，高质量的文本到SQL生成是可行的轻量级模型，为隐私敏感和资源受限的设置提供了一个实用的解决方案。
摘要：The Text-to-SQL task translates natural language questions into SQL queries, enabling intuitive database interaction for non-experts. While recent methods leveraging Large Language Models (LLMs) achieve strong performance, their reliance on proprietary models raise concerns about deployment feasibility and data privacy. In this work, we introduce LitE-SQL, a Lightweight and Efficient framework with two components: (i) a Schema Retriever that performs efficient schema linking using a vector database of pre-computed schema embeddings, and (ii) a SQL Generator fine-tuned in two stages-supervised fine-tuning followed by execution-guided reinforcement-enabling self-correction without costly multi-candidate generation. On BIRD, LitE-SQL achieves 72.10% execution accuracy, and on Spider 1.0 it reaches 88.45%, demonstrating comparable or superior performance to LLM-based methods despite using 2x to 30x fewer parameters. Our findings demonstrate that high-quality Text-to-SQL generation is feasible with lightweight models, offering a practical solution for privacy-sensitive and resource-constrained settings.

其他神经网络|深度学习|模型|建模(3篇)

【1】Dyna-Mind: Learning to Simulate from Experience for Better AI Agents
标题：Dyna-Mind：学习根据经验进行模拟，以获得更好的人工智能代理
链接：https://arxiv.org/abs/2510.09577

作者：Xiao Yu, Baolin Peng, Michel Galley, Hao Cheng, Qianhui Wu, Janardhan Kulkarni, Suman Nath, Zhou Yu, Jianfeng Gao
摘要：推理模型最近在数学和编码等领域取得了显着进展。然而，他们在数学和编码方面的专家级能力与他们在长期互动任务（如网络导航和计算机/电话使用）中的表现形成鲜明对比。受人类认知文献的启发，我们认为当前的人工智能代理需要“替代试错”-在行动之前在心理上模拟替代未来的能力-以增强他们在复杂交互环境中的理解和表现。我们介绍Dyna-Mind，一个两阶段的训练框架，明确地教（V）LM代理集成这种模拟到他们的推理。在第一阶段，我们引入了模拟推理（ReSim），它训练智能体从扩展的搜索树中生成结构化的推理轨迹，这些搜索树是从通过环境交互收集的真实经验中构建的。因此，ReSim将智能体的推理建立在忠实的世界动态基础上，并使其具备在推理中预测未来状态的能力。在第二阶段，我们提出了Dyna-GRPO，这是一种在线强化学习方法，通过使用结果奖励和中间状态作为实际推出的反馈，进一步加强代理的模拟和决策能力。在两个合成基准测试（Sokoban和ALFWorld）和一个现实基准测试（AndroidWorld）上的实验表明，（1）ReSim有效地将模拟能力注入到AI代理中，（2）Dyna-GRPO利用结果和交互级别的信号来学习更好的策略，以执行长期规划密集型任务。总之，这些结果突出了模拟在使AI代理在更具挑战性的环境中更有效地推理，计划和行动方面的核心作用。
摘要：Reasoning models have recently shown remarkable progress in domains such as math and coding. However, their expert-level abilities in math and coding contrast sharply with their performance in long-horizon, interactive tasks such as web navigation and computer/phone-use. Inspired by literature on human cognition, we argue that current AI agents need ''vicarious trial and error'' - the capacity to mentally simulate alternative futures before acting - in order to enhance their understanding and performance in complex interactive environments. We introduce Dyna-Mind, a two-stage training framework that explicitly teaches (V)LM agents to integrate such simulation into their reasoning. In stage 1, we introduce Reasoning with Simulations (ReSim), which trains the agent to generate structured reasoning traces from expanded search trees built from real experience gathered through environment interactions. ReSim thus grounds the agent's reasoning in faithful world dynamics and equips it with the ability to anticipate future states in its reasoning. In stage 2, we propose Dyna-GRPO, an online reinforcement learning method to further strengthen the agent's simulation and decision-making ability by using both outcome rewards and intermediate states as feedback from real rollouts. Experiments on two synthetic benchmarks (Sokoban and ALFWorld) and one realistic benchmark (AndroidWorld) demonstrate that (1) ReSim effectively infuses simulation ability into AI agents, and (2) Dyna-GRPO leverages outcome and interaction-level signals to learn better policies for long-horizon, planning-intensive tasks. Together, these results highlight the central role of simulation in enabling AI agents to reason, plan, and act more effectively in the ever more challenging environments.

【2】Can We Reliably Rank Model Performance across Domains without Labeled Data?
标题：在没有标记数据的情况下，我们能否对跨领域的模型性能进行可靠的排名？
链接：https://arxiv.org/abs/2510.09519

作者：Veronica Rammouz, Aaron Gonzalez, Carlos Cruzportillo, Adrian Tan, Nicole Beebe, Anthony Rios
备注：8 pages + references and Appendix
摘要：在没有标签的情况下估计模型性能是理解NLP模型如何泛化的一个重要目标。虽然先前的工作已经提出了基于数据集相似性或预测正确性的措施，但这些估计何时产生跨域的可靠性能排名仍不清楚。在本文中，我们分析了影响排名可靠性的因素，使用两步评估设置与四个基本分类器和几个大型语言模型作为错误预测。在GeoOLID和Amazon Reviews数据集上进行的跨越15个领域的实验表明，基于大型语言模型的错误预测器比基于漂移或zero-shot基线产生更强、更一致的等级相关性，并且具有真正的准确性。我们的分析揭示了两个关键的发现：当跨域的性能差异较大时，以及当错误模型的预测与基础模型的真实故障模式一致时，排名更可靠。这些结果澄清了性能估计方法何时可以信任，并为它们在跨域模型评估中的使用提供了指导。
摘要：Estimating model performance without labels is an important goal for understanding how NLP models generalize. While prior work has proposed measures based on dataset similarity or predicted correctness, it remains unclear when these estimates produce reliable performance rankings across domains. In this paper, we analyze the factors that affect ranking reliability using a two-step evaluation setup with four base classifiers and several large language models as error predictors. Experiments on the GeoOLID and Amazon Reviews datasets, spanning 15 domains, show that large language model-based error predictors produce stronger and more consistent rank correlations with true accuracy than drift-based or zero-shot baselines. Our analysis reveals two key findings: ranking is more reliable when performance differences across domains are larger, and when the error model's predictions align with the base model's true failure patterns. These results clarify when performance estimation methods can be trusted and provide guidance for their use in cross-domain model evaluation.

【3】Estimating Brain Activity with High Spatial and Temporal Resolution using a Naturalistic MEG-fMRI Encoding Model
标题：使用自然主义MEG-fMRI编码模型以高空间和时间分辨率估计大脑活动
链接：https://arxiv.org/abs/2510.09415

作者：Beige Jerry Jin, Leila Wehbe
摘要：目前的非侵入性神经成像技术在空间分辨率和时间分辨率之间进行权衡。虽然脑磁图（MEG）可以捕捉快速的神经动力学和功能磁共振成像（fMRI）可以空间定位大脑活动，一个统一的图片，保持两个高分辨率仍然是一个未解决的挑战与现有的源定位或MEG-fMRI融合方法，特别是单次试验的自然数据。当受试者被动地听超过七个小时的叙述故事时，我们收集了全头MEG，在开放的fMRI数据集中使用相同的刺激（LeBel等人，2023年）。我们开发了一个基于变换器的编码模型，结合这两个自然的语音理解实验的MEG和功能磁共振成像估计潜在的皮层源响应与高时空分辨率。我们的模型经过训练，可以同时预测多个受试者的MEG和fMRI，其中一个潜在层代表了我们对重建皮质源的估计。我们的模型预测MEG优于单模态编码模型的共同标准，它也产生源估计具有更高的空间和时间保真度比经典的最小范数的解决方案在模拟实验。我们验证了估计的潜在来源，显示其强大的泛化能力，在看不见的主题和方式。在我们的源空间中估计的活动预测皮质电图（ECoG）比在全新数据集中ECoG训练的编码模型更好。通过整合大型自然实验，MEG，功能磁共振成像和编码模型的力量，我们提出了一个实用的毫秒和毫米脑映射的路线。
摘要：Current non-invasive neuroimaging techniques trade off between spatial resolution and temporal resolution. While magnetoencephalography (MEG) can capture rapid neural dynamics and functional magnetic resonance imaging (fMRI) can spatially localize brain activity, a unified picture that preserves both high resolutions remains an unsolved challenge with existing source localization or MEG-fMRI fusion methods, especially for single-trial naturalistic data. We collected whole-head MEG when subjects listened passively to more than seven hours of narrative stories, using the same stimuli in an open fMRI dataset (LeBel et al., 2023). We developed a transformer-based encoding model that combines the MEG and fMRI from these two naturalistic speech comprehension experiments to estimate latent cortical source responses with high spatiotemporal resolution. Our model is trained to predict MEG and fMRI from multiple subjects simultaneously, with a latent layer that represents our estimates of reconstructed cortical sources. Our model predicts MEG better than the common standard of single-modality encoding models, and it also yields source estimates with higher spatial and temporal fidelity than classic minimum-norm solutions in simulation experiments. We validated the estimated latent sources by showing its strong generalizability across unseen subjects and modalities. Estimated activity in our source space predict electrocorticography (ECoG) better than an ECoG-trained encoding model in an entirely new dataset. By integrating the power of large naturalistic experiments, MEG, fMRI, and encoding models, we propose a practical route towards millisecond-and-millimeter brain mapping.

其他(30篇)

【1】AutoPR: Let's Automate Your Academic Promotion!
标题：AutoPR：让我们自动化您的学术推广！
链接：https://arxiv.org/abs/2510.09558

作者：Qiguang Chen, Zheng Yan, Mingda Yang, Libo Qin, Yixin Yuan, Hanjing Li, Jinhao Liu, Yiyan Ji, Dengyun Peng, Jiannan Guan, Mengkang Hu, Yantao Du, Wanxiang Che
备注：Preprint. Code: this https URL . Benchmark: this https URL
摘要：随着同行评审研究的数量激增，学者越来越依赖社交平台进行发现，而作者则投入大量精力推广他们的工作，以确保知名度和引用。为了简化这一过程并减少对人力的依赖，我们引入了自动推广（AutoPR），这是一项新颖的任务，可以将研究论文转化为准确，引人入胜和及时的公共内容。为了进行严格的评估，我们发布了PRBench，这是一个多模式基准，将512篇同行评议的文章与高质量的促销帖子联系起来，沿着三个轴评估系统：保真度（准确性和语气），参与度（受众定位和吸引力）和对齐（时间和渠道优化）。我们还介绍了PRAgent，这是一个多代理框架，可以在三个阶段自动化AutoPR：多模式准备的内容提取，抛光输出的协作合成，以及特定于平台的适应，以优化规范，音调和标记，以实现最大范围。与PRBench上的直接LLM管道相比，PRAgent表现出了实质性的改进，包括总观看时间增加了604%，喜欢增加了438%，整体参与度至少提高了2.9倍。消融研究表明，平台建模和有针对性的推广对这些收益贡献最大。我们的研究结果将AutoPR定位为一个易于处理，可衡量的研究问题，并为可扩展，有影响力的自动化学术交流提供了路线图。
摘要：As the volume of peer-reviewed research surges, scholars increasingly rely on social platforms for discovery, while authors invest considerable effort in promoting their work to ensure visibility and citations. To streamline this process and reduce the reliance on human effort, we introduce Automatic Promotion (AutoPR), a novel task that transforms research papers into accurate, engaging, and timely public content. To enable rigorous evaluation, we release PRBench, a multimodal benchmark that links 512 peer-reviewed articles to high-quality promotional posts, assessing systems along three axes: Fidelity (accuracy and tone), Engagement (audience targeting and appeal), and Alignment (timing and channel optimization). We also introduce PRAgent, a multi-agent framework that automates AutoPR in three stages: content extraction with multimodal preparation, collaborative synthesis for polished outputs, and platform-specific adaptation to optimize norms, tone, and tagging for maximum reach. When compared to direct LLM pipelines on PRBench, PRAgent demonstrates substantial improvements, including a 604% increase in total watch time, a 438% rise in likes, and at least a 2.9x boost in overall engagement. Ablation studies show that platform modeling and targeted promotion contribute the most to these gains. Our results position AutoPR as a tractable, measurable research problem and provide a roadmap for scalable, impactful automated scholarly communication.

【2】Multimodal Policy Internalization for Conversational Agents
标题：对话代理的多模式政策内化
链接：https://arxiv.org/abs/2510.09474

作者：Zhenhailong Wang, Jiateng Liu, Amin Fazel, Ritesh Sarkhel, Xing Fan, Xiang Li, Chenlei Guo, Heng Ji, Ruhi Sarikaya
摘要：像ChatGPT和Alexa+这样的现代会话代理依赖于指定元数据、响应样式和工具使用规则的预定义策略。随着这些基于LLM的系统扩展到支持不同的业务和用户查询，这些策略（通常作为上下文提示实现）变得越来越复杂和冗长，使得忠实遵守变得困难，并带来大量的固定计算成本。随着多模态代理的兴起，管理视觉和多模态行为的政策是至关重要的，但仍然研究不足。之前的安全压缩工作主要是缩短任务模板和演示，而现有的策略调整研究只关注基于文本的安全规则。我们引入了多模态策略内化（MPI），这是一个新的任务，它将推理密集型多模态策略内化为模型参数，从而在推理过程中不包括策略的情况下实现更强的策略遵循。MPI提出了独特的数据和算法挑战。我们建立了两个数据集，跨越合成和现实世界的决策和工具使用任务，并提出了TriMPI，一个三阶段的培训框架。TriMPI首先通过持续的预训练注入策略知识，然后执行监督微调，最后应用PolicyRollout，这是一种GRPO风格的强化学习扩展，可以通过策略感知响应来增强部署，以进行基础探索。TriMPI在端到端的准确性、泛化和对遗忘的鲁棒性方面取得了显着的进步。作为多模式政策内化的第一项工作，我们提供了数据集，培训食谱和综合评估，以促进未来的研究。项目页面：https://mikewangwzhl.github.io/TriMPI。
摘要：Modern conversational agents like ChatGPT and Alexa+ rely on predefined policies specifying metadata, response styles, and tool-usage rules. As these LLM-based systems expand to support diverse business and user queries, such policies, often implemented as in-context prompts, are becoming increasingly complex and lengthy, making faithful adherence difficult and imposing large fixed computational costs. With the rise of multimodal agents, policies that govern visual and multimodal behaviors are critical but remain understudied. Prior prompt-compression work mainly shortens task templates and demonstrations, while existing policy-alignment studies focus only on text-based safety rules. We introduce Multimodal Policy Internalization (MPI), a new task that internalizes reasoning-intensive multimodal policies into model parameters, enabling stronger policy-following without including the policy during inference. MPI poses unique data and algorithmic challenges. We build two datasets spanning synthetic and real-world decision-making and tool-using tasks and propose TriMPI, a three-stage training framework. TriMPI first injects policy knowledge via continual pretraining, then performs supervised finetuning, and finally applies PolicyRollout, a GRPO-style reinforcement learning extension that augments rollouts with policy-aware responses for grounded exploration. TriMPI achieves notable gains in end-to-end accuracy, generalization, and robustness to forgetting. As the first work on multimodal policy internalization, we provide datasets, training recipes, and comprehensive evaluations to foster future research. Project page: https://mikewangwzhl.github.io/TriMPI.

【3】HINT: Helping Ineffective Rollouts Navigate Towards Effectiveness
标题：提示：帮助无效的滚动实现有效性
链接：https://arxiv.org/abs/2510.09388

作者：Xinyi Wang, Jinyi Han, Zishang Jiang, Tingyun Li, Jiaqing Liang, Sihang Jiang, Zhaoqian Dai, Shuguang Ma, Fei Yu, Yanghua Xiao
摘要：强化学习（RL）已成为增强大型语言模型（LLM）的长思想链（CoT）推理能力的关键驱动力。然而，当任务难度超过模型的能力时，像GRPO这样的流行方法往往会失败，导致奖励稀疏和低效的训练。虽然之前的工作试图使用非策略数据来缓解这一问题，例如将RL与监督微调（SFT）混合或使用提示，但它们经常误导策略更新。在这项工作中，我们确定了这些失败背后的核心问题，我们称之为低训练亲和力。这种情况产生于外部指导和模型政策之间的巨大分配不匹配。为了诊断这一点，我们引入了Affinity，这是用于监控探索效率和训练稳定性的第一个定量指标。为了提高亲和力，我们提出了HINT：Helping Ineffective rollouts Navigate Towards effectiveness，一个自适应提示框架。HINT不提供直接的答案，而是提供启发式提示，引导模型自己发现解决方案，保留其自主推理能力。在数学推理任务上的大量实验表明，HINT始终优于现有方法，在各种规模的模型上实现了最先进的结果，同时也展示了更稳定的学习和更高的数据效率。代码可在Github上获得。
摘要：Reinforcement Learning (RL) has become a key driver for enhancing the long chain-of-thought (CoT) reasoning capabilities of Large Language Models (LLMs). However, prevalent methods like GRPO often fail when task difficulty exceeds the model's capacity, leading to reward sparsity and inefficient training. While prior work attempts to mitigate this using off-policy data, such as mixing RL with Supervised Fine-Tuning (SFT) or using hints, they often misguide policy updates In this work, we identify a core issue underlying these failures, which we term low training affinity. This condition arises from a large distributional mismatch between external guidance and the model's policy. To diagnose this, we introduce Affinity, the first quantitative metric for monitoring exploration efficiency and training stability. To improve Affinity, we propose HINT: Helping Ineffective rollouts Navigate Towards effectiveness, an adaptive hinting framework. Instead of providing direct answers, HINT supplies heuristic hints that guide the model to discover solutions on its own, preserving its autonomous reasoning capabilities. Extensive experiments on mathematical reasoning tasks show that HINT consistently outperforms existing methods, achieving state-of-the-art results with models of various scales, while also demonstrating significantly more stable learning and greater data efficiency.Code is available on Github.

【4】Token-Level Policy Optimization: Linking Group-Level Rewards to Token-Level Aggregation via Markov Likelihood
标题：令牌级政策优化：通过马尔科夫似然将集团级奖励与令牌级聚合联系起来
链接：https://arxiv.org/abs/2510.09369

作者：Xingyu Lin, Yilin Wen, En Wang, Du Su, Wenbin Liu, Chenfu Bao, Zhonghou Lv
摘要：组相对策略优化（GRPO）显著提高了大型语言模型（LLM）的推理能力，特别是通过提高它们的数学性能。然而，GRPO和相关的熵正则化方法仍然面临着根源于思想链（CoT）固有的稀疏令牌奖励的挑战。目前的方法通常依赖于无差别的令牌级熵调整，这经常导致熵崩溃或模型崩溃。在这项工作中，我们提出了TEPO，一种新的令牌级框架，采用马尔可夫似然（序列似然）链接组级奖励令牌通过令牌级聚合。实验表明，TEPO在关键指标（包括@k和准确性）上始终优于现有基线。它不仅在数学推理任务上开创了新的艺术状态，而且还显着提高了训练的稳定性。
摘要：Group Relative Policy Optimization (GRPO) has significantly advanced the reasoning ability of large language models (LLMs), particularly by boosting their mathematical performance. However, GRPO and related entropy-regularization methods still face challenges rooted in the sparse token rewards inherent to chain-of-thought (CoT). Current approaches often rely on undifferentiated token-level entropy adjustments, which frequently lead to entropy collapse or model collapse. In this work, we propose TEPO, a novel token-level framework that incorporates Markov Likelihood (sequence likelihood) links group-level rewards with tokens via token-level aggregation. Experiments show that TEPO consistently outperforms existing baselines across key metrics (including @k and accuracy). It not only sets a new state of the art on mathematical reasoning tasks but also significantly enhances training stability.

【5】MaP: A Unified Framework for Reliable Evaluation of Pre-training Dynamics
标题：MaP：训练前动态可靠评估的统一框架
链接：https://arxiv.org/abs/2510.09295

作者：Jiapeng Wang, Changxin Tian, Kunlong Chen, Ziqi Liu, Jiaxin Mao, Wayne Xin Zhao, Zhiqiang Zhang, Jun Zhou
摘要：可靠的评估是大型语言模型（LLM）进步的基础，但预训练期间的评估过程受到显著不稳定性的困扰，从而掩盖了真正的学习动态。在这项工作中，我们系统地诊断这种不稳定性，将其归因于两个不同的来源：来自训练随机性的参数不稳定性和来自噪声测量协议的评估不稳定性。为了抵消这两种噪声源，我们引入了\textbf{MaP}，这是一个双管齐下的框架，它协同集成了检查点\underline{M}erging \underline{a}和\underline{P}ass@k度量。检查点合并通过平均最近的模型权重来平滑参数空间，而Pass@k提供了对模型能力的鲁棒的、低方差的统计估计。大量的实验表明，MaP产生了更平滑的性能曲线，减少了运行间的方差，并确保了更一致的模型排名。最终，MaP为观察LLM培训动态提供了一个更可靠，更忠实的镜头，为LLM研究奠定了至关重要的经验基础。
摘要：Reliable evaluation is fundamental to the progress of Large Language Models (LLMs), yet the evaluation process during pre-training is plagued by significant instability that obscures true learning dynamics. In this work, we systematically diagnose this instability, attributing it to two distinct sources: \textit{Parameter Instability} from training stochasticity and \textit{Evaluation Instability} from noisy measurement protocols. To counteract both sources of noise, we introduce \textbf{MaP}, a dual-pronged framework that synergistically integrates checkpoint \underline{M}erging \underline{a}nd the \underline{P}ass@k metric. Checkpoint merging smooths the parameter space by averaging recent model weights, while Pass@k provides a robust, low-variance statistical estimate of model capability. Extensive experiments show that MaP yields significantly smoother performance curves, reduces inter-run variance, and ensures more consistent model rankings. Ultimately, MaP provides a more reliable and faithful lens for observing LLM training dynamics, laying a crucial empirical foundation for LLM research.

【6】Inflated Excellence or True Performance? Rethinking Medical Diagnostic Benchmarks with Dynamic Evaluation
标题：夸大的卓越还是真实的表现？用动态评估重新思考医学诊断基准
链接：https://arxiv.org/abs/2510.09275

作者：Xiangxu Zhang, Lei Li, Yanyun Zhou, Xiao Zhou, Yingying Zhang, Xian Wu
摘要：医疗诊断是一个高风险和复杂的领域，对患者护理至关重要。然而，目前对大型语言模型（LLM）的评估从根本上与现实世界的临床实践不一致。他们中的大多数依赖于来自公共医疗考试项目的静态基准，这往往会高估模型的性能，忽略教科书案例与现实世界中模糊、变化的条件之间的差异。最近的努力动态评估提供了一个有前途的替代方案，但他们的改进仅限于表面的扰动和精度的狭隘关注。为了解决这些差距，我们提出了DyReMe，这是一个动态的医学诊断基准，可以更好地反映真实的临床实践。与静态的考试式问题不同，DyReMe生成了新鲜的、类似咨询的病例，这些病例引入了诸如鉴别诊断和常见误诊因素等干扰因素。它还改变了表达风格，以模仿不同的现实世界的查询习惯。除了准确性，DyReMe还在三个额外的临床相关维度上评估LLM：准确性，有用性和一致性。我们的实验表明，这种动态方法产生了更具挑战性和现实的评估，揭示了最先进的LLM和实际临床实践的性能之间的显着不一致。这些研究结果突出表明，迫切需要更好地反映值得信赖的医疗诊断需求的评价框架。
摘要：Medical diagnostics is a high-stakes and complex domain that is critical to patient care. However, current evaluations of large language models (LLMs) are fundamentally misaligned with real-world clinical practice. Most of them rely on static benchmarks derived from public medical exam items, which tend to overestimate model performance and ignore the difference between textbook cases and the ambiguous, varying conditions in the real world. Recent efforts toward dynamic evaluation offer a promising alternative, but their improvements are limited to superficial perturbations and a narrow focus on accuracy. To address these gaps, we propose DyReMe, a dynamic benchmark for medical diagnostics that better reflects real clinical practice. Unlike static exam-style questions, DyReMe generates fresh, consultation-like cases that introduce distractors such as differential diagnoses and common misdiagnosis factors. It also varies expression styles to mimic diverse real-world query habits. Beyond accuracy, DyReMe evaluates LLMs on three additional clinically relevant dimensions: veracity, helpfulness, and consistency. Our experiments demonstrate that this dynamic approach yields more challenging and realistic assessments, revealing significant misalignments between the performance of state-of-the-art LLMs and real clinical practice. These findings highlight the urgent need for evaluation frameworks that better reflect the demands of trustworthy medical diagnostics.

【7】IRIS: An Iterative and Integrated Framework for Verifiable Causal Discovery in the Absence of Tabular Data
标题：IRIS：在缺乏表格数据的情况下发现可验证原因的迭代集成框架
链接：https://arxiv.org/abs/2510.09217

作者：Tao Feng, Lizhen Qu, Niket Tandon, Gholamreza Haffari
备注：ACL 2025
摘要：因果发现是科学研究的基础，但传统的统计算法面临着重大挑战，包括昂贵的数据收集，已知关系的冗余计算和不切实际的假设。虽然最近的法学硕士为基础的方法擅长识别常见的因果关系，他们未能发现新的关系。我们介绍IRIS（迭代检索和实时因果发现集成系统），一个新的框架，解决了这些限制。从一组初始变量开始，IRIS自动收集相关文档，提取变量并揭示因果关系。我们的混合因果发现方法结合了统计算法和基于LLM的方法来发现已知和新颖的因果关系。除了对初始变量的因果发现外，IRIS的缺失变量提案组件还识别并合并缺失变量以扩展因果图。我们的方法可以从一组初始变量中实时发现因果关系，而不需要预先存在的数据集。
摘要：Causal discovery is fundamental to scientific research, yet traditional statistical algorithms face significant challenges, including expensive data collection, redundant computation for known relations, and unrealistic assumptions. While recent LLM-based methods excel at identifying commonly known causal relations, they fail to uncover novel relations. We introduce IRIS (Iterative Retrieval and Integrated System for Real-Time Causal Discovery), a novel framework that addresses these limitations. Starting with a set of initial variables, IRIS automatically collects relevant documents, extracts variables, and uncovers causal relations. Our hybrid causal discovery method combines statistical algorithms and LLM-based methods to discover known and novel causal relations. In addition to causal discovery on initial variables, the missing variable proposal component of IRIS identifies and incorporates missing variables to expand the causal graphs. Our approach enables real-time causal discovery from only a set of initial variables without requiring pre-existing datasets.

【8】Multimodal Prompt Optimization: Why Not Leverage Multiple Modalities for MLLMs
标题：多模式提示优化：为什么不利用多模式实现MLLM
链接：https://arxiv.org/abs/2510.09201

作者：Yumin Choi, Dongki Kim, Jinheon Baek, Sung Ju Hwang
摘要：大型语言模型（LLM）已经取得了显着的成功，其多模态扩展（MLLM）进一步解锁了图像，视频和文本以外的其他模态的功能。然而，尽管有这种转变，即时优化方法，旨在减少手动提示制作的负担，同时最大限度地提高性能，仍然局限于文本，最终限制了MLLM的全部潜力。出于这一差距，我们引入了新的问题，多模态提示优化，它扩展了事先定义的提示优化的多模态空间所定义的对文本和非文本提示。为了解决这个问题，我们提出了多模态提示优化器（MPO），这是一个统一的框架，它不仅可以通过保留优先级的更新来执行多模态提示的联合优化，还可以通过利用早期评估作为基于贝叶斯的选择策略中的先验来指导候选提示的选择过程。通过在文本之外的各种模式（如图像，视频，甚至分子）中进行广泛的实验，我们证明MPO优于领先的纯文本优化方法，将多模态即时优化作为实现MLLM潜力的关键一步。
摘要：Large Language Models (LLMs) have shown remarkable success, and their multimodal expansions (MLLMs) further unlock capabilities spanning images, videos, and other modalities beyond text. However, despite this shift, prompt optimization approaches, designed to reduce the burden of manual prompt crafting while maximizing performance, remain confined to text, ultimately limiting the full potential of MLLMs. Motivated by this gap, we introduce the new problem of multimodal prompt optimization, which expands the prior definition of prompt optimization to the multimodal space defined by the pairs of textual and non-textual prompts. To tackle this problem, we then propose the Multimodal Prompt Optimizer (MPO), a unified framework that not only performs the joint optimization of multimodal prompts through alignment-preserving updates but also guides the selection process of candidate prompts by leveraging earlier evaluations as priors in a Bayesian-based selection strategy. Through extensive experiments across diverse modalities that go beyond text, such as images, videos, and even molecules, we demonstrate that MPO outperforms leading text-only optimization methods, establishing multimodal prompt optimization as a crucial step to realizing the potential of MLLMs.

【9】Exploiting Web Search Tools of AI Agents for Data Exfiltration
标题：利用人工智能代理的网络搜索工具进行数据过滤
链接：https://arxiv.org/abs/2510.09093

作者：Dennis Rall, Bernhard Bauer, Mohit Mittal, Thomas Fraunholz
备注：9 pages, 6 figures, conference article
摘要：大型语言模型（LLM）现在通常用于自主执行复杂任务，从自然语言处理到Web搜索等动态工作流。工具调用和检索增强生成（RAG）的使用允许LLM处理和检索敏感的公司数据，放大其功能和滥用的脆弱性。随着LLM越来越多地与外部数据源交互，间接即时注入成为一种关键且不断发展的攻击向量，使攻击者能够通过操纵输入来利用模型。通过对不同模型的间接提示注入攻击进行系统评估，我们分析了当前LLM对此类攻击的敏感程度，哪些参数（包括模型大小和制造商，具体实现）塑造了它们的脆弱性，以及哪些攻击方法仍然最有效。我们的研究结果表明，即使是众所周知的攻击模式继续成功，暴露了模型防御的持续弱点。为了解决这些漏洞，我们强调需要加强培训程序以提高固有的弹性，建立已知攻击向量的集中数据库以实现主动防御，并建立统一的测试框架以确保持续的安全验证。这些步骤对于推动开发人员将安全性集成到LLM的核心设计中至关重要，因为我们的研究结果表明，当前的模型仍然无法缓解长期存在的威胁。
摘要：Large language models (LLMs) are now routinely used to autonomously execute complex tasks, from natural language processing to dynamic workflows like web searches. The usage of tool-calling and Retrieval Augmented Generation (RAG) allows LLMs to process and retrieve sensitive corporate data, amplifying both their functionality and vulnerability to abuse. As LLMs increasingly interact with external data sources, indirect prompt injection emerges as a critical and evolving attack vector, enabling adversaries to exploit models through manipulated inputs. Through a systematic evaluation of indirect prompt injection attacks across diverse models, we analyze how susceptible current LLMs are to such attacks, which parameters, including model size and manufacturer, specific implementations, shape their vulnerability, and which attack methods remain most effective. Our results reveal that even well-known attack patterns continue to succeed, exposing persistent weaknesses in model defenses. To address these vulnerabilities, we emphasize the need for strengthened training procedures to enhance inherent resilience, a centralized database of known attack vectors to enable proactive defense, and a unified testing framework to ensure continuous security validation. These steps are essential to push developers toward integrating security into the core design of LLMs, as our findings show that current models still fail to mitigate long-standing threats.

【10】Auto-scaling Continuous Memory for GUI Agent
标题：自动扩展图形用户界面代理的连续内存
链接：https://arxiv.org/abs/2510.09038

作者：Wenyi Wu, Kun Zhou, Ruoxin Yuan, Vivian Yu, Stephen Wang, Zhiting Hu, Biwei Huang
摘要：我们研究如何赋予GUI代理可扩展的内存，帮助概括在不熟悉的接口和长期任务。先前的GUI代理将过去的轨迹压缩到文本标记中，这使得上下文长度膨胀并且错过了决定性的视觉提示（例如，小部件的精确尺寸和位置）。我们提出了一个连续的内存，使用VLM本身作为编码器，将每个GUI轨迹编码成一个固定长度的连续嵌入序列;这些嵌入直接插入到主干的输入层，大大降低了上下文成本，同时保留细粒度的视觉信息。随着记忆容量和检索深度的增加，性能单调地提高，不像文本记忆随着长提示而下降。为了以低成本增长内存，我们引入了一个自动缩放数据飞轮，（i）通过搜索发现新的环境，（ii）用开源VLM合成任务，（iii）用代理推出轨迹，（iv）用相同的VLM验证成功。使用这个管道，我们收集了100 k+轨迹，大约4000美元，只微调了1,500个样本的内存编码器（Q-Former上的LoRA，1.2%参数）。在现实世界的GUI基准测试中，我们的内存增强代理在长期和分布变化下不断提高成功率。值得注意的是，Qwen-2.5-VL-7 B+连续存储器实现了与最先进的闭源模型（例如，GPT-40，Claude-4）。
摘要：We study how to endow GUI agents with scalable memory that help generalize across unfamiliar interfaces and long-horizon tasks. Prior GUI agents compress past trajectories into text tokens, which balloons context length and misses decisive visual cues (e.g., exact widget size and position). We propose a continuous memory that encodes each GUI trajectory into a fixed-length sequence of continuous embeddings using the VLM itself as an encoder; these embeddings are plugged directly into the backbone's input layer, sharply reducing context cost while preserving fine-grained visual information. As memory size and retrieval depth increase, performance improves monotonically, unlike text memories that degrade with long prompts. To grow memory at low cost, we introduce an auto-scaling data flywheel that (i) discovers new environments via search, (ii) synthesizes tasks with an open-source VLM, (iii) rolls out trajectories with the agent, and (iv) verifies success with the same VLM. Using this pipeline, we collect 100k+ trajectories for about \$4000 and fine-tune only the memory encoder (LoRA on a Q-Former, 1.2\% parameters) with 1,500 samples. On real-world GUI benchmarks, our memory-augmented agent consistently improves success rates under long horizons and distribution shifts. Notably, Qwen-2.5-VL-7B + continuous memory achieves performance comparable to state-of-the-art closed-source models (e.g., GPT-4o, Claude-4).

【11】DARO: Difficulty-Aware Reweighting Policy Optimization
标题：DARO：意识到困难重新权衡政策优化
链接：https://arxiv.org/abs/2510.09001

作者：Jingyu Zhou, Lu Ma, Hao Liang, Chengyu Shen, Bin Cui, Wentao Zhang
摘要：大型语言模型（LLM）的最新进展表明，通过具有可验证奖励的强化学习（RLVR）可以显着增强推理能力。组相对策略优化（GRPO）已经成为RLVR的实际方法，激发了许多变体。然而，我们的数学分析表明，这些方法基本上是GRPO的加权变化。我们提供了一个统一的观点，表明他们依赖于静态或过于简单的加权方案与样本难度，防止适应模型的不断发展的能力。这造成了一个重大的损失规模问题，即训练不成比例地集中在某些难度水平上，而牺牲了其他难度水平，阻碍了整体表现。为了解决这些限制，我们引入了难度感知重加权策略优化（DARO），这是一种基于模型的学习状态动态调整每个难度组的损失贡献的方法。在Qwen2.5-Math-1.5B、Qwen2.5-Math-7 B和Llama3.1-8B上进行的大量实验表明，DARO在六个数学基准测试中的表现优于四个领先基线，实现了更快的收敛速度和更出色的最终性能。
摘要：Recent advances in large language models (LLMs) have shown that reasoning ability can be significantly enhanced through Reinforcement Learning with Verifiable Rewards (RLVR). Group Relative Policy Optimization (GRPO) has emerged as the de facto approach for RLVR, inspiring numerous variants. However, our mathematical analysis reveals that these methods are fundamentally weighted variations of GRPO. We provide a unified view, demonstrating that their reliance on static or overly simplistic weighting schemes tied to sample difficulty prevents adaptation to a model's evolving capabilities. This creates a significant loss scale issue, where training disproportionately focuses on certain difficulty levels at the expense of others, hindering overall performance. To address these limitations, we introduce \textbf{Difficulty-Aware Reweighting Policy Optimization (DARO)}, a method that dynamically adjusts the loss contribution of each difficulty group based on the model's learning state. Extensive experiments on Qwen2.5-Math-1.5B, Qwen2.5-Math-7B, and Llama3.1-8B show that DARO outperforms four leading baselines across six math benchmarks, achieving significantly faster convergence and superior final performance.

【12】Diagnosing and Mitigating System Bias in Self-Rewarding RL
标题：诊断和缓解自我奖励RL中的系统偏差
链接：https://arxiv.org/abs/2510.08977

作者：Chuyi Tan, Peiwen Yuan, Xinglin Wang, Yiwei Li, Shaoxiong Feng, Yueqi Zhang, Jiayi Shi, Ji Zhang, Boyuan Pan, Yao Hu, Kan Li
摘要：Reinforcement learning with verifiable rewards (RLVR) scales the reasoning ability of large language models (LLMs) but remains bottlenecked by limited labeled samples for continued data scaling. Reinforcement learning with intrinsic rewards (RLIR), where the policy model assigns rewards to its own rollouts, enables sustainable scaling in unlabeled settings, yet its performance and stability lag behind RLVR. We trace this gap to a system bias: the model tends to overestimate its high-confidence rollouts, leading to biased and unstable reward estimation. This bias accumulates as training progresses, with deviations from the oracle drifting toward over-reward, causing unstable training. We characterize this bias using three metrics: $\rho_{\text{noise}}$, $\rho_{\text{selfbias}}$, and $\rho_{\text{symbias}}$. We find that $\rho_{\text{noise}}$ and $\rho_{\text{symbias}}$ impact convergence, while $\rho_{\text{selfbias}}$ amplifies both correct and incorrect updates, leading to instability. To mitigate this, we propose reinforcement learning with ensembled rewards (RLER), which aggregates diverse models and adapts reward interpolation and rollout selection. Extensive experiments show that RLER improves by +13.6% over RLIR and is only 3.6% below RLVR, achieving stable scaling on unlabeled samples, making it highly applicable.
摘要：Reinforcement learning with verifiable rewards (RLVR) scales the reasoning ability of large language models (LLMs) but remains bottlenecked by limited labeled samples for continued data scaling. Reinforcement learning with intrinsic rewards (RLIR), where the policy model assigns rewards to its own rollouts, enables sustainable scaling in unlabeled settings, yet its performance and stability lag behind RLVR. We trace this gap to a system bias: the model tends to overestimate its high-confidence rollouts, leading to biased and unstable reward estimation. This bias accumulates as training progresses, with deviations from the oracle drifting toward over-reward, causing unstable training. We characterize this bias using three metrics: $\rho_{\text{noise}}$, $\rho_{\text{selfbias}}$, and $\rho_{\text{symbias}}$. We find that $\rho_{\text{noise}}$ and $\rho_{\text{symbias}}$ impact convergence, while $\rho_{\text{selfbias}}$ amplifies both correct and incorrect updates, leading to instability. To mitigate this, we propose reinforcement learning with ensembled rewards (RLER), which aggregates diverse models and adapts reward interpolation and rollout selection. Extensive experiments show that RLER improves by +13.6% over RLIR and is only 3.6% below RLVR, achieving stable scaling on unlabeled samples, making it highly applicable.

【13】A Human Behavioral Baseline for Collective Governance in Software Projects
标题：软件项目集体治理的人类行为基线
链接：https://arxiv.org/abs/2510.08956

作者：Mobina Noori, Mahasweta Chakraborti, Amy X Zhang, Seth Frey
备注：Algorithmic Collective Action Workshop @ NeurIPS 2025. arXiv admin note: text overlap with arXiv:2509.16295
摘要：我们研究了开源社区如何通过版本控制的治理文档来描述参与和控制。我们使用710个项目的语料库和成对的快照，将文本解析为演员、规则、动作和对象，然后将它们分组，并使用熵来衡量变化的均匀性，使用丰富度来衡量变化的多样性，使用Jensen Shannon散度来衡量变化的漂移。随着时间的推移，项目定义了更多的角色和更多的操作，这些角色和操作分布得更均匀，而规则的组成保持稳定。这些发现表明，治理通过扩大和平衡参与类别而发展，而规范力量没有重大转变。该分析提供了一个可重复的基线，用于评估未来人工智能介导的工作流程是集中还是重新分配权限。
摘要：We study how open source communities describe participation and control through version controlled governance documents. Using a corpus of 710 projects with paired snapshots, we parse text into actors, rules, actions, and objects, then group them and measure change with entropy for evenness, richness for diversity, and Jensen Shannon divergence for drift. Projects define more roles and more actions over time, and these are distributed more evenly, while the composition of rules remains stable. These findings indicate that governance grows by expanding and balancing categories of participation without major shifts in prescriptive force. The analysis provides a reproducible baseline for evaluating whether future AI mediated workflows concentrate or redistribute authority.

【14】Exploring Multi-Temperature Strategies for Token- and Rollout-Level Control in RLVR
标题：探索WLVR中令牌和滚动级控制的多温度策略
链接：https://arxiv.org/abs/2510.08892

作者：Haomin Zhuang, Yujun Zhou, Taicheng Guo, Yue Huang, Fangxu Liu, Kai Song, Xiangliang Zhang
摘要：强化学习在大型语言模型（LLM）的推理能力方面表现出了实质性的改进，在各个领域都表现出了显著的适用性。最近的研究发现，LLM中的令牌在推理任务中扮演着不同的角色，将它们分为高熵推理令牌和低熵知识令牌。以前的方法通常集中在限制更新以间接鼓励探索，但它们在令牌生成阶段本身并没有明确地促进探索行为。在这项工作中，我们引入了一种互补的方法，通过为不同的令牌类型应用不同的温度设置，显式地促进采样期间的探索。具体来说，我们的方法采用较高的温度进行推理令牌，以积极鼓励探索，同时保留较低的温度知识令牌，以保持事实的正确性。此外，我们系统地研究了各种多温度调度策略及其在强化学习环境中的影响。几个推理基准的实证评估表明，我们的方法显着提高LLM的推理性能。该代码可在https://github.com/zhmzm/Multi_Temperature_Verl.git上获得。
摘要：Reinforcement Learning has demonstrated substantial improvements in the reasoning abilities of Large Language Models (LLMs), exhibiting significant applicability across various domains. Recent research has identified that tokens within LLMs play distinct roles during reasoning tasks, categorizing them into high-entropy reasoning tokens and low-entropy knowledge tokens. Prior approaches have typically focused on restricting updates to indirectly encourage exploration, yet they do not explicitly facilitate exploratory behavior during the token generation stage itself. In this work, we introduce a complementary approach that explicitly promotes exploration during sampling by applying distinct temperature settings for different token types. Specifically, our method employs higher temperatures for reasoning tokens to actively encourage exploration, while retaining lower temperatures for knowledge tokens to maintain factual correctness. Furthermore, we systematically investigate various multi-temperature scheduling strategies and their impacts within reinforcement learning contexts. Empirical evaluations on several reasoning benchmarks demonstrate that our approach significantly enhances the reasoning performance of LLMs. The code is available at https://github.com/zhmzm/Multi_Temperature_Verl.git.

【15】ReviewerToo: Should AI Join The Program Committee? A Look At The Future of Peer Review
标题：ReviewerToo：人工智能应该加入项目委员会吗？同行评审的未来
链接：https://arxiv.org/abs/2510.08867

作者：Gaurav Sahu, Hugo Larochelle, Laurent Charlin, Christopher Pal
摘要：同行评审是科学出版的基石，但它存在不一致性，评审员主观性和可扩展性挑战。我们介绍了ReviewerToo，这是一个模块化框架，用于研究和部署人工智能辅助的同行评审，以通过系统和一致的评估来补充人类判断。ReviewerToo支持使用专门的审阅者角色和结构化的评估标准进行系统实验，并且可以部分或完全集成到真实的会议工作流程中。我们在ICLR 2025的1，963篇论文提交的精心策划的数据集上验证了ReviewerToo，其中我们使用gpt-oss-120 b模型的实验在将论文分类为接受/拒绝的任务中达到了81.8%的准确率，而普通人类评审员的准确率为83.9%。此外，ReviewerToo生成的评论被LLM评委评为比人类平均水平更高的质量，尽管仍然落后于最强的专家贡献。我们的分析突出了AI评审员擅长的领域（例如，事实核查，文献报道）以及他们在哪里挣扎（例如，评估方法的新颖性和理论贡献），强调对人类专业知识的持续需求。基于这些发现，我们提出了将人工智能整合到同行评审管道中的指导方针，展示了人工智能如何提高一致性、覆盖率和公平性，同时将复杂的评估判断留给领域专家。我们的工作为系统的、混合的同行评审系统提供了基础，该系统随着科学出版的增长而扩展。
摘要：Peer review is the cornerstone of scientific publishing, yet it suffers from inconsistencies, reviewer subjectivity, and scalability challenges. We introduce ReviewerToo, a modular framework for studying and deploying AI-assisted peer review to complement human judgment with systematic and consistent assessments. ReviewerToo supports systematic experiments with specialized reviewer personas and structured evaluation criteria, and can be partially or fully integrated into real conference workflows. We validate ReviewerToo on a carefully curated dataset of 1,963 paper submissions from ICLR 2025, where our experiments with the gpt-oss-120b model achieves 81.8% accuracy for the task of categorizing a paper as accept/reject compared to 83.9% for the average human reviewer. Additionally, ReviewerToo-generated reviews are rated as higher quality than the human average by an LLM judge, though still trailing the strongest expert contributions. Our analysis highlights domains where AI reviewers excel (e.g., fact-checking, literature coverage) and where they struggle (e.g., assessing methodological novelty and theoretical contributions), underscoring the continued need for human expertise. Based on these findings, we propose guidelines for integrating AI into peer-review pipelines, showing how AI can enhance consistency, coverage, and fairness while leaving complex evaluative judgments to domain experts. Our work provides a foundation for systematic, hybrid peer-review systems that scale with the growth of scientific publishing.

【16】Everyone prefers human writers, including AI
标题：每个人都更喜欢人类作家，包括人工智能
链接：https://arxiv.org/abs/2510.08831

作者：Wouter Haverals, Meredith Martin
备注：46 pages, 18 figures (5 main text + 13 supplementary), 5 tables
摘要：随着人工智能写作工具的普及，我们需要了解人类和机器如何评估文学风格，这是一个客观标准难以捉摸，判断本质上是主观的领域。我们使用Raymond Queneau的Exercises in Style（1947）进行了对照实验，以测量评估者的归因偏差。研究1比较了人类参与者（N=556）和AI模型（N=13），在三种条件下评估Queneau与GPT-4生成版本的文学段落：盲法，准确标记和反事实标记。研究2测试了AI评估者和创造者的14 × 14矩阵的偏见泛化。这两项研究都揭示了系统性的亲人类归因偏见。人类显示出+13.7个百分点（pp）的偏差（Cohen's h = 0.28，95%CI：0.21-0.34），而AI模型显示出+34.3个百分点的偏差（h = 0.70，95%CI：0.65-0.76），效果强2.5倍（P$<0.001）。研究2证实了这种偏见在人工智能架构中存在（+25.8pp，95%CI：24.1-27.6%），表明人工智能系统系统系统性地贬低了被标记为“人工智能生成”的创意内容，无论是哪种人工智能创建的。我们还发现，归因标签会导致评估者颠倒评估标准，相同的功能只会基于感知的作者身份得到相反的评估。这表明人工智能模型在训练过程中吸收了人类对人工创造力的文化偏见。我们的研究代表了人类和人工评估者在审美判断中归因偏差的首次受控比较，揭示了人工智能系统不仅复制而且放大了这种人类倾向。
摘要：As AI writing tools become widespread, we need to understand how both humans and machines evaluate literary style, a domain where objective standards are elusive and judgments are inherently subjective. We conducted controlled experiments using Raymond Queneau's Exercises in Style (1947) to measure attribution bias across evaluators. Study 1 compared human participants (N=556) and AI models (N=13) evaluating literary passages from Queneau versus GPT-4-generated versions under three conditions: blind, accurately labeled, and counterfactually labeled. Study 2 tested bias generalization across a 14$\times$14 matrix of AI evaluators and creators. Both studies revealed systematic pro-human attribution bias. Humans showed +13.7 percentage point (pp) bias (Cohen's h = 0.28, 95% CI: 0.21-0.34), while AI models showed +34.3 percentage point bias (h = 0.70, 95% CI: 0.65-0.76), a 2.5-fold stronger effect (P$<$0.001). Study 2 confirmed this bias operates across AI architectures (+25.8pp, 95% CI: 24.1-27.6%), demonstrating that AI systems systematically devalue creative content when labeled as "AI-generated" regardless of which AI created it. We also find that attribution labels cause evaluators to invert assessment criteria, with identical features receiving opposing evaluations based solely on perceived authorship. This suggests AI models have absorbed human cultural biases against artificial creativity during training. Our study represents the first controlled comparison of attribution bias between human and artificial evaluators in aesthetic judgment, revealing that AI systems not only replicate but amplify this human tendency.

【17】McMining: Automated Discovery of Misconceptions in Student Code
标题：McMining：自动发现学生代码中的误解
链接：https://arxiv.org/abs/2510.08827

作者：Erfan Al-Hossami, Razvan Bunescu
备注：16 pages, 8 figures
摘要：在学习编码时，学生经常会对各种编程语言概念产生误解。这些不仅会导致错误或低效的代码，还会减缓相关概念的学习。在本文中，我们介绍了McMining，从学生的代码样本中挖掘编程误解的任务。为了能够训练和评估McMining系统，我们开发了一个可扩展的错误概念基准数据集，以及大量体现这些错误概念的代码样本。然后，我们介绍了两个基于法学硕士的McMiner方法，并通过广泛的评估表明，从双子座，克劳德和GPT家庭的模型是有效的发现学生代码中的误解。
摘要：When learning to code, students often develop misconceptions about various programming language concepts. These can not only lead to bugs or inefficient code, but also slow down the learning of related concepts. In this paper, we introduce McMining, the task of mining programming misconceptions from samples of code from a student. To enable the training and evaluation of McMining systems, we develop an extensible benchmark dataset of misconceptions together with a large set of code samples where these misconceptions are manifested. We then introduce two LLM-based McMiner approaches and through extensive evaluations show that models from the Gemini, Claude, and GPT families are effective at discovering misconceptions in student code.

【18】MOSAIC: Multi-agent Orchestration for Task-Intelligent Scientific Coding
标题：MOSAIC：任务智能科学编码的多智能体演示
链接：https://arxiv.org/abs/2510.08804

作者：Siddeshwar Raghavan, Tanwi Mallick
摘要：我们提出了MOSAIC，一个多智能体大语言模型（LLM）框架，用于解决具有挑战性的科学编码任务。与通用编码不同，科学工作流需要严格的算法，与深厚的领域知识互连，并结合特定领域的推理，以及不需要I/O测试用例的算法迭代。许多科学问题也需要解决一系列子问题，从而得到最终的预期结果。MOSAIC被设计为一个免培训的框架，具有专门设计的代理，可以在学生-教师范式中进行自我反思，创建原理，编码和调试，以应对科学代码生成的挑战。这种设计有利于逐步问题分解，有针对性的错误纠正，并且，当与我们的合并上下文窗口（CCW）相结合时，在解决涉及连锁子问题的复杂科学任务时，减轻了LLM幻觉。我们评估科学的编码基准MOSAIC，并证明我们的专业代理框架优于现有的方法在准确性，鲁棒性和可解释性。
摘要：We present MOSAIC, a multi-agent Large Language Model (LLM) framework for solving challenging scientific coding tasks. Unlike general-purpose coding, scientific workflows require algorithms that are rigorous, interconnected with deep domain knowledge, and incorporate domain-specific reasoning, as well as algorithm iteration without requiring I/O test cases. Many scientific problems also require a sequence of subproblems to be solved, leading to the final desired result. MOSAIC is designed as a training-free framework with specially designed agents to self-reflect, create the rationale, code, and debug within a student-teacher paradigm to address the challenges of scientific code generation. This design facilitates stepwise problem decomposition, targeted error correction, and, when combined with our Consolidated Context Window (CCW), mitigates LLM hallucinations when solving complex scientific tasks involving chained subproblems. We evaluate MOSAIC on scientific coding benchmarks and demonstrate that our specialized agentic framework outperforms existing approaches in terms of accuracy, robustness, and interpretability.

【19】Struc-EMB: The Potential of Structure-Aware Encoding in Language Embeddings
标题：Struct-EMB：语言嵌入中结构感知编码的潜力
链接：https://arxiv.org/abs/2510.08774

作者：Shikun Liu, Haoyu Wang, Mufei Li, Pan Li
摘要：大型语言模型（LLM）的文本嵌入已经成为许多应用程序的基础。然而，这些模型通常对原始文本进行操作，忽略了丰富的结构信息，如超链接或引用，这些信息在许多现实世界的数据集中提供了关键的上下文。本文介绍并系统地评估了一种新的范式，通过将这些结构关系直接集成到LLM的内部编码过程中，而不是依赖于传统的事后聚合来生成结构感知的文本嵌入。我们研究了两种主要的进程内方法：顺序级联和并行缓存。通过广泛的zero-shot实验在检索，聚类，分类和推荐任务，我们证明了我们的结构感知方法始终优于纯文本和事后基线。我们的分析揭示了关键的权衡：顺序级联擅长嘈杂，中等长度的上下文，而并行缓存更有效地扩展到长，高信号的上下文中，但更容易受到干扰。为了解决噪声结构数据的挑战，我们还介绍并验证了两种有效的技术：上下文蒸馏和语义平衡。这项工作提供了对进程中结构感知编码的第一次全面分析，为构建更强大和上下文感知的嵌入模型提供了蓝图。
摘要：Text embeddings from Large Language Models (LLMs) have become foundational for numerous applications. However, these models typically operate on raw text, overlooking the rich structural information, such as hyperlinks or citations, that provides crucial context in many real-world datasets. This paper introduces and systematically evaluates a new paradigm for generating structure-aware text embeddings by integrating these structural relations directly into the LLM's internal encoding process, rather than relying on traditional post-hoc aggregation. We investigate two primary in-process methods: sequential concatenation and parallel caching. Through extensive zero-shot experiments across retrieval, clustering, classification, and recommendation tasks, we demonstrate that our structure-aware approaches consistently outperform both text-only and post-hoc baselines. Our analysis reveals critical trade-offs: sequential concatenation excels with noisy, moderate-length contexts, while parallel caching scales more effectively to long, high-signal contexts but is more susceptible to distractors. To address the challenge of noisy structural data, we also introduce and validate two effective techniques: Context Distillation and Semantic Balancing. This work provides the first comprehensive analysis of in-process structure-aware encoding, offering a blueprint for building more powerful and contextually aware embedding models.

【20】Scaling Laws for Code: A More Data-Hungry Regime
标题：代码缩放定律：一个更加缺乏数据的制度
链接：https://arxiv.org/abs/2510.08702

作者：Xianzhen Luo, Wenzhen Zheng, Qingfu Zhu, Rongyi Zhang, Houyi Li, Siming Huang, YuanTao Fan, Wanxiang Che
备注：Under Review
摘要：代码大型语言模型（LLM）正在彻底改变软件工程。然而，指导有效训练的尺度律主要是在自然语言（NL）上分析的。鉴于代码和NL之间的严格语法等根本差异，尚不清楚这些定律是否直接适用于代码。为了解决这一差距，我们对代码的缩放律进行了第一次大规模的实证研究，包括117次实验运行，模型大小从0.2B到3.8B，训练令牌从2B到128B。我们符合龙猫定律和法瑟定律。首先，结果表明，更具表现力的Farseer定律提供了更高的准确性。其次，分析表明，代码LLM有效地扩展模型大小。至关重要的是，代码代表了一个更需要数据的机制，需要比NL高得多的数据参数比。最后，另外两组关于代码NL混合物的实验表明，NL有利于资源受限的场景，但在更高的计算预算下会变得不利。
摘要：Code Large Language Models (LLMs) are revolutionizing software engineering. However, scaling laws that guide the efficient training are predominantly analyzed on Natural Language (NL). Given the fundamental differences like strict syntax between code and NL, it is unclear whether these laws are directly applicable to code. To address this gap, we conduct the first large-scale empirical study of scaling laws for code, comprising 117 experimental runs with model sizes from 0.2B to 3.8B and training tokens from 2B to 128B. We fit the Chinchilla law and the Farsser law. First, the results show that the more expressive Farseer law offers greater accuracy. Second, the analysis reveals that Code LLMs scale effectively with model size. Crucially, code represents a more data-hungry regime, requiring a substantially higher data-to-parameter ratio than NL. Finally, two additional sets of experiments on code-NL mixtures show that NL benefits resource-constrained scenarios, but becomes a detriment at higher compute budgets.

【21】Optimizing delivery for quick commerce factoring qualitative assessment of generated routes
标题：优化快速商务交付，考虑对生成路线的定性评估
链接：https://arxiv.org/abs/2510.08671

作者：Milon Bhattacharya, Milan Kumar
摘要：印度的电子商务市场预计将迅速增长，最后一英里的交付占运营费用的近一半。虽然基于车辆路径问题（VRP）的求解器被广泛用于交付规划，但由于非结构化地址、不完整的地图和距离估计中的计算约束，它们在现实世界场景中的有效性是有限的。这项研究提出了一个框架，采用大型语言模型（LLM）来批评基于策略的标准VRP生成的路线，使物流运营商能够评估和优先考虑更有效的交付计划。为了说明我们的方法，我们使用大型语言模型生成，注释和评估了400个案例。我们的研究发现，开源LLM识别路由问题的准确率为79%，而专有推理模型达到了86%。结果表明，基于LLM的评价VRP生成的路线可以是一个有效的和可扩展的评价层，超越了传统的距离和时间为基础的指标。这对提高最后一英里物流的成本效率、交付可靠性和可持续性具有影响，特别是对印度等发展中国家。
摘要：Indias e-commerce market is projected to grow rapidly, with last-mile delivery accounting for nearly half of operational expenses. Although vehicle routing problem (VRP) based solvers are widely used for delivery planning, their effectiveness in real-world scenarios is limited due to unstructured addresses, incomplete maps, and computational constraints in distance estimation. This study proposes a framework that employs large language models (LLMs) to critique VRP-generated routes against policy-based criteria, allowing logistics operators to evaluate and prioritise more efficient delivery plans. As a illustration of our approach we generate, annotate and evaluated 400 cases using large language models. Our study found that open-source LLMs identified routing issues with 79% accuracy, while proprietary reasoning models achieved reach upto 86%. The results demonstrate that LLM-based evaluation of VRP-generated routes can be an effective and scalable layer of evaluation which goes beyond beyond conventional distance and time based metrics. This has implications for improving cost efficiency, delivery reliability, and sustainability in last-mile logistics, especially for developing countries like India.

【22】Formalizing Style in Personal Narratives
标题：个人叙事的形式化风格
链接：https://arxiv.org/abs/2510.08649

作者：Gustave Cortal (ENS Paris Saclay, LISN), Alain Finkel (ENS Paris Saclay)
备注：None
摘要：个人叙事是作者为了使他们的经历具有意义而构建的故事。风格，作者使用语言表达自己的独特方式，是这些叙事如何传达主观经验的基础。然而，缺乏一个正式的框架来系统地分析这些文体选择。我们提出了一种新的方法，形式化的风格在个人叙述的语言选择模式时，作者沟通的主观经验。我们的框架整合了三个领域：功能语言学将语言建立为一个有意义的选择系统，计算机科学提供了自动提取和分析序列模式的方法，这些模式与心理观察有关。使用语言模型，我们自动提取语言特征，如过程，参与者和环境。我们将我们的框架应用于数百个梦的叙述，包括一个关于一名患有创伤后应激障碍的退伍军人的案例研究。对他的叙述的分析揭示了独特的模式，特别是语言过程如何支配心理过程，说明了语言选择和心理状态之间的关系。
摘要：Personal narratives are stories authors construct to make meaning of their experiences. Style, the distinctive way authors use language to express themselves, is fundamental to how these narratives convey subjective experiences. Yet there is a lack of a formal framework for systematically analyzing these stylistic choices. We present a novel approach that formalizes style in personal narratives as patterns in the linguistic choices authors make when communicating subjective experiences. Our framework integrates three domains: functional linguistics establishes language as a system of meaningful choices, computer science provides methods for automatically extracting and analyzing sequential patterns, and these patterns are linked to psychological observations. Using language models, we automatically extract linguistic features such as processes, participants, and circumstances. We apply our framework to hundreds of dream narratives, including a case study on a war veteran with post-traumatic stress disorder. Analysis of his narratives uncovers distinctive patterns, particularly how verbal processes dominate over mental ones, illustrating the relationship between linguistic choices and psychological states.

【23】Upfront Chain-of-Thought: A Cooperative Framework for Chain-of-Thought Compression
标题：前期思想链：思想链压缩的合作框架
链接：https://arxiv.org/abs/2510.08647

作者：Chengzhengxu Li, Xiaoming Liu, Zhaohan Zhang, Shaochu Zhang, Shengchao Liu, Guoxin Ma, Yu Lan, Chao Shen
备注：ACL2026 Under Review
摘要：最近的发展已经通过长思想链（CoT）在大型语言模型（LLM）中实现了高级推理，而长CoT由于生成LLM的自回归性质而遭受高计算成本和显著的延迟损失。CoT压缩旨在通过减少输出长度来提高推理过程的效率。以前的作品交易的推理效率，无论是费力的离散提示设计或外部压缩的CoT数据集的建设，牺牲关键的推理细节。在这项工作中，我们提出了前期CoT（UCoT）：一个高效的推理框架与前期思想嵌入自动CoT压缩。UCoT是一个涉及小模型（压缩器）和大模型（执行器）的协作工作流。UCoT的第一阶段训练压缩器为执行者生成富含推理信息的前期思维嵌入，避免了手动设计提示的缺点。第二阶段优化执行器，利用预先的思想嵌入，通过简短的推理得出正确的答案，使用奖励机制。大量实验表明，UCoT在保持执行器强大推理能力的同时，显著缩短了CoT的长度。值得一提的是，当将UCoT应用于Qwen2.5- 7 B-Instruct模型时，GSM 8 K数据集上的令牌使用量减少了50%，而性能比最先进的（SOTA）方法高3.08%。代码和数据集在补充材料中。
摘要：Recent developments have enabled advanced reasoning in Large Language Models (LLMs) via long Chain-of-Thought (CoT), while long CoT suffers from high computational costs and significant latency losses owing to the autoregressive nature of generative LLMs. CoT compression aims to improve efficiency in the reasoning process by reducing output length. Previous works trade reasoning efficiency by either laborious discrete prompt designing or the construction of external compressed CoT datasets that sacrifice key reasoning details. In this work, we propose Upfront CoT (UCoT): an efficient reasoning framework with upfront thought embedding to automate CoT compression. UCoT is a cooperative workflow involving a small model (compressor) and a large model (executor). The first stage of UCoT trains compressor to generate upfront thought embeddings rich in reasoning information for the executor, avoiding the drawbacks of manually designed prompts. The second stage optimizes executor to utilize upfront thought embeddings to derive the correct answer with short reasoning, using a reward mechanism. Extensive experiments show that UCoT maintains the powerful reasoning ability of executor while significantly reducing the length of CoT. It is worth mentioning that when applying UCoT to the Qwen2.5-7B-Instruct model, the usage of tokens on GSM8K dataset is reduced by 50\%, while the performance is 3.08\% higher than that of the state-of-the-art (SOTA) method. The code and dataset are in supplementary material.

【24】Text2Stories: Evaluating the Alignment Between Stakeholder Interviews and Generated User Stories
标题：文本2Stories：评估利益相关者采访和生成的用户故事之间的一致性
链接：https://arxiv.org/abs/2510.08622

作者：Francesco Dente, Fabiano Dalpiaz, Paolo Papotti
备注：8 pages
摘要：大型语言模型（LLM）可以用于从自然语言输入（例如启发式面试的成绩单）自动生成软件需求。然而，评估这些派生的需求是否忠实地反映了涉众的需求仍然是一项主要的手工任务。我们引入Text 2Stories，文本到故事对齐的任务和指标，允许量化需求（以用户故事的形式）与启发式会话参与者表达的实际需求相匹配的程度。给定一个访谈记录和一组用户故事，我们的指标量化了（i）正确性：记录支持的故事比例，以及（ii）完整性：至少一个故事支持的记录比例。我们将文本分割成文本块，并将对齐实例化为块和故事之间的匹配问题。在四个数据集上的实验表明，基于LLM的匹配器在保留的注释上实现了0.86 macro-F1，而嵌入模型单独保持落后，但能够实现有效的阻塞。最后，我们展示了我们的指标如何实现跨故事集的比较（例如，人类与生成），将Text 2Stories定位为现有用户故事质量标准的可扩展，忠实于源代码的补充。
摘要：Large language models (LLMs) can be employed for automating the generation of software requirements from natural language inputs such as the transcripts of elicitation interviews. However, evaluating whether those derived requirements faithfully reflect the stakeholders' needs remains a largely manual task. We introduce Text2Stories, a task and metrics for text-to-story alignment that allow quantifying the extent to which requirements (in the form of user stories) match the actual needs expressed by the elicitation session participants. Given an interview transcript and a set of user stories, our metric quantifies (i) correctness: the proportion of stories supported by the transcript, and (ii) completeness: the proportion of transcript supported by at least one story. We segment the transcript into text chunks and instantiate the alignment as a matching problem between chunks and stories. Experiments over four datasets show that an LLM-based matcher achieves 0.86 macro-F1 on held-out annotations, while embedding models alone remain behind but enable effective blocking. Finally, we show how our metrics enable the comparison across sets of stories (e.g., human vs. generated), positioning Text2Stories as a scalable, source-faithful complement to existing user-story quality criteria.

【25】From Simulation to Strategy: Automating Personalized Interaction Planning for Conversational Agents
标题：从模拟到策略：对话代理的自动化个性化交互规划
链接：https://arxiv.org/abs/2510.08621

作者：Wen-Yu Chang, Tzu-Hung Huang, Chih-Ho Chen, Yun-Nung Chen
摘要：随着代理对话模型的迅速兴起，逼真的用户模拟器研究对于调整有效的对话策略至关重要。这项工作研究了一个面向销售的代理，适应其对话的基础上跨越年龄，性别和职业的用户配置文件。虽然年龄和性别会影响整体表现，但职业在会话意图上的差异最明显。利用这一洞察力，我们引入了一个轻量级的，职业条件的策略，指导代理优先考虑与用户偏好一致的意图，从而导致更短，更成功的对话。我们的研究结果强调了丰富的模拟器配置文件的重要性，并展示了如何简单的人物信息策略可以提高面向销售的对话系统的有效性。
摘要：Amid the rapid rise of agentic dialogue models, realistic user-simulator studies are essential for tuning effective conversation strategies. This work investigates a sales-oriented agent that adapts its dialogue based on user profiles spanning age, gender, and occupation. While age and gender influence overall performance, occupation produces the most pronounced differences in conversational intent. Leveraging this insight, we introduce a lightweight, occupation-conditioned strategy that guides the agent to prioritize intents aligned with user preferences, resulting in shorter and more successful dialogues. Our findings highlight the importance of rich simulator profiles and demonstrate how simple persona-informed strategies can enhance the effectiveness of sales-oriented dialogue systems.

【26】MMA-ASIA: A Multilingual and Multimodal Alignment Framework for Culturally-Grounded Evaluation
标题：MMA-ASIA：基于文化评估的多语言和多模式协调框架
链接：https://arxiv.org/abs/2510.08608

作者：Weihua Zheng, Zhengyuan Liu, Tanmoy Chakraborty, Weiwen Xu, Xiaoxue Gao, Bryan Chen Zhengyu Tan, Bowei Zou, Chang Liu, Yujia Hu, Xing Xie, Xiaoyuan Yi, Jing Yao, Chaojun Wang, Long Li, Rui Liu, Huiyao Liu, Koji Inoue, Ryuichi Sumida, Tatsuya Kawahara, Fan Xu, Lingyu Ye, Wei Tian, Dongjun Kim, Jimin Jung, Jaehyung Seo, Nadya Yuki Wangsajaya, Pham Minh Duc, Ojasva Saxena, Palash Nandi, Xiyan Tao, Wiwik Karlina, Tuan Luong, Keertana Arun Vasan, Roy Ka-Wei Lee, Nancy F. Chen
摘要：大型语言模型（LLM）现已在全球范围内使用，但其多模态理解和推理在西方高资源环境之外往往会退化。我们提出了MMA-ASIA，一个全面的框架来评估LLM的文化意识，重点是亚洲背景。MMA-ASIA以人为本，多语言，多模式对齐的多项选择基准为中心，涵盖8个亚洲国家和10种语言，包括27，000个问题;超过79%需要基于文化背景的多步推理，超越简单的记忆。据我们所知，这是第一个在输入层面上跨三种模式对齐的数据集：文本，图像（视觉问答）和语音。这使得能够直接测试跨模态传输。在此基础上，我们提出了一个五维的评估协议，措施：（一）跨国家的文化意识的差异，（二）跨语言的一致性，（三）跨模态的一致性，（四）文化知识的泛化，（五）接地有效性。为了确保严格的评估，文化意识基础验证模块通过检查必要的文化知识是否支持正确答案来检测“捷径学习”。最后，通过比较模型分析，注意力追踪和创新的视觉消融前缀重放（VPR）方法，我们探讨了为什么模型在语言和模态之间存在差异，为构建文化上可靠的多模态LLM提供了可操作的见解。
摘要：Large language models (LLMs) are now used worldwide, yet their multimodal understanding and reasoning often degrade outside Western, high-resource settings. We propose MMA-ASIA, a comprehensive framework to evaluate LLMs' cultural awareness with a focus on Asian contexts. MMA-ASIA centers on a human-curated, multilingual, and multimodally aligned multiple-choice benchmark covering 8 Asian countries and 10 languages, comprising 27,000 questions; over 79 percent require multi-step reasoning grounded in cultural context, moving beyond simple memorization. To our knowledge, this is the first dataset aligned at the input level across three modalities: text, image (visual question answering), and speech. This enables direct tests of cross-modal transfer. Building on this benchmark, we propose a five-dimensional evaluation protocol that measures: (i) cultural-awareness disparities across countries, (ii) cross-lingual consistency, (iii) cross-modal consistency, (iv) cultural knowledge generalization, and (v) grounding validity. To ensure rigorous assessment, a Cultural Awareness Grounding Validation Module detects "shortcut learning" by checking whether the requisite cultural knowledge supports correct answers. Finally, through comparative model analysis, attention tracing, and an innovative Vision-ablated Prefix Replay (VPR) method, we probe why models diverge across languages and modalities, offering actionable insights for building culturally reliable multimodal LLMs.

【27】Limitations of Normalization in Attention Mechanism
标题：注意机制正常化的局限性
链接：https://arxiv.org/abs/2508.17821

作者：Timur Mudarisov, Mikhail Burtsev, Tatiana Petrova, Radu State
备注：10 pages, 4 figures
摘要：本文探讨了注意机制中的规范化的局限性。我们从一个理论框架开始，该框架能够识别模型的选择能力和令牌选择中涉及的几何分离。我们的分析包括softmax缩放下令牌向量的距离和分离标准的明确界限。通过使用预训练的GPT-2模型进行实验，我们实证验证了我们的理论结果，并分析了注意机制的关键行为。值得注意的是，我们证明，随着所选标记数量的增加，模型区分信息标记的能力下降，通常会收敛到一个统一的选择模式。我们还表明，softmax归一化下的梯度敏感性在训练过程中提出了挑战，特别是在低温设置下。这些发现推进了当前对基于softmax的注意力机制的理解，并激发了在未来注意力架构中更强大的规范化和选择策略的需求。
摘要：This paper investigates the limitations of the normalization in attention mechanisms. We begin with a theoretical framework that enables the identification of the model's selective ability and the geometric separation involved in token selection. Our analysis includes explicit bounds on distances and separation criteria for token vectors under softmax scaling. Through experiments with pre-trained GPT-2 model, we empirically validate our theoretical results and analyze key behaviors of the attention mechanism. Notably, we demonstrate that as the number of selected tokens increases, the model's ability to distinguish informative tokens declines, often converging toward a uniform selection pattern. We also show that gradient sensitivity under softmax normalization presents challenges during training, especially at low temperature settings. These findings advance current understanding of softmax-based attention mechanism and motivate the need for more robust normalization and selection strategies in future attention architectures.

【28】Target speaker anonymization in multi-speaker recordings
标题：多说话人录音中的目标说话人匿名化
链接：https://arxiv.org/abs/2510.09307

作者：Natalia Tomashenko, Junichi Yamagishi, Xin Wang, Yun Liu, Emmanuel Vincent
备注：Submitted to ICASSP 2026
摘要：现有的大多数说话人匿名化研究都集中在单说话人音频上，导致了针对这种情况优化的技术和评估指标的发展。本研究解决了多说话人会话音频中说话人匿名化的重大挑战，特别是当只有一个目标说话人需要匿名化时。这种情况与呼叫中心等环境高度相关，在呼叫中心中，客户隐私需要在与运营商的交互中仅匿名化客户的声音。传统的匿名化方法通常不适合这项任务。此外，目前的评估方法不允许我们准确地评估在这个复杂的多扬声器场景中的隐私保护和实用性。这项工作旨在弥合这些差距，探讨有效的策略，有针对性的说话人匿名在会话音频，突出其发展中的潜在问题，并提出相应的改进评估方法。
摘要：Most of the existing speaker anonymization research has focused on single-speaker audio, leading to the development of techniques and evaluation metrics optimized for such condition. This study addresses the significant challenge of speaker anonymization within multi-speaker conversational audio, specifically when only a single target speaker needs to be anonymized. This scenario is highly relevant in contexts like call centers, where customer privacy necessitates anonymizing only the customer's voice in interactions with operators. Conventional anonymization methods are often not suitable for this task. Moreover, current evaluation methodology does not allow us to accurately assess privacy protection and utility in this complex multi-speaker scenario. This work aims to bridge these gaps by exploring effective strategies for targeted speaker anonymization in conversational audio, highlighting potential problems in their development and proposing corresponding improved evaluation methodologies.

【29】BaldWhisper: Faster Whisper with Head Shearing and Layer Merging
标题：BaldWhisper：具有头部剪切和分层合并的更快Whisper
链接：https://arxiv.org/abs/2510.08599

作者：Yaya Sy, Christophe Cerisara, Irina Illina
摘要：为低资源语言修剪大型预训练的Transformers是一项挑战，因为它通常需要大量的重新训练数据来恢复性能。例如，Distill-Whisper对Whisper进行了40%的删减，并对21，000小时的语音进行了重新训练，远远超过了大多数语言的可用时间。Whisper可以在数据稀缺的环境中为边缘设备提供更轻、更快的速度吗？针对只有32 h语音到文本数据的Bambara，我们提出了一种新的剪枝方法。而不是词汇修剪，这是不合适的，由于频繁的代码切换的班巴拉扬声器，我们压缩嵌入低秩分解和特征蒸馏。而不是删除层，我们合并它们以限制性能损失。最终的模型保留了90%的原始性能，同时在MacBook Air M1上缩小了48%，速度提高了2.15倍。
摘要：Pruning large pre-trained transformers for low-resource languages is challenging, as it often requires massive retraining data to recover performance. For instance, Distill-Whisper prunes Whisper by 40% and retrains on 21,000 hours of speech, far beyond what is available for most languages. Can Whisper be made lighter and faster for edge devices in data-scarce settings? Focusing on Bambara with only 32h of speech-to-text data, we propose a new pruning recipe. Instead of vocabulary pruning, which is unsuitable due to frequent code-switching by Bambara speakers, we compress the embeddings with low-rank decomposition and feature distillation. Rather than removing layers, we merge them to limit performance loss. The final model preserves 90% of the original performance while being 48% smaller and 2.15x faster on a MacBook Air M1.

【30】Articulation-Informed ASR: Integrating Articulatory Features into ASR via Auxiliary Speech Inversion and Cross-Attention Fusion
标题：关节语知情的ASB：通过辅助言语倒置和交叉注意融合将关节语特征集成到ASB中
链接：https://arxiv.org/abs/2510.08585

作者：Ahmed Adel Attia, Jing Liu, Carol Espy Wilson
摘要：先前的工作已经研究了使用发音特征作为自动语音识别（ASR）的补充表示，但它们的使用主要限于浅层声学模型。在这项工作中，我们重新审视了深度学习时代的发音信息，并提出了一个框架，该框架利用发音表示作为辅助任务和识别模型的伪输入。具体来说，我们采用语音反转作为辅助预测任务，预测的发音特征作为查询流注入到模型中的交叉注意模块中，声学嵌入作为键和值。LibriSpeech上的实验表明，我们的方法在基于transformer的强基线上得到了一致的改进，特别是在低资源条件下。这些发现表明，发音功能，一旦靠边站的ASR研究，可以提供有意义的好处时，重新引入现代建筑。
摘要：Prior works have investigated the use of articulatory features as complementary representations for automatic speech recognition (ASR), but their use was largely confined to shallow acoustic models. In this work, we revisit articulatory information in the era of deep learning and propose a framework that leverages articulatory representations both as an auxiliary task and as a pseudo-input to the recognition model. Specifically, we employ speech inversion as an auxiliary prediction task, and the predicted articulatory features are injected into the model as a query stream in a cross-attention module with acoustic embeddings as keys and values. Experiments on LibriSpeech demonstrate that our approach yields consistent improvements over strong transformer-based baselines, particularly under low-resource conditions. These findings suggest that articulatory features, once sidelined in ASR research, can provide meaningful benefits when reintroduced with modern architectures.

机器翻译由腾讯交互翻译提供，仅供参考

点击“阅读原文”获取带摘要的学术速递

【声明】内容源于网络

Sophie外贸笔记

跨境分享角 | 长期更新优质内容

内容 0

粉丝 3

Sophie外贸笔记跨境分享角 | 长期更新优质内容

总阅读0

粉丝3

内容0