自然语言处理学术速递[9.17]- 大数跨境

首页

自然语言处理学术速递[9.17]

Sophie外贸笔记

2025-09-17

213

导读：cs.CL 方向，今日共计69篇

点击阅读原文访问arxivdaily.com，涵盖CS|物理|数学|经济|统计|金融|生物|电气领域，更有搜索、收藏等功能！

cs.CL 方向，今日共计69篇

大模型相关(32篇)

【1】Do Natural Language Descriptions of Model Activations Convey Privileged Information?
标题：模型激活的自然语言描述是否传达了特权信息？
链接：https://arxiv.org/abs/2509.13316

作者： Li, Alberto Mario Ceballos Arroyo, Giordano Rogers, Naomi Saphra, Byron C. Wallace
备注：34 pages, 6 figures
摘要：最近的可解释性方法已经提出使用第二个动词化器LLM将LLM内部表示转换为自然语言描述。这旨在说明目标模型如何表示和操作输入。但是，这种激活语言化方法实际上提供了关于目标模型内部工作的特权知识，还是仅仅传达了关于其输入的信息？我们批判性地评估了在先前工作中使用的数据集上流行的语言化方法，发现它们在没有任何目标模型内部访问的情况下在基准测试中取得了成功，这表明这些数据集对于评估语言化方法并不理想。然后，我们运行控制实验，揭示了语言化往往反映了参数的语言化LLM生成它们的知识，而不是激活的目标LLM被解码。总之，我们的研究结果表明，需要有针对性的基准和实验控制，严格评估是否言语化方法提供有意义的见解LLM的操作。
摘要：Recent interpretability methods have proposed to translate LLM internal representations into natural language descriptions using a second verbalizer LLM. This is intended to illuminate how the target model represents and operates on inputs. But do such activation verbalization approaches actually provide privileged knowledge about the internal workings of the target model, or do they merely convey information about its inputs? We critically evaluate popular verbalization methods across datasets used in prior work and find that they succeed at benchmarks without any access to target model internals, suggesting that these datasets are not ideal for evaluating verbalization methods. We then run controlled experiments which reveal that verbalizations often reflect the parametric knowledge of the verbalizer LLM which generated them, rather than the activations of the target LLM being decoded. Taken together, our results indicate a need for targeted benchmarks and experimental controls to rigorously assess whether verbalization methods provide meaningful insights into the operations of LLMs.

【2】RepIt: Representing Isolated Targets to Steer Language Models
标题：RepIt：将孤立目标表示为转向语言模型
链接：https://arxiv.org/abs/2509.13281

作者：iu, Nathan W. Henry, Nicholas Crispino, Yang Liu, Dawn Song, Chenguang Wang
摘要：虽然大型语言模型（LLM）中的激活转向是一个不断增长的研究领域，但方法往往会产生比预期更广泛的效果。这促使更纯粹的概念向量的隔离，以实现有针对性的干预，并在更细粒度的水平上理解LLM行为。我们提出了RepIt，一个简单的和数据高效的框架隔离概念的具体表示。在五个前沿LLM中，RepIt实现了精确的干预：它选择性地抑制对目标概念的拒绝，同时保留其他地方的拒绝，生成回答大规模杀伤性武器相关问题的模型，同时仍然在标准基准上得分为安全。我们进一步表明，校正信号定位到100-200个神经元，并且可以从单个A6000上的十几个示例中提取鲁棒的目标表示。这种效率引起了双重关注：可以使用适度的计算和数据进行操作，以扩展到代表性不足的数据稀缺主题，同时避开现有的基准。通过将拒绝向量与RepIt分离，这项工作表明，有针对性的干预可以抵消过度泛化，为模型行为的更细粒度控制奠定基础。
摘要：While activation steering in large language models (LLMs) is a growing area of research, methods can often incur broader effects than desired. This motivates isolation of purer concept vectors to enable targeted interventions and understand LLM behavior at a more granular level. We present RepIt, a simple and data-efficient framework for isolating concept-specific representations. Across five frontier LLMs, RepIt enables precise interventions: it selectively suppresses refusal on targeted concepts while preserving refusal elsewhere, producing models that answer WMD-related questions while still scoring as safe on standard benchmarks. We further show that the corrective signal localizes to just 100-200 neurons and that robust target representations can be extracted from as few as a dozen examples on a single A6000. This efficiency raises a dual concern: manipulations can be performed with modest compute and data to extend to underrepresented data-scarce topics while evading existing benchmarks. By disentangling refusal vectors with RepIt, this work demonstrates that targeted interventions can counteract overgeneralization, laying the foundation for more granular control of model behavior.

【3】Evaluating LLM Alignment on Personality Inference from Real-World Interview Data
标题：根据现实世界的面试数据评估LLM对人格推理的一致性
链接：https://arxiv.org/abs/2509.13244

作者：Zhu, Julina Maharjan, Xinyu Li, Karin G. Coifman, Ruoming Jin
备注：8 pages, 3 figures
摘要：大型语言模型（LLM）越来越多地部署在需要细致入微的心理理解的角色中，例如情感支持代理，顾问和决策助理。然而，它们解释人类人格特征的能力，这类应用程序的一个重要方面，仍然是未开发的，特别是在生态有效的对话设置。虽然之前的工作已经使用社交媒体数据上的离散大五标签模拟了LLM“人物角色”，但LLM与来自自然互动的连续，真实的个性评估的一致性在很大程度上尚未得到检验。为了解决这一差距，我们引入了一个新的基准，包括半结构化的面试成绩单配对验证连续大五特质分数。使用这个数据集，我们系统地评估了LLM在三个范例中的性能：（1）使用GPT-4.1 Mini的zero-shot和思想链提示，（2）应用于RoBERTa和Meta-LLaMA架构的基于LoRA的微调，以及（3）使用来自预训练的BERT和OpenAI的text-embedding-3-small的静态嵌入的回归。我们的研究结果显示，模型预测与真实人格特质之间的所有Pearson相关性都保持在0.26以下，这突出表明当前LLM与经验证的心理结构的一致性有限。思维链提示提供了最小的收益超过zero-shot，这表明人格推理更多地依赖于潜在的语义表征，而不是显式推理。这些发现强调了将LLM与复杂的人类属性相结合的挑战，并激励了未来在特定于特征的提示，上下文感知建模和以行为为导向的微调方面的工作。
摘要：Large Language Models (LLMs) are increasingly deployed in roles requiring nuanced psychological understanding, such as emotional support agents, counselors, and decision-making assistants. However, their ability to interpret human personality traits, a critical aspect of such applications, remains unexplored, particularly in ecologically valid conversational settings. While prior work has simulated LLM "personas" using discrete Big Five labels on social media data, the alignment of LLMs with continuous, ground-truth personality assessments derived from natural interactions is largely unexamined. To address this gap, we introduce a novel benchmark comprising semi-structured interview transcripts paired with validated continuous Big Five trait scores. Using this dataset, we systematically evaluate LLM performance across three paradigms: (1) zero-shot and chain-of-thought prompting with GPT-4.1 Mini, (2) LoRA-based fine-tuning applied to both RoBERTa and Meta-LLaMA architectures, and (3) regression using static embeddings from pretrained BERT and OpenAI's text-embedding-3-small. Our results reveal that all Pearson correlations between model predictions and ground-truth personality traits remain below 0.26, highlighting the limited alignment of current LLMs with validated psychological constructs. Chain-of-thought prompting offers minimal gains over zero-shot, suggesting that personality inference relies more on latent semantic representation than explicit reasoning. These findings underscore the challenges of aligning LLMs with complex human attributes and motivate future work on trait-specific prompting, context-aware modeling, and alignment-oriented fine-tuning.

【4】The Few-shot Dilemma: Over-prompting Large Language Models
标题：Few-Shot困境：过度激发大型语言模型
链接：https://arxiv.org/abs/2509.13196

作者：Tang, Doruk Tuncel, Christian Koerner, Thomas Runkler
备注：accepted for the main track of FLLM
摘要：过度提示是一种现象，即提示中过多的示例会导致大型语言模型（LLM）的性能下降，这挑战了关于上下文内Few-Shot学习的传统智慧。为了研究这种Few-Shot困境，我们概述了一个提示框架，该框架利用三种标准的Few-Shot选择方法-随机抽样，语义嵌入和TF-IDF向量-并在多个LLM中评估这些方法，包括GPT-4 o，GPT-3.5-turbo，DeepSeek-V3，Gemma-3，LLaMA-3.1，LLaMA-3.2和Mistral。我们的实验结果表明，将过多的特定领域的例子提示可以矛盾地降低性能在某些LLM，这与先前的经验结论，更相关的Few-Shot的例子普遍受益LLM。鉴于法学硕士辅助软件工程和需求分析的趋势，我们用两个真实世界的软件需求分类数据集进行了实验。通过逐渐增加TF-IDF选择和分层的Few-Shot示例的数量，我们确定了每个LLM的最佳数量。这种组合方法以较少的示例实现了卓越的性能，避免了过度提示问题，从而在分类功能和非功能需求方面超过了最先进的1%。
摘要：Over-prompting, a phenomenon where excessive examples in prompts lead to diminished performance in Large Language Models (LLMs), challenges the conventional wisdom about in-context few-shot learning. To investigate this few-shot dilemma, we outline a prompting framework that leverages three standard few-shot selection methods - random sampling, semantic embedding, and TF-IDF vectors - and evaluate these methods across multiple LLMs, including GPT-4o, GPT-3.5-turbo, DeepSeek-V3, Gemma-3, LLaMA-3.1, LLaMA-3.2, and Mistral. Our experimental results reveal that incorporating excessive domain-specific examples into prompts can paradoxically degrade performance in certain LLMs, which contradicts the prior empirical conclusion that more relevant few-shot examples universally benefit LLMs. Given the trend of LLM-assisted software engineering and requirement analysis, we experiment with two real-world software requirement classification datasets. By gradually increasing the number of TF-IDF-selected and stratified few-shot examples, we identify their optimal quantity for each LLM. This combined approach achieves superior performance with fewer examples, avoiding the over-prompting problem, thus surpassing the state-of-the-art by 1% in classifying functional and non-functional requirements.

【5】LLM Hallucination Detection: A Fast Fourier Transform Method Based on Hidden Layer Temporal Signals
标题：LLM幻觉检测：一种基于隐层时域信号的快速傅里叶变换方法
链接：https://arxiv.org/abs/2509.13154

作者：, Gang Tu, ShengYu Cheng, Junjie Hu, Jinting Wang, Rui Chen, Zhilong Zhou, Dongbo Shan
摘要：幻觉仍然是在可靠性敏感的应用程序中部署大型语言模型（LLM）的关键障碍。现有的检测方法主要分为两类：真实性检查，这是从根本上限制外部知识覆盖，和静态隐藏状态分析，无法捕捉推理动态的偏差。因此，其有效性和稳健性仍然有限。我们提出了HSAD（基于隐藏信号分析的检测），一种新的幻觉检测框架，该框架在自回归生成过程中对隐藏表示的时间动态进行建模。HSAD通过跨层采样激活来构造隐藏层信号，应用快速傅里叶变换（FFT）来获得频域表示，并提取最强的非DC频率分量作为谱特征。此外，通过利用LLM的自回归性质，HSAD识别最佳观察点以进行有效和可靠的检测。在包括TruthfulQA在内的多个基准测试中，与现有的最先进的方法相比，HSAD实现了超过10个百分点的改进。通过将推理过程建模与频域分析相结合，HSAD为LLM中的鲁棒性幻觉检测建立了一个新的范例。
摘要：Hallucination remains a critical barrier for deploying large language models (LLMs) in reliability-sensitive applications. Existing detection methods largely fall into two categories: factuality checking, which is fundamentally constrained by external knowledge coverage, and static hidden-state analysis, that fails to capture deviations in reasoning dynamics. As a result, their effectiveness and robustness remain limited. We propose HSAD (Hidden Signal Analysis-based Detection), a novel hallucination detection framework that models the temporal dynamics of hidden representations during autoregressive generation. HSAD constructs hidden-layer signals by sampling activations across layers, applies Fast Fourier Transform (FFT) to obtain frequency-domain representations, and extracts the strongest non-DC frequency component as spectral features. Furthermore, by leveraging the autoregressive nature of LLMs, HSAD identifies optimal observation points for effective and reliable detection. Across multiple benchmarks, including TruthfulQA, HSAD achieves over 10 percentage points improvement compared to prior state-of-the-art methods. By integrating reasoning-process modeling with frequency-domain analysis, HSAD establishes a new paradigm for robust hallucination detection in LLMs.

【6】Empowering LLMs with Parameterized Skills for Adversarial Long-Horizon Planning
标题：为LLC提供参数化技能以进行对抗性长期规划
链接：https://arxiv.org/abs/2509.13127

作者：, Shuai Xu, Aiyao He, Yanna Wang, Bo Xu
备注：Accepted to IJCNN 2025
摘要：大型语言模型（LLM）的最新进展导致了基于LLM的AI代理的发展。一个关键的挑战是创建能够在复杂、敌对的长期环境中有效立足的代理。现有的方法主要集中在（1）使用LLM作为政策，通过生成低级别的可行行动与环境进行交互，以及（2）使用LLM生成高级任务或语言指南，以刺激行动生成。然而，前者很难生成可靠的动作，而后者则严重依赖专家经验将高级任务转化为特定的动作序列。为了应对这些挑战，我们引入了语言计划，参数行动（PLAP）规划框架，便于在长期环境中基于LLM的代理的接地。PLAP方法包括三个关键组件：（1）包含特定于环境的参数化技能的技能库，（2）由LLM驱动的技能规划器，以及（3）将参数化技能转换为可执行动作序列的技能执行器。我们在MicroRTS中实现了PLAP，这是一款长期实时战略游戏，为LLM提供了一个不熟悉且具有挑战性的环境。实验结果证明了PLAP的有效性。特别是，GPT-40驱动的PLAP在zero-shot设置中的性能优于80%的基线代理，而Qwen 2 - 72 B驱动的PLAP，通过精心制作的Few-Shot示例，超过了顶级脚本代理CoacAI。此外，我们设计了全面的评估指标，并在PLAP框架内测试了6个闭源和2个开源LLM，最终发布了LLM排行榜，排名长期技能规划能力。我们的代码可在https://github.com/AI-Research-TeamX/PLAP上获得。
摘要：Recent advancements in Large Language Models(LLMs) have led to the development of LLM-based AI agents. A key challenge is the creation of agents that can effectively ground themselves in complex, adversarial long-horizon environments. Existing methods mainly focus on (1) using LLMs as policies to interact with the environment through generating low-level feasible actions, and (2) utilizing LLMs to generate high-level tasks or language guides to stimulate action generation. However, the former struggles to generate reliable actions, while the latter relies heavily on expert experience to translate high-level tasks into specific action sequences. To address these challenges, we introduce the Plan with Language, Act with Parameter (PLAP) planning framework that facilitates the grounding of LLM-based agents in long-horizon environments. The PLAP method comprises three key components: (1) a skill library containing environment-specific parameterized skills, (2) a skill planner powered by LLMs, and (3) a skill executor converting the parameterized skills into executable action sequences. We implement PLAP in MicroRTS, a long-horizon real-time strategy game that provides an unfamiliar and challenging environment for LLMs. The experimental results demonstrate the effectiveness of PLAP. In particular, GPT-4o-driven PLAP in a zero-shot setting outperforms 80% of baseline agents, and Qwen2-72B-driven PLAP, with carefully crafted few-shot examples, surpasses the top-tier scripted agent, CoacAI. Additionally, we design comprehensive evaluation metrics and test 6 closed-source and 2 open-source LLMs within the PLAP framework, ultimately releasing an LLM leaderboard ranking long-horizon skill planning ability. Our code is available at https://github.com/AI-Research-TeamX/PLAP.

【7】Multi-Model Synthetic Training for Mission-Critical Small Language Models
标题：任务关键小语言模型的多模型综合训练
链接：https://arxiv.org/abs/2509.13047

作者：tt, Pragyansmita Nayak
备注：8 pages. Accepted as a full paper to the 3rd International Conference on Foundation and Large Language Models (IEEE FLLM) 2025
摘要：大型语言模型（LLM）在许多领域都表现出了卓越的能力，但它们在专业领域的应用仍然受到特定领域训练数据的稀缺性和复杂性的限制。我们提出了一种新的方法，通过使用LLM作为一次性教师，而不是直接使用它们进行推理，实现了海上情报成本降低261倍。我们的方法通过多模型生成（GPT-4 o和o3-mini）将32亿条自动识别系统（AIS）船舶跟踪记录转换为21，543个合成问题和答案对，防止过度拟合并确保准确推理。由此产生的微调Qwen2.5- 7 B模型在海上任务中实现了75%的准确率，同时比使用更大的模型进行推理要便宜得多。我们表明，更小，更便宜的模型-当微调适当-可以提供类似的精度相比，更大的模型是昂贵的。我们的工作有助于为专业AI应用程序生成合成数据集的不断增长的领域，并为手动注释不可行的领域提供了一个高度可重复的框架。除了在不断发展的专业小语言模型领域扩大研究之外，我们的方法还可以立即应用于各个行业的海上安全，安全运营和船舶交通管理系统。
摘要：Large Language Models (LLMs) have demonstrated remarkable capabilities across many domains, yet their appli- cation to specialized fields remains constrained by the scarcity and complexity of domain-specific training data. We present a novel approach that achieves a 261x cost reduction for maritime intelligence by using LLMs as one-time teachers rather than using them directly for inference. Our method transforms 3.2 billion Automatic Identification System (AIS) vessel tracking records into 21,543 synthetic question and answer pairs through multi-model generation (GPT-4o and o3-mini), preventing over- fitting and ensuring accurate reasoning. The resulting fine-tuned Qwen2.5-7B model achieves 75% accuracy on maritime tasks, while being substantially cheaper than using a larger model for inference. We show that smaller, cheaper models - when fine tuned properly - can provide similar accuracy compared to larger models that are prohibitively expensive. Our work contributes to the growing field of synthetic dataset generation for specialized AI applications and presents a highly reproducible framework for domains where manual annotation is infeasible. Beyond expand- ing research in the growing field of specialized small language models, our approach has immediate applications in maritime safety, security operations, and vessel traffic management systems in various industries.

【8】SitLLM: Large Language Models for Sitting Posture Health Understanding via Pressure Sensor Data
标题：SitLLM：通过压力传感器数据了解坐姿健康的大型语言模型
链接：https://arxiv.org/abs/2509.12994

作者： Fufangchen Zhao, Yiyang Zhang, Danfeng Yan
摘要：不良坐姿是导致长期肌肉骨骼疾病和生理功能障碍的一个关键但经常被忽视的因素。现有的坐姿监测系统，虽然利用视觉，IMU，或基于压力的模态，经常遭受粗粒度的识别和缺乏个性化反馈所需的语义表达。在本文中，我们提出了\textbf {SitLLM}，这是一个轻量级的多模态框架，它将灵活的压力传感与大型语言模型（LLM）集成在一起，以实现细粒度的姿势理解和个性化的面向健康的响应生成。SitLLM包括三个关键组件：（1）\textit {Gaussian-Robust Sensor Embedding Module}，它将压力图划分为空间块，并注入局部噪声扰动以进行鲁棒特征提取;（2）\textit {Proxim-Driven Cross-Modal Alignment Module}，它使用预先训练的词汇嵌入，通过多头交叉注意将传感器嵌入重新编程到LLM的语义空间中;（3）一个融合特征级、结构级、语法级和语义级上下文信息的\textit {Multi-Context Prompt Module}，用于指导指令理解。
摘要：Poor sitting posture is a critical yet often overlooked factor contributing to long-term musculoskeletal disorders and physiological dysfunctions. Existing sitting posture monitoring systems, although leveraging visual, IMU, or pressure-based modalities, often suffer from coarse-grained recognition and lack the semantic expressiveness necessary for personalized feedback. In this paper, we propose \textbf{SitLLM}, a lightweight multimodal framework that integrates flexible pressure sensing with large language models (LLMs) to enable fine-grained posture understanding and personalized health-oriented response generation. SitLLM comprises three key components: (1) a \textit{Gaussian-Robust Sensor Embedding Module} that partitions pressure maps into spatial patches and injects local noise perturbations for robust feature extraction; (2) a \textit{Prompt-Driven Cross-Modal Alignment Module} that reprograms sensor embeddings into the LLM's semantic space via multi-head cross-attention using the pre-trained vocabulary embeddings; and (3) a \textit{Multi-Context Prompt Module} that fuses feature-level, structure-level, statistical-level, and semantic-level contextual information to guide instruction comprehension.

【9】Do LLMs Understand Wine Descriptors Across Cultures? A Benchmark for Cultural Adaptations of Wine Reviews
标题：法学硕士了解跨文化的葡萄酒描述符吗？葡萄酒评论文化适应的基准
链接：https://arxiv.org/abs/2509.12961

作者：u, Xingyue Wen, Tianyi Hu, Qian Janice Wang, Daniel Hershcovich
备注：EMNLP 2025 Findings
摘要：大型语言模型（LLM）的最新进展为文化感知语言任务打开了大门。我们介绍了一个新的问题，适应葡萄酒评论在中文和英文，这超出了字面翻译，纳入区域口味偏好和文化特定的风味描述。在跨文化葡萄酒评论改编的案例研究中，我们编制了第一个平行语料库的专业评论，包含8 k中文和16 k英语评论。我们使用自动指标和人工评估对神经机器翻译基线和最先进的LLM进行基准测试。对于后者，我们提出了三个以文化为导向的标准--文化接近性，文化中立性和文化一致性--来评估翻译评论与目标文化读者产生共鸣的自然程度。我们的分析表明，目前的模型很难捕捉文化的细微差别，特别是在翻译不同文化的葡萄酒描述。这凸显了翻译模式在处理文化内容方面的挑战和局限性。
摘要：Recent advances in large language models (LLMs) have opened the door to culture-aware language tasks. We introduce the novel problem of adapting wine reviews across Chinese and English, which goes beyond literal translation by incorporating regional taste preferences and culture-specific flavor descriptors. In a case study on cross-cultural wine review adaptation, we compile the first parallel corpus of professional reviews, containing 8k Chinese and 16k Anglophone reviews. We benchmark both neural-machine-translation baselines and state-of-the-art LLMs with automatic metrics and human evaluation. For the latter, we propose three culture-oriented criteria -- Cultural Proximity, Cultural Neutrality, and Cultural Genuineness -- to assess how naturally a translated review resonates with target-culture readers. Our analysis shows that current models struggle to capture cultural nuances, especially in translating wine descriptions across different cultures. This highlights the challenges and limitations of translation models in handling cultural content.

【10】Investigating ReLoRA: Effects on the Learning Dynamics of Small Language Models
标题：调查ReLoRA：对小语言模型学习动态的影响
链接：https://arxiv.org/abs/2509.12960

作者：ss, David Demitri Africa, Paula Buttery, Richard Diehl Martinez
备注：12 Pages, 6 Tables, 8 Figures
摘要：参数高效的方法，如LoRA，彻底改变了LLM的微调。尽管如此，他们通过ReLoRA对预训练的扩展还没有得到很好的理解，特别是对于小语言模型（SLM），它提供了更低的计算和环境成本。这项工作是第一次系统地研究SLM中的ReLoRA（11 M-66 M参数），评估性能和学习动态。通过消融实验，我们发现ReLoRA在损失，Paloma困惑和BLiMP方面的表现通常比标准训练差，并且对于较大的模型，差距扩大。对模型学习动态的进一步分析表明，ReLoRA强化了在较小模型中发现的秩缺陷。这些结果表明，低秩更新策略可能不容易转移到SLM预训练，强调需要在低计算制度进行更多的研究。
摘要：Parameter-efficient methods such as LoRA have revolutionised the fine-tuning of LLMs. Still, their extension to pretraining via ReLoRA is less well understood, especially for small language models (SLMs), which offer lower computational and environmental costs. This work is the first systematic study of ReLoRA in SLMs (11M-66M parameters), evaluating both performance and learning dynamics. Through ablation experiments, we find that ReLoRA generally performs worse than standard training on loss, Paloma perplexity and BLiMP, with the gap widening for the larger models. Further analysis of the learning dynamics of the models indicates that ReLoRA reinforces the rank deficiencies found in smaller models. These results indicate that low-rank update strategies may not transfer easily to SLM pretraining, highlighting the need for more research in the low-compute regime.

【11】Automated Generation of Research Workflows from Academic Papers: A Full-text Mining Framework
标题：从学术论文自动生成研究工作流：全文挖掘框架
链接：https://arxiv.org/abs/2509.12955

作者：g, Chengzhi Zhang
备注：None
摘要：研究工作流程的自动生成对于提高研究的可重复性和加速“AI for Science”的范式至关重要。然而，现有的方法通常只提取零碎的程序组件，从而无法捕捉完整的研究工作流程。为了解决这一差距，我们提出了一个端到端的框架，通过挖掘全文学术论文来生成全面的，结构化的研究工作流程。作为自然语言处理（NLP）领域的一个案例研究，我们以段落为中心的方法首先使用SciBERT的正向无标签（PU）学习来识别工作流描述性段落，获得了0.9772的F1分数。随后，我们利用Flan-T5快速学习从这些段落中生成工作流短语，分别产生ROUGE-1，ROUGE-2和ROUGE-L得分为0.4543，0.2877和0.4427。然后使用ChatGPT与Few-Shot学习将这些短语系统地分类为数据准备、数据处理和数据分析阶段，实现了0.958的分类精度。通过将分类短语映射到文档中的文档位置，我们最终生成整个研究工作流程的可读可视化流程图。这种方法有助于分析来自NLP语料库的工作流程，并揭示了过去二十年来的关键方法学转变，包括越来越重视数据分析以及从特征工程到消融研究的过渡。我们的工作提供了一个有效的技术框架，自动化的工作流程生成，以及一个新的，面向过程的角度不断发展的科学范式的实证调查。源代码和数据可在https://github.com/ZH-heng/research_workflow上获得。
摘要：The automated generation of research workflows is essential for improving the reproducibility of research and accelerating the paradigm of "AI for Science". However, existing methods typically extract merely fragmented procedural components and thus fail to capture complete research workflows. To address this gap, we propose an end-to-end framework that generates comprehensive, structured research workflows by mining full-text academic papers. As a case study in the Natural Language Processing (NLP) domain, our paragraph-centric approach first employs Positive-Unlabeled (PU) Learning with SciBERT to identify workflow-descriptive paragraphs, achieving an F1-score of 0.9772. Subsequently, we utilize Flan-T5 with prompt learning to generate workflow phrases from these paragraphs, yielding ROUGE-1, ROUGE-2, and ROUGE-L scores of 0.4543, 0.2877, and 0.4427, respectively. These phrases are then systematically categorized into data preparation, data processing, and data analysis stages using ChatGPT with few-shot learning, achieving a classification precision of 0.958. By mapping categorized phrases to their document locations in the documents, we finally generate readable visual flowcharts of the entire research workflows. This approach facilitates the analysis of workflows derived from an NLP corpus and reveals key methodological shifts over the past two decades, including the increasing emphasis on data analysis and the transition from feature engineering to ablation studies. Our work offers a validated technical framework for automated workflow generation, along with a novel, process-oriented perspective for the empirical investigation of evolving scientific paradigms. Source code and data are available at: https://github.com/ZH-heng/research_workflow.

【12】Jailbreaking Large Language Models Through Content Concretization
标题：通过内容具体化越狱大型语言模型
链接：https://arxiv.org/abs/2509.12937

作者：réus, Ahmed Hussain, Panos Papadimitratos
备注：Accepted for presentation in the Conference on Game Theory and AI for Security (GameSec) 2025
摘要：大型语言模型（LLM）越来越多地被部署用于任务自动化和内容生成，但它们的安全机制仍然很容易通过不同的越狱技术被规避。在本文中，我们介绍\textit{内容具体化}（CC），一种新的越狱技术，迭代地将抽象的恶意请求转化为具体的，可执行的实现。CC是一个两阶段的过程：首先，使用较低层、较少约束的安全过滤器模型生成初始LLM响应，然后通过处理初步输出和原始提示的较高层模型对其进行细化。我们使用350个网络安全特定的提示来评估我们的技术，证明越狱成功率（SR）有了实质性的改善，在三次改进迭代后从7%（无改进）增加到62%，同时保持每个提示7.5\textcent~的成本。九个不同LLM评估器的A/B比较测试证实，额外改进步骤的输出始终被评为更具恶意性和技术优势。此外，手动代码分析表明，生成的输出执行时只需最小的修改，尽管最佳部署通常需要特定于目标的微调。随着最终改进的有害代码生成，这些结果突出了当前LLM安全框架中的关键漏洞。
摘要：Large Language Models (LLMs) are increasingly deployed for task automation and content generation, yet their safety mechanisms remain vulnerable to circumvention through different jailbreaking techniques. In this paper, we introduce \textit{Content Concretization} (CC), a novel jailbreaking technique that iteratively transforms abstract malicious requests into concrete, executable implementations. CC is a two-stage process: first, generating initial LLM responses using lower-tier, less constrained safety filters models, then refining them through higher-tier models that process both the preliminary output and original prompt. We evaluate our technique using 350 cybersecurity-specific prompts, demonstrating substantial improvements in jailbreak Success Rates (SRs), increasing from 7\% (no refinements) to 62\% after three refinement iterations, while maintaining a cost of 7.5\textcent~per prompt. Comparative A/B testing across nine different LLM evaluators confirms that outputs from additional refinement steps are consistently rated as more malicious and technically superior. Moreover, manual code analysis reveals that generated outputs execute with minimal modification, although optimal deployment typically requires target-specific fine-tuning. With eventual improved harmful code generation, these results highlight critical vulnerabilities in current LLM safety frameworks.

【13】All Roads Lead to Rome: Graph-Based Confidence Estimation for Large Language Model Reasoning
标题：条条大路通罗马：大型语言模型推理的基于图的置信度估计
链接：https://arxiv.org/abs/2509.12908

作者：ng, Chang Shu, Ehsan Shareghi, Nigel Collier
备注：EMNLP 2025 Main
摘要：置信度估计对于大型语言模型（LLM）的可靠部署至关重要。现有的方法主要是为事实QA任务而设计的，并且通常不能推广到推理任务。为了解决这一差距，我们提出了一套无训练的，基于图的置信度估计方法量身定制的推理任务。我们的方法模型推理路径为有向图和估计的信心，利用图形属性，如中心，路径收敛，路径加权。在三个推理数据集上使用两个LLM的实验表明，在两个下游任务上改进了置信度估计和增强了性能。
摘要：Confidence estimation is essential for the reliable deployment of large language models (LLMs). Existing methods are primarily designed for factual QA tasks and often fail to generalize to reasoning tasks. To address this gap, we propose a set of training-free, graph-based confidence estimation methods tailored to reasoning tasks. Our approach models reasoning paths as directed graphs and estimates confidence by exploiting graph properties such as centrality, path convergence, and path weighting. Experiments with two LLMs on three reasoning datasets demonstrate improved confidence estimation and enhanced performance on two downstream tasks.

【14】Conan-Embedding-v2: Training an LLM from Scratch for Text Embeddings
标题：Conan-Embedding-v2：从Scratch训练LLM进行文本嵌入
链接：https://arxiv.org/abs/2509.12892

作者： Yang Tang, Ruijie Liu, Shi-Zhe Chen, Xi Chen
备注：EMNLP 2025 Oral
摘要：大型语言模型（LLM）最近在文本嵌入任务中表现出出色的性能。以前的工作通常使用LoRA来微调现有的LLM，这受到LLM和嵌入模型之间的数据和训练差距的限制。在这项工作中，我们介绍了Conan-embedding-v2，一个新的1. 4 B参数LLM从头开始训练和微调作为文本嵌入器。首先，我们为LLM预训练添加新闻数据和多语言对，以弥合数据差距。在此基础上，我们提出了一个跨语言检索数据集，使LLM能够更好地集成不同语言的嵌入。其次，LLM使用具有令牌级损失的因果掩码，而嵌入模型使用具有令牌级损失的双向掩码。这种训练差距使得完全微调不如LoRA有效。我们引入了一种软掩码机制来逐渐在这两种类型的掩码之间过渡，使模型能够学习更全面的表示。在此基础上，我们提出了一种动态的硬否定挖掘方法，该方法在整个训练过程中将模型暴露给更困难的否定示例。Conan-embedding-v2直观有效，仅使用约1.4B参数，在海量文本嵌入基准测试（MTEB）和中文MTEB（2025年5月19日）上均实现了SOTA性能。
摘要：Large language models (LLMs) have recently demonstrated excellent performance in text embedding tasks. Previous work usually use LoRA to fine-tune existing LLMs, which are limited by the data and training gap between LLMs and embedding models. In this work, we introduce Conan-embedding-v2, a new 1.4B-parameter LLM trained from scratch and fine-tuned as a text embedder. First, we add news data and multilingual pairs for LLM pretraining to bridge the data gap. Based on this, we propose a cross-lingual retrieval dataset that enables the LLM to better integrate embeddings across different languages. Second, whereas LLMs use a causal mask with token-level loss, embedding models use a bidirectional mask with sentence-level loss. This training gap makes full fine-tuning less effective than LoRA. We introduce a soft-masking mechanism to gradually transition between these two types of masks, enabling the model to learn more comprehensive representations. Based on this, we propose a dynamic hard negative mining method that exposes the model to more difficult negative examples throughout the training process. Being intuitive and effective, with only approximately 1.4B parameters, Conan-embedding-v2 achieves SOTA performance on both the Massive Text Embedding Benchmark (MTEB) and Chinese MTEB (May 19, 2025).

【15】The LLM Already Knows: Estimating LLM-Perceived Question Difficulty via Hidden Representations
标题：LLM已经知道：通过隐藏表示估计LLM感知的问题难度
链接：https://arxiv.org/abs/2509.12886

作者： Dongrui Liu, Zecheng Lin, Wei Tong, Sheng Zhong, Jing Shao
摘要：通过大型语言模型（LLM）估计输入问题的难度对于准确的性能评估和自适应推理是必不可少的。现有的方法通常依赖于重复的响应采样，辅助模型，或微调目标模型本身，这可能会导致大量的计算成本或妥协的一般性。在本文中，我们提出了一种新的方法，只利用目标LLM产生的隐藏表示的难度估计。我们将令牌级生成过程建模为马尔可夫链，并定义一个值函数来估计给定任何隐藏状态的预期输出质量。这允许仅基于初始隐藏状态进行有效且准确的难度估计，而不生成任何输出令牌。跨文本和多模态任务的广泛实验表明，我们的方法在难度估计方面始终优于现有的基线。此外，我们应用我们的难度估计来指导自适应推理策略，包括自一致性，最佳N和自优化，以更少的生成令牌实现更高的推理效率。
摘要：Estimating the difficulty of input questions as perceived by large language models (LLMs) is essential for accurate performance evaluation and adaptive inference. Existing methods typically rely on repeated response sampling, auxiliary models, or fine-tuning the target model itself, which may incur substantial computational costs or compromise generality. In this paper, we propose a novel approach for difficulty estimation that leverages only the hidden representations produced by the target LLM. We model the token-level generation process as a Markov chain and define a value function to estimate the expected output quality given any hidden state. This allows for efficient and accurate difficulty estimation based solely on the initial hidden state, without generating any output tokens. Extensive experiments across both textual and multimodal tasks demonstrate that our method consistently outperforms existing baselines in difficulty estimation. Moreover, we apply our difficulty estimates to guide adaptive reasoning strategies, including Self-Consistency, Best-of-N, and Self-Refine, achieving higher inference efficiency with fewer generated tokens.

【16】Zero-shot Graph Reasoning via Retrieval Augmented Framework with LLMs
标题：通过具有LLM的检索增强框架进行Zero-Shot图推理
链接：https://arxiv.org/abs/2509.12743

作者：i, Kiran Sheena Jyothi, Henry Liang, Sharika Mahadevan, Diego Klabjan
摘要：我们提出了一种新的，无需训练的方法，通过检索增强框架（GRRAF）进行图推理，该方法利用检索增强生成（RAG）以及大型语言模型（LLM）的代码生成功能来解决广泛的图推理任务。在GRRAF中，目标图存储在图数据库中，并且LLM被提示生成检索必要信息的可执行代码查询。这种方法避免了现有方法的局限性，需要广泛的微调或依赖于预定义的算法，它结合了一个错误反馈回路与超时机制，以确保正确性和效率。GraphInstruct数据集上的实验评估表明，GRRAF在大多数图推理任务上实现了100%的准确性，包括循环检测，二分图检查，最短路径计算和最大流，同时保持一致的令牌成本，而不管图的大小。在子图匹配上观察到不完美但仍然非常高的性能。值得注意的是，GRRAF可以有效地扩展到多达10，000个节点的大型图形。
摘要：We propose a new, training-free method, Graph Reasoning via Retrieval Augmented Framework (GRRAF), that harnesses retrieval-augmented generation (RAG) alongside the code-generation capabilities of large language models (LLMs) to address a wide range of graph reasoning tasks. In GRRAF, the target graph is stored in a graph database, and the LLM is prompted to generate executable code queries that retrieve the necessary information. This approach circumvents the limitations of existing methods that require extensive finetuning or depend on predefined algorithms, and it incorporates an error feedback loop with a time-out mechanism to ensure both correctness and efficiency. Experimental evaluations on the GraphInstruct dataset reveal that GRRAF achieves 100% accuracy on most graph reasoning tasks, including cycle detection, bipartite graph checks, shortest path computation, and maximum flow, while maintaining consistent token costs regardless of graph sizes. Imperfect but still very high performance is observed on subgraph matching. Notably, GRRAF scales effectively to large graphs with up to 10,000 nodes.

【17】Towards Inclusive Toxic Content Moderation: Addressing Vulnerabilities to Adversarial Attacks in Toxicity Classifiers Tackling LLM-generated Content
标题：迈向包容性有毒内容适度：解决毒性分类器中对抗性攻击的漏洞
链接：https://arxiv.org/abs/2509.12672

作者：iturewala, Arkaitz Zubiaga
摘要：由于大语言模型（LLM）的广泛使用，在线机器生成内容的数量急剧增长，这给内容审核系统带来了新的挑战。传统的内容审核分类器通常是在人类产生的文本上训练的，由于LLM生成的文本偏离其训练数据和旨在避免检测的对抗性攻击而导致错误分类。目前的防御策略是被动的而不是主动的，因为它们依赖于对抗性训练或外部检测模型来识别攻击。在这项工作中，我们的目标是确定毒性分类器的脆弱组成部分，有助于错误分类，提出了一种新的策略，基于机械解释技术。我们的研究重点是微调的BERT和RoBERTa分类器，测试不同的数据集跨越各种少数群体。我们使用对抗性攻击技术来识别脆弱的电路。最后，我们抑制这些脆弱的电路，提高对抗攻击的性能。我们还提供了对这些脆弱电路的人口统计学层面的见解，揭示了模型训练中的公平性和鲁棒性差距。我们发现，模型具有不同的头部，这些头部对性能至关重要，或者容易受到攻击，抑制易受攻击的头部可以提高对抗性输入的性能。我们还发现，不同的头部负责不同人口群体的脆弱性，这可以为毒性检测模型的更具包容性的发展提供信息。
摘要：The volume of machine-generated content online has grown dramatically due to the widespread use of Large Language Models (LLMs), leading to new challenges for content moderation systems. Conventional content moderation classifiers, which are usually trained on text produced by humans, suffer from misclassifications due to LLM-generated text deviating from their training data and adversarial attacks that aim to avoid detection. Present-day defence tactics are reactive rather than proactive, since they rely on adversarial training or external detection models to identify attacks. In this work, we aim to identify the vulnerable components of toxicity classifiers that contribute to misclassification, proposing a novel strategy based on mechanistic interpretability techniques. Our study focuses on fine-tuned BERT and RoBERTa classifiers, testing on diverse datasets spanning a variety of minority groups. We use adversarial attacking techniques to identify vulnerable circuits. Finally, we suppress these vulnerable circuits, improving performance against adversarial attacks. We also provide demographic-level insights into these vulnerable circuits, exposing fairness and robustness gaps in model training. We find that models have distinct heads that are either crucial for performance or vulnerable to attack and suppressing the vulnerable heads improves performance on adversarial input. We also find that different heads are responsible for vulnerability across different demographic groups, which can inform more inclusive development of toxicity detection models.

【18】Chat-Driven Text Generation and Interaction for Person Retrieval
标题：用于人员检索的聊天驱动文本生成和交互
链接：https://arxiv.org/abs/2509.12662

作者：, Chuxin Wang, Sihang Cai, Yeqiang Wang, Shulei Wang, Tao Jin
备注：Accepted by EMNLP 2025. 13 pages, 3 figures
摘要：基于文本的人物搜索（TBPS）能够使用自然语言描述从大规模数据库中检索人物图像，在监控应用中具有重要价值。然而，一个主要的挑战在于获得高质量文本注释的劳动密集型过程，这限制了可扩展性和实际部署。为了解决这个问题，我们引入了两个补充模块：多轮文本生成（MTG）和多轮文本交互（MTI）。MTG通过与MLLM的模拟对话生成丰富的伪标签，无需人工监督即可生成细粒度和多样化的视觉描述。MTI通过动态的、基于对话的推理在推理时细化用户查询，使系统能够解释和解决模糊、不完整或模棱两可的描述--这是现实世界搜索场景中常见的特征。MTG和MTI共同形成了一个统一的无注释框架，显著提高了检索的准确性、鲁棒性和可用性。广泛的评估表明，我们的方法实现了竞争力或优越的结果，同时消除了手动字幕的需要，为TBPS系统的可扩展性和实际部署铺平了道路。
摘要：Text-based person search (TBPS) enables the retrieval of person images from large-scale databases using natural language descriptions, offering critical value in surveillance applications. However, a major challenge lies in the labor-intensive process of obtaining high-quality textual annotations, which limits scalability and practical deployment. To address this, we introduce two complementary modules: Multi-Turn Text Generation (MTG) and Multi-Turn Text Interaction (MTI). MTG generates rich pseudo-labels through simulated dialogues with MLLMs, producing fine-grained and diverse visual descriptions without manual supervision. MTI refines user queries at inference time through dynamic, dialogue-based reasoning, enabling the system to interpret and resolve vague, incomplete, or ambiguous descriptions - characteristics often seen in real-world search scenarios. Together, MTG and MTI form a unified and annotation-free framework that significantly improves retrieval accuracy, robustness, and usability. Extensive evaluations demonstrate that our method achieves competitive or superior results while eliminating the need for manual captions, paving the way for scalable and practical deployment of TBPS systems.

【19】Don't Change My View: Ideological Bias Auditing in Large Language Models
标题：不要改变我的观点：大型语言模型中的意识形态偏见审计
链接：https://arxiv.org/abs/2509.12652

作者：er, Emilio Barkett
摘要：随着大型语言模型（LLM）越来越多地嵌入到数百万人使用的产品中，它们的输出可能会影响个人的信念，并逐渐塑造公众舆论。如果法学硕士的行为可以被有意地引导到特定的意识形态立场，如政治或宗教观点，那么那些控制这些系统的人可能会对公共话语产生不成比例的影响。虽然它仍然是一个悬而未决的问题，是否可以可靠地引导LLM走向一致的意识形态立场，以及是否可以有效地防止这种转向，关键的第一步是开发检测这种转向企图发生时的方法。在这项工作中，我们适应了以前提出的统计方法，以意识形态偏见审计的新背景。我们的方法继承了原始框架的模型无关设计，不需要访问语言模型的内部。相反，它通过分析与所选主题相关的提示中模型输出的分布变化来识别潜在的意识形态导向。这种设计使得该方法特别适用于审计专有的黑盒系统。我们通过一系列实验验证了我们的方法，证明了它的实用性和潜力，以支持LLM行为的独立事后审计。
摘要：As large language models (LLMs) become increasingly embedded in products used by millions, their outputs may influence individual beliefs and, cumulatively, shape public opinion. If the behavior of LLMs can be intentionally steered toward specific ideological positions, such as political or religious views, then those who control these systems could gain disproportionate influence over public discourse. Although it remains an open question whether LLMs can reliably be guided toward coherent ideological stances and whether such steering can be effectively prevented, a crucial first step is to develop methods for detecting when such steering attempts occur. In this work, we adapt a previously proposed statistical method to the new context of ideological bias auditing. Our approach carries over the model-agnostic design of the original framework, which does not require access to the internals of the language model. Instead, it identifies potential ideological steering by analyzing distributional shifts in model outputs across prompts that are thematically related to a chosen topic. This design makes the method particularly suitable for auditing proprietary black-box systems. We validate our approach through a series of experiments, demonstrating its practical applicability and its potential to support independent post hoc audits of LLM behavior.

【20】PAC: Pronunciation-Aware Contextualized Large Language Model-based Automatic Speech Recognition
标题：PAC：发音感知上下文化基于大语言模型的自动语音识别
链接：https://arxiv.org/abs/2509.12647

作者： Xin, Sunlu Zeng, Lu Fan, Youzheng Wu, Xiaodong He
备注：Submitted to ICASSP 2026
摘要：本文提出了一个发音感知上下文化（PAC）框架，以解决基于大型语言模型（LLM）的自动语音识别（ASR）系统中的两个关键挑战：有效的发音建模和鲁棒的同音字识别。这两个都是原始或长尾词识别所必需的。所提出的方法采用了两个阶段的学习范式。首先，我们介绍了一种发音引导的上下文学习方法。它采用了一个交错的字形音素上下文建模策略，结合字形只有干扰，鼓励模型利用音素线索准确识别。然后，我们提出了一种带有扰动标签采样的发音判别式强化学习方法，以进一步增强模型区分上下文同音异义词的能力。在公共英语Librispeech和普通话AISHELL-1数据集上的实验结果表明，PAC：（1）与预训练的基于LLM的ASR模型相比，相对单词错误率（WER）分别降低了30.2%和53.8%，（2）与强基线相比，长尾词的偏差WER分别降低了31.8%和60.5%。
摘要：This paper presents a Pronunciation-Aware Contextualized (PAC) framework to address two key challenges in Large Language Model (LLM)-based Automatic Speech Recognition (ASR) systems: effective pronunciation modeling and robust homophone discrimination. Both are essential for raw or long-tail word recognition. The proposed approach adopts a two-stage learning paradigm. First, we introduce a pronunciation-guided context learning method. It employs an interleaved grapheme-phoneme context modeling strategy that incorporates grapheme-only distractors, encouraging the model to leverage phonemic cues for accurate recognition. Then, we propose a pronunciation-discriminative reinforcement learning method with perturbed label sampling to further enhance the model\'s ability to distinguish contextualized homophones. Experimental results on the public English Librispeech and Mandarin AISHELL-1 datasets indicate that PAC: (1) reduces relative Word Error Rate (WER) by 30.2% and 53.8% compared to pre-trained LLM-based ASR models, and (2) achieves 31.8% and 60.5% relative reductions in biased WER for long-tail words compared to strong baselines, respectively.

【21】DaSAThco: Data-Aware SAT Heuristics Combinations Optimization via Large Language Models
标题：DaSAThco：通过大型语言模型进行数据感知SAT启发式组合优化
链接：https://arxiv.org/abs/2509.12602

作者：n, Guoqiang Li
备注：11 pages
摘要：测试驱动的子句学习求解器的性能取决于内部算法，然而SAT问题的异质性使得单一的、普遍的最优配置无法实现。虽然先前的自动化方法可以为特定的问题族找到专门的配置，但这种特定于网络的方法缺乏通用性，并且需要为新的问题类型进行昂贵的重新优化。我们介绍DaSAThco，一个框架，通过学习从实例特征到量身定制的启发式集合的可泛化映射来解决这一挑战，从而实现一次训练，适应广泛的模型。我们的框架使用一个大型语言模型，系统定义的问题原型的指导下，生成一个多样化的投资组合的专门启发式合奏，随后学习自适应选择机制，形成最终的映射。实验表明，DaSAThco实现了卓越的性能，最值得注意的是，展示了强大的域外泛化，其中非自适应方法显示出局限性。我们的工作建立了一个更可扩展的和实用的路径自动算法设计的复杂，可配置的系统。
摘要：The performance of Conflict-Driven Clause Learning solvers hinges on internal heuristics, yet the heterogeneity of SAT problems makes a single, universally optimal configuration unattainable. While prior automated methods can find specialized configurations for specific problem families, this dataset-specific approach lacks generalizability and requires costly re-optimization for new problem types. We introduce DaSAThco, a framework that addresses this challenge by learning a generalizable mapping from instance features to tailored heuristic ensembles, enabling a train-once, adapt-broadly model. Our framework uses a Large Language Model, guided by systematically defined Problem Archetypes, to generate a diverse portfolio of specialized heuristic ensembles and subsequently learns an adaptive selection mechanism to form the final mapping. Experiments show that DaSAThco achieves superior performance and, most notably, demonstrates robust out-of-domain generalization where non-adaptive methods show limitations. Our work establishes a more scalable and practical path toward automated algorithm design for complex, configurable systems.

【22】The Better You Learn, The Smarter You Prune: Towards Efficient Vision-language-action Models via Differentiable Token Pruning
标题：学得越好，修剪就越聪明：通过差异化代币修剪迈向高效的视觉-语言-动作模型
链接：https://arxiv.org/abs/2509.12594

作者：ang, Xuefeng Jiang, Yuan Ma, Xin Wen, Bailin Li, Kun Zhan, Peng Jia, Yahui Liu, Sheng Sun, Xianpeng Lang
备注：Under review. Project site: this https URL
摘要：我们提出了LightVLA，一个简单而有效的可区分标记修剪框架的视觉语言动作（VLA）模型。虽然VLA模型在执行现实世界的机器人任务方面表现出了令人印象深刻的能力，但它们在资源受限平台上的部署通常会受到大型视觉令牌集上基于注意力的计算的影响。LightVLA通过自适应的、性能驱动的视觉标记修剪来应对这一挑战：它生成动态查询来评估视觉标记的重要性，并采用Gumbel softmax来实现可区分的标记选择。通过微调，LightVLA学会保留信息量最大的视觉标记，同时修剪对任务执行没有贡献的标记，从而同时提高效率和性能。值得注意的是，LightVLA不需要启发式幻数，也不引入额外的可训练参数，使其与现代推理框架兼容。实验结果表明，LightVLA优于不同的VLA模型和现有的令牌修剪方法在不同的任务上的LIBERO基准，实现了更高的成功率，大大减少了计算开销。具体来说，LightVLA将FLOPs和延迟分别降低了59.1%和38.2%，任务成功率提高了2.9%。同时，我们还研究了可学习的查询为基础的令牌修剪方法LightVLA* 与额外的可训练参数，这也取得了令人满意的性能。我们的工作表明，随着VLA追求最佳性能，LightVLA自发地从性能驱动的角度学习修剪令牌。据我们所知，LightVLA是第一个将自适应视觉标记修剪应用于VLA任务的工作，其目标是效率和性能，标志着向更高效，更强大和实用的实时机器人系统迈出了重要一步。
摘要：We present LightVLA, a simple yet effective differentiable token pruning framework for vision-language-action (VLA) models. While VLA models have shown impressive capability in executing real-world robotic tasks, their deployment on resource-constrained platforms is often bottlenecked by the heavy attention-based computation over large sets of visual tokens. LightVLA addresses this challenge through adaptive, performance-driven pruning of visual tokens: It generates dynamic queries to evaluate visual token importance, and adopts Gumbel softmax to enable differentiable token selection. Through fine-tuning, LightVLA learns to preserve the most informative visual tokens while pruning tokens which do not contribute to task execution, thereby improving efficiency and performance simultaneously. Notably, LightVLA requires no heuristic magic numbers and introduces no additional trainable parameters, making it compatible with modern inference frameworks. Experimental results demonstrate that LightVLA outperforms different VLA models and existing token pruning methods across diverse tasks on the LIBERO benchmark, achieving higher success rates with substantially reduced computational overhead. Specifically, LightVLA reduces FLOPs and latency by 59.1% and 38.2% respectively, with a 2.9% improvement in task success rate. Meanwhile, we also investigate the learnable query-based token pruning method LightVLA* with additional trainable parameters, which also achieves satisfactory performance. Our work reveals that as VLA pursues optimal performance, LightVLA spontaneously learns to prune tokens from a performance-driven perspective. To the best of our knowledge, LightVLA is the first work to apply adaptive visual token pruning to VLA tasks with the collateral goals of efficiency and performance, marking a significant step toward more efficient, powerful and practical real-time robotic systems.

【23】Yet Another Watermark for Large Language Models
标题：大型语言模型的又一个水印
链接：https://arxiv.org/abs/2509.12574

作者：o, Ying Shi, Zhiguang Yang, Hanzhou Wu, Xinpeng Zhang
备注：his https URL
摘要：现有的大语言模型（LLM）水印方法主要通过调整标记采样预测或后处理来嵌入水印，缺乏与LLM的内在耦合，这可能会显著降低生成的标记文本的语义质量。基于训练或微调的传统水印方法可以扩展到LLM。然而，他们中的大多数是有限的白盒场景，或非常耗时，由于大量的参数LLM。在本文中，我们提出了一个新的水印框架LLM，其中水印嵌入到LLM通过操纵的LLM的内部参数，可以从生成的文本中提取，而无需访问LLM。与相关方法相比，该方法将水印与LLM的内参数纠缠在一起，更好地平衡了水印的鲁棒性和不可感知性。此外，所提出的方法使我们能够提取的黑箱情况下的水印，这是计算效率的使用。实验结果也验证了该方法的可行性、优越性和实用性。本研究提供了一个不同于主流研究的新视角，对今后的研究具有一定的启示意义。
摘要：Existing watermarking methods for large language models (LLMs) mainly embed watermark by adjusting the token sampling prediction or post-processing, lacking intrinsic coupling with LLMs, which may significantly reduce the semantic quality of the generated marked texts. Traditional watermarking methods based on training or fine-tuning may be extendable to LLMs. However, most of them are limited to the white-box scenario, or very time-consuming due to the massive parameters of LLMs. In this paper, we present a new watermarking framework for LLMs, where the watermark is embedded into the LLM by manipulating the internal parameters of the LLM, and can be extracted from the generated text without accessing the LLM. Comparing with related methods, the proposed method entangles the watermark with the intrinsic parameters of the LLM, which better balances the robustness and imperceptibility of the watermark. Moreover, the proposed method enables us to extract the watermark under the black-box scenario, which is computationally efficient for use. Experimental results have also verified the feasibility, superiority and practicality. This work provides a new perspective different from mainstream works, which may shed light on future research.

【24】Context-Aware Language Models for Forecasting Market Impact from Sequences of Financial News
标题：用于预测财经新闻序列市场影响的上下文感知语言模型
链接：https://arxiv.org/abs/2509.12519

作者：l, Nicholas Andrews, Xifeng Yan
备注：Preprint
摘要：金融新闻在金融市场的信息传播过程中起着关键作用，是股票价格的已知驱动力。然而，每一条新闻中的信息并不一定是自成一体的，往往需要对历史新闻报道有更广泛的理解才能准确解读。此外，识别和合并最相关的上下文信息提出了重大挑战。在这项工作中，我们探索了历史背景在大型语言模型理解金融新闻市场影响的能力中的价值。我们发现，历史背景下提供了一个一致的和显着的改进性能的方法和时间范围。为此，我们提出了一种高效的上下文化方法，该方法使用大型LM来处理主要文章，而小型LM将历史上下文编码为简洁的摘要嵌入，然后与大型模型的表示空间对齐。我们通过多个定性和定量的可解释性测试来探索模型的行为，并揭示对情境化价值的见解。最后，我们证明了历史背景在模型预测中的价值具有现实世界的应用，转化为模拟投资业绩的实质性改善。
摘要：Financial news plays a critical role in the information diffusion process in financial markets and is a known driver of stock prices. However, the information in each news article is not necessarily self-contained, often requiring a broader understanding of the historical news coverage for accurate interpretation. Further, identifying and incorporating the most relevant contextual information presents significant challenges. In this work, we explore the value of historical context in the ability of large language models to understand the market impact of financial news. We find that historical context provides a consistent and significant improvement in performance across methods and time horizons. To this end, we propose an efficient and effective contextualization method that uses a large LM to process the main article, while a small LM encodes the historical context into concise summary embeddings that are then aligned with the large model's representation space. We explore the behavior of the model through multiple qualitative and quantitative interpretability tests and reveal insights into the value of contextualization. Finally, we demonstrate that the value of historical context in model predictions has real-world applications, translating to substantial improvements in simulated investment performance.

【25】Audited Reasoning Refinement: Fine-Tuning Language Models via LLM-Guided Step-Wise Evaluation and Correction
标题：审计推理细化：通过LLM指导的分步评估和纠正微调语言模型
链接：https://arxiv.org/abs/2509.12476

作者：hattacharyya, Sara Riaz, Pedram Rooshenas
摘要：当直接的人类监督或高质量的标签稀缺时，训练特定于任务的小型推理模型是具有挑战性的。然而，具有推理能力的LLM会产生丰富的中间推理痕迹，这些痕迹可以被系统地细化以创建有效的监督信号。我们提出了Reason-Refine-then-Align（R2 tA），它将细化的模型原理转化为训练特定任务推理模型的监督。我们的方法根据特定任务的输入，从开源基础模型中生成初始推理和响应，然后细化这些痕迹，修复幻觉和不一致，形成高保真数据集。我们执行两阶段对齐，监督微调（SFT），然后直接偏好优化（DPO）来校准模型的中间推理与人类验证的概念偏好，然后根据对齐的推理调整最终输出。作为一个案例研究，我们应用R2 tA评估扩展的实体关系图（EERDs）在数据库系统设计，一个结构复杂的任务，只有数据库的方法错过或幻觉错误。我们策划了一个包含600个EERD变体（训练/测试分别为450/150）的数据集，其中诱导错误跨越11个类别。经验评估表明，R2 tA为数据稀缺领域的可扩展LLM适应提供了一条实用，具有成本效益的路径，为教育及其他领域提供了可重复的AI工具。
摘要：Training a task-specific small reasoning model is challenging when direct human supervision or high-quality labels are scarce. However, LLMs with reasoning capabilities produce abundant intermediate reasoning traces that can be systematically refined to create effective supervision signals. We propose Reason-Refine-then-Align (R2tA), which turns refined model rationales into supervision for training task-specific reasoning models. Our method generates initial reasoning and responses from an open-source base model on task-specific inputs, then refines these traces, fixing hallucinations and inconsistencies, to form a high-fidelity dataset. We perform a two-stage alignment, supervised fine-tuning (SFT), followed by direct preference optimization (DPO) to calibrate the model's intermediate reasoning with human-validated conceptual preferences and then condition the final output on that aligned reasoning. As a case study, we apply R2tA to evaluate extended entity relationship diagrams (EERDs) in database system design, a structurally complex task where prompt-only methods miss or hallucinate errors. We curated a dataset of 600 EERD variants (train/test split of 450/150, respectively) with induced mistakes spanning 11 categories. Empirical evaluation suggests R2tA provides a practical, cost-effective path to scalable LLM adaptation in data-scarce domains, enabling reproducible AI tools for education and beyond.

【26】Does Language Model Understand Language?
标题：语言模型理解语言吗？
链接：https://arxiv.org/abs/2509.12459

作者：charjee, Utathya Aich, Asfak Ali
摘要：尽管在自然语言生成和理解方面取得了进展，LM仍然在努力处理细粒度的语言现象，如时态、否定、语态和情态，这些都是有效的人类交流的核心要素。在联合国可持续发展目标4的背景下，语言清晰度至关重要，在教育技术中部署LM需要仔细审查。随着LM越来越多地为辅导系统、自动评分和翻译等应用提供支持，它们与人类语言解释的一致性对于有效学习至关重要。在这项研究中，我们进行了评估SOTA语言模型在这些具有挑战性的背景下，在英语和孟加拉语。为了确保结构化的评估，我们引入了一个新的路线在系统环境中的认知推理评估指南。我们提出的LUCID数据集由精心制作的英语和孟加拉语句子对组成，专门挑战这些模型对语言理解的关键方面，包括否定，时态，语音变化。我们评估了SOTA模型的性能，包括MISTRAL-SABA-24 B，LLaMA-4-Scout-17 B，LLaMA-3.3- 70 B，Gemma 2 - 9 B和Compound-Beta，使用标准度量，如Pearson相关性，Spearman相关性和平均绝对误差，以及新颖的，语言启发的度量HCE准确性。HCE准确性测量模型预测落在平均人类评级的一个标准差内的频率，从而捕获人类对语言解释中的可变性的容忍度。我们的研究结果突出了Compound-Beta作为最平衡的模型，在不同的语言条件下始终实现高相关性和低MAE。它记录了英语中最高的Pearson相关性，并在混合语言数据上表现出强大的性能，表明在跨语言场景中与人类判断保持高度一致。
摘要：Despite advances in natural language generation and understanding, LM still struggle with fine grained linguistic phenomena such as tense, negation, voice, and modality which are the elements central to effective human communication. In the context of the United Nations SDG 4, where linguistic clarity is critical, the deployment of LMs in educational technologies demands careful scrutiny. As LMs are increasingly powering applications like tutoring systems, automated grading, and translation, their alignment with human linguistic interpretation becomes essential for effective learning. In this study, we conduct a evaluation of SOTA language models across these challenging contexts in both English and Bengali. To ensure a structured assessment, we introduce a new Route for Evaluation of Cognitive Inference in Systematic Environments guidelines. Our proposed LUCID dataset, composed of carefully crafted sentence pairs in English and Bengali, specifically challenges these models on critical aspects of language comprehension, including negation, tense, voice variations. We assess the performance of SOTA models including MISTRAL-SABA-24B, LLaMA-4-Scout-17B, LLaMA-3.3-70B, Gemma2-9B, and Compound-Beta using standard metrics like Pearson correlation, Spearman correlation, and Mean Absolute Error, as well as novel, linguistically inspired metric the HCE accuracy. The HCE accuracy measures how often model predictions fall within one standard deviation of the mean human rating, thus capturing human like tolerance for variability in language interpretation. Our findings highlight Compound-Beta as the most balanced model, consistently achieving high correlations and low MAEs across diverse language conditions. It records the highest Pearson correlation in English and demonstrates robust performance on mixed-language data, indicating a strong alignment with human judgments in cross lingual scenarios.

【27】MedFact: Benchmarking the Fact-Checking Capabilities of Large Language Models on Chinese Medical Texts
标题：MedFact：对中文医学文本上大型语言模型的事实核查能力进行基准测试
链接：https://arxiv.org/abs/2509.12440

作者： Yangmin Huang, Qianyun Du, Xiangying Zhou, Zhiyang He, Jiaxue Hu, Xiaodong Tao, Lixian Lai
摘要：大型语言模型（LLM）在医疗保健领域的部署越来越多，需要对其事实可靠性进行严格的评估。然而，现有的基准往往受到狭窄的数据域的限制，无法捕捉真实世界的医疗信息的复杂性。为了解决这一关键差距，我们引入了MedFact，这是中国医学事实核查的一个新的和具有挑战性的基准。MedFact由2，116个专家注释的实例组成，这些实例来自不同的现实文本，涵盖13个医学专业，8种细粒度错误类型，4种写作风格和多种难度级别。它的构建采用了一个混合的人工智能框架，其中迭代的专家反馈改进了人工智能驱动的多标准过滤过程，确保了高数据质量和难度。我们对20个领先的LLM进行了全面评估，将其在准确性分类和错误定位方面的表现与人类专家基线进行了基准测试。我们的研究结果表明，虽然模型通常可以确定文本是否包含错误，但精确定位它仍然是一个巨大的挑战，即使是表现最好的模型也无法达到人类的表现。此外，我们的分析揭示了一种频繁的“过度批评”现象，即模型将正确信息错误识别为错误信息的倾向，这种倾向因多智能体协作和推理时间缩放等高级推理技术而加剧。通过强调在医疗应用中部署LLM的这些关键挑战，MedFact提供了一个强大的资源，以推动更真实可靠和医学感知模型的开发。
摘要：The increasing deployment of Large Language Models (LLMs) in healthcare necessitates a rigorous evaluation of their factual reliability. However, existing benchmarks are often limited by narrow domains of data, failing to capture the complexity of real-world medical information. To address this critical gap, we introduce MedFact, a new and challenging benchmark for Chinese medical fact-checking. MedFact comprises 2,116 expert-annotated instances curated from diverse real-world texts, spanning 13 medical specialties, 8 fine-grained error types, 4 writing styles, and multiple difficulty levels. Its construction employs a hybrid AI-human framework where iterative expert feedback refines an AI-driven, multi-criteria filtering process, ensuring both high data quality and difficulty. We conduct a comprehensive evaluation of 20 leading LLMs, benchmarking their performance on veracity classification and error localization against a human expert baseline. Our results reveal that while models can often determine if a text contains an error, precisely localizing it remains a substantial challenge, with even top-performing models falling short of human performance. Furthermore, our analysis uncovers a frequent ``over-criticism'' phenomenon, a tendency for models to misidentify correct information as erroneous, which is exacerbated by advanced reasoning techniques such as multi-agent collaboration and inference-time scaling. By highlighting these critical challenges for deploying LLMs in medical applications, MedFact provides a robust resource to drive the development of more factually reliable and medically aware models.

【28】SENTRA: Selected-Next-Token Transformer for LLM Text Detection
标题：SENTRA：用于LLM文本检测的选定下一个令牌Transformer
链接：https://arxiv.org/abs/2509.12385

作者：Plyler, Yilun Zhang, Alexander Tuzhilin, Saoud Khalifah, Sen Tian
备注：EMNLP Findings 2025
摘要：LLM正变得越来越强大和广泛。因此，其滥用的可能性和现实性也在增加。在这项工作中，我们解决的问题，检测LLM生成的文本，没有明确声明。我们提出了一种新的，通用的，监督LLM文本检测器，选择下一个令牌Transformer（SENTRA）。SENTRA是一个基于transformer的编码器，它利用选择的下一个标记概率序列，并利用大量未标记数据的对比预训练。我们在24个文本域的三个流行的公共数据集上进行的实验表明，SENTRA是一种通用分类器，在域外设置中显著优于流行的基线。
摘要：LLMs are becoming increasingly capable and widespread. Consequently, the potential and reality of their misuse is also growing. In this work, we address the problem of detecting LLM-generated text that is not explicitly declared as such. We present a novel, general-purpose, and supervised LLM text detector, SElected-Next-Token tRAnsformer (SENTRA). SENTRA is a Transformer-based encoder leveraging selected-next-token-probability sequences and utilizing contrastive pre-training on large amounts of unlabeled data. Our experiments on three popular public datasets across 24 domains of text demonstrate SENTRA is a general-purpose classifier that significantly outperforms popular baselines in the out-of-domain setting.

【29】LLM-as-a-Judge: Rapid Evaluation of Legal Document Recommendation for Retrieval-Augmented Generation
标题：法学硕士作为法官：快速评估检索增强生成的法律文件推荐
链接：https://arxiv.org/abs/2509.12382

作者：an, Alexandra Ortan, Apurv Verma, Madhavan Seshadri
备注：Accepted in EARL 25: The 2nd Workshop on Evaluating and Applying Recommender Systems with Large Language Models at RecSys 2025
摘要：随着生成式人工智能的兴起，推荐系统中的评估瓶颈变得尤为严重，传统的指标无法捕捉在法律研究等专业领域中重要的细微质量维度。我们能信任大型语言模型作为它们自己的可靠判断者吗？本文研究了LLM-as-a-Judge作为在法律背景下评估检索增强生成系统的原则方法，其中推荐质量的风险非常高。我们解决了两个决定实际可行性的基本问题：哪些评分员间可靠性指标最能反映LLM和人类评估之间的一致性，以及我们如何在竞争系统之间进行统计上合理的比较？通过系统的实验，我们发现传统的一致性指标，如Krippendorff的alpha，在人工智能系统评估的典型偏态分布中可能会产生误导。相反，Gwet的AC 2和等级相关系数成为法官选择的更强大的指标，而Wilcoxon符号秩检验与Benjamini-Hochberg校正提供了可靠的系统比较所需的统计严谨性。我们的研究结果提出了一条可扩展的、具有成本效益的评估路径，该路径保持了法律应用所要求的精确度，将曾经是人力密集型的瓶颈转变为自动化的、但具有统计原则的评估框架。
摘要：The evaluation bottleneck in recommendation systems has become particularly acute with the rise of Generative AI, where traditional metrics fall short of capturing nuanced quality dimensions that matter in specialized domains like legal research. Can we trust Large Language Models to serve as reliable judges of their own kind? This paper investigates LLM-as-a-Judge as a principled approach to evaluating Retrieval-Augmented Generation systems in legal contexts, where the stakes of recommendation quality are exceptionally high. We tackle two fundamental questions that determine practical viability: which inter-rater reliability metrics best capture the alignment between LLM and human assessments, and how do we conduct statistically sound comparisons between competing systems? Through systematic experimentation, we discover that traditional agreement metrics like Krippendorff's alpha can be misleading in the skewed distributions typical of AI system evaluations. Instead, Gwet's AC2 and rank correlation coefficients emerge as more robust indicators for judge selection, while the Wilcoxon Signed-Rank Test with Benjamini-Hochberg corrections provides the statistical rigor needed for reliable system comparisons. Our findings suggest a path toward scalable, cost-effective evaluation that maintains the precision demanded by legal applications, transforming what was once a human-intensive bottleneck into an automated, yet statistically principled, evaluation framework.

【30】MORABLES: A Benchmark for Assessing Abstract Moral Reasoning in LLMs with Fables
标题：MORABLES：评估带有寓言的法学硕士抽象道德推理的基准
链接：https://arxiv.org/abs/2509.12371

作者：rcuzzo, Alessandro Zangari, Andrea Albarelli, Jose Camacho-Collados, Mohammad Taher Pilehvar
备注：Accepted to EMNLP 2025 Main Conference
摘要：由于LLM在标准阅读理解基准方面表现出色，人们的注意力正在转向评估他们的复杂抽象推理和推理能力。以文学为基础的基准，其丰富的叙事和道德深度，提供了一个令人信服的框架，评估这种更深层次的理解能力。在这里，我们提出了MORABLES，一个人类验证的基准，从历史文献中提取的寓言和短篇小说。主要任务是以道德推理为目标的多项选择题，精心制作的干扰因素，挑战模型超越肤浅，提取问题的回答。为了进一步加强压力测试模型的鲁棒性，我们引入了对抗性变体，旨在暴露LLM漏洞和由于数据污染等问题而导致的捷径。我们的研究结果表明，虽然较大的模型表现优于较小的模型，但它们仍然容易受到对抗性操纵的影响，并且往往依赖于表面模式，而不是真正的道德推理。这种脆弱性导致了显著的自我矛盾，最好的模型在大约20%的情况下反驳了自己的答案，这取决于道德选择的框架。有趣的是，推理增强模型未能弥合这一差距，这表明规模-而不是推理能力-是性能的主要驱动力。
摘要：As LLMs excel on standard reading comprehension benchmarks, attention is shifting toward evaluating their capacity for complex abstract reasoning and inference. Literature-based benchmarks, with their rich narrative and moral depth, provide a compelling framework for evaluating such deeper comprehension skills. Here, we present MORABLES, a human-verified benchmark built from fables and short stories drawn from historical literature. The main task is structured as multiple-choice questions targeting moral inference, with carefully crafted distractors that challenge models to go beyond shallow, extractive question answering. To further stress-test model robustness, we introduce adversarial variants designed to surface LLM vulnerabilities and shortcuts due to issues such as data contamination. Our findings show that, while larger models outperform smaller ones, they remain susceptible to adversarial manipulation and often rely on superficial patterns rather than true moral reasoning. This brittleness results in significant self-contradiction, with the best models refuting their own answers in roughly 20% of cases depending on the framing of the moral choice. Interestingly, reasoning-enhanced models fail to bridge this gap, suggesting that scale - not reasoning ability - is the primary driver of performance.

【31】LLMAP: LLM-Assisted Multi-Objective Route Planning with User Preferences
标题：LLMAP：基于用户偏好的LLM辅助多目标路径规划
链接：https://arxiv.org/abs/2509.12273

作者：uan, Dong-Jun Han, Christopher G. Brinton, Sabine Brunswicker
摘要：大型语言模型（LLM）的兴起使自然语言驱动的路线规划成为一个新兴的研究领域，其中包含丰富的用户目标。目前的研究表现出两种不同的方法：直接路线规划使用LLM作为代理和基于图形的搜索策略。然而，LLM在前一种方法中难以处理大量的地图数据，而后者在理解自然语言偏好方面表现出有限的能力。此外，一个更关键的挑战来自全球用户的高度异质性和不可预测的时空分布。在本文中，我们介绍了一种新的LLM-Assisted路线规划（LLMAP）系统，该系统采用LLM-as-Parser来理解自然语言，识别任务，提取用户偏好并识别任务依赖关系，再加上多步图构造迭代搜索（MSGS）算法作为最佳路线查找的底层求解器。我们的多目标优化方法自适应地调整目标权重，以最大限度地提高兴趣点（POI）的质量和任务完成率，同时最大限度地减少路线距离，受三个关键约束：用户的时间限制，POI开放时间和任务的依赖性。我们使用1,000个路由提示进行了广泛的实验，这些提示在全球14个国家和27个城市中以不同的复杂性进行采样。结果表明，我们的方法实现了优越的性能，保证在多个约束。
摘要：The rise of large language models (LLMs) has made natural language-driven route planning an emerging research area that encompasses rich user objectives. Current research exhibits two distinct approaches: direct route planning using LLM-as-Agent and graph-based searching strategies. However, LLMs in the former approach struggle to handle extensive map data, while the latter shows limited capability in understanding natural language preferences. Additionally, a more critical challenge arises from the highly heterogeneous and unpredictable spatio-temporal distribution of users across the globe. In this paper, we introduce a novel LLM-Assisted route Planning (LLMAP) system that employs an LLM-as-Parser to comprehend natural language, identify tasks, and extract user preferences and recognize task dependencies, coupled with a Multi-Step Graph construction with iterative Search (MSGS) algorithm as the underlying solver for optimal route finding. Our multi-objective optimization approach adaptively tunes objective weights to maximize points of interest (POI) quality and task completion rate while minimizing route distance, subject to three key constraints: user time limits, POI opening hours, and task dependencies. We conduct extensive experiments using 1,000 routing prompts sampled with varying complexity across 14 countries and 27 cities worldwide. The results demonstrate that our approach achieves superior performance with guarantees across multiple constraints.

【32】MEUV: Achieving Fine-Grained Capability Activation in Large Language Models via Mutually Exclusive Unlock Vectors
标题：MEV：通过相互排斥的数据库Vectors在大型语言模型中实现细粒度的能力激活
链接：https://arxiv.org/abs/2509.12221

作者： Zhi Lin, Jingya Wang, Meng Han, Bo Jin
备注：Under Review
摘要：大型语言模型（LLM）强制执行安全对齐以可靠地拒绝恶意请求，但同样的全面保护措施也会阻止在警务，国防和其他高风险环境中的合法使用。早期的“参照方向”编辑可以绕过这些层，但它们依赖于一个单一的向量，不加区别地解锁所有危险的主题，不提供语义控制。我们引入互斥拒绝向量（MEUV），一个轻量级的框架，分解成主题对齐的，几乎正交的向量，每个专用于一个敏感的能力的单片拒绝方向。MEUV是在一个单一的时代学习的多任务目标，融合了差分消融裕度，交叉主题和正交性的处罚，以及几个辅助条款。在双语恶意提示基准测试中，MEUV在Gemma-2-2B、LLaMA-3-8B和Qwen-7 B上的攻击成功率不低于87%，但与最佳单向基准相比，跨主题泄漏减少了90%。在中文训练的向量几乎不变地转移到英语（反之亦然），这表明语言不可知的拒绝子空间。结果表明，细粒度的，主题级的能力激活是可以实现的，最小的效用损失，铺平了道路，在安全敏感领域的受控LLM部署。
摘要：Large language models (LLMs) enforce safety alignment to reliably refuse malicious requests, yet the same blanket safeguards also block legitimate uses in policing, defense, and other high-stakes settings. Earlier "refusal-direction" edits can bypass those layers, but they rely on a single vector that indiscriminately unlocks all hazardous topics, offering no semantic control. We introduce Mutually Exclusive Unlock Vectors (MEUV), a lightweight framework that factorizes the monolithic refusal direction into topic-aligned, nearly orthogonal vectors, each dedicated to one sensitive capability. MEUV is learned in a single epoch with a multi-task objective that blends a differential-ablation margin, cross-topic and orthogonality penalties, and several auxiliary terms. On bilingual malicious-prompt benchmarks, MEUV achieves an attack success rate of no less than 87% on Gemma-2-2B, LLaMA-3-8B, and Qwen-7B, yet cuts cross-topic leakage by up to 90% compared with the best single-direction baseline. Vectors trained in Chinese transfer almost unchanged to English (and vice versa), suggesting a language-agnostic refusal subspace. The results show that fine-grained, topic-level capability activation is achievable with minimal utility loss, paving the way for controlled LLMs deployment in security-sensitive domains.

Transformer(2篇)

【1】Shaping Explanations: Semantic Reward Modeling with Encoder-Only Transformers for GRPO
标题：塑造解释：使用仅编码器变形器的GRPO语义奖励建模
链接：https://arxiv.org/abs/2509.13081

作者： Pappone, Ruggero Marino Lazzaroni, Federico Califano, Niccolò Gentile, Roberto Marras
摘要：虽然大型语言模型（LLM）擅长生成类似人类的文本，但将其输出与复杂的定性目标（如教学合理性）保持一致仍然是一个重大挑战。标准的强化学习技术通常依赖于缓慢而昂贵的LLM作为判断评估，或者依赖于脆弱的、基于关键字的指标，如ROUGE，这些指标无法捕捉高质量解释的语义本质。在这项工作中，我们介绍了一种新的方法来奖励塑造组相对政策优化（GRPO）框架内。我们的核心贡献是使用一个小的，高效的编码器，只有Transformer作为语义奖励模型。该模型基于生成的解释与地面实况参考之间的余弦相似性提供密集的、语义丰富的奖励信号，引导政策走向不仅事实上正确而且在结构和概念上与专家推理一致的解释。我们将这种方法应用于训练意大利医学院入学考试模型的任务，遵循标准的域自适应持续预训练（CPT）和监督微调（SFT）。我们的研究结果表明，GRPO与我们提出的语义奖励相比，在强大的SFT基线上显着提高了解释的忠实性和清晰度，展示了在复杂的生成任务中使用轻量级编码器模型进行细致入微的奖励塑造的能力
摘要：While Large Language Models (LLMs) excel at generating human-like text, aligning their outputs with complex, qualitative goals like pedagogical soundness remains a significant challenge. Standard reinforcement learning techniques often rely on slow and expensive LLM-as-a-judge evaluations or on brittle, keyword-based metrics like ROUGE, which fail to capture the semantic essence of a high-quality explanation. In this work, we introduce a novel approach to reward shaping within the Group Relative Policy Optimisation (GRPO) framework. Our central contribution is the use of a small, efficient encoder-only transformer as a semantic reward model. This model provides a dense, semantically rich reward signal based on the cosine similarity between a generated explanation and a ground-truth reference, guiding the policy towards explanations that are not just factually correct but also structurally and conceptually aligned with expert reasoning. We apply this method to the task of training a model for the Italian medical-school entrance examinations, following standard domain-adaptive continued pre-training (CPT) and supervised fine-tuning (SFT). Our results demonstrate that GRPO with our proposed semantic reward significantly improves explanation faithfulness and clarity over a strong SFT baseline, showcasing the power of using lightweight encoder models for nuanced reward shaping in complex generation tasks

【2】A comparison of pipelines for the translation of a low resource language based on transformers
标题：基于转换器的低资源语言翻译管道的比较
链接：https://arxiv.org/abs/2509.12514

作者：nfanti, Michele Colombino, Giulia Coucourde, Faeze Memari, Stefano Pinardi, Rosa Meo
备注：9 pages, 4 figures
摘要：这项工作比较了用于训练基于transformer的神经网络的三个管道，以生产Bambara的机器翻译器，Bambara是非洲约14，188，850人使用的Mand\`e语言。第一条管道训练一个简单的Transformer将法语句子翻译成Bambara。第二个微调LLaMA 3（3B-8B）教师模型使用解码器只架构的法语到班巴拉翻译。来自前两个管道的模型使用不同的超参数组合进行训练，以提高BLEU和chrF分数，并在测试句子和官方Bambara基准上进行评估。第三条管道使用语言蒸馏和学生-教师双神经网络将Bambara集成到预训练的LaBSE模型中，该模型提供与语言无关的嵌入。然后将BERT扩展应用于LaBSE以生成翻译。所有管道都在Dokotoro（医疗）和Bayelemagaba（混合域）上进行了测试。结果表明，第一个管道虽然更简单，但达到了最佳的翻译准确率（Bayelemagaba上的10%BLEU，21%chrF），与低资源翻译结果一致。在为这项工作创建的Yiri数据集上，它实现了33.81%的BLEU和41%的chrF。基于指令的模型在单个数据集上的表现比在聚合集合上的表现更好，这表明它们更有效地捕捉特定于
摘要：This work compares three pipelines for training transformer-based neural networks to produce machine translators for Bambara, a Mand\`e language spoken in Africa by about 14,188,850 people. The first pipeline trains a simple transformer to translate sentences from French into Bambara. The second fine-tunes LLaMA3 (3B-8B) instructor models using decoder-only architectures for French-to-Bambara translation. Models from the first two pipelines were trained with different hyperparameter combinations to improve BLEU and chrF scores, evaluated on both test sentences and official Bambara benchmarks. The third pipeline uses language distillation with a student-teacher dual neural network to integrate Bambara into a pre-trained LaBSE model, which provides language-agnostic embeddings. A BERT extension is then applied to LaBSE to generate translations. All pipelines were tested on Dokotoro (medical) and Bayelemagaba (mixed domains). Results show that the first pipeline, although simpler, achieves the best translation accuracy (10% BLEU, 21% chrF on Bayelemagaba), consistent with low-resource translation results. On the Yiri dataset, created for this work, it achieves 33.81% BLEU and 41% chrF. Instructor-based models perform better on single datasets than on aggregated collections, suggesting they capture dataset-specific patterns more effectively.

GAN|生成相关(1篇)

【1】InfoGain-RAG: Boosting Retrieval-Augmented Generation via Document Information Gain-based Reranking and Filtering
标题：InfoGain-RAG：通过基于文档信息收益的重新排序和过滤来促进检索增强生成
链接：https://arxiv.org/abs/2509.12765

作者：g, Zihan Liang, Zhou Shao, Yufei Ma, Huangyu Dai, Ben Chen, Lingtao Mao, Chenyi Lei, Yuqing Ding, Han Li
备注：EMNLP'25 Oral Presentation. Contact: benchen4395@gmail.com
摘要：检索增强生成（RAG）已经成为解决大型语言模型（LLM）的关键限制（如幻觉，过时的知识和缺乏参考）的一种有前途的方法。然而，目前的RAG框架往往难以识别检索到的文档是否有意义地有助于生成答案。这个缺点使得很难过滤掉不相关的甚至是误导性的内容，这会显著影响最终的性能。在本文中，我们提出了文档信息增益（DIG），一种新的度量，旨在量化的贡献，检索到的文件，正确的答案生成。DIG通过计算LLM的生成置信度的差异来衡量文档的价值，无论文档是否被增强。此外，我们引入InfoGain-RAG，一个框架，利用DIG分数训练一个专门的reranker，优先级从精确区分和准确排序的角度来看，每个检索到的文件。这种方法可以有效地过滤掉不相关的文档，并选择最有价值的，更好的答案生成。在各种模型和基准测试的广泛实验表明，InfoGain-RAG可以显着优于现有的方法，在单个和多个检索范式。特别是在NaturalQA上，它在精确匹配准确率方面分别比朴素RAG，自反射RAG和现代基于排名的RAG提高了17.9%，4.5%和12.5%，甚至在所有数据集上的高级专有模型GPT-4 o上平均提高了15.3%。这些结果证明了InfoGain-RAG的可行性，因为它可以在多种应用中为RAG提供可靠的解决方案。
摘要：Retrieval-Augmented Generation (RAG) has emerged as a promising approach to address key limitations of Large Language Models (LLMs), such as hallucination, outdated knowledge, and lacking reference. However, current RAG frameworks often struggle with identifying whether retrieved documents meaningfully contribute to answer generation. This shortcoming makes it difficult to filter out irrelevant or even misleading content, which notably impacts the final performance. In this paper, we propose Document Information Gain (DIG), a novel metric designed to quantify the contribution of retrieved documents to correct answer generation. DIG measures a document's value by computing the difference of LLM's generation confidence with and without the document augmented. Further, we introduce InfoGain-RAG, a framework that leverages DIG scores to train a specialized reranker, which prioritizes each retrieved document from exact distinguishing and accurate sorting perspectives. This approach can effectively filter out irrelevant documents and select the most valuable ones for better answer generation. Extensive experiments across various models and benchmarks demonstrate that InfoGain-RAG can significantly outperform existing approaches, on both single and multiple retrievers paradigm. Specifically on NaturalQA, it achieves the improvements of 17.9%, 4.5%, 12.5% in exact match accuracy against naive RAG, self-reflective RAG and modern ranking-based RAG respectively, and even an average of 15.3% increment on advanced proprietary model GPT-4o across all datasets. These results demonstrate the feasibility of InfoGain-RAG as it can offer a reliable solution for RAG in multiple applications.

QA|VQA|问答|对话(2篇)

【1】HistoryBankQA: Multilingual Temporal Question Answering on Historical Events
标题：HistoryBankQA：历史事件的多语言时态问题解答
链接：https://arxiv.org/abs/2509.12720

作者：Mandal, Anant Khandelwal, Manish Gupta
摘要：对历史事件的时间推理是自然语言处理任务的关键技能，如事件提取、历史实体链接、时间问题回答、时间轴摘要、时间事件聚类和时间自然语言推理。然而，对大型语言模型（LLM）的时间推理能力进行基准测试的努力是相当有限的。现有的时间推理数据集规模有限，缺乏多语种覆盖，更多地关注当代事件。为了解决这些限制，我们提出了HistoryBank，一个从维基百科时间轴页面和文章信息框中提取的1000多万历史事件的多语言数据库。我们的数据库在历史深度和语言广度上都提供了前所未有的覆盖范围，包括10种语言。此外，我们构建了一个全面的问题回答基准的时间推理在所有语言。这个基准测试涵盖了6个时态QA推理任务，我们评估了一套流行的语言模型（LLaMA-3-8B，Mistral-7 B，Gemma-2- 9 b，Qwen 3 -8B，GPT 4 o），以评估它们在这些任务上的性能。正如预期的那样，GPT 4 o在所有答案类型和语言中表现最好; Gemma-2优于其他小语言模型。我们的工作旨在为推进历史事件的多语言和时间感知自然语言理解提供全面的资源。为了促进进一步的研究，我们将在接受本文后公开我们的代码和数据集。
摘要：Temporal reasoning about historical events is a critical skill for NLP tasks like event extraction, historical entity linking, temporal question answering, timeline summarization, temporal event clustering and temporal natural language inference. Yet efforts on benchmarking temporal reasoning capabilities of large language models (LLMs) are rather limited. Existing temporal reasoning datasets are limited in scale, lack multilingual coverage and focus more on contemporary events. To address these limitations, we present HistoryBank, a multilingual database of 10M+ historical events extracted from Wikipedia timeline pages and article infoboxes. Our database provides unprecedented coverage in both historical depth and linguistic breadth with 10 languages. Additionally, we construct a comprehensive question answering benchmark for temporal reasoning across all languages. This benchmark covers a diverse set of 6 temporal QA reasoning tasks, and we evaluate a suite of popular language models (LLaMA-3-8B, Mistral-7B, Gemma-2-9b, Qwen3-8B, GPT4o) to assess their performance on these tasks. As expected GPT4o performs best across all answer types and languages; Gemma-2 outperforms the other small language models. Our work aims to provide a comprehensive resource for advancing multilingual and temporally-aware natural language understanding of historical events. To facilitate further research, we will make our code and datasets publicly available upon acceptance of this paper.

【2】MORQA: Benchmarking Evaluation Metrics for Medical Open-Ended Question Answering
标题：MORQA：医学开放式问题分类的基准评估工具
链接：https://arxiv.org/abs/2509.12405

作者：im, Asma Ben Abacha, Zixuan Yu, Robert Doerning, Fei Xia, Meliha Yetisgen
备注：9 pages, 8 tables
摘要：由于对准确性、相关性和特定领域专业知识的严格要求，评估医疗领域的自然语言生成（NLG）系统面临着独特的挑战。传统的自动评估指标，如BLEU，ROUGE和BERTScore，通常无法区分高质量的输出，特别是考虑到医疗问答（QA）任务的开放性，其中可能存在多个有效的响应。在这项工作中，我们介绍了MORQA（医学开放式问答），一个新的多语言基准，旨在评估NLG评估指标的有效性，在三个医学视觉和基于文本的QA数据集在英语和中文。与之前的资源不同，我们的数据集包含由医学专业人士撰写的2-4+个黄金标准答案，以及三个英语和中文子集的专家人工评分。我们对传统指标和基于大型语言模型（LLM）的评估器（如GPT-4和Gemini）进行了基准测试，发现基于LLM的方法在与专家判断相关方面显著优于传统指标。我们进一步分析了推动这种改进的因素，包括LLM对语义细微差别的敏感性和对参考答案之间变化的鲁棒性。我们的研究结果提供了第一个全面的，多语言的定性研究NLG在医疗领域的评价，突出了人类对齐的评价方法的需要。所有数据集和注释将公开发布，以支持未来的研究。
摘要：Evaluating natural language generation (NLG) systems in the medical domain presents unique challenges due to the critical demands for accuracy, relevance, and domain-specific expertise. Traditional automatic evaluation metrics, such as BLEU, ROUGE, and BERTScore, often fall short in distinguishing between high-quality outputs, especially given the open-ended nature of medical question answering (QA) tasks where multiple valid responses may exist. In this work, we introduce MORQA (Medical Open-Response QA), a new multilingual benchmark designed to assess the effectiveness of NLG evaluation metrics across three medical visual and text-based QA datasets in English and Chinese. Unlike prior resources, our datasets feature 2-4+ gold-standard answers authored by medical professionals, along with expert human ratings for three English and Chinese subsets. We benchmark both traditional metrics and large language model (LLM)-based evaluators, such as GPT-4 and Gemini, finding that LLM-based approaches significantly outperform traditional metrics in correlating with expert judgments. We further analyze factors driving this improvement, including LLMs' sensitivity to semantic nuances and robustness to variability among reference answers. Our results provide the first comprehensive, multilingual qualitative study of NLG evaluation in the medical domain, highlighting the need for human-aligned evaluation methods. All datasets and annotations will be publicly released to support future research.

语义分析(1篇)

【1】Contrastive Learning with Enhanced Abstract Representations using Grouped Loss of Abstract Semantic Supervision
标题：使用抽象语义监督分组损失进行增强抽象表示的对比学习
链接：https://arxiv.org/abs/2509.12771

作者：sa, Muhiim Ali, Shengmai Chen, Yinuo Cai, Shekhar Pradhan
摘要：人类可以将图像识别为一般概念的实例，而不仅仅是识别其对象及其关系。在本文中，我们调查1。在何种程度上VLMs有这种概念抽象的能力，和2。在图像中编码高级概念信息的策略，使最终的VLM模型（CLEAR GLASS模型）能够在更大程度上具有这种能力。为此，我们引入了一个分组图像标题数据集（MAGIC），它由几组图像标题和每组一组相关的图像和更高级别的概念标签组成。我们使用了一种新的对比损失技术，以诱导模型编码在一个组中的每个图像（标题）的表示的信息是共同的图像标题组的所有成员。我们的主要贡献是基于文本图像对比组（外部对比损失）的分组对比损失函数以及测量组中图像标题实例之间距离的内部损失。我们的训练方法导致CLEAR GLASS模型具有概念抽象能力作为紧急能力，因为模型不暴露于与每个组相关联的更高级别的概念。相反，训练迫使模型为每个图像标题组创建语义表示，使其更接近潜在语义空间中更高级别概念的语义表示。我们的实验表明，这种训练方法的结果在一个模型，显示出改进的抽象概念识别相比，SOTA模型。
摘要：Humans can recognize an image as an instance of a general concept, beyond simply identifying its objects and their relationships. In this paper, we investigate 1. The extent to which VLMs have this concept abstraction capacity, and 2. Strategies for encoding the sort of higher-concept information in images that would enable the resulting VLM model (CLEAR GLASS model) to have this capability to a greater degree. To this end, we introduce a grouped image-caption dataset (MAGIC), which consists of several groups of image captions and for each group a set of associated images and higher-level conceptual labels. We use a novel contrastive loss technique to induce the model to encode in the representation of each image (caption) in a group the information that is common to all members of the image-caption group. Our main contribution is a grouped contrastive loss function based on text-image contrastive groups (outer contrastive loss) as well as an inner loss which measures the distances between image-caption instances in the group. Our training methodology results in the CLEAR GLASS model having the concept abstraction capacity as an emergent capacity because the model is not exposed to the higher-level concepts associated with each group. Instead, the training forces the model to create for each image-caption group a semantic representation that brings it closer to the semantic representation of the higher-level concepts in the latent semantic space. Our experiments show that this training methodology results in a model which shows improvement in abstract concept recognition compared to SOTA models.

Graph|知识图谱|Knowledge(1篇)

【1】LEAF: Knowledge Distillation of Text Embedding Models with Teacher-Aligned Representations
标题：LEAF：具有教师一致表示的文本嵌入模型的知识提炼
链接：https://arxiv.org/abs/2509.12539

作者：anic, Thomas Rueckstiess
备注：17 pages, 12 figures
摘要：我们提出了LEAF（“轻量级嵌入对齐框架”），文本嵌入模型的知识蒸馏框架。一个关键的区别特征是，我们提取的叶模型与它们的老师对齐。在信息检索的上下文中，这允许灵活的非对称架构，其中文档使用较大的教师模型编码，而查询可以使用较小的叶模型。我们还表明，叶模型自动继承MRL和输出量化的鲁棒性，只要这些属性存在于教师模型中，没有明确的训练。为了展示我们框架的能力，我们发布了leaf-ir，这是一个使用LEAF训练的23 M参数面向信息检索的文本嵌入模型，它在BEIR上设置了一个新的最先进的（SOTA），在该基准测试及其规模的模型的公共排行榜上排名第一。当运行在非对称模式下时，其检索性能进一步提高。然而，我们的计划是不限于信息检索设置，我们证明了其更广泛的适用性，通过合成多任务的叶MT模型。这还建立了一个新的SOTA，其规模在公共MTEB v2（英语）排行榜上排名第一。LEAF适用于黑盒模型，与其他嵌入模型训练框架相比，它不需要判断也不需要硬否定，并且可以使用小批量进行训练。因此，我们框架的数据集和培训基础设施要求并不高。我们在Apache 2.0许可证下公开提供我们的模型。
摘要：We present LEAF ("Lightweight Embedding Alignment Framework"), a knowledge distillation framework for text embedding models. A key distinguishing feature is that our distilled leaf models are aligned to their teacher. In the context of information retrieval, this allows for flexible asymmetric architectures where documents are encoded with the larger teacher model, while queries can be served with the smaller leaf models. We also show that leaf models automatically inherit MRL and robustness to output quantization whenever these properties are present in the teacher model, without explicitly training for them. To demonstrate the capability of our framework we publish leaf-ir, a 23M parameters information retrieval oriented text embedding model trained using LEAF, which sets a new state-of-the-art (SOTA) on BEIR, ranking #1 on the public leaderboard for this benchmark and for models of its size. When run in asymmetric mode, its retrieval performance is further increased. Our scheme is however not restricted to the information retrieval setting, and we demonstrate its wider applicability by synthesizing the multi-task leaf-mt model. This also sets a new SOTA, ranking #1 on the public MTEB v2 (English) leaderboard for its size. LEAF is applicable to black-box models and in contrast to other embedding model training frameworks, it does not require judgments nor hard negatives, and training can be conducted using small batch sizes. Thus, dataset and training infrastructure requirements for our framework are modest. We make our models publicly available under a permissive Apache 2.0 license.

摘要|信息提取(1篇)

【1】ReSum: Unlocking Long-Horizon Search Intelligence via Context Summarization
标题：ReSum：通过上下文总结解锁长期搜索智能
链接：https://arxiv.org/abs/2509.13313

作者：Kuan Li, Yida Zhao, Liwen Zhang, Litu Ou, Huifeng Yin, Zhongwang Zhang, Yong Jiang, Pengjun Xie, Fei Huang, Minhao Cheng, Shuai Wang, Hong Cheng, Jingren Zhou
备注：his https URL
摘要：基于大型语言模型（LLM）的Web代理在知识密集型任务上表现出强大的性能，但受到ReAct等范式中上下文窗口限制的阻碍。涉及多个实体、相互交织的关系和高度不确定性的复杂查询需要大量的搜索周期，在获得完整的解决方案之前，这些搜索周期会迅速耗尽上下文预算。为了克服这一挑战，我们引入了ReSum，这是一种新的范式，可以通过定期的上下文摘要进行不确定的探索。ReSum将不断增长的交互历史转换为紧凑的推理状态，在绕过上下文约束的同时保持对先前发现的感知。对于范式适应，我们提出了ReSum-GRPO，将GRPO与分段轨迹训练和优势广播相结合，以使代理熟悉摘要条件推理。在三个基准测试中对不同规模的Web代理进行的广泛实验表明，ReSum比ReAct平均绝对提高了4.5%，在ReSum-GRPO训练后进一步提高了8.2%。值得注意的是，只有1 K个训练样本，我们的WebResummer-30 B（WebSailor-30 B的Resum-GRPO训练版本）在BrowseComp-zh上达到33.3% Pass@1，在BrowseComp-en上达到18.3%，超过了现有的开源Web代理。
摘要：Large Language Model (LLM)-based web agents demonstrate strong performance on knowledge-intensive tasks but are hindered by context window limitations in paradigms like ReAct. Complex queries involving multiple entities, intertwined relationships, and high uncertainty demand extensive search cycles that rapidly exhaust context budgets before reaching complete solutions. To overcome this challenge, we introduce ReSum, a novel paradigm that enables indefinite exploration through periodic context summarization. ReSum converts growing interaction histories into compact reasoning states, maintaining awareness of prior discoveries while bypassing context constraints. For paradigm adaptation, we propose ReSum-GRPO, integrating GRPO with segmented trajectory training and advantage broadcasting to familiarize agents with summary-conditioned reasoning. Extensive experiments on web agents of varying scales across three benchmarks demonstrate that ReSum delivers an average absolute improvement of 4.5\% over ReAct, with further gains of up to 8.2\% following ReSum-GRPO training. Notably, with only 1K training samples, our WebResummer-30B (a ReSum-GRPO-trained version of WebSailor-30B) achieves 33.3\% Pass@1 on BrowseComp-zh and 18.3\% on BrowseComp-en, surpassing existing open-source web agents.

推理|分析|理解|解释(3篇)

【1】WebResearcher: Unleashing unbounded reasoning capability in Long-Horizon Agents
标题：WebResearcher：在长期代理中释放无限推理能力
链接：https://arxiv.org/abs/2509.13309

作者：, Guoxin Chen, Xuanzhong Chen, Donglei Yu, Wenbiao Yin, Xinyu Wang, Zhen Zhang, Baixuan Li, Huifeng Yin, Kuan Li, Rui Min, Minpeng Liao, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou
备注：his https URL
摘要：深度研究系统的最新进展已经证明了人工智能代理自主发现和综合来自外部来源的知识的潜力。在本文中，我们介绍了WebResearcher，一个新的框架，通过两个关键组件来构建这样的代理：（1）WebResearcher，一个迭代的深度研究范式，将深度研究重新表述为马尔可夫决策过程，其中代理定期将发现合并为不断发展的报告，同时保持集中的搜索，克服困扰现有单上下文方法的上下文窒息和噪声污染;以及（2）WebFrontier，一个可扩展的数据合成引擎，通过工具增强的复杂性升级生成高质量的训练数据，从而能够系统地创建研究任务，弥合被动知识回忆和主动知识构建之间的差距。值得注意的是，我们发现，从我们的范例训练数据显着提高工具使用能力，即使是传统的单上下文方法。此外，我们的范例自然地通过并行思维进行扩展，从而实现并发多智能体探索以获得更全面的结论。在6个具有挑战性的基准测试中进行的广泛实验表明，WebResearcher实现了最先进的性能，甚至超过了前沿专有系统。
摘要：Recent advances in deep-research systems have demonstrated the potential for AI agents to autonomously discover and synthesize knowledge from external sources. In this paper, we introduce WebResearcher, a novel framework for building such agents through two key components: (1) WebResearcher, an iterative deep-research paradigm that reformulates deep research as a Markov Decision Process, where agents periodically consolidate findings into evolving reports while maintaining focused workspaces, overcoming the context suffocation and noise contamination that plague existing mono-contextual approaches; and (2) WebFrontier, a scalable data synthesis engine that generates high-quality training data through tool-augmented complexity escalation, enabling systematic creation of research tasks that bridge the gap between passive knowledge recall and active knowledge construction. Notably, we find that the training data from our paradigm significantly enhances tool-use capabilities even for traditional mono-contextual methods. Furthermore, our paradigm naturally scales through parallel thinking, enabling concurrent multi-agent exploration for more comprehensive conclusions. Extensive experiments across 6 challenging benchmarks demonstrate that WebResearcher achieves state-of-the-art performance, even surpassing frontier proprietary systems.

【2】ChartGaze: Enhancing Chart Understanding in LVLMs with Eye-Tracking Guided Attention Refinement
标题：ChartGaze：通过眼动跟踪引导的注意力细化增强LVLM中的图表理解
链接：https://arxiv.org/abs/2509.13282

作者：atian, Amirhossein Abaskohi, Wan-Cyuan Fan, Mir Rayat Imtiaz Hossain, Leonid Sigal, Giuseppe Carenini
备注：EMNLP 2025
摘要：图表是沟通和表达信息的重要视觉媒介。虽然大型视觉语言模型（LVLM）在图表问题回答（CQA）方面取得了进展，但这项任务仍然具有挑战性，特别是当模型涉及图表的不相关区域时。在这项工作中，我们提出了ChartGaze，这是一个新的眼动跟踪数据集，可以在图表推理任务中捕获人类的凝视模式。通过对人类和模型注意力的系统比较，我们发现LVLM经常偏离人类视线，导致可解释性和准确性降低。为了解决这个问题，我们提出了一个凝视引导的注意力细化，使图像-文本注意力与人类的注视保持一致。我们的方法提高了答案的准确性和注意力的一致性，在多个模型中获得了高达2.56个百分点的收益。这些结果表明，结合人类的目光，以提高图表为重点的LVLM的推理质量和可解释性的承诺。
摘要：Charts are a crucial visual medium for communicating and representing information. While Large Vision-Language Models (LVLMs) have made progress on chart question answering (CQA), the task remains challenging, particularly when models attend to irrelevant regions of the chart. In this work, we present ChartGaze, a new eye-tracking dataset that captures human gaze patterns during chart reasoning tasks. Through a systematic comparison of human and model attention, we find that LVLMs often diverge from human gaze, leading to reduced interpretability and accuracy. To address this, we propose a gaze-guided attention refinement that aligns image-text attention with human fixations. Our approach improves both answer accuracy and attention alignment, yielding gains of up to 2.56 percentage points across multiple models. These results demonstrate the promise of incorporating human gaze to enhance both the reasoning quality and interpretability of chart-focused LVLMs.

【3】Humor in Pixels: Benchmarking Large Multimodal Models Understanding of Online Comics
标题：像素中的幽默：对大型多模式模型进行基准测试对在线漫画的理解
链接：https://arxiv.org/abs/2509.12248

作者：an, Rui Yang Tan, Kenny Tsu Wei Choo, Roy Ka-Wei Lee
备注：27 pages, 8 figures, EMNLP 2025
摘要：理解幽默是社交智能的核心方面，但对于大型多模态模型（LMM）来说仍然是一个重大挑战。我们介绍PixelHumor，这是一个包含2，800个注释多面板漫画的基准数据集，旨在评估Leblon解释多模态幽默和识别叙事序列的能力。使用最先进的LMM进行的实验揭示了巨大的差距：例如，顶级模型在面板测序中仅达到61%的准确度，远低于人类表现。这强调了当前模型整合视觉和文本线索以实现连贯叙事和幽默理解的关键局限性。通过提供一个严格的框架来评估多模态上下文和叙事推理，PixelHumor旨在推动Lounge的发展，更好地参与自然，社会意识的互动。
摘要：Understanding humor is a core aspect of social intelligence, yet it remains a significant challenge for Large Multimodal Models (LMMs). We introduce PixelHumor, a benchmark dataset of 2,800 annotated multi-panel comics designed to evaluate LMMs' ability to interpret multimodal humor and recognize narrative sequences. Experiments with state-of-the-art LMMs reveal substantial gaps: for instance, top models achieve only 61% accuracy in panel sequencing, far below human performance. This underscores critical limitations in current models' integration of visual and textual cues for coherent narrative and humor understanding. By providing a rigorous framework for evaluating multimodal contextual and narrative reasoning, PixelHumor aims to drive the development of LMMs that better engage in natural, socially aware interactions.

半/弱/无监督|不确定性(1篇)

【1】Mitigating Strategy Preference Bias in Emotional Support Conversation via Uncertainty Estimations
标题：通过不确定性估计缓解情感支持对话中的策略偏好偏差
链接：https://arxiv.org/abs/2509.12661

作者：ou, Qin Chen, Ningning Zhou, Jie Zhou, Xingjiao Wu, Liang He
摘要：情感支持对话（ESC）旨在通过移情对话减轻痛苦，但大型语言模型（LLM）在提供有效的ESC方面面临持续的挑战，由于策略规划的准确性低。此外，对具体战略有相当大的偏好。使用微调策略规划的先前方法已显示出减少这种偏差的潜力，而LLM中偏好偏差的根本原因尚未得到很好的研究。为了解决这些问题，我们首先揭示了偏见的根本原因，通过识别战略规划中的LLM的知识边界。然后，我们提出了一种通过具有双重奖励函数的强化学习来减轻偏差的方法，该方法根据知识边界通过每个区域的准确性和基于熵的置信度来优化策略规划。在具有多个LLM主干的ESCov和ExTES数据集上的实验表明，我们的方法优于基线，证实了我们方法的有效性。
摘要：Emotional support conversation (ESC) aims to alleviate distress through empathetic dialogue, yet large language models (LLMs) face persistent challenges in delivering effective ESC due to low accuracy in strategy planning. Moreover, there is a considerable preference bias towards specific strategies. Prior methods using fine-tuned strategy planners have shown potential in reducing such bias, while the underlying causes of the preference bias in LLMs have not well been studied. To address these issues, we first reveal the fundamental causes of the bias by identifying the knowledge boundaries of LLMs in strategy planning. Then, we propose an approach to mitigate the bias by reinforcement learning with a dual reward function, which optimizes strategy planning via both accuracy and entropy-based confidence for each region according to the knowledge boundaries. Experiments on the ESCov and ExTES datasets with multiple LLM backbones show that our approach outperforms the baselines, confirming the effectiveness of our approach.

Zero/Few/One-Shot|迁移|自适应(2篇)

【1】MAGIC-Enhanced Keyword Prompting for Zero-Shot Audio Captioning with CLIP Models
标题：MAGIC增强型关键字绘图，采用CLIP模型实现Zero-Shot音频字幕
链接：https://arxiv.org/abs/2509.12591

作者：indarajan, Pratik Patel, Sahil Tripathi, Md Azizul Hoque, Gautam Siddharth Kashyap
备注：Accepted in The 26th International Conference on Web Information Systems Engineering (WISE), scheduled for 15-17 December 2025 in Marrakech, Morocco
摘要：自动音频字幕（AAC）为音频片段生成字幕，但与图像字幕相比，由于数据集有限而面临挑战。为了克服这一点，我们提出了利用预训练模型的zero-shot AAC系统，从而消除了对大量训练的需要。我们的方法使用预训练的音频CLIP模型来提取听觉特征并生成结构化提示，该结构化提示在字幕生成中指导大型语言模型（LLM）。与传统的贪婪解码不同，我们的方法通过音频CLIP模型细化令牌选择，确保与音频内容对齐。实验结果表明，使用WavCaps模型的MAGIC搜索，NLG平均得分（从4.7到7.3）提高了35%。性能受音频文本匹配模型和关键字选择的影响很大，使用单个关键字提示可以获得最佳结果，当不使用关键字列表时，性能下降50%。
摘要：Automated Audio Captioning (AAC) generates captions for audio clips but faces challenges due to limited datasets compared to image captioning. To overcome this, we propose the zero-shot AAC system that leverages pre-trained models, eliminating the need for extensive training. Our approach uses a pre-trained audio CLIP model to extract auditory features and generate a structured prompt, which guides a Large Language Model (LLM) in caption generation. Unlike traditional greedy decoding, our method refines token selection through the audio CLIP model, ensuring alignment with the audio content. Experimental results demonstrate a 35% improvement in NLG mean score (from 4.7 to 7.3) using MAGIC search with the WavCaps model. The performance is heavily influenced by the audio-text matching model and keyword selection, with optimal results achieved using a single keyword prompt, and a 50% performance drop when no keyword list is used.

【2】The Adaptation Paradox: Agency vs. Mimicry in Companion Chatbots
标题：改编悖论：伴侣聊天机器人中的代理与模仿
链接：https://arxiv.org/abs/2509.12525

作者：Brandt, Cecilia Xi Wang
备注：31 pages, 17 figures, 2 tables. Submitted to CHI 2026 (under review). Preregistered: this https URL ; Code/Materials: this https URL
摘要：生成式人工智能为越来越多的伴侣聊天机器人提供了动力，但培养真正联系的原则仍然没有得到解决。我们测试两种路线：可见的用户作者与隐蔽的语言风格的模仿。在预注册的3x 2实验（N = 162）中，我们操纵用户控制的化身生成（无，预制，用户生成）和语言风格匹配（LSM）（静态与自适应）。生成一个化身提升了融洽关系（$\omega ^2 $ = 0.040，p = 0.013），而自适应LSM在个性化和满意度方面表现不如静态风格（d = 0.35，p = 0.009），并且矛盾地被认为适应性较差（t = 3.07，p = 0.003，d = 0.48）。我们称之为适应性人格：当被视为不连贯、不稳定的人格时，同步性会侵蚀连接。为了解释这一点，我们提出了一个稳定性和易读性的解释：可见的作者身份促进了自然的互动，而隐蔽的模仿则有不连贯的风险。我们的研究结果表明，设计师应该优先考虑清晰，用户驱动的个性化和限制风格的变化，而不是依赖于不透明的模仿。
摘要：Generative AI powers a growing wave of companion chatbots, yet principles for fostering genuine connection remain unsettled. We test two routes: visible user authorship versus covert language-style mimicry. In a preregistered 3x2 experiment (N = 162), we manipulated user-controlled avatar generation (none, premade, user-generated) and Language Style Matching (LSM) (static vs. adaptive). Generating an avatar boosted rapport ($\omega^2$ = .040, p = .013), whereas adaptive LSM underperformed static style on personalization and satisfaction (d = 0.35, p = .009) and was paradoxically judged less adaptive (t = 3.07, p = .003, d = 0.48). We term this an Adaptation Paradox: synchrony erodes connection when perceived as incoherent, destabilizing persona. To explain, we propose a stability-and-legibility account: visible authorship fosters natural interaction, while covert mimicry risks incoherence. Our findings suggest designers should prioritize legible, user-driven personalization and limit stylistic shifts rather than rely on opaque mimicry.

检索(1篇)

【1】Topic Coverage-based Demonstration Retrieval for In-Context Learning
标题：基于主题覆盖率的演示检索用于上下文学习
链接：https://arxiv.org/abs/2509.12451

作者：eon, SeongKu Kang, Runchu Tian, Pengcheng Jiang, Jiawei Han, Hwanjo Yu
备注：EMNLP 2025 Main
摘要：情境学习的有效性在很大程度上依赖于选择为给定的测试输入提供所有必要信息的演示。为了实现这一目标，识别和覆盖细粒度的知识需求至关重要。然而，现有的方法往往检索演示仅仅基于嵌入相似性或生成概率，导致不相关或冗余的例子。在本文中，我们提出了TopickK，一个主题覆盖为基础的检索框架，选择示范全面覆盖主题级的知识相关的测试输入和模型。具体来说，TopicK估计输入所需的主题，并评估模型对这些主题的知识。然后，TopicK迭代地选择引入先前未发现的所需主题的演示，其中模型表现出低主题知识。我们通过在各种数据集以及开源和闭源LLM上进行广泛的实验来验证TopicK的有效性。我们的源代码可在https://github.com/WonbinKweon/TopicK_EMNLP2025上获得。
摘要：The effectiveness of in-context learning relies heavily on selecting demonstrations that provide all the necessary information for a given test input. To achieve this, it is crucial to identify and cover fine-grained knowledge requirements. However, prior methods often retrieve demonstrations based solely on embedding similarity or generation probability, resulting in irrelevant or redundant examples. In this paper, we propose TopicK, a topic coverage-based retrieval framework that selects demonstrations to comprehensively cover topic-level knowledge relevant to both the test input and the model. Specifically, TopicK estimates the topics required by the input and assesses the model's knowledge on those topics. TopicK then iteratively selects demonstrations that introduce previously uncovered required topics, in which the model exhibits low topical knowledge. We validate the effectiveness of TopicK through extensive experiments across various datasets and both open- and closed-source LLMs. Our source code is available at https://github.com/WonbinKweon/TopicK_EMNLP2025.

其他神经网络|深度学习|模型|建模(4篇)

【1】WebSailor-V2: Bridging the Chasm to Proprietary Agents via Synthetic Data and Scalable Reinforcement Learning
标题：WebSailor-V2：通过合成数据和可扩展的强化学习弥合与专有代理的鸿沟
链接：https://arxiv.org/abs/2509.13305

作者：Zhongwang Zhang, Huifeng Yin, Rui Ye, Yida Zhao, Liwen Zhang, Litu Ou, Dingchu Zhang, Xixi Wu, Jialong Wu, Xinyu Wang, Zile Qiao, Zhen Zhang, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou
备注：his https URL
摘要：超越人类的认知局限性是LLM培训的关键前沿。像DeepResearch这样的专有代理系统已经在极其复杂的信息搜索基准上展示了超人的能力，例如BrowseComp，这是以前无法实现的壮举。我们认为，他们的成功取决于开源模型中缺乏的复杂推理模式：在浏览大量信息时系统地减少极端不确定性的能力。基于这一见解，我们引入了WebSailor，这是一种旨在灌输这一关键能力的完整的培训后方法。我们的方法包括通过结构化采样和信息混淆，RFT冷启动以及高效的代理RL训练算法，重复采样策略优化（DUPO）来生成新颖的，高不确定性的任务。通过这种集成管道，WebSailor在复杂的信息搜索任务中的表现明显优于所有开源代理，与专有代理的性能相匹配，并缩小了能力差距。
摘要：Transcending human cognitive limitations represents a critical frontier in LLM training. Proprietary agentic systems like DeepResearch have demonstrated superhuman capabilities on extremely complex information-seeking benchmarks such as BrowseComp, a feat previously unattainable. We posit that their success hinges on a sophisticated reasoning pattern absent in open-source models: the ability to systematically reduce extreme uncertainty when navigating vast information landscapes. Based on this insight, we introduce WebSailor, a complete post-training methodology designed to instill this crucial capability. Our approach involves generating novel, high-uncertainty tasks through structured sampling and information obfuscation, RFT cold start, and an efficient agentic RL training algorithm, Duplicating Sampling Policy Optimization (DUPO). With this integrated pipeline, WebSailor significantly outperforms all open-source agents in complex information-seeking tasks, matching proprietary agents' performance and closing the capability gap.

【2】A Novel Recurrent Neural Network Framework for Prediction and Treatment of Oncogenic Mutation Progression
标题：用于预测和治疗致癌突变进展的新型回归神经网络框架
链接：https://arxiv.org/abs/2509.12732

作者：rthasarathy, Achintya Bhowmik
备注：12 pages, 11 figures, work originally done in 2022/2023 and was awarded as one of the Regeneron Science Talent Search Finalists in 2022
摘要：尽管医学取得了重大进步，但癌症仍然是第二大死亡原因，在美国每年有超过60万人死亡。一个新兴的领域，途径分析，是有前途的，但仍然依赖于手动获得的湿实验室数据，这是耗时的获取。这项工作为基于人工智能（AI）的通路分析提出了一个高效、有效的端到端框架，可以预测癌症的严重程度和突变进展，从而推荐可能的治疗方法。所提出的技术涉及时间序列机器学习模型和路径分析的新组合。首先，从癌症基因组图谱（TCGA）数据库中分离突变序列。然后，采用一种新的预处理算法，根据变异频率过滤关键变异。这些数据被输入到预测癌症严重程度的递归神经网络（RNN）中。然后，该模型概率性地使用RNN预测、来自预处理算法的信息和多个药物靶标数据库来预测未来的突变并推荐可能的治疗方法。该框架实现了稳健的结果和受试者操作特征（ROC）曲线（一个关键的统计指标），准确率超过60%，类似于现有的癌症诊断。此外，预处理在分离重要突变方面发挥了重要作用，表明研究的每个癌症阶段可能包含数百个关键驱动突变，与当前研究一致。还生成了基于预测基因频率的热图，突出显示了每种癌症中的关键突变。总的来说，这项工作是第一个提出一个有效的，具有成本效益的端到端框架，用于预测癌症进展并提供可能的治疗方法，而不依赖于昂贵，耗时的湿实验室工作。
摘要：Despite significant medical advancements, cancer remains the second leading cause of death, with over 600,000 deaths per year in the US. One emerging field, pathway analysis, is promising but still relies on manually derived wet lab data, which is time-consuming to acquire. This work proposes an efficient, effective end-to-end framework for Artificial Intelligence (AI) based pathway analysis that predicts both cancer severity and mutation progression, thus recommending possible treatments. The proposed technique involves a novel combination of time-series machine learning models and pathway analysis. First, mutation sequences were isolated from The Cancer Genome Atlas (TCGA) Database. Then, a novel preprocessing algorithm was used to filter key mutations by mutation frequency. This data was fed into a Recurrent Neural Network (RNN) that predicted cancer severity. Then, the model probabilistically used the RNN predictions, information from the preprocessing algorithm, and multiple drug-target databases to predict future mutations and recommend possible treatments. This framework achieved robust results and Receiver Operating Characteristic (ROC) curves (a key statistical metric) with accuracies greater than 60%, similar to existing cancer diagnostics. In addition, preprocessing played an instrumental role in isolating important mutations, demonstrating that each cancer stage studied may contain on the order of a few-hundred key driver mutations, consistent with current research. Heatmaps based on predicted gene frequency were also generated, highlighting key mutations in each cancer. Overall, this work is the first to propose an efficient, cost-effective end-to-end framework for projecting cancer progression and providing possible treatments without relying on expensive, time-consuming wet lab work.

【3】Small Models, Big Results: Achieving Superior Intent Extraction through Decomposition
标题：小模型，大结果：通过分解实现卓越意图提取
链接：https://arxiv.org/abs/2509.12423

作者：Cohen, Yoni Halpern, Noam Kahlon, Joel Oren, Omri Berkovitch, Sapir Caduri, Ido Dagan, Anatoly Efros
摘要：从UI交互轨迹中理解用户意图仍然是智能代理开发中具有挑战性但至关重要的前沿领域。虽然大规模、基于中间件的多模态大型语言模型（MLLM）拥有更大的能力来处理此类序列的复杂性，但可以在设备上运行以提供隐私保护、低成本和低延迟用户体验的较小模型却难以进行准确的意图推断。我们通过引入一种新的分解方法来解决这些限制：首先，我们进行结构化的交互总结，从每个用户操作中捕获关键信息。其次，我们使用一个微调的模型对汇总的摘要进行意图提取。这种方法提高了资源受限模型中的意图理解，甚至超过了大型MLLM的基本性能。
摘要：Understanding user intents from UI interaction trajectories remains a challenging, yet crucial, frontier in intelligent agent development. While massive, datacenter-based, multi-modal large language models (MLLMs) possess greater capacity to handle the complexities of such sequences, smaller models which can run on-device to provide a privacy-preserving, low-cost, and low-latency user experience, struggle with accurate intent inference. We address these limitations by introducing a novel decomposed approach: first, we perform structured interaction summarization, capturing key information from each user action. Second, we perform intent extraction using a fine-tuned model operating on the aggregated summaries. This method improves intent understanding in resource-constrained models, even surpassing the base performance of large MLLMs.

【4】MTEB-NL and E5-NL: Embedding Benchmark and Models for Dutch
标题：MTEB-NL和E5-NL：嵌入荷兰人的基准和模型
链接：https://arxiv.org/abs/2509.12340

作者：anar, Ehsan Lotfi, Jens Van Nooten, Cristina Arhiliuc, Marija Kliocaite, Walter Daelemans
摘要：最近，包括模型、基准和数据集在内的嵌入式资源已经广泛发布，以支持各种语言。然而，荷兰语的代表性仍然不足，通常只占出版的多语种资源的一小部分。为了解决这一差距，并鼓励荷兰嵌入的进一步发展，我们引入了新的资源来评估和生成。首先，我们介绍了荷兰语海量文本嵌入基准（MTEB-NL），它包括现有的荷兰语数据集和新创建的数据集，涵盖了广泛的任务。其次，我们提供了一个训练数据集，该数据集是从可用的荷兰语检索数据集编译而来的，并辅以大型语言模型生成的合成数据，以扩大检索之外的任务覆盖范围。最后，我们发布了一系列E5-NL模型，紧凑而高效的嵌入模型，在多个任务中表现出强大的性能。我们通过Hugging Face Hub和MTEB软件包公开提供我们的资源。
摘要：Recently, embedding resources, including models, benchmarks, and datasets, have been widely released to support a variety of languages. However, the Dutch language remains underrepresented, typically comprising only a small fraction of the published multilingual resources. To address this gap and encourage the further development of Dutch embeddings, we introduce new resources for their evaluation and generation. First, we introduce the Massive Text Embedding Benchmark for Dutch (MTEB-NL), which includes both existing Dutch datasets and newly created ones, covering a wide range of tasks. Second, we provide a training dataset compiled from available Dutch retrieval datasets, complemented with synthetic data generated by large language models to expand task coverage beyond retrieval. Finally, we release a series of E5-NL models compact yet efficient embedding models that demonstrate strong performance across multiple tasks. We make our resources publicly available through the Hugging Face Hub and the MTEB package.

其他(18篇)

【1】WebWeaver: Structuring Web-Scale Evidence with Dynamic Outlines for Open-Ended Deep Research
标题：WebWeaver：通过动态大纲构建网络规模证据，以进行开放式深度研究
链接：https://arxiv.org/abs/2509.13312

作者：, Xin Guan, Bo Zhang, Shen Huang, Houquan Zhou, Shaopeng Lai, Ming Yan, Yong Jiang, Pengjun Xie, Fei Huang, Jun Zhang, Jingren Zhou
备注：An agent system for open-ended deep research
摘要：本文解决了开放式深度研究（OEDR），这是一个复杂的挑战，人工智能代理必须将大量的网络规模信息合成为有见地的报告。目前的方法受到双重限制的困扰：静态研究管道将计划与证据获取和一次性生成范式分离，这种范式容易遭受长期背景失败问题，如“中间损失”和幻觉。为了解决这些挑战，我们介绍了WebWeaver，一种新型的双代理框架，模拟人类的研究过程。规划器在一个动态循环中运行，迭代地交错证据获取与大纲优化，以产生一个全面的，基于源的大纲链接到证据的记忆库。然后，编写器执行分层检索和编写过程，逐节编写报告。通过从每个部分的内存库中仅执行必要证据的有针对性的检索，它有效地缓解了长上下文问题。我们的框架在主要的OEDR基准测试中建立了一个新的最先进的水平，包括DeepResearch Bench，DeepConsult和DeepResearchGym。这些结果验证了我们以人为本的迭代方法，证明了自适应规划和重点综合对于生成高质量，可靠和结构良好的报告至关重要。
摘要：This paper tackles open-ended deep research (OEDR), a complex challenge where AI agents must synthesize vast web-scale information into insightful reports. Current approaches are plagued by dual-fold limitations: static research pipelines that decouple planning from evidence acquisition and one-shot generation paradigms that easily suffer from long-context failure issues like "loss in the middle" and hallucinations. To address these challenges, we introduce WebWeaver, a novel dual-agent framework that emulates the human research process. The planner operates in a dynamic cycle, iteratively interleaving evidence acquisition with outline optimization to produce a comprehensive, source-grounded outline linking to a memory bank of evidence. The writer then executes a hierarchical retrieval and writing process, composing the report section by section. By performing targeted retrieval of only the necessary evidence from the memory bank for each part, it effectively mitigates long-context issues. Our framework establishes a new state-of-the-art across major OEDR benchmarks, including DeepResearch Bench, DeepConsult, and DeepResearchGym. These results validate our human-centric, iterative methodology, demonstrating that adaptive planning and focused synthesis are crucial for producing high-quality, reliable, and well-structured reports.

【2】Towards General Agentic Intelligence via Environment Scaling
标题：通过环境缩放迈向通用统计智能
链接：https://arxiv.org/abs/2509.13311

作者：ng, Shihao Cai, Baixuan Li, Jialong Wu, Guangyu Li, Wenbiao Yin, Xinyu Wang, Xiaobin Wang, Liangcai Su, Zhen Zhang, Shibin Wu, Zhengwei Tao, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou
备注：his https URL
摘要：高级代理智能是在实际的现实世界应用中部署大型语言模型的先决条件。多样化的真实世界API需要精确、健壮的函数调用智能，这需要代理通过在不同环境中的交互来开发这些功能。功能调用能力的广度与智能体训练环境的多样性密切相关。在这项工作中，我们扩大了环境，作为推进一般代理智能的一步。这就产生了两个核心挑战：（i）如何以有原则的方式扩展环境，以及（ii）如何有效地从与这些环境的互动中获得的经验中训练代理能力。为了解决这些问题，我们设计了一个可扩展的框架，自动构建异构环境，完全模拟，系统地扩大了空间的功能调用方案。我们进一步采用了两阶段的代理微调策略：首先赋予代理基本的代理能力，然后专门针对特定领域的情况。在agentic benchmark，tau-Bench，tau 2-Bench和ACEBench上进行的大量实验表明，我们的训练模型AgentScaler显着增强了模型的函数调用能力。
摘要：Advanced agentic intelligence is a prerequisite for deploying Large Language Models in practical, real-world applications. Diverse real-world APIs demand precise, robust function-calling intelligence, which needs agents to develop these capabilities through interaction in varied environments. The breadth of function-calling competence is closely tied to the diversity of environments in which agents are trained. In this work, we scale up environments as a step towards advancing general agentic intelligence. This gives rise to two central challenges: (i) how to scale environments in a principled manner, and (ii) how to effectively train agentic capabilities from experiences derived through interactions with these environments. To address these, we design a scalable framework that automatically constructs heterogeneous environments that are fully simulated, systematically broadening the space of function-calling scenarios. We further adapt a two-phase agent fine-tuning strategy: first endowing agents with fundamental agentic capabilities, then specializing them for domain-specific contexts. Extensive experiments on agentic benchmarks, tau-bench, tau2-Bench, and ACEBench, demonstrate that our trained model, AgentScaler, significantly enhances the function-calling capability of models.

【3】Scaling Agents via Continual Pre-training
标题：通过持续预训练的缩放代理
链接：https://arxiv.org/abs/2509.13310

作者：Su, Zhen Zhang, Guangyu Li, Zhuo Chen, Chenxi Wang, Maojia Song, Xinyu Wang, Kuan Li, Jialong Wu, Xuanzhong Chen, Zile Qiao, Zhongwang Zhang, Huifeng Yin, Shihao Cai, Runnan Fang, Zhengwei Tao, Wenbiao Yin, Chenxiong Qian, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou
备注：his https URL
摘要：大型语言模型（LLM）已经发展成为能够自主使用工具和多步推理解决复杂问题的代理系统。然而，建立在通用基础模型上的培训后方法在代理任务中一直表现不佳，特别是在开源实现中。我们确定了根本原因：由于缺乏健壮的代理基础模型，迫使模型在后训练期间同时学习不同的代理行为，同时将它们与专家演示对齐，从而产生基本的优化张力。为此，我们是第一个提出将连续性预训练（连续性CPT）纳入深度研究代理培训管道的人，以构建强大的代理基础模型。基于这种方法，我们开发了一个深入研究的Agent模型AgentFounder。我们在10个基准测试中评估了我们的AgentFounder-30 B，并在保持强大工具使用能力的同时实现了最先进的性能，特别是BrowseComp-en上的39.9%，BrowseComp-zh上的43.3%和HLE上的31.5% Pass@1。
摘要：Large language models (LLMs) have evolved into agentic systems capable of autonomous tool use and multi-step reasoning for complex problem-solving. However, post-training approaches building upon general-purpose foundation models consistently underperform in agentic tasks, particularly in open-source implementations. We identify the root cause: the absence of robust agentic foundation models forces models during post-training to simultaneously learn diverse agentic behaviors while aligning them to expert demonstrations, thereby creating fundamental optimization tensions. To this end, we are the first to propose incorporating Agentic Continual Pre-training (Agentic CPT) into the deep research agents training pipeline to build powerful agentic foundational models. Based on this approach, we develop a deep research agent model named AgentFounder. We evaluate our AgentFounder-30B on 10 benchmarks and achieve state-of-the-art performance while retains strong tool-use ability, notably 39.9% on BrowseComp-en, 43.3% on BrowseComp-zh, and 31.5% Pass@1 on HLE.

【4】HARMONIC: A Content-Centric Cognitive Robotic Architecture
标题：HARMONIC：以内容为中心的认知机器人架构
链接：https://arxiv.org/abs/2509.13279

作者：uganti, Sergei Nirenburg, Marjorie McShane, Jesse English, Michael K. Roberts, Christian Arndt, Carlos Gonzalez, Mingyo Seo, Luis Sentis
摘要：本文介绍了谐波，认知机器人架构设计的机器人在人类机器人团队。HARMONIC支持语义感知解释、类人决策和有意的语言交流。它解决了结果的安全性和质量问题;旨在解决数据稀缺，可解释性和安全性问题;并促进透明度和信任。演示了两个基于HARMONIC的概念验证机器人系统，每个系统都在高保真仿真环境和物理机器人平台上实现。
摘要：This paper introduces HARMONIC, a cognitive-robotic architecture designed for robots in human-robotic teams. HARMONIC supports semantic perception interpretation, human-like decision-making, and intentional language communication. It addresses the issues of safety and quality of results; aims to solve problems of data scarcity, explainability, and safety; and promotes transparency and trust. Two proof-of-concept HARMONIC-based robotic systems are demonstrated, each implemented in both a high-fidelity simulation environment and on physical robotic platforms.

【5】Podcasts as a Medium for Participation in Collective Action: A Case Study of Black Lives Matter
标题：播客作为参与集体行动的媒介：黑人生命也是命的案例研究
链接：https://arxiv.org/abs/2509.13197

作者：Moldovan, Arianna Pera, Davide Vega, Luca Maria Aiello
备注：11 pages, 5 figures
摘要：我们研究如何参与集体行动是在播客讨论中阐述，使用黑人生命物质（BLM）运动作为案例研究。虽然对集体行动话语的研究主要集中在基于文本的内容，本研究采取了第一步，通过使用播客成绩单分析音频格式。使用结构化播客研究语料库（SPoRC），我们调查了口语表达参与集体行动，分类为问题解决，呼吁行动，意图和执行。我们在2020年5月和6月的重要BLM相关事件后确定了讨论种族正义的播客片段，并使用从社交媒体上的先前工作改编的分层框架提取了参与性声明。我们研究了这些陈述的情感维度，发现了八种关键情绪及其与行动主义不同阶段的关联。我们发现，情绪状况因阶段而异，在行动号召、意图和执行过程中，不同的积极情绪表现出来。我们发现集体行动和负面情绪之间存在负面联系，这与理论预期相反。我们的工作有助于更好地理解行动主义如何在口头数字话语中表达，以及情感框架如何取决于讨论的形式。
摘要：We study how participation in collective action is articulated in podcast discussions, using the Black Lives Matter (BLM) movement as a case study. While research on collective action discourse has primarily focused on text-based content, this study takes a first step toward analyzing audio formats by using podcast transcripts. Using the Structured Podcast Research Corpus (SPoRC), we investigated spoken language expressions of participation in collective action, categorized as problem-solution, call-to-action, intention, and execution. We identified podcast episodes discussing racial justice after important BLM-related events in May and June of 2020, and extracted participatory statements using a layered framework adapted from prior work on social media. We examined the emotional dimensions of these statements, detecting eight key emotions and their association with varying stages of activism. We found that emotional profiles vary by stage, with different positive emotions standing out during calls-to-action, intention, and execution. We detected negative associations between collective action and negative emotions, contrary to theoretical expectations. Our work contributes to a better understanding of how activism is expressed in spoken digital discourse and how emotional framing may depend on the format of the discussion.

【6】Textarium: Entangling Annotation, Abstraction and Argument
标题：教科书：错综复杂的注释、抽象和论点
链接：https://arxiv.org/abs/2509.13191

作者：roff, Marian Dörk
备注：This is the authors' version of the article presented at VIS4DH and published in the proceedings of IEEE VIS 2025
摘要：我们提出了一个基于网络的环境，连接注释，抽象，并在文本的解释论证。作为学术阅读和写作的可视化界面，Textarium将人类分析与轻量级计算处理相结合，以弥合近距离和远距离阅读实践。读者可以突出显示文本，将关键词分组为概念，并将这些观察结果作为文章的锚点。该接口将这些解释性操作呈现为参数化的可视化状态。通过共同创造和迭代原型的投机性设计过程，我们开发了一种读写方法，使数字叙事中的解释过程透明和可共享。
摘要：We present a web-based environment that connects annotation, abstraction, and argumentation during the interpretation of text. As a visual interface for scholarly reading and writing, Textarium combines human analysis with lightweight computational processing to bridge close and distant reading practices. Readers can highlight text, group keywords into concepts, and embed these observations as anchors in essays. The interface renders these interpretive actions as parameterized visualization states. Through a speculative design process of co-creative and iterative prototyping, we developed a reading-writing approach that makes interpretive processes transparent and shareable within digital narratives.

【7】When Inverse Data Outperforms: Exploring the Pitfalls of Mixed Data in Multi-Stage Fine-Tuning
标题：当反向数据表现出色时：探索多阶段微调中混合数据的陷阱
链接：https://arxiv.org/abs/2509.13079

作者：ng, Xin Li, Tingyu Zhu, Zhicheng Yang, Zhijiang Guo, Wei Wang
摘要：现有的工作表明，o 1级的性能可以实现有限的数据蒸馏，但大多数现有的方法侧重于单向监督微调（SFT），忽略了不同的推理模式之间错综复杂的相互作用。在本文中，我们构建了r1 k，这是一个高质量的反向推理数据集，通过从s1 k中反演1，000个正向示例来获得，并研究了SFT和直接偏好优化（DPO）如何影响双向推理目标下的对齐。在评估的基准测试中，r1 k上的SFT比s1 k的精度提高了1.6%-6.8%。然而，在SFT期间天真地混合正向和反向数据会削弱方向区分。尽管DPO可以部分恢复这种区别，但它也通过将概率质量转移到不相关的输出来抑制不太首选的推理路径。这些发现表明，混合推理数据引入相互冲突的监督信号，强调需要强大的和方向感知的对齐策略。
摘要：Existing work has shown that o1-level performance can be achieved with limited data distillation, but most existing methods focus on unidirectional supervised fine-tuning (SFT), overlooking the intricate interplay between diverse reasoning patterns. In this paper, we construct r1k, a high-quality reverse reasoning dataset derived by inverting 1,000 forward examples from s1k, and examine how SFT and Direct Preference Optimization (DPO) affect alignment under bidirectional reasoning objectives. SFT on r1k yields a 1.6%--6.8% accuracy improvement over s1k across evaluated benchmarks. However, naively mixing forward and reverse data during SFT weakens the directional distinction. Although DPO can partially recover this distinction, it also suppresses less preferred reasoning paths by shifting the probability mass toward irrelevant outputs. These findings suggest that mixed reasoning data introduce conflicting supervision signals, underscoring the need for robust and direction-aware alignment strategies.

【8】Rethinking the Evaluation of Alignment Methods: Insights into Diversity, Generalisation, and Safety
标题：重新思考对齐方法的评估：对多样性，泛化和安全性的见解
链接：https://arxiv.org/abs/2509.12936

作者：iak, Julia Moska, Dawid Motyka, Karolina Seweryn, Paweł Walkowiak, Bartosz Żuk, Arkadiusz Janz
摘要：大型语言模型（LLM）需要仔细调整以平衡相互竞争的目标-真实性，安全性，简洁性，主动性和多样性。现有的研究集中在个别技术或具体的方面，缺乏一个整体的评估固有的权衡。我们提出了一个统一的评估框架，比较LLM对齐方法（PPO，DPO，ORPO，KTO）在这五个轴上，使用分布和分布数据集。利用专门的法学硕士作为法官提示，通过人体研究验证，我们发现DPO和KTO在事实准确性方面表现出色，PPO和DPO在安全性方面领先，PPO最好地平衡了简洁性与主动性。我们的研究结果提供了对常见对齐方法的权衡的见解，指导更平衡和可靠的LLM的开发。
摘要：Large language models (LLMs) require careful alignment to balance competing objectives - factuality, safety, conciseness, proactivity, and diversity. Existing studies focus on individual techniques or specific dimensions, lacking a holistic assessment of the inherent trade-offs. We propose a unified evaluation framework that compares LLM alignment methods (PPO, DPO, ORPO, KTO) across these five axes, using both in-distribution and out-of-distribution datasets. Leveraging a specialized LLM-as-Judge prompt, validated through human studies, we reveal that DPO and KTO excel in factual accuracy, PPO and DPO lead in safety, and PPO best balances conciseness with proactivity. Our findings provide insights into trade-offs of common alignment methods, guiding the development of more balanced and reliable LLMs.

【9】Benchmarking and Improving LVLMs on Event Extraction from Multimedia Documents
标题：对多媒体文档事件提取的LVLM进行基准测试和改进
链接：https://arxiv.org/abs/2509.12876

作者：, Zimu Wang, Wei Wang, Haiyang Zhang
备注：Accepted at INLG 2025. Camera-ready version
摘要：多媒体内容的激增需要开发有效的多媒体事件提取（M2 E2）系统。虽然大型视觉语言模型（LVLM）已经显示出强大的跨模态能力，但它们在M2 E2任务中的实用性仍然有待探索。在本文中，我们在M2 E2数据集上对代表性LVLM（包括DeepSeek-VL 2和Qwen-VL系列）进行了首次系统评估。我们的评估涵盖了纯文本、纯图像和跨媒体子任务，在Few-Shot提示和微调设置下进行评估。我们的主要研究结果突出了以下有价值的见解：（1）Few-Shot LVLM在视觉任务上表现得更好，但在文本任务上表现得更好;（2）用LoRA微调LVLM大大提高了模型性能;（3）LVLM在组合模态时表现出很强的协同作用，在跨模态设置中实现了卓越的性能。我们进一步提供了详细的错误分析，以揭示在语义精度，本地化和跨模式接地等领域的持续挑战，这些仍然是推进M2 E2功能的关键障碍。
摘要：The proliferation of multimedia content necessitates the development of effective Multimedia Event Extraction (M2E2) systems. Though Large Vision-Language Models (LVLMs) have shown strong cross-modal capabilities, their utility in the M2E2 task remains underexplored. In this paper, we present the first systematic evaluation of representative LVLMs, including DeepSeek-VL2 and the Qwen-VL series, on the M2E2 dataset. Our evaluations cover text-only, image-only, and cross-media subtasks, assessed under both few-shot prompting and fine-tuning settings. Our key findings highlight the following valuable insights: (1) Few-shot LVLMs perform notably better on visual tasks but struggle significantly with textual tasks; (2) Fine-tuning LVLMs with LoRA substantially enhances model performance; and (3) LVLMs exhibit strong synergy when combining modalities, achieving superior performance in cross-modal settings. We further provide a detailed error analysis to reveal persistent challenges in areas such as semantic precision, localization, and cross-modal grounding, which remain critical obstacles for advancing M2E2 capabilities.

【10】Data Augmentation for Maltese NLP using Transliterated and Machine Translated Arabic Data
标题：使用音译和机器翻译的阿拉伯语数据进行马耳他NLP的数据增强
链接：https://arxiv.org/abs/2509.12853

作者：llef, Nizar Habash, Claudia Borg
备注：EMNLP Camera-Ready
摘要：马耳他语是一种独特的闪米特语言，受到罗曼语和日耳曼语的广泛影响，特别是意大利语和英语。尽管它的闪米特根源，它的正字法是基于拉丁字母，创造了它与阿拉伯语最接近的语言亲属之间的差距。在本文中，我们探讨阿拉伯语资源是否可以通过跨语言增强技术支持马耳他自然语言处理（NLP）。我们研究了多种策略对齐阿拉伯语文本数据与马耳他语，包括各种音译方案和机器翻译（MT）的方法。作为其中的一部分，我们还介绍了新的音译系统，更好地代表马耳他正字法。我们评估了这些增强对单语和多语模型的影响，并证明了基于阿拉伯语的增强可以显着受益于马耳他NLP任务。
摘要：Maltese is a unique Semitic language that has evolved under extensive influence from Romance and Germanic languages, particularly Italian and English. Despite its Semitic roots, its orthography is based on the Latin script, creating a gap between it and its closest linguistic relatives in Arabic. In this paper, we explore whether Arabic-language resources can support Maltese natural language processing (NLP) through cross-lingual augmentation techniques. We investigate multiple strategies for aligning Arabic textual data with Maltese, including various transliteration schemes and machine translation (MT) approaches. As part of this, we also introduce novel transliteration systems that better represent Maltese orthography. We evaluate the impact of these augmentations on monolingual and mutlilingual models and demonstrate that Arabic-based augmentation can significantly benefit Maltese NLP tasks.

【11】ConvergeWriter: Data-Driven Bottom-Up Article Construction
标题：ConvergeWriter：数据驱动的自下而上的文章构建
链接：https://arxiv.org/abs/2509.12811

作者：i, Jiaqi Wang, Ruiting Li, Xingchen Han, Yiyang Qi, Shichao Wang, Yifei Lu, Yuantao Han, Feiliang Ren
摘要：大型语言模型（LLM）在文本生成方面表现出了非凡的能力，但基于广泛的外部知识库生成长形式的事实文档仍然是一个重大挑战。现有的“自上而下”的方法，首先产生一个假设或大纲，然后检索证据，往往受到模型的计划和现有知识之间的脱节，导致内容碎片化和事实不准确。为了解决这些限制，我们提出了一种新的“自下而上”，数据驱动的框架，反转传统的生成管道。我们的方法是基于“检索第一的知识，聚类结构”的战略，首先建立“知识边界”的源语料库之前，任何生成的规划发生。具体来说，我们执行详尽的迭代检索的知识库，然后采用无监督聚类算法组织检索到的文件到不同的“知识集群。“这些集群形成了一个客观的、数据驱动的基础，直接指导随后生成分层大纲和最终文档内容。这种自下而上的过程确保了生成的文本受到源材料的严格约束，并完全可追溯到源材料，主动适应知识库的有限范围，并从根本上减轻幻觉的风险。在14B和32B参数模型上的实验结果表明，我们的方法实现了与最先进的基线相当或超过最先进的基线的性能，并有望在需要高保真度和结构一致性的知识约束场景中表现出独特的优势。我们的工作为生成可靠的，结构化的，长格式的文档提供了一个有效的范例，为高风险，知识密集型领域中更强大的LLM应用铺平了道路。
摘要：Large Language Models (LLMs) have shown remarkable prowess in text generation, yet producing long-form, factual documents grounded in extensive external knowledge bases remains a significant challenge. Existing "top-down" methods, which first generate a hypothesis or outline and then retrieve evidence, often suffer from a disconnect between the model's plan and the available knowledge, leading to content fragmentation and factual inaccuracies. To address these limitations, we propose a novel "bottom-up," data-driven framework that inverts the conventional generation pipeline. Our approach is predicated on a "Retrieval-First for Knowledge, Clustering for Structure" strategy, which first establishes the "knowledge boundaries" of the source corpus before any generative planning occurs. Specifically, we perform exhaustive iterative retrieval from the knowledge base and then employ an unsupervised clustering algorithm to organize the retrieved documents into distinct "knowledge clusters." These clusters form an objective, data-driven foundation that directly guides the subsequent generation of a hierarchical outline and the final document content. This bottom-up process ensures that the generated text is strictly constrained by and fully traceable to the source material, proactively adapting to the finite scope of the knowledge base and fundamentally mitigating the risk of hallucination. Experimental results on both 14B and 32B parameter models demonstrate that our method achieves performance comparable to or exceeding state-of-the-art baselines, and is expected to demonstrate unique advantages in knowledge-constrained scenarios that demand high fidelity and structural coherence. Our work presents an effective paradigm for generating reliable, structured, long-form documents, paving the way for more robust LLM applications in high-stakes, knowledge-intensive domains.

【12】Similarity-Distance-Magnitude Activations
标题：相似性-距离-幅度激活
链接：https://arxiv.org/abs/2509.12760

作者：maltz
备注：17 pages, 5 tables, 1 algorithm. arXiv admin note: substantial text overlap with arXiv:2502.20167
摘要：我们通过添加相似性（即，正确预测深度匹配到训练）意识和距离到训练分布意识到现有输出幅度（即，决策边界）意识。当用作语言模型的最后一层激活时，所得到的相似性距离幅度（SDM）激活函数比softmax函数对协变量移位和高概率区域中的分布外输入更具鲁棒性，并通过密集匹配提供可解释性。补充的预测条件估计，SDM激活使分区的类明智的经验CDF，以防止低类明智的召回之间的选择性分类。这些属性使其更适合选择性分类，即使在考虑softmax上的事后校准方法时也是如此。
摘要：We introduce a more robust and interpretable formulation of the standard softmax activation function commonly used with neural networks by adding Similarity (i.e., correctly predicted depth-matches into training) awareness and Distance-to-training-distribution awareness to the existing output Magnitude (i.e., decision-boundary) awareness. When used as the final-layer activation with language models, the resulting Similarity-Distance-Magnitude (SDM) activation function is more robust than the softmax function to co-variate shifts and out-of-distribution inputs in high-probability regions, and provides interpretability-by-exemplar via dense matching. Complementing the prediction-conditional estimates, the SDM activation enables a partitioning of the class-wise empirical CDFs to guard against low class-wise recall among selective classifications. These properties make it preferable for selective classification, even when considering post-hoc calibration methods over the softmax.

【13】Case-Based Decision-Theoretic Decoding with Quality Memories
标题：具有优质记忆的基于案例的决策理论解码
链接：https://arxiv.org/abs/2509.12677

作者：Deguchi, Masaaki Nagata
备注：Accepted at EMNLP2025 main
摘要：最小贝叶斯风险（MBR）解码是文本生成的一种决策规则，它选择最大化期望效用的假设，鲁棒地生成比最大后验概率（MAP）解码更高质量的文本。然而，它依赖于从文本生成模型中提取的样本文本;因此，很难找到一个正确捕获域外知识或信息的假设。为了解决这个问题，我们提出了基于案例的决策理论（CBDT）解码，这是另一种使用域数据示例来估计预期效用的方法。CBDT解码不仅比MAP解码生成更高质量的文本，而且MBR和CBDT解码的组合在MSCOCO和nocaps数据集上的七个域De-En和Ja$\leftrightarrow$En翻译任务和图像字幕任务中优于MBR解码。
摘要：Minimum Bayes risk (MBR) decoding is a decision rule of text generation, which selects the hypothesis that maximizes the expected utility and robustly generates higher-quality texts than maximum a posteriori (MAP) decoding. However, it depends on sample texts drawn from the text generation model; thus, it is difficult to find a hypothesis that correctly captures the knowledge or information of out-of-domain. To tackle this issue, we propose case-based decision-theoretic (CBDT) decoding, another method to estimate the expected utility using examples of domain data. CBDT decoding not only generates higher-quality texts than MAP decoding, but also the combination of MBR and CBDT decoding outperformed MBR decoding in seven domain De--En and Ja$\leftrightarrow$En translation tasks and image captioning tasks on MSCOCO and nocaps datasets.

【14】Positional Encoding via Token-Aware Phase Attention
标题：通过令牌感知阶段注意力进行位置编码
链接：https://arxiv.org/abs/2509.12635

作者：ang, Sheng Shen, Rémi Munos, Hongyuan Zhan, Yuandong Tian
备注：21 pages
摘要：我们证明在实际的假设下，旋转位置嵌入（RoPE）引入了内在的距离依赖性的偏见，在注意力分数限制了RoPE的能力，模型的长期背景。RoPE扩展方法可以缓解这个问题，但它们通常需要在预训练后进行事后调整，例如重新缩放或超参数重新调整。本文介绍了令牌感知相位注意力（TAPA），一种新的位置编码方法，将一个可学习的相位函数的注意力机制。TAPA在长范围内保留令牌交互，通过直接和轻微的微调扩展到更长的上下文，外推到看不见的长度，并在长上下文上获得比RoPE家族更低的困惑。
摘要：We prove under practical assumptions that Rotary Positional Embedding (RoPE) introduces an intrinsic distance-dependent bias in attention scores that limits RoPE's ability to model long-context. RoPE extension methods may alleviate this issue, but they typically require post-hoc adjustments after pretraining, such as rescaling or hyperparameters retuning. This paper introduces Token-Aware Phase Attention (TAPA), a new positional encoding method that incorporates a learnable phase function into the attention mechanism. TAPA preserves token interactions over long range, extends to longer contexts with direct and light fine-tuning, extrapolates to unseen lengths, and attains significantly lower perplexity on long-context than RoPE families.

【15】EconProver: Towards More Economical Test-Time Scaling for Automated Theorem Proving
标题：EcoProver：实现更经济的测试时间缩放以实现自动化定理证明
链接：https://arxiv.org/abs/2509.12603

作者： Linfeng Song, Zhenwen Liang, Jiahao Xu, Shansan Gong, Qi Liu, Haitao Mi, Dong Yu
摘要：大型语言模型（LLM）最近推动了自动定理证明（ATP）领域的发展，通过广泛采用的测试时间缩放策略，特别是反射式思想链（CoT）推理和增加采样次数，实现了实质性的性能提升。然而，它们都为推理引入了显著的计算开销。此外，现有的成本分析通常只调节采样通道的数量，而忽略了由不同的缩放策略引入的采样成本的实质性差异。在本文中，我们系统地比较了ATP模型的不同测试时间缩放策略的效率，并证明了当前最先进的（SOTA）开源方法的效率低下。然后，我们研究的方法，以显着减少令牌的使用和样本通过，同时保持原有的性能。具体来说，我们提出了两种互补的方法，可以集成到一个统一的EconRL管道中，以获得更大的好处：（1）一个动态的思想链（CoT）切换机制，旨在减少不必要的令牌消耗，以及（2）具有可训练前缀的多样化并行扩展强化学习（RL），以提高约束采样通道下的通过率。miniF2F和ProofNet上的实验表明，我们的EconProver实现了与基线方法相当的性能，仅需12%的计算成本。这项工作为部署轻量级ATP模型提供了可操作的见解，而不会牺牲性能。
摘要：Large Language Models (LLMs) have recently advanced the field of Automated Theorem Proving (ATP), attaining substantial performance gains through widely adopted test-time scaling strategies, notably reflective Chain-of-Thought (CoT) reasoning and increased sampling passes. However, they both introduce significant computational overhead for inference. Moreover, existing cost analyses typically regulate only the number of sampling passes, while neglecting the substantial disparities in sampling costs introduced by different scaling strategies. In this paper, we systematically compare the efficiency of different test-time scaling strategies for ATP models and demonstrate the inefficiency of the current state-of-the-art (SOTA) open-source approaches. We then investigate approaches to significantly reduce token usage and sample passes while maintaining the original performance. Specifically, we propose two complementary methods that can be integrated into a unified EconRL pipeline for amplified benefits: (1) a dynamic Chain-of-Thought (CoT) switching mechanism designed to mitigate unnecessary token consumption, and (2) Diverse parallel-scaled reinforcement learning (RL) with trainable prefixes to enhance pass rates under constrained sampling passes. Experiments on miniF2F and ProofNet demonstrate that our EconProver achieves comparable performance to baseline methods with only 12% of the computational cost. This work provides actionable insights for deploying lightweight ATP models without sacrificing performance.

【16】Match Chat: Real Time Generative AI and Generative Computing for Tennis
标题：比赛聊天：网球实时生成人工智能和生成计算
链接：https://arxiv.org/abs/2509.12592

作者：ghman, Gozde Akay, Eduardo Morales, Rahul Agarwal, Preetika Srivastava
备注：12 pages, 5 Figures, 4 Tables
摘要：我们提出比赛聊天，实时，代理驱动的助手，旨在通过提供即时，准确的响应匹配相关的查询，以提高网球迷的体验。Match Chat将生成人工智能（GenAI）与生成计算（GenComp）技术相结合，在网球单打比赛直播期间综合关键见解。该系统在2025年温布尔登锦标赛和2025年美国公开赛上首次亮相，通过自然语言查询为约100万用户提供了对流媒体和静态数据的无缝访问。该架构基于面向代理的架构（AOA），结合了规则引擎、预测模型和代理，在将用户查询传递给GenAI组件之前对其进行预处理和优化。Match Chat系统的回答准确率为92.83%，平均响应时间为6.25秒，负载高达每秒120个请求（RPS）。超过96.08%的查询使用交互式提示设计进行引导，有助于用户体验优先考虑清晰度，响应性和最小的工作量。该系统旨在掩盖架构的复杂性，提供无摩擦和直观的界面，不需要入门或技术熟悉。在两个大满贯部署中，Match Chat保持了100%的稳定性，并支持近100万独立用户，突出了平台的可扩展性和可靠性。这项工作介绍了面向消费者的实时AI系统的关键设计模式，强调速度，精度和可用性，突出了在动态环境中部署高性能代理系统的实用路径。
摘要：We present Match Chat, a real-time, agent-driven assistant designed to enhance the tennis fan experience by delivering instant, accurate responses to match-related queries. Match Chat integrates Generative Artificial Intelligence (GenAI) with Generative Computing (GenComp) techniques to synthesize key insights during live tennis singles matches. The system debuted at the 2025 Wimbledon Championships and the 2025 US Open, where it provided about 1 million users with seamless access to streaming and static data through natural language queries. The architecture is grounded in an Agent-Oriented Architecture (AOA) combining rule engines, predictive models, and agents to pre-process and optimize user queries before passing them to GenAI components. The Match Chat system had an answer accuracy of 92.83% with an average response time of 6.25 seconds under loads of up to 120 requests per second (RPS). Over 96.08% of all queries were guided using interactive prompt design, contributing to a user experience that prioritized clarity, responsiveness, and minimal effort. The system was designed to mask architectural complexity, offering a frictionless and intuitive interface that required no onboarding or technical familiarity. Across both Grand Slam deployments, Match Chat maintained 100% uptime and supported nearly 1 million unique users, underscoring the scalability and reliability of the platform. This work introduces key design patterns for real-time, consumer-facing AI systems that emphasize speed, precision, and usability that highlights a practical path for deploying performant agentic systems in dynamic environments.

【17】FunAudio-ASR Technical Report
标题：FunAudio-ASB技术报告
链接：https://arxiv.org/abs/2509.12508

作者：Yanni Chen, Chong Deng, Changfeng Gao, Zhifu Gao, Bo Gong, Xiangang Li, Yabin Li, Xiang Lv, Yunjie Ji, Yiheng Jiang, Bin Ma, Haoneng Luo, Chongjia Ni, Zexu Pan, Yiping Peng, Zhendong Peng, Peiyao Wang, Hao Wang, Wen Wang, Wupeng Wang, Biao Tian, Zhentao Tan, Nan Yang, Bin Yuan, Jieping Ye, Jixing Yu, Qinglin Zhang, Kun Zou, Han Zhao, Shengkui Zhao, Jingren Zhou
摘要：近年来，自动语音识别（ASR）在三种互补范式的推动下取得了变革性的进步：数据缩放、模型大小缩放以及与大型语言模型（LLM）的深度集成。然而，LLM容易产生幻觉，这会显著降低现实世界ASR应用中的用户体验。在本文中，我们介绍了FunAudio-ASR，这是一个基于LLM的大规模ASR系统，它协同结合了海量数据、大模型容量、LLM集成和强化学习，以在各种复杂的语音识别场景中实现最先进的性能。此外，FunAudio-ASR针对实际部署进行了专门优化，增强了流媒体功能，噪声鲁棒性，代码切换，热词定制，并满足其他实际应用需求。实验结果表明，虽然大多数基于LLM的ASR系统在开源基准测试中表现出色，但它们在真实的行业评估集上往往表现不佳。得益于面向生产的优化，FunAudio-ASR在实际应用数据集上实现了SOTA性能，证明了其在实际环境中的有效性和鲁棒性。
摘要：In recent years, automatic speech recognition (ASR) has witnessed transformative advancements driven by three complementary paradigms: data scaling, model size scaling, and deep integration with large language models (LLMs). However, LLMs are prone to hallucination, which can significantly degrade user experience in real-world ASR applications. In this paper, we present FunAudio-ASR, a large-scale, LLM-based ASR system that synergistically combines massive data, large model capacity, LLM integration, and reinforcement learning to achieve state-of-the-art performance across diverse and complex speech recognition scenarios. Moreover, FunAudio-ASR is specifically optimized for practical deployment, with enhancements in streaming capability, noise robustness, code-switching, hotword customization, and satisfying other real-world application requirements. Experimental results show that while most LLM-based ASR systems achieve strong performance on open-source benchmarks, they often underperform on real industry evaluation sets. Thanks to production-oriented optimizations, FunAudio-ASR achieves SOTA performance on real application datasets, demonstrating its effectiveness and robustness in practical settings.

【18】Exact Coset Sampling for Quantum Lattice Algorithms
标题：量子格点算法的精确陪集抽样
链接：https://arxiv.org/abs/2509.12341

作者：ng
备注：Project Page: this https URL
摘要：我们给出了一个简单的，完全正确的，和有争议的“域扩展”在最近的窗口QFT晶格算法与复高斯窗口~\citep{chen 2024 quantum}的步骤9中的替代。已发布的Step~9存在周期性/支持不匹配的问题。我们提出了一个对移位差分构造，相干地消除所有未知的偏移量，产生一个精确的均匀CRT陪集状态$\mathbb{Z}_{P}$，然后使用QFT来执行预期的模线性关系。酉是可逆的，使用$\mathrm{poly}（\log M_2）$门，并保持算法的渐近性。项目页面：https://github.com/yifanzhang-pro/quantum-lattice。
摘要：We give a simple, fully correct, and assumption-light replacement for the contested "domain-extension" in Step 9 of a recent windowed-QFT lattice algorithm with complex-Gaussian windows~\citep{chen2024quantum}. The published Step~9 suffers from a periodicity/support mismatch. We present a pair-shift difference construction that coherently cancels all unknown offsets, produces an exact uniform CRT-coset state over $\mathbb{Z}_{P}$, and then uses the QFT to enforce the intended modular linear relation. The unitary is reversible, uses $\mathrm{poly}(\log M_2)$ gates, and preserves the algorithm's asymptotics. Project Page: https://github.com/yifanzhang-pro/quantum-lattice.

机器翻译由腾讯交互翻译提供，仅供参考

点击“阅读原文”获取带摘要的学术速递

【声明】内容源于网络

Sophie外贸笔记

跨境分享角 | 长期更新优质内容

内容 0

粉丝 3

Sophie外贸笔记跨境分享角 | 长期更新优质内容

总阅读0

粉丝3

内容0