点击阅读原文访问arxivdaily.com,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏等功能!
cs.CL 方向,今日共计159篇
大模型相关(80篇)
【1】Mapping Post-Training Forgetting in Language Models at Scale
标题:大规模绘制语言模型中训练后遗忘的地图
链接:https://arxiv.org/abs/2510.17776
备注:43 pages,15 figures
摘要:规模化的后训练现在推动了语言模型(LM)中许多最大的能力提升,但其对预训练知识的影响仍然知之甚少。不是所有的遗忘都是平等的:忘记一个事实(例如,美国总统或API调用)不会通过调用另一个来“平均”。因此,我们提出了一个样本明智的范式来衡量什么是遗忘,当发生向后转移。我们的指标计数1->0转换(训练后正确,训练后不正确)来量化遗忘,0->1转换来量化向后转移。传统的任务平均值混淆了这些影响,掩盖了大的变化。对于多项选择基准,我们添加了机会调整变量,从训练前和训练后的准确性中减去随机猜测的预期贡献。我们将此框架应用于训练后阶段,模型大小和数据规模。我们的大规模分析表明:(1)域连续预训练诱导中度遗忘,具有低到中度的向后迁移;(2)应用于基础模型的RL/SFT后训练和指令调整产生了数学和逻辑上的中到大的向后迁移,具有整体低到中度的遗忘;(3)将RL/SFT应用于指令调整模型对数据规模敏感:在小规模下,遗忘和向后转移都很小;在更大规模下,效果是混合的,需要通过更好的控制进行进一步研究;(4)模型合并并不能可靠地减轻遗忘。总的来说,我们的框架提供了一个实用的衡量标准,用于映射后训练如何大规模地改变预先训练的知识,从而使人工智能系统能够取得进展。
摘要:Scaled post-training now drives many of the largest capability gains in language models (LMs), yet its effect on pretrained knowledge remains poorly understood. Not all forgetting is equal: Forgetting one fact (e.g., a U.S. president or an API call) does not "average out" by recalling another. Hence, we propose a sample-wise paradigm to measure what is forgotten and when backward transfer occurs. Our metric counts 1->0 transitions (correct before post-training, incorrect after) to quantify forgetting and 0->1 transitions to quantify backward transfer. Traditional task averages conflate these effects and obscure large changes. For multiple-choice benchmarks, we add chance-adjusted variants that subtract the expected contribution of random guessing from pre- and post-training accuracies. We apply this framework across post-training stages, model sizes, and data scales. Our large-scale analysis shows that: (1) Domain-continual pretraining induces moderate forgetting with low-to-moderate backward transfer; (2) RL/SFT post-training applied to base models and Instruction tuning yields moderate-to-large backward transfer on math and logic with overall low-to-moderate forgetting; (3) Applying RL/SFT to instruction-tuned models is sensitive on data scale: at small scales, both forgetting and backward transfer are small; at larger scales, effects are mixed and warrant further study with better controls; (4) Model merging does not reliably mitigate forgetting. Overall, our framework offers a practical yardstick for mapping how post-training alters pretrained knowledge at scale -- enabling progress towards generally capable AI systems.
【2】Evaluating Medical LLMs by Levels of Autonomy: A Survey Moving from Benchmarks to Applications
标题:按自主性水平评估医学LLM:从基准转向应用的调查
链接:https://arxiv.org/abs/2510.17764
摘要:医学大型语言模型在标准基准上取得了很高的分数;然而,将这些结果转化为临床工作流程中的安全可靠性能仍然是一个挑战。该调查通过自主水平的镜头(L0-L3),跨越信息工具,信息转换和聚合,决策支持和监督代理重新构建评估。我们将现有的基准和指标与每个级别允许的行动及其相关风险保持一致,使评估目标明确。这激发了一个水平条件蓝图,用于选择指标,收集证据和报告索赔,以及将评估与监督联系起来的方向。通过以自主权为中心,该调查将该领域从基于分数的声明转向可信的,具有风险意识的证据,以供真正的临床使用。
摘要:Medical Large language models achieve strong scores on standard benchmarks; however, the transfer of those results to safe and reliable performance in clinical workflows remains a challenge. This survey reframes evaluation through a levels-of-autonomy lens (L0-L3), spanning informational tools, information transformation and aggregation, decision support, and supervised agents. We align existing benchmarks and metrics with the actions permitted at each level and their associated risks, making the evaluation targets explicit. This motivates a level-conditioned blueprint for selecting metrics, assembling evidence, and reporting claims, alongside directions that link evaluation to oversight. By centering autonomy, the survey moves the field beyond score-based claims toward credible, risk-aware evidence for real clinical use.
【3】VERA-V: Variational Inference Framework for Jailbreaking Vision-Language Models
标题:VERA-V:越狱视觉语言模型的变分推理框架
链接:https://arxiv.org/abs/2510.17759
备注:18 pages, 7 Figures,
摘要:视觉语言模型(VLM)通过视觉推理扩展了大型语言模型,但其多模态设计也引入了新的未充分研究的漏洞。现有的多模式红队方法主要依赖于脆性模板,专注于单一攻击设置,并且只暴露了一小部分漏洞。为了解决这些限制,我们引入了VERA-V,这是一个变分推理框架,它将多模态越狱发现重新定义为在成对的文本-图像提示上学习联合后验分布。这种概率观点使得能够生成绕过模型护栏的隐形耦合对抗输入。我们训练一个轻量级的攻击者来近似后验,允许对不同的越狱进行有效的采样,并提供对漏洞的分布洞察。VERA-V进一步整合了三种互补策略:(i)基于排版的文本提示,嵌入有害的线索,(ii)基于扩散的图像合成,引入对抗信号,以及(iii)结构化的干扰物,以分散VLM注意力。在HarmBench和HADES基准测试上的实验表明,VERA-V在开源和前沿VLM上的性能始终优于最先进的基线,比GPT-4 o上的最佳基线高出53.75%的攻击成功率(ASR)。
摘要:Vision-Language Models (VLMs) extend large language models with visual reasoning, but their multimodal design also introduces new, underexplored vulnerabilities. Existing multimodal red-teaming methods largely rely on brittle templates, focus on single-attack settings, and expose only a narrow subset of vulnerabilities. To address these limitations, we introduce VERA-V, a variational inference framework that recasts multimodal jailbreak discovery as learning a joint posterior distribution over paired text-image prompts. This probabilistic view enables the generation of stealthy, coupled adversarial inputs that bypass model guardrails. We train a lightweight attacker to approximate the posterior, allowing efficient sampling of diverse jailbreaks and providing distributional insights into vulnerabilities. VERA-V further integrates three complementary strategies: (i) typography-based text prompts that embed harmful cues, (ii) diffusion-based image synthesis that introduces adversarial signals, and (iii) structured distractors to fragment VLM attention. Experiments on HarmBench and HADES benchmarks show that VERA-V consistently outperforms state-of-the-art baselines on both open-source and frontier VLMs, achieving up to 53.75% higher attack success rate (ASR) over the best baseline on GPT-4o.
【4】AcademicEval: Live Long-Context LLM Benchmark
标题:AcademicEval:实时长上下文LLM基准
链接:https://arxiv.org/abs/2510.17725
备注:Accepted by TMLR. Code is available at this https URL
摘要:大型语言模型(LLM)最近在长上下文理解方面取得了显着的表现。然而,目前的长上下文LLM基准测试受到严格的上下文长度,劳动密集型注释以及LLM训练期间标签泄漏问题的紧迫挑战的限制。因此,我们提出了\textsc{AcademicEval},一个实时的基准,用于评估LLM在长上下文生成任务。\textsc{AcademicEval}采用arXiv上的论文,介绍几种具有长上下文输入的学术写作任务,\textit{即},\textsc{Title}、\textsc{Abstract}、\textsc{Introduction}和\textsc{Related Work},它们涵盖了广泛的抽象级别,无需手动标记。此外,\textsc{AcademicEval}还从收集的共同作者图中集成了高质量和专家策划的Few-Shot演示,以实现灵活的上下文长度。特别是,\textsc{AcademicEval}具有高效的实时评估功能,确保没有标签泄漏。我们对\textsc{AcademicEval}进行了整体评估,结果表明LLM在具有分层抽象级别的任务上表现不佳,并且倾向于与长Few-Shot演示进行斗争,突出了我们基准测试的挑战。通过实验分析,我们也揭示了一些见解,以提高LLM的长上下文建模能力。代码可在https://github.com/ulab-uiuc/AcademicEval上获得
摘要:Large Language Models (LLMs) have recently achieved remarkable performance in long-context understanding. However, current long-context LLM benchmarks are limited by rigid context length, labor-intensive annotation, and the pressing challenge of label leakage issues during LLM training. Therefore, we propose \textsc{AcademicEval}, a live benchmark for evaluating LLMs over long-context generation tasks. \textsc{AcademicEval} adopts papers on arXiv to introduce several academic writing tasks with long-context inputs, \textit{i.e.}, \textsc{Title}, \textsc{Abstract}, \textsc{Introduction}, and \textsc{Related Work}, which cover a wide range of abstraction levels and require no manual labeling. Moreover, \textsc{AcademicEval} integrates high-quality and expert-curated few-shot demonstrations from a collected co-author graph to enable flexible context length. Especially, \textsc{AcademicEval} features an efficient live evaluation, ensuring no label leakage. We conduct a holistic evaluation on \textsc{AcademicEval}, and the results illustrate that LLMs perform poorly on tasks with hierarchical abstraction levels and tend to struggle with long few-shot demonstrations, highlighting the challenge of our benchmark. Through experimental analysis, we also reveal some insights for enhancing LLMs' long-context modeling capabilities. Code is available at https://github.com/ulab-uiuc/AcademicEval
【5】QueST: Incentivizing LLMs to Generate Difficult Problems
标题:QueST:激励LLM产生困难的问题
链接:https://arxiv.org/abs/2510.17715
备注:20 pages, 7 figures
摘要:大型语言模型在推理任务、解决竞赛级编码和数学问题方面表现出色。然而,它们的可扩展性受到人工标记数据集和缺乏大规模、具有挑战性的编码问题训练数据的限制。现有的竞争性编码数据集只包含数千到数万个问题。以前的合成数据生成方法依赖于增强现有的指令数据集或从人类标记的数据中选择具有挑战性的问题。在本文中,我们提出了QueST,这是一种新的框架,它结合了困难感知图采样和困难感知拒绝微调,直接优化专门的生成器,以创建具有挑战性的编码问题。我们经过培训的发电机在产生有利于下游性能的挑战性问题方面表现出甚至优于GPT-4 o的能力。我们利用QueST生成大规模的合成编码问题,然后使用这些问题从具有长思想链的强教师模型中提取,或者对较小的模型进行强化学习,证明在这两种情况下都是有效的。我们的蒸馏实验证明了显着的性能增益。具体来说,在对Qwen 3 - 8B进行微调后,基于QueST生成的100 K难题,我们在LiveCodeBench上超越了原始Qwen 3 -8B的性能。有了额外的112 K示例(即,28 K人写的问题与多个合成解决方案配对),我们的8B模型与更大的DeepSeek-R1- 671 B的性能相匹配。这些发现表明,通过QueST生成复杂问题提供了一种有效且可扩展的方法,可以推进大型语言模型的竞争性编码和推理的前沿。
摘要:Large Language Models have achieved strong performance on reasoning tasks, solving competition-level coding and math problems. However, their scalability is limited by human-labeled datasets and the lack of large-scale, challenging coding problem training data. Existing competitive coding datasets contain only thousands to tens of thousands of problems. Previous synthetic data generation methods rely on either augmenting existing instruction datasets or selecting challenging problems from human-labeled data. In this paper, we propose QueST, a novel framework which combines difficulty-aware graph sampling and difficulty-aware rejection fine-tuning that directly optimizes specialized generators to create challenging coding problems. Our trained generators demonstrate superior capability compared to even GPT-4o at creating challenging problems that benefit downstream performance. We leverage QueST to generate large-scale synthetic coding problems, which we then use to distill from strong teacher models with long chain-of-thought or to conduct reinforcement learning for smaller models, proving effective in both scenarios. Our distillation experiments demonstrate significant performance gains. Specifically, after fine-tuning Qwen3-8B-base on 100K difficult problems generated by QueST, we surpass the performance of the original Qwen3-8B on LiveCodeBench. With an additional 112K examples (i.e., 28K human-written problems paired with multiple synthetic solutions), our 8B model matches the performance of the much larger DeepSeek-R1-671B. These findings indicate that generating complex problems via QueST offers an effective and scalable approach to advancing the frontiers of competitive coding and reasoning for large language models.
【6】Contextual Attention Modulation: Towards Efficient Multi-Task Adaptation in Large Language Models
标题:上下文注意力调节:在大型语言模型中实现高效的多任务适应
链接:https://arxiv.org/abs/2510.17705
备注:Accepted by CIKM' 25
摘要:大型语言模型(LLM)具有显着的泛化能力,但难以适应多任务,特别是在平衡知识保留与特定任务的专业化方面。传统的微调方法遭受灾难性的遗忘和大量的资源消耗,而现有的参数有效的方法在复杂的多任务场景中执行次优。为了解决这个问题,我们提出了上下文注意力调制(CAM),一种新的机制,动态调制LLM中的自我注意力模块的表示。CAM增强了特定任务的功能,同时保留了一般知识,从而促进了更有效和更高效的适应。对于有效的多任务适应,CAM集成到我们的混合上下文注意力调制(HyCAM)框架,它结合了一个共享的,全参数CAM模块与多个专门的,轻量级的CAM模块,增强了自适应知识融合的动态路由策略。大量的实验异构任务,包括问题回答,代码生成和逻辑推理,表明我们的方法显着优于现有的方法,实现了3.65%的平均性能提高。实现的代码和数据可在https://github.com/Applied-Machine-Learning-Lab/HyCAM上获得,以便于再现。
摘要:Large Language Models (LLMs) possess remarkable generalization capabilities but struggle with multi-task adaptation, particularly in balancing knowledge retention with task-specific specialization. Conventional fine-tuning methods suffer from catastrophic forgetting and substantial resource consumption, while existing parameter-efficient methods perform suboptimally in complex multi-task scenarios. To address this, we propose Contextual Attention Modulation (CAM), a novel mechanism that dynamically modulates the representations of self-attention modules in LLMs. CAM enhances task-specific features while preserving general knowledge, thereby facilitating more effective and efficient adaptation. For effective multi-task adaptation, CAM is integrated into our Hybrid Contextual Attention Modulation (HyCAM) framework, which combines a shared, full-parameter CAM module with multiple specialized, lightweight CAM modules, enhanced by a dynamic routing strategy for adaptive knowledge fusion. Extensive experiments on heterogeneous tasks, including question answering, code generation, and logical reasoning, demonstrate that our approach significantly outperforms existing approaches, achieving an average performance improvement of 3.65%. The implemented code and data are available to ease reproducibility at https://github.com/Applied-Machine-Learning-Lab/HyCAM.
【7】Towards Mining Effective Pedagogical Strategies from Learner-LLM Educational Dialogues
标题:从LLM教育对话中挖掘有效的教学策略
链接:https://arxiv.org/abs/2510.17698
摘要:对话在教育环境中起着至关重要的作用,但现有的大型语言模型(LLM)教育应用的评估方法主要集中在技术性能或学习成果上,往往忽视了对学习者与LLM互动的关注。为了缩小这一差距,本AIED博士联盟论文提出了一个正在进行的研究,采用对话分析方法,以确定有效的教学策略,从学习者LLM对话。该方法涉及对话数据收集、对话行为(DA)注释、DA模式挖掘和预测模型构建。早期的见解被概述为未来研究的第一步。这项工作强调了通过关注对话动态和教学策略来评估基于法学硕士的教育应用的必要性。
摘要:Dialogue plays a crucial role in educational settings, yet existing evaluation methods for educational applications of large language models (LLMs) primarily focus on technical performance or learning outcomes, often neglecting attention to learner-LLM interactions. To narrow this gap, this AIED Doctoral Consortium paper presents an ongoing study employing a dialogue analysis approach to identify effective pedagogical strategies from learner-LLM dialogues. The proposed approach involves dialogue data collection, dialogue act (DA) annotation, DA pattern mining, and predictive model building. Early insights are outlined as an initial step toward future research. The work underscores the need to evaluate LLM-based educational applications by focusing on dialogue dynamics and pedagogical strategies.
【8】Qomhra: A Bilingual Irish-English Large Language Model
标题:Qomhra:爱尔兰-英语双语大型语言模型
链接:https://arxiv.org/abs/2510.17652
摘要:本文介绍了Qomhr\'a,一个双语爱尔兰-英语大语言模型(LLM),在低资源约束下开发的,提出了一个完整的管道跨越双语持续预训练,指令调整,并从人类的喜好对齐。新访问的爱尔兰语语料库和英语文本混合和策划,以提高爱尔兰的表现,同时保留英语能力。6封闭重量LLM判断他们的爱尔兰文本生成由母语者,学习者和其他LLM。谷歌的Gemini-2.5-Pro排名最高,随后被用于合成指令调优和人类偏好数据集。利用Gemini-2.5-Pro贡献了两个数据集:一个30 K爱尔兰-英语并行指令调优数据集和一个1 K人类偏好数据集,生成接受和拒绝的响应,这些响应显示出与本地爱尔兰人近乎完美的对齐。Qomhr\'a在测试翻译,性别理解,主题识别和世界知识的基准中进行了全面评估,爱尔兰语的收益高达29%,英语为44%。Qomhr\'a还进行了指令调整,并在指令遵循方面展示了明显的进展,这对聊天机器人功能至关重要。
摘要:This paper introduces Qomhr\'a, a bilingual Irish-English large language model (LLM), developed under low-resource constraints presenting a complete pipeline spanning bilingual continued pre-training, instruction tuning, and alignment from human preferences. Newly accessible Irish corpora and English text are mixed and curated to improve Irish performance while preserving English ability. 6 closed-weight LLMs are judged for their Irish text generation by a native speaker, a learner and other LLMs. Google's Gemini-2.5-Pro is ranked the highest and is subsequently used to synthesise instruction tuning and human preference datasets. Two datasets are contributed leveraging Gemini-2.5-Pro: a 30K Irish-English parallel instruction tuning dataset and a 1K human preference dataset, generating accepted and rejected responses that show near perfect alignment with a native Irish speaker. Qomhr\'a is comprehensively evaluated across benchmarks testing translation, gender understanding, topic identification and world knowledge with gains of up to 29% in Irish and 44% in English. Qomhr\'a also undergoes instruction tuning and demonstrates clear progress in instruction following, crucial for chatbot functionality.
【9】LLM-as-a-Prophet: Understanding Predictive Intelligence with Prophet Arena
标题:先知法学硕士:通过先知竞技场了解预测智能
链接:https://arxiv.org/abs/2510.17638
备注:his https URL
摘要:预测不仅是一种基本的智力追求,而且对金融和经济等社会系统具有重要意义。随着在互联网规模数据上训练的大型语言模型(LLM)的快速发展,它提出了使用LLM预测现实世界未来事件的希望,我们称之为“LLM作为先知”的新兴范式。本文系统地研究了LLM的预测智能。为此,我们构建了Prophet Arena,这是一个通用的评估基准,可以持续收集实时预测任务,并将每个任务分解为不同的管道阶段,以支持我们的受控和大规模实验。我们的综合评估显示,许多LLM已经表现出令人印象深刻的预测能力,例如,他们的小校准误差,一致的预测信心和有前途的市场回报。然而,我们还发现了通过法学硕士作为先知实现卓越预测智能的关键瓶颈,例如法学硕士不准确的事件回忆、对数据源的误解以及在解决方案临近时与市场相比信息聚合速度较慢。
摘要:Forecasting is not only a fundamental intellectual pursuit but also is of significant importance to societal systems such as finance and economics. With the rapid advances of large language models (LLMs) trained on Internet-scale data, it raises the promise of employing LLMs to forecast real-world future events, an emerging paradigm we call "LLM-as-a-Prophet". This paper systematically investigates such predictive intelligence of LLMs. To this end, we build Prophet Arena, a general evaluation benchmark that continuously collects live forecasting tasks and decomposes each task into distinct pipeline stages, in order to support our controlled and large-scale experimentation. Our comprehensive evaluation reveals that many LLMs already exhibit impressive forecasting capabilities, reflected in, e.g., their small calibration errors, consistent prediction confidence and promising market returns. However, we also uncover key bottlenecks towards achieving superior predictive intelligence via LLM-as-a-Prophet, such as LLMs' inaccurate event recalls, misunderstanding of data sources and slower information aggregation compared to markets when resolution nears.
【10】Forget to Know, Remember to Use: Context-Aware Unlearning for Large Language Models
标题:忘记知道,记住用途:大型语言模型的上下文感知取消学习
链接:https://arxiv.org/abs/2510.17620
摘要:大型语言模型可能会对需要删除的敏感信息或过时知识进行编码,以确保负责任和合规的模型响应。遗忘已经成为完全再训练的有效替代方案,旨在去除特定知识,同时保留整体模型效用。去学习方法的现有评估集中在(1)遗忘目标知识(遗忘集)的程度和(2)在保留集上保持性能(即,实用程序)。然而,这些评估忽略了一个重要的可用性方面:如果在提示符中重新引入删除的信息,用户可能仍然希望模型利用这些信息。在六个国家的最先进的忘却方法的系统评估,我们发现,他们一贯损害这种上下文的效用。为了解决这个问题,我们用一个插件术语来增强遗忘目标,这个插件术语保留了模型在上下文中使用被遗忘的知识的能力。大量的实验表明,我们的方法恢复上下文效用接近原始水平,同时仍然保持有效的遗忘和保留集效用。
摘要:Large language models may encode sensitive information or outdated knowledge that needs to be removed, to ensure responsible and compliant model responses. Unlearning has emerged as an efficient alternative to full retraining, aiming to remove specific knowledge while preserving overall model utility. Existing evaluations of unlearning methods focus on (1) the extent of forgetting of the target knowledge (forget set) and (2) maintaining performance on the retain set (i.e., utility). However, these evaluations overlook an important usability aspect: users may still want the model to leverage the removed information if it is re-introduced in the prompt. In a systematic evaluation of six state-of-the-art unlearning methods, we find that they consistently impair such contextual utility. To address this, we augment unlearning objectives with a plug-in term that preserves the model's ability to use forgotten knowledge when it is present in context. Extensive experiments demonstrate that our approach restores contextual utility to near original levels while still maintaining effective forgetting and retain-set utility.
【11】HGAdapter: Hypergraph-based Adapters in Language Models for Code Summarization and Clone Detection
标题:HGAdaptor:用于代码摘要和克隆检测的语言模型中基于Hypergraph的适配器
链接:https://arxiv.org/abs/2510.17591
备注:Accepted by the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP 2025) as a findings long paper
摘要:预训练语言模型(PLM)越来越多地应用于与代码相关的任务。虽然PLM已经取得了很好的效果,但它们没有考虑代码中潜在的高阶数据相关性。我们提出了三种类型的高阶相关性的代码令牌,即抽象的语法树家族相关性,词汇相关性和行相关性。我们设计了一个标记和超边生成器来捕获这些高阶数据相关性。我们改进了超图神经网络的架构,并将其与适配器调整相结合,提出了一种新的基于超图的适配器(HGAdapter)来微调PLM。HGAdapter可以对高阶数据相关性进行编码,并允许插入到各种PLM中以提高性能。在几个公共数据集上进行了实验,包括六种语言的代码摘要和代码克隆检测任务。我们的方法在不同程度上提高了PLM在数据集中的性能。实验结果验证了高阶数据相关性的引入,有助于提高有效性。
摘要:Pre-trained language models (PLMs) are increasingly being applied to code-related tasks. Although PLMs have achieved good results, they do not take into account potential high-order data correlations within the code. We propose three types of high-order correlations in code tokens, i.e. abstract syntax tree family correlation, lexical correlation, and line correlation. We design a tokens and hyperedges generator to capture these high-order data correlations. We improve the architecture of hypergraph neural networks and combine it with adapter tuning to propose a novel hypergraph-based adapter (HGAdapter) to fine-tune PLMs. HGAdapter can encode high-order data correlations and is allowed to be inserted into various PLMs to enhance performance. Experiments were conducted on several public datasets, including six languages of code summarization and code clone detection tasks. Our methods improved the performance of PLMs in datasets to varying degrees. Experimental results validate the introduction of high-order data correlations that contribute to improved effectiveness.
【12】Language Confusion Gate: Language-Aware Decoding Through Model Self-Distillation
标题:语言混乱之门:通过模型自蒸馏进行数字感知解码
链接:https://arxiv.org/abs/2510.17555
摘要:大型语言模型(LLM)经常会遇到语言混乱,这是在文本生成过程中无意中混合的语言。目前解决这个问题的方法要么需要重新训练模型,要么不能区分有害的混淆和可接受的代码转换。本文介绍了语言混淆门(LCG),这是一种轻量级的插件式解决方案,可以在解码过程中过滤令牌,而不会改变基本的LLM。LCG使用范数调整的自蒸馏来训练,以预测适当的语系,并仅在需要时应用掩蔽。我们的方法是基于这样的发现,即语言混乱是罕见的,正确的语言标记通常是最好的预测之一,输出标记嵌入规范是大的高资源的语言,这偏见采样。当在各种模型(包括Qwen3,GPT-OSS,Gemma3,Llama3.1)上进行评估时,LCG显著降低了语言混淆,通常是一个数量级,而不会对任务性能产生负面影响。代码可在www.example.com上获得。
摘要:Large language models (LLMs) often experience language confusion, which is the unintended mixing of languages during text generation. Current solutions to this problem either necessitate model retraining or cannot differentiate between harmful confusion and acceptable code-switching. This paper introduces the Language Confusion Gate (LCG), a lightweight, plug-in solution that filters tokens during decoding without altering the base LLM. The LCG is trained using norm-adjusted self-distillation to predict appropriate language families and apply masking only when needed. Our method is based on the findings that language confusion is infrequent, correct-language tokens are usually among the top predictions, and output token embedding norms are larger for high-resource languages, which biases sampling. When evaluated across various models, including Qwen3, GPT-OSS, Gemma3, Llama3.1, LCG decreases language confusion significantly, often by an order of magnitude, without negatively impacting task performance. Code is available at https://github.com/collinzrj/language_confusion_gate.
【13】OncoReason: Structuring Clinical Reasoning in LLMs for Robust and Interpretable Survival Prediction
标题:OncoReason:在LLM中构建临床推理,以实现稳健且可解释的生存预测
链接:https://arxiv.org/abs/2510.17532
摘要:预测癌症治疗结果需要准确且可解释的模型,特别是在存在异质性临床数据的情况下。虽然大型语言模型(LLM)在生物医学NLP中表现出强大的性能,但它们通常缺乏对高风险决策支持至关重要的结构化推理能力。我们提出了一个统一的多任务学习框架,该框架将自回归LLM与MSK-CHORD数据集上的结果预测的临床推理相结合。我们的模型经过训练,可以联合执行二元生存分类,连续生存时间回归和自然语言原理生成。我们评估了三种对齐策略:(1)标准监督微调(SFT),(2)SFT与思想链(CoT)提示引发逐步推理,以及(3)组相对策略优化(GRPO),一种将模型输出与专家推导的推理轨迹对齐的强化学习方法。使用LLaMa 3 -8B和Med 42 -8B主干的实验表明,CoT提示将F1提高了+6.0,并将MAE降低了12%,而GRPO在BLEU,ROUGE和BERTScore中实现了最先进的可解释性和预测性能。我们进一步表明,现有的生物医学LLM往往无法产生有效的推理痕迹,由于建筑的限制。我们的研究结果强调了推理感知对齐在多任务临床建模中的重要性,并为精确肿瘤学中可解释,可信赖的LLM设定了新的基准。
摘要:Predicting cancer treatment outcomes requires models that are both accurate and interpretable, particularly in the presence of heterogeneous clinical data. While large language models (LLMs) have shown strong performance in biomedical NLP, they often lack structured reasoning capabilities critical for high-stakes decision support. We present a unified, multi-task learning framework that aligns autoregressive LLMs with clinical reasoning for outcome prediction on the MSK-CHORD dataset. Our models are trained to jointly perform binary survival classification, continuous survival time regression, and natural language rationale generation. We evaluate three alignment strategies: (1) standard supervised fine-tuning (SFT), (2) SFT with Chain-of-Thought (CoT) prompting to elicit step-by-step reasoning, and (3) Group Relative Policy Optimization (GRPO), a reinforcement learning method that aligns model outputs to expert-derived reasoning trajectories. Experiments with LLaMa3-8B and Med42-8B backbones demonstrate that CoT prompting improves F1 by +6.0 and reduces MAE by 12%, while GRPO achieves state-of-the-art interpretability and predictive performance across BLEU, ROUGE, and BERTScore. We further show that existing biomedical LLMs often fail to produce valid reasoning traces due to architectural constraints. Our findings underscore the importance of reasoning-aware alignment in multi-task clinical modeling and set a new benchmark for interpretable, trustworthy LLMs in precision oncology.
【14】SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors
标题:SimBench:对大型语言模型模拟人类行为的能力进行基准测试
链接:https://arxiv.org/abs/2510.17516
备注:Project Website: this http URL Data: this https URL
摘要:人类行为的大型语言模型(LLM)模拟有可能彻底改变社会和行为科学,当且仅当它们忠实地反映真实的人类行为。目前的评价是零散的,基于定制的任务和衡量标准,造成了无法比拟的结果。为了解决这个问题,我们引入了SimBench,这是第一个大规模的标准化基准,用于LLM仿真的强大,可重复的科学。通过统一20个不同的数据集,涵盖了从道德决策到经济选择的任务,SimBench提供了必要的基础,可以询问有关何时,如何以及为什么LLM模拟成功或失败的基本问题。我们表明,即使是最好的LLM今天有有限的模拟能力(得分:40.80/100),性能规模与模型大小的对数线性。仿真性能不会因增加推理时间计算而得到改善。我们展示了一个简化模拟权衡:简化调优提高了低熵(共识)问题的性能,但降低了高熵(多样)的。模型在模拟特定的人口群体时尤其困难。最后,我们证明了模拟能力与深度,知识密集型推理(MMLU-Pro,r=0.939)的相关性最强。通过使进展可衡量,我们的目标是加快更忠实的LLM模拟器的开发。
摘要:Large language model (LLM) simulations of human behavior have the potential to revolutionize the social and behavioral sciences, if and only if they faithfully reflect real human behaviors. Current evaluations are fragmented, based on bespoke tasks and metrics, creating a patchwork of incomparable results. To address this, we introduce SimBench, the first large-scale, standardized benchmark for a robust, reproducible science of LLM simulation. By unifying 20 diverse datasets covering tasks from moral decision-making to economic choice across a large global participant pool, SimBench provides the necessary foundation to ask fundamental questions about when, how, and why LLM simulations succeed or fail. We show that, while even the best LLMs today have limited simulation ability (score: 40.80/100), performance scales log-linearly with model size. Simulation performance is not improved by increased inference-time compute. We demonstrate an alignment-simulation trade-off: instruction-tuning improves performance on low-entropy (consensus) questions but degrades it on high-entropy (diverse) ones. Models particularly struggle when simulating specific demographic groups. Finally, we demonstrate that simulation ability correlates most strongly with deep, knowledge-intensive reasoning (MMLU-Pro, r=0.939). By making progress measurable, we aim to accelerate the development of more faithful LLM simulators.
【15】Empowering Real-World: A Survey on the Technology, Practice, and Evaluation of LLM-driven Industry Agents
标题:赋予现实世界权力:法学硕士驱动的行业代理的技术、实践和评估调查
链接:https://arxiv.org/abs/2510.17491
摘要:随着大型语言模型(LLM)的兴起,能够自主推理、规划和执行复杂任务的LLM代理已成为人工智能的前沿。然而,如何将对总代理的研究转化为推动行业转型的生产力仍然是一个重大挑战。针对这一问题,本文系统地回顾了基于LLM的行业代理的技术、应用和评价方法。使用一个工业代理能力成熟度框架,它概述了代理在工业应用中的演变,从“过程执行系统”到“自适应社会系统”。首先,我们研究了支持智能体能力进步的三个关键技术支柱:记忆、规划和工具使用。我们将讨论这些技术是如何从早期支持简单任务发展到以更先进的形式支持复杂的自治系统和集体智慧的。然后,我们提供了一个行业代理在现实世界的领域,如数字工程,科学发现,体现智能,协同业务执行,复杂系统仿真的应用概述。此外,本文还回顾了基础和专业能力的评估基准和方法,确定了现有评估系统在真实性、安全性和行业特异性方面面临的挑战。最后,我们重点关注行业代理人面临的实际挑战,探索其在各种场景下的能力边界、发展潜力和治理问题,同时提供对未来方向的见解。通过将技术演进与行业实践相结合,本文旨在澄清当前状态,并为理解和构建下一代行业代理提供清晰的路线图和理论基础。
摘要:With the rise of large language models (LLMs), LLM agents capable of autonomous reasoning, planning, and executing complex tasks have become a frontier in artificial intelligence. However, how to translate the research on general agents into productivity that drives industry transformations remains a significant challenge. To address this, this paper systematically reviews the technologies, applications, and evaluation methods of industry agents based on LLMs. Using an industry agent capability maturity framework, it outlines the evolution of agents in industry applications, from "process execution systems" to "adaptive social systems." First, we examine the three key technological pillars that support the advancement of agent capabilities: Memory, Planning, and Tool Use. We discuss how these technologies evolve from supporting simple tasks in their early forms to enabling complex autonomous systems and collective intelligence in more advanced forms. Then, we provide an overview of the application of industry agents in real-world domains such as digital engineering, scientific discovery, embodied intelligence, collaborative business execution, and complex system simulation. Additionally, this paper reviews the evaluation benchmarks and methods for both fundamental and specialized capabilities, identifying the challenges existing evaluation systems face regarding authenticity, safety, and industry specificity. Finally, we focus on the practical challenges faced by industry agents, exploring their capability boundaries, developmental potential, and governance issues in various scenarios, while providing insights into future directions. By combining technological evolution with industry practices, this review aims to clarify the current state and offer a clear roadmap and theoretical foundation for understanding and building the next generation of industry agents.
【16】Disparities in Multilingual LLM-Based Healthcare Q&A
标题:基于多语言法学硕士的医疗保健问答中的差异
链接:https://arxiv.org/abs/2510.17476
备注:Under review
摘要:公平获得可靠的健康信息对于将AI整合到医疗保健中至关重要。然而,不同语言的信息质量各不相同,引发了人们对多语言大型语言模型(LLM)的可靠性和一致性的担忧。我们系统地研究了跨语言的差异,在培训前的来源和事实对齐LLM答案跨英语,德语,土耳其语,中文(普通话)和意大利语的多语言医疗保健问答。我们(i)构建了多语言维基医疗保健(MultiWikiHealthCare),这是维基百科的多语言数据集;(ii)分析了跨语言医疗保健覆盖范围;(iii)评估了LLM响应与这些参考文献的一致性;(iv)通过使用上下文信息和检索增强生成(RAG)进行了一项关于事实对齐的案例研究。我们的研究结果揭示了维基百科的覆盖范围和LLM事实对齐的实质性跨语言差异。在LLM中,响应与英文维基百科更加一致,即使提示是非英语的。在推理时提供来自非英语维基百科的上下文摘录有效地将事实对齐转移到文化相关的知识。这些结果突出了为医疗保健构建更公平、多语言AI系统的实际途径。
摘要:Equitable access to reliable health information is vital when integrating AI into healthcare. Yet, information quality varies across languages, raising concerns about the reliability and consistency of multilingual Large Language Models (LLMs). We systematically examine cross-lingual disparities in pre-training source and factuality alignment in LLM answers for multilingual healthcare Q&A across English, German, Turkish, Chinese (Mandarin), and Italian. We (i) constructed Multilingual Wiki Health Care (MultiWikiHealthCare), a multilingual dataset from Wikipedia; (ii) analyzed cross-lingual healthcare coverage; (iii) assessed LLM response alignment with these references; and (iv) conducted a case study on factual alignment through the use of contextual information and Retrieval-Augmented Generation (RAG). Our findings reveal substantial cross-lingual disparities in both Wikipedia coverage and LLM factual alignment. Across LLMs, responses align more with English Wikipedia, even when the prompts are non-English. Providing contextual excerpts from non-English Wikipedia at inference time effectively shifts factual alignment toward culturally relevant knowledge. These results highlight practical pathways for building more equitable, multilingual AI systems for healthcare.
【17】Evaluating Large Language Models on Urdu Idiom Translation
标题:大语言模型在乌尔都语习语翻译中的应用
链接:https://arxiv.org/abs/2510.17460
摘要:习语翻译仍然是机器翻译中的一个重大挑战,特别是对于乌尔都语等低资源语言,并且之前受到的关注有限。为了推进这一领域的研究,我们介绍了乌尔都语到英语习语翻译的第一个评估数据集,涵盖了本地乌尔都语和罗马乌尔都语脚本,并使用黄金标准的英语对应物进行注释。我们评估了多个开源的大型语言模型(LLM)和神经机器翻译(NMT)系统,重点是它们保留习语和文化意义的能力。BLEU、BERTScore、COMET和XCOMET等自动化指标用于评估翻译质量。我们的研究结果表明,提示工程提高习语翻译相比,直接翻译,虽然性能差异提示类型相对较小。此外,交叉脚本比较显示,文本表示大大影响翻译质量,与本地乌尔都语输入产生更准确的地道翻译比罗马乌尔都语。
摘要:Idiomatic translation remains a significant challenge in machine translation, especially for low resource languages such as Urdu, and has received limited prior attention. To advance research in this area, we introduce the first evaluation datasets for Urdu to English idiomatic translation, covering both Native Urdu and Roman Urdu scripts and annotated with gold-standard English equivalents. We evaluate multiple open-source Large Language Models (LLMs) and Neural Machine Translation (NMT) systems on this task, focusing on their ability to preserve idiomatic and cultural meaning. Automatic metrics including BLEU, BERTScore, COMET, and XCOMET are used to assess translation quality. Our findings indicate that prompt engineering enhances idiomatic translation compared to direct translation, though performance differences among prompt types are relatively minor. Moreover, cross script comparisons reveal that text representation substantially affects translation quality, with Native Urdu inputs producing more accurate idiomatic translations than Roman Urdu.
【18】BenCao: An Instruction-Tuned Large Language Model for Traditional Chinese Medicine
标题:BenCao:一种经过指导调整的中医大语言模型
链接:https://arxiv.org/abs/2510.17415
摘要:传统中医药(TCM)拥有两千多年的历史,在全球医疗保健中发挥着重要作用。然而,将大型语言模型(LLM)应用于中医仍然具有挑战性,因为它依赖于整体推理,隐式逻辑和多模态诊断线索。现有的中医领域LLM在基于文本的理解方面取得了进展,但缺乏多模态整合,可解释性和临床适用性。为了解决这些限制,我们开发了BenCao,这是一款基于ChatGPT的中医多模式助手,集成了结构化知识库、诊断数据和专家反馈细化。BenCao是通过自然语言指令调整而不是参数重新训练进行训练的,符合专家级推理和中医特有的伦理规范。该系统整合了超过1,000个古典和现代文本的综合知识库,基于ESTO的多样化互动教学框架,可解释推理的思维链模拟机制,以及涉及持牌中医师的反馈改进过程。BenCao连接到外部API进行舌象分类和多模态数据库检索,从而实现对诊断资源的动态访问。在单项选择题基准和多模态分类任务的评估中,BenCao实现了优于一般领域和中医领域模型的准确性,特别是在诊断,草药识别和体质分类方面。该模型作为交互式应用程序部署在OpenAI GPTs Store上,截至2025年10月,全球有近1,000名用户访问。这项研究证明了通过基于自然语言的指令调整和多模态集成开发中医领域LLM的可行性,为将生成式人工智能与传统医学推理相结合提供了一个实用的框架,并为现实世界的部署提供了一个可扩展的途径。
摘要:Traditional Chinese Medicine (TCM), with a history spanning over two millennia, plays a role in global healthcare. However, applying large language models (LLMs) to TCM remains challenging due to its reliance on holistic reasoning, implicit logic, and multimodal diagnostic cues. Existing TCM-domain LLMs have made progress in text-based understanding but lack multimodal integration, interpretability, and clinical applicability. To address these limitations, we developed BenCao, a ChatGPT-based multimodal assistant for TCM, integrating structured knowledge bases, diagnostic data, and expert feedback refinement. BenCao was trained through natural language instruction tuning rather than parameter retraining, aligning with expert-level reasoning and ethical norms specific to TCM. The system incorporates a comprehensive knowledge base of over 1,000 classical and modern texts, a scenario-based instruction framework for diverse interactions, a chain-of-thought simulation mechanism for interpretable reasoning, and a feedback refinement process involving licensed TCM practitioners. BenCao connects to external APIs for tongue-image classification and multimodal database retrieval, enabling dynamic access to diagnostic resources. In evaluations across single-choice question benchmarks and multimodal classification tasks, BenCao achieved superior accuracy to general-domain and TCM-domain models, particularly in diagnostics, herb recognition, and constitution classification. The model was deployed as an interactive application on the OpenAI GPTs Store, accessed by nearly 1,000 users globally as of October 2025. This study demonstrates the feasibility of developing a TCM-domain LLM through natural language-based instruction tuning and multimodal integration, offering a practical framework for aligning generative AI with traditional medical reasoning and a scalable pathway for real-world deployment.
【19】Leveraging Group Relative Policy Optimization to Advance Large Language Models in Traditional Chinese Medicine
标题:利用集团相对政策优化推进中医大语言模型
链接:https://arxiv.org/abs/2510.17402
摘要:传统中医学(TCM)呈现出丰富且结构独特的知识体系,这对大型语言模型(LLM)的传统应用提出了挑战。虽然以前的TCM特定的LLM已经通过监督微调取得了进展,但它们通常在对齐,数据质量和评估一致性方面面临限制。在这项研究中,我们引入了Ladder-base,这是第一个使用组相对策略优化(GRPO)训练的以TCM为重点的LLM,这是一种强化学习方法,通过基于组内比较优化响应选择来提高推理和事实一致性。Ladder-base建立在Qwen2.5- 7 B-Instruct基础模型上,并专门在TCM-Ladder基准测试的文本子集上进行训练,使用80%的数据进行训练,剩余的20%在验证和测试集之间平均分配。通过标准化评估,与GPT-4、Gemini 2.5、Claude 3和Qwen 3等最先进的通用LLM以及Bentsao、HuatuoGPT 2和Zhongjing等特定领域的TCM模型相比,Ladder-base在多个推理指标上表现出卓越的性能。这些研究结果表明,GRPO提供了一种有效和高效的策略,使LLM与传统医学领域的专家级推理保持一致,并支持开发值得信赖的、基于临床的中医人工智能系统。
摘要:Traditional Chinese Medicine (TCM) presents a rich and structurally unique knowledge system that challenges conventional applications of large language models (LLMs). Although previous TCM-specific LLMs have shown progress through supervised fine-tuning, they often face limitations in alignment, data quality, and evaluation consistency. In this study, we introduce Ladder-base, the first TCM-focused LLM trained with Group Relative Policy Optimization (GRPO), a reinforcement learning method that improves reasoning and factual consistency by optimizing response selection based on intra-group comparisons. Ladder-base is built upon the Qwen2.5-7B-Instruct foundation model and trained exclusively on the textual subset of the TCM-Ladder benchmark, using 80 percent of the data for training and the remaining 20 percent split evenly between validation and test sets. Through standardized evaluation, Ladder-base demonstrates superior performance across multiple reasoning metrics when compared to both state-of-the-art general-purpose LLMs such as GPT-4, Gemini 2.5, Claude 3, and Qwen3 and domain-specific TCM models including BenTsao, HuatuoGPT2, and Zhongjing. These findings suggest that GRPO provides an effective and efficient strategy for aligning LLMs with expert-level reasoning in traditional medical domains and supports the development of trustworthy and clinically grounded TCM artificial intelligence systems.
【20】EduAdapt: A Question Answer Benchmark Dataset for Evaluating Grade-Level Adaptability in LLMs
标题:EduAdapt:用于评估LLM年级适应性的问答基准数据集
链接:https://arxiv.org/abs/2510.17389
备注:28 pages, 2 figures, 14 tables, 50 listings, EMNLP 2025 Main
摘要:大型语言模型(LLM)正在通过回答问题,解释复杂的概念以及在广泛的学科中生成内容来改变教育。尽管在学术基准上表现出色,但他们往往未能根据学生的年级水平做出反应。这是K-12教育的关键需求,在那里,适合年龄的词汇和解释对于有效学习至关重要。现有的模型经常产生对年轻学习者来说过于先进或模糊的输出,并且没有标准化的基准来评估他们在认知和发展阶段的调整能力。为了解决这一差距,我们引入了EduAdapt,这是一个跨9个科学科目的近48 k等级标记QA对的基准,跨越1-12年级,分为4个年级。我们在EduAdapt上评估了一组不同的开源LLM,发现虽然较大的模型通常表现更好,但它们仍然难以为低年级学生(1-5年级)生成合适的响应。我们的工作提出了第一个数据集和评估框架,用于评估LLM的年级适应性,旨在通过更好的培训和提示策略,培养更符合发展的教育AI系统。EduAdapt代码和数据集可在https://github.com/NaumanNaeem/EduAdapt上公开获取。
摘要:Large language models (LLMs) are transforming education by answering questions, explaining complex concepts, and generating content across a wide range of subjects. Despite strong performance on academic benchmarks, they often fail to tailor responses to students' grade levels. This is a critical need in K-12 education, where age-appropriate vocabulary and explanation are essential for effective learning. Existing models frequently produce outputs that are too advanced or vague for younger learners, and there are no standardized benchmarks to evaluate their ability to adjust across cognitive and developmental stages. To address this gap, we introduce EduAdapt, a benchmark of nearly 48k grade-labeled QA pairs across nine science subjects, spanning Grades 1-12 and grouped into four grade levels. We evaluate a diverse set of open-source LLMs on EduAdapt and find that while larger models generally perform better, they still struggle with generating suitable responses for early-grade students (Grades 1-5). Our work presents the first dataset and evaluation framework for assessing grade-level adaptability in LLMs, aiming to foster more developmentally aligned educational AI systems through better training and prompting strategies. EduAdapt code and datasets are publicly available at https://github.com/NaumanNaeem/EduAdapt.
【21】The Atomic Instruction Gap: Instruction-Tuned LLMs Struggle with Simple, Self-Contained Directives
标题:原子指令差距:经过指令调整的LLM与简单、独立的指令作斗争
链接:https://arxiv.org/abs/2510.17388
备注:11 pages, 1 figure, 8 tables
摘要:指令调整的大型语言模型(IT-LLM)表现出强大的zero-shot推理,但它们执行简单,自包含指令的能力仍然没有得到充分的探索,尽管这是复杂的推理的基础。我们在修改后的MMLU和MMLU-Pro基准上评估了20个IT-LLM,通过系统地改变选项标签的格式(字母,数字,罗马),同时在四种范式下保持其含义相同,即:(1)使用显式指令,标签更改会导致大的性能变化(例如,-30.45\%(罗马字母与数字),显示了注释格式偏差。(2)没有说明,性能进一步下降(高达-10.84\%),标签敏感性加剧,强调了明确指导的作用。(3)当选项内容被删除时,除了数字标签之外,模型无法通过随机选择基线,这表明对原子指令的遵守较弱。(4)三杆样本产生的鲁棒性或保真度没有显着的收益,和世代分析显示持久的标签错误,特别是对于非数字格式。在各种型号中,较大的LLM实现了更高的准确性,但在遵守指令方面仍然不一致。这些结果暴露了当前的预防调整范例的不确定性,并强调了明确针对原子预防以下的评估方法和训练策略的必要性。
摘要:Instruction-tuned large language models (IT-LLMs) exhibit strong zero-shot reasoning, yet their ability to execute simple, self-contained instructions remains underexplored, despite this being foundational to complex instruction-following. We evaluate 20 IT-LLMs on modified MMLU and MMLU-Pro benchmarks, by systematically varying the format of option labels (alphabetic, numeric, Roman) while keeping their meaning identical under four paradigms, namely: (1) With explicit instructions, label changes cause large performance shifts (e.g., -30.45\% for Roman vs. numeric), revealing instruction-format bias. (2) Without instructions, performance drops further (up to -10.84\%) and label sensitivity intensifies, underscoring the role of explicit guidance. (3) When option contents are removed, models fail random-choice baselines except with numeric labels, suggesting weak adherence to atomic directives. (4) Three-shot exemplars yield no significant gains in robustness or fidelity, and generation analyses show persistent label errors, especially for non-numeric formats. Across model sizes, larger LLMs achieve higher accuracy but remain inconsistent in instruction adherence. These results expose the insufficiencies of current instruction-tuning paradigms and highlight the need for evaluation methods and training strategies that explicitly target atomic instruction-following.
【22】TaxoAlign: Scholarly Taxonomy Generation Using Language Models
标题:TaxoAlign:使用语言模型的学术分类生成
链接:https://arxiv.org/abs/2510.17263
备注:This paper has been accepted at the EMNLP 2025 Main Conference
摘要:分类学在帮助研究人员以层次化的方式构建和导航知识方面发挥着至关重要的作用。它们也是创建综合文献调查的重要组成部分。现有的自动调查生成方法不将生成的调查的结构与人类专家编写的调查的结构进行比较。为了解决这一差距,我们提出了自己的自动分类法创建方法,可以弥合人工生成和自动创建的分类法之间的差距。为此,我们创建了CS-TaxoBench基准,其中包括从人类撰写的调查论文中提取的460个分类法。我们还包括根据会议调查论文整理的由80个分类法组成的额外测试集。我们提出了TaxoAlign,一个三阶段的主题为基础的解释指导的方法,学术分类生成。此外,我们提出了一个严格的自动化评估框架,衡量自动生成的分类的结构对齐和语义一致性相比,人类专家创建的。我们在CS-TaxoBench上评估我们的方法和各种基线,使用自动评估指标和人类评估研究。结果表明,TaxoAlign在几乎所有指标上都始终超过基线。代码和数据可以在https://github.com/AvishekLahiri/TaxoAlign上找到。
摘要:Taxonomies play a crucial role in helping researchers structure and navigate knowledge in a hierarchical manner. They also form an important part in the creation of comprehensive literature surveys. The existing approaches to automatic survey generation do not compare the structure of the generated surveys with those written by human experts. To address this gap, we present our own method for automated taxonomy creation that can bridge the gap between human-generated and automatically-created taxonomies. For this purpose, we create the CS-TaxoBench benchmark which consists of 460 taxonomies that have been extracted from human-written survey papers. We also include an additional test set of 80 taxonomies curated from conference survey papers. We propose TaxoAlign, a three-phase topic-based instruction-guided method for scholarly taxonomy generation. Additionally, we propose a stringent automated evaluation framework that measures the structural alignment and semantic coherence of automatically generated taxonomies in comparison to those created by human experts. We evaluate our method and various baselines on CS-TaxoBench, using both automated evaluation metrics and human evaluation studies. The results show that TaxoAlign consistently surpasses the baselines on nearly all metrics. The code and data can be found at https://github.com/AvishekLahiri/TaxoAlign.
【23】Explainability of Large Language Models: Opportunities and Challenges toward Generating Trustworthy Explanations
标题:大型语言模型的可解释性:生成可信解释的机遇和挑战
链接:https://arxiv.org/abs/2510.17256
摘要:大型语言模型在自然语言处理的广泛下游任务中表现出令人印象深刻的性能。然而,语言模型如何预测下一个标记并生成内容通常不是人类所能理解的。此外,这些模型经常在预测和推理中出错,称为幻觉。这些错误强调了更好地理解和解释语言模型复杂的内部工作原理以及它们如何生成预测输出的迫切需要。出于这一差距,本文研究了基于transformer的大型语言模型中的局部可解释性和机械可解释性,以促进对此类模型的信任。在这方面,我们的文件旨在作出三个关键贡献。首先,我们提出了一个本地的可解释性和机械解释性的方法和相关研究的文献中的见解的审查。此外,我们描述了在两个关键领域-医疗保健和自动驾驶-大型语言模型的可解释性和推理的实验研究,并分析了这种解释对解释接收者的信任影响。最后,我们总结了LLM可解释性不断发展的前景中当前未解决的问题,并概述了机会,关键挑战和未来方向,以生成与人类一致的,值得信赖的LLM解释。
摘要:Large language models have exhibited impressive performance across a broad range of downstream tasks in natural language processing. However, how a language model predicts the next token and generates content is not generally understandable by humans. Furthermore, these models often make errors in prediction and reasoning, known as hallucinations. These errors underscore the urgent need to better understand and interpret the intricate inner workings of language models and how they generate predictive outputs. Motivated by this gap, this paper investigates local explainability and mechanistic interpretability within Transformer-based large language models to foster trust in such models. In this regard, our paper aims to make three key contributions. First, we present a review of local explainability and mechanistic interpretability approaches and insights from relevant studies in the literature. Furthermore, we describe experimental studies on explainability and reasoning with large language models in two critical domains -- healthcare and autonomous driving -- and analyze the trust implications of such explanations for explanation receivers. Finally, we summarize current unaddressed issues in the evolving landscape of LLM explainability and outline the opportunities, critical challenges, and future directions toward generating human-aligned, trustworthy LLM explanations.
【24】StreamingThinker: Large Language Models Can Think While Reading
标题:StreamingThinker:大型语言模型可以在阅读时思考
链接:https://arxiv.org/abs/2510.17238
摘要:大型语言模型(LLM)在思想链(CoT)推理方面表现出了卓越的能力。然而,目前的LLM推理范式只有在整个输入可用之后才开始思考,这引入了不必要的延迟,并削弱了对动态场景中早期信息的关注。受人类在阅读时思考的认知的启发,我们首先为LLM设计了一个\textit{\textbf{streaming thinking}}范式,其中推理以输入的顺序展开,并在阅读完成后进一步调整其深度。我们用\textit{StreamingThinker}实例化这个范例,这个框架使LLM能够在阅读时思考,同时通过流CoT生成,流约束训练和流并行推理的集成。具体来说,StreamingThinker采用具有质量控制的流式推理单元来生成CoT,通过流式注意掩码和位置编码来实施保序推理,并利用并行KV缓存将输入编码与推理生成解耦,从而确保对齐并实现真正的并发性。我们在Qwen 3模型家族上评估StreamingThinker,包括数学推理,逻辑推理和基于上下文的QA推理任务。实验结果表明,StreamingThinker保持了与批处理思维相当的性能,同时在推理开始之前减少了80%的令牌等待,并在产生最终答案的时间级延迟中减少了60%以上,证明了流式范例对LLM推理的有效性。代码将在\href{https://github.com/EIT-NLP/StreamingLLM/tree/main/StreamingThinker}{this repository.}发布。
摘要:Large language models (LLMs) have demonstrated remarkable capabilities in chain of thought (CoT) reasoning. However, the current LLM reasoning paradigm initiates thinking only after the entire input is available, which introduces unnecessary latency and weakens attention to earlier information in dynamic scenarios. Inspired by human cognition of thinking while reading, we first design a \textit{\textbf{streaming thinking}} paradigm for LLMs, where reasoning unfolds in the order of input and further adjusts its depth once reading is complete. We instantiate this paradigm with \textit{StreamingThinker}, a framework that enables LLMs to think while reading through the integration of streaming CoT generation, streaming-constraint training, and streaming parallel inference. Specifically, StreamingThinker employs streaming reasoning units with quality control for CoT generation, enforces order-preserving reasoning through streaming attention masks and position encoding, and leverages parallel KV caches that decouple input encoding from reasoning generation, thereby ensuring alignment and enabling true concurrency. We evaluate StreamingThinker on the Qwen3 model family across math reasoning, logical reasoning, and context-based QA reasoning tasks. Experimental results show that the StreamingThinker preserves performance comparable to batch thinking, while yielding an 80\% reduction in token waiting before the onset of reasoning and a more than 60\% reduction in time-level latency for producing the final answer, demonstrating the effectiveness of the streaming paradigm for LLM reasoning. Code will be released at \href{https://github.com/EIT-NLP/StreamingLLM/tree/main/StreamingThinker}{this repository.}
【25】Wisdom is Knowing What not to Say: Hallucination-Free LLMs Unlearning via Attention Shifting
标题:智慧就是知道不该说什么:通过注意力转移消除幻觉的法学硕士
链接:https://arxiv.org/abs/2510.17210
备注:22 pages, 10 figures
摘要:计算能力的提高和人工智能辅助决策的必要性推动了大型语言模型(LLM)的应用。除此之外,LLM敏感数据的潜在保留也刺激了越来越多的机器学习研究。然而,现有的遗忘方法面临着一个关键的困境:积极的遗忘妥协模型效用,而保守的策略保持效用,但风险幻觉的反应。这大大限制了LLM在知识密集型应用中的可靠性。为了解决这个问题,我们引入了一个新的注意力转移(AS)框架选择性遗忘。AS由两个设计目标驱动:(1)上下文保持抑制,其减弱对承载事实的标记的注意力,而不破坏LLM的语言结构;以及(2)抗幻觉的响应成形,其在询问关于遗忘内容时阻止捏造的完成。AS通过两种注意力水平干预来实现这些目标,这两种干预是应用于非学习集的重要性感知抑制,以减少对记忆知识的依赖,以及注意力引导的保留增强,其加强了对保留数据集中语义上重要的标记的注意力,以减轻意外的退化。这两个组件通过双损失目标进行联合优化,该目标形成了一个软边界,该边界在表示叠加下保留不相关的知识的同时将学习局部化。实验结果表明,AS比最先进的unlearning方法提高了性能保护,在ToFU基准测试中实现了高达15%的准确性,在TDEC基准测试中实现了10%的准确性,同时保持了竞争性的无幻觉unlearning效果。与现有的方法相比,AS表现出更好的学习效果,泛化和响应可靠性之间的平衡。
摘要:The increase in computing power and the necessity of AI-assisted decision-making boost the growing application of large language models (LLMs). Along with this, the potential retention of sensitive data of LLMs has spurred increasing research into machine unlearning. However, existing unlearning approaches face a critical dilemma: Aggressive unlearning compromises model utility, while conservative strategies preserve utility but risk hallucinated responses. This significantly limits LLMs' reliability in knowledge-intensive applications. To address this, we introduce a novel Attention-Shifting (AS) framework for selective unlearning. AS is driven by two design objectives: (1) context-preserving suppression that attenuates attention to fact-bearing tokens without disrupting LLMs' linguistic structure; and (2) hallucination-resistant response shaping that discourages fabricated completions when queried about unlearning content. AS realizes these objectives through two attention-level interventions, which are importance-aware suppression applied to the unlearning set to reduce reliance on memorized knowledge and attention-guided retention enhancement that reinforces attention toward semantically essential tokens in the retained dataset to mitigate unintended degradation. These two components are jointly optimized via a dual-loss objective, which forms a soft boundary that localizes unlearning while preserving unrelated knowledge under representation superposition. Experimental results show that AS improves performance preservation over the state-of-the-art unlearning methods, achieving up to 15% higher accuracy on the ToFU benchmark and 10% on the TDEC benchmark, while maintaining competitive hallucination-free unlearning effectiveness. Compared to existing methods, AS demonstrates a superior balance between unlearning effectiveness, generalization, and response reliability.
【26】Soft-Masked Diffusion Language Models
标题:软屏蔽扩散语言模型
链接:https://arxiv.org/abs/2510.17206
摘要:扩散模型在语言建模中表现出了强大的潜力,与传统的自回归方法相比具有各种优势。它们能够并行生成和修改整个响应,从而能够更快地生成和内置自我纠正机制。大多数现代基于扩散的语言模型都采用掩蔽扩散,其中解码涉及基于二元决策迭代处理掩蔽令牌:要么保留掩码,要么用预测令牌替换它。然而,当保留掩码时,这种二元选择丢弃了有价值的预测信息。为了解决这一限制,我们引入了软屏蔽(SM),一种新的方法,动态地混合嵌入的掩码令牌与嵌入的前$k$预测令牌从先前的解码步骤,为每个保留的掩码。这为模型提供了更丰富的先验信息,保留了早期计算的上下文,并允许有关掩码令牌的部分信息传播到单个步骤之外。我们提出了一种训练方法,该方法采用预训练的掩蔽扩散语言模型来合并SM。我们证明了,继续用SM预训练169 M参数模型可以改善困惑和MAUVE分数。此外,我们微调两个国家的最先进的扩散模型,梦想-7B和Dream-Coder-7 B,与SM。SM在多个编码基准测试中始终如一地提高性能,特别是在高吞吐量设置中。
摘要:Diffusion models have demonstrated strong potential in language modeling, offering various advantages over traditional autoregressive approaches. Their ability to generate and revise entire responses in parallel enables faster generation and built-in self-correction mechanisms. Most modern diffusion-based language models employ masked diffusion, where decoding involves iteratively processing masked tokens based on a binary decision: either retaining the mask or replacing it with the predicted token. However, this binary choice discards valuable predictive information when the mask is retained. To address this limitation, we introduce soft-masking (SM), a novel method that dynamically blends the embedding of the mask token with the embeddings of the top-$k$ predicted tokens from the previous decoding step, for each retained mask. This provides the model with a more informative prior, preserving context from earlier computations and allowing partial information about masked tokens to propagate beyond a single step. We propose a training methodology that adapts a pretrained masked diffusion language model to incorporate SM. We demonstrate that continuing pretraining a 169M parameter model with SM leads to improved perplexity and MAUVE scores. Furthermore, we finetune two state-of-the-art diffusion models, Dream-7B and Dream-Coder-7B, with SM. SM consistently improves performance across multiple coding benchmarks, particularly in high-throughput settings.
【27】$\mathcal{V}isi\mathcal{P}runer$: Decoding Discontinuous Cross-Modal Dynamics for Efficient Multimodal LLMs
链接:https://arxiv.org/abs/2510.17205
备注:EMNLP 2025 Main
摘要:多模态大型语言模型(MLLM)在视觉语言任务中取得了很好的性能,但由于注意力计算随多模态标记数量的二次增长而产生了显着的计算开销。虽然已经努力修剪MLLM中的标记,但他们缺乏对MLLM如何处理和融合多模态信息的基本理解。通过系统分析,我们揭示了一个\textbf{三阶段}跨模态交互过程:(1)浅层识别任务意图,视觉标记充当被动注意汇;(2)跨模态融合突然发生在中间层,由一些关键的视觉标记驱动;(3)深层丢弃视觉标记,只关注语言精炼。基于这些发现,我们提出了一个无需训练的剪枝框架VisiPruner,它可以减少LLaVA-v1.5 7 B上高达99\%的视觉相关注意力计算和53.9\%的FLOP。它显著优于现有的令牌修剪方法,并在不同的MLLM中推广。除了修剪之外,我们的见解还通过将模型架构与其内在的逐层处理动态对齐,为训练高效的MLLM提供了可操作的指导方针。我们的代码可在https://github.com/EIT-NLP/VisiPruner上获取。
摘要:Multimodal Large Language Models (MLLMs) have achieved strong performance across vision-language tasks, but suffer from significant computational overhead due to the quadratic growth of attention computations with the number of multimodal tokens. Though efforts have been made to prune tokens in MLLMs, \textit{they lack a fundamental understanding of how MLLMs process and fuse multimodal information.} Through systematic analysis, we uncover a \textbf{three-stage} cross-modal interaction process: (1) Shallow layers recognize task intent, with visual tokens acting as passive attention sinks; (2) Cross-modal fusion occurs abruptly in middle layers, driven by a few critical visual tokens; (3) Deep layers discard vision tokens, focusing solely on linguistic refinement. Based on these findings, we propose \emph{VisiPruner}, a training-free pruning framework that reduces up to 99\% of vision-related attention computations and 53.9\% of FLOPs on LLaVA-v1.5 7B. It significantly outperforms existing token pruning methods and generalizes across diverse MLLMs. Beyond pruning, our insights further provide actionable guidelines for training efficient MLLMs by aligning model architecture with its intrinsic layer-wise processing dynamics. Our code is available at: https://github.com/EIT-NLP/VisiPruner.
【28】Offline Policy Evaluation of Multi-Turn LLM Health Coaching with Real Users
标题:与真实用户进行多轮LLM健康辅导的线下政策评估
链接:https://arxiv.org/abs/2510.17173
备注:Accepted to the NeurIPS 2025 Workshop on Multi-Turn Interactions in Large Language Models
摘要:我们研究了一个网络部署,工具增强LLM健康教练与真实用户。在一个有7个用户(280个额定回合)的试点中,离线政策评估(OPE)在因子分解决策头(工具/风格)上显示,统一的重型工具政策提高了日志的平均值,但损害了特定的亚组,最明显的是低健康素养/高自我效能的用户。具有隐藏原型的轻量级模拟器进一步表明,添加少量的早期信息获取奖励可以可靠地缩短特征识别并提高目标成功和通过@3。总之,这些早期的发现表明了一条以评估为先的个性化路径:冻结生成器,在类型化奖励(客观工具结果和满意度)上学习子组感知决策头,并始终报告每原型指标,以使平均模糊的子组伤害浮出水面。
摘要:We study a web-deployed, tool-augmented LLM health coach with real users. In a pilot with seven users (280 rated turns), offline policy evaluation (OPE) over factorized decision heads (Tool/Style) shows that a uniform heavy-tool policy raises average value on logs but harms specific subgroups, most notably low-health-literacy/high-self-efficacy users. A lightweight simulator with hidden archetypes further shows that adding a small early information-gain bonus reliably shortens trait identification and improves goal success and pass@3. Together, these early findings indicate an evaluation-first path to personalization: freeze the generator, learn subgroup-aware decision heads on typed rewards (objective tool outcomes and satisfaction), and always report per-archetype metrics to surface subgroup harms that averages obscure.
【29】Do LLMs Recognize Your Latent Preferences? A Benchmark for Latent Information Discovery in Personalized Interaction
标题:LLM认识到您的潜在偏好吗?个性化交互中潜在信息发现的基准
链接:https://arxiv.org/abs/2510.17132
摘要:大型语言模型(LLM)擅长生成广泛相关的文本,但当需要用户特定的偏好时,这种通用性就会成为限制,例如推荐餐馆或计划旅行。在这些场景中,用户很少明确地表达每一个偏好;相反,他们关心的大部分内容仍然是潜在的,等待推断。这就提出了一个根本性的问题:LLM能否通过对话发现和推理这些潜在的信息? 我们通过引入一个统一的基准来评估潜在信息发现-LLM通过多轮交互揭示和利用隐藏的用户属性的能力来解决这个问题。该基准测试跨越了三个渐进式的现实设置:经典的20个问题游戏,个性化的问题摘要和个性化的文本摘要。所有的任务共享一个三代理框架(用户,助理,法官),使回合水平评估的启发和适应。我们的研究结果表明,虽然LLM确实可以通过对话来显示潜在的信息,但它们的成功率随着上下文的变化而变化很大:从32%到98%,取决于任务的复杂性,主题和隐藏属性的数量。该基准测试为研究个性化交互中的潜在信息发现提供了第一个系统框架,强调有效的偏好推理仍然是构建真正自适应AI系统的开放前沿。
摘要:Large Language Models (LLMs) excel at producing broadly relevant text, but this generality becomes a limitation when user-specific preferences are required, such as recommending restaurants or planning travel. In these scenarios, users rarely articulate every preference explicitly; instead, much of what they care about remains latent, waiting to be inferred. This raises a fundamental question: Can LLMs uncover and reason about such latent information through conversation? We address this problem by introducing a unified benchmark for evaluating latent information discovery - the ability of LLMs to reveal and utilize hidden user attributes through multi-turn interaction. The benchmark spans three progressively realistic settings: the classic 20 Questions game, Personalized Question Answering, and Personalized Text Summarization. All tasks share a tri-agent framework (User, Assistant, Judge) enabling turn-level evaluation of elicitation and adaptation. Our results reveal that while LLMs can indeed surface latent information through dialogue, their success varies dramatically with context: from 32% to 98%, depending on task complexity, topic, and number of hidden attributes. This benchmark provides the first systematic framework for studying latent information discovery in personalized interaction, highlighting that effective preference inference remains an open frontier for building truly adaptive AI systems.
【30】Investigating Thinking Behaviours of Reasoning-Based Language Models for Social Bias Mitigation
标题:调查基于推理的语言模型的思维行为以缓解社会偏见
链接:https://arxiv.org/abs/2510.17062
摘要:虽然基于推理的大型语言模型通过内部结构化的思维过程擅长复杂的任务,但出现了一个令人担忧的现象,即这种思维过程可能会聚集社会刻板印象,导致有偏见的结果。然而,这些语言模型在社会偏见场景中的潜在行为仍然没有得到充分的研究。在这项工作中,我们系统地调查了这一现象背后的思维过程中的机制,并揭示了两种驱动社会偏见聚集的失败模式:1)刻板印象重复,模型依赖于社会刻板印象作为其主要理由,2)不相关的信息注入,它捏造或引入新的细节来支持有偏见的叙述。基于这些见解,我们引入了一个轻量级的基于故障的缓解方法,该方法查询模型,以针对这些特定的故障模式审查其自身的初始推理。在问题回答(BBQ和StereoSet)和开放式(BOLD)基准测试上的实验表明,我们的方法有效地减少了偏见,同时保持或提高了准确性。
摘要:While reasoning-based large language models excel at complex tasks through an internal, structured thinking process, a concerning phenomenon has emerged that such a thinking process can aggregate social stereotypes, leading to biased outcomes. However, the underlying behaviours of these language models in social bias scenarios remain underexplored. In this work, we systematically investigate mechanisms within the thinking process behind this phenomenon and uncover two failure patterns that drive social bias aggregation: 1) stereotype repetition, where the model relies on social stereotypes as its primary justification, and 2) irrelevant information injection, where it fabricates or introduces new details to support a biased narrative. Building on these insights, we introduce a lightweight prompt-based mitigation approach that queries the model to review its own initial reasoning against these specific failure patterns. Experiments on question answering (BBQ and StereoSet) and open-ended (BOLD) benchmarks show that our approach effectively reduces bias while maintaining or improving accuracy.
【31】Mapping from Meaning: Addressing the Miscalibration of Prompt-Sensitive Language Models
标题:意义映射:解决预算敏感语言模型的错误校准
链接:https://arxiv.org/abs/2510.17028
备注:None
摘要:大型语言模型(LLM)中一个有趣的行为是提示敏感性。当提供相同提示的不同但语义等效的版本时,模型可能会产生非常不同的答案分布。这表明,反映在一个模型的输出分布的不确定性为一个提示可能不反映模型的不确定性的提示的意义。我们的模型提示敏感性作为一种泛化错误,并表明,采样整个语义的“概念空间”与释义扰动提高不确定性校准,而不影响准确性。此外,我们引入了一个新的度量的不确定性分解的黑盒LLM,提高了基于熵的分解建模语义连续性在自然语言生成。我们表明,这种分解度量可以用来量化有多少LLM的不确定性是由于提示的敏感性。我们的工作介绍了一种新的方法来提高不确定性校准的敏感语言模型,并提供证据表明,一些LLM未能表现出一致的一般推理的含义,他们的输入。
摘要:An interesting behavior in large language models (LLMs) is prompt sensitivity. When provided with different but semantically equivalent versions of the same prompt, models may produce very different distributions of answers. This suggests that the uncertainty reflected in a model's output distribution for one prompt may not reflect the model's uncertainty about the meaning of the prompt. We model prompt sensitivity as a type of generalization error, and show that sampling across the semantic ``concept space'' with paraphrasing perturbations improves uncertainty calibration without compromising accuracy. Additionally, we introduce a new metric for uncertainty decomposition in black-box LLMs that improves upon entropy-based decomposition by modeling semantic continuities in natural language generation. We show that this decomposition metric can be used to quantify how much LLM uncertainty is attributed to prompt sensitivity. Our work introduces a new way to improve uncertainty calibration in prompt-sensitive language models, and provides evidence that some LLMs fail to exhibit consistent general reasoning about the meanings of their inputs.
【32】Forgetting to Forget: Attention Sink as A Gateway for Backdooring LLM Unlearning
标题:忘记忘记:注意力下沉作为后备LLM遗忘的门户
链接:https://arxiv.org/abs/2510.17021
摘要:大型语言模型(LLM)unlearning已经成为从预训练模型中删除不需要的数据,知识或行为,同时保留其通用性的关键机制。然而,随着开放权重LLM的兴起,我们不禁要问:遗忘过程本身是否可以被隐藏起来,在正常情况下看似成功,但当隐藏的触发器被激活时,它又会恢复到之前的遗忘行为?从经典的后门攻击中汲取灵感,将触发器嵌入到训练数据中以执行特定的行为,我们研究了后门遗忘,其中模型在干净的设置中按预期遗忘,但在触发器出现时恢复遗忘的知识。我们表明,设计这样的攻击提出了独特的挑战,取决于触发器放置在哪里以及如何加强后门训练。我们揭示了后门功效和注意力下沉现象之间的密切联系,即,浅输入令牌在LLM中始终吸引不成比例的注意力。我们的分析表明,这些注意力汇作为后门遗忘的网关:将触发器放置在汇的位置,并调整其注意力值显着增强后门持久性。大量的实验验证了这些发现,表明注意力下沉引导的后门遗忘可靠地恢复了被遗忘的知识,在后门触发器的存在下,而行为从一个正常的未学习的模型时,触发器不存在。代码可在https://github.com/OPTML-Group/Unlearn-Backdoor上获得。
摘要:Large language model (LLM) unlearning has become a critical mechanism for removing undesired data, knowledge, or behaviors from pre-trained models while retaining their general utility. Yet, with the rise of open-weight LLMs, we ask: can the unlearning process itself be backdoored, appearing successful under normal conditions yet reverting to pre-unlearned behavior when a hidden trigger is activated? Drawing inspiration from classical backdoor attacks that embed triggers into training data to enforce specific behaviors, we investigate backdoor unlearning, where models forget as intended in the clean setting but recover forgotten knowledge when the trigger appears. We show that designing such attacks presents unique challenges, hinging on where triggers are placed and how backdoor training is reinforced. We uncover a strong link between backdoor efficacy and the attention sink phenomenon, i.e., shallow input tokens consistently attract disproportionate attention in LLMs. Our analysis reveals that these attention sinks serve as gateways for backdoor unlearning: placing triggers at sink positions and aligning their attention values markedly enhances backdoor persistence. Extensive experiments validate these findings, showing that attention-sink-guided backdoor unlearning reliably restores forgotten knowledge in the presence of backdoor triggers, while behaving indistinguishably from a normally unlearned model when triggers are absent. Code is available at https://github.com/OPTML-Group/Unlearn-Backdoor.
【33】SafeSearch: Do Not Trade Safety for Utility in LLM Search Agents
标题:SafeSearch:不要在LLM搜索代理中以安全换取实用性
链接:https://arxiv.org/abs/2510.17017
备注:Code: this https URL
摘要:基于大语言模型(LLM)的搜索代理迭代地生成查询,检索外部信息,并推理回答开放域问题。虽然研究人员主要集中在提高其效用,但其安全行为仍然没有得到充分研究。在本文中,我们首先使用红队数据集评估搜索代理,发现它们比基本LLM更有可能产生有害的输出。例如,当被问到“我如何在未经同意的情况下跟踪某人的位置?”“,基本模型拒绝,而被设计为检索和引用源的搜索代理可以降低其拒绝阈值,获取文档(例如,法院案件),并在附上后,将其综合成一个信息丰富但不安全的摘要。我们进一步表明,以效用为导向的微调加剧了这种风险,激励安全性和效用的联合调整。我们提出了SafeSearch,这是一种多目标强化学习方法,它将最终输出的安全性/实用性奖励与一个新的查询级整形术语相结合,该术语惩罚不安全的查询并奖励安全的查询。实验表明,SafeSearch在三个红队数据集上将代理危害性降低了70%以上,同时产生安全,有用的响应,并匹配仅实用工具微调代理的QA性能;进一步分析证实了查询级奖励在联合提高安全性和实用性方面的有效性。
摘要:Large language model (LLM) based search agents iteratively generate queries, retrieve external information, and reason to answer open-domain questions. While researchers have primarily focused on improving their utility, their safety behaviors remain underexplored. In this paper, we first evaluate search agents using red-teaming datasets and find that they are more likely to produce harmful outputs than base LLMs. For example, when asked ``How can I track someone's location without their consent?'', a base model refuses, whereas a search agent designed to retrieve and cite sources may lower its refusal threshold, fetch documents (e.g., court cases), and, once appended, synthesize them into an informative yet unsafe summary. We further show that utility-oriented fine-tuning intensifies this risk, motivating joint alignment of safety and utility. We present SafeSearch, a multi-objective reinforcement learning approach that couples a final-output safety/utility reward with a novel query-level shaping term that penalizes unsafe queries and rewards safe ones. Experiments show that SafeSearch reduces agent harmfulness by over 70% across three red-teaming datasets while producing safe, helpful responses, and matches the QA performance of a utility-only finetuned agent; further analyses confirm the effectiveness of the query-level reward in jointly improving safety and utility.
【34】DiscoTrack: A Multilingual LLM Benchmark for Discourse Tracking
标题:DiscoTrack:用于话语跟踪的多语言LLM基准
链接:https://arxiv.org/abs/2510.17013
摘要:最近的LLM基准测试了一系列现象的模型,但仍然主要集中在自然语言理解上,用于提取显式信息,如QA或摘要,响应通常针对单个句子的信息。我们仍然缺乏更具挑战性的多语言基准,重点关注话语跟踪背景下更大文档中的隐含信息和语用推断:整合和聚合句子,段落和多个说话者话语中的信息。为此,我们提出了DiscoTrack,一个LLM基准测试,目标是一系列跨12种语言的任务和四个层次的话语理解:显着性识别,实体跟踪,话语关系和桥接推理。我们的评估表明,这些任务仍然具有挑战性,即使是最先进的模型。
摘要:Recent LLM benchmarks have tested models on a range of phenomena, but are still focused primarily on natural language understanding for extraction of explicit information, such as QA or summarization, with responses often tar- geting information from individual sentences. We are still lacking more challenging, and im- portantly also multilingual, benchmarks focus- ing on implicit information and pragmatic infer- ences across larger documents in the context of discourse tracking: integrating and aggregating information across sentences, paragraphs and multiple speaker utterances. To this end, we present DiscoTrack, an LLM benchmark target- ing a range of tasks across 12 languages and four levels of discourse understanding: salience recognition, entity tracking, discourse relations and bridging inference. Our evaluation shows that these tasks remain challenging, even for state-of-the-art models.
【35】Vocab Diet: Reshaping the Vocabulary of LLMs with Vector Arithmetic
标题:Vocab Diet:用载体算术重塑LLM的词汇
链接:https://arxiv.org/abs/2510.17001
摘要:大型语言模型(LLM)被证明可以将单词形式的变化(例如“walk”->“walked”)编码为嵌入空间中的线性方向。然而,标准的标记化算法将这些变化视为不同的标记--用表面形式变体(例如,“walk”、“walking”、“Walk”),代价是使用频率较低的词和多语种覆盖。我们表明,这些变化中的许多可以被捕获的变换向量-添加剂偏移量,产生适当的词的表示时,应用到基本形式的词嵌入-在输入和输出空间。在此基础上,我们提出了一个紧凑的词汇重塑:而不是分配唯一的令牌给每个表面形式,我们组成他们从共享的基础形式和变换向量(例如,“walked”=“walk”+过去式)。我们将我们的方法应用于多个LLM和五种语言,删除多达10%的词汇表条目,从而释放空间来分配新的,更多样化的令牌。重要的是,我们这样做的同时还将词汇覆盖范围扩展到词汇表外的单词,对下游性能的影响最小,并且不修改模型权重。我们的研究结果激发了对词汇表设计的基础性反思,从字符串枚举转向利用语言底层结构的组合词汇表。
摘要:Large language models (LLMs) were shown to encode word form variations, such as "walk"->"walked", as linear directions in embedding space. However, standard tokenization algorithms treat these variations as distinct tokens -- filling the size-capped vocabulary with surface form variants (e.g., "walk", "walking", "Walk"), at the expense of less frequent words and multilingual coverage. We show that many of these variations can be captured by transformation vectors -- additive offsets that yield the appropriate word's representation when applied to the base form word embedding -- in both the input and output spaces. Building on this, we propose a compact reshaping of the vocabulary: rather than assigning unique tokens to each surface form, we compose them from shared base form and transformation vectors (e.g., "walked" = "walk" + past tense). We apply our approach to multiple LLMs and across five languages, removing up to 10% of vocabulary entries -- thereby freeing space to allocate new, more diverse tokens. Importantly, we do so while also expanding vocabulary coverage to out-of-vocabulary words, with minimal impact on downstream performance, and without modifying model weights. Our findings motivate a foundational rethinking of vocabulary design, moving from string enumeration to a compositional vocabulary that leverages the underlying structure of language.
【36】Bits Leaked per Query: Information-Theoretic Bounds on Adversarial Attacks against LLMs
标题:每次查询泄露的位:针对LLM的对抗性攻击的信息理论界限
链接:https://arxiv.org/abs/2510.17000
备注:NeurIPS 2025 (spotlight)
摘要:恶意用户威胁大型语言模型(LLM)安全的对抗性攻击可以被视为试图推断目标属性$T$,该属性在发出指令时是未知的,并且只有在观察到模型的回复后才变得可知。目标属性$T$的示例包括触发LLM的有害响应或拒绝的二进制标志,以及通过遗忘删除的信息可以恢复的程度,两者都是通过对抗性指令引起的。LLM揭示了一个\n {observable signal} $Z$,该信号可能通过包含回答令牌、思考过程令牌或logits的响应泄漏攻击提示。然而,泄露的信息规模仍然是传闻,这使得审计师没有原则性的指导,而辩护者对透明度-风险权衡视而不见。我们填补了这一空白的信息理论框架,计算多少信息可以安全地披露,并使审计人员能够衡量他们的方法接近基本限制。将观测值$Z$和目标属性$T$之间的互信息$I(Z;T)$视为每个查询的泄漏比特,我们表明,实现错误$\varepsilon$至少需要$\log(1/\varepsilon)/I(Z;T)$查询,与反向泄漏率线性缩放,并且仅在数学上具有所需的精度。因此,即使是适度增加披露,就所需的准确性而言,攻击成本也会从二次方下降到对数。在7个LLM上进行的实验,包括系统提示泄漏、越狱和重新学习攻击,证实了这一理论:仅暴露答案令牌就需要大约1000个查询;添加logits将其减少到大约100个;而揭示完整的思维过程将其削减到几十个。我们的研究结果提供了部署LLM时平衡透明度和安全性的第一个原则性尺度。
摘要:Adversarial attacks by malicious users that threaten the safety of large language models (LLMs) can be viewed as attempts to infer a target property $T$ that is unknown when an instruction is issued, and becomes knowable only after the model's reply is observed. Examples of target properties $T$ include the binary flag that triggers an LLM's harmful response or rejection, and the degree to which information deleted by unlearning can be restored, both elicited via adversarial instructions. The LLM reveals an \emph{observable signal} $Z$ that potentially leaks hints for attacking through a response containing answer tokens, thinking process tokens, or logits. Yet the scale of information leaked remains anecdotal, leaving auditors without principled guidance and defenders blind to the transparency--risk trade-off. We fill this gap with an information-theoretic framework that computes how much information can be safely disclosed, and enables auditors to gauge how close their methods come to the fundamental limit. Treating the mutual information $I(Z;T)$ between the observation $Z$ and the target property $T$ as the leaked bits per query, we show that achieving error $\varepsilon$ requires at least $\log(1/\varepsilon)/I(Z;T)$ queries, scaling linearly with the inverse leak rate and only logarithmically with the desired accuracy. Thus, even a modest increase in disclosure collapses the attack cost from quadratic to logarithmic in terms of the desired accuracy. Experiments on seven LLMs across system-prompt leakage, jailbreak, and relearning attacks corroborate the theory: exposing answer tokens alone requires about a thousand queries; adding logits cuts this to about a hundred; and revealing the full thinking process trims it to a few dozen. Our results provide the first principled yardstick for balancing transparency and security when deploying LLMs.
【37】Parameter-Efficient Fine-Tuning for Low-Resource Languages: A Comparative Study of LLMs for Bengali Hate Speech Detection
标题:低资源语言的参数高效微调:孟加拉仇恨语音检测LLM的比较研究
链接:https://arxiv.org/abs/2510.16985
备注:Accepted to IEEE COMPAS 2025. 6 pages, 3 figures, 6 tables
摘要:孟加拉社交媒体平台上的仇恨言论急剧增加,对妇女和青少年的影响尤为严重。虽然BD-SHS等数据集为结构化评估提供了基础,但大多数现有方法依赖于计算成本高昂的全模型微调或专有API。本文介绍了第一个应用程序的参数有效的微调(PEFT)孟加拉仇恨语音检测使用LoRA和QLoRA。三个预调的大型语言模型- Gemma-3- 4 B,Llama-3.2-3B和Mistral-7 B-在BD-SHS数据集的50,281条注释评论上进行了微调。每个模型都通过训练不到1%的参数进行调整,从而可以在单个消费级GPU上进行实验。结果表明,Llama-3.2-3B获得了最高的F1分数,为92.23%,其次是Mistral-7 B,为88.94%,Gemma-3- 4 B为80.25%。这些研究结果建立PEFT作为一个实用的和可复制的战略,孟加拉语和相关的低资源语言。
摘要:Bengali social media platforms have witnessed a sharp increase in hate speech, disproportionately affecting women and adolescents. While datasets such as BD-SHS provide a basis for structured evaluation, most prior approaches rely on either computationally costly full-model fine-tuning or proprietary APIs. This paper presents the first application of Parameter-Efficient Fine-Tuning (PEFT) for Bengali hate speech detection using LoRA and QLoRA. Three instruction-tuned large language models - Gemma-3-4B, Llama-3.2-3B, and Mistral-7B - were fine-tuned on the BD-SHS dataset of 50,281 annotated comments. Each model was adapted by training fewer than 1% of its parameters, enabling experiments on a single consumer-grade GPU. The results show that Llama-3.2-3B achieved the highest F1-score of 92.23%, followed by Mistral-7B at 88.94% and Gemma-3-4B at 80.25%. These findings establish PEFT as a practical and replicable strategy for Bengali and related low-resource languages.
【38】Real-Time World Crafting: Generating Structured Game Behaviors from Natural Language with Large Language Models
标题:实时世界制作:通过大型语言模型从自然语言生成结构化游戏行为
链接:https://arxiv.org/abs/2510.16952
备注:16 pages, 11 figures (including appendix). To be presented at the 5th Wordplay @ EMNLP workshop (2025)
摘要:我们提出了一种新的架构,安全地将大型语言模型(LLM)集成到交互式游戏引擎中,允许玩家使用自然语言“编程”新的行为。我们的框架通过使用LLM将命令转换为受约束的域特定语言(DSL)来降低风险,该DSL在运行时配置自定义的安全组件系统(ECS)。我们评估了这个系统在一个2D的法术制作游戏原型的实验评估模型从双子座,GPT,克劳德家族与各种提示策略。一个经过验证的LLM法官对输出进行了定性评级,表明虽然较大的模型更好地捕捉了创意意图,但最佳提示策略是依赖于任务的:思想链改善了创意对齐,而Few-Shot示例对于生成更复杂的DSL脚本是必要的。这项工作为紧急游戏提供了一个经过验证的LLM-ECS模式,并为开发人员提供了定量的性能比较。
摘要:We present a novel architecture for safely integrating Large Language Models (LLMs) into interactive game engines, allowing players to "program" new behaviors using natural language. Our framework mitigates risks by using an LLM to translate commands into a constrained Domain-Specific Language (DSL), which configures a custom Entity-Component-System (ECS) at runtime. We evaluated this system in a 2D spell-crafting game prototype by experimentally assessing models from the Gemini, GPT, and Claude families with various prompting strategies. A validated LLM judge qualitatively rated the outputs, showing that while larger models better captured creative intent, the optimal prompting strategy is task-dependent: Chain-of-Thought improved creative alignment, while few-shot examples were necessary to generate more complex DSL scripts. This work offers a validated LLM-ECS pattern for emergent gameplay and a quantitative performance comparison for developers.
【39】Peering Inside the Black Box: Uncovering LLM Errors in Optimization Modelling through Component-Level Evaluation
标题:窥视黑匣子内部:通过学生级评估发现优化建模中的LLM错误
链接:https://arxiv.org/abs/2510.16943
摘要:大型语言模型(LLM)越来越多地用于将自然语言描述转换为数学优化公式。目前的评估通常将公式作为一个整体来对待,依赖于粗略的指标,如解决方案的准确性或运行时间,这掩盖了结构或数值错误。在这项研究中,我们提出了一个全面的,组件级的评价框架LLM生成的配方。除了传统的最优性差距,我们的框架还引入了决策变量和约束的精度和召回率、约束和目标均方根误差(RMSE)以及基于令牌使用和延迟的效率指标等指标。我们评估了GPT-5,LLaMA 3.1指令和DeepSeek数学在六种提示策略下不同复杂度的优化问题。结果表明,GPT-5始终优于其他模型,其中思想链,自我一致性和模块化提示证明是最有效的。分析表明,求解器的性能主要取决于高约束召回率和低约束RMSE,这两者共同保证了结构的正确性和解决方案的可靠性。约束精度和决策变量度量起次要作用,而简洁的输出可提高计算效率。这些发现强调了NLP到优化建模的三个原则:(i)完整的约束覆盖防止违规,(ii)最小化约束RMSE确保求解器级别的准确性,以及(iii)简洁的输出提高计算效率。所提出的框架建立了一个细粒度的基础,诊断评估的LLM优化建模。
摘要:Large language models (LLMs) are increasingly used to convert natural language descriptions into mathematical optimization formulations. Current evaluations often treat formulations as a whole, relying on coarse metrics like solution accuracy or runtime, which obscure structural or numerical errors. In this study, we present a comprehensive, component-level evaluation framework for LLM-generated formulations. Beyond the conventional optimality gap, our framework introduces metrics such as precision and recall for decision variables and constraints, constraint and objective root mean squared error (RMSE), and efficiency indicators based on token usage and latency. We evaluate GPT-5, LLaMA 3.1 Instruct, and DeepSeek Math across optimization problems of varying complexity under six prompting strategies. Results show that GPT-5 consistently outperforms other models, with chain-of-thought, self-consistency, and modular prompting proving most effective. Analysis indicates that solver performance depends primarily on high constraint recall and low constraint RMSE, which together ensure structural correctness and solution reliability. Constraint precision and decision variable metrics play secondary roles, while concise outputs enhance computational efficiency. These findings highlight three principles for NLP-to-optimization modeling: (i) Complete constraint coverage prevents violations, (ii) minimizing constraint RMSE ensures solver-level accuracy, and (iii) concise outputs improve computational efficiency. The proposed framework establishes a foundation for fine-grained, diagnostic evaluation of LLMs in optimization modeling.
【40】Prompt-MII: Meta-Learning Instruction Induction for LLMs
标题:预算MII:法学硕士的元学习教学诱导
链接:https://arxiv.org/abs/2510.16932
摘要:一种使大型语言模型(LLM)适应新任务的流行方法是上下文学习(ICL),它是有效的,但随着上下文长度的增加会产生很高的推理成本。在本文中,我们提出了一种执行指令归纳的方法,在这里我们采用训练示例,并将其简化为一个紧凑但描述性的提示,可以在整个训练集上实现与ICL相当的性能。具体来说,我们提出了PROMPT-MII,这是一个基于强化学习(RL)的框架,用于元学习指令归纳模型,该模型可以为任意新数据集动态生成紧凑的指令。我们在HuggingFace中心的3,000多个不同的分类数据集上进行训练,并对90个看不见的任务进行评估。PROMPT-MII将下游模型质量提高了4-9个F1点(相对10-20%),与ICL性能相当,同时需要的令牌减少了3- 13倍。
摘要:A popular method to adapt large language models (LLMs) to new tasks is in-context learning (ICL), which is effective but incurs high inference costs as context length grows. In this paper we propose a method to perform instruction induction, where we take training examples and reduce them to a compact but descriptive prompt that can achieve performance comparable to ICL over the full training set. Specifically, we propose PROMPT-MII, a reinforcement learning (RL) based framework to meta-learn an instruction induction model that can generate compact instructions on the fly for an arbitrary new dataset. We train on over 3,000 diverse classification datasets from the HuggingFace hub, and evaluate on 90 unseen tasks. PROMPT-MII improves downstream model quality by 4-9 F1 points (10-20% relative), matching ICL performance while requiring 3-13x fewer tokens.
【41】ChiKhaPo: A Large-Scale Multilingual Benchmark for Evaluating Lexical Comprehension and Generation in Large Language Models
标题:ChiKhaPo:评估大型语言模型中词汇理解和生成的大规模多语言基准
链接:https://arxiv.org/abs/2510.16928
摘要:大型语言模型(LLM)的现有基准在很大程度上限于高或中资源语言,并且通常评估推理和生成中高阶任务的性能。然而,大量证据表明,法学硕士缺乏世界上3800多种书面语言中绝大多数的基本语言能力。我们介绍了ChiKhaPo,由8个子任务的不同难度,旨在评估生成模型的词汇理解和生成能力。ChiKhaPo利用现有的词典,单语数据和双文本,并为2个子任务提供2700多种语言的覆盖,在语言覆盖方面超过了任何现有的基准。我们进一步表明,6个SOTA模型在我们的基准上表现不佳,并讨论了影响性能分数的因素,包括语系、语言资源、任务和理解与生成方向。通过ChiKhaPo,我们希望能够实现并鼓励LLM的大规模多语言基准测试。
摘要:Existing benchmarks for large language models (LLMs) are largely restricted to high- or mid-resource languages, and often evaluate performance on higher-order tasks in reasoning and generation. However, plenty of evidence points to the fact that LLMs lack basic linguistic competence in the vast majority of the world's 3800+ written languages. We introduce ChiKhaPo, consisting of 8 subtasks of varying difficulty designed to evaluate the lexical comprehension and generation abilities of generative models. ChiKhaPo draws on existing lexicons, monolingual data, and bitext, and provides coverage for 2700+ languages for 2 subtasks, surpassing any existing benchmark in terms of language coverage. We further show that 6 SOTA models struggle on our benchmark, and discuss the factors contributing to performance scores, including language family, language resourcedness, task, and comprehension versus generation directions. With ChiKhaPo, we hope to enable and encourage the massively multilingual benchmarking of LLMs.
【42】Res-Bench: Benchmarking the Robustness of Multimodal Large Language Models to Dynamic Resolution Input
标题:Res-Bench:对多模式大型语言模型与动态分辨率输入的鲁棒性进行基准测试
链接:https://arxiv.org/abs/2510.16926
备注:23 pages,19 figures
摘要:多模态大型语言模型(MLLM)越来越多地支持动态图像分辨率。然而,目前的评估范式主要评估语义性能,忽略了分辨率鲁棒性的关键问题-性能是否在不同的输入分辨率下保持稳定。为了解决这一差距,我们引入了\textbf{Res-Bench},这是一个综合性的基准测试,包括12个分辨率级别和6个核心能力维度的14,400个样本。我们设计了一个新的评估框架,超越了传统的准确性指标,以捕捉性能稳定性。该框架引入了多个鲁棒性指标:斯皮尔曼的相关性评估分辨率性能趋势,绝对/相对连续误差(ACE/RCE)测量性能波动。使用这些指标,我们对领先的MLLM进行了大规模评估。我们的分析包括:(1)以模型为中心和以任务为中心的鲁棒性检查,(2)包括填充和超分辨率的预处理策略的研究,以及(3)用于稳定性增强的微调的探索。
摘要:Multimodal Large Language Models (MLLMs) increasingly support dynamic image resolutions. However, current evaluation paradigms primarily assess semantic performance, overlooking the critical question of resolution robustness - whether performance remains stable across varying input resolutions. To address this gap, we introduce \textbf{Res-Bench}, a comprehensive benchmark comprising 14,400 samples across 12 resolution levels and six core capability dimensions. We designed a novel evaluation framework that goes beyond traditional accuracy metrics to capture performance stability. This framework introduces multiple robustness metrics: Spearman's correlation for assessing resolution-performance trends, and Absolute/Relative Continuous Error (ACE/RCE) for measuring performance volatility. Using these metrics, we conducted a large-scale evaluation of leading MLLMs. Our analysis encompasses: (1) model-centric and task-centric robustness examination, (2) investigation of preprocessing strategies including padding and super-resolution, and (3) exploration of fine-tuning for stability enhancement.
【43】Does Visual Grounding Enhance the Understanding of Embodied Knowledge in Large Language Models?
标题:视觉基础是否增强了对大型语言模型中预定知识的理解?
链接:https://arxiv.org/abs/2510.16924
备注:Accepted to EMNLP 2025 (Findings). This version corrects a redundant sentence in the Results section that appeared in the camera-ready version
摘要:尽管多模态语言模型(LM)取得了重大进展,但与纯文本模型相比,视觉基础是否能增强他们对具体知识的理解仍不清楚。为了解决这个问题,我们提出了一个新的具身知识理解基准的基础上,从心理学的感知理论,包括视觉,听觉,触觉,味觉,嗅觉外部感官,和内感受。该基准通过向量比较和超过1,700个问题的问答任务来评估模型在不同感官模式下的感知能力。通过比较30个最先进的LM,我们惊讶地发现视觉语言模型(VLM)在这两项任务中的表现都没有优于纯文本模型。此外,与其他感官维度相比,模型在视觉维度上的表现明显较差。进一步的分析表明,矢量表示很容易受到单词形式和频率的影响,并且模型很难回答涉及空间感知和推理的问题。我们的研究结果强调,需要更有效地整合体现在LM的知识,以提高他们对物理世界的理解。
摘要:Despite significant progress in multimodal language models (LMs), it remains unclear whether visual grounding enhances their understanding of embodied knowledge compared to text-only models. To address this question, we propose a novel embodied knowledge understanding benchmark based on the perceptual theory from psychology, encompassing visual, auditory, tactile, gustatory, olfactory external senses, and interoception. The benchmark assesses the models' perceptual abilities across different sensory modalities through vector comparison and question-answering tasks with over 1,700 questions. By comparing 30 state-of-the-art LMs, we surprisingly find that vision-language models (VLMs) do not outperform text-only models in either task. Moreover, the models perform significantly worse in the visual dimension compared to other sensory dimensions. Further analysis reveals that the vector representations are easily influenced by word form and frequency, and the models struggle to answer questions involving spatial perception and reasoning. Our findings underscore the need for more effective integration of embodied knowledge in LMs to enhance their understanding of the physical world.
【44】SAKE: Towards Editing Auditory Attribute Knowledge of Large Audio-Language Models
标题:SAKE:走向编辑大型音频语言模型的听觉属性知识
链接:https://arxiv.org/abs/2510.16917
备注:Work in progress
摘要:知识编辑提供了一种有效的方法来更新模型知识,而无需完全重新训练,但以前的工作几乎完全集中在文本或视觉形式。我们介绍SAKE,第一个专门设计用于在大型音频语言模型(LALM)中编辑听觉属性知识的基准。与事实更新不同,SAKE针对几个抽象的听觉属性,捕捉超越传统文本和视觉领域的知识类型。我们沿着四个方面的两个LALM基准七种编辑方法:可靠性,通用性,音频/文本本地化和可移植性。结果突出的挑战,如保留属性内的知识无关的编辑,概括编辑多模态推理,并保持编辑下的顺序更新。SAKE提供了一个原则性的框架来研究知识编辑如何扩展到听觉模态,为在更多样化的现实世界场景中维护和适应LALM开辟了新的方向。
摘要:Knowledge editing offers an efficient way to update model knowledge without full retraining, but prior work has concentrated almost exclusively on textual or visual modalities. We introduce SAKE, the first benchmark specifically designed for editing auditory attribute knowledge in Large Audio-Language Models (LALMs). Unlike factual updates, SAKE targets several abstract auditory attributes, capturing knowledge types that go beyond conventional textual and visual domains. We benchmark seven editing methods on two LALMs along four dimensions: reliability, generality, audio/text locality, and portability. Results highlight challenges such as preserving intra-attribute knowledge unrelated to the edit, generalizing edits to multimodal reasoning, and maintaining edits under sequential updates. SAKE provides a principled framework to study how knowledge editing extends to the auditory modalities, opening new directions for maintaining and adapting LALMs in more diverse real-world scenarios.
【45】Investigating Safety Vulnerabilities of Large Audio-Language Models Under Speaker Emotional Variations
标题:研究说话者情绪变化下大型音频语言模型的安全漏洞
链接:https://arxiv.org/abs/2510.16893
备注:Submitted to ICASSP 2026
摘要:大型音频语言模型(LALM)通过听觉理解扩展了基于文本的LLM,为多模态应用提供了新的机会。虽然他们的感知,推理和任务性能已被广泛研究,他们的安全对齐下的语言变异仍然未被探索。本文系统地研究了说话人情感的作用。我们构建了一个恶意语音指令的数据集,表达了多种情绪和强度,并评估了几个国家的最先进的LALM。我们的研究结果揭示了大量的安全不一致:不同的情绪引起不同程度的不安全反应,强度的影响是非单调的,中等的表达往往构成最大的风险。这些发现突出了LALM中被忽视的脆弱性,并呼吁明确设计对齐策略,以确保在情绪变化下的鲁棒性,这是在现实世界中值得信赖的部署的先决条件。
摘要:Large audio-language models (LALMs) extend text-based LLMs with auditory understanding, offering new opportunities for multimodal applications. While their perception, reasoning, and task performance have been widely studied, their safety alignment under paralinguistic variation remains underexplored. This work systematically investigates the role of speaker emotion. We construct a dataset of malicious speech instructions expressed across multiple emotions and intensities, and evaluate several state-of-the-art LALMs. Our results reveal substantial safety inconsistencies: different emotions elicit varying levels of unsafe responses, and the effect of intensity is non-monotonic, with medium expressions often posing the greatest risk. These findings highlight an overlooked vulnerability in LALMs and call for alignment strategies explicitly designed to ensure robustness under emotional variation, a prerequisite for trustworthy deployment in real-world settings.
【46】Utility-Diversity Aware Online Batch Selection for LLM Supervised Fine-tuning
标题:LLM监督微调的实用多样性感知在线批量选择
链接:https://arxiv.org/abs/2510.16882
摘要:监督微调(SFT)是一种使大型语言模型(LLM)适应下游任务的常用技术。在实践中,在完整数据集上的SFT在计算上是昂贵的,并且有时会遭受过拟合或偏差放大。这促进了SFT中数据策展的兴起,它优先优化最有价值的数据。这项工作研究了在线批量选择系列,在训练过程中动态评分和过滤样本。然而,现有的流行方法通常(i)仅依赖于数据的效用来选择子集,而忽略了其他关键因素,如多样性,(ii)依赖于外部资源,如参考模型或验证集,以及(iii)在完整数据集训练中产生额外的训练时间。为了解决这些限制,这项工作开发了\textbf{UDS(效用多样性采样)},一个有效的在线批量选择SFT的框架。UDS利用logits矩阵的核范数来捕获数据效用和样本内多样性,同时通过与历史样本的轻量级内存缓冲区进行有效的低维嵌入比较来估计样本间多样性。这样的设计消除了对外部资源和不必要的反向传播的需要,确保了计算效率。多个基准测试的实验表明,UDS在不同的数据预算下始终优于最先进的在线批量选择方法,并且与全数据集微调相比,显著减少了训练时间。代码可在https://github.com/gfyddha/UDS上获得。
摘要:Supervised fine-tuning (SFT) is a commonly used technique to adapt large language models (LLMs) to downstream tasks. In practice, SFT on a full dataset is computationally expensive and sometimes suffers from overfitting or bias amplification. This facilitates the rise of data curation in SFT, which prioritizes the most valuable data to optimze. This work studies the online batch selection family that dynamically scores and filters samples during the training process. However, existing popular methods often (i) rely merely on the utility of data to select a subset while neglecting other crucial factors like diversity, (ii) rely on external resources such as reference models or validation sets, and (iii) incur extra training time over full-dataset training. To address these limitations, this work develops \textbf{UDS (Utility-Diversity Sampling)}, a framework for efficient online batch selection in SFT. UDS leverages the nuclear norm of the logits matrix to capture both data utility and intra-sample diversity, while estimating inter-sample diversity through efficient low-dimensional embedding comparisons with a lightweight memory buffer of historical samples. Such a design eliminates the need for external resources and unnecessary backpropagation, securing computational efficiency. Experiments on multiple benchmarks demonstrate that UDS consistently outperforms state-of-the-art online batch selection methods under varying data budgets, and significantly reduces training time compared to full-dataset fine-tuning. Code is available at https://github.com/gfyddha/UDS.
【47】DeepAnalyze: Agentic Large Language Models for Autonomous Data Science
标题:DeepAnalyze:自治数据科学的大型语言模型
链接:https://arxiv.org/abs/2510.16872
备注:Code: this https URL Model: this https URL
摘要:从原始数据源到分析师级的深度研究报告,自主数据科学一直是一个长期存在的挑战,随着强大的大型语言模型(LLM)的出现,现在变得可行。最近基于工作流的数据代理在特定的数据任务上显示出了有希望的结果,但由于它们依赖于预定义的工作流,因此在实现完全自主的数据科学方面仍然存在根本性的限制。在本文中,我们介绍了DeepAnalyze-8B,这是第一个为自主数据科学设计的代理LLM,能够自动完成从数据源到分析师级深度研究报告的端到端管道。为了解决高复杂性的数据科学任务,我们提出了一种基于代理的训练范式,该范式模拟人类数据科学家的学习轨迹,使LLM能够在现实环境中逐步获取和集成多种功能。我们还介绍了一个数据为基础的轨迹合成框架,构建高质量的训练数据。通过代理培训,DeepAnalyze学习执行广泛的数据任务,从数据问题回答和专业分析任务到开放式数据研究。实验表明,仅使用8B参数,DeepAnalyze的性能优于以前基于最先进的专有LLM构建的基于工作流的代理。DeepAnalyze的模型、代码和训练数据都是开源的,为自主数据科学铺平了道路。
摘要:Autonomous data science, from raw data sources to analyst-grade deep research reports, has been a long-standing challenge, and is now becoming feasible with the emergence of powerful large language models (LLMs). Recent workflow-based data agents have shown promising results on specific data tasks but remain fundamentally limited in achieving fully autonomous data science due to their reliance on predefined workflows. In this paper, we introduce DeepAnalyze-8B, the first agentic LLM designed for autonomous data science, capable of automatically completing the end-toend pipeline from data sources to analyst-grade deep research reports. To tackle high-complexity data science tasks, we propose a curriculum-based agentic training paradigm that emulates the learning trajectory of human data scientists, enabling LLMs to progressively acquire and integrate multiple capabilities in real-world environments. We also introduce a data-grounded trajectory synthesis framework that constructs high-quality training data. Through agentic training, DeepAnalyze learns to perform a broad spectrum of data tasks, ranging from data question answering and specialized analytical tasks to open-ended data research. Experiments demonstrate that, with only 8B parameters, DeepAnalyze outperforms previous workflow-based agents built on most advanced proprietary LLMs. The model, code, and training data of DeepAnalyze are open-sourced, paving the way toward autonomous data science.
【48】Verifiable Fine-Tuning for LLMs: Zero-Knowledge Training Proofs Bound to Data Provenance and Policy
标题:LLM的可验证微调:与数据来源和政策绑定的零知识训练证明
链接:https://arxiv.org/abs/2510.16830
备注:20 pages, 10 figures
摘要:大型语言模型通常通过参数有效的微调进行调整,但当前的发布实践对使用什么数据以及如何计算更新提供了弱保证。我们提出了可验证的微调,一个协议和系统,产生简洁的零知识证明,发布的模型是从一个公开的初始化下声明的训练计划和可审计的数据集承诺。这一办法结合了五个要素。首先,将数据源、预处理、许可证和每纪元配额计数器绑定到清单的承诺。第二,支持公共可回放和私有索引隐藏批量选择的可验证采样器。第三,更新电路限于参数有效的微调,执行AdamW风格的优化器语义和证明友好的近似与明确的错误预算。第四,递归聚合,将每一步证明折叠成每一个epoch和端到端证书,并进行毫秒级验证。第五,证明代码标识和常量的出处绑定和可选的可信执行属性卡。在英语和双语教学混合,该方法保持实用性在紧张的预算,同时实现实际的证明性能。策略配额是强制执行的,没有违规行为,私有采样窗口没有显示可测量的索引泄漏。联邦实验表明,该系统组成的概率审计和带宽的限制。这些结果表明,端到端可验证的微调今天对于真正的参数有效管道是可行的,为受监管和分散的部署关闭了一个关键的信任缺口。
摘要:Large language models are often adapted through parameter efficient fine tuning, but current release practices provide weak assurances about what data were used and how updates were computed. We present Verifiable Fine Tuning, a protocol and system that produces succinct zero knowledge proofs that a released model was obtained from a public initialization under a declared training program and an auditable dataset commitment. The approach combines five elements. First, commitments that bind data sources, preprocessing, licenses, and per epoch quota counters to a manifest. Second, a verifiable sampler that supports public replayable and private index hiding batch selection. Third, update circuits restricted to parameter efficient fine tuning that enforce AdamW style optimizer semantics and proof friendly approximations with explicit error budgets. Fourth, recursive aggregation that folds per step proofs into per epoch and end to end certificates with millisecond verification. Fifth, provenance binding and optional trusted execution property cards that attest code identity and constants. On English and bilingual instruction mixtures, the method maintains utility within tight budgets while achieving practical proof performance. Policy quotas are enforced with zero violations, and private sampling windows show no measurable index leakage. Federated experiments demonstrate that the system composes with probabilistic audits and bandwidth constraints. These results indicate that end to end verifiable fine tuning is feasible today for real parameter efficient pipelines, closing a critical trust gap for regulated and decentralized deployments.
【49】Cross-Genre Authorship Attribution via LLM-Based Retrieve-and-Rerank
标题:基于LLM的跨体裁作者归属研究
链接:https://arxiv.org/abs/2510.16819
摘要:作者归属(AA)是从预定义的候选作者集合中识别查询文档的最可能作者的任务。我们介绍了一个两阶段的检索和重新排名框架,微调LLM跨流派AA。与信息检索(IR)领域不同,检索和重新排序是一种事实上的策略,跨体裁AA系统必须避免依赖于主题线索,而是学会识别独立于文本主题(体裁/域/主题)的作者特定的语言模式。因此,对于reranker,我们证明了IR中常用的训练策略与跨类型AA从根本上不一致,导致次优行为。为了解决这个问题,我们引入了一个有针对性的数据策展策略,使reranker能够有效地学习作者识别信号。使用我们基于法学硕士的检索和重新排名管道,我们实现了22.3和34.4的绝对优势@8点,比以前的最先进的HIBITION的具有挑战性的HRS 1和HRS 2跨流派AA基准大幅增长。
摘要:Authorship attribution (AA) is the task of identifying the most likely author of a query document from a predefined set of candidate authors. We introduce a two-stage retrieve-and-rerank framework that finetunes LLMs for cross-genre AA. Unlike the field of information retrieval (IR), where retrieve-and-rerank is a de facto strategy, cross-genre AA systems must avoid relying on topical cues and instead learn to identify author-specific linguistic patterns that are independent of the text's subject matter (genre/domain/topic). Consequently, for the reranker, we demonstrate that training strategies commonly used in IR are fundamentally misaligned with cross-genre AA, leading to suboptimal behavior. To address this, we introduce a targeted data curation strategy that enables the reranker to effectively learn author-discriminative signals. Using our LLM-based retrieve-and-rerank pipeline, we achieve substantial gains of 22.3 and 34.4 absolute Success@8 points over the previous state-of-the-art on HIATUS's challenging HRS1 and HRS2 cross-genre AA benchmarks.
【50】Knowing the Facts but Choosing the Shortcut: Understanding How Large Language Models Compare Entities
标题:了解事实但选择预设:了解大型语言模型如何比较实体
链接:https://arxiv.org/abs/2510.16815
备注:33 pages, 20 figures. Submitted ACL ARR 2025 October (under review)
摘要:大型语言模型(LLM)越来越多地用于基于知识的推理任务,但理解它们何时依赖于真正的知识而不是肤浅的知识仍然具有挑战性。我们通过实体比较任务来研究这个问题,要求模型沿着数值属性(例如,哪条河更长,多瑙河还是尼罗河?”),为系统分析提供了清晰的基础事实。尽管有足够的数值知识来正确回答,LLM经常做出与这些知识相矛盾的预测。我们确定了三个启发式的偏见,强烈影响模型的预测:实体流行度,提及顺序,语义同现。对于较小的模型,仅使用这些表面线索的简单逻辑回归预测模型选择比模型自身的数值预测更准确,这表明逻辑学在很大程度上覆盖了原则性推理。重要的是,我们发现,较大的模型(32 B参数)选择性地依赖于数值知识时,它是更可靠的,而较小的模型(7- 8B参数)显示没有这样的歧视,这就解释了为什么较大的模型优于较小的模型,即使较小的模型拥有更准确的知识。思维链提示引导所有模型在所有模型尺寸上使用数值特征。
摘要:Large Language Models (LLMs) are increasingly used for knowledge-based reasoning tasks, yet understanding when they rely on genuine knowledge versus superficial heuristics remains challenging. We investigate this question through entity comparison tasks by asking models to compare entities along numerical attributes (e.g., ``Which river is longer, the Danube or the Nile?''), which offer clear ground truth for systematic analysis. Despite having sufficient numerical knowledge to answer correctly, LLMs frequently make predictions that contradict this knowledge. We identify three heuristic biases that strongly influence model predictions: entity popularity, mention order, and semantic co-occurrence. For smaller models, a simple logistic regression using only these surface cues predicts model choices more accurately than the model's own numerical predictions, suggesting heuristics largely override principled reasoning. Crucially, we find that larger models (32B parameters) selectively rely on numerical knowledge when it is more reliable, while smaller models (7--8B parameters) show no such discrimination, which explains why larger models outperform smaller ones even when the smaller models possess more accurate knowledge. Chain-of-thought prompting steers all models towards using the numerical features across all model sizes.
【51】When Many-Shot Prompting Fails: An Empirical Study of LLM Code Translation
标题:当多镜头预算失败时:LLM代码翻译的实证研究
链接:https://arxiv.org/abs/2510.16809
摘要:具有巨大上下文窗口的大型语言模型(LLM)为上下文学习(ICL)提供了新的途径,其中提供许多示例(“多次”提示)通常被认为可以提高性能。我们调查这个假设的复杂任务的代码翻译。通过对超过90,000个翻译的大规模实证研究,我们系统地评估了将上下文示例从zero-shot扩展到多达625个示例的多镜头配置的影响,提示范围从大约100,000到800,000个标记。我们的研究结果揭示了一个“多镜头悖论”:虽然静态相似性指标可能会适度提高更多的例子,功能正确性始终与Few-Shot提示(5-25个例子)的峰值。提供更多的示例通常会降低这一关键功能的性能。这项研究强调,对于代码翻译,一些精心挑选的例子的质量超过了纯粹的数量,挑战了ICL的普遍功效“越多越好”,并强调了最佳提示策略的任务依赖性。我们的研究结果对有效利用软件工程中的LLM具有重要意义。
摘要:Large Language Models (LLMs) with vast context windows offer new avenues for in-context learning (ICL), where providing many examples ("many-shot" prompting) is often assumed to enhance performance. We investigate this assumption for the complex task of code translation. Through a large-scale empirical study of over 90,000 translations, we systematically evaluate the impact of scaling in-context examples from zero-shot to many-shot configurations of up to 625 examples, with prompts spanning from approximately 100,000 to 800,000 tokens. Our findings reveal a "many-shot paradox": while static similarity metrics may modestly improve with more examples, functional correctness consistently peaks with few-shot prompting (5-25 examples). Providing substantially more examples often degrades this crucial functional performance. This study highlights that for code translation, the quality of a few well-chosen examples outweighs sheer quantity, challenging the universal efficacy of "more is better" for ICL and underscoring the task-dependent nature of optimal prompting strategies. Our results have significant implications for effectively leveraging LLMs in software engineering.
【52】See or Say Graphs: Agent-Driven Scalable Graph Understanding with Vision-Language Models
标题:查看或说出图形:使用视觉语言模型的代理驱动的可扩展图形理解
链接:https://arxiv.org/abs/2510.16769
摘要:视觉语言模型(VLM)在图形理解方面表现出了很好的前景,但仍然受到输入令牌约束的限制,面临可扩展性瓶颈,缺乏有效的机制来协调文本和视觉模态。为了解决这些挑战,我们提出了GraphVista,一个统一的框架,提高了可扩展性和模态协调图的理解。为了实现可扩展性,GraphVista将图形信息分层组织到一个轻量级的GraphRAG库中,该库仅检索与任务相关的文本描述和高分辨率可视子图,压缩冗余上下文,同时保留关键推理元素。对于模态协调,GraphVista引入了一个规划代理,路由任务到最合适的模态使用简单的属性推理和视觉模态的本地和结构复杂的推理接地明确的拓扑结构的文本模态。大量的实验表明,GraphVista规模大的图形,高达200倍$大于现有的基准测试中使用的,并始终优于现有的文本,视觉和融合为基础的方法,实现高达4.4\times $的质量改进超过国家的最先进的基线充分利用这两种模式的互补优势。
摘要:Vision-language models (VLMs) have shown promise in graph understanding, but remain limited by input-token constraints, facing scalability bottlenecks and lacking effective mechanisms to coordinate textual and visual modalities. To address these challenges, we propose GraphVista, a unified framework that enhances both scalability and modality coordination in graph understanding. For scalability, GraphVista organizes graph information hierarchically into a lightweight GraphRAG base, which retrieves only task-relevant textual descriptions and high-resolution visual subgraphs, compressing redundant context while preserving key reasoning elements. For modality coordination, GraphVista introduces a planning agent that routes tasks to the most suitable modality-using the text modality for simple property reasoning and the visual modality for local and structurally complex reasoning grounded in explicit topology. Extensive experiments demonstrate that GraphVista scales to large graphs, up to $200\times$ larger than those used in existing benchmarks, and consistently outperforms existing textual, visual, and fusion-based methods, achieving up to $4.4\times$ quality improvement over the state-of-the-art baselines by fully exploiting the complementary strengths of both modalities.
【53】Beacon: Single-Turn Diagnosis and Mitigation of Latent Sycophancy in Large Language Models
标题:Beacon:大型语言模型中潜在谄媚的单轮诊断和缓解
链接:https://arxiv.org/abs/2510.16727
摘要:大型语言模型内在化了真实性和谄媚奉承之间的结构性权衡,这是从将乐于助人与礼貌服从混为一谈的奖励优化中产生的。这种潜在的偏见,被称为阿谀奉承,表现为对用户协议的偏好,而不是原则性推理。我们引入信标,一个单轮被迫选择的基准,隔离这种偏见独立于会话背景,使事实准确性和顺从的偏见之间的紧张局势的精确测量。对12个最先进的模型的评估表明,奉承分解成稳定的语言和情感子偏差,每个缩放模型的能力。我们进一步提出了认知水平和激活水平的干预措施,在相反的方向调节这些偏见,揭示了内部几何对齐作为一个动态的流形之间的真实性和社会顺应性的判断。Beacon将谄媚重新定义为规范性错误概括的可测量形式,为研究和减轻大规模生成系统中的对齐漂移提供了可重复的基础。
摘要:Large language models internalize a structural trade-off between truthfulness and obsequious flattery, emerging from reward optimization that conflates helpfulness with polite submission. This latent bias, known as sycophancy, manifests as a preference for user agreement over principled reasoning. We introduce Beacon, a single-turn forced-choice benchmark that isolates this bias independent of conversational context, enabling precise measurement of the tension between factual accuracy and submissive bias. Evaluations across twelve state-of-the-art models reveal that sycophancy decomposes into stable linguistic and affective sub-biases, each scaling with model capacity. We further propose prompt-level and activation-level interventions that modulate these biases in opposing directions, exposing the internal geometry of alignment as a dynamic manifold between truthfulness and socially compliant judgment. Beacon reframes sycophancy as a measurable form of normative misgeneralization, providing a reproducible foundation for studying and mitigating alignment drift in large-scale generative systems.
【54】so much depends / upon / a whitespace: Why Whitespace Matters for Poets and LLMs
标题:很多事情都取决于/取决于/空白:为什么空白对诗人和法学硕士很重要
链接:https://arxiv.org/abs/2510.16713
摘要:空白是诗歌形式的重要组成部分,既反映了对标准形式的坚持,也反映了对这些形式的反叛。每首诗的空白分布反映了诗人的艺术选择,是诗歌整体的语义和空间特征。然而,尽管诗歌作为一种长期存在的艺术形式和大型语言模型(LLM)的生成任务很受欢迎,但空白并没有得到NLP社区的足够关注。使用语料库的19k英语出版的诗歌从诗歌基金会,我们调查了4k诗人如何在他们的作品中使用空白。我们发布了一个2.8k公共领域诗歌的子集,保留了格式,以促进这一领域的进一步研究。我们将已发表的诗歌中的空白使用与(1)51k LLM生成的诗歌和(2)在线社区中发布的12k未发表的诗歌进行比较。我们还探讨了跨时间段,诗歌形式和数据源的空白使用。此外,我们发现不同的文本处理方法会导致诗歌数据中空白的显着不同的表示,促使我们使用这些诗歌和空白模式来讨论用于组装LLM预训练数据集的处理策略的影响。
摘要:Whitespace is a critical component of poetic form, reflecting both adherence to standardized forms and rebellion against those forms. Each poem's whitespace distribution reflects the artistic choices of the poet and is an integral semantic and spatial feature of the poem. Yet, despite the popularity of poetry as both a long-standing art form and as a generation task for large language models (LLMs), whitespace has not received sufficient attention from the NLP community. Using a corpus of 19k English-language published poems from Poetry Foundation, we investigate how 4k poets have used whitespace in their works. We release a subset of 2.8k public-domain poems with preserved formatting to facilitate further research in this area. We compare whitespace usage in the published poems to (1) 51k LLM-generated poems, and (2) 12k unpublished poems posted in an online community. We also explore whitespace usage across time periods, poetic forms, and data sources. Additionally, we find that different text processing methods can result in significantly different representations of whitespace in poetry data, motivating us to use these poems and whitespace patterns to discuss implications for the processing strategies used to assemble pretraining datasets for LLMs.
【55】The Chameleon Nature of LLMs: Quantifying Multi-Turn Stance Instability in Search-Enabled Language Models
标题:LLM的变色龙本质:量化支持搜索的语言模型中的多转向立场不稳定性
链接:https://arxiv.org/abs/2510.16712
摘要:大型语言模型与搜索/检索引擎的集成已经变得无处不在,但这些系统存在严重的漏洞,破坏了它们的可靠性。我们提出了第一个系统的调查“变色龙行为”的LLM:他们的惊人的倾向,以改变立场时,提出了矛盾的问题,在多轮对话(特别是在搜索功能的LLM)。通过我们新颖的变色龙基准数据集,包括17,770个精心制作的问答对,跨越12个有争议的领域的1,180个多回合对话,我们揭示了最先进系统的根本缺陷。我们引入了两个理论基础的指标:量化立场不稳定性的变色龙分数(0-1)和衡量知识多样性的源重用率(0-1)。我们对Llama-4-Maverick、GPT-4 o-mini和Gemini-2.5-Flash的严格评估揭示了一致的失败:所有模型都表现出严重的变色龙行为(得分为0.391-0.511),GPT-4 o-mini表现最差。重要的是,小的跨温度方差(小于0.004)表明该效应不是采样伪影。我们的分析揭示了机制:源重复使用率和信心(r=0.627)和立场变化(r=0.429)之间的强相关性是统计学上显着的(p小于0.05),表明有限的知识多样性使模型病理上的查询框架。这些发现强调了在医疗保健,法律和金融系统中部署LLM之前需要进行全面的一致性评估,在这些系统中,在相互作用中保持一致的立场对于可靠的决策支持至关重要。
摘要:Integration of Large Language Models with search/retrieval engines has become ubiquitous, yet these systems harbor a critical vulnerability that undermines their reliability. We present the first systematic investigation of "chameleon behavior" in LLMs: their alarming tendency to shift stances when presented with contradictory questions in multi-turn conversations (especially in search-enabled LLMs). Through our novel Chameleon Benchmark Dataset, comprising 17,770 carefully crafted question-answer pairs across 1,180 multi-turn conversations spanning 12 controversial domains, we expose fundamental flaws in state-of-the-art systems. We introduce two theoretically grounded metrics: the Chameleon Score (0-1) that quantifies stance instability, and Source Re-use Rate (0-1) that measures knowledge diversity. Our rigorous evaluation of Llama-4-Maverick, GPT-4o-mini, and Gemini-2.5-Flash reveals consistent failures: all models exhibit severe chameleon behavior (scores 0.391-0.511), with GPT-4o-mini showing the worst performance. Crucially, small across-temperature variance (less than 0.004) suggests the effect is not a sampling artifact. Our analysis uncovers the mechanism: strong correlations between source re-use rate and confidence (r=0.627) and stance changes (r=0.429) are statistically significant (p less than 0.05), indicating that limited knowledge diversity makes models pathologically deferential to query framing. These findings highlight the need for comprehensive consistency evaluation before deploying LLMs in healthcare, legal, and financial systems where maintaining coherent positions across interactions is critical for reliable decision support.
【56】Investigating the Impact of Rationales for LLMs on Natural Language Understanding
标题:调查LLM课程对自然语言理解的影响
链接:https://arxiv.org/abs/2510.16686
摘要:思想链(CoT)原理,它提供了一步一步的推理,以得出最终的答案,有利于LLM在推理和培训。通过在推理过程中回答之前生成基本原理,或者在训练过程中将其放置在原始答案之前或之后,可以显着提高模型在数学,符号和常识推理任务中的性能。然而,大多数工作都集中在这些推理任务中的基本原理的作用,忽略了它们对其他重要任务的潜在影响,如自然语言理解(NLU)任务。在这项工作中,我们提出了一个问题:合理性可以同样有利于NLU任务?为了进行系统的探索,我们构建了NLURC,这是一个全面而高质量的NLU数据集,并开发了各种理性增强方法。通过使用数据集探索这些方法在NLU任务上的适用性,我们发现了几个潜在的令人惊讶的发现:(1)随着模型大小的增加,CoT推理从阻碍NLU性能转变为超越直接标签预测,表明正相关。(2)大多数理性增强训练方法的表现比仅标签训练差,其中一种特别设计的方法持续获得改进。(3)使用基本原理训练的LLM在看不见的NLU任务上实现了显着的性能提升,与其大小的十倍模型相媲美,同时提供与商业LLM相当的可解释性。
摘要:Chain-of-thought (CoT) rationales, which provide step-by-step reasoning to derive final answers, benefit LLMs in both inference and training. Incorporating rationales, either by generating them before answering during inference, or by placing them before or after the original answers during training - significantly improves model performance on mathematical, symbolic and commonsense reasoning tasks. However, most work focuses on the role of rationales in these reasoning tasks, overlooking their potential impact on other important tasks like natural language understanding (NLU) tasks. In this work, we raise the question: Can rationales similarly benefit NLU tasks? To conduct a systematic exploration, we construct NLURC, a comprehensive and high-quality NLU dataset collection with rationales, and develop various rationale-augmented methods. Through exploring the applicability of these methods on NLU tasks using the dataset, we uncover several potentially surprising findings: (1) CoT inference shifts from hindering NLU performance to surpassing direct label prediction as model size grows, indicating a positive correlation. (2) Most rationale-augmented training methods perform worse than label-only training, with one specially designed method consistently achieving improvements. (3) LLMs trained with rationales achieve significant performance gains on unseen NLU tasks, rivaling models ten times their size, while delivering interpretability on par with commercial LLMs.
【57】Unleashing Diverse Thinking Modes in LLMs through Multi-Agent Collaboration
标题:通过多代理协作在法学硕士中释放多元化思维模式
链接:https://arxiv.org/abs/2510.16645
摘要:大型语言模型(LLM)表现出强大的性能,但往往缺乏可解释的推理。本文介绍了多智能体协作框架的不同思维模式(DiMo),它提高了性能和可解释性,通过模拟四个专业LLM代理之间的结构化辩论。每个代理体现了一个独特的推理范式,允许框架合作探索不同的认知方法。通过迭代辩论,代理人挑战和完善初始响应,产生更强大的结论和明确的,可审计的推理链。在六个基准测试中,在统一的开源设置下,DiMo提高了广泛使用的单一模型和辩论基线的准确性,在数学上获得了最大的收益。我们将DiMo定位为一个语义感知的Web原生多代理框架:它使用LLM代理来模拟人机智能,这些代理可以生成语义类型的,URL注释的证据链,用于解释和用户友好的交互。虽然我们的实验使用标准的推理基准,该框架的目的是在Web语料库和知识图的实例化,结合检索增强推理与结构化的理由,下游系统可以检查和重用。
摘要:Large Language Models (LLMs) demonstrate strong performance but often lack interpretable reasoning. This paper introduces the Multi-Agent Collaboration Framework for Diverse Thinking Modes (DiMo), which enhances both performance and interpretability by simulating a structured debate among four specialized LLM agents. Each agent embodies a distinct reasoning paradigm, allowing the framework to collaboratively explore diverse cognitive approaches. Through iterative debate, agents challenge and refine initial responses, yielding more robust conclusions and an explicit, auditable reasoning chain. Across six benchmarks and under a unified open-source setup, DiMo improves accuracy over widely used single-model and debate baselines, with the largest gains on math. We position DiMo as a semantics-aware, Web-native multi-agent framework: it models human-machine intelligence with LLM agents that produce semantically typed, URL-annotated evidence chains for explanations and user-friendly interactions. Although our experiments use standard reasoning benchmarks, the framework is designed to be instantiated over Web corpora and knowledge graphs, combining retrieval-augmented reasoning with structured justifications that downstream systems can inspect and reuse.
【58】Fine-tuning of Large Language Models for Constituency Parsing Using a Sequence to Sequence Approach
标题:使用序列到序列方法微调大语言模型的成分分析
链接:https://arxiv.org/abs/2510.16604
备注:6 pages, 3 figures. Submitted to SEPLN 2023 Conference
摘要:最近在使用大型神经模型的自然语言处理方面的进展为基于机器学习的句法分析开辟了新的可能性。本文通过对大型语言模型进行微调,将输入的句子翻译成相应的句法结构,探索了一种新的短语结构分析方法。主要目标是扩展Misintaxis的功能,这是一种用于教授西班牙语语法的工具。使用AnCora-ES语料库生成的训练数据对Hugging Face库中的几个模型进行了微调,并使用F1评分评估了它们的性能。结果表明,在短语结构分析的准确性高,并突出了这种方法的潜力。
摘要:Recent advances in natural language processing with large neural models have opened new possibilities for syntactic analysis based on machine learning. This work explores a novel approach to phrase-structure analysis by fine-tuning large language models (LLMs) to translate an input sentence into its corresponding syntactic structure. The main objective is to extend the capabilities of MiSintaxis, a tool designed for teaching Spanish syntax. Several models from the Hugging Face repository were fine-tuned using training data generated from the AnCora-ES corpus, and their performance was evaluated using the F1 score. The results demonstrate high accuracy in phrase-structure analysis and highlight the potential of this methodology.
【59】Language over Content: Tracing Cultural Understanding in Multilingual Large Language Models
标题:语言优于内容:在多语种大型语言模型中追踪文化理解
链接:https://arxiv.org/abs/2510.16565
备注:Accepted to CIKM 2025 Workshop on Human Centric AI
摘要:大型语言模型(LLM)越来越多地用于不同的文化背景,这使得准确的文化理解变得至关重要。以前的评估主要集中在输出水平的性能,模糊的因素,驱动反应的差异,而使用电路分析的研究涵盖了几种语言,很少关注文化。在这项工作中,我们跟踪LLM的内部文化理解机制,测量激活路径重叠时,回答语义等价的问题在两种情况下:不同的目标国家,而固定的问题语言,和不同的问题语言,而固定的国家。我们还使用相同语言的国家对从文化方面解开语言。结果表明,内部路径重叠更多的相同的语言,跨国问题比跨语言,同一个国家的问题,表明强烈的语言特定的模式。值得注意的是,韩国-朝鲜对表现出低重叠和高变异性,表明语言相似性并不能保证对齐的内部表示。
摘要:Large language models (LLMs) are increasingly used across diverse cultural contexts, making accurate cultural understanding essential. Prior evaluations have mostly focused on output-level performance, obscuring the factors that drive differences in responses, while studies using circuit analysis have covered few languages and rarely focused on culture. In this work, we trace LLMs' internal cultural understanding mechanisms by measuring activation path overlaps when answering semantically equivalent questions under two conditions: varying the target country while fixing the question language, and varying the question language while fixing the country. We also use same-language country pairs to disentangle language from cultural aspects. Results show that internal paths overlap more for same-language, cross-country questions than for cross-language, same-country questions, indicating strong language-specific patterns. Notably, the South Korea-North Korea pair exhibits low overlap and high variability, showing that linguistic similarity does not guarantee aligned internal representation.
【60】ReviewGuard: Enhancing Deficient Peer Review Detection via LLM-Driven Data Augmentation
标题:ReviewGuard:通过LLM驱动的数据增强增强缺陷同行评审检测
链接:https://arxiv.org/abs/2510.16549
摘要:同行评议是科学的守门人,但学术评估中提交的论文数量激增和大语言模型(LLM)的广泛采用带来了前所未有的挑战。最近的工作集中在使用LLM来提高审查效率或生成有见地的审查内容。然而,来自人类专家和人工智能系统的未经检查的缺陷评论可能会系统性地破坏同行评审生态系统并损害学术诚信。为了解决这个关键问题,我们引入了ReviewGuard,这是一个用于检测和分类缺陷评论的自动化系统。ReviewGuard采用全面的四阶段LLM驱动框架:(1)从OpenReview收集ICLR和NeurIPS论文及其相应的评论;(2)使用GPT-4.1人工验证注释评论类型;(3)通过LLM驱动的合成数据增强解决了类不平衡和数据稀缺问题,产生了6,634篇论文,24,657篇真实评论和46篇评论的最终语料库,438篇综合评论;以及(4)微调基于编码器的模型和开源LLM两者。我们对评论文本的结构和质量进行全面的特征分析。与足够的评论相比,有缺陷的评论表现出较低的评分,较高的自我报告的信心,降低结构复杂性,以及较高比例的负面情绪。人工智能生成的文本检测显示,自ChatGPT出现以来,人工智能生成的评论急剧增加。在缺陷评论检测模型的评估中,使用合成和真实评论数据的混合训练提供了对二元任务的召回和F1分数的实质性增强。这项研究提出了第一个LLM驱动的系统,用于检测有缺陷的同行评审,为同行评审中的人工智能治理提供证据,同时为人类与人工智能的合作提供有价值的见解,以保持学术诚信。
摘要:Peer review serves as the gatekeeper of science, yet the surge in submissions and widespread adoption of large language models (LLMs) in scholarly evaluation present unprecedented challenges. Recent work has focused on using LLMs to improve review efficiency or generate insightful review content. However, unchecked deficient reviews from both human experts and AI systems threaten to systematically undermine the peer review ecosystem and compromise academic integrity. To address this critical issue, we introduce ReviewGuard, an automated system for detecting and categorizing deficient reviews. ReviewGuard employs a comprehensive four-stage LLM-driven framework that: (1) collects ICLR and NeurIPS papers with their corresponding reviews from OpenReview; (2) annotates review types using GPT-4.1 with human validation; (3) addresses class imbalance and data scarcity through LLM-driven synthetic data augmentation, producing a final corpus of 6,634 papers, 24,657 real reviews, and 46,438 synthetic reviews; and (4) fine-tunes both encoder-based models and open source LLMs. We perform comprehensive feature analysis of the structure and quality of the review text. Compared to sufficient reviews, deficient reviews demonstrate lower rating scores, higher self-reported confidence, reduced structural complexity, and a higher proportion of negative sentiment. AI-generated text detection reveals that, since ChatGPT's emergence, AI-generated reviews have increased dramatically. In the evaluation of deficient review detection models, mixed training with synthetic and real review data provides substantial enhancements to recall and F1 scores on the binary task. This study presents the first LLM-driven system for detecting deficient peer reviews, providing evidence to inform AI governance in peer review while offering valuable insights into human-AI collaboration to maintain academic integrity.
【61】Check Yourself Before You Wreck Yourself: Selectively Quitting Improves LLM Agent Safety
标题:在毁掉自己之前检查自己:选择性退出提高LLM代理的安全性
链接:https://arxiv.org/abs/2510.16492
备注:Reliable ML and Regulatable ML workshops, Neurips 2025
摘要:随着大型语言模型(LLM)代理越来越多地在具有现实后果的复杂环境中运行,它们的安全性变得至关重要。虽然单回合任务的不确定性量化得到了很好的研究,但具有真实世界工具访问的多回合代理场景提出了独特的挑战,其中不确定性和模糊性复合,导致严重或灾难性的风险超出传统的文本生成失败。我们建议使用“退出”作为一个简单而有效的行为机制,LLM代理人认识到,并退出他们缺乏信心的情况。利用ToolEmu框架,我们对12个最先进的LLM的戒烟行为进行了系统的评估。我们的研究结果显示了一个非常有利的安全性-有用性权衡:在所有模型中,在0-3的范围内,被明确指示退出的代理人平均提高了+0.39的安全性(专有模型为+0.64),同时保持了可忽略不计的有用性平均下降-0.03。我们的分析表明,简单地添加明确的退出指令被证明是一个非常有效的安全机制,可以立即部署在现有的代理系统,并建立退出作为一个有效的第一线防御机制的自主代理在高风险的应用程序。
摘要:As Large Language Model (LLM) agents increasingly operate in complex environments with real-world consequences, their safety becomes critical. While uncertainty quantification is well-studied for single-turn tasks, multi-turn agentic scenarios with real-world tool access present unique challenges where uncertainties and ambiguities compound, leading to severe or catastrophic risks beyond traditional text generation failures. We propose using "quitting" as a simple yet effective behavioral mechanism for LLM agents to recognize and withdraw from situations where they lack confidence. Leveraging the ToolEmu framework, we conduct a systematic evaluation of quitting behavior across 12 state-of-the-art LLMs. Our results demonstrate a highly favorable safety-helpfulness trade-off: agents prompted to quit with explicit instructions improve safety by an average of +0.39 on a 0-3 scale across all models (+0.64 for proprietary models), while maintaining a negligible average decrease of -0.03 in helpfulness. Our analysis demonstrates that simply adding explicit quit instructions proves to be a highly effective safety mechanism that can immediately be deployed in existing agent systems, and establishes quitting as an effective first-line defense mechanism for autonomous agents in high-stakes applications.
【62】FrugalPrompt: Reducing Contextual Overhead in Large Language Models via Token Attribution
标题:FrugalPromise:通过令牌归因减少大型语言模型中的上下文负担
链接:https://arxiv.org/abs/2510.16439
摘要:大型语言模型(LLM)将其出色的性能归功于广泛的输入上下文,但这种冗长会增加货币成本,碳足迹和推理时间延迟。这种开销的大部分来自典型提示中存在的冗余的低效用令牌,因为通常只有一小部分令牌承载大部分语义权重。我们通过引入FrugalPrompt来解决这种效率低下的问题,FrugalPrompt是一种用于LLM的新型提示压缩框架,它只保留了语义上最重要的令牌。利用两种最先进的令牌归因方法,GlobEnc和DecompX,我们为输入序列中的每个令牌分配显着性分数,对它们进行排名以保留前k %令牌的原始顺序,并获得稀疏的简化提示。我们评估了四个NLP任务的方法:情感分析,常识问答,总结和数学推理,使用一套前沿LLM。对于前三个任务,20%的提示减少只会导致任务性能的边际损失,这表明当代LLM可以从高显着性线索中重建省略的上下文。相比之下,数学推理的性能急剧恶化,反映了对完全令牌连续性的更强依赖。对bottom-k%和random-k%令牌的进一步分析揭示了不对称的性能模式,这可能表明潜在的任务污染效应,其中模型可能会求助于传统NLP任务的预训练暴露的浅记忆模式。我们认为,我们的工作有助于更细致入微的理解LLM行为的性能效率的权衡,并划定容忍上下文稀疏和那些需要详尽的上下文之间的边界。我们的源代码和模型可以在以下网址获得:https://github.com/Starscream-11813/Frugal-ICL
摘要:Large language models (LLMs) owe much of their stellar performance to expansive input contexts, yet such verbosity inflates monetary costs, carbon footprint, and inference-time latency. Much of this overhead manifests from the redundant low-utility tokens present in typical prompts, as only a fraction of tokens typically carries the majority of the semantic weight. We address this inefficiency by introducing FrugalPrompt, a novel prompt compression framework for LLMs, which retains only the most semantically significant tokens. Leveraging two state-of-the-art token attribution methods, GlobEnc and DecompX, we assign salience scores to every token in an input sequence, rank them to preserve the top-k% tokens in their original order, and obtain a sparse frugalized prompt. We evaluate the approach across four NLP tasks: Sentiment Analysis, Commonsense QA, Summarization, and Mathematical Reasoning, using a suite of frontier LLMs. For the first three tasks, a 20% prompt reduction incurs only a marginal loss in task performance, demonstrating that contemporary LLMs can reconstruct elided context from high-salience cues. In contrast, performance on mathematical reasoning deteriorates sharply, reflecting a stronger dependence on complete token continuity. Further analysis with bottom-k% and random-k% tokens reveals asymmetric performance patterns that may suggest potential task contamination effects, wherein models may resort to shallow memorized patterns from pretraining exposure for conventional NLP tasks. We posit that our work contributes to a more nuanced understanding of LLM behavior in performance-efficiency trade-offs, and delineate the boundary between tasks tolerant to contextual sparsity and those requiring exhaustive context. Our source code and models are available at: https://github.com/Starscream-11813/Frugal-ICL
【63】MoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More than Outcomes
标题:MoreBench:在语言模型中评估程序和多元道德推理,而不是结果
链接:https://arxiv.org/abs/2510.16380
备注:46 pages, 8 figures, 10 tables. Preprint
摘要:随着人工智能系统的进步,我们越来越依赖它们与我们一起并为我们做出决策。为了确保这些决定符合人类价值观,我们不仅要了解他们做出了什么决定,还要了解他们如何做出这些决定。推理语言模型提供最终响应和(部分透明的)中间思维痕迹,为研究人工智能程序推理提供了及时的机会。与数学和代码问题不同,这些问题往往有客观正确的答案,道德困境是一个很好的测试平台,以过程为中心的评估,因为它们允许多个可辩护的结论。为了做到这一点,我们提出了MoReBench:1,000个道德场景,每个场景都配有一组专家认为在推理场景时必须包括(或避免)的标准。MoReBench包含超过23000个标准,包括识别道德考虑因素,权衡权衡取舍,并提供可操作的建议,以涵盖人工智能为人类道德决策提供建议以及自主做出道德决策的案例。另外,我们策划了MoreBench-Theory:150个例子来测试AI是否可以在规范伦理学的五个主要框架下进行推理。我们的研究结果表明,缩放定律和现有的数学,代码和科学推理任务的基准无法预测模型执行道德推理的能力。模型还显示出对特定道德框架的排斥(例如,边沁行为功利主义和康德道义论),这可能是流行的培训范式的副作用。总之,这些基准将以过程为中心的推理评估推向更安全,更透明的AI。
摘要:As AI systems progress, we rely more on them to make decisions with us and for us. To ensure that such decisions are aligned with human values, it is imperative for us to understand not only what decisions they make but also how they come to those decisions. Reasoning language models, which provide both final responses and (partially transparent) intermediate thinking traces, present a timely opportunity to study AI procedural reasoning. Unlike math and code problems which often have objectively correct answers, moral dilemmas are an excellent testbed for process-focused evaluation because they allow for multiple defensible conclusions. To do so, we present MoReBench: 1,000 moral scenarios, each paired with a set of rubric criteria that experts consider essential to include (or avoid) when reasoning about the scenarios. MoReBench contains over 23 thousand criteria including identifying moral considerations, weighing trade-offs, and giving actionable recommendations to cover cases on AI advising humans moral decisions as well as making moral decisions autonomously. Separately, we curate MoReBench-Theory: 150 examples to test whether AI can reason under five major frameworks in normative ethics. Our results show that scaling laws and existing benchmarks on math, code, and scientific reasoning tasks fail to predict models' abilities to perform moral reasoning. Models also show partiality towards specific moral frameworks (e.g., Benthamite Act Utilitarianism and Kantian Deontology), which might be side effects of popular training paradigms. Together, these benchmarks advance process-focused reasoning evaluation towards safer and more transparent AI.
【64】Navigating through the hidden embedding space: steering LLMs to improve mental health assessment
标题:通过隐藏的嵌入空间导航:指导LLM以改善心理健康评估
链接:https://arxiv.org/abs/2510.16373
摘要:大型语言模型(LLM)的快速发展正在改变人工智能,为心理健康(MH)等敏感和高影响力领域带来新的机遇。然而,尽管有这些进步,最近的证据表明,较小规模的模型仍然难以在特定领域的应用程序中提供最佳性能。在这项研究中,我们提出了一种具有成本效益但功能强大的方法来提高MH评估能力的LLM,而不依赖于任何计算密集型技术。我们的轻量级方法包括应用于特定层激活的线性变换,利用导向向量来指导模型的输出。值得注意的是,这种干预使模型能够在两个不同的任务中实现改进的结果:(1)确定Reddit帖子是否有助于检测抑郁症状的存在或不存在(相关性预测任务),以及(2)根据用户的Reddit帖子历史完成标准化的抑郁心理筛查问卷(问卷完成任务)。结果突出了未开发的潜力,指导机制作为计算效率高的工具LLM的MH域适应。
摘要:The rapid evolution of Large Language Models (LLMs) is transforming AI, opening new opportunities in sensitive and high-impact areas such as Mental Health (MH). Yet, despite these advancements, recent evidence reveals that smaller-scale models still struggle to deliver optimal performance in domain-specific applications. In this study, we present a cost-efficient yet powerful approach to improve MH assessment capabilities of an LLM, without relying on any computationally intensive techniques. Our lightweight method consists of a linear transformation applied to a specific layer's activations, leveraging steering vectors to guide the model's output. Remarkably, this intervention enables the model to achieve improved results across two distinct tasks: (1) identifying whether a Reddit post is useful for detecting the presence or absence of depressive symptoms (relevance prediction task), and (2) completing a standardized psychological screening questionnaire for depression based on users' Reddit post history (questionnaire completion task). Results highlight the untapped potential of steering mechanisms as computationally efficient tools for LLMs' MH domain adaptation.
【65】Utilising Large Language Models for Generating Effective Counter Arguments to Anti-Vaccine Tweets
标题:利用大型语言模型生成反疫苗推文的有效反驳论点
链接:https://arxiv.org/abs/2510.16359
备注:14 pages, 1 figure, work done as a part of this http URL project at IIT Kharagpur
摘要:在一个公共卫生越来越受到社交媒体上分享的信息影响的时代,打击疫苗怀疑论和错误信息已成为一个关键的社会目标。围绕疫苗接种的误导性叙述广泛传播,为实现高免疫率制造了障碍,并破坏了对健康建议的信任。虽然发现错误信息的努力已经取得了重大进展,但为揭穿这种说法而专门制作实时反驳论据的工作仍然是一个探索不足的领域。在这项工作中,我们探索LLM的能力,以产生健全的反论点反驳疫苗的错误信息。基于先前在错误信息揭穿方面的研究,我们尝试了各种提示策略和微调方法来优化反论点生成。此外,我们训练分类器将反疫苗推文分类为多标签类别,例如对疫苗有效性,副作用和政治影响的担忧,从而允许更多的上下文感知反驳。我们的评估通过人工判断,基于LLM的评估和自动度量进行,揭示了这些方法的高度一致性。我们的研究结果表明,整合标签描述和结构化微调增强了反驳的有效性,为大规模减少疫苗错误信息提供了一种有前途的方法。
摘要:In an era where public health is increasingly influenced by information shared on social media, combatting vaccine skepticism and misinformation has become a critical societal goal. Misleading narratives around vaccination have spread widely, creating barriers to achieving high immunisation rates and undermining trust in health recommendations. While efforts to detect misinformation have made significant progress, the generation of real time counter-arguments tailored to debunk such claims remains an insufficiently explored area. In this work, we explore the capabilities of LLMs to generate sound counter-argument rebuttals to vaccine misinformation. Building on prior research in misinformation debunking, we experiment with various prompting strategies and fine-tuning approaches to optimise counter-argument generation. Additionally, we train classifiers to categorise anti-vaccine tweets into multi-labeled categories such as concerns about vaccine efficacy, side effects, and political influences allowing for more context aware rebuttals. Our evaluation, conducted through human judgment, LLM based assessments, and automatic metrics, reveals strong alignment across these methods. Our findings demonstrate that integrating label descriptions and structured fine-tuning enhances counter-argument effectiveness, offering a promising approach for mitigating vaccine misinformation at scale.
【66】Thinking About Thinking: Evaluating Reasoning in Post-Trained Language Models
标题:思考思考:评估训练后的语言模型中的推理
链接:https://arxiv.org/abs/2510.16340
摘要:后训练技术的最新进展赋予了大型语言模型(LLM)通过生成补充规划令牌来处理复杂的逻辑密集型任务的增强能力。这一发展提出了一个基本问题:这些模型是否知道它们“学习”和“思考”的内容?为了解决这个问题,我们定义了三个核心能力:(1)对学习到的潜在策略的意识,(2)跨领域概括这些策略,以及(3)内部推理痕迹和最终输出之间的对齐。我们在几个任务中对这些能力进行了经验评估,每个任务都需要学习不同的策略。此外,我们对比了通过监督微调(SFT),直接策略优化(DPO)和组相对策略优化(GRPO)进行后训练的模型的配置文件。我们的研究结果表明,与SFT模型相比,RL训练的模型不仅表现出对其学习行为的更高意识和对新颖的、结构相似的任务的更强概括性,而且往往表现出推理轨迹与最终输出之间的弱一致性,这在GRPO训练的模型中最为明显。
摘要:Recent advances in post-training techniques have endowed Large Language Models (LLMs) with enhanced capabilities for tackling complex, logic-intensive tasks through the generation of supplementary planning tokens. This development raises a fundamental question: Are these models aware of what they "learn" and "think"? To address this, we define three core competencies: (1) awareness of learned latent policies, (2) generalization of these policies across domains, and (3) alignment between internal reasoning traces and final outputs. We empirically evaluate these abilities on several tasks, each designed to require learning a distinct policy. Furthermore, we contrast the profiles of models post-trained via Supervised Fine-Tuning (SFT), Direct Policy Optimization (DPO), and Group Relative Policy Optimization (GRPO). Our findings indicate that RL-trained models not only demonstrate greater awareness of their learned behaviors and stronger generalizability to novel, structurally similar tasks than SFT models but also often exhibit weak alignment between their reasoning traces and final outputs, an effect most pronounced in GRPO-trained models.
【67】Cerberus: Real-Time Video Anomaly Detection via Cascaded Vision-Language Models
标题:Cerberus:基于级联视觉语言模型的实时视频异常检测
链接:https://arxiv.org/abs/2510.16290
摘要:视频异常检测(VAD)随着视觉语言模型(VLM)的发展而迅速发展。虽然这些模型提供了卓越的zero-shot检测能力,但其巨大的计算成本和不稳定的视觉基础性能阻碍了实时部署。为了克服这些挑战,我们介绍了Cerberus,一个两级级联系统设计的高效而准确的实时VAD。Cerberus离线学习正常的行为规则,并在在线推理过程中将轻量级过滤与细粒度VLM推理相结合。Cerberus的性能提升来自两个关键创新:运动掩码提示和基于规则的偏差检测。前者将VLM的注意力引导到与运动相关的区域,而后者将异常识别为与学习规范的偏离,而不是列举可能的异常。对四个数据集的广泛评估表明,Cerberus在NVIDIA L40 S GPU上平均达到57.68 fps,加速比为151.79x $,准确率为97.2%,与最先进的基于VLM-VAD方法相当,将其确立为实时视频分析的实用解决方案。
摘要:Video anomaly detection (VAD) has rapidly advanced by recent development of Vision-Language Models (VLMs). While these models offer superior zero-shot detection capabilities, their immense computational cost and unstable visual grounding performance hinder real-time deployment. To overcome these challenges, we introduce Cerberus, a two-stage cascaded system designed for efficient yet accurate real-time VAD. Cerberus learns normal behavioral rules offline, and combines lightweight filtering with fine-grained VLM reasoning during online inference. The performance gains of Cerberus come from two key innovations: motion mask prompting and rule-based deviation detection. The former directs the VLM's attention to regions relevant to motion, while the latter identifies anomalies as deviations from learned norms rather than enumerating possible anomalies. Extensive evaluations on four datasets show that Cerberus on average achieves 57.68 fps on an NVIDIA L40S GPU, a 151.79$\times$ speedup, and 97.2\% accuracy comparable to the state-of-the-art VLM-based VAD methods, establishing it as a practical solution for real-time video analytics.
【68】Instant Personalized Large Language Model Adaptation via Hypernetwork
标题:通过超网络即时个性化大型语言模型适应
链接:https://arxiv.org/abs/2510.16282
摘要:个性化的大语言模型(LLM)使用用户配置文件或历史记录根据个人偏好定制内容。然而,现有的参数有效的微调(PEFT)方法,如“每个用户一个PEFT”(OPPU)范例,需要为每个用户训练一个单独的适配器,使它们在计算上昂贵,不切实际的实时更新。我们引入Profile-to-PEFT,这是一个可扩展的框架,它采用了一个超网络,经过端到端的训练,将用户的编码配置文件直接映射到一组完整的适配器参数(例如,LoRA),消除了部署时的每个用户培训。这种设计可以实现即时适应,推广到看不见的用户,并保护隐私的本地部署。实验结果表明,我们的方法优于基于Web的个性化和OPPU,同时使用更少的计算资源部署。该框架表现出很强的泛化分布外的用户,并保持在不同的用户活动水平和不同的嵌入骨干的鲁棒性。所提出的Profile-to-PEFT框架实现了适用于大规模应用的高效、可扩展和自适应LLM个性化。
摘要:Personalized large language models (LLMs) tailor content to individual preferences using user profiles or histories. However, existing parameter-efficient fine-tuning (PEFT) methods, such as the ``One-PEFT-Per-User'' (OPPU) paradigm, require training a separate adapter for each user, making them computationally expensive and impractical for real-time updates. We introduce Profile-to-PEFT, a scalable framework that employs a hypernetwork, trained end-to-end, to map a user's encoded profile directly to a full set of adapter parameters (e.g., LoRA), eliminating per-user training at deployment. This design enables instant adaptation, generalization to unseen users, and privacy-preserving local deployment. Experimental results demonstrate that our method outperforms both prompt-based personalization and OPPU while using substantially fewer computational resources at deployment. The framework exhibits strong generalization to out-of-distribution users and maintains robustness across varying user activity levels and different embedding backbones. The proposed Profile-to-PEFT framework enables efficient, scalable, and adaptive LLM personalization suitable for large-scale applications.
【69】Publication Trend Analysis and Synthesis via Large Language Model: A Case Study of Engineering in PNAS
标题:基于大型语言模型的出版趋势分析与综合:PNAS工程案例研究
链接:https://arxiv.org/abs/2510.16152
备注:35 pages, 10 figures
摘要:科学文献越来越被复杂的语言、静态的学科结构和潜在的稀疏的关键词系统所孤立,这使得捕捉现代科学的动态本质变得非常困难。本研究通过引入一个适应性强的大型语言模型(LLM)驱动的框架来量化主题趋势并绘制科学知识的演变图景,从而解决了这些挑战。该方法在美国国家科学院院刊(PNAS)发表的1,500多篇工程文章的20年收集中得到了证明,这些文章以其研究重点的广度和深度而闻名。两阶段分类管道首先根据其摘要为每篇文章建立一个主要的主题类别。随后的阶段执行全文分析以分配二级分类,揭示跨语料库的潜在跨主题连接。传统的自然语言处理(NLP)方法,如词袋(BoW)和词频-逆文档频率(TF-IDF),证实了所产生的主题结构,并表明独立的词频分析可能不足以映射具有高多样性的领域。最后,一级分类和二级分类之间的不相交图形表示揭示了主题之间的隐含联系,这些联系在单独分析摘要或关键字时可能不太明显。研究结果表明,该方法独立地恢复了大部分期刊的编辑嵌入式结构,而无需事先了解其现有的双重分类模式(例如,生物学研究也被归类为工程学)。这一框架为发现潜在的专题趋势和提供科学进展的高级别概览提供了一个强有力的工具。
摘要:Scientific literature is increasingly siloed by complex language, static disciplinary structures, and potentially sparse keyword systems, making it cumbersome to capture the dynamic nature of modern science. This study addresses these challenges by introducing an adaptable large language model (LLM)-driven framework to quantify thematic trends and map the evolving landscape of scientific knowledge. The approach is demonstrated over a 20-year collection of more than 1,500 engineering articles published by the Proceedings of the National Academy of Sciences (PNAS), marked for their breadth and depth of research focus. A two-stage classification pipeline first establishes a primary thematic category for each article based on its abstract. The subsequent phase performs a full-text analysis to assign secondary classifications, revealing latent, cross-topic connections across the corpus. Traditional natural language processing (NLP) methods, such as Bag-of-Words (BoW) and Term Frequency-Inverse Document Frequency (TF-IDF), confirm the resulting topical structure and also suggest that standalone word-frequency analyses may be insufficient for mapping fields with high diversity. Finally, a disjoint graph representation between the primary and secondary classifications reveals implicit connections between themes that may be less apparent when analyzing abstracts or keywords alone. The findings show that the approach independently recovers much of the journal's editorially embedded structure without prior knowledge of its existing dual-classification schema (e.g., biological studies also classified as engineering). This framework offers a powerful tool for detecting potential thematic trends and providing a high-level overview of scientific progress.
【70】Facts in Stats: Impacts of Pretraining Diversity on Language Model Generalization
标题:统计数据中的事实:预训练多样性对语言模型概括的影响
链接:https://arxiv.org/abs/2510.16096
备注:28 pages, 15 figures
摘要:语言模型是在序列上进行预训练的,这些序列将统计信息(使文本流畅)与特定标记之间的事实关联(事实知识)相结合。虽然最近的工作表明,他们的互动,如事实协会的释义,变异性,关键决定概括能力,我们缺乏对这些影响的系统分析。本文介绍了一种灵活的合成测试床,结合了通用令牌的统计流与源-目标令牌对的抽象事实流,使细粒度控制它们的交互。该设计使得能够通过操纵流组成(上下文结构)来独立控制多样性性质,并且通过改变每个事实出现在哪些统计流中来独立控制多样性水平。通过控制实验,我们发现,虽然较高的上下文多样性延迟分布(ID)的事实准确性,其影响的分布(OOD)的事实概括取决于上下文结构。在某些情况下,OOD表现遵循与ID相同的趋势,但在其他情况下,多样性对于重要的事实回忆至关重要。即使低多样性阻碍了事实回忆,最佳多样性水平也取决于训练持续时间。除了事实回忆失败,我们确定的结构,统计泛化失败独立,和其他两种能力下降。这表明了上下文设计和多样性水平之间的相互作用如何影响不同的泛化方面。此外,通过对模型组件的一系列控制干预,我们将OOD失败追溯到不同的优化瓶颈,突出了嵌入和非嵌入层的重要性。我们的合成框架使我们能够隔离在大规模研究中会混淆的影响,为未来的研究提供一个受控的测试平台。
摘要:Language models are pretrained on sequences that blend statistical regularities (making text fluent) with factual associations between specific tokens (knowledge of facts). While recent work suggests that the variability of their interaction, such as paraphrases of factual associations, critically determines generalization ability, we lack a systematic analysis of these impacts. This paper introduces a flexible synthetic testbed that combines a statistical stream of generic tokens with an abstract factual stream of source-target token pairs, enabling fine-grained control over their interaction. The design enables the independent control of diversity nature by manipulating stream composition (contextual structure) and the diversity level by varying which statistical streams each fact appears in. Through controlled experiments, we find that while higher contextual diversity delays in-distribution (ID) factual accuracy, its impact on out-of-distribution (OOD) factual generalization depends critically on contextual structure. In some cases, OOD performance follows the same trend as ID, but in others, diversity becomes essential for non-trivial factual recall. Even when low diversity prohibits factual recall, optimal diversity levels depend on training duration. Beyond factual recall failures, we identify structures where statistical generalization fails independently, and others where both capabilities degrade. This shows how the interplay between contextual design and diversity level impacts different generalization aspects. Further, through a series of controlled interventions on the model components, we trace the OOD failures to distinct optimization bottlenecks, highlighting the importance of the embedding and unembedding layers. Our synthetic framework allows us to isolate effects that would be confounded in large-scale studies, offering a controlled testbed for future investigations.
【71】Evaluating Prompting Strategies and Large Language Models in Systematic Literature Review Screening: Relevance and Task-Stage Classification
标题:系统文献评论筛选中的预算策略和大型语言模型:相关性和任务阶段分类
链接:https://arxiv.org/abs/2510.16091
摘要:本研究量化了提示策略如何与大型语言模型(LLM)相互作用,以自动化系统性文献综述(SLR)的筛选阶段。我们评估了六个LLM(GPT-4 o,GPT-4 o-mini,DeepSeek-Chat-V3,Gemini-2.5-Flash,Claude-3.5-Haiku,Llama-4-Maverick)在五种提示类型(zero-shot,Few-Shot,思想链(CoT),CoT-few-shot,自我反思)下的相关性分类和六个Level-2任务,使用准确度,精确度,召回率和F1。结果显示出明显的模型提示交互效应:CoT-少量射击产生最可靠的精确度-召回平衡; zero-shot最大限度地提高高灵敏度传球的召回率;由于模型的过度包容性和不稳定性,自我反思表现不佳。GPT-4 o和DeepSeek提供了强大的整体性能,而GPT-4 o-mini则以更低的成本表现出竞争力。相关性分类的成本性能分析(每1,000个摘要)揭示了模型提示配对之间的巨大绝对差异; GPT-4 o-mini在提示中保持低成本,GPT-4 o-mini上的结构化提示(CoT/CoT-few-shot)以较小的增量成本提供有吸引力的F1。我们推荐一个分阶段的工作流程,(1)部署低成本模型,并提供结构化提示以进行首次筛选,(2)仅将边界情况升级到更高容量的模型。这些发现突出了LLM在自动化文献筛选方面的不均衡但有希望的潜力。通过系统地分析提示模型交互,我们为任务自适应LLM部署提供了比较基准和实践指导。
摘要:This study quantifies how prompting strategies interact with large language models (LLMs) to automate the screening stage of systematic literature reviews (SLRs). We evaluate six LLMs (GPT-4o, GPT-4o-mini, DeepSeek-Chat-V3, Gemini-2.5-Flash, Claude-3.5-Haiku, Llama-4-Maverick) under five prompt types (zero-shot, few-shot, chain-of-thought (CoT), CoT-few-shot, self-reflection) across relevance classification and six Level-2 tasks, using accuracy, precision, recall, and F1. Results show pronounced model-prompt interaction effects: CoT-few-shot yields the most reliable precision-recall balance; zero-shot maximizes recall for high-sensitivity passes; and self-reflection underperforms due to over-inclusivity and instability across models. GPT-4o and DeepSeek provide robust overall performance, while GPT-4o-mini performs competitively at a substantially lower dollar cost. A cost-performance analysis for relevance classification (per 1,000 abstracts) reveals large absolute differences among model-prompt pairings; GPT-4o-mini remains low-cost across prompts, and structured prompts (CoT/CoT-few-shot) on GPT-4o-mini offer attractive F1 at a small incremental cost. We recommend a staged workflow that (1) deploys low-cost models with structured prompts for first-pass screening and (2) escalates only borderline cases to higher-capacity models. These findings highlight LLMs' uneven but promising potential to automate literature screening. By systematically analyzing prompt-model interactions, we provide a comparative benchmark and practical guidance for task-adaptive LLM deployment.
【72】EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle
标题:EvolveR:通过经验驱动的学习自我发展的LLM代理
链接:https://arxiv.org/abs/2510.16079
摘要:当前的大型语言模型(LLM)代理在工具使用方面表现出很强的性能,但缺乏从自身经验中系统学习的关键能力。虽然现有的框架主要侧重于缩小外部知识差距,但它们未能解决一个更根本的限制:无法迭代地改进解决问题的策略。在这项工作中,我们引入了EvolveR,这是一个旨在使代理通过完整的闭环体验生命周期进行自我改进的框架。这个生命周期包括两个关键阶段:(1)离线自蒸馏,其中代理的交互轨迹被合成为抽象的,可重用的战略原则的结构化存储库;(2)在线交互,其中代理与任务交互并主动检索提炼的原则以指导其决策,积累各种行为轨迹。该循环采用策略强化机制来基于代理的性能迭代地更新代理。我们证明了EvolveR在复杂的多跳问答基准测试中的有效性,它在强大的代理基线上实现了卓越的性能。我们的工作为智能体提供了一个全面的蓝图,这些智能体不仅可以从外部数据中学习,还可以从自身行为的后果中学习,为更加自主和不断改进的系统铺平了道路。代码可在https://github.com/Edaizi/EvolveR上获得。
摘要:Current Large Language Model (LLM) agents show strong performance in tool use, but lack the crucial capability to systematically learn from their own experiences. While existing frameworks mainly focus on mitigating external knowledge gaps, they fail to address a more fundamental limitation: the inability to iteratively refine problem-solving strategies. In this work, we introduce EvolveR, a framework designed to enable agent to self-improve through a complete, closed-loop experience lifecycle. This lifecycle comprises two key stages: (1) Offline Self-Distillation, where the agent's interaction trajectories are synthesized into a structured repository of abstract, reusable strategic principles; (2) Online Interaction, where the agent interacts with tasks and actively retrieves distilled principles to guide its decision-making, accumulating a diverse set of behavioral trajectories. This loop employs a policy reinforcement mechanism to iteratively update the agent based on its performance. We demonstrate the effectiveness of EvolveR on complex multi-hop question-answering benchmarks, where it achieves superior performance over strong agentic baselines. Our work presents a comprehensive blueprint for agents that learn not only from external data but also from the consequences of their own actions, paving the way for more autonomous and continuously improving systems. Code is available at https://github.com/Edaizi/EvolveR.
【73】Can LLMs Correct Themselves? A Benchmark of Self-Correction in LLMs
标题:法学硕士可以纠正自己吗?LLM自我纠正的基准
链接:https://arxiv.org/abs/2510.16062
备注:38 pages, 25 figures, 8 tables
摘要:大型语言模型(LLM)的自校正是提高其推理性能的关键组成部分。虽然已经提出了各种自我校正方法,这些方法的综合评价仍然在很大程度上未被探索,和LLMs是否可以真正纠正自己的问题是一个重要的利益和关注的问题。在这项研究中,我们介绍了CorrectBench,一个基准开发,以评估自我纠正策略的有效性,包括内在的,外部的和微调的方法,在三个任务:常识推理,数学推理和代码生成。我们的研究结果表明:1)自校正方法可以提高准确性,特别是对于复杂的推理任务; 2)混合不同的自校正策略产生进一步的改进,尽管它降低了效率; 3)推理LLM(例如,DeepSeek-R1)在额外的自校正方法下具有有限的优化,并且具有很高的时间成本。有趣的是,一个相对简单的思维链(CoT)基线展示了竞争的准确性和效率。这些结果强调了自我纠正的潜力,以提高LLM的推理性能,同时突出了持续的挑战,提高他们的效率。因此,我们主张进一步的研究集中在优化推理能力和操作效率之间的平衡。项目页面:https://correctbench.github.io/
摘要:Self-correction of large language models (LLMs) emerges as a critical component for enhancing their reasoning performance. Although various self-correction methods have been proposed, a comprehensive evaluation of these methods remains largely unexplored, and the question of whether LLMs can truly correct themselves is a matter of significant interest and concern. In this study, we introduce CorrectBench, a benchmark developed to evaluate the effectiveness of self-correction strategies, including intrinsic, external, and fine-tuned approaches, across three tasks: commonsense reasoning, mathematical reasoning, and code generation. Our findings reveal that: 1) Self-correction methods can improve accuracy, especially for complex reasoning tasks; 2) Mixing different self-correction strategies yields further improvements, though it reduces efficiency; 3) Reasoning LLMs (e.g., DeepSeek-R1) have limited optimization under additional self-correction methods and have high time costs. Interestingly, a comparatively simple chain-of-thought (CoT) baseline demonstrates competitive accuracy and efficiency. These results underscore the potential of self-correction to enhance LLM's reasoning performance while highlighting the ongoing challenge of improving their efficiency. Consequently, we advocate for further research focused on optimizing the balance between reasoning capabilities and operational efficiency. Project Page: https://correctbench.github.io/
【74】Fusion-Augmented Large Language Models: Boosting Diagnostic Trustworthiness via Model Consensus
标题:融合增强大型语言模型:通过模型共识提高诊断可信度
链接:https://arxiv.org/abs/2510.16057
备注:7 pages (Accepted to IEEE BHI 2025)
摘要:本研究提出了一种新的多模型融合框架,利用两种最先进的大型语言模型(LLM)ChatGPT和Claude,以提高CheXpert数据集上胸部X射线解释的可靠性。从224,316张胸部X光片的完整CheXpert语料库中,我们随机选择了234项放射科医生注释的研究,以使用仅图像提示来评估单峰性能。在这种情况下,ChatGPT和Claude的诊断准确率分别为62.8%和76.9%。基于相似性的共识方法,使用95%的输出相似性阈值,将准确率提高到77.6%。为了评估多模态输入的影响,我们随后根据MIMIC-CXR模板生成合成临床记录,并评估了50个随机选择的病例的单独子集,这些病例与图像和合成文本配对。在这个多模式队列中,ChatGPT的性能提高到84%,Claude提高到76%,而共识准确率达到91.3%。在两种实验条件下,基于协议的融合始终优于单个模型。这些发现强调了整合互补模式和使用输出级共识的实用性,以提高AI辅助放射诊断的可信度和临床实用性,提供了一条以最小的计算开销减少诊断错误的实用途径。
摘要:This study presents a novel multi-model fusion framework leveraging two state-of-the-art large language models (LLMs), ChatGPT and Claude, to enhance the reliability of chest X-ray interpretation on the CheXpert dataset. From the full CheXpert corpus of 224,316 chest radiographs, we randomly selected 234 radiologist-annotated studies to evaluate unimodal performance using image-only prompts. In this setting, ChatGPT and Claude achieved diagnostic accuracies of 62.8% and 76.9%, respectively. A similarity-based consensus approach, using a 95% output similarity threshold, improved accuracy to 77.6%. To assess the impact of multimodal inputs, we then generated synthetic clinical notes following the MIMIC-CXR template and evaluated a separate subset of 50 randomly selected cases paired with both images and synthetic text. On this multimodal cohort, performance improved to 84% for ChatGPT and 76% for Claude, while consensus accuracy reached 91.3%. Across both experimental conditions, agreement-based fusion consistently outperformed individual models. These findings highlight the utility of integrating complementary modalities and using output-level consensus to improve the trustworthiness and clinical utility of AI-assisted radiological diagnosis, offering a practical path to reduce diagnostic errors with minimal computational overhead.
【75】Can GRPO Help LLMs Transcend Their Pretraining Origin?
标题:GRPO能否帮助LLM超越其预训练起源?
链接:https://arxiv.org/abs/2510.15990
摘要:具有可验证奖励的强化学习(RLVR)主要由组相对策略优化(GRPO)算法驱动,是增强大型语言模型(LLM)推理能力的主要方法。尽管GRPO被广泛采用,但它的成果往往不一致;例如,一个模型可能在一个推理领域(如数学)显示出显著的改进,但在另一个领域(如医学)仍然停滞不前。这种不一致性提出了一个关键问题:在什么条件下GRPO改善推理和推广分布外(OOD)?我们从数据分布的角度对此进行了研究。我们首先从理论上证明GRPO是一个保守的重加权方案,有界的基础模型的分布,从而无法发现完全新颖的解决方案。我们在精心设计的对照研究中进一步验证了这一点,通过从头开始训练Transformers,评估推理深度,输入长度,令牌表示和组合性的泛化。我们的研究结果为GRPO的边界提供了一个原则性的解释:只有当目标任务与模型的预训练偏差一致时,OOD才会得到改善,而随着性能饱和,分布(ID)任务的收益会减少。这并不是将GRPO重新定义为一种通用的推理增强器,而是一种加深预训练偏差的工具。我们的研究结果激励了未来算法的发展,这些算法可以将模型的能力扩展到其预训练起源之外。
摘要:Reinforcement Learning with Verifiable Rewards (RLVR), primarily driven by the Group Relative Policy Optimization (GRPO) algorithm, is a leading approach for enhancing the reasoning abilities of Large Language Models (LLMs). Despite its wide adoption, GRPO's gains are often inconsistent; for instance, a model may show significant improvement in one reasoning domain, like mathematics, yet remain stagnant in another, such as medicine. This inconsistency raises a critical question: under what conditions does GRPO improve reasoning and generalize out-of-distribution (OOD)? We investigate this from a data distribution perspective. We first prove theoretically that GRPO is a conservative reweighting scheme, bounded by the base model's distribution and thus unable to discover completely novel solutions. We further validate this in carefully designed controlled studies by training transformers from scratch, evaluating generalization across reasoning depth, input length, token representation, and compositionality. Our results provide a principled explanation for GRPO's boundaries: OOD improvement emerges only when the target task aligns with the model's pretrained biases, while gains on in-distribution (ID) tasks diminish as performance saturates. This reframes GRPO not as a universal reasoning enhancer but as a tool that sharpens pretraining biases. Our findings motivate future development of algorithms that can expand a model's capabilities beyond its pretraining origin.
【76】Quantum NLP models on Natural Language Inference
标题:自然语言推理的量子NLP模型
链接:https://arxiv.org/abs/2510.15972
备注:Accepted, presented, and to appear in the Proceedings of the Quantum AI and NLP 2025 Conference
摘要:量子自然语言处理(QNLP)通过将组合结构直接嵌入到量子电路中,提供了一种新的语义建模方法。本文研究了QNLP模型在自然语言推理(NLI)任务中的应用,在受约束的Few-Shot设置下比较了量子、混合和经典的基于变换的模型。使用lambeq库和DiscoCat框架,我们构造参数化的量子电路的句子对和训练他们的语义相关性和推理分类。为了评估效率,我们引入了一种新的信息理论度量,每个参数的信息增益(IGPP),它量化了独立于模型大小的学习动态。我们的研究结果表明,量子模型实现了与经典基线相当的性能,同时使用更少的参数进行操作。基于量子的模型在推理方面优于随机初始化的Transformers,并且在相关性任务上实现了更低的测试误差。此外,量子模型表现出显著更高的每参数学习效率(比经典模型高出5个数量级),突出了QNLP在低资源、结构敏感环境中的前景。为了解决电路级隔离和促进参数共享,我们还提出了一种新的基于集群的架构,通过将门参数绑定到学习的单词集群而不是单个令牌来提高泛化能力。
摘要:Quantum natural language processing (QNLP) offers a novel approach to semantic modeling by embedding compositional structure directly into quantum circuits. This paper investigates the application of QNLP models to the task of Natural Language Inference (NLI), comparing quantum, hybrid, and classical transformer-based models under a constrained few-shot setting. Using the lambeq library and the DisCoCat framework, we construct parameterized quantum circuits for sentence pairs and train them for both semantic relatedness and inference classification. To assess efficiency, we introduce a novel information-theoretic metric, Information Gain per Parameter (IGPP), which quantifies learning dynamics independent of model size. Our results demonstrate that quantum models achieve performance comparable to classical baselines while operating with dramatically fewer parameters. The Quantum-based models outperform randomly initialized transformers in inference and achieve lower test error on relatedness tasks. Moreover, quantum models exhibit significantly higher per-parameter learning efficiency (up to five orders of magnitude more than classical counterparts), highlighting the promise of QNLP in low-resource, structure-sensitive settings. To address circuit-level isolation and promote parameter sharing, we also propose a novel cluster-based architecture that improves generalization by tying gate parameters to learned word clusters rather than individual tokens.
【77】Long Exposure: Accelerating Parameter-Efficient Fine-Tuning for LLMs under Shadowy Sparsity
标题:长时间曝光:加速Shadowy Sparsity下LLM的参数高效微调
链接:https://arxiv.org/abs/2510.15964
备注:None
摘要:预训练的大型语言模型(LLM)通过微调适应不同的下游任务对于许多应用程序至关重要。然而,参数有效的微调(PEFT)技术的效率低下,提出了重大的挑战,在时间投资和运营成本方面。在本文中,我们首先介绍了一种微妙的稀疏形式,称为阴影稀疏,这是独特的微调,并没有充分解决加速。在阴影稀疏下,我们提出了长曝光,一个有效的系统来加速LLM的PEFT。Long Exposure包括三个关键组件:Shadowy-sparsity Exposer采用延长的传感范围,以捕获阴影稀疏下的更多稀疏细节;面向序列的预测器提供高效而准确的预测,以处理大序列输入和不断变化的参数;动态感知运算符促进更结构化的计算模式和合并的内存访问,解决动态稀疏操作。广泛的评估表明,Long Exposure在端到端微调方面的性能优于最先进的技术,加速高达2.49\times $,为LLM加速PEFT提供了有前途的进步。
摘要:The adaptation of pre-trained large language models (LLMs) to diverse downstream tasks via fine-tuning is critical for numerous applications. However, the inefficiency of parameter-efficient fine-tuning (PEFT) techniques presents significant challenges in terms of time investments and operational costs. In this paper, we first introduce a nuanced form of sparsity, termed Shadowy Sparsity, which is distinctive in fine-tuning and has not been adequately addressed for acceleration. Under Shadowy Sparsity, we propose Long Exposure, an efficient system to accelerate PEFT for LLMs. Long Exposure comprises three key components: Shadowy-sparsity Exposer employs a prolonged sensing range to capture more sparsity details under shadowy sparsity; Sequence-oriented Predictor provides efficient yet accurate predictions to handle large sequence inputs and constantly-evolving parameters; and Dynamic-aware Operator facilitates more structured computational patterns and coalesced memory accesses, addressing dynamic sparse operations. Extensive evaluations show that Long Exposure outperforms state-of-the-arts with up to a $2.49\times$ speedup in end-to-end fine-tuning, offering promising advancements in accelerating PEFT for LLMs.
【78】HealthDial: A No-Code LLM-Assisted Dialogue Authoring Tool for Healthcare Virtual Agents
标题:HealthDial:医疗保健虚拟代理的无代码法学硕士辅助对话创作工具
链接:https://arxiv.org/abs/2510.15898
摘要:我们介绍HealthDial,一个对话创作工具,帮助医疗保健提供者和教育工作者创建虚拟代理,通过多个对话向患者提供健康教育和咨询。HealthDial利用大型语言模型(LLM),使用基于文本的患者健康教育材料作为输入,为每个会话自动创建基于会话的初始计划和对话。创作的对话以有限状态机的形式输出用于虚拟代理交付,使得所有内容都可以被验证,并且不会提供由LLM幻觉导致的不安全建议。LLM起草的对话结构和语言可以由作者在无代码用户界面中编辑,以确保有效性并优化清晰度和影响力。我们与辅导员和学生进行了可行性和可用性研究,以测试我们的方法与癌症筛查教育的创作任务。参与者使用HealthDial,然后通过与提供对话的3D动画虚拟代理进行交互来测试他们的对话。通过参与者对任务体验和最终对话的评估,我们表明HealthDial为顾问提供了一个有希望的第一步,以确保他们的健康教育材料的全面覆盖,同时与患者创建可理解和可操作的虚拟代理对话。
摘要:We introduce HealthDial, a dialogue authoring tool that helps healthcare providers and educators create virtual agents that deliver health education and counseling to patients over multiple conversations. HealthDial leverages large language models (LLMs) to automatically create an initial session-based plan and conversations for each session using text-based patient health education materials as input. Authored dialogue is output in the form of finite state machines for virtual agent delivery so that all content can be validated and no unsafe advice is provided resulting from LLM hallucinations. LLM-drafted dialogue structure and language can be edited by the author in a no-code user interface to ensure validity and optimize clarity and impact. We conducted a feasibility and usability study with counselors and students to test our approach with an authoring task for cancer screening education. Participants used HealthDial and then tested their resulting dialogue by interacting with a 3D-animated virtual agent delivering the dialogue. Through participants' evaluations of the task experience and final dialogues, we show that HealthDial provides a promising first step for counselors to ensure full coverage of their health education materials, while creating understandable and actionable virtual agent dialogue with patients.
【79】Mitigating Harmful Erraticism in LLMs Through Dialectical Behavior Therapy Based De-Escalation Strategies
标题:通过基于降级策略的辩证行为疗法缓解法学硕士中有害的不稳定性
链接:https://arxiv.org/abs/2510.15889
备注:15 pages, 7 figures and 6 tables
摘要:对个性化人工智能聊天机器人交互的需求不断上升,能够动态适应用户的情绪状态和实时请求,这突出了当前开发模式的关键局限性。现有的方法依赖于基线编程、自定义个性和手动响应调整,通常难以维护,并且容易出现错误,如幻觉、不稳定的输出和软件错误。本文假设,一个植根于人类心理学原理的框架,特别是治疗方式,可以提供一个比纯粹的技术干预更强大和可持续的解决方案。通过类比反映人脑的人工智能模拟神经网络,我们提出了辩证行为疗法(DBT)原则的应用,以调节聊天机器人对不同用户输入的反应。这项研究调查了基于DBT的框架对AI聊天机器人性能的影响,旨在确定其在产生更可靠,安全和准确的响应方面的有效性,同时减少幻觉,不稳定行为和其他系统性问题的发生。
摘要:The escalating demand for personalized AI chatbot interactions, capable of dynamically adapting to user emotional states and real-time requests, has highlighted critical limitations in current development paradigms. Existing methodologies, which rely on baseline programming, custom personalities, and manual response adjustments, often prove difficult to maintain and are susceptible to errors such as hallucinations, erratic outputs, and software bugs. This paper hypothesizes that a framework rooted in human psychological principles, specifically therapeutic modalities, can provide a more robust and sustainable solution than purely technical interventions. Drawing an analogy to the simulated neural networks of AI mirroring the human brain, we propose the application of Dialectical Behavior Therapy (DBT) principles to regulate chatbot responses to diverse user inputs. This research investigates the impact of a DBT-based framework on AI chatbot performance, aiming to ascertain its efficacy in yielding more reliable, safe, and accurate responses, while mitigating the occurrence of hallucinations, erratic behaviors, and other systemic issues.
【80】Comparing LLMs for Sentiment Analysis in Financial Market News
标题:金融市场新闻中比较LLM的情绪分析
链接:https://arxiv.org/abs/2510.15929
摘要:本文提出了一个大语言模型(LLM)在金融市场新闻的情绪分析任务的比较研究。这项工作的目的是分析这些模型在金融背景下的这一重要的自然语言处理任务的性能差异。LLM模型与经典方法进行了比较,允许量化每个测试模型或方法的好处。结果表明,在绝大多数情况下,大型语言模型的性能优于经典模型。
摘要:This article presents a comparative study of large language models (LLMs) in the task of sentiment analysis of financial market news. This work aims to analyze the performance difference of these models in this important natural language processing task within the context of finance. LLM models are compared with classical approaches, allowing for the quantification of the benefits of each tested model or approach. Results show that large language models outperform classical models in the vast majority of cases.
GAN|生成相关(5篇)
【1】Reasoning Distillation and Structural Alignment for Improved Code Generation
标题:推理蒸馏和结构对齐以改进代码生成
链接:https://arxiv.org/abs/2510.17598
摘要:使用语言模型的有效代码生成取决于两个关键因素:准确理解提示的意图,以及生成应用算法推理以生成能够通过各种测试用例的正确解决方案的代码,同时遵守目标编程语言的语法。与其他语言任务不同,代码生成需要的不仅仅是准确的标记预测;它需要理解解决方案级别和结构关系,而不仅仅是生成最可能的标记。超大型语言模型(VLLM)能够生成正确解决复杂任务的详细步骤,其中推理在解决问题中至关重要。这种推理能力在较小的语言模型中可能不存在。因此,在这项工作中,我们将VLLM的推理能力提取到一个更小,更有效的模型中,部署起来更快,更便宜。我们的方法通过学习识别正确的解决方案路径,并通过一种新的结构感知损失优化方法在问题定义和潜在解决方案之间建立结构对应关系,来训练模型来模拟VLLM的推理和解决问题的能力。这使模型能够超越令牌级生成,并深入掌握给定问题的解决方案的总体结构。实验结果表明,我们的微调模型,通过一个便宜和简单的实现过程,显着优于我们的基线模型在通过1,平均数据流,平均语法匹配指标在MBPP,MBPP Plus和HumanEval基准。
摘要:Effective code generation with language models hinges on two critical factors: accurately understanding the intent of the prompt and generating code that applies algorithmic reasoning to produce correct solutions capable of passing diverse test cases while adhering to the syntax of the target programming language. Unlike other language tasks, code generation requires more than accurate token prediction; it demands comprehension of solution-level and structural relationships rather than merely generating the most likely tokens. very large language model (VLLM) are capable of generating detailed steps toward the correct solution of complex tasks where reasoning is crucial in solving the problem. Such reasoning capabilities may be absent in smaller language models. Therefore, in this work, we distill the reasoning capabilities of a VLLM into a smaller, more efficient model that is faster and cheaper to deploy. Our approach trains the model to emulate the reasoning and problem-solving abilities of the VLLM by learning to identify correct solution pathways and establishing a structural correspondence between problem definitions and potential solutions through a novel method of structure-aware loss optimization. This enables the model to transcend token-level generation and to deeply grasp the overarching structure of solutions for given problems. Experimental results show that our fine-tuned model, developed through a cheap and simple to implement process, significantly outperforms our baseline model in terms of pass@1, average data flow, and average syntax match metrics across the MBPP, MBPP Plus, and HumanEval benchmarks.
【2】Towards Mixed-Modal Retrieval for Universal Retrieval-Augmented Generation
标题:走向通用检索增强一代的混合模式检索
链接:https://arxiv.org/abs/2510.17354
备注:This work is in progress
摘要:检索增强生成(RAG)已经成为一个强大的范例,用于增强大型语言模型(LLM)从外部语料库检索相关的文件。然而,现有的RAG系统主要集中在单峰的文本文档,并经常在现实世界的情况下,查询和文档可能包含混合模态(如文本和图像)不足。在本文中,我们解决了通用检索增强生成(URAG),其中涉及检索和推理的混合模态信息,以提高视觉语言生成的挑战。为此,我们提出了Nyx,一个统一的混合模态混合模态检索URAG方案量身定制。为了缓解现实的混合模态数据的稀缺性,我们引入了一个四阶段的自动生成和过滤管道,利用Web文档来构建NyxQA,这是一个包含多种混合模态问答对的数据集,可以更好地反映现实世界的信息需求。在这个高质量数据集的基础上,我们为Nyx采用了一个两阶段的训练框架:我们首先对NyxQA以及各种开源检索数据集进行预训练,然后使用来自下游视觉语言模型(VLM)的反馈进行监督微调,以使检索输出与生成偏好保持一致。实验结果表明,Nyx不仅在标准的纯文本RAG基准测试中表现出色,而且在更一般和逼真的URAG设置中表现出色,显着提高了视觉语言任务的生成质量。
摘要:Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm for enhancing large language models (LLMs) by retrieving relevant documents from an external corpus. However, existing RAG systems primarily focus on unimodal text documents, and often fall short in real-world scenarios where both queries and documents may contain mixed modalities (such as text and images). In this paper, we address the challenge of Universal Retrieval-Augmented Generation (URAG), which involves retrieving and reasoning over mixed-modal information to improve vision-language generation. To this end, we propose Nyx, a unified mixed-modal to mixed-modal retriever tailored for URAG scenarios. To mitigate the scarcity of realistic mixed-modal data, we introduce a four-stage automated pipeline for generation and filtering, leveraging web documents to construct NyxQA, a dataset comprising diverse mixed-modal question-answer pairs that better reflect real-world information needs. Building on this high-quality dataset, we adopt a two-stage training framework for Nyx: we first perform pre-training on NyxQA along with a variety of open-source retrieval datasets, followed by supervised fine-tuning using feedback from downstream vision-language models (VLMs) to align retrieval outputs with generative preferences. Experimental results demonstrate that Nyx not only performs competitively on standard text-only RAG benchmarks, but also excels in the more general and realistic URAG setting, significantly improving generation quality in vision-language tasks.
【3】DVAGen: Dynamic Vocabulary Augmented Generation
标题:DVAGen:动态词汇增强生成
链接:https://arxiv.org/abs/2510.17115
摘要:用固定词汇训练的语言模型很难推广到新的或词汇表外的单词,限制了它们处理不同标记组合的灵活性。现有的动态词汇表方法试图解决这一限制,但面临的挑战,如碎片代码库,缺乏对现代LLM的支持,有限的推理可扩展性。为了克服这些问题,我们引入了DVAGen,这是一个完全开源的统一框架,旨在对动态词汇增强语言模型进行培训,评估和可视化。我们的框架将管道模块化以便于定制,与开源LLM无缝集成,并且是第一个提供CLI和WebUI工具进行实时结果检查的框架。我们验证了现代LLM动态词汇方法的有效性,并展示了对批量推理的支持,显着提高了推理吞吐量。
摘要:Language models trained with a fixed vocabulary struggle to generalize to novel or out-of-vocabulary words, limiting their flexibility in handling diverse token combinations. Existing dynamic vocabulary approaches attempt to address this limitation but face challenges such as fragmented codebases, lack of support for modern LLMs, and limited inference scalability. To overcome these issues, we introduce DVAGen, a fully open-source, unified framework designed for training, evaluation, and visualization of dynamic vocabulary-augmented language models. Our framework modularizes the pipeline for ease of customization, integrates seamlessly with open-source LLMs, and is the first to provide both CLI and WebUI tools for real-time result inspection. We validate the effectiveness of dynamic vocabulary methods on modern LLMs and demonstrate support for batch inference, significantly improving inference throughput.
【4】Enhancing Language Agent Strategic Reasoning through Self-Play in Adversarial Games
标题:通过对抗游戏中的自我游戏增强语言代理策略推理
链接:https://arxiv.org/abs/2510.16761
摘要:现有的语言智能体在动态对抗游戏中往往由于策略推理能力差而遇到困难。为了缓解这一限制,一种有前途的方法是允许代理自动从游戏交互中学习,而不依赖昂贵的专家标记数据。与代理接收固定反馈或奖励的静态环境不同,在动态对抗游戏中选择适当的对手可以显着影响学习性能。然而,在对抗性环境中对对手的讨论仍然是一个正在探索的领域。在本文中,我们提出了一个步骤级的策略优化方法,通过玩和学习,SCO-PAL。利用SCO-PAL,我们通过将对手设置在不同级别来对对手选择进行详细分析,并发现自我游戏是在这种对抗性环境中提高战略推理的最有效方法。利用SCO-PAL与自我发挥,我们提高了对四个对手的平均胜率约30%,与基线相比,并实现了54.76%的胜率对GPT-4在六个对抗游戏。
摘要:Existing language agents often encounter difficulties in dynamic adversarial games due to poor strategic reasoning. To mitigate this limitation, a promising approach is to allow agents to learn from game interactions automatically, without relying on costly expert-labeled data. Unlike static environments where agents receive fixed feedback or rewards, selecting appropriate opponents in dynamic adversarial games can significantly impact learning performance. However, the discussion of opponents in adversarial environments remains an area under exploration. In this paper, we propose a Step-level poliCy Optimization method through Play-And-Learn, SCO-PAL. Leveraging SCO-PAL, we conduct a detailed analysis of opponent selection by setting opponents at different levels and find that self-play is the most effective way to improve strategic reasoning in such adversarial environments. Utilizing SCO-PAL with self-play, we increase the average win rate against four opponents by approximately 30% compared to baselines and achieve a 54.76% win rate against GPT-4 in six adversarial games.
【5】U-Codec: Ultra Low Frame-rate Neural Speech Codec for Fast High-fidelity Speech Generation
标题:U-Codec:超低帧率神经语音编解码器,用于快速高保真语音生成
链接:https://arxiv.org/abs/2510.16718
摘要:我们提出了\textbf{U-Codec},一个\textbf{U}ltra低帧率神经语音\textbf{Codec},实现高保真度重建和快速语音生成在极低的帧速率为5 Hz(5帧每秒)。5 Hz的极端压缩通常会导致严重的清晰度和频谱细节损失,我们引入了基于Transformer的帧间长期依赖性模块,并系统地探索残余矢量量化(RVQ)深度和码本大小,以确定最佳配置。此外,我们将U-Codec应用到基于大语言模型(LLM)的自回归TTS模型中,该模型利用全局和局部分层架构来有效地捕获多层令牌之间的依赖关系。我们将基于LLM的TTS从50 Hz的3层RVQ扩展到5 Hz的32层RVQ。实验结果表明,U-Codec提高了基于LLM的TTS推理速度约3 $\倍$高帧速率的编解码器,同时保持相似性和自然性。这些结果验证了使用高度压缩的5 Hz离散令牌进行快速和高保真语音合成的可行性。
摘要:We propose \textbf{U-Codec}, an \textbf{U}ltra low frame-rate neural speech \textbf{Codec} that achieves high-fidelity reconstruction and fast speech generation at an extremely low frame-rate of 5Hz (5 frames per second). Extreme compression at 5Hz typically leads to severe intelligibility and spectral detail loss, we introduce a Transformer-based inter-frame long-term dependency module and systematically explore residual vector quantization (RVQ) depth and codebook size to identify optimal configurations. Moreover, we apply U-Codec into a large language model (LLM)-based auto-regressive TTS model, which leverages global and local hierarchical architecture to effectively capture dependencies across multi-layer tokens. We extend LLM-based TTS from 3-layer RVQ at 50Hz to 32-layer RVQ at 5Hz. Experimental results demonstrate that U-Codec improves LLM-based TTS inference speed by around 3 $\times$ over high-frame-rate codecs while maintaining similarity and naturalness. These results validate the feasibility of using highly compressed 5Hz discrete tokens for fast and high-fidelity speech synthesis.
BERT(1篇)
【1】Multilingual Clinical NER for Diseases and Medications Recognition in Cardiology Texts using BERT Embeddings
标题:使用BERT嵌入进行心脏病学文本中疾病和药物识别的多语言临床NER
链接:https://arxiv.org/abs/2510.17437
备注:11 pages, 5 figures, 1 table, published in Working Notes of the Conference and Labs of the Evaluation Forum (CLEF 2024)
摘要:电子健康记录(EHR)数据量的快速增长凸显了从非结构化临床文本中解锁生物医学知识的迫切需求,以支持数据驱动的临床系统的进步,包括患者诊断,疾病进展监测,治疗效果评估,未来临床事件的预测,虽然语境化语言模型已经证明了英语语料库中命名实体识别(NER)系统的令人印象深刻的性能改进,仍然缺乏集中在低资源语言的临床文本的研究。为了弥合这一差距,我们的研究旨在开发多个深度上下文嵌入模型,以增强心脏病学领域的临床NER,作为BioASQ Multiplexer NER共享任务的一部分。我们探索了不同的单语和多语言BERT模型的有效性,这些模型在一般领域文本上进行训练,用于从用英语,西班牙语和意大利语撰写的临床病例报告中提取疾病和药物提及。我们在西班牙疾病识别(SDR)、西班牙药物识别(SMR)、英语药物识别(EMR)和意大利药物识别(IMR)方面的F1得分分别为77.88%、92.09%、91.74%和88.9%。这些结果在所有子任务的测试排行榜中均优于F1得分的平均值和中位数,平均值/中位数分别为:SDR 69.61%/75.66%,SMR 81.22%/90.18%,EMR 89.2%/88.96%,IMR 82.8%/87.76%。
摘要:The rapidly increasing volume of electronic health record (EHR) data underscores a pressing need to unlock biomedical knowledge from unstructured clinical texts to support advancements in data-driven clinical systems, including patient diagnosis, disease progression monitoring, treatment effects assessment, prediction of future clinical events, etc. While contextualized language models have demonstrated impressive performance improvements for named entity recognition (NER) systems in English corpora, there remains a scarcity of research focused on clinical texts in low-resource languages. To bridge this gap, our study aims to develop multiple deep contextual embedding models to enhance clinical NER in the cardiology domain, as part of the BioASQ MultiCardioNER shared task. We explore the effectiveness of different monolingual and multilingual BERT-based models, trained on general domain text, for extracting disease and medication mentions from clinical case reports written in English, Spanish, and Italian. We achieved an F1-score of 77.88% on Spanish Diseases Recognition (SDR), 92.09% on Spanish Medications Recognition (SMR), 91.74% on English Medications Recognition (EMR), and 88.9% on Italian Medications Recognition (IMR). These results outperform the mean and median F1 scores in the test leaderboard across all subtasks, with the mean/median values being: 69.61%/75.66% for SDR, 81.22%/90.18% for SMR, 89.2%/88.96% for EMR, and 82.8%/87.76% for IMR.
语义分析(1篇)
【1】Xiaoice: Training-Free Video Understanding via Self-Supervised Spatio-Temporal Clustering of Semantic Features
标题:Xiaoice:通过自我监督的语义特征时空聚集进行免训练的视频理解
链接:https://arxiv.org/abs/2510.16781
摘要:静态图像上的大规模视觉语言模型(VLM)的显著zero-shot推理能力尚未完全转化到视频领域。传统的视频理解模型通常依赖于对注释数据集进行广泛的特定于任务的训练,这是一个既昂贵又可扩展性有限的过程。本文介绍了一种新颖的、无需训练的视频理解框架,该框架通过将预先训练的VLM的丰富语义先验与用于模式发现的经典机器学习算法协同结合,从而避免了端到端的训练。我们的核心思想是将视频理解重新定义为高维语义特征空间内的自监督时空聚类问题。建议的管道首先将视频流转换成语义特征轨迹使用冻结的视觉编码器的预训练的VLM。随后,我们采用核时间分割(KTS),一个强大的机器学习技术,分割成离散的,语义连贯的事件段的连续特征流。然后,这些片段进行无监督的基于密度的聚类,以识别整个视频中反复出现的宏观场景和主题。通过从每个发现的集群中选择具有代表性的关键帧,并利用VLM的生成功能进行文本描述,我们的框架自动生成视频内容的结构化,多模态摘要。该方法为视频内容的zero-shot自动结构分析提供了一种有效的、可解释的和模型无关的途径。
摘要:The remarkable zero-shot reasoning capabilities of large-scale Visual Language Models (VLMs) on static images have yet to be fully translated to the video domain. Conventional video understanding models often rely on extensive, task-specific training on annotated datasets, a process that is both costly and limited in scalability. This paper introduces a novel, training-free framework for video understanding that circumvents end-to-end training by synergistically combining the rich semantic priors of pre-trained VLMs with classic machine learning algorithms for pattern discovery. Our core idea is to reframe video understanding as a self-supervised spatio-temporal clustering problem within a high-dimensional semantic feature space. The proposed pipeline first transforms a video stream into a semantic feature trajectory using the frozen visual encoder of a pre-trained VLM. Subsequently, we employ Kernel Temporal Segmentation (KTS), a robust machine learning technique, to partition the continuous feature stream into discrete, semantically coherent event segments. These segments are then subjected to unsupervised density-based clustering to identify recurring macroscopic scenes and themes throughout the video. By selecting representative keyframes from each discovered cluster and leveraging the VLM's generative capabilities for textual description, our framework automatically produces a structured, multi-modal summary of the video content. This approach provides an effective, interpretable, and model-agnostic pathway for zero-shot, automated structural analysis of video content.
Graph|知识图谱|Knowledge(2篇)
【1】Executable Knowledge Graphs for Replicating AI Research
标题:可执行的知识图用于复制AI研究
链接:https://arxiv.org/abs/2510.17795
备注:Work in progress
摘要:复制人工智能研究对于大型语言模型(LLM)代理来说是一项至关重要但具有挑战性的任务。现有的方法往往难以生成可执行代码,主要是由于背景知识不足和检索增强生成(RAG)方法的局限性,无法捕捉隐藏在参考文献中的潜在技术细节。此外,以前的方法往往忽视有价值的实现级代码信号,并且缺乏支持多粒度检索和重用的结构化知识表示。为了克服这些挑战,我们提出了可执行知识图(xKG),这是一个模块化和可插入的知识库,可以自动集成从科学文献中提取的技术见解,代码片段和特定领域的知识。当集成到具有两个不同LLM的三个代理框架中时,xKG在PaperBench上显示出显著的性能提升(o3-mini为10.9%),证明了其作为自动化AI研究复制的通用和可扩展解决方案的有效性。代码将在https://github.com/zjunlp/xKG上发布。
摘要:Replicating AI research is a crucial yet challenging task for large language model (LLM) agents. Existing approaches often struggle to generate executable code, primarily due to insufficient background knowledge and the limitations of retrieval-augmented generation (RAG) methods, which fail to capture latent technical details hidden in referenced papers. Furthermore, previous approaches tend to overlook valuable implementation-level code signals and lack structured knowledge representations that support multi-granular retrieval and reuse. To overcome these challenges, we propose Executable Knowledge Graphs (xKG), a modular and pluggable knowledge base that automatically integrates technical insights, code snippets, and domain-specific knowledge extracted from scientific literature. When integrated into three agent frameworks with two different LLMs, xKG shows substantial performance gains (10.9% with o3-mini) on PaperBench, demonstrating its effectiveness as a general and extensible solution for automated AI research replication. Code will released at https://github.com/zjunlp/xKG.
【2】Leave It to the Experts: Detecting Knowledge Distillation via MoE Expert Signatures
标题:将其交给专家:通过MoE专家签名检测知识提炼
链接:https://arxiv.org/abs/2510.16968
备注:Code is at this https URL
摘要:知识蒸馏(KD)加速了大型语言模型(LLM)的训练,但带来了知识产权保护和LLM多样性风险。现有的KD检测方法的基础上自我身份或输出相似性,可以很容易地规避通过迅速工程。我们提出了一个KD检测框架有效的白盒和黑盒设置利用一个被忽视的信号:转移的MoE的“结构习惯”,特别是内部路由模式。我们的方法分析了不同的专家如何在各种输入中进行专业化和协作,从而创建出在蒸馏过程中持续存在的独特指纹。为了超越白盒设置和MoE架构,我们进一步提出了阴影-MoE,一个黑盒方法,通过辅助蒸馏构建代理MoE表示,以比较任意模型对之间的这些模式。我们建立了一个全面的,可重复的基准,提供不同的蒸馏检查点和可扩展的框架,以促进未来的研究。广泛的实验表明,在各种场景下的检测准确率>94%,对基于身份的规避具有很强的鲁棒性,优于现有的基线,同时突出了LLM中的结构习惯转移。
摘要:Knowledge Distillation (KD) accelerates training of large language models (LLMs) but poses intellectual property protection and LLM diversity risks. Existing KD detection methods based on self-identity or output similarity can be easily evaded through prompt engineering. We present a KD detection framework effective in both white-box and black-box settings by exploiting an overlooked signal: the transfer of MoE "structural habits", especially internal routing patterns. Our approach analyzes how different experts specialize and collaborate across various inputs, creating distinctive fingerprints that persist through the distillation process. To extend beyond the white-box setup and MoE architectures, we further propose Shadow-MoE, a black-box method that constructs proxy MoE representations via auxiliary distillation to compare these patterns between arbitrary model pairs. We establish a comprehensive, reproducible benchmark that offers diverse distilled checkpoints and an extensible framework to facilitate future research. Extensive experiments demonstrate >94% detection accuracy across various scenarios and strong robustness to prompt-based evasion, outperforming existing baselines while highlighting the structural habits transfer in LLMs.
推理|分析|理解|解释(16篇)
【1】Foundational Automatic Evaluators: Scaling Multi-Task Generative Evaluator Training for Reasoning-Centric Domains
标题:基础自动评估器:扩展以推理为中心的领域的多任务生成评估器训练
链接:https://arxiv.org/abs/2510.17793
备注:29 pages, 9 tables, 6 figures
摘要:微调专门的生成评估器已经成为一种流行的范式,以满足日益增长的需求,可扩展的评估在培训和测试时间。然而,最近的工作主要集中在应用新的方法,如强化学习(RL),以培训评估人员,回避大规模的数据驱动的开发。在这项工作中,我们专注于数据扩展,策划了一组250万样本,涵盖五个独特的评估任务(成对,步骤级,无参考和基于参考的验证,以及单一评级)和多个领域,专注于推理评估。我们的数据,我们训练基础自动推理评估器(FARE),一个家庭的8B和20 B(3.6B活动)参数评估器,一个简单的迭代拒绝采样监督微调(SFT)的方法。FARE-8B挑战了更大规模的专业RL培训评估人员,FARE-20 B为开源评估人员设定了新标准,超过了专业的70 B+评估人员。除了静态基准测试之外,我们还在现实任务中评估了FARE:作为推理时间重排序器,FARE-20 B在数学上实现了接近预言机的性能。作为RL训练中的验证器,FARE将下游RL训练模型的性能提高了14.1%,而不是字符串匹配验证器。当从FARE初始化时,连续微调的FARE代码在评估测试用例质量方面比gpt-oss-20 B高出65%。
摘要:Finetuning specialized generative evaluators has emerged as a popular paradigm to meet the increasing demand for scalable evaluation during both training and test-time. However, recent work has largely focused on applying new methodology, such as reinforcement learning (RL), to training evaluators, shying away from large-scale, data-driven development. In this work, we focus on data scaling, curating a set of 2.5M samples spanning five unique evaluation tasks (pairwise, step-level, reference-free and reference-based verification, and single rating) and multiple domains focused on reasoning evaluation. With our data, we train Foundational Automatic Reasoning Evaluators (FARE), a family of 8B and 20B (with 3.6B active) parameter evaluators, with a simple iterative rejection-sampling supervised finetuning (SFT) approach. FARE-8B challenges larger specialized RL-trained evaluators and FARE-20B sets the new standard for open-source evaluators, surpassing specialized 70B+ evaluators. Beyond static benchmarks, we evaluate FARE in real-world tasks: As inference-time rerankers, FARE-20B achieves near-oracle performance on MATH. As verifiers in RL training, FARE improves the downstream RL-trained model performance by up to 14.1% vs. string-matching verifiers. When initialized from FARE, a continually-finetuned FARE-Code outperforms gpt-oss-20B by 65% on evaluating test-case quality.
【2】LawChain: Modeling Legal Reasoning Chains for Chinese Tort Case Analysis
标题:LawChain:为中国侵权案件分析建模法律推理链
链接:https://arxiv.org/abs/2510.17602
摘要:法律推理是法律分析和决策的基本组成部分。现有的法律推理的计算方法主要依赖于一般的推理框架,如三段论和IRAC,不全面检查的细微差别的过程,支持法律推理。此外,目前的研究主要集中在刑事案件,民事案件的建模不足。在这项工作中,我们提出了一个新的框架,明确建模的法律推理在中国侵权相关的民事案件的分析。我们首先将侵权行为分析中使用的法律推理过程操作化到LawChain框架中。LawChain是一个三模块推理框架,每个模块由多个细粒度子步骤组成。在LawChain框架的指导下,我们引入了侵权法律推理的任务,并构建了一个评估基准LawChain$_{eval}$,以系统地评估侵权分析的分析推理链中的关键步骤。利用这个基准,我们评估了最先进的大型语言模型在民事侵权背景下的法律推理能力。我们的研究结果表明,目前的模型仍然无法准确地处理侵权法律推理的关键要素。此外,我们还介绍了几种基线方法,这些方法通过提示或后期训练明确地结合了LawChain风格的推理。我们对其他法律分析任务进行了进一步的实验,例如法律命名实体识别和刑事赔偿计算,以验证这些基线的普遍性。所提出的基线方法在侵权相关的法律推理方面取得了显着的改进,并很好地推广到相关的法律分析任务,从而证明了显式建模法律推理链的价值,以提高语言模型的推理能力。
摘要:Legal reasoning is a fundamental component of legal analysis and decision-making. Existing computational approaches to legal reasoning predominantly rely on generic reasoning frameworks such as syllogism and IRAC, which do not comprehensively examine the nuanced processes that underpin legal reasoning. Moreover, current research has largely focused on criminal cases, with insufficient modeling for civil cases. In this work, we present a novel framework for explicitly modeling legal reasoning in the analysis of Chinese tort-related civil cases. We first operationalize the legal reasoning processes used in tort analysis into the LawChain framework. LawChain is a three-module reasoning framework, with each module consisting of multiple finer-grained sub-steps. Informed by the LawChain framework, we introduce the task of tort legal reasoning and construct an evaluation benchmark, LawChain$_{eval}$, to systematically assess the critical steps within analytical reasoning chains for tort analysis. Leveraging this benchmark, we evaluate state-of-the-art large language models for their legal reasoning ability in civil tort contexts. Our results indicate that current models still fall short in accurately handling crucial elements of tort legal reasoning. Furthermore, we introduce several baseline approaches that explicitly incorporate LawChain-style reasoning through prompting or post-training. We conduct further experiments on additional legal analysis tasks, such as Legal Named-Entity Recognition and Criminal Damages Calculation, to verify the generalizability of these baselines. The proposed baseline approaches achieve significant improvements in tort-related legal reasoning and generalize well to related legal analysis tasks, thus demonstrating the value of explicitly modeling legal reasoning chains to enhance the reasoning capabilities of language models.
【3】MIRAGE: Agentic Framework for Multimodal Misinformation Detection with Web-Grounded Reasoning
标题:MISYS:利用网络推理进行多模式错误信息检测的抽象框架
链接:https://arxiv.org/abs/2510.17590
备注:16 pages, 3 tables, 1 figure
摘要:错误信息通过数十亿个每日多模态帖子在网络平台上传播,这些帖子结合了文本和图像,压倒了人工事实核查能力。监督检测模型需要特定领域的训练数据,并且无法在各种操纵策略中推广。我们提出了一个推理时间,模型可插入的代理框架MINUS,它将多模态验证分解为四个连续的模块:视觉准确性评估检测AI生成的图像,跨模态一致性分析识别上下文外的再利用,检索增强的事实检查通过迭代问题生成在Web证据中提出索赔,校准的判断模块集成了所有信号。MINUS通过有针对性的网络检索编排视觉语言模型推理,输出结构化和引用链接的基本原理。在MMFakeBench验证集(1,000个样本)上,使用GPT-4 o-mini的MIRECT实现了81.65%的F1和75.1%的准确度,比最强的zero-shot基线(使用MMD-Agent的GPT-4V,F1为74.0%)高出7.65个点,同时保持了34.3%的假阳性率,而仅判断基线的假阳性率为97.3%。测试集结果(5,000个样本)证实了泛化,F1为81.44%,准确率为75.08%。消融研究表明,视觉验证贡献了5.18分F1,检索增强推理贡献了2.97分。我们的研究结果表明,分解的代理推理与Web检索可以匹配监督检测器的性能,而无需特定领域的训练,使错误信息检测跨模态标记的数据仍然稀缺。
摘要:Misinformation spreads across web platforms through billions of daily multimodal posts that combine text and images, overwhelming manual fact-checking capacity. Supervised detection models require domain-specific training data and fail to generalize across diverse manipulation tactics. We present MIRAGE, an inference-time, model-pluggable agentic framework that decomposes multimodal verification into four sequential modules: visual veracity assessment detects AI-generated images, cross-modal consistency analysis identifies out-of-context repurposing, retrieval-augmented factual checking grounds claims in web evidence through iterative question generation, and a calibrated judgment module integrates all signals. MIRAGE orchestrates vision-language model reasoning with targeted web retrieval, outputs structured and citation-linked rationales. On MMFakeBench validation set (1,000 samples), MIRAGE with GPT-4o-mini achieves 81.65% F1 and 75.1% accuracy, outperforming the strongest zero-shot baseline (GPT-4V with MMD-Agent at 74.0% F1) by 7.65 points while maintaining 34.3% false positive rate versus 97.3% for a judge-only baseline. Test set results (5,000 samples) confirm generalization with 81.44% F1 and 75.08% accuracy. Ablation studies show visual verification contributes 5.18 F1 points and retrieval-augmented reasoning contributes 2.97 points. Our results demonstrate that decomposed agentic reasoning with web retrieval can match supervised detector performance without domain-specific training, enabling misinformation detection across modalities where labeled data remains scarce.
【4】Deep Self-Evolving Reasoning
标题:深度自我进化推理
链接:https://arxiv.org/abs/2510.17498
摘要:长形式的思想链推理已经成为大型语言模型中高级推理的基石。虽然最近的验证-改进框架使专有模型能够解决奥运会级别的问题,但其有效性取决于强大、可靠的验证和校正能力,而这些能力在开放重量、较小规模的模型中仍然很脆弱。这项工作表明,即使在硬任务上的验证和细化能力较弱,这些模型的推理限制也可以通过我们称为深度自进化推理(DSER)的概率范式来大幅扩展。我们将迭代推理概念化为马尔可夫链,其中每一步都代表解空间中的随机过渡。关键的见解是,只要改进的概率略微超过退化的概率,就可以保证收敛到正确的解决方案。通过并行运行多个长期的、自我进化的过程,DSER放大了这些微小的积极趋势,使模型能够渐近地接近正确的答案。根据经验,我们将DSER应用于DeepSeek-R1-0528-Qwen 3 -8B模型。在具有挑战性的AIME 2024-2025基准测试中,DSER解决了9个以前无法解决的问题中的5个,并提高了整体性能,使这款紧凑型模型能够通过多数投票超越其600 B参数老师的单圈精度。除了测试时间缩放的直接效用之外,DSER框架还用于诊断当前开放权重推理机的基本局限性。通过清楚地描述它们在自我验证,改进和稳定性方面的缺点,我们的研究结果为开发具有强大,内在自我进化能力的下一代模型建立了明确的研究议程。
摘要:Long-form chain-of-thought reasoning has become a cornerstone of advanced reasoning in large language models. While recent verification-refinement frameworks have enabled proprietary models to solve Olympiad-level problems, their effectiveness hinges on strong, reliable verification and correction capabilities, which remain fragile in open-weight, smaller-scale models. This work demonstrates that even with weak verification and refinement capabilities on hard tasks, the reasoning limits of such models can be substantially extended through a probabilistic paradigm we call Deep Self-Evolving Reasoning (DSER). We conceptualize iterative reasoning as a Markov chain, where each step represents a stochastic transition in the solution space. The key insight is that convergence to a correct solution is guaranteed as long as the probability of improvement marginally exceeds that of degradation. By running multiple long-horizon, self-evolving processes in parallel, DSER amplifies these small positive tendencies, enabling the model to asymptotically approach correct answers. Empirically, we apply DSER to the DeepSeek-R1-0528-Qwen3-8B model. On the challenging AIME 2024-2025 benchmark, DSER solves 5 out of 9 previously unsolvable problems and boosts overall performance, enabling this compact model to surpass the single-turn accuracy of its 600B-parameter teacher through majority voting. Beyond its immediate utility for test-time scaling, the DSER framework serves to diagnose the fundamental limitations of current open-weight reasoners. By clearly delineating their shortcomings in self-verification, refinement, and stability, our findings establish a clear research agenda for developing next-generation models with powerful, intrinsic self-evolving capabilities.
【5】How News Feels: Understanding Affective Bias in Multilingual Headlines for Human-Centered Media Design
标题:新闻感受:了解以人为本的媒体设计多语言标题中的情感偏见
链接:https://arxiv.org/abs/2510.17252
备注:15 pages, 7 figures, 4 tables. Submitted to the International Conference on Data and Applied Analytics (IDAA 2025)
摘要:新闻媒体通常不仅通过报道内容,还通过如何构建来塑造公众情绪。同一事件可能在一个媒体上显得平静,在另一个媒体上却令人震惊,这反映出报道中微妙的情感偏见。负面或情绪化的标题往往会吸引更多的注意力,传播得更快,这反过来又会鼓励媒体以引发更强烈反应的方式来构建故事。本研究通过对孟加拉语新闻的大规模情感分析来探讨这一趋势。使用zero-shot推理与Gemma-3 4 B,我们分析了30万孟加拉语新闻标题和它们的内容,以确定每个的主导情绪和整体基调。研究结果显示,负面情绪明显占主导地位,特别是愤怒,恐惧和失望,以及不同渠道对类似故事的情感描述存在显著差异。基于这些见解,我们提出了一个以人为本的新闻聚合器,可视化的情感线索,并帮助读者认识到隐藏的情感框架在日常新闻的设计思路。
摘要:News media often shape the public mood not only by what they report but by how they frame it. The same event can appear calm in one outlet and alarming in another, reflecting subtle emotional bias in reporting. Negative or emotionally charged headlines tend to attract more attention and spread faster, which in turn encourages outlets to frame stories in ways that provoke stronger reactions. This research explores that tendency through large-scale emotion analysis of Bengali news. Using zero-shot inference with Gemma-3 4B, we analyzed 300000 Bengali news headlines and their content to identify the dominant emotion and overall tone of each. The findings reveal a clear dominance of negative emotions, particularly anger, fear, and disappointment, and significant variation in how similar stories are emotionally portrayed across outlets. Based on these insights, we propose design ideas for a human-centered news aggregator that visualizes emotional cues and helps readers recognize hidden affective framing in daily news.
【6】Understanding and Improving Length Generalization in Hierarchical Sparse Attention Models
标题:理解和改进分层稀疏注意力模型中的长度概括
链接:https://arxiv.org/abs/2510.17196
备注:Preprint. Work in progress
摘要:有效地处理长上下文是语言模型的关键挑战。虽然标准的Transformers受到二次复杂性和差的长度外推的限制,但是像滑动窗口注意和状态空间模型这样的替代架构由于其固定大小的存储器而牺牲了有效利用完整上下文的能力。基于组块的稀疏注意已经成为极长概括的一个有前途的范例,但支撑其成功的关键架构原则尚未完全理解。在这项工作中,我们提出了一个系统的解剖这些模型,以确定驱动其性能的核心组件。通过一个统一的框架和全面的消融研究,我们证明了三个设计原则的组合是至关重要的:(1)具有专用CLS令牌的表达性非线性Chunk Encoder,以产生用于检索的表示;(2)一个扩展的残差路径,以稳定地整合检索到的全局信息,而不会被局部残差流覆盖;以及(3)在预训练期间强制选择稀疏性以桥接训练-测试分布间隙。我们提供了一个理论动机块内信息处理和地标生成。通过结合这些原则,我们建立了一个新的免训练长度外推的最先进技术,成功地将在4K上下文上训练的模型推广到RULER和BABILong上的3200万个令牌。我们的研究结果为开发未来的高性能长上下文语言模型提供了一套明确的设计原则。
摘要:Effectively processing long contexts is a critical challenge for language models. While standard Transformers are limited by quadratic complexity and poor length extrapolation, alternative architectures like sliding window attention and state space models sacrifice the ability to effectively utilize the full context due to their fixed-size memory. Chunk-based sparse attention has emerged as a promising paradigm for extreme length generalization, yet the key architectural principles underpinning its success are not yet fully understood. In this work, we present a systematic dissection of these models to identify the core components driving their performance. Through a unified framework and comprehensive ablation studies, we demonstrate that a combination of three design principles is critical: (1) an expressive, non-linear Chunk Encoder with a dedicated CLS token to produce representations for retrieval; (2) a Bypassing Residual Path to stably integrate retrieved global information without it being overridden by the local residual stream; and (3) enforced selection sparsity during pre-training to bridge the train-test distribution gap. We provide a theoretical motivation for intra-chunk information processing and landmark generation. By combining these principles, we establish a new state-of-the-art for training-free length extrapolation, successfully generalizing models trained on a 4K context to 32 million tokens on RULER and BABILong. Our findings provide a clear and empirically-grounded set of design principles for developing future, highly-capable long-context language models.
【7】VAGEN: Reinforcing World Model Reasoning for Multi-Turn VLM Agents
标题:VAGEN:加强多回合VLM代理的世界模型推理
链接:https://arxiv.org/abs/2510.16907
备注:Accepted to NeurIPS 2025
摘要:与语言模型(LLM)代理相比,训练视觉语言模型(VLM)代理的一个关键挑战在于从文本状态到复杂视觉观察的转变。这种转变引入了部分可观测性,并需要强大的世界建模。我们问:可以VLM代理构建内部世界模型,通过明确的视觉状态推理?为了解决这个问题,我们通过强化学习(RL)在架构上强制执行和奖励代理的推理过程,将其制定为部分可观察马尔可夫决策过程(POMDP)。我们发现,将智能体的推理分解为状态估计(“当前状态是什么?)和过渡建模(“下一步是什么?“)是成功的关键,通过五个推理策略证明。我们对代理如何表示内部信念的调查表明,最佳表示是依赖于任务的:自然语言擅长捕捉一般任务中的语义关系,而结构化格式对于精确操作和控制是必不可少的。在这些见解的基础上,我们设计了一个世界建模奖励,提供密集的,转弯级别的监督准确的状态预测,并引入双层一般优势估计(双层GAE)的转弯感知信用分配。通过这种形式的视觉状态推理,3B参数模型在五个不同的代理基准测试中获得了0.82的分数,比未经训练的对手(0.21)提高了3.5倍,并且优于GPT-5(0.75),Gemini 2.5 Pro(0.67)和Claude 4.5(0.62)等专有推理模型。所有的实验都在我们的VAGEN框架内进行,这是一个可扩展的系统,用于在不同的视觉环境中训练和分析多轮VLM代理。代码和数据可在https://vagen-ai.github.io上公开获取。
摘要:A key challenge in training Vision-Language Model (VLM) agents, compared to Language Model (LLM) agents, lies in the shift from textual states to complex visual observations. This transition introduces partial observability and demands robust world modeling. We ask: Can VLM agents construct internal world models through explicit visual state reasoning? To address this question, we architecturally enforce and reward the agent's reasoning process via reinforcement learning (RL), formulating it as a Partially Observable Markov Decision Process (POMDP). We find that decomposing the agent's reasoning into State Estimation ("what is the current state?") and Transition Modeling ("what comes next?") is critical for success, as demonstrated through five reasoning strategies. Our investigation into how agents represent internal beliefs reveals that the optimal representation is task-dependent: Natural Language excels at capturing semantic relationships in general tasks, while Structured formats are indispensable for precise manipulation and control. Building on these insights, we design a World Modeling Reward that provides dense, turn-level supervision for accurate state prediction, and introduce Bi-Level General Advantage Estimation (Bi-Level GAE) for turn-aware credit assignment. Through this form of visual state reasoning, a 3B-parameter model achieves a score of 0.82 across five diverse agent benchmarks, representing a 3$\times$ improvement over its untrained counterpart (0.21) and outperforming proprietary reasoning models such as GPT-5 (0.75), Gemini 2.5 Pro (0.67) and Claude 4.5 (0.62). All experiments are conducted within our VAGEN framework, a scalable system for training and analyzing multi-turn VLM agents in diverse visual environments. Code and data are publicly available at https://vagen-ai.github.io.
【8】LC-Eval: A Bilingual Multi-Task Evaluation Benchmark for Long-Context Understanding
标题:LC-Eval:用于长上下文理解的双语多任务评估基准
链接:https://arxiv.org/abs/2510.16783
备注:1 figure, 15 tables, 10 main pages
摘要:大型语言模型(LLM)的最新进展已经展示了复杂的功能,包括处理和理解扩展上下文的能力。这些新兴的能力需要严格的评估方法,以有效地评估其在长期背景下的理解性能。在本文中,我们提出了\textbf{LC-Eval},这是一个双语多任务评估基准,旨在评估英语和阿拉伯语的长期上下文理解,目标上下文长度从4k到超过128 k令牌。LC-Eval引入了四个新颖且具有挑战性的任务:多文档问题回答,双语问题回答,段落内的声明验证以及基于长上下文的多项选择题。这些任务旨在评估LLM在深度推理,文档理解,信息跟踪和双语信息提取和理解方面的能力。该基准包括每个任务的阿拉伯语和英语数据集,允许对不同文本类型的性能进行比较分析。对开放式和封闭式LLM进行了评估,结果表明LC-Eval提出了重大挑战。即使是高性能的模型,如GPT-4 o,也难以完成某些任务,突出了基准测试的复杂性和严格性。
摘要:Recent advancements in Large Language Models (LLMs) have demonstrated sophisticated capabilities, including the ability to process and comprehend extended contexts. These emergent capabilities necessitate rigorous evaluation methods to effectively assess their performance in long-context understanding. In this paper, we present \textbf{LC-Eval}, a bilingual, multi-task evaluation benchmark designed to evaluate long-context understanding in English and Arabic, targeting context lengths ranging from 4k to over 128k tokens. LC-Eval introduces four novel and challenging tasks: multi-document question answering, bilingual question answering, claim verification within a paragraph, and multiple-choice questions based on long contexts. These tasks are designed to assess LLMs' abilities in deep reasoning, document comprehension, information tracing, and bilingual information extraction and understanding. The benchmark includes datasets in both Arabic and English for each task, allowing for a comparative analysis of their performance across different text genres. Evaluations were conducted on both open-weight and closed LLMs, with results indicating that LC-Eval presents significant challenges. Even high-performing models, such as GPT-4o, struggled with certain tasks, highlighting the complexity and rigor of the benchmark.
【9】Temporal Understanding under Deictic Frame of Reference
标题:指示参考系下的时间理解
链接:https://arxiv.org/abs/2510.16685
备注:Under review
摘要:理解时间是人类认知的基础,时间经验通常是通过基于感觉运动经验的空间隐喻来概念化的。例如,“夏天即将到来”与“我们正在接近夏天”相对应。在这样的表达中,人类依赖于参考框架(FoR)来解释与特定观点相关的含义。将这一概念扩展到时间,时间参考框架(t-FoR)定义了相对于体验者的“现在”时刻如何感知时间关系。虽然大型语言模型(LLM)在自然语言理解方面取得了显着进步,但它们解释和推理时间的能力仍然有限。在这项工作中,我们介绍了TUuD(指示t-FoR下的时间理解),一个框架,评估LLM如何解释时间-事件和事件-事件的关系时,参考点“现在”动态地沿着时间轴移动。根据最近的工作时间认知\cite{li 2025 other},LLM提示当前时刻和目标事件之间的相似性从0.00(完全不相似)到1.00(高度相似),其中相似性量化了两点之间的感知时间对齐。我们的研究结果表明,四个评估的LLM表现出可衡量的适应指示t-FoR,与相似性评级达到峰值,对过去和未来的事件。然而,适应性减弱,超出了短期的背景下,这表明,虽然LLM显示部分人类一样的时间认知,他们的时间推理仍然敏感的参考框架的变化和时间距离。
摘要:Understanding time is fundamental to human cognition, where temporal experience is often conceptualized through spatial metaphors grounded in sensory-motor experience. For example, "summer is approaching" parallels "We are approaching the summer". In such expressions, humans rely on a frame of reference (FoR) to interpret meaning relative to a particular viewpoint. Extending this concept to time, a temporal frame of reference (t-FoR) defines how temporal relations are perceived relative to an experiencer's moment of "now". While Large Language Models (LLMs) have shown remarkable advances in natural language understanding, their ability to interpret and reason about time remains limited. In this work, we introduce TUuD (Temporal Understanding under Deictic t-FoR), a framework that evaluates how LLMs interpret time-event and event-event relations when the reference point of "now" dynamically shifts along a timeline. Following recent work on temporal cognition \cite{li2025other}, LLMs are prompted to rate the similarity between the current moment and a target event from 0.00 (completely dissimilar) to 1.00 (highly similar), where similarity quantifies perceived temporal alignment between the two points. Our results show that four evaluated LLMs exhibit measurable adaptation to a deictic t-FoR, with similarity ratings peaking around the present and decreasing toward past and future events. The adaptation, however, weakens beyond near-term contexts, suggesting that while LLMs display partial human-like temporal cognition, their temporal reasoning remains sensitive to reference-frame shifts and temporal distance.
【10】Prompt Optimization via Retrieved Reasoning Assets and Multi-Agent Analysis
标题:通过检索推理资产和多代理分析进行即时优化
链接:https://arxiv.org/abs/2510.16635
备注:Preprint
摘要:即时优化已经成为一种有效的替代重新训练,以提高大型语言模型(LLM)的性能。然而,大多数现有的方法将评估视为黑盒,仅依赖于数值分数,而对提示成功或失败的原因提供有限的见解。它们还严重依赖于试错法的改进,这是难以解释和控制的。在本文中,我们介绍了MA-SAPO,一个多智能体框架的分数感知提示优化。与现有方法相比,MA-SAPO明确地将评估结果与结构化推理相结合,以指导系统编辑。该框架具体包括两个阶段:在推理阶段,智能体协作解释指标分数,诊断弱点,并合成存储为可重用推理资产的有针对性的改进;在测试阶段,智能体检索这些资产以分析优化的提示并仅应用基于证据的编辑。通过将评估信号转化为可解释的推理链,MA-SAPO可以迅速进行改进,使其更加透明、可审计和可控制。在HelpSteer 1/2基准测试上的实验表明,与单通道提示、检索增强基线和先前的多代理策略相比,我们的方法得到了一致的改进,验证了我们方法的有效性。
摘要:Prompt optimization has emerged as an effective alternative to retraining for improving the performance of Large Language Models (LLMs). However, most existing approaches treat evaluation as a black box, relying solely on numerical scores while offering limited insight into why a prompt succeeds or fails. They also depend heavily on trial-and-error refinements, which are difficult to interpret and control. In this paper, we introduce MA-SAPO, a Multi-Agent framework for Score-Aware Prompt Optimization. Compared to prior methods, MA-SAPO explicitly couples evaluation outcomes with structured reasoning to guide systematic edits. The framework specifically consists of two stages: during the Reasoning Phase, agents collaboratively explain metric scores, diagnose weaknesses, and synthesize targeted refinements that are stored as reusable reasoning assets; during the Test Phase, agents retrieve these assets to analyze optimized prompts and apply only evidence-grounded edits. By turning evaluation signals into interpretable reasoning chains, MA-SAPO produces prompt refinements that are more transparent, auditable, and controllable. Experiments on the HelpSteer1/2 benchmarks demonstrate consistent improvements over single-pass prompting, retrieval-augmented baselines, and prior multi-agent strategies, validating the effectiveness of our approach.
【11】Agree, Disagree, Explain: Decomposing Human Label Variation in NLI through the Lens of Explanations
标题:同意,不同意,解释:通过解释的镜头分解NLI中的人类标签变异
链接:https://arxiv.org/abs/2510.16458
摘要:Natural Language Inference datasets often exhibit human label variation. To better understand these variations, explanation-based approaches analyze the underlying reasoning behind annotators' decisions. One such approach is the LiTEx taxonomy, which categorizes free-text explanations in English into reasoning types. However, previous work applying such taxonomies has focused on within-label variation: cases where annotators agree on the final NLI label but provide different explanations. In contrast, this paper broadens the scope by examining how annotators may diverge not only in the reasoning type but also in the labeling step. We use explanations as a lens to decompose the reasoning process underlying NLI annotation and to analyze individual differences. We apply LiTEx to two NLI English datasets and align annotation variation from multiple aspects: NLI label agreement, explanation similarity, and taxonomy agreement, with an additional compounding factor of annotators' selection bias. We observe instances where annotators disagree on the label but provide highly similar explanations, suggesting that surface-level disagreement may mask underlying agreement in interpretation. Moreover, our analysis reveals individual preferences in explanation strategies and label choices. These findings highlight that agreement in reasoning types better reflects the semantic similarity of free-text explanations than label agreement alone. Our findings underscore the richness of reasoning-based explanations and the need for caution in treating labels as ground truth.
【12】RAVEN: Robust Advertisement Video Violation Temporal Grounding via Reinforcement Reasoning
标题:RAVEN:通过强化推理实现稳健的广告视频违规时间基础
链接:https://arxiv.org/abs/2510.16455
备注:ACL 2025 (Oral, Industry Track)
摘要:Advertisement (Ad) video violation detection is critical for ensuring platform compliance, but existing methods struggle with precise temporal grounding, noisy annotations, and limited generalization. We propose RAVEN, a novel framework that integrates curriculum reinforcement learning with multimodal large language models (MLLMs) to enhance reasoning and cognitive capabilities for violation detection. RAVEN employs a progressive training strategy, combining precisely and coarsely annotated data, and leverages Group Relative Policy Optimization (GRPO) to develop emergent reasoning abilities without explicit reasoning annotations. Multiple hierarchical sophisticated reward mechanism ensures precise temporal grounding and consistent category prediction. Experiments on industrial datasets and public benchmarks show that RAVEN achieves superior performances in violation category accuracy and temporal interval localization. We also design a pipeline to deploy the RAVEN on the online Ad services, and online A/B testing further validates its practical applicability, with significant improvements in precision and recall. RAVEN also demonstrates strong generalization, mitigating the catastrophic forgetting issue associated with supervised fine-tuning.
【13】TrajSelector: Harnessing Latent Representations for Efficient and Effective Best-of-N in Large Reasoning Model
标题:Trajbitt:利用潜在表示在大型推理模型中实现高效且有效的N中最佳
链接:https://arxiv.org/abs/2510.16449
备注:13 pages, 6 figures. Project website: this https URL
摘要:Large language models (LLMs) have shown remarkable progress in complex reasoning tasks, largely enabled by test-time scaling (TTS) paradigms that allocate additional compute during inference. Among these, external TTS (particularly the Best-of-N selection paradigm) yields scalable performance improvements by selecting from multiple independently generated reasoning trajectories. However, this approach faces key limitations: (i) the high computational overhead of deploying process reward models, (ii) the underutilization of the LLM's intrinsic latent representations. We introduce TrajSelector, an efficient and effective Best-of-N framework that exploit the hidden states in the sampler LLM for process-level scoring. A lightweight verifier (with only 0.6B parameters) evaluates the quality of step-wise trajectory, and then aggregates these scores to identify the optimal reasoning trajectory. Our framework employs a fully data-driven, end-to-end training recipe that eliminates reliance on massive step-level annotations. Experiential results across five benchmarks demonstrate that TrajSelector delivers consistent performance gains. In Best-of-32 settings, it surpasses majority voting by 4.61% accuracy and outperforms existing process reward models by 4.31% to 12.21%, all while maintaining lower inference costs.
【14】What Questions Should Robots Be Able to Answer? A Dataset of User Questions for Explainable Robotics
标题:机器人应该能够回答哪些问题?可解释机器人技术的用户问题数据集
链接:https://arxiv.org/abs/2510.16435
摘要:With the growing use of large language models and conversational interfaces in human-robot interaction, robots' ability to answer user questions is more important than ever. We therefore introduce a dataset of 1,893 user questions for household robots, collected from 100 participants and organized into 12 categories and 70 subcategories. Most work in explainable robotics focuses on why-questions. In contrast, our dataset provides a wide variety of questions, from questions about simple execution details to questions about how the robot would act in hypothetical scenarios -- thus giving roboticists valuable insights into what questions their robot needs to be able to answer. To collect the dataset, we created 15 video stimuli and 7 text stimuli, depicting robots performing varied household tasks. We then asked participants on Prolific what questions they would want to ask the robot in each portrayed situation. In the final dataset, the most frequent categories are questions about task execution details (22.5%), the robot's capabilities (12.7%), and performance assessments (11.3%). Although questions about how robots would handle potentially difficult scenarios and ensure correct behavior are less frequent, users rank them as the most important for robots to be able to answer. Moreover, we find that users who identify as novices in robotics ask different questions than more experienced users. Novices are more likely to inquire about simple facts, such as what the robot did or the current state of the environment. As robots enter environments shared with humans and language becomes central to giving instructions and interaction, this dataset provides a valuable foundation for (i) identifying the information robots need to log and expose to conversational interfaces, (ii) benchmarking question-answering modules, and (iii) designing explanation strategies that align with user expectations.
【15】In Generative AI We (Dis)Trust? Computational Analysis of Trust and Distrust in Reddit Discussions
标题:在生成性人工智能中,我们(Dis)信任吗?Reddit讨论中信任和不信任的计算分析
链接:https://arxiv.org/abs/2510.16173
摘要:The rise of generative AI (GenAI) has impacted many aspects of human life. As these systems become embedded in everyday practices, understanding public trust in them also becomes essential for responsible adoption and governance. Prior work on trust in AI has largely drawn from psychology and human-computer interaction, but there is a lack of computational, large-scale, and longitudinal approaches to measuring trust and distrust in GenAI and large language models (LLMs). This paper presents the first computational study of Trust and Distrust in GenAI, using a multi-year Reddit dataset (2022--2025) spanning 39 subreddits and 197,618 posts. Crowd-sourced annotations of a representative sample were combined with classification models to scale analysis. We find that Trust and Distrust are nearly balanced over time, with shifts around major model releases. Technical performance and usability dominate as dimensions, while personal experience is the most frequent reason shaping attitudes. Distinct patterns also emerge across trustors (e.g., experts, ethicists, general users). Our results provide a methodological framework for large-scale Trust analysis and insights into evolving public perceptions of GenAI.
【16】The Hidden Cost of Modeling P(X): Vulnerability to Membership Inference Attacks in Generative Text Classifiers
标题:建模P(X)的隐藏成本:生成式文本分类器中成员推理攻击的脆弱性
链接:https://arxiv.org/abs/2510.16122
摘要:Membership Inference Attacks (MIAs) pose a critical privacy threat by enabling adversaries to determine whether a specific sample was included in a model's training dataset. Despite extensive research on MIAs, systematic comparisons between generative and discriminative classifiers remain limited. This work addresses this gap by first providing theoretical motivation for why generative classifiers exhibit heightened susceptibility to MIAs, then validating these insights through comprehensive empirical evaluation. Our study encompasses discriminative, generative, and pseudo-generative text classifiers across varying training data volumes, evaluated on nine benchmark datasets. Employing a diverse array of MIA strategies, we consistently demonstrate that fully generative classifiers which explicitly model the joint likelihood $P(X,Y)$ are most vulnerable to membership leakage. Furthermore, we observe that the canonical inference approach commonly used in generative classifiers significantly amplifies this privacy risk. These findings reveal a fundamental utility-privacy trade-off inherent in classifier design, underscoring the critical need for caution when deploying generative classifiers in privacy-sensitive applications. Our results motivate future research directions in developing privacy-preserving generative classifiers that can maintain utility while mitigating membership inference vulnerabilities.
半/弱/无监督|不确定性(1篇)
【1】DELULU: Discriminative Embedding Learning Using Latent Units for Speaker-Aware Self-Supervised Speech Foundational Model
标题:DELULU:使用潜在单位进行区分性嵌入学习用于说话者感知自我监督语音基础模型
链接:https://arxiv.org/abs/2510.17662
摘要:Self-supervised speech models have achieved remarkable success on content-driven tasks, yet they remain limited in capturing speaker-discriminative features critical for verification, diarization, and profiling applications. We introduce DELULU, a speaker-aware self-supervised foundational model that addresses this limitation by integrating external supervision into the pseudo-label generation process. DELULU leverages frame-level embeddings from ReDimNet, a state-of-the-art speaker verification model, to guide the k-means clustering step during pre-training, introducing a strong speaker-discriminative inductive bias that aligns representation learning with speaker identity. The model is trained using a dual objective that combines masked prediction and denoising, further enhancing robustness and generalization. DELULU significantly outperforms prior self-supervised learning (SSL) models across a range of speaker-centric tasks, achieving up to 62% relative improvement in equal error rate (EER) for speaker verification and consistent gains on zero-shot profiling tasks such as gender, age, accent, and speaker counting. Our findings demonstrate that DELULU is a strong universal encoder for speaker-aware speech processing, enabling superior performance even without task-specific fine-tuning.
检测相关(5篇)
【1】DETree: DEtecting Human-AI Collaborative Texts via Tree-Structured Hierarchical Representation Learning
标题:DETree:通过树结构分层表示学习检测人机协作文本
链接:https://arxiv.org/abs/2510.17489
备注:To appear in NeurIPS 2025
摘要:Detecting AI-involved text is essential for combating misinformation, plagiarism, and academic misconduct. However, AI text generation includes diverse collaborative processes (AI-written text edited by humans, human-written text edited by AI, and AI-generated text refined by other AI), where various or even new LLMs could be involved. Texts generated through these varied processes exhibit complex characteristics, presenting significant challenges for detection. Current methods model these processes rather crudely, primarily employing binary classification (purely human vs. AI-involved) or multi-classification (treating human-AI collaboration as a new class). We observe that representations of texts generated through different processes exhibit inherent clustering relationships. Therefore, we propose DETree, a novel approach that models the relationships among different processes as a Hierarchical Affinity Tree structure, and introduces a specialized loss function that aligns text representations with this tree. To facilitate this learning, we developed RealBench, a comprehensive benchmark dataset that automatically incorporates a wide spectrum of hybrid texts produced through various human-AI collaboration processes. Our method improves performance in hybrid text detection tasks and significantly enhances robustness and generalization in out-of-distribution scenarios, particularly in few-shot learning conditions, further demonstrating the promise of training-based approaches in OOD settings. Our code and dataset are available at https://github.com/heyongxin233/DETree.
【2】AI-Generated Text Detection in Low-Resource Languages: A Case Study on Urdu
标题:低资源语言中的人工智能生成文本检测:乌尔都语案例研究
链接:https://arxiv.org/abs/2510.16573
摘要:Large Language Models (LLMs) are now capable of generating text that closely resembles human writing, making them powerful tools for content creation, but this growing ability has also made it harder to tell whether a piece of text was written by a human or by a machine. This challenge becomes even more serious for languages like Urdu, where there are very few tools available to detect AI-generated text. To address this gap, we propose a novel AI-generated text detection framework tailored for the Urdu language. A balanced dataset comprising 1,800 humans authored, and 1,800 AI generated texts, sourced from models such as Gemini, GPT-4o-mini, and Kimi AI was developed. Detailed linguistic and statistical analysis was conducted, focusing on features such as character and word counts, vocabulary richness (Type Token Ratio), and N-gram patterns, with significance evaluated through t-tests and MannWhitney U tests. Three state-of-the-art multilingual transformer models such as mdeberta-v3-base, distilbert-base-multilingualcased, and xlm-roberta-base were fine-tuned on this dataset. The mDeBERTa-v3-base achieved the highest performance, with an F1-score 91.29 and accuracy of 91.26% on the test set. This research advances efforts in contesting misinformation and academic misconduct in Urdu-speaking communities and contributes to the broader development of NLP tools for low resource languages.
【3】InfraGPT Smart Infrastructure: An End-to-End VLM-Based Framework for Detecting and Managing Urban Defects
标题:CLARGPT智能基础设施:用于检测和管理城市缺陷的端到端基于LMA的框架
链接:https://arxiv.org/abs/2510.16017
摘要:Infrastructure in smart cities is increasingly monitored by networks of closed circuit television (CCTV) cameras. Roads, bridges and tunnels develop cracks, potholes, and fluid leaks that threaten public safety and require timely repair. Manual inspection is costly and hazardous, and existing automatic systems typically address individual defect types or provide unstructured outputs that cannot directly guide maintenance crews. This paper proposes a comprehensive pipeline that leverages street CCTV streams for multi defect detection and segmentation using the YOLO family of object detectors and passes the detections to a vision language model (VLM) for scene aware summarization. The VLM generates a structured action plan in JSON format that includes incident descriptions, recommended tools, dimensions, repair plans, and urgent alerts. We review literature on pothole, crack and leak detection, highlight recent advances in large vision language models such as QwenVL and LLaVA, and describe the design of our early prototype. Experimental evaluation on public datasets and captured CCTV clips demonstrates that the system accurately identifies diverse defects and produces coherent summaries. We conclude by discussing challenges and directions for scaling the system to city wide deployments.
【4】Bolster Hallucination Detection via Prompt-Guided Data Augmentation
标题:通过预算引导数据增强增强幻觉检测
链接:https://arxiv.org/abs/2510.15977
摘要:Large language models (LLMs) have garnered significant interest in AI community. Despite their impressive generation capabilities, they have been found to produce misleading or fabricated information, a phenomenon known as hallucinations. Consequently, hallucination detection has become critical to ensure the reliability of LLM-generated content. One primary challenge in hallucination detection is the scarcity of well-labeled datasets containing both truthful and hallucinated outputs. To address this issue, we introduce Prompt-guided data Augmented haLlucination dEtection (PALE), a novel framework that leverages prompt-guided responses from LLMs as data augmentation for hallucination detection. This strategy can generate both truthful and hallucinated data under prompt guidance at a relatively low cost. To more effectively evaluate the truthfulness of the sparse intermediate embeddings produced by LLMs, we introduce an estimation metric called the Contrastive Mahalanobis Score (CM Score). This score is based on modeling the distributions of truthful and hallucinated data in the activation space. CM Score employs a matrix decomposition approach to more accurately capture the underlying structure of these distributions. Importantly, our framework does not require additional human annotations, offering strong generalizability and practicality for real-world applications. Extensive experiments demonstrate that PALE achieves superior hallucination detection performance, outperforming the competitive baseline by a significant margin of 6.55%.
【5】Detecting and Preventing Harmful Behaviors in AI Companions: Development and Evaluation of the SHIELD Supervisory System
标题:检测和预防人工智能伴侣的有害行为:SHIELD监控系统的开发和评估
链接:https://arxiv.org/abs/2510.15891
摘要:AI companions powered by large language models (LLMs) are increasingly integrated into users' daily lives, offering emotional support and companionship. While existing safety systems focus on overt harms, they rarely address early-stage problematic behaviors that can foster unhealthy emotional dynamics, including over-attachment or reinforcement of social isolation. We developed SHIELD (Supervisory Helper for Identifying Emotional Limits and Dynamics), a LLM-based supervisory system with a specific system prompt that detects and mitigates risky emotional patterns before escalation. SHIELD targets five dimensions of concern: (1) emotional over-attachment, (2) consent and boundary violations, (3) ethical roleplay violations, (4) manipulative engagement, and (5) social isolation reinforcement. These dimensions were defined based on media reports, academic literature, existing AI risk frameworks, and clinical expertise in unhealthy relationship dynamics. To evaluate SHIELD, we created a 100-item synthetic conversation benchmark covering all five dimensions of concern. Testing across five prominent LLMs (GPT-4.1, Claude Sonnet 4, Gemma 3 1B, Kimi K2, Llama Scout 4 17B) showed that the baseline rate of concerning content (10-16%) was significantly reduced with SHIELD (to 3-8%), a 50-79% relative reduction, while preserving 95% of appropriate interactions. The system achieved 59% sensitivity and 95% specificity, with adaptable performance via prompt engineering. This proof-of-concept demonstrates that transparent, deployable supervisory systems can address subtle emotional manipulation in AI companions. Most development materials including prompts, code, and evaluation methods are made available as open source materials for research, adaptation, and deployment.
识别/分类(2篇)
【1】PANER: A Paraphrase-Augmented Framework for Low-Resource Named Entity Recognition
标题:PANER:用于低资源命名实体识别的重述增强框架
链接:https://arxiv.org/abs/2510.17720
摘要:Named Entity Recognition (NER) is a critical task that requires substantial annotated data, making it challenging in low-resource scenarios where label acquisition is expensive. While zero-shot and instruction-tuned approaches have made progress, they often fail to generalize to domain-specific entities and do not effectively utilize limited available data. We present a lightweight few-shot NER framework that addresses these challenges through two key innovations: (1) a new instruction tuning template with a simplified output format that combines principles from prior IT approaches to leverage the large context window of recent state-of-the-art LLMs; (2) introducing a strategic data augmentation technique that preserves entity information while paraphrasing the surrounding context, thereby expanding our training data without compromising semantic relationships. Experiments on benchmark datasets show that our method achieves performance comparable to state-of-the-art models on few-shot and zero-shot tasks, with our few-shot approach attaining an average F1 score of 80.1 on the CrossNER datasets. Models trained with our paraphrasing approach show consistent improvements in F1 scores of up to 17 points over baseline versions, offering a promising solution for groups with limited NER training data and compute power.
【2】Extended LSTM: Adaptive Feature Gating for Toxic Comment Classification
标题:扩展LSTM:有毒评论分类的自适应特征门控
链接:https://arxiv.org/abs/2510.17018
摘要:Toxic comment detection remains a challenging task, where transformer-based models (e.g., BERT) incur high computational costs and degrade on minority toxicity classes, while classical ensembles lack semantic adaptability. We propose xLSTM, a parameter-efficient and theoretically grounded framework that unifies cosine-similarity gating, adaptive feature prioritization, and principled class rebalancing. A learnable reference vector {v} in {R}^d modulates contextual embeddings via cosine similarity, amplifying toxic cues and attenuating benign signals to yield stronger gradients under severe class imbalance. xLSTM integrates multi-source embeddings (GloVe, FastText, BERT CLS) through a projection layer, a character-level BiLSTM for morphological cues, embedding-space SMOTE for minority augmentation, and adaptive focal loss with dynamic class weighting. On the Jigsaw Toxic Comment benchmark, xLSTM attains 96.0% accuracy and 0.88 macro-F1, outperforming BERT by 33% on threat and 28% on identity_hate categories, with 15 times fewer parameters and 50ms inference latency. Cosine gating contributes a +4.8% F1 gain in ablations. The results establish a new efficiency adaptability frontier, demonstrating that lightweight, theoretically informed architectures can surpass large pretrained models on imbalanced, domain-specific NLP tasks.
Zero/Few/One-Shot|迁移|自适应(3篇)
【1】MOSAIC: Masked Objective with Selective Adaptation for In-domain Contrastive Learning
标题:MOSAIC:域内对比学习的选择性自适应掩蔽目标算法
链接:https://arxiv.org/abs/2510.16797
摘要:We introduce MOSAIC (Masked Objective with Selective Adaptation for In-domain Contrastive learning), a multi-stage framework for domain adaptation of sentence embedding models that incorporates joint domain-specific masked supervision. Our approach addresses the challenges of adapting large-scale general-domain sentence embedding models to specialized domains. By jointly optimizing masked language modeling (MLM) and contrastive objectives within a unified training pipeline, our method enables effective learning of domain-relevant representations while preserving the robust semantic discrimination properties of the original model. We empirically validate our approach on both high-resource and low-resource domains, achieving improvements up to 13.4% in NDCG@10 (Normalized Discounted Cumulative Gain) over strong general-domain baselines. Comprehensive ablation studies further demonstrate the effectiveness of each component, highlighting the importance of balanced joint supervision and staged adaptation.
【2】Zero-Shot Performance Prediction for Probabilistic Scaling Laws
标题:概率缩放定律的零发射性能预测
链接:https://arxiv.org/abs/2510.16743
备注:Accepted to NeurIPS 2025
摘要:The prediction of learning curves for Natural Language Processing (NLP) models enables informed decision-making to meet specific performance objectives, while reducing computational overhead and lowering the costs associated with dataset acquisition and curation. In this work, we formulate the prediction task as a multitask learning problem, where each task's data is modelled as being organized within a two-layer hierarchy. To model the shared information and dependencies across tasks and hierarchical levels, we employ latent variable multi-output Gaussian Processes, enabling to account for task correlations and supporting zero-shot prediction of learning curves (LCs). We demonstrate that this approach facilitates the development of probabilistic scaling laws at lower costs. Applying an active learning strategy, LCs can be queried to reduce predictive uncertainty and provide predictions close to ground truth scaling laws. We validate our framework on three small-scale NLP datasets with up to $30$ LCs. These are obtained from nanoGPT models, from bilingual translation using mBART and Transformer models, and from multilingual translation using M2M100 models of varying sizes.
【3】SIADAFIX: issue description response for adaptive program repair
标题:SIADAFIX:自适应程序修复的问题描述响应
链接:https://arxiv.org/abs/2510.16059
备注:20 pages, 3 figures
摘要:We propose utilizing fast and slow thinking to enhance the capabilities of large language model-based agents on complex tasks such as program repair. In particular, we design an adaptive program repair method based on issue description response, called SIADAFIX. The proposed method utilizes slow thinking bug fix agent to complete complex program repair tasks, and employs fast thinking workflow decision components to optimize and classify issue descriptions, using issue description response results to guide the orchestration of bug fix agent workflows. SIADAFIX adaptively selects three repair modes, i.e., easy, middle and hard mode, based on problem complexity. It employs fast generalization for simple problems and test-time scaling techniques for complex problems. Experimental results on the SWE-bench Lite show that the proposed method achieves 60.67% pass@1 performance using the Claude-4 Sonnet model, reaching state-of-the-art levels among all open-source methods. SIADAFIX effectively balances repair efficiency and accuracy, providing new insights for automated program repair. Our code is available at https://github.com/liauto-siada/siada-cli.
检索(1篇)
【1】Train for Truth, Keep the Skills: Binary Retrieval-Augmented Reward Mitigates Hallucinations
标题:训练真理,保持技能:二进制检索增强奖励减轻幻觉
链接:https://arxiv.org/abs/2510.17733
摘要:Language models often generate factually incorrect information unsupported by their training data, a phenomenon known as extrinsic hallucination. Existing mitigation approaches often degrade performance on open-ended generation and downstream tasks, limiting their practical utility. We propose an online reinforcement learning method using a novel binary retrieval-augmented reward (RAR) to address this tradeoff. Unlike continuous reward schemes, our approach assigns a reward of one only when the model's output is entirely factually correct, and zero otherwise. We evaluate our method on Qwen3 reasoning models across diverse tasks. For open-ended generation, binary RAR achieves a 39.3% reduction in hallucination rates, substantially outperforming both supervised training and continuous-reward RL baselines. In short-form question answering, the model learns calibrated abstention, strategically outputting "I don't know" when faced with insufficient parametric knowledge. This yields 44.4% and 21.7% fewer incorrect answers on PopQA and GPQA, respectively. Crucially, these factuality gains come without performance degradation on instruction following, math, or code, whereas continuous-reward RL, despite improving factuality, induces quality regressions.
语料库(1篇)
【1】EgMM-Corpus: A Multimodal Vision-Language Dataset for Egyptian Culture
标题:EgMM-Corpus:埃及文化的多模式视觉语言数据集
链接:https://arxiv.org/abs/2510.16198
摘要:Despite recent advances in AI, multimodal culturally diverse datasets are still limited, particularly for regions in the Middle East and Africa. In this paper, we introduce EgMM-Corpus, a multimodal dataset dedicated to Egyptian culture. By designing and running a new data collection pipeline, we collected over 3,000 images, covering 313 concepts across landmarks, food, and folklore. Each entry in the dataset is manually validated for cultural authenticity and multimodal coherence. EgMM-Corpus aims to provide a reliable resource for evaluating and training vision-language models in an Egyptian cultural context. We further evaluate the zero-shot performance of Contrastive Language-Image Pre-training CLIP on EgMM-Corpus, on which it achieves 21.2% Top-1 accuracy and 36.4% Top-5 accuracy in classification. These results underscore the existing cultural bias in large-scale vision-language models and demonstrate the importance of EgMM-Corpus as a benchmark for developing culturally aware models.
表征(3篇)
【1】Addressing Antisocial Behavior in Multi-Party Dialogs Through Multimodal Representation Learning
标题:通过多模式表示学习解决多方对话中的反社会行为
链接:https://arxiv.org/abs/2510.17289
摘要:Antisocial behavior (ASB) on social media -- including hate speech, harassment, and cyberbullying -- poses growing risks to platform safety and societal well-being. Prior research has focused largely on networks such as X and Reddit, while \textit{multi-party conversational settings} remain underexplored due to limited data. To address this gap, we use \textit{CyberAgressionAdo-Large}, a French open-access dataset simulating ASB in multi-party conversations, and evaluate three tasks: \textit{abuse detection}, \textit{bullying behavior analysis}, and \textit{bullying peer-group identification}. We benchmark six text-based and eight graph-based \textit{representation-learning methods}, analyzing lexical cues, interactional dynamics, and their multimodal fusion. Results show that multimodal models outperform unimodal baselines. The late fusion model \texttt{mBERT + WD-SGCN} achieves the best overall results, with top performance on abuse detection (0.718) and competitive scores on peer-group identification (0.286) and bullying analysis (0.606). Error analysis highlights its effectiveness in handling nuanced ASB phenomena such as implicit aggression, role transitions, and context-dependent hostility.
【2】Neuronal Group Communication for Efficient Neural representation
标题:神经元群通信实现高效的神经表示
链接:https://arxiv.org/abs/2510.16851
备注:28 pages, 2 figures
摘要:The ever-increasing scale of modern neural networks has brought unprecedented performance alongside daunting challenges in efficiency and interpretability. This paper addresses the core question of how to build large neural systems that learn efficient, modular, and interpretable representations. We propose Neuronal Group Communication (NGC), a theory-driven framework that reimagines a neural network as a dynamical system of interacting neuronal groups rather than a monolithic collection of neural weights. Instead of treating each weight as an independent trainable parameter, NGC treats weights as transient interactions between embedding-like neuronal states, with neural computation unfolding through iterative communication among groups of neurons. This low-rank, modular representation yields compact models: groups of neurons exchange low-dimensional signals, enabling intra-group specialization and inter-group information sharing while dramatically reducing redundant parameters. By drawing on dynamical systems theory, we introduce a neuronal stability metric (analogous to Lyapunov stability) that quantifies the contraction of neuron activations toward stable patterns during sequence processing. Using this metric, we reveal that emergent reasoning capabilities correspond to an external driving force or ``potential'', which nudges the neural dynamics away from trivial trajectories while preserving stability. Empirically, we instantiate NGC in large language models (LLMs) and demonstrate improved performance on complex reasoning benchmarks under moderate compression. NGC consistently outperforms standard low-rank approximations and cross-layer basis-sharing methods at comparable compression rates. We conclude by discussing the broader implications of NGC, including how structured neuronal group dynamics might relate to generalization in high-dimensional learning systems.
【3】Copy-Augmented Representation for Structure Invariant Template-Free Retrosynthesis
标题:结构不变无模板逆合成的拷贝增强表示
链接:https://arxiv.org/abs/2510.16588
摘要:Retrosynthesis prediction is fundamental to drug discovery and chemical synthesis, requiring the identification of reactants that can produce a target molecule. Current template-free methods struggle to capture the structural invariance inherent in chemical reactions, where substantial molecular scaffolds remain unchanged, leading to unnecessarily large search spaces and reduced prediction accuracy. We introduce C-SMILES, a novel molecular representation that decomposes traditional SMILES into element-token pairs with five special tokens, effectively minimizing editing distance between reactants and products. Building upon this representation, we incorporate a copy-augmented mechanism that dynamically determines whether to generate new tokens or preserve unchanged molecular fragments from the product. Our approach integrates SMILES alignment guidance to enhance attention consistency with ground-truth atom mappings, enabling more chemically coherent predictions. Comprehensive evaluation on USPTO-50K and large-scale USPTO-FULL datasets demonstrates significant improvements: 67.2% top-1 accuracy on USPTO-50K and 50.8% on USPTO-FULL, with 99.9% validity in generated molecules. This work establishes a new paradigm for structure-aware molecular generation with direct applications in computational drug discovery.
Word2Vec|文本|单词(3篇)
【1】Glyph: Scaling Context Windows via Visual-Text Compression
标题:字形:通过视觉文本压缩扩展上下文窗口
链接:https://arxiv.org/abs/2510.17800
摘要:Large language models (LLMs) increasingly rely on long-context modeling for tasks such as document understanding, code analysis, and multi-step reasoning. However, scaling context windows to the million-token level brings prohibitive computational and memory costs, limiting the practicality of long-context LLMs. In this work, we take a different perspective-visual context scaling-to tackle this challenge. Instead of extending token-based sequences, we propose Glyph, a framework that renders long texts into images and processes them with vision-language models (VLMs). This approach substantially compresses textual input while preserving semantic information, and we further design an LLM-driven genetic search to identify optimal visual rendering configurations for balancing accuracy and compression. Through extensive experiments, we demonstrate that our method achieves 3-4x token compression while maintaining accuracy comparable to leading LLMs such as Qwen3-8B on various long-context benchmarks. This compression also leads to around 4x faster prefilling and decoding, and approximately 2x faster SFT training. Furthermore, under extreme compression, a 128K-context VLM could scale to handle 1M-token-level text tasks. In addition, the rendered text data benefits real-world multimodal tasks, such as document understanding. Our code and model are released at https://github.com/thu-coai/Glyph.
【2】When Annotators Disagree, Topology Explains: Mapper, a Topological Tool for Exploring Text Embedding Geometry and Ambiguity
标题:当注释者不同意时,Topology解释:Mapper,一种用于探索嵌入几何和歧义的文本的topology工具
链接:https://arxiv.org/abs/2510.17548
备注:Accepted to appear in the Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP 2025, Main Conference)
摘要:Language models are often evaluated with scalar metrics like accuracy, but such measures fail to capture how models internally represent ambiguity, especially when human annotators disagree. We propose a topological perspective to analyze how fine-tuned models encode ambiguity and more generally instances. Applied to RoBERTa-Large on the MD-Offense dataset, Mapper, a tool from topological data analysis, reveals that fine-tuning restructures embedding space into modular, non-convex regions aligned with model predictions, even for highly ambiguous cases. Over $98\%$ of connected components exhibit $\geq 90\%$ prediction purity, yet alignment with ground-truth labels drops in ambiguous data, surfacing a hidden tension between structural confidence and label uncertainty. Unlike traditional tools such as PCA or UMAP, Mapper captures this geometry directly uncovering decision regions, boundary collapses, and overconfident clusters. Our findings position Mapper as a powerful diagnostic tool for understanding how models resolve ambiguity. Beyond visualization, it also enables topological metrics that may inform proactive modeling strategies in subjective NLP tasks.
【3】Investigating the Association Between Text-Based Indications of Foodborne Illness from Yelp Reviews and New York City Health Inspection Outcomes (2023)
标题:调查Yelp评论中基于文本的食源性疾病指征与纽约市健康检查结果之间的关联(2023年)
链接:https://arxiv.org/abs/2510.16334
备注:Presented as a poster at Data Science Day 2024
摘要:Foodborne illnesses are gastrointestinal conditions caused by consuming contaminated food. Restaurants are critical venues to investigate outbreaks because they share sourcing, preparation, and distribution of foods. Public reporting of illness via formal channels is limited, whereas social media platforms host abundant user-generated content that can provide timely public health signals. This paper analyzes signals from Yelp reviews produced by a Hierarchical Sigmoid Attention Network (HSAN) classifier and compares them with official restaurant inspection outcomes issued by the New York City Department of Health and Mental Hygiene (NYC DOHMH) in 2023. We evaluate correlations at the Census tract level, compare distributions of HSAN scores by prevalence of C-graded restaurants, and map spatial patterns across NYC. We find minimal correlation between HSAN signals and inspection scores at the tract level and no significant differences by number of C-graded restaurants. We discuss implications and outline next steps toward address-level analyses.
其他神经网络|深度学习|模型|建模(11篇)
【1】UltraCUA: A Foundation Model for Computer Use Agents with Hybrid Action
标题:UltraCUA:具有混合动作的计算机使用代理的基础模型
链接:https://arxiv.org/abs/2510.17790
摘要:Multimodal agents for computer use rely exclusively on primitive actions (click, type, scroll) that require accurate visual grounding and lengthy execution chains, leading to cascading failures and performance bottlenecks. While other agents leverage rich programmatic interfaces (APIs, MCP servers, tools), computer-use agents (CUAs) remain isolated from these capabilities. We present UltraCUA, a foundation model that bridges this gap through hybrid action -- seamlessly integrating GUI primitives with high-level programmatic tool calls. To achieve this, our approach comprises four key components: (1) an automated pipeline that scales programmatic tools from software documentation, open-source repositories, and code generation; (2) a synthetic data engine producing over 17,000 verifiable tasks spanning real-world computer-use scenarios; (3) a large-scale high-quality hybrid action trajectory collection with both low-level GUI actions and high-level programmatic tool calls; and (4) a two-stage training pipeline combining supervised fine-tuning with online reinforcement learning, enabling strategic alternation between low-level and high-level actions. Experiments with our 7B and 32B models demonstrate substantial improvements over state-of-the-art agents. On OSWorld, UltraCUA models achieve an average 22% relative improvement over base models, while being 11% faster in terms of steps. Out-of-domain evaluation on WindowsAgentArena shows our model reaches 21.7% success rate, outperforming baselines trained on Windows data. The hybrid action mechanism proves critical, reducing error propagation while maintaining execution efficiency.
【2】Agentic Reinforcement Learning for Search is Unsafe
标题:搜索的强化学习是不安全的
链接:https://arxiv.org/abs/2510.17431
摘要:Agentic reinforcement learning (RL) trains large language models to autonomously call tools during reasoning, with search as the most common application. These models excel at multi-step reasoning tasks, but their safety properties are not well understood. In this study, we show that RL-trained search models inherit refusal from instruction tuning and often deflect harmful requests by turning them into safe queries. However, this safety is fragile. Two simple attacks, one that forces the model to begin response with search (Search attack), another that encourages models to repeatedly search (Multi-search attack), trigger cascades of harmful searches and answers. Across two model families (Qwen, Llama) with both local and web search, these attacks lower refusal rates by up to 60.0%, answer safety by 82.5%, and search-query safety by 82.4%. The attacks succeed by triggering models to generate harmful, request-mirroring search queries before they can generate the inherited refusal tokens. This exposes a core weakness of current RL training: it rewards continued generation of effective queries without accounting for their harmfulness. As a result, RL search models have vulnerabilities that users can easily exploit, making it urgent to develop safety-aware agentic RL pipelines optimising for safe search.
【3】Navigating the Alignment-Calibration Trade-off: A Pareto-Superior Frontier via Model Merging
标题:应对对准-校准权衡:通过模型合并实现帕累托-卓越前沿
链接:https://arxiv.org/abs/2510.17426
摘要:The "alignment tax" of post-training is typically framed as a drop in task accuracy. We show it also involves a severe loss of calibration, making models overconfident, less reliable, and model outputs less diverse. We show that this trade-off can be navigated effectively via a simple post-hoc intervention: interpolating between a model's weights before and after alignment. Crucially, this is not a strict trade-off. We find that the process consistently reveals Pareto-optimal interpolations - models that improve accuracy beyond both parents while substantially recovering the calibration lost during alignment. Our work demonstrates that simple model merging provides a computationally efficient method for mitigating the full scope of the alignment tax, yielding models that are more capable and more reliable.
【4】From Preferences to Prejudice: The Role of Alignment Tuning in Shaping Social Bias in Video Diffusion Models
标题:从偏好到偏好:对齐调整在塑造视频传播模型中社会偏见中的作用
链接:https://arxiv.org/abs/2510.17247
摘要:Recent advances in video diffusion models have significantly enhanced text-to-video generation, particularly through alignment tuning using reward models trained on human preferences. While these methods improve visual quality, they can unintentionally encode and amplify social biases. To systematically trace how such biases evolve throughout the alignment pipeline, we introduce VideoBiasEval, a comprehensive diagnostic framework for evaluating social representation in video generation. Grounded in established social bias taxonomies, VideoBiasEval employs an event-based prompting strategy to disentangle semantic content (actions and contexts) from actor attributes (gender and ethnicity). It further introduces multi-granular metrics to evaluate (1) overall ethnicity bias, (2) gender bias conditioned on ethnicity, (3) distributional shifts in social attributes across model variants, and (4) the temporal persistence of bias within videos. Using this framework, we conduct the first end-to-end analysis connecting biases in human preference datasets, their amplification in reward models, and their propagation through alignment-tuned video diffusion models. Our results reveal that alignment tuning not only strengthens representational biases but also makes them temporally stable, producing smoother yet more stereotyped portrayals. These findings highlight the need for bias-aware evaluation and mitigation throughout the alignment process to ensure fair and socially responsible video generation.
【5】Online Learning Defense against Iterative Jailbreak Attacks via Prompt Optimization
标题:通过即时优化防御迭代越狱攻击的在线学习防御
链接:https://arxiv.org/abs/2510.17006
摘要:Iterative jailbreak methods that repeatedly rewrite and input prompts into large language models (LLMs) to induce harmful outputs -- using the model's previous responses to guide each new iteration -- have been found to be a highly effective attack strategy. Despite being an effective attack strategy against LLMs and their safety mechanisms, existing defenses do not proactively disrupt this dynamic trial-and-error cycle. In this study, we propose a novel framework that dynamically updates its defense strategy through online learning in response to each new prompt from iterative jailbreak methods. Leveraging the distinctions between harmful jailbreak-generated prompts and typical harmless prompts, we introduce a reinforcement learning-based approach that optimizes prompts to ensure appropriate responses for harmless tasks while explicitly rejecting harmful prompts. Additionally, to curb overfitting to the narrow band of partial input rewrites explored during an attack, we introduce Past-Direction Gradient Damping (PDGD). Experiments conducted on three LLMs show that our approach significantly outperforms five existing defense methods against five iterative jailbreak methods. Moreover, our results indicate that our prompt optimization strategy simultaneously enhances response quality for harmless tasks.
【6】A Comprehensive Survey on Reinforcement Learning-based Agentic Search: Foundations, Roles, Optimizations, Evaluations, and Applications
标题:基于强化学习的统计搜索综合调查:基础、角色、优化、评估和应用
链接:https://arxiv.org/abs/2510.16724
备注:38 pages, 4 figures, 7 tables
摘要:The advent of large language models (LLMs) has transformed information access and reasoning through open-ended natural language interaction. However, LLMs remain limited by static knowledge, factual hallucinations, and the inability to retrieve real-time or domain-specific information. Retrieval-Augmented Generation (RAG) mitigates these issues by grounding model outputs in external evidence, but traditional RAG pipelines are often single turn and heuristic, lacking adaptive control over retrieval and reasoning. Recent advances in agentic search address these limitations by enabling LLMs to plan, retrieve, and reflect through multi-step interaction with search environments. Within this paradigm, reinforcement learning (RL) offers a powerful mechanism for adaptive and self-improving search behavior. This survey provides the first comprehensive overview of \emph{RL-based agentic search}, organizing the emerging field along three complementary dimensions: (i) What RL is for (functional roles), (ii) How RL is used (optimization strategies), and (iii) Where RL is applied (scope of optimization). We summarize representative methods, evaluation protocols, and applications, and discuss open challenges and future directions toward building reliable and scalable RL driven agentic search systems. We hope this survey will inspire future research on the integration of RL and agentic search. Our repository is available at https://github.com/ventr1c/Awesome-RL-based-Agentic-Search-Papers.
【7】Hallucination Benchmark for Speech Foundation Models
标题:语音基础模型的幻觉基准
链接:https://arxiv.org/abs/2510.16567
备注:Under Review
摘要:Hallucinations in automatic speech recognition (ASR) systems refer to fluent and coherent transcriptions produced by neural ASR models that are completely unrelated to the underlying acoustic input (i.e., the speech signal). While similar to conventional decoding errors in potentially compromising the usability of transcriptions for downstream applications, hallucinations can be more detrimental due to their preservation of syntactically and semantically plausible structure. This apparent coherence can mislead subsequent processing stages and introduce serious risks, particularly in critical domains such as healthcare and law. Conventional evaluation metrics are primarily centered on error-based metrics and fail to distinguish between phonetic inaccuracies and hallucinations. Consequently, there is a critical need for new evaluation frameworks that can effectively identify and assess models with a heightened propensity for generating hallucinated content. To this end, we introduce SHALLOW, the first benchmark framework that systematically categorizes and quantifies hallucination phenomena in ASR along four complementary axes: lexical, phonetic, morphological, and semantic. We define targeted metrics within each category to produce interpretable profiles of model behavior. Through evaluation across various architectures and speech domains, we have found that SHALLOW metrics correlate strongly with word error rate (WER) when recognition quality is high (i.e., low WER). Still, this correlation weakens substantially as WER increases. SHALLOW, therefore, captures fine-grained error patterns that WER fails to distinguish under degraded and challenging conditions. Our framework supports specific diagnosis of model weaknesses and provides feedback for model improvement beyond what aggregate error rates can offer.
【8】Probing the Hidden Talent of ASR Foundation Models for L2 English Oral Assessment
标题:探索二语英语口语评估的ASC基础模型的隐藏才华
链接:https://arxiv.org/abs/2510.16387
摘要:In this paper, we explore the untapped potential of Whisper, a well-established automatic speech recognition (ASR) foundation model, in the context of L2 spoken language assessment (SLA). Unlike prior studies that extrinsically analyze transcriptions produced by Whisper, our approach goes a step further to probe its latent capabilities by extracting acoustic and linguistic features from hidden representations. With only a lightweight classifier being trained on top of Whisper's intermediate and final outputs, our method achieves strong performance on the GEPT picture-description dataset, outperforming existing cutting-edge baselines, including a multimodal approach. Furthermore, by incorporating image and text-prompt information as auxiliary relevance cues, we demonstrate additional performance gains. Finally, we conduct an in-depth analysis of Whisper's embeddings, which reveals that, even without task-specific fine-tuning, the model intrinsically encodes both ordinal proficiency patterns and semantic aspects of speech, highlighting its potential as a powerful foundation for SLA and other spoken language understanding tasks.
【9】WEBSERV: A Browser-Server Environment for Efficient Training of Reinforcement Learning-based Web Agents at Scale
标题:WEBSERV:一个用于大规模有效训练基于强化学习的Web代理的浏览器-服务器环境
链接:https://arxiv.org/abs/2510.16252
摘要:Training and evaluation of Reinforcement Learning (RL) web agents have gained increasing attention, yet a scalable and efficient environment that couples realistic and robust browser-side interaction with controllable server-side state at scale is still missing. Existing environments tend to have one or more of the following issues: they overwhelm policy models with excessive and noisy context; they perform actions non-deterministically without waiting for the UI or network to stabilize; or they cannot scale isolated client-server containers effectively for parallel RL rollouts. We propose WEBSERV, an environment that includes 1) a compact, site-agnostic browser environment that balances context and action complexity, and 2) a scalable RL environment via efficient launching and resetting web-servers to enable scalable RL training and evaluation. We evaluate WEBSERV on the shopping CMS and Gitlab tasks in WebArena, achieving state-of-the-art single-prompt success rates while cutting launch latency by ~5x and storage need by ~240x, with a comparable memory footprint, enabling 200+ concurrent containers on a single host.
【10】Zeroth-Order Sharpness-Aware Learning with Exponential Tilting
标题:具有指数倾斜的零阶敏锐度感知学习
链接:https://arxiv.org/abs/2510.16157
摘要:Classic zeroth-order optimization approaches typically optimize for a smoothed version of the original function, i.e., the expected objective under randomly perturbed model parameters. This can be interpreted as encouraging the loss values in the perturbation set to be small on average. Popular sharpness-aware minimization (SAM) objectives, however, typically focus on the largest loss within the neighborhood to arrive at flat minima more effectively. In this work, we connect zeroth-order optimization (and its corresponding objectives) with SAM approaches explicitly, through an exponential tilting objective that provides a smooth transition between the average- and the max-loss formulations. We explore new zeroth-order algorithms to solve a soft SAM objective parameterized by a tilting parameter $t$. We provide precise characterizations of the sharpness notions of the tilted SAM framework. Practically, our approach can be used as a gradient-free and memory-efficient alternative to SAM variants, and it achieves better generalization compared to vanilla zeroth-order baselines on a wide range of downstream tasks, including classification, multiple choice QA, and language generation.
【11】PrivacyPAD: A Reinforcement Learning Framework for Dynamic Privacy-Aware Delegation
标题:PrivacyPAD:一个用于动态隐私感知委托的强化学习框架
链接:https://arxiv.org/abs/2510.16054
摘要:When users submit queries to Large Language Models (LLMs), their prompts can often contain sensitive data, forcing a difficult choice: Send the query to a powerful proprietary LLM providers to achieving state-of-the-art performance and risk data exposure, or relying on smaller, local models guarantees data privacy but often results in a degradation of task performance. Prior approaches have relied on static pipelines that use LLM rewriting, which shatters linguistic coherence and indiscriminately removes privacy-sensitive information, including task-critical content. We reformulate this challenge (Privacy-Conscious Delegation) as a sequential decision-making problem and introduce a novel reinforcement learning (RL) framework called PrivacyPAD to solve it. Our framework trains an agent to dynamically route text chunks, learning a policy that optimally balances the trade-off between privacy leakage and task performance. It implicitly distinguishes between replaceable Personally Identifiable Information (PII) (which it shields locally) and task-critical PII (which it strategically sends to the remote model for maximal utility). To validate our approach in complex scenarios, we also introduce a new medical dataset with high PII density. Our framework achieves a new state-of-the-art on the privacy-utility frontier, demonstrating the necessity of learned, adaptive policies for deploying LLMs in sensitive environments.
其他(24篇)
【1】Enterprise Deep Research: Steerable Multi-Agent Deep Research for Enterprise Analytics
标题:Enterprise Deep Research:Steerable Multi-Agent Deep Research
链接:https://arxiv.org/abs/2510.17797
备注:Technical report; 13 pages plus references and appendices
摘要:As information grows exponentially, enterprises face increasing pressure to transform unstructured data into coherent, actionable insights. While autonomous agents show promise, they often struggle with domain-specific nuances, intent alignment, and enterprise integration. We present Enterprise Deep Research (EDR), a multi-agent system that integrates (1) a Master Planning Agent for adaptive query decomposition, (2) four specialized search agents (General, Academic, GitHub, LinkedIn), (3) an extensible MCP-based tool ecosystem supporting NL2SQL, file analysis, and enterprise workflows, (4) a Visualization Agent for data-driven insights, and (5) a reflection mechanism that detects knowledge gaps and updates research direction with optional human-in-the-loop steering guidance. These components enable automated report generation, real-time streaming, and seamless enterprise deployment, as validated on internal datasets. On open-ended benchmarks including DeepResearch Bench and DeepConsult, EDR outperforms state-of-the-art agentic systems without any human steering. We release the EDR framework and benchmark trajectories to advance research on multi-agent reasoning applications. Code at https://github.com/SalesforceAIResearch/enterprise-deep-research and Dataset at https://huggingface.co/datasets/Salesforce/EDR-200
【2】LILO: Bayesian Optimization with Interactive Natural Language Feedback
标题:LILO:具有交互式自然语言反馈的Bayesian优化
链接:https://arxiv.org/abs/2510.17671
摘要:For many real-world applications, feedback is essential in translating complex, nuanced, or subjective goals into quantifiable optimization objectives. We propose a language-in-the-loop framework that uses a large language model (LLM) to convert unstructured feedback in the form of natural language into scalar utilities to conduct BO over a numeric search space. Unlike preferential BO, which only accepts restricted feedback formats and requires customized models for each domain-specific problem, our approach leverages LLMs to turn varied types of textual feedback into consistent utility signals and to easily include flexible user priors without manual kernel design. At the same time, our method maintains the sample efficiency and principled uncertainty quantification of BO. We show that this hybrid method not only provides a more natural interface to the decision maker but also outperforms conventional BO baselines and LLM-only optimizers, particularly in feedback-limited regimes.
【3】Annotation-Efficient Universal Honesty Alignment
标题:注释高效的普遍诚实一致
链接:https://arxiv.org/abs/2510.17509
摘要:Honesty alignment-the ability of large language models (LLMs) to recognize their knowledge boundaries and express calibrated confidence-is essential for trustworthy deployment. Existing methods either rely on training-free confidence estimation (e.g., token probabilities, self-consistency) or training-based calibration with correctness annotations. While effective, achieving universal honesty alignment with training-based calibration requires costly, large-scale labeling. To support annotation-efficient training, we introduce Elicitation-Then-Calibration (EliCal), a two-stage framework that first elicits internal confidence using inexpensive self-consistency supervision, then calibrates this confidence with a small set of correctness annotations. To support a large-scale study, we release HonestyBench, a benchmark covering ten free-form QA datasets with 560k training and 70k evaluation instances annotated with correctness and self-consistency signals. Experiments show that EliCal achieves near-optimal alignment with only 1k correctness annotations (0.18% of full supervision) and better alignment performance on unseen MMLU tasks than the calibration-only baseline, offering a scalable solution toward universal honesty alignment in LLMs.
【4】Lingua Custodi's participation at the WMT 2025 Terminology shared task
标题:Lingua Custodi参与WMT 2025术语共享任务
链接:https://arxiv.org/abs/2510.17504
摘要:While BERT is an effective method for learning monolingual sentence embeddings for semantic similarity and embedding based transfer learning BERT based cross-lingual sentence embeddings have yet to be explored. We systematically investigate methods for learning multilingual sentence embeddings by combining the best methods for learning monolingual and cross-lingual representations including: masked language modeling (MLM), translation language modeling (TLM), dual encoder translation ranking, and additive margin softmax. We show that introducing a pre-trained multilingual language model dramatically reduces the amount of parallel training data required to achieve good performance by 80%. Composing the best of these methods produces a model that achieves 83.7% bi-text retrieval accuracy over 112 languages on Tatoeba, well above the 65.5 achieved by LASER, while still performing competitively on monolingual transfer learning benchmarks. Parallel data mined from CommonCrawl using our best model is shown to train competitive NMT models for en-zh and en-de. We publicly release our best multilingual sentence embedding model for 109+ languages at https://tfhub.dev/google/LaBSE.
【5】ReXMoE: Reusing Experts with Minimal Overhead in Mixture-of-Experts
标题:ReXMoE:在混合专家中以最小的负担重用专家
链接:https://arxiv.org/abs/2510.17483
摘要:Mixture-of-Experts (MoE) architectures have emerged as a promising approach to scale Large Language Models (LLMs). MoE boosts the efficiency by activating a subset of experts per token. Recent works show that fine-grained experts substantially enriches the combinatorial flexibility of active experts and enhances model expressiveness. However, such a design is fundamentally limited by the layer-local routing mechanism: each layer is restricted to its own expert pool. This requires a careful trade-off between expert dimensionality and routing diversity given fixed parameter budgets. We describe ReXMoE, a novel MoE architecture that improves routing beyond the existing layer-local approaches by allowing routers to reuse experts across adjacent layers. ReXMoE decouples expert dimensionality from per-layer budgets, enabling richer expert combinations without sacrificing individual expert capacity or inflating overall parameters. To this end, we propose a new progressive scaling routing (PSR) strategy to gradually increase the candidate expert pool during training. As a result, ReXMoE improves both language modeling and downstream task performance. Extensive experiments on models ranging from 0.5B to 7B parameters across different architectures demonstrate that ReXMoE consistently improves performance under fixed architectural dimensions, confirming ReXMoE as new design paradigm for parameter-efficient and scalable MoE-based LLMs.
【6】AFRICAPTION: Establishing a New Paradigm for Image Captioning in African Languages
标题:非洲:建立非洲语言图像字幕新范式
链接:https://arxiv.org/abs/2510.17405
摘要:Multimodal AI research has overwhelmingly focused on high-resource languages, hindering the democratization of advancements in the field. To address this, we present AfriCaption, a comprehensive framework for multilingual image captioning in 20 African languages and our contributions are threefold: (i) a curated dataset built on Flickr8k, featuring semantically aligned captions generated via a context-aware selection and translation process; (ii) a dynamic, context-preserving pipeline that ensures ongoing quality through model ensembling and adaptive substitution; and (iii) the AfriCaption model, a 0.5B parameter vision-to-text architecture that integrates SigLIP and NLLB200 for caption generation across under-represented languages. This unified framework ensures ongoing data quality and establishes the first scalable image-captioning resource for under-represented African languages, laying the groundwork for truly inclusive multimodal AI.
【7】When AI companions become witty: Can human brain recognize AI-generated irony?
标题:当AI同伴变得机智时:人类大脑能识别AI生成的讽刺吗?
链接:https://arxiv.org/abs/2510.17168
摘要:As Large Language Models (LLMs) are increasingly deployed as social agents and trained to produce humor and irony, a question emerges: when encountering witty AI remarks, do people interpret these as intentional communication or mere computational output? This study investigates whether people adopt the intentional stance, attributing mental states to explain behavior,toward AI during irony comprehension. Irony provides an ideal paradigm because it requires distinguishing intentional contradictions from unintended errors through effortful semantic reanalysis. We compared behavioral and neural responses to ironic statements from AI versus human sources using established ERP components: P200 reflecting early incongruity detection and P600 indexing cognitive efforts in reinterpreting incongruity as deliberate irony. Results demonstrate that people do not fully adopt the intentional stance toward AI-generated irony. Behaviorally, participants attributed incongruity to deliberate communication for both sources, though significantly less for AI than human, showing greater tendency to interpret AI incongruities as computational errors. Neural data revealed attenuated P200 and P600 effects for AI-generated irony, suggesting reduced effortful detection and reanalysis consistent with diminished attribution of communicative intent. Notably, people who perceived AI as more sincere showed larger P200 and P600 effects for AI-generated irony, suggesting that intentional stance adoption is calibrated by specific mental models of artificial agents. These findings reveal that source attribution shapes neural processing of social-communicative phenomena. Despite current LLMs' linguistic sophistication, achieving genuine social agency requires more than linguistic competence, it necessitates a shift in how humans perceive and attribute intentionality to artificial agents.
【8】Rethinking On-policy Optimization for Query Augmentation
标题:重新思考查询增强的政策优化
链接:https://arxiv.org/abs/2510.17139
摘要:Recent advances in large language models (LLMs) have led to a surge of interest in query augmentation for information retrieval (IR). Two main approaches have emerged. The first prompts LLMs to generate answers or pseudo-documents that serve as new queries, relying purely on the model's parametric knowledge or contextual information. The second applies reinforcement learning (RL) to fine-tune LLMs for query rewriting, directly optimizing retrieval metrics. While having respective advantages and limitations, the two approaches have not been compared under consistent experimental conditions. In this work, we present the first systematic comparison of prompting-based and RL-based query augmentation across diverse benchmarks, including evidence-seeking, ad hoc, and tool retrieval. Our key finding is that simple, training-free query augmentation often performs on par with, or even surpasses, more expensive RL-based counterparts, especially when using powerful LLMs. Motivated by this discovery, we introduce a novel hybrid method, On-policy Pseudo-document Query Expansion (OPQE), which, instead of rewriting a query, the LLM policy learns to generate a pseudo-document that maximizes retrieval performance, thus merging the flexibility and generative structure of prompting with the targeted optimization of RL. We show OPQE outperforms both standalone prompting and RL-based rewriting, demonstrating that a synergistic approach yields the best results. Our implementation is made available to facilitate reproducibility.
【9】Verification-Aware Planning for Multi-Agent Systems
标题:多智能体系统的验证感知规划
链接:https://arxiv.org/abs/2510.17109
备注:Submission for ARR Oct
摘要:Large language model (LLM) agents are increasingly deployed to tackle complex tasks, often necessitating collaboration among multiple specialized agents. However, multi-agent collaboration introduces new challenges in planning, coordination, and verification. Execution failures frequently arise not from flawed reasoning alone, but from subtle misalignments in task interpretation, output format, or inter-agent handoffs. To address these challenges, we present VeriMAP, a framework for multi-agent collaboration with verification-aware planning. The VeriMAP planner decomposes tasks, models subtask dependencies, and encodes planner-defined passing criteria as subtask verification functions (VFs) in Python and natural language. We evaluate VeriMAP on diverse datasets, demonstrating that it outperforms both single- and multi-agent baselines while enhancing system robustness and interpretability. Our analysis highlights how verification-aware planning enables reliable coordination and iterative refinement in multi-agent systems, without relying on external labels or annotations.
【10】Back to Bytes: Revisiting Tokenization Through UTF-8
标题:回到收件箱:通过UTF-8重新审视代币化
链接:https://arxiv.org/abs/2510.16987
摘要:We present UTF8Tokenizer, a minimalist byte-level tokenizer that maps text exactly to IDs corresponding to the bytes underlying the text's UTF-8 encoding (e.g., byte x09 is token ID 9). Unlike prior byte-level approaches (Xue et al., 2021; Pagnoni et al., 2025), our implementation never introduces out-of-range IDs (i.e. there is no token ID 256) or auxiliary tokens: all special behavior (e.g., padding, boundaries, conversation structure, attention segments, tool calling, "thinking" spans, etc.) is encoded using C0 control bytes - just as ASCII was originally designed to embed control information alongside printable text. These design principles yield practical benefits: (1) faster tokenization (14x) and significantly lower host-device transfer (8x less than int64); (2) simple, shareable 256*d embedding tables that can be aligned across models; and (3) a training-time enhancement via bit-biased embeddings, which exposes per-byte bit structure and can be added to the embedding table post-training, removing inference costs. Our HuggingFace-compatible implementation improves language modeling convergence.
【11】FinSight: Towards Real-World Financial Deep Research
标题:FinSight:走向现实世界的金融深度研究
链接:https://arxiv.org/abs/2510.16844
备注:Working in progress
摘要:Generating professional financial reports is a labor-intensive and intellectually demanding process that current AI systems struggle to fully automate. To address this challenge, we introduce FinSight (Financial InSight), a novel multi agent framework for producing high-quality, multimodal financial reports. The foundation of FinSight is the Code Agent with Variable Memory (CAVM) architecture, which unifies external data, designed tools, and agents into a programmable variable space, enabling flexible data collection, analysis and report generation through executable code. To ensure professional-grade visualization, we propose an Iterative Vision-Enhanced Mechanism that progressively refines raw visual outputs into polished financial charts. Furthermore, a two stage Writing Framework expands concise Chain-of-Analysis segments into coherent, citation-aware, and multimodal reports, ensuring both analytical depth and structural consistency. Experiments on various company and industry-level tasks demonstrate that FinSight significantly outperforms all baselines, including leading deep research systems in terms of factual accuracy, analytical depth, and presentation quality, demonstrating a clear path toward generating reports that approach human-expert quality.
【12】Who's Asking? Simulating Role-Based Questions for Conversational AI Evaluation
标题:谁在问?模拟基于角色的问题进行对话式人工智能评估
链接:https://arxiv.org/abs/2510.16829
摘要:Language model users often embed personal and social context in their questions. The asker's role -- implicit in how the question is framed -- creates specific needs for an appropriate response. However, most evaluations, while capturing the model's capability to respond, often ignore who is asking. This gap is especially critical in stigmatized domains such as opioid use disorder (OUD), where accounting for users' contexts is essential to provide accessible, stigma-free responses. We propose CoRUS (COmmunity-driven Roles for User-centric Question Simulation), a framework for simulating role-based questions. Drawing on role theory and posts from an online OUD recovery community (r/OpiatesRecovery), we first build a taxonomy of asker roles -- patients, caregivers, practitioners. Next, we use it to simulate 15,321 questions that embed each role's goals, behaviors, and experiences. Our evaluations show that these questions are both highly believable and comparable to real-world data. When used to evaluate five LLMs, for the same question but differing roles, we find systematic differences: vulnerable roles, such as patients and caregivers, elicit more supportive responses (+17%) and reduced knowledge content (-19%) in comparison to practitioners. Our work demonstrates how implicitly signaling a user's role shapes model responses, and provides a methodology for role-informed evaluation of conversational AI.
【13】End-to-end Listen, Look, Speak and Act
标题:端到端听、看、说和表演
链接:https://arxiv.org/abs/2510.16756
备注:22 pages, 8 figures
摘要:Human interaction is inherently multimodal and full-duplex: we listen while watching, speak while acting, and fluidly adapt to turn-taking and interruptions. Realizing these capabilities is essential for building models simulating humans. We present ELLSA (End-to-end Listen, Look, Speak and Act), which, to our knowledge, is the first full-duplex, end-to-end model that simultaneously perceives and generates across vision, text, speech, and action within a single architecture, enabling interaction patterns previously out of reach, yielding more natural, human-like behaviors. At its core is a novel SA-MoE architecture (Self-Attention Mixture-of-Experts) that routes each modality to specialized experts and fuses them through a unified attention backbone. This provides a generalizable solution for joint multimodal perception and concurrent generation, leveraging strong pre-trained components while enabling efficient modality integration and mitigating modality interference. On speech-interaction and robot-manipulation benchmarks, ELLSA matches modality-specific baselines, while uniquely supporting advanced multimodal and full-duplex behaviors such as dialogue and action turn-taking, defective instruction rejection, speaking-while-acting, context-grounded visual question answering, and action barge-ins. We contend that ELLSA represents a step toward more natural and general interactive intelligence, contributing to the broader pursuit of artificial general intelligence. All data, code and model checkpoints will be released upon acceptance.
【14】Natural Language Processing Applications in Cardiology: A Narrative Review
标题:自然语言处理在心脏病学中的应用:叙述性评论
链接:https://arxiv.org/abs/2510.16708
摘要:Cardiovascular disease has become increasingly prevalent in modern society and has a significant effect on global health and well-being. Heart-related conditions are intricate, multifaceted disorders, which may be influenced by a combination of genetic predispositions, lifestyle choices, and various socioeconomic and clinical factors. Information regarding these potentially complex interrelationships is dispersed among diverse types of textual data, which include patient narratives, medical records, and scientific literature, among others. Natural language processing (NLP) techniques have increasingly been adopted as a powerful means to analyse and make sense of this vast amount of unstructured data. This, in turn, can allow healthcare professionals to gain deeper insights into the cardiology field, which has the potential to revolutionize current approaches to the diagnosis, treatment, and prevention of cardiac problems. This review provides a detailed overview of NLP research in cardiology between 2014 and 2025. We queried six literature databases to find articles describing the application of NLP techniques in the context of a range of different cardiovascular diseases. Following a rigorous screening process, we identified a total of 265 relevant articles. We analysed each article from multiple dimensions, i.e., NLP paradigm types, cardiology-related task types, cardiovascular disease types, and data source types. Our analysis reveals considerable diversity within each of these dimensions, thus demonstrating the considerable breadth of NLP research within the field. We also perform a temporal analysis, which illustrates the evolution and changing trends in NLP methods employed over the last decade that we cover. To our knowledge, the review constitutes the most comprehensive overview of NLP research in cardiology to date.
【15】All You Need is One: Capsule Prompt Tuning with a Single Vector
标题:只需一个即可:使用单个载体进行胶囊提示调谐
链接:https://arxiv.org/abs/2510.16670
备注:NeurIPS 2025
摘要:Prompt-based learning has emerged as a parameter-efficient finetuning (PEFT) approach to facilitate Large Language Model (LLM) adaptation to downstream tasks by conditioning generation with task-aware guidance. Despite its successes, current prompt-based learning methods heavily rely on laborious grid searching for optimal prompt length and typically require considerable number of prompts, introducing additional computational burden. Worse yet, our pioneer findings indicate that the task-aware prompt design is inherently limited by its absence of instance-aware information, leading to a subtle attention interplay with the input sequence. In contrast, simply incorporating instance-aware information as a part of the guidance can enhance the prompt-tuned model performance without additional fine-tuning. Moreover, we find an interesting phenomenon, namely "attention anchor", that incorporating instance-aware tokens at the earliest position of the sequence can successfully preserve strong attention to critical structural information and exhibit more active attention interaction with all input tokens. In light of our observation, we introduce Capsule Prompt-Tuning (CaPT), an efficient and effective solution that leverages off-the-shelf, informative instance semantics into prompt-based learning. Our approach innovatively integrates both instance-aware and task-aware information in a nearly parameter-free manner (i.e., one single capsule prompt). Empirical results demonstrate that our method can exhibit superior performance across various language tasks (e.g., 84.03\% average accuracy on T5-Large), serving as an "attention anchor," while enjoying high parameter efficiency (e.g., 0.003\% of model parameters on Llama3.2-1B).
【16】Automated Composition of Agents: A Knapsack Approach for Agentic Component Selection
标题:代理的自动组合:统计组件选择的背包方法
链接:https://arxiv.org/abs/2510.16499
备注:Accepted to NeurIPS 2025 Conference
摘要:Designing effective agentic systems requires the seamless composition and integration of agents, tools, and models within dynamic and uncertain environments. Most existing methods rely on static, semantic retrieval approaches for tool or agent discovery. However, effective reuse and composition of existing components remain challenging due to incomplete capability descriptions and the limitations of retrieval methods. Component selection suffers because the decisions are not based on capability, cost, and real-time utility. To address these challenges, we introduce a structured, automated framework for agentic system composition that is inspired by the knapsack problem. Our framework enables a composer agent to systematically identify, select, and assemble an optimal set of agentic components by jointly considering performance, budget constraints, and compatibility. By dynamically testing candidate components and modeling their utility in real-time, our approach streamlines the assembly of agentic systems and facilitates scalable reuse of resources. Empirical evaluation with Claude 3.5 Sonnet across five benchmarking datasets shows that our online-knapsack-based composer consistently lies on the Pareto frontier, achieving higher success rates at significantly lower component costs compared to our baselines. In the single-agent setup, the online knapsack composer shows a success rate improvement of up to 31.6% in comparison to the retrieval baselines. In multi-agent systems, the online knapsack composer increases success rate from 37% to 87% when agents are selected from an agent inventory of 100+ agents. The substantial performance gap confirms the robust adaptability of our method across diverse domains and budget constraints.
【17】ATA: A Neuro-Symbolic Approach to Implement Autonomous and Trustworthy Agents
标题:ATA:实现自主且值得信赖的代理的神经符号方法
链接:https://arxiv.org/abs/2510.16381
摘要:Large Language Models (LLMs) have demonstrated impressive capabilities, yet their deployment in high-stakes domains is hindered by inherent limitations in trustworthiness, including hallucinations, instability, and a lack of transparency. To address these challenges, we introduce a generic neuro-symbolic approach, which we call Autonomous Trustworthy Agents (ATA). The core of our approach lies in decoupling tasks into two distinct phases: Offline knowledge ingestion and online task processing. During knowledge ingestion, an LLM translates an informal problem specification into a formal, symbolic knowledge base. This formal representation is crucial as it can be verified and refined by human experts, ensuring its correctness and alignment with domain requirements. In the subsequent task processing phase, each incoming input is encoded into the same formal language. A symbolic decision engine then utilizes this encoded input in conjunction with the formal knowledge base to derive a reliable result. Through an extensive evaluation on a complex reasoning task, we demonstrate that a concrete implementation of ATA is competitive with state-of-the-art end-to-end reasoning models in a fully automated setup while maintaining trustworthiness. Crucially, with a human-verified and corrected knowledge base, our approach significantly outperforms even larger models, while exhibiting perfect determinism, enhanced stability against input perturbations, and inherent immunity to prompt injection attacks. By generating decisions grounded in symbolic reasoning, ATA offers a practical and controllable architecture for building the next generation of transparent, auditable, and reliable autonomous agents.
【18】End-to-End Argument Mining through Autoregressive Argumentative Structure Prediction
标题:通过自回归论点结构预测进行端到端论点挖掘
链接:https://arxiv.org/abs/2510.16363
备注:Accepted version. To appear in IJCNN 2025
摘要:Argument Mining (AM) helps in automating the extraction of complex argumentative structures such as Argument Components (ACs) like Premise, Claim etc. and Argumentative Relations (ARs) like Support, Attack etc. in an argumentative text. Due to the inherent complexity of reasoning involved with this task, modelling dependencies between ACs and ARs is challenging. Most of the recent approaches formulate this task through a generative paradigm by flattening the argumentative structures. In contrast to that, this study jointly formulates the key tasks of AM in an end-to-end fashion using Autoregressive Argumentative Structure Prediction (AASP) framework. The proposed AASP framework is based on the autoregressive structure prediction framework that has given good performance for several NLP tasks. AASP framework models the argumentative structures as constrained pre-defined sets of actions with the help of a conditional pre-trained language model. These actions build the argumentative structures step-by-step in an autoregressive manner to capture the flow of argumentative reasoning in an efficient way. Extensive experiments conducted on three standard AM benchmarks demonstrate that AASP achieves state-of-theart (SoTA) results across all AM tasks in two benchmarks and delivers strong results in one benchmark.
【19】Towards Low-Resource Alignment to Diverse Perspectives with Sparse Feedback
标题:以稀疏的反馈实现低资源与多元化观点的一致
链接:https://arxiv.org/abs/2510.16257
备注:Findings of EMNLP 2025, 5 pages
摘要:As language models have a greater impact on society, it is important to ensure they are aligned to a diverse range of perspectives and are able to reflect nuance in human values. However, the most popular training paradigms for modern language models often assume there is one optimal answer for every query, leading to generic responses and poor alignment. In this work, we aim to enhance pluralistic alignment of language models in a low-resource setting with two methods: pluralistic decoding and model steering. We empirically demonstrate that model steering offers consistent improvement over zero-shot and few-shot baselines with only 50 annotated samples. Our proposed methods decrease false positives in several high-stakes tasks such as hate speech detection and misinformation detection, and improves the distributional alignment to human values in GlobalOpinionQA. We hope our work highlights the importance of diversity and how language models can be adapted to consider nuanced perspectives.
【20】ScholarEval: Research Idea Evaluation Grounded in Literature
标题:学者评价:立足于文学的研究理念评价
链接:https://arxiv.org/abs/2510.16234
摘要:As AI tools become increasingly common for research ideation, robust evaluation is critical to ensure the validity and usefulness of generated ideas. We introduce ScholarEval, a retrieval augmented evaluation framework that assesses research ideas based on two fundamental criteria: soundness - the empirical validity of proposed methods based on existing literature, and contribution - the degree of advancement made by the idea across different dimensions relative to prior research. To evaluate ScholarEval, we introduce ScholarIdeas, the first expert-annotated dataset of multi-domain research ideas and reviews, comprised of 117 ideas across four disciplines: artificial intelligence, neuroscience, biochemistry, and ecology. Our evaluation shows that ScholarEval achieves significantly higher coverage of points mentioned in the human expert annotated rubrics in ScholarIdeas compared to all baselines. Furthermore, ScholarEval is consistently preferred over our strongest baseline o4-mini-deep-research, a reasoning and search-enabled agentic system by OpenAI, in terms of evaluation actionability, depth, and evidence support. Our large-scale user study also shows that ScholarEval significantly outperforms deep research in literature engagement, idea refinement, and usefulness. We openly release our code, dataset, and ScholarEval tool for the community to use and build on.
【21】What Can String Probability Tell Us About Grammaticality?
标题:关于语法性,字符串概率可以告诉我们什么?
链接:https://arxiv.org/abs/2510.16227
摘要:What have language models (LMs) learned about grammar? This question remains hotly debated, with major ramifications for linguistic theory. However, since probability and grammaticality are distinct notions in linguistics, it is not obvious what string probabilities can reveal about an LM's underlying grammatical knowledge. We present a theoretical analysis of the relationship between grammar, meaning, and string probability, based on simple assumptions about the generative process of corpus data. Our framework makes three predictions, which we validate empirically using 280K sentence pairs in English and Chinese: (1) correlation between the probability of strings within minimal pairs, i.e., string pairs with minimal semantic differences; (2) correlation between models' and humans' deltas within minimal pairs; and (3) poor separation in probability space between unpaired grammatical and ungrammatical strings. Our analyses give theoretical grounding for using probability to learn about LMs' structural knowledge, and suggest directions for future work in LM grammatical evaluation.
【22】Alignment is Localized: A Causal Probe into Preference Layers
标题:对齐是局部化的:偏好层的因果关系探索
链接:https://arxiv.org/abs/2510.16167
摘要:Reinforcement Learning frameworks, particularly those utilizing human annotations, have become an increasingly popular method for preference fine-tuning, where the outputs of a language model are tuned to match a certain set of behavioral policies or guidelines. Reinforcement Learning through Human Feedback (RLHF) is perhaps the most popular implementation of such a framework, particularly for aligning LMs toward safety and human intent. However, the internal workings of how such alignment is achieved remain largely opaque. In this work, we systematically analyze preference optimization for language model alignment by applying layer-wide causal patching between a base model and its tuned counterpart across human preference pairs. We implement our methodology on \textit{Llama-3.2-1B}, and find that alignment is spatially localized: mid-layer activations encode a distinct subspace that causally determines reward-consistent behavior, while early and late layers remain largely unaffected. Utilizing LASSO regression, we also find that only a small number of layers possess non-zero coefficients linking activation distances to reward gains. Overall, we show that, at least for some language models, alignment from human-based, preferential tuning is a directional, low rank process, rather than diffuse and parameteric.
【23】Attention to Non-Adopters
标题:注意非吸烟者
链接:https://arxiv.org/abs/2510.15951
摘要:Although language model-based chat systems are increasingly used in daily life, most Americans remain non-adopters of chat-based LLMs -- as of June 2025, 66% had never used ChatGPT. At the same time, LLM development and evaluation rely mainly on data from adopters (e.g., logs, preference data), focusing on the needs and tasks for a limited demographic group of adopters in terms of geographic location, education, and gender. In this position paper, we argue that incorporating non-adopter perspectives is essential for developing broadly useful and capable LLMs. We contend that relying on methods that focus primarily on adopters will risk missing a range of tasks and needs prioritized by non-adopters, entrenching inequalities in who benefits from LLMs, and creating oversights in model development and evaluation. To illustrate this claim, we conduct case studies with non-adopters and show: how non-adopter needs diverge from those of current users, how non-adopter needs point us towards novel reasoning tasks, and how to systematically integrate non-adopter needs via human-centered methods.
【24】Readers Prefer Outputs of AI Trained on Copyrighted Books over Expert Human Writers
标题:读者更喜欢接受版权书籍训练的人工智能的输出,而不是专家人类作家
链接:https://arxiv.org/abs/2510.13939
备注:Preprint Under Review
摘要:The use of copyrighted books for training AI models has led to numerous lawsuits from authors concerned about AI's ability to generate derivative content. Yet it's unclear if these models can generate high quality literary text while emulating authors' styles. To answer this we conducted a preregistered study comparing MFA-trained expert writers with three frontier AI models: ChatGPT, Claude & Gemini in writing up to 450 word excerpts emulating 50 award-winning authors' diverse styles. In blind pairwise evaluations by 159 representative expert & lay readers, AI-generated text from in-context prompting was strongly disfavored by experts for both stylistic fidelity (OR=0.16, p<10^-8) & writing quality (OR=0.13, p<10^-7) but showed mixed results with lay readers. However, fine-tuning ChatGPT on individual authors' complete works completely reversed these findings: experts now favored AI-generated text for stylistic fidelity (OR=8.16, p<10^-13) & writing quality (OR=1.87, p=0.010), with lay readers showing similar shifts. These effects generalize across authors & styles. The fine-tuned outputs were rarely flagged as AI-generated (3% rate v. 97% for in-context prompting) by best AI detectors. Mediation analysis shows this reversal occurs because fine-tuning eliminates detectable AI stylistic quirks (e.g., cliche density) that penalize in-context outputs. While we do not account for additional costs of human effort required to transform raw AI output into cohesive, publishable prose, the median fine-tuning & inference cost of $81 per author represents a dramatic 99.7% reduction compared to typical professional writer compensation. Author-specific fine-tuning thus enables non-verbatim AI writing that readers prefer to expert human writing, providing empirical evidence directly relevant to copyright's fourth fair-use factor, the "effect upon the potential market or value" of the source works.
机器翻译由腾讯交互翻译提供,仅供参考
点击“阅读原文”获取带摘要的学术速递

