自然语言处理学术速递[8.27]- 大数跨境

首页

自然语言处理学术速递[8.27]

Sophie外贸笔记

2025-08-27

233

导读：cs.CL 方向，今日共计73篇

点击阅读原文访问arxivdaily.com，涵盖CS|物理|数学|经济|统计|金融|生物|电气领域，更有搜索、收藏等功能！

cs.CL 方向，今日共计73篇

大模型相关(28篇)

【1】Generative Interfaces for Language Models
标题：语言模型的生成性接口
链接：https://arxiv.org/abs/2508.19227

作者：n, Yanzhe Zhang, Yutong Zhang, Yijia Shao, Diyi Yang
备注：Preprint
摘要：大型语言模型（LLM）越来越多地被视为助手、副驾驶员和顾问，能够通过自然对话支持各种任务。然而，大多数系统仍然受到线性请求-响应格式的限制，这通常使得多轮，信息密集和探索性任务中的交互效率低下。为了解决这些限制，我们提出了生成接口的语言模型，一个范例，其中LLM响应用户查询，主动生成用户界面（UI），使更多的自适应和交互式的参与。我们的框架利用结构化的接口特定的表示和迭代的改进，将用户查询转换为特定于任务的UI。为了进行系统评估，我们引入了一个多维评估框架，该框架将生成式界面与传统的基于聊天的界面进行比较，涵盖各种任务，交互模式和查询类型，捕获用户体验的功能，交互和情感方面。结果表明，生成界面始终优于会话界面，在超过70%的情况下，人类更喜欢它们。这些发现澄清了用户何时以及为什么喜欢生成界面，为人类与人工智能交互的未来发展铺平了道路。
摘要：Large language models (LLMs) are increasingly seen as assistants, copilots, and consultants, capable of supporting a wide range of tasks through natural conversation. However, most systems remain constrained by a linear request-response format that often makes interactions inefficient in multi-turn, information-dense, and exploratory tasks. To address these limitations, we propose Generative Interfaces for Language Models, a paradigm in which LLMs respond to user queries by proactively generating user interfaces (UIs) that enable more adaptive and interactive engagement. Our framework leverages structured interface-specific representations and iterative refinements to translate user queries into task-specific UIs. For systematic evaluation, we introduce a multidimensional assessment framework that compares generative interfaces with traditional chat-based ones across diverse tasks, interaction patterns, and query types, capturing functional, interactive, and emotional aspects of user experience. Results show that generative interfaces consistently outperform conversational ones, with humans preferring them in over 70% of cases. These findings clarify when and why users favor generative interfaces, paving the way for future advancements in human-AI interaction.

【2】Demystifying Scientific Problem-Solving in LLMs by Probing Knowledge and Reasoning
标题：通过探索知识和推理揭开法学硕士中科学问题解决的神秘面纱
链接：https://arxiv.org/abs/2508.19202

作者：Yixin Liu, Arpan Sarkar, Doug Downey, Arman Cohan
备注：28 pages, 16 figures
摘要：解决科学问题对LLM提出了独特的挑战，需要深入的领域知识和通过复杂推理应用这些知识的能力。虽然自动科学推理机在帮助人类科学家方面有很大的希望，但目前还没有广泛采用的整体基准来评估科学推理，而且很少有方法系统地解开知识和推理在这些任务中的不同角色。为了解决这些差距，我们介绍了SciReas，一套不同的现有基准的科学推理任务，和SciReas-Pro，一个选择性的子集，需要更复杂的推理。我们的整体评估揭示了关于科学推理性能的见解，这些见解在单独依赖单个基准时仍然是隐藏的。然后，我们提出了KRUX，一个探索框架，用于研究推理和知识在科学任务中的不同作用。结合这两者，我们进行了深入的分析，产生了几个关键的发现：（1）从模型参数检索任务相关的知识是LLM在科学推理中的一个关键瓶颈;（2）推理模型始终受益于在推理增强的基础上添加到上下文中的外部知识;（3）增强语言化推理提高了LLM表面任务相关知识的能力。最后，我们进行了一个轻量级的分析，将我们以科学为中心的数据组成与长期CoT SFT的并行工作进行比较，并发布了SciLit 01，这是一个强大的8B科学推理基线。
摘要：Scientific problem solving poses unique challenges for LLMs, requiring both deep domain knowledge and the ability to apply such knowledge through complex reasoning. While automated scientific reasoners hold great promise for assisting human scientists, there is currently no widely adopted holistic benchmark for evaluating scientific reasoning, and few approaches systematically disentangle the distinct roles of knowledge and reasoning in these tasks. To address these gaps, we introduce SciReas, a diverse suite of existing benchmarks for scientific reasoning tasks, and SciReas-Pro, a selective subset that requires more complex reasoning. Our holistic evaluation surfaces insights about scientific reasoning performance that remain hidden when relying on individual benchmarks alone. We then propose KRUX, a probing framework for studying the distinct roles of reasoning and knowledge in scientific tasks. Combining the two, we conduct an in-depth analysis that yields several key findings: (1) Retrieving task-relevant knowledge from model parameters is a critical bottleneck for LLMs in scientific reasoning; (2) Reasoning models consistently benefit from external knowledge added in-context on top of the reasoning enhancement; (3) Enhancing verbalized reasoning improves LLMs' ability to surface task-relevant knowledge. Finally, we conduct a lightweight analysis, comparing our science-focused data composition with concurrent efforts on long CoT SFT, and release SciLit01, a strong 8B baseline for scientific reasoning.

【3】It's All About In-Context Learning! Teaching Extremely Low-Resource Languages to LLMs
标题：一切都是关于上下文学习！向法学硕士教授资源极低的语言
链接：https://arxiv.org/abs/2508.19089

作者：hixue Zhao, Carolina Scarton
备注：Accepted by EMNLP 2025
摘要：极低资源的语言，特别是那些用罕见脚本编写的语言，如图1所示，仍然基本上不受大型语言模型（LLM）的支持。部分原因是缺乏培训数据等复杂因素。本文首次全面分析了LLM是否可以纯粹通过上下文学习（ICL）获得这些语言，有或没有辅助对齐信号，以及这些方法与参数有效微调（PEFT）相比如何。我们系统地评估了三个最先进的多语言LLM中的20种代表性不足的语言。我们的研究结果突出了PEFT的局限性，当语言和它的脚本都是非常不足的LLM代表。相比之下，具有语言对齐的zero-shot ICL对极低资源的语言非常有效，而Few-Shot ICL或PEFT对LLM相对更好地表示的语言更有益。对于在极低资源语言上工作的LLM从业者，我们总结了基于我们的结果的指导方针，使LLM适应低资源语言，例如，避免对未见过脚本的语言微调多语言模型。
摘要：Extremely low-resource languages, especially those written in rare scripts, as shown in Figure 1, remain largely unsupported by large language models (LLMs). This is due in part to compounding factors such as the lack of training data. This paper delivers the first comprehensive analysis of whether LLMs can acquire such languages purely via in-context learning (ICL), with or without auxiliary alignment signals, and how these methods compare to parameter-efficient fine-tuning (PEFT). We systematically evaluate 20 under-represented languages across three state-of-the-art multilingual LLMs. Our findings highlight the limitation of PEFT when both language and its script are extremely under-represented by the LLM. In contrast, zero-shot ICL with language alignment is impressively effective on extremely low-resource languages, while few-shot ICL or PEFT is more beneficial for languages relatively better represented by LLMs. For LLM practitioners working on extremely low-resource languages, we summarise guidelines grounded by our results on adapting LLMs to low-resource languages, e.g., avoiding fine-tuning a multilingual model on languages of unseen scripts.

【4】HiPlan: Hierarchical Planning for LLM-Based Agents with Adaptive Global-Local Guidance
标题：HiPlan：具有自适应全球-本地指导的基于LLM的代理的分层规划
链接：https://arxiv.org/abs/2508.19076

作者： Yuan Chang, Gaihong Yu, Xiaoqiu Le
摘要：基于大型语言模型（LLM）的智能体在决策任务中表现出了卓越的能力，但在复杂的长期规划场景中表现出了显着的困难。这是因为它们缺乏宏观指导，导致在复杂的任务中迷失方向和失败，以及在执行过程中缺乏足够的持续监督，使它们对环境变化反应迟钝，容易出现偏差。为了应对这些挑战，我们引入了HiPlan，一个分层规划框架，提供自适应的全局-局部指导，以促进基于LLM的代理的决策。HiPlan将复杂的任务分解为一般方向的里程碑行动指南和详细行动的分步提示。在离线阶段，我们从专家演示中构建了一个里程碑库，通过检索语义相似的任务和里程碑来实现结构化的经验重用。在执行阶段，来自过去里程碑的轨迹段被动态调整以生成逐步提示，使当前观察与里程碑目标相一致，弥合差距并纠正偏差。在两个具有挑战性的基准上进行的广泛实验表明，HiPlan的性能大大优于强基线，消融研究验证了其分层组件的互补优势。
摘要：Large language model (LLM)-based agents have demonstrated remarkable capabilities in decision-making tasks, but struggle significantly with complex, long-horizon planning scenarios. This arises from their lack of macroscopic guidance, causing disorientation and failures in complex tasks, as well as insufficient continuous oversight during execution, rendering them unresponsive to environmental changes and prone to deviations. To tackle these challenges, we introduce HiPlan, a hierarchical planning framework that provides adaptive global-local guidance to boost LLM-based agents'decision-making. HiPlan decomposes complex tasks into milestone action guides for general direction and step-wise hints for detailed actions. During the offline phase, we construct a milestone library from expert demonstrations, enabling structured experience reuse by retrieving semantically similar tasks and milestones. In the execution phase, trajectory segments from past milestones are dynamically adapted to generate step-wise hints that align current observations with the milestone objectives, bridging gaps and correcting deviations. Extensive experiments across two challenging benchmarks demonstrate that HiPlan substantially outperforms strong baselines, and ablation studies validate the complementary benefits of its hierarchical components.

【5】The Double-edged Sword of LLM-based Data Reconstruction: Understanding and Mitigating Contextual Vulnerability in Word-level Differential Privacy Text Sanitization
标题：基于LLM的数据重建的双刃剑：理解和缓解词级差异隐私文本清理中的上下文漏洞
链接：https://arxiv.org/abs/2508.18976

作者：eisenbacher, Alexandra Klymenko, Andreea-Elena Bodea, Florian Matthes
备注：15 pages, 4 figures, 8 tables. Accepted to WPES @ CCS 2025
摘要：差分隐私文本净化是指在差分隐私（DP）框架下对文本进行私有化的过程，提供可证明的隐私保证，同时还根据经验防御试图损害隐私的对手。尽管它们很简单，但在单词级别操作的DP文本清理方法表现出许多缺点，其中包括由于清理过程中的随机化而从原始文本中留下上下文线索的趋势$\unicode{x2013}$我们将其称为$\textit{contextual vulnerability}$。鉴于强大的上下文理解和推理能力的大型语言模型（LLM），我们探讨在何种程度上LLM可以利用DP消毒文本的上下文漏洞。我们不仅在使用先进的LLM方面扩展了以前的工作，而且在测试各种隐私级别的更广泛的消毒机制方面也进行了扩展。我们的实验揭示了基于LLM的数据重构攻击对隐私和实用性的双刃剑效应：虽然LLM确实可以推断原始语义，有时会降低经验隐私保护，但它们也可以用于改善DP消毒文本的质量和隐私。根据我们的研究结果，我们提出了使用LLM数据重建作为后处理步骤的建议，通过逆向思维来增强隐私保护。
摘要：Differentially private text sanitization refers to the process of privatizing texts under the framework of Differential Privacy (DP), providing provable privacy guarantees while also empirically defending against adversaries seeking to harm privacy. Despite their simplicity, DP text sanitization methods operating at the word level exhibit a number of shortcomings, among them the tendency to leave contextual clues from the original texts due to randomization during sanitization $\unicode{x2013}$ this we refer to as $\textit{contextual vulnerability}$. Given the powerful contextual understanding and inference capabilities of Large Language Models (LLMs), we explore to what extent LLMs can be leveraged to exploit the contextual vulnerability of DP-sanitized texts. We expand on previous work not only in the use of advanced LLMs, but also in testing a broader range of sanitization mechanisms at various privacy levels. Our experiments uncover a double-edged sword effect of LLM-based data reconstruction attacks on privacy and utility: while LLMs can indeed infer original semantics and sometimes degrade empirical privacy protections, they can also be used for good, to improve the quality and privacy of DP-sanitized texts. Based on our findings, we propose recommendations for using LLM data reconstruction as a post-processing step, serving to increase privacy protection by thinking adversarially.

【6】Empowering Computing Education Researchers Through LLM-Assisted Content Analysis
标题：通过法学硕士辅助内容分析为计算教育研究人员赋权
链接：https://arxiv.org/abs/2508.18872

作者：le, Sebastian Mateos Nicolajsen
备注：7 pages, 2 figures
摘要：计算机教育研究（CER）通常是由希望改善自己和更广泛的学科的教学实践的从业者发起的。然而，后者往往是困难的，因为许多研究人员缺乏同事，资源或能力来进行研究，这是普遍的或严格的，足以推进学科。因此，研究方法，使大量的定性数据的意义，而不是增加研究人员的负担，有显着的潜力在CER。在这篇讨论论文中，我们提出了一种对大量文本数据进行严格分析的方法，即LLM辅助内容分析（LACA）的变体。这种方法将内容分析与大型语言模型的使用相结合，使研究人员能够进行更大规模的研究，否则他们将无法执行。使用计算教育数据集，我们说明了如何LACA可以应用在一个可重复的和严格的方式。我们相信这种方法在CER中具有潜力，可以从更广泛的研究中获得更普遍的发现。这一点，再加上类似方法的发展，可以帮助推进CER学科的实践和研究质量。
摘要：Computing education research (CER) is often instigated by practitioners wanting to improve both their own and the wider discipline's teaching practice. However, the latter is often difficult as many researchers lack the colleagues, resources, or capacity to conduct research that is generalisable or rigorous enough to advance the discipline. As a result, research methods that enable sense-making with larger volumes of qualitative data, while not increasing the burden on the researcher, have significant potential within CER. In this discussion paper, we propose such a method for conducting rigorous analysis on large volumes of textual data, namely a variation of LLM-assisted content analysis (LACA). This method combines content analysis with the use of large language models, empowering researchers to conduct larger-scale research which they would otherwise not be able to perform. Using a computing education dataset, we illustrate how LACA could be applied in a reproducible and rigorous manner. We believe this method has potential in CER, enabling more generalisable findings from a wider range of research. This, together with the development of similar methods, can help to advance both the practice and research quality of the CER discipline.

【7】ConfTuner: Training Large Language Models to Express Their Confidence Verbally
标题：ConfTuner：训练大型语言模型以口头表达信心
链接：https://arxiv.org/abs/2508.18847

作者：Miao Xiong, Jiaying Wu, Bryan Hooi
摘要：大型语言模型（LLM）越来越多地部署在科学、法律和医疗保健等高风险领域，在这些领域，准确表达不确定性对于可靠性和信任至关重要。然而，目前的LLM经常被观察到以高置信度生成不正确的答案，这种现象被称为“过度自信”。最近的努力集中在校准LLM的口头信心：即，他们以文字形式表达信心，如“我有80%的信心...... ".现有的方法要么依赖于及时的工程或微调与精确生成的不确定性估计，这两种方法的有效性和普遍性有限。受经典机器学习模型中用于校准的适当评分规则的概念的启发，我们引入了ConfTuner，这是一种简单有效的微调方法，它引入了最小的开销，并且不需要地面真实置信度得分或代理置信度估计。ConfTuner依赖于一个新的损失函数，标记化的Brier得分，我们在理论上证明这是一个正确的评分规则，直观地说，它“正确地激励模型报告其正确的真实概率”。ConfTuner改进了各种推理任务的校准，并推广到GPT-4 o等黑盒模型。我们的研究结果进一步表明，更好的校准的信心，使下游增益的自我校正和模型级联，推进值得信赖的LLM系统的发展。该代码可在https://github.com/liushiliushi/ConfTuner上获得。
摘要：Large Language Models (LLMs) are increasingly deployed in high-stakes domains such as science, law, and healthcare, where accurate expressions of uncertainty are essential for reliability and trust. However, current LLMs are often observed to generate incorrect answers with high confidence, a phenomenon known as "overconfidence". Recent efforts have focused on calibrating LLMs' verbalized confidence: i.e., their expressions of confidence in text form, such as "I am 80% confident that...". Existing approaches either rely on prompt engineering or fine-tuning with heuristically generated uncertainty estimates, both of which have limited effectiveness and generalizability. Motivated by the notion of proper scoring rules for calibration in classical machine learning models, we introduce ConfTuner, a simple and efficient fine-tuning method that introduces minimal overhead and does not require ground-truth confidence scores or proxy confidence estimates. ConfTuner relies on a new loss function, tokenized Brier score, which we theoretically prove to be a proper scoring rule, intuitively meaning that it "correctly incentivizes the model to report its true probability of being correct". ConfTuner improves calibration across diverse reasoning tasks and generalizes to black-box models such as GPT-4o. Our results further show that better-calibrated confidence enables downstream gains in self-correction and model cascade, advancing the development of trustworthy LLM systems. The code is available at https://github.com/liushiliushi/ConfTuner.

【8】Arrows of Math Reasoning Data Synthesis for Large Language Models: Diversity, Complexity and Correctness
标题：大型语言模型的数学推理数据合成之箭：多样性、复杂性和正确性
链接：https://arxiv.org/abs/2508.18824

作者：n, Changxin Tian, Binbin Hu, Kunlong Chen, Ziqi Liu, Zhiqiang Zhang, Jun Zhou
摘要：增强大型语言模型（LLM）的数学推理需要高质量的训练数据，但传统方法在可扩展性、成本和数据可靠性方面面临着严峻的挑战。为了解决这些限制，我们提出了一种新的程序辅助合成框架，系统地生成高质量的数学语料库，保证多样性，复杂性和正确性。该框架集成了数学知识系统和特定领域的工具来创建可执行程序。然后，这些程序被翻译成自然语言问题解决方案对，并通过双边验证机制进行审查，该机制根据程序输出验证解决方案的正确性，并确保程序问题的一致性。我们已经产生了1230万个这样的问题解决三元组。实验表明，对我们的数据进行微调的模型显着提高了它们的推理能力，在几个基准数据集上实现了最先进的性能，并展示了我们的合成方法的有效性。
摘要：Enhancing the mathematical reasoning of large language models (LLMs) demands high-quality training data, yet conventional methods face critical challenges in scalability, cost, and data reliability. To address these limitations, we propose a novel program-assisted synthesis framework that systematically generates a high-quality mathematical corpus with guaranteed diversity, complexity, and correctness. This framework integrates mathematical knowledge systems and domain-specific tools to create executable programs. These programs are then translated into natural language problem-solution pairs and vetted by a bilateral validation mechanism that verifies solution correctness against program outputs and ensures program-problem consistency. We have generated 12.3 million such problem-solving triples. Experiments demonstrate that models fine-tuned on our data significantly improve their inference capabilities, achieving state-of-the-art performance on several benchmark datasets and showcasing the effectiveness of our synthesis approach.

【9】LLM-based Contrastive Self-Supervised AMR Learning with Masked Graph Autoencoders for Fake News Detection
标题：基于LLM的对比自我监督的MRC学习，使用掩蔽图自动编码器进行假新闻检测
链接：https://arxiv.org/abs/2508.18819

作者：upta, Shraban Kumar Chatterjee, Suman Kundu
摘要：数字时代错误信息的扩散导致了重大的社会挑战。现有的方法往往难以捕捉长距离的依赖关系，复杂的语义关系，以及影响新闻传播的社会动态。此外，这些方法需要大量的标记数据集，使其部署资源密集型。在这项研究中，我们提出了一种新的自我监督的错误信息检测框架，集成了复杂的语义关系，使用抽象意义表示（AMR）和新闻传播动态。我们引入了一个基于LLM的图对比损失（LGCL），利用负锚点产生的大语言模型（LLM），以提高特征分离的zero-shot方式。为了结合社会背景，我们采用了多视图图掩码自动编码器，它从社会背景图中学习新闻传播特征。通过结合这些语义和基于传播的特征，我们的方法以自我监督的方式有效地区分假新闻和真新闻。大量的实验表明，与其他最先进的方法相比，我们的自监督框架实现了卓越的性能，即使在有限的标记数据集上，同时提高了泛化能力。
摘要：The proliferation of misinformation in the digital age has led to significant societal challenges. Existing approaches often struggle with capturing long-range dependencies, complex semantic relations, and the social dynamics influencing news dissemination. Furthermore, these methods require extensive labelled datasets, making their deployment resource-intensive. In this study, we propose a novel self-supervised misinformation detection framework that integrates both complex semantic relations using Abstract Meaning Representation (AMR) and news propagation dynamics. We introduce an LLM-based graph contrastive loss (LGCL) that utilizes negative anchor points generated by a Large Language Model (LLM) to enhance feature separability in a zero-shot manner. To incorporate social context, we employ a multi view graph masked autoencoder, which learns news propagation features from social context graph. By combining these semantic and propagation-based features, our approach effectively differentiates between fake and real news in a self-supervised manner. Extensive experiments demonstrate that our self-supervised framework achieves superior performance compared to other state-of-the-art methodologies, even with limited labelled datasets while improving generalizability.

【10】ThinkDial: An Open Recipe for Controlling Reasoning Effort in Large Language Models
标题：ThinkDial：控制大型语言模型中推理工作的开放食谱
链接：https://arxiv.org/abs/2508.18773

作者：, Siyu Yuan, Xuefeng Li, Mingxuan Wang, Jiangjie Chen
摘要：具有思维链推理的大型语言模型（LLM）已经证明了卓越的问题解决能力，但控制其计算工作量仍然是实际部署的重大挑战。最近的专有系统，如OpenAI的gpt-oss系列，已经引入了用于直观推理控制的离散操作模式，但开源社区在很大程度上未能实现这种功能。在本文中，我们介绍了ThinkDial，第一个开放式的端到端的框架，成功地实现了gpt-oss风格的可控推理，通过离散的操作模式。我们的系统可以在三种不同的推理机制之间无缝切换：高模式（完全推理能力），中模式（令牌减少50%，性能下降<10%）和低模式（令牌减少75%，性能下降<15%）。我们通过一个端到端的训练范例来实现这一点，该范例在整个管道中集成了自动模式控制：自动模式监督微调，将可控推理能力直接嵌入到学习过程中，以及具有自适应奖励成形的两阶段自动感知强化学习。大量的实验表明，ThinkDial实现了目标的压缩性能权衡与明确的响应长度减少，同时保持性能阈值。该框架还表现出很强的泛化能力的分布任务。
摘要：Large language models (LLMs) with chain-of-thought reasoning have demonstrated remarkable problem-solving capabilities, but controlling their computational effort remains a significant challenge for practical deployment. Recent proprietary systems like OpenAI's gpt-oss series have introduced discrete operational modes for intuitive reasoning control, but the open-source community has largely failed to achieve such capabilities. In this paper, we introduce ThinkDial, the first open-recipe end-to-end framework that successfully implements gpt-oss-style controllable reasoning through discrete operational modes. Our system enables seamless switching between three distinct reasoning regimes: High mode (full reasoning capability), Medium mode (50 percent token reduction with <10 percent performance degradation), and Low mode (75 percent token reduction with <15 percent performance degradation). We achieve this through an end-to-end training paradigm that integrates budget-mode control throughout the entire pipeline: budget-mode supervised fine-tuning that embeds controllable reasoning capabilities directly into the learning process, and two-phase budget-aware reinforcement learning with adaptive reward shaping. Extensive experiments demonstrate that ThinkDial achieves target compression-performance trade-offs with clear response length reductions while maintaining performance thresholds. The framework also exhibits strong generalization capabilities on out-of-distribution tasks.

【11】Beyond Quality: Unlocking Diversity in Ad Headline Generation with Large Language Models
标题：超越质量：利用大型语言模型释放广告标题生成的多样性
链接：https://arxiv.org/abs/2508.18739

作者：g, Siyu Yan, Depeng Yuan, Yuqi Chen, Yanhua Huang, Yuanhang Zheng, Shuhao Li, Yinqi Zhang, Kedi Chen, Mingrui Zhu, Ruiwen Xu
摘要：广告标题的生成在现代广告中起着至关重要的作用，质量和多样性对于吸引广泛的受众群体至关重要。目前的方法主要是优化语言模型的标题质量或点击率（CTR），往往忽视了多样性的需要，导致同质的输出。为了解决这一限制，我们提出了DIVER，一种基于大型语言模型（LLM）的新框架，该框架针对多样性和质量进行了联合优化。我们首先设计了一个语义和风格感知的数据生成管道，它可以自动生成具有广告内容和多种不同标题的高质量训练对。为了实现在单个前向传递中生成高质量和多样化的广告标题的目标，我们提出了一个具有监督微调（SFT）和强化学习（RL）的多阶段多目标优化框架。在真实工业数据集上的实验表明，DIVER有效地平衡了质量和多样性。部署在服务数亿用户的大规模内容共享平台上，我们的框架将广告客户价值（ADVV）和CTR提高了4.0%和1.4%。
摘要：The generation of ad headlines plays a vital role in modern advertising, where both quality and diversity are essential to engage a broad range of audience segments. Current approaches primarily optimize language models for headline quality or click-through rates (CTR), often overlooking the need for diversity and resulting in homogeneous outputs. To address this limitation, we propose DIVER, a novel framework based on large language models (LLMs) that are jointly optimized for both diversity and quality. We first design a semantic- and stylistic-aware data generation pipeline that automatically produces high-quality training pairs with ad content and multiple diverse headlines. To achieve the goal of generating high-quality and diversified ad headlines within a single forward pass, we propose a multi-stage multi-objective optimization framework with supervised fine-tuning (SFT) and reinforcement learning (RL). Experiments on real-world industrial datasets demonstrate that DIVER effectively balances quality and diversity. Deployed on a large-scale content-sharing platform serving hundreds of millions of users, our framework improves advertiser value (ADVV) and CTR by 4.0% and 1.4%.

【12】Filtering for Creativity: Adaptive Prompting for Multilingual Riddle Generation in LLMs
标题：创意过滤：LLM中多语言Riddle生成的自适应预算
链接：https://arxiv.org/abs/2508.18709

作者：ent Ziti, Evan Girard-Sun, Sean O'Brien, Vasu Sharma, Kevin Zhu
摘要：多语言谜语生成对大型语言模型（LLM）提出了挑战，以平衡文化流畅性和创造性抽象。标准的提示策略-- zero-shot、Few-Shot、思维链--倾向于重复使用记忆中的谜语或进行肤浅的复述。我们引入自适应独创性过滤（AOF），一个提示框架，过滤冗余代使用基于余弦的相似性拒绝，同时执行词汇新颖性和跨语言的保真度。在三个LLM和四个语言对中进行评估，AOF增强的GPT-4 o在日语中实现了\texttt{0.177} Self-BLEU和\texttt{0.915} Distinct-2，与其他提示方法和语言对相比，提高了词汇多样性并减少了冗余。我们的研究结果表明，语义拒绝可以引导文化接地，创造性的一代没有特定的任务微调。
摘要：Multilingual riddle generation challenges large language models (LLMs) to balance cultural fluency with creative abstraction. Standard prompting strategies -- zero-shot, few-shot, chain-of-thought -- tend to reuse memorized riddles or perform shallow paraphrasing. We introduce Adaptive Originality Filtering (AOF), a prompting framework that filters redundant generations using cosine-based similarity rejection, while enforcing lexical novelty and cross-lingual fidelity. Evaluated across three LLMs and four language pairs, AOF-enhanced GPT-4o achieves \texttt{0.177} Self-BLEU and \texttt{0.915} Distinct-2 in Japanese, signaling improved lexical diversity and reduced redundancy compared to other prompting methods and language pairs. Our findings show that semantic rejection can guide culturally grounded, creative generation without task-specific fine-tuning.

【13】FALCON: Autonomous Cyber Threat Intelligence Mining with LLMs for IDS Rule Generation
标题：RISKCON：使用LLM进行IDS规则生成的自主网络威胁情报挖掘
链接：https://arxiv.org/abs/2508.18684

作者：Mitra, Azim Bazarov, Martin Duclos, Sudip Mittal, Aritran Piplai, Md Rayhanur Rahman, Edward Zieglar, Shahram Rahimi
备注：11 pages, 5 figures, 4 tables
摘要：基于特征的入侵检测系统（IDS）通过将网络或主机活动与预定义的规则进行匹配来检测恶意活动。这些规则源自广泛的网络威胁情报（CTI），其中包括通过自动化工具和手动威胁分析（如沙箱）获得的攻击特征和行为模式。然后，CTI将转换为IDS引擎的可操作规则，从而实现实时检测和预防。然而，网络威胁的不断演变需要频繁的规则更新，这会延迟部署时间并削弱整体安全准备。由大型语言模型（LLM）驱动的代理系统的最新进展提供了具有内部评估的自主IDS规则生成的可能性。我们介绍了TIPCON，一个自治的代理框架，生成可部署的IDS规则，从CTI数据的实时性和评估他们使用内置的多阶段验证。为了展示多功能性，我们针对网络（Snort）和基于主机的（YARA）介质，并构建了一个全面的IDS规则数据集及其相应的CTI。我们的评估表明，RISKCON在自动规则生成方面表现出色，通过定性评估验证的平均准确率为95%，在所有指标中，多名网络安全分析师之间的评分员一致率为84%。这些结果强调了LLM驱动的数据挖掘用于实时网络威胁缓解的可行性和有效性。
摘要：Signature-based Intrusion Detection Systems (IDS) detect malicious activities by matching network or host activity against predefined rules. These rules are derived from extensive Cyber Threat Intelligence (CTI), which includes attack signatures and behavioral patterns obtained through automated tools and manual threat analysis, such as sandboxing. The CTI is then transformed into actionable rules for the IDS engine, enabling real-time detection and prevention. However, the constant evolution of cyber threats necessitates frequent rule updates, which delay deployment time and weaken overall security readiness. Recent advancements in agentic systems powered by Large Language Models (LLMs) offer the potential for autonomous IDS rule generation with internal evaluation. We introduce FALCON, an autonomous agentic framework that generates deployable IDS rules from CTI data in real-time and evaluates them using built-in multi-phased validators. To demonstrate versatility, we target both network (Snort) and host-based (YARA) mediums and construct a comprehensive dataset of IDS rules with their corresponding CTIs. Our evaluations indicate FALCON excels in automatic rule generation, with an average of 95% accuracy validated by qualitative evaluation with 84% inter-rater agreement among multiple cybersecurity analysts across all metrics. These results underscore the feasibility and effectiveness of LLM-driven data mining for real-time cyber threat mitigation.

【14】Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks
标题：推理任务中混合专家语言模型的最优稀疏性
链接：https://arxiv.org/abs/2508.18672

作者：kamura, Satoki Ishikawa, Masaki Kawamura, Takumi Okamoto, Daisuke Nohara, Jun Suzuki, Rio Yokota
备注：Presented at the Second AI for Math Workshop at ICML
摘要：经验缩放定律推动了大型语言模型（LLM）的发展，但当模型架构或数据管道发生变化时，它们的系数会发生变化。混合专家（MoE）模型，现在的标准，在国家的最先进的系统，引入了一个新的稀疏性维度，目前的密集模型的前沿忽视。我们研究如何MoE稀疏影响两个不同的能力制度：记忆和推理。我们训练系列MoE Transformers，系统地改变总参数、活动参数和top-$k$路由，同时保持计算预算固定。对于每个模型，我们记录了训练前损失、下游任务损失和任务准确性，从而将训练测试泛化差距与损失准确性差距分开。量化基准随着总参数单调地提高，反映了训练损失。相比之下，推理性能饱和，甚至可以回归，尽管在总参数和训练损失的持续增益。当主动参数恒定时，单独改变top-$k$几乎没有影响，而经典的超参数，如学习率和初始化，则会在与稀疏性相同的方向上调节泛化间隙。训练后强化学习（GRPO）和额外的测试时间计算都不能挽救过于稀疏模型的推理缺陷。我们的模型检查点、代码和日志在https://github.com/rioyokotalab/optimal-sparsity上是开源的。
摘要：Empirical scaling laws have driven the evolution of large language models (LLMs), yet their coefficients shift whenever the model architecture or data pipeline changes. Mixture-of-Experts (MoE) models, now standard in state-of-the-art systems, introduce a new sparsity dimension that current dense-model frontiers overlook. We investigate how MoE sparsity influences two distinct capability regimes: memorization and reasoning. We train families of MoE Transformers that systematically vary total parameters, active parameters, and top-$k$ routing while holding the compute budget fixed. For every model we record pre-training loss, downstream task loss, and task accuracy, allowing us to separate the train-test generalization gap from the loss-accuracy gap. Memorization benchmarks improve monotonically with total parameters, mirroring training loss. By contrast, reasoning performance saturates and can even regress despite continued gains in both total parameters and training loss. Altering top-$k$ alone has little effect when active parameters are constant, and classic hyperparameters such as learning rate and initialization modulate the generalization gap in the same direction as sparsity. Neither post-training reinforcement learning (GRPO) nor extra test-time compute rescues the reasoning deficit of overly sparse models. Our model checkpoints, code and logs are open-source at https://github.com/rioyokotalab/optimal-sparsity.

【15】Membership Inference Attacks on LLM-based Recommender Systems
标题：对基于LLM的推荐系统的成员推断攻击
链接：https://arxiv.org/abs/2508.18665

作者：, Yuechun Gu, Min-Chun Chen, Keke Chen
摘要：基于大语言模型的推荐系统（RecSys）可以灵活地使推荐系统适应不同的领域。它利用上下文学习（ICL），即，提示，以定制推荐功能，其包括敏感的历史用户特定项目交互，例如，隐性反馈，如点击的项目或明确的产品评论。这样的私人信息可能暴露于新的隐私攻击。然而，没有人对这一重要问题进行过研究。我们设计了四个成员推理攻击（MIA），旨在揭示受害者的历史互动是否已被系统提示使用。它们是直接询问，幻觉，相似性和中毒攻击，每一种都利用了LLM或RecSys的独特功能。我们已经在三个LLM上仔细评估了它们，这些LLM已经用于开发ICL-LLM RecSys和两个著名的RecSys基准数据集。结果证实，LLM RecSys上的MIA威胁是现实的：直接查询和中毒攻击显示出显着的高攻击优势。我们还分析了影响这些攻击的因素，如系统提示中的镜头数量和受害者在镜头中的位置。
摘要：Large language models (LLMs) based Recommender Systems (RecSys) can flexibly adapt recommendation systems to different domains. It utilizes in-context learning (ICL), i.e., the prompts, to customize the recommendation functions, which include sensitive historical user-specific item interactions, e.g., implicit feedback like clicked items or explicit product reviews. Such private information may be exposed to novel privacy attack. However, no study has been done on this important issue. We design four membership inference attacks (MIAs), aiming to reveal whether victims' historical interactions have been used by system prompts. They are \emph{direct inquiry, hallucination, similarity, and poisoning attacks}, each of which utilizes the unique features of LLMs or RecSys. We have carefully evaluated them on three LLMs that have been used to develop ICL-LLM RecSys and two well-known RecSys benchmark datasets. The results confirm that the MIA threat on LLM RecSys is realistic: direct inquiry and poisoning attacks showing significantly high attack advantages. We have also analyzed the factors affecting these attacks, such as the number of shots in system prompts and the position of the victim in the shots.

【16】Emotion Omni: Enabling Empathetic Speech Response Generation through Large Language Models
标题：情感Omni：通过大型语言模型实现同理心语音响应生成
链接：https://arxiv.org/abs/2508.18655

作者：g, Guangyan Zhang, Jiale Chen, Jingyu Li, Yuehai Wang, Yiwen Guo
备注：5 pages, 1 figure, submitted to ICASSP 2026
摘要：随着语音大语言模型（speech LLM）的发展，用户现在可以通过语音直接与助手交互。然而，大多数现有的模型只是简单地将响应内容转换为语音，而没有充分理解用户查询中嵌入的丰富的情感和语言学线索。在许多情况下，同一个句子可以有不同的含义取决于情感表达。此外，情感理解对于改善人机交互中的用户体验至关重要。目前，大多数具有移情能力的语音LLM都是在大量数据集上训练的。这种方法需要大量的数据和大量的计算资源。因此，一个关键的挑战在于如何开发一个语音LLM能够产生移情反应有限的数据，而不需要大规模的培训。为了解决这一挑战，我们提出了情感Omni，一种新的模型架构，旨在了解用户语音输入的情感内容，并生成移情语音响应。此外，我们开发了一个基于开源TTS框架的数据生成管道，以构建一个20万的情感对话数据集，该数据集支持构建一个移情语音助手。这些演示可在https://w311411.github.io/omni_demo/上获得
摘要：With the development of speech large language models (speech LLMs), users can now interact directly with assistants via speech. However, most existing models simply convert the response content into speech without fully understanding the rich emotional and paralinguistic cues embedded in the user's query. In many cases, the same sentence can have different meanings depending on the emotional expression. Furthermore, emotional understanding is essential for improving user experience in human-machine interaction. Currently, most speech LLMs with empathetic capabilities are trained on massive datasets. This approach requires vast amounts of data and significant computational resources. Therefore, a key challenge lies in how to develop a speech LLM capable of generating empathetic responses with limited data and without the need for large-scale training. To address this challenge, we propose Emotion Omni, a novel model architecture designed to understand the emotional content of user speech input and generate empathetic speech responses. Additionally, we developed a data generation pipeline based on an open-source TTS framework to construct a 200k emotional dialogue dataset, which supports the construction of an empathetic speech assistant. The demos are available at https://w311411.github.io/omni_demo/

【17】Breaking the Trade-Off Between Faithfulness and Expressiveness for Large Language Models
标题：打破大型语言模型的忠实性和表达性之间的权衡
链接：https://arxiv.org/abs/2508.18651

作者：ng, Qingyi Si, Zheng Lin
摘要：外部知识的基础反应是减轻大型语言模型（LLM）中幻觉的有效策略。然而，目前的LLM难以无缝整合知识，同时保持忠实（或保真度）和表达能力，这是人类自然拥有的能力。这种限制导致输出要么缺乏外部知识的支持，从而损害忠实性，要么显得过于冗长和不自然，从而牺牲表达能力。在这项工作中，为了打破忠实性和表现力之间的权衡，我们提出了协作解码（Code-Collaborative Decoding），这是一种新的方法，它动态地集成了在有和没有外部知识的情况下生成的输出概率。这种集成由分布发散和模型置信度指导，使得能够从模型的内部参数中选择性地激活相关和可靠的表达式。此外，我们引入了一个知识感知的重新排序机制，防止过度依赖先验参数知识，同时确保适当利用所提供的外部信息。通过全面的实验，我们的即插即用代码框架在增强忠实性方面表现出卓越的性能，而不会影响各种LLM和评估指标的表达能力，从而验证了其有效性和可推广性。
摘要：Grounding responses in external knowledge represents an effective strategy for mitigating hallucinations in Large Language Models (LLMs). However, current LLMs struggle to seamlessly integrate knowledge while simultaneously maintaining faithfulness (or fidelity) and expressiveness, capabilities that humans naturally possess. This limitation results in outputs that either lack support from external knowledge, thereby compromising faithfulness, or appear overly verbose and unnatural, thus sacrificing expressiveness. In this work, to break the trade-off between faithfulness and expressiveness, we propose Collaborative Decoding (CoDe), a novel approach that dynamically integrates output probabilities generated with and without external knowledge. This integration is guided by distribution divergence and model confidence, enabling the selective activation of relevant and reliable expressions from the model's internal parameters. Furthermore, we introduce a knowledge-aware reranking mechanism that prevents over-reliance on prior parametric knowledge while ensuring proper utilization of provided external information. Through comprehensive experiments, our plug-and-play CoDe framework demonstrates superior performance in enhancing faithfulness without compromising expressiveness across diverse LLMs and evaluation metrics, validating both its effectiveness and generalizability.

【18】Beyond Benchmark: LLMs Evaluation with an Anthropomorphic and Value-oriented Roadmap
标题：超越基准：具有拟人化和价值导向路线图的LLM评估
链接：https://arxiv.org/abs/2508.18646

作者： Ninglun Gu, Kailai Zhang, Zijiao Zhang, Yelun Bao, Jin Yang, Xu Yin, Liwei Liu, Yihuan Liu, Pengyong Li, Gary G. Yen, Junchi Yan
备注：Preprint. Under review
摘要：对于大型语言模型（LLM），基准性能和现实世界的实用程序之间存在脱节。目前的评价框架仍然支离破碎，优先考虑技术指标，而忽视对部署的整体评估。本调查从人类智力的角度引入了一种拟人化的评估范式，提出了一种新的三维分类法：智商（IQ）-基础能力的一般智力，情商（EQ）-基于价值观的互动的协调能力，以及专业商数（PQ）-专业能力的专业知识。对于实用价值，我们开创了价值导向评估（VQ）框架，评估经济可行性，社会影响，道德一致性和环境可持续性。我们的模块化架构集成了六个组件和一个实施路线图。通过对200多个基准的分析，我们确定了包括动态评估需求和可解释性差距在内的关键挑战。它为开发技术熟练，上下文相关和道德健全的LLM提供了可操作的指导。我们在https://github.com/onejune2018/Awesome-LLM-Eval上维护了一个开源评估资源库。
摘要：For Large Language Models (LLMs), a disconnect persists between benchmark performance and real-world utility. Current evaluation frameworks remain fragmented, prioritizing technical metrics while neglecting holistic assessment for deployment. This survey introduces an anthropomorphic evaluation paradigm through the lens of human intelligence, proposing a novel three-dimensional taxonomy: Intelligence Quotient (IQ)-General Intelligence for foundational capacity, Emotional Quotient (EQ)-Alignment Ability for value-based interactions, and Professional Quotient (PQ)-Professional Expertise for specialized proficiency. For practical value, we pioneer a Value-oriented Evaluation (VQ) framework assessing economic viability, social impact, ethical alignment, and environmental sustainability. Our modular architecture integrates six components with an implementation roadmap. Through analysis of 200+ benchmarks, we identify key challenges including dynamic assessment needs and interpretability gaps. It provides actionable guidance for developing LLMs that are technically proficient, contextually relevant, and ethically sound. We maintain a curated repository of open-source evaluation resources at: https://github.com/onejune2018/Awesome-LLM-Eval.

【19】Scaling Laws for Task-Stratified Knowledge in Post-Training Quantized Large Language Models
标题：训练后量化大型语言模型中任务分层知识的缩放定律
链接：https://arxiv.org/abs/2508.18609

作者：ou, Pengfei Cao, Jiang Li, Jun Zhao, Kang Liu
摘要：大型语言模型（LLM）由于其规模而带来了重大的部署挑战，训练后量化（PTQ）成为一种实用的压缩解决方案。然而，对PTQ如何精确影响各种LLM知识能力的全面理解仍然是难以捉摸的，并且量化模型的现有缩放律通常忽略了关键的PTQ特定参数和特定任务的敏感性。本文针对这些差距进行了广泛的实证调查，以建立任务分层的缩放律。我们将LLM知识分解为记忆和利用能力，并开发了一个统一的定量框架，该框架包括模型大小，有效位宽，校准集大小和组大小。我们的中心发现表明，知识记忆表现出显着更大的敏感性，在有效的位宽，校准集的大小和模型大小的变化相比，更强大的知识利用。这些发现提供了PTQ的影响的细粒度的理解，并为开发知识感知的量化策略，可以更好地保留有针对性的认知功能提供指导。
摘要：Large language models (LLMs) present significant deployment challenges due to their scale, with post-training quantization (PTQ) emerging as a practical compression solution. However, a comprehensive understanding of how PTQ precisely impacts diverse LLM knowledge capabilities remains elusive, and existing scaling laws for quantized models often overlook crucial PTQ-specific parameters and task-specific sensitivities. This paper addresses these gaps by conducting an extensive empirical investigation to establish task-stratified scaling laws. We disentangle LLM knowledge into memorization and utilization capabilities and develop a unified quantitative framework that incorporates model size, effective bit-width, calibration set size, and group size. Our central finding reveals that knowledge memorization exhibits markedly greater sensitivity to variations in effective bit-width, calibration set size, and model size compared to the more robust knowledge utilization. These findings offer a fine-grained understanding of PTQ's impact and provide guidance for developing knowledge-aware quantization strategies that can better preserve targeted cognitive functions.

【20】What do language models model? Transformers, automata, and the format of thought
标题：语言模型模型是什么？Transformer、自动机和思维格式
链接：https://arxiv.org/abs/2508.18598

作者：in
摘要：大型语言模型实际上是在模拟什么？它们告诉我们一些关于人类能力的事情，或者它们是我们训练它们的语料库的模型？我为后一种立场提供了一个非通货紧缩的辩护。认知科学告诉我们，人类的语言能力依赖于超线性格式进行计算。相比之下，Transformer架构最多支持线性格式进行处理。这个论点将主要依赖于Transformers的计算架构的某些不变量。然后，我提出了一个关于Transformers正在做什么的积极故事，重点是Liu et al.（2022）关于捷径自动机的有趣猜测。最后，我将解释为什么我不认为这是一个可怕的通缩故事。语言不仅是表达内心世界的一种手段，而且是一种“话语机器”，它能让我们在适当的语境中创造出新的语言。我们已经学会了以一种方式使用这项技术; LLM也学会了使用它，但通过非常不同的方式。
摘要：What do large language models actually model? Do they tell us something about human capacities, or are they models of the corpus we've trained them on? I give a non-deflationary defence of the latter position. Cognitive science tells us that linguistic capabilities in humans rely supralinear formats for computation. The transformer architecture, by contrast, supports at best a linear formats for processing. This argument will rely primarily on certain invariants of the computational architecture of transformers. I then suggest a positive story about what transformers are doing, focusing on Liu et al. (2022)'s intriguing speculations about shortcut automata. I conclude with why I don't think this is a terribly deflationary story. Language is not (just) a means for expressing inner state but also a kind of 'discourse machine' that lets us make new language given appropriate context. We have learned to use this technology in one way; LLMs have also learned to use it too, but via very different means.

【21】Principled Detection of Hallucinations in Large Language Models via Multiple Testing
标题：通过多重测试原则性地检测大型语言模型中的幻觉
链接：https://arxiv.org/abs/2508.18473

作者：, Akshayaa Magesh, Venugopal V. Veeravalli
备注：16 pages
摘要：虽然大型语言模型（LLM）已经成为解决各种任务的强大基础模型，但它们也被证明容易产生幻觉，即，产生的反应听起来很自信，但实际上是不正确的，甚至是荒谬的。在这项工作中，我们将检测幻觉的问题表述为假设检验问题，并将其与机器学习模型中的分布外检测问题进行了比较。我们提出了一种多重测试启发的方法来解决幻觉检测问题，并提供了大量的实验结果来验证我们的方法对最先进的方法的鲁棒性。
摘要：While Large Language Models (LLMs) have emerged as powerful foundational models to solve a variety of tasks, they have also been shown to be prone to hallucinations, i.e., generating responses that sound confident but are actually incorrect or even nonsensical. In this work, we formulate the problem of detecting hallucinations as a hypothesis testing problem and draw parallels to the problem of out-of-distribution detection in machine learning models. We propose a multiple-testing-inspired method to solve the hallucination detection problem, and provide extensive experimental results to validate the robustness of our approach against state-of-the-art methods.

【22】Integrating gender inclusivity into large language models via instruction tuning
标题：通过教学调整将性别包容性纳入大型语言模型
链接：https://arxiv.org/abs/2508.18466

作者：blewska, Bartosz Żuk
摘要：想象一下，一种语言有阳性，阴性和中性语法性别，然而，由于历史和政治习俗，阳性形式主要用于指代男性，女性和混合性别群体。这就是当代波兰的现实。这种不公平的语言系统的一个社会后果是，在波兰语文本上训练的大型语言模型（LLM）继承并加强了这种男性偏见，产生了性别不平衡的输出。本研究通过使用IPIS数据集调整LLM来解决这个问题，IPIS数据集是波兰语和波兰语到英语翻译说明中人工制作的性别包容性校对的集合。接地的理论语言框架，我们设计了一个系统提示明确的性别包容性的指导方针，波兰语。在我们的实验中，我们IPIS调整多语言LLM（Llama-8B，Mistral-7 B和Mistral-Nemo）和波兰特定的LLM（Bielik和PLLuM）。我们的方法旨在将性别包容性作为这些模型的固有特征，提供系统的解决方案，以减轻波兰语生成中的性别偏见。
摘要：Imagine a language with masculine, feminine, and neuter grammatical genders, yet, due to historical and political conventions, masculine forms are predominantly used to refer to men, women and mixed-gender groups. This is the reality of contemporary Polish. A social consequence of this unfair linguistic system is that large language models (LLMs) trained on Polish texts inherit and reinforce this masculine bias, generating gender-imbalanced outputs. This study addresses this issue by tuning LLMs using the IPIS dataset, a collection of human-crafted gender-inclusive proofreading in Polish and Polish-to-English translation instructions. Grounded in a theoretical linguistic framework, we design a system prompt with explicit gender-inclusive guidelines for Polish. In our experiments, we IPIS-tune multilingual LLMs (Llama-8B, Mistral-7B and Mistral-Nemo) and Polish-specific LLMs (Bielik and PLLuM). Our approach aims to integrate gender inclusivity as an inherent feature of these models, offering a systematic solution to mitigate gender bias in Polish language generation.

【23】How Reliable are LLMs for Reasoning on the Re-ranking task?
标题：LLM在重新排名任务上的推理有多可靠？
链接：https://arxiv.org/abs/2508.18444

作者：veer Islam, Zhiming Zhao
备注：Accepted at FQAS Conference 2024. DOI will be provided in 3 weeks after the conference has published the paper
摘要：随着大型语言模型（LLM）语义理解能力的提高，它们表现出更强的意识和与人类价值观的一致性，但这是以透明度为代价的。虽然通过实验分析取得了有希望的结果，但深入了解LLM的内部工作是不可避免的，以理解重新排名背后的推理，这为最终用户提供了一个解释，使他们能够做出明智的决定。此外，在用户参与有限且排名数据不足的新开发系统中，准确地对内容重新排名仍然是一个重大挑战。虽然各种训练方法会影响LLM的训练并生成推理，但我们的分析发现，某些训练方法表现出比其他方法更好的可解释性，这意味着并没有通过所有训练方法学习到准确的语义理解;相反，已经获得了抽象知识来优化评估，这对LLM的真实可靠性提出了质疑。因此，在这项工作中，我们分析了不同的训练方法如何影响LLM中重新排序任务的语义理解，并研究这些模型是否可以生成更明智的文本推理，以克服透明度或LLM和有限训练数据的挑战。为了分析LLM的重新排名任务，我们利用一个相对较小的排名数据集，从环境和地球科学领域重新排名检索的内容。此外，我们还分析了可解释的信息，看看是否可以使用可解释性的理由重新排名。
摘要：With the improving semantic understanding capability of Large Language Models (LLMs), they exhibit a greater awareness and alignment with human values, but this comes at the cost of transparency. Although promising results are achieved via experimental analysis, an in-depth understanding of the LLM's internal workings is unavoidable to comprehend the reasoning behind the re-ranking, which provides end users with an explanation that enables them to make an informed decision. Moreover, in newly developed systems with limited user engagement and insufficient ranking data, accurately re-ranking content remains a significant challenge. While various training methods affect the training of LLMs and generate inference, our analysis has found that some training methods exhibit better explainability than others, implying that an accurate semantic understanding has not been learned through all training methods; instead, abstract knowledge has been gained to optimize evaluation, which raises questions about the true reliability of LLMs. Therefore, in this work, we analyze how different training methods affect the semantic understanding of the re-ranking task in LLMs and investigate whether these models can generate more informed textual reasoning to overcome the challenges of transparency or LLMs and limited training data. To analyze the LLMs for re-ranking tasks, we utilize a relatively small ranking dataset from the environment and the Earth science domain to re-rank retrieved content. Furthermore, we also analyze the explainable information to see if the re-ranking can be reasoned using explainability.

【24】A Systematic Approach to Predict the Impact of Cybersecurity Vulnerabilities Using LLMs
标题：使用LLM预测网络安全漏洞影响的系统方法
链接：https://arxiv.org/abs/2508.18439

作者：lmen Høst, Pierre Lison, Leon Moonen
摘要：漏洞数据库（如国家漏洞数据库（NVD））提供了常见漏洞和暴露（CVE）的详细描述，但通常缺乏有关其实际影响的信息，例如攻击者可能用于利用漏洞的策略，技术和程序（TTP）。然而，手动将CVE链接到其相应的TTP是一项具有挑战性且耗时的任务，并且每年发布的大量新漏洞使得自动化支持变得可取。本文介绍了TRIAGE，一个双管齐下的自动化方法，使用大语言模型（LLM）映射的CVE相关技术从ATT&CK知识库。我们首先提示一个LLM与指令的基础上MITRE的CVE映射方法来预测一个初始列表的技术。然后将该列表与第二个基于LLM的模块的结果相结合，该模块使用上下文学习将CVE映射到相关技术。这种混合方法策略性地将基于规则的推理与数据驱动的推理相结合。我们的评估表明，在上下文学习优于个人的映射方法，和混合的方法提高了召回的开发技术。我们还发现GPT-4 o-mini在这项任务上的表现优于Llama3.3- 70 B。总的来说，我们的研究结果表明，LLM可用于自动预测网络安全漏洞的影响，TRIAGE使将CVE映射到ATT&CK的过程更加有效。关键词：漏洞影响，CVE，ATT&CK技术，大型语言模型，自动映射。
摘要：Vulnerability databases, such as the National Vulnerability Database (NVD), offer detailed descriptions of Common Vulnerabilities and Exposures (CVEs), but often lack information on their real-world impact, such as the tactics, techniques, and procedures (TTPs) that adversaries may use to exploit the vulnerability. However, manually linking CVEs to their corresponding TTPs is a challenging and time-consuming task, and the high volume of new vulnerabilities published annually makes automated support desirable. This paper introduces TRIAGE, a two-pronged automated approach that uses Large Language Models (LLMs) to map CVEs to relevant techniques from the ATT&CK knowledge base. We first prompt an LLM with instructions based on MITRE's CVE Mapping Methodology to predict an initial list of techniques. This list is then combined with the results from a second LLM-based module that uses in-context learning to map a CVE to relevant techniques. This hybrid approach strategically combines rule-based reasoning with data-driven inference. Our evaluation reveals that in-context learning outperforms the individual mapping methods, and the hybrid approach improves recall of exploitation techniques. We also find that GPT-4o-mini performs better than Llama3.3-70B on this task. Overall, our results show that LLMs can be used to automatically predict the impact of cybersecurity vulnerabilities and TRIAGE makes the process of mapping CVEs to ATT&CK more efficient. Keywords: vulnerability impact, CVE, ATT&CK techniques, large language models, automated mapping.

【25】Language-Specific Layer Matters: Efficient Multilingual Enhancement for Large Vision-Language Models
标题：特定于视觉语言的层很重要：大型视觉语言模型的高效多语言增强
链接：https://arxiv.org/abs/2508.18381

作者：n, Yilin Wang, Yongyu Mu, Lei Huang, Bei Li, Xiaocheng Feng, Tong Xiao, Jingbo Zhu
备注：Accepted by EMNLP 2025 findings
摘要：大型视觉语言模型（LVLM）在理解人类语言的视觉信息方面表现出卓越的能力，但在多语言能力方面也表现出不平衡。在这项工作中，我们深入研究了LVLM的多语言工作模式，并确定了LVLM的多语言理解能力和浅层语言特异性神经元激活之间的显着相关性。基于这一见解，我们引入了PLAST，这是一种训练配方，通过精确的特定于语言的层微调来实现LVLM的高效多语言增强。PLAST首先通过监测语言特异性神经元激活来识别涉及多语言理解的层。然后，这些层通过问题翻译对进行精确微调，以实现多语言对齐。我们在MM-Bench和MMMB上的实证结果表明，PLAST有效地提高了LVLM的多语言能力，并且仅调整了14%的参数就实现了显着的效率。进一步的分析表明，PLAST可以推广到低资源和复杂的视觉推理任务，促进特定语言的视觉信息参与在浅层。
摘要：Large vision-language models (LVLMs) have demonstrated exceptional capabilities in understanding visual information with human languages but also exhibit an imbalance in multilingual capabilities. In this work, we delve into the multilingual working pattern of LVLMs and identify a salient correlation between the multilingual understanding ability of LVLMs and language-specific neuron activations in shallow layers. Building on this insight, we introduce PLAST, a training recipe that achieves efficient multilingual enhancement for LVLMs by Precise LAnguage-Specific layers fine-Tuning. PLAST first identifies layers involved in multilingual understanding by monitoring language-specific neuron activations. These layers are then precisely fine-tuned with question-translation pairs to achieve multilingual alignment. Our empirical results on MM-Bench and MMMB demonstrate that PLAST effectively improves the multilingual capabilities of LVLMs and achieves significant efficiency with only 14% of the parameters tuned. Further analysis reveals that PLAST can be generalized to low-resource and complex visual reasoning tasks, facilitating the language-specific visual information engagement in shallow layers.

【26】Training Language Model Agents to Find Vulnerabilities with CTF-Dojo
标题：使用CTF-Dojo训练语言模型代理查找漏洞
链接：https://arxiv.org/abs/2508.18370

作者： Zhuo, Dingmin Wang, Hantian Ding, Varun Kumar, Zijian Wang
摘要：大型语言模型（LLM）在可执行的运行时环境中训练时表现出了卓越的能力，特别是通过验证的反馈循环在软件工程任务中表现出色。然而，可扩展和可推广的执行环境仍然稀缺，限制了训练更有能力的ML代理的进展。我们介绍了CTF-Dojo，这是第一个为训练具有可验证反馈的LLM而量身定制的大规模可执行运行时，具有658个功能齐全的Capture-The-Flag（CTF）风格的挑战，这些挑战在Docker中容器化，具有保证的可重复性。为了在没有人工干预的情况下实现快速扩展，我们开发了CTF-Forge，这是一个自动化的管道，可以在几分钟内将公共可用的工件转换为随时可用的执行环境，从而消除了传统上需要数周的专家配置。我们仅在CTF-Dojo的486个高质量，执行验证的轨迹上培训了基于LLM的代理，在三个竞争基准上实现了高达11.6%的绝对收益：InterCode-CTF，NYU CTF Bench和Cybench。我们性能最好的32 B型号达到31.9% Pass@1，建立了一个新的开放重量的最先进的竞争对手的前沿型号，如DeepSeek-V3-0324和Gemini-2.5-Flash。通过将CTF风格的任务框架作为可执行代理学习的基准，CTF-Dojo证明了基于执行的训练信号不仅有效，而且在不依赖昂贵的专有系统的情况下推进高性能ML代理方面至关重要。
摘要：Large language models (LLMs) have demonstrated exceptional capabilities when trained within executable runtime environments, notably excelling at software engineering tasks through verified feedback loops. Yet, scalable and generalizable execution-grounded environments remain scarce, limiting progress in training more capable ML agents. We introduce CTF-Dojo, the first large-scale executable runtime tailored for training LLMs with verifiable feedback, featuring 658 fully functional Capture-The-Flag (CTF)-style challenges containerized in Docker with guaranteed reproducibility. To enable rapid scaling without manual intervention, we develop CTF-Forge, an automated pipeline that transforms publicly available artifacts into ready-to-use execution environments in minutes, eliminating weeks of expert configuration traditionally required. We trained LLM-based agents on just 486 high-quality, execution-verified trajectories from CTF-Dojo, achieving up to 11.6% absolute gains over strong baselines across three competitive benchmarks: InterCode-CTF, NYU CTF Bench, and Cybench. Our best-performing 32B model reaches 31.9% Pass@1, establishing a new open-weight state-of-the-art that rivals frontier models like DeepSeek-V3-0324 and Gemini-2.5-Flash. By framing CTF-style tasks as a benchmark for executable-agent learning, CTF-Dojo demonstrates that execution-grounded training signals are not only effective but pivotal in advancing high-performance ML agents without dependence on costly proprietary systems.

【27】LLMs Can't Handle Peer Pressure: Crumbling under Multi-Agent Social Interactions
标题：LLM无法处理同伴压力：在多智能体社交互动下崩溃
链接：https://arxiv.org/abs/2508.18321

作者：ng, Tej Deep Pala, Weisheng Jin, Amir Zadeh, Chuan Li, Dorien Herremans, Soujanya Poria
摘要：大型语言模型（LLM）越来越多地部署在多智能体系统（MAS）中作为协作智能的组成部分，其中对等交互动态地塑造个人决策。虽然以前的工作集中在一致性偏见，我们扩展分析，以研究如何LLM形成信任从以前的印象，抵制错误信息，并在互动过程中整合同行的输入，在复杂的社会动态下实现集体智慧的关键因素。我们提出了KAIROS，一个基准模拟问答比赛与同行代理的不同可靠性，提供细粒度的控制条件，如专家新手的角色，嘈杂的人群，和敌对的同行。LLM接收历史互动和当前的同伴反应，允许系统调查信任，同伴行为和自信如何影响决策。至于缓解策略，我们在多个模型中评估了提示，监督微调和强化学习，组相对策略优化（GRPO）。我们的研究结果表明，GRPO与多代理上下文结合基于结果的奖励和无约束推理实现了最佳的整体性能，但也降低了对社会影响的鲁棒性相比，基础模型。代码和数据集可在https://github.com/declare-lab/KAIROS上获得。
摘要：Large language models (LLMs) are increasingly deployed in multi-agent systems (MAS) as components of collaborative intelligence, where peer interactions dynamically shape individual decision-making. Although prior work has focused on conformity bias, we extend the analysis to examine how LLMs form trust from previous impressions, resist misinformation, and integrate peer input during interaction, key factors for achieving collective intelligence under complex social dynamics. We present KAIROS, a benchmark simulating quiz contests with peer agents of varying reliability, offering fine-grained control over conditions such as expert-novice roles, noisy crowds, and adversarial peers. LLMs receive both historical interactions and current peer responses, allowing systematic investigation into how trust, peer action, and self-confidence influence decisions. As for mitigation strategies, we evaluate prompting, supervised fine-tuning, and reinforcement learning, Group Relative Policy Optimisation (GRPO), across multiple models. Our results reveal that GRPO with multi-agent context combined with outcome-based rewards and unconstrained reasoning achieves the best overall performance, but also decreases the robustness to social influence compared to Base models. The code and datasets are available at: https://github.com/declare-lab/KAIROS.

【28】SALMAN: Stability Analysis of Language Models Through the Maps Between Graph-based Manifolds
标题：SALMAN：基于图流形间映射的语言模型稳定性分析
链接：https://arxiv.org/abs/2508.18306

作者：Cheng, Yupeng Cao, Jinwen Wu, Koduvayur Subbalakshmi, Tian Han, Zhuo Feng
摘要：最近在预训练的基于transformer的语言模型方面取得的进展推动了许多NLP任务的最新性能。然而，随着这些模型的规模和部署的增长，它们在输入扰动下的鲁棒性成为一个越来越紧迫的问题。现有的鲁棒性方法通常在小参数和大规模模型（LLM）之间存在差异，并且它们通常依赖于劳动密集型，特定于样本的对抗性设计。在本文中，我们提出了一个统一的，本地（样本级）的鲁棒性框架（SALMAN），评估模型的稳定性，而无需修改内部参数或诉诸复杂的扰动算法。我们的方法的核心是一种新的距离映射失真（DMD）的措施，排名每个样本的易感性比较输入到输出的距离映射在一个接近线性的复杂性的方式。通过展示攻击效率和鲁棒训练的显着提高，我们将我们的框架定位为一个实用的，模型无关的工具，用于提高基于transformer的NLP系统的可靠性。
摘要：Recent strides in pretrained transformer-based language models have propelled state-of-the-art performance in numerous NLP tasks. Yet, as these models grow in size and deployment, their robustness under input perturbations becomes an increasingly urgent question. Existing robustness methods often diverge between small-parameter and large-scale models (LLMs), and they typically rely on labor-intensive, sample-specific adversarial designs. In this paper, we propose a unified, local (sample-level) robustness framework (SALMAN) that evaluates model stability without modifying internal parameters or resorting to complex perturbation heuristics. Central to our approach is a novel Distance Mapping Distortion (DMD) measure, which ranks each sample's susceptibility by comparing input-to-output distance mappings in a near-linear complexity manner. By demonstrating significant gains in attack efficiency and robust training, we position our framework as a practical, model-agnostic tool for advancing the reliability of transformer-based NLP systems.

Transformer(1篇)

【1】Integral Transformer: Denoising Attention, Not Too Much Not Too Little
标题：积分Transformer：消除注意力，不太多也不太少
链接：https://arxiv.org/abs/2508.18387

作者：zev, Abbas Ghaddar, Dingtao Hu, Boxing Chen
备注：EMNLP 2025 Main
摘要：Softmax自我注意力通常会将不成比例的权重分配给语义上没有信息的标记，如特殊标记和标点符号，这种现象被称为注意力噪音。虽然最近的方法，如Cog Attention和Differential Transformer，通过引入负注意力分数来解决这个问题，但它们有可能丢弃有用的信息。在本文中，我们提出了积分Transformer，一种新的自我注意力机制，通过整合从logit分布采样的信号来消除注意力。我们的方法减轻了噪声，同时保留了对模型性能至关重要的特殊令牌的贡献。大量的实验表明，我们的模型优于香草，齿轮，差分注意力变体上建立良好的知识和推理语言基准。此外，我们的分析表明，在较低的Transformer层中采用香草自注意力提高了性能，并且积分Transformer有效地平衡了注意力分布，减少了上层的秩崩溃。
摘要：Softmax self-attention often assigns disproportionate weight to semantically uninformative tokens such as special tokens and punctuation, a phenomenon known as attention noise. While recent methods like Cog Attention and the Differential Transformer have addressed this by introducing negative attention scores, they risk discarding useful information. In this paper, we propose the Integral Transformer, a novel self-attention mechanism that denoises attention by integrating signals sampled from the logit distribution. Our approach mitigates noise while preserving the contributions of special tokens critical for model performance. Extensive experiments demonstrate that our model outperforms vanilla, Cog, and Differential attention variants on well-established knowledge and reasoning language benchmarks. Moreover, our analysis reveals that employing vanilla self-attention in the lower Transformer layers enhances performance and that the Integral Transformer effectively balances attention distributions and reduces rank collapse in upper layers.

GAN|生成相关(5篇)

【1】Retrieval-Augmented Generation for Natural Language Art Provenance Searches in the Getty Provenance Index
标题：基于Getty出处索引的自然语言艺术出处检索增强生成
链接：https://arxiv.org/abs/2508.19093

作者：nrickson
摘要：本研究提出了一个检索扩增代（RAG）框架的艺术来源研究，重点是盖蒂种源指数。来源研究建立了艺术品的所有权历史，这对于验证真实性，支持归还和法律索赔以及了解艺术品的文化和历史背景至关重要。这一过程因碎片化的多语种档案数据而变得复杂，妨碍了有效的检索。目前的搜索门户需要精确的元数据，限制了探索性搜索。我们的方法通过语义检索和上下文摘要实现自然语言和多语言搜索，减少对元数据结构的依赖。我们使用Getty Provenance Index - German Sales中的10，000条记录样本来评估RAG检索和汇总拍卖记录的能力。结果表明，这种方法为浏览艺术市场档案提供了一种可扩展的解决方案，为历史学家和文化遗产专业人员进行历史敏感性研究提供了一种实用的工具。
摘要：This research presents a Retrieval-Augmented Generation (RAG) framework for art provenance studies, focusing on the Getty Provenance Index. Provenance research establishes the ownership history of artworks, which is essential for verifying authenticity, supporting restitution and legal claims, and understanding the cultural and historical context of art objects. The process is complicated by fragmented, multilingual archival data that hinders efficient retrieval. Current search portals require precise metadata, limiting exploratory searches. Our method enables natural-language and multilingual searches through semantic retrieval and contextual summarization, reducing dependence on metadata structures. We assess RAG's capability to retrieve and summarize auction records using a 10,000-record sample from the Getty Provenance Index - German Sales. The results show this approach provides a scalable solution for navigating art market archives, offering a practical tool for historians and cultural heritage professionals conducting historically sensitive research.

【2】Diverse And Private Synthetic Datasets Generation for RAG evaluation: A multi-agent framework
标题：用于RAG评估的多元化和私有合成数据集生成：多代理框架
链接：https://arxiv.org/abs/2508.18929

作者：ouich, Hongliu Cao, Eoin Thomas
备注：ECAI 2025 TRUST AI workshop
摘要：检索增强生成（RAG）系统通过整合外部知识来改进大型语言模型输出，从而实现更明智和上下文感知的响应。然而，这些系统的有效性和可信度关键取决于如何评估它们，特别是评估过程是否能够捕捉到诸如保护敏感信息等现实世界的约束。虽然目前对RAG系统的评价工作主要集中在制定绩效指标上，但对基本评价数据集的设计和质量的关注要少得多，尽管它们在实现有意义的、可靠的评估方面发挥着关键作用。在这项工作中，我们引入了一种新的多代理框架，用于生成RAG评估的合成QA数据集，优先考虑语义多样性和隐私保护。我们的方法包括：（1）多样性代理，利用集群技术最大化主题覆盖范围和语义可变性，（2）隐私代理，检测和屏蔽多个域的敏感信息，以及（3）QA策展代理，合成适合作为RAG评估地面真相的私有和多样化QA对。大量的实验表明，我们的评估集优于基线方法的多样性，并实现了强大的隐私屏蔽特定领域的数据集。这项工作为更安全、更全面的RAG系统评估提供了一条实用且符合道德的途径，为未来与不断发展的AI法规和合规标准保持一致的增强功能奠定了基础。
摘要：Retrieval-augmented generation (RAG) systems improve large language model outputs by incorporating external knowledge, enabling more informed and context-aware responses. However, the effectiveness and trustworthiness of these systems critically depends on how they are evaluated, particularly on whether the evaluation process captures real-world constraints like protecting sensitive information. While current evaluation efforts for RAG systems have primarily focused on the development of performance metrics, far less attention has been given to the design and quality of the underlying evaluation datasets, despite their pivotal role in enabling meaningful, reliable assessments. In this work, we introduce a novel multi-agent framework for generating synthetic QA datasets for RAG evaluation that prioritize semantic diversity and privacy preservation. Our approach involves: (1) a Diversity agent leveraging clustering techniques to maximize topical coverage and semantic variability, (2) a Privacy Agent that detects and mask sensitive information across multiple domains and (3) a QA curation agent that synthesizes private and diverse QA pairs suitable as ground truth for RAG evaluation. Extensive experiments demonstrate that our evaluation sets outperform baseline methods in diversity and achieve robust privacy masking on domain-specific datasets. This work offers a practical and ethically aligned pathway toward safer, more comprehensive RAG system evaluation, laying the foundation for future enhancements aligned with evolving AI regulations and compliance standards.

【3】Beyond the Textual: Generating Coherent Visual Options for MCQs
标题：超越文本：为MCQ生成连贯的视觉选项
链接：https://arxiv.org/abs/2508.18772

作者：Wang, Longzhu He, Wei Zheng
备注：EMNLP 2025
摘要：多项选择题（MCQs）在教育中对培养深度思维和知识整合起着至关重要的作用。然而，以前的研究主要集中在生成文本选项的MCQ，但它在很大程度上忽略了视觉选项。此外，由于手动创作的高成本和有限的可扩展性，生成高质量的干扰项仍然是一个主要的挑战。为了解决这些问题，我们提出了一个跨模态选项合成（CmOS），一个新的框架生成教育MCQs与视觉选项。我们的框架集成了多模态思想链（MCoT）推理过程和检索增强生成（RAG），以产生语义上合理的和视觉上相似的答案和干扰。它还包括一个识别模块，用于识别适合视觉选项的内容。测试任务的实验结果表明，CmOS的优越性，在内容的歧视，问题的生成和视觉选项生成在现有的方法，在不同的学科和教育水平。
摘要：Multiple-choice questions (MCQs) play a crucial role in fostering deep thinking and knowledge integration in education. However, previous research has primarily focused on generating MCQs with textual options, but it largely overlooks the visual options. Moreover, generating high-quality distractors remains a major challenge due to the high cost and limited scalability of manual authoring. To tackle these problems, we propose a Cross-modal Options Synthesis (CmOS), a novel framework for generating educational MCQs with visual options. Our framework integrates Multimodal Chain-of-Thought (MCoT) reasoning process and Retrieval-Augmented Generation (RAG) to produce semantically plausible and visually similar answer and distractors. It also includes a discrimination module to identify content suitable for visual options. Experimental results on test tasks demonstrate the superiority of CmOS in content discrimination, question generation and visual option generation over existing methods across various subjects and educational levels.

【4】UniC-RAG: Universal Knowledge Corruption Attacks to Retrieval-Augmented Generation
标题：UniC-RAG：对检索增强一代的普遍知识腐败攻击
链接：https://arxiv.org/abs/2508.18652

作者：eng, Yanting Wang, Ying Chen, Jinyuan Jia
备注：21 pages, 4 figures
摘要：检索增强生成（RAG）系统被广泛部署在金融、医疗保健和网络安全等不同领域的实际应用中。然而，许多研究表明，它们容易受到知识腐败攻击，攻击者可以将对抗性文本注入RAG系统的知识数据库，以诱导LLM生成攻击者所需的输出。现有的研究主要集中在攻击特定查询或具有相似主题（或关键字）的查询。在这项工作中，我们提出了UniC-RAG，一个通用的知识腐败攻击RAG系统。与之前的工作不同，UniC-RAG联合优化了少量对抗性文本，这些文本可以同时攻击大量具有不同主题和域的用户查询，使攻击者能够实现各种恶意目标，例如将用户引导到恶意网站，触发有害命令执行或发起拒绝服务攻击。我们制定UniC-RAG作为一个优化问题，并进一步设计了一个有效的解决方案来解决它，包括一个平衡的基于相似性的聚类方法，以提高攻击的有效性。我们的广泛评估表明，UniC-RAG是非常有效的，并显着优于基线。例如，UniC-RAG可以通过将100个对抗文本注入到具有数百万文本的知识数据库中来同时攻击大量用户查询（例如，2，000）。此外，我们评估了现有的防御措施，并表明它们不足以防御UniC-RAG，强调了RAG系统中新防御机制的必要性。
摘要：Retrieval-augmented generation (RAG) systems are widely deployed in real-world applications in diverse domains such as finance, healthcare, and cybersecurity. However, many studies showed that they are vulnerable to knowledge corruption attacks, where an attacker can inject adversarial texts into the knowledge database of a RAG system to induce the LLM to generate attacker-desired outputs. Existing studies mainly focus on attacking specific queries or queries with similar topics (or keywords). In this work, we propose UniC-RAG, a universal knowledge corruption attack against RAG systems. Unlike prior work, UniC-RAG jointly optimizes a small number of adversarial texts that can simultaneously attack a large number of user queries with diverse topics and domains, enabling an attacker to achieve various malicious objectives, such as directing users to malicious websites, triggering harmful command execution, or launching denial-of-service attacks. We formulate UniC-RAG as an optimization problem and further design an effective solution to solve it, including a balanced similarity-based clustering method to enhance the attack's effectiveness. Our extensive evaluations demonstrate that UniC-RAG is highly effective and significantly outperforms baselines. For instance, UniC-RAG could achieve over 90% attack success rate by injecting 100 adversarial texts into a knowledge database with millions of texts to simultaneously attack a large set of user queries (e.g., 2,000). Additionally, we evaluate existing defenses and show that they are insufficient to defend against UniC-RAG, highlighting the need for new defense mechanisms in RAG systems.

【5】The Mind's Eye: A Multi-Faceted Reward Framework for Guiding Visual Metaphor Generation
标题：心灵之眼：指导视觉隐喻生成的多方面奖励框架
链接：https://arxiv.org/abs/2508.18569

作者： Koushik, Fatemeh Nazarieh, Katherine Birch, Shenbin Qian, Diptesh Kanojia
备注：Under Review
摘要：视觉隐喻生成是一项具有挑战性的任务，其目的是根据输入的文本隐喻生成图像。从本质上讲，它需要语言理解来将源概念与目标概念结合起来，以保持意义的同时确保视觉连贯性。我们提出了一个自我评估的视觉隐喻生成框架，侧重于隐喻对齐。我们的自我评估方法结合了现有的指标，我们新提出的隐喻分解分数和意义对齐（MA）的度量。在这种设置中，我们探索了两种新的方法：一种无训练的管道，它明确地将提示分解为用于图像合成的源-目标-意义（S-T-M）映射，以及一种互补的基于训练的管道，它使用我们提出的自我评价奖励模式来改进对齐，而无需任何大规模的再训练。在保持的测试集上，无训练方法在分解，CLIP和MA分数上超过了强大的封闭基线（GPT-4 o，Imagen），而基于训练的方法紧随其后。我们使用面向用户的研究来评估我们的框架输出，并观察到参与者总体上更喜欢GPT-4 o，而我们的免培训管道则采用开源方法，并在抽象隐喻上边缘化Imagen。我们的分析表明，S-T-M提示有助于更长或更抽象的隐喻，封闭的模型在简短，具体的情况下表现出色，我们还观察到采样设置的敏感性。总的来说，结构化提示和轻量级RL在适度的计算下能够很好地执行隐喻对齐，而人类偏好的剩余差距似乎是由美学和采样驱动的。
摘要：Visual metaphor generation is a challenging task that aims to generate an image given an input text metaphor. Inherently, it needs language understanding to bind a source concept with a target concept, in a way that preserves meaning while ensuring visual coherence. We propose a self-evaluating visual metaphor generation framework that focuses on metaphor alignment. Our self-evaluation approach combines existing metrics with our newly proposed metaphor decomposition score and a meaning alignment (MA) metric. Within this setup, we explore two novel approaches: a training-free pipeline that explicitly decomposes prompts into source-target-meaning (S-T-M) mapping for image synthesis, and a complementary training-based pipeline that improves alignment using our proposed self-evaluation reward schema, without any large-scale retraining. On the held-out test set, the training-free approach surpasses strong closed baselines (GPT-4o, Imagen) on decomposition, CLIP, and MA scores, with the training-based approach close behind. We evaluate our framework output using a user-facing study, and observed that participants preferred GPT-4o overall, while our training-free pipeline led open-source methods and edged Imagen on abstract metaphors. Our analyses show S-T-M prompting helps longer or more abstract metaphors, with closed models excelling on short, concrete cases; we also observe sensitivity to sampler settings. Overall, structured prompting and lightweight RL perform metaphor alignment well under modest compute, and remaining gaps to human preference appear driven by aesthetics and sampling.

QA|VQA|问答|对话(5篇)

【1】Text to Query Plans for Question Answering on Large Tables
标题：大桌子上问题解答的文本查询计划
链接：https://arxiv.org/abs/2508.18758

作者：ang, Chen Wang, Yuzhe Zhang, Jacky Jiang
摘要：大型表格数据集的高效查询和分析仍然是一个重大挑战，特别是对于没有SQL等编程语言专业知识的用户。Text-to-SQL方法在基准数据上表现出良好的性能;然而，它们继承了SQL的缺点，包括大型数据集的效率低下以及对基本查询之外的复杂数据分析的支持有限。我们提出了一个新的框架，将自然语言查询到查询计划。我们的解决方案在传统数据库之外实现，使我们能够支持经典的SQL命令，同时避免SQL的固有限制。此外，我们还支持复杂的分析功能，如主成分分析和异常检测，提供比传统SQL功能更大的灵活性和可扩展性。我们利用LLM迭代地解释查询和构造操作序列，通过增量构建解决方案来解决计算复杂性。通过直接对数据执行操作，我们克服了上下文长度限制，而不需要模型处理整个数据集。我们通过在标准数据库和大型科学表上的实验来验证我们的框架，证明了它在处理广泛的数据集和执行复杂的数据分析方面的有效性。
摘要：Efficient querying and analysis of large tabular datasets remain significant challenges, especially for users without expertise in programming languages like SQL. Text-to-SQL approaches have shown promising performance on benchmark data; however, they inherit SQL's drawbacks, including inefficiency with large datasets and limited support for complex data analyses beyond basic querying. We propose a novel framework that transforms natural language queries into query plans. Our solution is implemented outside traditional databases, allowing us to support classical SQL commands while avoiding SQL's inherent limitations. Additionally, we enable complex analytical functions, such as principal component analysis and anomaly detection, providing greater flexibility and extensibility than traditional SQL capabilities. We leverage LLMs to iteratively interpret queries and construct operation sequences, addressing computational complexity by incrementally building solutions. By executing operations directly on the data, we overcome context length limitations without requiring the entire dataset to be processed by the model. We validate our framework through experiments on both standard databases and large scientific tables, demonstrating its effectiveness in handling extensive datasets and performing sophisticated data analyses.

【2】Chronological Passage Assembling in RAG framework for Temporal Question Answering
标题：时间顺序段落在RAG框架中组装用于时间问题回答
链接：https://arxiv.org/abs/2508.18748

作者：ng Kim, Jeonghyun Park, Joonho Yang, Hwanhee Lee
备注：7 pages, 3 figures
摘要：在叙事任务上回答长背景问题是具有挑战性的，因为正确的答案往往取决于重建一个连贯的事件时间轴，同时在有限的背景窗口中保留背景流。检索增强生成（RAG）索引方法旨在通过选择性地只检索必要的文档片段来解决这一挑战。然而，叙事文本具有独特的特点，限制了这些现有方法的有效性。具体来说，理解叙事文本需要的不仅仅是孤立的片段，因为更广泛的背景和片段之间的顺序关系对于理解至关重要。为了解决这些局限性，我们提出了ChronoRAG，一种专门用于叙事文本的新型RAG框架。这种方法侧重于两个基本方面：细化分散的文档信息到连贯和结构化的段落，并保留叙事流明确捕捉和维护检索通道之间的时间顺序。我们通过在NarrativeQA数据集上的实验实证证明了ChronoRAG的有效性，在需要事实识别和理解复杂顺序关系的任务中显示出实质性的改进，强调了时间顺序推理在解决叙事QA中至关重要。
摘要：Long-context question answering over narrative tasks is challenging because correct answers often hinge on reconstructing a coherent timeline of events while preserving contextual flow in a limited context window. Retrieval-augmented generation (RAG) indexing methods aim to address this challenge by selectively retrieving only necessary document segments. However, narrative texts possess unique characteristics that limit the effectiveness of these existing approaches. Specifically, understanding narrative texts requires more than isolated segments, as the broader context and sequential relationships between segments are crucial for comprehension. To address these limitations, we propose ChronoRAG, a novel RAG framework specialized for narrative texts. This approach focuses on two essential aspects: refining dispersed document information into coherent and structured passages, and preserving narrative flow by explicitly capturing and maintaining the temporal order among retrieved passages. We empirically demonstrate the effectiveness of ChronoRAG through experiments on the NarrativeQA dataset, showing substantial improvements in tasks requiring both factual identification and comprehension of complex sequential relationships, underscoring that reasoning over temporal order is crucial in resolving narrative QA.

【3】M3HG: Multimodal, Multi-scale, and Multi-type Node Heterogeneous Graph for Emotion Cause Triplet Extraction in Conversations
标题：M3 HG：用于对话中情感原因三重组提取的多模式、多尺度、多类型节点异类图
链接：https://arxiv.org/abs/2508.18740

作者：g, Ying Shen, Tiantian Chen, Lin Zhang
备注：16 pages, 8 figures. Accepted to Findings of ACL 2025
摘要：多模态会话中的情感原因三元组提取（MECTEC）最近在社交媒体分析中获得了显著的关注，其旨在同时提取情感话语、原因话语和情感类别。然而，相关数据集的稀缺性，只有一个发布的数据集具有高度统一的对话场景，阻碍了该领域的模型开发。为了解决这个问题，我们介绍MECAD，第一个多模态，多场景MECTEC数据集，包括989个对话，从56个电视连续剧跨越广泛的对话背景。此外，现有的MECTEC方法未能明确地建模情感和因果关系的背景下，忽略了语义信息在不同层次上的融合，导致性能下降。在本文中，我们提出了M3HG，一种新的模型，明确地捕捉情感和因果关系的背景下，并有效地融合上下文信息，在两个话语间和话语内的水平，通过多模态异构图。大量的实验证明了M3HG与现有的最先进的方法相比的有效性。代码和数据集可在https://github.com/redifinition/M3HG上获得。
摘要：Emotion Cause Triplet Extraction in Multimodal Conversations (MECTEC) has recently gained significant attention in social media analysis, aiming to extract emotion utterances, cause utterances, and emotion categories simultaneously. However, the scarcity of related datasets, with only one published dataset featuring highly uniform dialogue scenarios, hinders model development in this field. To address this, we introduce MECAD, the first multimodal, multi-scenario MECTEC dataset, comprising 989 conversations from 56 TV series spanning a wide range of dialogue contexts. In addition, existing MECTEC methods fail to explicitly model emotional and causal contexts and neglect the fusion of semantic information at different levels, leading to performance degradation. In this paper, we propose M3HG, a novel model that explicitly captures emotional and causal contexts and effectively fuses contextual information at both inter- and intra-utterance levels via a multimodal heterogeneous graph. Extensive experiments demonstrate the effectiveness of M3HG compared with existing state-of-the-art methods. The codes and dataset are available at https://github.com/redifinition/M3HG.

【4】Knowing or Guessing? Robust Medical Visual Question Answering via Joint Consistency and Contrastive Learning
标题：知道还是猜测？通过联合一致性和对比学习进行稳健的医学视觉问题解答
链接：https://arxiv.org/abs/2508.18687

作者：iang, Yuxi Chen, Sibo Song, Yan Zhang, Yeying Jin, Yang Feng, Jian Wu, Zuozhu Liu
摘要：在高风险的医疗应用中，对不同问题措辞的一致回答对于可靠的诊断至关重要。然而，我们发现，目前的医疗视觉语言模型（Med-VLM）表现出有关脆弱性的医疗视觉问题的表达，因为他们的答案波动显着时，面对语义等效的医疗问题的改写。我们将其归因于两个局限性：（1）医学概念的不充分对齐，导致不同的推理模式，以及（2）训练数据中隐藏的偏见，优先考虑句法捷径而不是语义理解。为了解决这些挑战，我们构建了RoMed，这是一个建立在原始VQA数据集上的数据集，包含144 k个问题，这些问题的变化跨越了单词级，词汇级和语义级的扰动。当在RoMed上评估LLaVA-Med等最先进的（SOTA）模型时，我们观察到令人担忧的性能下降（例如，与原始VQA基准测试相比，召回率下降了40%，暴露了关键的鲁棒性差距。为了弥合这一差距，我们提出了一致性和对比学习（CCL），它集成了两个关键组件：（1）知识锚定的一致性学习，将Med-VLM与医学知识而不是浅层特征模式对齐，以及（2）偏见感知的对比学习，通过区分性表示细化来减轻数据特定的先验。CCL在三个流行的VQA基准测试中实现了SOTA性能，并在具有挑战性的RoMed测试集上将答案一致性显著提高了50%，表现出显着增强的鲁棒性。代码将被释放。
摘要：In high-stakes medical applications, consistent answering across diverse question phrasings is essential for reliable diagnosis. However, we reveal that current Medical Vision-Language Models (Med-VLMs) exhibit concerning fragility in Medical Visual Question Answering, as their answers fluctuate significantly when faced with semantically equivalent rephrasings of medical questions. We attribute this to two limitations: (1) insufficient alignment of medical concepts, leading to divergent reasoning patterns, and (2) hidden biases in training data that prioritize syntactic shortcuts over semantic understanding. To address these challenges, we construct RoMed, a dataset built upon original VQA datasets containing 144k questions with variations spanning word-level, sentence-level, and semantic-level perturbations. When evaluating state-of-the-art (SOTA) models like LLaVA-Med on RoMed, we observe alarming performance drops (e.g., a 40\% decline in Recall) compared to original VQA benchmarks, exposing critical robustness gaps. To bridge this gap, we propose Consistency and Contrastive Learning (CCL), which integrates two key components: (1) knowledge-anchored consistency learning, aligning Med-VLMs with medical knowledge rather than shallow feature patterns, and (2) bias-aware contrastive learning, mitigating data-specific priors through discriminative representation refinement. CCL achieves SOTA performance on three popular VQA benchmarks and notably improves answer consistency by 50\% on the challenging RoMed test set, demonstrating significantly enhanced robustness. Code will be released.

【5】Can Out-of-Distribution Evaluations Uncover Reliance on Shortcuts? A Case Study in Question Answering
标题：分布外评估能否揭示对捷径的依赖？问题解答案例研究
链接：https://arxiv.org/abs/2508.18407

作者：efánik, Timothee Mickus, Marek Kadlčík, Michal Spiegel, Josef Kuchař
备注：To appear in Findings of EMNLP 2025
摘要：最近AI领域的大多数工作都是通过对分布外（OOD）数据集的性能来评估模型的泛化能力。尽管这些评估具有实用性，但它们建立在一个强有力的假设之上：OOD评估可以捕获和反映现实世界部署中可能的失败。在这项工作中，我们挑战了这一假设，并将OOD评估获得的结果与现有问答（QA）模型中记录的一组特定故障模式进行了对比，这些模式被称为对虚假特征或预测捷径的依赖。我们发现，QA中用于OOD评估的不同数据集提供了对模型鲁棒性的估计，这些鲁棒性对质量差异很大的快捷方式具有鲁棒性，有些甚至在很大程度上表现不佳，即使是简单的分布评估。我们将此部分归因于观察到虚假快捷方式在ID+OOD数据集之间共享，但也发现数据集的训练和评估质量在很大程度上是断开的。我们的工作强调了常用的基于OOD的泛化评估的局限性，并提供了方法和建议，以更有力地评估QA内外的泛化。
摘要：A majority of recent work in AI assesses models' generalization capabilities through the lens of performance on out-of-distribution (OOD) datasets. Despite their practicality, such evaluations build upon a strong assumption: that OOD evaluations can capture and reflect upon possible failures in a real-world deployment. In this work, we challenge this assumption and confront the results obtained from OOD evaluations with a set of specific failure modes documented in existing question-answering (QA) models, referred to as a reliance on spurious features or prediction shortcuts. We find that different datasets used for OOD evaluations in QA provide an estimate of models' robustness to shortcuts that have a vastly different quality, some largely under-performing even a simple, in-distribution evaluation. We partially attribute this to the observation that spurious shortcuts are shared across ID+OOD datasets, but also find cases where a dataset's quality for training and evaluation is largely disconnected. Our work underlines limitations of commonly-used OOD-based evaluations of generalization, and provides methodology and recommendations for evaluating generalization within and beyond QA more robustly.

机器翻译(2篇)

【1】LaTeXTrans: Structured LaTeX Translation with Multi-Agent Coordination
标题：LaTeXTrans：多代理协调的结构化LaTeX翻译
链接：https://arxiv.org/abs/2508.18791

作者：u, Chenglong Wang, Shunjie Xing, Yifu Huo, Fengning Tian, Quan Du, Di Yang, Chunliang Zhang, Tong Xiao, Jingbo Zhu
摘要：尽管现代机器翻译（MT）系统在通用领域文本上取得了显着进展，但翻译结构化的LaTeX格式文档仍然是一个重大挑战。这些文档通常将自然语言与特定领域的语法（如数学方程、表格、图形和交叉引用）交织在一起，所有这些都必须准确地保留以保持语义完整性和可编译性。在本文中，我们介绍了LaTeXTrans，一个协作的多代理系统，旨在解决这一挑战。LaTeXTrans通过六个专门的代理来确保格式保留，结构保真度和术语一致性：1）解析器，通过占位符替换和语法过滤将LaTeX分解为解释友好的单元; 2）翻译器，验证器，摘要器和术语提取器，协同工作以确保上下文感知，自我纠正和术语一致的翻译; 3）生成器，将翻译后的内容重新构建为结构良好的LaTeX文档。实验结果表明，LaTeXTrans在翻译准确率和结构保真度方面均优于主流机器翻译系统，为LaTeX格式文档的翻译提供了一种有效实用的解决方案。
摘要：Despite the remarkable progress of modern machine translation (MT) systems on general-domain texts, translating structured LaTeX-formatted documents remains a significant challenge. These documents typically interleave natural language with domain-specific syntax, such as mathematical equations, tables, figures, and cross-references, all of which must be accurately preserved to maintain semantic integrity and compilability. In this paper, we introduce LaTeXTrans, a collaborative multi-agent system designed to address this challenge. LaTeXTrans ensures format preservation, structural fidelity, and terminology consistency through six specialized agents: 1) a Parser that decomposes LaTeX into translation-friendly units via placeholder substitution and syntax filtering; 2) a Translator, Validator, Summarizer, and Terminology Extractor that work collaboratively to ensure context-aware, self-correcting, and terminology-consistent translations; 3) a Generator that reconstructs the translated content into well-structured LaTeX documents. Experimental results demonstrate that LaTeXTrans can outperform mainstream MT systems in both translation accuracy and structural fidelity, offering an effective and practical solution for translating LaTeX-formatted documents.

【2】COMET-poly: Machine Translation Metric Grounded in Other Candidates
标题：COMET-poly：机器翻译指标以其他候选人为基础
链接：https://arxiv.org/abs/2508.18549

作者：le, Vilém Zouhar, Tu Anh Dinh, Felipe Maia Polo, Jan Niehues, Mrinmaya Sachan
备注：Maike Züfle, Vilém Zouhar, and Tu Anh Dinh contributed equally
摘要：机器翻译的自动化指标试图复制人类的判断。与经常在多个备选方案的背景下评估翻译的人类不同，这些指标通常只考虑源句子和单个翻译。评估设置中的这种差异可能会对自动化指标的性能产生负面影响。我们提出了两个自动化的指标，将额外的信息超出了单一的翻译。COMET-polycand使用相同源语句的替代翻译与手头的翻译进行比较和对比，从而对其质量进行更明智的评估。COMET-polyic受到基于检索的上下文学习的启发，采用类似源文本的翻译以及它们的人工标记质量分数来指导评估。我们发现，包括一个单一的额外的翻译在彗星polycand提高段级度量性能（0.079至0.118肯德尔的tau-b相关性），进一步增益时，更多的翻译被添加。在COMET-polyic中检索到的例子产生了类似的改进（0.079到0.116 Kendall的tau-b相关性）。我们公开发布模型。
摘要：Automated metrics for machine translation attempt to replicate human judgment. Unlike humans, who often assess a translation in the context of multiple alternatives, these metrics typically consider only the source sentence and a single translation. This discrepancy in the evaluation setup may negatively impact the performance of automated metrics. We propose two automated metrics that incorporate additional information beyond the single translation. COMET-polycand uses alternative translations of the same source sentence to compare and contrast with the translation at hand, thereby providing a more informed assessment of its quality. COMET-polyic, inspired by retrieval-based in-context learning, takes in translations of similar source texts along with their human-labeled quality scores to guide the evaluation. We find that including a single additional translation in COMET-polycand improves the segment-level metric performance (0.079 to 0.118 Kendall's tau-b correlation), with further gains when more translations are added. Incorporating retrieved examples in COMET-polyic yields similar improvements (0.079 to 0.116 Kendall's tau-b correlation). We release our models publicly.

语义分析(2篇)

【1】Beyond the Black Box: Integrating Lexical and Semantic Methods in Quantitative Discourse Analysis with BERTopic
标题：超越黑匣子：利用BER Topic在定量话语分析中集成词汇和语义方法
链接：https://arxiv.org/abs/2508.19099

作者：mpton
备注：5 pages conference paper, 4 tables
摘要：随着大型语言模型和计算工具的兴起，定量话语分析得到了越来越多的采用。然而，依赖MAXQDA和NVivo等黑盒软件可能会破坏方法的透明度和与研究目标的一致性。本文提出了一个混合的，透明的框架QDA，结合词汇和语义的方法，使三角测量，再现性和可解释性。从历史政治话语中的案例研究中，我们演示了如何使用NLTK，spaCy和句子Transformers自定义Python管道允许对预处理，lemmatisation和嵌入生成进行细粒度控制。我们进一步详细介绍了我们的迭代BERTopic建模过程，其中包含UMAP降维、HDBSCAN聚类和c-TF-IDF关键词提取，并通过参数调整和多次运行进行优化，以增强主题一致性和覆盖范围。通过将精确的词汇搜索与上下文感知的语义聚类并置，我们主张采用一种多层次的方法来减轻孤立的两种方法的局限性。我们的工作流程强调了代码级别的透明度，研究机构和方法三角计算话语研究的重要性。代码和补充材料可通过GitHub获取。
摘要：Quantitative Discourse Analysis has seen growing adoption with the rise of Large Language Models and computational tools. However, reliance on black box software such as MAXQDA and NVivo risks undermining methodological transparency and alignment with research goals. This paper presents a hybrid, transparent framework for QDA that combines lexical and semantic methods to enable triangulation, reproducibility, and interpretability. Drawing from a case study in historical political discourse, we demonstrate how custom Python pipelines using NLTK, spaCy, and Sentence Transformers allow fine-grained control over preprocessing, lemmatisation, and embedding generation. We further detail our iterative BERTopic modelling process, incorporating UMAP dimensionality reduction, HDBSCAN clustering, and c-TF-IDF keyword extraction, optimised through parameter tuning and multiple runs to enhance topic coherence and coverage. By juxtaposing precise lexical searches with context-aware semantic clustering, we argue for a multi-layered approach that mitigates the limitations of either method in isolation. Our workflow underscores the importance of code-level transparency, researcher agency, and methodological triangulation in computational discourse studies. Code and supplementary materials are available via GitHub.

【2】Semantic Attractors and the Emergence of Meaning: Towards a Teleological Model of AGI
标题：语义吸引物和意义的出现：走向AGI的目的论模型
链接：https://arxiv.org/abs/2508.18290

作者：him Rudolph
备注：10 pages
摘要：本文基于复值意义空间中的语义吸引子的概念，提出了一个语义通用人工智能的理论框架。从目前的transformer为基础的语言模型，统计下一个令牌预测操作，我们探索一个模型，其中的意义不是推断概率，但通过递归张量变换形成。使用循环操作涉及的想象单位\n {i}，我们描述了一个旋转的语义结构，能够模拟讽刺，同音异义和歧义。然而，这个模型的中心是一个语义吸引子--一个目的论算子，与统计计算不同，它充当一个有意的代理（Microvitum），引导意义走向稳定、清晰和表达深度。根据梯度流、张量变形和迭代矩阵动力学的概念，吸引子提供了一个语义转换的模型，不仅在数学上具有启发性，而且在数学上也具有重要意义。我们认为，真正的意义出现不是从模拟，而是从递归收敛到语义连贯性，这需要一个全新的认知架构-一个旨在塑造语言，而不仅仅是预测它。
摘要：This essay develops a theoretical framework for a semantic Artificial General Intelligence (AGI) based on the notion of semantic attractors in complex-valued meaning spaces. Departing from current transformer-based language models, which operate on statistical next-token prediction, we explore a model in which meaning is not inferred probabilistically but formed through recursive tensorial transformation. Using cyclic operations involving the imaginary unit \emph{i}, we describe a rotational semantic structure capable of modeling irony, homonymy, and ambiguity. At the center of this model, however, is a semantic attractor -- a teleological operator that, unlike statistical computation, acts as an intentional agent (Microvitum), guiding meaning toward stability, clarity, and expressive depth. Conceived in terms of gradient flows, tensor deformations, and iterative matrix dynamics, the attractor offers a model of semantic transformation that is not only mathematically suggestive, but also philosophically significant. We argue that true meaning emerges not from simulation, but from recursive convergence toward semantic coherence, and that this requires a fundamentally new kind of cognitive architecture -- one designed to shape language, not just predict it.

Graph|知识图谱|Knowledge(2篇)

【1】Do LVLMs Know What They Know? A Systematic Study of Knowledge Boundary Perception in LVLMs
标题：LVLM知道他们知道什么吗？LVLM知识边界感知的系统研究
链接：https://arxiv.org/abs/2508.19111

作者：ng, Shiyu Ni, Keping Bi
备注：EMNLP2025 Findings
摘要：大型视觉语言模型（LVLM）表现出强大的视觉问答（VQA）能力，但也会产生幻觉。一个可靠的模型应该感知它的知识边界--知道它知道什么，不知道什么。本文通过评估三种类型的置信信号：概率置信度、基于答案一致性的置信度和言语化置信度，研究了LVLM对其知识边界的感知。在三个VQA数据集上对三个LVLM进行的实验表明，尽管LVLM具有合理的感知水平，但仍有很大的改进空间。在这三种信心中，基于概率和一致性的信号是更可靠的指标，而言语化的信心往往会导致过度自信。为了增强LVLM的感知，我们采用了几种已建立的大型语言模型（LLM）的置信度校准方法，并提出了三种有效的方法。此外，我们将LVLM与LLM进行了比较，发现联合处理视觉和文本输入会降低问答性能，但会降低信心，从而提高感知水平。
摘要：Large vision-language models (LVLMs) demonstrate strong visual question answering (VQA) capabilities but are shown to hallucinate. A reliable model should perceive its knowledge boundaries-knowing what it knows and what it does not. This paper investigates LVLMs' perception of their knowledge boundaries by evaluating three types of confidence signals: probabilistic confidence, answer consistency-based confidence, and verbalized confidence. Experiments on three LVLMs across three VQA datasets show that, although LVLMs possess a reasonable perception level, there is substantial room for improvement. Among the three confidences, probabilistic and consistency-based signals are more reliable indicators, while verbalized confidence often leads to overconfidence. To enhance LVLMs' perception, we adapt several established confidence calibration methods from Large Language Models (LLMs) and propose three effective methods. Additionally, we compare LVLMs with their LLM counterparts, finding that jointly processing visual and textual inputs decreases question-answering performance but reduces confidence, resulting in an improved perception level compared to LLMs.

【2】Bias Mitigation Agent: Optimizing Source Selection for Fair and Balanced Knowledge Retrieval
标题：偏见缓解代理：优化源选择以实现公平平衡的知识检索
链接：https://arxiv.org/abs/2508.18724

作者：Singh, Deepak Muppiri, William Ngu
备注：Accepted at KDD'2025 Agent4IR workshop
摘要：大型语言模型（LLM）通过开启生成应用的时代，改变了人工智能领域。人工智能建立在生成性人工智能能力之上，代表了向自主、目标驱动的系统的重大转变，这些系统可以推理、检索和行动。然而，它们也继承了内部和外部信息来源中存在的偏见。这严重影响了检索信息的公平性和平衡性，从而降低了用户的信任度。为了解决这一关键挑战，我们引入了一种新的偏见缓解代理，多代理系统，旨在通过专门的代理，优化源的选择，以确保检索到的内容是高度相关的和最低限度的偏见，以促进公平和平衡的知识传播编排偏见缓解的工作流程。实验结果表明，与基线朴素检索策略相比，偏差减少了81.82%。
摘要：Large Language Models (LLMs) have transformed the field of artificial intelligence by unlocking the era of generative applications. Built on top of generative AI capabilities, Agentic AI represents a major shift toward autonomous, goal-driven systems that can reason, retrieve, and act. However, they also inherit the bias present in both internal and external information sources. This significantly affects the fairness and balance of retrieved information, and hence reduces user trust. To address this critical challenge, we introduce a novel Bias Mitigation Agent, a multi-agent system designed to orchestrate the workflow of bias mitigation through specialized agents that optimize the selection of sources to ensure that the retrieved content is both highly relevant and minimally biased to promote fair and balanced knowledge dissemination. The experimental results demonstrate an 81.82\% reduction in bias compared to a baseline naive retrieval strategy.

推理|分析|理解|解释(8篇)

【1】StepWiser: Stepwise Generative Judges for Wiser Reasoning
标题：StepWiser：逐步生成的法官为智慧的推理
链接：https://arxiv.org/abs/2508.19229

作者：, Wenting Zhao, Weizhe Yuan, Olga Golovneva, Tong Zhang, Jason Weston, Sainbayar Sukhbaatar
摘要：随着模型越来越多地利用多步推理策略来解决复杂问题，监督这些中间步骤的逻辑有效性已成为一个关键的研究挑战。过程奖励模型通过提供逐步反馈来解决这个问题，但目前的方法有两个主要缺点：它们通常作为分类器而不提供解释，并且它们依赖于静态数据集的监督微调限制了泛化。受最新进展的启发，我们重新构建了从分类任务到推理任务本身的逐步奖励模型。因此，我们提出了一个生成式判断，对策略模型的推理步骤进行推理（即，元原因），在交付最终判决之前输出思考令牌。我们的模型StepWiser通过强化学习使用推出的相对结果进行训练。我们证明了它提供了（i）比现有方法更好的中间步骤判断准确性;（ii）可用于在训练时改进策略模型;（iii）改进推理时搜索。
摘要：As models increasingly leverage multi-step reasoning strategies to solve complex problems, supervising the logical validity of these intermediate steps has become a critical research challenge. Process reward models address this by providing step-by-step feedback, but current approaches have two major drawbacks: they typically function as classifiers without providing explanations, and their reliance on supervised fine-tuning with static datasets limits generalization. Inspired by recent advances, we reframe stepwise reward modeling from a classification task to a reasoning task itself. We thus propose a generative judge that reasons about the policy model's reasoning steps (i.e., meta-reasons), outputting thinking tokens before delivering a final verdict. Our model, StepWiser, is trained by reinforcement learning using relative outcomes of rollouts. We show it provides (i) better judgment accuracy on intermediate steps than existing methods; (ii) can be used to improve the policy model at training time; and (iii) improves inference-time search.

【2】MovieCORE: COgnitive REasoning in Movies
标题：MovieCore：电影中的COGSYS推理
链接：https://arxiv.org/abs/2508.19026

作者：smy Faure, Min-Hung Chen, Jia-Fong Yeh, Ying Cheng, Hung-Ting Su, Yung-Hao Tang, Shang-Hong Lai, Winston H. Hsu
备注：Accepted for EMNLP'2025 Main Conference. Project Page: this https URL
摘要：本文介绍了MovieCORE，一种新型的视频问答（VQA）数据集，旨在探索对电影内容的更深层次的认知理解。与专注于表面理解的现有数据集不同，MovieCORE强调参与System-2思维的问题，同时保持特定于视频材料。我们提出了一种创新的代理头脑风暴方法，利用多个大型语言模型（LLM）作为思想代理来生成和改进高质量的问答对。为了评估数据集的质量，我们开发了一套认知测试，评估深度，思维激发潜力和句法复杂性。我们还提出了一个全面的评估方案，用于评估VQA模型在更深层次的认知任务上的性能。为了解决现有的视频语言模型（VLM）的局限性，我们引入了一个代理增强模块，代理选择增强（ACE），它将模型的推理能力提高了25%。我们的工作有助于推进人工智能系统中的电影理解，并在面临有关电影内容的更具挑战性、细致入微的问题时，为当前VQA模型的能力和局限性提供了有价值的见解。我们的项目页面、数据集和代码可以在https://joslefaure.github.io/assets/html/moviecore.html上找到。
摘要：This paper introduces MovieCORE, a novel video question answering (VQA) dataset designed to probe deeper cognitive understanding of movie content. Unlike existing datasets that focus on surface-level comprehension, MovieCORE emphasizes questions that engage System-2 thinking while remaining specific to the video material. We present an innovative agentic brainstorming approach, utilizing multiple large language models (LLMs) as thought agents to generate and refine high-quality question-answer pairs. To evaluate dataset quality, we develop a set of cognitive tests assessing depth, thought-provocation potential, and syntactic complexity. We also propose a comprehensive evaluation scheme for assessing VQA model performance on deeper cognitive tasks. To address the limitations of existing video-language models (VLMs), we introduce an agentic enhancement module, Agentic Choice Enhancement (ACE), which improves model reasoning capabilities post-training by up to 25%. Our work contributes to advancing movie understanding in AI systems and provides valuable insights into the capabilities and limitations of current VQA models when faced with more challenging, nuanced questions about cinematic content. Our project page, dataset and code can be found at https://joslefaure.github.io/assets/html/moviecore.html.

【3】Interpretable by AI Mother Tongue: Native Symbolic Reasoning in Neural Models
标题：人工智能母语解释：神经模型中的原生符号推理
链接：https://arxiv.org/abs/2508.18988

作者： Liu
备注：25 pages, 9 figures. The AI Intuition Explorer dashboard is available at: this https URL
摘要：我们提出了一个框架，其中神经模型开发了AI母语，这是一种原生符号语言，同时支持直觉推理，组合符号链和固有的可解释性。与事后解释方法不同，我们的方法将推理直接嵌入到模型的表示中：符号捕获有意义的语义模式，链跟踪决策路径，门控感应机制引导选择性焦点，产生透明而灵活的推理。我们引入互补的训练目标，以提高符号纯度和决策稀疏性，并采用顺序专业化策略，首先建立广泛的符号能力，然后完善直观的判断。人工智能任务的实验证明了具有竞争力的准确性以及可验证的推理痕迹，表明人工智能母语可以作为神经模型中可解释性，直觉和符号推理的统一机制。
摘要：We present a framework where neural models develop an AI Mother Tongue, a native symbolic language that simultaneously supports intuitive reasoning, compositional symbol chains, and inherent interpretability. Unlike post-hoc explanation methods, our approach embeds reasoning directly into the model's representations: symbols capture meaningful semantic patterns, chains trace decision paths, and gated induction mechanisms guide selective focus, yielding transparent yet flexible reasoning. We introduce complementary training objectives to enhance symbol purity and decision sparsity, and employ a sequential specialization strategy to first build broad symbolic competence and then refine intuitive judgments. Experiments on AI tasks demonstrate competitive accuracy alongside verifiable reasoning traces, showing that AI Mother Tongue can serve as a unified mechanism for interpretability, intuition, and symbolic reasoning in neural models.

【4】Answering the Unanswerable Is to Err Knowingly: Analyzing and Mitigating Abstention Failures in Large Reasoning Models
标题：回答无法回答的问题就是故意错误：分析和缓解大型推理模型中的弃权失败
链接：https://arxiv.org/abs/2508.18760

作者：iangyu Liu, Zequn Sun, Wei Hu
摘要：大型推理模型（LRM）在复杂的推理任务上取得了显著的进展。然而，LRM提出的一些问题本质上是无法回答的，例如缺乏充分条件的数学问题。我们发现，当面对这些无法回答的问题时，LRM总是不能提供适当的弃权。在本文中，我们系统地分析，调查和解决这个问题，以实现可信的AI。我们首先进行了详细的分析时，面对无法回答的问题，LRM的不同反应行为。然后，我们表明，LRM具有足够的认知能力，认识到这些问题中的缺陷。然而，他们没有表现出适当的回避行为，揭示了他们的内部认知和外部反应之间的失调。最后，为了解决这个问题，我们提出了一个轻量级的，两阶段的方法，结合认知监测与推理时间干预。实验结果表明，该方法在保持整体推理性能的同时，显著提高了推理率。
摘要：Large reasoning models (LRMs) have shown remarkable progress on complex reasoning tasks. However, some questions posed to LRMs are inherently unanswerable, such as math problems lacking sufficient conditions. We find that LRMs continually fail to provide appropriate abstentions when confronted with these unanswerable questions. In this paper, we systematically analyze, investigate, and resolve this issue for trustworthy AI. We first conduct a detailed analysis of the distinct response behaviors of LRMs when facing unanswerable questions. Then, we show that LRMs possess sufficient cognitive capabilities to recognize the flaws in these questions. However, they fail to exhibit appropriate abstention behavior, revealing a misalignment between their internal cognition and external response. Finally, to resolve this issue, we propose a lightweight, two-stage method that combines cognitive monitoring with inference-time intervention. Experimental results demonstrate that our method significantly improves the abstention rate while maintaining the overall reasoning performance.

【5】CAC-CoT: Connector-Aware Compact Chain-of-Thought for Efficient Reasoning Data Synthesis Across Dual-System Cognitive Tasks
标题：CAC-CoT：连接器感知的紧凑思想链，用于跨双系统认知任务的高效推理数据合成
链接：https://arxiv.org/abs/2508.18743

作者：oi, Yonghoon Kwon, Heondeuk Lee
备注：Accepted at EMNLP 2025 findings
摘要：长的思想链（CoT）提示有助于大型语言模型（LLM）解决困难的问题，但非常长的跟踪通常会减慢甚至降低快速，直观的“System-1”任务的性能。我们引入了连接器感知紧凑型CoT（CAC-CoT）-一种故意将推理限制在一个小的固定连接器短语集上的方法，将模型转向简洁和结构良好的解释。尽管简单，我们的Gemini-2.0-Flash合成方法产生了高质量的训练质量。CAC-CoT在GSM 8 K上达到约85%，在GPQA（系统-2）上达到约40%，而在S1-Bench（系统-1）上保持约90%。它的推理轨迹平均约为300个标记（ART），约为基线轨迹长度的三分之一，在不损失准确性的情况下提供更高的效率。
摘要：Long chain-of-thought (CoT) prompting helps Large Language Models (LLMs) solve difficult problems, but very long traces often slow or even degrade performance on fast, intuitive "System-1" tasks. We introduce Connector-Aware Compact CoT (CAC-CoT) -- a method that deliberately restricts reasoning to a small, fixed set of connector phrases, steering the model toward concise and well -- structured explanations. Despite its simplicity, our synthetic method with Gemini-2.0-Flash yields a high-quality training quality. CAC-CoT achieves approximately 85% on GSM8K and approximately 40% on GPQA (System-2) while retaining approximately 90% on S1-Bench (System-1). Its reasoning traces average approximately 300 tokens(ART), about one-third the length of baseline traces, delivering higher efficiency without loss of accuracy.

【6】EMMM, Explain Me My Model! Explainable Machine Generated Text Detection in Dialogues
标题：EMMM，解释一下我的模型！对话中可解释的机器生成文本检测
链接：https://arxiv.org/abs/2508.18715

作者：fei Yuan, Haoyi Li, Soyeon Caren Han, Christopher Leckie
备注：15 pages
摘要：在客户服务中快速采用大型语言模型（LLM）带来了新的风险，因为恶意行为者可以利用它们通过机器生成的文本（MGT）进行大规模的用户模拟。目前的MGT检测方法通常在在线对话环境中挣扎，降低了可靠AI部署所必需的可靠性和可解释性。在操作员通常是非专家用户的客户服务场景中，解释对于可信的MGT检测变得至关重要。在本文中，我们提出了EMMM，一个解释，然后检测框架，平衡延迟，准确性和非专家导向的可解释性。实验结果表明，EMMM提供了非专家用户可以访问的解释，70%的人类评估者更喜欢它的输出，同时与最先进的模型相比，实现了具有竞争力的准确性，并保持了低延迟，在1秒内生成输出。我们的代码和数据集在https://github.com/AngieYYF/EMMM-explainable-chatbot-detection上开源。
摘要：The rapid adoption of large language models (LLMs) in customer service introduces new risks, as malicious actors can exploit them to conduct large-scale user impersonation through machine-generated text (MGT). Current MGT detection methods often struggle in online conversational settings, reducing the reliability and interpretability essential for trustworthy AI deployment. In customer service scenarios where operators are typically non-expert users, explanation become crucial for trustworthy MGT detection. In this paper, we propose EMMM, an explanation-then-detection framework that balances latency, accuracy, and non-expert-oriented interpretability. Experimental results demonstrate that EMMM provides explanations accessible to non-expert users, with 70\% of human evaluators preferring its outputs, while achieving competitive accuracy compared to state-of-the-art models and maintaining low latency, generating outputs within 1 second. Our code and dataset are open-sourced at https://github.com/AngieYYF/EMMM-explainable-chatbot-detection.

【7】Tailored Teaching with Balanced Difficulty: Elevating Reasoning in Multimodal Chain-of-Thought via Prompt Curriculum
标题：平衡难度的量身定制教学：通过即时课程提升多模态思维链中的推理能力
链接：https://arxiv.org/abs/2508.18673

作者：Yang, Quan Feng, Zhongying Pan, Xiang Chen, Yu Tian, Wentong Li, Shuofei Qiao, Yuxia Geng, Xingyu Zhao, Sheng-Jun Huang
摘要：多模式思维链（MCoT）提示的有效性通常受到随机或手动选择示例的限制。这些示例未能考虑特定于模型的知识分布和任务的内在复杂性，导致次优和不稳定的模型性能。为了解决这个问题，我们提出了一个新的框架的启发教学原则“量身定制的教学与平衡的难度”。我们将即时选择重新定义为即时课程设计问题：构建一组有序的训练示例，这些示例与模型的当前能力相一致。我们的方法整合了两个互补的信号：（1）模型感知的难度，通过主动学习设置中的预测不一致来量化，捕捉模型本身发现的挑战;（2）内在样本复杂度，它独立于任何模型来衡量每个问题图像对的固有难度。通过联合分析这些信号，我们开发了一个难度平衡的采样策略，确保所选择的提示示例在两个维度上都是多样的。在五个具有挑战性的基准测试和多个流行的多模态大型语言模型（MLLM）上进行的大量实验表明，我们的方法产生了实质性的和一致的改进，并大大减少了随机采样引起的性能差异，为增强多模态推理提供了一种原则性和鲁棒性的方法。
摘要：The effectiveness of Multimodal Chain-of-Thought (MCoT) prompting is often limited by the use of randomly or manually selected examples. These examples fail to account for both model-specific knowledge distributions and the intrinsic complexity of the tasks, resulting in suboptimal and unstable model performance. To address this, we propose a novel framework inspired by the pedagogical principle of "tailored teaching with balanced difficulty". We reframe prompt selection as a prompt curriculum design problem: constructing a well ordered set of training examples that align with the model's current capabilities. Our approach integrates two complementary signals: (1) model-perceived difficulty, quantified through prediction disagreement in an active learning setup, capturing what the model itself finds challenging; and (2) intrinsic sample complexity, which measures the inherent difficulty of each question-image pair independently of any model. By jointly analyzing these signals, we develop a difficulty-balanced sampling strategy that ensures the selected prompt examples are diverse across both dimensions. Extensive experiments conducted on five challenging benchmarks and multiple popular Multimodal Large Language Models (MLLMs) demonstrate that our method yields substantial and consistent improvements and greatly reduces performance discrepancies caused by random sampling, providing a principled and robust approach for enhancing multimodal reasoning.

【8】Latent Self-Consistency for Reliable Majority-Set Selection in Short- and Long-Answer Reasoning
标题：短答案和长答案推理中可靠多数集选择的潜在自相容性
链接：https://arxiv.org/abs/2508.18395

作者：k Oh, Jay-yoon Lee
摘要：大型语言模型（LLM）中的概率解码通常会产生不一致的输出，特别是在复杂或长形式的问题上。自我一致性（SC）通过对精确字符串的多数投票来减轻短形式QA的这种情况，而通用自我一致性（USC）和加权Unigram一致性得分（WUCS）扩展到长形式响应，但在短形式基准上失去准确性。我们引入了潜在的自我一致性（LSC），它使用可学习的令牌嵌入来选择语义上最一致的响应。摘要令牌的轻量级正向生成将推理时间增加不到1%，并且不需要更改模型架构。在6个短形式和5个长形式推理基准（例如，MATH、MMLU、TruthfulQA），LSC在所有短格式和长格式上平均超过SC、USC和WUCS，同时保持可忽略的计算开销。这些结果将LSC定位为一种实用的一致性选择方法，可以可靠地跨答案格式工作。此外，LSC提供了良好校准的置信度估计值，在两种答案格式中保持较低的预期校准误差。
摘要：Probabilistic decoding in Large Language Models (LLMs) often yields inconsistent outputs, particularly on complex or long-form questions. Self-Consistency (SC) mitigates this for short-form QA by majority voting over exact strings, whereas Universal Self-Consistency (USC) and Weighted Unigram Consistency Score (WUCS) extend to long-form responses but lose accuracy on short-form benchmarks. We introduce Latent Self-Consistency (LSC), which selects the most semantically consistent response using learnable token embeddings. A lightweight forward generation of summary tokens increases inference time by less than 1% and requires no changes to the model architecture. Across 6 short-form and 5 long-form reasoning benchmarks (e.g., MATH, MMLU, TruthfulQA), LSC surpasses SC, USC and WUCS on all short-form and long-form ones on average, while maintaining negligible computational overhead. These results position LSC as a practical consistency-selection method that works reliably across answer formats. Additionally, LSC provides well-calibrated confidence estimates, maintaining low Expected Calibration Error across both answer formats.

半/弱/无监督|不确定性(1篇)

【1】Building Self-Evolving Agents via Experience-Driven Lifelong Learning: A Framework and Benchmark
标题：通过经验驱动的终身学习构建自我进化的代理人：框架和基准
链接：https://arxiv.org/abs/2508.19005

作者：i, Yipeng Hao, Jie Zhou, Hang Yan, Zhikai Lei, Rui Zhen, Zhenhua Han, Yutao Yang, Junsong Li, Qianjun Pan, Tianyu Huai, Qin Chen, Xin Li, Kai Chen, Bo Zhang, Xipeng Qiu, Liang He
摘要：随着人工智能向通用智能发展，重点正在从针对静态任务优化的系统转移到创建持续学习的开放式代理。在本文中，我们介绍了经验驱动的终身学习（ELL），一个框架，用于建立能够通过现实世界的互动不断增长的自我发展的代理。该框架建立在四个核心原则上：（1）经验探索：智能体通过与动态环境的持续、自我激励的交互来学习，导航相互依赖的任务并生成丰富的经验轨迹。(2)长期记忆：代理保存和结构的历史知识，包括个人经验，领域的专业知识，常识推理，到一个持久的记忆系统。(3)技能学习：代理人通过从经验中提取重复出现的模式到可重用的技能中来自主改进，这些技能被积极地改进和验证，以应用于新的任务。(4)知识内化：智能体将显性和离散的经验内化为隐性和直觉的能力，成为“第二天性”。我们还介绍了StuLife，这是ELL的基准数据集，它模拟了学生的整体大学旅程，从入学到学术和个人发展，跨越三个核心阶段和十个详细的子场景。StuLife的设计围绕三个关键的范式转变：从被动到主动，从上下文到记忆，从模仿到学习。在这个动态的环境中，代理人必须获得和提炼实用技能，并保持持久的记忆，根据不断变化的状态变量做出决策。StuLife提供了一个评估终身学习能力的综合平台，包括记忆保留，技能转移和自我激励行为。除了在StuLife基准上评估SOTA LLM之外，我们还探索了上下文工程在推进AGI中的作用。
摘要：As AI advances toward general intelligence, the focus is shifting from systems optimized for static tasks to creating open-ended agents that learn continuously. In this paper, we introduce Experience-driven Lifelong Learning (ELL), a framework for building self-evolving agents capable of continuous growth through real-world interaction. The framework is built on four core principles: (1) Experience Exploration: Agents learn through continuous, self-motivated interaction with dynamic environments, navigating interdependent tasks and generating rich experiential trajectories. (2) Long-term Memory: Agents preserve and structure historical knowledge, including personal experiences, domain expertise, and commonsense reasoning, into a persistent memory system. (3) Skill Learning: Agents autonomously improve by abstracting recurring patterns from experience into reusable skills, which are actively refined and validated for application in new tasks. (4) Knowledge Internalization: Agents internalize explicit and discrete experiences into implicit and intuitive capabilities as "second nature". We also introduce StuLife, a benchmark dataset for ELL that simulates a student's holistic college journey, from enrollment to academic and personal development, across three core phases and ten detailed sub-scenarios. StuLife is designed around three key paradigm shifts: From Passive to Proactive, From Context to Memory, and From Imitation to Learning. In this dynamic environment, agents must acquire and distill practical skills and maintain persistent memory to make decisions based on evolving state variables. StuLife provides a comprehensive platform for evaluating lifelong learning capabilities, including memory retention, skill transfer, and self-motivated behavior. Beyond evaluating SOTA LLMs on the StuLife benchmark, we also explore the role of context engineering in advancing AGI.

检测相关(1篇)

【1】Controllable Conversational Theme Detection Track at DSTC 12
标题：DJC 12上的可控对话主题检测曲目
链接：https://arxiv.org/abs/2508.18783

作者：yminov, Hang Su, Jake Vincent, Siffi Singh, Jason Cai, James Gung, Raphael Shu, Saab Mansour
备注：DSTC12@SigDial2025; data and code available at this https URL
摘要：会话分析一直处于语音和自然语言处理技术进步推动的转型的最前沿。大型语言模型（LLM）在分析领域的快速采用，使可以自动化的问题的复杂性和规模达到了一个新的水平。在本文中，我们将主题检测作为会话分析中的一项关键任务，旨在自动识别和分类会话中的主题。这个过程可以显著减少分析扩展对话框所涉及的手动工作，特别是在客户支持或销售等领域。与传统的对话意图检测不同，传统的对话意图检测通常依赖于下游系统逻辑的一组固定意图，主题旨在作为对话核心查询的直接、面向用户的摘要。这种区别允许在主题图面窗体和特定于用户的自定义方面有更大的灵活性。我们提出可控的会话主题检测问题作为一个公开的竞争轨道在对话系统技术挑战赛（DSTC）12 -它被框为联合聚类和主题标记的对话话语，与独特的方面是可控性的结果通过提供用户偏好数据实现的主题集群的粒度。我们给出了一个概述的问题，相关的数据集和评估指标，自动和人类。最后，我们讨论参与者团队的提交内容并提供其中的见解。跟踪材料（数据和代码）在GitHub存储库中公开提供。
摘要：Conversational analytics has been on the forefront of transformation driven by the advances in Speech and Natural Language Processing techniques. Rapid adoption of Large Language Models (LLMs) in the analytics field has taken the problems that can be automated to a new level of complexity and scale. In this paper, we introduce Theme Detection as a critical task in conversational analytics, aimed at automatically identifying and categorizing topics within conversations. This process can significantly reduce the manual effort involved in analyzing expansive dialogs, particularly in domains like customer support or sales. Unlike traditional dialog intent detection, which often relies on a fixed set of intents for downstream system logic, themes are intended as a direct, user-facing summary of the conversation's core inquiry. This distinction allows for greater flexibility in theme surface forms and user-specific customizations. We pose Controllable Conversational Theme Detection problem as a public competition track at Dialog System Technology Challenge (DSTC) 12 -- it is framed as joint clustering and theme labeling of dialog utterances, with the distinctive aspect being controllability of the resulting theme clusters' granularity achieved via the provided user preference data. We give an overview of the problem, the associated dataset and the evaluation metrics, both automatic and human. Finally, we discuss the participant teams' submissions and provide insights from those. The track materials (data and code) are openly available in the GitHub repository.

识别/分类(1篇)

【1】H-PRM: A Pluggable Hotword Pre-Retrieval Module for Various Speech Recognition Systems
标题：H-PRM：用于各种语音识别系统的可插入热词预检索模块
链接：https://arxiv.org/abs/2508.18295

作者：ai, Lingtao Mao, Ben Chen, Zihan Wang, Zihan Liang, Ying Han, Chenyi Lei, Han Li
摘要：热词定制在ASR中至关重要，以提高特定领域术语的准确性。它主要是由传统模型和音频大语言模型（LLM）的进步推动的。然而，现有的模型往往与大规模的热词斗争，因为识别率随着热词数量的增加而急剧下降。在本文中，我们介绍了一种新的热词定制系统，利用一个热词预检索模块（H-PRM），以确定最相关的热词候选人通过测量之间的声学相似性的热词和语音段。这种即插即用的解决方案可以很容易地集成到传统的模型中，如SeACo-Paraformer，显著提高热词后召回率（PRR）。此外，我们还通过一种基于H-PRM的方法将H-PRM整合到Audio LLM中，从而实现热词的无缝定制。大量的测试验证了H-PRM可以优于现有的方法，为ASR中的热词定制提供了新的方向。
摘要：Hotword customization is crucial in ASR to enhance the accuracy of domain-specific terms. It has been primarily driven by the advancements in traditional models and Audio large language models (LLMs). However, existing models often struggle with large-scale hotwords, as the recognition rate drops dramatically with the number of hotwords increasing. In this paper, we introduce a novel hotword customization system that utilizes a hotword pre-retrieval module (H-PRM) to identify the most relevant hotword candidate by measuring the acoustic similarity between the hotwords and the speech segment. This plug-and-play solution can be easily integrated into traditional models such as SeACo-Paraformer, significantly enhancing hotwords post-recall rate (PRR). Additionally, we incorporate H-PRM into Audio LLMs through a prompt-based approach, enabling seamless customization of hotwords. Extensive testing validates that H-PRM can outperform existing methods, showing a new direction for hotword customization in ASR.

Word2Vec|文本|单词(2篇)

【1】Attention2Probability: Attention-Driven Terminology Probability Estimation for Robust Speech-to-Text System
标题：Attention 2 Probability：鲁棒语音转文本系统的注意力驱动术语概率估计
链接：https://arxiv.org/abs/2508.18701

作者：, Jun Zhang, Bin Wang, Jin Qiu, Lu Huang, Yuan Ge, Xiaoqian Liu, Tong Xiao, Jingbo Zhu
备注：9 pages, 4 figures, 5 tables
摘要：语音大语言模型（SLMs）的最新进展已经改善了一般领域的语音识别和翻译，但准确地生成特定领域的术语或新词仍然具有挑战性。为了解决这个问题，我们提出了Attention 2 Probability：用于鲁棒语音到文本系统的注意力驱动术语概率估计，它是轻量级的，灵活的，准确的。Attention 2 Probability将语音和术语之间的交叉注意权重转换为存在概率，并进一步采用课程学习来提高检索准确性。此外，为了解决术语干预语音到文本任务的数据缺乏问题，我们创建并发布了一个新的术语语音数据集，以支持该领域的未来研究。实验结果表明，Attention 2 Probability在我们的测试集上的性能明显优于VectorDB方法。具体来说，它的最大召回率达到92.57%的中文和86.83%的英文。这种高召回率是通过每个查询仅8.71ms的延迟实现的。使用Attention 2 Probability检索的术语干预SLM的识别和翻译任务，将术语准确性提高了6- 17%，同时揭示了SLM目前对术语的利用存在局限性。
摘要：Recent advances in speech large language models (SLMs) have improved speech recognition and translation in general domains, but accurately generating domain-specific terms or neologisms remains challenging. To address this, we propose Attention2Probability: attention-driven terminology probability estimation for robust speech-to-text system, which is lightweight, flexible, and accurate. Attention2Probability converts cross-attention weights between speech and terminology into presence probabilities, and it further employs curriculum learning to enhance retrieval accuracy. Furthermore, to tackle the lack of data for speech-to-text tasks with terminology intervention, we create and release a new speech dataset with terminology to support future research in this area. Experimental results show that Attention2Probability significantly outperforms the VectorDB method on our test set. Specifically, its maximum recall rates reach 92.57% for Chinese and 86.83% for English. This high recall is achieved with a latency of only 8.71ms per query. Intervening in SLMs' recognition and translation tasks using Attention2Probability-retrieved terms improves terminology accuracy by 6-17%, while revealing that the current utilization of terminology by SLMs has limitations.

【2】A New NMT Model for Translating Clinical Texts from English to Spanish
标题：将临床文本从英语翻译为西班牙语的新NMT模型
链接：https://arxiv.org/abs/2508.18607

作者：, Xun Wang, Hong Yu
备注：This work was accepted by the Machine Learning for Health (ML4H) Workshop at NeurIPS 2018
摘要：翻译电子健康记录（EHR）叙述从英语到西班牙语是一个临床上重要的，但具有挑战性的任务，由于缺乏一个平行对齐的语料库和丰富的未知词包含。为了应对这些挑战，我们提出了\textbf{NOOV}（用于No OOV），这是一种新的神经机器翻译（NMT）系统，它需要很少的域内并行对齐语料库进行训练。NOOV集成了从并行对齐语料库中自动学习的双语词典和从大型生物医学知识资源中提取的短语查找表，以减轻NMT中的未知词问题和单词重复挑战，提高NMT系统的短语生成能力。评估表明，NOOV能够生成更好的EHR翻译，在准确性和流畅性方面都有所提高。
摘要：Translating electronic health record (EHR) narratives from English to Spanish is a clinically important yet challenging task due to the lack of a parallel-aligned corpus and the abundant unknown words contained. To address such challenges, we propose \textbf{NOOV} (for No OOV), a new neural machine translation (NMT) system that requires little in-domain parallel-aligned corpus for training. NOOV integrates a bilingual lexicon automatically learned from parallel-aligned corpora and a phrase look-up table extracted from a large biomedical knowledge resource, to alleviate both the unknown word problem and the word-repeat challenge in NMT, enhancing better phrase generation of NMT systems. Evaluation shows that NOOV is able to generate better translation of EHR with improvement in both accuracy and fluency.

其他神经网络|深度学习|模型|建模(2篇)

【1】Harnessing Rule-Based Reinforcement Learning for Enhanced Grammatical Error Correction
标题：利用基于规则的强化学习来增强语法错误纠正
链接：https://arxiv.org/abs/2508.18780

作者： Xunjian Yin, Yilin Chen, Xiaojun Wan
备注：Code will be released upon publication
摘要：语法纠错是自然语言处理中的一项重要任务。传统的基于编解码器模型的方法已经取得了一定的成功，但LLM在该领域的应用还有待探索。目前的研究主要依靠监督微调来训练LLM直接生成正确的句子，这限制了模型强大的推理能力。为了解决这个问题，我们提出了一个新的框架，基于规则的RL。通过在中文数据集上的实验，我们的基于规则的强化学习框架达到了最先进的性能，并在召回率上有了显著的提高。这一结果清楚地突出了使用RL来引导LLM的优势，为GEC的未来发展提供了更可控和可靠的范例。
摘要：Grammatical error correction is a significant task in NLP. Traditional methods based on encoder-decoder models have achieved certain success, but the application of LLMs in this field is still underexplored. Current research predominantly relies on supervised fine-tuning to train LLMs to directly generate the corrected sentence, which limits the model's powerful reasoning ability. To address this limitation, we propose a novel framework based on Rule-Based RL. Through experiments on the Chinese datasets, our Rule-Based RL framework achieves \textbf{state-of-the-art }performance, with a notable increase in \textbf{recall}. This result clearly highlights the advantages of using RL to steer LLMs, offering a more controllable and reliable paradigm for future development in GEC.

【2】RLMR: Reinforcement Learning with Mixed Rewards for Creative Writing
标题：RL MR：创意写作混合奖励的强化学习
链接：https://arxiv.org/abs/2508.18642

作者：Liao, Tian Zhang, Xiao Feng, Yusong Zhang, Rui Yang, Haorui Wang, Bosi Wen, Ziying Wang, Runzhi Shi
摘要：大型语言模型在创意写作应用中被广泛使用。创造性写作需要在主观写作质量（例如，文学性和情感表达）和客观约束遵循（例如，格式要求和字数限制）。现有的强化学习方法很难平衡这两个方面：单一奖励策略无法同时提高这两种能力，而固定权重的混合奖励方法缺乏适应不同写作场景的能力。为了解决这个问题，我们提出了混合奖励强化学习（RLMR），利用一个动态的混合奖励系统，从一个写作奖励模型评估主观的写作质量和一个约束验证模型评估客观的约束遵循。根据样本组内的书写质量动态调整约束跟随的奖励权重，确保违反约束的样本在GRPO中获得负优势，从而在训练过程中受到惩罚，这是该方法的关键创新之处。我们对从8B到72B参数的各种模型系列进行自动和手动评估。此外，我们构建了一个真实世界的写作基准称为WriteEval的综合评价。实验结果表明，该方法在指令遵循（IFEval从83.36%提高到86.65%）和写作质量（WriteEval的人工专家成对评估成功率为72.75%）方面都取得了一致的改善。据我们所知，RLMR是第一个在在线RL训练中将主观偏好与客观验证相结合的工作，为多维创意写作优化提供了有效的解决方案。
摘要：Large language models are extensively utilized in creative writing applications. Creative writing requires a balance between subjective writing quality (e.g., literariness and emotional expression) and objective constraint following (e.g., format requirements and word limits). Existing reinforcement learning methods struggle to balance these two aspects: single reward strategies fail to improve both abilities simultaneously, while fixed-weight mixed-reward methods lack the ability to adapt to different writing scenarios. To address this problem, we propose Reinforcement Learning with Mixed Rewards (RLMR), utilizing a dynamically mixed reward system from a writing reward model evaluating subjective writing quality and a constraint verification model assessing objective constraint following. The constraint following reward weight is adjusted dynamically according to the writing quality within sampled groups, ensuring that samples violating constraints get negative advantage in GRPO and thus penalized during training, which is the key innovation of this proposed method. We conduct automated and manual evaluations across diverse model families from 8B to 72B parameters. Additionally, we construct a real-world writing benchmark named WriteEval for comprehensive evaluation. Results illustrate that our method achieves consistent improvements in both instruction following (IFEval from 83.36\% to 86.65\%) and writing quality (72.75\% win rate in manual expert pairwise evaluations on WriteEval). To the best of our knowledge, RLMR is the first work to combine subjective preferences with objective verification in online RL training, providing an effective solution for multi-dimensional creative writing optimization.

其他(13篇)

【1】Evaluating the Evaluators: Are readability metrics good measures of readability?
标题：评估评估者：可读性指标是衡量可读性的好方法吗？
链接：https://arxiv.org/abs/2508.19221

作者：chola, Daniel Khashabi, Mark Dredze
摘要：Plain Language Summarization（PLS）旨在将复杂的文档提取为非专业受众可访问的摘要。在本文中，我们进行了一个彻底的调查PLS文献，并确定目前的标准做法的可读性评估是使用传统的可读性指标，如Flesch-Kincaid等级水平（FKGL）。然而，尽管在其他领域中已被证明是实用的，这些指标还没有被比较的PLS中的人类可读性的判断。我们评估了8个可读性指标，并表明大多数与人类判断的相关性很差，包括最流行的指标FKGL。然后，我们表明，语言模型（LM）是更好的判断可读性，与人类判断的皮尔逊相关性达到0.56的性能最好的模型。将我们的分析扩展到PLS数据集，其中包含针对非专家受众的摘要，我们发现LM更好地捕获更深层次的可读性度量，例如所需的背景知识，并导致与传统度量不同的结论。基于这些发现，我们提供了最佳实践的建议，在评估简单的语言摘要。我们发布我们的分析代码和调查数据。
摘要：Plain Language Summarization (PLS) aims to distill complex documents into accessible summaries for non-expert audiences. In this paper, we conduct a thorough survey of PLS literature, and identify that the current standard practice for readability evaluation is to use traditional readability metrics, such as Flesch-Kincaid Grade Level (FKGL). However, despite proven utility in other fields, these metrics have not been compared to human readability judgments in PLS. We evaluate 8 readability metrics and show that most correlate poorly with human judgments, including the most popular metric, FKGL. We then show that Language Models (LMs) are better judges of readability, with the best-performing model achieving a Pearson correlation of 0.56 with human judgments. Extending our analysis to PLS datasets, which contain summaries aimed at non-expert audiences, we find that LMs better capture deeper measures of readability, such as required background knowledge, and lead to different conclusions than the traditional metrics. Based on these findings, we offer recommendations for best practices in the evaluation of plain language summaries. We release our analysis code and survey data.

【2】VibeVoice Technical Report
标题：VibeVoice技术报告
链接：https://arxiv.org/abs/2508.19205

作者：Peng, Jianwei Yu, Wenhui Wang, Yaoyao Chang, Yutao Sun, Li Dong, Yi Zhu, Weijiang Xu, Hangbo Bao, Zehua Wang, Shaohan Huang, Yan Xia, Furu Wei
摘要：本报告介绍了VibeVoice，一种新的模型，旨在通过采用下一个令牌扩散来合成多个扬声器的长格式语音，这是一种通过扩散自回归生成潜在向量来建模连续数据的统一方法。为了实现这一点，我们引入了一种新的连续语音标记器，与流行的Encodec模型相比，它将数据压缩提高了80倍，同时保持了相当的性能。标记器有效地保持了音频保真度，同时显著提高了处理长序列的计算效率。因此，VibeVoice可以合成长达90分钟的长格式语音（在64K上下文窗口长度中），最多4个扬声器，捕捉真实的对话“vibe”，并超越开源和专有对话模型。
摘要：This report presents VibeVoice, a novel model designed to synthesize long-form speech with multiple speakers by employing next-token diffusion, which is a unified method for modeling continuous data by autoregressively generating latent vectors via diffusion. To enable this, we introduce a novel continuous speech tokenizer that, when compared to the popular Encodec model, improves data compression by 80 times while maintaining comparable performance. The tokenizer effectively preserves audio fidelity while significantly boosting computational efficiency for processing long sequences. Thus, VibeVoice can synthesize long-form speech for up to 90 minutes (in a 64K context window length) with a maximum of 4 speakers, capturing the authentic conversational ``vibe'' and surpassing open-source and proprietary dialogue models.

【3】The Ramon Llull's Thinking Machine for Automated Ideation
标题：Ramon Llull的自动思维机器
链接：https://arxiv.org/abs/2508.19200

作者：ao, Boyuan Zheng, Chenglei Si, Haofei Yu, Ken Liu, Runlong Zhou, Ruochen Li, Tong Chen, Xiang Li, Yiming Zhang, Tongshuang Wu
备注：21 pages, 3 figures
摘要：本文重温拉蒙Llull的Ars combinatoria -中世纪的框架，通过符号重组产生的知识-作为一个概念基础，建立一个现代Llull的思考机器的研究构思。我们的方法定义了三个组成轴：主题（例如，效率，适应性），域（例如，问答，机器翻译），和方法（例如，对抗训练，线性注意力）。这些元素代表了科学工作中常见的高级抽象-动机，问题设置和技术方法-并作为LLM驱动探索的构建块。我们从人类专家或会议论文中挖掘元素，并表明提示LLM与策划的组合产生了多样的，相关的，并以当前文献为基础的研究思路。这种现代思维机器为增强科学创造力提供了一种轻量级、可解释的工具，并为人类和人工智能之间的协作思维提供了一条道路。
摘要：This paper revisits Ramon Llull's Ars combinatoria - a medieval framework for generating knowledge through symbolic recombination - as a conceptual foundation for building a modern Llull's thinking machine for research ideation. Our approach defines three compositional axes: Theme (e.g., efficiency, adaptivity), Domain (e.g., question answering, machine translation), and Method (e.g., adversarial training, linear attention). These elements represent high-level abstractions common in scientific work - motivations, problem settings, and technical approaches - and serve as building blocks for LLM-driven exploration. We mine elements from human experts or conference papers and show that prompting LLMs with curated combinations produces research ideas that are diverse, relevant, and grounded in current literature. This modern thinking machine offers a lightweight, interpretable tool for augmenting scientific creativity and suggests a path toward collaborative ideation between humans and AI.

【4】"Where does it hurt?" - Dataset and Study on Physician Intent Trajectories in Doctor Patient Dialogues
链接：https://arxiv.org/abs/2508.19077

作者： Soumyadeep Roy, Fares Al Mohamad, Jens-Michalis Papaioannou, Wolfgang Nejdl, Felix Gers, Alexander Löser
备注：Accepted at ECAI 2025
摘要：在医患对话中，医生的主要目标是诊断患者并提出治疗计划。医生通过有针对性的提问来指导这些对话，以有效地收集为患者提供最佳结果所需的信息。据我们所知，这是第一个研究医生的意图轨迹在医患对话的工作。我们使用“环境临床智能基准”（Aci-bench）数据集进行研究。我们与医疗专业人员合作，基于SOAP框架（主观、客观、评估和计划）开发了一个细粒度的医生意图分类。然后，我们进行了大规模的注释工作，在大量使用Prolific（一个流行的众包平台）招募的医学专家的帮助下，标记了5000多个医生和患者的回合。这个大型标记数据集是一个重要的资源贡献，我们使用它来对医疗意图分类任务的最先进的生成和编码器模型进行基准测试。我们的研究结果表明，我们的模型理解医疗对话的一般结构具有很高的准确性，但往往无法识别SOAP类别之间的转换。我们还首次报告了医疗对话结构中的常见轨迹，为设计“鉴别诊断”系统提供了有价值的见解。最后，我们广泛研究了意图过滤对医疗对话摘要的影响，并观察到性能的显着提高。我们在https://github.com/DATEXIS/medical-intent-classification上公开提供代码和数据，包括注释指南。
摘要：In a doctor-patient dialogue, the primary objective of physicians is to diagnose patients and propose a treatment plan. Medical doctors guide these conversations through targeted questioning to efficiently gather the information required to provide the best possible outcomes for patients. To the best of our knowledge, this is the first work that studies physician intent trajectories in doctor-patient dialogues. We use the `Ambient Clinical Intelligence Benchmark' (Aci-bench) dataset for our study. We collaborate with medical professionals to develop a fine-grained taxonomy of physician intents based on the SOAP framework (Subjective, Objective, Assessment, and Plan). We then conduct a large-scale annotation effort to label over 5000 doctor-patient turns with the help of a large number of medical experts recruited using Prolific, a popular crowd-sourcing platform. This large labeled dataset is an important resource contribution that we use for benchmarking the state-of-the-art generative and encoder models for medical intent classification tasks. Our findings show that our models understand the general structure of medical dialogues with high accuracy, but often fail to identify transitions between SOAP categories. We also report for the first time common trajectories in medical dialogue structures that provide valuable insights for designing `differential diagnosis' systems. Finally, we extensively study the impact of intent filtering for medical dialogue summarization and observe a significant boost in performance. We make the codes and data, including annotation guidelines, publicly available at https://github.com/DATEXIS/medical-intent-classification.

【5】Automatic Prompt Optimization with Prompt Distillation
标题：使用瞬发蒸馏的自动瞬发优化
链接：https://arxiv.org/abs/2508.18992

作者： Zhuravlev, Artur R. Khairullin, Ernest A. Dyagin, Alena N. Sitkina, Nikita I. Kulin
摘要：自动提示是为语言模型自动选择优化提示的过程，由于大型语言模型（LLM）领域的广泛研究推动了提示工程的快速发展，自动提示越来越受欢迎。本文介绍了DistillPrompt-一种基于大型语言模型的自动提示方法，该方法采用多阶段集成的任务特定信息到使用训练数据的提示。DistillPrompt利用蒸馏、压缩和聚合操作来更彻底地探索提示空间。使用t-lite-instruction-0.1语言模型在不同的文本分类和生成任务数据集上测试了该方法。结果表明显著的平均改善（例如，在关键指标上，与该领域现有方法相比，整个数据集的Grips（20.12%），将DistillPrompt确立为自动提示中最有效的非梯度方法之一。
摘要：Autoprompting is the process of automatically selecting optimized prompts for language models, which is gaining popularity due to the rapid development of prompt engineering driven by extensive research in the field of large language models (LLMs). This paper presents DistillPrompt -- a novel autoprompting method based on large language models that employs a multi-stage integration of task-specific information into prompts using training data. DistillPrompt utilizes distillation, compression, and aggregation operations to explore the prompt space more thoroughly. The method was tested on different datasets for text classification and generation tasks using the t-lite-instruct-0.1 language model. The results demonstrate a significant average improvement (e.g., 20.12% across the entire dataset compared to Grips) in key metrics over existing methods in the field, establishing DistillPrompt as one of the most effective non-gradient approaches in autoprompting.

【6】Affective Polarization across European Parliaments
标题：欧洲议会的情感两极分化
链接：https://arxiv.org/abs/2508.18916

作者：oski, Igor Mozetič, Nikola Ljubešić, Petra Kralj Novak
备注：6 pages, 4 figures
摘要：情感两极分化，其特点是对对立群体的消极和敌意增加，已成为世界各地政治话语的一个突出特征。我们的研究以完全自动化的方式审查了欧洲议会中存在的这种类型的两极分化。利用一个全面的语料库的议会发言，从六个欧洲国家的议会，我们采用自然语言处理技术来估计的情绪。通过比较来自对立群体的个人与自己群体的个人所传达的消极程度，我们发现了情感极化互动的模式。研究结果表明，在所有六个欧洲议会存在一致的情感极化。虽然活动与消极相关，但在不太活跃和更活跃的议员之间没有观察到情感极化的差异。最后，我们表明，互惠是一个促进机制，在所有六个议会的议员之间的情感极化。
摘要：Affective polarization, characterized by increased negativity and hostility towards opposing groups, has become a prominent feature of political discourse worldwide. Our study examines the presence of this type of polarization in a selection of European parliaments in a fully automated manner. Utilizing a comprehensive corpus of parliamentary speeches from the parliaments of six European countries, we employ natural language processing techniques to estimate parliamentarian sentiment. By comparing the levels of negativity conveyed in references to individuals from opposing groups versus one's own, we discover patterns of affectively polarized interactions. The findings demonstrate the existence of consistent affective polarization across all six European parliaments. Although activity correlates with negativity, there is no observed difference in affective polarization between less active and more active members of parliament. Finally, we show that reciprocity is a contributing mechanism in affective polarization between parliamentarians across all six parliaments.

【7】ReflectivePrompt: Reflective evolution in autoprompting algorithms
标题：ReflectivePrompt：自动提示算法的反射性进化
链接：https://arxiv.org/abs/2508.18870

作者： Zhuravlev, Artur R. Khairullin, Ernest A. Dyagin, Alena N. Sitkina, Nikita I. Kulin
摘要：自动提示是为语言模型自动选择优化提示的过程，随着提示工程的快速发展，在大型语言模型（LLM）领域的广泛研究的推动下，自动提示已经越来越流行。本文介绍了反射提示-一种新的自动提示方法的基础上进化算法，采用反射进化方法更精确和全面的搜索最佳提示。ReflectivePrompt在交叉和精英变异之前利用短期和长期反射操作来提高它们引入的修改的质量。该方法允许在整个进化过程中获得的知识的积累，并基于当前种群在每个时期更新它。使用开放访问的大型语言模型：t-lite-instruction-0.1和gemma 3 - 27 b-it，在33个数据集上对ReflectivePrompt进行了分类和文本生成任务的测试。与EvoPrompt相比，BBH的比例为28%），从而成为基于进化算法的自动提示中最有效的解决方案之一。
摘要：Autoprompting is the process of automatically selecting optimized prompts for language models, which has been gaining popularity with the rapid advancement of prompt engineering, driven by extensive research in the field of large language models (LLMs). This paper presents ReflectivePrompt - a novel autoprompting method based on evolutionary algorithms that employs a reflective evolution approach for more precise and comprehensive search of optimal prompts. ReflectivePrompt utilizes short-term and long-term reflection operations before crossover and elitist mutation to enhance the quality of the modifications they introduce. This method allows for the accumulation of knowledge obtained throughout the evolution process and updates it at each epoch based on the current population. ReflectivePrompt was tested on 33 datasets for classification and text generation tasks using open-access large language models: t-lite-instruct-0.1 and gemma3-27b-it. The method demonstrates, on average, a significant improvement (e.g., 28% on BBH compared to EvoPrompt) in metrics relative to current state-of-the-art approaches, thereby establishing itself as one of the most effective solutions in evolutionary algorithm-based autoprompting.

【8】Thinking Before You Speak: A Proactive Test-time Scaling Approach
标题：说话前思考：一种主动的测试时间缩放方法
链接：https://arxiv.org/abs/2508.18648

作者：Wenchang Chai, Hejun Wu, Yan Pan, Pengxu Wei, Liang Lin
备注：None
摘要：大型语言模型（LLM）通常在复杂的推理任务（如数学）中表现出缺陷，我们将其归因于人类推理模式与LLM训练数据中呈现的推理模式之间的差异。在处理复杂问题时，人类倾向于在表达解决方案之前仔细思考。然而，他们往往不清楚自己的内心想法，包括他们的意图和选择的方法。因此，在从人类来源收集的训练数据中可能缺少对于桥接推理步骤至关重要的关键见解。为了弥合这一差距，我们建议在连续的推理步骤之间插入洞察力，它会检查状态并启动下一个推理步骤。与依赖于单个或静态提示的工作流来促进推理的先前提示策略不同，洞察力是主动生成的，以指导推理过程。我们将我们的想法实现为一个推理框架，名为“说话前先思考”（TBYS），并设计了一个管道，用于自动收集和过滤上下文中的示例，以生成洞察力，从而减少人类标记工作和微调开销。在具有挑战性的数学数据集上的实验验证了TBYS的有效性。项目网址：https://gitee.com/jswrt/TBYS
摘要：Large Language Models (LLMs) often exhibit deficiencies with complex reasoning tasks, such as maths, which we attribute to the discrepancy between human reasoning patterns and those presented in the LLMs' training data. When dealing with complex problems, humans tend to think carefully before expressing solutions. However, they often do not articulate their inner thoughts, including their intentions and chosen methodologies. Consequently, critical insights essential for bridging reasoning steps may be absent in training data collected from human sources. To bridge this gap, we proposes inserting \emph{insight}s between consecutive reasoning steps, which review the status and initiate the next reasoning steps. Unlike prior prompting strategies that rely on a single or a workflow of static prompts to facilitate reasoning, \emph{insight}s are \emph{proactively} generated to guide reasoning processes. We implement our idea as a reasoning framework, named \emph{Thinking Before You Speak} (TBYS), and design a pipeline for automatically collecting and filtering in-context examples for the generation of \emph{insight}s, which alleviates human labeling efforts and fine-tuning overheads. Experiments on challenging mathematical datasets verify the effectiveness of TBYS. Project website: https://gitee.com/jswrt/TBYS

【9】Backprompting: Leveraging Synthetic Production Data for Health Advice Guardrails
标题：反向提示：利用合成生产数据提供健康建议护栏
链接：https://arxiv.org/abs/2508.18384

作者：n Cheng, Anna Lisa Gentile, Chad DeLuca, Guang-Jie Ren
摘要：大型语言模型（LLM）在企业环境中的普及也带来了与其使用相关的大量风险。Guardrails技术旨在通过各种检测器过滤LLM的输入/输出文本来减轻这种风险。然而，开发和维护鲁棒的检测器面临着许多挑战，其中之一是在部署之前难以获得真实LLM输出的生产质量标记数据。在这项工作中，我们提出了backprompting，一个简单而直观的解决方案，以生成类似生产的标签数据的健康建议护栏的发展。此外，我们配对我们的backprompting方法与稀疏的人在循环聚类技术标记生成的数据。我们的目标是构建一个平行语料库，大致代表原始数据集，但类似于真正的LLM输出。然后，我们将现有的数据集与我们的合成示例相结合，为我们的检测器生成强大的训练数据。我们在最困难和最微妙的护栏之一测试我们的技术：LLM输出中的健康建议的识别，并展示与其他解决方案相比的改进。我们的检测器能够超过GPT-4 o高达3.73%，尽管参数少了400倍。
摘要：The pervasiveness of large language models (LLMs) in enterprise settings has also brought forth a significant amount of risks associated with their usage. Guardrails technologies aim to mitigate this risk by filtering LLMs' input/output text through various detectors. However, developing and maintaining robust detectors faces many challenges, one of which is the difficulty in acquiring production-quality labeled data on real LLM outputs prior to deployment. In this work, we propose backprompting, a simple yet intuitive solution to generate production-like labeled data for health advice guardrails development. Furthermore, we pair our backprompting method with a sparse human-in-the-loop clustering technique to label the generated data. Our aim is to construct a parallel corpus roughly representative of the original dataset yet resembling real LLM output. We then infuse existing datasets with our synthetic examples to produce robust training data for our detector. We test our technique in one of the most difficult and nuanced guardrails: the identification of health advice in LLM output, and demonstrate improvement versus other solutions. Our detector is able to outperform GPT-4o by up to 3.73%, despite having 400x less parameters.

【10】Not All Visitors are Bilingual: A Measurement Study of the Multilingual Web from an Accessibility Perspective
标题：并非所有访问者都是双语的：从无障碍角度对多语言网络的测量研究
链接：https://arxiv.org/abs/2508.18328

作者：asan Masud Bhuiyan, Matteo Varvello, Yasir Zaki, Cristian-Alexandru Staicu
备注：6 pages, 6 figures
摘要：英语是网络上的主要语言，为世界上1000万个顶级网站中的近一半提供支持。尽管如此，对多语言内容的支持正在增加，许多网站越来越多地在可见内容和隐藏元数据中将英语与区域或本地语言结合起来。这种多语言使用为有视觉障碍的用户带来了巨大的障碍，因为屏幕阅读器等辅助技术经常缺乏对非拉丁字母的强大支持，以及对非英语文本的错误渲染或发音，从而加剧了不同语言环境中的无障碍挑战。然而，大规模的研究，这一问题一直受到限制，缺乏全面的数据集的多语种网络内容。为了解决这一差距，我们引入了LangCrUX，这是第一个大规模的数据集，包含12种语言的120，000个流行网站，主要使用非拉丁语脚本。利用这个数据集，我们对多语言网页的可访问性进行了系统的分析，并发现了普遍忽视的可访问性提示。我们发现，这些提示往往无法反映语言的多样性可见的内容，降低屏幕阅读器的有效性和限制网页的可访问性。最后，我们提出了Kizuki，一个语言感知的自动化可访问性测试扩展帐户的语言不一致的可访问性提示有限的效用。
摘要：English is the predominant language on the web, powering nearly half of the world's top ten million websites. Support for multilingual content is nevertheless growing, with many websites increasingly combining English with regional or native languages in both visible content and hidden metadata. This multilingualism introduces significant barriers for users with visual impairments, as assistive technologies like screen readers frequently lack robust support for non-Latin scripts and misrender or mispronounce non-English text, compounding accessibility challenges across diverse linguistic contexts. Yet, large-scale studies of this issue have been limited by the lack of comprehensive datasets on multilingual web content. To address this gap, we introduce LangCrUX, the first large-scale dataset of 120,000 popular websites across 12 languages that primarily use non-Latin scripts. Leveraging this dataset, we conduct a systematic analysis of multilingual web accessibility and uncover widespread neglect of accessibility hints. We find that these hints often fail to reflect the language diversity of visible content, reducing the effectiveness of screen readers and limiting web accessibility. We finally propose Kizuki, a language-aware automated accessibility testing extension to account for the limited utility of language-inconsistent accessibility hints.

【11】Can VLMs Recall Factual Associations From Visual References?
标题：VLM能否从视觉参考中回忆起事实关联？
链接：https://arxiv.org/abs/2508.18297

作者： Ashok, Ashutosh Chaubey, Hirona J. Arai, Jonathan May, Jesse Thomason
备注：To appear at EMNLP 2025 (Findings)
摘要：通过一项对照研究，我们确定了视觉语言模型（VLM）的多模态接地系统的缺陷。虽然VLM可以回忆起事实关联时，提供了一个文本参考的实体，他们这样做的能力显着减弱时，参考是视觉代替。迫使VLM依赖于实体的图像表示，使他们回忆事实知识的能力减半，这表明VLM很难将他们对实体的内部知识与其图像表示联系起来。我们发现，这种连接故障与模型内部状态中不同模式的表达相关，并且这些内部状态上的探针在标记VLM响应不可靠的情况下实现了92%以上的准确性。这些探测器可以被应用，而无需再训练，以确定何时VLM将无法正确回答需要理解多模态输入的问题。当用于促进视觉问答任务的选择性预测时，探针将覆盖率提高了7.87%（绝对值），同时还将错误风险降低了0.9%（绝对值）。解决系统的，可检测的缺陷是语言基础的重要途径，我们为未来的方向提供了明智的建议。
摘要：Through a controlled study, we identify a systematic deficiency in the multimodal grounding of Vision Language Models (VLMs). While VLMs can recall factual associations when provided a textual reference to an entity; their ability to do so is significantly diminished when the reference is visual instead. Forcing VLMs to rely on image representations of an entity halves their ability to recall factual knowledge, suggesting that VLMs struggle to link their internal knowledge of an entity with its image representation. We show that such linking failures are correlated with the expression of distinct patterns in model internal states, and that probes on these internal states achieve over 92% accuracy at flagging cases where the VLM response is unreliable. These probes can be applied, without retraining, to identify when a VLM will fail to correctly answer a question that requires an understanding of multimodal input. When used to facilitate selective prediction on a visual question answering task, the probes increase coverage by 7.87% (absolute) while also reducing the risk of error by 0.9% (absolute). Addressing the systematic, detectable deficiency is an important avenue in language grounding, and we provide informed recommendations for future directions.

【12】Designing across domains with declarative thinking: Insights from the 96-Eyes ptychographic imager project
标题：用陈述性思维进行跨领域设计：来自96-Eyes重叠成像仪项目的见解
链接：https://arxiv.org/abs/2508.18512

作者：Chan
摘要：本文介绍了一个从业者的反思应用声明，第五代，问题制定语言（5GL）从头成像系统设计，在学术界和私营部门内的跨学科研究和跨功能产品开发的经验。使用96眼项目：96相机并行多模态成像器的高通量药物发现作为一个代表性的案例，我说明了项目的要求，从硬件限制到生命科学的需要，可以正式成为机器可读的问题陈述，以保持关键任务的输入不同领域的利益相关者。这种声明性方法增强了透明度，确保了设计可追溯性，并最大限度地减少了光学、算法、硬件加速计算和生命科学团队之间代价高昂的不对准。除了通过真实代码示例对5GL进行技术讨论之外，我还反思了在命令式环境中采用5GL的实际障碍，第三代语言（3GL）仍然是团队间协作的默认媒介。这些经验教训不是提供一个通用的解决方案，而是强调编程范式如何通过现有的领域层次结构隐式地塑造研究工作流。讨论的目的是邀请进一步探讨如何声明性的问题配方可以促进创新的设置中，并发的R\&{}D工作流正在获得牵引力，而不是环境中的顺序，阶段驱动的工作流仍然是规范。
摘要：This article presents a practitioner's reflection on applying declarative, 5th generation, problem formulation language (5GL) to de novo imaging system design, informed by experiences across the interdisciplinary research in academia and cross-functional product development within the private sector. Using the 96-Eyes project: 96-camera parallel multi-modal imager for high-throughput drug discovery as a representative case, I illustrate how project requirements, ranging from hardware constraints to life sciences needs, can be formalized into machine-readable problem statements to preserve mission-critical input from diverse domain stakeholders. This declarative approach enhances transparency, ensures design traceability, and minimizes costly misalignment across optical, algorithmic, hardware-accelerated compute, and life sciences teams. Alongside the technical discussion of 5GL with real-world code examples, I reflect on the practical barriers to adopting 5GL in environments where imperative, 3rd-generation languages (3GL) remain the default medium for inter-team collaboration. Rather than offering an one-size-fits-all solution, these learned lessons highlight how programming paradigms implicitly shapes research workflows through existing domain hierarchies. The discussion aims to invite further explorations into how declarative problem formulations can facilitate innovation in settings where concurrent R\&{}D workflows are gaining traction, as opposed to environments where sequential, phase-driven workflows remain the norm.

【13】Toward Responsible ASR for African American English Speakers: A Scoping Review of Bias and Equity in Speech Technology
标题：迈向非裔美国英语使用者负责任的ASB：言语技术偏见和公平的范围审查
链接：https://arxiv.org/abs/2508.18288

作者：nningham, Adinawa Adjagbodjou, Jeffrey Basoah, Jainaba Jawara, Kowe Kadoma, Aaleyah Lewis
备注：10 pages, 9 Pages (References and Appendices). The archival version has been accepted to AAAI (AIES 2025) without the extended Appendices. This extended version includes Appendices
摘要：这一范围文献综述探讨了公平，偏见和公平是如何在自动语音识别（ASR）和邻近的语音和语言技术（ESTA）的非裔美国人英语（AAE）的扬声器和其他语言多样化的社区概念化和操作。从人机交互（HCI），机器学习/自然语言处理（ML/NLP）和社会语言学的44篇同行评议的出版物中，我们确定了四个主要的研究领域：（1）研究人员如何理解与ASR相关的危害;（2）涵盖收集，策展，注释和模型训练的包容性数据实践;（3）语言包容的方法和理论方法;（4）研究人员如何理解与ASR相关的危害。以及（4）新兴实践和设计更公平系统的建议。虽然技术公平的干预措施越来越多，我们的审查突出了一个关键的差距，以治理为中心的方法，前景社区机构，语言正义和参与式问责制。我们提出了一个以治理为中心的ASR生命周期，作为负责任的ASR开发的新兴跨学科框架，并为寻求解决语音AI系统中语言边缘化问题的研究人员、从业人员和政策制定者提供了启示。
摘要：This scoping literature review examines how fairness, bias, and equity are conceptualized and operationalized in Automatic Speech Recognition (ASR) and adjacent speech and language technologies (SLT) for African American English (AAE) speakers and other linguistically diverse communities. Drawing from 44 peer-reviewed publications across Human-Computer Interaction (HCI), Machine Learning/Natural Language Processing (ML/NLP), and Sociolinguistics, we identify four major areas of inquiry: (1) how researchers understand ASR-related harms; (2) inclusive data practices spanning collection, curation, annotation, and model training; (3) methodological and theoretical approaches to linguistic inclusion; and (4) emerging practices and design recommendations for more equitable systems. While technical fairness interventions are growing, our review highlights a critical gap in governance-centered approaches that foreground community agency, linguistic justice, and participatory accountability. We propose a governance-centered ASR lifecycle as an emergent interdisciplinary framework for responsible ASR development and offer implications for researchers, practitioners, and policymakers seeking to address language marginalization in speech AI systems.

机器翻译由腾讯交互翻译提供，仅供参考

点击“阅读原文”获取带摘要的学术速递

【声明】内容源于网络

Sophie外贸笔记

跨境分享角 | 长期更新优质内容

内容 0

粉丝 3

Sophie外贸笔记跨境分享角 | 长期更新优质内容

总阅读0

粉丝3

内容0