自然语言处理学术速递[8.11]- 大数跨境

首页

自然语言处理学术速递[8.11]

Sophie外贸笔记

2025-08-11

导读：cs.CL 方向，今日共计63篇

点击阅读原文访问arxivdaily.com，涵盖CS|物理|数学|经济|统计|金融|生物|电气领域，更有搜索、收藏等功能！

cs.CL 方向，今日共计63篇

大模型相关(30篇)

【1】HapticLLaMA: A Multimodal Sensory Language Model for Haptic Captioning
标题：HapticLLaMA：一个用于触觉字幕的多模态感觉语言模型
链接：https://arxiv.org/abs/2508.06475

作者：, Daniel Hershcovich, Hasti Seifi
摘要：触觉字幕是从触觉信号（例如振动）生成自然语言描述的任务，用于虚拟现实，可访问性和康复应用。虽然以前的多模态研究主要集中在视觉和听觉上，但触觉的触觉信号仍然没有得到充分的探索。为了解决这一差距，我们正式的触觉字幕任务，并提出HapticLLaMA，一种多模态的感觉语言模型，解释振动信号的描述在一个给定的感官，情感，或联想类别。我们研究两种类型的触觉tokenizer，基于频率的tokenizer和基于EnCodec的tokenizer，将触觉信号转换为离散单元的序列，使其与LLaMA模型的集成。HapticLLaMA分为两个阶段进行训练：（1）使用LLaMA架构进行监督微调，并基于LoRA进行自适应，以及（2）通过来自人类反馈的强化学习（RLHF）进行微调。我们使用自动n元语法指标和人工评估来评估HapticLLaMA的字幕性能。HapticLLaMA在解释触觉振动信号方面表现出强大的能力，METEOR得分为59.98，BLEU-4得分为32.06。此外，超过61%的生成的字幕在7分制上获得了超过3.5的人类评分，RLHF在整体评分分布中获得了10%的改善，这表明与人类触觉感知的一致性更强。这些发现强调了大型语言模型处理和适应感官数据的潜力。
摘要：Haptic captioning is the task of generating natural language descriptions from haptic signals, such as vibrations, for use in virtual reality, accessibility, and rehabilitation applications. While previous multimodal research has focused primarily on vision and audio, haptic signals for the sense of touch remain underexplored. To address this gap, we formalize the haptic captioning task and propose HapticLLaMA, a multimodal sensory language model that interprets vibration signals into descriptions in a given sensory, emotional, or associative category. We investigate two types of haptic tokenizers, a frequency-based tokenizer and an EnCodec-based tokenizer, that convert haptic signals into sequences of discrete units, enabling their integration with the LLaMA model. HapticLLaMA is trained in two stages: (1) supervised fine-tuning using the LLaMA architecture with LoRA-based adaptation, and (2) fine-tuning via reinforcement learning from human feedback (RLHF). We assess HapticLLaMA's captioning performance using both automated n-gram metrics and human evaluation. HapticLLaMA demonstrates strong capability in interpreting haptic vibration signals, achieving a METEOR score of 59.98 and a BLEU-4 score of 32.06 respectively. Additionally, over 61% of the generated captions received human ratings above 3.5 on a 7-point scale, with RLHF yielding a 10% improvement in the overall rating distribution, indicating stronger alignment with human haptic perception. These findings highlight the potential of large language models to process and adapt to sensory data.

【2】SlimInfer: Accelerating Long-Context LLM Inference via Dynamic Token Pruning
标题：SlimInfer：通过动态令牌修剪加速长上下文LLM推理
链接：https://arxiv.org/abs/2508.06447

作者：ong, Rubing Yang, Yushi Huang, Desheng Hui, Ao Zhou, Jianlei Yang
摘要：大型语言模型（LLM）的长上下文推理受到高计算需求的严重限制。虽然现有的几种方法优化了注意力计算，但它们仍然在每一层处理全部隐藏状态，限制了整体效率。在这项工作中，我们提出了SlimInfer，这是一个创新的框架，旨在通过在向前传递过程中直接修剪不太重要的提示符来加速推理。我们的关键见解是信息扩散现象：当来自关键令牌的信息通过层传播时，它会分布在整个序列中。这个扩散过程表明，LLM可以保持其语义的完整性时，过多的令牌，甚至包括这些关键的，修剪隐藏状态。受此启发，SlimInfer引入了一种动态细粒度修剪机制，可以准确地删除中间层隐藏状态的冗余令牌。这种逐层修剪自然使异步KV缓存管理器能够预取所需的令牌块，而无需复杂的预测器，从而减少内存使用和I/O成本。大量实验表明，SlimInfer可以在单个RTX 4090上实现LLaMA 3.1 -8B-Instruct的最高$\mathbf{2.53\times}$时间到第一个令牌（TTFT）加速和$\mathbf{1.88\times}$端到端延迟减少，而不会牺牲LongBench的性能。我们的代码将在接受后发布。
摘要：Long-context inference for Large Language Models (LLMs) is heavily limited by high computational demands. While several existing methods optimize attention computation, they still process the full set of hidden states at each layer, limiting overall efficiency. In this work, we propose SlimInfer, an innovative framework that aims to accelerate inference by directly pruning less critical prompt tokens during the forward pass. Our key insight is an information diffusion phenomenon: As information from critical tokens propagates through layers, it becomes distributed across the entire sequence. This diffusion process suggests that LLMs can maintain their semantic integrity when excessive tokens, even including these critical ones, are pruned in hidden states. Motivated by this, SlimInfer introduces a dynamic fine-grained pruning mechanism that accurately removes redundant tokens of hidden state at intermediate layers. This layer-wise pruning naturally enables an asynchronous KV cache manager that prefetches required token blocks without complex predictors, reducing both memory usage and I/O costs. Extensive experiments show that SlimInfer can achieve up to $\mathbf{2.53\times}$ time-to-first-token (TTFT) speedup and $\mathbf{1.88\times}$ end-to-end latency reduction for LLaMA3.1-8B-Instruct on a single RTX 4090, without sacrificing performance on LongBench. Our code will be released upon acceptance.

【3】Echoes of Automation: The Increasing Use of LLMs in Newsmaking
标题：自动化的回声：法学硕士在新闻制作中的使用越来越多
链接：https://arxiv.org/abs/2508.06445

作者：Ansari, Delvin Ce Zhang, Nafis Irtiza Tripto, Dongwon Lee
备注：To appear in 18th International Conference on Social Computing, Behavioral-Cultural Modeling, & Prediction and Behavior Representation in Modeling and Simulation, and to be published in the Springer LNCS series
摘要：生成人工智能（GenAI），特别是LLM的迅速崛起，对新闻诚信和作者提出了担忧。这项研究调查了来自主要，地方和大学新闻媒体的40，000多篇新闻文章中的AI生成内容，这些新闻文章采用各种媒体格式。使用三种先进的人工智能文本检测器（例如，双筒望远镜、Fast-Detect GPT和GPTZero），我们发现近年来GenAI的使用大幅增加，尤其是在地方和大学新闻中。句子层面的分析表明，LLM经常用于新闻的介绍，而结论通常是手工编写的。语言分析表明，GenAI提高了词汇的丰富性和可读性，但降低了正式性，导致更统一的写作风格，特别是在当地媒体中。
摘要：The rapid rise of Generative AI (GenAI), particularly LLMs, poses concerns for journalistic integrity and authorship. This study examines AI-generated content across over 40,000 news articles from major, local, and college news media, in various media formats. Using three advanced AI-text detectors (e.g., Binoculars, Fast-Detect GPT, and GPTZero), we find substantial increase of GenAI use in recent years, especially in local and college news. Sentence-level analysis reveals LLMs are often used in the introduction of news, while conclusions usually written manually. Linguistic analysis shows GenAI boosts word richness and readability but lowers formality, leading to more uniform writing styles, particularly in local media.

【4】Learning the Topic, Not the Language: How LLMs Classify Online Immigration Discourse Across Languages
标题：学习主题，而不是语言：法学硕士如何对跨语言的在线移民话语进行分类
链接：https://arxiv.org/abs/2508.06435

作者：suto, Stefano Maria Iacus, Francisco Rowe, Devika Jain
摘要：大型语言模型（LLM）正在通过实现可扩展的精确分析来改变社会科学研究。它们的适应性提出了一个问题，即通过微调在几种语言中获得的知识是否可以转移到只在预训练期间出现的看不见的语言中。为了检验这一点，我们在单语，双语或多语言数据集上微调轻量级LLaMA 3.2-3B模型，以将来自X/Twitter的移民相关推文分类为13种语言，这是一个以极化，文化特定话语为特征的领域。我们评估最低限度的特定语言微调是否能够实现跨语言主题检测，以及添加目标语言是否可以纠正训练前的偏见。结果表明，在一种或两种语言中微调的LLM可以可靠地对看不见的语言中的移民相关内容进行分类。然而，识别一条推文是表达了支持还是反对移民的立场，得益于多语言微调。预训练偏差有利于主导语言，但即使是在微调期间对代表性不足的语言的最小暴露（仅为原始预训练令牌量的9.62\times10 ^{-11}$）也会产生显着的收益。这些发现挑战了跨语言掌握需要广泛的多语言培训的假设：有限的语言覆盖足以进行主题级概括，结构性偏差可以通过轻量级干预来纠正。通过发布4位量化的LoRA微调模型，我们提供了一个开源的，可重复的专有LLM替代方案，以OpenAI GPT-4 o模型的0.00000989%的美元成本提供35倍的推理速度，实现可扩展的，包容性的研究。
摘要：Large language models (LLMs) are transforming social-science research by enabling scalable, precise analysis. Their adaptability raises the question of whether knowledge acquired through fine-tuning in a few languages can transfer to unseen languages that only appeared during pre-training. To examine this, we fine-tune lightweight LLaMA 3.2-3B models on monolingual, bilingual, or multilingual data sets to classify immigration-related tweets from X/Twitter across 13 languages, a domain characterised by polarised, culturally specific discourse. We evaluate whether minimal language-specific fine-tuning enables cross-lingual topic detection and whether adding targeted languages corrects pre-training biases. Results show that LLMs fine-tuned in one or two languages can reliably classify immigration-related content in unseen languages. However, identifying whether a tweet expresses a pro- or anti-immigration stance benefits from multilingual fine-tuning. Pre-training bias favours dominant languages, but even minimal exposure to under-represented languages during fine-tuning (as little as $9.62\times10^{-11}$ of the original pre-training token volume) yields significant gains. These findings challenge the assumption that cross-lingual mastery requires extensive multilingual training: limited language coverage suffices for topic-level generalisation, and structural biases can be corrected with lightweight interventions. By releasing 4-bit-quantised, LoRA fine-tuned models, we provide an open-source, reproducible alternative to proprietary LLMs that delivers 35 times faster inference at just 0.00000989% of the dollar cost of the OpenAI GPT-4o model, enabling scalable, inclusive research.

【5】Sample-efficient LLM Optimization with Reset Replay
标题：具有重置重播的样本高效LLM优化
链接：https://arxiv.org/abs/2508.06412

作者：iu, Jinyu Wang, Lei Song, Jiang Bian
摘要：最近在训练后的大型语言模型（LLM）方面取得的进展，特别是通过强化学习（RL）和偏好优化方法，是增强其推理能力的关键驱动因素。然而，这些方法往往受到样本效率低和易受首因偏差的困扰，过度拟合初始经验会降低政策质量并损害学习过程。为了解决这些挑战，我们引入了LLM优化与重置重放（LoRR），一个通用的和强大的插件，旨在提高任何基于偏好的优化框架中的样本效率。LoRR核心机制支持在高重放次数下进行训练，最大限度地提高每个收集的数据批次的效用。为了抵消高重放训练中固有的过度拟合风险，LoRR采用了周期性重置策略，并重复使用初始数据，从而保持了网络的可塑性。此外，它利用混合优化目标，结合监督微调（SFT）和基于偏好的损失，以进一步支持数据利用。我们的大量实验表明，LoRR显着提高了各种偏好优化方法在数学和一般推理基准上的性能。值得注意的是，使用LoRR增强的迭代DPO方法在具有挑战性的数学任务上实现了相当的性能，优于一些复杂且计算密集的基于RL的算法。这些研究结果强调，LoRR为LLM微调提供了一个实用、高效和高效的范例，从有限的数据中释放出更高的性能。
摘要：Recent advancements in post-training Large Language Models (LLMs), particularly through Reinforcement Learning (RL) and preference optimization methods, are key drivers for enhancing their reasoning capabilities. However, these methods are often plagued by low sample efficiency and a susceptibility to primacy bias, where overfitting to initial experiences degrades policy quality and damages the learning process. To address these challenges, we introduce LLM optimization with Reset Replay (LoRR), a general and powerful plugin designed to enhance sample efficiency in any preference-based optimization framework. LoRR core mechanism enables training at a high replay number, maximizing the utility of each collected data batch. To counteract the risk of overfitting inherent in high-replay training, LoRR incorporates a periodic reset strategy with reusing initial data, which preserves network plasticity. Furthermore, it leverages a hybrid optimization objective, combining supervised fine-tuning (SFT) and preference-based losses to further bolster data exploitation. Our extensive experiments demonstrate that LoRR significantly boosts the performance of various preference optimization methods on both mathematical and general reasoning benchmarks. Notably, an iterative DPO approach augmented with LoRR achieves comparable performance on challenging math tasks, outperforming some complex and computationally intensive RL-based algorithms. These findings highlight that LoRR offers a practical, sample-efficient, and highly effective paradigm for LLM finetuning, unlocking greater performance from limited data.

【6】LLMs vs. Chinese Anime Enthusiasts: A Comparative Study on Emotionally Supportive Role-Playing
标题：法学硕士与中国动漫爱好者：情感支持角色扮演的比较研究
链接：https://arxiv.org/abs/2508.06388

作者：u, Xiao Pu, Yeqi Feng, Tianxing He
备注：21 pages, 17 figures, 3 tables
摘要：大型语言模型（LLM）在角色扮演对话和提供情感支持方面表现出了令人印象深刻的能力。然而，在结合这些功能以实现与虚拟角色的情感支持交互方面仍然存在重大的研究差距。为了解决这一研究空白，我们专注于动漫角色作为案例研究，因为他们的个性和庞大的粉丝群。这种选择使我们能够有效地评估如何以及LLM可以提供情感支持，同时保持特定的性格特征。我们介绍ChatAnime，第一个支持性角色扮演（ESRP）数据集。我们首先从流行动漫社区中精心挑选了20个顶级角色，并设计了60个以情感为中心的真实场景问题。然后，我们在全国范围内进行选拔，确定40名对特定角色有深刻认识和丰富角色扮演经验的中国动漫爱好者。接下来，我们系统地收集了10位法学硕士和这40位中国动漫爱好者的两轮对话数据。为了评估LLM的ESRP性能，我们设计了一个以用户体验为导向的评估系统，包括三个维度的9个细粒度指标：基本对话，角色扮演和情感支持，以及响应多样性的整体指标。总的来说，该数据集包括2，400个人类书写的答案和24，000个LLM生成的答案，由超过132，000个人类注释支持。实验结果表明，表现最好的LLM在角色扮演和情感支持方面超过了人类粉丝，而人类仍然在反应多样性方面领先。我们希望这项工作可以提供宝贵的资源和见解，为未来的研究优化LLM在ESRP。我们的数据集可在https://github.com/LanlanQiu/ChatAnime上获得。
摘要：Large Language Models (LLMs) have demonstrated impressive capabilities in role-playing conversations and providing emotional support as separate research directions. However, there remains a significant research gap in combining these capabilities to enable emotionally supportive interactions with virtual characters. To address this research gap, we focus on anime characters as a case study because of their well-defined personalities and large fan bases. This choice enables us to effectively evaluate how well LLMs can provide emotional support while maintaining specific character traits. We introduce ChatAnime, the first Emotionally Supportive Role-Playing (ESRP) dataset. We first thoughtfully select 20 top-tier characters from popular anime communities and design 60 emotion-centric real-world scenario questions. Then, we execute a nationwide selection process to identify 40 Chinese anime enthusiasts with profound knowledge of specific characters and extensive experience in role-playing. Next, we systematically collect two rounds of dialogue data from 10 LLMs and these 40 Chinese anime enthusiasts. To evaluate the ESRP performance of LLMs, we design a user experience-oriented evaluation system featuring 9 fine-grained metrics across three dimensions: basic dialogue, role-playing and emotional support, along with an overall metric for response diversity. In total, the dataset comprises 2,400 human-written and 24,000 LLM-generated answers, supported by over 132,000 human annotations. Experimental results show that top-performing LLMs surpass human fans in role-playing and emotional support, while humans still lead in response diversity. We hope this work can provide valuable resources and insights for future research on optimizing LLMs in ESRP. Our datasets are available at https://github.com/LanlanQiu/ChatAnime.

【7】Evaluating Style-Personalized Text Generation: Challenges and Directions
标题：评估风格个性化文本生成：挑战和方向
链接：https://arxiv.org/abs/2508.06374

作者：angra, Bahareh Sarrafzadeh, Adrian de Wynter, Silviu Cucerzan, Sujay Kumar Jauhar
摘要：虽然先前的研究已经建立了风格个性化文本生成的工具和基准，但在低资源作者风格个性化文本生成空间中的评估探索有限。通过这项工作，我们质疑广泛采用的评估指标，如BLEU和ROUGE的有效性，并探索其他评估范式，如风格嵌入和LLM作为判断，以全面评估风格个性化文本生成任务。我们使用我们的风格区分基准来评估这些指标及其集合，该基准跨越八个写作任务，并在三个设置中进行评估，域区分，作者归属，以及LLM个性化与非个性化区分。我们提供了确凿的证据，采用不同的评价指标，有效地评估风格个性化的文本生成集成。
摘要：While prior research has built tools and benchmarks towards style personalized text generation, there has been limited exploration of evaluation in low-resource author style personalized text generation space. Through this work, we question the effectiveness of the widely adopted evaluation metrics like BLEU and ROUGE, and explore other evaluation paradigms such as style embeddings and LLM-as-judge to holistically evaluate the style personalized text generation task. We evaluate these metrics and their ensembles using our style discrimination benchmark, that spans eight writing tasks, and evaluates across three settings, domain discrimination, authorship attribution, and LLM personalized vs non-personalized discrimination. We provide conclusive evidence to adopt ensemble of diverse evaluation metrics to effectively evaluate style personalized text generation.

【8】Matrix-Driven Instant Review: Confident Detection and Reconstruction of LLM Plagiarism on PC
标题：矩阵驱动的即时评论：PC上LLM抄袭的自信检测和重建
链接：https://arxiv.org/abs/2508.06309

作者：Zhang
摘要：近年来，对大型语言模型（LLM）中知识产权（IP）的关注显着增长。剽窃其他LLM（通过直接重量复制，升级，修剪或持续预训练）并在没有正确归因于原始许可证的情况下声称作者身份是一种严重的不当行为，可能会对原始开发人员造成重大的财务和声誉损害。然而，现有的检测LLM剽窃的方法在关键领域存在不足。它们无法准确地重建权重对应关系，缺乏计算统计显著性度量（如p值）的能力，并且可能错误地将在类似数据上训练的模型标记为相关。为了解决这些局限性，我们提出了矩阵驱动的即时审查（MPEG-Driven Instant Review），这是一种利用矩阵分析和大偏差理论的新方法。MPEG4实现了权重关系的精确重建，提供了严格的p$值估计，并专注于权重相似性，而不需要完整的模型推理。实验结果表明，即使经过大量的转换，例如随机排列和使用数万亿个令牌进行持续的预训练，Mcloud也能可靠地检测到剽窃。此外，所有的检测都可以在一个小时内在一台PC上完成，使Mingdom既高效又方便。
摘要：In recent years, concerns about intellectual property (IP) in large language models (LLMs) have grown significantly. Plagiarizing other LLMs (through direct weight copying, upcycling, pruning, or continual pretraining) and claiming authorship without properly attributing to the original license, is a serious misconduct that can lead to significant financial and reputational harm to the original developers. However, existing methods for detecting LLM plagiarism fall short in key areas. They fail to accurately reconstruct weight correspondences, lack the ability to compute statistical significance measures such as $p$-values, and may mistakenly flag models trained on similar data as being related. To address these limitations, we propose Matrix-Driven Instant Review (MDIR), a novel method that leverages matrix analysis and Large Deviation Theory. MDIR achieves accurate reconstruction of weight relationships, provides rigorous $p$-value estimation, and focuses exclusively on weight similarity without requiring full model inference. Experimental results demonstrate that MDIR reliably detects plagiarism even after extensive transformations, such as random permutations and continual pretraining with trillions of tokens. Moreover, all detections can be performed on a single PC within an hour, making MDIR both efficient and accessible.

【9】Large Language Model Data Generation for Enhanced Intent Recognition in German Speech
标题：用于增强德语语音意图识别的大语言模型数据生成
链接：https://arxiv.org/abs/2508.06277

作者：ekarek Rosin, Burak Can Kaplan, Stefan Wermter
备注：11 pages, 3 figures, accepted at KONVENS 2025
摘要：语音命令的意图识别（IR）对于人工智能（AI）辅助系统至关重要;然而，大多数现有方法仅限于短命令，并且主要针对英语开发。本文针对这些局限性，通过集中在IR从老年人的德语发言。我们提出了一种新的方法，该方法结合了一个适应的Whisper ASR模型，对老年人的德语语音（SVC-de）进行了微调，并在三个著名的大型语言模型（LLM）生成的合成文本数据集上训练了基于transformer的语言模型：LeoLM，Llama 3和ChatGPT。为了评估我们的方法的鲁棒性，我们用文本到语音模型生成合成语音，并进行广泛的跨数据集测试。我们的研究结果表明，合成LLM生成的数据显着提高了分类性能和鲁棒性，以不同的说话风格和看不见的词汇。值得注意的是，我们发现LeoLM，一个较小的，特定于领域的13 B LLM，在德国意图识别的数据集质量上超过了更大的ChatGPT（175 B）。我们的方法表明，生成式人工智能可以有效地弥合低资源领域的数据差距。我们提供数据生成和培训过程的详细文档，以确保透明度和可重复性。
摘要：Intent recognition (IR) for speech commands is essential for artificial intelligence (AI) assistant systems; however, most existing approaches are limited to short commands and are predominantly developed for English. This paper addresses these limitations by focusing on IR from speech by elderly German speakers. We propose a novel approach that combines an adapted Whisper ASR model, fine-tuned on elderly German speech (SVC-de), with Transformer-based language models trained on synthetic text datasets generated by three well-known large language models (LLMs): LeoLM, Llama3, and ChatGPT. To evaluate the robustness of our approach, we generate synthetic speech with a text-to-speech model and conduct extensive cross-dataset testing. Our results show that synthetic LLM-generated data significantly boosts classification performance and robustness to different speaking styles and unseen vocabulary. Notably, we find that LeoLM, a smaller, domain-specific 13B LLM, surpasses the much larger ChatGPT (175B) in dataset quality for German intent recognition. Our approach demonstrates that generative AI can effectively bridge data gaps in low-resource domains. We provide detailed documentation of our data generation and training process to ensure transparency and reproducibility.

【10】EICAP: Deep Dive in Assessment and Enhancement of Large Language Models in Emotional Intelligence through Multi-Turn Conversations
标题：EUCAP：通过多轮对话深入评估和增强情商中的大型语言模型
链接：https://arxiv.org/abs/2508.06196

作者：r, Ehsaneddin Asgari
摘要：情绪智力（EI）是人类对齐LLM发展的一个关键但未充分探索的维度。为了解决这一差距，我们引入了一个统一的、基于心理学的四层EI分类法，为大型语言模型（LLM）量身定制，包括情感跟踪、原因推理、评估和情感上适当的响应生成。在此框架的基础上，我们提出了EICAP-Bench，一种新颖的MCQ风格的多回合基准测试，旨在评估不同语言和文化背景下开源LLM的EI能力。我们评估六个LLM：LLaMA 3（8B）、LLaMA 3-Instruct、Gemma（9 B）、Gemma-Instruct、Qwen2.5（7 B）和Qwen2.5-Instruct，将Qwen2.5-Instruct确定为最强基线。为了评估增强EI功能的潜力，我们使用UltraChat（UC）上的LoRA适配器对Qwen2.5-Base和Qwen2.5-Instruct进行了微调，UltraChat（UC）是一个大规模的、经过调整的对话数据集，有英语和阿拉伯语两种版本。我们的统计分析表明，在五个EI层中，只有评估层通过基于UC的微调显示出显着的改善。这些发现突出了现有的预训练和预防调整范式在为LLM配备更深层次的情感推理方面的局限性，并强调了对全面EI对齐的目标数据和建模策略的需求。
摘要：Emotional Intelligence (EI) is a critical yet underexplored dimension in the development of human-aligned LLMs. To address this gap, we introduce a unified, psychologically grounded four-layer taxonomy of EI tailored for large language models (LLMs), encompassing emotional tracking, cause inference, appraisal, and emotionally appropriate response generation. Building on this framework, we present EICAP-Bench, a novel MCQ style multi-turn benchmark designed to evaluate EI capabilities in open-source LLMs across diverse linguistic and cultural contexts. We evaluate six LLMs: LLaMA3 (8B), LLaMA3-Instruct, Gemma (9B), Gemma-Instruct, Qwen2.5 (7B), and Qwen2.5-Instruct on EmoCap-Bench, identifying Qwen2.5-Instruct as the strongest baseline. To assess the potential for enhancing EI capabilities, we fine-tune both Qwen2.5-Base and Qwen2.5-Instruct using LoRA adapters on UltraChat (UC), a large-scale, instruction-tuned dialogue dataset, in both English and Arabic. Our statistical analysis reveals that among the five EI layers, only the Appraisal layer shows significant improvement through UC-based fine-tuning. These findings highlight the limitations of existing pretraining and instruction-tuning paradigms in equipping LLMs with deeper emotional reasoning and underscore the need for targeted data and modeling strategies for comprehensive EI alignment.

【11】DKG-LLM : A Framework for Medical Diagnosis and Personalized Treatment Recommendations via Dynamic Knowledge Graph and Large Language Model Integration
标题：DKG-LLM：通过动态知识图和大型语言模型集成的医疗诊断和个性化治疗建议框架
链接：https://arxiv.org/abs/2508.06186

作者：adani, Maryam Abdollahi Shamami, Hamidreza Sadeghsalehi, Borhan Asadi, Saba Hesaraki
摘要：自ChatGPT发布以来，大型语言模型（LLM）呈指数级增长。这些模型由于其在各种任务（包括语言处理任务）上的强大性能而受到关注。这些模型通过训练数十亿个参数来理解和理解任务。这些模型的发展是增强自然语言理解的变革力量，并向人工通用智能（AGI）迈出了重要一步。在这项研究中，我们的目标是提出DKG-LLM框架。DKG-LLM框架通过将动态知识图（DKG）与Grok 3大型语言模型集成，为医疗诊断和个性化治疗建议引入了一种开创性的方法。使用自适应语义融合算法（ASFA），异构医疗数据（包括临床报告和PubMed文章）和患者记录动态生成由13种不同类型（例如，疾病、症状、治疗、患者概况）和26种关系类型中的127，392条边（例如，因果、治疗、关联）。ASFA利用高级概率模型、贝叶斯推理和图形优化来提取语义信息，动态更新图形，每个数据类别中有大约150个新节点和边，同时保持多达987，654条边的可扩展性。真实世界的数据集，包括MIMIC-III和PubMed，被用来评估所提出的架构。评价结果表明，DKG-LLM的诊断准确率为84.19%。该模型的治疗推荐准确率为89.63%，语义覆盖率为93.48%。DKG-LLM是一种可靠的变革性工具，可处理噪声数据和复杂的多症状疾病，以及从医生输入中进行基于反馈的学习。
摘要：Large Language Models (LLMs) have grown exponentially since the release of ChatGPT. These models have gained attention due to their robust performance on various tasks, including language processing tasks. These models achieve understanding and comprehension of tasks by training billions of parameters. The development of these models is a transformative force in enhancing natural language understanding and has taken a significant step towards artificial general intelligence (AGI). In this study, we aim to present the DKG-LLM framework. The DKG-LLM framework introduces a groundbreaking approach to medical diagnosis and personalized treatment recommendations by integrating a dynamic knowledge graph (DKG) with the Grok 3 large language model. Using the Adaptive Semantic Fusion Algorithm (ASFA), heterogeneous medical data (including clinical reports and PubMed articles) and patient records dynamically generate a knowledge graph consisting of 15,964 nodes in 13 distinct types (e.g., diseases, symptoms, treatments, patient profiles) and 127,392 edges in 26 relationship types (e.g., causal, therapeutic, association). ASFA utilizes advanced probabilistic models, Bayesian inference, and graph optimization to extract semantic information, dynamically updating the graph with approximately 150 new nodes and edges in each data category while maintaining scalability with up to 987,654 edges. Real-world datasets, including MIMIC-III and PubMed, were utilized to evaluate the proposed architecture. The evaluation results show that DKG-LLM achieves a diagnostic accuracy of 84.19%. The model also has a treatment recommendation accuracy of 89.63% and a semantic coverage of 93.48%. DKG-LLM is a reliable and transformative tool that handles noisy data and complex multi-symptom diseases, along with feedback-based learning from physician input.

【12】Comparing Knowledge Injection Methods for LLMs in a Low-Resource Regime
标题：比较低资源制度下LLM的知识注入方法
链接：https://arxiv.org/abs/2508.06178

作者：izio, Thales Almeida, Roberto Lotufo, Rodrigo Nogueira
摘要：大型语言模型（LLM）通常需要大量的文本来有效地获取新知识。虽然在大型语料库上继续进行预训练或采用检索增强生成（RAG）已被证明是成功的，但仅用数千或数百万个令牌更新LLM仍然具有挑战性。在这项工作中，我们调查的任务注入小，非结构化的信息LLM及其关系的灾难性遗忘现象。我们使用最近新闻的数据集-确保与模型的预训练数据没有重叠-通过使用与学习信息相关的问答对来探测模型来评估知识获取。从持续的预训练基线开始，我们探索了不同的增强算法来生成合成数据，以提高知识获取能力。我们的实验表明，简单地在有限的数据上继续进行预训练会产生适度的改进，而将模型暴露于不同的文本变化会显着提高对新事实的学习-特别是通过不同的提示诱导更大的可变性的方法。此外，我们揭示了小数据机制中的遗忘现象，说明了学习新内容和保留现有功能之间的微妙平衡。我们还确认了基于RAG的知识注入方法的敏感性，与参数方法相比，这通常会导致控制数据集的更大退化。最后，我们证明了模型本身可以生成有效的合成训练数据，这表明了一种自我改进模型更新的途径。在我们的实验中使用的所有代码和生成的数据都是公开的，为研究有限数据的LLM中的有效知识注入提供了资源。https://github.com/hugoabonizio/knowledge-injection-methods
摘要：Large language models (LLMs) often require vast amounts of text to effectively acquire new knowledge. While continuing pre-training on large corpora or employing retrieval-augmented generation (RAG) has proven successful, updating an LLM with only a few thousand or million tokens remains challenging. In this work, we investigate the task of injecting small, unstructured information into LLMs and its relation to the catastrophic forgetting phenomenon. We use a dataset of recent news -- ensuring no overlap with the model's pre-training data -- to evaluate the knowledge acquisition by probing the model with question-answer pairs related the learned information. Starting from a continued pre-training baseline, we explored different augmentation algorithms to generate synthetic data to improve the knowledge acquisition capabilities. Our experiments show that simply continuing pre-training on limited data yields modest improvements, whereas exposing the model to diverse textual variations significantly improves the learning of new facts -- particularly with methods that induce greater variability through diverse prompting. Furthermore, we shed light on the forgetting phenomenon in small-data regimes, illustrating the delicate balance between learning new content and retaining existing capabilities. We also confirm the sensitivity of RAG-based approaches for knowledge injection, which often lead to greater degradation on control datasets compared to parametric methods. Finally, we demonstrate that models can generate effective synthetic training data themselves, suggesting a pathway toward self-improving model updates. All code and generated data used in our experiments are publicly available, providing a resource for studying efficient knowledge injection in LLMs with limited data at https://github.com/hugoabonizio/knowledge-injection-methods.

【13】Pragmatics beyond humans: meaning, communication, and LLMs
标题：超越人类的修辞学：意义、沟通和LLM
链接：https://arxiv.org/abs/2508.06167

作者：iak
摘要：本文将语用学重新定义为一个动态的界面，而不是一个从属的第三个维度的意义，通过它，语言作为一个社会嵌入的工具进行操作。随着大语言模型（LLM）在交际环境中的出现，这种理解需要进一步完善和方法上的重新考虑。第一部分挑战了传统的符号学语义分离，认为连接主义LLM架构破坏了既定的意义层次结构，并提出了人机通信（HMC）框架作为一个更合适的替代方案。第二部分探讨了以人为中心的语用学理论和以机器为中心的LLM之间的紧张关系。虽然传统的、受格赖斯启发的语用学继续占据主导地位，但它依赖于人类特定的假设，不适合LLM等预测系统。概率语用学，特别是理性言语行为框架，提供了一个更兼容的目的论，专注于优化，而不是真理评估。第三部分讨论了三种形式的替代主义问题-概括，语言和交际-强调扭曲LLM评估和模糊人类交际主体的作用的拟人化偏见。最后，本文引入了语境挫折的概念来描述语境输入增加与语境理解崩溃的悖论，强调用户如何被迫共同构建模型和自己的语用条件。这些论点表明，语用学理论可能需要调整或扩展，以更好地解释涉及生成人工智能的通信。
摘要：The paper reconceptualizes pragmatics not as a subordinate, third dimension of meaning, but as a dynamic interface through which language operates as a socially embedded tool for action. With the emergence of large language models (LLMs) in communicative contexts, this understanding needs to be further refined and methodologically reconsidered. The first section challenges the traditional semiotic trichotomy, arguing that connectionist LLM architectures destabilize established hierarchies of meaning, and proposes the Human-Machine Communication (HMC) framework as a more suitable alternative. The second section examines the tension between human-centred pragmatic theories and the machine-centred nature of LLMs. While traditional, Gricean-inspired pragmatics continue to dominate, it relies on human-specific assumptions ill-suited to predictive systems like LLMs. Probabilistic pragmatics, particularly the Rational Speech Act framework, offers a more compatible teleology by focusing on optimization rather than truth-evaluation. The third section addresses the issue of substitutionalism in three forms - generalizing, linguistic, and communicative - highlighting the anthropomorphic biases that distort LLM evaluation and obscure the role of human communicative subjects. Finally, the paper introduces the concept of context frustration to describe the paradox of increased contextual input paired with a collapse in contextual understanding, emphasizing how users are compelled to co-construct pragmatic conditions both for the model and themselves. These arguments suggest that pragmatic theory may need to be adjusted or expanded to better account for communication involving generative AI.

【14】Semantic and Structural Analysis of Implicit Biases in Large Language Models: An Interpretable Approach
标题：大型语言模型中隐性偏见的语义和结构分析：可解释方法
链接：https://arxiv.org/abs/2508.06155

作者：ang, Lian Lian, Zhen Qi, Guiran Liu
摘要：本文讨论了在大型语言模型生成过程中可能出现的隐式定型问题。它提出了一种可解释的偏见检测方法，旨在识别模型输出中隐藏的社会偏见，特别是那些不容易通过显式语言特征捕获的语义倾向。该方法将嵌套语义表示与上下文对比机制相结合。它从模型输出的向量空间结构中提取潜在的偏差特征。使用注意权重扰动，它分析了模型的敏感性，特定的社会属性术语，从而揭示了偏见的语义路径形成。为了验证该方法的有效性，本研究使用了StereoSet数据集，该数据集涵盖了多个刻板印象维度，包括性别，职业，宗教和种族。该评估侧重于几个关键指标，如偏见检测准确性，语义一致性和上下文敏感性。实验结果表明，该方法在不同维度上都具有较好的检测性能.它可以准确地识别语义相似的文本之间的偏见差异，同时保持较高的语义对齐和输出稳定性。该方法在结构设计中也表现出很高的可解释性。它有助于揭示语言模型内部的偏见关联机制。这为偏差检测提供了更加透明和可靠的技术基础。该方法适用于真实世界的应用程序，其中所生成的内容需要高度的可信度。
摘要：This paper addresses the issue of implicit stereotypes that may arise during the generation process of large language models. It proposes an interpretable bias detection method aimed at identifying hidden social biases in model outputs, especially those semantic tendencies that are not easily captured through explicit linguistic features. The method combines nested semantic representation with a contextual contrast mechanism. It extracts latent bias features from the vector space structure of model outputs. Using attention weight perturbation, it analyzes the model's sensitivity to specific social attribute terms, thereby revealing the semantic pathways through which bias is formed. To validate the effectiveness of the method, this study uses the StereoSet dataset, which covers multiple stereotype dimensions including gender, profession, religion, and race. The evaluation focuses on several key metrics, such as bias detection accuracy, semantic consistency, and contextual sensitivity. Experimental results show that the proposed method achieves strong detection performance across various dimensions. It can accurately identify bias differences between semantically similar texts while maintaining high semantic alignment and output stability. The method also demonstrates high interpretability in its structural design. It helps uncover the internal bias association mechanisms within language models. This provides a more transparent and reliable technical foundation for bias detection. The approach is suitable for real-world applications where high trustworthiness of generated content is required.

【15】Scaling Personality Control in LLMs with Big Five Scaler Prompts
标题：使用大五量表预算来衡量LLM的个性控制
链接：https://arxiv.org/abs/2508.06149

作者：o, Yun-Gyung Cheong
摘要：我们提出了Big 5-Scaler，这是一个基于条件反射的框架，用于调节具有可控大五人格特征的大型语言模型（LLM）。通过将数字特征值嵌入到自然语言提示中，我们的方法可以实现细粒度的个性控制，而无需额外的培训。我们评估了Big 5-Scaler在特质表达，对话生成和人类特质模仿任务中的表现。结果表明，它诱导一致的和可区分的人格特征的模型，与性能不同的提示类型和规模。我们的分析突出了简洁的提示和较低的特质强度的有效性，提供了一个有效的方法来建立个性感知的对话代理。
摘要：We present Big5-Scaler, a prompt-based framework for conditioning large language models (LLMs) with controllable Big Five personality traits. By embedding numeric trait values into natural language prompts, our method enables fine-grained personality control without additional training. We evaluate Big5-Scaler across trait expression, dialogue generation, and human trait imitation tasks. Results show that it induces consistent and distinguishable personality traits across models, with performance varying by prompt type and scale. Our analysis highlights the effectiveness of concise prompts and lower trait intensities, providing a efficient approach for building personality-aware dialogue agents.

【16】Less is More: Selective Reflection for Compatible and Efficient Knowledge Distillation in Large Language Models
标题：Less is More：选择性反射在大型语言模型中实现兼容和有效的知识蒸馏
链接：https://arxiv.org/abs/2508.06135

作者：Liu, Mengxiang Zhang
摘要：知识蒸馏（KD）是将大型语言模型（LLM）压缩成紧凑，高效的学生模型的基本技术。然而，现有的白盒KD方法主要集中在平衡地面实况和学生生成的响应，而忽略了两个关键因素：训练数据质量和学生模型兼容性。为了解决这些限制，我们提出了选择性反射蒸馏（SRD），这是一种新的数据管理框架，它利用学生模型的反射来系统地改进训练数据。SRD通过将地面真实数据与学生模型输出进行比较，动态评估和选择训练响应对，并通过基于难度的自动排名选择性地管理高质量，学生兼容的训练实例。此外，在选择训练数据之后，采用课程调度策略以固定的间隔将这些策划的子集增量地引入蒸馏过程。作为一种即插即用的增强功能，SRD在不同的白盒KD方法和模型架构中不断改善蒸馏结果，并在KD训练期间显着降低计算成本。在一系列语言模型基准测试上的实验表明，在不同的KD方法和模型家族下，SRD在提取模型性能方面的持续改进，以及训练运行时间减少高达39%。值得注意的是，SRD作为即插即用模块运行，在不修改底层KD算法的情况下提高了采样效率。我们的研究结果强调，数据质量和兼容性是LLM有效和高效蒸馏的关键，SRD提供了一个原则性的框架来实现这两个目标。这项工作推进了对KD中以数据为中心的因素的理解，并为提高压缩LLM的能力和效率提供了实用的见解。
摘要：Knowledge Distillation (KD) is a fundamental technique for compressing large language models (LLMs) into compact, efficient student models. However, existing white-box KD methods mainly focus on balancing ground truth and student-generated responses while overlooking two critical factors: training data quality and student-model compatibility. To address these limitations, we propose Selective Reflection Distillation (SRD), a novel data curation framework that leverages reflections from student models to systematically refine training data. SRD dynamically evaluates and selects prompt-response pairs by comparing ground truth data with student model outputs, selectively curating high-quality, student-compatible training instances through automated ranking based on difficulty. Furthermore, after selecting the training data, a curriculum scheduling strategy is employed to incrementally introduce these curated subsets into the distillation process at fixed intervals. As a plug-and-play enhancement, SRD consistently improves distillation outcomes across diverse white-box KD approaches and model architectures, as well as decreases computational cost significantly during KD training. Experiments on a range of language model benchmarks demonstrate SRD's consistent improvements in distilled model performance, as well as a reduction in training runtime by up to 39%, under diverse KD methods and model families. Notably, SRD operates as a plug-and-play module, enhancing sample efficiency without modifying underlying KD algorithms. Our findings highlight that data quality and compatibility are pivotal to effective and efficient distillation of LLMs, and SRD provides a principled framework to achieve both. This work advances the understanding of data-centric factors in KD and offers practical insights for enhancing the capability and efficiency of compressed LLMs.

【17】AURA: Affordance-Understanding and Risk-aware Alignment Technique for Large Language Models
标题：AURA：大型语言模型的经济实惠理解和风险感知对齐技术
链接：https://arxiv.org/abs/2508.06124

作者：Adak, Pratyush Chatterjee, Somnath Banerjee, Rima Hazra, Somak Aditya, Animesh Mukherjee
摘要：当今的LLM面临着管理基于负担能力的安全风险的挑战，这种情况下，由于忽略了逻辑影响，输出无意中促进了有害的行为。传统的安全解决方案，如基于标量结果的奖励模型、参数调整或启发式解码策略，缺乏在微妙但关键的推理步骤中可靠检测和干预所需的粒度和主动性。为了解决这一根本性的差距，我们引入了AURA，这是一个以过程奖励模型（PRM）为中心的创新的多层框架，提供了跨逻辑一致性和安全意识的全面的步骤级评估。我们的框架无缝地结合了内省的自我批评，细粒度的PRM评估和自适应安全感知解码，以动态和主动地引导模型走向更安全的推理轨迹。经验证据清楚地表明，这种方法大大超过现有的方法，大大提高了逻辑的完整性和负担敏感的安全模型输出。这项研究代表了迈向更安全，更负责任和上下文感知的AI的关键一步，为对安全敏感的应用程序设定了新的基准。
摘要：Present day LLMs face the challenge of managing affordance-based safety risks-situations where outputs inadvertently facilitate harmful actions due to overlooked logical implications. Traditional safety solutions, such as scalar outcome-based reward models, parameter tuning, or heuristic decoding strategies, lack the granularity and proactive nature needed to reliably detect and intervene during subtle yet crucial reasoning steps. Addressing this fundamental gap, we introduce AURA, an innovative, multi-layered framework centered around Process Reward Models (PRMs), providing comprehensive, step level evaluations across logical coherence and safety-awareness. Our framework seamlessly combines introspective self-critique, fine-grained PRM assessments, and adaptive safety-aware decoding to dynamically and proactively guide models toward safer reasoning trajectories. Empirical evidence clearly demonstrates that this approach significantly surpasses existing methods, significantly improving the logical integrity and affordance-sensitive safety of model outputs. This research represents a pivotal step toward safer, more responsible, and contextually aware AI, setting a new benchmark for alignment-sensitive applications.

【18】Few-Shot Prompting for Extractive Quranic QA with Instruction-Tuned LLMs
标题：使用经过指令调整的LLM进行提取性古兰经QA的Few-Shot预算
链接：https://arxiv.org/abs/2508.06103

作者：asem, Islam Oshallah, Ali Hamdi, Ammar Mohammed
备注：6 pages , 2 figures , Accepted in IMSA 2025,Egypt , this https URL
摘要：本文提出了两种有效的方法，提取问题问答（QA）的古兰经。它解决了与复杂的语言，独特的术语和文本中的深层含义有关的挑战。第二种使用Few-Shot提示，并使用经过预调的大型语言模型，如Gemini和DeepSeek。一个专门的阿拉伯语提示框架的跨度提取。强大的后处理系统集成了子词对齐、重叠抑制和语义过滤。这提高了精度，减少了幻觉。评估表明，具有阿拉伯语指令的大型语言模型优于传统的微调模型。最佳配置达到0.637的pAP10评分。结果证实，基于语义的指令调整是有效的低资源，语义丰富的QA任务。
摘要：This paper presents two effective approaches for Extractive Question Answering (QA) on the Quran. It addresses challenges related to complex language, unique terminology, and deep meaning in the text. The second uses few-shot prompting with instruction-tuned large language models such as Gemini and DeepSeek. A specialized Arabic prompt framework is developed for span extraction. A strong post-processing system integrates subword alignment, overlap suppression, and semantic filtering. This improves precision and reduces hallucinations. Evaluations show that large language models with Arabic instructions outperform traditional fine-tuned models. The best configuration achieves a pAP10 score of 0.637. The results confirm that prompt-based instruction tuning is effective for low-resource, semantically rich QA tasks.

【19】ConlangCrafter: Constructing Languages with a Multi-Hop LLM Pipeline
标题：ConlangCrafter：用多跳LLM管道构建语言
链接：https://arxiv.org/abs/2508.06094

作者：per, Moran Yanuka, Raja Giryes, Gašper Beguš
备注：Project page: this https URL
摘要：人工语言（conlangs），如世界语和昆亚语，在艺术、哲学和国际交流中扮演着不同的角色。与此同时，大规模的基础模型已经彻底改变了文本，图像和其他方面的创意生成。在这项工作中，我们利用现代LLM作为端到端conlang创建的计算创造力辅助工具。我们介绍ConlangCrafter，一个多跳管道，将语言设计分解为模块化阶段-语音，形态，语法，词汇生成和翻译。在每个阶段，我们的方法都利用LLM的元语言推理能力，注入随机性以鼓励多样性，并利用自我改进反馈来鼓励新兴语言描述的一致性。我们评估ConlangCrafter衡量一致性和类型多样性的指标，证明它有能力产生连贯和多样的conlangs没有人类的语言专业知识。
摘要：Constructed languages (conlangs) such as Esperanto and Quenya have played diverse roles in art, philosophy, and international communication. Meanwhile, large-scale foundation models have revolutionized creative generation in text, images, and beyond. In this work, we leverage modern LLMs as computational creativity aids for end-to-end conlang creation. We introduce ConlangCrafter, a multi-hop pipeline that decomposes language design into modular stages -- phonology, morphology, syntax, lexicon generation, and translation. At each stage, our method leverages LLMs' meta-linguistic reasoning capabilities, injecting randomness to encourage diversity and leveraging self-refinement feedback to encourage consistency in the emerging language description. We evaluate ConlangCrafter on metrics measuring coherence and typological diversity, demonstrating its ability to produce coherent and varied conlangs without human linguistic expertise.

【20】Efficient Knowledge Probing of Large Language Models by Adapting Pre-trained Embeddings
标题：通过调整预先训练的嵌入对大型语言模型进行高效知识探索
链接：https://arxiv.org/abs/2508.06030

作者：arma, Yiqiao Jin, Rakshit Trivedi, Srijan Kumar
摘要：大型语言模型（LLM）在生成式预训练过程中获得跨不同领域的知识，例如科学，历史和地理。然而，由于它们的随机性，很难预测LLM已经获得了什么。先前的工作已经开发了不同的方法来探索这种知识，通过调查隐藏的表示，制作特定的任务提示，策划代表性样本，并估计其不确定性。然而，这些方法需要通过底层模型进行正向传递，以探测LLM关于特定事实的知识，这使得它们在计算上昂贵且耗时。为了弥合这一差距，我们提出了$\textbf{PEEK}$或$\textbf{P}$roxy $\textbf{E}$mbeddings来$\textbf {E}$估计LLM的$\textbf{K}$nobody，通过利用预训练的嵌入模型，有效地将事实知识编码为文本或图形作为LLM的代理。首先，我们通过各种探测策略识别LLM已知的事实的训练集，然后调整嵌入模型以使用线性解码器层预测LLM输出。对$3$维基百科衍生数据集，$4$LLM和$7$嵌入模型的综合评估表明，嵌入可以预测LLM知识，准确率高达90%。此外，我们发现句子嵌入模型比图嵌入更适合预测LLM知识，从而揭示了事实景观的底层表示。因此，我们认为，知识适应嵌入可以用来识别大规模LLM中的知识缺口，并可以提供更深入的见解LLM的内部归纳偏见。代码和数据可在https://github.com/claws-lab/peek上获得。
摘要：Large language models (LLMs) acquire knowledge across diverse domains such as science, history, and geography encountered during generative pre-training. However, due to their stochasticity, it is difficult to predict what LLMs have acquired. Prior work has developed different ways to probe this knowledge by investigating the hidden representations, crafting specific task prompts, curating representative samples, and estimating their uncertainty. However, these methods require making forward passes through the underlying model to probe the LLM's knowledge about a specific fact, making them computationally expensive and time-consuming. To bridge this gap, we propose $\textbf{PEEK}$ or $\textbf{P}$roxy $\textbf{E}$mbeddings to $\textbf{E}$stimate $\textbf{K}$nowledge of LLMs, by leveraging the pre-trained embedding models that effectively encode factual knowledge as text or graphs as proxies for LLMs. First, we identify a training set of facts known by LLMs through various probing strategies and then adapt embedding models to predict the LLM outputs with a linear decoder layer. Comprehensive evaluation on $3$ Wikipedia-derived datasets, $4$ LLMs, and $7$ embedding models shows that embeddings can predict LLM knowledge on a held-out set with up to 90 % accuracy. Furthermore, we find that sentence embedding models are more suitable than graph embeddings to predict LLM knowledge, shedding light on the underlying representation of the factual landscape. Thus, we believe that knowledge-adapted embeddings can be used to identify knowledge gaps in LLMs at scale and can provide deeper insights into LLMs' internal inductive bias. The code and data are made available at https://github.com/claws-lab/peek.

【21】Temporal Self-Rewarding Language Models: Decoupling Chosen-Rejected via Past-Future
标题：时间自我奖励语言模型：通过过去-未来将选择-拒绝脱钩
链接：https://arxiv.org/abs/2508.06026

作者：ng, Xin Wang, Cunxiang Wang, Junfeng Fang, Qiufeng Wang, Jianing Chu, Xuran Meng, Shuxun Yang, Libo Qin, Yue Zhang, Wei Ye, Shikun Zhang
备注：12 pages, 5 figures
摘要：自奖励语言模型提出了一种架构，其中大型语言模型（LLM）通过LLM作为法官提示生成响应并评估自己的输出，通过迭代直接偏好优化（DPO）动态提高其生成能力。然而，我们的分析揭示了现有的自我奖励范式的一个关键限制：选择和拒绝反应的同步改善逐渐缩小了对比样本之间的代表性差异，破坏了有效的偏好学习。我们提出了时间自我奖励语言模型，战略性地协调过去，现在和未来的模型代，以维持学习信号。我们的双阶段框架引入了：（1）\textit{锚定拒绝} -使用过去初始模型的输出来修复被拒绝的响应，以及（2）\textit{未来引导选择} -使用下一代模型预测来动态地管理所选择的样本。在三个模型家族（Llama，Qwen，Mistral）和不同模型大小（Llama 3B/8B/70 B）上进行的广泛实验表明，与使用相同计算资源的自奖励相比，使用我们的方法进行训练时，性能得到了显着改善。例如，使用我们的方法，Llama3.1-8B在AlpacaEval 2.0上达到了29.44的胜率，比自我奖励基线（19.69）高出9.75。值得注意的是，我们的方法还在数学推理（GSM 8 K），基于知识的QA（ARC，TruthfulQA）和代码生成（HumanEval）任务中表现出卓越的分布外泛化，即使我们没有专门收集这些训练数据。
摘要：Self-Rewarding Language Models propose an architecture in which the Large Language Models(LLMs) both generates responses and evaluates its own outputs via LLM-as-a-Judge prompting, dynamically improving its generative capabilities through iterative Direct Preference Optimization (DPO). However, our analysis reveals a critical limitation in existing Self-Rewarding paradigms: the synchronized improvement of chosen and rejected responses progressively narrows the representational difference between contrasting samples, undermining effective preference learning. We propose \textbf{Temporal Self-Rewarding Language Models} that strategically coordinate past, present, and future model generations to sustain learning signals. Our dual-phase framework introduces: (1) \textit{Anchored Rejection} - fixing rejected responses using the past initial model's outputs and (2) \textit{Future-Guided Chosen} - dynamically curating chosen samples using next-generation model predictions. Extensive experiments across three model families (Llama, Qwen, Mistral) and different model sizes (Llama3B/8B/70B) demonstrate significant improvements when trained with our method compared to Self-Rewarding using same computation resources. For example, Llama3.1-8B reaches a 29.44 win rate on AlpacaEval 2.0 with our method, outperforming the Self-Rewarding baseline (19.69) by 9.75. Notably, our method also demonstrates superior out-of-distribution generalization across mathematical reasoning (GSM8K), knowledge-based QA (ARC, TruthfulQA), and code generation (HumanEval) tasks, even though we do not specifically collect such training data.

【22】Bifrost-1: Bridging Multimodal LLMs and Diffusion Models with Patch-level CLIP Latents
标题：Bifrost-1：用补丁级CLIP潜伏期连接多峰LLM和扩散模型
链接：https://arxiv.org/abs/2508.05954

作者：Jaemin Cho, Amir Zadeh, Chuan Li, Mohit Bansal
备注：Project Page: this https URL
摘要：人们越来越关注将高保真视觉合成功能集成到大型语言模型（LLM）中，而不影响其强大的推理能力。直接训练LLM或桥接LLM和扩散模型的现有方法通常遭受昂贵的训练，因为骨干LLM在预训练期间没有看到图像表示。我们提出了Bifrost-1，这是一个统一的框架，它使用补丁级CLIP图像嵌入作为潜在变量，将预训练的多模态LLM（MLLM）和扩散模型连接起来，这些模型与MLLM的CLIP视觉编码器原生对齐。这些块级图像嵌入集成到扩散模型中，并对其ControlNet进行了轻量级调整。为了保留原始的多模态推理能力的MLLM，我们配备了一个视觉生成分支的MLLM从原始MLLM参数初始化时，预测补丁级图像嵌入。通过将预训练的MLLM和扩散模型与补丁级CLIP潜伏期无缝集成，我们的框架能够以显着的训练效率生成高保真可控图像。我们的实验表明，Bifrost-1在视觉保真度和多模态理解方面实现了与之前方法相当或更好的性能，并且在训练期间的计算量大幅降低。我们还提供全面的消融研究，显示我们的设计选择的有效性。
摘要：There is growing interest in integrating high-fidelity visual synthesis capabilities into large language models (LLMs) without compromising their strong reasoning capabilities. Existing methods that directly train LLMs or bridge LLMs and diffusion models usually suffer from costly training since the backbone LLMs have not seen image representations during pretraining. We present Bifrost-1, a unified framework that bridges pretrained multimodal LLMs (MLLMs) and diffusion models using patch-level CLIP image embeddings as latent variables, which are natively aligned with the MLLM's CLIP visual encoder. These patch-level image embeddings are integrated into the diffusion model with a lightweight adaptation of its ControlNet. To retain the original multimodal reasoning capabilities of MLLMs, we equip the MLLM with a visual generation branch initialized from the original MLLM parameters when predicting the patch-level image embeddings. By seamlessly integrating pretrained MLLMs and diffusion models with patch-level CLIP latents, our framework enables high-fidelity controllable image generation with significant training efficiency. Our experiments demonstrate that Bifrost-1 achieves comparable or better performance than previous methods in terms of visual fidelity and multimodal understanding, with substantially lower compute during training. We also provide comprehensive ablation studies showing the effectiveness of our design choices.

【23】Do Machines Think Emotionally? Cognitive Appraisal Analysis of Large Language Models
标题：机器会思考吗？大型语言模型的认知评价分析
链接：https://arxiv.org/abs/2508.05880

作者：tacharyya, Lucas Craig, Tharun Dilliraj, Jia Li, James Z. Wang
摘要：情感计算已被确立为促进人工智能（AI）系统整体发展的一个重要研究领域。基础模型，特别是大型语言模型（LLM），在过去的几项工作中已经被评估，训练或调整，成为更好的情感预测器或生成器。然而，这些研究中的大多数都是以监督的方式处理与情感相关的任务，使用与刺激相关的离散情感标签（例如，文本、图像、视频、音频）。评估研究，特别是，往往局限于标准和表面的情绪相关的任务，如识别诱发或表达的情绪。在本文中，我们超越了表面水平的情绪任务，调查如何LLM的原因，通过认知维度的情绪。从认知评价理论，我们研究是否LLM产生连贯的和合理的认知推理时，推理情绪化的刺激。我们引入了一个大规模的情感认知推理基准-核心-评估内部的认知结构隐含的LLM用于情感推理。通过大量的评估实验和分析，我们试图回答：（a）模型是否更有可能隐含地依赖于特定的认知评估维度？(b)哪些认知维度对于表征特定情绪很重要？，以及，（c）LLM中不同情绪类别的内部表征是否可以通过认知评价维度来解释？我们的研究结果和分析揭示了不同的LLM不同的推理模式。我们的基准和代码将公开提供。
摘要：Affective Computing has been established as a crucial field of inquiry to advance the holistic development of Artificial Intelligence (AI) systems. Foundation models -- especially Large Language Models (LLMs) -- have been evaluated, trained, or instruction-tuned in several past works, to become better predictors or generators of emotion. Most of these studies, however, approach emotion-related tasks in a supervised manner, assessing or training the capabilities of LLMs using discrete emotion labels associated with stimuli (e.g., text, images, video, audio). Evaluation studies, in particular, have often been limited to standard and superficial emotion-related tasks, such as the recognition of evoked or expressed emotions. In this paper, we move beyond surface-level emotion tasks to investigate how LLMs reason about emotions through cognitive dimensions. Drawing from cognitive appraisal theory, we examine whether LLMs produce coherent and plausible cognitive reasoning when reasoning about emotionally charged stimuli. We introduce a large-scale benchmark on Cognitive Reasoning for Emotions - CoRE - to evaluate internal cognitive structures implicitly used by LLMs for emotional reasoning. Through a plethora of evaluation experiments and analysis, we seek to answer: (a) Are models more likely to implicitly rely on specific cognitive appraisal dimensions?, (b) What cognitive dimensions are important for characterizing specific emotions?, and, (c) Can the internal representations of different emotion categories in LLMs be interpreted through cognitive appraisal dimensions? Our results and analyses reveal diverse reasoning patterns across different LLMs. Our benchmark and code will be made publicly available.

【24】"Mirror" Language AI Models of Depression are Criterion-Contaminated
链接：https://arxiv.org/abs/2508.05830

作者：Rasiq Hussain, Mehak Gupta, Joshua R. Oltmanns
备注：39 pages, 9 figures
摘要：越来越多的研究表明，LLM基于语言的抑郁评估分数预测接近完美（R2高达0.70）。然而，许多人直接从语言反应到抑郁评估来开发这些模型。这些“镜像模型”遭受“标准污染”，当预测分数部分取决于预测因子本身时，就会出现这种情况。这会导致人为的效应量膨胀，从而降低模型的普适性。本研究比较了镜像模型和非镜像模型的表现，非镜像模型是从不反映它们所预测的评估的语言中发展出来的。N = 110研究参与者完成了两个不同的访谈：结构化诊断和生活史访谈。然后提示GPT-4、GPT-40和LLaMA 3 - 70 B分别从两个成绩单预测结构化诊断访谈抑郁评分。镜像模型（使用结构化诊断数据）显示出非常大的效应量（例如，R2 = .80）。正如预期的那样，非镜像模型（使用生活史数据）显示出较小的效应量，但相对较大（例如，R2 = 0.27）。当镜像和非镜像模型预测的结构化访谈抑郁评分与自我报告的抑郁症状相关时，镜像和非镜像的表现相同（例如，r = ~.54），表明镜像模型可能由于标准污染而包含偏差。主题建模确定了镜像和非镜像模型之间的聚类，以及真阳性和假阳性预测之间的聚类。在这项头对头的比较研究中，抑郁症的镜像语言AI模型显示出人为夸大的效应大小和较低的普遍性。随着抑郁症的语言人工智能模型不断发展，结合非镜像模型可以识别可解释和可概括的语义特征，这些特征在现实世界的心理评估中具有独特的实用性。
摘要：A growing number of studies show near-perfect LLM language-based prediction of depression assessment scores (up to R2 of .70). However, many develop these models directly from language responses to depression assessments. These "Mirror models" suffer from "criterion contamination", which arises when a predicted score depends in part on the predictors themselves. This causes artificial effect size inflation which reduces model generalizability. The present study compares the performance of Mirror models versus "Non-Mirror models", which are developed from language that does not mirror the assessment they are developed to predict. N = 110 research participants completed two different interviews: structured diagnostic and life history interviews. GPT-4, GPT-4o and LLaMA3-70B were then prompted to predict structured diagnostic interview depression scores from the two transcripts separately. Mirror models (using structured diagnostic data) showed very large effect sizes (e.g., R2 = .80). As expected, NonMirror models (using life history data) demonstrated smaller effect sizes, but were relatively large (e.g., R2 = .27). When Mirror and Non-Mirror model-predicted structured interview depression scores were correlated with self-reported depression symptoms, Mirror and NonMirror performed the same (e.g., r = ~.54), indicating that Mirror models contain bias perhaps due to criterion contamination. Topic modeling identified clusters across Mirror and Non-Mirror models, as well as between true-positive and false-positive predictions. In this head-to-head comparison study, Mirror language AI models of depression showed artificially inflated effect sizes and less generalizability. As language AI models for depression continue to evolve, incorporating Non-Mirror models may identify interpretable, and generalizable semantic features that have unique utility in real-world psychological assessment.

【25】Human-like fleeting memory improves language learning but impairs reading time prediction in transformer language models
标题：类人转瞬即逝的记忆可以改善语言学习，但会损害Transformer语言模型中的阅读时间预测
链接：https://arxiv.org/abs/2508.05803

作者：hamma, Micha Heilbron
摘要：人类的记忆是短暂的。当单词被处理时，组成新句子的确切单词形式很快就会丢失。认知科学家长期以来一直认为，这种记忆的局限性可能有助于学习语言，这一观点得到了经典联结主义建模工作的支持。Transformers的兴起似乎挑战了这一想法，因为这些模型可以有效地学习语言，尽管没有内存限制或其他架构近因偏见。在这里，我们调查的假设利益短暂的记忆语言学习在严格控制的实验Transformer语言模型。在发育现实的训练集上训练具有和不具有短暂记忆的Transformers，我们发现短暂记忆始终可以改善语言学习（通过整体语言建模性能和有针对性的句法评估来量化），但出乎意料的是，它会损害基于语料库的人类阅读时间预测。有趣的是，后续分析显示，这种差异-更好的语言建模，但更差的阅读时间预测-不能用先前的解释来解释为什么更好的语言模型有时更适合人类阅读时间。总之，这些结果支持了记忆限制对神经网络语言学习的好处-但不是预测行为。
摘要：Human memory is fleeting. As words are processed, the exact wordforms that make up incoming sentences are rapidly lost. Cognitive scientists have long believed that this limitation of memory may, paradoxically, help in learning language - an idea supported by classic connectionist modelling work. The rise of Transformers appears to challenge this idea, as these models can learn language effectively, despite lacking memory limitations or other architectural recency biases. Here, we investigate the hypothesized benefit of fleeting memory for language learning in tightly controlled experiments on transformer language models. Training transformers with and without fleeting memory on a developmentally realistic training set, we find that fleeting memory consistently improves language learning (as quantified by both overall language modelling performance and targeted syntactic evaluation) but, unexpectedly, impairs surprisal-based prediction of human reading times. Interestingly, follow up analyses revealed that this discrepancy - better language modeling, yet worse reading time prediction - could not be accounted for by prior explanations of why better language models sometimes fit human reading time worse. Together, these results support a benefit of memory limitations on neural network language learning - but not on predicting behavior.

【26】DMFI: Dual-Modality Fine-Tuning and Inference Framework for LLM-Based Insider Threat Detection
标题：DMFI：基于LLM的内部威胁检测的双模式微调和推理框架
链接：https://arxiv.org/abs/2508.05694

作者：Kong, Dongjie Liu, Xiaobo Jin, Guanggang Geng, Zhiying Li, Jian Weng
备注：Submitted to the 2025 IEEE International Conference on Data Mining (ICDM)
摘要：内部威胁检测（ITD）由于恶意内部行为的微妙、长期和依赖于上下文的性质，在网络安全中构成了持续和高影响力的挑战。传统的模型往往难以捕捉语义意图和复杂的行为动态，而现有的基于LLM的解决方案在快速适应性和模态覆盖方面面临限制。为了弥合这一差距，我们提出了DMFI，一个双模态框架，集成了语义推理与行为感知微调。DMFI将原始日志转换为两个结构化视图：（1）处理内容丰富的工件的语义视图（例如，电子邮件，https）使用警告格式的提示;以及（2）行为抽象，通过4 W引导的（When-Where-What-Which）转换构建，以编码上下文动作序列。两个LoRA增强型LLM独立进行微调，其输出通过基于轻量级MLP的决策模块进行融合。我们进一步引入了DMFI-B，这是一种区分正常和异常行为表示的自适应策略，可以提高严重类不平衡下的鲁棒性。在CERT r4.2和r5.2数据集上的实验表明，DMFI在检测精度方面优于现有方法。我们的方法将LLM的语义推理能力与结构化行为建模相结合，为现实世界的内部威胁检测提供了可扩展且有效的解决方案。我们的工作证明了LLM推理与结构化行为建模相结合的有效性，为现代内部威胁检测提供了可扩展和可部署的解决方案。
摘要：Insider threat detection (ITD) poses a persistent and high-impact challenge in cybersecurity due to the subtle, long-term, and context-dependent nature of malicious insider behaviors. Traditional models often struggle to capture semantic intent and complex behavior dynamics, while existing LLM-based solutions face limitations in prompt adaptability and modality coverage. To bridge this gap, we propose DMFI, a dual-modality framework that integrates semantic inference with behavior-aware fine-tuning. DMFI converts raw logs into two structured views: (1) a semantic view that processes content-rich artifacts (e.g., emails, https) using instruction-formatted prompts; and (2) a behavioral abstraction, constructed via a 4W-guided (When-Where-What-Which) transformation to encode contextual action sequences. Two LoRA-enhanced LLMs are fine-tuned independently, and their outputs are fused via a lightweight MLP-based decision module. We further introduce DMFI-B, a discriminative adaptation strategy that separates normal and abnormal behavior representations, improving robustness under severe class imbalance. Experiments on CERT r4.2 and r5.2 datasets demonstrate that DMFI outperforms state-of-the-art methods in detection accuracy. Our approach combines the semantic reasoning power of LLMs with structured behavior modeling, offering a scalable and effective solution for real-world insider threat detection. Our work demonstrates the effectiveness of combining LLM reasoning with structured behavioral modeling, offering a scalable and deployable solution for modern insider threat detection.

【27】Fine-Tuning Vision-Language Models for Markdown Conversion of Financial Tables in Malaysian Audited Financial Reports
标题：马来西亚审计财务报告中财务表降价转换的微调视觉语言模型
链接：https://arxiv.org/abs/2508.05669

作者：Tan (Faculty of Computer Science and Information Technology, Universiti Malaya), En Jun Choong, Ethan Jeremiah Chitty, Yan Pheng Choo, John Hsin Yang Wong, Chern Eu Cheah
备注：28 pages, 14 figures, 5 tables. Evaluation code (LLM-as-a-judge and Markdown TEDS) is available at this https URL. The development dataset and evaluation benchmark are available on Hugging Face at this https URL and this https URL respectively
摘要：从财务文档中准确提取和表示表格数据的结构仍然是文档理解的关键挑战，特别是对于监管和分析用例。这项研究解决了从马来西亚审计财务报告转换为Markdown格式的财务表的复杂性，旋转布局，多级标题和隐式结构线索复杂的任务。我们提出了一个微调的视觉语言模型（VLM），基于Qwen2.5-VL-7 B，针对从文档图像生成高保真Markdown进行了优化。我们的方法包括一个包含2，152个图像-文本对的精选数据集，以及使用LoRA的监督微调策略。为了评估性能，我们使用双重框架在100个样本表上评估了我们的模型：基于标准的LLM作为细粒度准确性的判断，以及我们新颖的基于Markdown树编辑距离的相似性（TEDS）度量，用于整体结构保真度。我们的模型在基于标准的评估中达到了92.20%的总体准确率和96.53%的Markdown TEDS得分。这一性能大大超过了Qwen2.5-VL-7 B基础模型、更大规模的VLM和支持推理的专用模型。与这些自托管的替代方案相比，它还显着减少了推理时间。此外，它的准确性超过了广泛使用的专有模型，如OpenAI的GPT-4 o和Gemini 2.5 Flash。这些结果表明，特定领域的微调提供了一种有效和高效的方法来弥合非结构化财务文档和下游自动化之间的差距，可以在没有计算开销的情况下与更大和更通用的模型相媲美。
摘要：Accurately extracting and representing the structure of tabular data from financial documents remains a critical challenge in document understanding, particularly for regulatory and analytical use cases. This study addresses the complexity of converting financial tables from Malaysian audited financial reports into Markdown format, a task complicated by rotated layouts, multi-level headers, and implicit structural cues. We propose a fine-tuned vision-language model (VLM), based on Qwen2.5-VL-7B, optimized for high-fidelity Markdown generation from document images. Our approach includes a curated dataset of 2,152 image-text pairs with augmentations and a supervised fine-tuning strategy using LoRA. To assess performance, we evaluated our model on 100 out-of-sample tables using a dual framework: a criteria-based LLM-as-a-judge for fine-grained accuracy and our novel Markdown Tree-Edit-Distance-based Similarity (TEDS) metric for holistic structural fidelity. Our model achieves a 92.20% overall accuracy on the criteria-based assessment and a 96.53% Markdown TEDS score. This performance significantly surpasses its Qwen2.5-VL-7B base model, larger-scale VLMs, and specialized reasoning-enabled models. Compared to these self-hosted alternatives, it also significantly reduces inference time. Furthermore, its accuracy exceeds that of widely used proprietary models such as OpenAI's GPT-4o and Gemini 2.5 Flash. These results demonstrate that domain-specific fine-tuning provides an effective and efficient method to bridge the gap between unstructured financial documents and downstream automation, rivalling much larger and more general models without their computational overhead.

【28】A Survey of LLM-based Deep Search Agents: Paradigm, Optimization, Evaluation, and Challenges
标题：基于LLM的深度搜索代理调查：范式，优化，评估和挑战
链接：https://arxiv.org/abs/2508.05668

作者：, Jianghao Lin, Yongzhao Xiao, Zheli Zhou, Rong Shan, Te Gao, Jiachen Zhu, Weiwen Liu, Yong Yu, Weinan Zhang
摘要：大型语言模型（LLM）的出现极大地改变了网络搜索。基于LLM的搜索代理的出现标志着向更深入，动态，自主信息搜索的关键转变。这些代理可以理解用户的意图和环境背景，并执行多轮检索与动态规划，扩展搜索功能远远超出了网络。像OpenAI的Deep Research这样的领先例子突出了它们在深度信息挖掘和现实世界应用中的潜力。这项调查提供了第一个系统的分析搜索代理。我们从架构，优化，应用和评估的角度全面分析和分类现有的工作，最终确定关键的开放挑战，并概述了在这个快速发展的领域有前途的未来研究方向。我们的存储库可以在https://github.com/YunjiaXi/Awesome-Search-Agent-Papers上找到。
摘要：The advent of Large Language Models (LLMs) has significantly revolutionized web search. The emergence of LLM-based Search Agents marks a pivotal shift towards deeper, dynamic, autonomous information seeking. These agents can comprehend user intentions and environmental context and execute multi-turn retrieval with dynamic planning, extending search capabilities far beyond the web. Leading examples like OpenAI's Deep Research highlight their potential for deep information mining and real-world applications. This survey provides the first systematic analysis of search agents. We comprehensively analyze and categorize existing works from the perspectives of architecture, optimization, application, and evaluation, ultimately identifying critical open challenges and outlining promising future research directions in this rapidly evolving field. Our repository is available on https://github.com/YunjiaXi/Awesome-Search-Agent-Papers.

【29】AttriLens-Mol: Attribute Guided Reinforcement Learning for Molecular Property Prediction with Large Language Models
标题：AttriLens-Mol：使用大型语言模型进行分子性质预测的属性引导强化学习
链接：https://arxiv.org/abs/2508.04748

作者： Long Chen, Yile Wang
备注：9 pages
摘要：大型语言模型（LLM）在协助分子性质预测任务方面表现出了希望，但通常依赖于人工制作的提示和思维链模板。虽然最近的高级大型推理模型，如DeepSeek-R1，采用强化学习来扩展“思考”过程，但它们的推理可能是冗长的，缺乏相关性。我们介绍了AttriLens-Mol，这是一个用于LLM分子性质预测的属性引导强化学习框架。AttriLens-Mol通过以下方式引导模型的推理：（1）鼓励基于属性的结构化输出的格式奖励，（2）避免枚举不相关属性的计数奖励，以及（3）使用高级LLM和RDKit验证生成属性的相关性的合理性奖励。这种方法在推理过程中隐含地引出模型对相关分子属性的固有知识，使得能够更有效地对分子属性进行预测。在分布内和分布外数据集上的实验表明，使用我们提出的AttriLens-Mol方法在4，000个样本上训练7 B大小的R1-Distilled-Qwen 2.5和R1-Distilled-LLaMA 3.1模型显着提高了性能，获得了与监督微调模型（Mol-Instructions，ChemDFM等）相当或更好的结果。和先进型号（GPT-3.5、GPT-4 o、DeepSeek-V3、DeepSeek-R1等）。此外，我们提取的目标属性，当用作可解释的决策树模型的特征时，与提示LLM生成的属性相比，会产生更好的性能。这表明AttriLens-Mol有效地消除了更多相关和预测性的分子属性，从而增强了属性预测的可解释性和性能。我们将在https://github.com/szu-tera/AttriLens-Mol中发布代码。
摘要：Large Language Models (LLMs) have shown promise in assisting molecular property prediction tasks but often rely on human-crafted prompts and chain-of-thought templates. While recent advanced large reasoning models like DeepSeek-R1 employ reinforcement learning for an extended ``thinking'' process, their reasoning can be verbose and lack relevance. We introduce AttriLens-Mol, an attribute-guided reinforcement learning framework for molecular property prediction with LLMs. AttriLens-Mol steers the model's reasoning by using: (1) a format reward encouraging attribute-based structured output, (2) a count reward to avoid enumerating irrelevant attributes, and (3) a rationality reward using advanced LLMs and RDKit to verify the relatedness of the generated attributes. This approach implicitly elicits the model's inherent knowledge of relevant molecular attributes during reasoning, enables making predictions for the molecular property more effectively. Experiments on both in-distribution and out-of-distribution datasets show that, training both 7B-size R1-Distilled-Qwen2.5 and R1-Distilled-LLaMA3.1 models on 4,000 samples with our proposed AttriLens-Mol method significantly boosts the performance, getting comparable or better results than supervised fine-tuning models (Mol-Instructions, ChemDFM, etc.) and advanced models (GPT-3.5, GPT-4o, DeepSeek-V3, DeepSeek-R1, etc.). Further, our extracted attributes for the target property, when used as features for an interpretable decision tree model, yield superior performance compared to attributes generated by prompting LLMs. This shows that AttriLens-Mol effectively elicits more relevant and predictive molecular attributes, leading to enhanced interpretability and performance for property prediction. We release the code in https://github.com/szu-tera/AttriLens-Mol.

【30】NanoCodec: Towards High-Quality Ultra Fast Speech LLM Inference
标题：NanoCodec：迈向高质量超快速语音LLM推理
链接：https://arxiv.org/abs/2508.05835

作者：Casanova, Paarth Neekhara, Ryan Langman, Shehzeen Hussain, Subhankar Ghosh, Xuesong Yang, Ante Jukić, Jason Li, Boris Ginsburg
备注：Accepted to Interspeech 2025
摘要：大型语言模型（LLM）通过利用音频编解码器将音频离散化为令牌，从而使语言建模技术能够应用于语音数据，从而大大提高了音频处理。然而，现有的音频编解码器通常以高帧速率操作，导致缓慢的训练和推断，特别是对于自回归模型。为了解决这个问题，人们对低帧率音频编解码器越来越感兴趣，这减少了生成一秒音频所需的自回归步骤的数量。在本文中，我们进行消融研究，以检查帧速率，比特率和因果关系对编解码器重建质量的影响。基于我们的发现，我们推出了NanoCodec，这是一种最先进的音频编解码器，可实现仅12.5帧每秒（FPS）的高质量压缩。NanoCodec在各种比特率范围内都优于相关工作，为低延迟和高效的语音LLM训练和推理建立了新的基准。
摘要：Large Language Models (LLMs) have significantly advanced audio processing by leveraging audio codecs to discretize audio into tokens, enabling the application of language modeling techniques to speech data. However, existing audio codecs often operate at high frame rates, leading to slow training and inference, particularly for autoregressive models. To address this, there is growing interest in low frame-rate audio codecs, which reduce the number of autoregressive steps required to generate one second of audio. In this paper, we conduct ablation studies to examine the impact of frame rate, bitrate, and causality on codec reconstruction quality. Based on our findings, we introduce NanoCodec, a state-of-the-art audio codec that achieves high-quality compression at just 12.5 frames per second (FPS). NanoCodec outperforms related works across various bitrate ranges, establishing a new benchmark for low-latency and efficient Speech LLM training and inference.

Transformer(1篇)

【1】Crisp Attention: Regularizing Transformers via Structured Sparsity
标题：清晰的注意力：通过结构稀疏性规范Transformer
链接：https://arxiv.org/abs/2508.06016

作者：dhi, Vishal Gandhi
摘要：自注意机制的二次计算成本是缩放Transformer模型的主要挑战。虽然注意力稀疏被广泛研究作为一种提高计算效率的技术，但几乎普遍认为它是以模型精度为代价的。在本文中，我们报告了一个令人惊讶的反例，以这种共同的智慧。通过在SST-2情感分析任务的微调过程中将结构化的事后稀疏性引入DistilBERT模型的注意力机制，我们发现模型的准确性显着提高。我们的模型具有80%的注意力稀疏度，达到了91.59%的验证准确率，比密集基线绝对提高了0.97%。我们假设这种现象是由于稀疏性作为一个强大的隐式正则化器，通过迫使模型使用更受约束和更鲁棒的特征集进行预测来防止模型过度拟合。我们的工作重铸注意稀疏不仅作为一种工具的计算效率，但作为一个潜在的方法，提高Transformer模型的推广和性能。
摘要：The quadratic computational cost of the self-attention mechanism is a primary challenge in scaling Transformer models. While attention sparsity is widely studied as a technique to improve computational efficiency, it is almost universally assumed to come at the cost of model accuracy. In this paper, we report a surprising counter-example to this common wisdom. By introducing structured, post-hoc sparsity to the attention mechanism of a DistilBERT model during fine-tuning on the SST-2 sentiment analysis task, we find that model accuracy improves significantly. Our model with 80\% attention sparsity achieves a validation accuracy of 91.59\%, a 0.97\% absolute improvement over the dense baseline. We hypothesize that this phenomenon is due to sparsity acting as a powerful implicit regularizer, preventing the model from overfitting by forcing it to make predictions with a more constrained and robust set of features. Our work recasts attention sparsity not just as a tool for computational efficiency, but as a potential method for improving the generalization and performance of Transformer models.

GAN|生成相关(9篇)

【1】A Systematic Literature Review of Retrieval-Augmented Generation: Techniques, Metrics, and Challenges
标题：检索增强一代的系统性文献评论：技术、挑战和挑战
链接：https://arxiv.org/abs/2508.06401

作者：own, Muhammad Roman, Barry Devereux
备注：58 pages
摘要：本文对检索增强生成（RAG）的研究文献进行了系统综述，重点分析了2020年至2025年5月期间发表的最高引用研究。共有128篇文章符合我们的纳入标准。这些记录来自ACM数字图书馆、IEEE Xplore、Scopus、ScienceDirect和数字书目和图书馆项目（DBLP）。RAG将神经检索器与生成语言模型相结合，在最新的非参数记忆中输出，同时保留存储在模型权重中的语义概括。在PRISMA 2020框架的指导下，我们（i）根据引用计数和研究问题指定明确的纳入和排除标准，（ii）目录数据集，架构和评估实践，以及（iii）综合关于RAG有效性和局限性的经验证据。为了减轻引用滞后偏差，我们对2025年发表的论文采用了较低的引用计数阈值，以便仍然捕获引用自然较少的新兴突破。这篇综述澄清了目前的研究现状，突出了方法上的差距，并为未来的研究绘制了优先方向。
摘要：This systematic review of the research literature on retrieval-augmented generation (RAG) provides a focused analysis of the most highly cited studies published between 2020 and May 2025. A total of 128 articles met our inclusion criteria. The records were retrieved from ACM Digital Library, IEEE Xplore, Scopus, ScienceDirect, and the Digital Bibliography and Library Project (DBLP). RAG couples a neural retriever with a generative language model, grounding output in up-to-date, non-parametric memory while retaining the semantic generalisation stored in model weights. Guided by the PRISMA 2020 framework, we (i) specify explicit inclusion and exclusion criteria based on citation count and research questions, (ii) catalogue datasets, architectures, and evaluation practices, and (iii) synthesise empirical evidence on the effectiveness and limitations of RAG. To mitigate citation-lag bias, we applied a lower citation-count threshold to papers published in 2025 so that emerging breakthroughs with naturally fewer citations were still captured. This review clarifies the current research landscape, highlights methodological gaps, and charts priority directions for future research.

【2】You Don't Need Pre-built Graphs for RAG: Retrieval Augmented Generation with Adaptive Reasoning Structures
标题：RAG不需要预构建图形：具有自适应推理结构的检索增强生成
链接：https://arxiv.org/abs/2508.06105

作者： Chen, Chuang Zhou, Zheng Yuan, Qinggang Zhang, Zeyang Cui, Hao Chen, Yilin Xiao, Jiannong Cao, Xiao Huang
摘要：大型语言模型（LLM）通常会产生幻觉，在处理超出其知识和感知的问题时会产生事实上不正确的陈述。检索增强生成（RAG）通过从知识库中检索查询相关上下文来支持LLM推理来解决这个问题。最近的进展利用预构造的图来捕获分布式文档之间的关系连接，在复杂的任务中表现出显着的性能。然而，现有的基于图的RAG（GraphRAG）方法依赖于一个昂贵的过程来将语料库转换为图，引入了压倒性的令牌成本和更新延迟。此外，现实世界的查询在类型和复杂性上各不相同，需要不同的逻辑结构来进行准确的推理。预先构建的图可能与这些所需的结构不一致，导致无效的知识检索。为此，我们提出了一个\textbf{\underline{Logic}}-aware \textbf{\underline{R}}etrieval-\textbf {\underline{A}}ugmented \textbf{\underline{G}}eneration framework（\textbf{LogicRAG}），该框架在推理时动态提取推理结构，以指导自适应检索，而无需任何预先构建的图。LogicRAG首先将输入查询分解为一组子问题，并构造有向无环图（DAG）来建模它们之间的逻辑依赖关系。为了支持连贯的多步推理，LogicRAG然后使用拓扑排序将图线性化，以便子问题可以以逻辑一致的顺序进行处理。此外，LogicRAG使用图修剪来减少冗余检索，并使用上下文修剪来过滤不相关的上下文，显着降低了整体令牌成本。大量的实验表明，LogicRAG实现了卓越的性能和效率相比，国家的最先进的基线。
摘要：Large language models (LLMs) often suffer from hallucination, generating factually incorrect statements when handling questions beyond their knowledge and perception. Retrieval-augmented generation (RAG) addresses this by retrieving query-relevant contexts from knowledge bases to support LLM reasoning. Recent advances leverage pre-constructed graphs to capture the relational connections among distributed documents, showing remarkable performance in complex tasks. However, existing Graph-based RAG (GraphRAG) methods rely on a costly process to transform the corpus into a graph, introducing overwhelming token cost and update latency. Moreover, real-world queries vary in type and complexity, requiring different logic structures for accurate reasoning. The pre-built graph may not align with these required structures, resulting in ineffective knowledge retrieval. To this end, we propose a \textbf{\underline{Logic}}-aware \textbf{\underline{R}}etrieval-\textbf{\underline{A}}ugmented \textbf{\underline{G}}eneration framework (\textbf{LogicRAG}) that dynamically extracts reasoning structures at inference time to guide adaptive retrieval without any pre-built graph. LogicRAG begins by decomposing the input query into a set of subproblems and constructing a directed acyclic graph (DAG) to model the logical dependencies among them. To support coherent multi-step reasoning, LogicRAG then linearizes the graph using topological sort, so that subproblems can be addressed in a logically consistent order. Besides, LogicRAG applies graph pruning to reduce redundant retrieval and uses context pruning to filter irrelevant context, significantly reducing the overall token cost. Extensive experiments demonstrate that LogicRAG achieves both superior performance and efficiency compared to state-of-the-art baselines.

【3】ThematicPlane: Bridging Tacit User Intent and Latent Spaces for Image Generation
标题：ThematicPlane：弥合隐性用户意图和图像生成的潜在空间
链接：https://arxiv.org/abs/2508.06065

作者：e, Nikhil Sharma, Donghoon Shin, DaEun Choi, Harsh Sharma, Jeonghwan Kim, Heng Ji
备注：None
摘要：生成式人工智能使图像创作变得更加容易，但将输出与微妙的创作意图相结合仍然具有挑战性，特别是对于非专家来说。现有的工具通常需要用户通过提示或引用将想法具体化，限制了流体探索。我们引入ThematicPlane，这是一个使用户能够导航和操作高级语义概念（例如，情绪、风格或叙事基调）。这个界面在隐性的创造意图和系统控制之间架起了一座桥梁。在我们的探索性研究中（N=6），参与者从事发散和收敛的创造性模式，经常拥抱意想不到的结果作为灵感或迭代线索。虽然他们的探索基于熟悉的主题，但对主题如何映射到输出的不同期望表明需要更易于解释的控件。总的来说，ThematicPlane促进了富有表现力的迭代工作流，并突出了生成式设计工具中直观、语义驱动的交互的新方向。
摘要：Generative AI has made image creation more accessible, yet aligning outputs with nuanced creative intent remains challenging, particularly for non-experts. Existing tools often require users to externalize ideas through prompts or references, limiting fluid exploration. We introduce ThematicPlane, a system that enables users to navigate and manipulate high-level semantic concepts (e.g., mood, style, or narrative tone) within an interactive thematic design plane. This interface bridges the gap between tacit creative intent and system control. In our exploratory study (N=6), participants engaged in divergent and convergent creative modes, often embracing unexpected results as inspiration or iteration cues. While they grounded their exploration in familiar themes, differing expectations of how themes mapped to outputs revealed a need for more explainable controls. Overall, ThematicPlane fosters expressive, iterative workflows and highlights new directions for intuitive, semantics-driven interaction in generative design tools.

【4】Fact2Fiction: Targeted Poisoning Attack to Agentic Fact-checking System
标题：事实2小说：对统计事实核查系统的有针对性的中毒攻击
链接：https://arxiv.org/abs/2508.06059

作者：, Yupeng Li, Bin Benjamin Zhu, Dacheng Wen, Reynold Cheng, Francis C. M. Lau
摘要：最先进的事实核查系统通过采用自主的基于LLM的代理将复杂的索赔分解为较小的子索赔，单独验证每个子索赔，并汇总部分结果以产生具有合理性的判决（判决的解释性理由），从而大规模地打击错误信息。这些系统的安全性至关重要，因为往往容易被忽视的事实核查人员可能会放大错误信息。这项工作介绍了Fact 2Fiction，第一个中毒攻击框架，针对这样的代理事实检查系统。Fact 2Fiction反映了分解策略，并利用系统生成的理由来制作量身定制的恶意证据，从而损害子声明验证。大量的实验表明，Fact 2Fiction在各种中毒预算下的攻击成功率比最先进的攻击高8.9%-21.2%。Fact 2Fiction暴露了当前事实核查系统的安全漏洞，并强调了防御对策的必要性。
摘要：State-of-the-art fact-checking systems combat misinformation at scale by employing autonomous LLM-based agents to decompose complex claims into smaller sub-claims, verify each sub-claim individually, and aggregate the partial results to produce verdicts with justifications (explanatory rationales for the verdicts). The security of these systems is crucial, as compromised fact-checkers, which tend to be easily underexplored, can amplify misinformation. This work introduces Fact2Fiction, the first poisoning attack framework targeting such agentic fact-checking systems. Fact2Fiction mirrors the decomposition strategy and exploits system-generated justifications to craft tailored malicious evidences that compromise sub-claim verification. Extensive experiments demonstrate that Fact2Fiction achieves 8.9\%--21.2\% higher attack success rates than state-of-the-art attacks across various poisoning budgets. Fact2Fiction exposes security weaknesses in current fact-checking systems and highlights the need for defensive countermeasures.

【5】EvolvR: Self-Evolving Pairwise Reasoning for Story Evaluation to Enhance Generation
标题：EvolvR：故事评估的自我进化成对推理以增强生成
链接：https://arxiv.org/abs/2508.06046

作者：g, Zhengxu Hou, Yangshijie Zhang, Bingren Yan, Zhibo Yang, Xingsheng Zhang, Luxi Xing, Qiang Zhou, Chen Zhang
摘要：虽然大型语言模型（LLM）作为法官（LLM作为法官）的有效性已经得到验证，但它们的表现仍然有限，在开放式任务中，特别是在故事评估中。准确的故事评价不仅对于辅助人类质量判断，而且对于提供指导故事生成的关键信号至关重要。然而，现有的方法面临着一个困境：闭源模型的快速工程适应性差，而开源模型的微调方法缺乏严格的推理能力，故事评估必不可少的。为了解决这个问题，我们提出了自进化成对推理（EvolvR）框架。基于成对比较，该框架首先通过多角色策略自我合成分数对齐的思想链（CoT）数据。为了确保数据质量，这些原始CoT经过自我过滤过程，利用多代理来保证其逻辑严谨性和鲁棒性。最后，在细化数据上训练的评估器被部署为奖励模型来指导故事生成任务。实验结果表明，我们的框架实现了国家的最先进的（SOTA）性能的三个评估基准，包括StoryER，HANNA和OpenMEVA。此外，当作为奖励模型时，它显着提高了生成故事的质量，从而充分验证了我们自我进化方法的优越性。
摘要：Although the effectiveness of Large Language Models (LLMs) as judges (LLM-as-a-judge) has been validated, their performance remains limited in open-ended tasks, particularly in story evaluation. Accurate story evaluation is crucial not only for assisting human quality judgment but also for providing key signals to guide story generation. However, existing methods face a dilemma: prompt engineering for closed-source models suffers from poor adaptability, while fine-tuning approaches for open-source models lack the rigorous reasoning capabilities essential for story evaluation. To address this, we propose the Self-Evolving Pairwise Reasoning (EvolvR) framework. Grounded in pairwise comparison, the framework first self-synthesizes score-aligned Chain-of-Thought (CoT) data via a multi-persona strategy. To ensure data quality, these raw CoTs undergo a self-filtering process, utilizing multi-agents to guarantee their logical rigor and robustness. Finally, the evaluator trained on the refined data is deployed as a reward model to guide the story generation task. Experimental results demonstrate that our framework achieves state-of-the-art (SOTA) performance on three evaluation benchmarks including StoryER, HANNA and OpenMEVA. Furthermore, when served as a reward model, it significantly enhances the quality of generated stories, thereby fully validating the superiority of our self-evolving approach.

【6】Adversarial Topic-aware Prompt-tuning for Cross-topic Automated Essay Scoring
标题：面向跨主题自动作文评分的对抗性主题感知策略调整
链接：https://arxiv.org/abs/2508.05987

作者：hang, Hongyan Zhao, Chaoran Cui, Qilong Song, Zhiqing Lu, Shuai Gong, Kailin Liu
摘要：跨主题自动作文评分（AES）旨在开发一种能够有效评估目标主题作文的可转移模型。这一领域的一个重大挑战来自各专题之间固有的差异。虽然现有的方法主要集中在通过源和目标主题的分布对齐来提取主题共享特征，但它们通常忽略主题特定的特征，从而限制了它们评估关键特征（诸如主题坚持性）的能力。为了解决这个限制，我们提出了一种对抗性主题感知提示调优（ATOP），这是一种联合学习主题共享和主题特定特征以改进跨主题AES的新方法。ATOP通过优化可学习的主题感知提示（包括共享和特定组件）来实现这一点，以从预先训练的语言模型（PLM）中获取相关知识。为了增强主题共享提示学习的鲁棒性并减轻主题对齐引入的特征尺度敏感性，我们将对抗训练纳入统一的回归和分类框架。此外，我们采用了一个基于邻居的分类器来模拟文章表征的局部结构，并为目标主题文章生成伪标签。然后，这些伪标签用于指导针对目标主题定制的主题特定提示的监督学习。在公开的ASAP++数据集上进行的大量实验表明，ATOP在整体和多特质作文评分方面都显着优于现有的最先进的方法。我们的方法的实现可在https://anonymous.4open.science/r/ATOP-A271公开获得。
摘要：Cross-topic automated essay scoring (AES) aims to develop a transferable model capable of effectively evaluating essays on a target topic. A significant challenge in this domain arises from the inherent discrepancies between topics. While existing methods predominantly focus on extracting topic-shared features through distribution alignment of source and target topics, they often neglect topic-specific features, limiting their ability to assess critical traits such as topic adherence. To address this limitation, we propose an Adversarial TOpic-aware Prompt-tuning (ATOP), a novel method that jointly learns topic-shared and topic-specific features to improve cross-topic AES. ATOP achieves this by optimizing a learnable topic-aware prompt--comprising both shared and specific components--to elicit relevant knowledge from pre-trained language models (PLMs). To enhance the robustness of topic-shared prompt learning and mitigate feature scale sensitivity introduced by topic alignment, we incorporate adversarial training within a unified regression and classification framework. In addition, we employ a neighbor-based classifier to model the local structure of essay representations and generate pseudo-labels for target-topic essays. These pseudo-labels are then used to guide the supervised learning of topic-specific prompts tailored to the target topic. Extensive experiments on the publicly available ASAP++ dataset demonstrate that ATOP significantly outperforms existing state-of-the-art methods in both holistic and multi-trait essay scoring. The implementation of our method is publicly available at: https://anonymous.4open.science/r/ATOP-A271.

【7】Spectrum Projection Score: Aligning Retrieved Summaries with Reader Models in Retrieval-Augmented Generation
标题：频谱投影评分：将检索摘要与检索增强一代中的读者模型对齐
链接：https://arxiv.org/abs/2508.05909

作者：Hu, Qinglin Zhu, Siya Qi, Yulan He, Hanqi Yan, Lin Gui
摘要：大型语言模型（LLM）已经显示出通过检索增强生成（RAG）的检索器-读者范式，补充模型输入与外部检索的知识，提高了生成性能。然而，以前的工作往往从整体上评估RAG，共同评估检索者和读者，使其难以隔离检索的真正贡献，特别是考虑到作为读者的LLM的即时敏感性。我们引入频谱投影分数（SPS），一个轻量级的，无监督的度量，允许读者通过比较从摘要生成的令牌形成的区域，并在读者的子空间的主要方向，以衡量检索到的摘要与其隐藏的表示的语义对齐，并测量相关性。基于SPS，我们提出了xCompress，一个推理时间控制器框架，动态采样，排名和压缩检索摘要候选人。五个QA基准测试与四个开源LLM的广泛实验表明，SPS不仅提高了一系列任务的性能，但也提供了一个原则性的角度检索和生成之间的相互作用。
摘要：Large Language Models (LLMs) have shown improved generation performance through retrieval-augmented generation (RAG) following the retriever-reader paradigm, which supplements model inputs with externally retrieved knowledge. However, prior work often evaluates RAG holistically, assessing the retriever and reader jointly, making it difficult to isolate the true contribution of retrieval, particularly given the prompt sensitivity of LLMs used as readers. We introduce Spectrum Projection Score (SPS), a lightweight, supervision-free metric that allows the reader to gauge the semantic alignment of a retrieved summary with its hidden representation by comparing the area formed by generated tokens from the summary, and the principal directions of subspace in the reader and to measure the relevance. Building on SPS we present xCompress, an inference time controller framework that dynamically samples, ranks, and compresses retrieval summary candidates. Extensive experiments on five QA benchmarks with four open source LLMs show that SPS not only enhances performance across a range of tasks but also provides a principled perspective on the interaction between retrieval and generation.

【8】Guardians and Offenders: A Survey on Harmful Content Generation and Safety Mitigation
标题：守护者和罪犯：关于有害内容生成和安全缓解的调查
链接：https://arxiv.org/abs/2508.05775

作者：, Changjia Zhu, Junjie Xiong, Xiaoran Xu, Lingyao Li, Yao Liu, Zhuo Lu
摘要：大型语言模型（LLM）彻底改变了数字平台上的内容创建，在自然语言生成和理解方面提供了前所未有的能力。这些模型支持有益的应用程序，如内容生成、问答（Q&A）、编程和代码推理。与此同时，它们也会因无意或故意产生有毒、攻击性或有偏见的内容而构成严重风险。LLM的这种双重角色，既作为解决现实世界问题的强大工具，又作为有害语言的潜在来源，提出了一个紧迫的社会技术挑战。在这次调查中，我们系统地回顾了最近的研究，包括无意的毒性，对抗性越狱攻击和内容审核技术。我们提出了一个统一的LLM相关危害和防御分类，分析新兴的多模式和LLM辅助越狱策略，并评估缓解措施，包括人工反馈强化学习（RLHF），即时工程和安全调整。我们的综合突出了LLM安全性的不断发展的景观，确定了当前评估方法的局限性，并概述了未来的研究方向，以指导强大和道德上一致的语言技术的发展。
摘要：Large Language Models (LLMs) have revolutionized content creation across digital platforms, offering unprecedented capabilities in natural language generation and understanding. These models enable beneficial applications such as content generation, question and answering (Q&A), programming, and code reasoning. Meanwhile, they also pose serious risks by inadvertently or intentionally producing toxic, offensive, or biased content. This dual role of LLMs, both as powerful tools for solving real-world problems and as potential sources of harmful language, presents a pressing sociotechnical challenge. In this survey, we systematically review recent studies spanning unintentional toxicity, adversarial jailbreaking attacks, and content moderation techniques. We propose a unified taxonomy of LLM-related harms and defenses, analyze emerging multimodal and LLM-assisted jailbreak strategies, and assess mitigation efforts, including reinforcement learning with human feedback (RLHF), prompt engineering, and safety alignment. Our synthesis highlights the evolving landscape of LLM safety, identifies limitations in current evaluation methodologies, and outlines future research directions to guide the development of robust and ethically aligned language technologies.

【9】Enhancing Retrieval-Augmented Generation for Electric Power Industry Customer Support
标题：加强电力行业客户支持的恢复增强发电
链接：https://arxiv.org/abs/2508.05664

作者：an, Kuok Tou Ho, Chenglong Ma, Yujing Si, Hok Lai Lin, Sa Lei Lam
备注：6 pages
摘要：许多人工智能客户服务系统使用标准的NLP管道或微调的语言模型，这些模型通常不适用于模糊的，多意图的或特定于细节的查询。本案例研究评估最近的技术：查询重写，RAG融合，关键字增强，意图识别，上下文重新排序，建立一个强大的客户支持系统在电力领域。我们比较了矢量存储和基于图的RAG框架，最终选择基于图的RAG，因为它在处理复杂查询方面具有卓越的性能。我们发现，查询重写提高检索查询使用非标准术语或需要精确的细节。RAG Fusion通过合并多个检索来提高模糊或多方面查询的性能。重新排序通过过滤不相关的上下文来减少幻觉。意图识别支持将复杂问题分解为更有针对性的子查询，从而提高相关性和效率。相比之下，关键字增强由于有偏见的关键字选择而对结果产生负面影响。我们的最终系统结合了意图识别，RAG融合和重新排序来处理消歧和多源查询。在GPT-4生成的数据集和真实世界的电力供应商FAQ数据集上进行评估，它分别达到了97.9%和89.6%的准确率，大大优于基线RAG模型。
摘要：Many AI customer service systems use standard NLP pipelines or finetuned language models, which often fall short on ambiguous, multi-intent, or detail-specific queries. This case study evaluates recent techniques: query rewriting, RAG Fusion, keyword augmentation, intent recognition, and context reranking, for building a robust customer support system in the electric power domain. We compare vector-store and graph-based RAG frameworks, ultimately selecting the graph-based RAG for its superior performance in handling complex queries. We find that query rewriting improves retrieval for queries using non-standard terminology or requiring precise detail. RAG Fusion boosts performance on vague or multifaceted queries by merging multiple retrievals. Reranking reduces hallucinations by filtering irrelevant contexts. Intent recognition supports the decomposition of complex questions into more targeted sub-queries, increasing both relevance and efficiency. In contrast, keyword augmentation negatively impacts results due to biased keyword selection. Our final system combines intent recognition, RAG Fusion, and reranking to handle disambiguation and multi-source queries. Evaluated on both a GPT-4-generated dataset and a real-world electricity provider FAQ dataset, it achieves 97.9% and 89.6% accuracy respectively, substantially outperforming baseline RAG models.

QA|VQA|问答|对话(1篇)

【1】Harnessing Adaptive Topology Representations for Zero-Shot Graph Question Answering
标题：利用自适应的拓扑表示进行Zero-Shot图问题解答
链接：https://arxiv.org/abs/2508.06345

作者：i, Jiangyue Yan, Chun Kang, Yang Chen, Hua Liu, James T. Kwok, Yu Zhang
摘要：大型多模态模型（LLM）在不同领域的问答（QA）任务中表现出广义的zero-shot能力，包括涉及复杂图拓扑的图QA。然而，大多数当前的方法只使用单一类型的图表示，即拓扑表示形式（TRF），如非统一的文本描述或样式固定的视觉样式。这些“一刀切”的方法没有考虑到不同模型或任务的具体偏好，往往导致不正确或过长的答复。为了解决这个问题，我们首先分析了现有的TRFs的特点和弱点，然后设计了一组TRFs，表示为$F_{ZS}$，适合于zero-shot图QA。然后，我们引入了一个新的度量，图响应效率（GRE），它衡量的性能和简洁性之间的平衡图QA。在此基础上，我们开发了DynamicTRF框架，旨在提高图QA的准确性和简洁性。具体来说，DynamicTRF首先创建一个TRF偏好（TRFP）数据集，该数据集根据GRE分数对TRF进行排名，以探测特定于问题的TRF偏好。然后在TRFP数据集上训练TRF路由器，在推理过程中为每个问题自适应地分配来自$F_{ZS}$的最佳TRF。在7个域内算法图QA任务和2个域外下游任务上的大量实验表明，DynamicTRF在准确性方面显著增强了LSTO的zero-shot图QA
摘要：Large Multimodal Models (LMMs) have shown generalized zero-shot capabilities in diverse domain question-answering (QA) tasks, including graph QA that involves complex graph topologies. However, most current approaches use only a single type of graph representation, namely Topology Representation Form (TRF), such as prompt-unified text descriptions or style-fixed visual styles. Those "one-size-fits-all" approaches fail to consider the specific preferences of different models or tasks, often leading to incorrect or overly long responses. To address this, we first analyze the characteristics and weaknesses of existing TRFs, and then design a set of TRFs, denoted by $F_{ZS}$, tailored to zero-shot graph QA. We then introduce a new metric, Graph Response Efficiency (GRE), which measures the balance between the performance and the brevity in graph QA. Built on these, we develop the DynamicTRF framework, which aims to improve both the accuracy and conciseness of graph QA. To be specific, DynamicTRF first creates a TRF Preference (TRFP) dataset that ranks TRFs based on their GRE scores, to probe the question-specific TRF preferences. Then it trains a TRF router on the TRFP dataset, to adaptively assign the best TRF from $F_{ZS}$ for each question during the inference. Extensive experiments across 7 in-domain algorithmic graph QA tasks and 2 out-of-domain downstream tasks show that DynamicTRF significantly enhances the zero-shot graph QA of LMMs in terms of accuracy

推理|分析|理解|解释(5篇)

【1】Effective Training Data Synthesis for Improving MLLM Chart Understanding
标题：有效的训练数据合成以提高MLLM图表理解
链接：https://arxiv.org/abs/2508.06492

作者：g, Zeyu Zhang, Yunzhong Hou, Zhuowan Li, Gaowen Liu, Ali Payani, Yuan-Sen Ting, Liang Zheng
备注：Accepted by ICCV 2025 (poster). 26 pages, 17 figures
摘要：能够有效地阅读科学情节，或图表理解，是建立有效的科学代理的核心部分。然而，现有的多模态大型语言模型（MLLM），特别是开源模型，在具有挑战性的基准测试中，通常成功率为30%-50%。以前的研究微调MLLM与合成图表往往受到限制，其相似性不足的真实图表，这可能会损害模型的训练和复杂的真实世界的图表的性能。在这项研究中，我们表明，模块化的图表生成和多样化的视觉细节，提高图表的理解能力。特别是，我们设计了一个五步数据合成管道，其中我们将数据和函数创建分离为单个图生成，将稍后子图的生成条件设置为多个子图的早期子图，在视觉上使生成的图形多样化，过滤掉低质量数据，最后使用GPT-4 o生成问答（QA）对。这种方法使我们能够简化微调数据集的生成，并引入有效图表数据集（ECD），它包含10 k+图表图像和300 k + QA对，涵盖25个主题，具有250+图表类型组合，具有高视觉复杂性。我们表明，ECD一致地提高了各种MLLM在一系列真实世界和合成测试集上的性能。代码、数据和模型可在https://github.com/yuweiyang-anu/ECD上获得。
摘要：Being able to effectively read scientific plots, or chart understanding, is a central part toward building effective agents for science. However, existing multimodal large language models (MLLMs), especially open-source ones, are still falling behind with a typical success rate of 30%-50% on challenging benchmarks. Previous studies on fine-tuning MLLMs with synthetic charts are often restricted by their inadequate similarity to the real charts, which could compromise model training and performance on complex real-world charts. In this study, we show that modularizing chart generation and diversifying visual details improves chart understanding capabilities. In particular, we design a five-step data synthesis pipeline, where we separate data and function creation for single plot generation, condition the generation of later subplots on earlier ones for multi-subplot figures, visually diversify the generated figures, filter out low quality data, and finally generate the question-answer (QA) pairs with GPT-4o. This approach allows us to streamline the generation of fine-tuning datasets and introduce the effective chart dataset (ECD), which contains 10k+ chart images and 300k+ QA pairs, covering 25 topics and featuring 250+ chart type combinations with high visual complexity. We show that ECD consistently improves the performance of various MLLMs on a range of real-world and synthetic test sets. Code, data and models are available at: https://github.com/yuweiyang-anu/ECD.

【2】GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models
标题：GLM-4.5：统计、推理和编码（ARC）基础模型
链接：https://arxiv.org/abs/2508.06471

作者：eam: Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, Kedong Wang, Lucen Zhong, Mingdao Liu, Rui Lu, Shulin Cao, Xiaohan Zhang, Xuancheng Huang, Yao Wei, Yean Cheng, Yifan An, Yilin Niu, Yuanhao Wen, Yushi Bai, Zhengxiao Du, Zihan Wang, Zilin Zhu, Bohan Zhang, Bosi Wen, Bowen Wu, Bowen Xu, Can Huang, Casey Zhao, Changpeng Cai, Chao Yu, Chen Li, Chendi Ge, Chenghua Huang, Chenhui Zhang, Chenxi Xu, Chenzheng Zhu, Chuang Li, Congfeng Yin, Daoyan Lin, Dayong Yang, Dazhi Jiang, Ding Ai, Erle Zhu, Fei Wang, Gengzheng Pan, Guo Wang, Hailong Sun, Haitao Li, Haiyang Li, Haiyi Hu, Hanyu Zhang, Hao Peng, Hao Tai, Haoke Zhang, Haoran Wang, Haoyu Yang, He Liu, He Zhao, Hongwei Liu, Hongxi Yan, Huan Liu, Huilong Chen, Ji Li, Jiajing Zhao, Jiamin Ren, Jian Jiao, Jiani Zhao, Jianyang Yan, Jiaqi Wang, Jiayi Gui, Jiayue Zhao, Jie Liu, Jijie Li, Jing Li, Jing Lu, Jingsen Wang, Jingwei Yuan, Jingxuan Li, Jingzhao Du, Jinhua Du, Jinxin Liu, Junkai Zhi, Junli Gao, Ke Wang, Lekang Yang, Liang Xu, Lin Fan, Lindong Wu, Lintao Ding, Lu Wang, Man Zhang, Minghao Li, Minghuan Xu, Mingming Zhao, Mingshu Zhai
摘要：GLM-4.5是一个开源的混合专家（Mixture-of-Experts，MoE）大型语言模型，具有355 B总参数和32 B激活参数，其特点是混合推理方法支持思维和直接响应模式。通过对23 T令牌进行多阶段训练，以及使用专家模型迭代和强化学习进行全面的后期训练，GLM-4.5在代理、推理和编码（ARC）任务中实现了强大的性能，在TAU-Bench上得分为70.1%，在AIME 24上得分为91.0%，在SWE-Bench Verified上得分为64.2%。GLM-4.5的参数比几个竞争对手少得多，在所有评估模型中排名第三，在代理基准测试中排名第二。我们发布了GLM-4.5（355 B参数）和紧凑版GLM-4.5-Air（106 B参数），以推进推理和代理AI系统的研究。代码、模型和更多信息可在https://github.com/zai-org/GLM-4.5上获得。
摘要：We present GLM-4.5, an open-source Mixture-of-Experts (MoE) large language model with 355B total parameters and 32B activated parameters, featuring a hybrid reasoning method that supports both thinking and direct response modes. Through multi-stage training on 23T tokens and comprehensive post-training with expert model iteration and reinforcement learning, GLM-4.5 achieves strong performance across agentic, reasoning, and coding (ARC) tasks, scoring 70.1% on TAU-Bench, 91.0% on AIME 24, and 64.2% on SWE-bench Verified. With much fewer parameters than several competitors, GLM-4.5 ranks 3rd overall among all evaluated models and 2nd on agentic benchmarks. We release both GLM-4.5 (355B parameters) and a compact version, GLM-4.5-Air (106B parameters), to advance research in reasoning and agentic AI systems. Code, models, and more information are available at https://github.com/zai-org/GLM-4.5.

【3】InfoCausalQA:Can Models Perform Non-explicit Causal Reasoning Based on Infographic?
标题：InfoCausal问答：模型可以基于信息图进行非显式因果推理吗？
链接：https://arxiv.org/abs/2508.06220

作者：a, Junhyeong Park, Jahyun Jeon, Youngjae Yu
备注：14 pages, 9 figures
摘要：视觉语言模型（VLM）的最新进展已经证明了令人印象深刻的感知和推理能力。然而，执行因果推理的能力-人类认知的核心方面-仍然没有得到充分的探索，特别是在多模态环境中。在这项研究中，我们介绍了InfoCauseQA，一种新的基准设计，以评估因果推理的基础上，结合结构化的视觉数据与文本的背景信息图。该基准包括两个任务：任务1侧重于基于推断的数字趋势的定量因果推理，而任务2针对涉及五种类型的因果关系的语义因果推理：原因，效果，干预，反事实和时间。我们从四个公共来源手动收集了494个信息图-文本对，并使用GPT-4 o生成了1，482个高质量的多项选择QA对。然后，这些问题被人类仔细修改，以确保它们不能仅仅基于表面线索来回答，而是需要真正的视觉基础。我们的实验结果表明，目前的VLMs表现出有限的能力，在计算推理和语义因果推理更明显的局限性。与人类相比，他们的表现明显较低，这表明在利用基于信息图的信息进行因果推理方面存在巨大差距。通过InfoCausal QA，我们强调需要提高多模态AI系统的因果推理能力。
摘要：Recent advances in Vision-Language Models (VLMs) have demonstrated impressive capabilities in perception and reasoning. However, the ability to perform causal inference -- a core aspect of human cognition -- remains underexplored, particularly in multimodal settings. In this study, we introduce InfoCausalQA, a novel benchmark designed to evaluate causal reasoning grounded in infographics that combine structured visual data with textual context. The benchmark comprises two tasks: Task 1 focuses on quantitative causal reasoning based on inferred numerical trends, while Task 2 targets semantic causal reasoning involving five types of causal relations: cause, effect, intervention, counterfactual, and temporal. We manually collected 494 infographic-text pairs from four public sources and used GPT-4o to generate 1,482 high-quality multiple-choice QA pairs. These questions were then carefully revised by humans to ensure they cannot be answered based on surface-level cues alone but instead require genuine visual grounding. Our experimental results reveal that current VLMs exhibit limited capability in computational reasoning and even more pronounced limitations in semantic causal reasoning. Their significantly lower performance compared to humans indicates a substantial gap in leveraging infographic-based information for causal inference. Through InfoCausalQA, we highlight the need for advancing the causal reasoning abilities of multimodal AI systems.

【4】UR$^2$: Unify RAG and Reasoning through Reinforcement Learning
标题：UR $' 2 $：通过强化学习统一RAG和推理
链接：https://arxiv.org/abs/2508.06165

作者：, Boran Xiang, Xiaolong Wang, Zhinan Gou, Weizhi Ma, Yang Liu
摘要：大型语言模型（LLM）通过两种互补的范式表现出了卓越的能力：检索增强生成（RAG），它增强了知识基础，以及来自可验证奖励的强化学习（RLVR），它优化了复杂的推理能力。然而，这两个功能往往是孤立发展，现有的努力，以统一它们仍然狭窄的范围通常限于开放域QA与固定的检索设置和特定任务的假设。这种集成的缺乏限制了推广，并限制了RAG-RL方法在更广泛领域的适用性。为了弥合这一差距，我们提出了UR 2（统一RAG和推理），这是一个通过强化学习统一检索和推理的通用框架。UR 2介绍了两个关键的贡献：一个困难意识的课程培训，有选择地调用检索只有挑战性的问题，和一个混合知识访问策略结合特定领域的离线语料库与LLM生成的摘要。这些组件旨在实现检索和推理之间的动态协调，提高跨各种任务的适应性。跨开放域QA、MMLU-Pro、医学和数学推理任务的实验表明，UR 2（构建在Qwen2.5-3/7 B和LLaMA-3.1-8B上）显著优于现有的RAG和RL方法，在多个基准测试中实现了与GPT-4 o-mini和GPT-4.1-mini相当的性能。我们已在https://github.com/Tsinghua-dhy/UR2上发布了所有代码、模型和数据。
摘要：Large Language Models (LLMs) have shown remarkable capabilities through two complementary paradigms: Retrieval-Augmented Generation (RAG), which enhances knowledge grounding, and Reinforcement Learning from Verifiable Rewards (RLVR), which optimizes complex reasoning abilities. However, these two capabilities are often developed in isolation, and existing efforts to unify them remain narrow in scope-typically limited to open-domain QA with fixed retrieval settings and task-specific assumptions. This lack of integration constrains generalization and limits the applicability of RAG-RL methods to broader domains. To bridge this gap, we propose UR2 (Unified RAG and Reasoning), a general framework that unifies retrieval and reasoning through reinforcement learning. UR2 introduces two key contributions: a difficulty-aware curriculum training that selectively invokes retrieval only for challenging problems, and a hybrid knowledge access strategy combining domain-specific offline corpora with LLM-generated summaries. These components are designed to enable dynamic coordination between retrieval and reasoning, improving adaptability across a diverse range of tasks. Experiments across open-domain QA, MMLU-Pro, medical, and mathematical reasoning tasks demonstrate that UR2 (built on Qwen2.5-3/7B and LLaMA-3.1-8B) significantly outperforms existing RAG and RL methods, achieving comparable performance to GPT-4o-mini and GPT-4.1-mini on several benchmarks. We have released all code, models, and data at https://github.com/Tsinghua-dhy/UR2.

【5】Do Ethical AI Principles Matter to Users? A Large-Scale Analysis of User Sentiment and Satisfaction
标题：人工智能道德原则对用户重要吗？用户情绪和满意度的大规模分析
链接：https://arxiv.org/abs/2508.05913

作者：sch, Min Chul Cha
摘要：随着人工智能系统越来越多地嵌入到组织工作流程和消费者应用程序中，公平、透明和稳健性等道德原则已在政策和行业指南中得到广泛认可。然而，从用户的角度来看，这些原则是否得到认可，价值或影响力仍然缺乏经验证据。这项研究通过分析G2的10万多条用户对AI产品的评论，调查了道德AI与用户满意度之间的联系。使用基于transformer的语言模型，我们在欧盟值得信赖的人工智能道德准则定义的七个道德维度上衡量情绪。我们的研究结果表明，所有七个维度都与用户满意度呈正相关。然而，这种关系在用户和产品类型之间存在系统性差异。人工智能开发平台的技术用户和评审人员更频繁地讨论系统级问题（例如，透明度、数据治理），而终端用户应用程序的非技术用户和审查者强调以人为中心的维度（例如，人类机构、社会福利）。此外，道德AI与用户满意度之间的关联对于非技术用户和所有维度的最终用户应用程序来说都要强得多。我们的研究结果从用户的角度强调了道德AI设计的重要性，并强调了考虑用户角色和产品类型之间的上下文差异的必要性。
摘要：As AI systems become increasingly embedded in organizational workflows and consumer applications, ethical principles such as fairness, transparency, and robustness have been widely endorsed in policy and industry guidelines. However, there is still scarce empirical evidence on whether these principles are recognized, valued, or impactful from the perspective of users. This study investigates the link between ethical AI and user satisfaction by analyzing over 100,000 user reviews of AI products from G2. Using transformer-based language models, we measure sentiment across seven ethical dimensions defined by the EU Ethics Guidelines for Trustworthy AI. Our findings show that all seven dimensions are positively associated with user satisfaction. Yet, this relationship varies systematically across user and product types. Technical users and reviewers of AI development platforms more frequently discuss system-level concerns (e.g., transparency, data governance), while non-technical users and reviewers of end-user applications emphasize human-centric dimensions (e.g., human agency, societal well-being). Moreover, the association between ethical AI and user satisfaction is significantly stronger for non-technical users and end-user applications across all dimensions. Our results highlight the importance of ethical AI design from users' perspectives and underscore the need to account for contextual differences across user roles and product types.

检测相关(3篇)

【1】Cyberbullying Detection via Aggression-Enhanced Prompting
标题：通过攻击增强型广告进行网络欺凌检测
链接：https://arxiv.org/abs/2508.06360

作者：id, Anu Sabu, Girish A. Koushik, Ferrante Neri, Diptesh Kanojia
备注：Accepted to RANLP 2025
摘要：检测社交媒体上的网络欺凌仍然是一个关键的挑战，因为它的微妙和不同的表现形式。本研究调查了在统一的训练框架内将攻击检测作为辅助任务是否可以增强大型语言模型（LLM）在网络欺凌检测中的泛化和性能。实验进行了五个侵略数据集和一个网络欺凌数据集使用防御调整LLM。我们评估了多种策略：zero-shot、Few-Shot、独立LoRA微调和多任务学习（MTL）。鉴于MTL的结果不一致，我们提出了一种丰富的提示管道方法，其中侵略预测嵌入到网络欺凌检测提示中，以提供上下文增强。初步结果表明，丰富的提示管道始终优于标准LoRA微调，这表明攻击信息上下文显着提高了网络欺凌检测。这项研究强调了潜在的辅助任务，如侵略检测，以提高社交网络上的安全关键型应用程序的LLM的推广。
摘要：Detecting cyberbullying on social media remains a critical challenge due to its subtle and varied expressions. This study investigates whether integrating aggression detection as an auxiliary task within a unified training framework can enhance the generalisation and performance of large language models (LLMs) in cyberbullying detection. Experiments are conducted on five aggression datasets and one cyberbullying dataset using instruction-tuned LLMs. We evaluated multiple strategies: zero-shot, few-shot, independent LoRA fine-tuning, and multi-task learning (MTL). Given the inconsistent results of MTL, we propose an enriched prompt pipeline approach in which aggression predictions are embedded into cyberbullying detection prompts to provide contextual augmentation. Preliminary results show that the enriched prompt pipeline consistently outperforms standard LoRA fine-tuning, indicating that aggression-informed context significantly boosts cyberbullying detection. This study highlights the potential of auxiliary tasks, such as aggression detection, to improve the generalisation of LLMs for safety-critical applications on social networks.

【2】Classification is a RAG problem: A case study on hate speech detection
标题：分类是一个RAG问题：仇恨言论检测的案例研究
链接：https://arxiv.org/abs/2508.06204

作者：illats, Josh Pennington, Aravind Mohan, Bertie Vidgen
摘要：强大的内容审核要求分类系统能够快速适应不断变化的政策，而无需昂贵的重新培训。我们使用检索增强生成（RAG）进行分类，该方法将传统的分类任务从根据预先训练的参数确定正确的类别转变为评估与推理时检索到的上下文知识相关的内容。在仇恨言论检测中，这将任务从“这是仇恨言论吗？“到“这是否违反了仇恨言论政策？" 我们的上下文策略引擎（CPE）-一个代理RAG系统-演示了这种方法，并提供了三个关键优势：（1）与领先的商业系统相当的强大分类准确性，（2）通过检索策略段的内在可解释性，以及（3）无需模型再训练的动态策略更新。通过三个实验，我们展示了强大的基线性能，并表明该系统可以通过正确调整对特定身份组的保护来应用细粒度的策略控制，而无需重新训练或牺牲整体性能。这些发现表明，RAG可以将分类转变为一个更灵活，透明和适应性更强的过程，以解决内容审核和更广泛的分类问题。
摘要：Robust content moderation requires classification systems that can quickly adapt to evolving policies without costly retraining. We present classification using Retrieval-Augmented Generation (RAG), which shifts traditional classification tasks from determining the correct category in accordance with pre-trained parameters to evaluating content in relation to contextual knowledge retrieved at inference. In hate speech detection, this transforms the task from "is this hate speech?" to "does this violate the hate speech policy?" Our Contextual Policy Engine (CPE) - an agentic RAG system - demonstrates this approach and offers three key advantages: (1) robust classification accuracy comparable to leading commercial systems, (2) inherent explainability via retrieved policy segments, and (3) dynamic policy updates without model retraining. Through three experiments, we demonstrate strong baseline performance and show that the system can apply fine-grained policy control by correctly adjusting protection for specific identity groups without requiring retraining or compromising overall performance. These findings establish that RAG can transform classification into a more flexible, transparent, and adaptable process for content moderation and wider classification problems.

【3】Prosocial Behavior Detection in Player Game Chat: From Aligning Human-AI Definitions to Efficient Annotation at Scale
标题：玩家游戏聊天中的亲社会行为检测：从协调人类-人工智能定义到大规模有效注释
链接：https://arxiv.org/abs/2508.05938

作者：ielnik, Min Kim, Penphob (Andrea)Boonyarungsrit, Fereshteh Soltani, Deshawn Sambrano, Animashree Anandkumar, R. Michael Alvarez
备注：9 pages, 4 figures, 4 tables
摘要：检测文本中的亲社会性-旨在肯定，支持或改善他人行为的通信-对信任和安全系统来说是一个新颖且日益重要的挑战。与有毒内容检测不同，亲社会性缺乏完善的定义和标记数据，需要新的注释和部署方法。我们提出了一个实用的三阶段管道，可以实现可扩展的，高精度的亲社会内容分类，同时最大限度地减少人工标记工作和推理成本。首先，我们使用人类标记示例的小种子集来确定基于LLM的最佳标记策略。然后，我们引入了一个人类-AI细化循环，注释者在其中审查GPT-4和人类之间的高度分歧案例，以迭代地澄清和扩展任务定义-这是新兴注释任务（如亲社会性）的关键一步。此过程可提高标签质量和清晰度对齐。最后，我们使用GPT-4合成了10 k个高质量标签，并训练了一个两阶段的推理系统：一个轻量级的分类器处理高置信度的预测，而只有$\sim$35\%的模糊实例升级到GPT-4 o。此架构将推理成本降低了70%，同时实现了高精度（0.90美元）。我们的管道展示了有针对性的人机交互、仔细的任务制定和部署感知架构设计如何为新型负责任的人工智能任务解锁可扩展的解决方案。
摘要：Detecting prosociality in text--communication intended to affirm, support, or improve others' behavior--is a novel and increasingly important challenge for trust and safety systems. Unlike toxic content detection, prosociality lacks well-established definitions and labeled data, requiring new approaches to both annotation and deployment. We present a practical, three-stage pipeline that enables scalable, high-precision prosocial content classification while minimizing human labeling effort and inference costs. First, we identify the best LLM-based labeling strategy using a small seed set of human-labeled examples. We then introduce a human-AI refinement loop, where annotators review high-disagreement cases between GPT-4 and humans to iteratively clarify and expand the task definition-a critical step for emerging annotation tasks like prosociality. This process results in improved label quality and definition alignment. Finally, we synthesize 10k high-quality labels using GPT-4 and train a two-stage inference system: a lightweight classifier handles high-confidence predictions, while only $\sim$35\% of ambiguous instances are escalated to GPT-4o. This architecture reduces inference costs by $\sim$70% while achieving high precision ($\sim$0.90). Our pipeline demonstrates how targeted human-AI interaction, careful task formulation, and deployment-aware architecture design can unlock scalable solutions for novel responsible AI tasks.

Zero/Few/One-Shot|迁移|自适应(2篇)

【1】Beyond Uniform Criteria: Scenario-Adaptive Multi-Dimensional Jailbreak Evaluation
标题：超越统一标准：场景适应性多维越狱评估
链接：https://arxiv.org/abs/2508.06194

作者：, Yuekang Li, Xiaohan Zhang, Youtao Ding, Li Pan
摘要：精确的越狱评估对于LLM红色团队和越狱研究至关重要。当前的方法采用二元分类（例如，字符串匹配，有毒文本分类器，LLM驱动的方法），只产生“是/否”标签，而不量化伤害强度。现有的多维框架（例如，安全违规、相对真实性、信息性）在各种情况下应用统一的评估标准，导致特定于网络的不匹配-例如，“相对真实性”与“仇恨言论”无关-这会影响评估精度。为了解决这些限制，我们引入了SceneJailEval，其主要贡献是：（1）一个开创性的自适应越狱评估多维框架，克服了现有多维方法的关键“一刀切”约束，并具有强大的可扩展性，以灵活适应定制或新兴场景。(2)一个全面的14种场景数据集，包含不同的越狱变体和区域案例，填补了长期以来在高质量，整体基准方面的空白，用于自适应评估。(3)SceneJailEval实现了最先进的结果，在我们的全场景数据集上的F1得分为0.917（比之前的SOTA高出6%），在JBB上的F1得分为0.995（比之前的SOTA高出3%），超过了现有评估方法在异构场景中的准确性限制，并证实了其优势。
摘要：Precise jailbreak evaluation is vital for LLM red teaming and jailbreak research. Current approaches employ binary classification ( e.g., string matching, toxic text classifiers, LLM-driven methods), yielding only "yes/no" labels without quantifying harm intensity. Existing multi-dimensional frameworks ( e.g., Security Violation, Relative Truthfulness, Informativeness) apply uniform evaluation criteria across scenarios, resulting in scenario-specific mismatches--for instance, "Relative Truthfulness" is irrelevant to "hate speech"--which compromise evaluation precision. To tackle these limitations, we introduce SceneJailEval, with key contributions: (1) A groundbreaking scenario-adaptive multi-dimensional framework for jailbreak evaluation, overcoming the critical "one-size-fits-all" constraint of existing multi-dimensional methods, and featuring strong extensibility to flexibly adapt to customized or emerging scenarios. (2) A comprehensive 14-scenario dataset with diverse jailbreak variants and regional cases, filling the long-standing gap in high-quality, holistic benchmarks for scenario-adaptive evaluation. (3) SceneJailEval achieves state-of-the-art results, with an F1 score of 0.917 on our full-scenario dataset (+6% over prior SOTA) and 0.995 on JBB (+3% over prior SOTA), surpassing accuracy limits of existing evaluation methods in heterogeneous scenarios and confirming its advantage.

【2】InfiGUI-G1: Advancing GUI Grounding with Adaptive Exploration Policy Optimization
标题：InfiGUI-G1：通过自适应探索策略优化推进GUI基础
链接：https://arxiv.org/abs/2508.05731

作者：u, Zeyu Liu, Shuanghe Zhu, Pengxiang Li, Congkai Xie, Jiasheng Wang, Xueyu Hu, Xiaotian Han, Jianbo Yuan, Xinyao Wang, Shengyu Zhang, Hongxia Yang, Fei Wu
备注：11 pages, 3 figures
摘要：多模态大型语言模型（MLLM）的出现推动了使用纯视觉输入在图形用户界面（GUI）上操作的自治代理的发展。一个根本的挑战是为自然语言指令提供强大的基础。这需要精确的空间对齐，它准确地定位每个元素的坐标，更重要的是，需要正确的语义对齐，它将指令与功能适当的UI元素相匹配。虽然具有可验证奖励的强化学习（RLVR）已被证明可以有效地改善这些MLLM的空间对齐，但我们发现，低效的探索瓶颈了语义对齐，这会阻止模型学习困难的语义关联。为了解决这个探索问题，我们提出了自适应探索策略优化（AEPO），一个新的策略优化框架。AEPO采用多答案生成策略来实施更广泛的探索，然后由理论上基于的自适应探索奖励（AER）函数来指导，该自适应探索奖励（AER）函数来自效率的第一原则eta=U/C。我们的AEPO训练模型InfiGUI-G1-3B和InfiGUI-G1- 7 B在多个具有挑战性的GUI基础基准测试中建立了新的最先进的结果，在旨在测试泛化和语义理解的基准测试中，与朴素RLVR基线相比，实现了高达9.0%的显著相对改进。资源可在https://github.com/InfiXAI/InfiGUI-G1上获得。
摘要：The emergence of Multimodal Large Language Models (MLLMs) has propelled the development of autonomous agents that operate on Graphical User Interfaces (GUIs) using pure visual input. A fundamental challenge is robustly grounding natural language instructions. This requires a precise spatial alignment, which accurately locates the coordinates of each element, and, more critically, a correct semantic alignment, which matches the instructions to the functionally appropriate UI element. Although Reinforcement Learning with Verifiable Rewards (RLVR) has proven to be effective at improving spatial alignment for these MLLMs, we find that inefficient exploration bottlenecks semantic alignment, which prevent models from learning difficult semantic associations. To address this exploration problem, we present Adaptive Exploration Policy Optimization (AEPO), a new policy optimization framework. AEPO employs a multi-answer generation strategy to enforce broader exploration, which is then guided by a theoretically grounded Adaptive Exploration Reward (AER) function derived from first principles of efficiency eta=U/C. Our AEPO-trained models, InfiGUI-G1-3B and InfiGUI-G1-7B, establish new state-of-the-art results across multiple challenging GUI grounding benchmarks, achieving significant relative improvements of up to 9.0% against the naive RLVR baseline on benchmarks designed to test generalization and semantic understanding. Resources are available at https://github.com/InfiXAI/InfiGUI-G1.

语料库(1篇)

【1】PEACH: A sentence-aligned Parallel English-Arabic Corpus for Healthcare
标题：PEACH：与行业一致的平行英语-阿拉伯语医疗保健数据库
链接：https://arxiv.org/abs/2508.05722

作者：Sabbagh
备注：None
摘要：本文介绍了PEACH，一个平行的英语-阿拉伯语语料库的医疗保健文本，包括病人的信息传单和教育材料。该语料库包含51，671个平行句子，总计约590，517个英语和567，707个阿拉伯语单词标记。句子长度平均在9.52到11.83个单词之间。作为一个手动对齐的语料库，PEACH是一个黄金标准的语料库，帮助研究人员在对比语言学，翻译研究和自然语言处理。它可用于导出双语词典，为特定领域的机器翻译调整大型语言模型，评估用户对医疗保健中机器翻译的看法，评估患者信息传单和教育材料的可读性和非专业友好性，并作为翻译研究的教育资源。PEACH是公开的。
摘要：This paper introduces PEACH, a sentence-aligned parallel English-Arabic corpus of healthcare texts encompassing patient information leaflets and educational materials. The corpus contains 51,671 parallel sentences, totaling approximately 590,517 English and 567,707 Arabic word tokens. Sentence lengths vary between 9.52 and 11.83 words on average. As a manually aligned corpus, PEACH is a gold-standard corpus, aiding researchers in contrastive linguistics, translation studies, and natural language processing. It can be used to derive bilingual lexicons, adapt large language models for domain-specific machine translation, evaluate user perceptions of machine translation in healthcare, assess patient information leaflets and educational materials' readability and lay-friendliness, and as an educational resource in translation studies. PEACH is publicly accessible.

其他神经网络|深度学习|模型|建模(1篇)

【1】One Size Does Not Fit All: A Distribution-Aware Sparsification for More Precise Model Merging
标题：一刀切并不适合所有人：具有分布意识的精简化，以实现更精确的模型合并
链接：https://arxiv.org/abs/2508.06163

作者：Luo, Dingyang Lin, Junxin Wang, Ziqiang Xu, Kaiyan Chang, Tong Zheng, Bei Li, Anxiang Ma, Tong Xiao, Zhengtao Yu, Jingbo Zhu
备注：Under review
摘要：模型合并已成为多任务学习的一种引人注目的无数据范式，能够将多个微调模型融合为单个强大的实体。合并方法的一个关键技术是稀疏化，它从任务向量中删除冗余参数以减轻干扰。然而，普遍的方法采用“一刀切”的战略，采用统一的稀疏率，忽略了模型参数固有的结构和统计异质性。这通常会导致次优的权衡，其中关键参数被无意中修剪，而不太有用的参数被保留。为了解决这个限制，我们引入了\textbf{TADrop}（\textbf{T}ensor-wise \textbf{A}daptive \textbf{Drop}），这是一种尊重这种异质性的自适应稀疏化策略。TADrop不是全局比率，而是根据其分布特性为每个参数张量分配一个定制的稀疏级别。核心直觉是，具有更密集、更冗余分布的张量可以被积极地修剪，而更稀疏、更关键的张量则被保留。作为一个简单的即插即用模块，我们通过将其与基础，经典和SOTA合并方法集成来验证TADrop。在不同的任务（视觉，语言和多模态）和模型（ViT，BEiT）上进行的广泛实验表明，TADrop始终显着提高了他们的表现。例如，当增强领先的合并方法时，它在8个ViT-B/32任务上实现了2.0%的平均性能增益。TADrop提供了一种更有效的方法来减轻参数干扰，通过定制稀疏化模型的结构，提供了一个新的基线高性能模型合并。
摘要：Model merging has emerged as a compelling data-free paradigm for multi-task learning, enabling the fusion of multiple fine-tuned models into a single, powerful entity. A key technique in merging methods is sparsification, which prunes redundant parameters from task vectors to mitigate interference. However, prevailing approaches employ a ``one-size-fits-all'' strategy, applying a uniform sparsity ratio that overlooks the inherent structural and statistical heterogeneity of model parameters. This often leads to a suboptimal trade-off, where critical parameters are inadvertently pruned while less useful ones are retained. To address this limitation, we introduce \textbf{TADrop} (\textbf{T}ensor-wise \textbf{A}daptive \textbf{Drop}), an adaptive sparsification strategy that respects this heterogeneity. Instead of a global ratio, TADrop assigns a tailored sparsity level to each parameter tensor based on its distributional properties. The core intuition is that tensors with denser, more redundant distributions can be pruned aggressively, while sparser, more critical ones are preserved. As a simple and plug-and-play module, we validate TADrop by integrating it with foundational, classic, and SOTA merging methods. Extensive experiments across diverse tasks (vision, language, and multimodal) and models (ViT, BEiT) demonstrate that TADrop consistently and significantly boosts their performance. For instance, when enhancing a leading merging method, it achieves an average performance gain of 2.0\% across 8 ViT-B/32 tasks. TADrop provides a more effective way to mitigate parameter interference by tailoring sparsification to the model's structure, offering a new baseline for high-performance model merging.

其他(10篇)

【1】Post-training for Efficient Communication via Convention Formation
标题：通过公约形成进行有效沟通的后期训练
链接：https://arxiv.org/abs/2508.06482

作者：, Evan Wang, Yoav Artzi
备注：Accepted to COLM 2025
摘要：人类通过调整语言和形成特定的约定，在多回合互动中提高了沟通效率。相比之下，先前的工作表明，LLM并不自然地表现出这种行为。我们开发了一个培训后的过程，通过有针对性的微调对公约形成的示范识别发展这种能力。我们用两个新的基准来评估这种能力。首先，我们设计了一个集中的，认知动机的互动基准，一贯eleclides强大的公约形成的趋势，在人类。其次，我们创建了一个新的基于文档的引用完成任务，它反映了野外惯例形成行为。我们的研究表明，在两种评估方法中，经过培训的LLM的惯例形成能力显着提高。
摘要：Humans communicate with increasing efficiency in multi-turn interactions, by adapting their language and forming ad-hoc conventions. In contrast, prior work shows that LLMs do not naturally show this behavior. We develop a post-training process to develop this ability through targeted fine-tuning on heuristically identified demonstrations of convention formation. We evaluate with two new benchmarks focused on this capability. First, we design a focused, cognitively-motivated interaction benchmark that consistently elicits strong convention formation trends in humans. Second, we create a new document-grounded reference completion task that reflects in-the-wild convention formation behavior. Our studies show significantly improved convention formation abilities in post-trained LLMs across the two evaluation methods.

【2】ScamAgents: How AI Agents Can Simulate Human-Level Scam Calls
标题：诈骗代理：人工智能代理如何模拟人类级别的诈骗电话
链接：https://arxiv.org/abs/2508.06457

作者：dhe
备注：Accepted at CAMLIS 25: Conference on Applied Machine Learning for Information Security. 10 pages, 3 figures
摘要：大型语言模型（LLM）已经表现出令人印象深刻的流畅性和推理能力，但它们被滥用的可能性引起了越来越多的关注。在本文中，我们提出了ScamAgent，一个建立在LLM之上的自主多回合代理，能够生成模拟真实世界欺诈场景的高度逼真的诈骗电话脚本。与以前的工作集中在单发提示误用，ScamAgent保持对话记忆，动态适应模拟用户的反应，并采用欺骗性的说服策略，在会话回合。我们表明，目前的LLM安全护栏，包括拒绝机制和内容过滤器，对这种基于代理的威胁是无效的。当提示被分解、伪装或在代理框架内增量地交付时，即使具有强大的代理级别保护的模型也可以被绕过。我们进一步展示了使用现代文本到语音系统将诈骗脚本转化为逼真的语音通话，完成了一个完全自动化的诈骗管道。我们的研究结果强调了对多轮安全审计，代理级控制框架以及检测和破坏由生成AI驱动的会话欺骗的新方法的迫切需求。
摘要：Large Language Models (LLMs) have demonstrated impressive fluency and reasoning capabilities, but their potential for misuse has raised growing concern. In this paper, we present ScamAgent, an autonomous multi-turn agent built on top of LLMs, capable of generating highly realistic scam call scripts that simulate real-world fraud scenarios. Unlike prior work focused on single-shot prompt misuse, ScamAgent maintains dialogue memory, adapts dynamically to simulated user responses, and employs deceptive persuasion strategies across conversational turns. We show that current LLM safety guardrails, including refusal mechanisms and content filters, are ineffective against such agent-based threats. Even models with strong prompt-level safeguards can be bypassed when prompts are decomposed, disguised, or delivered incrementally within an agent framework. We further demonstrate the transformation of scam scripts into lifelike voice calls using modern text-to-speech systems, completing a fully automated scam pipeline. Our findings highlight an urgent need for multi-turn safety auditing, agent-level control frameworks, and new methods to detect and disrupt conversational deception powered by generative AI.

【3】Memp: Exploring Agent Procedural Memory
标题：Memp：探索代理程序记忆
链接：https://arxiv.org/abs/2508.06433

作者：ng, Yuan Liang, Xiaobin Wang, Jialong Wu, Shuofei Qiao, Pengjun Xie, Fei Huang, Huajun Chen, Ningyu Zhang
备注：Work in progress
摘要：基于大型语言模型（LLM）的代理擅长于各种任务，但它们遭受脆弱的程序内存，这些内存是手动设计的或与静态参数纠缠在一起。在这项工作中，我们研究策略，赋予代理人的可学习，可更新，和终身的程序记忆。我们提出Memp，蒸馏过去的代理轨迹到细粒度的，一步一步的指令和更高级别的，脚本般的抽象，并探讨不同的策略的影响，建立，检索和更新的程序内存。再加上一个不断更新、更正和弃用其内容的动态方案，这个存储库与新的经验同步发展。在TravelPlanner和ALFWorld上的实证评估表明，随着记忆库的完善，智能体在类似任务上获得了更高的成功率和更高的效率。此外，从更强的模型构建的过程内存保留了它的价值：将过程内存迁移到更弱的模型会产生实质性的性能增益。
摘要：Large Language Models (LLMs) based agents excel at diverse tasks, yet they suffer from brittle procedural memory that is manually engineered or entangled in static parameters. In this work, we investigate strategies to endow agents with a learnable, updatable, and lifelong procedural memory. We propose Memp that distills past agent trajectories into both fine-grained, step-by-step instructions and higher-level, script-like abstractions, and explore the impact of different strategies for Build, Retrieval, and Update of procedural memory. Coupled with a dynamic regimen that continuously updates, corrects, and deprecates its contents, this repository evolves in lockstep with new experience. Empirical evaluation on TravelPlanner and ALFWorld shows that as the memory repository is refined, agents achieve steadily higher success rates and greater efficiency on analogous tasks. Moreover, procedural memory built from a stronger model retains its value: migrating the procedural memory to a weaker model yields substantial performance gains.

【4】Quantifying Conversation Drift in MCP via Latent Polytope
标题：基于隐多面体的MCP会话漂移量化
链接：https://arxiv.org/abs/2508.06418

作者：i, Hongwei Yao, Shuo Shao, Shaopeng Jiao, Ziqi Peng, Zhan Qin, Cong Wang
摘要：模型上下文协议（MCP）通过集成外部工具来增强大型语言模型（LLM），从而实现实时数据的动态聚合以改善任务执行。然而，其非隔离的执行上下文引入了关键的安全和隐私风险。特别是，恶意制作的内容可能会导致工具中毒或间接提示注入，从而导致会话劫持、错误信息传播或数据泄露。现有的防御措施，如基于规则的过滤器或LLM驱动的检测，仍然不够，因为它们依赖于静态签名，计算效率低下，无法量化会话劫持。为了解决这些限制，我们提出了SecMCP，一个安全的框架，检测和量化会话漂移，由对抗性外部知识引起的潜在空间轨迹的偏差。通过在潜在多面体空间内对LLM激活向量进行建模，SecMCP可以识别会话动态中的异常变化，从而能够主动检测劫持、误导和数据泄露。我们在三个最先进的LLM（Llama3，Vicuna，Mistral）上跨基准数据集（MS MARCO，HotpotQA，FinQA）评估SecMCP，展示了AUROC得分超过0.915的稳健检测，同时保持系统可用性。我们的贡献包括MCP安全威胁的系统分类，一种新的潜在的基于多面体的方法来量化会话漂移，SecMCP的功效和实证验证。
摘要：The Model Context Protocol (MCP) enhances large language models (LLMs) by integrating external tools, enabling dynamic aggregation of real-time data to improve task execution. However, its non-isolated execution context introduces critical security and privacy risks. In particular, adversarially crafted content can induce tool poisoning or indirect prompt injection, leading to conversation hijacking, misinformation propagation, or data exfiltration. Existing defenses, such as rule-based filters or LLM-driven detection, remain inadequate due to their reliance on static signatures, computational inefficiency, and inability to quantify conversational hijacking. To address these limitations, we propose SecMCP, a secure framework that detects and quantifies conversation drift, deviations in latent space trajectories induced by adversarial external knowledge. By modeling LLM activation vectors within a latent polytope space, SecMCP identifies anomalous shifts in conversational dynamics, enabling proactive detection of hijacking, misleading, and data exfiltration. We evaluate SecMCP on three state-of-the-art LLMs (Llama3, Vicuna, Mistral) across benchmark datasets (MS MARCO, HotpotQA, FinQA), demonstrating robust detection with AUROC scores exceeding 0.915 while maintaining system usability. Our contributions include a systematic categorization of MCP security threats, a novel latent polytope-based methodology for quantifying conversation drift, and empirical validation of SecMCP's efficacy.

【5】Position: Intelligent Coding Systems Should Write Programs with Justifications
标题：立场：智能编码系统应该编写有理由的程序
链接：https://arxiv.org/abs/2508.06017

作者：Xu, Shiwei Feng, Zian Su, Chengpeng Wang, Xiangyu Zhang
备注：The first two authors contributed equally to this work
摘要：智能编码系统通过使用户能够用自然语言指定代码行为来改变软件开发。然而，人工智能驱动的编码器的不透明决策引起了信任和可用性问题，特别是对于无法检查低级别实现的非专家用户。我们认为，这些系统不仅应该生成代码，但也产生明确的，一致的理由，桥梁模型推理和用户的理解。为此，我们确定了两个关键的理由属性认知对齐和语义的忠实性，并强调现有方法的局限性，包括正式验证，静态分析，事后可解释性。我们提倡探索神经符号的方法来生成合理性，其中符号约束在训练过程中指导模型行为，并且通过神经表示来丰富程序语义，从而在推理时实现自动一致性检查。
摘要：Intelligent coding systems are transforming software development by enabling users to specify code behavior in natural language. However, the opaque decision-making of AI-driven coders raises trust and usability concerns, particularly for non-expert users who cannot inspect low-level implementations. We argue that these systems should not only generate code but also produce clear, consistent justifications that bridge model reasoning and user understanding. To this end, we identify two critical justification properties-cognitive alignment and semantic faithfulness-and highlight the limitations of existing methods, including formal verification, static analysis, and post-hoc explainability. We advocate exploring neuro-symbolic approaches for justification generation, where symbolic constraints guide model behavior during training and program semantics are enriched through neural representations, enabling automated consistency checks at inference time.

【6】Discovering Properties of Inflectional Morphology in Neural Emergent Communication
标题：发现神经紧急通讯中变形形态的性质
链接：https://arxiv.org/abs/2508.05843

作者：berti, Shane Storks, Huteng Dai
摘要：基于深度神经网络的智能体的紧急通信（EmCom）有望深入了解人类语言的本质，但仍然主要集中在一些特定于子领域的目标和指标上，这些目标和指标优先考虑一对一表示具有独特字符的属性的通信方案，并按句法组合它们。因此，我们重新解释一个共同的EmCom设置，属性值重建游戏，施加一个小词汇量的限制，以模拟双发音，并制定了一个新的设置类似于自然主义的屈折形态（使有意义的比较自然语言的沟通计划）。我们开发了新的指标，并探讨了这个游戏的变化，真正的属性的曲折形态：连接性和融合性。通过我们的实验，我们发现，模拟语音约束鼓励连接形态，和新兴语言复制的趋势，自然语言融合语法属性。
摘要：Emergent communication (EmCom) with deep neural network-based agents promises to yield insights into the nature of human language, but remains focused primarily on a few subfield-specific goals and metrics that prioritize communication schemes which represent attributes with unique characters one-to-one and compose them syntactically. We thus reinterpret a common EmCom setting, the attribute-value reconstruction game, by imposing a small-vocabulary constraint to simulate double articulation, and formulating a novel setting analogous to naturalistic inflectional morphology (enabling meaningful comparison to natural language communication schemes). We develop new metrics and explore variations of this game motivated by real properties of inflectional morphology: concatenativity and fusionality. Through our experiments, we discover that simulated phonological constraints encourage concatenative morphology, and emergent languages replicate the tendency of natural languages to fuse grammatical attributes.

【7】Basic interactive algorithms: Preview
标题：基本交互算法：预览
链接：https://arxiv.org/abs/2508.05798

作者：vich
备注：None
摘要：这个对话文件提供了一个预览，并提供了一个即将到来的工作的基本交互式算法的公理化的预感。算法的现代概念是在20世纪30 - 50年代阐明的。它在四分之一个世纪前被公理化为“顺序算法”或“经典算法”的概念;我们现在更喜欢称之为“基本算法”。公理化用来表明，对于每一个基本算法，都有一个行为等价的抽象状态机。它也被用来证明丘奇-图灵命题，因为它已被理解的逻辑学家。从20世纪60年代开始，算法的概念已经扩展-概率算法，量子算法等-促使引入一个更雄心勃勃的版本的丘奇图灵论文，通常被称为“物理论文”。我们强调了Church-Turing论文的两个版本之间的差异，并说明了如何将不确定性和概率算法视为具有适当预言的基本算法。同样的观点也适用于量子电路算法和许多其他类别的算法。
摘要：This dialog paper offers a preview and provides a foretaste of an upcoming work on the axiomatization of basic interactive algorithms. The modern notion of algorithm was elucidated in the 1930s--1950s. It was axiomatized a quarter of a century ago as the notion of ``sequential algorithm'' or ``classical algorithm''; we prefer to call it ``basic algorithm" now. The axiomatization was used to show that for every basic algorithm there is a behaviorally equivalent abstract state machine. It was also used to prove the Church-Turing thesis as it has been understood by the logicians. Starting from the 1960s, the notion of algorithm has expanded -- probabilistic algorithms, quantum algorithms, etc. -- prompting introduction of a much more ambitious version of the Church-Turing thesis commonly known as the ``physical thesis.'' We emphasize the difference between the two versions of the Church-Turing thesis and illustrate how nondeterministic and probabilistic algorithms can be viewed as basic algorithms with appropriate oracles. The same view applies to quantum circuit algorithms and many other classes of algorithms.

【8】FineDialFact: A benchmark for Fine-grained Dialogue Fact Verification
标题：FineDialFact：细粒度对话事实验证的基准
链接：https://arxiv.org/abs/2508.05782

作者：Chen, Yufeng Li, Yujian Gan, Arkaitz Zubiaga, Matthew Purver
摘要：众所周知，大型语言模型（LLM）会产生幻觉-事实上不正确或捏造的信息-这对许多自然语言处理（NLP）应用程序（如对话系统）构成了重大挑战。因此，检测幻觉已成为一个关键的研究领域。目前在对话系统中检测幻觉的方法主要集中在验证所生成的响应的事实一致性上。然而，这些回答通常包含准确、不准确或无法验证的事实，使得一个事实标签过于简单和粗粒度。在本文中，我们介绍了一个基准，FineDialFact，细粒度的对话事实验证，它涉及到验证原子事实提取的对话响应。为了支持这一点，我们构建了一个数据集的基础上公开可用的对话数据集，并使用各种基线方法进行评估。实验结果表明，结合思想链（CoT）推理的方法可以提高对话事实验证的性能。尽管如此，在开放领域对话数据集HybriDialogue上取得的最佳F1分数仅为0.75，这表明该基准仍然是未来研究的一项具有挑战性的任务。我们的数据集和代码将在GitHub上公开。
摘要：Large Language Models (LLMs) are known to produce hallucinations - factually incorrect or fabricated information - which poses significant challenges for many Natural Language Processing (NLP) applications, such as dialogue systems. As a result, detecting hallucinations has become a critical area of research. Current approaches to hallucination detection in dialogue systems primarily focus on verifying the factual consistency of generated responses. However, these responses often contain a mix of accurate, inaccurate or unverifiable facts, making one factual label overly simplistic and coarse-grained. In this paper, we introduce a benchmark, FineDialFact, for fine-grained dialogue fact verification, which involves verifying atomic facts extracted from dialogue responses. To support this, we construct a dataset based on publicly available dialogue datasets and evaluate it using various baseline methods. Experimental results demonstrate that methods incorporating Chain-of-Thought (CoT) reasoning can enhance performance in dialogue fact verification. Despite this, the best F1-score achieved on the HybriDialogue, an open-domain dialogue dataset, is only 0.75, indicating that the benchmark remains a challenging task for future research. Our dataset and code will be public on GitHub.

【9】DINA: A Dual Defense Framework Against Internal Noise and External Attacks in Natural Language Processing
标题：DINA：自然语言处理中针对内部噪音和外部攻击的双重防御框架
链接：https://arxiv.org/abs/2508.05671

作者：uang, Hen-Hsen Huang, Tsai-Yen Li
备注：7 pages
摘要：随着大型语言模型（LLM）和生成式人工智能越来越多地集成到客户服务和审核应用程序中，对抗性威胁来自外部操纵和内部标签损坏。在这项工作中，我们通过引入DINA（针对内部噪声和对抗性攻击的双重防御）来识别和系统地解决这些双重对抗性威胁，DINA是一种专门为NLP量身定制的新型统一框架。我们的方法采用了来自计算机视觉的高级噪声标签学习方法，并将其与对抗训练相结合，以同时减轻内部标签破坏和外部对抗干扰。在来自在线游戏服务的真实世界数据集上进行的大量实验表明，与基线模型相比，DINA显着提高了模型的鲁棒性和准确性。我们的研究结果不仅强调了双重威胁防御的关键必要性，还为在现实对抗场景中保护NLP系统提供了实用的策略，强调了公平和负责任的AI部署的更广泛意义。
摘要：As large language models (LLMs) and generative AI become increasingly integrated into customer service and moderation applications, adversarial threats emerge from both external manipulations and internal label corruption. In this work, we identify and systematically address these dual adversarial threats by introducing DINA (Dual Defense Against Internal Noise and Adversarial Attacks), a novel unified framework tailored specifically for NLP. Our approach adapts advanced noisy-label learning methods from computer vision and integrates them with adversarial training to simultaneously mitigate internal label sabotage and external adversarial perturbations. Extensive experiments conducted on a real-world dataset from an online gaming service demonstrate that DINA significantly improves model robustness and accuracy compared to baseline models. Our findings not only highlight the critical necessity of dual-threat defenses but also offer practical strategies for safeguarding NLP systems in realistic adversarial scenarios, underscoring broader implications for fair and responsible AI deployment.

【10】Indian Legal NLP Benchmarks : A Survey
标题：印度法律NLP基准：调查
链接：https://arxiv.org/abs/2107.06056

作者：h Kalamkar, Janani Venugopalan Ph.D., Vivek Raghavan Ph.D
摘要：具有挑战性的基准的可用性是AI在特定领域取得进步的关键。由于法律文本与普通英语文本有很大不同，因此需要为印度法律文本创建单独的自然语言处理基准，这些基准具有挑战性，并专注于法律系统特定的任务。这将刺激印度法律文本自然语言处理应用的创新，并将使人工智能社区和法律界受益。我们回顾了这一领域的现有工作，并提出了为印度法律自然语言处理创建新基准的想法。
摘要：Availability of challenging benchmarks is the key to advancement of AI in a specific field.Since Legal Text is significantly different than normal English text, there is a need to create separate Natural Language Processing benchmarks for Indian Legal Text which are challenging and focus on tasks specific to Legal Systems. This will spur innovation in applications of Natural language Processing for Indian Legal Text and will benefit AI community and Legal fraternity. We review the existing work in this area and propose ideas to create new benchmarks for Indian Legal Natural Language Processing.

机器翻译由腾讯交互翻译提供，仅供参考

点击“阅读原文”获取带摘要的学术速递

【声明】内容源于网络

Sophie外贸笔记

跨境分享角 | 长期更新优质内容

内容 0

粉丝 3

Sophie外贸笔记跨境分享角 | 长期更新优质内容

总阅读0

粉丝3

内容0