自然语言处理学术速递[9.15]- 大数跨境

首页

自然语言处理学术速递[9.15]

Sophie外贸笔记

2025-09-16

483

导读：cs.CL 方向，今日共计72篇

点击阅读原文访问arxivdaily.com，涵盖CS|物理|数学|经济|统计|金融|生物|电气领域，更有搜索、收藏等功能！

cs.CL 方向，今日共计72篇

大模型相关(32篇)

【1】RefactorCoderQA: Benchmarking LLMs for Multi-Domain Coding Question Solutions in Cloud and Edge Deployment
标题：RefactorCoderQA：为云和边缘部署中的多域编码问题解决方案进行LLM基准测试
链接：https://arxiv.org/abs/2509.10436

作者：Rahman, Aroosa Hameed, Gautam Srivastava, Syed Muhammad Danish
备注：12 pages, 5 figures, submitted to IEEE Transactions on Services Computing
摘要：为了优化大型语言模型（LLM）的推理和解决问题的能力，我们提出了一种新的云边缘协作架构，使结构化，多代理提示框架。该框架包括三个专门的组件：GuideLLM，一个部署在边缘的轻量级模型，用于提供方法指导; SolverLLM，一个托管在云中的更强大的模型，负责生成代码解决方案;以及JudgeLLM，一个用于评估解决方案正确性和质量的自动评估器。为了评估和展示这种架构在现实环境中的有效性，我们引入了RefactorCoderQA，这是一个全面的基准测试，旨在评估和增强跨多域编码任务的大型语言模型（LLM）的性能。由于现有基准测试的局限性，RefactorCoderQA系统地涵盖了各种技术领域，包括软件工程，数据科学，机器学习和自然语言处理，使用来自Stack Overflow的真实编码挑战。大量的实验表明，我们的微调模型RefactorCoder-MoE实现了最先进的性能，显著优于领先的开源和商业基线，总体准确率为76.84%。人工评估进一步验证了生成的解决方案的可解释性、准确性和实际相关性。此外，我们评估系统级的指标，如吞吐量和延迟，以获得更深入的了解所提出的架构的性能特征和权衡。
摘要：To optimize the reasoning and problem-solving capabilities of Large Language Models (LLMs), we propose a novel cloud-edge collaborative architecture that enables a structured, multi-agent prompting framework. This framework comprises three specialized components: GuideLLM, a lightweight model deployed at the edge to provide methodological guidance; SolverLLM, a more powerful model hosted in the cloud responsible for generating code solutions; and JudgeLLM, an automated evaluator for assessing solution correctness and quality. To evaluate and demonstrate the effectiveness of this architecture in realistic settings, we introduce RefactorCoderQA, a comprehensive benchmark designed to evaluate and enhance the performance of Large Language Models (LLMs) across multi-domain coding tasks. Motivated by the limitations of existing benchmarks, RefactorCoderQA systematically covers various technical domains, including Software Engineering, Data Science, Machine Learning, and Natural Language Processing, using authentic coding challenges from Stack Overflow. Extensive experiments reveal that our fine-tuned model, RefactorCoder-MoE, achieves state-of-the-art performance, significantly outperforming leading open-source and commercial baselines with an overall accuracy of 76.84%. Human evaluations further validate the interpretability, accuracy, and practical relevance of the generated solutions. In addition, we evaluate system-level metrics, such as throughput and latency, to gain deeper insights into the performance characteristics and trade-offs of the proposed architecture.

【2】Long Context Automated Essay Scoring with Language Models
标题：使用语言模型的长上下文自动论文评分
链接：https://arxiv.org/abs/2509.10417

作者：er Ormerod, Gitit Kehat
备注：8 pages, 2 figures, 2 tables
摘要：基于transformer的语言模型在架构上被限制为处理固定最大长度的文本。高年级学生写的论文经常超过许多流行的开源模型的最大允许长度。当使用这些模型进行自动作文评分时，解决这个问题的一种常见方法是截断输入文本。这引起了严重的有效性问题，因为它破坏了模型完全捕获和评估评分规则的组织元素的能力，这需要很长的上下文来评估。在这项研究中，我们评估了几个模型，将标准Transformer架构的架构修改，以克服这些长度限制使用Kaggle ASAP 2.0数据集。本研究中考虑的模型包括XLNet，Longformer，ModernBERT，Mamba和Llama模型的微调版本。
摘要：Transformer-based language models are architecturally constrained to process text of a fixed maximum length. Essays written by higher-grade students frequently exceed the maximum allowed length for many popular open-source models. A common approach to addressing this issue when using these models for Automated Essay Scoring is to truncate the input text. This raises serious validity concerns as it undermines the model's ability to fully capture and evaluate organizational elements of the scoring rubric, which requires long contexts to assess. In this study, we evaluate several models that incorporate architectural modifications of the standard transformer architecture to overcome these length limitations using the Kaggle ASAP 2.0 dataset. The models considered in this study include fine-tuned versions of XLNet, Longformer, ModernBERT, Mamba, and Llama models.

【3】Dropping Experts, Recombining Neurons: Retraining-Free Pruning for Sparse Mixture-of-Experts LLMs
标题：放弃专家，挑战神经元：稀疏专家混合LLM的免再训练修剪
链接：https://arxiv.org/abs/2509.10377

作者：ou, Ziyu Zhao, Dongzhou Cheng, zhiliang wu, Jie Gui, Yi Yang, Fei Wu, Yu Cheng, Hehe Fan
备注：Accepted to EMNLP2025
摘要：稀疏混合专家（SMoE）架构由于其计算效率而被广泛用于大型语言模型（LLM）。然而，尽管每个令牌只激活几个专家，但SMoE仍然需要加载所有专家参数，从而导致高内存使用率和部署挑战。以前的工作试图通过修剪和合并专家来减少开销，但主要集中在专家级操作上，使神经元级结构未得到充分探索。我们提出了DERN（Dropping Experts，Dropping Neurons），一个与任务无关且无需再训练的专家修剪和重建框架。我们观察到，专家往往是不一致的，并包含在神经元级别的语义冲突，这对直接合并提出了挑战。为了解决这个问题，DERN分三步工作：首先使用路由器统计数据修剪冗余专家;然后将它们分解为神经元级专家段，将每个段分配给最兼容的保留专家;最后，合并每个保留专家内的段以构建紧凑的表示。在Mixtral、Qwen和DeepSeek SMoE模型上的实验表明，在50%专家稀疏度下，DERN在常识推理和MMLU基准测试中的性能提高了5%以上，而无需额外的训练。它还大大减少了专家数量和内存使用，使SMoE LLM在实践中更容易部署。
摘要：Sparse Mixture-of-Experts (SMoE) architectures are widely used in large language models (LLMs) due to their computational efficiency. However, though only a few experts are activated for each token, SMoE still requires loading all expert parameters, leading to high memory usage and challenges in deployment. Previous work has tried to reduce the overhead by pruning and merging experts, but primarily focused on expert-level operations, leaving neuron-level structure underexplored. We propose DERN (Dropping Experts, Recombining Neurons), a task-agnostic and retraining-free framework for expert pruning and reconstruction. We observe that experts are often misaligned and contain semantic conflicts at the neuron level, which poses challenges for direct merging. To solve this, DERN works in three steps: it first prunes redundant experts using router statistics; then it decomposes them into neuron-level expert segments, assigning each segment to its most compatible retained expert; and finally, it merges segments within each retained expert to build a compact representation. Experiments on Mixtral, Qwen, and DeepSeek SMoE models show that DERN improves performance by more than 5% on commonsense reasoning and MMLU benchmarks under 50% expert sparsity, without extra training. It also greatly reduces the number of experts and memory usage, making SMoE LLMs easier to deploy in practice.

【4】Beyond Token Limits: Assessing Language Model Performance on Long Text Classification
标题：超越代币限制：评估语言模型在长文本分类上的性能
链接：https://arxiv.org/abs/2509.10199

作者：bők, Viktor Kovács, Martin Bánóczy, Daniel Møller Eriksen, Nathalie Neptune, Philippe Roussille
摘要：社会科学中最广泛使用的大型语言模型（例如BERT及其衍生物，例如RoBERTa）对它们可以处理以产生预测的输入文本长度有限制。这对于一些分类任务来说是一个特别紧迫的问题，其目的是处理长输入文本。其中一个领域涉及法律和法律草案（法案），这些法律和法律草案可能有数百页的长度，因此，对于只能处理例如512个令牌的模型来说，并不特别适合处理。在本文中，我们展示了实验结果，涵盖5种语言的XLM-RoBERTA，Longformer，GPT-3.5，GPT-4模型的多类分类任务的比较研究项目，其中有一个码本的21个政策主题标签，从教育到医疗保健。结果显示，Longformer模型没有特别的优势，专门针对处理长输入的目的进行了预训练。GPT变体和表现最好的开放模型之间的比较产生了后者的优势。类级因素的分析指出，当涉及到长文本输入的性能时，特定类别之间的支持和实质重叠的重要性。
摘要：The most widely used large language models in the social sciences (such as BERT, and its derivatives, e.g. RoBERTa) have a limitation on the input text length that they can process to produce predictions. This is a particularly pressing issue for some classification tasks, where the aim is to handle long input texts. One such area deals with laws and draft laws (bills), which can have a length of multiple hundred pages and, therefore, are not particularly amenable for processing with models that can only handle e.g. 512 tokens. In this paper, we show results from experiments covering 5 languages with XLM-RoBERTa, Longformer, GPT-3.5, GPT-4 models for the multiclass classification task of the Comparative Agendas Project, which has a codebook of 21 policy topic labels from education to health care. Results show no particular advantage for the Longformer model, pre-trained specifically for the purposes of handling long inputs. The comparison between the GPT variants and the best-performing open model yielded an edge for the latter. An analysis of class-level factors points to the importance of support and substance overlaps between specific categories when it comes to performance on long text inputs.

【5】Benchmark of stylistic variation in LLM-generated texts
标题：LLM生成文本中文体变异的基准
链接：https://arxiv.org/abs/2509.10179

作者：čka, Anna Marklová, Václav Cvrček
摘要：本研究调查了人类书写的文本和大型语言模型（LLM）产生的可比文本中的语域变化。Biber的多维分析（MDA）应用于人类书面文本和AI创建的文本样本，以找到LLM与人类最显着和最系统的差异的变化维度。作为文本材料，一个新的LLM生成的语料库AI布朗，这是可比的BE-21（布朗家族语料库代表当代英国英语）。由于除了英语之外的所有语言在前沿LLM的训练数据中都代表性不足，因此使用AI-Koditex语料库和捷克多维模型在捷克语上进行了类似的分析。检查了16个前沿模型在不同的设置和提示，重点放在基础模型和预防调整模型之间的差异。在此基础上，创建了一个基准，通过该基准，模型可以相互比较，并在可解释的维度上进行排名。
摘要：This study investigates the register variation in texts written by humans and comparable texts produced by large language models (LLMs). Biber's multidimensional analysis (MDA) is applied to a sample of human-written texts and AI-created texts generated to be their counterparts to find the dimensions of variation in which LLMs differ most significantly and most systematically from humans. As textual material, a new LLM-generated corpus AI-Brown is used, which is comparable to BE-21 (a Brown family corpus representing contemporary British English). Since all languages except English are underrepresented in the training data of frontier LLMs, similar analysis is replicated on Czech using AI-Koditex corpus and Czech multidimensional model. Examined were 16 frontier models in various settings and prompts, with emphasis placed on the difference between base models and instruction-tuned models. Based on this, a benchmark is created through which models can be compared with each other and ranked in interpretable dimensions.

【6】Population-Aligned Persona Generation for LLM-based Social Simulation
标题：基于LLM的社交模拟的人口一致角色生成
链接：https://arxiv.org/abs/2509.10127

作者：u, Zheyuan Xiao, Max Xiong, Yuxuan Lei, Tianfu Wang, Jianxun Lian, Kaize Ding, Ziang Xiao, Nicholas Jing Yuan, Xing Xie
摘要：大型语言模型（LLM）的最新进展使人类社会模拟达到了前所未有的规模和保真度，为计算社会科学提供了新的机会。然而，一个关键的挑战是真实地代表现实世界人口的多样性和分布的人物角色集的构建。大多数现有的基于LLM的社会模拟研究主要集中在设计代理框架和模拟环境，往往忽略了人物角色生成的复杂性和潜在的偏见所引入的非代表性人物角色集。在本文中，我们提出了一个系统的框架，合成高质量，人口对齐的人物角色集LLM驱动的社会模拟。我们的方法首先利用LLM从长期社交媒体数据中生成叙事人物角色，然后进行严格的质量评估以过滤掉低保真度的个人资料。然后，我们应用重要性抽样来实现与参考心理测量分布（如大五人格特质）的全球一致性。为了满足特定模拟环境的需求，我们进一步引入了一个特定于任务的模块，该模块将全局对齐的角色集调整为目标子群体。大量的实验表明，我们的方法显着降低了人口水平的偏见，并实现了广泛的研究和政策应用准确，灵活的社会模拟。
摘要：Recent advances in large language models (LLMs) have enabled human-like social simulations at unprecedented scale and fidelity, offering new opportunities for computational social science. A key challenge, however, is the construction of persona sets that authentically represent the diversity and distribution of real-world populations. Most existing LLM-based social simulation studies focus primarily on designing agentic frameworks and simulation environments, often overlooking the complexities of persona generation and the potential biases introduced by unrepresentative persona sets. In this paper, we propose a systematic framework for synthesizing high-quality, population-aligned persona sets for LLM-driven social simulation. Our approach begins by leveraging LLMs to generate narrative personas from long-term social media data, followed by rigorous quality assessment to filter out low-fidelity profiles. We then apply importance sampling to achieve global alignment with reference psychometric distributions, such as the Big Five personality traits. To address the needs of specific simulation contexts, we further introduce a task-specific module that adapts the globally aligned persona set to targeted subpopulations. Extensive experiments demonstrate that our method significantly reduces population-level bias and enables accurate, flexible social simulation for a wide range of research and policy applications.

【7】Arabic Large Language Models for Medical Text Generation
标题：用于医学文本生成的阿拉伯语大型语言模型
链接：https://arxiv.org/abs/2509.10095

作者：an Allam, Seif Ahmed, Ali Hamdi, Ammar Mohammed
备注：Published in 2025 4th International Conference on Computer Technologies (ICCTech)
摘要：高效的医院管理系统（HMS）在全球范围内对于应对过度拥挤、资源有限和紧急医疗服务可用性差等挑战至关重要。现有的方法往往缺乏提供准确、实时的医疗建议的能力，特别是对于不规则的输入和代表性不足的语言。为了克服这些局限性，本研究提出了一种方法，微调大语言模型（LLM）的阿拉伯医学文本生成。该系统旨在通过根据用户输入提供准确的医疗建议、诊断、药物建议和治疗计划来帮助患者。研究方法需要从社交媒体平台收集独特的数据集，捕捉患者和医生之间的真实医疗对话。该数据集包括患者投诉和医疗建议，经过适当的清理和预处理，以考虑多种阿拉伯方言。微调最先进的生成模型，如Mistral-7 B-Instruct-v0.2、LLaMA-2- 7 B和GPT-2 Medium，优化了系统生成可靠医学文本的能力。评估结果表明，微调的Mistral-7 B模型优于其他模型，实现平均BERT（双向编码器表示从Transformers）得分值的精确度，召回率和F1分数分别为68.5%，69.08%和68.5%。比较基准和质量评估证实该系统有能力对非正式投入作出连贯和相关的医疗答复。这项研究强调了生成人工智能（AI）在推进HMS方面的潜力，为全球医疗保健挑战提供了可扩展和适应性强的解决方案，特别是在语言和文化多样化的环境中。
摘要：Efficient hospital management systems (HMS) are critical worldwide to address challenges such as overcrowding, limited resources, and poor availability of urgent health care. Existing methods often lack the ability to provide accurate, real-time medical advice, particularly for irregular inputs and underrepresented languages. To overcome these limitations, this study proposes an approach that fine-tunes large language models (LLMs) for Arabic medical text generation. The system is designed to assist patients by providing accurate medical advice, diagnoses, drug recommendations, and treatment plans based on user input. The research methodology required the collection of a unique dataset from social media platforms, capturing real-world medical conversations between patients and doctors. The dataset, which includes patient complaints together with medical advice, was properly cleaned and preprocessed to account for multiple Arabic dialects. Fine-tuning state-of-the-art generative models, such as Mistral-7B-Instruct-v0.2, LLaMA-2-7B, and GPT-2 Medium, optimized the system's ability to generate reliable medical text. Results from evaluations indicate that the fine-tuned Mistral-7B model outperformed the other models, achieving average BERT (Bidirectional Encoder Representations from Transformers) Score values in precision, recall, and F1-scores of 68.5\%, 69.08\%, and 68.5\%, respectively. Comparative benchmarking and qualitative assessments validate the system's ability to produce coherent and relevant medical replies to informal input. This study highlights the potential of generative artificial intelligence (AI) in advancing HMS, offering a scalable and adaptable solution for global healthcare challenges, especially in linguistically and culturally diverse environments.

【8】Established Psychometric vs. Ecologically Valid Questionnaires: Rethinking Psychological Assessments in Large Language Models
标题：建立的心理测量与逻辑有效问卷：重新思考大型语言模型中的心理评估
链接：https://arxiv.org/abs/2509.10078

作者：hoi, Woojung Song, Jongwook Han, Eun-Ju Lee, Yohan Jo
备注：17 pages, 4 figures
摘要：研究人员已经应用了既定的心理测量问卷（例如，BFI，PVQ）来衡量大型语言模型（LLM）的反应中反映的人格特质和价值观。然而，人们对将这些人为设计的问卷应用于LLM表示担忧。其中一个问题是它们缺乏生态有效性-调查问题在多大程度上充分反映和类似于LLM在响应用户查询时生成文本的真实世界背景。然而，目前还不清楚既定的问卷和生态有效的问卷在结果上有何不同，以及这些差异可能提供什么样的见解。本文对这两类问卷进行了全面的对比分析。我们的分析表明，建立的问卷（1）产生实质上不同的配置文件的LLM从生态有效的，偏离的心理特征表示的上下文中的用户查询，（2）遭受不充分的项目稳定的测量，（3）创建误导的印象，LLM拥有稳定的结构，和（4）产生夸大的配置文件的人物角色提示的LLM。总的来说，我们的工作告诫不要使用已建立的心理问卷LLM。我们的代码将在发布时发布。
摘要：Researchers have applied established psychometric questionnaires (e.g., BFI, PVQ) to measure the personality traits and values reflected in the responses of Large Language Models (LLMs). However, concerns have been raised about applying these human-designed questionnaires to LLMs. One such concern is their lack of ecological validity--the extent to which survey questions adequately reflect and resemble real-world contexts in which LLMs generate texts in response to user queries. However, it remains unclear how established questionnaires and ecologically valid questionnaires differ in their outcomes, and what insights these differences may provide. In this paper, we conduct a comprehensive comparative analysis of the two types of questionnaires. Our analysis reveals that established questionnaires (1) yield substantially different profiles of LLMs from ecologically valid ones, deviating from the psychological characteristics expressed in the context of user queries, (2) suffer from insufficient items for stable measurement, (3) create misleading impressions that LLMs possess stable constructs, and (4) yield exaggerated profiles for persona-prompted LLMs. Overall, our work cautions against the use of established psychological questionnaires for LLMs. Our code will be released upon publication.

【9】Multi-Intent Recognition in Dialogue Understanding: A Comparison Between Smaller Open-Source LLMs
标题：对话理解中的多意图识别：小型开源LLM之间的比较
链接：https://arxiv.org/abs/2509.10010

作者：ad, Philine Kowol, Stefan Hillmann, Sebastian Möller
摘要：在本文中，我们提供了一个广泛的分析多标签的意图分类使用大型语言模型（LLM）是开源的，公开的，可以在消费者硬件中运行。我们使用MultiWOZ 2.1数据集（对话系统领域的基准）来研究三种流行的开源预训练LLM的有效性，即LLama 2 - 7 B-hf，Mistral-7 B-v0.1和Yi-6 B。我们在一个Few-Shot设置中执行分类任务，在提示中给出20个示例，并提供一些说明。我们的方法侧重于这些模型在多个性能指标上的性能差异，方法是在多标签意图分类任务上系统地评估这些模型。此外，我们使用较小的Transformer模型BertForSequenceClassification作为基线，比较了基于预防的微调方法与监督学习的性能。为了评估模型的性能，我们使用了准确度、精确度和召回率等评估指标，以及微观、宏观和加权F1得分。我们还报告了推理时间，VRAM要求等。Mistral-7 B-v0.1在14个意图类中的11个意图类上优于其他两个生成模型，加权平均值为0.50。它还具有相对较低的嗡嗡声损失和较高的Jaccard相似性，使其成为Few-Shot设置中的获胜模型。我们发现基于BERT的监督分类器具有优越的性能相比，最好的性能Few-Shot生成LLM。该研究为小型开源LLM提供了一个框架，用于检测复杂的多意图对话，增强面向任务的聊天机器人的自然语言理解方面。
摘要：In this paper, we provide an extensive analysis of multi-label intent classification using Large Language Models (LLMs) that are open-source, publicly available, and can be run in consumer hardware. We use the MultiWOZ 2.1 dataset, a benchmark in the dialogue system domain, to investigate the efficacy of three popular open-source pre-trained LLMs, namely LLama2-7B-hf, Mistral-7B-v0.1, and Yi-6B. We perform the classification task in a few-shot setup, giving 20 examples in the prompt with some instructions. Our approach focuses on the differences in performance of these models across several performance metrics by methodically assessing these models on multi-label intent classification tasks. Additionally, we compare the performance of the instruction-based fine-tuning approach with supervised learning using the smaller transformer model BertForSequenceClassification as a baseline. To evaluate the performance of the models, we use evaluation metrics like accuracy, precision, and recall as well as micro, macro, and weighted F1 score. We also report the inference time, VRAM requirements, etc. The Mistral-7B-v0.1 outperforms two other generative models on 11 intent classes out of 14 in terms of F-Score, with a weighted average of 0.50. It also has relatively lower Humming Loss and higher Jaccard Similarity, making it the winning model in the few-shot setting. We find BERT based supervised classifier having superior performance compared to the best performing few-shot generative LLM. The study provides a framework for small open-source LLMs in detecting complex multi-intent dialogues, enhancing the Natural Language Understanding aspect of task-oriented chatbots.

【10】Large Language Models Meet Legal Artificial Intelligence: A Survey
标题：大型语言模型满足法律人工智能：一项调查
链接：https://arxiv.org/abs/2509.09969

作者：ou, Zihan Ye, Nanli Zeng, Tianyong Hao, Kun Zeng
摘要：近年来，大型语言模型（LLM）极大地推动了法律人工智能（Legal AI）的发展，提高了法律任务的效率和准确性。为了推进法学硕士课程在法律领域的研究和应用，本文对16个法律法学硕士系列和47个基于法学硕士的法律任务框架进行了全面回顾，并收集了15个基准和29个数据集来评估不同的法律能力。此外，我们分析了挑战，并讨论了法律领域基于法学硕士的方法的未来方向。我们希望本文能为初学者提供一个系统的介绍，并鼓励未来在这一领域的研究。资源可在https://github.com/ZhitianHou/LLMs4LegalAI上获得。
摘要：Large Language Models (LLMs) have significantly advanced the development of Legal Artificial Intelligence (Legal AI) in recent years, enhancing the efficiency and accuracy of legal tasks. To advance research and applications of LLM-based approaches in legal domain, this paper provides a comprehensive review of 16 legal LLMs series and 47 LLM-based frameworks for legal tasks, and also gather 15 benchmarks and 29 datasets to evaluate different legal capabilities. Additionally, we analyse the challenges and discuss future directions for LLM-based approaches in the legal domain. We hope this paper provides a systematic introduction for beginners and encourages future research in this field. Resources are available at https://github.com/ZhitianHou/LLMs4LegalAI.

【11】Vibe Check: Understanding the Effects of LLM-Based Conversational Agents' Personality and Alignment on User Perceptions in Goal-Oriented Tasks
标题：Vibe Check：了解基于LLM的对话代理的个性和一致性对目标导向任务中用户感知的影响
链接：https://arxiv.org/abs/2509.09870

作者：ahman, Smit Desai
摘要：大型语言模型（LLM）使会话代理（CA）能够表达独特的个性，提出了关于这种设计如何塑造用户感知的新问题。本研究探讨了人格表达水平和用户-主体人格一致性如何影响目标导向任务中的感知。在一个受试者之间的实验（N=150），参与者完成了旅行计划与CA表现出低，中，或高表达的大五特质，通过我们的新的特质调制键框架控制。结果显示了一个倒U型的关系：中等表达产生了最积极的评价，在智力，享受，拟人化，采用的意图，信任，和可爱性，显着优于两个极端。人格一致性进一步增强了结果，外向性和情绪稳定性成为最具影响力的特征。聚类分析确定了三个不同的兼容性配置文件，与“良好的对齐”用户报告基本上是积极的看法。这些研究结果表明，个性表达和战略特质对齐构成CA个性的最佳设计目标，提供设计的影响，法学硕士为基础的CA变得越来越普遍。
摘要：Large language models (LLMs) enable conversational agents (CAs) to express distinctive personalities, raising new questions about how such designs shape user perceptions. This study investigates how personality expression levels and user-agent personality alignment influence perceptions in goal-oriented tasks. In a between-subjects experiment (N=150), participants completed travel planning with CAs exhibiting low, medium, or high expression across the Big Five traits, controlled via our novel Trait Modulation Keys framework. Results revealed an inverted-U relationship: medium expression produced the most positive evaluations across Intelligence, Enjoyment, Anthropomorphism, Intention to Adopt, Trust, and Likeability, significantly outperforming both extremes. Personality alignment further enhanced outcomes, with Extraversion and Emotional Stability emerging as the most influential traits. Cluster analysis identified three distinct compatibility profiles, with "Well-Aligned" users reporting substantially positive perceptions. These findings demonstrate that personality expression and strategic trait alignment constitute optimal design targets for CA personality, offering design implications as LLM-based CAs become increasingly prevalent.

【12】LLMs as Agentic Cooperative Players in Multiplayer UNO
标题：LLM作为多人UNO中的大型合作玩家
链接：https://arxiv.org/abs/2509.09867

作者：no Matinez, Jesse Roberts
摘要：LLM承诺帮助人类-不仅仅是回答问题，而是通过在广泛的任务中提供有用的指导。但这种援助能走多远？一个基于语言模型的代理真的能帮助某人实现他们的目标吗？我们通过在UNO（一种回合制纸牌游戏）中使用LLM来测试这个问题，要求它不要赢，而是帮助另一个玩家这样做。我们建立了一个工具，允许解码器只有LLM参与作为代理在RLCard游戏环境。这些模型接收完整的游戏状态信息，并在两种不同的提示策略下使用简单的文本提示进行响应。我们评估从小型（1B参数）到大型（70 B参数）的模型，并探索模型规模如何影响性能。我们发现，虽然所有的模型都能够成功地超越一个随机基线时，发挥UNO，很少有人能够显着帮助另一个球员。
摘要：LLMs promise to assist humans -- not just by answering questions, but by offering useful guidance across a wide range of tasks. But how far does that assistance go? Can a large language model based agent actually help someone accomplish their goal as an active participant? We test this question by engaging an LLM in UNO, a turn-based card game, asking it not to win but instead help another player to do so. We built a tool that allows decoder-only LLMs to participate as agents within the RLCard game environment. These models receive full game-state information and respond using simple text prompts under two distinct prompting strategies. We evaluate models ranging from small (1B parameters) to large (70B parameters) and explore how model scale impacts performance. We find that while all models were able to successfully outperform a random baseline when playing UNO, few were able to significantly aid another player.

【13】Topic-Guided Reinforcement Learning with LLMs for Enhancing Multi-Document Summarization
标题：利用LLM进行主题引导强化学习以增强多文档摘要
链接：https://arxiv.org/abs/2509.09852

作者：i, Austin Xu, Shafiq Joty, Giuseppe Carenini
摘要：多文档摘要（MDS）的一个关键挑战是有效地整合来自多个来源的信息，同时保持一致性和主题相关性。虽然大型语言模型在单文档摘要方面表现出了令人印象深刻的结果，但它们在MDS上的性能仍有改进的空间。在本文中，我们提出了一个主题引导的强化学习方法，以改善MDS中的内容选择。我们首先表明，明确提示模型与主题标签，提高了信息量的生成摘要。在此基础上，我们提出了一种新的主题奖励组相对策略优化（GRPO）框架内，以衡量主题之间的对齐生成的摘要和源文件。在Multi-News和Multi-XScience数据集上的实验结果表明，我们的方法始终优于强基线，突出了在MDS中利用主题线索的有效性。
摘要：A key challenge in Multi-Document Summarization (MDS) is effectively integrating information from multiple sources while maintaining coherence and topical relevance. While Large Language Models have shown impressive results in single-document summarization, their performance on MDS still leaves room for improvement. In this paper, we propose a topic-guided reinforcement learning approach to improve content selection in MDS. We first show that explicitly prompting models with topic labels enhances the informativeness of the generated summaries. Building on this insight, we propose a novel topic reward within the Group Relative Policy Optimization (GRPO) framework to measure topic alignment between the generated summary and source documents. Experimental results on the Multi-News and Multi-XScience datasets demonstrate that our method consistently outperforms strong baselines, highlighting the effectiveness of leveraging topical cues in MDS.

【14】HEFT: A Coarse-to-Fine Hierarchy for Enhancing the Efficiency and Accuracy of Language Model Reasoning
标题：HEFT：从粗到细的层次结构，用于提高语言模型推理的效率和准确性
链接：https://arxiv.org/abs/2509.09801

作者：ill
摘要：大型语言模型（LLM）对专业推理任务的适应从根本上受到计算资源的限制。参数有效的微调（PEFT）方法已经成为一个强大的解决方案，但这些技术的景观是多种多样的，不同的方法在模型的权重空间或其表示空间。本文研究的假设，这些范例的协同组合可以解锁卓越的性能和效率。我们介绍了HEFT（分层高效微调），一种新的分层自适应策略，以粗到细的方式组成两种不同的PEFT方法：首先，使用低秩自适应（LoRA）在权重空间中进行广泛的基础自适应，然后使用表示微调（ReFT）对内部激活进行精确的手术细化。我们通过在BoolQ基准上微调Llama-2- 7 B模型来评估这种方法，BoolQ基准是一个具有挑战性的推理数据集。我们的研究结果揭示了一个深刻的协同效应。使用我们的HEFT策略仅对三个时期进行微调的模型实现了85.17%的准确率，超过了使用仅LoRA（85.05%）或仅ReFT（83.36%）方法训练20个时期的模型的性能。这项工作表明，PEFT方法的深思熟虑的组合是一种强有力的算法创新，为提高语言模型的推理能力提供了一条更有效的途径。通过用一小部分计算预算实现优异的结果，我们的研究结果提出了一种原则性的方法来克服复杂认知任务适应大规模模型所固有的障碍。
摘要：The adaptation of large language models (LLMs) to specialized reasoning tasks is fundamentally constrained by computational resources. Parameter-Efficient Fine-Tuning (PEFT) methods have emerged as a powerful solution, yet the landscape of these techniques is diverse, with distinct methods operating in either the model's weight space or its representation space. This paper investigates the hypothesis that a synergistic combination of these paradigms can unlock superior performance and efficiency. We introduce HEFT (Hierarchical Efficient Fine-Tuning), a novel hierarchical adaptation strategy that composes two distinct PEFT methods in a coarse-to-fine manner: first, a broad, foundational adaptation in the weight space using Low-Rank Adaptation (LoRA), followed by a precise, surgical refinement of internal activations using Representation Fine-Tuning (ReFT). We evaluate this approach by fine-tuning a Llama-2-7B model on the BoolQ benchmark, a challenging dataset for inferential reasoning. Our results reveal a profound synergistic effect. A model fine-tuned for only three epochs with our HEFT strategy achieves an accuracy of 85.17\%, exceeding the performance of models trained for 20 epochs with either LoRA-only (85.05\%) or ReFT-only (83.36\%) methodologies. This work demonstrates that the thoughtful composition of PEFT methods is a potent algorithmic innovation, offering a more efficient and effective path toward advancing the reasoning capabilities of language models. By achieving superior results with a fraction of the computational budget, our findings present a principled approach to overcoming the obstacles inherent in adapting large-scale models for complex cognitive tasks.

【15】Discrimination by LLMs: Cross-lingual Bias Assessment and Mitigation in Decision-Making and Summarisation
标题：LLM的歧视：决策和总结中的跨语言偏见评估和缓解
链接：https://arxiv.org/abs/2509.09735

作者：ijzer, Jieying Chen
备注：7 pages
摘要：大型语言模型（LLM）快速集成到各个领域引起了人们对社会不平等和信息偏见的担忧。本研究探讨了与背景，性别和年龄有关的LLM偏见，重点是它们对决策和总结任务的影响。此外，本研究还考察了这些偏见的跨语言传播，并评估了受试者指导的缓解策略的有效性。使用Tamkin et al.（2023）翻译成荷兰语的改编版本的数据集，我们为决策任务创建了151，200个独特的提示，为总结任务创建了176，400个。各种人口统计学变量，指令，显着性水平和语言进行了测试的GPT-3.5和GPT-4 o。我们的分析显示，这两种模型在决策过程中都有明显的偏见，有利于女性，年轻的年龄和某些背景，如非洲裔美国人的背景。相比之下，总结任务显示出最小的偏见的证据，虽然显着的年龄相关的差异出现了GPT-3.5的英语。跨语言分析表明，英语和荷兰语之间的偏见模式大致相似，但在特定的人口统计学类别中观察到显着差异。新提出的缓解指示虽然不能完全消除偏差，但显示出减少偏差的潜力。最有效的指导实现了27%的最有利和最不利的人口统计之间的差距平均减少。值得注意的是，与GPT-3.5相反，GPT-4 o显示出所有英文提示的偏倚降低，表明在新模型中基于GPT的缓解的特定潜力。这项研究强调了谨慎采用LLM和特定背景偏见测试的重要性，强调了继续制定有效缓解策略以确保负责任地部署AI的必要性。
摘要：The rapid integration of Large Language Models (LLMs) into various domains raises concerns about societal inequalities and information bias. This study examines biases in LLMs related to background, gender, and age, with a focus on their impact on decision-making and summarization tasks. Additionally, the research examines the cross-lingual propagation of these biases and evaluates the effectiveness of prompt-instructed mitigation strategies. Using an adapted version of the dataset by Tamkin et al. (2023) translated into Dutch, we created 151,200 unique prompts for the decision task and 176,400 for the summarisation task. Various demographic variables, instructions, salience levels, and languages were tested on GPT-3.5 and GPT-4o. Our analysis revealed that both models were significantly biased during decision-making, favouring female gender, younger ages, and certain backgrounds such as the African-American background. In contrast, the summarisation task showed minimal evidence of bias, though significant age-related differences emerged for GPT-3.5 in English. Cross-lingual analysis showed that bias patterns were broadly similar between English and Dutch, though notable differences were observed across specific demographic categories. The newly proposed mitigation instructions, while unable to eliminate biases completely, demonstrated potential in reducing them. The most effective instruction achieved a 27\% mean reduction in the gap between the most and least favorable demographics. Notably, contrary to GPT-3.5, GPT-4o displayed reduced biases for all prompts in English, indicating the specific potential for prompt-based mitigation within newer models. This research underscores the importance of cautious adoption of LLMs and context-specific bias testing, highlighting the need for continued development of effective mitigation strategies to ensure responsible deployment of AI.

【16】Benchmarking Vision-Language Models on Chinese Ancient Documents: From OCR to Knowledge Reasoning
标题：中国古代文献的视觉语言模型基准：从OCR到知识推理
链接：https://arxiv.org/abs/2509.09731

作者：u, Yuchuan Wu, Fan Shi, Lei Liao, Jinghui Lu, Xiaodong Ge, Han Wang, Minghan Zhuo, Xuecheng Wu, Xiang Fei, Hao Feng, Guozhi Tang, An-Lan Wang, Hanshen Zhu, Yangfan He, Quanhuan Liang, Liyuan Meng, Chao Feng, Can Huang, Jingqun Tang, Bin Li
摘要：中国古代文献是中国几千年历史和文化的宝贵载体，在不同领域拥有丰富的知识，但在数字化和理解方面面临挑战，即，传统的方法只能扫描图像，而当前的视觉语言模型（VLM）则要与它们的视觉和语言复杂性作斗争。现有的文档基准集中在英文印刷文本或简体中文，留下了一个空白，评估VLM的古代中国文档。为了解决这个问题，我们提出了AncientDoc，中国古代文献的第一个基准，旨在评估从OCR到知识推理的VLMs。AncientDoc包括五个任务（页面级OCR，白话翻译，基于推理的QA，基于知识的QA，语言变体QA），涵盖14种文档类型，100多本书，约3,000页。基于AncientDoc，我们使用多个指标评估主流VLM，并辅以与人类对齐的大型语言模型进行评分。
摘要：Chinese ancient documents, invaluable carriers of millennia of Chinese history and culture, hold rich knowledge across diverse fields but face challenges in digitization and understanding, i.e., traditional methods only scan images, while current Vision-Language Models (VLMs) struggle with their visual and linguistic complexity. Existing document benchmarks focus on English printed texts or simplified Chinese, leaving a gap for evaluating VLMs on ancient Chinese documents. To address this, we present AncientDoc, the first benchmark for Chinese ancient documents, designed to assess VLMs from OCR to knowledge reasoning. AncientDoc includes five tasks (page-level OCR, vernacular translation, reasoning-based QA, knowledge-based QA, linguistic variant QA) and covers 14 document types, over 100 books, and about 3,000 pages. Based on AncientDoc, we evaluate mainstream VLMs using multiple metrics, supplemented by a human-aligned large language model for scoring.

【17】A meta-analysis on the performance of machine-learning based language models for sentiment analysis
标题：基于机器学习的情感分析语言模型性能的元分析
链接：https://arxiv.org/abs/2509.09728

作者：de, Jonas Klingwort, Christian Borgs
摘要：本文提出了一种元分析方法，用于评估ML在Twitter数据情感分析中的性能。该研究旨在估计平均性能，评估研究之间和研究内的异质性，并分析研究特征如何影响模型性能。使用PRISMA指南，我们检索了学术数据库，并从20项研究中选择了195项试验，其中包括12项研究特征。使用双反正弦变换和三水平随机效应模型分析了报告最多的性能指标总体准确性。AIC优化模型的平均总体准确度为0.80 [0.76，0.84]。本文提供了两个关键的见解：1）总体准确性被广泛使用，但由于其对类别不平衡和情感类别数量的敏感性，通常会产生误导，突出了规范化的必要性。2)模型性能的标准化报告，包括报告独立测试集的混淆矩阵，对于跨研究的ML分类器的可靠比较至关重要，这似乎远非常见的做法。
摘要：This paper presents a meta-analysis evaluating ML performance in sentiment analysis for Twitter data. The study aims to estimate the average performance, assess heterogeneity between and within studies, and analyze how study characteristics influence model performance. Using PRISMA guidelines, we searched academic databases and selected 195 trials from 20 studies with 12 study features. Overall accuracy, the most reported performance metric, was analyzed using double arcsine transformation and a three-level random effects model. The average overall accuracy of the AIC-optimized model was 0.80 [0.76, 0.84]. This paper provides two key insights: 1) Overall accuracy is widely used but often misleading due to its sensitivity to class imbalance and the number of sentiment classes, highlighting the need for normalization. 2) Standardized reporting of model performance, including reporting confusion matrices for independent test sets, is essential for reliable comparisons of ML classifiers across studies, which seems far from common practice.

【18】A Role-Aware Multi-Agent Framework for Financial Education Question Answering with LLMs
标题：利用法学硕士进行金融教育问题解答的角色感知多代理框架
链接：https://arxiv.org/abs/2509.09727

作者： Yingjun Du
备注：8 pages, 6 figures, Underreview
摘要：问答（QA）在金融教育中起着核心作用，但现有的大型语言模型（LLM）方法往往无法捕捉到解决金融问题所需的细致入微和专业化的推理。金融领域需要多步定量推理，熟悉特定领域的术语，并理解现实世界的场景。我们提出了一个多代理框架，利用基于角色的提示，以提高特定领域的QA性能。我们的框架包括一个基本生成器，证据检索器，和一个专家审查代理，在一个单一的通过迭代工作，以产生一个精致的答案。我们在一个在线学习平台Study.com上评估了我们的框架，该平台提供了3,532个专家设计的金融教育问题。我们利用检索增强生成（RAG）的上下文证据，从6本金融教科书和提示策略的领域专家评审员。我们的实验表明，基于批评的细化提高了6.6-8.3%的答案准确率超过zero-shot的思想链基线，从双子座2.0-Flash的最高性能。此外，我们的方法使GPT-4 o-mini能够实现与金融调整的FinGPT-mt_Llama 3 -8B_LoRA相当的性能。我们的研究结果显示了一种具有成本效益的方法来增强金融QA，并为多代理金融LLM系统的进一步研究提供了见解。
摘要：Question answering (QA) plays a central role in financial education, yet existing large language model (LLM) approaches often fail to capture the nuanced and specialized reasoning required for financial problem-solving. The financial domain demands multistep quantitative reasoning, familiarity with domain-specific terminology, and comprehension of real-world scenarios. We present a multi-agent framework that leverages role-based prompting to enhance performance on domain-specific QA. Our framework comprises a Base Generator, an Evidence Retriever, and an Expert Reviewer agent that work in a single-pass iteration to produce a refined answer. We evaluated our framework on a set of 3,532 expert-designed finance education questions from Study.com, an online learning platform. We leverage retrieval-augmented generation (RAG) for contextual evidence from 6 finance textbooks and prompting strategies for a domain-expert reviewer. Our experiments indicate that critique-based refinement improves answer accuracy by 6.6-8.3% over zero-shot Chain-of-Thought baselines, with the highest performance from Gemini-2.0-Flash. Furthermore, our method enables GPT-4o-mini to achieve performance comparable to the finance-tuned FinGPT-mt_Llama3-8B_LoRA. Our results show a cost-effective approach to enhancing financial QA and offer insights for further research in multi-agent financial LLM systems.

【19】DiTTO-LLM: Framework for Discovering Topic-based Technology Opportunities via Large Language Model
标题：DiTTO-LLM：通过大型语言模型发现基于主题的技术机会的框架
链接：https://arxiv.org/abs/2509.09724

作者：Kim, Sujeong Seo, Juhyun Lee
备注：5 figures
摘要：技术机会是关键信息，是技术、产业和创新进步的基础。本文提出了一个框架的基础上的技术之间的时间关系，以确定新兴技术的机会。该框架首先从专利数据集中提取文本，然后映射基于文本的主题以发现技术间的关系。然后通过跟踪这些主题随时间的变化来确定技术机会。为了提高效率，该框架利用大型语言模型来提取主题，并采用基于聊天的语言模型的提示来支持技术机会的发现。该框架使用美国专利和商标局提供的人工智能专利数据集进行评估。实验结果表明，人工智能技术正在演变为促进日常可访问性的形式。这种方法表明了拟议框架在确定未来技术机会方面的潜力。
摘要：Technology opportunities are critical information that serve as a foundation for advancements in technology, industry, and innovation. This paper proposes a framework based on the temporal relationships between technologies to identify emerging technology opportunities. The proposed framework begins by extracting text from a patent dataset, followed by mapping text-based topics to discover inter-technology relationships. Technology opportunities are then identified by tracking changes in these topics over time. To enhance efficiency, the framework leverages a large language model to extract topics and employs a prompt for a chat-based language model to support the discovery of technology opportunities. The framework was evaluated using an artificial intelligence patent dataset provided by the United States Patent and Trademark Office. The experimental results suggest that artificial intelligence technology is evolving into forms that facilitate everyday accessibility. This approach demonstrates the potential of the proposed framework to identify future technology opportunities.

【20】ALIGNS: Unlocking nomological networks in psychological measurement through a large language model
标题：对齐：通过大型语言模型解锁心理测量中的法则网络
链接：https://arxiv.org/abs/2509.09723

作者：rsen, Sen Yan, Roland Müller, Lan Sang, Mikko Rönkkö, Ravi Starzl, Donald Edmondson
摘要：心理测量对许多学科都至关重要。尽管在测量方面取得了进展，但在Cronbach和Meehl将其作为验证的基础提出70年后，构建法则网络，概念和测量如何与建立有效性相关的理论地图仍然是一个挑战。这一限制带来了实际后果：临床试验可能无法检测治疗效果，公共政策可能会针对错误的结果。我们介绍了潜在指标分析，以生成语法结构（ALIGNS），一个大型的语言模型为基础的系统，经过验证的问卷调查措施训练。ALIGNS提供了三个全面的法则网络，包含超过550，000个指标，涉及心理学，医学，社会政策和其他领域。这代表了大型语言模型的首次应用，以解决测量验证中的基础问题。我们报告用于开发模型的分类准确性测试，以及三个评估。在第一次评估中，广泛使用的NIH PROMIS焦虑和抑郁工具被证明收敛于情绪困扰的单一维度。第二次评估研究了儿童气质的措施，并确定了四个潜在的维度没有捕捉到目前的框架，并质疑一个现有的维度。第三个评估，适用性检查，聘请专家心理测量师谁评估系统的重要性，可访问性和适用性。ALIGNS可在nomologicalnetwork.org上免费获得，通过大规模的法理分析补充了传统的验证方法。
摘要：Psychological measurement is critical to many disciplines. Despite advances in measurement, building nomological networks, theoretical maps of how concepts and measures relate to establish validity, remains a challenge 70 years after Cronbach and Meehl proposed them as fundamental to validation. This limitation has practical consequences: clinical trials may fail to detect treatment effects, and public policy may target the wrong outcomes. We introduce Analysis of Latent Indicators to Generate Nomological Structures (ALIGNS), a large language model-based system trained with validated questionnaire measures. ALIGNS provides three comprehensive nomological networks containing over 550,000 indicators across psychology, medicine, social policy, and other fields. This represents the first application of large language models to solve a foundational problem in measurement validation. We report classification accuracy tests used to develop the model, as well as three evaluations. In the first evaluation, the widely used NIH PROMIS anxiety and depression instruments are shown to converge into a single dimension of emotional distress. The second evaluation examines child temperament measures and identifies four potential dimensions not captured by current frameworks, and questions one existing dimension. The third evaluation, an applicability check, engages expert psychometricians who assess the system's importance, accessibility, and suitability. ALIGNS is freely available at nomologicalnetwork.org, complementing traditional validation methods with large-scale nomological analysis.

【21】The Thinking Therapist: Training Large Language Models to Deliver Acceptance and Commitment Therapy using Supervised Fine-Tuning and Odds Ratio Policy Optimization
标题：思维治疗师：使用受监督的微调和赔率比政策优化训练大型语言模型以提供接受和承诺治疗
链接：https://arxiv.org/abs/2509.09712

作者：ir
摘要：接受和承诺疗法（ACT）是第三波认知行为疗法，在几种精神疾病中有疗效的证据。本研究调查了后训练方法和显式推理对小型开放权重大型语言模型（LLM）提供ACT的能力的影响。使用Mistral-Large生成的50组合成ACT转录本，我们用两种不同的方法训练Llama-3.2-3b-Instruct，监督微调（SFT）和比值比策略优化（ORPO），每种方法都有和没有明确的思想链（COT）推理步骤。通过将这四个训练后的变体与基础指令模型进行比较来评估性能。这些模型在模拟治疗会话中进行基准测试，由LLM法官在ACT保真度测量（ACT-FM）和治疗师同理心量表（TES）上定量评估性能，该法官已经对人类评估进行了微调。我们的研究结果表明，ORPO训练的模型在ACT保真度（$\chi^2（5）= 185.15，p <0.001 $）和治疗性同理心（$\chi^2（5）= 140.37，p <0.001 $）方面显著优于SFT和Instruct模型。COT的效果是有条件的，因为它为SFT模型提供了显著的益处，使ACT-FM评分平均提高了2.68分（$p <0.001 $），而对更优的ORPO或非调优变体没有明显的优势。我们认为，ORPO的优越性源于其学习治疗“过程”而不是模仿“内容”的能力，这是ACT的一个关键方面，而COT则是仅通过模仿训练的模型的必要支架。这项研究建立了偏好对齐的政策优化可以有效地灌输ACT能力在小型LLM，以及显式推理的效用是高度依赖于底层的培训范式。
摘要：Acceptance and Commitment Therapy (ACT) is a third-wave cognitive behavioral therapy with emerging evidence of efficacy in several psychiatric conditions. This study investigates the impact of post-training methodology and explicit reasoning on the ability of a small open-weight large language model (LLM) to deliver ACT. Using 50 sets of synthetic ACT transcripts generated by Mistral-Large, we trained Llama-3.2-3b-Instruct with two distinct approaches, supervised fine-tuning (SFT) and odds ratio policy optimization (ORPO), each with and without an explicit chain-of-thought (COT) reasoning step. Performance was evaluated by comparing these four post-trained variants against the base Instruct model. These models were benchmarked in simulated therapy sessions, with performance quantitatively assessed on the ACT Fidelity Measure (ACT-FM) and the Therapist Empathy Scale (TES) by an LLM judge that had been fine-tuned on human evaluations. Our findings demonstrate that the ORPO-trained models significantly outperformed both their SFT and Instruct counterparts on ACT fidelity ($\chi^2(5) = 185.15, p < .001$) and therapeutic empathy ($\chi^2(5) = 140.37, p < .001$). The effect of COT was conditional as it provided a significant benefit to SFT models, improving ACT-FM scores by an average of 2.68 points ($p < .001$), while offering no discernible advantage to the superior ORPO or instruct-tuned variants. We posit that the superiority of ORPO stems from its ability to learn the therapeutic `process' over imitating `content,' a key aspect of ACT, while COT acts as a necessary scaffold for models trained only via imitation. This study establishes that preference-aligned policy optimization can effectively instill ACT competencies in small LLMs, and that the utility of explicit reasoning is highly dependent on the underlying training paradigm.

【22】Psychiatry-Bench: A Multi-Task Benchmark for LLMs in Psychiatry
标题：精神病学长凳：精神病学法学硕士的多任务基准
链接：https://arxiv.org/abs/2509.09711

作者：uda, Abdelrahamn A. Hassan, Radwa J. Hanafy, Mohammed E. Fouda
摘要：大型语言模型（LLM）在提高精神病学实践方面具有很大的潜力，从提高诊断准确性到简化临床文档和治疗支持。然而，现有的评估资源严重依赖于小型临床访谈语料库，社交媒体帖子或合成对话，这限制了它们的临床有效性，并且无法捕获精神病学推理的全部复杂性。在这项工作中，我们介绍了PsychiatryBench，这是一个严格策划的基准，完全基于权威，专家验证的精神病学教科书和案例。PsychiatryBench包括11个不同的问答任务，从诊断推理和治疗计划到纵向随访，管理计划，临床方法，顺序病例分析，以及多项选择/扩展匹配格式，总计超过5，300个专家注释项目。我们评估了一系列不同的前沿LLM（包括Google Gemini，DeepSeek，LLaMA 3和QWQ-32）以及领先的开源医学模型（例如，OpenBiloLLM，MedGemma）使用常规度量和“LLM-as-judge”相似性评分框架。我们的研究结果揭示了临床一致性和安全性方面的巨大差距，特别是在多轮随访和管理任务中，强调了对专业模型调整和更强大的评估范式的需求。PsychiatryBench提供了一个模块化的，可扩展的平台，用于基准测试和提高LLM在高风险心理健康应用中的性能。
摘要：Large language models (LLMs) hold great promise in enhancing psychiatric practice, from improving diagnostic accuracy to streamlining clinical documentation and therapeutic support. However, existing evaluation resources heavily rely on small clinical interview corpora, social media posts, or synthetic dialogues, which limits their clinical validity and fails to capture the full complexity of psychiatric reasoning. In this work, we introduce PsychiatryBench, a rigorously curated benchmark grounded exclusively in authoritative, expert-validated psychiatric textbooks and casebooks. PsychiatryBench comprises eleven distinct question-answering tasks ranging from diagnostic reasoning and treatment planning to longitudinal follow-up, management planning, clinical approach, sequential case analysis, and multiple-choice/extended matching formats totaling over 5,300 expert-annotated items. We evaluate a diverse set of frontier LLMs (including Google Gemini, DeepSeek, LLaMA 3, and QWQ-32) alongside leading open-source medical models (e.g., OpenBiloLLM, MedGemma) using both conventional metrics and an "LLM-as-judge" similarity scoring framework. Our results reveal substantial gaps in clinical consistency and safety, particularly in multi-turn follow-up and management tasks, underscoring the need for specialized model tuning and more robust evaluation paradigms. PsychiatryBench offers a modular, extensible platform for benchmarking and improving LLM performance in high-stakes mental health applications.

【23】Generating Individual Travel Diaries Using Large Language Models Informed by Census and Land-Use Data
标题：使用受人口普查和土地利用数据影响的大型语言模型生成个人旅行日记
链接：https://arxiv.org/abs/2509.09710

作者：lrokh Amin, Devin Rhoads, Fatemeh Fakhrmoosavi, Nicholas E. Lownes, John N. Ivan
摘要：这项研究介绍了一个大语言模型（LLM）计划，用于生成个人旅行日记在基于代理的交通模型。虽然传统的方法依赖于大量的专有家庭旅行调查，但本研究中提出的方法从开源的美国社区调查（ACS）和智能位置数据库（SLD）数据中生成人物角色，然后通过直接提示合成日记。这项研究采用了一种新的一对一队列现实主义评分：四个指标（旅行计数评分，间隔评分，目的评分和模式评分）的复合物，对康涅狄格州全州交通研究（CSTS）日记进行了验证，在人口统计学变量之间进行了匹配。该验证利用詹森-香农散度来测量生成的日记和真实日记之间的分布相似性。当与在验证集上校准的经典方法生成的日志（负二项用于行程生成;多项Logit用于模式/目的）相比时，LLM生成的日志实现了相当的整体真实性（LLM平均值：0.485 vs. 0.455）。LLM擅长确定旅行目的，并表现出更大的一致性（更窄的现实主义分数分布），而经典模型导致旅行计数和活动持续时间的数值估计。综合验证证实了LLM的统计代表性（LLM平均值：0.612 vs. 0.435），证明了LLM的zero-shot可行性，并为未来的合成日记评价系统建立了日记真实性的可量化指标。
摘要：This study introduces a Large Language Model (LLM) scheme for generating individual travel diaries in agent-based transportation models. While traditional approaches rely on large quantities of proprietary household travel surveys, the method presented in this study generates personas stochastically from open-source American Community Survey (ACS) and Smart Location Database (SLD) data, then synthesizes diaries through direct prompting. This study features a novel one-to-cohort realism score: a composite of four metrics (Trip Count Score, Interval Score, Purpose Score, and Mode Score) validated against the Connecticut Statewide Transportation Study (CSTS) diaries, matched across demographic variables. The validation utilizes Jensen-Shannon Divergence to measure distributional similarities between generated and real diaries. When compared to diaries generated with classical methods (Negative Binomial for trip generation; Multinomial Logit for mode/purpose) calibrated on the validation set, LLM-generated diaries achieve comparable overall realism (LLM mean: 0.485 vs. 0.455). The LLM excels in determining trip purpose and demonstrates greater consistency (narrower realism score distribution), while classical models lead in numerical estimates of trip count and activity duration. Aggregate validation confirms the LLM's statistical representativeness (LLM mean: 0.612 vs. 0.435), demonstrating LLM's zero-shot viability and establishing a quantifiable metric of diary realism for future synthetic diary evaluation systems.

【24】Assisting Research Proposal Writing with Large Language Models: Evaluation and Refinement
标题：使用大型语言模型协助研究提案撰写：评估和细化
链接：https://arxiv.org/abs/2509.09709

作者： Weiqi Wang
摘要：像ChatGPT这样的大型语言模型（LLM）越来越多地用于学术写作，但不正确或捏造的引用等问题引起了道德问题。此外，目前的内容质量评估往往依赖于主观的人类判断，这是劳动密集型的，缺乏客观性，潜在地损害了一致性和可靠性。在这项研究中，提供一个定量的评估和提高研究建议书写作能力的法学硕士，我们提出了两个关键的评估指标-内容质量和参考有效性-和一个迭代的提示方法的基础上，从这两个指标的分数。我们广泛的实验表明，所提出的指标提供了一个客观的，定量的框架来评估ChatGPT的写作表现。此外，迭代提示显著提高了内容质量，同时减少了参考不准确和捏造，解决了学术环境中的关键道德挑战。
摘要：Large language models (LLMs) like ChatGPT are increasingly used in academic writing, yet issues such as incorrect or fabricated references raise ethical concerns. Moreover, current content quality evaluations often rely on subjective human judgment, which is labor-intensive and lacks objectivity, potentially compromising the consistency and reliability. In this study, to provide a quantitative evaluation and enhance research proposal writing capabilities of LLMs, we propose two key evaluation metrics--content quality and reference validity--and an iterative prompting method based on the scores derived from these two metrics. Our extensive experiments show that the proposed metrics provide an objective, quantitative framework for assessing ChatGPT's writing performance. Additionally, iterative prompting significantly enhances content quality while reducing reference inaccuracies and fabrications, addressing critical ethical challenges in academic contexts.

【25】Beyond I'm Sorry, I Can't: Dissecting Large Language Model Refusal
标题：超越“对不起，我不能”：剖析大型语言模型拒绝
链接：https://arxiv.org/abs/2509.09708

作者：u Prakash, Yeo Wei Jie, Amir Abdullah, Ranjan Satapathy, Erik Cambria, Roy Ka Wei Lee
摘要：拒绝有害的提示是一个关键的安全行为在预防调优的大型语言模型（LLM），但这种行为的内部原因仍然知之甚少。我们研究了两个公共的预调模型，Gemma-2-2B-IT和LLaMA-3.1-8B-IT，使用在剩余流激活上训练的稀疏自编码器（SAE）。给定一个有害的提示，我们搜索SAE潜在空间的特征集，其消融将模型从拒绝翻转到遵守，展示因果影响并创建越狱。我们的搜索过程分为三个阶段：（1）拒绝方向：找到一个refresh中介方向，并收集该方向附近的SAE特征;（2）贪婪过滤：修剪到最小集合;（3）交互发现：拟合一个因子分解机（FM），捕获剩余活动特征和最小集合之间的非线性交互。这个管道产生了一系列广泛的越狱关键特性，提供了对拒绝机制基础的深入了解。此外，我们发现证据的冗余功能，保持休眠状态，除非早期功能被抑制。我们的研究结果强调了通过操纵可解释的潜在空间进行细粒度审计和有针对性地干预安全行为的潜力。
摘要：Refusal on harmful prompts is a key safety behaviour in instruction-tuned large language models (LLMs), yet the internal causes of this behaviour remain poorly understood. We study two public instruction-tuned models, Gemma-2-2B-IT and LLaMA-3.1-8B-IT, using sparse autoencoders (SAEs) trained on residual-stream activations. Given a harmful prompt, we search the SAE latent space for feature sets whose ablation flips the model from refusal to compliance, demonstrating causal influence and creating a jailbreak. Our search proceeds in three stages: (1) Refusal Direction: find a refusal-mediating direction and collect SAE features near that direction; (2) Greedy Filtering: prune to a minimal set; and (3) Interaction Discovery: fit a factorization machine (FM) that captures nonlinear interactions among the remaining active features and the minimal set. This pipeline yields a broad set of jailbreak-critical features, offering insight into the mechanistic basis of refusal. Moreover, we find evidence of redundant features that remain dormant unless earlier features are suppressed. Our findings highlight the potential for fine-grained auditing and targeted intervention in safety behaviours by manipulating the interpretable latent space.

【26】LLM-Based Instance-Driven Heuristic Bias In the Context of a Biased Random Key Genetic Algorithm
标题：有偏差随机密钥遗传算法背景下的基于LLM的实例驱动启发式偏差
链接：https://arxiv.org/abs/2509.09707

作者：acón Sartori, Martín Isla Pino, Pedro Pinacho-Davidson, Christian Blum
备注：Submitted to a journal for review
摘要：将大型语言模型（LLM）集成到元分析中为解决复杂的组合优化问题开辟了一条新的途径。虽然大多数现有的方法利用LLM进行代码生成以创建或细化特定的语法，但它们通常忽略了单个问题实例的结构属性。在这项工作中，我们引入了一个新的框架，集成LLM与偏置随机密钥遗传算法（BRKGA）来解决NP难的最长运行子序列问题。我们的方法扩展了实例驱动的启发式偏差范式，通过引入人类LLM协作过程来共同设计和实现一组计算效率高的指标。LLM分析这些特定于实例的度量以生成定制的启发式偏差，从而将BRKGA引导到搜索空间的有希望的区域。我们进行了全面的实验评估，包括严格的统计测试、收敛和行为分析以及有针对性的消融研究，在生成的1，050个不同复杂性的实例中将我们的方法与标准BRKGA基线进行比较。结果表明，我们的表现最好的混合，BRKGA+Llama-4-Maverick，实现了统计上的显着改善，特别是在最复杂的情况下。我们的研究结果证实，利用LLM产生一个先验的，实例驱动的启发式偏差是一个有价值的方法，以提高在复杂的优化领域的元分析。
摘要：Integrating Large Language Models (LLMs) within metaheuristics opens a novel path for solving complex combinatorial optimization problems. While most existing approaches leverage LLMs for code generation to create or refine specific heuristics, they often overlook the structural properties of individual problem instances. In this work, we introduce a novel framework that integrates LLMs with a Biased Random-Key Genetic Algorithm (BRKGA) to solve the NP-hard Longest Run Subsequence problem. Our approach extends the instance-driven heuristic bias paradigm by introducing a human-LLM collaborative process to co-design and implement a set of computationally efficient metrics. The LLM analyzes these instance-specific metrics to generate a tailored heuristic bias, which steers the BRKGA toward promising areas of the search space. We conduct a comprehensive experimental evaluation, including rigorous statistical tests, convergence and behavioral analyses, and targeted ablation studies, comparing our method against a standard BRKGA baseline across 1,050 generated instances of varying complexity. Results show that our top-performing hybrid, BRKGA+Llama-4-Maverick, achieves statistically significant improvements over the baseline, particularly on the most complex instances. Our findings confirm that leveraging an LLM to produce an a priori, instance-driven heuristic bias is a valuable approach for enhancing metaheuristics in complex optimization domains.

【27】Differential Robustness in Transformer Language Models: Empirical Evaluation Under Adversarial Text Attacks
标题：Transformer语言模型的差分鲁棒性：对抗性文本攻击下的实证评估
链接：https://arxiv.org/abs/2509.09706

作者：datkar, Oluwaseun Ajao, Matthew Shardlow
备注：8 pages, 4 tables, to appear in proceedings of Recent Advances in Natural Language Processing (RANLP 2025) and ACL Anthology
摘要：该研究评估了大型语言模型（LLM）对对抗性攻击的弹性，特别关注Flan-T5，BERT和RoBERTa-Base。通过TextFocal和BERTAttack系统设计的对抗性测试，我们发现模型鲁棒性存在显著差异。RoBERTa-Base和FlanT 5表现出卓越的弹性，即使受到复杂的攻击也能保持准确性，攻击成功率为0%。相反。BERT-Base显示出相当大的漏洞，TextFocal将模型准确率从48%降低到3%的成功率达到了93.75%。我们的研究表明，虽然某些LLM已经开发出有效的防御机制，但这些保障措施通常需要大量的计算资源。这项研究有助于LLM安全的理解，确定现有的优势和劣势，在目前的保障措施，并提出了切实可行的建议，制定更有效的防御策略。
摘要：This study evaluates the resilience of large language models (LLMs) against adversarial attacks, specifically focusing on Flan-T5, BERT, and RoBERTa-Base. Using systematically designed adversarial tests through TextFooler and BERTAttack, we found significant variations in model robustness. RoBERTa-Base and FlanT5 demonstrated remarkable resilience, maintaining accuracy even when subjected to sophisticated attacks, with attack success rates of 0%. In contrast. BERT-Base showed considerable vulnerability, with TextFooler achieving a 93.75% success rate in reducing model accuracy from 48% to just 3%. Our research reveals that while certain LLMs have developed effective defensive mechanisms, these safeguards often require substantial computational resources. This study contributes to the understanding of LLM security by identifying existing strengths and weaknesses in current safeguarding approaches and proposes practical recommendations for developing more efficient and effective defensive strategies.

【28】The Non-Determinism of Small LLMs: Evidence of Low Answer Consistency in Repetition Trials of Standard Multiple-Choice Benchmarks
标题：小型LLM的不确定性：标准多项选择基准重复试验中答案一致性较低的证据
链接：https://arxiv.org/abs/2509.09705

作者：inhanez, Paulo Cavalin, Cassia Sanctos, Marcelo Grave, Yago Primerano
摘要：这项工作探讨了小LLM（2B-8B参数）在多次回答同一问题时的一致性。我们对已知的开源LLM进行了一项研究，这些LLM对来自多项选择基准MMLU-Redux和MedQA的10个重复问题进行了响应，考虑了不同的推理温度，小型与中型模型（50 B-80 B），微调与基础模型以及其他参数。我们还研究了要求多次试验答案一致性对准确性的影响，以及在决定哪种模型最好地提供这两种模型时所涉及的权衡。为了支持这些研究，我们提出了一些新的分析和图形工具。结果表明，可以回答一致的问题的数量在模型之间变化很大，但通常在50%-80%的范围内的小模型在低推理温度。此外，一致答案的准确性似乎与整体准确性有合理的相关性。中型模型的结果似乎表明答案一致性水平要高得多。
摘要：This work explores the consistency of small LLMs (2B-8B parameters) in answering multiple times the same question. We present a study on known, open-source LLMs responding to 10 repetitions of questions from the multiple-choice benchmarks MMLU-Redux and MedQA, considering different inference temperatures, small vs. medium models (50B-80B), finetuned vs. base models, and other parameters. We also look into the effects of requiring multi-trial answer consistency on accuracy and the trade-offs involved in deciding which model best provides both of them. To support those studies, we propose some new analytical and graphical tools. Results show that the number of questions which can be answered consistently vary considerably among models but are typically in the 50%-80% range for small models at low inference temperatures. Also, accuracy among consistent answers seems to reasonably correlate with overall accuracy. Results for medium-sized models seem to indicate much higher levels of answer consistency.

【29】Temporal Preferences in Language Models for Long-Horizon Assistance
标题：长期援助语言模型的时间偏好
链接：https://arxiv.org/abs/2509.09704

作者：ki, Mohammad Naghizadeh, Samaneh Ranjkhah Zonouzaghi, Hossein Setareh
摘要：我们研究语言模型（LM）是否表现出未来与现在为导向的跨时选择的偏好，以及这些偏好是否可以系统地操纵。使用适应人类的实验协议，我们评估多个LM的时间权衡任务和基准对人类决策者的样本。我们引入了一个操作度量，时间取向（MTO）的可操纵性，定义为LM的显示时间偏好之间的变化未来和现在为导向的提示。在我们的测试中，以推理为中心的模型（例如，DeepSeek-Reasoner和grok-3-mini）在面向未来的提示下选择稍后的选项，但只能部分个性化跨身份或地理位置的决策。此外，正确推理时间取向的模型会将自己的未来取向内化为人工智能决策者。我们讨论了人工智能助手的设计含义，这些助手应该与异构的长期目标保持一致，并概述了个性化上下文校准和社会意识部署的研究议程。
摘要：We study whether language models (LMs) exhibit future- versus present-oriented preferences in intertemporal choice and whether those preferences can be systematically manipulated. Using adapted human experimental protocols, we evaluate multiple LMs on time-tradeoff tasks and benchmark them against a sample of human decision makers. We introduce an operational metric, the Manipulability of Time Orientation (MTO), defined as the change in an LM's revealed time preference between future- and present-oriented prompts. In our tests, reasoning-focused models (e.g., DeepSeek-Reasoner and grok-3-mini) choose later options under future-oriented prompts but only partially personalize decisions across identities or geographies. Moreover, models that correctly reason about time orientation internalize a future orientation for themselves as AI decision makers. We discuss design implications for AI assistants that should align with heterogeneous, long-horizon goals and outline a research agenda on personalized contextual calibration and socially aware deployment.

【30】CTCC: A Robust and Stealthy Fingerprinting Framework for Large Language Models via Cross-Turn Contextual Correlation Backdoor
标题：CTCC：通过交叉转向上下文相关后门的大型语言模型稳健且隐蔽的指纹识别框架
链接：https://arxiv.org/abs/2509.09703

作者：u, Xixiang Zhao, Xubin Yue, Shengwei Tian, Changting Lin, Meng Han
备注：Accepted by EMNLP2025 MainConference
摘要：大型语言模型（LLM）的广泛部署加剧了人们对知识产权（IP）保护的担忧，因为模型盗窃和未经授权的重新分发变得越来越可行。为了解决这个问题，模型指纹旨在将可验证的所有权跟踪嵌入到LLM中。然而，现有的方法面临着隐蔽性，鲁棒性和可推广性之间的固有权衡，要么通过分布变化可检测，容易受到对抗性修改的影响，要么一旦指纹被发现就很容易失效。在这项工作中，我们介绍了CTCC，一种新的规则驱动的指纹识别框架，编码跨多个对话轮的上下文相关性，如反事实，而不是依赖于令牌级或单轮触发。CTCC支持黑盒访问下的指纹验证，同时减少误报和指纹泄漏，支持在共享语义规则下的连续构建，即使部分触发器被暴露。跨多个LLM架构的广泛实验表明，CTCC始终实现更强的隐身性和鲁棒性比以前的工作。我们的研究结果将CTCC定位为现实世界LLM部署场景中所有权验证的可靠实用解决方案。我们的代码和数据可在上公开获取。
摘要：The widespread deployment of large language models (LLMs) has intensified concerns around intellectual property (IP) protection, as model theft and unauthorized redistribution become increasingly feasible. To address this, model fingerprinting aims to embed verifiable ownership traces into LLMs. However, existing methods face inherent trade-offs between stealthness, robustness, and generalizability, being either detectable via distributional shifts, vulnerable to adversarial modifications, or easily invalidated once the fingerprint is revealed. In this work, we introduce CTCC, a novel rule-driven fingerprinting framework that encodes contextual correlations across multiple dialogue turns, such as counterfactual, rather than relying on token-level or single-turn triggers. CTCC enables fingerprint verification under black-box access while mitigating false positives and fingerprint leakage, supporting continuous construction under a shared semantic rule even if partial triggers are exposed. Extensive experiments across multiple LLM architectures demonstrate that CTCC consistently achieves stronger stealth and robustness than prior work. Our findings position CTCC as a reliable and practical solution for ownership verification in real-world LLM deployment scenarios. Our code and data are publicly available at .

【31】Creativity Benchmark: A benchmark for marketing creativity for LLM models
标题：创意基准：LLM模型营销创意的基准
链接：https://arxiv.org/abs/2509.09702

作者：t, Kieran Browne, Pip Bingemann
备注：30 Pages, 14 figures
摘要：我们介绍了Creativity Benchmark，这是一个用于营销创意的大型语言模型（LLM）的评估框架。该基准涵盖100个品牌（12个类别）和三种提示类型（见解，想法，疯狂的想法）。使用Bradley-Terry模型分析的678名实践创意人员对11，012次匿名比较的人类成对偏好显示出紧密的聚类性能，没有模型在品牌或提示类型之间占主导地位：上下分布为$\Delta\theta \约0.45$，这意味着头对头获胜概率为0.61 $;最高评级的模型仅在大约61美元的时间内击败最低评级的模型。我们还使用余弦距离来分析模型多样性，以捕获模型内和模型间的变化以及对提示重构的敏感性。将三个法学硕士作为法官的设置与人类排名进行比较，揭示了弱的，不一致的相关性和法官特定的偏见，强调自动法官不能取代人类评估。传统的创造力测试也只能部分转移到品牌限制的任务。总的来说，研究结果强调了对专家人力评估和多样性意识工作流程的需求。
摘要：We introduce Creativity Benchmark, an evaluation framework for large language models (LLMs) in marketing creativity. The benchmark covers 100 brands (12 categories) and three prompt types (Insights, Ideas, Wild Ideas). Human pairwise preferences from 678 practising creatives over 11,012 anonymised comparisons, analysed with Bradley-Terry models, show tightly clustered performance with no model dominating across brands or prompt types: the top-bottom spread is $\Delta\theta \approx 0.45$, which implies a head-to-head win probability of $0.61$; the highest-rated model beats the lowest only about $61\%$ of the time. We also analyse model diversity using cosine distances to capture intra- and inter-model variation and sensitivity to prompt reframing. Comparing three LLM-as-judge setups with human rankings reveals weak, inconsistent correlations and judge-specific biases, underscoring that automated judges cannot substitute for human evaluation. Conventional creativity tests also transfer only partially to brand-constrained tasks. Overall, the results highlight the need for expert human evaluation and diversity-aware workflows.

【32】Personas within Parameters: Fine-Tuning Small Language Models with Low-Rank Adapters to Mimic User Behaviors
标题：参数内的角色：使用低级别适配器微调小型语言模型以模仿用户行为
链接：https://arxiv.org/abs/2509.09689

作者：Thakur, Eshani Agrawal, Smruthi Mukund
摘要：开发精确推荐模型的一个长期挑战是模拟用户行为，这主要是由于用户交互的复杂性和随机性。为此，一个有前途的工作是使用大型语言模型（LLM）来模拟用户行为。然而，将这些通用的大型预训练模型与用户偏好相匹配需要：（i）有效且连续地解析大规模表格式用户-项目交互数据，（ii）克服预训练引起的归纳偏差，以准确地学习用户特定的知识，以及（iii）为数百万用户大规模实现前两个。虽然大多数以前的作品都集中在复杂的方法来提示LLM或在表格交互数据集上对其进行微调，但我们的方法将重点转移到使用冻结的LLM提取强大的文本用户表示，并模拟由微调的小语言模型（SLM）提供支持的具有成本效益，资源效率高的用户代理。此外，我们还展示了一种为用户组或\textit{persona}训练多个低级别适配器的方法，在用户行为代理的可扩展性和性能之间取得最佳平衡。我们的实验提供了令人信服的经验证据，我们的方法的有效性，表明使用我们的方法开发的用户代理有可能弥合离线指标和现实世界的推荐系统的性能之间的差距。
摘要：A long-standing challenge in developing accurate recommendation models is simulating user behavior, mainly due to the complex and stochastic nature of user interactions. Towards this, one promising line of work has been the use of Large Language Models (LLMs) for simulating user behavior. However, aligning these general-purpose large pre-trained models with user preferences necessitates: (i) effectively and continously parsing large-scale tabular user-item interaction data, (ii) overcoming pre-training-induced inductive biases to accurately learn user specific knowledge, and (iii) achieving the former two at scale for millions of users. While most previous works have focused on complex methods to prompt an LLM or fine-tune it on tabular interaction datasets, our approach shifts the focus to extracting robust textual user representations using a frozen LLM and simulating cost-effective, resource-efficient user agents powered by fine-tuned Small Language Models (SLMs). Further, we showcase a method for training multiple low-rank adapters for groups of users or \textit{persona}, striking an optimal balance between scalability and performance of user behavior agents. Our experiments provide compelling empirical evidence of the efficacy of our methods, demonstrating that user agents developed using our approach have the potential to bridge the gap between offline metrics and real-world performance of recommender systems.

Transformer(2篇)

【1】WhisTLE: Deeply Supervised, Text-Only Domain Adaptation for Pretrained Speech Recognition Transformers
标题：WhisTLE：预训练语音识别转换器的深度监督、纯文本域自适应
链接：https://arxiv.org/abs/2509.10452

作者：ndey, Karun Kumar, Raphael Tang
备注：5 pages, 2 figures
摘要：预训练的自动语音识别（ASR）模型（如Whisper）表现良好，但仍然需要进行领域调整以处理看不见的词汇和用语。在许多现实世界的设置中，收集语音数据是不切实际的，需要仅文本适配。我们提出了Whistle，这是一种针对预训练编码器-解码器ASR模型的深度监督的纯文本自适应方法。Whistle训练一个变分自动编码器（VAE）来模拟来自文本的编码器输出，并使用学习的文本到潜在编码器微调解码器，可选地结合文本到语音（TTS）自适应。在推理时，原始编码器被恢复，不会产生额外的运行时成本。在四个域外数据集和四个ASR模型中，带有TTS的WhisTLE相对于仅TTS自适应将字错误率（WER）降低了12.3%，并且在32个场景中的27个场景中优于所有非WhisTLE基线。
摘要：Pretrained automatic speech recognition (ASR) models such as Whisper perform well but still need domain adaptation to handle unseen vocabulary and parlance. In many real-world settings, collecting speech data is impractical, necessitating text-only adaptation. We propose WhisTLE, a deeply supervised, text-only adaptation method for pretrained encoder-decoder ASR models. WhisTLE trains a variational autoencoder (VAE) to model encoder outputs from text and fine-tunes the decoder using the learned text-to-latent encoder, optionally combined with text-to-speech (TTS) adaptation. At inference, the original encoder is restored, incurring no extra runtime cost. Across four out-of-domain datasets and four ASR models, WhisTLE with TTS reduces word error rate (WER) by 12.3% relative to TTS-only adaptation and outperforms all non-WhisTLE baselines in 27 of 32 scenarios.

【2】!MSA at BAREC Shared Task 2025: Ensembling Arabic Transformers for Readability Assessment
标题：! BAREC共享任务2025的GMA：融入阿拉伯Transformer进行可读性评估
链接：https://arxiv.org/abs/2509.10040

作者：asem, Mohamed Younes, Seif Ahmed, Abdelrahman Moustafa
备注：10 Pages , 8 figures , ArabicNLP 2025 , Co-located with EMNLP 2025
摘要：我们为BAREC 2025共享任务提供了MSAs获奖系统，用于细粒度阿拉伯语可读性评估，在六个轨道中的六个中获得第一名。我们的方法是四个互补的Transformer模型（AraBERTv2，AraELECTRA，MARBERT和CAMeLBERT）的置信加权合奏，每个模型都用不同的损失函数进行微调，以捕获不同的可读性信号。为了解决严重的类别不平衡和数据稀缺问题，我们应用了加权训练、高级预处理、使用我们最强大的模型重新标记SAMER语料库，以及通过Gemini 2.5 Flash生成合成数据，添加了大约10，000个稀有级别的样本。有针对性的后处理步骤校正了预测分布偏斜，提供了6.3%的二次加权Kappa（QWK）增益。我们的系统在句子级别达到了87.5%的QWK，在文档级别达到了87.4%，这证明了模型和损失多样性，信心信息融合和智能增强的强大力量，以实现强大的阿拉伯语可读性预测。
摘要：We present MSAs winning system for the BAREC 2025 Shared Task on fine-grained Arabic readability assessment, achieving first place in six of six tracks. Our approach is a confidence-weighted ensemble of four complementary transformer models (AraBERTv2, AraELECTRA, MARBERT, and CAMeLBERT) each fine-tuned with distinct loss functions to capture diverse readability signals. To tackle severe class imbalance and data scarcity, we applied weighted training, advanced preprocessing, SAMER corpus relabeling with our strongest model, and synthetic data generation via Gemini 2.5 Flash, adding about 10,000 rare-level samples. A targeted post-processing step corrected prediction distribution skew, delivering a 6.3 percent Quadratic Weighted Kappa (QWK) gain. Our system reached 87.5 percent QWK at the sentence level and 87.4 percent at the document level, demonstrating the power of model and loss diversity, confidence-informed fusion, and intelligent augmentation for robust Arabic readability prediction.

GAN|生成相关(2篇)

【1】CMHG: A Dataset and Benchmark for Headline Generation of Minority Languages in China
标题：CMHG：中国少数民族语言头条生成的数据集和基准
链接：https://arxiv.org/abs/2509.09990

作者：u, Zeli Su, Ziyin Zhang, Jianing Liu, XU Han, Ting Zhang, Yushuang Dong
摘要：中国的少数民族语言，如藏语、维吾尔语和传统蒙古语，由于其独特的书写系统，与国际标准不同，面临着巨大的挑战。这种差异导致了相关语料库的严重缺乏，特别是对于像标题生成这样的监督任务。为了解决这一差距，我们引入了一个新的数据集，中国少数民族标题生成（CMHG），其中包括100，000个藏语条目，以及50，000个维吾尔语和蒙古语条目，专门用于标题生成任务。此外，我们提出了一个高质量的测试集注释的母语，旨在作为一个基准，为未来的研究在这一领域。我们希望这个数据集能够成为一个有价值的资源，以促进中国少数民族语言的标题生成，并有助于相关基准的发展。
摘要：Minority languages in China, such as Tibetan, Uyghur, and Traditional Mongolian, face significant challenges due to their unique writing systems, which differ from international standards. This discrepancy has led to a severe lack of relevant corpora, particularly for supervised tasks like headline generation. To address this gap, we introduce a novel dataset, Chinese Minority Headline Generation (CMHG), which includes 100,000 entries for Tibetan, and 50,000 entries each for Uyghur and Mongolian, specifically curated for headline generation tasks. Additionally, we propose a high-quality test set annotated by native speakers, designed to serve as a benchmark for future research in this domain. We hope this dataset will become a valuable resource for advancing headline generation in Chinese minority languages and contribute to the development of related benchmarks.

【2】HANRAG: Heuristic Accurate Noise-resistant Retrieval-Augmented Generation for Multi-hop Question Answering
标题：HANRAG：用于多跳问题回答的启发式准确抗噪检索增强一代
链接：https://arxiv.org/abs/2509.09713

作者：n, Dan Yang, Yue Shen, Yihan Jiao, Zhehao Tan, Jie Feng, Lianzhen Zhong, Jian Wang, Peng Wei, Jinjie Gu
摘要：检索增强生成（RAG）方法通过将信息检索（IR）技术与大型语言模型（LLM）相结合，增强了问答系统和对话生成任务。这种从外部知识库中检索信息以增强生成模型的响应能力的策略已经取得了一定的成功。然而，目前的RAG方法在处理多跳查询时仍然面临许多挑战。例如，一些方法过度依赖于迭代检索，在复合查询上浪费了太多的检索步骤。另外，使用原始复杂查询进行检索可能无法捕获与特定子查询相关的内容，从而导致嘈杂的检索内容。如果不对噪声进行管理，可能会导致噪声积累的问题。为了解决这些问题，我们介绍了HANRAG，一种新的基于知识的框架，旨在有效地解决不同复杂性的问题。在一个强大的revelator的驱动下，HANRAG路由查询，将其分解为子查询，并从检索到的文档中过滤噪音。这增强了系统的适应性和抗噪性，使其能够高度处理不同的查询。我们将所提出的框架与其他领先的行业方法在各种基准进行比较。结果表明，我们的框架在单跳和多跳问答任务中获得了优异的性能。
摘要：The Retrieval-Augmented Generation (RAG) approach enhances question-answering systems and dialogue generation tasks by integrating information retrieval (IR) technologies with large language models (LLMs). This strategy, which retrieves information from external knowledge bases to bolster the response capabilities of generative models, has achieved certain successes. However, current RAG methods still face numerous challenges when dealing with multi-hop queries. For instance, some approaches overly rely on iterative retrieval, wasting too many retrieval steps on compound queries. Additionally, using the original complex query for retrieval may fail to capture content relevant to specific sub-queries, resulting in noisy retrieved content. If the noise is not managed, it can lead to the problem of noise accumulation. To address these issues, we introduce HANRAG, a novel heuristic-based framework designed to efficiently tackle problems of varying complexity. Driven by a powerful revelator, HANRAG routes queries, decomposes them into sub-queries, and filters noise from retrieved documents. This enhances the system's adaptability and noise resistance, making it highly capable of handling diverse queries. We compare the proposed framework against other leading industry methods across various benchmarks. The results demonstrate that our framework obtains superior performance in both single-hop and multi-hop question-answering tasks.

QA|VQA|问答|对话(2篇)

【1】Incongruent Positivity: When Miscalibrated Positivity Undermines Online Supportive Conversations
标题：不一致的积极情绪：当错误的积极情绪破坏在线支持对话时
链接：https://arxiv.org/abs/2509.10184

作者：jed, Abeer ALdayel
备注：This paper is under review
摘要：在情感支持的谈话中，善意的积极情绪有时会失败，导致轻蔑，最小化或不切实际的乐观反应。我们研究这种现象的不一致的积极性，在人类和LLM产生的反应的积极支持的失调表达。为此，我们从Reddit收集了一系列情感强度的真实用户助理对话，并使用大型语言模型为相同的上下文生成了额外的响应。我们将这些谈话按强度分为两个级别：轻度，包括关系紧张和一般建议，严重，包括悲伤和焦虑的谈话。这一级别的分类使支持性反应如何在低风险和高风险背景下变化的比较分析成为可能。我们的分析表明，LLM更容易通过轻蔑和最小化语气产生不切实际的积极情绪，特别是在高风险的情况下。为了进一步研究这种现象的潜在维度，我们对具有强烈和微弱情绪反应的数据集进行了微调。此外，我们开发了一个弱监督多标签分类器集成（DeBERTa和MentalBERT），它显示了对两种关注（轻度和重度）的不一致阳性类型的改进检测。我们的研究结果揭示了需要超越仅仅产生一般的积极反应，而是研究一致的支持措施，以平衡积极的影响与情感承认。这种方法为在线支持性对话中调整大型语言模型与情感期望提供了见解，为上下文感知和信任保护在线对话系统铺平了道路。
摘要：In emotionally supportive conversations, well-intended positivity can sometimes misfire, leading to responses that feel dismissive, minimizing, or unrealistically optimistic. We examine this phenomenon of incongruent positivity as miscalibrated expressions of positive support in both human and LLM generated responses. To this end, we collected real user-assistant dialogues from Reddit across a range of emotional intensities and generated additional responses using large language models for the same context. We categorize these conversations by intensity into two levels: Mild, which covers relationship tension and general advice, and Severe, which covers grief and anxiety conversations. This level of categorization enables a comparative analysis of how supportive responses vary across lower and higher stakes contexts. Our analysis reveals that LLMs are more prone to unrealistic positivity through dismissive and minimizing tone, particularly in high-stakes contexts. To further study the underlying dimensions of this phenomenon, we finetune LLMs on datasets with strong and weak emotional reactions. Moreover, we developed a weakly supervised multilabel classifier ensemble (DeBERTa and MentalBERT) that shows improved detection of incongruent positivity types across two sorts of concerns (Mild and Severe). Our findings shed light on the need to move beyond merely generating generic positive responses and instead study the congruent support measures to balance positive affect with emotional acknowledgment. This approach offers insights into aligning large language models with affective expectations in the online supportive dialogue, paving the way toward context-aware and trust preserving online conversation systems.

【2】Towards Reliable and Interpretable Document Question Answering via VLMs
标题：通过VLM实现可靠且可解释的文档问题解答
链接：https://arxiv.org/abs/2509.10129

作者：hen, Simone Giovannini, Andrea Gemelli, Fabio Coppini, Simone Marinai
摘要：视觉语言模型（VLM）在文档理解方面表现出强大的能力，特别是在从复杂文档中识别和提取文本信息方面。尽管如此，在文档中准确定位答案仍然是一个重大挑战，限制了可解释性和现实世界的适用性。为了解决这个问题，我们引入了\textit{DocExplainerV 0}，这是一个即插即用的边界框预测模块，可以从空间定位中提取答案。这种设计使其适用于现有的VLM，包括无法进行微调的专有系统。通过系统的评估，我们提供了定量的洞察文本准确性和空间接地之间的差距，正确的答案往往缺乏可靠的本地化。我们的标准化框架突出了这些缺点，并建立了一个基准，为未来的研究更可解释和强大的文档信息提取VLMs。
摘要：Vision-Language Models (VLMs) have shown strong capabilities in document understanding, particularly in identifying and extracting textual information from complex documents. Despite this, accurately localizing answers within documents remains a major challenge, limiting both interpretability and real-world applicability. To address this, we introduce \textit{DocExplainerV0}, a plug-and-play bounding-box prediction module that decouples answer generation from spatial localization. This design makes it applicable to existing VLMs, including proprietary systems where fine-tuning is not feasible. Through systematic evaluation, we provide quantitative insights into the gap between textual accuracy and spatial grounding, showing that correct answers often lack reliable localization. Our standardized framework highlights these shortcomings and establishes a benchmark for future research toward more interpretable and robust document information extraction VLMs.

机器翻译(3篇)

【1】Natural Language Translation of Formal Proofs through Informalization of Proof Steps and Recursive Summarization along Proof Structure
标题：通过证明步骤的信息化和沿着证明结构的渐进总结进行形式证明的自然语言翻译
链接：https://arxiv.org/abs/2509.09726

作者：tori, Takuya Matsuzaki, Makoto Fujiwara
备注：Submitted to INLG 2025 (accepted)
摘要：本文提出了一种用于机器可验证的形式证明的自然语言翻译方法，该方法利用了LLM的非正式化（形式语言证明步骤的语言化）和摘要功能。为了进行评估，它被应用到根据从本科水平的教科书中获取的自然语言证明创建的正式证明数据，并与原始自然语言证明相比，分析生成的自然语言证明的质量。此外，我们将证明该方法可以输出高度可读和准确的自然语言证明，通过将其应用到现有的形式化证明库的精益证明助手。
摘要：This paper proposes a natural language translation method for machine-verifiable formal proofs that leverages the informalization (verbalization of formal language proof steps) and summarization capabilities of LLMs. For evaluation, it was applied to formal proof data created in accordance with natural language proofs taken from an undergraduate-level textbook, and the quality of the generated natural language proofs was analyzed in comparison with the original natural language proofs. Furthermore, we will demonstrate that this method can output highly readable and accurate natural language proofs by applying it to existing formal proof library of the Lean proof assistant.

【2】Optimal Multi-Task Learning at Regularization Horizon for Speech Translation Task
标题：语音翻译任务规则化视野下的最佳多任务学习
链接：https://arxiv.org/abs/2509.09701

作者：ng, Junhyun Lee
摘要：端到端的语音到文本翻译通常受到配对语音文本数据的稀缺性的影响。克服这一缺点的一种方法是利用机器翻译（MT）任务中的双文本数据并执行多任务学习（MTL）。在本文中，我们制定MTL从正则化的角度来看，并探讨如何序列内和跨模态正则化。通过深入研究一致性正则化（不同模态）和R-drop（相同模态）的效果，我们展示了它们分别对总正则化的贡献。我们还证明了MT损失系数作为MTL设置中正则化的另一个来源。有了这三个正则化的来源，我们在高维空间中引入了最佳正则化轮廓，称为正则化视界。实验表明，调整正则化范围内的超参数在MuST-C数据集上实现了接近最先进的性能。
摘要：End-to-end speech-to-text translation typically suffers from the scarcity of paired speech-text data. One way to overcome this shortcoming is to utilize the bitext data from the Machine Translation (MT) task and perform Multi-Task Learning (MTL). In this paper, we formulate MTL from a regularization perspective and explore how sequences can be regularized within and across modalities. By thoroughly investigating the effect of consistency regularization (different modality) and R-drop (same modality), we show how they respectively contribute to the total regularization. We also demonstrate that the coefficient of MT loss serves as another source of regularization in the MTL setting. With these three sources of regularization, we introduce the optimal regularization contour in the high-dimensional space, called the regularization horizon. Experiments show that tuning the hyperparameters within the regularization horizon achieves near state-of-the-art performance on the MuST-C dataset.

【3】Text-to-SQL Oriented to the Process Mining Domain: A PT-EN Dataset for Query Translation
标题：面向流程挖掘领域的文本到SQL：用于查询翻译的PT-EN数据集
链接：https://arxiv.org/abs/2509.09684

作者： Yamate, Thais Rodrigues Neubauer, Marcelo Fantinato, Sarajane Marques Peres
备注：33 pages
摘要：本文介绍了text-2-SQL-4-PM，一个双语（葡萄牙语-英语）基准数据集，专为过程挖掘领域的文本到SQL任务。文本到SQL的转换促进了数据库的自然语言查询，提高了没有SQL专业知识的用户的可访问性和专家的生产力。text-2-SQL-4-PM数据集是定制的，以解决流程挖掘的独特挑战，包括从事件日志派生的专用词汇表和单表关系结构。该数据集包括1，655个自然语言语句，包括人类生成的释义，205个SQL语句和10个限定词。方法包括专家的手动管理，专业翻译和详细的注释过程，以实现对任务复杂性的细致分析。此外，使用GPT-3.5 Turbo的基线研究证明了该数据集用于文本到SQL应用程序的可行性和实用性。结果表明，text-2-SQL-4-PM支持文本到SQL实现的评估，为语义解析和其他自然语言处理任务提供了更广泛的适用性。
摘要：This paper introduces text-2-SQL-4-PM, a bilingual (Portuguese-English) benchmark dataset designed for the text-to-SQL task in the process mining domain. Text-to-SQL conversion facilitates natural language querying of databases, increasing accessibility for users without SQL expertise and productivity for those that are experts. The text-2-SQL-4-PM dataset is customized to address the unique challenges of process mining, including specialized vocabularies and single-table relational structures derived from event logs. The dataset comprises 1,655 natural language utterances, including human-generated paraphrases, 205 SQL statements, and ten qualifiers. Methods include manual curation by experts, professional translations, and a detailed annotation process to enable nuanced analyses of task complexity. Additionally, a baseline study using GPT-3.5 Turbo demonstrates the feasibility and utility of the dataset for text-to-SQL applications. The results show that text-2-SQL-4-PM supports evaluation of text-to-SQL implementations, offering broader applicability for semantic parsing and other natural language processing tasks.

语义分析(2篇)

【1】Querying Climate Knowledge: Semantic Retrieval for Scientific Discovery
标题：查询气候知识：科学发现的语义检索
链接：https://arxiv.org/abs/2509.10087

作者：Adamu, Qi Zhang, Huitong Pan, Longin Jan Latecki, Eduard C. Dragut
备注：ACM SIGIR 2025 Workshop MANILA
摘要：气候科学文献的复杂性和数量不断增加，使得研究人员越来越难以在模型、数据集、区域和变量中找到相关信息。本文介绍了一个基于气候出版物和更广泛的科学文本构建的特定领域知识图谱（KG），旨在改善气候知识的获取和使用方式。与基于关键字的搜索不同，我们的KG支持结构化的语义查询，帮助研究人员发现精确的连接，例如哪些模型已在特定地区得到验证，或者哪些数据集通常与某些遥相关模式一起使用。我们演示了KG如何使用Cypher查询回答这些问题，并概述了其与RAG系统中的大型语言模型的集成，以提高气候相关问题回答的透明度和可靠性。这项工作超越了KG构建，为气候研究人员，模型开发人员和其他依赖准确，上下文科学信息的人展示了其现实世界的价值。
摘要：The growing complexity and volume of climate science literature make it increasingly difficult for researchers to find relevant information across models, datasets, regions, and variables. This paper introduces a domain-specific Knowledge Graph (KG) built from climate publications and broader scientific texts, aimed at improving how climate knowledge is accessed and used. Unlike keyword based search, our KG supports structured, semantic queries that help researchers discover precise connections such as which models have been validated in specific regions or which datasets are commonly used with certain teleconnection patterns. We demonstrate how the KG answers such questions using Cypher queries, and outline its integration with large language models in RAG systems to improve transparency and reliability in climate-related question answering. This work moves beyond KG construction to show its real world value for climate researchers, model developers, and others who rely on accurate, contextual scientific information.

【2】How Small Transformation Expose the Weakness of Semantic Similarity Measures
标题：小转换如何暴露语义相似性指标的弱点
链接：https://arxiv.org/abs/2509.09714

作者：nel Nikiema, Albérick Euraste Djire, Abdoul Aziz Bonkoungou, Micheline Bénédicte Moumoula, Jordan Samhi, Abdoul Kader Kabore, Jacques Klein, Tegawendé F. Bissyande
摘要：这项研究探讨了不同的方法如何衡量语义相似性，这对于各种软件工程应用程序，如代码搜索，API建议，自动代码审查和重构工具是很重要的。虽然大型语言模型越来越多地用于这些相似性评估，但问题仍然是它们是否真正理解语义关系或仅仅识别表面模式。该研究测试了18种不同的相似性度量方法，包括基于单词的方法，嵌入技术，基于LLM的系统和结构感知算法。研究人员创建了一个系统的测试框架，将受控的更改应用于文本和代码，以评估每种方法处理不同类型语义关系的效果。结果显示，常用指标存在重大问题。一些基于嵌入的方法错误地将语义对立面识别为相似，高达99.9%，而某些基于转换器的方法偶尔会将相反的含义比同义的含义更相似。研究发现，嵌入方法的性能差通常源于它们计算距离的方式;从欧几里得距离切换到余弦相似度，结果提高了24%到66%。基于LLM的方法在区分语义差异方面表现更好，与错误地将高分（0.82至0.99）分配给不同内容的嵌入方法相比，真正不同的含义产生较低的相似性分数（0.00至0.29）。
摘要：This research examines how well different methods measure semantic similarity, which is important for various software engineering applications such as code search, API recommendations, automated code reviews, and refactoring tools. While large language models are increasingly used for these similarity assessments, questions remain about whether they truly understand semantic relationships or merely recognize surface patterns. The study tested 18 different similarity measurement approaches, including word-based methods, embedding techniques, LLM-based systems, and structure-aware algorithms. The researchers created a systematic testing framework that applies controlled changes to text and code to evaluate how well each method handles different types of semantic relationships. The results revealed significant issues with commonly used metrics. Some embedding-based methods incorrectly identified semantic opposites as similar up to 99.9 percent of the time, while certain transformer-based approaches occasionally rated opposite meanings as more similar than synonymous ones. The study found that embedding methods' poor performance often stemmed from how they calculate distances; switching from Euclidean distance to cosine similarity improved results by 24 to 66 percent. LLM-based approaches performed better at distinguishing semantic differences, producing low similarity scores (0.00 to 0.29) for genuinely different meanings, compared to embedding methods that incorrectly assigned high scores (0.82 to 0.99) to dissimilar content.

Graph|知识图谱|Knowledge(4篇)

【1】DeepDive: Advancing Deep Search Agents with Knowledge Graphs and Multi-Turn RL
标题：DeepDive：利用知识图和多转向RL推进深度搜索代理
链接：https://arxiv.org/abs/2509.10446

作者：henyu Hou, Zihan Wang, Hanchen Zhang, Xiao Liu, Yujiang Li, Shi Feng, Jie Tang, Yuxiao Dong
摘要：使用浏览工具增强大型语言模型（LLM）可以大大提高它们作为深度搜索代理的潜力，以解决复杂的现实任务。然而，开放式LLM在这样的环境中仍然表现不佳，这是由于浏览工具的长期推理能力有限以及缺乏足够困难的监督数据。为了应对这些挑战，我们提出了DeepDive来推进深度搜索代理。首先，我们提出了一种策略，可以从开放的知识图谱中自动合成复杂、困难和难以找到的问题。其次，我们应用端到端的多轮强化学习（RL）来增强LLM的深度搜索长期推理。实验表明，DeepDive-32 B在BrowseComp上取得了新的开源竞争结果，优于WebSailor，DeepSeek-R1-Browse和Search-o 1。我们证明了多轮RL训练提高了深度搜索能力，并在多个基准测试中显着提高了性能。我们观察到，DeepDive支持工具调用和并行采样的测试时间缩放。所有数据集、模型和代码都可以在https://github.com/THUDM/DeepDive上公开获取。
摘要：Augmenting large language models (LLMs) with browsing tools substantially improves their potential as deep search agents to solve complex, real-world tasks. Yet, open LLMs still perform poorly in such settings due to limited long-horizon reasoning capacity with browsing tools and the lack of sufficiently difficult supervised data. To address these challenges, we present DeepDive to advance deep search agents. First, we propose a strategy to automatically synthesize complex, difficult, and hard-to-find questions from open knowledge graphs. Second, we apply end-to-end multi-turn reinforcement learning (RL) to enhance LLMs' long-horizon reasoning with deep search. Experiments show that DeepDive-32B achieves a new open-source competitive result on BrowseComp, outperforming WebSailor, DeepSeek-R1-Browse, and Search-o1. We demonstrate that multi-turn RL training improves deep search ability and significantly contributes to the performance improvements across multiple benchmarks. We observe that DeepDive enables test-time scaling of tool calls and parallel sampling. All datasets, models, and code are publicly available at https://github.com/THUDM/DeepDive.

【2】SI-FACT: Mitigating Knowledge Conflict via Self-Improving Faithfulness-Aware Contrastive Tuning
标题：SI-FACT：通过自我提高忠诚度对比调整来缓解知识冲突
链接：https://arxiv.org/abs/2509.10208

作者：g Fu
摘要：针对大型语言模型在知识密集型任务中由于知识冲突而产生的不忠实响应，提出了一种新的自改进框架Self-Improving Faithfulness Aware Contrastive Tuning，该框架采用自指导机制，使LLM能够自动生成高质量的结构化对比学习数据，包括锚样本、语义等价的正样本和模拟不忠实场景的负样本。该方法大大降低了人工标注的成本。随后，应用对比学习对模型进行训练，在知识冲突评价基准ECARE KRE和COSE KRE上的实验表明，基于SI-FACT模型的知识冲突评价方法具有较好的收敛性和较强的泛化能力在Llama 3 8B语言模型上的实验结果表明，SI FACT在提高LLM的语境忠实度方面具有很强的有效性和很高的数据效率，为构建更主动、更可信的语言模型提供了一条切实可行的途径。
摘要：Large Language Models often generate unfaithful responses in knowledge intensive tasks due to knowledge conflict,that is,a preference for relying on internal parametric knowledge rather than the provided context.To address this issue,we propose a novel self improving framework,Self Improving Faithfulness Aware Contrastive Tuning.The framework uses a self instruct mechanism that allows the base LLM to automatically generate high quality,structured contrastive learning data,including anchor samples,semantically equivalent positive samples,and negative samples simulating unfaithful scenarios.This approach significantly reduces the cost of manual annotation.Subsequently,contrastive learning is applied to train the model,enabling it to pull faithful responses closer and push unfaithful responses farther apart in the representation space.Experiments on knowledge conflict evaluation benchmarks ECARE KRE and COSE KRE show that the SI FACT model based on Llama3 8B Instruct improves the Contextual Recall Rate by 6.2% over the best baseline method,while significantly reducing dependence on internal memory.The results indicate that SI FACT provides strong effectiveness and high data efficiency in enhancing the contextual faithfulness of LLMs,offering a practical pathway toward building more proactive and trustworthy language models.

【3】Structured Information Matters: Explainable ICD Coding with Patient-Level Knowledge Graphs
标题：结构化信息很重要：使用患者级知识图进行可解释的ICD编码
链接：https://arxiv.org/abs/2509.09699

作者：Li, Viktor Schlegel, Tingting Mu, Warren Del-Pinto, Goran Nenadic
摘要：将临床文档映射到标准化的临床词汇表是一项重要任务，因为它为信息检索和分析提供了结构化数据，这对临床研究，医院管理和改善患者护理至关重要。然而，手动编码既困难又耗时，使其在规模上不切实际。自动编码可以潜在地减轻这种负担，提高结构化临床数据的可用性和准确性。这项任务很难自动化，因为它需要映射到高维和长尾目标空间，如国际疾病分类（ICD）。虽然外部知识源已被很容易地利用，以提高输出代码表示，使用外部资源表示的输入文件一直探索不足。在这项工作中，我们计算输入文档的结构化表示，利用文档级知识图（KG），提供一个全面的结构化视图的病人的病情。所得到的知识图有效地表示了以患者为中心的输入文档，具有原始文本的23%，同时保留了90%的信息。我们通过将其集成到最先进的ICD编码架构PLM-ICD中来评估此图用于自动化ICD-9编码的有效性。我们的实验在流行的基准测试中提高了Macro-F1分数高达3.20%，同时提高了训练效率。我们将这种改进归因于KG中不同类型的实体和关系，并证明了该方法在纯文本基线上的改进的可解释性潜力。
摘要：Mapping clinical documents to standardised clinical vocabularies is an important task, as it provides structured data for information retrieval and analysis, which is essential to clinical research, hospital administration and improving patient care. However, manual coding is both difficult and time-consuming, making it impractical at scale. Automated coding can potentially alleviate this burden, improving the availability and accuracy of structured clinical data. The task is difficult to automate, as it requires mapping to high-dimensional and long-tailed target spaces, such as the International Classification of Diseases (ICD). While external knowledge sources have been readily utilised to enhance output code representation, the use of external resources for representing the input documents has been underexplored. In this work, we compute a structured representation of the input documents, making use of document-level knowledge graphs (KGs) that provide a comprehensive structured view of a patient's condition. The resulting knowledge graph efficiently represents the patient-centred input documents with 23\% of the original text while retaining 90\% of the information. We assess the effectiveness of this graph for automated ICD-9 coding by integrating it into the state-of-the-art ICD coding architecture PLM-ICD. Our experiments yield improved Macro-F1 scores by up to 3.20\% on popular benchmarks, while improving training efficiency. We attribute this improvement to different types of entities and relationships in the KG, and demonstrate the improved explainability potential of the approach over the text-only baseline.

【4】AI-Powered Assistant for Long-Term Access to RHIC Knowledge
标题：人工智能支持的长期获取RHIC知识的助理
链接：https://arxiv.org/abs/2509.09688

作者：Atif, Vincent Garonne, Eric Lancon, Jerome Lauret, Alexandr Prozorov, Michal Vranovsky
摘要：随着布鲁克海文国家实验室的相对论重离子对撞机（RHIC）结束了25年的运行，保护其庞大的数据（1 ExaByte）和嵌入的科学知识成为一个关键的优先事项。RHIC数据和分析保存计划（DAPP）引入了一个人工智能辅助系统，该系统提供对文档、工作流程和软件的自然语言访问，旨在支持再现性、教育和未来发现。该助手基于使用检索增强生成和模型上下文协议的大型语言模型，对RHIC实验中的结构化和非结构化内容进行索引，并实现领域适应性交互。我们报告了部署，计算性能，正在进行的多实验集成以及为可持续和可解释的长期AI访问而设计的架构功能。我们的经验说明了现代AI/ML工具如何改变科学遗留数据的可用性和可重复性。
摘要：As the Relativistic Heavy Ion Collider (RHIC) at Brookhaven National Laboratory concludes 25 years of operation, preserving not only its vast data holdings ($\sim$1 ExaByte) but also the embedded scientific knowledge becomes a critical priority. The RHIC Data and Analysis Preservation Plan (DAPP) introduces an AI-powered assistant system that provides natural language access to documentation, workflows, and software, with the aim of supporting reproducibility, education, and future discovery. Built upon Large Language Models using Retrieval-Augmented Generation and the Model Context Protocol, this assistant indexes structured and unstructured content from RHIC experiments and enables domain-adapted interaction. We report on the deployment, computational performance, ongoing multi-experiment integration, and architectural features designed for a sustainable and explainable long-term AI access. Our experience illustrates how modern AI/ML tools can transform the usability and discoverability of scientific legacy data.

推理|分析|理解|解释(3篇)

【1】Abduct, Act, Predict: Scaffolding Causal Inference for Automated Failure Attribution in Multi-Agent Systems
标题：绑架，行动，预测：多智能体系统中自动故障归因的脚手架因果推理
链接：https://arxiv.org/abs/2509.10401

作者：, Yixuan Weng, Minjun Zhu, Zhen Lin, Yue Zhang
摘要：多智能体系统中的故障归因-精确定位发生决定性错误的确切步骤-是一个关键但尚未解决的挑战。目前的方法将其视为长对话日志上的模式识别任务，导致极低的步骤级准确度（低于17%），这使得它们对于调试复杂系统是不切实际的。他们的核心弱点是根本无法进行强大的反事实推理：确定纠正一个单一的行动是否真的可以避免任务失败。为了弥合这一反事实推理差距，我们引入了绑架行为预测（A2 P）脚手架，一种新的代理框架，将失败归因模式识别到一个结构化的因果推理任务。A2 P在一个单一的推理过程中明确地引导一个大型语言模型通过一个正式的三步推理过程：（1）溯因，推断代理行为背后隐藏的根本原因;（2）行动，定义最小的纠正干预;（3）预测，模拟随后的轨迹并验证干预是否解决了故障。这种结构化的方法利用了整个对话的整体背景，同时对模型的分析施加了严格的因果逻辑。我们对Who\When基准的广泛实验证明了它的有效性。A2 P算法的步长精度达到了47.46%，比基线的16.67%提高了2.85倍。在更复杂的Hand-Crafted数据集上，它实现了29.31\%的步长精度，比基线的12.07\%提高了2.43\times $。通过从因果角度重新定义问题，A2 P脚手架为自动化故障归因提供了一个强大的、可验证的、更准确的解决方案。
摘要：Failure attribution in multi-agent systems -- pinpointing the exact step where a decisive error occurs -- is a critical yet unsolved challenge. Current methods treat this as a pattern recognition task over long conversation logs, leading to critically low step-level accuracy (below 17\%), which renders them impractical for debugging complex systems. Their core weakness is a fundamental inability to perform robust counterfactual reasoning: to determine if correcting a single action would have actually averted the task failure. To bridge this counterfactual inference gap, we introduce Abduct-Act-Predict (A2P) Scaffolding, a novel agent framework that transforms failure attribution from pattern recognition into a structured causal inference task. A2P explicitly guides a large language model through a formal three-step reasoning process within a single inference pass: (1) Abduction, to infer the hidden root causes behind an agent's actions; (2) Action, to define a minimal corrective intervention; and (3) Prediction, to simulate the subsequent trajectory and verify if the intervention resolves the failure. This structured approach leverages the holistic context of the entire conversation while imposing a rigorous causal logic on the model's analysis. Our extensive experiments on the Who\&When benchmark demonstrate its efficacy. On the Algorithm-Generated dataset, A2P achieves 47.46\% step-level accuracy, a 2.85$\times$ improvement over the 16.67\% of the baseline. On the more complex Hand-Crafted dataset, it achieves 29.31\% step accuracy, a 2.43$\times$ improvement over the baseline's 12.07\%. By reframing the problem through a causal lens, A2P Scaffolding provides a robust, verifiable, and significantly more accurate solution for automated failure attribution.

【2】Unsupervised Hallucination Detection by Inspecting Reasoning Processes
标题：通过检查推理过程进行无监督幻觉检测
链接：https://arxiv.org/abs/2509.10004

作者：Srey, Xiaobao Wu, Anh Tuan Luu
备注：To appear in EMNLP 2025
摘要：无监督幻觉检测旨在识别由大型语言模型（LLM）生成的幻觉内容，而不依赖于标记数据。虽然无监督方法通过消除劳动密集型的人工注释而受到欢迎，但它们经常依赖于与事实正确性无关的代理信号。这种不对齐使检测探针偏向于表面或非真相相关的方面，限制了跨数据集和场景的概括性。为了克服这些限制，我们提出了IRIS，一个无监督的幻觉检测框架，利用内部表示固有的事实正确性。IRIS提示LLM仔细验证给定语句的真实性，并获取其上下文嵌入作为训练的信息特征。同时，每个响应的不确定性被认为是真实性的软伪标签。实验结果表明，IRIS始终优于现有的无监督方法。我们的方法是完全无监督的，计算成本低，即使在很少的训练数据下也能很好地工作，使其适合实时检测。
摘要：Unsupervised hallucination detection aims to identify hallucinated content generated by large language models (LLMs) without relying on labeled data. While unsupervised methods have gained popularity by eliminating labor-intensive human annotations, they frequently rely on proxy signals unrelated to factual correctness. This misalignment biases detection probes toward superficial or non-truth-related aspects, limiting generalizability across datasets and scenarios. To overcome these limitations, we propose IRIS, an unsupervised hallucination detection framework, leveraging internal representations intrinsic to factual correctness. IRIS prompts the LLM to carefully verify the truthfulness of a given statement, and obtain its contextualized embedding as informative features for training. Meanwhile, the uncertainty of each response is considered a soft pseudolabel for truthfulness. Experimental results demonstrate that IRIS consistently outperforms existing unsupervised methods. Our approach is fully unsupervised, computationally low cost, and works well even with few training data, making it suitable for real-time detection.

【3】Error Analysis in a Modular Meeting Transcription System
标题：模块化会议转录系统中的错误分析
链接：https://arxiv.org/abs/2509.10143

作者：ting, Simon Berger, Thilo von Neumann, Christoph Boeddeker, Ralf Schlüter, Reinhold Haeb-Umbach
备注：Accepted at ITG Conference on Speech Communication 2025
摘要：会议记录是近年来发展迅速的一个研究领域。然而，挑战依然存在，限制了其业绩。在这项工作中，我们扩展了以前提出的框架，用于分析泄漏语音分离与适当的敏感性时间局部性。我们发现，有显着的泄漏到交叉通道的地区，只有主扬声器是活跃的。同时，结果表明，这不会影响最终的性能，因为这些泄漏的部分在很大程度上被语音活动检测（VAD）忽略。此外，不同的分割进行了比较，表明先进的日记化方法能够减少的差距，甲骨文分割的三分之一相比，一个简单的基于能量的VAD。我们还揭示了哪些因素导致了剩余的差异。这些结果代表了LibriCSS在仅对LibriSpeech数据进行训练识别模块的系统中的最新性能。
摘要：Meeting transcription is a field of high relevance and remarkable progress in recent years. Still, challenges remain that limit its performance. In this work, we extend a previously proposed framework for analyzing leakage in speech separation with proper sensitivity to temporal locality. We show that there is significant leakage to the cross channel in areas where only the primary speaker is active. At the same time, the results demonstrate that this does not affect the final performance much as these leaked parts are largely ignored by the voice activity detection (VAD). Furthermore, different segmentations are compared showing that advanced diarization approaches are able to reduce the gap to oracle segmentation by a third compared to a simple energy-based VAD. We additionally reveal what factors contribute to the remaining difference. The results represent state-of-the-art performance on LibriCSS among systems that train the recognition module on LibriSpeech data only.

检测相关(1篇)

【1】Cross-Layer Attention Probing for Fine-Grained Hallucination Detection
标题：用于细粒度幻觉检测的跨层注意力探测
链接：https://arxiv.org/abs/2509.09700

作者：Suresh, Rahaf Aljundi, Ikechukwu Nkisi-Orji, Nirmalie Wiratunga
备注：To be published at the TRUST-AI workshop, ECAI 2025
摘要：随着大型语言模型（LLM）在各种应用中的大规模采用，由于它们倾向于生成不准确的文本，即幻觉，因此存在越来越多的可靠性问题。在这项工作中，我们提出了跨层注意探测（CLAP），一种新的激活探测技术的幻觉检测，它处理整个残留流的LLM激活作为一个联合序列。我们使用五个LLM和三个任务的经验评估表明，与贪婪解码响应以及在较高温度下采样的响应的基线相比，CLAP改善了幻觉检测，从而实现了细粒度检测，即在给定提示的不同采样响应之间消除幻觉和非幻觉的能力。这使我们能够提出一个检测，然后缓解策略，使用CLAP减少幻觉和提高LLM的可靠性相比，直接缓解方法。最后，我们表明，CLAP保持高可靠性，即使在应用的分布。
摘要：With the large-scale adoption of Large Language Models (LLMs) in various applications, there is a growing reliability concern due to their tendency to generate inaccurate text, i.e. hallucinations. In this work, we propose Cross-Layer Attention Probing (CLAP), a novel activation probing technique for hallucination detection, which processes the LLM activations across the entire residual stream as a joint sequence. Our empirical evaluations using five LLMs and three tasks show that CLAP improves hallucination detection compared to baselines on both greedy decoded responses as well as responses sampled at higher temperatures, thus enabling fine-grained detection, i.e. the ability to disambiguate hallucinations and non-hallucinations among different sampled responses to a given prompt. This allows us to propose a detect-then-mitigate strategy using CLAP to reduce hallucinations and improve LLM reliability compared to direct mitigation approaches. Finally, we show that CLAP maintains high reliability even when applied out-of-distribution.

识别/分类(1篇)

【1】Prominence-aware automatic speech recognition for conversational speech
标题：会话语音的突出感知自动语音识别
链接：https://arxiv.org/abs/2509.10116

作者：nke, Barbara Schuppler
摘要：本文研究了将显著性检测和语音识别相结合的会话型奥地利德语语音识别算法。首先，突出检测器的微调wav2vec2模型分类单词级的突出。然后使用该检测器在大型语料库中自动标注韵律突显。基于这些注释，我们训练了新的识别性ASR系统，该系统同时转录单词及其显著性水平。与我们的基线ASR系统相比，突出信息的集成没有改变性能，同时对于识别的单词序列正确的话语达到85.53%的突出检测准确率。本文表明，基于transformer的模型可以有效地编码韵律信息，并代表了一个新的贡献韵律增强ASR，具有潜在的应用语言学研究和韵律知情的对话系统。
摘要：This paper investigates prominence-aware automatic speech recognition (ASR) by combining prominence detection and speech recognition for conversational Austrian German. First, prominence detectors were developed by fine-tuning wav2vec2 models to classify word-level prominence. The detector was then used to automatically annotate prosodic prominence in a large corpus. Based on those annotations, we trained novel prominence-aware ASR systems that simultaneously transcribe words and their prominence levels. The integration of prominence information did not change performance compared to our baseline ASR system, while reaching a prominence detection accuracy of 85.53% for utterances where the recognized word sequence was correct. This paper shows that transformer-based models can effectively encode prosodic information and represents a novel contribution to prosody-enhanced ASR, with potential applications for linguistic research and prosody-informed dialogue systems.

Zero/Few/One-Shot|迁移|自适应(1篇)

【1】VStyle: A Benchmark for Voice Style Adaptation with Spoken Instructions
标题：VStyle：语音风格改编的基准
链接：https://arxiv.org/abs/2509.09716

作者： Mingyang Han, Yuxuan Xie, Chen Wang, Dong Zhang, Kexin Huang, Haoxiang Shi, DongXiao Wang, Tengtao Song, Qinyuan Cheng, Shimin Li, Jun Song, Xipeng Qiu, Bo Zheng
摘要：口语模型（SLMs）已经成为语音理解和生成的统一范式，使自然的人机交互成为可能。然而，虽然大多数进展都集中在语义准确性和指令遵循，SLMs的能力，以适应他们的说话风格的基础上口头指示受到了有限的关注。我们介绍了语音风格适应（VSA），一个新的任务，检查是否SLM可以修改他们的说话风格，如音色，韵律，或人物以下的自然语言口语命令。为了研究这个任务，我们提出了VStyle，一个双语（中文和英文）基准，涵盖了四个类别的语音生成：声学属性，自然语言教学，角色扮演和内隐移情。我们还引入了大型音频语言模型作为法官（LALM作为法官）框架，该框架逐步评估文本忠实性，风格坚持性和自然性的输出，确保可再现和客观的评估。在商业系统和开源SLM上的实验表明，当前模型在可控风格适应方面面临明显的局限性，突出了这项任务的新颖性和挑战性。通过发布VStyle及其评估工具包，我们的目标是为社区提供一个推进以人为本的口语交互的基础。数据集和代码可在\href{https：//junzhan2000.github.io/VStyle.github.io/}{project's homepage}上公开获取。
摘要：Spoken language models (SLMs) have emerged as a unified paradigm for speech understanding and generation, enabling natural human machine interaction. However, while most progress has focused on semantic accuracy and instruction following, the ability of SLMs to adapt their speaking style based on spoken instructions has received limited attention. We introduce Voice Style Adaptation (VSA), a new task that examines whether SLMs can modify their speaking style, such as timbre, prosody, or persona following natural language spoken commands. To study this task, we present VStyle, a bilingual (Chinese & English) benchmark covering four categories of speech generation: acoustic attributes, natural language instruction, role play, and implicit empathy. We also introduce the Large Audio Language Model as a Judge (LALM as a Judge) framework, which progressively evaluates outputs along textual faithfulness, style adherence, and naturalness, ensuring reproducible and objective assessment. Experiments on commercial systems and open source SLMs demonstrate that current models face clear limitations in controllable style adaptation, highlighting both the novelty and challenge of this task. By releasing VStyle and its evaluation toolkit, we aim to provide the community with a foundation for advancing human centered spoken interaction. The dataset and code are publicly available at \href{https://junzhan2000.github.io/VStyle.github.io/}{project's homepage}.

Word2Vec|文本|单词(1篇)

【1】Whisper Has an Internal Word Aligner
标题：Whisper有一个内部单词对齐器
链接：https://arxiv.org/abs/2509.09987

作者：Yeh, Yen Meng, Hao Tang
备注：ASRU 2025
摘要：从强大的自动语音识别器（特别是Whisper）中获得准确的单词级时间戳的兴趣越来越大。现有的方法要么需要额外的培训，要么根本没有竞争力。在以前的工作中的评估也是相对宽松的，通常使用超过200毫秒的公差。在这项工作中，我们发现注意头耳语捕捉准确的单词对齐，是明显不同于那些不。此外，我们发现，使用字符产生更精细，更准确的路线比使用单词。基于这些发现，我们提出了一种无监督的方法来提取词对齐过滤注意头，而教师强迫耳语字符。我们的方法不仅不需要训练，而且在20 ms和100 ms之间的更严格的公差下产生比先前工作更准确的单词对齐。
摘要：There is an increasing interest in obtaining accurate word-level timestamps from strong automatic speech recognizers, in particular Whisper. Existing approaches either require additional training or are simply not competitive. The evaluation in prior work is also relatively loose, typically using a tolerance of more than 200 ms. In this work, we discover attention heads in Whisper that capture accurate word alignments and are distinctively different from those that do not. Moreover, we find that using characters produces finer and more accurate alignments than using wordpieces. Based on these findings, we propose an unsupervised approach to extracting word alignments by filtering attention heads while teacher forcing Whisper with characters. Our approach not only does not require training but also produces word alignments that are more accurate than prior work under a stricter tolerance between 20 ms and 100 ms.

其他神经网络|深度学习|模型|建模(2篇)

【1】Is In-Context Learning Learning?
标题：上下文学习是学习吗？
链接：https://arxiv.org/abs/2509.10414

作者： Wynter
备注：Director's cut
摘要：上下文学习（ICL）允许一些自回归模型通过下一个令牌预测来解决任务，而不需要进一步的训练。这导致了关于这些模型的能力的索赔解决（学习）看不见的任务，只有几个镜头（样本）的提示。然而，演绎并不总是意味着学习，因为ICL并不明确编码给定的观察。相反，这些模型依赖于它们的先验知识和给定的范例（如果有的话）。我们认为，从数学上讲，ICL确实构成了学习，但其完整的特征需要实证工作。然后，我们进行了一个大规模的分析ICL消融或占记忆，预训练，分布变化，提示风格和措辞。我们发现，ICL是一种有效的学习范式，但其学习和概括看不见的任务的能力有限。我们注意到，在样本变得越来越多的限制，准确性是不敏感的样本分布，模型，提示风格，和输入的语言特征。相反，它从提示中的重复推断出模式，这导致了分布敏感性，特别是在提示风格中，如思维链。鉴于形式上类似的任务不同的准确性，我们得出结论，自回归的特设编码不是一个强大的机制，并建议有限的通用性。
摘要：In-context learning (ICL) allows some autoregressive models to solve tasks via next-token prediction and without needing further training. This has led to claims about these model's ability to solve (learn) unseen tasks with only a few shots (exemplars) in the prompt. However, deduction does not always imply learning, as ICL does not explicitly encode a given observation. Instead, the models rely on their prior knowledge and the exemplars given, if any. We argue that, mathematically, ICL does constitute learning, but its full characterisation requires empirical work. We then carry out a large-scale analysis of ICL ablating out or accounting for memorisation, pretraining, distributional shifts, and prompting style and phrasing. We find that ICL is an effective learning paradigm, but limited in its ability to learn and generalise to unseen tasks. We note that, in the limit where exemplars become more numerous, accuracy is insensitive to exemplar distribution, model, prompt style, and the input's linguistic features. Instead, it deduces patterns from regularities in the prompt, which leads to distributional sensitivity, especially in prompting styles such as chain-of-thought. Given the varied accuracies on formally similar tasks, we conclude that autoregression's ad-hoc encoding is not a robust mechanism, and suggests limited all-purpose generalisability.

【2】Investigating Symbolic Triggers of Hallucination in Gemma Models Across HaluEval and TruthfulQA
标题：通过HaluEval和TruthfulQA调查Gemma模型中幻觉的象征触发因素
链接：https://arxiv.org/abs/2509.09715

作者：mba, Sanju Tiwari, Manas Gaur
摘要：大型语言模型（LLM）中的幻觉是一个研究得很好的问题。然而，使LLM本质上易受幻觉影响的特性尚未被确定和研究。这项研究确定并表征了关键属性，使我们能够查明模型内部机制中的漏洞。为了巩固这些属性，我们利用了两个已建立的数据集HaluEval和TruthfulQA，并将其现有的问题回答格式转换为各种其他格式，以缩小这些属性作为幻觉的原因。我们的研究结果表明，Gemma-2-2B在符号属性上的幻觉百分比非常高，在任务和数据集上平均为79.0%。随着模型规模的增加，Gemma-2- 9 B的幻觉下降到73.6%，Gemma-2- 27 B的幻觉下降到63.9%，总体下降了15个百分点。虽然幻觉率随着模型大小的增加而降低，但由符号属性引起的大量幻觉仍然存在。这对于所有Gemma模型和两个数据集的修饰符（范围从84.76%到94.98%）和命名实体（范围从83.87%到93.96%）尤为明显。这些发现表明，符号元素继续混淆模型，指出这些LLM如何处理这些输入的根本弱点-无论其规模如何。
摘要：Hallucination in Large Language Models (LLMs) is a well studied problem. However, the properties that make LLM intrinsically vulnerable to hallucinations have not been identified and studied. This research identifies and characterizes the key properties, allowing us to pinpoint vulnerabilities within the model's internal mechanisms. To solidify on these properties, we utilized two established datasets, HaluEval and TruthfulQA and convert their existing format of question answering into various other formats to narrow down these properties as the reason for the hallucinations. Our findings reveal that hallucination percentages across symbolic properties are notably high for Gemma-2-2B, averaging 79.0% across tasks and datasets. With increased model scale, hallucination drops to 73.6% for Gemma-2-9B and 63.9% for Gemma-2-27B, reflecting a 15 percentage point reduction overall. Although the hallucination rate decreases as the model size increases, a substantial amount of hallucination caused by symbolic properties still persists. This is especially evident for modifiers (ranging from 84.76% to 94.98%) and named entities (ranging from 83.87% to 93.96%) across all Gemma models and both datasets. These findings indicate that symbolic elements continue to confuse the models, pointing to a fundamental weakness in how these LLMs process such inputs--regardless of their scale.

其他(16篇)

【1】Scaling Arabic Medical Chatbots Using Synthetic Data: Enhancing Generative AI with Synthetic Patient Records
标题：使用合成数据扩展阿拉伯医疗聊天机器人：使用合成患者记录增强生成人工智能
链接：https://arxiv.org/abs/2509.10108

作者：an Allam, Seif Ahmed, Ali Hamdi, Khaled Shaban
备注：Accepted in AICCSA 2025
摘要：阿拉伯语医疗聊天机器人的发展受到大规模，高质量注释数据集稀缺的严重制约。虽然之前的努力汇编了来自社交媒体的20，000个阿拉伯语患者与医生互动的数据集，以微调大型语言模型（LLM），但模型的可扩展性和泛化性仍然有限。在这项研究中，我们提出了一种可扩展的合成数据增强策略，将训练语料库扩展到100，000条记录。使用先进的生成式AI系统ChatGPT-4 o和Gemini 2.5 Pro，我们生成了80，000个基于原始数据集结构的上下文相关和医学连贯的合成问答对。这些合成样本经过语义过滤、手动验证并集成到训练管道中。我们微调了五个LLM，包括Mistral-7 B和AraGPT 2，并使用BERTScore指标和专家驱动的定性评估来评估它们的性能。为了进一步分析合成源的有效性，我们进行了一项消融研究，比较了ChatGPT-4 o和Gemini独立生成的数据。结果表明，ChatGPT-4 o数据在所有模型中始终导致更高的F1分数和更少的幻觉。总的来说，我们的研究结果证明了合成增强作为一种实用解决方案的可行性，可以在低资源医疗NLP中增强特定领域的语言模型，为更具包容性、可扩展性和准确性的阿拉伯语医疗聊天机器人系统铺平道路。
摘要：The development of medical chatbots in Arabic is significantly constrained by the scarcity of large-scale, high-quality annotated datasets. While prior efforts compiled a dataset of 20,000 Arabic patient-doctor interactions from social media to fine-tune large language models (LLMs), model scalability and generalization remained limited. In this study, we propose a scalable synthetic data augmentation strategy to expand the training corpus to 100,000 records. Using advanced generative AI systems ChatGPT-4o and Gemini 2.5 Pro we generated 80,000 contextually relevant and medically coherent synthetic question-answer pairs grounded in the structure of the original dataset. These synthetic samples were semantically filtered, manually validated, and integrated into the training pipeline. We fine-tuned five LLMs, including Mistral-7B and AraGPT2, and evaluated their performance using BERTScore metrics and expert-driven qualitative assessments. To further analyze the effectiveness of synthetic sources, we conducted an ablation study comparing ChatGPT-4o and Gemini-generated data independently. The results showed that ChatGPT-4o data consistently led to higher F1-scores and fewer hallucinations across all models. Overall, our findings demonstrate the viability of synthetic augmentation as a practical solution for enhancing domain-specific language models in-low resource medical NLP, paving the way for more inclusive, scalable, and accurate Arabic healthcare chatbot systems.

【2】VARCO-VISION-2.0 Technical Report
标题：VARCO-Vision-2.0技术报告
链接：https://arxiv.org/abs/2509.10105

作者： Cha, Jeongho Ju, SunYoung Park, Jong-Hyeon Lee, Younghyun Yu, Youngjune Kim
备注：19 pages, 1 figure, 14 tables. Technical report for VARCO-VISION-2.0, a Korean-English bilingual VLM in 14B and 1.7B variants. Key features: multi-image understanding, OCR with text localization, improved Korean capabilities
摘要：我们介绍了VARCO-VISION-2.0，这是一种开放权重的韩语和英语双语视觉语言模型（VLM），与以前的模型VARCO-VISION-14 B相比，其功能有所改进。该模型支持文档、图表和表格等复杂输入的多图像理解，并通过预测文本内容及其空间位置来提供layoutaware OCR。该模型采用四阶段课程和记忆效率技术进行训练，实现了增强的多模态对齐，同时保留了核心语言能力，并通过偏好优化提高了安全性。广泛的基准评估表明，这两种语言都具有强大的空间基础和竞争力，14 B模型在OpenCompass VLM排行榜上获得了第8名。除了14 B规模的模型，我们还发布了针对设备部署优化的1.7B版本。我们相信，这些模型推进双语VLM及其实际应用的发展。VARCO-VISION-2.0的两种变体可在Hugging Face获得：全尺寸14 B模型和轻量型1.7B模型。
摘要：We introduce VARCO-VISION-2.0, an open-weight bilingual vision-language model (VLM) for Korean and English with improved capabilities compared to the previous model VARCO-VISION-14B. The model supports multi-image understanding for complex inputs such as documents, charts, and tables, and delivers layoutaware OCR by predicting both textual content and its spatial location. Trained with a four-stage curriculum with memory-efficient techniques, the model achieves enhanced multimodal alignment, while preserving core language abilities and improving safety via preference optimization. Extensive benchmark evaluations demonstrate strong spatial grounding and competitive results for both languages, with the 14B model achieving 8th place on the OpenCompass VLM leaderboard among models of comparable scale. Alongside the 14B-scale model, we release a 1.7B version optimized for on-device deployment. We believe these models advance the development of bilingual VLMs and their practical applications. Two variants of VARCO-VISION-2.0 are available at Hugging Face: a full-scale 14B model and a lightweight 1.7B model.

【3】Linguistic trajectories of bipolar disorder on social media
标题：双相情感障碍在社交媒体上的语言轨迹
链接：https://arxiv.org/abs/2509.10035

作者：ank, Armin Zlomuzica
备注：Pre-print
摘要：语言提供了情感障碍，如双相情感障碍（BD）的有价值的标志物，但临床评估的规模仍然有限。作为回应，社交媒体（SM）语言的分析由于其高时间分辨率和纵向范围而变得突出。在这里，我们介绍了一种方法来确定用户的诊断时间，并将其应用于研究语言轨迹，从3年前到21年后，BD诊断-对比使用报告单极抑郁症（UD）和非受影响的用户（HC）。我们发现，BD诊断伴随着普遍的语言改变，反映情绪障碍，精神共病，药物滥用，住院，医疗共病，不寻常的思想内容，和混乱的思想。我们进一步观察到在诊断后二十年内反复出现的情绪相关语言变化，其中明显的12个月周期性暗示季节性情绪发作。最后，趋势水平证据表明，估计为女性的用户的周期性增加。总之，我们的研究结果为BD急性期和慢性期的语言改变提供了证据。这验证并扩展了最近利用SM对心理健康进行可扩展监测的努力。
摘要：Language provides valuable markers of affective disorders such as bipolar disorder (BD), yet clinical assessments remain limited in scale. In response, analyses of social media (SM) language have gained prominence due to their high temporal resolution and longitudinal scope. Here, we introduce a method to determine the timing of users' diagnoses and apply it to study language trajectories from 3 years before to 21 years after BD diagnosis - contrasted with uses reporting unipolar depression (UD) and non-affected users (HC). We show that BD diagnosis is accompanied by pervasive linguistic alterations reflecting mood disturbance, psychiatric comorbidity, substance abuse, hospitalization, medical comorbidities, unusual thought content, and disorganized thought. We further observe recurring mood-related language changes across two decades after the diagnosis, with a pronounced 12-month periodicity suggestive of seasonal mood episodes. Finally, trend-level evidence suggests an increased periodicity in users estimated to be female. In sum, our findings provide evidence for language alterations in the acute and chronic phase of BD. This validates and extends recent efforts leveraging SM for scalable monitoring of mental health.

【4】Emulating Public Opinion: A Proof-of-Concept of AI-Generated Synthetic Survey Responses for the Chilean Case
标题：模拟公众舆论：人工智能生成的智利案件综合调查回应的概念验证
链接：https://arxiv.org/abs/2509.09871

作者：onzález-Bustamante, Nando Verelst, Carla Cisternas
备注：Working paper: 18 pages, 4 tables, 2 figures
摘要：大型语言模型（LLM）通过使用合成受访者来模仿人类的答案和行为，为调查研究中的方法和应用创新提供了有希望的途径，从而可能减少测量和表示错误。然而，LLM恢复聚合项目分布的程度仍然不确定，下游应用程序有可能复制从训练数据中继承的社会刻板印象和偏见。我们评估了LLM生成的合成调查响应的可靠性对地面真实的人从智利民意概率调查的反应。具体来说，我们基准测试了128个测试模型问题三元组，生成了189，696个合成配置文件，并汇集了性能指标（即，准确度，精确度，召回率和F1分数），在128个问题子样本对的荟萃分析中测试关键社会人口学维度的偏差。评估涵盖OpenAI的GPT系列和O系列推理模型，以及Llama和Qwen检查点。三个结果脱颖而出。首先，合成响应在信任项目上取得了优异的表现（F1得分和准确度> 0.90）。第二，GPT-4 o，GPT-4 o-mini和Llama 4 Maverick在此任务中执行任务。第三，45-59岁的受访者中，综合人类对齐度最高。总体而言，基于LLM的合成样本近似于概率样本的响应，但具有显著的项目级异质性。捕捉公众舆论的全部细微差别仍然具有挑战性，需要仔细校准和额外的分布测试，以确保算法的保真度并减少错误。
摘要：Large Language Models (LLMs) offer promising avenues for methodological and applied innovations in survey research by using synthetic respondents to emulate human answers and behaviour, potentially mitigating measurement and representation errors. However, the extent to which LLMs recover aggregate item distributions remains uncertain and downstream applications risk reproducing social stereotypes and biases inherited from training data. We evaluate the reliability of LLM-generated synthetic survey responses against ground-truth human responses from a Chilean public opinion probabilistic survey. Specifically, we benchmark 128 prompt-model-question triplets, generating 189,696 synthetic profiles, and pool performance metrics (i.e., accuracy, precision, recall, and F1-score) in a meta-analysis across 128 question-subsample pairs to test for biases along key sociodemographic dimensions. The evaluation spans OpenAI's GPT family and o-series reasoning models, as well as Llama and Qwen checkpoints. Three results stand out. First, synthetic responses achieve excellent performance on trust items (F1-score and accuracy > 0.90). Second, GPT-4o, GPT-4o-mini and Llama 4 Maverick perform comparably on this task. Third, synthetic-human alignment is highest among respondents aged 45-59. Overall, LLM-based synthetic samples approximate responses from a probabilistic sample, though with substantial item-level heterogeneity. Capturing the full nuance of public opinion remains challenging and requires careful calibration and additional distributional tests to ensure algorithmic fidelity and reduce errors.

【5】Latency and Token-Aware Test-Time Compute
标题：延迟和令牌感知测试时计算
链接：https://arxiv.org/abs/2509.09864

作者：Huang, Mehul Damani, Yousef El-Kurdi, Ramon Astudillo, Wei Sun
摘要：推理时间缩放已经成为一种通过生成多个候选响应并在其中进行选择来提高大型语言模型（LLM）性能的强大方法。然而，现有的动态分配测试时间计算的工作通常只考虑并行生成方法，如最好的N，忽略增量解码方法，如波束搜索，并在很大程度上忽略了延迟，只关注令牌的使用。我们制定推理时间缩放作为一个问题的动态计算分配和方法选择，系统必须决定应用哪种策略和多少计算分配每个查询的基础上。我们的框架明确地结合了令牌成本和挂钟延迟，后者对于用户体验至关重要，特别是对于模型必须有效地发出多个查询的代理工作流。推理基准的实验表明，我们的方法始终优于静态策略，实现了有利的准确性和成本的权衡，同时保持实用的部署。
摘要：Inference-time scaling has emerged as a powerful way to improve large language model (LLM) performance by generating multiple candidate responses and selecting among them. However, existing work on dynamic allocation for test-time compute typically considers only parallel generation methods such as best-of-N, overlooking incremental decoding methods like beam search, and has largely ignored latency, focusing only on token usage. We formulate inference-time scaling as a problem of dynamic compute allocation and method selection, where the system must decide which strategy to apply and how much compute to allocate on a per-query basis. Our framework explicitly incorporates both token cost and wall-clock latency, the latter being critical for user experience and particularly for agentic workflows where models must issue multiple queries efficiently. Experiments on reasoning benchmarks show that our approach consistently outperforms static strategies, achieving favorable accuracy-cost trade-offs while remaining practical for deployment.

【6】Pragmatic Frames Evoked by Gestures: A FrameNet Brasil Approach to Multimodality in Turn Organization
标题：手势引发的务实框架：FrameNet Brasil多模式轮流组织方法
链接：https://arxiv.org/abs/2509.09804

作者：Andrade Abreu, Tiago Timponi Torrent, Ely Edison da Silva Matos
备注：Paper submitted to Language Sciences Journal
摘要：本文提出了一个多模态会话话轮组织的建模框架，通过语言和互动手势之间的相关性的命题，分析语用框架是如何概念化和引发的交际者。作为一种手段，提供证据的分析，我们开发了一种注释方法，以丰富的多模态数据集（注释的语义框架）与务实的框架建模会话轮组织。虽然来自不同领域的研究人员已经研究了会话轮组织，但具体的策略，特别是沟通者使用的手势，尚未编码在可用于机器学习的数据集中。为了填补这一空白，我们用用于转弯组织的手势注释丰富了Frame2数据集。Frame2数据集包含巴西电视连续剧Pedro Pelo Mundo的10集，注释了视频和文本中引发的语义帧。这个数据集使我们能够密切观察交流者如何在实验室外使用交互式手势，据我们所知，这在以前的相关文献中没有记录。我们的研究结果证实，参与面对面交谈的沟通者使用手势作为传递，采取和保持会话的工具，也揭示了一些手势的变化，以前没有记录。我们认为，这些手势的使用产生于语用框架的概念化，涉及心理空间，整合和概念隐喻。此外，我们的数据表明，语用框架的注释有助于更深入地了解人类的认知和语言。
摘要：This paper proposes a framework for modeling multimodal conversational turn organization via the proposition of correlations between language and interactive gestures, based on analysis as to how pragmatic frames are conceptualized and evoked by communicators. As a means to provide evidence for the analysis, we developed an annotation methodology to enrich a multimodal dataset (annotated for semantic frames) with pragmatic frames modeling conversational turn organization. Although conversational turn organization has been studied by researchers from diverse fields, the specific strategies, especially gestures used by communicators, had not yet been encoded in a dataset that can be used for machine learning. To fill this gap, we enriched the Frame2 dataset with annotations of gestures used for turn organization. The Frame2 dataset features 10 episodes from the Brazilian TV series Pedro Pelo Mundo annotated for semantic frames evoked in both video and text. This dataset allowed us to closely observe how communicators use interactive gestures outside a laboratory, in settings, to our knowledge, not previously recorded in related literature. Our results have confirmed that communicators involved in face-to-face conversation make use of gestures as a tool for passing, taking and keeping conversational turns, and also revealed variations of some gestures that had not been documented before. We propose that the use of these gestures arises from the conceptualization of pragmatic frames, involving mental spaces, blending and conceptual metaphors. In addition, our data demonstrate that the annotation of pragmatic frames contributes to a deeper understanding of human cognition and language.

【7】Executable Ontologies: Synthesizing Event Semantics with Dataflow Architecture
标题：可执行实体：使用数据流架构合成事件语义
链接：https://arxiv.org/abs/2509.09775

作者： Boldachev
备注：22 pages, 6 figures
摘要：本文介绍了Boldsea，Boldachev的语义事件方法-一种使用可执行本体建模复杂动态系统的体系结构-语义模型作为动态结构，直接控制流程执行。我们证明了将事件语义与一个XML架构集成解决了传统的业务流程管理（BPM）系统和面向对象的语义技术的局限性。本文介绍了形式化的BSL（boldsea Semantic Language），包括BNF文法，并概述了boldsea引擎的体系结构，它直接将语义模型解释为可执行算法，而不需要编译。它支持在运行时修改事件模型，确保时间透明性，并在统一的语义框架内无缝地合并数据和业务逻辑。
摘要：This paper presents boldsea, Boldachev's semantic-event approach -- an architecture for modeling complex dynamic systems using executable ontologies -- semantic models that act as dynamic structures, directly controlling process execution. We demonstrate that integrating event semantics with a dataflow architecture addresses the limitations of traditional Business Process Management (BPM) systems and object-oriented semantic technologies. The paper presents the formal BSL (boldsea Semantic Language), including its BNF grammar, and outlines the boldsea-engine's architecture, which directly interprets semantic models as executable algorithms without compilation. It enables the modification of event models at runtime, ensures temporal transparency, and seamlessly merges data and business logic within a unified semantic framework.

【8】MCP-AgentBench: Evaluating Real-World Language Agent Performance with MCP-Mediated Tools
标题：MCP-AgentBench：使用MVP介导的工具评估现实世界的语言代理性能
链接：https://arxiv.org/abs/2509.09734

作者：o, Benfeng Xu, Chiwei Zhu, Wentao Hong, Xiaorui Wang, Zhendong Mao
摘要：模型上下文协议（MCP）正在迅速成为一个关键的开放标准，旨在增强代理工具的集成和互操作性，并定位于开启一个强大，互连和真正实用的代理AI的新时代。然而，尽管MCP的采用越来越多，现有的基准往往无法捕捉真实世界的代理性能在这个新的范式，导致其真正的运营价值的扭曲的看法和无法可靠地区分利润。为了弥合这一关键的评估差距，我们引入了MCP-AgentBench -一个全面的基准测试，专门用于严格评估语言代理在MCP介导的工具交互中的能力。MCP-AgentBench的核心贡献包括：建立了一个强大的MCP测试平台，包括33个操作服务器和188个不同的工具;开发了一个基准测试，其中包括600个系统设计的查询，分布在6个不同类别的不同交互复杂性;并引入了MCP-Eval，一种新的以结果为导向的评估方法，优先考虑现实世界的任务成功。通过对领先语言代理的广泛实证评估，我们提供了基本的见解。MCP-AgentBench旨在为研究界提供一个标准化和可靠的框架，以构建、验证和推进能够充分利用MCP变革性优势的代理，从而加速实现真正有能力和可互操作的人工智能系统。
摘要：The Model Context Protocol (MCP) is rapidly emerging as a pivotal open standard, designed to enhance agent-tool integration and interoperability, and is positioned to unlock a new era of powerful, interconnected, and genuinely utilitarian agentic AI. However, despite MCP's growing adoption, existing benchmarks often fail to capture real-world agent performance within this new paradigm, leading to a distorted perception of their true operational value and an inability to reliably differentiate proficiencies. To bridge this critical evaluation gap, we introduce MCP-AgentBench -- a comprehensive benchmark specifically engineered to rigorously assess language agent capabilities in MCP-mediated tool interactions. Core contributions of MCP-AgentBench include: the establishment of a robust MCP testbed comprising 33 operational servers with 188 distinct tools; the development of a benchmark featuring 600 systematically designed queries distributed across 6 distinct categories of varying interaction complexity; and the introduction of MCP-Eval, a novel outcome-oriented evaluation methodology prioritizing real-world task success. Through extensive empirical evaluation of leading language agents, we provide foundational insights. MCP-AgentBench aims to equip the research community with a standardized and reliable framework to build, validate, and advance agents capable of fully leveraging MCP's transformative benefits, thereby accelerating progress toward truly capable and interoperable AI systems.

【9】MultimodalHugs: Enabling Sign Language Processing in Hugging Face
标题：MultimodalHugs：在拥抱面部时启用手语处理
链接：https://arxiv.org/abs/2509.09729

作者：nt, Zifan Jiang, Carlos Escolano, Amit Moryossef, Mathias Müller, Rico Sennrich, Sarah Ebling
摘要：近年来，手语处理（SLP）在自然语言处理的一般领域中获得了重要性。然而，与口语研究相比，SLP研究受到复杂的特设代码的阻碍，无意中导致低重现性和不公平的比较。为快速和可重复的实验而构建的现有工具，如Hugging Face，不够灵活，无法无缝集成手语实验。我们在SLP研究人员中进行的一项调查证实了这一观点。为了应对这些挑战，我们引入了MultimodalHugs，这是一个建立在Hugging Face之上的框架，可以实现更多样化的数据模式和任务，同时继承了Hugging Face生态系统的众所周知的优势。尽管手语是我们的主要关注点，但MultimodalHugs增加了一层抽象，使其更广泛地适用于不符合拥抱脸标准模板之一的其他用例。我们提供定量实验来说明MultimodalHugs如何适应不同的模式，如手势语言的姿势估计数据，或文本字符的像素数据。
摘要：In recent years, sign language processing (SLP) has gained importance in the general field of Natural Language Processing. However, compared to research on spoken languages, SLP research is hindered by complex ad-hoc code, inadvertently leading to low reproducibility and unfair comparisons. Existing tools that are built for fast and reproducible experimentation, such as Hugging Face, are not flexible enough to seamlessly integrate sign language experiments. This view is confirmed by a survey we conducted among SLP researchers. To address these challenges, we introduce MultimodalHugs, a framework built on top of Hugging Face that enables more diverse data modalities and tasks, while inheriting the well-known advantages of the Hugging Face ecosystem. Even though sign languages are our primary focus, MultimodalHugs adds a layer of abstraction that makes it more widely applicable to other use cases that do not fit one of the standard templates of Hugging Face. We provide quantitative experiments to illustrate how MultimodalHugs can accommodate diverse modalities such as pose estimation data for sign languages, or pixel data for text characters.

【10】BIBERT-Pipe on Biomedical Nested Named Entity Linking at BioASQ 2025
标题：BIBERT-Pipe on BioASQ 2025上的生物医学嵌套命名实体链接
链接：https://arxiv.org/abs/2509.09725

作者：, Xindi Zheng, Siqi Liu
摘要：生物医学文本的实体链接（EL）通常以仅英语的语料库为基准，具有平面提及，使得嵌套和多语言提及的更现实的情况基本上未被探索。我们为BioNNE 2025多语言生物医学嵌套命名实体链接共享任务（英语和俄语）提供了我们的系统，通过一个轻量级的管道来缩小这一差距，该管道保持原始EL模型的完整性，并仅修改三个与任务对齐的组件：两阶段检索排名。我们在这两个阶段中利用相同的基本编码器模型：检索阶段使用原始的预训练模型，而排名阶段应用特定于域的微调。边界提示。在排名阶段，我们用可学习的[Ms] / [Me]标签包装每个提及，为编码器提供一个明确的，与语言无关的跨度，然后才能健壮地重叠和嵌套。数据集扩增。我们还使用三个互补的数据源自动扩展排名训练语料库，无需额外的手动注释即可提高覆盖率。在BioNNE 2025排行榜上，我们的两阶段系统，双语BERT（BIBERT-Pipe），在多语言赛道中排名第三，证明了这些最小但原则性修改的有效性和竞争力。代码可在https://github.com/Kaggle-Competitions-Code/BioNNE-L上公开获取。
摘要：Entity linking (EL) for biomedical text is typically benchmarked on English-only corpora with flat mentions, leaving the more realistic scenario of nested and multilingual mentions largely unexplored. We present our system for the BioNNE 2025 Multilingual Biomedical Nested Named Entity Linking shared task (English & Russian), closing this gap with a lightweight pipeline that keeps the original EL model intact and modifies only three task-aligned components: Two-stage retrieval-ranking. We leverage the same base encoder model in both stages: the retrieval stage uses the original pre-trained model, while the ranking stage applies domain-specific fine-tuning. Boundary cues. In the ranking stage, we wrap each mention with learnable [Ms] / [Me] tags, providing the encoder with an explicit, language-agnostic span before robustness to overlap and nesting. Dataset augmentation. We also automatically expand the ranking training corpus with three complementary data sources, enhancing coverage without extra manual annotation. On the BioNNE 2025 leaderboard, our two stage system, bilingual bert (BIBERT-Pipe), ranks third in the multilingual track, demonstrating the effectiveness and competitiveness of these minimal yet principled modifications. Code are publicly available at https://github.com/Kaggle-Competitions-Code/BioNNE-L.

【11】Improving MLLM Historical Record Extraction with Test-Time Image
标题：利用测试时图像改进MLLM历史记录提取
链接：https://arxiv.org/abs/2509.09722

作者：chibald, Tony Martinez
摘要：我们提出了一种新的集成框架，稳定LLM为基础的文本提取嘈杂的历史文件。我们用Gemini 2.0 Flash转录每个图像的多个增强变体，并将这些输出与自定义Needleman Wunsch风格的对齐器融合，该对齐器产生共识转录和置信度得分。我们提出了一个新的数据集，622宾夕法尼亚州的死亡记录，并证明我们的方法提高了转录准确性4个百分点，相对于Single Shot基线。我们发现填充和模糊是最有用的提高精度，而网格扭曲扰动是最好的分离高和低置信度的情况下。该方法简单，可扩展，并可立即部署到其他文档集合和转录模型。
摘要：We present a novel ensemble framework that stabilizes LLM based text extraction from noisy historical documents. We transcribe multiple augmented variants of each image with Gemini 2.0 Flash and fuse these outputs with a custom Needleman Wunsch style aligner that yields both a consensus transcription and a confidence score. We present a new dataset of 622 Pennsylvania death records, and demonstrate our method improves transcription accuracy by 4 percentage points relative to a single shot baseline. We find that padding and blurring are the most useful for improving accuracy, while grid warp perturbations are best for separating high and low confidence cases. The approach is simple, scalable, and immediately deployable to other document collections and transcription models.

【12】DB3 Team's Solution For Meta KDD Cup' 25
标题：DB 3 Team针对Meta KDD Cup ' 25的解决方案
链接：https://arxiv.org/abs/2509.09681

作者：a, Jiazun Chen, Yirui Zhan, Suifeng Zhao, Weipeng Jiang, Chaorui Zhang, Wei Han, Bo Bai, Jun Gao
摘要：本文介绍了db 3团队在2025年KDD Cup的Meta CRAG-MM Challenge 2025中的获胜方案。为了应对挑战的独特多模态，多回合问答基准（CRAG-MM），我们开发了一个综合框架，该框架将针对不同任务的定制检索管道与用于幻觉控制的统一LLM调优方法集成在一起。我们的解决方案具有以下特点：（1）特定于领域的检索管道，处理图像索引的知识图、Web资源和多轮对话;（2）使用SFT、DPO和RL进行高级拒绝训练。该系统在任务1中获得第二名，在任务2中获得第二名，在任务3中获得第一名，通过出色地处理第一人称视角挑战，获得了以自我为中心的查询的优秀大奖。
摘要：This paper presents the db3 team's winning solution for the Meta CRAG-MM Challenge 2025 at KDD Cup'25. Addressing the challenge's unique multi-modal, multi-turn question answering benchmark (CRAG-MM), we developed a comprehensive framework that integrates tailored retrieval pipelines for different tasks with a unified LLM-tuning approach for hallucination control. Our solution features (1) domain-specific retrieval pipelines handling image-indexed knowledge graphs, web sources, and multi-turn conversations; and (2) advanced refusal training using SFT, DPO, and RL. The system achieved 2nd place in Task 1, 2nd place in Task 2, and 1st place in Task 3, securing the grand prize for excellence in ego-centric queries through superior handling of first-person perspective challenges.

【13】Clip Your Sequences Fairly: Enforcing Length Fairness for Sequence-Level RL
标题：公平地剪辑您的序列：强制序列级RL的长度公平
链接：https://arxiv.org/abs/2509.09177

作者：, Quanjia Xiao, Lei Pang, Haixiao Liu
摘要：我们提出了FSPO（公平序列策略优化），这是一种用于LLM的序列级强化学习方法，它直接在重要性采样（IS）权重空间中执行长度公平裁剪。我们重新审视了序列级RL方法，并确定了PPO/GRPO风格的裁剪移植到序列时的不匹配：固定的裁剪范围系统地重新加权短响应与长响应，扭曲了有效的目标。从理论上讲，我们通过长度重新加权误差（LRE）形式化长度公平性，并证明了小LRE产生一个方向余弦保证之间的裁剪和真正的更新。FSPO引入了一个简单的，高斯动机的补救措施：我们剪辑的序列对数IS比与一个波段，适用于KL校正漂移项和规模为$\sqrt{L}$。从经验上讲，FSPO可以跨长度箱调整剪辑速率，稳定训练，并在多个评估数据集上优于所有基线。
摘要：We propose FSPO (Fair Sequence Policy Optimization), a sequence-level reinforcement learning method for LLMs that enforces length-fair clipping directly in the importance-sampling (IS) weight space. We revisit sequence-level RL methods and identify a mismatch when PPO/GRPO-style clipping is transplanted to sequences: a fixed clip range systematically reweights short vs. long responses, distorting the effective objective. Theoretically, we formalize length fairness via a Length Reweighting Error (LRE) and prove that small LRE yields a directional cosine guarantee between the clipped and true updates. FSPO introduces a simple, Gaussian-motivated remedy: we clip the sequence log-IS ratio with a band that applies a KL-corrected drift term and scales as $\sqrt{L}$. Empirically, FSPO flattens clip rates across length bins, stabilizes training, and outperforms all baselines across multiple evaluation datasets.

【14】Generative Engine Optimization: How to Dominate AI Search
标题：生成式引擎优化：如何主导人工智能搜索
链接：https://arxiv.org/abs/2509.08919

作者：, Xiaoxuan Wang, Kaiwen Chen, Nick Koudas
摘要：ChatGPT、Perplexity和Gemini等生成式人工智能搜索引擎的快速采用从根本上重塑了信息检索，从传统的排名列表转向综合的、引用支持的答案。这种转变挑战了现有的搜索引擎优化（SEO）实践，并需要一个新的范式，我们称之为生成引擎优化（GEO）。本文对人工智能搜索和传统网络搜索（Google）进行了全面的比较分析。通过一系列跨多个垂直领域、语言和查询释义的大规模受控实验，我们量化了这些系统如何获取信息的关键差异。我们的主要研究结果显示，人工智能搜索对获得的媒体（第三方，权威来源）表现出系统性和压倒性的偏见，而不是品牌拥有的和社交内容，这与谷歌更平衡的组合形成鲜明对比。我们进一步证明了AI搜索服务在其领域多样性，新鲜度，跨语言稳定性和对措辞的敏感性方面存在显着差异。基于这些实证结果，我们制定了战略GEO议程。我们为从业者提供了可操作的指导，强调了以下关键需求：（1）设计机器可扫描性和合理性的内容，（2）主导获得的媒体以建立AI感知的权威，（3）采用特定于引擎和语言感知的策略，以及（4）克服利基玩家固有的“大品牌偏见”。我们的工作提供了基础的实证分析和战略框架，以实现在新的生成搜索格局的可见性。
摘要：The rapid adoption of generative AI-powered search engines like ChatGPT, Perplexity, and Gemini is fundamentally reshaping information retrieval, moving from traditional ranked lists to synthesized, citation-backed answers. This shift challenges established Search Engine Optimization (SEO) practices and necessitates a new paradigm, which we term Generative Engine Optimization (GEO). This paper presents a comprehensive comparative analysis of AI Search and traditional web search (Google). Through a series of large-scale, controlled experiments across multiple verticals, languages, and query paraphrases, we quantify critical differences in how these systems source information. Our key findings reveal that AI Search exhibit a systematic and overwhelming bias towards Earned media (third-party, authoritative sources) over Brand-owned and Social content, a stark contrast to Google's more balanced mix. We further demonstrate that AI Search services differ significantly from each other in their domain diversity, freshness, cross-language stability, and sensitivity to phrasing. Based on these empirical results, we formulate a strategic GEO agenda. We provide actionable guidance for practitioners, emphasizing the critical need to: (1) engineer content for machine scannability and justification, (2) dominate earned media to build AI-perceived authority, (3) adopt engine-specific and language-aware strategies, and (4) overcome the inherent "big brand bias" for niche players. Our work provides the foundational empirical analysis and a strategic framework for achieving visibility in the new generative search landscape.

【15】Unified Learnable 2D Convolutional Feature Extraction for ASR
标题：用于ASB的统一可学习2D卷积特征提取
链接：https://arxiv.org/abs/2509.10031

作者：ting, Benedikt Hilmes, Ralf Schlüter, Hermann Ney
备注：Accepted at ITG Conference on Speech Communication 2025
摘要：神经前端代表了自动语音识别（ASR）系统特征提取的一种有前途的方法，因为它们能够为不同的任务学习专门定制的特征。然而，许多现有的技术仍然受到经典方法的严重影响。虽然这种归纳偏差可能会简化系统设计，但我们的工作旨在开发一个更通用的特征提取前端。此外，我们寻求统一前端架构，与应用源自不同来源的多层拓扑组合的现有方法形成鲜明对比。实验系统地展示了如何减少现有技术的影响，以实现通用的前端。由此产生的2D卷积前端是参数高效的，适用于计算资源有限的场景，而不像在未标记音频上预先训练的大型模型。结果表明，这种通用的统一方法不仅是可行的，而且与现有的监督可学习特征提取器的性能相匹配。
摘要：Neural front-ends represent a promising approach to feature extraction for automatic speech recognition (ASR) systems as they enable to learn specifically tailored features for different tasks. Yet, many of the existing techniques remain heavily influenced by classical methods. While this inductive bias may ease the system design, our work aims to develop a more generic front-end for feature extraction. Furthermore, we seek to unify the front-end architecture contrasting with existing approaches that apply a composition of several layer topologies originating from different sources. The experiments systematically show how to reduce the influence of existing techniques to achieve a generic front-end. The resulting 2D convolutional front-end is parameter-efficient and suitable for a scenario with limited computational resources unlike large models pre-trained on unlabeled audio. The results demonstrate that this generic unified approach is not only feasible but also matches the performance of existing supervised learnable feature extractors.

【16】HypoGeneAgent: A Hypothesis Language Agent for Gene-Set Cluster Resolution Selection Using Perturb-seq Datasets
标题：HypoGeneAgent：一种使用扰动序列数据集进行基因集集群解析选择的假设语言代理
链接：https://arxiv.org/abs/2509.09740

作者：, Xing-Yue Monica Ge, Aaron Archer Waterman, Tommaso Biancalani, David Richmond, Yogesh Pandit, Avtar Singh, Russell Littman, Jin Liu, Jan-Christian Huetter, Vladimir Ermakov
摘要：大规模的单细胞和Perturb-seq研究通常涉及聚类细胞，然后用基因本体论（GO）术语注释每个聚类，以阐明潜在的生物学程序。然而，这两个阶段，解决方案选择和功能注释，本质上是主观的，依赖于语言学和专家策展。我们提出了HYPOGENEAGENT，一个大型语言模型（LLM）驱动的框架，将聚类注释转化为一个定量优化的任务。最初，作为基因集分析师的LLM分析每个基因程序或扰动模块的内容，并生成基于GO的假设的排名列表，以及校准的置信度得分。随后，我们用一个嵌入模型嵌入每个预测的描述，计算成对的余弦相似度，并让代理裁判小组评分（i）预测的内部一致性，同一聚类内的高平均相似度，称为聚类内一致性（ii）它们的外部独特性，聚类之间的低相似度，称为聚类间分离。这两个数量相结合，以产生代理派生的分辨率分数，这是最大化时，集群表现出同时的一致性和互斥性。当应用于公共K562 CRISPRi Perturb-seq数据集作为初步测试时，我们的分辨率得分选择与经典指标（如剪影得分，基因功能富集总结的模块化得分）相比显示与已知途径对齐的聚类粒度。这些发现将LLM代理建立为聚类分辨率和功能注释的客观裁定者，从而为单细胞多组学研究中的全自动，上下文感知的解释管道铺平了道路。
摘要：Large-scale single-cell and Perturb-seq investigations routinely involve clustering cells and subsequently annotating each cluster with Gene-Ontology (GO) terms to elucidate the underlying biological programs. However, both stages, resolution selection and functional annotation, are inherently subjective, relying on heuristics and expert curation. We present HYPOGENEAGENT, a large language model (LLM)-driven framework, transforming cluster annotation into a quantitatively optimizable task. Initially, an LLM functioning as a gene-set analyst analyzes the content of each gene program or perturbation module and generates a ranked list of GO-based hypotheses, accompanied by calibrated confidence scores. Subsequently, we embed every predicted description with a sentence-embedding model, compute pair-wise cosine similarities, and let the agent referee panel score (i) the internal consistency of the predictions, high average similarity within the same cluster, termed intra-cluster agreement (ii) their external distinctiveness, low similarity between clusters, termed inter-cluster separation. These two quantities are combined to produce an agent-derived resolution score, which is maximized when clusters exhibit simultaneous coherence and mutual exclusivity. When applied to a public K562 CRISPRi Perturb-seq dataset as a preliminary test, our Resolution Score selects clustering granularities that exhibit alignment with known pathway compared to classical metrics such silhouette score, modularity score for gene functional enrichment summary. These findings establish LLM agents as objective adjudicators of cluster resolution and functional annotation, thereby paving the way for fully automated, context-aware interpretation pipelines in single-cell multi-omics studies.

机器翻译由腾讯交互翻译提供，仅供参考

点击“阅读原文”获取带摘要的学术速递

【声明】内容源于网络

Sophie外贸笔记

跨境分享角 | 长期更新优质内容

内容 0

粉丝 3

Sophie外贸笔记跨境分享角 | 长期更新优质内容

总阅读0

粉丝3

内容0