大数跨境

人工智能学术速递[9.1]

人工智能学术速递[9.1] Sophie外贸笔记
2025-09-01
221
导读:cs.AI 方向,今日共计116篇

点击阅读原文访问arxivdaily.com,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏等功能!


cs.AI人工智能,共计116篇


【1】The Demon is in Ambiguity: Revisiting Situation Recognition with Single Positive Multi-Label Learning
标题:恶魔处于模糊状态:用单正多标签学习重新审视情境识别
链接:https://arxiv.org/abs/2508.21816

作者:n, Yuchen Niu, Shang Wang, Kaizhu Huang, Qiufeng Wang, Xiao-Bo Jin
备注:Accepted by ICDM 2025
摘要:上下文识别(SR)是计算机视觉中的一项基本任务,旨在通过识别关键事件及其关联实体来从图像中提取结构化语义摘要。具体来说,给定一幅输入图像,模型首先必须对主要视觉事件进行分类(动词分类),然后识别参与的实体及其语义角色(语义角色标注),最后在图像中定位这些实体(语义角色定位)。现有的方法将动词分类作为一个单一的标签问题,但我们通过全面的分析表明,这种提法未能解决视觉事件识别中固有的模糊性,因为多个动词类别可能会合理地描述相同的图像。本文的主要贡献有三点:第一,通过实证分析揭示了动词类别之间普遍存在语义重叠,因而动词分类本质上是一个多标签问题。其次,考虑到用多个标签完全注释大规模数据集的不切实际性,我们建议将动词分类重新表述为单个正多标签学习(SPMLL)问题-SR研究的一个新视角。第三,我们为SR设计了一个全面的多标签评估基准,该基准经过精心设计,可以在多标签设置中公平地评估模型性能。为了应对SPMLL的挑战,我们进一步开发了图增强动词多层感知器(GE-VerbMLP),它结合了图神经网络来捕获标签相关性和对抗训练来优化决策边界。在真实世界数据集上的大量实验表明,我们的方法实现了超过3\%的MAP改善,同时保持传统的前1和前5的准确性指标的竞争力。
摘要:Context recognition (SR) is a fundamental task in computer vision that aims to extract structured semantic summaries from images by identifying key events and their associated entities. Specifically, given an input image, the model must first classify the main visual events (verb classification), then identify the participating entities and their semantic roles (semantic role labeling), and finally localize these entities in the image (semantic role localization). Existing methods treat verb classification as a single-label problem, but we show through a comprehensive analysis that this formulation fails to address the inherent ambiguity in visual event recognition, as multiple verb categories may reasonably describe the same image. This paper makes three key contributions: First, we reveal through empirical analysis that verb classification is inherently a multi-label problem due to the ubiquitous semantic overlap between verb categories. Second, given the impracticality of fully annotating large-scale datasets with multiple labels, we propose to reformulate verb classification as a single positive multi-label learning (SPMLL) problem - a novel perspective in SR research. Third, we design a comprehensive multi-label evaluation benchmark for SR that is carefully designed to fairly evaluate model performance in a multi-label setting. To address the challenges of SPMLL, we futher develop the Graph Enhanced Verb Multilayer Perceptron (GE-VerbMLP), which combines graph neural networks to capture label correlations and adversarial training to optimize decision boundaries. Extensive experiments on real-world datasets show that our approach achieves more than 3\% MAP improvement while remaining competitive on traditional top-1 and top-5 accuracy metrics.


【2】Automated Clinical Problem Detection from SOAP Notes using a Collaborative Multi-Agent LLM Architecture
标题:使用协作多代理LLM架构从SOAPnote中自动检测临床问题
链接:https://arxiv.org/abs/2508.21803

作者:e, Xiaoyang Wang, Christopher C. Yang
备注:Accepted to The 16th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (ACM-BCB 2025)(Poster Paper)
摘要:准确解释临床叙述对于患者护理至关重要,但这些笔记的复杂性使自动化具有挑战性。虽然大型语言模型(LLM)显示出前景,但单模型方法可能缺乏高风险临床任务所需的鲁棒性。我们引入了一个协作多智能体系统(MAS),模拟临床咨询团队,以解决这一差距。该系统的任务是通过仅分析SOAP笔记的主观(S)和客观(O)部分来识别临床问题,模拟将原始数据合成为评估的诊断推理过程。Manager代理协调一个动态分配的专家代理团队,他们参与分层的迭代辩论以达成共识。我们在420个MIMIC-III笔记的策划数据集上评估了我们的MAS。动态多药剂配置在识别充血性心力衰竭、急性肾损伤和脓毒症方面表现出持续改善的性能。代理人辩论的定性分析表明,这种结构有效地表面和权衡相互矛盾的证据,虽然它偶尔会受到群体思维。通过对临床团队的推理过程进行建模,我们的系统为更准确,更强大,更可解释的临床决策支持工具提供了一条有前途的道路。
摘要:Accurate interpretation of clinical narratives is critical for patient care, but the complexity of these notes makes automation challenging. While Large Language Models (LLMs) show promise, single-model approaches can lack the robustness required for high-stakes clinical tasks. We introduce a collaborative multi-agent system (MAS) that models a clinical consultation team to address this gap. The system is tasked with identifying clinical problems by analyzing only the Subjective (S) and Objective (O) sections of SOAP notes, simulating the diagnostic reasoning process of synthesizing raw data into an assessment. A Manager agent orchestrates a dynamically assigned team of specialist agents who engage in a hierarchical, iterative debate to reach a consensus. We evaluated our MAS against a single-agent baseline on a curated dataset of 420 MIMIC-III notes. The dynamic multi-agent configuration demonstrated consistently improved performance in identifying congestive heart failure, acute kidney injury, and sepsis. Qualitative analysis of the agent debates reveals that this structure effectively surfaces and weighs conflicting evidence, though it can occasionally be susceptible to groupthink. By modeling a clinical team's reasoning process, our system offers a promising path toward more accurate, robust, and interpretable clinical decision support tools.


【3】Tree-Guided Diffusion Planner
标题:树木引导扩散规划器
链接:https://arxiv.org/abs/2508.21800

作者:g Jeon, Cheolhong Min, Jaesik Park
备注:20 pages, 11 figures, 14 tables (main paper + appendix) / under review / project page will be available after the paper becomes public in arxiv
摘要:使用预先训练的扩散模型进行规划已经成为解决测试时间引导控制问题的一种很有前途的方法。然而,标准梯度制导通常在凸和可微奖励景观下表现最佳,在涉及非凸目标、不可微约束和多奖励结构的现实场景中表现出显著降低的有效性。此外,最近的监督规划方法需要特定于任务的训练或值估计器,这限制了测试时间的灵活性和zero-shot泛化。我们提出了一个树引导的扩散规划(TDP),一个zero-shot测试时间规划框架,通过结构化的轨迹生成平衡的探索和开发。我们使用双层采样过程将测试时间规划框架为树搜索问题:(1)通过无训练粒子指导产生不同的父轨迹,以鼓励广泛的探索,以及(2)通过任务目标指导的快速条件去噪来细化子轨迹。TDP通过探索不同的轨迹区域并仅使用预训练模型和测试时奖励信号来利用扩展的解决方案空间中的梯度信息来解决梯度引导的局限性。我们评估TDP三个不同的任务:迷宫淘金,机械臂块操作,AntMaze多目标探索。TDP在所有任务上都始终优于最先进的方法。项目页面可以在:tree-diffusion-planner.github.io找到。
摘要:Planning with pretrained diffusion models has emerged as a promising approach for solving test-time guided control problems. However, standard gradient guidance typically performs optimally under convex and differentiable reward landscapes, showing substantially reduced effectiveness in real-world scenarios involving non-convex objectives, non-differentiable constraints, and multi-reward structures. Furthermore, recent supervised planning approaches require task-specific training or value estimators, which limits test-time flexibility and zero-shot generalization. We propose a Tree-guided Diffusion Planner (TDP), a zero-shot test-time planning framework that balances exploration and exploitation through structured trajectory generation. We frame test-time planning as a tree search problem using a bi-level sampling process: (1) diverse parent trajectories are produced via training-free particle guidance to encourage broad exploration, and (2) sub-trajectories are refined through fast conditional denoising guided by task objectives. TDP addresses the limitations of gradient guidance by exploring diverse trajectory regions and harnessing gradient information across this expanded solution space using only pretrained models and test-time reward signals. We evaluate TDP on three diverse tasks: maze gold-picking, robot arm block manipulation, and AntMaze multi-goal exploration. TDP consistently outperforms state-of-the-art approaches on all tasks. The project page can be found at: tree-diffusion-planner.github.io.


【4】DynaMark: A Reinforcement Learning Framework for Dynamic Watermarking in Industrial Machine Tool Controllers
标题:DynaMark:用于工业机械工具控制器中动态水印的强化学习框架
链接:https://arxiv.org/abs/2508.21797

作者:abi, Abhishek Hanchate, Satish Bukkapatnam, Dan Li
摘要:工业4.0的高度网络化机床控制器(MTC)是使用过时传感器数据操纵执行器的重放攻击的主要目标。动态水印可以揭示这种篡改,但目前的计划假设线性高斯动态和使用恒定的水印统计,使他们容易受到时变,部分专有的行为的MTC。我们使用DynaMark来缩小这一差距,DynaMark是一种强化学习框架,将动态水印建模为马尔可夫决策过程(MDP)。它学习一个自适应的政策,动态适应的协方差零均值高斯水印使用现有的测量和检测器反馈,而不需要系统的知识。DynaMark最大限度地发挥独特的奖励功能,动态平衡控制性能、能耗和检测置信度。我们开发了一个贝叶斯信念更新机制,在线性系统中的实时检测信心。这种方法,独立于特定的系统假设,支持线性动态系统的MDP。在西门子Sinumerik 828D数字孪生控制器上,与恒定方差基线相比,DynaMark在保持标称轨迹的同时实现了70%的水印能量降低。它还保持一个平均检测延迟等于一个采样间隔。一个物理步进电机测试台验证了这些发现,快速触发报警,控制性能下降较少,并超过现有的基准。
摘要:Industry 4.0's highly networked Machine Tool Controllers (MTCs) are prime targets for replay attacks that use outdated sensor data to manipulate actuators. Dynamic watermarking can reveal such tampering, but current schemes assume linear-Gaussian dynamics and use constant watermark statistics, making them vulnerable to the time-varying, partly proprietary behavior of MTCs. We close this gap with DynaMark, a reinforcement learning framework that models dynamic watermarking as a Markov decision process (MDP). It learns an adaptive policy online that dynamically adapts the covariance of a zero-mean Gaussian watermark using available measurements and detector feedback, without needing system knowledge. DynaMark maximizes a unique reward function balancing control performance, energy consumption, and detection confidence dynamically. We develop a Bayesian belief updating mechanism for real-time detection confidence in linear systems. This approach, independent of specific system assumptions, underpins the MDP for systems with linear dynamics. On a Siemens Sinumerik 828D controller digital twin, DynaMark achieves a reduction in watermark energy by 70% while preserving the nominal trajectory, compared to constant variance baselines. It also maintains an average detection delay equivalent to one sampling interval. A physical stepper-motor testbed validates these findings, rapidly triggering alarms with less control performance decline and exceeding existing benchmarks.


【5】TMUAD: Enhancing Logical Capabilities in Unified Anomaly Detection Models with a Text Memory Bank
标题:TMUAD:利用文本存储库增强统一异常检测模型中的逻辑能力
链接:https://arxiv.org/abs/2508.21795

作者:u, Jiahe Hou, Wei Wang, Jinsong Du, Yang Cong, Huijie Fan
摘要:异常检测的目的是识别偏离正常模式的异常,由于可用的正常数据量有限,因此具有挑战性。与大多数现有的统一的方法,依赖于精心设计的图像特征提取器和内存银行捕捉对象之间的逻辑关系,我们引入了一个文本内存银行,以提高逻辑异常的检测。具体来说,我们提出了一个统一的结构和逻辑异常检测(TMUAD)的三个内存框架。首先,我们建立了一个类级的文本记忆库的逻辑异常检测所提出的逻辑感知的文本提取器,它可以捕获丰富的逻辑描述的对象从输入图像。其次,我们构建了一个对象级的图像存储库,保留完整的对象轮廓提取功能,从分割的对象。第三,我们采用视觉编码器来提取块级图像特征,以构建用于结构异常检测的块级存储库。这三个互补的内存库用于检索和比较与查询图像最相似的正常图像,计算多个级别的异常分数,并将它们融合到最终的异常分数中。通过协作内存库统一结构和逻辑异常检测,TMUAD在涉及工业和医疗领域的七个公开数据集上实现了最先进的性能。模型和代码可在https://github.com/SIA-IDE/TMUAD上获得。
摘要:Anomaly detection, which aims to identify anomalies deviating from normal patterns, is challenging due to the limited amount of normal data available. Unlike most existing unified methods that rely on carefully designed image feature extractors and memory banks to capture logical relationships between objects, we introduce a text memory bank to enhance the detection of logical anomalies. Specifically, we propose a Three-Memory framework for Unified structural and logical Anomaly Detection (TMUAD). First, we build a class-level text memory bank for logical anomaly detection by the proposed logic-aware text extractor, which can capture rich logical descriptions of objects from input images. Second, we construct an object-level image memory bank that preserves complete object contours by extracting features from segmented objects. Third, we employ visual encoders to extract patch-level image features for constructing a patch-level memory bank for structural anomaly detection. These three complementary memory banks are used to retrieve and compare normal images that are most similar to the query image, compute anomaly scores at multiple levels, and fuse them into a final anomaly score. By unifying structural and logical anomaly detection through collaborative memory banks, TMUAD achieves state-of-the-art performance across seven publicly available datasets involving industrial and medical domains. The model and code are available at https://github.com/SIA-IDE/TMUAD.


【6】MoE-Health: A Mixture of Experts Framework for Robust Multimodal Healthcare Prediction
标题:教育部健康:稳健多模式医疗保健预测的混合专家框架
链接:https://arxiv.org/abs/2508.21793

作者:Wang, Christopher C. Yang
备注:Accepted to The 16th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (ACM-BCB 2025)
摘要:医疗保健系统生成各种多模态数据,包括电子健康记录(EHR)、临床记录和医学图像。有效地利用这些数据进行临床预测是具有挑战性的,特别是因为真实世界的样本通常存在不同或不完整的模式。现有方法通常需要完整的模态数据或依赖于手动选择策略,限制了其在真实世界临床环境中的适用性,其中数据可用性因患者和机构而异。为了解决这些局限性,我们提出了MoE-Health,这是一种新型的专家混合框架,旨在实现医疗保健预测中的稳健多模态融合。MoE-Health架构专门用于处理不同模式的样本,并提高关键临床任务的性能。通过利用专门的专家网络和动态门控机制,我们的方法动态地选择和组合相关专家的基础上可用的数据模式,使灵活的适应不同的数据可用性的情况。我们在MIMIC-IV数据集上评估了MoE-Health的三个关键临床预测任务:住院死亡率预测,长期住院和再入院预测。实验结果表明,MoE-Health实现了优越的性能相比,现有的多模态融合方法,同时保持不同的模态可用性模式的鲁棒性。该框架有效地集成了多模态信息,在处理异构和不完整的医疗数据时提供了更好的预测性能和鲁棒性,使其特别适合部署在具有异构数据可用性的各种医疗环境中。
摘要:Healthcare systems generate diverse multimodal data, including Electronic Health Records (EHR), clinical notes, and medical images. Effectively leveraging this data for clinical prediction is challenging, particularly as real-world samples often present with varied or incomplete modalities. Existing approaches typically require complete modality data or rely on manual selection strategies, limiting their applicability in real-world clinical settings where data availability varies across patients and institutions. To address these limitations, we propose MoE-Health, a novel Mixture of Experts framework designed for robust multimodal fusion in healthcare prediction. MoE-Health architecture is specifically developed to handle samples with differing modalities and improve performance on critical clinical tasks. By leveraging specialized expert networks and a dynamic gating mechanism, our approach dynamically selects and combines relevant experts based on available data modalities, enabling flexible adaptation to varying data availability scenarios. We evaluate MoE-Health on the MIMIC-IV dataset across three critical clinical prediction tasks: in-hospital mortality prediction, long length of stay, and hospital readmission prediction. Experimental results demonstrate that MoE-Health achieves superior performance compared to existing multimodal fusion methods while maintaining robustness across different modality availability patterns. The framework effectively integrates multimodal information, offering improved predictive performance and robustness in handling heterogeneous and incomplete healthcare data, making it particularly suitable for deployment in diverse healthcare environments with heterogeneous data availability.


【7】Going over Fine Web with a Fine-Tooth Comb: Technical Report of Indexing Fine Web for Problematic Content Search and Retrieval
标题:仔细梳理精细网络:为问题内容搜索和检索编制精细网络索引技术报告
链接:https://arxiv.org/abs/2508.21788

作者:mir Marinas, Anastasiia Kucherenko, Andrei Kucharavy
摘要:大型语言模型(LLM)严重依赖网络规模的数据集,如Common Crawl,它为一些现代模型提供了超过80%的训练数据。然而,网络抓取的不加选择的性质在数据质量、安全和道德方面提出了挑战。尽管训练数据质量至关重要,但由于计算限制,先前对有害内容的研究仅限于小样本。该项目提出了一个使用基于ElasticSearch的管道索引和分析LLM训练数据集的框架。我们将其应用于SwissAI的FineWeb-2语料库(1.5TB,四种语言),实现了快速查询性能-大多数搜索在毫秒内,所有搜索都在2秒内。我们的工作展示了实时数据集分析,为更安全,更负责任的人工智能系统提供了实用的工具。
摘要:Large language models (LLMs) rely heavily on web-scale datasets like Common Crawl, which provides over 80\% of training data for some modern models. However, the indiscriminate nature of web crawling raises challenges in data quality, safety, and ethics. Despite the critical importance of training data quality, prior research on harmful content has been limited to small samples due to computational constraints. This project presents a framework for indexing and analyzing LLM training datasets using an ElasticSearch-based pipeline. We apply it to SwissAI's FineWeb-2 corpus (1.5TB, four languages), achieving fast query performance--most searches in milliseconds, all under 2 seconds. Our work demonstrates real-time dataset analysis, offering practical tools for safer, more accountable AI systems.


【8】PiCSAR: Probabilistic Confidence Selection And Ranking
标题:PiCSAR:概率信心选择和排名
链接:https://arxiv.org/abs/2508.21787

作者:g Jun Leang, Zheng Zhao, Aryo Pradipta Gema, Sohee Yang, Wai-Chung Kwan, Xuanli He, Wenda Li, Pasquale Minervini, Eleonora Giunchiglia, Shay B. Cohen
摘要:最佳n次抽样通过生成多个候选解决方案并选择具有最高奖励的解决方案来提高大型语言模型(LLM)和大型推理模型(LRM)的准确性。推理任务的关键挑战是设计一个评分函数,该函数可以在不访问地面事实答案的情况下识别正确的推理链。我们提出了概率置信度选择和排名(PiCSAR):一种简单的,无需训练的方法,使用推理和最终答案的联合对数似然对每个候选生成进行评分。推理和最终答案的联合对数似然自然地分解为推理置信度和答案置信度。PiCSAR在不同的基准测试中取得了显著的收益(MATH500上为+10.18,AIME2025上为+9.81),在20次比较中,有16次的样本量至少减少了2倍。我们的分析表明,正确的推理链表现出显着更高的推理和答案的信心,证明了PiCSAR的有效性。
摘要:Best-of-n sampling improves the accuracy of large language models (LLMs) and large reasoning models (LRMs) by generating multiple candidate solutions and selecting the one with the highest reward. The key challenge for reasoning tasks is designing a scoring function that can identify correct reasoning chains without access to ground-truth answers. We propose Probabilistic Confidence Selection And Ranking (PiCSAR): a simple, training-free method that scores each candidate generation using the joint log-likelihood of the reasoning and final answer. The joint log-likelihood of the reasoning and final answer naturally decomposes into reasoning confidence and answer confidence. PiCSAR achieves substantial gains across diverse benchmarks (+10.18 on MATH500, +9.81 on AIME2025), outperforming baselines with at least 2x fewer samples in 16 out of 20 comparisons. Our analysis reveals that correct reasoning chains exhibit significantly higher reasoning and answer confidence, justifying the effectiveness of PiCSAR.


【9】Benchmarking GPT-5 in Radiation Oncology: Measurable Gains, but Persistent Need for Expert Oversight
标题:放射肿瘤学中的GPT-5基准:可测量的收益,但持续需要专家监督
链接:https://arxiv.org/abs/2508.21777

作者:, Jibak Sarkar, Philipp Schubert, Sabine Semrau, Thomas Weissmann, Andre Karius, Johann Brand, Bernd-Niklas Axer, Ahmed Gomaa, Pluvio Stephan, Ishita Sheth, Sogand Beirami, Annette Schwarz, Udo Gaipl, Benjamin Frey, Christoph Bert, Stefanie Corradini, Rainer Fietkau, Florian Putz
备注:Under review in Frontiers in Artificial Intelligence
摘要:大型语言模型(LLM)在临床决策支持方面显示出巨大的潜力。GPT-5是一种新型LLM系统,专门针对肿瘤学用途销售。   研究方法:使用两个互补的基准评估性能:(i)ACR放射肿瘤学培训考试(TXIT,2021),包括300个多项选择题,以及(ii)一组60个真实的放射肿瘤学插图,代表不同的疾病部位和治疗适应症。对于小插曲评价,指示GPT-5生成简明的治疗计划。四位委员会认证的放射肿瘤学家对正确性、全面性和幻觉进行了评级。使用Fleiss 'k {appa}对评定者间可靠性进行量化。   结果:在TXIT基准测试中,GPT-5的平均准确率为92.8%,优于GPT-4(78.8%)和GPT-3.5(62.1%)。领域特异性增益在剂量和诊断中最为明显。在小插曲评价中,GPT-5的治疗建议在正确性(平均值3.24/4,95% CI:3.11 - 3.38)和全面性(3.59/4,95% CI:3.49 - 3.69)方面评分较高。幻觉是罕见的,没有一个病例能达到大多数人的共识。评价者之间的一致性较低(正确性为Fleiss 0.083),反映了临床判断的固有变异性。错误集中在需要精确的试验知识或详细的临床适应的复杂场景中。   讨论:GPT-5在放射肿瘤学多项选择基准上明显优于先前模型变体。虽然GPT-5在生成真实世界的放射肿瘤治疗建议方面表现出良好的性能,但正确性评级表明还有进一步改进的空间。虽然幻觉并不常见,但实质性错误的存在强调了GPT-5生成的建议在临床实施之前需要严格的专家监督。
摘要:Introduction: Large language models (LLM) have shown great potential in clinical decision support. GPT-5 is a novel LLM system that has been specifically marketed towards oncology use.   Methods: Performance was assessed using two complementary benchmarks: (i) the ACR Radiation Oncology In-Training Examination (TXIT, 2021), comprising 300 multiple-choice items, and (ii) a curated set of 60 authentic radiation oncologic vignettes representing diverse disease sites and treatment indications. For the vignette evaluation, GPT-5 was instructed to generate concise therapeutic plans. Four board-certified radiation oncologists rated correctness, comprehensiveness, and hallucinations. Inter-rater reliability was quantified using Fleiss' \k{appa}.   Results: On the TXIT benchmark, GPT-5 achieved a mean accuracy of 92.8%, outperforming GPT-4 (78.8%) and GPT-3.5 (62.1%). Domain-specific gains were most pronounced in Dose and Diagnosis. In the vignette evaluation, GPT-5's treatment recommendations were rated highly for correctness (mean 3.24/4, 95% CI: 3.11-3.38) and comprehensiveness (3.59/4, 95% CI: 3.49-3.69). Hallucinations were rare with no case reaching majority consensus for their presence. Inter-rater agreement was low (Fleiss' \k{appa} 0.083 for correctness), reflecting inherent variability in clinical judgment. Errors clustered in complex scenarios requiring precise trial knowledge or detailed clinical adaptation.   Discussion: GPT-5 clearly outperformed prior model variants on the radiation oncology multiple-choice benchmark. Although GPT-5 exhibited favorable performance in generating real-world radiation oncology treatment recommendations, correctness ratings indicate room for further improvement. While hallucinations were infrequent, the presence of substantive errors underscores that GPT-5-generated recommendations require rigorous expert oversight before clinical implementation.


【10】Unsupervised Video Continual Learning via Non-Parametric Deep Embedded Clustering
标题:通过非参数深度嵌入式集群的无监督视频连续学习
链接:https://arxiv.org/abs/2508.21773

作者: Kurpukdee, Adrian G. Bors
备注:Accepted to The 36th British Machine Vision Conference (BMVC 2025), Sheffield, UK
摘要:我们提出了一个现实的情况下,无监督的视频学习,既没有任务边界,也没有标签时,学习一系列的任务。我们还提供了一个非参数的学习解决方案,未充分探索的问题,无监督的视频连续学习。视频是一种复杂而丰富的时空媒体信息,在许多应用中得到了广泛的应用,但在无监督的持续学习中还没有得到充分的研究。以前的研究只关注监督式持续学习,依赖于标签和任务边界的知识,而标记数据的成本很高,而且不实用。为了解决这个问题,我们研究了无监督视频连续学习(uVCL)。与图像相比,uVCL由于处理视频的额外计算和存储器要求而提出更多挑战。我们介绍了一个通用的基准实验协议uVCL考虑在每个任务的非结构化视频数据类别的学习。我们建议使用无监督视频Transformer网络提取的深度嵌入视频特征的核密度估计(KDE)作为数据的非参数概率表示。我们引入了一个新的检测标准,为传入的新任务数据,动态地使内存集群的扩展,旨在捕捉新的知识时,学习一系列的任务。我们利用从先前任务的迁移学习作为知识迁移到当前学习任务的初始状态。我们发现,所提出的方法大大提高了性能的模型时,连续学习许多任务。我们对三个标准的视频动作识别数据集进行了深入的评估,包括UCF 101,HMDB 51和Something-to-Something V2,而不使用任何标签或类边界。
摘要:We propose a realistic scenario for the unsupervised video learning where neither task boundaries nor labels are provided when learning a succession of tasks. We also provide a non-parametric learning solution for the under-explored problem of unsupervised video continual learning. Videos represent a complex and rich spatio-temporal media information, widely used in many applications, but which have not been sufficiently explored in unsupervised continual learning. Prior studies have only focused on supervised continual learning, relying on the knowledge of labels and task boundaries, while having labeled data is costly and not practical. To address this gap, we study the unsupervised video continual learning (uVCL). uVCL raises more challenges due to the additional computational and memory requirements of processing videos when compared to images. We introduce a general benchmark experimental protocol for uVCL by considering the learning of unstructured video data categories during each task. We propose to use the Kernel Density Estimation (KDE) of deep embedded video features extracted by unsupervised video transformer networks as a non-parametric probabilistic representation of the data. We introduce a novelty detection criterion for the incoming new task data, dynamically enabling the expansion of memory clusters, aiming to capture new knowledge when learning a succession of tasks. We leverage the use of transfer learning from the previous tasks as an initial state for the knowledge transfer to the current learning task. We found that the proposed methodology substantially enhances the performance of the model when successively learning many tasks. We perform in-depth evaluations on three standard video action recognition datasets, including UCF101, HMDB51, and Something-to-Something V2, without using any labels or class boundaries.


【11】Reasoning-Intensive Regression
标题:推理密集型回归
链接:https://arxiv.org/abs/2508.21762

作者:uindjo, Omar Khattab
摘要:人工智能研究人员和从业者越来越多地将大型语言模型(LLM)应用于我们所谓的推理密集型回归(RiR),即从文本中推导出微妙的数值属性。与标准语言回归任务(例如情感或相似性)不同,RiR通常出现在基于规则的评分或特定领域检索等临时问题中,这些问题需要对文本进行更深入的分析,而只有有限的特定于任务的训练数据和计算可用。我们将三个现实问题作为RiR任务来建立一个初始基准,并使用它来测试我们的假设,即通过梯度下降来提示冻结的LLM和微调Transformer编码器通常都会在RiR中遇到困难。然后,我们提出了MENTAT,一个简单而轻量级的方法,结合了批量反射提示优化与神经集成学习。MENTAT在两个基线上都实现了高达65%的改进,尽管RiR的未来进步仍有很大的空间。
摘要:AI researchers and practitioners increasingly apply large language models (LLMs) to what we call reasoning-intensive regression (RiR), i.e. deducing subtle numerical properties from text. Unlike standard language regression tasks, e.g. for sentiment or similarity, RiR often appears instead in ad-hoc problems like rubric-based scoring or domain-specific retrieval, where much deeper analysis of text is required while only limited task-specific training data and computation are available. We cast three realistic problems as RiR tasks to establish an initial benchmark, and use that to test our hypothesis that prompting frozen LLMs and finetuning Transformer encoders via gradient descent will both often struggle in RiR. We then propose MENTAT, a simple and lightweight method that combines batch-reflective prompt optimization with neural ensemble learning. MENTAT achieves up to 65% improvement over both baselines, though substantial room remains for future advances in RiR.


【12】Orientability of Causal Relations in Time Series using Summary Causal Graphs and Faithful Distributions
标题:使用摘要因果图和忠实分布研究时间序列因果关系的可定向性
链接:https://arxiv.org/abs/2508.21742

作者:Loranchet, Charles K. Assaad
摘要:理解时间变量之间的因果关系是时间序列分析的一个核心挑战,特别是当完整的因果结构是未知的。即使无法完全指定完整的因果结构,专家们通常也能成功地提供因果图的高级抽象,称为摘要因果图,它捕捉不同时间序列之间的主要因果关系,同时抽象出微观层面的细节。在这项工作中,我们提出的条件,保证时间变量之间的微观层次的边缘的定向给定的背景知识编码在一个摘要因果图,并假设有一个忠实的和因果充分的分布相对于真正的未知图。我们的研究结果提供了理论保证边缘取向在微观层面上,即使在宏观层面上的周期或双向边缘的存在。这些发现为利用SCG在复杂的时间系统中为因果发现提供了实际指导,并突出了结合专家知识以改善从观测时间序列数据进行因果推断的价值。
摘要:Understanding causal relations between temporal variables is a central challenge in time series analysis, particularly when the full causal structure is unknown. Even when the full causal structure cannot be fully specified, experts often succeed in providing a high-level abstraction of the causal graph, known as a summary causal graph, which captures the main causal relations between different time series while abstracting away micro-level details. In this work, we present conditions that guarantee the orientability of micro-level edges between temporal variables given the background knowledge encoded in a summary causal graph and assuming having access to a faithful and causally sufficient distribution with respect to the true unknown graph. Our results provide theoretical guarantees for edge orientation at the micro-level, even in the presence of cycles or bidirected edges at the macro-level. These findings offer practical guidance for leveraging SCGs to inform causal discovery in complex temporal systems and highlight the value of incorporating expert knowledge to improve causal inference from observational time series data.


【13】Neural Network Acceleration on MPSoC board: Integrating SLAC's SNL, Rogue Software and Auto-SNL
标题:MPSoC板上的神经网络加速:集成SLAC的SNL、Rogue Software和Auto-SNL
链接:https://arxiv.org/abs/2508.21739

作者:aoui Rahali, Abhilasha Dave, Larry Ruckman, Mohammad Mehdi Rahimifar, Audrey C. Therrien, James J. Russel, Ryan T. Herbst
摘要:LCLS-II自由电子激光器(FEL)将产生用于束线实验的X射线脉冲,其速率高达1~MHz,探测器产生的数据吞吐量超过1 TB/s。管理如此大规模的数据流带来了巨大的挑战,因为传输和存储基础设施变得过于昂贵。机器学习(ML)为实时数据减少提供了一个有前途的解决方案,但传统的实现会引入过多的延迟,使其不适合高速实验环境。为了应对这些挑战,SLAC开发了SLAC神经网络库(SNL),这是一个专门的框架,旨在将实时ML推理模型部署在现场可编程门阵列(FPGA)上。SNL的关键特性是能够动态更新模型权重,而无需FPGA重新合成,增强了自适应学习应用的灵活性。为了进一步增强可用性和可访问性,我们引入了Auto-SNL,这是一个Python扩展,它简化了将基于Python的神经网络模型转换为SNL兼容的高级合成代码的过程。本文针对Xilinx ZCU 102 FPGA的多种神经网络架构、定点精度和综合配置,与当前最先进的工具hls 4 ml进行了基准比较。结果表明,SNL在大多数测试架构中实现了具有竞争力或优越的延迟,同时在某些情况下还节省了FPGA资源。这种适应性展示了SNL的多功能性,为高能物理,医学成像,机器人等领域的研究人员和学者提供了新的机会。
摘要:The LCLS-II Free Electron Laser (FEL) will generate X-ray pulses for beamline experiments at rates of up to 1~MHz, with detectors producing data throughputs exceeding 1 TB/s. Managing such massive data streams presents significant challenges, as transmission and storage infrastructures become prohibitively expensive. Machine learning (ML) offers a promising solution for real-time data reduction, but conventional implementations introduce excessive latency, making them unsuitable for high-speed experimental environments. To address these challenges, SLAC developed the SLAC Neural Network Library (SNL), a specialized framework designed to deploy real-time ML inference models on Field-Programmable Gate Arrays (FPGA). SNL's key feature is the ability to dynamically update model weights without requiring FPGA resynthesis, enhancing flexibility for adaptive learning applications. To further enhance usability and accessibility, we introduce Auto-SNL, a Python extension that streamlines the process of converting Python-based neural network models into SNL-compatible high-level synthesis code. This paper presents a benchmark comparison against hls4ml, the current state-of-the-art tool, across multiple neural network architectures, fixed-point precisions, and synthesis configurations targeting a Xilinx ZCU102 FPGA. The results showed that SNL achieves competitive or superior latency in most tested architectures, while in some cases also offering FPGA resource savings. This adaptation demonstrates SNL's versatility, opening new opportunities for researchers and academics in fields such as high-energy physics, medical imaging, robotics, and many more.


【14】Developer Insights into Designing AI-Based Computer Perception Tools
标题:开发人员对设计基于人工智能的计算机感知工具的见解
链接:https://arxiv.org/abs/2508.21733

作者:n (1), Meghan E. Hurley (1), Eric A. Storch (2), John Herrington (3), Casey Zampella (3), Julia Parish-Morris (3), Gabriel Lázaro-Muñoz (4), Kristin Kostick-Quenet (1) ((1) Center for Ethics and Health Policy, Baylor College of Medicine, Houston, TX, USA, (2) Department of Psychiatry and Behavioral Sciences, Baylor College of Medicine, Houston, TX, USA, (3) Department of Child and Adolescent Psychiatry and Behavioral Sciences, Children's Hospital of Philadelphia, Philadelphia, PA, USA, (4) Center for Bioethics, Harvard Medical School, Boston, MA, USA)
备注:15 pages
摘要:基于人工智能(AI)的计算机感知(CP)技术使用移动传感器收集行为和生理数据,用于临床决策。这些工具可以重塑临床知识的生成和解释方式。然而,将这些工具有效地集成到临床工作流程中取决于开发人员如何平衡临床实用性与用户可接受性和可信度。我们的研究提出了20个基于AI的CP工具开发人员的深入访谈结果。对采访进行转录并进行归纳、主题分析,以确定4个关键设计优先事项:1)考虑背景并确保患者和临床医生的可解释性; 2)使工具与现有的临床工作流程保持一致; 3)根据相关利益相关者进行适当定制,以提高可用性和可接受性;和4)在与既定范式保持一致的同时,突破创新的界限。我们的研究结果强调,开发人员不仅将自己视为技术架构师,而且还将自己视为道德管家,设计出既能被用户接受又能在认识上负责的工具(优先考虑客观性并推动临床知识向前发展)。我们提供以下建议来帮助实现这种平衡:记录如何围绕定制进行设计选择,定义定制选择的限制,透明地传达有关输出的信息,并投资于用户培训。实现这些目标需要开发人员、临床医生和伦理学家之间的跨学科合作。
摘要:Artificial intelligence (AI)-based computer perception (CP) technologies use mobile sensors to collect behavioral and physiological data for clinical decision-making. These tools can reshape how clinical knowledge is generated and interpreted. However, effective integration of these tools into clinical workflows depends on how developers balance clinical utility with user acceptability and trustworthiness. Our study presents findings from 20 in-depth interviews with developers of AI-based CP tools. Interviews were transcribed and inductive, thematic analysis was performed to identify 4 key design priorities: 1) to account for context and ensure explainability for both patients and clinicians; 2) align tools with existing clinical workflows; 3) appropriately customize to relevant stakeholders for usability and acceptability; and 4) push the boundaries of innovation while aligning with established paradigms. Our findings highlight that developers view themselves as not merely technical architects but also ethical stewards, designing tools that are both acceptable by users and epistemically responsible (prioritizing objectivity and pushing clinical knowledge forward). We offer the following suggestions to help achieve this balance: documenting how design choices around customization are made, defining limits for customization choices, transparently conveying information about outputs, and investing in user training. Achieving these goals will require interdisciplinary collaboration between developers, clinicians, and ethicists.


【15】CAD2DMD-SET: Synthetic Generation Tool of Digital Measurement Device CAD Model Datasets for fine-tuning Large Vision-Language Models
标题:CAD 2DMD-SET:用于微调大型视觉语言模型的数字测量设备CAD模型数据集的合成生成工具
链接:https://arxiv.org/abs/2508.21732

作者:nte, Atabak Dehban, Rodrigo Ventura
摘要:大型视觉语言模型(LVLM)的最新进展已经在各种多模态任务中展示了令人印象深刻的能力。然而,他们继续努力解决琐碎的场景,例如从数字测量设备(DMD)读取值,特别是在涉及混乱,遮挡,极端观点和运动模糊的现实条件下;常见于头戴式摄像机和增强现实(AR)应用。出于这些限制,这项工作介绍了CAD 2DMD-SET,一个合成的数据生成工具,旨在支持视觉问答(VQA)涉及DMD的任务。通过利用3D CAD模型、高级渲染和高保真图像合成,我们的工具可以生成适用于微调LVLM的各种VQA标记的合成DMD数据集。此外,我们还介绍了DMDBench,这是一个由1,000张带注释的真实世界图像组成的策划验证集,旨在评估模型在实际约束条件下的性能。使用平均归一化Levenshtein相似性(ANLS)对三个最先进的LVLM进行基准测试,并使用CAD 2DMD-SET生成的数据集对这些模型的LoRA进行进一步微调,取得了实质性的改进,InternVL的得分增加了200%,而其他任务没有下降。这表明,CAD 2DMD-SET训练数据集在前面提到的挑战性条件下运行时,大大提高了LVLM的鲁棒性和性能。CAD 2DMD-SET工具预计将在本手稿的最终版本完成后作为开源发布,允许社区添加不同的测量设备并生成自己的数据集。
摘要:Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated impressive capabilities across various multimodal tasks. They continue, however, to struggle with trivial scenarios such as reading values from Digital Measurement Devices (DMDs), particularly in real-world conditions involving clutter, occlusions, extreme viewpoints, and motion blur; common in head-mounted cameras and Augmented Reality (AR) applications. Motivated by these limitations, this work introduces CAD2DMD-SET, a synthetic data generation tool designed to support visual question answering (VQA) tasks involving DMDs. By leveraging 3D CAD models, advanced rendering, and high-fidelity image composition, our tool produces diverse, VQA-labelled synthetic DMD datasets suitable for fine-tuning LVLMs. Additionally, we present DMDBench, a curated validation set of 1,000 annotated real-world images designed to evaluate model performance under practical constraints. Benchmarking three state-of-the-art LVLMs using Average Normalised Levenshtein Similarity (ANLS) and further fine-tuning LoRA's of these models with CAD2DMD-SET's generated dataset yielded substantial improvements, with InternVL showcasing a score increase of 200% without degrading on other tasks. This demonstrates that the CAD2DMD-SET training dataset substantially improves the robustness and performance of LVLMs when operating under the previously stated challenging conditions. The CAD2DMD-SET tool is expected to be released as open-source once the final version of this manuscript is prepared, allowing the community to add different measurement devices and generate their own datasets.


【16】Freeze and Conquer: Reusable Ansatz for Solving the Traveling Salesman Problem
标题:冻结与征服:解决旅行推销员问题的可重复使用的思维
链接:https://arxiv.org/abs/2508.21730

作者:Fagiolo, Nicolo' Vescera
摘要:在本文中,我们提出了一个变分算法的旅行商问题(TSP),结合(i)紧凑的编码排列,这也降低了量子位的要求,(ii)优化冻结重用策略:其中电路拓扑(“Ansatz”)首先通过模拟退火(SA)在训练实例上进行优化,然后“冻结”并在新实例上重新使用,仅限于电路参数的快速重新优化。这种管道消除了测试中昂贵的结构研究,使程序立即在NISQ硬件上实现。   在一组40美元随机生成的对称实例上,跨越4 - 7美元的城市,所得到的Ancestor实现了4个城市情况下的平均最优出行抽样概率为100美元,5个城市情况下的平均最优出行抽样概率为90美元,6个城市情况下的平均最优出行抽样概率为80美元. 7个城市的成功率显着下降到平均$\sim 20\%$,揭示了所提出的方法的可扩展性限制的发病。   结果表明,稳健的推广能力,中等规模的问题,并表明如何冻结的Ancestor可以显着减少时间的解决方案,而不会降低解决方案的质量。本文还讨论了可扩展性的限制,“热启动”的参数初始化的影响,并扩展到更复杂的问题,如车辆路径和车间调度的前景。
摘要:In this paper we present a variational algorithm for the Traveling Salesman Problem (TSP) that combines (i) a compact encoding of permutations, which reduces the qubit requirement too, (ii) an optimize-freeze-reuse strategy: where the circuit topology (``Ansatz'') is first optimized on a training instance by Simulated Annealing (SA), then ``frozen'' and re-used on novel instances, limited to a rapid re-optimization of only the circuit parameters. This pipeline eliminates costly structural research in testing, making the procedure immediately implementable on NISQ hardware.   On a set of $40$ randomly generated symmetric instances that span $4 - 7$ cities, the resulting Ansatz achieves an average optimal trip sampling probability of $100\%$ for 4 city cases, $90\%$ for 5 city cases and $80\%$ for 6 city cases. With 7 cities the success rate drops markedly to an average of $\sim 20\%$, revealing the onset of scalability limitations of the proposed method.   The results show robust generalization ability for moderate problem sizes and indicate how freezing the Ansatz can dramatically reduce time-to-solution without degrading solution quality. The paper also discusses scalability limitations, the impact of ``warm-start'' initialization of parameters, and prospects for extension to more complex problems, such as Vehicle Routing and Job-Shop Scheduling.


【17】OptMark: Robust Multi-bit Diffusion Watermarking via Inference Time Optimization
标题:OptMark:通过推理时间优化的稳健多位扩散水印
链接:https://arxiv.org/abs/2508.21727

作者:Xing, Hai Ci, Hongbin Xu, Hangjie Yuan, Yong Liu, Mike Zheng Shou
摘要:扩散生成图像水印对于版权保护和用户跟踪至关重要。然而,目前的扩散水印方法面临着很大的局限性:零比特水印系统缺乏大规模用户跟踪的能力,而多比特方法对某些图像变换或生成攻击高度敏感,导致缺乏全面的鲁棒性。在本文中,我们提出了OptMark,一个基于优化的方法,嵌入一个强大的多位水印到中间潜伏期的扩散去噪过程。OptMark策略性地在早期插入结构水印以抵抗生成攻击,并在后期插入细节水印以承受图像变换,并使用定制的正则化项来保持图像质量并确保不可感知性。为了解决优化过程中内存消耗随去噪步骤数量线性增长的挑战,OptMark采用了伴随梯度方法,将内存使用量从O(N)减少到O(1)。实验结果表明,OptMark实现了不可见的多位水印,同时确保对valuemetric变换,几何变换,编辑和再生攻击的鲁棒性。
摘要:Watermarking diffusion-generated images is crucial for copyright protection and user tracking. However, current diffusion watermarking methods face significant limitations: zero-bit watermarking systems lack the capacity for large-scale user tracking, while multi-bit methods are highly sensitive to certain image transformations or generative attacks, resulting in a lack of comprehensive robustness. In this paper, we propose OptMark, an optimization-based approach that embeds a robust multi-bit watermark into the intermediate latents of the diffusion denoising process. OptMark strategically inserts a structural watermark early to resist generative attacks and a detail watermark late to withstand image transformations, with tailored regularization terms to preserve image quality and ensure imperceptibility. To address the challenge of memory consumption growing linearly with the number of denoising steps during optimization, OptMark incorporates adjoint gradient methods, reducing memory usage from O(N) to O(1). Experimental results demonstrate that OptMark achieves invisible multi-bit watermarking while ensuring robust resilience against valuemetric transformations, geometric transformations, editing, and regeneration attacks.


【18】PosterForest: Hierarchical Multi-Agent Collaboration for Scientific Poster Generation
标题:PosterForest:科学海报生成的分层多智能体协作
链接:https://arxiv.org/abs/2508.21720

作者:, Seojeong Park, Seongjong Song, Hyunjung Shim
摘要:我们提出了一个新的训练免费的框架,\textit{PosterForest},自动化的科学海报生成。与以前的方法,在很大程度上忽略了科学文献的层次结构和文本和视觉元素的语义集成,我们的方法直接解决这两个挑战。我们介绍了\textit{海报树},一个分层的中间表示,共同编码的文档结构和视觉文本的关系在多个级别。我们的框架采用了多代理协作策略,代理专门从事内容摘要和布局规划迭代协调,并提供相互反馈。这种方法可以实现逻辑一致性、内容保真度和视觉连贯性的联合优化。多个学术领域的广泛实验表明,我们的方法优于现有的基线在定性和定量评估。由此产生的海报实现了最接近专家设计的地面实况的质量,并提供了卓越的信息保存,结构清晰度和用户偏好。
摘要:We present a novel training-free framework, \textit{PosterForest}, for automated scientific poster generation. Unlike prior approaches, which largely neglect the hierarchical structure of scientific documents and the semantic integration of textual and visual elements, our method addresses both challenges directly. We introduce the \textit{Poster Tree}, a hierarchical intermediate representation that jointly encodes document structure and visual-textual relationships at multiple levels. Our framework employs a multi-agent collaboration strategy, where agents specializing in content summarization and layout planning iteratively coordinate and provide mutual feedback. This approach enables the joint optimization of logical consistency, content fidelity, and visual coherence. Extensive experiments on multiple academic domains show that our method outperforms existing baselines in both qualitative and quantitative evaluations. The resulting posters achieve quality closest to expert-designed ground truth and deliver superior information preservation, structural clarity, and user preference.


【19】Entropy-Based Non-Invasive Reliability Monitoring of Convolutional Neural Networks
标题:基于熵的卷积神经网络可靠性无创监测
链接:https://arxiv.org/abs/2508.21715

作者:in Nazeri, Wael Hafez
备注:8 pages, 3 figures, 2 tables
摘要:卷积神经网络(CNN)已成为现代计算机视觉的基础,在各种图像识别任务中实现了前所未有的准确性。虽然这些网络在分布数据上表现出色,但它们仍然容易受到对抗性扰动的影响-无法察觉的输入修改,从而导致高置信度的错误分类。然而,现有的检测方法要么需要昂贵的再训练,修改网络架构,或降低性能的干净的输入。在这里,我们证明了对抗性扰动在CNN激活中创建了即时的、可检测的熵签名,可以在不修改任何模型的情况下对其进行监测。使用VGG-16上的并行熵监测,我们证明了对抗性输入在早期卷积层中始终将激活熵移动7%,实现了90%的检测准确率,误报率和漏报率低于20%。干净和对抗性熵分布之间的完全分离表明,CNN固有地编码其激活模式中的分布偏移。这项工作确定了CNN的可靠性可以单独通过激活熵来评估,从而实现了实时检测对抗性输入而不影响原始模型性能的自诊断视觉系统的实际部署。
摘要:Convolutional Neural Networks (CNNs) have become the foundation of modern computer vision, achieving unprecedented accuracy across diverse image recognition tasks. While these networks excel on in-distribution data, they remain vulnerable to adversarial perturbations imperceptible input modifications that cause misclassification with high confidence. However, existing detection methods either require expensive retraining, modify network architecture, or degrade performance on clean inputs. Here we show that adversarial perturbations create immediate, detectable entropy signatures in CNN activations that can be monitored without any model modification. Using parallel entropy monitoring on VGG-16, we demonstrate that adversarial inputs consistently shift activation entropy by 7% in early convolutional layers, enabling 90% detection accuracy with false positives and false negative rates below 20%. The complete separation between clean and adversarial entropy distributions reveals that CNNs inherently encode distribution shifts in their activation patterns. This work establishes that CNN reliability can be assessed through activation entropy alone, enabling practical deployment of self-diagnostic vision systems that detect adversarial inputs in real-time without compromising original model performance.


【20】Why Stop at Words? Unveiling the Bigger Picture through Line-Level OCR
标题:为什么停在言语上?通过行级OCR揭开更大的图景
链接:https://arxiv.org/abs/2508.21693

作者:Vempati, Nishit Anand, Gaurav Talebailkar, Arpan Garai, Chetan Arora
备注:11 pages. Project Website: this https URL
摘要:传统的光学字符识别(OCR)技术将每个字符分段然后识别。这使得它们在字符分割时容易出现错误,并且缺乏上下文来利用语言模型。在过去十年中,序列到序列翻译的进步导致现代技术首先检测单词,然后一次输入一个单词到模型中,以直接输出完整的单词作为字符序列。这允许更好地利用语言模型并绕过容易出错的字符分割步骤。我们观察到,上述风格的转变已经将准确性的瓶颈转移到了分词。因此,在本文中,我们提出了一个自然和逻辑的进展,从字级OCR行级OCR。该建议允许绕过单词检测中的错误,并提供更大的句子上下文以更好地利用语言模型。我们表明,该技术不仅提高了准确性,而且OCR的效率。尽管我们进行了全面的文献调查,但我们没有找到任何公共数据集来训练和基准测试从单词到行级OCR的这种转变。因此,我们还贡献了一个精心策划的数据集,包含251个带有行级注释的英文页面图像。我们的实验显示,端到端的准确性显著提高了5.4%,这突出了向行级OCR过渡的潜在好处,特别是对于文档图像。我们还报告了与基于单词的管道相比,效率提高了4倍。随着大型语言模型的不断改进,我们的方法也有可能利用这些进步。项目网址:https://nishitanand.github.io/line-level-ocr-website
摘要:Conventional optical character recognition (OCR) techniques segmented each character and then recognized. This made them prone to error in character segmentation, and devoid of context to exploit language models. Advances in sequence to sequence translation in last decade led to modern techniques first detecting words and then inputting one word at a time to a model to directly output full words as sequence of characters. This allowed better utilization of language models and bypass error-prone character segmentation step. We observe that the above transition in style has moved the bottleneck in accuracy to word segmentation. Hence, in this paper, we propose a natural and logical progression from word level OCR to line-level OCR. The proposal allows to bypass errors in word detection, and provides larger sentence context for better utilization of language models. We show that the proposed technique not only improves the accuracy but also efficiency of OCR. Despite our thorough literature survey, we did not find any public dataset to train and benchmark such shift from word to line-level OCR. Hence, we also contribute a meticulously curated dataset of 251 English page images with line-level annotations. Our experimentation revealed a notable end-to-end accuracy improvement of 5.4%, underscoring the potential benefits of transitioning towards line-level OCR, especially for document images. We also report a 4 times improvement in efficiency compared to word-based pipelines. With continuous improvements in large language models, our methodology also holds potential to exploit such advances. Project Website: https://nishitanand.github.io/line-level-ocr-website


【21】Harnessing IoT and Generative AI for Weather-Adaptive Learning in Climate Resilience Education
标题:利用物联网和生成式人工智能进行气候适应性教育中的天气适应性学习
链接:https://arxiv.org/abs/2508.21666

作者:A. Khan, Emmanuel G. Blanchard, Sébastien George
摘要:本文介绍了未来大气条件培训系统(FACTS),这是一个新的平台,通过基于地点的适应性学习经验推进气候适应性教育。FACTS将物联网传感器收集的实时大气数据与知识库中的精选资源相结合,以动态生成本地化的学习挑战。学习者的反应由生成AI驱动的服务器进行分析,该服务器提供个性化的反馈和自适应支持。用户评价的结果表明,与会者认为该系统易于使用,对建立与气候抗御能力有关的知识有效。这些研究结果表明,将物联网和生成人工智能集成到大气适应性学习技术中,对于提高教育参与度和培养气候意识具有重要意义。
摘要:This paper introduces the Future Atmospheric Conditions Training System (FACTS), a novel platform that advances climate resilience education through place-based, adaptive learning experiences. FACTS combines real-time atmospheric data collected by IoT sensors with curated resources from a Knowledge Base to dynamically generate localized learning challenges. Learner responses are analyzed by a Generative AI powered server, which delivers personalized feedback and adaptive support. Results from a user evaluation indicate that participants found the system both easy to use and effective for building knowledge related to climate resilience. These findings suggest that integrating IoT and Generative AI into atmospherically adaptive learning technologies holds significant promise for enhancing educational engagement and fostering climate awareness.


【22】Leveraging Imperfection with MEDLEY A Multi-Model Approach Harnessing Bias in Medical AI
标题:利用MEDLEY的不完美利用医疗人工智能中的偏见的多模型方法
链接:https://arxiv.org/abs/2508.21648

作者:tahi, Mehdi Astaraki, Fernando Seoane
摘要:医学人工智能中的偏见通常被视为需要消除的缺陷。然而,人类的推理本质上包含了由教育、文化和经验形成的偏见,这表明它们的存在可能是不可避免的,也可能是有价值的。我们提出了MEDLEY(具有杠杆多样性的医学包围诊断系统),这是一个概念框架,可以编排多个AI模型,同时保留其多样化的输出,而不是将其折叠成共识。与抑制分歧的传统方法不同,MEDLEY将模型特异性偏差记录为潜在优势,并将幻觉视为临床医生验证的临时假设。使用30多个大型语言模型开发了一个概念验证演示器,创建了一个最小的可行产品,在合成病例中保留了共识和少数观点,使诊断的不确定性和潜在的偏见对临床监督透明。虽然还没有一个经过验证的临床工具,演示说明了如何结构化的多样性,可以提高临床医生监督下的医疗推理。通过将人工智能的缺陷重新定义为一种资源,MEDLEY提供了一种范式转变,为开发值得信赖的医疗人工智能系统开辟了新的监管、道德和创新途径。
摘要:Bias in medical artificial intelligence is conventionally viewed as a defect requiring elimination. However, human reasoning inherently incorporates biases shaped by education, culture, and experience, suggesting their presence may be inevitable and potentially valuable. We propose MEDLEY (Medical Ensemble Diagnostic system with Leveraged diversitY), a conceptual framework that orchestrates multiple AI models while preserving their diverse outputs rather than collapsing them into a consensus. Unlike traditional approaches that suppress disagreement, MEDLEY documents model-specific biases as potential strengths and treats hallucinations as provisional hypotheses for clinician verification. A proof-of-concept demonstrator was developed using over 30 large language models, creating a minimum viable product that preserved both consensus and minority views in synthetic cases, making diagnostic uncertainty and latent biases transparent for clinical oversight. While not yet a validated clinical tool, the demonstration illustrates how structured diversity can enhance medical reasoning under clinician supervision. By reframing AI imperfection as a resource, MEDLEY offers a paradigm shift that opens new regulatory, ethical, and innovation pathways for developing trustworthy medical AI systems.


【23】A-MHA*: Anytime Multi-Heuristic A*
标题:A-MHA*:随时多启发式A*
链接:https://arxiv.org/abs/2508.21637

作者:Natarajan, Muhammad Suhail Saleem, William Xiao, Sandip Aine, Howie Choset, Maxim Likhachev
摘要:为图搜索设计好的启发式函数需要足够的领域知识。通常很容易设计出性能良好并与搜索空间某些部分中的潜在真实成本值相关的算法,但这些算法在整个域中可能是不可接受的,从而影响搜索的最优性保证。有界次优搜索使用几个这样的部分好的,但不允许的启发式算法是在多启发式A*(MHA*)。虽然MHA* 利用多个不可接受的算法来生成更快的次优解,但原始版本并没有随着时间的推移而改进该解决方案。它是一种一次性算法,需要仔细设置膨胀因子以获得所需的一次性解决方案。在这项工作中,我们通过将MHA* 扩展到一个随时可用的版本来解决这个问题,该版本可以快速找到一个可行的次优解决方案,并不断改进它,直到时间耗尽。我们的工作灵感来自于Anytime Repairing A*(ARA*)算法。我们证明,我们的精确适应ARA* 的概念在MHA* 框架保留了原来的次优和完整性的保证,并提高MHA* 执行在任何时候的时尚。此外,我们报告了A-MHA* 在3-D路径规划域和滑动瓦片难题中的性能,并与MHA* 和其他任意时刻算法进行了比较。
摘要:Designing good heuristic functions for graph search requires adequate domain knowledge. It is often easy to design heuristics that perform well and correlate with the underlying true cost-to-go values in certain parts of the search space but these may not be admissible throughout the domain thereby affecting the optimality guarantees of the search. Bounded suboptimal search using several such partially good but inadmissible heuristics was developed in Multi-Heuristic A* (MHA*). Although MHA* leverages multiple inadmissible heuristics to potentially generate a faster suboptimal solution, the original version does not improve the solution over time. It is a one shot algorithm that requires careful setting of inflation factors to obtain a desired one time solution. In this work, we tackle this issue by extending MHA* to an anytime version that finds a feasible suboptimal solution quickly and continually improves it until time runs out. Our work is inspired from the Anytime Repairing A* (ARA*) algorithm. We prove that our precise adaptation of ARA* concepts in the MHA* framework preserves the original suboptimal and completeness guarantees and enhances MHA* to perform in an anytime fashion. Furthermore, we report the performance of A-MHA* in 3-D path planning domain and sliding tiles puzzle and compare against MHA* and other anytime algorithms.


【24】QZhou-Embedding Technical Report
标题:QZhou-嵌入技术报告
链接:https://arxiv.org/abs/2508.21632

作者:En Xu, Bin Chen, Haibiao Chen, Yinfei Xu
摘要:我们提出了QZhou-Embedding,一个通用的上下文文本嵌入模型,具有特殊的文本表示能力。基于Qwen2.5- 7 B-Instruct基础模型,我们设计了一个统一的多任务框架,包括专门的数据转换和培训策略。数据转换方案可以合并更多样化的文本训练数据集,而特定于任务的训练策略可以提高模型学习效率。我们开发了一个利用LLM API的数据合成管道,结合了释义、增强和硬否定示例生成等技术,以提高训练集的语义丰富性和样本难度。此外,我们采用了两阶段的训练策略,包括初始的以检索为重点的预训练,然后进行全任务微调,使嵌入模型能够基于强大的检索性能扩展其功能。我们的模型在MTEB和CMTEB基准测试中取得了最先进的结果,在两个排行榜上均排名第一(2025年8月27日),并同时在包括重新排序,聚类等任务上实现最先进的性能。我们的研究结果表明,更高质量,更多样化的数据对于提高检索模型性能至关重要,利用LLM生成能力可以进一步优化嵌入模型突破的数据质量。我们的模型权重在Apache 2.0许可下发布在HuggingFace上。为了重现性,我们在GitHub上提供了评估代码和说明。
摘要:We present QZhou-Embedding, a general-purpose contextual text embedding model with exceptional text representation capabilities. Built upon the Qwen2.5-7B-Instruct foundation model, we designed a unified multi-task framework comprising specialized data transformation and training strategies. The data transformation scheme enables the incorporation of more diverse textual training datasets, while the task-specific training strategies enhance model learning efficiency. We developed a data synthesis pipeline leveraging LLM API, incorporating techniques such as paraphrasing, augmentation, and hard negative example generation to improve the semantic richness and sample difficulty of the training set. Additionally, we employ a two-stage training strategy, comprising initial retrieval-focused pretraining followed by full-task fine-tuning, enabling the embedding model to extend its capabilities based on robust retrieval performance. Our model achieves state-of-the-art results on the MTEB and CMTEB benchmarks, ranking first on both leaderboards (August 27 2025), and simultaneously achieves state-of-the-art performance on tasks including reranking, clustering, etc. Our findings demonstrate that higher-quality, more diverse data is crucial for advancing retrieval model performance, and that leveraging LLMs generative capabilities can further optimize data quality for embedding model breakthroughs. Our model weights are released on HuggingFace under Apache 2.0 license. For reproducibility, we provide evaluation code and instructions on GitHub.


【25】Integrating Large Language Models with Network Optimization for Interactive and Explainable Supply Chain Planning: A Real-World Case Study
标题:将大型语言模型与网络优化集成,以实现交互式和可解释的供应链规划:现实案例研究
链接:https://arxiv.org/abs/2508.21622

作者: Venkatachalam
摘要:本文提出了一种集成框架,将传统的网络优化模型与大型语言模型(LLM)相结合,为供应链计划提供交互式,可解释和角色感知的决策支持。该系统通过生成自然语言摘要、上下文可视化和量身定制的关键绩效指标(KPI),弥合了复杂的运筹学输出与业务利益相关者理解之间的差距。核心优化模型解决了战术库存再分配网络的配送中心多时期和多项目,使用混合整数制定。该技术架构结合了AI代理,RESTful API和动态用户界面,以支持实时交互,配置更新和基于模拟的见解。一个案例研究演示了该系统如何通过防止缺货、降低成本和保持服务水平来改善计划结果。未来的扩展包括集成私有LLM、迁移学习、强化学习和贝叶斯神经网络,以增强可解释性、适应性和实时决策。
摘要:This paper presents an integrated framework that combines traditional network optimization models with large language models (LLMs) to deliver interactive, explainable, and role-aware decision support for supply chain planning. The proposed system bridges the gap between complex operations research outputs and business stakeholder understanding by generating natural language summaries, contextual visualizations, and tailored key performance indicators (KPIs). The core optimization model addresses tactical inventory redistribution across a network of distribution centers for multi-period and multi-item, using a mixed-integer formulation. The technical architecture incorporates AI agents, RESTful APIs, and a dynamic user interface to support real-time interaction, configuration updates, and simulation-based insights. A case study demonstrates how the system improves planning outcomes by preventing stockouts, reducing costs, and maintaining service levels. Future extensions include integrating private LLMs, transfer learning, reinforcement learning, and Bayesian neural networks to enhance explainability, adaptability, and real-time decision-making.


【26】Physics-Informed Spectral Modeling for Hyperspectral Imaging
标题:用于高光谱成像的物理信息光谱建模
链接:https://arxiv.org/abs/2508.21618

作者:awrysiak, Krzysztof Krawiec
摘要:我们提出了PhISM,这是一种基于物理学的深度学习架构,可以在没有监督的情况下学习,以明确地解开高光谱观测,并使用连续基函数对其进行建模。\mname在几个分类和回归基准上优于先前的方法,需要有限的标记数据,并提供额外的见解,这要归功于可解释的潜在表示。
摘要:We present PhISM, a physics-informed deep learning architecture that learns without supervision to explicitly disentangle hyperspectral observations and model them with continuous basis functions. \mname outperforms prior methods on several classification and regression benchmarks, requires limited labeled data, and provides additional insights thanks to interpretable latent representation.


【27】Scalable Solution Methods for Dec-POMDPs with Deterministic Dynamics
标题:具有确定性动力学的Dec-POMDPs可扩展求解方法
链接:https://arxiv.org/abs/2508.21595

作者: Alex Schutz, Zhikun Li, Bruno Lacerda, Robert Skilton, Nick Hawes
摘要:许多高层次的多智能体规划问题,包括多机器人导航和路径规划,可以有效地建模使用确定性的行动和观察。   在这项工作中,我们专注于这样的领域,并介绍了类的确定性分散POMDPs(Det-Dec-POMDPs)。这是Dec-POMDP的一个子类,其特征在于确定性转换和以状态和联合动作为条件的观察。   然后,我们提出了一个实用的求解器称为迭代确定性POMDP规划(IDPP)。该方法建立在经典的联合均衡搜索策略框架的基础上,并经过专门优化,以处理当前Dec-POMDP求解器无法有效解决的大规模Det-Dec-POMDP。
摘要:Many high-level multi-agent planning problems, including multi-robot navigation and path planning, can be effectively modeled using deterministic actions and observations.   In this work, we focus on such domains and introduce the class of Deterministic Decentralized POMDPs (Det-Dec-POMDPs). This is a subclass of Dec-POMDPs characterized by deterministic transitions and observations conditioned on the state and joint actions.   We then propose a practical solver called Iterative Deterministic POMDP Planning (IDPP). This method builds on the classic Joint Equilibrium Search for Policies framework and is specifically optimized to handle large-scale Det-Dec-POMDPs that current Dec-POMDP solvers are unable to address efficiently.


【28】Middo: Model-Informed Dynamic Data Optimization for Enhanced LLM Fine-Tuning via Closed-Loop Learning
标题:Middo:模型知情的动态数据优化,通过闭环学习进行增强的LLM微调
链接:https://arxiv.org/abs/2508.21589

作者:g, Xin Gao, Qizhi Pei, Zhuoshi Pan, Mengzhang Cai, Jiang Wu, Conghui He, Lijun Wu
备注:Accepted by EMNLP 2025 (main)
摘要:监督微调(SFT)大型语言模型(LLM)从根本上依赖于高质量的训练数据。虽然数据选择和数据合成是提高数据质量的两种常见策略,但现有方法通常面临静态数据集管理的局限性,无法适应不断发展的模型功能。在本文中,我们介绍了Middo,一个自我进化的模型通知的动态数据优化框架,使用模型感知的数据选择和上下文保持数据细化。与传统的一次性滤波/综合方法不同,我们的框架建立了一个闭环优化系统:(1)自参考诊断模块通过三轴模型信号-损失模式主动识别次优样本(复杂性),嵌入集群动态(多样性)和自我调整分数(质量);(2)自适应优化引擎然后将次优样本转换为教学上有价值的训练点,同时保持语义完整性;(3)这个优化过程通过动态学习原理随着模型能力不断发展。在多个基准测试上的实验表明,我们的方法始终如一地提高了种子数据的质量,并提高了LLM的性能,在保持原始数据集规模的同时,平均提高了7.15%的准确率。这项工作通过数据和模型的动态人类-人工智能共同进化为可持续LLM培训建立了一个新的范式。我们的数据集、模型和代码即将推出。
摘要:Supervised Fine-Tuning (SFT) Large Language Models (LLM) fundamentally rely on high-quality training data. While data selection and data synthesis are two common strategies to improve data quality, existing approaches often face limitations in static dataset curation that fail to adapt to evolving model capabilities. In this paper, we introduce Middo, a self-evolving Model-informed dynamic data optimization framework that uses model-aware data selection and context-preserving data refinement. Unlike conventional one-off filtering/synthesis methods, our framework establishes a closed-loop optimization system: (1) A self-referential diagnostic module proactively identifies suboptimal samples through tri-axial model signals - loss patterns (complexity), embedding cluster dynamics (diversity), and self-alignment scores (quality); (2) An adaptive optimization engine then transforms suboptimal samples into pedagogically valuable training points while preserving semantic integrity; (3) This optimization process continuously evolves with model capability through dynamic learning principles. Experiments on multiple benchmarks demonstrate that our \method consistently enhances the quality of seed data and boosts LLM's performance with improving accuracy by 7.15% on average while maintaining the original dataset scale. This work establishes a new paradigm for sustainable LLM training through dynamic human-AI co-evolution of data and models. Our datasets, models, and code are coming soon.


【29】A Survey on Current Trends and Recent Advances in Text Anonymization
标题:文本解析的当前趋势和最新进展调查
链接:https://arxiv.org/abs/2508.21587

作者:ußer, Lorenz Sparrenberg, Armin Berger, Max Hahnbück, Christian Bauckhage, Rafet Sifa
备注:Accepted at IEEE DSAA 2025
摘要:包含敏感个人信息的文本数据在各个领域的扩散需要强大的匿名化技术来保护隐私并遵守法规,同时为各种关键的下游任务保留数据可用性。本调查全面概述了文本匿名化技术的当前趋势和最新进展。我们首先讨论基本方法,主要集中在命名实体识别,然后研究大型语言模型的变革性影响,详细介绍它们作为复杂匿名者和强大的去匿名化威胁的双重角色。该调查进一步探讨了医疗保健、法律、金融和教育等关键领域的特定领域挑战和定制解决方案。我们调查先进的方法,将正式的隐私模型和风险意识的框架,并解决作者匿名化的专业子领域。此外,我们还审查了评估框架、综合指标、基准和实用工具包,以用于匿名化解决方案的实际部署。本次审查巩固了当前的知识,确定了新兴趋势和持续存在的挑战,包括不断变化的隐私与效用权衡、解决准标识符的需要以及法学硕士能力的影响,旨在指导学者和从业者的未来研究方向该领域。
摘要:The proliferation of textual data containing sensitive personal information across various domains requires robust anonymization techniques to protect privacy and comply with regulations, while preserving data usability for diverse and crucial downstream tasks. This survey provides a comprehensive overview of current trends and recent advances in text anonymization techniques. We begin by discussing foundational approaches, primarily centered on Named Entity Recognition, before examining the transformative impact of Large Language Models, detailing their dual role as sophisticated anonymizers and potent de-anonymization threats. The survey further explores domain-specific challenges and tailored solutions in critical sectors such as healthcare, law, finance, and education. We investigate advanced methodologies incorporating formal privacy models and risk-aware frameworks, and address the specialized subfield of authorship anonymization. Additionally, we review evaluation frameworks, comprehensive metrics, benchmarks, and practical toolkits for real-world deployment of anonymization solutions. This review consolidates current knowledge, identifies emerging trends and persistent challenges, including the evolving privacy-utility trade-off, the need to address quasi-identifiers, and the implications of LLM capabilities, and aims to guide future research directions for both academics and practitioners in this field.


【30】Revisiting Landmarks: Learning from Previous Plans to Generalize over Problem Instances
标题:重温里程碑:从以前的计划中吸取教训,对问题进行概括
链接:https://arxiv.org/abs/2508.21564

作者:u, Sebastijan Dumančić, Mathijs de Weerdt
摘要:我们提出了一个新的框架,发现地标,自动概括整个域。这些广义地标是从一组已解决的实例中学习的,并描述了传统地标提取算法不足的规划问题的中间目标。我们的广义地标延伸到一个域的谓词之外,通过使用独立于特定问题的对象的状态函数,并适用于所有类似的对象,从而捕获重复。基于这些功能,我们构建了一个有向广义地标图,定义了里程碑的进展,包括循环的可能性重复的子计划。我们展示了如何使用这个图中的启发式来解决新的问题的情况下,相同的域。我们的研究结果表明,从几个小的实例中学习的广义地标图也是有效的,在同一领域的较大的实例。如果识别出指示重复的循环,我们会看到启发式性能比基线有显著提高。广义地标捕获域信息,是可解释的,并有助于自动规划。可以从同一域的一小组计划中发现此信息。
摘要:We propose a new framework for discovering landmarks that automatically generalize across a domain. These generalized landmarks are learned from a set of solved instances and describe intermediate goals for planning problems where traditional landmark extraction algorithms fall short. Our generalized landmarks extend beyond the predicates of a domain by using state functions that are independent of the objects of a specific problem and apply to all similar objects, thus capturing repetition. Based on these functions, we construct a directed generalized landmark graph that defines the landmark progression, including loop possibilities for repetitive subplans. We show how to use this graph in a heuristic to solve new problem instances of the same domain. Our results show that the generalized landmark graphs learned from a few small instances are also effective for larger instances in the same domain. If a loop that indicates repetition is identified, we see a significant improvement in heuristic performance over the baseline. Generalized landmarks capture domain information that is interpretable and useful to an automated planner. This information can be discovered from a small set of plans for the same domain.


【31】Limitations of Physics-Informed Neural Networks: a Study on Smart Grid Surrogation
标题:物理信息神经网络的局限性:智能电网替代研究
链接:https://arxiv.org/abs/2508.21559

作者:tero, Carmine Delle Femine, Kenji S. Muro, Marco Quartulli, Marcello Restelli
备注:Presented in PowerTech2025
摘要:物理信息神经网络(PINN)通过将物理定律直接集成到学习框架中,为智能电网建模提供了一种变革性的方法,解决了传统数据驱动方法中数据稀缺和物理一致性的关键挑战。本文评估了PINN作为智能电网动态代理模型的能力,比较了它们在三个关键实验中与XGBoost,随机森林和线性回归的性能:插值,交叉验证和情景轨迹预测。通过专门通过基于物理的损失函数(强制执行功率平衡,操作约束和电网稳定性)训练PINN,我们证明了它们优越的泛化能力,在减少错误方面优于数据驱动模型。值得注意的是,PINN在动态网格操作中保持相对较低的MAE,在随机和专家驱动的控制场景中可靠地捕获状态转换,而传统模型表现出不稳定的性能。尽管在极端操作条件下性能略有下降,但PINN始终保持物理可行性,这对于安全关键型应用至关重要。我们的研究结果有助于建立PINN作为智能电网代理的范式转换工具,桥接数据驱动的灵活性与第一原则的严谨性。这项工作推进了实时电网控制和可扩展的数字孪生,强调了关键任务能源系统中物理感知架构的必要性。
摘要:Physics-Informed Neural Networks (PINNs) present a transformative approach for smart grid modeling by integrating physical laws directly into learning frameworks, addressing critical challenges of data scarcity and physical consistency in conventional data-driven methods. This paper evaluates PINNs' capabilities as surrogate models for smart grid dynamics, comparing their performance against XGBoost, Random Forest, and Linear Regression across three key experiments: interpolation, cross-validation, and episodic trajectory prediction. By training PINNs exclusively through physics-based loss functions (enforcing power balance, operational constraints, and grid stability) we demonstrate their superior generalization, outperforming data-driven models in error reduction. Notably, PINNs maintain comparatively lower MAE in dynamic grid operations, reliably capturing state transitions in both random and expert-driven control scenarios, while traditional models exhibit erratic performance. Despite slight degradation in extreme operational regimes, PINNs consistently enforce physical feasibility, proving vital for safety-critical applications. Our results contribute to establishing PINNs as a paradigm-shifting tool for smart grid surrogation, bridging data-driven flexibility with first-principles rigor. This work advances real-time grid control and scalable digital twins, emphasizing the necessity of physics-aware architectures in mission-critical energy systems.


【32】EZ-Sort: Efficient Pairwise Comparison via Zero-Shot CLIP-Based Pre-Ordering and Human-in-the-Loop Sorting
标题:EZ-Sort:通过基于Zero-Shot CLIP的预购和人在环排序进行高效的成对比较
链接:https://arxiv.org/abs/2508.21550

作者:k, Haejun Chung, Ikbeom Jang
备注:5 pages, 2 figures, Accepted at CIKM 2025 (ACM International Conference on Information and Knowledge Management)
摘要:在主观或困难的注释任务中,成对比较通常比绝对评级或顺序分类更受欢迎,因为它提高了可靠性。然而,穷举比较需要大量的注释(O(n^2))。最近的工作已经大大减少了注释负担(O(n log n)),积极抽样成对比较使用排序算法。我们通过以下方式进一步提高了标注效率:(1)使用对比图像预训练(CLIP)模型分层地对项目进行粗略的预排序,而无需进行训练;(2)用自动比较代替简单,明显的人工比较。EZ-Sort首先产生一个基于CLIP的zero-shot预排序,然后产生一个基于内存桶的Elo评分,最后运行一个不确定性引导的人在回路MergeSort。使用各种数据集进行验证:面部年龄估计(FGNET),历史图像年表(DHCI)和视网膜图像质量评估(EyePACS)。结果表明,与详尽的成对比较相比,EZ-Sort将人类注释成本降低了90.5%,与之前的工作相比(当n = 100时)降低了19.8%,同时提高或保持了评分者间的可靠性。这些结果表明,结合CLIP的先验知识与不确定性感知采样产生一个有效的和可扩展的解决方案,成对排序。
摘要:Pairwise comparison is often favored over absolute rating or ordinal classification in subjective or difficult annotation tasks due to its improved reliability. However, exhaustive comparisons require a massive number of annotations (O(n^2)). Recent work has greatly reduced the annotation burden (O(n log n)) by actively sampling pairwise comparisons using a sorting algorithm. We further improve annotation efficiency by (1) roughly pre-ordering items using the Contrastive Language-Image Pre-training (CLIP) model hierarchically without training, and (2) replacing easy, obvious human comparisons with automated comparisons. The proposed EZ-Sort first produces a CLIP-based zero-shot pre-ordering, then initializes bucket-aware Elo scores, and finally runs an uncertainty-guided human-in-the-loop MergeSort. Validation was conducted using various datasets: face-age estimation (FGNET), historical image chronology (DHCI), and retinal image quality assessment (EyePACS). It showed that EZ-Sort reduced human annotation cost by 90.5% compared to exhaustive pairwise comparisons and by 19.8% compared to prior work (when n = 100), while improving or maintaining inter-rater reliability. These results demonstrate that combining CLIP-based priors with uncertainty-aware sampling yields an efficient and scalable solution for pairwise ranking.


【33】What Data is Really Necessary? A Feasibility Study of Inference Data Minimization for Recommender Systems
标题:哪些数据是真正需要的?推荐系统推理数据最小化的可行性研究
链接:https://arxiv.org/abs/2508.21547

作者:en, Marco Favier, Bart Goethals
备注:Accepted for publication at the 34th ACM International Conference on Information and Knowledge Management (CIKM '25), November 10-14, 2025, Seoul, Republic of Korea
摘要:数据最小化是一项法律原则,要求个人数据处理仅限于特定目的所需。对于依赖大量个人数据的推荐系统,将这一原则付诸实施仍然是一个重大挑战。本文对这类系统的隐反馈推理数据最小化问题进行了可行性研究。我们提出了一个新的问题制定,分析各种最小化技术,并调查影响其有效性的关键因素。我们证明了大量的推理数据减少在技术上是可行的,而没有显着的性能损失。然而,其实用性主要取决于两个因素:技术环境(例如,性能目标,模型的选择)和用户特性(例如,历史大小、偏好复杂度)。因此,虽然我们确定其技术可行性,我们的结论是,数据最小化仍然具有实际挑战性,其依赖于技术和用户的情况下,使一个通用的标准,数据的“必要性”难以执行。
摘要:Data minimization is a legal principle requiring personal data processing to be limited to what is necessary for a specified purpose. Operationalizing this principle for recommender systems, which rely on extensive personal data, remains a significant challenge. This paper conducts a feasibility study on minimizing implicit feedback inference data for such systems. We propose a novel problem formulation, analyze various minimization techniques, and investigate key factors influencing their effectiveness. We demonstrate that substantial inference data reduction is technically feasible without significant performance loss. However, its practicality is critically determined by two factors: the technical setting (e.g., performance targets, choice of model) and user characteristics (e.g., history size, preference complexity). Thus, while we establish its technical feasibility, we conclude that data minimization remains practically challenging and its dependence on the technical and user context makes a universal standard for data `necessity' difficult to implement.


【34】Complete Gaussian Splats from a Single Image with Denoising Diffusion Models
标题:具有去噪扩散模型的单个图像的完全高斯飞溅
链接:https://arxiv.org/abs/2508.21542

作者:o, Mohamed Sayed, Steven L. Waslander, Sara Vicente, Daniyar Turmukhambetov, Michael Firman
备注:Main paper: 11 pages; Supplementary materials: 7 pages
摘要:高斯溅射通常需要对场景进行密集观察,并且可能无法重建遮挡和未观察到的区域。我们提出了一个潜在的扩散模型来重建一个完整的三维场景与高斯splats,包括被遮挡的部分,在推理过程中,只有一个单一的图像。完成一个场景的未观察到的表面是具有挑战性的,由于模糊的似是而非的表面。传统的方法使用基于回归的公式来预测被遮挡和截头体外表面的单个“模式”,导致模糊、不可信和无法捕获多种可能的解释。因此,它们通常部分地解决这个问题,集中在与背景隔离的对象上,仅重建可见表面,或者未能从输入视图中推断出。相比之下,我们提出了一个生成公式来学习高斯splats的3D表示的分布,条件是一个单一的输入图像。为了解决缺乏真实训练数据的问题,我们提出了一种变分自动重建器,以自监督的方式仅从2D图像中学习潜在空间,并在其上训练扩散模型。我们的方法生成了忠实的重建和不同的样本,能够完成高质量的360度渲染的遮挡表面。
摘要:Gaussian splatting typically requires dense observations of the scene and can fail to reconstruct occluded and unobserved areas. We propose a latent diffusion model to reconstruct a complete 3D scene with Gaussian splats, including the occluded parts, from only a single image during inference. Completing the unobserved surfaces of a scene is challenging due to the ambiguity of the plausible surfaces. Conventional methods use a regression-based formulation to predict a single "mode" for occluded and out-of-frustum surfaces, leading to blurriness, implausibility, and failure to capture multiple possible explanations. Thus, they often address this problem partially, focusing either on objects isolated from the background, reconstructing only visible surfaces, or failing to extrapolate far from the input views. In contrast, we propose a generative formulation to learn a distribution of 3D representations of Gaussian splats conditioned on a single input image. To address the lack of ground-truth training data, we propose a Variational AutoReconstructor to learn a latent space only from 2D images in a self-supervised manner, over which a diffusion model is trained. Our method generates faithful reconstructions and diverse samples with the ability to complete the occluded surfaces for high-quality 360-degree renderings.


【35】HealthProcessAI: A Technical Framework and Proof-of-Concept for LLM-Enhanced Healthcare Process Mining
标题:HealthProcessAI:LLM-Enhanced Healthcare Process Mining的技术框架和概念验证
链接:https://arxiv.org/abs/2508.21540

作者:llueca-Fernandez, Kaile Chen, Fernando Seoane, Farhad Abtahi
摘要:流程挖掘已经成为一种强大的分析技术,用于理解复杂的医疗保健工作流程。然而,它的应用面临着重大障碍,包括技术复杂性、缺乏标准化方法以及获得实践培训资源的机会有限。我们介绍HealthProcessAI,这是一个GenAI框架,旨在通过提供围绕现有Python(PM 4PY)和R(bupaR)库的全面包装器来简化医疗保健和流行病学中的流程挖掘应用程序。为了解决不熟悉的问题并提高可访问性,该框架集成了多个大型语言模型(LLM),用于自动化流程图解释和报告生成,帮助将技术分析转化为不同用户可以轻松理解的输出。我们使用脓毒症进展数据作为概念验证示例验证了该框架,并通过OpenRouter平台比较了五种最先进的LLM模型的输出。为了测试其功能,该框架成功地处理了四个概念验证场景中的败血症数据,展示了强大的技术性能及其通过自动LLM分析生成报告的能力。使用五个独立的LLM作为自动评估器进行的LLM评估显示了不同的模型优势:Claude Sonnet-4和Gemini 2.5-Pro在自动LLM评估器评估时获得了最高的一致性得分(3.79/4.0和3.65/4.0)。通过集成多个大型语言模型(LLM)进行自动解释和报告生成,该框架解决了人们对过程挖掘输出的普遍不熟悉,使临床医生、数据科学家和研究人员更容易访问它们。这种结构化分析和人工智能驱动的解释组合代表了将复杂的流程挖掘结果转化为医疗保健应用程序的潜在可操作见解的新颖方法学进步。
摘要:Process mining has emerged as a powerful analytical technique for understanding complex healthcare workflows. However, its application faces significant barriers, including technical complexity, a lack of standardized approaches, and limited access to practical training resources. We introduce HealthProcessAI, a GenAI framework designed to simplify process mining applications in healthcare and epidemiology by providing a comprehensive wrapper around existing Python (PM4PY) and R (bupaR) libraries. To address unfamiliarity and improve accessibility, the framework integrates multiple Large Language Models (LLMs) for automated process map interpretation and report generation, helping translate technical analyses into outputs that diverse users can readily understand. We validated the framework using sepsis progression data as a proof-of-concept example and compared the outputs of five state-of-the-art LLM models through the OpenRouter platform. To test its functionality, the framework successfully processed sepsis data across four proof-of-concept scenarios, demonstrating robust technical performance and its capability to generate reports through automated LLM analysis. LLM evaluation using five independent LLMs as automated evaluators revealed distinct model strengths: Claude Sonnet-4 and Gemini 2.5-Pro achieved the highest consistency scores (3.79/4.0 and 3.65/4.0) when evaluated by automated LLM assessors. By integrating multiple Large Language Models (LLMs) for automated interpretation and report generation, the framework addresses widespread unfamiliarity with process mining outputs, making them more accessible to clinicians, data scientists, and researchers. This structured analytics and AI-driven interpretation combination represents a novel methodological advance in translating complex process mining results into potentially actionable insights for healthcare applications.


【36】Counterfactual Scenarios for Automated Planning
标题:自动规划的反事实场景
链接:https://arxiv.org/abs/2508.21521

作者:gante, Francesco Leofante, Andrea Micheli
备注:Accepted at the 22nd International Conference on Principles of Knowledge Representation and Reasoning (KR 2025)
摘要:反事实推理(CE)是一种强大的技术,用于解释机器学习模型,通过展示模型的输入应该如何最小限度地改变模型以产生不同的输出。在自动规划的背景下,已经提出了类似的建议,其中CE的特点是对现有计划的最小修改,这将导致满足不同的目标。虽然这样的解释可能有助于诊断故障和推理计划的特征,但它们无法捕获所解决问题的更高级别属性。为了解决这个问题,我们提出了一个新的解释范式,是基于反事实的情况。特别是,给定一个规划问题$P$和一个定义计划所需属性的\ltlf公式$\psi$,反事实场景确定对$P$的最小修改,使得它允许符合$\psi$的计划。在本文中,我们提出了两个定性的反事实的情况下,基于一个明确的量化计划,必须满足$\psi$的实例。然后,我们将计算复杂性生成这样的反事实的情况下,允许不同类型的变化$P$。我们表明,生产反事实的情况下,往往是昂贵的计算计划为$P$,从而证明了我们的建议的实际可行性,并最终提供了一个框架,在这方面构建实用的算法。
摘要:Counterfactual Explanations (CEs) are a powerful technique used to explain Machine Learning models by showing how the input to a model should be minimally changed for the model to produce a different output. Similar proposals have been made in the context of Automated Planning, where CEs have been characterised in terms of minimal modifications to an existing plan that would result in the satisfaction of a different goal. While such explanations may help diagnose faults and reason about the characteristics of a plan, they fail to capture higher-level properties of the problem being solved. To address this limitation, we propose a novel explanation paradigm that is based on counterfactual scenarios. In particular, given a planning problem $P$ and an \ltlf formula $\psi$ defining desired properties of a plan, counterfactual scenarios identify minimal modifications to $P$ such that it admits plans that comply with $\psi$. In this paper, we present two qualitative instantiations of counterfactual scenarios based on an explicit quantification over plans that must satisfy $\psi$. We then characterise the computational complexity of generating such counterfactual scenarios when different types of changes are allowed on $P$. We show that producing counterfactual scenarios is often only as expensive as computing a plan for $P$, thus demonstrating the practical viability of our proposal and ultimately providing a framework to construct practical algorithms in this area.


【37】Modeling Wise Decision Making: A Z-Number Fuzzy Framework Inspired by Phronesis
标题:明智决策建模:受Phronesis启发的Z数模糊框架
链接:https://arxiv.org/abs/2508.21517

作者:an, Ankita Sharma, Romi Banerjee
备注:total 17 pages, main manuscript 12 pages, supplementary 5 pages, 6 tables in main manuscript, 5 figures in main manuscript, 2 tables in supplementary, and 3 figures in supplementary
摘要:背景资料:智慧是一种高级结构,包括观点采择、反思、亲社会取向、反思性移情行动和智力谦逊。与传统的被二元思维严格束缚的推理模型不同,智慧在模糊的阴影中展开,需要分级评估和自我反思的谦卑。目前的措施依赖于自我报告,很少反映明智推理中固有的谦卑和不确定性。一个同时考虑到多维性和信心的计算框架有可能改善心理科学并允许人性化的人工智能。方法:我们提出了一个模糊推理系统与Z数,每个决定被表示在一个智慧分数(限制)和置信度分数(确定性)。作为本研究的一部分,参与者(N = 100)暴露于文化中立的图片道德困境的任务,他们产生的有声思维的语言反应,这被映射到五个理论为基础的组成部分的智慧。使用21条规则的基础将每个单独组件的分数组合在一起,并通过高斯核密度估计调整隶属函数。结果如下:在一项概念验证研究中,该系统产生了双属性智慧表征,这些表征与既定量表适度但显着相关,而与无关特征的关系可以忽略不计,支持收敛和发散有效性。出资额:其贡献是将智慧形式化为一个多维的、有不确定性意识的结构,以Z数的形式操作。除了在心理学中进行测量外,它还计算了模糊Z数如何为人工智能系统提供可解释的、对置信度敏感的推理,从而在严格的计算和类似人类的判断之间提供一个安全的中间地带。
摘要:Background: Wisdom is a superordinate construct that embraces perspective taking, reflectiveness, prosocial orientation, reflective empathetic action, and intellectual humility. Unlike conventional models of reasoning that are rigidly bound by binary thinking, wisdom unfolds in shades of ambiguity, requiring both graded evaluation and self-reflective humility. Current measures depend on self-reports and seldom reflect the humility and uncertainty inherent in wise reasoning. A computational framework that takes into account both multidimensionality and confidence has the potential to improve psychological science and allow humane AI. Method: We present a fuzzy inference system with Z numbers, each of the decisions being expressed in terms of a wisdom score (restriction) and confidence score (certainty). As part of this study, participants (N = 100) were exposed to culturally neutral pictorial moral dilemma tasks to which they generated think-aloud linguistic responses, which were mapped into five theoretically based components of wisdom. The scores of each individual component were combined using a base of 21 rules, with membership functions tuned via Gaussian kernel density estimation. Results: In a proof of concept study, the system produced dual attribute wisdom representations that correlated modestly but significantly with established scales while showing negligible relations with unrelated traits, supporting convergent and divergent validity. Contribution: The contribution is to formalize wisdom as a multidimensional, uncertainty-conscious construct, operationalized in the form of Z-numbers. In addition to progressing measurement in psychology, it calculates how fuzzy Z numbers can provide AI systems with interpretable, confidence-sensitive reasoning that affords a safe, middle ground between rigorous computation and human-like judgment.


【38】On the Hardness of Learning GNN-based SAT Solvers: The Role of Graph Ricci Curvature
标题:关于学习基于GNN的SAT求解器的难度:图Ricci曲线的作用
链接:https://arxiv.org/abs/2508.21513

作者:deri
备注:Preprint
摘要:图神经网络(GNN)最近通过对逻辑公式的图形表示进行操作,显示出作为布尔可满足性问题(SAT)解决方案的前景。然而,在更困难的情况下,它们的性能会急剧下降,这是否反映了基本的架构限制?在这项工作中,我们提供了一个几何解释,通过镜头的图形Ricci曲率(RC),量化本地连接瓶颈。我们证明了来自随机k-SAT公式的二分图是固有的负弯曲,这种曲率与实例难度减少。在此基础上,我们证明了基于GNN的SAT求解器受到过度挤压的影响,这种现象使得长范围的依赖关系无法压缩成固定长度的表示。我们在不同的SAT基准测试中验证了我们的主张,并确认曲率既是问题复杂性的一个强有力的指标,也可以用来预测性能。最后,我们将我们的研究结果与现有求解器的设计原则联系起来,并为未来的工作勾勒出有希望的方向。
摘要:Graph Neural Networks (GNNs) have recently shown promise as solvers for Boolean Satisfiability Problems (SATs) by operating on graph representations of logical formulas. However, their performance degrades sharply on harder instances, raising the question of whether this reflects fundamental architectural limitations. In this work, we provide a geometric explanation through the lens of graph Ricci Curvature (RC), which quantifies local connectivity bottlenecks. We prove that bipartite graphs derived from random k-SAT formulas are inherently negatively curved, and that this curvature decreases with instance difficulty. Building on this, we show that GNN-based SAT solvers are affected by oversquashing, a phenomenon where long-range dependencies become impossible to compress into fixed-length representations. We validate our claims empirically across different SAT benchmarks and confirm that curvature is both a strong indicator of problem complexity and can be used to predict performance. Finally, we connect our findings to design principles of existing solvers and outline promising directions for future work.


【39】ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding
标题:ELV-Halluc:对长视频理解中的语义聚合幻觉进行基准测试
链接:https://arxiv.org/abs/2508.21496

作者:iahao Wang, Yaolun Zhang, Ruohui Wang, Xuanyu Zheng, Yepeng Tang, Dahua Lin, Lewei Lu
摘要:视频多模态大语言模型(Video-MLLM)在视频理解方面取得了显著的进展。然而,他们仍然容易受到与视频输入不一致或不相关的产生幻觉的内容的影响。以前的视频幻觉基准主要集中在短视频上。他们将幻觉归因于一些因素,如强烈的语言先验、缺失的框架或视觉编码器引入的视觉语言偏见。虽然这些原因确实解释了短视频中的大多数幻觉,但它们仍然过于简化了幻觉的原因。有时,模型生成错误的输出,但具有正确的帧级语义。我们把这种幻觉称为语义聚合幻觉(SAH),它是在将框架级语义聚合成事件级语义组的过程中产生的。由于多个事件的语义复杂性增加,SAH在长视频中变得特别重要,因此必须分离并彻底调查这种幻觉的原因。为了解决上述问题,我们引入了ELV-Halluc,这是第一个致力于长视频幻觉的基准,可以系统地调查SAH。我们的实验证实了SAH的存在,并表明它与语义复杂性增加。此外,我们发现,模型更容易SAH快速变化的语义。此外,我们还讨论了缓解SAH的潜在方法。我们证明了位置编码策略有助于减轻SAH,并进一步采用DPO策略,以提高模型的能力,区分语义内和跨事件。为了支持这一点,我们策划了一个8 K对抗数据对的数据集,并在ELV-Halluc和Video-MME上实现了改进,包括SAH比率大幅降低27.7%。
摘要:Video multimodal large language models (Video-MLLMs) have achieved remarkable progress in video understanding. However, they remain vulnerable to hallucination-producing content inconsistent with or unrelated to video inputs. Previous video hallucination benchmarks primarily focus on short-videos. They attribute hallucinations to factors such as strong language priors, missing frames, or vision-language biases introduced by the visual encoder. While these causes indeed account for most hallucinations in short videos, they still oversimplify the cause of hallucinations. Sometimes, models generate incorrect outputs but with correct frame-level semantics. We refer to this type of hallucination as Semantic Aggregation Hallucination (SAH), which arises during the process of aggregating frame-level semantics into event-level semantic groups. Given that SAH becomes particularly critical in long videos due to increased semantic complexity across multiple events, it is essential to separate and thoroughly investigate the causes of this type of hallucination. To address the above issues, we introduce ELV-Halluc, the first benchmark dedicated to long-video hallucination, enabling a systematic investigation of SAH. Our experiments confirm the existence of SAH and show that it increases with semantic complexity. Additionally, we find that models are more prone to SAH on rapidly changing semantics. Moreover, we discuss potential approaches to mitigate SAH. We demonstrate that positional encoding strategy contributes to alleviating SAH, and further adopt DPO strategy to enhance the model's ability to distinguish semantics within and across events. To support this, we curate a dataset of 8K adversarial data pairs and achieve improvements on both ELV-Halluc and Video-MME, including a substantial 27.7% reduction in SAH ratio.


【40】Priors Matter: Addressing Misspecification in Bayesian Deep Q-Learning
标题:先验知识很重要:解决贝叶斯深度Q学习中的错误指定
链接:https://arxiv.org/abs/2508.21488

作者: van der Vaart, Neil Yorke-Smith, Matthijs T.J. Spaan
摘要:强化学习中的不确定性量化可以大大提高探索性和鲁棒性。近似贝叶斯方法最近已经推广到量化无模型算法中的不确定性。然而,到目前为止,重点一直是提高后验近似的准确性,而不是研究后验的先验和似然假设的准确性。在这项工作中,我们证明了贝叶斯深度Q学习中存在冷后验效应,与理论相反,当降低后验温度时,性能会提高。为了识别和克服可能的原因,我们挑战对贝叶斯无模型算法中的可能性和先验知识所做的常见假设。我们实证研究先验分布,并通过统计检验表明,常见的高斯似然假设经常被违反。我们认为,开发更合适的似然和先验应该是未来贝叶斯强化学习研究的重点,我们为深度Q学习中更好的先验提供了简单,可实现的解决方案,从而导致更高性能的贝叶斯算法。
摘要:Uncertainty quantification in reinforcement learning can greatly improve exploration and robustness. Approximate Bayesian approaches have recently been popularized to quantify uncertainty in model-free algorithms. However, so far the focus has been on improving the accuracy of the posterior approximation, instead of studying the accuracy of the prior and likelihood assumptions underlying the posterior. In this work, we demonstrate that there is a cold posterior effect in Bayesian deep Q-learning, where contrary to theory, performance increases when reducing the temperature of the posterior. To identify and overcome likely causes, we challenge common assumptions made on the likelihood and priors in Bayesian model-free algorithms. We empirically study prior distributions and show through statistical tests that the common Gaussian likelihood assumption is frequently violated. We argue that developing more suitable likelihoods and priors should be a key focus in future Bayesian reinforcement learning research and we offer simple, implementable solutions for better priors in deep Q-learning that lead to more performant Bayesian algorithms.


【41】HSFN: Hierarchical Selection for Fake News Detection building Heterogeneous Ensemble
标题:HSEN:假新闻检测构建异类联盟的分层选择
链接:https://arxiv.org/abs/2508.21482

作者:outinho, Rafael M.O. Cruz, Francimaria R. S. Nascimento, George D. C. Cavalcanti
备注:Accepted by IEEE International Conference on Systems, Man, and Cybernetics (SMC) - IEEE SMC 2025
摘要:心理偏见,如确认偏见,使个人特别容易相信和传播社交媒体上的假新闻,导致公共卫生和政治等领域的重大后果。基于机器学习的事实检查系统已经被广泛研究以缓解这个问题。其中,集成方法在组合多个分类器以提高鲁棒性方面特别有效。然而,它们的性能在很大程度上取决于组成分类器的多样性选择真正多样的模型仍然是一个关键的挑战,特别是当模型倾向于学习冗余模式时。在这项工作中,我们提出了一种新的自动分类器选择方法,优先多样性,也扩展了性能。该方法首先计算分类器之间的成对多样性,并应用层次聚类将它们组织成不同粒度级别的组。然后,HierarchySelect探索这些分层级别,以在每个级别上选择一个分类器池,每个分类器池表示不同的池内多样性。最多样化的游泳池被确定并从中选择用于整体构建。选择过程中结合了一个评价指标,反映每个分类器的性能,以确保合奏也概括良好。我们进行实验,40个异构分类器在6个数据集,从不同的应用领域和不同数量的类。我们的方法进行比较,对肘启发式和国家的最先进的基线。结果表明,我们的方法在六个数据集中的两个上达到了最高的准确率。实现细节可以在项目的存储库中找到:https://github.com/SaraBCoutinho/HSFN。
摘要:Psychological biases, such as confirmation bias, make individuals particularly vulnerable to believing and spreading fake news on social media, leading to significant consequences in domains such as public health and politics. Machine learning-based fact-checking systems have been widely studied to mitigate this problem. Among them, ensemble methods are particularly effective in combining multiple classifiers to improve robustness. However, their performance heavily depends on the diversity of the constituent classifiers-selecting genuinely diverse models remains a key challenge, especially when models tend to learn redundant patterns. In this work, we propose a novel automatic classifier selection approach that prioritizes diversity, also extended by performance. The method first computes pairwise diversity between classifiers and applies hierarchical clustering to organize them into groups at different levels of granularity. A HierarchySelect then explores these hierarchical levels to select one pool of classifiers per level, each representing a distinct intra-pool diversity. The most diverse pool is identified and selected for ensemble construction from these. The selection process incorporates an evaluation metric reflecting each classifiers's performance to ensure the ensemble also generalises well. We conduct experiments with 40 heterogeneous classifiers across six datasets from different application domains and with varying numbers of classes. Our method is compared against the Elbow heuristic and state-of-the-art baselines. Results show that our approach achieves the highest accuracy on two of six datasets. The implementation details are available on the project's repository: https://github.com/SaraBCoutinho/HSFN .


【42】Igniting Creative Writing in Small Language Models: LLM-as-a-Judge versus Multi-Agent Refined Rewards
标题:在小型语言模型中认可创意写作:法学硕士作为法官与多代理细化奖励
链接:https://arxiv.org/abs/2508.21476

作者:Wei, Bo Lu, Xingyu Zhang, Zhejun Zhao, Dongdong Shen, Long Xia, Dawei Yin
备注:EMNLP 2025 Main
摘要:大型语言模型(LLM)已经展示了非凡的创造性写作能力,但其大量的计算需求阻碍了其广泛使用。增强小语言模型(SLM)提供了一个有前途的替代方案,但目前的方法,如监督微调(SFT)与新颖性斗争,从人类反馈强化学习(RLHF)是昂贵的。本文探讨了两种不同的人工智能驱动的奖励策略,在一个强化学习从人工智能反馈(RLAIF)框架内点燃创造性写作的7B参数SLM,特别是用于生成中文问候。第一种策略采用了一个RM训练的高质量的偏好数据策划的一种新的多代理拒绝抽样框架设计的创造性的任务。第二种更新颖的策略利用了一种原则引导的LLM作为判断者,其奖励函数通过具有反射机制的对抗训练方案进行优化,以直接提供奖励信号。综合实验表明,虽然这两种方法显着提高创造性的输出超过基线,原则指导的法学硕士作为一个法官证明产生卓越的生成质量。此外,它在训练效率和减少对人工注释数据的依赖方面具有显着优势,为创造性的SLM提供了一条更具可扩展性和有效的途径。我们的自动化评估方法也与人类判断保持高度一致。我们的代码和数据可在https://github.com/weixiaolong94-hub/Igniting-Creative-Writing-in-Small-Language-Models上公开获取。
摘要:Large Language Models (LLMs) have demonstrated remarkable creative writing capabilities, yet their substantial computational demands hinder widespread use. Enhancing Small Language Models (SLMs) offers a promising alternative, but current methods like Supervised Fine-Tuning (SFT) struggle with novelty, and Reinforcement Learning from Human Feedback (RLHF) is costly. This paper explores two distinct AI-driven reward strategies within a Reinforcement Learning from AI Feedback (RLAIF) framework to ignite the creative writing of a 7B-parameter SLM, specifically for generating Chinese greetings. The first strategy employs a RM trained on high-quality preference data curated by a novel multi-agent rejection sampling framework designed for creative tasks. The second, more novel strategy utilizes a principle-guided LLM-as-a-Judge, whose reward function is optimized via an adversarial training scheme with a reflection mechanism, to directly provide reward signals. Comprehensive experiments reveal that while both approaches significantly enhance creative output over baselines, the principle-guided LLM-as-a-Judge demonstrably yields superior generation quality. Furthermore, it offers notable advantages in training efficiency and reduced dependency on human-annotated data, presenting a more scalable and effective path towards creative SLMs. Our automated evaluation methods also exhibit strong alignment with human judgments. Our code and data are publicly available at https://github.com/weixiaolong94-hub/Igniting-Creative-Writing-in-Small-Language-Models.


【43】MMSearch-Plus: A Simple Yet Challenging Benchmark for Multimodal Browsing Agents
标题:MMSearch-Plus:一个简单而实用的多模式浏览代理基准
链接:https://arxiv.org/abs/2508.21475

作者:, Yihua Teng, Xinxing Su, Xinyu Fu, Jihao Wu, Chaofan Tao, Ziru Liu, Haoli Bai, Rui Liu, Lingpeng Kong
备注:Project Page: this https URL
摘要:大型多模态语言模型(MLLM)越来越多地被部署为Web代理,但许多多模态浏览基准可以通过依赖于高召回率图像搜索和附近文本的浅层固定工作流来解决,这些工作流掩盖了细粒度视觉推理,出处验证和长期工具使用的真正多模态挑战。我们引入MMSearch-Plus,这是一个包含311个任务的基准,这些任务高度要求多模态理解,同时保留强文本浏览套件的难度。每个项目都被构造成包含多个弱的、局部的视觉信号,这些信号必须被提取出来,通过迭代的文本-图像搜索传播,并在回答之前在检索噪声下进行交叉验证。我们的策展程序,时空外推,种子问题,其答案需要从空间线索(微文本,部分级别的外观,布局,标志)和时间痕迹(广播覆盖,季节性背景)外推到图像外的事实,如事件,日期和地点。我们提供了一个模型不可知的代理框架与浏览工具,并评估了一系列的封闭和开放MLLM。在我们的框架下,最强的代理(o3)在没有搜索的情况下达到15.1%,在推出后达到36.0%的准确率,而强大的开源模型(Qwen-2.5-VL-72 B-Instruct)在没有搜索的情况下达到0.0%,在20轮搜索后达到6.9%。除了答案的准确性,我们还评估了边界框的生产和裁剪图像搜索,并进行了错误分析,在源验证,基于部件的推理和长期规划中出现了故障。
摘要:Large multimodal language models (MLLMs) are increasingly deployed as web agents, yet many multimodal browsing benchmarks can be solved by shallow, fixed workflows that lean on high-recall image search and nearby text-masking the genuinely multimodal challenges of fine-grained visual reasoning, provenance verification, and long-horizon tool use. We introduce MMSearch-Plus, a benchmark of 311 tasks that highly demand multimodal understanding while preserving the difficulty profile of strong text-only browsing suites. Each item is constructed to contain multiple weak, localized visual signals that must be extracted, propagated through iterative text-image search, and cross-validated under retrieval noise before answering. Our curation procedure, Spatial-Temporal Extrapolation, seeds questions whose answers require extrapolating from spatial cues (micro-text, part-level appearance, layouts, signage) and temporal traces (broadcast overlays, seasonal context) to out-of-image facts such as events, dates, and venues. We provide a model-agnostic agent framework with browsing tools and evaluate a range of closed and open MLLMs. The strongest agent (o3) attains 15.1% without search and 36.0% accuracy with rollout under our framework, while a strong open-source model (Qwen-2.5-VL-72B-Instruct) achieves 0.0% without search and 6.9% after 20 rounds of search. Beyond answer accuracy, we assess bounding-box production and cropped-image search, and conduct an error analysis that surfaces failures in source verification, part-based reasoning, and long-horizon planning.


【44】Controllable 3D Molecular Generation for Structure-Based Drug Design Through Bayesian Flow Networks and Gradient Integration
标题:通过Bayesian流网络和梯度积分实现基于结构的药物设计的可控3D分子生成
链接:https://arxiv.org/abs/2508.21468

作者: Choi, Hwanhee Kim, Chihyun Park, Dahyeon Lee, Seungyong Lee, Yoonju Kim, Hyoungjoon Park, Sein Kwon, Youngwan Jo, Sanghyun Park
摘要:基于结构的药物设计(SBDD)的最新进展已经利用生成模型进行3D分子生成,主要通过与靶蛋白的结合亲和力来评估模型性能。然而,实际的药物发现需要高结合亲和力以及合成的可行性和选择性,在以前的评估中很大程度上被忽视的关键特性。为了解决这一差距,我们确定了传统的基于扩散的生成模型在有效引导分子生成这些不同的药理学特性方面的基本局限性。我们提出了CByG,一个新的框架扩展贝叶斯流网络到一个基于梯度的条件生成模型,鲁棒地集成了属性特定的指导。此外,我们引入了一个综合的评价方案,结合结合亲和力,合成的可行性和选择性的实际基准,克服了传统的评价方法的局限性。大量的实验表明,我们提出的CByG框架在多个基本评估标准上显着优于基线模型,突出了其在现实世界的药物发现应用中的有效性和实用性。
摘要:Recent advances in Structure-based Drug Design (SBDD) have leveraged generative models for 3D molecular generation, predominantly evaluating model performance by binding affinity to target proteins. However, practical drug discovery necessitates high binding affinity along with synthetic feasibility and selectivity, critical properties that were largely neglected in previous evaluations. To address this gap, we identify fundamental limitations of conventional diffusion-based generative models in effectively guiding molecule generation toward these diverse pharmacological properties. We propose CByG, a novel framework extending Bayesian Flow Network into a gradient-based conditional generative model that robustly integrates property-specific guidance. Additionally, we introduce a comprehensive evaluation scheme incorporating practical benchmarks for binding affinity, synthetic feasibility, and selectivity, overcoming the limitations of conventional evaluation methods. Extensive experiments demonstrate that our proposed CByG framework significantly outperforms baseline models across multiple essential evaluation criteria, highlighting its effectiveness and practicality for real-world drug discovery applications.


【45】Diffusion-based Multi-modal Synergy Interest Network for Click-through Rate Prediction
标题:基于扩散的多模式协同兴趣网络用于点入率预测
链接:https://arxiv.org/abs/2508.21460

作者:i, Weihai Lu, Yu Tong, Yiheng Li, Zhejun Zhao
备注:SIGIR 2025
摘要:在点击率预测中,点击率预测用于对用户兴趣进行建模。然而,大多数现有的CTR预测方法主要是基于ID模态。因此,他们无法全面模拟用户的多模式偏好。因此,有必要引入多模态CTR预测。虽然直接将现有的多模态融合方法应用于点击率预测模型似乎很有吸引力,但这些方法(1)未能有效地解开不同模态之间的共性和特异性;(2)未能考虑模态之间的协同效应并对模态之间的复杂交互进行建模。   针对上述问题,本文提出了基于扩散的多模态协同兴趣网络(Diff-MSIN)的点击预测框架。该框架引入了三个创新模块:多模态特征增强(MFE)模块、协同关系捕获(SRC)模块和特征动态自适应融合(FRESH)模块。MFE模块和SRC模块提取不同模态之间的协同、共同和特殊信息。它们有效地增强了模态的表示,提高了融合的整体质量。为了鼓励不同特征之间的区别,我们设计了一种知识解耦方法。此外,FDAF模块专注于捕捉用户偏好并减少融合噪音。为了验证Diff-MSIN框架的有效性,我们使用Rec-Tmall和三个Amazon数据集进行了广泛的实验。结果表明,与基线相比,我们的方法至少有1.67%的显着改进,突出了其增强多模式推荐系统的潜力。我们的代码可以在以下链接中找到:https://github.com/Cxx-0/Diff-MSIN。
摘要:In click-through rate prediction, click-through rate prediction is used to model users' interests. However, most of the existing CTR prediction methods are mainly based on the ID modality. As a result, they are unable to comprehensively model users' multi-modal preferences. Therefore, it is necessary to introduce multi-modal CTR prediction. Although it seems appealing to directly apply the existing multi-modal fusion methods to click-through rate prediction models, these methods (1) fail to effectively disentangle commonalities and specificities across different modalities; (2) fail to consider the synergistic effects between modalities and model the complex interactions between modalities.   To address the above issues, this paper proposes the Diffusion-based Multi-modal Synergy Interest Network (Diff-MSIN) framework for click-through prediction. This framework introduces three innovative modules: the Multi-modal Feature Enhancement (MFE) Module Synergistic Relationship Capture (SRC) Module, and the Feature Dynamic Adaptive Fusion (FDAF) Module. The MFE Module and SRC Module extract synergistic, common, and special information among different modalities. They effectively enhances the representation of the modalities, improving the overall quality of the fusion. To encourage distinctiveness among different features, we design a Knowledge Decoupling method. Additionally, the FDAF Module focuses on capturing user preferences and reducing fusion noise. To validate the effectiveness of the Diff-MSIN framework, we conducted extensive experiments using the Rec-Tmall and three Amazon datasets. The results demonstrate that our approach yields a significant improvement of at least 1.67% compared to the baseline, highlighting its potential for enhancing multi-modal recommendation systems. Our code is available at the following link: https://github.com/Cxx-0/Diff-MSIN.


【46】Learning Lifted Action Models From Traces of Incomplete Actions and States
标题:从不完整动作和状态的痕迹中学习提升动作模型
链接:https://arxiv.org/abs/2508.21449

作者:nsen, Jonas Gösgens, Hector Geffner
备注:To be presented at KR 2025
摘要:考虑从随机的状态-动作轨迹中学习滑动瓦片谜题的提升后的CIPPS模型的问题,其中状态仅表示瓦片的位置,动作是标签上、下、左、右,没有参数。这个问题涉及两个挑战。首先,这些状态不是完整的CNOPS状态,因为缺少了一些谓词,比如表示"blank"位置的原子。第二,这些行动也不是完全的行动,因为它们没有揭示行动所涉及的所有对象、效果和前提条件。以前的方法已经解决了这个模型学习问题的不同版本,但大多数假设跟踪中的动作是完全可观察的动作,或者域谓词都是可观察的。在这项工作中考虑的新的设置是更“现实”,因为观察到的原子传达了世界的状态,但不是完整的ARIPS状态,动作揭示了选择动作所需的参数,但不是在ARIPS中建模所需的参数。为了制定和解决学习问题,我们引入了一个变种的CIPPS,我们称之为CIPPS+,其中某些CIPPS行动参数可以留在隐含的前提条件,也可以涉及有限形式的存在量化。学习问题变成了从CNOPS+状态-动作轨迹学习CNOPS+模型的问题。为此,所提出的学习算法,称为SYNTH,为每个动作构建了一个分层序列(连接)的前提条件表达式或“查询”,表示状态中的唯一对象,并将隐含的动作参数置于RIPS+中。SYNTH的正确性和完整性的建立,其可扩展性进行了测试,从现有的ARIPS域ARIPS+模型获得的状态动作轨迹。
摘要:Consider the problem of learning a lifted STRIPS model of the sliding-tile puzzle from random state-action traces where the states represent the location of the tiles only, and the actions are the labels up, down, left, and right, with no arguments. Two challenges are involved in this problem. First, the states are not full STRIPS states, as some predicates are missing, like the atoms representing the position of the ``blank''. Second, the actions are not full STRIPS either, as they do not reveal all the objects involved in the actions effects and preconditions. Previous approaches have addressed different versions of this model learning problem, but most assume that actions in the traces are full STRIPS actions or that the domain predicates are all observable. The new setting considered in this work is more ``realistic'', as the atoms observed convey the state of the world but not full STRIPS states, and the actions reveal the arguments needed for selecting the action but not the ones needed for modeling it in STRIPS. For formulating and addressing the learning problem, we introduce a variant of STRIPS, which we call STRIPS+, where certain STRIPS action arguments can be left implicit in preconditions which can also involve a limited form of existential quantification. The learning problem becomes the problem of learning STRIPS+ models from STRIPS+ state-action traces. For this, the proposed learning algorithm, called SYNTH, constructs a stratified sequence (conjunction) of precondition expressions or ``queries'' for each action, that denote unique objects in the state and ground the implicit action arguments in STRIPS+. The correctness and completeness of SYNTH is established, and its scalability is tested on state-action traces obtained from STRIPS+ models derived from existing STRIPS domains.


【47】A General Framework of Epistemic Forgetting and its Instantiation by Ranking Functions
标题:认识遗忘的一般框架及其通过排名函数的实例化
链接:https://arxiv.org/abs/2508.21441

作者: Beierle, Alexander Hahn, Diana Howey, Gabriele Kern-Isberner, Kai Sauerwald
摘要:遗忘作为一种知识管理操作,由于各种原因,故意忽略代理的部分知识和信念。遗忘有很多方面,一个人可能想忘记部分语法,命题或条件。在文献中,已经提出并深入研究了两种适用于执行遗忘的主要运算符:首先,变量消除是一种语法方法,它混合了某些原子变量,以专注于语言的其余部分。它主要用于逻辑编程和答案集编程领域。第二,AGM信念修正理论中的收缩有效地将命题从逻辑演绎下的信念集中移除。这两种操作主要依赖于经典逻辑。在这篇文章中,我们采取认知的角度,研究遗忘操作的认知状态具有丰富的语义结构,但有明确的联系,命题逻辑。这使我们能够研究在认知背景下的遗忘意味着什么,从而将众所周知的和新颖的遗忘操作提升到认知水平。我们提出了五种一般类型的认知遗忘和实例化他们与七个具体的遗忘操作Spohn的排名功能。我们从逻辑编程和AGM理论中的遗忘假设中获得灵感,提出了一个丰富的公理景观来评估遗忘操作。最后,我们根据所有假设评估所有具体的遗忘操作,从而得到一个新的全面概述,突出遗忘操作之间的差异和共性。
摘要:Forgetting as a knowledge management operation deliberately ignores parts of the knowledge and beliefs of an agent, for various reasons. Forgetting has many facets, one may want to forget parts of the syntax, a proposition, or a conditional. In the literature, two main operators suitable for performing forgetting have been proposed and investigated in depth: First, variable elimination is a syntactical method that blends out certain atomic variables to focus on the rest of the language. It has been mainly used in the area of logic programming and answer set programming. Second, contraction in AGM belief revision theory effectively removes propositions from belief sets under logical deduction. Both operations rely mainly on classical logics. In this article, we take an epistemic perspective and study forgetting operations in epistemic states with richer semantic structures, but with clear links to propositional logic. This allows us to investigate what forgetting in the epistemic background means, thereby lifting well-known and novel forgetting operations to the epistemic level. We present five general types of epistemic forgetting and instantiate them with seven concrete forgetting operations for Spohn's ranking functions. We take inspiration from postulates of forgetting both from logic programming and AGM theory to propose a rich landscape of axioms for evaluating forgetting operations. Finally, we evaluate all concrete forgetting operations according to all postulates, leading to a novel comprehensive overview highlighting differences and commonalities among the forgetting operators.


【48】MedShift: Implicit Conditional Transport for X-Ray Domain Adaptation
标题:MedShift:用于X射线域自适应的隐式条件传输
链接:https://arxiv.org/abs/2508.21435

作者: Caetano, Christiaan Viviers, Peter H.H. de With, Fons van der Sommen
备注:Accepted at the ICCV 2025 AIM Workshop
摘要:合成医疗数据为训练鲁棒模型提供了一种可扩展的解决方案,但显著的领域差距限制了其对现实世界临床环境的推广。本文讨论了头部合成和真实X射线图像之间的跨域转换的挑战,重点是弥合衰减行为,噪声特性和软组织表示的差异。我们提出了MedShift,一个统一的类条件生成模型的基础上流匹配和薛定谔桥,它使高保真,不成对的图像翻译跨多个域。与需要特定领域训练或依赖配对数据的先前方法不同,MedShift学习共享的领域不可知潜在空间,并支持在训练期间看到的任何一对领域之间的无缝转换。我们引入了X-DigiSkull,这是一个新的数据集,包括在不同辐射剂量下对齐的合成和真实头骨X射线,以基准域转换模型。实验结果表明,尽管与基于扩散的方法相比,MedShift的模型尺寸较小,但它提供了强大的性能,并且在推理时保持灵活,因为它可以进行调整以优先考虑感知保真度或结构一致性,使其成为医学成像领域自适应的可扩展和可推广的解决方案。代码和数据集可在https://caetas.github.io/medshift.html上获得
摘要:Synthetic medical data offers a scalable solution for training robust models, but significant domain gaps limit its generalizability to real-world clinical settings. This paper addresses the challenge of cross-domain translation between synthetic and real X-ray images of the head, focusing on bridging discrepancies in attenuation behavior, noise characteristics, and soft tissue representation. We propose MedShift, a unified class-conditional generative model based on Flow Matching and Schrodinger Bridges, which enables high-fidelity, unpaired image translation across multiple domains. Unlike prior approaches that require domain-specific training or rely on paired data, MedShift learns a shared domain-agnostic latent space and supports seamless translation between any pair of domains seen during training. We introduce X-DigiSkull, a new dataset comprising aligned synthetic and real skull X-rays under varying radiation doses, to benchmark domain translation models. Experimental results demonstrate that, despite its smaller model size compared to diffusion-based approaches, MedShift offers strong performance and remains flexible at inference time, as it can be tuned to prioritize either perceptual fidelity or structural consistency, making it a scalable and generalizable solution for domain adaptation in medical imaging. The code and dataset are available at https://caetas.github.io/medshift.html


【49】The Complexity Trap: Simple Observation Masking Is as Efficient as LLM Summarization for Agent Context Management
标题:复杂性陷阱:简单的观察掩蔽对于代理上下文管理来说与LLM总结一样有效
链接:https://arxiv.org/abs/2508.21433

作者:ndenbauer, Igor Slinko, Ludwig Felder, Egor Bogomolov, Yaroslav Zharov
摘要:基于大型语言模型(LLM)的代理通过迭代推理,探索和工具使用来解决复杂的任务,这一过程可能导致长时间,昂贵的上下文历史。虽然最先进的软件工程(SE)代理(如OpenHands或Cursor)使用基于LLM的摘要来解决这个问题,但与简单地忽略旧的观察相比,增加的复杂性是否提供了切实的性能优势尚不清楚。我们提出了一个系统的比较,这些策略在SWE代理SWE工作台验证在五个不同的模型配置。我们发现,一个简单的观察掩蔽策略的一半成本相对于一个原始代理,同时匹配,有时稍微超过,解决率LLM总结。例如,使用Qwen 3-Coder 480 B,掩蔽将解决率从53.8%(原始代理)提高到54.8%,同时以较低的成本与摘要保持竞争力。这些结果表明,至少在SWE-台验证的SWE-代理,最有效和最高效的上下文管理可以是最简单的。我们发布的代码和数据的可重复性
摘要:Large Language Model (LLM)-based agents solve complex tasks through iterative reasoning, exploration, and tool-use, a process that can result in long, expensive context histories. While state-of-the-art Software Engineering ( SE) agents like OpenHands or Cursor use LLM-based summarization to tackle this issue, it is unclear whether the increased complexity offers tangible performance benefits compared to simply omitting older observations. We present a systematic comparison of these strategies within SWE-agent on SWE-bench Verified across five diverse model configurations. We find that a simple observation-masking strategy halves cost relative to a raw agent while matching, and sometimes slightly exceeding, the solve rate of LLM summarization. For example, with Qwen3-Coder 480B, masking improves solve rate from 53.8% (raw agent) to 54.8%, while remaining competitive with summarization at a lower cost. These results suggest that, at least within SWE-agent on SWE-bench Verified, the most effective and efficient context management can be the simplest. We release code and data for reproducibility


【50】Med-RewardBench: Benchmarking Reward Models and Judges for Medical Multimodal Large Language Models
标题:Med-RewardBench:医疗多模式大型语言模型的奖励模型和评委基准
链接:https://arxiv.org/abs/2508.21430

作者:ng, Jipeng Zhang, Wenxuan Wang, Cheng-Yi Li, Wei-Chieh Fang, Hsin-Yu Wu, Haiqin Zhong, Wenting Chen, Linlin Shen
备注:19 pages, 5 figures, 3 tables
摘要:多模态大语言模型(MLLM)在医学应用中具有巨大的潜力,包括疾病诊断和临床决策。然而,这些任务需要高度准确、上下文敏感和专业一致的响应,因此可靠的奖励模型和判断至关重要。尽管它们很重要,但医疗奖励模型(MRM)和法官仍然没有得到充分的探索,没有专门的基准来满足临床要求。现有的基准集中在一般MLLM功能或评估模型作为求解器,忽略了诊断准确性和临床相关性等基本评估维度。为了解决这个问题,我们引入了Med-RewardBench,这是第一个专门用于评估医疗场景中的MRM和法官的基准。Med-RewardBench拥有一个跨13个器官系统和8个临床科室的多模式数据集,包含1,026个专家注释病例。严格的三步流程确保了六个临床关键维度的高质量评估数据。我们评估了32个最先进的MLLM,包括开源,专有和医疗特定模型,揭示了将输出与专家判断相结合的重大挑战。此外,我们还开发了基线模型,通过微调展示了实质性的性能改进。
摘要:Multimodal large language models (MLLMs) hold significant potential in medical applications, including disease diagnosis and clinical decision-making. However, these tasks require highly accurate, context-sensitive, and professionally aligned responses, making reliable reward models and judges critical. Despite their importance, medical reward models (MRMs) and judges remain underexplored, with no dedicated benchmarks addressing clinical requirements. Existing benchmarks focus on general MLLM capabilities or evaluate models as solvers, neglecting essential evaluation dimensions like diagnostic accuracy and clinical relevance. To address this, we introduce Med-RewardBench, the first benchmark specifically designed to evaluate MRMs and judges in medical scenarios. Med-RewardBench features a multimodal dataset spanning 13 organ systems and 8 clinical departments, with 1,026 expert-annotated cases. A rigorous three-step process ensures high-quality evaluation data across six clinically critical dimensions. We evaluate 32 state-of-the-art MLLMs, including open-source, proprietary, and medical-specific models, revealing substantial challenges in aligning outputs with expert judgment. Additionally, we develop baseline models that demonstrate substantial performance improvements through fine-tuning.


【51】Benchmarking the State of Networks with a Low-Cost Method Based on Reservoir Computing
标题:利用基于水库计算的低成本方法对网络状态进行基准测试
链接:https://arxiv.org/abs/2508.21420

作者:on Reimers, Carl-Hendrik Peters, Stefano Nichele
备注:Net-Zero Future 2025 Conference
摘要:使用挪威移动网络利用率的数据,我们展示了一种非侵入性,低成本的方法来监测通信和移动网络状态的可能性。该方法将网络数据转换为水库计算框架内的模型,然后测量模型的代理任务的性能。通过实验,我们展示了这些代理的性能与网络状态的关系。这种方法的一个主要优点是,它使用现成的数据集,并利用水库计算框架的一个廉价的和很大程度上不可知的方法。来自移动网络利用率的数据以匿名、聚合的形式提供,每天有多个快照。这些数据可以被视为一个加权网络。水库计算允许使用加权但未经训练的网络作为机器学习工具。该网络初始化为所谓的回声状态网络(ESN),将传入信号投射到更高维的空间中,在该空间上,单个训练层进行操作。这比深度神经网络消耗更少的能量,在深度神经网络中,网络的每个权重都被训练。我们使用神经科学启发的任务,并训练我们的ESN模型来解决它们。然后,我们展示了性能如何取决于某些网络配置,以及当扰动网络时,性能如何明显下降。虽然这项工作作为概念验证,但我们相信它可以被提升用于近实时监控以及识别移动通信网络和交通网络的可能弱点。
摘要:Using data from mobile network utilization in Norway, we showcase the possibility of monitoring the state of communication and mobility networks with a non-invasive, low-cost method. This method transforms the network data into a model within the framework of reservoir computing and then measures the model's performance on proxy tasks. Experimentally, we show how the performance on these proxies relates to the state of the network. A key advantage of this approach is that it uses readily available data sets and leverages the reservoir computing framework for an inexpensive and largely agnostic method. Data from mobile network utilization is available in an anonymous, aggregated form with multiple snapshots per day. This data can be treated like a weighted network. Reservoir computing allows the use of weighted, but untrained networks as a machine learning tool. The network, initialized as a so-called echo state network (ESN), projects incoming signals into a higher dimensional space, on which a single trained layer operates. This consumes less energy than deep neural networks in which every weight of the network is trained. We use neuroscience inspired tasks and trained our ESN model to solve them. We then show how the performance depends on certain network configurations and also how it visibly decreases when perturbing the network. While this work serves as proof of concept, we believe it can be elevated to be used for near-real-time monitoring as well as the identification of possible weak spots of both mobile communication networks as well as transportation networks.


【52】CARJAN: Agent-Based Generation and Simulation of Traffic Scenarios with AJAN
标题:CARJAN:基于Agent的AJAN交通场景生成与仿真
链接:https://arxiv.org/abs/2508.21411

作者:rank Neis, Andre Antakli, Matthias Klusch
摘要:用户友好的建模和虚拟仿真的城市交通场景与不同类型的交互代理,如行人,骑自行车和自动驾驶汽车仍然是一个挑战。我们提出了CARJAN,一种新的工具,用于半自动生成和模拟这种情况下的多智能体工程框架AJAN和驾驶模拟器CARLA的基础上。CARJAN为交通场景布局的建模、存储和维护提供了一个可视化的用户界面,并在CARLA的动态场景模拟中利用基于SPARQL行为树的决策和交互。CARJAN为CARLA中的虚拟交通场景的交互式、智能的基于代理的生成和仿真提供了第一种集成方法。
摘要:User-friendly modeling and virtual simulation of urban traffic scenarios with different types of interacting agents such as pedestrians, cyclists and autonomous vehicles remains a challenge. We present CARJAN, a novel tool for semi-automated generation and simulation of such scenarios based on the multi-agent engineering framework AJAN and the driving simulator CARLA. CARJAN provides a visual user interface for the modeling, storage and maintenance of traffic scenario layouts, and leverages SPARQL Behavior Tree-based decision-making and interactions for agents in dynamic scenario simulations in CARLA. CARJAN provides a first integrated approach for interactive, intelligent agent-based generation and simulation of virtual traffic scenarios in CARLA.


【53】DRASP: A Dual-Resolution Attentive Statistics Pooling Framework for Automatic MOS Prediction
标题:BRASP:用于自动MOS预测的双分辨率专注统计池框架
链接:https://arxiv.org/abs/2508.21407

作者: Yang, Kuan-Tang Huang, Chien-Chun Wang, Hung-Shin Lee, Hsin-Min Wang, Berlin Chen
备注:Accepted to APSIPA ASC 2025
摘要:池机制是必不可少的平均意见得分(MOS)预测,促进可变长度的音频特征转换成一个简洁的固定大小的表示,有效地编码语音质量。现有的池化方法通常以单一粒度操作,集中于全面的全局视角或详细的帧级分析,这可能会忽略互补的感知见解。为了解决这个问题,我们引入了双分辨率注意统计池(Dual-Resolution Attentive Statistics Pooling,DRASP)框架。DRASP集成了粗粒度的全球统计摘要和细粒度的对感知重要部分的仔细分析。这种双视图架构使我们的模型能够制定一个更全面和强大的表示,同时捕获总体结构背景和突出的局部细节。大量的实验验证了该框架的有效性和较强的泛化能力。它在不同的数据集(MusicEval和AES-Natural),MOS预测骨干(包括基于CLAP的模型和AudioBox美学)和不同的音频生成系统中始终优于各种基线方法,与广泛使用的平均池化方法相比,系统级Spearman秩相关系数(SRCC)相对提高了10.39%。
摘要:A pooling mechanism is essential for mean opinion score (MOS) prediction, facilitating the transformation of variable-length audio features into a concise fixed-size representation that effectively encodes speech quality. Existing pooling methods typically operate at a singular granularity, concentrating either on a comprehensive global perspective or a detailed frame-level analysis, which may overlook complementary perceptual insights. To address this limitation, we introduce the Dual-Resolution Attentive Statistics Pooling (DRASP) framework. DRASP integrates both coarse-grained, global statistical summaries and fine-grained, attentive analyses of perceptually significant segments. This dual-view architecture empowers our model to formulate a more thorough and robust representation, capturing both the overarching structural context and salient local details concurrently. Extensive experiments validate the effectiveness and strong generalization ability of the proposed framework. It consistently outperforms various baseline methods across diverse datasets (MusicEval and AES-Natural), MOS prediction backbones (including a CLAP-based model and AudioBox-Aesthetics), and different audio generation systems, achieving a relative improvement of 10.39% in system-level Spearman's rank correlation coefficient (SRCC) over the widely-used average pooling approach.


【54】AI Compute Architecture and Evolution Trends
标题:AI计算架构与演进趋势
链接:https://arxiv.org/abs/2508.21394

作者:Liang
备注:29 pages, 26 figures
摘要:人工智能发展的重点已经从学术研究转向实际应用。然而,人工智能的发展在各个层面都面临着许多挑战。本文将尝试使用结构化方法从几个不同的角度分析人工智能的机遇和挑战。本文提出了一个七层的人工智能计算架构模型,从下到上包括物理层、链路层、神经网络层、上下文层、代理层、代理层和应用层。它还解释了人工智能计算如何通过大规模语言模型(LLM)的三个阶段进化演变为这个7层架构。对于每一层,我们描述了发展轨迹和关键技术。在第1层和第2层中,我们讨论了AI计算问题以及纵向扩展和横向扩展策略对计算架构的影响。在第3层中,我们探索了LLM的两种不同开发路径。在第4层中,我们讨论了上下文内存对LLM的影响,并将其与传统的处理器内存进行了比较。在第5层到第7层,我们讨论了AI代理的趋势,并探讨了从单个AI代理到基于AI的生态系统的演变过程中的问题,以及它们对AI行业的影响。此外,人工智能的发展不仅涉及技术挑战,还涉及建立自我可持续生态系统的经济问题。本文分析了互联网行业,以预测人工智能发展的未来轨迹。
摘要:The focus of AI development has shifted from academic research to practical applications. However, AI development faces numerous challenges at various levels. This article will attempt to analyze the opportunities and challenges of AI from several different perspectives using a structured approach. This article proposes a seven-layer model for AI compute architecture, including Physical Layer, Link Layer, Neural Network Layer, Context Layer, Agent Layer, Orchestrator Layer, and Application Layer, from bottom to top. It also explains how AI computing has evolved into this 7-layer architecture through the three-stage evolution on large-scale language models (LLMs). For each layer, we describe the development trajectory and key technologies. In Layers 1 and 2 we discuss AI computing issues and the impact of Scale-Up and Scale-Out strategies on computing architecture. In Layer 3 we explore two different development paths for LLMs. In Layer 4 we discuss the impact of contextual memory on LLMs and compares it to traditional processor memory. In Layers 5 to 7 we discuss the trends of AI agents and explore the issues in evolution from a single AI agent to an AI-based ecosystem, and their impact on the AI industry. Furthermore, AI development involves not only technical challenges but also the economic issues to build self-sustainable ecosystem. This article analyzes the internet industry to provide predictions on the future trajectory of AI development.


【55】zkLoRA: Fine-Tuning Large Language Models with Verifiable Security via Zero-Knowledge Proofs
标题:zkLoRA:通过零知识证明微调大型语言模型,具有可验证的安全性
链接:https://arxiv.org/abs/2508.21393

作者:o, Taotao Wang, Shengli Zhang, Jiqun Zhang, Shi Long, Dacheng Tao
摘要:微调大型语言模型(LLM)对于使其适应特定任务至关重要,但它仍然对计算要求很高,并引起了对正确性和隐私的担忧,特别是在不受信任的环境中。虽然参数高效的方法,如低秩自适应(LoRA)显着降低了资源需求,确保零知识约束下的微调的安全性和可验证性仍然是一个悬而未决的挑战。为了解决这个问题,我们引入了zkLoRA,这是第一个将LoRA微调与零知识证明(ZKP)集成的框架,实现了可证明的安全性和正确性。zkLoRA采用先进的加密技术-例如查找参数,sumcheck协议和多项式承诺-来验证基于transformer的架构中的算术和非算术操作。该框架为LoRA微调期间的前向传播、后向传播和参数更新提供了端到端的可验证性,同时保护了模型参数和训练数据的隐私。利用基于GPU的实现,zkLoRA通过对LLaMA等开源LLM的实验验证证明了实用性和效率,可扩展到130亿个参数。通过将参数高效的微调与ZKP相结合,zkLoRA弥合了一个关键的差距,使LLM能够在敏感或不受信任的环境中安全可靠地部署。
摘要:Fine-tuning large language models (LLMs) is crucial for adapting them to specific tasks, yet it remains computationally demanding and raises concerns about correctness and privacy, particularly in untrusted environments. Although parameter-efficient methods like Low-Rank Adaptation (LoRA) significantly reduce resource requirements, ensuring the security and verifiability of fine-tuning under zero-knowledge constraints remains an unresolved challenge. To address this, we introduce zkLoRA, the first framework to integrate LoRA fine-tuning with zero-knowledge proofs (ZKPs), achieving provable security and correctness. zkLoRA employs advanced cryptographic techniques -- such as lookup arguments, sumcheck protocols, and polynomial commitments -- to verify both arithmetic and non-arithmetic operations in Transformer-based architectures. The framework provides end-to-end verifiability for forward propagation, backward propagation, and parameter updates during LoRA fine-tuning, while safeguarding the privacy of model parameters and training data. Leveraging GPU-based implementations, zkLoRA demonstrates practicality and efficiency through experimental validation on open-source LLMs like LLaMA, scaling up to 13 billion parameters. By combining parameter-efficient fine-tuning with ZKPs, zkLoRA bridges a critical gap, enabling secure and trustworthy deployment of LLMs in sensitive or untrusted environments.


【56】AllSummedUp: un framework open-source pour comparer les metriques d'evaluation de resume
标题:AllSummedUp:一个框架开源,用于比较简历评估指标
链接:https://arxiv.org/abs/2508.21389

作者:rserant, Vincent Guigue
备注:in French language
摘要:本文研究了自动文本摘要评估中的再现性挑战。基于在六个代表性指标(从经典方法如ROUGE到最近基于LLM的方法(G-Eval,SEval-Ex))中进行的实验,我们强调了文献中报告的性能与我们实验环境中观察到的性能之间的显着差异。我们引入了一个统一的开源框架,应用于SummEval数据集,旨在支持评估指标的公平和透明的比较。我们的研究结果揭示了一个结构性的权衡:与人类判断最一致的指标往往是计算密集型的,并且在运行中不太稳定。除了比较分析之外,这项研究还强调了依赖LLM进行评估的关键问题,强调了它们的随机性,技术依赖性和有限的再现性。我们提倡更强大的评估协议,包括详尽的文件和方法标准化,以确保更大的可靠性,自动总结评估。
摘要:This paper investigates reproducibility challenges in automatic text summarization evaluation. Based on experiments conducted across six representative metrics ranging from classical approaches like ROUGE to recent LLM-based methods (G-Eval, SEval-Ex), we highlight significant discrepancies between reported performances in the literature and those observed in our experimental setting. We introduce a unified, open-source framework, applied to the SummEval dataset and designed to support fair and transparent comparison of evaluation metrics. Our results reveal a structural trade-off: metrics with the highest alignment with human judgments tend to be computationally intensive and less stable across runs. Beyond comparative analysis, this study highlights key concerns about relying on LLMs for evaluation, stressing their randomness, technical dependencies, and limited reproducibility. We advocate for more robust evaluation protocols including exhaustive documentation and methodological standardization to ensure greater reliability in automatic summarization assessment.


【57】Normality and the Turing Test
标题:正态性和图灵测试
链接:https://arxiv.org/abs/2508.21382

作者: Kabbach
摘要:本文提出通过正态性概念重新审视图灵检验。它的核心论点是,正常的统计解释-理解为平均在规范和数学意义上的术语-证明是有用的理解图灵测试在至少两个方面。首先,从这个意义上说,图灵测试针对的是正常/平均水平,而不是异常的人类智力,因此成功通过测试需要建造像正常/平均人类一样“犯错误”并表现出不完美行为的机器。第二,图灵测试是一种统计测试,在这种测试中,智力的判断从来不是由一个“普通”法官(被理解为非专家)来进行的,而是由一个完整的陪审团来进行的。因此,图灵在他的原始论文中谈到的“平均人类判断力”的概念应该主要被理解为是指由多个法官的个人判断的归一化聚合所构成的数学抽象。简而言之,本文认为,图灵测试是一个正常的智力测试,由一个正常的法官评估,表征了一群人类审讯者的平均判断。它的结论是双重的。首先,它认为像ChatGPT这样的大型语言模型不太可能通过图灵测试,因为这些模型精确地针对特殊而不是正常/平均的人类智力。因此,它们构成了所谓的人工智能而不是人工智能本身的模型。其次,它认为图灵测试是否可以对人类认知的理解做出任何贡献的核心问题是人类思维是否真的可以还原为正常/平均思维--这个问题在很大程度上超出了图灵测试本身,并质疑它所属的正常主义范式的概念基础。
摘要:This paper proposes to revisit the Turing test through the concept of normality. Its core argument is that the statistical interpretation of the normal--understood as the average both in the normative and mathematical sense of the term--proves useful for understanding the Turing test in at least two ways. First, in the sense that the Turing test targets normal/average rather than exceptional human intelligence, so that successfully passing the test requires building machines that "make mistakes" and display imperfect behavior just like normal/average humans. Second, in the sense that the Turing test is a statistical test where judgments of intelligence are never carried out by a single "average" judge (understood as non-expert) but always by a full jury. As such, the notion of "average human interrogator" that Turing talks about in his original paper should be understood primarily as referring to a mathematical abstraction made of the normalized aggregate of individual judgments of multiple judges. In short, this paper argues that the Turing test is a test of normal intelligence as assessed by a normal judge characterizing the average judgment of a pool of human interrogators. Its conclusions are twofold. First, it argues that large language models such as ChatGPT are unlikely to pass the Turing test as those models precisely target exceptional rather than normal/average human intelligence. As such, they constitute models of what it proposes to call artificial smartness rather than artificial intelligence per se. Second, it argues that the core question of whether the Turing test can contribute anything to the understanding of human cognition is that of whether the human mind is really reducible to the normal/average mind--a question which largely extends beyond the Turing test itself and questions the conceptual underpinnings of the normalist paradigm it belongs to.


【58】Iterative Inference in a Chess-Playing Neural Network
标题:国际象棋神经网络中的迭代推理
链接:https://arxiv.org/abs/2508.21380

作者:dmann, Sebastian Lapuschkin, Wojciech Samek
摘要:神经网络是通过平滑、渐进的细化,还是通过更复杂的计算过程来构建它们的表示?我们通过扩展logit镜头来分析Leela Chess Zero(一个超人国际象棋引擎)的策略网络。我们发现,在玩强度和解谜能力跨层强单调的趋势,但政策分布经常遵循非平滑的轨迹。这方面的证据包括早期发现但随后被丢弃的正确谜题解决方案、与最终输出相关性较差的移动排名,以及直到网络后期的高政策分歧。这些发现与通常在语言模型中观察到的平滑分布收敛形成对比。
摘要:Do neural networks build their representations through smooth, gradual refinement, or via more complex computational processes? We investigate this by extending the logit lens to analyze the policy network of Leela Chess Zero, a superhuman chess engine. We find strong monotonic trends in playing strength and puzzle-solving ability across layers, yet policy distributions frequently follow non-smooth trajectories. Evidence for this includes correct puzzle solutions that are discovered early but subsequently discarded, move rankings that remain poorly correlated with final outputs, and high policy divergence until late in the network. These findings contrast with the smooth distributional convergence typically observed in language models.


【59】RoboInspector: Unveiling the Unreliability of Policy Code for LLM-enabled Robotic Manipulation
标题:RoboInspector:揭露LLM支持的机器人操纵政策代码的不可靠性
链接:https://arxiv.org/abs/2508.21378

作者:ing, Linkang Du, Peng Cheng, Yuanchao Shu
摘要:大型语言模型(LLM)在推理和代码生成方面表现出卓越的能力,使机器人操作只需一条指令即可启动。LLM通过生成控制机器人所需的策略代码来执行各种任务。尽管LLM取得了进展,但由于现实任务的不同要求和用户指令的固有复杂性,实现可靠的策略代码生成仍然是一个重大挑战。在实践中,不同的用户可能会提供不同的指令来驱动机器人执行相同的任务,这可能会导致策略代码生成的不可靠性。为了弥合这一差距,我们设计了RoboInspector,这是一个管道,可以从两个角度揭示和表征启用LLM的机器人操作的策略代码的不可靠性:操作任务的复杂性和指令的粒度。我们在两个突出的框架中使用168种不同的任务,指令和LLM组合进行全面的实验。RoboInspector识别了导致操作失败的四种主要不可靠行为。我们提供了这些行为及其根本原因的详细描述,为实际开发提供了见解,以减少不可靠性。此外,我们引入了一种由故障策略代码反馈指导的改进方法,该方法在LLM启用的机器人操作中将策略代码生成的可靠性提高了35%,在模拟和现实环境中进行了评估。
摘要:Large language models (LLMs) demonstrate remarkable capabilities in reasoning and code generation, enabling robotic manipulation to be initiated with just a single instruction. The LLM carries out various tasks by generating policy code required to control the robot. Despite advances in LLMs, achieving reliable policy code generation remains a significant challenge due to the diverse requirements of real-world tasks and the inherent complexity of user instructions. In practice, different users may provide distinct instructions to drive the robot for the same task, which may cause the unreliability of policy code generation. To bridge this gap, we design RoboInspector, a pipeline to unveil and characterize the unreliability of the policy code for LLM-enabled robotic manipulation from two perspectives: the complexity of the manipulation task and the granularity of the instruction. We perform comprehensive experiments with 168 distinct combinations of tasks, instructions, and LLMs in two prominent frameworks. The RoboInspector identifies four main unreliable behaviors that lead to manipulation failure. We provide a detailed characterization of these behaviors and their underlying causes, giving insight for practical development to reduce unreliability. Furthermore, we introduce a refinement approach guided by failure policy code feedback that improves the reliability of policy code generation by up to 35% in LLM-enabled robotic manipulation, evaluated in both simulation and real-world environments.


【60】Challenges and Applications of Large Language Models: A Comparison of GPT and DeepSeek family of models
标题:大型语言模型的挑战和应用:GPT和DeepSeek系列模型的比较
链接:https://arxiv.org/abs/2508.21377

作者:harma, Sneha Tuli, Narendra Badam
备注:18 pages, 7 figures
摘要:大型语言模型(LLM)正在改变各行各业的人工智能,但它们的开发和部署仍然很复杂。本调查回顾了构建和使用LLM时面临的16个关键挑战,并探讨了两种具有独特方法的最先进模型如何解决这些挑战:OpenAI的闭源GPT-4 o(2024年5月更新)和DeepSeek-V3-0324(2025年3月),一个大型开源混合专家模型。通过这种比较,我们展示了闭源模型(强大的安全性、微调的可靠性)和开源模型(效率、适应性)之间的权衡。我们还探索了不同领域的LLM应用程序(从聊天机器人和编码工具到医疗保健和教育),突出了哪些模型属性最适合每个用例。本文旨在指导AI研究人员,开发人员和决策者了解当前LLM的功能,限制和最佳实践。
摘要:Large Language Models (LLMs) are transforming AI across industries, but their development and deployment remain complex. This survey reviews 16 key challenges in building and using LLMs and examines how these challenges are addressed by two state-of-the-art models with unique approaches: OpenAI's closed source GPT-4o (May 2024 update) and DeepSeek-V3-0324 (March 2025), a large open source Mixture-of-Experts model. Through this comparison, we showcase the trade-offs between closed source models (robust safety, fine-tuned reliability) and open source models (efficiency, adaptability). We also explore LLM applications across different domains (from chatbots and coding tools to healthcare and education), highlighting which model attributes are best suited for each use case. This article aims to guide AI researchers, developers, and decision-makers in understanding current LLM capabilities, limitations, and best practices.


【61】AHELM: A Holistic Evaluation of Audio-Language Models
标题:AHELM:音频语言模型的整体评估
链接:https://arxiv.org/abs/2508.21376

作者: Haoqin Tu, Chi Heem Wong, Zijun Wang, Siwei Yang, Yifan Mai, Yuyin Zhou, Cihang Xie, Percy Liang
摘要:音频语言模型(ALMs)的评估-多模态模型,以交错的音频和文本作为输入和输出文本-受到缺乏标准化基准的阻碍;大多数基准只测量一个或两个功能,并忽略了公平性或安全性等评估方面。此外,由于单独的评估测试了有限数量的模型,并使用不同的提示方法和推断参数,因此很难进行跨模型的比较。为了解决这些不足,我们引入了AHELM,这是一个聚合各种数据集的基准-包括两个新的合成音频文本数据集,称为PARADE,它评估了ALM避免刻板印象,以及Core-Bench,它通过推理的多轮问题回答来衡量对对话音频的推理--从10个方面全面衡量资产负债表的性能,我们认为这10个方面对资产负债表的开发和使用很重要:音频感知、知识、推理、情感检测、偏差、公平性、多语言性、鲁棒性、毒性和安全性。我们还标准化了提示、推理参数和评估指标,以确保模型之间的公平比较。我们测试了14个开放权重和封闭API的ALM,来自3个开发人员和3个额外的简单基线系统,每个系统包括一个自动语音识别器和一个语言模型。我们的研究结果表明,虽然双子座2.5专业排名前5的10个方面,它表现出群体的不公平性($p=0.01$)的ASR任务,而大多数其他型号没有。我们还发现,基线系统在AHELM上的表现相当不错,其中一个系统尽管只有语音转文本功能,但总体排名第五。为了提高透明度,所有原始提示、模型生成和输出都可以在我们的网站https://crfm.stanford.edu/helm/audio/v1.0.0上找到。AHELM旨在成为一个活的基准,新的数据集和模型将随着时间的推移而增加。
摘要:Evaluations of audio-language models (ALMs) -- multimodal models that take interleaved audio and text as input and output text -- are hindered by the lack of standardized benchmarks; most benchmarks measure only one or two capabilities and omit evaluative aspects such as fairness or safety. Furthermore, comparison across models is difficult as separate evaluations test a limited number of models and use different prompting methods and inference parameters. To address these shortfalls, we introduce AHELM, a benchmark that aggregates various datasets -- including 2 new synthetic audio-text datasets called PARADE, which evaluates the ALMs on avoiding stereotypes, and CoRe-Bench, which measures reasoning over conversational audio through inferential multi-turn question answering -- to holistically measure the performance of ALMs across 10 aspects we have identified as important to the development and usage of ALMs: audio perception, knowledge, reasoning, emotion detection, bias, fairness, multilinguality, robustness, toxicity, and safety. We also standardize the prompts, inference parameters, and evaluation metrics to ensure equitable comparisons across models. We test 14 open-weight and closed-API ALMs from 3 developers and 3 additional simple baseline systems each consisting of an automatic speech recognizer and a language model. Our results show that while Gemini 2.5 Pro ranks top in 5 out of 10 aspects, it exhibits group unfairness ($p=0.01$) on ASR tasks whereas most of the other models do not. We also find that the baseline systems perform reasonably well on AHELM, with one ranking 5th overall despite having only speech-to-text capabilities. For transparency, all raw prompts, model generations, and outputs are available on our website at https://crfm.stanford.edu/helm/audio/v1.0.0. AHELM is intended to be a living benchmark and new datasets and models will be added over time.


【62】Think in Games: Learning to Reason in Games via Reinforcement Learning with Large Language Models
标题:游戏中思考:通过大型语言模型的强化学习来学习游戏中的推理
链接:https://arxiv.org/abs/2508.21365

作者:Yu Gu, Yuan Sui, Zining Zhu, Yifan Lu, Guohua Tang, Zhongqian Sun, Wei Yang
摘要:大型语言模型(LLM)擅长数学和编码等复杂的推理任务,但他们经常与幼儿毫不费力地完成的简单交互任务作斗争。这种差异突出了陈述性知识(知道某事)和程序性知识(知道如何做某事)之间的关键差距。虽然传统的强化学习(RL)代理可以通过环境交互来获取程序知识,但它们通常作为黑箱运行,需要大量的训练数据。相比之下,LLM拥有广泛的世界知识和推理能力,但无法有效地将这种静态知识转化为互动环境中的动态决策。为了应对这一挑战,我们提出了游戏思维(TiG),这是一个新的框架,使LLM能够通过与游戏环境的直接交互来发展程序理解,同时保留其固有的推理和解释能力。具体来说,TiG将基于RL的决策重新制定为语言建模任务:LLM生成语言指导的策略,这些策略通过基于环境反馈的在线强化学习迭代地进行优化。我们的实验结果表明,TiG成功地弥合了陈述性和程序性知识之间的差距,与传统的RL方法相比,具有显着降低的数据和计算需求,实现了有竞争力的性能。此外,TiG为其决策提供了逐步的自然语言解释,大大提高了复杂交互任务的透明度和可解释性。
摘要:Large language models (LLMs) excel at complex reasoning tasks such as mathematics and coding, yet they frequently struggle with simple interactive tasks that young children perform effortlessly. This discrepancy highlights a critical gap between declarative knowledge (knowing about something) and procedural knowledge (knowing how to do something). Although traditional reinforcement learning (RL) agents can acquire procedural knowledge through environmental interaction, they often operate as black boxes and require substantial training data. In contrast, LLMs possess extensive world knowledge and reasoning capabilities, but are unable to effectively convert this static knowledge into dynamic decision-making in interactive settings. To address this challenge, we propose Think in Games (TiG), a novel framework that empowers LLMs to develop procedural understanding through direct interaction with game environments, while retaining their inherent reasoning and explanatory abilities. Specifically, TiG reformulates RL-based decision-making as a language modeling task: LLMs generate language-guided policies, which are refined iteratively through online reinforcement learning based on environmental feedback. Our experimental results show that TiG successfully bridges the gap between declarative and procedural knowledge, achieving competitive performance with dramatically lower data and computational demands compared to conventional RL methods. Moreover, TiG provides step-by-step natural language explanations for its decisions, greatly improving transparency and interpretability in complex interactive tasks.


【63】Adaptive Heavy-Tailed Stochastic Gradient Descent
标题:自适应重尾随机梯度下降
链接:https://arxiv.org/abs/2508.21353

作者:, Gustavo Enrique Batista, Pierre Lafaye de Micheaux
摘要:在大规模神经网络模型时代,由于过度依赖训练损失,优化算法通常难以泛化。机器学习社区广泛接受的一个关键见解是宽盆地(局部最小值周围的区域,损失逐渐增加)通过为输入数据或模型参数的微小变化提供更大的稳定性来促进更好的泛化。相比之下,尖锐的最小值通常更敏感且不太稳定。受两个关键经验观察的启发-随机梯度下降中梯度噪声的固有重尾分布和神经网络训练过程中的稳定边缘现象,其中曲率在稳定之前增长,我们引入了自适应重尾随机梯度下降(AHTSGD)。该算法在训练的早期阶段将拖尾噪声注入优化器,以增强探索,并随着锐度稳定而逐渐过渡到拖尾噪声。通过在整个训练过程中动态适应损失情况的尖锐度,AHTSGD促进了向宽流域的加速收敛。AHTSGD是第一个基于稳定边缘现象调整注入噪声到优化器中的性质的算法。AHTSGD在MNIST和CIFAR-10等基准测试中始终优于SGD和其他基于噪声的方法,在SVHN等噪声数据集上具有显着的增益。它最终加速了初始化不佳的早期训练,并提高了在干净和嘈杂环境中的泛化能力,对学习率选择保持鲁棒性。
摘要:In the era of large-scale neural network models, optimization algorithms often struggle with generalization due to an overreliance on training loss. One key insight widely accepted in the machine learning community is the idea that wide basins (regions around a local minimum where the loss increases gradually) promote better generalization by offering greater stability to small changes in input data or model parameters. In contrast, sharp minima are typically more sensitive and less stable. Motivated by two key empirical observations - the inherent heavy-tailed distribution of gradient noise in stochastic gradient descent and the Edge of Stability phenomenon during neural network training, in which curvature grows before settling at a plateau, we introduce Adaptive Heavy Tailed Stochastic Gradient Descent (AHTSGD). The algorithm injects heavier-tailed noise into the optimizer during the early stages of training to enhance exploration and gradually transitions to lighter-tailed noise as sharpness stabilizes. By dynamically adapting to the sharpness of the loss landscape throughout training, AHTSGD promotes accelerated convergence to wide basins. AHTSGD is the first algorithm to adjust the nature of injected noise into an optimizer based on the Edge of Stability phenomenon. AHTSGD consistently outperforms SGD and other noise-based methods on benchmarks like MNIST and CIFAR-10, with marked gains on noisy datasets such as SVHN. It ultimately accelerates early training from poor initializations and improves generalization across clean and noisy settings, remaining robust to learning rate choices.


【64】DLGAN : Time Series Synthesis Based on Dual-Layer Generative Adversarial Networks
标题:DLGAN:基于双层生成对抗网络的时间序列合成
链接:https://arxiv.org/abs/2508.21340

作者: Shuhan Liu, Zhaohui Peng, Yaohui Chu, Yue Zhang, Yining Wang
备注:8 pages, 3 figures
摘要:时间序列综合是保证时间序列数据安全流通的有效途径。现有的时间序列合成方法通常基于随机序列进行时间建模以生成目标序列,这通常难以确保所生成的时间序列中的时间依赖性。此外,直接在随机序列上建模时间特征使得准确捕获原始时间序列的特征信息具有挑战性。为了解决上述问题,我们提出了一个简单但有效的生成模型\textbf{D}ual-\textbf{L}ayer \textbf{G} generative\textbf{A}dversarial \textbf{N}etworks,命名为\textbf{DLGAN}。该模型将时间序列生成过程分解为两个阶段:序列特征提取和序列重建。首先,这两个阶段形成了一个完整的时间序列自动编码器,可以对原始时间序列进行监督学习,以确保重建过程可以恢复序列的时间依赖性。其次,生成对抗网络(GAN)用于生成与实时序列特征向量对齐的合成特征向量,确保生成器可以从实时时间序列中捕获时间特征。在四个公共数据集上进行的大量实验证明了该模型在各种评估指标上的优越性。
摘要:Time series synthesis is an effective approach to ensuring the secure circulation of time series data. Existing time series synthesis methods typically perform temporal modeling based on random sequences to generate target sequences, which often struggle to ensure the temporal dependencies in the generated time series. Additionally, directly modeling temporal features on random sequences makes it challenging to accurately capture the feature information of the original time series. To address the above issues, we propose a simple but effective generative model \textbf{D}ual-\textbf{L}ayer \textbf{G}enerative \textbf{A}dversarial \textbf{N}etworks, named \textbf{DLGAN}. The model decomposes the time series generation process into two stages: sequence feature extraction and sequence reconstruction. First, these two stages form a complete time series autoencoder, enabling supervised learning on the original time series to ensure that the reconstruction process can restore the temporal dependencies of the sequence. Second, a Generative Adversarial Network (GAN) is used to generate synthetic feature vectors that align with the real-time sequence feature vectors, ensuring that the generator can capture the temporal features from real time series. Extensive experiments on four public datasets demonstrate the superiority of this model across various evaluation metrics.


【65】Stairway to Fairness: Connecting Group and Individual Fairness
标题:通往公平的阶梯:连接群体与个人公平
链接:https://arxiv.org/abs/2508.21334

作者:Veronika Rampisela, Maria Maistro, Tuukka Ruotsalo, Falk Scholer, Christina Lioma
备注:Accepted to RecSys 2025 (short paper)
摘要:推荐系统中的公平性通常分为群体公平性和个体公平性。然而,对这两种公平类型之间的关系没有既定的科学认识,因为先前关于这两种类型的工作对每种公平类型使用了不同的评价措施或评价目标,因此无法对两者进行适当的比较。因此,目前还不知道增加一种公平性会如何影响另一种公平性。为了填补这一空白,我们研究了群体和个人公平的关系,通过全面比较的评价措施,可用于这两种公平类型。我们在3个数据集上进行了8次实验,结果表明,对群体高度公平的推荐对个人可能非常不公平。我们的发现是新颖的和有用的RS从业者,旨在提高他们的系统的公平性。我们的代码可在https://github.com/theresiavr/stairway-to-fairness上获取。
摘要:Fairness in recommender systems (RSs) is commonly categorised into group fairness and individual fairness. However, there is no established scientific understanding of the relationship between the two fairness types, as prior work on both types has used different evaluation measures or evaluation objectives for each fairness type, thereby not allowing for a proper comparison of the two. As a result, it is currently not known how increasing one type of fairness may affect the other. To fill this gap, we study the relationship of group and individual fairness through a comprehensive comparison of evaluation measures that can be used for both fairness types. Our experiments with 8 runs across 3 datasets show that recommendations that are highly fair for groups can be very unfair for individuals. Our finding is novel and useful for RS practitioners aiming to improve the fairness of their systems. Our code is available at: https://github.com/theresiavr/stairway-to-fairness.


【66】Stage-Diff: Stage-wise Long-Term Time Series Generation Based on Diffusion Models
标题:阶段差异:基于扩散模型的阶段长期时间序列生成
链接:https://arxiv.org/abs/2508.21330

作者: Shuhan Liu, Zhaohui Peng, Yaohui Chu, Yue Zhang, Yining Wang
备注:8 pages, 5 figures
摘要:生成式模型已成功地应用于时间序列生成领域。然而,当处理长期的时间序列,跨越较长的时期,并表现出更复杂的长期时间模式,生成的任务变得更具挑战性。长期时间序列表现出长期的时间依赖性,但其数据分布也会随着时间的推移而逐渐变化。在这些长期依赖性和数据分布的漂移之间找到平衡是一个关键挑战。另一方面,长期时间序列包含不同特征序列之间更复杂的相互关系,使得有效捕获序列内和序列间依赖性的任务成为另一个重要挑战。为了解决这些问题,我们提出了一个基于扩散模型的长期时间序列阶段生成模型Stage-Diff。首先,通过逐阶段序列生成和阶段间信息传递,该模型保留了长期序列依赖性,同时能够对数据分布变化进行建模。第二,在每个阶段内,渐进序列分解应用于在不同的时间尺度上执行通道独立的建模,而阶段间的信息传递利用多通道融合建模。该方法结合了通道独立建模的鲁棒性和多通道建模的信息融合优势,有效地平衡了长期时间序列的序列内和序列间依赖性。在多个真实数据集上的大量实验验证了Stage-Diff在长期时间序列生成任务中的有效性。
摘要:Generative models have been successfully used in the field of time series generation. However, when dealing with long-term time series, which span over extended periods and exhibit more complex long-term temporal patterns, the task of generation becomes significantly more challenging. Long-term time series exhibit long-range temporal dependencies, but their data distribution also undergoes gradual changes over time. Finding a balance between these long-term dependencies and the drift in data distribution is a key challenge. On the other hand, long-term time series contain more complex interrelationships between different feature sequences, making the task of effectively capturing both intra-sequence and inter-sequence dependencies another important challenge. To address these issues, we propose Stage-Diff, a staged generative model for long-term time series based on diffusion models. First, through stage-wise sequence generation and inter-stage information transfer, the model preserves long-term sequence dependencies while enabling the modeling of data distribution shifts. Second, within each stage, progressive sequence decomposition is applied to perform channel-independent modeling at different time scales, while inter-stage information transfer utilizes multi-channel fusion modeling. This approach combines the robustness of channel-independent modeling with the information fusion advantages of multi-channel modeling, effectively balancing the intra-sequence and inter-sequence dependencies of long-term time series. Extensive experiments on multiple real-world datasets validate the effectiveness of Stage-Diff in long-term time series generation tasks.


【67】Multi-Ontology Integration with Dual-Axis Propagation for Medical Concept Representation
标题:医学概念表示的多实体集成和双轴传播
链接:https://arxiv.org/abs/2508.21320

作者:yebi Kerdabadi, Arya Hadizadeh Moghaddam, Dongjie Wang, Zijun Yao
备注:This work has been accepted as a full research paper at CIKM 2025
摘要:医学本体图通过结构化关系将外部知识映射到电子健康记录中的医学代码。通过利用域批准的连接(例如,父-子),预测模型可以通过合并来自相关概念的上下文信息来生成更丰富的医学概念表示。然而,现有的文献主要集中于合并来自单个本体系统或来自多个本体系统(例如,疾病,药物和程序),而不是将它们整合到一个统一的学习结构中。因此,概念表示学习往往仍然局限于本体内的关系,忽略了跨本体的连接。在本文中,我们提出了LINKO,一个大的语言模型(LLM)增强的综合本体学习框架,利用多个本体图,同时使双轴知识传播内和跨异构本体系统,以提高医学概念表示学习。具体来说,LINKO首先采用LLM提供一个图形检索增强的初始化本体概念嵌入,通过工程提示,包括概念描述,并进一步增强本体上下文。其次,我们的方法通过在两个轴上执行知识传播来联合学习不同本体图中的医学概念:(1)跨层次本体级别的本体内垂直传播和(2)并行的每个级别内的本体间水平传播。最后,通过在两个公共数据集上的大量实验,我们验证了LINKO在最先进的基线上的优越性能。作为与现有EHR预测模型兼容的插件编码器,LINKO进一步证明了在涉及有限数据可用性和罕见疾病预测的场景中增强的鲁棒性。
摘要:Medical ontology graphs map external knowledge to medical codes in electronic health records via structured relationships. By leveraging domain-approved connections (e.g., parent-child), predictive models can generate richer medical concept representations by incorporating contextual information from related concepts. However, existing literature primarily focuses on incorporating domain knowledge from a single ontology system, or from multiple ontology systems (e.g., diseases, drugs, and procedures) in isolation, without integrating them into a unified learning structure. Consequently, concept representation learning often remains limited to intra-ontology relationships, overlooking cross-ontology connections. In this paper, we propose LINKO, a large language model (LLM)-augmented integrative ontology learning framework that leverages multiple ontology graphs simultaneously by enabling dual-axis knowledge propagation both within and across heterogeneous ontology systems to enhance medical concept representation learning. Specifically, LINKO first employs LLMs to provide a graph-retrieval-augmented initialization for ontology concept embedding, through an engineered prompt that includes concept descriptions, and is further augmented with ontology context. Second, our method jointly learns the medical concepts in diverse ontology graphs by performing knowledge propagation in two axes: (1) intra-ontology vertical propagation across hierarchical ontology levels and (2) inter-ontology horizontal propagation within every level in parallel. Last, through extensive experiments on two public datasets, we validate the superior performance of LINKO over state-of-the-art baselines. As a plug-in encoder compatible with existing EHR predictive models, LINKO further demonstrates enhanced robustness in scenarios involving limited data availability and rare disease prediction.


【68】MultiFluxAI Enhancing Platform Engineering with Advanced Agent-Orchestrated Retrieval Systems
标题:MultiFluxAI通过先进的代理描述检索系统增强平台工程
链接:https://arxiv.org/abs/2508.21307

作者:acharla, Sridhar Murthy J, Anjaneyulu Pasala
备注:Abstract accepted for presentation at ACM ISEC 2025
摘要:MultiFluxAI是一个创新的人工智能平台,旨在解决跨应用领域的产品工程中管理和集成大量不同数据源的挑战。它解决了当前和新的服务相关查询,增强了用户在数字生态系统中的参与度。该平台利用先进的AI技术,如生成AI,矢量化和代理编排,为复杂的用户查询提供动态和上下文感知的响应。
摘要:MultiFluxAI is an innovative AI platform developed to address the challenges of managing and integrating vast, disparate data sources in product engineering across application domains. It addresses both current and new service related queries that enhance user engagement in the digital ecosystem. This platform leverages advanced AI techniques, such as Generative AI, vectorization, and agentic orchestration to provide dynamic and context-aware responses to complex user queries.


【69】Locus: Agentic Predicate Synthesis for Directed Fuzzing
标题:轨迹:定向模糊的统计预测合成
链接:https://arxiv.org/abs/2508.21302

作者:Chihao Shen, Ziyang Li, Jiahao Yu, Yizheng Chen, Kexin Pei
摘要:有向模糊化的目的是找到导致指定目标程序状态的程序输入。它有广泛的应用,如调试系统崩溃,确认报告的错误,并产生潜在的漏洞利用。这项任务本质上具有挑战性,因为目标状态通常深深嵌入程序中,而许多可能的程序输入所表现出的搜索空间却大得令人望而却步。现有的方法依赖于分支距离或手动指定的约束来指导搜索;然而,单独的分支通常不足以精确地表征朝向达到目标状态的进展,而手动指定的约束通常针对特定的错误类型而定制,因此难以推广到不同的目标状态和程序。   我们提出了轨迹,一个新的框架,以提高效率的定向模糊。我们的主要见解是综合谓词捕捉模糊的进展,语义上有意义的中间状态,作为达到目标状态的里程碑。当用于在模糊情况下对程序进行仪表化时,它们可以拒绝不太可能到达目标州的执行,同时提供额外的覆盖范围指导。为了自动化这项任务并推广到不同的程序,Locus提供了一个带有程序分析工具的代理框架来合成和迭代地改进候选谓词,同时确保谓词严格放松目标状态,以防止通过符号执行的错误拒绝。我们的评估表明,Locus大大提高了八个最先进的模糊器在发现真实世界漏洞方面的效率,平均加速41.6倍。到目前为止,Locus已经发现了8个以前未修补的错误,其中一个已经通过草案补丁得到了承认。
摘要:Directed fuzzing aims to find program inputs that lead to specified target program states. It has broad applications, such as debugging system crashes, confirming reported bugs, and generating exploits for potential vulnerabilities. This task is inherently challenging because target states are often deeply nested in the program, while the search space manifested by numerous possible program inputs is prohibitively large. Existing approaches rely on branch distances or manually-specified constraints to guide the search; however, the branches alone are often insufficient to precisely characterize progress toward reaching the target states, while the manually specified constraints are often tailored for specific bug types and thus difficult to generalize to diverse target states and programs.   We present Locus, a novel framework to improve the efficiency of directed fuzzing. Our key insight is to synthesize predicates to capture fuzzing progress as semantically meaningful intermediate states, serving as milestones towards reaching the target states. When used to instrument the program under fuzzing, they can reject executions unlikely to reach the target states, while providing additional coverage guidance. To automate this task and generalize to diverse programs, Locus features an agentic framework with program analysis tools to synthesize and iteratively refine the candidate predicates, while ensuring the predicates strictly relax the target states to prevent false rejections via symbolic execution. Our evaluation shows that Locus substantially improves the efficiency of eight state-of-the-art fuzzers in discovering real-world vulnerabilities, achieving an average speedup of 41.6x. So far, Locus has found eight previously unpatched bugs, with one already acknowledged with a draft patch.


【70】MyGO: Memory Yielding Generative Offline-consolidation for Lifelong Learning Systems
标题:MyGO:终身学习系统的记忆生成性离线整合
链接:https://arxiv.org/abs/2508.21296

作者:, Zihui Song
备注:5 pages
摘要:持续或终身学习旨在开发能够从一系列任务中获取新知识的模型,而不会灾难性地忘记以前学到的知识。现有的方法通常依赖于存储来自先前任务的样本(经验重放)或采用复杂的正则化项来保护学习的权重。然而,这些方法面临着与数据隐私,存储限制和性能下降时,任务是不同的挑战。为了应对这些挑战,我们引入了MyGO(Memory Yielding Generative Offline-consolidation),这是一种受生物唤醒-睡眠周期启发的新型终身学习框架。在“唤醒”阶段,系统快速学习新任务,并训练一个紧凑的生成模型(生成记忆,G-MRM)来捕获其数据分布。在“睡眠”阶段,系统进入离线状态,使用所有学习的G-SVM模型生成伪数据(“梦”),并通过知识蒸馏将新旧知识合并到核心特征提取器中。这种方法消除了存储任何原始数据的需要,仅保留紧凑的生成模型,这在隐私和存储效率方面具有显着优势。我们在计算机视觉(Split-MNIST)和自然语言处理(Split-AG News)基准上评估了MyGO,并将其与顺序微调基线进行了比较。结果表明,MyGO显著减轻了灾难性遗忘,并在任务中保持了较高的平均准确率,证明了该框架的有效性和领域通用性。
摘要:Continual or Lifelong Learning aims to develop models capable of acquiring new knowledge from a sequence of tasks without catastrophically forgetting what has been learned before. Existing approaches often rely on storing samples from previous tasks (experience replay) or employing complex regularization terms to protect learned weights. However, these methods face challenges related to data privacy, storage limitations, and performance degradation when tasks are dissimilar. To address these challenges, we introduce MyGO (Memory Yielding Generative Offline-consolidation), a novel lifelong learning framework inspired by the biological wake-sleep cycle. During the "wake" phase, the system rapidly learns a new task and trains a compact generative model (Generative Memory, G-mem) to capture its data distribution. During the "sleep" phase, the system enters an offline state, using all learned G-mem models to generate pseudo-data ("dreams") and consolidate new and old knowledge into a core feature extractor via knowledge distillation. This approach obviates the need to store any raw data, retaining only compact generative models, which offers significant advantages in privacy and storage efficiency. We evaluate MyGO on computer vision (Split-MNIST) and natural language processing (Split-AG News) benchmarks, comparing it against a sequential fine-tuning baseline. The results demonstrate that MyGO significantly mitigates catastrophic forgetting and maintains high average accuracy across tasks, proving the framework's effectiveness and domain-generality.


【71】BLUEX Revisited: Enhancing Benchmark Coverage with Automatic Captioning
标题:重温BLUEX:通过自动字幕增强基准覆盖率
链接:https://arxiv.org/abs/2508.21294

作者:herme Alves Santos, Giovana Kerche Bonás, Thales Sales Almeida
备注:12 pages, 5 figures, 2 tables
摘要:随着大型语言模型(LLM)功能的不断增长,对鲁棒的评估方法的需求越来越大,特别是在多语言和非英语环境中。我们提出了BLUEX数据集的更新版本,现在包括2024-2025年的考试和使用最先进的模型自动生成的图像标题,增强了其与LLM预训练中数据污染研究的相关性。字幕策略将纯文本模型的可访问性提高了40%以上,产生了1,422个可用问题,是原始BLUEX的两倍多。我们评估了商业和开源LLM及其通过标题利用视觉上下文的能力。
摘要:With the growing capabilities of Large Language Models (LLMs), there is an increasing need for robust evaluation methods, especially in multilingual and non-English contexts. We present an updated version of the BLUEX dataset, now including 2024-2025 exams and automatically generated image captions using state-of-the-art models, enhancing its relevance for data contamination studies in LLM pretraining. Captioning strategies increase accessibility to text-only models by more than 40%, producing 1,422 usable questions, more than doubling the number in the original BLUEX. We evaluated commercial and open-source LLMs and their ability to leverage visual context through captions.


【72】Efficient Code Embeddings from Code Generation Models
标题:来自代码生成模型的高效代码嵌入
链接:https://arxiv.org/abs/2508.21290

作者:vosheieva, Saba Sturua, Michael Günther, Scott Martens, Han Xiao
备注:9 pages, table and evaluations 5-9
摘要:Jina代码嵌入是一种新颖的代码嵌入模型套件,其被设计为从自然语言查询中检索代码、执行技术问题回答以及跨编程语言识别语义上相似的代码片段。它创新性地使用了在文本和代码上预先训练的自回归主干,通过最后令牌池生成嵌入。我们概述了训练配方,并展示了最先进的性能,尽管模型的大小相对较小,验证了这种方法来构建代码嵌入模型。
摘要:jina-code-embeddings is a novel code embedding model suite designed to retrieve code from natural language queries, perform technical question-answering, and identify semantically similar code snippets across programming languages. It makes innovative use of an autoregressive backbone pre-trained on both text and code, generating embeddings via last-token pooling. We outline the training recipe and demonstrate state-of-the-art performance despite the relatively small size of the models, validating this approach to code embedding model construction.


【73】Breaking the Cold-Start Barrier: Reinforcement Learning with Double and Dueling DQNs
标题:打破冷启动障碍:双重和决斗DQN的强化学习
链接:https://arxiv.org/abs/2508.21259

作者:o
摘要:推荐系统很难为交互历史有限的新用户提供准确的建议,这一挑战被称为冷用户问题。本文提出了一种强化学习方法,使用双重和决斗深度Q网络(DQN)从稀疏反馈中动态学习用户偏好,在不依赖敏感人口统计数据的情况下提高推荐准确性。通过将这些先进的DQN变体与矩阵分解模型相结合,与基于流行度和主动学习策略等传统方法相比,我们在大型电子商务数据集上实现了卓越的性能。实验结果表明,我们的方法,特别是决斗DQN,降低了根均方误差(RMSE)的冷用户,提供了一个有效的解决方案,隐私受限的环境。
摘要:Recommender systems struggle to provide accurate suggestions to new users with limited interaction history, a challenge known as the cold-user problem. This paper proposes a reinforcement learning approach using Double and Dueling Deep Q-Networks (DQN) to dynamically learn user preferences from sparse feedback, enhancing recommendation accuracy without relying on sensitive demographic data. By integrating these advanced DQN variants with a matrix factorization model, we achieve superior performance on a large e-commerce dataset compared to traditional methods like popularity-based and active learning strategies. Experimental results show that our method, particularly Dueling DQN, reduces Root Mean Square Error (RMSE) for cold users, offering an effective solution for privacy-constrained environments.


【74】A Mixture of Experts Gating Network for Enhanced Surrogate Modeling in External Aerodynamics
标题:用于增强外部空气动力学代理建模的混合专家门控网络
链接:https://arxiv.org/abs/2508.21249

作者:Amin Nabian, Sanjay Choudhry
摘要:与高保真CFD模拟相关的计算成本仍然是汽车设计和优化周期中的一个重要瓶颈。虽然基于ML的代理模型已成为加速空气动力学预测的一种有前途的替代方案,但该领域的特点是专业神经网络架构的多样化和快速发展,没有单一模型表现出普遍的优越性。本文介绍了一种新的元学习框架,利用这种架构的多样性作为一种优势。我们提出了一种专家混合(MoE)模型,该模型采用专用的门控网络来动态地优化组合来自三个异构的最先进的代理模型的预测:DoMINO,一种可分解的多尺度神经运算符; X-MeshGraphNet,一种可扩展的多尺度图形神经网络;和FigConvNet,一种因子化的隐式全局卷积网络。门控网络学习一个空间变化的加权策略,分配可信度的基础上,其本地化的性能预测表面压力和壁面剪切应力场的每个专家。为了防止模型崩溃并鼓励平衡的专家贡献,我们将熵正则化项集成到训练损失函数中。整个系统在DrivAerML数据集上进行训练和验证,DrivAerML数据集是汽车空气动力学高保真CFD模拟的大规模公共基准。定量结果表明,MoE模型实现了L-2预测误差的显着减少,不仅优于总体平均值,而且在所有评估的物理量最准确的个人专家模型。这项工作建立了MoE框架作为一个强大而有效的战略,通过协同结合专门架构的互补优势,创建更强大,更准确的复合代理模型。
摘要:The computational cost associated with high-fidelity CFD simulations remains a significant bottleneck in the automotive design and optimization cycle. While ML-based surrogate models have emerged as a promising alternative to accelerate aerodynamic predictions, the field is characterized by a diverse and rapidly evolving landscape of specialized neural network architectures, with no single model demonstrating universal superiority. This paper introduces a novel meta-learning framework that leverages this architectural diversity as a strength. We propose a Mixture of Experts (MoE) model that employs a dedicated gating network to dynamically and optimally combine the predictions from three heterogeneous, state-of-the-art surrogate models: DoMINO, a decomposable multi-scale neural operator; X-MeshGraphNet, a scalable multi-scale graph neural network; and FigConvNet, a factorized implicit global convolution network. The gating network learns a spatially-variant weighting strategy, assigning credibility to each expert based on its localized performance in predicting surface pressure and wall shear stress fields. To prevent model collapse and encourage balanced expert contributions, we integrate an entropy regularization term into the training loss function. The entire system is trained and validated on the DrivAerML dataset, a large-scale, public benchmark of high-fidelity CFD simulations for automotive aerodynamics. Quantitative results demonstrate that the MoE model achieves a significant reduction in L-2 prediction error, outperforming not only the ensemble average but also the most accurate individual expert model across all evaluated physical quantities. This work establishes the MoE framework as a powerful and effective strategy for creating more robust and accurate composite surrogate models by synergistically combining the complementary strengths of specialized architectures.


【75】Full-Frequency Temporal Patching and Structured Masking for Enhanced Audio Classification
标题:用于增强音频分类的全频率时间修补和结构化掩蔽
链接:https://arxiv.org/abs/2508.21243

作者:kineni, Baocheng Geng, Qing Tian
摘要:Transformers和状态空间模型(SSM)通过将频谱图建模为补丁序列来改进音频分类。然而,现有的模型,如音频频谱图Transformer(AST)和音频曼巴(AuM)采用来自计算机视觉的方形修补,这会破坏连续的频率模式并产生过多的补丁,减慢训练速度并增加计算量。我们提出了全频时间修补(FFTP),修补策略,更好地匹配的时间-频率不对称的频谱图跨越全频带与本地化的时间背景,保留谐波结构,并显着减少补丁计数和计算。我们还介绍了SpecMask,一种补丁对齐的频谱图增强,它在固定的掩蔽预算下结合了全频和局部时频掩模,在保持频谱连续性的同时增强了时间鲁棒性。当应用于AST和AuM时,我们使用SpecMask的修补方法在AudioSet-18 k上将mAP提高了+6.76,在SpeechCommandsV 2上将准确度提高了+8.46,同时将计算量减少了83.26%,证明了性能和效率的提高。
摘要:Transformers and State-Space Models (SSMs) have advanced audio classification by modeling spectrograms as sequences of patches. However, existing models such as the Audio Spectrogram Transformer (AST) and Audio Mamba (AuM) adopt square patching from computer vision, which disrupts continuous frequency patterns and produces an excessive number of patches, slowing training, and increasing computation. We propose Full-Frequency Temporal Patching (FFTP), a patching strategy that better matches the time-frequency asymmetry of spectrograms by spanning full frequency bands with localized temporal context, preserving harmonic structure, and significantly reducing patch count and computation. We also introduce SpecMask, a patch-aligned spectrogram augmentation that combines full-frequency and localized time-frequency masks under a fixed masking budget, enhancing temporal robustness while preserving spectral continuity. When applied on both AST and AuM, our patching method with SpecMask improves mAP by up to +6.76 on AudioSet-18k and accuracy by up to +8.46 on SpeechCommandsV2, while reducing computation by up to 83.26%, demonstrating both performance and efficiency gains.


【76】Addressing accuracy and hallucination of LLMs in Alzheimer's disease research through knowledge graphs
标题:通过知识图谱解决阿尔茨海默病研究中LLM的准确性和幻觉
链接:https://arxiv.org/abs/2508.21238

作者:Xu, Jiarui Feng, Justin Melendez, Kaleigh Roberts, Donghong Cai, Mingfang Zhu, Donald Elbert, Yixin Chen, Randall J. Bateman
摘要:在过去的两年中,基于大型语言模型(LLM)的聊天机器人,如ChatGPT,通过实现多样化的任务完成和问答功能,彻底改变了各个领域。然而,它们在科学研究中的应用仍然受到诸如幻觉、特定领域知识有限以及反应缺乏可解释性或可追溯性等挑战的限制。基于图的检索增强生成(GraphRAG)已经成为一种很有前途的方法,通过在响应生成之前集成特定于域的上下文信息来提高聊天机器人的可靠性,解决标准LLM的一些限制。尽管有潜力,但只有有限的研究评估了GraphRAG在需要深入知识的特定领域,如阿尔茨海默病或其他生物医学领域。在本文中,我们评估了两个流行的GraphRAG系统的质量和可追溯性。我们编译了一个数据库的50篇论文和70个专家问题有关阿尔茨海默氏病,构建了一个GraphRAG知识库,并采用GPT-4 O作为LLM回答查询。然后,我们将GraphRAG生成的响应质量与标准GPT-4 o模型生成的响应质量进行比较。此外,我们还讨论和评估了几个检索增强生成(RAG)和GraphRAG系统的可追溯性。最后,我们提供了一个易于使用的界面与预建的阿尔茨海默病数据库的研究人员测试的性能标准RAG和GraphRAG。
摘要:In the past two years, large language model (LLM)-based chatbots, such as ChatGPT, have revolutionized various domains by enabling diverse task completion and question-answering capabilities. However, their application in scientific research remains constrained by challenges such as hallucinations, limited domain-specific knowledge, and lack of explainability or traceability for the response. Graph-based Retrieval-Augmented Generation (GraphRAG) has emerged as a promising approach to improving chatbot reliability by integrating domain-specific contextual information before response generation, addressing some limitations of standard LLMs. Despite its potential, there are only limited studies that evaluate GraphRAG on specific domains that require intensive knowledge, like Alzheimer's disease or other biomedical domains. In this paper, we assess the quality and traceability of two popular GraphRAG systems. We compile a database of 50 papers and 70 expert questions related to Alzheimer's disease, construct a GraphRAG knowledge base, and employ GPT-4o as the LLM for answering queries. We then compare the quality of responses generated by GraphRAG with those from a standard GPT-4o model. Additionally, we discuss and evaluate the traceability of several Retrieval-Augmented Generation (RAG) and GraphRAG systems. Finally, we provide an easy-to-use interface with a pre-built Alzheimer's disease database for researchers to test the performance of both standard RAG and GraphRAG.


【77】Decoding Memories: An Efficient Pipeline for Self-Consistency Hallucination Detection
标题:解码记忆:自我一致性幻觉检测的有效管道
链接:https://arxiv.org/abs/2508.21228

作者:o, Xiaorui Liu, Feiyi Wang, Dan Lu, Junqi Yin
备注:14 pages, under review
摘要:大型语言模型(LLM)在研究和现实世界的应用中都表现出了令人印象深刻的性能,但它们仍然在与幻觉作斗争。现有的幻觉检测方法往往表现不佳的水平生成或严重依赖于特定领域的知识。虽然自我一致性方法有助于解决这些限制,但由于重复生成,它们会产生高计算成本。在本文中,我们进行了第一次研究,以确定冗余的自我一致性的方法,表现为跨代共享前缀令牌,并观察到非确切答案令牌的语义内容的贡献最小。基于这些见解,我们提出了一种新的解码记忆流水线(Decoding Memory pipeline),通过选择性推理和退火解码来加速生成。正交的模型,数据集,解码策略和自我一致性基线,我们的神经网络不断提高多响应生成的效率,并有望扩展到对齐和推理任务。大量的实验表明,我们的方法在不牺牲AUROC性能的情况下实现了高达3倍的加速比。
摘要:Large language models (LLMs) have demonstrated impressive performance in both research and real-world applications, but they still struggle with hallucination. Existing hallucination detection methods often perform poorly on sentence-level generation or rely heavily on domain-specific knowledge. While self-consistency approaches help address these limitations, they incur high computational costs due to repeated generation. In this paper, we conduct the first study on identifying redundancy in self-consistency methods, manifested as shared prefix tokens across generations, and observe that non-exact-answer tokens contribute minimally to the semantic content. Based on these insights, we propose a novel Decoding Memory Pipeline (DMP) that accelerates generation through selective inference and annealed decoding. Being orthogonal to the model, dataset, decoding strategy, and self-consistency baseline, our DMP consistently improves the efficiency of multi-response generation and holds promise for extension to alignment and reasoning tasks. Extensive experiments show that our method achieves up to a 3x speedup without sacrificing AUROC performance.


【78】Generalizable Object Re-Identification via Visual In-Context Prompting
标题:通过视觉上下文内绘图的可推广对象重新识别
链接:https://arxiv.org/abs/2508.21222

作者:Huang, Xiaoming Liu
备注:ICCV 2025
摘要:当前的对象重新识别(ReID)方法训练域特定模型(例如,对于人或车辆),其缺乏概括性并且需要用于新类别的昂贵的标记数据。虽然自监督学习通过学习实例不变性来减少注释需求,但它很难捕获对ReID至关重要的\textit{identity-sensitive}特征。本文提出了一种新的框架Visual In-Context Probreting ~(VICP),在该框架中,在已知类别上训练的模型可以直接推广到未知的新类别,只使用\textit{in-context examples}作为提示,而不需要参数自适应。VICP协同LLM和视觉基础模型~(VFM):LLM通过特定任务提示从Few-Shot正/负对中推断语义身份规则,然后引导VFM(例如,DINO)通过动态视觉提示提取ID区分特征。通过将LLM派生的语义概念与VFM的预先训练的先验知识对齐,VICP能够泛化到新的类别,从而消除了对特定于小程序的再训练的需要。为了支持评估,我们引入了ShopID 10 K,这是来自电子商务平台的10 K对象实例的数据集,具有多视图图像和跨域测试。在ShopID 10 K和各种ReID基准上的实验表明,VICP在看不见的类别上明显优于基线。代码可在https://github.com/Hzzone/VICP上获得。
摘要:Current object re-identification (ReID) methods train domain-specific models (e.g., for persons or vehicles), which lack generalization and demand costly labeled data for new categories. While self-supervised learning reduces annotation needs by learning instance-wise invariance, it struggles to capture \textit{identity-sensitive} features critical for ReID. This paper proposes Visual In-Context Prompting~(VICP), a novel framework where models trained on seen categories can directly generalize to unseen novel categories using only \textit{in-context examples} as prompts, without requiring parameter adaptation. VICP synergizes LLMs and vision foundation models~(VFM): LLMs infer semantic identity rules from few-shot positive/negative pairs through task-specific prompting, which then guides a VFM (\eg, DINO) to extract ID-discriminative features via \textit{dynamic visual prompts}. By aligning LLM-derived semantic concepts with the VFM's pre-trained prior, VICP enables generalization to novel categories, eliminating the need for dataset-specific retraining. To support evaluation, we introduce ShopID10K, a dataset of 10K object instances from e-commerce platforms, featuring multi-view images and cross-domain testing. Experiments on ShopID10K and diverse ReID benchmarks demonstrate that VICP outperforms baselines by a clear margin on unseen categories. Code is available at https://github.com/Hzzone/VICP.


【79】Enhancing Robustness of Autoregressive Language Models against Orthographic Attacks via Pixel-based Approach
标题:通过基于像素的方法增强自回归语言模型对正排攻击的鲁棒性
链接:https://arxiv.org/abs/2508.21206

作者: Jian Lan, Yihong Liu, Hinrich Schütze, Thomas Seidl
摘要:自回归语言模型容易受到拼写攻击,其中输入文本被来自多语言字母表的字符干扰,导致性能大幅下降。此漏洞主要源于子字标记器及其嵌入中固有的词汇表外问题。为了解决这个问题,我们提出了一个基于像素的生成语言模型,取代了基于文本的嵌入与基于像素的表示渲染单词作为单独的图像。这种设计提供了更强的鲁棒性噪声输入,而跨不同的书写系统的多语言文本的兼容性的扩展。我们在多语言LAMBADA数据集、WMT 24数据集和SST-2基准测试上评估了所提出的方法,证明了其对正字法噪音的适应性及其在多语言环境中的有效性。
摘要:Autoregressive language models are vulnerable to orthographic attacks, where input text is perturbed with characters from multilingual alphabets, leading to substantial performance degradation. This vulnerability primarily stems from the out-of-vocabulary issue inherent in subword tokenizers and their embeddings. To address this limitation, we propose a pixel-based generative language model that replaces the text-based embeddings with pixel-based representations by rendering words as individual images. This design provides stronger robustness to noisy inputs, while an extension of compatibility to multilingual text across diverse writing systems. We evaluate the proposed method on the multilingual LAMBADA dataset, WMT24 dataset and the SST-2 benchmark, demonstrating both its resilience to orthographic noise and its effectiveness in multilingual settings.


【80】Fuzzy, Symbolic, and Contextual: Enhancing LLM Instruction via Cognitive Scaffolding
标题:模糊、象征性和上下文:通过认知支架加强LLM教学
链接:https://arxiv.org/abs/2508.21204

作者:igueiredo
摘要:我们研究了架构归纳偏见如何影响教学对话中的大型语言模型(LLM)的认知行为。我们引入了一个象征性的脚手架机制配对的短期记忆模式,旨在促进自适应,结构化的推理苏格拉底辅导。使用五个系统变量的控制消融,我们通过专家设计的涵盖脚手架,反应能力,符号推理和会话记忆的规则来评估模型输出。我们提出了初步的结果,使用基于法学硕士的评价框架,以认知接地的标题。这使得可以在早期实验中跨架构变体进行可扩展的系统比较。初步结果表明,我们的完整系统始终优于基线变体。分析表明,去除记忆或符号结构会降低关键的认知行为,包括抽象,适应性探测和概念连续性。这些研究结果支持一个处理级帐户中,建筑支架可以可靠地塑造紧急教学策略LLM。
摘要:We study how architectural inductive biases influence the cognitive behavior of large language models (LLMs) in instructional dialogue. We introduce a symbolic scaffolding mechanism paired with a short-term memory schema designed to promote adaptive, structured reasoning in Socratic tutoring. Using controlled ablation across five system variants, we evaluate model outputs via expert-designed rubrics covering scaffolding, responsiveness, symbolic reasoning, and conversational memory. We present preliminary results using an LLM-based evaluation framework aligned to a cognitively grounded rubric. This enables scalable, systematic comparisons across architectural variants in early-stage experimentation. The preliminary results show that our full system consistently outperforms baseline variants. Analysis reveals that removing memory or symbolic structure degrades key cognitive behaviors, including abstraction, adaptive probing, and conceptual continuity. These findings support a processing-level account in which architectural scaffolds can reliably shape emergent instructional strategies in LLMs.


【81】Improving Aviation Safety Analysis: Automated HFACS Classification Using Reinforcement Learning with Group Relative Policy Optimization
标题:改进航空安全分析:使用强化学习和群体相对政策优化的自动化HFACS分类
链接:https://arxiv.org/abs/2508.21201

作者:adi, Sarah Sharif, Yaser Banad
摘要:分析航空事故背后的人为因素对于预防未来的事故至关重要,但使用人为因素分析和分类系统(HFACS)的传统方法受到可扩展性和一致性的限制。为了解决这个问题,我们引入了一个用于航空安全分析的自动化HFACS分类框架,该框架利用强化学习和组相对策略优化(GRPO)来微调Llama-3.1 8B语言模型。我们的方法结合了一个多组件奖励系统,专为航空安全分析,并集成了合成数据生成,以克服事故数据集的类不平衡。由此产生的GRPO优化模型实现了显著的性能提升,包括精确匹配精度提高了350%(从0.0400提高到0.1800),部分匹配精度提高到0.8800。值得注意的是,我们的专业模型在关键指标上优于最先进的LLM(大型语言模型),包括GPT-5-mini和Gemini-2.5-fiash。本研究还提出了多标签HFACS分类问题的精确匹配精度作为一种新的基准方法来评估语言模型的高级推理能力。最终,我们的工作验证了更小的,域优化的模型可以提供一个计算效率和更好的解决方案的关键安全分析。这种方法使在资源受限的边缘设备上进行功能强大的低延迟部署成为可能。
摘要:Analyzing the human factors behind aviation accidents is crucial for preventing future incidents, yet traditional methods using the Human Factors Analysis and Classification System (HFACS) are limited by scalability and consistency. To address this, we introduce an automated HFACS classification framework for aviation safety analysis that utilizes Reinforcement Learning with Group Relative Policy Optimization (GRPO) to fine-tune a Llama-3.1 8B language model. Our approach incorporates a multi-component reward system tailored for aviation safety analysis and integrates synthetic data generation to overcome class imbalance in accident datasets. The resulting GRPO-optimized model achieved noticeable performance gains, including a 350% increase in exact match accuracy (from 0.0400 to 0.1800) and an improved partial match accuracy of 0.8800. Significantly, our specialized model outperforms state-of-the-art LLMs (Large Language Models), including GPT-5-mini and Gemini-2.5-fiash, on key metrics. This research also proposes exact match accuracy in multi-label HFACS classification problem as a new benchmarking methodology to evaluate the advanced reasoning capabilities of language models. Ultimately, our work validates that smaller, domain-optimized models can provide a computationally efficient and better solution for critical safety analysis. This approach makes powerful, low-latency deployment on resource-constrained edge devices feasible.


【82】Manifold Trajectories in Next-Token Prediction: From Replicator Dynamics to Softmax Equilibrium
标题:下一个代币预测中的多种轨迹:从Copy Dynamics到Softmax均衡
链接:https://arxiv.org/abs/2508.21186

作者:er R. Lee-Jenkins
摘要:大型语言模型中的解码通常被描述为对令牌进行评分并使用softmax进行规范化。我们给出了一个最小的,自足的帐户,这一步作为一个约束变分原理的概率单纯形。离散的,规范化的尊重上升是经典的乘法权重(熵镜)更新;其连续时间的限制是复制流。从这些成分中,我们证明,对于固定的上下文和温度,下一个令牌分布遵循单纯形内的平滑轨迹,并收敛到softmax平衡。这在产出分配水平上形式化了常见的“流形”直觉。该分析产生了精确的、面向实践的结果:温度作为沿着相同轨迹的时间的精确重新缩放,而top-k和核采样将流限制在具有相同保证的面。我们还概述了一个控制帐户的路径依赖的分数调整和他们的连接循环,幻觉式的行为。我们没有对训练动态或内部表征提出任何要求;这些都推迟到未来的工作中。
摘要:Decoding in large language models is often described as scoring tokens and normalizing with softmax. We give a minimal, self-contained account of this step as a constrained variational principle on the probability simplex. The discrete, normalization-respecting ascent is the classical multiplicative-weights (entropic mirror) update; its continuous-time limit is the replicator flow. From these ingredients we prove that, for a fixed context and temperature, the next-token distribution follows a smooth trajectory inside the simplex and converges to the softmax equilibrium. This formalizes the common ``manifold traversal'' intuition at the output-distribution level. The analysis yields precise, practice-facing consequences: temperature acts as an exact rescaling of time along the same trajectory, while top-k and nucleus sampling restrict the flow to a face with identical guarantees. We also outline a controlled account of path-dependent score adjustments and their connection to loop-like, hallucination-style behavior. We make no claims about training dynamics or internal representations; those are deferred to future work.


【83】BED-LLM: Intelligent Information Gathering with LLMs and Bayesian Experimental Design
标题:BED-LLM:使用LLM和Bayesian实验设计的智能信息收集
链接:https://arxiv.org/abs/2508.21184

作者:oudhury, Sinead Williamson, Adam Goliński, Ning Miao, Freddie Bickford Smith, Michael Kirchhof, Yizhe Zhang, Tom Rainforth
摘要:我们提出了一种通用的方法,用于提高大型语言模型(LLM)的能力,使用序贯贝叶斯实验设计(BED)的框架,从用户或其他外部来源智能和自适应地收集信息。这使得LLM能够充当有效的多轮对话代理,并与外部环境交互式地交互。我们的方法,我们称之为BED-LLM(贝叶斯实验设计与大型语言模型),是基于迭代选择的问题或查询,最大限度地提高预期信息增益(EIG)有关的任务的兴趣,鉴于以前收集的响应。我们展示了如何使用来自LLM的信念分布的概率模型,以原则性的方式制定EIG,并在其建设中提供详细的见解。BED-LLM成功的另一个关键是一些具体的创新,例如精心设计的EIG估计器,而不仅仅依赖于上下文更新来调节先前的响应,以及提出候选查询的有针对性的策略。我们发现,BED-LLM在基于20个问题的游戏和使用LLM主动推断用户偏好的广泛测试中,与LLM和其他自适应设计策略的直接提示相比,在性能上取得了实质性的进步。
摘要:We propose a general-purpose approach for improving the ability of Large Language Models (LLMs) to intelligently and adaptively gather information from a user or other external source using the framework of sequential Bayesian experimental design (BED). This enables LLMs to act as effective multi-turn conversational agents and interactively interface with external environments. Our approach, which we call BED-LLM (Bayesian Experimental Design with Large Language Models), is based on iteratively choosing questions or queries that maximize the expected information gain (EIG) about the task of interest given the responses gathered previously. We show how this EIG can be formulated in a principled way using a probabilistic model derived from the LLM's belief distribution and provide detailed insights into key decisions in its construction. Further key to the success of BED-LLM are a number of specific innovations, such as a carefully designed estimator for the EIG, not solely relying on in-context updates for conditioning on previous responses, and a targeted strategy for proposing candidate queries. We find that BED-LLM achieves substantial gains in performance across a wide range of tests based on the 20-questions game and using the LLM to actively infer user preferences, compared to direct prompting of the LLM and other adaptive design strategies.


【84】FUTURE: Flexible Unlearning for Tree Ensemble
标题:未来:树木群落的灵活放弃学习
链接:https://arxiv.org/abs/2508.21181

作者:en, Jin Huang, Jiali Cheng, Yuchan Guo, Mengjie Wang, Lalitesh Morishetti, Kaushiki Nag, Hadi Amiri
备注:CIKM 2025
摘要:树集合因其在分类任务中的有效性而被广泛认可,在包括生物信息学,金融和医疗诊断在内的不同领域实现了最先进的性能。随着对数据隐私和被遗忘权的日益重视,已经提出了几种非学习算法来使树集成忘记敏感信息。然而,现有的方法往往是针对特定的模型或依赖于离散的树结构,使他们难以推广到复杂的合奏和效率低下的大规模数据集。为了解决这些限制,我们提出了未来,一种新的学习算法的树集成。具体来说,我们将遗忘样本的问题制定为基于梯度的优化任务。为了适应不可微的树系综,我们采用概率模型近似的优化框架。这使得端到端的遗忘能够以有效和高效的方式进行。在真实世界数据集上的大量实验表明,FUTURE产生了显着和成功的非学习性能。
摘要:Tree ensembles are widely recognized for their effectiveness in classification tasks, achieving state-of-the-art performance across diverse domains, including bioinformatics, finance, and medical diagnosis. With increasing emphasis on data privacy and the \textit{right to be forgotten}, several unlearning algorithms have been proposed to enable tree ensembles to forget sensitive information. However, existing methods are often tailored to a particular model or rely on the discrete tree structure, making them difficult to generalize to complex ensembles and inefficient for large-scale datasets. To address these limitations, we propose FUTURE, a novel unlearning algorithm for tree ensembles. Specifically, we formulate the problem of forgetting samples as a gradient-based optimization task. In order to accommodate non-differentiability of tree ensembles, we adopt the probabilistic model approximations within the optimization framework. This enables end-to-end unlearning in an effective and efficient manner. Extensive experiments on real-world datasets show that FUTURE yields significant and successful unlearning performance.


【85】Deep Residual Echo State Networks: exploring residual orthogonal connections in untrained Recurrent Neural Networks
标题:深度剩余回声状态网络:探索未训练的递归神经网络中的剩余正交连接
链接:https://arxiv.org/abs/2508.21172

作者:nna, Andrea Ceni, Claudio Gallicchio
备注:10 pages, 6 figures
摘要:回声状态网络(ESN)是水库计算(RC)框架内的一种特殊类型的未经训练的递归神经网络(RNN),因其快速有效的学习而流行。然而,传统的ESN往往与长期的信息处理斗争。在本文中,我们介绍了一类新的基于时间残差连接的深度未经训练的RNN,称为深度残差回声状态网络(DeepResESN)。我们表明,利用未经训练的残余递归层的层次结构显着提高记忆容量和长期时间建模。对于时间剩余连接,我们考虑了不同的正交配置,包括随机生成和固定结构的配置,我们研究了它们对网络动力学的影响。深入的数学分析概述了确保DeepResESN内稳定动态的必要和充分条件。我们对各种时间序列任务的实验展示了所提出的方法比传统的浅和深RC的优势。
摘要:Echo State Networks (ESNs) are a particular type of untrained Recurrent Neural Networks (RNNs) within the Reservoir Computing (RC) framework, popular for their fast and efficient learning. However, traditional ESNs often struggle with long-term information processing. In this paper, we introduce a novel class of deep untrained RNNs based on temporal residual connections, called Deep Residual Echo State Networks (DeepResESNs). We show that leveraging a hierarchy of untrained residual recurrent layers significantly boosts memory capacity and long-term temporal modeling. For the temporal residual connections, we consider different orthogonal configurations, including randomly generated and fixed-structure configurations, and we study their effect on network dynamics. A thorough mathematical analysis outlines necessary and sufficient conditions to ensure stable dynamics within DeepResESN. Our experiments on a variety of time series tasks showcase the advantages of the proposed approach over traditional shallow and deep RC.


【86】Quantifying Label-Induced Bias in Large Language Model Self- and Cross-Evaluations
标题:量化大型语言模型自我评估和交叉评估中标签引起的偏见
链接:https://arxiv.org/abs/2508.21164

作者:raf, Sajjad Rezvani Boroujeni, Justin Beaudry, Hossein Abedi, Tom Bush
摘要:大型语言模型(LLM)越来越多地用于评估输出,但它们的判断可能会受到影响。本研究考察了ChatGPT,Gemini和Claude在四种条件下自我和跨模型评估的偏差:无标签,真标签和两个假标签场景。由每个模型撰写的博客文章都由三个使用整体偏好投票和一致性,信息性和简洁性的质量评级进行评估,所有分数都表示为直接比较的百分比。结果显示出惊人的不对称性:“克劳德”标签一贯提高分数,而“双子座”标签一贯降低分数,无论实际内容如何。虚假标签经常颠倒排名,导致偏好投票发生高达50个百分点的变化,转换后的质量评级发生高达12个百分点的变化。双子座的自我评分在真实标签下崩溃,而克劳德的自我偏好加剧。这些研究结果表明,感知模型身份可以严重扭曲高层次的判断,并微妙地影响详细的质量评级,强调需要盲目或多模型评估协议,以确保LLM基准测试的公平性。
摘要:Large language models (LLMs) are increasingly used to evaluate outputs, yet their judgments may be influenced. This study examines bias in self- and cross-model evaluations by ChatGPT, Gemini, and Claude under four conditions: no labels, true labels, and two false-label scenarios. Blog posts authored by each model were evaluated by all three using both overall preference voting and quality ratings for Coherence, Informativeness, and Conciseness, with all scores expressed as percentages for direct comparison. Results reveal striking asymmetries: the "Claude" label consistently boosts scores, while the "Gemini" label consistently depresses them, regardless of actual content. False labels frequently reversed rankings, producing shifts of up to 50 percentage points in preference votes and up to 12 percentage points in converted quality ratings. Gemini's self-scores collapsed under true labels, while Claude's self-preference intensified. These findings show that perceived model identity can heavily distort high-level judgments and subtly influence detailed quality ratings, underscoring the need for blind or multimodel evaluation protocols to ensure fairness in LLM benchmarking.


【87】RadGS-Reg: Registering Spine CT with Biplanar X-rays via Joint 3D Radiative Gaussians Reconstruction and 3D/3D Registration
标题:RadGS-Reg:通过联合3D辐射高斯重建和3D/3D配准将脊柱CT与双平面X射线配准
链接:https://arxiv.org/abs/2508.21154

作者:Xueming Fu, Junfeng Jiang, Qiang Zeng, Ye Tang, Zhengming Chen, Luming Nong, Feng Wang, S. Kevin Zhou
备注:11 pages, 2 figures
摘要:由于对高精度和实时性能的严格要求,图像引导导航中的计算机断层扫描(CT)/X射线配准仍然具有挑战性。传统的“绘制和比较”方法依赖于迭代投影和比较,存在空间信息丢失和域间隙问题。从双平面X射线的3D重建补充了2D/3D配准的空间和形状信息,但目前的方法受到密集视图要求的限制,并与噪声X射线作斗争。为了解决这些局限性,我们引入了RadGS-Reg,这是一种通过联合3D辐射高斯(RadGS)重建和3D/3D配准进行椎体级CT/X射线配准的新框架。具体来说,我们的双平面X射线椎骨RadGS重建模块探索了基于学习的RadGS重建方法,具有反事实注意力学习(CAL)机制,专注于噪声X射线中的椎骨区域。此外,患者特定的预训练策略逐步将RadGS-Reg从模拟数据调整为真实数据,同时学习椎骨形状先验知识。在内部数据集上的实验证明了这两项任务的最新性能,超过了现有的方法。该代码可从以下网址获得:https://github.com/shenao1995/RadGS_Reg。
摘要:Computed Tomography (CT)/X-ray registration in image-guided navigation remains challenging because of its stringent requirements for high accuracy and real-time performance. Traditional "render and compare" methods, relying on iterative projection and comparison, suffer from spatial information loss and domain gap. 3D reconstruction from biplanar X-rays supplements spatial and shape information for 2D/3D registration, but current methods are limited by dense-view requirements and struggles with noisy X-rays. To address these limitations, we introduce RadGS-Reg, a novel framework for vertebral-level CT/X-ray registration through joint 3D Radiative Gaussians (RadGS) reconstruction and 3D/3D registration. Specifically, our biplanar X-rays vertebral RadGS reconstruction module explores learning-based RadGS reconstruction method with a Counterfactual Attention Learning (CAL) mechanism, focusing on vertebral regions in noisy X-rays. Additionally, a patient-specific pre-training strategy progressively adapts the RadGS-Reg from simulated to real data while simultaneously learning vertebral shape prior knowledge. Experiments on in-house datasets demonstrate the state-of-the-art performance for both tasks, surpassing existing methods. The code is available at: https://github.com/shenao1995/RadGS_Reg.


【88】WaveLLDM: Design and Development of a Lightweight Latent Diffusion Model for Speech Enhancement and Restoration
标题:WaveLLDM:用于语音增强和恢复的轻量级潜在扩散模型的设计和开发
链接:https://arxiv.org/abs/2508.21153

作者:ra Santoso, Rizka Wakhidatus Sholikah, Raden Venantius Hari Ginardi
摘要:高质量的音频在广泛的应用中至关重要,包括在线通信、虚拟助手和多媒体行业。然而,由噪声、压缩和传输伪影引起的降级仍然是一个主要挑战。虽然扩散模型已被证明对音频恢复是有效的,但它们通常需要大量的计算资源,并且难以处理较长的缺失片段。本研究介绍了WaveLLDM(Wave Lightweight Latent Diffusion Model),这是一种集成了高效神经音频编解码器和潜在扩散的架构,用于音频恢复和去噪。与在时域或频谱域中操作的传统方法不同,WaveLLDM在压缩的潜在空间中处理音频,从而在保持重建质量的同时降低计算复杂度。在Voicebank+DEMAND测试集上的经验评估表明,WaveLLDM实现了准确的频谱重建,具有低对数频谱距离(LSD)分数(0.48至0.60)和对未知数据的良好适应性。然而,与最先进的方法相比,它在感知质量和语音清晰度方面仍然表现不佳,WB-PESQ得分范围为1.62至1.71,STOI得分在0.76至0.78之间。这些限制归因于次优的架构调优、缺乏微调以及训练持续时间不足。尽管如此,结合神经音频编解码器和潜在扩散模型的灵活架构为未来的开发提供了坚实的基础。
摘要:High-quality audio is essential in a wide range of applications, including online communication, virtual assistants, and the multimedia industry. However, degradation caused by noise, compression, and transmission artifacts remains a major challenge. While diffusion models have proven effective for audio restoration, they typically require significant computational resources and struggle to handle longer missing segments. This study introduces WaveLLDM (Wave Lightweight Latent Diffusion Model), an architecture that integrates an efficient neural audio codec with latent diffusion for audio restoration and denoising. Unlike conventional approaches that operate in the time or spectral domain, WaveLLDM processes audio in a compressed latent space, reducing computational complexity while preserving reconstruction quality. Empirical evaluations on the Voicebank+DEMAND test set demonstrate that WaveLLDM achieves accurate spectral reconstruction with low Log-Spectral Distance (LSD) scores (0.48 to 0.60) and good adaptability to unseen data. However, it still underperforms compared to state-of-the-art methods in terms of perceptual quality and speech clarity, with WB-PESQ scores ranging from 1.62 to 1.71 and STOI scores between 0.76 and 0.78. These limitations are attributed to suboptimal architectural tuning, the absence of fine-tuning, and insufficient training duration. Nevertheless, the flexible architecture that combines a neural audio codec and latent diffusion model provides a strong foundation for future development.


【89】A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers
标题:科学大型语言模型概览:从数据基础到代理前沿
链接:https://arxiv.org/abs/2508.21148

作者:Chenglong Ma, Wei Li, Wanghan Xu, Jiamin Wu, Jucheng Hu, Tianbin Li, Guohang Zhuang, Jiaqi Liu, Yingzhou Lu, Ying Chen, Chaoyang Zhang, Cheng Tan, Jie Ying, Guocheng Wu, Shujian Gao, Pengcheng Chen, Jiashi Lin, Haitao Wu, Lulu Chen, Fengxiang Wang, Yuanyuan Zhang, Xiangyu Zhao, Feilong Tang, Encheng Su, Junzhi Ning, Xinyao Liu, Ye Du, Changkai Ji, Cheng Tang, Huihui Xu, Ziyang Chen, Ziyan Huang, Jiyao Liu, Pengfei Jiang, Yizhou Wang, Chen Tang, Jianyu Wu, Yuchen Ren, Siyuan Yan, Zhonghua Wang, Zhongxing Xu, Shiyan Su, Shangquan Sun, Runkai Zhao, Zhisheng Zhang, Yu Liu, Fudi Wang, Yuanfeng Ji, Yanzhou Su, Hongming Shan, Chunmei Feng, Jiahao Xu, Jiangtao Yan, Wenhao Tang, Diping Song, Lihao Liu, Yanyan Huang, Lequan Yu, Bin Fu, Shujun Wang, Xiaomeng Li, Xiaowei Hu, Yun Gu, Ben Fei, Zhongying Deng, Benyou Wang, Yuewen Cao, Minjie Shen, Haodong Duan, Jie Xu, Yirong Chen, Fang Yan, Hongxia Hao, Jielan Li, Jiajun Du, Yanbo Wang, Imran Razzak, Chi Zhang, Lijun Wu, Conghui He, Zhaohui Lu, Jinhai Huang, Yihao Liu, Fenghua Ling, Yuqiang Li, Aoran Wang, Qihao Zheng, Nanqing Dong, Tianfan Fu, Dongzhan Zhou, Yan Lu, Wenlong Zhang, Jin Ye, Jianfei Cai, Wanli Ouyang, Yu Qiao, Zongyuan Ge, Shixiang Tang, Junjun He
摘要:科学大语言模型(Sci-LLM)正在改变知识在科学研究中的表示、集成和应用方式,但它们的进展是由科学数据的复杂性决定的。这项调查提出了一个全面的,以数据为中心的综合,重新构建了Sci-LLM的发展,作为模型及其底层数据之间的共同进化。我们制定了一个统一的科学数据分类和科学知识的层次模型,强调多模态,跨尺度和特定领域的挑战,区分科学语料库从一般的自然语言处理数据集。我们系统地回顾了最近的Sci-LLM,从通用基础到跨不同科学学科的专业模型,以及对270多个训练前/训练后数据集的广泛分析,说明了为什么Sci-LLM提出了不同的要求-异构,多尺度,不确定性负载语料库,需要保留域不变性并实现跨模态推理的表示。在评估方面,我们检查了190多个基准数据集,并通过高级评估协议跟踪了从静态考试向过程和发现导向评估的转变。这些以数据为中心的分析突出了科学数据开发中的持续问题,并讨论了涉及半自动注释管道和专家验证的新兴解决方案。最后,我们概述了一个向闭环系统的范式转变,在闭环系统中,基于Sci-LLM的自主代理积极进行实验,验证并为一个活生生的,不断发展的知识库做出贡献。总的来说,这项工作为构建可信赖的、不断发展的人工智能(AI)系统提供了路线图,这些系统在加速科学发现方面发挥着真正的合作伙伴作用。
摘要:Scientific Large Language Models (Sci-LLMs) are transforming how knowledge is represented, integrated, and applied in scientific research, yet their progress is shaped by the complex nature of scientific data. This survey presents a comprehensive, data-centric synthesis that reframes the development of Sci-LLMs as a co-evolution between models and their underlying data substrate. We formulate a unified taxonomy of scientific data and a hierarchical model of scientific knowledge, emphasizing the multimodal, cross-scale, and domain-specific challenges that differentiate scientific corpora from general natural language processing datasets. We systematically review recent Sci-LLMs, from general-purpose foundations to specialized models across diverse scientific disciplines, alongside an extensive analysis of over 270 pre-/post-training datasets, showing why Sci-LLMs pose distinct demands -- heterogeneous, multi-scale, uncertainty-laden corpora that require representations preserving domain invariance and enabling cross-modal reasoning. On evaluation, we examine over 190 benchmark datasets and trace a shift from static exams toward process- and discovery-oriented assessments with advanced evaluation protocols. These data-centric analyses highlight persistent issues in scientific data development and discuss emerging solutions involving semi-automated annotation pipelines and expert validation. Finally, we outline a paradigm shift toward closed-loop systems where autonomous agents based on Sci-LLMs actively experiment, validate, and contribute to a living, evolving knowledge base. Collectively, this work provides a roadmap for building trustworthy, continually evolving artificial intelligence (AI) systems that function as a true partner in accelerating scientific discovery.


【90】HiddenObject: Modality-Agnostic Fusion for Multimodal Hidden Object Detection
标题:HiddenTarget:用于多模式隐藏对象检测的模式不可知融合
链接:https://arxiv.org/abs/2508.21135

作者:ng, Tuan-Anh Vu, Sanjith Menon, Sriram Narasimhan, M. Khalid Jawed
摘要:检测隐藏或部分隐藏的物体仍然是多模态环境中的一个基本挑战,其中遮挡,伪装和照明变化等因素会显著阻碍性能。传统的基于RGB的检测方法在这种不利条件下往往会失败,这促使人们需要更强大的、与模态无关的方法。在这项工作中,我们提出了HiddenObject,融合框架,使用基于曼巴融合机制集成RGB,热和深度数据。我们的方法可以捕获不同模态的互补信号,从而增强对模糊或隐藏目标的检测。具体来说,所提出的方法确定特定于模态的功能,并将它们融合在一个统一的表示,以及概括在具有挑战性的情况。我们在多个基准数据集上验证了HiddenObject,与现有方法相比,展示了最先进或具有竞争力的性能。这些结果突出了我们的融合设计的有效性,并暴露了当前单峰和初始融合策略的关键局限性。更广泛地说,我们的研究结果表明,基于Mamba的融合架构可以显着推进多模态对象检测领域,特别是在视觉退化或复杂条件下。
摘要:Detecting hidden or partially concealed objects remains a fundamental challenge in multimodal environments, where factors like occlusion, camouflage, and lighting variations significantly hinder performance. Traditional RGB-based detection methods often fail under such adverse conditions, motivating the need for more robust, modality-agnostic approaches. In this work, we present HiddenObject, a fusion framework that integrates RGB, thermal, and depth data using a Mamba-based fusion mechanism. Our method captures complementary signals across modalities, enabling enhanced detection of obscured or camouflaged targets. Specifically, the proposed approach identifies modality-specific features and fuses them in a unified representation that generalizes well across challenging scenarios. We validate HiddenObject across multiple benchmark datasets, demonstrating state-of-the-art or competitive performance compared to existing methods. These results highlight the efficacy of our fusion design and expose key limitations in current unimodal and na\"ive fusion strategies. More broadly, our findings suggest that Mamba-based fusion architectures can significantly advance the field of multimodal object detection, especially under visually degraded or complex conditions.


【91】R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning
标题:R-4 B:通过双模式退火和强化学习激励MLLM中的通用自动思考能力
链接:https://arxiv.org/abs/2508.21113

作者:, Qi Yang, Bolin Ni, Shiming Xiang, Han Hu, Houwen Peng
备注:20 pages, 14 figures, 5 tables
摘要:多模态大型语言模型(MLLM)具有逐步思维能力,在复杂的推理问题上表现出了卓越的性能。然而,这种思考过程对于无需复杂推理即可解决的简单问题来说是多余的。为了解决这种效率低下的问题,我们提出了R-4 B,一种自动思考的MLLM,它可以根据问题的复杂性自适应地决定何时思考。R-4 B的核心思想是使用双模式退火赋予模型思考和非思考能力,并应用双模式策略优化(BPO)来提高模型在确定是否激活思考过程时的准确性。具体来说,我们首先在一个精心策划的涵盖各种主题的数据集上训练模型,该数据集包含来自思维和非思维模式的样本。然后,在改进的GRPO框架下进行第二阶段的训练,其中策略模型被迫为每个输入查询生成来自两种模式的响应。实验结果表明,R-4 B在25个具有挑战性的基准测试中达到了最先进的性能。它在大多数任务中的性能优于Qwen2.5-VL-7 B,并在推理密集型基准测试中以较低的计算成本实现了与Kimi-VL-A3 B-Thinking-2506(16 B)等大型模型相当的性能。
摘要:Multimodal Large Language Models (MLLMs) equipped with step-by-step thinking capabilities have demonstrated remarkable performance on complex reasoning problems. However, this thinking process is redundant for simple problems solvable without complex reasoning. To address this inefficiency, we propose R-4B, an auto-thinking MLLM, which can adaptively decide when to think based on problem complexity. The central idea of R-4B is to empower the model with both thinking and non-thinking capabilities using bi-mode annealing, and apply Bi-mode Policy Optimization~(BPO) to improve the model's accuracy in determining whether to activate the thinking process. Specifically, we first train the model on a carefully curated dataset spanning various topics, which contains samples from both thinking and non-thinking modes. Then it undergoes a second phase of training under an improved GRPO framework, where the policy model is forced to generate responses from both modes for each input query. Experimental results show that R-4B achieves state-of-the-art performance across 25 challenging benchmarks. It outperforms Qwen2.5-VL-7B in most tasks and achieves performance comparable to larger models such as Kimi-VL-A3B-Thinking-2506 (16B) on reasoning-intensive benchmarks with lower computational cost.


【92】EmbodiedOneVision: Interleaved Vision-Text-Action Pretraining for General Robot Control
标题:AccordedOneVision:用于通用机器人控制的交叉视觉-文本-动作预训练
链接:https://arxiv.org/abs/2508.21112

作者: Haoming Song, Qizhi Chen, Zhaoqing Chen, Xianqiang Gao, Xinyi Ye, Qi Lv, Modi Shi, Guanghui Ren, Cheng Ruan, Maoqing Yao, Haoran Yang, Jiacheng Bao, Bin Zhao, Dong Wang
摘要:人类在开放世界中无缝执行多模态推理和物理交互的能力是通用嵌入式智能系统的核心目标。最近的视觉语言动作(VLA)模型,这是共同训练的大规模机器人和视觉文本数据,在一般的机器人控制方面取得了显着的进展。然而,它们仍然无法在交错推理和交互中实现人类水平的灵活性。本文介绍了EO-Robotics,包括EO-1模型和EO-Data1. 5M数据集。EO-1是一个统一的体现基础模型,通过交错的视觉-文本-动作预训练,在多模态体现推理和机器人控制方面实现了卓越的性能。EO-1的开发基于两个关键支柱:(i)一个统一的架构,不加区别地处理多模态输入(图像,文本,视频和动作),以及(ii)一个大规模,高质量的多模态体现推理数据集,EO-Data1. 5M,其中包含超过150万个样本,重点是交错的视觉-文本-动作理解。EO-1通过自回归解码和EO-Data1. 5M上的流匹配去噪之间的协同作用进行训练,实现无缝机器人动作生成和多模态体现推理。大量的实验证明了交错式视觉-文本-动作学习对于开放世界理解和概括的有效性,通过跨多个实施例的各种长视野、灵巧的操纵任务进行了验证。本文详细介绍了EO-1的体系结构、EO-Data1. 5M的数据构建策略和训练方法,为开发先进的嵌入式基础模型提供了有价值的见解。
摘要:The human ability to seamlessly perform multimodal reasoning and physical interaction in the open world is a core goal for general-purpose embodied intelligent systems. Recent vision-language-action (VLA) models, which are co-trained on large-scale robot and visual-text data, have demonstrated notable progress in general robot control. However, they still fail to achieve human-level flexibility in interleaved reasoning and interaction. In this work, introduce EO-Robotics, consists of EO-1 model and EO-Data1.5M dataset. EO-1 is a unified embodied foundation model that achieves superior performance in multimodal embodied reasoning and robot control through interleaved vision-text-action pre-training. The development of EO-1 is based on two key pillars: (i) a unified architecture that processes multimodal inputs indiscriminately (image, text, video, and action), and (ii) a massive, high-quality multimodal embodied reasoning dataset, EO-Data1.5M, which contains over 1.5 million samples with emphasis on interleaved vision-text-action comprehension. EO-1 is trained through synergies between auto-regressive decoding and flow matching denoising on EO-Data1.5M, enabling seamless robot action generation and multimodal embodied reasoning. Extensive experiments demonstrate the effectiveness of interleaved vision-text-action learning for open-world understanding and generalization, validated through a variety of long-horizon, dexterous manipulation tasks across multiple embodiments. This paper details the architecture of EO-1, the data construction strategy of EO-Data1.5M, and the training methodology, offering valuable insights for developing advanced embodied foundation models.


【93】Automating the Deep Space Network Data Systems; A Case Study in Adaptive Anomaly Detection through Agentic AI
标题:深空网络数据系统自动化;通过显着人工智能进行自适应异常检测的案例研究
链接:https://arxiv.org/abs/2508.21111

作者:hou (1 and 2), Lisa S. Locke (3), Harvey M. Soldan (3) ((1) University of California San Diego, (2) Pasadena City College, (3) Jet Propulsion Laboratory California Institute of Technology)
摘要:深空网络(DSN)是美国宇航局最大的天线设施网络,可生成大量多变量时间序列数据。这些设施包含深空网络天线和发射器,这些天线和发射器在很长一段时间内会发生退化,这可能会导致数据流中断,并威胁到数十个依赖深空网络作为生命线的航天器的地球连接。这项研究的目的是试验不同的方法,这些方法将能够帮助喷气推进实验室的工程师通过收集的数据直接查明异常和设备退化,并继续为未来的宇宙空间任务进行DSN的维护和操作。因此,我们研究了各种机器学习技术,这些技术可以通过预测分析完全重建数据,并通过统计计算和阈值确定实时数据集中的异常数据条目。除了经过充分训练和测试的机器学习模型之外,我们还集成了强化学习子系统的使用,该子系统根据严重程度对已识别的异常进行分类,并使用大型语言模型为每个异常数据条目标记解释,所有这些都可以通过人工反馈/输入随着时间的推移进行改进和微调。具体来说,对于DSN发射器,我们还实现了一个完整的数据管道系统,将数据提取、解析和处理工作流连接在一起,因为之前没有连贯的程序或脚本来执行这些任务。使用这个数据管道系统,我们还能够连接从DSN天线数据训练的模型,完成DSN异常检测的数据工作流程。这一切都由代理人工智能系统包裹并进一步连接,其中利用复杂的推理来确定异常数据的分类和预测。
摘要:The Deep Space Network (DSN) is NASA's largest network of antenna facilities that generate a large volume of multivariate time-series data. These facilities contain DSN antennas and transmitters that undergo degradation over long periods of time, which may cause costly disruptions to the data flow and threaten the earth-connection of dozens of spacecraft that rely on the Deep Space Network for their lifeline. The purpose of this study was to experiment with different methods that would be able to assist JPL engineers with directly pinpointing anomalies and equipment degradation through collected data, and continue conducting maintenance and operations of the DSN for future space missions around our universe. As such, we have researched various machine learning techniques that can fully reconstruct data through predictive analysis, and determine anomalous data entries within real-time datasets through statistical computations and thresholds. On top of the fully trained and tested machine learning models, we have also integrated the use of a reinforcement learning subsystem that classifies identified anomalies based on severity level and a Large Language Model that labels an explanation for each anomalous data entry, all of which can be improved and fine-tuned over time through human feedback/input. Specifically, for the DSN transmitters, we have also implemented a full data pipeline system that connects the data extraction, parsing, and processing workflow all together as there was no coherent program or script for performing these tasks before. Using this data pipeline system, we were able to then also connect the models trained from DSN antenna data, completing the data workflow for DSN anomaly detection. This was all wrapped around and further connected by an agentic AI system, where complex reasoning was utilized to determine the classifications and predictions of anomalous data.


【94】An Explainable, Attention-Enhanced, Bidirectional Long Short-Term Memory Neural Network for Joint 48-Hour Forecasting of Temperature, Irradiance, and Relative Humidity
标题:可解释、注意力增强、双向长短期记忆神经网络,用于48小时联合预测温度、辐射率和相对湿度
链接:https://arxiv.org/abs/2508.21109

作者:Vamvouras, Konstantinos Braimakis, Christos Tzivanidis
备注:27 pages, 8 figures
摘要:本文提出了一种深度学习(DL)框架,用于48小时预测温度,太阳辐照度和相对湿度,以支持智能HVAC系统中的模型预测控制(MPC)。该方法采用具有注意力的堆叠双向长短期记忆(BiLSTM)网络,通过联合预测所有三个变量来捕获时间和跨特征依赖性。具有编码周期时间特征的历史气象数据(2019-2022)用于训练,而2023年的数据用于评估泛化。该模型的平均绝对误差为1.3摄氏度(温度),31 W/m2(辐照度)和6.7个百分点(湿度),优于最先进的数值天气预测和机器学习基准。集成的特征量化的贡献,和注意力的权重揭示了时间模式,提高可解释性。通过结合多变量预测,基于注意力的DL和可解释性,这项工作推进了数据驱动的天气预测。所展示的准确性和透明度突出了该框架通过可靠的短期气象预报进行节能建筑控制的潜力。
摘要:This paper presents a Deep Learning (DL) framework for 48-hour forecasting of temperature, solar irradiance, and relative humidity to support Model Predictive Control (MPC) in smart HVAC systems. The approach employs a stacked Bidirectional Long Short-Term Memory (BiLSTM) network with attention, capturing temporal and cross-feature dependencies by jointly predicting all three variables. Historical meteorological data (2019-2022) with encoded cyclical time features were used for training, while 2023 data evaluated generalization. The model achieved Mean Absolute Errors of 1.3 degrees Celsius (temperature), 31 W/m2 (irradiance), and 6.7 percentage points (humidity), outperforming state-of-the-art numerical weather prediction and machine learning benchmarks. Integrated Gradients quantified feature contributions, and attention weights revealed temporal patterns, enhancing interpretability. By combining multivariate forecasting, attention-based DL, and explainability, this work advances data-driven weather prediction. The demonstrated accuracy and transparency highlight the framework's potential for energy-efficient building control through reliable short-term meteorological forecasting.


【95】Learning to Generate Unit Test via Adversarial Reinforcement Learning
标题:通过对抗强化学习学习生成单元测试
链接:https://arxiv.org/abs/2508.21107

作者:ee, Changho Hwang, Kimin Lee
备注:Code is available at: this https URL
摘要:单元测试是编程中的核心实践,可以对人类开发人员或大型语言模型(LLM)生成的程序进行系统评估。考虑到编写全面的单元测试的挑战,LLM已经被用于自动化测试生成,但是用于训练LLM生成高质量测试的方法仍然没有得到充分的探索。在这项工作中,我们提出了UTRL,一种新的强化学习框架,它可以训练LLM在给定编程指令的情况下生成高质量的单元测试。我们的核心思想是通过强化学习以对抗的方式迭代训练两个LLM,单元测试生成器和代码生成器。单元测试生成器被训练以最大化区分奖励,这反映了其产生暴露代码生成器的解决方案中的故障的测试的能力,并且代码生成器被训练以最大化代码奖励,这反映了其产生通过由测试生成器生成的单元测试的解决方案的能力。在我们的实验中,我们证明了通过UTRL训练的Qwen 3 - 4 B生成的单元测试与通过监督微调人类编写的地面真实单元测试训练的相同模型生成的单元测试相比,显示出更高的质量,产生的代码评估与地面真实测试引起的代码评估更接近。此外,用UTRL训练的Qwen 3 - 4 B在生成高质量单元测试方面优于GPT-4.1等前沿模型,突出了UTRL在训练LLM执行此任务方面的有效性。
摘要:Unit testing is a core practice in programming, enabling systematic evaluation of programs produced by human developers or large language models (LLMs). Given the challenges in writing comprehensive unit tests, LLMs have been employed to automate test generation, yet methods for training LLMs to produce high-quality tests remain underexplored. In this work, we propose UTRL, a novel reinforcement learning framework that trains an LLM to generate high-quality unit tests given a programming instruction. Our key idea is to iteratively train two LLMs, the unit test generator and the code generator, in an adversarial manner via reinforcement learning. The unit test generator is trained to maximize a discrimination reward, which reflects its ability to produce tests that expose faults in the code generator's solutions, and the code generator is trained to maximize a code reward, which reflects its ability to produce solutions that pass the unit tests generated by the test generator. In our experiments, we demonstrate that unit tests generated by Qwen3-4B trained via UTRL show higher quality compared to unit tests generated by the same model trained via supervised fine-tuning on human-written ground-truth unit tests, yielding code evaluations that more closely align with those induced by the ground-truth tests. Moreover, Qwen3-4B trained with UTRL outperforms frontier models such as GPT-4.1 in generating high-quality unit tests, highlighting the effectiveness of UTRL in training LLMs for this task.


【96】Dynamic Low-rank Approximation of Full-Matrix Preconditioner for Training Generalized Linear Models
标题:训练广义线性模型的全矩阵预处理器动态低阶逼近
链接:https://arxiv.org/abs/2508.21106

作者:atveeva, Aleksandr Katrutsa, Evgeny Frolov
摘要:自适应梯度方法,如Adagrad及其变体,在大规模优化中广泛使用。然而,它们使用对角预处理矩阵限制了捕获参数相关性的能力。全矩阵自适应方法,近似精确的海森,可以模拟这些相关性,并可能实现更快的收敛。同时,它们的计算和内存成本对于大规模模型来说往往是令人望而却步的。为了解决这个问题,我们提出了AdaGram,一个优化器,使有效的全矩阵自适应梯度更新。为了减少内存和计算开销,我们利用快速对称因式分解计算的预处理更新方向在每次迭代。此外,我们保持低秩结构的预处理沿优化轨迹使用矩阵积分器的方法。标准机器学习任务的数值实验表明,AdaGram收敛速度更快,或者在使用秩5和更小的秩近似时与对角自适应优化器的性能相匹配。这证明了AdaGram作为大型模型自适应优化的可扩展解决方案的潜力。
摘要:Adaptive gradient methods like Adagrad and its variants are widespread in large-scale optimization. However, their use of diagonal preconditioning matrices limits the ability to capture parameter correlations. Full-matrix adaptive methods, approximating the exact Hessian, can model these correlations and may enable faster convergence. At the same time, their computational and memory costs are often prohibitive for large-scale models. To address this limitation, we propose AdaGram, an optimizer that enables efficient full-matrix adaptive gradient updates. To reduce memory and computational overhead, we utilize fast symmetric factorization for computing the preconditioned update direction at each iteration. Additionally, we maintain the low-rank structure of a preconditioner along the optimization trajectory using matrix integrator methods. Numerical experiments on standard machine learning tasks show that AdaGram converges faster or matches the performance of diagonal adaptive optimizers when using rank five and smaller rank approximations. This demonstrates AdaGram's potential as a scalable solution for adaptive optimization in large models.


【97】PVPO: Pre-Estimated Value-Based Policy Optimization for Agentic Reasoning
标题:PVPO:用于统计推理的预估计基于价值的策略优化
链接:https://arxiv.org/abs/2508.21104

作者:eng, Penghong Zhao, Guochao Jiang, Chuzhan Hao, Yuewei Zhang, Hao Wang
备注:14 pages, 5 figures
摘要:无临界强化学习方法,特别是组策略,因其在复杂任务中的效率而引起了相当大的关注。然而,这些方法在很大程度上依赖于多个抽样和比较的政策,以估计的优势,这可能会导致政策陷入局部最优和增加计算成本。为了解决这些问题,我们提出了PVPO,这是一种通过优势参考锚和数据预采样增强的有效强化学习方法。具体来说,我们使用参考模型提前推出,并采用计算的奖励分数作为参考锚。我们的方法有效地纠正了组内比较引入的累积偏差,并显着降低了对推出数量的依赖。同时,参考模型可以在数据预采样时评估样本难度,从而有效选择高增益数据,提高训练效率。在两个领域的九个数据集上进行的实验表明,PVPO实现了最先进的(SOTA)性能。我们的方法不仅在多个任务中表现出强大的泛化能力,而且在不同规模的模型中表现出可扩展的性能。
摘要:Critic-free reinforcement learning methods, particularly group policies, have attracted considerable attention for their efficiency in complex tasks. However, these methods rely heavily on multiple sampling and comparisons within the policy to estimate advantage, which may cause the policy to fall into local optimum and increase computational cost. To address these issues, we propose PVPO, an efficient reinforcement learning method enhanced by an advantage reference anchor and data pre-sampling. Specifically, we use the reference model to rollout in advance and employ the calculated reward score as a reference anchor. Our approach effectively corrects the cumulative bias introduced by intra-group comparisons and significantly reduces reliance on the number of rollouts. Meanwhile, the reference model can assess sample difficulty during data pre-sampling, enabling effective selection of high-gain data to improve training efficiency. Experiments conducted on nine datasets across two domains demonstrate that PVPO achieves State-Of-The-Art (SOTA) performance. Our approach not only demonstrates robust generalization across multiple tasks, but also exhibits scalable performance across models of varying scales.


【98】Spatiotemporal EEG-Based Emotion Recognition Using SAM Ratings from Serious Games with Hybrid Deep Learning
标题:使用混合深度学习的严肃游戏的Sam评级进行基于时空脑电波的情感识别
链接:https://arxiv.org/abs/2508.21103

作者:man, Ilona Heldal, Jerry Chun-Wei Lin
摘要:基于EEG的情感识别的最新进展使用深度学习和经典机器学习方法显示出有希望的结果;然而,大多数现有研究都狭隘地关注二进制效价预测或特定于主题的分类,这限制了现实世界情感计算系统的可推广性和部署。为了解决这一差距,本文提出了一个统一的,多粒度的EEG情绪分类框架,建立在GAMEEMO数据集,其中包括14通道EEG记录和连续自我报告的情绪评级(无聊,可怕,平静,有趣)从28个主题在四个情绪诱导游戏场景。我们的管道采用了结构化的预处理策略,包括时间窗口分割,混合统计和频域特征提取,以及z分数归一化,将原始EEG信号转换为鲁棒的,有区别的输入向量。情绪标签是在三个互补的轴上导出和编码的:(i)基于积极和消极情绪评级的平均极性的二进制效价分类,以及(ii)多类情绪分类,其中预测最情绪状态的存在。(iii)通过将每种情绪分为10个有序类,实现细粒度多标签表示。我们评估了广泛的模型,包括随机森林,XGBoost和SVM,以及LSTM,LSTM-GRU和CNN-LSTM等深度神经架构。其中,LSTM-GRU模型始终优于其他模型,在二进制效价任务中实现了0.932的F1分数,在多类和多标签情感分类中分别达到了94.5%和90.6%。
摘要:Recent advancements in EEG-based emotion recognition have shown promising outcomes using both deep learning and classical machine learning approaches; however, most existing studies focus narrowly on binary valence prediction or subject-specific classification, which limits generalizability and deployment in real-world affective computing systems. To address this gap, this paper presents a unified, multigranularity EEG emotion classification framework built on the GAMEEMO dataset, which consists of 14-channel EEG recordings and continuous self-reported emotion ratings (boring, horrible, calm, and funny) from 28 subjects across four emotion-inducing gameplay scenarios. Our pipeline employs a structured preprocessing strategy that comprises temporal window segmentation, hybrid statistical and frequency-domain feature extraction, and z-score normalization to convert raw EEG signals into robust, discriminative input vectors. Emotion labels are derived and encoded across three complementary axes: (i) binary valence classification based on the averaged polarity of positive and negative emotion ratings, and (ii) Multi-class emotion classification, where the presence of the most affective state is predicted. (iii) Fine-grained multi-label representation via binning each emotion into 10 ordinal classes. We evaluate a broad spectrum of models, including Random Forest, XGBoost, and SVM, alongside deep neural architectures such as LSTM, LSTM-GRU, and CNN-LSTM. Among these, the LSTM-GRU model consistently outperforms the others, achieving an F1-score of 0.932 in the binary valence task and 94.5% and 90.6% in both multi-class and Multi-Label emotion classification.


【99】Beyond Prediction: Reinforcement Learning as the Defining Leap in Healthcare AI
标题:超越预测:强化学习是医疗保健人工智能的定义性飞跃
链接:https://arxiv.org/abs/2508.21101

作者:rera, Gousia Habib, Qianyi Xu, Daniel J. Tan, Kai He, Erik Cambria, Mengling Feng
备注:40 pages in total (including appendix)
摘要:强化学习(RL)标志着人工智能在医疗保健中应用的根本转变。RL不是仅仅预测结果,而是积极地决定具有长期目标的干预措施。与基于固定关联的传统模型不同,强化学习系统通过试验、反馈和长期奖励优化来学习,从而引入了变革的可能性和新的风险。从信息融合的角度来看,医疗RL通常使用时间和决策级机制集成多源信号,如生命体征、实验室临床记录、成像和设备遥测。这些系统可以在集中式、联合式或边缘架构中运行,以满足实时临床约束,并自然地跨越数据、功能和决策融合级别。这项调查探讨了RL在医疗保健领域的崛起,不仅仅是一套工具,而是在临床环境中向代理智能的转变。我们首先构建了RL技术的景观,包括基于模型和无模型的方法,离线和批量约束的方法,以及通过医疗约束的镜头进行奖励规范和不确定性校准的新兴策略。然后,我们全面分析了RL应用,包括重症监护、慢性病、心理健康、诊断和机器人辅助,确定了它们的趋势、差距和转化瓶颈。与之前的评论相比,我们批判性地分析了RL的道德,部署和奖励设计挑战,并综合了安全,人性化的政策学习经验。本文既是一个技术路线图,也是对RL在医疗AI中新兴的变革性作用的批判性反思,而不是作为预测机器,而是作为代理临床智能。
摘要:Reinforcement learning (RL) marks a fundamental shift in how artificial intelligence is applied in healthcare. Instead of merely predicting outcomes, RL actively decides interventions with long term goals. Unlike traditional models that operate on fixed associations, RL systems learn through trial, feedback, and long-term reward optimization, introducing transformative possibilities and new risks. From an information fusion lens, healthcare RL typically integrates multi-source signals such as vitals, labs clinical notes, imaging and device telemetry using temporal and decision-level mechanisms. These systems can operate within centralized, federated, or edge architectures to meet real-time clinical constraints, and naturally span data, features and decision fusion levels. This survey explore RL's rise in healthcare as more than a set of tools, rather a shift toward agentive intelligence in clinical environments. We first structure the landscape of RL techniques including model-based and model-free methods, offline and batch-constrained approaches, and emerging strategies for reward specification and uncertainty calibration through the lens of healthcare constraints. We then comprehensively analyze RL applications spanning critical care, chronic disease, mental health, diagnostics, and robotic assistance, identifying their trends, gaps, and translational bottlenecks. In contrast to prior reviews, we critically analyze RL's ethical, deployment, and reward design challenges, and synthesize lessons for safe, human-aligned policy learning. This paper serves as both a a technical roadmap and a critical reflection of RL's emerging transformative role in healthcare AI not as prediction machinery, but as agentive clinical intelligence.


【100】Safe-Control: A Safety Patch for Mitigating Unsafe Content in Text-to-Image Generation Models
标题:Safe-Control:用于缓解文本到图像生成模型中不安全内容的安全补丁
链接:https://arxiv.org/abs/2508.21099

作者:Meng, Yingkai Dong, Ning Yu, Li Wang, Zheng Li, Shanqing Guo
摘要:尽管文本到图像(T2 I)生成模型取得了进步,但其误用甚至滥用的可能性引起了严重的安全问题。模型开发人员已经做出了巨大的努力来引入安全机制,以解决T2 I模型中的这些问题。然而,现有的安全机制,无论是外部的还是内部的,要么仍然容易在分配变化的情况下被规避,要么需要进行广泛的特定模式的调整。为了解决这些限制,我们引入了Safe-Control,这是一种创新的即插即用安全补丁,旨在减轻T2 I模型中的不安全内容生成。使用数据驱动策略和安全感知条件,Safe-Control将安全控制信号注入锁定的T2 I模型,以类似补丁的方式进行更新。模型开发人员还可以构建各种安全补丁,以满足不断变化的安全需求,这些安全补丁可以灵活地合并为单个统一的补丁。它的即插即用设计进一步确保了适应性,使其与其他类似去噪架构的T2 I型号兼容。我们对六种不同的公共T2 I模型进行了广泛的评估。实证结果强调,安全控制是有效的,在六个不同的T2 I模型具有类似的生成架构,减少不安全的内容生成,但它成功地保持了良性图像的质量和文本对齐。与包括外部和内部防御在内的七种最先进的安全机制相比,Safe-Control在减少不安全内容生成方面的表现明显优于所有基准。例如,在不安全提示和最新的对抗性攻击下,它将不安全内容生成的概率降低到7%,而大多数基线方法的概率约为20%。
摘要:Despite the advancements in Text-to-Image (T2I) generation models, their potential for misuse or even abuse raises serious safety concerns. Model developers have made tremendous efforts to introduce safety mechanisms that can address these concerns in T2I models. However, the existing safety mechanisms, whether external or internal, either remain susceptible to evasion under distribution shifts or require extensive model-specific adjustments. To address these limitations, we introduce Safe-Control, an innovative plug-and-play safety patch designed to mitigate unsafe content generation in T2I models. Using data-driven strategies and safety-aware conditions, Safe-Control injects safety control signals into the locked T2I model, acting as an update in a patch-like manner. Model developers can also construct various safety patches to meet the evolving safety requirements, which can be flexibly merged into a single, unified patch. Its plug-and-play design further ensures adaptability, making it compatible with other T2I models of similar denoising architecture. We conduct extensive evaluations on six diverse and public T2I models. Empirical results highlight that Safe-Control is effective in reducing unsafe content generation across six diverse T2I models with similar generative architectures, yet it successfully maintains the quality and text alignment of benign images. Compared to seven state-of-the-art safety mechanisms, including both external and internal defenses, Safe-Control significantly outperforms all baselines in reducing unsafe content generation. For example, it reduces the probability of unsafe content generation to 7%, compared to approximately 20% for most baseline methods, under both unsafe prompts and the latest adversarial attacks.


【101】TrInk: Ink Generation with Transformer Network
标题:TrInk:通过Transformer Network生成墨水
链接:https://arxiv.org/abs/2508.21098

作者:in, Shubhang Desai, Xu Chen, Biyi Fang, Zhuoyi Huang, Zhe Li, Chong-Xin Gan, Xiao Tu, Man-Wai Mak, Yan Lu, Shujie Liu
备注:Accepted to EMNLP 2025 Main Conference
摘要:在本文中,我们提出了TrInk,一个基于Transformer的墨水生成模型,它有效地捕获了全局依赖关系。为了更好地促进输入文本和生成的笔画点之间的对齐,我们在交叉注意模块中引入了缩放位置嵌入和高斯记忆掩码。此外,我们设计了主观和客观评估管道,以全面评估生成的笔迹的易读性和风格一致性。实验表明,我们的基于transformer的模型实现了35.56%的减少字符错误率(CER)和29.66%的减少字错误率(WER)的IAM OnDB数据集相比,以前的方法。我们提供了一个演示页面,其中包含来自TrInk的手写样本和基线模型,网址为:https://akahello-a11y.github.io/trink-demo/
摘要:In this paper, we propose TrInk, a Transformer-based model for ink generation, which effectively captures global dependencies. To better facilitate the alignment between the input text and generated stroke points, we introduce scaled positional embeddings and a Gaussian memory mask in the cross-attention module. Additionally, we design both subjective and objective evaluation pipelines to comprehensively assess the legibility and style consistency of the generated handwriting. Experiments demonstrate that our Transformer-based model achieves a 35.56\% reduction in character error rate (CER) and an 29.66% reduction in word error rate (WER) on the IAM-OnDB dataset compared to previous methods. We provide an demo page with handwriting samples from TrInk and baseline models at: https://akahello-a11y.github.io/trink-demo/


【102】Model-Driven Quantum Code Generation Using Large Language Models and Retrieval-Augmented Generation
标题:使用大型语言模型和检索增强生成的模型驱动量子代码生成
链接:https://arxiv.org/abs/2508.21097

作者:iavash, Armin Moin
备注:This paper is accepted to the New Ideas and Emerging Results (NIER) track of the ACM/IEEE 28th International Conference on Model Driven Engineering Languages and Systems (MODELS)
摘要:本文介绍了一个新的研究方向,模型到文本/代码转换,利用大型语言模型(LLM),可以增强检索增强生成(RAG)管道。重点是量子和混合量子经典软件系统,其中模型驱动的方法可以帮助降低成本,减轻与异构平台环境和缺乏开发人员技能相关的风险。我们验证所提出的想法之一,关于生成代码的UML模型实例的软件系统。这段Python代码使用了一个名为Qiskit的成熟库,可以在基于门或基于电路的量子计算机上执行。我们部署的RAG管道包含来自公共GitHub存储库的示例Qiskit代码。实验结果表明,精心设计的提示可以将CodeBLEU分数提高四倍,从而产生更准确和一致的量子代码。然而,建议的研究方向可以通过未来的进一步调查来超越这一点,通过进行实验来解决我们在这里提出的其他研究问题和想法,例如将软件系统模型实例部署为RAG管道中的信息源,或者将LLM部署为代码到代码的转换,例如,用于编译用例。
摘要:This paper introduces a novel research direction for model-to-text/code transformations by leveraging Large Language Models (LLMs) that can be enhanced with Retrieval-Augmented Generation (RAG) pipelines. The focus is on quantum and hybrid quantum-classical software systems, where model-driven approaches can help reduce the costs and mitigate the risks associated with the heterogeneous platform landscape and lack of developers' skills. We validate one of the proposed ideas regarding generating code out of UML model instances of software systems. This Python code uses a well-established library, called Qiskit, to execute on gate-based or circuit-based quantum computers. The RAG pipeline that we deploy incorporates sample Qiskit code from public GitHub repositories. Experimental results show that well-engineered prompts can improve CodeBLEU scores by up to a factor of four, yielding more accurate and consistent quantum code. However, the proposed research direction can go beyond this through further investigation in the future by conducting experiments to address our other research questions and ideas proposed here, such as deploying software system model instances as the source of information in the RAG pipelines, or deploying LLMs for code-to-code transformations, for instance, for transpilation use cases.


【103】CoBA: Counterbias Text Augmentation for Mitigating Various Spurious Correlations via Semantic Triples
标题:CoBA:反偏见文本增强,通过语义三重组缓解各种虚假相关性
链接:https://arxiv.org/abs/2508.21083

作者:in, Juhwan Choi, Jungmin Yun, Junho Lee, Soojin Jang, Youngbin Kim
备注:Accepted at EMNLP 2025
摘要:深度学习模型经常学习和利用训练数据中的虚假相关性,使用这些非目标特征来通知它们的预测。这种依赖导致性能下降和对未知数据的泛化能力差。为了解决这些限制,我们引入了一种更通用的反事实数据增强形式,称为反偏见数据增强,它同时解决了多种偏见(例如,性别偏差、简单性偏差),并增强分布外稳健性。我们介绍CoBA:CounterBias Augmentation是一个统一的框架,在语义三重层次上运行:首先将文本分解为主谓宾三元组,然后选择性地修改这些三元组以破坏虚假的相关性。通过从这些调整后的三元组中重建文本,CoBA生成了消除虚假模式的反偏见数据。通过大量实验,我们证明CoBA不仅提高了下游任务性能,还有效地减少了偏差并增强了分布外弹性,为虚假相关性带来的挑战提供了通用且强大的解决方案。
摘要:Deep learning models often learn and exploit spurious correlations in training data, using these non-target features to inform their predictions. Such reliance leads to performance degradation and poor generalization on unseen data. To address these limitations, we introduce a more general form of counterfactual data augmentation, termed counterbias data augmentation, which simultaneously tackles multiple biases (e.g., gender bias, simplicity bias) and enhances out-of-distribution robustness. We present CoBA: CounterBias Augmentation, a unified framework that operates at the semantic triple level: first decomposing text into subject-predicate-object triples, then selectively modifying these triples to disrupt spurious correlations. By reconstructing the text from these adjusted triples, CoBA generates counterbias data that mitigates spurious patterns. Through extensive experiments, we demonstrate that CoBA not only improves downstream task performance, but also effectively reduces biases and strengthens out-of-distribution resilience, offering a versatile and robust solution to the challenges posed by spurious correlations.


【104】Evaluating Differentially Private Generation of Domain-Specific Text
标题:评估特定领域文本的差异私人生成
链接:https://arxiv.org/abs/2508.20452

作者:, Viktor Schlegel, Srinivasan Nandakumar, Iqra Zahid, Yuping Wu, Warren Del-Pinto, Goran Nenadic, Siew-Kei Lam, Jie Zhang, Anil A Bharath
摘要:生成式人工智能为医疗保健和金融等高风险领域提供了变革潜力,但隐私和监管障碍阻碍了真实世界数据的使用。为了解决这个问题,差异化的私人合成数据生成已经成为一个有前途的替代方案。在这项工作中,我们引入了一个统一的基准,系统地评估正式差分隐私(DP)保证下生成的文本数据集的实用性和保真度。我们的基准测试解决了特定领域基准测试中的关键挑战,包括选择代表性数据和现实的隐私预算,考虑预训练和各种评估指标。我们评估了五个特定领域数据集的最先进的隐私保护生成方法,与真实数据相比,特别是在严格的隐私限制下,揭示了显着的实用性和保真度下降。这些发现强调了当前方法的局限性,概述了对先进的隐私保护数据共享方法的需求,并在现实场景中对其进行评估方面树立了先例。
摘要:Generative AI offers transformative potential for high-stakes domains such as healthcare and finance, yet privacy and regulatory barriers hinder the use of real-world data. To address this, differentially private synthetic data generation has emerged as a promising alternative. In this work, we introduce a unified benchmark to systematically evaluate the utility and fidelity of text datasets generated under formal Differential Privacy (DP) guarantees. Our benchmark addresses key challenges in domain-specific benchmarking, including choice of representative data and realistic privacy budgets, accounting for pre-training and a variety of evaluation metrics. We assess state-of-the-art privacy-preserving generation methods across five domain-specific datasets, revealing significant utility and fidelity degradation compared to real data, especially under strict privacy constraints. These findings underscore the limitations of current approaches, outline the need for advanced privacy-preserving data sharing methods and set a precedent regarding their evaluation in realistic scenarios.


【105】QuadKAN: KAN-Enhanced Quadruped Motion Control via End-to-End Reinforcement Learning
标题:QuadKAN:通过端到端强化学习实现KAN增强的四足动物运动控制
链接:https://arxiv.org/abs/2508.19153

作者:g, Gavin Tao
备注:14pages, 9 figures, Journal paper
摘要:我们解决视觉引导的四足动物运动控制与强化学习(RL),并强调本体感觉与视觉相结合的必要性,鲁棒控制。我们提出了QuadKAN,一个样条参数化的跨模态策略实例化Kolmogorov-Arnold网络(KAN)。该框架采用了一个样条编码器的本体感觉和本体感觉视觉输入样条融合头。这种结构化函数类将状态到动作的映射与步态的分段平滑性质相结合,提高了样本效率,减少了动作抖动和能耗,并提供了可解释的姿势动作灵敏度。我们采用多模式延迟随机化(MMDR),并使用邻近策略优化(PPO)进行端到端训练。在不同地形(包括平坦和不平坦的表面以及具有静态或动态障碍物的场景)上进行的评估表明,QuadKAN与最先进的(SOTA)基线相比,始终实现更高的回报,更大的距离和更少的碰撞。这些结果表明,样条参数化的政策提供了一个简单的,有效的,和可解释的替代强大的视觉引导运动。一旦接受,将提供一个储存库。
摘要:We address vision-guided quadruped motion control with reinforcement learning (RL) and highlight the necessity of combining proprioception with vision for robust control. We propose QuadKAN, a spline-parameterized cross-modal policy instantiated with Kolmogorov-Arnold Networks (KANs). The framework incorporates a spline encoder for proprioception and a spline fusion head for proprioception-vision inputs. This structured function class aligns the state-to-action mapping with the piecewise-smooth nature of gait, improving sample efficiency, reducing action jitter and energy consumption, and providing interpretable posture-action sensitivities. We adopt Multi-Modal Delay Randomization (MMDR) and perform end-to-end training with Proximal Policy Optimization (PPO). Evaluations across diverse terrains, including both even and uneven surfaces and scenarios with static or dynamic obstacles, demonstrate that QuadKAN achieves consistently higher returns, greater distances, and fewer collisions than state-of-the-art (SOTA) baselines. These results show that spline-parameterized policies offer a simple, effective, and interpretable alternative for robust vision-guided locomotion. A repository will be made available upon acceptance.


【106】Database Normalization via Dual-LLM Self-Refinement
标题:通过Dual-LLM自我细化实现数据库规范化
链接:https://arxiv.org/abs/2508.17693

作者:, Nakyung Lee, Gyuyeong Kim
备注:5 pages
摘要:数据库规范化对于保持数据完整性至关重要。然而,这是耗时且容易出错的,因为它通常由数据工程师手动执行。为此,我们提出了Mffie,一个利用大型语言模型能力的数据库规范化框架。Miffie可以实现自动数据规范化,无需人工操作,同时保持高准确性。Maffie的核心是一个双模型自细化架构,它结合了分别用于规范化模式生成和验证的最佳性能模型。生成模块根据验证模块的反馈消除异常,直到输出模式满足规范化要求。我们还精心设计了针对特定任务的zero-shot提示,以指导模型实现高精度和成本效益。实验结果表明,Maffie可以规范化复杂的数据库模式,同时保持高精度。
摘要:Database normalization is crucial to preserving data integrity. However, it is time-consuming and error-prone, as it is typically performed manually by data engineers. To this end, we present Miffie, a database normalization framework that leverages the capability of large language models. Miffie enables automated data normalization without human effort while preserving high accuracy. The core of Miffie is a dual-model self-refinement architecture that combines the best-performing models for normalized schema generation and verification, respectively. The generation module eliminates anomalies based on the feedback of the verification module until the output schema satisfies the requirement for normalization. We also carefully design task-specific zero-shot prompts to guide the models for achieving both high accuracy and cost efficiency. Experimental results show that Miffie can normalize complex database schemas while maintaining high accuracy.


【107】NSPDI-SNN: An efficient lightweight SNN based on nonlinear synaptic pruning and dendritic integration
标题:NSPDI-SNN:基于非线性突触修剪和树枝整合的高效轻量级SNN
链接:https://arxiv.org/abs/2508.21566

作者:, Hongze Sun, Jiayi He, Qianqian Liao, Yunliang Zang, Duo Chen, Dezhong Yao, Daqing Guo
备注:13 pages, 8 figures, 5 tables; This manuscript has been submitted for possible pulication
摘要:脉冲神经网络(SpikingNeuralNetworks,SNN)是基于模拟生物神经元的人工神经网络,在近年来的人工智能技术研究中引起了广泛的关注。生物神经元中的树突具有高效的信息处理能力和计算能力,然而SNN的神经元很少与树突的复杂结构相匹配。受神经元树突的非线性结构和高度稀疏特性的启发,在这项研究中,我们提出了一种有效的,轻量级的SNN方法与非线性修剪和树突集成(NSPDI-SNN)。在该方法中,我们引入了非线性树突积分(NDI),以改善神经元的时空信息的表示。我们实现了树突棘的异构状态转换率,并构造了一种新的灵活的非线性突触修剪(NSP)方法来实现SNN的高稀疏性。我们在三个基准数据集(DVS 128 Gesture,CIFAR 10-DVS和CIFAR 10)上进行了系统的实验,并将评估扩展到两个复杂的任务(语音识别和基于强化学习的迷宫导航任务)。在所有任务中,NSPDI-SNN始终以最小的性能下降实现高稀疏性。特别是,我们的方法在所有三个事件流数据集上都取得了最好的实验结果。进一步分析表明,随着稀疏度的增加,NSPDI显著提高了突触信息传递的效率。总之,我们的研究结果表明,神经元树突的复杂结构和非线性计算提供了一个很有前途的方法,开发有效的SNN方法。
摘要:Spiking neural networks (SNNs) are artificial neural networks based on simulated biological neurons and have attracted much attention in recent artificial intelligence technology studies. The dendrites in biological neurons have efficient information processing ability and computational power; however, the neurons of SNNs rarely match the complex structure of the dendrites. Inspired by the nonlinear structure and highly sparse properties of neuronal dendrites, in this study, we propose an efficient, lightweight SNN method with nonlinear pruning and dendritic integration (NSPDI-SNN). In this method, we introduce nonlinear dendritic integration (NDI) to improve the representation of the spatiotemporal information of neurons. We implement heterogeneous state transition ratios of dendritic spines and construct a new and flexible nonlinear synaptic pruning (NSP) method to achieve the high sparsity of SNN. We conducted systematic experiments on three benchmark datasets (DVS128 Gesture, CIFAR10-DVS, and CIFAR10) and extended the evaluation to two complex tasks (speech recognition and reinforcement learning-based maze navigation task). Across all tasks, NSPDI-SNN consistently achieved high sparsity with minimal performance degradation. In particular, our method achieved the best experimental results on all three event stream datasets. Further analysis showed that NSPDI significantly improved the efficiency of synaptic information transfer as sparsity increased. In conclusion, our results indicate that the complex structure and nonlinear computation of neuronal dendrites provide a promising approach for developing efficient SNN methods.


【108】EconAgentic in DePIN Markets: A Large Language Model Approach to the Sharing Economy of Decentralized Physical Infrastructure
标题:DePin市场中的经济学:去中心化物理基础设施共享经济的大语言模型方法
链接:https://arxiv.org/abs/2508.21368

作者:, Mocca Schweitzer
摘要:去中心化物理基础设施(DePIN)市场正在通过基于令牌的经济学和管理去中心化运营的智能合约来彻底改变共享经济。到2024年,DePIN项目的市值已超过100亿美元,突显出其快速增长。然而,这些市场的不受监管性质,加上智能合约中人工智能代理的自主部署,带来了效率低下和与人类价值观潜在不一致等风险。为了解决这些问题,我们引入了Econotic,这是一个大型语言模型(LLM)驱动的框架,旨在缓解这些挑战。我们的研究集中在三个关键领域:1)模拟DePIN市场的动态演变,2)评估利益相关者的行动及其经济影响,3)分析宏观经济指标,使市场结果与社会目标保持一致。通过Econotic,我们模拟了AI代理如何响应代币激励,投资基础设施并适应市场条件,将AI驱动的决策与人类启发式基准进行比较。我们的研究结果表明,Econotic为DePIN市场的效率,包容性和稳定性提供了有价值的见解,有助于学术理解和实际改进去中心化,令牌化经济的设计和治理。
摘要:The Decentralized Physical Infrastructure (DePIN) market is revolutionizing the sharing economy through token-based economics and smart contracts that govern decentralized operations. By 2024, DePIN projects have exceeded \$10 billion in market capitalization, underscoring their rapid growth. However, the unregulated nature of these markets, coupled with the autonomous deployment of AI agents in smart contracts, introduces risks such as inefficiencies and potential misalignment with human values. To address these concerns, we introduce EconAgentic, a Large Language Model (LLM)-powered framework designed to mitigate these challenges. Our research focuses on three key areas: 1) modeling the dynamic evolution of DePIN markets, 2) evaluating stakeholders' actions and their economic impacts, and 3) analyzing macroeconomic indicators to align market outcomes with societal goals. Through EconAgentic, we simulate how AI agents respond to token incentives, invest in infrastructure, and adapt to market conditions, comparing AI-driven decisions with human heuristic benchmarks. Our results show that EconAgentic provides valuable insights into the efficiency, inclusion, and stability of DePIN markets, contributing to both academic understanding and practical improvements in the design and governance of decentralized, tokenized economies.


【109】A Financial Brain Scan of the LLM
标题:法学硕士的金融大脑扫描
链接:https://arxiv.org/abs/2508.21285

作者: Antoine Didisheim, Luciano Somoza, Hanqing Tian
备注:47 pages
摘要:计算机科学中的新兴技术使“大脑扫描”大型语言模型(LLM)成为可能,识别指导其推理的简单英语概念,并在保持其他因素不变的情况下引导它们。我们表明,这种方法可以将LLM生成的经济预测映射到情绪,技术分析和时机等概念,并在不降低性能的情况下计算它们的相对重要性。我们还表明,模型可以被引导到或多或少的风险厌恶,乐观或悲观,这使得研究人员能够纠正或模拟偏见。该方法对于社会科学的实证研究来说是透明、轻量级且可复制的。
摘要:Emerging techniques in computer science make it possible to "brain scan" large language models (LLMs), identify the plain-English concepts that guide their reasoning, and steer them while holding other factors constant. We show that this approach can map LLM-generated economic forecasts to concepts such as sentiment, technical analysis, and timing, and compute their relative importance without reducing performance. We also show that models can be steered to be more or less risk-averse, optimistic, or pessimistic, which allows researchers to correct or simulate biases. The method is transparent, lightweight, and replicable for empirical research in the social sciences.


【110】Deep Active Learning for Lung Disease Severity Classification from Chest X-rays: Learning with Less Data in the Presence of Class Imbalance
标题:胸部X光检查肺部疾病严重程度分类的深度主动学习:在阶级失衡的情况下用更少的数据进行学习
链接:https://arxiv.org/abs/2508.21263

作者:briel, Mohammadreza Zandehshahvar, Marly van Assen, Nattakorn Kittisut, Kyle Peters, Carlo N. De Cecco, Ali Adibi
摘要:为了减少在类别不平衡情况下胸部X射线(CXR)肺部疾病严重程度分类所需的标记数据量,本研究应用了具有贝叶斯神经网络(BNN)近似和加权损失函数的深度主动学习。这项回顾性研究收集了2020年1月至11月期间Emory Healthcare附属医院963名患者(平均年龄59.2岁; 481名女性)的2,319份CXR。所有患者均经临床确诊为COVID-19。每个CXR由3到6名委员会认证的放射科医生独立标记为正常,中度或重度。使用主动学习来训练具有Monte Carlo Dropout的深度神经网络以分类疾病严重程度。使用各种采集函数从未标记样本池中迭代选择信息量最大的样本。使用准确度、受试者工作特征曲线下面积(AU ROC)和精确-召回曲线下面积(AU PRC)评估性能。记录训练时间和采集时间。统计分析包括描述性指标和跨收购策略的性能比较。熵采样在使用15.4%的训练数据的二元分类(正常与患病)中实现了93.7%的准确度(AU ROC,0.91)。在多类设置中,平均STD采样使用23.1%的标记数据实现了70.3%的准确度(AU ROC,0.86)。这些方法优于更复杂和计算昂贵的采集功能,并显着减少标记的需要。具有BNN近似和加权损失的深度主动学习有效地减少了标记数据需求,同时解决了类别不平衡,保持或超过诊断性能。
摘要:To reduce the amount of required labeled data for lung disease severity classification from chest X-rays (CXRs) under class imbalance, this study applied deep active learning with a Bayesian Neural Network (BNN) approximation and weighted loss function. This retrospective study collected 2,319 CXRs from 963 patients (mean age, 59.2 $\pm$ 16.6 years; 481 female) at Emory Healthcare affiliated hospitals between January and November 2020. All patients had clinically confirmed COVID-19. Each CXR was independently labeled by 3 to 6 board-certified radiologists as normal, moderate, or severe. A deep neural network with Monte Carlo Dropout was trained using active learning to classify disease severity. Various acquisition functions were used to iteratively select the most informative samples from an unlabeled pool. Performance was evaluated using accuracy, area under the receiver operating characteristic curve (AU ROC), and area under the precision-recall curve (AU PRC). Training time and acquisition time were recorded. Statistical analysis included descriptive metrics and performance comparisons across acquisition strategies. Entropy Sampling achieved 93.7% accuracy (AU ROC, 0.91) in binary classification (normal vs. diseased) using 15.4% of the training data. In the multi-class setting, Mean STD sampling achieved 70.3% accuracy (AU ROC, 0.86) using 23.1% of the labeled data. These methods outperformed more complex and computationally expensive acquisition functions and significantly reduced labeling needs. Deep active learning with BNN approximation and weighted loss effectively reduces labeled data requirements while addressing class imbalance, maintaining or exceeding diagnostic performance.


【111】Reinforcement Learning for Optimizing Large Qubit Array based Quantum Sensor Circuits
标题:优化基于大型量子位阵列的量子传感器电路的强化学习
链接:https://arxiv.org/abs/2508.21253

作者:Ashok Attisara, Sathish Kumar
备注:10 pages, 13 figures, 2 tables
摘要:随着传感器中量子比特数量的增加,设计和控制量子电路的复杂性呈指数级增长。手动优化这些电路变得不可行。优化大规模量子电路中的纠缠分布对于提高量子传感器的灵敏度和效率至关重要[5],[6]。本文提出了一种强化学习与基于张量网络的仿真(MPS)的工程集成,用于优化高达60个量子比特的量子传感器电路的可扩展电路优化。为了实现高效的模拟和可扩展性,我们采用张量网络方法,特别是矩阵乘积状态(MPS)表示,而不是传统的状态向量或密度矩阵方法。我们的强化学习代理学习重构电路,以最大化量子Fisher信息(QFI)和纠缠熵,同时减少门数和电路深度。实验结果表明,QFI值接近1,纠缠熵在0.8-1.0范围内,深度和门数减少了90%。这些结果突出了将量子机器学习和张量网络相结合以在现实约束下优化复杂量子电路的潜力。
摘要:As the number of qubits in a sensor increases, the complexity of designing and controlling the quantum circuits grows exponentially. Manually optimizing these circuits becomes infeasible. Optimizing entanglement distribution in large-scale quantum circuits is critical for enhancing the sensitivity and efficiency of quantum sensors [5], [6]. This paper presents an engineering integration of reinforcement learning with tensor-network-based simulation (MPS) for scalable circuit optimization for optimizing quantum sensor circuits with up to 60 qubits. To enable efficient simulation and scalability, we adopt tensor network methods, specifically the Matrix Product State (MPS) representation, instead of traditional state vector or density matrix approaches. Our reinforcement learning agent learns to restructure circuits to maximize Quantum Fisher Information (QFI) and entanglement entropy while reducing gate counts and circuit depth. Experimental results show consistent improvements, with QFI values approaching 1, entanglement entropy in the 0.8-1.0 range, and up to 90% reduction in depth and gate count. These results highlight the potential of combining quantum machine learning and tensor networks to optimize complex quantum circuits under realistic constraints.


【112】Quantum Machine Learning for Optimizing Entanglement Distribution in Quantum Sensor Circuits
标题:量子机器学习优化量子传感器电路中的纠缠分布
链接:https://arxiv.org/abs/2508.21252

作者:Ashok Attisara, Sathish Kumar
备注:11 pages, 13 figures, 4 tables
摘要:在快速发展的量子计算领域,针对特定任务优化量子电路对于提高性能和效率至关重要。最近,量子感测已经成为量子科学和技术领域内的一个独特且快速增长的研究分支。预计该领域将提供新的机会,特别是在高灵敏度和精度方面。纠缠是实现高灵敏度和测量精度的关键因素之一[3]。本文提出了一种利用量子机器学习技术优化量子传感器电路中纠缠分布的新方法。通过在量子环境中利用强化学习,我们的目标是优化纠缠布局,以最大限度地提高量子费舍尔信息(QFI)和纠缠熵,这是量子系统灵敏度和相干性的关键指标,同时最大限度地减少电路深度和门数。我们的实现,基于Qiskit,集成了噪声模型和错误缓解策略,以模拟现实的量子环境。结果表明,电路性能和灵敏度得到了显着改善,通过测量0.84-1.0范围内的高QFI和熵,深度和门数减少了20- 86%,突出了机器学习在量子电路优化中的潜力。
摘要:In the rapidly evolving field of quantum computing, optimizing quantum circuits for specific tasks is crucial for enhancing performance and efficiency. More recently, quantum sensing has become a distinct and rapidly growing branch of research within the area of quantum science and technology. The field is expected to provide new opportunities, especially regarding high sensitivity and precision. Entanglement is one of the key factors in achieving high sensitivity and measurement precision [3]. This paper presents a novel approach utilizing quantum machine learning techniques to optimize entanglement distribution in quantum sensor circuits. By leveraging reinforcement learning within a quantum environment, we aim to optimize the entanglement layout to maximize Quantum Fisher Information (QFI) and entanglement entropy, which are key indicators of a quantum system's sensitivity and coherence, while minimizing circuit depth and gate counts. Our implementation, based on Qiskit, integrates noise models and error mitigation strategies to simulate realistic quantum environments. The results demonstrate significant improvements in circuit performance and sensitivity, highlighting the potential of machine learning in quantum circuit optimization by measuring high QFI and entropy in the range of 0.84-1.0 with depth and gate count reduction by 20-86%.


【113】Zero-Shot KWS for Children's Speech using Layer-Wise Features from SSL Models
标题:使用SSL模型的分层功能针对儿童语音的Zero-ShotKWS
链接:https://arxiv.org/abs/2508.21248

作者:tum, Abhijit Sinha, Hemant Kumar Kathania, Sudarsana Reddy Kadiri, Mahesh Chandra Govil
备注:Accepted
摘要:已经提出了许多方法来增强成人语音中的关键词识别(KWS),但儿童语音由于其独特的声学和语言特征,对KWS系统提出了独特的挑战。本文介绍了一种零触发KWS方法(zero-shot KWS),该方法利用了最先进的自监督学习(SSL)模型,包括Wave 2 Vec 2、HuBERT和Data 2 Vec。从这些SSL模型中逐层提取特征,并用于训练基于Kaldi的DNN KWS系统。WSJCAM 0成人语音数据集用于训练,而PFSTAR儿童语音数据集用于测试,证明了我们的方法的zero-shot能力。我们的方法在儿童语音的所有关键词集上都取得了最先进的结果。值得注意的是,Wav 2 Vec 2模型,特别是第22层,表现最好,对于一组30个关键字,ATWV得分为0.691,MTWV得分为0.7003,虚警概率和未命中概率分别为0.0164和0.0547。此外,针对不同年龄段儿童的绩效评估证实了该系统的有效性。为了评估系统对噪声的鲁棒性,使用性能最好的Wav 2 Vec 2模型的性能最好的层进行了额外的实验。结果表明,与传统的基于MFCC的基线相比,有了显着的改进,强调了SSL嵌入的潜力,即使在嘈杂的条件下。为了进一步推广KWS框架,针对额外的CMU数据集重复实验。总体而言,结果突出了SSL功能在增强儿童语音的Zero-Shot KWS性能方面的重大贡献,有效地解决了与儿童说话者的独特特征相关的挑战。
摘要:Numerous methods have been proposed to enhance Keyword Spotting (KWS) in adult speech, but children's speech presents unique challenges for KWS systems due to its distinct acoustic and linguistic characteristics. This paper introduces a zero-shot KWS approach that leverages state-of-the-art self-supervised learning (SSL) models, including Wav2Vec2, HuBERT and Data2Vec. Features are extracted layer-wise from these SSL models and used to train a Kaldi-based DNN KWS system. The WSJCAM0 adult speech dataset was used for training, while the PFSTAR children's speech dataset was used for testing, demonstrating the zero-shot capability of our method. Our approach achieved state-of-the-art results across all keyword sets for children's speech. Notably, the Wav2Vec2 model, particularly layer 22, performed the best, delivering an ATWV score of 0.691, a MTWV score of 0.7003 and probability of false alarm and probability of miss of 0.0164 and 0.0547 respectively, for a set of 30 keywords. Furthermore, age-specific performance evaluation confirmed the system's effectiveness across different age groups of children. To assess the system's robustness against noise, additional experiments were conducted using the best-performing layer of the best-performing Wav2Vec2 model. The results demonstrated a significant improvement over traditional MFCC-based baseline, emphasizing the potential of SSL embeddings even in noisy conditions. To further generalize the KWS framework, the experiments were repeated for an additional CMU dataset. Overall the results highlight the significant contribution of SSL features in enhancing Zero-Shot KWS performance for children's speech, effectively addressing the challenges associated with the distinct characteristics of child speakers.


【114】HCQA: Hybrid Classical-Quantum Agent for Generating Optimal Quantum Sensor Circuits
标题:HCQA:用于生成最佳量子传感器电路的混合经典量子代理
链接:https://arxiv.org/abs/2508.21246

作者:mari, Sathish A. P. Kumar
备注:9 pages, 9 figures
摘要:本研究提出了一个HCQA设计最佳的量子传感器电路(QSC),以解决复杂的量子物理问题。HCQA通过利用深度Q网络(DQN)进行学习和策略优化来集成计算智能技术,并通过基于Q值的基于量子的动作选择机制来增强。量子电路使用Ry门对代理当前状态进行编码,然后创建可能动作的叠加。电路的测量导致概率动作结果,允许代理通过选择最大化量子费舍尔信息(QFI)同时最小化门的数量的门序列来生成最佳QSC。这种计算智能驱动的HCQA能够自动生成纠缠量子态,特别是压缩态,具有用于量子态估计和控制的高QFI灵敏度。对由两个量子位和一系列Rx、Ry和S门组成的QSC的HCQA的评估证明了其在生成QFI为1的最佳QSC方面的效率。这项工作突出了人工智能驱动的学习和量子计算之间的协同作用,说明了智能代理如何自主发现最佳量子电路设计,以增强传感和估计任务。
摘要:This study proposes an HCQA for designing optimal Quantum Sensor Circuits (QSCs) to address complex quantum physics problems. The HCQA integrates computational intelligence techniques by leveraging a Deep Q-Network (DQN) for learning and policy optimization, enhanced by a quantum-based action selection mechanism based on the Q-values. A quantum circuit encodes the agent current state using Ry gates, and then creates a superposition of possible actions. Measurement of the circuit results in probabilistic action outcomes, allowing the agent to generate optimal QSCs by selecting sequences of gates that maximize the Quantum Fisher Information (QFI) while minimizing the number of gates. This computational intelligence-driven HCQA enables the automated generation of entangled quantum states, specifically the squeezed states, with high QFI sensitivity for quantum state estimation and control. Evaluation of the HCQA on a QSC that consists of two qubits and a sequence of Rx, Ry, and S gates demonstrates its efficiency in generating optimal QSCs with a QFI of 1. This work highlights the synergy between AI-driven learning and quantum computation, illustrating how intelligent agents can autonomously discover optimal quantum circuit designs for enhanced sensing and estimation tasks.


【115】Can Layer-wise SSL Features Improve Zero-Shot ASR Performance for Children's Speech?
标题:分层SSL功能能否提高儿童语音的Zero-ShotASB性能?
链接:https://arxiv.org/abs/2508.21225

作者:inha, Hemant Kumar Kathania, Sudarsana Reddy Kadiri, Shrikanth Narayanan
备注:Accepted
摘要:自动语音识别(ASR)系统往往很难准确地处理儿童的语音,由于其独特的和高度可变的声学和语言特征。虽然自我监督学习(SSL)模型的最新进展极大地增强了成人语音的转录,但准确地转录儿童语音仍然是一个重大挑战。这项研究调查了从最先进的SSL预训练模型中提取的逐层特征的有效性-特别是Wav 2 Vec 2,HuBERT,Data 2 Vec和WavLM,以提高zero-shot场景中儿童语音的ASR性能。对从这些模型中提取的特征进行了详细的分析,并使用Kaldi工具包将其集成到简化的基于DNN的ASR系统中。该分析确定了在zero-shot场景中增强儿童语音ASR性能的最有效层,其中WSJCAM 0成人语音用于训练,PFSTAR儿童语音用于测试。实验结果表明,Wav 2 Vec 2模型的第22层实现了5.15%的最低字错误率(WER),表示比使用Wav 2 Vec 2的直接zero-shot解码(WER为10.65%)相对改善了51.64%。此外,年龄组分析表明,随着年龄的增长,性能得到了一致的改善,即使在使用SSL功能的年轻年龄组中也观察到了显着的收益。对CMU Kids数据集的进一步实验证实了类似的趋势,突出了所提出方法的普遍性。
摘要:Automatic Speech Recognition (ASR) systems often struggle to accurately process children's speech due to its distinct and highly variable acoustic and linguistic characteristics. While recent advancements in self-supervised learning (SSL) models have greatly enhanced the transcription of adult speech, accurately transcribing children's speech remains a significant challenge. This study investigates the effectiveness of layer-wise features extracted from state-of-the-art SSL pre-trained models - specifically, Wav2Vec2, HuBERT, Data2Vec, and WavLM in improving the performance of ASR for children's speech in zero-shot scenarios. A detailed analysis of features extracted from these models was conducted, integrating them into a simplified DNN-based ASR system using the Kaldi toolkit. The analysis identified the most effective layers for enhancing ASR performance on children's speech in a zero-shot scenario, where WSJCAM0 adult speech was used for training and PFSTAR children speech for testing. Experimental results indicated that Layer 22 of the Wav2Vec2 model achieved the lowest Word Error Rate (WER) of 5.15%, representing a 51.64% relative improvement over the direct zero-shot decoding using Wav2Vec2 (WER of 10.65%). Additionally, age group-wise analysis demonstrated consistent performance improvements with increasing age, along with significant gains observed even in younger age groups using the SSL features. Further experiments on the CMU Kids dataset confirmed similar trends, highlighting the generalizability of the proposed approach.


【116】Pep2Prob Benchmark: Predicting Fragment Ion Probability for MS$^2$-based Proteomics
标题:Pep 2 Prob基准:预测基于MS $2 $的蛋白质组学碎片离子概率
链接:https://arxiv.org/abs/2508.21076

作者:hichao Wang, Shengqi Sang, Pisit Wajanasara, Nuno Bandeira
备注:Dataset is available at HuggingFace: this https URL
摘要:蛋白质执行几乎所有的细胞功能,并构成大多数药物靶点,使其分析成为理解人类健康和疾病生物学的基础。串联质谱法(Tandem mass spectrometry,MS$^2$)是蛋白质组学中的主要分析技术,它通过电离、裂解多肽,并利用产生的质谱来鉴定和定量生物样品中的蛋白质。在MS $^2 $分析中,肽碎片离子概率预测发挥着关键作用,作为强度信息的补充,提高了从质谱中识别肽的准确性。目前的方法依赖于片段化的全局统计,其假设片段的概率在所有肽中是均匀的。然而,从生物化学原理的角度来看,这种假设过于简单化,限制了准确的预测。为了解决这一差距,我们提出了Pep 2 Prob,这是第一个为肽特异性碎片离子概率预测而设计的综合数据集和基准。拟议的数据集包含608,780个独特前体的碎片离子概率统计(每个前体是一对肽序列和电荷状态),总结了超过1.83亿个高质量,高分辨率的HCD MS $^2 $光谱,并进行了验证肽分配和碎片注释。我们使用简单的统计规则和基于学习的方法建立基线性能,并发现利用肽特异性信息的模型显著优于仅使用全局片段化统计的先前方法。此外,随着容量的增加,基准模型的性能表明肽片段化关系表现出复杂的非线性,需要复杂的机器学习方法。
摘要:Proteins perform nearly all cellular functions and constitute most drug targets, making their analysis fundamental to understanding human biology in health and disease. Tandem mass spectrometry (MS$^2$) is the major analytical technique in proteomics that identifies peptides by ionizing them, fragmenting them, and using the resulting mass spectra to identify and quantify proteins in biological samples. In MS$^2$ analysis, peptide fragment ion probability prediction plays a critical role, enhancing the accuracy of peptide identification from mass spectra as a complement to the intensity information. Current approaches rely on global statistics of fragmentation, which assumes that a fragment's probability is uniform across all peptides. Nevertheless, this assumption is oversimplified from a biochemical principle point of view and limits accurate prediction. To address this gap, we present Pep2Prob, the first comprehensive dataset and benchmark designed for peptide-specific fragment ion probability prediction. The proposed dataset contains fragment ion probability statistics for 608,780 unique precursors (each precursor is a pair of peptide sequence and charge state), summarized from more than 183 million high-quality, high-resolution, HCD MS$^2$ spectra with validated peptide assignments and fragmentation annotations. We establish baseline performance using simple statistical rules and learning-based methods, and find that models leveraging peptide-specific information significantly outperform previous methods using only global fragmentation statistics. Furthermore, performance across benchmark models with increasing capacities suggests that the peptide-fragmentation relationship exhibits complex nonlinearities requiring sophisticated machine learning approaches.


机器翻译由腾讯交互翻译提供,仅供参考

点击“阅读原文”获取带摘要的学术速递

【声明】内容源于网络
0
0
Sophie外贸笔记
跨境分享角 | 长期更新优质内容
内容 0
粉丝 3
Sophie外贸笔记 跨境分享角 | 长期更新优质内容
总阅读0
粉丝3
内容0