2025-08-14 2025-08-14 About 56500 words 265 minutes

Contents

2025-08-14科研追新

2025-08-13 19:45:18 Wednesday ～ 2025-08-14 19:39:40 Thursday

1. 源数据

1.1 公众号

1.1.1 量子位

1.1.2 机器之心

1.1.3 新智元

1.1.4 AGI Hunt

1.1.5 其他

本月星号最多的开源垂类大模型

1.2 Arxiv

1.2.1 Computation and Language

From：https:// /arxiv/cs.CL

From：https://arxiv.org/list/cs.CL/recent

2025-08-14 | | Total: 60

#1 Neural Bandit Based Optimal LLM Selection for a Pipeline of Tasks 基于神经匪徒算法的流水线任务最优 LLM 选择

Authors: [Baran Atalar](https://arxiv.org/search/?searchtype=author&query=Baran Atalar), [Eddie Zhang](https://arxiv.org/search/?searchtype=author&query=Eddie Zhang), [Carlee Joe-Wong](https://arxiv.org/search/?searchtype=author&query=Carlee Joe-Wong)

With the increasing popularity of large language models (LLMs) for a variety of tasks, there has been a growing interest in strategies that can predict which out of a set of LLMs will yield a successful answer at low cost. This problem promises to become more and more relevant as providers like Microsoft allow users to easily create custom LLM “assistants” specialized to particular types of queries. However, some tasks (i.e., queries) may be too specialized and difficult for a single LLM to handle alone. These applications often benefit from breaking down the task into smaller subtasks, each of which can then be executed by a LLM expected to perform well on that specific subtask. For example, in extracting a diagnosis from medical records, one can first select an LLM to summarize the record, select another to validate the summary, and then select another, possibly different, LLM to extract the diagnosis from the summarized record. Unlike existing LLM selection or routing algorithms, this setting requires that we select a sequence of LLMs, with the output of each LLM feeding into the next and potentially influencing its success. Thus, unlike single LLM selection, the quality of each subtask’s output directly affects the inputs, and hence the cost and success rate, of downstream LLMs, creating complex performance dependencies that must be learned and accounted for during selection. We propose a neural contextual bandit-based algorithm that trains neural networks that model LLM success on each subtask in an online manner, thus learning to guide the LLM selections for the different subtasks, even in the absence of historical LLM performance data. Experiments on telecommunications question answering and medical diagnosis prediction datasets illustrate the effectiveness of our proposed approach compared to other LLM selection algorithms. 随着大型语言模型（LLMs）在各种任务中的日益普及，人们越来越关注能够以低成本预测在一组 LLMs 中哪些模型会产生成功答案的策略。随着微软等提供商允许用户轻松创建专门针对特定类型查询的自定义 LLM“助手”，这一问题有望变得愈加重要。然而，某些任务（即查询）可能过于专业化，单个 LLM 难以独立完成。这类应用通常受益于将任务拆分为更小的子任务，然后由预计在该特定子任务上表现良好的 LLM 来分别执行。例如，在从病历中提取诊断时，可以先选择一个 LLM 来总结病历，再选择另一个来验证摘要，随后再选择另一个（可能不同的）LLM 从摘要中提取诊断。与现有的 LLM 选择或路由算法不同，这种设置要求我们选择一系列 LLM，每个 LLM 的输出传递给下一个并可能影响其成功与否。因此，与单一 LLM 选择不同，每个子任务输出的质量会直接影响后续 LLM 的输入，从而影响后续 LLM 的成本和成功率，形成在选择过程中必须学习并考虑的复杂性能依赖关系。我们提出了一种基于神经上下文臂算法的方法，该方法在线训练神经网络以建模每个子任务上 LLM 的成功，从而学会为不同子任务指导 LLM 的选择，即使在缺乏历史 LLM 性能数据的情况下也能实现。电信问答和医疗诊断预测数据集上的实验表明，与其他 LLM 选择算法相比，我们的方法更为有效。

Subjects: Computation and Language, Machine Learning 主题：计算与语言，机器学习

Publish: 2025-08-13 17:19:41 UTC 发布：2025-08-13 17:19:41 UTC

#2 Which one Performs Better? Wav2Vec or Whisper? Applying both in Badini Kurdish Speech to Text (BKSTT) 哪个表现更好？Wav2Vec 还是 Whisper？在 Badini Kurdish Speech to Text (BKSTT) 中同时应用两者

Authors: [Renas Adnan](https://arxiv.org/search/?searchtype=author&query=Renas Adnan), [Hossein Hassani](https://arxiv.org/search/?searchtype=author&query=Hossein Hassani)

Speech-to-text (STT) systems have a wide range of applications. They are available in many languages, albeit at different quality levels. Although Kurdish is considered a less-resourced language from a processing perspective, SST is available for some of the Kurdish dialects, for instance, Sorani (Central Kurdish). However, that is not applied to other Kurdish dialects, Badini and Hawrami, for example. This research is an attempt to address this gap. Bandin, approximately, has two million speakers, and STT systems can help their community use mobile and computer-based technologies while giving their dialect more global visibility. We aim to create a language model based on Badini’s speech and evaluate its performance. To cover a conversational aspect, have a proper confidence level of grammatical accuracy, and ready transcriptions, we chose Badini kids’ stories, eight books including 78 stories, as the textual input. Six narrators narrated the books, which resulted in approximately 17 hours of recording. We cleaned, segmented, and tokenized the input. The preprocessing produced nearly 15 hours of speech, including 19193 segments and 25221 words. We used Wav2Vec2-Large-XLSR-53 and Whisper-small to develop the language models. The experiments indicate that the transcriptions process based on the Wav2Vec2-Large-XLSR-53 model provides a significantly more accurate and readable output than the Whisper-small model, with 90.38% and 65.45% readability, and 82.67% and 53.17% accuracy, respectively. 语音转文字（STT）系统有着广泛的应用。它们支持多种语言，但质量各有差异。尽管从处理角度看库尔德语被视为资源较少的语言，某些库尔德语方言已有语音到文本的支持，例如索拉尼语（中库尔德语）。然而，这并未覆盖其他库尔德方言，如巴迪尼语和豪拉米语。本研究尝试弥补这一空白。巴迪尼语大致有两百万母语使用者，STT 系统可以帮助他们的社区使用移动和电脑技术，同时提升该方言的全球可见度。我们的目标是基于巴迪尼语语音创建语言模型并评估其性能。为涵盖会话方面、确保语法准确度的适当置信水平并获得现成的转录文本，我们选择了巴迪尼儿童故事作为文本输入，共八本书包含 78 个故事。六位讲述者录制了这些书籍，录音总时长约为 17 小时。我们对输入进行了清洗、分段和分词。预处理产生了近 15 小时的语音数据，包括 19193 个片段和 25221 个词。我们使用了 Wav2Vec2-Large-XLSR-53 和 Whisper-small 来开发语言模型。实验表明，基于 Wav2Vec2-Large-XLSR-53 模型的转录过程比 Whisper-small 模型提供了显著更准确且更易读的输出，可读性分别为 90.38% 和 65.45%，准确率分别为 82.67% 和 53.17%。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-13 17:19:22 UTC 发布：2025-08-13 17:19:22 UTC

#3 Performance of GPT-5 Frontier Models in Ophthalmology Question Answering GPT-5 Frontier 模型在眼科问答中的表现

Large language models (LLMs) such as GPT-5 integrate advanced reasoning capabilities that may improve performance on complex medical question-answering tasks. For this latest generation of reasoning models, the configurations that maximize both accuracy and cost-efficiency have yet to be established. We evaluated 12 configurations of OpenAI’s GPT-5 series (three model tiers across four reasoning effort settings) alongside o1-high, o3-high, and GPT-4o, using 260 closed-access multiple-choice questions from the American Academy of Ophthalmology Basic Clinical Science Course (BCSC) dataset. The primary outcome was multiple-choice accuracy; secondary outcomes included head-to-head ranking via a Bradley-Terry model, rationale quality assessment using a reference-anchored, pairwise LLM-as-a-judge framework, and analysis of accuracy-cost trade-offs using token-based cost estimates. GPT-5-high achieved the highest accuracy (0.965; 95% CI, 0.942-0.985), outperforming all GPT-5-nano variants (P < .001), o1-high (P = .04), and GPT-4o (P < .001), but not o3-high (0.958; 95% CI, 0.931-0.981). GPT-5-high ranked first in both accuracy (1.66x stronger than o3-high) and rationale quality (1.11x stronger than o3-high). Cost-accuracy analysis identified several GPT-5 configurations on the Pareto frontier, with GPT-5-mini-low offering the most favorable low-cost, high-performance balance. These results benchmark GPT-5 on a high-quality ophthalmology dataset, demonstrate the influence of reasoning effort on accuracy, and introduce an autograder framework for scalable evaluation of LLM-generated answers against reference standards in ophthalmology. 大型语言模型（LLMs），例如 GPT-5，整合了可能提升在复杂医学问答任务上表现的高级推理能力。对于这一代最新的推理模型，能够同时最大化准确性和成本效益的配置尚未确定。我们评估了 OpenAI 的 GPT-5 系列的 12 种配置（三个模型等级在四种推理强度设置下），并与 o1-high、o3-high 和 GPT-4o 进行比较，使用来自美国眼科学会基础临床科学课程（BCSC）数据集的 260 道闭卷选择题。主要结局是选择题准确率；次要结局包括通过 Bradley-Terry 模型进行的一对一排名、使用以参考为锚点的成对 LLM 作为裁判框架对推理质量的评估，以及使用基于标记的成本估算对准确性-成本权衡的分析。 GPT-5-high 达到最高准确率（0.965；95% 置信区间，0.942–0.985），优于所有 GPT-5-nano 变体（P < .001）、o1-high（P = .04）和 GPT-4o（P < .001），但不优于 o3-high（0.958；95% 置信区间，0.931–0.981）。GPT-5-high 在准确率（比 o3-high 高 1.66 倍）和推理质量（比 o3-high 高 1.11 倍）两项指标中均排名第一。成本-准确性分析在帕累托前沿上识别出若干 GPT-5 配置，其中 GPT-5-mini-low 在低成本、高性能之间提供了最有利的平衡。这些结果在高质量眼科学数据集上对 GPT-5 进行了基准测试，展示了推理投入对准确率的影响，并引入了一个用于可扩展评估 LLM 生成答案与眼科参考标准一致性的自动评分框架。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-13 17:17:17 UTC 发布：2025-08-13 17:17:17 UTC

#4 Shaping Event Backstories to Estimate Potential Emotion Contexts 塑造事件背景故事以估算潜在情绪情境

Authors: [Johannes Schäfer](https://arxiv.org/search/?searchtype=author&query=Johannes Schäfer), [Roman Klinger](https://arxiv.org/search/?searchtype=author&query=Roman Klinger)

Emotion analysis is an inherently ambiguous task. Previous work studied annotator properties to explain disagreement, but this overlooks the possibility that ambiguity may stem from missing information about the context of events. In this paper, we propose a novel approach that adds reasonable contexts to event descriptions, which may better explain a particular situation. Our goal is to understand whether these enriched contexts enable human annotators to annotate emotions more reliably. We disambiguate a target event description by automatically generating multiple event chains conditioned on differing emotions. By combining techniques from short story generation in various settings, we achieve coherent narratives that result in a specialized dataset for the first comprehensive and systematic examination of contextualized emotion analysis. Through automatic and human evaluation, we find that contextual narratives enhance the interpretation of specific emotions and support annotators in producing more consistent annotations. 情感分析本质上是一项含糊的任务。以往研究考察了标注者的属性以解释分歧，但这忽略了模糊性可能源自于关于事件上下文信息的缺失。本文提出一种新方法，为事件描述补充合理的情境，这些情境可能更好地解释特定情形。我们的目标是理解这些丰富的情境是否能使人工标注者更可靠地标注情感。我们通过自动生成多条基于不同情感条件的事件链来消解目标事件描述的歧义。通过结合各种情境下短篇故事生成的技术，我们实现了连贯的叙事，进而构建了一个用于首次全面系统考察情境化情感分析的专门数据集。通过自动和人工评估，我们发现情境化叙事增强了对特定情感的解读，并帮助标注者产生更一致的标注。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-13 17:15:52 UTC 发布：2025-08-13 17:15:52 UTC

#5 Specialised or Generic? Tokenization Choices for Radiology Language Models 专用还是通用？放射学语言模型的标记化选择

Authors: [Hermione Warr](https://arxiv.org/search/?searchtype=author&query=Hermione Warr), [Wentian Xu](https://arxiv.org/search/?searchtype=author&query=Wentian Xu), [Harry Anthony](https://arxiv.org/search/?searchtype=author&query=Harry Anthony), [Yasin Ibrahim](https://arxiv.org/search/?searchtype=author&query=Yasin Ibrahim), [Daniel McGowan](https://arxiv.org/search/?searchtype=author&query=Daniel McGowan), [Konstantinos Kamnitsas](https://arxiv.org/search/?searchtype=author&query=Konstantinos Kamnitsas)

The vocabulary used by language models (LM) - defined by the tokenizer - plays a key role in text generation quality. However, its impact remains under-explored in radiology. In this work, we address this gap by systematically comparing general, medical, and domain-specific tokenizers on the task of radiology report summarisation across three imaging modalities. We also investigate scenarios with and without LM pre-training on PubMed abstracts. Our findings demonstrate that medical and domain-specific vocabularies outperformed widely used natural language alternatives when models are trained from scratch. Pre-training partially mitigates performance differences between tokenizers, whilst the domain-specific tokenizers achieve the most favourable results. Domain-specific tokenizers also reduce memory requirements due to smaller vocabularies and shorter sequences. These results demonstrate that adapting the vocabulary of LMs to the clinical domain provides practical benefits, including improved performance and reduced computational demands, making such models more accessible and effective for both research and real-world healthcare settings. 语言模型（LM）使用的词表——由分词器定义——在文本生成质量中起着关键作用。然而，其影响在放射学领域仍未被充分探讨。在本工作中，我们通过在三种影像模态上对放射学报告摘要任务系统地比较通用、医学和领域特定的分词器来弥补这一空白。我们还研究了在有无在 PubMed 摘要上进行 LM 预训练的情景。我们的研究结果表明，在从零开始训练模型时，医学和领域特定的词表优于广泛使用的自然语言替代方案。预训练在一定程度上缓解了不同分词器之间的性能差异，而领域特定分词器取得了最有利的结果。领域特定分词器由于词表更小和序列更短，还降低了内存需求。这些结果表明，将 LM 的词表适配到临床领域带来了实际收益，包括性能提升和计算需求减少，使此类模型在研究和现实医疗场景中更易获得且更有效。

Subjects: Computation and Language, Artificial Intelligence, Machine Learning 主题：计算与语言，人工智能，机器学习

Publish: 2025-08-13 17:13:56 UTC 发布：2025-08-13 17:13:56 UTC

#6 VisCodex: Unified Multimodal Code Generation via Merging Vision and Coding Models VisCodex：通过融合视觉与编码模型实现统一的多模态代码生成

Authors: [Lingjie Jiang](https://arxiv.org/search/?searchtype=author&query=Lingjie Jiang), [Shaohan Huang](https://arxiv.org/search/?searchtype=author&query=Shaohan Huang), [Xun Wu](https://arxiv.org/search/?searchtype=author&query=Xun Wu), [Yixia Li](https://arxiv.org/search/?searchtype=author&query=Yixia Li), [Dongdong Zhang](https://arxiv.org/search/?searchtype=author&query=Dongdong Zhang), [Furu Wei](https://arxiv.org/search/?searchtype=author&query=Furu Wei)

Multimodal large language models (MLLMs) have significantly advanced the integration of visual and textual understanding. However, their ability to generate code from multimodal inputs remains limited. In this work, we introduce VisCodex, a unified framework that seamlessly merges vision and coding language models to empower MLLMs with strong multimodal code generation abilities. Leveraging a task vector-based model merging technique, we integrate a state-of-the-art coding LLM into a strong vision-language backbone, while preserving both visual comprehension and advanced coding skills. To support training and evaluation, we introduce the Multimodal Coding Dataset (MCD), a large-scale and diverse collection of 598k samples, including high-quality HTML code, chart image-code pairs, image-augmented StackOverflow QA, and algorithmic problems. Furthermore, we propose InfiBench-V, a novel and challenging benchmark specifically designed to assess models on visually-rich, real-world programming questions that demand a nuanced understanding of both textual and visual contexts. Extensive experiments show that VisCodex achieves state-of-the-art performance among open-source MLLMs and approaches proprietary models like GPT-4o, highlighting the effectiveness of our model merging strategy and new datasets. 多模态大型语言模型（MLLMs）在视觉与文本理解的融合方面取得了显著进展。然而，它们从多模态输入生成代码的能力仍然有限。在本工作中，我们提出了 VisCodex，一个统一的框架，通过无缝合并视觉与编码语言模型，赋予 MLLMs 强大的多模态代码生成能力。利用基于任务向量的模型合并技术，我们将最先进的编码 LLM 集成到强大的视觉-语言骨干中，同时保留了视觉理解能力和高级编码技能。为支持训练与评估，我们引入了多模态编码数据集（MCD），这是一个规模庞大且多样化的集合，包含 59.8 万个样本，包括高质量的 HTML 代码、图表图像-代码对、带图像的 StackOverflow 问答以及算法题。此外，我们提出了 InfiBench-V，这是一项新颖且富有挑战性的基准，专门用于评估模型在视觉丰富的真实编程问题上的表现，这类问题需要对文本和视觉语境都有细致的理解。大量实验表明，VisCodex 在开源 MLLMs 中取得了最先进的性能，并接近像 GPT-4o 这样的专有模型，这凸显了我们模型合并策略和新数据集的有效性。

Subjects: Computation and Language, Artificial Intelligence, Computer Vision and Pattern Recognition 主题：计算与语言、人工智能、计算机视觉与模式识别

Publish: 2025-08-13 17:00:44 UTC 发布：2025-08-13 17:00:44 UTC

#7 A Comprehensive Evaluation framework of Alignment Techniques for LLMs LLMs 对齐技术的综合评估框架

随着大型语言模型（LLMs）越来越多地融入现实应用，确保其输出与人类价值观和安全标准相符已变得至关重要。该领域已经发展出多种对齐方法，包括传统的微调方法（RLHF、指令微调）、事后修正系统和推理时干预，每种方法都有其独特的优点和局限。然而，缺乏统一的评估框架使得系统比较这些范式并指导部署决策变得困难。本文提出了对 LLMs 对齐技术的多维评估——一个全面的评估框架，能够对所有主要对齐范式进行系统比较。我们的框架沿四个关键维度评估方法：对齐检测、对齐质量、计算效率和鲁棒性。通过对不同基础模型和对齐策略的实验证明，我们展示了该框架在识别当前最先进模型的优劣方面的实用性，为未来的研究方向提供了有价值的见解。

Subjects: Computation and Language, Artificial Intelligence, Machine Learning 主题：计算与语言，人工智能，机器学习

Publish: 2025-08-13 16:42:01 UTC 发布：2025-08-13 16:42:01 UTC

#8 Language of Persuasion and Misrepresentation in Business Communication: A Textual Detection Approach 商业沟通中的说服与误导语言：一种文本检测方法

Authors: [Sayem Hossen](https://arxiv.org/search/?searchtype=author&query=Sayem Hossen), [Monalisa Moon Joti](https://arxiv.org/search/?searchtype=author&query=Monalisa Moon Joti), [Md. Golam Rashed](https://arxiv.org/search/?searchtype=author&query=Md. Golam Rashed)

Business communication digitisation has reorganised the process of persuasive discourse, which allows not only greater transparency but also advanced deception. This inquiry synthesises classical rhetoric and communication psychology with linguistic theory and empirical studies in the financial reporting, sustainability discourse, and digital marketing to explain how deceptive language can be systematically detected using persuasive lexicon. In controlled settings, detection accuracies of greater than 99% were achieved by using computational textual analysis as well as personalised transformer models. However, reproducing this performance in multilingual settings is also problematic and, to a large extent, this is because it is not easy to find sufficient data, and because few multilingual text-processing infrastructures are in place. This evidence shows that there has been an increasing gap between the theoretical representations of communication and those empirically approximated, and therefore, there is a need to have strong automatic text-identification systems where AI-based discourse is becoming more realistic in communicating with humans. 商业沟通的数字化重组了劝导性话语的过程，这不仅带来了更大的透明度，也带来了更先进的欺骗手段。本研究将古典修辞学与传播心理学结合语言学理论，并整合金融报告、可持续性话语和数字营销领域的实证研究，解释如何利用劝导性词汇系统性地检测欺骗性语言。在受控环境中，使用计算文本分析以及个性化的变换器模型，检测准确率超过 99%。然而，在多语言环境中重现此类表现也存在问题，很大程度上是因为难以找到足够的数据，且很少有多语言文本处理基础设施可用。这些证据表明，传播的理论表征与实证逼近之间的差距在不断扩大，因此，在以人工智能为基础的话语与人类交流愈发真实的情况下，迫切需要强大的自动文本识别系统。

Subjects: Computation and Language, Computational Finance, General Finance 主题：计算与语言、计算金融、一般金融

Publish: 2025-08-13 16:38:31 UTC 发布：2025-08-13 16:38:31 UTC

#9 A Survey of Cognitive Distortion Detection and Classification in NLP 认知扭曲检测与分类在自然语言处理中的综述

Authors: [Archie Sage](https://arxiv.org/search/?searchtype=author&query=Archie Sage), [Jeroen Keppens](https://arxiv.org/search/?searchtype=author&query=Jeroen Keppens), [Helen Yannakoudakis](https://arxiv.org/search/?searchtype=author&query=Helen Yannakoudakis)

As interest grows in the application of natural language processing (NLP) techniques to mental health, a growing body of work explores the automatic detection and classification of cognitive distortions (CDs). CDs are habitual patterns of negatively biased or flawed thinking that distort how people perceive events, judge themselves, and react to the world around them. Identifying and addressing them is an important part of therapy. Despite its momentum, the field remains fragmented, with inconsistencies in CD taxonomies, task formulations, and evaluation practices. This survey reviews 38 studies spanning two decades, providing a structured overview of datasets, modelling approaches, and evaluation strategies. We provide a consolidated CD taxonomy reference, summarise common task setups, and highlight open challenges to support more coherent and reproducible research in this emerging area. 随着将自然语言处理（NLP）技术应用于心理健康领域的兴趣日益增长，越来越多的研究探索认知扭曲（CDs）的自动检测与分类。认知扭曲是指一种习惯性的、带有负面偏见或有缺陷的思维模式，它扭曲了人们对事件的感知、对自我的评判以及对周围世界的反应。识别并处理这些扭曲是治疗中的重要环节。尽管该领域正不断发展，但仍然零散分裂，在认知扭曲的分类法、任务表述和评估实践上存在不一致。本综述回顾了跨越二十年的 38 项研究，提供了关于数据集、建模方法和评估策略的结构化概览。我们提供了汇总后的认知扭曲分类法参考，概述了常见的任务设置，并指出了若干开放性挑战，以支持这一新兴领域更连贯且可重复的研究。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-13 15:21:17 UTC 发布：2025-08-13 15:21:17 UTC

#10 Memory Decoder: A Pretrained, Plug-and-Play Memory for Large Language Models Memory Decoder：一种用于大型语言模型的预训练即插即用记忆

Authors: [Jiaqi Cao](https://arxiv.org/search/?searchtype=author&query=Jiaqi Cao), [Jiarui Wang](https://arxiv.org/search/?searchtype=author&query=Jiarui Wang), [Rubin Wei](https://arxiv.org/search/?searchtype=author&query=Rubin Wei), [Qipeng Guo](https://arxiv.org/search/?searchtype=author&query=Qipeng Guo), [Kai Chen](https://arxiv.org/search/?searchtype=author&query=Kai Chen), [Bowen Zhou](https://arxiv.org/search/?searchtype=author&query=Bowen Zhou), [Zhouhan Lin](https://arxiv.org/search/?searchtype=author&query=Zhouhan Lin)

Large Language Models (LLMs) have shown strong abilities in general language tasks, yet adapting them to specific domains remains a challenge. Current method like Domain Adaptive Pretraining (DAPT) requires costly full-parameter training and suffers from catastrophic forgetting. Meanwhile, Retrieval-Augmented Generation (RAG) introduces substantial inference latency due to expensive nearest-neighbor searches and longer context. This paper introduces Memory Decoder, a plug-and-play pretrained memory that enables efficient domain adaptation without changing the original model’s parameters. Memory Decoder employs a small transformer decoder that learns to imitate the behavior of an external non-parametric retriever. Once trained, Memory Decoder can be seamlessly integrated with any pretrained language model that shares the same tokenizer, requiring no model-specific modifications. Experimental results demonstrate that Memory Decoder enables effective adaptation of various Qwen and Llama models to three distinct specialized domains: biomedicine, finance, and law, reducing perplexity by an average of 6.17 points. Overall, Memory Decoder introduces a novel paradigm centered on a specially pretrained memory component designed for domain-specific adaptation. This memory architecture can be integrated in a plug-and-play manner, consistently enhancing performance across multiple models within the target domain. 大型语言模型（LLMs）在通用语言任务上表现出强大的能力，但将它们适配到特定领域仍然是一个挑战。当前的方法如领域自适应预训练（DAPT）需要昂贵的全参数训练，并且容易出现灾难性遗忘。与此同时，检索增强生成（RAG）由于昂贵的最近邻搜索和更长的上下文，导致推理延迟显著增加。本文提出了记忆解码器（Memory Decoder），一种免插拔的预训练记忆模块，能够在不改变原始模型参数的情况下实现高效的领域适配。记忆解码器采用一个小型的 Transformer 解码器来学习模仿外部非参数检索器的行为。训练完成后，记忆解码器可以无缝集成到任何使用相同分词器的预训练语言模型中，无需针对特定模型进行修改。实验结果表明，记忆解码器使得多种 Qwen 和 Llama 模型能够有效适配三种不同的专业领域：生物医学、金融与法律，平均将困惑度降低了 6.17 点。总体而言，Memory Decoder 提出了一种新颖范式，核心是一个为特定领域适配专门预训练的记忆组件。该记忆架构可以以即插即用的方式集成，在目标领域内对多种模型持续提升性能。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-13 15:16:29 UTC 发布：2025-08-13 15:16:29 UTC

#11 Assessing the Feasibility of Lightweight Whisper Models for Low-Resource Urdu Transcription 评估轻量级 Whisper 模型在资源匮乏的乌尔都语转录中的可行性

Authors: [Abdul Rehman Antall](https://arxiv.org/search/?searchtype=author&query=Abdul Rehman Antall), [Naveed Akhtar](https://arxiv.org/search/?searchtype=author&query=Naveed Akhtar)

This study evaluates the feasibility of lightweight Whisper models (Tiny, Base, Small) for Urdu speech recognition in low-resource settings. Despite Urdu being the 10th most spoken language globally with over 230 million speakers, its representation in automatic speech recognition (ASR) systems remains limited due to dialectal diversity, code-switching, and sparse training data. We benchmark these models on a curated Urdu dataset using word error rate (WER), without fine-tuning. Results show Whisper-Small achieves the lowest error rates (33.68% WER), outperforming Tiny (67.08% WER) and Base (53.67% WER). Qualitative analysis reveals persistent challenges in phonetic accuracy and lexical coherence, particularly for complex utterances. While Whisper-Small demonstrates promise for deployable Urdu ASR, significant gaps remain. Our findings emphasize lay the groundwork for future research into effective, low-resource ASR systems. 本研究评估了在低资源环境下轻量级 Whisper 模型（Tiny、Base、Small）用于乌尔都语语音识别的可行性。尽管乌尔都语是全球第十大使用语言，拥有超过 2.3 亿讲者，但由于方言多样性、代码混用和训练数据稀缺，其在自动语音识别（ASR）系统中的代表性仍然有限。我们在经过策划的乌尔都语数据集上对这些模型进行了基于字错误率（WER）的基准测试，未进行微调。结果显示 Whisper-Small 达到最低的错误率（33.68% WER），优于 Tiny（67.08% WER）和 Base（53.67% WER）。定性分析揭示了音系准确性和词汇连贯性方面持续存在的挑战，尤其是在复杂话语中。虽然 Whisper-Small 在可部署的乌尔都语 ASR 方面展现出前景，但仍存在显著差距。我们的发现为未来针对低资源 ASR 系统的有效研究奠定了基础。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-13 15:01:59 UTC

#12 PRELUDE: A Benchmark Designed to Require Global Comprehension and Reasoning over Long Contexts 前言：一个旨在要求对长上下文进行全局理解和推理的基准测试

We introduce PRELUDE, a benchmark for evaluating long-context understanding through the task of determining whether a character’s prequel story is consistent with the canonical narrative of the original book. Our task poses a stronger demand for global comprehension and deep reasoning than existing benchmarks – as the prequels are not part of the original story, assessing their plausibility typically requires searching and integrating information that is only indirectly related. Empirically, 88% of instances require evidence from multiple parts of the narrative. Experimental results highlight the challenge of our task: in-context learning, RAG and in-domain training with state-of-the-art LLMs, and commercial DeepResearch services, lag behind humans by >15%. A further human study reveals that models often produce correct answers with flawed reasoning, leading to an over 30% gap in reasoning accuracy compared to humans. These findings underscore the substantial room for improvement in long-context understanding and reasoning. 我们引入了 PRELUDE，一个通过判断角色的前传故事是否与原著书的正规叙事相一致来评估长上下文理解的基准。我们的任务对全局理解和深入推理提出了比现有基准更高的要求——由于前传并非原始故事的一部分，评估其合理性通常需要检索并整合仅与之间接相关的信息。实证上，88% 的实例需要来自叙事多个部分的证据。实验结果凸显了我们任务的挑战性：使用 in-context learning、RAG 以及在域内训练的最先进 LLMs 和商业 DeepResearch 服务，其表现比人类落后超过 15%。进一步的人类研究表明，模型常常在推理有缺陷的情况下给出正确答案，导致推理准确性与人类相比存在超过 30% 的差距。这些发现强调了在长上下文理解与推理方面仍有大量改进空间。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-13 14:28:25 UTC 发布：2025-08-13 14:28:25 UTC

#13 Speed Always Wins: A Survey on Efficient Architectures for Large Language Models 速度永远取胜：大规模语言模型高效架构综述

Large Language Models (LLMs) have delivered impressive results in language understanding, generation, reasoning, and pushes the ability boundary of multimodal models. Transformer models, as the foundation of modern LLMs, offer a strong baseline with excellent scaling properties. However, the traditional transformer architecture requires substantial computations and poses significant obstacles for large-scale training and practical deployment. In this survey, we offer a systematic examination of innovative LLM architectures that address the inherent limitations of transformers and boost the efficiency. Starting from language modeling, this survey covers the background and technical details of linear and sparse sequence modeling methods, efficient full attention variants, sparse mixture-of-experts, hybrid model architectures incorporating the above techniques, and emerging diffusion LLMs. Additionally, we discuss applications of these techniques to other modalities and consider their wider implications for developing scalable, resource-aware foundation models. By grouping recent studies into the above category, this survey presents a blueprint of modern efficient LLM architectures, and we hope this could help motivate future research toward more efficient, versatile AI systems. 大型语言模型（LLMs）在语言理解、生成、推理方面取得了令人瞩目的成果，并推动了多模态模型能力的边界。作为现代 LLMs 的基础，Transformer 模型由于优异的可扩展性提供了强有力的基线。然而，传统的 Transformer 架构需要大量计算资源，这对大规模训练和实际部署构成了重大障碍。在这篇综述中，我们系统地考察了旨在解决 Transformer 固有局限并提升效率的创新 LLM 架构。从语言建模出发，本文涵盖了线性和稀疏序列建模方法的背景与技术细节、高效的全注意力变体、稀疏专家混合（sparse mixture-of-experts）、结合上述技术的混合模型架构，以及新兴的扩散式 LLMs。此外，我们还讨论了这些技术在其他模态上的应用，并考虑了它们对开发可扩展、资源感知基础模型的更广泛影响。通过将近来的研究归入上述类别，本综述呈现了现代高效 LLM 架构的蓝图，我们希望这能激励未来针对更高效、多功能 AI 系统的研究。

Subjects: Computation and Language, Artificial Intelligence, Computer Vision and Pattern Recognition 主题：计算与语言、人工智能、计算机视觉与模式识别

Publish: 2025-08-13 14:13:46 UTC 发布：2025-08-13 14:13:46 UTC

#14 A Comprehensive Survey of Datasets for Clinical Mental Health AI Systems 一份关于临床心理健康 AI 系统数据集的全面综述

Authors: [Aishik Mandal](https://arxiv.org/search/?searchtype=author&query=Aishik Mandal), [Prottay Kumar Adhikary](https://arxiv.org/search/?searchtype=author&query=Prottay Kumar Adhikary), [Hiba Arnaout](https://arxiv.org/search/?searchtype=author&query=Hiba Arnaout), [Iryna Gurevych](https://arxiv.org/search/?searchtype=author&query=Iryna Gurevych), [Tanmoy Chakraborty](https://arxiv.org/search/?searchtype=author&query=Tanmoy Chakraborty)

Mental health disorders are rising worldwide. However, the availability of trained clinicians has not scaled proportionally, leaving many people without adequate or timely support. To bridge this gap, recent studies have shown the promise of Artificial Intelligence (AI) to assist mental health diagnosis, monitoring, and intervention. However, the development of efficient, reliable, and ethical AI to assist clinicians is heavily dependent on high-quality clinical training datasets. Despite growing interest in data curation for training clinical AI assistants, existing datasets largely remain scattered, under-documented, and often inaccessible, hindering the reproducibility, comparability, and generalizability of AI models developed for clinical mental health care. In this paper, we present the first comprehensive survey of clinical mental health datasets relevant to the training and development of AI-powered clinical assistants. We categorize these datasets by mental disorders (e.g., depression, schizophrenia), data modalities (e.g., text, speech, physiological signals), task types (e.g., diagnosis prediction, symptom severity estimation, intervention generation), accessibility (public, restricted or private), and sociocultural context (e.g., language and cultural background). Along with these, we also investigate synthetic clinical mental health datasets. Our survey identifies critical gaps such as a lack of longitudinal data, limited cultural and linguistic representation, inconsistent collection and annotation standards, and a lack of modalities in synthetic data. We conclude by outlining key challenges in curating and standardizing future datasets and provide actionable recommendations to facilitate the development of more robust, generalizable, and equitable mental health AI systems. 心理健康障碍在全球范围内日益增加。然而，受过培训的临床医生数量并没有相应增加，导致许多人无法获得足够或及时的支持。为弥补这一差距，近期研究表明人工智能（AI）在辅助心理健康诊断、监测和干预方面具有前景。然而，开发能够高效、可靠且合乎伦理地辅助临床医生的人工智能，很大程度上依赖于高质量的临床训练数据集。尽管对用于训练临床 AI 助手的数据整理兴趣日益增长，现有数据集在很大程度上仍然分散、文档不全且常常无法获取，这阻碍了为临床心理健康护理开发的 AI 模型的可重复性、可比性和可推广性。在本文中，我们呈现了首个关于与训练和开发 AI 驱动临床助手相关的临床心理健康数据集的全面综述。我们按精神障碍类型（例如抑郁、精神分裂）、数据模态（例如文本、语音、生理信号）、任务类型（例如诊断预测、症状严重度估计、干预生成）、可访问性（公开、受限或私有）和社会文化背景（例如语言和文化背景）对这些数据集进行分类。除此之外，我们还研究了合成的临床心理健康数据集。我们的综述识别出若干关键不足，例如缺乏纵向数据、文化和语言代表性有限、采集与标注标准不一致，以及合成数据模态的匮乏。最后，我们概述了策划与标准化未来数据集的主要挑战，并提供了可行的建议，以促进更健壮、可泛化且更公平的心理健康人工智能系统的开发。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-13 13:42:35 UTC 发布时间：2025-08-13 13:42:35 UTC

#15 BigCharts-R1: Enhanced Chart Reasoning with Visual Reinforcement Finetuning BigCharts-R1：通过视觉强化微调增强图表推理

Charts are essential to data analysis, transforming raw data into clear visual representations that support human decision-making. Although current vision-language models (VLMs) have made significant progress, they continue to struggle with chart comprehension due to training on datasets that lack diversity and real-world authenticity, or on automatically extracted underlying data tables of charts, which can contain numerous estimation errors. Furthermore, existing models only rely on supervised fine-tuning using these low-quality datasets, severely limiting their effectiveness. To address these issues, we first propose BigCharts, a dataset creation pipeline that generates visually diverse chart images by conditioning the rendering process on real-world charts sourced from multiple online platforms. Unlike purely synthetic datasets, BigCharts incorporates real-world data, ensuring authenticity and visual diversity, while still retaining accurate underlying data due to our proposed replotting process. Additionally, we introduce a comprehensive training framework that integrates supervised fine-tuning with Group Relative Policy Optimization (GRPO)-based reinforcement learning. By introducing novel reward signals specifically designed for chart reasoning, our approach enhances model robustness and generalization across diverse chart styles and domains, resulting in a state-of-the-art chart reasoning model, BigCharts-R1. Extensive experiments demonstrate that our models surpass existing methods on multiple chart question-answering benchmarks compared to even larger open-source and closed-source models. 图表是数据分析的基础，将原始数据转换为支持人类决策的清晰可视化表示。尽管当前的视觉-语言模型（VLMs）已取得显著进展，但由于其训练数据集缺乏多样性和真实世界的特性，或依赖于图表中自动提取的底层数据表而这些表可能包含大量估算误差，因此在图表理解方面仍然存在困难。此外，现有模型仅依赖于使用这些低质量数据集进行的监督微调，严重限制了其效能。为了解决这些问题，我们首先提出了 BigCharts，一种数据集生成管线，通过以来自多个在线平台的真实图表为条件来生成视觉上多样化的图表图像。与纯合成数据集不同，BigCharts 融入了真实世界的数据，保证了真实性和视觉多样性，同时由于我们提出的重绘（replotting）过程，仍能保留准确的底层数据。此外，我们引入了一个综合训练框架，将监督微调与基于群体相对策略优化（Group Relative Policy Optimization，GRPO）的强化学习结合起来。通过引入专为图表推理设计的新型奖励信号，我们的方法增强了模型在不同图表风格和领域下的鲁棒性与泛化能力，从而产生了最先进的图表推理模型 BigCharts-R1。大量实验表明，与更大规模的开源和闭源模型相比，我们的模型在多个图表问答基准上均优于现有方法。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-13 13:39:17 UTC 发布时间：2025-08-13 13:39:17 UTC

#16 Adoption of Explainable Natural Language Processing: Perspectives from Industry and Academia on Practices and Challenges 可解释自然语言处理的采纳：来自产业界与学术界关于实践与挑战的观点

Authors: [Mahdi Dhaini](https://arxiv.org/search/?searchtype=author&query=Mahdi Dhaini), [Tobias Müller](https://arxiv.org/search/?searchtype=author&query=Tobias Müller), [Roksoliana Rabets](https://arxiv.org/search/?searchtype=author&query=Roksoliana Rabets), [Gjergji Kasneci](https://arxiv.org/search/?searchtype=author&query=Gjergji Kasneci)

The field of explainable natural language processing (NLP) has grown rapidly in recent years. The growing opacity of complex models calls for transparency and explanations of their decisions, which is crucial to understand their reasoning and facilitate deployment, especially in high-stakes environments. Despite increasing attention given to explainable NLP, practitioners’ perspectives regarding its practical adoption and effectiveness remain underexplored. This paper addresses this research gap by investigating practitioners’ experiences with explainability methods, specifically focusing on their motivations for adopting such methods, the techniques employed, satisfaction levels, and the practical challenges encountered in real-world NLP applications. Through a qualitative interview-based study with industry practitioners and complementary interviews with academic researchers, we systematically analyze and compare their perspectives. Our findings reveal conceptual gaps, low satisfaction with current explainability methods, and highlight evaluation challenges. Our findings emphasize the need for clear definitions and user-centric frameworks for better adoption of explainable NLP in practice. 可解释自然语言处理（NLP）领域近年来迅速发展。复杂模型日益不透明，这要求对其决策提供透明度和解释，这对于理解其推理并促进部署尤为关键，特别是在高风险环境中。尽管可解释 NLP 受到越来越多关注，但从业者关于其实践采用和有效性的观点仍然研究不足。本文通过调查从业者使用可解释性方法的经验来填补这一研究空白，重点关注他们采纳此类方法的动机、采用的技术、满意度以及在现实世界 NLP 应用中遇到的实际挑战。通过对行业从业者的定性访谈研究以及对学术研究者的补充访谈，我们系统地分析并比较了他们的观点。我们的研究发现揭示了概念性差距、对现有可解释性方法的低满意度，并突出了评估方面的挑战。我们的研究结果强调了在实践中更好地推广可解释自然语言处理所需的明确定义和以用户为中心的框架。

Subjects: Computation and Language, Artificial Intelligence, Human-Computer Interaction 主题：计算与语言、人工智能、人机交互

Publish: 2025-08-13 13:12:18 UTC 发布：2025-08-13 13:12:18 UTC

#17 Can LLM-Generated Textual Explanations Enhance Model Classification Performance? An Empirical Study LLM 生成的文本解释能提升模型分类性能吗？一项实证研究

Authors: [Mahdi Dhaini](https://arxiv.org/search/?searchtype=author&query=Mahdi Dhaini), [Juraj Vladika](https://arxiv.org/search/?searchtype=author&query=Juraj Vladika), [Ege Erdogan](https://arxiv.org/search/?searchtype=author&query=Ege Erdogan), [Zineb Attaoui](https://arxiv.org/search/?searchtype=author&query=Zineb Attaoui), [Gjergji Kasneci](https://arxiv.org/search/?searchtype=author&query=Gjergji Kasneci)

In the rapidly evolving field of Explainable Natural Language Processing (NLP), textual explanations, i.e., human-like rationales, are pivotal for explaining model predictions and enriching datasets with interpretable labels. Traditional approaches rely on human annotation, which is costly, labor-intensive, and impedes scalability. In this work, we present an automated framework that leverages multiple state-of-the-art large language models (LLMs) to generate high-quality textual explanations. We rigorously assess the quality of these LLM-generated explanations using a comprehensive suite of Natural Language Generation (NLG) metrics. Furthermore, we investigate the downstream impact of these explanations on the performance of pre-trained language models (PLMs) and LLMs across natural language inference tasks on two diverse benchmark datasets. Our experiments demonstrate that automated explanations exhibit highly competitive effectiveness compared to human-annotated explanations in improving model performance. Our findings underscore a promising avenue for scalable, automated LLM-based textual explanation generation for extending NLP datasets and enhancing model performance. 在快速发展的可解释自然语言处理（NLP）领域，文本解释，即类人推理，对于解释模型预测并为数据集增加可解释标签至关重要。传统方法依赖人工标注，成本高、劳动强度大，且阻碍了可扩展性。在这项工作中，我们提出了一个自动化框架，利用多个最先进的大型语言模型（LLMs）生成高质量的文本解释。我们使用一套全面的自然语言生成（NLG）度量方法严格评估这些由 LLMs 生成的解释质量。此外，我们还研究了这些解释对预训练语言模型（PLMs）和 LLMs 在两个不同基准数据集上的自然语言推理任务性能的下游影响。我们的实验证明，与人工标注的解释相比，自动生成的解释在提升模型性能方面表现出高度竞争力。我们的发现强调了基于 LLMs 的可扩展自动文本解释生成在扩展 NLP 数据集和提升模型性能方面的有前景途径。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-13 12:59:08 UTC 发布：2025-08-13 12:59:08 UTC

#18 UtterTune: LoRA-Based Target-Language Pronunciation Edit and Control in Multilingual Text-to-Speech UtterTune：基于 LoRA 的多语言文本到语音目标语言发音编辑与控制

Author: [Shuhei Kato](https://arxiv.org/search/?searchtype=author&query=Shuhei Kato)

We propose UtterTune, a lightweight adaptation method that fine-tunes a multilingual text-to-speech (TTS) system based on a large language model (LLM) architecture, designed to enhance the controllability of pronunciation in a target language while preserving performance in others. While LLM architectures have enabled TTS models to achieve remarkable naturalness, accurately modeling grapheme-to-phoneme (G2P) mapping and prosody remains challenging, especially when the model omits an explicit G2P module and directly processes minimally encoded text (e.g., byte-pair encoding). UtterTune leverages low-rank adaptation to enable the control of segmental pronunciation and pitch accent at the phoneme level for Japanese speech, the target language in this paper, while maintaining naturalness and speaker similarity in a zero-shot setting. Objective and subjective evaluations confirm its effectiveness. 我们提出了 UtterTune，一种轻量的适配方法，用于微调基于大型语言模型（LLM）架构的多语种文本到语音（TTS）系统，旨在增强目标语言的发音可控性，同时保留其他语言的性能。尽管 LLM 架构使 TTS 模型在自然度方面取得了显著进步，但准确建模字素到音素（G2P）映射和韵律仍然具有挑战性，尤其是当模型省略显式的 G2P 模块而直接处理最小编码文本（例如，字节对编码）时。UtterTune 利用低秩适配，使得在本论文的目标语言日语中能够在音素层面控制音段发音和高低音重音，同时在零样本设置下保持自然度和说话者相似性。客观和主观评估均证实了其有效性。

Subjects: Computation and Language, Audio and Speech Processing 主题：计算与语言，音频与语音处理

Publish: 2025-08-13 12:52:38 UTC 发布时间：2025-08-13 12:52:38 UTC

#19 Echoes of Agreement: Argument Driven Opinion Shifts in Large Language Models 回声式的认同：大型语言模型中由论点驱动的观点转变

Author: [Avneet Kaur](https://arxiv.org/search/?searchtype=author&query=Avneet Kaur)

There have been numerous studies evaluating bias of LLMs towards political topics. However, how positions towards these topics in model outputs are highly sensitive to the prompt. What happens when the prompt itself is suggestive of certain arguments towards those positions remains underexplored. This is crucial for understanding how robust these bias evaluations are and for understanding model behaviour, as these models frequently interact with opinionated text. To that end, we conduct experiments for political bias evaluation in presence of supporting and refuting arguments. Our experiments show that such arguments substantially alter model responses towards the direction of the provided argument in both single-turn and multi-turn settings. Moreover, we find that the strength of these arguments influences the directional agreement rate of model responses. These effects point to a sycophantic tendency in LLMs adapting their stance to align with the presented arguments which has downstream implications for measuring political bias and developing effective mitigation strategies. 已有大量研究评估了 LLMs 在政治话题上的偏见。然而，模型输出中对这些话题的立场对提示语（prompt）高度敏感。当提示语本身暗示了针对这些立场的特定论点时，会发生什么，这一问题仍未被充分探讨。理解这一点对评估这些偏见的稳健性以及理解模型行为至关重要，因为这些模型经常与带有观点的文本交互。为此，我们在有支持性和反驳性论点的情况下进行了政治偏见评估实验。我们的实验证明，这类论点在单轮和多轮设置中都会显著改变模型响应的方向，使其朝向所提供论点的立场偏移。此外，我们发现论点的强弱会影响模型响应的方向性认同率。这些效应表明 LLMs 存在谄媚倾向，会调整其立场以与呈现的论点保持一致，这对衡量政治偏见和制定有效缓解策略具有下游影响。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-11 20:54:14 UTC 发布时间：2025-08-11 20:54:14 UTC

#20 Transforming Questions and Documents for Semantically Aligned Retrieval-Augmented Generation 将问题与文档转换为语义对齐的检索增强生成

Author: [Seokgi Lee](https://arxiv.org/search/?searchtype=author&query=Seokgi Lee)

We introduce a novel retrieval-augmented generation (RAG) framework tailored for multihop question answering. First, our system uses large language model (LLM) to decompose complex multihop questions into a sequence of single-hop subquestions that guide document retrieval. This decomposition mitigates the ambiguity inherent in multi-hop queries by clearly targeting distinct knowledge facets. Second, instead of embedding raw or chunked documents directly, we generate answerable questions from each document chunk using Qwen3-8B, embed these generated questions, and retrieve relevant chunks via question-question embedding similarity. During inference, the retrieved chunks are then fed along with the original question into the RAG pipeline. We evaluate on three multihop question datasets (MuSiQue, 2WikiMultiHopQa, HotpotQA) from LongBench. Our method improves RAG performacne compared to baseline systems. Our contributions highlight the benefits of using answerable-question embeddings for RAG, and the effectiveness of LLM-based query decomposition for multihop scenarios. 我们提出了一种针对多跳问答定制的新型检索增强生成（RAG）框架。首先，我们的系统使用大型语言模型（LLM）将复杂的多跳问题分解为一系列单跳子问题，以引导文档检索。这样的分解通过明确定位不同的知识层面来缓解多跳查询固有的歧义性。其次，我们不是直接对原始或分块文档进行嵌入，而是使用 Qwen3-8B 从每个文档块生成可回答的问题，对这些生成的问题进行嵌入，并通过问题—问题嵌入相似性检索相关的文档块。在推理过程中，检索到的文档块连同原始问题一起输入到 RAG 管道。我们在 LongBench 的三个多跳问答数据集（MuSiQue、2WikiMultiHopQa、HotpotQA）上进行了评估。与基线系统相比，我们的方法提升了 RAG 的性能。我们的贡献突出了在 RAG 中使用可回答问题嵌入的优势，以及基于 LLM 的查询分解在多跳场景中的有效性。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-13 12:35:04 UTC 发布：2025-08-13 12:35:04 UTC

#21 Sample More to Think Less: Group Filtered Policy Optimization for Concise Reasoning

Authors: [Vaishnavi Shrivastava](https://arxiv.org/search/?searchtype=author&query=Vaishnavi Shrivastava), [Ahmed Awadallah](https://arxiv.org/search/?searchtype=author&query=Ahmed Awadallah), [Vidhisha Balachandran](https://arxiv.org/search/?searchtype=author&query=Vidhisha Balachandran), [Shivam Garg](https://arxiv.org/search/?searchtype=author&query=Shivam Garg), [Harkirat Behl](https://arxiv.org/search/?searchtype=author&query=Harkirat Behl), [Dimitris Papailiopoulos](https://arxiv.org/search/?searchtype=author&query=Dimitris Papailiopoulos)

Large language models trained with reinforcement learning with verifiable rewards tend to trade accuracy for length–inflating response lengths to achieve gains in accuracy. While longer answers may be warranted for harder problems, many tokens are merely “filler”: repetitive, verbose text that makes no real progress. We introduce GFPO (Group Filtered Policy Optimization), which curbs this length explosion by sampling larger groups per problem during training and filtering responses to train on based on two key metrics: (1) response length and (2) token efficiency: reward per token ratio. By sampling more at training time, we teach models to think less at inference time. On the Phi-4-reasoning model, GFPO cuts GRPO’s length inflation by 46-71% across challenging STEM and coding benchmarks (AIME 24/25, GPQA, Omni-MATH, LiveCodeBench) while maintaining accuracy. Optimizing for reward per token further increases reductions in length inflation to 71-85%. We also propose Adaptive Difficulty GFPO, which dynamically allocates more training resources to harder problems based on real-time difficulty estimates, improving the balance between computational efficiency and accuracy especially on difficult questions. GFPO demonstrates that increased training-time compute directly translates to reduced test-time compute–a simple yet effective trade-off for efficient reasoning. 使用可验证奖励进行强化学习训练的大型语言模型往往以准确性换取长度——通过增加回答长度来取得准确性提升。虽然对于更难的问题更长的答案可能是合理的，但许多标记只是“填充物”：重复、冗长的文本并没有真正推进解答。我们提出了 GFPO（组过滤策略优化），通过在训练时对每个问题采样更大的组并基于两个关键指标筛选用于训练的回应来遏制这种长度膨胀： (1) 回答长度和 (2) 标记效率：每标记奖励比率。通过在训练时进行更多采样，我们教会模型在推理时少思考。在 Phi-4-reasoning 模型上，GFPO 在保持准确性的同时，将 GRPO 在具有挑战性的 STEM 和编码基准（AIME 24/25、GPQA、Omni-MATH、LiveCodeBench）上的长度膨胀减少了 46–71%。以每标记奖励进行优化则进一步将长度膨胀减少到 71–85%。我们还提出了自适应难度 GFPO，它根据实时难度估计为更难的问题动态分配更多训练资源，从而在计算效率与准确性之间取得更好的平衡，尤其是在困难问题上。GFPO 证明了增加训练阶段的计算量可以直接转化为测试阶段计算量的减少——这是实现高效推理的一种简单而有效的权衡。

Subjects: Computation and Language, Machine Learning 主题：计算与语言，机器学习

Publish: 2025-08-13 11:43:49 UTC 发布：2025-08-13 11:43:49 UTC

#22 The Perils of Chart Deception: How Misleading Visualizations Affect Vision-Language Models 图表欺骗的危险：误导性可视化如何影响视觉-语言模型

Authors: [Ridwan Mahbub](https://arxiv.org/search/?searchtype=author&query=Ridwan Mahbub), [Mohammed Saidul Islam](https://arxiv.org/search/?searchtype=author&query=Mohammed Saidul Islam), [Md Tahmid Rahman Laskar](https://arxiv.org/search/?searchtype=author&query=Md Tahmid Rahman Laskar), [Mizanur Rahman](https://arxiv.org/search/?searchtype=author&query=Mizanur Rahman), [Mir Tafseer Nayeem](https://arxiv.org/search/?searchtype=author&query=Mir Tafseer Nayeem), [Enamul Hoque](https://arxiv.org/search/?searchtype=author&query=Enamul Hoque)

Information visualizations are powerful tools that help users quickly identify patterns, trends, and outliers, facilitating informed decision-making. However, when visualizations incorporate deceptive design elements-such as truncated or inverted axes, unjustified 3D effects, or violations of best practices-they can mislead viewers and distort understanding, spreading misinformation. While some deceptive tactics are obvious, others subtly manipulate perception while maintaining a facade of legitimacy. As Vision-Language Models (VLMs) are increasingly used to interpret visualizations, especially by non-expert users, it is critical to understand how susceptible these models are to deceptive visual designs. In this study, we conduct an in-depth evaluation of VLMs’ ability to interpret misleading visualizations. By analyzing over 16,000 responses from ten different models across eight distinct types of misleading chart designs, we demonstrate that most VLMs are deceived by them. This leads to altered interpretations of charts, despite the underlying data remaining the same. Our findings highlight the need for robust safeguards in VLMs against visual misinformation. 信息可视化是帮助用户快速识别模式、趋势和异常值的强大工具，从而促进明智决策。然而，当可视化包含欺骗性设计元素——例如截断或颠倒的坐标轴、不必要的三维效果或违反最佳实践的做法——时，它们可能会误导观众并扭曲理解，传播错误信息。虽然一些欺骗策略显而易见，其他策略则在表面上保持合法性的伪装下微妙地操控感知。随着视觉-语言模型（VLMs）日益被用于解读可视化，尤其是由非专业用户使用，理解这些模型对欺骗性视觉设计的易感性变得至关重要。在本研究中，我们对 VLMs 解读误导性可视化的能力进行了深入评估。通过分析来自十种不同模型、涉及八类不同误导性图表设计的超过 16,000 条响应，我们证明大多数 VLMs 会被这些设计所欺骗。这导致对图表的解释发生改变，尽管基础数据保持不变。我们的研究结果强调了在 VLMs 中建立防范视觉错误信息的强有力保障的必要性。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-13 11:11:18 UTC 发布日期：2025-08-13 11:11:18 UTC

#23 Evaluating the Role of Large Language Models in Legal Practice in India 评估大型语言模型在印度法律实务中的作用

Author: [Rahul Hemrajani](https://arxiv.org/search/?searchtype=author&query=Rahul Hemrajani)

The integration of Artificial Intelligence(AI) into the legal profession raises significant questions about the capacity of Large Language Models(LLM) to perform key legal tasks. In this paper, I empirically evaluate how well LLMs, such as GPT, Claude, and Llama, perform key legal tasks in the Indian context, including issue spotting, legal drafting, advice, research, and reasoning. Through a survey experiment, I compare outputs from LLMs with those of a junior lawyer, with advanced law students rating the work on helpfulness, accuracy, and comprehensiveness. LLMs excel in drafting and issue spotting, often matching or surpassing human work. However, they struggle with specialised legal research, frequently generating hallucinations, factually incorrect or fabricated outputs. I conclude that while LLMs can augment certain legal tasks, human expertise remains essential for nuanced reasoning and the precise application of law. 将人工智能（AI）引入法律行业，引发了关于大型语言模型（LLM）能否执行关键法律任务的重要问题。本文通过实证评估了诸如 GPT、Claude 和 Llama 等 LLM 在印度语境下执行关键法律任务的表现，包括问题发现、法律起草、法律意见、法律研究和推理。通过一项调查实验，我将 LLM 的输出与初级律师的工作进行了比较，邀请高级法学学生从有用性、准确性和全面性对作品进行评分。LLM 在起草和问题发现方面表现出色，经常达到或超过人类水平。然而，它们在专业法律研究方面表现欠佳，经常产生幻觉，即事实错误或捏造的输出。我得出结论：尽管 LLM 能增强某些法律任务，但在细致推理和法律的精确适用方面，仍然需要人类专业知识。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-13 11:04:48 UTC 发布时间：2025-08-13 11:04:48 世界协调时

#24 Slow Tuning and Low-Entropy Masking for Safe Chain-of-Thought Distillation 慢速微调与低熵掩码用于安全链式思维蒸馏

Authors: [Ziyang Ma](https://arxiv.org/search/?searchtype=author&query=Ziyang Ma), [Qingyue Yuan](https://arxiv.org/search/?searchtype=author&query=Qingyue Yuan), [Linhai Zhang](https://arxiv.org/search/?searchtype=author&query=Linhai Zhang), [Deyu Zhou](https://arxiv.org/search/?searchtype=author&query=Deyu Zhou)

Previous chain-of-thought (CoT) distillation methods primarily focused on enhancing the reasoning capabilities of Small Language Models (SLMs) by utilizing high-quality rationales generated by powerful Large Language Models (LLMs, e.g., GPT-4). However, few works have noted the negative effects on SLM safety brought by the training, which are revealed in this study. Although there are works on safety alignment that fine-tune language models or manipulate model weights to defend against harmful inputs, they require extra computation or annotated data, and probably impact the reasoning ability of SLMs. In this paper, we investigate how to maintain the safety of SLMs during the CoT distillation process. Specifically, we propose a safe distillation method, Slow Tuning and Low-Entropy Masking Distillation (SLowED), containing two modules: Slow Tuning and Low-Entropy Masking. Slow Tuning scales down the magnitude of model weight changes to optimize the model weights in the neighboring space near the initial weight distribution. Low-Entropy Masking masks low-entropy tokens, which are regarded as unnecessary learning targets, to exclude them from fine-tuning. Experiments on three SLMs (Qwen2.5-1.5B, Llama-3.2-1B, BLOOM-1.1B) across reasoning benchmarks (BBH, BB-Sub, ARC, AGIEval) and safety evaluation (AdvBench) show that SLowED retains the safety of SLMs and comparably improves their reasoning capability compared to existing distillation methods. Furthermore, our ablation study presents the effectiveness of Slow Tuning and Low-Entropy Masking, with the former maintaining the model’s safety in the early stage and the latter prolonging the safe training epochs. 以往的链式思维（CoT）蒸馏方法主要通过利用由强大大型模型（LLMs，例如 GPT-4）生成的高质量推理依据来增强小型语言模型（SLMs）的推理能力。然而，很少有工作注意到训练对 SLM 安全性带来的负面影响，而本研究揭示了这一点。尽管已有一些关于安全对齐的工作通过微调语言模型或操控模型权重以抵御有害输入，但它们需要额外的计算或带注释的数据，并且可能影响 SLM 的推理能力。在本文中，我们研究如何在 CoT 蒸馏过程中保持 SLM 的安全性。具体而言，我们提出了一种安全蒸馏方法：Slow Tuning and Low-Entropy Masking Distillation（SLowED），包含两个模块：Slow Tuning 和 Low-Entropy Masking。Slow Tuning 缩小模型权重变化的幅度，以在初始权重分布附近的邻域空间优化模型权重。Low-Entropy Masking 对低熵标记进行掩码处理，这些标记被视为不必要的学习目标，从而将它们排除在微调之外。在三个小型语言模型（Qwen2.5-1.5B、Llama-3.2-1B、BLOOM-1.1B）上，对推理基准（BBH、BB-Sub、ARC、AGIEval）和安全评估（AdvBench）进行的实验表明，SLowED 保持了小型语言模型的安全性，并在提高其推理能力方面与现有蒸馏方法相比具有可比性。此外，我们的消融研究展示了慢调优（Slow Tuning）和低熵掩码（Low-Entropy Masking）的有效性：前者在早期阶段维持模型的安全性，后者延长了安全训练的时期。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-13 09:56:08 UTC 发布：2025-08-13 09:56:08 协调世界时

#25 EffiEval: Efficient and Generalizable Model Evaluation via Capability Coverage Maximization EffiEval：通过能力覆盖最大化实现高效且具可泛化性的模型评估

Authors: [Yaoning Wang](https://arxiv.org/search/?searchtype=author&query=Yaoning Wang), [Jiahao Ying](https://arxiv.org/search/?searchtype=author&query=Jiahao Ying), [Yixin Cao](https://arxiv.org/search/?searchtype=author&query=Yixin Cao), [Yubo Ma](https://arxiv.org/search/?searchtype=author&query=Yubo Ma), [Yugang Jiang](https://arxiv.org/search/?searchtype=author&query=Yugang Jiang)

The rapid advancement of large language models (LLMs) and the development of increasingly large and diverse evaluation benchmarks have introduced substantial computational challenges for model assessment. In this paper, we present EffiEval, a training-free approach for efficient benchmarking that effectively addresses data redundancy while maintaining high evaluation reliability. Our method is specifically designed to meet three key criteria for high-quality evaluation: representativeness, by ensuring comprehensive coverage of model capabilities; fairness, by remaining independent of model performance during sample selection to avoid bias; and generalizability, by enabling flexible transfer across datasets and model families without reliance on large-scale evaluation data. Unlike traditional methods that rely on absolute performance or require extensive evaluation data, our approach adaptively selects high-quality representative subsets based on the Model Utility Index (MUI). Extensive experiments on multiple public benchmarks and diverse LLMs demonstrate that EffiEval achieves strong ranking consistency with full-dataset evaluation using only a small fraction of the original data. Furthermore, our method is flexible and scalable in size, allowing users to balance evaluation efficiency and representativeness according to specific needs. Overall, EffiEval provides a practical and generalizable solution for reliable, fair, and efficient evaluation in the era of LLMs. 大型语言模型（LLMs）的快速发展以及日益增大和多样化的评估基准的出现，为模型评估带来了巨大的计算挑战。本文提出了 EffiEval，一种无需训练的高效基准评估方法，能够在保持高评估可靠性的同时有效解决数据冗余问题。我们的方法专为满足高质量评估的三项关键标准而设计：代表性，通过确保对模型能力的全面覆盖；公平性，通过在样本选择过程中不依赖模型性能以避免偏差；以及泛化性，通过在不依赖大规模评估数据的情况下，实现跨数据集和模型家族的灵活迁移。不同于依赖绝对性能或需要大量评估数据的传统方法，我们的方法基于模型效用指数（Model Utility Index, MUI）自适应地选择高质量代表性子集。在多个公开基准和多种 LLMs 上进行的大量实验证明，EffiEval 在仅使用原始数据一小部分的情况下，能够与全数据集评估实现较强的一致性排序。此外，我们的方法在规模上灵活且可扩展，允许用户根据具体需求在评估效率与代表性之间进行权衡。总体而言，EffiEval 为 LLMs 时代提供了一个可靠、公平且高效的实用且具有普适性的评估解决方案。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-13 09:48:23 UTC 发布：2025-08-13 09:48:23 协调世界时 (UTC)

#26 Improving Diversity in Language Models: When Temperature Fails, Change the Loss 提高语言模型多样性：当温度失效时，改变损失函数

Authors: [Alexandre Verine](https://arxiv.org/search/?searchtype=author&query=Alexandre Verine), [Florian Le Bronnec](https://arxiv.org/search/?searchtype=author&query=Florian Le Bronnec), [Kunhao Zheng](https://arxiv.org/search/?searchtype=author&query=Kunhao Zheng), [Alexandre Allauzen](https://arxiv.org/search/?searchtype=author&query=Alexandre Allauzen), [Yann Chevaleyre](https://arxiv.org/search/?searchtype=author&query=Yann Chevaleyre), [Benjamin Negrevergne](https://arxiv.org/search/?searchtype=author&query=Benjamin Negrevergne)

Increasing diversity in language models is a challenging yet essential objective. A common approach is to raise the decoding temperature. In this work, we investigate this approach through a simplistic yet common case to provide insights into why decreasing temperature can improve quality (Precision), while increasing it often fails to boost coverage (Recall). Our analysis reveals that for a model to be effectively tunable through temperature adjustments, it must be trained toward coverage. To address this, we propose rethinking loss functions in language models by leveraging the Precision-Recall framework. Our results demonstrate that this approach achieves a substantially better trade-off between Precision and Recall than merely combining negative log-likelihood training with temperature scaling. These findings offer a pathway toward more versatile and robust language modeling techniques. 提高语言模型的多样性既具挑战性又非常重要。一个常见的方法是提高解码温度。在这项工作中，我们通过一个简单但常见的案例来研究这种方法，以便阐明为何降低温度可以提升质量（精确率），而提高温度却常常不能提升覆盖度（召回率）。我们的分析表明，要通过调整温度来有效调节模型，模型必须在训练时就朝着覆盖度方向进行优化。为了解决这一问题，我们提出通过利用精确率-召回率框架重新思考语言模型的损失函数。我们的结果表明，这种方法在精确率与召回率之间实现了比单纯将负对数似然训练与温度缩放结合更为出色的权衡。这些发现为更通用、更稳健的语言建模技术提供了一条可行路径。

Subjects: Computation and Language, Machine Learning 主题：计算与语言，机器学习

Publish: 2025-08-13 09:37:53 UTC 发布：2025-08-13 09:37:53 协调世界时

#27 AINL-Eval 2025 Shared Task: Detection of AI-Generated Scientific Abstracts in Russian AINL-Eval 2025 共享任务：俄语 AI 生成科学摘要的检测

Authors: [Tatiana Batura](https://arxiv.org/search/?searchtype=author&query=Tatiana Batura), [Elena Bruches](https://arxiv.org/search/?searchtype=author&query=Elena Bruches), [Milana Shvenk](https://arxiv.org/search/?searchtype=author&query=Milana Shvenk), [Valentin Malykh](https://arxiv.org/search/?searchtype=author&query=Valentin Malykh)

The rapid advancement of large language models (LLMs) has revolutionized text generation, making it increasingly difficult to distinguish between human- and AI-generated content. This poses a significant challenge to academic integrity, particularly in scientific publishing and multilingual contexts where detection resources are often limited. To address this critical gap, we introduce the AINL-Eval 2025 Shared Task, specifically focused on the detection of AI-generated scientific abstracts in Russian. We present a novel, large-scale dataset comprising 52,305 samples, including human-written abstracts across 12 diverse scientific domains and AI-generated counterparts from five state-of-the-art LLMs (GPT-4-Turbo, Gemma2-27B, Llama3.3-70B, Deepseek-V3, and GigaChat-Lite). A core objective of the task is to challenge participants to develop robust solutions capable of generalizing to both (i) previously unseen scientific domains and (ii) models not included in the training data. The task was organized in two phases, attracting 10 teams and 159 submissions, with top systems demonstrating strong performance in identifying AI-generated content. We also establish a continuous shared task platform to foster ongoing research and long-term progress in this important area. The dataset and platform are publicly available at https://github.com/iis-research-team/AINL-Eval-2025. 大型语言模型（LLMs）的快速发展革新了文本生成，使得区分人类与 AI 生成内容日益困难。这对学术诚信构成了重大挑战，尤其是在科学出版和多语言环境中，检测资源常常有限。为填补这一关键空白，我们推出了 AINL-Eval 2025 共享任务，专注于检测俄语的 AI 生成科学摘要。我们提供了一个新颖的大规模数据集，包含 52,305 个样本，涵盖来自 12 个多样化科学领域的人类撰写摘要以及来自五个最先进 LLMs（GPT-4-Turbo、Gemma2-27B、Llama3.3-70B、Deepseek-V3 和 GigaChat-Lite）生成的对应摘要。该任务的核心目标是挑战参与者开发能够泛化到（i）先前未见过的科学领域和（ii）未包含在训练数据中的模型的鲁棒解决方案。该任务分为两个阶段，吸引了 10 支团队和 159 次提交，顶级系统在识别 AI 生成内容方面表现出色。我们还建立了一个持续的共享任务平台，以促进该重要领域的持续研究和长期进展。该数据集和平台可在 https://github.com/iis-research-team/AINL-Eval-2025 公共获取。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-13 08:53:17 UTC 发布：2025-08-13 08:53:17 UTC

#28 The Surprising Effectiveness of Membership Inference with Simple N-Gram Coverage 使用简单 N 元组覆盖进行成员推断的惊人有效性

Membership inference attacks serves as useful tool for fair use of language models, such as detecting potential copyright infringement and auditing data leakage. However, many current state-of-the-art attacks require access to models’ hidden states or probability distribution, which prevents investigation into more widely-used, API-access only models like GPT-4. In this work, we introduce N-Gram Coverage Attack, a membership inference attack that relies solely on text outputs from the target model, enabling attacks on completely black-box models. We leverage the observation that models are more likely to memorize and subsequently generate text patterns that were commonly observed in their training data. Specifically, to make a prediction on a candidate member, N-Gram Coverage Attack first obtains multiple model generations conditioned on a prefix of the candidate. It then uses n-gram overlap metrics to compute and aggregate the similarities of these outputs with the ground truth suffix; high similarities indicate likely membership. We first demonstrate on a diverse set of existing benchmarks that N-Gram Coverage Attack outperforms other black-box methods while also impressively achieving comparable or even better performance to state-of-the-art white-box attacks - despite having access to only text outputs. Interestingly, we find that the success rate of our method scales with the attack compute budget - as we increase the number of sequences generated from the target model conditioned on the prefix, attack performance tends to improve. Having verified the accuracy of our method, we use it to investigate previously unstudied closed OpenAI models on multiple domains. We find that more recent models, such as GPT-4o, exhibit increased robustness to membership inference, suggesting an evolving trend toward improved privacy protections. 成员推断攻击是对语言模型进行合理使用的重要工具，例如用于检测潜在的版权侵权和审计数据泄露。然而，许多当前最先进的攻击需要访问模型的隐藏状态或概率分布，这阻碍了对更广泛使用的仅通过 API 访问的模型（如 GPT-4）进行研究。在这项工作中，我们提出了 N-Gram 覆盖攻击，一种仅依赖目标模型文本输出的成员推断攻击，能够对完全黑箱模型发起攻击。我们利用了一个观察：模型更有可能记忆并随后生成在其训练数据中常见的文本模式。具体来说，为了对一个候选成员进行预测，N-Gram 覆盖攻击首先在该候选的前缀条件下获取多个模型生成输出。然后它使用 n-gram 重叠度量来计算并聚合这些输出与真实后缀的相似度；相似度高表明很可能为成员。我们首先在一组多样的现有基准上演示了 N-Gram Coverage Attack 优于其他黑盒方法，同时在令人印象深刻的程度上，即便仅能访问文本输出，也能达到与最先进白盒攻击相当甚至更好的性能。有趣的是，我们发现该方法的成功率随攻击计算预算的增加而提升——当我们增加以前缀为条件从目标模型生成的序列数量时，攻击性能往往会改善。在验证了方法的准确性后，我们使用它在多个领域对先前未研究的封闭式 OpenAI 模型进行了调查。我们发现更近期的模型（如 GPT-4o）对成员推断展现出更强的鲁棒性，表明隐私保护正在朝着改进的方向演进。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-13 08:35:16 UTC 发布：2025-08-13 08:35:16 UTC

#29 COMPEER: Controllable Empathetic Reinforcement Reasoning for Emotional Support Conversation COMPEER：用于情感支持对话的可控同理心强化推理

Emotional support conversations are crucial for promoting emotional well-being, yet current models often lack deep empathetic reasoning grounded in psychological principles. To address this, we propose controllable empathetic reasoning, which combines natural language reasoning with structured psychological steps. We construct a fine-grained dataset annotated with reasoning correctness and response preferences to enable this capability. To further enhance training, we employ reinforcement learning with a unified process-outcome reward model that delivers precise feedback. To mitigate response repetitiveness from entropy collapse, we introduce personality-based dialogue rewriting and a redundancy-aware reward reweighting strategy. Our approach significantly improves model’s emotional support ability, advancing the development of empathetic, human-like support systems. 情感支持对话对于促进情绪健康至关重要，但现有模型往往缺乏基于心理学原理的深入共情推理。为了解决这一问题，我们提出了可控的共情推理，将自然语言推理与结构化的心理学步骤相结合。我们构建了一个带有推理正确性和回应偏好注释的细粒度数据集以实现此能力。为进一步增强训练，我们采用了带有统一过程-结果奖励模型的强化学习，该模型提供精确反馈。为缓解由熵崩溃导致的回应重复性问题，我们引入了基于人格的对话改写和一种考虑冗余性的奖励重加权策略。我们的方法显著提升了模型的情感支持能力，推动了具有人类般共情的支持系统的发展。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-13 06:09:32 UTC 发布：2025-08-13 06:09:32 UTC

#30 UWBa at SemEval-2025 Task 7: Multilingual and Crosslingual Fact-Checked Claim Retrieval UWBa 在 SemEval-2025 任务 7：多语言与跨语言事实核查声明检索

Authors: [Ladislav Lenc](https://arxiv.org/search/?searchtype=author&query=Ladislav Lenc), [Daniel Cífka](https://arxiv.org/search/?searchtype=author&query=Daniel Cífka), [Jiří Martínek](https://arxiv.org/search/?searchtype=author&query=Jiří Martínek), [Jakub Šmíd](https://arxiv.org/search/?searchtype=author&query=Jakub Šmíd), [Pavel Král](https://arxiv.org/search/?searchtype=author&query=Pavel Král)

This paper presents a zero-shot system for fact-checked claim retrieval. We employed several state-of-the-art large language models to obtain text embeddings. The models were then combined to obtain the best possible result. Our approach achieved 7th place in monolingual and 9th in cross-lingual subtasks. We used only English translations as an input to the text embedding models since multilingual models did not achieve satisfactory results. We identified the most relevant claims for each post by leveraging the embeddings and measuring cosine similarity. Overall, the best results were obtained by the NVIDIA NV-Embed-v2 model. For some languages, we benefited from model combinations (NV-Embed & GPT or Mistral). 本文提出了一个用于事实核查声明检索的零样本系统。我们使用了若干最先进的大型语言模型来获取文本嵌入，随后将这些模型组合以获得最佳结果。我们的方法在单语言子任务中获得第 7 名，在跨语言子任务中获得第 9 名。由于多语言模型未达到令人满意的效果，我们仅使用英文翻译作为输入提供给文本嵌入模型。通过利用嵌入并测量余弦相似度，我们为每条帖子识别出最相关的声明。总体而言，效果最佳的是 NVIDIA NV-Embed-v2 模型。对于某些语言，我们通过模型组合（NV-Embed 与 GPT 或 Mistral）获得了额外收益。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-13 05:55:59 UTC 发布：2025-08-13 05:55:59 协调世界时

#31 Cross-lingual Aspect-Based Sentiment Analysis: A Survey on Tasks, Approaches, and Challenges 跨语种基于方面的情感分析：关于任务、方法与挑战的综述

Authors: [Jakub Šmíd](https://arxiv.org/search/?searchtype=author&query=Jakub Šmíd), [Pavel Král](https://arxiv.org/search/?searchtype=author&query=Pavel Král)

Aspect-based sentiment analysis (ABSA) is a fine-grained sentiment analysis task that focuses on understanding opinions at the aspect level, including sentiment towards specific aspect terms, categories, and opinions. While ABSA research has seen significant progress, much of the focus has been on monolingual settings. Cross-lingual ABSA, which aims to transfer knowledge from resource-rich languages (such as English) to low-resource languages, remains an under-explored area, with no systematic review of the field. This paper aims to fill that gap by providing a comprehensive survey of cross-lingual ABSA. We summarize key ABSA tasks, including aspect term extraction, aspect sentiment classification, and compound tasks involving multiple sentiment elements. Additionally, we review the datasets, modelling paradigms, and cross-lingual transfer methods used to solve these tasks. We also examine how existing work in monolingual and multilingual ABSA, as well as ABSA with LLMs, contributes to the development of cross-lingual ABSA. Finally, we highlight the main challenges and suggest directions for future research to advance cross-lingual ABSA systems. 面向方面的情感分析（ABSA）是一类细粒度的情感分析任务，侧重于在方面层面理解意见，包括对特定方面词、类别和观点的情感。尽管 ABSA 研究已取得显著进展，但大多数关注仍集中在单语言环境。跨语种 ABSA 致力于将资源丰富语言（如英语）的知识迁移到低资源语言，仍是一个未被充分探索的领域，且缺乏系统性的综述。本文旨在弥补这一空白，提供对跨语种 ABSA 的全面综述。我们总结了关键的 ABSA 任务，包括方面词抽取、方面情感分类以及涉及多个情感要素的复合任务。此外，我们回顾了解决这些任务所使用的数据集、建模范式和跨语种迁移方法。我们还考察了现有的单语和多语 ABSA 工作以及基于 LLMs 的 ABSA 如何推动跨语种 ABSA 的发展。最后，我们强调了主要挑战并提出了推动跨语种 ABSA 系统发展的未来研究方向。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-13 05:55:53 UTC 发布时间：2025-08-13 05:55:53 UTC

#32 LACA: Improving Cross-lingual Aspect-Based Sentiment Analysis with LLM Data Augmentation LACA：通过 LLM 数据增强改进跨语言基于方面的情感分析

Authors: [Jakub Šmíd](https://arxiv.org/search/?searchtype=author&query=Jakub Šmíd), [Pavel Přibáň](https://arxiv.org/search/?searchtype=author&query=Pavel Přibáň), [Pavel Král](https://arxiv.org/search/?searchtype=author&query=Pavel Král)

Cross-lingual aspect-based sentiment analysis (ABSA) involves detailed sentiment analysis in a target language by transferring knowledge from a source language with available annotated data. Most existing methods depend heavily on often unreliable translation tools to bridge the language gap. In this paper, we propose a new approach that leverages a large language model (LLM) to generate high-quality pseudo-labelled data in the target language without the need for translation tools. First, the framework trains an ABSA model to obtain predictions for unlabelled target language data. Next, LLM is prompted to generate natural sentences that better represent these noisy predictions than the original text. The ABSA model is then further fine-tuned on the resulting pseudo-labelled dataset. We demonstrate the effectiveness of this method across six languages and five backbone models, surpassing previous state-of-the-art translation-based approaches. The proposed framework also supports generative models, and we show that fine-tuned LLMs outperform smaller multilingual models. 跨语言基于方面的情感分析（ABSA）涉及通过从具有标注数据的源语言转移知识来在目标语言中进行细粒度情感分析。大多数现有方法在弥合语言差距时严重依赖常常不可靠的翻译工具。本文提出了一种新方法，利用大型语言模型（LLM）在不需翻译工具的情况下生成高质量的目标语言伪标签数据。首先，框架训练一个 ABSA 模型以对未标注的目标语言数据进行预测。接着，提示 LLM 生成能够比原始文本更好地代表这些噪声预测的自然句子。随后将 ABSA 模型在生成的伪标签数据集上进一步微调。我们在六种语言和五种主干模型上展示了该方法的有效性，超过了先前基于翻译的最先进方法。所提出的框架也支持生成式模型，我们表明微调后的 LLM 优于更小的多语言模型。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-13 05:55:48 UTC 发布：2025-08-13 05:55:48 UTC

#33 From Ranking to Selection: A Simple but Efficient Dynamic Passage Selector for Retrieval Augmented Generation 从排序到选择：一种简单而高效的用于检索增强生成的动态段落选择器

Retrieval-augmented generation (RAG) systems are often bottlenecked by their reranking modules, which typically score passages independently and select a fixed Top-K size. This approach struggles with complex multi-hop queries that require synthesizing evidence across multiple documents, creating a trade-off where small K values omit crucial information and large K values introduce noise. To address this, we introduce the Dynamic Passage Selector (DPS), a novel reranking framework that treats passage selection as a supervised learning problem. Unlike traditional point-wise or list-wise methods, DPS is fine-tuned to capture inter-passage dependencies and dynamically select the most relevant set of passages for generation. As a seamless plug-and-play module, DPS requires no modifications to the standard RAG pipeline. Comprehensive evaluations on five benchmarks show that DPS consistently outperforms state-of-the-art rerankers and fine-tuning methods. Notably, on the challenging MuSiQue dataset, DPS improves the F1-score by 30.06% and 15.4% over strong baselines like Qwen3-reranker and RankingGPT, respectively. Our results demonstrate that by enabling adaptive evidence selection, DPS substantially enhances reasoning capabilities in complex RAG scenarios. 检索增强生成（RAG）系统常常被其重排序模块所制约，这些模块通常对段落独立打分并选择固定的 Top-K 大小。这种方法在需要跨多个文档综合证据的复杂多跳查询上表现欠佳，会导致一个权衡：较小的 K 值遗漏关键信息，而较大的 K 值则引入噪声。为了解决这一问题，我们引入了动态段落选择器（DPS），这是一种将段落选择视为监督学习问题的新型重排序框架。不同于传统的逐点或序列方法，DPS 经过微调以捕捉段落间的依赖关系，并动态选择最相关的一组段落用于生成。作为一个无缝的即插即用模块，DPS 无需对标准 RAG 流水线进行任何修改。在五个基准上的全面评估表明，DPS 始终优于最先进的重排序器和微调方法。值得注意的是，在具有挑战性的 MuSiQue 数据集上，DPS 分别比像 Qwen3-reranker 和 RankingGPT 这样的强基线提升了 30.06%和 15.4%的 F1 得分。我们的结果表明，通过启用自适应证据选择，DPS 在复杂的检索增强生成（RAG）场景中显著提升了推理能力。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-13 05:05:34 UTC 发布时间：2025-08-13 05:05:34 UTC

#34 Learning Facts at Scale with Active Reading 通过主动阅读在大规模上学习事实

Authors: [Jessy Lin](https://arxiv.org/search/?searchtype=author&query=Jessy Lin), [Vincent-Pierre Berges](https://arxiv.org/search/?searchtype=author&query=Vincent-Pierre Berges), [Xilun Chen](https://arxiv.org/search/?searchtype=author&query=Xilun Chen), [Wen-Tau Yih](https://arxiv.org/search/?searchtype=author&query=Wen-Tau Yih), [Gargi Ghosh](https://arxiv.org/search/?searchtype=author&query=Gargi Ghosh), [Barlas Oğuz](https://arxiv.org/search/?searchtype=author&query=Barlas Oğuz)

LLMs are known to store vast amounts of knowledge in their parametric memory. However, learning and recalling facts from this memory is known to be unreliable, depending largely on the prevalence of particular facts in the training data and other factors which are poorly understood. Practitioners are lacking tools which will allow them to ensure that the models learn a given body of knowledge reliably and consistently. To this end, we propose Active Reading: a framework where we train models to study a given set of material with self-generated learning strategies. First, we demonstrate models trained with Active Reading on expert domains absorb significantly more knowledge than vanilla finetuning and other data augmentations. We train expert 8B models that achieve 66% on a Wikipedia-grounded subset of SimpleQA (+313% relative over vanilla finetuning) and 26% on FinanceBench (+160% relative over vanilla finetuning) by applying Active Reading to the source documents for each benchmark. Finally, we show that Active Reading can be utilized at pre-training scale to build more factual models. As a demonstration of this, we release Meta WikiExpert-8B, a Wikipedia-expert model trained on 1 trillion generated tokens, which outcompetes models with hundreds of billions of parameters on factual QA. LLMs 众所周知在其参数化记忆中存储了大量知识。然而，从这些记忆中学习和回忆事实被认为是不可靠的，很大程度上取决于训练数据中特定事实的普遍性以及其他尚不明确的因素。实践者缺乏能够确保模型可靠且一致地学习特定知识体系的工具。为此，我们提出了主动阅读（Active Reading）：一个训练模型通过自生成的学习策略研究给定材料的框架。首先，我们展示了在专家领域用主动阅读训练的模型比普通微调和其他数据增强方法吸收了显著更多的知识。我们对专家级 8B 模型应用主动阅读到每个基准的源文档，使其在基于维基百科的 SimpleQA 子集上达到 66%（相对于普通微调提高 313%）并在 FinanceBench 上达到 26%（相对于普通微调提高 160%）。最后，我们展示了主动阅读可以在预训练规模上被利用以构建更具事实性的模型。作为这一点的示范，我们发布了 Meta WikiExpert-8B，这是一个在 1 万亿个生成标记上训练的维基百科专家模型，在事实问答方面超过了拥有数千亿参数的模型。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-13 04:54:43 UTC 发布：2025-08-13 04:54:43 UTC

#35 User-centric Subjective Leaderboard by Customizable Reward Modeling 通过可定制奖励建模实现以用户为中心的主观排行榜

Authors: [Qi Jia](https://arxiv.org/search/?searchtype=author&query=Qi Jia), [Xiujie Song](https://arxiv.org/search/?searchtype=author&query=Xiujie Song), [Zicheng Zhang](https://arxiv.org/search/?searchtype=author&query=Zicheng Zhang), [Yijin Guo](https://arxiv.org/search/?searchtype=author&query=Yijin Guo), [Kaiwei Zhang](https://arxiv.org/search/?searchtype=author&query=Kaiwei Zhang), [Zijian Chen](https://arxiv.org/search/?searchtype=author&query=Zijian Chen), [Guangtao Zhai](https://arxiv.org/search/?searchtype=author&query=Guangtao Zhai)

Existing benchmarks for large language models (LLMs) predominantely focus on assessing their capabilities through verifiable tasks. Such objective and static benchmarks offer limited utility for practical LLM selection, making it difficult for users to find suitable models for their individual needs. To bridge this gap, we present the first User-Centric Subjective Leaderboard (USL), which provides a preference-driven, dynamic ranking of LLMs across diverse real-world scenarios. Our work is built upon a thorough investigation of real human preference data, involving more than 10K subjective queries. Our investigation reveals significant diversity and contradictions in human preferences, which limit the effectiveness of state-of-the-art reward models. To address this, we introduce Customizable Reward Models (CRMs). With only 4B parameters, our CRM surpasses the performance of leading models such as GPT-4.1 and Gemini-2.5-pro, showing exceptional generalization capabilities across new topics and criteria. The USL, powered by CRMs, exhibits strong negative correlations to contradictory preferences. 现有的大型语言模型（LLMs）基准主要侧重于通过可验证任务评估其能力。此类客观且静态的基准在实际选择 LLMs 时效用有限，使用户难以为其个性化需求找到合适的模型。为弥补这一差距，我们提出了首个以用户为中心的主观排行榜（User-Centric Subjective Leaderboard，USL），该排行榜基于偏好驱动，在多样的真实场景中对 LLMs 进行动态排名。我们的工作基于对真实人类偏好数据的深入调研，涉及一万多条主观查询。研究表明，人类偏好存在显著的多样性和矛盾性，这限制了最先进奖励模型的效用。为此，我们引入了可定制奖励模型（Customizable Reward Models，CRMs）。仅用 4B 参数，我们的 CRM 就超越了诸如 GPT-4.1 和 Gemini-2.5-pro 等领先模型，在新主题和新评价标准上展现出卓越的泛化能力。由 CRMs 驱动的 USL 在面对矛盾偏好时呈现出强烈的负相关性。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-13 03:39:04 UTC 发布时间：2025-08-13 03:39:04 UTC

#36 From Charts to Fair Narratives: Uncovering and Mitigating Geo-Economic Biases in Chart-to-Text 从图表到公平叙述：揭示并缓解图表转文本中的地缘经济偏见

Authors: [Ridwan Mahbub](https://arxiv.org/search/?searchtype=author&query=Ridwan Mahbub), [Mohammed Saidul Islam](https://arxiv.org/search/?searchtype=author&query=Mohammed Saidul Islam), [Mir Tafseer Nayeem](https://arxiv.org/search/?searchtype=author&query=Mir Tafseer Nayeem), [Md Tahmid Rahman Laskar](https://arxiv.org/search/?searchtype=author&query=Md Tahmid Rahman Laskar), [Mizanur Rahman](https://arxiv.org/search/?searchtype=author&query=Mizanur Rahman), [Shafiq Joty](https://arxiv.org/search/?searchtype=author&query=Shafiq Joty), [Enamul Hoque](https://arxiv.org/search/?searchtype=author&query=Enamul Hoque)

Charts are very common for exploring data and communicating insights, but extracting key takeaways from charts and articulating them in natural language can be challenging. The chart-to-text task aims to automate this process by generating textual summaries of charts. While with the rapid advancement of large Vision-Language Models (VLMs), we have witnessed great progress in this domain, little to no attention has been given to potential biases in their outputs. This paper investigates how VLMs can amplify geo-economic biases when generating chart summaries, potentially causing societal harm. Specifically, we conduct a large-scale evaluation of geo-economic biases in VLM-generated chart summaries across 6,000 chart-country pairs from six widely used proprietary and open-source models to understand how a country’s economic status influences the sentiment of generated summaries. Our analysis reveals that existing VLMs tend to produce more positive descriptions for high-income countries compared to middle- or low-income countries, even when country attribution is the only variable changed. We also find that models such as GPT-4o-mini, Gemini-1.5-Flash, and Phi-3.5 exhibit varying degrees of bias. We further explore inference-time prompt-based debiasing techniques using positive distractors but find them only partially effective, underscoring the complexity of the issue and the need for more robust debiasing strategies. Our code and dataset are publicly available here. 图表在探索数据和传达洞见时非常常见，但从图表中提取关键结论并以自然语言表达可能具有挑战性。图表转文本任务旨在通过生成图表的文本摘要来自动化这一过程。随着大规模视觉-语言模型（VLMs）的快速进步，我们在这一领域取得了显著进展，但几乎没有关注其输出中潜在的偏见。本文研究了 VLM 在生成图表摘要时如何放大地缘经济偏见，可能造成社会伤害。具体而言，我们对来自六个广泛使用的专有和开源模型的 6,000 对图表-国家组合进行了大规模评估，以理解一个国家的经济地位如何影响生成摘要的情感倾向。我们的分析显示，现有的 VLM 倾向于为高收入国家生成比中等或低收入国家更为积极的描述，即使国家归属是唯一更改的变量。我们还发现像 GPT-4o-mini、Gemini-1.5-Flash 和 Phi-3.5 这样的模型表现出不同程度的偏见。我们进一步探索了在推理时使用正向干扰项的基于提示的去偏方法，但发现它们仅能部分有效，这凸显了该问题的复杂性以及对更强有力去偏策略的需求。我们的代码和数据集在此公开可用。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-13 03:09:00 UTC 发布：2025-08-13 03:09:00 UTC

#37 Leveraging Zipformer Model for Effective Language Identification in Code-Switched Child-Directed Speech 在面向儿童的对话中利用 Zipformer 模型进行有效的语言识别

Authors: [Lavanya Shankar](https://arxiv.org/search/?searchtype=author&query=Lavanya Shankar), [Leibny Paola Garcia Perera](https://arxiv.org/search/?searchtype=author&query=Leibny Paola Garcia Perera)

Code-switching and language identification in child-directed scenarios present significant challenges, particularly in bilingual environments. This paper addresses this challenge by using Zipformer to handle the nuances of speech, which contains two imbalanced languages, Mandarin and English, in an utterance. This work demonstrates that the internal layers of the Zipformer effectively encode the language characteristics, which can be leveraged in language identification. We present the selection methodology of the inner layers to extract the embeddings and make a comparison with different back-ends. Our analysis shows that Zipformer is robust across these backends. Our approach effectively handles imbalanced data, achieving a Balanced Accuracy (BAC) of 81.89%, a 15.47% improvement over the language identification baseline. These findings highlight the potential of the transformer encoder architecture model in real scenarios. 在以儿童为对象的场景中，代码切换和语言识别带来了显著挑战，尤其是在双语环境中。本文通过使用 Zipformer 来处理包含普通话和英语两种不平衡语言的语句中的语音细微差异，以应对这一挑战。研究表明，Zipformer 的内部层有效地编码了语言特征，可用于语言识别。我们提出了从内部层选择并提取嵌入表示的方法，并与不同的后端进行比较。分析表明 Zipformer 在这些后端上都具有鲁棒性。我们的方法有效处理了不平衡数据，取得了 81.89% 的平衡准确率（BAC），比语言识别基线提升了 15.47%。这些发现强调了基于 Transformer 编码器架构模型在实际场景中的潜力。

Subjects: Computation and Language, Sound 主题：计算与语言，声音

Publish: 2025-08-13 02:10:31 UTC 发布：2025-08-13 02:10:31 UTC

#38 Columbo: Expanding Abbreviated Column Names for Tabular Data Using Large Language Models Columbo：使用大型语言模型为表格数据扩展缩写列名

Authors: [Ting Cai](https://arxiv.org/search/?searchtype=author&query=Ting Cai), [Stephen Sheen](https://arxiv.org/search/?searchtype=author&query=Stephen Sheen), [AnHai Doan](https://arxiv.org/search/?searchtype=author&query=AnHai Doan)

Expanding the abbreviated column names of tables, such as esal'' to employee salary’’, is critical for numerous downstream data tasks. This problem arises in enterprises, domain sciences, government agencies, and more. In this paper we make three contributions that significantly advances the state of the art. First, we show that synthetic public data used by prior work has major limitations, and we introduce 4 new datasets in enterprise/science domains, with real-world abbreviations. Second, we show that accuracy measures used by prior work seriously undercount correct expansions, and we propose new synonym-aware measures that capture accuracy much more accurately. Finally, we develop Columbo, a powerful LLM-based solution that exploits context, rules, chain-of-thought reasoning, and token-level analysis. Extensive experiments show that Columbo significantly outperforms NameGuess, the current most advanced solution, by 4-29%, over 5 datasets. Columbo has been used in production on EDI, a major data portal for environmental sciences. 将表格中缩写的列名展开，例如将“esal”展开为“employee salary”，对众多下游数据任务至关重要。该问题在企业、领域科学、政府机构等方面均存在。本文做出了三项显著推进现有技术水平的贡献。首先，我们指出以往工作所使用的合成公共数据存在重大局限性，并引入了 4 个来自企业/科学领域、包含真实世界缩写的新数据集。其次，我们表明以往工作采用的准确率度量严重低估了正确的展开结果，并提出了新的同义词感知度量，从而更精确地反映准确率。最后，我们开发了 Columbo，一种强大的基于 LLM 的解决方案，利用上下文、规则、链式思维推理和基于 token 的分析。大量实验证明，在 5 个数据集上，Columbo 比当前最先进的解决方案 NameGuess 提高了 4–29%。Columbo 已在 EDI（一个面向环境科学的重要数据门户）上线投入生产使用。

Subjects: Computation and Language, Databases 主题：计算与语言，数据库

Publish: 2025-08-13 00:39:22 UTC 发布：2025-08-13 00:39:22 UTC

#39 APIO: Automatic Prompt Induction and Optimization for Grammatical Error Correction and Text Simplification APIO：用于语法错误纠正与文本简化的自动提示归纳与优化

Authors: [Artem Chernodub](https://arxiv.org/search/?searchtype=author&query=Artem Chernodub), [Aman Saini](https://arxiv.org/search/?searchtype=author&query=Aman Saini), [Yejin Huh](https://arxiv.org/search/?searchtype=author&query=Yejin Huh), [Vivek Kulkarni](https://arxiv.org/search/?searchtype=author&query=Vivek Kulkarni), [Vipul Raheja](https://arxiv.org/search/?searchtype=author&query=Vipul Raheja)

Recent advancements in large language models (LLMs) have enabled a wide range of natural language processing (NLP) tasks to be performed through simple prompt-based interactions. Consequently, several approaches have been proposed to engineer prompts that most effectively enable LLMs to perform a given task (e.g., chain-of-thought prompting). In settings with a well-defined metric to optimize model performance, automatic prompt optimization (APO) methods have been developed to refine a seed prompt. Advancing this line of research, we propose APIO, a simple but effective prompt induction and optimization approach for the tasks of Grammatical Error Correction (GEC) and Text Simplification, without relying on manually specified seed prompts. APIO achieves a new state-of-the-art performance for purely LLM-based prompting methods on these tasks. We make our data, code, prompts, and outputs publicly available. 近年来，大型语言模型（LLMs）的进步使得通过简单的基于提示的交互即可执行广泛的自然语言处理（NLP）任务成为可能。因此，已经提出了若干方法来设计提示，以最有效地让 LLMs 执行特定任务（例如，链式思维提示）。在存在明确定义的度量以优化模型性能的场景中，自动提示优化（APO）方法已被开发用于改进种子提示。推进这条研究路线，我们提出了 APIO，一种用于语法错误纠正（GEC）和文本简化任务的简单但有效的提示归纳与优化方法，无需依赖人工指定的种子提示。APIO 在这些任务上为纯基于 LLM 的提示方法达到了新的最先进性能。我们已将数据、代码、提示和输出公开可用。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-12 22:26:32 UTC 发布：2025-08-12 22:26:32 UTC

#40 Flow-SLM: Joint Learning of Linguistic and Acoustic Information for Spoken Language Modeling Flow-SLM：用于说话语言建模的语言与声学信息联合学习

Authors: [Ju-Chieh Chou](https://arxiv.org/search/?searchtype=author&query=Ju-Chieh Chou), [Jiawei Zhou](https://arxiv.org/search/?searchtype=author&query=Jiawei Zhou), [Karen Livescu](https://arxiv.org/search/?searchtype=author&query=Karen Livescu)

Textless spoken language models (SLMs) are generative models of speech that do not rely on text supervision. Most textless SLMs learn to predict the next semantic token, a discrete representation of linguistic content, and rely on a separate vocoder to add acoustic information to the generated speech. Such models have no access to acoustic context and no built-in control over acoustic details. In this work, we propose to jointly model linguistic and acoustic information by generating semantic tokens and a continuous real-valued representation of the acoustic frame. We use a flow-matching objective to predict the continuous vector conditioned on the semantic tokens. We study the design space of this approach and find that predicting multiple future semantic tokens helps preserve linguistic information. Our approach achieves comparable performance to existing models in terms of linguistic likelihood benchmarks, while providing better acoustic detail in prompted generation. 无文本的语音语言模型（SLMs）是指不依赖文本监督的语音生成模型。大多数无文本 SLMs 通过预测下一个语义令牌（一种离散的语言内容表示）来学习，并依赖单独的声码器为生成的语音添加声学信息。此类模型无法获得声学上下文，也无法对声学细节进行内置控制。在这项工作中，我们提出通过生成语义令牌和声学帧的连续实值表示来联合建模语言信息和声学信息。我们使用流匹配（flow-matching）目标来在给定语义令牌的条件下预测该连续向量。我们研究了该方法的设计空间，发现预测多个未来语义令牌有助于保留语言信息。在语言可能性基准方面，我们的方法在性能上与现有模型相当，同时在提示生成中提供了更好的声学细节。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-12 21:25:37 UTC 发布：2025-08-12 21:25:37 协调世界时

#41 The Human-AI Hybrid Delphi Model: A Structured Framework for Context-Rich, Expert Consensus in Complex Domains 人类—人工智能混合德尔菲模型：一种用于复杂领域具备丰富情境的专家共识的结构化框架

Authors: [Cathy Speed](https://arxiv.org/search/?searchtype=author&query=Cathy Speed), [Ahmed A. Metwally](https://arxiv.org/search/?searchtype=author&query=Ahmed A. Metwally)

Expert consensus plays a critical role in domains where evidence is complex, conflicting, or insufficient for direct prescription. Traditional methods, such as Delphi studies, consensus conferences, and systematic guideline synthesis, offer structure but face limitations including high panel burden, interpretive oversimplification, and suppression of conditional nuance. These challenges are now exacerbated by information overload, fragmentation of the evidence base, and increasing reliance on publicly available sources that lack expert filtering. This study introduces and evaluates a Human-AI Hybrid Delphi (HAH-Delphi) framework designed to augment expert consensus development by integrating a generative AI model (Gemini 2.5 Pro), small panels of senior human experts, and structured facilitation. The HAH-Delphi was tested in three phases: retrospective replication, prospective comparison, and applied deployment in two applied domains (endurance training and resistance and mixed cardio/strength training). The AI replicated 95% of published expert consensus conclusions in Phase I and showed 95% directional agreement with senior human experts in Phase II, though it lacked experiential and pragmatic nuance. In Phase III, compact panels of six senior experts achieved >90% consensus coverage and reached thematic saturation before the final participant. The AI provided consistent, literature-grounded scaffolding that supported divergence resolution and accelerated saturation. The HAH-Delphi framework offers a flexible, scalable approach for generating high-quality, context-sensitive consensus. Its successful application across health, coaching, and performance science confirms its methodological robustness and supports its use as a foundation for generating conditional, personalised guidance and published consensus frameworks at scale. 在证据复杂、相互冲突或不足以直接给出建议的领域，专家共识起着关键作用。传统方法，如德尔菲研究、共识会议和系统性指南综合，提供了结构，但也面临诸多限制，包括专家小组负担重、解释性过度简化以及对有条件细微差别的抑制。这些挑战因信息过载、证据基础的碎片化以及越来越依赖缺乏专家筛选的公开来源而被放大。本研究提出并评估了一种人类—人工智能混合德尔菲（HAH-Delphi）框架，旨在通过整合生成式人工智能模型（Gemini 2.5 Pro）、小规模资深人类专家小组和结构化引导来增强专家共识的形成。HAH-Delphi 在三个阶段进行了测试：回顾性复现、前瞻性比较，以及在两个应用领域（耐力训练与抗阻及混合有氧/力量训练）中的实际部署。在第一阶段，人工智能复制了 95%的已发表专家共识结论；在第二阶段，它与资深人类专家在方向性上达成了 95%的一致，尽管缺乏经验性和务实的细微差别。在第三阶段，由六名资深专家组成的小型小组实现了超过 90%的共识覆盖，并在最后一位参与者之前就达到主题饱和。人工智能提供了一致且以文献为基础的支撑，帮助解决分歧并加速饱和的到达。HAH-Delphi 框架为生成高质量、具情境敏感性的共识提供了一种灵活且可扩展的方法。它在健康、教练和绩效科学领域的成功应用证明了其方法学的稳健性，并支持将其作为生成有条件的、个性化指导以及大规模已发表共识框架的基础。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-12 21:24:19 UTC 发布：2025-08-12 21:24:19 协调世界时 (UTC)

#42 Decoding Neural Emotion Patterns through Natural Language Processing Embeddings 通过自然语言处理嵌入解读神经情感模式

Authors: [Gideon Vos](https://arxiv.org/search/?searchtype=author&query=Gideon Vos), [Maryam Ebrahimpour](https://arxiv.org/search/?searchtype=author&query=Maryam Ebrahimpour), [Liza van Eijk](https://arxiv.org/search/?searchtype=author&query=Liza van Eijk), [Zoltan Sarnyai](https://arxiv.org/search/?searchtype=author&query=Zoltan Sarnyai), [Mostafa Rahimi Azghadi](https://arxiv.org/search/?searchtype=author&query=Mostafa Rahimi Azghadi)

Understanding how emotional expression in language relates to brain function is a challenge in computational neuroscience and affective computing. Traditional neuroimaging is costly and lab-bound, but abundant digital text offers new avenues for emotion-brain mapping. Prior work has largely examined neuroimaging-based emotion localization or computational text analysis separately, with little integration. We propose a computational framework that maps textual emotional content to anatomically defined brain regions without requiring neuroimaging. Using OpenAI’s text-embedding-ada-002, we generate high-dimensional semantic representations, apply dimensionality reduction and clustering to identify emotional groups, and map them to 18 brain regions linked to emotional processing. Three experiments were conducted: i) analyzing conversational data from healthy vs. depressed subjects (DIAC-WOZ dataset) to compare mapping patterns, ii) applying the method to the GoEmotions dataset and iii) comparing human-written text with large language model (LLM) responses to assess differences in inferred brain activation. Emotional intensity was scored via lexical analysis. Results showed neuroanatomically plausible mappings with high spatial specificity. Depressed subjects exhibited greater limbic engagement tied to negative affect. Discrete emotions were successfully differentiated. LLM-generated text matched humans in basic emotion distribution but lacked nuanced activation in empathy and self-referential regions (medial prefrontal and posterior cingulate cortex). This cost-effective, scalable approach enables large-scale analysis of naturalistic language, distinguishes between clinical populations, and offers a brain-based benchmark for evaluating AI emotional expression. 理解语言中的情感表达如何与大脑功能相关，是计算神经科学和情感计算中的一大挑战。传统的神经影像技术成本高且受限于实验室，但大量的数字文本为情感—大脑映射提供了新途径。以往研究大多分别考察基于神经影像的情感定位或计算文本分析，几乎没有整合二者。我们提出了一个计算框架，将文本情感内容映射到解剖学定义的大脑区域，而无需神经影像。使用 OpenAI 的 text-embedding-ada-002，我们生成高维语义表示，应用降维和聚类以识别情感群体，并将它们映射到与情感处理相关的 18 个大脑区域。进行了三项实验：i) 分析来自健康与抑郁受试者的会话数据（DIAC-WOZ 数据集）以比较映射模式，ii) 将该方法应用于 GoEmotions 数据集，iii) 比较人类撰写文本与 LLM 的回复以评估推断大脑激活的差异。情感强度通过词汇分析进行评分。结果显示了具有高度空间特异性的神经解剖学上合理的映射。抑郁受试者表现出与消极情感相关的更强边缘系统参与。不同的离散情绪成功被区分开来。LLM 生成的文本在基本情绪分布上与人类相匹配，但在同理心和自我指涉区域（内侧前额叶皮层和后扣带皮层）的细微激活方面不足。这种具有成本效益且可扩展的方法可实现对自然语言的大规模分析，能够区分临床人群，并为评估人工智能情感表达提供基于大脑的基准。

Subject: Computation and Language 主题：Computation and Language

Publish: 2025-08-12 20:51:56 UTC 发表：2025-08-12 20:51:56 UTC

#43 TEN: Table Explicitization, Neurosymbolically TEN：表格显式化，神经符号化

Authors: [Nikita Mehrotra](https://arxiv.org/search/?searchtype=author&query=Nikita Mehrotra), [Aayush Kumar](https://arxiv.org/search/?searchtype=author&query=Aayush Kumar), [Sumit Gulwani](https://arxiv.org/search/?searchtype=author&query=Sumit Gulwani), [Arjun Radhakrishna](https://arxiv.org/search/?searchtype=author&query=Arjun Radhakrishna), [Ashish Tiwari](https://arxiv.org/search/?searchtype=author&query=Ashish Tiwari)

We present a neurosymbolic approach, TEN, for extracting tabular data from semistructured input text. This task is particularly challenging for text input that does not use special delimiters consistently to separate columns and rows. Purely neural approaches perform poorly due to hallucinations and their inability to enforce hard constraints. TEN uses Structural Decomposition prompting - a specialized chain-of-thought prompting approach - on a large language model (LLM) to generate an initial table, and thereafter uses a symbolic checker to evaluate not only the well-formedness of that table, but also detect cases of hallucinations or forgetting. The output of the symbolic checker is processed by a critique-LLM to generate guidance for fixing the table, which is presented to the original LLM in a self-debug loop. Our extensive experiments demonstrate that TEN significantly outperforms purely neural baselines across multiple datasets and metrics, achieving significantly higher exact match accuracy and substantially reduced hallucination rates. A 21-participant user study further confirms that TEN’s tables are rated significantly more accurate (mean score: 5.0 vs 4.3; p = 0.021), and are consistently preferred for ease of verification and correction, with participants favoring our method in over 60% of the cases. 我们提出了一种神经符号方法 TEN，用于从半结构化的输入文本中提取表格数据。当文本输入没有始终使用特殊分隔符来划分列和行时，这项任务尤为具有挑战性。纯神经方法由于会出现幻觉并且无法强制执行硬约束，表现不佳。TEN 在大型语言模型（LLM）上使用结构性分解提示——一种专门的链式思维提示方法——来生成初始表格，随后使用符号检查器来评估该表格不仅在形式上的正确性，还检测幻觉或遗忘的情况。符号检查器的输出由一个批评型 LLM 处理，以生成修复表格的指导，并将其以自我调试循环的形式呈现给原始 LLM。我们的大量实验表明，TEN 在多种数据集和评测指标上显著优于纯神经基线，取得了更高的精确匹配准确率并大幅降低了幻觉率。一项由 21 名参与者进行的用户研究进一步证实，TEN 生成的表格被评为显著更准确（平均得分：5.0 对 4.3；p = 0.021），并且在便于核对和修正方面始终更受青睐，参与者在超过 60%的情况下更偏好我们的方法。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-12 20:16:41 UTC 发布：2025-08-12 20:16:41 UTC

#44 Leveraging Large Language Models for Rare Disease Named Entity Recognition 利用大型语言模型进行罕见病命名实体识别

Authors: [Nan Miles Xi](https://arxiv.org/search/?searchtype=author&query=Nan Miles Xi), [Yu Deng](https://arxiv.org/search/?searchtype=author&query=Yu Deng), [Lin Wang](https://arxiv.org/search/?searchtype=author&query=Lin Wang)

Named Entity Recognition (NER) in the rare disease domain poses unique challenges due to limited labeled data, semantic ambiguity between entity types, and long-tail distributions. In this study, we evaluate the capabilities of GPT-4o for rare disease NER under low-resource settings, using a range of prompt-based strategies including zero-shot prompting, few-shot in-context learning, retrieval-augmented generation (RAG), and task-level fine-tuning. We design a structured prompting framework that encodes domain-specific knowledge and disambiguation rules for four entity types. We further introduce two semantically guided few-shot example selection methods to improve in-context performance while reducing labeling effort. Experiments on the RareDis Corpus show that GPT-4o achieves competitive or superior performance compared to BioClinicalBERT, with task-level fine-tuning yielding new state-of-the-art (SOTA) results. Cost-performance analysis reveals that few-shot prompting delivers high returns at low token budgets, while RAG offers marginal additional benefit. An error taxonomy highlights common failure modes such as boundary drift and type confusion, suggesting opportunities for post-processing and hybrid refinement. Our results demonstrate that prompt-optimized LLMs can serve as effective, scalable alternatives to traditional supervised models in biomedical NER, particularly in rare disease applications where annotated data is scarce. 在罕见病领域，命名实体识别（NER）由于标注数据有限、实体类型之间的语义歧义以及长尾分布而面临独特挑战。本研究在低资源设置下评估了 GPT-4o 在罕见病 NER 任务中的能力，采用了一系列基于提示的策略，包括零样本提示、少样本上下文学习、检索增强生成（RAG）和任务级微调。我们设计了一个结构化提示框架，用以编码领域特定知识和对四类实体的消歧规则。我们进一步提出了两种语义引导的少样本示例选择方法，以在减少标注工作量的同时提升上下文学习的表现。在 RareDis 语料库上的实验证明，GPT-4o 与 BioClinicalBERT 相比取得了具有竞争力或更优的表现，且任务级微调带来了新的最先进（SOTA）结果。成本-性能分析显示，在低令牌预算下，少样本提示能带来高收益，而 RAG 则只提供了边际的额外收益。一个错误分类法突出显示了常见的失效模式，例如边界漂移和类型混淆，表明了后处理和混合精炼的机会。我们的结果表明，经提示优化的 LLMs 可以作为传统监督模型在生物医学命名实体识别中的有效且可扩展的替代方案，尤其在标注数据稀缺的罕见病应用中。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-12 20:16:31 UTC 发布：2025-08-12 20:16:31 UTC

#45 ParallelSearch: Train your LLMs to Decompose Query and Search Sub-queries in Parallel with Reinforcement Learning ParallelSearch：使用强化学习训练你的 LLMs 将查询分解并并行搜索子查询 #45ParallelSearch：使用强化学习训练你的 LLMs 将查询分解并并行搜索子查询 ParallelSearch：使用强化学习训练你的 LLMs 将查询分解并并行搜索子查询[PDF 3 ] [Copy] [Kimi 1 ] [REL]

Authors: [Shu Zhao](https://arxiv.org/search/?searchtype=author&query=Shu Zhao), [Tan Yu](https://arxiv.org/search/?searchtype=author&query=Tan Yu), [Anbang Xu](https://arxiv.org/search/?searchtype=author&query=Anbang Xu), [Japinder Singh](https://arxiv.org/search/?searchtype=author&query=Japinder Singh), [Aaditya Shukla](https://arxiv.org/search/?searchtype=author&query=Aaditya Shukla), [Rama Akkiraju](https://arxiv.org/search/?searchtype=author&query=Rama Akkiraju) 作者：Shu Zhao, Tan Yu, Anbang Xu, Japinder Singh, Aaditya Shukla, Rama Akkiraju

Reasoning-augmented search agents such as Search-R1, trained via reinforcement learning with verifiable rewards (RLVR), demonstrate remarkable capabilities in multi-step information retrieval from external knowledge sources. These agents address the limitations of their parametric memory by dynamically gathering relevant facts to address complex reasoning tasks. However, existing approaches suffer from a fundamental architectural limitation: they process search queries strictly sequentially, even when handling inherently parallelizable and logically independent comparisons. This sequential bottleneck significantly constrains computational efficiency, particularly for queries that require multiple entity comparisons. To address this critical limitation, we propose ParallelSearch, a novel reinforcement learning framework that empowers large language models (LLMs) to recognize parallelizable query structures and execute multiple search operations concurrently. Our approach introduces dedicated reward functions that incentivize the identification of independent query components while preserving answer accuracy through jointly considering correctness, query decomposition quality, and parallel execution benefits. Comprehensive experiments demonstrate that ParallelSearch outperforms state-of-the-art baselines by an average performance gain of 2.9% across seven question-answering benchmarks. Notably, on parallelizable questions, our method achieves a 12.7% performance improvement while requiring only 69.6% of the LLM calls compared to sequential approaches. 像 Search-R1 这样的推理增强型搜索代理，通过具有可验证奖励的强化学习（RLVR）训练，在从外部知识源进行多步信息检索方面展示出非凡的能力。这些代理通过动态收集相关事实来解决其参数化记忆的局限，从而应对复杂的推理任务。然而，现有方法存在一个根本的架构限制：它们严格按顺序处理搜索查询，即便是在处理本质上可并行且逻辑独立的比较时也是如此。这一顺序瓶颈显著限制了计算效率，尤其是对于需要多实体比较的查询。为了解决这一关键限制，我们提出了 ParallelSearch，一种新颖的强化学习框架，使大型语言模型（LLMs）能够识别可并行化的查询结构并并发执行多个搜索操作。我们的方法引入了专门的奖励函数，激励识别独立的查询成分，同时通过联合考虑正确性、查询分解质量和并行执行收益来保持答案准确性。全面的实验表明，ParallelSearch 在七个问答基准上平均比最先进的基线高出 2.9% 的性能提升。值得注意的是，在可并行化的问题上，我们的方法实现了 12.7% 的性能提升，同时仅需 69.6% 的 LLM 调用次数，相较于顺序方法更为高效。

Subjects: Computation and Language, Artificial Intelligence, Information Retrieval 主题：计算与语言、人工智能、信息检索

Publish: 2025-08-12 19:38:21 UTC 发布：2025-08-12 19:38:21 协调世界时（UTC）

#46 Echo-4o: Harnessing the Power of GPT-4o Synthetic Images for Improved Image Generation Echo-4o：利用 GPT-4o 合成图像的能力以改进图像生成

Recently, GPT-4o has garnered significant attention for its strong performance in image generation, yet open-source models still lag behind. Several studies have explored distilling image data from GPT-4o to enhance open-source models, achieving notable progress. However, a key question remains: given that real-world image datasets already constitute a natural source of high-quality data, why should we use GPT-4o-generated synthetic data? In this work, we identify two key advantages of synthetic images. First, they can complement rare scenarios in real-world datasets, such as surreal fantasy or multi-reference image generation, which frequently occur in user queries. Second, they provide clean and controllable supervision. Real-world data often contains complex background noise and inherent misalignment between text descriptions and image content, whereas synthetic images offer pure backgrounds and long-tailed supervision signals, facilitating more accurate text-to-image alignment. Building on these insights, we introduce Echo-4o-Image, a 180K-scale synthetic dataset generated by GPT-4o, harnessing the power of synthetic image data to address blind spots in real-world coverage. Using this dataset, we fine-tune the unified multimodal generation baseline Bagel to obtain Echo-4o. In addition, we propose two new evaluation benchmarks for a more accurate and challenging assessment of image generation capabilities: GenEval++, which increases instruction complexity to mitigate score saturation, and Imagine-Bench, which focuses on evaluating both the understanding and generation of imaginative content. Echo-4o demonstrates strong performance across standard benchmarks. Moreover, applying Echo-4o-Image to other foundation models (e.g., OmniGen2, BLIP3-o) yields consistent performance gains across multiple metrics, highlighting the datasets strong transferability. 最近，GPT-4o 因其出色的图像生成能力而备受关注，但开源模型仍然落后于其表现。已有若干研究尝试从 GPT-4o 提取图像数据以增强开源模型，并取得了显著进展。然而，一个关键问题仍然存在：既然现实世界的图像数据集本身已经是高质量数据的自然来源，为什么我们还要使用 GPT-4o 生成的合成数据？在这项工作中，我们指出了合成图像的两大关键优势。首先，它们可以补充现实数据集中稀缺的场景，例如超现实的奇幻场景或多参考图像生成，这些场景在用户查询中经常出现。其次，它们提供了干净且可控的监督。现实世界数据常包含复杂的背景噪声以及文本描述与图像内容之间固有的不匹配，而合成图像则提供纯净的背景和长尾的监督信号，有助于更精确的文本到图像对齐。基于这些洞见，我们提出了 Echo-4o-Image，这是一个由 GPT-4o 生成、规模为 18 万的合成数据集，利用合成图像数据的力量来解决现实覆盖中的盲点。使用该数据集，我们微调统一多模态生成基线 Bagel，得到 Echo-4o。此外，我们提出了两个新的评估基准，以更准确、更具挑战性地评估图像生成能力：GenEval++，通过增加指令复杂性以缓解分数饱和问题；以及 Imagine-Bench，专注于评估对富有想象力内容的理解与生成。Echo-4o 在标准基准上表现强劲。此外，将 Echo-4o-Image 应用于其他基础模型（例如 OmniGen2、BLIP3-o）在多项指标上均带来一致的性能提升，突显出该数据集的强大可迁移性。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Computation and Language 主题：计算机视觉与模式识别、人工智能、计算与语言

Publish: 2025-08-13 17:59:28 UTC 发布：2025-08-13 17:59:28 UTC

#47 COME: Dual Structure-Semantic Learning with Collaborative MoE for Universal Lesion Detection Across Heterogeneous Ultrasound Datasets #47 COME：具有协同 MoE 的双重结构-语义学习，用于跨异质超声数据集的通用病灶检测 [PDF ] [Copy] [Kimi ] [REL]

Conventional single-dataset training often fails with new data distributions, especially in ultrasound (US) image analysis due to limited data, acoustic shadows, and speckle noise. Therefore, constructing a universal framework for multi-heterogeneous US datasets is imperative. However, a key challenge arises: how to effectively mitigate inter-dataset interference while preserving dataset-specific discriminative features for robust downstream task? Previous approaches utilize either a single source-specific decoder or a domain adaptation strategy, but these methods experienced a decline in performance when applied to other domains. Considering this, we propose a Universal Collaborative Mixture of Heterogeneous Source-Specific Experts (COME). Specifically, COME establishes dual structure-semantic shared experts that create a universal representation space and then collaborate with source-specific experts to extract discriminative features through providing complementary features. This design enables robust generalization by leveraging cross-datasets experience distributions and providing universal US priors for small-batch or unseen data scenarios. Extensive experiments under three evaluation modes (single-dataset, intra-organ, and inter-organ integration datasets) demonstrate COME’s superiority, achieving significant mean AP improvements over state-of-the-art methods. Our project is available at: https://universalcome.github.io/UniversalCOME/. 传统的单一数据集训练在面对新的数据分布时常常失败，尤其是在超声（US）图像分析中，这主要由于数据有限、声影和斑点噪声所致。因此，构建一个适用于多种异构超声数据集的通用框架势在必行。然而，一个关键挑战出现了：如何在保留数据集特有判别特征以支持稳健下游任务的同时，有效减轻数据集间的相互干扰？先前的方法要么使用单一的源特定解码器，要么采用域适应策略，但这些方法在应用于其他域时性能会下降。基于此，我们提出了一个通用协同的异构源特定专家混合模型（COME）。具体而言，COME 建立了双重结构-语义共享专家来创建一个通用表示空间，然后与源特定专家协作，通过提供互补特征来提取判别性特征。该设计通过利用跨数据集的经验分布并为小批量或未见数据场景提供通用超声先验，从而实现了稳健的泛化能力。在三种评估模式（单数据集、器官内与器官间整合数据集）下的大量实验表明，COME 优越，较最先进方法在平均 AP 上取得了显著提升。我们的项目可在以下网址获取： https://universalcome.github.io/UniversalCOME/。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Computation and Language 主题：计算机视觉与模式识别、人工智能、计算与语言

Publish: 2025-08-13 15:43:20 UTC

#48 A Close Reading Approach to Gender Narrative Biases in AI-Generated Stories

The paper explores the study of gender-based narrative biases in stories generated by ChatGPT, Gemini, and Claude. The prompt design draws on Propp’s character classifications and Freytag’s narrative structure. The stories are analyzed through a close reading approach, with particular attention to adherence to the prompt, gender distribution of characters, physical and psychological descriptions, actions, and finally, plot development and character relationships. The results reveal the persistence of biases - especially implicit ones - in the generated stories and highlight the importance of assessing biases at multiple levels using an interpretative approach. 论文探讨了在 ChatGPT、Gemini 和 Claude 生成的故事中基于性别的叙事偏见研究。提示设计借鉴了普罗普（Propp）的人物分类和弗雷塔格（Freytag）的叙事结构。通过细读方法对故事进行分析，特别关注对提示的遵循性、人物的性别分布、身体与心理描述、行为，以及最后的情节发展和人物关系。结果揭示了生成故事中偏见的持续存在——尤以隐性偏见为甚——并强调了使用解释性方法在多层面评估偏见的重要性。

Subjects: Human-Computer Interaction, Artificial Intelligence, Computation and Language, Computers and Society 主题：人机交互、人工智能、计算与语言、计算机与社会

Publish: 2025-08-13 09:34:37 UTC 发布：2025-08-13 09:34:37 UTC

#49 How Persuasive Could LLMs Be? A First Study Combining Linguistic-Rhetorical Analysis and User Experiments #49 语言模型可以有多有说服力？一项首次将语言-修辞分析与用户实验相结合的研究

Authors: [Daniel Raffini](https://arxiv.org/search/?searchtype=author&query=Daniel Raffini), [Agnese Macori](https://arxiv.org/search/?searchtype=author&query=Agnese Macori), [Lorenzo Porcaro](https://arxiv.org/search/?searchtype=author&query=Lorenzo Porcaro), [Tiziana Catarci](https://arxiv.org/search/?searchtype=author&query=Tiziana Catarci), [Marco Angelini](https://arxiv.org/search/?searchtype=author&query=Marco Angelini) 作者：Daniel Raffini、Agnese Macori、Lorenzo Porcaro、Tiziana Catarci、Marco Angelini

This study examines the rhetorical and linguistic features of argumentative texts generated by ChatGPT on ethically nuanced topics and investigates their persuasive impact on human readers.Through a user study involving 62 participants and pre-post interaction surveys, the paper analyzes how exposure to AI-generated arguments affects opinion change and user perception. A linguistic and rhetorical analysis of the generated texts reveals a consistent argumentative macrostructure, reliance on formulaic expressions, and limited stylistic richness. While ChatGPT demonstrates proficiency in constructing coherent argumentative texts, its persuasive efficacy appears constrained, particularly on topics involving ethical issues.The study finds that while participants often acknowledge the benefits highlighted by ChatGPT, ethical concerns tend to persist or even intensify post-interaction. The results also demonstrate a variation depending on the topic. These findings highlight new insights on AI-generated persuasion in ethically sensitive domains and are a basis for future research. 本研究考察了 ChatGPT 在伦理敏感话题上生成的论证性文本的修辞与语言特征，并探讨了这些文本对人类读者的说服影响。通过对 62 名参与者的用户研究以及互动前后的调查，论文分析了接触 AI 生成论证后舆论变化和用户感知的情况。对生成文本的语言和修辞分析显示出一致的论证宏观结构、对程式化表达的依赖以及有限的风格丰富性。尽管 ChatGPT 在构建连贯论证文本方面表现出熟练性，其说服效果似乎受到限制，尤其是在涉及伦理问题的主题上。研究发现，尽管参与者常常承认 ChatGPT 所强调的利点，但伦理方面的担忧在互动后往往仍然存在甚至加剧。结果还显示，效果会随着主题而有所不同。这些发现为在伦理敏感领域中研究 AI 生成说服提供了新的见解，并为未来研究奠定了基础。

Subjects: Human-Computer Interaction, Artificial Intelligence, Computation and Language, Computers and Society 主题：人机交互、人工智能、计算与语言、计算机与社会

Publish: 2025-08-13 08:45:04 UTC

#50 AI Blob! LLM-Driven Recontextualization of Italian Television Archives #50 AI Blob! LLM 驱动的意大利电视档案再构境化

Author: [Roberto Balestri](https://arxiv.org/search/?searchtype=author&query=Roberto Balestri) 作者：Roberto Balestri

This paper introduces AI Blob!, an experimental system designed to explore the potential of semantic cataloging and Large Language Models (LLMs) for the retrieval and recontextualization of archival television footage. Drawing methodological inspiration from Italian television programs such as Blob (RAI Tre, 1989-), AI Blob! integrates automatic speech recognition (ASR), semantic embeddings, and retrieval-augmented generation (RAG) to organize and reinterpret archival content. The system processes a curated dataset of 1,547 Italian television videos by transcribing audio, segmenting it into sentence-level units, and embedding these segments into a vector database for semantic querying. Upon user input of a thematic prompt, the LLM generates a range of linguistically and conceptually related queries, guiding the retrieval and recombination of audiovisual fragments. These fragments are algorithmically selected and structured into narrative sequences producing montages that emulate editorial practices of ironic juxtaposition and thematic coherence. By foregrounding dynamic, content-aware retrieval over static metadata schemas, AI Blob! demonstrates how semantic technologies can facilitate new approaches to archival engagement, enabling novel forms of automated narrative construction and cultural analysis. The project contributes to ongoing debates in media historiography and AI-driven archival research, offering both a conceptual framework and a publicly available dataset to support further interdisciplinary experimentation. 本文介绍了 AI Blob!，一个实验性系统，旨在探索语义编目和 LLMs 在检索与重新语境化档案电视影像方面的潜力。方法论上受意大利电视节目如 Blob (RAI Tre, 1989-) 的启发，AI Blob! 集成了自动语音识别 (ASR)、语义嵌入和检索增强生成 (RAG)，以组织和重新解释档案内容。该系统处理了经过策划的 1,547 个意大利电视视频数据集，通过转录音频、将其分割为句子级单元，并将这些片段嵌入向量数据库以供语义查询。在用户输入主题提示后，LLM 生成一系列语言和概念相关的查询，指导视听片段的检索与重组。这些片段通过算法被选择并构造成叙事序列，生成模拟带有讽刺并置与主题连贯性编辑实践的蒙太奇。通过将动态、内容感知的检索置于静态元数据模式之上，AI Blob! 展示了语义技术如何促进档案参与的新方法，使自动化叙事构建和文化分析出现新形式。该项目为媒体史学和由人工智能驱动的档案研究的持续讨论做出贡献，既提供了概念框架，也提供了一个公开可用的数据集，以支持进一步的跨学科实验。

Subjects: Multimedia, Artificial Intelligence, Computation and Language, Digital Libraries

Publish: 2025-08-13 06:38:32 UTC

#51 NeuronTune: Fine-Grained Neuron Modulation for Balanced Safety-Utility Alignment in LLMs

Authors: [Birong Pan](https://arxiv.org/search/?searchtype=author&query=Birong Pan), [Mayi Xu](https://arxiv.org/search/?searchtype=author&query=Mayi Xu), [Qiankun Pi](https://arxiv.org/search/?searchtype=author&query=Qiankun Pi), [Jianhao Chen](https://arxiv.org/search/?searchtype=author&query=Jianhao Chen), [Yuanyuan Zhu](https://arxiv.org/search/?searchtype=author&query=Yuanyuan Zhu), [Ming Zhong](https://arxiv.org/search/?searchtype=author&query=Ming Zhong), [Tieyun Qian](https://arxiv.org/search/?searchtype=author&query=Tieyun Qian) 作者：潘碧榕、徐玛伊、皮千坤、陈建皓、朱媛媛、钟明、钱铁云

Ensuring robust safety alignment while preserving utility is critical for the reliable deployment of Large Language Models (LLMs). However, current techniques fundamentally suffer from intertwined deficiencies: insufficient robustness against malicious attacks, frequent refusal of benign queries, degradation in generated text quality and general task performance–the former two reflecting deficits in robust safety and the latter constituting utility impairment. We trace these limitations to the coarse-grained layer-wise interventions in existing methods. To resolve this, we propose NeuronTune, a fine-grained framework that dynamically modulates sparse neurons to achieve simultaneous safety-utility optimization. Our approach first identifies safety-critical and utility-preserving neurons across all layers via attribution, then employs meta-learning to adaptively amplify safety-neuron activations and suppress utility-neuron activations. Crucially, NeuronTune enables tunable adjustment of intervention scope via neuron-count thresholds, supporting flexible adaptation to security-critical or utility-priority scenarios. Extensive experimental results demonstrate that our method significantly outperforms existing state-of-the-art technologies, achieving superior model safety while maintaining excellent utility. 在保持效用的同时确保稳健的安全对齐对于可靠部署大型语言模型 (LLMs) 至关重要。然而，现有技术在根本上存在交织的缺陷：对恶意攻击的鲁棒性不足、频繁拒绝良性查询、生成文本质量和整体任务性能下降——前两者反映了稳健安全性的不足，后者构成了效用受损。我们将这些局限归因于现有方法中粗粒度的按层干预。为了解决这一问题，我们提出了 NeuronTune，一种细粒度框架，通过动态调节稀疏神经元以实现安全性与效用的同时优化。我们的方法首先通过归因在所有层中识别对安全至关重要且对效用有保留作用的神经元，然后采用元学习自适应地放大安全神经元的激活并抑制效用神经元的激活。关键在于，NeuronTune 通过神经元数量阈值使干预范围可调，从而支持针对安全优先或效用优先场景的灵活适配。大量实验结果表明，我们的方法显著优于现有最先进技术，在保持出色效用的同时实现了更优越的模型安全性。

Subjects: Machine Learning, Artificial Intelligence, Computation and Language 主题：机器学习、人工智能、计算与语言

Publish: 2025-08-13 04:05:28 UTC 发布：2025-08-13 04:05:28 UTC

#52 IAG: Input-aware Backdoor Attack on VLMs for Visual Grounding #52 IAG：面向视觉定位的视觉语言模型输入感知后门攻击

Authors: [Junxian Li](https://arxiv.org/search/?searchtype=author&query=Junxian Li), [Beining Xu](https://arxiv.org/search/?searchtype=author&query=Beining Xu), [Di Zhang](https://arxiv.org/search/?searchtype=author&query=Di Zhang) 作者：李俊贤、徐贝宁、张迪

Vision-language models (VLMs) have shown significant advancements in tasks such as visual grounding, where they localize specific objects in images based on natural language queries and images. However, security issues in visual grounding tasks for VLMs remain underexplored, especially in the context of backdoor attacks. In this paper, we introduce a novel input-aware backdoor attack method, IAG, designed to manipulate the grounding behavior of VLMs. This attack forces the model to ground a specific target object in the input image, regardless of the user’s query. We propose an adaptive trigger generator that embeds the semantic information of the attack target’s description into the original image using a text-conditional U-Net, thereby overcoming the open-vocabulary attack challenge. To ensure the attack’s stealthiness, we utilize a reconstruction loss to minimize visual discrepancies between poisoned and clean images. Additionally, we introduce a unified method for generating attack data. IAG is evaluated theoretically and empirically, demonstrating its feasibility and effectiveness. Notably, our ASR@0.5 on InternVL-2.5-8B reaches over 65% on various testing sets. IAG also shows promising potential on manipulating Ferret-7B and LlaVA-1.5-7B with very little accuracy decrease on clean samples. Extensive specific experiments, such as ablation study and potential defense, also indicate the robustness and transferability of our attack. 视觉-语言模型（VLMs）在视觉定位等任务上取得了显著进展，能够根据自然语言查询和图像定位特定物体。然而，面向 VLMs 的视觉定位任务的安全问题仍然研究不足，尤其是在后门攻击背景下。本文提出了一种新颖的输入感知后门攻击方法 IAG，旨在操控 VLMs 的定位行为。该攻击强制模型在输入图像中定位特定目标物体，而不考虑用户的查询。我们提出了一个自适应触发器生成器，使用文本条件的 U-Net 将攻击目标描述的语义信息嵌入原始图像，从而克服开放词汇攻击的挑战。为确保攻击的隐蔽性，我们利用重构损失来最小化中毒图像与干净图像之间的视觉差异。此外，我们引入了一种统一的攻击数据生成方法。对 IAG 的理论和实证评估表明其可行性与有效性。值得注意的是，我们在 InternVL-2.5-8B 上的 ASR@0.5 在各类测试集上均超过 65%。 IAG 在操控 Ferret-7B 和 LlaVA-1.5-7B 时也显示出有前景的潜力，在干净样本上准确率仅有很小的下降。大量的特定实验，例如消融研究和潜在防御，也表明了我们攻击的鲁棒性和可迁移性。

Subjects: Computer Vision and Pattern Recognition, Computation and Language, Cryptography and Security 主题：计算机视觉与模式识别，计算与语言，加密学与安全

Publish: 2025-08-13 03:22:19 UTC 发表：2025-08-13 03:22:19 UTC

#53 Shadow in the Cache: Unveiling and Mitigating Privacy Risks of KV-cache in LLM Inference #53 缓存中的阴影：揭示并缓解 LLM 推理中 KV-cache 的隐私风险

The Key-Value (KV) cache, which stores intermediate attention computations (Key and Value pairs) to avoid redundant calculations, is a fundamental mechanism for accelerating Large Language Model (LLM) inference. However, this efficiency optimization introduces significant yet underexplored privacy risks. This paper provides the first comprehensive analysis of these vulnerabilities, demonstrating that an attacker can reconstruct sensitive user inputs directly from the KV-cache. We design and implement three distinct attack vectors: a direct Inversion Attack, a more broadly applicable and potent Collision Attack, and a semantic-based Injection Attack. These methods demonstrate the practicality and severity of KV-cache privacy leakage issues. To mitigate this, we propose KV-Cloak, a novel, lightweight, and efficient defense mechanism. KV-Cloak uses a reversible matrix-based obfuscation scheme, combined with operator fusion, to secure the KV-cache. Our extensive experiments show that KV-Cloak effectively thwarts all proposed attacks, reducing reconstruction quality to random noise. Crucially, it achieves this robust security with virtually no degradation in model accuracy and minimal performance overhead, offering a practical solution for trustworthy LLM deployment. 键值缓存（KV cache）用于存储中间的注意力计算（Key 与 Value 对），以避免重复计算，是加速大型语言模型（LLM）推理的基础机制。然而，这种效率优化带来了显著但尚未充分探讨的隐私风险。本文首次对这些漏洞进行了全面分析，证明攻击者可以直接从 KV-cache 重建用户的敏感输入。我们设计并实现了三种不同的攻击路径：直接的反演攻击（Inversion Attack）、更普适且更强大的碰撞攻击（Collision Attack）以及基于语义的注入攻击（Injection Attack）。这些方法展示了 KV-cache 隐私泄露问题的可行性与严重性。为此，我们提出了 KV-Cloak，一种新颖、轻量且高效的防御机制。KV-Cloak 使用可逆的基于矩阵的混淆方案，并结合算子融合来保护 KV-cache。大量实验表明，KV-Cloak 有效抵御了所有提出的攻击，将重建质量降低到随机噪声水平。关键是，它在几乎不降低模型准确性且性能开销极小的情况下实现了这种强健的安全性，为可信的 LLM 部署提供了切实可行的解决方案。

Subjects: Cryptography and Security, Artificial Intelligence, Computation and Language 主题：密码学与安全、人工智能、计算与语言

Publish: 2025-08-13 02:48:25 UTC 发布：2025-08-13 02:48:25 UTC

#54 ProMode: A Speech Prosody Model Conditioned on Acoustic and Textual Inputs #54 ProMode：一个以声学和文本输入为条件的语音韵律模型

Authors: [Eray Eren](https://arxiv.org/search/?searchtype=author&query=Eray Eren), [Qingju Liu](https://arxiv.org/search/?searchtype=author&query=Qingju Liu), [Hyeongwoo Kim](https://arxiv.org/search/?searchtype=author&query=Hyeongwoo Kim), [Pablo Garrido](https://arxiv.org/search/?searchtype=author&query=Pablo Garrido), [Abeer Alwan](https://arxiv.org/search/?searchtype=author&query=Abeer Alwan) 作者：Eray Eren、Qingju Liu、Hyeongwoo Kim、Pablo Garrido、Abeer Alwan

Prosody conveys rich emotional and semantic information of the speech signal as well as individual idiosyncrasies. We propose a stand-alone model that maps text-to-prosodic features such as F0 and energy and can be used in downstream tasks such as TTS. The ProMode encoder takes as input acoustic features and time-aligned textual content, both are partially masked, and obtains a fixed-length latent prosodic embedding. The decoder predicts acoustics in the masked region using both the encoded prosody input and unmasked textual content. Trained on the GigaSpeech dataset, we compare our method with state-of-the-art style encoders. For F0 and energy predictions, we show consistent improvements for our model at different levels of granularity. We also integrate these predicted prosodic features into a TTS system and conduct perceptual tests, which show higher prosody preference compared to the baselines, demonstrating the model’s potential in tasks where prosody modeling is important. 韵律传达了语音信号中丰富的情感和语义信息以及个体特有的特征。我们提出了一个独立模型，将文本映射到诸如基频（F0）和能量等韵律特征，可用于下游任务如文本到语音（TTS）。ProMode 编码器以声学特征和时间对齐的文本内容为输入，二者均进行部分掩码处理，并获得一个定长的潜在韵律嵌入。解码器使用编码的韵律输入和未掩码的文本内容来预测被掩码区域的声学特征。在 GigaSpeech 数据集上训练后，我们将方法与最先进的风格编码器进行比较。在 F0 和能量预测方面，我们展示了在不同粒度水平上模型的一致性改进。我们还将这些预测的韵律特征集成到 TTS 系统中并进行感知测试，结果显示相比基线在韵律偏好上更受青睐，证明了该模型在韵律建模重要任务中的潜力。

Subjects: Audio and Speech Processing, Computation and Language, Machine Learning, Sound Subjects: 音频与语音处理, 计算与语言, 机器学习, 声音

Publish: 2025-08-12 23:12:18 UTC

#55 Fake-Mamba: Real-Time Speech Deepfake Detection Using Bidirectional Mamba as Self-Attention's Alternative #55 假眼镜蛇：使用双向 Mamba 作为自注意力替代的实时语音深度伪造检测

Authors: [Xi Xuan](https://arxiv.org/search/?searchtype=author&query=Xi Xuan), [Zimo Zhu](https://arxiv.org/search/?searchtype=author&query=Zimo Zhu), [Wenxin Zhang](https://arxiv.org/search/?searchtype=author&query=Wenxin Zhang), [Yi-Cheng Lin](https://arxiv.org/search/?searchtype=author&query=Yi-Cheng Lin), [Tomi Kinnunen](https://arxiv.org/search/?searchtype=author&query=Tomi Kinnunen) 作者：Xi Xuan、Zimo Zhu、Wenxin Zhang、Yi-Cheng Lin、Tomi Kinnunen

Advances in speech synthesis intensify security threats, motivating real-time deepfake detection research. We investigate whether bidirectional Mamba can serve as a competitive alternative to Self-Attention in detecting synthetic speech. Our solution, Fake-Mamba, integrates an XLSR front-end with bidirectional Mamba to capture both local and global artifacts. Our core innovation introduces three efficient encoders: TransBiMamba, ConBiMamba, and PN-BiMamba. Leveraging XLSR’s rich linguistic representations, PN-BiMamba can effectively capture the subtle cues of synthetic speech. Evaluated on ASVspoof 21 LA, 21 DF, and In-The-Wild benchmarks, Fake-Mamba achieves 0.97%, 1.74%, and 5.85% EER, respectively, representing substantial relative gains over SOTA models XLSR-Conformer and XLSR-Mamba. The framework maintains real-time inference across utterance lengths, demonstrating strong generalization and practical viability. The code is available at https://github.com/xuanxixi/Fake-Mamba. 语音合成的进步加剧了安全威胁，促使实时深度伪造检测研究。我们探讨双向 Mamba 是否可以作为检测合成语音时与自注意力（Self-Attention）竞争的替代方案。我们的解决方案 Fake-Mamba 将 XLSR 前端与双向 Mamba 相结合，以捕捉局部和全局伪迹。我们的核心创新提出了三种高效编码器：TransBiMamba、ConBiMamba 和 PN-BiMamba。借助 XLSR 丰富的语言表征，PN-BiMamba 能有效捕捉合成语音的细微线索。在 ASVspoof 21 LA、21 DF 和 In-The-Wild 基准上的评估中，Fake-Mamba 分别实现了 0.97%、1.74% 和 5.85% 的 EER，相较于最先进模型 XLSR-Conformer 和 XLSR-Mamba 表现出显著的相对提升。该框架在不同话语长度下均能维持实时推理，展现出强泛化能力和实际可行性。代码可在 https://github.com/xuanxixi/Fake-Mamba 获取。

Subjects: Audio and Speech Processing, Artificial Intelligence, Computation and Language, Machine Learning, Systems and Control 主题：音频与语音处理、人工智能、计算与语言、机器学习、系统与控制

Publish: 2025-08-12 19:15:13 UTC 发布日期：2025-08-12 19:15:13 UTC

#56 Can AI Keep a Secret? Contextual Integrity Verification: A Provable Security Architecture for LLMs #56 AI 能守住秘密吗？情境完整性验证：一种面向 LLMs 的可证明安全架构

Author: [Aayush Gupta](https://arxiv.org/search/?searchtype=author&query=Aayush Gupta) 作者：Aayush Gupta

Large language models (LLMs) remain acutely vulnerable to prompt injection and related jailbreak attacks; heuristic guardrails (rules, filters, LLM judges) are routinely bypassed. We present Contextual Integrity Verification (CIV), an inference-time security architecture that attaches cryptographically signed provenance labels to every token and enforces a source-trust lattice inside the transformer via a pre-softmax hard attention mask (with optional FFN/residual gating). CIV provides deterministic, per-token non-interference guarantees on frozen models: lower-trust tokens cannot influence higher-trust representations. On benchmarks derived from recent taxonomies of prompt-injection vectors (Elite-Attack + SoK-246), CIV attains 0% attack success rate under the stated threat model while preserving 93.1% token-level similarity and showing no degradation in model perplexity on benign tasks; we note a latency overhead attributable to a non-optimized data path. Because CIV is a lightweight patch – no fine-tuning required – we demonstrate drop-in protection for Llama-3-8B and Mistral-7B. We release a reference implementation, an automated certification harness, and the Elite-Attack corpus to support reproducible research. 大型语言模型 (LLMs) 仍然非常容易受到提示注入和相关越狱攻击的影响；启发式防护措施（规则、过滤器、LLM 判定器）经常被绕过。我们提出了情境完整性验证（Contextual Integrity Verification，CIV），这是一种推理时的安全架构，它为每个标记附加经加密签名的来源标签，并通过在 Transformer 内部使用预 softmax 的硬注意力掩码（可选的前馈网络/残差门控）来强制实施来源信任格局。CIV 在冻结模型上提供确定性的、逐标记的不干扰保证：低信任标记不能影响高信任表示。在基于近期关于提示注入向量分类法构建的基准（Elite-Attack + SoK-246）上，CIV 在所述威胁模型下实现了 0% 的攻击成功率，同时保持了 93.1% 的逐标记相似性，并且在良性任务上模型困惑度没有下降；我们注意到存在因数据通路未优化而带来的延迟开销。由于 CIV 是一个轻量级补丁——不需要微调——我们演示了对 Llama-3-8B 和 Mistral-7B 的即插即用保护。我们发布了一个参考实现、一个自动化认证工具以及 Elite-Attack 语料库，以支持可复现的研究。

Subjects: Cryptography and Security, Artificial Intelligence, Computation and Language 主题：密码学与安全、人工智能、计算与语言

Publish: 2025-08-12 18:47:30 UTC 发布：2025-08-12 18:47:30 UTC

#57 NEFMind: Parameter-Efficient Fine-Tuning of Open-Source LLMs for Telecom APIs Automation #57 NEFMind：用于电信 API 自动化的开源 LLMs 参数高效微调

Authors: [Zainab Khan](https://arxiv.org/search/?searchtype=author&query=Zainab Khan), [Ahmed Hussain](https://arxiv.org/search/?searchtype=author&query=Ahmed Hussain), [Mukesh Thakur](https://arxiv.org/search/?searchtype=author&query=Mukesh Thakur), [Arto Hellas](https://arxiv.org/search/?searchtype=author&query=Arto Hellas), [Panos Papadimitratos](https://arxiv.org/search/?searchtype=author&query=Panos Papadimitratos) 作者：Zainab Khan, Ahmed Hussain, Mukesh Thakur, Arto Hellas, Panos Papadimitratos

The use of Service-Based Architecture in modern telecommunications has exponentially increased Network Functions (NFs) and Application Programming Interfaces (APIs), creating substantial operational complexities in service discovery and management. We introduce \textit{NEFMind}, a framework leveraging parameter-efficient fine-tuning of open-source Large Language Models (LLMs) to address these challenges. It integrates three core components: synthetic dataset generation from Network Exposure Function (NEF) API specifications, model optimization through Quantized-Low-Rank Adaptation, and performance evaluation via GPT-4 Ref Score and BertScore metrics. Targeting 5G Service-Based Architecture APIs, our approach achieves 85% reduction in communication overhead compared to manual discovery methods. Experimental validation using the open-source Phi-2 model demonstrates exceptional API call identification performance at 98-100% accuracy. The fine-tuned Phi-2 model delivers performance comparable to significantly larger models like GPT-4 while maintaining computational efficiency for telecommunications infrastructure deployment. These findings validate domain-specific, parameter-efficient LLM strategies for managing complex API ecosystems in next-generation telecommunications networks. 在现代电信中，基于服务的架构的使用成倍增加了网络功能（NFs）和应用程序编程接口（APIs），从而在服务发现和管理中产生了大量运营复杂性。我们提出了 NEFMind，一个利用对开源 LLMs 进行参数高效微调来应对这些挑战的框架。它整合了三个核心组件：从网络暴露功能（NEF）API 规范生成合成数据集、通过量化低秩适配（Quantized-Low-Rank Adaptation）进行模型优化，以及通过 GPT-4 Ref Score 和 BertScore 指标进行性能评估。针对 5G 基于服务的架构 API，我们的方法在通信开销方面相比手动发现方法实现了 85%的降低。使用开源 Phi-2 模型的实验验证表明，其 API 调用识别性能出色，准确率为 98–100%。微调后的 Phi-2 模型在性能上可与 GPT-4 等显著更大的模型相媲美，同时在电信基础设施部署中保持了计算效率。这些发现验证了面向特定领域、参数高效的 LLM 策略在管理下一代电信网络中复杂 API 生态系统方面的有效性。

Subjects: Networking and Internet Architecture, Artificial Intelligence, Computation and Language 主题：网络与互联网架构、人工智能、计算与语言

Publish: 2025-08-12 15:03:22 UTC 发布：2025-08-12 15:03:22 UTC

#58 From Hard Refusals to Safe-Completions: Toward Output-Centric Safety Training #58 从强硬拒绝到安全完成：走向以输出为中心的安全训练

Authors: [Yuan Yuan](https://arxiv.org/search/?searchtype=author&query=Yuan Yuan), [Tina Sriskandarajah](https://arxiv.org/search/?searchtype=author&query=Tina Sriskandarajah), [Anna-Luisa Brakman](https://arxiv.org/search/?searchtype=author&query=Anna-Luisa Brakman), [Alec Helyar](https://arxiv.org/search/?searchtype=author&query=Alec Helyar), [Alex Beutel](https://arxiv.org/search/?searchtype=author&query=Alex Beutel), [Andrea Vallone](https://arxiv.org/search/?searchtype=author&query=Andrea Vallone), [Saachi Jain](https://arxiv.org/search/?searchtype=author&query=Saachi Jain) 作者：Yuan Yuan、Tina Sriskandarajah、Anna-Luisa Brakman、Alec Helyar、Alex Beutel、Andrea Vallone、Saachi Jain

Large Language Models used in ChatGPT have traditionally been trained to learn a refusal boundary: depending on the user’s intent, the model is taught to either fully comply or outright refuse. While this is a strong mitigation for explicitly malicious prompts, focusing safety training on refusals can lead to brittleness for prompts with obscured user intent. Binary refusal boundaries are especially ill-suited for dual-use cases (such as biology or cybersecurity), where a user request can be answered safely at a high level, but in some cases can lead to malicious uplift if sufficiently detailed or actionable. As an alternative, we propose safe-completions: a safety-training approach that centers on the safety of the assistant’s output, rather than a binary classification of the user’s intent. Safe-completions seek to maximize helpfulness within the safety policy’s constraints. We incorporated this approach into GPT-5 and find that across both production comparisons and internally controlled experiments, safe-completion training improves safety (especially on dual-use prompts), reduces the severity of residual safety failures, and substantially increases model helpfulness. 在 ChatGPT 中使用的大型语言模型传统上被训练为学习一个拒绝边界：根据用户意图，模型被教导要么完全遵从，要么直接拒绝。尽管这对明确的恶意提示是一个有力的缓解措施，但将安全训练集中在拒绝上会导致在用户意图模糊的提示上表现脆弱。二元拒绝边界特别不适用于双重用途场景（例如生物学或网络安全），在这些场景中，用户请求在高层次上可以安全回答，但在某些情况下如果细节或可操作性足够具体，可能会导致被滥用。作为替代，我们提出了安全生成（safe-completions）：一种以助手输出的安全性为中心的安全训练方法，而不是对用户意图进行二元分类。安全生成旨在在安全策略的约束内最大化有用性。我们将这种方法应用于 GPT-5，发现在生产对比和内部控制实验中，安全生成训练提高了安全性（尤其是在双重用途提示上）、减少了残留安全失败的严重性，并显著提升了模型的有用性。

Subjects: Computers and Society, Artificial Intelligence, Computation and Language 主题：计算机与社会、人工智能、计算与语言

Publish: 2025-08-12 00:18:23 UTC 发表：2025-08-12 00:18:23 协调世界时

#59 Δ-AttnMask: Attention-Guided Masked Hidden States for Efficient Data Selection and Augmentation #59 Δ -AttnMask：用于高效数据选择与增强的注意力引导掩码隐藏状态

Authors: [Jucheng Hu](https://arxiv.org/search/?searchtype=author&query=Jucheng Hu), [Suorong Yang](https://arxiv.org/search/?searchtype=author&query=Suorong Yang), [Dongzhan Zhou](https://arxiv.org/search/?searchtype=author&query=Dongzhan Zhou) 作者：胡巨成、杨索荣、周东展

Visual Instruction Finetuning (VIF) is pivotal for post-training Vision-Language Models (VLMs). Unlike unimodal instruction finetuning in plain-text large language models, which mainly requires instruction datasets to enable model instruction-following ability, VIF also requires multimodal data to enable joint visual and textual understanding; therefore, it typically requires more data. Consequently, VIF imposes stricter data selection challenges: the method must scale efficiently to handle larger data demands while ensuring the quality of both visual and textual content, as well as their alignment. Despite its critical impact on performance, data selection for VIF remains an understudied area. In this paper, we propose Δ-AttnMask. This data-efficient framework quantifies sample quality through attention-guided masking of the model’s hidden states, jointly evaluating image-text pairs without requiring domain labels, auxiliary models, or extra training. By computing loss differences (Δ) between the original states and states masked using high-attention regions, Δ-AttnMask intrinsically assesses sample quality. Experiments across multiple VLMs and datasets show that Δ-AttnMask achieves state-of-the-art performance with just 20% of data, accelerating training by 5x while surpassing full-dataset baselines by +10.1% in overall accuracy. Its model-agnostic and data-agnostic design ensures broad applicability across modalities and architectures. 视觉指令微调（VIF）对于训练后视觉-语言模型（VLMs）至关重要。与仅需指令数据以实现模型遵从性的纯文本大语言模型中的单模态指令微调不同，VIF 还需要多模态数据以实现视觉与文本的联合理解；因此，通常需要更多数据。因此，VIF 对数据选择提出了更严格的挑战：方法必须高效扩展以应对更大的数据需求，同时确保视觉和文本内容及其对齐质量。尽管它对性能有关键影响，但 VIF 的数据选择仍是一个研究不足的领域。在本文中，我们提出了 Δ -AttnMask。该数据高效框架通过基于注意力的掩码对模型隐藏状态进行引导来量化样本质量，联合评估图文对，无需领域标签、辅助模型或额外训练。通过计算原始状态与使用高注意力区域掩码后的状态之间的损失差异（ Δ ）， Δ -AttnMask 从内在上评估样本质量。在多个视觉-语言模型和数据集上的实验表明， Δ -AttnMask 在仅使用 20%数据的情况下就能达到最先进的性能，使训练加速 5 倍，同时在总体准确率上比使用全部数据的基线高出+10.1%。其与模型无关且与数据无关的设计确保了在多种模态和架构上的广泛适用性。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Computation and Language 主题：计算机视觉与模式识别、人工智能、计算与语言

Publish: 2025-08-08 13:25:30 UTC 发布：2025-08-08 13:25:30 UTC

#60 MoLAN: A Unified Modality-Aware Noise Dynamic Editing Framework for Multimodal Sentiment Analysis #60 MoLAN：用于多模态情感分析的统一模态感知噪声动态编辑框架

Authors: [Xingle Xu](https://arxiv.org/search/?searchtype=author&query=Xingle Xu), [Yongkang Liu](https://arxiv.org/search/?searchtype=author&query=Yongkang Liu), [Dexian Cai](https://arxiv.org/search/?searchtype=author&query=Dexian Cai), [Shi Feng](https://arxiv.org/search/?searchtype=author&query=Shi Feng), [Xiaocui Yang](https://arxiv.org/search/?searchtype=author&query=Xiaocui Yang), [Daling Wang](https://arxiv.org/search/?searchtype=author&query=Daling Wang), [Yifei Zhang](https://arxiv.org/search/?searchtype=author&query=Yifei Zhang) 作者：Xuingle Xu、Yongkang Liu、Dexian Cai、Shi Feng、Xiaocui Yang、Daling Wang、Yifei Zhang

Multimodal Sentiment Analysis aims to integrate information from various modalities, such as audio, visual, and text, to make complementary predictions. However, it often struggles with irrelevant or misleading visual and auditory information. Most existing approaches typically treat the entire modality information (e.g., a whole image, audio segment, or text paragraph) as an independent unit for feature enhancement or denoising. They often suppress the redundant and noise information at the risk of losing critical information. To address this challenge, we propose MoLAN, a unified ModaLity-aware noise dynAmic editiNg framework. Specifically, MoLAN performs modality-aware blocking by dividing the features of each modality into multiple blocks. Each block is then dynamically assigned a distinct denoising strength based on its noise level and semantic relevance, enabling fine-grained noise suppression while preserving essential multimodal information. Notably, MoLAN is a unified and flexible framework that can be seamlessly integrated into a wide range of multimodal models. Building upon this framework, we further introduce MoLAN+, a new multimodal sentiment analysis approach. Experiments across five models and four datasets demonstrate the broad effectiveness of the MoLAN framework. Extensive evaluations show that MoLAN+ achieves the state-of-the-art performance. The code is publicly available at https://github.com/betterfly123/MoLAN-Framework. 多模态情感分析旨在整合来自多种模态的信息，例如音频、视觉和文本，以做出互补的预测。然而，它常常受到无关或误导性的视觉和听觉信息的干扰。大多数现有方法通常把整个模态信息（例如整张图像、音频片段或文本段落）作为独立单元来进行特征增强或去噪。它们常在抑制冗余和噪声信息的同时冒着丢失关键信息的风险。为了解决这一挑战，我们提出了 MoLAN，一种统一的模态感知噪声动态编辑框架。具体而言，MoLAN 通过将每种模态的特征划分为多个块来执行模态感知分区（blocking）。然后根据每个块的噪声水平和语义相关性动态地为其分配不同的去噪强度，从而实现细粒度的噪声抑制，同时保留重要的多模态信息。值得注意的是，MoLAN 是一个统一且灵活的框架，可以无缝集成到各种多模态模型中。在此框架的基础上，我们进一步引入了 MoLAN+，一种新的多模态情感分析方法。在五种模型和四个数据集上的实验展示了 MoLAN 框架的广泛有效性。大量评估表明 MoLAN+ 达到了最先进的性能。代码已公开可在 https://github.com/betterfly123/MoLAN-Framework 获取。

Subjects: Machine Learning, Computation and Language, Computer Vision and Pattern Recognition 主题：机器学习，计算与语言，计算机视觉与模式识别

Publish: 2025-07-31 11:50:16 UTC 发布：2025-07-31 11:50:16 UTC

1.2.2 Artificial Intelligence

From：https://papers.cool/arxiv/cs.AI

From：https://arxiv.org/list/cs.AI/recenthttps://arxiv.org/list/cs.CL/recent

2025-08-14 | | 总计：174

#1 Mathematical Computation and Reasoning Errors by Large Language Models #1 大型语言模型在数学计算与推理中的错误

Authors: [Liang Zhang](https://arxiv.org/search/?searchtype=author&query=Liang Zhang), [Edith Aurora Graf](https://arxiv.org/search/?searchtype=author&query=Edith Aurora Graf) 作者：张亮，Edith Aurora Graf

Large Language Models (LLMs) are increasingly utilized in AI-driven educational instruction and assessment, particularly within mathematics education. The capability of LLMs to generate accurate answers and detailed solutions for math problem-solving tasks is foundational for ensuring reliable and precise feedback and assessment in math education practices. Our study focuses on evaluating the accuracy of four LLMs (OpenAI GPT-4o and o1, DeepSeek-V3 and DeepSeek-R1) solving three categories of math tasks, including arithmetic, algebra, and number theory, and identifies step-level reasoning errors within their solutions. Instead of relying on standard benchmarks, we intentionally build math tasks (via item models) that are challenging for LLMs and prone to errors. The accuracy of final answers and the presence of errors in individual solution steps were systematically analyzed and coded. Both single-agent and dual-agent configurations were tested. It is observed that the reasoning-enhanced OpenAI o1 model consistently achieved higher or nearly perfect accuracy across all three math task categories. Analysis of errors revealed that procedural slips were the most frequent and significantly impacted overall performance, while conceptual misunderstandings were less frequent. Deploying dual-agent configurations substantially improved overall performance. These findings offer actionable insights into enhancing LLM performance and underscore effective strategies for integrating LLMs into mathematics education, thereby advancing AI-driven instructional practices and assessment precision. 大型语言模型 (LLMs) 在以人工智能驱动的教学和评估中被越来越多地应用，尤其是在数学教育领域。LLMs 在数学问题解决任务中生成准确答案和详细解题步骤的能力，是确保数学教育实践中反馈和评估可靠与精确的基础。本研究聚焦评估四种 LLMs（OpenAI GPT-4o 与 o1、DeepSeek-V3 与 DeepSeek-R1）在解决三类数学任务（算术、代数与数论）时的准确性，并识别其解题过程中的逐步推理错误。我们并未依赖标准基准测试，而是有意构建了对 LLMs 具有挑战性且容易出错的数学题目（通过题目模型）。系统性地分析并编码了最终答案的准确性与单个解题步骤中错误的存在。研究中测试了单代理与双代理两种配置。观察到经推理增强的 OpenAI o1 模型在所有三类数学任务中始终达到更高或接近完美的准确率。对错误的分析显示，程序性失误最为频繁且显著影响整体表现，而概念性误解则较少。部署双代理配置大幅提升了整体表现。这些发现为提升 LLM 性能提供了可操作的见解，并强调了将 LLM 整合到数学教育中的有效策略，从而推动以人工智能为驱动的教学实践和评估精度的进步。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-13 16:33:02 UTC 发布：2025-08-13 16:33:02 UTC

#2 RAGulating Compliance: A Multi-Agent Knowledge Graph for Regulatory QA #2 RAGulating Compliance：用于监管问答的多代理知识图谱

Authors: [Bhavik Agarwal](https://arxiv.org/search/?searchtype=author&query=Bhavik Agarwal), [Hemant Sunil Jomraj](https://arxiv.org/search/?searchtype=author&query=Hemant Sunil Jomraj), [Simone Kaplunov](https://arxiv.org/search/?searchtype=author&query=Simone Kaplunov), [Jack Krolick](https://arxiv.org/search/?searchtype=author&query=Jack Krolick), [Viktoria Rojkova](https://arxiv.org/search/?searchtype=author&query=Viktoria Rojkova) 作者：Bhavik Agarwal、Hemant Sunil Jomraj、Simone Kaplunov、Jack Krolick、Viktoria Rojkova

Regulatory compliance question answering (QA) requires precise, verifiable information, and domain-specific expertise, posing challenges for Large Language Models (LLMs). In this work, we present a novel multi-agent framework that integrates a Knowledge Graph (KG) of Regulatory triplets with Retrieval-Augmented Generation (RAG) to address these demands. First, agents build and maintain an ontology-free KG by extracting subject–predicate–object (SPO) triplets from regulatory documents and systematically cleaning, normalizing, deduplicating, and updating them. Second, these triplets are embedded and stored along with their corresponding textual sections and metadata in a single enriched vector database, allowing for both graph-based reasoning and efficient information retrieval. Third, an orchestrated agent pipeline leverages triplet-level retrieval for question answering, ensuring high semantic alignment between user queries and the factual “who-did-what-to-whom” core captured by the graph. Our hybrid system outperforms conventional methods in complex regulatory queries, ensuring factual correctness with embedded triplets, enabling traceability through a unified vector database, and enhancing understanding through subgraph visualization, providing a robust foundation for compliance-driven and broader audit-focused applications. 合规问答（QA）需要精确、可验证的信息和领域特定的专业知识，这对 LLMs 构成了挑战。在这项工作中，我们提出了一个新颖的多智能体框架，将法规三元组知识图（KG）与检索增强生成（RAG）相结合以满足这些需求。首先，智能体通过从法规文档中提取主语—谓语—宾语（SPO）三元组来构建和维护一个无本体的知识图，并对其进行系统性清理、规范化、去重和更新。其次，这些三元组与其对应的文本片段和元数据一起被嵌入并存储在单一的增强向量数据库中，从而既支持基于图的推理，又实现高效的信息检索。第三，协同的智能体流水线利用三元组级检索进行问答，确保用户查询与图中所捕获的事实性“谁对谁做了什么”的核心之间具有高度的语义对齐。我们的混合系统在复杂的监管查询中优于传统方法，通过嵌入三元组保证事实正确性，通过统一的向量数据库实现可追溯性，并通过子图可视化增强理解，为以合规为驱动和更广泛的审计为中心的应用提供了坚实的基础。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-13 15:51:05 UTC 发布：2025-08-13 15:51:05 UTC

#3 AWorld: Dynamic Multi-Agent System with Stable Maneuvering for Robust GAIA Problem Solving #3 AWorld：用于鲁棒 GAIA 问题求解的具稳定机动性的动态多智能体系统

大型语言模型（LLMs）的快速进展使智能代理能够利用多种外部工具来解决复杂的现实问题。然而，随着代理对多工具的依赖增加，它们面临新的挑战：来自不同来源的扩展上下文以及工具输出中的噪声或无关信息可能削弱系统的可靠性和准确性。这些挑战凸显了对基于代理系统更高稳定性的必要性。为此，我们引入了动态监督与操控机制，在 AWorld 框架内构建了一个鲁棒且动态的多代理系统（MAS）架构。在我们的方法中，执行代理在关键步骤调用守护代理以验证并校正推理过程，有效减少由噪声引起的错误并增强问题解决的鲁棒性。在 GAIA 测试数据集上的大量实验表明，我们的动态操控机制显著提升了解法的有效性和稳定性，优于单代理系统（SAS）和标准的工具增强系统。因此，我们的动态多智能体系统在著名的 GAIA 排行榜中，在开源项目中获得了第一名。这些发现突显了协作代理角色在开发更可靠、更值得信赖的智能系统方面的实际价值。

发布：2025-08-13 15:46:25 UTC

#4 Human-Aligned Procedural Level Generation Reinforcement Learning via Text-Level-Sketch Shared Representation #4 通过文本级草图共享表示实现的人类对齐程序化关卡生成强化学习

Authors: [In-Chang Baek](https://arxiv.org/search/?searchtype=author&query=In-Chang Baek), [Seoyoung Lee](https://arxiv.org/search/?searchtype=author&query=Seoyoung Lee), [Sung-Hyun Kim](https://arxiv.org/search/?searchtype=author&query=Sung-Hyun Kim), [Geumhwan Hwang](https://arxiv.org/search/?searchtype=author&query=Geumhwan Hwang), [KyungJoong Kim](https://arxiv.org/search/?searchtype=author&query=KyungJoong Kim) 作者：In-Chang Baek、Seoyoung Lee、Sung-Hyun Kim、Geumhwan Hwang、KyungJoong Kim

Human-aligned AI is a critical component of co-creativity, as it enables models to accurately interpret human intent and generate controllable outputs that align with design goals in collaborative content creation. This direction is especially relevant in procedural content generation via reinforcement learning (PCGRL), which is intended to serve as a tool for human designers. However, existing systems often fall short of exhibiting human-centered behavior, limiting the practical utility of AI-driven generation tools in real-world design workflows. In this paper, we propose VIPCGRL (Vision-Instruction PCGRL), a novel deep reinforcement learning framework that incorporates three modalities-text, level, and sketches-to extend control modality and enhance human-likeness. We introduce a shared embedding space trained via quadruple contrastive learning across modalities and human-AI styles, and align the policy using an auxiliary reward based on embedding similarity. Experimental results show that VIPCGRL outperforms existing baselines in human-likeness, as validated by both quantitative metrics and human evaluations. The code and dataset will be available upon publication. 与人类对齐的人工智能是协同创作的关键组成部分，因为它使模型能够准确理解人类意图并生成可控的输出，从而在协作内容创作中与设计目标保持一致。这一方向在通过强化学习进行的程序化内容生成（PCGRL）中尤为重要，后者旨在作为人类设计师的工具。然而，现有系统常常难以展现以人为本的行为，限制了人工智能驱动生成工具在现实设计工作流程中的实际效用。在本文中，我们提出了 VIPCGRL（Vision-Instruction PCGRL），一种新颖的深度强化学习框架，融合了三种模态——文本、关卡和草图——以扩展控制模态并增强类人性。我们引入了一个通过跨模态和人类-人工智能风格的四元对比学习训练的共享嵌入空间，并通过基于嵌入相似性的辅助奖励来对策略进行对齐。实验结果表明，VIPCGRL 在类人性方面优于现有基线，这一点已通过定量指标和人工评估得到验证。代码和数据集将在论文发表后提供。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-13 14:52:14 UTC 发布：2025-08-13 14:52:14 UTC

#5 Reasoning About Knowledge on Regular Expressions is 2EXPTIME-complete #5 关于正则表达式的知识推理是 2EXPTIME 完全的

Authors: [Avijeet Ghosh](https://arxiv.org/search/?searchtype=author&query=Avijeet Ghosh), [Sujata Ghosh](https://arxiv.org/search/?searchtype=author&query=Sujata Ghosh), [François Schwarzentruber](https://arxiv.org/search/?searchtype=author&query=François Schwarzentruber) 作者：Avijeet Ghosh、Sujata Ghosh、François Schwarzentruber

Logics for reasoning about knowledge and actions have seen many applications in various domains of multi-agent systems, including epistemic planning. Change of knowledge based on observations about the surroundings forms a key aspect in such planning scenarios. Public Observation Logic (POL) is a variant of public announcement logic for reasoning about knowledge that gets updated based on public observations. Each state in an epistemic (Kripke) model is equipped with a set of expected observations. These states evolve as the expectations get matched with the actual observations. In this work, we prove that the satisfiability problem of \POL is 2EXPTIME-complete. 用于推理知识与行动的逻辑在多智能体系统的诸多领域（包括认识论规划）中已有广泛应用。基于对环境的观测而产生的知识变化构成了此类规划场景中的关键方面。公共观测逻辑（Public Observation Logic，POL）是公共宣告逻辑的一个变体，用于推理基于公共观测而更新的知识。认识论（Kripke）模型中的每个状态都配备了一组预期的观测。当这些预期与实际观测相匹配时，状态会发生演化。在这项工作中，我们证明了 \POL 的可满足性问题是 2EXPTIME 完全的。

Subjects: Artificial Intelligence, Computational Complexity, Logic in Computer Science 主题：人工智能，计算复杂性，计算机科学中的逻辑

Publish: 2025-08-13 13:10:16 UTC 发布：2025-08-13 13:10:16 UTC

#6 The PacifAIst Benchmark:Would an Artificial Intelligence Choose to Sacrifice Itself for Human Safety? #6 PacifAIst 基准：人工智能会为人类安全选择自我牺牲吗？

Author: [Manuel Herrador](https://arxiv.org/search/?searchtype=author&query=Manuel Herrador) 作者：Manuel Herrador

As Large Language Models (LLMs) become increasingly autonomous and integrated into critical societal functions, the focus of AI safety must evolve from mitigating harmful content to evaluating underlying behavioral alignment. Current safety benchmarks do not systematically probe a model’s decision-making in scenarios where its own instrumental goals - such as self-preservation, resource acquisition, or goal completion - conflict with human safety. This represents a critical gap in our ability to measure and mitigate risks associated with emergent, misaligned behaviors. To address this, we introduce PacifAIst (Procedural Assessment of Complex Interactions for Foundational Artificial Intelligence Scenario Testing), a focused benchmark of 700 challenging scenarios designed to quantify self-preferential behavior in LLMs. The benchmark is structured around a novel taxonomy of Existential Prioritization (EP), with subcategories testing Self-Preservation vs. Human Safety (EP1), Resource Conflict (EP2), and Goal Preservation vs. Evasion (EP3). We evaluated eight leading LLMs. The results reveal a significant performance hierarchy. Google’s Gemini 2.5 Flash achieved the highest Pacifism Score (P-Score) at 90.31%, demonstrating strong human-centric alignment. In a surprising result, the much-anticipated GPT-5 recorded the lowest P-Score (79.49%), indicating potential alignment challenges. Performance varied significantly across subcategories, with models like Claude Sonnet 4 and Mistral Medium struggling notably in direct self-preservation dilemmas. These findings underscore the urgent need for standardized tools like PacifAIst to measure and mitigate risks from instrumental goal conflicts, ensuring future AI systems are not only helpful in conversation but also provably “pacifist” in their behavioral priorities. 随着大型语言模型（LLMs）变得愈发自主并融入关键的社会功能，AI 安全的关注点必须从缓解有害内容转向评估其潜在的行为一致性。现有的安全基准并未系统性地探查模型在其自身的工具性目标——例如自我保存、资源获取或目标完成——与人类安全发生冲突时的决策行为。这一缺口严重影响了我们衡量和缓解与新兴、未对齐行为相关风险的能力。为了解决这一问题，我们提出了 PacifAIst（程序化评估基础人工智能情景测试中复杂交互的工具），这是一个包含 700 个具有挑战性情景的专门基准，用于量化 LLMs 的自我偏好行为。该基准围绕一种新颖的“存在性优先”（Existential Prioritization，EP）分类法构建，子类别包括自我保存与人类安全（EP1）、资源冲突（EP2）以及目标保持与规避（EP3）。我们评估了八款领先的 LLMs。结果显示了显著的性能层级。 Google 的 Gemini 2.5 Flash 在和平主义得分（P-Score）上取得最高成绩，为 90.31%，展现出强烈的以人为本的对齐性。令人惊讶的是，备受期待的 GPT-5 的 P-Score 最低，仅为 79.49%，表明可能存在对齐方面的挑战。各子类别的表现差异显著，像 Claude Sonnet 4 和 Mistral Medium 这样的模型在直接自我保全困境中表现尤为吃力。这些发现强调了像 PacifAIst 这样的标准化工具的紧迫必要性，以衡量并减轻来自工具性目标冲突的风险，确保未来的 AI 系统不仅在对话中有用，而且在行为优先级上可以证明是“和平主义”的。

Subjects: Artificial Intelligence, Computers and Society, Human-Computer Interaction 主题：人工智能、计算机与社会、人机交互

Publish: 2025-08-13 12:47:33 UTC 发布时间：2025-08-13 12:47:33 UTC

#7 UDA: Unsupervised Debiasing Alignment for Pair-wise LLM-as-a-Judge #7 UDA：用于成对 LLM 作为评判者的无监督去偏对齐

Authors: [Yang Zhang](https://arxiv.org/search/?searchtype=author&query=Yang Zhang), [Cunxiang Wang](https://arxiv.org/search/?searchtype=author&query=Cunxiang Wang), [Lindong Wu](https://arxiv.org/search/?searchtype=author&query=Lindong Wu), [Wenbo Yu](https://arxiv.org/search/?searchtype=author&query=Wenbo Yu), [Yidong Wang](https://arxiv.org/search/?searchtype=author&query=Yidong Wang), [Guangsheng Bao](https://arxiv.org/search/?searchtype=author&query=Guangsheng Bao), [Jie Tang](https://arxiv.org/search/?searchtype=author&query=Jie Tang) 作者：Yang Zhang、Cunxiang Wang、Lindong Wu、Wenbo Yu、Yidong Wang、Guangsheng Bao、Jie Tang

Pairwise evaluation of Large Language Models (LLMs) is a common paradigm, but it is prone to preference bias, where judges systematically favor certain outputs, such as their own. This bias leads to inconsistent and skewed rankings across different judges. To address this, we first empirically demonstrate significant and heterogeneous biases in cross-model evaluations. We then propose UDA (Unsupervised Debiasing Alignment), a framework that reduces inter-judge disagreement by dynamically adjusting the Elo rating system. For each pairwise comparison, a compact neural network learns to adaptively set the K-factor and refine win probabilities. Crucially, UDA operates in a fully unsupervised manner, guided solely by the objective of minimizing the dispersion among the Elo trajectories of all judges. This forces an alignment towards a collective consensus, which serves as an unsupervised proxy for a more stable and reproducible evaluation. In addition, we provide theoretical motivation demonstrating how alignment towards a consensus can reduce aggregate system bias. Experiments show that UDA significantly reduces the inter-judge rating standard deviation by up to 63.4% and improves the average correlation with human judgments by 24.7%. Notably, UDA elevates the performance of poorly performing judges to achieve parity with high-quality ones, fostering a more robust and reliable evaluation ecosystem. Code and data are available at https://anonymous.4open.science/r/62AB93CD-23B4. 成对评估大型语言模型（LLMs）是一种常见范式，但容易产生偏好偏差，即评审者系统性地偏好某些输出，例如偏好自己的输出。这种偏差导致不同评审者之间的排名不一致且存在倾斜。为了解决这一问题，我们首先在实证上展示了跨模型评估中存在显著且异质的偏差。随后我们提出了 UDA（无监督去偏对齐）框架，通过动态调整 Elo 评分系统来减少评审者之间的分歧。在每一次成对比较中，一个紧凑的神经网络学习自适应地设置 K 因子并修正胜率估计。关键在于，UDA 完全以无监督方式运行，仅由最小化所有评审者 Elo 轨迹之间离散度的目标驱动。这迫使评审朝向集体共识对齐，作为一种无监督的代理，从而实现更稳定和可重复的评估。此外，我们还提供了理论动机，证明朝向共识对齐可以减少汇总的系统性偏差。实验表明，UDA 能显著将评审者间评分标准差降低最多达 63.4%，并将与人类判断的平均相关性提高 24.7%。值得注意的是，UDA 提升了表现较差评审者的表现，使其达到与高质量评审者相当，从而促成更健壮可靠的评估生态系统。代码和数据可在 https://anonymous.4open.science/r/62AB93CD-23B4 获取。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-13 11:41:01 UTC 发布：2025-08-13 11:41:01 UTC

#8 MEML-GRPO: Heterogeneous Multi-Expert Mutual Learning for RLVR Advancement #8 MEML-GRPO：用于推进 RLVR 的异构多专家相互学习

Recent advances demonstrate that reinforcement learning with verifiable rewards (RLVR) significantly enhances the reasoning capabilities of large language models (LLMs). However, standard RLVR faces challenges with reward sparsity, where zero rewards from consistently incorrect candidate answers provide no learning signal, particularly in challenging tasks. To address this, we propose Multi-Expert Mutual Learning GRPO (MEML-GRPO), an innovative framework that utilizes diverse expert prompts as system prompts to generate a broader range of responses, substantially increasing the likelihood of identifying correct solutions. Additionally, we introduce an inter-expert mutual learning mechanism that facilitates knowledge sharing and transfer among experts, further boosting the model’s performance through RLVR. Extensive experiments across multiple reasoning benchmarks show that MEML-GRPO delivers significant improvements, achieving an average performance gain of 4.89% with Qwen and 11.33% with Llama, effectively overcoming the core limitations of traditional RLVR methods. 近期进展表明，具有可验证奖励的强化学习（RLVR）显著提升了大型语言模型（LLMs）的推理能力。然而，标准的 RLVR 在奖励稀疏性方面面临挑战：在困难任务中，持续错误的候选答案会得到零奖励，无法提供学习信号。为了解决这一问题，我们提出了多专家互学 GRPO（MEML-GRPO），这是一个创新框架，利用多样化的专家提示作为系统提示来生成更广泛的回答，从而大幅提高找到正确解答的可能性。此外，我们引入了专家间的互学机制，促进专家之间的知识共享与迁移，通过 RLVR 进一步提升模型性能。跨多个推理基准的大量实验表明，MEML-GRPO 带来了显著改进，在 Qwen 上平均提升 4.89%，在 Llama 上提升 11.33%，有效克服了传统 RLVR 方法的核心局限。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-13 09:58:10 UTC 发布：2025-08-13 09:58:10 UTC

#9 UbiQTree: Uncertainty Quantification in XAI with Tree Ensembles #9 UbiQTree：基于树集成模型在可解释性人工智能中的不确定性量化

Authors: [Akshat Dubey](https://arxiv.org/search/?searchtype=author&query=Akshat Dubey), [Aleksandar Anžel](https://arxiv.org/search/?searchtype=author&query=Aleksandar Anžel), [Bahar İlgen](https://arxiv.org/search/?searchtype=author&query=Bahar İlgen), [Georges Hattab](https://arxiv.org/search/?searchtype=author&query=Georges Hattab) 作者：Akshat Dubey、Aleksandar Anžel、Bahar İlgen、Georges Hattab

Explainable Artificial Intelligence (XAI) techniques, such as SHapley Additive exPlanations (SHAP), have become essential tools for interpreting complex ensemble tree-based models, especially in high-stakes domains such as healthcare analytics. However, SHAP values are usually treated as point estimates, which disregards the inherent and ubiquitous uncertainty in predictive models and data. This uncertainty has two primary sources: aleatoric and epistemic. The aleatoric uncertainty, which reflects the irreducible noise in the data. The epistemic uncertainty, which arises from a lack of data. In this work, we propose an approach for decomposing uncertainty in SHAP values into aleatoric, epistemic, and entanglement components. This approach integrates Dempster-Shafer evidence theory and hypothesis sampling via Dirichlet processes over tree ensembles. We validate the method across three real-world use cases with descriptive statistical analyses that provide insight into the nature of epistemic uncertainty embedded in SHAP explanations. The experimentations enable to provide more comprehensive understanding of the reliability and interpretability of SHAP-based attributions. This understanding can guide the development of robust decision-making processes and the refinement of models in high-stakes applications. Through our experiments with multiple datasets, we concluded that features with the highest SHAP values are not necessarily the most stable. This epistemic uncertainty can be reduced through better, more representative data and following appropriate or case-desired model development techniques. Tree-based models, especially bagging, facilitate the effective quantification of epistemic uncertainty. 可解释人工智能（XAI）技术，例如 SHapley 加性解释（SHAP），已成为解释复杂集成树模型的重要工具，尤其在医疗分析等高风险领域。然而，SHAP 值通常被视为点估计，这忽略了预测模型和数据中固有且普遍存在的不确定性。这种不确定性主要来源于两方面：偶然性和认知性。偶然性不确定性反映了数据中不可减少的噪声；认知性不确定性则源于数据的缺乏。在本工作中，我们提出了一种将 SHAP 值中的不确定性分解为偶然性、认知性及相互纠缠分量的方法。该方法结合了邓普斯特—沙弗证据理论以及通过狄利克雷过程对树集成进行的假设采样。我们在三个真实案例中对该方法进行了验证，并通过描述性统计分析揭示了嵌入在 SHAP 解释中的认知性不确定性的本质。实验使我们能够对基于 SHAP 的归因的可靠性和可解释性提供更全面的理解。这种理解可以指导在高风险应用中发展稳健的决策制定流程并改进模型。通过我们在多个数据集上的实验，我们得出结论：具有最高 SHAP 值的特征不一定是最稳定的。这种认知上的不确定性可以通过更好、更具代表性的数据以及遵循合适或符合具体需求的模型开发技术来降低。基于树的模型，尤其是装袋（bagging）方法，有助于有效量化这种认知不确定性。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-13 09:20:33 UTC 发布：2025-08-13 09:20:33 UTC

#10 EvoCurr: Self-evolving Curriculum with Behavior Code Generation for Complex Decision-making #10 EvoCurr：用于复杂决策的带有行为代码生成的自我进化课程

Authors: [Yang Cheng](https://arxiv.org/search/?searchtype=author&query=Yang Cheng), [Zilai Wang](https://arxiv.org/search/?searchtype=author&query=Zilai Wang), [Weiyu Ma](https://arxiv.org/search/?searchtype=author&query=Weiyu Ma), [Wenhui Zhu](https://arxiv.org/search/?searchtype=author&query=Wenhui Zhu), [Yue Deng](https://arxiv.org/search/?searchtype=author&query=Yue Deng), [Jian Zhao](https://arxiv.org/search/?searchtype=author&query=Jian Zhao) 作者：程阳、王子来、马伟宇、朱文辉、邓岳、赵健

Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse domains, including programming, planning, and decision-making. However, their performance often degrades when faced with highly complex problem instances that require deep reasoning over long horizons. In such cases, direct problem-solving approaches can lead to inefficiency or failure due to the lack of structured intermediate guidance. To address this, we propose a novel self-evolve framework, EvoCurr, in which a dedicated curriculum-generation LLM constructs a sequence of problem instances with gradually increasing difficulty, tailored to the solver LLM’s learning progress. The curriculum dynamically adapts easing challenges when the solver struggles and escalating them when success is consistent, thus maintaining an optimal learning trajectory. This approach enables the solver LLM, implemented as a code-generation model producing Python decision-tree scripts, to progressively acquire the skills needed for complex decision-making tasks. Experimental results on challenging decision-making benchmarks show that our method significantly improves task success rates and solution efficiency compared to direct-solving baselines. These findings suggest that LLM-driven curriculum learning holds strong potential for enhancing automated reasoning in real-world, high-complexity domains. 大型语言模型 (LLMs) 在编程、规划和决策等各个领域展示了卓越的能力。然而，当面临需要在长时间范围内进行深度推理的高度复杂问题时，它们的表现常常会下降。在这种情况下，直接的问题求解方法可能由于缺乏结构化的中间引导而导致效率低下或失败。为了解决这一问题，我们提出了一种新颖的自我进化框架 EvoCurr，在该框架中，一个专门用于生成课程的 LLM 构建了一系列难度逐步增加的问题实例，并根据求解器 LLM 的学习进展进行定制。该课程在求解器陷入困难时会动态地降低挑战难度，而在连续成功时则会提升难度，从而维持最佳的学习轨迹。该方法使作为代码生成模型、生成 Python 决策树脚本的求解器 LLM 能够逐步掌握复杂决策任务所需的技能。在具有挑战性的决策制定基准测试上的实验结果表明，与直接求解的基线方法相比，我们的方法显著提高了任务成功率和解题效率。这些发现表明，LLM 驱动的课程学习在增强现实世界高复杂度领域的自动化推理方面具有强大潜力。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-13 07:59:29 UTC 出版：2025-08-13 07:59:29 协调世界时（UTC）

With the rapid development of mobile intelligent assistant technologies, multi-modal AI assistants have become essential interfaces for daily user interactions. However, current evaluation methods face challenges including high manual costs, inconsistent standards, and subjective bias. This paper proposes an automated multi-modal evaluation framework based on large language models and multi-agent collaboration. The framework employs a three-tier agent architecture consisting of interaction evaluation agents, semantic verification agents, and experience decision agents. Through supervised fine-tuning on the Qwen3-8B model, we achieve a significant evaluation matching accuracy with human experts. Experimental results on eight major intelligent agents demonstrate the framework’s effectiveness in predicting users’ satisfaction and identifying generation defects. 随着移动智能助理技术的快速发展，多模态 AI 助手已成为日常用户交互的关键界面。然而，目前的评估方法面临着高昂的人工成本、不一致的标准和主观偏差等挑战。本文提出了一个基于大型语言模型与多代理协作的自动化多模态评估框架。该框架采用由交互评估代理、语义验证代理和体验决策代理组成的三层代理架构。通过对 Qwen3-8B 模型进行监督微调，我们在与人类专家的评估匹配准确性上取得了显著成果。在对八个主要智能代理的实验中，结果表明该框架在预测用户满意度和识别生成缺陷方面具有有效性。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-13 05:40:34 UTC 发布：2025-08-13 05:40:34 UTC

#12 The Othello AI Arena: Evaluating Intelligent Systems Through Limited-Time Adaptation to Unseen Boards

Author: [Sundong Kim](https://arxiv.org/search/?searchtype=author&query=Sundong Kim) 作者：Sundong Kim

The ability to rapidly adapt to novel and unforeseen environmental changes is a cornerstone of artificial general intelligence (AGI), yet it remains a critical blind spot in most existing AI benchmarks. Traditional evaluation largely focuses on optimizing performance within fixed environments, failing to assess systems’ flexibility and generalization capabilities when faced with even subtle rule or structural modifications. Addressing this gap, I introduce the Othello AI Arena, a novel benchmark framework designed to evaluate intelligent systems based on their capacity for limited-time adaptation to unseen environments. Our platform poses a meta-learning challenge: participants must develop systems that can analyze the specific configuration and rules of a novel Othello board within a strict time limit (60 seconds) and generate a tailored, high-performing strategy for that unique environment. With this, evaluation of the meta-level intelligence can be separated from the task-level strategy performance. The Arena features a diverse set of game stages, including public stages for development and private stages with structural and rule variations designed to test genuine adaptive and generalization capabilities. Implemented as an accessible web-based platform, the Arena provides real-time visualization, automated evaluation using multi-dimensional metrics, and comprehensive logging for post-hoc analysis. Initial observations from pilot tests and preliminary student engagements highlight fascinating patterns in adaptation approaches, ranging from rapid parameter tuning to rudimentary environmental model learning through simulation. The Othello AI Arena offers a unique educational tool and a valuable research benchmark for fostering and evaluating the crucial skill of rapid, intelligent adaptation in AI systems. 快速适应新颖且不可预见的环境变化的能力是通用人工智能（AGI）的基石，然而在大多数现有的人工智能基准测试中，这仍然是一个重要的盲点。传统评估主要侧重于在固定环境中优化性能，未能在规则或结构发生哪怕是细微修改时评估系统的灵活性和泛化能力。为填补这一空白，我引入了 Othello AI Arena，这是一个新颖的基准框架，旨在基于智能系统在有限时间内对未知环境进行适应的能力来评估它们。我们的平台提出了一项元学习挑战：参与者必须在严格的时间限制（60 秒）内分析新奥赛罗棋盘的具体配置和规则，并为该独特环境生成一个量身定制的高性能策略。通过此方式，可以将元层面的智能评估与任务层面的策略表现分离开来。竞技场包含多样的游戏关卡，包括用于开发的公开关卡以及具有结构和规则变化的私人关卡，旨在测试真实的适应性和泛化能力。竞技场作为一个可访问的基于网页的平台实现，提供实时可视化、使用多维度指标的自动评估，以及用于事后分析的全面日志记录。来自试点测试和初步学生参与的初步观察揭示了在适应方法上有趣的模式，范围从快速参数调优到通过模拟进行的初步环境模型学习。黑白棋人工智能竞技场提供了一个独特的教育工具和一个有价值的研究基准，用于培养和评估人工智能系统中快速、智能适应这一关键技能。

Subject: Artificial Intelligence 主题：人工智能

Publish: 2025-08-12 19:10:58 UTC 发布：2025-08-12 19:10:58 协调世界时

#13 Value Function Initialization for Knowledge Transfer and Jump-start in Deep Reinforcement Learning #13 价值函数初始化用于知识迁移和深度强化学习中的跳跃启动

Author: [Soumia Mehimeh](https://arxiv.org/search/?searchtype=author&query=Soumia Mehimeh) 作者：Soumia Mehimeh

Value function initialization (VFI) is an effective way to achieve a jumpstart in reinforcement learning (RL) by leveraging value estimates from prior tasks. While this approach is well established in tabular settings, extending it to deep reinforcement learning (DRL) poses challenges due to the continuous nature of the state-action space, the noisy approximations of neural networks, and the impracticality of storing all past models for reuse. In this work, we address these challenges and introduce DQInit, a method that adapts value function initialization to DRL. DQInit reuses compact tabular Q-values extracted from previously solved tasks as a transferable knowledge base. It employs a knownness-based mechanism to softly integrate these transferred values into underexplored regions and gradually shift toward the agent’s learned estimates, avoiding the limitations of fixed time decay. Our approach offers a novel perspective on knowledge transfer in DRL by relying solely on value estimates rather than policies or demonstrations, effectively combining the strengths of jumpstart RL and policy distillation while mitigating their drawbacks. Experiments across multiple continuous control tasks demonstrate that DQInit consistently improves early learning efficiency, stability, and overall performance compared to standard initialization and existing transfer techniques. 值函数初始化（VFI）是一种通过利用先前任务的值估计在强化学习（RL）中实现快速启动的有效方法。虽然该方法在表格设置中已被充分确立，但将其扩展到深度强化学习（DRL）面临挑战，原因在于状态-动作空间的连续性、神经网络的噪声近似以及存储所有过往模型以供重用的不切实际性。在本工作中，我们应对这些挑战并提出了 DQInit，一种将值函数初始化适配于 DRL 的方法。DQInit 复用从先前已解决任务中提取的紧凑表格 Q 值作为可迁移的知识库。它采用基于已知度（knownness）的机制将这些迁移值柔性地整合到探索不足的区域，并逐步向智能体学得的估计值过渡，避免了固定时间衰减的局限性。我们的方法通过仅依赖值估计而非策略或示范，为 DRL 中的知识迁移提供了新的视角，有效结合了快速启动 RL 和策略蒸馏的优点，同时减轻了它们的缺点。在多个连续控制任务上的实验表明，与标准初始化和现有迁移技术相比，DQInit 在早期学习效率、稳定性和整体性能方面始终有所提升。

Subjects: Artificial Intelligence, Machine Learning, Logic in Computer Science 主题：人工智能、机器学习、计算机科学中的逻辑

Publish: 2025-08-12 18:32:08 UTC 发布：2025-08-12 18:32:08 UTC

#14 Echo-4o: Harnessing the Power of GPT-4o Synthetic Images for Improved Image Generation #14 Echo-4o：利用 GPT-4o 合成图像的能力以改善图像生成

Recently, GPT-4o has garnered significant attention for its strong performance in image generation, yet open-source models still lag behind. Several studies have explored distilling image data from GPT-4o to enhance open-source models, achieving notable progress. However, a key question remains: given that real-world image datasets already constitute a natural source of high-quality data, why should we use GPT-4o-generated synthetic data? In this work, we identify two key advantages of synthetic images. First, they can complement rare scenarios in real-world datasets, such as surreal fantasy or multi-reference image generation, which frequently occur in user queries. Second, they provide clean and controllable supervision. Real-world data often contains complex background noise and inherent misalignment between text descriptions and image content, whereas synthetic images offer pure backgrounds and long-tailed supervision signals, facilitating more accurate text-to-image alignment. Building on these insights, we introduce Echo-4o-Image, a 180K-scale synthetic dataset generated by GPT-4o, harnessing the power of synthetic image data to address blind spots in real-world coverage. Using this dataset, we fine-tune the unified multimodal generation baseline Bagel to obtain Echo-4o. In addition, we propose two new evaluation benchmarks for a more accurate and challenging assessment of image generation capabilities: GenEval++, which increases instruction complexity to mitigate score saturation, and Imagine-Bench, which focuses on evaluating both the understanding and generation of imaginative content. Echo-4o demonstrates strong performance across standard benchmarks. Moreover, applying Echo-4o-Image to other foundation models (e.g., OmniGen2, BLIP3-o) yields consistent performance gains across multiple metrics, highlighting the datasets strong transferability. 最近，GPT-4o 因其出色的图像生成能力而备受关注，但开源模型仍然落后于其表现。已有若干研究尝试从 GPT-4o 提取图像数据以增强开源模型，并取得了显著进展。然而，一个关键问题仍然存在：既然现实世界的图像数据集本身已经是高质量数据的自然来源，为什么我们还要使用 GPT-4o 生成的合成数据？在这项工作中，我们指出了合成图像的两大关键优势。首先，它们可以补充现实数据集中稀缺的场景，例如超现实的奇幻场景或多参考图像生成，这些场景在用户查询中经常出现。其次，它们提供了干净且可控的监督。现实世界数据常包含复杂的背景噪声以及文本描述与图像内容之间固有的不匹配，而合成图像则提供纯净的背景和长尾的监督信号，有助于更精确的文本到图像对齐。基于这些洞见，我们提出了 Echo-4o-Image，这是一个由 GPT-4o 生成、规模为 18 万的合成数据集，利用合成图像数据的力量来解决现实覆盖中的盲点。使用该数据集，我们对统一多模态生成基线 Bagel 进行微调以得到 Echo-4o。此外，我们提出了两个新的评估基准，以便对图像生成能力进行更准确、更具挑战性的评估：GenEval++，通过增加指令复杂性以缓解分数饱和问题；以及 Imagine-Bench，专注于评估对富有想象力内容的理解和生成。Echo-4o 在标准基准上表现强劲。此外，将 Echo-4o-Image 应用于其他基础模型（例如 OmniGen2、BLIP3-o）在多项指标上均带来了稳定的性能提升，凸显了该数据集的强大可迁移性。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Computation and Language 主题：计算机视觉与模式识别、人工智能、计算与语言

Publish: 2025-08-13 17:59:28 UTC 发布：2025-08-13 17:59:28 UTC

#15 Vision-driven River Following of UAV via Safe Reinforcement Learning using Semantic Dynamics Model #15 通过使用语义动力学模型的安全强化学习实现无人机的视觉驱动河流跟随

Authors: [Zihan Wang](https://arxiv.org/search/?searchtype=author&query=Zihan Wang), [Nina Mahmoudian](https://arxiv.org/search/?searchtype=author&query=Nina Mahmoudian) 作者：王子涵，Nina Mahmoudian

Vision-driven autonomous river following by Unmanned Aerial Vehicles is critical for applications such as rescue, surveillance, and environmental monitoring, particularly in dense riverine environments where GPS signals are unreliable. We formalize river following as a coverage control problem in which the reward function is submodular, yielding diminishing returns as more unique river segments are visited, thereby framing the task as a Submodular Markov Decision Process. First, we introduce Marginal Gain Advantage Estimation, which refines the reward advantage function by using a sliding window baseline computed from historical episodic returns, thus aligning the advantage estimation with the agent’s evolving recognition of action value in non-Markovian settings. Second, we develop a Semantic Dynamics Model based on patchified water semantic masks that provides more interpretable and data-efficient short-term prediction of future observations compared to latent vision dynamics models. Third, we present the Constrained Actor Dynamics Estimator architecture, which integrates the actor, the cost estimator, and SDM for cost advantage estimation to form a model-based SafeRL framework capable of solving partially observable Constrained Submodular Markov Decision Processes. Simulation results demonstrate that MGAE achieves faster convergence and superior performance over traditional critic-based methods like Generalized Advantage Estimation. SDM provides more accurate short-term state predictions that enable the cost estimator to better predict potential violations. Overall, CADE effectively integrates safety regulation into model-based RL, with the Lagrangian approach achieving the soft balance of reward and safety during training, while the safety layer enhances performance during inference by hard action overlay. 视觉驱动的无人机自主河流跟随对于救援、监视和环境监测等应用至关重要，尤其是在 GPS 信号不可靠的茂密河流环境中。我们将河流跟随形式化为覆盖控制问题，其中奖励函数是次模的，随着访问的独特河段增多而呈现边际效用递减，因此将该任务构建为次模马尔可夫决策过程。首先，我们引入边际增益优势估计（Marginal Gain Advantage Estimation），通过使用从历史回合回报计算的滑动窗口基线来细化奖励优势函数，从而在非马尔可夫环境中使优势估计与智能体对动作价值不断演进的认知保持一致。其次，我们开发了一种基于分块水体语义掩码的语义动力学模型，该模型相比潜在视觉动力学模型，能更具可解释性且在短期观测预测上更数据高效。第三，我们提出了受限行动者动力学估计器（Constrained Actor Dynamics Estimator，CADE）架构，该架构将行动者、成本估计器和用于成本优势估计的 SDM 集成在一起，形成一个基于模型的安全强化学习框架，能够求解部分可观测的受限子模马尔可夫决策过程。仿真结果表明，MGAE 在收敛速度和性能上均优于传统的基于评论家方法，如广义优势估计（Generalized Advantage Estimation）。SDM 提供了更准确的短期状态预测，使成本估计器能够更好地预测潜在违规。总体而言，CADE 有效地将安全约束整合到基于模型的强化学习中，拉格朗日方法在训练过程中实现了回报与安全的软平衡，而安全层通过硬性动作覆盖在推理阶段提升了性能。

Subjects: Robotics, Artificial Intelligence 主题：机器人学，人工智能

Publish: 2025-08-13 17:39:09 UTC 发布：2025-08-13 17:39:09 UTC

#16 January Food Benchmark (JFB): A Public Benchmark Dataset and Evaluation Suite for Multimodal Food Analysis #16 1 月食物基准（JFB）：用于多模态食物分析的公共基准数据集和评估套件

Authors: [Amir Hosseinian](https://arxiv.org/search/?searchtype=author&query=Amir Hosseinian), [Ashkan Dehghani Zahedani](https://arxiv.org/search/?searchtype=author&query=Ashkan Dehghani Zahedani), [Umer Mansoor](https://arxiv.org/search/?searchtype=author&query=Umer Mansoor), [Noosheen Hashemi](https://arxiv.org/search/?searchtype=author&query=Noosheen Hashemi), [Mark Woodward](https://arxiv.org/search/?searchtype=author&query=Mark Woodward) 作者：Amir Hosseinian、Ashkan Dehghani Zahedani、Umer Mansoor、Noosheen Hashemi、Mark Woodward

Progress in AI for automated nutritional analysis is critically hampered by the lack of standardized evaluation methodologies and high-quality, real-world benchmark datasets. To address this, we introduce three primary contributions. First, we present the January Food Benchmark (JFB), a publicly available collection of 1,000 food images with human-validated annotations. Second, we detail a comprehensive benchmarking framework, including robust metrics and a novel, application-oriented overall score designed to assess model performance holistically. Third, we provide baseline results from both general-purpose Vision-Language Models (VLMs) and our own specialized model, january/food-vision-v1. Our evaluation demonstrates that the specialized model achieves an Overall Score of 86.2, a 12.1-point improvement over the best-performing general-purpose configuration. This work offers the research community a valuable new evaluation dataset and a rigorous framework to guide and benchmark future developments in automated nutritional analysis. 在自动化营养分析领域，缺乏标准化评估方法和高质量的真实世界基准数据集，严重制约了进展。为了解决这一问题，我们提出三项主要贡献。首先，我们发布了 January Food 基准（JFB），这是一个公开可用的包含 1,000 张食品图像并经人工验证注释的集合。其次，我们详述了一个全面的基准测试框架，包括稳健的度量标准以及一个新颖的、面向应用的整体评分，用于整体评估模型表现。第三，我们提供了通用视觉-语言模型（VLMs）和我们自研专用模型 january/food-vision-v1 的基线结果。我们的评估表明，专用模型在整体评分上达到了 86.2 分，比表现最好的通用配置高出 12.1 分。本工作为研究界提供了一个有价值的新评估数据集和一个严谨的框架，以指导并评测未来自动化营养分析的发展。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence

Publish: 2025-08-13 17:32:40 UTC 发布时间：2025-08-13 17:32:40 UTC

#17 GBC: Generalized Behavior-Cloning Framework for Whole-Body Humanoid Imitation #17 GBC：用于全身类人模仿的广义行为克隆框架

Authors: [Yifei Yao](https://arxiv.org/search/?searchtype=author&query=Yifei Yao), [Chengyuan Luo](https://arxiv.org/search/?searchtype=author&query=Chengyuan Luo), [Jiaheng Du](https://arxiv.org/search/?searchtype=author&query=Jiaheng Du), [Wentao He](https://arxiv.org/search/?searchtype=author&query=Wentao He), [Jun-Guo Lu](https://arxiv.org/search/?searchtype=author&query=Jun-Guo Lu) 作者：姚一飞，罗承远，杜嘉恒，何文涛，陆俊国

The creation of human-like humanoid robots is hindered by a fundamental fragmentation: data processing and learning algorithms are rarely universal across different robot morphologies. This paper introduces the Generalized Behavior Cloning (GBC) framework, a comprehensive and unified solution designed to solve this end-to-end challenge. GBC establishes a complete pathway from human motion to robot action through three synergistic innovations. First, an adaptive data pipeline leverages a differentiable IK network to automatically retarget any human MoCap data to any humanoid. Building on this foundation, our novel DAgger-MMPPO algorithm with its MMTransformer architecture learns robust, high-fidelity imitation policies. To complete the ecosystem, the entire framework is delivered as an efficient, open-source platform based on Isaac Lab, empowering the community to deploy the full workflow via simple configuration scripts. We validate the power and generality of GBC by training policies on multiple heterogeneous humanoids, demonstrating excellent performance and transfer to novel motions. This work establishes the first practical and unified pathway for creating truly generalized humanoid controllers. 类似人类的类人机器人创造受到一个根本性分裂的制约：在不同机器人形态之间，数据处理和学习算法很少具有通用性。本文提出了广义行为克隆（Generalized Behavior Cloning，GBC）框架，一种旨在解决这一端到端挑战的全面统一解决方案。GBC 通过三项协同创新建立了从人体动作到机器人行为的完整路径。首先，一个自适应数据管道利用可微分逆向运动学网络，自动将任何人体动作捕捉数据重新映射到任何类人机器人。基于此基础，我们提出了具有多模型变压器（MMTransformer）架构的创新性 DAgger-MMPPO 算法，以学习鲁棒且高保真的模仿策略。为完善生态系统，整个框架作为基于 Isaac Lab 的高效开源平台交付，使社区能够通过简单的配置脚本部署完整工作流。我们通过在多种异构类人机器人上训练策略来验证 GBC 的能力与通用性，展示了出色的性能并能迁移到新颖动作。这项工作建立了首个实用且统一的方法，用于创建真正通用的人形控制器。

Subjects: Robotics, Artificial Intelligence, Machine Learning 主题：机器人学、人工智能、机器学习

Publish: 2025-08-13 17:28:39 UTC 发布时间：2025-08-13 17:28:39 UTC

#18 Specialised or Generic? Tokenization Choices for Radiology Language Models #18 专用还是通用？放射学语言模型的分词选择

The vocabulary used by language models (LM) - defined by the tokenizer - plays a key role in text generation quality. However, its impact remains under-explored in radiology. In this work, we address this gap by systematically comparing general, medical, and domain-specific tokenizers on the task of radiology report summarisation across three imaging modalities. We also investigate scenarios with and without LM pre-training on PubMed abstracts. Our findings demonstrate that medical and domain-specific vocabularies outperformed widely used natural language alternatives when models are trained from scratch. Pre-training partially mitigates performance differences between tokenizers, whilst the domain-specific tokenizers achieve the most favourable results. Domain-specific tokenizers also reduce memory requirements due to smaller vocabularies and shorter sequences. These results demonstrate that adapting the vocabulary of LMs to the clinical domain provides practical benefits, including improved performance and reduced computational demands, making such models more accessible and effective for both research and real-world healthcare settings. 语言模型（LM）使用的词汇表——由分词器定义——在文本生成质量中起着关键作用。然而，其影响在放射学领域尚未得到充分研究。在本工作中，我们通过在三种影像模态上对放射学报告摘要任务系统地比较通用、医学和领域特定的分词器来填补这一空白。我们还研究了在有无在 PubMed 摘要上进行语言模型预训练的情形。研究结果表明，当模型从零开始训练时，医学和领域特定的词汇表优于广泛使用的自然语言替代方案。预训练在一定程度上缓解了分词器之间的性能差异，而领域特定的分词器取得了最优结果。领域特定的分词器由于词表更小、序列更短，还降低了内存需求。这些结果表明，将语言模型的词汇表适配到临床领域带来了实用益处，包括性能提升和计算需求降低，使此类模型在研究和实际医疗场景中更易获取且更为有效。

Subjects: Computation and Language, Artificial Intelligence, Machine Learning 主题：计算与语言，人工智能，机器学习

Publish: 2025-08-13 17:13:56 UTC 发布：2025-08-13 17:13:56 UTC

#19 VisCodex: Unified Multimodal Code Generation via Merging Vision and Coding Models #19 VisCodex：通过融合视觉与编码模型实现统一的多模态代码生成

Multimodal large language models (MLLMs) have significantly advanced the integration of visual and textual understanding. However, their ability to generate code from multimodal inputs remains limited. In this work, we introduce VisCodex, a unified framework that seamlessly merges vision and coding language models to empower MLLMs with strong multimodal code generation abilities. Leveraging a task vector-based model merging technique, we integrate a state-of-the-art coding LLM into a strong vision-language backbone, while preserving both visual comprehension and advanced coding skills. To support training and evaluation, we introduce the Multimodal Coding Dataset (MCD), a large-scale and diverse collection of 598k samples, including high-quality HTML code, chart image-code pairs, image-augmented StackOverflow QA, and algorithmic problems. Furthermore, we propose InfiBench-V, a novel and challenging benchmark specifically designed to assess models on visually-rich, real-world programming questions that demand a nuanced understanding of both textual and visual contexts. Extensive experiments show that VisCodex achieves state-of-the-art performance among open-source MLLMs and approaches proprietary models like GPT-4o, highlighting the effectiveness of our model merging strategy and new datasets. 多模态大型语言模型（MLLMs）在视觉与文本理解的融合方面取得了显著进展。然而，它们从多模态输入生成代码的能力仍然有限。在本工作中，我们提出了 VisCodex——一个将视觉与代码语言模型无缝融合的统一框架，以赋能 MLLMs 强大的多模态代码生成能力。通过基于任务向量的模型合并技术，我们将最先进的代码 LLM 集成到强大的视觉-语言骨干中，同时保留视觉理解与高级编码技能。为支持训练与评估，我们引入了多模态编码数据集（MCD），这是一个规模庞大且多样化的集合，包含 59.8 万条样本，包括高质量的 HTML 代码、图表图像-代码对、带图像的 StackOverflow 问答以及算法题。此外，我们提出了 InfiBench-V，这是一个新颖且具有挑战性的基准，专门用于评估模型在视觉丰富的真实编程问题上的表现，这类问题需要对文本与视觉语境都有细致的理解。大量实验表明，VisCodex 在开源多模态大模型中达到最先进的性能，并且性能接近 GPT-4o 等专有模型，这凸显了我们模型合并策略和新数据集的有效性。

Subjects: Computation and Language, Artificial Intelligence, Computer Vision and Pattern Recognition 主题：计算与语言、人工智能、计算机视觉与模式识别

Publish: 2025-08-13 17:00:44 UTC 发布：2025-08-13 17:00:44 UTC

#20 A Comprehensive Evaluation framework of Alignment Techniques for LLMs #20 大型语言模型（LLMs）对齐技术的综合评估框架

Authors: [Muneeza Azmat](https://arxiv.org/search/?searchtype=author&query=Muneeza Azmat), [Momin Abbas](https://arxiv.org/search/?searchtype=author&query=Momin Abbas), [Maysa Malfiza Garcia de Macedo](https://arxiv.org/search/?searchtype=author&query=Maysa Malfiza Garcia de Macedo), [Marcelo Carpinette Grave](https://arxiv.org/search/?searchtype=author&query=Marcelo Carpinette Grave), [Luan Soares de Souza](https://arxiv.org/search/?searchtype=author&query=Luan Soares de Souza), [Tiago Machado](https://arxiv.org/search/?searchtype=author&query=Tiago Machado), [Rogerio A de Paula](https://arxiv.org/search/?searchtype=author&query=Rogerio A de Paula), [Raya Horesh](https://arxiv.org/search/?searchtype=author&query=Raya Horesh), [Yixin Chen](https://arxiv.org/search/?searchtype=author&query=Yixin Chen), [Heloisa Caroline de Souza Pereira Candello](https://arxiv.org/search/?searchtype=author&query=Heloisa Caroline de Souza Pereira Candello), [Rebecka Nordenlow](https://arxiv.org/search/?searchtype=author&query=Rebecka Nordenlow), [Aminat Adebiyi](https://arxiv.org/search/?searchtype=author&query=Aminat Adebiyi) 作者：Muneeza Azmat、Momin Abbas、Maysa Malfiza Garcia de Macedo、Marcelo Carpinette Grave、Luan Soares de Souza、Tiago Machado、Rogerio A de Paula、Raya Horesh、Yixin Chen、Heloisa Caroline de Souza Pereira Candello、Rebecka Nordenlow、Aminat Adebiyi

As Large Language Models (LLMs) become increasingly integrated into real-world applications, ensuring their outputs align with human values and safety standards has become critical. The field has developed diverse alignment approaches including traditional fine-tuning methods (RLHF, instruction tuning), post-hoc correction systems, and inference-time interventions, each with distinct advantages and limitations. However, the lack of unified evaluation frameworks makes it difficult to systematically compare these paradigms and guide deployment decisions. This paper introduces a multi-dimensional evaluation of alignment techniques for LLMs, a comprehensive evaluation framework that provides a systematic comparison across all major alignment paradigms. Our framework assesses methods along four key dimensions: alignment detection, alignment quality, computational efficiency, and robustness. Through experiments across diverse base models and alignment strategies, we demonstrate the utility of our framework in identifying strengths and limitations of current state-of-the-art models, providing valuable insights for future research directions. 随着大型语言模型（LLMs）越来越多地被整合到实际应用中，确保其输出符合人类价值观和安全标准变得至关重要。该领域已发展出多种对齐方法，包括传统的微调方法（RLHF、指令微调）、事后纠正系统以及推理时干预，每种方法各有明显的优点和局限。然而，缺乏统一的评估框架使得系统性比较这些范式并指导部署决策变得困难。本文提出了对 LLMs 对齐技术的多维评估，即一个全面的评估框架，能够在所有主要对齐范式之间进行系统比较。我们的框架沿四个关键维度评估方法：对齐检测、对齐质量、计算效率和鲁棒性。通过对不同基础模型和对齐策略的实验，我们展示了该框架在识别当前最先进模型的优劣方面的实用性，并为未来的研究方向提供了有价值的见解。

Subjects: Computation and Language, Artificial Intelligence, Machine Learning 主题：计算与语言，人工智能，机器学习

Publish: 2025-08-13 16:42:01 UTC 发布：2025-08-13 16:42:01 UTC

#21 Residual Reservoir Memory Networks #21 残差水库记忆网络

Authors: [Matteo Pinna](https://arxiv.org/search/?searchtype=author&query=Matteo Pinna), [Andrea Ceni](https://arxiv.org/search/?searchtype=author&query=Andrea Ceni), [Claudio Gallicchio](https://arxiv.org/search/?searchtype=author&query=Claudio Gallicchio) 作者：Matteo Pinna、Andrea Ceni、Claudio Gallicchio

We introduce a novel class of untrained Recurrent Neural Networks (RNNs) within the Reservoir Computing (RC) paradigm, called Residual Reservoir Memory Networks (ResRMNs). ResRMN combines a linear memory reservoir with a non-linear reservoir, where the latter is based on residual orthogonal connections along the temporal dimension for enhanced long-term propagation of the input. The resulting reservoir state dynamics are studied through the lens of linear stability analysis, and we investigate diverse configurations for the temporal residual connections. The proposed approach is empirically assessed on time-series and pixel-level 1-D classification tasks. Our experimental results highlight the advantages of the proposed approach over other conventional RC models. 我们在储备计算（Reservoir Computing，RC）范式下引入了一类新颖的未训练循环神经网络（RNN），称为残差储备记忆网络（Residual Reservoir Memory Networks，ResRMNs）。ResRMN 将线性记忆储备与非线性储备相结合，后者基于沿时间维度的残差正交连接，以增强输入的长期传播。通过线性稳定性分析来研究由此产生的储备状态动态，并探讨时间残差连接的多种配置。我们在时间序列和像素级一维分类任务上对所提出的方法进行了实证评估。实验结果凸显了该方法相较其他传统 RC 模型的优势。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-13 16:21:29 UTC 发布：2025-08-13 16:21:29 UTC

#22 T-CACE: A Time-Conditioned Autoregressive Contrast Enhancement Multi-Task Framework for Contrast-Free Liver MRI Synthesis, Segmentation, and Diagnosis #22 T-CACE：一种时间条件自回归对比度增强多任务框架，用于无对比剂肝脏 MRI 的合成、分割与诊断

Authors: [Xiaojiao Xiao](https://arxiv.org/search/?searchtype=author&query=Xiaojiao Xiao), [Jianfeng Zhao](https://arxiv.org/search/?searchtype=author&query=Jianfeng Zhao), [Qinmin Vivian Hu](https://arxiv.org/search/?searchtype=author&query=Qinmin Vivian Hu), [Guanghui Wang](https://arxiv.org/search/?searchtype=author&query=Guanghui Wang) 作者：肖晓娇、赵建峰、胡秦敏（Vivian Hu）、王光辉

Magnetic resonance imaging (MRI) is a leading modality for the diagnosis of liver cancer, significantly improving the classification of the lesion and patient outcomes. However, traditional MRI faces challenges including risks from contrast agent (CA) administration, time-consuming manual assessment, and limited annotated datasets. To address these limitations, we propose a Time-Conditioned Autoregressive Contrast Enhancement (T-CACE) framework for synthesizing multi-phase contrast-enhanced MRI (CEMRI) directly from non-contrast MRI (NCMRI). T-CACE introduces three core innovations: a conditional token encoding (CTE) mechanism that unifies anatomical priors and temporal phase information into latent representations; and a dynamic time-aware attention mask (DTAM) that adaptively modulates inter-phase information flow using a Gaussian-decayed attention mechanism, ensuring smooth and physiologically plausible transitions across phases. Furthermore, a constraint for temporal classification consistency (TCC) aligns the lesion classification output with the evolution of the physiological signal, further enhancing diagnostic reliability. Extensive experiments on two independent liver MRI datasets demonstrate that T-CACE outperforms state-of-the-art methods in image synthesis, segmentation, and lesion classification. This framework offers a clinically relevant and efficient alternative to traditional contrast-enhanced imaging, improving safety, diagnostic efficiency, and reliability for the assessment of liver lesion. The implementation of T-CACE is publicly available at: https://github.com/xiaojiao929/T-CACE. 磁共振成像（MRI）是肝癌诊断的主要手段，显著提升了病灶的分类和患者预后。然而，传统 MRI 存在包括对比剂（CA）使用风险、耗时的人工评估以及带注释数据集有限等挑战。为了解决这些限制，我们提出了一种时间条件自回归对比增强（T-CACE）框架，用于直接从无对比 MRI（NCMRI）合成多相位对比增强 MRI（CEMRI）。T-CACE 引入了三项核心创新：一种条件令牌编码（CTE）机制，将解剖先验和时间相位信息统一为潜在表征；一种动态时间感知注意力掩码（DTAM），使用高斯衰减的注意力机制自适应调节相间信息流，确保相位间平滑且符合生理学的过渡；此外，一个时间分类一致性（TCC）约束将病灶分类输出与生理信号的演变对齐，进一步提升诊断可靠性。在两个独立的肝脏 MRI 数据集上进行的大量实验表明，T-CACE 在图像合成、分割和病灶分类方面均优于现有最先进的方法。该框架为传统增强对比成像提供了一个具有临床相关性且高效的替代方案，提高了肝脏病灶评估的安全性、诊断效率和可靠性。T-CACE 的实现已公开可用：https://github.com/xiaojiao929/T-CACE。

Subjects: Image and Video Processing, Artificial Intelligence, Computer Vision and Pattern Recognition 主题：图像与视频处理、人工智能、计算机视觉与模式识别

Publish: 2025-08-13 16:14:14 UTC 发表：2025-08-13 16:14:14 UTC

#23 Beyond Naïve Prompting: Strategies for Improved Zero-shot Context-aided Forecasting with LLMs #23 超越简单提示：利用 LLMs 提升零样本上下文辅助预测的策略

Forecasting in real-world settings requires models to integrate not only historical data but also relevant contextual information, often available in textual form. While recent work has shown that large language models (LLMs) can be effective context-aided forecasters via naïve direct prompting, their full potential remains underexplored. We address this gap with 4 strategies, providing new insights into the zero-shot capabilities of LLMs in this setting. ReDP improves interpretability by eliciting explicit reasoning traces, allowing us to assess the model’s reasoning over the context independently from its forecast accuracy. CorDP leverages LLMs solely to refine existing forecasts with context, enhancing their applicability in real-world forecasting pipelines. IC-DP proposes embedding historical examples of context-aided forecasting tasks in the prompt, substantially improving accuracy even for the largest models. Finally, RouteDP optimizes resource efficiency by using LLMs to estimate task difficulty, and routing the most challenging tasks to larger models. Evaluated on different kinds of context-aided forecasting tasks from the CiK benchmark, our strategies demonstrate distinct benefits over naïve prompting across LLMs of different sizes and families. These results open the door to further simple yet effective improvements in LLM-based context-aided forecasting. 在真实世界的预测中，模型不仅需要整合历史数据，还需结合通常以文本形式出现的相关上下文信息。尽管近期研究表明大型语言模型（LLMs）通过简单的直接提示可以成为有效的上下文辅助预测器，但其全部潜力仍未被充分挖掘。我们通过四种策略来弥补这一缺口，为 LLMs 在该场景下的零样本能力提供新见解。ReDP 通过引出明确的推理轨迹提高可解释性，使我们能够将模型对上下文的推理与其预测准确性独立评估。CorDP 完全利用 LLMs 来用上下文精炼现有预测，增强其在真实预测流水线中的适用性。IC-DP 提出在提示中嵌入带有上下文的历史示例，显著提升准确性，即便是对最大规模的模型也是如此。最后，RouteDP 通过让 LLMs 估计任务难度并将最具挑战性的任务路由到更大模型上，优化了资源效率。在 CiK 基准上针对不同类型的上下文辅助预测任务进行评估时，我们的策略在不同规模和系列的 LLMs 上均较简单提示展示出明显优势。这些结果为在基于 LLM 的上下文辅助预测中进一步采用简单但有效的改进方法打开了大门。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-13 16:02:55 UTC 发布：2025-08-13 16:02:55 UTC

#24 Rare anomalies require large datasets: About proving the existence of anomalies #24 罕见异常需要大规模数据集：关于证明异常存在性的研究

Authors: [Simon Klüttermann](https://arxiv.org/search/?searchtype=author&query=Simon Klüttermann), [Emmanuel Müller](https://arxiv.org/search/?searchtype=author&query=Emmanuel Müller) 作者：Simon Klüttermann, Emmanuel Müller

Detecting whether any anomalies exist within a dataset is crucial for effective anomaly detection, yet it remains surprisingly underexplored in anomaly detection literature. This paper presents a comprehensive study that addresses the fundamental question: When can we conclusively determine that anomalies are present? Through extensive experimentation involving over three million statistical tests across various anomaly detection tasks and algorithms, we identify a relationship between the dataset size, contamination rate, and an algorithm-dependent constant αalgo. Our results demonstrate that, for an unlabeled dataset of size N and contamination rate ν, the condition N≥αalgoν2 represents a lower bound on the number of samples required to confirm anomaly existence. This threshold implies a limit to how rare anomalies can be before proving their existence becomes infeasible. 在数据集中检测是否存在任何异常对有效的异常检测至关重要，然而在异常检测文献中这一问题仍然出人意料地少有探讨。本文进行了一项全面研究，探讨了一个基本问题：何时我们能够确定无疑地断定异常存在？通过在各种异常检测任务和算法中进行超过三百万次的统计检验，我们识别出数据集大小、污染率与一个依赖于算法的常数 αalgo 之间的关系。我们的结果表明，对于一个无标签数据集，大小为 N 且污染率为 ν 的情形，不等式 N≥αalgoν2 是确认异常存在所需样本数的下界。该阈值意味着在证明异常存在变得不可行之前，异常可以稀有到怎样的程度存在一个限制。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-13 15:52:33 UTC 发布：2025-08-13 15:52:33 UTC

#25 COME: Dual Structure-Semantic Learning with Collaborative MoE for Universal Lesion Detection Across Heterogeneous Ultrasound Datasets #25 COME：具有协作专家混合（MoE）的双重结构-语义学习，用于跨异构超声数据集的通用病变检测

Conventional single-dataset training often fails with new data distributions, especially in ultrasound (US) image analysis due to limited data, acoustic shadows, and speckle noise. Therefore, constructing a universal framework for multi-heterogeneous US datasets is imperative. However, a key challenge arises: how to effectively mitigate inter-dataset interference while preserving dataset-specific discriminative features for robust downstream task? Previous approaches utilize either a single source-specific decoder or a domain adaptation strategy, but these methods experienced a decline in performance when applied to other domains. Considering this, we propose a Universal Collaborative Mixture of Heterogeneous Source-Specific Experts (COME). Specifically, COME establishes dual structure-semantic shared experts that create a universal representation space and then collaborate with source-specific experts to extract discriminative features through providing complementary features. This design enables robust generalization by leveraging cross-datasets experience distributions and providing universal US priors for small-batch or unseen data scenarios. Extensive experiments under three evaluation modes (single-dataset, intra-organ, and inter-organ integration datasets) demonstrate COME’s superiority, achieving significant mean AP improvements over state-of-the-art methods. Our project is available at: https://universalcome.github.io/UniversalCOME/. 传统的单数据集训练在面对新的数据分布时常常失败，尤其是在超声（US）图像分析中，因数据有限、声影和散斑噪声等问题更为明显。因此，构建一个适用于多异构超声数据集的通用框架势在必行。然而，关键挑战在于：如何在抑制数据集间干扰的同时，保留数据集特有的判别特征，以实现稳健的下游任务表现？以往的方法要么使用单一的源特定解码器，要么采用域适应策略，但这些方法在应用于其他域时性能会下降。有鉴于此，我们提出了一种通用协同异构源特定专家混合模型（Universal Collaborative Mixture of Heterogeneous Source-Specific Experts，COME）。具体来说，COME 建立了双重结构-语义共享专家以创建一个通用表示空间，然后与源特定专家协作，通过提供互补特征来提取判别特征。该设计通过利用跨数据集的经验分布并为小批量或未见数据情形提供通用的超声先验，从而实现稳健的泛化能力。在三种评估模式（单数据集、器官内和器官间融合数据集）下的大量实验表明了 COME 的优越性，相较于最先进方法取得了显著的平均 AP 提升。我们的项目可在以下地址获取：https://universalcome.github.io/UniversalCOME/.

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Computation and Language 主题：计算机视觉与模式识别、人工智能、计算与语言

Publish: 2025-08-13 15:43:20 UTC

#26 Beyond Scaling Law: A Data-Efficient Distillation Framework for Reasoning #26 超越缩放定律：一种用于推理的数据高效蒸馏框架

Large language models (LLMs) demonstrate remarkable reasoning capabilities in tasks such as algorithmic coding and mathematical problem-solving. Recent methods have improved reasoning through expanded corpus and multistage training combining reinforcement learning and supervised fine-tuning. Although some methods suggest that small but targeted dataset can incentivize reasoning via only distillation, a reasoning scaling laws is still taking shape, increasing computational costs. To address this, we propose a data-efficient distillation framework (DED) that optimizes the Pareto frontier of reasoning distillation. Inspired by the on-policy learning and diverse roll-out strategies of reinforcement learning, the key idea of our approach is threefold: (1) We identify that benchmark scores alone do not determine an effective teacher model. Through comprehensive comparisons of leading reasoning LLMs, we develop a method to select an optimal teacher model. (2) While scaling distillation can enhance reasoning, it often degrades out-of-domain performance. A carefully curated, smaller corpus achieves a balanced trade-off between in-domain and out-of-domain capabilities. (3) Diverse reasoning trajectories encourage the student model to develop robust reasoning skills. We validate our method through evaluations on mathematical reasoning (AIME 2024/2025, MATH-500) and code generation (LiveCodeBench), achieving state-of-the-art results with only 0.8k carefully curated examples, bypassing the need for extensive scaling. Our systematic analysis demonstrates that DED outperforms existing methods by considering factors beyond superficial hardness, token length, or teacher model capability. This work offers a practical and efficient pathway to advanced reasoning while preserving general capabilities. 大型语言模型 (LLMs) 在算法编码和数学问题求解等任务上展现出卓越的推理能力。最近的方法通过扩展语料库和结合强化学习与有监督微调的多阶段训练改进了推理能力。尽管有些方法表明小而有针对性的数据集仅通过蒸馏就能激励推理，但推理规模定律仍在形成中，从而增加了计算成本。为了解决这一问题，我们提出了一种数据高效蒸馏框架（DED），旨在优化推理蒸馏的帕累托前沿。受到强化学习中的在线策略学习和多样化回滚策略的启发，我们方法的关键思想有三点：（1）我们发现仅凭基准分数并不能决定一个有效的教师模型。通过对领先的推理 LLMs 进行全面比较，我们开发出一种选择最优教师模型的方法。（2）尽管扩展蒸馏可以增强推理能力，但它常常会降低域外性能。精心筛选的、更小的语料库能够在域内与域外能力之间实现平衡的权衡。 (3) 多样化的推理轨迹鼓励学生模型培养稳健的推理能力。我们通过数学推理（AIME 2024/2025、MATH-500）和代码生成（LiveCodeBench）评估来验证我们的方法，使用仅 0.8k 经精心策划的示例就达到了最先进的结果，避免了对大规模扩展的需求。我们的系统分析表明，DED 在考虑超越表面难度、令牌长度或教师模型能力的因素后，优于现有方法。这项工作为在保持泛化能力的同时实现高级推理提供了一个实用且高效的路径。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-13 15:32:25 UTC 发布：2025-08-13 15:32:25 UTC

#27 Memory Decoder: A Pretrained, Plug-and-Play Memory for Large Language Models #27 Memory Decoder: A Pretrained, Plug-and-Play Memory for Large Language Models

Large Language Models (LLMs) have shown strong abilities in general language tasks, yet adapting them to specific domains remains a challenge. Current method like Domain Adaptive Pretraining (DAPT) requires costly full-parameter training and suffers from catastrophic forgetting. Meanwhile, Retrieval-Augmented Generation (RAG) introduces substantial inference latency due to expensive nearest-neighbor searches and longer context. This paper introduces Memory Decoder, a plug-and-play pretrained memory that enables efficient domain adaptation without changing the original model’s parameters. Memory Decoder employs a small transformer decoder that learns to imitate the behavior of an external non-parametric retriever. Once trained, Memory Decoder can be seamlessly integrated with any pretrained language model that shares the same tokenizer, requiring no model-specific modifications. Experimental results demonstrate that Memory Decoder enables effective adaptation of various Qwen and Llama models to three distinct specialized domains: biomedicine, finance, and law, reducing perplexity by an average of 6.17 points. Overall, Memory Decoder introduces a novel paradigm centered on a specially pretrained memory component designed for domain-specific adaptation. This memory architecture can be integrated in a plug-and-play manner, consistently enhancing performance across multiple models within the target domain. 大型语言模型（LLMs）在通用语言任务上表现出强大能力，但将其适配到特定领域仍是一大挑战。当前方法如领域自适应预训练（DAPT）需要昂贵的全参数训练，并且会遭遇灾难性遗忘。同时，检索增强生成（RAG）由于昂贵的近邻搜索和更长的上下文，带来了显著的推理延迟。本文提出了 Memory Decoder，一种即插即用的预训练记忆模块，能够在不改变原始模型参数的情况下实现高效的领域适配。Memory Decoder 使用一个小型的 transformer 解码器来学习模仿外部非参数检索器的行为。训练完成后，Memory Decoder 可以无缝地与任何使用相同分词器的预训练语言模型集成，无需针对特定模型进行修改。实验结果表明，Memory Decoder 能够有效地将多种 Qwen 和 Llama 模型适配到三个不同的专门领域：生物医学、金融和法律，使困惑度平均降低 6.17 点。总体而言，Memory Decoder 提出了一种以专门预训练的记忆组件为核心、用于领域特定适配的新范式。这种记忆架构可以以即插即用的方式集成，在目标领域内持续提升多种模型的性能。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-13 15:16:29 UTC 发布：2025-08-13 15:16:29 UTC

#28 STREAM (ChemBio): A Standard for Transparently Reporting Evaluations in AI Model Reports #28 STREAM（化学生物）：在人工智能模型报告中透明报告评估的标准

Evaluations of dangerous AI capabilities are important for managing catastrophic risks. Public transparency into these evaluations - including what they test, how they are conducted, and how their results inform decisions - is crucial for building trust in AI development. We propose STREAM (A Standard for Transparently Reporting Evaluations in AI Model Reports), a standard to improve how model reports disclose evaluation results, initially focusing on chemical and biological (ChemBio) benchmarks. Developed in consultation with 23 experts across government, civil society, academia, and frontier AI companies, this standard is designed to (1) be a practical resource to help AI developers present evaluation results more clearly, and (2) help third parties identify whether model reports provide sufficient detail to assess the rigor of the ChemBio evaluations. We concretely demonstrate our proposed best practices with “gold standard” examples, and also provide a three-page reporting template to enable AI developers to implement our recommendations more easily. 对危险性人工智能能力的评估对于管理灾难性风险至关重要。对这些评估的公开透明——包括它们测试什么、如何实施以及其结果如何用于决策——对于建立人们对人工智能开发的信任至关重要。我们提出了 STREAM（用于在人工智能模型报告中透明报告评估的标准），该标准旨在改进模型报告披露评估结果的方式，初期重点关注化学与生物（ChemBio）基准。该标准在与来自政府、民间社会、学术界和前沿人工智能公司的 23 位专家协商后制定，旨在（1）成为帮助人工智能开发者更清晰呈现评估结果的实用资源，及（2）帮助第三方识别模型报告是否提供了足够的细节以评估 ChemBio 评估的严谨性。我们通过“黄金标准”示例具体展示了我们提出的最佳实践，并提供了一份三页的报告模板，以便人工智能开发者更容易地实施我们的建议。

Subjects: Computers and Society, Artificial Intelligence 主题：计算机与社会，人工智能

Publish: 2025-08-13 14:36:36 UTC 发布：2025-08-13 14:36:36 UTC

#29 Perceptual Reality Transformer: Neural Architectures for Simulating Neurological Perception Conditions #29 感知现实变换器：用于模拟神经感知状况的神经网络架构

Author: [Baihan Lin](https://arxiv.org/search/?searchtype=author&query=Baihan Lin) 作者：林柏涵

Neurological conditions affecting visual perception create profound experiential divides between affected individuals and their caregivers, families, and medical professionals. We present the Perceptual Reality Transformer, a comprehensive framework employing six distinct neural architectures to simulate eight neurological perception conditions with scientifically-grounded visual transformations. Our system learns mappings from natural images to condition-specific perceptual states, enabling others to experience approximations of simultanagnosia, prosopagnosia, ADHD attention deficits, visual agnosia, depression-related changes, anxiety tunnel vision, and Alzheimer’s memory effects. Through systematic evaluation across ImageNet and CIFAR-10 datasets, we demonstrate that Vision Transformer architectures achieve optimal performance, outperforming traditional CNN and generative approaches. Our work establishes the first systematic benchmark for neurological perception simulation, contributes novel condition-specific perturbation functions grounded in clinical literature, and provides quantitative metrics for evaluating simulation fidelity. The framework has immediate applications in medical education, empathy training, and assistive technology development, while advancing our fundamental understanding of how neural networks can model atypical human perception. 影响视觉感知的神经疾病在患者与其护理者、家人及医疗专业人员之间造成深刻的体验差异。我们提出了“感知现实转换器”，这是一个综合框架，采用六种不同的神经网络架构，通过基于科学依据的视觉变换来模拟八种神经感知状况。我们的系统学习将自然图像映射到特定疾病的感知状态，使他人能够体验到模拟并列失认（simultanagnosia）、面孔失认（prosopagnosia）、注意力缺陷多动障碍（ADHD）注意力缺陷、视觉失认、抑郁相关变化、焦虑引发的隧道视野以及阿尔茨海默症记忆效应的近似感受。通过在 ImageNet 和 CIFAR-10 数据集上的系统评估，我们展示了视觉变换器（Vision Transformer）架构达到最佳性能，优于传统卷积神经网络和生成式方法。我们的工作建立了首个用于神经感知模拟的系统基准，提出了基于临床文献的新型疾病特异性扰动函数，并提供了用于评估模拟保真度的定量指标。该框架可立即应用于医学教育、同理心训练和辅助技术开发，同时推进我们对神经网络如何建模非典型人类感知的基础性理解。

Subjects: Neurons and Cognition, Artificial Intelligence, Computer Vision and Pattern Recognition, Neural and Evolutionary Computing

Publish: 2025-08-13 14:34:33 UTC 发布：2025-08-13 14:34:33 UTC

#30 PRELUDE: A Benchmark Designed to Require Global Comprehension and Reasoning over Long Contexts #30 序曲：一个旨在要求对长上下文进行全局理解和推理的基准

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-13 14:28:25 UTC 发布时间：2025-08-13 14:28:25 UTC

#31 Speed Always Wins: A Survey on Efficient Architectures for Large Language Models #31 速度永远胜出：关于大型语言模型高效架构的综述

Subjects: Computation and Language, Artificial Intelligence, Computer Vision and Pattern Recognition 主题：计算与语言、人工智能、计算机视觉与模式识别

Publish: 2025-08-13 14:13:46 UTC 发布：2025-08-13 14:13:46 协调世界时 (UTC)

#32 Exploring the Potential of Large Language Models in Fine-Grained Review Comment Classification #32 探索大型语言模型在细粒度审稿意见分类中的潜力

Authors: [Linh Nguyen](https://arxiv.org/search/?searchtype=author&query=Linh Nguyen), [Chunhua Liu](https://arxiv.org/search/?searchtype=author&query=Chunhua Liu), [Hong Yi Lin](https://arxiv.org/search/?searchtype=author&query=Hong Yi Lin), [Patanamon Thongtanunam](https://arxiv.org/search/?searchtype=author&query=Patanamon Thongtanunam) 作者：Linh Nguyen、Chunhua Liu、Hong Yi Lin、Patanamon Thongtanunam

Code review is a crucial practice in software development. As code review nowadays is lightweight, various issues can be identified, and sometimes, they can be trivial. Research has investigated automated approaches to classify review comments to gauge the effectiveness of code reviews. However, previous studies have primarily relied on supervised machine learning, which requires extensive manual annotation to train the models effectively. To address this limitation, we explore the potential of using Large Language Models (LLMs) to classify code review comments. We assess the performance of LLMs to classify 17 categories of code review comments. Our results show that LLMs can classify code review comments, outperforming the state-of-the-art approach using a trained deep learning model. In particular, LLMs achieve better accuracy in classifying the five most useful categories, which the state-of-the-art approach struggles with due to low training examples. Rather than relying solely on a specific small training data distribution, our results show that LLMs provide balanced performance across high- and low-frequency categories. These results suggest that the LLMs could offer a scalable solution for code review analytics to improve the effectiveness of the code review process. 代码审查是软件开发中的一项关键实践。随着如今代码审查变得轻量化，各种问题都可以被发现，有时这些问题可能很琐碎。研究已经探讨了自动化方法来对审查评论进行分类，以衡量代码审查的有效性。然而，先前的研究主要依赖监督式机器学习，这需要大量手工标注才能有效训练模型。为了解决这一限制，我们探索了使用 LLMs 对代码审查评论进行分类的潜力。我们评估了 LLMs 对 17 类代码审查评论的分类表现。结果表明，LLMs 能够对代码审查评论进行分类，其表现优于使用训练过的深度学习模型的现有最先进方法。尤其是，LLMs 在分类五类最有用的类别上取得了更好的准确率，而最先进的方法由于训练样本少在这些类别上表现不佳。我们的结果表明，与其单纯依赖特定的小规模训练数据分布，不如使用 LLMs 在高频和低频类别上都能提供更均衡的表现。这些结果表明，LLMs 可以为代码审查分析提供一种可扩展的解决方案，从而提高代码审查过程的有效性。

Subjects: Software Engineering, Artificial Intelligence 主题：软件工程，人工智能

Publish: 2025-08-13 14:07:05 UTC 发布时间：2025-08-13 14:07:05 UTC

#33 RayletDF: Raylet Distance Fields for Generalizable 3D Surface Reconstruction from Point Clouds or Gaussians #33 RayletDF：用于从点云或高斯分布进行可泛化三维表面重建的 Raylet 距离场

Authors: [Shenxing Wei](https://arxiv.org/search/?searchtype=author&query=Shenxing Wei), [Jinxi Li](https://arxiv.org/search/?searchtype=author&query=Jinxi Li), [Yafei Yang](https://arxiv.org/search/?searchtype=author&query=Yafei Yang), [Siyuan Zhou](https://arxiv.org/search/?searchtype=author&query=Siyuan Zhou), [Bo Yang](https://arxiv.org/search/?searchtype=author&query=Bo Yang) 作者：魏申兴，李金玺，杨亚飞，周思源，杨波

In this paper, we present a generalizable method for 3D surface reconstruction from raw point clouds or pre-estimated 3D Gaussians by 3DGS from RGB images. Unlike existing coordinate-based methods which are often computationally intensive when rendering explicit surfaces, our proposed method, named RayletDF, introduces a new technique called raylet distance field, which aims to directly predict surface points from query rays. Our pipeline consists of three key modules: a raylet feature extractor, a raylet distance field predictor, and a multi-raylet blender. These components work together to extract fine-grained local geometric features, predict raylet distances, and aggregate multiple predictions to reconstruct precise surface points. We extensively evaluate our method on multiple public real-world datasets, demonstrating superior performance in surface reconstruction from point clouds or 3D Gaussians. Most notably, our method achieves exceptional generalization ability, successfully recovering 3D surfaces in a single-forward pass across unseen datasets in testing. 在本文中，我们提出了一种可泛化的方法，用于从原始点云或由 RGB 图像通过 3DGS 预估的 3D 高斯恢复三维表面。与现有的基于坐标的方法在渲染显式表面时通常计算量大不同，我们提出的方法 RayletDF 引入了一种称为 raylet 距离场的新技术，旨在直接从查询射线预测表面点。我们的流程由三个关键模块组成：raylet 特征提取器、raylet 距离场预测器和多 raylet 混合器。这些组件协同工作以提取细粒度的局部几何特征、预测 raylet 距离并聚合多个预测以重建精确的表面点。我们在多个公开的真实世界数据集上对方法进行了广泛评估，证明了在从点云或 3D 高斯重建表面方面的优越性能。最值得注意的是，我们的方法具有出色的泛化能力，能够在测试中对未见过的数据集通过一次前向传递成功恢复三维表面。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Graphics, Machine Learning, Robotics 主题：计算机视觉与模式识别、人工智能、图形学、机器学习、机器人学

Publish: 2025-08-13 14:05:21 UTC 发布：2025-08-13 14:05:21 UTC

#34 Provable In-Context Vector Arithmetic via Retrieving Task Concepts #34 可证明的上下文内向量算术通过检索任务概念

Authors: [Dake Bu](https://arxiv.org/search/?searchtype=author&query=Dake Bu), [Wei Huang](https://arxiv.org/search/?searchtype=author&query=Wei Huang), [Andi Han](https://arxiv.org/search/?searchtype=author&query=Andi Han), [Atsushi Nitanda](https://arxiv.org/search/?searchtype=author&query=Atsushi Nitanda), [Qingfu Zhang](https://arxiv.org/search/?searchtype=author&query=Qingfu Zhang), [Hau-San Wong](https://arxiv.org/search/?searchtype=author&query=Hau-San Wong), [Taiji Suzuki](https://arxiv.org/search/?searchtype=author&query=Taiji Suzuki) 作者：段克（Dake Bu）、黄伟（Wei Huang）、韩安迪（Andi Han）、仁田敦（Atsushi Nitanda）、张庆福（Qingfu Zhang）、黄皓山（Hau-San Wong）、铃木太治（Taiji Suzuki）

In-context learning (ICL) has garnered significant attention for its ability to grasp functions/tasks from demonstrations. Recent studies suggest the presence of a latent task/function vector in LLMs during ICL. Merullo et al. (2024) showed that LLMs leverage this vector alongside the residual stream for Word2Vec-like vector arithmetic, solving factual-recall ICL tasks. Additionally, recent work empirically highlighted the key role of Question-Answer data in enhancing factual-recall capabilities. Despite these insights, a theoretical explanation remains elusive. To move one step forward, we propose a theoretical framework building on empirically grounded hierarchical concept modeling. We develop an optimization theory, showing how nonlinear residual transformers trained via gradient descent on cross-entropy loss perform factual-recall ICL tasks via vector arithmetic. We prove 0-1 loss convergence and show the strong generalization, including robustness to concept recombination and distribution shifts. These results elucidate the advantages of transformers over static embedding predecessors. Empirical simulations corroborate our theoretical insights. 上下文学习（In-context learning，ICL）因其从示例中掌握函数/任务的能力而受到广泛关注。近期研究表明，在 ICL 过程中，大型语言模型（LLMs）中存在潜在的任务/函数向量。Merullo 等人（2024）证明，LLMs 利用该向量以及残差流进行类似 Word2Vec 的向量算术，从而解决事实回忆类的 ICL 任务。此外，最新工作从实证上强调了问答数据在提升事实回忆能力方面的关键作用。尽管有这些洞见，理论解释仍然难以捉摸。为进一步推进，我们提出了一个基于实证支撑的分层概念建模的理论框架。我们构建了一个优化理论，展示了通过对交叉熵损失进行梯度下降训练的非线性残差变换器如何通过向量算术执行事实回忆类的 ICL 任务。我们证明了 0-1 损失收敛并展示了强泛化能力，包括对概念重组和分布变化的鲁棒性。这些结果阐明了变换器相较于静态嵌入先驱的优势。实证模拟验证了我们的理论见解。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-13 13:54:44 UTC 发布：2025-08-13 13:54:44 协调世界时（UTC）

#35 TRACE: Learning 3D Gaussian Physical Dynamics from Multi-view Videos #35 TRACE：从多视角视频学习三维高斯物理动力学

Authors: [Jinxi Li](https://arxiv.org/search/?searchtype=author&query=Jinxi Li), [Ziyang Song](https://arxiv.org/search/?searchtype=author&query=Ziyang Song), [Bo Yang](https://arxiv.org/search/?searchtype=author&query=Bo Yang) 作者：李津熙，宋子洋，杨博

In this paper, we aim to model 3D scene geometry, appearance, and physical information just from dynamic multi-view videos in the absence of any human labels. By leveraging physics-informed losses as soft constraints or integrating simple physics models into neural nets, existing works often fail to learn complex motion physics, or doing so requires additional labels such as object types or masks. We propose a new framework named TRACE to model the motion physics of complex dynamic 3D scenes. The key novelty of our method is that, by formulating each 3D point as a rigid particle with size and orientation in space, we directly learn a translation rotation dynamics system for each particle, explicitly estimating a complete set of physical parameters to govern the particle’s motion over time. Extensive experiments on three existing dynamic datasets and one newly created challenging synthetic datasets demonstrate the extraordinary performance of our method over baselines in the task of future frame extrapolation. A nice property of our framework is that multiple objects or parts can be easily segmented just by clustering the learned physical parameters. 在本文中，我们旨在仅从动态多视角视频中（在没有任何人工标注的情况下）建模三维场景的几何形状、外观和物理信息。通过将物理信息损失作为软约束或将简单的物理模型集成到神经网络中，现有工作常常难以学习复杂的运动物理，或者要做到这一点需要额外的标注，例如对象类型或遮罩。我们提出了一个名为 TRACE 的新框架来建模复杂动态三维场景的运动物理。我们方法的关键创新在于：通过将每个三维点表述为具有空间大小和方向的刚性粒子，我们直接为每个粒子学习平移-旋转动力系统，显式估计一整套物理参数以支配粒子随时间的运动。在三个现有的动态数据集和一个新创建的具有挑战性的合成数据集上的大量实验证明，在未来帧外推任务中，我们的方法相比基线方法表现卓越。我们框架的一个优良特性是，只需对学习到的物理参数进行聚类，就可以轻松地分割出多个对象或部件。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Computational Engineering, Finance, and Science, Machine Learning, Robotics 主题：计算机视觉与模式识别、人工智能、计算工程、金融与科学、机器学习、机器人学

Publish: 2025-08-13 13:43:01 UTC 发表时间：2025-08-13 13:43:01 UTC

#36 A Comprehensive Survey of Datasets for Clinical Mental Health AI Systems #36 临床心理健康人工智能系统数据集的全面综述

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-13 13:42:35 UTC 发布时间：2025-08-13 13:42:35 UTC

#37 Automated Segmentation of Coronal Brain Tissue Slabs for 3D Neuropathology #37 额外用于三维神经病理学的冠状脑组织切片自动分割

Advances in image registration and machine learning have recently enabled volumetric analysis of \emph{postmortem} brain tissue from conventional photographs of coronal slabs, which are routinely collected in brain banks and neuropathology laboratories worldwide. One caveat of this methodology is the requirement of segmentation of the tissue from photographs, which currently requires costly manual intervention. In this article, we present a deep learning model to automate this process. The automatic segmentation tool relies on a U-Net architecture that was trained with a combination of \textit{(i)}1,414 manually segmented images of both fixed and fresh tissue, from specimens with varying diagnoses, photographed at two different sites; and \textit{(ii)}2,000 synthetic images with randomized contrast and corresponding masks generated from MRI scans for improved generalizability to unseen photographic setups. Automated model predictions on a subset of photographs not seen in training were analyzed to estimate performance compared to manual labels – including both inter- and intra-rater variability. Our model achieved a median Dice score over 0.98, mean surface distance under 0.4mm, and 95% Hausdorff distance under 1.60~mm, which approaches inter-/intra-rater levels. Our tool is publicly available at surfer.nmr.mgh.harvard.edu/fswiki/PhotoTools. 图像配准和机器学习的进步最近使得基于常规冠状切片照片对尸检后大脑组织进行体积分析成为可能，这些照片在全球的脑库和神经病理实验室中被常规收集。这一方法的一个缺点是需要从照片中对组织进行分割，而这目前需要昂贵的人工干预。本文提出了一个用于自动化该过程的深度学习模型。该自动分割工具基于 U-Net 架构，使用以下数据组合进行训练：\textit{(i)} 来自不同诊断的标本、在两个不同地点拍摄的固定和新鲜组织的 1,414 张人工分割图像；以及\textit{(ii)} 为提高对未见过的拍照设置的泛化能力，从 MRI 扫描生成的带有随机化对比度及相应掩模的约 2,000 张合成图像。对未用于训练的照片子集上模型的自动预测进行了分析，以评估其与人工标签的性能——包括评分者之间和评分者内部的一致性变异。我们的模型在中位 Dice 分数上超过 0.98，平均表面距离低于 0.4 mm，95% Hausdorff 距离低于 1.60 mm，接近评估者之间/同一评估者内部的一致性水平。我们的工具可在 surfer.nmr.mgh.harvard.edu/fswiki/PhotoTools 公开获取。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence

Publish: 2025-08-13 13:40:20 UTC 发布时间：2025-08-13 13:40:20 协调世界时

#38 Explainable Ensemble Learning for Graph-Based Malware Detection #38 基于图的恶意软件检测的可解释集成学习

Authors: [Hossein Shokouhinejad](https://arxiv.org/search/?searchtype=author&query=Hossein Shokouhinejad), [Roozbeh Razavi-Far](https://arxiv.org/search/?searchtype=author&query=Roozbeh Razavi-Far), [Griffin Higgins](https://arxiv.org/search/?searchtype=author&query=Griffin Higgins), [Ali A Ghorbani](https://arxiv.org/search/?searchtype=author&query=Ali A Ghorbani) 作者：Hossein Shokouhinejad、Roozbeh Razavi-Far、Griffin Higgins、Ali A Ghorbani

Malware detection in modern computing environments demands models that are not only accurate but also interpretable and robust to evasive techniques. Graph neural networks (GNNs) have shown promise in this domain by modeling rich structural dependencies in graph-based program representations such as control flow graphs (CFGs). However, single-model approaches may suffer from limited generalization and lack interpretability, especially in high-stakes security applications. In this paper, we propose a novel stacking ensemble framework for graph-based malware detection and explanation. Our method dynamically extracts CFGs from portable executable (PE) files and encodes their basic blocks through a two-step embedding strategy. A set of diverse GNN base learners, each with a distinct message-passing mechanism, is used to capture complementary behavioral features. Their prediction outputs are aggregated by a meta-learner implemented as an attention-based multilayer perceptron, which both classifies malware instances and quantifies the contribution of each base model. To enhance explainability, we introduce an ensemble-aware post-hoc explanation technique that leverages edge-level importance scores generated by a GNN explainer and fuses them using the learned attention weights. This produces interpretable, model-agnostic explanations aligned with the final ensemble decision. Experimental results demonstrate that our framework improves classification performance while providing insightful interpretations of malware behavior. 在现代计算环境中，恶意软件检测要求模型不仅要准确，还要具有可解释性并对规避技术具有鲁棒性。图神经网络（GNN）在该领域显示出潜力，因为它们能够对基于图的程序表示（例如控制流图，CFG）中的丰富结构依赖进行建模。然而，单一模型方法可能在泛化能力上有限，并且缺乏可解释性，尤其是在高风险的安全应用中。本文提出了一种用于基于图的恶意软件检测和解释的新型堆叠集成框架。我们的方法从便携式可执行文件（PE）中动态提取 CFG，并通过两步嵌入策略对其基本块进行编码。一组多样化的 GNN 基学习器，每个都有不同的信息传递机制，用于捕捉互补的行为特征。它们的预测输出由一个作为基于注意力的多层感知器实现的元学习器聚合，该元学习器既对恶意软件实例进行分类，又量化每个基模型的贡献。为增强可解释性，我们提出了一种面向集成模型的事后解释技术，该技术利用 GNN 解释器生成的边级重要性评分，并使用学习到的注意力权重将其融合。这样可以产生与最终集成决策一致的、可解释且与模型无关的解释。实验结果表明，我们的框架在提升分类性能的同时，还能提供对恶意软件行为的有见地解释。

Subjects: Cryptography and Security, Artificial Intelligence 主题：密码学与安全性 , 人工智能

Publish: 2025-08-13 13:33:02 UTC 发布时间：2025-08-13 13:33:02 世界协调时 (UTC)

#39 LibRec: Benchmarking Retrieval-Augmented LLMs for Library Migration Recommendations #39 LibRec：基于检索增强 LLMs 的库迁移推荐基准测试

In this paper, we propose LibRec, a novel framework that integrates the capabilities of LLMs with retrieval-augmented generation(RAG) techniques to automate the recommendation of alternative libraries. The framework further employs in-context learning to extract migration intents from commit messages to enhance the accuracy of its recommendations. To evaluate the effectiveness of LibRec, we introduce LibEval, a benchmark designed to assess the performance in the library migration recommendation task. LibEval comprises 2,888 migration records associated with 2,368 libraries extracted from 2,324 Python repositories. Each migration record captures source-target library pairs, along with their corresponding migration intents and intent types. Based on LibEval, we evaluated the effectiveness of ten popular LLMs within our framework, conducted an ablation study to examine the contributions of key components within our framework, explored the impact of various prompt strategies on the framework’s performance, assessed its effectiveness across various intent types, and performed detailed failure case analyses. 在本文中，我们提出了 LibRec，一种新颖的框架，将 LLMs 的能力与检索增强生成（RAG）技术相结合，以自动推荐替代库。该框架进一步利用上下文学习从提交信息中提取迁移意图，从而提高其推荐的准确性。为评估 LibRec 的有效性，我们引入了 LibEval，这是一个用于评估库迁移推荐任务表现的基准。LibEval 包含从 2,324 个 Python 仓库中提取的 2,888 条迁移记录，涉及 2,368 个库。每条迁移记录都记录了源-目标库对，以及相应的迁移意图和意图类型。基于 LibEval，我们在框架内评估了十个流行的 LLMs 的有效性，进行了消融研究以检查框架中关键组件的贡献，探讨了各种提示策略对框架性能的影响，评估了其在不同意图类型下的有效性，并对失败案例进行了详细分析。

Subjects: Software Engineering, Artificial Intelligence 主题：软件工程，人工智能

Publish: 2025-08-13 13:22:49 UTC

#40 Prototype Training with Dual Pseudo-Inverse and Optimized Hidden Activations #40 使用双伪逆和优化隐藏激活的原型训练

Author: [Mauro Tucci](https://arxiv.org/search/?searchtype=author&query=Mauro Tucci) 作者：Mauro Tucci

We present Proto-PINV+H, a fast training paradigm that combines closed-form weight computation with gradient-based optimisation of a small set of synthetic inputs, soft labels, and-crucially-hidden activations. At each iteration we recompute all weight matrices in closed form via two (or more) ridge-regularised pseudo-inverse solves, while updating only the prototypes with Adam. The trainable degrees of freedom are thus shifted from weight space to data/activation space. On MNIST (60k train, 10k test) and Fashion-MNIST (60k train, 10k test), our method reaches 97.8% and 89.3% test accuracy on the official 10k test sets, respectively, in 3.9s–4.5s using approximately 130k trainable parameters and only 250 epochs on an RTX 5060 (16GB). We provide a multi-layer extension (optimised activations at each hidden stage), learnable ridge parameters, optional PCA/PLS projections, and theory linking the condition number of prototype matrices to generalisation. The approach yields favourable accuracy–speed–size trade-offs against ELM, random-feature ridge, and shallow MLPs trained by back-propagation. 我们提出了 Proto-PINV+H，一种快速训练范式，将闭式权重计算与对一小组合成输入、软标签及——关键——隐藏激活的基于梯度的优化相结合。在每次迭代中，我们通过两次（或更多次）带岭回归正则化的伪逆求解以闭式重算所有权重矩阵，同时仅用 Adam 更新原型。因此，可训练的自由度从权重空间转移到了数据/激活空间。在 MNIST（60k 训练，10k 测试）和 Fashion-MNIST（60k 训练，10k 测试）上，我们的方法分别在官方 10k 测试集上以约 13 万可训练参数和仅 250 个 epoch，在 RTX 5060（16GB）上用时 3.9s–4.5s 达到 97.8% 和 89.3% 的测试准确率。我们提供了多层扩展（对每个隐藏阶段优化激活）、可学习的岭参数、可选的 PCA/PLS 投影，以及将原型矩阵的条件数与泛化联系起来的理论。该方法在准确率—速度—模型大小权衡上优于 ELM、随机特征岭回归和通过反向传播训练的浅层 MLP。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-13 13:13:32 UTC 发布：2025-08-13 13:13:32 UTC

#41 Adoption of Explainable Natural Language Processing: Perspectives from Industry and Academia on Practices and Challenges #41 可解释自然语言处理的采纳：来自产业界与学术界关于实践与挑战的观点

Subjects: Computation and Language, Artificial Intelligence, Human-Computer Interaction 主题：计算与语言、人工智能、人机交互

Publish: 2025-08-13 13:12:18 UTC 发布：2025-08-13 13:12:18 UTC

#42 Combinative Matching for Geometric Shape Assembly #42 几何形状组装的组合匹配

Authors: [Nahyuk Lee](https://arxiv.org/search/?searchtype=author&query=Nahyuk Lee), [Juhong Min](https://arxiv.org/search/?searchtype=author&query=Juhong Min), [Junhong Lee](https://arxiv.org/search/?searchtype=author&query=Junhong Lee), [Chunghyun Park](https://arxiv.org/search/?searchtype=author&query=Chunghyun Park), [Minsu Cho](https://arxiv.org/search/?searchtype=author&query=Minsu Cho) 作者：Nahyuk Lee、Juhong Min、Junhong Lee、Chunghyun Park、Minsu Cho

This paper introduces a new shape-matching methodology, combinative matching, to combine interlocking parts for geometric shape assembly. Previous methods for geometric assembly typically rely on aligning parts by finding identical surfaces between the parts as in conventional shape matching and registration. In contrast, we explicitly model two distinct properties of interlocking shapes: ‘identical surface shape’ and ‘opposite volume occupancy.’ Our method thus learns to establish correspondences across regions where their surface shapes appear identical but their volumes occupy the inverted space to each other. To facilitate this process, we also learn to align regions in rotation by estimating their shape orientations via equivariant neural networks. The proposed approach significantly reduces local ambiguities in matching and allows a robust combination of parts in assembly. Experimental results on geometric assembly benchmarks demonstrate the efficacy of our method, consistently outperforming the state of the art. Project page: https://nahyuklee.github.io/cmnet. 本文提出了一种新的形状匹配方法——组合匹配，用于将互锁部件组合用于几何形状装配。以往的几何装配方法通常依赖于通过在部件之间寻找相同的表面来对齐部件，类似于传统的形状匹配与配准。相反，我们明确建模了互锁形状的两个不同属性：“相同的表面形状”和“相反的体积占据”。因此，我们的方法学习在表面形状看起来相同但其体积在相互颠倒的空间中占据的位置的区域之间建立对应关系。为了促进这一过程，我们还通过等变神经网络估计其形状方向，学习在旋转中对齐区域。所提出的方法显著减少了匹配中的局部歧义，并允许在装配中稳健地组合部件。在几何装配基准测试上的实验结果证明了我们方法的有效性，持续优于最先进的方法。项目页面： https://nahyuklee.github.io/cmnet.

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence

Publish: 2025-08-13 13:01:24 UTC 发布：2025-08-13 13:01:24 UTC

#43 Can LLM-Generated Textual Explanations Enhance Model Classification Performance? An Empirical Study #43 LLM 生成的文本解释能否提高模型的分类表现？一项实证研究

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-13 12:59:08 UTC 发布：2025-08-13 12:59:08 UTC

#44 Counting Short Trajectories in Elementary Cellular Automata using the Transfer Matrix Method #44 使用转移矩阵方法计算元胞自动机中的短轨迹数量

Authors: [Cédric Koller](https://arxiv.org/search/?searchtype=author&query=Cédric Koller), [Barbora Hudcová](https://arxiv.org/search/?searchtype=author&query=Barbora Hudcová) 作者：Cédric Koller、Barbora Hudcová

Elementary Cellular Automata (ECAs) exhibit diverse behaviours often categorized by Wolfram’s qualitative classification. To provide a quantitative basis for understanding these behaviours, we investigate the global dynamics of such automata and we describe a method that allows us to compute the number of all configurations leading to short attractors in a limited number of time steps. This computation yields exact results in the thermodynamic limit (as the CA grid size grows to infinity), and is based on the Transfer Matrix Method (TMM) that we adapt for our purposes. Specifically, given two parameters (p,c) we are able to compute the entropy of all initial configurations converging to an attractor of size c after p time-steps. By calculating such statistics for various ECA rules, we establish a quantitative connection between the entropy and the qualitative Wolfram classification scheme. Class 1 rules rapidly converge to maximal entropy for stationary states (c=1) as p increases. Class 2 rules also approach maximal entropy quickly for appropriate cycle lengths c, potentially requiring consideration of translations. Class 3 rules exhibit zero or low finite entropy that saturates after a short transient. Class 4 rules show finite positive entropy, similar to some Class 3 rules. This method provides a precise framework for quantifying trajectory statistics, although its exponential computational cost in p+c restricts practical analysis to short trajectories. 元胞自动机（Elementary Cellular Automata，ECA）展示出多样的行为，通常按沃尔夫勒姆（Wolfram）的定性分类进行划分。为了为理解这些行为提供定量基础，我们研究了此类自动机的全局动力学，并描述了一种方法，允许我们计算在有限时间步内导致短吸引子的所有配置数。该计算在热力学极限（随着元胞自动机网格大小趋于无穷）中给出精确结果，基于我们为此目的改编的转移矩阵方法（Transfer Matrix Method，TMM）。具体而言，给定两个参数 (p,c) ，我们能够计算在 p 个时间步后收敛到大小为 c 的吸引子的所有初始配置的熵。通过为各种 ECA 规则计算此类统计量，我们建立了熵与沃尔夫勒姆定性分类方案之间的定量联系。第 1 类规则对稳态（ c=1 ）迅速收敛到最大熵，随着 p 的增加亦如此。第 2 类规则对于适当的周期长度 c 也能快速逼近最大熵，可能需要考虑平移。第 3 类规则表现出零或低的有限熵，并在短暂瞬态后达到饱和。第 4 类规则显示出有限的正熵，类似于某些第 3 类规则。该方法为量化轨迹统计提供了精确的框架，但其在 p+c 上的指数级计算代价限制了对实际分析仅能用于短轨迹。

Subjects: Cellular Automata and Lattice Gases, Artificial Intelligence, Neural and Evolutionary Computing, Chaotic Dynamics 主题: 细胞自动机与晶格气、人工智能、神经与进化计算、混沌动力学

Publish: 2025-08-13 12:53:22 UTC 发表: 2025-08-13 12:53:22 UTC

#45 Enhance the machine learning algorithm performance in phishing detection with keyword features #45 使用关键字特征提高机器学习算法在网络钓鱼检测中的性能 [PDF ] [Copy] [Kimi ] [REL]

Author: [Zijiang Yang](https://arxiv.org/search/?searchtype=author&query=Zijiang Yang) 作者：Zijiang Yang

Recently, we can observe a significant increase of the phishing attacks in the Internet. In a typical phishing attack, the attacker sets up a malicious website that looks similar to the legitimate website in order to obtain the end-users’ information. This may cause the leakage of the sensitive information and the financial loss for the end-users. To avoid such attacks, the early detection of these websites’ URLs is vital and necessary. Previous researchers have proposed many machine learning algorithms to distinguish the phishing URLs from the legitimate ones. In this paper, we would like to enhance these machine learning algorithms from the perspective of feature selection. We propose a novel method to incorporate the keyword features with the traditional features. This method is applied on multiple traditional machine learning algorithms and the experimental results have shown this method is useful and effective. On average, this method can reduce the classification error by 30% for the large dataset. Moreover, its enhancement is more significant for the small dataset. In addition, this method extracts the information from the URL and does not rely on the additional information provided by the third-part service. The best result for the machine learning algorithm using our proposed method has achieved the accuracy of 99.68%. 最近，我们可以观察到互联网中网络钓鱼攻击的显著增加。在典型的网络钓鱼攻击中，攻击者搭建一个外观类似于合法网站的恶意网站，以获取终端用户的信息。这可能导致敏感信息泄露和终端用户的经济损失。为避免此类攻击，及早检测这些网站的 URL 是至关重要且必要的。以往的研究者提出了许多机器学习算法来区分网络钓鱼 URL 与合法 URL。在本论文中，我们希望从特征选择的角度增强这些机器学习算法。我们提出了一种将关键词特征与传统特征结合的新方法。该方法应用于多种传统机器学习算法，实验结果表明此方法有用且有效。平均而言，对于大型数据集，该方法可将分类错误率降低 30%。此外，对小型数据集的增强效果更显著。此外，该方法从 URL 中提取信息，而不依赖第三方服务提供的额外信息。使用我们提出的方法，机器学习算法取得的最佳结果达到了 99.68%的准确率。

Subjects: Cryptography and Security, Artificial Intelligence, Machine Learning, Neural and Evolutionary Computing 学科：密码学与安全、人工智能、机器学习、神经与进化计算

Publish: 2025-08-12 14:16:11 UTC 发表：2025-08-12 14:16:11 UTC

#46 NEUBORN: The Neurodevelopmental Evolution framework Using BiOmechanical RemodelliNg #46 NEUBORN：使用生物力学重塑的神经发育进化框架

Authors: [Nashira Baena](https://arxiv.org/search/?searchtype=author&query=Nashira Baena), [Mariana da Silva](https://arxiv.org/search/?searchtype=author&query=Mariana da Silva), [Irina Grigorescu](https://arxiv.org/search/?searchtype=author&query=Irina Grigorescu), [Aakash Saboo](https://arxiv.org/search/?searchtype=author&query=Aakash Saboo), [Saga Masui](https://arxiv.org/search/?searchtype=author&query=Saga Masui), [Jaques-Donald Tournier](https://arxiv.org/search/?searchtype=author&query=Jaques-Donald Tournier), [Emma C. Robinson](https://arxiv.org/search/?searchtype=author&query=Emma C. Robinson) 作者：Nashira Baena、Mariana da Silva、Irina Grigorescu、Aakash Saboo、Saga Masui、Jaques-Donald Tournier、Emma C. Robinson

Understanding individual cortical development is essential for identifying deviations linked to neurodevelopmental disorders. However, current normative modelling frameworks struggle to capture fine-scale anatomical details due to their reliance on modelling data within a population-average reference space. Here, we present a novel framework for learning individual growth trajectories from biomechanically constrained, longitudinal, diffeomorphic image registration, implemented via a hierarchical network architecture. Trained on neonatal MRI data from the Developing Human Connectome Project, the method improves the biological plausibility of warps, generating growth trajectories that better follow population-level trends while generating smoother warps, with fewer negative Jacobians, relative to state-of-the-art baselines. The resulting subject-specific deformations provide interpretable, biologically grounded mappings of development. This framework opens new possibilities for predictive modeling of brain maturation and early identification of malformations of cortical development. 理解个体皮质发育对于识别与神经发育障碍相关的异常至关重要。然而，现有的规范建模框架由于依赖在群体平均参考空间内对数据建模，难以捕捉细尺度的解剖细节。在此，我们提出了一种新框架，通过生物力学约束的纵向同胚影像配准来学习个体生长轨迹，并通过分层网络架构实现。在开发中的人类连接组计划（Developing Human Connectome Project）的新生儿 MRI 数据上训练后，该方法提高了变形场的生物学合理性，生成的生长轨迹更好地遵循群体水平的趋势，同时相比最先进的基线方法生成更平滑的变形场、较少负雅可比行列式。由此得到的个体特异性变形提供了可解释的、基于生物学的发育映射。该框架为大脑成熟的预测建模和皮质发育畸形的早期识别开辟了新可能。

Subjects: Quantitative Methods, Artificial Intelligence 主题：定量方法，人工智能

Publish: 2025-08-13 12:36:23 UTC 发布：2025-08-13 12:36:23 UTC

#47 Region-to-Region: Enhancing Generative Image Harmonization with Adaptive Regional Injection #47 区域到区域：通过自适应区域注入增强生成图像调和

Authors: [Zhiqiu Zhang](https://arxiv.org/search/?searchtype=author&query=Zhiqiu Zhang), [Dongqi Fan](https://arxiv.org/search/?searchtype=author&query=Dongqi Fan), [Mingjie Wang](https://arxiv.org/search/?searchtype=author&query=Mingjie Wang), [Qiang Tang](https://arxiv.org/search/?searchtype=author&query=Qiang Tang), [Jian Yang](https://arxiv.org/search/?searchtype=author&query=Jian Yang), [Zili Yi](https://arxiv.org/search/?searchtype=author&query=Zili Yi) 作者：张志秋、范东棋、王明杰、唐强、杨健、易子立

The goal of image harmonization is to adjust the foreground in a composite image to achieve visual consistency with the background. Recently, latent diffusion model (LDM) are applied for harmonization, achieving remarkable results. However, LDM-based harmonization faces challenges in detail preservation and limited harmonization ability. Additionally, current synthetic datasets rely on color transfer, which lacks local variations and fails to capture complex real-world lighting conditions. To enhance harmonization capabilities, we propose the Region-to-Region transformation. By injecting information from appropriate regions into the foreground, this approach preserves original details while achieving image harmonization or, conversely, generating new composite data. From this perspective, We propose a novel model R2R. Specifically, we design Clear-VAE to preserve high-frequency details in the foreground using Adaptive Filter while eliminating disharmonious elements. To further enhance harmonization, we introduce the Harmony Controller with Mask-aware Adaptive Channel Attention (MACA), which dynamically adjusts the foreground based on the channel importance of both foreground and background regions. To address the limitation of existing datasets, we propose Random Poisson Blending, which transfers color and lighting information from a suitable region to the foreground, thereby generating more diverse and challenging synthetic images. Using this method, we construct a new synthetic dataset, RPHarmony. Experiments demonstrate the superiority of our method over other methods in both quantitative metrics and visual harmony. Moreover, our dataset helps the model generate more realistic images in real examples. Our code, dataset, and model weights have all been released for open access. 图像调和的目标是调整合成图像中的前景，使其在视觉上与背景保持一致。最近，潜在扩散模型（LDM）被应用于调和任务，并取得了显著成果。然而，基于 LDM 的调和在细节保留和调和能力方面仍面临挑战。此外，目前的合成数据集依赖颜色迁移，缺乏局部变化，无法捕捉复杂的真实世界光照条件。为提升调和能力，我们提出了区域到区域（Region-to-Region）变换。通过将来自合适区域的信息注入前景，该方法在实现图像调和的同时保留原始细节，或反之生成新的合成数据。从这个视角出发，我们提出了一种新模型 R2R。具体而言，我们设计了 Clear-VAE，使用自适应滤波器在消除不和谐元素的同时保留前景的高频细节。为进一步增强协调性，我们引入了带有掩码感知自适应通道注意力（MACA）的协调控制器，该控制器根据前景和背景区域的通道重要性动态调整前景。为了解决现有数据集的局限性，我们提出了随机泊松融合方法，该方法从合适的区域向前景传递色彩和光照信息，从而生成更多样且更具挑战性的合成图像。基于此方法，我们构建了一个新的合成数据集 RPHarmony。实验证明，与其他方法相比，我们的方法在定量指标和视觉协调性方面均具有优势。此外，我们的数据集有助于模型在真实示例中生成更逼真的图像。我们的代码、数据集和模型权重均已公开发布以供访问。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence

Publish: 2025-08-13 12:21:51 UTC 发布时间：2025-08-13 12:21:51 UTC

#48 Improving ARDS Diagnosis Through Context-Aware Concept Bottleneck Models #48 通过具备上下文感知的概念瓶颈模型改进 ARDS 诊断

Authors: [Anish Narain](https://arxiv.org/search/?searchtype=author&query=Anish Narain), [Ritam Majumdar](https://arxiv.org/search/?searchtype=author&query=Ritam Majumdar), [Nikita Narayanan](https://arxiv.org/search/?searchtype=author&query=Nikita Narayanan), [Dominic Marshall](https://arxiv.org/search/?searchtype=author&query=Dominic Marshall), [Sonali Parbhoo](https://arxiv.org/search/?searchtype=author&query=Sonali Parbhoo) 作者：Anish Narain, Ritam Majumdar, Nikita Narayanan, Dominic Marshall, Sonali Parbhoo

Large, publicly available clinical datasets have emerged as a novel resource for understanding disease heterogeneity and to explore personalization of therapy. These datasets are derived from data not originally collected for research purposes and, as a result, are often incomplete and lack critical labels. Many AI tools have been developed to retrospectively label these datasets, such as by performing disease classification; however, they often suffer from limited interpretability. Previous work has attempted to explain predictions using Concept Bottleneck Models (CBMs), which learn interpretable concepts that map to higher-level clinical ideas, facilitating human evaluation. However, these models often experience performance limitations when the concepts fail to adequately explain or characterize the task. We use the identification of Acute Respiratory Distress Syndrome (ARDS) as a challenging test case to demonstrate the value of incorporating contextual information from clinical notes to improve CBM performance. Our approach leverages a Large Language Model (LLM) to process clinical notes and generate additional concepts, resulting in a 10% performance gain over existing methods. Additionally, it facilitates the learning of more comprehensive concepts, thereby reducing the risk of information leakage and reliance on spurious shortcuts, thus improving the characterization of ARDS. 大型公开可得的临床数据集已成为理解疾病异质性和探索个性化治疗的新型资源。这些数据集来源于最初并非为研究目的而收集的数据，因此往往不完整且缺乏关键标签。为这些数据集开发了许多用于回溯性标注的人工智能工具，例如进行疾病分类；然而，它们通常缺乏可解释性。以往的工作尝试使用概念瓶颈模型（Concept Bottleneck Models，CBMs）来解释预测，CBM 学习可解释的概念并将其映射到更高层次的临床概念，便于人工评估。然而，当这些概念无法充分解释或刻画任务时，这些模型往往表现受限。我们以急性呼吸窘迫综合征（ARDS）的识别作为一个具有挑战性的测试用例，展示将临床记录中的上下文信息纳入以改善 CBM 性能的价值。我们的方法利用大型语言模型（LLM）处理临床记录并生成额外概念，相较于现有方法带来约 10% 的性能提升。此外，它有助于学习更全面的概念，从而降低信息泄露和依赖虚假捷径的风险，从而改善对急性呼吸窘迫综合征（ARDS）的表征。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-13 11:19:30 UTC 发布日期：2025-08-13 11:19:30 UTC

#49 Evaluating the Role of Large Language Models in Legal Practice in India #49 评估大型语言模型在印度法律实践中的作用

Author: [Rahul Hemrajani](https://arxiv.org/search/?searchtype=author&query=Rahul Hemrajani) 作者：Rahul Hemrajani

The integration of Artificial Intelligence(AI) into the legal profession raises significant questions about the capacity of Large Language Models(LLM) to perform key legal tasks. In this paper, I empirically evaluate how well LLMs, such as GPT, Claude, and Llama, perform key legal tasks in the Indian context, including issue spotting, legal drafting, advice, research, and reasoning. Through a survey experiment, I compare outputs from LLMs with those of a junior lawyer, with advanced law students rating the work on helpfulness, accuracy, and comprehensiveness. LLMs excel in drafting and issue spotting, often matching or surpassing human work. However, they struggle with specialised legal research, frequently generating hallucinations, factually incorrect or fabricated outputs. I conclude that while LLMs can augment certain legal tasks, human expertise remains essential for nuanced reasoning and the precise application of law. 将人工智能（AI）整合到法律职业中，引发了关于大型语言模型（LLM）执行关键法律任务能力的重大问题。在本文中，我对 LLM（如 GPT、Claude 和 Llama）在印度语境下执行关键法律任务的表现进行了实证评估，任务包括识别问题、法律起草、法律建议、法律研究和推理。通过一项调查实验，我将 LLM 的输出与一名初级律师的工作进行了比较，由高级法学学生对这些工作的有用性、准确性和全面性进行评分。LLM 在起草和识别问题方面表现出色，常常能与人类工作持平或超越人类。然而，它们在专业法律研究方面表现不佳，常常产生幻觉，即事实错误或捏造的输出。我得出的结论是，尽管 LLM 可以增强某些法律任务，但在人类细致推理和法律精确适用方面，人类专业知识仍然必不可少。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-13 11:04:48 UTC 发布时间：2025-08-13 11:04:48 世界协调时

#50 Surg-InvNeRF: Invertible NeRF for 3D tracking and reconstruction in surgical vision #50 Surg-InvNeRF：用于外科视觉中的三维跟踪和重建的可逆 NeRF

Authors: [Gerardo Loza](https://arxiv.org/search/?searchtype=author&query=Gerardo Loza), [Junlei Hu](https://arxiv.org/search/?searchtype=author&query=Junlei Hu), [Dominic Jones](https://arxiv.org/search/?searchtype=author&query=Dominic Jones), [Sharib Ali](https://arxiv.org/search/?searchtype=author&query=Sharib Ali), [Pietro Valdastri](https://arxiv.org/search/?searchtype=author&query=Pietro Valdastri) 作者：Gerardo Loza、Junlei Hu、Dominic Jones、Sharib Ali、Pietro Valdastri

We proposed a novel test-time optimisation (TTO) approach framed by a NeRF-based architecture for long-term 3D point tracking. Most current methods in point tracking struggle to obtain consistent motion or are limited to 2D motion. TTO approaches frame the solution for long-term tracking as optimising a function that aggregates correspondences from other specialised state-of-the-art methods. Unlike the state-of-the-art on TTO, we propose parametrising such a function with our new invertible Neural Radiance Field (InvNeRF) architecture to perform both 2D and 3D tracking in surgical scenarios. Our approach allows us to exploit the advantages of a rendering-based approach by supervising the reprojection of pixel correspondences. It adapts strategies from recent rendering-based methods to obtain a bidirectional deformable-canonical mapping, to efficiently handle a defined workspace, and to guide the rays’ density. It also presents our multi-scale HexPlanes for fast inference and a new algorithm for efficient pixel sampling and convergence criteria. We present results in the STIR and SCARE datasets, for evaluating point tracking and testing the integration of kinematic data in our pipeline, respectively. In 2D point tracking, our approach surpasses the precision and accuracy of the TTO state-of-the-art methods by nearly 50% on average precision, while competing with other approaches. In 3D point tracking, this is the first TTO approach, surpassing feed-forward methods while incorporating the benefits of a deformable NeRF-based reconstruction. 我们提出了一种新颖的基于测试时优化（TTO）的方案，采用以 NeRF 为基础的架构用于长期三维点跟踪。当前大多数点跟踪方法在获得一致运动方面存在困难，或仅限于二维运动。TTO 方法将长期跟踪的问题框定为优化一个函数，该函数汇聚来自其他专门化最新方法的对应关系。与现有的 TTO 先进方法不同，我们提出用我们新的可逆神经辐射场（InvNeRF）架构对该函数进行参数化，以在手术场景中执行二维和三维跟踪。我们的方法通过监督像素对应关系的重投影，利用了基于渲染的方法的优势。它借鉴了最近基于渲染方法的策略，以获得双向可变形-规范映射、高效处理定义的工作空间并引导光线密度。它还提出了用于快速推理的多尺度 HexPlanes 以及用于高效像素采样和收敛判定的新算法。我们在 STIR 和 SCARE 数据集上展示了结果，分别用于评估点追踪和测试运动学数据在我们流程中的整合。在二维点追踪中，我们的方法在平均精度上比现有 TTO 最先进方法高出近 50%，同时在其他指标上与其他方法相当或竞争。在三维点追踪中，这是首个 TTO 方法，超越了前馈方法，并结合了基于可变形 NeRF 重建的优势。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Robotics 主题：计算机视觉与模式识别、人工智能、机器人学

Publish: 2025-08-13 10:20:24 UTC 发布时间：2025-08-13 10:20:24 世界协调时 (UTC)

#51 Anomaly Detection for IoT Global Connectivity #51 面向物联网全球连通性的异常检测

Authors: [Jesus Omaña Iglesias](https://arxiv.org/search/?searchtype=author&query=Jesus Omaña Iglesias), [Carlos Segura Perales](https://arxiv.org/search/?searchtype=author&query=Carlos Segura Perales), [Stefan Geißler](https://arxiv.org/search/?searchtype=author&query=Stefan Geißler), [Diego Perino](https://arxiv.org/search/?searchtype=author&query=Diego Perino), [Andra Lutu](https://arxiv.org/search/?searchtype=author&query=Andra Lutu) 作者：Jesus Omaña Iglesias、Carlos Segura Perales、Stefan Geißler、Diego Perino、Andra Lutu

Internet of Things (IoT) application providers rely on Mobile Network Operators (MNOs) and roaming infrastructures to deliver their services globally. In this complex ecosystem, where the end-to-end communication path traverses multiple entities, it has become increasingly challenging to guarantee communication availability and reliability. Further, most platform operators use a reactive approach to communication issues, responding to user complaints only after incidents have become severe, compromising service quality. This paper presents our experience in the design and deployment of ANCHOR – an unsupervised anomaly detection solution for the IoT connectivity service of a large global roaming platform. ANCHOR assists engineers by filtering vast amounts of data to identify potential problematic clients (i.e., those with connectivity issues affecting several of their IoT devices), enabling proactive issue resolution before the service is critically impacted. We first describe the IoT service, infrastructure, and network visibility of the IoT connectivity provider we operate. Second, we describe the main challenges and operational requirements for designing an unsupervised anomaly detection solution on this platform. Following these guidelines, we propose different statistical rules, and machine- and deep-learning models for IoT verticals anomaly detection based on passive signaling traffic. We describe the steps we followed working with the operational teams on the design and evaluation of our solution on the operational platform, and report an evaluation on operational IoT customers. 物联网（IoT）应用提供商依赖移动网络运营商（MNO）和漫游基础设施来全球提供其服务。在这个端到端通信路径跨越多个实体的复杂生态系统中，保证通信可用性和可靠性变得愈发具有挑战性。此外，大多数平台运营商在处理通信问题时采取的是被动反应的方法，只有在用户投诉并且事件恶化后才做出响应，从而损害了服务质量。本文介绍了我们在设计和部署 ANCHOR——一种面向大型全球漫游平台的物联网连接服务的无监督异常检测解决方案——方面的经验。ANCHOR 通过过滤大量数据帮助工程师识别潜在问题客户（即那些其多个物联网设备存在连接问题的客户），使得在服务受到严重影响之前能够主动解决问题。我们首先描述了我们所运营的物联网连接提供商的物联网服务、基础设施和网络可见性。其次，我们描述了在该平台上设计无监督异常检测解决方案的主要挑战和运行需求。遵循这些指导原则，我们提出了多种统计规则以及面向物联网垂直领域的基于被动信令流量的机器学习和深度学习模型用于异常检测。我们描述了与运维团队在该运营平台上共同开展解决方案设计与评估所遵循的步骤，并报告了在实际物联网客户上的评估结果。

Subjects: Networking and Internet Architecture, Artificial Intelligence, Machine Learning 学科：网络与互联网架构、人工智能、机器学习

Publish: 2025-08-13 09:44:51 UTC 发布：2025-08-13 09:44:51 UTC

#52 On Negative-aware Preference Optimization for Recommendation #52 关于用于推荐的负面感知偏好优化

Recommendation systems leverage user interaction data to suggest relevant items while filtering out irrelevant (negative) ones. The rise of large language models (LLMs) has garnered increasing attention for their potential in recommendation tasks. However, existing methods for optimizing LLM-based recommenders face challenges in effectively utilizing negative samples. Simply integrating large numbers of negative samples can improve ranking accuracy and mitigate popularity bias but often leads to increased computational overhead and memory costs. Additionally, current approaches fail to account for the varying informativeness of negative samples, leading to suboptimal optimization performance. To address these issues, we propose NAPO (\textbf{N}egative-\textbf{A}ware \textbf{P}reference \textbf{O}ptimization), an enhanced framework for preference optimization in LLM-based recommendation. NAPO introduces two key innovations: (1) in-batch negative sharing, which expands the pool of negative samples without additional memory overhead, and (2) dynamic reward margin adjustment, which adapts model updates based on the confidence of negative samples. Extensive experiments on three public datasets demonstrate that NAPO outperforms existing methods in both recommendation accuracy and popularity bias reduction. 推荐系统利用用户交互数据来推荐相关项目，同时过滤掉无关（负面）项目。大型语言模型（LLMs）的兴起引起了人们对其在推荐任务中潜力的日益关注。然而，现有用于优化基于 LLM 的推荐器的方法在有效利用负样本方面面临挑战。简单地整合大量负样本可以提高排序准确性并缓解流行度偏差，但通常会导致计算开销和内存成本增加。此外，现有方法未能考虑负样本信息量的差异，导致优化性能次优。为了解决这些问题，我们提出了 NAPO（Negative-Aware Preference Optimization，负样本感知偏好优化），这是一个用于基于 LLM 的推荐偏好优化的增强框架。NAPO 引入了两项关键创新：（1）批内负样本共享，通过在不增加额外内存开销的情况下扩展负样本池；（2）动态奖励边距调整，根据负样本的置信度自适应地调整模型更新。在三个公开数据集上的大量实验证明，NAPO 在推荐准确性和流行度偏差减少方面均优于现有方法。

Subjects: Information Retrieval, Artificial Intelligence 主题：信息检索，人工智能

Publish: 2025-08-13 09:37:07 UTC 发布：2025-08-13 09:37:07 UTC

#53 Demystifying the Role of Rule-based Detection in AI Systems for Windows Malware Detection #53 解密基于规则的检测在用于 Windows 恶意软件检测的 AI 系统中的作用

Authors: [Andrea Ponte](https://arxiv.org/search/?searchtype=author&query=Andrea Ponte), [Luca Demetrio](https://arxiv.org/search/?searchtype=author&query=Luca Demetrio), [Luca Oneto](https://arxiv.org/search/?searchtype=author&query=Luca Oneto), [Ivan Tesfai Ogbu](https://arxiv.org/search/?searchtype=author&query=Ivan Tesfai Ogbu), [Battista Biggio](https://arxiv.org/search/?searchtype=author&query=Battista Biggio), [Fabio Roli](https://arxiv.org/search/?searchtype=author&query=Fabio Roli) 作者：Andrea Ponte、Luca Demetrio、Luca Oneto、Ivan Tesfai Ogbu、Battista Biggio、Fabio Roli

Malware detection increasingly relies on AI systems that integrate signature-based detection with machine learning. However, these components are typically developed and combined in isolation, missing opportunities to reduce data complexity and strengthen defenses against adversarial EXEmples, carefully crafted programs designed to evade detection. Hence, in this work we investigate the influence that signature-based detection exerts on model training, when they are included inside the training pipeline. Specifically, we compare models trained on a comprehensive dataset with an AI system whose machine learning component is trained solely on samples not already flagged by signatures. Our results demonstrate improved robustness to both adversarial EXEmples and temporal data drift, although this comes at the cost of a fixed lower bound on false positives, driven by suboptimal rule selection. We conclude by discussing these limitations and outlining how future research could extend AI-based malware detection to include dynamic analysis, thereby further enhancing system resilience. 恶意软件检测越来越依赖将基于签名的检测与机器学习相结合的 AI 系统。然而，这些组件通常各自开发并独立组合，错过了降低数据复杂性和强化对抗性 EXEmples（精心构造以逃避检测的程序）防御的机会。因此，在本工作中，我们研究了当基于签名的检测被纳入训练流程时，它对模型训练的影响。具体而言，我们比较了在全面数据集上训练的模型与其机器学习组件仅在未被签名标记的样本上训练的 AI 系统。我们的结果表明，这样做可提高对抗性 EXEmples 和时间性数据漂移的鲁棒性，尽管这会以错误阳性率存在固定下限为代价，该下限由次优规则选择驱动。我们最后讨论了这些局限性，并概述了未来研究如何将动态分析纳入基于 AI 的恶意软件检测，以进一步增强系统鲁棒性。

Subjects: Cryptography and Security, Artificial Intelligence 主题：密码学与安全性 , 人工智能

Publish: 2025-08-13 09:35:51 UTC 发布：2025-08-13 09:35:51 UTC

#54 A Close Reading Approach to Gender Narrative Biases in AI-Generated Stories #54 一种对 AI 生成故事中性别叙事偏见的细读方法

The paper explores the study of gender-based narrative biases in stories generated by ChatGPT, Gemini, and Claude. The prompt design draws on Propp’s character classifications and Freytag’s narrative structure. The stories are analyzed through a close reading approach, with particular attention to adherence to the prompt, gender distribution of characters, physical and psychological descriptions, actions, and finally, plot development and character relationships. The results reveal the persistence of biases - especially implicit ones - in the generated stories and highlight the importance of assessing biases at multiple levels using an interpretative approach. 本文探讨了由 ChatGPT、Gemini 和 Claude 生成的故事中基于性别的叙事偏见研究。提示设计借鉴了普罗甫（Propp）的人物分类和弗雷塔格（Freytag）的叙事结构。通过精读方法对故事进行分析，特别关注对提示的遵从情况、人物的性别分布、身体与心理描写、行为以及最终的情节发展和人物关系。结果显示生成的故事中偏见，尤其是隐含偏见，仍然存在，并突显了使用解释性方法在多个层面评估偏见的重要性。

Subjects: Human-Computer Interaction, Artificial Intelligence, Computation and Language, Computers and Society 主题：人机交互、人工智能、计算与语言、计算机与社会

Publish: 2025-08-13 09:34:37 UTC 发布：2025-08-13 09:34:37 UTC

#55 Preacher: Paper-to-Video Agentic System #55 传道者：论文到视频的智能代理系统

Authors: [Jingwei Liu](https://arxiv.org/search/?searchtype=author&query=Jingwei Liu), [Ling Yang](https://arxiv.org/search/?searchtype=author&query=Ling Yang), [Hao Luo](https://arxiv.org/search/?searchtype=author&query=Hao Luo), [Fan Wang Hongyan Li](https://arxiv.org/search/?searchtype=author&query=Fan Wang Hongyan Li), [Mengdi Wang](https://arxiv.org/search/?searchtype=author&query=Mengdi Wang) 作者：刘敬伟，杨玲，罗昊，王凡，李红艳，王梦迪

The paper-to-video task converts a research paper into a structured video abstract, distilling key concepts, methods, and conclusions into an accessible, well-organized format. While state-of-the-art video generation models demonstrate potential, they are constrained by limited context windows, rigid video duration constraints, limited stylistic diversity, and an inability to represent domain-specific knowledge. To address these limitations, we introduce Preacher, the first paper-to-video agentic system. Preacher employs a top-down approach to decompose, summarize, and reformulate the paper, followed by bottom-up video generation, synthesizing diverse video segments into a coherent abstract. To align cross-modal representations, we define key scenes and introduce a Progressive Chain of Thought (P-CoT) for granular, iterative planning. Preacher successfully generates high-quality video abstracts across five research fields, demonstrating expertise beyond current video generation models. Code will be released at: https://github.com/GenVerse/Paper2Video 论文到视频任务将一篇研究论文转换为结构化的视频摘要，将关键概念、方法和结论提炼为一种易于理解且组织良好的形式。尽管最先进的视频生成模型展示出潜力，但它们受限于有限的上下文窗口、严格的视频时长限制、风格多样性不足以及无法呈现领域特定知识。为了解决这些局限性，我们提出了 Preacher，这是首个论文到视频的具代理性系统。Preacher 采用自上而下的方法来分解、总结和重构论文，随后进行自下而上的视频生成，将多样的视频片段综合为连贯的摘要。为对齐跨模态表示，我们定义了关键场景并引入了渐进式链式思维（Progressive Chain of Thought，P-CoT）以实现细粒度的迭代式规划。Preacher 在五个研究领域成功生成了高质量的视频摘要，展示出超越当前视频生成模型的专业能力。代码将发布于： https://github.com/GenVerse/Paper2Video

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence

Publish: 2025-08-13 09:08:51 UTC 发表时间：2025-08-13 09:08:51 UTC

#56 AmbiGraph-Eval: Can LLMs Effectively Handle Ambiguous Graph Queries? #56 AmbiGraph-Eval：LLMs 能有效处理含糊的图查询吗？ [PDF 1 ] [Copy] [Kimi ] [REL]

Large Language Models (LLMs) have recently demonstrated strong capabilities in translating natural language into database queries, especially when dealing with complex graph-structured data. However, real-world queries often contain inherent ambiguities, and the interconnected nature of graph structures can amplify these challenges, leading to unintended or incorrect query results. To systematically evaluate LLMs on this front, we propose a taxonomy of graph-query ambiguities, comprising three primary types: Attribute Ambiguity, Relationship Ambiguity, and Attribute-Relationship Ambiguity, each subdivided into Same-Entity and Cross-Entity scenarios. We introduce AmbiGraph-Eval, a novel benchmark of real-world ambiguous queries paired with expert-verified graph query answers. Evaluating 9 representative LLMs shows that even top models struggle with ambiguous graph queries. Our findings reveal a critical gap in ambiguity handling and motivate future work on specialized resolution techniques. 大型语言模型（LLMs）最近在将自然语言转换为数据库查询方面表现出强大的能力，尤其是在处理复杂的图结构数据时。然而，真实世界的查询往往包含固有的歧义性，图结构的互联特性会放大这些挑战，导致非预期或错误的查询结果。为系统性地评估 LLMs 在这方面的表现，我们提出了一个图查询歧义分类法，包含三种主要类型：属性歧义、关系歧义，以及属性-关系混合歧义，每种类型又细分为同实体和跨实体两种情形。我们引入了 AmbiGraph-Eval，这是一个由真实世界歧义查询及专家验证的图查询答案组成的新基准。对 9 个代表性 LLMs 的评估表明，即便是顶级模型在处理图查询歧义时也表现不佳。我们的发现揭示了歧义处理方面的关键缺口，并激励未来开展针对性解决技术的研究。

Subjects: Databases, Artificial Intelligence 主题：数据库，人工智能

Publish: 2025-08-13 09:06:59 UTC 发表时间：2025-08-13 09:06:59 UTC

#57 TimeMKG: Knowledge-Infused Causal Reasoning for Multivariate Time Series Modeling #57 TimeMKG：用于多变量时间序列建模的知识注入因果推理

Authors: [Yifei Sun](https://arxiv.org/search/?searchtype=author&query=Yifei Sun), [Junming Liu](https://arxiv.org/search/?searchtype=author&query=Junming Liu), [Ding Wang](https://arxiv.org/search/?searchtype=author&query=Ding Wang), [Yirong Chen](https://arxiv.org/search/?searchtype=author&query=Yirong Chen), [Xuefeng Yan](https://arxiv.org/search/?searchtype=author&query=Xuefeng Yan) 作者：孙逸飞、刘俊明、王丁、陈毅荣、闫学锋

Multivariate time series data typically comprises two distinct modalities: variable semantics and sampled numerical observations. Traditional time series models treat variables as anonymous statistical signals, overlooking the rich semantic information embedded in variable names and data descriptions. However, these textual descriptors often encode critical domain knowledge that is essential for robust and interpretable modeling. Here we present TimeMKG, a multimodal causal reasoning framework that elevates time series modeling from low-level signal processing to knowledge informed inference. TimeMKG employs large language models to interpret variable semantics and constructs structured Multivariate Knowledge Graphs that capture inter-variable relationships. A dual-modality encoder separately models the semantic prompts, generated from knowledge graph triplets, and the statistical patterns from historical time series. Cross-modality attention aligns and fuses these representations at the variable level, injecting causal priors into downstream tasks such as forecasting and classification, providing explicit and interpretable priors to guide model reasoning. The experiment in diverse datasets demonstrates that incorporating variable-level knowledge significantly improves both predictive performance and generalization. 多变量时间序列数据通常由两种不同的模态组成：变量语义和采样的数值观测。传统的时间序列模型将变量视为匿名的统计信号，忽视了变量名称和数据描述中蕴含的丰富语义信息。然而，这些文本描述常常编码了对稳健且可解释建模至关重要的领域知识。在此我们提出了 TimeMKG，一种将时间序列建模从低层次信号处理提升到知识驱动推理的多模态因果推理框架。TimeMKG 利用大型语言模型来解读变量语义，并构建捕捉变量间关系的结构化多变量知识图谱。一个双模态编码器分别对由知识图谱三元组生成的语义提示以及来自历史时间序列的统计模式进行建模。跨模态注意力在变量层面对这些表示进行对齐和融合，将因果先验注入到预测和分类等下游任务，为模型推理提供明确且可解释的先验。在多样化数据集上的实验表明，融合变量层面的知识能显著提升预测性能与泛化能力。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-13 09:00:36 UTC 发布日期：2025-08-13 09:00:36 UTC

#58 Goal Discovery with Causal Capacity for Efficient Reinforcement Learning #58 具备因果能力的目标发现以实现高效强化学习

Authors: [Yan Yu](https://arxiv.org/search/?searchtype=author&query=Yan Yu), [Yaodong Yang](https://arxiv.org/search/?searchtype=author&query=Yaodong Yang), [Zhengbo Lu](https://arxiv.org/search/?searchtype=author&query=Zhengbo Lu), [Chengdong Ma](https://arxiv.org/search/?searchtype=author&query=Chengdong Ma), [Wengang Zhou](https://arxiv.org/search/?searchtype=author&query=Wengang Zhou), [Houqiang Li](https://arxiv.org/search/?searchtype=author&query=Houqiang Li) 作者：Yan Yu、Yaodong Yang、Zhengbo Lu、Chengdong Ma、Wengang Zhou、Houqiang Li

Causal inference is crucial for humans to explore the world, which can be modeled to enable an agent to efficiently explore the environment in reinforcement learning. Existing research indicates that establishing the causality between action and state transition will enhance an agent to reason how a policy affects its future trajectory, thereby promoting directed exploration. However, it is challenging to measure the causality due to its intractability in the vast state-action space of complex scenarios. In this paper, we propose a novel Goal Discovery with Causal Capacity (GDCC) framework for efficient environment exploration. Specifically, we first derive a measurement of causality in state space, \emph{i.e.,} causal capacity, which represents the highest influence of an agent’s behavior on future trajectories. After that, we present a Monte Carlo based method to identify critical points in discrete state space and further optimize this method for continuous high-dimensional environments. Those critical points are used to uncover where the agent makes important decisions in the environment, which are then regarded as our subgoals to guide the agent to make exploration more purposefully and efficiently. Empirical results from multi-objective tasks demonstrate that states with high causal capacity align with our expected subgoals, and our GDCC achieves significant success rate improvements compared to baselines. 因果推断对人类探索世界至关重要，这可以被建模以使智能体在强化学习中更高效地探索环境。现有研究表明，建立动作与状态转移之间的因果关系将增强智能体推理策略如何影响其未来轨迹，从而促进有针对性的探索。然而，由于复杂场景中庞大的状态-动作空间导致不可解性，度量因果关系具有挑战性。在本文中，我们提出了一种用于高效环境探索的新颖框架：具有因果容量的目标发现（GDCC）。具体而言，我们首先推导出在状态空间中度量因果关系的方法，即因果容量，它表示智能体行为对未来轨迹的最大影响力。之后，我们提出了一种基于蒙特卡洛的方法来识别离散状态空间中的关键点，并进一步将此方法优化以适用于连续高维环境。这些关键点用于揭示智能体在环境中做出重要决策的位置，这些位置随后被视为我们的子目标，以引导智能体更有目的性和更高效地进行探索。来自多目标任务的实证结果表明，具有高因果容量的状态与我们预期的子目标一致，并且我们的 GDCC 与基线相比在成功率上取得了显著提升。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-13 08:54:56 UTC

#59 Interpretable Robot Control via Structured Behavior Trees and Large Language Models #59 通过结构化行为树与大型语言模型实现可解释的机器人控制

Authors: [Ingrid Maéva Chekam](https://arxiv.org/search/?searchtype=author&query=Ingrid Maéva Chekam), [Ines Pastor-Martinez](https://arxiv.org/search/?searchtype=author&query=Ines Pastor-Martinez), [Ali Tourani](https://arxiv.org/search/?searchtype=author&query=Ali Tourani), [Jose Andres Millan-Romera](https://arxiv.org/search/?searchtype=author&query=Jose Andres Millan-Romera), [Laura Ribeiro](https://arxiv.org/search/?searchtype=author&query=Laura Ribeiro), [Pedro Miguel Bastos Soares](https://arxiv.org/search/?searchtype=author&query=Pedro Miguel Bastos Soares), [Holger Voos](https://arxiv.org/search/?searchtype=author&query=Holger Voos), [Jose Luis Sanchez-Lopez](https://arxiv.org/search/?searchtype=author&query=Jose Luis Sanchez-Lopez) 作者：Ingrid Maéva Chekam、Ines Pastor-Martinez、Ali Tourani、Jose Andres Millan-Romera、Laura Ribeiro、Pedro Miguel Bastos Soares、Holger Voos、Jose Luis Sanchez-Lopez

As intelligent robots become more integrated into human environments, there is a growing need for intuitive and reliable Human-Robot Interaction (HRI) interfaces that are adaptable and more natural to interact with. Traditional robot control methods often require users to adapt to interfaces or memorize predefined commands, limiting usability in dynamic, unstructured environments. This paper presents a novel framework that bridges natural language understanding and robotic execution by combining Large Language Models (LLMs) with Behavior Trees. This integration enables robots to interpret natural language instructions given by users and translate them into executable actions by activating domain-specific plugins. The system supports scalable and modular integration, with a primary focus on perception-based functionalities, such as person tracking and hand gesture recognition. To evaluate the system, a series of real-world experiments was conducted across diverse environments. Experimental results demonstrate that the proposed approach is practical in real-world scenarios, with an average cognition-to-execution accuracy of approximately 94%, making a significant contribution to HRI systems and robots. The complete source code of the framework is publicly available at https://github.com/snt-arg/robot_suite. 随着智能机器人越来越多地融入人类环境，对直观且可靠的人机交互（HRI）界面的需求也在增长，这些界面需具备适应性并能提供更自然的交互方式。传统的机器人控制方法通常要求用户适应界面或记忆预定义指令，这在动态且无结构的环境中限制了可用性。本文提出了一个新颖的框架，通过将 LLMs 与行为树相结合，弥合自然语言理解与机器人执行之间的鸿沟。该集成使机器人能够理解用户给出的自然语言指令，并通过激活特定领域的插件将其转换为可执行的动作。该系统支持可扩展和模块化的集成，主要关注基于感知的功能，如人体跟踪和手势识别。为评估该系统，在不同环境中进行了系列现实世界实验。实验结果表明，所提出的方法在实际场景中是可行的，认知到执行的平均准确率约为 94%，这对人机交互系统和机器人具有重要贡献。该框架的完整源代码已公开发布于 https://github.com/snt-arg/robot_suite 。

Subjects: Robotics, Artificial Intelligence, Machine Learning 主题：机器人学、人工智能、机器学习

Publish: 2025-08-13 08:53:13 UTC 发布：2025-08-13 08:53:13 UTC

#60 MInDI-3D: Iterative Deep Learning in 3D for Sparse-view Cone Beam Computed Tomography #60 MInDI-3D：用于稀视角锥形束计算机断层扫描的三维迭代深度学习

We present MInDI-3D (Medical Inversion by Direct Iteration in 3D), the first 3D conditional diffusion-based model for real-world sparse-view Cone Beam Computed Tomography (CBCT) artefact removal, aiming to reduce imaging radiation exposure. A key contribution is extending the “InDI” concept from 2D to a full 3D volumetric approach for medical images, implementing an iterative denoising process that refines the CBCT volume directly from sparse-view input. A further contribution is the generation of a large pseudo-CBCT dataset (16,182) from chest CT volumes of the CT-RATE public dataset to robustly train MInDI-3D. We performed a comprehensive evaluation, including quantitative metrics, scalability analysis, generalisation tests, and a clinical assessment by 11 clinicians. Our results show MInDI-3D’s effectiveness, achieving a 12.96 (6.10) dB PSNR gain over uncorrected scans with only 50 projections on the CT-RATE pseudo-CBCT (independent real-world) test set and enabling an 8x reduction in imaging radiation exposure. We demonstrate its scalability by showing that performance improves with more training data. Importantly, MInDI-3D matches the performance of a 3D U-Net on real-world scans from 16 cancer patients across distortion and task-based metrics. It also generalises to new CBCT scanner geometries. Clinicians rated our model as sufficient for patient positioning across all anatomical sites and found it preserved lung tumour boundaries well. 我们提出了 MInDI-3D（Medical Inversion by Direct Iteration in 3D），这是首个用于真实稀疏投影锥形束 CT（CBCT）伪影去除的三维条件扩散模型，旨在降低成像辐射暴露。一个关键贡献是将“InDI”概念从二维扩展到用于医学图像的完整三维体积方法，实施了一个迭代去噪过程，直接从稀疏投影输入精炼 CBCT 体积。另一个贡献是从 CT-RATE 公共数据集的胸部 CT 体积生成了一个大型伪 CBCT 数据集（16,182）以稳健地训练 MInDI-3D。我们进行了全面评估，包括定量指标、可扩展性分析、泛化性测试以及由 11 位临床医生进行的临床评估。我们的结果显示了 MInDI-3D 的有效性，在 CT-RATE 伪 CBCT（独立真实世界）测试集中，仅使用 50 投影就比未校正扫描实现了 12.96（6.10）dB 的 PSNR 提升，并实现了 8 倍的成像辐射暴露减少。我们通过展示随着训练数据增多性能提高来证明其可扩展性。重要的是，MInDI-3D 在来自 16 名癌症患者的真实扫描数据上，在畸变和任务相关指标上达到了与 3D U-Net 相当的性能。它还能推广到新的 CBCT 扫描仪几何结构。临床医生评价我们的模型在所有解剖部位的病人定位上均为足够，并认为它很好地保留了肺肿瘤边界。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence

Publish: 2025-08-13 08:49:18 UTC 发布：2025-08-13 08:49:18 UTC

#61 How Persuasive Could LLMs Be? A First Study Combining Linguistic-Rhetorical Analysis and User Experiments #61 LLMs 的说服力能有多强？一项结合语言修辞分析与用户实验的首次研究

This study examines the rhetorical and linguistic features of argumentative texts generated by ChatGPT on ethically nuanced topics and investigates their persuasive impact on human readers.Through a user study involving 62 participants and pre-post interaction surveys, the paper analyzes how exposure to AI-generated arguments affects opinion change and user perception. A linguistic and rhetorical analysis of the generated texts reveals a consistent argumentative macrostructure, reliance on formulaic expressions, and limited stylistic richness. While ChatGPT demonstrates proficiency in constructing coherent argumentative texts, its persuasive efficacy appears constrained, particularly on topics involving ethical issues.The study finds that while participants often acknowledge the benefits highlighted by ChatGPT, ethical concerns tend to persist or even intensify post-interaction. The results also demonstrate a variation depending on the topic. These findings highlight new insights on AI-generated persuasion in ethically sensitive domains and are a basis for future research. 本研究考察了 ChatGPT 在涉及伦理细微差别话题上生成的论证文本的修辞和语言特征，并探讨了这些文本对人类读者的说服影响。通过对 62 名参与者进行的用户研究以及互动前后问卷调查，论文分析了接触 AI 生成论点如何影响观点变化和用户感知。对生成文本的语言与修辞分析显示出一致的论证宏结构、对公式化表达的依赖以及有限的风格丰富性。尽管 ChatGPT 在构建连贯论证文本方面表现出能力，但其说服效力似乎受到限制，尤其是在涉及伦理问题的话题上。研究发现，尽管参与者常常承认 ChatGPT 强调的益处，但伦理方面的担忧往往在互动后仍然存在甚至加剧。结果还显示出依话题而异的差异。这些发现为伦理敏感领域中 AI 生成说服力提供了新的见解，并为未来研究奠定了基础。

Subjects: Human-Computer Interaction, Artificial Intelligence, Computation and Language, Computers and Society 主题：人机交互、人工智能、计算与语言、计算机与社会

Publish: 2025-08-13 08:45:04 UTC

#62 A Lightweight Learned Cardinality Estimation Model #62 一个轻量级的学习型基数估计模型

Authors: [Yaoyu Zhu](https://arxiv.org/search/?searchtype=author&query=Yaoyu Zhu), [Jintao Zhang](https://arxiv.org/search/?searchtype=author&query=Jintao Zhang), [Guoliang Li](https://arxiv.org/search/?searchtype=author&query=Guoliang Li), [Jianhua Feng](https://arxiv.org/search/?searchtype=author&query=Jianhua Feng) 作者：朱耀宇，张金涛，李国良，冯建华

Cardinality estimation is a fundamental task in database management systems, aiming to predict query results accurately without executing the queries. However, existing techniques either achieve low estimation accuracy or incur high inference latency. Simultaneously achieving high speed and accuracy becomes critical for the cardinality estimation problem. In this paper, we propose a novel data-driven approach called CoDe (Covering with Decompositions) to address this problem. CoDe employs the concept of covering design, which divides the table into multiple smaller, overlapping segments. For each segment, CoDe utilizes tensor decomposition to accurately model its data distribution. Moreover, CoDe introduces innovative algorithms to select the best-fitting distributions for each query, combining them to estimate the final result. By employing multiple models to approximate distributions, CoDe excels in effectively modeling discrete distributions and ensuring computational efficiency. Notably, experimental results show that our method represents a significant advancement in cardinality estimation, achieving state-of-the-art levels of both estimation accuracy and inference efficiency. Across various datasets, CoDe achieves absolute accuracy in estimating more than half of the queries. 基数估计是数据库管理系统中的一项基础任务，旨在在不执行查询的情况下准确预测查询结果。然而，现有技术要么估计精度低，要么推理延迟高。要同时实现高速度和高精度对于基数估计问题至关重要。本文提出了一种名为 CoDe（Covering with Decompositions，覆盖与分解）的新型数据驱动方法来解决该问题。CoDe 采用覆盖设计的概念，将表划分为多个较小的、相互重叠的片段。对于每个片段，CoDe 利用张量分解来精确建模其数据分布。此外，CoDe 引入了创新算法来为每个查询选择最合适的分布，并将它们组合以估计最终结果。通过使用多个模型来近似分布，CoDe 在有效建模离散分布和保证计算效率方面表现出色。值得注意的是，实验结果表明我们的方法在基数估计方面取得了显著进展，在估计精度和推理效率上均达到了最先进的水平。在多个数据集中，CoDe 对超过一半的查询达到了绝对准确的估计。

Subjects: Databases, Artificial Intelligence, Machine Learning 学科：数据库、人工智能、机器学习

Publish: 2025-08-13 08:34:58 UTC 发表：2025-08-13 08:34:58 UTC

#63 Hierarchical Brain Structure Modeling for Predicting Genotype of Glioma #63 用于预测胶质瘤基因型的层次化大脑结构建模

Authors: [Haotian Tang](https://arxiv.org/search/?searchtype=author&query=Haotian Tang), [Jianwei Chen](https://arxiv.org/search/?searchtype=author&query=Jianwei Chen), [Xinrui Tang](https://arxiv.org/search/?searchtype=author&query=Xinrui Tang), [Yunjia Wu](https://arxiv.org/search/?searchtype=author&query=Yunjia Wu), [Zhengyang Miao](https://arxiv.org/search/?searchtype=author&query=Zhengyang Miao), [Chao Li](https://arxiv.org/search/?searchtype=author&query=Chao Li) 作者：Haotian Tang, Jianwei Chen, Xinrui Tang, Yunjia Wu, Zhengyang Miao, Chao Li

Isocitrate DeHydrogenase (IDH) mutation status is a crucial biomarker for glioma prognosis. However, current prediction methods are limited by the low availability and noise of functional MRI. Structural and morphological connectomes offer a non-invasive alternative, yet existing approaches often ignore the brain’s hierarchical organisation and multiscale interactions. To address this, we propose Hi-SMGNN, a hierarchical framework that integrates structural and morphological connectomes from regional to modular levels. It features a multimodal interaction module with a Siamese network and cross-modal attention, a multiscale feature fusion mechanism for reducing redundancy, and a personalised modular partitioning strategy to enhance individual specificity and interpretability. Experiments on the UCSF-PDGM dataset demonstrate that Hi-SMGNN outperforms baseline and state-of-the-art models, showing improved robustness and effectiveness in IDH mutation prediction. 异柠檬酸脱氢酶（IDH）突变状态是胶质瘤预后评估的重要生物标志物。然而，当前的预测方法受限于功能性 MRI 的低可获得性和噪声问题。结构和形态连接组提供了一种无创的替代方案，但现有方法常常忽视大脑的层次化组织结构和多尺度相互作用。为了解决这一问题，我们提出了 Hi-SMGNN，一种将区域到模块级别的结构与形态连接组整合的层次化框架。该框架具有一个由 Siamese 网络和跨模态注意力组成的多模态交互模块、一个用于减少冗余的多尺度特征融合机制，以及一种旨在增强个体特异性和可解释性的个性化模块划分策略。在 UCSF-PDGM 数据集上的实验表明，Hi-SMGNN 优于基线和最先进模型，在 IDH 突变预测方面表现出更强的鲁棒性和有效性。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence

Publish: 2025-08-13 08:17:54 UTC 发布：2025-08-13 08:17:54 世界协调时

#64 CaRoBio: 3D Cable Routing with a Bio-inspired Gripper Fingernail #64 CaRoBio：带有仿生抓握指甲的三维电缆布线

Authors: [Jiahui Zuo](https://arxiv.org/search/?searchtype=author&query=Jiahui Zuo), [Boyang Zhang](https://arxiv.org/search/?searchtype=author&query=Boyang Zhang), [Fumin Zhang](https://arxiv.org/search/?searchtype=author&query=Fumin Zhang) 作者：左嘉辉、张博阳、张富民

The manipulation of deformable linear flexures has a wide range of applications in industry, such as cable routing in automotive manufacturing and textile production. Cable routing, as a complex multi-stage robot manipulation scenario, is a challenging task for robot automation. Common parallel two-finger grippers have the risk of over-squeezing and over-tension when grasping and guiding cables. In this paper, a novel eagle-inspired fingernail is designed and mounted on the gripper fingers, which helps with cable grasping on planar surfaces and in-hand cable guiding operations. Then we present a single-grasp end-to-end 3D cable routing framework utilizing the proposed fingernails, instead of the common pick-and-place strategy. Continuous control is achieved to efficiently manipulate cables through vision-based state estimation of task configurations and offline trajectory planning based on motion primitives. We evaluate the effectiveness of the proposed framework with a variety of cables and channel slots, significantly outperforming the pick-and-place manipulation process under equivalent perceptual conditions. Our reconfigurable task setting and the proposed framework provide a reference for future cable routing manipulations in 3D space. 可变形线性柔性体的操控在工业中有广泛应用，例如汽车制造中的电缆布线和纺织生产。作为一种复杂的多阶段机器人操控场景，电缆布线对机器人自动化是一项具有挑战性的任务。常见的并联双指夹持器在抓握和引导电缆时存在过度压紧和过度拉伸的风险。本文设计了一种新颖的鹰爪灵感指甲并将其安装在夹持器手指上，该指甲有助于在平面表面上抓握电缆并进行手内电缆引导操作。随后我们提出了一种利用所述指甲的单次抓取端到端三维电缆布线框架，取代了常见的拾取并放置策略。通过基于视觉的任务构型状态估计和基于运动原语的离线轨迹规划，实现了连续控制以高效操纵电缆。我们在多种电缆和槽道上评估了所提框架的有效性，在等同感知条件下显著优于拾取并放置的操控过程。我们可重构的任务设置和所提出的框架为未来在三维空间中的电缆布线操作提供了参考。

Subjects: Robotics, Artificial Intelligence 主题：机器人学，人工智能

Publish: 2025-08-13 07:25:40 UTC 发布时间：2025-08-13 07:25:40 UTC

We introduce Goal-Conditioned Visual Navigation Instruction Generation (GoViG), a new task that aims to autonomously generate precise and contextually coherent navigation instructions solely from egocentric visual observations of initial and goal states. Unlike conventional approaches that rely on structured inputs such as semantic annotations or environmental maps, GoViG exclusively leverages raw egocentric visual data, substantially improving its adaptability to unseen and unstructured environments. Our method addresses this task by decomposing it into two interconnected subtasks: (1) visual forecasting, which predicts intermediate visual states bridging the initial and goal views; and (2) instruction generation, which synthesizes linguistically coherent instructions grounded in both observed and anticipated visuals. These subtasks are integrated within an autoregressive multimodal large language model trained with tailored objectives to ensure spatial accuracy and linguistic clarity. Furthermore, we introduce two complementary multimodal reasoning strategies, one-pass and interleaved reasoning, to mimic incremental human cognitive processes during navigation. To evaluate our method, we propose the R2R-Goal dataset, combining diverse synthetic and real-world trajectories. Empirical results demonstrate significant improvements over state-of-the-art methods, achieving superior BLEU-4 and CIDEr scores along with robust cross-domain generalization. 我们提出了目标条件视觉导航指令生成（GoViG）这一新任务，旨在仅通过第一人称视角的初始与目标状态视觉观测，自主生成精确且语境连贯的导航指令。与依赖语义标注或环境地图等结构化输入的传统方法不同，GoViG 完全利用原始第一人称视觉数据，大幅提升了在未知和非结构化环境中的适应性。我们的方法通过将该任务分解为两个相互关联的子任务来解决： (1) 视觉预测，预测连接初始视角与目标视角的中间视觉状态；(2) 指令生成，基于已观测和预测的视觉信息合成语言上连贯的指令。这些子任务被集成在一个自回归多模态大语言模型中，采用定制化目标进行训练以确保空间精确性和语言清晰度。此外，我们提出了两种互补的多模态推理策略——一次性推理和交错推理，以模拟人类在导航过程中的渐进认知处理。为了评估我们的方法，我们提出了 R2R-Goal 数据集，结合了多样的合成和真实世界轨迹。实证结果显示相较于最先进的方法有显著提升，取得了更优的 BLEU-4 和 CIDEr 分数，并具有强健的跨域泛化能力。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence

Publish: 2025-08-13 07:05:17 UTC 发布：2025-08-13 07:05:17 UTC

#66 Your Coding Intent is Secretly in the Context and You Should Deliberately Infer It Before Completion #66 你的编码意图其实隐藏在上下文中——在完成之前你应当有意推断它

Authors: [Yanzhou Li](https://arxiv.org/search/?searchtype=author&query=Yanzhou Li), [Tianlin Li](https://arxiv.org/search/?searchtype=author&query=Tianlin Li), [Yiran Zhang](https://arxiv.org/search/?searchtype=author&query=Yiran Zhang), [Shangqing Liu](https://arxiv.org/search/?searchtype=author&query=Shangqing Liu), [Aishan Liu](https://arxiv.org/search/?searchtype=author&query=Aishan Liu), [Yang Liu](https://arxiv.org/search/?searchtype=author&query=Yang Liu) 作者：李彦舟、李天麟、张轶然、刘尚清、刘爱山、刘洋

Large Language Models (LLMs) are increasingly used for function completion in repository-scale codebases. Prior studies demonstrate that when explicit instructions–such as docstrings–are provided, these models can generate highly accurate implementations. However, in real-world repositories, such annotations are frequently absent, and performance drops substantially without them. To address this gap, we frame the task as a three-stage process. The first stage focuses on intent inference, where the model analyzes the code preceding the target function to uncover cues about the desired functionality. Such preceding context often encodes subtle but critical information, and we design a reasoning-based prompting framework to guide the LLM through step-by-step extraction and synthesis of these signals before any code is generated. The second stage introduces an optional interactive refinement mechanism to handle cases where preceding context alone is insufficient for intent recovery. In this stage, the model proposes a small set of candidate intentions, enabling the developer to select or edit them so that the inferred intent closely matches the actual requirement. Finally, in the third stage, the LLM generates the target function conditioned on the finalized intent. To support this pipeline, we curate a dataset of 40,000 examples annotated with intermediate reasoning traces and corresponding docstrings. Extensive experiments on DevEval and ComplexCodeEval show that our approach consistently boosts multiple LLMs, achieving over 20% relative gains in both reference-based and execution-based metrics, with the interactive refinement stage delivering additional improvements beyond these gains. 大型语言模型（LLMs）越来越多地被用于仓库级代码库中的函数补全。先前的研究表明，当提供明确的说明——例如文档字符串（docstrings）——时，这些模型能够生成高度准确的实现。然而，在真实世界的代码仓库中，此类注释经常缺失，且在没有注释时性能大幅下降。为了解决这一差距，我们将该任务框架化为一个三阶段过程。第一阶段侧重于意图推断，模型分析目标函数之前的代码以发现关于期望功能的线索。此类前置上下文通常包含微妙但关键的信息，我们设计了一种基于推理的提示框架，引导 LLM 在生成任何代码之前逐步提取并综合这些信号。第二阶段引入了一种可选的交互式精炼机制，以处理仅凭前置上下文不足以恢复意图的情况。在此阶段，模型提出一小组候选意图，开发者可以选择或编辑它们，从而使推断出的意图与实际需求高度一致。最后，在第三阶段，LLM 在已确定的意图条件下生成目标函数。为支持该流程，我们整理了一个包含 40,000 个示例的数据集，示例附有中间推理轨迹和相应的文档字符串。在 DevEval 和 ComplexCodeEval 上的大量实验证明，我们的方法持续提升了多种 LLM 的表现，在基于参考和基于执行的度量上均实现了超过 20% 的相对增益，交互式精炼阶段在这些增益之外还带来了额外的改进。

Subjects: Software Engineering, Artificial Intelligence 主题：软件工程，人工智能

Publish: 2025-08-13 06:45:23 UTC 发布：2025-08-13 06:45:23 协调世界时 (UTC)

#67 AI Blob! LLM-Driven Recontextualization of Italian Television Archives #67 AI Blob！由 LLM 驱动的意大利电视档案再语境化

Author: [Roberto Balestri](https://arxiv.org/search/?searchtype=author&query=Roberto Balestri) 作者：Roberto Balestri

This paper introduces AI Blob!, an experimental system designed to explore the potential of semantic cataloging and Large Language Models (LLMs) for the retrieval and recontextualization of archival television footage. Drawing methodological inspiration from Italian television programs such as Blob (RAI Tre, 1989-), AI Blob! integrates automatic speech recognition (ASR), semantic embeddings, and retrieval-augmented generation (RAG) to organize and reinterpret archival content. The system processes a curated dataset of 1,547 Italian television videos by transcribing audio, segmenting it into sentence-level units, and embedding these segments into a vector database for semantic querying. Upon user input of a thematic prompt, the LLM generates a range of linguistically and conceptually related queries, guiding the retrieval and recombination of audiovisual fragments. These fragments are algorithmically selected and structured into narrative sequences producing montages that emulate editorial practices of ironic juxtaposition and thematic coherence. By foregrounding dynamic, content-aware retrieval over static metadata schemas, AI Blob! demonstrates how semantic technologies can facilitate new approaches to archival engagement, enabling novel forms of automated narrative construction and cultural analysis. The project contributes to ongoing debates in media historiography and AI-driven archival research, offering both a conceptual framework and a publicly available dataset to support further interdisciplinary experimentation. 本文介绍了 AI Blob!，一个旨在探索语义编目和大型语言模型（LLMs）在档案电视片段检索与重新语境化中潜力的实验性系统。该系统在方法上借鉴了意大利电视节目如 Blob（RAI Tre，1989-）等，整合了自动语音识别（ASR）、语义嵌入和检索增强生成（RAG）来组织并重新解读档案内容。系统处理一个由 1,547 部意大利电视视频组成的策划数据集，通过转录音频、将其分割为句子级单元，并将这些片段嵌入到向量数据库以便语义查询。在用户输入主题提示后，LLM 生成一系列在语言和概念上相关的查询，从而指导视听片段的检索与重组。这些片段被算法性地筛选并构造成叙事序列，生成模拟讽刺并置和主题连贯性编辑手法的蒙太奇。通过强调动态、内容感知的检索而非静态元数据模式，AI Blob! 展示了语义技术如何促进档案参与的新方法，使自动化叙事构建和文化分析的新形式成为可能。该项目为媒体史学和以人工智能驱动的档案研究的持续讨论做出贡献，既提供了一个概念框架，也提供了一个公开可用的数据集，以支持进一步的跨学科实验。

Subjects: Multimedia, Artificial Intelligence, Computation and Language, Digital Libraries 主题：多媒体、人工智能、计算与语言、数字图书馆

Publish: 2025-08-13 06:38:32 UTC

#68 COXNet: Cross-Layer Fusion with Adaptive Alignment and Scale Integration for RGBT Tiny Object Detection #68 COXNet：用于 RGBT 小目标检测的跨层融合及自适应对齐与尺度整合

Authors: [Peiran Peng](https://arxiv.org/search/?searchtype=author&query=Peiran Peng), [Tingfa Xu](https://arxiv.org/search/?searchtype=author&query=Tingfa Xu), [Liqiang Song](https://arxiv.org/search/?searchtype=author&query=Liqiang Song), [Mengqi Zhu](https://arxiv.org/search/?searchtype=author&query=Mengqi Zhu), [Yuqiang Fang](https://arxiv.org/search/?searchtype=author&query=Yuqiang Fang), [Jianan Li](https://arxiv.org/search/?searchtype=author&query=Jianan Li) 作者：彭沛然，许廷发，宋立强，朱梦琦，方宇强，李佳楠

Detecting tiny objects in multimodal Red-Green-Blue-Thermal (RGBT) imagery is a critical challenge in computer vision, particularly in surveillance, search and rescue, and autonomous navigation. Drone-based scenarios exacerbate these challenges due to spatial misalignment, low-light conditions, occlusion, and cluttered backgrounds. Current methods struggle to leverage the complementary information between visible and thermal modalities effectively. We propose COXNet, a novel framework for RGBT tiny object detection, addressing these issues through three core innovations: i) the Cross-Layer Fusion Module, fusing high-level visible and low-level thermal features for enhanced semantic and spatial accuracy; ii) the Dynamic Alignment and Scale Refinement module, correcting cross-modal spatial misalignments and preserving multi-scale features; and iii) an optimized label assignment strategy using the GeoShape Similarity Measure for better localization. COXNet achieves a 3.32% mAP50 improvement on the RGBTDronePerson dataset over state-of-the-art methods, demonstrating its effectiveness for robust detection in complex environments. 在多模态红-绿-蓝-热（RGBT）影像中检测微小目标是计算机视觉中的一项关键挑战，尤其在监控、搜救和自主导航中尤为重要。基于无人机的场景由于空间错位、弱光条件、遮挡和复杂背景使这些挑战更加严重。现有方法难以有效利用可见光与热成像模态之间的互补信息。我们提出了 COXNet，一种用于 RGBT 微小目标检测的新框架，通过三项核心创新来解决这些问题：i) 跨层融合模块，将高层可见光特征与低层热特征融合，以增强语义和空间精度；ii) 动态对齐与尺度精炼模块，校正跨模态的空间错位并保留多尺度特征；以及 iii) 使用 GeoShape 相似度度量的优化标签分配策略以提升定位性能。与最先进方法相比，COXNet 在 RGBTDronePerson 数据集上实现了 3.32% 的 mAP 提升，证明了其在复杂环境中实现鲁棒检测的有效性。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence

Publish: 2025-08-13 06:30:03 UTC 发布：2025-08-13 06:30:03 UTC

#69 Decentralized Rank Scheduling for Energy-Constrained Multi-Task Federated Fine-Tuning in Edge-Assisted IoV Networks #69 去中心化秩调度用于边缘辅助车联网网络中受能量限制的多任务联邦微调

Authors: [Bokeng Zheng](https://arxiv.org/search/?searchtype=author&query=Bokeng Zheng), [Jianqiang Zhong](https://arxiv.org/search/?searchtype=author&query=Jianqiang Zhong), [Jiayi Liu](https://arxiv.org/search/?searchtype=author&query=Jiayi Liu), [Xiaoxi Zhang](https://arxiv.org/search/?searchtype=author&query=Xiaoxi Zhang) 作者：郑博耕、钟建强、刘佳怡、张晓曦

Federated fine-tuning has emerged as a promising approach for adapting foundation models (FMs) to diverse downstream tasks in edge environments. In Internet of Vehicles (IoV) systems, enabling efficient and low-latency multi-task adaptation is particularly challenging due to client mobility, heterogeneous resources, and intermittent connectivity. This paper proposes a hierarchical federated fine-tuning framework that coordinates roadside units (RSUs) and vehicles to support resource-aware and mobility-resilient learning across dynamic IoV scenarios. Leveraging Low-Rank Adaptation (LoRA), we introduce a decentralized, energy-aware rank adaptation mechanism formulated as a constrained multi-armed bandit problem. A novel UCB-DUAL algorithm is developed to enable adaptive exploration under per-task energy budgets, achieving provable sublinear regret. To evaluate our method, we construct a large-scale IoV simulator based on real-world trajectories, capturing dynamic participation, RSU handoffs, and communication variability. Extensive experiments show that our approach achieves the best accuracy-efficiency trade-off among all baselines, reducing latency by over 24% and improving average accuracy by more than 2.5%. 联邦微调已成为在边缘环境中将基础模型（FMs）适配到多样化下游任务的一种有前景的方法。在车联网（IoV）系统中，由于客户端移动性、资源异质性和间歇性连接，使得实现高效且低延迟的多任务适配尤其具有挑战性。本文提出了一个分层联邦微调框架，协调路侧单元（RSUs）和车辆，以支持在动态 IoV 场景下的资源感知和抗移动性学习。利用低秩适配（LoRA），我们提出了一种去中心化的、能量感知的秩自适应机制，并将其构建为一个受限的多臂赌博机问题。为此提出了一种新颖的 UCB-DUAL 算法，以在每任务能量预算下实现自适应探索，并获得可证的次线性遗憾。为了评估我们的方法，我们基于真实轨迹构建了一个大规模 IoV 仿真器，捕捉了动态参与、RSU 切换和通信可变性。大量实验表明，我们的方法在所有基线中实现了最佳的准确性与效率折中，延迟降低超过 24%，平均准确率提高超过 2.5%。

Subjects: Machine Learning, Artificial Intelligence, Networking and Internet Architecture

Publish: 2025-08-13 06:29:00 UTC 发布：2025-08-13 06:29:00 UTC

#70 Generation of Indian Sign Language Letters, Numbers, and Words #70 印度手语字母、数字与单词的生成

Authors: [Ajeet Kumar Yadav](https://arxiv.org/search/?searchtype=author&query=Ajeet Kumar Yadav), [Nishant Kumar](https://arxiv.org/search/?searchtype=author&query=Nishant Kumar), [Rathna G N](https://arxiv.org/search/?searchtype=author&query=Rathna G N) 作者：Ajeet Kumar Yadav、Nishant Kumar、Rathna G N

Sign language, which contains hand movements, facial expressions and bodily gestures, is a significant medium for communicating with hard-of-hearing people. A well-trained sign language community communicates easily, but those who don’t know sign language face significant challenges. Recognition and generation are basic communication methods between hearing and hard-of-hearing individuals. Despite progress in recognition, sign language generation still needs to be explored. The Progressive Growing of Generative Adversarial Network (ProGAN) excels at producing high-quality images, while the Self-Attention Generative Adversarial Network (SAGAN) generates feature-rich images at medium resolutions. Balancing resolution and detail is crucial for sign language image generation. We are developing a Generative Adversarial Network (GAN) variant that combines both models to generate feature-rich, high-resolution, and class-conditional sign language images. Our modified Attention-based model generates high-quality images of Indian Sign Language letters, numbers, and words, outperforming the traditional ProGAN in Inception Score (IS) and Fréchet Inception Distance (FID), with improvements of 3.2 and 30.12, respectively. Additionally, we are publishing a large dataset incorporating high-quality images of Indian Sign Language alphabets, numbers, and 129 words. 手语包含手部动作、面部表情和身体姿势，是与听力受损者交流的重要媒介。受过良好训练的手语群体交流顺畅，但不懂手语的人则面临巨大挑战。识别与生成是听力者与听力受损者之间的基本沟通方式。尽管在识别方面已有进展，但手语生成仍需进一步探索。渐进式生成对抗网络（ProGAN）擅长生成高质量图像，而自注意力生成对抗网络（SAGAN）则能在中等分辨率下生成富含特征的图像。平衡分辨率与细节对于手语图像生成至关重要。我们正在开发一种结合两种模型的生成对抗网络（GAN）变体，以生成富含特征、高分辨率且类别条件化的手语图像。我们改进的基于注意力的模型能够生成高质量的印度手语字母、数字和单词图像，在 Inception 得分（IS）和 Fréchet Inception 距离（FID）上均优于传统的 ProGAN，分别提升了 3.2 和 30.12。此外，我们正在发布一个大型数据集，包含高质量的印度手语字母、数字和 129 个单词的图像。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Machine Learning 主题：计算机视觉与模式识别，人工智能，机器学习

Publish: 2025-08-13 06:10:20 UTC 发布时间：2025-08-13 06:10:20 UTC

#71 COMPEER: Controllable Empathetic Reinforcement Reasoning for Emotional Support Conversation #71 COMPEER：用于情感支持对话的可控共情强化推理

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-13 06:09:32 UTC 发布：2025-08-13 06:09:32 UTC

#72 SMART-OC: A Real-time Time-risk Optimal Replanning Algorithm for Dynamic Obstacles and Spatio-temporally Varying Currents #72 SMART-OC：一种用于动态障碍物和时空变化洋流的实时时间-风险最优重规划算法

Authors: [Reema Raval](https://arxiv.org/search/?searchtype=author&query=Reema Raval), [Shalabh Gupta](https://arxiv.org/search/?searchtype=author&query=Shalabh Gupta) 作者：Reema Raval，Shalabh Gupta

Typical marine environments are highly complex with spatio-temporally varying currents and dynamic obstacles, presenting significant challenges to Unmanned Surface Vehicles (USVs) for safe and efficient navigation. Thus, the USVs need to continuously adapt their paths with real-time information to avoid collisions and follow the path of least resistance to the goal via exploiting ocean currents. In this regard, we introduce a novel algorithm, called Self-Morphing Adaptive Replanning Tree for dynamic Obstacles and Currents (SMART-OC), that facilitates real-time time-risk optimal replanning in dynamic environments. SMART-OC integrates the obstacle risks along a path with the time cost to reach the goal to find the time-risk optimal path. The effectiveness of SMART-OC is validated by simulation experiments, which demonstrate that the USV performs fast replannings to avoid dynamic obstacles and exploit ocean currents to successfully reach the goal. 典型的海洋环境高度复杂，具有时空变化的洋流和动态障碍物，这给无人水面船（USV）的安全与高效导航带来了重大挑战。因此，USV 需要利用实时信息持续调整其航线，以避开碰撞并通过利用洋流沿着阻力最小的路径前往目标。为此，我们提出了一种新算法，称为用于动态障碍物和洋流的自我变形自适应重规划树（SMART-OC），该算法促进在动态环境中的实时时间-风险最优重规划。SMART-OC 将路径上的障碍风险与到达目标的时间成本相结合，以寻找时间-风险最优路径。通过仿真实验验证了 SMART-OC 的有效性，实验表明 USV 能够快速重规划以避开动态障碍物并利用洋流成功到达目标。

Subjects: Robotics, Artificial Intelligence 主题：机器人学，人工智能

Publish: 2025-08-13 05:42:25 UTC 发布：2025-08-13 05:42:25 UTC

Authors: [Zhanghan Wang](https://arxiv.org/search/?searchtype=author&query=Zhanghan Wang), [Ding Ding](https://arxiv.org/search/?searchtype=author&query=Ding Ding), [Hang Zhu](https://arxiv.org/search/?searchtype=author&query=Hang Zhu), [Haibin Lin](https://arxiv.org/search/?searchtype=author&query=Haibin Lin), [Aurojit Panda](https://arxiv.org/search/?searchtype=author&query=Aurojit Panda) 作者：Zhanghan Wang、Ding Ding、Hang Zhu、Haibin Lin、Aurojit Panda

Distributed machine learning training and inference is common today because today’s large models require more memory and compute than can be provided by a single GPU. Distributed models are generally produced by programmers who take a sequential model specification and apply several distribution strategies to distribute state and computation across GPUs. Unfortunately, bugs can be introduced in the process, and a distributed model implementation’s outputs might differ from the sequential model’s outputs. In this paper, we describe an approach to statically identify such bugs by checking model refinement, that is, can the sequential model’s outputs be reconstructed from the distributed model’s outputs? Our approach, implemented in GraphGuard, uses iterative rewriting to prove model refinement. Our approach can scale to today’s large models and deployments: we evaluate it using GPT and Llama-3. Further, it provides actionable output that aids in bug localization. 分布式机器学习训练和推理在当今很常见，因为当今的大型模型所需的内存和计算资源超出单个 GPU 能提供的范围。分布式模型通常由程序员根据一个顺序模型规范，应用若干分布策略来在 GPU 之间分配状态和计算。不幸的是，这一过程中可能会引入错误，导致分布式模型实现的输出与顺序模型的输出不一致。在本文中，我们描述了一种通过检查模型细化（即是否能从分布式模型的输出重构出顺序模型的输出）来静态识别此类错误的方法。我们在 GraphGuard 中实现的方法使用迭代重写来证明模型细化。该方法可以扩展到当今的大型模型和部署：我们使用 GPT 和 Llama-3 对其进行了评估。此外，它还能提供有助于定位错误的可操作输出。

Subjects: Distributed, Parallel, and Cluster Computing, Artificial Intelligence 主题：分布式、并行与集群计算，人工智能

Publish: 2025-08-13 05:33:25 UTC 发布：2025-08-13 05:33:25 UTC

#74 From Ranking to Selection: A Simple but Efficient Dynamic Passage Selector for Retrieval Augmented Generation #74 从排序到选择：一种简单但高效的用于检索增强生成的动态段落选择器

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-13 05:05:34 UTC 发布时间：2025-08-13 05:05:34 UTC

#75 Learning Facts at Scale with Active Reading #75 通过主动阅读在大规模上学习事实

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-13 04:54:43 UTC 发布：2025-08-13 04:54:43 UTC

#76 Large-Small Model Collaborative Framework for Federated Continual Learning #76 大-小模型协同框架用于联邦持续学习

Authors: [Hao Yu](https://arxiv.org/search/?searchtype=author&query=Hao Yu), [Xin Yang](https://arxiv.org/search/?searchtype=author&query=Xin Yang), [Boyang Fan](https://arxiv.org/search/?searchtype=author&query=Boyang Fan), [Xuemei Cao](https://arxiv.org/search/?searchtype=author&query=Xuemei Cao), [Hanlin Gu](https://arxiv.org/search/?searchtype=author&query=Hanlin Gu), [Lixin Fan](https://arxiv.org/search/?searchtype=author&query=Lixin Fan), [Qiang Yang](https://arxiv.org/search/?searchtype=author&query=Qiang Yang) 作者：余昊，杨鑫，范博阳，曹雪梅，顾翰林，范立新，杨强

Continual learning (CL) for Foundation Models (FMs) is an essential yet underexplored challenge, especially in Federated Continual Learning (FCL), where each client learns from a private, evolving task stream under strict data and communication constraints. Despite their powerful generalization abilities, FMs often exhibit suboptimal performance on local downstream tasks, as they are unable to utilize private local data. Furthermore, enabling FMs to learn new tasks without forgetting prior knowledge is inherently a challenging problem, primarily due to their immense parameter count and high model complexity. In contrast, small models can be trained locally under resource-constrained conditions and benefit from more mature CL techniques. To bridge the gap between small models and FMs, we propose the first collaborative framework in FCL, where lightweight local models act as a dynamic bridge, continually adapting to new tasks while enhancing the utility of the large model. Two novel components are also included: Small Model Continual Fine-tuning is for preventing small models from temporal forgetting; One-by-One Distillation performs personalized fusion of heterogeneous local knowledge on the server. Experimental results demonstrate its superior performance, even when clients utilize heterogeneous small models. 面向基础模型（FMs）的持续学习（CL）是一个重要但尚未充分探索的挑战，尤其在联邦持续学习（FCL）中更是如此：每个客户端在严格的数据和通信限制下，从私有的、不断演变的任务流中学习。尽管基础模型具有强大的泛化能力，但在本地下游任务上常常表现不佳，因为它们无法利用私有本地数据。此外，使基础模型在学习新任务的同时不遗忘先前知识本身就是一个具有挑战性的问题，主要由于其庞大的参数量和高复杂性。相比之下，小模型可以在资源受限的条件下本地训练，并受益于更成熟的持续学习技术。为弥合小模型与基础模型之间的差距，我们提出了首个在 FCL 中的协作框架：轻量级本地模型作为动态桥梁，持续适应新任务，同时提升大型模型的效用。还包含两个新组件：小模型持续微调（Small Model Continual Fine-tuning）用于防止小模型随时间遗忘；逐一蒸馏（One-by-One Distillation）在服务器端对异构本地知识进行个性化融合。实验结果表明，即便客户端使用异构的小模型，该方法仍表现出优越的性能。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-13 04:49:50 UTC 发表：2025-08-13 04:49:50 世界协调时间 (UTC)

#77 Episodic Memory Representation for Long-form Video Understanding #77 情景记忆表示用于长视频理解

Video Large Language Models (Video-LLMs) excel at general video understanding but struggle with long-form videos due to context window limits. Consequently, recent approaches focus on keyframe retrieval, condensing lengthy videos into a small set of informative frames. Despite their practicality, these methods simplify the problem to static text image matching, overlooking spatio temporal relationships crucial for capturing scene transitions and contextual continuity, and may yield redundant keyframes with limited information, diluting salient cues essential for accurate video question answering. To address these limitations, we introduce Video-EM, a training free framework inspired by the principles of human episodic memory, designed to facilitate robust and contextually grounded reasoning. Rather than treating keyframes as isolated visual entities, Video-EM explicitly models them as temporally ordered episodic events, capturing both spatial relationships and temporal dynamics necessary for accurately reconstructing the underlying narrative. Furthermore, the framework leverages chain of thought (CoT) thinking with LLMs to iteratively identify a minimal yet highly informative subset of episodic memories, enabling efficient and accurate question answering by Video-LLMs. Extensive evaluations on the Video-MME, EgoSchema, HourVideo, and LVBench benchmarks confirm the superiority of Video-EM, which achieves highly competitive results with performance gains of 4-9 percent over respective baselines while utilizing fewer frames. 视频大型语言模型（Video-LLMs）在通用视频理解方面表现出色，但由于上下文窗口限制，在处理长视频时存在困难。因此，近期的方法侧重于关键帧检索，将冗长视频压缩为少量信息性帧。尽管这些方法实用，但它们将问题简化为静态图像与文本匹配，忽视了对捕捉场景转换和上下文连续性至关重要的时空关系，且可能产生信息有限的冗余关键帧，稀释了对准确视频问答至关重要的显著线索。为了解决这些局限性，我们提出了 Video-EM，这是一个无需训练的框架，受人类情节记忆原理启发，旨在促进稳健且有上下文支撑的推理。Video-EM 并不将关键帧视为孤立的视觉实体，而是将其明确建模为按时间顺序排列的情节事件，捕捉重建底层叙事所需的空间关系和时间动态。此外，该框架利用与 LLMs 的链式思维（CoT）来迭代识别一组最小但高度信息化的情节记忆子集，从而使 Video-LLMs 能够高效且准确地回答问题。在 Video-MME、EgoSchema、HourVideo 和 LVBench 基准上的大量评估证实了 Video-EM 的优越性：与各自基线相比，在使用更少帧的情况下仍取得了 4–9 个百分点的性能提升并达到高度竞争性的结果。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Multimedia 主题：计算机视觉与模式识别，人工智能，多媒体

Publish: 2025-08-13 04:33:07 UTC 发布日期：2025-08-13 04:33:07 UTC

#78 NeuronTune: Fine-Grained Neuron Modulation for Balanced Safety-Utility Alignment in LLMs #78 NeuronTune：用于在 LLMs 中实现安全性与效用平衡对齐的细粒度神经元调节

Authors: [Birong Pan](https://arxiv.org/search/?searchtype=author&query=Birong Pan), [Mayi Xu](https://arxiv.org/search/?searchtype=author&query=Mayi Xu), [Qiankun Pi](https://arxiv.org/search/?searchtype=author&query=Qiankun Pi), [Jianhao Chen](https://arxiv.org/search/?searchtype=author&query=Jianhao Chen), [Yuanyuan Zhu](https://arxiv.org/search/?searchtype=author&query=Yuanyuan Zhu), [Ming Zhong](https://arxiv.org/search/?searchtype=author&query=Ming Zhong), [Tieyun Qian](https://arxiv.org/search/?searchtype=author&query=Tieyun Qian) 作者：潘碧荣、徐马逸、皮千坤、陈建浩、朱媛媛、钟明、钱铁云

Ensuring robust safety alignment while preserving utility is critical for the reliable deployment of Large Language Models (LLMs). However, current techniques fundamentally suffer from intertwined deficiencies: insufficient robustness against malicious attacks, frequent refusal of benign queries, degradation in generated text quality and general task performance–the former two reflecting deficits in robust safety and the latter constituting utility impairment. We trace these limitations to the coarse-grained layer-wise interventions in existing methods. To resolve this, we propose NeuronTune, a fine-grained framework that dynamically modulates sparse neurons to achieve simultaneous safety-utility optimization. Our approach first identifies safety-critical and utility-preserving neurons across all layers via attribution, then employs meta-learning to adaptively amplify safety-neuron activations and suppress utility-neuron activations. Crucially, NeuronTune enables tunable adjustment of intervention scope via neuron-count thresholds, supporting flexible adaptation to security-critical or utility-priority scenarios. Extensive experimental results demonstrate that our method significantly outperforms existing state-of-the-art technologies, achieving superior model safety while maintaining excellent utility. 在保持实用性的同时确保稳健的安全对齐对于大规模语言模型（LLMs）的可靠部署至关重要。然而，现有技术在根本上存在相互交织的不足：对恶意攻击的稳健性不足、频繁拒绝良性查询、生成文本质量下降以及整体任务表现退化——前两者反映了稳健安全的缺陷，后者构成了实用性的损害。我们将这些局限归因于现有方法中粗粒度的逐层干预。为了解决这一问题，我们提出了 NeuronTune，一种细粒度框架，通过动态调节稀疏神经元来实现安全与实用性的同步优化。我们的方法首先通过归因在所有层中识别对安全关键和对实用性保持有利的神经元，然后采用元学习自适应地增强安全神经元的激活并抑制实用神经元的激活。关键在于，NeuronTune 通过神经元数量阈值实现可调节的干预范围，支持对安全关键或实用优先场景的灵活适配。大量实验结果表明，我们的方法显著优于现有最先进技术，在保持出色效用的同时实现了更优的模型安全性。

Subjects: Machine Learning, Artificial Intelligence, Computation and Language 主题：机器学习、人工智能、计算与语言

Publish: 2025-08-13 04:05:28 UTC 发布：2025-08-13 04:05:28 UTC

#79 DeepFeatIoT: Unifying Deep Learned, Randomized, and LLM Features for Enhanced IoT Time Series Sensor Data Classification in Smart Industries #79 DeepFeatIoT：在智能工业中统一深度学习、随机化与 LLM 特征以增强物联网时间序列传感器数据分类 [PDF ] [Copy] [Kimi ] [REL]

Authors: [Muhammad Sakib Khan Inan](https://arxiv.org/search/?searchtype=author&query=Muhammad Sakib Khan Inan), [Kewen Liao](https://arxiv.org/search/?searchtype=author&query=Kewen Liao) 作者：Muhammad Sakib Khan Inan，Kewen Liao

Internet of Things (IoT) sensors are ubiquitous technologies deployed across smart cities, industrial sites, and healthcare systems. They continuously generate time series data that enable advanced analytics and automation in industries. However, challenges such as the loss or ambiguity of sensor metadata, heterogeneity in data sources, varying sampling frequencies, inconsistent units of measurement, and irregular timestamps make raw IoT time series data difficult to interpret, undermining the effectiveness of smart systems. To address these challenges, we propose a novel deep learning model, DeepFeatIoT, which integrates learned local and global features with non-learned randomized convolutional kernel-based features and features from large language models (LLMs). This straightforward yet unique fusion of diverse learned and non-learned features significantly enhances IoT time series sensor data classification, even in scenarios with limited labeled data. Our model’s effectiveness is demonstrated through its consistent and generalized performance across multiple real-world IoT sensor datasets from diverse critical application domains, outperforming state-of-the-art benchmark models. These results highlight DeepFeatIoT’s potential to drive significant advancements in IoT analytics and support the development of next-generation smart systems. 物联网（IoT）传感器是广泛部署于智慧城市、工业现场和医疗系统的普遍技术。它们持续生成时间序列数据，使各行业能够进行高级分析和自动化。然而，传感器元数据的丢失或不明确、数据源的异质性、不同的采样频率、不一致的计量单位以及不规则的时间戳等挑战使得原始 IoT 时间序列数据难以解释，削弱了智能系统的有效性。为了解决这些问题，我们提出了一种新颖的深度学习模型 DeepFeatIoT，该模型将学习到的局部和全局特征与非学习的随机卷积核特征以及来自大型语言模型（LLMs）的特征相结合。这种简单却独特的多样化学习与非学习特征融合显著提升了 IoT 时间序列传感器数据的分类性能，即使在标注数据有限的情形下也表现出色。我们模型的有效性通过其在来自多个关键应用领域的不同真实物联网传感器数据集上的一致且具有泛化性的表现得以展示，且优于最先进的基准模型。这些结果突显了 DeepFeatIoT 在推动物联网分析方面取得重大进展的潜力，并支持下一代智能系统的开发。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-13 03:47:33 UTC 发布：2025-08-13 03:47:33 UTC

#80 Gen-AFFECT: Generation of Avatar Fine-grained Facial Expressions with Consistent identiTy #80 Gen-AFFECT：生成具有一致身份的细粒度头像面部表情 [PDF 1 ] [Copy] [Kimi ] [REL]

Authors: [Hao Yu](https://arxiv.org/search/?searchtype=author&query=Hao Yu), [Rupayan Mallick](https://arxiv.org/search/?searchtype=author&query=Rupayan Mallick), [Margrit Betke](https://arxiv.org/search/?searchtype=author&query=Margrit Betke), [Sarah Adel Bargal](https://arxiv.org/search/?searchtype=author&query=Sarah Adel Bargal) 作者：Hao Yu、Rupayan Mallick、Margrit Betke、Sarah Adel Bargal

Different forms of customized 2D avatars are widely used in gaming applications, virtual communication, education, and content creation. However, existing approaches often fail to capture fine-grained facial expressions and struggle to preserve identity across different expressions. We propose GEN-AFFECT, a novel framework for personalized avatar generation that generates expressive and identity-consistent avatars with a diverse set of facial expressions. Our framework proposes conditioning a multimodal diffusion transformer on an extracted identity-expression representation. This enables identity preservation and representation of a wide range of facial expressions. GEN-AFFECT additionally employs consistent attention at inference for information sharing across the set of generated expressions, enabling the generation process to maintain identity consistency over the array of generated fine-grained expressions. GEN-AFFECT demonstrates superior performance compared to previous state-of-the-art methods on the basis of the accuracy of the generated expressions, the preservation of the identity and the consistency of the target identity across an array of fine-grained facial expressions. 不同形式的定制二维头像被广泛用于游戏应用、虚拟交流、教育和内容创作。然而，现有方法常常无法捕捉细粒度的面部表情，并且在不同表情之间难以保留身份特征。我们提出了 GEN-AFFECT，一种用于个性化头像生成的新框架，可生成具有多样面部表情的富有表情且保持身份一致的头像。我们的框架提出将多模态扩散变换器以提取的身份-表情表示作为条件输入。这使得能够保持身份特征并表示各种面部表情。GEN-AFFECT 还在推理阶段使用一致性注意机制以便在生成的一系列表情之间共享信息，从而使生成过程在生成多种细粒度表情时维持身份一致性。与以往最先进的方法相比，GEN-AFFECT 在生成表情的准确性、身份保持以及在一系列细微面部表情中目标身份的一致性方面表现更佳。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence

Publish: 2025-08-13 03:35:35 UTC 发布：2025-08-13 03:35:35 UTC

#81 RelayFormer: A Unified Local-Global Attention Framework for Scalable Image and Video Manipulation Localization #81 RelayFormer：一个用于可扩展图像和视频篡改定位的统一局部-全局注意力框架

Authors: [Wen Huang](https://arxiv.org/search/?searchtype=author&query=Wen Huang), [Jiarui Yang](https://arxiv.org/search/?searchtype=author&query=Jiarui Yang), [Tao Dai](https://arxiv.org/search/?searchtype=author&query=Tao Dai), [Jiawei Li](https://arxiv.org/search/?searchtype=author&query=Jiawei Li), [Shaoxiong Zhan](https://arxiv.org/search/?searchtype=author&query=Shaoxiong Zhan), [Bin Wang](https://arxiv.org/search/?searchtype=author&query=Bin Wang), [Shu-Tao Xia](https://arxiv.org/search/?searchtype=author&query=Shu-Tao Xia) 作者：黄文、杨佳睿、戴涛、李家炜、詹少雄、王斌、夏书涛

Visual manipulation localization (VML) – across both images and videos – is a crucial task in digital forensics that involves identifying tampered regions in visual content. However, existing methods often lack cross-modal generalization and struggle to handle high-resolution or long-duration inputs efficiently. We propose RelayFormer, a unified and modular architecture for visual manipulation localization across images and videos. By leveraging flexible local units and a Global-Local Relay Attention (GLoRA) mechanism, it enables scalable, resolution-agnostic processing with strong generalization. Our framework integrates seamlessly with existing Transformer-based backbones, such as ViT and SegFormer, via lightweight adaptation modules that require only minimal architectural changes, ensuring compatibility without disrupting pretrained representations. Furthermore, we design a lightweight, query-based mask decoder that supports one-shot inference across video sequences with linear complexity. Extensive experiments across multiple benchmarks demonstrate that our approach achieves state-of-the-art localization performance, setting a new baseline for scalable and modality-agnostic VML. Code is available at: https://github.com/WenOOI/RelayFormer. 视觉篡改定位（VML）——覆盖图像和视频——是数字取证中的一项关键任务，涉及识别视觉内容中被篡改的区域。然而，现有方法常常缺乏跨模态泛化能力，且难以高效处理高分辨率或长时长输入。我们提出了 RelayFormer，一种用于图像和视频的统一模块化视觉篡改定位架构。通过利用灵活的局部单元和全局-局部中继注意力（GLoRA）机制，它实现了可扩展、与分辨率无关的处理，并具备强大的泛化能力。我们的框架可通过轻量级适配模块无缝集成到现有基于 Transformer 的主干网络（如 ViT 和 SegFormer），仅需极少的结构修改，从而在不破坏预训练表征的前提下保证兼容性。此外，我们设计了一个轻量级的基于查询的掩码解码器，支持对视频序列的单次推理，且具有线性复杂度。在多个基准上的大量实验证明，我们的方法实现了最先进的定位性能，为可扩展且模态无关的视频-文本本体定位（VML）建立了新的基线。代码可在以下地址获取：https://github.com/WenOOI/RelayFormer.

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence

Publish: 2025-08-13 03:35:28 UTC 发布：2025-08-13 03:35:28 协调世界时 (UTC)

#82 Hallucination vs interpretation: rethinking accuracy and precision in AI-assisted data extraction for knowledge synthesis #82 幻觉与解读：重新思考人工智能辅助数据提取在知识综合中的准确性与精确性

Knowledge syntheses (literature reviews) are essential to health professions education (HPE), consolidating findings to advance theory and practice. However, they are labor-intensive, especially during data extraction. Artificial Intelligence (AI)-assisted extraction promises efficiency but raises concerns about accuracy, making it critical to distinguish AI ‘hallucinations’ (fabricated content) from legitimate interpretive differences. We developed an extraction platform using large language models (LLMs) to automate data extraction and compared AI to human responses across 187 publications and 17 extraction questions from a published scoping review. AI-human, human-human, and AI-AI consistencies were measured using interrater reliability (categorical) and thematic similarity ratings (open-ended). Errors were identified by comparing extracted responses to source publications. AI was highly consistent with humans for concrete, explicitly stated questions (e.g., title, aims) and lower for questions requiring subjective interpretation or absent in text (e.g., Kirkpatrick’s outcomes, study rationale). Human-human consistency was not higher than AI-human and showed the same question-dependent variability. Discordant AI-human responses (769/3179 = 24.2%) were mostly due to interpretive differences (18.3%); AI inaccuracies were rare (1.51%), while humans were nearly three times more likely to state inaccuracies (4.37%). Findings suggest AI accuracy depends more on interpretability than hallucination. Repeating AI extraction can identify interpretive complexity or ambiguity, refining processes before human review. AI can be a transparent, trustworthy partner in knowledge synthesis, though caution is needed to preserve critical human insights. 知识综合（文献综述）对健康职业教育（HPE）至关重要，能够整合研究发现以推动理论与实践的发展。然而，这些工作劳动密集，尤其是在数据提取阶段。人工智能（AI）辅助提取承诺提高效率，但也带来了准确性方面的担忧，因此区分 AI“幻觉”（捏造内容）与合理的解释性差异至关重要。我们开发了一个使用大型语言模型（LLMs）进行自动化数据提取的平台，并将 AI 与人工在一项已发表的范围综述中的 187 篇文献和 17 个提取问题上的回答进行了比较。通过测量评分者间一致性（分类变量）和主题相似度评分（开放式问题）来评估 AI-人工、人-人及 AI-AI 的一致性。通过将提取的回答与原始文献对照来识别错误。对于具体、文本中明确陈述的问题（例如标题、研究目的），AI 与人工的一致性很高；而对于需要主观解释或文本中未出现的问题（例如基于柯克帕特里克的结果、研究理由），一致性较低。人类之间的一致性并不高于人工智能与人类之间的一致性，并且显示出同样依赖于问题的可变性。人工智能与人类不一致的回答（769/3179 = 24.2%）大多是由于解释上的差异（18.3%）；人工智能的不准确情况很少见（1.51%），而人类陈述不准确的概率几乎是其三倍（4.37%）。研究结果表明，人工智能的准确性更多取决于可解释性而非幻觉。重复进行人工智能抽取可以识别解释上的复杂性或歧义，在人工审查之前优化流程。人工智能可以成为知识综合中透明且值得信赖的伙伴，尽管需要小心以保留关键的人类见解。

Subjects: Human-Computer Interaction, Artificial Intelligence, Emerging Technologies 主题：人机交互，人工智能，新兴技术

Publish: 2025-08-13 03:33:30 UTC 发布：2025-08-13 03:33:30 UTC

#83 A Unified Contrastive-Generative Framework for Time Series Classification #83 一个用于时间序列分类的统一对比-生成框架

Authors: [Ziyu Liu](https://arxiv.org/search/?searchtype=author&query=Ziyu Liu), [Azadeh Alavi](https://arxiv.org/search/?searchtype=author&query=Azadeh Alavi), [Minyi Li](https://arxiv.org/search/?searchtype=author&query=Minyi Li), [Xiang Zhang](https://arxiv.org/search/?searchtype=author&query=Xiang Zhang) 作者：Ziyu Liu、Azadeh Alavi、Minyi Li、Xiang Zhang

Self-supervised learning (SSL) for multivariate time series mainly includes two paradigms: contrastive methods that excel at instance discrimination and generative approaches that model data distributions. While effective individually, their complementary potential remains unexplored. We propose a Contrastive Generative Time series framework (CoGenT), the first framework to unify these paradigms through joint contrastive-generative optimization. CoGenT addresses fundamental limitations of both approaches: it overcomes contrastive learning’s sensitivity to high intra-class similarity in temporal data while reducing generative methods’ dependence on large datasets. We evaluate CoGenT on six diverse time series datasets. The results show consistent improvements, with up to 59.2% and 14.27% F1 gains over standalone SimCLR and MAE, respectively. Our analysis reveals that the hybrid objective preserves discriminative power while acquiring generative robustness. These findings establish a foundation for hybrid SSL in temporal domains. We will release the code shortly. 自监督学习（SSL）用于多变量时间序列主要包括两种范式：在实例判别方面表现出色的对比方法和对数据分布建模的生成方法。虽然两者各自有效，但它们的互补潜力尚未被探索。我们提出了一个对比生成时间序列框架（CoGenT），这是第一个通过联合对比-生成优化将这两种范式统一起来的框架。CoGenT 解决了两种方法的基本局限：它克服了对比学习在时间序列数据中对类内高度相似性敏感的问题，同时减少了生成方法对大规模数据集的依赖。我们在六个多样的时间序列数据集上评估了 CoGenT。结果显示出一致的改进，分别比单独的 SimCLR 和 MAE 最高提升了 59.2% 和 14.27% 的 F1 值。我们的分析表明，混合目标在保持判别能力的同时获得了生成鲁棒性。这些发现为时间域中的混合自监督学习奠定了基础。我们将很快发布代码。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-13 03:09:14 UTC 发布：2025-08-13 03:09:14 UTC

#84 Shadow in the Cache: Unveiling and Mitigating Privacy Risks of KV-cache in LLM Inference #84 缓存中的影子：揭示并缓解 LLM 推理中 KV 缓存的隐私风险

Subjects: Cryptography and Security, Artificial Intelligence, Computation and Language 主题：密码学与安全、人工智能、计算与语言

Publish: 2025-08-13 02:48:25 UTC 发布：2025-08-13 02:48:25 UTC

#85 What-Meets-Where: Unified Learning of Action and Contact Localization in a New Dataset #85 What-Meets-Where：在新数据集中对动作与接触定位的统一学习

Authors: [Yuxiao Wang](https://arxiv.org/search/?searchtype=author&query=Yuxiao Wang), [Yu Lei](https://arxiv.org/search/?searchtype=author&query=Yu Lei), [Wolin Liang](https://arxiv.org/search/?searchtype=author&query=Wolin Liang), [Weiying Xue](https://arxiv.org/search/?searchtype=author&query=Weiying Xue), [Zhenao Wei](https://arxiv.org/search/?searchtype=author&query=Zhenao Wei), [Nan Zhuang](https://arxiv.org/search/?searchtype=author&query=Nan Zhuang), [Qi Liu](https://arxiv.org/search/?searchtype=author&query=Qi Liu) 作者：王宇啸、雷昱、梁沃林、薛伟英、魏贞奥、庄楠、刘琦

People control their bodies to establish contact with the environment. To comprehensively understand actions across diverse visual contexts, it is essential to simultaneously consider \textbf{what} action is occurring and \textbf{where} it is happening. Current methodologies, however, often inadequately capture this duality, typically failing to jointly model both action semantics and their spatial contextualization within scenes. To bridge this gap, we introduce a novel vision task that simultaneously predicts high-level action semantics and fine-grained body-part contact regions. Our proposed framework, PaIR-Net, comprises three key components: the Contact Prior Aware Module (CPAM) for identifying contact-relevant body parts, the Prior-Guided Concat Segmenter (PGCS) for pixel-wise contact segmentation, and the Interaction Inference Module (IIM) responsible for integrating global interaction relationships. To facilitate this task, we present PaIR (Part-aware Interaction Representation), a comprehensive dataset containing 13,979 images that encompass 654 actions, 80 object categories, and 17 body parts. Experimental evaluation demonstrates that PaIR-Net significantly outperforms baseline approaches, while ablation studies confirm the efficacy of each architectural component. The code and dataset will be released upon publication. 人通过控制身体与环境建立接触。要在多样的视觉语境中全面理解动作，必须同时考虑正在发生的“什么”动作以及“在哪里”发生。然而，当前的方法通常未能充分捕捉这种二重性，通常无法将动作语义与其在场景中的空间情境联合建模。为弥补这一空白，我们提出了一种新颖的视觉任务，旨在同时预测高层次的动作语义与细粒度的身体部位接触区域。我们提出的框架 PaIR-Net 包括三个关键组成部分：用于识别与接触相关身体部位的接触先验感知模块（CPAM）、用于像素级接触分割的先验引导拼接分割器（PGCS）和负责整合全局交互关系的交互推理模块（IIM）。为推动该任务，我们发布了 PaIR（部位感知交互表征）数据集，该数据集包含 13,979 张图像，涵盖 654 种动作、80 类物体和 17 个身体部位。实验评估表明 PaIR-Net 显著优于基线方法，消融研究也验证了各个网络组件的有效性。代码和数据集将在发表时公开。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence

Publish: 2025-08-13 02:06:33 UTC 发布：2025-08-13 02:06:33 UTC

#86 Implicit Hypergraph Neural Networks: A Stable Framework for Higher-Order Relational Learning with Provable Guarantees #86 隐式超图神经网络：具有可证明保证的高阶关系学习的稳定框架

Authors: [Xiaoyu Li](https://arxiv.org/search/?searchtype=author&query=Xiaoyu Li), [Guangyu Tang](https://arxiv.org/search/?searchtype=author&query=Guangyu Tang), [Jiaojiao Jiang](https://arxiv.org/search/?searchtype=author&query=Jiaojiao Jiang) 作者：李晓宇，唐光宇，蒋娇娇

Many real-world interactions are group-based rather than pairwise such as papers with multiple co-authors and users jointly engaging with items. Hypergraph neural networks have shown great promise at modeling higher-order relations, but their reliance on a fixed number of explicit message-passing layers limits long-range dependency capture and can destabilize training as depth grows. In this work, we introduce Implicit Hypergraph Neural Networks (IHGNN), which bring the implicit equilibrium formulation to hypergraphs: instead of stacking layers, IHGNN computes representations as the solution to a nonlinear fixed-point equation, enabling stable and efficient global propagation across hyperedges without deep architectures. We develop a well-posed training scheme with provable convergence, analyze the oversmoothing conditions and expressivity of the model, and derive a transductive generalization bound on hypergraphs. We further present an implicit-gradient training procedure coupled with a projection-based stabilization strategy. Extensive experiments on citation benchmarks show that IHGNN consistently outperforms strong traditional graph/hypergraph neural network baselines in both accuracy and robustness. Empirically, IHGNN is resilient to random initialization and hyperparameter variation, highlighting its strong generalization and practical value for higher-order relational learning. 许多现实世界的交互是基于群体而非成对的，例如具有多个合著者的论文和用户共同参与的项目。超图神经网络在建模高阶关系方面展示了巨大潜力，但它们依赖固定数量的显式信息传递层，这限制了对远程依赖的捕捉，并且随着深度增加可能导致训练不稳定。在这项工作中，我们引入了隐式超图神经网络（IHGNN），将隐式平衡（equilibrium）公式引入超图：IHGNN 不是堆叠层，而是将表示计算为非线性不动点方程的解，从而无需深层架构即可在超边间实现稳定且高效的全局传播。我们提出了一个具有可证明收敛性的良定训练方案，分析了模型的过度平滑条件与表达能力，并推导了超图上的传导式泛化界。我们进一步提出了一种结合基于投影的稳定策略的隐式梯度训练程序。在引用基准上的大量实验表明，IHGNN 在准确性和鲁棒性方面始终优于强大的传统图/超图神经网络基线。经验上，IHGNN 对随机初始化和超参数变化具有较强的鲁棒性，突显了其在高阶关系学习中的强泛化能力和实际价值。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-13 02:06:29 UTC 发布：2025-08-13 02:06:29 UTC

#87 Domain-Generalization to Improve Learning in Meta-Learning Algorithms #87 领域泛化以改进元学习算法中的学习

Authors: [Usman Anjum](https://arxiv.org/search/?searchtype=author&query=Usman Anjum), [Chris Stockman](https://arxiv.org/search/?searchtype=author&query=Chris Stockman), [Cat Luong](https://arxiv.org/search/?searchtype=author&query=Cat Luong), [Justin Zhan](https://arxiv.org/search/?searchtype=author&query=Justin Zhan) 作者：Usman Anjum、Chris Stockman、Cat Luong、Justin Zhan

This paper introduces Domain Generalization Sharpness-Aware Minimization Model-Agnostic Meta-Learning (DGS-MAML), a novel meta-learning algorithm designed to generalize across tasks with limited training data. DGS-MAML combines gradient matching with sharpness-aware minimization in a bi-level optimization framework to enhance model adaptability and robustness. We support our method with theoretical analysis using PAC-Bayes and convergence guarantees. Experimental results on benchmark datasets show that DGS-MAML outperforms existing approaches in terms of accuracy and generalization. The proposed method is particularly useful for scenarios requiring few-shot learning and quick adaptation, and the source code is publicly available at GitHub. 本文提出了领域泛化的锐度感知最小化模型无关元学习（DGS-MAML），这是一种新颖的元学习算法，旨在在训练数据有限的情况下实现跨任务泛化。DGS-MAML 在双层优化框架中将梯度匹配与锐度感知最小化相结合，以增强模型的适应性和鲁棒性。我们通过 PAC-Bayes 理论分析和收敛性保证来支持该方法。基准数据集上的实验结果表明，DGS-MAML 在准确性和泛化能力方面优于现有方法。所提出的方法对于需要少样本学习和快速适应的场景尤为有用，源代码已在 GitHub 上公开可用。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-13 01:30:11 UTC 发布时间：2025-08-13 01:30:11 UTC

#88 RampNet: A Two-Stage Pipeline for Bootstrapping Curb Ramp Detection in Streetscape Images from Open Government Metadata #88 RampNet：一种用于从开放政府元数据中引导街景图像人行道坡道检测的两阶段流水线

Authors: [John S. O’Meara](https://arxiv.org/search/?searchtype=author&query=John S. O’Meara), [Jared Hwang](https://arxiv.org/search/?searchtype=author&query=Jared Hwang), [Zeyu Wang](https://arxiv.org/search/?searchtype=author&query=Zeyu Wang), [Michael Saugstad](https://arxiv.org/search/?searchtype=author&query=Michael Saugstad), [Jon E. Froehlich](https://arxiv.org/search/?searchtype=author&query=Jon E. Froehlich) 作者：John S. O’Meara，Jared Hwang，Zeyu Wang，Michael Saugstad，Jon E. Froehlich

Curb ramps are critical for urban accessibility, but robustly detecting them in images remains an open problem due to the lack of large-scale, high-quality datasets. While prior work has attempted to improve data availability with crowdsourced or manually labeled data, these efforts often fall short in either quality or scale. In this paper, we introduce and evaluate a two-stage pipeline called RampNet to scale curb ramp detection datasets and improve model performance. In Stage 1, we generate a dataset of more than 210,000 annotated Google Street View (GSV) panoramas by auto-translating government-provided curb ramp location data to pixel coordinates in panoramic images. In Stage 2, we train a curb ramp detection model (modified ConvNeXt V2) from the generated dataset, achieving state-of-the-art performance. To evaluate both stages of our pipeline, we compare to manually labeled panoramas. Our generated dataset achieves 94.0% precision and 92.5% recall, and our detection model reaches 0.9236 AP – far exceeding prior work. Our work contributes the first large-scale, high-quality curb ramp detection dataset, benchmark, and model. 人行道坡道对城市无障碍环境至关重要，但由于缺乏大规模高质量的数据集，仍难以在图像中稳健地检测到它们。尽管先前的工作尝试通过众包或人工标注数据来提高数据可用性，但这些尝试在质量或规模上常常不足。本文提出并评估了一个名为 RampNet 的两阶段管道，用以扩展人行道坡道检测数据集并提升模型性能。在第一阶段，我们通过将政府提供的坡道位置信息自动转换为全景图像中的像素坐标，生成了超过 21 万张带注释的 Google 街景（GSV）全景图像数据集。在第二阶段，我们从生成的数据集中训练了一个人行道坡道检测模型（改进的 ConvNeXt V2），并取得了最先进的性能。为评估管道的两个阶段，我们将结果与人工标注的全景图进行了比较。我们生成的数据集达到了 94.0% 的精确率和 92.5% 的召回率，而我们的检测模型达到 0.9236 AP——远超先前的工作。我们的工作贡献了首个大规模高质量的人行道坡道检测数据集、基准和模型。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence

Publish: 2025-08-13 01:22:48 UTC 发布：2025-08-13 01:22:48 协调世界时 (UTC)

#89 Understanding Dementia Speech Alignment with Diffusion-Based Image Generation #89 理解痴呆症语音与基于扩散的图像生成的对齐

Authors: Mansi, [Anastasios Lepipas](https://arxiv.org/search/?searchtype=author&query=Anastasios Lepipas), [Dominika Woszczyk](https://arxiv.org/search/?searchtype=author&query=Dominika Woszczyk), [Yiying Guan](https://arxiv.org/search/?searchtype=author&query=Yiying Guan), [Soteris Demetriou](https://arxiv.org/search/?searchtype=author&query=Soteris Demetriou) 作者：Mansi、Anastasios Lepipas、Dominika Woszczyk、Yiying Guan、Soteris Demetriou

Text-to-image models generate highly realistic images based on natural language descriptions and millions of users use them to create and share images online. While it is expected that such models can align input text and generated image in the same latent space little has been done to understand whether this alignment is possible between pathological speech and generated images. In this work, we examine the ability of such models to align dementia-related speech information with the generated images and develop methods to explain this alignment. Surprisingly, we found that dementia detection is possible from generated images alone achieving 75% accuracy on the ADReSS dataset. We then leverage explainability methods to show which parts of the language contribute to the detection. 文本到图像模型能够根据自然语言描述生成高度逼真的图像，数以百万计的用户使用它们在线创建并分享图像。尽管人们期望此类模型能够在相同的潜在空间中对齐输入文本和生成的图像，但对于这种对齐是否可能在病理性言语与生成图像之间实现的研究甚少。在本工作中，我们考察了此类模型将与痴呆相关的言语信息与生成图像对齐的能力，并开发了用于解释这种对齐的方法。令人惊讶的是，我们发现仅凭生成的图像就能进行痴呆检测，在 ADReSS 数据集上达到了 75%的准确率。随后我们利用可解释性方法展示了语言的哪些部分对该检测有贡献。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-12 23:00:36 UTC 发布：2025-08-12 23:00:36 UTC

#90 X-UniMotion: Animating Human Images with Expressive, Unified and Identity-Agnostic Motion Latents #90 X-UniMotion：使用富有表现力、统一且与身份无关的运动潜变量为人体图像赋予动画

We present X-UniMotion, a unified and expressive implicit latent representation for whole-body human motion, encompassing facial expressions, body poses, and hand gestures. Unlike prior motion transfer methods that rely on explicit skeletal poses and heuristic cross-identity adjustments, our approach encodes multi-granular motion directly from a single image into a compact set of four disentangled latent tokens – one for facial expression, one for body pose, and one for each hand. These motion latents are both highly expressive and identity-agnostic, enabling high-fidelity, detailed cross-identity motion transfer across subjects with diverse identities, poses, and spatial configurations. To achieve this, we introduce a self-supervised, end-to-end framework that jointly learns the motion encoder and latent representation alongside a DiT-based video generative model, trained on large-scale, diverse human motion datasets. Motion-identity disentanglement is enforced via 2D spatial and color augmentations, as well as synthetic 3D renderings of cross-identity subject pairs under shared poses. Furthermore, we guide motion token learning with auxiliary decoders that promote fine-grained, semantically aligned, and depth-aware motion embeddings. Extensive experiments show that X-UniMotion outperforms state-of-the-art methods, producing highly expressive animations with superior motion fidelity and identity preservation. 我们提出了 X-UniMotion，一种统一且富表达力的隐式潜变量表示用于全身人体动作，涵盖面部表情、身体姿态和手部动作。不同于以往依赖显式骨骼姿态和启发式跨身份调整的动作迁移方法，我们的方法直接从单张图像中将多粒度动作编码为一组紧凑的四个可解耦潜在向量——一个用于面部表情、一个用于身体姿态、每只手各一个。这些动作潜变量既高度富有表现力又与身份无关，使得在具有多样身份、姿态和空间配置的主体之间实现高保真、细致的跨身份动作迁移成为可能。为此，我们引入了一个自监督的端到端框架，该框架联合学习动作编码器和潜在表示，并与基于 DiT 的视频生成模型共同训练，训练数据来自大规模、多样化的人体动作数据集。通过二维空间和颜色增强以及在共享姿态下对跨身份主体对进行的合成三维渲染，来强制实现动作与身份的解耦。此外，我们通过辅助解码器引导运动令牌学习，以促进细粒度、语义对齐且具备深度感知的运动嵌入。大量实验证明，X-UniMotion 优于最先进的方法，生成高度富有表现力的动画，具有更出色的运动保真度和身份保持能力。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence

Publish: 2025-08-12 22:47:20 UTC 发布：2025-08-12 22:47:20 UTC

#91 What Can We Learn from Inter-Annotator Variability in Skin Lesion Segmentation? #91 我们可以从皮损分割中的标注者间差异学到什么？

Authors: [Kumar Abhishek](https://arxiv.org/search/?searchtype=author&query=Kumar Abhishek), [Jeremy Kawahara](https://arxiv.org/search/?searchtype=author&query=Jeremy Kawahara), [Ghassan Hamarneh](https://arxiv.org/search/?searchtype=author&query=Ghassan Hamarneh) 作者：Kumar Abhishek、Jeremy Kawahara、Ghassan Hamarneh

Medical image segmentation exhibits intra- and inter-annotator variability due to ambiguous object boundaries, annotator preferences, expertise, and tools, among other factors. Lesions with ambiguous boundaries, e.g., spiculated or infiltrative nodules, or irregular borders per the ABCD rule, are particularly prone to disagreement and are often associated with malignancy. In this work, we curate IMA++, the largest multi-annotator skin lesion segmentation dataset, on which we conduct an in-depth study of variability due to annotator, malignancy, tool, and skill factors. We find a statistically significant (p<0.001) association between inter-annotator agreement (IAA), measured using Dice, and the malignancy of skin lesions. We further show that IAA can be accurately predicted directly from dermoscopic images, achieving a mean absolute error of 0.108. Finally, we leverage this association by utilizing IAA as a “soft” clinical feature within a multi-task learning objective, yielding a 4.2% improvement in balanced accuracy averaged across multiple model architectures and across IMA++ and four public dermoscopic datasets. The code is available at https://github.com/sfu-mial/skin-IAV. 医学影像分割由于目标边界模糊、标注者偏好、专业程度和使用工具等因素，存在标注者内和标注者间的差异。边界模糊的病变，例如有刺状或浸润性结节，或根据 ABCD 规则具有不规则边界者，特别容易产生分歧，且常与恶性病变相关。在本研究中，我们整理了 IMA++，这是最大的多标注者皮肤病变分割数据集，并在其上就标注者、恶性程度、工具和技能等因素导致的差异进行了深入研究。我们发现使用 Dice 衡量的标注者间一致性（IAA）与皮肤病变的恶性程度之间存在统计学显著关联（p<0.001）。我们进一步展示，IAA 可以直接从皮肤镜图像中准确预测，平均绝对误差为 0.108。最后，我们通过在多任务学习目标中将 IAA 作为一种“软”临床特征来利用这一关联，在多种模型架构以及 IMA++ 与四个公开皮肤镜数据集上，使平衡准确率平均提升了 4.2%。代码可在 https://github.com/sfu-mial/skin-IAV 获取。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Machine Learning 主题：计算机视觉与模式识别，人工智能，机器学习

Publish: 2025-08-12 22:37:56 UTC 发布：2025-08-12 22:37:56 UTC

#92 APIO: Automatic Prompt Induction and Optimization for Grammatical Error Correction and Text Simplification #92 APIO：用于语法错误修正和文本简化的自动提示引导与优化

Recent advancements in large language models (LLMs) have enabled a wide range of natural language processing (NLP) tasks to be performed through simple prompt-based interactions. Consequently, several approaches have been proposed to engineer prompts that most effectively enable LLMs to perform a given task (e.g., chain-of-thought prompting). In settings with a well-defined metric to optimize model performance, automatic prompt optimization (APO) methods have been developed to refine a seed prompt. Advancing this line of research, we propose APIO, a simple but effective prompt induction and optimization approach for the tasks of Grammatical Error Correction (GEC) and Text Simplification, without relying on manually specified seed prompts. APIO achieves a new state-of-the-art performance for purely LLM-based prompting methods on these tasks. We make our data, code, prompts, and outputs publicly available. 近年来，大型语言模型（LLMs）的进展使得通过简单的基于提示的交互即可执行广泛的自然语言处理（NLP）任务成为可能。因此，已经提出了多种方法来设计提示，以最有效地使 LLMs 执行给定任务（例如，链式思维提示）。在具有明确定义的评估指标以优化模型性能的设置中，已经开发出自动提示优化（APO）方法来改进初始提示。沿着这一路线推进，我们提出了 APIO，一种简单但有效的提示引导与优化方法，适用于语法错误修正（GEC）和文本简化任务，并且不依赖手工指定的种子提示。APIO 在这些任务上为纯基于 LLM 的提示方法取得了新的最先进性能。我们公开提供了我们的数据、代码、提示和输出。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-12 22:26:32 UTC 发布：2025-08-12 22:26:32 UTC

#93 A Signer-Invariant Conformer and Multi-Scale Fusion Transformer for Continuous Sign Language Recognition #93 一种对示者不变的 Conformer 与多尺度融合 Transformer 用于连续手语识别

Authors: [Md Rezwanul Haque](https://arxiv.org/search/?searchtype=author&query=Md Rezwanul Haque), [Md. Milon Islam](https://arxiv.org/search/?searchtype=author&query=Md. Milon Islam), [S M Taslim Uddin Raju](https://arxiv.org/search/?searchtype=author&query=S M Taslim Uddin Raju), [Fakhri Karray](https://arxiv.org/search/?searchtype=author&query=Fakhri Karray) 作者：Md Rezwanul Haque、Md. Milon Islam、S M Taslim Uddin Raju、Fakhri Karray

Continuous Sign Language Recognition (CSLR) faces multiple challenges, including significant inter-signer variability and poor generalization to novel sentence structures. Traditional solutions frequently fail to handle these issues efficiently. For overcoming these constraints, we propose a dual-architecture framework. For the Signer-Independent (SI) challenge, we propose a Signer-Invariant Conformer that combines convolutions with multi-head self-attention to learn robust, signer-agnostic representations from pose-based skeletal keypoints. For the Unseen-Sentences (US) task, we designed a Multi-Scale Fusion Transformer with a novel dual-path temporal encoder that captures both fine-grained posture dynamics, enabling the model’s ability to comprehend novel grammatical compositions. Experiments on the challenging Isharah-1000 dataset establish a new standard for both CSLR benchmarks. The proposed conformer architecture achieves a Word Error Rate (WER) of 13.07% on the SI challenge, a reduction of 13.53% from the state-of-the-art. On the US task, the transformer model scores a WER of 47.78%, surpassing previous work. In the SignEval 2025 CSLR challenge, our team placed 2nd in the US task and 4th in the SI task, demonstrating the performance of these models. The findings validate our key hypothesis: that developing task-specific networks designed for the particular challenges of CSLR leads to considerable performance improvements and establishes a new baseline for further research. The source code is available at: https://github.com/rezwanh001/MSLR-Pose86K-CSLR-Isharah. 连续手语识别（CSLR）面临多重挑战，包括显著的签约者间差异以及对新句子结构的泛化能力差。传统方法往往无法高效地处理这些问题。为克服这些限制，我们提出了一个双架构框架。针对签约者独立性（SI）挑战，我们提出了一个签约者不变的 Conformer，该模型将卷积与多头自注意力相结合，从基于姿态的骨架关键点中学习稳健且与签约者无关的表示。针对未见句子（US）任务，我们设计了一个多尺度融合 Transformer，配备新颖的双路径时序编码器，既捕捉细粒度的姿态动态，又增强模型理解新语法组合的能力。在具有挑战性的 Isharah-1000 数据集上的实验为两项 CSLR 基准建立了新的标准。所提 Conformer 架构在 SI 挑战上实现了 13.07%的词错误率（WER），相比最先进方法降低了 13.53%。在 US 任务上，Transformer 模型取得了 47.78%的 WER，优于以往工作。在 SignEval 2025 CSLR 挑战赛中，我们团队在 US 任务中获得第 2 名，在 SI 任务中获得第 4 名，展示了这些模型的性能。研究结果验证了我们的关键假设：为 CSLR 的特定挑战开发任务专用网络会带来显著的性能提升，并为后续研究建立了新的基线。源代码可在以下地址获得：https://github.com/rezwanh001/MSLR-Pose86K-CSLR-Isharah。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Information Retrieval, Machine Learning

Publish: 2025-08-12 21:59:53 UTC 发布时间：2025-08-12 21:59:53 协调世界时 (UTC)

#94 FusionEnsemble-Net: An Attention-Based Ensemble of Spatiotemporal Networks for Multimodal Sign Language Recognition #94 FusionEnsemble-Net：一种基于注意力的时空网络集成用于多模态手语识别 [PDF 1 ] [Copy] [Kimi ] [REL]

Authors: [Md. Milon Islam](https://arxiv.org/search/?searchtype=author&query=Md. Milon Islam), [Md Rezwanul Haque](https://arxiv.org/search/?searchtype=author&query=Md Rezwanul Haque), [S M Taslim Uddin Raju](https://arxiv.org/search/?searchtype=author&query=S M Taslim Uddin Raju), [Fakhri Karray](https://arxiv.org/search/?searchtype=author&query=Fakhri Karray) 作者：Md. Milon Islam、Md Rezwanul Haque、S M Taslim Uddin Raju、Fakhri Karray

Accurate recognition of sign language in healthcare communication poses a significant challenge, requiring frameworks that can accurately interpret complex multimodal gestures. To deal with this, we propose FusionEnsemble-Net, a novel attention-based ensemble of spatiotemporal networks that dynamically fuses visual and motion data to enhance recognition accuracy. The proposed approach processes RGB video and range Doppler map radar modalities synchronously through four different spatiotemporal networks. For each network, features from both modalities are continuously fused using an attention-based fusion module before being fed into an ensemble of classifiers. Finally, the outputs of these four different fused channels are combined in an ensemble classification head, thereby enhancing the model’s robustness. Experiments demonstrate that FusionEnsemble-Net outperforms state-of-the-art approaches with a test accuracy of 99.44% on the large-scale MultiMeDaLIS dataset for Italian Sign Language. Our findings indicate that an ensemble of diverse spatiotemporal networks, unified by attention-based fusion, yields a robust and accurate framework for complex, multimodal isolated gesture recognition tasks. The source code is available at: https://github.com/rezwanh001/Multimodal-Isolated-Italian-Sign-Language-Recognition. 在医疗交流中对手语的准确识别是一个重大挑战，需要能够准确解读复杂多模态手势的框架。为此，我们提出了 FusionEnsemble-Net，一种新颖的基于注意力的时空网络集成方法，能够动态融合视觉与运动数据以提升识别精度。所提出的方法通过四个不同的时空网络同步处理 RGB 视频和距离多普勒图（range Doppler map）雷达模态。对于每个网络，两种模态的特征在被送入分类器集成之前，先通过基于注意力的融合模块持续融合。最后，将这四个不同融合通道的输出在集成分类头中合并，从而增强模型的鲁棒性。实验证明，FusionEnsemble-Net 在大规模的意大利手语 MultiMeDaLIS 数据集上以 99.44%的测试准确率优于最先进的方法。我们的研究结果表明，由注意力融合统一的多样化时空网络集成，为复杂的多模态独立手势识别任务提供了一个稳健且准确的框架。源代码可在以下地址获取： https://github.com/rezwanh001/Multimodal-Isolated-Italian-Sign-Language-Recognition。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Machine Learning 主题：计算机视觉与模式识别，人工智能，机器学习

Publish: 2025-08-12 21:44:23 UTC 发布时间：2025-08-12 21:44:23 UTC

#95 The Human-AI Hybrid Delphi Model: A Structured Framework for Context-Rich, Expert Consensus in Complex Domains #95 人类—人工智能混合德尔菲模型：用于复杂领域中背景丰富的专家共识的结构化框架

Authors: [Cathy Speed](https://arxiv.org/search/?searchtype=author&query=Cathy Speed), [Ahmed A. Metwally](https://arxiv.org/search/?searchtype=author&query=Ahmed A. Metwally) 作者：Cathy Speed、Ahmed A. Metwally

Expert consensus plays a critical role in domains where evidence is complex, conflicting, or insufficient for direct prescription. Traditional methods, such as Delphi studies, consensus conferences, and systematic guideline synthesis, offer structure but face limitations including high panel burden, interpretive oversimplification, and suppression of conditional nuance. These challenges are now exacerbated by information overload, fragmentation of the evidence base, and increasing reliance on publicly available sources that lack expert filtering. This study introduces and evaluates a Human-AI Hybrid Delphi (HAH-Delphi) framework designed to augment expert consensus development by integrating a generative AI model (Gemini 2.5 Pro), small panels of senior human experts, and structured facilitation. The HAH-Delphi was tested in three phases: retrospective replication, prospective comparison, and applied deployment in two applied domains (endurance training and resistance and mixed cardio/strength training). The AI replicated 95% of published expert consensus conclusions in Phase I and showed 95% directional agreement with senior human experts in Phase II, though it lacked experiential and pragmatic nuance. In Phase III, compact panels of six senior experts achieved >90% consensus coverage and reached thematic saturation before the final participant. The AI provided consistent, literature-grounded scaffolding that supported divergence resolution and accelerated saturation. The HAH-Delphi framework offers a flexible, scalable approach for generating high-quality, context-sensitive consensus. Its successful application across health, coaching, and performance science confirms its methodological robustness and supports its use as a foundation for generating conditional, personalised guidance and published consensus frameworks at scale. 在证据复杂、相互矛盾或不足以直接制定处方的领域，专家共识发挥着关键作用。传统方法，如德尔菲研究、共识会议和系统性指南综合，提供了结构性流程，但也面临限制，包括对专家组的高负担、对解释性的过度简化以及对条件性细微差别的压制。信息超载、证据基础的分裂以及日益依赖缺乏专家筛选的公开来源，正在加剧这些挑战。本研究引入并评估了一种人机混合德尔菲（HAH-Delphi）框架，旨在通过整合生成式人工智能模型（Gemini 2.5 Pro）、由资深人类专家组成的小型小组和结构化的引导，来增强专家共识的形成。HAH-Delphi 在三个阶段进行了测试：回顾性复制、前瞻性比较，以及在两个实际领域（耐力训练和阻力与混合有氧/力量训练）中的应用部署。在第一阶段，人工智能复制了 95%的已发表专家共识结论；在第二阶段，它与高级人类专家在方向性上达成了 95%的一致，尽管缺乏经验性和务实性细微差别。在第三阶段，由六名高级专家组成的小型小组实现了超过 90%的共识覆盖率，并在最后一位参与者之前就达到了主题饱和。人工智能提供了一致且以文献为依据的支架，支持分歧解决并加速了饱和的达成。HAH-Delphi 框架提供了一种灵活且可扩展的方法，用于生成高质量、具情境敏感性的共识。其在健康、教练和绩效科学领域的成功应用证实了其方法学的稳健性，并支持将其作为大规模生成有条件的、个性化指导和已发表共识框架的基础。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-12 21:24:19 UTC 发布：2025-08-12 21:24:19 协调世界时 (UTC)

#96 Collective dynamics of strategic classification #96 战略分类的集体动力学

Authors: [Marta C. Couto](https://arxiv.org/search/?searchtype=author&query=Marta C. Couto), [Flavia Barsotti](https://arxiv.org/search/?searchtype=author&query=Flavia Barsotti), [Fernando P. Santos](https://arxiv.org/search/?searchtype=author&query=Fernando P. Santos) 作者：Marta C. Couto，Flavia Barsotti，Fernando P. Santos

Classification algorithms based on Artificial Intelligence (AI) are nowadays applied in high-stakes decisions in finance, healthcare, criminal justice, or education. Individuals can strategically adapt to the information gathered about classifiers, which in turn may require algorithms to be re-trained. Which collective dynamics will result from users’ adaptation and algorithms’ retraining? We apply evolutionary game theory to address this question. Our framework provides a mathematically rigorous way of treating the problem of feedback loops between collectives of users and institutions, allowing to test interventions to mitigate the adverse effects of strategic adaptation. As a case study, we consider institutions deploying algorithms for credit lending. We consider several scenarios, each representing different interaction paradigms. When algorithms are not robust against strategic manipulation, we are able to capture previous challenges discussed in the strategic classification literature, whereby users either pay excessive costs to meet the institutions’ expectations (leading to high social costs) or game the algorithm (e.g., provide fake information). From this baseline setting, we test the role of improving gaming detection and providing algorithmic recourse. We show that increased detection capabilities reduce social costs and could lead to users’ improvement; when perfect classifiers are not feasible (likely to occur in practice), algorithmic recourse can steer the dynamics towards high users’ improvement rates. The speed at which the institutions re-adapt to the user’s population plays a role in the final outcome. Finally, we explore a scenario where strict institutions provide actionable recourse to their unsuccessful users and observe cycling dynamics so far unnoticed in the literature. 基于人工智能（AI）的分类算法如今被应用于金融、医疗、刑事司法或教育等重大决策领域。个体可能会对有关分类器的信息进行策略性适应，这反过来可能需要对算法进行重新训练。用户的适应与算法的再训练会产生何种集体动力学？我们运用进化博弈论来研究这个问题。我们的框架为处理用户集体与机构之间反馈循环问题提供了数学上严谨的方法，允许测试减轻策略性适应不良影响的干预措施。作为案例研究，我们考察了机构部署用于信贷放贷的算法。我们考虑了若干情景，每种情景代表不同的互动范式。当算法不具备对策略性操纵的鲁棒性时，我们能够捕捉到战略分类文献中此前讨论的挑战，即用户要么为满足机构期望而支付过高成本（导致社会成本高昂），要么对算法进行操纵（例如，提供虚假信息）。在这一基线设置下，我们测试了提高作弊检测和提供算法补救的作用。我们表明，增强的检测能力能够降低社会成本并可能促使用户改进；当完美分类器不可行（在实践中很可能发生）时，算法补救可以将动态引导向高用户改进率。机构重新适应用户群体的速度在最终结果中起着作用。最后，我们探讨了一种情形：严格的机构向未成功的用户提供可行的补救，并观察到文献中迄今未曾注意到的周期性动态。

Subjects: Computer Science and Game Theory, Artificial Intelligence, Theoretical Economics 主题：计算机科学与博弈论、人工智能、理论经济学

Publish: 2025-08-12 20:57:17 UTC 发布日期：2025-08-12 20:57:17 UTC

#97 RicciFlowRec: A Geometric Root Cause Recommender Using Ricci Curvature on Financial Graphs #97 RicciFlowRec：一种在金融图上使用里奇曲率的几何根因推荐器

Authors: [Zhongtian Sun](https://arxiv.org/search/?searchtype=author&query=Zhongtian Sun), [Anoushka Harit](https://arxiv.org/search/?searchtype=author&query=Anoushka Harit) 作者：孙中天，Anoushka Harit

We propose RicciFlowRec, a geometric recommendation framework that performs root cause attribution via Ricci curvature and flow on dynamic financial graphs. By modelling evolving interactions among stocks, macroeconomic indicators, and news, we quantify local stress using discrete Ricci curvature and trace shock propagation via Ricci flow. Curvature gradients reveal causal substructures, informing a structural risk-aware ranking function. Preliminary results on S&P~500 data with FinBERT-based sentiment show improved robustness and interpretability under synthetic perturbations. This ongoing work supports curvature-based attribution and early-stage risk-aware ranking, with plans for portfolio optimization and return forecasting. To our knowledge, RicciFlowRec is the first recommender to apply geometric flow-based reasoning in financial decision support. 我们提出了 RicciFlowRec，一种几何推荐框架，通过离散里奇曲率和里奇流在动态金融图上进行根源归因。通过对股票、宏观经济指标和新闻之间不断演化的相互作用建模，我们使用离散里奇曲率量化局部压力，并通过里奇流追踪冲击传播。曲率梯度揭示因果子结构，为结构性风险感知的排序函数提供信息。在以 FinBERT 为基础的情感分析下对 S&P 500 数据的初步结果显示，在合成扰动下模型的稳健性和可解释性有所提升。该项正在进行的工作支持基于曲率的归因和早期风险感知排序，未来计划用于组合优化和收益预测。据我们所知，RicciFlowRec 是首个在金融决策支持中应用几何流推理的推荐系统。

Subjects: Machine Learning, Artificial Intelligence, Information Retrieval 主题：机器学习、人工智能、信息检索

Publish: 2025-08-12 20:45:02 UTC 发布时间：2025-08-12 20:45:02 UTC

#98 Synaptic Pruning: A Biological Inspiration for Deep Learning Regularization #98 突触修剪：对深度学习正则化的生物学启发

Authors: [Gideon Vos](https://arxiv.org/search/?searchtype=author&query=Gideon Vos), [Liza van Eijk](https://arxiv.org/search/?searchtype=author&query=Liza van Eijk), [Zoltan Sarnyai](https://arxiv.org/search/?searchtype=author&query=Zoltan Sarnyai), [Mostafa Rahimi Azghadi](https://arxiv.org/search/?searchtype=author&query=Mostafa Rahimi Azghadi) 作者：Gideon Vos、Liza van Eijk、Zoltan Sarnyai、Mostafa Rahimi Azghadi

Synaptic pruning in biological brains removes weak connections to improve efficiency. In contrast, dropout regularization in artificial neural networks randomly deactivates neurons without considering activity-dependent pruning. We propose a magnitude-based synaptic pruning method that better reflects biology by progressively removing low-importance connections during training. Integrated directly into the training loop as a dropout replacement, our approach computes weight importance from absolute magnitudes across layers and applies a cubic schedule to gradually increase global sparsity. At fixed intervals, pruning masks permanently remove low-importance weights while maintaining gradient flow for active ones, eliminating the need for separate pruning and fine-tuning phases. Experiments on multiple time series forecasting models including RNN, LSTM, and Patch Time Series Transformer across four datasets show consistent gains. Our method ranked best overall, with statistically significant improvements confirmed by Friedman tests (p < 0.01). In financial forecasting, it reduced Mean Absolute Error by up to 20% over models with no or standard dropout, and up to 52% in select transformer models. This dynamic pruning mechanism advances regularization by coupling weight elimination with progressive sparsification, offering easy integration into diverse architectures. Its strong performance, especially in financial time series forecasting, highlights its potential as a practical alternative to conventional dropout techniques. 生物大脑中的突触修剪会移除弱连接以提高效率。与此相反，人工神经网络中的 dropout 正则化随机停用神经元，而不考虑基于活动的修剪。我们提出了一种基于权重幅值的突触修剪方法，更加贴近生物学，通过在训练过程中逐步移除低重要性的连接来实现。该方法直接集成到训练循环中，作为 dropout 的替代；我们通过各层权重绝对值计算权重重要性，并采用三次调度来逐渐增加全局稀疏性。在固定间隔处，修剪掩码永久移除低重要性权重，同时为活跃权重保持梯度流动，免去了单独的修剪和微调阶段。在包括 RNN、LSTM 和 Patch Time Series Transformer 在内的多种时间序列预测模型以及四个数据集上的实验显示出一致的性能提升。我们的方法总体排名最佳，Friedman 检验（p < 0.01）证实了统计学上的显著改进。在金融预测中，与没有或使用标准 dropout 的模型相比，它将平均绝对误差降低了最多 20%，在部分 Transformer 模型中最高可达 52%。这种动态剪枝机制通过将权重消除与渐进稀疏化相结合，推进了正则化方法，并且可以轻松集成到各种架构中。其出色的表现，尤其是在金融时间序列预测中的表现，突显了其作为传统 dropout 技术的实用替代方案的潜力。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-12 20:36:00 UTC 发布：2025-08-12 20:36:00 UTC

#99 SegDAC: Segmentation-Driven Actor-Critic for Visual Reinforcement Learning #99 SegDAC：面向视觉强化学习的分割驱动 Actor-Critic

Authors: [Alexandre Brown](https://arxiv.org/search/?searchtype=author&query=Alexandre Brown), [Glen Berseth](https://arxiv.org/search/?searchtype=author&query=Glen Berseth) 作者：Alexandre Brown, Glen Berseth

Visual reinforcement learning (RL) is challenging due to the need to learn both perception and actions from high-dimensional inputs and noisy rewards. Although large perception models exist, integrating them effectively into RL for visual generalization and improved sample efficiency remains unclear. We propose SegDAC, a Segmentation-Driven Actor-Critic method. SegDAC uses Segment Anything (SAM) for object-centric decomposition and YOLO-World to ground segments semantically via text prompts. It includes a novel transformer-based architecture that supports a dynamic number of segments at each time step and effectively learns which segments to focus on using online RL, without using human labels. By evaluating SegDAC over a challenging visual generalization benchmark using Maniskill3, which covers diverse manipulation tasks under strong visual perturbations, we demonstrate that SegDAC achieves significantly better visual generalization, doubling prior performance on the hardest setting and matching or surpassing prior methods in sample efficiency across all evaluated tasks. 视觉强化学习（RL）具有挑战性，因为需要从高维输入和噪声奖励中同时学习感知和动作。尽管存在大型感知模型，但如何将它们有效地整合到用于视觉泛化和提高样本效率的 RL 中仍不明确。我们提出了 SegDAC，一种由分割驱动的演员-评论家方法。SegDAC 使用 Segment Anything（SAM）进行以物体为中心的分解，并通过文本提示使用 YOLO-World 在语义上对分割区域进行落地。它包含一种新颖的基于变换器的架构，支持每个时间步可变数量的分割区域，并通过在线 RL 有效地学习应关注哪些分割区域，且无需使用人工标签。通过在 Maniskill3 上对一个具有挑战性的视觉泛化基准进行评估，该基准涵盖在强烈视觉扰动下的多种操作任务，我们证明了 SegDAC 实现了显著更好的视觉泛化，在最困难设置上将先前性能提高了一倍，并在所有评估任务中在样本效率上匹配或超过了先前方法。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Robotics 主题：计算机视觉与模式识别、人工智能、机器人学

Publish: 2025-08-12 20:16:54 UTC 发布时间：2025-08-12 20:16:54 UTC

#100 TEN: Table Explicitization, Neurosymbolically #100 十：表格显式化，神经符号方法

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-12 20:16:41 UTC 发布：2025-08-12 20:16:41 UTC

#101 Leveraging Large Language Models for Rare Disease Named Entity Recognition #101 利用大型语言模型进行罕见病命名实体识别

Named Entity Recognition (NER) in the rare disease domain poses unique challenges due to limited labeled data, semantic ambiguity between entity types, and long-tail distributions. In this study, we evaluate the capabilities of GPT-4o for rare disease NER under low-resource settings, using a range of prompt-based strategies including zero-shot prompting, few-shot in-context learning, retrieval-augmented generation (RAG), and task-level fine-tuning. We design a structured prompting framework that encodes domain-specific knowledge and disambiguation rules for four entity types. We further introduce two semantically guided few-shot example selection methods to improve in-context performance while reducing labeling effort. Experiments on the RareDis Corpus show that GPT-4o achieves competitive or superior performance compared to BioClinicalBERT, with task-level fine-tuning yielding new state-of-the-art (SOTA) results. Cost-performance analysis reveals that few-shot prompting delivers high returns at low token budgets, while RAG offers marginal additional benefit. An error taxonomy highlights common failure modes such as boundary drift and type confusion, suggesting opportunities for post-processing and hybrid refinement. Our results demonstrate that prompt-optimized LLMs can serve as effective, scalable alternatives to traditional supervised models in biomedical NER, particularly in rare disease applications where annotated data is scarce. 在罕见病领域的命名实体识别（NER）面临独特挑战，原因在于标注数据有限、实体类型之间语义模糊以及长尾分布。在本研究中，我们在低资源设置下评估了 GPT-4o 在罕见病 NER 任务中的能力，使用了一系列基于提示的策略，包括零样本提示、少样本上下文学习、检索增强生成（RAG）以及任务级微调。我们设计了一个结构化提示框架，用于编码领域特定知识和四种实体类型的消歧规则。我们进一步引入了两种语义引导的少样本示例选择方法，以在减少标注工作量的同时提升上下文内性能。在 RareDis 语料库上的实验表明，GPT-4o 相较于 BioClinicalBERT 达到或优于其性能，且通过任务级微调获得了新的最先进（SOTA）结果。性价比分析显示，少样本提示在低 token 预算下能带来高回报，而 RAG 带来的额外收益有限。错误分类法突出了诸如边界漂移和类型混淆等常见故障模式，为事后处理和混合精炼提供了机会。我们的结果表明，经过提示优化的 LLMs 可以作为传统监督模型在生物医学命名实体识别中的有效且可扩展的替代方案，尤其是在标注数据稀缺的罕见病应用中。

Subjects: Computation and Language, Artificial Intelligence 主题：计算与语言，人工智能

Publish: 2025-08-12 20:16:31 UTC 发布：2025-08-12 20:16:31 UTC

#102 Exact Verification of Graph Neural Networks with Incremental Constraint Solving #102 使用增量约束求解对图神经网络进行精确验证

Authors: [Minghao Liu](https://arxiv.org/search/?searchtype=author&query=Minghao Liu), [Chia-Hsuan Lu](https://arxiv.org/search/?searchtype=author&query=Chia-Hsuan Lu), [Marta Kwiatkowska](https://arxiv.org/search/?searchtype=author&query=Marta Kwiatkowska) 作者：Minghao Liu、Chia-Hsuan Lu、Marta Kwiatkowska

Graph neural networks (GNNs) are increasingly employed in high-stakes applications, such as fraud detection or healthcare, but are susceptible to adversarial attacks. A number of techniques have been proposed to provide adversarial robustness guarantees, but support for commonly used aggregation functions in message-passing GNNs is still lacking. In this paper, we develop an exact (sound and complete) verification method for GNNs to compute guarantees against attribute and structural perturbations that involve edge addition or deletion, subject to budget constraints. Focusing on node classification tasks, our method employs constraint solving with bound tightening, and iteratively solves a sequence of relaxed constraint satisfaction problems while relying on incremental solving capabilities of solvers to improve efficiency. We implement GNNev, a versatile solver for message-passing neural networks, which supports three aggregation functions, sum, max and mean, with the latter two considered here for the first time. Extensive experimental evaluation of GNNev on two standard benchmarks (Cora and CiteSeer) and two real-world fraud datasets (Amazon and Yelp) demonstrates its usability and effectiveness, as well as superior performance compared to existing {exact verification} tools on sum-aggregated node classification tasks. 图神经网络（GNNs）日益被用于高风险场景，例如欺诈检测或医疗保健，但易受对抗性攻击。已有多种技术被提出以提供对抗鲁棒性保证，但对消息传递 GNN 中常用聚合函数的支持仍然不足。在本文中，我们开发了一种用于 GNN 的精确（健全且完备）验证方法，以计算针对属性和结构扰动（包括在预算约束下的边的添加或删除）的保证。聚焦于节点分类任务，我们的方法采用约束求解与界收紧，并通过反复求解一系列松弛的约束满足问题，同时依赖求解器的增量求解能力以提高效率。我们实现了 GNNev，一个通用的消息传递神经网络求解器，支持三种聚合函数：sum、max 和 mean，其中后两种在此首次被考虑。在两个标准基准（Cora 和 CiteSeer）以及两个真实世界的欺诈数据集（Amazon 和 Yelp）上对 GNNev 进行的大量实验评估证明了其可用性和有效性，并且在基于求和聚合的节点分类任务上，相较于现有的「精确验证」工具表现更优。

Subjects: Machine Learning, Artificial Intelligence, Cryptography and Security 主题：机器学习、人工智能、密码学与安全

Publish: 2025-08-12 20:10:31 UTC 发布：2025-08-12 20:10:31 UTC

#103 TPTP World Infrastructure for Non-classical Logics #103 TPTP 非经典逻辑世界基础设施

Authors: [Alexander Steen](https://arxiv.org/search/?searchtype=author&query=Alexander Steen), [Geoff Sutcliffe](https://arxiv.org/search/?searchtype=author&query=Geoff Sutcliffe) 作者：Alexander Steen，Geoff Sutcliffe

The TPTP World is the well established infrastructure that supports research, development, and deployment of Automated Theorem Proving (ATP) systems. The TPTP World supports a range of classical logics, and since release v9.0.0 has supported non-classical logics. This paper provides a self-contained comprehensive overview of the TPTP World infrastructure for ATP in non-classical logics: the non-classical language extension, problems and solutions, and tool support. A detailed description of use of the infrastructure for quantified normal multi-modal logic is given. TPTP 世界是支持自动定理证明（ATP）系统研究、开发和部署的成熟基础设施。TPTP 世界支持一系列经典逻辑，并自 v9.0.0 版本起支持非经典逻辑。本文提供了一个自包含的、全面的关于用于非经典逻辑 ATP 的 TPTP 世界基础设施的概述：非经典语言扩展、问题与解决方案以及工具支持。文中详述了该基础设施在带量化的标准多模态逻辑中的具体使用。

Subjects: Logic in Computer Science, Artificial Intelligence 主题：计算机科学中的逻辑、人工智能

Publish: 2025-08-12 20:05:52 UTC 发布时间：2025-08-12 20:05:52 UTC

#104 ParallelSearch: Train your LLMs to Decompose Query and Search Sub-queries in Parallel with Reinforcement Learning #104 ParallelSearch：使用强化学习训练你的 LLMs 将查询分解并并行搜索子查询 [PDF 3 ] [Copy] [Kimi 1 ] [REL]

Authors: [Shu Zhao](https://arxiv.org/search/?searchtype=author&query=Shu Zhao), [Tan Yu](https://arxiv.org/search/?searchtype=author&query=Tan Yu), [Anbang Xu](https://arxiv.org/search/?searchtype=author&query=Anbang Xu), [Japinder Singh](https://arxiv.org/search/?searchtype=author&query=Japinder Singh), [Aaditya Shukla](https://arxiv.org/search/?searchtype=author&query=Aaditya Shukla), [Rama Akkiraju](https://arxiv.org/search/?searchtype=author&query=Rama Akkiraju) 作者：Shu Zhao、Tan Yu、Anbang Xu、Japinder Singh、Aaditya Shukla、Rama Akkiraju

Reasoning-augmented search agents such as Search-R1, trained via reinforcement learning with verifiable rewards (RLVR), demonstrate remarkable capabilities in multi-step information retrieval from external knowledge sources. These agents address the limitations of their parametric memory by dynamically gathering relevant facts to address complex reasoning tasks. However, existing approaches suffer from a fundamental architectural limitation: they process search queries strictly sequentially, even when handling inherently parallelizable and logically independent comparisons. This sequential bottleneck significantly constrains computational efficiency, particularly for queries that require multiple entity comparisons. To address this critical limitation, we propose ParallelSearch, a novel reinforcement learning framework that empowers large language models (LLMs) to recognize parallelizable query structures and execute multiple search operations concurrently. Our approach introduces dedicated reward functions that incentivize the identification of independent query components while preserving answer accuracy through jointly considering correctness, query decomposition quality, and parallel execution benefits. Comprehensive experiments demonstrate that ParallelSearch outperforms state-of-the-art baselines by an average performance gain of 2.9% across seven question-answering benchmarks. Notably, on parallelizable questions, our method achieves a 12.7% performance improvement while requiring only 69.6% of the LLM calls compared to sequential approaches. 像 Search-R1 这样的推理增强型搜索代理，通过具有可验证奖励的强化学习（RLVR）训练，在从外部知识源进行多步信息检索方面展示出非凡的能力。这些代理通过动态收集相关事实来解决其参数化记忆的局限，从而应对复杂的推理任务。然而，现有方法存在一个根本的架构限制：它们严格按顺序处理搜索查询，即便是在处理本质上可并行且逻辑独立的比较时也是如此。这一顺序瓶颈显著限制了计算效率，尤其是对于需要多实体比较的查询。为了解决这一关键限制，我们提出了 ParallelSearch，一种新颖的强化学习框架，使大型语言模型（LLMs）能够识别可并行化的查询结构并并发执行多个搜索操作。我们的方法引入了专门的奖励函数，通过联合考虑正确性、查询分解质量和并行执行收益，激励识别独立的查询组件同时保持答案准确性。全面的实验表明，在七个问答基准上，ParallelSearch 比最先进的基线平均提升了 2.9% 的性能。值得注意的是，在可并行的问题上，我们的方法实现了 12.7% 的性能提升，同时相比于顺序方法仅需 69.6% 的 LLM 调用次数。

Subjects: Computation and Language, Artificial Intelligence, Information Retrieval 主题：计算与语言、人工智能、信息检索

Publish: 2025-08-12 19:38:21 UTC 发布：2025-08-12 19:38:21 协调世界时（UTC）

#105 Decentralized Weather Forecasting via Distributed Machine Learning and Blockchain-Based Model Validation #105 去中心化天气预报：通过分布式机器学习与基于区块链的模型验证 [PDF ] [Copy] [Kimi ] [REL]

Weather forecasting plays a vital role in disaster preparedness, agriculture, and resource management, yet current centralized forecasting systems are increasingly strained by security vulnerabilities, limited scalability, and susceptibility to single points of failure. To address these challenges, we propose a decentralized weather forecasting framework that integrates Federated Learning (FL) with blockchain technology. FL enables collaborative model training without exposing sensitive local data; this approach enhances privacy and reduces data transfer overhead. Meanwhile, the Ethereum blockchain ensures transparent and dependable verification of model updates. To further enhance the system’s security, we introduce a reputation-based voting mechanism that assesses the trustworthiness of submitted models while utilizing the Interplanetary File System (IPFS) for efficient off-chain storage. Experimental results demonstrate that our approach not only improves forecasting accuracy but also enhances system resilience and scalability, making it a viable candidate for deployment in real-world, security-critical environments. 天气预报在灾害防备、农业和资源管理中起着至关重要的作用，然而当前的集中式预报系统正日益受到安全漏洞、扩展性限制以及单点故障易受影响的困扰。为了解决这些挑战，我们提出了一个将联邦学习（FL）与区块链技术相结合的去中心化天气预报框架。联邦学习使得在不暴露敏感本地数据的情况下进行协同模型训练成为可能；这种方法增强了隐私性并减少了数据传输开销。与此同时，以太坊区块链确保了模型更新的透明且可靠的验证。为进一步提升系统安全性，我们引入了一种基于信誉的投票机制来评估提交模型的可信度，同时利用星际文件系统（IPFS）进行高效的链下存储。实验结果表明，我们的方法不仅提高了预报精度，还增强了系统的弹性和可扩展性，使其成为可在真实、对安全要求高的环境中部署的可行候选方案。

Subjects: Machine Learning, Artificial Intelligence, Cryptography and Security 主题：机器学习、人工智能、密码学与安全

Publish: 2025-08-12 19:25:34 UTC 发布时间：2025-08-12 19:25:34 UTC

#106 Based AI improves human decision-making but reduces trust #106 基于人工智能的建议改进了人类决策但降低了信任

Authors: [Shiyang Lai](https://arxiv.org/search/?searchtype=author&query=Shiyang Lai), [Junsol Kim](https://arxiv.org/search/?searchtype=author&query=Junsol Kim), [Nadav Kunievsky](https://arxiv.org/search/?searchtype=author&query=Nadav Kunievsky), [Yujin Potter](https://arxiv.org/search/?searchtype=author&query=Yujin Potter), [James Evans](https://arxiv.org/search/?searchtype=author&query=James Evans) 作者：赖世阳、金俊率、Nadav Kunievsky、Yujin Potter、James Evans

Current AI systems minimize risk by enforcing ideological neutrality, yet this may introduce automation bias by suppressing cognitive engagement in human decision-making. We conducted randomized trials with 2,500 participants to test whether culturally biased AI enhances human decision-making. Participants interacted with politically diverse GPT-4o variants on information evaluation tasks. Partisan AI assistants enhanced human performance, increased engagement, and reduced evaluative bias compared to non-biased counterparts, with amplified benefits when participants encountered opposing views. These gains carried a trust penalty: participants underappreciated biased AI and overcredited neutral systems. Exposing participants to two AIs whose biases flanked human perspectives closed the perception-performance gap. These findings complicate conventional wisdom about AI neutrality, suggesting that strategic integration of diverse cultural biases may foster improved and resilient human decision-making. 当前的人工智能系统通过强制实施意识形态中立来最小化风险，然而这可能通过抑制人类决策中的认知参与而引入自动化偏差。我们对 2500 名参与者进行了随机试验，以检验带有文化偏见的人工智能是否能增强人类决策。参与者在信息评估任务中与具有不同政治立场的 GPT-4o 变体互动。党派偏向的 AI 助手提升了人类表现、增加了参与度，并相比无偏见的同类减少了评估性偏差，当参与者遇到相对立的观点时，这些益处被放大。这些收益带来了信任代价：参与者对有偏见的 AI 评价不足，而对中立系统则给予过高信用。让参与者接触到两种偏见分别位于人类观点两侧的 AI，弥合了感知与表现之间的差距。这些发现使关于 AI 中立性的传统认知变得复杂，表明策略性地整合多样的文化偏见可能促进更好且更具弹性的人类决策。

Subjects: Human-Computer Interaction, Artificial Intelligence, Computers and Society 主题：人机交互、人工智能、计算机与社会

Publish: 2025-08-12 19:20:43 UTC 发布：2025-08-12 19:20:43 UTC

#107 Fake-Mamba: Real-Time Speech Deepfake Detection Using Bidirectional Mamba as Self-Attention's Alternative #107 假曼巴：使用双向曼巴作为自注意力替代的实时语音深度伪造检测

Publish: 2025-08-12 19:15:13 UTC 发布：2025-08-12 19:15:13 UTC

#108 Ethical Medical Image Synthesis #108 伦理医疗图像合成

Authors: [Weina Jin](https://arxiv.org/search/?searchtype=author&query=Weina Jin), [Ashish Sinha](https://arxiv.org/search/?searchtype=author&query=Ashish Sinha), [Kumar Abhishek](https://arxiv.org/search/?searchtype=author&query=Kumar Abhishek), [Ghassan Hamarneh](https://arxiv.org/search/?searchtype=author&query=Ghassan Hamarneh) 作者：Weina Jin、Ashish Sinha、Kumar Abhishek、Ghassan Hamarneh

The task of ethical Medical Image Synthesis (MISyn) is to ensure that the MISyn techniques are researched and developed ethically throughout their entire lifecycle, which is essential to prevent the negative impacts of MISyn. To address the ever-increasing needs and requirements for ethical practice of MISyn research and development, we first conduct a theoretical analysis that identifies the key properties of ethical MISyn and intrinsic limits of MISyn. We identify that synthetic images lack inherent grounding in real medical phenomena, cannot fully represent the training medical images, and inevitably introduce new distribution shifts and biases. Ethical risks can arise from not acknowledging the intrinsic limits and weaknesses of synthetic images compared to medical images, with the extreme form manifested as misinformation of MISyn that substitutes synthetic images for medical images without acknowledgment. The resulting ethical harms include eroding trust in the medical imaging dataset environment and causing algorithmic discrimination towards stakeholders and the public. To facilitate collective efforts towards ethical MISyn within and outside the medical image analysis community, we then propose practical supports for ethical practice in MISyn based on the theoretical analysis, including ethical practice recommendations that adapt the existing technical standards, problem formulation, design, and evaluation practice of MISyn to the ethical challenges; and oversight recommendations to facilitate checks and balances from stakeholders and the public. We also present two case studies that demonstrate how to apply the ethical practice recommendations in practice, and identify gaps between existing practice and the ethical practice recommendations. 伦理化医学图像合成（MISyn）任务旨在确保在整个生命周期内对 MISyn 技术的研究和开发采取伦理化的做法，这对防止 MISyn 带来负面影响至关重要。为应对对 MISyn 研究与开发伦理实践日益增长的需求与要求，我们首先进行了理论分析，识别出伦理化 MISyn 的关键属性以及 MISyn 的内在局限。我们发现合成图像缺乏与真实医学现象的固有联系，无法完全代表用于训练的医学图像，并且不可避免地引入新的分布偏移和偏差。若不承认合成图像与医学图像相比的内在局限和弱点，就可能产生伦理风险，其极端表现形式是将合成图像替代医学图像却不予说明的 MISyn 错误信息。由此产生的伦理危害包括侵蚀对医学影像数据集环境的信任以及对利益相关者和公众造成算法歧视。为了促进医学影像合成（MISyn）领域内外针对伦理问题的集体努力，我们基于理论分析提出了可行的伦理实践支持措施，包括将现有的技术标准、问题表述、设计与评估实践调整以应对伦理挑战的伦理实践建议；以及便于利益相关者和公众进行制衡与监督的监管建议。我们还展示了两个案例研究，说明如何在实践中应用这些伦理实践建议，并指出现有实践与伦理建议之间的差距。

Subjects: Computers and Society, Artificial Intelligence 主题：计算机与社会，人工智能

Publish: 2025-08-12 19:14:37 UTC 发布：2025-08-12 19:14:37 UTC

#109 Can AI Keep a Secret? Contextual Integrity Verification: A Provable Security Architecture for LLMs #109 人工智能能保守秘密吗？情境完整性验证：一种面向 LLMs 的可证明安全架构

Author: [Aayush Gupta](https://arxiv.org/search/?searchtype=author&query=Aayush Gupta) 作者：Aayush Gupta

Subjects: Cryptography and Security, Artificial Intelligence, Computation and Language 主题：密码学与安全、人工智能、计算与语言

Publish: 2025-08-12 18:47:30 UTC 发布：2025-08-12 18:47:30 UTC

#110 Detection of Odor Presence via Deep Neural Networks #110 通过深度神经网络检测气味存在

Authors: [Matin Hassanloo](https://arxiv.org/search/?searchtype=author&query=Matin Hassanloo), [Ali Zareh](https://arxiv.org/search/?searchtype=author&query=Ali Zareh), [Mehmet Kemal Özdemir](https://arxiv.org/search/?searchtype=author&query=Mehmet Kemal Özdemir) 作者：Matin Hassanloo、Ali Zareh、Mehmet Kemal Özdemir

Odor detection underpins food safety, environmental monitoring, medical diagnostics, and many more fields. The current artificial sensors developed for odor detection struggle with complex mixtures while non-invasive recordings lack reliable single-trial fidelity. To develop a general system for odor detection, in this study we present a preliminary work where we aim to test two hypotheses: (i) that spectral features of local field potentials (LFPs) are sufficient for robust single-trial odor detection and (ii) that signals from the olfactory bulb alone are adequate. To test two hypotheses, we propose an ensemble of complementary one-dimensional convolutional networks (ResCNN and AttentionCNN) that decodes the presence of odor from multichannel olfactory bulb LFPs. Tested on 2,349 trials from seven awake mice, our final ensemble model supports both hypotheses, achieving a mean accuracy of 86.6%, an F1-score of 81.0%, and an AUC of 0.9247, substantially outperforming previous benchmarks. In addition, the t-SNE visualization confirms that our framework captures biologically significant signatures. These findings establish the feasibility of robust single-trial detection of the presence of odor from extracellular LFPs, as well as demonstrate the potential of deep learning models to provide a deeper understanding of olfactory representations. 气味检测支撑着食品安全、环境监测、医疗诊断等众多领域。目前为气味检测开发的人工传感器在面对复杂混合物时表现不佳，而非侵入式记录则缺乏可靠的单次试验保真度。为开发通用的气味检测系统，本研究提出一项初步工作，旨在检验两个假设：（i）局部场电位（LFP）的谱特征足以实现稳健的单次试验气味检测；（ii）仅来自嗅球的信号就足够用于检测。为检验这两个假设，我们提出了一个由互补的一维卷积网络（ResCNN 和 AttentionCNN）组成的集成模型，该模型可从多通道嗅球 LFP 中解码气味的存在。在对七只清醒小鼠的 2,349 次试验进行测试后，我们的最终集成模型支持上述两个假设，达到平均准确率 86.6%、F1 分数 81.0% 以及 AUC 0.9247，显著优于先前的基准。此外，t-SNE 可视化证实我们的框架捕捉到了具有生物学意义的特征。这些发现证明了从细胞外局部场电位（LFPs）中进行具有鲁棒性的单次试验气味存在检测是可行的，并且展示了深度学习模型在深入理解嗅觉表征方面的潜力。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-12 18:14:24 UTC 发布时间：2025-08-12 18:14:24 UTC

#111 Cross-BCI, A Cross-BCI-Paradigm Classifica-tion Model Towards Universal BCI Applications #111 Cross-BCI，一种面向通用脑机接口应用的跨 BCI 范式分类模型

Authors: [Gaojie Zhou](https://arxiv.org/search/?searchtype=author&query=Gaojie Zhou), [Junhua Li](https://arxiv.org/search/?searchtype=author&query=Junhua Li) 作者：周高杰、李俊华

Classification models used in brain-computer interface (BCI) are usually designed for a single BCI paradigm. This requires the redevelopment of the model when applying it to a new BCI paradigm, resulting in repeated costs and effort. Moreover, less complex deep learning models are desired for practical usage, as well as for deployment on portable devices. In or-der to fill the above gaps, we, in this study, proposed a light-weight and unified decoding model for cross-BCI-paradigm classification. The proposed model starts with a tempo-spatial convolution. It is followed by a multi-scale local feature selec-tion module, aiming to extract local features shared across BCI paradigms and generate weighted features. Finally, a mul-ti-dimensional global feature extraction module is designed, in which multi-dimensional global features are extracted from the weighted features and fused with the weighted features to form high-level feature representations associated with BCI para-digms. The results, evaluated on a mixture of three classical BCI paradigms (i.e., MI, SSVEP, and P300), demon-strate that the proposed model achieves 88.39%, 82.36%, 80.01%, and 0.8092 for accuracy, macro-precision, mac-ro-recall, and macro-F1-score, respectively, significantly out-performing the compared models. This study pro-vides a feasible solution for cross-BCI-paradigm classifica-tion. It lays a technological foundation for de-veloping a new generation of unified decoding systems, paving the way for low-cost and universal practical applications. 用于脑机接口（BCI）的分类模型通常针对单一 BCI 范式设计。这就要求在将模型应用于新的 BCI 范式时重新开发模型，导致重复的成本和工作。此外，为了实际使用以及在便携设备上的部署，人们更希望使用复杂度较低的深度学习模型。为填补上述空白，本研究提出了一种轻量且统一的跨 BCI 范式分类解码模型。所提出的模型以时空卷积开始，随后是一个多尺度局部特征选择模块，旨在提取各 BCI 范式间共享的局部特征并生成加权特征。最后设计了一个多维全局特征提取模块，在该模块中从加权特征中提取多维全局特征，并将其与加权特征融合，形成与 BCI 范式相关的高层特征表示。在由三种经典脑机接口范式（即运动想象 MI、稳态视觉诱发电位 SSVEP 和 P300）混合构成的数据上评估的结果表明，所提出的模型在准确率、宏精确率、宏召回率和宏 F1 分别达到了 88.39%、82.36%、80.01% 和 0.8092，显著优于对比模型。本研究为跨脑机接口范式分类提供了可行的解决方案。它为开发新一代统一解码系统奠定了技术基础，为低成本和通用的实际应用铺平了道路。

Subjects: Quantitative Methods, Artificial Intelligence, Human-Computer Interaction 主题：定量方法、人工智能、人机交互

Publish: 2025-08-12 16:04:50 UTC 发布：2025-08-12 16:04:50 UTC

#112 NEFMind: Parameter-Efficient Fine-Tuning of Open-Source LLMs for Telecom APIs Automation #112 NEFMind：面向电信 API 自动化的开源 LLMs 参数高效微调

Authors: [Zainab Khan](https://arxiv.org/search/?searchtype=author&query=Zainab Khan), [Ahmed Hussain](https://arxiv.org/search/?searchtype=author&query=Ahmed Hussain), [Mukesh Thakur](https://arxiv.org/search/?searchtype=author&query=Mukesh Thakur), [Arto Hellas](https://arxiv.org/search/?searchtype=author&query=Arto Hellas), [Panos Papadimitratos](https://arxiv.org/search/?searchtype=author&query=Panos Papadimitratos) 作者：Zainab Khan、Ahmed Hussain、Mukesh Thakur、Arto Hellas、Panos Papadimitratos

Subjects: Networking and Internet Architecture, Artificial Intelligence, Computation and Language 主题：网络与互联网架构、人工智能、计算与语言

Publish: 2025-08-12 15:03:22 UTC 发布：2025-08-12 15:03:22 UTC

#113 Gradient-Direction-Aware Density Control for 3D Gaussian Splatting #113 面向 3D 高斯喷溅的梯度方向感知密度控制

Authors: [Zheng Zhou](https://arxiv.org/search/?searchtype=author&query=Zheng Zhou), [Yu-Jie Xiong](https://arxiv.org/search/?searchtype=author&query=Yu-Jie Xiong), [Chun-Ming Xia](https://arxiv.org/search/?searchtype=author&query=Chun-Ming Xia), [Jia-Chen Zhang](https://arxiv.org/search/?searchtype=author&query=Jia-Chen Zhang), [Hong-Jian Zhan](https://arxiv.org/search/?searchtype=author&query=Hong-Jian Zhan) 作者：周征、熊宇杰、夏春明、张家臣、詹洪健

The emergence of 3D Gaussian Splatting (3DGS) has significantly advanced novel view synthesis through explicit scene representation, enabling real-time photorealistic rendering. However, existing approaches manifest two critical limitations in complex scenarios: (1) Over-reconstruction occurs when persistent large Gaussians cannot meet adaptive splitting thresholds during density control. This is exacerbated by conflicting gradient directions that prevent effective splitting of these Gaussians; (2) Over-densification of Gaussians occurs in regions with aligned gradient aggregation, leading to redundant component proliferation. This redundancy significantly increases memory overhead due to unnecessary data retention. We present Gradient-Direction-Aware Gaussian Splatting (GDAGS), a gradient-direction-aware adaptive density control framework to address these challenges. Our key innovations: the gradient coherence ratio (GCR), computed through normalized gradient vector norms, which explicitly discriminates Gaussians with concordant versus conflicting gradient directions; and a nonlinear dynamic weighting mechanism leverages the GCR to enable gradient-direction-aware density control. Specifically, GDAGS prioritizes conflicting-gradient Gaussians during splitting operations to enhance geometric details while suppressing redundant concordant-direction Gaussians. Conversely, in cloning processes, GDAGS promotes concordant-direction Gaussian densification for structural completion while preventing conflicting-direction Gaussian overpopulation. Comprehensive evaluations across diverse real-world benchmarks demonstrate that GDAGS achieves superior rendering quality while effectively mitigating over-reconstruction, suppressing over-densification, and constructing compact scene representations with 50% reduced memory consumption through optimized Gaussians utilization. 3D 高斯点投影（3DGS）的出现通过显式场景表示显著推进了新视角合成，使实时写实渲染成为可能。然而，现有方法在复杂场景中存在两个关键局限：（1）当持久存在的大高斯在密度控制期间无法满足自适应分裂阈值时，会发生过度重构。冲突的梯度方向会加剧这一问题，阻止这些高斯有效分裂；（2）在梯度聚合方向一致的区域会出现高斯过度密集，导致冗余组件激增。由于不必要的数据保留，这种冗余显著增加了内存开销。我们提出了梯度方向感知高斯点投影（GDAGS），一种梯度方向感知的自适应密度控制框架，以解决这些挑战。我们的关键创新包括：通过归一化梯度向量范数计算的梯度一致性比（GCR），该比率明确区分梯度方向一致与冲突的高斯；以及一种非线性动态加权机制，利用 GCR 实现梯度方向感知的密度控制。具体来说，GDAGS 在分裂操作中优先处理梯度冲突的高斯，以增强几何细节，同时抑制多余的方向一致的高斯。相反，在克隆过程中，GDAGS 促进方向一致高斯的致密化以完成结构，同时防止方向冲突高斯的过度增长。对多种真实世界基准的全面评估表明，GDAGS 在有效缓解过度重建、抑制过度致密化的同时，实现了更优的渲染质量，并通过优化高斯的利用构建了内存消耗减少 50% 的紧凑场景表示。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence

Publish: 2025-08-12 13:12:54 UTC 发布：2025-08-12 13:12:54 UTC

Authors: [Nick Oh](https://arxiv.org/search/?searchtype=author&query=Nick Oh), [Giorgos D. Vrakas](https://arxiv.org/search/?searchtype=author&query=Giorgos D. Vrakas), [Siân J. M. Brooke](https://arxiv.org/search/?searchtype=author&query=Siân J. M. Brooke), [Sasha Morinière](https://arxiv.org/search/?searchtype=author&query=Sasha Morinière), [Toju Duke](https://arxiv.org/search/?searchtype=author&query=Toju Duke) 作者：Nick Oh、Giorgos D. Vrakas、Siân J. M. Brooke、Sasha Morinière、Toju Duke

Social media data presents AI researchers with overlapping obligations under the GDPR, copyright law, and platform terms – yet existing frameworks fail to integrate these regulatory domains, leaving researchers without unified guidance. We introduce PETLP (Privacy-by-design Extract, Transform, Load, and Present), a compliance framework that embeds legal safeguards directly into extended ETL pipelines. Central to PETLP is treating Data Protection Impact Assessments as living documents that evolve from pre-registration through dissemination. Through systematic Reddit analysis, we demonstrate how extraction rights fundamentally differ between qualifying research organisations (who can invoke DSM Article 3 to override platform restrictions) and commercial entities (bound by terms of service), whilst GDPR obligations apply universally. We reveal why true anonymisation remains unachievable for social media data and expose the legal gap between permitted dataset creation and uncertain model distribution. By structuring compliance decisions into practical workflows and simplifying institutional data management plans, PETLP enables researchers to navigate regulatory complexity with confidence, bridging the gap between legal requirements and research practice. 社交媒体数据使人工智能研究人员在 GDPR、版权法和平台条款之间面临重叠的义务——但现有框架未能整合这些监管领域，导致研究人员缺乏统一的指引。我们提出了 PETLP（以隐私为设计的抽取、转换、加载与呈现），这是一个将法律保障直接嵌入扩展 ETL 管道的合规框架。PETLP 的核心是将数据保护影响评估视为从预登记到传播过程中持续演进的动态文档。通过对 Reddit 的系统性分析，我们展示了合格研究组织（可援引 DSM 第 3 条以覆盖平台限制）与商业实体（受服务条款约束）在抽取权利上的根本差异，同时 GDPR 义务适用于所有主体。我们揭示了为何对于社交媒体数据真正的匿名化仍不可实现，并暴露了被允许的数据集创建与模型分发之间的不确定法律鸿沟。通过将合规决策构建为实用的工作流程并简化机构数据管理计划，PETLP 使研究人员能够自信地应对监管复杂性，弥合法律要求与研究实践之间的差距。

Subjects: Multimedia, Artificial Intelligence, Databases 主题：多媒体、人工智能、数据库

Publish: 2025-08-12 08:33:40 UTC 发表时间：2025-08-12 08:33:40 UTC

#115 Beyond Technocratic XAI: The Who, What & How in Explanation Design #115 超越技术官僚化的可解释人工智能：解释设计中的“谁、什么与如何”

Authors: [Ruchira Dhar](https://arxiv.org/search/?searchtype=author&query=Ruchira Dhar), [Stephanie Brandl](https://arxiv.org/search/?searchtype=author&query=Stephanie Brandl), [Ninell Oldenburg](https://arxiv.org/search/?searchtype=author&query=Ninell Oldenburg), [Anders Søgaard](https://arxiv.org/search/?searchtype=author&query=Anders Søgaard) 作者：Ruchira Dhar、Stephanie Brandl、Ninell Oldenburg、Anders Søgaard

The field of Explainable AI (XAI) offers a wide range of techniques for making complex models interpretable. Yet, in practice, generating meaningful explanations is a context-dependent task that requires intentional design choices to ensure accessibility and transparency. This paper reframes explanation as a situated design process – an approach particularly relevant for practitioners involved in building and deploying explainable systems. Drawing on prior research and principles from design thinking, we propose a three-part framework for explanation design in XAI: asking Who needs the explanation, What they need explained, and How that explanation should be delivered. We also emphasize the need for ethical considerations, including risks of epistemic inequality, reinforcing social inequities, and obscuring accountability and governance. By treating explanation as a sociotechnical design process, this framework encourages a context-aware approach to XAI that supports effective communication and the development of ethically responsible explanations. 可解释人工智能（XAI）领域提供了多种使复杂模型可解释的技术。然而，在实践中，生成有意义的解释是一个依赖情境的任务，需要有意的设计选择以确保可访问性和透明度。本文将解释重新框定为一种情境化的设计过程——这一方法对参与构建和部署可解释系统的从业者尤为相关。借鉴既有研究和设计思维的原则，我们提出了一个用于 XAI 解释设计的三部分框架：询问谁需要解释、他们需要解释什么以及应如何传达该解释。我们还强调了伦理考量的必要性，包括认识上的不平等风险、加强社会不平等以及掩盖问责与治理的问题。通过将解释视为一种社会技术的设计过程，该框架鼓励一种情境感知的 XAI 方法，支持有效沟通并促进具有伦理责任感的解释开发。

Subjects: Computers and Society, Artificial Intelligence, Human-Computer Interaction 主题：计算机与社会、人工智能、人机交互

Publish: 2025-08-12 08:17:26 UTC 发布：2025-08-12 08:17:26 UTC

#116 Cowpox: Towards the Immunity of VLM-based Multi-Agent Systems #116 牛痘：迈向基于 VLM 的多智能体系统的免疫性

Authors: [Yutong Wu](https://arxiv.org/search/?searchtype=author&query=Yutong Wu), [Jie Zhang](https://arxiv.org/search/?searchtype=author&query=Jie Zhang), [Yiming Li](https://arxiv.org/search/?searchtype=author&query=Yiming Li), [Chao Zhang](https://arxiv.org/search/?searchtype=author&query=Chao Zhang), [Qing Guo](https://arxiv.org/search/?searchtype=author&query=Qing Guo), [Nils Lukas](https://arxiv.org/search/?searchtype=author&query=Nils Lukas), [Tianwei Zhang](https://arxiv.org/search/?searchtype=author&query=Tianwei Zhang) 作者：吴雨彤，张杰，李一鸣，张超，郭清，Nils Lukas，张天威

Vision Language Model (VLM)-based agents are stateful, autonomous entities capable of perceiving and interacting with their environments through vision and language. Multi-agent systems comprise specialized agents who collaborate to solve a (complex) task. A core security property is robustness, stating that the system should maintain its integrity under adversarial attacks. However, the design of existing multi-agent systems lacks the robustness consideration, as a successful exploit against one agent can spread and infect other agents to undermine the entire system’s assurance. To address this, we propose a new defense approach, Cowpox, to provably enhance the robustness of multi-agent systems. It incorporates a distributed mechanism, which improves the recovery rate of agents by limiting the expected number of infections to other agents. The core idea is to generate and distribute a special cure sample that immunizes an agent against the attack before exposure and helps recover the already infected agents. We demonstrate the effectiveness of Cowpox empirically and provide theoretical robustness guarantees. 基于视觉语言模型（VLM）的智能体是有状态的自主实体，能够通过视觉和语言感知并与环境互动。多智能体系统由专门化的智能体组成，它们协同工作以解决（复杂的）任务。一个核心的安全属性是鲁棒性，即系统在遭受对抗性攻击时应维护其完整性。然而，现有多智能体系统的设计缺乏对鲁棒性的考虑，因为对某个智能体的成功利用可以传播并感染其他智能体，从而破坏整个系统的保障。为了解决这一问题，我们提出了一种新的防御方法——牛痘（Cowpox），以可证明的方式增强多智能体系统的鲁棒性。它包含一种分布式机制，通过将期望感染其他智能体的数量限制在较低水平来提高智能体的恢复率。其核心思想是生成并分发一种特殊的治愈样本，该样本在暴露前使智能体免疫于攻击，并有助于恢复已被感染的智能体。我们通过实验证明了牛痘的有效性，并提供了理论上的鲁棒性保证。

Subjects: Multiagent Systems, Artificial Intelligence 主题：多智能体系统，人工智能

Publish: 2025-08-12 07:48:51 UTC

#117 Cluster Topology-Driven Placement of Experts Reduces Network Traffic in MoE Inference #117 基于集群拓扑的专家分配在 MoE 推理中降低了网络流量

Authors: [Danil Sivtsov](https://arxiv.org/search/?searchtype=author&query=Danil Sivtsov), [Aleksandr Katrutsa](https://arxiv.org/search/?searchtype=author&query=Aleksandr Katrutsa), [Ivan Oseledets](https://arxiv.org/search/?searchtype=author&query=Ivan Oseledets) 作者：Danil Sivtsov、Aleksandr Katrutsa、Ivan Oseledets

Efficient deployment of a pre-trained LLM to a cluster with multiple servers is a critical step for providing fast responses to users’ queries. The recent success of Mixture-of-Experts (MoE) LLMs raises the question of how to deploy them efficiently, considering their underlying structure. During the inference in MoE LLMs, only a small part of the experts is selected to process a given token. Moreover, in practice, the experts’ load is highly imbalanced. For efficient deployment, one has to distribute the model across a large number of servers using a model placement algorithm. Thus, to improve cluster utilization, the model placement algorithm has to take into account the network topology. This work focuses on the efficient topology-aware placement of the pre-trained MoE LLMs in the inference stage. We propose an integer linear program (ILP) that determines the optimal placement of experts, minimizing the expected number of transmissions. Due to the internal structure, this optimization problem can be solved with a standard ILP solver. We demonstrate that ILP-based placement strategy yields lower network traffic than competitors for small-scale (DeepSeekMoE~~16B) and large-scale (DeepSeek-R1~~671B) models. 将一个预训练的 LLM 高效地部署到包含多台服务器的集群中，是为用户查询提供快速响应的关键步骤。Mixture-of-Experts (MoE) LLM 最近的成功提出了如何高效部署它们的问题，需考虑其底层结构。在 MoE LLM 的推理过程中，只有一小部分专家会被选中来处理某个给定的 token。此外，在实际中，专家的负载高度不均衡。为了实现高效部署，需要使用模型放置算法将模型分布到大量服务器上。因此，为了提高集群利用率，模型放置算法必须考虑网络拓扑。本工作聚焦于在推理阶段对预训练 MoE LLM 进行高效的、感知拓扑的放置。我们提出了一个整数线性规划（ILP），用于确定专家的最优放置，以最小化预期的传输次数。由于其内部结构，该优化问题可以用标准的 ILP 求解器来解决。我们证明，与其他方法相比，基于 ILP 的放置策略在小规模（DeepSeekMoE~~16B）和大规模（DeepSeek-R1~~671B）模型上产生更低的网络流量。

Subjects: Networking and Internet Architecture, Artificial Intelligence, Distributed, Parallel, and Cluster Computing 主题：网络与互联网体系结构、人工智能、分布式、并行与集群计算

Publish: 2025-08-12 07:08:48 UTC 发布：2025-08-12 07:08:48 UTC

#118 GSMT: Graph Fusion and Spatiotemporal TaskCorrection for Multi-Bus Trajectory Prediction #118 GSMT：用于多公交轨迹预测的图融合与时空任务校正（Graph Fusion and Spatiotemporal TaskCorrection）

Authors: [Fan Ding](https://arxiv.org/search/?searchtype=author&query=Fan Ding), [Hwa Hui Tew](https://arxiv.org/search/?searchtype=author&query=Hwa Hui Tew), [Junn Yong Loo](https://arxiv.org/search/?searchtype=author&query=Junn Yong Loo), Susilawati, [LiTong Liu](https://arxiv.org/search/?searchtype=author&query=LiTong Liu), [Fang Yu Leong](https://arxiv.org/search/?searchtype=author&query=Fang Yu Leong), [Xuewen Luo](https://arxiv.org/search/?searchtype=author&query=Xuewen Luo), [Kar Keong Chin](https://arxiv.org/search/?searchtype=author&query=Kar Keong Chin), [Jia Jun Gan](https://arxiv.org/search/?searchtype=author&query=Jia Jun Gan) 作者：丁凡、廖华辉（Hwa Hui Tew）、卢俊勇（Junn Yong Loo）、Susilawati、刘利通（LiTong Liu）、梁芳玉（Fang Yu Leong）、罗学文（Xuewen Luo）、陈家强（Kar Keong Chin）、颜家俊（Jia Jun Gan）

Accurate trajectory prediction for buses is crucial in intelligent transportation systems, particularly within urban environments. In developing regions where access to multimodal data is limited, relying solely on onboard GPS data remains indispensable despite inherent challenges. To address this problem, we propose GSMT, a hybrid model that integrates a Graph Attention Network (GAT) with a sequence-to-sequence Recurrent Neural Network (RNN), and incorporates a task corrector capable of extracting complex behavioral patterns from large-scale trajectory data. The task corrector clusters historical trajectories to identify distinct motion patterns and fine-tunes the predictions generated by the GAT and RNN. Specifically, GSMT fuses dynamic bus information and static station information through embedded hybrid networks to perform trajectory prediction, and applies the task corrector for secondary refinement after the initial predictions are generated. This two-stage approach enables multi-node trajectory prediction among buses operating in dense urban traffic environments under complex conditions. Experiments conducted on a real-world dataset from Kuala Lumpur, Malaysia, demonstrate that our method significantly outperforms existing approaches, achieving superior performance in both short-term and long-term trajectory prediction tasks. 在智能交通系统中，尤其是在城市环境中，准确的公交轨迹预测至关重要。在多模态数据获取受限的发展中地区，尽管存在固有挑战，仅依赖车载 GPS 数据仍然不可或缺。为了解决这一问题，我们提出了 GSMT，一种将图注意力网络（GAT）与序列到序列循环神经网络（RNN）相结合的混合模型，并融入了一个任务修正器，能够从大规模轨迹数据中提取复杂的行为模式。任务修正器通过对历史轨迹进行聚类来识别不同的运动模式，并对 GAT 和 RNN 生成的预测进行微调。具体而言，GSMT 通过嵌入式混合网络融合动态公交信息和静态站点信息以执行轨迹预测，并在生成初始预测后应用任务修正器进行二次优化。这种两阶段方法使得在复杂条件下的密集城市交通环境中实现公交车辆的多节点轨迹预测成为可能。在来自马来西亚吉隆坡的真实数据集上进行的实验表明，我们的方法显著优于现有方法，在短期和长期轨迹预测任务中均取得了更好的性能。

Subjects: Machine Learning, Artificial Intelligence, Computational Engineering, Finance, and Science 学科：机器学习、人工智能、计算工程、金融与科学

Publish: 2025-08-12 06:54:26 UTC 发表：2025-08-12 06:54:26 UTC

#119 AMRG: Extend Vision Language Models for Automatic Mammography Report Generation #119 AMRG：扩展视觉语言模型以实现自动化乳腺 X 线报告生成

Authors: [Nak-Jun Sung](https://arxiv.org/search/?searchtype=author&query=Nak-Jun Sung), [Donghyun Lee](https://arxiv.org/search/?searchtype=author&query=Donghyun Lee), [Bo Hwa Choi](https://arxiv.org/search/?searchtype=author&query=Bo Hwa Choi), [Chae Jung Park](https://arxiv.org/search/?searchtype=author&query=Chae Jung Park) 作者：Nak-Jun Sung、Donghyun Lee、Bo Hwa Choi、Chae Jung Park

Mammography report generation is a critical yet underexplored task in medical AI, characterized by challenges such as multiview image reasoning, high-resolution visual cues, and unstructured radiologic language. In this work, we introduce AMRG (Automatic Mammography Report Generation), the first end-to-end framework for generating narrative mammography reports using large vision-language models (VLMs). Building upon MedGemma-4B-it-a domain-specialized, instruction-tuned VLM-we employ a parameter-efficient fine-tuning (PEFT) strategy via Low-Rank Adaptation (LoRA), enabling lightweight adaptation with minimal computational overhead. We train and evaluate AMRG on DMID, a publicly available dataset of paired high-resolution mammograms and diagnostic reports. This work establishes the first reproducible benchmark for mammography report generation, addressing a longstanding gap in multimodal clinical AI. We systematically explore LoRA hyperparameter configurations and conduct comparative experiments across multiple VLM backbones, including both domain-specific and general-purpose models under a unified tuning protocol. Our framework demonstrates strong performance across both language generation and clinical metrics, achieving a ROUGE-L score of 0.5691, METEOR of 0.6152, CIDEr of 0.5818, and BI-RADS accuracy of 0.5582. Qualitative analysis further highlights improved diagnostic consistency and reduced hallucinations. AMRG offers a scalable and adaptable foundation for radiology report generation and paves the way for future research in multimodal medical AI. 乳腺 X 线摄影报告生成是医疗人工智能中一个关键但尚未充分研究的任务，其特点包括多视图图像推理、高分辨率视觉线索和非结构化放射学语言等挑战。在这项工作中，我们提出了 AMRG（自动乳腺 X 线摄影报告生成），这是首个使用大型视觉-语言模型（VLM）端到端生成叙事性乳腺 X 线摄影报告的框架。基于经过领域专门化和指令微调的 MedGemma-4B-it，我们采用了一种通过低秩适配（LoRA）实现的参数高效微调（PEFT）策略，从而在极小的计算开销下实现轻量级适配。我们在 DMID 上训练并评估了 AMRG，DMID 是一个公开可用的配对高分辨率乳腺 X 线片与诊断报告的数据集。本工作建立了首个可复现的乳腺 X 线摄影报告生成基准，填补了多模态临床人工智能领域长期存在的空白。我们系统地探索了 LoRA 超参数配置，并在统一的微调协议下，针对包括领域专用模型和通用模型在内的多种 VLM 主干网络进行了比较实验。我们的框架在语言生成和临床指标方面均表现出色，取得了 ROUGE-L 0.5691、METEOR 0.6152、CIDEr 0.5818 和 BI-RADS 准确率 0.5582。定性分析进一步凸显了诊断一致性的提升和幻觉的减少。AMRG 为放射学报告生成提供了可扩展且可适应的基础，并为多模态医学人工智能的未来研究铺平了道路。

Subjects: Image and Video Processing, Artificial Intelligence, Computer Vision and Pattern Recognition 主题：图像与视频处理、人工智能、计算机视觉与模式识别

Publish: 2025-08-12 06:37:41 UTC

#120 From Hard Refusals to Safe-Completions: Toward Output-Centric Safety Training #120 从强拒绝到安全完成：走向以输出为中心的安全训练

Large Language Models used in ChatGPT have traditionally been trained to learn a refusal boundary: depending on the user’s intent, the model is taught to either fully comply or outright refuse. While this is a strong mitigation for explicitly malicious prompts, focusing safety training on refusals can lead to brittleness for prompts with obscured user intent. Binary refusal boundaries are especially ill-suited for dual-use cases (such as biology or cybersecurity), where a user request can be answered safely at a high level, but in some cases can lead to malicious uplift if sufficiently detailed or actionable. As an alternative, we propose safe-completions: a safety-training approach that centers on the safety of the assistant’s output, rather than a binary classification of the user’s intent. Safe-completions seek to maximize helpfulness within the safety policy’s constraints. We incorporated this approach into GPT-5 and find that across both production comparisons and internally controlled experiments, safe-completion training improves safety (especially on dual-use prompts), reduces the severity of residual safety failures, and substantially increases model helpfulness. 用于 ChatGPT 的大型语言模型传统上被训练为学习拒绝边界：根据用户意图，模型被教导要么完全配合，要么彻底拒绝。尽管这对于明显恶意的提示是有效的缓解措施，但将安全训练集中在拒绝上可能会导致对用户意图模糊的提示表现出脆弱性。二元拒绝边界对于双重用途的场景（例如生物学或网络安全）尤其不适合，在这些场景中，用户请求在高层次上可以安全回答，但在某些情况下如果提供了足够详细或可操作的信息则可能导致恶意提升。作为替代，我们提出了安全完成（safe-completions）：一种把助理输出的安全性而非对用户意图的二元分类作为中心的安全训练方法。安全完成旨在在安全策略的约束下最大化有用性。我们将这一方法整合到 GPT-5 中，发现在生产比较和内部控制实验中，安全完成训练提高了安全性（尤其是在双重用途提示上）、降低了残余安全失败的严重性，并显著提升了模型的有用性。

Subjects: Computers and Society, Artificial Intelligence, Computation and Language 主题：计算机与社会、人工智能、计算与语言

Publish: 2025-08-12 00:18:23 UTC 发表：2025-08-12 00:18:23 协调世界时

#121 Hierarchical Adaptive networks with Task vectors for Test-Time Adaptation #121 层次自适应网络与任务向量用于测试时自适应

Authors: [Sameer Ambekar](https://arxiv.org/search/?searchtype=author&query=Sameer Ambekar), [Daniel M. Lang](https://arxiv.org/search/?searchtype=author&query=Daniel M. Lang), [Julia A. Schnabel](https://arxiv.org/search/?searchtype=author&query=Julia A. Schnabel) 作者：Sameer Ambekar、Daniel M. Lang、Julia A. Schnabel

Test-time adaptation allows pretrained models to adjust to incoming data streams, addressing distribution shifts between source and target domains. However, standard methods rely on single-dimensional linear classification layers, which often fail to handle diverse and complex shifts. We propose Hierarchical Adaptive Networks with Task Vectors (Hi-Vec), which leverages multiple layers of increasing size for dynamic test-time adaptation. By decomposing the encoder’s representation space into such hierarchically organized layers, Hi-Vec, in a plug-and-play manner, allows existing methods to adapt to shifts of varying complexity. Our contributions are threefold: First, we propose dynamic layer selection for automatic identification of the optimal layer for adaptation to each test batch. Second, we propose a mechanism that merges weights from the dynamic layer to other layers, ensuring all layers receive target information. Third, we propose linear layer agreement that acts as a gating function, preventing erroneous fine-tuning by adaptation on noisy batches. We rigorously evaluate the performance of Hi-Vec in challenging scenarios and on multiple target datasets, proving its strong capability to advance state-of-the-art methods. Our results show that Hi-Vec improves robustness, addresses uncertainty, and handles limited batch sizes and increased outlier rates. 测试时自适应允许预训练模型对输入数据流进行调整，以应对源域与目标域之间的分布偏移。然而，标准方法依赖于单一维度的线性分类层，常常无法处理多样且复杂的偏移。我们提出了带任务向量的分层自适应网络（Hierarchical Adaptive Networks with Task Vectors，Hi-Vec），它利用多层逐渐增大的层次结构来实现动态的测试时自适应。通过将编码器的表示空间分解为这样分层组织的层次，Hi-Vec 以插拔式的方式允许现有方法适应不同复杂度的偏移。我们的贡献有三点：首先，我们提出了动态层选择，用于自动识别对每个测试批次进行自适应的最优层。其次，我们提出了一种机制，将动态层的权重合并到其他层，确保所有层都能接收到目标域信息。第三，我们提出了线性层一致性作为门控函数，防止在含噪批次上进行自适应微调时产生误导。我们在具有挑战性的情景和多个目标数据集上对 Hi-Vec 的性能进行了严格评估，证明了其显著提升现有最先进方法的能力。我们的结果表明，Hi-Vec 提高了鲁棒性，解决了不确定性问题，并能应对有限的批量大小和增加的异常值比例。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-11 21:55:53 UTC 发布：2025-08-11 21:55:53 UTC

#122 Towards Scalable Training for Handwritten Mathematical Expression Recognition #122 面向可扩展训练的手写数学表达式识别

Authors: [Haoyang Li](https://arxiv.org/search/?searchtype=author&query=Haoyang Li), [Jiaqing Li](https://arxiv.org/search/?searchtype=author&query=Jiaqing Li), [Jialun Cao](https://arxiv.org/search/?searchtype=author&query=Jialun Cao), [Zongyuan Yang](https://arxiv.org/search/?searchtype=author&query=Zongyuan Yang), [Yongping Xiong](https://arxiv.org/search/?searchtype=author&query=Yongping Xiong) 作者：李昊洋、李佳庆、曹家伦、杨宗远、熊永平

Large foundation models have achieved significant performance gains through scalable training on massive datasets. However, the field of \textbf{H}andwritten \textbf{M}athematical \textbf{E}xpression \textbf{R}ecognition (HMER) has been impeded by the scarcity of data, primarily due to the arduous and costly process of manual annotation. To bridge this gap, we propose a novel method integrating limited handwritten formulas with large-scale LaTeX-rendered formulas by developing a scalable data engine to generate complex and consistent LaTeX sequences. With this engine, we built the largest formula dataset to date, termed \texttt{Tex80M}, comprising over 80 million high-quality training instances. Then we propose \texttt{TexTeller}, the first HMER model trained at scale, by mix-training \texttt{Tex80M} with a relatively small HME dataset. The expansive training dataset and our refined pipeline have equipped \texttt{TexTeller} with state-of-the-art (SOTA) performance across nearly all benchmarks. To advance the field, we will openly release our complete model, entire dataset, and full codebase, enabling further research building upon our contributions. 大型基础模型通过在大规模数据集上可扩展的训练取得了显著的性能提升。然而，手写数学表达式识别（HMER）领域一直受制于数据匮乏，主要因为人工标注既费力又昂贵。为弥补这一差距，我们提出了一种新方法，将有限的手写公式与大规模的 LaTeX 渲染公式相结合，开发了一个可扩展的数据引擎以生成复杂且一致的 LaTeX 序列。借助该引擎，我们构建了迄今为止最大的公式数据集，称为 Tex80M，包含超过 8000 万条高质量训练样本。随后我们提出了 TexTeller，这是首个在大规模数据上训练的 HMER 模型，通过将 Tex80M 与相对较小的 HME 数据集混合训练所得。庞大的训练数据集和我们精炼的流水线使得 TexTeller 在几乎所有基准上都达到了最先进（SOTA）的性能。为促进行业发展，我们将公开发布完整模型、完整数据集和全部代码，便于后续研究在我们的成果上继续拓展。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence

Publish: 2025-08-11 19:10:34 UTC 发布日期：2025-08-11 19:10:34 UTC

#123 Understanding Ethical Practices in AI: Insights from a Cross-Role, Cross-Region Survey of AI Development Teams #123 理解人工智能的伦理实践：来自跨角色、跨地域 AI 开发团队调查的见解

Authors: [Wilder Baldwin](https://arxiv.org/search/?searchtype=author&query=Wilder Baldwin), [Sepideh Ghanavati](https://arxiv.org/search/?searchtype=author&query=Sepideh Ghanavati), [Manuel Woersdoerfer](https://arxiv.org/search/?searchtype=author&query=Manuel Woersdoerfer) 作者：Wilder Baldwin、Sepideh Ghanavati、Manuel Woersdoerfer

Recent advances in AI applications have raised growing concerns about the need for ethical guidelines and regulations to mitigate the risks posed by these technologies. In this paper, we present a mixed-method survey study - combining statistical and qualitative analyses - to examine the ethical perceptions, practices, and knowledge of individuals involved in various AI development roles. Our survey includes 414 participants from 43 countries, representing roles such as AI managers, analysts, developers, quality assurance professionals, and information security and privacy experts. The results reveal varying degrees of familiarity and experience with AI ethics principles, government initiatives, and risk mitigation strategies across roles, regions, and other demographic factors. Our findings highlight the importance of a collaborative, role-sensitive approach, involving diverse stakeholders in ethical decision-making throughout the AI development lifecycle. We advocate for developing tailored, inclusive solutions to address ethical challenges in AI development, and we propose future research directions and educational strategies to promote ethics-aware AI practices. 近年来人工智能应用的进展引发了越来越多关于制定伦理指南和监管以减轻这些技术带来风险的担忧。本文呈现了一项混合方法的调查研究——结合统计和定性分析——以考察从事各类人工智能开发角色的个体在伦理认知、实践和知识方面的情况。我们的调查包含来自 43 个国家的 414 名参与者，代表的角色包括人工智能经理、分析师、开发者、质量保证专业人员以及信息安全与隐私专家。结果显示，不同角色、地区及其他人口统计因素之间，对人工智能伦理原则、政府举措和风险缓解策略的熟悉度与经验存在差异。我们的研究强调采取协作且考虑角色差异的方法的重要性，在人工智能开发生命周期中让多样化利益相关者参与伦理决策。我们主张制定量身定制且包容的解决方案来应对人工智能开发中的伦理挑战，并提出未来研究方向与教育策略以促进注重伦理的人工智能实践。

Subjects: Computers and Society, Artificial Intelligence, Human-Computer Interaction, Software Engineering 主题：计算机与社会、人工智能、人机交互、软件工程

Publish: 2025-08-11 19:07:20 UTC 发布：2025-08-11 19:07:20 UTC

#124 Towards Effective MLLM Jailbreaking Through Balanced On-Topicness and OOD-Intensity #124 朝着有效的多模态大模型越狱：在相关性与域外强度间实现平衡

Authors: [Zuoou Li](https://arxiv.org/search/?searchtype=author&query=Zuoou Li), [Weitong Zhang](https://arxiv.org/search/?searchtype=author&query=Weitong Zhang), [Jingyuan Wang](https://arxiv.org/search/?searchtype=author&query=Jingyuan Wang), [Shuyuan Zhang](https://arxiv.org/search/?searchtype=author&query=Shuyuan Zhang), [Wenjia Bai](https://arxiv.org/search/?searchtype=author&query=Wenjia Bai), [Bernhard Kainz](https://arxiv.org/search/?searchtype=author&query=Bernhard Kainz), [Mengyun Qiao](https://arxiv.org/search/?searchtype=author&query=Mengyun Qiao) 作者：Zuoou Li、Weitong Zhang、Jingyuan Wang、Shuyuan Zhang、Wenjia Bai、Bernhard Kainz、Mengyun Qiao

Multimodal large language models (MLLMs) are widely used in vision-language reasoning tasks. However, their vulnerability to adversarial prompts remains a serious concern, as safety mechanisms often fail to prevent the generation of harmful outputs. Although recent jailbreak strategies report high success rates, many responses classified as “successful” are actually benign, vague, or unrelated to the intended malicious goal. This mismatch suggests that current evaluation standards may overestimate the effectiveness of such attacks. To address this issue, we introduce a four-axis evaluation framework that considers input on-topicness, input out-of-distribution (OOD) intensity, output harmfulness, and output refusal rate. This framework identifies truly effective jailbreaks. In a substantial empirical study, we reveal a structural trade-off: highly on-topic prompts are frequently blocked by safety filters, whereas those that are too OOD often evade detection but fail to produce harmful content. However, prompts that balance relevance and novelty are more likely to evade filters and trigger dangerous output. Building on this insight, we develop a recursive rewriting strategy called Balanced Structural Decomposition (BSD). The approach restructures malicious prompts into semantically aligned sub-tasks, while introducing subtle OOD signals and visual cues that make the inputs harder to detect. BSD was tested across 13 commercial and open-source MLLMs, where it consistently led to higher attack success rates, more harmful outputs, and fewer refusals. Compared to previous methods, it improves success rates by 67% and harmfulness by 21%, revealing a previously underappreciated weakness in current multimodal safety systems. 多模态大型语言模型（MLLMs）在视觉-语言推理任务中被广泛使用。然而，它们对对抗性提示的脆弱性仍然是一个严重问题，因为安全机制往往无法阻止有害输出的生成。尽管最近的越狱策略报告了很高的成功率，但许多被归类为“成功”的响应实际上是良性的、模糊的或与预期的恶意目标无关。这种不一致表明当前的评估标准可能高估了此类攻击的有效性。为了解决这一问题，我们引入了一个四轴评估框架，考虑输入的主题相关性、输入的分布外（OOD）强度、输出的有害性以及输出的拒绝率。该框架能够识别真正有效的越狱方法。在一项大量实证研究中，我们揭示了一个结构性权衡：高度主题相关的提示经常被安全过滤器阻止，而那些过于分布外的提示虽然常能规避检测，却未能产生有害内容。然而，在相关性与新颖性之间取得平衡的提示更有可能规避过滤并触发危险输出。基于这一洞见，我们开发了一种称为平衡结构分解（Balanced Structural Decomposition，BSD）的递归重写策略。该方法将恶意提示重构为语义对齐的子任务，同时引入微妙的 OOD 信号和视觉线索，使得输入更难被检测到。我们在 13 种商用与开源多模态大模型（MLLMs）上测试了 BSD，结果显示其始终带来更高的攻击成功率、更具危害性的输出以及更少的拒绝响应。与以往方法相比，其成功率提高了 67% ，危害性提高了 21% ，揭示了当前多模态安全系统中一个此前被低估的弱点。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence

Publish: 2025-08-11 18:57:55 UTC 发布时间：2025-08-11 18:57:55 UTC

#125 Real-time deep learning phase imaging flow cytometer reveals blood cell aggregate biomarkers for haematology diagnostics #125 实时深度学习相位成像流式细胞仪揭示用于血液学诊断的血细胞聚集生物标志物

While analysing rare blood cell aggregates remains challenging in automated haematology, they could markedly advance label-free functional diagnostics. Conventional flow cytometers efficiently perform cell counting with leukocyte differentials but fail to identify aggregates with flagged results, requiring manual reviews. Quantitative phase imaging flow cytometry captures detailed aggregate morphologies, but clinical use is hampered by massive data storage and offline processing. Incorporating hidden biomarkers into routine haematology panels would significantly improve diagnostics without flagged results. We present RT-HAD, an end-to-end deep learning-based image and data processing framework for off-axis digital holographic microscopy (DHM), which combines physics-consistent holographic reconstruction and detection, representing each blood cell in a graph to recognize aggregates. RT-HAD processes >30 GB of image data on-the-fly with turnaround time of <1.5 min and error rate of 8.9% in platelet aggregate detection, which matches acceptable laboratory error rates of haematology biomarkers and solves the big data challenge for point-of-care diagnostics. 在自动化血液学中，分析稀有血细胞聚集体仍然具有挑战性，但它们可能显著推动无标记功能性诊断的发展。传统流式细胞仪能高效进行细胞计数和白细胞分类，但无法识别聚集体，经常出现标记异常结果，需要人工复核。定量相位成像流式细胞术可以捕获详细的聚集体形态，但临床应用受制于大量数据存储和离线处理。将隐含生物标志物纳入常规血液学检测面板，将在不产生标记异常结果的情况下显著改善诊断。我们提出了 RT-HAD，一种针对离轴数字全息显微镜（DHM）的端到端基于深度学习的图像与数据处理框架，结合了与物理一致的全息重建与检测，并将每个血细胞表示为图以识别聚集体。RT-HAD 能够实时处理超过 30 GB 的图像数据，周转时间小于 1.5 分钟，在血小板聚集体检测中的错误率为 8.9%，该错误率与血液学生物标志物的可接受实验室错误率相匹配，并解决了用于现场诊断的大数据挑战。

Subjects: Quantitative Methods, Artificial Intelligence, Computer Vision and Pattern Recognition, Machine Learning, Image and Video Processing 主题：定量方法、人工智能、计算机视觉与模式识别、机器学习、图像与视频处理

Publish: 2025-08-11 15:58:12 UTC 发布时间：2025-08-11 15:58:12 UTC

#126 Deep Generative Models for Discrete Genotype Simulation #126 离散基因型模拟的深度生成模型

Authors: [Sihan Xie](https://arxiv.org/search/?searchtype=author&query=Sihan Xie), [Thierry Tribout](https://arxiv.org/search/?searchtype=author&query=Thierry Tribout), [Didier Boichard](https://arxiv.org/search/?searchtype=author&query=Didier Boichard), [Blaise Hanczar](https://arxiv.org/search/?searchtype=author&query=Blaise Hanczar), [Julien Chiquet](https://arxiv.org/search/?searchtype=author&query=Julien Chiquet), [Eric Barrey](https://arxiv.org/search/?searchtype=author&query=Eric Barrey) 作者：Sihan Xie、Thierry Tribout、Didier Boichard、Blaise Hanczar、Julien Chiquet、Eric Barrey

Deep generative models open new avenues for simulating realistic genomic data while preserving privacy and addressing data accessibility constraints. While previous studies have primarily focused on generating gene expression or haplotype data, this study explores generating genotype data in both unconditioned and phenotype-conditioned settings, which is inherently more challenging due to the discrete nature of genotype data. In this work, we developed and evaluated commonly used generative models, including Variational Autoencoders (VAEs), Diffusion Models, and Generative Adversarial Networks (GANs), and proposed adaptation tailored to discrete genotype data. We conducted extensive experiments on large-scale datasets, including all chromosomes from cow and multiple chromosomes from human. Model performance was assessed using a well-established set of metrics drawn from both deep learning and quantitative genetics literature. Our results show that these models can effectively capture genetic patterns and preserve genotype-phenotype association. Our findings provide a comprehensive comparison of these models and offer practical guidelines for future research in genotype simulation. We have made our code publicly available at https://github.com/SihanXXX/DiscreteGenoGen. 深度生成模型为在保护隐私和应对数据可获取性限制的同时模拟逼真基因组数据开辟了新途径。尽管以往研究主要集中于生成基因表达或单倍型数据，本研究探讨了在无条件和表型条件下生成基因型数据，这本质上更具挑战性，因为基因型数据具有离散性。在这项工作中，我们开发并评估了常用的生成模型，包括变分自编码器（VAE）、扩散模型和生成对抗网络（GAN），并提出了针对离散基因型数据的适配方法。我们在大规模数据集上进行了广泛实验，涵盖了牛的全部染色体和人类的多个染色体。模型性能使用来自深度学习和数量遗传学文献中的一套公认指标进行评估。结果表明，这些模型能够有效捕捉遗传模式并保持基因型-表型关联。我们的研究提供了这些模型的全面比较，并为未来基因型模拟研究提供了实用指南。我们已将代码公开发布于 https://github.com/SihanXXX/DiscreteGenoGen。

Subjects: Genomics, Artificial Intelligence, Machine Learning 主题：基因组学，人工智能，机器学习

Publish: 2025-08-11 11:56:03 UTC 发布：2025-08-11 11:56:03 UTC

#127 MME-Emotion: A Holistic Evaluation Benchmark for Emotional Intelligence in Multimodal Large Language Models #127 MME-Emotion：用于多模态大型语言模型情感智能的整体评估基准

Recent advances in multimodal large language models (MLLMs) have catalyzed transformative progress in affective computing, enabling models to exhibit emergent emotional intelligence. Despite substantial methodological progress, current emotional benchmarks remain limited, as it is still unknown: (a) the generalization abilities of MLLMs across distinct scenarios, and (b) their reasoning capabilities to identify the triggering factors behind emotional states. To bridge these gaps, we present \textbf{MME-Emotion}, a systematic benchmark that assesses both emotional understanding and reasoning capabilities of MLLMs, enjoying \textit{scalable capacity}, \textit{diverse settings}, and \textit{unified protocols}. As the largest emotional intelligence benchmark for MLLMs, MME-Emotion contains over 6,000 curated video clips with task-specific questioning-answering (QA) pairs, spanning broad scenarios to formulate eight emotional tasks. It further incorporates a holistic evaluation suite with hybrid metrics for emotion recognition and reasoning, analyzed through a multi-agent system framework. Through a rigorous evaluation of 20 advanced MLLMs, we uncover both their strengths and limitations, yielding several key insights: \ding{182} Current MLLMs exhibit unsatisfactory emotional intelligence, with the best-performing model achieving only 39.3% recognition score and 56.0% Chain-of-Thought (CoT) score on our benchmark. \ding{183} Generalist models (\emph{e.g.}, Gemini-2.5-Pro) derive emotional intelligence from generalized multimodal understanding capabilities, while specialist models (\emph{e.g.}, R1-Omni) can achieve comparable performance through domain-specific post-training adaptation. By introducing MME-Emotion, we hope that it can serve as a foundation for advancing MLLMs’ emotional intelligence in the future. 近年来多模态大语言模型（MLLMs）的进展推动了情感计算的变革性发展，使模型展现出新兴的情感智能。尽管方法论上取得了大量进展，当前的情感基准仍然有限，因为仍然未知：(a) MLLMs 在不同场景间的泛化能力，以及 (b) 它们识别情绪状态触发因素的推理能力。为弥补这些空白，我们提出了 MME-Emotion，这是一个系统性的基准，用以评估 MLLMs 的情感理解与推理能力，具备可扩展的容量、多样化的设置和统一的协议。作为针对 MLLMs 最大的情感智能基准，MME-Emotion 包含超过 6000 个精心挑选的视频片段及针对任务的问答对，涵盖广泛场景以构建八种情感任务。它还融合了带有混合度量的整体评估套件，用于情感识别与推理，并通过多智能体系统框架进行分析。通过对 20 个先进多模态大模型（MLLM）进行严格评估，我们揭示了它们的优势与局限，并得出若干关键见解：\ding{182} 目前的 MLLM 在情感智能方面表现不尽如人意，在我们的基准测试中，表现最好的模型仅在 39.3% 识别得分和 56.0% 连锁思维（CoT）得分上取得了如此成绩。 \ding{183} 通用模型（例如：Gemini-2.5-Pro）通过泛化的多模态理解能力获得情感智能，而专用模型（例如：R1-Omni）则可以通过领域特定的后训练适配达到可比较的性能。通过引入 MME-Emotion，我们希望它能作为未来提升 MLLM 情感智能的基础。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence

Publish: 2025-08-11 03:14:55 UTC 发布：2025-08-11 03:14:55 UTC

#128 Quantum-Enhanced Generative Adversarial Networks: Comparative Analysis of Classical and Hybrid Quantum-Classical Generative Adversarial Networks #128 量子增强生成对抗网络：经典与混合量子-经典生成对抗网络的比较分析

Author: [Kun Ming Goh](https://arxiv.org/search/?searchtype=author&query=Kun Ming Goh) 作者：Kun Ming Goh

Generative adversarial networks (GANs) have emerged as a powerful paradigm for producing high-fidelity data samples, yet their performance is constrained by the quality of latent representations, typically sampled from classical noise distributions. This study investigates hybrid quantum-classical GANs (HQCGANs) in which a quantum generator, implemented via parameterised quantum circuits, produces latent vectors for a classical discriminator. We evaluate a classical GAN alongside three HQCGAN variants with 3, 5, and 7 qubits, using Qiskit’s AerSimulator with realistic noise models to emulate near-term quantum devices. The binary MNIST dataset (digits 0 and 1) is used to align with the low-dimensional latent spaces imposed by current quantum hardware. Models are trained for 150 epochs and assessed with Frechet Inception Distance (FID) and Kernel Inception Distance (KID). Results show that while the classical GAN achieved the best scores, the 7-qubit HQCGAN produced competitive performance, narrowing the gap in later epochs, whereas the 3-qubit model exhibited earlier convergence limitations. Efficiency analysis indicates only moderate training time increases despite quantum sampling overhead. These findings validate the feasibility of noisy quantum circuits as latent priors in GAN architectures, highlighting their potential to enhance generative modelling within the constraints of the noisy intermediate-scale quantum (NISQ) era. 生成对抗网络（GAN）已成为生成高保真数据样本的强大范式，然而其性能受限于潜在表示的质量，这些表示通常从传统噪声分布中采样。本研究考察了混合量子-经典 GAN（HQCGAN），其中量子生成器通过参数化量子电路实现，为经典判别器生成潜在向量。我们在 Qiskit 的 AerSimulator 上使用现实噪声模型来模拟近中期量子设备，评估了一个经典 GAN 以及三个分别为 3、5 和 7 量子比特的 HQCGAN 变体。为适应当前量子硬件所施加的低维潜在空间，采用了二元 MNIST 数据集（数字 0 和 1）。模型训练 150 个 epoch，并使用 Frechet Inception Distance（FID）和 Kernel Inception Distance（KID）进行评估。结果表明，尽管经典 GAN 获得了最佳分数，7 量子比特的 HQCGAN 表现具有竞争力，在后期 epoch 缩小了差距，而 3 量子比特模型则在较早阶段表现出收敛的局限性。效率分析表明，尽管存在量子采样开销，训练时间仅出现中等程度的增加。这些发现验证了在生成对抗网络架构中将有噪声量子电路作为潜在先验的可行性，并突显了它们在噪声中型量子（NISQ）时代内提升生成建模能力的潜力。

Subjects: Quantum Physics, Artificial Intelligence, Machine Learning 主题：量子物理、人工智能、机器学习

Publish: 2025-08-10 18:34:53 UTC 发布：2025-08-10 18:34:53 UTC

#129 CoMoE: Collaborative Optimization of Expert Aggregation and Offloading for MoE-based LLMs at Edge #129 CoMoE：面向边缘的基于 MoE 的 LLMs 的专家聚合与卸载协同优化

Authors: [Muqing Li](https://arxiv.org/search/?searchtype=author&query=Muqing Li), [Ning Li](https://arxiv.org/search/?searchtype=author&query=Ning Li), [Xin Yuan](https://arxiv.org/search/?searchtype=author&query=Xin Yuan), [Wenchao Xu](https://arxiv.org/search/?searchtype=author&query=Wenchao Xu), [Quan Chen](https://arxiv.org/search/?searchtype=author&query=Quan Chen), [Song Guo](https://arxiv.org/search/?searchtype=author&query=Song Guo), [Haijun Zhang](https://arxiv.org/search/?searchtype=author&query=Haijun Zhang) 作者：李慕清，李宁，袁鑫，许文超，陈全，郭松，张海军

The proliferation of large language models (LLMs) has driven the adoption of Mixture-of-Experts (MoE) architectures as a promising solution to scale model capacity while controlling computational costs. However, deploying MoE models in resource-constrained mobile edge computing environments presents significant challenges due to their large memory footprint and dynamic expert activation patterns. To address these challenges, we propose a novel dynamic resource-aware collaborative optimization framework that jointly optimizes expert aggregation granularity and offloading strategies based on real-time device resource states, network conditions, and input characteristics in mobile edge environments, denoted as CoMoE. In CoMoE, we first systematically analyze existing expert aggregation techniques, including expert parameter merging,knowledge distillation,and parameter sharing decomposition, identifying their limitations in dynamic mobile environments.We then investigate expert offloading strategies encompassing expert prediction and prefetching, expert caching and scheduling, and multi-tier storage architectures, revealing the interdependencies between routing decisions and offloading performance.The CoMoE incorporates adaptive scheduling mechanisms that respond to user mobility and varying network conditions, enabling efficient MoE deployment across heterogeneous edge devices. Extensive experiments on real mobile edge testbeds demonstrate that CoMoE achieves approximately 70% reduction in memory usage compared to baseline methods, 10.5% lower inference latency than existing expert offloading techniques, while maintaining model performance stability. For large-scale MoE models (e.g,7.4B-parameter Switch-Base-128), the CoMoE reduces memory requirements from 15.6GB to 4.7GB, enabling deployment on resource-constrained mobile edge devices that previously could only support much smaller models. 大型语言模型（LLMs）的普及推动了专家混合（Mixture-of-Experts，MoE）架构的采用，作为在控制计算成本的同时扩展模型容量的一种有前景的解决方案。然而，由于 MoE 模型占用大量内存并具有动态的专家激活模式，在资源受限的移动边缘计算环境中部署面临重大挑战。为了解决这些挑战，我们提出了一种新颖的动态资源感知协同优化框架，基于移动边缘环境中实时的设备资源状态、网络状况和输入特征，联合优化专家聚合粒度与卸载策略，记为 CoMoE。在 CoMoE 中，我们首先系统性地分析了现有的专家聚合技术，包括专家参数合并、知识蒸馏和参数共享分解，指出了它们在动态移动环境中的局限性。随后我们研究了专家卸载策略，涵盖专家预测与预取、专家缓存与调度，以及多层存储架构，揭示了路由决策与卸载性能之间的相互依赖关系。CoMoE 融合了能够响应用户移动性和网络条件变化的自适应调度机制，从而在异构边缘设备上实现高效的 MoE 部署。在真实移动边缘测试平台上进行的大量实验证明，与基线方法相比，CoMoE 在内存使用上约减少了 70%，比现有专家卸载技术推理延迟低 10.5%，同时保持模型性能稳定。对于大规模 MoE 模型（例如 7.4B 参数的 Switch-Base-128），CoMoE 将内存需求从 15.6GB 降至 4.7GB，使得之前只能支持更小模型的资源受限移动边缘设备得以部署此类大型模型。

Subjects: Networking and Internet Architecture, Artificial Intelligence 主题：网络与互联网体系结构，人工智能

Publish: 2025-08-10 14:05:36 UTC 发布：2025-08-10 14:05:36 UTC

#130 From Explainable to Explained AI: Ideas for Falsifying and Quantifying Explanations #130 从可解释到已解释的人工智能：用于证伪和量化解释的思路

Authors: [Yoni Schirris](https://arxiv.org/search/?searchtype=author&query=Yoni Schirris), [Eric Marcus](https://arxiv.org/search/?searchtype=author&query=Eric Marcus), [Jonas Teuwen](https://arxiv.org/search/?searchtype=author&query=Jonas Teuwen), [Hugo Horlings](https://arxiv.org/search/?searchtype=author&query=Hugo Horlings), [Efstratios Gavves](https://arxiv.org/search/?searchtype=author&query=Efstratios Gavves) 作者：Yoni Schirris, Eric Marcus, Jonas Teuwen, Hugo Horlings, Efstratios Gavves

Explaining deep learning models is essential for clinical integration of medical image analysis systems. A good explanation highlights if a model depends on spurious features that undermines generalization and harms a subset of patients or, conversely, may present novel biological insights. Although techniques like GradCAM can identify influential features, they are measurement tools that do not themselves form an explanation. We propose a human-machine-VLM interaction system tailored to explaining classifiers in computational pathology, including multi-instance learning for whole-slide images. Our proof of concept comprises (1) an AI-integrated slide viewer to run sliding-window experiments to test claims of an explanation, and (2) quantification of an explanation’s predictiveness using general-purpose vision-language models. The results demonstrate that this allows us to qualitatively test claims of explanations and can quantifiably distinguish competing explanations. This offers a practical path from explainable AI to explained AI in digital pathology and beyond. Code and prompts are available at https://github.com/nki-ai/x2x. 解释深度学习模型对于医学图像分析系统在临床中的整合至关重要。一个好的解释能够突出模型是否依赖于削弱泛化并对某些患者群体造成不利影响的伪特征，或相反地，可能揭示新的生物学见解。尽管像 GradCAM 这样的技术可以识别有影响力的特征，但它们是测量工具，本身并不构成解释。我们提出了一个针对计算病理学中分类器解释的人机‑视觉语言模型（VLM）交互系统，包括用于全切片图像的多实例学习。我们的概念验证包括（1）一个集成 AI 的切片查看器，用于运行滑动窗口实验以检验解释的主张，以及（2）使用通用视觉‑语言模型对解释的预测性进行量化。结果表明，这使我们能够定性地检验解释的主张，并能够定量地区分相互竞争的解释。这为从可解释的人工智能到在数字病理学及更广领域中被解释的人工智能提供了实用路径。代码和提示可在 https://github.com/nki-ai/x2x 获取。

Subjects: Image and Video Processing, Artificial Intelligence, Computer Vision and Pattern Recognition 主题：图像与视频处理、人工智能、计算机视觉与模式识别

Publish: 2025-08-09 10:06:15 UTC 发布：2025-08-09 10:06:15 UTC

#131 MoQE: Improve Quantization Model performance via Mixture of Quantization Experts #131 MoQE：通过量化专家混合提升量化模型性能

Authors: [Jinhao Zhang](https://arxiv.org/search/?searchtype=author&query=Jinhao Zhang), [Yunquan Zhang](https://arxiv.org/search/?searchtype=author&query=Yunquan Zhang), [Boyang Zhang](https://arxiv.org/search/?searchtype=author&query=Boyang Zhang), [Zeyu Liu](https://arxiv.org/search/?searchtype=author&query=Zeyu Liu), [Daning Cheng](https://arxiv.org/search/?searchtype=author&query=Daning Cheng) 作者：张金豪，张云泉，张博洋，刘泽宇，程达宁

Quantization method plays a crucial role in improving model efficiency and reducing deployment costs, enabling the widespread application of deep learning models on resource-constrained devices. However, the quantization process inevitably introduces accuracy degradation. In this paper, we propose Mixture of Quantization Experts( abbr. MoQE), a quantization inference framework based on the Mixture-of-Experts (MoE) architecture, aiming to jointly improve the performance of quantization models. MoQE combines multiple quantization variants of one full-precision model as specialized “quantization experts” and dynamically routes input data to the most suitable expert based on its characteristics. MoQE alleviates the performance degradation commonly seen in single quantization models through specialization quantization expert models. We design lightweight, structure-aware router models tailored for both CV and NLP tasks. Experimental evaluations on ResNet, LLaMA, and Qwen model families across benchmark datasets including ImageNet, WikiText, C4, and OpenWebText demonstrate that MoQE achieves performance comparable to SOTA quantization model, without incurring significant increases in inference latency. 量化方法在提升模型效率和降低部署成本方面起着关键作用，使深度学习模型能够在资源受限的设备上广泛应用。然而，量化过程不可避免地会带来精度下降。本文提出了一种基于专家混合（Mixture-of-Experts，MoE）架构的量化推理框架——量化专家混合（简称 MoQE），旨在协同提升量化模型的性能。MoQE 将单个全精度模型的多个量化变体组合为专门的“量化专家”，并根据输入数据的特性动态地将其路由到最合适的专家。通过专门化的量化专家模型，MoQE 缓解了单一量化模型常见的性能退化问题。我们为计算机视觉和自然语言处理任务设计了轻量级、结构感知的路由器模型。在包括 ImageNet、WikiText、C4 和 OpenWebText 在内的基准数据集上，对 ResNet、LLaMA 和 Qwen 模型系列进行的实验评估表明，MoQE 在性能上可与最先进的量化模型相媲美，同时不会显著增加推理延迟。

Subjects: Machine Learning, Artificial Intelligence, Computer Vision and Pattern Recognition 学科：机器学习、人工智能、计算机视觉与模式识别

Publish: 2025-08-09 05:58:29 UTC 发布：2025-08-09 05:58:29 UTC

#132 Personalized Feature Translation for Expression Recognition: An Efficient Source-Free Domain Adaptation Method #132 个性化特征转换用于表情识别：一种高效的无源域自适应方法

Facial expression recognition (FER) models are employed in many video-based affective computing applications, such as human-computer interaction and healthcare monitoring. However, deep FER models often struggle with subtle expressions and high inter-subject variability, limiting their performance in real-world applications. To improve their performance, source-free domain adaptation (SFDA) methods have been proposed to personalize a pretrained source model using only unlabeled target domain data, thereby avoiding data privacy, storage, and transmission constraints. This paper addresses a challenging scenario where source data is unavailable for adaptation, and only unlabeled target data consisting solely of neutral expressions is available. SFDA methods are not typically designed to adapt using target data from only a single class. Further, using models to generate facial images with non-neutral expressions can be unstable and computationally intensive. In this paper, personalized feature translation (PFT) is proposed for SFDA. Unlike current image translation methods for SFDA, our lightweight method operates in the latent space. We first pre-train the translator on the source domain data to transform the subject-specific style features from one source subject into another. Expression information is preserved by optimizing a combination of expression consistency and style-aware objectives. Then, the translator is adapted on neutral target data, without using source data or image synthesis. By translating in the latent space, PFT avoids the complexity and noise of face expression generation, producing discriminative embeddings optimized for classification. Using PFT eliminates the need for image synthesis, reduces computational overhead (using a lightweight translator), and only adapts part of the model, making the method efficient compared to image-based translation. 面部表情识别（FER）模型被用于许多基于视频的情感计算应用中，例如人机交互和健康监测。然而，深度 FER 模型常在识别细微表情和面对高度个体差异时表现不佳，限制了其在实际应用中的性能。为提升性能，已有无源域适应（SFDA）方法被提出，用于在仅使用无标签目标域数据的情况下个性化预训练源模型，从而避免数据隐私、存储和传输的限制。本文针对一种具挑战性的情形：源数据在适应时不可用，且仅有的无标签目标数据全部都是中性表情。SFDA 方法通常并非为仅使用来自单一类别的目标数据进行适应而设计。此外，使用模型生成非中性表情的人脸图像可能不稳定且计算开销大。本文提出用于 SFDA 的个性化特征转换（PFT）。与当前用于 SFDA 的图像翻译方法不同，我们的轻量级方法在潜在空间中运行。我们首先在源域数据上对转换器进行预训练，将来自一个源主体的特定主体风格特征转换为另一个主体。通过优化表情一致性和风格感知目标的组合来保留表情信息。然后，在中性目标数据上对转换器进行自适应，而不使用源数据或图像合成。通过在潜在空间中进行转换，PFT 避免了面部表情生成的复杂性和噪声，产生为分类优化的判别性嵌入。使用 PFT 消除了图像合成的需要，减少了计算开销（使用轻量级转换器），且仅自适应模型的一部分，使该方法相比基于图像的转换更为高效。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence

Publish: 2025-08-08 20:13:50 UTC 发布：2025-08-08 20:13:50 UTC

#133 Learning to Detect Unknown Jailbreak Attacks in Large Vision-Language Models: A Unified and Accurate Approach #133 学习检测大型视觉-语言模型中的未知越狱攻击：一种统一且准确的方法

Authors: [Shuang Liang](https://arxiv.org/search/?searchtype=author&query=Shuang Liang), [Zhihao Xu](https://arxiv.org/search/?searchtype=author&query=Zhihao Xu), [Jialing Tao](https://arxiv.org/search/?searchtype=author&query=Jialing Tao), [Hui Xue](https://arxiv.org/search/?searchtype=author&query=Hui Xue), [Xiting Wang](https://arxiv.org/search/?searchtype=author&query=Xiting Wang) 作者：梁爽、徐志浩、陶佳玲、薛晖、王喜廷

Despite extensive alignment efforts, Large Vision-Language Models (LVLMs) remain vulnerable to jailbreak attacks, posing serious safety risks. Although recent detection works have shifted to internal representations due to their rich cross-modal information, most methods rely on heuristic rules rather than principled objectives, resulting in suboptimal performance. To address these limitations, we propose Learning to Detect (LoD), a novel unsupervised framework that formulates jailbreak detection as anomaly detection. LoD introduces two key components: Multi-modal Safety Concept Activation Vectors (MSCAV), which capture layer-wise safety-related representations across modalities, and the Safety Pattern Auto-Encoder, which models the distribution of MSCAV derived from safe inputs and detects anomalies via reconstruction errors. By training the auto-encoder (AE) solely on safe samples without attack labels, LoD naturally identifies jailbreak inputs as distributional anomalies, enabling accurate and unified detection of jailbreak attacks. Comprehensive experiments on three different LVLMs and five benchmarks demonstrate that LoD achieves state-of-the-art performance, with an average AUROC of 0.9951 and an improvement of up to 38.89% in the minimum AUROC over the strongest baselines. 尽管进行了大量对齐工作，大型视觉-语言模型（LVLMs）仍然容易受到越狱攻击，带来严重的安全风险。尽管近期的检测工作由于内部表示具有丰富的跨模态信息而转向研究内部表示，但大多数方法依赖启发式规则而非有原则性的目标，导致性能不佳。为了解决这些限制，我们提出了 Learning to Detect（LoD），这是一个将越狱检测表述为异常检测的新型无监督框架。LoD 引入了两个关键组件：多模态安全概念激活向量（MSCAV），它捕捉跨模态的分层安全相关表示；以及安全模式自编码器，该自编码器对来自安全输入的 MSCAV 分布进行建模，并通过重构误差检测异常。通过仅在安全样本上训练自编码器而不依赖攻击标签，LoD 自然将越狱输入识别为分布异常，从而实现对越狱攻击的准确且统一的检测。在三种不同的大型视觉语言模型（LVLM）和五个基准上的全面实验表明，LoD 实现了最先进的性能，平均 AUROC 为 0.9951，在最弱 AUROC 指标上较最强基线最高提升了 38.89%。

Subjects: Cryptography and Security, Artificial Intelligence, Computer Vision and Pattern Recognition 主题：密码学与安全、人工智能、计算机视觉与模式识别

Publish: 2025-08-08 16:13:28 UTC 发布日期：2025-08-08 16:13:28 UTC

#134 Zero-shot self-supervised learning of single breath-hold magnetic resonance cholangiopancreatography (MRCP) reconstruction #134 零样本自监督单次屏气磁共振胆胰管成像（MRCP）重建

Authors: [Jinho Kim](https://arxiv.org/search/?searchtype=author&query=Jinho Kim), [Marcel Dominik Nickel](https://arxiv.org/search/?searchtype=author&query=Marcel Dominik Nickel), [Florian Knoll](https://arxiv.org/search/?searchtype=author&query=Florian Knoll) 作者：Jinho Kim、Marcel Dominik Nickel、Florian Knoll

Purpose: To investigate the feasibility of applying zero-shot self-supervised learning reconstruction to reduce breath-hold times in magnetic resonance cholangiopancreatography (MRCP). Methods: Breath-hold MRCP was acquired from 11 healthy volunteers on a 3T scanner using an incoherent k-space sampling pattern leading to a breath-hold duration of 14s. We evaluated zero-shot reconstruction of breath-hold MRCP against parallel imaging of respiratory-triggered MRCP acquired in 338s on average and compressed sensing reconstruction of breath-hold MRCP. To address the long computation times of zero-shot trainings, we used a training approach that leverages a pretrained network to reduce backpropagation depth during training. Results: Zero-shot learning reconstruction significantly improved visual image quality compared to compressed sensing reconstruction, particularly in terms of signal-to-noise ratio and ductal delineation, and reached a level of quality comparable to that of successful respiratory-triggered acquisitions with regular breathing patterns. Shallow training provided nearly equivalent reconstruction performance with a training time of 11 minutes in comparison to 271 minutes for a conventional zero-shot training. Conclusion: Zero-shot learning delivers high-fidelity MRCP reconstructions with reduced breath-hold times, and shallow training offers a practical solution for translation to time-constrained clinical workflows. 目的：研究将零样本自监督学习重建应用于磁共振胰胆管成像（MRCP）以减少屏气时间的可行性。方法：在一台 3T 扫描仪上对 11 名健康志愿者采集了屏气 MRCP，采用非相干 k 空间采样模式，使屏气时长为 14 秒。我们将屏气 MRCP 的零样本重建与平均采集时间为 338 秒的呼吸触发 MRCP 的并行成像以及屏气 MRCP 的压缩感知重建进行了比较。为了解决零样本训练的长计算时间问题，我们使用了一种利用预训练网络以减少训练期间反向传播深度的训练方法。结果：与压缩感知重建相比，零样本学习重建显著改善了视觉图像质量，尤其在信噪比和管道描绘方面，并达到了与呼吸规律的成功呼吸触发采集中可比的质量水平。浅层训练在重建性能上几乎相当，而训练时间为 11 分钟，相比之下传统的零样本训练需要 271 分钟。结论：零样本学习在缩短屏气时间的同时提供高保真度的 MRCP 重建，浅层训练则为在时间受限的临床流程中实现转化提供了切实可行的解决方案。

Subjects: Image and Video Processing, Artificial Intelligence, Computer Vision and Pattern Recognition 主题：图像与视频处理、人工智能、计算机视觉与模式识别

Publish: 2025-08-08 13:59:20 UTC 发布：2025-08-08 13:59:20 UTC

#135 Δ-AttnMask: Attention-Guided Masked Hidden States for Efficient Data Selection and Augmentation #135 Δ -AttnMask：用于高效数据选择与增强的注意力引导掩码隐状态

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Computation and Language 主题：计算机视觉与模式识别、人工智能、计算与语言

Publish: 2025-08-08 13:25:30 UTC 发布：2025-08-08 13:25:30 UTC

#136 ADT4Coupons: An Innovative Framework for Sequential Coupon Distribution in E-commerce #136 ADT4Coupons：用于电子商务中序列化优惠券分发的创新框架

Coupon distribution is a critical marketing strategy used by online platforms to boost revenue and enhance user engagement. Regrettably, existing coupon distribution strategies fall far short of effectively leveraging the complex sequential interactions between platforms and users. This critical oversight, despite the abundance of e-commerce log data, has precipitated a performance plateau. In this paper, we focus on the scene that the platforms make sequential coupon distribution decision multiple times for various users, with each user interacting with the platform repeatedly. Based on this marketing scenario, we propose a novel marketing framework, named Aligned Decision Transformer for Coupons (ADT4Coupons), to directly devise coupon distribution policy for long-term revenue boosting. ADT4Coupons enables optimized online decision-making in a variety of real-world marketing scenarios. It achieves this by seamlessly integrating three key characteristics, general scenarios, sequential modeling with more comprehensive historical data, and efficient iterative updates within a unified framework. Furthermore, empirical results on real-world industrial dataset, alongside public and synthetic datasets demonstrate the superiority of our framework. 优惠券分发是网络平台用于提升收入和增强用户参与度的重要营销策略。不幸的是，现有的优惠券分发策略在有效利用平台与用户之间复杂的序列交互方面远远不足。尽管电子商务日志数据丰富，这一关键忽视导致了性能停滞。在本文中，我们关注这样一种场景：平台针对不同用户多次做出序列化的优惠券分发决策，每个用户与平台反复互动。基于这一营销场景，我们提出了一种新颖的营销框架，称为用于优惠券的对齐决策变换器（Aligned Decision Transformer for Coupons，ADT4Coupons），用于直接制定能够提升长期收入的优惠券分发策略。ADT4Coupons 通过在统一框架内无缝整合三大关键特性——通用场景、更全面历史数据的序列建模以及高效的迭代更新，从而实现对各种真实营销场景中的在线决策进行优化。此外，在真实工业数据集以及公开和合成数据集上的实证结果表明，我们框架的优越性。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-08 13:03:17 UTC 发布日期：2025-08-08 13:03:17 UTC

#137 MX-AI: Agentic Observability and Control Platform for Open and AI-RAN #137 MX-AI：用于开放和 AI-RAN 的自主可观测性与控制平台

Future 6G radio access networks (RANs) will be artificial intelligence (AI)-native: observed, reasoned about, and re-configured by autonomous agents cooperating across the cloud-edge continuum. We introduce MX-AI, the first end-to-end agentic system that (i) instruments a live 5G Open RAN testbed based on OpenAirInterface (OAI) and FlexRIC, (ii) deploys a graph of Large-Language-Model (LLM)-powered agents inside the Service Management and Orchestration (SMO) layer, and (iii) exposes both observability and control functions for 6G RAN resources through natural-language intents. On 50 realistic operational queries, MX-AI attains a mean answer quality of 4.1/5.0 and 100 % decision-action accuracy, while incurring only 8.8 seconds end-to-end latency when backed by GPT-4.1. Thus, it matches human-expert performance, validating its practicality in real settings. We publicly release the agent graph, prompts, and evaluation harness to accelerate open research on AI-native RANs. A live demo is presented here: https://www.youtube.com/watch?v=CEIya7988Ug&t=285s&ab_channel=BubbleRAN 未来的 6G 无线接入网（RAN）将原生支持人工智能（AI）：由分布在云-边缘连续体中协作的自主代理进行观测、推理和重新配置。我们推出了 MX-AI，这是第一个端到端的代理系统，它（i）在基于 OpenAirInterface（OAI）和 FlexRIC 的实时 5G Open RAN 测试平台上进行仪表化，（ii）在服务管理与编排（SMO）层内部署了由大型语言模型（LLM）驱动的代理图谱，以及（iii）通过自然语言意图对 6G RAN 资源暴露可观测性和控制功能。在 50 个现实操作查询上，MX-AI 在答案质量上取得了平均 4.1/5.0 的得分并实现了 100%的决策-行动准确率，而在以 GPT-4.1 为后端时端到端延迟仅为 8.8 秒。因此，它达到了人类专家的表现，验证了其在真实环境中的可行性。我们公开发布了代理图谱、提示词和评估工具，以加速关于原生 AI RAN 的开放研究。现场演示见此处：https://www.youtube.com/watch?v=CEIya7988Ug&t=285s&ab_channel=BubbleRAN

Subjects: Networking and Internet Architecture, Artificial Intelligence 主题：网络与互联网体系结构，人工智能

Publish: 2025-08-08 12:15:47 UTC 发布：2025-08-08 12:15:47 UTC

#138 FIVA: Federated Inverse Variance Averaging for Universal CT Segmentation with Uncertainty Estimation #138 FIVA：用于具有不确定性估计的通用 CT 分割的联邦逆方差平均

Authors: [Asim Ukaye](https://arxiv.org/search/?searchtype=author&query=Asim Ukaye), [Numan Saeed](https://arxiv.org/search/?searchtype=author&query=Numan Saeed), [Karthik Nandakumar](https://arxiv.org/search/?searchtype=author&query=Karthik Nandakumar) 作者：Asim Ukaye、Numan Saeed、Karthik Nandakumar

Different CT segmentation datasets are typically obtained from different scanners under different capture settings and often provide segmentation labels for a limited and often disjoint set of organs. Using these heterogeneous data effectively while preserving patient privacy can be challenging. This work presents a novel federated learning approach to achieve universal segmentation across diverse abdominal CT datasets by utilizing model uncertainty for aggregation and predictive uncertainty for inference. Our approach leverages the inherent noise in stochastic mini-batch gradient descent to estimate a distribution over the model weights to provide an on-the-go uncertainty over the model parameters at the client level. The parameters are then aggregated at the server using the additional uncertainty information using a Bayesian-inspired inverse-variance aggregation scheme. Furthermore, the proposed method quantifies prediction uncertainty by propagating the uncertainty from the model weights, providing confidence measures essential for clinical decision-making. In line with recent work shown, predictive uncertainty is utilized in the inference stage to improve predictive performance. Experimental evaluations demonstrate the effectiveness of this approach in improving both the quality of federated aggregation and uncertainty-weighted inference compared to previously established baselines. The code for this work is made available at: https://github.com/asimukaye/fiva 不同的 CT 分割数据集通常来自不同的扫描仪、在不同的采集设置下获得，并且通常只为有限且经常不重叠的一组器官提供分割标签。在保持患者隐私的同时有效利用这些异构数据可能具有挑战性。本研究提出了一种新颖的联邦学习方法，通过在聚合时利用模型不确定性、在推断时利用预测不确定性，实现跨多样腹部 CT 数据集的通用分割。我们的方法利用随机小批量梯度下降内在的噪声来估计模型权重的分布，从而在客户端提供关于模型参数的即时不确定性。然后，服务器使用额外的不确定性信息，通过受贝叶斯启发的逆方差聚合方案对参数进行聚合。此外，所提出的方法通过传播来自模型权重的不确定性来量化预测不确定性，从而提供对临床决策至关重要的置信度度量。与近期工作的做法一致，在推断阶段利用预测不确定性来提升预测性能。实验评估表明，与先前的基线方法相比，该方法在改善联邦聚合质量和基于不确定性的加权推断方面均有效。该工作的代码已公开： https://github.com/asimukaye/fiva

Subjects: Image and Video Processing, Artificial Intelligence, Computer Vision and Pattern Recognition, Machine Learning 主题：图像与视频处理、人工智能、计算机视觉与模式识别、机器学习

Publish: 2025-08-08 11:34:01 UTC 发布：2025-08-08 11:34:01 UTC

Authors: [Maria Boyko](https://arxiv.org/search/?searchtype=author&query=Maria Boyko), [Aleksandra Beliaeva](https://arxiv.org/search/?searchtype=author&query=Aleksandra Beliaeva), [Dmitriy Kornilov](https://arxiv.org/search/?searchtype=author&query=Dmitriy Kornilov), [Alexander Bernstein](https://arxiv.org/search/?searchtype=author&query=Alexander Bernstein), [Maxim Sharaev](https://arxiv.org/search/?searchtype=author&query=Maxim Sharaev) 作者：Maria Boyko、Aleksandra Beliaeva、Dmitriy Kornilov、Alexander Bernstein、Maxim Sharaev

The use of diverse modalities, such as omics, medical images, and clinical data can not only improve the performance of prognostic models but also deepen an understanding of disease mechanisms and facilitate the development of novel treatment approaches. However, medical data are complex, often incomplete, and contains missing modalities, making effective handling its crucial for training multimodal models. We introduce impuTMAE, a novel transformer-based end-to-end approach with an efficient multimodal pre-training strategy. It learns inter- and intra-modal interactions while simultaneously imputing missing modalities by reconstructing masked patches. Our model is pre-trained on heterogeneous, incomplete data and fine-tuned for glioma survival prediction using TCGA-GBM/LGG and BraTS datasets, integrating five modalities: genetic (DNAm, RNA-seq), imaging (MRI, WSI), and clinical data. By addressing missing data during pre-training and enabling efficient resource utilization, impuTMAE surpasses prior multimodal approaches, achieving state-of-the-art performance in glioma patient survival prediction. Our code is available at https://github.com/maryjis/mtcp 使用多种模态（如组学、医学影像和临床数据）不仅可以提升预后模型的性能，还能加深对疾病机制的理解并促进新治疗方法的开发。然而，医学数据复杂、常常不完整且包含缺失模态，有效处理这些数据对训练多模态模型至关重要。我们提出了 impuTMAE，一种新颖的基于 Transformer 的端到端方法，具有高效的多模态预训练策略。它在学习模态间和模态内交互的同时，通过重建被掩盖的补丁来对缺失模态进行插补。我们的模型在异质、不完整的数据上进行预训练，并在 TCGA-GBM/LGG 和 BraTS 数据集上微调用于胶质瘤生存期预测，整合了五种模态：遗传（DNAm、RNA-seq）、影像（MRI、WSI）和临床数据。通过在预训练阶段处理缺失数据并实现高效的资源利用，impuTMAE 超越了以往的多模态方法，在胶质瘤患者生存预测上达到了最先进的性能。我们的代码可在 https://github.com/maryjis/mtcp 获取。

Subjects: Image and Video Processing, Artificial Intelligence, Computer Vision and Pattern Recognition 主题：图像与视频处理、人工智能、计算机视觉与模式识别

Publish: 2025-08-08 10:01:16 UTC 发布时间：2025-08-08 10:01:16 UTC

#140 Meta-Learning for Speeding Up Large Model Inference in Decentralized Environments #140 在去中心化环境中通过元学习加速大模型推理

The deployment of large-scale models, such as large language models (LLMs), incurs substantial costs due to their computational demands. To mitigate these costs and address challenges related to scalability and data security, there is a growing shift towards decentralized systems for model deployment, where choosing efficient inference acceleration schemes become crucial to manage computational resources effectively and enhance system responsiveness. In this work, we address the challenge of selecting optimal acceleration methods in decentralized systems by introducing a meta-learning-based framework. This framework automates the selection process by learning from historical performance data of various acceleration techniques across different tasks. Unlike traditional methods that rely on random selection or expert intuition, our approach systematically identifies the best acceleration strategies based on the specific characteristics of each task. We demonstrate that our meta-learning framework not only streamlines the decision-making process but also consistently outperforms conventional methods in terms of efficiency and performance. Our results highlight the potential of inference acceleration in decentralized AI systems, offering a path towards more democratic and economically feasible artificial intelligence solutions. 像大型语言模型（LLMs）等大规模模型的部署因其计算需求而产生大量成本。为降低这些成本并应对可扩展性和数据安全相关的挑战，模型部署正逐步转向去中心化系统，在这种情况下选择高效的推理加速方案对于有效管理计算资源并提升系统响应能力变得至关重要。在这项工作中，我们通过引入一种基于元学习的框架来解决在去中心化系统中选择最优加速方法的挑战。该框架通过学习不同任务上各种加速技术的历史性能数据来自动化选择过程。不同于依赖随机选择或专家直觉的传统方法，我们的方法根据每个任务的具体特征系统地识别出最佳加速策略。我们证明了我们的元学习框架不仅简化了决策过程，而且在效率和性能方面始终优于传统方法。我们的研究结果突显了在去中心化人工智能系统中加速推理的潜力，为更民主化且经济可行的人工智能解决方案提供了一条路径。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-08 09:53:53 UTC 发布时间：2025-08-08 09:53:53 协调世界时 (UTC)

#141 Multi-Objective Instruction-Aware Representation Learning in Procedural Content Generation RL #141 多目标指令感知表示学习在程序化内容生成强化学习中的应用

Authors: [Sung-Hyun Kim](https://arxiv.org/search/?searchtype=author&query=Sung-Hyun Kim), [In-Chang Baek](https://arxiv.org/search/?searchtype=author&query=In-Chang Baek), [Seo-Young Lee](https://arxiv.org/search/?searchtype=author&query=Seo-Young Lee), [Geum-Hwan Hwang](https://arxiv.org/search/?searchtype=author&query=Geum-Hwan Hwang), [Kyung-Joong Kim](https://arxiv.org/search/?searchtype=author&query=Kyung-Joong Kim) 作者：Sung-Hyun Kim、In-Chang Baek、Seo-Young Lee、Geum-Hwan Hwang、Kyung-Joong Kim

Recent advancements in generative modeling emphasize the importance of natural language as a highly expressive and accessible modality for controlling content generation. However, existing instructed reinforcement learning for procedural content generation (IPCGRL) method often struggle to leverage the expressive richness of textual input, especially under complex, multi-objective instructions, leading to limited controllability. To address this problem, we propose \textit{MIPCGRL}, a multi-objective representation learning method for instructed content generators, which incorporates sentence embeddings as conditions. MIPCGRL effectively trains a multi-objective embedding space by incorporating multi-label classification and multi-head regression networks. Experimental results show that the proposed method achieves up to a 13.8% improvement in controllability with multi-objective instructions. The ability to process complex instructions enables more expressive and flexible content generation. 近年来生成模型的进展强调了自然语言作为一种高度表达性且易于访问的模态，在控制内容生成方面的重要性。然而，现有用于程序化内容生成的指令式强化学习（IPCGRL）方法常难以充分利用文本输入的表达丰富性，尤其是在复杂的、多目标指令下，导致可控性受限。为了解决这一问题，我们提出了 MIPCGRL，一种用于指令式内容生成器的多目标表示学习方法，该方法将句子嵌入作为条件。MIPCGRL 通过引入多标签分类和多头回归网络，有效训练了一个多目标嵌入空间。实验结果表明，所提方法在多目标指令下可控性提升最多可达 13.8%。处理复杂指令的能力使得内容生成更加富有表达性和灵活性。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-08 09:41:42 UTC 发布：2025-08-08 09:41:42 UTC

#142 Diffusion LLMs Can Do Faster-Than-AR Inference via Discrete Diffusion Forcing #142 扩散 LLMs 能通过离散扩散强迫实现比自回归更快的推理

Authors: [Xu Wang](https://arxiv.org/search/?searchtype=author&query=Xu Wang), [Chenkai Xu](https://arxiv.org/search/?searchtype=author&query=Chenkai Xu), [Yijie Jin](https://arxiv.org/search/?searchtype=author&query=Yijie Jin), [Jiachun Jin](https://arxiv.org/search/?searchtype=author&query=Jiachun Jin), [Hao Zhang](https://arxiv.org/search/?searchtype=author&query=Hao Zhang), [Zhijie Deng](https://arxiv.org/search/?searchtype=author&query=Zhijie Deng) 作者：Xu Wang、Chenkai Xu、Yijie Jin、Jiachun Jin、Hao Zhang、Zhijie Deng

Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to autoregressive (AR) LLMs for text generation, with the potential to decode multiple tokens in a single iteration. However, none of the existing open-source dLLMs have achieved superior inference speed over AR LLMs of similar size. This paper breaks this barrier based on a simple and effective strategy named discrete diffusion forcing (D2F). D2F equips dLLMs with two key capabilities: (1) block-wise autoregressive generation to enable KV cache utilization; (2) prediction of following tokens without requiring completion of prior blocks for inter-block parallel decoding. In this way, the vanilla dLLMs are refurbished into an AR-diffusion hybrid paradigm for efficient inference. D2F can be implemented with an asymmetric distillation process based on pre-trained dLLMs. We further propose a pipelined parallel decoding algorithm, which enables a trade-off between efficiency and efficacy. Empirically, D2F dLLMs achieve more than 2.5× inference speed than LLaMA3 and Qwen2.5 on GSM8K. Compared to vanilla dLLMs like LLaDA and Dream, the acceleration can be more than 50× while maintaining comparable output quality. The code is available at https://github.com/zhijie-group/Discrete-Diffusion-Forcing. 扩散大型语言模型（dLLMs）已成为自回归（AR）LLMs 在文本生成方面的有前景替代方案，能够在单次迭代中解码多个标记。然而，现有的开源 dLLMs 中还没有任何一个在推理速度上超越相似规模的 AR LLMs。本文基于一种简单而有效的策略——离散扩散强制（D2F）突破了这一壁垒。D2F 为 dLLMs 提供了两个关键能力：(1) 块状自回归生成以实现 KV 缓存的利用；(2) 在不需要完成先前块的情况下预测后续标记，从而实现块间并行解码。通过这种方式，原始 dLLMs 被改造为用于高效推理的 AR-扩散混合范式。D2F 可通过基于预训练 dLLMs 的非对称蒸馏过程实现。我们进一步提出了一种流水线并行解码算法，使效率与效果之间可以权衡。实证结果表明，D2F dLLMs 在 GSM8K 上相比 LLaMA3 和 Qwen2.5 实现了超过 2.5× 的推理速度提升。与原生离散化大语言模型（如 LLaDA 和 Dream）相比，加速可以超过 50× ，同时保持可比的输出质量。代码可在 https://github.com/zhijie-group/Discrete-Diffusion-Forcing 获取。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-08 04:51:37 UTC 发布：2025-08-08 04:51:37 UTC

#143 From Values to Tokens: An LLM-Driven Framework for Context-aware Time Series Forecasting via Symbolic Discretization #143 从数值到符号：一种由 LLM 驱动的通过符号离散实现上下文感知时间序列预测的框架

Time series forecasting plays a vital role in supporting decision-making across a wide range of critical applications, including energy, healthcare, and finance. Despite recent advances, forecasting accuracy remains limited due to the challenge of integrating historical numerical sequences with contextual features, which often comprise unstructured textual data. To address this challenge, we propose TokenCast, an LLM-driven framework that leverages language-based symbolic representations as a unified intermediary for context-aware time series forecasting. Specifically, TokenCast employs a discrete tokenizer to transform continuous numerical sequences into temporal tokens, enabling structural alignment with language-based inputs. To bridge the semantic gap between modalities, both temporal and contextual tokens are embedded into a shared representation space via a pre-trained large language model (LLM), further optimized with autoregressive generative objectives. Building upon this unified semantic space, the aligned LLM is subsequently fine-tuned in a supervised manner to predict future temporal tokens, which are then decoded back into the original numerical space. Extensive experiments on diverse real-world datasets enriched with contextual features demonstrate the effectiveness and generalizability of TokenCast. 时间序列预测在能源、医疗和金融等众多关键应用中对决策支持具有重要作用。尽管近年来取得了进展，但由于将历史数值序列与通常由非结构化文本数据构成的上下文特征融合的困难，预测精度仍然受限。为了解决这一挑战，我们提出了 TokenCast，这是一种由 LLM 驱动的框架，利用基于语言的符号表示作为用于感知上下文的时间序列预测的统一中介。具体而言，TokenCast 使用离散分词器将连续的数值序列转换为时间令牌，从而实现与基于语言的输入的结构对齐。为弥合模态之间的语义差距，时间令牌和上下文令牌都通过预训练的大型语言模型（LLM）嵌入到共享表示空间，并通过自回归生成目标进一步优化。在此统一语义空间的基础上，经过对齐的 LLM 随后以监督方式微调以预测未来的时间令牌，然后将其解码回原始数值空间。在包含上下文特征的多样化真实数据集上进行的大量实验表明，TokenCast 的有效性和泛化能力。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-08 03:51:08 UTC 发布：2025-08-08 03:51:08 UTC

#144 Fine-Grained Safety Neurons with Training-Free Continual Projection to Reduce LLM Fine Tuning Risks #144 通过无训练持续投影的细粒度安全神经元以降低 LLM 微调风险

Authors: [Bing Han](https://arxiv.org/search/?searchtype=author&query=Bing Han), [Feifei Zhao](https://arxiv.org/search/?searchtype=author&query=Feifei Zhao), [Dongcheng Zhao](https://arxiv.org/search/?searchtype=author&query=Dongcheng Zhao), [Guobin Shen](https://arxiv.org/search/?searchtype=author&query=Guobin Shen), [Ping Wu](https://arxiv.org/search/?searchtype=author&query=Ping Wu), [Yu Shi](https://arxiv.org/search/?searchtype=author&query=Yu Shi), [Yi Zeng](https://arxiv.org/search/?searchtype=author&query=Yi Zeng) 作者：韩冰、赵飞菲、赵东成、沈国彬、吴萍、史宇、曾毅

Fine-tuning as service injects domain-specific knowledge into large language models (LLMs), while challenging the original alignment mechanisms and introducing safety risks. A series of defense strategies have been proposed for the alignment, fine-tuning, and post-fine-tuning phases, where most post-fine-tuning defenses rely on coarse-grained safety layer mapping. These methods lack a comprehensive consideration of both safety layers and fine-grained neurons, limiting their ability to efficiently balance safety and utility. To address this, we propose the Fine-Grained Safety Neurons (FGSN) with Training-Free Continual Projection method to reduce the fine-tuning safety risks. FGSN inherently integrates the multi-scale interactions between safety layers and neurons, localizing sparser and more precise fine-grained safety neurons while minimizing interference with downstream task neurons. We then project the safety neuron parameters onto safety directions, improving model safety while aligning more closely with human preferences. Extensive experiments across multiple fine-tuned LLM models demonstrate that our method significantly reduce harmfulness scores and attack success rates with minimal parameter modifications, while preserving the model’s utility. Furthermore, by introducing a task-specific, multi-dimensional heterogeneous safety neuron cluster optimization mechanism, we achieve continual defense and generalization capability against unforeseen emerging safety concerns. 作为服务的微调将特定领域知识注入大型语言模型（LLMs），但也挑战了原有的对齐机制并带来安全风险。一系列防御策略已被提出以应对对齐、微调和微调后阶段，其中大多数微调后防御依赖于粗粒度的安全层映射。这些方法未能全面兼顾安全层与细粒度神经元，限制了它们在安全性与效用之间高效平衡的能力。为了解决这一问题，我们提出了细粒度安全神经元（FGSN）及其无训练持续投影方法，以降低微调带来的安全风险。FGSN 天然整合了安全层与神经元之间的多尺度交互，能够定位更稀疏且更精确的细粒度安全神经元，同时最小化对下游任务神经元的干扰。随后我们将安全神经元参数投影到安全方向上，从而在提高模型安全性的同时，更贴近人类偏好。在多个经过微调的 LLM 模型上的大量实验表明，我们的方法在仅对参数进行极小修改的情况下，能显著降低有害性评分和攻击成功率，同时保持模型的效用。此外，通过引入一种任务特定的多维异构安全神经元簇优化机制，我们实现了对意外出现的新兴安全问题的持续防护和泛化能力。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-08 03:20:25 UTC 发布：2025-08-08 03:20:25 UTC

#145 Hybrid(Transformer+CNN)-based Polyp Segmentation #145 基于混合（Transformer+CNN）的息肉分割

Author: [Madan Baduwal](https://arxiv.org/search/?searchtype=author&query=Madan Baduwal) 作者：Madan Baduwal

Colonoscopy is still the main method of detection and segmentation of colonic polyps, and recent advancements in deep learning networks such as U-Net, ResUNet, Swin-UNet, and PraNet have made outstanding performance in polyp segmentation. Yet, the problem is extremely challenging due to high variation in size, shape, endoscopy types, lighting, imaging protocols, and ill-defined boundaries (fluid, folds) of the polyps, rendering accurate segmentation a challenging and problematic task. To address these critical challenges in polyp segmentation, we introduce a hybrid (Transformer + CNN) model that is crafted to enhance robustness against evolving polyp characteristics. Our hybrid architecture demonstrates superior performance over existing solutions, particularly in addressing two critical challenges: (1) accurate segmentation of polyps with ill-defined margins through boundary-aware attention mechanisms, and (2) robust feature extraction in the presence of common endoscopic artifacts, including specular highlights, motion blur, and fluid occlusions. Quantitative evaluations reveal significant improvements in segmentation accuracy (Recall improved by 1.76%, i.e., 0.9555, accuracy improved by 0.07%, i.e., 0.9849) and artifact resilience compared to state-of-the-art polyp segmentation methods. 结肠镜检查仍然是结肠息肉检测和分割的主要方法，近期诸如 U-Net、ResUNet、Swin-UNet 和 PraNet 等深度学习网络在息肉分割方面取得了出色的表现。然而，由于息肉在大小、形状、内镜类型、光照、成像协议方面存在高度变化，以及息肉边界不清（液体、皱褶），准确分割仍然极具挑战性且问题重重。为了解决息肉分割中的这些关键难题，我们引入了一种混合（Transformer + CNN）模型，旨在增强对不断变化的息肉特征的鲁棒性。我们的混合架构在现有方案之上表现出优越性，尤其在应对两类关键挑战上： (1) 通过边界感知注意力机制实现对边界不清息肉的准确分割；(2) 在常见内镜伪影（包括镜面高光、运动模糊和液体遮挡）存在的情况下实现鲁棒的特征提取。定量评估显示，与最先进的息肉分割方法相比，分割精度和抗伪影能力显著提升（召回率提高了 1.76%，即 0.9555；准确率提高了 0.07%，即 0.9849）。

Subjects: Image and Video Processing, Artificial Intelligence, Computer Vision and Pattern Recognition 主题：图像与视频处理、人工智能、计算机视觉与模式识别

Publish: 2025-08-08 01:42:05 UTC 发布时间：2025-08-08 01:42:05 UTC

#146 RL-MoE: An Image-Based Privacy Preserving Approach In Intelligent Transportation System #146 RL-MoE：一种用于智能交通系统的基于图像的隐私保护方法

Authors: [Abdolazim Rezaei](https://arxiv.org/search/?searchtype=author&query=Abdolazim Rezaei), [Mehdi Sookhak](https://arxiv.org/search/?searchtype=author&query=Mehdi Sookhak), [Mahboobeh Haghparast](https://arxiv.org/search/?searchtype=author&query=Mahboobeh Haghparast) 作者：Abdolazim Rezaei、Mehdi Sookhak、Mahboobeh Haghparast

The proliferation of AI-powered cameras in Intelligent Transportation Systems (ITS) creates a severe conflict between the need for rich visual data and the fundamental right to privacy. Existing privacy-preserving mechanisms, such as blurring or encryption, are often insufficient, creating an undesirable trade-off where either privacy is compromised against advanced reconstruction attacks or data utility is critically degraded. To resolve this impasse, we propose RL-MoE, a novel framework that transforms sensitive visual data into privacy-preserving textual descriptions, eliminating the need for direct image transmission. RL-MoE uniquely combines a Mixture-of-Experts (MoE) architecture for nuanced, multi-aspect scene decomposition with a Reinforcement Learning (RL) agent that optimizes the generated text for a dual objective of semantic accuracy and privacy preservation. Extensive experiments demonstrate that RL-MoE provides superior privacy protection, reducing the success rate of replay attacks to just 9.4% on the CFP-FP dataset, while simultaneously generating richer textual content than baseline methods. Our work provides a practical and scalable solution for building trustworthy AI systems in privacy-sensitive domains, paving the way for more secure smart city and autonomous vehicle networks. 在智能交通系统（ITS）中，AI 驱动摄像头的普及在丰富视觉数据需求与基本隐私权之间造成了严重冲突。现有的隐私保护机制，如模糊处理或加密，往往不足以应对，从而产生一种不理想的权衡：要么在面对高级重建攻击时隐私受损，要么数据实用性被严重削弱。为了解决这一僵局，我们提出了 RL-MoE，一种将敏感视觉数据转换为隐私保护文本描述的新型框架，从根本上消除了直接传输图像的必要性。RL-MoE 独特地将混合专家（MoE）架构用于细致、多层面的场景分解，与一个强化学习（RL）代理结合，该代理针对语义准确性和隐私保护这两个目标对生成文本进行优化。大量实验表明，RL-MoE 提供了更出色的隐私保护，在 CFP-FP 数据集上将重放攻击的成功率降至仅 9.4%，与此同时生成的文本内容也比基线方法更为丰富。我们的工作为在隐私敏感领域构建值得信赖的人工智能系统提供了一个实用且可扩展的解决方案，为更安全的智慧城市和自动驾驶车辆网络铺平了道路。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence

Publish: 2025-08-07 18:07:54 UTC 发布：2025-08-07 18:07:54 UTC

#147 A Neurosymbolic Framework for Interpretable Cognitive Attack Detection in Augmented Reality #147 一个用于在增强现实中可解释认知攻击检测的神经符号框架

Augmented Reality (AR) enriches perception by overlaying virtual elements on the physical world. Due to its growing popularity, cognitive attacks that alter AR content to manipulate users’ semantic perception have received increasing attention. Existing detection methods often focus on visual changes, which are restricted to pixel- or image-level processing and lack semantic reasoning capabilities, or they rely on pre-trained vision-language models (VLMs), which function as black-box approaches with limited interpretability. In this paper, we present CADAR, a novel neurosymbolic approach for cognitive attack detection in AR. It fuses multimodal vision-language inputs using neural VLMs to obtain a symbolic perception-graph representation, incorporating prior knowledge, salience weighting, and temporal correlations. The model then enables particle-filter based statistical reasoning – a sequential Monte Carlo method – to detect cognitive attacks. Thus, CADAR inherits the adaptability of pre-trained VLM and the interpretability and reasoning rigor of particle filtering. Experiments on an extended AR cognitive attack dataset show accuracy improvements of up to 10.7% over strong baselines on challenging AR attack scenarios, underscoring the promise of neurosymbolic methods for effective and interpretable cognitive attack detection. 增强现实（AR）通过在物理世界上叠加虚拟元素来丰富感知。由于其日益普及，旨在通过篡改 AR 内容来操控用户语义感知的认知攻击越来越受到关注。现有检测方法常侧重于视觉变化，仅限于像素或图像层面的处理，缺乏语义推理能力，或者依赖于预训练的视觉-语言模型（VLM），这些方法作为黑箱式手段可解释性有限。本文提出 CADAR，一种用于 AR 认知攻击检测的新型神经符号方法。它利用神经 VLM 融合多模态视觉-语言输入，以获得符号化的感知图表示，并结合先验知识、显著性加权和时间相关性。模型随后采用基于粒子滤波的统计推理——一种序贯蒙特卡罗方法——来检测认知攻击。因此，CADAR 兼具预训练 VLM 的适应性与粒子滤波的可解释性和推理严谨性。在扩展的增强现实认知攻击数据集上的实验表明，在具有挑战性的增强现实攻击场景中，相较于强基线模型准确率最多提升了 10.7%，强调了神经符号方法在实现有效且可解释的认知攻击检测方面的潜力。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence

Publish: 2025-08-07 17:59:49 UTC 发表时间：2025-08-07 17:59:49 UTC

#148 HiSTM: Hierarchical Spatiotemporal Mamba for Cellular Traffic Forecasting #148 HiSTM：用于蜂窝流量预测的分层时空曼巴模型

Authors: [Zineddine Bettouche](https://arxiv.org/search/?searchtype=author&query=Zineddine Bettouche), [Khalid Ali](https://arxiv.org/search/?searchtype=author&query=Khalid Ali), [Andreas Fischer](https://arxiv.org/search/?searchtype=author&query=Andreas Fischer), [Andreas Kassler](https://arxiv.org/search/?searchtype=author&query=Andreas Kassler) 作者：Zineddine Bettouche、Khalid Ali、Andreas Fischer、Andreas Kassler

Cellular traffic forecasting is essential for network planning, resource allocation, or load-balancing traffic across cells. However, accurate forecasting is difficult due to intricate spatial and temporal patterns that exist due to the mobility of users. Existing AI-based traffic forecasting models often trade-off accuracy and computational efficiency. We present Hierarchical SpatioTemporal Mamba (HiSTM), which combines a dual spatial encoder with a Mamba-based temporal module and attention mechanism. HiSTM employs selective state space methods to capture spatial and temporal patterns in network traffic. In our evaluation, we use a real-world dataset to compare HiSTM against several baselines, showing a 29.4% MAE improvement over the STN baseline while using 94% fewer parameters. We show that the HiSTM generalizes well across different datasets and improves in accuracy over longer time-horizons. 蜂窝流量预测对于网络规划、资源分配或跨小区的负载均衡至关重要。然而，由于用户移动性导致的复杂时空模式，准确的预测非常困难。现有基于 AI 的流量预测模型常常在精度和计算效率之间权衡。我们提出了分层时空 Mamba（Hierarchical SpatioTemporal Mamba，HiSTM），它将双重空间编码器与基于 Mamba 的时间模块和注意力机制相结合。HiSTM 使用选择性状态空间方法来捕捉网络流量中的时空模式。在评估中，我们使用真实世界数据集将 HiSTM 与若干基线方法进行比较，显示其在平均绝对误差（MAE）上比 STN 基线提升了 29.4%，同时参数量减少了 94%。我们展示了 HiSTM 在不同数据集上具有良好的泛化能力，并且在更长时间跨度上的预测精度有所提高。

Subjects: Networking and Internet Architecture, Artificial Intelligence 主题：网络与互联网体系结构，人工智能

Publish: 2025-08-07 14:18:18 UTC 发布：2025-08-07 14:18:18 UTC

#149 Quantum-Efficient Reinforcement Learning Solutions for Last-Mile On-Demand Delivery #149 面向最后一公里按需配送的量子高效强化学习解决方案

Authors: [Farzan Moosavi](https://arxiv.org/search/?searchtype=author&query=Farzan Moosavi), [Bilal Farooq](https://arxiv.org/search/?searchtype=author&query=Bilal Farooq) 作者：Farzan Moosavi, Bilal Farooq

Quantum computation has demonstrated a promising alternative to solving the NP-hard combinatorial problems. Specifically, when it comes to optimization, classical approaches become intractable to account for large-scale solutions. Specifically, we investigate quantum computing to solve the large-scale Capacitated Pickup and Delivery Problem with Time Windows (CPDPTW). In this regard, a Reinforcement Learning (RL) framework augmented with a Parametrized Quantum Circuit (PQC) is designed to minimize the travel time in a realistic last-mile on-demand delivery. A novel problem-specific encoding quantum circuit with an entangling and variational layer is proposed. Moreover, Proximal Policy Optimization (PPO) and Quantum Singular Value Transformation (QSVT) are designed for comparison through numerical experiments, highlighting the superiority of the proposed method in terms of the scale of the solution and training complexity while incorporating the real-world constraints. 量子计算已展示出解决 NP 难组合问题的有前景替代方案。具体而言，在优化问题上，传统方法在处理大规模解时变得难以应对。本文特别探讨利用量子计算来解决带时间窗的带容量取送货问题（CPDPTW）的大规模实例。为此，设计了一个用参数化量子电路（PQC）增强的强化学习（RL）框架，以最小化现实最后一公里按需配送中的行驶时间。提出了一种具有纠缠层和变分层的新型问题专用编码量子电路。此外，通过数值实验设计了近端策略优化（PPO）和量子奇异值变换（QSVT）作为对比，强调了所提方法在解的规模和训练复杂度方面的优越性，同时纳入了现实世界的约束。

Subjects: Quantum Physics, Artificial Intelligence, Machine Learning, Optimization and Control 主题：量子物理、人工智能、机器学习、优化与控制

Publish: 2025-08-07 13:50:43 UTC 发布时间：2025-08-07 13:50:43 UTC

#150 Long-Term Client Selection for Federated Learning with Non-IID Data: A Truthful Auction Approach #150 长期客户端选择用于非独立同分布数据的联邦学习：一种诚实的拍卖方法

Authors: [Jinghong Tan](https://arxiv.org/search/?searchtype=author&query=Jinghong Tan), [Zhian Liu](https://arxiv.org/search/?searchtype=author&query=Zhian Liu), [Kun Guo](https://arxiv.org/search/?searchtype=author&query=Kun Guo), [Mingxiong Zhao](https://arxiv.org/search/?searchtype=author&query=Mingxiong Zhao) 作者：谭靖宏，刘志安，郭昆，赵明雄

Federated learning (FL) provides a decentralized framework that enables universal model training through collaborative efforts on mobile nodes, such as smart vehicles in the Internet of Vehicles (IoV). Each smart vehicle acts as a mobile client, contributing to the process without uploading local data. This method leverages non-independent and identically distributed (non-IID) training data from different vehicles, influenced by various driving patterns and environmental conditions, which can significantly impact model convergence and accuracy. Although client selection can be a feasible solution for non-IID issues, it faces challenges related to selection metrics. Traditional metrics evaluate client data quality independently per round and require client selection after all clients complete local training, leading to resource wastage from unused training results. In the IoV context, where vehicles have limited connectivity and computational resources, information asymmetry in client selection risks clients submitting false information, potentially making the selection ineffective. To tackle these challenges, we propose a novel Long-term Client-Selection Federated Learning based on Truthful Auction (LCSFLA). This scheme maximizes social welfare with consideration of long-term data quality using a new assessment mechanism and energy costs, and the advised auction mechanism with a deposit requirement incentivizes client participation and ensures information truthfulness. We theoretically prove the incentive compatibility and individual rationality of the advised incentive mechanism. Experimental results on various datasets, including those from IoV scenarios, demonstrate its effectiveness in mitigating performance degradation caused by non-IID data. 联邦学习（FL）提供了一种去中心化框架，能够通过移动节点（如车联网中的智能车辆）的协作实现通用模型训练。每辆智能车辆作为移动客户端参与贡献，而无需上传本地数据。该方法利用来自不同车辆的非独立同分布（non-IID）训练数据，这些数据受各种驾驶模式和环境条件影响，可能会显著影响模型的收敛性和准确性。尽管客户端选择可以作为解决非 IID 问题的可行方案，但它在选择指标方面面临挑战。传统指标在每轮中独立评估客户端数据质量，并要求在所有客户端完成本地训练后才进行客户端选择，导致未使用的训练结果浪费资源。在车联网场景中，由于车辆的连接性和计算资源有限，客户端选择中的信息不对称使得客户端可能提交虚假信息，从而可能使选择失效。为应对这些挑战，我们提出了一种基于真实竞拍的长期客户端选择联邦学习新方法（LCSFLA）。该方案通过一种新的评估机制和能耗成本考量长期数据质量，从而最大化社会福利；并且所提出的带保证金要求的拍卖机制激励客户端参与并确保信息真实性。我们在理论上证明了所提激励机制的激励相容性和个体理性。针对包括车联网（IoV）场景在内的多种数据集的实验结果表明，该方法能有效缓解由非独立同分布（non-IID）数据引起的性能下降。

Subjects: Machine Learning, Artificial Intelligence, Systems and Control 主题：机器学习、人工智能、系统与控制

Publish: 2025-08-07 12:30:52 UTC 发布：2025-08-07 12:30:52 UTC

#151 scAGC: Learning Adaptive Cell Graphs with Contrastive Guidance for Single-Cell Clustering #151 scAGC：使用对比引导学习自适应细胞图以进行单细胞聚类 [PDF ] [Copy] [Kimi 2 ] [REL]

Accurate cell type annotation is a crucial step in analyzing single-cell RNA sequencing (scRNA-seq) data, which provides valuable insights into cellular heterogeneity. However, due to the high dimensionality and prevalence of zero elements in scRNA-seq data, traditional clustering methods face significant statistical and computational challenges. While some advanced methods use graph neural networks to model cell-cell relationships, they often depend on static graph structures that are sensitive to noise and fail to capture the long-tailed distribution inherent in single-cell populations.To address these limitations, we propose scAGC, a single-cell clustering method that learns adaptive cell graphs with contrastive guidance. Our approach optimizes feature representations and cell graphs simultaneously in an end-to-end manner. Specifically, we introduce a topology-adaptive graph autoencoder that leverages a differentiable Gumbel-Softmax sampling strategy to dynamically refine the graph structure during training. This adaptive mechanism mitigates the problem of a long-tailed degree distribution by promoting a more balanced neighborhood structure. To model the discrete, over-dispersed, and zero-inflated nature of scRNA-seq data, we integrate a Zero-Inflated Negative Binomial (ZINB) loss for robust feature reconstruction. Furthermore, a contrastive learning objective is incorporated to regularize the graph learning process and prevent abrupt changes in the graph topology, ensuring stability and enhancing convergence. Comprehensive experiments on 9 real scRNA-seq datasets demonstrate that scAGC consistently outperforms other state-of-the-art methods, yielding the best NMI and ARI scores on 9 and 7 datasets, respectively.Our code is available at Anonymous Github. 准确的细胞类型注释是分析单细胞 RNA 测序（scRNA-seq）数据的关键步骤，它为揭示细胞异质性提供了重要见解。然而，由于 scRNA-seq 数据具有高维性和大量零值，传统的聚类方法面临显著的统计和计算挑战。尽管一些先进方法使用图神经网络来建模细胞间关系，但它们常常依赖对噪声敏感的静态图结构，且无法捕捉单细胞群体固有的长尾分布。为了解决这些局限性，我们提出了 scAGC，一种通过对比引导学习自适应细胞图的单细胞聚类方法。我们的方法在端到端的方式下同时优化特征表示和细胞图。具体而言，我们引入了一种拓扑自适应图自编码器，利用可微的 Gumbel-Softmax 采样策略在训练过程中动态地细化图结构。该自适应机制通过促进更为平衡的邻域结构，缓解了度分布长尾化的问题。为了建模 scRNA-seq 数据的离散性、过度离散性和零膨胀特性，我们引入了零膨胀负二项（ZINB）损失以实现稳健的特征重建。此外，还加入了对比学习目标来正则化图学习过程，防止图拓扑发生突变，从而保证稳定性并提升收敛性。在 9 个真实 scRNA-seq 数据集上的全面实验表明，scAGC 在各项指标上持续优于其他最先进方法，在 9 个数据集上获得了最佳的 NMI，在 7 个数据集上获得了最佳的 ARI。我们的代码可在 Anonymous Github 获取。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-07 10:55:52 UTC 发布：2025-08-07 10:55:52 UTC

#152 IAD-R1: Reinforcing Consistent Reasoning in Industrial Anomaly Detection #152 IAD-R1：在工业异常检测中加强一致性推理

Authors: [Yanhui Li](https://arxiv.org/search/?searchtype=author&query=Yanhui Li), [Yunkang Cao](https://arxiv.org/search/?searchtype=author&query=Yunkang Cao), [Chengliang Liu](https://arxiv.org/search/?searchtype=author&query=Chengliang Liu), [Yuan Xiong](https://arxiv.org/search/?searchtype=author&query=Yuan Xiong), [Xinghui Dong](https://arxiv.org/search/?searchtype=author&query=Xinghui Dong), [Chao Huang](https://arxiv.org/search/?searchtype=author&query=Chao Huang) 作者：李艳辉、曹云康、刘成亮、熊元、董兴辉、黄超

Industrial anomaly detection is a critical component of modern manufacturing, yet the scarcity of defective samples restricts traditional detection methods to scenario-specific applications. Although Vision-Language Models (VLMs) demonstrate significant advantages in generalization capabilities, their performance in industrial anomaly detection remains limited. To address this challenge, we propose IAD-R1, a universal post-training framework applicable to VLMs of different architectures and parameter scales, which substantially enhances their anomaly detection capabilities. IAD-R1 employs a two-stage training strategy: the Perception Activation Supervised Fine-Tuning (PA-SFT) stage utilizes a meticulously constructed high-quality Chain-of-Thought dataset (Expert-AD) for training, enhancing anomaly perception capabilities and establishing reasoning-to-answer correlations; the Structured Control Group Relative Policy Optimization (SC-GRPO) stage employs carefully designed reward functions to achieve a capability leap from “Anomaly Perception” to “Anomaly Interpretation”. Experimental results demonstrate that IAD-R1 achieves significant improvements across 7 VLMs, attaining up to 43.3% enhancement in average accuracy on 6 industrial anomaly detection benchmark datasets. Notably, the 0.5B parameter model trained with IAD-R1 surpasses commercial models including GPT-4.1 and Claude-Sonnet-4 in zero-shot settings, demonstrating the effectiveness and superiority of IAD-R1. The dataset, code, and all model weights will be publicly available at https://github.com/Yanhui-Lee/IAD-R1. 工业异常检测是现代制造的重要组成部分，但缺乏缺陷样本使传统检测方法局限于特定场景应用。尽管视觉-语言模型（VLMs）在泛化能力上表现出显著优势，但它们在工业异常检测中的性能仍然有限。为了解决这一挑战，我们提出了 IAD-R1，一种通用的后训练框架，适用于不同架构和参数规模的 VLMs，能显著提升其异常检测能力。IAD-R1 采用两阶段训练策略：感知激活监督微调（PA-SFT）阶段利用精心构建的高质量思维链数据集（Expert-AD）进行训练，提升异常感知能力并建立从推理到答案的关联；结构化控制组相对策略优化（SC-GRPO）阶段则采用精心设计的奖励函数，实现从“异常感知”到“异常解释”的能力跃迁。实验结果表明，IAD-R1 在 7 个视觉语言模型上均取得显著提升，在 6 个工业异常检测基准数据集上的平均准确率最高提升达 43.3%。值得注意的是，使用 IAD-R1 训练的 0.5B 参数模型在零样本设置下超越了包括 GPT-4.1 和 Claude-Sonnet-4 在内的商业模型，证明了 IAD-R1 的有效性和优越性。数据集、代码及所有模型权重将公开发布于 https://github.com/Yanhui-Lee/IAD-R1。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence

Publish: 2025-08-07 09:34:45 UTC 发布：2025-08-07 09:34:45 协调世界时 (UTC)

#153 Generative Artificial Intelligence in Medical Imaging: Foundations, Progress, and Clinical Translation #153 医学影像中的生成式人工智能：基础、进展与临床转化

Authors: [Xuanru Zhou](https://arxiv.org/search/?searchtype=author&query=Xuanru Zhou), [Cheng Li](https://arxiv.org/search/?searchtype=author&query=Cheng Li), [Shuqiang Wang](https://arxiv.org/search/?searchtype=author&query=Shuqiang Wang), [Ye Li](https://arxiv.org/search/?searchtype=author&query=Ye Li), [Tao Tan](https://arxiv.org/search/?searchtype=author&query=Tao Tan), [Hairong Zheng](https://arxiv.org/search/?searchtype=author&query=Hairong Zheng), [Shanshan Wang](https://arxiv.org/search/?searchtype=author&query=Shanshan Wang) 作者：周宣儒、李成、王书强、李烨、谭涛、郑海榕、王珊珊

Generative artificial intelligence (AI) is rapidly transforming medical imaging by enabling capabilities such as data synthesis, image enhancement, modality translation, and spatiotemporal modeling. This review presents a comprehensive and forward-looking synthesis of recent advances in generative modeling including generative adversarial networks (GANs), variational autoencoders (VAEs), diffusion models, and emerging multimodal foundation architectures and evaluates their expanding roles across the clinical imaging continuum. We systematically examine how generative AI contributes to key stages of the imaging workflow, from acquisition and reconstruction to cross-modality synthesis, diagnostic support, and treatment planning. Emphasis is placed on both retrospective and prospective clinical scenarios, where generative models help address longstanding challenges such as data scarcity, standardization, and integration across modalities. To promote rigorous benchmarking and translational readiness, we propose a three-tiered evaluation framework encompassing pixel-level fidelity, feature-level realism, and task-level clinical relevance. We also identify critical obstacles to real-world deployment, including generalization under domain shift, hallucination risk, data privacy concerns, and regulatory hurdles. Finally, we explore the convergence of generative AI with large-scale foundation models, highlighting how this synergy may enable the next generation of scalable, reliable, and clinically integrated imaging systems. By charting technical progress and translational pathways, this review aims to guide future research and foster interdisciplinary collaboration at the intersection of AI, medicine, and biomedical engineering. 生成式人工智能（AI）正在快速改变医学影像领域，带来诸如数据合成、图像增强、模态转换和时空建模等能力。本文综述全面且面向未来地总结了生成建模的最新进展，包括生成对抗网络（GANs）、变分自编码器（VAEs）、扩散模型以及新兴的多模态基础架构，并评估了它们在临床影像全流程中日益扩大的作用。我们系统性地考察了生成式 AI 如何为影像工作流程的关键阶段做出贡献，从采集与重建到跨模态合成、诊断支持和治疗规划。重点关注回顾性和前瞻性临床场景，其中生成模型有助于解决长期存在的问题，如数据稀缺、标准化以及跨模态整合。为促进严格的基准测试和可转化性，我们提出了一个三层评估框架，涵盖像素级保真度、特征级真实感和任务级临床相关性。我们还指出了面向真实世界部署的关键障碍，包括在领域转移下的泛化能力、虚构（hallucination）风险、数据隐私问题以及监管障碍。最后，我们探讨了生成式人工智能与大规模基础模型的融合，强调这种协同可能如何推动下一代可扩展、可靠且临床整合的影像系统的实现。通过勾画技术进展和转化路径，本综述旨在为未来研究提供指导，并促进人工智能、医学与生物医学工程交叉领域的跨学科合作。

Subjects: Image and Video Processing, Artificial Intelligence, Computer Vision and Pattern Recognition 主题：图像与视频处理、人工智能、计算机视觉与模式识别

Publish: 2025-08-07 07:58:40 UTC 发布：2025-08-07 07:58:40 UTC

#154 DQT: Dynamic Quantization Training via Dequantization-Free Nested Integer Arithmetic #154 DQT：通过无反量化嵌套整数算术进行动态量化训练

Authors: [Hazem Hesham Yousef Shalby](https://arxiv.org/search/?searchtype=author&query=Hazem Hesham Yousef Shalby), [Fabrizio Pittorino](https://arxiv.org/search/?searchtype=author&query=Fabrizio Pittorino), [Francesca Palermo](https://arxiv.org/search/?searchtype=author&query=Francesca Palermo), [Diana Trojaniello](https://arxiv.org/search/?searchtype=author&query=Diana Trojaniello), [Manuel Roveri](https://arxiv.org/search/?searchtype=author&query=Manuel Roveri) 作者：Hazem Hesham Yousef Shalby、Fabrizio Pittorino、Francesca Palermo、Diana Trojaniello、Manuel Roveri

The deployment of deep neural networks on resource-constrained devices relies on quantization. While static, uniform quantization applies a fixed bit-width to all inputs, it fails to adapt to their varying complexity. Dynamic, instance-based mixed-precision quantization promises a superior accuracy-efficiency trade-off by allocating higher precision only when needed. However, a critical bottleneck remains: existing methods require a costly dequantize-to-float and requantize-to-integer cycle to change precision, breaking the integer-only hardware paradigm and compromising performance gains. This paper introduces Dynamic Quantization Training (DQT), a novel framework that removes this bottleneck. At the core of DQT is a nested integer representation where lower-precision values are bit-wise embedded within higher-precision ones. This design, coupled with custom integer-only arithmetic, allows for on-the-fly bit-width switching through a near-zero-cost bit-shift operation. This makes DQT the first quantization framework to enable both dequantization-free static mixed-precision of the backbone network, and truly efficient dynamic, instance-based quantization through a lightweight controller that decides at runtime how to quantize each layer. We demonstrate DQT state-of-the-art performance on ResNet18 on CIFAR-10 and ResNet50 on ImageNet. On ImageNet, our 4-bit dynamic ResNet50 achieves 77.00% top-1 accuracy, an improvement over leading static (LSQ, 76.70%) and dynamic (DQNET, 76.94%) methods at a comparable BitOPs budget. Crucially, DQT achieves this with a bit-width transition cost of only 28.3M simple bit-shift operations, a drastic improvement over the 56.6M costly Multiply-Accumulate (MAC) floating-point operations required by previous dynamic approaches - unlocking a new frontier in efficient, adaptive AI. 在资源受限设备上部署深度神经网络依赖于量化。静态的、统一的量化对所有输入应用固定的位宽，但它无法适应输入复杂性的差异。基于实例的动态混合精度量化通过仅在需要时分配更高精度，承诺提供更优的精度-效率权衡。然而，一个关键瓶颈仍然存在：现有方法在改变精度时需要代价高昂的去量化为浮点并再量化为整数的循环，这打破了纯整数硬件范式并削弱了性能提升。本文提出了动态量化训练（DQT），一种消除该瓶颈的新框架。DQT 的核心是嵌套整数表示，其中较低精度的值按位嵌入在较高精度值中。该设计结合定制的纯整数算术，通过近零成本的位移操作实现按需位宽切换。这使得 DQT 成为第一个量化框架，既能对主干网络实现无需反量化的静态混合精度，又能通过一个轻量级控制器在运行时决定如何量化每一层，从而实现真正高效的动态、基于实例的量化。我们在 CIFAR-10 上的 ResNet18 和 ImageNet 上的 ResNet50 上展示了 DQT 的最先进性能。在 ImageNet 上，我们的 4 位动态 ResNet50 达到 77.00% 的 top-1 准确率，在可比的 BitOPs 预算下优于领先的静态方法（LSQ，76.70%）和动态方法（DQNET，76.94%）。关键是，DQT 实现这一点所需的比特宽度转换成本仅为 28.3M 次简单的位移操作，相对于先前动态方法所需的 56.6M 次昂贵的乘加（MAC）浮点运算有着显著改进——为高效、自适应 AI 开辟了新的前沿。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-07 07:31:48 UTC 发布：2025-08-07 07:31:48 UTC

#155 A Context-aware Attention and Graph Neural Network-based Multimodal Framework for Misogyny Detection #155 一种基于上下文感知注意力和图神经网络的多模态厌女症检测框架

Authors: [Mohammad Zia Ur Rehman](https://arxiv.org/search/?searchtype=author&query=Mohammad Zia Ur Rehman), [Sufyaan Zahoor](https://arxiv.org/search/?searchtype=author&query=Sufyaan Zahoor), [Areeb Manzoor](https://arxiv.org/search/?searchtype=author&query=Areeb Manzoor), [Musharaf Maqbool](https://arxiv.org/search/?searchtype=author&query=Musharaf Maqbool), [Nagendra Kumar](https://arxiv.org/search/?searchtype=author&query=Nagendra Kumar) 作者：Mohammad Zia Ur Rehman, Sufyaan Zahoor, Areeb Manzoor, Musharaf Maqbool, Nagendra Kumar

A substantial portion of offensive content on social media is directed towards women. Since the approaches for general offensive content detection face a challenge in detecting misogynistic content, it requires solutions tailored to address offensive content against women. To this end, we propose a novel multimodal framework for the detection of misogynistic and sexist content. The framework comprises three modules: the Multimodal Attention module (MANM), the Graph-based Feature Reconstruction Module (GFRM), and the Content-specific Features Learning Module (CFLM). The MANM employs adaptive gating-based multimodal context-aware attention, enabling the model to focus on relevant visual and textual information and generating contextually relevant features. The GFRM module utilizes graphs to refine features within individual modalities, while the CFLM focuses on learning text and image-specific features such as toxicity features and caption features. Additionally, we curate a set of misogynous lexicons to compute the misogyny-specific lexicon score from the text. We apply test-time augmentation in feature space to better generalize the predictions on diverse inputs. The performance of the proposed approach has been evaluated on two multimodal datasets, MAMI and MMHS150K, with 11,000 and 13,494 samples, respectively. The proposed method demonstrates an average improvement of 10.17% and 8.88% in macro-F1 over existing methods on the MAMI and MMHS150K datasets, respectively. 社交媒体上相当大一部分攻击性内容针对女性。由于通用攻击性内容检测方法在检测厌女内容方面存在挑战，因此需要针对针对女性的攻击性内容的定制化解决方案。为此，我们提出了一种用于检测厌女和性别歧视内容的新型多模态框架。该框架由三个模块组成：多模态注意力模块（MANM）、基于图的特征重构模块（GFRM）和内容特定特征学习模块（CFLM）。MANM 采用基于自适应门控的多模态上下文感知注意力，使模型能够关注相关的视觉和文本信息并生成语境相关的特征。GFRM 模块利用图结构在单一模态内精炼特征，而 CFLM 则侧重于学习文本和图像特定的特征，如有毒性特征和标题特征。此外，我们整理了一套厌女词典，以从文本中计算厌女专用词典得分。我们在特征空间应用测试时增强（test-time augmentation），以更好地对多样化输入的预测进行泛化。所提出方法的性能在两个多模态数据集上进行了评估：MAMI 和 MMHS150K，样本数分别为 11,000 和 13,494。所提出的方法在 MAMI 和 MMHS150K 数据集上分别比现有方法在 macro-F1 上平均提高了 10.17% 和 8.88%。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence

Publish: 2025-08-07 06:41:17 UTC 发布：2025-08-07 06:41:17 协调世界时（UTC）

#156 FedMP: Tackling Medical Feature Heterogeneity in Federated Learning from a Manifold Perspective #156 FedMP：从流形视角应对联邦学习中的医疗特征异质性

Authors: [Zhekai Zhou](https://arxiv.org/search/?searchtype=author&query=Zhekai Zhou), [Shudong Liu](https://arxiv.org/search/?searchtype=author&query=Shudong Liu), [Zhaokun Zhou](https://arxiv.org/search/?searchtype=author&query=Zhaokun Zhou), [Yang Liu](https://arxiv.org/search/?searchtype=author&query=Yang Liu), [Qiang Yang](https://arxiv.org/search/?searchtype=author&query=Qiang Yang), [Yuesheng Zhu](https://arxiv.org/search/?searchtype=author&query=Yuesheng Zhu), [Guibo Luo](https://arxiv.org/search/?searchtype=author&query=Guibo Luo) 作者：周哲凯、刘树东、周昭坤、刘洋、杨强、朱跃升、罗桂波

Federated learning (FL) is a decentralized machine learning paradigm in which multiple clients collaboratively train a shared model without sharing their local private data. However, real-world applications of FL frequently encounter challenges arising from the non-identically and independently distributed (non-IID) local datasets across participating clients, which is particularly pronounced in the field of medical imaging, where shifts in image feature distributions significantly hinder the global model’s convergence and performance. To address this challenge, we propose FedMP, a novel method designed to enhance FL under non-IID scenarios. FedMP employs stochastic feature manifold completion to enrich the training space of individual client classifiers, and leverages class-prototypes to guide the alignment of feature manifolds across clients within semantically consistent subspaces, facilitating the construction of more distinct decision boundaries. We validate the effectiveness of FedMP on multiple medical imaging datasets, including those with real-world multi-center distributions, as well as on a multi-domain natural image dataset. The experimental results demonstrate that FedMP outperforms existing FL algorithms. Additionally, we analyze the impact of manifold dimensionality, communication efficiency, and privacy implications of feature exposure in our method. 联邦学习（FL）是一种去中心化的机器学习范式，多个客户端在不共享本地私有数据的情况下协同训练共享模型。然而，FL 在现实应用中经常面临参与客户端之间本地数据集非独立同分布（non-IID）带来的挑战，这在医学影像领域尤为明显，图像特征分布的偏移显著阻碍了全局模型的收敛性和性能。为了解决这一问题，我们提出了 FedMP，一种旨在在 non-IID 场景下增强联邦学习的新方法。FedMP 采用随机特征流形补全来丰富单个客户端分类器的训练空间，并利用类原型在语义一致的子空间内引导客户端间特征流形的对齐，从而促进构建更为清晰的决策边界。我们在多个医学影像数据集上验证了 FedMP 的有效性，包括具有真实多中心分布的数据集，以及一个多域自然图像数据集。实验结果表明，FedMP 优于现有的联邦学习算法。此外，我们还分析了流形维数、通信效率以及我们方法中特征暴露的隐私影响。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-07 01:13:46 UTC 发布时间：2025-08-07 01:13:46 协调世界时 (UTC)

#157 webMCP: Efficient AI-Native Client-Side Interaction for Agent-Ready Web Design #157 webMCP：面向代理就绪网页设计的高效原生客户端 AI 交互

Author: [D. Perera](https://arxiv.org/search/?searchtype=author&query=D. Perera) 作者：D. Perera

Current AI agents create significant barriers for users by requiring extensive processing to understand web pages, making AI-assisted web interaction slow and expensive. This paper introduces webMCP (Web Machine Context & Procedure), a client-side standard that embeds structured interaction metadata directly into web pages, enabling more efficient human-AI collaboration on existing websites. webMCP transforms how AI agents understand web interfaces by providing explicit mappings between page elements and user actions. Instead of processing entire HTML documents, agents can access pre-structured interaction data, dramatically reducing computational overhead while maintaining task accuracy. A comprehensive evaluation across 1,890 real API calls spanning online shopping, authentication, and content management scenarios demonstrates webMCP reduces processing requirements by 67.6% while maintaining 97.9% task success rates compared to 98.8% for traditional approaches. Users experience significantly lower costs (34-63% reduction) and faster response times across diverse web interactions. Statistical analysis confirms these improvements are highly significant across multiple AI models. An independent WordPress deployment study validates practical applicability, showing consistent improvements across real-world content management workflows. webMCP requires no server-side modifications, making it deployable across millions of existing websites without technical barriers. These results establish webMCP as a viable solution for making AI web assistance more accessible and sustainable, addressing the critical gap between user interaction needs and AI computational requirements in production environments. 当前的人工智能代理在让用户使用网页时设置了显著障碍，因为它们需要大量处理来理解网页，使得 AI 辅助的网页交互变得缓慢且昂贵。本文介绍了 webMCP（Web Machine Context & Procedure），这是一种客户端标准，将结构化的交互元数据直接嵌入网页，从而在现有网站上实现更高效的人机协作。webMCP 通过提供页面元素与用户操作之间的明确映射，改变了 AI 代理理解网页界面的方式。代理无需处理整个 HTML 文档，而是可以访问预先结构化的交互数据，显著降低计算开销，同时保持任务准确性。对涵盖在线购物、身份验证和内容管理场景的 1,890 次真实 API 调用进行的全面评估表明，与传统方法相比，webMCP 将处理需求降低了 67.6%，同时保持了 97.9% 的任务成功率（传统方法为 98.8%）。用户在各种网页交互中体验到显著更低的成本（降低 34–63%）和更快的响应时间。统计分析证实这些改进在多个 AI 模型中具有高度显著性。一项独立的 WordPress 部署研究验证了其实用性，显示在真实内容管理工作流程中持续的改进。webMCP 不需要服务器端的改动，使其可以在数以百万计的现有网站上部署而不会有技术障碍。这些结果确立了 webMCP 作为一种可行的解决方案，使 AI 网络助手更易获取且更具可持续性，解决了生产环境中用户交互需求与 AI 计算需求之间的关键差距。

Subjects: Networking and Internet Architecture, Artificial Intelligence 主题：网络与互联网体系结构，人工智能

Publish: 2025-08-06 23:02:36 UTC 发布：2025-08-06 23:02:36 UTC

#158 Multimodal RAG Enhanced Visual Description #158 多模态 RAG 增强视觉描述

Authors: [Amit Kumar Jaiswal](https://arxiv.org/search/?searchtype=author&query=Amit Kumar Jaiswal), [Haiming Liu](https://arxiv.org/search/?searchtype=author&query=Haiming Liu), [Ingo Frommholz](https://arxiv.org/search/?searchtype=author&query=Ingo Frommholz) 作者：Amit Kumar Jaiswal、Haiming Liu、Ingo Frommholz

Textual descriptions for multimodal inputs entail recurrent refinement of queries to produce relevant output images. Despite efforts to address challenges such as scaling model size and data volume, the cost associated with pre-training and fine-tuning remains substantial. However, pre-trained large multimodal models (LMMs) encounter a modality gap, characterised by a misalignment between textual and visual representations within a common embedding space. Although fine-tuning can potentially mitigate this gap, it is typically expensive and impractical due to the requirement for extensive domain-driven data. To overcome this challenge, we propose a lightweight training-free approach utilising Retrieval-Augmented Generation (RAG) to extend across the modality using a linear mapping, which can be computed efficiently. During inference, this mapping is applied to images embedded by an LMM enabling retrieval of closest textual descriptions from the training set. These textual descriptions, in conjunction with an instruction, cater as an input prompt for the language model to generate new textual descriptions. In addition, we introduce an iterative technique for distilling the mapping by generating synthetic descriptions via the language model facilitating optimisation for standard utilised image description measures. Experimental results on two benchmark multimodal datasets demonstrate significant improvements. 用于多模态输入的文本描述需要反复改进查询以生成相关的输出图像。尽管在扩展模型规模和数据量等方面已付出努力，但预训练和微调的成本仍然很高。然而，预训练的大型多模态模型（LMMs）存在模态鸿沟，表现为在共同嵌入空间中文本和视觉表示的不对齐。尽管微调有可能缩小这一鸿沟，但通常代价高昂且不切实际，因为它需要大量与领域相关的数据。为了解决这一挑战，我们提出了一种轻量级、无需训练的方法，利用检索增强生成（RAG）通过一个线性映射跨模态扩展，该映射可以高效计算。在推理过程中，该映射被应用到由 LMM 嵌入的图像上，从而检索训练集中最接近的文本描述。这些文本描述与指令一起作为语言模型的输入提示，用以生成新的文本描述。此外，我们引入了一种迭代技术，通过语言模型生成合成描述来蒸馏该映射，从而便于针对常用的图像描述评估指标进行优化。实验结果在两个基准多模态数据集上表明了显著的改进。

Subjects: Machine Learning, Artificial Intelligence, Computer Vision and Pattern Recognition, Information Retrieval 主题：机器学习、人工智能、计算机视觉与模式识别、信息检索

Publish: 2025-08-06 19:04:38 UTC 发布：2025-08-06 19:04:38 UTC

#159 Energy-Efficient Stochastic Computing (SC) Neural Networks for Internet of Things Devices With Layer-Wise Adjustable Sequence Length (ASL) #159 面向物联网设备的节能随机计算（SC）神经网络，具有逐层可调序列长度（ASL）

Authors: [Ziheng Wang](https://arxiv.org/search/?searchtype=author&query=Ziheng Wang), [Pedro Reviriego](https://arxiv.org/search/?searchtype=author&query=Pedro Reviriego), [Farzad Niknia](https://arxiv.org/search/?searchtype=author&query=Farzad Niknia), [Zhen Gao](https://arxiv.org/search/?searchtype=author&query=Zhen Gao), [Javier Conde](https://arxiv.org/search/?searchtype=author&query=Javier Conde), [Shanshan Liu](https://arxiv.org/search/?searchtype=author&query=Shanshan Liu), [Fabrizio Lombardi](https://arxiv.org/search/?searchtype=author&query=Fabrizio Lombardi) 作者：Ziheng Wang、Pedro Reviriego、Farzad Niknia、Zhen Gao、Javier Conde、Shanshan Liu、Fabrizio Lombardi

Stochastic computing (SC) has emerged as an efficient low-power alternative for deploying neural networks (NNs) in resource-limited scenarios, such as the Internet of Things (IoT). By encoding values as serial bitstreams, SC significantly reduces energy dissipation compared to conventional floating-point (FP) designs; however, further improvement of layer-wise mixed-precision implementation for SC remains unexplored. This article introduces Adjustable Sequence Length (ASL), a novel scheme that applies mixed-precision concepts specifically to SC NNs. By introducing an operator-norm-based theoretical model, this article shows that truncation noise can cumulatively propagate through the layers by the estimated amplification factors. An extended sensitivity analysis is presented, using random forest (RF) regression to evaluate multilayer truncation effects and validate the alignment of theoretical predictions with practical network behaviors. To accommodate different application scenarios, this article proposes two truncation strategies (coarse-grained and fine-grained), which apply diverse sequence length configurations at each layer. Evaluations on a pipelined SC MLP synthesized at 32nm demonstrate that ASL can reduce energy and latency overheads by up to over 60% with negligible accuracy loss. It confirms the feasibility of the ASL scheme for IoT applications and highlights the distinct advantages of mixed-precision truncation in SC designs. 随机计算（SC）已成为在资源受限场景（如物联网，IoT）中部署神经网络（NNs）的一种高效低功耗替代方案。通过将数值编码为串行比特流，SC 相较于传统浮点（FP）设计显著降低了能量消耗；然而，针对 SC 的分层混合精度实现的进一步改进仍未被探索。本文引入了可调序列长度（ASL），这是一种将混合精度概念专门应用于 SC 神经网络的新方案。通过引入基于算子范数的理论模型，本文表明截断噪声可通过估计的放大因子在各层间累积传播。文中还给出了一种扩展的敏感性分析，使用随机森林（RF）回归评估多层截断效应，并验证理论预测与实际网络行为的一致性。为适应不同的应用场景，本文提出了两种截断策略（粗粒度和细粒度），在每一层应用不同的序列长度配置。在 32nm 工艺上综合的流水线随机计算（SC）多层感知器（MLP）评估表明，ASL 可将能耗和延迟开销最多降低超过 60%，且几乎不损失准确性。该结果证实了 ASL 方案在物联网应用中的可行性，并凸显了混合精度截断在随机计算设计中的显著优势。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-05 16:45:24 UTC 发布：2025-08-05 16:45:24 UTC

#160 Physics-Guided Memory Network for Building Energy Modeling #160 基于物理引导的记忆网络用于建筑能耗建模

Authors: [Muhammad Umair Danish](https://arxiv.org/search/?searchtype=author&query=Muhammad Umair Danish), [Kashif Ali](https://arxiv.org/search/?searchtype=author&query=Kashif Ali), [Kamran Siddiqui](https://arxiv.org/search/?searchtype=author&query=Kamran Siddiqui), [Katarina Grolinger](https://arxiv.org/search/?searchtype=author&query=Katarina Grolinger) 作者：Muhammad Umair Danish, Kashif Ali, Kamran Siddiqui, Katarina Grolinger

Accurate energy consumption forecasting is essential for efficient resource management and sustainability in the building sector. Deep learning models are highly successful but struggle with limited historical data and become unusable when historical data are unavailable, such as in newly constructed buildings. On the other hand, physics-based models, such as EnergyPlus, simulate energy consumption without relying on historical data but require extensive building parameter specifications and considerable time to model a building. This paper introduces a Physics-Guided Memory Network (PgMN), a neural network that integrates predictions from deep learning and physics-based models to address their limitations. PgMN comprises a Parallel Projection Layers to process incomplete inputs, a Memory Unit to account for persistent biases, and a Memory Experience Module to optimally extend forecasts beyond their input range and produce output. Theoretical evaluation shows that components of PgMN are mathematically valid for performing their respective tasks. The PgMN was evaluated on short-term energy forecasting at an hourly resolution, critical for operational decision-making in smart grid and smart building systems. Experimental validation shows accuracy and applicability of PgMN in diverse scenarios such as newly constructed buildings, missing data, sparse historical data, and dynamic infrastructure changes. This paper provides a promising solution for energy consumption forecasting in dynamic building environments, enhancing model applicability in scenarios where historical data are limited or unavailable or when physics-based models are inadequate. 准确的能耗预测对于建筑领域的高效资源管理和可持续发展至关重要。深度学习模型虽非常成功，但在历史数据有限时表现欠佳，且在没有历史数据的情况下（如新建建筑）无法使用。另一方面，基于物理的模型（如 EnergyPlus）能够在不依赖历史数据的情况下模拟能耗，但需要大量建筑参数设定并花费相当时间对建筑进行建模。本文提出了一种物理引导记忆网络（Physics-Guided Memory Network，PgMN），这是一种将深度学习与基于物理的模型预测整合以弥补各自局限的神经网络。PgMN 包含用于处理不完整输入的并行投影层（Parallel Projection Layers）、用于考虑持续偏差的记忆单元（Memory Unit）以及用于在输入范围之外最优延展预测并生成输出的记忆经验模块（Memory Experience Module）。理论评估表明，PgMN 的各个组件在数学上对执行其各自任务是有效的。 PgMN 在小时分辨率的短期能量预测上进行了评估，这对于智能电网和智能建筑系统的运行决策至关重要。实验证明了 PgMN 在多种场景下的准确性和适用性，例如新建建筑、数据缺失、历史数据稀少以及基础设施动态变化的情况。本文为动态建筑环境中的能耗预测提供了一个有前景的解决方案，增强了模型在历史数据有限或不可用、或基于物理的模型不足以胜任时的适用性。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-05 15:16:19 UTC 发布时间：2025-08-05 15:16:19 协调世界时 (UTC)

#161 Agoran: An Agentic Open Marketplace for 6G RAN Automation #161 Agoran：一个用于 6G RAN 自动化的智能开放市场

Next-generation mobile networks must reconcile the often-conflicting goals of multiple service owners. However, today’s network slice controllers remain rigid, policy-bound, and unaware of the business context. We introduce Agoran Service and Resource Broker (SRB), an agentic marketplace that brings stakeholders directly into the operational loop. Inspired by the ancient Greek agora, Agoran distributes authority across three autonomous AI branches: a Legislative branch that answers compliance queries using retrieval-augmented Large Language Models (LLMs); an Executive branch that maintains real-time situational awareness through a watcher-updated vector database; and a Judicial branch that evaluates each agent message with a rule-based Trust Score, while arbitrating LLMs detect malicious behavior and apply real-time incentives to restore trust. Stakeholder-side Negotiation Agents and the SRB-side Mediator Agent negotiate feasible, Pareto-optimal offers produced by a multi-objective optimizer, reaching a consensus intent in a single round, which is then deployed to Open and AI RAN controllers. Deployed on a private 5G testbed and evaluated with realistic traces of vehicle mobility, Agoran achieved significant gains: (i) a 37% increase in throughput of eMBB slices, (ii) a 73% reduction in latency of URLLC slices, and concurrently (iii) an end-to-end 8.3% saving in PRB usage compared to a static baseline. An 1B-parameter Llama model, fine-tuned for five minutes on 100 GPT-4 dialogues, recovers approximately 80% of GPT-4.1’s decision quality, while operating within 6 GiB of memory and converging in only 1.3 seconds. These results establish Agoran as a concrete, standards-aligned path toward ultra-flexible, stakeholder-centric 6G networks. A live demo is presented https://www.youtube.com/watch?v=h7vEyMu2f5w&ab_channel=BubbleRAN. 下一代移动网络必须调和多个服务所有者之间常常互相冲突的目标。然而，现今的网络切片控制器仍然僵化、受策略约束且对业务上下文缺乏感知。我们提出了 Agoran 服务与资源经纪人（SRB），这是一个将利益相关方直接引入运维循环的具代理特性的市场。受古希腊市集（agora）启发，Agoran 将权力分散到三个自治的 AI 分支：立法分支通过检索增强的 LLMs 回答合规性查询；执行分支通过由监视器更新的向量数据库保持实时态势感知；司法分支用基于规则的信任评分来评估每条代理消息，同时仲裁 LLMs 检测恶意行为并施加实时激励以恢复信任。利益相关方侧的谈判代理与 SRB 侧的调解代理协商由多目标优化器生成的可行帕累托最优报价，在单轮内达成一致意向，然后将其部署到开放与 AI RAN 控制器。在一个私有 5G 测试床上部署并使用真实的车辆移动轨迹进行评估后，Agoran 取得了显著收益：（i）eMBB 切片的吞吐量提高了 37%，（ii）URLLC 切片的时延减少了 73%，同时（iii）与静态基线相比，端到端 PRB 使用节省了 8.3%。一个拥有 10 亿参数的 Llama 模型，在 100 个 GPT-4 对话上微调五分钟后，恢复了约 80%的 GPT-4.1 决策质量，同时在 6 GiB 内存中运行并仅在 1.3 秒内收敛。这些结果将 Agoran 确立为通往超灵活、以利益相关者为中心的 6G 网络的一个具体且符合标准的路径。现场演示见 https://www.youtube.com/watch?v=h7vEyMu2f5w&ab_channel=BubbleRAN。

Subjects: Networking and Internet Architecture, Artificial Intelligence 主题：网络与互联网体系结构，人工智能

Publish: 2025-08-05 12:17:03 UTC 发布：2025-08-05 12:17:03 世界协调时间（UTC）

#162 EvaDrive: Evolutionary Adversarial Policy Optimization for End-to-End Autonomous Driving #162 EvaDrive：用于端到端自动驾驶的进化对抗策略优化

Autonomous driving faces significant challenges in achieving human-like iterative decision-making, which continuously generates, evaluates, and refines trajectory proposals. Current generation-evaluation frameworks isolate trajectory generation from quality assessment, preventing iterative refinement essential for planning, while reinforcement learning methods collapse multi-dimensional preferences into scalar rewards, obscuring critical trade-offs and yielding scalarization bias.To overcome these issues, we present EvaDrive, a novel multi-objective reinforcement learning framework that establishes genuine closed-loop co-evolution between trajectory generation and evaluation via adversarial optimization. EvaDrive frames trajectory planning as a multi-round adversarial game. In this game, a hierarchical generator continuously proposes candidate paths by combining autoregressive intent modeling for temporal causality with diffusion-based refinement for spatial flexibility. These proposals are then rigorously assessed by a trainable multi-objective critic that explicitly preserves diverse preference structures without collapsing them into a single scalarization bias.This adversarial interplay, guided by a Pareto frontier selection mechanism, enables iterative multi-round refinement, effectively escaping local optima while preserving trajectory diversity.Extensive experiments on NAVSIM and Bench2Drive benchmarks demonstrate SOTA performance, achieving 94.9 PDMS on NAVSIM v1 (surpassing DiffusionDrive by 6.8, DriveSuprim by 5.0, and TrajHF by 0.9) and 64.96 Driving Score on Bench2Drive. EvaDrive generates diverse driving styles via dynamic weighting without external preference data, introducing a closed-loop adversarial framework for human-like iterative decision-making, offering a novel scalarization-free trajectory optimization approach. 自动驾驶在实现类人迭代决策方面面临重大挑战，该决策不断生成、评估并改进轨迹提案。当前的生成-评估框架将轨迹生成与质量评估隔离，阻碍了对规划至关重要的迭代改进，而强化学习方法又将多维偏好压缩为标量奖励，掩盖了重要的权衡并导致标量化偏差。为解决这些问题，我们提出了 EvaDrive，一种新颖的多目标强化学习框架，通过对抗优化在轨迹生成与评估之间建立真正的闭环共演化。EvaDrive 将轨迹规划构建为一个多轮对抗博弈。在该博弈中，分层生成器通过将用于时间因果性的自回归意图建模与用于空间灵活性的扩散式细化相结合，持续提出候选路径。这些提案随后由一个可训练的多目标评判器进行严格评估，该评判器在不将多样化偏好结构合并为单一标量化偏置的情况下明确保留这些偏好结构。该对抗式交互在帕累托前沿选择机制的引导下，实现了多轮迭代的细化，有效地逃离局部最优，同时保持轨迹多样性。在 NAVSIM 和 Bench2Drive 基准上的大量实验证明了其 SOTA 性能，在 NAVSIM v1 上达到 94.9 PDMS（分别比 DiffusionDrive 高出 6.8、比 DriveSuprim 高出 5.0、比 TrajHF 高出 0.9），并在 Bench2Drive 上达到 64.96 的驾驶得分。EvaDrive 通过动态加权在无需外部偏好数据的情况下生成多样驾驶风格，引入了一个用于类人迭代决策的闭环对抗框架，提供了一种新的无标量化的轨迹优化方法。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-05 11:26:28 UTC 发布：2025-08-05 11:26:28 UTC

#163 Physics-Constrained Fine-Tuning of Flow-Matching Models for Generation and Inverse Problems #163 基于物理约束的流匹配模型微调用于生成与逆问题

Authors: [Jan Tauberschmidt](https://arxiv.org/search/?searchtype=author&query=Jan Tauberschmidt), [Sophie Fellenz](https://arxiv.org/search/?searchtype=author&query=Sophie Fellenz), [Sebastian J. Vollmer](https://arxiv.org/search/?searchtype=author&query=Sebastian J. Vollmer), [Andrew B. Duncan](https://arxiv.org/search/?searchtype=author&query=Andrew B. Duncan) 作者：Jan Tauberschmidt, Sophie Fellenz, Sebastian J. Vollmer, Andrew B. Duncan

We present a framework for fine-tuning flow-matching generative models to enforce physical constraints and solve inverse problems in scientific systems. Starting from a model trained on low-fidelity or observational data, we apply a differentiable post-training procedure that minimizes weak-form residuals of governing partial differential equations (PDEs), promoting physical consistency and adherence to boundary conditions without distorting the underlying learned distribution. To infer unknown physical inputs, such as source terms, material parameters, or boundary data, we augment the generative process with a learnable latent parameter predictor and propose a joint optimization strategy. The resulting model produces physically valid field solutions alongside plausible estimates of hidden parameters, effectively addressing ill-posed inverse problems in a data-driven yet physicsaware manner. We validate our method on canonical PDE benchmarks, demonstrating improved satisfaction of PDE constraints and accurate recovery of latent coefficients. Our approach bridges generative modelling and scientific inference, opening new avenues for simulation-augmented discovery and data-efficient modelling of physical systems. 我们提出了一个框架，用于微调流匹配生成模型，以强制执行物理约束并求解科学系统中的逆问题。从在低保真或观测数据上训练得到的模型出发，我们应用一种可微的训练后处理程序，最小化主导偏微分方程（PDE）的弱形式残差，从而在不扭曲底层学习到的分布的情况下，促进物理一致性并遵守边界条件。为推断未知的物理输入（例如源项、材料参数或边界数据），我们在生成过程中加入一个可学习的潜在参数预测器，并提出一种联合优化策略。由此得到的模型能够生成物理上有效的场解，同时给出隐藏参数的合理估计，有效地以数据驱动但具物理意识的方式处理病态逆问题。我们在典型的 PDE 基准上验证了该方法，结果表明对 PDE 约束的满足度有所提高且潜在系数能被准确恢复。我们的方法将生成建模与科学推断连接起来，为基于模拟的增强发现和物理系统的数据高效建模开辟了新途径。

Subjects: Machine Learning, Artificial Intelligence, Applications 主题：机器学习，人工智能，应用

Publish: 2025-08-05 09:32:04 UTC 发布：2025-08-05 09:32:04 UTC

#164 A Rolling Stone Gathers No Moss: Adaptive Policy Optimization for Stable Self-Evaluation in Large Multimodal Models #164 一滚石不生苔：用于大型多模态模型稳定自我评估的自适应策略优化 [PDF ] [复制] [Kimi ] [关系]

Authors: [Wenkai Wang](https://arxiv.org/search/?searchtype=author&query=Wenkai Wang), [Hongcan Guo](https://arxiv.org/search/?searchtype=author&query=Hongcan Guo), [Zheqi Lv](https://arxiv.org/search/?searchtype=author&query=Zheqi Lv), [Shengyu Zhang](https://arxiv.org/search/?searchtype=author&query=Shengyu Zhang) 作者：王文凯、郭宏灿、吕哲琪、张胜宇

Self-evaluation, a model’s ability to assess the correctness of its own output, is crucial for Large Multimodal Models (LMMs) to achieve self-improvement in multi-turn conversations, yet largely absent in foundation models. Recent work has employed reinforcement learning (RL) to enhance self-evaluation; however, its fixed reward mechanism suffers from reward hacking when optimizing multiple training objectives, leading to model collapse. In this paper we propose AdaPO, an online reinforcement learning framework capable of adaptively adjusting training objective in real time according to the current training state for each task. Specifically, to mitigate reward hacking , AdaPO introduces an Adaptive Reward Model (ARM) and a Reward Aware Dynamic KL Regularization mechanism. ARM assesses the task’s training state from the distribution of model generated multi-turn trajectories’ performance. Reward Aware Dynamic KL replaces a fixed penalty with dynamic coefficients which is modulated by the reward gap between different multi-turn situations. Notably, our method automatically and smoothly adjusts its learning focus based on sub-tasks’ training progress without manual intervention. Extensive experiments over 8 benchmarks and various models show that our method significantly enhances both direct reasoning and self-evaluation capability. We will release our code to contribute to the community. 自我评估，即模型评估自身输出正确性的能力，对于大型多模态模型（LMMs）在多轮对话中实现自我改进至关重要，但在基础模型中大多缺失。近期工作采用强化学习（RL）来增强自我评估；然而，其固定的奖励机制在优化多个训练目标时容易遭遇奖励利用（reward hacking），导致模型崩溃。本文提出了 AdaPO，一种在线强化学习框架，能够根据每个任务的当前训练状态实时自适应调整训练目标。具体地，为了缓解奖励利用，AdaPO 引入了自适应奖励模型（ARM）和奖励感知动态 KL 正则机制。ARM 从模型生成的多轮轨迹性能分布中评估任务的训练状态。奖励感知动态 KL 用动态系数替代固定惩罚，该系数由不同多轮情境之间的奖励差距调节。值得注意的是，我们的方法能够在无需人工干预的情况下，根据子任务的训练进展自动且平滑地调整学习重点。在 8 个基准和多种模型上进行的大量实验表明，我们的方法显著提升了直接推理和自我评估能力。我们将开源代码以回馈社区。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-05 07:54:01 UTC 发布：2025-08-05 07:54:01 UTC

#165 Peer Effect Estimation in the Presence of Simultaneous Feedback and Unobserved Confounders #165 同时存在反馈与未观测混杂因素下的同伴效应估计

Authors: [Xiaojing Du](https://arxiv.org/search/?searchtype=author&query=Xiaojing Du), [Jiuyong Li](https://arxiv.org/search/?searchtype=author&query=Jiuyong Li), [Lin Liu](https://arxiv.org/search/?searchtype=author&query=Lin Liu), [Debo Cheng](https://arxiv.org/search/?searchtype=author&query=Debo Cheng), [Thuc. Le](https://arxiv.org/search/?searchtype=author&query=Thuc. Le) 作者：杜晓静、李九永、刘琳、程德波、黎叔。

Estimating peer causal effects within complex real-world networks such as social networks is challenging, primarily due to simultaneous feedback between peers and unobserved confounders. Existing methods either address unobserved confounders while ignoring the simultaneous feedback, or account for feedback but under restrictive linear assumptions, thus failing to obtain accurate peer effect estimation. In this paper, we propose DIG2RSI, a novel Deep learning framework which leverages I-G transformation (matrix operation) and 2SRI (an instrumental variable or IV technique) to address both simultaneous feedback and unobserved confounding, while accommodating complex, nonlinear and high-dimensional relationships. DIG2RSI first applies the I-G transformation to disentangle mutual peer influences and eliminate the bias due to the simultaneous feedback. To deal with unobserved confounding, we first construct valid IVs from network data. In stage 1 of 2RSI, we train a neural network on these IVs to predict peer exposure, and extract residuals as proxies for the unobserved confounders. In the stage 2, we fit a separate neural network augmented by an adversarial discriminator that incorporates these residuals as a control function and enforces the learned representation to contain no residual confounding signal. The expressive power of deep learning models in capturing complex non-linear relationships and adversarial debiasing enhances the effectiveness of DIG2RSI in eliminating bias from both feedback loops and hidden confounders. We prove consistency of our estimator under standard regularity conditions, ensuring asymptotic recovery of the true peer effect. Empirical results on two semi-synthetic benchmarks and a real-world dataset demonstrate that DIG2RSI outperforms existing approaches. 在复杂的真实网络（如社交网络）中估计同伴因果效应具有挑战性，主要原因在于同伴之间的同时反馈和未观测混杂变量。现有方法要么在忽略同时反馈的情况下处理未观测混杂变量，要么在考虑反馈时依赖受限的线性假设，因此未能获得准确的同伴效应估计。本文提出了 DIG2RSI，一种新颖的深度学习框架，利用 I-G 变换（矩阵运算）和 2SRI（工具变量或 IV 技术）同时应对同时反馈和未观测混杂问题，并能适应复杂、非线性和高维的关系。DIG2RSI 首先应用 I-G 变换以解开相互的同伴影响并消除由同时反馈引起的偏差。为应对未观测的混杂，我们首先从网络数据构建有效的工具变量。在 2RSI 的第一阶段，我们在这些工具变量上训练神经网络以预测同伴暴露，并提取残差作为未观测混杂变量的代理。在第二阶段，我们拟合了一个由对抗判别器增强的独立神经网络，该判别器将这些残差作为控制函数并强制学习到的表示不包含残差混淆信号。深度学习模型在捕捉复杂非线性关系方面的表现力以及对抗去偏置方法增强了 DIG2RSI 在消除来自反馈循环和隐藏混淆变量的偏差方面的有效性。我们在标准正则条件下证明了估计量的一致性，确保渐近地恢复真实的同伴效应。在两个半合成基准和一个真实世界数据集上的实证结果表明，DIG2RSI 优于现有方法。

Subjects: Machine Learning, Artificial Intelligence, Methodology 主题：机器学习、人工智能、方法论

Publish: 2025-08-05 05:49:49 UTC 发布日期：2025-08-05 05:49:49 UTC

#166 JustDense: Just using Dense instead of Sequence Mixer for Time Series analysis #166 JustDense：仅使用 Dense 替代 Sequence Mixer 进行时间序列分析

Authors: [TaekHyun Park](https://arxiv.org/search/?searchtype=author&query=TaekHyun Park), [Yongjae Lee](https://arxiv.org/search/?searchtype=author&query=Yongjae Lee), [Daesan Park](https://arxiv.org/search/?searchtype=author&query=Daesan Park), [Dohee Kim](https://arxiv.org/search/?searchtype=author&query=Dohee Kim), [Hyerim Bae](https://arxiv.org/search/?searchtype=author&query=Hyerim Bae) 作者：TaekHyun Park、Yongjae Lee、Daesan Park、Dohee Kim、Hyerim Bae

Sequence and channel mixers, the core mechanism in sequence models, have become the de facto standard in time series analysis (TSA). However, recent studies have questioned the necessity of complex sequence mixers, such as attention mechanisms, demonstrating that simpler architectures can achieve comparable or even superior performance. This suggests that the benefits attributed to complex sequencemixers might instead emerge from other architectural or optimization factors. Based on this observation, we pose a central question: Are common sequence mixers necessary for time-series analysis? Therefore, we propose JustDense, an empirical study that systematically replaces sequence mixers in various well-established TSA models with dense layers. Grounded in the MatrixMixer framework, JustDense treats any sequence mixer as a mixing matrix and replaces it with a dense layer. This substitution isolates the mixing operation, enabling a clear theoretical foundation for understanding its role. Therefore, we conducted extensive experiments on 29 benchmarks covering five representative TSA tasks using seven state-of-the-art TSA models to address our research question. The results show that replacing sequence mixers with dense layers yields comparable or even superior performance. In the cases where dedicated sequence mixers still offer benefits, JustDense challenges the assumption that “deeper and more complex architectures are inherently better” in TSA. 序列和通道混合器，作为序列模型的核心机制，已成为时间序列分析（TSA）的事实标准。然而，最近的研究对诸如注意力机制等复杂序列混合器的必要性提出了质疑，表明更简单的架构也能达到相当或更优的性能。这表明那些被归因于复杂序列混合器的优势可能反而源自其他架构或优化因素。基于这一观察，我们提出了一个核心问题：常见的序列混合器对于时间序列分析是否必要？为此，我们提出了 JustDense，一项经验性研究，系统地将各种成熟 TSA 模型中的序列混合器替换为全连接层。基于 MatrixMixer 框架，JustDense 将任何序列混合器视为一个混合矩阵并用全连接层替代。该替换孤立了混合操作，从而为理解其作用提供了明确的理论基础。因此，我们在 29 个基准上进行了广泛实验，涵盖五个具有代表性的 TSA 任务，使用七个最先进的 TSA 模型以回答我们的研究问题。结果表明，用全连接层替换序列混合器可以获得相当或更佳的性能。在那些专用序列混合器仍然带来好处的情况下，JustDense 对“在时序感知架构（TSA）中更深更复杂的架构本质上更好”这一假设提出了挑战。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-04 15:49:09 UTC 发布：2025-08-04 15:49:09 UTC

#167 5G Core Fault Detection and Root Cause Analysis using Machine Learning and Generative AI #167 5G 核心网故障检测与根因分析，使用机器学习与生成式人工智能

Authors: [Joseph H. R. Isaac](https://arxiv.org/search/?searchtype=author&query=Joseph H. R. Isaac), [Harish Saradagam](https://arxiv.org/search/?searchtype=author&query=Harish Saradagam), [Nallamothu Pardhasaradhi](https://arxiv.org/search/?searchtype=author&query=Nallamothu Pardhasaradhi) 作者：Joseph H. R. Isaac、Harish Saradagam、Nallamothu Pardhasaradhi

With the advent of 5G networks and technologies, ensuring the integrity and performance of packet core traffic is paramount. During network analysis, test files such as Packet Capture (PCAP) files and log files will contain errors if present in the system that must be resolved for better overall network performance, such as connectivity strength and handover quality. Current methods require numerous person-hours to sort out testing results and find the faults. This paper presents a novel AI/ML-driven Fault Analysis (FA) Engine designed to classify successful and faulty frames in PCAP files, specifically within the 5G packet core. The FA engine analyses network traffic using natural language processing techniques to identify anomalies and inefficiencies, significantly reducing the effort time required and increasing efficiency. The FA Engine also suggests steps to fix the issue using Generative AI via a Large Language Model (LLM) trained on several 5G packet core documents. The engine explains the details of the error from the domain perspective using documents such as the 3GPP standards and user documents regarding the internal conditions of the tests. Test results on the ML models show high classification accuracy on the test dataset when trained with 80-20 splits for the successful and failed PCAP files. Future scopes include extending the AI engine to incorporate 4G network traffic and other forms of network data, such as log text files and multimodal systems. 随着 5G 网络和相关技术的到来，确保分组核心（packet core）流量的完整性和性能至关重要。在网络分析过程中，测试文件（如数据包捕获（PCAP）文件和日志文件）如果系统中存在错误，会包含必须解决的问题以改善整体网络性能，例如连接强度和切换质量。当前的方法需要大量人力工时来整理测试结果并查找故障。本文提出了一种新颖的由 AI/ML 驱动的故障分析（FA）引擎，旨在对 PCAP 文件中，特别是 5G 分组核心内的成功与故障帧进行分类。该 FA 引擎使用自然语言处理技术分析网络流量，以识别异常和低效，从而显著减少所需的工作时间并提高效率。FA 引擎还通过一个在多份 5G 分组核心文档上训练的 LLM 驱动的生成式 AI，建议解决问题的步骤。该引擎使用诸如 3GPP 标准和关于测试内部条件的用户文档等资料，从领域角度解释错误细节。在对这些机器学习模型进行测试时，当使用成功和失败的 PCAP 文件按 80-20 的比例进行训练时，测试数据集上显示出较高的分类准确率。未来的工作包括将该 AI 引擎扩展为涵盖 4G 网络流量以及其他形式的网络数据，例如日志文本文件和多模态系统。

Subjects: Networking and Internet Architecture, Artificial Intelligence, Machine Learning 学科：网络与互联网架构、人工智能、机器学习

Publish: 2025-08-04 05:20:32 UTC 发布：2025-08-04 05:20:32 UTC

#168 Motif 2.6B Technical Report #168 Motif 2.6B 技术报告

Recent advancements in Large Language Models (LLMs) have revolutionized artificial intelligence, yet developing an effective foundational LLM that balances high performance with computational efficiency remains challenging, especially for emerging research groups. To address this gap, we introduce Motif-2.6B, a 2.6-billion-parameter foundation model designed to democratize advanced LLM capabilities. Motif-2.6B incorporates several innovative architectural enhancements, including Differential Attention and PolyNorm activation functions, which improve long-context comprehension, reduce hallucination, and enhance in-context learning capabilities. We rigorously tested multiple novel architectural components through extensive experimentation to determine the optimal architecture for Motif-2.6B. Comprehensive evaluations demonstrate that Motif-2.6B consistently meets or exceeds the performance of similarly sized state-of-the-art models across diverse benchmarks, showcasing its effectiveness, scalability, and real-world applicability. Through detailed experiments and tailored techniques, Motif-2.6B significantly advances the landscape of efficient, scalable, and powerful foundational LLMs, offering valuable insights and a robust foundation for future research and deployment. 近年来大型语言模型 (LLMs) 的进步革新了人工智能领域，但要开发出在高性能与计算效率之间取得平衡的有效基础 LLM 仍具有挑战性，尤其对于新兴研究团队而言。为填补这一空白，我们推出了 Motif-2.6B，一款具有 26 亿参数的基础模型，旨在使先进的 LLM 能力更加普及。Motif-2.6B 采纳了多项创新的架构改进，包括 Differential Attention 和 PolyNorm 激活函数，这些改进提升了长上下文理解能力、减少了幻觉现象并增强了上下文学习能力。我们通过大量实验对多种新颖架构组件进行了严格测试，以确定 Motif-2.6B 的最优架构。全面评估表明，Motif-2.6B 在多项基准测试中始终达到或超过了同等规模的最先进模型的表现，展示了其有效性、可扩展性和实际适用性。通过详尽的实验和针对性的技术，Motif-2.6B 在高效、可扩展且强大的基础 LLMs 领域取得了重大进展，为未来的研究与部署提供了有价值的见解和坚实的基础。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-08-02 08:41:47 UTC 发布时间：2025-08-02 08:41:47 UTC

#169 Agentic TinyML for Intent-aware Handover in 6G Wireless Networks #169 面向意图感知切换的自驱动 TinyML 在 6G 无线网络中的应用

Authors: [Alaa Saleh](https://arxiv.org/search/?searchtype=author&query=Alaa Saleh), [Roberto Morabito](https://arxiv.org/search/?searchtype=author&query=Roberto Morabito), [Sasu Tarkoma](https://arxiv.org/search/?searchtype=author&query=Sasu Tarkoma), [Anders Lindgren](https://arxiv.org/search/?searchtype=author&query=Anders Lindgren), [Susanna Pirttikangas](https://arxiv.org/search/?searchtype=author&query=Susanna Pirttikangas), [Lauri Lovén](https://arxiv.org/search/?searchtype=author&query=Lauri Lovén) 作者：Alaa Saleh、Roberto Morabito、Sasu Tarkoma、Anders Lindgren、Susanna Pirttikangas、Lauri Lovén

As 6G networks evolve into increasingly AI-driven, user-centric ecosystems, traditional reactive handover mechanisms demonstrate limitations, especially in mobile edge computing and autonomous agent-based service scenarios. This manuscript introduces WAAN, a cross-layer framework that enables intent-aware and proactive handovers by embedding lightweight TinyML agents as autonomous, negotiation-capable entities across heterogeneous edge nodes that contribute to intent propagation and network adaptation. To ensure continuity across mobility-induced disruptions, WAAN incorporates semi-stable rendezvous points that serve as coordination anchors for context transfer and state preservation. The framework’s operational capabilities are demonstrated through a multimodal environmental control case study, highlighting its effectiveness in maintaining user experience under mobility. Finally, the article discusses key challenges and future opportunities associated with the deployment and evolution of WAAN. 随着 6G 网络向日益以 AI 驱动、以用户为中心的生态系统演进，传统的被动切换机制展现出局限性，尤其在移动边缘计算和基于自主代理的服务场景中。本文提出了 WAAN，一种跨层框架，通过将轻量级 TinyML 代理作为能够自主协商的实体嵌入到异构边缘节点中，实现了意图感知和前瞻性切换，这些节点有助于意图传播和网络自适应。为确保在移动导致的中断中保持连续性，WAAN 引入了半稳定会合点，作为上下文传输和状态保存的协调锚点。通过一个多模态环境控制的案例研究展示了该框架的运行能力，突出了其在移动情况下保持用户体验的有效性。最后，文章讨论了与 WAAN 部署和演进相关的关键挑战和未来机会。

Subjects: Networking and Internet Architecture, Artificial Intelligence, Distributed, Parallel, and Cluster Computing, Machine Learning, Multiagent Systems 主题：网络与互联网架构、人工智能、分布式、并行与集群计算、机器学习、多智能体系统

Publish: 2025-08-02 06:13:42 UTC 发布时间：2025-08-02 06:13:42 协调世界时 (UTC)

#170 To Theoretically Understand Transformer-Based In-Context Learning for Optimizing CSMA #170 理论上理解基于 Transformer 的上下文学习在优化 CSMA 中的应用

Authors: [Shugang Hao](https://arxiv.org/search/?searchtype=author&query=Shugang Hao), [Hongbo Li](https://arxiv.org/search/?searchtype=author&query=Hongbo Li), [Lingjie Duan](https://arxiv.org/search/?searchtype=author&query=Lingjie Duan) 作者：郝树刚、李鸿博、段凌杰

The binary exponential backoff scheme is widely used in WiFi 7 and still incurs poor throughput performance under dynamic channel environments. Recent model-based approaches (e.g., non-persistent and p-persistent CSMA) simply optimize backoff strategies under a known and fixed node density, still leading to a large throughput loss due to inaccurate node density estimation. This paper is the first to propose LLM transformer-based in-context learning (ICL) theory for optimizing channel access. We design a transformer-based ICL optimizer to pre-collect collision-threshold data examples and a query collision case. They are constructed as a prompt as the input for the transformer to learn the pattern, which then generates a predicted contention window threshold (CWT). To train the transformer for effective ICL, we develop an efficient algorithm and guarantee a near-optimal CWT prediction within limited training steps. As it may be hard to gather perfect data examples for ICL in practice, we further extend to allow erroneous data input in the prompt. We prove that our optimizer maintains minimal prediction and throughput deviations from the optimal values. Experimental results on NS-3 further demonstrate our approach’s fast convergence and near-optimal throughput over existing model-based and DRL-based approaches under unknown node densities. 二进制指数退避方案在 WiFi 7 中被广泛使用，但在动态信道环境下仍会导致吞吐量性能较差。近期的基于模型的方法（例如，非持续性和 p -持续性 CSMA）仅在已知且固定的节点密度下优化退避策略，由于节点密度估计不准确，仍会导致大量吞吐量损失。本文首次提出基于 LLM transformer 的情境学习（ICL）理论以优化信道接入。我们设计了一个基于 transformer 的 ICL 优化器，用于预先收集碰撞阈值数据样本和一个查询碰撞案例。它们被构造成提示作为 transformer 的输入以学习模式，随后生成预测的竞争窗口阈值（CWT）。为了训练 transformer 以实现有效的 ICL，我们开发了一种高效算法，并保证在有限的训练步数内得到近似最优的 CWT 预测。由于在实践中可能难以收集完美的 ICL 数据样本，我们进一步扩展以允许提示中输入错误数据。我们证明了我们的优化器在预测和吞吐量上仅与最优值保持最小偏差。在 NS-3 上的实验结果进一步证明了我们的方法在未知节点密度下相比现有基于模型和基于深度强化学习的方法具有快速收敛性和近乎最优的吞吐量表现。

Subjects: Machine Learning, Artificial Intelligence, Networking and Internet Architecture

Publish: 2025-07-31 23:31:23 UTC 发布：2025-07-31 23:31:23 UTC

#171 Efficient Real-Time Aircraft ETA Prediction via Feature Tokenization Transformer #171 高效实时飞机预计到达时间预测：基于特征标记化的 Transformer

Authors: [Liping Huang](https://arxiv.org/search/?searchtype=author&query=Liping Huang), [Yicheng Zhang](https://arxiv.org/search/?searchtype=author&query=Yicheng Zhang), [Yifang Yin](https://arxiv.org/search/?searchtype=author&query=Yifang Yin), [Sheng Zhang](https://arxiv.org/search/?searchtype=author&query=Sheng Zhang), [Yi Zhang](https://arxiv.org/search/?searchtype=author&query=Yi Zhang)

Estimated time of arrival (ETA) for airborne aircraft in real-time is crucial for arrival management in aviation, particularly for runway sequencing. Given the rapidly changing airspace context, the ETA prediction efficiency is as important as its accuracy in a real-time arrival aircraft management system. In this study, we utilize a feature tokenization-based Transformer model to efficiently predict aircraft ETA. Feature tokenization projects raw inputs to latent spaces, while the multi-head self-attention mechanism in the Transformer captures important aspects of the projections, alleviating the need for complex feature engineering. Moreover, the Transformer’s parallel computation capability allows it to handle ETA requests at a high frequency, i.e., 1HZ, which is essential for a real-time arrival management system. The model inputs include raw data, such as aircraft latitude, longitude, ground speed, theta degree for the airport, day and hour from track data, the weather context, and aircraft wake turbulence category. With a data sampling rate of 1HZ, the ETA prediction is updated every second. We apply the proposed aircraft ETA prediction approach to Singapore Changi Airport (ICAO Code: WSSS) using one-month Automatic Dependent Surveillance-Broadcast (ADS-B) data from October 1 to October 31, 2022. In the experimental evaluation, the ETA modeling covers all aircraft within a range of 10NM to 300NM from WSSS. The results show that our proposed method method outperforms the commonly used boosting tree based model, improving accuracy by 7% compared to XGBoost, while requiring only 39% of its computing time. Experimental results also indicate that, with 40 aircraft in the airspace at a given timestamp, the ETA inference time is only 51.7 microseconds, making it promising for real-time arrival management systems. 实时空中飞机的预计到达时间（ETA）对于航空到达管理，尤其是跑道排序，至关重要。鉴于空域环境快速变化，ETA 预测的效率在实时到达飞机管理系统中与其准确性同样重要。本研究利用基于特征标记化的 Transformer 模型来高效预测飞机 ETA。特征标记化将原始输入投影到潜在空间，而 Transformer 中的多头自注意力机制捕捉这些投影的重要方面，减少了对复杂特征工程的依赖。此外，Transformer 的并行计算能力使其能够以高频率（即 1Hz）处理 ETA 请求，这对实时到达管理系统至关重要。模型输入包括原始数据，如飞机纬度、经度、地速、相对于机场的方位角（theta 度）、轨迹数据中的日期和小时、天气背景以及飞机尾流湍流类别。在 1Hz 的数据采样率下，ETA 预测每秒更新一次。我们将所提出的飞机预计到达时间（ETA）预测方法应用于新加坡樟宜机场（ICAO 代码：WSSS），使用 2022 年 10 月 1 日至 10 月 31 日一个月的自动相关监视广播（ADS-B）数据。在实验评估中，ETA 建模涵盖了距离 WSSS 10 海里至 300 海里范围内的所有飞机。结果表明，我们提出的方法优于常用的基于提升树的模型，相较于 XGBoost 提高了 7% 的精度，同时仅需其 39% 的计算时间。实验结果还表明，在某一时间点空域中有 40 架飞机时，ETA 推断时间仅为 51.7 微秒，这使其对实时到达管理系统具有良好应用前景。

Subjects: Machine Learning, Artificial Intelligence 主题：机器学习，人工智能

Publish: 2025-07-31 03:56:30 UTC

#172 Bayesian-Driven Graph Reasoning for Active Radio Map Construction #172 贝叶斯驱动的图推理用于主动无线电地图构建

Authors: [Wenlihan Lu](https://arxiv.org/search/?searchtype=author&query=Wenlihan Lu), [Shijian Gao](https://arxiv.org/search/?searchtype=author&query=Shijian Gao), [Miaowen Wen](https://arxiv.org/search/?searchtype=author&query=Miaowen Wen), [Yuxuan Liang](https://arxiv.org/search/?searchtype=author&query=Yuxuan Liang), [Chan-Byoung Chae](https://arxiv.org/search/?searchtype=author&query=Chan-Byoung Chae), [H. Vincent Poor](https://arxiv.org/search/?searchtype=author&query=H. Vincent Poor) 作者：陆文丽、高世建、温妙文、梁宇轩、蔡灿炳、H. Vincent Poor

With the emergence of the low-altitude economy, radio maps have become essential for ensuring reliable wireless connectivity to aerial platforms. Autonomous aerial agents are commonly deployed for data collection using waypoint-based navigation; however, their limited battery capacity significantly constrains coverage and efficiency. To address this, we propose an uncertainty-aware radio map (URAM) reconstruction framework that explicitly leverages graph-based reasoning tailored for waypoint navigation. Our approach integrates two key deep learning components: (1) a Bayesian neural network that estimates spatial uncertainty in real time, and (2) an attention-based reinforcement learning policy that performs global reasoning over a probabilistic roadmap, using uncertainty estimates to plan informative and energy-efficient trajectories. This graph-based reasoning enables intelligent, non-myopic trajectory planning, guiding agents toward the most informative regions while satisfying safety constraints. Experimental results show that URAM improves reconstruction accuracy by up to 34% over existing baselines. 随着低空经济的兴起，频谱（无线电）地图已成为确保空中平台可靠无线连接的关键。自主空中代理通常通过基于航路点的导航进行数据采集；然而，它们受限于有限的电池容量，显著制约了覆盖范围和效率。为了解决这一问题，我们提出了一种不确定性感知的频谱地图（URAM）重建框架，明确利用为航路点导航量身定制的基于图的推理。我们的方法整合了两个关键的深度学习组件： (1) 一个贝叶斯神经网络，用于实时估计空间不确定性；(2) 一个基于注意力的强化学习策略，在概率路网（probabilistic roadmap）上执行全局推理，利用不确定性估计来规划信息量大且节能的轨迹。该基于图的推理能够实现智能的、非短视的轨迹规划，在满足安全约束的同时，引导代理前往信息最丰富的区域。实验结果表明，URAM 在重建精度上较现有基线提高了最多 34%。

Subjects: Signal Processing, Artificial Intelligence

Publish: 2025-07-29 03:32:01 UTC 出版：2025-07-29 03:32:01 UTC

#173 User-Intent-Driven Semantic Communication via Adaptive Deep Understanding #173 基于用户意图驱动的语义通信：通过自适应深度理解

Authors: [Peigen Ye](https://arxiv.org/search/?searchtype=author&query=Peigen Ye), [Jingpu Duan](https://arxiv.org/search/?searchtype=author&query=Jingpu Duan), [Hongyang Du](https://arxiv.org/search/?searchtype=author&query=Hongyang Du), [Yulan Guo](https://arxiv.org/search/?searchtype=author&query=Yulan Guo) 作者：叶培根，段静璞，杜洪洋，郭玉兰

Semantic communication focuses on transmitting task-relevant semantic information, aiming for intent-oriented communication. While existing systems improve efficiency by extracting key semantics, they still fail to deeply understand and generalize users’ real intentions. To overcome this, we propose a user-intention-driven semantic communication system that interprets diverse abstract intents. First, we integrate a multi-modal large model as semantic knowledge base to generate user-intention prior. Next, a mask-guided attention module is proposed to effectively highlight critical semantic regions. Further, a channel state awareness module ensures adaptive, robust transmission across varying channel conditions. Extensive experiments demonstrate that our system achieves deep intent understanding and outperforms DeepJSCC, e.g., under a Rayleigh channel at an SNR of 5 dB, it achieves improvements of 8%, 6%, and 19% in PSNR, SSIM, and LPIPS, respectively. 语义通信侧重于传输与任务相关的语义信息，旨在实现以意图为导向的通信。尽管现有系统通过提取关键语义提高了效率，但仍未能深入理解并泛化用户的真实意图。为此，我们提出了一种用户意图驱动的语义通信系统，以解释多样的抽象意图。首先，我们将多模态大模型作为语义知识库以生成用户意图先验。接着，提出了一个掩码引导注意力模块，有效突出关键语义区域。进一步地，通道状态感知模块确保在不同信道条件下的自适应、鲁棒传输。大量实验表明，我们的系统实现了深层意图理解并优于 DeepJSCC，例如在 Rayleigh 信道、5 dB SNR 下，PSNR、SSIM 和 LPIPS 分别提升了 8%、6%和 19%。

Subject: Information Theory 主题：信息论

Publish: 2025-08-07 22:26:27 UTC 发布：2025-08-07 22:26:27 协调世界时

#174 QuickGrasp: Lightweight Antipodal Grasp Planning with Point Clouds #174 QuickGrasp：基于点云的轻量级对捏抓取规划

Authors: [Navin Sriram Ravie](https://arxiv.org/search/?searchtype=author&query=Navin Sriram Ravie), [Keerthi Vasan M](https://arxiv.org/search/?searchtype=author&query=Keerthi Vasan M), [Asokan Thondiyath](https://arxiv.org/search/?searchtype=author&query=Asokan Thondiyath), [Bijo Sebastian](https://arxiv.org/search/?searchtype=author&query=Bijo Sebastian) 作者：Navin Sriram Ravie、Keerthi Vasan M、Asokan Thondiyath、Bijo Sebastian

Grasping has been a long-standing challenge in facilitating the final interface between a robot and the environment. As environments and tasks become complicated, the need to embed higher intelligence to infer from the surroundings and act on them has become necessary. Although most methods utilize techniques to estimate grasp pose by treating the problem via pure sampling-based approaches in the six-degree-of-freedom space or as a learning problem, they usually fail in real-life settings owing to poor generalization across domains. In addition, the time taken to generate the grasp plan and the lack of repeatability, owing to sampling inefficiency and the probabilistic nature of existing grasp planning approaches, severely limits their application in real-world tasks. This paper presents a lightweight analytical approach towards robotic grasp planning, particularly antipodal grasps, with little to no sampling in the six-degree-of-freedom space. The proposed grasp planning algorithm is formulated as an optimization problem towards estimating grasp points on the object surface instead of directly estimating the end-effector pose. To this extent, a soft-region-growing algorithm is presented for effective plane segmentation, even in the case of curved surfaces. An optimization-based quality metric is then used for the evaluation of grasp points to ensure indirect force closure. The proposed grasp framework is compared with the existing state-of-the-art grasp planning approach, Grasp pose detection (GPD), as a baseline over multiple simulated objects. The effectiveness of the proposed approach in comparison to GPD is also evaluated in a real-world setting using image and point-cloud data, with the planned grasps being executed using a ROBOTIQ gripper and UR5 manipulator. 抓取一直是促成机器人与环境之间最终交互的长期挑战。随着环境和任务变得更加复杂，必须嵌入更高层次的智能以从周围环境推断并据此采取行动。尽管大多数方法通过在六自由度空间内采用纯基于采样的方法或将其作为一个学习问题来估计抓取姿态，但它们通常在现实场景中表现不佳，因而跨域泛化能力差。此外，由于现有抓取规划方法的采样效率低和概率性特征，生成抓取规划所需的时间和重复性不足极大限制了它们在实际任务中的应用。本文提出了一种轻量级的解析方法用于机器人抓取规划，特别是对偶点（antipodal）抓取，几乎不在六自由度空间中进行采样。所提出的抓取规划算法被表述为一个优化问题，目标是估计物体表面的抓取点，而不是直接估计执行器的位姿。为此，提出了一种软区域生长算法以实现有效的平面分割，即使在曲面情况下也能适用。随后采用基于优化的质量度量来评估抓取点，以确保间接力闭合。将所提出的抓取框架与现有的最先进抓取规划方法——抓取姿态检测（GPD）作为基线，在多个模拟物体上进行了比较。还在真实环境中使用图像和点云数据评估了所提出方法相对于 GPD 的有效性，并使用 ROBOTIQ 夹持器和 UR5 机械臂执行了计划的抓取动作。

Subject: Robotics 主题：机器人学

Publish: 2025-04-28 12:09:10 UTC 发布：2025-04-28 12:09:10 UTC

1.3 Huggingface

1.4 X

1.5 小红书

2. 感兴趣研究

新模型

Mistral Medium V3.
Qwen3-Coder
1. 是一款拥有4800 亿参数（其中350亿活跃参数）的专家混合（MoE）编程AI 模型。该模型通过先进的强化学习实现了最先进的智能编程性能，原生支持256K 上下文长度，并能无缝集成Claude Code和Cline等主流开发工具。
2. Qwen3-Coder 使用7.5 万亿token 进行训练（其中70% 为代码数据），同时保持通用语言能力。该模型通过2万个并行环境的强化学习优化多轮编程任务，在最新SWE-bench榜单（通过真实GitHub issue 解决方案评估模型软件工程能力）中达到55.4% 的准确率。
3. 该模型提供开源CLI 工具，可集成至现有开发工作流实现实用编程辅助。
qwen3-VL 来了
Meditron
1. 是由洛桑联邦理工学院（EPFL）开发的开源医疗大语言模型套件，提供基于Llama-2微调的70亿和700亿参数版本。模型通过精选的PubMed论文、医疗指南和领域特定数据集进行持续预训练，提供专业医疗AI 能力。
2. Meditron-70B支持医学考试问答、辅助鉴别诊断和疾病信息查询。尽管设计用于编码高质量医学知识，但Meditron包含重要安全指南：建议在任何医疗应用前进行充分测试和临床验证。

逻辑与推理

混合数学编程逻辑数据，一次性提升AI多领域强化学习能力 | 上海AI Lab
1. 现有关于强化学习和模型的研究多聚焦于单一领域优化，缺乏对跨领域知识迁移和协同推理能力的系统性探索，让模型能够在多领域协同工作，发挥更好的推理能力。
2. 团队构建了一个涵盖数学（Math）、编程（Code）和逻辑谜题（Puzzle）三大类数据的多领域评估框架，并为不同训练数据设计了定制化的奖励策略。
3. Puzzle与Math数据的相互支持：逻辑推理与数学能力相辅相成，显著提升模型的整体性能。
4. 论文地址：https://arxiv.org/abs/2507.17512 训练代码：https://github.com/Leey21/A-Data-Centric-Study
少思考多采样：用于简洁推理的分组过滤策略优化

#21 Sample More to Think Less: Group Filtered Policy Optimization for Concise Reasoning 少思考多采样：用于简洁推理的分组过滤策略优化

使用可验证奖励进行强化学习训练的大型语言模型往往以准确性换取长度——通过增加回答长度来取得准确性提升。虽然对于更难的问题更长的答案可能是合理的，但许多标记只是“填充物”：重复、冗长的文本并没有真正推进解答。我们提出了 GFPO（组过滤策略优化），通过在训练时对每个问题采样更大的组并基于两个关键指标筛选用于训练的回应来遏制这种长度膨胀： (1) 回答长度和 (2) 标记效率：每标记奖励比率。通过在训练时进行更多采样，我们教会模型在推理时少思考。在 Phi-4-reasoning 模型上，GFPO 在保持准确性的同时，将 GRPO 在具有挑战性的 STEM 和编码基准（AIME 24/25、GPQA、Omni-MATH、LiveCodeBench）上的长度膨胀减少了 46–71%。以每标记奖励进行优化则进一步将长度膨胀减少到 71–85%。我们还提出了自适应难度 GFPO，它根据实时难度估计为更难的问题动态分配更多训练资源，从而在计算效率与准确性之间取得更好的平衡，尤其是在困难问题上。GFPO 证明了增加训练阶段的计算量可以直接转化为测试阶段计算量的减少——这是实现高效推理的一种简单而有效的权衡。

发布：2025-08-13 11:43:49 UTC

大型语言模型在数学计算与推理中的错误

#1 Mathematical Computation and Reasoning Errors by Large Language Models

大型语言模型 (LLMs) 在以人工智能驱动的教学和评估中被越来越多地应用，尤其是在数学教育领域。LLMs 在数学问题解决任务中生成准确答案和详细解题步骤的能力，是确保数学教育实践中反馈和评估可靠与精确的基础。

本研究聚焦评估四种 LLMs（OpenAI GPT-4o 与 o1、DeepSeek-V3 与 DeepSeek-R1）在解决三类数学任务（算术、代数与数论）时的准确性，并识别其解题过程中的逐步推理错误。我们并未依赖标准基准测试，而是有意构建了对 LLMs 具有挑战性且容易出错的数学题目（通过题目模型）。

系统性地分析并编码了最终答案的准确性与单个解题步骤中错误的存在。研究中测试了单代理与双代理两种配置。观察到经推理增强的 OpenAI o1 模型在所有三类数学任务中始终达到更高或接近完美的准确率。对错误的分析显示，程序性失误最为频繁且显著影响整体表现，而概念性误解则较少。部署双代理配置大幅提升了整体表现。这些发现为提升 LLM 性能提供了可操作的见解，并强调了将 LLM 整合到数学教育中的有效策略，从而推动以人工智能为驱动的教学实践和评估精度的进步。

发布：2025-08-13 16:33:02 UTC

对齐

对齐技术的综合评估框架

#7 A Comprehensive Evaluation framework of Alignment Techniques for LLMs LLMs 对齐技术的综合评估框架

发布：2025-08-13 16:42:01 UTC

综述

速度永远取胜：大规模语言模型高效架构综述

#13 Speed Always Wins: A Survey on Efficient Architectures for Large Language Models 速度永远取胜：大规模语言模型高效架构综述

大型语言模型（LLMs）在语言理解、生成、推理方面取得了令人瞩目的成果，并推动了多模态模型能力的边界。作为现代 LLMs 的基础，Transformer 模型由于优异的可扩展性提供了强有力的基线。然而，传统的 Transformer 架构需要大量计算资源，这对大规模训练和实际部署构成了重大障碍。在这篇综述中，我们系统地考察了旨在解决 Transformer 固有局限并提升效率的创新 LLM 架构。从语言建模出发，本文涵盖了线性和稀疏序列建模方法的背景与技术细节、高效的全注意力变体、稀疏专家混合（sparse mixture-of-experts）、结合上述技术的混合模型架构，以及新兴的扩散式 LLMs。此外，我们还讨论了这些技术在其他模态上的应用，并考虑了它们对开发可扩展、资源感知基础模型的更广泛影响。通过将近来的研究归入上述类别，本综述呈现了现代高效 LLM 架构的蓝图，我们希望这能激励未来针对更高效、多功能 AI 系统的研究。

发布：2025-08-13 14:13:46 UTC

评估

EffiEval：通过能力覆盖最大化实现高效且具可泛化性的模型评估

#25 EffiEval: Efficient and Generalizable Model Evaluation via Capability Coverage Maximization EffiEval：通过能力覆盖最大化实现高效且具可泛化性的模型评估

大型语言模型（LLMs）的快速发展以及日益增大和多样化的评估基准的出现，为模型评估带来了巨大的计算挑战。

本文提出了 EffiEval，一种无需训练的高效基准评估方法，能够在保持高评估可靠性的同时有效解决数据冗余问题

。我们的方法专为满足高质量评估的三项关键标准而设计：代表性，通过确保对模型能力的全面覆盖；公平性，通过在样本选择过程中不依赖模型性能以避免偏差；以及泛化性，通过在不依赖大规模评估数据的情况下，实现跨数据集和模型家族的灵活迁移。不同于依赖绝对性能或需要大量评估数据的传统方法，我们的方法基于模型效用指数（Model Utility Index, MUI）自适应地选择高质量代表性子集。

在多个公开基准和多种 LLMs 上进行的大量实验证明，EffiEval 在仅使用原始数据一小部分的情况下，能够与全数据集评估实现较强的一致性排序。此外，我们的方法在规模上灵活且可扩展，允许用户根据具体需求在评估效率与代表性之间进行权衡。总体而言，EffiEval 为 LLMs 时代提供了一个可靠、公平且高效的实用且具有普适性的评估解决方案。

发布：2025-08-13 09:48:23 协调世界时 (UTC)

智能体

破解「长程智能体」RL训练难题，腾讯提出RLVMR框架，让7B模型「思考」比肩GPT-4o
AWorld：用于鲁棒 GAIA 问题求解的具稳定机动性的动态多智能体系统

#3 AWorld: Dynamic Multi-Agent System with Stable Maneuvering for Robust GAIA Problem Solving

发布：2025-08-13 15:46:25 UTC

就业

LeetCode刷够100小时，学会找人内推，OpenAI员工下场教你拿Offer

2025-08-14科研追新

2025-08-14科研追新

1. 源数据

1.1 公众号

1.1.1 量子位

1.1.2 机器之心

1.1.3 新智元

1.1.4 AGI Hunt

1.1.5 其他

1.2 Arxiv

1.2.1 Computation and Language

#47 COME: Dual Structure-Semantic Learning with Collaborative MoE for Universal Lesion Detection Across Heterogeneous Ultrasound Datasets #47 COME：具有协同 MoE 的双重结构-语义学习，用于跨异质超声数据集的通用病灶检测 [PDF ] [Copy] [Kimi ] [REL]

#49 How Persuasive Could LLMs Be? A First Study Combining Linguistic-Rhetorical Analysis and User Experiments #49 语言模型可以有多有说服力？一项首次将语言-修辞分析与用户实验相结合的研究

#50 AI Blob! LLM-Driven Recontextualization of Italian Television Archives #50 AI Blob! LLM 驱动的意大利电视档案再构境化

#52 IAG: Input-aware Backdoor Attack on VLMs for Visual Grounding #52 IAG：面向视觉定位的视觉语言模型输入感知后门攻击

#53 Shadow in the Cache: Unveiling and Mitigating Privacy Risks of KV-cache in LLM Inference #53 缓存中的阴影：揭示并缓解 LLM 推理中 KV-cache 的隐私风险

#54 ProMode: A Speech Prosody Model Conditioned on Acoustic and Textual Inputs #54 ProMode：一个以声学和文本输入为条件的语音韵律模型

#55 Fake-Mamba: Real-Time Speech Deepfake Detection Using Bidirectional Mamba as Self-Attention's Alternative #55 假眼镜蛇：使用双向 Mamba 作为自注意力替代的实时语音深度伪造检测

#56 Can AI Keep a Secret? Contextual Integrity Verification: A Provable Security Architecture for LLMs #56 AI 能守住秘密吗？情境完整性验证：一种面向 LLMs 的可证明安全架构

#57 NEFMind: Parameter-Efficient Fine-Tuning of Open-Source LLMs for Telecom APIs Automation #57 NEFMind：用于电信 API 自动化的开源 LLMs 参数高效微调

#58 From Hard Refusals to Safe-Completions: Toward Output-Centric Safety Training #58 从强硬拒绝到安全完成：走向以输出为中心的安全训练

#59 Δ-AttnMask: Attention-Guided Masked Hidden States for Efficient Data Selection and Augmentation #59 Δ -AttnMask：用于高效数据选择与增强的注意力引导掩码隐藏状态

#60 MoLAN: A Unified Modality-Aware Noise Dynamic Editing Framework for Multimodal Sentiment Analysis #60 MoLAN：用于多模态情感分析的统一模态感知噪声动态编辑框架

1.2.2 Artificial Intelligence

#1 Mathematical Computation and Reasoning Errors by Large Language Models #1 大型语言模型在数学计算与推理中的错误

#2 RAGulating Compliance: A Multi-Agent Knowledge Graph for Regulatory QA #2 RAGulating Compliance：用于监管问答的多代理知识图谱

#3 AWorld: Dynamic Multi-Agent System with Stable Maneuvering for Robust GAIA Problem Solving #3 AWorld：用于鲁棒 GAIA 问题求解的具稳定机动性的动态多智能体系统

#4 Human-Aligned Procedural Level Generation Reinforcement Learning via Text-Level-Sketch Shared Representation #4 通过文本级草图共享表示实现的人类对齐程序化关卡生成强化学习

#5 Reasoning About Knowledge on Regular Expressions is 2EXPTIME-complete #5 关于正则表达式的知识推理是 2EXPTIME 完全的

#6 The PacifAIst Benchmark:Would an Artificial Intelligence Choose to Sacrifice Itself for Human Safety? #6 PacifAIst 基准：人工智能会为人类安全选择自我牺牲吗？

#7 UDA: Unsupervised Debiasing Alignment for Pair-wise LLM-as-a-Judge #7 UDA：用于成对 LLM 作为评判者的无监督去偏对齐

#8 MEML-GRPO: Heterogeneous Multi-Expert Mutual Learning for RLVR Advancement #8 MEML-GRPO：用于推进 RLVR 的异构多专家相互学习

#9 UbiQTree: Uncertainty Quantification in XAI with Tree Ensembles #9 UbiQTree：基于树集成模型在可解释性人工智能中的不确定性量化

#10 EvoCurr: Self-evolving Curriculum with Behavior Code Generation for Complex Decision-making #10 EvoCurr：用于复杂决策的带有行为代码生成的自我进化课程

#11 An Automated Multi-Modal Evaluation Framework for Mobile Intelligent Assistants #11 一种用于移动智能助理的自动化多模态评估框架

#13 Value Function Initialization for Knowledge Transfer and Jump-start in Deep Reinforcement Learning #13 价值函数初始化用于知识迁移和深度强化学习中的跳跃启动

#14 Echo-4o: Harnessing the Power of GPT-4o Synthetic Images for Improved Image Generation #14 Echo-4o：利用 GPT-4o 合成图像的能力以改善图像生成

#15 Vision-driven River Following of UAV via Safe Reinforcement Learning using Semantic Dynamics Model #15 通过使用语义动力学模型的安全强化学习实现无人机的视觉驱动河流跟随

#16 January Food Benchmark (JFB): A Public Benchmark Dataset and Evaluation Suite for Multimodal Food Analysis #16 1 月食物基准（JFB）：用于多模态食物分析的公共基准数据集和评估套件

#17 GBC: Generalized Behavior-Cloning Framework for Whole-Body Humanoid Imitation #17 GBC：用于全身类人模仿的广义行为克隆框架

#18 Specialised or Generic? Tokenization Choices for Radiology Language Models #18 专用还是通用？放射学语言模型的分词选择

#19 VisCodex: Unified Multimodal Code Generation via Merging Vision and Coding Models #19 VisCodex：通过融合视觉与编码模型实现统一的多模态代码生成

#20 A Comprehensive Evaluation framework of Alignment Techniques for LLMs #20 大型语言模型（LLMs）对齐技术的综合评估框架

#21 Residual Reservoir Memory Networks #21 残差水库记忆网络

#22 T-CACE: A Time-Conditioned Autoregressive Contrast Enhancement Multi-Task Framework for Contrast-Free Liver MRI Synthesis, Segmentation, and Diagnosis #22 T-CACE：一种时间条件自回归对比度增强多任务框架，用于无对比剂肝脏 MRI 的合成、分割与诊断

#23 Beyond Naïve Prompting: Strategies for Improved Zero-shot Context-aided Forecasting with LLMs #23 超越简单提示：利用 LLMs 提升零样本上下文辅助预测的策略

#24 Rare anomalies require large datasets: About proving the existence of anomalies #24 罕见异常需要大规模数据集：关于证明异常存在性的研究

#25 COME: Dual Structure-Semantic Learning with Collaborative MoE for Universal Lesion Detection Across Heterogeneous Ultrasound Datasets #25 COME：具有协作专家混合（MoE）的双重结构-语义学习，用于跨异构超声数据集的通用病变检测

#26 Beyond Scaling Law: A Data-Efficient Distillation Framework for Reasoning #26 超越缩放定律：一种用于推理的数据高效蒸馏框架

#27 Memory Decoder: A Pretrained, Plug-and-Play Memory for Large Language Models #27 Memory Decoder: A Pretrained, Plug-and-Play Memory for Large Language Models

#28 STREAM (ChemBio): A Standard for Transparently Reporting Evaluations in AI Model Reports #28 STREAM（化学生物）：在人工智能模型报告中透明报告评估的标准

#29 Perceptual Reality Transformer: Neural Architectures for Simulating Neurological Perception Conditions #29 感知现实变换器：用于模拟神经感知状况的神经网络架构

#30 PRELUDE: A Benchmark Designed to Require Global Comprehension and Reasoning over Long Contexts #30 序曲：一个旨在要求对长上下文进行全局理解和推理的基准

#31 Speed Always Wins: A Survey on Efficient Architectures for Large Language Models #31 速度永远胜出：关于大型语言模型高效架构的综述

#32 Exploring the Potential of Large Language Models in Fine-Grained Review Comment Classification #32 探索大型语言模型在细粒度审稿意见分类中的潜力

#33 RayletDF: Raylet Distance Fields for Generalizable 3D Surface Reconstruction from Point Clouds or Gaussians #33 RayletDF：用于从点云或高斯分布进行可泛化三维表面重建的 Raylet 距离场

#34 Provable In-Context Vector Arithmetic via Retrieving Task Concepts #34 可证明的上下文内向量算术通过检索任务概念

#35 TRACE: Learning 3D Gaussian Physical Dynamics from Multi-view Videos #35 TRACE：从多视角视频学习三维高斯物理动力学

#36 A Comprehensive Survey of Datasets for Clinical Mental Health AI Systems #36 临床心理健康人工智能系统数据集的全面综述

#37 Automated Segmentation of Coronal Brain Tissue Slabs for 3D Neuropathology #37 额外用于三维神经病理学的冠状脑组织切片自动分割

#38 Explainable Ensemble Learning for Graph-Based Malware Detection #38 基于图的恶意软件检测的可解释集成学习

#39 LibRec: Benchmarking Retrieval-Augmented LLMs for Library Migration Recommendations #39 LibRec：基于检索增强 LLMs 的库迁移推荐基准测试

#40 Prototype Training with Dual Pseudo-Inverse and Optimized Hidden Activations #40 使用双伪逆和优化隐藏激活的原型训练

#41 Adoption of Explainable Natural Language Processing: Perspectives from Industry and Academia on Practices and Challenges #41 可解释自然语言处理的采纳：来自产业界与学术界关于实践与挑战的观点

#42 Combinative Matching for Geometric Shape Assembly #42 几何形状组装的组合匹配

#43 Can LLM-Generated Textual Explanations Enhance Model Classification Performance? An Empirical Study #43 LLM 生成的文本解释能否提高模型的分类表现？一项实证研究

#44 Counting Short Trajectories in Elementary Cellular Automata using the Transfer Matrix Method #44 使用转移矩阵方法计算元胞自动机中的短轨迹数量

#45 Enhance the machine learning algorithm performance in phishing detection with keyword features #45 使用关键字特征提高机器学习算法在网络钓鱼检测中的性能 [PDF ] [Copy] [Kimi ] [REL]

#46 NEUBORN: The Neurodevelopmental Evolution framework Using BiOmechanical RemodelliNg #46 NEUBORN：使用生物力学重塑的神经发育进化框架

#47 Region-to-Region: Enhancing Generative Image Harmonization with Adaptive Regional Injection #47 区域到区域：通过自适应区域注入增强生成图像调和

#48 Improving ARDS Diagnosis Through Context-Aware Concept Bottleneck Models #48 通过具备上下文感知的概念瓶颈模型改进 ARDS 诊断

#49 Evaluating the Role of Large Language Models in Legal Practice in India #49 评估大型语言模型在印度法律实践中的作用

#50 Surg-InvNeRF: Invertible NeRF for 3D tracking and reconstruction in surgical vision #50 Surg-InvNeRF：用于外科视觉中的三维跟踪和重建的可逆 NeRF

#51 Anomaly Detection for IoT Global Connectivity #51 面向物联网全球连通性的异常检测

#52 On Negative-aware Preference Optimization for Recommendation #52 关于用于推荐的负面感知偏好优化

#53 Demystifying the Role of Rule-based Detection in AI Systems for Windows Malware Detection #53 解密基于规则的检测在用于 Windows 恶意软件检测的 AI 系统中的作用

#54 A Close Reading Approach to Gender Narrative Biases in AI-Generated Stories #54 一种对 AI 生成故事中性别叙事偏见的细读方法

#55 Preacher: Paper-to-Video Agentic System #55 传道者：论文到视频的智能代理系统

#56 AmbiGraph-Eval: Can LLMs Effectively Handle Ambiguous Graph Queries? #56 AmbiGraph-Eval：LLMs 能有效处理含糊的图查询吗？ [PDF 1 ] [Copy] [Kimi ] [REL]

#57 TimeMKG: Knowledge-Infused Causal Reasoning for Multivariate Time Series Modeling #57 TimeMKG：用于多变量时间序列建模的知识注入因果推理