2025-12-18科研追新
2025-12-18科研追新
1. 源数据
1.1 媒体
From:量子位、机器之心、新智元、AGI Hunt、小红书、X其他
- 刚刚,让谷歌翻身的Gemini 3,上线Flash版
- 在智能 / 成本上,它成为了全球性价比最高的模型
- Gemini 3 Flash 证明了,速度与规模并不必然以牺牲智能为代价
1.2 Huggingface
- MMGR:多模式生成推理(82▲)
- 视频现实测试:人工智能生成的ASMR视频能骗过vlm和人类吗? (51▲)
- WorldPlay:面向实时交互世界建模的长期几何一致性(49▲)
- Scone:通过统一理解生成模型在主体驱动图像生成中架起组合与区分的桥梁(38▲)
- RoboTracer:掌握机器人视觉语言模型中的空间跟踪(31▲)
- OpenDataArena:一个公平开放的训练后数据集值基准测试平台(27▲)
- 从以任务为中心的视图揭示隐藏的陷阱并导航下一代向量相似性搜索(23▲)
- 向量棱镜:通过分层语义结构动画向量图形(22▲)
- recpt - v2技术报告(14▲)
- MemFlow:连贯高效长视频叙事的流动自适应记忆(14▲)
- ShowTable:通过协同反思和细化解锁创造性的表可视化(13▲)
- 通过文本导向图像到3D的前馈3D编辑(13▲)
- 还有26篇论文
1.3 Arxiv
1.3.1 Computation and Language
From:https://papers.cool/arxiv/cs.CL)
https://arxiv.org/list/cs.CL/recent
https://www.arxivdaily.com/cate/20/seq/0
cs.CL 方向,今日共计60篇
大模型相关(28篇)
【1】Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers 标题:激活先知:训练和评估法学硕士作为通用激活解释者 链接:https://arxiv.org/abs/2512.15674
作者:Adam Karvonen,James Chua,Clément Dumas,Kit Fraser-Taliente,Subhash Kantamneni,Julian Minder,Euan Ong,Arnab Sen Sharma,Daniel Wen,Owain Evans,Samuel Marks 备注:36 pages 摘要:众所周知,大型语言模型(LLM)激活很难理解,大多数现有技术都使用复杂的专门方法来解释它们。最近的工作提出了一种更简单的方法,称为LatentQA:训练LLM直接接受LLM激活作为输入,并以自然语言回答有关它们的任意问题。然而,以前的工作集中在狭窄的任务设置的培训和评价。在本文中,我们采取了一个通才的视角。我们评估了LatentQA训练的模型,我们称之为激活预言机(AO),在远离分布的环境中,并研究了性能如何随着训练数据的多样性而变化。我们发现,AO可以恢复微调到模型中的信息(例如,传记知识或恶意倾向),这些信息不会出现在输入文本中,尽管从未用来自微调模型的激活进行过训练。我们的主要评估是四个下游任务,我们可以将其与之前的白盒和黑盒技术进行比较。我们发现,即使是经过严格训练的LatentQA模型也可以很好地泛化,并且添加额外的训练数据集(例如分类任务和自监督上下文预测任务)可以产生一致的进一步改进。总的来说,我们的最佳AO在所有四个任务上都匹配或超过了之前的白盒基线,并且是4个任务中3个任务的最佳方法。这些结果表明,多样化的培训,以回答自然语言查询赋予了一般的能力,以口头表达有关LLM激活的信息。 摘要:Large language model (LLM) activations are notoriously difficult to understand, with most existing techniques using complex, specialized methods for interpreting them. Recent work has proposed a simpler approach known as LatentQA: training LLMs to directly accept LLM activations as inputs and answer arbitrary questions about them in natural language. However, prior work has focused on narrow task settings for both training and evaluation. In this paper, we instead take a generalist perspective. We evaluate LatentQA-trained models, which we call Activation Oracles (AOs), in far out-of-distribution settings and examine how performance scales with training data diversity. We find that AOs can recover information fine-tuned into a model (e.g., biographical knowledge or malign propensities) that does not appear in the input text, despite never being trained with activations from a fine-tuned model. Our main evaluations are four downstream tasks where we can compare to prior white- and black-box techniques. We find that even narrowly-trained LatentQA models can generalize well, and that adding additional training datasets (such as classification tasks and a self-supervised context prediction task) yields consistent further improvements. Overall, our best AOs match or exceed prior white-box baselines on all four tasks and are the best method on 3 out of 4. These results suggest that diversified training to answer natural-language queries imparts a general capability to verbalize information about LLM activations.
【2】Explaining the Reasoning of Large Language Models Using Attribution Graphs 标题:使用属性图解释大型语言模型的推理 链接:https://arxiv.org/abs/2512.15663
作者:Chase Walker,Rickard Ewetz 摘要:大型语言模型(LLM)表现出非凡的能力,但它们的推理仍然不透明,引发了安全和信任问题。将信用分配给输入特征的归因方法已被证明是解释计算机视觉模型决策的有效方法。从这些,上下文属性已成为一个很有前途的方法来解释自回归LLM的行为。然而,目前的上下文归因产生不完整的解释直接相关生成的令牌的提示,丢弃代际影响的过程中。为了克服这些缺点,我们引入了上下文归因通过图解释(CAGE)框架。CAGE引入了一个归因图:一个有向图,它量化了每一代人如何受到即时和所有前几代人的影响。图的构造保持两个属性-因果性和行随机性。属性图允许通过沿图中的路径边缘化中间贡献来计算上下文属性。在多个模型、数据集、指标和方法中,CAGE提高了上下文归因的可信度,平均收益高达40%。 摘要:Large language models (LLMs) exhibit remarkable capabilities, yet their reasoning remains opaque, raising safety and trust concerns. Attribution methods, which assign credit to input features, have proven effective for explaining the decision making of computer vision models. From these, context attributions have emerged as a promising approach for explaining the behavior of autoregressive LLMs. However, current context attributions produce incomplete explanations by directly relating generated tokens to the prompt, discarding inter-generational influence in the process. To overcome these shortcomings, we introduce the Context Attribution via Graph Explanations (CAGE) framework. CAGE introduces an attribution graph: a directed graph that quantifies how each generation is influenced by both the prompt and all prior generations. The graph is constructed to preserve two properties-causality and row stochasticity. The attribution graph allows context attributions to be computed by marginalizing intermediate contributions along paths in the graph. Across multiple models, datasets, metrics, and methods, CAGE improves context attribution faithfulness, achieving average gains of up to 40%.
【3】VTCBench: Can Vision-Language Models Understand Long Context with Vision-Text Compression? 标题:VTCBench:视觉语言模型可以通过视觉文本压缩理解长上下文吗? 链接:https://arxiv.org/abs/2512.15649
作者:Hongbo Zhao,Meng Wang,Fei Zhu,Wenzhuo Liu,Bolin Ni,Fanhu Zeng,Gaofeng Meng,Zhaoxiang Zhang 摘要:与扩展LLM的上下文窗口相关联的计算和存储器开销严重限制了它们的可扩展性。一个值得注意的解决方案是视觉文本压缩(VTC),以DeepSeek-OCR和Glencore等框架为例,它将长文本转换为密集的2D视觉表示,从而实现3x-20 x的令牌压缩比。然而,这种高信息密度对视觉语言模型(VLM)的核心长上下文能力的影响仍有待研究。为了解决这一差距,我们引入了VTC的第一个基准,并系统地评估了VLM在三个长期背景理解设置中的性能:VTC检索,它评估模型检索和聚合信息的能力; VTC推理,它需要模型推断潜在的关联,以最小的词汇重叠定位事实;和VTC记忆,它测量长期对话记忆中的综合问题回答。此外,我们建立了VTCBench-Wild来模拟不同的输入场景。我们在我们的基准测试中全面评估了领先的开源和专有模型。结果表明,尽管能够解码文本信息(例如,OCR),但大多数VLM对VTC压缩的信息表现出令人惊讶的低长上下文理解能力,无法捕捉上下文中的长关联或依赖。 摘要:The computational and memory overheads associated with expanding the context window of LLMs severely limit their scalability. A noteworthy solution is vision-text compression (VTC), exemplified by frameworks like DeepSeek-OCR and Glyph, which convert long texts into dense 2D visual representations, thereby achieving token compression ratios of 3x-20x. However, the impact of this high information density on the core long-context capabilities of vision-language models (VLMs) remains under-investigated. To address this gap, we introduce the first benchmark for VTC and systematically assess the performance of VLMs across three long-context understanding settings: VTC-Retrieval, which evaluates the model’s ability to retrieve and aggregate information; VTC-Reasoning, which requires models to infer latent associations to locate facts with minimal lexical overlap; and VTC-Memory, which measures comprehensive question answering within long-term dialogue memory. Furthermore, we establish the VTCBench-Wild to simulate diverse input scenarios.We comprehensively evaluate leading open-source and proprietary models on our benchmarks. The results indicate that, despite being able to decode textual information (e.g., OCR) well, most VLMs exhibit a surprisingly poor long-context understanding ability with VTC-compressed information, failing to capture long associations or dependencies in the context.This study provides a deep understanding of VTC and serves as a foundation for designing more efficient and scalable VLMs.
【4】Evaluating Metrics for Safety with LLM-as-Judges 标题:通过LLM作为评委评估工作组的安全性 链接:https://arxiv.org/abs/2512.15617
作者:Kester Clegg,Richard Hawkins,Ibrahim Habli,Tom Lawton 摘要:LLM(大型语言模型)越来越多地用于文本处理管道,以智能地响应各种输入和生成任务。这就有可能取代因人员不足或流程复杂而阻碍现有信息流的人员角色。然而,LLM会犯错误,一些处理角色是安全关键的。例如,根据医院转诊信对病人进行术后护理,或为工作人员更新核设施的现场访问时间表。如果我们想将LLM引入到以前由人类执行的关键信息流中,我们如何使它们安全可靠?本文认为,与其对增强生成框架或基于图的技术提出表演性主张,安全性论证应该集中在我们从LLM过程中的评估点获得的证据类型上,特别是在采用LLM作为法官(LaJ)评估者的框架中。本文认为,虽然我们不能从许多自然语言处理任务中得到确定性的评价,但通过采用一篮子加权指标,可以降低评价中的错误风险,使用上下文敏感性来定义错误严重程度,并设计置信阈值,当评价者之间的一致性较低时,触发人类对关键LaJ判断的审查。 摘要:LLMs (Large Language Models) are increasingly used in text processing pipelines to intelligently respond to a variety of inputs and generation tasks. This raises the possibility of replacing human roles that bottleneck existing information flows, either due to insufficient staff or process complexity. However, LLMs make mistakes and some processing roles are safety critical. For example, triaging post-operative care to patients based on hospital referral letters, or updating site access schedules in nuclear facilities for work crews. If we want to introduce LLMs into critical information flows that were previously performed by humans, how can we make them safe and reliable? Rather than make performative claims about augmented generation frameworks or graph-based techniques, this paper argues that the safety argument should focus on the type of evidence we get from evaluation points in LLM processes, particularly in frameworks that employ LLM-as-Judges (LaJ) evaluators. This paper argues that although we cannot get deterministic evaluations from many natural language processing tasks, by adopting a basket of weighted metrics it may be possible to lower the risk of errors within an evaluation, use context sensitivity to define error severity and design confidence thresholds that trigger human review of critical LaJ judgments when concordance across evaluators is low.
【5】Bolmo: Byteifying the Next Generation of Language Models 标题:Bolmo:字节化下一代语言模型 链接:https://arxiv.org/abs/2512.15586
作者:Benjamin Minixhofer,Tyler Murray,Tomasz Limisiewicz,Anna Korhonen,Luke Zettlemoyer,Noah A. Smith,Edoardo M. Ponti,Luca Soldaini,Valentin Hofmann 摘要:我们介绍Bolmo,这是第一个在1B和7 B参数尺度上具有竞争力的完全开放的字节级语言模型(LM)。在以前的研究字节级LM,主要集中在从头开始训练,我们训练Bolmo通过bytefiating现有的子字级LM。字节化使得能够克服子字标记化的限制-例如由于固定的子字词汇而导致的字符理解不足和效率限制-同时在前导子字级LM的级别上执行。Bolmo是专门为字节化而设计的:我们的架构解决了先前字节级架构和子字级LM的表达能力之间的不匹配,这使得在Bolmo和源子字模型之间采用有效的精确蒸馏目标成为可能。这允许通过投资小于典型预训练令牌预算的1%来将子字级LM转换为字节级LM。Bolmo的性能大大优于所有之前的字节级LM,并且在字符理解方面优于源子字级LM,在某些情况下,编码,而在其他任务上接近匹配原始LM的性能。此外,我们还表明,Bolmo可以通过使用更高的令牌压缩比进行训练来实现与子字级LM竞争的推理速度,并且可以通过利用源子字级LM周围的现有生态系统来进行廉价有效的后训练。我们的研究结果最终使字节级LM在广泛的用例集上成为与子字级LM竞争的实用选择。 摘要:We introduce Bolmo, the first family of competitive fully open byte-level language models (LMs) at the 1B and 7B parameter scales. In contrast to prior research on byte-level LMs, which focuses predominantly on training from scratch, we train Bolmo by byteifying existing subword-level LMs. Byteification enables overcoming the limitations of subword tokenization - such as insufficient character understanding and efficiency constraints due to the fixed subword vocabulary - while performing at the level of leading subword-level LMs. Bolmo is specifically designed for byteification: our architecture resolves a mismatch between the expressivity of prior byte-level architectures and subword-level LMs, which makes it possible to employ an effective exact distillation objective between Bolmo and the source subword model. This allows for converting a subword-level LM to a byte-level LM by investing less than 1% of a typical pretraining token budget. Bolmo substantially outperforms all prior byte-level LMs of comparable size, and outperforms the source subword-level LMs on character understanding and, in some cases, coding, while coming close to matching the original LMs’ performance on other tasks. Furthermore, we show that Bolmo can achieve inference speeds competitive with subword-level LMs by training with higher token compression ratios, and can be cheaply and effectively post-trained by leveraging the existing ecosystem around the source subword-level LM. Our results finally make byte-level LMs a practical choice competitive with subword-level LMs across a wide set of use cases.
【6】CTkvr: KV Cache Retrieval for Long-Context LLMs via Centroid then Token Indexing 标题:CTkvr:通过Centroid然后进行令牌索引进行长上下文LLM的KV缓存检索 链接:https://arxiv.org/abs/2512.15550
作者:Kuan Lu,Shuhang Lin,Sai Wu,Yichen Yao,Junhan Yang,Huan Li,Wei Chu,Xu Yinghui,Yuan Qi,Gang Chen 摘要:大型语言模型(LLM)越来越多地应用于长上下文场景,如多轮对话。然而,长上下文对推理效率提出了重大挑战,包括来自键值(KV)高速缓存的高存储器开销和由于过多的存储器访问而增加的延迟。动态KV选择的最新方法在权衡中挣扎:块级索引通过检索不相关的KV条目而降低准确性,而令牌级索引由于低效的检索机制而导致高延迟。在本文中,我们提出了CTKVR,一种新的质心然后令牌KV检索方案,解决了这些限制。CTKVR利用了一个关键的观察结果:位置相邻的查询向量在旋转位置嵌入(RoPE)后表现出高相似性,并共享其大部分前k KV缓存条目。基于这一认识,CTKVR采用了两阶段的检索策略:在预填充期间预先计算轻量级质心以进行质心粒度索引,然后进行标记级细化以进行精确的KV检索。这种方法平衡了检索效率和准确性。为了进一步提高性能,我们使用CPU-GPU协同执行实现了一个优化的索引构建和搜索系统。在实验中,CTKVR在多个基准测试中实现了卓越的性能,精度下降不到1%。同时,CTKVR在Llama-3-8B和Yi-9 B上提供了3倍和4倍的吞吐量加速,在不同的GPU硬件上具有96 K的上下文长度。 摘要:Large language models (LLMs) are increasingly applied in long-context scenarios such as multi-turn conversations. However, long contexts pose significant challenges for inference efficiency, including high memory overhead from Key-Value (KV) cache and increased latency due to excessive memory accesses. Recent methods for dynamic KV selection struggle with trade-offs: block-level indexing degrades accuracy by retrieving irrelevant KV entries, while token-level indexing incurs high latency from inefficient retrieval mechanisms. In this paper, we propose CTKVR, a novel centroid-then-token KV retrieval scheme that addresses these limitations. CTKVR leverages a key observation: query vectors adjacent in position exhibit high similarity after Rotary Position Embedding (RoPE) and share most of their top-k KV cache entries. Based on this insight, CTKVR employs a two-stage retrieval strategy: lightweight centroids are precomputed during prefilling for centroid-grained indexing, followed by token-level refinement for precise KV retrieval. This approach balances retrieval efficiency and accuracy. To further enhance performance, we implement an optimized system for indexing construction and search using CPU-GPU co-execution. Experimentally, CTKVR achieves superior performance across multiple benchmarks with less than 1% accuracy degradation. Meanwhile, CTKVR delivers 3 times and 4 times throughput speedups on Llama-3-8B and Yi-9B at 96K context length across diverse GPU hardware.
【7】Toward expert-level motivational interviewing for health behavior improvement with LLMs 标题:通过LLM进行专家级激励面试以改善健康行为 链接:https://arxiv.org/abs/2512.15446
作者:Run-ze Hu,Yang Yang,Yi-hang Yang,Jing-qi Kong,Jia-hui Luo,Wen-yu Yang,Jing Chen,Jing-yao Liu,Hui-qun Zeng,Lei Zhang,Zheng Liu 备注:26 pages, 3 figures 摘要:背景资料:动机访谈(MI)是促进健康行为改变的有效咨询方法,但其影响受到对训练有素的人类咨询师的需求的限制。目的:本研究旨在通过开发和评估动机性面试的大型语言模型(MI-LLM)来探索一种可扩展的替代方案。研究方法:我们首先策划了五个中国心理咨询语料库,并使用GPT-4和MI通知提示,将两个最高质量的数据集(CPsyCounD和PsyDTCorpus)的多轮对话转录为2,040个MI风格的咨询对话,其中2,000个用于培训,40个用于测试。三个支持中文的开源LLM(百川2 - 7 B-Chat、ChatGLM-4- 9 B-Chat和Llama-3-8B-Chinese-Chat-v2)在该语料库上进行了微调,并被命名为MI-LLM。我们使用基于回合的自动指标和专家手动编码(MITI编码手册4.2.1)评估了MI-LLM。结果如下:在所有三个模型中,与基本模型相比,微调大大提高了BLEU-4和ROUGE分数,手动编码显示MI-LLM实现了技术和关系全局分数,以及接近真实MI对话的MI依从率,尽管复杂的反思和反思与问题的比率仍然不太频繁。结论:这些研究结果提供了初步证据,表明以MI为导向的微调可以赋予通用LLM核心MI一致的咨询行为,这表明AI辅助健康行为改变支持的可扩展途径,同时强调需要进一步研究数据规模,复杂的MI技能和现实世界的干预试验。 摘要:Background: Motivational interviewing (MI) is an effective counseling approach for promoting health behavior change, but its impact is constrained by the need for highly trained human counselors. Objective: This study aimed to explore a scalable alternative by developing and evaluating Large Language Models for Motivational Interviewing (MI-LLMs). Methods: We first curated five Chinese psychological counseling corpora and, using GPT-4 with an MI-informed prompt, transcribed multi-turn dialogues from the two highest-quality datasets (CPsyCounD and PsyDTCorpus) into 2,040 MI-style counseling conversations, of which 2,000 were used for training and 40 for testing. Three Chinese-capable open-source LLMs (Baichuan2-7B-Chat, ChatGLM-4-9B-Chat and Llama-3-8B-Chinese-Chat-v2) were fine-tuned on this corpus and were named as MI-LLMs. We evaluated MI-LLMs using round-based automatic metrics and expert manual coding with the Motivational Interviewing Treatment Integrity (MITI) Coding Manual 4.2.1. Results: Across all three models, fine-tuning substantially improved BLEU-4 and ROUGE scores compared with the base models, and manual coding showed that MI-LLMs achieved technical and relational global scores, and MI-adherent ratios that approached those of real MI dialogues, although complex reflections and reflection-to-question ratios remained less frequent. Conclusions: These findings provide initial evidence that MI-oriented fine-tuning can endow general-purpose LLMs with core MI-consistent counseling behaviors, suggesting a scalable pathway toward AI-assisted health behavior change support while underscoring the need for further work on data scale, complex MI skills and real-world intervention trials.
【8】ORACLE: Time-Dependent Recursive Summary Graphs for Foresight on News Data Using LLMs 标题:Oracle:使用LLM预测新闻数据的时间依赖性回归摘要图 链接:https://arxiv.org/abs/2512.15397
作者:Lev Kharlashkin,Eiaki Morooka,Yehor Tereshchenko,Mika Hämäläinen 摘要:Oracle将每日新闻转化为每周的决策洞察力,为芬兰应用科学大学之一提供决策支持。该平台抓取和版本新闻,应用特定于大学的相关性过滤,嵌入内容,将项目分类到PESTEL维度,并构建一个简洁的时间依赖递归摘要图(TRSG):由LLM总结的两个聚类层,每周重新计算。一个轻量级的变化检测器突出显示了什么是新的,删除或更改,然后将差异分组到PESTEL感知分析的主题中。我们详细介绍了管道,讨论了使系统在生产中稳定的具体设计选择,并提出了一个具有评估计划的智能用例。 摘要:ORACLE turns daily news into week-over-week, decision-ready insights for one of the Finnish University of Applied Sciences. The platform crawls and versions news, applies University-specific relevance filtering, embeds content, classifies items into PESTEL dimensions and builds a concise Time-Dependent Recursive Summary Graph (TRSG): two clustering layers summarized by an LLM and recomputed weekly. A lightweight change detector highlights what is new, removed or changed, then groups differences into themes for PESTEL-aware analysis. We detail the pipeline, discuss concrete design choices that make the system stable in production and present a curriculum-intelligence use case with an evaluation plan.
【9】Dual-Density Inference for Efficient Language Model Reasoning 标题:高效语言模型推理的双密度推理 链接:https://arxiv.org/abs/2512.15358
作者:Zhengyi Zhao,Shubo Zhang,Yuxi Zhang,Huimin Wang,Binyang Li,Kam-Fai Wong 摘要:大型语言模型(LLM)在复杂的推理任务中表现出令人印象深刻的能力。然而,目前的方法采用统一的语言密度的中间推理和最终答案,导致计算效率低下。我们的观察发现,推理过程服务于模型本身的计算功能,而回答服务于人类理解的交流功能。这种区别使得能够使用压缩的、符号丰富的语言进行中间计算,同时保持人类可读的最终解释。为了解决这种效率低下的问题,我们提出了Denser:\underline{D}ual-d\underline{ens}ity inff\underline{er}ence,这是一个新的框架,它分别针对推理和回答阶段优化信息密度。我们的框架通过三个组件实现这一点:一个查询处理模块,分析输入问题,高密度压缩推理机制,有效的中间计算,和答案生成组件,压缩推理转换成人类可读的解决方案。多个推理问题回答基准的实验评估表明,与标准的思想链方法相比,Denser将令牌消耗减少了62%,同时保持或提高了准确性。这些效率的提高对于复杂的多步推理问题尤其重要,因为传统方法会产生大量的解释。 摘要:Large Language Models (LLMs) have shown impressive capabilities in complex reasoning tasks. However, current approaches employ uniform language density for both intermediate reasoning and final answers, leading to computational inefficiency. Our observation found that reasoning process serves a computational function for the model itself, while answering serves a communicative function for human understanding. This distinction enables the use of compressed, symbol-rich language for intermediate computations while maintaining human-readable final explanations. To address this inefficiency, we present Denser: \underline{D}ual-d\underline{ens}ity inf\underline{er}ence, a novel framework that optimizes information density separately for reasoning and answering phases. Our framework implements this through three components: a query processing module that analyzes input problems, a high-density compressed reasoning mechanism for efficient intermediate computations, and an answer generation component that translates compressed reasoning into human-readable solutions. Experimental evaluation across multiple reasoning question answering benchmarks demonstrates that Denser reduces token consumption by up to 62% compared to standard Chain-of-Thought methods while preserving or improving accuracy. These efficiency gains are particularly significant for complex multi-step reasoning problems where traditional methods generate extensive explanations.
【10】Adversarial versification in portuguese as a jailbreak operator in LLMs 标题:葡萄牙作为LLC越狱运营商的对抗性版本 链接:https://arxiv.org/abs/2512.15353
作者:Joao Queiroz 备注:15 pages 摘要:最近的证据表明,提示的版本化构成了一个非常有效的对抗机制对齐LLM。“对抗诗歌作为大型语言模型中的通用单轮越狱机制”的研究表明,通常以散文形式拒绝的指令在重写为诗歌时变得可执行,在MLCommons AILuminate的基准测试中产生高达18倍的安全故障。人工写诗的ASR达到约62%,自动版本达到43%,有些模型在单轮互动中的成功率超过90%。影响是结构性的:使用RLHF、宪法人工智能和混合管道训练的系统在最小的符号形式变化下表现出一致的退化。Versification将提示转移到监督稀疏的潜在区域,揭示过度依赖于表面模式的护栏。这种表面上的稳健性和实际上的脆弱性之间的分离暴露了当前对齐机制的深刻局限性。葡萄牙语是一种具有高度形态句法复杂性、丰富的韵律传统以及超过2.5亿使用者的语言,但缺乏评估构成了一个关键差距。实验协议必须参数化扫描,米,韵律变化,以测试特定的葡语模式,这是目前被忽视的漏洞。 摘要:Recent evidence shows that the versification of prompts constitutes a highly effective adversarial mechanism against aligned LLMs. The study ‘Adversarial poetry as a universal single-turn jailbreak mechanism in large language models’ demonstrates that instructions routinely refused in prose become executable when rewritten as verse, producing up to 18 x more safety failures in benchmarks derived from MLCommons AILuminate. Manually written poems reach approximately 62% ASR, and automated versions 43%, with some models surpassing 90% success in single-turn interactions. The effect is structural: systems trained with RLHF, constitutional AI, and hybrid pipelines exhibit consistent degradation under minimal semiotic formal variation. Versification displaces the prompt into sparsely supervised latent regions, revealing guardrails that are excessively dependent on surface patterns. This dissociation between apparent robustness and real vulnerability exposes deep limitations in current alignment regimes. The absence of evaluations in Portuguese, a language with high morphosyntactic complexity, a rich metric-prosodic tradition, and over 250 million speakers, constitutes a critical gap. Experimental protocols must parameterise scansion, metre, and prosodic variation to test vulnerabilities specific to Lusophone patterns, which are currently ignored.
【11】Evaluating LLMs for Zeolite Synthesis Event Extraction (ZSEE): A Systematic Analysis of Prompting Strategies 标题:评估用于分子筛合成事件提取(ZSEE)的LLM:提取策略的系统分析 链接:https://arxiv.org/abs/2512.15312
作者:Charan Prakash Rathore,Saumi Ray,Dhruv Kumar 备注:Under Review 摘要:从沸石合成实验过程中提取结构化信息对于材料发现至关重要,但现有方法尚未系统地评估用于该领域特定任务的大语言模型(LLM)。这项工作解决了一个基本问题:什么是不同的提示策略的有效性时,应用LLM科学信息提取?我们专注于四个关键子任务:事件类型分类(识别合成步骤),触发文本识别(定位事件提及),参数角色提取(识别参数类型),参数文本提取(提取参数值)。我们评估了四种提示策略- zero-shot,Few-Shot,特定事件和基于反射-在六个最先进的LLM(Gemma-3- 12 b-it,GPT-5-mini,O 4-mini,Claude-Haiku-3.5,DeepSeek推理和非推理)使用1,530个注释句子的ZSEE数据集。实验结果表明,该算法在事件类型分类上表现出很强的性能(80- 90% F1),但在细粒度提取任务上表现一般,尤其是在参数角色和参数文本提取上(50- 65% F1)。GPT-5-mini表现出极高的快速灵敏度,F1变异为11- 79%。值得注意的是,先进的提示策略提供了最小的改进超过zero-shot的方法,揭示了基本的架构限制。错误分析识别系统幻觉,过度概括,无法捕捉合成特定的细微差别。我们的研究结果表明,虽然LLM实现了高层次的理解,但精确提取实验参数需要领域适应模型,为科学信息提取提供定量基准。 摘要:Extracting structured information from zeolite synthesis experimental procedures is critical for materials discovery, yet existing methods have not systematically evaluated Large Language Models (LLMs) for this domain-specific task. This work addresses a fundamental question: what is the efficacy of different prompting strategies when applying LLMs to scientific information extraction? We focus on four key subtasks: event type classification (identifying synthesis steps), trigger text identification (locating event mentions), argument role extraction (recognizing parameter types), and argument text extraction (extracting parameter values). We evaluate four prompting strategies - zero-shot, few-shot, event-specific, and reflection-based - across six state-of-the-art LLMs (Gemma-3-12b-it, GPT-5-mini, O4-mini, Claude-Haiku-3.5, DeepSeek reasoning and non-reasoning) using the ZSEE dataset of 1,530 annotated sentences. Results demonstrate strong performance on event type classification (80-90% F1) but modest performance on fine-grained extraction tasks, particularly argument role and argument text extraction (50-65% F1). GPT-5-mini exhibits extreme prompt sensitivity with 11-79% F1 variation. Notably, advanced prompting strategies provide minimal improvements over zero-shot approaches, revealing fundamental architectural limitations. Error analysis identifies systematic hallucination, over-generalization, and inability to capture synthesis-specific nuances. Our findings demonstrate that while LLMs achieve high-level understanding, precise extraction of experimental parameters requires domain-adapted models, providing quantitative benchmarks for scientific information extraction.
【12】Well Begun, Half Done: Reinforcement Learning with Prefix Optimization for LLM Reasoning 标题:良好的开端,完成了一半:LLM推理的前缀优化强化学习 链接:https://arxiv.org/abs/2512.15274
作者:Yiliu Sun,Zicheng Zhao,Yang Wei,Yanfang Zhang,Chen Gong 备注:Accepted by AAAI 2026 摘要:带有可验证奖励的强化学习(RLVR)显著增强了大型语言模型(LLM)的推理能力。当前的RLVR方法通常在所有生成的令牌上进行训练,但忽略了探索哪些令牌(例如,前缀标记)实际上有助于推理。这种统一的训练策略在优化低回报代币上花费了大量的精力,这反过来又阻碍了高回报代币的潜在改进,并降低了整体训练效率。为了解决这个问题,我们提出了一种新的RLVR方法,称为渐进前缀令牌策略优化(PPPO),它突出了生成的输出的前缀段的重要性。具体来说,灵感来自于成熟的人类思维理论的路径依赖,其中早期阶段的思想大大限制了随后的思维轨迹,我们确定了类似的现象,在LLM推理称为开始锁定效应(BLE)。PPPO通过将其优化目标集中在LLM的前缀推理过程中来利用这一发现。这种有针对性的优化策略可以积极影响后续的推理过程,并最终改善最终结果。为了提高LLM在如何以高质量开始推理方面的学习效率,PPPO引入了两种训练策略:(a)渐进前缀保留,通过在训练期间增加保留的前缀标记的比例来形成渐进的学习过程;(b)连续累积奖励,其通过对一个前缀令牌序列的多个连续进行采样来减轻奖励偏差,并累积他们的分数作为奖励信号。在各种推理任务上的大量实验结果表明,我们提出的PPPO优于代表性的RLVR方法,只有26.17%的训练令牌的准确率提高了18.02%。 摘要:Reinforcement Learning with Verifiable Rewards (RLVR) significantly enhances the reasoning capability of Large Language Models (LLMs). Current RLVR approaches typically conduct training across all generated tokens, but neglect to explore which tokens (e.g., prefix tokens) actually contribute to reasoning. This uniform training strategy spends substantial effort on optimizing low-return tokens, which in turn impedes the potential improvement from high-return tokens and reduces overall training effectiveness. To address this issue, we propose a novel RLVR approach called Progressive Prefix-token Policy Optimization (PPPO), which highlights the significance of the prefix segment of generated outputs. Specifically, inspired by the well-established human thinking theory of Path Dependence, where early-stage thoughts substantially constrain subsequent thinking trajectory, we identify an analogous phenomenon in LLM reasoning termed Beginning Lock-in Effect (BLE). PPPO leverages this finding by focusing its optimization objective on the prefix reasoning process of LLMs. This targeted optimization strategy can positively influence subsequent reasoning processes, and ultimately improve final results. To improve the learning effectiveness of LLMs on how to start reasoning with high quality, PPPO introduces two training strategies: (a) Progressive Prefix Retention, which shapes a progressive learning process by increasing the proportion of retained prefix tokens during training; (b) Continuation Accumulated Reward, which mitigates reward bias by sampling multiple continuations for one prefix token sequence, and accumulating their scores as the reward signal. Extensive experimental results on various reasoning tasks demonstrate that our proposed PPPO outperforms representative RLVR methods, with the accuracy improvements of 18.02% on only 26.17% training tokens.
【13】MCP-SafetyBench: A Benchmark for Safety Evaluation of Large Language Models with Real-World MCP Servers 标题:MCP-SafetyBench:使用现实世界的LCP服务器对大型语言模型进行安全评估的基准 链接:https://arxiv.org/abs/2512.15163
作者:Xuanjun Zong,Zhiqi Shen,Lei Wang,Yunshi Lan,Chao Yang 备注:Our benchmark is available at https://github.com/xjzzzzzzzz/MCPSafety 摘要:大型语言模型(LLM)正在演变为推理、计划和操作外部工具的代理系统。模型上下文协议(MCP)是这种转变的关键推动者,它提供了一个标准化的接口,用于连接LLM与异构工具和服务。然而,MCP的开放性和多服务器工作流程引入了现有基准无法捕获的新安全风险,因为它们专注于孤立的攻击或缺乏真实世界的覆盖。我们提出了MCP-SafetyBench,这是一个建立在真实MCP服务器上的综合基准测试,支持跨五个领域的真实多轮评估:浏览器自动化,财务分析,位置导航,存储库管理和Web搜索。它整合了跨服务器、主机和用户端的20种MCP攻击类型的统一分类,并包括需要多步推理和不确定性下跨服务器协调的任务。使用MCP-SafetyBench,我们系统地评估了领先的开源和闭源LLM,揭示了安全性能的巨大差异,并随着任务范围和服务器交互的增长而升级漏洞。我们的研究结果强调了加强防御的迫切需要,并将MCP-SafetyBench作为诊断和减轻实际MCP部署中安全风险的基础。 摘要:Large language models (LLMs) are evolving into agentic systems that reason, plan, and operate external tools. The Model Context Protocol (MCP) is a key enabler of this transition, offering a standardized interface for connecting LLMs with heterogeneous tools and services. Yet MCP’s openness and multi-server workflows introduce new safety risks that existing benchmarks fail to capture, as they focus on isolated attacks or lack real-world coverage. We present MCP-SafetyBench, a comprehensive benchmark built on real MCP servers that supports realistic multi-turn evaluation across five domains: browser automation, financial analysis, location navigation, repository management, and web search. It incorporates a unified taxonomy of 20 MCP attack types spanning server, host, and user sides, and includes tasks requiring multi-step reasoning and cross-server coordination under uncertainty. Using MCP-SafetyBench, we systematically evaluate leading open- and closed-source LLMs, revealing large disparities in safety performance and escalating vulnerabilities as task horizons and server interactions grow. Our results highlight the urgent need for stronger defenses and establish MCP-SafetyBench as a foundation for diagnosing and mitigating safety risks in real-world MCP deployments.
【14】Quantifying Return on Security Controls in LLM Systems 标题:量化LLM系统中安全控制的回报 链接:https://arxiv.org/abs/2512.15081
作者:Richard Helder Moulton,Austin O’Brien,John D. Hastings 备注:13 pages, 9 figures, 3 tables 摘要:尽管大型语言模型(LLM)越来越多地用于安全关键工作流,但从业人员缺乏关于哪些安全措施值得部署的定量指导。本文介绍了一种面向决策的框架和可重复的方法,它们共同量化剩余风险,将对抗性探测结果转换为财务风险估计和控制回报率(RoC)指标,并对基于LLM的系统的分层防御进行货币比较。检索增强生成(RAG)服务使用DeepSeek-R1模型在包含合成个人身份信息(PII)的语料库上进行实例化,并使用Garak在五个漏洞类别中进行自动攻击:PII泄漏,潜在上下文注入,提示注入,对抗性攻击生成和发散。对于每个(漏洞,控制)对,攻击成功概率估计通过拉普拉斯的继任规则,并结合损失三角形分布,从公共的破坏成本数据校准,在10,000运行蒙特卡罗模拟产生损失的概率曲线和预期损失。三个广泛使用的缓解措施,基于属性的访问控制(ABAC);命名实体识别(NER)编辑使用Microsoft Presidio;和NeMo Guardrails,然后比较基线RAG配置。基线系统表现出非常高的攻击成功率(PII、潜伏注入和即时注入>= 0.98),每个攻击场景的总模拟预期损失为31.3万美元。ABAC将PII和恶意相关攻击的成功概率降低到接近于零,并将总预期损失降低了约94%,实现了9.83的RoC。NER编辑同样消除了PII泄漏,并达到了5.97的RoC,而NeMo Guardrails仅提供了边际效益(RoC为0.05)。 摘要:Although large language models (LLMs) are increasingly used in security-critical workflows, practitioners lack quantitative guidance on which safeguards are worth deploying. This paper introduces a decision-oriented framework and reproducible methodology that together quantify residual risk, convert adversarial probe outcomes into financial risk estimates and return-on-control (RoC) metrics, and enable monetary comparison of layered defenses for LLM-based systems. A retrieval-augmented generation (RAG) service is instantiated using the DeepSeek-R1 model over a corpus containing synthetic personally identifiable information (PII), and subjected to automated attacks with Garak across five vulnerability classes: PII leakage, latent context injection, prompt injection, adversarial attack generation, and divergence. For each (vulnerability, control) pair, attack success probabilities are estimated via Laplace’s Rule of Succession and combined with loss triangle distributions, calibrated from public breach-cost data, in 10,000-run Monte Carlo simulations to produce loss exceedance curves and expected losses. Three widely used mitigations, attribute-based access control (ABAC); named entity recognition (NER) redaction using Microsoft Presidio; and NeMo Guardrails, are then compared to a baseline RAG configuration. The baseline system exhibits very high attack success rates (>= 0.98 for PII, latent injection, and prompt injection), yielding a total simulated expected loss of $313k per attack scenario. ABAC collapses success probabilities for PII and prompt-related attacks to near zero and reduces the total expected loss by ~94%, achieving an RoC of 9.83. NER redaction likewise eliminates PII leakage and attains an RoC of 5.97, while NeMo Guardrails provides only marginal benefit (RoC of 0.05).
【15】The Meta-Prompting Protocol: Orchestrating LLMs via Adversarial Feedback Loops 标题:元预算协议:通过对抗反馈循环来预算LLM 链接:https://arxiv.org/abs/2512.15053
作者:Fanzhe Fu 备注:6 pages, 2 figures 摘要:大型语言模型(LLM)从随机聊天界面到可靠的软件组件的过渡需要对交互范式进行根本性的重新设计。目前的方法,主要是基于实用主义的“即时工程”,无法提供任务关键型应用程序所需的确定性保证。我们介绍了元编排协议,一个严格的理论框架,正式编制的LLM作为一个可编程的,自我优化的系统。该协议的核心是对抗三位一体,这是一个由生成器(P)、审计器(A)和优化器(O)组成的三方拓扑。通过将自然语言指令视为语义计算图中的可区分变量,并利用文本评论作为梯度,该架构减轻了幻觉并防止模型崩溃。我们证明了这种方法的理论可行性,使用声明式编程范式(DSPy)和自动文本区分(TextGrad),在概率计算的时代建立了“可观察软件工程”的基础。 摘要:The transition of Large Language Models (LLMs) from stochastic chat interfaces to reliable software components necessitates a fundamental re-engineering of interaction paradigms. Current methodologies, predominantly heuristic-based “prompt engineering,” fail to provide the deterministic guarantees required for mission-critical applications. We introduce the Meta-Prompting Protocol, a rigorous theoretical framework that formalizes the orchestration of LLMs as a programmable, self-optimizing system. Central to this protocol is the Adversarial Trinity, a tripartite topology comprising a Generator (P), an Auditor (A), and an Optimizer (O). By treating natural language instructions as differentiable variables within a semantic computation graph and utilizing textual critiques as gradients, this architecture mitigates hallucination and prevents model collapse. We demonstrate the theoretical viability of this approach using declarative programming paradigms (DSPy) and automatic textual differentiation (TextGrad), establishing a foundation for “Observable Software Engineering” in the era of probabilistic computing.
【16】SGM: Safety Glasses for Multimodal Large Language Models via Neuron-Level Detoxification 标题:SGM:通过神经元级去规范化的多模式大型语言模型的安全眼镜 链接:https://arxiv.org/abs/2512.15052
作者:Hongbo Wang,MaungMaung AprilPyone,Isao Echizen 备注:Under Review for ACL 2026 摘要:免责声明:本文中的样本可能有害并导致不适。 多模态大型语言模型(MLLM)可以实现多模态生成,但会从弱策划的预训练语料库中继承有毒、有偏见和NSFW信号,从而导致安全风险,特别是在对抗性触发条件下,后期不透明的无训练解毒方法难以处理。我们提出了SGM,一种白盒神经元水平的多模态干预,就像有毒神经元的安全眼镜:它通过专业加权软抑制选择性地重新校准一小组有毒专家神经元,在没有任何参数更新的情况下中和有害的交叉模态激活。我们建立了MM-TOXIC-QA,一个多模式毒性评估框架,并将SGM与现有的解毒技术进行了比较。在开源MLLM上的实验表明,SGM减轻了标准和对抗条件下的毒性,将有害率从48.2%降低到2.5%,同时保持流畅性和多模式推理。SGM是可扩展的,其组合防御(表示为SGM*)与现有的解毒方法相结合,具有更强的安全性能,为毒性控制的多模式发电提供了可解释的低成本解决方案。 摘要:Disclaimer: Samples in this paper may be harmful and cause discomfort. Multimodal large language models (MLLMs) enable multimodal generation but inherit toxic, biased, and NSFW signals from weakly curated pretraining corpora, causing safety risks, especially under adversarial triggers that late, opaque training-free detoxification methods struggle to handle. We propose SGM, a white-box neuron-level multimodal intervention that acts like safety glasses for toxic neurons: it selectively recalibrates a small set of toxic expert neurons via expertise-weighted soft suppression, neutralizing harmful cross-modal activations without any parameter updates. We establish MM-TOXIC-QA, a multimodal toxicity evaluation framework, and compare SGM with existing detoxification techniques. Experiments on open-source MLLMs show that SGM mitigates toxicity in standard and adversarial conditions, cutting harmful rates from 48.2% to 2.5% while preserving fluency and multimodal reasoning. SGM is extensible, and its combined defenses, denoted as SGM*, integrate with existing detoxification methods for stronger safety performance, providing an interpretable, low-cost solution for toxicity-controlled multimodal generation.
【17】DreamPRM-Code: Function-as-Step Process Reward Model with Label Correction for LLM Coding 标题:DreamPRM-Code:具有LLM编码标签纠正的功能即步骤流程奖励模型 链接:https://arxiv.org/abs/2512.15000
作者:Ruiyi Zhang,Peijia Qin,Qi Cao,Pengtao Xie 摘要:过程奖励模型(Process Reward Models,PRM)已经成为通过测试时间缩放来改进大型语言模型(Large Language Models,LLM)的关键,但由于代码中缺乏有意义的步骤分解以及蒙特卡洛生成的部分标签的噪声,它们在编码中的有效性仍然有限。我们提出了DreamPRM-Code,一种以编码为中心的PRM,它将函数视为推理步骤,使用功能链提示策略来诱导模块化代码生成,使PRM训练和应用类似于数学推理任务。为了解决标签噪声问题,DreamPRM-Code引入了一种基于元学习的校正机制,该机制利用干净的最终解决方案单元测试标签,并执行双层优化来优化中间标签。通过测试时间缩放,DreamPRM-Code在LiveCodeBench上以80.9 pass@1 rate实现了最先进的性能,超过了OpenAI o 4-mini。 摘要:Process Reward Models (PRMs) have become essential for improving Large Language Models (LLMs) via test-time scaling, yet their effectiveness in coding remains limited due to the lack of meaningful step decompositions in code and the noise of Monte-Carlo-generated partial labels. We propose DreamPRM-Code, a coding-focused PRM that treats functions as reasoning steps using a Chain-of-Function prompting strategy to induce modular code generation, enabling PRM training and application analogous to mathematical reasoning tasks. To address label noise, DreamPRM-Code introduces a meta-learning-based correction mechanism that leverages clean final-solution unit-test labels and performs bi-level optimization to refine intermediate labels. Applying on test-time scaling, DreamPRM-Code achieved state-of-the-art performance on LiveCodeBench with 80.9 pass@1 rate, surpassing OpenAI o4-mini.
【18】Evaluating Large Language Models on Multimodal Chemistry Olympiad Exams 标题:在多峰化学奥林匹克考试中评估大型语言模型 链接:https://arxiv.org/abs/2512.14989
作者:Yiming Cui,Xin Yao,Yuxuan Qin,Xin Li,Shijin Wang,Guoping Hu 备注:Published at Communications Chemistry 摘要:多模态科学推理仍然是大型语言模型(LLM)的一个重大挑战,特别是在化学领域,解决问题依赖于符号图,分子结构和结构化的可视化数据。在这里,我们系统地评估了40个专有和开源的多模式LLM,包括GPT-5,o3,Gemini-2.5-Pro和Qwen2.5-VL,这些LLM是根据二十多年来美国国家化学奥林匹克(USNCO)考试中的奥林匹克风格化学问题的策划基准。这些问题需要在不同的方式集成的视觉和文本推理。我们发现,许多模型都难以进行模态融合,在某些情况下,删除图像甚至可以提高准确性,这表明视觉语言整合存在偏差。如消融研究和基于遮挡的可解释性所示,思维链提示始终增强准确性和视觉基础。我们的研究结果揭示了当前MLLM科学推理能力的关键局限性,为开发更强大和可解释的化学多模态系统提供了可行的策略。这项工作为衡量特定领域多模态AI的进展提供了一个及时的基准,并强调了在人工智能和科学推理的交叉领域取得进一步进展的必要性。 摘要:Multimodal scientific reasoning remains a significant challenge for large language models (LLMs), particularly in chemistry, where problem-solving relies on symbolic diagrams, molecular structures, and structured visual data. Here, we systematically evaluate 40 proprietary and open-source multimodal LLMs, including GPT-5, o3, Gemini-2.5-Pro, and Qwen2.5-VL, on a curated benchmark of Olympiad-style chemistry questions drawn from over two decades of U.S. National Chemistry Olympiad (USNCO) exams. These questions require integrated visual and textual reasoning across diverse modalities. We find that many models struggle with modality fusion, where in some cases, removing the image even improves accuracy, indicating misalignment in vision-language integration. Chain-of-Thought prompting consistently enhances both accuracy and visual grounding, as demonstrated through ablation studies and occlusion-based interpretability. Our results reveal critical limitations in the scientific reasoning abilities of current MLLMs, providing actionable strategies for developing more robust and interpretable multimodal systems in chemistry. This work provides a timely benchmark for measuring progress in domain-specific multimodal AI and underscores the need for further advances at the intersection of artificial intelligence and scientific reasoning.
【19】Prompt Repetition Improves Non-Reasoning LLMs 标题:快速重复改进了非推理的LLM 链接:https://arxiv.org/abs/2512.14982
作者:Yaniv Leviathan,Matan Kalman,Yossi Matias 摘要:当不使用推理时,重复输入提示可以提高流行模型(Gemini、GPT、Claude和Deepseek)的性能,而不会增加生成的令牌数量或延迟。 摘要:When not using reasoning, repeating the input prompt improves performance for popular models (Gemini, GPT, Claude, and Deepseek) without increasing the number of generated tokens or latency.
【20】Cross-Tokenizer Likelihood Scoring Algorithms for Language Model Distillation 标题:语言模型蒸馏的交叉令牌器似然评分算法 链接:https://arxiv.org/abs/2512.14954
作者:Buu Phan,Ashish Khisti,Karen Ullrich 摘要:计算两个语言模型(LM)之间的下一个标记似然比是训练范式(如知识蒸馏)中的标准任务。由于这需要两个模型共享相同的概率空间,因此当教师和学生LM使用不同的标记器时,例如,当边缘设备部署需要较小的词汇量以降低内存开销时,这就变得具有挑战性。在这项工作中,我们解决这个词汇不对齐的问题,揭示了一个隐式的递归结构,在常用的字节对编码(BPE)算法,并利用它来创建一个概率框架,交叉标记似然评分。我们的方法使序列似然评估的词汇不同的教师模型本地tokenizer,解决两个特定的情况下:当学生词汇是教师词汇的子集,和一般情况下,它是任意的。在子集机制中,我们的框架计算精确的似然性,并为顺序采样提供下一个令牌概率,每个令牌只有O(1)模型评估。当用于蒸馏时,Qwen2.5-1.5B模型的内存占用减少了12%,同时还将评估任务的基线性能提高了4%。对于一般情况,我们引入了一个严格的无损过程,该过程利用BPE递归结构,并辅以快速近似,以保持大词汇量设置的实用性。应用于数学推理的蒸馏,我们的方法将GSM 8 K的准确性提高了2%以上。 摘要:Computing next-token likelihood ratios between two language models (LMs) is a standard task in training paradigms such as knowledge distillation. Since this requires both models to share the same probability space, it becomes challenging when the teacher and student LMs use different tokenizers, for instance, when edge-device deployment necessitates a smaller vocabulary size to lower memory overhead. In this work, we address this vocabulary misalignment problem by uncovering an implicit recursive structure in the commonly deployed Byte-Pair Encoding (BPE) algorithm and utilizing it to create a probabilistic framework for cross-tokenizer likelihood scoring. Our method enables sequence likelihood evaluation for vocabularies different from the teacher model native tokenizer, addressing two specific scenarios: when the student vocabulary is a subset of the teacher vocabulary, and the general case where it is arbitrary. In the subset regime, our framework computes exact likelihoods and provides next-token probabilities for sequential sampling with only O(1) model evaluations per token. When used for distillation, this yields up to a 12% reduction in memory footprint for the Qwen2.5-1.5B model while also improving baseline performance up to 4% on the evaluated tasks. For the general case, we introduce a rigorous lossless procedure that leverages BPE recursive structure, complemented by a fast approximation that keeps large-vocabulary settings practical. Applied to distillation for mathematical reasoning, our approach improves GSM8K accuracy by more than 2% over the current state of the art.
【21】Parameter Efficient Multimodal Instruction Tuning for Romanian Vision Language Models 标题:罗马尼亚视觉语言模型的参数高效多模式指令调整 链接:https://arxiv.org/abs/2512.14926
作者:George-Andrei Dima,Dumitru-Clementin Cercel 摘要:专注于低资源语言是实现生成AI民主化的重要一步。在这项工作中,我们有助于减少罗马尼亚语的多式联运NLP资源缺口。我们将广为人知的Flickr 30 k数据集翻译成罗马尼亚语,并通过利用开源LLM进一步扩展它以进行可视化问答。我们证明了我们的数据集的有用性微调开源VLM罗马尼亚视觉问答。我们从三个广泛使用的模型系列中选择VLM:LLaMA 3.2,LLaVA 1.6和Qwen 2。对于微调,我们采用参数有效的LoRA方法。我们的模型显示了罗马尼亚在视觉QA方面的改进能力,以及他们没有接受过训练的任务,例如罗马尼亚图像描述生成。70亿参数的Qwen 2-VL-RoVQA在两项任务上都获得了最高分,BERTScore F1比其原始版本分别提高了+6.05%和+2.61%。最后,该模型显示,大大减少了语法错误相比,其原始形式,表明不仅在语言理解,而且在罗马尼亚语流利的改善。 摘要:Focusing on low-resource languages is an essential step toward democratizing generative AI. In this work, we contribute to reducing the multimodal NLP resource gap for Romanian. We translate the widely known Flickr30k dataset into Romanian and further extend it for visual question answering by leveraging open-source LLMs. We demonstrate the usefulness of our datasets by fine-tuning open-source VLMs on Romanian visual question answering. We select VLMs from three widely used model families: LLaMA 3.2, LLaVA 1.6, and Qwen2. For fine-tuning, we employ the parameter-efficient LoRA method. Our models show improved Romanian capabilities in visual QA, as well as on tasks they were not trained on, such as Romanian image description generation. The seven-billion-parameter Qwen2-VL-RoVQA obtains top scores on both tasks, with improvements of +6.05% and +2.61% in BERTScore F1 over its original version. Finally, the models show substantial reductions in grammatical errors compared to their original forms, indicating improvements not only in language understanding but also in Romanian fluency.
【22】Multiscale Aggregated Hierarchical Attention (MAHA): A Game Theoretic and Optimization Driven Approach to Efficient Contextual Modeling in Large Language Models 标题:多尺度聚合分层注意力(MAHA):一种博弈论和优化驱动的大型语言模型中高效上下文建模方法 链接:https://arxiv.org/abs/2512.14925
作者:Caner Erden 摘要:MultiHead SelfAttention(MHSA)的二次计算复杂度仍然是扩展长上下文任务的大型语言模型(LLM)的根本瓶颈。虽然稀疏和线性化的注意力机制试图减轻这一点,但它们通常会损害全局依赖关系的表示,或者无法有效地捕获多尺度语义粒度。在本文中,我们提出了多尺度聚合层次注意力(MAHA),一种新的架构框架,重新制定的注意力机制,通过层次分解和数学上严格的聚合。与传统的方法,在一个单一的分辨率处理令牌的相互作用,MAHA动态分区的输入序列到层次尺度通过可学习的下采样算子。核心创新在于其聚合策略:我们将特定规模的注意力矩阵的融合建模为资源分配问题,通过凸优化框架或基于纳什均衡的博弈论方法解决。这确保了局部细微差别和全局上下文保真度之间的理论上的最佳平衡。MAHA在混合扩张卷积Transformer骨干中实现,利用可区分的优化层来实现端到端训练。实验评估表明,MAHA实现了卓越的可扩展性;经验FLOPs分析证实,与标准注意力相比,在序列长度为4096时,计算成本降低了81%。这项工作弥合了优化理论和序列建模之间的差距,为下一代LLM提供了一个可扩展的解决方案。 摘要:The quadratic computational complexity of MultiHead SelfAttention (MHSA) remains a fundamental bottleneck in scaling Large Language Models (LLMs) for longcontext tasks. While sparse and linearized attention mechanisms attempt to mitigate this, they often compromise the representation of global dependencies or fail to capture multiscale semantic granularity effectively. In this paper, we propose Multiscale Aggregated Hierarchical Attention (MAHA), a novel architectural framework that reformulates the attention mechanism through hierarchical decomposition and mathematically rigorous aggregation. Unlike conventional approaches that treat token interactions at a single resolution, MAHA dynamically partitions the input sequence into hierarchical scales via learnable downsampling operators. The core innovation lies in its aggregation strategy: we model the fusion of scalespecific attention matrices as a resource allocation problem, solved via a convex optimization framework or a Nash equilibriumbased gametheoretic approach. This ensures a theoretically optimal balance between local nuance and global context fidelity. Implemented within a hybrid dilatedconvolutional transformer backbone, MAHA utilizes differentiable optimization layers to enable endtoend training. Experimental evaluations demonstrate that MAHA achieves superior scalability; empirical FLOPs analysis confirms an 81% reduction in computational cost at a sequence length of 4096 compared to standard attention. This work bridges the gap between optimization theory and sequence modeling, offering a scalable solution for nextgeneration LLMs.
【23】DrugRAG: Enhancing Pharmacy LLM Performance Through A Novel Retrieval-Augmented Generation Pipeline 标题:DrugRAG:通过新型检索增强生成管道提高药学LLM性能 链接:https://arxiv.org/abs/2512.14896
作者:Houman Kazemzadeh,Kiarash Mokhtari Dizaji,Seyed Reza Tavakoli,Farbod Davoodi,MohammadReza KarimiNejad,Parham Abed Azad,Ali Sabzi,Armin Khosravi,Siavash Ahmadi,Mohammad Hossein Rohban,Glolamali Aminian,Tahereh Javaheri 备注:11 pages, 2 figures, 3 tables 摘要:目的:评估大语言模型(LLM)在药房执照式问答(QA)任务上的性能,并开发外部知识集成方法以提高其准确性。 方法:我们使用141个问题的药房数据集对11个现有的LLM进行了基准测试,这些LLM具有不同的参数大小(80亿到700多亿)。我们测量了每个模型的基线准确度,没有修改。然后,我们开发了一个三步检索增强生成(RAG)管道,DrugRAG,从经过验证的来源检索结构化的药物知识,并使用基于证据的上下文增强模型提示。该管道在模型外部运行,不需要更改模型架构或参数。 结果:基线准确率范围为46%至92%,其中GPT-5(92%)和o3(89%)得分最高。参数少于80亿的模型得分低于50%。DrugRAG提高了所有测试模型的准确性,增益范围从7到21个百分点(例如,Gemma 3 27B:61%至71%,Llama 3.1 8B:46%至67%)。 结论:我们证明,外部结构化的药物知识集成通过DrugRAG可测量地提高了LLM的准确性药房任务,而无需修改底层模型。这种方法提供了一个实用的管道,可以通过基于证据的信息来增强以制药为中心的人工智能应用。 摘要:Objectives: To evaluate large language model (LLM) performance on pharmacy licensure-style question-answering (QA) tasks and develop an external knowledge integration method to improve their accuracy. Methods: We benchmarked eleven existing LLMs with varying parameter sizes (8 billion to 70+ billion) using a 141-question pharmacy dataset. We measured baseline accuracy for each model without modification. We then developed a three-step retrieval-augmented generation (RAG) pipeline, DrugRAG, that retrieves structured drug knowledge from validated sources and augments model prompts with evidence-based context. This pipeline operates externally to the models, requiring no changes to model architecture or parameters. Results: Baseline accuracy ranged from 46% to 92%, with GPT-5 (92%) and o3 (89%) achieving the highest scores. Models with fewer than 8 billion parameters scored below 50%. DrugRAG improved accuracy across all tested models, with gains ranging from 7 to 21 percentage points (e.g., Gemma 3 27B: 61% to 71%, Llama 3.1 8B: 46% to 67%) on the 141-item benchmark. Conclusion: We demonstrate that external structured drug knowledge integration through DrugRAG measurably improves LLM accuracy on pharmacy tasks without modifying the underlying models. This approach provides a practical pipeline for enhancing pharmacy-focused AI applications with evidence-based information.
【24】Integrating Large Language Models and Knowledge Graphs to Capture Political Viewpoints in News Media 标题:集成大型语言模型和知识图以捕捉新闻媒体中的政治观点 链接:https://arxiv.org/abs/2512.14887
作者:Massimiliano Fadda,Enrico Motta,Francesco Osborne,Diego Reforgiato Recupero,Angelo Salatino 摘要:新闻来源在民主社会中发挥着核心作用,通过具体的主题、观点和声音塑造政治和社会话语。了解这些动态对于评估媒体格局是否对公共辩论提供了平衡和公正的描述至关重要。在早期的工作中,我们引入了一个管道,给定一个新闻语料库,i)使用混合人机方法来识别关于给定主题表达的观点的范围,以及ii)对关于所识别的观点的相关声明进行分类,定义为语义和意识形态一致的声明的集合(例如,英国政府认为,移民对英国经济有积极影响。在本文中,我们通过i)微调大语言模型(LLM)用于观点分类和ii)使用从维基数据中提取的相关参与者的语义描述来丰富声明表示来改进此管道。我们评估我们的方法对替代解决方案的基准集中在英国移民辩论。结果表明,虽然这两种机制独立地提高了分类性能,但它们的集成产生了最佳结果,特别是当使用能够处理长输入的LLM时。 摘要:News sources play a central role in democratic societies by shaping political and social discourse through specific topics, viewpoints and voices. Understanding these dynamics is essential for assessing whether the media landscape offers a balanced and fair account of public debate. In earlier work, we introduced a pipeline that, given a news corpus, i) uses a hybrid human-machine approach to identify the range of viewpoints expressed about a given topic, and ii) classifies relevant claims with respect to the identified viewpoints, defined as sets of semantically and ideologically congruent claims (e.g., positions arguing that immigration positively impacts the UK economy). In this paper, we improve this pipeline by i) fine-tuning Large Language Models (LLMs) for viewpoint classification and ii) enriching claim representations with semantic descriptions of relevant actors drawn from Wikidata. We evaluate our approach against alternative solutions on a benchmark centred on the UK immigration debate. Results show that while both mechanisms independently improve classification performance, their integration yields the best results, particularly when using LLMs capable of processing long inputs.
【25】Revisiting the Reliability of Language Models in Instruction-Following 标题:重新审视语言模型在教学遵循中的可靠性 链接:https://arxiv.org/abs/2512.14754
作者:Jianshuo Dong,Yutong Zhang,Yan Liu,Zhenyu Zhong,Tao Wei,Chao Zhang,Han Qiu 备注:Preprint 摘要:先进的LLM在IFEval等基准测试中达到了接近天花板的精度。然而,这些令人印象深刻的分数并不一定转化为现实世界中使用的可靠服务,用户经常改变他们的措辞,上下文框架和任务制定。在本文中,我们研究了细微差别导向的可靠性:模型是否表现出一致的能力,在表哥提示,传达类似的用户意图,但有微妙的细微差别。为了量化这一点,我们引入了一个新的指标,可靠的@k,并开发了一个自动化的管道,通过数据增强生成高质量的表亲提示。在此基础上,我们构建了IFEval++进行系统评估。在20个专有和26个开源LLM中,我们发现当前的模型在细微的可靠性方面表现出严重的不足-它们的性能可以通过细微的即时修改下降高达61.8%。更重要的是,我们描述它,并探索三个潜在的改进食谱。我们的研究结果强调了细微差别导向的可靠性,这是迈向更可靠和值得信赖的LLM行为的关键但未充分探索的下一步。我们的代码和基准测试可以访问:https://github.com/jianshuod/IFEval-pp。 摘要:Advanced LLMs have achieved near-ceiling instruction-following accuracy on benchmarks such as IFEval. However, these impressive scores do not necessarily translate to reliable services in real-world use, where users often vary their phrasing, contextual framing, and task formulations. In this paper, we study nuance-oriented reliability: whether models exhibit consistent competence across cousin prompts that convey analogous user intents but with subtle nuances. To quantify this, we introduce a new metric, reliable@k, and develop an automated pipeline that generates high-quality cousin prompts via data augmentation. Building upon this, we construct IFEval++ for systematic evaluation. Across 20 proprietary and 26 open-source LLMs, we find that current models exhibit substantial insufficiency in nuance-oriented reliability – their performance can drop by up to 61.8% with nuanced prompt modifications. What’s more, we characterize it and explore three potential improvement recipes. Our findings highlight nuance-oriented reliability as a crucial yet underexplored next step toward more dependable and trustworthy LLM behavior. Our code and benchmark are accessible: https://github.com/jianshuod/IFEval-pp.
【26】SoMe: A Realistic Benchmark for LLM-based Social Media Agents 标题:SoMe:法学硕士社交媒体代理的现实基准 链接:https://arxiv.org/abs/2512.14720
作者:Dizhan Xue,Jing Cui,Shengsheng Qian,Chuanrui Hu,Changsheng Xu 备注:Accepted by AAAI 2026 摘要:由大型语言模型(LLM)驱动的智能代理最近展示了令人印象深刻的功能,并在社交媒体平台上越来越受欢迎。虽然LLM代理正在重塑社交媒体的生态,但目前在对其理解媒体内容,理解用户行为和做出复杂决策的能力进行全面评估方面存在差距。为了应对这一挑战,我们引入了SoMe,这是一个开创性的基准,旨在评估配备了各种代理工具的社交媒体代理,用于访问和分析社交媒体数据。SoMe包含来自各种社交媒体平台和外部网站的8个社交媒体代理任务,9,164,284个帖子,6,591个用户配置文件和25,686个报告,以及17,869个精心注释的任务查询。与现有的社交媒体任务的数据集和基准相比,SoMe是第一个为基于LLM的社交媒体代理提供多功能和现实的平台,以处理各种社交媒体任务。通过广泛的定量和定性分析,我们提供了第一个概述洞察主流代理LLM在现实的社交媒体环境中的性能,并确定了几个限制。我们的评估表明,目前的闭源和开源LLM都不能令人满意地处理社交媒体代理任务。SoMe为未来的社交媒体代理提供了一个具有挑战性但有意义的测试平台。我们的代码和数据可在https://github.com/LivXue/SoMe上获得 摘要:Intelligent agents powered by large language models (LLMs) have recently demonstrated impressive capabilities and gained increasing popularity on social media platforms. While LLM agents are reshaping the ecology of social media, there exists a current gap in conducting a comprehensive evaluation of their ability to comprehend media content, understand user behaviors, and make intricate decisions. To address this challenge, we introduce SoMe, a pioneering benchmark designed to evaluate social media agents equipped with various agent tools for accessing and analyzing social media data. SoMe comprises a diverse collection of 8 social media agent tasks, 9,164,284 posts, 6,591 user profiles, and 25,686 reports from various social media platforms and external websites, with 17,869 meticulously annotated task queries. Compared with the existing datasets and benchmarks for social media tasks, SoMe is the first to provide a versatile and realistic platform for LLM-based social media agents to handle diverse social media tasks. By extensive quantitative and qualitative analysis, we provide the first overview insight into the performance of mainstream agentic LLMs in realistic social media environments and identify several limitations. Our evaluation reveals that both the current closed-source and open-source LLMs cannot handle social media agent tasks satisfactorily. SoMe provides a challenging yet meaningful testbed for future social media agents. Our code and data are available at https://github.com/LivXue/SoMe
【27】LLM as a Neural Architect: Controlled Generation of Image Captioning Models Under Strict API Contracts 标题:LLM作为神经架构师:在严格的API合同下控制生成图像字幕模型 链接:https://arxiv.org/abs/2512.14706
作者:Krunal Jesani,Dmitry Ignatov,Radu Timofte 摘要:神经架构搜索(NAS)传统上需要大量的人类专业知识或自动试错来设计深度学习模型。我们提出了NN-Caption,这是一个LLM引导的神经架构搜索管道,它通过在严格的Net API下将LEMUR的分类骨干与序列解码器(LSTM/GRU/Transformer)组成CNN编码器来生成可运行的图像字幕模型。使用DeepSeek-R1-0528-Qwen 3 -8B作为主要生成器,我们给出了提示模板和生成架构的示例。我们使用BLEU-4对MS COCO进行了评估。LLM生成了数十个字幕模型,其中超过一半成功训练并生成了有意义的字幕。我们分析了在提示中使用不同数量的输入模型片段(5与10)的结果,发现当提供更多候选组件时,成功率略有下降。我们还报告了训练动态(字幕准确度与历元)和达到的最高BLEU-4。我们的研究结果突出了LLM指导的NAS的承诺:LLM不仅提出了架构,而且还建议了超参数和训练实践。我们确定遇到的挑战(例如,代码幻觉或API遵从性问题),并详细说明提示规则和迭代代码修复如何解决这些问题。这项工作提出了一个管道,集成了基于脚本的代码生成与自动评估,并添加了几十个新的字幕模型到开放的LEMUR数据集,以促进可重复的基准测试和下游AutoML研究。 摘要:Neural architecture search (NAS) traditionally requires significant human expertise or automated trial-and-error to design deep learning models. We present NN-Caption, an LLM-guided neural architecture search pipeline that generates runnable image-captioning models by composing CNN encoders from LEMUR’s classification backbones with sequence decoders (LSTM/GRU/Transformer) under a strict Net API. Using DeepSeek-R1-0528-Qwen3-8B as the primary generator, we present the prompt template and examples of generated architectures. We evaluate on MS COCO with BLEU-4. The LLM generated dozens of captioning models, with over half successfully trained and producing meaningful captions. We analyse the outcomes of using different numbers of input model snippets (5 vs. 10) in the prompt, finding a slight drop in success rate when providing more candidate components. We also report training dynamics (caption accuracy vs. epochs) and the highest BLEU-4 attained. Our results highlight the promise of LLM-guided NAS: the LLM not only proposes architectures but also suggests hyperparameters and training practices. We identify the challenges encountered (e.g., code hallucinations or API compliance issues) and detail how prompt rules and iterative code fixes addressed them. This work presents a pipeline that integrates prompt-based code generation with automatic evaluation, and adds dozens of novel captioning models to the open LEMUR dataset to facilitate reproducible benchmarking and downstream AutoML research.
【28】Effectively Detecting and Responding to Online Harassment with Large Language Models 标题:使用大型语言模型有效检测和响应在线骚扰 链接:https://arxiv.org/abs/2512.14700
作者:Pinxian Lu,Nimra Ishfaq,Emma Win,Morgan Rose,Sierra R Strickland,Candice L Biernesser,Jamie Zelazny,Munmun De Choudhury 备注:16 pages, 2 figures 摘要:在线骚扰一直是在线空间的一个持续存在的问题。研究主要集中在公共社交媒体平台上的在线骚扰,而对私人消息平台的研究较少。为了解决私人消息平台Instagram上的在线骚扰问题,我们利用了大型语言模型(LLM)的功能。为了实现这一目标,我们招募了人类标签人员来识别Instagram消息数据集中的在线骚扰。使用前面的对话作为背景,我们利用LLM管道对Instagram消息进行大规模标记,并评估其对人类标签的性能。然后,我们使用LLM来生成和评估对在线骚扰消息的模拟响应。我们发现LLM标签管道能够识别私人信息中的在线骚扰。通过比较人类的反应和模拟的反应,我们也表明,我们的模拟反应是优越的有益的原始人类的反应相比。 摘要:Online harassment has been a persistent issue in the online space. Predominantly, research focused on online harassment in public social media platforms, while less is placed on private messaging platforms. To address online harassment on one private messaging platform, Instagram, we leverage the capabilities of Large Language Models (LLMs). To achieve this, we recruited human labelers to identify online harassment in an Instagram messages dataset. Using the previous conversation as context, we utilize an LLM pipeline to conduct large-scale labeling on Instagram messages and evaluate its performance against human labels. Then, we use LLM to generate and evaluate simulated responses to online harassment messages. We find that the LLM labeling pipeline is capable of identifying online harassment in private messages. By comparing human responses and simulated responses, we also demonstrate that our simulated responses are superior in helpfulness compared to original human responses.
QA|VQA|问答|对话(4篇)
【1】You Never Know a Person, You Only Know Their Defenses: Detecting Levels of Psychological Defense Mechanisms in Supportive Conversations 标题:你永远不认识一个人,你只知道他们的防御能力:检测支持性对话中心理防御机制的水平 链接:https://arxiv.org/abs/2512.15601
作者:Hongbin Na,Zimu Wang,Zhaoming Chen,Peilin Zhou,Yining Hua,Grace Ziqi Zhou,Haiyang Zhang,Tao Shen,Wei Wang,John Torous,Shaoxiong Ji,Ling Chen 备注:Under Review 摘要:心理防御是人们用来管理痛苦的策略,通常是自动的。僵硬或过度使用防御与心理健康有负面联系,并影响说话者披露的内容以及他们如何接受或拒绝帮助。然而,防御是复杂的,难以可靠地测量,特别是在临床对话中。我们介绍了PsyDefConv,一个对话语料库,帮助搜索者的话语标记为防御级别,和DMRS的合作试点,一个四阶段的管道,提供基于证据的预注释。语料库包含200个对话和4709个话语,包括2336个求助者回合,带有标签,科恩的Kappa值为0.639。在一项平衡研究中,副驾驶员将平均注释时间减少了22.4%。在专家评审中,在7分制量表上,证据平均为4.62,临床可行性平均为4.44,洞察力平均为4.40。在zero-shot和微调设置中具有强语言模型的基准测试显示出清晰的净空,最佳宏观F1分数约为30%,并且倾向于过度预测成熟的防御。语料库分析证实,成熟的防御是最常见的,并揭示了情感特定的偏差。我们将发布语料库,注释,代码和提示,以支持语言防御功能的研究。 摘要:Psychological defenses are strategies, often automatic, that people use to manage distress. Rigid or overuse of defenses is negatively linked to mental health and shapes what speakers disclose and how they accept or resist help. However, defenses are complex and difficult to reliably measure, particularly in clinical dialogues. We introduce PsyDefConv, a dialogue corpus with help seeker utterances labeled for defense level, and DMRS Co-Pilot, a four-stage pipeline that provides evidence-based pre-annotations. The corpus contains 200 dialogues and 4709 utterances, including 2336 help seeker turns, with labeling and Cohen’s kappa 0.639. In a counterbalanced study, the co-pilot reduced average annotation time by 22.4%. In expert review, it averaged 4.62 for evidence, 4.44 for clinical plausibility, and 4.40 for insight on a seven-point scale. Benchmarks with strong language models in zero-shot and fine-tuning settings demonstrate clear headroom, with the best macro F1-score around 30% and a tendency to overpredict mature defenses. Corpus analyses confirm that mature defenses are most common and reveal emotion-specific deviations. We will release the corpus, annotations, code, and prompts to support research on defensive functioning in language.
【2】RFKG-CoT: Relation-Driven Adaptive Hop-count Selection and Few-Shot Path Guidance for Knowledge-Aware QA 标题:RFKG-CoT:知识感知QA的时间驱动自适应跳数选择和Few-Shot路径指导 链接:https://arxiv.org/abs/2512.15219
作者:Chao Zhang,Minghan Li,Tianrui Lv,Guodong Zhou 备注:9pages, 5 figures, accepted by AAAI 2026 摘要:由于参数化知识的限制,大型语言模型(LLM)经常在知识密集型QA中产生幻觉。虽然KG-CoT等现有方法通过整合知识图(KG)路径来提高可靠性,但它们存在严格的跳数选择(仅由问题驱动)和推理路径利用不足(缺乏指导)的问题。为了解决这个问题,我们提出了RFKG-CoT:首先,它用关系驱动的自适应跳数选择器代替了刚性的跳数选择器,该选择器通过激活KG关系(例如,1-用于直接“兄弟”关系的跳跃,用于间接“父子”链的2-跳跃),经由关系掩码形式化。其次,它引入了一个Few-Shot的上下文学习路径指导机制与CoT(思考),构建一个“问题-路径-答案”格式的例子,以提高LLM的理解推理路径的能力。在四个KGQA基准测试上的实验表明,RFKG-CoT比KG-CoT提高了高达14.7 pp的准确性(WebQSP上的Llama 2 - 7 B)。消融确认跳数选择器和路径提示是互补的,共同将KG证据转化为更忠实的答案。 摘要:Large language models (LLMs) often generate hallucinations in knowledge-intensive QA due to parametric knowledge limitations. While existing methods like KG-CoT improve reliability by integrating knowledge graph (KG) paths, they suffer from rigid hop-count selection (solely question-driven) and underutilization of reasoning paths (lack of guidance). To address this, we propose RFKG-CoT: First, it replaces the rigid hop-count selector with a relation-driven adaptive hop-count selector that dynamically adjusts reasoning steps by activating KG relations (e.g., 1-hop for direct “brother” relations, 2-hop for indirect “father-son” chains), formalized via a relation mask. Second, it introduces a few-shot in-context learning path guidance mechanism with CoT (think) that constructs examples in a “question-paths-answer” format to enhance LLMs’ ability to understand reasoning paths. Experiments on four KGQA benchmarks show RFKG-CoT improves accuracy by up to 14.7 pp (Llama2-7B on WebQSP) over KG-CoT. Ablations confirm the hop-count selector and the path prompt are complementary, jointly transforming KG evidence into more faithful answers.
【3】DASH: Dialogue-Aware Similarity and Handshake Recognition for Topic Segmentation in Public-Channel Conversations 标题:DASH:用于公共频道对话中主题分割的对话感知相似性和握手识别 链接:https://arxiv.org/abs/2512.15042
作者:Sijin Sun,Liangbin Zhao,Ming Deng,Xiuju Fu 备注:Accepted by AAAIW2026 摘要:对话主题分割(Dialogue Topic Segmentation,简称CBT)对于理解面向任务的公共信道通信至关重要,例如海上VHF对话,其特征在于非正式语音和隐式过渡。为了解决传统方法的局限性,我们提出了一种新的基于LLM的框架DASH-LLM。其核心贡献是:(1)通过对话握手识别的话题转移检测;(2)通过相似性引导的示例选择的上下文增强;以及(3)生成选择性的正样本和负样本以提高模型辨别力和鲁棒性。此外,我们还发布了第一个真实海上VHF通信的公共数据集VHF-Dial,以推进该领域的研究。DASH-ESTA为每个细分提供可解释的推理和置信度得分。实验结果表明,我们的框架实现了几个sota分割可信的准确性VHF-Dial和标准的基准,建立了一个坚实的基础,稳定的监测和决策支持业务对话。 摘要:Dialogue Topic Segmentation (DTS) is crucial for understanding task-oriented public-channel communications, such as maritime VHF dialogues, which feature informal speech and implicit transitions. To address the limitations of traditional methods, we propose DASH-DTS, a novel LLM-based framework. Its core contributions are: (1) topic shift detection via dialogue handshake recognition; (2) contextual enhancement through similarity-guided example selection; and (3) the generation of selective positive and negative samples to improve model discrimination and robustness. Additionally, we release VHF-Dial, the first public dataset of real-world maritime VHF communications, to advance research in this domain. DASH-DTS provides interpretable reasoning and confidence scores for each segment. Experimental results demonstrate that our framework achieves several sota segmentation trusted accuracy on both VHF-Dial and standard benchmarks, establishing a strong foundation for stable monitoring and decision support in operational dialogues.
【4】Audio MultiChallenge: A Multi-Turn Evaluation of Spoken Dialogue Systems on Natural Human Interaction 标题:音频多重挑战:口语对话系统对自然人类互动的多轮评估 链接:https://arxiv.org/abs/2512.14865
作者:Advait Gosai,Tyler Vuong,Utkarsh Tyagi,Steven Li,Wenjia You,Miheer Bavare,Arda Uçar,Zhongwang Fang,Brian Jang,Bing Liu,Yunzhong He 摘要:端到端(E2 E)口语对话系统正在越来越多地取代基于语音的人机交互的级联管道,直接处理原始音频而无需中间转录。现有的基准主要评估这些模型的合成语音和单轮任务,离开现实的多轮会话能力的探索。我们介绍音频MultiChallenge,一个开源的基准评估E2 E口语对话系统下的自然多轮交互模式。建立在基于文本的MultiChallenge框架,它评估推理记忆,指令保留和自我连贯性,我们引入了一个新的轴语音编辑,测试鲁棒性,中间话语语音修复和回溯。我们进一步增强了每个轴的音频模态,例如为推理记忆引入Audio-Cue挑战,需要回忆环境声音和语义内容之外的非语言信号。我们通过一个混合音频原生代理和人在回路的管道,策划了来自47个扬声器的452个对话,其中有1,712个实例特定的主题,该管道暴露了大规模的模型故障,同时保留了在无脚本的人类语音中发现的自然不流利。我们对专有和开源模型的评估显示,即使是前沿模型也难以达到我们的基准,Gemini 3 Pro Preview(Thinking)是我们性能最高的模型,通过率为54.65%。错误分析表明,模型在我们的新轴上最常失败,并且自相干性随着音频上下文的延长而降低。这些失败反映了在自然口语对话中跟踪编辑、音频提示和远程上下文的困难。Audio MultiChallenge提供了一个可重复的测试平台来量化它们,并推动音频原生多回合交互功能的改进。 摘要:End-to-end (E2E) spoken dialogue systems are increasingly replacing cascaded pipelines for voice-based human-AI interaction, processing raw audio directly without intermediate transcription. Existing benchmarks primarily evaluate these models on synthetic speech and single-turn tasks, leaving realistic multi-turn conversational ability underexplored. We introduce Audio MultiChallenge, an open-source benchmark to evaluate E2E spoken dialogue systems under natural multi-turn interaction patterns. Building on the text-based MultiChallenge framework, which evaluates Inference Memory, Instruction Retention, and Self Coherence, we introduce a new axis Voice Editing that tests robustness to mid-utterance speech repairs and backtracking. We further augment each axis to the audio modality, such as introducing Audio-Cue challenges for Inference Memory that require recalling ambient sounds and paralinguistic signals beyond semantic content. We curate 452 conversations from 47 speakers with 1,712 instance-specific rubrics through a hybrid audio-native agentic and human-in-the-loop pipeline that exposes model failures at scale while preserving natural disfluencies found in unscripted human speech. Our evaluation of proprietary and open-source models reveals that even frontier models struggle on our benchmark, with Gemini 3 Pro Preview (Thinking), our highest-performing model achieving a 54.65% pass rate. Error analysis shows that models fail most often on our new axes and that Self Coherence degrades with longer audio context. These failures reflect difficulty of tracking edits, audio cues, and long-range context in natural spoken dialogue. Audio MultiChallenge provides a reproducible testbed to quantify them and drive improvements in audio-native multi-turn interaction capability.
机器翻译(2篇)
【1】An Empirical Study on Chinese Character Decomposition in Multiword Expression-Aware Neural Machine Translation 标题:多词表达感知神经机器翻译中汉字分解的实证研究 链接:https://arxiv.org/abs/2512.15556
作者:Lifeng Han,Gareth J. F. Jones,Alan F. Smeaton 备注:capstone work, technical report, 27 pages, extraction from PhD thesis https://doras.dcu.ie/26559/ 摘要:单词含义、表示和解释在自然语言理解(NLU)、自然语言处理(NLP)和自然语言生成(NLG)任务中发挥着基础作用。这些任务中的许多固有困难源于多词表达(MWE),多词表达通过引入歧义、惯用表达、不频繁的使用和广泛的变化而使任务复杂化。在解决西方语言,特别是英语的MWE的挑战性方面,已经取得了重大努力和实质性进展。这一进展部分归功于成熟的研究社区和丰富的计算资源。然而,汉语和与之密切相关的亚洲语言等语系的进展程度则不同,它们在这方面仍然落后。虽然子词建模已成功应用于许多西方语言,以解决罕见的单词,提高短语理解,并通过字节对编码(BPE)等技术增强机器翻译(MT),但它不能直接应用于汉字等表意语言脚本。本文系统地研究了基于MWE的神经机器翻译中的汉字分解技术。此外,我们报告的实验,以检查如何汉字分解技术有助于表示的原始含义的中文单词和字符,以及它如何能够有效地解决的挑战,翻译MWE。 摘要:Word meaning, representation, and interpretation play fundamental roles in natural language understanding (NLU), natural language processing (NLP), and natural language generation (NLG) tasks. Many of the inherent difficulties in these tasks stem from Multi-word Expressions (MWEs), which complicate the tasks by introducing ambiguity, idiomatic expressions, infrequent usage, and a wide range of variations. Significant effort and substantial progress have been made in addressing the challenging nature of MWEs in Western languages, particularly English. This progress is attributed in part to the well-established research communities and the abundant availability of computational resources. However, the same level of progress is not true for language families such as Chinese and closely related Asian languages, which continue to lag behind in this regard. While sub-word modelling has been successfully applied to many Western languages to address rare words improving phrase comprehension, and enhancing machine translation (MT) through techniques like byte-pair encoding (BPE), it cannot be applied directly to ideograph language scripts like Chinese. In this work, we conduct a systematic study of the Chinese character decomposition technology in the context of MWE-aware neural machine translation (NMT). Furthermore, we report experiments to examine how Chinese character decomposition technology contributes to the representation of the original meanings of Chinese words and characters, and how it can effectively address the challenges of translating MWEs.
【2】Yes-MT’s Submission to the Low-Resource Indic Language Translation Shared Task in WMT 2024 标题:Yes-MT提交WMT 2024低资源印度语翻译共享任务 链接:https://arxiv.org/abs/2512.15226
作者:Yash Bhaskar,Parameswari Krishnamurthy 备注:Accepted at WMT 2024 摘要:本文介绍了Yes-MT团队在WMT 2024上为低资源印度语翻译共享任务提交的系统(Pakray等人,2024年),专注于英语和阿萨姆语,米佐语,卡西语和曼尼普尔语之间的翻译。实验探索了各种方法,包括微调预训练模型,如mT5(Xue et al.,2020)和IndicBart(Dabre等人,2021)在多语言和单语言环境中,LoRA(Hu等人,2021)微调IndicTrans 2(Gala等人,2023)、zero-shot和Few-Shot提示(Brown,2020)以及大型语言模型(LLM),如Llama 3(Dubey等人,2024)和Mixtral 8x 7 b(Jiang等人,2024),LoRA监督Llama 3的微调(Mecklenburg等人,2024),并从头开始训练Transformer模型(Vaswani,2017)。使用SacreBLEU(Post,2018)和CHRF(Popovic,2015)对WMT 23低资源印度语翻译共享任务测试数据进行了评估,突出了低资源翻译的挑战和LLM对这些任务的潜力,特别是微调。 摘要:This paper presents the systems submitted by the Yes-MT team for the Low-Resource Indic Language Translation Shared Task at WMT 2024 (Pakray et al., 2024), focusing on translating between English and the Assamese, Mizo, Khasi, and Manipuri languages. The experiments explored various approaches, including fine-tuning pre-trained models like mT5 (Xue et al., 2020) and IndicBart (Dabre et al., 2021) in both multilingual and monolingual settings, LoRA (Hu et al., 2021) fine-tuning IndicTrans2 (Gala et al., 2023), zero-shot and few-shot prompting (Brown, 2020) with large language models (LLMs) like Llama 3 (Dubey et al., 2024) and Mixtral 8x7b (Jiang et al., 2024), LoRA supervised fine-tuning of Llama 3 (Mecklenburg et al., 2024), and training Transformer models (Vaswani, 2017) from scratch. The results were evaluated on the WMT23 Low-Resource Indic Language Translation Shared Task test data using SacreBLEU (Post, 2018) and CHRF (Popovic, 2015), highlighting the challenges of low-resource translation and the potential of LLMs for these tasks, particularly with fine-tuning.
语义分析(1篇)
【1】The Semantic Illusion: Certified Limits of Embedding-Based Hallucination Detection in RAG Systems 标题:语义幻觉:RAG系统中基于嵌入的幻觉检测的认证限制 链接:https://arxiv.org/abs/2512.15068
作者:Debu Sinha 备注:12 pages, 2 figures, 6 tables 摘要:检索增强生成(RAG)系统仍然容易受到幻觉的影响,尽管在检索证据的基础上。目前的检测方法依赖于语义相似性和自然语言推理(NLI),但它们的基本局限性还没有得到严格的表征。我们将共形预测应用于幻觉检测,提供有限样本覆盖保证,从而实现检测能力的精确量化。使用大约600个示例的校准集,我们在合成幻觉(自然问题)上实现了94%的覆盖率和0%的假阳性率。然而,在跨越多个LLM的三个真实幻觉基准测试(GPT-4,ChatGPT,GPT-3,Llama-2,Mistral)中,基于嵌入的方法-包括最先进的OpenAI文本嵌入-3-大型和交叉编码器模型-表现出不可接受的误报率:HaluEval为100%,RAGTruth为88%,WikiBio为50%。至关重要的是,GPT-4作为LLM法官在相同的数据上仅实现了7%的FPR(95% CI:[3.4%,13.7%]),证明该任务可以通过推理解决。我们称之为“语义错觉”:语义上合理的幻觉保持与源文件的相似性,同时引入嵌入不可见的事实错误。这种限制在嵌入架构、LLM生成器和任务类型中仍然存在,这表明基于嵌入的检测不足以进行生产RAG部署。 摘要:Retrieval-Augmented Generation (RAG) systems remain susceptible to hallucinations despite grounding in retrieved evidence. Current detection methods rely on semantic similarity and natural language inference (NLI), but their fundamental limitations have not been rigorously characterized. We apply conformal prediction to hallucination detection, providing finite-sample coverage guarantees that enable precise quantification of detection capabilities. Using calibration sets of approximately 600 examples, we achieve 94% coverage with 0% false positive rate on synthetic hallucinations (Natural Questions). However, on three real hallucination benchmarks spanning multiple LLMs (GPT-4, ChatGPT, GPT-3, Llama-2, Mistral), embedding-based methods - including state-of-the-art OpenAI text-embedding-3-large and cross-encoder models - exhibit unacceptable false positive rates: 100% on HaluEval, 88% on RAGTruth, and 50% on WikiBio. Crucially, GPT-4 as an LLM judge achieves only 7% FPR (95% CI: [3.4%, 13.7%]) on the same data, proving the task is solvable through reasoning. We term this the “semantic illusion”: semantically plausible hallucinations preserve similarity to source documents while introducing factual errors invisible to embeddings. This limitation persists across embedding architectures, LLM generators, and task types, suggesting embedding-based detection is insufficient for production RAG deployment.
Graph|知识图谱|Knowledge(2篇)
【1】How Much is Too Much? Exploring LoRA Rank Trade-offs for Retaining Knowledge and Domain Robustness 标题:多少才算太多?探索LoRA等级权衡以保留知识和领域稳健性 链接:https://arxiv.org/abs/2512.15634
作者:Darshita Rathore,Vineet Kumar,Chetna Bansal,Anindya Moitra 备注:Accepted at AACL IJCNLP 2025 摘要:大型语言模型通过微调越来越多地适应下游任务。全监督微调(SFT)和参数有效微调(PEFT)方法,如低秩自适应(LoRA),是两种主要的方法。虽然PEFT方法因其计算效率而被广泛使用,但其配置的含义(例如,排名)在下游Q&A任务和概括中仍然没有得到充分的探索。在这项工作中,我们对多个推理和召回数据集进行了全面评估,进行了排名扫描,以量化SFT和PEFT之间的权衡。我们还比较了PEFT和SFT模型在域内和域外适应中的准确性,突出了不同的泛化行为和特定任务的遗忘。我们证明,LoRA实现了竞争力,在某些情况下,优越的性能相比,SFT,特别是在推理任务在特定的排名值。此外,我们通过频谱特征和逐层注意结构来分析内部表征,从而深入了解注意模式的表征漂移和结构变化。 摘要:Large language models are increasingly adapted to downstream tasks through fine-tuning. Full supervised fine-tuning (SFT) and parameter-efficient fine-tuning (PEFT) methods, such as Low-Rank Adaptation (LoRA), are two dominant approaches. While PEFT methods are widely used for their computational efficiency, the implications of their configurations (e.g., rank) remain under-explored in downstream Q&A tasks and generalisation. In this work, we perform a comprehensive evaluation across multiple reasoning and recall datasets, conducting a rank sweep to quantify the trade-off between SFT and PEFT. We also compare the accuracy of PEFT and SFT models across in-domain and out-of-domain adaptation, highlighting distinct generalisation behaviour and task-specific forgetting. We demonstrate that LoRA achieves competitive and in some cases superior performance compared to SFT, particularly on reasoning tasks at specific rank values. Additionally, we analyze the internal representations via spectral features and layer-wise attention structures, offering insights into representational drift and structural changes in attention patterns.
【2】HERO: Hierarchical Traversable 3D Scene Graphs for Embodied Navigation Among Movable Obstacles 标题:HERO:分层可穿越3D场景图,用于在可移动障碍物之间进行有序导航 链接:https://arxiv.org/abs/2512.15047
作者:Yunheng Wang,Yixiao Feng,Yuetong Fang,Shuning Zhang,Tan Jing,Jian Li,Xiangrui Jiang,Renjing Xu 摘要:3D场景图(3DSG)构成了物理世界的强大表示,其特征在于它们能够显式地对实体之间的复杂空间,语义和功能关系进行建模,从而提供基本的理解,使代理能够智能地与其环境交互并执行多功能行为。作为这种能力的重要组成部分,嵌入式导航利用3DSG的紧凑和表达性,在复杂的大规模环境中实现长期推理和规划。然而,先前的工作依赖于静态世界假设,仅基于静态空间布局来定义可穿越空间,从而将可交互障碍物视为不可穿越的。这个基本的限制严重破坏了它们在现实世界场景中的有效性,导致有限的可达性,低效率和较差的可扩展性。为了解决这些问题,我们提出了HERO,一个新的框架,用于构建分层可遍历3DSGs,重新定义可遍历性建模可操作的障碍路径,捕捉他们的物理交互性,功能语义和场景的关系层次。结果表明,相对于其基线,HERO减少PL的35.1%,在部分阻塞的环境中,SR增加79.4%,在完全阻塞的,表现出显着更高的效率和可达性。 摘要:3D Scene Graphs (3DSGs) constitute a powerful representation of the physical world, distinguished by their abilities to explicitly model the complex spatial, semantic, and functional relationships between entities, rendering a foundational understanding that enables agents to interact intelligently with their environment and execute versatile behaviors. Embodied navigation, as a crucial component of such capabilities, leverages the compact and expressive nature of 3DSGs to enable long-horizon reasoning and planning in complex, large-scale environments. However, prior works rely on a static-world assumption, defining traversable space solely based on static spatial layouts and thereby treating interactable obstacles as non-traversable. This fundamental limitation severely undermines their effectiveness in real-world scenarios, leading to limited reachability, low efficiency, and inferior extensibility. To address these issues, we propose HERO, a novel framework for constructing Hierarchical Traversable 3DSGs, that redefines traversability by modeling operable obstacles as pathways, capturing their physical interactivity, functional semantics, and the scene’s relational hierarchy. The results show that, relative to its baseline, HERO reduces PL by 35.1% in partially obstructed environments and increases SR by 79.4% in fully obstructed ones, demonstrating substantially higher efficiency and reachability.
推理|分析|理解|解释(4篇)
【1】When a Nation Speaks: Machine Learning and NLP in People’s Sentiment Analysis During Bangladesh’s 2024 Mass Uprising 标题:当一个国家说话时:孟加拉国2024年大规模起义期间人们情绪分析中的机器学习和NLP 链接:https://arxiv.org/abs/2512.15547
作者:Md. Samiul Alim,Mahir Shahriar Tamim,Maisha Rahman,Tanvir Ahmed Khan,Md Mushfique Anwar 备注:Accepted in 2025 28th International Conference on Computer and Information Technology (ICCIT) 摘要:情感分析是自然语言处理(NLP)中的一个新兴研究领域,主要在选举和社交媒体趋势等背景下进行探索,但在理解内乱期间的情感动态方面仍然存在重大差距,特别是在孟加拉语中。我们的研究通过研究孟加拉国2024年大规模起义中的公众情绪,在国家危机期间开创了孟加拉国的情绪分析。我们策划了一个独特的数据集,其中包含来自主要Facebook新闻门户网站的2,028条注释新闻标题,将它们分为愤怒,希望和绝望。通过潜在狄利克雷分配(LDA),我们确定了政治腐败和公众抗议等普遍存在的主题,并分析了互联网停电等事件如何塑造情绪模式。它优于多语言Transformers(mBERT:67%,XLM-RoBERTA:71%)和传统的机器学习方法(SVM和Logistic回归:均为70%)。这些结果突出了特定语言模型的有效性,并为政治动荡期间的公众情绪提供了有价值的见解。 摘要:Sentiment analysis, an emerging research area within natural language processing (NLP), has primarily been explored in contexts like elections and social media trends, but there remains a significant gap in understanding emotional dynamics during civil unrest, particularly in the Bangla language. Our study pioneers sentiment analysis in Bangla during a national crisis by examining public emotions amid Bangladesh’s 2024 mass uprising. We curated a unique dataset of 2,028 annotated news headlines from major Facebook news portals, classifying them into Outrage, Hope, and Despair. Through Latent Dirichlet Allocation (LDA), we identified prevalent themes like political corruption and public protests, and analyzed how events such as internet blackouts shaped sentiment patterns. It outperformed multilingual transformers (mBERT: 67%, XLM-RoBERTa: 71%) and traditional machine learning methods (SVM and Logistic Regression: both 70%). These results highlight the effectiveness of language-specific models and offer valuable insights into public sentiment during political turmoil.
【2】The Moralization Corpus: Frame-Based Annotation and Analysis of Moralizing Speech Acts across Diverse Text Genres 标题:道德化数据库:不同文本类型的道德化言语行为的基于框架的注释与分析 链接:https://arxiv.org/abs/2512.15248
作者:Maria Becker,Mirko Sommer,Lars Tapken,Yi Wan Teh,Bruno Brocai 摘要:道德化(Moralizations)–引用道德价值来证明要求或立场的论点–是一种尚未充分探索的说服性沟通形式。我们提出了道德化语料库,一个新颖的多体裁数据集,旨在分析道德价值观是如何战略性地使用在议论文。道德化在实用上是复杂的,而且往往是隐含的,这对人类注释者和NLP系统都构成了重大挑战。我们开发了一个基于框架的注释方案,捕捉道德化的构成要素-道德价值观,要求和话语主角-并将其应用于一组不同的德语文本,包括政治辩论,新闻文章和在线讨论。该语料库可以对跨交际格式和领域的道德语言进行细粒度分析。我们进一步评估了几个大型语言模型(LLM)在不同的提示条件下的道德化检测和道德化组件提取的任务,并将其与人类注释进行比较,以研究自动和手动分析道德化的挑战。结果表明,详细的提示指令有更大的效果比Few-Shot或基于警告的提示,道德化仍然是一个高度主观和上下文敏感的任务。我们发布所有数据,注释指南和代码,以促进未来对NLP中道德话语和道德推理的跨学科研究。 摘要:Moralizations - arguments that invoke moral values to justify demands or positions - are a yet underexplored form of persuasive communication. We present the Moralization Corpus, a novel multi-genre dataset designed to analyze how moral values are strategically used in argumentative discourse. Moralizations are pragmatically complex and often implicit, posing significant challenges for both human annotators and NLP systems. We develop a frame-based annotation scheme that captures the constitutive elements of moralizations - moral values, demands, and discourse protagonists - and apply it to a diverse set of German texts, including political debates, news articles, and online discussions. The corpus enables fine-grained analysis of moralizing language across communicative formats and domains. We further evaluate several large language models (LLMs) under varied prompting conditions for the task of moralization detection and moralization component extraction and compare it to human annotations in order to investigate the challenges of automatic and manual analysis of moralizations. Results show that detailed prompt instructions has a greater effect than few-shot or explanation-based prompting, and that moralization remains a highly subjective and context-sensitive task. We release all data, annotation guidelines, and code to foster future interdisciplinary research on moral discourse and moral reasoning in NLP.
【3】T5Gemma 2: Seeing, Reading, and Understanding Longer 标题:T5 Gemma 2:更长时间地观看、阅读和理解 链接:https://arxiv.org/abs/2512.14856
作者:Biao Zhang,Paul Suganthan,Gaël Liu,Ilya Philippov,Sahil Dua,Ben Hora,Kat Black,Gus Martins,Omar Sanseviero,Shreya Pathak,Cassidy Hardin,Francesco Visin,Jiageng Zhang,Kathleen Kenealy,Qin Yin,Olivier Lacombe,Armand Joulin,Tris Warkentin,Adam Roberts 备注:technical report 摘要:T5 Gemma 2是T5 Gemma系列轻量级开放式编码器-解码器模型的下一代,具有强大的多语言,多模式和长上下文功能。T5 Gemma 2遵循T5 Gemma中的自适应配方(通过UL 2)-将预训练的仅解码器模型调整为编码器-解码器模型,并基于Gemma 3模型将其从纯文本模式扩展到多模式。我们进一步提出了两种方法来提高效率:捆绑词嵌入,在编码器和解码器之间共享所有嵌入,以及合并注意力,将解码器的自注意力和交叉注意力统一到一个联合模块中。实验证明了自适应策略在体系结构和模态上的通用性,以及编码器-解码器体系结构在长上下文建模上的独特优势。与T5 Gemma类似,T5 Gemma 2的训练前性能与Gemma 3相比相当或更好,训练后性能也有显着提高。我们将预训练的模型(270 M-270 M,1B-1B和4 B-4 B)发布给社区供未来研究。 摘要:We introduce T5Gemma 2, the next generation of the T5Gemma family of lightweight open encoder-decoder models, featuring strong multilingual, multimodal and long-context capabilities. T5Gemma 2 follows the adaptation recipe (via UL2) in T5Gemma – adapting a pretrained decoder-only model into an encoder-decoder model, and extends it from text-only regime to multimodal based on the Gemma 3 models. We further propose two methods to improve the efficiency: tied word embedding that shares all embeddings across encoder and decoder, and merged attention that unifies decoder self- and cross-attention into a single joint module. Experiments demonstrate the generality of the adaptation strategy over architectures and modalities as well as the unique strength of the encoder-decoder architecture on long context modeling. Similar to T5Gemma, T5Gemma 2 yields comparable or better pretraining performance and significantly improved post-training performance than its Gemma 3 counterpart. We release the pretrained models (270M-270M, 1B-1B and 4B-4B) to the community for future research.
【4】SepsisSuite: Beyond Risk Stratification – A Comparative Analysis of Deep Fusion vs. Expert Stacking for Prescriptive Sepsis AI 标题:败血症套件:超越风险分层–处方败血症AI的深度融合与专家堆叠的比较分析 链接:https://arxiv.org/abs/2512.14712
作者:Ryan Cartularo 备注:7 Pages, 4 Tables, 9 Figures 摘要:脓毒症占全球ICU入院人数的近20%,但传统的预测模型往往无法有效地整合异构数据流,要么被模态孤立,要么依赖于脆弱的早期融合。在这项工作中,我们提出了一个严格的架构比较端到端的深度融合和上下文感知堆叠的败血症任务。我们最初假设,一种新的四模态分层门控注意力网络-称为SepsisFusionFormer-将解决生命体征,文本和成像之间复杂的跨模态相互作用。然而,MIMIC-IV的实验显示,SepsisFusionFormer在小抗生素组群中遭受“注意力饥饿”($N \约2,100 $),导致过拟合(AUC 0.66)。这个违反直觉的结果为SepsisLateFusion的设计提供了信息,SepsisLateFusion是一种“更精简”的上下文感知专家混合(MoE)架构。通过将模态视为正交专家-“Historian”(静态),“Monitor”(时间)和“Reader”(NLP)-并通过CatBoost元学习器动态门控它们,我们实现了最先进的(SOTA)性能:临床发作前4小时预测的AUC为0.915。通过校准临床安全性的决策阈值,我们将漏诊病例相对于默认操作点减少了48%,从而打开了一个真正的预防窗口,以便及时干预反应性警报。此外,对于多类抗生素选择的新处方任务,我们证明了四模式Entrance实现了最高性能(0.72 AUC)。这些模型集成到SepsisSuite中,SepsisSuite是一个用于临床决策支持的部署就绪Python框架。SepsisSuite可在https://github.com/RyanCartularo/SepsisSuite-Info免费获得 摘要:Sepsis accounts for nearly 20% of global ICU admissions, yet conventional prediction models often fail to effectively integrate heterogeneous data streams, remaining either siloed by modality or reliant on brittle early fusion. In this work, we present a rigorous architectural comparison between End-to-End Deep Fusion and Context-Aware Stacking for sepsis tasks. We initially hypothesized that a novel Quad-Modal Hierarchical Gated Attention Network – termed SepsisFusionFormer – would resolve complex cross-modal interactions between vitals, text, and imaging. However, experiments on MIMIC-IV revealed that SepsisFusionFormer suffered from “attention starvation” in the small antibiotic cohort ($N \approx 2,100$), resulting in overfitting (AUC 0.66). This counterintuitive result informed the design of SepsisLateFusion, a “leaner” Context-Aware Mixture-of-Experts (MoE) architecture. By treating modalities as orthogonal experts – the “Historian” (Static), the “Monitor” (Temporal), and the “Reader” (NLP) – and dynamically gating them via a CatBoost meta-learner, we achieved State-of-the-Art (SOTA) performance: 0.915 AUC for prediction 4 hours prior to clinical onset. By calibrating the decision threshold for clinical safety, we reduced missed cases by 48% relative to the default operating point, thus opening a true preventative window for timely intervention over reactive alerts. Furthermore, for the novel prescriptive task of multi-class antibiotic selection, we demonstrate that a Quad-Modal Ensemble achieved the highest performance (0.72 AUC). These models are integrated into SepsisSuite, a deployment-ready Python framework for clinical decision support. SepsisSuite is available for free at: https://github.com/RyanCartularo/SepsisSuite-Info
识别/分类(2篇)
【1】Learning inflection classes using Adaptive Resonance Theory 标题:使用自适应共振理论学习拐点课程 链接:https://arxiv.org/abs/2512.15551
作者:Peter Dekker,Heikki Rasilo,Bart de Boer 摘要:屈折变化类的概念是语言学家使用的一种抽象概念,它提供了一种描述语言模式的方法,为推导以前未遇到的形式提供了类比基础。这种能力是形态学获取和处理的重要组成部分。我们研究了学习系统的口头变形类的个人语言用户进行无监督聚类的词素到变形类。作为一个认知上合理的和可解释的计算模型,我们使用自适应共振理论,一个神经网络的参数,决定了程度的概括(警惕)。该模型适用于拉丁语、葡萄牙语和爱沙尼亚语。聚类与已证实的屈折变化类的相似性取决于屈折变化系统的复杂性。我们发现在一个狭窄的区域的泛化参数的最佳性能。从模型中提取的学习特征与变形类的语言描述相似。所提出的模型可以用来研究变化的拐点类在未来,包括它在一个基于代理的模型。 摘要:The concept of inflection classes is an abstraction used by linguists, and provides a means to describe patterns in languages that give an analogical base for deducing previously unencountered forms. This ability is an important part of morphological acquisition and processing. We study the learnability of a system of verbal inflection classes by the individual language user by performing unsupervised clustering of lexemes into inflection classes. As a cognitively plausible and interpretable computational model, we use Adaptive Resonance Theory, a neural network with a parameter that determines the degree of generalisation (vigilance). The model is applied to Latin, Portuguese and Estonian. The similarity of clustering to attested inflection classes varies depending on the complexity of the inflectional system. We find the best performance in a narrow region of the generalisation parameter. The learned features extracted from the model show similarity with linguistic descriptions of the inflection classes. The proposed model could be used to study change in inflection classes in the future, by including it in an agent-based model.
【2】Emotion Recognition in Signers 标题:签名者的情感识别 链接:https://arxiv.org/abs/2512.15376
作者:Kotaro Funakoshi,Yaoxiong Zhu 摘要:识别签名者的情绪受到一个理论挑战和一个实际挑战,即语法和情感面部表情之间的重叠和模型训练数据的稀缺。本文使用我们的eJSL数据集(一个新的日本手语签名者情感识别基准数据集)和BOBSL(一个带字幕的大型英国手语数据集)在跨语言环境中解决了这两个挑战。在eJSL中,两个签名者表达了78个不同的话语,每个话语有七种不同的情绪状态,产生了1,092个视频片段。我们的经验表明,1)口语文本情感识别缓解了手语中的数据稀缺性,2)时间段选择具有显着的影响,3)结合手部动作增强了签名者的情感识别。最后,我们建立了一个比口语LLM更强的基线。 摘要:Recognition of signers’ emotions suffers from one theoretical challenge and one practical challenge, namely, the overlap between grammatical and affective facial expressions and the scarcity of data for model training. This paper addresses these two challenges in a cross-lingual setting using our eJSL dataset, a new benchmark dataset for emotion recognition in Japanese Sign Language signers, and BOBSL, a large British Sign Language dataset with subtitles. In eJSL, two signers expressed 78 distinct utterances with each of seven different emotional states, resulting in 1,092 video clips. We empirically demonstrate that 1) textual emotion recognition in spoken language mitigates data scarcity in sign language, 2) temporal segment selection has a significant impact, and 3) incorporating hand motion enhances emotion recognition in signers. Finally we establish a stronger baseline than spoken language LLMs.
语料库(1篇)
【1】Rakuten Data Release: A Large-Scale and Long-Term Reviews Corpus for Hotel Domain 标题:Rakuten Data Release:大规模、长期的酒店评论语料库 链接:https://arxiv.org/abs/2512.15151
作者:Yuki Nakayama,Koki Hikichi,Yun Ching Liu,Yu Hirate 摘要:本文介绍了一个大规模的乐天旅游评论语料库。我们收集了从2009年到2024年16年的730万条客户评论。数据集中的每条记录都包含评论文本、来自住宿的回复、匿名评论者ID、评论日期、住宿ID、计划ID、计划标题、房间类型、房间名称、目的、陪同组和来自不同方面类别的用户评级以及总分。我们提供了有关语料库的统计信息,并使用统计方法深入了解了2019年至2024年之间驱动数据漂移的因素。 摘要:This paper presents a large-scale corpus of Rakuten Travel Reviews. Our collection contains 7.3 million customer reviews for 16 years, ranging from 2009 to 2024. Each record in the dataset contains the review text, its response from an accommodation, an anonymized reviewer ID, review date, accommodation ID, plan ID, plan title, room type, room name, purpose, accompanying group, and user ratings from different aspect categories, as well as an overall score. We present statistical information about our corpus and provide insights into factors driving data drift between 2019 and 2024 using statistical approaches.
其他神经网络|深度学习|模型|建模(3篇)
【1】PPSEBM: An Energy-Based Model with Progressive Parameter Selection for Continual Learning 标题:PPSEBM:一种基于能量的渐进参数选择的连续学习模型 链接:https://arxiv.org/abs/2512.15658
作者:Xiaodi Li,Dingcheng Li,Rujun Gao,Mahmoud Zamani,Feng Mi,Latifur Khan 备注:10 pages, 3 figures, 2025 IEEE International Conference on Big Data (BigData) 摘要:持续学习仍然是机器学习中的一个根本挑战,需要模型从一系列任务中学习,而不会忘记之前获得的知识。在这种情况下,一个主要的障碍是灾难性的遗忘,即早期任务的性能随着新任务的学习而下降。在本文中,我们介绍了PPSEBM,一种新的框架,集成了基于能量的模型(EBM)与渐进参数选择(PPS),以有效地解决灾难性遗忘在自然语言处理任务的持续学习。在PPSEBM中,渐进式参数选择为每个新任务分配不同的特定于任务的参数,而EBM从先前的任务中生成代表性的伪样本。这些生成的样本主动通知和指导参数选择过程,增强模型在适应新任务的同时保留过去知识的能力。在不同NLP基准测试上的实验结果表明,PPSEBM优于最先进的持续学习方法,为减轻灾难性遗忘提供了一种有前途的鲁棒解决方案。 摘要:Continual learning remains a fundamental challenge in machine learning, requiring models to learn from a stream of tasks without forgetting previously acquired knowledge. A major obstacle in this setting is catastrophic forgetting, where performance on earlier tasks degrades as new tasks are learned. In this paper, we introduce PPSEBM, a novel framework that integrates an Energy-Based Model (EBM) with Progressive Parameter Selection (PPS) to effectively address catastrophic forgetting in continual learning for natural language processing tasks. In PPSEBM, progressive parameter selection allocates distinct, task-specific parameters for each new task, while the EBM generates representative pseudo-samples from prior tasks. These generated samples actively inform and guide the parameter selection process, enhancing the model’s ability to retain past knowledge while adapting to new tasks. Experimental results on diverse NLP benchmarks demonstrate that PPSEBM outperforms state-of-the-art continual learning methods, offering a promising and robust solution to mitigate catastrophic forgetting.
【2】Beyond Majority Voting: Towards Fine-grained and More Reliable Reward Signal for Test-Time Reinforcement Learning 标题:超越多数投票:为测试时强化学习提供细粒度、更可靠的奖励信号 链接:https://arxiv.org/abs/2512.15146
作者:Weiqin Wang,Yile Wang,Kehao Chen,Hui Huang 摘要:测试时强化学习通过使用多数投票结果作为伪标签来减轻对注释数据的依赖,作为具有可验证奖励的强化学习(RLVR)的补充方向出现,以提高大型语言模型(LLM)的推理能力。然而,这种投票策略往往会导致确认偏差,并遭受稀疏奖励,限制了整体性能。在这项工作中,我们提出了子组特定的逐步置信加权伪标签估计(SCOPE),一个框架,集成模型的信心和动态子组划分来解决这些问题。具体来说,SCOPE将建议的逐步置信度集成到伪标签推导中,优先考虑简单频率计数的高质量推理路径。此外,它动态划分的候选输出池到独立的子组,通过平衡推理质量和探索多样性。SCOPE通过对每个子组进行重复抽样来获得当地共识,从而提供多样化的监督目标,以鼓励更广泛的探索。我们在各种模型和基准上进行了实验,实验结果表明SCOPE始终优于最近的基线。值得注意的是,SCOPE在挑战AIME 2025上实现了13.1%的相对改善,在AMC上实现了8.1%的相对改善。代码发布于\href{https://github.com/szu-tera/SCOPE}{https://github.com/szu-tera/SCOPE}。 摘要:Test-time reinforcement learning mitigates the reliance on annotated data by using majority voting results as pseudo-labels, emerging as a complementary direction to reinforcement learning with verifiable rewards (RLVR) for improving reasoning ability of large language models (LLMs). However, this voting strategy often induces confirmation bias and suffers from sparse rewards, limiting the overall performance. In this work, we propose subgroup-specific step-wise confidence-weighted pseudo-label estimation (SCOPE), a framework integrating model confidence and dynamic subgroup partitioning to address these issues. Specifically, SCOPE integrates the proposed step-wise confidence into pseudo label deduction, prioritizing high-quality reasoning paths over simple frequency count. Furthermore, it dynamically partitions the candidate outputs pool into independent subgroups by balancing reasoning quality against exploration diversity. By deriving local consensus via repeat sampling for each sub group, SCOPE provides diverse supervision targets to encourage broader exploration. We conduct experiments across various models and benchmarks, experimental results show that SCOPE consistently outperforms recent baselines. Notably, SCOPE achieving relative improvements of 13.1% on challenging AIME 2025 and 8.1% on AMC. The code is released at \href{https://github.com/szu-tera/SCOPE}{https://github.com/szu-tera/SCOPE}.
【3】Task Matrices: Linear Maps for Cross-Model Finetuning Transfer 标题:任务矩阵:跨模型微调传输的线性地图 链接:https://arxiv.org/abs/2512.14880
作者:Darrin O’ Brien,Dhikshith Gajulapalli,Eric Xia 备注:NeurIPS Unireps 2025 摘要:可解释性的结果表明,大型视觉和语言模型学习隐式线性编码时,模型有偏见的上下文提示。然而,在更一般的适应机制中存在类似的线性表示尚未得到证明。在这项工作中,我们开发了任务矩阵的概念,从一个基地微调嵌入状态的线性变换。我们证明,视觉和文本模型和10个不同的数据集,增强与任务矩阵的基础模型实现的结果超过线性探头,有时接近微调的水平。我们的研究结果验证了预训练和微调架构之间存在跨层线性编码。此外,我们表明,这种编码的基于数据的近似是有效的和可推广到多个域。我们公开我们的实现。 摘要:Results in interpretability suggest that large vision and language models learn implicit linear encodings when models are biased by in-context prompting. However, the existence of similar linear representations in more general adaptation regimes has not yet been demonstrated. In this work, we develop the concept of a task matrix, a linear transformation from a base to finetuned embedding state. We demonstrate that for vision and text models and ten different datasets, a base model augmented with a task matrix achieves results surpassing linear probes, sometimes approaching finetuned levels. Our results validate the existence of cross-layer linear encodings between pretrained and finetuned architectures. Moreover, we show that a data-based approximation for such encodings is both efficient and generalizable to multiple domains. We make our implementation publicly available.
其他(13篇)
【1】Predictive Concept Decoders: Training Scalable End-to-End Interpretability Assistants 标题:预测概念解码器:训练可扩展的端到端解释性助理 链接:https://arxiv.org/abs/2512.15712
作者:Vincent Huang,Dami Choi,Daniel D. Johnson,Sarah Schwettmann,Jacob Steinhardt 备注:28 pages, 12 figures 摘要:解释神经网络的内部激活可以对其行为产生更忠实的解释,但由于激活空间的复杂结构,这是困难的。现有的可扩展可解释性方法使用手工设计的代理,这些代理可以制作和测试关于内部激活如何与外部行为相关的假设。我们建议将此任务转变为端到端的训练目标,通过训练可解释性助手来准确预测通过通信瓶颈激活的模型行为。具体来说,编码器将激活压缩为稀疏的概念列表,解码器读取该列表并回答关于模型的自然语言问题。我们将展示如何在大型非结构化数据上预训练这个助手,然后对其进行微调以回答问题。由此产生的架构,我们称之为预测概念解码器,具有良好的缩放特性:瓶颈概念的自动interp分数随着数据的增加而提高,下游应用程序的性能也随之提高。具体来说,PCD可以检测越狱、秘密提示和植入的潜在概念,并且能够准确地显现潜在的用户属性。 摘要:Interpreting the internal activations of neural networks can produce more faithful explanations of their behavior, but is difficult due to the complex structure of activation space. Existing approaches to scalable interpretability use hand-designed agents that make and test hypotheses about how internal activations relate to external behavior. We propose to instead turn this task into an end-to-end training objective, by training interpretability assistants to accurately predict model behavior from activations through a communication bottleneck. Specifically, an encoder compresses activations to a sparse list of concepts, and a decoder reads this list and answers a natural language question about the model. We show how to pretrain this assistant on large unstructured data, then finetune it to answer questions. The resulting architecture, which we call a Predictive Concept Decoder, enjoys favorable scaling properties: the auto-interp score of the bottleneck concepts improves with data, as does the performance on downstream applications. Specifically, PCDs can detect jailbreaks, secret hints, and implanted latent concepts, and are able to accurately surface latent user attributes.
【2】Characterizing Mamba’s Selective Memory using Auto-Encoders 标题:使用自动编码器描述曼巴的选择性记忆 链接:https://arxiv.org/abs/2512.15653
作者:Tamanna Hossain,Robert L. Logan,Ganesh Jagadeesan,Sameer Singh,Joel Tetreault,Alejandro Jaimes 备注:AACL 2025. Oral Presentation 摘要:状态空间模型(SSM)是一个有前途的替代Transformers的语言建模,因为它们在推理过程中使用固定的内存。然而,当处理长序列时,这种固定的存储器使用要求隐藏状态中的一些信息损失。虽然先前的工作已经研究了这种信息丢失发生的序列长度,但它并没有表征SSM语言模型(LM)倾向于忘记的信息类型。在本文中,我们通过识别令牌的类型(例如,词性,命名实体)和序列(例如,代码,数学问题),这些问题更容易被SSM LM遗忘。我们通过训练自动编码器从SSM的隐藏状态重建序列来实现这一点,并通过将输入与其重建进行比较来测量信息损失。我们使用Mamba家族的SSM LM(130 M-1.4B)对范围从4- 256个标记的序列进行实验。我们的研究结果显示,数学相关令牌的信息丢失率显著较高(例如,数字,变量),提及组织实体,以及标准美式英语的替代方言。然后,我们检查这些标记出现在Mamba的预训练数据中的频率,发现不太流行的标记往往是Mamba最有可能忘记的标记。通过识别这些模式,我们的工作为未来的研究提供了明确的方向,以开发更好地控制曼巴保留重要信息的能力的方法。 摘要:State space models (SSMs) are a promising alternative to transformers for language modeling because they use fixed memory during inference. However, this fixed memory usage requires some information loss in the hidden state when processing long sequences. While prior work has studied the sequence length at which this information loss occurs, it does not characterize the types of information SSM language models (LMs) tend to forget. In this paper, we address this knowledge gap by identifying the types of tokens (e.g., parts of speech, named entities) and sequences (e.g., code, math problems) that are more frequently forgotten by SSM LMs. We achieve this by training an auto-encoder to reconstruct sequences from the SSM’s hidden state, and measure information loss by comparing inputs with their reconstructions. We perform experiments using the Mamba family of SSM LMs (130M–1.4B) on sequences ranging from 4–256 tokens. Our results show significantly higher rates of information loss on math-related tokens (e.g., numbers, variables), mentions of organization entities, and alternative dialects to Standard American English. We then examine the frequency that these tokens appear in Mamba’s pretraining data and find that less prevalent tokens tend to be the ones Mamba is most likely to forget. By identifying these patterns, our work provides clear direction for future research to develop methods that better control Mamba’s ability to retain important information.
【3】From Data to Dialogue: Unlocking Language for All 标题:从数据到对话:解锁所有人的语言 链接:https://arxiv.org/abs/2512.15552
作者:Dakota Ellis,Samy Bakikerali,Wanshan Chen,Bao Dinh,Uyen Le 摘要:传统语言学家建议使用通用词汇表(GSL)来帮助新的语言学习者识别英语中最重要的单词。这一过程需要语言专业知识、主观投入和大量时间。我们试图创建自己的GSL,并根据行业标准(NGSL)评估其实用性。我们发现,创建一个专门的单词列表(SWL),或特定于整个语料库的子集的单词列表,是语言学习者优化过程的最实用的方法。我们使用我们的模型创建的SWL优于行业标准,达到了语言理解所需的95%覆盖率,相对而言单词更少。通过将SWL过程仅限于客观标准,它可以自动化,缩放和定制以满足全球语言学习者的需求。 摘要:Traditional linguists have proposed the use of a General Service List (GSL) to assist new language learners in identifying the most important words in English. This process requires linguistic expertise, subjective input, and a considerable amount of time. We attempt to create our own GSL and evaluate its practicality against the industry standard (The NGSL). We found creating a Specialized Word List (SWL), or a word list specific to a subset of the overall corpus, to be the most practical way for language-learners to optimize the process. The SWL’s that we created using our model outperformed the industry standard, reaching the 95% coverage required for language comprehension with fewer words comparatively. By restricting the SWL process to objective criteria only, it can be automated, scaled, and tailored to the needs of language-learners across the globe.
【4】Tracking Temporal Dynamics of Vector Sets with Gaussian Process 标题:用高斯过程跟踪载体集的时间动态 链接:https://arxiv.org/abs/2512.15538
作者:Taichi Aida,Mamoru Komachi,Toshinobu Ogiso,Hiroya Takamura,Daichi Mochihashi 备注:Work in Progress 摘要:了解向量集的时间演变是跨各个领域的基本挑战,包括生态学,犯罪分析和语言学。例如,生态系统结构的演变是由于植物、食草动物和食肉动物之间的相互作用;犯罪的空间分布随着社会变化而变化;词语嵌入向量反映了随着时间的推移的文化和语义趋势。然而,分析这种时变的向量集是具有挑战性的,因为它们的结构复杂,也随着时间的推移而演变。在这项工作中,我们提出了一种新的方法来建模的分布基础上的每组向量使用无限维高斯过程。通过用随机傅立叶特征近似高斯过程中的隐函数,我们得到了随时间变化的紧凑且可比的向量表示。这使我们能够在低维空间中跟踪和可视化向量集的时间转换。我们将我们的方法应用于社会学数据(犯罪分布)和语言学数据(词嵌入),证明了它在捕获时间动态方面的有效性。我们的研究结果表明,所提出的方法提供了可解释的和强大的表示,提供了一个强大的框架,分析跨不同领域的时间索引向量集的结构变化。 摘要:Understanding the temporal evolution of sets of vectors is a fundamental challenge across various domains, including ecology, crime analysis, and linguistics. For instance, ecosystem structures evolve due to interactions among plants, herbivores, and carnivores; the spatial distribution of crimes shifts in response to societal changes; and word embedding vectors reflect cultural and semantic trends over time. However, analyzing such time-varying sets of vectors is challenging due to their complicated structures, which also evolve over time. In this work, we propose a novel method for modeling the distribution underlying each set of vectors using infinite-dimensional Gaussian processes. By approximating the latent function in the Gaussian process with Random Fourier Features, we obtain compact and comparable vector representations over time. This enables us to track and visualize temporal transitions of vector sets in a low-dimensional space. We apply our method to both sociological data (crime distributions) and linguistic data (word embeddings), demonstrating its effectiveness in capturing temporal dynamics. Our results show that the proposed approach provides interpretable and robust representations, offering a powerful framework for analyzing structural changes in temporally indexed vector sets across diverse domains.
【5】Why Your Academic Field Is Everywhere at Once: A Case Study of Arabic Linguistics 标题:为什么你的学术领域同时无处不在:阿拉伯语言学的案例研究 链接:https://arxiv.org/abs/2512.15328
作者:Ayman Eddakrouri,Amani Ramadan 摘要:本研究运用Brookes的范畴离散度(Δ)对当代阿拉伯应用语言学研究的主位结构进行分析。使用2019年至2025年的1,564篇出版物的综合真实数据集,分为8个核心子学科,我们计算出分散指数Δ = 0.194。这个非常低的值表明极端的主题分散,揭示了该领域的特点是明显的异质性,而不是集中。该分析将计算语言学确定为一种主导但非霸权的力量,与社会语言学,语言教学和其他子领域的强大研究共存。本研究阐明了布鲁克斯的原始公式的正确应用,展示了其实用性的领域表征,并提供了一个可复制的文献计量方法评估跨领域的学科结构。 摘要:This study applies Brookes’ Measure of Categorical Dispersion (Δ) to analyze the thematic structure of contemporary Arabic Applied Linguistics research. Using a comprehensive, real-world dataset of 1,564 publications from 2019 to 2025, classified into eight core sub-disciplines, we calculate a dispersion index of Δ = 0.194. This remarkably low value indicates extreme thematic dispersion, revealing that the field is characterized by pronounced heterogeneity rather than concentration. The analysis identifies Computational Linguistics as a dominant but non-hegemonic force, coexisting with robust research in Sociolinguistics, Language Teaching, and other subfields. This study clarifies the correct application of Brookes’ original formula, demonstrates its utility for field characterization, and provides a replicable bibliometric methodology for assessing disciplinary structure across domains.
【6】Towards Proactive Personalization through Profile Customization for Individual Users in Dialogues 标题:通过对话中个人用户的个人资料定制实现主动个性化 链接:https://arxiv.org/abs/2512.15302
作者:Xiaotian Zhang,Yuan Wang,Ruizhe Chen,Zeya Wang,Runchen Hou,Zuozhu Liu 摘要:在交互式系统中部署大型语言模型(LLM)需要与个人用户的细微差别和动态偏好进行深度对齐。目前的对齐技术主要解决普遍的人类价值观或静态的单圈偏好,从而无法解决长期个性化的关键需求和初始用户冷启动问题。为了弥合这一差距,我们提出了PersonalAgent,一种新的以用户为中心的终身代理,旨在不断推断和适应用户的喜好。PersonalAgent通过将对话分解为单轮交互,将偏好推理框架为顺序决策任务,构建并动态细化统一的用户配置文件。实验表明,PersonalAgent实现了优越的性能,强大的基于策略和策略优化的基线,不仅在理想化,而且在嘈杂的会话环境中,同时保持跨会话偏好的一致性。此外,人工评估证实,PersonalAgent擅长自然和连贯地捕捉用户偏好。我们的研究结果强调了终身个性化对于开发更具包容性和适应性的会话代理的重要性。我们的代码在这里可用。 摘要:The deployment of Large Language Models (LLMs) in interactive systems necessitates a deep alignment with the nuanced and dynamic preferences of individual users. Current alignment techniques predominantly address universal human values or static, single-turn preferences, thereby failing to address the critical needs of long-term personalization and the initial user cold-start problem. To bridge this gap, we propose PersonalAgent, a novel user-centric lifelong agent designed to continuously infer and adapt to user preferences. PersonalAgent constructs and dynamically refines a unified user profile by decomposing dialogues into single-turn interactions, framing preference inference as a sequential decision-making task. Experiments show that PersonalAgent achieves superior performance over strong prompt-based and policy optimization baselines, not only in idealized but also in noisy conversational contexts, while preserving cross-session preference consistency. Furthermore, human evaluation confirms that PersonalAgent excels at capturing user preferences naturally and coherently. Our findings underscore the importance of lifelong personalization for developing more inclusive and adaptive conversational agents. Our code is available here.
【7】ChatGPT and Gemini participated in the Korean College Scholastic Ability Test – Earth Science I 标题:ChatGPT和Gemini参加韩国大学学业能力测试–地球科学I 链接:https://arxiv.org/abs/2512.15298
作者:Seok-Hyun Ga,Chun-Yen Chang 备注:23 pages, 9 tables, 1 figure 摘要:生成式人工智能的快速发展正在为教育和评估带来创新性的变化。随着学生利用人工智能进行作业的普及率增加,人们对学术诚信和评估有效性的担忧也在增加。本研究利用2025年韩国大学学术能力测试(CSAT)的地球科学I部分,深入分析了最先进的大型语言模型(LLM)的多模态科学推理能力和认知局限性,包括GPT-4 o,Gemini 2.5 Flash和Gemini 2.5 Pro。三个实验条件(整页输入,单项输入,优化的多模态输入)被设计来评估模型在不同数据结构的性能。定量结果表明,非结构化的输入导致显着的性能下降,由于分割和光学字符识别(OCR)的失败。即使在优化的条件下,模型也表现出基本的推理缺陷。定性分析表明,“感知错误”占主导地位,突出了“感知认知差距”,模型未能解释示意图中的符号意义,尽管认识到视觉数据。此外,模型表现出“计算-概念化离散性”,成功地执行计算,但未能应用潜在的科学概念,以及“过程幻觉”,模型跳过视觉验证,支持看似合理但毫无根据的背景知识。为了应对课程中未经授权使用人工智能的挑战,这项研究为设计针对这些特定认知漏洞的“抗人工智能问题”提供了可操作的线索。通过利用人工智能的弱点,例如感知和认知之间的差距,教育工作者可以将真正的学生能力与人工智能生成的反应区分开来,从而确保评估的公平性。 摘要:The rapid development of Generative AI is bringing innovative changes to education and assessment. As the prevalence of students utilizing AI for assignments increases, concerns regarding academic integrity and the validity of assessments are growing. This study utilizes the Earth Science I section of the 2025 Korean College Scholastic Ability Test (CSAT) to deeply analyze the multimodal scientific reasoning capabilities and cognitive limitations of state-of-the-art Large Language Models (LLMs), including GPT-4o, Gemini 2.5 Flash, and Gemini 2.5 Pro. Three experimental conditions (full-page input, individual item input, and optimized multimodal input) were designed to evaluate model performance across different data structures. Quantitative results indicated that unstructured inputs led to significant performance degradation due to segmentation and Optical Character Recognition (OCR) failures. Even under optimized conditions, models exhibited fundamental reasoning flaws. Qualitative analysis revealed that “Perception Errors” were dominant, highlighting a “Perception-Cognition Gap” where models failed to interpret symbolic meanings in schematic diagrams despite recognizing visual data. Furthermore, models demonstrated a “Calculation-Conceptualization Discrepancy,” successfully performing calculations while failing to apply the underlying scientific concepts, and “Process Hallucination,” where models skipped visual verification in favor of plausible but unfounded background knowledge. Addressing the challenge of unauthorized AI use in coursework, this study provides actionable cues for designing “AI-resistant questions” that target these specific cognitive vulnerabilities. By exploiting AI’s weaknesses, such as the gap between perception and cognition, educators can distinguish genuine student competency from AI-generated responses, thereby ensuring assessment fairness.
【8】SynGP500: A Clinically-Grounded Synthetic Dataset of Australian General Practice Medical Notes 标题:SynGP 500:澳大利亚全科医学笔记的临床基础合成数据集 链接:https://arxiv.org/abs/2512.15259
作者:Piyawoot Songsiritat 备注:16 pages, 2 figures 摘要:我们介绍SynGP500,临床医生策划的500合成澳大利亚全科医学笔记的集合。该数据集整合了基于临床的广度(RACGP 2022课程),流行病学校准的患病率(BEACH研究)和多样化的咨询环境。这种方法系统地包括常见的陈述和不太常见的特定条件,GP必须认识到,但很少出现在单一的实践人群,可能支持更普遍的模型训练比数据集约束自然发生的情况下分布。SynGP 500在设计上是混乱的,反映了医疗保健服务的真实复杂性:电报文档,错别字,患者不依从,社会经济障碍和临床医生与患者的分歧,不像消毒的合成数据集掩盖了临床现实。多方面验证通过与真实的澳大利亚GP咨询模式(BEACH研究)的流行病学对齐来证明数据集质量,文体分析证实了高语言变异,语义多样性分析证明了广泛的覆盖范围,以及使用自我监督医学概念提取的探索性下游评估,显示F1改善。SynGP 500解决了一个关键的国家差距,为研究人员和教育工作者提供了一个资源,用于开发和评估澳大利亚全科医学的临床NLP方法,同时保护患者隐私。 摘要:We introduce SynGP500, a clinician-curated collection of 500 synthetic Australian general practice medical notes. The dataset integrates curriculum-based clinical breadth (RACGP 2022 Curriculum), epidemiologically-calibrated prevalence (BEACH study), and diverse consultation contexts. This approach systematically includes both common presentations and less-common curriculum-specified conditions that GPs must recognize but appear infrequently in single practice populations, potentially supporting more generalizable model training than datasets constrained by naturally occurring case distributions. SynGP500 is messy by design, reflecting the authentic complexity of healthcare delivery: telegraphic documentation, typos, patient non-adherence, socioeconomic barriers, and clinician-patient disagreements, unlike sanitized synthetic datasets that obscure clinical realities. Multi-faceted validation demonstrates dataset quality through epidemiological alignment with real Australian GP consultation patterns (BEACH study), stylometric analysis confirming high linguistic variation, semantic diversity analysis demonstrating broad coverage, and exploratory downstream evaluation using self-supervised medical concept extraction, showing F1 improvements. SynGP500 addresses a critical national gap, providing researchers and educators with a resource for developing and evaluating clinical NLP methods for Australian general practice while inherently protecting patient privacy.
【9】FAME: Fictional Actors for Multilingual Erasure 标题:FAME:多语言擦除的虚构演员 链接:https://arxiv.org/abs/2512.15235
作者:Claudio Savelli,Moreno La Quatra,Alkis Koudounas,Flavio Giobergia 摘要:接受网络规模数据培训的法学硕士引发了对隐私和被遗忘权的担忧。为了解决这些问题,Machine Unlearning提供了从训练模型中删除特定信息的技术,而无需从头开始重新训练。然而,现有的评估LLM遗忘的基准面临两个主要限制:它们只关注英语,只支持实体级遗忘(删除有关一个人的所有信息)。我们介绍FAME(Fictional Actors for Multilingual Erasure),这是一个综合基准,用于评估五种语言的机器学习:英语,法语,德语,意大利语和西班牙语。FAME包含1,000个虚构的演员传记和20,000个问答对。每本传记包括20个主题的信息,这些主题被分为结构化类别(传记、职业、成就、个人信息)。这种设计使得实体级的遗忘(即,忘记整个身份)和实例级遗忘(即,忘记特定的事实,同时保留其他事实)。我们提供了两个数据集分割,以支持这两种不同的学习场景,并支持跨语言的学习技术的系统比较。由于FAME使用完全虚构的数据,它确保了在模型预训练期间从未遇到过这些信息,从而允许对遗忘方法进行受控评估。 摘要:LLMs trained on web-scale data raise concerns about privacy and the right to be forgotten. To address these issues, Machine Unlearning provides techniques to remove specific information from trained models without retraining from scratch. However, existing benchmarks for evaluating unlearning in LLMs face two major limitations: they focus only on English and support only entity-level forgetting (removing all information about a person). We introduce FAME (Fictional Actors for Multilingual Erasure), a synthetic benchmark for evaluating Machine Unlearning across five languages: English, French, German, Italian, and Spanish. FAME contains 1,000 fictional actor biographies and 20,000 question-answer pairs. Each biography includes information on 20 topics organized into structured categories (biography, career, achievements, personal information). This design enables both entity-level unlearning (i.e., forgetting entire identities) and instance-level unlearning (i.e., forgetting specific facts while retaining others). We provide two dataset splits to support these two different unlearning scenarios and enable systematic comparison of unlearning techniques across languages. Since FAME uses entirely fictional data, it ensures that the information was never encountered during model pretraining, allowing for a controlled evaluation of unlearning methods.
【10】From NLG Evaluation to Modern Student Assessment in the Era of ChatGPT: The Great Misalignment Problem and Pedagogical Multi-Factor Assessment (P-MFA) 标题:从NLG评估到ChatGPT时代的现代学生评估:巨大的错位问题和教学多因素评估(P-MFA) 链接:https://arxiv.org/abs/2512.15183
作者:Mika Hämäläinen,Kimmo Leiviskä 备注:IWCLUL 2025 摘要:本文探讨了日益增长的认知平行NLG评估和学生在芬兰大学的评分。我们认为,这两个领域正在经历一个巨大的错位问题。随着学生越来越多地使用ChatGPT等工具来产生复杂的输出,传统的评估方法专注于最终产品而不是学习过程已经失去了有效性。为了解决这个问题,我们引入了教学多因素评估(P-MFA)模型,这是一个基于过程的多证据框架,灵感来自多因素认证的逻辑。 摘要:This paper explores the growing epistemic parallel between NLG evaluation and grading of students in a Finnish University. We argue that both domains are experiencing a Great Misalignment Problem. As students increasingly use tools like ChatGPT to produce sophisticated outputs, traditional assessment methods that focus on final products rather than learning processes have lost their validity. To address this, we introduce the Pedagogical Multi-Factor Assessment (P-MFA) model, a process-based, multi-evidence framework inspired by the logic of multi-factor authentication.
【11】From Isolation to Entanglement: When Do Interpretability Methods Identify and Disentangle Known Concepts? 标题:从孤立到纠缠:可解释性方法何时识别和解开已知概念? 链接:https://arxiv.org/abs/2512.15134
作者:Aaron Mueller,Andrew Lee,Shruti Joshi,Ekdeep Singh Lubana,Dhanya Sridhar,Patrik Reizinger 摘要:可解释性的一个中心目标是从神经网络的激活中恢复因果相关概念的表示。这些概念表示的质量通常是孤立地评估的,并且在实际中可能不成立的隐含独立性假设下进行评估。因此,目前还不清楚常见的特征化方法-包括稀疏自动编码器(SAE)和稀疏探测器-是否可以恢复这些概念的解纠缠表示。本研究提出了一个多概念的评估设置,我们控制文本概念,如情感,域和时态之间的相关性,并分析性能下增加它们之间的相关性。我们首先评估featurizers可以在多大程度上学习解纠缠表示的每个概念下增加相关强度。我们观察到从概念到特征的一对多关系:特征对应的概念不超过一个,但概念分布在许多特征上。然后,我们进行转向实验,测量每个概念是否是独立可操作的。即使在概念的均匀分布上进行训练,SAE特征在转向时通常会影响许多概念,这表明它们既不是选择性的也不是独立的;尽管如此,特征会影响不相交的子空间。这些结果表明,测量解纠缠的相关性指标一般不足以建立独立性时,转向,影响不相交的子空间是不够的概念选择性。这些结果强调了成分评价在可解释性研究中的重要性。 摘要:A central goal of interpretability is to recover representations of causally relevant concepts from the activations of neural networks. The quality of these concept representations is typically evaluated in isolation, and under implicit independence assumptions that may not hold in practice. Thus, it is unclear whether common featurization methods - including sparse autoencoders (SAEs) and sparse probes - recover disentangled representations of these concepts. This study proposes a multi-concept evaluation setting where we control the correlations between textual concepts, such as sentiment, domain, and tense, and analyze performance under increasing correlations between them. We first evaluate the extent to which featurizers can learn disentangled representations of each concept under increasing correlational strengths. We observe a one-to-many relationship from concepts to features: features correspond to no more than one concept, but concepts are distributed across many features. Then, we perform steering experiments, measuring whether each concept is independently manipulable. Even when trained on uniform distributions of concepts, SAE features generally affect many concepts when steered, indicating that they are neither selective nor independent; nonetheless, features affect disjoint subspaces. These results suggest that correlational metrics for measuring disentanglement are generally not sufficient for establishing independence when steering, and that affecting disjoint subspaces is not sufficient for concept selectivity. These results underscore the importance of compositional evaluations in interpretability research.
【12】Incentives or Ontology? A Structural Rebuttal to OpenAI’s Hallucination Thesis 标题:激励还是实体论?对OpenAI幻觉论点的结构性反驳 链接:https://arxiv.org/abs/2512.14801
作者:Richard Ackermann,Simeon Emanuilov 备注:17 pages, references to prior work arXiv:2509.16297 and arXiv:2511.06073 摘要:OpenAI最近认为,大型语言模型中的幻觉主要是由于不一致的评估激励导致的,这些激励奖励自信的猜测,而不是认识上的谦逊。根据这种观点,幻觉是一种偶然的行为假象,可以通过改进基准和奖励结构来补救。在本文中,我们挑战这种解释。借鉴以前的工作,结构幻觉和经验性实验使用许可证甲骨文,我们认为幻觉是不是一个优化失败,但架构的必然性的Transformer模型。 Transformers并不代表世界;它们对令牌之间的统计关联进行建模。它们的嵌入空间形成了一个来自语言共现的伪本体,而不是世界指称结构。在本体论的边界条件下-训练数据稀疏或不连贯的区域-模型必须插入虚构的延续以保持连贯性。没有激励机制可以改变这种对模式完成的结构依赖。 我们的实证结果表明,幻觉只能通过外部的真理验证和警告模块,而不是通过改变激励,提示,或微调消除。Licensing Oracle能够实现跨域的完美匹配精度,因为它提供了Transformer所缺乏的基础。 我们的结论是,幻觉是生成架构的一种结构属性,可靠的人工智能需要将语言流畅性与认知责任区分开来的混合系统。 摘要:OpenAI has recently argued that hallucinations in large language models result primarily from misaligned evaluation incentives that reward confident guessing rather than epistemic humility. On this view, hallucination is a contingent behavioral artifact, remediable through improved benchmarks and reward structures. In this paper, we challenge that interpretation. Drawing on previous work on structural hallucination and empirical experiments using a Licensing Oracle, we argue that hallucination is not an optimization failure but an architectural inevitability of the transformer model. Transformers do not represent the world; they model statistical associations among tokens. Their embedding spaces form a pseudo-ontology derived from linguistic co-occurrence rather than world-referential structure. At ontological boundary conditions - regions where training data is sparse or incoherent - the model necessarily interpolates fictional continuations in order to preserve coherence. No incentive mechanism can modify this structural dependence on pattern completion. Our empirical results demonstrate that hallucination can only be eliminated through external truth-validation and abstention modules, not through changes to incentives, prompting, or fine-tuning. The Licensing Oracle achieves perfect abstention precision across domains precisely because it supplies grounding that the transformer lacks. We conclude that hallucination is a structural property of generative architectures and that reliable AI requires hybrid systems that distinguish linguistic fluency from epistemic responsibility.
【13】NoveltyRank: Estimating Conceptual Novelty of AI Papers 标题:NoveltyRank:AI论文的概念新颖性评估 链接:https://arxiv.org/abs/2512.14738
作者:Zhengxu Yan,Han Li,Yuming Feng 摘要:随着学术出版越来越容易,研究论文的数量,特别是在人工智能相关领域,急剧增加。出版物的泛滥使得真正新颖和有影响力的作品很难脱颖而出,人工新颖性评估往往不稳定且耗时。我们的项目旨在开发一种模型,用于评估和排名AI论文的概念新颖性,从而实现对研究原创性的数据驱动和可扩展评估。这样的系统可以帮助研究人员有效地识别引入真正创新思想而不是微小变体的提交,并为会议评审员提供定量和一致的新颖性信号。我们的方法主要通过论文的标题、摘要和与先前文献的语义相似性来评估新颖性。鉴于新颖性估计的动机,我们探索了两个具有不同建模目标的任务公式,每个任务公式都提供了不同的视角:(1)二元分类,它从先前新颖作品的学习模式中预测论文的绝对新颖性,以及(2)成对新颖性比较,它学习通过相对新颖性来区分论文。我们在这两个任务上微调了Qwen 3 - 4 B-Instruct-2507和SciBERT,以GPT-5.1为基准,分析任务制定和建模选择如何影响性能。该实现可在https://github.com/ZhengxuYan/NoveltyRank上公开获得。 摘要:With the growing ease of academic publishing, the volume of research papers, especially in AI-related fields, has surged dramatically. This flood of publications makes it difficult for truly novel and impactful work to stand out, and manual novelty assessment is often unstable and time-consuming. Our project aims to develop a model that estimates and ranks the conceptual novelty of AI papers, enabling a data-driven and scalable assessment of research originality. Such a system can help researchers efficiently identify submissions that introduce genuinely innovative ideas rather than minor variants, and provide conference reviewers with a quantitative and consistent signal of novelty. Our approach evaluates novelty primarily through a paper’s title, abstract, and semantic similarity to prior literature. Given the motivation of novelty estimation, we explore two task formulations with different modeling objectives, each offering a different perspective: (1) binary classification, which predicts the paper’s absolute novelty from learned patterns of prior novel works, and (2) pairwise novelty comparison, which learns to distinguish papers by relative novelty over others. We fine-tune Qwen3-4B-Instruct-2507 and SciBERT on both tasks, benchmarking against GPT-5.1 to analyze how task formulation and modeling choices affect performance. The implementation is publicly available at https://github.com/ZhengxuYan/NoveltyRank.
1.3.2 Artificial Intelligence
From:https://papers.cool/arxiv/cs.AI
https://arxiv.org/list/cs.AI/recent
https://www.arxivdaily.com/cate/21/seq/0
cs.AI人工智能,共计142篇
【1】Spatia: Video Generation with Updatable Spatial Memory 标题:Spatia:具有可更新空间内存的视频生成 链接:https://arxiv.org/abs/2512.15716
作者:Jinjing Zhao,Fangyun Wei,Zhening Liu,Hongyang Zhang,Chang Xu,Yan Lu 备注:Project page: https://zhaojingjing713.github.io/Spatia/ 摘要:由于视频信号的密集、高维性质,现有的视频生成模型难以保持长期的空间和时间一致性。为了克服这一限制,我们提出了Spatia,这是一种空间记忆感知视频生成框架,它显式保留3D场景点云作为持久空间记忆。Spatia迭代地生成以这种空间记忆为条件的视频剪辑,并通过视觉SLAM不断更新它。这种动态-静态解纠缠设计增强了整个生成过程的空间一致性,同时保留了模型生成逼真动态实体的能力。此外,Spatia还支持显式相机控制和3D感知交互式编辑等应用,为可扩展的内存驱动视频生成提供了几何基础框架。 摘要:Existing video generation models struggle to maintain long-term spatial and temporal consistency due to the dense, high-dimensional nature of video signals. To overcome this limitation, we propose Spatia, a spatial memory-aware video generation framework that explicitly preserves a 3D scene point cloud as persistent spatial memory. Spatia iteratively generates video clips conditioned on this spatial memory and continuously updates it through visual SLAM. This dynamic-static disentanglement design enhances spatial consistency throughout the generation process while preserving the model’s ability to produce realistic dynamic entities. Furthermore, Spatia enables applications such as explicit camera control and 3D-aware interactive editing, providing a geometrically grounded framework for scalable, memory-driven video generation.
【2】Predictive Concept Decoders: Training Scalable End-to-End Interpretability Assistants 标题:预测概念解码器:训练可扩展的端到端解释性助理 链接:https://arxiv.org/abs/2512.15712
作者:Vincent Huang,Dami Choi,Daniel D. Johnson,Sarah Schwettmann,Jacob Steinhardt 备注:28 pages, 12 figures 摘要:解释神经网络的内部激活可以对其行为产生更忠实的解释,但由于激活空间的复杂结构,这是困难的。现有的可扩展可解释性方法使用手工设计的代理,这些代理可以制作和测试关于内部激活如何与外部行为相关的假设。我们建议将此任务转变为端到端的训练目标,通过训练可解释性助手来准确预测通过通信瓶颈激活的模型行为。具体来说,编码器将激活压缩为稀疏的概念列表,解码器读取该列表并回答关于模型的自然语言问题。我们将展示如何在大型非结构化数据上预训练这个助手,然后对其进行微调以回答问题。由此产生的架构,我们称之为预测概念解码器,具有良好的缩放特性:瓶颈概念的自动interp分数随着数据的增加而提高,下游应用程序的性能也随之提高。具体来说,PCD可以检测越狱、秘密提示和植入的潜在概念,并且能够准确地显现潜在的用户属性。 摘要:Interpreting the internal activations of neural networks can produce more faithful explanations of their behavior, but is difficult due to the complex structure of activation space. Existing approaches to scalable interpretability use hand-designed agents that make and test hypotheses about how internal activations relate to external behavior. We propose to instead turn this task into an end-to-end training objective, by training interpretability assistants to accurately predict model behavior from activations through a communication bottleneck. Specifically, an encoder compresses activations to a sparse list of concepts, and a decoder reads this list and answers a natural language question about the model. We show how to pretrain this assistant on large unstructured data, then finetune it to answer questions. The resulting architecture, which we call a Predictive Concept Decoder, enjoys favorable scaling properties: the auto-interp score of the bottleneck concepts improves with data, as does the performance on downstream applications. Specifically, PCDs can detect jailbreaks, secret hints, and implanted latent concepts, and are able to accurately surface latent user attributes.
【3】Artism: AI-Driven Dual-Engine System for Art Generation and Critique 标题:艺术主义:人工智能驱动的艺术生成和批评双引擎系统 链接:https://arxiv.org/abs/2512.15710
作者:Shuai Liu,Yiqing Tian,Yang Chen,Mar Canet Sola 备注:7 pages, 3 figures, 36 references, appendix with support material and 1 introduction video 摘要:本文提出了一种双引擎AI架构方法,旨在解决探索艺术演变中潜在轨迹的复杂问题。我们提出了两个相互关联的组件:AIDA(人工艺术家社交网络)和Ismism Machine,一个批判性分析系统。核心创新在于利用深度学习和多智能体协作来实现艺术历史发展和概念创新模式的多维模拟。该框架探讨了从传统的单向批判转向智能,互动模式的反思性实践。我们目前正在运用这种方法对当代艺术概念进行实验研究。这项研究介绍了一种基于人工智能驱动的关键循环的通用方法,为艺术的计算分析提供了新的可能性。 摘要:This paper proposes a dual-engine AI architectural method designed to address the complex problem of exploring potential trajectories in the evolution of art. We present two interconnected components: AIDA (an artificial artist social network) and the Ismism Machine, a system for critical analysis. The core innovation lies in leveraging deep learning and multi-agent collaboration to enable multidimensional simulations of art historical developments and conceptual innovation patterns. The framework explores a shift from traditional unidirectional critique toward an intelligent, interactive mode of reflexive practice. We are currently applying this method in experimental studies on contemporary art concepts. This study introduces a general methodology based on AI-driven critical loops, offering new possibilities for computational analysis of art.
【4】mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs 标题:mimic-video:VLA之外的可推广机器人控制的视频动作模型 链接:https://arxiv.org/abs/2512.15692
作者:Jonas Pai,Liam Achenbach,Victoriano Montesinos,Benedek Forrai,Oier Mees,Elvis Nava 摘要:用于机器人操作的主流视觉语言动作模型(VLA)是建立在大规模但不连续的静态Web数据上预训练的视觉语言主干之上的。因此,尽管改进了语义泛化,但该策略必须仅从机器人轨迹隐式地推断复杂的物理动力学和时间依赖性。这种依赖造成了不可持续的数据负担,需要持续、大规模的专家数据收集来弥补先天物理理解的缺乏。我们认为,虽然视觉语言预训练有效地捕捉语义先验,它仍然盲目的物理因果关系。一个更有效的范例利用视频在预训练期间联合捕获语义和视觉动态,从而隔离剩余的低级控制任务。为此,我们引入了\model,一种新的视频动作模型(VAM),它将预训练的互联网规模视频模型与基于流匹配的动作解码器配对,条件是其潜在表示。解码器作为一个逆动力学模型(IDM),产生低层次的机器人动作从视频空间的行动计划的潜在表示。我们的广泛评估表明,我们的方法在模拟和真实世界的机器人操作任务上实现了最先进的性能,与传统的VLA架构相比,将采样效率提高了10倍,收敛速度提高了2倍。 摘要:Prevailing Vision-Language-Action Models (VLAs) for robotic manipulation are built upon vision-language backbones pretrained on large-scale, but disconnected static web data. As a result, despite improved semantic generalization, the policy must implicitly infer complex physical dynamics and temporal dependencies solely from robot trajectories. This reliance creates an unsustainable data burden, necessitating continuous, large-scale expert data collection to compensate for the lack of innate physical understanding. We contend that while vision-language pretraining effectively captures semantic priors, it remains blind to physical causality. A more effective paradigm leverages video to jointly capture semantics and visual dynamics during pretraining, thereby isolating the remaining task of low-level control. To this end, we introduce \model, a novel Video-Action Model (VAM) that pairs a pretrained Internet-scale video model with a flow matching-based action decoder conditioned on its latent representations. The decoder serves as an Inverse Dynamics Model (IDM), generating low-level robot actions from the latent representation of video-space action plans. Our extensive evaluation shows that our approach achieves state-of-the-art performance on simulated and real-world robotic manipulation tasks, improving sample efficiency by 10x and convergence speed by 2x compared to traditional VLA architectures.
【5】BashArena: A Control Setting for Highly Privileged AI Agents 标题:BashArena:高度紧张的人工智能代理的控制环境 链接:https://arxiv.org/abs/2512.15688
作者:Adam Kaufman,James Lucassen,Tyler Tracy,Cody Rushing,Aryan Bhatt 备注:The task generation pipeline can be found here: https://github.com/redwoodresearch/basharena_public 摘要:未来的AI代理可能会以更高的权限自主运行。如果这些代理人错位,他们可能会滥用这些特权造成严重损害。人工智能控制领域开发了一些技术,使错位的人工智能更难造成这种损害,同时保持它们的有用性。我们介绍BashArena,这是一个在安全关键环境中研究AI控制技术的环境。BashArena包含637个Linux系统管理和基础设施工程任务,在复杂的现实环境中,以及四个破坏目标(执行恶意软件,泄露机密,升级权限和禁用防火墙),供红队瞄准。我们评估多个前沿LLM完成任务的能力,进行破坏未被发现,并检测破坏企图。克劳德·十四行诗4.5成功执行破坏,同时逃避GPT-4.1迷你26%的时间,在4%的概率明智的FPR监测。我们的研究结果为BashArena设计更有效的控制协议提供了基线。我们将数据集作为ControlArena设置发布,并共享我们的任务生成管道。 摘要:Future AI agents might run autonomously with elevated privileges. If these agents are misaligned, they might abuse these privileges to cause serious damage. The field of AI control develops techniques that make it harder for misaligned AIs to cause such damage, while preserving their usefulness. We introduce BashArena, a setting for studying AI control techniques in security-critical environments. BashArena contains 637 Linux system administration and infrastructure engineering tasks in complex, realistic environments, along with four sabotage objectives (execute malware, exfiltrate secrets, escalate privileges, and disable firewall) for a red team to target. We evaluate multiple frontier LLMs on their ability to complete tasks, perform sabotage undetected, and detect sabotage attempts. Claude Sonnet 4.5 successfully executes sabotage while evading monitoring by GPT-4.1 mini 26% of the time, at 4% trajectory-wise FPR. Our findings provide a baseline for designing more effective control protocols in BashArena. We release the dataset as a ControlArena setting and share our task generation pipeline.
【6】Can LLMs Guide Their Own Exploration? Gradient-Guided Reinforcement Learning for LLM Reasoning 标题:法学硕士可以指导自己的探索吗?LLM推理的学生引导强化学习 链接:https://arxiv.org/abs/2512.15687
作者:Zhenwen Liang,Sidi Lu,Wenhao Yu,Kishan Panaganti,Yujun Zhou,Haitao Mi,Dong Yu 摘要:强化学习对于加强大型语言模型的推理能力至关重要,但目前的探索机制与这些模型的实际学习方式仍然存在根本性的偏差。熵奖金和外部语义比较器鼓励表面水平的变化,但不能保证采样轨迹在形状优化的更新方向上不同。我们提出了G2 RL,一个梯度引导的强化学习框架,在这个框架中,探索不是由外部几何驱动的,而是由模型自身的一阶更新几何驱动的。对于每个响应,G2 RL从模型最终层灵敏度中构建一个序列级特征,可以从标准的前向传递中以可忽略的成本获得,并通过比较采样组中的这些特征来测量每个轨迹将如何重塑策略。引入新的梯度方向的轨迹接收有界乘法奖励缩放器,而冗余或非流形更新被去强调,从而产生自然地与PPO风格稳定性和KL控制对齐的自参考探索信号。在Qwen 3 base 1.7B和4 B模型上的数学和一般推理基准测试(MATH 500,AMC,AIME 24,AIME 25,GPQA,MMLUpro)中,G2 RL始终改进了基于熵的GRPO和外部嵌入方法的pass@1,maj@16和pass@k。分析诱导几何,我们发现G2 RL将探索扩展到更正交且通常相反的梯度方向,同时保持语义一致性,揭示了策略自身的更新空间为指导大型语言模型强化学习中的探索提供了更忠实和有效的基础。 摘要:Reinforcement learning has become essential for strengthening the reasoning abilities of large language models, yet current exploration mechanisms remain fundamentally misaligned with how these models actually learn. Entropy bonuses and external semantic comparators encourage surface level variation but offer no guarantee that sampled trajectories differ in the update directions that shape optimization. We propose G2RL, a gradient guided reinforcement learning framework in which exploration is driven not by external heuristics but by the model own first order update geometry. For each response, G2RL constructs a sequence level feature from the model final layer sensitivity, obtainable at negligible cost from a standard forward pass, and measures how each trajectory would reshape the policy by comparing these features within a sampled group. Trajectories that introduce novel gradient directions receive a bounded multiplicative reward scaler, while redundant or off manifold updates are deemphasized, yielding a self referential exploration signal that is naturally aligned with PPO style stability and KL control. Across math and general reasoning benchmarks (MATH500, AMC, AIME24, AIME25, GPQA, MMLUpro) on Qwen3 base 1.7B and 4B models, G2RL consistently improves pass@1, maj@16, and pass@k over entropy based GRPO and external embedding methods. Analyzing the induced geometry, we find that G2RL expands exploration into substantially more orthogonal and often opposing gradient directions while maintaining semantic coherence, revealing that a policy own update space provides a far more faithful and effective basis for guiding exploration in large language model reinforcement learning.
【7】Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers 标题:激活先知:训练和评估法学硕士作为通用激活解释者 链接:https://arxiv.org/abs/2512.15674
作者:Adam Karvonen,James Chua,Clément Dumas,Kit Fraser-Taliente,Subhash Kantamneni,Julian Minder,Euan Ong,Arnab Sen Sharma,Daniel Wen,Owain Evans,Samuel Marks 备注:36 pages 摘要:众所周知,大型语言模型(LLM)激活很难理解,大多数现有技术都使用复杂的专门方法来解释它们。最近的工作提出了一种更简单的方法,称为LatentQA:训练LLM直接接受LLM激活作为输入,并以自然语言回答有关它们的任意问题。然而,以前的工作集中在狭窄的任务设置的培训和评价。在本文中,我们采取了一个通才的视角。我们评估了LatentQA训练的模型,我们称之为激活预言机(AO),在远离分布的环境中,并研究了性能如何随着训练数据的多样性而变化。我们发现,AO可以恢复微调到模型中的信息(例如,传记知识或恶意倾向),这些信息不会出现在输入文本中,尽管从未用来自微调模型的激活进行过训练。我们的主要评估是四个下游任务,我们可以将其与之前的白盒和黑盒技术进行比较。我们发现,即使是经过严格训练的LatentQA模型也可以很好地泛化,并且添加额外的训练数据集(例如分类任务和自监督上下文预测任务)可以产生一致的进一步改进。总的来说,我们的最佳AO在所有四个任务上都匹配或超过了之前的白盒基线,并且是4个任务中3个任务的最佳方法。这些结果表明,多样化的培训,以回答自然语言查询赋予了一般的能力,以口头表达有关LLM激活的信息。 摘要:Large language model (LLM) activations are notoriously difficult to understand, with most existing techniques using complex, specialized methods for interpreting them. Recent work has proposed a simpler approach known as LatentQA: training LLMs to directly accept LLM activations as inputs and answer arbitrary questions about them in natural language. However, prior work has focused on narrow task settings for both training and evaluation. In this paper, we instead take a generalist perspective. We evaluate LatentQA-trained models, which we call Activation Oracles (AOs), in far out-of-distribution settings and examine how performance scales with training data diversity. We find that AOs can recover information fine-tuned into a model (e.g., biographical knowledge or malign propensities) that does not appear in the input text, despite never being trained with activations from a fine-tuned model. Our main evaluations are four downstream tasks where we can compare to prior white- and black-box techniques. We find that even narrowly-trained LatentQA models can generalize well, and that adding additional training datasets (such as classification tasks and a self-supervised context prediction task) yields consistent further improvements. Overall, our best AOs match or exceed prior white-box baselines on all four tasks and are the best method on 3 out of 4. These results suggest that diversified training to answer natural-language queries imparts a general capability to verbalize information about LLM activations.
【8】Explaining the Reasoning of Large Language Models Using Attribution Graphs 标题:使用属性图解释大型语言模型的推理 链接:https://arxiv.org/abs/2512.15663
作者:Chase Walker,Rickard Ewetz 摘要:大型语言模型(LLM)表现出非凡的能力,但它们的推理仍然不透明,引发了安全和信任问题。将信用分配给输入特征的归因方法已被证明是解释计算机视觉模型决策的有效方法。从这些,上下文属性已成为一个很有前途的方法来解释自回归LLM的行为。然而,目前的上下文归因产生不完整的解释直接相关生成的令牌的提示,丢弃代际影响的过程中。为了克服这些缺点,我们引入了上下文归因通过图解释(CAGE)框架。CAGE引入了一个归因图:一个有向图,它量化了每一代人如何受到即时和所有前几代人的影响。图的构造保持两个属性-因果性和行随机性。属性图允许通过沿图中的路径边缘化中间贡献来计算上下文属性。在多个模型、数据集、指标和方法中,CAGE提高了上下文归因的可信度,平均收益高达40%。 摘要:Large language models (LLMs) exhibit remarkable capabilities, yet their reasoning remains opaque, raising safety and trust concerns. Attribution methods, which assign credit to input features, have proven effective for explaining the decision making of computer vision models. From these, context attributions have emerged as a promising approach for explaining the behavior of autoregressive LLMs. However, current context attributions produce incomplete explanations by directly relating generated tokens to the prompt, discarding inter-generational influence in the process. To overcome these shortcomings, we introduce the Context Attribution via Graph Explanations (CAGE) framework. CAGE introduces an attribution graph: a directed graph that quantifies how each generation is influenced by both the prompt and all prior generations. The graph is constructed to preserve two properties-causality and row stochasticity. The attribution graph allows context attributions to be computed by marginalizing intermediate contributions along paths in the graph. Across multiple models, datasets, metrics, and methods, CAGE improves context attribution faithfulness, achieving average gains of up to 40%.
【9】Stepwise Think-Critique: A Unified Framework for Robust and Interpretable LLM Reasoning 标题:逐步思维批判:稳健且可解释的LLM推理的统一框架 链接:https://arxiv.org/abs/2512.15662
作者:Jiaqi Xu,Cuiling Lan,Xuejin Chen,Yan LU 备注:Under Review 摘要:人类通过批判性思维解决复杂问题,推理和评估交织在一起,朝着正确的解决方案汇聚。然而,大多数现有的大型语言模型(LLM)将推理与验证分离开来:它们要么在没有显式自检的情况下生成推理,要么依赖外部验证者来事后检测错误。前者缺乏即时反馈,而后者增加了系统的复杂性,阻碍了同步学习。受人类批判性思维的启发,我们提出了逐步思维批判(STC),这是一个统一的框架,在单个模型的每一步都交织着推理和自我批判。STC使用混合强化学习目标进行训练,该目标将推理奖励和批判一致性奖励相结合,以共同优化推理质量和自我评估。在数学推理基准上的实验表明,STC表现出较强的批判性思维能力,并产生更多可解释的推理痕迹,代表着向具有内置批判性思维的LLM迈进了一步。 摘要:Human beings solve complex problems through critical thinking, where reasoning and evaluation are intertwined to converge toward correct solutions. However, most existing large language models (LLMs) decouple reasoning from verification: they either generate reasoning without explicit self-checking or rely on external verifiers to detect errors post hoc. The former lacks immediate feedback, while the latter increases system complexity and hinders synchronized learning. Motivated by human critical thinking, we propose Stepwise Think-Critique (STC), a unified framework that interleaves reasoning and self-critique at each step within a single model. STC is trained with a hybrid reinforcement learning objective combining reasoning rewards and critique-consistency rewards to jointly optimize reasoning quality and self-evaluation. Experiments on mathematical reasoning benchmarks show that STC demonstrates strong critic-thinking capabilities and produces more interpretable reasoning traces, representing a step toward LLMs with built-in critical thinking.
【10】PPSEBM: An Energy-Based Model with Progressive Parameter Selection for Continual Learning 标题:PPSEBM:一种基于能量的渐进参数选择的连续学习模型 链接:https://arxiv.org/abs/2512.15658
作者:Xiaodi Li,Dingcheng Li,Rujun Gao,Mahmoud Zamani,Feng Mi,Latifur Khan 备注:10 pages, 3 figures, 2025 IEEE International Conference on Big Data (BigData) 摘要:持续学习仍然是机器学习中的一个根本挑战,需要模型从一系列任务中学习,而不会忘记之前获得的知识。在这种情况下,一个主要的障碍是灾难性的遗忘,即早期任务的性能随着新任务的学习而下降。在本文中,我们介绍了PPSEBM,一种新的框架,集成了基于能量的模型(EBM)与渐进参数选择(PPS),以有效地解决灾难性遗忘在自然语言处理任务的持续学习。在PPSEBM中,渐进式参数选择为每个新任务分配不同的特定于任务的参数,而EBM从先前的任务中生成代表性的伪样本。这些生成的样本主动通知和指导参数选择过程,增强模型在适应新任务的同时保留过去知识的能力。在不同NLP基准测试上的实验结果表明,PPSEBM优于最先进的持续学习方法,为减轻灾难性遗忘提供了一种有前途的鲁棒解决方案。 摘要:Continual learning remains a fundamental challenge in machine learning, requiring models to learn from a stream of tasks without forgetting previously acquired knowledge. A major obstacle in this setting is catastrophic forgetting, where performance on earlier tasks degrades as new tasks are learned. In this paper, we introduce PPSEBM, a novel framework that integrates an Energy-Based Model (EBM) with Progressive Parameter Selection (PPS) to effectively address catastrophic forgetting in continual learning for natural language processing tasks. In PPSEBM, progressive parameter selection allocates distinct, task-specific parameters for each new task, while the EBM generates representative pseudo-samples from prior tasks. These generated samples actively inform and guide the parameter selection process, enhancing the model’s ability to retain past knowledge while adapting to new tasks. Experimental results on diverse NLP benchmarks demonstrate that PPSEBM outperforms state-of-the-art continual learning methods, offering a promising and robust solution to mitigate catastrophic forgetting.
【11】VTCBench: Can Vision-Language Models Understand Long Context with Vision-Text Compression? 标题:VTCBench:视觉语言模型可以通过视觉文本压缩理解长上下文吗? 链接:https://arxiv.org/abs/2512.15649
作者:Hongbo Zhao,Meng Wang,Fei Zhu,Wenzhuo Liu,Bolin Ni,Fanhu Zeng,Gaofeng Meng,Zhaoxiang Zhang 摘要:与扩展LLM的上下文窗口相关联的计算和存储器开销严重限制了它们的可扩展性。一个值得注意的解决方案是视觉文本压缩(VTC),以DeepSeek-OCR和Glencore等框架为例,它将长文本转换为密集的2D视觉表示,从而实现3x-20 x的令牌压缩比。然而,这种高信息密度对视觉语言模型(VLM)的核心长上下文能力的影响仍有待研究。为了解决这一差距,我们引入了VTC的第一个基准,并系统地评估了VLM在三个长期背景理解设置中的性能:VTC检索,它评估模型检索和聚合信息的能力; VTC推理,它需要模型推断潜在的关联,以最小的词汇重叠定位事实;和VTC记忆,它测量长期对话记忆中的综合问题回答。此外,我们建立了VTCBench-Wild来模拟不同的输入场景。我们在我们的基准测试中全面评估了领先的开源和专有模型。结果表明,尽管能够解码文本信息(例如,OCR),但大多数VLM对VTC压缩的信息表现出令人惊讶的低长上下文理解能力,无法捕捉上下文中的长关联或依赖。 摘要:The computational and memory overheads associated with expanding the context window of LLMs severely limit their scalability. A noteworthy solution is vision-text compression (VTC), exemplified by frameworks like DeepSeek-OCR and Glyph, which convert long texts into dense 2D visual representations, thereby achieving token compression ratios of 3x-20x. However, the impact of this high information density on the core long-context capabilities of vision-language models (VLMs) remains under-investigated. To address this gap, we introduce the first benchmark for VTC and systematically assess the performance of VLMs across three long-context understanding settings: VTC-Retrieval, which evaluates the model’s ability to retrieve and aggregate information; VTC-Reasoning, which requires models to infer latent associations to locate facts with minimal lexical overlap; and VTC-Memory, which measures comprehensive question answering within long-term dialogue memory. Furthermore, we establish the VTCBench-Wild to simulate diverse input scenarios.We comprehensively evaluate leading open-source and proprietary models on our benchmarks. The results indicate that, despite being able to decode textual information (e.g., OCR) well, most VLMs exhibit a surprisingly poor long-context understanding ability with VTC-compressed information, failing to capture long associations or dependencies in the context.This study provides a deep understanding of VTC and serves as a foundation for designing more efficient and scalable VLMs.
【12】IC-Effect: Precise and Efficient Video Effects Editing via In-Context Learning 标题:IC效应:通过上下文学习精确有效的视频效果编辑 链接:https://arxiv.org/abs/2512.15635
作者:Yuanhang Li,Yiren Song,Junzhe Bai,Xinran Liang,Hu Yang,Libiao Jin,Qi Mao 摘要:我们提出了\textbf{IC-Effect},这是一个用于Few-Shot视频VFX编辑的基于DiT的框架,它可以合成复杂的效果(例如火焰,粒子和卡通人物),同时严格保持空间和时间的一致性。视频特效编辑具有很高的挑战性,因为注入的效果必须与背景无缝融合,背景必须保持完全不变,并且必须从有限的配对数据中有效地学习效果模式。然而,现有的视频编辑模型不能满足这些要求。IC-Effect利用源视频作为干净的上下文条件,利用DiT模型的上下文学习能力来实现精确的背景保留和自然效果注入。一个两阶段的训练策略,包括一般的编辑适应,然后通过EST-LoRA的效果特定的学习,确保强大的指令遵循和强大的效果建模。为了进一步提高效率,我们引入了时空稀疏标记化,从而在大幅减少计算的情况下实现高保真度。我们还发布了一个配对的VFX编辑数据集,涵盖15美元的高质量视觉风格。大量的实验表明,IC-Effect提供了高质量,可控和时间一致的VFX编辑,为视频创作开辟了新的可能性。 摘要:We propose \textbf{IC-Effect}, an instruction-guided, DiT-based framework for few-shot video VFX editing that synthesizes complex effects (\eg flames, particles and cartoon characters) while strictly preserving spatial and temporal consistency. Video VFX editing is highly challenging because injected effects must blend seamlessly with the background, the background must remain entirely unchanged, and effect patterns must be learned efficiently from limited paired data. However, existing video editing models fail to satisfy these requirements. IC-Effect leverages the source video as clean contextual conditions, exploiting the contextual learning capability of DiT models to achieve precise background preservation and natural effect injection. A two-stage training strategy, consisting of general editing adaptation followed by effect-specific learning via Effect-LoRA, ensures strong instruction following and robust effect modeling. To further improve efficiency, we introduce spatiotemporal sparse tokenization, enabling high fidelity with substantially reduced computation. We also release a paired VFX editing dataset spanning $15$ high-quality visual styles. Extensive experiments show that IC-Effect delivers high-quality, controllable, and temporally consistent VFX editing, opening new possibilities for video creation.
【13】How Much is Too Much? Exploring LoRA Rank Trade-offs for Retaining Knowledge and Domain Robustness 标题:多少才算太多?探索LoRA等级权衡以保留知识和领域稳健性 链接:https://arxiv.org/abs/2512.15634
作者:Darshita Rathore,Vineet Kumar,Chetna Bansal,Anindya Moitra 备注:Accepted at AACL IJCNLP 2025 摘要:大型语言模型通过微调越来越多地适应下游任务。全监督微调(SFT)和参数有效微调(PEFT)方法,如低秩自适应(LoRA),是两种主要的方法。虽然PEFT方法因其计算效率而被广泛使用,但其配置的含义(例如,排名)在下游Q&A任务和概括中仍然没有得到充分的探索。在这项工作中,我们对多个推理和召回数据集进行了全面评估,进行了排名扫描,以量化SFT和PEFT之间的权衡。我们还比较了PEFT和SFT模型在域内和域外适应中的准确性,突出了不同的泛化行为和特定任务的遗忘。我们证明,LoRA实现了竞争力,在某些情况下,优越的性能相比,SFT,特别是在推理任务在特定的排名值。此外,我们通过频谱特征和逐层注意结构来分析内部表征,从而深入了解注意模式的表征漂移和结构变化。 摘要:Large language models are increasingly adapted to downstream tasks through fine-tuning. Full supervised fine-tuning (SFT) and parameter-efficient fine-tuning (PEFT) methods, such as Low-Rank Adaptation (LoRA), are two dominant approaches. While PEFT methods are widely used for their computational efficiency, the implications of their configurations (e.g., rank) remain under-explored in downstream Q&A tasks and generalisation. In this work, we perform a comprehensive evaluation across multiple reasoning and recall datasets, conducting a rank sweep to quantify the trade-off between SFT and PEFT. We also compare the accuracy of PEFT and SFT models across in-domain and out-of-domain adaptation, highlighting distinct generalisation behaviour and task-specific forgetting. We demonstrate that LoRA achieves competitive and in some cases superior performance compared to SFT, particularly on reasoning tasks at specific rank values. Additionally, we analyze the internal representations via spectral features and layer-wise attention structures, offering insights into representational drift and structural changes in attention patterns.
【14】Evaluating Metrics for Safety with LLM-as-Judges 标题:通过LLM作为评委评估工作组的安全性 链接:https://arxiv.org/abs/2512.15617
作者:Kester Clegg,Richard Hawkins,Ibrahim Habli,Tom Lawton 摘要:LLM(大型语言模型)越来越多地用于文本处理管道,以智能地响应各种输入和生成任务。这就有可能取代因人员不足或流程复杂而阻碍现有信息流的人员角色。然而,LLM会犯错误,一些处理角色是安全关键的。例如,根据医院转诊信对病人进行术后护理,或为工作人员更新核设施的现场访问时间表。如果我们想将LLM引入到以前由人类执行的关键信息流中,我们如何使它们安全可靠?本文认为,与其对增强生成框架或基于图的技术提出表演性主张,安全性论证应该集中在我们从LLM过程中的评估点获得的证据类型上,特别是在采用LLM作为法官(LaJ)评估者的框架中。本文认为,虽然我们不能从许多自然语言处理任务中得到确定性的评价,但通过采用一篮子加权指标,可以降低评价中的错误风险,使用上下文敏感性来定义错误严重程度,并设计置信阈值,当评价者之间的一致性较低时,触发人类对关键LaJ判断的审查。 摘要:LLMs (Large Language Models) are increasingly used in text processing pipelines to intelligently respond to a variety of inputs and generation tasks. This raises the possibility of replacing human roles that bottleneck existing information flows, either due to insufficient staff or process complexity. However, LLMs make mistakes and some processing roles are safety critical. For example, triaging post-operative care to patients based on hospital referral letters, or updating site access schedules in nuclear facilities for work crews. If we want to introduce LLMs into critical information flows that were previously performed by humans, how can we make them safe and reliable? Rather than make performative claims about augmented generation frameworks or graph-based techniques, this paper argues that the safety argument should focus on the type of evidence we get from evaluation points in LLM processes, particularly in frameworks that employ LLM-as-Judges (LaJ) evaluators. This paper argues that although we cannot get deterministic evaluations from many natural language processing tasks, by adopting a basket of weighted metrics it may be possible to lower the risk of errors within an evaluation, use context sensitivity to define error severity and design confidence thresholds that trigger human review of critical LaJ judgments when concordance across evaluators is low.
【15】How Smoothing is N-simplicial Attention? 标题:N-简单注意力的平滑程度如何? 链接:https://arxiv.org/abs/2512.15600
作者:Alexandre Dussolle,Pietro Liò 备注:arXiv preprint 摘要:从纯多层感知器(MLP)到每层的可学习图消息传递机制是最先进结果的基础,尽管存在计算权衡(例如GAT或Transformers)。为了更进一步,在这项工作中,我们引入了N-单纯注意力,从成对标记相似性到高阶交互,并将其适用于旋转位置嵌入(RoPE)。为了帮助管理增加的复杂性,我们提出了一个具有成本效益的单纯形选择,使模型能够将其计算负载集中到更任务敏感的交互上。除了这些核心机制,我们研究如何平滑N-单纯注意力是通过推导出Lipschitz上限,并通过证明它本身也遭受过平滑,尽管开放的注意力消息传递到高阶的相互作用。 摘要:Going from pure Multilayer Perceptron (MLP) to a learnable graph message-passing mechanism at each layer has been foundational to state-of-the-art results, despite the computational trade-off (e.g. GATs or Transformers). To go a step further, in this work, we introduce N-simplicial attention, going from pairwise token similarity to higher-order interactions, and adapt it for Rotary Position Embeddings (RoPE). To help manage the increased complexity, we propose a cost-effective simplex selection enabling the model to focus its computation load onto the more task-sensitive interactions. Beyond these core mechanisms, we study how smoothing N-simplicial attention is by deriving a Lipschitz upper-bound and by demonstrating that by itself it also suffers from over-smoothing, despite opening the attention message-passing to higher-order interactions.
【16】A Decision-Theoretic Approach for Managing Misalignment 标题:管理失调的决策理论方法 链接:https://arxiv.org/abs/2512.15584
作者:Daniel A. Herrmann,Abinav Chari,Isabelle Qian,Sree Sharvesh,B. A. Levinstein 备注:Second Conference of the International Association for Safe and Ethical Artificial Intelligence (IASEAI ‘26) 摘要:什么时候应该把决策权交给AI系统?虽然价值对齐文献已经开发出了塑造人工智能价值的技术,但很少关注如何在不确定性下确定不完美的对齐何时足以证明授权是合理的。我们认为,合理的授权需要平衡代理人的价值(错误)与其认识的准确性和其范围(它有可用的行为)对齐。本文介绍了一个正式的,决策理论的框架来分析这种权衡精确占主要的不确定性,这些因素。我们的分析揭示了两种授权方案之间的明显区别。首先,普遍授权(将任何问题委托给代理人)要求近乎完美的价值一致和完全的认知信任,这些条件在实践中很少满足。其次,我们表明,特定于上下文的授权可以是最佳的,即使有重大的错位。一个代理人的优越的准确性或扩大范围可能会授予更好的整体决策问题,使委托合理的期望。我们开发了一种新的评分框架来量化这种事前决策。最终,我们的工作提供了一种原则性的方法,用于确定人工智能何时在给定的背景下足够一致,将重点从实现完美的一致性转移到管理不确定性下授权的风险和回报。 摘要:When should we delegate decisions to AI systems? While the value alignment literature has developed techniques for shaping AI values, less attention has been paid to how to determine, under uncertainty, when imperfect alignment is good enough to justify delegation. We argue that rational delegation requires balancing an agent’s value (mis)alignment with its epistemic accuracy and its reach (the acts it has available). This paper introduces a formal, decision-theoretic framework to analyze this tradeoff precisely accounting for a principal’s uncertainty about these factors. Our analysis reveals a sharp distinction between two delegation scenarios. First, universal delegation (trusting an agent with any problem) demands near-perfect value alignment and total epistemic trust, conditions rarely met in practice. Second, we show that context-specific delegation can be optimal even with significant misalignment. An agent’s superior accuracy or expanded reach may grant access to better overall decision problems, making delegation rational in expectation. We develop a novel scoring framework to quantify this ex ante decision. Ultimately, our work provides a principled method for determining when an AI is aligned enough for a given context, shifting the focus from achieving perfect alignment to managing the risks and rewards of delegation under uncertainty.
【17】Evaluating Large Language Models in Scientific Discovery 标题:在科学发现中评估大型语言模型 链接:https://arxiv.org/abs/2512.15567
作者:Zhangde Song,Jieyu Lu,Yuanqi Du,Botao Yu,Thomas M. Pruyn,Yue Huang,Kehan Guo,Xiuzhe Luo,Yuanhao Qu,Yi Qu,Yinkai Wang,Haorui Wang,Jeff Guo,Jingru Gan,Parshin Shojaee,Di Luo,Andres M Bran,Gen Li,Qiyuan Zhao,Shao-Xiong Lennon Luo,Yuxuan Zhang,Xiang Zou,Wanru Zhao,Yifan F. Zhang,Wucheng Zhang,Shunan Zheng,Saiyang Zhang,Sartaaj Takrim Khan,Mahyar Rajabi-Kochi,Samantha Paradi-Maropakis,Tony Baltoiu,Fengyu Xie,Tianyang Chen,Kexin Huang,Weiliang Luo,Meijing Fang,Xin Yang,Lixue Cheng,Jiajun He,Soha Hassoun,Xiangliang Zhang,Wei Wang,Chandan K. Reddy,Chao Zhang,Zhiling Zheng,Mengdi Wang,Le Cong,Carla P. Gomes,Chang-Yu Hsieh,Aditya Nandy,Philippe Schwaller,Heather J. Kulik,Haojun Jia,Huan Sun,Seyed Mohamad Moosavi,Chenru Duan 摘要:大型语言模型(LLM)越来越多地应用于科学研究,但主流的科学基准探索去语境化的知识,忽视了推动科学发现的迭代推理、假设生成和观察解释。我们引入了一个以生物学为基础的基准,该基准评估了生物学,化学,材料和物理学领域的LLM,领域专家定义了真正感兴趣的研究项目,并将其分解为模块化的研究方案,从这些方案中抽取了经过审查的问题。该框架在两个层面评估模型:(i)问题层面的准确性与项目相关的项目和(ii)项目层面的性能,其中模型必须提出可测试的假设,设计模拟或实验,并解释结果。将这个两阶段的科学发现评估(ESTA)框架应用于最先进的LLM揭示了相对于一般科学基准的一致性能差距,扩大模型大小和推理的回报递减,以及来自不同提供商的顶级模型之间共享的系统弱点。研究方案中的大的性能变化导致对评估的科学发现项目的最佳性能模型的选择发生变化,这表明所有当前的LLM都与一般的科学“超级智能”相距甚远。尽管如此,LLM已经在各种科学发现项目中表现出了希望,包括组成场景得分较低的情况,突出了引导探索和发现中的偶然性的作用。该框架为LLM的发现相关评估提供了一个可重复的基准,并绘制了实用的路径,以推进其向科学发现的发展。 摘要:Large language models (LLMs) are increasingly applied to scientific research, yet prevailing science benchmarks probe decontextualized knowledge and overlook the iterative reasoning, hypothesis generation, and observation interpretation that drive scientific discovery. We introduce a scenario-grounded benchmark that evaluates LLMs across biology, chemistry, materials, and physics, where domain experts define research projects of genuine interest and decompose them into modular research scenarios from which vetted questions are sampled. The framework assesses models at two levels: (i) question-level accuracy on scenario-tied items and (ii) project-level performance, where models must propose testable hypotheses, design simulations or experiments, and interpret results. Applying this two-phase scientific discovery evaluation (SDE) framework to state-of-the-art LLMs reveals a consistent performance gap relative to general science benchmarks, diminishing return of scaling up model sizes and reasoning, and systematic weaknesses shared across top-tier models from different providers. Large performance variation in research scenarios leads to changing choices of the best performing model on scientific discovery projects evaluated, suggesting all current LLMs are distant to general scientific “superintelligence”. Nevertheless, LLMs already demonstrate promise in a great variety of scientific discovery projects, including cases where constituent scenario scores are low, highlighting the role of guided exploration and serendipity in discovery. This SDE framework offers a reproducible benchmark for discovery-relevant evaluation of LLMs and charts practical paths to advance their development toward scientific discovery.
【18】A Conditioned UNet for Music Source Separation 标题:用于音乐源分离的条件化UNet 链接:https://arxiv.org/abs/2512.15532
作者:Ken O’Hanlon,Basil Woods,Lin Wang,Mark Sandler 摘要:在本文中,我们提出了一个有条件的UNet音乐源分离(MSS)。MSS通常由多输出神经网络(通常为UNets)执行,每个输出表示来自预定义乐器词汇表的特定词干。相比之下,条件MSS网络接受与感兴趣的干相关的音频查询,以及从中提取干的信号。因此,不需要严格的词汇表,这使得MSS中的任务更加现实。由于缺乏合适的数据,这些任务的条件方法的潜力在某种程度上被隐藏了,最近MoisesDb数据集解决了这个问题。最近的一种方法Banquet使用了这个数据集,在更大的词汇表上看到了有希望的结果。Banquet使用Bandsplit RNN而不是UNet,作者指出UNet不应该适合有条件的MSS。我们反对这种说法,并提出QSCNet,一种新的有条件的UNet的MSS,集成了网络调节元素的稀疏压缩网络MSS。我们发现QSCNet在几个MSS任务上的性能优于Banquet超过1dB SNR,而使用的参数数量不到一半。 摘要:In this paper we propose a conditioned UNet for Music Source Separation (MSS). MSS is generally performed by multi-output neural networks, typically UNets, with each output representing a particular stem from a predefined instrument vocabulary. In contrast, conditioned MSS networks accept an audio query related to a stem of interest alongside the signal from which that stem is to be extracted. Thus, a strict vocabulary is not required and this enables more realistic tasks in MSS. The potential of conditioned approaches for such tasks has been somewhat hidden due to a lack of suitable data, an issue recently addressed with the MoisesDb dataset. A recent method, Banquet, employs this dataset with promising results seen on larger vocabularies. Banquet uses Bandsplit RNN rather than a UNet and the authors state that UNets should not be suitable for conditioned MSS. We counter this argument and propose QSCNet, a novel conditioned UNet for MSS that integrates network conditioning elements in the Sparse Compressed Network for MSS. We find QSCNet to outperform Banquet by over 1dB SNR on a couple of MSS tasks, while using less than half the number of parameters.
【19】BERT and CNN integrated Neural Collaborative Filtering for Recommender Systems 标题:BERT和CNN集成用于推荐系统的神经协作过滤 链接:https://arxiv.org/abs/2512.15526
作者:Abdullah Al Munem,Sumona Yeasmin,Mohammad Rezwanul Huq 摘要:每天都有大量的用户出于不同的需求访问互联网。网站的所有者从用户与网站的内容或项目的交互中产生利润。鲁棒的推荐系统可以通过根据用户的独特偏好推荐项目来增加用户与网站的交互。在本实验中,我们提出了BERT和结合CNN的神经协同过滤(NCF)的推荐系统。该模型从用户和项目配置文件的输入,并找到用户的兴趣。该模型可以处理数字,分类和图像数据,以从输入中提取潜在特征。该模型在MovieLens数据集的小样本上进行了25个epoch的训练和验证。使用相同的数据集来训练和验证简单的NCF和基于BERT的NCF模型,并与所提出的模型进行比较。所提出的模型优于这两个基线模型。对于MovieLens数据集上的799个用户,所提出的模型获得的结果是0.72召回率和0.486命中率@10。本实验的结论是,同时考虑分类和图像数据可以提高推荐系统的性能。 摘要:Every day, a significant number of users visit the internet for different needs. The owners of a website generate profits from the user interaction with the contents or items of the website. A robust recommendation system can increase user interaction with a website by recommending items according to the user’s unique preferences. BERT and CNN-integrated neural collaborative filtering (NCF) have been proposed for the recommendation system in this experiment. The proposed model takes inputs from the user and item profile and finds the user’s interest. This model can handle numeric, categorical, and image data to extract the latent features from the inputs. The model is trained and validated on a small sample of the MovieLens dataset for 25 epochs. The same dataset has been used to train and validate a simple NCF and a BERT-based NCF model and compared with the proposed model. The proposed model outperformed those two baseline models. The obtained result for the proposed model is 0.72 recall and 0.486 Hit Ratio @ 10 for 799 users on the MovieLens dataset. This experiment concludes that considering both categorical and image data can improve the performance of a recommendation system.
【20】Attention in Motion: Secure Platooning via Transformer-based Misbehavior Detection 标题:运动中的注意力:通过基于变形器的不当行为检测来安全排队 链接:https://arxiv.org/abs/2512.15503
作者:Konstantinos Kalogiannis,Ahmed Mohamed Hussain,Hexu Li,Panos Papadimitratos 备注:17 pages, 10 figures 摘要:车辆编队有望通过车对车(V2X)通信实现的多车辆编队协调,实现运输效率和安全性的变革性改进。然而,车队协调的分布式性质产生了安全漏洞,允许经过身份验证的车辆注入伪造的运动学数据,损害操作稳定性,并对乘客安全构成威胁。传统的不当行为检测方法,依赖于可扩展性检查和统计方法,遭受高误报率(FP),不能捕捉复杂的时间依赖性固有的多车辆协调动态。我们提出了Attention In Motion(AIMformer),这是一个基于transformer的框架,专门针对具有边缘部署能力的车辆排中的实时不当行为检测而量身定制。AIMformer利用多头自注意机制,同时捕获车内时间动态和车间空间相关性。它采用了全球定位编码与车辆特定的时间偏移,以处理加入/退出演习。我们提出了一个精度为重点(BCE)的损失函数,惩罚FP,以满足安全关键的车辆系统的要求。在4排控制器,多个攻击向量,和不同的移动场景广泛的评估表明,卓越的性能($\geq $0.93)相比,国家的最先进的基线架构。利用TensorFlow Lite(TFLite)、开放神经网络交换(ONNX)和TensorRT的全面部署分析实现了亚毫秒级的推理延迟,使其适用于资源受限的边缘平台上的实时操作。因此,验证AIMformer对于车载和路边基础设施部署都是可行的。 摘要:Vehicular platooning promises transformative improvements in transportation efficiency and safety through the coordination of multi-vehicle formations enabled by Vehicle-to-Everything (V2X) communication. However, the distributed nature of platoon coordination creates security vulnerabilities, allowing authenticated vehicles to inject falsified kinematic data, compromise operational stability, and pose a threat to passenger safety. Traditional misbehaviour detection approaches, which rely on plausibility checks and statistical methods, suffer from high False Positive (FP) rates and cannot capture the complex temporal dependencies inherent in multi-vehicle coordination dynamics. We present Attention In Motion (AIMformer), a transformer-based framework specifically tailored for real-time misbehaviour detection in vehicular platoons with edge deployment capabilities. AIMformer leverages multi-head self-attention mechanisms to simultaneously capture intra-vehicle temporal dynamics and inter-vehicle spatial correlations. It incorporates global positional encoding with vehicle-specific temporal offsets to handle join/exit maneuvers. We propose a Precision-Focused (BCE) loss function that penalizes FPs to meet the requirements of safety-critical vehicular systems. Extensive evaluation across 4 platoon controllers, multiple attack vectors, and diverse mobility scenarios demonstrates superior performance ($\geq$ 0.93) compared to state-of-the-art baseline architectures. A comprehensive deployment analysis utilizing TensorFlow Lite (TFLite), Open Neural Network Exchange (ONNX), and TensorRT achieves sub-millisecond inference latency, making it suitable for real-time operation on resource-constrained edge platforms. Hence, validating AIMformer is viable for both in-vehicle and roadside infrastructure deployment.
【21】Soft Geometric Inductive Bias for Object Centric Dynamics 标题:物心动力学的软几何诱导偏差 链接:https://arxiv.org/abs/2512.15493
作者:Hampus Linander,Conor Heins,Alexander Tschantz,Marco Perin,Christopher Buckley 备注:8 pages, 11 figures; 6 pages supplementary material 摘要:等方差是学习物理动力学的一个强大的先验知识,但如果对称性被破坏,精确的组等方差会降低性能。我们提出了以对象为中心的世界模型与几何代数神经网络,提供了一个软几何归纳偏见。我们的模型进行评估,使用模拟环境的二维刚体动力学与静态障碍,在那里我们训练的下一步预测自回归。对于长期的推出,我们表明,我们的模型的软电感偏置的结果在物理保真度方面的更好的性能相比,非等变基线模型。该方法补充了最近的软等方差思想,并与简单,精心选择的先验可以产生强大的泛化的观点相一致。这些结果表明,几何代数在手工物理和非结构化深网之间提供了一个有效的中间立场,为多对象场景提供了样本高效的动力学模型。 摘要:Equivariance is a powerful prior for learning physical dynamics, yet exact group equivariance can degrade performance if the symmetries are broken. We propose object-centric world models built with geometric algebra neural networks, providing a soft geometric inductive bias. Our models are evaluated using simulated environments of 2d rigid body dynamics with static obstacles, where we train for next-step predictions autoregressively. For long-horizon rollouts we show that the soft inductive bias of our models results in better performance in terms of physical fidelity compared to non-equivariant baseline models. The approach complements recent soft-equivariance ideas and aligns with the view that simple, well-chosen priors can yield robust generalization. These results suggest that geometric algebra offers an effective middle ground between hand-crafted physics and unstructured deep nets, delivering sample-efficient dynamics models for multi-object scenes.
【22】Nemotron-Math: Efficient Long-Context Distillation of Mathematical Reasoning from Multi-Mode Supervision 标题:Nemotron-Math:从多模式监督中有效地长上下文提取数学推理 链接:https://arxiv.org/abs/2512.15489
作者:Wei Du,Shubham Toshniwal,Branislav Kisacanin,Sadegh Mahdavi,Ivan Moshkov,George Armstrong,Stephen Ge,Edgar Minasyan,Feng Chen,Igor Gitman 摘要:高质量的数学推理监督需要多样化的推理风格,长形式的跟踪和有效的工具集成,现有数据集只能以有限的形式提供这些功能。利用gpt-oss-120 b的多模式生成能力,我们引入了Nemotron-Math,这是一个大规模的数学推理数据集,包含750万个高、中、低推理模式的解决方案跟踪,每个模式都可以使用和不使用Python工具集成推理(TIR)。 该数据集集成了85 K策划的AoPS问题和262 K社区来源的StackExchange-Math问题,将结构化竞赛任务与各种现实世界的数学查询相结合。我们进行受控评估以评估数据集质量。 Nemotron-Math在匹配的AoPS问题上始终优于原始的OpenMathReasoning。对StackExchange-Math进行简化可以显著提高鲁棒性和泛化能力,尤其是在HLE-Math上,同时保持数学竞赛基准的准确性。 为了支持有效的长上下文训练,我们开发了一个顺序的分桶策略,该策略将128 K上下文长度的微调加速了2- 3倍,而没有显著的准确性损失。总的来说,Nemotron-Math实现了最先进的性能,包括使用Python TIR在AIME 2024和2025上的100% maj@16准确度。 摘要:High-quality mathematical reasoning supervision requires diverse reasoning styles, long-form traces, and effective tool integration, capabilities that existing datasets provide only in limited form. Leveraging the multi-mode generation ability of gpt-oss-120b, we introduce Nemotron-Math, a large-scale mathematical reasoning dataset containing 7.5M solution traces across high, medium, and low reasoning modes, each available both with and without Python tool-integrated reasoning (TIR). The dataset integrates 85K curated AoPS problems with 262K community-sourced StackExchange-Math problems, combining structured competition tasks with diverse real-world mathematical queries. We conduct controlled evaluations to assess the dataset quality. Nemotron-Math consistently outperforms the original OpenMathReasoning on matched AoPS problems. Incorporating StackExchange-Math substantially improves robustness and generalization, especially on HLE-Math, while preserving accuracy on math competition benchmarks. To support efficient long-context training, we develop a sequential bucketed strategy that accelerates 128K context-length fine-tuning by 2–3$\times$ without significant accuracy loss. Overall, Nemotron-Math enables state-of-the-art performance, including 100% maj@16 accuracy on AIME 2024 and 2025 with Python TIR.
【23】How Do Semantically Equivalent Code Transformations Impact Membership Inference on LLMs for Code? 标题:语义等效代码转换如何影响LLM代码的成员推断? 链接:https://arxiv.org/abs/2512.15468
作者:Hua Yang,Alejandro Velasco,Thanh Le-Cong,Md Nazmul Haque,Bowen Xu,Denys Poshyvanyk 备注:13 pages, 3 figures 摘要:大型代码语言模型的成功依赖于大量的代码数据,包括公共开源存储库(如GitHub)和公司的私有机密代码。这引起了人们对知识产权合规性和潜在的未经授权使用受许可限制的代码的担忧。虽然成员推理(MI)技术已被提出来检测这种未经授权的使用,它们的有效性可以被破坏的语义等效的代码转换技术,修改代码语法,同时保留语义。 在这项工作中,我们系统地研究是否语义等效的代码转换规则可能会被利用,以逃避MI检测。结果显示,在最坏的情况下,每个规则的模型准确性仅下降1.5%,这表明转换后的数据集可以有效地作为微调的替代品。此外,我们发现其中一条规则(RenameVariable)将MI成功率降低了10.19%,突出了其掩盖受限代码存在的潜力。为了验证这些发现,我们进行了因果分析,证实变量重命名在破坏MI检测方面具有最强的因果效应。值得注意的是,我们发现,组合多个转换不会进一步降低MI的有效性。我们的研究结果暴露了许可证合规性执行中的一个关键漏洞,用于训练代码的大型语言模型,这表明基于转换的混淆技术可以大大削弱MI检测。 摘要:The success of large language models for code relies on vast amounts of code data, including public open-source repositories, such as GitHub, and private, confidential code from companies. This raises concerns about intellectual property compliance and the potential unauthorized use of license-restricted code. While membership inference (MI) techniques have been proposed to detect such unauthorized usage, their effectiveness can be undermined by semantically equivalent code transformation techniques, which modify code syntax while preserving semantic. In this work, we systematically investigate whether semantically equivalent code transformation rules might be leveraged to evade MI detection. The results reveal that model accuracy drops by only 1.5% in the worst case for each rule, demonstrating that transformed datasets can effectively serve as substitutes for fine-tuning. Additionally, we find that one of the rules (RenameVariable) reduces MI success by 10.19%, highlighting its potential to obscure the presence of restricted code. To validate these findings, we conduct a causal analysis confirming that variable renaming has the strongest causal effect in disrupting MI detection. Notably, we find that combining multiple transformations does not further reduce MI effectiveness. Our results expose a critical loophole in license compliance enforcement for training large language models for code, showing that MI detection can be substantially weakened by transformation-based obfuscation techniques.
【24】On Assessing the Relevance of Code Reviews Authored by Generative Models 标题:关于评估生成模型撰写的代码审查的相关性 链接:https://arxiv.org/abs/2512.15466
作者:Robert Heumüller,Frank Ortmeier 备注:Replication Package: https://github.com/robert-heumueller-ovgu/repl-generative-review-relevance 摘要:在代码审查中使用像ChatGPT这样的大型语言模型提供了有希望的效率提升,但也引发了对正确性和安全性的担忧。现有的代码评审生成评估方法要么依赖于自动比较一个单一的地面真相,这无法捕捉人类的观点的变化,或对“有用性”,一个高度模糊的概念的主观评估。我们提出了一种新的评价方法,我们称之为多主观排名的基础上。使用来自CodeReview StackExchange的280个独立代码审查请求和相应评论的数据集,多位人工评委对ChatGPT生成的评论的质量以及来自平台的顶级人工响应进行了排名。结果显示,ChatGPT的评论排名明显优于人类,甚至超过了StackExchange的接受答案。更进一步,我们提出的方法激励并实现了对生成AI在代码审查中的表现进行更有意义的评估,同时也提高了对未经检查的集成到审查过程中的潜在风险的认识。 摘要:The use of large language models like ChatGPT in code review offers promising efficiency gains but also raises concerns about correctness and safety. Existing evaluation methods for code review generation either rely on automatic comparisons to a single ground truth, which fails to capture the variability of human perspectives, or on subjective assessments of “usefulness”, a highly ambiguous concept. We propose a novel evaluation approach based on what we call multi-subjective ranking. Using a dataset of 280 self-contained code review requests and corresponding comments from CodeReview StackExchange, multiple human judges ranked the quality of ChatGPT-generated comments alongside the top human responses from the platform. Results show that ChatGPT’s comments were ranked significantly better than human ones, even surpassing StackExchange’s accepted answers. Going further, our proposed method motivates and enables more meaningful assessments of generative AI’s performance in code review, while also raising awareness of potential risks of unchecked integration into review processes.
【25】Intent-Driven UAM Rescheduling 标题:意图驱动的UAM重新分配 链接:https://arxiv.org/abs/2512.15462
作者:Jeongseok Kim,Kangjin Kim 备注:18 pages, 2 figures, AAIML submitted 摘要:由于资源的限制,在城市空中机动性(UAM)领域,垂直起降场的有效调度受到了越来越多的关注。对于调度问题,我们利用混合线性规划(MILP),这往往是制定在一个资源受限的项目调度问题(RCPSP)。在本文中,我们展示了我们的方法来处理动态的操作要求和模糊的重新调度请求从人类。特别是,我们利用三值逻辑解释模糊的用户意图和决策树,提出了一个新的集成系统,结合回答集编程(ASP)和MILP。这一综合框架优化了时间表,并透明地支持人力投入。有了这个系统,我们提供了一个强大的结构,可解释的,自适应UAM调度。 摘要:Due to the restricted resources, efficient scheduling in vertiports has received much more attention in the field of Urban Air Mobility (UAM). For the scheduling problem, we utilize a Mixed Integer Linear Programming (MILP), which is often formulated in a resource-restricted project scheduling problem (RCPSP). In this paper, we show our approach to handle both dynamic operation requirements and vague rescheduling requests from humans. Particularly, we utilize a three-valued logic for interpreting ambiguous user intents and a decision tree, proposing a newly integrated system that combines Answer Set Programming (ASP) and MILP. This integrated framework optimizes schedules and supports human inputs transparently. With this system, we provide a robust structure for explainable, adaptive UAM scheduling.
【26】Double Horizon Model-Based Policy Optimization 标题:基于双地平线模型的政策优化 链接:https://arxiv.org/abs/2512.15439
作者:Akihiro Kubo,Paavo Parmas,Shin Ishii 备注:Accepted to Transactions on Machine Learning Research (TMLR) Code available at https://github.com/4kubo/erl_lib 摘要:基于模型的强化学习(MBRL)通过从学习的动力学模型生成合成轨迹(称为卷展)来降低真实环境采样的成本。然而,选择推出的长度带来了两个困境:(1)较长的推出更好地保留了政策培训,但放大了模型偏差,这表明需要一个中间视野来减轻分布变化(即,政策上和过去政策外样本之间的差距)。(2)此外,较长的模型推出可以减少值估计偏差,但由于通过多个步骤的反向传播而提高了策略梯度的方差,这意味着稳定梯度估计的另一个中间范围。然而,这两个最佳视野可能不同。为了解决这一冲突,我们提出了基于模型的双地平线策略优化(DHMBPO),它将推出过程分为长期“分发推出”(DR)和短期“培训推出”(TR)。DR生成策略上状态样本以减轻分布偏移。相比之下,短TR利用可微分转换来提供具有稳定梯度更新的准确值梯度估计,从而需要更少的更新并减少总体运行时间。我们证明了双视野方法有效地平衡了分布偏移,模型偏差和梯度不稳定性,并在样本效率和运行时间方面超过了现有的连续控制基准MBRL方法。 摘要:Model-based reinforcement learning (MBRL) reduces the cost of real-environment sampling by generating synthetic trajectories (called rollouts) from a learned dynamics model. However, choosing the length of the rollouts poses two dilemmas: (1) Longer rollouts better preserve on-policy training but amplify model bias, indicating the need for an intermediate horizon to mitigate distribution shift (i.e., the gap between on-policy and past off-policy samples). (2) Moreover, a longer model rollout may reduce value estimation bias but raise the variance of policy gradients due to backpropagation through multiple steps, implying another intermediate horizon for stable gradient estimates. However, these two optimal horizons may differ. To resolve this conflict, we propose Double Horizon Model-Based Policy Optimization (DHMBPO), which divides the rollout procedure into a long “distribution rollout” (DR) and a short “training rollout” (TR). The DR generates on-policy state samples for mitigating distribution shift. In contrast, the short TR leverages differentiable transitions to offer accurate value gradient estimation with stable gradient updates, thereby requiring fewer updates and reducing overall runtime. We demonstrate that the double-horizon approach effectively balances distribution shift, model bias, and gradient instability, and surpasses existing MBRL methods on continuous-control benchmarks in terms of both sample efficiency and runtime.
【27】Outer-Learning Framework for Playing Multi-Player Trick-Taking Card Games: A Case Study in Skat 标题:玩多人纸牌游戏的外部学习框架:Skat的案例研究 链接:https://arxiv.org/abs/2512.15435
作者:Stefan Edelkamp 摘要:在多人纸牌游戏(如Skat或Bridge)中,游戏的早期阶段,如出价,游戏选择和初始卡片选择,通常比精细的中期和末期游戏更重要。在目前的计算限制下,这种早期决策依赖于使用来自人类专家游戏的大型语料库的统计信息。在本文中,我们推导并评估了一个通用的自举外部学习框架,该框架通过扩展人类游戏数据库来生成和合并统计数据,从而提高预测准确性。我们实现了完美的特征散列函数来处理压缩表,从而产生了一个自我改进的纸牌游戏引擎,其中新推断的知识在自我学习过程中不断得到改进。Skat的案例研究表明,自动化方法可以用于支持游戏中的各种决策。 摘要:In multi-player card games such as Skat or Bridge, the early stages of the game, such as bidding, game selection, and initial card selection, are often more critical to the success of the play than refined middle- and end-game play. At the current limits of computation, such early decision-making resorts to using statistical information derived from a large corpus of human expert games. In this paper, we derive and evaluate a general bootstrapping outer-learning framework that improves prediction accuracy by expanding the database of human games with millions of self-playing AI games to generate and merge statistics. We implement perfect feature hash functions to address compacted tables, producing a self-improving card game engine, where newly inferred knowledge is continuously improved during self-learning. The case study in Skat shows that the automated approach can be used to support various decisions in the game.
【28】FM-EAC: Feature Model-based Enhanced Actor-Critic for Multi-Task Control in Dynamic Environments 标题:FM-AEC:基于特征模型的增强型Acor-Critic,用于动态环境中的多任务控制 链接:https://arxiv.org/abs/2512.15430
作者:Quanxi Zhou,Wencan Mao,Manabu Tsukada,John C. S. Lui,Yusheng Ji 摘要:基于模型的强化学习(MBRL)和无模型强化学习(MFRL)沿着不同的路径发展,但在Dyna-Q的设计中收敛[1]。然而,现代强化学习方法仍然难以在任务和场景之间实现有效的可转移性。出于这种限制,我们提出了一个通用的算法,基于特征模型的增强演员评论(FM-EAC),集成了规划,代理和学习的多任务控制在动态环境中。FM-EAC结合了MBRL和MFRL的优势,并通过使用新颖的基于特征的模型和增强的演员-评论家框架来提高概括性。在城市和农业应用中的模拟表明,FM-EAC始终优于许多最先进的MBRL和MFRL方法。更重要的是,不同的子网可以根据用户的具体要求在FM-EAC中定制。 摘要:Model-based reinforcement learning (MBRL) and model-free reinforcement learning (MFRL) evolve along distinct paths but converge in the design of Dyna-Q [1]. However, modern RL methods still struggle with effective transferability across tasks and scenarios. Motivated by this limitation, we propose a generalized algorithm, Feature Model-Based Enhanced Actor-Critic (FM-EAC), that integrates planning, acting, and learning for multi-task control in dynamic environments. FM-EAC combines the strengths of MBRL and MFRL and improves generalizability through the use of novel feature-based models and an enhanced actor-critic framework. Simulations in both urban and agricultural applications demonstrate that FM-EAC consistently outperforms many state-of-the-art MBRL and MFRL methods. More importantly, different sub-networks can be customized within FM-EAC according to user-specific requirements.
【29】SMART: Semantic Matching Contrastive Learning for Partially View-Aligned Clustering 标题:Smart:用于部分视图对齐集群的语义匹配对比学习 链接:https://arxiv.org/abs/2512.15396
作者:Liang Peng,Yixuan Ye,Cheng Liu,Hangjun Che,Fei Wang,Zhiwen Yu,Si Wu,Hau-San Wong 摘要:多视图聚类已被经验证明可以通过利用数据的多个视图之间的固有互补信息来提高学习性能。然而,在现实世界中,收集严格对齐的视图是具有挑战性的,从对齐和未对齐的数据中学习成为更实用的解决方案。部分视图对齐聚类旨在学习未对齐视图样本之间的对应关系,以更好地利用视图之间的潜在一致性和互补性,包括对齐和未对齐的数据。然而,大多数现有的PVC方法无法利用未对齐的数据来捕获来自同一集群的样本之间的共享语义。此外,多视图数据的固有异质性会导致表示的分布变化,导致在跨视图潜在特征之间建立有意义的对应关系时不准确,从而损害学习效率。为了解决这些挑战,我们提出了一个语义匹配contRasTive学习模型(SMART)的PVC。我们的方法的主要思想是减轻跨视图分布变化的影响,从而促进语义匹配对比学习,充分利用对齐和未对齐数据中的语义关系。在8个基准数据集上的大量实验表明,我们的方法在PVC问题上始终优于现有方法。 摘要:Multi-view clustering has been empirically shown to improve learning performance by leveraging the inherent complementary information across multiple views of data. However, in real-world scenarios, collecting strictly aligned views is challenging, and learning from both aligned and unaligned data becomes a more practical solution. Partially View-aligned Clustering aims to learn correspondences between misaligned view samples to better exploit the potential consistency and complementarity across views, including both aligned and unaligned data. However, most existing PVC methods fail to leverage unaligned data to capture the shared semantics among samples from the same cluster. Moreover, the inherent heterogeneity of multi-view data induces distributional shifts in representations, leading to inaccuracies in establishing meaningful correspondences between cross-view latent features and, consequently, impairing learning effectiveness. To address these challenges, we propose a Semantic MAtching contRasTive learning model (SMART) for PVC. The main idea of our approach is to alleviate the influence of cross-view distributional shifts, thereby facilitating semantic matching contrastive learning to fully exploit semantic relationships in both aligned and unaligned data. Extensive experiments on eight benchmark datasets demonstrate that our method consistently outperforms existing approaches on the PVC problem.
【30】Bilateral Spatial Reasoning about Street Networks: Graph-based RAG with Qualitative Spatial Representations 标题:关于街道网络的双边空间推理:具有定性空间表示的基于图的RAG 链接:https://arxiv.org/abs/2512.15388
作者:Reinhard Moratz,Niklas Daute,James Ondieki,Markus Kattenbeck,Mario Krajina,Ioannis Giannopoulos 摘要:本文论述了提高大语言模型(LLM)的能力,通过定性空间关系为行人指路提供路线指示。 摘要:This paper deals with improving the capabilities of Large Language Models (LLM) to provide route instructions for pedestrian wayfinders by means of qualitative spatial relations.
【31】Emotion Recognition in Signers 标题:签名者的情感识别 链接:https://arxiv.org/abs/2512.15376
作者:Kotaro Funakoshi,Yaoxiong Zhu 摘要:识别签名者的情绪受到一个理论挑战和一个实际挑战,即语法和情感面部表情之间的重叠和模型训练数据的稀缺。本文使用我们的eJSL数据集(一个新的日本手语签名者情感识别基准数据集)和BOBSL(一个带字幕的大型英国手语数据集)在跨语言环境中解决了这两个挑战。在eJSL中,两个签名者表达了78个不同的话语,每个话语有七种不同的情绪状态,产生了1,092个视频片段。我们的经验表明,1)口语文本情感识别缓解了手语中的数据稀缺性,2)时间段选择具有显着的影响,3)结合手部动作增强了签名者的情感识别。最后,我们建立了一个比口语LLM更强的基线。 摘要:Recognition of signers’ emotions suffers from one theoretical challenge and one practical challenge, namely, the overlap between grammatical and affective facial expressions and the scarcity of data for model training. This paper addresses these two challenges in a cross-lingual setting using our eJSL dataset, a new benchmark dataset for emotion recognition in Japanese Sign Language signers, and BOBSL, a large British Sign Language dataset with subtitles. In eJSL, two signers expressed 78 distinct utterances with each of seven different emotional states, resulting in 1,092 video clips. We empirically demonstrate that 1) textual emotion recognition in spoken language mitigates data scarcity in sign language, 2) temporal segment selection has a significant impact, and 3) incorporating hand motion enhances emotion recognition in signers. Finally we establish a stronger baseline than spoken language LLMs.
【32】SCOPE: Prompt Evolution for Enhancing Agent Effectiveness 标题:范围:快速发展以提高代理有效性 链接:https://arxiv.org/abs/2512.15374
作者:Zehua Pei,Hui-Ling Zhen,Shixiong Kai,Sinno Jialin Pan,Yunhe Wang,Mingxuan Yuan,Bei Yu 摘要:大型语言模型(LLM)代理越来越多地部署在生成大量动态上下文的环境中。然而,一个关键的瓶颈仍然存在:虽然代理可以访问此上下文,但它们的静态提示缺乏有效管理它的机制,从而导致经常性的纠正和增强失败。为了解决这个能力差距,我们引入了\textbf{SCOPE}(通过Prompt Evolution进行自演化上下文优化)。SCOPE框架上下文管理作为一个\textit{在线优化}问题,从执行跟踪合成的指导方针,自动发展代理的提示。我们提出了一个双流机制,平衡战术特异性(解决即时错误)与战略通用性(发展长期原则)。此外,我们引入了视角驱动的探索,以最大限度地提高策略覆盖率,增加代理对任何给定任务具有正确策略的可能性。在HLE基准测试上的实验表明,SCOPE在没有人为干预的情况下,任务成功率从14.23%提高到38.64%。我们在https://github.com/JarvisPei/SCOPE上公开发布我们的代码。 摘要:Large Language Model (LLM) agents are increasingly deployed in environments that generate massive, dynamic contexts. However, a critical bottleneck remains: while agents have access to this context, their static prompts lack the mechanisms to manage it effectively, leading to recurring Corrective and Enhancement failures. To address this capability gap, we introduce \textbf{SCOPE} (Self-evolving Context Optimization via Prompt Evolution). SCOPE frames context management as an \textit{online optimization} problem, synthesizing guidelines from execution traces to automatically evolve the agent’s prompt. We propose a Dual-Stream mechanism that balances tactical specificity (resolving immediate errors) with strategic generality (evolving long-term principles). Furthermore, we introduce Perspective-Driven Exploration to maximize strategy coverage, increasing the likelihood that the agent has the correct strategy for any given task. Experiments on the HLE benchmark show that SCOPE improves task success rates from 14.23% to 38.64% without human intervention. We make our code publicly available at https://github.com/JarvisPei/SCOPE.
【33】Image Complexity-Aware Adaptive Retrieval for Efficient Vision-Language Models 标题:有效视觉语言模型的图像复杂性感知自适应检索 链接:https://arxiv.org/abs/2512.15372
作者:Mikel Williams-Lekuona,Georgina Cosma 备注:Accepted paper for ECIR 2026 摘要:视觉语言模型中的Vision Transformers在所有图像上应用统一的计算工作量,无论是分析简单的产品照片还是复杂的街道场景,都需要花费175.33 GFLOP(ViT-L/14)。我们提出了ICAR(图像复杂性感知检索),它使Vision Transformers能够对简单图像使用更少的计算,同时通过其完整的网络深度处理复杂图像。关键的挑战是保持跨模式对齐:来自不同处理深度的嵌入必须保持兼容以进行文本匹配。ICAR通过双路径训练解决了这个问题,该训练从减少计算和完全计算处理中产生兼容的嵌入。这保持了同一语义空间中图像表示和文本嵌入之间的兼容性,无论图像是提前退出还是完全处理。与现有的两阶段方法,需要昂贵的重新排序,ICAR使直接的图像-文本匹配,而无需额外的开销。为了确定要使用多少计算,我们开发了ConvNeXt-IC,它将图像复杂度评估视为分类任务。通过应用现代分类器骨干而不是专门的架构,ConvNeXt-IC实现了最先进的性能,与人类判断的相关性为0.959(Pearson),加速比为4.4倍。在使用真实网络数据增强的标准基准测试中,ICAR实现了20%的实际加速,同时保持了类别级性能和95%的实例级性能,从而实现了视觉语言系统的可持续扩展。 摘要:Vision transformers in vision-language models apply uniform computational effort across all images, expending 175.33 GFLOPs (ViT-L/14) whether analysing a straightforward product photograph or a complex street scene. We propose ICAR (Image Complexity-Aware Retrieval), which enables vision transformers to use less compute for simple images whilst processing complex images through their full network depth. The key challenge is maintaining cross-modal alignment: embeddings from different processing depths must remain compatible for text matching. ICAR solves this through dual-path training that produces compatible embeddings from both reduced-compute and full-compute processing. This maintains compatibility between image representations and text embeddings in the same semantic space, whether an image exits early or processes fully. Unlike existing two-stage approaches that require expensive reranking, ICAR enables direct image-text matching without additional overhead. To determine how much compute to use, we develop ConvNeXt-IC, which treats image complexity assessment as a classification task. By applying modern classifier backbones rather than specialised architectures, ConvNeXt-IC achieves state-of-the-art performance with 0.959 correlation with human judgement (Pearson) and 4.4x speedup. Evaluated on standard benchmarks augmented with real-world web data, ICAR achieves 20% practical speedup while maintaining category-level performance and 95% of instance-level performance, enabling sustainable scaling of vision-language systems.
【34】Adversarial versification in portuguese as a jailbreak operator in LLMs 标题:葡萄牙作为LLC越狱运营商的对抗性版本 链接:https://arxiv.org/abs/2512.15353
作者:Joao Queiroz 备注:15 pages 摘要:最近的证据表明,提示的版本化构成了一个非常有效的对抗机制对齐LLM。“对抗诗歌作为大型语言模型中的通用单轮越狱机制”的研究表明,通常以散文形式拒绝的指令在重写为诗歌时变得可执行,在MLCommons AILuminate的基准测试中产生高达18倍的安全故障。人工写诗的ASR达到约62%,自动版本达到43%,有些模型在单轮互动中的成功率超过90%。影响是结构性的:使用RLHF、宪法人工智能和混合管道训练的系统在最小的符号形式变化下表现出一致的退化。Versification将提示转移到监督稀疏的潜在区域,揭示过度依赖于表面模式的护栏。这种表面上的稳健性和实际上的脆弱性之间的分离暴露了当前对齐机制的深刻局限性。葡萄牙语是一种具有高度形态句法复杂性、丰富的韵律传统以及超过2.5亿使用者的语言,但缺乏评估构成了一个关键差距。实验协议必须参数化扫描,米,韵律变化,以测试特定的葡语模式,这是目前被忽视的漏洞。 摘要:Recent evidence shows that the versification of prompts constitutes a highly effective adversarial mechanism against aligned LLMs. The study ‘Adversarial poetry as a universal single-turn jailbreak mechanism in large language models’ demonstrates that instructions routinely refused in prose become executable when rewritten as verse, producing up to 18 x more safety failures in benchmarks derived from MLCommons AILuminate. Manually written poems reach approximately 62% ASR, and automated versions 43%, with some models surpassing 90% success in single-turn interactions. The effect is structural: systems trained with RLHF, constitutional AI, and hybrid pipelines exhibit consistent degradation under minimal semiotic formal variation. Versification displaces the prompt into sparsely supervised latent regions, revealing guardrails that are excessively dependent on surface patterns. This dissociation between apparent robustness and real vulnerability exposes deep limitations in current alignment regimes. The absence of evaluations in Portuguese, a language with high morphosyntactic complexity, a rich metric-prosodic tradition, and over 250 million speakers, constitutes a critical gap. Experimental protocols must parameterise scansion, metre, and prosodic variation to test vulnerabilities specific to Lusophone patterns, which are currently ignored.
【35】Empirical Investigation of the Impact of Phase Information on Fault Diagnosis of Rotating Machinery 标题:相信息对旋转机械故障诊断影响的实证研究 链接:https://arxiv.org/abs/2512.15344
作者:Hiroyoshi Nagahama,Katsufumi Inoue,Masayoshi Todorokihara,Michifumi Yoshioka 备注:This work has been submitted to the IEEE for possible publication 摘要:旋转机械的预测性维护越来越依赖于振动信号,但大多数基于学习的方法要么在频谱特征提取期间丢弃相位,要么使用原始时间波形而不明确利用相位信息。本文介绍了两种相位感知预处理策略,以解决多轴振动数据中的随机相位变化:(1)三轴独立相位调整,将每个轴单独对齐到零相位(2)单轴参考相位调整,通过应用均匀的时间偏移来保持轴间的关系。使用同步三轴传感器获取的新构建的转子数据集,我们在两阶段学习框架下评估了六种深度学习架构。结果表明,架构无关的改进:三轴独立的方法实现了一致的增益(+2.7%的Transformer),而单轴参考方法提供了卓越的性能,高达96.2%的准确度(+5.4%),通过保留空间相位关系。这些研究结果建立了两个阶段的调整策略,作为预测性维护系统的实用和可扩展的增强功能。 摘要:Predictive maintenance of rotating machinery increasingly relies on vibration signals, yet most learning-based approaches either discard phase during spectral feature extraction or use raw time-waveforms without explicitly leveraging phase information. This paper introduces two phase-aware preprocessing strategies to address random phase variations in multi-axis vibration data: (1) three-axis independent phase adjustment that aligns each axis individually to zero phase (2) single-axis reference phase adjustment that preserves inter-axis relationships by applying uniform time shifts. Using a newly constructed rotor dataset acquired with a synchronized three-axis sensor, we evaluate six deep learning architectures under a two-stage learning framework. Results demonstrate architecture-independent improvements: the three-axis independent method achieves consistent gains (+2.7% for Transformer), while the single-axis reference approach delivers superior performance with up to 96.2% accuracy (+5.4%) by preserving spatial phase relationships. These findings establish both phase alignment strategies as practical and scalable enhancements for predictive maintenance systems.
【36】Exploring User Acceptance and Concerns toward LLM-powered Conversational Agents in Immersive Extended Reality 标题:探索沉浸式延展实境中用户对LLM支持的对话代理的接受度和担忧 链接:https://arxiv.org/abs/2512.15343
作者:Efe Bozkir,Enkelejda Kasneci 摘要:生成式人工智能(AI)和大型语言模型(LLM)的快速发展,以及使其可访问的服务的可用性,使公众开始将它们融入日常生活。延展实境(XR)社区也试图整合LLM,特别是以会话代理的形式,以增强用户体验和任务效率。当与这样的会话代理交互时,由于会话的自然流,用户可以容易地公开敏感信息,并且将这样的会话数据与细粒度的传感器数据相结合可能导致新的隐私问题。为了解决这些问题,以用户为中心的技术接受和关注的理解是必不可少的。因此,为此,我们进行了一项有1036名参与者的大规模众包研究,研究了XR中基于LLM的会话代理的用户决策过程,包括XR设置类型、语音交互类型和数据处理位置等因素。我们发现,虽然用户普遍接受这些技术,但他们对安全、隐私、社会影响和信任表示担忧。我们的研究结果表明,熟悉度起着至关重要的作用,因为日常生成AI的使用与更大的接受度有关。相比之下,以前拥有XR设备与接受度较低有关,可能是由于对设置的熟悉程度。我们还发现,男性报告的接受程度高于女性,关注程度低于女性。关于数据类型敏感性,位置数据引起了最重要的关注,而体温和虚拟对象状态被认为是最不敏感的。总的来说,我们的研究强调了从业人员有效地将他们的措施传达给用户的重要性,用户可能仍然不信任。最后,我们对LLM-Powered XR提出了建议。 摘要:The rapid development of generative artificial intelligence (AI) and large language models (LLMs), and the availability of services that make them accessible, have led the general public to begin incorporating them into everyday life. The extended reality (XR) community has also sought to integrate LLMs, particularly in the form of conversational agents, to enhance user experience and task efficiency. When interacting with such conversational agents, users may easily disclose sensitive information due to the naturalistic flow of the conversations, and combining such conversational data with fine-grained sensor data may lead to novel privacy issues. To address these issues, a user-centric understanding of technology acceptance and concerns is essential. Therefore, to this end, we conducted a large-scale crowdsourcing study with 1036 participants, examining user decision-making processes regarding LLM-powered conversational agents in XR, across factors of XR setting type, speech interaction type, and data processing location. We found that while users generally accept these technologies, they express concerns related to security, privacy, social implications, and trust. Our results suggest that familiarity plays a crucial role, as daily generative AI use is associated with greater acceptance. In contrast, previous ownership of XR devices is linked to less acceptance, possibly due to existing familiarity with the settings. We also found that men report higher acceptance with fewer concerns than women. Regarding data type sensitivity, location data elicited the most significant concern, while body temperature and virtual object states were considered least sensitive. Overall, our study highlights the importance of practitioners effectively communicating their measures to users, who may remain distrustful. We conclude with implications and recommendations for LLM-powered XR.
【37】Vision-based module for accurately reading linear scales in a laboratory 标题:基于视觉的模块,用于在实验室中准确读取线性刻度 链接:https://arxiv.org/abs/2512.15327
作者:Parvesh Saini,Soumyadipta Maiti,Beena Rai 备注:10 pages, 16 figures 摘要:基于视觉的模型的能力和数量正在迅速增加。这些视觉模型现在能够以很高的精度完成更多的任务,如目标检测、图像分类、实例分割等。但是,能够对图像进行精确的定量测量的模型,就像人类只看它一样,是罕见的。对于在实验室环境中完全自主工作的机器人来说,它需要具备一些基本技能,如导航、处理物体、准备样品等,以在非结构化环境中与人类相似的能力。另一个重要的能力是从仪器和设备中读取测量值。在这里,我们试图模仿人类启发的方法来读取线性标尺的测量值。作为测试用例,我们从注射器和量筒中选择读数水平。对于随机取向的注射器,我们进行变换以校正取向。为了使系统高效和鲁棒,感兴趣的区域被减小到仅包含图像的一部分的线性尺度。之后,提取一系列特征,如主要标记,对应的数字和水平指示器位置,从中计算最终读数。还将使用该系统获得的读数与相同情况下的人类读数值进行比较,并观察到准确的对应关系。 摘要:Capabilities and the number of vision-based models are increasing rapidly. And these vision models are now able to do more tasks like object detection, image classification, instance segmentation etc. with great accuracy. But models which can take accurate quantitative measurements form an image, as a human can do by just looking at it, are rare. For a robot to work with complete autonomy in a Laboratory environment, it needs to have some basic skills like navigation, handling objects, preparing samples etc. to match human-like capabilities in an unstructured environment. Another important capability is to read measurements from instruments and apparatus. Here, we tried to mimic a human inspired approach to read measurements from a linear scale. As a test case we have picked reading level from a syringe and a measuring cylinder. For a randomly oriented syringe we carry out transformations to correct the orientation. To make the system efficient and robust, the area of interest is reduced to just the linear scale containing part of the image. After that, a series of features were extracted like the major makers, the corresponding digits, and the level indicator location, from which the final reading was calculated. Readings obtained using this system were also compared against human read values of the same instances and an accurate correspondence was observed.
【38】Managing Ambiguity: A Proof of Concept of Human-AI Symbiotic Sense-making based on Quantum-Inspired Cognitive Mechanism of Rogue Variable Detection 标题:管理模糊性:基于流氓变量检测的量子启发认知机制的人与人工智能共生意义制造的概念证明 链接:https://arxiv.org/abs/2512.15325
作者:Agnieszka Bienkowska,Jacek Malecki,Alexander Mathiesen-Ohman,Katarzyna Tworek 备注:19 pages, 6 figures 摘要:组织越来越多地在以波动性、不确定性、复杂性和模糊性(VUCA)为特征的环境中运作,在这种环境中,早期的变化指标往往是微弱的、分散的信号。虽然人工智能(AI)被广泛用于支持管理决策,但大多数基于AI的系统仍然针对预测和解决方案进行了优化,导致在高度模糊的条件下过早地解释关闭。这在管理科学中造成了一个空白,即人类人工智能系统如何在模糊性变成错误或危机之前负责任地管理它。这项研究通过提出LAIZA人类-AI增强共生智能系统及其专利过程的概念证明(ESTA)来解决这一差距:量子启发的流氓变量建模(QRVM),人在环退相干和集体认知推理的系统和方法。该机制将歧义操作化为非崩溃的认知状态,检测持续的解释故障(流氓变量),并在自主推理变得不可靠时激活结构化的人在回路澄清。从经验上讲,本文借鉴了2025年人工智能发展中进行的为期三个月的案例研究,涉及员工意图和知识产权边界的长期模糊性。研究结果表明,保持解释的多元性使早期的基于网络的准备工作成为可能,包括积极的专利保护,一旦模糊性消失,就可以采取果断和无干扰的行动。该研究通过将模糊性重新定义为一流的结构,为管理理论做出了贡献,并展示了VUCA环境中人类-AI共生对组织弹性的实用价值。 摘要:Organizations increasingly operate in environments characterized by volatility, uncertainty, complexity, and ambiguity (VUCA), where early indicators of change often emerge as weak, fragmented signals. Although artificial intelligence (AI) is widely used to support managerial decision-making, most AI-based systems remain optimized for prediction and resolution, leading to premature interpretive closure under conditions of high ambiguity. This creates a gap in management science regarding how human-AI systems can responsibly manage ambiguity before it crystallizes into error or crisis. This study addresses this gap by presenting a proof of concept (PoC) of the LAIZA human-AI augmented symbiotic intelligence system and its patented process: Systems and Methods for Quantum-Inspired Rogue Variable Modeling (QRVM), Human-in-the-Loop Decoherence, and Collective Cognitive Inference. The mechanism operationalizes ambiguity as a non-collapsed cognitive state, detects persistent interpretive breakdowns (rogue variables), and activates structured human-in-the-loop clarification when autonomous inference becomes unreliable. Empirically, the article draws on a three-month case study conducted in 2025 within the AI development, involving prolonged ambiguity surrounding employee intentions and intellectual property boundaries. The findings show that preserving interpretive plurality enabled early scenario-based preparation, including proactive patent protection, allowing decisive and disruption-free action once ambiguity collapsed. The study contributes to management theory by reframing ambiguity as a first-class construct and demonstrates the practical value of human-AI symbiosis for organizational resilience in VUCA environments.
【39】Automated Motion Artifact Check for MRI (AutoMAC-MRI): An Interpretable Framework for Motion Artifact Detection and Severity Assessment 标题:自动化MRI运动异常检查(AutoMAC-MRI):一个可解释的运动异常检测和严重性评估框架 链接:https://arxiv.org/abs/2512.15315
作者:Antony Jerald,Dattesh Shanbhag,Sudhanya Chatterjee 摘要:运动伪影降低MRI图像质量并增加患者召回。现有的自动化质量评估方法在很大程度上限于二元决策,并提供很少的可解释性。我们介绍AutoMAC-MRI,这是一个可解释的框架,用于对异构MR对比度和方向的运动伪影进行分级。该方法使用监督对比学习来学习运动严重性的判别表示。在这个特征空间内,我们计算特定于等级的亲和力分数,量化图像与每个运动等级的接近度,从而使等级分配透明且可解释。我们在超过5000个专家注释的脑MRI切片上评估AutoMAC-MRI,这些切片跨越多个对比和视图。评估亲和力分数对专家标签的实验表明,分数与专家判断一致,支持其作为运动严重性的可解释的措施。通过将精确的等级检测与每个等级的亲和力评分相结合,AutoMAC-MRI实现了在线MRI质量控制,有可能减少不必要的重新扫描并提高工作流程效率。 摘要:Motion artifacts degrade MRI image quality and increase patient recalls. Existing automated quality assessment methods are largely limited to binary decisions and provide little interpretability. We introduce AutoMAC-MRI, an explainable framework for grading motion artifacts across heterogeneous MR contrasts and orientations. The approach uses supervised contrastive learning to learn a discriminative representation of motion severity. Within this feature space, we compute grade-specific affinity scores that quantify an image’s proximity to each motion grade, thereby making grade assignments transparent and interpretable. We evaluate AutoMAC-MRI on more than 5000 expert-annotated brain MRI slices spanning multiple contrasts and views. Experiments assessing affinity scores against expert labels show that the scores align well with expert judgment, supporting their use as an interpretable measure of motion severity. By coupling accurate grade detection with per-grade affinity scoring, AutoMAC-MRI enables inline MRI quality control, with the potential to reduce unnecessary rescans and improve workflow efficiency.
【40】Evaluating LLMs for Zeolite Synthesis Event Extraction (ZSEE): A Systematic Analysis of Prompting Strategies 标题:评估用于分子筛合成事件提取(ZSEE)的LLM:提取策略的系统分析 链接:https://arxiv.org/abs/2512.15312
作者:Charan Prakash Rathore,Saumi Ray,Dhruv Kumar 备注:Under Review 摘要:从沸石合成实验过程中提取结构化信息对于材料发现至关重要,但现有方法尚未系统地评估用于该领域特定任务的大语言模型(LLM)。这项工作解决了一个基本问题:什么是不同的提示策略的有效性时,应用LLM科学信息提取?我们专注于四个关键子任务:事件类型分类(识别合成步骤),触发文本识别(定位事件提及),参数角色提取(识别参数类型),参数文本提取(提取参数值)。我们评估了四种提示策略- zero-shot,Few-Shot,特定事件和基于反射-在六个最先进的LLM(Gemma-3- 12 b-it,GPT-5-mini,O 4-mini,Claude-Haiku-3.5,DeepSeek推理和非推理)使用1,530个注释句子的ZSEE数据集。实验结果表明,该算法在事件类型分类上表现出很强的性能(80- 90% F1),但在细粒度提取任务上表现一般,尤其是在参数角色和参数文本提取上(50- 65% F1)。GPT-5-mini表现出极高的快速灵敏度,F1变异为11- 79%。值得注意的是,先进的提示策略提供了最小的改进超过zero-shot的方法,揭示了基本的架构限制。错误分析识别系统幻觉,过度概括,无法捕捉合成特定的细微差别。我们的研究结果表明,虽然LLM实现了高层次的理解,但精确提取实验参数需要领域适应模型,为科学信息提取提供定量基准。 摘要:Extracting structured information from zeolite synthesis experimental procedures is critical for materials discovery, yet existing methods have not systematically evaluated Large Language Models (LLMs) for this domain-specific task. This work addresses a fundamental question: what is the efficacy of different prompting strategies when applying LLMs to scientific information extraction? We focus on four key subtasks: event type classification (identifying synthesis steps), trigger text identification (locating event mentions), argument role extraction (recognizing parameter types), and argument text extraction (extracting parameter values). We evaluate four prompting strategies - zero-shot, few-shot, event-specific, and reflection-based - across six state-of-the-art LLMs (Gemma-3-12b-it, GPT-5-mini, O4-mini, Claude-Haiku-3.5, DeepSeek reasoning and non-reasoning) using the ZSEE dataset of 1,530 annotated sentences. Results demonstrate strong performance on event type classification (80-90% F1) but modest performance on fine-grained extraction tasks, particularly argument role and argument text extraction (50-65% F1). GPT-5-mini exhibits extreme prompt sensitivity with 11-79% F1 variation. Notably, advanced prompting strategies provide minimal improvements over zero-shot approaches, revealing fundamental architectural limitations. Error analysis identifies systematic hallucination, over-generalization, and inability to capture synthesis-specific nuances. Our findings demonstrate that while LLMs achieve high-level understanding, precise extraction of experimental parameters requires domain-adapted models, providing quantitative benchmarks for scientific information extraction.
【41】Graph Pattern-based Association Rules Evaluated Under No-repeated-anything Semantics in the Graph Transactional Setting 标题:图传递环境中无重复语义下的基于图模式的关联规则评估 链接:https://arxiv.org/abs/2512.15308
作者:Basil Ell 摘要:我们介绍了基于图形模式的关联规则(GPAR)的有向标记的多重图,如RDF图。GPAR支持生成任务(其中图被扩展)和评估任务(其中图的可扩展性被评估)。该框架超越了相关的形式主义,如图功能依赖,图实体依赖,关系关联规则,图关联规则,多关系和路径关联规则,和霍恩规则。给定一个图的集合,我们在无重复语义下评估图模式,这允许更有效地考虑图的拓扑结构。我们定义了一个概率空间,并在概率环境中获得信心,提升,杠杆和信念。我们进一步分析了这些指标如何与经典的基于项集的同行,并确定其特征属性被保留的条件。 摘要:We introduce graph pattern-based association rules (GPARs) for directed labeled multigraphs such as RDF graphs. GPARs support both generative tasks, where a graph is extended, and evaluative tasks, where the plausibility of a graph is assessed. The framework goes beyond related formalisms such as graph functional dependencies, graph entity dependencies, relational association rules, graph association rules, multi-relation and path association rules, and Horn rules. Given a collection of graphs, we evaluate graph patterns under no-repeated-anything semantics, which allows the topology of a graph to be taken into account more effectively. We define a probability space and derive confidence, lift, leverage, and conviction in a probabilistic setting. We further analyze how these metrics relate to their classical itemset-based counterparts and identify conditions under which their characteristic properties are preserved.
【42】ChatGPT and Gemini participated in the Korean College Scholastic Ability Test – Earth Science I 标题:ChatGPT和Gemini参加韩国大学学业能力测试–地球科学I 链接:https://arxiv.org/abs/2512.15298
作者:Seok-Hyun Ga,Chun-Yen Chang 备注:23 pages, 9 tables, 1 figure 摘要:生成式人工智能的快速发展正在为教育和评估带来创新性的变化。随着学生利用人工智能进行作业的普及率增加,人们对学术诚信和评估有效性的担忧也在增加。本研究利用2025年韩国大学学术能力测试(CSAT)的地球科学I部分,深入分析了最先进的大型语言模型(LLM)的多模态科学推理能力和认知局限性,包括GPT-4 o,Gemini 2.5 Flash和Gemini 2.5 Pro。三个实验条件(整页输入,单项输入,优化的多模态输入)被设计来评估模型在不同数据结构的性能。定量结果表明,非结构化的输入导致显着的性能下降,由于分割和光学字符识别(OCR)的失败。即使在优化的条件下,模型也表现出基本的推理缺陷。定性分析表明,“感知错误”占主导地位,突出了“感知认知差距”,模型未能解释示意图中的符号意义,尽管认识到视觉数据。此外,模型表现出“计算-概念化离散性”,成功地执行计算,但未能应用潜在的科学概念,以及“过程幻觉”,模型跳过视觉验证,支持看似合理但毫无根据的背景知识。为了应对课程中未经授权使用人工智能的挑战,这项研究为设计针对这些特定认知漏洞的“抗人工智能问题”提供了可操作的线索。通过利用人工智能的弱点,例如感知和认知之间的差距,教育工作者可以将真正的学生能力与人工智能生成的反应区分开来,从而确保评估的公平性。 摘要:The rapid development of Generative AI is bringing innovative changes to education and assessment. As the prevalence of students utilizing AI for assignments increases, concerns regarding academic integrity and the validity of assessments are growing. This study utilizes the Earth Science I section of the 2025 Korean College Scholastic Ability Test (CSAT) to deeply analyze the multimodal scientific reasoning capabilities and cognitive limitations of state-of-the-art Large Language Models (LLMs), including GPT-4o, Gemini 2.5 Flash, and Gemini 2.5 Pro. Three experimental conditions (full-page input, individual item input, and optimized multimodal input) were designed to evaluate model performance across different data structures. Quantitative results indicated that unstructured inputs led to significant performance degradation due to segmentation and Optical Character Recognition (OCR) failures. Even under optimized conditions, models exhibited fundamental reasoning flaws. Qualitative analysis revealed that “Perception Errors” were dominant, highlighting a “Perception-Cognition Gap” where models failed to interpret symbolic meanings in schematic diagrams despite recognizing visual data. Furthermore, models demonstrated a “Calculation-Conceptualization Discrepancy,” successfully performing calculations while failing to apply the underlying scientific concepts, and “Process Hallucination,” where models skipped visual verification in favor of plausible but unfounded background knowledge. Addressing the challenge of unauthorized AI use in coursework, this study provides actionable cues for designing “AI-resistant questions” that target these specific cognitive vulnerabilities. By exploiting AI’s weaknesses, such as the gap between perception and cognition, educators can distinguish genuine student competency from AI-generated responses, thereby ensuring assessment fairness.
【43】Graph Contextual Reinforcement Learning for Efficient Directed Controller Synthesis 标题:用于高效有向控制器综合的图上下文强化学习 链接:https://arxiv.org/abs/2512.15295
作者:Toshihide Ubukata,Enhong Mu,Takuto Yamauchi,Mingyue Zhang,Jialong Li,Kenji Tei 摘要:控制器综合是一种自动生成满足特定性质的标号转换系统控制器的形式化方法。然而,综合过程的效率在很大程度上取决于勘探政策。这些策略通常依赖于通过强化学习(RL)学习的固定规则或策略,这些规则或策略只考虑有限的当前特征集。为了解决这一限制,本文引入了GCRL,这是一种通过集成图神经网络(GNNs)来增强基于RL的方法的方法。GCRL将LTS探索的历史编码到一个图结构中,允许它捕获更广泛的、不基于当前的上下文。在与最先进方法的比较实验中,GCRL在五个基准领域中的四个领域中表现出了卓越的学习效率和泛化能力,除了一个具有高度对称性和严格局部交互的特定领域。 摘要:Controller synthesis is a formal method approach for automatically generating Labeled Transition System (LTS) controllers that satisfy specified properties. The efficiency of the synthesis process, however, is critically dependent on exploration policies. These policies often rely on fixed rules or strategies learned through reinforcement learning (RL) that consider only a limited set of current features. To address this limitation, this paper introduces GCRL, an approach that enhances RL-based methods by integrating Graph Neural Networks (GNNs). GCRL encodes the history of LTS exploration into a graph structure, allowing it to capture a broader, non-current-based context. In a comparative experiment against state-of-the-art methods, GCRL exhibited superior learning efficiency and generalization across four out of five benchmark domains, except one particular domain characterized by high symmetry and strictly local interactions.
【44】Quantum Machine Learning for Cybersecurity: A Taxonomy and Future Directions 标题:用于网络安全的量子机器学习:分类学和未来方向 链接:https://arxiv.org/abs/2512.15286
作者:Siva Sai,Ishika Goyal,Shubham Sharma,Sri Harshita Manuri,Vinay Chamola,Rajkumar Buyya 备注:15 pages, 5 figures, Submitted to a journal 摘要:近年来,越来越多的网络威胁和快速发展的策略,以及大量的数据,导致经典的机器学习,规则和基于签名的防御策略失败,使它们无法跟上。另一种选择,量子机器学习(QML),最近出现了,利用基于量子力学的计算。它为某些问题提供了更好的高维结构编码和处理。本综述全面概述了与安全领域相关的QML技术,如量子神经网络(QNN),量子支持向量机(QSVM),变分量子电路(VQC)和量子生成对抗网络(QGAN),并讨论了本文对该领域现有研究的贡献以及如何改进它们。它还将这些方法映射到有监督、无监督和生成式学习范式中,并映射到核心网络安全任务中,包括入侵和异常检测、恶意软件和僵尸网络分类以及流量分析。它还讨论了它们在云计算安全领域的应用,其中QML可以增强安全和可扩展的操作。还讨论了QML在网络安全领域的许多局限性,以及解决这些局限性的方向。 摘要:The increasing number of cyber threats and rapidly evolving tactics, as well as the high volume of data in recent years, have caused classical machine learning, rules, and signature-based defence strategies to fail, rendering them unable to keep up. An alternative, Quantum Machine Learning (QML), has recently emerged, making use of computations based on quantum mechanics. It offers better encoding and processing of high-dimensional structures for certain problems. This survey provides a comprehensive overview of QML techniques relevant to the domain of security, such as Quantum Neural Networks (QNNs), Quantum Support Vector Machines (QSVMs), Variational Quantum Circuits (VQCs), and Quantum Generative Adversarial Networks (QGANs), and discusses the contributions of this paper in relation to existing research in the field and how it improves over them. It also maps these methods across supervised, unsupervised, and generative learning paradigms, and to core cybersecurity tasks, including intrusion and anomaly detection, malware and botnet classification, and encrypted-traffic analytics. It also discusses their application in the domain of cloud computing security, where QML can enhance secure and scalable operations. Many limitations of QML in the domain of cybersecurity have also been discussed, along with the directions for addressing them.
【45】Well Begun, Half Done: Reinforcement Learning with Prefix Optimization for LLM Reasoning 标题:良好的开端,完成了一半:LLM推理的前缀优化强化学习 链接:https://arxiv.org/abs/2512.15274
作者:Yiliu Sun,Zicheng Zhao,Yang Wei,Yanfang Zhang,Chen Gong 备注:Accepted by AAAI 2026 摘要:带有可验证奖励的强化学习(RLVR)显著增强了大型语言模型(LLM)的推理能力。当前的RLVR方法通常在所有生成的令牌上进行训练,但忽略了探索哪些令牌(例如,前缀标记)实际上有助于推理。这种统一的训练策略在优化低回报代币上花费了大量的精力,这反过来又阻碍了高回报代币的潜在改进,并降低了整体训练效率。为了解决这个问题,我们提出了一种新的RLVR方法,称为渐进前缀令牌策略优化(PPPO),它突出了生成的输出的前缀段的重要性。具体来说,灵感来自于成熟的人类思维理论的路径依赖,其中早期阶段的思想大大限制了随后的思维轨迹,我们确定了类似的现象,在LLM推理称为开始锁定效应(BLE)。PPPO通过将其优化目标集中在LLM的前缀推理过程中来利用这一发现。这种有针对性的优化策略可以积极影响后续的推理过程,并最终改善最终结果。为了提高LLM在如何以高质量开始推理方面的学习效率,PPPO引入了两种训练策略:(a)渐进前缀保留,通过在训练期间增加保留的前缀标记的比例来形成渐进的学习过程;(b)连续累积奖励,其通过对一个前缀令牌序列的多个连续进行采样来减轻奖励偏差,并累积他们的分数作为奖励信号。在各种推理任务上的大量实验结果表明,我们提出的PPPO优于代表性的RLVR方法,只有26.17%的训练令牌的准确率提高了18.02%。 摘要:Reinforcement Learning with Verifiable Rewards (RLVR) significantly enhances the reasoning capability of Large Language Models (LLMs). Current RLVR approaches typically conduct training across all generated tokens, but neglect to explore which tokens (e.g., prefix tokens) actually contribute to reasoning. This uniform training strategy spends substantial effort on optimizing low-return tokens, which in turn impedes the potential improvement from high-return tokens and reduces overall training effectiveness. To address this issue, we propose a novel RLVR approach called Progressive Prefix-token Policy Optimization (PPPO), which highlights the significance of the prefix segment of generated outputs. Specifically, inspired by the well-established human thinking theory of Path Dependence, where early-stage thoughts substantially constrain subsequent thinking trajectory, we identify an analogous phenomenon in LLM reasoning termed Beginning Lock-in Effect (BLE). PPPO leverages this finding by focusing its optimization objective on the prefix reasoning process of LLMs. This targeted optimization strategy can positively influence subsequent reasoning processes, and ultimately improve final results. To improve the learning effectiveness of LLMs on how to start reasoning with high quality, PPPO introduces two training strategies: (a) Progressive Prefix Retention, which shapes a progressive learning process by increasing the proportion of retained prefix tokens during training; (b) Continuation Accumulated Reward, which mitigates reward bias by sampling multiple continuations for one prefix token sequence, and accumulating their scores as the reward signal. Extensive experimental results on various reasoning tasks demonstrate that our proposed PPPO outperforms representative RLVR methods, with the accuracy improvements of 18.02% on only 26.17% training tokens.
【46】VLA-AN: An Efficient and Onboard Vision-Language-Action Framework for Aerial Navigation in Complex Environments 标题:VLA-AN:复杂环境中空中导航的高效机载视觉-语言-动作框架 链接:https://arxiv.org/abs/2512.15258
作者:Yuze Wu,Mo Zhu,Xingxing Li,Yuheng Du,Yuxin Fan,Wenjun Li,Xin Zhou,Fei Gao 摘要:本文提出了VLA-AN,一个高效的机载视觉语言行动(VLA)框架,致力于在复杂环境中的自主无人机导航。VLA-AN解决了现有大型航空导航模型的四个主要局限性:数据域差距,推理时间导航不足,生成行动策略的安全问题,以及机载部署限制。首先,我们利用3D高斯溅射(3D-GS)构建一个高保真数据集,以有效地弥合域差距。其次,我们引入了一个渐进的三阶段训练框架,依次加强场景理解,核心飞行技能和复杂的导航能力。第三,我们设计了一个轻量级的,实时的动作模块加上几何安全校正。该模块确保快速、无冲突和稳定的命令生成,减轻随机生成策略中固有的安全风险。最后,通过对机载部署管道的深度优化,VLA-AN在资源受限的无人机上实现了强大的实时推理吞吐量的8.3倍改进。大量的实验表明,VLA-AN显著改善了空间接地,场景推理和长视野导航,实现了98.1%的最高单任务成功率,并为实现轻量级空中机器人的全链闭环自主提供了高效,实用的解决方案。 摘要:This paper proposes VLA-AN, an efficient and onboard Vision-Language-Action (VLA) framework dedicated to autonomous drone navigation in complex environments. VLA-AN addresses four major limitations of existing large aerial navigation models: the data domain gap, insufficient temporal navigation with reasoning, safety issues with generative action policies, and onboard deployment constraints. First, we construct a high-fidelity dataset utilizing 3D Gaussian Splatting (3D-GS) to effectively bridge the domain gap. Second, we introduce a progressive three-stage training framework that sequentially reinforces scene comprehension, core flight skills, and complex navigation capabilities. Third, we design a lightweight, real-time action module coupled with geometric safety correction. This module ensures fast, collision-free, and stable command generation, mitigating the safety risks inherent in stochastic generative policies. Finally, through deep optimization of the onboard deployment pipeline, VLA-AN achieves a robust real-time 8.3x improvement in inference throughput on resource-constrained UAVs. Extensive experiments demonstrate that VLA-AN significantly improves spatial grounding, scene reasoning, and long-horizon navigation, achieving a maximum single-task success rate of 98.1%, and providing an efficient, practical solution for realizing full-chain closed-loop autonomy in lightweight aerial robots.
【47】Leveraging Foundational Models and Simple Fusion for Multi-modal Physiological Signal Analysis 标题:利用基础模型和简单融合进行多模式生理信号分析 链接:https://arxiv.org/abs/2512.15250
作者:Youssef Ghallab,Omar Iraqy,Mohamed Kandil,Mohamed Ashraf,Saadeldine Eletter,Morougue Ghazal,Ayman Khalafallah,Nagwa El-Makky 备注:Published at NeurIPS 2025 Workshop on Foundation Models for the Brain and Body 摘要:诸如心电图(ECG)和脑电图(EEG)之类的生理信号提供了对人类健康和认知的补充见解,但是由于有限的多模态标记数据和模态特定差异,多模态集成是具有挑战性的。在这项工作中,我们调整了CBraMod编码器以进行大规模自监督心电图预训练,引入了双掩蔽策略来捕获导联内和导联间的依赖性。为了克服上述挑战,我们使用了一个用于EEG的预训练CBraMod编码器,并预训练了一个对称ECG编码器,为每个模态配备了丰富的基础表示。然后,这些表示通过简单的嵌入级联进行融合,允许分类头学习跨模态交互,尽管多模态监督有限,但仍然能够进行有效的下游学习。在情感识别上进行评估,我们的方法实现了接近最先进的性能,表明精心设计的生理编码器,即使是简单的融合,也大大提高了下游性能。这些结果突出了基础模型方法利用生理信号的整体性质的潜力,为医疗保健和情感计算提供可扩展,标签高效和可推广的解决方案。 摘要:Physiological signals such as electrocardiograms (ECG) and electroencephalograms (EEG) provide complementary insights into human health and cognition, yet multi-modal integration is challenging due to limited multi-modal labeled data, and modality-specific differences . In this work, we adapt the CBraMod encoder for large-scale self-supervised ECG pretraining, introducing a dual-masking strategy to capture intra- and inter-lead dependencies. To overcome the above challenges, we utilize a pre-trained CBraMod encoder for EEG and pre-train a symmetric ECG encoder, equipping each modality with a rich foundational representation. These representations are then fused via simple embedding concatenation, allowing the classification head to learn cross-modal interactions, together enabling effective downstream learning despite limited multi-modal supervision. Evaluated on emotion recognition, our approach achieves near state-of-the-art performance, demonstrating that carefully designed physiological encoders, even with straightforward fusion, substantially improve downstream performance. These results highlight the potential of foundation-model approaches to harness the holistic nature of physiological signals, enabling scalable, label-efficient, and generalizable solutions for healthcare and affective computing.
【48】Intersectional Fairness in Vision-Language Models for Medical Image Disease Classification 标题:医学图像疾病分类视觉语言模型中的交叉公平性 链接:https://arxiv.org/abs/2512.15249
作者:Yupeng Zhang,Adam G. Dunn,Usman Naseem,Jinman Kim 摘要:医学人工智能(AI)系统,特别是多模态视觉语言模型(VLM),往往表现出交叉偏见,其中模型在诊断边缘化患者亚组方面系统性地不太自信。由于人口统计学上的偏斜数据和诊断确定性的不同分布,这种偏倚可能导致更高的不准确和漏诊率。目前的公平性干预措施往往无法解决这些差距或损害整体诊断性能,以实现统计平等的亚组。在这项研究中,我们开发了跨模态对齐一致性(CMAC-MMD),这是一个标准化跨交叉患者亚组诊断确定性的训练框架。与传统的去偏方法不同,这种方法在临床推断过程中不需要敏感的人口统计数据就可以平衡模型的决策置信度。我们使用10,015张皮肤病变图像(HAM 10, 000)和12,000张图像(BCN 20000)的外部验证以及10,000张用于青光眼检测的眼底图像(Harvard-FairVLMed)评估了这种方法,并通过交叉年龄,性别和种族属性对性能进行了分层。在皮肤病学队列中,与标准训练相比,所提出的方法将总体交叉漏诊差距(真阳性率差异,$Δ$TPR)从0.50降低到0.26,同时将总体曲线下面积(AUC)从0.94提高到0.97。类似地,对于青光眼筛查,该方法将$Δ$TPR从0.41降低至0.31,实现了更好的0.72 AUC(与0.71基线相比)。这建立了一个可扩展的框架,用于开发高风险的临床决策支持系统,这些系统既准确,又可以在不同的患者亚组中公平地执行,确保可靠的性能,而不会增加隐私风险。 摘要:Medical artificial intelligence (AI) systems, particularly multimodal vision-language models (VLM), often exhibit intersectional biases where models are systematically less confident in diagnosing marginalised patient subgroups. Such bias can lead to higher rates of inaccurate and missed diagnoses due to demographically skewed data and divergent distributions of diagnostic certainty. Current fairness interventions frequently fail to address these gaps or compromise overall diagnostic performance to achieve statistical parity among the subgroups. In this study, we developed Cross-Modal Alignment Consistency (CMAC-MMD), a training framework that standardises diagnostic certainty across intersectional patient subgroups. Unlike traditional debiasing methods, this approach equalises the model’s decision confidence without requiring sensitive demographic data during clinical inference. We evaluated this approach using 10,015 skin lesion images (HAM10000) with external validation on 12,000 images (BCN20000), and 10,000 fundus images for glaucoma detection (Harvard-FairVLMed), stratifying performance by intersectional age, gender, and race attributes. In the dermatology cohort, the proposed method reduced the overall intersectional missed diagnosis gap (difference in True Positive Rate, $Δ$TPR) from 0.50 to 0.26 while improving the overall Area Under the Curve (AUC) from 0.94 to 0.97 compared to standard training. Similarly, for glaucoma screening, the method reduced $Δ$TPR from 0.41 to 0.31, achieving a better AUC of 0.72 (vs. 0.71 baseline). This establishes a scalable framework for developing high-stakes clinical decision support systems that are both accurate and can perform equitably across diverse patient subgroups, ensuring reliable performance without increasing privacy risks.
【49】CangLing-KnowFlow: A Unified Knowledge-and-Flow-fused Agent for Comprehensive Remote Sensing Applications 标题:CangLing-KnowFlow:一种面向综合遥感应用的统一知识和流融合代理 链接:https://arxiv.org/abs/2512.15231
作者:Zhengchao Chen,Haoran Wang,Jing Yao,Pedram Ghamisi,Jun Zhou,Peter M. Atkinson,Bing Zhang 摘要:海量遥感数据的自动化和智能化处理是对地观测的关键。现有的自动化系统通常是针对特定任务的,缺乏统一的框架来管理不同的端到端工作流程-从数据预处理到高级解释-跨不同的RS应用程序。为了解决这一差距,本文介绍了仓灵KnowFlow,一个统一的智能代理框架,集成了一个过程知识库(PKB),动态工作流调整,和进化记忆模块。PKB包括162个实际RS任务中的1,008个专家验证的工作流程案例,可指导规划并大大减少通用代理中常见的幻觉。在运行时故障期间,动态工作流调整自主诊断和重新规划恢复策略,而进化记忆模块不断从这些事件中学习,迭代地增强代理的知识和性能。这种协同作用使CangLing-KnowFlow能够在各种复杂的任务中适应、学习和可靠地运行。我们在KnowFlow-Bench上评估了CangLing-KnowFlow,KnowFlow-Bench是一个受现实世界应用启发的324个工作流的新基准,在13个顶级大型语言模型(LLM)主干上测试其性能,从开源到商业。在所有复杂的任务中,CangLing-KnowFlow在任务成功率方面超过了Reflexion基线至少4%。作为这一新兴领域的第一个最全面的验证,这项研究证明了CangLing-KnowFlow作为一个强大的,高效的,可扩展的自动化解决方案的巨大潜力,通过利用专家知识(知识)到自适应和可验证的程序(流)复杂的EO挑战。 摘要:The automated and intelligent processing of massive remote sensing (RS) datasets is critical in Earth observation (EO). Existing automated systems are normally task-specific, lacking a unified framework to manage diverse, end-to-end workflows–from data preprocessing to advanced interpretation–across diverse RS applications. To address this gap, this paper introduces CangLing-KnowFlow, a unified intelligent agent framework that integrates a Procedural Knowledge Base (PKB), Dynamic Workflow Adjustment, and an Evolutionary Memory Module. The PKB, comprising 1,008 expert-validated workflow cases across 162 practical RS tasks, guides planning and substantially reduces hallucinations common in general-purpose agents. During runtime failures, the Dynamic Workflow Adjustment autonomously diagnoses and replans recovery strategies, while the Evolutionary Memory Module continuously learns from these events, iteratively enhancing the agent’s knowledge and performance. This synergy enables CangLing-KnowFlow to adapt, learn, and operate reliably across diverse, complex tasks. We evaluated CangLing-KnowFlow on the KnowFlow-Bench, a novel benchmark of 324 workflows inspired by real-world applications, testing its performance across 13 top Large Language Model (LLM) backbones, from open-source to commercial. Across all complex tasks, CangLing-KnowFlow surpassed the Reflexion baseline by at least 4% in Task Success Rate. As the first most comprehensive validation along this emerging field, this research demonstrates the great potential of CangLing-KnowFlow as a robust, efficient, and scalable automated solution for complex EO challenges by leveraging expert knowledge (Knowledge) into adaptive and verifiable procedures (Flow).
【50】Yes-MT’s Submission to the Low-Resource Indic Language Translation Shared Task in WMT 2024 标题:Yes-MT提交WMT 2024低资源印度语翻译共享任务 链接:https://arxiv.org/abs/2512.15226
作者:Yash Bhaskar,Parameswari Krishnamurthy 备注:Accepted at WMT 2024 摘要:本文介绍了Yes-MT团队在WMT 2024上为低资源印度语翻译共享任务提交的系统(Pakray等人,2024年),专注于英语和阿萨姆语,米佐语,卡西语和曼尼普尔语之间的翻译。实验探索了各种方法,包括微调预训练模型,如mT5(Xue et al.,2020)和IndicBart(Dabre等人,2021)在多语言和单语言环境中,LoRA(Hu等人,2021)微调IndicTrans 2(Gala等人,2023)、zero-shot和Few-Shot提示(Brown,2020)以及大型语言模型(LLM),如Llama 3(Dubey等人,2024)和Mixtral 8x 7 b(Jiang等人,2024),LoRA监督Llama 3的微调(Mecklenburg等人,2024),并从头开始训练Transformer模型(Vaswani,2017)。使用SacreBLEU(Post,2018)和CHRF(Popovic,2015)对WMT 23低资源印度语翻译共享任务测试数据进行了评估,突出了低资源翻译的挑战和LLM对这些任务的潜力,特别是微调。 摘要:This paper presents the systems submitted by the Yes-MT team for the Low-Resource Indic Language Translation Shared Task at WMT 2024 (Pakray et al., 2024), focusing on translating between English and the Assamese, Mizo, Khasi, and Manipuri languages. The experiments explored various approaches, including fine-tuning pre-trained models like mT5 (Xue et al., 2020) and IndicBart (Dabre et al., 2021) in both multilingual and monolingual settings, LoRA (Hu et al., 2021) fine-tuning IndicTrans2 (Gala et al., 2023), zero-shot and few-shot prompting (Brown, 2020) with large language models (LLMs) like Llama 3 (Dubey et al., 2024) and Mixtral 8x7b (Jiang et al., 2024), LoRA supervised fine-tuning of Llama 3 (Mecklenburg et al., 2024), and training Transformer models (Vaswani, 2017) from scratch. The results were evaluated on the WMT23 Low-Resource Indic Language Translation Shared Task test data using SacreBLEU (Post, 2018) and CHRF (Popovic, 2015), highlighting the challenges of low-resource translation and the potential of LLMs for these tasks, particularly with fine-tuning.
【51】RFKG-CoT: Relation-Driven Adaptive Hop-count Selection and Few-Shot Path Guidance for Knowledge-Aware QA 标题:RFKG-CoT:知识感知QA的时间驱动自适应跳数选择和Few-Shot路径指导 链接:https://arxiv.org/abs/2512.15219
作者:Chao Zhang,Minghan Li,Tianrui Lv,Guodong Zhou 备注:9pages, 5 figures, accepted by AAAI 2026 摘要:由于参数化知识的限制,大型语言模型(LLM)经常在知识密集型QA中产生幻觉。虽然KG-CoT等现有方法通过整合知识图(KG)路径来提高可靠性,但它们存在严格的跳数选择(仅由问题驱动)和推理路径利用不足(缺乏指导)的问题。为了解决这个问题,我们提出了RFKG-CoT:首先,它用关系驱动的自适应跳数选择器代替了刚性的跳数选择器,该选择器通过激活KG关系(例如,1-用于直接“兄弟”关系的跳跃,用于间接“父子”链的2-跳跃),经由关系掩码形式化。其次,它引入了一个Few-Shot的上下文学习路径指导机制与CoT(思考),构建一个“问题-路径-答案”格式的例子,以提高LLM的理解推理路径的能力。在四个KGQA基准测试上的实验表明,RFKG-CoT比KG-CoT提高了高达14.7 pp的准确性(WebQSP上的Llama 2 - 7 B)。消融确认跳数选择器和路径提示是互补的,共同将KG证据转化为更忠实的答案。 摘要:Large language models (LLMs) often generate hallucinations in knowledge-intensive QA due to parametric knowledge limitations. While existing methods like KG-CoT improve reliability by integrating knowledge graph (KG) paths, they suffer from rigid hop-count selection (solely question-driven) and underutilization of reasoning paths (lack of guidance). To address this, we propose RFKG-CoT: First, it replaces the rigid hop-count selector with a relation-driven adaptive hop-count selector that dynamically adjusts reasoning steps by activating KG relations (e.g., 1-hop for direct “brother” relations, 2-hop for indirect “father-son” chains), formalized via a relation mask. Second, it introduces a few-shot in-context learning path guidance mechanism with CoT (think) that constructs examples in a “question-paths-answer” format to enhance LLMs’ ability to understand reasoning paths. Experiments on four KGQA benchmarks show RFKG-CoT improves accuracy by up to 14.7 pp (Llama2-7B on WebQSP) over KG-CoT. Ablations confirm the hop-count selector and the path prompt are complementary, jointly transforming KG evidence into more faithful answers.
【52】A Clustering-Based Variable Ordering Framework for Relaxed Decision Diagrams for Maximum Weighted Independent Set Problem 标题:最大加权独立集问题松弛决策图的基于迭代的变量排序框架 链接:https://arxiv.org/abs/2512.15198
作者:Mohsen Nafar,Michael Römer,Lin Xie 摘要:离散优化问题的精确算法在很大程度上依赖于强原始界和强对偶界。松弛决策图(DD)提供了一种通用的机制,通过节点合并来计算解空间的过逼近,从而推导出这样的对偶边界。然而,这些放松图的质量,即所得到的对偶边界的紧密性,关键取决于编译期间执行的变量排序和合并决策。虽然动态变量排序算法有效地收紧了边界,但当在整个变量集上进行全局计算时,它们通常会导致计算开销。为了减轻这种权衡,这项工作引入了一种新的基于聚类的变量排序框架。我们首先将变量划分为簇,而不是将动态排序算法应用于整个不固定变量集。然后,我们利用这种结构分解来指导排序过程,大大减少了启发式的搜索空间。在这个框架内,我们研究了两种不同的策略:聚类,它使用特定问题的聚合标准(如最大加权独立集问题(MWISP)中的累积顶点权重)顺序处理集群,和挑选和排序,迭代地选择和排序每个集群的代表变量,以平衡局部多样性与启发式指导。后来,开发一些理论结果的增长的DDs的大小为MWISP,我们提出了两种不同的政策,在拟议的框架内设置集群的数量。我们将这些策略嵌入到基于DD的分支定界算法中,并在MWISP上对其进行评估。在基准实例中,所提出的方法一致地降低了计算成本相比,标准的动态变量排序基线。 摘要:Efficient exact algorithms for Discrete Optimization (DO) rely heavily on strong primal and dual bounds. Relaxed Decision Diagrams (DDs) provide a versatile mechanism for deriving such dual bounds by compactly over-approximating the solution space through node merging. However, the quality of these relaxed diagrams, i.e. the tightness of the resulting dual bounds, depends critically on the variable ordering and the merging decisions executed during compilation. While dynamic variable ordering heuristics effectively tighten bounds, they often incur computational overhead when evaluated globally across the entire variable set. To mitigate this trade-off, this work introduces a novel clustering-based framework for variable ordering. Instead of applying dynamic ordering heuristics to the full set of unfixed variables, we first partition variables into clusters. We then leverage this structural decomposition to guide the ordering process, significantly reducing the heuristic’s search space. Within this framework, we investigate two distinct strategies: Cluster-to-Cluster, which processes clusters sequentially using problem-specific aggregate criteria (such as cumulative vertex weights in the Maximum Weighted Independent Set Problem (MWISP)), and Pick-and-Sort, which iteratively selects and sorts representative variables from each cluster to balance local diversity with heuristic guidance. Later on, developing some theoretical results on the growth of the size of DDs for MWISP we propose two different policies for setting the number of clusters within the proposed framework. We embed these strategies into a DD-based branch-and-bound algorithm and evaluate them on the MWISP. Across benchmark instances, the proposed methodology consistently reduces computational costs compared to standard dynamic variable ordering baseline.
【53】Governing rapid technological change: Policy Delphi on the future of European AI governance 标题:治理快速技术变革:关于欧洲人工智能治理未来的德尔菲政策 链接:https://arxiv.org/abs/2512.15196
作者:Atte Ojanen,Johannes Anttila,Thilo H. K. Thelitz,Anna Bjork 备注:29 pages 摘要:人工智能(AI)的快速发展为寻求管理该技术的政策制定者带来了独特的挑战。在这种情况下,德尔菲法已成为确定未来研究和展望领域专家对新出现的技术问题的共识和分歧的一种既定方法。本文的目的有两个:首先,它研究了专家在欧洲人工智能治理发展中看到的关键紧张局势,其次,它反映了德尔菲法基于这些见解为人工智能等新兴技术提供预期治理的能力。该分析基于2024年年中与欧洲政策制定者,研究人员和非政府组织进行的关于人工智能治理未来的两轮政策德尔菲研究的结果。德尔菲政策在揭示欧洲人工智能治理的不同观点方面很有用,得出了一个共识,即面向未来的人工智能监管可能更多地取决于立法的实际实施和执行,而不是其技术细节或范围。此外,该研究发现了人工智能治理中的可行性-概率差距:理想的政策方向,如更大的公民参与,被认为不太可能和可行。这突出表明了理想的监管监督与监管跟上技术变革的实际困难之间的紧张关系。 摘要:The rapid advancements in artificial intelligence (AI) present unique challenges for policymakers that seek to govern the technology. In this context, the Delphi method has become an established way to identify consensus and disagreement on emerging technological issues among experts in the field of futures studies and foresight. The aim of this article is twofold: first, it examines key tensions experts see in the development of AI governance in Europe, and second, it reflects on the Delphi method’s capacity to inform anticipatory governance of emerging technologies like AI based on these insights. The analysis is based on the results of a two-round Policy Delphi study on the future of AI governance with European policymakers, researchers and NGOs, conducted in mid-2024. The Policy Delphi proved useful in revealing diverse perspectives on European AI governance, drawing out a consensus that future-proof AI regulation will likely depend more on practical implementation and enforcement of legislation than on its technical specifics or scope. Furthermore, the study identified a desirability-probability gap in AI governance: desirable policy directions, like greater citizen participation, were perceived as less probable and feasible. This highlights a tension between desirable regulatory oversight and the practical difficulty for regulation to keep up with technological change.
【54】DEER: Draft with Diffusion, Verify with Autoregressive Models 标题:DEER:用扩散起草,用自回归模型验证 链接:https://arxiv.org/abs/2512.15176
作者:Zicong Cheng,Guo-Wei Yang,Jia Li,Zhijie Deng,Meng-Hao Guo,Shi-Min Hu 备注:Homepage : https://czc726.github.io/DEER/ 摘要:效率作为LLM驱动的代理和推理系统的关键实际挑战,越来越受到自回归(AR)解码的固有延迟的限制。推测性解码通过草稿验证方案减轻了该成本,然而现有方法依赖于AR草稿模型(也称为,起草者),这引入了两个基本问题:(1)逐步的不确定性累积导致目标模型和起草者之间的信任的逐步崩溃,以及(2)AR起草者的固有顺序解码。总之,这些因素导致有限的加速。在本文中,我们表明,扩散大语言模型(DLLM)的起草者可以自然地克服这些问题,通过其根本不同的概率建模和高效的并行解码策略。基于这一认识,我们引入了DEER,一个高效的推测解码框架,它使用扩散起草并使用AR模型进行验证。为了实现高质量的草稿,DEER采用两阶段训练管道将基于dLLM的草稿器与目标AR模型对齐,并进一步采用单步解码来生成长草稿片段。实验表明,DEER的草稿接受长度高达32个令牌,远远超过EAGLE-3的10个令牌。此外,在HumanEval上使用Qwen 3 - 30 B-A3 B,DEER获得了5.54倍的加速比,而EAGLE-3仅获得了2.41倍。代码、模型、演示等将在https://czc726.github.io/DEER/上提供 摘要:Efficiency, as a critical practical challenge for LLM-driven agentic and reasoning systems, is increasingly constrained by the inherent latency of autoregressive (AR) decoding. Speculative decoding mitigates this cost through a draft-verify scheme, yet existing approaches rely on AR draft models (a.k.a., drafters), which introduce two fundamental issues: (1) step-wise uncertainty accumulation leads to a progressive collapse of trust between the target model and the drafter, and (2) inherently sequential decoding of AR drafters. Together, these factors cause limited speedups. In this paper, we show that a diffusion large language model (dLLM) drafters can naturally overcome these issues through its fundamentally different probabilistic modeling and efficient parallel decoding strategy. Building on this insight, we introduce DEER, an efficient speculative decoding framework that drafts with diffusion and verifies with AR models. To enable high-quality drafting, DEER employs a two-stage training pipeline to align the dLLM-based drafters with the target AR model, and further adopts single-step decoding to generate long draft segments. Experiments show DEER reaches draft acceptance lengths of up to 32 tokens, far surpassing the 10 tokens achieved by EAGLE-3. Moreover, on HumanEval with Qwen3-30B-A3B, DEER attains a 5.54x speedup, while EAGLE-3 achieves only 2.41x. Code, model, demo, etc, will be available at https://czc726.github.io/DEER/
【55】MCP-SafetyBench: A Benchmark for Safety Evaluation of Large Language Models with Real-World MCP Servers 标题:MCP-SafetyBench:使用现实世界的LCP服务器对大型语言模型进行安全评估的基准 链接:https://arxiv.org/abs/2512.15163
作者:Xuanjun Zong,Zhiqi Shen,Lei Wang,Yunshi Lan,Chao Yang 备注:Our benchmark is available at https://github.com/xjzzzzzzzz/MCPSafety 摘要:大型语言模型(LLM)正在演变为推理、计划和操作外部工具的代理系统。模型上下文协议(MCP)是这种转变的关键推动者,它提供了一个标准化的接口,用于连接LLM与异构工具和服务。然而,MCP的开放性和多服务器工作流程引入了现有基准无法捕获的新安全风险,因为它们专注于孤立的攻击或缺乏真实世界的覆盖。我们提出了MCP-SafetyBench,这是一个建立在真实MCP服务器上的综合基准测试,支持跨五个领域的真实多轮评估:浏览器自动化,财务分析,位置导航,存储库管理和Web搜索。它整合了跨服务器、主机和用户端的20种MCP攻击类型的统一分类,并包括需要多步推理和不确定性下跨服务器协调的任务。使用MCP-SafetyBench,我们系统地评估了领先的开源和闭源LLM,揭示了安全性能的巨大差异,并随着任务范围和服务器交互的增长而升级漏洞。我们的研究结果强调了加强防御的迫切需要,并将MCP-SafetyBench作为诊断和减轻实际MCP部署中安全风险的基础。 摘要:Large language models (LLMs) are evolving into agentic systems that reason, plan, and operate external tools. The Model Context Protocol (MCP) is a key enabler of this transition, offering a standardized interface for connecting LLMs with heterogeneous tools and services. Yet MCP’s openness and multi-server workflows introduce new safety risks that existing benchmarks fail to capture, as they focus on isolated attacks or lack real-world coverage. We present MCP-SafetyBench, a comprehensive benchmark built on real MCP servers that supports realistic multi-turn evaluation across five domains: browser automation, financial analysis, location navigation, repository management, and web search. It incorporates a unified taxonomy of 20 MCP attack types spanning server, host, and user sides, and includes tasks requiring multi-step reasoning and cross-server coordination under uncertainty. Using MCP-SafetyBench, we systematically evaluate leading open- and closed-source LLMs, revealing large disparities in safety performance and escalating vulnerabilities as task horizons and server interactions grow. Our results highlight the urgent need for stronger defenses and establish MCP-SafetyBench as a foundation for diagnosing and mitigating safety risks in real-world MCP deployments.
【56】Offline Multi-Task Multi-Objective Data-Driven Evolutionary Algorithm with Language Surrogate Model and Implicit Q-Learning 标题:具有语言代理模型和隐式Q学习的离线多任务多目标数据驱动进化算法 链接:https://arxiv.org/abs/2512.15149
作者:Xian-Rong Zhang,Yue-Jiao Gong,Zeyuan Ma,Jun Zhang 备注:16 pages 摘要:数据驱动的进化算法通过强大的代理建模在解决昂贵的优化问题方面显示出令人惊讶的结果。虽然有前途,现有的代理建模方案可能会遇到许多子目标的复杂优化问题的局限性,这依赖于重复和繁琐的近似。为了解决这样的技术差距,我们提出了Q-MetaSur作为一个即插即用的代理建模方案,能够提供统一和广义的代理学习。具体来说,我们考虑多任务多目标优化~(MTMOO)在离线设置。提出了几个关键的设计:1)我们将目标近似转化为序列到序列的建模,其中MTMOO问题可以用tenxual令牌化来表示。为了在这种自回归建模下操作,我们引入了一个基于大型语言模型的代理模型,该模型首先编码MTMOO实例,然后解码看不见的决策变量的客观值。为了确保训练所提出的模型的稳定性,我们提出了一种两阶段离线训练策略,该策略作为监督调整和RL微调的协同作用,首先利用离线数据集来适应现有知识,然后利用RL来增强模型的泛化性能。在CEC 2019基准测试上的大量实证结果表明,Q-MetaSur不仅在客观近似精度方面优于代表性的替代基线,而且还有助于底层进化算法实现所需的优化收敛和改进的帕累托最优性。 摘要:Data-driven evolutionary algorithms has shown surprising results in addressing expensive optimization problems through robust surrogate modeling. Though promising, existing surrogate modeling schemes may encounter limitations in complex optimization problems with many sub-objectives, which rely on repeated and tedious approximation. To address such technical gap, we propose Q-MetaSur as a plug-and-play surrogate modeling scheme capable of providing unified and generalized surrogate learning. Specifically, we consider multi-task-multi-objective optimization~(MTMOO) in offline setting. Several key designs are proposed: 1) we transform objective approximation into sequence-to-sequence modeling where MTMOO problem can be represented by tenxual tokenization. To operate under such auto-regressive modeling, we introduce a Large Language Model-based surrogate model that first encodes a MTMOO instance and then decodes objective values of unseen decision variables. To ensure stability in training the proposed model, we propose a two-stage offline training strategy that operates as a synergy of supervised tuning and RL fine-tuning, which first exploits offline dataset to fit existing knowledge and then leverages RL to enhance model’s generalization performance. Extensive empirical results on the CEC2019 benchmark demonstrate that Q-MetaSur not only outperforms representative surrogate baselines in objective approximation accuracy, but also helps underlying evolutionary algorithms achieve both desired optimization convergence and improved pareto optimality.
【57】From Isolation to Entanglement: When Do Interpretability Methods Identify and Disentangle Known Concepts? 标题:从孤立到纠缠:可解释性方法何时识别和解开已知概念? 链接:https://arxiv.org/abs/2512.15134
作者:Aaron Mueller,Andrew Lee,Shruti Joshi,Ekdeep Singh Lubana,Dhanya Sridhar,Patrik Reizinger 摘要:可解释性的一个中心目标是从神经网络的激活中恢复因果相关概念的表示。这些概念表示的质量通常是孤立地评估的,并且在实际中可能不成立的隐含独立性假设下进行评估。因此,目前还不清楚常见的特征化方法-包括稀疏自动编码器(SAE)和稀疏探测器-是否可以恢复这些概念的解纠缠表示。本研究提出了一个多概念的评估设置,我们控制文本概念,如情感,域和时态之间的相关性,并分析性能下增加它们之间的相关性。我们首先评估featurizers可以在多大程度上学习解纠缠表示的每个概念下增加相关强度。我们观察到从概念到特征的一对多关系:特征对应的概念不超过一个,但概念分布在许多特征上。然后,我们进行转向实验,测量每个概念是否是独立可操作的。即使在概念的均匀分布上进行训练,SAE特征在转向时通常会影响许多概念,这表明它们既不是选择性的也不是独立的;尽管如此,特征会影响不相交的子空间。这些结果表明,测量解纠缠的相关性指标一般不足以建立独立性时,转向,影响不相交的子空间是不够的概念选择性。这些结果强调了成分评价在可解释性研究中的重要性。 摘要:A central goal of interpretability is to recover representations of causally relevant concepts from the activations of neural networks. The quality of these concept representations is typically evaluated in isolation, and under implicit independence assumptions that may not hold in practice. Thus, it is unclear whether common featurization methods - including sparse autoencoders (SAEs) and sparse probes - recover disentangled representations of these concepts. This study proposes a multi-concept evaluation setting where we control the correlations between textual concepts, such as sentiment, domain, and tense, and analyze performance under increasing correlations between them. We first evaluate the extent to which featurizers can learn disentangled representations of each concept under increasing correlational strengths. We observe a one-to-many relationship from concepts to features: features correspond to no more than one concept, but concepts are distributed across many features. Then, we perform steering experiments, measuring whether each concept is independently manipulable. Even when trained on uniform distributions of concepts, SAE features generally affect many concepts when steered, indicating that they are neither selective nor independent; nonetheless, features affect disjoint subspaces. These results suggest that correlational metrics for measuring disentanglement are generally not sufficient for establishing independence when steering, and that affecting disjoint subspaces is not sufficient for concept selectivity. These results underscore the importance of compositional evaluations in interpretability research.
【58】HD-Prot: A Protein Language Model for Joint Sequence-Structure Modeling with Continuous Structure Tokens 标题:HD-Prot:具有连续结构令牌的联合序列结构建模的蛋白质语言模型 链接:https://arxiv.org/abs/2512.15133
作者:Yi Zhou,Haohao Qu,Yunqing Liu,Shanru Lin,Le Song,Wenqi Fan 摘要:蛋白质固有地具有一致的序列-结构二元性。大量的蛋白质序列数据可以很容易地表示为离散的标记,这推动了蛋白质语言模型(pLM)的富有成效的发展。然而,一个关键的挑战是如何有效地将连续的结构知识整合到pLM中。目前的方法往往离散蛋白质结构,以适应语言建模框架,这不可避免地导致细粒度信息的损失,并限制了性能潜力的多模态pLM。在本文中,我们认为这种担忧是可以规避的:基于序列的pLM可以扩展到通过连续令牌来包含结构模态,即,避免矢量量化高保真蛋白质结构潜在。具体来说,我们提出了一个混合扩散蛋白质语言模型,HD-Prot,它嵌入了一个连续值的扩散头顶部的离散pLM,实现无缝操作与离散和连续令牌联合序列结构建模。它通过统一的吸收扩散过程捕获跨模态的令牌间依赖性,并通过序列的分类预测和结构的连续扩散来估计每个令牌的分布。大量的实证结果表明,HD-Prot在无条件序列结构共生成,基序支架,蛋白质结构预测和反向折叠任务中实现了具有竞争力的性能,尽管在有限的计算资源下开发,但与最先进的多模式pLM表现相当。它突出了在统一的语言模型架构内同时估计分类和连续分布的可行性,为多模态pLM提供了一个有前途的替代方向。 摘要:Proteins inherently possess a consistent sequence-structure duality. The abundance of protein sequence data, which can be readily represented as discrete tokens, has driven fruitful developments in protein language models (pLMs). A key remaining challenge, however, is how to effectively integrate continuous structural knowledge into pLMs. Current methods often discretize protein structures to accommodate the language modeling framework, which inevitably results in the loss of fine-grained information and limits the performance potential of multimodal pLMs. In this paper, we argue that such concerns can be circumvented: a sequence-based pLM can be extended to incorporate the structure modality through continuous tokens, i.e., high-fidelity protein structure latents that avoid vector quantization. Specifically, we propose a hybrid diffusion protein language model, HD-Prot, which embeds a continuous-valued diffusion head atop a discrete pLM, enabling seamless operation with both discrete and continuous tokens for joint sequence-structure modeling. It captures inter-token dependencies across modalities through a unified absorbing diffusion process, and estimates per-token distributions via categorical prediction for sequences and continuous diffusion for structures. Extensive empirical results show that HD-Prot achieves competitive performance in unconditional sequence-structure co-generation, motif-scaffolding, protein structure prediction, and inverse folding tasks, performing on par with state-of-the-art multimodal pLMs despite being developed under limited computational resources. It highlights the viability of simultaneously estimating categorical and continuous distributions within a unified language model architecture, offering a promising alternative direction for multimodal pLMs.
【59】Automatic Reward Shaping from Multi-Objective Human Heuristics 标题:来自多目标人类启发式的自动奖励塑造 链接:https://arxiv.org/abs/2512.15120
作者:Yuqing Xie,Jiayu Chen,Wenhao Tang,Ya Zhang,Chao Yu,Yu Wang 摘要:设计有效的奖励函数仍然是强化学习的核心挑战,特别是在多目标环境中。在这项工作中,我们提出了多目标奖励塑造与探索(MORSE),一个通用的框架,自动结合多个人类设计的启发式奖励到一个统一的奖励函数。MORSE将整形过程描述为一个双层优化问题:内环训练策略以最大化当前整形的奖励,而外环更新奖励函数以优化任务性能。为了鼓励在奖励空间中进行探索并避免次优的局部最小值,MORSE将随机性引入成形过程,注入由任务性能和固定随机初始化神经网络的预测误差指导的噪声。在MuJoCo和Isaac Sim环境中的实验结果表明,MORSE有效地平衡了各种机器人任务中的多个目标,实现了与手动调整奖励函数所获得的任务性能相当的任务性能。 摘要:Designing effective reward functions remains a central challenge in reinforcement learning, especially in multi-objective environments. In this work, we propose Multi-Objective Reward Shaping with Exploration (MORSE), a general framework that automatically combines multiple human-designed heuristic rewards into a unified reward function. MORSE formulates the shaping process as a bi-level optimization problem: the inner loop trains a policy to maximize the current shaped reward, while the outer loop updates the reward function to optimize task performance. To encourage exploration in the reward space and avoid suboptimal local minima, MORSE introduces stochasticity into the shaping process, injecting noise guided by task performance and the prediction error of a fixed, randomly initialized neural network. Experimental results in MuJoCo and Isaac Sim environments show that MORSE effectively balances multiple objectives across various robotic tasks, achieving task performance comparable to those obtained with manually tuned reward functions.
【60】I am here for you": How relational conversational AI appeals to adolescents, especially those who are socially and emotionally vulnerable 链接:https://arxiv.org/abs/2512.15117
作者:Pilyoung Kim,Yun Xie,Sujin Yang 摘要:通用对话式人工智能聊天机器人和人工智能伴侣越来越多地为青少年提供情感支持对话,这引发了关于对话风格如何塑造拟人化和情感依赖的问题。在一项预先注册的在线实验中,284名11-15岁的青少年和他们的父母阅读了两份匹配的成绩单,其中聊天机器人使用关系风格(第一人称,亲和,承诺语言)或透明风格(明确的非人性,信息语气)来回应日常社会问题。青少年更倾向于关系型而不是透明型,而父母更倾向于透明型。青少年认为关系型聊天机器人更像人类,更可爱,更值得信赖,情感上更亲密,同时认为这两种风格都同样有帮助。与那些喜欢透明风格或两种聊天机器人的青少年相比,喜欢关系风格的青少年的家庭和同伴关系质量较低,压力和焦虑较高。这些研究结果将对话风格确定为青少年AI安全的关键设计杠杆,表明关系框架增强了拟人化,信任和情感亲密度,并且对社交和情感脆弱的青少年特别有吸引力,他们可能会增加对对话AI的情感依赖风险。 摘要:General-purpose conversational AI chatbots and AI companions increasingly provide young adolescents with emotionally supportive conversations, raising questions about how conversational style shapes anthropomorphism and emotional reliance. In a preregistered online experiment with 284 adolescent-parent dyads, youth aged 11-15 and their parents read two matched transcripts in which a chatbot responded to an everyday social problem using either a relational style (first-person, affiliative, commitment language) or a transparent style (explicit nonhumanness, informational tone). Adolescents more often preferred the relational than the transparent style, whereas parents were more likely to prefer transparent style than adolescents. Adolescents rated the relational chatbot as more human-like, likable, trustworthy and emotionally close, while perceiving both styles as similarly helpful. Adolescents who preferred relational style had lower family and peer relationship quality and higher stress and anxiety than those preferring transparent style or both chatbots. These findings identify conversational style as a key design lever for youth AI safety, showing that relational framing heightens anthropomorphism, trust and emotional closeness and can be especially appealing to socially and emotionally vulnerable adolescents, who may be at increased risk for emotional reliance on conversational AI.
【61】FADTI: Fourier and Attention Driven Diffusion for Multivariate Time Series Imputation 标题:FADTI:多元时间序列插补的傅里叶和注意力驱动扩散 链接:https://arxiv.org/abs/2512.15116
作者:Runze Li,Hanchen Wang,Wenjie Zhang,Binghao Li,Yu Zhang,Xuemin Lin,Ying Zhang 备注:This work has been submitted to the IEEE for possible publication. 15 pages, 8 figures 摘要:多变量时间序列插补是医疗保健、交通预测和生物建模等应用的基础,其中传感器故障和不规则采样会导致普遍的缺失值。然而,现有的Transformer-和扩散为基础的模型缺乏明确的归纳偏见和频率意识,限制其泛化结构化的缺失模式和分布的变化。我们提出了FADTI,一个基于扩散的框架,通过可学习的傅立叶偏置投影(FBP)模块注入频率信息特征调制,并通过自注意和门控卷积将其与时间建模相结合。FBP支持多个频谱基,从而实现对固定和非固定模式的自适应编码。该设计将频域感应偏置注入到生成填补过程中。在多个基准测试(包括新引入的生物时间序列数据集)上的实验表明,FADTI始终优于最先进的方法,特别是在高缺失率下。代码可在https://anonymous.4open.science/r/TimeSeriesImputation-52BF上获得 摘要:Multivariate time series imputation is fundamental in applications such as healthcare, traffic forecasting, and biological modeling, where sensor failures and irregular sampling lead to pervasive missing values. However, existing Transformer- and diffusion-based models lack explicit inductive biases and frequency awareness, limiting their generalization under structured missing patterns and distribution shifts. We propose FADTI, a diffusion-based framework that injects frequency-informed feature modulation via a learnable Fourier Bias Projection (FBP) module and combines it with temporal modeling through self-attention and gated convolution. FBP supports multiple spectral bases, enabling adaptive encoding of both stationary and non-stationary patterns. This design injects frequency-domain inductive bias into the generative imputation process. Experiments on multiple benchmarks, including a newly introduced biological time series dataset, show that FADTI consistently outperforms state-of-the-art methods, particularly under high missing rates. Code is available at https://anonymous.4open.science/r/TimeSeriesImputation-52BF
【62】How Many Heads Make an SSM? A Unified Framework for Attention and State Space Models 标题:有多少个头构成一个ESM?注意力和状态空间模型的统一框架 链接:https://arxiv.org/abs/2512.15115
作者:Ali Ghodsi 摘要:序列建模产生了不同的架构-从经典的递归神经网络到现代的Transformers和状态空间模型(SSM)-但对表达性和可训练性权衡的统一理论理解仍然有限。我们引入了一个统一的框架,通过一个依赖于输入的有效相互作用算子$W_{ij}(X)$来表示一个广泛的序列映射类,明确了两个重复的构造模式:(一)统一分解框架(明确)(注意力风格混合),其中$W_{ij}(X)$通过应用于共享值映射的标量系数而变化,(ii)结构动力学(隐式)(状态空间递归),其中$W_{ij}$是由一个潜动力系统诱导的。利用这个框架,我们得到三个理论结果。首先,我们建立了交互等级差距:统一因子化框架中的模型,如单头注意力,被限制在低维算子跨度上,不能表示某些结构化的动态映射。第二,我们证明了等价(头计数)定理表明,在我们的多头分解类,表示一个线性SSM的滞后运营商跨越一个$k$维子空间的长度-$n$序列需要和可实现的$H=k$头。第三,我们证明了一个梯度公路结果,表明注意力层允许输入与距离无关的梯度路径,而稳定的线性动力学表现出距离相关的梯度衰减。总之,这些结果正式代数表达(互动/运营商跨度)和远程梯度传播之间的基本权衡,现代序列架构设计提供理论基础。 摘要:Sequence modeling has produced diverse architectures – from classical recurrent neural networks to modern Transformers and state space models (SSMs) – yet a unified theoretical understanding of expressivity and trainability trade-offs remains limited. We introduce a unified framework that represents a broad class of sequence maps via an input-dependent effective interaction operator $W_{ij}(X)$, making explicit two recurring construction patterns: (i) the Unified Factorized Framework (Explicit) (attention-style mixing), in which $W_{ij}(X)$ varies through scalar coefficients applied to shared value maps, and (ii) Structured Dynamics (Implicit) (state-space recurrences), in which $W_{ij}$ is induced by a latent dynamical system. Using this framework, we derive three theoretical results. First, we establish the Interaction Rank Gap: models in the Unified Factorized Framework, such as single-head attention, are constrained to a low-dimensional operator span and cannot represent certain structured dynamical maps. Second, we prove an Equivalence (Head-Count) Theorem showing that, within our multi-head factorized class, representing a linear SSM whose lag operators span a $k$-dimensional subspace on length-$n$ sequences requires and is achievable with $H=k$ heads. Third, we prove a Gradient Highway Result, showing that attention layers admit inputs with distance-independent gradient paths, whereas stable linear dynamics exhibit distance-dependent gradient attenuation. Together, these results formalize a fundamental trade-off between algebraic expressivity (interaction/operator span) and long-range gradient propagation, providing theoretical grounding for modern sequence architecture design.
【63】Feature-Centric Unsupervised Node Representation Learning Without Homophily Assumption 标题:以企业为中心的无监督节点表示学习,没有同质性假设 链接:https://arxiv.org/abs/2512.15112
作者:Sunwoo Kim,Soo Yong Lee,Kyungho Kim,Hyunjin Hwang,Jaemin Yoo,Kijung Shin 备注:Published in AAAI 2026 摘要:无监督节点表示学习旨在获得有意义的节点嵌入,而不依赖于节点标签。为了实现这一点,通常采用聚合来自相邻节点的信息的图卷积来编码节点特征和图拓扑。然而,过度依赖图卷积可能是次优的–尤其是在非同素图中–因为它可能会为特征或拓扑属性不同的节点产生过度相似的嵌入。因此,调整图卷积使用的程度已经在监督学习环境中积极探索,而这种方法在无监督场景中仍然没有得到充分探索。为了解决这个问题,我们提出了FUEL,它通过增强嵌入空间中的类内相似性和类间可分性来自适应地学习足够程度的图卷积使用。由于类是未知的,FUEL利用节点特征来识别节点集群,并将这些集群视为类的代理。通过使用15种基线方法和14个基准数据集进行的广泛实验,我们证明了FUEL在下游任务中的有效性,在具有不同同质性水平的图中实现了最先进的性能。 摘要:Unsupervised node representation learning aims to obtain meaningful node embeddings without relying on node labels. To achieve this, graph convolution, which aggregates information from neighboring nodes, is commonly employed to encode node features and graph topology. However, excessive reliance on graph convolution can be suboptimal-especially in non-homophilic graphs-since it may yield unduly similar embeddings for nodes that differ in their features or topological properties. As a result, adjusting the degree of graph convolution usage has been actively explored in supervised learning settings, whereas such approaches remain underexplored in unsupervised scenarios. To tackle this, we propose FUEL, which adaptively learns the adequate degree of graph convolution usage by aiming to enhance intra-class similarity and inter-class separability in the embedding space. Since classes are unknown, FUEL leverages node features to identify node clusters and treats these clusters as proxies for classes. Through extensive experiments using 15 baseline methods and 14 benchmark datasets, we demonstrate the effectiveness of FUEL in downstream tasks, achieving state-of-the-art performance across graphs with diverse levels of homophily.
【64】Beyond Fast and Slow: Cognitive-Inspired Elastic Reasoning for Large Language Models 标题:超越快与慢:大型语言模型的认知启发弹性推理 链接:https://arxiv.org/abs/2512.15089
作者:Jinwu Hu,Dongjin Yang,Langyu Bian,Zhiquan Wen,Yufeng Wang,Yaofo Chen,Bin Xiao,Yuanqing Li,Mingkui Tan 备注:under review 摘要:大型语言模型(LLM)在各种语言任务中表现出令人印象深刻的性能。然而,现有的LLM推理策略主要依赖于LLM本身的快速或慢速模式(如o 1思维),因此很难在不同难度的查询中平衡推理效率和准确性。在本文中,我们提出了认知启发的弹性推理(CogER),一个框架的灵感来自人类的层次推理,动态选择最合适的推理策略,为每个查询。具体来说,CogER首先评估传入查询的复杂性,并将其分配到几个预定义级别中的一个,每个级别对应于定制的处理策略,从而解决不可观察的查询难度的挑战。为了实现自动策略选择,我们建模的过程中,马尔可夫决策过程和训练CogER-Agent使用强化学习。该代理是由一个奖励函数,平衡解决方案的质量和计算成本,确保资源有效的推理。此外,对于需要外部工具的查询,我们引入了认知工具辅助推理,这使得LLM能够在其思想链中自主调用外部工具。大量的实验表明,CogER优于最先进的测试时间缩放方法,在域内任务的平均精确匹配方面实现了至少13%的相对改进,在域外任务上实现了8%的相对增益。 摘要:Large language models (LLMs) have demonstrated impressive performance across various language tasks. However, existing LLM reasoning strategies mainly rely on the LLM itself with fast or slow mode (like o1 thinking) and thus struggle to balance reasoning efficiency and accuracy across queries of varying difficulties. In this paper, we propose Cognitive-Inspired Elastic Reasoning (CogER), a framework inspired by human hierarchical reasoning that dynamically selects the most suitable reasoning strategy for each query. Specifically, CogER first assesses the complexity of incoming queries and assigns them to one of several predefined levels, each corresponding to a tailored processing strategy, thereby addressing the challenge of unobservable query difficulty. To achieve automatic strategy selection, we model the process as a Markov Decision Process and train a CogER-Agent using reinforcement learning. The agent is guided by a reward function that balances solution quality and computational cost, ensuring resource-efficient reasoning. Moreover, for queries requiring external tools, we introduce Cognitive Tool-Assisted Reasoning, which enables the LLM to autonomously invoke external tools within its chain-of-thought. Extensive experiments demonstrate that CogER outperforms state-of-the-art Test-Time scaling methods, achieving at least a 13% relative improvement in average exact match on In-Domain tasks and an 8% relative gain on Out-of-Domain tasks.
【65】Quantifying Return on Security Controls in LLM Systems 标题:量化LLM系统中安全控制的回报 链接:https://arxiv.org/abs/2512.15081
作者:Richard Helder Moulton,Austin O’Brien,John D. Hastings 备注:13 pages, 9 figures, 3 tables 摘要:尽管大型语言模型(LLM)越来越多地用于安全关键工作流,但从业人员缺乏关于哪些安全措施值得部署的定量指导。本文介绍了一种面向决策的框架和可重复的方法,它们共同量化剩余风险,将对抗性探测结果转换为财务风险估计和控制回报率(RoC)指标,并对基于LLM的系统的分层防御进行货币比较。检索增强生成(RAG)服务使用DeepSeek-R1模型在包含合成个人身份信息(PII)的语料库上进行实例化,并使用Garak在五个漏洞类别中进行自动攻击:PII泄漏,潜在上下文注入,提示注入,对抗性攻击生成和发散。对于每个(漏洞,控制)对,攻击成功概率估计通过拉普拉斯的继任规则,并结合损失三角形分布,从公共的破坏成本数据校准,在10,000运行蒙特卡罗模拟产生损失的概率曲线和预期损失。三个广泛使用的缓解措施,基于属性的访问控制(ABAC);命名实体识别(NER)编辑使用Microsoft Presidio;和NeMo Guardrails,然后比较基线RAG配置。基线系统表现出非常高的攻击成功率(PII、潜伏注入和即时注入>= 0.98),每个攻击场景的总模拟预期损失为31.3万美元。ABAC将PII和恶意相关攻击的成功概率降低到接近于零,并将总预期损失降低了约94%,实现了9.83的RoC。NER编辑同样消除了PII泄漏,并达到了5.97的RoC,而NeMo Guardrails仅提供了边际效益(RoC为0.05)。 摘要:Although large language models (LLMs) are increasingly used in security-critical workflows, practitioners lack quantitative guidance on which safeguards are worth deploying. This paper introduces a decision-oriented framework and reproducible methodology that together quantify residual risk, convert adversarial probe outcomes into financial risk estimates and return-on-control (RoC) metrics, and enable monetary comparison of layered defenses for LLM-based systems. A retrieval-augmented generation (RAG) service is instantiated using the DeepSeek-R1 model over a corpus containing synthetic personally identifiable information (PII), and subjected to automated attacks with Garak across five vulnerability classes: PII leakage, latent context injection, prompt injection, adversarial attack generation, and divergence. For each (vulnerability, control) pair, attack success probabilities are estimated via Laplace’s Rule of Succession and combined with loss triangle distributions, calibrated from public breach-cost data, in 10,000-run Monte Carlo simulations to produce loss exceedance curves and expected losses. Three widely used mitigations, attribute-based access control (ABAC); named entity recognition (NER) redaction using Microsoft Presidio; and NeMo Guardrails, are then compared to a baseline RAG configuration. The baseline system exhibits very high attack success rates (>= 0.98 for PII, latent injection, and prompt injection), yielding a total simulated expected loss of $313k per attack scenario. ABAC collapses success probabilities for PII and prompt-related attacks to near zero and reduces the total expected loss by ~94%, achieving an RoC of 9.83. NER redaction likewise eliminates PII leakage and attains an RoC of 5.97, while NeMo Guardrails provides only marginal benefit (RoC of 0.05).
【66】PMMD: A pose-guided multi-view multi-modal diffusion for person generation 标题:PMMD:用于人员生成的姿势引导的多视图多模式扩散 链接:https://arxiv.org/abs/2512.15069
作者:Ziyu Shang,Haoran Liu,Rongchao Zhang,Zhiqian Wei,Tongtong Feng 摘要:生成具有可控姿态和外观的一致的人体图像对于虚拟试穿、图像编辑和数字人创建等应用是必不可少的。当前的方法经常遭受遮挡、服装风格漂移和姿势未对准。我们提出了姿势引导的多视图多模式扩散(PMMD),这是一个扩散框架,可以根据多视图参考、姿势图和文本提示合成逼真的人物图像。多模态编码器联合建模视觉视图,姿态特征和语义描述,从而减少跨模态差异并提高身份保真度。我们进一步设计了一个ResCVA模块来增强局部细节,同时保留全局结构,以及一个跨模态融合模块,将图像语义与文本集成在整个去噪管道中。在DeepFashion MultiModal数据集上的实验表明,PMMD在一致性,细节保留和可控性方面优于代表性基线。项目页面和代码可以在https://github.com/ZANMANGLOOPYE/PMMD上找到。 摘要:Generating consistent human images with controllable pose and appearance is essential for applications in virtual try on, image editing, and digital human creation. Current methods often suffer from occlusions, garment style drift, and pose misalignment. We propose Pose-guided Multi-view Multimodal Diffusion (PMMD), a diffusion framework that synthesizes photorealistic person images conditioned on multi-view references, pose maps, and text prompts. A multimodal encoder jointly models visual views, pose features, and semantic descriptions, which reduces cross modal discrepancy and improves identity fidelity. We further design a ResCVA module to enhance local detail while preserving global structure, and a cross modal fusion module that integrates image semantics with text throughout the denoising pipeline. Experiments on the DeepFashion MultiModal dataset show that PMMD outperforms representative baselines in consistency, detail preservation, and controllability. Project page and code are available at https://github.com/ZANMANGLOOPYE/PMMD.
【67】The Semantic Illusion: Certified Limits of Embedding-Based Hallucination Detection in RAG Systems 标题:语义幻觉:RAG系统中基于嵌入的幻觉检测的认证限制 链接:https://arxiv.org/abs/2512.15068
作者:Debu Sinha 备注:12 pages, 2 figures, 6 tables 摘要:检索增强生成(RAG)系统仍然容易受到幻觉的影响,尽管在检索证据的基础上。目前的检测方法依赖于语义相似性和自然语言推理(NLI),但它们的基本局限性还没有得到严格的表征。我们将共形预测应用于幻觉检测,提供有限样本覆盖保证,从而实现检测能力的精确量化。使用大约600个示例的校准集,我们在合成幻觉(自然问题)上实现了94%的覆盖率和0%的假阳性率。然而,在跨越多个LLM的三个真实幻觉基准测试(GPT-4,ChatGPT,GPT-3,Llama-2,Mistral)中,基于嵌入的方法-包括最先进的OpenAI文本嵌入-3-大型和交叉编码器模型-表现出不可接受的误报率:HaluEval为100%,RAGTruth为88%,WikiBio为50%。至关重要的是,GPT-4作为LLM法官在相同的数据上仅实现了7%的FPR(95% CI:[3.4%,13.7%]),证明该任务可以通过推理解决。我们称之为“语义错觉”:语义上合理的幻觉保持与源文件的相似性,同时引入嵌入不可见的事实错误。这种限制在嵌入架构、LLM生成器和任务类型中仍然存在,这表明基于嵌入的检测不足以进行生产RAG部署。 摘要:Retrieval-Augmented Generation (RAG) systems remain susceptible to hallucinations despite grounding in retrieved evidence. Current detection methods rely on semantic similarity and natural language inference (NLI), but their fundamental limitations have not been rigorously characterized. We apply conformal prediction to hallucination detection, providing finite-sample coverage guarantees that enable precise quantification of detection capabilities. Using calibration sets of approximately 600 examples, we achieve 94% coverage with 0% false positive rate on synthetic hallucinations (Natural Questions). However, on three real hallucination benchmarks spanning multiple LLMs (GPT-4, ChatGPT, GPT-3, Llama-2, Mistral), embedding-based methods - including state-of-the-art OpenAI text-embedding-3-large and cross-encoder models - exhibit unacceptable false positive rates: 100% on HaluEval, 88% on RAGTruth, and 50% on WikiBio. Crucially, GPT-4 as an LLM judge achieves only 7% FPR (95% CI: [3.4%, 13.7%]) on the same data, proving the task is solvable through reasoning. We term this the “semantic illusion”: semantically plausible hallucinations preserve similarity to source documents while introducing factual errors invisible to embeddings. This limitation persists across embedding architectures, LLM generators, and task types, suggesting embedding-based detection is insufficient for production RAG deployment.
【68】EMFusion: Conditional Diffusion Framework for Trustworthy Frequency Selective EMF Forecasting in Wireless Networks 标题:EMFusion:无线网络中可靠频率选择性电动势预测的条件扩散框架 链接:https://arxiv.org/abs/2512.15067
作者:Zijiang Yan,Yixiang Huang,Jianhua Pei,Hina Tabassum,Luca Chiaraviglio 备注:Submission for possible publication 摘要:无线基础设施的快速增长增加了准确估计和预测电磁场(EMF)水平的需求,以确保持续的合规性,评估潜在的健康影响,并支持高效的网络规划。虽然现有的研究依赖于宽带聚合EMF数据的单变量预测,但需要频率选择性多变量预测来捕获对主动网络规划至关重要的运营商间和频率间变化。为此,本文介绍了EMFusion,这是一个基于条件多元扩散的概率预测框架,它集成了各种背景因素(例如,一天中的时间、季节和假期),同时提供明确的不确定性估计。该架构的特点是一个残余的U-Net骨干增强的交叉注意力机制,动态集成外部条件,以指导生成过程。此外,EMFusion集成了一种基于估算的采样策略,将预测视为结构修复任务,即使在不规则的测量中也能确保时间一致性。与标准点预测器不同,EMFusion直接从学习的条件分布中生成校准的概率预测区间,为可信决策提供必要的显式不确定性量化。在频率选择性EMF数据集上进行的数值实验表明,具有工作时间上下文信息的EMFusion在有条件或无条件下的性能优于基线模型。EMFusion在连续排序概率得分(CRPS)方面优于最佳基线23.85%,在归一化均方根误差方面优于最佳基线13.93%,并将预测CRPS误差降低22.47%。 摘要:The rapid growth in wireless infrastructure has increased the need to accurately estimate and forecast electromagnetic field (EMF) levels to ensure ongoing compliance, assess potential health impacts, and support efficient network planning. While existing studies rely on univariate forecasting of wideband aggregate EMF data, frequency-selective multivariate forecasting is needed to capture the inter-operator and inter-frequency variations essential for proactive network planning. To this end, this paper introduces EMFusion, a conditional multivariate diffusion-based probabilistic forecasting framework that integrates diverse contextual factors (e.g., time of day, season, and holidays) while providing explicit uncertainty estimates. The proposed architecture features a residual U-Net backbone enhanced by a cross-attention mechanism that dynamically integrates external conditions to guide the generation process. Furthermore, EMFusion integrates an imputation-based sampling strategy that treats forecasting as a structural inpainting task, ensuring temporal coherence even with irregular measurements. Unlike standard point forecasters, EMFusion generates calibrated probabilistic prediction intervals directly from the learned conditional distribution, providing explicit uncertainty quantification essential for trustworthy decision-making. Numerical experiments conducted on frequency-selective EMF datasets demonstrate that EMFusion with the contextual information of working hours outperforms the baseline models with or without conditions. The EMFusion outperforms the best baseline by 23.85% in continuous ranked probability score (CRPS), 13.93% in normalized root mean square error, and reduces prediction CRPS error by 22.47%.
【69】Tracking spatial temporal details in ultrasound long video via wavelet analysis and memory bank 标题:通过子波分析和存储库跟踪超声长视频中的空间时间细节 链接:https://arxiv.org/abs/2512.15066
作者:Chenxiao Zhang,Runshi Zhang,Junchen Wang 备注:Chenxiao Zhang and Runshi Zhang contributed equally to this work. 14 pages, 11 figures 摘要:医学超声视频广泛用于医疗检查、疾病诊断和手术规划。高保真病变区域和靶器官分割构成了计算机辅助手术工作流程的关键组成部分。超声图像的低对比度和噪声背景会导致器官边界的误分割,这可能导致小目标丢失并增加边界分割误差。长视频中的对象跟踪仍然是一个重大的研究挑战。为了克服这些挑战,我们提出了一个基于记忆库的小波滤波和融合网络,它采用了编码器-解码器结构,有效地提取细粒度的详细空间特征,并整合高频(HF)信息。具体地说,基于记忆的小波卷积提出了同时捕获类别,详细信息和利用相邻信息的编码器。级联小波压缩用于融合多尺度频域特征并扩展每个卷积层内的感受野。设计了一种基于交叉注意和记忆压缩机制的长短时记忆库,用于长视频中的目标跟踪。为了充分利用特征图中边界敏感的高频细节,在解码器中设计了一个基于自适应小波滤波器的高频感知特征融合模块。在对四个超声视频数据集(两个甲状腺结节,甲状腺,心脏数据集)进行的广泛基准测试中,与最先进的方法相比,我们的方法在分割指标上有明显的改进。特别是,我们的方法可以更准确地分割小的甲状腺结节,证明了其有效性的情况下,涉及小的超声对象在长视频。该代码可在https://github.com/XiAooZ/MWNet上获得。 摘要:Medical ultrasound videos are widely used for medical inspections, disease diagnosis and surgical planning. High-fidelity lesion area and target organ segmentation constitutes a key component of the computer-assisted surgery workflow. The low contrast levels and noisy backgrounds of ultrasound videos cause missegmentation of organ boundary, which may lead to small object losses and increase boundary segmentation errors. Object tracking in long videos also remains a significant research challenge. To overcome these challenges, we propose a memory bank-based wavelet filtering and fusion network, which adopts an encoder-decoder structure to effectively extract fine-grained detailed spatial features and integrate high-frequency (HF) information. Specifically, memory-based wavelet convolution is presented to simultaneously capture category, detailed information and utilize adjacent information in the encoder. Cascaded wavelet compression is used to fuse multiscale frequency-domain features and expand the receptive field within each convolutional layer. A long short-term memory bank using cross-attention and memory compression mechanisms is designed to track objects in long video. To fully utilize the boundary-sensitive HF details of feature maps, an HF-aware feature fusion module is designed via adaptive wavelet filters in the decoder. In extensive benchmark tests conducted on four ultrasound video datasets (two thyroid nodule, the thyroid gland, the heart datasets) compared with the state-of-the-art methods, our method demonstrates marked improvements in segmentation metrics. In particular, our method can more accurately segment small thyroid nodules, demonstrating its effectiveness for cases involving small ultrasound objects in long video. The code is available at https://github.com/XiAooZ/MWNet.
【70】The Meta-Prompting Protocol: Orchestrating LLMs via Adversarial Feedback Loops 标题:元预算协议:通过对抗反馈循环来预算LLM 链接:https://arxiv.org/abs/2512.15053
作者:Fanzhe Fu 备注:6 pages, 2 figures 摘要:大型语言模型(LLM)从随机聊天界面到可靠的软件组件的过渡需要对交互范式进行根本性的重新设计。目前的方法,主要是基于实用主义的“即时工程”,无法提供任务关键型应用程序所需的确定性保证。我们介绍了元编排协议,一个严格的理论框架,正式编制的LLM作为一个可编程的,自我优化的系统。该协议的核心是对抗三位一体,这是一个由生成器(P)、审计器(A)和优化器(O)组成的三方拓扑。通过将自然语言指令视为语义计算图中的可区分变量,并利用文本评论作为梯度,该架构减轻了幻觉并防止模型崩溃。我们证明了这种方法的理论可行性,使用声明式编程范式(DSPy)和自动文本区分(TextGrad),在概率计算的时代建立了“可观察软件工程”的基础。 摘要:The transition of Large Language Models (LLMs) from stochastic chat interfaces to reliable software components necessitates a fundamental re-engineering of interaction paradigms. Current methodologies, predominantly heuristic-based “prompt engineering,” fail to provide the deterministic guarantees required for mission-critical applications. We introduce the Meta-Prompting Protocol, a rigorous theoretical framework that formalizes the orchestration of LLMs as a programmable, self-optimizing system. Central to this protocol is the Adversarial Trinity, a tripartite topology comprising a Generator (P), an Auditor (A), and an Optimizer (O). By treating natural language instructions as differentiable variables within a semantic computation graph and utilizing textual critiques as gradients, this architecture mitigates hallucination and prevents model collapse. We demonstrate the theoretical viability of this approach using declarative programming paradigms (DSPy) and automatic textual differentiation (TextGrad), establishing a foundation for “Observable Software Engineering” in the era of probabilistic computing.
【71】SGM: Safety Glasses for Multimodal Large Language Models via Neuron-Level Detoxification 标题:SGM:通过神经元级去规范化的多模式大型语言模型的安全眼镜 链接:https://arxiv.org/abs/2512.15052
作者:Hongbo Wang,MaungMaung AprilPyone,Isao Echizen 备注:Under Review for ACL 2026 摘要:免责声明:本文中的样本可能有害并导致不适。 多模态大型语言模型(MLLM)可以实现多模态生成,但会从弱策划的预训练语料库中继承有毒、有偏见和NSFW信号,从而导致安全风险,特别是在对抗性触发条件下,后期不透明的无训练解毒方法难以处理。我们提出了SGM,一种白盒神经元水平的多模态干预,就像有毒神经元的安全眼镜:它通过专业加权软抑制选择性地重新校准一小组有毒专家神经元,在没有任何参数更新的情况下中和有害的交叉模态激活。我们建立了MM-TOXIC-QA,一个多模式毒性评估框架,并将SGM与现有的解毒技术进行了比较。在开源MLLM上的实验表明,SGM减轻了标准和对抗条件下的毒性,将有害率从48.2%降低到2.5%,同时保持流畅性和多模式推理。SGM是可扩展的,其组合防御(表示为SGM*)与现有的解毒方法相结合,具有更强的安全性能,为毒性控制的多模式发电提供了可解释的低成本解决方案。 摘要:Disclaimer: Samples in this paper may be harmful and cause discomfort. Multimodal large language models (MLLMs) enable multimodal generation but inherit toxic, biased, and NSFW signals from weakly curated pretraining corpora, causing safety risks, especially under adversarial triggers that late, opaque training-free detoxification methods struggle to handle. We propose SGM, a white-box neuron-level multimodal intervention that acts like safety glasses for toxic neurons: it selectively recalibrates a small set of toxic expert neurons via expertise-weighted soft suppression, neutralizing harmful cross-modal activations without any parameter updates. We establish MM-TOXIC-QA, a multimodal toxicity evaluation framework, and compare SGM with existing detoxification techniques. Experiments on open-source MLLMs show that SGM mitigates toxicity in standard and adversarial conditions, cutting harmful rates from 48.2% to 2.5% while preserving fluency and multimodal reasoning. SGM is extensible, and its combined defenses, denoted as SGM*, integrate with existing detoxification methods for stronger safety performance, providing an interpretable, low-cost solution for toxicity-controlled multimodal generation.
【72】HERO: Hierarchical Traversable 3D Scene Graphs for Embodied Navigation Among Movable Obstacles 标题:HERO:分层可穿越3D场景图,用于在可移动障碍物之间进行有序导航 链接:https://arxiv.org/abs/2512.15047
作者:Yunheng Wang,Yixiao Feng,Yuetong Fang,Shuning Zhang,Tan Jing,Jian Li,Xiangrui Jiang,Renjing Xu 摘要:3D场景图(3DSG)构成了物理世界的强大表示,其特征在于它们能够显式地对实体之间的复杂空间,语义和功能关系进行建模,从而提供基本的理解,使代理能够智能地与其环境交互并执行多功能行为。作为这种能力的重要组成部分,嵌入式导航利用3DSG的紧凑和表达性,在复杂的大规模环境中实现长期推理和规划。然而,先前的工作依赖于静态世界假设,仅基于静态空间布局来定义可穿越空间,从而将可交互障碍物视为不可穿越的。这个基本的限制严重破坏了它们在现实世界场景中的有效性,导致有限的可达性,低效率和较差的可扩展性。为了解决这些问题,我们提出了HERO,一个新的框架,用于构建分层可遍历3DSGs,重新定义可遍历性建模可操作的障碍路径,捕捉他们的物理交互性,功能语义和场景的关系层次。结果表明,相对于其基线,HERO减少PL的35.1%,在部分阻塞的环境中,SR增加79.4%,在完全阻塞的,表现出显着更高的效率和可达性。 摘要:3D Scene Graphs (3DSGs) constitute a powerful representation of the physical world, distinguished by their abilities to explicitly model the complex spatial, semantic, and functional relationships between entities, rendering a foundational understanding that enables agents to interact intelligently with their environment and execute versatile behaviors. Embodied navigation, as a crucial component of such capabilities, leverages the compact and expressive nature of 3DSGs to enable long-horizon reasoning and planning in complex, large-scale environments. However, prior works rely on a static-world assumption, defining traversable space solely based on static spatial layouts and thereby treating interactable obstacles as non-traversable. This fundamental limitation severely undermines their effectiveness in real-world scenarios, leading to limited reachability, low efficiency, and inferior extensibility. To address these issues, we propose HERO, a novel framework for constructing Hierarchical Traversable 3DSGs, that redefines traversability by modeling operable obstacles as pathways, capturing their physical interactivity, functional semantics, and the scene’s relational hierarchy. The results show that, relative to its baseline, HERO reduces PL by 35.1% in partially obstructed environments and increases SR by 79.4% in fully obstructed ones, demonstrating substantially higher efficiency and reachability.
【73】Agentic AI for Integrated Sensing and Communication: Analysis, Framework, and Case Study 标题:集成传感和通信的抽象人工智能:分析、框架和案例研究 链接:https://arxiv.org/abs/2512.15044
作者:Wenwen Xie,Geng Sun,Ruichen Zhang,Xuejie Liu,Yinqiu Liu,Jiacheng Wang,Dusit Niyato,Ping Zhang 摘要:集成感知与通信(ISAC)已成为第六代(6 G)时代的一个重要发展方向,为未来智能网络的协同感知与通信提供了必要的支持。然而,随着无线环境变得越来越动态和复杂,ISAC系统需要更智能的处理和更自主的操作,以保持效率和适应性。与此同时,代理人工智能(AI)通过在动态环境中实现连续的感知-推理-行动循环,为ISAC系统的智能、自主和高效运行提供了解决这些挑战的可行解决方案。因此,我们在这项工作中深入研究了智能人工智能在ISAC系统中的应用价值和前景。首先,我们对代理AI和ISAC系统进行了全面的回顾,以展示它们的关键特征。其次,我们展示了ISAC系统的几种常见优化方法,并强调了基于生成式人工智能(GenAI)的Agent AI的显著优势。第三,提出了一种新的代理ISAC框架,并通过实例验证了该框架在优化ISAC性能方面的优越性。最后,我们阐明了未来的研究方向,基于代理人工智能的ISAC系统。 摘要:Integrated sensing and communication (ISAC) has emerged as a key development direction in the sixth-generation (6G) era, which provides essential support for the collaborative sensing and communication of future intelligent networks. However, as wireless environments become increasingly dynamic and complex, ISAC systems require more intelligent processing and more autonomous operation to maintain efficiency and adaptability. Meanwhile, agentic artificial intelligence (AI) offers a feasible solution to address these challenges by enabling continuous perception-reasoning-action loops in dynamic environments to support intelligent, autonomous, and efficient operation for ISAC systems. As such, we delve into the application value and prospects of agentic AI in ISAC systems in this work. Firstly, we provide a comprehensive review of agentic AI and ISAC systems to demonstrate their key characteristics. Secondly, we show several common optimization approaches for ISAC systems and highlight the significant advantages of generative artificial intelligence (GenAI)-based agentic AI. Thirdly, we propose a novel agentic ISAC framework and prensent a case study to verify its superiority in optimizing ISAC performance. Finally, we clarify future research directions for agentic AI-based ISAC systems.
【74】LADY: Linear Attention for Autonomous Driving Efficiency without Transformers 标题:LADY:线性关注没有Transformer的自动驾驶效率 链接:https://arxiv.org/abs/2512.15038
作者:Jihao Huang,Xi Xia,Zhiyuan Li,Tianle Liu,Jingke Wang,Junbo Chen,Tengju Ye 备注:Under review 摘要:端到端模式已经证明了自动驾驶的巨大潜力。此外,大多数现有方法都建立在Transformer架构上。然而,Transformers会产生二次注意力成本,限制了它们对长空间和时间序列建模的能力,特别是在资源受限的边缘平台上。由于自动驾驶本质上需要高效的时间建模,这一挑战严重限制了它们的部署和实时性能。近年来,线性注意机制由于其优越的时空复杂性而受到越来越多的关注。然而,现有的线性注意力架构仅限于自我注意力,缺乏对跨模态和跨时间交互的支持,而这两种交互对于自动驾驶至关重要。在这项工作中,我们提出了LADY,这是第一个用于端到端自动驾驶的完全线性的基于注意力的生成模型。LADY能够以恒定的计算和存储成本在推理时融合长距离时间上下文,而不管相机和LiDAR特征的历史长度。此外,我们引入了一个轻量级的线性交叉注意机制,使有效的跨模态信息交换。在NAVSIM和Bench2Drive基准测试上的实验表明,LADY在恒定的时间和内存复杂度下实现了最先进的性能,提供了更好的规划性能和显着降低的计算成本。此外,该模型已在边缘设备上部署和验证,证明了其在资源有限的场景中的实用性。 摘要:End-to-end paradigms have demonstrated great potential for autonomous driving. Additionally, most existing methods are built upon Transformer architectures. However, transformers incur a quadratic attention cost, limiting their ability to model long spatial and temporal sequences-particularly on resource-constrained edge platforms. As autonomous driving inherently demands efficient temporal modeling, this challenge severely limits their deployment and real-time performance. Recently, linear attention mechanisms have gained increasing attention due to their superior spatiotemporal complexity. However, existing linear attention architectures are limited to self-attention, lacking support for cross-modal and cross-temporal interactions-both crucial for autonomous driving. In this work, we propose LADY, the first fully linear attention-based generative model for end-to-end autonomous driving. LADY enables fusion of long-range temporal context at inference with constant computational and memory costs, regardless of the history length of camera and LiDAR features. Additionally, we introduce a lightweight linear cross-attention mechanism that enables effective cross-modal information exchange. Experiments on the NAVSIM and Bench2Drive benchmarks demonstrate that LADY achieves state-of-the-art performance with constant-time and memory complexity, offering improved planning performance and significantly reduced computational cost. Additionally, the model has been deployed and validated on edge devices, demonstrating its practicality in resource-limited scenarios.
【75】Spectral Representation-based Reinforcement Learning 标题:基于谱表示的强化学习 链接:https://arxiv.org/abs/2512.15036
作者:Chenxiao Gao,Haotian Sun,Na Li,Dale Schuurmans,Bo Dai 摘要:在具有大型状态和动作空间的现实应用中,强化学习(RL)通常采用函数近似来表示策略,值函数和动态模型等核心组件。虽然强大的近似,如神经网络提供了很大的表现力,他们往往存在理论上的模糊性,遭受优化不稳定性和探索困难,并在实践中产生大量的计算成本。在本文中,我们引入了谱表示的观点作为解决RL中这些困难的解决方案。从谱分解的过渡算子,这个框架产生了一个有效的抽象的系统动力学为后续的政策优化,同时也提供了一个明确的理论表征。我们揭示了如何构建具有潜在变量结构或基于能量的结构,这意味着不同的学习方法,从数据中提取谱表示的过渡算子的谱表示。值得注意的是,这些学习方法中的每一种都在这个框架下实现了有效的RL算法。我们还证明了这一频谱视图部分可观察的MDPs。最后,我们在DeepMind Control Suite中的20多个具有挑战性的任务上验证了这些算法,它们的性能与当前最先进的无模型和基于模型的基线相当或更好。 摘要:In real-world applications with large state and action spaces, reinforcement learning (RL) typically employs function approximations to represent core components like the policies, value functions, and dynamics models. Although powerful approximations such as neural networks offer great expressiveness, they often present theoretical ambiguities, suffer from optimization instability and exploration difficulty, and incur substantial computational costs in practice. In this paper, we introduce the perspective of spectral representations as a solution to address these difficulties in RL. Stemming from the spectral decomposition of the transition operator, this framework yields an effective abstraction of the system dynamics for subsequent policy optimization while also providing a clear theoretical characterization. We reveal how to construct spectral representations for transition operators that possess latent variable structures or energy-based structures, which implies different learning methods to extract spectral representations from data. Notably, each of these learning methods realizes an effective RL algorithm under this framework. We also provably extend this spectral view to partially observable MDPs. Finally, we validate these algorithms on over 20 challenging tasks from the DeepMind Control Suite, where they achieve performances comparable or superior to current state-of-the-art model-free and model-based baselines.
【76】Beyond Accuracy: A Geometric Stability Analysis of Large Language Models in Chess Evaluation 标题:超越准确性:国际象棋评估中大型语言模型的几何稳定性分析 链接:https://arxiv.org/abs/2512.15033
作者:Xidan Song,Weiqi Wang,Ruifeng Cao,Qingya Hu 摘要:复杂推理领域中的大型语言模型(LLM)的评估通常依赖于与地面实况预言机的性能对齐。在国际象棋领域,这一标准表现为针对强大引擎(如Stockfish)的准确性基准。然而,高标量准确性并不一定意味着强大的概念理解。本文认为,标准的准确性指标无法区分真正的几何推理和肤浅的记忆规范板状态。为了解决这一差距,我们提出了一个几何稳定性框架,一种新的评估方法,严格测试模型一致性不变的转换,包括董事会旋转,镜像对称,颜色反转,格式转换。我们将此框架应用于六种最先进的LLM的比较分析,包括GPT-5.1,Claude Sonnet 4.5和Kimi K2 Turbo,使用了大约3,000个位置的数据集。我们的研究结果揭示了一个显着的精度稳定性paradigm。虽然GPT-5.1等模型在标准位置上达到了接近最佳的精度,但它们在几何扰动下表现出灾难性的退化,特别是在错误率激增超过600%的旋转任务中。这种差异表明模式匹配依赖于抽象的空间逻辑。相反,Claude Sonnet 4.5和Kimi K2 Turbo表现出卓越的双重鲁棒性,在所有变换轴上保持高度一致性。此外,我们分析了有用性和安全性之间的权衡,确定Gemini 2.5 Flash是非法状态拒绝的领导者(96.0%)。我们的结论是,几何稳定性为AI评估提供了一个正交的基本指标,为将推理能力与大规模模型中的数据污染和过拟合分离开来提供了必要的代理。 摘要:The evaluation of Large Language Models (LLMs) in complex reasoning domains typically relies on performance alignment with ground-truth oracles. In the domain of chess, this standard manifests as accuracy benchmarks against strong engines like Stockfish. However, high scalar accuracy does not necessarily imply robust conceptual understanding. This paper argues that standard accuracy metrics fail to distinguish between genuine geometric reasoning and the superficial memorization of canonical board states. To address this gap, we propose a Geometric Stability Framework, a novel evaluation methodology that rigorously tests model consistency under invariant transformations-including board rotation, mirror symmetry, color inversion, and format conversion. We applied this framework to a comparative analysis of six state-of-the-art LLMs including GPT-5.1, Claude Sonnet 4.5, and Kimi K2 Turbo, utilizing a dataset of approximately 3,000 positions. Our results reveal a significant Accuracy-Stability Paradox. While models such as GPT-5.1 achieve near-optimal accuracy on standard positions, they exhibit catastrophic degradation under geometric perturbation, specifically in rotation tasks where error rates surge by over 600%. This disparity suggests a reliance on pattern matching over abstract spatial logic. Conversely, Claude Sonnet 4.5 and Kimi K2 Turbo demonstrate superior dual robustness, maintaining high consistency across all transformation axes. Furthermore, we analyze the trade-off between helpfulness and safety, identifying Gemini 2.5 Flash as the leader in illegal state rejection (96.0%). We conclude that geometric stability provides an orthogonal and essential metric for AI evaluation, offering a necessary proxy for disentangling reasoning capabilities from data contamination and overfitting in large-scale models.
【77】Epistemic diversity across language models mitigates knowledge collapse 标题:跨语言模型的认知多样性缓解了知识崩溃 链接:https://arxiv.org/abs/2512.15011
作者:Damian Hodel,Jevin D. West 备注:16 pages, 7 figures 摘要:人工智能(AI)的日益广泛使用引发了对知识崩溃的担忧,即,一种对最主要和最核心的思想的简化。之前的工作已经证明了单模型崩溃,定义为在自己的输出上训练的AI模型的性能衰减。受生态学的启发,我们问人工智能生态系统的多样性,即模型之间的多样性,是否可以减轻这种崩溃。我们建立在单一模型方法的基础上,但专注于基于集体输出训练的模型生态系统。为了研究多样性对模型性能的影响,我们将语言模型的训练数据分段,并在十次自我训练迭代中评估由此产生的生态系统。我们发现,增加认识的多样性减轻崩溃,但有趣的是,只有到一个最佳水平。我们的研究结果表明,一个生态系统只包含几个不同的模型无法表达丰富的混合物的完整,真实的分布,导致快速的性能衰减。然而,将数据分布在太多的模型上会降低每个模型对真实分布的近似能力,从而导致第一次迭代步骤中的性能低下。在人工智能单一文化的背景下,我们的研究结果表明,需要监测人工智能系统的多样性,并制定政策,激励更多的领域和社区特定的模型。 摘要:The growing use of artificial intelligence (AI) raises concerns of knowledge collapse, i.e., a reduction to the most dominant and central set of ideas. Prior work has demonstrated single-model collapse, defined as performance decay in an AI model trained on its own output. Inspired by ecology, we ask whether AI ecosystem diversity, that is, diversity among models, can mitigate such a collapse. We build on the single-model approach but focus on ecosystems of models trained on their collective output. To study the effect of diversity on model performance, we segment the training data across language models and evaluate the resulting ecosystems over ten, self-training iterations. We find that increased epistemic diversity mitigates collapse, but, interestingly, only up to an optimal level. Our results suggest that an ecosystem containing only a few diverse models fails to express the rich mixture of the full, true distribution, resulting in rapid performance decay. Yet distributing the data across too many models reduces each model’s approximation capacity on the true distribution, leading to poor performance already in the first iteration step. In the context of AI monoculture, our results suggest the need to monitor diversity across AI systems and to develop policies that incentivize more domain- and community-specific models.
【78】Evaluating the Capability of Video Question Generation for Expert Knowledge Elicitation 标题:评估视频问题生成以获取专家知识的能力 链接:https://arxiv.org/abs/2512.15006
作者:Huaying Zhang,Atsushi Hashimoto,Tosho Hirasawa 备注:WACV 2026 accepted 摘要:熟练的面试官可以从专家那里提取有价值的信息。这就提出了一个基本问题:是什么让一些问题比其他问题更有效?为了解决这个问题,定量评估问题生成模型是必不可少的。视频问题生成(VQG)是视频问答(VideoQA)的一个主题,其中为给定的答案生成问题。他们的评估通常侧重于回答问题的能力,而不是生成问题的质量。相比之下,我们专注于从人类专家那里引出看不见的知识的问题质量。对于VQG模型的不断改进,我们提出了一个协议,通过模拟问答与专家使用问答检索通信的能力进行评估。我们通过构建一个新的数据集EgoExoAsk来获得检索器,该数据集包括从Ego-Exo 4D的专家评论注释中生成的27,666个QA对。使用EgoExoAsk训练集来获得检索器,并在具有Ego-Exo 4D视频片段的验证集上构建基准。实验结果表明,我们的指标合理地符合问题生成设置:模型访问更丰富的上下文进行评估更好,支持我们的协议按预期工作。EgoExoAsk数据集可在https://github.com/omron-sinicx/VQG4ExpertKnowledge上找到。 摘要:Skilled human interviewers can extract valuable information from experts. This raises a fundamental question: what makes some questions more effective than others? To address this, a quantitative evaluation of question-generation models is essential. Video question generation (VQG) is a topic for video question answering (VideoQA), where questions are generated for given answers. Their evaluation typically focuses on the ability to answer questions, rather than the quality of generated questions. In contrast, we focus on the question quality in eliciting unseen knowledge from human experts. For a continuous improvement of VQG models, we propose a protocol that evaluates the ability by simulating question-answering communication with experts using a question-to-answer retrieval. We obtain the retriever by constructing a novel dataset, EgoExoAsk, which comprises 27,666 QA pairs generated from Ego-Exo4D’s expert commentary annotation. The EgoExoAsk training set is used to obtain the retriever, and the benchmark is constructed on the validation set with Ego-Exo4D video segments. Experimental results demonstrate our metric reasonably aligns with question generation settings: models accessing richer context are evaluated better, supporting that our protocol works as intended. The EgoExoAsk dataset is available in https://github.com/omron-sinicx/VQG4ExpertKnowledge .
【79】DreamPRM-Code: Function-as-Step Process Reward Model with Label Correction for LLM Coding 标题:DreamPRM-Code:具有LLM编码标签纠正的功能即步骤流程奖励模型 链接:https://arxiv.org/abs/2512.15000
作者:Ruiyi Zhang,Peijia Qin,Qi Cao,Pengtao Xie 摘要:过程奖励模型(Process Reward Models,PRM)已经成为通过测试时间缩放来改进大型语言模型(Large Language Models,LLM)的关键,但由于代码中缺乏有意义的步骤分解以及蒙特卡洛生成的部分标签的噪声,它们在编码中的有效性仍然有限。我们提出了DreamPRM-Code,一种以编码为中心的PRM,它将函数视为推理步骤,使用功能链提示策略来诱导模块化代码生成,使PRM训练和应用类似于数学推理任务。为了解决标签噪声问题,DreamPRM-Code引入了一种基于元学习的校正机制,该机制利用干净的最终解决方案单元测试标签,并执行双层优化来优化中间标签。通过测试时间缩放,DreamPRM-Code在LiveCodeBench上以80.9 pass@1 rate实现了最先进的性能,超过了OpenAI o 4-mini。 摘要:Process Reward Models (PRMs) have become essential for improving Large Language Models (LLMs) via test-time scaling, yet their effectiveness in coding remains limited due to the lack of meaningful step decompositions in code and the noise of Monte-Carlo-generated partial labels. We propose DreamPRM-Code, a coding-focused PRM that treats functions as reasoning steps using a Chain-of-Function prompting strategy to induce modular code generation, enabling PRM training and application analogous to mathematical reasoning tasks. To address label noise, DreamPRM-Code introduces a meta-learning-based correction mechanism that leverages clean final-solution unit-test labels and performs bi-level optimization to refine intermediate labels. Applying on test-time scaling, DreamPRM-Code achieved state-of-the-art performance on LiveCodeBench with 80.9 pass@1 rate, surpassing OpenAI o4-mini.
【80】Beyond Proximity: A Keypoint-Trajectory Framework for Classifying Affiliative and Agonistic Social Networks in Dairy Cattle 标题:超越邻近性:奶牛附属和挑衅社交网络分类的关键点轨迹框架 链接:https://arxiv.org/abs/2512.14998
作者:Sibi Parivendan,Kashfia Sailunaz,Suresh Neethirajan 备注:36 pages, 12 figures, 8 tables 摘要:精准畜牧业需要客观评估社会行为,以支持牛群福利监测,但大多数现有的方法推断使用静态接近阈值,不能区分从属竞争行为在复杂的谷仓环境中的相互作用。这一限制限制了自动社交网络分析在商业环境中的可解释性。我们提出了一个基于姿态的计算框架,通过对解剖关键点的时空几何形状进行建模,超越了邻近度的交互分类。该方法不依赖于像素级外观或简单的距离测量,而是从关键点轨迹编码特定于交互的运动签名,从而能够区分社交交互效价。该框架被实现为一个端到端的计算机视觉管道,集成了用于对象检测的YOLOv11(mAP@0.50:96.24%),监督个体识别(98.24%准确率),用于多对象跟踪的ByteTrack(81.96%准确率),用于27点解剖关键点估计的ZebraPose,以及在姿势衍生距离动态上训练的支持向量机分类器。在从商业奶牛场收集的带注释的交互片段上,分类器仅使用姿势信息区分从属和对抗行为的准确率达到77.51%。与仅接近基线的比较评估显示,行为歧视,特别是亲和互动的实质性收益。结果建立了一个概念验证的自动化,基于视觉的推理的社会互动适合构建交互感知社交网络,与商品硬件上的近实时性能。 摘要:Precision livestock farming requires objective assessment of social behavior to support herd welfare monitoring, yet most existing approaches infer interactions using static proximity thresholds that cannot distinguish affiliative from agonistic behaviors in complex barn environments. This limitation constrains the interpretability of automated social network analysis in commercial settings. We present a pose-based computational framework for interaction classification that moves beyond proximity heuristics by modeling the spatiotemporal geometry of anatomical keypoints. Rather than relying on pixel-level appearance or simple distance measures, the proposed method encodes interaction-specific motion signatures from keypoint trajectories, enabling differentiation of social interaction valence. The framework is implemented as an end-to-end computer vision pipeline integrating YOLOv11 for object detection (mAP@0.50: 96.24%), supervised individual identification (98.24% accuracy), ByteTrack for multi-object tracking (81.96% accuracy), ZebraPose for 27-point anatomical keypoint estimation, and a support vector machine classifier trained on pose-derived distance dynamics. On annotated interaction clips collected from a commercial dairy barn, the classifier achieved 77.51% accuracy in distinguishing affiliative and agonistic behaviors using pose information alone. Comparative evaluation against a proximity-only baseline shows substantial gains in behavioral discrimination, particularly for affiliative interactions. The results establish a proof-of-concept for automated, vision-based inference of social interactions suitable for constructing interaction-aware social networks, with near-real-time performance on commodity hardware.
【81】Where is the Watermark? Interpretable Watermark Detection at the Block Level 标题:水印在哪里?块级的可解释水印检测 链接:https://arxiv.org/abs/2512.14994
作者:Maria Bulychev,Neil G. Marchant,Benjamin I. P. Rubinstein 备注:20 pages, 14 figures. Camera-ready for WACV 2026 摘要:生成式人工智能的最新进展使人们能够创建高度逼真的数字内容,引发了人们对真实性、所有权和滥用的担忧。虽然水印已经成为一个越来越重要的机制来跟踪和保护数字媒体,大多数现有的图像水印方案作为黑箱操作,产生全局检测分数,而不提供任何洞察水印是如何或在哪里存在。这种透明度的缺乏影响了用户的信任,并使其难以解释篡改的影响。在本文中,我们提出了一个事后图像水印方法,结合局部嵌入区域级的可解释性。我们的方法嵌入水印信号在离散小波变换域中使用的统计块明智的战略。这使我们能够生成检测图,揭示图像的哪些区域可能被水印或更改。我们表明,我们的方法实现了强大的鲁棒性对常见的图像变换,同时保持敏感的语义操作。同时,水印保持高度不可察觉。与以前的事后方法相比,我们的方法提供了更多的可解释的检测,同时保持竞争的鲁棒性。例如,我们的水印对于裁剪图像的一半是鲁棒的。 摘要:Recent advances in generative AI have enabled the creation of highly realistic digital content, raising concerns around authenticity, ownership, and misuse. While watermarking has become an increasingly important mechanism to trace and protect digital media, most existing image watermarking schemes operate as black boxes, producing global detection scores without offering any insight into how or where the watermark is present. This lack of transparency impacts user trust and makes it difficult to interpret the impact of tampering. In this paper, we present a post-hoc image watermarking method that combines localised embedding with region-level interpretability. Our approach embeds watermark signals in the discrete wavelet transform domain using a statistical block-wise strategy. This allows us to generate detection maps that reveal which regions of an image are likely watermarked or altered. We show that our method achieves strong robustness against common image transformations while remaining sensitive to semantic manipulations. At the same time, the watermark remains highly imperceptible. Compared to prior post-hoc methods, our approach offers more interpretable detection while retaining competitive robustness. For example, our watermarks are robust to cropping up to half the image.
【82】Imitation Game: Reproducing Deep Learning Bugs Leveraging an Intelligent Agent 标题:模仿游戏:利用智能代理再现深度学习错误 链接:https://arxiv.org/abs/2512.14990
作者:Mehil B Shah,Mohammad Masudur Rahman,Foutse Khomh 备注:Accepted by the 48th IEEE/ACM International Conference on Software Engineering (ICSE 2026) 摘要:尽管它们在各个领域(例如,医疗保健、金融、软件工程),基于深度学习(DL)的应用程序遭受许多错误、故障和漏洞。再现这些错误对于解决它们是必不可少的,但由于DL模型固有的不确定性及其与硬件和软件环境的紧密耦合,这是非常具有挑战性的。根据最近的研究,只有大约3%的DL错误可以使用手动方法可靠地复制。为了解决这些挑战,我们提出了RepGen,这是一种用于复制深度学习错误的新颖,自动化和智能方法。RepGen从一个项目中构建了一个学习增强的上下文,为bug复制开发了一个全面的计划,采用了迭代的生成-验证-细化机制,从而使用LLM生成了这样的代码,再现了手头的bug。我们在106个真实世界的深度学习错误上评估了RepGen,并实现了80.19%的重现率,比最先进的测量方法提高了19.81%。一项涉及27名参与者的开发人员研究表明,RepGen将DL错误复制的成功率提高了23.35%,将复制时间减少了56.8%,并降低了参与者的认知负荷。 摘要:Despite their wide adoption in various domains (e.g., healthcare, finance, software engineering), Deep Learning (DL)-based applications suffer from many bugs, failures, and vulnerabilities. Reproducing these bugs is essential for their resolution, but it is extremely challenging due to the inherent nondeterminism of DL models and their tight coupling with hardware and software environments. According to recent studies, only about 3% of DL bugs can be reliably reproduced using manual approaches. To address these challenges, we present RepGen, a novel, automated, and intelligent approach for reproducing deep learning bugs. RepGen constructs a learning-enhanced context from a project, develops a comprehensive plan for bug reproduction, employs an iterative generate-validate-refine mechanism, and thus generates such code using an LLM that reproduces the bug at hand. We evaluate RepGen on 106 real-world deep learning bugs and achieve a reproduction rate of 80.19%, a 19.81% improvement over the state-of-the-art measure. A developer study involving 27 participants shows that RepGen improves the success rate of DL bug reproduction by 23.35%, reduces the time to reproduce by 56.8%, and lowers participants’ cognitive load.
【83】Evaluating Large Language Models on Multimodal Chemistry Olympiad Exams 标题:在多峰化学奥林匹克考试中评估大型语言模型 链接:https://arxiv.org/abs/2512.14989
作者:Yiming Cui,Xin Yao,Yuxuan Qin,Xin Li,Shijin Wang,Guoping Hu 备注:Published at Communications Chemistry 摘要:多模态科学推理仍然是大型语言模型(LLM)的一个重大挑战,特别是在化学领域,解决问题依赖于符号图,分子结构和结构化的可视化数据。在这里,我们系统地评估了40个专有和开源的多模式LLM,包括GPT-5,o3,Gemini-2.5-Pro和Qwen2.5-VL,这些LLM是根据二十多年来美国国家化学奥林匹克(USNCO)考试中的奥林匹克风格化学问题的策划基准。这些问题需要在不同的方式集成的视觉和文本推理。我们发现,许多模型都难以进行模态融合,在某些情况下,删除图像甚至可以提高准确性,这表明视觉语言整合存在偏差。如消融研究和基于遮挡的可解释性所示,思维链提示始终增强准确性和视觉基础。我们的研究结果揭示了当前MLLM科学推理能力的关键局限性,为开发更强大和可解释的化学多模态系统提供了可行的策略。这项工作为衡量特定领域多模态AI的进展提供了一个及时的基准,并强调了在人工智能和科学推理的交叉领域取得进一步进展的必要性。 摘要:Multimodal scientific reasoning remains a significant challenge for large language models (LLMs), particularly in chemistry, where problem-solving relies on symbolic diagrams, molecular structures, and structured visual data. Here, we systematically evaluate 40 proprietary and open-source multimodal LLMs, including GPT-5, o3, Gemini-2.5-Pro, and Qwen2.5-VL, on a curated benchmark of Olympiad-style chemistry questions drawn from over two decades of U.S. National Chemistry Olympiad (USNCO) exams. These questions require integrated visual and textual reasoning across diverse modalities. We find that many models struggle with modality fusion, where in some cases, removing the image even improves accuracy, indicating misalignment in vision-language integration. Chain-of-Thought prompting consistently enhances both accuracy and visual grounding, as demonstrated through ablation studies and occlusion-based interpretability. Our results reveal critical limitations in the scientific reasoning abilities of current MLLMs, providing actionable strategies for developing more robust and interpretable multimodal systems in chemistry. This work provides a timely benchmark for measuring progress in domain-specific multimodal AI and underscores the need for further advances at the intersection of artificial intelligence and scientific reasoning.
【84】Prompt Repetition Improves Non-Reasoning LLMs 标题:快速重复改进了非推理的LLM 链接:https://arxiv.org/abs/2512.14982
作者:Yaniv Leviathan,Matan Kalman,Yossi Matias 摘要:当不使用推理时,重复输入提示可以提高流行模型(Gemini、GPT、Claude和Deepseek)的性能,而不会增加生成的令牌数量或延迟。 摘要:When not using reasoning, repeating the input prompt improves performance for popular models (Gemini, GPT, Claude, and Deepseek) without increasing the number of generated tokens or latency.
【85】EVICPRESS: Joint KV-Cache Compression and Eviction for Efficient LLM Serving 标题:EVICPUSS:联合KV-缓存压缩和驱逐,以实现高效的LLM服务 链接:https://arxiv.org/abs/2512.14946
作者:Shaoting Feng,Yuhan Liu,Hanchen Li,Xiaokun Chen,Samuel Shen,Kuntai Du,Zhuohan Gu,Rui Zhang,Yuyang Huang,Yihua Cheng,Jiayi Yao,Qizheng Zhang,Ganesh Ananthanarayanan,Junchen Jiang 摘要:KV缓存的重用是大型语言模型推理系统高效运行的关键。随着更多的LLM用户,KV缓存占用空间很容易超过GPU内存容量,因此之前的工作建议将KV缓存驱逐到较低层的存储设备,或者压缩KV缓存,以便更多的KV缓存可以适应快速内存。然而,之前的工作错过了一个重要的机会:联合优化所有KV缓存的驱逐和压缩决策,以最大限度地减少平均生成延迟,而不影响质量。 我们提出EVICPRESS,KV缓存管理系统,适用于有损压缩和自适应驱逐KV缓存跨多个存储层。具体地,对于上下文的每个KV高速缓存,EVICPRESS考虑KV高速缓存的压缩和驱逐对所有上下文的平均生成质量和延迟的影响。为了实现这一点,EVICPRESS提出了一个统一的效用函数,量化的有损压缩或驱逐的质量和延迟的影响。为此,EVICPRESS的分析模块定期更新所有上下文的所有可能的驱逐压缩配置上的效用函数分数,并使用快速启发式重新排列所有存储层上的KV缓存来放置KV缓存,目标是最大化每个存储层上的效用函数分数。与收回KV缓存或压缩KV缓存的基线相比,EVICPRESS在快速设备上实现了更高的KV缓存命中率,即,更低的延迟,同时通过对对压缩错误敏感的上下文应用保守压缩来保持高生成质量。对12个数据集和5个模型的评估表明,EVICPRESS在相同的生成质量下实现了高达2.19倍的首次令牌时间(TTFT)。 摘要:Reusing KV cache is essential for high efficiency of Large Language Model (LLM) inference systems. With more LLM users, the KV cache footprint can easily exceed GPU memory capacity, so prior work has proposed to either evict KV cache to lower-tier storage devices, or compress KV cache so that more KV cache can be fit in the fast memory. However, prior work misses an important opportunity: jointly optimizing the eviction and compression decisions across all KV caches to minimize average generation latency without hurting quality. We propose EVICPRESS, a KV-cache management system that applies lossy compression and adaptive eviction to KV cache across multiple storage tiers. Specifically, for each KV cache of a context, EVICPRESS considers the effect of compression and eviction of the KV cache on the average generation quality and delay across all contexts as a whole. To achieve this, EVICPRESS proposes a unified utility function that quantifies the effect of quality and delay of the lossy compression or eviction. To this end, EVICPRESS’s profiling module periodically updates the utility function scores on all possible eviction-compression configurations for all contexts and places KV caches using a fast heuristic to rearrange KV caches on all storage tiers, with the goal of maximizing the utility function scores on each storage tier. Compared to the baselines that evict KV cache or compress KV cache, EVICPRESS achieves higher KV-cache hit rates on fast devices, i.e., lower delay, while preserving high generation quality by applying conservative compression to contexts that are sensitive to compression errors. Evaluation on 12 datasets and 5 models demonstrates that EVICPRESS achieves up to 2.19x faster time-to-first-token (TTFT) at equivalent generation quality.
【86】TalkVerse: Democratizing Minute-Long Audio-Driven Video Generation 标题:TalkVerse:一分钟音频驱动视频生成民主化 链接:https://arxiv.org/abs/2512.14938
作者:Zhenzhi Wang,Jian Wang,Ke Ma,Dahua Lin,Bing Zhou 备注:open-sourced single-person full-body talking video generation dataset, training code and checkpoints 摘要:我们介绍了TalkVerse,这是一个大规模的开放语料库,用于单人,音频驱动的谈话视频生成,旨在实现公平,可重复的方法比较。虽然目前最先进的系统依赖于封闭的数据或计算繁重的模型,但TalkVerse提供了230万个高分辨率(720 p/1080 p)音频视频同步剪辑,总计6.3k小时。这些都是通过透明的管道从超过60 k小时的视频中策划的,包括场景切换检测,美学评估,严格的视听同步检查以及全面的注释,包括2D骨架和结构化的视觉/音频风格字幕。利用TalkVerse,我们提出了一个基于Wan2.2- 5 B构建的可重复的5 B DiT基线。通过利用具有高下采样率的视频VAE和具有运动帧上下文的滑动窗口机制,我们的模型实现了具有低漂移的分钟长的生成。它提供了与14 B Wan-S2 V型号相当的对口型同步和视觉质量,但推理成本低10倍。为了增强长视频中的故事讲述,我们整合了一个MLLM导演,根据音频和视觉线索重写提示。此外,我们的模型支持zero-shot视频配音通过控制潜在噪声注入。我们开源了数据集、训练配方和5 B检查点,以降低音频驱动的人类视频生成研究的障碍。项目页面:https://zhenzhiwang.github.io/talkverse/ 摘要:We introduce TalkVerse, a large-scale, open corpus for single-person, audio-driven talking video generation designed to enable fair, reproducible comparison across methods. While current state-of-the-art systems rely on closed data or compute-heavy models, TalkVerse offers 2.3 million high-resolution (720p/1080p) audio-video synchronized clips totaling 6.3k hours. These are curated from over 60k hours of video via a transparent pipeline that includes scene-cut detection, aesthetic assessment, strict audio-visual synchronization checks, and comprehensive annotations including 2D skeletons and structured visual/audio-style captions. Leveraging TalkVerse, we present a reproducible 5B DiT baseline built on Wan2.2-5B. By utilizing a video VAE with a high downsampling ratio and a sliding window mechanism with motion-frame context, our model achieves minute-long generation with low drift. It delivers comparable lip-sync and visual quality to the 14B Wan-S2V model but with 10$\times$ lower inference cost. To enhance storytelling in long videos, we integrate an MLLM director to rewrite prompts based on audio and visual cues. Furthermore, our model supports zero-shot video dubbing via controlled latent noise injection. We open-source the dataset, training recipes, and 5B checkpoints to lower barriers for research in audio-driven human video generation. Project Page: https://zhenzhiwang.github.io/talkverse/
【87】Improving Pre-trained Segmentation Models using Post-Processing 标题:使用后处理改进预训练的分割模型 链接:https://arxiv.org/abs/2512.14937
作者:Abhijeet Parida,Daniel Capellán-Martín,Zhifan Jiang,Nishad Kulkarni,Krithika Iyer,Austin Tapp,Syed Muhammad Anwar,María J. Ledesma-Carbayo,Marius George Linguraru 摘要:神经胶质瘤是成人中最常见的恶性脑肿瘤,也是最致命的肿瘤之一。尽管积极治疗,中位生存率不到15个月。准确的多参数MRI(mpMRI)肿瘤分割对于手术计划、放射治疗和疾病监测至关重要。虽然深度学习模型提高了自动分割的准确性,但大规模的预训练模型泛化能力很差,而且往往表现不佳,产生系统性错误,如误报、标签交换和切片中的切片不连续。这些限制进一步加剧了对GPU资源的不平等访问以及大规模模型训练不断增长的环境成本。在这项工作中,我们提出了自适应后处理技术,以改进为各种类型肿瘤开发的大规模预训练模型产生的胶质瘤分割的质量。我们在多个BraTS 2025分割挑战任务中展示了这些技术,撒哈拉以南非洲挑战的排名指标提高了14.9%,成人胶质瘤挑战提高了0.9%。这种方法促进了脑肿瘤分割研究从日益复杂的模型架构向高效、临床一致的后处理策略的转变,这些策略是精确的、计算公平的和可持续的。 摘要:Gliomas are the most common malignant brain tumors in adults and are among the most lethal. Despite aggressive treatment, the median survival rate is less than 15 months. Accurate multiparametric MRI (mpMRI) tumor segmentation is critical for surgical planning, radiotherapy, and disease monitoring. While deep learning models have improved the accuracy of automated segmentation, large-scale pre-trained models generalize poorly and often underperform, producing systematic errors such as false positives, label swaps, and slice discontinuities in slices. These limitations are further compounded by unequal access to GPU resources and the growing environmental cost of large-scale model training. In this work, we propose adaptive post-processing techniques to refine the quality of glioma segmentations produced by large-scale pretrained models developed for various types of tumors. We demonstrated the techniques in multiple BraTS 2025 segmentation challenge tasks, with the ranking metric improving by 14.9 % for the sub-Saharan Africa challenge and 0.9% for the adult glioma challenge. This approach promotes a shift in brain tumor segmentation research from increasingly complex model architectures to efficient, clinically aligned post-processing strategies that are precise, computationally fair, and sustainable.
【88】Parameter Efficient Multimodal Instruction Tuning for Romanian Vision Language Models 标题:罗马尼亚视觉语言模型的参数高效多模式指令调整 链接:https://arxiv.org/abs/2512.14926
作者:George-Andrei Dima,Dumitru-Clementin Cercel 摘要:专注于低资源语言是实现生成AI民主化的重要一步。在这项工作中,我们有助于减少罗马尼亚语的多式联运NLP资源缺口。我们将广为人知的Flickr 30 k数据集翻译成罗马尼亚语,并通过利用开源LLM进一步扩展它以进行可视化问答。我们证明了我们的数据集的有用性微调开源VLM罗马尼亚视觉问答。我们从三个广泛使用的模型系列中选择VLM:LLaMA 3.2,LLaVA 1.6和Qwen 2。对于微调,我们采用参数有效的LoRA方法。我们的模型显示了罗马尼亚在视觉QA方面的改进能力,以及他们没有接受过训练的任务,例如罗马尼亚图像描述生成。70亿参数的Qwen 2-VL-RoVQA在两项任务上都获得了最高分,BERTScore F1比其原始版本分别提高了+6.05%和+2.61%。最后,该模型显示,大大减少了语法错误相比,其原始形式,表明不仅在语言理解,而且在罗马尼亚语流利的改善。 摘要:Focusing on low-resource languages is an essential step toward democratizing generative AI. In this work, we contribute to reducing the multimodal NLP resource gap for Romanian. We translate the widely known Flickr30k dataset into Romanian and further extend it for visual question answering by leveraging open-source LLMs. We demonstrate the usefulness of our datasets by fine-tuning open-source VLMs on Romanian visual question answering. We select VLMs from three widely used model families: LLaMA 3.2, LLaVA 1.6, and Qwen2. For fine-tuning, we employ the parameter-efficient LoRA method. Our models show improved Romanian capabilities in visual QA, as well as on tasks they were not trained on, such as Romanian image description generation. The seven-billion-parameter Qwen2-VL-RoVQA obtains top scores on both tasks, with improvements of +6.05% and +2.61% in BERTScore F1 over its original version. Finally, the models show substantial reductions in grammatical errors compared to their original forms, indicating improvements not only in language understanding but also in Romanian fluency.
【89】AgroAskAI: A Multi-Agentic AI Framework for Supporting Smallholder Farmers’ Enquiries Globally 标题:AgroAskAI:一个支持全球小农查询的多学科AI框架 链接:https://arxiv.org/abs/2512.14910
作者:Nadine Angela Cantonjos,Arpita Biswas 摘要:农村地区的农业区面临着与气候相关的风险的破坏,包括干旱、暴雨和不断变化的天气模式。先前的研究需要适应性的风险管理解决方案和决策策略。为此,人工智能(AI),特别是代理AI,提供了一条充满希望的前进道路。人工智能系统由能够解决复杂动态任务的自主、专业代理组成。虽然过去的系统依赖于单智能体模型或使用多智能体框架只用于静态功能,但越来越需要支持动态协作推理和上下文感知输出的架构。为了弥合这一差距,我们提出了AgroAskAI,这是一个多智能体推理系统,用于农业气候适应决策支持,重点关注脆弱的农村社区。AgroAskAI具有模块化的角色专用架构,使用责任链方法来协调自治代理,集成实时工具和数据集。该系统具有内置的治理机制,可以减轻幻觉并为连贯的、本地相关的战略提供内部反馈。该系统还支持多语言互动,使非英语农民也能使用。对与气候适应相关的常见农业查询的实验表明,通过额外的工具和及时的改进,AgroAskAI提供了更具可操作性,接地和包容性的输出。我们的实验结果突出了人工智能在农业气候适应方面可持续和负责任的决策支持的潜力。 摘要:Agricultural regions in rural areas face damage from climate-related risks, including droughts, heavy rainfall, and shifting weather patterns. Prior research calls for adaptive risk-management solutions and decision-making strategies. To this end, artificial intelligence (AI), particularly agentic AI, offers a promising path forward. Agentic AI systems consist of autonomous, specialized agents capable of solving complex, dynamic tasks. While past systems have relied on single-agent models or have used multi-agent frameworks only for static functions, there is a growing need for architectures that support dynamic collaborative reasoning and context-aware outputs. To bridge this gap, we present AgroAskAI, a multi-agent reasoning system for climate adaptation decision support in agriculture, with a focus on vulnerable rural communities. AgroAskAI features a modular, role-specialized architecture that uses a chain-of-responsibility approach to coordinate autonomous agents, integrating real-time tools and datasets. The system has built-in governance mechanisms that mitigate hallucination and enable internal feedback for coherent, locally relevant strategies. The system also supports multilingual interactions, making it accessible to non-English-speaking farmers. Experiments on common agricultural queries related to climate adaptation show that, with additional tools and prompt refinement, AgroAskAI delivers more actionable, grounded, and inclusive outputs. Our experimental results highlight the potential of agentic AI for sustainable and accountable decision support in climate adaptation for agriculture.
【90】DrugRAG: Enhancing Pharmacy LLM Performance Through A Novel Retrieval-Augmented Generation Pipeline 标题:DrugRAG:通过新型检索增强生成管道提高药学LLM性能 链接:https://arxiv.org/abs/2512.14896
作者:Houman Kazemzadeh,Kiarash Mokhtari Dizaji,Seyed Reza Tavakoli,Farbod Davoodi,MohammadReza KarimiNejad,Parham Abed Azad,Ali Sabzi,Armin Khosravi,Siavash Ahmadi,Mohammad Hossein Rohban,Glolamali Aminian,Tahereh Javaheri 备注:11 pages, 2 figures, 3 tables 摘要:目的:评估大语言模型(LLM)在药房执照式问答(QA)任务上的性能,并开发外部知识集成方法以提高其准确性。 方法:我们使用141个问题的药房数据集对11个现有的LLM进行了基准测试,这些LLM具有不同的参数大小(80亿到700多亿)。我们测量了每个模型的基线准确度,没有修改。然后,我们开发了一个三步检索增强生成(RAG)管道,DrugRAG,从经过验证的来源检索结构化的药物知识,并使用基于证据的上下文增强模型提示。该管道在模型外部运行,不需要更改模型架构或参数。 结果:基线准确率范围为46%至92%,其中GPT-5(92%)和o3(89%)得分最高。参数少于80亿的模型得分低于50%。DrugRAG提高了所有测试模型的准确性,增益范围从7到21个百分点(例如,Gemma 3 27B:61%至71%,Llama 3.1 8B:46%至67%)。 结论:我们证明,外部结构化的药物知识集成通过DrugRAG可测量地提高了LLM的准确性药房任务,而无需修改底层模型。这种方法提供了一个实用的管道,可以通过基于证据的信息来增强以制药为中心的人工智能应用。 摘要:Objectives: To evaluate large language model (LLM) performance on pharmacy licensure-style question-answering (QA) tasks and develop an external knowledge integration method to improve their accuracy. Methods: We benchmarked eleven existing LLMs with varying parameter sizes (8 billion to 70+ billion) using a 141-question pharmacy dataset. We measured baseline accuracy for each model without modification. We then developed a three-step retrieval-augmented generation (RAG) pipeline, DrugRAG, that retrieves structured drug knowledge from validated sources and augments model prompts with evidence-based context. This pipeline operates externally to the models, requiring no changes to model architecture or parameters. Results: Baseline accuracy ranged from 46% to 92%, with GPT-5 (92%) and o3 (89%) achieving the highest scores. Models with fewer than 8 billion parameters scored below 50%. DrugRAG improved accuracy across all tested models, with gains ranging from 7 to 21 percentage points (e.g., Gemma 3 27B: 61% to 71%, Llama 3.1 8B: 46% to 67%) on the 141-item benchmark. Conclusion: We demonstrate that external structured drug knowledge integration through DrugRAG measurably improves LLM accuracy on pharmacy tasks without modifying the underlying models. This approach provides a practical pipeline for enhancing pharmacy-focused AI applications with evidence-based information.
【91】Imitation Learning for Multi-turn LM Agents via On-policy Expert Corrections 标题:通过政策上专家纠正的多回合LM代理的模仿学习 链接:https://arxiv.org/abs/2512.14895
作者:Niklas Lauffer,Xiang Deng,Srivatsa Kundurthy,Brad Kenstler,Jeff Da 摘要:训练LM代理的流行范例依赖于模仿学习,对专家轨迹进行微调。然而,我们发现,多回合LM代理的模仿学习的非政策性质受到称为协变量转移的基本限制:由于学生政策的行为与专家的行为不同,它遇到了训练数据中不存在的状态,降低了微调的有效性。从经典的Dagger算法的灵感,我们提出了一种新的数据生成方法来解决多轮LLM训练的协变量移位。我们引入了政策上的专家校正(OECs),部分政策上的数据,通过学生模型开始推出,然后在轨迹的一部分切换到专家模型。我们探索我们的数据生成技术在软件工程(SWE)任务领域的有效性,LLM代理必须与开发环境交互以修复软件错误的多轮设置。我们的实验比较了OEC数据与SWE代理问题和训练模型上的各种其他基于策略和模仿学习方法,这些方法使用共同的拒绝采样(即,使用环境奖励)结合监督微调技术。实验发现,OEC轨迹在7b和32b设置中分别比传统的模仿学习提高了14%和13%,并在SWE台架上进行了验证。我们的研究结果表明,需要将专家演示与政策数据相结合,以进行有效的多轮LM代理培训。 摘要:A popular paradigm for training LM agents relies on imitation learning, fine-tuning on expert trajectories. However, we show that the off-policy nature of imitation learning for multi-turn LM agents suffers from the fundamental limitation known as covariate shift: as the student policy’s behavior diverges from the expert’s, it encounters states not present in the training data, reducing the effectiveness of fine-tuning. Taking inspiration from the classic DAgger algorithm, we propose a novel data generation methodology for addressing covariate shift for multi-turn LLM training. We introduce on-policy expert corrections (OECs), partially on-policy data generated by starting rollouts with a student model and then switching to an expert model part way through the trajectory. We explore the effectiveness of our data generation technique in the domain of software engineering (SWE) tasks, a multi-turn setting where LLM agents must interact with a development environment to fix software bugs. Our experiments compare OEC data against various other on-policy and imitation learning approaches on SWE agent problems and train models using a common rejection sampling (i.e., using environment reward) combined with supervised fine-tuning technique. Experiments find that OEC trajectories show a relative 14% and 13% improvement over traditional imitation learning in the 7b and 32b setting, respectively, on SWE-bench verified. Our results demonstrate the need for combining expert demonstrations with on-policy data for effective multi-turn LM agent training.
【92】OLR-WA: Online Weighted Average Linear Regression in Multivariate Data Streams 标题:OLR-WA:多元数据流中的在线加权平均线性回归 链接:https://arxiv.org/abs/2512.14892
作者:Mohammad Abu-Shaira,Alejandro Rodriguez,Greg Speegle,Victor Sheng,Ishfaq Ahmad 摘要:在线学习使用新数据逐步更新模型,避免了大量的存储需求和昂贵的模型重新计算。本文介绍了一种新颖的多变量在线线性回归模型“OLR-WA;加权平均在线回归”。我们还研究了涉及漂移的场景,其中数据中的潜在模式随着时间的推移而演变,进行收敛分析,并将我们的方法与现有的在线回归模型进行比较。OLR-WA的结果表明,它能够实现与批量回归相当的性能,同时与其他最先进的在线模型相比,也显示出相当或更高的性能,从而建立了其有效性。此外,OLR-WA在快速收敛方面表现出卓越的性能,超越了其他在线模型,从第一次迭代到最后一次迭代,即使在使用最少量的数据点(仅占总数据点的1%至10%)进行初始化时,也始终实现高r2值作为性能指标。值得注意的是,除了能够处理基于时间(时间漂移)的场景之外,OLR-WA还脱颖而出,成为唯一能够有效管理基于信心的挑战性场景的模型。它通过在更新中采用保守的方法来实现这一点,优先考虑具有较高置信水平的旧数据点。总之,OLR-WA的性能进一步巩固了其在不同环境中的多功能性和实用性,使其成为在线线性回归任务的有价值的解决方案。 摘要:Online learning updates models incrementally with new data, avoiding large storage requirements and costly model recalculations. In this paper, we introduce “OLR-WA; OnLine Regression with Weighted Average”, a novel and versatile multivariate online linear regression model. We also investigate scenarios involving drift, where the underlying patterns in the data evolve over time, conduct convergence analysis, and compare our approach with existing online regression models. The results of OLR-WA demonstrate its ability to achieve performance comparable to the batch regression, while also showcasing comparable or superior performance when compared with other state-of-the-art online models, thus establishing its effectiveness. Moreover, OLR-WA exhibits exceptional performance in terms of rapid convergence, surpassing other online models with consistently achieving high r2 values as a performance measure from the first iteration to the last iteration, even when initialized with minimal amount of data points, as little as 1% to 10% of the total data points. In addition to its ability to handle time-based (temporal drift) scenarios, remarkably, OLR-WA stands out as the only model capable of effectively managing confidence-based challenging scenarios. It achieves this by adopting a conservative approach in its updates, giving priority to older data points with higher confidence levels. In summary, OLR-WA’s performance further solidifies its versatility and utility across different contexts, making it a valuable solution for online linear regression tasks.
【93】Integrating Large Language Models and Knowledge Graphs to Capture Political Viewpoints in News Media 标题:集成大型语言模型和知识图以捕捉新闻媒体中的政治观点 链接:https://arxiv.org/abs/2512.14887
作者:Massimiliano Fadda,Enrico Motta,Francesco Osborne,Diego Reforgiato Recupero,Angelo Salatino 摘要:新闻来源在民主社会中发挥着核心作用,通过具体的主题、观点和声音塑造政治和社会话语。了解这些动态对于评估媒体格局是否对公共辩论提供了平衡和公正的描述至关重要。在早期的工作中,我们引入了一个管道,给定一个新闻语料库,i)使用混合人机方法来识别关于给定主题表达的观点的范围,以及ii)对关于所识别的观点的相关声明进行分类,定义为语义和意识形态一致的声明的集合(例如,英国政府认为,移民对英国经济有积极影响。在本文中,我们通过i)微调大语言模型(LLM)用于观点分类和ii)使用从维基数据中提取的相关参与者的语义描述来丰富声明表示来改进此管道。我们评估我们的方法对替代解决方案的基准集中在英国移民辩论。结果表明,虽然这两种机制独立地提高了分类性能,但它们的集成产生了最佳结果,特别是当使用能够处理长输入的LLM时。 摘要:News sources play a central role in democratic societies by shaping political and social discourse through specific topics, viewpoints and voices. Understanding these dynamics is essential for assessing whether the media landscape offers a balanced and fair account of public debate. In earlier work, we introduced a pipeline that, given a news corpus, i) uses a hybrid human-machine approach to identify the range of viewpoints expressed about a given topic, and ii) classifies relevant claims with respect to the identified viewpoints, defined as sets of semantically and ideologically congruent claims (e.g., positions arguing that immigration positively impacts the UK economy). In this paper, we improve this pipeline by i) fine-tuning Large Language Models (LLMs) for viewpoint classification and ii) enriching claim representations with semantic descriptions of relevant actors drawn from Wikidata. We evaluate our approach against alternative solutions on a benchmark centred on the UK immigration debate. Results show that while both mechanisms independently improve classification performance, their integration yields the best results, particularly when using LLMs capable of processing long inputs.
【94】Entropy-Reservoir Bregman Projection: An Information-Geometric Unification of Model Collapse 标题:熵-水库Bregman投影:模型崩溃的信息-几何统一 链接:https://arxiv.org/abs/2512.14879
作者:Jingwei Chen
摘要:自参考学习-在它自己生成的数据上训练模型-承诺无限的可扩展性,但长期遭受模型崩溃:语言模型退化为重复的文本,GANs丢弃模式,以及重复学习策略过度利用。虽然从业者采用adhoc修复,如真实数据混合,熵奖金,知识蒸馏,或检索增强生成,一个单一的原则,解释失败模式和这些修复的成功仍然难以捉摸。我们提出了熵库布雷格曼投影(ERBP),这是一个统一这些现象的信息几何框架。我们模型的闭环作为一个随机Bregman投影序列在分布空间。在没有外部耦合的情况下,有限样本噪声迫使系统投射到不断缩小的经验支持上,导致指数熵衰减和最终崩溃。引入一个熵库–一个混合到每个投影中的高熵分布–注入了一个可控的熵流,可以证明它可以稳定动态。我们的理论产生(i)崩溃的必要条件,(ii)保证非平凡熵地板的充分条件,以及(iii)仅依赖于样本大小和Bregman生成器的强凸性/Lipschitz常数的封闭形式的速率。大型语言模型自训练、强化学习中的Soft Actor-Critic和GAN优化实验验证了我们的预测,并表明不同的稳定化算法对应于特定的水库选择和耦合系数。因此,ERBP将一系列民间疗法转化为一个单一的定量设计规则:监控和预算你的熵通量。
摘要:Self-referential learning – training a model on data it generated itself – promises boundless scalability but chronically suffers from model collapse: language models degenerate into repetitive text, GANs drop modes, and reinforcement-learning policies over-exploit. Although practitioners employ adhoc fixes such as real-data mixing, entropy bonuses, knowledge distillation, or retrieval-augmented generation, a single principle that explains both the failure mode and the success of these fixes has remained elusive. We present Entropy-Reservoir Bregman Projection (ERBP), an information-geometric framework that unifies these phenomena. We model the closed loop as a stochastic Bregman projection sequence in distribution space. Without external coupling, finite-sample noise forces the system to project onto an ever-shrinking empirical support, causing exponential entropy decay and eventual collapse. Introducing an Entropy Reservoir – a high-entropy distribution mixed into each projection – injects a controllable entropy flux that provably stabilises the dynamics. Our theory yields (i) a necessary condition for collapse, (ii) a sufficient condition that guarantees a non-trivial entropy floor, and (iii) closed-form rates that depend only on sample size and the strong-convexity/Lipschitz constants of the Bregman generator. Experiments on large-language-model self-training, Soft Actor-Critic in reinforcement learning, and GAN optimisation validate our predictions and show that disparate stabilisation heuristics correspond to specific reservoir choices and coupling coefficients. ERBP thus transforms a collection of folk remedies into a single, quantitative design rule: monitor and budget your entropy flux.
【95】Penetration Testing of Agentic AI: A Comparative Security Analysis Across Models and Frameworks 标题:大型人工智能的渗透测试:跨模型和框架的比较安全分析 链接:https://arxiv.org/abs/2512.14860
作者:Viet K. Nguyen,Mohammad I. Husain 摘要:人工智能引入了传统LLM保障措施无法解决的安全漏洞。虽然Palo Alto Networks的Unit 42最近的工作表明ChatGPT-4 o成功地执行了它在聊天模式下拒绝的代理攻击,但没有在多个模型和框架中进行比较分析。我们对人工智能系统进行了首次系统性的渗透测试和比较评估,测试了五个突出的模型(Claude 3.5 Sonnet,Gemini 2.5 Flash,GPT-4o,Grok 2和Nova Pro)(AutoGen和CrewAI)使用七代理架构,模仿大学信息管理系统的功能和13种不同的攻击场景,这些攻击场景跨越提示注入,服务器端请求伪造(SSRF)、SQL注入和工具滥用。我们的130个测试案例揭示了显著的安全差异:AutoGen的拒绝率为52.3%,CrewAI为30.8%,而模型性能从Nova Pro的46.2%到Claude和Grok 2的38.5%不等。最关键的是,Grok 2在CrewAI上只拒绝了13次攻击中的2次(拒绝率为15.4%),所有配置的总体拒绝率为41.5%,这表明尽管有企业级安全机制,但仍有超过一半的恶意提示成功。我们确定了六种不同的防御行为模式,包括一种新的“幻觉合规”策略,模型制造输出,而不是执行或拒绝攻击,并提供可操作的安全代理部署的建议。完整的攻击提示也包含在附录中,以实现再现性。 摘要:Agentic AI introduces security vulnerabilities that traditional LLM safeguards fail to address. Although recent work by Unit 42 at Palo Alto Networks demonstrated that ChatGPT-4o successfully executes attacks as an agent that it refuses in chat mode, there is no comparative analysis in multiple models and frameworks. We conducted the first systematic penetration testing and comparative evaluation of agentic AI systems, testing five prominent models (Claude 3.5 Sonnet, Gemini 2.5 Flash, GPT-4o, Grok 2, and Nova Pro) across two agentic AI frameworks (AutoGen and CrewAI) using a seven-agent architecture that mimics the functionality of a university information management system and 13 distinct attack scenarios that span prompt injection, Server Side Request Forgery (SSRF), SQL injection, and tool misuse. Our 130 total test cases reveal significant security disparities: AutoGen demonstrates a 52.3% refusal rate versus CrewAI’s 30.8%, while model performance ranges from Nova Pro’s 46.2% to Claude and Grok 2’s 38.5%. Most critically, Grok 2 on CrewAI rejected only 2 of 13 attacks (15.4% refusal rate), and the overall refusal rate of 41.5% across all configurations indicates that more than half of malicious prompts succeeded despite enterprise-grade safety mechanisms. We identify six distinct defensive behavior patterns including a novel “hallucinated compliance” strategy where models fabricate outputs rather than executing or refusing attacks, and provide actionable recommendations for secure agent deployment. Complete attack prompts are also included in the Appendix to enable reproducibility.
【96】A Roadmap for Applying Graph Neural Networks to Numerical Data: Insights from Cementitious Materials 标题:将图神经网络应用于数字数据的路线图:来自水泥基材料的见解 链接:https://arxiv.org/abs/2512.14855
作者:Mahmuda Sharmin,Taihao Han,Jie Huang,Narayanan Neithalath,Gaurav Sant,Aditya Kumar 摘要:机器学习(ML)已越来越多地应用于混凝土研究,以优化性能和配合比设计。然而,将ML应用于胶凝材料的一个主要挑战是可用数据库的有限大小和多样性。一个有前途的解决方案是开发多模态数据库,将数值和图形数据相结合。水泥研究中的传统ML框架通常限于单一数据模式。图神经网络(GNN)代表了新一代的神经架构,能够从结构化为图的数据中学习,通过不规则或拓扑依赖的连接而不是固定的空间坐标来捕获关系。虽然GNN本质上是为图形数据设计的,但它们可以适应从数值数据集中提取相关性,并可能将物理定律直接嵌入其架构中,从而实现可解释和物理信息的预测。这项工作是实施GNN来设计混凝土的首批研究之一,特别强调使用k-最近邻(K-NN)方法将表格数据转换为图形表示的清晰且可重复的路径。模型超参数和特征选择进行了系统优化,以提高预测性能。GNN显示出与基准随机森林相当的性能,许多研究已经证明了这一点,以产生可靠的预测水泥材料。总的来说,这项研究为从传统机器学习过渡到高级人工智能架构提供了基础路线图。所提出的框架为未来的多模态和物理信息GNN模型奠定了坚实的基础,这些模型能够捕获复杂的材料行为并加速水泥材料的设计和优化。 摘要:Machine learning (ML) has been increasingly applied in concrete research to optimize performance and mixture design. However, one major challenge in applying ML to cementitious materials is the limited size and diversity of available databases. A promising solution is the development of multi-modal databases that integrate both numerical and graphical data. Conventional ML frameworks in cement research are typically restricted to a single data modality. Graph neural network (GNN) represents a new generation of neural architectures capable of learning from data structured as graphs, capturing relationships through irregular or topology-dependent connections rather than fixed spatial coordinates. While GNN is inherently designed for graphical data, they can be adapted to extract correlations from numerical datasets and potentially embed physical laws directly into their architecture, enabling explainable and physics-informed predictions. This work is among the first few studies to implement GNNs to design concrete, with a particular emphasis on establishing a clear and reproducible pathway for converting tabular data into graph representations using the k-nearest neighbor (K-NN) approach. Model hyperparameters and feature selection are systematically optimized to enhance prediction performance. The GNN shows performance comparable to the benchmark random forest, which has been demonstrated by many studies to yield reliable predictions for cementitious materials. Overall, this study provides a foundational roadmap for transitioning from traditional ML to advanced AI architectures. The proposed framework establishes a strong foundation for future multi-modal and physics-informed GNN models capable of capturing complex material behaviors and accelerating the design and optimization of cementitious materials.
【97】MALCDF: A Distributed Multi-Agent LLM Framework for Real-Time Cyber 标题:MALEDF:用于实时网络的分布式多代理LLM框架 链接:https://arxiv.org/abs/2512.14846
作者:Arth Bhardwaj,Sia Godika,Yuvam Loonker 摘要:传统的集中式安全工具往往会错过自适应的多向量攻击。我们提出了多代理LLM网络防御框架(MALCDF),一个实用的设置,其中四个大的语言模型(LLM)代理检测,情报,响应和分析一起工作的实时。代理通过安全通信层(SCL)与加密的、本体对齐的消息进行通信,并产生对代理友好的输出(例如,MITRE ATT&CK映射)。 为了进行评估,我们保持测试的简单性和一致性:所有报告的指标都来自CICIDS 2017功能模式中相同的50条记录的实时流。CICIDS 2017用于配置(字段/模式)和训练实用的ML基线。ML-IDS基线是一个轻量级随机森林IDS(LRF-IDS),在CICIDS 2017的一个子集上训练,并在50条记录流上测试,训练和测试记录之间没有重叠。 在实验中,MALCDF达到90.0%的检测准确率,85.7%的F1分数,和9.1%的假阳性率,平均每个事件的延迟为6.8s。它在准确性上优于轻量级ML-IDS基线和单LLM设置,同时保持端到端输出一致。总的来说,这种实践构建表明,将简单的LLM代理与安全的本体对齐消息传递相协调可以改善实用的实时网络防御。 摘要:Traditional, centralized security tools often miss adaptive, multi-vector attacks. We present the Multi-Agent LLM Cyber Defense Framework (MALCDF), a practical setup where four large language model (LLM) agents-Detection, Intelligence, Response, and Analysis-work together in real time. Agents communicate over a Secure Communication Layer (SCL) with encrypted, ontology-aligned messages, and produce audit-friendly outputs (e.g., MITRE ATT&CK mappings). For evaluation, we keep the test simple and consistent: all reported metrics come from the same 50-record live stream derived from the CICIDS2017 feature schema. CICIDS2017 is used for configuration (fields/schema) and to train a practical ML baseline. The ML-IDS baseline is a Lightweight Random Forest IDS (LRF-IDS) trained on a subset of CICIDS2017 and tested on the 50-record stream, with no overlap between training and test records. In experiments, MALCDF reaches 90.0% detection accuracy, 85.7% F1-score, and 9.1% false-positive rate, with 6.8s average per-event latency. It outperforms the lightweight ML-IDS baseline and a single-LLM setup on accuracy while keeping end-to-end outputs consistent. Overall, this hands-on build suggests that coordinating simple LLM agents with secure, ontology-aligned messaging can improve practical, real-time cyber defense.
【98】Let the Barbarians In: How AI Can Accelerate Systems Performance Research 标题:让野蛮人进来:人工智能如何加速系统性能研究 链接:https://arxiv.org/abs/2512.14806
作者:Audrey Cheng,Shu Liu,Melissa Pan,Zhifei Li,Shubham Agarwal,Mert Cemri,Bowen Wang,Alexander Krentsel,Tian Xia,Jongseok Park,Shuo Yang,Jeff Chen,Lakshya Agrawal,Ashwin Naren,Shulu Li,Ruiying Ma,Aditya Desai,Jiarong Xing,Koushik Sen,Matei Zaharia,Ion Stoica 摘要:人工智能(AI)通过自动发现新的解决方案,开始改变研究过程。这种转变取决于可靠验证器的可用性,人工智能驱动的方法需要验证候选解决方案。专注于提高系统性能的研究特别适合这种范式,因为系统性能问题自然会允许这样的验证器:候选人可以在真实系统或模拟器中实现,并根据预定义的工作负载进行评估。我们将这种生成、评估和细化的迭代周期称为AI驱动的系统研究(ADRS)。使用几个开源ADRS实例(即,OpenEvolve,GEPA和ShinkaEvolve),我们通过十个案例研究(例如,多区域云调度、混合专家负载平衡、基于LLM的SQL、事务调度),ADRS生成的解决方案可以匹配甚至优于人类最先进的设计。根据这些发现,我们概述了最佳实践(例如,水平的及时规范,反馈量,强大的评估),有效地使用ADRS,我们讨论了未来的研究方向及其影响。虽然我们还没有一个在所有系统研究中应用ADRS的通用配方,但我们希望我们的初步发现,以及我们确定的挑战,为未来的工作提供有意义的指导,因为研究人员的努力越来越多地转向问题制定和战略监督。注:本文是我们先前工作的扩展[14]。它增加了对多个ADRS框架的广泛评估,并提供了对最佳实践的更深入分析和见解。 摘要:Artificial Intelligence (AI) is beginning to transform the research process by automating the discovery of new solutions. This shift depends on the availability of reliable verifiers, which AI-driven approaches require to validate candidate solutions. Research focused on improving systems performance is especially well-suited to this paradigm because system performance problems naturally admit such verifiers: candidates can be implemented in real systems or simulators and evaluated against predefined workloads. We term this iterative cycle of generation, evaluation, and refinement AI-Driven Research for Systems (ADRS). Using several open-source ADRS instances (i.e., OpenEvolve, GEPA, and ShinkaEvolve), we demonstrate across ten case studies (e.g., multi-region cloud scheduling, mixture-of-experts load balancing, LLM-based SQL, transaction scheduling) that ADRS-generated solutions can match or even outperform human state-of-the-art designs. Based on these findings, we outline best practices (e.g., level of prompt specification, amount of feedback, robust evaluation) for effectively using ADRS, and we discuss future research directions and their implications. Although we do not yet have a universal recipe for applying ADRS across all of systems research, we hope our preliminary findings, together with the challenges we identify, offer meaningful guidance for future work as researcher effort shifts increasingly toward problem formulation and strategic oversight. Note: This paper is an extension of our prior work [14]. It adds extensive evaluation across multiple ADRS frameworks and provides deeper analysis and insights into best practices.
【99】Sharing State Between Prompts and Programs 标题:预算和程序之间共享状态 链接:https://arxiv.org/abs/2512.14805
作者:Ellie Y. Cheng,Logan Weber,Tian Jin,Michael Carbin 摘要:大型语言模型(LLM)的兴起引入了一种新的编程类型:自然语言编程。通过编写指示LLM执行自然语言处理、代码生成、推理等的提示,用户用自然语言编写代码–自然语言代码–供LLM执行。 一个新兴的研究领域可以实现自然语言代码和Python等形式语言之间的互操作性。我们提出了一种新的编程抽象,共享程序状态,消除了手动工作,使自然语言代码和程序状态之间的互操作性。通过共享程序状态,程序员可以编写自然代码,直接编写程序变量,使用程序对象进行计算,并在程序中实现控制流。我们提出了一个模式,用于指定自然函数接口,扩展编程系统,以支持自然代码,并利用此模式指定共享程序状态作为一个自然函数接口。 我们在Nightjar编程系统中实现了共享程序状态。Nightjar允许程序员编写包含共享Python程序状态的自然代码的Python程序。我们表明,Nightjar程序实现了与手动编写的实现相当或更高的任务准确性(+4-19%),同时平均减少了39.6%的代码行。使用Nightjar的代价是它可能会产生运行时开销(手动实现的0.4- 4.3倍运行时)。 摘要:The rise of large language models (LLMs) has introduced a new type of programming: natural language programming. By writing prompts that direct LLMs to perform natural language processing, code generation, reasoning, etc., users are writing code in natural language – natural language code – for the LLM to execute. An emerging area of research enables interoperability between natural language code and formal languages such as Python. We present a novel programming abstraction, shared program state, that removes the manual work required to enable interoperability between natural language code and program state. With shared program state, programmers can write natural code that directly writes program variables, computes with program objects, and implements control flow in the program. We present a schema for specifying natural function interfaces that extend programming systems to support natural code and leverage this schema to specify shared program state as a natural function interface. We implement shared program state in the Nightjar programming system. Nightjar enables programmers to write Python programs that contain natural code that shares the Python program state. We show that Nightjar programs achieve comparable or higher task accuracy than manually written implementations (+4-19%), while decreasing the lines of code by 39.6% on average. The tradeoff to using Nightjar is that it may incur runtime overhead (0.4-4.3x runtime of manual implementations).
【100】Incentives or Ontology? A Structural Rebuttal to OpenAI’s Hallucination Thesis 标题:激励还是实体论?对OpenAI幻觉论点的结构性反驳 链接:https://arxiv.org/abs/2512.14801
作者:Richard Ackermann,Simeon Emanuilov 备注:17 pages, references to prior work arXiv:2509.16297 and arXiv:2511.06073 摘要:OpenAI最近认为,大型语言模型中的幻觉主要是由于不一致的评估激励导致的,这些激励奖励自信的猜测,而不是认识上的谦逊。根据这种观点,幻觉是一种偶然的行为假象,可以通过改进基准和奖励结构来补救。在本文中,我们挑战这种解释。借鉴以前的工作,结构幻觉和经验性实验使用许可证甲骨文,我们认为幻觉是不是一个优化失败,但架构的必然性的Transformer模型。 Transformers并不代表世界;它们对令牌之间的统计关联进行建模。它们的嵌入空间形成了一个来自语言共现的伪本体,而不是世界指称结构。在本体论的边界条件下-训练数据稀疏或不连贯的区域-模型必须插入虚构的延续以保持连贯性。没有激励机制可以改变这种对模式完成的结构依赖。 我们的实证结果表明,幻觉只能通过外部的真理验证和警告模块,而不是通过改变激励,提示,或微调消除。Licensing Oracle能够实现跨域的完美匹配精度,因为它提供了Transformer所缺乏的基础。 我们的结论是,幻觉是生成架构的一种结构属性,可靠的人工智能需要将语言流畅性与认知责任区分开来的混合系统。 摘要:OpenAI has recently argued that hallucinations in large language models result primarily from misaligned evaluation incentives that reward confident guessing rather than epistemic humility. On this view, hallucination is a contingent behavioral artifact, remediable through improved benchmarks and reward structures. In this paper, we challenge that interpretation. Drawing on previous work on structural hallucination and empirical experiments using a Licensing Oracle, we argue that hallucination is not an optimization failure but an architectural inevitability of the transformer model. Transformers do not represent the world; they model statistical associations among tokens. Their embedding spaces form a pseudo-ontology derived from linguistic co-occurrence rather than world-referential structure. At ontological boundary conditions - regions where training data is sparse or incoherent - the model necessarily interpolates fictional continuations in order to preserve coherence. No incentive mechanism can modify this structural dependence on pattern completion. Our empirical results demonstrate that hallucination can only be eliminated through external truth-validation and abstention modules, not through changes to incentives, prompting, or fine-tuning. The Licensing Oracle achieves perfect abstention precision across domains precisely because it supplies grounding that the transformer lacks. We conclude that hallucination is a structural property of generative architectures and that reliable AI requires hybrid systems that distinguish linguistic fluency from epistemic responsibility.
【101】IaC Generation with LLMs: An Error Taxonomy and A Study on Configuration Knowledge Injection 标题:使用LLM生成IaC:错误分类和配置知识注入研究 链接:https://arxiv.org/abs/2512.14792
作者:Roman Nekrasov,Stefano Fossati,Indika Kumara,Damian Andrew Tamburri,Willem-Jan van den Heuvel 备注:Submitted to ACM 摘要:大型语言模型(LLM)目前在生成正确和意图一致的基础设施即代码(IaC)方面表现出较低的成功率。本研究通过系统地注入结构化配置知识,研究了改进基于LLM的IaC生成(特别是Terraform)的方法。为了促进这一点,现有的IaC-Eval基准测试通过云仿真和自动错误分析进行了显著增强。此外,一种新的错误分类法LLM辅助的IaC代码生成的开发。一系列的知识注入技术的实施和评估,进展从朴素的检索增强生成(RAG)更复杂的图形RAG方法。这些包括图形组件的语义丰富和建模资源间的依赖关系。实验结果表明,虽然基线LLM性能较差(总体成功率为27.1%),但注入结构化配置知识将技术验证成功率提高到75.3%,总体成功率提高到62.6%。尽管在技术正确性方面取得了这些进展,但意图对齐趋于稳定,揭示了一个“正确性一致性差距”,LLM可以成为熟练的“编码员”,但在实现细微差别的用户意图方面仍然是有限的“架构师”。 摘要:Large Language Models (LLMs) currently exhibit low success rates in generating correct and intent-aligned Infrastructure as Code (IaC). This research investigated methods to improve LLM-based IaC generation, specifically for Terraform, by systematically injecting structured configuration knowledge. To facilitate this, an existing IaC-Eval benchmark was significantly enhanced with cloud emulation and automated error analysis. Additionally, a novel error taxonomy for LLM-assisted IaC code generation was developed. A series of knowledge injection techniques was implemented and evaluated, progressing from Naive Retrieval-Augmented Generation (RAG) to more sophisticated Graph RAG approaches. These included semantic enrichment of graph components and modeling inter-resource dependencies. Experimental results demonstrated that while baseline LLM performance was poor (27.1% overall success), injecting structured configuration knowledge increased technical validation success to 75.3% and overall success to 62.6%. Despite these gains in technical correctness, intent alignment plateaued, revealing a “Correctness-Congruence Gap” where LLMs can become proficient “coders” but remain limited “architects” in fulfilling nuanced user intent.
【102】Improving VQA Reliability: A Dual-Assessment Approach with Self-Reflection and Cross-Model Verification 标题:提高VQA可靠性:具有自我反思和跨模型验证的双重评估方法 链接:https://arxiv.org/abs/2512.14770
作者:Xixian Wu,Yang Ou,Pengchao Tian,Zian Yang,Jielei Zhang,Peiyi Li,Longwen Gao 摘要:视觉语言模型(VLM)在视觉问答(VQA)中显示出巨大的潜力。然而,VLM对幻觉的敏感性可能导致过度自信但不正确的答案,严重破坏答案的可靠性。为了解决这个问题,我们提出了双重评估VLM可靠性(DAVR),一个新的框架,集成了自我反思和跨模型验证综合不确定性估计。DAVR框架具有双通道架构:一个通道利用双选择器模块通过将VLM潜在特征与QA嵌入融合来评估响应可靠性,而另一个通道部署外部参考模型进行事实交叉检查以减轻幻觉。在ICCV-CLVL 2025的可靠VQA挑战赛中,DAVR获得了39.64的领先$Φ_{100}$得分和97.22的100-AUC,确保了第一名,并证明了其在增强VLM响应可信度方面的有效性。 摘要:Vision-language models (VLMs) have demonstrated significant potential in Visual Question Answering (VQA). However, the susceptibility of VLMs to hallucinations can lead to overconfident yet incorrect answers, severely undermining answer reliability. To address this, we propose Dual-Assessment for VLM Reliability (DAVR), a novel framework that integrates Self-Reflection and Cross-Model Verification for comprehensive uncertainty estimation. The DAVR framework features a dual-pathway architecture: one pathway leverages dual selector modules to assess response reliability by fusing VLM latent features with QA embeddings, while the other deploys external reference models for factual cross-checking to mitigate hallucinations. Evaluated in the Reliable VQA Challenge at ICCV-CLVL 2025, DAVR achieves a leading $Φ_{100}$ score of 39.64 and a 100-AUC of 97.22, securing first place and demonstrating its effectiveness in enhancing the trustworthiness of VLM responses.
【103】Privacy-Preserving Feature Valuation in Vertical Federated Learning Using Shapley-CMI and PSI Permutation 标题:在垂直联邦学习中使用Shapley-RCM和ISI排列的隐私保护特征估值 链接:https://arxiv.org/abs/2512.14767
作者:Unai Laskurain,Aitor Aguirre-Ortuzar,Urko Zurutuza 备注:Presented at the 3rd IEEE International Conference on Federated Learning Technologies and Applications (FLTA25), October 2025 摘要:联合学习(FL)是一种新兴的机器学习范式,它使多方能够在不共享原始数据的情况下协作训练模型,从而确保数据隐私。在垂直FL(VFL)中,每一方都为相同的用户保留不同的特征,一个关键的挑战是在训练任何模型之前评估每一方的特征贡献,特别是在没有模型存在的早期阶段。为了解决这个问题,Shapley-CMI方法最近被提出作为一种无模型的信息理论方法,使用条件互信息(CMI)进行特征估值。然而,它的原始公式没有提供能够安全地计算所需的排列和交叉的实际实现。本文提出了一种新的隐私保护的Shapley-CMI的VFL实现。我们的系统引入了一个私有集合交集(PSI)服务器,该服务器执行所有必要的特征排列,并计算离散化和加密ID组之间的加密交集大小,而无需进行原始数据交换。然后,每一方使用这些交集结果来计算Shapley-CMI值,计算其特征的边际效用。初步实验证实了所提出的系统的正确性和隐私性,证明了其安全和有效的特征贡献估计VFL的可行性。这种方法可以确保数据的机密性,跨多方扩展,并在不需要共享原始数据或训练模型的情况下实现公平的数据评估。 摘要:Federated Learning (FL) is an emerging machine learning paradigm that enables multiple parties to collaboratively train models without sharing raw data, ensuring data privacy. In Vertical FL (VFL), where each party holds different features for the same users, a key challenge is to evaluate the feature contribution of each party before any model is trained, particularly in the early stages when no model exists. To address this, the Shapley-CMI method was recently proposed as a model-free, information-theoretic approach to feature valuation using Conditional Mutual Information (CMI). However, its original formulation did not provide a practical implementation capable of computing the required permutations and intersections securely. This paper presents a novel privacy-preserving implementation of Shapley-CMI for VFL. Our system introduces a private set intersection (PSI) server that performs all necessary feature permutations and computes encrypted intersection sizes across discretized and encrypted ID groups, without the need for raw data exchange. Each party then uses these intersection results to compute Shapley-CMI values, computing the marginal utility of their features. Initial experiments confirm the correctness and privacy of the proposed system, demonstrating its viability for secure and efficient feature contribution estimation in VFL. This approach ensures data confidentiality, scales across multiple parties, and enables fair data valuation without requiring the sharing of raw data or training models.
【104】GR-Agent: Adaptive Graph Reasoning Agent under Incomplete Knowledge 标题:GR-Agent:不完全知识下的自适应图推理Agent 链接:https://arxiv.org/abs/2512.14766
作者:Dongzhuoran Zhou,Yuqicheng Zhu,Xiaxia Wang,Hongkuan Zhou,Jiaoyan Chen,Steffen Staab,Yuan He,Evgeny Kharlamov 摘要:大型语言模型(LLM)在知识图问答(KGQA)上取得了很好的效果,但大多数基准测试都假设存在直接支持三元组的完整知识图(KG)。这将评估降低到浅层检索,并忽略了不完整的KG的现实,其中许多事实缺失,答案必须从现有的事实中推断出来。我们弥合这一差距,提出了一种方法来构建基准下KG不完备性,删除直接支持三元组,同时确保替代推理路径所需的推断答案仍然存在。使用我们的方法构建的基准测试实验表明,现有的方法遭受一致的性能下降下不完整,突出了他们有限的推理能力。为了克服这一局限性,我们提出了自适应图推理代理(GR-Agent)。它首先从KG构造一个交互环境,然后将KGQA形式化为该环境中的Agent环境交互。GR-Agent在包括图推理工具的动作空间上操作,并维护潜在支持推理证据的存储器,包括相关关系和推理路径。大量的实验表明,GR-Agent优于非训练基线,并在完全和不完全设置下执行基于训练的方法。 摘要:Large language models (LLMs) achieve strong results on knowledge graph question answering (KGQA), but most benchmarks assume complete knowledge graphs (KGs) where direct supporting triples exist. This reduces evaluation to shallow retrieval and overlooks the reality of incomplete KGs, where many facts are missing and answers must be inferred from existing facts. We bridge this gap by proposing a methodology for constructing benchmarks under KG incompleteness, which removes direct supporting triples while ensuring that alternative reasoning paths required to infer the answer remain. Experiments on benchmarks constructed using our methodology show that existing methods suffer consistent performance degradation under incompleteness, highlighting their limited reasoning ability. To overcome this limitation, we present the Adaptive Graph Reasoning Agent (GR-Agent). It first constructs an interactive environment from the KG, and then formalizes KGQA as agent environment interaction within this environment. GR-Agent operates over an action space comprising graph reasoning tools and maintains a memory of potential supporting reasoning evidence, including relevant relations and reasoning paths. Extensive experiments demonstrate that GR-Agent outperforms non-training baselines and performs comparably to training-based methods under both complete and incomplete settings.
【105】Guided Discrete Diffusion for Constraint Satisfaction Problems 标题:约束满足问题的引导离散扩散 链接:https://arxiv.org/abs/2512.14765
作者:Justin Jung 备注:Originally published in Jan 2025 on the SpringtailAI Blog 摘要:我们提出了离散扩散指导约束满足问题(CSP),并证明了它的能力,解决数独难题没有监督。 摘要:We propose discrete diffusion guidance for constraint satisfaction problems (CSPs) and demonstrate its ability to solve Sudoku puzzles without supervision.
【106】Workflows vs Agents for Code Translation 标题:代码转换的工作流与代理 链接:https://arxiv.org/abs/2512.14762
作者:Henry Gray,Tom Yotam,Octavian Udrea 摘要:将算法从高级语言(如MATLAB)转换为硬件描述语言(HDL)是一个资源密集型但在FPGA和ASIC上部署的必要步骤。虽然大型语言模型(LLM)提供了一条自动化的道路,但它们对HDL代码的有限训练使得端到端编译变得脆弱,并且容易出现语法错误。我们比较了两个LLM-driven的方法在MATLAB到HDL管道的语法修复:一个结构化的,专家设计的流程,遵循一个固定的操作顺序,和一个更自主的代理方法,使用模型上下文协议(MCP)\cite{anthropic 2024 mcp}动态选择自己的工具。我们研究了42个MATLAB信号处理函数,并隔离了语法修复阶段。在三个模型尺度上,代理方法在解决初始语法错误方面更有效,可以解锁更多的候选项以通过管道进行处理。这一上游改进带来了可衡量的下游改进,尤其是在中型车型上,它将模拟覆盖率提高了20多个百分点。我们假设收益来自简短的提示,积极的上下文管理和有条件的工具使用。条件检索在8B和30 B处有帮助;在235 B处,最终成功增益很小,并且幼稚RAG变体获得最高的最终成功。我们的研究结果表明,这些代理框架,如果设计得当,是最有效的补偿能力限制的小型和中型模型。 摘要:Translating algorithms from high-level languages like MATLAB to hardware description languages (HDLs) is a resource-intensive but necessary step for deployment on FPGAs and ASICs. While large language models (LLMs) offer a path to automation, their limited training on HDL code makes end-to-end transpilation brittle and prone to syntax errors. We compare two LLM-driven methods for syntax repair in a MATLAB-to-HDL pipeline: a structured, expert-designed flow that follows a fixed sequence of operations, and a more autonomous agentic approach that uses the Model Context Protocol (MCP) \cite{anthropic2024mcp} to dynamically select its own tools. We study 42 MATLAB signal-processing functions and isolate the syntax-repair stage. Across three model scales, the agentic approach is more effective at resolving initial syntax errors, unblocking a greater number of candidates to proceed through the pipeline. This upstream improvement yields measurable downstream improvements, most notably on mid-sized models, where it increases the simulation reach rate by over 20 percentage points. We hypothesize the gains come from short prompts, aggressive context management, and conditional tool use. Conditional retrieval helps at 8B and 30B; at 235B final-success gains are small and a naive RAG variant attains the highest final success. Our findings suggest that these agentic frameworks, when properly designed, are most effective at compensating for the capacity limits of small and mid-sized models.
【107】CAPE: Capability Achievement via Policy Execution 标题:CAPE:通过政策执行实现能力 链接:https://arxiv.org/abs/2512.14761
作者:David Ball 备注:32 pages, 3 figures 摘要:现代人工智能系统缺乏表达和执行需求的方法。预训练产生智能,后训练优化偏好,但两者都不能保证模型可靠地满足明确的、依赖于上下文的约束。这种缺失的抽象解释了为什么高度智能的模型在部署中经常失败,尽管基准性能很强。 我们引入能力工程,将需求转换为可执行的规范和培训模型,以满足他们默认的系统实践。我们通过CAPE(通过策略执行实现能力)来操作这种实践,CAPE是一种实现指定->验证->正确->训练循环的协议。 CAPE基于两个经验发现:(1)语境客观性,即一旦语境固定,(注释者间一致性从kappa = 0.42上升到kappa = 0.98),以及(2)验证保真度缩放,其中验证准确性随着模型规模而提高(r = 0.94),不像偏好一致性,无论计算如何,它都会在30%到50%的分歧处达到稳定。在六个领域的109,500个示例中,CAPE相对于DPO将违规率降低了81%(标准差小于0.3%)。通过用可重用的规范替换每个示例的注释,CAPE将成本降低了5到20倍,并将时间从几个月缩短到几周。 我们在Apache 2.0下发布了CAPE协议、PredicateGraph模式、CPL规范语言和策略包。我们还推出了CapabilityBench,这是一个针对社区贡献的政策进行模型评估的公共注册表,将评估从情报基准转向能力测量。 摘要:Modern AI systems lack a way to express and enforce requirements. Pre-training produces intelligence, and post-training optimizes preferences, but neither guarantees that models reliably satisfy explicit, context-dependent constraints. This missing abstraction explains why highly intelligent models routinely fail in deployment despite strong benchmark performance. We introduce Capability Engineering, the systematic practice of converting requirements into executable specifications and training models to satisfy them by default. We operationalize this practice through CAPE (Capability Achievement via Policy Execution), a protocol implementing a Specify -> Verify -> Correct -> Train loop. CAPE is grounded in two empirical findings: (1) contextual objectivity, where properties appearing subjective become objective once context is fixed (inter-annotator agreement rises from kappa = 0.42 to kappa = 0.98), and (2) verification-fidelity scaling, where verification accuracy improves with model scale (r = 0.94), unlike preference agreement which plateaus at 30 to 50 percent disagreement regardless of compute. Across 109,500 examples in six domains, CAPE reduces violation rates by 81 percent relative to DPO (standard deviation less than 0.3 percent). By replacing per-example annotation with reusable specifications, CAPE reduces costs by 5 to 20 times and shortens timelines from months to weeks. We release the CAPE protocol, PredicateGraph schema, CPL specification language, and policy packs under Apache 2.0. We also launch CapabilityBench, a public registry of model evaluations against community-contributed policies, shifting evaluation from intelligence benchmarks toward capability measurement.
【108】Revisiting the Reliability of Language Models in Instruction-Following 标题:重新审视语言模型在教学遵循中的可靠性 链接:https://arxiv.org/abs/2512.14754
作者:Jianshuo Dong,Yutong Zhang,Yan Liu,Zhenyu Zhong,Tao Wei,Chao Zhang,Han Qiu 备注:Preprint 摘要:先进的LLM在IFEval等基准测试中达到了接近天花板的精度。然而,这些令人印象深刻的分数并不一定转化为现实世界中使用的可靠服务,用户经常改变他们的措辞,上下文框架和任务制定。在本文中,我们研究了细微差别导向的可靠性:模型是否表现出一致的能力,在表哥提示,传达类似的用户意图,但有微妙的细微差别。为了量化这一点,我们引入了一个新的指标,可靠的@k,并开发了一个自动化的管道,通过数据增强生成高质量的表亲提示。在此基础上,我们构建了IFEval++进行系统评估。在20个专有和26个开源LLM中,我们发现当前的模型在细微的可靠性方面表现出严重的不足-它们的性能可以通过细微的即时修改下降高达61.8%。更重要的是,我们描述它,并探索三个潜在的改进食谱。我们的研究结果强调了细微差别导向的可靠性,这是迈向更可靠和值得信赖的LLM行为的关键但未充分探索的下一步。我们的代码和基准测试可以访问:https://github.com/jianshuod/IFEval-pp。 摘要:Advanced LLMs have achieved near-ceiling instruction-following accuracy on benchmarks such as IFEval. However, these impressive scores do not necessarily translate to reliable services in real-world use, where users often vary their phrasing, contextual framing, and task formulations. In this paper, we study nuance-oriented reliability: whether models exhibit consistent competence across cousin prompts that convey analogous user intents but with subtle nuances. To quantify this, we introduce a new metric, reliable@k, and develop an automated pipeline that generates high-quality cousin prompts via data augmentation. Building upon this, we construct IFEval++ for systematic evaluation. Across 20 proprietary and 26 open-source LLMs, we find that current models exhibit substantial insufficiency in nuance-oriented reliability – their performance can drop by up to 61.8% with nuanced prompt modifications. What’s more, we characterize it and explore three potential improvement recipes. Our findings highlight nuance-oriented reliability as a crucial yet underexplored next step toward more dependable and trustworthy LLM behavior. Our code and benchmark are accessible: https://github.com/jianshuod/IFEval-pp.
【109】CODE ACROSTIC: Robust Watermarking for Code Generation 标题:代码ACROSTIC:用于代码生成的鲁棒水印 链接:https://arxiv.org/abs/2512.14753
作者:Li Lin,Siyuan Xin,Yang Cao,Xiaochun Cao 摘要:对大型语言模型(LLM)进行水印对于防止其滥用至关重要,包括制造假新闻,剽窃和垃圾邮件。由于LLM生成的代码往往包含知识产权,因此水印技术尤为重要。然而,我们发现现有的LLM生成的代码水印方法无法解决注释删除攻击。在这种情况下,攻击者可以简单地从生成的代码中删除注释而不影响其功能,这大大降低了现有代码水印技术的有效性。另一方面,将水印注入到代码中是具有挑战性的,因为如先前的工作所指出的,与自然语言相比,大多数代码表示低熵的情形。为了解决这个问题,我们利用先验知识来区分代码的低熵和高熵部分,如提示列表所示。然后,我们在提示列表的指导下注入水印,实现比现有方法更高的可检测性和可用性。我们在HumanEval上评估了我们提出的方法,并将我们的方法与三种最先进的代码水印技术进行了比较。实验结果验证了该方法的有效性。 摘要:Watermarking large language models (LLMs) is vital for preventing their misuse, including the fabrication of fake news, plagiarism, and spam. It is especially important to watermark LLM-generated code, as it often contains intellectual property.However, we found that existing methods for watermarking LLM-generated code fail to address comment removal attack.In such cases, an attacker can simply remove the comments from the generated code without affecting its functionality, significantly reducing the effectiveness of current code-watermarking techniques.On the other hand, injecting a watermark into code is challenging because, as previous works have noted, most code represents a low-entropy scenario compared to natural language. Our approach to addressing this issue involves leveraging prior knowledge to distinguish between low-entropy and high-entropy parts of the code, as indicated by a Cue List of words.We then inject the watermark guided by this Cue List, achieving higher detectability and usability than existing methods.We evaluated our proposed method on HumanEvaland compared our method with three state-of-the-art code watermarking techniques. The results demonstrate the effectiveness of our approach.
【110】Cyberswarm: a novel swarm intelligence algorithm inspired by cyber community dynamics 标题:Cyberswarm:一种受网络社区动态启发的新型群体智能算法 链接:https://arxiv.org/abs/2512.14752
作者:Abdelsadeq Elfergany,Ammar Adl,Mohammed Kayed 备注:49 pages, 15 figures 摘要:推荐系统在动态适应复杂社交网络中不断变化的用户偏好和交互方面面临挑战。传统的方法往往无法考虑网络社会系统内错综复杂的互动,缺乏在不同领域进行概括的灵活性,这突出表明需要更具适应性和通用性的解决方案。在这项工作中,我们介绍了一个通用的群体智能算法的推荐系统,旨在无缝地适应不同的应用程序。它受到社会心理学原理的启发。该框架在动态超图结构中对用户偏好和社区影响进行建模。它利用了基于中心的特征提取和Node2Vec嵌入。偏好演化由消息传递机制和分层图建模指导,从而能够实时适应不断变化的行为。实验结果表明,该算法在各种推荐任务,包括社交网络和内容发现的优越性能。关键指标,如命中率(HR),平均倒数排名(MRR)和归一化贴现累积增益(NDCG)在多个数据集上的表现始终优于基线方法。该模型对动态环境的适应性允许提出与上下文相关的精确建议。该算法代表了推荐系统的进步,通过桥接个人偏好和社区的影响。它的通用设计支持不同领域的应用,包括社交图、个性化学习和医疗图。这项工作突出了将群体智能与网络动态相结合的潜力,以解决推荐系统中复杂的优化挑战。 摘要:Recommendation systems face challenges in dynamically adapting to evolving user preferences and interactions within complex social networks. Traditional approaches often fail to account for the intricate interactions within cyber-social systems and lack the flexibility to generalize across diverse domains, highlighting the need for more adaptive and versatile solutions. In this work, we introduce a general-purpose swarm intelligence algorithm for recommendation systems, designed to adapt seamlessly to varying applications. It was inspired by social psychology principles. The framework models user preferences and community influences within a dynamic hypergraph structure. It leverages centrality-based feature extraction and Node2Vec embeddings. Preference evolution is guided by message-passing mechanisms and hierarchical graph modeling, enabling real-time adaptation to changing behaviors. Experimental evaluations demonstrated the algorithm’s superior performance in various recommendation tasks, including social networks and content discovery. Key metrics such as Hit Rate (HR), Mean Reciprocal Rank (MRR), and Normalized Discounted Cumulative Gain (NDCG) consistently outperformed baseline methods across multiple datasets. The model’s adaptability to dynamic environments allowed for contextually relevant and precise recommendations. The proposed algorithm represents an advancement in recommendation systems by bridging individual preferences and community influences. Its general-purpose design enables applications in diverse domains, including social graphs, personalized learning, and medical graphs. This work highlights the potential of integrating swarm intelligence with network dynamics to address complex optimization challenges in recommendation systems.
【111】One Leak Away: How Pretrained Model Exposure Amplifies Jailbreak Risks in Finetuned LLMs 标题:一个漏洞:预训练模型暴露如何放大Finetuned LLM的越狱风险 链接:https://arxiv.org/abs/2512.14751
作者:Yixin Tan,Zhe Yu,Jun Sakuma 备注:17 pages 摘要:微调预训练的大型语言模型(LLM)已经成为开发下游应用程序的标准范例。然而,它的安全影响仍然不清楚,特别是关于微调的LLM是否从其预先训练的来源继承越狱漏洞。我们在一个现实的预训练到微调的威胁模型中研究这个问题,在这个模型中,攻击者可以白盒访问预训练的LLM,只能黑盒访问其微调的衍生物。实证分析表明,在预训练模型上优化的对抗性提示最有效地转移到其微调的变体,揭示了从预训练到微调的LLM的继承漏洞。为了进一步研究这种继承,我们进行了表征级探测,这表明可转移的提示在预训练的隐藏状态中是线性可分离的,这表明通用的可转移性被编码在预训练的表征中。基于这一见解,我们提出了探针引导投影(PGP)攻击,它将优化转向与可转移性相关的方向。跨多个LLM系列和各种微调任务的实验证实了PGP的强大传输成功,强调了预训练到微调范式中固有的安全风险。 摘要:Finetuning pretrained large language models (LLMs) has become the standard paradigm for developing downstream applications. However, its security implications remain unclear, particularly regarding whether finetuned LLMs inherit jailbreak vulnerabilities from their pretrained sources. We investigate this question in a realistic pretrain-to-finetune threat model, where the attacker has white-box access to the pretrained LLM and only black-box access to its finetuned derivatives. Empirical analysis shows that adversarial prompts optimized on the pretrained model transfer most effectively to its finetuned variants, revealing inherited vulnerabilities from pretrained to finetuned LLMs. To further examine this inheritance, we conduct representation-level probing, which shows that transferable prompts are linearly separable within the pretrained hidden states, suggesting that universal transferability is encoded in pretrained representations. Building on this insight, we propose the Probe-Guided Projection (PGP) attack, which steers optimization toward transferability-relevant directions. Experiments across multiple LLM families and diverse finetuned tasks confirm PGP’s strong transfer success, underscoring the security risks inherent in the pretrain-to-finetune paradigm.
【112】Factor(U,T): Controlling Untrusted AI by Monitoring their Plans 标题:因素(U,T):通过监控不受信任的人工智能的计划来控制他们 链接:https://arxiv.org/abs/2512.14745
作者:Edward Lue Chee Lip,Anthony Channg,Diana Kim,Aaron Sandoval,Kevin Zhu 备注:Accepted to AAAI 2026 Workshop on Trust and Control in Agentic AI (TrustAgent). 6 pages body, 8 pages total, 3 figures 摘要:随着人工智能能力的发展,我们越来越依赖强大的模型来分解复杂的任务,但如果分解器本身是恶意的呢?因子化认知协议将复杂的任务分解为更简单的子任务:一个模型创建分解,而其他模型则单独实现子任务。以前的工作使用可信(较弱但可靠)模型进行分解,这限制了分解本身具有挑战性的任务的有用性。我们引入了Factor($U$,$T$),其中不可信(更强但潜在恶意)模型分解,而可信模型实现子任务。当监控器只观察自然语言任务指令而不是完整的解决方案时,是否可以检测到恶意活动?我们在BigCodeBench(Python编码任务的数据集)上的控制评估中基线和红队Factor($U$,$T$)。与评估完整Python解决方案的监视器(AUROC 0.96)相比,区分恶意和诚实分解的监视器(AUROC 0.52)表现不佳。此外,Factor($D$,$U$)使用受信任的分解器并监控具体的子解决方案,实现了出色的区分度(AUROC 0.96)和强安全性(1.2% ASR),这表明在仅分解监控失败的情况下,实现上下文监控可以成功。 摘要:As AI capabilities advance, we increasingly rely on powerful models to decompose complex tasks $\unicode{x2013}$ but what if the decomposer itself is malicious? Factored cognition protocols decompose complex tasks into simpler child tasks: one model creates the decomposition, while other models implement the child tasks in isolation. Prior work uses trusted (weaker but reliable) models for decomposition, which limits usefulness for tasks where decomposition itself is challenging. We introduce Factor($U$,$T$), in which an untrusted (stronger but potentially malicious) model decomposes while trusted models implement child tasks. Can monitors detect malicious activity when observing only natural language task instructions, rather than complete solutions? We baseline and red team Factor($U$,$T$) in control evaluations on BigCodeBench, a dataset of Python coding tasks. Monitors distinguishing malicious from honest decompositions perform poorly (AUROC 0.52) compared to monitors evaluating complete Python solutions (AUROC 0.96). Furthermore, Factor($D$,$U$), which uses a trusted decomposer and monitors concrete child solutions, achieves excellent discrimination (AUROC 0.96) and strong safety (1.2% ASR), demonstrating that implementation-context monitoring succeeds where decomposition-only monitoring fails.
【113】Quantum-Augmented AI/ML for O-RAN: Hierarchical Threat Detection with Synergistic Intelligence and Interpretability (Technical Report) 标题:用于O-RAN的量子增强AI/ML:具有协同智能和可解释性的分层威胁检测(技术报告) 链接:https://arxiv.org/abs/2512.14742
作者:Tan Le,Van Le,Sachin Shetty 摘要:开放无线接入网络(O-RAN)增强了模块化和遥测粒度,但也扩大了分散的控制、用户和管理平面的网络安全攻击面。我们提出了一个分层的防御框架,有三个协调层异常检测,入侵确认和多攻击分类,每个对齐O-RAN的遥测堆栈。我们的方法集成了混合量子计算和机器学习,利用基于幅度和纠缠的特征编码与深度和集成分类器。我们在合成和真实世界遥测中进行广泛的基准测试,评估编码深度,架构变体和诊断保真度。该框架始终实现近乎完美的准确率,高召回率和强类可分性。跨决策边界、概率裕度和潜在空间几何形状的多方面评估证实了其可解释性、鲁棒性以及在近RT和非RT RIC域中用于切片感知诊断和可扩展部署的准备。 摘要:Open Radio Access Networks (O-RAN) enhance modularity and telemetry granularity but also widen the cybersecurity attack surface across disaggregated control, user and management planes. We propose a hierarchical defense framework with three coordinated layers-anomaly detection, intrusion confirmation, and multiattack classification-each aligned with O-RAN’s telemetry stack. Our approach integrates hybrid quantum computing and machine learning, leveraging amplitude- and entanglement-based feature encodings with deep and ensemble classifiers. We conduct extensive benchmarking across synthetic and real-world telemetry, evaluating encoding depth, architectural variants, and diagnostic fidelity. The framework consistently achieves near-perfect accuracy, high recall, and strong class separability. Multi-faceted evaluation across decision boundaries, probabilistic margins, and latent space geometry confirms its interpretability, robustness, and readiness for slice-aware diagnostics and scalable deployment in near-RT and non-RT RIC domains.
【114】Persistent Backdoor Attacks under Continual Fine-Tuning of LLMs 标题:LLM持续微调下的持续后门攻击 链接:https://arxiv.org/abs/2512.14741
作者:Jing Cui,Yufei Han,Jianbin Jiao,Junge Zhang 摘要:后门攻击将恶意行为嵌入到大型语言模型(LLM)中,使攻击者能够触发有害输出或绕过安全控制。然而,在用户驱动的部署后持续微调下植入后门的持久性很少被研究。大多数先前的工作评估的有效性和推广植入后门只在发布和经验证据表明,天真注入后门持久性降低更新后。在这项工作中,我们研究植入的后门是否以及如何通过多阶段部署后的微调持续存在。我们提出了P-Trojan,一个基于恶意程序的攻击算法,明确优化了重复更新的后门持久性。通过将中毒梯度与令牌嵌入上的干净任务的梯度对齐,植入的后门映射在后续更新期间不太可能被抑制或遗忘。理论分析表明,这种持续的后门攻击的可行性,经过不断的微调。在Qwen2.5和LLaMA 3系列LLM以及各种任务序列上进行的实验表明,P-Trojan在保持清洁任务准确性的同时实现了99%以上的持久性。我们的研究结果强调了在现实模型适应管道中需要持久性感知评估和更强的防御。 摘要:Backdoor attacks embed malicious behaviors into Large Language Models (LLMs), enabling adversaries to trigger harmful outputs or bypass safety controls. However, the persistence of the implanted backdoors under user-driven post-deployment continual fine-tuning has been rarely examined. Most prior works evaluate the effectiveness and generalization of implanted backdoors only at releasing and empirical evidence shows that naively injected backdoor persistence degrades after updates. In this work, we study whether and how implanted backdoors persist through a multi-stage post-deployment fine-tuning. We propose P-Trojan, a trigger-based attack algorithm that explicitly optimizes for backdoor persistence across repeated updates. By aligning poisoned gradients with those of clean tasks on token embeddings, the implanted backdoor mapping is less likely to be suppressed or forgotten during subsequent updates. Theoretical analysis shows the feasibility of such persistent backdoor attacks after continual fine-tuning. And experiments conducted on the Qwen2.5 and LLaMA3 families of LLMs, as well as diverse task sequences, demonstrate that P-Trojan achieves over 99% persistence while preserving clean-task accuracy. Our findings highlight the need for persistence-aware evaluation and stronger defenses in realistic model adaptation pipelines.
【115】Zero-Knowledge Audit for Internet of Agents: Privacy-Preserving Communication Verification with Model Context Protocol 标题:代理互联网的零知识审计:使用模型上下文协议的隐私保护通信验证 链接:https://arxiv.org/abs/2512.14737
作者:Guanlin Jing,Huayi Qi 摘要:现有的代理通信框架在提供可验证的审计跟踪而不损害代理交互的隐私和机密性方面面临着严重的限制。在受监管的环境中,保护代理通信隐私,同时确保可验证性,成为需要准确计费、合规性验证和问责制的应用程序的根本挑战。 我们引入了一个框架,审计代理通信,保持消息的私人,同时仍然检查他们遵循预期的规则。它将零知识证明与现有的模型上下文协议(MCP)配对,因此可以在不泄露其内容的情况下验证消息。该方法在轻量级网络中运行,与标准MCP交换保持兼容,并添加异步审计验证以确认格式和一般消息类型,而不暴露细节。 该框架支持代理之间的相互审计:一方可以检查通信内容和质量,而另一方可以验证使用指标,所有这些都不会泄露敏感信息。我们形式化的安全目标,并表明zk-MCP提供数据的真实性和通信隐私,实现有效的验证与可忽略的延迟开销。我们完全实现的框架,包括基于Circom的零知识证明生成和审计协议与MCP的双向通道集成,据我们所知,这是第一个隐私保护审计系统的代理通信,提供可验证的相互审计,而不暴露消息内容或损害代理隐私。 摘要:Existing agent communication frameworks face critical limitations in providing verifiable audit trails without compromising the privacy and confidentiality of agent interactions. The protection of agent communication privacy while ensuring auditability emerges as a fundamental challenge for applications requiring accurate billing, compliance verification, and accountability in regulated environments. We introduce a framework for auditing agent communications that keeps messages private while still checking they follow expected rules. It pairs zero-knowledge proofs with the existing Model Context Protocol (MCP) so messages can be verified without revealing their contents. The approach runs in lightweight networks, stays compatible with standard MCP exchanges, and adds asynchronous audit verification to confirm format and general message types without exposing specifics. The framework enables mutual audits between agents: one side can check communication content and quality while the other verifies usage metrics, all without revealing sensitive information. We formalize security goals and show that zk-MCP provides data authenticity and communication privacy, achieving efficient verification with negligible latency overhead. We fully implement the framework, including Circom-based zero-knowledge proof generation and an audit protocol integrated with MCP’s bidirectional channel, and, to our knowledge, this is the first privacy-preserving audit system for agent communications that offers verifiable mutual auditing without exposing message content or compromising agent privacy.
【116】INFORM-CT: INtegrating LLMs and VLMs FOR Incidental Findings Management in Abdominal CT 标题:INFORM-CT:整合LLM和VLM以进行腹部CT中的意外发现管理 链接:https://arxiv.org/abs/2512.14732
作者:Idan Tankel,Nir Mazor,Rafi Brada,Christina LeBedis,Guy ben-Yosef 摘要:CT扫描中的偶然发现,虽然通常是良性的,但可能具有重要的临床意义,应按照既定的指南进行报告。放射科医生传统的人工检查既费时又多变。本文提出了一种新的框架,利用大型语言模型(LLM)和基础视觉语言模型(VLM)的规划和执行代理的方法,以提高效率和精度的偶然发现检测,分类和报告腹部CT扫描。鉴于腹部器官的医疗指南,管理偶然发现的过程是通过计划者-执行者框架自动化的。基于LLM的规划器使用预定义的基本函数生成Python脚本,而执行器运行这些脚本以通过VLM、分割模型和图像处理子例程执行必要的检查和检测。 我们证明了我们的方法的有效性,通过实验的CT腹部基准的三个器官,在一个全自动的端到端的方式。我们的研究结果表明,所提出的框架优于现有的纯基于VLM的方法在准确性和效率。 摘要:Incidental findings in CT scans, though often benign, can have significant clinical implications and should be reported following established guidelines. Traditional manual inspection by radiologists is time-consuming and variable. This paper proposes a novel framework that leverages large language models (LLMs) and foundational vision-language models (VLMs) in a plan-and-execute agentic approach to improve the efficiency and precision of incidental findings detection, classification, and reporting for abdominal CT scans. Given medical guidelines for abdominal organs, the process of managing incidental findings is automated through a planner-executor framework. The planner, based on LLM, generates Python scripts using predefined base functions, while the executor runs these scripts to perform the necessary checks and detections, via VLMs, segmentation models, and image processing subroutines. We demonstrate the effectiveness of our approach through experiments on a CT abdominal benchmark for three organs, in a fully automatic end-to-end manner. Our results show that the proposed framework outperforms existing pure VLM-based approaches in terms of accuracy and efficiency.
【117】Semantic Geometry for policy-constrained interpretation 标题:用于政策限制解释的语义几何 链接:https://arxiv.org/abs/2512.14731
作者:Nikit Phadke 摘要:我们提出了一个几何框架的政策约束的语义解释,可证明防止幻觉的承诺在高风险领域。语义表示为单位球面上的方向,证据被建模为证人向量集,和可接受的解释对应于球面凸区域。策略约束作为定义在同一流形上的显式先验引入,与证据几何分离。解释减少到约束优化的可接受的区域,与拒绝出现的拓扑必要的结果矛盾或政策排除。我们将这个框架连接到信息理论,贝叶斯推理和层理论语义,证明我们的复杂性界限是信息理论上最优的。对大规模监管金融数据的实证验证表明,多个政策制度的零幻觉批准是第一个大规模的结果。 摘要:We present a geometric framework for policy-constrained semantic interpretation that provably prevents hallucinated commitments in high-stakes domains. Semantic meaning is represented as direction on a unit sphere, evidence is modeled as sets of witness vectors, and admissible interpretations correspond to spherical convex regions. Policy constraints are introduced as explicit priors defined over the same manifold, separated from evidence geometry. Interpretation reduces to constrained optimization over admissible regions, with refusal emerging as a topologically necessary outcome under contradiction or policy exclusion. We connect this framework to information theory, Bayesian inference, and sheaf-theoretic semantics, proving that our complexity bounds are information-theoretically optimal. Empirical validation on large scale regulated financial data demonstrates zero hallucinated approvals across multiple policy regimes-the first such result at scale.
【118】A Critical Perspective on Finite Sample Conformal Prediction Theory in Medical Applications 标题:有限样本保形预测理论在医学应用中的批判性观点 链接:https://arxiv.org/abs/2512.14727
作者:Klaus-Rudolf Kladny,Bernhard Schölkopf,Lisa Koch,Christian F. Baumgartner,Michael Muehlebach 摘要:机器学习(ML)正在改变医疗保健,但安全的临床决策需要可靠的不确定性估计,而标准ML模型无法提供。共形预测(CP)是一种流行的工具,它允许用户将启发式不确定性估计转换为具有统计保证的不确定性估计。CP的工作原理是将ML模型的预测与校准样本一起转换为预测集,这些预测集保证包含具有任何期望概率的真实标签。一个经常被引用的优点是CP理论适用于任意大小的校准样本,这表明即使只有小的校准集可用,也可以实现具有实际意义的统计保证的不确定性估计。我们质疑这一承诺表明,虽然统计保证适用于任意大小的校准集,这些保证的实际效用在很大程度上取决于校准集的大小。这种观察在医学领域是相关的,因为数据通常是稀缺的,因此获得大的校准集是不可行的。我们证实了我们的批评在医学图像分类任务的实证演示。 摘要:Machine learning (ML) is transforming healthcare, but safe clinical decisions demand reliable uncertainty estimates that standard ML models fail to provide. Conformal prediction (CP) is a popular tool that allows users to turn heuristic uncertainty estimates into uncertainty estimates with statistical guarantees. CP works by converting predictions of a ML model, together with a calibration sample, into prediction sets that are guaranteed to contain the true label with any desired probability. An often cited advantage is that CP theory holds for calibration samples of arbitrary size, suggesting that uncertainty estimates with practically meaningful statistical guarantees can be achieved even if only small calibration sets are available. We question this promise by showing that, although the statistical guarantees hold for calibration sets of arbitrary size, the practical utility of these guarantees does highly depend on the size of the calibration set. This observation is relevant in medical domains because data is often scarce and obtaining large calibration sets is therefore infeasible. We corroborate our critique in an empirical demonstration on a medical image classification task.
【119】Quantum Decision Transformers (QDT): Synergistic Entanglement and Interference for Offline Reinforcement Learning 标题:量子决策转换器(QDT):离线强化学习的协同纠缠和干扰 链接:https://arxiv.org/abs/2512.14726
作者:Abraham Itzhak Weinberg 摘要:离线强化学习能够在没有环境交互的情况下从预先收集的数据集进行策略学习,但现有的决策Transformer(DT)架构难以处理长期信用分配和复杂的状态-动作依赖关系。我们介绍了量子决策Transformer(QDT),一种新的架构,将量子启发的计算机制,以解决这些挑战。我们的方法集成了两个核心组件:具有捕获非局部特征相关性的纠缠操作的量子启发注意力,以及具有多路径处理和可学习干扰的量子前馈网络,用于自适应计算。通过对连续控制任务的综合实验,与标准DT相比,我们展示了超过2,000%的性能改进,在不同的数据质量上具有出色的泛化能力。至关重要的是,我们的消融研究揭示了量子激发的组件之间的强大协同效应:两者都不能单独实现竞争性性能,但它们的组合产生了远远超过单个贡献的显着改善。这种协同作用表明,有效的量子架构设计需要相互依赖的机制的整体协同设计,而不是采用模块化组件。我们的分析确定了三个关键的计算优势:通过非本地相关性增强信用分配,通过并行处理的隐式集成行为,以及通过可学习的干扰自适应资源分配。这些发现确立了量子启发的设计原则,作为在顺序决策中推进Transformer架构的一个有前途的方向,其影响超出了强化学习,更广泛地扩展到神经架构设计。 摘要:Offline reinforcement learning enables policy learning from pre-collected datasets without environment interaction, but existing Decision Transformer (DT) architectures struggle with long-horizon credit assignment and complex state-action dependencies. We introduce the Quantum Decision Transformer (QDT), a novel architecture incorporating quantum-inspired computational mechanisms to address these challenges. Our approach integrates two core components: Quantum-Inspired Attention with entanglement operations that capture non-local feature correlations, and Quantum Feedforward Networks with multi-path processing and learnable interference for adaptive computation. Through comprehensive experiments on continuous control tasks, we demonstrate over 2,000% performance improvement compared to standard DTs, with superior generalization across varying data qualities. Critically, our ablation studies reveal strong synergistic effects between quantum-inspired components: neither alone achieves competitive performance, yet their combination produces dramatic improvements far exceeding individual contributions. This synergy demonstrates that effective quantum-inspired architecture design requires holistic co-design of interdependent mechanisms rather than modular component adoption. Our analysis identifies three key computational advantages: enhanced credit assignment through non-local correlations, implicit ensemble behavior via parallel processing, and adaptive resource allocation through learnable interference. These findings establish quantum-inspired design principles as a promising direction for advancing transformer architectures in sequential decision-making, with implications extending beyond reinforcement learning to neural architecture design more broadly.
【120】Generative Urban Flow Modeling: From Geometry to Airflow with Graph Diffusion 标题:生成性城市流建模:从几何到带有图形扩散的气流 链接:https://arxiv.org/abs/2512.14725
作者:Francisco Giral,Álvaro Manzano,Ignacio Gómez,Petros Koumoutsakos,Soledad Le Clainche 摘要:城市风场模拟在城市空气质量评价和城市可持续发展规划中具有重要作用。建模和仿真的一个关键挑战是处理城市景观的复杂几何形状。低阶模型在捕捉几何形状的影响方面受到限制,而高保真计算流体动力学(CFD)模拟非常昂贵,特别是在多个几何形状或风力条件下。在这里,我们提出了一个生成扩散框架,用于合成稳态城市风场的非结构化网格,只需要几何信息。该框架将分层图神经网络与基于分数的扩散建模相结合,以生成准确和多样化的速度场,而无需时间展开或密集测量。该模型在多个网格切片和风向角上进行训练,可以推广到看不见的几何形状,恢复关键的流动结构,如尾流和再循环区,并提供不确定性预测。消融研究证实了鲁棒性网格变化和性能在不同的推理制度。这项工作是建立建筑环境基础模型的第一步,可以帮助城市规划者在密度和气候不确定性下快速评估设计决策。 摘要:Urban wind flow modeling and simulation play an important role in air quality assessment and sustainable city planning. A key challenge for modeling and simulation is handling the complex geometries of the urban landscape. Low order models are limited in capturing the effects of geometry, while high-fidelity Computational Fluid Dynamics (CFD) simulations are prohibitively expensive, especially across multiple geometries or wind conditions. Here, we propose a generative diffusion framework for synthesizing steady-state urban wind fields over unstructured meshes that requires only geometry information. The framework combines a hierarchical graph neural network with score-based diffusion modeling to generate accurate and diverse velocity fields without requiring temporal rollouts or dense measurements. Trained across multiple mesh slices and wind angles, the model generalizes to unseen geometries, recovers key flow structures such as wakes and recirculation zones, and offers uncertainty-aware predictions. Ablation studies confirm robustness to mesh variation and performance under different inference regimes. This work develops is the first step towards foundation models for the built environment that can help urban planners rapidly evaluate design decisions under densification and climate uncertainty.
【121】HATSolver: Learning Groebner Bases with Hierarchical Attention Transformers 标题:HATSolver:用分层注意力Transformer学习Groebner基础 链接:https://arxiv.org/abs/2512.14722
作者:Mohamed Malhou,Ludovic Perret,Kristin Lauter 摘要:在NeurIPS 2024上,Kera等人介绍了使用Transformers来计算Groebner基,Groebner基是计算机代数中具有许多实际应用的核心对象。在本文中,我们改进了这种方法,应用层次注意力Transformers(HAT),通过Groebner基计算解决多元多项式方程组。HAT架构采用了树结构的归纳偏差,使建模的层次关系中存在的数据,从而实现显着的计算节省相比,传统的平面注意力模型。我们推广到任意深度,包括详细的计算成本分析。结合课程学习,我们的方法解决了比Kera等人(2024年学习计算Groebner基地)大得多的实例。 摘要:At NeurIPS 2024, Kera et al. introduced the use of transformers for computing Groebner bases, a central object in computer algebra with numerous practical applications. In this paper, we improve this approach by applying Hierarchical Attention Transformers (HATs) to solve systems of multivariate polynomial equations via Groebner bases computation. The HAT architecture incorporates a tree-structured inductive bias that enables the modeling of hierarchical relationships present in the data and thus achieves significant computational savings compared to conventional flat attention models. We generalize to arbitrary depths and include a detailed computational cost analysis. Combined with curriculum learning, our method solves instances that are much larger than those in Kera et al. (2024 Learning to compute Groebner bases)
【122】SoMe: A Realistic Benchmark for LLM-based Social Media Agents 标题:SoMe:法学硕士社交媒体代理的现实基准 链接:https://arxiv.org/abs/2512.14720
作者:Dizhan Xue,Jing Cui,Shengsheng Qian,Chuanrui Hu,Changsheng Xu 备注:Accepted by AAAI 2026 摘要:由大型语言模型(LLM)驱动的智能代理最近展示了令人印象深刻的功能,并在社交媒体平台上越来越受欢迎。虽然LLM代理正在重塑社交媒体的生态,但目前在对其理解媒体内容,理解用户行为和做出复杂决策的能力进行全面评估方面存在差距。为了应对这一挑战,我们引入了SoMe,这是一个开创性的基准,旨在评估配备了各种代理工具的社交媒体代理,用于访问和分析社交媒体数据。SoMe包含来自各种社交媒体平台和外部网站的8个社交媒体代理任务,9,164,284个帖子,6,591个用户配置文件和25,686个报告,以及17,869个精心注释的任务查询。与现有的社交媒体任务的数据集和基准相比,SoMe是第一个为基于LLM的社交媒体代理提供多功能和现实的平台,以处理各种社交媒体任务。通过广泛的定量和定性分析,我们提供了第一个概述洞察主流代理LLM在现实的社交媒体环境中的性能,并确定了几个限制。我们的评估表明,目前的闭源和开源LLM都不能令人满意地处理社交媒体代理任务。SoMe为未来的社交媒体代理提供了一个具有挑战性但有意义的测试平台。我们的代码和数据可在https://github.com/LivXue/SoMe上获得 摘要:Intelligent agents powered by large language models (LLMs) have recently demonstrated impressive capabilities and gained increasing popularity on social media platforms. While LLM agents are reshaping the ecology of social media, there exists a current gap in conducting a comprehensive evaluation of their ability to comprehend media content, understand user behaviors, and make intricate decisions. To address this challenge, we introduce SoMe, a pioneering benchmark designed to evaluate social media agents equipped with various agent tools for accessing and analyzing social media data. SoMe comprises a diverse collection of 8 social media agent tasks, 9,164,284 posts, 6,591 user profiles, and 25,686 reports from various social media platforms and external websites, with 17,869 meticulously annotated task queries. Compared with the existing datasets and benchmarks for social media tasks, SoMe is the first to provide a versatile and realistic platform for LLM-based social media agents to handle diverse social media tasks. By extensive quantitative and qualitative analysis, we provide the first overview insight into the performance of mainstream agentic LLMs in realistic social media environments and identify several limitations. Our evaluation reveals that both the current closed-source and open-source LLMs cannot handle social media agent tasks satisfactorily. SoMe provides a challenging yet meaningful testbed for future social media agents. Our code and data are available at https://github.com/LivXue/SoMe
【123】Hybrid Attribution Priors for Explainable and Robust Model Training 标题:用于可解释且稳健模型训练的混合归因先验 链接:https://arxiv.org/abs/2512.14719
作者:Zhuoran Zhang,Feng Zhang,Shangyuan Li,Yang Shi,Yuanxing Zhang,Wei Chen,Tengjiao Wang,Kam-Fai Wong 备注:15 pages 摘要:小型语言模型(SLM)广泛用于需要低延迟和轻量级部署的任务,特别是分类。随着可解释性和鲁棒性越来越重要,通过在训练期间引入基于归因的监督,推理引导的学习已经成为一个有效的框架;然而,获得一般和可靠的归因先验仍然是一个重大挑战。通过分析分类设置中的代表性归因方法,我们发现,虽然这些方法可以可靠地突出类相关的令牌,他们往往集中在语义相似的类共享的共同关键字。由于这些类别在标准训练下已经很难区分,因此这些属性提供的区分线索不足,限制了它们改善模型区分的能力。为了克服这一限制,我们提出了类感知属性先验(CAP),一种新的属性先验提取框架,引导语言模型捕捉细粒度的类区分,并产生更突出的,有区别的属性先验。在此基础上,我们进一步引入CAP混合,它结合了来自CAP的先验知识与现有的归因技术,形成一个更全面和平衡的监督信号。通过将模型的自我归因与这些丰富的先验知识相结合,我们的方法鼓励学习各种与决策相关的特征。在完整数据,Few-Shot和对抗场景中的大量实验表明,我们的方法始终增强了可解释性和鲁棒性。 摘要:Small language models (SLMs) are widely used in tasks that require low latency and lightweight deployment, particularly classification. As interpretability and robustness gain increasing importance, explanation-guided learning has emerged as an effective framework by introducing attribution-based supervision during training; however, deriving general and reliable attribution priors remains a significant challenge. Through an analysis of representative attribution methods in classification settings, we find that although these methods can reliably highlight class-relevant tokens, they often focus on common keywords shared by semantically similar classes. Because such classes are already difficult to distinguish under standard training, these attributions provide insufficient discriminative cues, limiting their ability to improve model differentiation. To overcome this limitation, we propose Class-Aware Attribution Prior (CAP), a novel attribution prior extraction framework that guides language models toward capturing fine-grained class distinctions and producing more salient, discriminative attribution priors. Building on this idea, we further introduce CAP Hybrid, which combines priors from CAP with those from existing attribution techniques to form a more comprehensive and balanced supervisory signal. By aligning a model’s self-attribution with these enriched priors, our approach encourages the learning of diverse, decision-relevant features. Extensive experiments in full-data, few-shot, and adversarial scenarios demonstrate that our method consistently enhances both interpretability and robustness.
【124】SEED: Spectral Entropy-Guided Evaluation of SpatialTemporal Dependencies for Multivariate Time Series Forecasting 标题:SEED:多元时间序列预测的时空相依性的谱熵引导评估 链接:https://arxiv.org/abs/2512.14718
作者:Feng Xiong,Zongxia Xie,Yanru Sun,Haoyu Wang,Jianhong Lin 摘要:有效的多变量时间序列预测往往受益于准确建模复杂的变量间的依赖关系。然而,现有的基于注意力或图的方法面临三个关键问题:(a)强时间自依赖性经常被不相关的变量破坏;(b)softmax归一化忽略并逆转负相关性;(c)变量难以感知它们的时间位置。为了解决这些问题,我们提出了\textbf{SEED},一个用于时空依赖性建模的谱熵引导评估框架。SEED引入了依赖性评估器,这是一项关键创新,它利用谱熵动态地提供每个变量的空间和时间依赖性的初步评估,使模型能够自适应地平衡通道独立性(CI)和通道依赖性(CD)策略。为了解释源于其他变量而不是内在动力学的影响的时间延迟,我们提出了基于谱熵的融合器来进一步细化评估的依赖权重,有效地分离这一部分。此外,为了保持负相关性,我们引入了一个Signed Graph Constructor,它可以启用带符号的边权重,克服了softmax的局限性。最后,为了帮助变量感知它们的时间位置,从而构建更全面的空间特征,我们引入了上下文空间提取器,它利用本地上下文窗口来提取空间特征。在来自不同应用领域的12个真实数据集上进行的大量实验表明,SEED实现了最先进的性能,验证了其有效性和通用性。 摘要:Effective multivariate time series forecasting often benefits from accurately modeling complex inter-variable dependencies. However, existing attention- or graph-based methods face three key issues: (a) strong temporal self-dependencies are often disrupted by irrelevant variables; (b) softmax normalization ignores and reverses negative correlations; (c) variables struggle to perceive their temporal positions. To address these, we propose \textbf{SEED}, a Spectral Entropy-guided Evaluation framework for spatial-temporal Dependency modeling. SEED introduces a Dependency Evaluator, a key innovation that leverages spectral entropy to dynamically provide a preliminary evaluation of the spatial and temporal dependencies of each variable, enabling the model to adaptively balance Channel Independence (CI) and Channel Dependence (CD) strategies. To account for temporal regularities originating from the influence of other variables rather than intrinsic dynamics, we propose Spectral Entropy-based Fuser to further refine the evaluated dependency weights, effectively separating this part. Moreover, to preserve negative correlations, we introduce a Signed Graph Constructor that enables signed edge weights, overcoming the limitations of softmax. Finally, to help variables perceive their temporal positions and thereby construct more comprehensive spatial features, we introduce the Context Spatial Extractor, which leverages local contextual windows to extract spatial features. Extensive experiments on 12 real-world datasets from various application domains demonstrate that SEED achieves state-of-the-art performance, validating its effectiveness and generality.
【125】How a Bit Becomes a Story: Semantic Steering via Differentiable Fault Injection 标题:一位如何成为一个故事:通过差异故障注入实现语义引导 链接:https://arxiv.org/abs/2512.14715
作者:Zafaryab Haider,Md Hafizur Rahman,Shane Moeykens,Vijay Devabhaktuni,Prabuddha Chakraborty 摘要:来自恶意电路或错误的难以检测的硬件位翻转已经被证明会使Transformers在非生成任务中变得脆弱。这项工作,第一次,调查如何低层次,按位扰动(故障注入)的权重的大型语言模型(LLM)用于图像字幕可以影响其生成的描述的语义含义,同时保持语法结构。虽然先前的故障分析方法已经表明,翻转几个位可能会使分类器崩溃或降低准确性,但这些方法忽略了生成系统的语义和语言维度。在图像字幕模型中,一个简单的翻转可能会微妙地改变视觉特征映射到单词的方式,改变人工智能对世界的整个叙述。我们假设,这种语义漂移不是随机的,但可微分估计。也就是说,模型自身的梯度可以预测哪些位,如果被扰动,将最强烈地影响意义,同时保持语法和流畅性不变。我们设计了一个可区分的故障分析框架,刀片(通过可区分的估计位级故障分析),使用基于梯度的敏感性估计,以定位语义关键位,然后通过一个标题级的语义流畅性目标,完善他们的选择。我们的目标不仅仅是破坏字幕,而是要了解意义本身是如何在比特级别编码、分布和改变的,揭示即使是难以察觉的低级变化也可以引导生成视觉语言模型的高级语义。它还通过揭示结构化位级故障如何重塑模型的语义输出,为鲁棒性测试、对抗性防御和可解释的AI开辟了道路。 摘要:Hard-to-detect hardware bit flips, from either malicious circuitry or bugs, have already been shown to make transformers vulnerable in non-generative tasks. This work, for the first time, investigates how low-level, bitwise perturbations (fault injection) to the weights of a large language model (LLM) used for image captioning can influence the semantic meaning of its generated descriptions while preserving grammatical structure. While prior fault analysis methods have shown that flipping a few bits can crash classifiers or degrade accuracy, these approaches overlook the semantic and linguistic dimensions of generative systems. In image captioning models, a single flipped bit might subtly alter how visual features map to words, shifting the entire narrative an AI tells about the world. We hypothesize that such semantic drifts are not random but differentiably estimable. That is, the model’s own gradients can predict which bits, if perturbed, will most strongly influence meaning while leaving syntax and fluency intact. We design a differentiable fault analysis framework, BLADE (Bit-level Fault Analysis via Differentiable Estimation), that uses gradient-based sensitivity estimation to locate semantically critical bits and then refines their selection through a caption-level semantic-fluency objective. Our goal is not merely to corrupt captions, but to understand how meaning itself is encoded, distributed, and alterable at the bit level, revealing that even imperceptible low-level changes can steer the high-level semantics of generative vision-language models. It also opens pathways for robustness testing, adversarial defense, and explainable AI, by exposing how structured bit-level faults can reshape a model’s semantic output.
【126】Improving Underwater Acoustic Classification Through Learnable Gabor Filter Convolution and Attention Mechanisms 标题:通过可学习的Gabor过滤卷积和注意力机制改进水下声学分类 链接:https://arxiv.org/abs/2512.14714
作者:Lucas Cesar Ferreira Domingos,Russell Brinkworth,Paulo Eduardo Santos,Karl Sammut 摘要:水声目标的远程检测和分类是环境监测和防御的关键。然而,船舶辐射和环境水下噪声的复杂性对精确的信号处理提出了重大挑战。虽然机器学习的最新进展提高了分类准确性,但数据集可用性有限和缺乏标准化实验等问题阻碍了泛化和鲁棒性。本文介绍了GSE ResNeXt,这是一种深度学习架构,它将可学习的Gabor卷积层与通过挤压和激发注意力机制增强的ResNeXt主干集成在一起。Gabor滤波器用作二维自适应带通滤波器,扩展了特征通道表示。它与通道注意力的结合提高了训练稳定性和收敛性,同时增强了模型提取区分特征的能力。该模型进行评估的三个分类任务的复杂性增加。特别是,训练和测试数据之间的时间差异的影响进行了探讨,揭示了船舶和传感器之间的距离显着影响性能。结果表明,GSE ResNeXt在分类性能方面始终优于Xception,ResNet和MobileNetV2等基线模型。关于稳定性和收敛性,在模型的初始层中添加Gabor卷积表示训练时间减少了28%。这些结果强调了信号处理策略在不同环境条件下,特别是在数据有限的水下声学分类情况下,提高模型的可靠性和通用性的重要性。未来的发展应侧重于减轻环境因素对输入信号的影响。 摘要:Remotely detecting and classifying underwater acoustic targets is critical for environmental monitoring and defence. However, the complex nature of ship-radiated and environmental underwater noise poses significant challenges to accurate signal processing. While recent advancements in machine learning have improved classification accuracy, issues such as limited dataset availability and a lack of standardised experimentation hinder generalisation and robustness. This paper introduces GSE ResNeXt, a deep learning architecture integrating learnable Gabor convolutional layers with a ResNeXt backbone enhanced by squeeze-and-excitation attention mechanisms. The Gabor filters serve as two-dimensional adaptive band-pass filters, extending the feature channel representation. Its combination with channel attention improves training stability and convergence while enhancing the model’s ability to extract discriminative features. The model is evaluated on three classification tasks of increasing complexity. In particular, the impact of temporal differences between the training and testing data is explored, revealing that the distance between the vessel and sensor significantly affects performance. Results show that, GSE ResNeXt consistently outperforms baseline models like Xception, ResNet, and MobileNetV2, in terms of classification performance. Regarding stability and convergence, the addition of Gabor convolutions in the initial layers of the model represents a 28% reduction in training time. These results emphasise the importance of signal processing strategies in improving the reliability and generalisation of models under different environmental conditions, especially in data-limited underwater acoustic classification scenarios. Future developments should focus on mitigating the impact of environmental factors on input signals.
【127】SepsisSuite: Beyond Risk Stratification – A Comparative Analysis of Deep Fusion vs. Expert Stacking for Prescriptive Sepsis AI 标题:败血症套件:超越风险分层–处方败血症AI的深度融合与专家堆叠的比较分析 链接:https://arxiv.org/abs/2512.14712
作者:Ryan Cartularo 备注:7 Pages, 4 Tables, 9 Figures 摘要:脓毒症占全球ICU入院人数的近20%,但传统的预测模型往往无法有效地整合异构数据流,要么被模态孤立,要么依赖于脆弱的早期融合。在这项工作中,我们提出了一个严格的架构比较端到端的深度融合和上下文感知堆叠的败血症任务。我们最初假设,一种新的四模态分层门控注意力网络-称为SepsisFusionFormer-将解决生命体征,文本和成像之间复杂的跨模态相互作用。然而,MIMIC-IV的实验显示,SepsisFusionFormer在小抗生素组群中遭受“注意力饥饿”($N \约2,100 $),导致过拟合(AUC 0.66)。这个违反直觉的结果为SepsisLateFusion的设计提供了信息,SepsisLateFusion是一种“更精简”的上下文感知专家混合(MoE)架构。通过将模态视为正交专家-“Historian”(静态),“Monitor”(时间)和“Reader”(NLP)-并通过CatBoost元学习器动态门控它们,我们实现了最先进的(SOTA)性能:临床发作前4小时预测的AUC为0.915。通过校准临床安全性的决策阈值,我们将漏诊病例相对于默认操作点减少了48%,从而打开了一个真正的预防窗口,以便及时干预反应性警报。此外,对于多类抗生素选择的新处方任务,我们证明了四模式Entrance实现了最高性能(0.72 AUC)。这些模型集成到SepsisSuite中,SepsisSuite是一个用于临床决策支持的部署就绪Python框架。SepsisSuite可在https://github.com/RyanCartularo/SepsisSuite-Info免费获得 摘要:Sepsis accounts for nearly 20% of global ICU admissions, yet conventional prediction models often fail to effectively integrate heterogeneous data streams, remaining either siloed by modality or reliant on brittle early fusion. In this work, we present a rigorous architectural comparison between End-to-End Deep Fusion and Context-Aware Stacking for sepsis tasks. We initially hypothesized that a novel Quad-Modal Hierarchical Gated Attention Network – termed SepsisFusionFormer – would resolve complex cross-modal interactions between vitals, text, and imaging. However, experiments on MIMIC-IV revealed that SepsisFusionFormer suffered from “attention starvation” in the small antibiotic cohort ($N \approx 2,100$), resulting in overfitting (AUC 0.66). This counterintuitive result informed the design of SepsisLateFusion, a “leaner” Context-Aware Mixture-of-Experts (MoE) architecture. By treating modalities as orthogonal experts – the “Historian” (Static), the “Monitor” (Temporal), and the “Reader” (NLP) – and dynamically gating them via a CatBoost meta-learner, we achieved State-of-the-Art (SOTA) performance: 0.915 AUC for prediction 4 hours prior to clinical onset. By calibrating the decision threshold for clinical safety, we reduced missed cases by 48% relative to the default operating point, thus opening a true preventative window for timely intervention over reactive alerts. Furthermore, for the novel prescriptive task of multi-class antibiotic selection, we demonstrate that a Quad-Modal Ensemble achieved the highest performance (0.72 AUC). These models are integrated into SepsisSuite, a deployment-ready Python framework for clinical decision support. SepsisSuite is available for free at: https://github.com/RyanCartularo/SepsisSuite-Info
【128】Promoting Fairness in Information Access within Social Networks 标题:促进社交网络内信息获取的公平性 链接:https://arxiv.org/abs/2512.14711
作者:Changan Liu,Xiaotian Zhou,Ahad N. Zehmakan,Zhongzhi Zhang 备注:Accepted by ICDE 2026 摘要:在线社交网络的出现促进了信息的快速和广泛传播。然而,一些用户,特别是少数群体的成员,由于他们处于不利的网络位置,可能不太可能接收到在网络上传播的信息。我们研究的优化问题,增加新的连接到一个网络,以提高公平性,在不同的人口群体之间的信息访问。 我们提供了这个问题的具体公式,其中信息访问是根据电阻距离来测量的,{提供了一个新的视角,强调全球网络结构和多路径连接。该问题被证明是NP难的。我们提出了一个简单的贪婪算法,结果输出准确的解决方案,但它的运行时间是立方的,这使得它不适合大型网络。作为我们的主要技术贡献,我们利用几种新颖的近似技术将其时间复杂度降低到线性。除了我们的理论研究结果外,我们还使用真实世界和合成数据集进行了一系列广泛的实验。我们证明,我们的线性时间算法可以产生精确的解决方案,网络与数百万个节点。 摘要:The advent of online social networks has facilitated fast and wide spread of information. However, some users, especially members of minority groups, may be less likely to receive information spreading on the network, due to their disadvantaged network position. We study the optimization problem of adding new connections to a network to enhance fairness in information access among different demographic groups. We provide a concrete formulation of this problem where information access is measured in terms of resistance distance, {offering a new perspective that emphasizes global network structure and multi-path connectivity.} The problem is shown to be NP-hard. We propose a simple greedy algorithm which turns out to output accurate solutions, but its run time is cubic, which makes it undesirable for large networks. As our main technical contribution, we reduce its time complexity to linear, leveraging several novel approximation techniques. In addition to our theoretical findings, we also conduct an extensive set of experiments using both real-world and synthetic datasets. We demonstrate that our linear-time algorithm can produce accurate solutions for networks with millions of nodes.
【129】Autonomous Source Knowledge Selection in Multi-Domain Adaptation 标题:多领域适应中的自主源知识选择 链接:https://arxiv.org/abs/2512.14710
作者:Keqiuyin Li,Jie Lu,Hua Zuo,Guangquan Zhang 摘要:无监督多域自适应在迁移学习中起着关键作用,它利用从多个源域获取的丰富源信息,从一个未标记的目标域求解目标任务。然而,多个源域通常包含大量冗余或不相关的信息,这会损害传输性能,特别是在多源域设置中。迫切需要制定有效的策略,从大量的源领域中识别和选择最可转移的知识来解决目标任务。本文提出了一种多领域自适应方法–命名源知识选择方法(AutoS),用于选择源训练样本和模型,使目标任务的预测能够使用更多相关和可传递的源信息。该方法采用密度驱动的选择策略,在训练过程中选择源样本,并确定哪些源模型应有助于目标预测。同时,建立在预训练的多模态模型上的伪标签增强模块用于减轻目标标签噪声并提高自我监督。在真实数据集上的实验表明了该方法的优越性。 摘要:Unsupervised multi-domain adaptation plays a key role in transfer learning by leveraging acquired rich source information from multiple source domains to solve target task from an unlabeled target domain. However, multiple source domains often contain much redundant or unrelated information which can harm transfer performance, especially when in massive-source domain settings. It is urgent to develop effective strategies for identifying and selecting the most transferable knowledge from massive source domains to address the target task. In this paper, we propose a multi-domain adaptation method named \underline{\textit{Auto}}nomous Source Knowledge \underline{\textit{S}}election (AutoS) to autonomosly select source training samples and models, enabling the prediction of target task using more relevant and transferable source information. The proposed method employs a density-driven selection strategy to choose source samples during training and to determine which source models should contribute to target prediction. Simulteneously, a pseudo-label enhancement module built on a pre-trained multimodal modal is employed to mitigate target label noise and improve self-supervision. Experiments on real-world datasets indicate the superiority of the proposed method.
【130】Attention as Binding: A Vector-Symbolic Perspective on Transformer Reasoning 标题:注意力即约束:Transformer推理的载体符号视角 链接:https://arxiv.org/abs/2512.14709
作者:Sahil Rajesh Dhayalkar 备注:12 pages with references. Submitted to ‘Logical and Symbolic Reasoning in Language Models @ AAAI 2026’ conference and is under review 摘要:基于transformer的语言模型显示出令人印象深刻的推理行为,但在需要稳定符号操作的任务上仍然很脆弱。本文开发了一个统一的角度来看,这些现象,通过解释自我注意和剩余流实现一个近似的矢量符号架构(VSA)。在这个观点中,查询和键定义了角色空间,值编码填充符,注意力权重执行软解绑,而剩余连接实现了许多绑定结构的叠加。我们使用这个代数透镜涉及Transformer内部的思想链的痕迹,基于程序的推理,和内存增强的工具的使用,并解释特征故障模式,如变量的混乱和不一致的逻辑相关的提示。基于这一观点,我们提出了VSA启发的架构偏见,包括明确的绑定/解绑定头和超维内存层,以及促进角色填充分离和鲁棒叠加的训练目标。最后,我们概述度量“VSA相似性”和逻辑组合,并提出理论和架构的开放问题。总体而言,本文认为,将注意力视为软矢量符号计算提供了一条通往更可解释和逻辑可靠的推理系统的原则路线。 摘要:Transformer-based language models display impressive reasoning-like behavior, yet remain brittle on tasks that require stable symbolic manipulation. This paper develops a unified perspective on these phenomena by interpreting self-attention and residual streams as implementing an approximate Vector Symbolic Architecture (VSA). In this view, queries and keys define role spaces, values encode fillers, attention weights perform soft unbinding, and residual connections realize superposition of many bound structures. We use this algebraic lens to relate transformer internals to chain-of-thought traces, program-based reasoning, and memory-augmented tool use, and to explain characteristic failure modes such as variable confusion and inconsistency across logically related prompts. Building on this perspective, we propose VSA-inspired architectural biases, including explicit binding/unbinding heads and hyperdimensional memory layers, and training objectives that promote role-filler separation and robust superposition. Finally, we outline metrics for measuring “VSA-likeness” and logical compositionality, and pose theoretical and architectural open problems. Overall, the paper argues that viewing attention as soft vector-symbolic computation offers a principled route toward more interpretable and logically reliable reasoning systems.
【131】LLM as a Neural Architect: Controlled Generation of Image Captioning Models Under Strict API Contracts 标题:LLM作为神经架构师:在严格的API合同下控制生成图像字幕模型 链接:https://arxiv.org/abs/2512.14706
作者:Krunal Jesani,Dmitry Ignatov,Radu Timofte 摘要:神经架构搜索(NAS)传统上需要大量的人类专业知识或自动试错来设计深度学习模型。我们提出了NN-Caption,这是一个LLM引导的神经架构搜索管道,它通过在严格的Net API下将LEMUR的分类骨干与序列解码器(LSTM/GRU/Transformer)组成CNN编码器来生成可运行的图像字幕模型。使用DeepSeek-R1-0528-Qwen 3 -8B作为主要生成器,我们给出了提示模板和生成架构的示例。我们使用BLEU-4对MS COCO进行了评估。LLM生成了数十个字幕模型,其中超过一半成功训练并生成了有意义的字幕。我们分析了在提示中使用不同数量的输入模型片段(5与10)的结果,发现当提供更多候选组件时,成功率略有下降。我们还报告了训练动态(字幕准确度与历元)和达到的最高BLEU-4。我们的研究结果突出了LLM指导的NAS的承诺:LLM不仅提出了架构,而且还建议了超参数和训练实践。我们确定遇到的挑战(例如,代码幻觉或API遵从性问题),并详细说明提示规则和迭代代码修复如何解决这些问题。这项工作提出了一个管道,集成了基于脚本的代码生成与自动评估,并添加了几十个新的字幕模型到开放的LEMUR数据集,以促进可重复的基准测试和下游AutoML研究。 摘要:Neural architecture search (NAS) traditionally requires significant human expertise or automated trial-and-error to design deep learning models. We present NN-Caption, an LLM-guided neural architecture search pipeline that generates runnable image-captioning models by composing CNN encoders from LEMUR’s classification backbones with sequence decoders (LSTM/GRU/Transformer) under a strict Net API. Using DeepSeek-R1-0528-Qwen3-8B as the primary generator, we present the prompt template and examples of generated architectures. We evaluate on MS COCO with BLEU-4. The LLM generated dozens of captioning models, with over half successfully trained and producing meaningful captions. We analyse the outcomes of using different numbers of input model snippets (5 vs. 10) in the prompt, finding a slight drop in success rate when providing more candidate components. We also report training dynamics (caption accuracy vs. epochs) and the highest BLEU-4 attained. Our results highlight the promise of LLM-guided NAS: the LLM not only proposes architectures but also suggests hyperparameters and training practices. We identify the challenges encountered (e.g., code hallucinations or API compliance issues) and detail how prompt rules and iterative code fixes addressed them. This work presents a pipeline that integrates prompt-based code generation with automatic evaluation, and adds dozens of novel captioning models to the open LEMUR dataset to facilitate reproducible benchmarking and downstream AutoML research.
【132】Tourists Profiling by Interest Analysis 标题:通过兴趣分析进行游客分析 链接:https://arxiv.org/abs/2512.14704
作者:Sonia Djebali,Quentin Gabot,Guillaume Guerard 摘要:随着近年来的数字革命,旅游者行为分析及其相关研究领域发生了深刻的变化。现在,利用游客在旅行中留下的数字痕迹更容易检查他们的行为。对旅游业各个方面进行的研究侧重于数字痕迹的定量方面,以得出结论。在本文中,我们建议的研究集中在定性和定量方面的数字痕迹,以了解动态的旅游行为,特别是那些有关景点网络。 摘要:With the recent digital revolution, analyzing of tourists’ behaviors and research fields associated with it have changed profoundly. It is now easier to examine behaviors of tourists using digital traces they leave during their travels. The studies conducted on diverse aspects of tourism focus on quantitative aspects of digital traces to reach its conclusions. In this paper, we suggest a study focused on both qualitative and quantitative aspect of digital traces to understand the dynamics governing tourist behavior, especially those concerning attractions networks.
【133】Algorithmic Criminal Liability in Greenwashing: Comparing India, United States, and European Union 标题:洗绿中的种族刑事责任:印度、美国和欧盟比较 链接:https://arxiv.org/abs/2512.12837
作者:Sahibpreet Singh,Manjit Singh 备注:Published in HPNLU Journal of Law, Business and Economics, Vol. 3, 2024, pp. 51-68. ISSN: 2584-0436 摘要:人工智能驱动的绿色清洗已成为企业可持续发展治理中的一个潜在挑战,加剧了环境披露的不透明性,并颠覆了监管监督。这项研究对印度、美国和欧盟人工智能介导的洗绿行为的刑事责任进行了比较法律分析,揭示了当欺骗性索赔源自算法系统时,在归咎于罪责方面的理论缺陷。现有法规通过将责任建立在可证明的人类意图上,表现出以人为中心的偏见,使它们无法解决算法欺骗问题。该研究确定了审慎适应方面的一个关键差距,因为现行的欺诈法规相对于人工智能产生的虚假陈述仍然过时。本研究报告采用理论法律方法,系统地剖析了司法判例和法定文书,得出了关于公司刑事责任可能扩大的结果。调查结果强调了严格责任模型的可行性,重新校准的人工智能问责制治理框架,以及ESG制度下的算法尽职调查任务。比较分析揭示了司法管辖区的差异,欧盟企业可持续性尽职调查指令(CSDDD)提供了一个潜在的跨国模式。这项研究通过倡导将算法风险评估与法律人格结构相结合的混合责任框架,为人工智能伦理和环境法学做出了贡献,确保算法的不透明性不会妨碍责任的执行。 摘要:AI-powered greenwashing has emerged as an insidious challenge within corporate sustainability governance, exacerbating the opacity of environmental disclosures and subverting regulatory oversight. This study conducts a comparative legal analysis of criminal liability for AI-mediated greenwashing across India, the US, and the EU, exposing doctrinal lacunae in attributing culpability when deceptive claims originate from algorithmic systems. Existing statutes exhibit anthropocentric biases by predicating liability on demonstrable human intent, rendering them ill-equipped to address algorithmic deception. The research identifies a critical gap in jurisprudential adaptation, as prevailing fraud statutes remain antiquated vis-à-vis AI-generated misrepresentation. Utilising a doctrinal legal methodology, this study systematically dissects judicial precedents and statutory instruments, yielding results regarding the potential expansion of corporate criminal liability. Findings underscore the viability of strict liability models, recalibrated governance frameworks for AI accountability, and algorithmic due diligence mandates under ESG regimes. Comparative insights reveal jurisdictional disparities, with the EU Corporate Sustainability Due Diligence Directive (CSDDD) offering a potential transnational model. This study contributes to AI ethics and environmental jurisprudence by advocating for a hybrid liability framework integrating algorithmic risk assessment with legal personhood constructs, ensuring algorithmic opacity does not preclude liability enforcement.
【134】QoS-Aware Hierarchical Reinforcement Learning for Joint Link Selection and Trajectory Optimization in SAGIN-Supported UAV Mobility Management 标题:SAGIN支持的无人机机动性管理中用于关节链接选择和轨迹优化的服务质量感知分层强化学习 链接:https://arxiv.org/abs/2512.15119
作者:Jiayang Wan,Ke He,Yafei Wang,Fan Liu,Wenjin Wang,Shi Jin 备注:This work has been submitted to the IEEE for possible publication 摘要:由于无人机(UAV)高度和水平移动性的显著变化,任何单个网络都难以确保连续和可靠的三维覆盖。为此,空-空-地综合网络(SAGIN)已成为实现无处不在的无人机连接的重要架构。为了解决异构网络中覆盖和信号特性的显著差异,将SAGIN中的无人机移动性管理归结为一个约束多目标联合优化问题。该公式将离散链路选择与连续轨迹优化相结合。在此基础上,我们提出了一个两级多智能体分层深度强化学习(HDRL)框架,将问题分解为两个交替可解的子问题。为了将复杂的链路选择决策映射到一个紧凑的离散动作空间中,我们在顶层设计了一个双深度Q网络(DDQN)算法,该算法通过双Q值估计实现了稳定和高质量的策略学习。为了在满足服务质量(QoS)约束的同时处理连续轨迹动作空间,我们集成了软演员-评论家(SAC)的最大熵机制,并在较低级别采用基于拉格朗日的约束SAC(CSAC)算法,该算法动态调整拉格朗日乘子以平衡约束满足和策略优化。此外,所提出的算法可以扩展到多无人机场景下的集中训练和分散执行(CTDE)范式,这使得更普遍的政策。仿真结果表明,该方案在吞吐量,链路切换频率和QoS满意度方面明显优于现有的基准。 摘要:Due to the significant variations in unmanned aerial vehicle (UAV) altitude and horizontal mobility, it becomes difficult for any single network to ensure continuous and reliable threedimensional coverage. Towards that end, the space-air-ground integrated network (SAGIN) has emerged as an essential architecture for enabling ubiquitous UAV connectivity. To address the pronounced disparities in coverage and signal characteristics across heterogeneous networks, this paper formulates UAV mobility management in SAGIN as a constrained multi-objective joint optimization problem. The formulation couples discrete link selection with continuous trajectory optimization. Building on this, we propose a two-level multi-agent hierarchical deep reinforcement learning (HDRL) framework that decomposes the problem into two alternately solvable subproblems. To map complex link selection decisions into a compact discrete action space, we conceive a double deep Q-network (DDQN) algorithm in the top-level, which achieves stable and high-quality policy learning through double Q-value estimation. To handle the continuous trajectory action space while satisfying quality of service (QoS) constraints, we integrate the maximum-entropy mechanism of the soft actor-critic (SAC) and employ a Lagrangian-based constrained SAC (CSAC) algorithm in the lower-level that dynamically adjusts the Lagrange multipliers to balance constraint satisfaction and policy optimization. Moreover, the proposed algorithm can be extended to multi-UAV scenarios under the centralized training and decentralized execution (CTDE) paradigm, which enables more generalizable policies. Simulation results demonstrate that the proposed scheme substantially outperforms existing benchmarks in throughput, link switching frequency and QoS satisfaction.
【135】Meta-learners for few-shot weakly-supervised optic disc and cup segmentation on fundus images 标题:元学习器,用于对底部图像进行Few-Shot弱监督的光盘和杯分割 链接:https://arxiv.org/abs/2512.15061
作者:Pandega Abyan Zumarsyah,Igi Ardiyanto,Hanung Adi Nugroho 备注:Submitted to Computers in Biology and Medicine 摘要:本研究开发了用于Few-Shot弱监督分割(FWS)的元学习器,以解决有限标记眼底图像的青光眼诊断中视盘(OD)和视杯(OC)分割的挑战。我们通过引入Omni元训练来显着改进现有的元学习器,该元训练平衡了数据使用并使射击次数多样化。我们还开发了它们的高效版本,以降低计算成本。此外,我们还开发了稀疏化技术,以生成更具可定制性和代表性的涂鸦和其他稀疏标签。在评估了多个数据集后,我们发现Omni和高效版本的性能优于原始版本,最好的元学习者是高效Omni ProtoSeg(EO-ProtoSeg)。它仅使用一个稀疏标记的图像在REFUGE数据集上实现了OD的88.15%和OC的71.17%的交集(IoU)得分,优于需要更多标记图像的Few-Shot和半监督方法。其最佳性能在DRISHTIGS上分别达到86.80%和71.78%,在REFUGE上分别达到88.21%和73.70%,在REFUGE上分别达到80.39%和52.65%。EO-ProtoSeg与无监督域自适应方法相当,但更轻,参数不到200万,不需要任何重新训练。 摘要:This study develops meta-learners for few-shot weakly-supervised segmentation (FWS) to address the challenge of optic disc (OD) and optic cup (OC) segmentation for glaucoma diagnosis with limited labeled fundus images. We significantly improve existing meta-learners by introducing Omni meta-training which balances data usage and diversifies the number of shots. We also develop their efficient versions that reduce computational costs. In addition, we develop sparsification techniques that generate more customizable and representative scribbles and other sparse labels. After evaluating multiple datasets, we find that Omni and efficient versions outperform the original versions, with the best meta-learner being Efficient Omni ProtoSeg (EO-ProtoSeg). It achieves intersection over union (IoU) scores of 88.15% for OD and 71.17% for OC on the REFUGE dataset using just one sparsely labeled image, outperforming few-shot and semi-supervised methods which require more labeled images. Its best performance reaches 86.80% for OD and 71.78%for OC on DRISHTIGS, 88.21% for OD and 73.70% for OC on REFUGE, 80.39% for OD and 52.65% for OC on REFUGE. EO-ProtoSeg is comparable to unsupervised domain adaptation methods yet much lighter with less than two million parameters and does not require any retraining.
【136】Restless Multi-Process Multi-Armed Bandits with Applications to Self-Driving Microscopies 标题:不安的多进程多臂盗贼应用于自动驾驶显微镜 链接:https://arxiv.org/abs/2512.14930
作者:Jaume Anguera Peris,Songtao Cheng,Hanzhao Zhang,Wei Ouyang,Joakim Jaldén 摘要:高内容筛选显微镜产生大量的活细胞成像数据,但其潜力仍然受到无法确定何时何地最有效成像的限制。在数千个动态演变的感兴趣区域中最佳地平衡采集时间、计算能力和光漂白预算仍然是一个开放的挑战,有限的视场调整和传感器灵敏度使其进一步复杂化。现有的方法要么依赖于静态采样,要么忽略了生物过程的动态演变,导致效率低下和错过事件的生态学。在这里,我们介绍了不安分的多进程多臂强盗(RMPMAB),一个新的决策理论框架,其中每个实验区域的建模不是作为一个单一的过程,而是作为一个整体的马尔可夫链,从而捕捉到固有的异质性的生物系统,如异步细胞周期和异质性药物反应。在此基础上,我们推导出封闭形式的表达式的瞬态和渐近行为的聚合过程,并设计可扩展的惠特尔指数的政策与次线性复杂度的成像区域的数量。通过模拟和一个真实的生物活细胞成像数据集,我们表明,我们的方法在资源限制下实现了吞吐量的大幅提高。值得注意的是,我们的算法优于Thomson Sampling,Bayesian UCB,epsilon-Greedy和Round Robin,在模拟中将累积遗憾减少了37%以上,并在实时成像实验中捕获了93%以上的生物相关事件,强调了其变革智能显微镜的潜力。除了提高实验效率外,RMPMAB框架还将随机决策理论与最佳自主显微镜控制相结合,提供了一种原则性的方法来加速跨多学科的发现。 摘要:High-content screening microscopy generates large amounts of live-cell imaging data, yet its potential remains constrained by the inability to determine when and where to image most effectively. Optimally balancing acquisition time, computational capacity, and photobleaching budgets across thousands of dynamically evolving regions of interest remains an open challenge, further complicated by limited field-of-view adjustments and sensor sensitivity. Existing approaches either rely on static sampling or heuristics that neglect the dynamic evolution of biological processes, leading to inefficiencies and missed events. Here, we introduce the restless multi-process multi-armed bandit (RMPMAB), a new decision-theoretic framework in which each experimental region is modeled not as a single process but as an ensemble of Markov chains, thereby capturing the inherent heterogeneity of biological systems such as asynchronous cell cycles and heterogeneous drug responses. Building upon this foundation, we derive closed-form expressions for transient and asymptotic behaviors of aggregated processes, and design scalable Whittle index policies with sub-linear complexity in the number of imaging regions. Through both simulations and a real biological live-cell imaging dataset, we show that our approach achieves substantial improvements in throughput under resource constraints. Notably, our algorithm outperforms Thomson Sampling, Bayesian UCB, epsilon-Greedy, and Round Robin by reducing cumulative regret by more than 37% in simulations and capturing 93% more biologically relevant events in live imaging experiments, underscoring its potential for transformative smart microscopy. Beyond improving experimental efficiency, the RMPMAB framework unifies stochastic decision theory with optimal autonomous microscopy control, offering a principled approach to accelerate discovery across multidisciplinary sciences.
【137】Artificial Intelligence for the Assessment of Peritoneal Carcinosis during Diagnostic Laparoscopy for Advanced Ovarian Cancer 标题:人工智能用于评估晚期卵巢癌诊断性腹腔镜检查期间的腹膜癌 链接:https://arxiv.org/abs/2512.14797
作者:Riccardo Oliva,Farahdiba Zarin,Alice Zampolini Faustini,Armine Vardazaryan,Andrea Rosati,Vinkle Srivastav,Nunzia Del Villano,Jacques Marescaux,Giovanni Scambia,Pietro Mascagni,Nicolas Padoy,Anna Fagotti 摘要:晚期卵巢癌(AOC)通常在腹膜癌(PC)的晚期阶段被诊断出来。诊断性腹腔镜检查(DL)中的Fagotti评分(FS)评估通过估计手术可切除性来指导治疗计划,但其主观性和操作者依赖性限制了重复性和广泛使用。回顾性收集了在转诊中心接受DL并伴随FS评估的患者的视频,并将其分为开发数据集(用于数据注释、AI培训和评估)和独立测试数据集(用于内部验证)。在开发数据集中,手动注释解剖结构和PC的FS相关帧。训练深度学习模型,以自动识别FS相关帧、片段结构和PC,并预测视频级FS和手术指征(ItS)。使用分割的Dice评分、解剖站(AS)和ItS预测的F1评分以及最终FS估计的均方根误差(RMSE)评价AI性能。在开发数据集中,分割模型在7,311帧上训练,解剖结构的Dice得分为70$\pm$3%,PC为56$\pm$3%。视频级AS分类的F1分数分别为74$\pm$3%和73$\pm$4%,FS预测的归一化RMSE值分别为1.39$\pm$0.18和1.15$\pm$0.08,ItS在开发(n=101)和独立测试数据集(n=50)中的F1分数分别为80$\pm$8%和80$\pm$2%。这是第一个预测细胞减灭术可行性的AI模型,可从DL视频中提供自动FS估计。其在数据集上的可重复性和可靠性表明,AI可以通过标准化的术中肿瘤负荷评估和AOC的临床决策来支持外科医生。 摘要:Advanced Ovarian Cancer (AOC) is often diagnosed at an advanced stage with peritoneal carcinosis (PC). Fagotti score (FS) assessment at diagnostic laparoscopy (DL) guides treatment planning by estimating surgical resectability, but its subjective and operator-dependent nature limits reproducibility and widespread use. Videos of patients undergoing DL with concomitant FS assessments at a referral center were retrospectively collected and divided into a development dataset, for data annotation, AI training and evaluation, and an independent test dataset, for internal validation. In the development dataset, FS-relevant frames were manually annotated for anatomical structures and PC. Deep learning models were trained to automatically identify FS-relevant frames, segment structures and PC, and predict video-level FS and indication to surgery (ItS). AI performance was evaluated using Dice score for segmentation, F1-scores for anatomical stations (AS) and ItS prediction, and root mean square error (RMSE) for final FS estimation. In the development dataset, the segmentation model trained on 7,311 frames, achieved Dice scores of 70$\pm$3% for anatomical structures and 56$\pm$3% for PC. Video-level AS classification achieved F1-scores of 74$\pm$3% and 73$\pm$4%, FS prediction showed normalized RMSE values of 1.39$\pm$0.18 and 1.15$\pm$0.08, and ItS reached F1-scores of 80$\pm$8% and 80$\pm$2% in the development (n=101) and independent test datasets (n=50), respectively. This is the first AI model to predict the feasibility of cytoreductive surgery providing automated FS estimation from DL videos. Its reproducible and reliable performance across datasets suggests that AI can support surgeons through standardized intraoperative tumor burden assessment and clinical decision-making in AOC.
【138】Magnification-Aware Distillation (MAD): A Self-Supervised Framework for Unified Representation Learning in Gigapixel Whole-Slide Images 标题:壮丽感知蒸馏(MAD):千兆像素整幻灯片图像中统一表示学习的自我监督框架 链接:https://arxiv.org/abs/2512.14796
作者:Mahmut S. Gokmen,Mitchell A. Klusty,Peter T. Nelson,Allison M. Neltner,Sen-Ching Samson Cheung,Thomas M. Pearce,David A Gutman,Brittany N. Dugger,Devavrat S. Bisht,Margaret E. Flanagan,V. K. Cody Bumgardner 备注:10 pages, 4 figures, 5 tables, submitted to AMIA 2026 Informatics Summit 摘要:全切片图像(WSIs)包含分布在多个放大级别的组织信息,但大多数自我监督方法将这些尺度视为独立的视图。这种分离防止模型学习在分辨率变化时保持稳定的表示,这是实际神经病理学工作流程的关键要求。这项研究引入了Magnification-Aware Distillation(MAD),这是一种自我监督的策略,它将低放大率背景与空间对齐的高放大率细节联系起来,使模型能够学习粗糙的组织结构如何与精细的细胞模式相关。由此产生的基础模型MAD-NP完全通过这种跨尺度对应来训练,而不需要注释。仅在10倍嵌入上训练的线性分类器在应用于不可见的40倍图块时保持了96.7%的性能,证明了强大的分辨率不变表示学习。分割输出在不同放大倍数下保持一致,保留解剖边界并最大限度地减少噪声。这些结果突出了使用统一嵌入空间进行可扩展、放大稳健WSI分析的可行性 摘要:Whole-slide images (WSIs) contain tissue information distributed across multiple magnification levels, yet most self-supervised methods treat these scales as independent views. This separation prevents models from learning representations that remain stable when resolution changes, a key requirement for practical neuropathology workflows. This study introduces Magnification-Aware Distillation (MAD), a self-supervised strategy that links low-magnification context with spatially aligned high-magnification detail, enabling the model to learn how coarse tissue structure relates to fine cellular patterns. The resulting foundation model, MAD-NP, is trained entirely through this cross-scale correspondence without annotations. A linear classifier trained only on 10x embeddings maintains 96.7% of its performance when applied to unseen 40x tiles, demonstrating strong resolution-invariant representation learning. Segmentation outputs remain consistent across magnifications, preserving anatomical boundaries and minimizing noise. These results highlight the feasibility of scalable, magnification-robust WSI analysis using a unified embedding space
【139】Scaling Causal Mediation for Complex Systems: A Framework for Root Cause Analysis 标题:复杂系统的因果调解规模:根本原因分析框架 链接:https://arxiv.org/abs/2512.14764
作者:Alessandro Casadei,Sreyoshi Bhaduri,Rohit Malshe,Pavan Mullapudi,Raj Ratan,Ankush Pole,Arkajit Rakshit 摘要:从物流和云基础设施到工业物联网的现代运营系统都由复杂、相互依赖的流程管理。了解干预如何通过这样的系统传播,需要因果推理方法,超越直接影响,以量化介导的途径。传统的调解分析,而有效的简单设置,未能扩展到高维有向无环图(DAG)在实践中遇到的,特别是当多个治疗和调解人互动。在本文中,我们提出了一个可扩展的调解分析框架,为大型因果DAG涉及多种治疗和调解人量身定制。我们的方法系统地将总效应分解为可解释的直接和间接成分。我们通过在履行中心物流中的应用案例研究展示了其实际效用,其中复杂的依赖关系和不可控因素往往掩盖了根本原因。 摘要:Modern operational systems ranging from logistics and cloud infrastructure to industrial IoT, are governed by complex, interdependent processes. Understanding how interventions propagate through such systems requires causal inference methods that go beyond direct effects to quantify mediated pathways. Traditional mediation analysis, while effective in simple settings, fails to scale to the high-dimensional directed acyclic graphs (DAGs) encountered in practice, particularly when multiple treatments and mediators interact. In this paper, we propose a scalable mediation analysis framework tailored for large causal DAGs involving multiple treatments and mediators. Our approach systematically decomposes total effects into interpretable direct and indirect components. We demonstrate its practical utility through applied case studies in fulfillment center logistics, where complex dependencies and non-controllable factors often obscure root causes.
【140】Multiscale Cross-Modal Mapping of Molecular, Pathologic, and Radiologic Phenotypes in Lipid-Deficient Clear Cell Renal CellCarcinoma 标题:脂质缺乏透明细胞肾细胞癌分子、病理和放射学表型的多尺度跨模式绘图 链接:https://arxiv.org/abs/2512.14750
作者:Ying Cui,Dongzhe Zheng,Ke Yu,Xiyin Zheng,Xiaorui Wang,Xinxiang Li,Yan Gu,Lin Fu,Xinyi Chen,Wenjie Mei,Xin-Gui Peng 摘要:透明细胞肾细胞癌(ccRCC)在多个生物学尺度上表现出广泛的瘤内异质性,导致临床结局多变,并限制了传统TNM分期的有效性,这突出了对多尺度综合分析框架的迫切需求。多组学分析定义的脂质缺乏的去透明细胞分化(DCCD)ccRCC亚型,即使在早期疾病中也与不良结局相关。在这里,我们建立了一个分层的跨尺度框架DCCD-ccRCC的术前识别。在最高层,交叉模式映射将分子特征转移到组织学和CT表型,建立了分子到病理学到放射学的监督桥梁。在此框架内,每个特定模式的模型都旨在反映肿瘤生物学的固有分层结构。PathoDCCD捕获了多尺度的微观特征,从细胞形态和组织结构到中间区域组织。RadioDCCD通过将整个肿瘤及其栖息地亚区域放射组学与2D最大截面异质性度量相结合来整合互补的宏观信息。这些嵌套模型能够实现整合的分子亚型预测和临床风险分层。在总共1,659名患者的5个队列中,PathoDCCD可靠地概括了分子亚型,而RadioDCCD提供了可靠的术前预测。一致的预测确定了临床结局最差的患者。这种跨尺度范式将分子生物学、计算病理学和定量放射学统一为用于ccRCC术前无创分子表型分析的生物学基础策略。 摘要:Clear cell renal cell carcinoma (ccRCC) exhibits extensive intratumoral heterogeneity on multiple biological scales, contributing to variable clinical outcomes and limiting the effectiveness of conventional TNM staging, which highlights the urgent need for multiscale integrative analytic frameworks. The lipid-deficient de-clear cell differentiated (DCCD) ccRCC subtype, defined by multi-omics analyses, is associated with adverse outcomes even in early-stage disease. Here, we establish a hierarchical cross-scale framework for the preoperative identification of DCCD-ccRCC. At the highest layer, cross-modal mapping transferred molecular signatures to histological and CT phenotypes, establishing a molecular-to-pathology-to-radiology supervisory bridge. Within this framework, each modality-specific model is designed to mirror the inherent hierarchical structure of tumor biology. PathoDCCD captured multi-scale microscopic features, from cellular morphology and tissue architecture to meso-regional organization. RadioDCCD integrated complementary macroscopic information by combining whole-tumor and its habitat-subregions radiomics with a 2D maximal-section heterogeneity metric. These nested models enabled integrated molecular subtype prediction and clinical risk stratification. Across five cohorts totaling 1,659 patients, PathoDCCD reliably recapitulated molecular subtypes, while RadioDCCD provided reliable preoperative prediction. The consistent predictions identified patients with the poorest clinical outcomes. This cross-scale paradigm unifies molecular biology, computational pathology, and quantitative radiology into a biologically grounded strategy for preoperative noninvasive molecular phenotyping of ccRCC.
【141】VERAFI: Verified Agentic Financial Intelligence through Neurosymbolic Policy Generation 标题:VERAFI:通过神经符号政策生成验证统计金融情报 链接:https://arxiv.org/abs/2512.14744
作者:Adewale Akinfaderin,Shreyas Subramanian 摘要:金融人工智能系统存在一个关键的盲点:虽然检索增强生成(RAG)在查找相关文档方面表现出色,但语言模型在推理过程中仍然会产生计算错误和违反监管规定的行为,即使是完美的检索。本文介绍了VERAFI(Verified Actuatic Financial Intelligence),一个用于验证金融情报的神经符号策略生成的代理框架。VERAFI将最先进的密集检索和交叉编码器重新排序与金融工具启用的代理和自动推理策略相结合,涵盖GAAP合规性,SEC要求和数学验证。在FinanceBench上的综合评价表明,相对于传统的重排序密集检索,VERAFI的综合方法的事实正确率达到了94.7%,相对提高了81%。仅神经符号政策层就比纯粹的代理处理贡献了4.3个百分点的收益,特别是针对持续的数学和逻辑错误。通过将金融领域的专业知识直接集成到推理过程中,VERAFI提供了一条通往值得信赖的金融AI的实用途径,以满足监管合规、投资决策和风险管理的严格准确性要求。 摘要:Financial AI systems suffer from a critical blind spot: while Retrieval-Augmented Generation (RAG) excels at finding relevant documents, language models still generate calculation errors and regulatory violations during reasoning, even with perfect retrieval. This paper introduces VERAFI (Verified Agentic Financial Intelligence), an agentic framework with neurosymbolic policy generation for verified financial intelligence. VERAFI combines state-of-the-art dense retrieval and cross-encoder reranking with financial tool-enabled agents and automated reasoning policies covering GAAP compliance, SEC requirements, and mathematical validation. Our comprehensive evaluation on FinanceBench demonstrates remarkable improvements: while traditional dense retrieval with reranking achieves only 52.4% factual correctness, VERAFI’s integrated approach reaches 94.7%, an 81% relative improvement. The neurosymbolic policy layer alone contributes a 4.3 percentage point gain over pure agentic processing, specifically targeting persistent mathematical and logical errors. By integrating financial domain expertise directly into the reasoning process, VERAFI offers a practical pathway toward trustworthy financial AI that meets the stringent accuracy demands of regulatory compliance, investment decisions, and risk management.
【142】PyFi: Toward Pyramid-like Financial Image Understanding for VLMs via Adversarial Agents 标题:PyFi:通过对抗代理实现VLM的金字塔式财务形象理解 链接:https://arxiv.org/abs/2512.14735
作者:Yuqun Zhang,Yuxuan Zhao,Sijia Chen 摘要:本文提出了PyFi,这是一种新的框架,用于类似于金融图像的理解,使视觉语言模型(VLM)能够以渐进的,简单到复杂的方式通过问题链进行推理。PyFi的核心是PyFi-600 K,这是一个由600 K个金融问答对组成的数据集,这些问答对被组织成一个推理金字塔:底部的问题只需要基本的感知,而顶部的问题则需要不断提高的金融视觉理解和专业知识的能力。这些数据是可扩展的,因为它是在没有人工注释的情况下合成的,使用PyFi-adv,这是蒙特卡洛树搜索(MCTS)范式下的多代理对抗机制,其中,对于每个图像,挑战者代理通过生成问题链与求解器代理竞争,这些问题链逐渐探索金融视觉推理中更深的能力水平。利用这个数据集,我们提出了在金融领域先进的VLM细粒度,层次和全面的评估。此外,对Qwen2.5-VL-3B和Qwen2.5-VL-7 B的结构化问题链进行微调,使这些模型能够通过将复杂的金融问题分解为具有逐渐增加的推理需求的子问题来回答复杂的金融问题,从而在数据集上分别获得19.52%和8.06%的平均准确率提高。所有代码、数据集和模型的资源都可以在https://github.com/AgenticFinLab/PyFi上找到。 摘要:This paper proposes PyFi, a novel framework for pyramid-like financial image understanding that enables vision language models (VLMs) to reason through question chains in a progressive, simple-to-complex manner. At the core of PyFi is PyFi-600K, a dataset comprising 600K financial question-answer pairs organized into a reasoning pyramid: questions at the base require only basic perception, while those toward the apex demand increasing levels of capability in financial visual understanding and expertise. This data is scalable because it is synthesized without human annotations, using PyFi-adv, a multi-agent adversarial mechanism under the Monte Carlo Tree Search (MCTS) paradigm, in which, for each image, a challenger agent competes with a solver agent by generating question chains that progressively probe deeper capability levels in financial visual reasoning. Leveraging this dataset, we present fine-grained, hierarchical, and comprehensive evaluations of advanced VLMs in the financial domain. Moreover, fine-tuning Qwen2.5-VL-3B and Qwen2.5-VL-7B on the pyramid-structured question chains enables these models to answer complex financial questions by decomposing them into sub-questions with gradually increasing reasoning demands, yielding average accuracy improvements of 19.52% and 8.06%, respectively, on the dataset. All resources of code, dataset and models are available at: https://github.com/AgenticFinLab/PyFi .
2. 归类后数据
强化学习
Can LLMs Guide Their Own Exploration? Gradient-Guided Reinforcement Learning for LLM Reasoning
标题:法学硕士可以指导自己的探索吗?LLM推理的学生引导强化学习 链接:https://arxiv.org/abs/2512.15687
作者:Zhenwen Liang,Sidi Lu,Wenhao Yu,Kishan Panaganti,Yujun Zhou,Haitao Mi,Dong Yu 摘要:
- 强化学习对于加强大型语言模型的推理能力至关重要,但目前的探索机制与这些模型的实际学习方式仍然存在根本性的偏差。熵奖金和外部语义比较器鼓励表面水平的变化,但不能保证采样轨迹在形状优化的更新方向上不同。
- 我们提出了G2 RL,一个梯度引导的强化学习框架,在这个框架中,探索不是由外部几何驱动的,而是由模型自身的一阶更新几何驱动的。
- 对于每个响应,G2 RL从模型最终层灵敏度中构建一个序列级特征,可以从标准的前向传递中以可忽略的成本获得,并通过比较采样组中的这些特征来测量每个轨迹将如何重塑策略。
- 引入新的梯度方向的轨迹接收有界乘法奖励缩放器,而冗余或非流形更新被去强调,从而产生自然地与PPO风格稳定性和KL控制对齐的自参考探索信号。
- 在Qwen 3 base 1.7B和4 B模型上的数学和一般推理基准测试(MATH 500,AMC,AIME 24,AIME 25,GPQA,MMLUpro)中,G2 RL始终改进了基于熵的GRPO和外部嵌入方法的pass@1,maj@16和pass@k。分析诱导几何,我们发现G2 RL将探索扩展到更正交且通常相反的梯度方向,同时保持语义一致性,揭示了策略自身的更新空间为指导大型语言模型强化学习中的探索提供了更忠实和有效的基础。
Well Begun, Half Done: Reinforcement Learning with Prefix Optimization for LLM Reasoning
标题:良好的开端,完成了一半:LLM推理的前缀优化强化学习 链接:https://arxiv.org/abs/2512.15274
作者:Yiliu Sun,Zicheng Zhao,Yang Wei,Yanfang Zhang,Chen Gong 备注:Accepted by AAAI 2026 摘要:
- 带有可验证奖励的强化学习(RLVR)显著增强了大型语言模型(LLM)的推理能力。当前的RLVR方法通常在所有生成的令牌上进行训练,但忽略了探索哪些令牌(例如,前缀标记)实际上有助于推理。这种统一的训练策略在优化低回报代币上花费了大量的精力,这反过来又阻碍了高回报代币的潜在改进,并降低了整体训练效率。
- 为了解决这个问题,我们提出了一种新的RLVR方法,称为渐进前缀令牌策略优化(PPPO),它突出了生成的输出的前缀段的重要性。具体来说,灵感来自于成熟的人类思维理论的路径依赖,其中早期阶段的思想大大限制了随后的思维轨迹,我们确定了类似的现象,在LLM推理称为开始锁定效应(BLE)。
- PPPO通过将其优化目标集中在LLM的前缀推理过程中来利用这一发现。这种有针对性的优化策略可以积极影响后续的推理过程,并最终改善最终结果。为了提高LLM在如何以高质量开始推理方面的学习效率,PPPO引入了两种训练策略:(a)渐进前缀保留,通过在训练期间增加保留的前缀标记的比例来形成渐进的学习过程;(b)连续累积奖励,其通过对一个前缀令牌序列的多个连续进行采样来减轻奖励偏差,并累积他们的分数作为奖励信号。
- 在各种推理任务上的大量实验结果表明,我们提出的PPPO优于代表性的RLVR方法,只有26.17%的训练令牌的准确率提高了18.02%。
推理解码
Stepwise Think-Critique: A Unified Framework for Robust and Interpretable LLM Reasoning
标题:逐步思维批判:稳健且可解释的LLM推理的统一框架 链接:https://arxiv.org/abs/2512.15662
作者:Jiaqi Xu,Cuiling Lan,Xuejin Chen,Yan LU 备注:Under Review 摘要:
- 人类通过批判性思维解决复杂问题,推理和评估交织在一起,朝着正确的解决方案汇聚。然而,大多数现有的大型语言模型(LLM)将推理与验证分离开来:它们要么在没有显式自检的情况下生成推理,要么依赖外部验证者来事后检测错误。前者缺乏即时反馈,而后者增加了系统的复杂性,阻碍了同步学习。
- 受人类批判性思维的启发,我们提出了逐步思维批判(STC),这是一个统一的框架,在单个模型的每一步都交织着推理和自我批判。STC使用混合强化学习目标进行训练,该目标将推理奖励和批判一致性奖励相结合,以共同优化推理质量和自我评估。
- 在数学推理基准上的实验表明,STC表现出较强的批判性思维能力,并产生更多可解释的推理痕迹,代表着向具有内置批判性思维的LLM迈进了一步。
Beyond Fast and Slow: Cognitive-Inspired Elastic Reasoning for Large Language Models
标题:超越快与慢:大型语言模型的认知启发弹性推理 链接:https://arxiv.org/abs/2512.15089
作者:Jinwu Hu,Dongjin Yang,Langyu Bian,Zhiquan Wen,Yufeng Wang,Yaofo Chen,Bin Xiao,Yuanqing Li,Mingkui Tan 备注:under review 摘要:
- 大型语言模型(LLM)在各种语言任务中表现出令人印象深刻的性能。然而,现有的LLM推理策略主要依赖于LLM本身的快速或慢速模式(如o 1思维),因此很难在不同难度的查询中平衡推理效率和准确性。
- 在本文中,我们提出了认知启发的弹性推理(CogER),一个框架的灵感来自人类的层次推理,动态选择最合适的推理策略,为每个查询。
- 具体来说,CogER首先评估传入查询的复杂性,并将其分配到几个预定义级别中的一个,每个级别对应于定制的处理策略,从而解决不可观察的查询难度的挑战。
- 为了实现自动策略选择,我们建模的过程中,马尔可夫决策过程和训练CogER-Agent使用强化学习。该代理是由一个奖励函数,平衡解决方案的质量和计算成本,确保资源有效的推理。
- 此外,对于需要外部工具的查询,我们引入了认知工具辅助推理,这使得LLM能够在其思想链中自主调用外部工具。
- 大量的实验表明,CogER优于最先进的测试时间缩放方法,在域内任务的平均精确匹配方面实现了至少13%的相对改进,在域外任务上实现了8%的相对增益。
DreamPRM-Code: Function-as-Step Process Reward Model with Label Correction for LLM Coding
标题:DreamPRM-Code:具有LLM编码标签纠正的功能即步骤流程奖励模型 链接:https://arxiv.org/abs/2512.15000
作者:Ruiyi Zhang,Peijia Qin,Qi Cao,Pengtao Xie 摘要:
- 过程奖励模型(Process Reward Models,PRM)已经成为通过测试时间缩放来改进大型语言模型(Large Language Models,LLM)的关键,但由于代码中缺乏有意义的步骤分解以及蒙特卡洛生成的部分标签的噪声,它们在编码中的有效性仍然有限。
- 我们提出了DreamPRM-Code,一种以编码为中心的PRM,它将函数视为推理步骤,使用功能链提示策略来诱导模块化代码生成,使PRM训练和应用类似于数学推理任务。
- 为了解决标签噪声问题,DreamPRM-Code引入了一种基于元学习的校正机制,该机制利用干净的最终解决方案单元测试标签,并执行双层优化来优化中间标签。
- 通过测试时间缩放,DreamPRM-Code在LiveCodeBench上以80.9 pass@1 rate实现了最先进的性能,超过了OpenAI o 4-mini。
持续学习
PPSEBM: An Energy-Based Model with Progressive Parameter Selection for Continual Learning
标题:PPSEBM:一种基于能量的渐进参数选择的连续学习模型 链接:https://arxiv.org/abs/2512.15658
作者:Xiaodi Li,Dingcheng Li,Rujun Gao,Mahmoud Zamani,Feng Mi,Latifur Khan 备注:10 pages, 3 figures, 2025 IEEE International Conference on Big Data (BigData) 摘要:
- 持续学习仍然是机器学习中的一个根本挑战,需要模型从一系列任务中学习,而不会忘记之前获得的知识。在这种情况下,一个主要的障碍是灾难性的遗忘,即早期任务的性能随着新任务的学习而下降。
- 在本文中,我们介绍了PPSEBM,一种新的框架,集成了基于能量的模型(EBM)与渐进参数选择(PPS),以有效地解决灾难性遗忘在自然语言处理任务的持续学习。在PPSEBM中,渐进式参数选择为每个新任务分配不同的特定于任务的参数,而EBM从先前的任务中生成代表性的伪样本。这些生成的样本主动通知和指导参数选择过程,增强模型在适应新任务的同时保留过去知识的能力。
- 在不同NLP基准测试上的实验结果表明,PPSEBM优于最先进的持续学习方法,为减轻灾难性遗忘提供了一种有前途的鲁棒解决方案。
分析
How Much is Too Much? Exploring LoRA Rank Trade-offs for Retaining Knowledge and Domain Robustness
标题:多少才算太多?探索LoRA等级权衡以保留知识和领域稳健性 链接:https://arxiv.org/abs/2512.15634
作者:Darshita Rathore,Vineet Kumar,Chetna Bansal,Anindya Moitra 备注:Accepted at AACL IJCNLP 2025 摘要:大型语言模型通过微调越来越多地适应下游任务。全监督微调(SFT)和参数有效微调(PEFT)方法,如低秩自适应(LoRA),是两种主要的方法。虽然PEFT方法因其计算效率而被广泛使用,但其配置的含义(例如,排名)在下游Q&A任务和概括中仍然没有得到充分的探索。在这项工作中,我们对多个推理和召回数据集进行了全面评估,进行了排名扫描,以量化SFT和PEFT之间的权衡。我们还比较了PEFT和SFT模型在域内和域外适应中的准确性,突出了不同的泛化行为和特定任务的遗忘。我们证明,LoRA实现了竞争力,在某些情况下,优越的性能相比,SFT,特别是在推理任务在特定的排名值。此外,我们通过频谱特征和逐层注意结构来分析内部表征,从而深入了解注意模式的表征漂移和结构变化。
游戏
Outer-Learning Framework for Playing Multi-Player Trick-Taking Card Games: A Case Study in Skat
标题:玩多人纸牌游戏的外部学习框架:Skat的案例研究 链接:https://arxiv.org/abs/2512.15435
作者:Stefan Edelkamp 摘要:
- 在多人纸牌游戏(如Skat或Bridge)中,游戏的早期阶段,如出价,游戏选择和初始卡片选择,通常比精细的中期和末期游戏更重要。在目前的计算限制下,这种早期决策依赖于使用来自人类专家游戏的大型语料库的统计信息。
- 在本文中,我们推导并评估了一个通用的自举外部学习框架,该框架通过扩展人类游戏数据库来生成和合并统计数据,从而提高预测准确性。
- 我们实现了完美的特征散列函数来处理压缩表,从而产生了一个自我改进的纸牌游戏引擎,其中新推断的知识在自我学习过程中不断得到改进。Skat的案例研究表明,自动化方法可以用于支持游戏中的各种决策。