2025-12-14科研追新

Contents

2025-12-14科研追新

1. 源数据

1.1 媒体

From:量子位、机器之心、新智元、AGI Hunt、小红书、X其他

  1. RL是「点金石」还是「挖掘机」?CMU 用可控实验给出答案
    1. 构建基于依赖图DAG的 GSM-Infinite 的可控合成数据框架,在完全解耦的环境下,定量分析了预训练、Mid-training(中期训练/CPT)和 RL 三者对模型推理泛化能力的因果影响
    2. 结论
      1. 仅当预训练留有足够提升空间,且 RL 数据针对模型的能力边界(即那些虽具难度但尚未超出模型能力范围的任务)时,RL 才能带来真正的能力增益(pass@128)。
      2. 情境泛化需要极少但充分的预训练接触,在此之后 RL 便能实现可靠的迁移。
      3. 在固定计算量下,相比于仅使用 RL,中期训练能显著提升性能,证明了其在训练流程中处于核心地位却未被充分探索。
      4. 过程级奖励能减少奖励破解(Reward Hacking)现象并提高推理的忠实度。
  2. 刚刚,GPT-5.2满分屠榜,OpenAI十周年王者归来
    1. GPT-5.2已上线24小时:差评如潮!
  3. 告别「盲目自信」,CCD:扩散语言模型推理新SOTA
    1. 上下文一致性解码算法(Coherent Contextual Decoding, CCD),充分利用扩散过程中的上下文增广,从理论上纠正了传统 DLM 推理策略的 “短视性”,并进一步采用自适应解码方案在多种开源 DLMs 上同时实现了 3.48 倍的加速和 3.9% 的性能提升
  4. 6位前DeepMind老将打造「AI指挥官」,一半成本刷新SOTA
    1. 搭建了一个元系统,该系统可以让前沿大模型自动生成解决特定任务的策略和模型组合。这样不仅解决了前沿模型难以单独解决复杂真实世界问题的痛点,还将整体推理成本降低了一半。
    2. 其元系统旨在利用任何现成的前沿模型,自动生成能解决特定任务的完整系统,无需构建甚至不需要微调自己的大前沿模型。

1.2 Huggingface

1.3 Arxiv

1.3.1 Computation and Language

From:https://papers.cool/arxiv/cs.CL

https://arxiv.org/list/cs.CL/recent

2025-12-15 | | Total: 40 2025-12-15 | | 总计:40

1 SUMFORU: An LLM-Based Review Summarization Framework for Personalized Purchase Decision Support 1 SUMFORU:一个基于 LLM 的评论摘要框架,用于个性化购买决策支持

Authors: [Yuming Feng](https://arxiv.org/search/?searchtype=author&query=Yuming Feng), [Xinrui Jiang](https://arxiv.org/search/?searchtype=author&query=Xinrui Jiang) 作者:冯雨明,姜欣睿

Online product reviews contain rich but noisy signals that overwhelm users and hinder effective decision-making. Existing LLM-based summarizers remain generic and fail to account for individual preferences, limiting their practical utility. We propose SUMFORU, a steerable review summarization framework that aligns outputs with explicit user personas to support personalized purchase decisions. Our approach integrates a high-quality data pipeline built from the Amazon 2023 Review Dataset with a two-stage alignment procedure: (1) persona-aware Supervised Fine-Tuning (SFT) via asymmetric knowledge distillation, and (2) Reinforcement Learning with AI Feedback (RLAIF) using a preference estimator to capture fine-grained, persona-relevant signals. We evaluate the model across rule-based, LLM-based, and human-centered metrics, demonstrating consistent improvements in consistency, grounding, and preference alignment. Our framework achieves the highest performance across all evaluation settings and generalizes effectively to unseen product categories. Our results highlight the promise of steerable pluralistic alignment for building next-generation personalized decision-support systems. 在线产品评论包含丰富但嘈杂的信息,这些信息会淹没用户并妨碍有效的决策。现有基于 LLM 的摘要器仍然过于通用,未能考虑个体偏好,从而限制了其实用性。我们提出了 SUMFORU,一种可引导的评论摘要框架,将输出与明确的用户人物角色对齐,以支持个性化的购买决策。我们的方法整合了基于 Amazon 2023 评论数据集构建的高质量数据管道与一个两阶段的对齐程序:(1)通过非对称知识蒸馏进行的人物角色感知监督微调(SFT),以及(2)使用偏好估计器捕捉细粒度、与人物角色相关信号的基于 AI 反馈的强化学习(RLAIF)。我们通过基于规则、基于 LLM 和以人为中心的评估指标对模型进行了评估,展示了在一致性、基于证据性和偏好对齐方面的持续改进。我们的框架在所有评估设置中取得了最高性能,并能有效泛化到未见过的产品类别。我们的结果突显了可引导的多元化对齐在构建新一代个性化决策支持系统方面的前景。

Subject: Computation and Language 主题:Computation and Language

Publish: 2025-12-12 18:05:52 UTC 发布:2025-12-12 18:05:52 UTC

2 Speculative Decoding Speed-of-Light: Optimal Lower Bounds via Branching Random Walks 2 Speculative Decoding Speed-of-Light:通过分支随机游走得到的最优下界 [PDF ] [Copy] [Kimi 1 ] [REL]

Authors: [Sergey Pankratov](https://arxiv.org/search/?searchtype=author&query=Sergey Pankratov), [Dan Alistarh](https://arxiv.org/search/?searchtype=author&query=Dan Alistarh) 作者:Sergey Pankratov、Dan Alistarh

Speculative generation has emerged as a promising technique to accelerate inference in large language models (LLMs) by leveraging parallelism to verify multiple draft tokens simultaneously. However, the fundamental limits on the achievable speedup remain poorly understood. In this work, we establish the first ``tight’’ lower bounds on the runtime of any deterministic speculative generation algorithm. This is achieved by drawing a parallel between the token generation process and branching random walks, which allows us to analyze the optimal draft tree selection problem. We prove, under basic assumptions, that the expected number of tokens successfully predicted per speculative iteration is bounded as E[X]≤(μ+μ(2))log(P)/μ2+O(1), where P is the verifier’s capacity, μ is the expected entropy of the verifier’s output distribution, and μ(2) is the expected second log-moment. This result provides new insights into the limits of parallel token generation, and could guide the design of future speculative decoding systems. Empirical evaluations on Llama models validate our theoretical predictions, confirming the tightness of our bounds in practical settings. 推测生成已成为一种有前景的技术,通过利用并行性同时验证多个草稿标记来加速大型语言模型(LLMs)的推理。然而,可实现加速的基本限制仍然知之甚少。在这项工作中,我们建立了对任何确定性推测生成算法运行时间的首个“紧”下界。为此,我们将标记生成过程与分支随机游走进行类比,从而能够分析最优草稿树选择问题。我们在基本假设下证明,每次推测迭代中成功预测的标记的期望数量受限于 E[X]≤(μ+μ(2))log(P)/μ2+O(1) ,其中 P 是验证器的容量, μ 是验证器输出分布的期望熵, μ(2) 是期望的二阶对数矩。这一结果为并行标记生成的极限提供了新的见解,并可指导未来推测解码系统的设计。在 Llama 模型上的实证评估验证了我们的理论预测,确认了我们的界在实际设置中的紧性。

Subject: Computation and Language 主题:Computation and Language

Publish: 2025-12-12 16:54:33 UTC 发布:2025-12-12 16:54:33 UTC

3 Automating Historical Insight Extraction from Large-Scale Newspaper Archives via Neural Topic Modeling 3 通过神经主题建模从大规模报纸档案中自动提取历史洞见

Authors: [Keerthana Murugaraj](https://arxiv.org/search/?searchtype=author&query=Keerthana Murugaraj), [Salima Lamsiyah](https://arxiv.org/search/?searchtype=author&query=Salima Lamsiyah), [Marten During](https://arxiv.org/search/?searchtype=author&query=Marten During), [Martin Theobald](https://arxiv.org/search/?searchtype=author&query=Martin Theobald) 作者:Keerthana Murugaraj,Salima Lamsiyah,Marten During,Martin Theobald

Extracting coherent and human-understandable themes from large collections of unstructured historical newspaper archives presents significant challenges due to topic evolution, Optical Character Recognition (OCR) noise, and the sheer volume of text. Traditional topic-modeling methods, such as Latent Dirichlet Allocation (LDA), often fall short in capturing the complexity and dynamic nature of discourse in historical texts. To address these limitations, we employ BERTopic. This neural topic-modeling approach leverages transformerbased embeddings to extract and classify topics, which, despite its growing popularity, still remains underused in historical research. Our study focuses on articles published between 1955 and 2018, specifically examining discourse on nuclear power and nuclear safety. We analyze various topic distributions across the corpus and trace their temporal evolution to uncover long-term trends and shifts in public discourse. This enables us to more accurately explore patterns in public discourse, including the co-occurrence of themes related to nuclear power and nuclear weapons and their shifts in topic importance over time. Our study demonstrates the scalability and contextual sensitivity of BERTopic as an alternative to traditional approaches, offering richer insights into historical discourses extracted from newspaper archives. These findings contribute to historical, nuclear, and social-science research while reflecting on current limitations and proposing potential directions for future work. 从大量无结构的历史报纸档案中提取连贯且为人理解的主题具有显著挑战,原因包括主题演变、光学字符识别(OCR)噪声以及大量文本本身。传统的主题建模方法,例如潜在狄利克雷分配(LDA),常常无法捕捉历史文本中话语的复杂性和动态特性。为了解决这些局限性,我们采用了 BERTopic。这种神经主题建模方法利用基于 Transformer 的嵌入来提取和分类主题,尽管其越来越受欢迎,但在历史研究中仍然使用不足。我们的研究聚焦于 1955 年至 2018 年间发表的文章,具体审视关于核能和核安全的话语。我们分析了语料库中不同的主题分布并追踪其时间演变,以揭示长期趋势和公共话语的转变。这使我们能够更准确地探索公共话语中的模式,包括与核能和核武器相关主题的共现及其随时间在主题重要性上的变化。 我们的研究展示了 BERTopic 作为传统方法替代方案的可扩展性和上下文敏感性,能够为从报纸档案中提取的历史话语提供更丰富的洞见。 这些发现有助于历史学、核能研究、和社会科学研究,同时反思当前的局限并提出未来工作的潜在方向。

Subjects: Computation and Language, Artificial Intelligence, Information Retrieval 主题:计算与语言、人工智能、信息检索

Publish: 2025-12-12 15:15:02 UTC 发表:2025-12-12 15:15:02 UTC

4 Bounding Hallucinations: Information-Theoretic Guarantees for RAG Systems via Merlin-Arthur Protocols 4 限制幻觉:通过梅林-阿瑟协议为检索增强生成(RAG)系统提供信息论保证

Authors: [Björn Deiseroth](https://arxiv.org/search/?searchtype=author&query=Björn Deiseroth), [Max Henning Höth](https://arxiv.org/search/?searchtype=author&query=Max Henning Höth), [Kristian Kersting](https://arxiv.org/search/?searchtype=author&query=Kristian Kersting), [Letitia Parcalabescu](https://arxiv.org/search/?searchtype=author&query=Letitia Parcalabescu) 作者:Björn Deiseroth、Max Henning Höth、Kristian Kersting、Letitia Parcalabescu

Retrieval-augmented generation (RAG) models rely on retrieved evidence to guide large language model (LLM) generators, yet current systems treat retrieval as a weak heuristic rather than verifiable evidence. As a result, LLMs answer without support, hallucinate under incomplete or misleading context, and rely on spurious evidence. We introduce a training framework that treats the entire RAG pipeline – both the retriever and the generator – as an interactive proof system via an adaptation of the Merlin-Arthur (M/A) protocol. Arthur (the generator LLM) trains on questions of unkown provenance: Merlin provides helpful evidence, while Morgana injects adversarial, misleading context. Both use a linear-time XAI method to identify and modify the evidence most influential to Arthur. Consequently, Arthur learns to (i) answer when the context support the answer, (ii) reject when evidence is insufficient, and (iii) rely on the specific context spans that truly ground the answer. We further introduce a rigorous evaluation framework to disentangle explanation fidelity from baseline predictive errors. This allows us to introduce and measure the Explained Information Fraction (EIF), which normalizes M/A certified mutual-information guarantees relative to model capacity and imperfect benchmarks. Across three RAG datasets and two model families of varying sizes, M/A-trained LLMs show improved groundedness, completeness, soundness, and reject behavior, as well as reduced hallucinations – without needing manually annotated unanswerable questions. The retriever likewise improves recall and MRR through automatically generated M/A hard positives and negatives. Our results demonstrate that autonomous interactive-proof-style supervision provides a principled and practical path toward reliable RAG systems that treat retrieved documents not as suggestions, but as verifiable evidence. 检索增强生成(RAG)模型依赖检索到的证据来指导大型语言模型(LLM)生成器,但现有系统将检索视为一种弱启发式而非可验证的证据。因此,LLM 在没有支持的情况下给出答案,在上下文不完整或具有误导性时产生幻觉,并依赖伪证据。我们引入了一个训练框架,通过对梅林-亚瑟(Merlin-Arthur,M/A)协议的改编,将整个 RAG 流水线——包括检索器和生成器——视为一个交互式证明系统。亚瑟(生成器 LLM)在来源不明的问题上进行训练:梅林提供有用证据,而摩甘娜(Morgana)注入对抗性、误导性的上下文。两者都使用线性时间的可解释性(XAI)方法来识别并修改对亚瑟影响最大的证据。因此,亚瑟学会了(i)在上下文支持答案时给出答案,(ii)在证据不足时拒绝回答,以及(iii)依赖真正支撑答案的特定上下文片段。我们进一步引入了一个严格的评估框架,以将解释性保真度与基线预测错误区分开来。 这使我们能够引入并衡量“可解释信息分数”(EIF),它将经 M/A 认证的互信息保证相对于模型容量和不完美基准进行归一化。在三个 RAG 数据集和两个不同规模的模型族上,经过 M/A 训练的 LLMs 在有根性、完整性、可靠性和拒绝行为方面均有所提升,同时幻觉现象减少——且不需要人工标注的不可回答问题。通过自动生成的 M/A 难度正例和负例,检索器的召回率和 MRR 也得到了提升。我们的结果表明,自主的交互式证明风格监督为构建可靠的 RAG 系统提供了一条有原则且实用的途径,使检索到的文档不再被视为建议,而是可验证的证据。

Subjects: Computation and Language, Artificial Intelligence, Machine Learning 主题:计算与语言,人工智能,机器学习

Publish: 2025-12-12 14:50:38 UTC 发布:2025-12-12 14:50:38 UTC

5 Visualizing token importance for black-box language models 5 为黑箱语言模型可视化词元重要性

Authors: [Paulius Rauba](https://arxiv.org/search/?searchtype=author&query=Paulius Rauba), [Qiyao Wei](https://arxiv.org/search/?searchtype=author&query=Qiyao Wei), [Mihaela van der Schaar](https://arxiv.org/search/?searchtype=author&query=Mihaela van der Schaar) 作者:Paulius Rauba, Qiyao Wei, Mihaela van der Schaar

We consider the problem of auditing black-box large language models (LLMs) to ensure they behave reliably when deployed in production settings, particularly in high-stakes domains such as legal, medical, and regulatory compliance. Existing approaches for LLM auditing often focus on isolated aspects of model behavior, such as detecting specific biases or evaluating fairness. We are interested in a more general question – can we understand how the outputs of black-box LLMs depend on each input token? There is a critical need to have such tools in real-world applications that rely on inaccessible API endpoints to language models. However, this is a highly non-trivial problem, as LLMs are stochastic functions (i.e. two outputs will be different by chance), while computing prompt-level gradients to approximate input sensitivity is infeasible. To address this, we propose Distribution-Based Sensitivity Analysis (DBSA), a lightweight model-agnostic procedure to evaluate the sensitivity of the output of a language model for each input token, without making any distributional assumptions about the LLM. DBSA is developed as a practical tool for practitioners, enabling quick, plug-and-play visual exploration of LLMs reliance on specific input tokens. Through illustrative examples, we demonstrate how DBSA can enable users to inspect LLM inputs and find sensitivities that may be overlooked by existing LLM interpretability methods. 我们研究的是审计黑盒大型语言模型(LLMs)的问题,以确保它们在生产环境中尤其是在法律、医疗和监管合规等高风险领域部署时表现可靠。现有的 LLM 审计方法常常关注模型行为的孤立方面,例如检测特定偏见或评估公平性。我们关注的是一个更普遍的问题——我们能否理解黑盒 LLMs 的输出如何依赖于每个输入标记?在依赖于不可访问的语言模型 API 端点的现实应用中,迫切需要这样的工具。然而,这却是一个非常复杂的问题,因为 LLMs 是随机函数(即两次输出可能因随机性而不同),而计算提示级梯度来近似输入敏感性也是不可行的。为了解决这一问题,我们提出了基于分布的敏感性分析(DBSA),这是一种轻量级、与模型无关的程序,用于评估语言模型输出对每个输入标记的敏感性,而无需对 LLM 做出任何分布假设。 DBSA 被开发为面向从业者的实用工具,能够快速即插即用地可视化探索 LLMs 对特定输入标记的依赖。通过说明性示例,我们展示了 DBSA 如何使用户检查 LLM 的输入并发现现有 LLM 可解释性方法可能忽视的敏感性问题。

Subjects: Computation and Language, Machine Learning 主题:计算与语言 , 机器学习

Publish: 2025-12-12 14:01:43 UTC 发布:2025-12-12 14:01:43 UTC

6 Extending a Parliamentary Corpus with MPs' Tweets: Automatic Annotation and Evaluation Using MultiParTweet 6 使用议员推文扩展议会语料库:使用 MultiParTweet 的自动注释与评估

Authors: [Mevlüt Bagci](https://arxiv.org/search/?searchtype=author&query=Mevlüt Bagci), [Ali Abusaleh](https://arxiv.org/search/?searchtype=author&query=Ali Abusaleh), [Daniel Baumartz](https://arxiv.org/search/?searchtype=author&query=Daniel Baumartz), [Giueseppe Abrami](https://arxiv.org/search/?searchtype=author&query=Giueseppe Abrami), [Maxim Konca](https://arxiv.org/search/?searchtype=author&query=Maxim Konca), [Alexander Mehler](https://arxiv.org/search/?searchtype=author&query=Alexander Mehler) 作者:Mevlüt Bagci、Ali Abusaleh、Daniel Baumartz、Giueseppe Abrami、Maxim Konca、Alexander Mehler

Social media serves as a critical medium in modern politics because it both reflects politicians’ ideologies and facilitates communication with younger generations. We present MultiParTweet, a multilingual tweet corpus from X that connects politicians’ social media discourse with German political corpus GerParCor, thereby enabling comparative analyses between online communication and parliamentary debates. MultiParTweet contains 39 546 tweets, including 19 056 media items. Furthermore, we enriched the annotation with nine text-based models and one vision-language model (VLM) to annotate MultiParTweet with emotion, sentiment, and topic annotations. Moreover, the automated annotations are evaluated against a manually annotated subset. MultiParTweet can be reconstructed using our tool, TTLABTweetCrawler, which provides a framework for collecting data from X. To demonstrate a methodological demonstration, we examine whether the models can predict each other using the outputs of the remaining models. In summary, we provide MultiParTweet, a resource integrating automatic text and media-based annotations validated with human annotations, and TTLABTweetCrawler, a general-purpose X data collection tool. Our analysis shows that the models are mutually predictable. In addition, VLM-based annotation were preferred by human annotators, suggesting that multimodal representations align more with human interpretation. 社交媒体在现代政治中是一个关键媒介,因为它既反映了政治人物的意识形态,又促进了与年轻一代的交流。我们提出了 MultiParTweet,这是一个来自 X 的多语言推文语料库,将政治人物的社交媒体话语与德文政治语料库 GerParCor 连接起来,从而使在线交流与议会辩论之间的比较分析成为可能。MultiParTweet 包含 39,546 条推文,其中包括 19,056 件媒体内容。此外,我们使用九个基于文本的模型和一个视觉-语言模型(VLM)丰富了注释,以对 MultiParTweet 进行情感、情绪和主题注释。此外,我们将自动注释与手工注释的子集进行了评估。可以使用我们的工具 TTLABTweetCrawler 重建 MultiParTweet,该工具提供了一个从 X 收集数据的框架。为了展示方法论示例,我们检查了模型是否可以使用其余模型的输出来互相预测。 总之,我们提供了 MultiParTweet——一个整合了自动文本和基于媒体的注释并经人工注释验证的资源,以及 TTLABTweetCrawler——一个通用的 X 数据采集工具。我们的分析显示,这些模型彼此之间具有可预测性。 此外,基于 VLM 的注释更受人工注释者的偏好,这表明多模态表示更符合人类的理解。

Subjects: Computation and Language, Multimedia 主题:计算与语言, 多媒体

Publish: 2025-12-12 13:55:02 UTC 发表:2025-12-12 13:55:02 UTC

7 Does Less Hallucination Mean Less Creativity? An Empirical Investigation in LLMs 7 更少的幻觉是否意味着更少的创造力?在 LLMs 中的实证调查

Authors: [Mohor Banerjee](https://arxiv.org/search/?searchtype=author&query=Mohor Banerjee), [Nadya Yuki Wangsajaya](https://arxiv.org/search/?searchtype=author&query=Nadya Yuki Wangsajaya), [Syed Ali Redha Alsagoff](https://arxiv.org/search/?searchtype=author&query=Syed Ali Redha Alsagoff), [Min Sen Tan](https://arxiv.org/search/?searchtype=author&query=Min Sen Tan), [Zachary Choy Kit Chun](https://arxiv.org/search/?searchtype=author&query=Zachary Choy Kit Chun), [Alvin Chan Guo Wei](https://arxiv.org/search/?searchtype=author&query=Alvin Chan Guo Wei) 作者:Mohor Banerjee、Nadya Yuki Wangsajaya、Syed Ali Redha Alsagoff、Min Sen Tan、Zachary Choy Kit Chun、Alvin Chan Guo Wei

Large Language Models (LLMs) exhibit remarkable capabilities in natural language understanding and reasoning, but suffer from hallucination: the generation of factually incorrect content. While numerous methods have been developed to reduce hallucinations, their impact on creative generations remains unexplored. This gap is particularly critical for AI-assisted scientific discovery, which requires both factual accuracy and creative hypothesis generation. We investigate how three hallucination-reduction techniques: Chain of Verification (CoVe), Decoding by Contrasting Layers (DoLa), and Retrieval-Augmented Generation (RAG), affect creativity in LLMs. Evaluating multiple model families (LLaMA, Qwen, Mistral) at varying scales (1B - 70B parameters) on two creativity benchmarks (NeoCoder and CS4), we find that these methods have opposing effects on divergent creativity. CoVe enhances divergent thinking, DoLa suppresses it, and RAG shows minimal impact. Our findings provide guidance for selecting appropriate hallucination-reduction methods in scientific applications, where the balance between factual accuracy and creative exploration is crucial. 大型语言模型(LLMs)在自然语言理解和推理方面展现出显著能力,但存在幻觉问题:生成事实错误的内容。尽管已经开发出许多方法来减少幻觉,但它们对创造性生成的影响尚未被探索。这一空白对需要既具事实准确性又能创造性地提出假设的 AI 辅助科学发现尤为关键。我们研究了三种减少幻觉的技术——验证链(Chain of Verification,CoVe)、对比层解码(Decoding by Contrasting Layers,DoLa)和检索增强生成(Retrieval-Augmented Generation,RAG)——如何影响 LLMs 的创造力。在不同规模(1B–70B 参数)和多个模型家族(LLaMA、Qwen、Mistral)上,基于两个创造力基准(NeoCoder 和 CS4)的评估表明,这些方法对发散性创造力有相反的影响:CoVe 增强了发散性思维,DoLa 抑制了它,而 RAG 的影响最小。我们的发现为在科学应用中选择合适的减少幻觉方法提供了指导——在这些应用中,事实准确性与创造性探索之间的平衡至关重要。

Subjects: Computation and Language, Artificial Intelligence 主题:计算与语言,人工智能

Publish: 2025-12-12 12:14:29 UTC 发布:2025-12-12 12:14:29 协调世界时 (UTC)

8 Building Patient Journeys in Hebrew: A Language Model for Clinical Timeline Extraction 8 在希伯来语中构建病程路径:用于临床时间线提取的语言模型

Authors: [Kai Golan Hashiloni](https://arxiv.org/search/?searchtype=author&query=Kai Golan Hashiloni), [Brenda Kasabe Nokai](https://arxiv.org/search/?searchtype=author&query=Brenda Kasabe Nokai), [Michal Shevach](https://arxiv.org/search/?searchtype=author&query=Michal Shevach), [Esthy Shemesh](https://arxiv.org/search/?searchtype=author&query=Esthy Shemesh), [Ronit Bartin](https://arxiv.org/search/?searchtype=author&query=Ronit Bartin), [Anna Bergrin](https://arxiv.org/search/?searchtype=author&query=Anna Bergrin), [Liran Harel](https://arxiv.org/search/?searchtype=author&query=Liran Harel), [Nachum Dershowitz](https://arxiv.org/search/?searchtype=author&query=Nachum Dershowitz), [Liat Nadai Arad](https://arxiv.org/search/?searchtype=author&query=Liat Nadai Arad), [Kfir Bar](https://arxiv.org/search/?searchtype=author&query=Kfir Bar) 作者:Kai Golan Hashiloni、Brenda Kasabe Nokai、Michal Shevach、Esthy Shemesh、Ronit Bartin、Anna Bergrin、Liran Harel、Nachum Dershowitz、Liat Nadai Arad、Kfir Bar

We present a new Hebrew medical language model designed to extract structured clinical timelines from electronic health records, enabling the construction of patient journeys. Our model is based on DictaBERT 2.0 and continually pre-trained on over five million de-identified hospital records. To evaluate its effectiveness, we introduce two new datasets – one from internal medicine and emergency departments, and another from oncology – annotated for event temporal relations. Our results show that our model achieves strong performance on both datasets. We also find that vocabulary adaptation improves token efficiency and that de-identification does not compromise downstream performance, supporting privacy-conscious model development. The model is made available for research use under ethical restrictions. 我们提出了一种新的希伯来语医学语言模型,旨在从电子健康记录中提取结构化的临床时间线,从而构建患者就诊历程。我们的模型基于 DictaBERT 2.0,并在超过五百万条去标识化的医院记录上进行了持续预训练。为评估其有效性,我们引入了两个新数据集——一个来自内科和急诊科,另一个来自肿瘤科——并对事件时间关系进行了标注。结果显示我们的模型在这两个数据集上均取得了良好表现。我们还发现词汇适应能提高标记效率,而去标识化并不影响下游任务的性能,这支持了注重隐私的模型开发。该模型在伦理限制下对研究用途开放。

Subject: Computation and Language 主题:Computation and Language

Publish: 2025-12-12 11:54:50 UTC 发布:2025-12-12 11:54:50 UTC

9 Mistake Notebook Learning: Selective Batch-Wise Context Optimization for In-Context Learning 9 错误笔记本学习:用于上下文学习的选择性批次式上下文优化

Authors: [Xuanbo Su](https://arxiv.org/search/?searchtype=author&query=Xuanbo Su), [Yingfang Zhang](https://arxiv.org/search/?searchtype=author&query=Yingfang Zhang), [Hao Luo](https://arxiv.org/search/?searchtype=author&query=Hao Luo), [Xiaoteng Liu](https://arxiv.org/search/?searchtype=author&query=Xiaoteng Liu), [Leo Huang](https://arxiv.org/search/?searchtype=author&query=Leo Huang) 作者:苏轩博、张英方、骆皓、刘晓腾、黄雷

Large language models (LLMs) adapt to tasks via gradient fine-tuning (heavy computation, catastrophic forgetting) or In-Context Learning (ICL: low robustness, poor mistake learning). To fix this, we introduce Mistake Notebook Learning (MNL), a training-free framework with a persistent knowledge base of abstracted error patterns. Unlike prior instance/single-trajectory memory methods, MNL uses batch-wise error abstraction: it extracts generalizable guidance from multiple failures, stores insights in a dynamic notebook, and retains only baseline-outperforming guidance via hold-out validation (ensuring monotonic improvement). We show MNL nearly matches Supervised Fine-Tuning (93.9% vs 94.3% on GSM8K) and outperforms training-free alternatives on GSM8K, Spider, AIME, and KaggleDBQA. On KaggleDBQA (Qwen3-8B), MNL hits 28% accuracy (47% relative gain), outperforming Memento (15.1%) and Training-Free GRPO (22.1) - proving it’s a strong training-free alternative for complex reasoning. 大型语言模型 (LLMs) 通过梯度微调(计算量大、灾难性遗忘)或上下文学习(ICL:鲁棒性低、错误学习能力差)来适应任务。为了解决这个问题,我们提出了错误笔记学习(MNL),这是一种无需训练的框架,具有持久化的抽象化错误模式知识库。不同于以往的实例/单轨迹记忆方法,MNL 使用批量错误抽象:它从多次失败中提取可泛化的指导,将见解存储在动态笔记本中,并通过保留集验证只保留能超过基线的指导(确保单调改进)。我们展示了 MNL 在 GSM8K 上几乎匹配监督微调(93.9% 对 94.3%),并且在 GSM8K、Spider、AIME 和 KaggleDBQA 上优于其他无需训练的替代方法。在 KaggleDBQA(Qwen3-8B)上,MNL 达到 28% 的准确率(相对增益 47%),优于 Memento(15.1%)和无需训练的 GRPO(22.1%)——证明它是在复杂推理任务中强有力的无需训练替代方案。

Subject: Computation and Language 主题:Computation and Language

Publish: 2025-12-12 11:33:09 UTC 发布:2025-12-12 11:33:09 UTC

10 CLINIC: Evaluating Multilingual Trustworthiness in Language Models for Healthcare 10 CLINIC:评估用于医疗保健的多语言语言模型的可信度

Authors: [Akash Ghosh](https://arxiv.org/search/?searchtype=author&query=Akash Ghosh), [Srivarshinee Sridhar](https://arxiv.org/search/?searchtype=author&query=Srivarshinee Sridhar), [Raghav Kaushik Ravi](https://arxiv.org/search/?searchtype=author&query=Raghav Kaushik Ravi), [Muhsin Muhsin](https://arxiv.org/search/?searchtype=author&query=Muhsin Muhsin), [Sriparna Saha](https://arxiv.org/search/?searchtype=author&query=Sriparna Saha), [Chirag Agarwal](https://arxiv.org/search/?searchtype=author&query=Chirag Agarwal) 作者:Akash Ghosh、Srivarshinee Sridhar、Raghav Kaushik Ravi、Muhsin Muhsin、Sriparna Saha、Chirag Agarwal

Integrating language models (LMs) in healthcare systems holds great promise for improving medical workflows and decision-making. However, a critical barrier to their real-world adoption is the lack of reliable evaluation of their trustworthiness, especially in multilingual healthcare settings. Existing LMs are predominantly trained in high-resource languages, making them ill-equipped to handle the complexity and diversity of healthcare queries in mid- and low-resource languages, posing significant challenges for deploying them in global healthcare contexts where linguistic diversity is key. In this work, we present CLINIC, a Comprehensive Multilingual Benchmark to evaluate the trustworthiness of language models in healthcare. CLINIC systematically benchmarks LMs across five key dimensions of trustworthiness: truthfulness, fairness, safety, robustness, and privacy, operationalized through 18 diverse tasks, spanning 15 languages (covering all the major continents), and encompassing a wide array of critical healthcare topics like disease conditions, preventive actions, diagnostic tests, treatments, surgeries, and medications. Our extensive evaluation reveals that LMs struggle with factual correctness, demonstrate bias across demographic and linguistic groups, and are susceptible to privacy breaches and adversarial attacks. By highlighting these shortcomings, CLINIC lays the foundation for enhancing the global reach and safety of LMs in healthcare across diverse languages. 将语言模型(LMs)整合到医疗系统中在改进医疗工作流程和决策方面具有巨大潜力。然而,它们在现实世界中被广泛采用的一个关键障碍是缺乏对其可信度进行可靠评估,尤其是在多语言医疗环境中。现有的语言模型主要在高资源语言上训练,这使得它们难以应对中低资源语言中医疗问题的复杂性和多样性,从而为在语言多样性至关重要的全球医疗环境中部署它们带来重大挑战。在这项工作中,我们提出了 CLINIC,一套用于评估语言模型在医疗领域可信度的全面多语言基准。CLINIC 系统地在可信度的五个关键维度上对语言模型进行基准测试:真实性、公平性、安全性、鲁棒性和隐私性;通过 18 项多样化任务来实现,这些任务覆盖 15 种语言(涵盖所有主要大洲),并包含广泛的关键医疗主题,例如疾病状况、预防措施、诊断检测、治疗、手术和药物。 我们的广泛评估表明,语言模型在事实准确性方面存在困难,对不同人口和语言群体表现出偏见,并且容易受到隐私泄露和对抗性攻击。通过指出这些不足,CLINIC 为提升语言模型在多种语言背景下的医疗保健全球覆盖性和安全性奠定了基础。

Subject: Computation and Language 主题:Computation and Language

Publish: 2025-12-12 10:19:27 UTC 发布:2025-12-12 10:19:27 UTC

11 Minimal Clips, Maximum Salience: Long Video Summarization via Key Moment Extraction 11 最少片段,最大显著性:通过关键时刻提取进行长视频摘要

Authors: [Galann Pennec](https://arxiv.org/search/?searchtype=author&query=Galann Pennec), [Zhengyuan Liu](https://arxiv.org/search/?searchtype=author&query=Zhengyuan Liu), [Nicholas Asher](https://arxiv.org/search/?searchtype=author&query=Nicholas Asher), [Philippe Muller](https://arxiv.org/search/?searchtype=author&query=Philippe Muller), [Nancy F. Chen](https://arxiv.org/search/?searchtype=author&query=Nancy F. Chen) 作者:Galann Pennec、Zhengyuan Liu、Nicholas Asher、Philippe Muller、Nancy F. Chen

Vision-Language Models (VLMs) are able to process increasingly longer videos. Yet, important visual information is easily lost throughout the entire context and missed by VLMs. Also, it is important to design tools that enable cost-effective analysis of lengthy video content. In this paper, we propose a clip selection method that targets key video moments to be included in a multimodal summary. We divide the video into short clips and generate compact visual descriptions of each using a lightweight video captioning model. These are then passed to a large language model (LLM), which selects the K clips containing the most relevant visual information for a multimodal summary. We evaluate our approach on reference clips for the task, automatically derived from full human-annotated screenplays and summaries in the MovieSum dataset. We further show that these reference clips (less than 6% of the movie) are sufficient to build a complete multimodal summary of the movies in MovieSum. Using our clip selection method, we achieve a summarization performance close to that of these reference clips while capturing substantially more relevant video information than random clip selection. Importantly, we maintain low computational cost by relying on a lightweight captioning model. 视觉-语言模型(VLMs)能够处理越来越长的视频。然而,重要的视觉信息在整个上下文中很容易丢失,且被 VLMs 遗漏。此外,设计能够对冗长视频内容进行成本效益高的分析的工具也十分重要。在本文中,我们提出了一种片段选择方法,针对应包含在多模态摘要中的视频关键时刻。我们将视频划分为短片段,并使用轻量级视频字幕生成模型为每个片段生成简洁的视觉描述。随后将这些描述传给大型语言模型(LLM),由其从中选择包含最多相关视觉信息的 K 个片段作为多模态摘要。我们在 MovieSum 数据集中、由完整人工注释的剧本和摘要自动导出的参考片段上评估了我们的方法。我们进一步展示了这些参考片段(不到电影时长的 6%)足以构建 MovieSum 中电影的完整多模态摘要。使用我们的片段选择方法,我们在摘要性能上接近这些参考片段,同时比随机片段选择捕捉到更多实质相关的视频信息。 重要的是,我们通过依赖一个轻量级的图像描述模型来保持低计算成本。

Subjects: Computation and Language, Computer Vision and Pattern Recognition 主题:计算与语言、计算机视觉与模式识别

Publish: 2025-12-12 09:19:45 UTC 发布:2025-12-12 09:19:45 UTC

12 Improving Translation Quality by Selecting Better Data for LLM Fine-Tuning: A Comparative Analysis 12 通过为 LLM 微调选择更好的数据来提高翻译质量:一项比较分析

Authors: [Felipe Ribeiro Fujita de Mello](https://arxiv.org/search/?searchtype=author&query=Felipe Ribeiro Fujita de Mello), [Hideyuki Takada](https://arxiv.org/search/?searchtype=author&query=Hideyuki Takada) 作者:Felipe Ribeiro Fujita de Mello,Hideyuki Takada

We investigated the impact of data selection on machine translation fine-tuning for open LLMs. Using Japanese-English corpora, we compare five selectors: TF-IDF, COMET Kiwi, QuRate, FD-Score, and random selection, under controlled training conditions. We observed that semantic selectors consistently outperform lexical and geometry-based heuristics, and that even when the selected data differ by less than 3%, the impact on model performance is substantial, underscoring the sensitivity of fine-tuning to data quality. 我们研究了数据选择对开放式 LLM 机器翻译微调的影响。使用日英语料库,在受控训练条件下比较了五种选择器:TF-IDF、COMET Kiwi、QuRate、FD-Score 和随机选择。我们观察到语义选择器始终优于基于词汇和几何的启发式方法,即使被选择的数据差异小于 3%,对模型性能的影响也很显著,这凸显了微调对数据质量的敏感性。

Subject: Computation and Language 主题:Computation and Language

Publish: 2025-12-12 08:59:27 UTC 发布:2025-12-12 08:59:27 协调世界时

Authors: [Tomáš Koref](https://arxiv.org/search/?searchtype=author&query=Tomáš Koref), [Lena Held](https://arxiv.org/search/?searchtype=author&query=Lena Held), [Mahammad Namazov](https://arxiv.org/search/?searchtype=author&query=Mahammad Namazov), [Harun Kumru](https://arxiv.org/search/?searchtype=author&query=Harun Kumru), [Yassine Thlija](https://arxiv.org/search/?searchtype=author&query=Yassine Thlija), [Christoph Burchard](https://arxiv.org/search/?searchtype=author&query=Christoph Burchard), [Ivan Habernal](https://arxiv.org/search/?searchtype=author&query=Ivan Habernal) 作者:Tomáš Koref、Lena Held、Mahammad Namazov、Harun Kumru、Yassine Thlija、Christoph Burchard、Ivan Habernal

Courts must justify their decisions, but systematically analyzing judicial reasoning at scale remains difficult. This study refutes claims about formalistic judging in Central and Eastern Europe (CEE) by developing automated methods to detect and classify judicial reasoning in Czech Supreme Courts’ decisions using state-of-the-art natural language processing methods. We create the MADON dataset of 272 decisions from two Czech Supreme Courts with expert annotations of 9,183 paragraphs with eight argument types and holistic formalism labels for supervised training and evaluation. Using a corpus of 300k Czech court decisions, we adapt transformer LLMs for Czech legal domain by continued pretraining and experiment with methods to address dataset imbalance including asymmetric loss and class weighting. The best models successfully detect argumentative paragraphs (82.6% macro-F1), classify traditional types of legal argument (77.5% macro-F1), and classify decisions as formalistic/non-formalistic (83.2% macro-F1). Our three-stage pipeline combining ModernBERT, Llama 3.1, and traditional feature-based machine learning achieves promising results for decision classification while reducing computational costs and increasing explainability. Empirically, we challenge prevailing narratives about CEE formalism. This work shows that legal argument mining enables reliable judicial philosophy classification and shows the potential of legal argument mining for other important tasks in computational legal studies. Our methodology is easily replicable across jurisdictions, and our entire pipeline, datasets, guidelines, models, and source codes are available at https://github.com/trusthlt/madon. 法院必须为其裁决提供理由,但大规模系统地分析司法推理仍然困难。本研究通过开发自动化方法来检测和分类捷克最高法院判决中的司法推理,利用最先进的自然语言处理方法,驳斥了关于中东欧(CEE)形式主义裁判的说法。我们创建了 MADON 数据集,包含来自两家捷克最高法院的 272 份裁决,专家对 9,183 段落进行了注释,标注了八种论证类型以及用于监督训练和评估的整体形式主义标签。利用包含 30 万份捷克法院裁决的语料库,我们通过继续预训练将变换器 LLMs 适配于捷克法律领域,并尝试了包括非对称损失和类别加权在内的处理数据集不平衡的方法。表现最好的模型成功检测出论证性段落(宏平均 F1 为 82.6%)、分类传统法律论证类型(宏平均 F1 为 77.5%),并将裁决分类为形式主义/非形式主义(宏平均 F1 为 83.2%)。我们结合 ModernBERT、Llama 3 的三阶段管道。1、基于传统特征的机器学习在决策分类方面取得了可观的结果,同时降低了计算成本并提高了可解释性。实证上,我们对关于 CEE 形式主义的主流叙述提出了质疑。该工作表明,法律论证挖掘能够实现可靠的司法哲学分类,并展示了法律论证挖掘在计算法学其他重要任务中的潜力。我们的方法易于在不同司法辖区复制,且我们的完整流程、数据集、指南、模型和源代码可在 https://github.com/trusthlt/madon 获取。com/trusthlt/madon。

Subjects: Computation and Language, Computers and Society 主题:计算与语言,计算机与社会

Publish: 2025-12-12 08:37:53 UTC 发布:2025-12-12 08:37:53 UTC

14 qa-FLoRA: Data-free query-adaptive Fusion of LoRAs for LLMs 14 qa-FLoRA:面向 LLMs 的无数据查询自适应 LoRA 融合

Authors: [Shreya Shukla](https://arxiv.org/search/?searchtype=author&query=Shreya Shukla), [Aditya Sriram](https://arxiv.org/search/?searchtype=author&query=Aditya Sriram), [Milinda Kuppur Narayanaswamy](https://arxiv.org/search/?searchtype=author&query=Milinda Kuppur Narayanaswamy), [Hiteshi Jain](https://arxiv.org/search/?searchtype=author&query=Hiteshi Jain) 作者:Shreya Shukla、Aditya Sriram、Milinda Kuppur Narayanaswamy、Hiteshi Jain

The deployment of large language models for specialized tasks often requires domain-specific parameter-efficient finetuning through Low-Rank Adaptation (LoRA) modules. However, effectively fusing these adapters to handle complex, multi-domain composite queries remains a critical challenge. Existing LoRA fusion approaches either use static weights, which assign equal relevance to each participating LoRA, or require data-intensive supervised training for every possible LoRA combination to obtain respective optimal fusion weights. We propose qa-FLoRA, a novel query-adaptive data-and-training-free method for LoRA fusion that dynamically computes layer-level fusion weights by measuring distributional divergence between the base model and respective adapters. Our approach eliminates the need for composite training data or domain-representative samples, making it readily applicable to existing adapter collections. Extensive experiments across nine multilingual composite tasks spanning mathematics, coding, and medical domains, show that qa-FLoRA outperforms static fusion by ~5% with LLaMA-2 and ~6% with LLaMA-3, and the training-free baselines by ~7% with LLaMA-2 and ~10% with LLaMA-3, while significantly closing the gap with supervised baselines. Further, layer-level analysis of our fusion weights reveals interpretable fusion patterns, demonstrating the effectiveness of our approach for robust multi-domain adaptation. 将大型语言模型部署到专门任务上通常需要通过低秩自适应(LoRA)模块进行面向领域的参数高效微调。然而,有效融合这些适配器以处理复杂的多领域复合查询仍然是一个关键挑战。现有的 LoRA 融合方法要么使用静态权重,给每个参与的 LoRA 分配相同的重要性,要么需要对每一种可能的 LoRA 组合进行数据密集的有监督训练以获得各自的最优融合权重。我们提出了 qa-FLoRA,这是一种新颖的针对查询自适应、无需数据和训练的 LoRA 融合方法,通过衡量基础模型与各适配器之间的分布差异,动态计算层级融合权重。我们的方法消除了对复合训练数据或领域代表性样本的需求,使其可直接应用于现有的适配器集合。 在涵盖数学、编码和医学领域的九个多语言复合任务上进行的大规模实验证明,qa-FLoRA 比静态融合在 LLaMA-2 上高出约 5%,在 LLaMA-3 上高出约 6%;相比无需训练的基线,qa-FLoRA 在 LLaMA-2 上高出约 7%,在 LLaMA-3 上高出约 10%,同时显著缩小了与有监督基线之间的差距。 此外,对我们的融合权重进行的层级分析揭示了可解释的融合模式,展示了我们的方法在实现稳健的多领域适应方面的有效性。

Subject: Computation and Language 主题:Computation and Language

Publish: 2025-12-12 08:27:34 UTC 发布:2025-12-12 08:27:34 UTC

15 Unifying Dynamic Tool Creation and Cross-Task Experience Sharing through Cognitive Memory Architecture 15 通过认知记忆架构统一动态工具创建与跨任务经验共享

Authors: [Jiarun Liu](https://arxiv.org/search/?searchtype=author&query=Jiarun Liu), [Shiyue Xu](https://arxiv.org/search/?searchtype=author&query=Shiyue Xu), [Yang Li](https://arxiv.org/search/?searchtype=author&query=Yang Li), [Shangkun Liu](https://arxiv.org/search/?searchtype=author&query=Shangkun Liu), [Yongli Yu](https://arxiv.org/search/?searchtype=author&query=Yongli Yu), [Peng Cao](https://arxiv.org/search/?searchtype=author&query=Peng Cao) 作者:Jiarun Liu、Shiyue Xu、Yang Li、Shangkun Liu、Yongli Yu、Peng Cao

Large Language Model agents face fundamental challenges in adapting to novel tasks due to limitations in tool availability and experience reuse. Existing approaches either rely on predefined tools with limited coverage or build tools from scratch without leveraging past experiences, leading to inefficient exploration and suboptimal performance. We introduce SMITH (Shared Memory Integrated Tool Hub), a unified cognitive architecture that seamlessly integrates dynamic tool creation with cross-task experience sharing through hierarchical memory organization. SMITH organizes agent memory into procedural, semantic, and episodic components, enabling systematic capability expansion while preserving successful execution patterns. Our approach formalizes tool creation as iterative code generation within controlled sandbox environments and experience sharing through episodic memory retrieval with semantic similarity matching. We further propose a curriculum learning strategy based on agent-ensemble difficulty re-estimation. Extensive experiments on the GAIA benchmark demonstrate SMITH’s effectiveness, achieving 81.8% Pass@1 accuracy and outperforming state-of-the-art baselines including Alita (75.2%) and Memento (70.9%). Our work establishes a foundation for building truly adaptive agents that continuously evolve their capabilities through principled integration of tool creation and experience accumulation. 大型语言模型代理在适应新任务时面临根本性挑战,原因在于工具可用性的限制和经验重用的不足。现有方法要么依赖覆盖范围有限的预定义工具,要么从零构建工具而不利用以往经验,导致探索效率低下和性能不佳。我们提出了 SMITH(共享记忆集成工具中心),一种统一的认知架构,能够通过分层记忆组织无缝整合动态工具创建与跨任务经验共享。SMITH 将代理记忆组织为程序性、语义性和情景性组件,实现系统化能力扩展的同时保留成功的执行模式。我们的方法将工具创建形式化为受控沙箱环境中的迭代代码生成,并通过结合语义相似度匹配的情景记忆检索来实现经验共享。我们进一步提出了一种基于代理集成难度重估的课程学习策略。 在 GAIA 基准上的大量实验证明了 SMITH 的有效性,实现了 81.8%的 Pass@1 准确率,且超越了包括 Alita(75.2%)和 Memento(70.9%)在内的最先进基线模型。我们的工作为构建真正自适应的智能体奠定了基础,这类智能体通过有原则地整合工具创建与经验积累,持续进化其能力。

Subject: Computation and Language 主题:Computation and Language

Publish: 2025-12-12 06:00:11 UTC 发布:2025-12-12 06:00:11 UTC

16 LegalRikai: Open Benchmark – A Benchmark for Complex Japanese Corporate Legal Tasks 16 LegalRikai:开放基准 —— 针对复杂日本企业法律任务的基准测试

Authors: [Shogo Fujita](https://arxiv.org/search/?searchtype=author&query=Shogo Fujita), [Yuji Naraki](https://arxiv.org/search/?searchtype=author&query=Yuji Naraki), [Yiqing Zhu](https://arxiv.org/search/?searchtype=author&query=Yiqing Zhu), [Shinsuke Mori](https://arxiv.org/search/?searchtype=author&query=Shinsuke Mori) 作者:Shogo Fujita, Yuji Naraki, Yiqing Zhu, Shinsuke Mori

This paper introduces LegalRikai: Open Benchmark, a new benchmark comprising four complex tasks that emulate Japanese corporate legal practices. The benchmark was created by legal professionals under the supervision of an attorney. This benchmark has 100 samples that require long-form, structured outputs, and we evaluated them against multiple practical criteria. We conducted both human and automated evaluations using leading LLMs, including GPT-5, Gemini 2.5 Pro, and Claude Opus 4.1. Our human evaluation revealed that abstract instructions prompted unnecessary modifications, highlighting model weaknesses in document-level editing that were missed by conventional short-text tasks. Furthermore, our analysis reveals that automated evaluation aligns well with human judgment on criteria with clear linguistic grounding, and assessing structural consistency remains a challenge. The result demonstrates the utility of automated evaluation as a screening tool when expert availability is limited. We propose a dataset evaluation framework to promote more practice-oriented research in the legal domain. 本文介绍了 LegalRikai: Open Benchmark,一项新的基准测试,包含四个模拟日本公司法律实务的复杂任务。该基准由法律专业人士在一名律师的监督下创建。该基准包含 100 个样本,要求生成长篇、结构化的输出,并且我们根据多项实用标准对其进行了评估。我们使用包括 GPT-5、Gemini 2.5 Pro 和 Claude Opus 4.1 在内的领先 LLMs 进行了人工和自动评估。我们的人工评估显示,抽象的指示会促使不必要的修改,凸显了模型在文档级编辑方面的弱点,而这些弱点在传统的短文本任务中未被发现。此外,我们的分析表明,在具有明确语言基础的评估标准上,自动评估与人工判断高度一致,但评估结构一致性仍然是一个挑战。结果表明,当专家可用性有限时,自动评估作为筛查工具具有实用价值。我们提出了一个数据集评估框架,以促进法律领域更贴近实务的研究。

Subject: Computation and Language 主题:Computation and Language

Publish: 2025-12-12 05:47:06 UTC 发布:2025-12-12 05:47:06 UTC

17 CIP: A Plug-and-Play Causal Prompting Framework for Mitigating Hallucinations under Long-Context Noise 17 CIP:一种即插即用的因果提示框架,用于在长上下文噪声下缓解幻觉

Authors: [Qingsen Ma](https://arxiv.org/search/?searchtype=author&query=Qingsen Ma), [Dianyun Wang](https://arxiv.org/search/?searchtype=author&query=Dianyun Wang), [Ran Jing](https://arxiv.org/search/?searchtype=author&query=Ran Jing), [Yujun Sun](https://arxiv.org/search/?searchtype=author&query=Yujun Sun), [Zhenbo Xu](https://arxiv.org/search/?searchtype=author&query=Zhenbo Xu) 作者:马青森、王殿云、景然、孙宇峻、徐振博

Large language models often hallucinate when processing long and noisy retrieval contexts because they rely on spurious correlations rather than genuine causal relationships. We propose CIP, a lightweight and plug-and-play causal prompting framework that mitigates hallucinations at the input stage. CIP constructs a causal relation sequence among entities, actions, and events and injects it into the prompt to guide reasoning toward causally relevant evidence. Through causal intervention and counterfactual reasoning, CIP suppresses non causal reasoning paths, improving factual grounding and interpretability. Experiments across seven mainstream language models, including GPT-4o, Gemini 2.0 Flash, and Llama 3.1, show that CIP consistently enhances reasoning quality and reliability, achieving 2.6 points improvement in Attributable Rate, 0.38 improvement in Causal Consistency Score, and a fourfold increase in effective information density. API level profiling further shows that CIP accelerates contextual understanding and reduces end to end response latency by up to 55.1 percent. These results suggest that causal reasoning may serve as a promising paradigm for improving the explainability, stability, and efficiency of large language models. 大型语言模型在处理冗长且噪声较多的检索上下文时常常出现幻觉,因为它们依赖于虚假的相关性而非真实的因果关系。我们提出了 CIP,一种轻量且即插即用的因果提示框架,用于在输入阶段减轻幻觉。CIP 构建了实体、动作和事件之间的因果关系序列,并将其注入提示中,以引导推理朝向与因果相关的证据。通过因果干预和反事实推理,CIP 抑制了非因果的推理路径,提高了事实依据性和可解释性。在包括 GPT-4o、Gemini 2.0 Flash 和 Llama 3.1 在内的七种主流语言模型上的实验表明,CIP 持续提升了推理质量和可靠性,在可归因率上提高了 2.6 个点,因果一致性得分提高了 0.38,且有效信息密度增加了四倍。API 级别的分析进一步显示,CIP 加速了上下文理解,并将端到端响应延迟最多减少了 55%。1%。这些结果表明,因果推理可能成为提高大型语言模型可解释性、稳定性和效率的一种有前途的范式。

Subjects: Computation and Language, Artificial Intelligence 主题:计算与语言,人工智能

Publish: 2025-12-12 05:02:26 UTC 发布时间:2025-12-12 05:02:26 UTC

18 AdaSD: Adaptive Speculative Decoding for Efficient Language Model Inference 18 AdaSD:用于高效语言模型推理的自适应投机解码

Authors: [Kuan-Wei Lu](https://arxiv.org/search/?searchtype=author&query=Kuan-Wei Lu), [Ding-Yong Hong](https://arxiv.org/search/?searchtype=author&query=Ding-Yong Hong), [Pangfeng Liu](https://arxiv.org/search/?searchtype=author&query=Pangfeng Liu)

Large language models (LLMs) have achieved remarkable performance across a wide range of tasks, but their increasing parameter sizes significantly slow down inference. Speculative decoding mitigates this issue by leveraging a smaller draft model to predict candidate tokens, which are then verified by a larger target model. However, existing approaches often require additional training, extensive hyperparameter tuning, or prior analysis of models and tasks before deployment. In this paper, we propose Adaptive Speculative Decoding (AdaSD), a hyperparameter-free decoding scheme that dynamically adjusts generation length and acceptance criteria during inference. AdaSD introduces two adaptive thresholds: one to determine when to stop candidate token generation and another to decide token acceptance, both updated in real time based on token entropy and Jensen-Shannon distance. This approach eliminates the need for pre-analysis or fine-tuning and is compatible with off-the-shelf models. Experiments on benchmark datasets demonstrate that AdaSD achieves up to 49% speedup over standard speculative decoding while limiting accuracy degradation to under 2%, making it a practical solution for efficient and adaptive LLM inference. 大型语言模型(LLMs)在广泛任务上取得了显著性能,但其不断增长的参数规模大幅减慢了推理速度。投机性解码通过利用较小的草稿模型来预测候选标记,然后由更大的目标模型验证这些候选,从而缓解了这一问题。然而,现有方法通常需要额外训练、广泛的超参数调优或在部署前对模型和任务进行预先分析。本文提出了自适应投机性解码(AdaSD),这是一种无需超参数的解码方案,在推理过程中动态调整生成长度和接受标准。AdaSD 引入了两个自适应阈值:一个用于判断何时停止候选标记生成,另一个用于决定标记是否被接受,两个阈值均根据标记熵和詹森-香农距离实时更新。这一方法消除了预先分析或微调的需要,并兼容现成模型。 在基准数据集上的实验表明,AdaSD 在将准确性降幅控制在 2% 以内的同时,相较于标准投机解码最多可实现 49% 的速度提升,使其成为一种用于高效且自适应 LLM 推理的实用方案。

Subject: Computation and Language 主题:Computation and Language

Publish: 2025-12-12 04:56:08 UTC 发布:2025-12-12 04:56:08 UTC

19 When Actions Teach You to Think: Reasoning-Action Synergy via Reinforcement Learning in Conversational Agents 19 当行动教你思考:通过强化学习在会话代理中实现推理—行动协同

Authors: [Mrinal Rawat](https://arxiv.org/search/?searchtype=author&query=Mrinal Rawat), [Arkajyoti Chakraborty](https://arxiv.org/search/?searchtype=author&query=Arkajyoti Chakraborty), [Neha Gupta](https://arxiv.org/search/?searchtype=author&query=Neha Gupta), [Roberto Pieraccini](https://arxiv.org/search/?searchtype=author&query=Roberto Pieraccini) 作者:Mrinal Rawat、Arkajyoti Chakraborty、Neha Gupta、Roberto Pieraccini

Supervised fine-tuning (SFT) has emerged as one of the most effective ways to improve the performance of large language models (LLMs) in downstream tasks. However, SFT can have difficulty generalizing when the underlying data distribution changes, even when the new data does not fall completely outside the training domain. Recent reasoning-focused models such as o1 and R1 have demonstrated consistent gains over their non-reasoning counterparts, highlighting the importance of reasoning for improved generalization and reliability. However, collecting high-quality reasoning traces for SFT remains challenging – annotations are costly, subjective, and difficult to scale. To address this limitation, we leverage Reinforcement Learning (RL) to enable models to learn reasoning strategies directly from task outcomes. We propose a pipeline in which LLMs generate reasoning steps that guide both the invocation of tools (e.g., function calls) and the final answer generation for conversational agents. Our method employs Group Relative Policy Optimization (GRPO) with rewards designed around tool accuracy and answer correctness, allowing the model to iteratively refine its reasoning and actions. Experimental results demonstrate that our approach improves both the quality of reasoning and the precision of tool invocations, achieving a 1.5% relative improvement over the SFT model (trained without explicit thinking) and a 40% gain compared to the base of the vanilla Qwen3-1.7B model. These findings demonstrate the promise of unifying reasoning and action learning through RL to build more capable and generalizable conversational agents. 监督微调(SFT)已成为提高大型语言模型(LLMs)在下游任务中表现的最有效方法之一。然而,当基础数据分布发生变化时,即便新数据并非完全超出训练领域,SFT 的泛化能力也可能受到影响。最近一些以推理为核心的模型(如 o1 和 R1)在与非推理对应模型的比较中持续取得提升,凸显了推理对于改善泛化性和可靠性的关键作用。然而,为 SFT 收集高质量的推理轨迹仍然具有挑战性——标注成本高、主观性强且难以扩展。为了解决这一限制,我们利用强化学习(RL)使模型能够直接从任务结果中学习推理策略。我们提出了一个流程,其中 LLMs 生成推理步骤,以指导工具调用(例如函数调用)和对话代理的最终答案生成。我们的方法采用群体相对策略优化(GRPO),并围绕工具准确性和答案正确性设计奖励,使模型能够迭代地改进其推理和动作。 实验结果表明,我们的方法同时提升了推理质量和工具调用的准确性,相较于未进行显式思考训练的 SFT 模型实现了 1.5%的相对提升,并且相较于原始 Qwen3-1.7B 基础模型实现了 40%的增益。这些发现证明了通过强化学习将推理与行动学习统一起来以构建更有能力且更具泛化性的对话式智能体的前景。

Subjects: Computation and Language, Machine Learning 主题:计算与语言 , 机器学习

Publish: 2025-12-12 04:44:40 UTC 发布:2025-12-12 04:44:40 UTC

20 Leveraging LLMs for Title and Abstract Screening for Systematic Review: A Cost-Effective Dynamic Few-Shot Learning Approach 20 利用 LLMs 对系统综述的题目和摘要进行筛选:一种具有成本效益的动态少样本学习方法

Authors: [Yun-Chung Liu](https://arxiv.org/search/?searchtype=author&query=Yun-Chung Liu), [Rui Yang](https://arxiv.org/search/?searchtype=author&query=Rui Yang), [Jonathan Chong Kai Liew](https://arxiv.org/search/?searchtype=author&query=Jonathan Chong Kai Liew), [Ziran Yin](https://arxiv.org/search/?searchtype=author&query=Ziran Yin), [Henry Foote](https://arxiv.org/search/?searchtype=author&query=Henry Foote), [Christopher J. Lindsell](https://arxiv.org/search/?searchtype=author&query=Christopher J. Lindsell), [Chuan Hong](https://arxiv.org/search/?searchtype=author&query=Chuan Hong) 作者:Yun-Chung Liu, Rui Yang, Jonathan Chong Kai Liew, Ziran Yin, Henry Foote, Christopher J. Lindsell, Chuan Hong

Systematic reviews are a key component of evidence-based medicine, playing a critical role in synthesizing existing research evidence and guiding clinical decisions. However, with the rapid growth of research publications, conducting systematic reviews has become increasingly burdensome, with title and abstract screening being one of the most time-consuming and resource-intensive steps. To mitigate this issue, we designed a two-stage dynamic few-shot learning (DFSL) approach aimed at improving the efficiency and performance of large language models (LLMs) in the title and abstract screening task. Specifically, this approach first uses a low-cost LLM for initial screening, then re-evaluates low-confidence instances using a high-performance LLM, thereby enhancing screening performance while controlling computational costs. We evaluated this approach across 10 systematic reviews, and the results demonstrate its strong generalizability and cost-effectiveness, with potential to reduce manual screening burden and accelerate the systematic review process in practical applications. 系统综述是循证医学的重要组成部分,在综合现有研究证据并指导临床决策方面发挥着关键作用。然而,随着研究出版物的快速增长,开展系统综述变得越来越繁重,其中题目和摘要筛选是最耗时、最耗资源的步骤之一。为缓解这一问题,我们设计了一种两阶段动态少样本学习(DFSL)方法,旨在提高大型语言模型(LLMs)在题目和摘要筛选任务中的效率与性能。具体而言,该方法首先使用低成本的 LLM 进行初步筛选,然后对低置信度的实例使用高性能的 LLM 重新评估,从而在控制计算成本的同时提升筛选性能。我们在 10 个系统综述上评估了该方法,结果表明其具有很强的泛化性和成本效益,有望在实际应用中减少人工筛选负担并加速系统综述流程。

Subject: Computation and Language 主题:Computation and Language

Publish: 2025-12-12 03:51:54 UTC 发布:2025-12-12 03:51:54 UTC

Authors: [Di Wu](https://arxiv.org/search/?searchtype=author&query=Di Wu), [Ruiyu Fang](https://arxiv.org/search/?searchtype=author&query=Ruiyu Fang), [Liting Jiang](https://arxiv.org/search/?searchtype=author&query=Liting Jiang), [Shuangyong Song](https://arxiv.org/search/?searchtype=author&query=Shuangyong Song), [Xiaomeng Huang](https://arxiv.org/search/?searchtype=author&query=Xiaomeng Huang), [Shiquan Wang](https://arxiv.org/search/?searchtype=author&query=Shiquan Wang), [Zhongqiu Li](https://arxiv.org/search/?searchtype=author&query=Zhongqiu Li), [Lingling Shi](https://arxiv.org/search/?searchtype=author&query=Lingling Shi), [Mengjiao Bao](https://arxiv.org/search/?searchtype=author&query=Mengjiao Bao), [Yongxiang Li](https://arxiv.org/search/?searchtype=author&query=Yongxiang Li), [Hao Huang](https://arxiv.org/search/?searchtype=author&query=Hao Huang) 作者:吴迪、方锐宇、姜丽婷、宋双勇、黄晓蒙、王世权、李忠秋、石玲玲、鲍梦娇、李永祥、黄浩

Multi-intent spoken language understanding (SLU) involves two tasks: multiple intent detection and slot filling, which jointly handle utterances containing more than one intent. Owing to this characteristic, which closely reflects real-world applications, the task has attracted increasing research attention, and substantial progress has been achieved. However, there remains a lack of a comprehensive and systematic review of existing studies on multi-intent SLU. To this end, this paper presents a survey of recent advances in multi-intent SLU. We provide an in-depth overview of previous research from two perspectives: decoding paradigms and modeling approaches. On this basis, we further compare the performance of representative models and analyze their strengths and limitations. Finally, we discuss the current challenges and outline promising directions for future research. We hope this survey will offer valuable insights and serve as a useful reference for advancing research in multi-intent SLU. 多意图语音语言理解(SLU)包括两个任务:多意图检测和槽位填充,共同处理包含多个意图的语句。由于这一特性与现实应用高度契合,该任务吸引了越来越多的研究关注,并取得了可观的进展。然而,关于多意图 SLU 的现有研究仍缺乏全面而系统的综述。为此,本文对近期多意图 SLU 的进展进行了综述。我们从两方面对以往研究进行了深入概述:解码范式和建模方法。在此基础上,我们进一步比较了代表性模型的性能并分析了它们的优缺点。最后,我们讨论了当前的挑战并勾画了未来研究的有前景方向。我们希望本综述能提供有价值的见解并成为推进多意图 SLU 研究的有用参考。

Subjects: Computation and Language, Artificial Intelligence 主题:计算与语言,人工智能

Publish: 2025-12-12 03:46:39 UTC 发布:2025-12-12 03:46:39 UTC

22 SciLaD: A Large-Scale, Transparent, Reproducible Dataset for Natural Scientific Language Processing 22 SciLaD: 一个用于自然科学语言处理的大规模、透明且可复现的数据集

Authors: [Luca Foppiano](https://arxiv.org/search/?searchtype=author&query=Luca Foppiano), [Sotaro Takeshita](https://arxiv.org/search/?searchtype=author&query=Sotaro Takeshita), [Pedro Ortiz Suarez](https://arxiv.org/search/?searchtype=author&query=Pedro Ortiz Suarez), [Ekaterina Borisova](https://arxiv.org/search/?searchtype=author&query=Ekaterina Borisova), [Raia Abu Ahmad](https://arxiv.org/search/?searchtype=author&query=Raia Abu Ahmad), [Malte Ostendorff](https://arxiv.org/search/?searchtype=author&query=Malte Ostendorff), [Fabio Barth](https://arxiv.org/search/?searchtype=author&query=Fabio Barth), [Julian Moreno-Schneider](https://arxiv.org/search/?searchtype=author&query=Julian Moreno-Schneider), [Georg Rehm](https://arxiv.org/search/?searchtype=author&query=Georg Rehm) 作者:Luca Foppiano、Sotaro Takeshita、Pedro Ortiz Suarez、Ekaterina Borisova、Raia Abu Ahmad、Malte Ostendorff、Fabio Barth、Julian Moreno-Schneider、Georg Rehm

SciLaD is a novel, large-scale dataset of scientific language constructed entirely using open-source frameworks and publicly available data sources. It comprises a curated English split containing over 10 million scientific publications and a multilingual, unfiltered TEI XML split including more than 35 million publications. We also publish the extensible pipeline for generating SciLaD. The dataset construction and processing workflow demonstrates how open-source tools can enable large-scale, scientific data curation while maintaining high data quality. Finally, we pre-train a RoBERTa model on our dataset and evaluate it across a comprehensive set of benchmarks, achieving performance comparable to other scientific language models of similar size, validating the quality and utility of SciLaD. We publish the dataset and evaluation pipeline to promote reproducibility, transparency, and further research in natural scientific language processing and understanding including scholarly document processing. SciLaD 是一个新颖的大规模科学语言数据集,完全使用开源框架和公开可用的数据源构建。它包含一个精心整理的英文子集,涵盖超过一千万篇科学出版物,以及一个多语言、未过滤的 TEI XML 子集,包含超过三千五百万篇出版物。我们还发布了用于生成 SciLaD 的可扩展管道。数据集构建和处理工作流展示了开源工具如何在保持高数据质量的同时实现大规模科学数据的整理。最后,我们在该数据集上对 RoBERTa 模型进行了预训练,并在一系列综合基准上进行了评估,取得了与其他同等规模的科学语言模型相当的性能,验证了 SciLaD 的质量和实用性。我们发布该数据集和评估管道,以促进可复现性、透明性以及包括学术文档处理在内的自然科学语言处理与理解领域的进一步研究。

Subject: Computation and Language 主题:Computation and Language

Publish: 2025-12-12 00:40:40 UTC 发布:2025-12-12 00:40:40 协调世界时(UTC)

23 FIBER: A Multilingual Evaluation Resource for Factual Inference Bias 23 FIBER:用于事实推理偏见的多语言评估资源

Authors: [Evren Ayberk Munis](https://arxiv.org/search/?searchtype=author&query=Evren Ayberk Munis), [Deniz Yılmaz](https://arxiv.org/search/?searchtype=author&query=Deniz Yılmaz), [Arianna Muti](https://arxiv.org/search/?searchtype=author&query=Arianna Muti), [Çağrı Toraman](https://arxiv.org/search/?searchtype=author&query=Çağrı Toraman) 作者:Evren Ayberk Munis、Deniz Yılmaz、Arianna Muti、Çağrı Toraman

Large language models are widely used across domains, yet there are concerns about their factual reliability and biases. Factual knowledge probing offers a systematic means to evaluate these aspects. Most existing benchmarks focus on single-entity facts and monolingual data. We therefore present FIBER, a multilingual benchmark for evaluating factual knowledge in single- and multi-entity settings. The dataset includes sentence completion, question-answering, and object-count prediction tasks in English, Italian, and Turkish. Using FIBER, we examine whether the prompt language induces inference bias in entity selection and how large language models perform on multi-entity versus single-entity questions. The results indicate that the language of the prompt can influence the model’s generated output, particularly for entities associated with the country corresponding to that language. However, this effect varies across different topics such that 31% of the topics exhibit factual inference bias score greater than 0.5. Moreover, the level of bias differs across languages such that Turkish prompts show higher bias compared to Italian in 83% of the topics, suggesting a language-dependent pattern. Our findings also show that models face greater difficulty when handling multi-entity questions than the single-entity questions. Model performance differs across both languages and model sizes. The highest mean average precision is achieved in English, while Turkish and Italian lead to noticeably lower scores. Larger models, including Llama-3.1-8B and Qwen-2.5-7B, show consistently better performance than smaller 3B-4B models. 大型语言模型在各个领域被广泛使用,但其事实可靠性和偏见问题令人担忧。事实知识探测为评估这些方面提供了系统化的方法。现有大多数基准主要集中在单实体事实和单语数据上。因此,我们提出了 FIBER,这是一个用于评估单实体与多实体情境下事实知识的多语种基准。该数据集包括英文、意大利文和土耳其文的句子补全、问答和对象计数预测任务。利用 FIBER,我们考察了提示语言是否会在实体选择上引入推理偏差,以及大型语言模型在多实体问题与单实体问题上的表现。结果表明,提示的语言会影响模型生成的输出,尤其是与该语言对应国家相关的实体。然而,这种影响因不同主题而异,31% 的主题表现出大于 0.5 的事实推理偏差得分。 此外,偏见程度在不同语言间存在差异:在 83%的主题中,土耳其语提示显示的偏见高于意大利语,表明存在与语言相关的模式。我们的发现还显示,模型在处理多实体问题时比单实体问题面临更大难度。模型性能在不同语言和模型规模之间也有所差异。平均精确率最高的是英语,而土耳其语和意大利语的得分明显较低。较大模型(包括 Llama-3.1-8B 和 Qwen-2.5-7B)比较小的 3B–4B 模型表现出持续更好的性能。

Subjects: Computation and Language, Artificial Intelligence 主题:计算与语言,人工智能

Publish: 2025-12-11 20:51:16 UTC 发布时间:2025-12-11 20:51:16 UTC

24 Explanation Bias is a Product: Revealing the Hidden Lexical and Position Preferences in Post-Hoc Feature Attribution 24 解释偏差是产物:揭示事后特征归因中的隐含词汇和位置偏好

Authors: [Jonathan Kamp](https://arxiv.org/search/?searchtype=author&query=Jonathan Kamp), [Roos Bakker](https://arxiv.org/search/?searchtype=author&query=Roos Bakker), [Dominique Blok](https://arxiv.org/search/?searchtype=author&query=Dominique Blok) 作者:Jonathan Kamp、Roos Bakker、Dominique Blok

Good quality explanations strengthen the understanding of language models and data. Feature attribution methods, such as Integrated Gradient, are a type of post-hoc explainer that can provide token-level insights. However, explanations on the same input may vary greatly due to underlying biases of different methods. Users may be aware of this issue and mistrust their utility, while unaware users may trust them inadequately. In this work, we delve beyond the superficial inconsistencies between attribution methods, structuring their biases through a model- and method-agnostic framework of three evaluation metrics. We systematically assess both the lexical and position bias (what and where in the input) for two transformers; first, in a controlled, pseudo-random classification task on artificial data; then, in a semi-controlled causal relation detection task on natural data. We find that lexical and position biases are structurally unbalanced in our model comparison, with models that score high on one type score low on the other. We also find signs that methods producing anomalous explanations are more likely to be biased themselves. 高质量的解释增强了对语言模型和数据的理解。特征归因方法,例如积分梯度,是一种事后解释器,可以提供基于标记的见解。然而,由于不同方法的潜在偏差,对相同输入的解释可能会有很大差异。知情用户可能意识到这一问题并因此不信任其效用,而不知情的用户则可能对其过度信任。在本工作中,我们深入探讨归因方法之间表面不一致之下的结构性偏差,通过一个与模型和方法无关的由三项评估指标构成的框架来构建这些偏差。我们系统性地评估了两种 Transformer 的词汇和位置偏差(输入中是什么以及在哪里);首先,在人工数据上的受控伪随机分类任务中;然后,在自然数据上的半受控因果关系检测任务中。我们发现,在模型比较中,词汇偏差和位置偏差在结构上并不平衡,某一类得分高的模型在另一类上得分低。我们还发现,产生异常解释的方法更有可能自身存在偏差。

Subjects: Computation and Language, Artificial Intelligence 主题:计算与语言,人工智能

Publish: 2025-12-11 20:48:22 UTC 发布:2025-12-11 20:48:22 UTC

25 Applying NLP to iMessages: Understanding Topic Avoidance, Responsiveness, and Sentiment 25 将自然语言处理应用于 iMessage:理解话题回避、响应性与情感

Authors: [Alan Gerber](https://arxiv.org/search/?searchtype=author&query=Alan Gerber), [Sam Cooperman](https://arxiv.org/search/?searchtype=author&query=Sam Cooperman) 作者:Alan Gerber,Sam Cooperman

What is your messaging data used for? While many users do not often think about the information companies can gather based off of their messaging platform of choice, it is nonetheless important to consider as society increasingly relies on short-form electronic communication. While most companies keep their data closely guarded, inaccessible to users or potential hackers, Apple has opened a door to their walled-garden ecosystem, providing iMessage users on Mac with one file storing all their messages and attached metadata. With knowledge of this locally stored file, the question now becomes: What can our data do for us? In the creation of our iMessage text message analyzer, we set out to answer five main research questions focusing on topic modeling, response times, reluctance scoring, and sentiment analysis. This paper uses our exploratory data to show how these questions can be answered using our analyzer and its potential in future studies on iMessage data. 你的消息数据被用来做什么?尽管许多用户并不常考虑公司可以基于他们所选消息平台收集的信息,但随着社会越来越依赖短文本电子通信,这一点仍然很重要。尽管大多数公司将其数据严密保护,用户或潜在黑客无法访问,苹果却为其封闭生态系统打开了一扇门,为 Mac 上的 iMessage 用户提供了一个存储其所有消息及附带元数据的文件。了解了这个本地存储的文件后,问题变成了:我们的数据能为我们做什么?在创建我们的 iMessage 短信分析器时,我们着手回答五个主要研究问题,重点关注主题建模、响应时间、回避评分和情感分析。本文使用我们的探索性数据展示了如何使用我们的分析器回答这些问题,以及其在未来 iMessage 数据研究中的潜力。

Subjects: Computation and Language, Computers and Society, Applications, Other Statistics 主题:计算与语言、计算机与社会、应用、其他统计

Publish: 2025-12-11 19:48:51 UTC 发表:2025-12-11 19:48:51 UTC

26 MultiScript30k: Leveraging Multilingual Embeddings to Extend Cross Script Parallel Data 26 MultiScript30k:利用多语种嵌入扩展跨书写系统平行数据

Authors: [Christopher Driggers-Ellis](https://arxiv.org/search/?searchtype=author&query=Christopher Driggers-Ellis), [Detravious Brinkley](https://arxiv.org/search/?searchtype=author&query=Detravious Brinkley), [Ray Chen](https://arxiv.org/search/?searchtype=author&query=Ray Chen), [Aashish Dhawan](https://arxiv.org/search/?searchtype=author&query=Aashish Dhawan), [Daisy Zhe Wang](https://arxiv.org/search/?searchtype=author&query=Daisy Zhe Wang), [Christan Grant](https://arxiv.org/search/?searchtype=author&query=Christan Grant) 作者:Christopher Driggers-Ellis、Detravious Brinkley、Ray Chen、Aashish Dhawan、Daisy Zhe Wang、Christan Grant

Multi30k is frequently cited in the multimodal machine translation (MMT) literature, offering parallel text data for training and fine-tuning deep learning models. However, it is limited to four languages: Czech, English, French, and German. This restriction has led many researchers to focus their investigations only on these languages. As a result, MMT research on diverse languages has been stalled because the official Multi30k dataset only represents European languages in Latin scripts. Previous efforts to extend Multi30k exist, but the list of supported languages, represented language families, and scripts is still very short. To address these issues, we propose MultiScript30k, a new Multi30k dataset extension for global languages in various scripts, created by translating the English version of Multi30k (Multi30k-En) using NLLB200-3.3B. The dataset consists of over 30000 sentences and provides translations of all sentences in Multi30k-En into Ar, Es, Uk, Zh_Hans and Zh_Hant. Similarity analysis shows that Multi30k extension consistently achieves greater than 0.8 cosine similarity and symmetric KL divergence less than 0.000251 for all languages supported except Zh_Hant which is comparable to the previous Multi30k extensions ArEnMulti30k and Multi30k-Uk. COMETKiwi scores reveal mixed assessments of MultiScript30k as a translation of Multi30k-En in comparison to the related work. ArEnMulti30k scores nearly equal MultiScript30k-Ar, but Multi30k-Uk scores 6.4% greater than MultiScript30k-Uk per split. Multi30k 在多模态机器翻译(MMT)文献中被频繁引用,为训练和微调深度学习模型提供并行文本数据。然而,它仅限于四种语言:捷克语、英语、法语和德语。这一限制导致许多研究者只将研究集中在这些语言上。因此,由于官方 Multi30k 数据集仅代表使用拉丁字母的欧洲语言,关于多样化语言的 MMT 研究受到阻碍。此前曾有扩展 Multi30k 的尝试,但支持的语言列表、所代表的语系和书写系统仍然很少。为了解决这些问题,我们提出了 MultiScript30k,这是一个面向全球多种书写系统语言的全新 Multi30k 数据集扩展,通过使用 NLLB200-3.3B 将 Multi30k 的英文版本(Multi30k-En)翻译而成。该数据集包含超过 个句子,并提供了 Multi30k-En 中所有句子的阿拉伯语(Ar)、西班牙语(Es)、乌克兰语(Uk)、简体中文(Zh_Hans)和繁体中文(Zh_Hant)翻译。 相似性分析表明,Multi30k 扩展在所有支持的语言中(除了与先前的 Multi30k 扩展 ArEnMulti30k 和 Multi30k-Uk 相当的繁体中文 Zh_Hant)始终达到大于 的余弦相似度和小于 的对称 KL 散度。COMETKiwi 得分对 MultiScript30k 作为 Multi30k-En 翻译的评价与相关工作相比呈混合态势。ArEnMulti30k 的得分几乎等于 MultiScript30k-Ar,但 Multi30k-Uk 的得分在每个划分上比 MultiScript30k-Uk 高出 。

Subjects: Computation and Language, Artificial Intelligence, Machine Learning, Multimedia 主题:计算与语言、人工智能、机器学习、多媒体

Publish: 2025-12-11 19:43:19 UTC 发布:2025-12-11 19:43:19 UTC

27 PIAST: Rapid Prompting with In-context Augmentation for Scarce Training data 27 PIAST:用于稀缺训练数据的快速提示与上下文增强

Authors: [Pawel Batorski](https://arxiv.org/search/?searchtype=author&query=Pawel Batorski), [Paul Swoboda](https://arxiv.org/search/?searchtype=author&query=Paul Swoboda) 作者:Pawel Batorski、Paul Swoboda

LLMs are highly sensitive to prompt design, but handcrafting effective prompts is difficult and often requires intricate crafting of few-shot examples. We propose a fast automatic prompt construction algorithm that augments human instructions by generating a small set of few shot examples. Our method iteratively replaces/drops/keeps few-shot examples using Monte Carlo Shapley estimation of example utility. For faster execution, we use aggressive subsampling and a replay buffer for faster evaluations. Our method can be run using different compute time budgets. On a limited budget, we outperform existing automatic prompting methods on text simplification and GSM8K and obtain second best results on classification and summarization. With an extended, but still modest compute budget we set a new state of the art among automatic prompting methods on classification, simplification and GSM8K. Our results show that carefully constructed examples, rather than exhaustive instruction search, are the dominant lever for fast and data efficient prompt engineering. Our code is available at https://github.com/Batorskq/PIAST. LLMs 对提示设计高度敏感,但手工制作有效提示很困难,且常常需要精心构造的少量示例。我们提出了一种快速的自动提示构建算法,通过生成一小组少量示例来增强人工指令。我们的方法使用蒙特卡洛沙普利值估计示例效用,迭代地替换/删除/保留少量示例。为加快执行,我们采用了激进的子采样和重放缓冲区以加速评估。我们的方法可以在不同的计算时间预算下运行。在有限预算下,我们在文本简化和 GSM8K 上优于现有的自动提示方法,并在分类和摘要任务上获得第二佳结果。在扩展但仍然适度的计算预算下,我们在自动提示方法中在分类、简化和 GSM8K 上建立了新的最先进记录。我们的结果表明,精心构造的示例,而非穷尽的指令搜索,是实现快速且数据高效的提示工程的主要杠杆。我们的代码可在 https://github.com/Batorskq/PIAST 获取。

Subject: Computation and Language 主题:Computation and Language

Publish: 2025-12-11 16:55:30 UTC 发布:2025-12-11 16:55:30 UTC

28 KBQA-R1: Reinforcing Large Language Models for Knowledge Base Question Answering 28 KBQA-R1: 强化大型语言模型用于知识库问答

Authors: [Xin Sun](https://arxiv.org/search/?searchtype=author&query=Xin Sun), [Zhongqi Chen](https://arxiv.org/search/?searchtype=author&query=Zhongqi Chen), [Xing Zheng](https://arxiv.org/search/?searchtype=author&query=Xing Zheng), [Qiang Liu](https://arxiv.org/search/?searchtype=author&query=Qiang Liu), [Shu Wu](https://arxiv.org/search/?searchtype=author&query=Shu Wu), [Bowen Song](https://arxiv.org/search/?searchtype=author&query=Bowen Song), [Zilei Wang](https://arxiv.org/search/?searchtype=author&query=Zilei Wang), [Weiqiang Wang](https://arxiv.org/search/?searchtype=author&query=Weiqiang Wang), [Liang Wang](https://arxiv.org/search/?searchtype=author&query=Liang Wang) 作者:孙鑫、陈中奇、郑兴、刘强、吴舒、宋博文、王子磊、王伟强、王亮

Knowledge Base Question Answering (KBQA) challenges models to bridge the gap between natural language and strict knowledge graph schemas by generating executable logical forms. While Large Language Models (LLMs) have advanced this field, current approaches often struggle with a dichotomy of failure: they either generate hallucinated queries without verifying schema existence or exhibit rigid, template-based reasoning that mimics synthesized traces without true comprehension of the environment. To address these limitations, we present \textbf{KBQA-R1}, a framework that shifts the paradigm from text imitation to interaction optimization via Reinforcement Learning. Treating KBQA as a multi-turn decision process, our model learns to navigate the knowledge base using a list of actions, leveraging Group Relative Policy Optimization (GRPO) to refine its strategies based on concrete execution feedback rather than static supervision. Furthermore, we introduce \textbf{Referenced Rejection Sampling (RRS)}, a data synthesis method that resolves cold-start challenges by strictly aligning reasoning traces with ground-truth action sequences. Extensive experiments on WebQSP, GrailQA, and GraphQuestions demonstrate that KBQA-R1 achieves state-of-the-art performance, effectively grounding LLM reasoning in verifiable execution. 知识库问答(KBQA)要求模型通过生成可执行的逻辑形式来弥合自然语言与严格知识图谱模式之间的差距。尽管大型语言模型(LLMs)在该领域取得了进展,但现有方法常常在两种失败模式之间徘徊:要么生成未经验证模式存在性的虚构查询,要么表现出僵化的、基于模板的推理,模仿合成的推理痕迹而未真正理解环境。为了解决这些局限性,我们提出了 KBQA-R1,一个将范式从文本模仿转向通过强化学习进行交互优化的框架。将 KBQA 视为一个多回合决策过程,我们的模型通过一系列动作学习在知识库中导航,利用群体相对策略优化(Group Relative Policy Optimization,GRPO)根据具体的执行反馈而非静态监督来改进策略。此外,我们引入了引用拒绝采样(Referenced Rejection Sampling,RRS),这是一种数据合成方法,通过严格将推理痕迹与真实动作序列对齐来解决冷启动问题。 在 WebQSP、GrailQA 和 GraphQuestions 上的大量实验表明,KBQA-R1 达到了最先进的性能,有效地将 LLM 推理基于可验证的执行中。

Subject: Computation and Language 主题:Computation and Language

Publish: 2025-12-10 17:45:42 UTC 发布:2025-12-10 17:45:42 UTC

29 MedBioRAG: Semantic Search and Retrieval-Augmented Generation with Large Language Models for Medical and Biological QA 29 MedBioRAG:用于医学与生物问答的语义搜索与检索增强生成,结合大规模语言模型

Author: [Seonok Kim](https://arxiv.org/search/?searchtype=author&query=Seonok Kim) 作者:Seonok Kim

Recent advancements in retrieval-augmented generation (RAG) have significantly enhanced the ability of large language models (LLMs) to perform complex question-answering (QA) tasks. In this paper, we introduce MedBioRAG, a retrieval-augmented model designed to improve biomedical QA performance through a combination of semantic and lexical search, document retrieval, and supervised fine-tuning. MedBioRAG efficiently retrieves and ranks relevant biomedical documents, enabling precise and context-aware response generation. We evaluate MedBioRAG across text retrieval, close-ended QA, and long-form QA tasks using benchmark datasets such as NFCorpus, TREC-COVID, MedQA, PubMedQA, and BioASQ. Experimental results demonstrate that MedBioRAG outperforms previous state-of-the-art (SoTA) models and the GPT-4o base model in all evaluated tasks. Notably, our approach improves NDCG and MRR scores for document retrieval, while achieving higher accuracy in close-ended QA and ROUGE scores in long-form QA. Our findings highlight the effectiveness of semantic search-based retrieval and LLM fine-tuning in biomedical applications. 在检索增强生成(RAG)方面的最新进展显著提升了大型语言模型(LLMs)执行复杂问答(QA)任务的能力。在本文中,我们提出了 MedBioRAG,一种通过语义与词汇检索、文献检索和监督微调相结合来提升生物医学问答性能的检索增强模型。MedBioRAG 能高效检索并排序相关的生物医学文献,从而实现精确且具上下文意识的响应生成。我们在 NFCorpus、TREC-COVID、MedQA、PubMedQA 和 BioASQ 等基准数据集上对 MedBioRAG 在文本检索、封闭式问答和长格式问答任务上进行了评估。实验结果表明,MedBioRAG 在所有评估任务中均优于先前的最先进(SoTA)模型及 GPT-4o 基础模型。值得注意的是,我们的方法在文献检索上提高了 NDCG 和 MRR 得分,同时在封闭式问答上实现了更高的准确率,在长格式问答上取得了更高的 ROUGE 得分。我们的研究结果强调了基于语义搜索的检索与对 LLM 的微调在生物医学应用中的有效性。

Subjects: Computation and Language, Artificial Intelligence 主题:计算与语言,人工智能

Publish: 2025-12-10 15:43:25 UTC 发布:2025-12-10 15:43:25 UTC

30 Benchmarking Automatic Speech Recognition Models for African Languages 30 为非洲语言基准测试自动语音识别模型

Authors: [Alvin Nahabwe](https://arxiv.org/search/?searchtype=author&query=Alvin Nahabwe), [Sulaiman Kagumire](https://arxiv.org/search/?searchtype=author&query=Sulaiman Kagumire), [Denis Musinguzi](https://arxiv.org/search/?searchtype=author&query=Denis Musinguzi), [Bruno Beijuka](https://arxiv.org/search/?searchtype=author&query=Bruno Beijuka), [Jonah Mubuuke Kyagaba](https://arxiv.org/search/?searchtype=author&query=Jonah Mubuuke Kyagaba), [Peter Nabende](https://arxiv.org/search/?searchtype=author&query=Peter Nabende), [Andrew Katumba](https://arxiv.org/search/?searchtype=author&query=Andrew Katumba), [Joyce Nakatumba-Nabende](https://arxiv.org/search/?searchtype=author&query=Joyce Nakatumba-Nabende) 作者:Alvin Nahabwe、Sulaiman Kagumire、Denis Musinguzi、Bruno Beijuka、Jonah Mubuuke Kyagaba、Peter Nabende、Andrew Katumba、Joyce Nakatumba-Nabende

Automatic speech recognition (ASR) for African languages remains constrained by limited labeled data and the lack of systematic guidance on model selection, data scaling, and decoding strategies. Large pre-trained systems such as Whisper, XLS-R, MMS, and W2v-BERT have expanded access to ASR technology, but their comparative behavior in African low-resource contexts has not been studied in a unified and systematic way. In this work, we benchmark four state-of-the-art ASR models across 13 African languages, fine-tuning them on progressively larger subsets of transcribed data ranging from 1 to 400 hours. Beyond reporting error rates, we provide new insights into why models behave differently under varying conditions. We show that MMS and W2v-BERT are more data efficient in very low-resource regimes, XLS-R scales more effectively as additional data becomes available, and Whisper demonstrates advantages in mid-resource conditions. We also analyze where external language model decoding yields improvements and identify cases where it plateaus or introduces additional errors, depending on the alignment between acoustic and text resources. By highlighting the interaction between pre-training coverage, model architecture, dataset domain, and resource availability, this study offers practical and insights into the design of ASR systems for underrepresented languages. 针对非洲语言的自动语音识别(ASR)仍受限于标注数据稀缺以及在模型选择、数据扩展和解码策略方面缺乏系统性指导。诸如 Whisper、XLS-R、MMS 和 W2v-BERT 等大规模预训练系统虽扩展了 ASR 技术的可及性,但它们在非洲低资源语境中的比较行为尚未以统一且系统的方式进行研究。在本研究中,我们在 13 种非洲语言上对四种最先进的 ASR 模型进行了基准测试,对它们在逐步增大的转录数据子集(从 1 到 400 小时)上进行了微调。除了报告错误率外,我们还提供了关于模型在不同条件下为何表现不同的新见解。我们表明,在极低资源情形下 MMS 和 W2v-BERT 更具数据效率,随着额外数据的可得性增加,XLS-R 的扩展效果更为显著,而 Whisper 在中等资源条件下展现出优势。我们还分析了外部语言模型解码在哪些情况下能带来改进,并识别出根据声学资源与文本资源的一致性,会出现性能趋于平稳或引入额外错误的情形。 通过强调预训练覆盖范围、模型架构、数据集领域和资源可用性之间的相互作用,本研究为设计面向弱势语言的自动语音识别系统提供了实用见解。

Subjects: Computation and Language, Sound, Audio and Speech Processing 主题:计算与语言、声音、音频与语音处理

Publish: 2025-11-30 10:21:19 UTC 发布日期:2025-11-30 10:21:19 UTC

31 ASR Under the Stethoscope: Evaluating Biases in Clinical Speech Recognition across Indian Languages 31 听诊器下的语音识别:评估印度语言临床语音识别中的偏差

Authors: [Subham Kumar](https://arxiv.org/search/?searchtype=author&query=Subham Kumar), [Prakrithi Shivaprakash](https://arxiv.org/search/?searchtype=author&query=Prakrithi Shivaprakash), [Abhishek Manoharan](https://arxiv.org/search/?searchtype=author&query=Abhishek Manoharan), [Astut Kurariya](https://arxiv.org/search/?searchtype=author&query=Astut Kurariya), [Diptadhi Mukherjee](https://arxiv.org/search/?searchtype=author&query=Diptadhi Mukherjee), [Lekhansh Shukla](https://arxiv.org/search/?searchtype=author&query=Lekhansh Shukla), [Animesh Mukherjee](https://arxiv.org/search/?searchtype=author&query=Animesh Mukherjee), [Prabhat Chand](https://arxiv.org/search/?searchtype=author&query=Prabhat Chand), [Pratima Murthy](https://arxiv.org/search/?searchtype=author&query=Pratima Murthy) 作者:Subham Kumar、Prakrithi Shivaprakash、Abhishek Manoharan、Astut Kurariya、Diptadhi Mukherjee、Lekhansh Shukla、Animesh Mukherjee、Prabhat Chand、Pratima Murthy

Automatic Speech Recognition (ASR) is increasingly used to document clinical encounters, yet its reliability in multilingual and demographically diverse Indian healthcare contexts remains largely unknown. In this study, we conduct the first systematic audit of ASR performance on real world clinical interview data spanning Kannada, Hindi, and Indian English, comparing leading models including Indic Whisper, Whisper, Sarvam, Google speech to text, Gemma3n, Omnilingual, Vaani, and Gemini. We evaluate transcription accuracy across languages, speakers, and demographic subgroups, with a particular focus on error patterns affecting patients vs. clinicians and gender based or intersectional disparities. Our results reveal substantial variability across models and languages, with some systems performing competitively on Indian English but failing on code mixed or vernacular speech. We also uncover systematic performance gaps tied to speaker role and gender, raising concerns about equitable deployment in clinical settings. By providing a comprehensive multilingual benchmark and fairness analysis, our work highlights the need for culturally and demographically inclusive ASR development for healthcare ecosystem in India. 自动语音识别(ASR)正越来越多地用于记录临床会诊,但其在多语言且人口统计多样的印度医疗环境中的可靠性尚 largely 未知。在本研究中,我们对真实世界的临床访谈数据进行了首次系统性审计,覆盖卡纳达语、印地语和印度英语,比较了包括 Indic Whisper、Whisper、Sarvam、Google speech to text、Gemma3n、Omnilingual、Vaani 和 Gemini 在内的领先模型。我们评估了不同语言、说话者和人口统计子群的转录准确性,特别关注影响患者与临床医生的错误模式以及基于性别或交叉性别差异的情况。我们的结果显示,各模型和语言之间存在显著差异,部分系统在印度英语上表现具有竞争力,但在代码混合或方言语音上表现不佳。我们还发现与说话者角色和性别相关的系统性性能差距,这对在临床环境中的公平部署提出了担忧。 通过提供一个全面的多语言基准和公平性分析,我们的工作强调在印度医疗生态系统中开发适合不同文化和人口群体的自动语音识别(ASR)的必要性。

Subjects: Computation and Language, Sound, Audio and Speech Processing 主题:计算与语言、声音、音频与语音处理

Publish: 2025-11-30 06:37:40 UTC 发布:2025-11-30 06:37:40 UTC

32 From Signal to Turn: Interactional Friction in Modular Speech-to-Speech Pipelines 32 从信号到轮次:模块化语音到语音管道中的互动摩擦

Authors: [Titaya Mairittha](https://arxiv.org/search/?searchtype=author&query=Titaya Mairittha), [Tanakon Sawanglok](https://arxiv.org/search/?searchtype=author&query=Tanakon Sawanglok), [Panuwit Raden](https://arxiv.org/search/?searchtype=author&query=Panuwit Raden), [Jirapast Buntub](https://arxiv.org/search/?searchtype=author&query=Jirapast Buntub), [Thanapat Warunee](https://arxiv.org/search/?searchtype=author&query=Thanapat Warunee), [Napat Asawachaisuvikrom](https://arxiv.org/search/?searchtype=author&query=Napat Asawachaisuvikrom), [Thanaphum Saiwongin](https://arxiv.org/search/?searchtype=author&query=Thanaphum Saiwongin) 作者:Titaya Mairittha、Tanakon Sawanglok、Panuwit Raden、Jirapast Buntub、Thanapat Warunee、Napat Asawachaisuvikrom、Thanaphum Saiwongin

While voice-based AI systems have achieved remarkable generative capabilities, their interactions often feel conversationally broken. This paper examines the interactional friction that emerges in modular Speech-to-Speech Retrieval-Augmented Generation (S2S-RAG) pipelines. By analyzing a representative production system, we move beyond simple latency metrics to identify three recurring patterns of conversational breakdown: (1) Temporal Misalignment, where system delays violate user expectations of conversational rhythm; (2) Expressive Flattening, where the loss of paralinguistic cues leads to literal, inappropriate responses; and (3) Repair Rigidity, where architectural gating prevents users from correcting errors in real-time. Through system-level analysis, we demonstrate that these friction points should not be understood as defects or failures, but as structural consequences of a modular design that prioritizes control over fluidity. We conclude that building natural spoken AI is an infrastructure design challenge, requiring a shift from optimizing isolated components to carefully choreographing the seams between them. 尽管基于语音的人工智能系统在生成能力上取得了显著进展,但它们的交互常常在会话层面显得破碎。本文考察了模块化语音到语音检索增强生成(S2S-RAG)流水线中出现的交互摩擦。通过分析一个具有代表性的生产系统,我们超越了简单的延迟指标,识别出会话崩溃的三种反复出现的模式: (1) 时间错位,系统延迟违背了用户对会话节奏的预期; (2) 表达扁平化,副语言线索的丧失导致字面化且不恰当的回应;以及 (3) 修正僵化,架构性的门控阻止用户实时纠正错误。通过系统级分析,我们展示了这些摩擦点不应被理解为缺陷或失败,而是优先考虑可控性而非流畅性的模块化设计所带来的结构性后果。我们得出结论:构建自然的语音人工智能是一个基础设施设计挑战,需要从优化孤立组件转向精心编排它们之间的缝隙。

Subjects: Human-Computer Interaction, Artificial Intelligence, Computation and Language, Software Engineering 主题:人机交互、人工智能、计算与语言、软件工程

Publish: 2025-12-12 17:05:11 UTC 发布:2025-12-12 17:05:11 UTC

33 DentalGPT: Incentivizing Multimodal Complex Reasoning in Dentistry 33 DentalGPT:激励牙科领域的多模态复杂推理

Authors: [Zhenyang Cai](https://arxiv.org/search/?searchtype=author&query=Zhenyang Cai), [Jiaming Zhang](https://arxiv.org/search/?searchtype=author&query=Jiaming Zhang), [Junjie Zhao](https://arxiv.org/search/?searchtype=author&query=Junjie Zhao), [Ziyi Zeng](https://arxiv.org/search/?searchtype=author&query=Ziyi Zeng), [Yanchao Li](https://arxiv.org/search/?searchtype=author&query=Yanchao Li), [Jingyi Liang](https://arxiv.org/search/?searchtype=author&query=Jingyi Liang), [Junying Chen](https://arxiv.org/search/?searchtype=author&query=Junying Chen), [Yunjin Yang](https://arxiv.org/search/?searchtype=author&query=Yunjin Yang), [Jiajun You](https://arxiv.org/search/?searchtype=author&query=Jiajun You), [Shuzhi Deng](https://arxiv.org/search/?searchtype=author&query=Shuzhi Deng), [Tongfei Wang](https://arxiv.org/search/?searchtype=author&query=Tongfei Wang), [Wanting Chen](https://arxiv.org/search/?searchtype=author&query=Wanting Chen), [Chunxiu Hao](https://arxiv.org/search/?searchtype=author&query=Chunxiu Hao), [Ruiqi Xie](https://arxiv.org/search/?searchtype=author&query=Ruiqi Xie), [Zhenwei Wen](https://arxiv.org/search/?searchtype=author&query=Zhenwei Wen), [Xiangyi Feng](https://arxiv.org/search/?searchtype=author&query=Xiangyi Feng), [Zou Ting](https://arxiv.org/search/?searchtype=author&query=Zou Ting), [Jin Zou Lin](https://arxiv.org/search/?searchtype=author&query=Jin Zou Lin), [Jianquan Li](https://arxiv.org/search/?searchtype=author&query=Jianquan Li), [Guangjun Yu](https://arxiv.org/search/?searchtype=author&query=Guangjun Yu), [Liangyi Chen](https://arxiv.org/search/?searchtype=author&query=Liangyi Chen), [Junwen Wang](https://arxiv.org/search/?searchtype=author&query=Junwen Wang), [Shan Jiang](https://arxiv.org/search/?searchtype=author&query=Shan Jiang), [Benyou Wang](https://arxiv.org/search/?searchtype=author&query=Benyou Wang) 作者:蔡振阳、张家明、赵俊杰、曾子易、李岩超、梁静怡、陈君英、杨云锦、尤佳俊、邓淑芝、王同斐、陈婉婷、郝纯秀、谢瑞琦、温振威、冯祥伊、邹婷、邹金楼、李建全、于光军、陈良意、王俊文、蒋珊、王本友

Reliable interpretation of multimodal data in dentistry is essential for automated oral healthcare, yet current multimodal large language models (MLLMs) struggle to capture fine-grained dental visual details and lack sufficient reasoning ability for precise diagnosis. To address these limitations, we present DentalGPT, a specialized dental MLLM developed through high-quality domain knowledge injection and reinforcement learning. Specifically, the largest annotated multimodal dataset for dentistry to date was constructed by aggregating over 120k dental images paired with detailed descriptions that highlight diagnostically relevant visual features, making it the multimodal dataset with the most extensive collection of dental images to date. Training on this dataset significantly enhances the MLLM’s visual understanding of dental conditions, while the subsequent reinforcement learning stage further strengthens its capability for multimodal complex reasoning. Comprehensive evaluations on intraoral and panoramic benchmarks, along with dental subsets of medical VQA benchmarks, show that DentalGPT achieves superior performance in disease classification and dental VQA tasks, outperforming many state-of-the-art MLLMs despite having only 7B parameters. These results demonstrate that high-quality dental data combined with staged adaptation provides an effective pathway for building capable and domain-specialized dental MLLMs. 在牙科领域对多模态数据进行可靠解读对于自动化口腔医疗至关重要,然而现有的多模态大语言模型(MLLMs)在捕捉细粒度的牙科视觉细节方面表现不足,并且缺乏进行精确诊断所需的推理能力。为了解决这些局限性,我们提出了 DentalGPT,一种通过高质量领域知识注入和强化学习开发的专业牙科多模态大模型。具体而言,我们构建了迄今为止最大的牙科带注释多模态数据集,汇聚了超过 12 万张牙科图像及其详细描述,这些描述突出了具有诊断意义的视觉特征,使其成为目前收集牙科图像最为丰富的多模态数据集。在该数据集上的训练显著增强了模型对牙科病况的视觉理解,而随后进行的强化学习阶段进一步加强了其进行多模态复杂推理的能力。 在口内和全景基准以及医学 VQA 基准的牙科子集上的全面评估表明,DentalGPT 在疾病分类和牙科视觉问答任务上取得了优越的表现,尽管仅有 7B 参数,但其性能超过了许多最先进的多模态大模型。这些结果表明,高质量的牙科数据与分阶段适配相结合,为构建具有能力且领域专精的牙科多模态大模型提供了一条有效路径。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Computation and Language 主题:计算机视觉与模式识别、人工智能、计算与语言

Publish: 2025-12-12 13:42:57 UTC 发表于:2025-12-12 13:42:57 UTC

34 HFS: Holistic Query-Aware Frame Selection for Efficient Video Reasoning 34 HFS:用于高效视频推理的整体查询感知帧选择

Authors: [Yiqing Yang](https://arxiv.org/search/?searchtype=author&query=Yiqing Yang), [Kin-Man Lam](https://arxiv.org/search/?searchtype=author&query=Kin-Man Lam) 作者:杨意清,林建文

Key frame selection in video understanding presents significant challenges. Traditional top-K selection methods, which score frames independently, often fail to optimize the selection as a whole. This independent scoring frequently results in selecting frames that are temporally clustered and visually redundant. Additionally, training lightweight selectors using pseudo labels generated offline by Multimodal Large Language Models (MLLMs) prevents the supervisory signal from dynamically adapting to task objectives. To address these limitations, we propose an end-to-end trainable, task-adaptive framework for frame selection. A Chain-of-Thought approach guides a Small Language Model (SLM) to generate task-specific implicit query vectors, which are combined with multimodal features to enable dynamic frame scoring. We further define a continuous set-level objective function that incorporates relevance, coverage, and redundancy, enabling differentiable optimization via Gumbel-Softmax to select optimal frame combinations at the set level. Finally, student-teacher mutual learning is employed, where the student selector (SLM) and teacher reasoner (MLLM) are trained to align their frame importance distributions via KL divergence. Combined with cross-entropy loss, this enables end-to-end optimization, eliminating reliance on static pseudo labels. Experiments across various benchmarks, including Video-MME, LongVideoBench, MLVU, and NExT-QA, demonstrate that our method significantly outperforms existing approaches. 视频理解中的关键帧选择面临重大挑战。传统的逐帧独立打分的 top-K 选择方法往往无法对整体选择进行优化。这种独立打分常常导致选择的帧在时间上聚集且在视觉上冗余。此外,使用由多模态大语言模型(MLLMs)离线生成的伪标签来训练轻量选择器,会阻止监督信号根据任务目标进行动态调整。为了解决这些限制,我们提出了一个端到端可训练、任务自适应的帧选择框架。基于思路链(Chain-of-Thought)的方法引导小型语言模型(SLM)生成任务特定的隐式查询向量,这些向量与多模态特征结合以实现动态帧评分。我们进一步定义了一个包含相关性、覆盖性和冗余性的连续集合级目标函数,通过 Gumbel-Softmax 实现可微分优化,以在集合级别选择最优的帧组合。 最后,采用了学生-教师相互学习的策略,学生选择器(SLM)和教师推理器(MLLM)通过最小化它们在帧重要性分布上的 KL 散度来对齐。结合交叉熵损失,这使得端到端优化成为可能,消除了对静态伪标签的依赖。在包括 Video-MME、LongVideoBench、MLVU 和 NExT-QA 等多个基准上的实验表明,我们的方法显著优于现有方法。

Subjects: Computer Vision and Pattern Recognition, Computation and Language, Multimedia 主题:计算机视觉与模式识别,计算与语言,多媒体

Publish: 2025-12-12 13:10:30 UTC 发布时间:2025-12-12 13:10:30 协调世界时 (UTC)

35 Rethinking Expert Trajectory Utilization in LLM Post-training 35 重新思考在 LLM 后训练中对专家轨迹的利用

Authors: [Bowen Ding](https://arxiv.org/search/?searchtype=author&query=Bowen Ding), [Yuhan Chen](https://arxiv.org/search/?searchtype=author&query=Yuhan Chen), [Jiayang Lv](https://arxiv.org/search/?searchtype=author&query=Jiayang Lv), [Jiyao Yuan](https://arxiv.org/search/?searchtype=author&query=Jiyao Yuan), [Qi Zhu](https://arxiv.org/search/?searchtype=author&query=Qi Zhu), [Shuangshuang Tian](https://arxiv.org/search/?searchtype=author&query=Shuangshuang Tian), [Dantong Zhu](https://arxiv.org/search/?searchtype=author&query=Dantong Zhu), [Futing Wang](https://arxiv.org/search/?searchtype=author&query=Futing Wang), [Heyuan Deng](https://arxiv.org/search/?searchtype=author&query=Heyuan Deng), [Fei Mi](https://arxiv.org/search/?searchtype=author&query=Fei Mi), [Lifeng Shang](https://arxiv.org/search/?searchtype=author&query=Lifeng Shang), [Tao Lin](https://arxiv.org/search/?searchtype=author&query=Tao Lin) 作者:丁博文、陈雨涵、吕佳阳、袁佳尧、朱琦、田爽爽、朱丹彤、王福廷、邓贺源、米飞、尚立峰、林涛

While effective post-training integrates Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), the optimal mechanism for utilizing expert trajectories remains unresolved. We propose the Plasticity-Ceiling Framework to theoretically ground this landscape, decomposing performance into foundational SFT performance and the subsequent RL plasticity. Through extensive benchmarking, we establish the Sequential SFT-then-RL pipeline as the superior standard, overcoming the stability deficits of synchronized approaches. Furthermore, we derive precise scaling guidelines: (1) Transitioning to RL at the SFT Stable or Mild Overfitting Sub-phase maximizes the final ceiling by securing foundational SFT performance without compromising RL plasticity; (2) Refuting ``Less is More’’ in the context of SFT-then-RL scaling, we demonstrate that Data Scale determines the primary post-training potential, while Trajectory Difficulty acts as a performance multiplier; and (3) Identifying that the Minimum SFT Validation Loss serves as a robust indicator for selecting the expert trajectories that maximize the final performance ceiling. Our findings provide actionable guidelines for maximizing the value extracted from expert trajectories. 尽管有效的后训练整合了监督微调(SFT)与强化学习(RL),但如何最优利用专家轨迹仍未解决。我们提出了可塑性上限框架以在理论上奠定该领域基础,将性能分解为基础的 SFT 表现与随后 RL 的可塑性。通过大量基准测试,我们确立了先 SFT 后 RL 的顺序流水线作为更优标准,克服了同步方法的稳定性缺陷。此外,我们推导出精确的扩展指南:(1)在 SFT 的稳定期或轻微过拟合子阶段转入 RL,可通过保障基础 SFT 表现而不损害 RL 可塑性,从而最大化最终上限;(2)在 SFT→RL 扩展的情境下否定“少即是多”的观点,我们展示了数据规模决定了后训练的主要潜力,而轨迹难度则作为性能的乘数;(3)识别出最小 SFT 验证损失是选择能最大化最终性能上限的专家轨迹的稳健指标。 我们的研究结果为从专家轨迹中最大化提取价值提供了可行的指导。

Subjects: Machine Learning, Computation and Language 主题:机器学习,计算与语言

Publish: 2025-12-12 11:13:00 UTC 发布:2025-12-12 11:13:00 UTC

36 Task-Specific Sparse Feature Masks for Molecular Toxicity Prediction with Chemical Language Models 36 针对分子毒性预测的任务特定稀疏特征掩码,基于化学语言模型

Authors: [Kwun Sy Lee](https://arxiv.org/search/?searchtype=author&query=Kwun Sy Lee), [Jiawei Chen](https://arxiv.org/search/?searchtype=author&query=Jiawei Chen), [Fuk Sheng Ford Chung](https://arxiv.org/search/?searchtype=author&query=Fuk Sheng Ford Chung), [Tianyu Zhao](https://arxiv.org/search/?searchtype=author&query=Tianyu Zhao), [Zhenyuan Chen](https://arxiv.org/search/?searchtype=author&query=Zhenyuan Chen), [Debby D. Wang](https://arxiv.org/search/?searchtype=author&query=Debby D. Wang) 作者:Kwun Sy Lee、Jiawei Chen、Fuk Sheng Ford Chung、Tianyu Zhao、Zhenyuan Chen、Debby D. Wang

Reliable in silico molecular toxicity prediction is a cornerstone of modern drug discovery, offering a scalable alternative to experimental screening. However, the black-box nature of state-of-the-art models remains a significant barrier to adoption, as high-stakes safety decisions demand verifiable structural insights alongside predictive performance. To address this, we propose a novel multi-task learning (MTL) framework designed to jointly enhance accuracy and interpretability. Our architecture integrates a shared chemical language model with task-specific attention modules. By imposing an L1 sparsity penalty on these modules, the framework is constrained to focus on a minimal set of salient molecular fragments for each distinct toxicity endpoint. The resulting framework is trained end-to-end and is readily adaptable to various transformer-based backbones. Evaluated on the ClinTox, SIDER, and Tox21 benchmark datasets, our approach consistently outperforms both single-task and standard MTL baselines. Crucially, the sparse attention weights provide chemically intuitive visualizations that reveal the specific fragments influencing predictions, thereby enhancing insight into the model’s decision-making process. 可靠的计算机内(in silico)分子毒性预测是现代药物发现的基石,为实验筛选提供了可扩展的替代方案。然而,最先进模型的黑箱特性仍然是采用的重大障碍,因为高风险的安全决策不仅需要预测性能,还需要可验证的结构性见解。为了解决这一问题,我们提出了一种新颖的多任务学习(MTL)框架,旨在同时提升准确性与可解释性。我们的架构将共享的化学语言模型与任务特定的注意力模块相结合。通过对这些模块施加 L1 稀疏性惩罚,框架被约束为针对每个不同的毒性端点聚焦于最少的一组显著分子片段。所得框架进行端到端训练,并能够方便地适配各种基于 Transformer 的骨干模型。在 ClinTox、SIDER 和 Tox21 基准数据集上的评估表明,我们的方法始终优于单任务和标准多任务学习基线。 关键是,稀疏注意力权重提供了符合化学直觉的可视化,揭示了影响预测的具体片段,从而增强了对模型决策过程的洞察。

Subjects: Computational Engineering, Finance, and Science, Artificial Intelligence, Computation and Language, Machine Learning, Biomolecules 主题:计算工程、金融与科学、人工智能、计算与语言、机器学习、生物分子

Publish: 2025-12-12 09:41:04 UTC 发布:2025-12-12 09:41:04 UTC

37 Adaptive Soft Rolling KV Freeze with Entropy-Guided Recovery: Sublinear Memory Growth for Efficient LLM Inference 37 自适应软滚动键值冻结与熵引导恢复:用于高效 LLM 推理的亚线性内存增长

Authors: [Adilet Metinov](https://arxiv.org/search/?searchtype=author&query=Adilet Metinov), [Gulida M. Kudakeeva](https://arxiv.org/search/?searchtype=author&query=Gulida M. Kudakeeva), [Bolotbek uulu Nursultan](https://arxiv.org/search/?searchtype=author&query=Bolotbek uulu Nursultan), [Gulnara D. Kabaeva](https://arxiv.org/search/?searchtype=author&query=Gulnara D. Kabaeva) 作者:Adilet Metinov、Gulida M. Kudakeeva、Bolotbek uulu Nursultan、Gulnara D. Kabaeva

We present Adaptive Soft Rolling KV Freeze with Entropy-Guided Recovery (ASR-KF-EGR), a training-free inference-time framework for efficient large language model generation. Our method introduces a reversible soft-freeze mechanism that temporarily suspends key-value (KV) updates for low-importance tokens identified within a sliding attention window. Unlike eviction-based approaches that permanently discard context, ASR-KF-EGR preserves all tokens in off-GPU storage and restores them on demand. We extend the framework with sublinear freeze scheduling, where freeze duration grows sublinearly with repeated low-importance detections, preventing over-aggressive compression. Preliminary experiments on LLaMA-3 8B demonstrate 55-67% reduction in active KV cache size while maintaining generation quality and passing needle-in-haystack retrieval tests. The method is architecture-agnostic, requires no fine-tuning, and provides a practical solution for memory-constrained deployment of long-context LLMs. 我们提出了自适应软滚动 KV 冻结与熵引导恢复(ASR-KF-EGR),这是一种无需训练、在推理阶段用于高效大语言模型生成的框架。我们的方法引入了一种可逆的软冻结机制,在滑动注意窗口内暂时中止对被识别为低重要性的标记的键值(KV)更新。不同于将上下文永久驱逐的逐出式方法,ASR-KF-EGR 将所有标记保存在 GPU 外存储中,并按需恢复。我们将该框架扩展为次线性冻结调度,即冻结时长随着重复检测到低重要性而次线性增长,从而防止过度激进的压缩。对 LLaMA-3 8B 的初步实验表明,在保持生成质量并通过“针在干草堆”检索测试的情况下,活动 KV 缓存大小减少了 55%–67%。该方法与体系结构无关,不需微调,为内存受限的长上下文 LLMs 部署提供了实用的解决方案。

Subjects: Machine Learning, Artificial Intelligence, Computation and Language 主题:机器学习、人工智能、计算与语言

Publish: 2025-12-12 02:02:02 UTC 发布:2025-12-12 02:02:02 UTC

38 FutureWeaver: Planning Test-Time Compute for Multi-Agent Systems with Modularized Collaboration 38 FutureWeaver:为多智能体系统在测试时规划计算以实现模块化协作

Authors: [Dongwon Jung](https://arxiv.org/search/?searchtype=author&query=Dongwon Jung), [Peng Shi](https://arxiv.org/search/?searchtype=author&query=Peng Shi), [Yi Zhang](https://arxiv.org/search/?searchtype=author&query=Yi Zhang) 作者:Dongwon Jung、Peng Shi、Yi Zhang

Scaling test-time computation improves large language model performance without additional training. Recent work demonstrates that techniques such as repeated sampling, self-verification, and self-reflection can significantly enhance task success by allocating more inference-time compute. However, applying these techniques across multiple agents in a multi-agent system is difficult: there does not exist principled mechanisms to allocate compute to foster collaboration among agents, to extend test-time scaling to collaborative interactions, or to distribute compute across agents under explicit budget constraints. To address this gap, we propose FutureWeaver, a framework for planning and optimizing test-time compute allocation in multi-agent systems under fixed budgets. FutureWeaver introduces modularized collaboration, formalized as callable functions that encapsulate reusable multi-agent workflows. These modules are automatically derived through self-play reflection by abstracting recurring interaction patterns from past trajectories. Building on these modules, FutureWeaver employs a dual-level planning architecture that optimizes compute allocation by reasoning over the current task state while also speculating on future steps. Experiments on complex agent benchmarks demonstrate that FutureWeaver consistently outperforms baselines across diverse budget settings, validating its effectiveness for multi-agent collaboration in inference-time optimization. 在测试时增加计算量可以在不额外训练的情况下提升大型语言模型的性能。近期研究表明,通过在推理阶段分配更多计算资源,诸如重复采样、自我验证和自我反思等技术能够显著提高任务成功率。然而,将这些技术应用于多智能体系统中的多个主体却很困难:尚无可行的机制来分配计算以促进智能体间的协作、将测试时扩展应用到协作性交互中,或在明确的预算限制下在智能体间分配计算资源。为了解决这一空白,我们提出了 FutureWeaver,这是一个在固定预算下为多智能体系统规划和优化测试时计算分配的框架。FutureWeaver 引入了模块化协作,以可调用函数的形式形式化,封装了可重用的多智能体工作流。这些模块通过自我对弈反思自动生成,方法是从过去的轨迹中抽象出重复出现的交互模式。 在这些模块的基础上,FutureWeaver 采用了双层规划架构,通过对当前任务状态进行推理并对未来步骤进行预测,从而优化计算资源的分配。 在复杂智能体基准测试上的实验表明,FutureWeaver 在各种预算设置下始终优于基线方法,验证了其在推理时优化多智能体协作方面的有效性。

Subjects: Artificial Intelligence, Computation and Language 主题:人工智能,计算与语言

Publish: 2025-12-12 01:43:48 UTC 发布时间:2025-12-12 01:43:48 世界协调时

39 SCOUT: A Defense Against Data Poisoning Attacks in Fine-Tuned Language Models 39 SCOUT:针对微调语言模型中数据投毒攻击的防御

Authors: [Mohamed Afane](https://arxiv.org/search/?searchtype=author&query=Mohamed Afane), [Abhishek Satyam](https://arxiv.org/search/?searchtype=author&query=Abhishek Satyam), [Ke Chen](https://arxiv.org/search/?searchtype=author&query=Ke Chen), [Tao Li](https://arxiv.org/search/?searchtype=author&query=Tao Li), [Junaid Farooq](https://arxiv.org/search/?searchtype=author&query=Junaid Farooq), [Juntao Chen](https://arxiv.org/search/?searchtype=author&query=Juntao Chen) 作者:Mohamed Afane、Abhishek Satyam、Ke Chen、Tao Li、Junaid Farooq、Juntao Chen

Backdoor attacks create significant security threats to language models by embedding hidden triggers that manipulate model behavior during inference, presenting critical risks for AI systems deployed in healthcare and other sensitive domains. While existing defenses effectively counter obvious threats such as out-of-context trigger words and safety alignment violations, they fail against sophisticated attacks using contextually-appropriate triggers that blend seamlessly into natural language. This paper introduces three novel contextually-aware attack scenarios that exploit domain-specific knowledge and semantic plausibility: the ViralApp attack targeting social media addiction classification, the Fever attack manipulating medical diagnosis toward hypertension, and the Referral attack steering clinical recommendations. These attacks represent realistic threats where malicious actors exploit domain-specific vocabulary while maintaining semantic coherence, demonstrating how adversaries can weaponize contextual appropriateness to evade conventional detection methods. To counter both traditional and these sophisticated attacks, we present \textbf{SCOUT (Saliency-based Classification Of Untrusted Tokens)}, a novel defense framework that identifies backdoor triggers through token-level saliency analysis rather than traditional context-based detection methods. SCOUT constructs a saliency map by measuring how the removal of individual tokens affects the model’s output logits for the target label, enabling detection of both conspicuous and subtle manipulation attempts. We evaluate SCOUT on established benchmark datasets (SST-2, IMDB, AG News) against conventional attacks (BadNet, AddSent, SynBkd, StyleBkd) and our novel attacks, demonstrating that SCOUT successfully detects these sophisticated threats while preserving accuracy on clean inputs. 后门攻击通过嵌入隐蔽触发器在推理时操控模型行为,对语言模型构成重大安全威胁,这在部署于医疗和其他敏感领域的人工智能系统中带来关键风险。尽管现有防御能有效应对诸如语境不符的触发词和安全对齐违规等明显威胁,但它们对使用语境上恰当、能够无缝融入自然语言的高级攻击束手无策。本文提出三种新颖的、具语境感知性的攻击场景,利用领域特定知识和语义合理性:针对社交媒体成瘾分类的 ViralApp 攻击、将医学诊断操纵为高血压的 Fever 攻击,以及引导临床建议的 Referral 攻击。这些攻击代表了现实威胁,其中恶意行为者在保持语义连贯的同时利用领域特定词汇,展示了对手如何将语境适宜性武器化以规避传统检测方法。 为应对传统攻击和这些复杂攻击,我们提出了 SCOUT(基于显著性的未信任标记分类,Saliency-based Classification Of Untrusted Tokens),这是一种新颖的防御框架,通过基于标记级别的显著性分析来识别后门触发器,而不是传统的基于上下文的检测方法。SCOUT 通过衡量移除单个标记对模型在目标标签上输出 logits 的影响来构建显著性图,从而能够检测明显和隐蔽的操控尝试。我们在已建立的基准数据集(SST-2、IMDB、AG News)上针对传统攻击(BadNet、AddSent、SynBkd、StyleBkd)和我们提出的新型攻击评估了 SCOUT,结果表明 SCOUT 在保持对干净输入准确率的同时,能够成功检测这些复杂威胁。

Subjects: Cryptography and Security, Computation and Language 主题:密码学与安全,计算与语言

Publish: 2025-12-10 17:25:55 UTC 发布:2025-12-10 17:25:55 UTC

40 TV2TV: A Unified Framework for Interleaved Language and Video Generation 40 TV2TV:用于交错语言与视频生成的统一框架

Authors: [Xiaochuang Han](https://arxiv.org/search/?searchtype=author&query=Xiaochuang Han), [Youssef Emad](https://arxiv.org/search/?searchtype=author&query=Youssef Emad), [Melissa Hall](https://arxiv.org/search/?searchtype=author&query=Melissa Hall), [John Nguyen](https://arxiv.org/search/?searchtype=author&query=John Nguyen), [Karthik Padthe](https://arxiv.org/search/?searchtype=author&query=Karthik Padthe), [Liam Robbins](https://arxiv.org/search/?searchtype=author&query=Liam Robbins), [Amir Bar](https://arxiv.org/search/?searchtype=author&query=Amir Bar), [Delong Chen](https://arxiv.org/search/?searchtype=author&query=Delong Chen), [Michal Drozdzal](https://arxiv.org/search/?searchtype=author&query=Michal Drozdzal), [Maha Elbayad](https://arxiv.org/search/?searchtype=author&query=Maha Elbayad), [Yushi Hu](https://arxiv.org/search/?searchtype=author&query=Yushi Hu), [Shang-Wen Li](https://arxiv.org/search/?searchtype=author&query=Shang-Wen Li), [Sreya Dutta Roy](https://arxiv.org/search/?searchtype=author&query=Sreya Dutta Roy), [Jakob Verbeek](https://arxiv.org/search/?searchtype=author&query=Jakob Verbeek), [XuDong Wang](https://arxiv.org/search/?searchtype=author&query=XuDong Wang), [Marjan Ghazvininejad](https://arxiv.org/search/?searchtype=author&query=Marjan Ghazvininejad), [Luke Zettlemoyer](https://arxiv.org/search/?searchtype=author&query=Luke Zettlemoyer), [Emily Dinan](https://arxiv.org/search/?searchtype=author&query=Emily Dinan) 作者:Xiaochuang Han、Youssef Emad、Melissa Hall、John Nguyen、Karthik Padthe、Liam Robbins、Amir Bar、Delong Chen、Michal Drozdzal、Maha Elbayad、Yushi Hu、Shang-Wen Li、Sreya Dutta Roy、Jakob Verbeek、XuDong Wang、Marjan Ghazvininejad、Luke Zettlemoyer、Emily Dinan

Video generation models are rapidly advancing, but can still struggle with complex video outputs that require significant semantic branching or repeated high-level reasoning about what should happen next. In this paper, we introduce a new class of omni video-text models that integrate ideas from recent LM reasoning advances to address this challenge. More specifically, we present TV2TV, a unified generative modeling framework which decomposes video generation into an interleaved text and video generation process. TV2TV jointly learns language modeling (next-token prediction) and video flow matching (next-frame prediction) using a Mixture-of-Transformers (MoT) architecture. At inference time, TV2TV decides when to alternate between generating text and video frames, allowing the model to “think in words” about subsequent content before ``acting in pixels’’ to produce frames. This design offloads much of the responsibility for deciding what should happen next to the language modeling tower, enabling improved visual quality and prompt alignment of generated videos. It also enables fine-grained controllability, allowing users to modify the video generation trajectory through text interventions at any point in the process. In controlled experiments on video game data, TV2TV demonstrates substantial improvements in both visual quality and controllability. TV2TV also scales to natural videos, as we show by augmenting sports videos with interleaved natural language action descriptions using vision-language models (VLMs). Training TV2TV on this corpus yields strong visual quality and prompt alignment, showcasing the model’s ability to reason about and generate complex real-world action sequences. Together, these results highlight TV2TV as a promising step toward video generation with open-ended textual reasoning and control. 视频生成模型正在迅速发展,但在需要大量语义分支或反复进行关于接下来应发生什么的高层次推理的复杂视频输出方面仍可能表现欠佳。在本文中,我们引入了一类新的全能视频-文本模型,结合了近期语言模型推理进展中的思想以应对这一挑战。更具体地,我们提出了 TV2TV,一种统一的生成建模框架,它将视频生成分解为交替进行的文本与视频生成过程。TV2TV 使用混合变换器(Mixture-of-Transformers,MoT)架构联合学习语言建模(下一个标记预测)和视频流匹配(下一帧预测)。在推理时,TV2TV 决定何时在生成文本和视频帧之间切换,使模型能够在“用文字思考”后再“用像素行动”以生成帧。该设计将大量关于接下来应发生什么的决策责任卸载给语言建模塔,从而提升生成视频的视觉质量和与提示的对齐度。 它还实现了细粒度可控性,允许用户在任何时间点通过文本干预来修改视频生成轨迹。在对视频游戏数据的受控实验中,TV2TV 在视觉质量和可控性方面均表现出显著提升。我们还通过使用视觉—语言模型(VLMs)在体育视频中插入自然语言动作描述,展示了 TV2TV 能够扩展到自然视频。将 TV2TV 在该语料上进行训练后,显示出强大的视觉质量和提示对齐能力,证明了模型推理并生成复杂真实世界动作序列的能力。 一起,这些结果凸显了 TV2TV 作为朝向具有开放式文本推理和控制的视频生成的有前景的进展。

Subjects: Machine Learning, Artificial Intelligence, Computer Vision and Pattern Recognition 学科:机器学习、人工智能、计算机视觉与模式识别

Publish: 2025-12-04 18:59:09 UTC 发布:2025-12-04 18:59:09 UTC

1.3.2 Artificial Intelligence

Fromhttps://papers.cool/arxiv/cs.AI

https://arxiv.org/list/cs.AI/recent

2025-12-15 | | Total: 90 2025-12-15 | | 共计:90

1 MedAI: Evaluating TxAgent's Therapeutic Agentic Reasoning in the NeurIPS CURE-Bench Competition 1 MedAI:在 NeurIPS CURE-Bench 竞赛中评估 TxAgent 的治疗性自主推理

Authors: [Tim Cofala](https://arxiv.org/search/?searchtype=author&query=Tim Cofala), [Christian Kalfar](https://arxiv.org/search/?searchtype=author&query=Christian Kalfar), [Jingge Xiao](https://arxiv.org/search/?searchtype=author&query=Jingge Xiao), [Johanna Schrader](https://arxiv.org/search/?searchtype=author&query=Johanna Schrader), [Michelle Tang](https://arxiv.org/search/?searchtype=author&query=Michelle Tang), [Wolfgang Nejdl](https://arxiv.org/search/?searchtype=author&query=Wolfgang Nejdl) 作者:Tim Cofala、Christian Kalfar、Jingge Xiao、Johanna Schrader、Michelle Tang、Wolfgang Nejdl

Therapeutic decision-making in clinical medicine constitutes a high-stakes domain in which AI guidance interacts with complex interactions among patient characteristics, disease processes, and pharmacological agents. Tasks such as drug recommendation, treatment planning, and adverse-effect prediction demand robust, multi-step reasoning grounded in reliable biomedical knowledge. Agentic AI methods, exemplified by TxAgent, address these challenges through iterative retrieval-augmented generation (RAG). TxAgent employs a fine-tuned Llama-3.1-8B model that dynamically generates and executes function calls to a unified biomedical tool suite (ToolUniverse), integrating FDA Drug API, OpenTargets, and Monarch resources to ensure access to current therapeutic information. In contrast to general-purpose RAG systems, medical applications impose stringent safety constraints, rendering the accuracy of both the reasoning trace and the sequence of tool invocations critical. These considerations motivate evaluation protocols treating token-level reasoning and tool-usage behaviors as explicit supervision signals. This work presents insights derived from our participation in the CURE-Bench NeurIPS 2025 Challenge, which benchmarks therapeutic-reasoning systems using metrics that assess correctness, tool utilization, and reasoning quality. We analyze how retrieval quality for function (tool) calls influences overall model performance and demonstrate performance gains achieved through improved tool-retrieval strategies. Our work was awarded the Excellence Award in Open Science. Complete information can be found at https://curebench.ai/. 临床医学中的治疗决策是一个高风险领域,人工智能的指导需与患者特征、疾病过程和药理制剂之间的复杂相互作用相结合。诸如药物推荐、治疗方案制定和不良反应预测等任务要求基于可靠生物医学知识进行稳健的、多步骤推理。以 TxAgent 为代表的主体化(agentic)人工智能方法通过迭代的检索增强生成(RAG)来应对这些挑战。TxAgent 使用经微调的 Llama-3.1-8B 模型,动态生成并执行对统一生物医学工具套件(ToolUniverse)的函数调用,整合了 FDA Drug API、OpenTargets 和 Monarch 资源,以确保获取最新的治疗信息。与通用 RAG 系统不同,医疗应用施加了严格的安全约束,使得推理过程的正确性和工具调用序列的准确性变得至关重要。这些考量促使评估协议将令牌级的推理和工具使用行为视为显式监督信号来处理。 本工作介绍了我们参与 CURE-Bench NeurIPS 2025 挑战赛所获得的见解,该挑战赛使用评估正确性、工具使用情况和推理质量的指标来基准化治疗推理系统。我们分析了用于函数(工具)调用的检索质量如何影响整体模型性能,并展示了通过改进工具检索策略获得的性能提升。我们的工作获得了开放科学卓越奖。完整信息请见 https://curebench.ai/

Subjects: Artificial Intelligence, Machine Learning 主题:人工智能、机器学习

Publish: 2025-12-12 16:01:48 UTC 发布时间:2025-12-12 16:01:48 UTC

2 Causal Inference in Energy Demand Prediction 2 能量需求预测中的因果推断

Authors: [Chutian Ma](https://arxiv.org/search/?searchtype=author&query=Chutian Ma), [Grigorii Pomazkin](https://arxiv.org/search/?searchtype=author&query=Grigorii Pomazkin), [Giacinto Paolo Saggese](https://arxiv.org/search/?searchtype=author&query=Giacinto Paolo Saggese), [Paul Smith](https://arxiv.org/search/?searchtype=author&query=Paul Smith) 作者:Chutian Ma、Grigorii Pomazkin、Giacinto Paolo Saggese、Paul Smith

Energy demand prediction is critical for grid operators, industrial energy consumers, and service providers. Energy demand is influenced by multiple factors, including weather conditions (e.g. temperature, humidity, wind speed, solar radiation), and calendar information (e.g. hour of day and month of year), which further affect daily work and life schedules. These factors are causally interdependent, making the problem more complex than simple correlation-based learning techniques satisfactorily allow for. We propose a structural causal model that explains the causal relationship between these variables. A full analysis is performed to validate our causal beliefs, also revealing important insights consistent with prior studies. For example, our causal model reveals that energy demand responds to temperature fluctuations with season-dependent sensitivity. Additionally, we find that energy demand exhibits lower variance in winter due to the decoupling effect between temperature changes and daily activity patterns. We then build a Bayesian model, which takes advantage of the causal insights we learned as prior knowledge. The model is trained and tested on unseen data and yields state-of-the-art performance in the form of a 3.84 percent MAPE on the test set. The model also demonstrates strong robustness, as the cross-validation across two years of data yields an average MAPE of 3.88 percent. 能源需求预测对电网运营商、工业能源消费者和服务提供商至关重要。能源需求受多种因素影响,包括天气状况(如温度、湿度、风速、太阳辐射)和日历信息(如一天中的小时和一年中的月份),这些因素进一步影响日常工作和生活安排。这些因素在因果上相互依赖,使得问题比基于简单相关性的学习技术更为复杂。我们提出了一个结构因果模型来解释这些变量之间的因果关系。我们进行了全面分析以验证我们的因果假设,同时也揭示了与先前研究一致的重要见解。例如,我们的因果模型显示能源需求对温度波动的响应具有季节依赖的敏感性。此外,我们发现由于温度变化与日常活动模式之间的解耦效应,冬季的能源需求方差较低。随后我们构建了一个贝叶斯模型,利用我们作为先验知识学到的因果见解。 该模型在未见过的数据上进行训练和测试,并在测试集上取得了业界领先的表现,表现为 3.84%的 MAPE。该模型还表现出很强的鲁棒性,跨两年数据的交叉验证平均 MAPE 为 3。88%。

Subject: Artificial Intelligence 主题:人工智能

Publish: 2025-12-12 15:30:46 UTC 发布:2025-12-12 15:30:46 协调世界时

3 AI Benchmark Democratization and Carpentry 3 AI 基准民主化与木工活(Carpentry)

Authors: [Gregor von Laszewski](https://arxiv.org/search/?searchtype=author&query=Gregor von Laszewski), [Wesley Brewer](https://arxiv.org/search/?searchtype=author&query=Wesley Brewer), [Jeyan Thiyagalingam](https://arxiv.org/search/?searchtype=author&query=Jeyan Thiyagalingam), [Juri Papay](https://arxiv.org/search/?searchtype=author&query=Juri Papay), [Armstrong Foundjem](https://arxiv.org/search/?searchtype=author&query=Armstrong Foundjem), [Piotr Luszczek](https://arxiv.org/search/?searchtype=author&query=Piotr Luszczek), [Murali Emani](https://arxiv.org/search/?searchtype=author&query=Murali Emani), [Shirley V. Moore](https://arxiv.org/search/?searchtype=author&query=Shirley V. Moore), [Vijay Janapa Reddi](https://arxiv.org/search/?searchtype=author&query=Vijay Janapa Reddi), [Matthew D. Sinclair](https://arxiv.org/search/?searchtype=author&query=Matthew D. Sinclair), [Sebastian Lobentanzer](https://arxiv.org/search/?searchtype=author&query=Sebastian Lobentanzer), [Sujata Goswami](https://arxiv.org/search/?searchtype=author&query=Sujata Goswami), [Benjamin Hawks](https://arxiv.org/search/?searchtype=author&query=Benjamin Hawks), [Marco Colombo](https://arxiv.org/search/?searchtype=author&query=Marco Colombo), [Nhan Tran](https://arxiv.org/search/?searchtype=author&query=Nhan Tran), [Christine R. Kirkpatrick](https://arxiv.org/search/?searchtype=author&query=Christine R. Kirkpatrick), [Abdulkareem Alsudais](https://arxiv.org/search/?searchtype=author&query=Abdulkareem Alsudais), [Gregg Barrett](https://arxiv.org/search/?searchtype=author&query=Gregg Barrett), [Tianhao Li](https://arxiv.org/search/?searchtype=author&query=Tianhao Li), [Kirsten Morehouse](https://arxiv.org/search/?searchtype=author&query=Kirsten Morehouse), [Shivaram Venkataraman](https://arxiv.org/search/?searchtype=author&query=Shivaram Venkataraman), [Rutwik Jain](https://arxiv.org/search/?searchtype=author&query=Rutwik Jain), [Kartik Mathur](https://arxiv.org/search/?searchtype=author&query=Kartik Mathur), [Victor Lu](https://arxiv.org/search/?searchtype=author&query=Victor Lu), [Tejinder Singh](https://arxiv.org/search/?searchtype=author&query=Tejinder Singh), [Khojasteh Z. Mirza](https://arxiv.org/search/?searchtype=author&query=Khojasteh Z. Mirza), [Kongtao Chen](https://arxiv.org/search/?searchtype=author&query=Kongtao Chen), [Sasidhar Kunapuli](https://arxiv.org/search/?searchtype=author&query=Sasidhar Kunapuli), [Gavin Farrell](https://arxiv.org/search/?searchtype=author&query=Gavin Farrell), [Renato Umeton](https://arxiv.org/search/?searchtype=author&query=Renato Umeton), [Geoffrey C. Fox](https://arxiv.org/search/?searchtype=author&query=Geoffrey C. Fox) 作者:Gregor von Laszewski、Wesley Brewer、Jeyan Thiyagalingam、Juri Papay、Armstrong Foundjem、Piotr Luszczek、Murali Emani、Shirley V. Moore、Vijay Janapa Reddi、Matthew D. Sinclair、Sebastian Lobentanzer、Sujata Goswami、Benjamin Hawks、Marco Colombo、Nhan Tran、Christine R. Kirkpatrick、Abdulkareem Alsudais、Gregg Barrett、Tianhao Li、Kirsten Morehouse、Shivaram Venkataraman、Rutwik Jain、Kartik Mathur、Victor Lu、Tejinder Singh、Khojasteh Z. Mirza、Kongtao Chen、Sasidhar Kunapuli、Gavin Farrell、Renato Umeton、Geoffrey C. Fox

Benchmarks are a cornerstone of modern machine learning, enabling reproducibility, comparison, and scientific progress. However, AI benchmarks are increasingly complex, requiring dynamic, AI-focused workflows. Rapid evolution in model architectures, scale, datasets, and deployment contexts makes evaluation a moving target. Large language models often memorize static benchmarks, causing a gap between benchmark results and real-world performance. Beyond traditional static benchmarks, continuous adaptive benchmarking frameworks are needed to align scientific assessment with deployment risks. This calls for skills and education in AI Benchmark Carpentry. From our experience with MLCommons, educational initiatives, and programs like the DOE’s Trillion Parameter Consortium, key barriers include high resource demands, limited access to specialized hardware, lack of benchmark design expertise, and uncertainty in relating results to application domains. Current benchmarks often emphasize peak performance on top-tier hardware, offering limited guidance for diverse, real-world scenarios. Benchmarking must become dynamic, incorporating evolving models, updated data, and heterogeneous platforms while maintaining transparency, reproducibility, and interpretability. Democratization requires both technical innovation and systematic education across levels, building sustained expertise in benchmark design and use. Benchmarks should support application-relevant comparisons, enabling informed, context-sensitive decisions. Dynamic, inclusive benchmarking will ensure evaluation keeps pace with AI evolution and supports responsible, reproducible, and accessible AI deployment. Community efforts can provide a foundation for AI Benchmark Carpentry. 基准测试是现代机器学习的基石,使可重复性、比较和科学进步成为可能。然而,人工智能基准测试正变得愈发复杂,要求动态、以 AI 为中心的工作流程。模型架构、规模、数据集和部署环境的快速演变使得评估成为一个不断变化的目标。大型语言模型常常记忆静态基准,导致基准结果与现实世界性能之间出现差距。除了传统的静态基准之外,还需要连续自适应的基准框架,以使科学评估与部署风险保持一致。这要求在 AI 基准制作方面具备技能和教育培训。根据我们在 MLCommons、教育项目以及 DOE 的“万亿参数联盟”等计划中的经验,主要障碍包括资源需求高、对专用硬件的访问有限、缺乏基准设计专业知识,以及难以将结果与应用领域关联的不确定性。目前的基准往往强调在顶级硬件上的峰值性能,对多样化的现实场景提供的指导有限。 基准测试必须变得动态化,纳入不断发展的模型、更新的数据和异构平台,同时保持透明性、可重复性和可解释性。普及化既需要技术创新,也需要各层面的系统化教育,在基准设计与使用方面建立持续的专业知识。基准应支持与应用相关的比较,促进有信息的、具上下文敏感性的决策。动态且包容的基准测试将确保评估跟上人工智能的发展步伐,并支持负责任、可重复且可获取的人工智能部署。社区努力可以为人工智能基准工艺奠定基础。

Subject: Artificial Intelligence 主题:人工智能

Publish: 2025-12-12 14:20:05 UTC 发表:2025-12-12 14:20:05 协调世界时

4 AI-MASLD Metabolic Dysfunction and Information Steatosis of Large Language Models in Unstructured Clinical Narratives 4 AI-MASLD 代谢功能障碍与大型语言模型在非结构化临床叙事中的信息脂肪变性

Authors: [Yuan Shen](https://arxiv.org/search/?searchtype=author&query=Yuan Shen), [Xiaojun Wu](https://arxiv.org/search/?searchtype=author&query=Xiaojun Wu), [Linghua Yu](https://arxiv.org/search/?searchtype=author&query=Linghua Yu) 作者:沈元,吴晓军,于灵华

This study aims to simulate real-world clinical scenarios to systematically evaluate the ability of Large Language Models (LLMs) to extract core medical information from patient chief complaints laden with noise and redundancy, and to verify whether they exhibit a functional decline analogous to Metabolic Dysfunction-Associated Steatotic Liver Disease (MASLD). We employed a cross-sectional analysis design based on standardized medical probes, selecting four mainstream LLMs as research subjects: GPT-4o, Gemini 2.5, DeepSeek 3.1, and Qwen3-Max. An evaluation system comprising twenty medical probes across five core dimensions was used to simulate a genuine clinical communication environment. All probes had gold-standard answers defined by clinical experts and were assessed via a double-blind, inverse rating scale by two independent clinicians. The results show that all tested models exhibited functional defects to varying degrees, with Qwen3-Max demonstrating the best overall performance and Gemini 2.5 the worst. Under conditions of extreme noise, most models experienced a functional collapse. Notably, GPT-4o made a severe misjudgment in the risk assessment for pulmonary embolism (PE) secondary to deep vein thrombosis (DVT). This research is the first to empirically confirm that LLMs exhibit features resembling metabolic dysfunction when processing clinical information, proposing the innovative concept of “AI-Metabolic Dysfunction-Associated Steatotic Liver Disease (AI-MASLD)”. These findings offer a crucial safety warning for the application of Artificial Intelligence (AI) in healthcare, emphasizing that current LLMs must be used as auxiliary tools under human expert supervision, as there remains a significant gap between their theoretical knowledge and practical clinical application. 本研究旨在模拟真实临床情境,系统评估 LLMs 从充斥噪音和冗余的患者主诉中提取核心医疗信息的能力,并验证其是否表现出类似代谢功能障碍相关脂肪肝病(MASLD)那样的功能性下降。我们采用基于标准化医疗探查的横断面分析设计,选取四款主流 LLMs 作为研究对象:GPT-4o、Gemini 2.5、DeepSeek 3.1 和 Qwen3-Max。通过由五个核心维度构成、共二十项医疗探查的问题评估体系来模拟真实的临床交流环境。所有探查均由临床专家制定金标准答案,并由两位独立临床医生以双盲、逆向评分量表进行评估。结果显示,所有被测模型均在不同程度上表现出功能缺陷,其中 Qwen3-Max 总体表现最佳,Gemini 2.5 最差;在极端噪音条件下,大多数模型出现了功能性崩溃。 值得注意的是,GPT-4o 在评估继发于深静脉血栓(DVT)的肺栓塞(PE)风险时做出了严重错误判断。这项研究首次通过实证确认 LLMs 在处理临床信息时表现出类似代谢功能障碍的特征,并提出了创新概念“AI-代谢功能障碍相关脂肪肝病(AI-MASLD)”。 这些发现为人工智能(AI)在医疗领域的应用提供了重要的安全警示,强调目前的 LLMs 必须作为辅助工具在人工专家监督下使用,因为它们的理论知识与实际临床应用之间仍存在显著差距。

Subject: Artificial Intelligence 主题:人工智能

Publish: 2025-12-12 13:25:19 UTC 发布:2025-12-12 13:25:19 UTC

5 EmeraldMind: A Knowledge Graph-Augmented Framework for Greenwashing Detection 5 EmeraldMind:一种以知识图谱增强的绿色洗牌检测框架

Authors: [Georgios Kaoukis](https://arxiv.org/search/?searchtype=author&query=Georgios Kaoukis), [Ioannis Aris Koufopoulos](https://arxiv.org/search/?searchtype=author&query=Ioannis Aris Koufopoulos), [Psaroudaki Eleni](https://arxiv.org/search/?searchtype=author&query=Psaroudaki Eleni), [Danae Pla Karidi](https://arxiv.org/search/?searchtype=author&query=Danae Pla Karidi), [Evaggelia Pitoura](https://arxiv.org/search/?searchtype=author&query=Evaggelia Pitoura), [George Papastefanatos](https://arxiv.org/search/?searchtype=author&query=George Papastefanatos), [Panayiotis Tsaparas](https://arxiv.org/search/?searchtype=author&query=Panayiotis Tsaparas) 作者:Georgios Kaoukis、Ioannis Aris Koufopoulos、Psaroudaki Eleni、Danae Pla Karidi、Evaggelia Pitoura、George Papastefanatos、Panayiotis Tsaparas

As AI and web agents become pervasive in decision-making, it is critical to design intelligent systems that not only support sustainability efforts but also guard against misinformation. Greenwashing, i.e., misleading corporate sustainability claims, poses a major challenge to environmental progress. To address this challenge, we introduce EmeraldMind, a fact-centric framework integrating a domain-specific knowledge graph with retrieval-augmented generation to automate greenwashing detection. EmeraldMind builds the EmeraldGraph from diverse corporate ESG (environmental, social, and governance) reports, surfacing verifiable evidence, often missing in generic knowledge bases, and supporting large language models in claim assessment. The framework delivers justification-centric classifications, presenting transparent, evidence-backed verdicts and abstaining responsibly when claims cannot be verified. Experiments on a new greenwashing claims dataset demonstrate that EmeraldMind achieves competitive accuracy, greater coverage, and superior explanation quality compared to generic LLMs, without the need for fine-tuning or retraining. 随着人工智能和网络代理在决策中变得普遍,设计不仅支持可持续发展努力而且能防止错误信息的智能系统变得至关重要。漂绿(即具有误导性的企业可持续性声明)对环境进展构成了重大挑战。为应对这一挑战,我们引入了 EmeraldMind,一个以事实为中心的框架,将特定领域知识图与检索增强生成相结合,以自动检测漂绿行为。EmeraldMind 从多样的企业 ESG(环境、社会与治理)报告中构建 EmeraldGraph,揭示了常见通用知识库中往往缺失的可验证证据,并支持大型语言模型在评估声明时使用这些证据。该框架提供以论证为中心的分类,呈现透明、有证据支持的判定,并在无法验证声明时负责地选择回避。对一个新的漂绿声明数据集的实验表明,EmeraldMind 在无需微调或重新训练的情况下,与通用 LLMs 相比实现了具有竞争力的准确性、更广的覆盖率和更优的解释质量。

Subject: Artificial Intelligence 主题:人工智能

Publish: 2025-12-12 12:06:36 UTC 发布:2025-12-12 12:06:36 UTC

6 BAID: A Benchmark for Bias Assessment of AI Detectors 6 BAID:用于评估 AI 检测器偏见的基准

Authors: [Priyam Basu](https://arxiv.org/search/?searchtype=author&query=Priyam Basu), [Yunfeng Zhang](https://arxiv.org/search/?searchtype=author&query=Yunfeng Zhang), [Vipul Raheja](https://arxiv.org/search/?searchtype=author&query=Vipul Raheja) 作者:Priyam Basu、Yunfeng Zhang、Vipul Raheja

AI-generated text detectors have recently gained adoption in educational and professional contexts. Prior research has uncovered isolated cases of bias, particularly against English Language Learners (ELLs) however, there is a lack of systematic evaluation of such systems across broader sociolinguistic factors. In this work, we propose BAID, a comprehensive evaluation framework for AI detectors across various types of biases. As a part of the framework, we introduce over 200k samples spanning 7 major categories: demographics, age, educational grade level, dialect, formality, political leaning, and topic. We also generated synthetic versions of each sample with carefully crafted prompts to preserve the original content while reflecting subgroup-specific writing styles. Using this, we evaluate four open-source state-of-the-art AI text detectors and find consistent disparities in detection performance, particularly low recall rates for texts from underrepresented groups. Our contributions provide a scalable, transparent approach for auditing AI detectors and emphasize the need for bias-aware evaluation before these tools are deployed for public use. 人工智能生成文本检测器最近在教育和职业场景中开始被采用。先前的研究发现了孤立的偏见案例,尤其是针对英语学习者(ELLs)的偏见,然而,针对更广泛社会语言学因素的系统性评估仍然缺乏。在这项工作中,我们提出了 BAID,一种针对各种偏见类型的 AI 检测器的综合评估框架。作为该框架的一部分,我们引入了涵盖七大类的超过 20 万个样本:人口统计、年龄、教育年级水平、方言、正式程度、政治倾向和话题。我们还使用精心设计的提示为每个样本生成了合成版本,以在保留原始内容的同时反映子群体特有的写作风格。基于此,我们评估了四种开源的最先进 AI 文本检测器,发现检测性能存在持续差异,特别是来自代表性不足群体的文本召回率偏低。我们的贡献提供了一种可扩展、透明的 AI 检测器审计方法,并强调在这些工具公开部署前进行有偏见意识评估的必要性。

Subjects: Artificial Intelligence, Machine Learning 主题:人工智能、机器学习

Publish: 2025-12-12 12:01:42 UTC 发布:2025-12-12 12:01:42 UTC

7 General-purpose AI models can generate actionable knowledge on agroecological crop protection 7 通用型人工智能模型可以生成关于农业生态作物保护的可付诸行动的知识

Author: [Kris A. G. Wyckhuys](https://arxiv.org/search/?searchtype=author&query=Kris A. G. Wyckhuys) 作者:Kris A. G. Wyckhuys

Generative artificial intelligence (AI) offers potential for democratizing scientific knowledge and converting this to clear, actionable information, yet its application in agri-food science remains unexplored. Here, we verify the scientific knowledge on agroecological crop protection that is generated by either web-grounded or non-grounded large language models (LLMs), i.e., DeepSeek versus the free-tier version of ChatGPT. For nine globally limiting pests, weeds, and plant diseases, we assessed the factual accuracy, data consistency, and breadth of knowledge or data completeness of each LLM. Overall, DeepSeek consistently screened a 4.8-49.7-fold larger literature corpus and reported 1.6-2.4-fold more biological control agents or management solutions than ChatGPT. As a result, DeepSeek reported 21.6% higher efficacy estimates, exhibited greater laboratory-to-field data consistency, and showed more realistic effects of pest identity and management tactics. However, both models hallucinated, i.e., fabricated fictitious agents or references, reported on implausible ecological interactions or outcomes, confused old and new scientific nomenclatures, and omitted data on key agents or solutions. Despite these shortcomings, both LLMs correctly reported low-resolution efficacy trends. Overall, when paired with rigorous human oversight, LLMs may pose a powerful tool to support farm-level decision-making and unleash scientific creativity. 生成式人工智能(AI)有可能使科学知识大众化并将其转化为清晰、可行的信息,但其在农业食品科学中的应用尚未被探索。在此,我们验证了由基于网络检索或非检索的大型语言模型(LLMs)——即 DeepSeek 与 ChatGPT 免费版——生成的关于农业生态作物保护的科学知识。针对九种在全球范围内受限制的害虫、杂草和植物病害,我们评估了每个 LLM 的事实准确性、数据一致性以及知识广度或数据完整性。总体而言,DeepSeek 持续筛选了比 ChatGPT 大 4.8–49.7 倍的文献语料,并报告了 1.6–2.4 倍更多的生物防治因子或管理方案。因此,DeepSeek 报告的疗效估计高出 21.6%,表现出更高的实验室到田间数据一致性,并显示出害虫种类和管理策略更为现实的影响。 然而,这两种模型都出现了幻觉,即捏造虚构的代理或参考资料,报告不可信的生态相互作用或结果,混淆旧的与新的科学命名法,并遗漏关于关键因子或解决方案的数据。尽管存在这些缺点,两个 LLMs 都正确报告了低分辨率的效能趋势。总体而言,在严格的人类监督下,LLMs 可能成为支持农场层面决策并释放科学创造力的强大工具。

Subjects: Artificial Intelligence, Computers and Society, Information Retrieval 主题:人工智能,计算机与社会,信息检索

Publish: 2025-12-12 11:17:13 UTC 发表:2025-12-12 11:17:13 UTC

8 Three methods, one problem: Classical and AI approaches to no-three-in-line 8 三种方法,一个问题:经典与人工智能方法解决“不三点共线”问题

Authors: [Pranav Ramanathan](https://arxiv.org/search/?searchtype=author&query=Pranav Ramanathan), [Thomas Prellberg](https://arxiv.org/search/?searchtype=author&query=Thomas Prellberg), [Matthew Lewis](https://arxiv.org/search/?searchtype=author&query=Matthew Lewis), [Prathamesh Dinesh Joshi](https://arxiv.org/search/?searchtype=author&query=Prathamesh Dinesh Joshi), [Raj Abhijit Dandekar](https://arxiv.org/search/?searchtype=author&query=Raj Abhijit Dandekar), [Rajat Dandekar](https://arxiv.org/search/?searchtype=author&query=Rajat Dandekar), [Sreedath Panat](https://arxiv.org/search/?searchtype=author&query=Sreedath Panat) 作者:Pranav Ramanathan,Thomas Prellberg,Matthew Lewis,Prathamesh Dinesh Joshi,Raj Abhijit Dandekar,Rajat Dandekar,Sreedath Panat

The No-Three-In-Line problem asks for the maximum number of points that can be placed on an n by n grid with no three collinear, representing a famous problem in combinatorial geometry. While classical methods like Integer Linear Programming (ILP) guarantee optimal solutions, they face exponential scaling with grid size, and recent advances in machine learning offer promising alternatives for pattern-based approximation. This paper presents the first systematic comparison of classical optimization and AI approaches to this problem, evaluating their performance against traditional algorithms. We apply PatternBoost transformer learning and reinforcement learning (PPO) to this problem for the first time, comparing them against ILP. ILP achieves provably optimal solutions up to 19 by 19 grids, while PatternBoost matches optimal performance up to 14 by 14 grids with 96% test loss reduction. PPO achieves perfect solutions on 10 by 10 grids but fails at 11 by 11 grids, where constraint violations prevent valid configurations. These results demonstrate that classical optimization remains essential for exact solutions while AI methods offer competitive performance on smaller instances, with hybrid approaches presenting the most promising direction for scaling to larger problem sizes. 无三点共线问题询问在 n×n 网格上可以放置的最多点数,且不允许有三点共线,这代表组合几何学中的一个著名问题。虽然整数线性规划(ILP)等经典方法能保证最优解,但它们在网格规模增大时面临指数级扩展,而机器学习的最新进展为基于模式的近似提供了有前景的替代方案。本文首次对经典优化方法和人工智能方法对该问题的系统性比较,评估它们相对于传统算法的性能。我们首次将 PatternBoost 变换器学习和强化学习(PPO)应用于该问题,并将它们与 ILP 进行比较。ILP 在最多 19×19 网格上可获得可证明的最优解,而 PatternBoost 在最多 14×14 网格上匹配了最优性能,测试损失降低了 96%。PPO 在 10×10 网格上达到了完美解,但在 11×11 网格上失败,约束违规导致无法得到有效配置。 这些结果表明,经典优化方法在求解精确解方面仍然不可或缺,而人工智能方法在较小规模实例上表现具有竞争力,混合方法则为扩展到更大规模问题提供了最有希望的方向。

Subject: Artificial Intelligence 主题:人工智能

Publish: 2025-12-12 11:12:42 UTC 发布:2025-12-12 11:12:42 协调世界时 (UTC)

9 Motif-2-12.7B-Reasoning: A Practitioner's Guide to RL Training Recipes 9 Motif-2-12.7B-Reasoning:实践者的强化学习训练配方指南

Authors: [Junghwan Lim](https://arxiv.org/search/?searchtype=author&query=Junghwan Lim), [Sungmin Lee](https://arxiv.org/search/?searchtype=author&query=Sungmin Lee), [Dongseok Kim](https://arxiv.org/search/?searchtype=author&query=Dongseok Kim), [Taehyun Kim](https://arxiv.org/search/?searchtype=author&query=Taehyun Kim), [Eunhwan Park](https://arxiv.org/search/?searchtype=author&query=Eunhwan Park), [Jeesoo Lee](https://arxiv.org/search/?searchtype=author&query=Jeesoo Lee), [Jeongdoo Lee](https://arxiv.org/search/?searchtype=author&query=Jeongdoo Lee), [Junhyeok Lee](https://arxiv.org/search/?searchtype=author&query=Junhyeok Lee), [Wai Ting Cheung](https://arxiv.org/search/?searchtype=author&query=Wai Ting Cheung), [Dahye Choi](https://arxiv.org/search/?searchtype=author&query=Dahye Choi), [Minsu Ha](https://arxiv.org/search/?searchtype=author&query=Minsu Ha), [Jaeheui Her](https://arxiv.org/search/?searchtype=author&query=Jaeheui Her), [Jaeyeon Huh](https://arxiv.org/search/?searchtype=author&query=Jaeyeon Huh), [Hanbin Jung](https://arxiv.org/search/?searchtype=author&query=Hanbin Jung), [Changjin Kang](https://arxiv.org/search/?searchtype=author&query=Changjin Kang), [Beomgyu Kim](https://arxiv.org/search/?searchtype=author&query=Beomgyu Kim), [Minjae Kim](https://arxiv.org/search/?searchtype=author&query=Minjae Kim), [Taewhan Kim](https://arxiv.org/search/?searchtype=author&query=Taewhan Kim), [Youngrok Kim](https://arxiv.org/search/?searchtype=author&query=Youngrok Kim), [Hyukjin Kweon](https://arxiv.org/search/?searchtype=author&query=Hyukjin Kweon), [Haesol Lee](https://arxiv.org/search/?searchtype=author&query=Haesol Lee), [Kungyu Lee](https://arxiv.org/search/?searchtype=author&query=Kungyu Lee), [Dongpin Oh](https://arxiv.org/search/?searchtype=author&query=Dongpin Oh), [Yeongjae Park](https://arxiv.org/search/?searchtype=author&query=Yeongjae Park), [Bokki Ryu](https://arxiv.org/search/?searchtype=author&query=Bokki Ryu), [Dongjoo Weon](https://arxiv.org/search/?searchtype=author&query=Dongjoo Weon) 作者:Junghwan Lim、Sungmin Lee、Dongseok Kim、Taehyun Kim、Eunhwan Park、Jeesoo Lee、Jeongdoo Lee、Junhyeok Lee、Wai Ting Cheung、Dahye Choi、Minsu Ha、Jaeheui Her、Jaeyeon Huh、Hanbin Jung、Changjin Kang、Beomgyu Kim、Minjae Kim、Taewhan Kim、Youngrok Kim、Hyukjin Kweon、Haesol Lee、Kungyu Lee、Dongpin Oh、Yeongjae Park、Bokki Ryu、Dongjoo Weon

We introduce Motif-2-12.7B-Reasoning, a 12.7B parameter language model designed to bridge the gap between open-weight systems and proprietary frontier models in complex reasoning and long-context understanding. Addressing the common challenges of model collapse and training instability in reasoning adaptation, we propose a comprehensive, reproducible training recipe spanning system, data, and algorithmic optimizations. Our approach combines memory-efficient infrastructure for 64K-token contexts using hybrid parallelism and kernel-level optimizations with a two-stage Supervised Fine-Tuning (SFT) curriculum that mitigates distribution mismatch through verified, aligned synthetic data. Furthermore, we detail a robust Reinforcement Learning Fine-Tuning (RLFT) pipeline that stabilizes training via difficulty-aware data filtering and mixed-policy trajectory reuse. Empirical results demonstrate that Motif-2-12.7B-Reasoning achieves performance comparable to models with significantly larger parameter counts across mathematics, coding, and agentic benchmarks, offering the community a competitive open model and a practical blueprint for scaling reasoning capabilities under realistic compute constraints. 我们介绍了 Motif-2-12.7B-Reasoning,一个拥有 127 亿参数的语言模型,旨在缩小开放权重系统与专有前沿模型在复杂推理和长上下文理解方面的差距。为了解决推理适配中常见的模型崩溃和训练不稳定问题,我们提出了一套全面且可复现的训练配方,涵盖系统、数据和算法层面的优化。我们的方法将针对 64K 标记上下文的高效内存基础设施(使用混合并行和内核级优化)与两阶段监督微调(SFT)课程相结合,通过验证过的、对齐的合成数据来缓解分布不匹配。此外,我们详细说明了一个稳健的强化学习微调(RLFT)流水线,该流水线通过基于难度的数据筛选和混合策略轨迹重用来稳定训练。实证结果表明,Motif-2-12.7B-Reasoning 在数学、编程和代理基准测试中表现可与参数量显著更大的模型相媲美,为社区提供了一个具有竞争力的开源模型和在真实计算限制下扩展推理能力的实用蓝图。

Subject: Artificial Intelligence 主题:人工智能

Publish: 2025-12-11 00:51:18 UTC 发布:2025-12-11 00:51:18 UTC

10 Back to the Baseline: Examining Baseline Effects on Explainability Metrics 10 回到基线:检视基线对可解释性指标的影响

Authors: [Agustin Martin Picard](https://arxiv.org/search/?searchtype=author&query=Agustin Martin Picard), [Thibaut Boissin](https://arxiv.org/search/?searchtype=author&query=Thibaut Boissin), [Varshini Subhash](https://arxiv.org/search/?searchtype=author&query=Varshini Subhash), [Rémi Cadène](https://arxiv.org/search/?searchtype=author&query=Rémi Cadène), [Thomas Fel](https://arxiv.org/search/?searchtype=author&query=Thomas Fel) 作者:Agustin Martin Picard、Thibaut Boissin、Varshini Subhash、Rémi Cadène、Thomas Fel

Attribution methods are among the most prevalent techniques in Explainable Artificial Intelligence (XAI) and are usually evaluated and compared using Fidelity metrics, with Insertion and Deletion being the most popular. These metrics rely on a baseline function to alter the pixels of the input image that the attribution map deems most important. In this work, we highlight a critical problem with these metrics: the choice of a given baseline will inevitably favour certain attribution methods over others. More concerningly, even a simple linear model with commonly used baselines contradicts itself by designating different optimal methods. A question then arises: which baseline should we use? We propose to study this problem through two desirable properties of a baseline: (i) that it removes information and (ii) that it does not produce overly out-of-distribution (OOD) images. We first show that none of the tested baselines satisfy both criteria, and there appears to be a trade-off among current baselines: either they remove information or they produce a sequence of OOD images. Finally, we introduce a novel baseline by leveraging recent work in feature visualisation to artificially produce a model-dependent baseline that removes information without being overly OOD, thus improving on the trade-off when compared to other existing baselines. Our code is available at https://github.com/deel-ai-papers/Back-to-the-Baseline 归因方法是可解释人工智能(XAI)中最常见的技术之一,通常使用保真度(Fidelity)度量来评估和比较,其中插入(Insertion)和删除(Deletion)是最流行的。这些度量依赖于一个基线函数,用以改变归因图认为最重要的输入图像像素。在本工作中,我们强调了这些度量的一个关键问题:特定基线的选择不可避免地会偏向某些归因方法而非其他方法。更令人担忧的是,即便是使用常见基线的简单线性模型,也会自相矛盾地指定不同的最佳方法。于是一个问题出现了:我们应当使用哪种基线?我们提出通过基线的两个理想属性来研究此问题:(i)它应当去除信息;(ii)它不应产生过度的非分布(OOD)图像。我们首先展示了,所测试的基线中没有一种同时满足这两个标准,并且当前基线似乎存在权衡:它们要么能去除信息,要么会生成一系列 OOD 图像。 最后,我们引入了一个新基线,利用最近在特征可视化方面的工作,人工生成一个依赖于模型的基线,该基线在去除信息的同时不过度成为分布外(OOD),从而在与其他现有基线相比时改进了这一权衡。我们的代码可在 https://github.com/deel-ai-papers/Back-to-the-Baseline 获取

Subjects: Artificial Intelligence, Computer Vision and Pattern Recognition 主题:人工智能,计算机视觉与模式识别

Publish: 2025-12-12 10:13:44 UTC 发布:2025-12-12 10:13:44 协调世界时

11 AgentBalance: Backbone-then-Topology Design for Cost-Effective Multi-Agent Systems under Budget Constraints 11 AgentBalance:在预算约束下面向成本效益的多代理系统的“先骨干后拓扑”设计

Authors: [Shuowei Cai](https://arxiv.org/search/?searchtype=author&query=Shuowei Cai), [Yansong Ning](https://arxiv.org/search/?searchtype=author&query=Yansong Ning), [Hao Liu](https://arxiv.org/search/?searchtype=author&query=Hao Liu) 作者:蔡硕伟,宁延松,刘浩

Large Language Model (LLM)-based multi-agent systems (MAS) are becoming indispensable building blocks for web-scale applications such as web search, social network analytics, and online customer support, where cost-effectiveness is increasingly the primary constraint for large-scale deployment. While recent work improves MAS cost-effectiveness by shaping inter-agent communication topologies and selecting agent backbones, it rarely models and optimizes under explicit token-cost and latency budgets that reflect deployment constraints. This often leads to topology-first designs and suboptimal cost-effectiveness when budgets are binding. We present AgentBalance, a framework for constructing cost-effective MAS under explicit token-cost and latency budgets via a backbone-then-topology design. AgentBalance first performs backbone-oriented agent generation, constructing agents with heterogeneous backbones through LLM pool construction, pool selection, and role-backbone matching. It then performs adaptive MAS topology generation, guiding inter-agent communication via agent representation learning, gating, and latency-aware topology synthesis. Experiments on benchmarks with 14 candidate LLM backbones show that AgentBalance achieves up to 10% and 22% performance gains under matched token-cost and latency budgets, respectively, and yields strong AUC on performance-versus-budget curves across benchmarks. AgentBalance also functions as a plug-in for existing MAS, improving performance under the same token-cost and latency constraints, and it generalizes well to unseen LLMs for practical, budget-aware deployment. Code: https://github.com/usail-hkust/AgentBalance 基于大语言模型(LLM)的多智能体系统(MAS)正逐渐成为网络级应用(如网络搜索、社交网络分析和在线客户支持)的不可或缺的构件,而在大规模部署中,成本效益正日益成为首要约束。尽管近期工作通过设计智能体间通信拓扑和选择智能体主干模型来提升 MAS 的成本效益,但很少在反映部署约束的明确令牌成本和延迟预算下进行建模和优化。这常导致以拓扑优先的设计,并在预算受限时产生次优的成本效益。我们提出了 AgentBalance,一个通过“先主干后拓扑”的设计,在明确的令牌成本和延迟预算下构建具成本效益 MAS 的框架。AgentBalance 首先进行面向主干的智能体生成,通过构建 LLM 池、池选择和角色—主干匹配来构造具有异质主干的智能体;随后进行自适应 MAS 拓扑生成,通过智能体表示学习、门控和考虑延迟的拓扑合成来引导智能体间的通信。 在使用 14 个候选 LLM 骨干模型的基准测试实验中,AgentBalance 在匹配的令牌成本和延迟预算下分别实现了最多 10% 和 22% 的性能提升,并且在各基准上的性能-预算曲线中呈现出强劲的 AUC。AgentBalance 还可作为现有多代理系统(MAS)的插件,在相同的令牌成本和延迟约束下提升性能,并且对未见过的 LLM 具有良好的泛化能力,适用于实际的预算感知部署。代码: https://github.com/usail-hkust/AgentBalance

Subject: Artificial Intelligence 主题:人工智能

Publish: 2025-12-12 10:08:03 UTC 发布:2025-12-12 10:08:03 UTC

12 Towards Trustworthy Multi-Turn LLM Agents via Behavioral Guidance 12 通过行为引导走向可信的多轮 LLM 代理

Author: [Gonca Gürsun](https://arxiv.org/search/?searchtype=author&query=Gonca Gürsun) 作者:Gonca Gürsun

Large Language Models demonstrate strong reasoning and generation abilities, yet their behavior in multi-turn tasks often lacks reliability and verifiability. We present a task completion framework that enables LLM-based agents to act under explicit behavioral guidance in environments described by reinforcement learning formalisms with defined observation, action, and reward signals. The framework integrates three components: a lightweight task profiler that selects reasoning and generation strategies, a reasoning module that learns verifiable observation - action mappings, and a generation module that enforces constraint-compliant outputs through validation or deterministic synthesis. We show that as the agent interacts with the environment, these components co-evolve, yielding trustworthy behavior. 大型语言模型展示了强大的推理和生成能力,然而在多回合任务中的行为往往缺乏可靠性和可验证性。我们提出了一个任务完成框架,使基于 LLM 的代理能够在以强化学习形式描述、具有明确定义的观测、动作和奖励信号的环境中,在明确的行为引导下行动。该框架整合了三部分:一个轻量级任务分析器,用于选择推理和生成策略;一个推理模块,用于学习可验证的观测—动作映射;以及一个生成模块,通过验证或确定性合成来强制输出符合约束。我们展示了随着代理与环境的交互,这些组件共同进化,从而产生可信的行为。

Subject: Artificial Intelligence 主题:人工智能

Publish: 2025-12-12 10:03:24 UTC 发布:2025-12-12 10:03:24 UTC

13 CAPTURE: A Benchmark and Evaluation for LVLMs in CAPTCHA Resolving 13 CAPTURE:用于 LVLM 在 CAPTCHA 解析上的基准和评估

Authors: [Jianyi Zhang](https://arxiv.org/search/?searchtype=author&query=Jianyi Zhang), [Ziyin Zhou](https://arxiv.org/search/?searchtype=author&query=Ziyin Zhou), [Xu Ji](https://arxiv.org/search/?searchtype=author&query=Xu Ji), [Shizhao Liu](https://arxiv.org/search/?searchtype=author&query=Shizhao Liu), [Zhangchi Zhao](https://arxiv.org/search/?searchtype=author&query=Zhangchi Zhao) 作者:张建义、周子尹、季旭、刘世钊、赵章驰

Benefiting from strong and efficient multi-modal alignment strategies, Large Visual Language Models (LVLMs) are able to simulate human visual and reasoning capabilities, such as solving CAPTCHAs. However, existing benchmarks based on visual CAPTCHAs still face limitations. Previous studies, when designing benchmarks and datasets, customized them according to their research objectives. Consequently, these benchmarks cannot comprehensively cover all CAPTCHA types. Notably, there is a dearth of dedicated benchmarks for LVLMs. To address this problem, we introduce a novel CAPTCHA benchmark for the first time, named CAPTURE CAPTCHA for Testing Under Real-world Experiments, specifically for LVLMs. Our benchmark encompasses 4 main CAPTCHA types and 25 sub-types from 31 vendors. The diversity enables a multi-dimensional and thorough evaluation of LVLM performance. CAPTURE features extensive class variety, large-scale data, and unique LVLM-tailored labels, filling the gaps in previous research in terms of data comprehensiveness and labeling pertinence. When evaluated by this benchmark, current LVLMs demonstrate poor performance in solving CAPTCHAs. 得益于强大且高效的多模态对齐策略,大型视觉语言模型(LVLMs)能够模拟人类的视觉与推理能力,例如解答验证码。然而,现有基于视觉验证码的基准测试仍存在局限性。以往研究在设计基准和数据集时,会根据其研究目标进行定制。因此,这些基准无法全面覆盖所有验证码类型。值得注意的是,专门针对 LVLM 的基准几乎不存在。为解决这一问题,我们首次提出了一个专为 LVLM 设计的新型验证码基准,名为 CAPTURE(CAPTCHA for Testing Under Real-world Experiments)。我们的基准涵盖来自 31 家厂商的 4 类主要验证码和 25 个子类。这种多样性使得对 LVLM 性能的评估可以做到多维且全面。CAPTURE 具备广泛的类别多样性、大规模数据以及为 LVLM 量身定制的标签,填补了以往研究在数据全面性和标签相关性方面的空白。通过该基准评估时,当前 LVLM 在解答验证码方面表现不佳。

Subject: Artificial Intelligence 主题:人工智能

Publish: 2025-12-12 06:50:27 UTC 发布时间:2025-12-12 06:50:27 协调世界时 (UTC)

14 TriFlow: A Progressive Multi-Agent Framework for Intelligent Trip Planning 14 TriFlow:一种用于智能行程规划的渐进式多代理框架

Authors: [Yuxing Chen](https://arxiv.org/search/?searchtype=author&query=Yuxing Chen), [Basem Suleiman](https://arxiv.org/search/?searchtype=author&query=Basem Suleiman), [Qifan Chen](https://arxiv.org/search/?searchtype=author&query=Qifan Chen) 作者:陈宇星、Basem Suleiman、陈齐凡

Real-world trip planning requires transforming open-ended user requests into executable itineraries under strict spatial, temporal, and budgetary constraints while aligning with user preferences. Existing LLM-based agents struggle with constraint satisfaction, tool coordination, and efficiency, often producing infeasible or costly plans. To address these limitations, we present TriFlow, a progressive multi-agent framework that unifies structured reasoning and language-based flexibility through a three-stage pipeline of retrieval, planning, and governance. By this design, TriFlow progressively narrows the search space, assembles constraint-consistent itineraries via rule-LLM collaboration, and performs bounded iterative refinement to ensure global feasibility and personalisation. Evaluations on TravelPlanner and TripTailor benchmarks demonstrated state-of-the-art results, achieving 91.1% and 97% final pass rates, respectively, with over 10x runtime efficiency improvement compared to current SOTA. 现实世界的行程规划需要将开放式的用户请求转化为在严格的空间、时间和预算约束下可执行的行程,同时符合用户偏好。现有基于 LLM 的代理在满足约束、工具协调和效率方面存在困难,常常产生不可行或成本过高的计划。为了解决这些局限性,我们提出了 TriFlow,一种渐进式多代理框架,通过检索、规划和治理三个阶段的流水线将结构化推理与基于语言的灵活性统一起来。通过这种设计,TriFlow 逐步缩小搜索空间,通过规则-LLM 协作组装与约束一致的行程,并执行有界的迭代完善以保证全局可行性和个性化。在 TravelPlanner 和 TripTailor 基准上的评估显示出最先进的结果,最终通过率分别达到 91.1% 和 97%,与当前最先进方法相比运行时效率提高超过 10 倍。

Subject: Artificial Intelligence 主题:人工智能

Publish: 2025-12-12 04:27:22 UTC 发布:2025-12-12 04:27:22 UTC

15 A-LAMP: Agentic LLM-Based Framework for Automated MDP Modeling and Policy Generation 15 A-LAMP:基于代理的 LLM 框架用于自动化 MDP 建模和策略生成

Authors: [Hong Je-Gal](https://arxiv.org/search/?searchtype=author&query=Hong Je-Gal), [Chan-Bin Yi](https://arxiv.org/search/?searchtype=author&query=Chan-Bin Yi), [Hyun-Suk Lee](https://arxiv.org/search/?searchtype=author&query=Hyun-Suk Lee) 作者:Hong Je-Gal、Chan-Bin Yi、Hyun-Suk Lee

Applying reinforcement learning (RL) to real-world tasks requires converting informal descriptions into a formal Markov decision process (MDP), implementing an executable environment, and training a policy agent. Automating this process is challenging due to modeling errors, fragile code, and misaligned objectives, which often impede policy training. We introduce an agentic large language model (LLM)-based framework for automated MDP modeling and policy generation (A-LAMP), that automatically translates free-form natural language task descriptions into an MDP formulation and trained policy. The framework decomposes modeling, coding, and training into verifiable stages, ensuring semantic alignment throughout the pipeline. Across both classic control and custom RL domains, A-LAMP consistently achieves higher policy generation capability than a single state-of-the-art LLM model. Notably, even its lightweight variant, which is built on smaller language models, approaches the performance of much larger models. Failure analysis reveals why these improvements occur. In addition, a case study also demonstrates that A-LAMP generates environments and policies that preserve the task’s optimality, confirming its correctness and reliability. 将强化学习(RL)应用于现实世界任务需要把非正式的描述转换为形式化的马尔可夫决策过程(MDP)、实现可执行的环境并训练策略智能体。由于建模错误、脆弱的代码和目标不一致等问题,这一过程自动化面临挑战,常常阻碍策略训练。我们提出了一个基于智能体化大型语言模型(LLM)的自动化 MDP 建模与策略生成框架(A-LAMP),可将自由形式的自然语言任务描述自动翻译为 MDP 表述并生成训练好的策略。该框架将建模、编码和训练分解为可验证的阶段,确保整个流程中的语义对齐。在经典控制和自定义强化学习领域中,A-LAMP 在策略生成能力上始终优于单一的最先进 LLM 模型。值得注意的是,即使是基于较小语言模型构建的轻量级变体,其性能也接近更大模型的表现。故障分析揭示了这些改进产生的原因。 此外,一项案例研究还表明,A-LAMP 生成的环境和策略能够保持任务的最优性,证实了其正确性和可靠性。

Subject: Artificial Intelligence 主题:人工智能

Publish: 2025-12-12 04:21:17 UTC 发布:2025-12-12 04:21:17 协调世界时 (UTC)

16 FutureWeaver: Planning Test-Time Compute for Multi-Agent Systems with Modularized Collaboration 16 FutureWeaver:为多智能体系统在测试时规划计算资源的模块化协作

Authors: [Dongwon Jung](https://arxiv.org/search/?searchtype=author&query=Dongwon Jung), [Peng Shi](https://arxiv.org/search/?searchtype=author&query=Peng Shi), [Yi Zhang](https://arxiv.org/search/?searchtype=author&query=Yi Zhang) 作者:Dongwon Jung、Peng Shi、Yi Zhang

Scaling test-time computation improves large language model performance without additional training. Recent work demonstrates that techniques such as repeated sampling, self-verification, and self-reflection can significantly enhance task success by allocating more inference-time compute. However, applying these techniques across multiple agents in a multi-agent system is difficult: there does not exist principled mechanisms to allocate compute to foster collaboration among agents, to extend test-time scaling to collaborative interactions, or to distribute compute across agents under explicit budget constraints. To address this gap, we propose FutureWeaver, a framework for planning and optimizing test-time compute allocation in multi-agent systems under fixed budgets. FutureWeaver introduces modularized collaboration, formalized as callable functions that encapsulate reusable multi-agent workflows. These modules are automatically derived through self-play reflection by abstracting recurring interaction patterns from past trajectories. Building on these modules, FutureWeaver employs a dual-level planning architecture that optimizes compute allocation by reasoning over the current task state while also speculating on future steps. Experiments on complex agent benchmarks demonstrate that FutureWeaver consistently outperforms baselines across diverse budget settings, validating its effectiveness for multi-agent collaboration in inference-time optimization. 在测试时增加计算量可以在不额外训练的情况下提升大型语言模型的性能。近期研究表明,通过在推理阶段分配更多计算资源,诸如重复采样、自我验证和自我反思等技术能够显著提高任务成功率。然而,将这些技术应用于多智能体系统中的多个主体却很困难:尚无可行的机制来分配计算以促进智能体间的协作、将测试时扩展应用到协作性交互中,或在明确的预算限制下在智能体间分配计算资源。为了解决这一空白,我们提出了 FutureWeaver,这是一个在固定预算下为多智能体系统规划和优化测试时计算分配的框架。FutureWeaver 引入了模块化协作,以可调用函数的形式形式化,封装了可重用的多智能体工作流。这些模块通过自我对弈反思自动生成,方法是从过去的轨迹中抽象出重复出现的交互模式。 在这些模块的基础上,FutureWeaver 采用了双层规划架构,通过对当前任务状态进行推理并对未来步骤进行预测,从而优化计算资源的分配。 在复杂智能体基准测试上的实验表明,FutureWeaver 在各种预算设置下始终优于基线方法,验证了其在推理时优化多智能体协作方面的有效性。

Subjects: Artificial Intelligence, Computation and Language 主题:人工智能,计算与语言

Publish: 2025-12-12 01:43:48 UTC 发布时间:2025-12-12 01:43:48 世界协调时

17 Deep Learning–Accelerated Multi-Start Large Neighborhood Search for Real-time Freight Bundling 17 深度学习——用于实时运货捆绑的加速多起点大邻域搜索

Authors: [Haohui Zhang](https://arxiv.org/search/?searchtype=author&query=Haohui Zhang), [Wouter van Heeswijk](https://arxiv.org/search/?searchtype=author&query=Wouter van Heeswijk), [Xinyu Hu](https://arxiv.org/search/?searchtype=author&query=Xinyu Hu), [Neil Yorke-Smith](https://arxiv.org/search/?searchtype=author&query=Neil Yorke-Smith), [Martijn Mes](https://arxiv.org/search/?searchtype=author&query=Martijn Mes) 作者:张昊辉、Wouter van Heeswijk、胡欣宇、Neil Yorke-Smith、Martijn Mes

Online Freight Exchange Systems (OFEX) play a crucial role in modern freight logistics by facilitating real-time matching between shippers and carrier. However, efficient combinatorial bundling of transporation jobs remains a bottleneck. We model the OFEX combinatorial bundling problem as a multi-commodity one-to-one pickup-and-delivery selective traveling salesperson problem (m1-PDSTSP), which optimizes revenue-driven freight bundling under capacity, precedence, and route-length constraints. The key challenge is to couple combinatorial bundle selection with pickup-and-delivery routing under sub-second latency. We propose a learning–accelerated hybrid search pipeline that pairs a Transformer Neural Network-based constructive policy with an innovative Multi-Start Large Neighborhood Search (MSLNS) metaheuristic within a rolling-horizon scheme in which the platform repeatedly freezes the current marketplace into a static snapshot and solves it under a short time budget. This pairing leverages the low-latency, high-quality inference of the learning-based constructor alongside the robustness of improvement search; the multi-start design and plausible seeds help LNS to explore the solution space more efficiently. Across benchmarks, our method outperforms state-of-the-art neural combinatorial optimization and metaheuristic baselines in solution quality with comparable time, achieving an optimality gap of less than 2% in total revenue relative to the best available exact baseline method. To our knowledge, this is the first work to establish that a Deep Neural Network-based constructor can reliably provide high-quality seeds for (multi-start) improvement heuristics, with applicability beyond the \textit{m1-PDSTSP} to a broad class of selective traveling salesperson problems and pickup and delivery problems. 在线货运交易系统(OFEX)在现代货运物流中起着关键作用,通过促成托运方与承运方之间的实时匹配。然而,高效的运输作业组合打包仍然是一个瓶颈。我们将 OFEX 组合打包问题建模为一种多商品一对一揽收与交付选择性旅行推销员问题(m1-PDSTSP),在容量、先后顺序和路线长度约束下,优化以收益为驱动的货运打包。关键挑战在于在亚秒级延迟下将组合包裹选择与揽收与交付路径规划耦合。我们提出了一种学习加速的混合搜索流程,该流程将基于 Transformer 神经网络的构造性策略与一种创新的多起点大领域搜索(MSLNS)元启发式相结合,运行于一个滚动时域方案中,在该方案中平台反复将当前市场冻结为静态快照并在短时间预算内求解。 这种组合利用了基于学习的构造器在低延迟、高质量推理方面的优势,同时结合了改进搜索的鲁棒性;多起点设计和合理的种子帮助 LNS 更高效地探索解空间。在各类基准测试中,我们的方法以相当的时间成本在解质量上优于最先进的神经组合优化和元启发式基线方法,相对于最好的可用精确基线方法,在总收益上的最优性差距小于 2%。 据我们所知,这是首个建立基于深度神经网络的构造器能够可靠地为(多起点)改进启发式方法提供高质量种子的工作,其适用性超出了 m1-PDSTSP,扩展到一类广泛的选择性旅行商问题和接送问题。

Subject: Artificial Intelligence 主题:人工智能

Publish: 2025-12-12 00:29:37 UTC 发布:2025-12-12 00:29:37 UTC

18 CORL: Reinforcement Learning of MILP Policies Solved via Branch and Bound 18 CORL:通过分支定界法求解的混合整数线性规划策略的强化学习

Authors: [Akhil S Anand](https://arxiv.org/search/?searchtype=author&query=Akhil S Anand), [Elias Aarekol](https://arxiv.org/search/?searchtype=author&query=Elias Aarekol), [Martin Mziray Dalseg](https://arxiv.org/search/?searchtype=author&query=Martin Mziray Dalseg), [Magnus Stalhane](https://arxiv.org/search/?searchtype=author&query=Magnus Stalhane), [Sebastien Gros](https://arxiv.org/search/?searchtype=author&query=Sebastien Gros) 作者:Akhil S Anand、Elias Aarekol、Martin Mziray Dalseg、Magnus Stalhane、Sebastien Gros

Combinatorial sequential decision making problems are typically modeled as mixed integer linear programs (MILPs) and solved via branch and bound (B&B) algorithms. The inherent difficulty of modeling MILPs that accurately represent stochastic real world problems leads to suboptimal performance in the real world. Recently, machine learning methods have been applied to build MILP models for decision quality rather than how accurately they model the real world problem. However, these approaches typically rely on supervised learning, assume access to true optimal decisions, and use surrogates for the MILP gradients. In this work, we introduce a proof of concept CORL framework that end to end fine tunes an MILP scheme using reinforcement learning (RL) on real world data to maximize its operational performance. We enable this by casting an MILP solved by B&B as a differentiable stochastic policy compatible with RL. We validate the CORL method in a simple illustrative combinatorial sequential decision making example. 组合式序贯决策问题通常被建模为混合整数线性规划(MILP),并通过分支定界(B&B)算法求解。准确表示随机现实世界问题的 MILP 建模固有困难导致现实世界中的次优性能。最近,机器学习方法被用于构建以决策质量为目标而非对现实世界问题建模精度的 MILP。然而,这些方法通常依赖监督学习、假设可以获得真实最优决策,并且使用 MILP 梯度的替代品。在本工作中,我们提出了一个概念验证的 CORL 框架,该框架通过在现实世界数据上使用强化学习(RL)端到端微调 MILP 方案,以最大化其运营性能。我们通过将由 B&B 求解的 MILP 表述为与 RL 兼容的可微随机策略来实现这一点。我们在一个简单的说明性组合序贯决策示例中验证了 CORL 方法。

Subjects: Artificial Intelligence, Machine Learning, Systems and Control, Optimization and Control 主题:人工智能、机器学习、系统与控制、优化与控制

Publish: 2025-12-11 23:20:13 UTC 发布:2025-12-11 23:20:13 UTC

19 Unified Smart Factory Model: A model-based Approach for Integrating Industry 4.0 and Sustainability for Manufacturing Systems 19 统一智能工厂模型:一种基于模型的方法,用于将工业 4.0 与制造系统的可持续性集成

Authors: [Ishaan Kaushal](https://arxiv.org/search/?searchtype=author&query=Ishaan Kaushal), [Amaresh Chakrabarti](https://arxiv.org/search/?searchtype=author&query=Amaresh Chakrabarti) 作者:Ishaan Kaushal、Amaresh Chakrabarti

This paper presents the Unified Smart Factory Model (USFM), a comprehensive framework designed to translate high-level sustainability goals into measurable factory-level indicators with a systematic information map of manufacturing activities. The manufacturing activities were modelled as set of manufacturing, assembly and auxiliary processes using Object Process Methodology, a Model Based Systems Engineering (MBSE) language. USFM integrates Manufacturing Process and System, Data Process, and Key Performance Indicator (KPI) Selection and Assessment in a single framework. Through a detailed case study of Printed Circuit Board (PCB) assembly factory, the paper demonstrates how environmental sustainability KPIs can be selected, modelled, and mapped to the necessary data, highlighting energy consumption and environmental impact metrics. The model’s systematic approach can reduce redundancy, minimize the risk of missing critical information, and enhance data collection. The paper concluded that the USFM bridges the gap between sustainability goals and practical implementation, providing significant benefits for industries specifically SMEs aiming to achieve sustainability targets. 本文提出了统一智能工厂模型(USFM),一个全面的框架,旨在将高层次的可持续发展目标转化为可衡量的工厂级指标,并提供制造活动的系统信息映射。制造活动使用基于模型的系统工程(MBSE)语言——对象过程方法(Object Process Methodology)建模为一系列制造、装配和辅助过程。USFM 在单一框架中整合了制造过程与系统、数据流程以及关键绩效指标(KPI)的选择与评估。通过对印制电路板(PCB)装配工厂的详细案例研究,本文展示了如何选择、建模并将环境可持续性 KPI 映射到所需数据,重点突出能耗和环境影响指标。该模型的系统性方法可以减少冗余、降低遗漏关键信息的风险,并增强数据采集。论文得出结论,USFM 弥合了可持续发展目标与实际落实之间的鸿沟,为尤其是寻求实现可持续目标的中小企业等行业提供了显著益处。

Subject: Emerging Technologies 主题:新兴技术

Publish: 2025-12-11 13:30:38 UTC 发布日期:2025-12-11 13:30:38 UTC

20 Particulate: Feed-Forward 3D Object Articulation 20 Particulate:前馈式 3D 物体关节化

Authors: [Ruining Li](https://arxiv.org/search/?searchtype=author&query=Ruining Li), [Yuxin Yao](https://arxiv.org/search/?searchtype=author&query=Yuxin Yao), [Chuanxia Zheng](https://arxiv.org/search/?searchtype=author&query=Chuanxia Zheng), [Christian Rupprecht](https://arxiv.org/search/?searchtype=author&query=Christian Rupprecht), [Joan Lasenby](https://arxiv.org/search/?searchtype=author&query=Joan Lasenby), [Shangzhe Wu](https://arxiv.org/search/?searchtype=author&query=Shangzhe Wu), [Andrea Vedaldi](https://arxiv.org/search/?searchtype=author&query=Andrea Vedaldi) 作者:Ruining Li, Yuxin Yao, Chuanxia Zheng, Christian Rupprecht, Joan Lasenby, Shangzhe Wu, Andrea Vedaldi

We present Particulate, a feed-forward approach that, given a single static 3D mesh of an everyday object, directly infers all attributes of the underlying articulated structure, including its 3D parts, kinematic structure, and motion constraints. At its core is a transformer network, Part Articulation Transformer, which processes a point cloud of the input mesh using a flexible and scalable architecture to predict all the aforementioned attributes with native multi-joint support. We train the network end-to-end on a diverse collection of articulated 3D assets from public datasets. During inference, Particulate lifts the network’s feed-forward prediction to the input mesh, yielding a fully articulated 3D model in seconds, much faster than prior approaches that require per-object optimization. Particulate can also accurately infer the articulated structure of AI-generated 3D assets, enabling full-fledged extraction of articulated 3D objects from a single (real or synthetic) image when combined with an off-the-shelf image-to-3D generator. We further introduce a new challenging benchmark for 3D articulation estimation curated from high-quality public 3D assets, and redesign the evaluation protocol to be more consistent with human preferences. Quantitative and qualitative results show that Particulate significantly outperforms state-of-the-art approaches. 我们提出了 Particulate,一种前馈方法:给定一个日常物体的单个静态三维网格,直接推断出底层关节结构的所有属性,包括其三维部件、运动学结构和运动约束。其核心是一个变换器网络,称为 Part Articulation Transformer,该网络使用灵活且可扩展的架构处理输入网格的点云,以原生多关节支持预测上述所有属性。我们在来自公开数据集的多样化关节化三维资产集合上端到端训练该网络。在推理阶段,Particulate 将网络的前馈预测提升回输入网格,几秒钟内生成一个完全关节化的三维模型,远快于那些需要对每个物体进行优化的先前方法。Particulate 还可以准确推断 AI 生成三维资产的关节结构,当与现成的图像到三维生成器结合时,可实现从单张(真实或合成)图像完整提取关节化三维物体。 我们进一步引入了一个新的挑战性基准,用于从高质量的公共 3D 资源中策划的 3D 关节估计,并重新设计了评估协议,使其更符合人类偏好。定量和定性结果表明,Particulate 明显优于最先进的方法。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Graphics 主题:计算机视觉与模式识别,人工智能,图形学

Publish: 2025-12-12 18:59:51 UTC 发布时间:2025-12-12 18:59:51 协调世界时 (UTC)

21 Super Suffixes: Bypassing Text Generation Alignment and Guard Models Simultaneously 21 Super Suffixes: 同时绕过文本生成对齐和防护模型

Authors: [Andrew Adiletta](https://arxiv.org/search/?searchtype=author&query=Andrew Adiletta), [Kathryn Adiletta](https://arxiv.org/search/?searchtype=author&query=Kathryn Adiletta), [Kemal Derya](https://arxiv.org/search/?searchtype=author&query=Kemal Derya), [Berk Sunar](https://arxiv.org/search/?searchtype=author&query=Berk Sunar) 作者:Andrew Adiletta、Kathryn Adiletta、Kemal Derya、Berk Sunar

The rapid deployment of Large Language Models (LLMs) has created an urgent need for enhanced security and privacy measures in Machine Learning (ML). LLMs are increasingly being used to process untrusted text inputs and even generate executable code, often while having access to sensitive system controls. To address these security concerns, several companies have introduced guard models, which are smaller, specialized models designed to protect text generation models from adversarial or malicious inputs. In this work, we advance the study of adversarial inputs by introducing Super Suffixes, suffixes capable of overriding multiple alignment objectives across various models with different tokenization schemes. We demonstrate their effectiveness, along with our joint optimization technique, by successfully bypassing the protection mechanisms of Llama Prompt Guard 2 on five different text generation models for malicious text and code generation. To the best of our knowledge, this is the first work to reveal that Llama Prompt Guard 2 can be compromised through joint optimization. Additionally, by analyzing the changing similarity of a model’s internal state to specific concept directions during token sequence processing, we propose an effective and lightweight method to detect Super Suffix attacks. We show that the cosine similarity between the residual stream and certain concept directions serves as a distinctive fingerprint of model intent. Our proposed countermeasure, DeltaGuard, significantly improves the detection of malicious prompts generated through Super Suffixes. It increases the non-benign classification rate to nearly 100%, making DeltaGuard a valuable addition to the guard model stack and enhancing robustness against adversarial prompt attacks. 大型语言模型 (LLMs) 的快速部署带来了对机器学习 (ML) 更高的安全性和隐私保护需求。LLMs 越来越多地被用于处理不受信任的文本输入,甚至生成可执行代码,且常常能够访问敏感的系统控制权限。为了解决这些安全问题,一些公司引入了守护模型,这些模型体积更小、专门用于保护文本生成模型免受对抗性或恶意输入的影响。在本研究中,我们通过提出超后缀(Super Suffixes)来推进对对抗性输入的研究,超后缀能够在具有不同分词方案的多种模型上覆盖多个对齐目标。我们通过联合优化技术展示了它们的有效性,成功绕过了 Llama Prompt Guard 2 在五种不同文本生成模型上的恶意文本和代码生成防护机制。据我们所知,这是首个披露可以通过联合优化破坏 Llama Prompt Guard 2 的工作。 此外,通过分析模型在处理序列标记时其内部状态与特定概念方向相似度的变化,我们提出了一种有效且轻量的方法来检测超级后缀(Super Suffix)攻击。我们证明了残差流与某些概念方向之间的余弦相似度可作为模型意图的鲜明指纹。我们提出的对策 DeltaGuard 显著提高了对通过超级后缀生成的恶意提示的检测能力。它将非良性分类率提高到接近 100%,使 DeltaGuard 成为防护模型堆栈中的有价值补充,并增强了对对抗性提示攻击的鲁棒性。

Subjects: Cryptography and Security, Artificial Intelligence 主题:密码学与安全性 , 人工智能

Publish: 2025-12-12 18:52:09 UTC 发布:2025-12-12 18:52:09 UTC

22 Agile Flight Emerges from Multi-Agent Competitive Racing 22 敏捷飞行从多智能体竞争性赛跑中涌现

Authors: [Vineet Pasumarti](https://arxiv.org/search/?searchtype=author&query=Vineet Pasumarti), [Lorenzo Bianchi](https://arxiv.org/search/?searchtype=author&query=Lorenzo Bianchi), [Antonio Loquercio](https://arxiv.org/search/?searchtype=author&query=Antonio Loquercio) 作者:Vineet Pasumarti、Lorenzo Bianchi、Antonio Loquercio

Through multi-agent competition and the sparse high-level objective of winning a race, we find that both agile flight (e.g., high-speed motion pushing the platform to its physical limits) and strategy (e.g., overtaking or blocking) emerge from agents trained with reinforcement learning. We provide evidence in both simulation and the real world that this approach outperforms the common paradigm of training agents in isolation with rewards that prescribe behavior, e.g., progress on the raceline, in particular when the complexity of the environment increases, e.g., in the presence of obstacles. Moreover, we find that multi-agent competition yields policies that transfer more reliably to the real world than policies trained with a single-agent progress-based reward, despite the two methods using the same simulation environment, randomization strategy, and hardware. In addition to improved sim-to-real transfer, the multi-agent policies also exhibit some degree of generalization to opponents unseen at training time. Overall, our work, following in the tradition of multi-agent competitive game-play in digital domains, shows that sparse task-level rewards are sufficient for training agents capable of advanced low-level control in the physical world. Code: https://github.com/Jirl-upenn/AgileFlight_MultiAgent 通过多智能体竞争和稀疏的高层目标——赢得比赛,我们发现,经由强化学习训练的智能体会自发出现灵活飞行(例如推动平台到其物理极限的高速运动)和策略性行为(例如超车或阻挡)。我们在仿真和现实世界中均提供了证据,表明这种方法优于常见范式:即让智能体在孤立情况下以规定行为的奖励(例如沿赛道进展)进行训练,尤其是在环境复杂度增加(例如存在障碍物)时更为明显。此外,我们发现,多智能体竞争产生的策略比单智能体以进展为基础的奖励训练的策略更可靠地迁移到现实世界,尽管两种方法使用相同的仿真环境、随机化策略和硬件。除改进的仿真到现实迁移外,多智能体策略在一定程度上也能对训练时未见过的对手表现出一定的泛化能力。 总体而言,我们的工作延续了数字领域多智能体竞争博弈的传统,表明稀疏的任务级奖励足以训练出能够在物理世界中实现高级低级控制的智能体。代码: https://github.com/Jirl-upenn/AgileFlight_MultiAgent

Subjects: Robotics, Artificial Intelligence, Multiagent Systems 主题:机器人学、人工智能、多智能体系统

Publish: 2025-12-12 18:48:50 UTC 发表:2025-12-12 18:48:50 UTC

23 Conditional Coverage Diagnostics for Conformal Prediction 23 条件覆盖度诊断用于保形预测

Authors: [Sacha Braun](https://arxiv.org/search/?searchtype=author&query=Sacha Braun), [David Holzmüller](https://arxiv.org/search/?searchtype=author&query=David Holzmüller), [Michael I. Jordan](https://arxiv.org/search/?searchtype=author&query=Michael I. Jordan), [Francis Bach](https://arxiv.org/search/?searchtype=author&query=Francis Bach) 作者:Sacha Braun、David Holzmüller、Michael I. Jordan、Francis Bach

Evaluating conditional coverage remains one of the most persistent challenges in assessing the reliability of predictive systems. Although conformal methods can give guarantees on marginal coverage, no method can guarantee to produce sets with correct conditional coverage, leaving practitioners without a clear way to interpret local deviations. To overcome sample-inefficiency and overfitting issues of existing metrics, we cast conditional coverage estimation as a classification problem. Conditional coverage is violated if and only if any classifier can achieve lower risk than the target coverage. Through the choice of a (proper) loss function, the resulting risk difference gives a conservative estimate of natural miscoverage measures such as L1 and L2 distance, and can even separate the effects of over- and under-coverage, and non-constant target coverages. We call the resulting family of metrics excess risk of the target coverage (ERT). We show experimentally that the use of modern classifiers provides much higher statistical power than simple classifiers underlying established metrics like CovGap. Additionally, we use our metric to benchmark different conformal prediction methods. Finally, we release an open-source package for ERT as well as previous conditional coverage metrics. Together, these contributions provide a new lens for understanding, diagnosing, and improving the conditional reliability of predictive systems. 评估条件覆盖率仍然是评估预测系统可靠性时最持久的挑战之一。尽管保形方法可以对边际覆盖率提供保证,但没有任何方法能保证生成具有正确条件覆盖率的集合,使得实践者无法清楚地解释局部偏差。为克服现有度量的样本低效和过拟合问题,我们将条件覆盖率估计视为一个分类问题。当且仅当存在任一分类器能够获得比目标覆盖率更低的风险时,条件覆盖率才被违反。通过选择(适当的)损失函数,得到的风险差为诸如 L1 和 L2 距离等自然失覆盖度量提供了保守估计,甚至可以区分过度覆盖与覆盖不足的影响,以及非恒定目标覆盖率的情况。我们将由此得到的一族度量称为目标覆盖率的超额风险(ERT)。我们通过实验表明,使用现代分类器相比于像 CovGap 这样的既有度量所用的简单分类器,具有更高的统计检验力。 此外,我们使用我们的度量来对不同的保形预测方法进行基准测试。最后,我们发布了一个用于 ERT 以及以往条件覆盖度量的开源软件包。 这些贡献共同为理解、诊断和改进预测系统的条件可靠性提供了新的视角。

Subjects: Machine Learning, Artificial Intelligence, Machine Learning 主题:机器学习,人工智能,机器学习

Publish: 2025-12-12 18:47:39 UTC 发布:2025-12-12 18:47:39 UTC

24 Smudged Fingerprints: A Systematic Evaluation of the Robustness of AI Image Fingerprints 24 模糊指纹:对 AI 图像指纹鲁棒性的系统性评估

Authors: [Kai Yao](https://arxiv.org/search/?searchtype=author&query=Kai Yao), [Marc Juarez](https://arxiv.org/search/?searchtype=author&query=Marc Juarez) 作者:Kai Yao、Marc Juarez

Model fingerprint detection techniques have emerged as a promising approach for attributing AI-generated images to their source models, but their robustness under adversarial conditions remains largely unexplored. We present the first systematic security evaluation of these techniques, formalizing threat models that encompass both white- and black-box access and two attack goals: fingerprint removal, which erases identifying traces to evade attribution, and fingerprint forgery, which seeks to cause misattribution to a target model. We implement five attack strategies and evaluate 14 representative fingerprinting methods across RGB, frequency, and learned-feature domains on 12 state-of-the-art image generators. Our experiments reveal a pronounced gap between clean and adversarial performance. Removal attacks are highly effective, often achieving success rates above 80% in white-box settings and over 50% under constrained black-box access. While forgery is more challenging than removal, its success significantly varies across targeted models. We also identify a utility-robustness trade-off: methods with the highest attribution accuracy are often vulnerable to attacks. Although some techniques exhibit robustness in specific settings, none achieves high robustness and accuracy across all evaluated threat models. These findings highlight the need for techniques balancing robustness and accuracy, and identify the most promising approaches for advancing this goal. 模型指纹检测技术已经成为将 AI 生成图像归因到其源模型的一种有前景的方法,但其在对抗性条件下的鲁棒性仍大多未被探索。我们提出了对这些技术的首个系统性安全评估,形式化了涵盖白盒和黑盒访问的威胁模型以及两个攻击目标:指纹移除(通过抹去识别痕迹以规避归因)和指纹伪造(试图使归因错误地指向目标模型)。我们实现了五种攻击策略,并在 RGB、频域和学习特征域对 12 个最先进的图像生成器评估了 14 种具有代表性的指纹方法。实验揭示了清洁条件与对抗条件性能之间明显差距。移除攻击非常有效,在白盒设置中成功率常常超过 80%,在受限黑盒访问下也超过 50%。尽管伪造比移除更具挑战性,但其成功率在不同目标模型间差异显著。我们还发现了效用与鲁棒性的权衡:归因准确率最高的方法往往容易受到攻击。 尽管某些技术在特定情境中表现出稳健性,但没有哪种技术在所有评估的威胁模型中同时实现高稳健性和高准确性。 这些发现凸显了需要在稳健性与准确性之间取得平衡的技术,并指出了最有前景的途径以推动这一目标的实现。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence

Publish: 2025-12-12 18:33:14 UTC 发布:2025-12-12 18:33:14 UTC

25 Generative Parametric Design (GPD): A framework for real-time geometry generation and on-the-fly multiparametric approximation 25 生成参数化设计(GPD):一个用于实时几何生成和实时多参数近似的框架

Authors: [Mohammed El Fallaki Idrissi](https://arxiv.org/search/?searchtype=author&query=Mohammed El Fallaki Idrissi), [Jad Mounayer](https://arxiv.org/search/?searchtype=author&query=Jad Mounayer), [Sebastian Rodriguez](https://arxiv.org/search/?searchtype=author&query=Sebastian Rodriguez), [Fodil Meraghni](https://arxiv.org/search/?searchtype=author&query=Fodil Meraghni), [Francisco Chinesta](https://arxiv.org/search/?searchtype=author&query=Francisco Chinesta) 作者:Mohammed El Fallaki Idrissi、Jad Mounayer、Sebastian Rodriguez、Fodil Meraghni、Francisco Chinesta

This paper presents a novel paradigm in simulation-based engineering sciences by introducing a new framework called Generative Parametric Design (GPD). The GPD framework enables the generation of new designs along with their corresponding parametric solutions given as a reduced basis. To achieve this, two Rank Reduction Autoencoders (RRAEs) are employed, one for encoding and generating the design or geometry, and the other for encoding the sparse Proper Generalized Decomposition (sPGD) mode solutions. These models are linked in the latent space using regression techniques, allowing efficient transitions between design and their associated sPGD modes. By empowering design exploration and optimization, this framework also advances digital and hybrid twin development, enhancing predictive modeling and real-time decision-making in engineering applications. The developed framework is demonstrated on two-phase microstructures, in which the multiparametric solutions account for variations in two key material parameters. 本文通过引入一种称为生成参数化设计(GPD)的新框架,在基于仿真的工程科学中提出了一种新范式。GPD 框架能够生成新的设计以及其对应的以约简基表示的参数化解。为此,采用了两个秩约简自编码器(RRAE),一个用于编码和生成设计或几何形状,另一个用于编码稀疏广义分解(sPGD)模态解。通过回归技术在潜在空间中将这些模型联系起来,使得在设计与其相关的 sPGD 模态之间能够高效转换。该框架在增强设计探索与优化的同时,也推动了数字孪生和混合孪生的发展,提升了工程应用中的预测建模和实时决策能力。所开发的框架在两相微观结构上进行了演示,其中多参数解考虑了两个关键材料参数的变化。

Subjects: Computational Engineering, Finance, and Science, Artificial Intelligence 学科:计算工程、金融与科学,人工智能

Publish: 2025-12-12 17:44:38 UTC 发布:2025-12-12 17:44:38 UTC

26 CogniSNN: Enabling Neuron-Expandability, Pathway-Reusability, and Dynamic-Configurability with Random Graph Architectures in Spiking Neural Networks 26 CogniSNN:在尖峰神经网络中通过随机图架构实现神经元可扩展性、通路可重用性和动态可配置性

Authors: [Yongsheng Huang](https://arxiv.org/search/?searchtype=author&query=Yongsheng Huang), [Peibo Duan](https://arxiv.org/search/?searchtype=author&query=Peibo Duan), [Yujie Wu](https://arxiv.org/search/?searchtype=author&query=Yujie Wu), [Kai Sun](https://arxiv.org/search/?searchtype=author&query=Kai Sun), [Zhipeng Liu](https://arxiv.org/search/?searchtype=author&query=Zhipeng Liu), [Changsheng Zhang](https://arxiv.org/search/?searchtype=author&query=Changsheng Zhang), [Bin Zhang](https://arxiv.org/search/?searchtype=author&query=Bin Zhang), [Mingkun Xu](https://arxiv.org/search/?searchtype=author&query=Mingkun Xu) 作者:黄永胜、段培博、吴宇杰、孙凯、刘志鹏、张昌盛、张斌、徐明坤

Spiking neural networks (SNNs), regarded as the third generation of artificial neural networks, are expected to bridge the gap between artificial intelligence and computational neuroscience. However, most mainstream SNN research directly adopts the rigid, chain-like hierarchical architecture of traditional artificial neural networks (ANNs), ignoring key structural characteristics of the brain. Biological neurons are stochastically interconnected, forming complex neural pathways that exhibit Neuron-Expandability, Pathway-Reusability, and Dynamic-Configurability. In this paper, we introduce a new SNN paradigm, named Cognition-aware SNN (CogniSNN), by incorporating Random Graph Architecture (RGA). Furthermore, we address the issues of network degradation and dimensional mismatch in deep pathways by introducing an improved pure spiking residual mechanism alongside an adaptive pooling strategy. Then, we design a Key Pathway-based Learning without Forgetting (KP-LwF) approach, which selectively reuses critical neural pathways while retaining historical knowledge, enabling efficient multi-task transfer. Finally, we propose a Dynamic Growth Learning (DGL) algorithm that allows neurons and synapses to grow dynamically along the internal temporal dimension. Extensive experiments demonstrate that CogniSNN achieves performance comparable to, or even surpassing, current state-of-the-art SNNs on neuromorphic datasets and Tiny-ImageNet. The Pathway-Reusability enhances the network’s continuous learning capability across different scenarios, while the dynamic growth algorithm improves robustness against interference and mitigates the fixed-timestep constraints during neuromorphic chip deployment. This work demonstrates the potential of SNNs with random graph structures in advancing brain-inspired intelligence and lays the foundation for their practical application on neuromorphic hardware. 脉冲神经网络(SNNs)被视为第三代人工神经网络,期望弥合人工智能与计算神经科学之间的鸿沟。然而,大多数主流的 SNN 研究直接采纳了传统人工神经网络(ANNs)那种僵化的链式层级结构,忽视了大脑的关键结构特征。生物神经元以随机方式互连,形成复杂的神经通路,表现出神经元可扩展性、通路可重用性以及Dynamic-Configurability.。在本文中,我们通过引入随机图架构(RGA)提出了一种新的 SNN 范式,称为认知感知 SNN(CogniSNN)。此外,我们通过引入改进的纯脉冲残差机制和自适应池化策略来解决深层通路中的网络退化和维度不匹配问题。随后,我们设计了一种基于关键通路的无遗忘学习(KP-LwF)方法,该方法有选择地重用关键神经通路,同时保留历史知识,从而实现高效的多任务迁移。 最后,我们提出了一种动态增长学习(DGL)算法,允许神经元和突触沿内部时间维度动态生长。大量实验表明,CogniSNN 在类神经形态数据集和 Tiny-ImageNet 上的性能可与当前最先进的 SNN 相媲美,甚至超越之。路径重用性提升了网络在不同场景下的连续学习能力,而动态增长算法则增强了对干扰的鲁棒性,并缓解了在类神经形态芯片部署时固定时间步长的限制。本工作展示了具有随机图结构的 SNN 在推动类脑智能方面的潜力,并为其在类神经形态硬件上的实际应用奠定了基础。

Subjects: Neural and Evolutionary Computing, Artificial Intelligence 主题:神经与进化计算,人工智能

Publish: 2025-12-12 17:36:31 UTC 发布:2025-12-12 17:36:31 UTC

27 From Signal to Turn: Interactional Friction in Modular Speech-to-Speech Pipelines 27 从信号到转向:模块化语音到语音流水线中的交互摩擦

Authors: [Titaya Mairittha](https://arxiv.org/search/?searchtype=author&query=Titaya Mairittha), [Tanakon Sawanglok](https://arxiv.org/search/?searchtype=author&query=Tanakon Sawanglok), [Panuwit Raden](https://arxiv.org/search/?searchtype=author&query=Panuwit Raden), [Jirapast Buntub](https://arxiv.org/search/?searchtype=author&query=Jirapast Buntub), [Thanapat Warunee](https://arxiv.org/search/?searchtype=author&query=Thanapat Warunee), [Napat Asawachaisuvikrom](https://arxiv.org/search/?searchtype=author&query=Napat Asawachaisuvikrom), [Thanaphum Saiwongin](https://arxiv.org/search/?searchtype=author&query=Thanaphum Saiwongin) 作者:Titaya Mairittha、Tanakon Sawanglok、Panuwit Raden、Jirapast Buntub、Thanapat Warunee、Napat Asawachaisuvikrom、Thanaphum Saiwongin

While voice-based AI systems have achieved remarkable generative capabilities, their interactions often feel conversationally broken. This paper examines the interactional friction that emerges in modular Speech-to-Speech Retrieval-Augmented Generation (S2S-RAG) pipelines. By analyzing a representative production system, we move beyond simple latency metrics to identify three recurring patterns of conversational breakdown: (1) Temporal Misalignment, where system delays violate user expectations of conversational rhythm; (2) Expressive Flattening, where the loss of paralinguistic cues leads to literal, inappropriate responses; and (3) Repair Rigidity, where architectural gating prevents users from correcting errors in real-time. Through system-level analysis, we demonstrate that these friction points should not be understood as defects or failures, but as structural consequences of a modular design that prioritizes control over fluidity. We conclude that building natural spoken AI is an infrastructure design challenge, requiring a shift from optimizing isolated components to carefully choreographing the seams between them. 尽管基于语音的人工智能系统在生成能力方面取得了显著进展,但其交互常常在会话上显得支离破碎。本文考察了模块化语音到语音检索增强生成(S2S-RAG)流水线中出现的交互摩擦。通过分析一个具有代表性的生产系统,我们超越了简单的延迟指标,识别出三种反复出现的会话崩溃模式:(1)时间错位:系统延迟违背了用户对会话节奏的期望;(2)表达扁平化:副语言线索的丧失导致字面化、不恰当的回应;以及(3)修正僵化:架构上的门控阻止用户实时纠正错误。通过系统级分析,我们证明这些摩擦点不应被理解为缺陷或失败,而是优先控制而非流畅性的模块化设计的结构性后果。我们得出结论:构建自然的语音人工智能是一个基础设施设计挑战,需要从优化孤立组件转向精心编排它们之间的接缝。

Subjects: Human-Computer Interaction, Artificial Intelligence, Computation and Language, Software Engineering 主题:人机交互、人工智能、计算与语言、软件工程

Publish: 2025-12-12 17:05:11 UTC 发布:2025-12-12 17:05:11 UTC

28 From Verification Burden to Trusted Collaboration: Design Goals for LLM-Assisted Literature Reviews 28 从验证负担到可信协作:LLM 辅助文献综述的设计目标

Authors: [Brenda Nogueira](https://arxiv.org/search/?searchtype=author&query=Brenda Nogueira), [Werner Geyer](https://arxiv.org/search/?searchtype=author&query=Werner Geyer), [Andrew Anderson](https://arxiv.org/search/?searchtype=author&query=Andrew Anderson), [Toby Jia-Jun Li](https://arxiv.org/search/?searchtype=author&query=Toby Jia-Jun Li), [Dongwhi Kim](https://arxiv.org/search/?searchtype=author&query=Dongwhi Kim), [Nuno Moniz](https://arxiv.org/search/?searchtype=author&query=Nuno Moniz), [Nitesh V. Chawla](https://arxiv.org/search/?searchtype=author&query=Nitesh V. Chawla) 作者:Brenda Nogueira、Werner Geyer、Andrew Anderson、Toby Jia-Jun Li、Dongwhi Kim、Nuno Moniz、Nitesh V. Chawla

Large Language Models (LLMs) are increasingly embedded in academic writing practices. Although numerous studies have explored how researchers employ these tools for scientific writing, their concrete implementation, limitations, and design challenges within the literature review process remain underexplored. In this paper, we report a user study with researchers across multiple disciplines to characterize current practices, benefits, and \textit{pain points} in using LLMs to investigate related work. We identified three recurring gaps: (i) lack of trust in outputs, (ii) persistent verification burden, and (iii) requiring multiple tools. This motivates our proposal of six design goals and a high-level framework that operationalizes them through improved related papers visualization, verification at every step, and human-feedback alignment with generation-guided explanations. Overall, by grounding our work in the practical, day-to-day needs of researchers, we designed a framework that addresses these limitations and models real-world LLM-assisted writing, advancing trust through verifiable actions and fostering practical collaboration between researchers and AI systems. 大型语言模型(LLMs)正日益被嵌入学术写作实践中。尽管大量研究探讨了研究者如何使用这些工具进行科学写作,但它们在文献综述过程中具体的实施方式、局限性和设计挑战仍未得到充分研究。在本文中,我们报告了一项面向多个学科研究者的用户研究,以描述在使用 LLMs 调查相关工作的过程中当前的实践、益处和痛点。我们识别出三个反复出现的差距:(i)对输出缺乏信任,(ii)持续的验证负担,以及(iii)需要使用多种工具。基于此,我们提出了六个设计目标和一个高层次框架,通过改进相关论文的可视化、在每一步进行验证,以及通过生成引导的解释实现人类反馈对齐来将这些目标付诸实践。总体而言,通过将我们的工作基于研究者的日常实际需求,我们设计了一个解决这些局限并模拟现实世界中 LLM 辅助写作的框架,通过可验证的操作提升信任,并促进研究者与 AI 系统之间的实用协作。

Subjects: Human-Computer Interaction, Artificial Intelligence 主题:人机交互,人工智能

Publish: 2025-12-12 15:38:34 UTC 发布:2025-12-12 15:38:34 UTC

29 Automating Historical Insight Extraction from Large-Scale Newspaper Archives via Neural Topic Modeling 29 通过神经主题建模从大规模报刊档案中自动提取历史洞见

Authors: [Keerthana Murugaraj](https://arxiv.org/search/?searchtype=author&query=Keerthana Murugaraj), [Salima Lamsiyah](https://arxiv.org/search/?searchtype=author&query=Salima Lamsiyah), [Marten During](https://arxiv.org/search/?searchtype=author&query=Marten During), [Martin Theobald](https://arxiv.org/search/?searchtype=author&query=Martin Theobald) 作者:Keerthana Murugaraj,Salima Lamsiyah,Marten During,Martin Theobald

Extracting coherent and human-understandable themes from large collections of unstructured historical newspaper archives presents significant challenges due to topic evolution, Optical Character Recognition (OCR) noise, and the sheer volume of text. Traditional topic-modeling methods, such as Latent Dirichlet Allocation (LDA), often fall short in capturing the complexity and dynamic nature of discourse in historical texts. To address these limitations, we employ BERTopic. This neural topic-modeling approach leverages transformerbased embeddings to extract and classify topics, which, despite its growing popularity, still remains underused in historical research. Our study focuses on articles published between 1955 and 2018, specifically examining discourse on nuclear power and nuclear safety. We analyze various topic distributions across the corpus and trace their temporal evolution to uncover long-term trends and shifts in public discourse. This enables us to more accurately explore patterns in public discourse, including the co-occurrence of themes related to nuclear power and nuclear weapons and their shifts in topic importance over time. Our study demonstrates the scalability and contextual sensitivity of BERTopic as an alternative to traditional approaches, offering richer insights into historical discourses extracted from newspaper archives. These findings contribute to historical, nuclear, and social-science research while reflecting on current limitations and proposing potential directions for future work. 从大量无结构的历史报纸档案中提取连贯且为人理解的主题具有显著挑战,原因包括主题演变、光学字符识别(OCR)噪声以及大量文本本身。传统的主题建模方法,例如潜在狄利克雷分配(LDA),常常无法捕捉历史文本中话语的复杂性和动态特性。为了解决这些局限性,我们采用了 BERTopic。这种神经主题建模方法利用基于 Transformer 的嵌入来提取和分类主题,尽管其越来越受欢迎,但在历史研究中仍然使用不足。我们的研究聚焦于 1955 年至 2018 年间发表的文章,具体审视关于核能和核安全的话语。我们分析了语料库中不同的主题分布并追踪其时间演变,以揭示长期趋势和公共话语的转变。这使我们能够更准确地探索公共话语中的模式,包括与核能和核武器相关主题的共现及其随时间在主题重要性上的变化。 我们的研究展示了 BERTopic 作为传统方法替代方案的可扩展性和上下文敏感性,能够为从报纸档案中提取的历史话语提供更丰富的洞见。 这些发现有助于历史学、核能研究、和社会科学研究,同时反思当前的局限并提出未来工作的潜在方向。

Subjects: Computation and Language, Artificial Intelligence, Information Retrieval 主题:计算与语言、人工智能、信息检索

Publish: 2025-12-12 15:15:02 UTC 发表:2025-12-12 15:15:02 UTC

30 Bounding Hallucinations: Information-Theoretic Guarantees for RAG Systems via Merlin-Arthur Protocols 30 限制幻觉:通过梅林-亚瑟协议为检索增强生成(RAG)系统提供信息论保证

Authors: [Björn Deiseroth](https://arxiv.org/search/?searchtype=author&query=Björn Deiseroth), [Max Henning Höth](https://arxiv.org/search/?searchtype=author&query=Max Henning Höth), [Kristian Kersting](https://arxiv.org/search/?searchtype=author&query=Kristian Kersting), [Letitia Parcalabescu](https://arxiv.org/search/?searchtype=author&query=Letitia Parcalabescu) 作者:Björn Deiseroth、Max Henning Höth、Kristian Kersting、Letitia Parcalabescu

Retrieval-augmented generation (RAG) models rely on retrieved evidence to guide large language model (LLM) generators, yet current systems treat retrieval as a weak heuristic rather than verifiable evidence. As a result, LLMs answer without support, hallucinate under incomplete or misleading context, and rely on spurious evidence. We introduce a training framework that treats the entire RAG pipeline – both the retriever and the generator – as an interactive proof system via an adaptation of the Merlin-Arthur (M/A) protocol. Arthur (the generator LLM) trains on questions of unkown provenance: Merlin provides helpful evidence, while Morgana injects adversarial, misleading context. Both use a linear-time XAI method to identify and modify the evidence most influential to Arthur. Consequently, Arthur learns to (i) answer when the context support the answer, (ii) reject when evidence is insufficient, and (iii) rely on the specific context spans that truly ground the answer. We further introduce a rigorous evaluation framework to disentangle explanation fidelity from baseline predictive errors. This allows us to introduce and measure the Explained Information Fraction (EIF), which normalizes M/A certified mutual-information guarantees relative to model capacity and imperfect benchmarks. Across three RAG datasets and two model families of varying sizes, M/A-trained LLMs show improved groundedness, completeness, soundness, and reject behavior, as well as reduced hallucinations – without needing manually annotated unanswerable questions. The retriever likewise improves recall and MRR through automatically generated M/A hard positives and negatives. Our results demonstrate that autonomous interactive-proof-style supervision provides a principled and practical path toward reliable RAG systems that treat retrieved documents not as suggestions, but as verifiable evidence. 检索增强生成(RAG)模型依赖检索到的证据来指导大型语言模型(LLM)生成器,但现有系统将检索视为一种弱启发式而非可验证的证据。因此,LLM 在没有支持的情况下给出答案,在上下文不完整或具有误导性时产生幻觉,并依赖伪证据。我们引入了一个训练框架,通过对梅林-亚瑟(Merlin-Arthur,M/A)协议的改编,将整个 RAG 流水线——包括检索器和生成器——视为一个交互式证明系统。亚瑟(生成器 LLM)在来源不明的问题上进行训练:梅林提供有用证据,而摩甘娜(Morgana)注入对抗性、误导性的上下文。两者都使用线性时间的可解释性(XAI)方法来识别并修改对亚瑟影响最大的证据。因此,亚瑟学会了(i)在上下文支持答案时给出答案,(ii)在证据不足时拒绝回答,以及(iii)依赖真正支撑答案的特定上下文片段。我们进一步引入了一个严格的评估框架,以将解释性保真度与基线预测错误区分开来。 这使我们能够引入并衡量“可解释信息分数”(EIF),它将经 M/A 认证的互信息保证相对于模型容量和不完美基准进行归一化。在三个 RAG 数据集和两个不同规模的模型族上,经过 M/A 训练的 LLMs 在有根性、完整性、可靠性和拒绝行为方面均有所提升,同时幻觉现象减少——且不需要人工标注的不可回答问题。通过自动生成的 M/A 难度正例和负例,检索器的召回率和 MRR 也得到了提升。我们的结果表明,自主的交互式证明风格监督为构建可靠的 RAG 系统提供了一条有原则且实用的途径,使检索到的文档不再被视为建议,而是可验证的证据。

Subjects: Computation and Language, Artificial Intelligence, Machine Learning 主题:计算与语言,人工智能,机器学习

Publish: 2025-12-12 14:50:38 UTC 发布:2025-12-12 14:50:38 UTC

31 Atomic Action Slicing: Planner-Aligned Options for Generalist VLA Agents 31 原子动作切片:与规划器对齐的通用视觉语言行动体的选项

Authors: [Stefan Tabakov](https://arxiv.org/search/?searchtype=author&query=Stefan Tabakov), [Asen Popov](https://arxiv.org/search/?searchtype=author&query=Asen Popov), [Dimitar Dimitrov](https://arxiv.org/search/?searchtype=author&query=Dimitar Dimitrov), [S. Ensiye Kiyamousavi](https://arxiv.org/search/?searchtype=author&query=S. Ensiye Kiyamousavi), [Vladimir Hristov](https://arxiv.org/search/?searchtype=author&query=Vladimir Hristov), [Boris Kraychev](https://arxiv.org/search/?searchtype=author&query=Boris Kraychev) 作者:Stefan Tabakov、Asen Popov、Dimitar Dimitrov、S. Ensiye Kiyamousavi、Vladimir Hristov、Boris Kraychev

Current vision-language-action (VLA) models generalize poorly, particularly when tasks require new compositions of skills or objects. We introduce Atomic Action Slicing (AAS), a planner-aligned approach that decomposes long-horizon demonstrations into short, typed atomic actions that are easier for planners to use and policies to learn. Using LIBERO demonstrations, AAS produces a validated dataset of 2,124 atomic segments labeled with action type, temporal span, and confidence. A stronger segmenter (Gemini 2.5 Pro) closely matches planner-defined plans and remains robust under keyframe jitter, while smaller models perform worse on multi-object tasks. Fine-tuning CLIP-RT+ on our atomic dataset improves task success from 94.2% to 95.3% on LIBERO-Goal and 83.8% to 88.8% on LIBERO-Long. We publicly release the GATE-VLAP dataset on HuggingFace(https://huggingface.co/datasets/gate-institute/GATE-VLAP-datasets) 当前的视觉-语言-动作(VLA)模型泛化能力较差,尤其在任务需要新的技能或物体组合时表现不佳。我们提出了原子动作切分(Atomic Action Slicing,AAS),这是一种与规划器对齐的方法,将长时域示范分解为更短、更具类型化的原子动作,这些动作更便于规划器使用和策略学习。使用 LIBERO 示范,AAS 生成了一个经过验证的数据集,包含 2,124 个原子片段,标注了动作类型、时间跨度和置信度。更强的分段器(Gemini 2.5 Pro)与规划器定义的计划高度匹配,并在关键帧抖动下依然保持鲁棒性,而较小的模型在多物体任务上表现较差。在我们的原子数据集上微调 CLIP-RT+,使得在 LIBERO-Goal 上的任务成功率从 94.2% 提升到 95.3%,在 LIBERO-Long 上从 83.8% 提升到 88.8%。我们在 HuggingFace 上公开发布了 GATE-VLAP 数据集(https://huggingface.co/datasets/gate-institute/GATE-VLAP-datasets)

Subjects: Machine Learning, Artificial Intelligence, Robotics 主题:机器学习,人工智能,机器人学

Publish: 2025-12-12 14:14:27 UTC 发布:2025-12-12 14:14:27 UTC

32 Multi-temporal Calving Front Segmentation 32 多时相崩解前缘分割

Authors: [Marcel Dreier](https://arxiv.org/search/?searchtype=author&query=Marcel Dreier), [Nora Gourmelon](https://arxiv.org/search/?searchtype=author&query=Nora Gourmelon), [Dakota Pyles](https://arxiv.org/search/?searchtype=author&query=Dakota Pyles), [Fei Wu](https://arxiv.org/search/?searchtype=author&query=Fei Wu), [Matthias Braun](https://arxiv.org/search/?searchtype=author&query=Matthias Braun), [Thorsten Seehaus](https://arxiv.org/search/?searchtype=author&query=Thorsten Seehaus), [Andreas Maier](https://arxiv.org/search/?searchtype=author&query=Andreas Maier), [Vincent Christlein](https://arxiv.org/search/?searchtype=author&query=Vincent Christlein) 作者:Marcel Dreier, Nora Gourmelon, Dakota Pyles, Fei Wu, Matthias Braun, Thorsten Seehaus, Andreas Maier, Vincent Christlein

The calving fronts of marine-terminating glaciers undergo constant changes. These changes significantly affect the glacier’s mass and dynamics, demanding continuous monitoring. To address this need, deep learning models were developed that can automatically delineate the calving front in Synthetic Aperture Radar imagery. However, these models often struggle to correctly classify areas affected by seasonal conditions such as ice melange or snow-covered surfaces. To address this issue, we propose to process multiple frames from a satellite image time series of the same glacier in parallel and exchange temporal information between the corresponding feature maps to stabilize each prediction. We integrate our approach into the current state-of-the-art architecture Tyrion and accomplish a new state-of-the-art performance on the CaFFe benchmark dataset. In particular, we achieve a Mean Distance Error of 184.4 m and a mean Intersection over Union of 83.6. 海洋终端冰川的崩解前缘不断变化。这些变化显著影响冰川的质量和平流动力学,因而需要持续监测。为满足这一需求,开发了能够在合成孔径雷达影像中自动勾画崩解前缘的深度学习模型。然而,这些模型常常难以正确分类受季节性条件影响的区域,例如冰混合带或被雪覆盖的表面。为了解决这一问题,我们提出并行处理来自同一冰川的卫星影像时间序列的多帧,并在对应的特征图之间交换时序信息,以稳定每次预测。我们将该方法集成到当前最先进的架构 Tyrion 中,并在 CaFFe 基准数据集上实现了新的最先进性能。具体而言,我们达到了平均距离误差为 184.4 米,平均交并比为 83.6。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence

Publish: 2025-12-12 13:45:05 UTC 发布时间:2025-12-12 13:45:05 协调世界时 (UTC)

33 DentalGPT: Incentivizing Multimodal Complex Reasoning in Dentistry 33 DentalGPT:激励牙科领域的多模态复杂推理

Authors: [Zhenyang Cai](https://arxiv.org/search/?searchtype=author&query=Zhenyang Cai), [Jiaming Zhang](https://arxiv.org/search/?searchtype=author&query=Jiaming Zhang), [Junjie Zhao](https://arxiv.org/search/?searchtype=author&query=Junjie Zhao), [Ziyi Zeng](https://arxiv.org/search/?searchtype=author&query=Ziyi Zeng), [Yanchao Li](https://arxiv.org/search/?searchtype=author&query=Yanchao Li), [Jingyi Liang](https://arxiv.org/search/?searchtype=author&query=Jingyi Liang), [Junying Chen](https://arxiv.org/search/?searchtype=author&query=Junying Chen), [Yunjin Yang](https://arxiv.org/search/?searchtype=author&query=Yunjin Yang), [Jiajun You](https://arxiv.org/search/?searchtype=author&query=Jiajun You), [Shuzhi Deng](https://arxiv.org/search/?searchtype=author&query=Shuzhi Deng), [Tongfei Wang](https://arxiv.org/search/?searchtype=author&query=Tongfei Wang), [Wanting Chen](https://arxiv.org/search/?searchtype=author&query=Wanting Chen), [Chunxiu Hao](https://arxiv.org/search/?searchtype=author&query=Chunxiu Hao), [Ruiqi Xie](https://arxiv.org/search/?searchtype=author&query=Ruiqi Xie), [Zhenwei Wen](https://arxiv.org/search/?searchtype=author&query=Zhenwei Wen), [Xiangyi Feng](https://arxiv.org/search/?searchtype=author&query=Xiangyi Feng), [Zou Ting](https://arxiv.org/search/?searchtype=author&query=Zou Ting), [Jin Zou Lin](https://arxiv.org/search/?searchtype=author&query=Jin Zou Lin), [Jianquan Li](https://arxiv.org/search/?searchtype=author&query=Jianquan Li), [Guangjun Yu](https://arxiv.org/search/?searchtype=author&query=Guangjun Yu), [Liangyi Chen](https://arxiv.org/search/?searchtype=author&query=Liangyi Chen), [Junwen Wang](https://arxiv.org/search/?searchtype=author&query=Junwen Wang), [Shan Jiang](https://arxiv.org/search/?searchtype=author&query=Shan Jiang), [Benyou Wang](https://arxiv.org/search/?searchtype=author&query=Benyou Wang) 作者:蔡振阳、张家明、赵俊杰、曾子易、李岩超、梁静怡、陈君英、杨云锦、尤佳俊、邓淑芝、王同斐、陈婉婷、郝纯秀、谢瑞琦、温振威、冯祥伊、邹婷、邹金楼、李建全、于光军、陈良意、王俊文、蒋珊、王本友

Reliable interpretation of multimodal data in dentistry is essential for automated oral healthcare, yet current multimodal large language models (MLLMs) struggle to capture fine-grained dental visual details and lack sufficient reasoning ability for precise diagnosis. To address these limitations, we present DentalGPT, a specialized dental MLLM developed through high-quality domain knowledge injection and reinforcement learning. Specifically, the largest annotated multimodal dataset for dentistry to date was constructed by aggregating over 120k dental images paired with detailed descriptions that highlight diagnostically relevant visual features, making it the multimodal dataset with the most extensive collection of dental images to date. Training on this dataset significantly enhances the MLLM’s visual understanding of dental conditions, while the subsequent reinforcement learning stage further strengthens its capability for multimodal complex reasoning. Comprehensive evaluations on intraoral and panoramic benchmarks, along with dental subsets of medical VQA benchmarks, show that DentalGPT achieves superior performance in disease classification and dental VQA tasks, outperforming many state-of-the-art MLLMs despite having only 7B parameters. These results demonstrate that high-quality dental data combined with staged adaptation provides an effective pathway for building capable and domain-specialized dental MLLMs. 在牙科领域对多模态数据进行可靠解读对于自动化口腔医疗至关重要,然而现有的多模态大语言模型(MLLMs)在捕捉细粒度的牙科视觉细节方面表现不足,并且缺乏进行精确诊断所需的推理能力。为了解决这些局限性,我们提出了 DentalGPT,一种通过高质量领域知识注入和强化学习开发的专业牙科多模态大模型。具体而言,我们构建了迄今为止最大的牙科带注释多模态数据集,汇聚了超过 12 万张牙科图像及其详细描述,这些描述突出了具有诊断意义的视觉特征,使其成为目前收集牙科图像最为丰富的多模态数据集。在该数据集上的训练显著增强了模型对牙科病况的视觉理解,而随后进行的强化学习阶段进一步加强了其进行多模态复杂推理的能力。 在口内和全景基准以及医学 VQA 基准的牙科子集上的全面评估表明,DentalGPT 在疾病分类和牙科视觉问答任务上取得了优越的表现,尽管仅有 7B 参数,但其性能超过了许多最先进的多模态大模型。这些结果表明,高质量的牙科数据与分阶段适配相结合,为构建具有能力且领域专精的牙科多模态大模型提供了一条有效路径。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Computation and Language 主题:计算机视觉与模式识别、人工智能、计算与语言

Publish: 2025-12-12 13:42:57 UTC 发表于:2025-12-12 13:42:57 UTC

34 Optimizing the Training Diet: Data Mixture Search for Robust Time Series Forecasting 34 优化训练“饮食”:用于鲁棒时间序列预测的数据混合搜索

Authors: [Federico Pennino](https://arxiv.org/search/?searchtype=author&query=Federico Pennino), [Maurizio Gabbrielli](https://arxiv.org/search/?searchtype=author&query=Maurizio Gabbrielli) 作者:Federico Pennino,Maurizio Gabbrielli

The standard paradigm for training deep learning models on sensor data assumes that more data is always better. However, raw sensor streams are often imbalanced and contain significant redundancy, meaning that not all data points contribute equally to model generalization. In this paper, we show that, in some cases, “less is more” when considering datasets. We do this by reframing the data selection problem: rather than tuning model hyperparameters, we fix the model and optimize the composition of the training data itself. We introduce a framework for discovering the optimal “training diet” from a large, unlabeled time series corpus. Our framework first uses a large-scale encoder and k-means clustering to partition the dataset into distinct, behaviorally consistent clusters. These clusters represent the fundamental ‘ingredients’ available for training. We then employ the Optuna optimization framework to search the high-dimensional space of possible data mixtures. For each trial, Optuna proposes a specific sampling ratio for each cluster, and a new training set is constructed based on this recipe. A smaller target model is then trained and evaluated. Our experiments reveal that this data-centric search consistently discovers data mixtures that yield models with significantly higher performance compared to baselines trained on the entire dataset. Specifically - evaluated on PMSM dataset - our method improved performance from a baseline MSE of 1.70 to 1.37, a 19.41% improvement. 用于传感器数据训练深度学习模型的标准范式假定更多数据总是更好。然而,原始传感器流通常存在不平衡并含有大量冗余,这意味着并非所有数据点对模型泛化都有同等贡献。在本文中,我们展示了在某些情况下,考虑数据集时“少即是多”。我们通过重新构架数据选择问题来实现这一点:与其调优模型超参数,我们固定模型并优化训练数据本身的组成。我们提出了一个从大型未标注时间序列语料中发现最优“训练饮食”的框架。我们的框架首先使用大规模编码器和 k-means 聚类将数据集划分为行为一致的不同簇。这些簇代表了可用于训练的基本“成分”。然后我们使用 Optuna 优化框架在可能的数据混合的高维空间中进行搜索。对于每次试验,Optuna 为每个簇提出一个具体的采样比例,并基于该配方构建一个新的训练集。然后训练并评估一个较小的目标模型。 我们的实验表明,这种以数据为中心的搜索方法始终能够发现数据混合,使得模型性能相比于在整个数据集上训练的基线模型显著提升。具体来说——在 PMSM 数据集上的评估——我们的方法将基线均方误差(MSE)从 1.70 提高到了 1。37,19.41% 的提升。

Subjects: Machine Learning, Artificial Intelligence 主题:机器学习,人工智能

Publish: 2025-12-12 13:26:07 UTC 发布时间:2025-12-12 13:26:07 协调世界时

35 Graph Embedding with Mel-spectrograms for Underwater Acoustic Target Recognition 35 使用梅尔谱图的图嵌入用于水下声学目标识别

Authors: [Sheng Feng](https://arxiv.org/search/?searchtype=author&query=Sheng Feng), [Shuqing Ma](https://arxiv.org/search/?searchtype=author&query=Shuqing Ma), [Xiaoqian Zhu](https://arxiv.org/search/?searchtype=author&query=Xiaoqian Zhu) 作者:Sheng Feng、Shuqing Ma、Xiaoqian Zhu

Underwater acoustic target recognition (UATR) is extremely challenging due to the complexity of ship-radiated noise and the variability of ocean environments. Although deep learning (DL) approaches have achieved promising results, most existing models implicitly assume that underwater acoustic data lie in a Euclidean space. This assumption, however, is unsuitable for the inherently complex topology of underwater acoustic signals, which exhibit non-stationary, non-Gaussian, and nonlinear characteristics. To overcome this limitation, this paper proposes the UATR-GTransformer, a non-Euclidean DL model that integrates Transformer architectures with graph neural networks (GNNs). The model comprises three key components: a Mel patchify block, a GTransformer block, and a classification head. The Mel patchify block partitions the Mel-spectrogram into overlapping patches, while the GTransformer block employs a Transformer Encoder to capture mutual information between split patches to generate Mel-graph embeddings. Subsequently, a GNN enhances these embeddings by modeling local neighborhood relationships, and a feed-forward network (FFN) further performs feature transformation. Experiments results based on two widely used benchmark datasets demonstrate that the UATR-GTransformer achieves performance competitive with state-of-the-art methods. In addition, interpretability analysis reveals that the proposed model effectively extracts rich frequency-domain information, highlighting its potential for applications in ocean engineering. 水下声学目标识别(UATR)由于船舶辐射噪声的复杂性和海洋环境的多变性而极具挑战性。尽管深度学习(DL)方法已取得了可喜的成果,但大多数现有模型隐含地假设水下声学数据位于欧几里得空间。然而,这一假设并不适用于水下声学信号固有的复杂拓扑特性,这些信号表现出非平稳、非高斯和非线性的特征。为克服这一限制,本文提出了 UATR-GTransformer,一种将 Transformer 架构与图神经网络(GNN)相结合的非欧几里得深度学习模型。该模型由三个关键组件组成:Mel 分块(Mel patchify)模块、GTransformer 模块和分类头。Mel 分块模块将梅尔谱(Mel-spectrogram)划分为重叠的块,而 GTransformer 模块则采用 Transformer 编码器来捕捉分割块之间的相互信息以生成 Mel-图嵌入。随后,图神经网络通过建模局部邻域关系来增强这些嵌入,前馈网络(FFN)则进一步进行特征变换。 基于两个广泛使用的基准数据集的实验结果表明,UATR-GTransformer 在性能上与最先进的方法具有竞争力。 此外,可解释性分析显示,该模型能有效提取丰富的频域信息,凸显了其在海洋工程应用中的潜力。

Subjects: Sound, Artificial Intelligence 主题:声音,人工智能

Publish: 2025-12-12 13:25:54 UTC 发布:2025-12-12 13:25:54 UTC

36 Parallax: Runtime Parallelization for Operator Fallbacks in Heterogeneous Edge Systems 36 Parallax:面向异构边缘系统中运算符回退的运行时并行化

Authors: [Chong Tang](https://arxiv.org/search/?searchtype=author&query=Chong Tang), [Hao Dai](https://arxiv.org/search/?searchtype=author&query=Hao Dai), [Jagmohan Chauhan](https://arxiv.org/search/?searchtype=author&query=Jagmohan Chauhan) 作者:唐崇(Chong Tang)、戴浩(Hao Dai)、Jagmohan Chauhan

The growing demand for real-time DNN applications on edge devices necessitates faster inference of increasingly complex models. Although many devices include specialized accelerators (e.g., mobile GPUs), dynamic control-flow operators and unsupported kernels often fall back to CPU execution. Existing frameworks handle these fallbacks poorly, leaving CPU cores idle and causing high latency and memory spikes. We introduce Parallax, a framework that accelerates mobile DNN inference without model refactoring or custom operator implementations. Parallax first partitions the computation DAG to expose parallelism, then employs branch-aware memory management with dedicated arenas and buffer reuse to reduce runtime footprint. An adaptive scheduler executes branches according to device memory constraints, meanwhile, fine-grained subgraph control enables heterogeneous inference of dynamic models. By evaluating on five representative DNNs across three different mobile devices, Parallax achieves up to 46% latency reduction, maintains controlled memory overhead (26.5% on average), and delivers up to 30% energy savings compared with state-of-the-art frameworks, offering improvements aligned with the responsiveness demands of real-time mobile inference. 对边缘设备上实时深度神经网络应用日益增长的需求,要求对日益复杂的模型进行更快的推理。尽管许多设备包含专用加速器(例如移动 GPU),动态控制流算子和不受支持的内核经常回退到 CPU 执行。现有框架对这些回退处理不佳,导致 CPU 核心空闲并引起高延迟和内存峰值。我们提出了 Parallax,一个在不重构模型或实现自定义算子的情况下加速移动端 DNN 推理的框架。Parallax 首先对计算有向无环图进行划分以揭示并行性,随后采用具备专用区和缓冲重用的分支感知内存管理以减少运行时内存占用。一个自适应调度器根据设备内存约束执行分支,同时,细粒度子图控制使动态模型的异构推理成为可能。 通过对三种不同移动设备上的五个具有代表性的深度神经网络进行评估,Parallax 在延迟上最多可降低 46%,保持受控的内存开销(平均 26.5%),并与最先进的框架相比最多节省 30% 的能量,提供了符合实时移动推理响应性要求的改进。

Subjects: Distributed, Parallel, and Cluster Computing, Artificial Intelligence, Computer Vision and Pattern Recognition 主题:分布式、并行与集群计算,人工智能,计算机视觉与模式识别

Publish: 2025-12-12 13:07:00 UTC 发表:2025-12-12 13:07:00 UTC

37 Contrastive Time Series Forecasting with Anomalies 37 对比时间序列预测中的异常检测

Authors: [Joel Ekstrand](https://arxiv.org/search/?searchtype=author&query=Joel Ekstrand), [Zahra Taghiyarrenani](https://arxiv.org/search/?searchtype=author&query=Zahra Taghiyarrenani), [Slawomir Nowaczyk](https://arxiv.org/search/?searchtype=author&query=Slawomir Nowaczyk) 作者:Joel Ekstrand、Zahra Taghiyarrenani、Slawomir Nowaczyk

Time series forecasting predicts future values from past data. In real-world settings, some anomalous events have lasting effects and influence the forecast, while others are short-lived and should be ignored. Standard forecasting models fail to make this distinction, often either overreacting to noise or missing persistent shifts. We propose Co-TSFA (Contrastive Time Series Forecasting with Anomalies), a regularization framework that learns when to ignore anomalies and when to respond. Co-TSFA generates input-only and input-output augmentations to model forecast-irrelevant and forecast-relevant anomalies, and introduces a latent-output alignment loss that ties representation changes to forecast changes. This encourages invariance to irrelevant perturbations while preserving sensitivity to meaningful distributional shifts. Experiments on the Traffic and Electricity benchmarks, as well as on a real-world cash-demand dataset, demonstrate that Co-TSFA improves performance under anomalous conditions while maintaining accuracy on normal data. An anonymized GitHub repository with the implementation of Co-TSFA is provided and will be made public upon acceptance. 时间序列预测基于过去数据预测未来值。在真实场景中,某些异常事件具有持久影响并会影响预测,而另一些则是短暂的应被忽略。标准预测模型无法区分这两类异常,常常对噪声反应过度或错过持久性变化。我们提出了 Co-TSFA(带异常的对比时间序列预测),这是一种正则化框架,能够学习何时忽略异常、何时响应。Co-TSFA 生成仅输入和输入-输出增强来模拟与预测无关和与预测相关的异常,并引入潜在-输出对齐损失,将表示的变化与预测的变化联系起来。这鼓励对无关扰动的不变性,同时保留对有意义分布变化的敏感性。在 Traffic 和 Electricity 基准以及一个真实的现金需求数据集上的实验表明,Co-TSFA 在异常条件下提高了性能,同时在正常数据上保持了准确性。提供了包含 Co-TSFA 实现的匿名 GitHub 仓库,并将在论文被接受后公开。

Subjects: Machine Learning, Artificial Intelligence, Machine Learning 主题:机器学习,人工智能,机器学习

Publish: 2025-12-12 12:54:24 UTC 发表:2025-12-12 12:54:24 UTC

38 NeuralOGCM: Differentiable Ocean Modeling with Learnable Physics 38 NeuralOGCM:具有可学习物理性的可微分海洋建模

Authors: [Hao Wu](https://arxiv.org/search/?searchtype=author&query=Hao Wu), [Yuan Gao](https://arxiv.org/search/?searchtype=author&query=Yuan Gao), [Fan Xu](https://arxiv.org/search/?searchtype=author&query=Fan Xu), [Fan Zhang](https://arxiv.org/search/?searchtype=author&query=Fan Zhang), [Guangliang Liu](https://arxiv.org/search/?searchtype=author&query=Guangliang Liu), [Yuxuan Liang](https://arxiv.org/search/?searchtype=author&query=Yuxuan Liang), [Xiaomeng Huang](https://arxiv.org/search/?searchtype=author&query=Xiaomeng Huang) 作者:吴昊、高原、徐帆、张凡、刘广亮、梁宇轩、黄晓猛

High-precision scientific simulation faces a long-standing trade-off between computational efficiency and physical fidelity. To address this challenge, we propose NeuralOGCM, an ocean modeling framework that fuses differentiable programming with deep learning. At the core of NeuralOGCM is a fully differentiable dynamical solver, which leverages physics knowledge as its core inductive bias. The learnable physics integration captures large-scale, deterministic physical evolution, and transforms key physical parameters (e.g., diffusion coefficients) into learnable parameters, enabling the model to autonomously optimize its physical core via end-to-end training. Concurrently, a deep neural network learns to correct for subgrid-scale processes and discretization errors not captured by the physics model. Both components work in synergy, with their outputs integrated by a unified ODE solver. Experiments demonstrate that NeuralOGCM maintains long-term stability and physical consistency, significantly outperforming traditional numerical models in speed and pure AI baselines in accuracy. Our work paves a new path for building fast, stable, and physically-plausible models for scientific computing. 高精度科学模拟长期存在计算效率与物理逼真性之间的权衡。为了解决这一挑战,我们提出了 NeuralOGCM,一种将可微分编程与深度学习融合的海洋建模框架。NeuralOGCM 的核心是一个完全可微分的动力学求解器,它以物理知识作为核心归纳偏置。可学习的物理积分捕捉大尺度的确定性物理演化,并将关键物理参数(例如扩散系数)转化为可学习参数,使模型能够通过端到端训练自主优化其物理核心。同时,一个深度神经网络学习校正物理模型未捕捉到的子网格尺度过程和离散化误差。两个组件协同工作,其输出由统一的常微分方程求解器整合。实验表明,NeuralOGCM 保持长期稳定性和物理一致性,在速度上显著优于传统数值模型,在精度上显著优于纯人工智能基线。 我们的工作为构建快速、稳定的以及用于科学计算的物理可行模型。

Subjects: Machine Learning, Artificial Intelligence 主题:机器学习,人工智能

Publish: 2025-12-12 12:53:46 UTC 发布:2025-12-12 12:53:46 UTC

39 Does Less Hallucination Mean Less Creativity? An Empirical Investigation in LLMs 39 更少的幻觉是否意味着更少的创造力?对 LLMs 的实证研究

Authors: [Mohor Banerjee](https://arxiv.org/search/?searchtype=author&query=Mohor Banerjee), [Nadya Yuki Wangsajaya](https://arxiv.org/search/?searchtype=author&query=Nadya Yuki Wangsajaya), [Syed Ali Redha Alsagoff](https://arxiv.org/search/?searchtype=author&query=Syed Ali Redha Alsagoff), [Min Sen Tan](https://arxiv.org/search/?searchtype=author&query=Min Sen Tan), [Zachary Choy Kit Chun](https://arxiv.org/search/?searchtype=author&query=Zachary Choy Kit Chun), [Alvin Chan Guo Wei](https://arxiv.org/search/?searchtype=author&query=Alvin Chan Guo Wei) 作者:Mohor Banerjee、Nadya Yuki Wangsajaya、Syed Ali Redha Alsagoff、Min Sen Tan、Zachary Choy Kit Chun、Alvin Chan Guo Wei

Large Language Models (LLMs) exhibit remarkable capabilities in natural language understanding and reasoning, but suffer from hallucination: the generation of factually incorrect content. While numerous methods have been developed to reduce hallucinations, their impact on creative generations remains unexplored. This gap is particularly critical for AI-assisted scientific discovery, which requires both factual accuracy and creative hypothesis generation. We investigate how three hallucination-reduction techniques: Chain of Verification (CoVe), Decoding by Contrasting Layers (DoLa), and Retrieval-Augmented Generation (RAG), affect creativity in LLMs. Evaluating multiple model families (LLaMA, Qwen, Mistral) at varying scales (1B - 70B parameters) on two creativity benchmarks (NeoCoder and CS4), we find that these methods have opposing effects on divergent creativity. CoVe enhances divergent thinking, DoLa suppresses it, and RAG shows minimal impact. Our findings provide guidance for selecting appropriate hallucination-reduction methods in scientific applications, where the balance between factual accuracy and creative exploration is crucial. 大型语言模型(LLMs)在自然语言理解和推理方面表现出显著能力,但存在幻觉问题:生成事实不准确的内容。尽管已经开发了许多减少幻觉的方法,但它们对创造性生成的影响尚未被探索。这一空白对于需要事实准确性和创造性假设生成并重的人工智能辅助科学发现尤其关键。我们研究了三种减少幻觉的技术——验证链(Chain of Verification,CoVe)、层对比解码(Decoding by Contrasting Layers,DoLa)和检索增强生成(Retrieval-Augmented Generation,RAG)——如何影响 LLMs 的创造力。在不同规模(1B–70B 参数)和多个模型系列(LLaMA、Qwen、Mistral)上,以及在两个创造性基准(NeoCoder 和 CS4)上的评估表明,这些方法对发散性创造力具有相反的影响:CoVe 增强了发散性思维,DoLa 抑制了它,而 RAG 的影响则很小。我们的研究结果为在科学应用中选择合适的减少幻觉方法提供了指导,在这些应用中,事实准确性与创造性探索之间的平衡至关重要。

Subjects: Computation and Language, Artificial Intelligence 主题:计算与语言,人工智能

Publish: 2025-12-12 12:14:29 UTC 发布:2025-12-12 12:14:29 协调世界时 (UTC)

40 Towards Privacy-Preserving Code Generation: Differentially Private Code Language Models 40 朝向隐私保护的代码生成:差分隐私代码语言模型

Authors: [Melih Catal](https://arxiv.org/search/?searchtype=author&query=Melih Catal), [Pooja Rani](https://arxiv.org/search/?searchtype=author&query=Pooja Rani), [Harald C. Gall](https://arxiv.org/search/?searchtype=author&query=Harald C. Gall)

Large language models specialized for code (CodeLLMs) have demonstrated remarkable capabilities in generating code snippets, documentation, and test cases. However, despite their promising capabilities, CodeLLMs can inadvertently memorize and reproduce snippets from their training data, which poses risks of privacy breaches and intellectual property violations. These risks restrict the deployment of CodeLLMs in sensitive domains and limit their training datasets to publicly available sources. To mitigate the memorization risk without compromising their task performance, we apply Differential Privacy (DP) to CodeLLMs. To the best of our knowledge, this is the first comprehensive study that systematically evaluates the effectiveness of DP in CodeLLMs. DP adds calibrated noise to the training process to protect individual data points while still allowing the model to learn useful patterns. To this end, we first identify and understand the driving reasons of the memorization behaviour of the CodeLLMs during their fine-tuning. Then, to address this issue, we empirically evaluate the effect of DP on mitigating memorization while preserving code generation capabilities. Our findings show that DP substantially reduces memorization in CodeLLMs across all the tested snippet types. The snippet types most prone to memorization are also the most effectively mitigated by DP. Furthermore, we observe that DP slightly increases perplexity but preserves, and can even enhance, the code generation capabilities of CodeLLMs, which makes it feasible to apply DP in practice without significantly compromising model utility. Finally, we analyze the impact of DP on training efficiency and energy consumption, finding that DP does not significantly affect training time or energy usage, making it a practical choice for privacy-preserving CodeLLMs training. 专门用于代码的大型语言模型(CodeLLMs)在生成代码片段、文档和测试用例方面展现了卓越的能力。然而,尽管其能力有前景,CodeLLMs 可能会无意中记忆并复现训练数据中的片段,这带来了隐私泄露和知识产权侵犯的风险。这些风险限制了 CodeLLMs 在敏感领域的部署,并将其训练数据集限制为公开可用的来源。为了在不损害任务性能的前提下缓解记忆化风险,我们将差分隐私(DP)应用于 CodeLLMs。据我们所知,这是首个系统评估差分隐私在 CodeLLMs 中效果的全面研究。差分隐私通过在训练过程中加入校准噪声来保护单个数据点,同时仍允许模型学习有用的模式。为此,我们首先识别并理解在微调过程中驱动 CodeLLMs 记忆化行为的原因。然后,为了解决这一问题,我们通过实证评估差分隐私在减轻记忆化的同时保持代码生成能力方面的效果。 我们的研究结果表明,差分隐私(DP)在所有测试的代码片段类型中都能显著减少 CodeLLMs 的记忆化现象。最容易出现记忆化的片段类型,也是 DP 最有效缓解的类型。此外,我们观察到 DP 会略微增加困惑度,但能够保留甚至增强 CodeLLMs 的代码生成能力,这使得在实际应用中采用 DP 成为可行的选择,而不会显著降低模型的实用性。 最后,我们分析了 DP 对训练效率和能耗的影响,发现 DP 并不显著影响训练时间或能量消耗,因此它是用于保护隐私的 CodeLLMs 训练的实用选择。

Subjects: Software Engineering, Artificial Intelligence, Cryptography and Security 主题:软件工程、人工智能、密码学与安全

Publish: 2025-12-12 11:31:13 UTC 发布:2025-12-12 11:31:13 协调世界时 (UTC)

41 Exploring MLLM-Diffusion Information Transfer with MetaCanvas 41 使用 MetaCanvas 探索 MLLM-扩散 信息传递

Authors: [Han Lin](https://arxiv.org/search/?searchtype=author&query=Han Lin), [Xichen Pan](https://arxiv.org/search/?searchtype=author&query=Xichen Pan), [Ziqi Huang](https://arxiv.org/search/?searchtype=author&query=Ziqi Huang), [Ji Hou](https://arxiv.org/search/?searchtype=author&query=Ji Hou), [Jialiang Wang](https://arxiv.org/search/?searchtype=author&query=Jialiang Wang), [Weifeng Chen](https://arxiv.org/search/?searchtype=author&query=Weifeng Chen), [Zecheng He](https://arxiv.org/search/?searchtype=author&query=Zecheng He), [Felix Juefei-Xu](https://arxiv.org/search/?searchtype=author&query=Felix Juefei-Xu), [Junzhe Sun](https://arxiv.org/search/?searchtype=author&query=Junzhe Sun), [Zhipeng Fan](https://arxiv.org/search/?searchtype=author&query=Zhipeng Fan), [Ali Thabet](https://arxiv.org/search/?searchtype=author&query=Ali Thabet), [Mohit Bansal](https://arxiv.org/search/?searchtype=author&query=Mohit Bansal), [Chu Wang](https://arxiv.org/search/?searchtype=author&query=Chu Wang) 作者:Han Lin, Xichen Pan, Ziqi Huang, Ji Hou, Jialiang Wang, Weifeng Chen, Zecheng He, Felix Juefei-Xu, Junzhe Sun, Zhipeng Fan, Ali Thabet, Mohit Bansal, Chu Wang

Multimodal learning has rapidly advanced visual understanding, largely via multimodal large language models (MLLMs) that use powerful LLMs as cognitive cores. In visual generation, however, these powerful core models are typically reduced to global text encoders for diffusion models, leaving most of their reasoning and planning ability unused. This creates a gap: current multimodal LLMs can parse complex layouts, attributes, and knowledge-intensive scenes, yet struggle to generate images or videos with equally precise and structured control. We propose MetaCanvas, a lightweight framework that lets MLLMs reason and plan directly in spatial and spatiotemporal latent spaces and interface tightly with diffusion generators. We empirically implement MetaCanvas on three different diffusion backbones and evaluate it across six tasks, including text-to-image generation, text/image-to-video generation, image/video editing, and in-context video generation, each requiring precise layouts, robust attribute binding, and reasoning-intensive control. MetaCanvas consistently outperforms global-conditioning baselines, suggesting that treating MLLMs as latent-space planners is a promising direction for narrowing the gap between multimodal understanding and generation. 多模态学习在视觉理解方面迅速进步,很大程度上得益于将强大大语言模型(LLMs)作为认知核心的多模态大语言模型(MLLMs)。然而,在视觉生成领域,这些强大的核心模型通常被简化为扩散模型的全局文本编码器,其大部分推理和规划能力未被利用。这造成了一个差距:当前的多模态 LLMs 能够解析复杂的布局、属性和需要丰富知识的场景,但在以同样精确和结构化的控制生成图像或视频方面却表现欠佳。我们提出了 MetaCanvas,一个轻量级框架,使 MLLMs 能够直接在空间和时空潜在空间中进行推理和规划,并与扩散生成器紧密对接。我们在三种不同的扩散主干上对 MetaCanvas 进行了实证实现,并在六项任务上进行了评估,包括文本到图像生成、文本/图像到视频生成、图像/视频编辑以及上下文视频生成,每项任务都需要精确的布局、稳健的属性绑定和依赖推理的控制。 MetaCanvas 在性能上始终优于全局条件基线,这表明将多模态大模型视为潜在空间规划器,是缩小多模态理解与生成之间差距的有前景方向。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Machine Learning 主题:计算机视觉与模式识别,人工智能,机器学习

Publish: 2025-12-12 11:07:11 UTC 发布:2025-12-12 11:07:11 UTC

42 Boosting Skeleton-based Zero-Shot Action Recognition with Training-Free Test-Time Adaptation 42 提升基于骨架的零样本动作识别,通过无训练测试时自适应

Authors: [Jingmin Zhu](https://arxiv.org/search/?searchtype=author&query=Jingmin Zhu), [Anqi Zhu](https://arxiv.org/search/?searchtype=author&query=Anqi Zhu), [Hossein Rahmani](https://arxiv.org/search/?searchtype=author&query=Hossein Rahmani), [Jun Liu](https://arxiv.org/search/?searchtype=author&query=Jun Liu), [Mohammed Bennamoun](https://arxiv.org/search/?searchtype=author&query=Mohammed Bennamoun), [Qiuhong Ke](https://arxiv.org/search/?searchtype=author&query=Qiuhong Ke) 作者:Jingmin Zhu, Anqi Zhu, Hossein Rahmani, Jun Liu, Mohammed Bennamoun, Qiuhong Ke

We introduce Skeleton-Cache, the first training-free test-time adaptation framework for skeleton-based zero-shot action recognition (SZAR), aimed at improving model generalization to unseen actions during inference. Skeleton-Cache reformulates inference as a lightweight retrieval process over a non-parametric cache that stores structured skeleton representations, combining both global and fine-grained local descriptors. To guide the fusion of descriptor-wise predictions, we leverage the semantic reasoning capabilities of large language models (LLMs) to assign class-specific importance weights. By integrating these structured descriptors with LLM-guided semantic priors, Skeleton-Cache dynamically adapts to unseen actions without any additional training or access to training data. Extensive experiments on NTU RGB+D 60/120 and PKU-MMD II demonstrate that Skeleton-Cache consistently boosts the performance of various SZAR backbones under both zero-shot and generalized zero-shot settings. The code is publicly available at https://github.com/Alchemist0754/Skeleton-Cache. 我们提出了 Skeleton-Cache,这是首个针对基于骨架的零样本动作识别(SZAR)在测试时无需训练即可自适应的框架,旨在提升模型在推理阶段对未见过动作的泛化能力。Skeleton-Cache 将推理重构为对一个非参数缓存的轻量检索过程,该缓存存储结构化的骨架表示,结合了全局和细粒度的局部描述符。为引导对各描述符预测的融合,我们利用大型语言模型(LLMs)的语义推理能力来分配类别特定的重要性权重。通过将这些结构化描述符与由 LLMs 引导的语义先验相结合,Skeleton-Cache 能在无需任何额外训练或访问训练数据的情况下动态适应未见动作。在 NTU RGB+D 60/120 和 PKU-MMD II 上的大量实验证明,Skeleton-Cache 在零样本和广义零样本设置下均能持续提升多种 SZAR 骨干网络的性能。代码已公开于 https://github.com/Alchemist0754/Skeleton-Cache

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence

Publish: 2025-12-12 10:53:51 UTC 发布时间:2025-12-12 10:53:51 UTC

43 Flowception: Temporally Expansive Flow Matching for Video Generation 43 Flowception:用于视频生成的时间扩展流匹配

Authors: [Tariq Berrada Ifriqi](https://arxiv.org/search/?searchtype=author&query=Tariq Berrada Ifriqi), [John Nguyen](https://arxiv.org/search/?searchtype=author&query=John Nguyen), [Karteek Alahari](https://arxiv.org/search/?searchtype=author&query=Karteek Alahari), [Jakob Verbeek](https://arxiv.org/search/?searchtype=author&query=Jakob Verbeek), [Ricky T. Q. Chen](https://arxiv.org/search/?searchtype=author&query=Ricky T. Q. Chen) 作者:Tariq Berrada Ifriqi、John Nguyen、Karteek Alahari、Jakob Verbeek、Ricky T. Q. Chen

We present Flowception, a novel non-autoregressive and variable-length video generation framework. Flowception learns a probability path that interleaves discrete frame insertions with continuous frame denoising. Compared to autoregressive methods, Flowception alleviates error accumulation/drift as the frame insertion mechanism during sampling serves as an efficient compression mechanism to handle long-term context. Compared to full-sequence flows, our method reduces FLOPs for training three-fold, while also being more amenable to local attention variants, and allowing to learn the length of videos jointly with their content. Quantitative experimental results show improved FVD and VBench metrics over autoregressive and full-sequence baselines, which is further validated with qualitative results. Finally, by learning to insert and denoise frames in a sequence, Flowception seamlessly integrates different tasks such as image-to-video generation and video interpolation. 我们提出了 Flowception,一种新颖的非自回归且可变长度的视频生成框架。Flowception 学习一条概率路径,该路径将离散的帧插入与连续的帧去噪交织在一起。与自回归方法相比,Flowception 缓解了误差积累/漂移问题,因为采样过程中的帧插入机制作为一种高效的压缩机制来处理长期上下文。与全序列流方法相比,我们的方法在训练时将 FLOPs 降低了三倍,同时更易于与局部注意力变体结合,并允许与视频内容共同学习视频的长度。定量实验结果表明,与自回归和全序列基线相比,在 FVD 和 VBench 指标上有所提升,定性结果进一步验证了这一点。最后,通过学习在序列中插入和去噪帧,Flowception 无缝整合了图像到视频生成和视频插帧等不同任务。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence

Publish: 2025-12-12 10:23:47 UTC 发布:2025-12-12 10:23:47 UTC

44 Task-Specific Sparse Feature Masks for Molecular Toxicity Prediction with Chemical Language Models 44 针对任务的稀疏特征掩码用于化学语言模型的分子毒性预测

Authors: [Kwun Sy Lee](https://arxiv.org/search/?searchtype=author&query=Kwun Sy Lee), [Jiawei Chen](https://arxiv.org/search/?searchtype=author&query=Jiawei Chen), [Fuk Sheng Ford Chung](https://arxiv.org/search/?searchtype=author&query=Fuk Sheng Ford Chung), [Tianyu Zhao](https://arxiv.org/search/?searchtype=author&query=Tianyu Zhao), [Zhenyuan Chen](https://arxiv.org/search/?searchtype=author&query=Zhenyuan Chen), [Debby D. Wang](https://arxiv.org/search/?searchtype=author&query=Debby D. Wang) 作者:Kwun Sy Lee、Jiawei Chen、Fuk Sheng Ford Chung、Tianyu Zhao、Zhenyuan Chen、Debby D. Wang

Reliable in silico molecular toxicity prediction is a cornerstone of modern drug discovery, offering a scalable alternative to experimental screening. However, the black-box nature of state-of-the-art models remains a significant barrier to adoption, as high-stakes safety decisions demand verifiable structural insights alongside predictive performance. To address this, we propose a novel multi-task learning (MTL) framework designed to jointly enhance accuracy and interpretability. Our architecture integrates a shared chemical language model with task-specific attention modules. By imposing an L1 sparsity penalty on these modules, the framework is constrained to focus on a minimal set of salient molecular fragments for each distinct toxicity endpoint. The resulting framework is trained end-to-end and is readily adaptable to various transformer-based backbones. Evaluated on the ClinTox, SIDER, and Tox21 benchmark datasets, our approach consistently outperforms both single-task and standard MTL baselines. Crucially, the sparse attention weights provide chemically intuitive visualizations that reveal the specific fragments influencing predictions, thereby enhancing insight into the model’s decision-making process. 可靠的计算机内(in silico)分子毒性预测是现代药物发现的基石,为实验筛选提供了可扩展的替代方案。然而,最先进模型的黑箱特性仍然是采用的重大障碍,因为高风险的安全决策不仅需要预测性能,还需要可验证的结构性见解。为了解决这一问题,我们提出了一种新颖的多任务学习(MTL)框架,旨在同时提升准确性与可解释性。我们的架构将共享的化学语言模型与任务特定的注意力模块相结合。通过对这些模块施加 L1 稀疏性惩罚,框架被约束为针对每个不同的毒性端点聚焦于最少的一组显著分子片段。所得框架进行端到端训练,并能够方便地适配各种基于 Transformer 的骨干模型。在 ClinTox、SIDER 和 Tox21 基准数据集上的评估表明,我们的方法始终优于单任务和标准多任务学习基线。 关键是,稀疏注意力权重提供了符合化学直觉的可视化,揭示了影响预测的具体片段,从而增强了对模型决策过程的洞察。

Subjects: Computational Engineering, Finance, and Science, Artificial Intelligence, Computation and Language, Machine Learning, Biomolecules 主题:计算工程、金融与科学、人工智能、计算与语言、机器学习、生物分子

Publish: 2025-12-12 09:41:04 UTC 发布:2025-12-12 09:41:04 UTC

45 REMODEL-LLM: Transforming C code to Java using LLMs 45 REMODEL-LLM:使用 LLMs 将 C 代码转换为 Java

Authors: [Aryan Gupta](https://arxiv.org/search/?searchtype=author&query=Aryan Gupta), [Y. Raghu Reddy](https://arxiv.org/search/?searchtype=author&query=Y. Raghu Reddy) 作者:Aryan Gupta,Y. Raghu Reddy

The automated translation of C code to Java code is a notoriously difficult task, fraught with challenges stemming from fundamental paradigm shifts (procedural vs. Object Oriented), memory models (manual pointers vs. Garbage Collection), and incompatible data types. This paper investigates the efficacy of 19 small, quantized LLMs (under 20 billion parameters) for the C to Java translation task. We use a novel, hybrid pipeline that leverages Abstract Syntax Trees (ASTs) for semantic decomposition and employs a highly constrained, rule based prompting strategy. The results are stark: a clear multi tiered performance divide emerged. The vast majority of models (Tier 3, e.g., llama3.1, gemma3, starcoder2) failed 100% of the tests, proving incapable of generating even basic, runnable Java boilerplate. A small middle tier (Tier 2, e.g., mistral-nemo and mistral) produced runnable code but was plagued by dangerous semantic failures and wrong translations. Only three models (Tier 1: phi4, deepseek-coder-v2, codeqwen) proved viable, passing over 50% of the test suite. Even these top models failed on the most complex C concepts, such as function pointers, sizeof, and enum logic, revealing a hard ceiling for the reasoning capabilities of current quantized models. 将 C 代码自动翻译为 Java 代码是一项众所周知的困难任务,问题源于根本的范式转变(过程式与面向对象)、内存模型(手动指针与垃圾回收)以及不兼容的数据类型。本文研究了 19 个小型、量化的 LLMs(参数低于 200 亿)在 C 到 Java 翻译任务中的有效性。我们使用了一种新颖的混合流程,利用抽象语法树(AST)进行语义分解,并采用高度受限的基于规则的提示策略。结果截然不同:出现了明显的多层次性能差距。绝大多数模型(第 3 层,例如 llama3.1、gemma3、starcoder2)在所有测试中均失败,100%无法生成哪怕是基本的、可运行的 Java 样板代码。少数中间层模型(第 2 层,例如 mistral-nemo 和 mistral)能够生成可运行代码,但存在严重的语义性错误和错误翻译。仅有三个模型(第 1 层:phi4、deepseek-coder-v2、codeqwen)被证明是可行的,通过了超过 50% 的测试用例。 即便是这些顶级模型在最复杂的 C 概念上也表现不佳,例如函数指针、sizeof 和枚举逻辑,这暴露了当前量化模型在推理能力上的一道天花板。

Subjects: Software Engineering, Artificial Intelligence 主题:软件工程,人工智能

Publish: 2025-12-12 09:25:10 UTC 发布:2025-12-12 09:25:10 UTC

46 Surveillance Video-Based Traffic Accident Detection Using Transformer Architecture 46 基于监控视频的交通事故检测:使用 Transformer 架构

Authors: [Tanu Singh](https://arxiv.org/search/?searchtype=author&query=Tanu Singh), [Pranamesh Chakraborty](https://arxiv.org/search/?searchtype=author&query=Pranamesh Chakraborty), [Long T. Truong](https://arxiv.org/search/?searchtype=author&query=Long T. Truong) 作者:Tanu Singh、Pranamesh Chakraborty、Long T. Truong

Road traffic accidents represent a leading cause of mortality globally, with incidence rates rising due to increasing population, urbanization, and motorization. Rising accident rates raise concerns about traffic surveillance effectiveness. Traditional computer vision methods for accident detection struggle with limited spatiotemporal understanding and poor cross-domain generalization. Recent advances in transformer architectures excel at modeling global spatial-temporal dependencies and parallel computation. However, applying these models to automated traffic accident detection is limited by small, non-diverse datasets, hindering the development of robust, generalizable systems. To address this gap, we curated a comprehensive and balanced dataset that captures a wide spectrum of traffic environments, accident types, and contextual variations. Utilizing the curated dataset, we propose an accident detection model based on a transformer architecture using pre-extracted spatial video features. The architecture employs convolutional layers to extract local correlations across diverse patterns within a frame, while leveraging transformers to capture sequential-temporal dependencies among the retrieved features. Moreover, most existing studies neglect the integration of motion cues, which are essential for understanding dynamic scenes, especially during accidents. These approaches typically rely on static features or coarse temporal information. In this study, multiple methods for incorporating motion cues were evaluated to identify the most effective strategy. Among the tested input approaches, concatenating RGB features with optical flow achieved the highest accuracy at 88.3%. The results were further compared with vision language models (VLM) such as GPT, Gemini, and LLaVA-NeXT-Video to assess the effectiveness of the proposed method. 道路交通事故是全球主要的致死原因之一,随着人口增长、城市化和机动车普及,事故发生率不断上升。事故率上升对交通监控的有效性提出了挑战。传统的用于事故检测的计算机视觉方法在时空理解方面有限,且跨域泛化能力差。近期在变换器(transformer)架构方面的进展在建模全局时空依赖性和并行计算方面表现出色。然而,将这些模型应用于自动交通事故检测受到数据集规模小且多样性不足的限制,阻碍了稳健且具泛化能力系统的发展。为了解决这一问题,我们整理了一个全面且平衡的数据集,覆盖了广泛的交通环境、事故类型和情境变体。基于该整理后数据集,我们提出了一个基于变换器架构并使用预提取空间视频特征的事故检测模型。 该架构采用卷积层提取单帧内不同模式的局部相关性,同时利用变换器捕捉所提取特征之间的序列-时间依赖性。此外,大多数现有研究忽视了运动线索的整合,而运动线索对于理解动态场景(尤其是在事故发生时)至关重要。这些方法通常依赖静态特征或粗略的时间信息。本研究评估了多种融合运动线索的方法,以确定最有效的策略。在测试的输入方法中,将 RGB 特征与光流拼接达到最高准确率 88.3%。研究结果还与视觉语言模型(如 GPT、Gemini 和 LLaVA-NeXT-Video)进行了比较,以评估所提方法的有效性。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence

Publish: 2025-12-12 07:57:36 UTC 发布:2025-12-12 07:57:36 协调世界时

47 MLLM Machine Unlearning via Visual Knowledge Distillation 47 MLLM 机器遗忘:通过视觉知识蒸馏进行机器学习遗忘

Authors: [Yuhang Wang](https://arxiv.org/search/?searchtype=author&query=Yuhang Wang), [Zhenxing Niu](https://arxiv.org/search/?searchtype=author&query=Zhenxing Niu), [Haoxuan Ji](https://arxiv.org/search/?searchtype=author&query=Haoxuan Ji), [Guangyu He](https://arxiv.org/search/?searchtype=author&query=Guangyu He), [Haichang Gao](https://arxiv.org/search/?searchtype=author&query=Haichang Gao), [Gang Hua](https://arxiv.org/search/?searchtype=author&query=Gang Hua) 作者:王宇航、牛振兴、季昊轩、贺光宇、高海昌、华刚

Recently, machine unlearning approaches have been proposed to remove sensitive information from well-trained large models. However, most existing methods are tailored for LLMs, while MLLM-oriented unlearning remains at its early stage. Inspired by recent studies exploring the internal mechanisms of MLLMs, we propose to disentangle the visual and textual knowledge embedded within MLLMs and introduce a dedicated approach to selectively erase target visual knowledge while preserving textual knowledge. Unlike previous unlearning methods that rely on output-level supervision, our approach introduces a Visual Knowledge Distillation (VKD) scheme, which leverages intermediate visual representations within the MLLM as supervision signals. This design substantially enhances both unlearning effectiveness and model utility. Moreover, since our method only fine-tunes the visual components of the MLLM, it offers significant efficiency advantages. Extensive experiments demonstrate that our approach outperforms state-of-the-art unlearning methods in terms of both effectiveness and efficiency. Moreover, we are the first to evaluate the robustness of MLLM unlearning against relearning attacks. 最近,已经提出了机器消除学习(machine unlearning)方法,用于从训练良好的大型模型中移除敏感信息。然而,大多数现有方法针对的是 LLMs,而面向多模态大型模型(MLLM)的消除学习仍处于早期阶段。受近期研究探讨 MLLM 内部机制的启发,我们提出将 MLLM 中嵌入的视觉和文本知识进行解耦,并引入一种专门的方法来选择性地擦除目标视觉知识,同时保留文本知识。与依赖输出层监督的既有消除学习方法不同,我们的方法引入了一种视觉知识蒸馏(Visual Knowledge Distillation,VKD)方案,利用 MLLM 中间的视觉表征作为监督信号。该设计显著提升了消除学习的效果和模型实用性。此外,由于我们的方法只微调 MLLM 的视觉组件,因此在效率上具有显著优势。大量实验表明,我们的方法在效果和效率方面均优于最先进的消除学习方法。此外,我们是首个评估 MLLM 消除学习对再学习(relearning)攻击鲁棒性的方法。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence

Publish: 2025-12-12 06:51:02 UTC 发布:2025-12-12 06:51:02 UTC

48 Condensation-Concatenation Framework for Dynamic Graph Continual Learning 48 冷凝-串联框架用于动态图持续学习

Authors: [Tingxu Yan](https://arxiv.org/search/?searchtype=author&query=Tingxu Yan), [Ye Yuan](https://arxiv.org/search/?searchtype=author&query=Ye Yuan) 作者:Tingxu Yan、Ye Yuan

Dynamic graphs are prevalent in real-world scenarios, where continuous structural changes induce catastrophic forgetting in graph neural networks (GNNs). While continual learning has been extended to dynamic graphs, existing methods overlook the effects of topological changes on existing nodes. To address it, we propose a novel framework for continual learning on dynamic graphs, named Condensation-Concatenation-based Continual Learning (CCC). Specifically, CCC first condenses historical graph snapshots into compact semantic representations while aiming to preserve the original label distribution and topological properties. Then it concatenates these historical embeddings with current graph representations selectively. Moreover, we refine the forgetting measure (FM) to better adapt to dynamic graph scenarios by quantifying the predictive performance degradation of existing nodes caused by structural updates. CCC demonstrates superior performance over state-of-the-art baselines across four real-world datasets in extensive experiments. 动态图在现实场景中普遍存在,连续的结构变化会导致图神经网络(GNN)出现灾难性遗忘。尽管持续学习已被扩展到动态图,但现有方法忽视了拓扑变化对已有节点的影响。为了解决这一问题,我们提出了一个用于动态图持续学习的新框架,称为基于凝聚-拼接的持续学习(Condensation-Concatenation-based Continual Learning,CCC)。具体而言,CCC 首先将历史图快照凝聚为紧凑的语义表示,同时旨在保留原始标签分布和拓扑特性。然后有选择地将这些历史嵌入与当前图表示拼接。此外,我们改进了遗忘度量(FM),通过量化结构更新导致的对已有节点预测性能的下降,使其更好地适应动态图场景。在大量实验中,CCC 在四个真实数据集上表现出优于最新基线的方法。

Subjects: Machine Learning, Artificial Intelligence 主题:机器学习,人工智能

Publish: 2025-12-12 06:32:16 UTC 发布:2025-12-12 06:32:16 UTC

49 Few-Shot VLM-Based G-Code and HMI Verification in CNC Machining 49 基于少样本视觉语言模型的数控加工 G 代码与人机界面验证

Authors: [Yasaman Hashem Pour](https://arxiv.org/search/?searchtype=author&query=Yasaman Hashem Pour), [Nazanin Mahjourian](https://arxiv.org/search/?searchtype=author&query=Nazanin Mahjourian), [Vinh Nguyen](https://arxiv.org/search/?searchtype=author&query=Vinh Nguyen) 作者:Yasaman Hashem Pour、Nazanin Mahjourian、Vinh Nguyen

Manual generation of G-code is important for learning the operation of CNC machines. Prior work in G-code verification uses Large-Language Models (LLMs), which primarily examine errors in the written programming. However, CNC machining requires extensive use and knowledge of the Human-Machine Interface (HMI), which displays machine status and errors. LLMs currently lack the capability to leverage knowledge of HMIs due to their inability to access the vision modality. This paper proposes a few-shot VLM-based verification approach that simultaneously evaluates the G-code and the HMI display for errors and safety status. The input dataset includes paired G-code text and associated HMI screenshots from a 15-slant-PRO lathe, including both correct and error-prone cases. To enable few-shot learning, the VLM is provided with a structured JSON schema based on prior heuristic knowledge. After determining the prompts, instances of G-code and HMI that either contain errors or are error free are used as few-shot examples to guide the VLM. The model was then evaluated in comparison to a zero-shot VLM through multiple scenarios of incorrect G-code and HMI errors with respect to per-slot accuracy. The VLM showed that few-shot prompting led to overall enhancement of detecting HMI errors and discrepancies with the G-code for more comprehensive debugging. Therefore, the proposed framework was demonstrated to be suitable for verification of manually generated G-code that is typically developed in CNC training. 手工生成 G-code 对学习数控机床的操作非常重要。以往关于 G-code 验证的工作使用了 Large-Language Models (LLMs),这些模型主要检查书面编程中的错误。然而,CNC 加工大量依赖并需要了解人机界面(HMI),该界面显示机器状态和错误信息。由于无法访问视觉模态,LLMs 目前无法利用 HMI 知识。本文提出了一种基于视觉语言模型(VLM)的少样本验证方法,该方法同时评估 G-code 和 HMI 显示的错误及安全状态。输入数据集包括来自 15-slant-PRO 车床的配对 G-code 文本和相关 HMI 屏幕截图,涵盖正确和存在错误的情况。为实现少样本学习,VLM 提供了基于先验启发式知识的结构化 JSON 模式。确定提示语后,将包含错误或无错误的 G-code 与 HMI 实例作为少样本示例来引导 VLM。 该模型随后在多种与每槽(per-slot)准确度有关的错误 G-code 和 HMI 错误场景中,与零样本视觉语言模型(VLM)进行了比较评估。实验显示,少样本提示能够总体上提升对 HMI 错误和与 G-code 不一致之处的检测,从而实现更全面的调试。 因此,所提出的框架被证明适用于验证通常在数控(CNC)培训中手工生成的 G-code。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Human-Computer Interaction 主题:计算机视觉与模式识别,人工智能,人机交互

Publish: 2025-12-12 05:42:36 UTC 发布:2025-12-12 05:42:36 世界协调时 (UTC)

50 AI Autonomy or Human Dependency? Defining the Boundary in Responsible AI with the α-Coefficient 50AI 自主还是依赖人类?在负责的人工智能中用 α -系数界定边界

Authors: [Nattaya Mairittha](https://arxiv.org/search/?searchtype=author&query=Nattaya Mairittha), [Gabriel Phorncharoenmusikul](https://arxiv.org/search/?searchtype=author&query=Gabriel Phorncharoenmusikul), [Sorawit Worapradidth](https://arxiv.org/search/?searchtype=author&query=Sorawit Worapradidth) 作者:Nattaya Mairittha、Gabriel Phorncharoenmusikul、Sorawit Worapradidth

The integrity of contemporary AI systems is undermined by a critical design flaw: the misappropriation of Human-in-the-Loop (HITL) models to mask systems that are fundamentally reliant on human labor. We term this structural reliance Human-Instead-of-AI (HISOAI). HISOAI systems represent an ethical failure and an unsustainable economic dependency, where human workers function as hidden operational fallbacks rather than strategic collaborators. To rectify this, we propose the AI-First, Human-Empowered (AFHE) paradigm. AFHE mandates a technological design where the AI component must achieve a minimum, quantifiable level of functional independence prior to deployment. This standard is formalized through the AI Autonomy Coefficient (alpha), a metric that determines the proportion of tasks that the AI successfully processes without mandatory human substitution. We introduce the AFHE Deployment Algorithm, an algorithmic gate that requires the system to meet a specified alpha threshold across both offline and shadow testing. By enforcing this structural separation, the AFHE framework redefines the human’s role to focus exclusively on high-value tasks, including ethical oversight, boundary pushing, and strategic model tuning, thereby ensuring true system transparency and operational independence. This work advocates for a critical shift toward metric-driven, structurally sound AI architecture, moving the industry beyond deceptive human dependency toward verifiable autonomy. 当代人工智能系统的完整性被一个关键的设计缺陷削弱:将“人类在环”(HITL)模型被不当利用,以掩盖那些在根本上依赖人工劳动的系统。我们将这种结构性依赖称为“以人为代替人工智能”(HISOAI)。HISOAI 系统代表了一种伦理失败和不可持续的经济依赖,其中人类工作者作为隐藏的操作后备而非策略性合作者发挥作用。为纠正这一点,我们提出了“以 AI 为先、人类赋能”(AFHE)范式。AFHE 要求在部署前,技术设计必须使 AI 组件达到一个最低的、可量化的功能独立性水平。该标准通过“AI 自主系数(alpha)”来形式化,这是一个度量,用以确定 AI 在无需人为替代的情况下成功处理任务的比例。我们引入了 AFHE 部署算法——一个算法门控,要求系统在离线测试和影子测试中达到指定的 alpha 阈值。 通过强制这种结构性分离,AFHE 框架将人的角色重新定义为专注于高价值任务,包括伦理监督、突破界限以及战略性模型调优,从而确保真正的系统透明性和操作独立性。该工作倡导向以指标为驱动、结构上健全的人工智能架构进行关键性转变,使行业从具有欺骗性的对人类依赖迈向可验证的自主性。

Subjects: Human-Computer Interaction, Artificial Intelligence 主题:人机交互,人工智能

Publish: 2025-12-12 05:41:20 UTC 发布:2025-12-12 05:41:20 UTC

51 CIP: A Plug-and-Play Causal Prompting Framework for Mitigating Hallucinations under Long-Context Noise 51 CIP:一种即插即用的因果提示框架,用于在长上下文噪声下缓解幻觉问题

Authors: [Qingsen Ma](https://arxiv.org/search/?searchtype=author&query=Qingsen Ma), [Dianyun Wang](https://arxiv.org/search/?searchtype=author&query=Dianyun Wang), [Ran Jing](https://arxiv.org/search/?searchtype=author&query=Ran Jing), [Yujun Sun](https://arxiv.org/search/?searchtype=author&query=Yujun Sun), [Zhenbo Xu](https://arxiv.org/search/?searchtype=author&query=Zhenbo Xu) 作者:Qingsen Ma, Dianyun Wang, Ran Jing, Yujun Sun, Zhenbo Xu

Large language models often hallucinate when processing long and noisy retrieval contexts because they rely on spurious correlations rather than genuine causal relationships. We propose CIP, a lightweight and plug-and-play causal prompting framework that mitigates hallucinations at the input stage. CIP constructs a causal relation sequence among entities, actions, and events and injects it into the prompt to guide reasoning toward causally relevant evidence. Through causal intervention and counterfactual reasoning, CIP suppresses non causal reasoning paths, improving factual grounding and interpretability. Experiments across seven mainstream language models, including GPT-4o, Gemini 2.0 Flash, and Llama 3.1, show that CIP consistently enhances reasoning quality and reliability, achieving 2.6 points improvement in Attributable Rate, 0.38 improvement in Causal Consistency Score, and a fourfold increase in effective information density. API level profiling further shows that CIP accelerates contextual understanding and reduces end to end response latency by up to 55.1 percent. These results suggest that causal reasoning may serve as a promising paradigm for improving the explainability, stability, and efficiency of large language models. 大型语言模型在处理冗长且噪声较多的检索上下文时经常出现幻觉,因为它们依赖于虚假的相关性而非真实的因果关系。我们提出了 CIP,一种轻量且即插即用的因果提示框架,旨在在输入阶段缓解幻觉。CIP 在实体、动作和事件之间构建因果关系序列,并将其注入提示中,引导推理朝向具有因果相关性的证据。通过因果干预和反事实推理,CIP 抑制非因果的推理路径,提升事实依据性和可解释性。在包括 GPT-4o、Gemini 2.0 Flash 和 Llama 3.1 在内的七种主流语言模型上的实验表明,CIP 持续提升了推理质量和可靠性,在可归因率上提高了 2.6 个百分点,因果一致性得分提高了 0.38,有效信息密度提高了四倍。API 级别的分析进一步显示,CIP 加速了上下文理解,并将端到端响应延迟最多减少了 55%。1%。这些结果表明,因果推理可能成为提高大型语言模型可解释性、稳定性和效率的一种有前景的范式。

Subjects: Computation and Language, Artificial Intelligence 主题:计算与语言,人工智能

Publish: 2025-12-12 05:02:26 UTC 发布时间:2025-12-12 05:02:26 UTC

52 Words to Describe What I'm Feeling: Exploring the Potential of AI Agents for High Subjectivity Decisions in Advance Care Planning 52 用词描述我的感受:探索 AI 代理在高主观性预先医疗决策中的潜力

Authors: [Kellie Yu Hui Sim](https://arxiv.org/search/?searchtype=author&query=Kellie Yu Hui Sim), [Pin Sym Foong](https://arxiv.org/search/?searchtype=author&query=Pin Sym Foong), [Chenyu Zhao](https://arxiv.org/search/?searchtype=author&query=Chenyu Zhao), [Melanie Yi Ning Quek](https://arxiv.org/search/?searchtype=author&query=Melanie Yi Ning Quek), [Swarangi Subodh Mehta](https://arxiv.org/search/?searchtype=author&query=Swarangi Subodh Mehta), [Kenny Tsu Wei Choo](https://arxiv.org/search/?searchtype=author&query=Kenny Tsu Wei Choo) 作者:Kellie Yu Hui Sim、Pin Sym Foong、Chenyu Zhao、Melanie Yi Ning Quek、Swarangi Subodh Mehta、Kenny Tsu Wei Choo

Serious illness can deprive patients of the capacity to speak for themselves. As populations age and caregiver networks shrink, the need for reliable support in Advance Care Planning (ACP) grows. To probe this fraught design space of using proxy agents for high-risk, high-subjectivity decisions, we built an experience prototype (\acpagent{}) and asked 15 participants in 4 workshops to train it to be their personal proxy in ACP decisions. We analysed their coping strategies and feature requests and mapped the results onto axes of agent autonomy and human control. Our findings argue for a potential new role of AI in ACP where agents act as personal advocates for individuals, building mutual intelligibility over time. We conclude with design recommendations to balance the risks and benefits of such an agent. 严重疾病可能剥夺患者为自己发声的能力。随着人口老龄化和照护网络的缩小,对可靠的预先医疗决策(Advance Care Planning,ACP)支持的需求日益增长。为了探讨在高风险、高主观性决策中使用代理代表的这一充满挑战的设计空间,我们构建了一个体验原型(\acpagent{}),并在 4 个工作坊中邀请 15 位参与者训练它成为他们在 ACP 决策中的个人代理。我们分析了他们的应对策略和功能需求,并将结果映射到代理自主性与人类控制的坐标轴上。我们的发现提出了 AI 在 ACP 中一种潜在的新角色:代理作为个人倡导者,为个体随时间建立相互可理解性。最后我们给出设计建议,以平衡此类代理的风险与收益。

Subjects: Human-Computer Interaction, Artificial Intelligence 主题:人机交互,人工智能

Publish: 2025-12-12 04:39:34 UTC 发布时间:2025-12-12 04:39:34 协调世界时 (UTC)

53 A Scalable Multi-GPU Framework for Encrypted Large-Model Inference 53 一个可扩展的多 GPU 框架用于加密大模型推理

Authors: [Siddharth Jayashankar](https://arxiv.org/search/?searchtype=author&query=Siddharth Jayashankar), [Joshua Kim](https://arxiv.org/search/?searchtype=author&query=Joshua Kim), [Michael B. Sullivan](https://arxiv.org/search/?searchtype=author&query=Michael B. Sullivan), [Wenting Zheng](https://arxiv.org/search/?searchtype=author&query=Wenting Zheng), [Dimitrios Skarlatos](https://arxiv.org/search/?searchtype=author&query=Dimitrios Skarlatos) 作者:Siddharth Jayashankar、Joshua Kim、Michael B. Sullivan、Wenting Zheng、Dimitrios Skarlatos

Encrypted AI using fully homomorphic encryption (FHE) provides strong privacy guarantees; but its slow performance has limited practical deployment. Recent works proposed ASICs to accelerate FHE, but require expensive advanced manufacturing processes that constrain their accessibility. GPUs are a far more accessible platform, but achieving ASIC-level performance using GPUs has remained elusive. Furthermore, state-of-the-art approaches primarily focus on small models that fit comfortably within a single device. Supporting large models such as LLMs in FHE introduces a dramatic increase in computational complexity that requires optimized GPU kernels, along with managing terabyte-scale memory footprints that far exceed the capacity of a single GPU. This paper presents Cerium, a multi-GPU framework for FHE inference on large models. Cerium integrates a domain-specific language, an optimizing compiler, and a runtime system to automatically generate high-performance GPU kernels, manage terabyte-scale memory footprints, and parallelize computation across multiple GPUs. It introduces new IR constructs, compiler passes, sparse polynomial representations, memory-efficient data layouts, and communication-aware parallelization techniques that together enable encrypted inference for models ranging from small CNNs to Llama3-8B. We build Cerium on NVIDIA GPUs and demonstrate significant performance gains. For small models, Cerium outperforms expert-written hand-optimized GPU libraries by up to 2.25 times. Cerium achieves performance competitive with state-of-the-art FHE ASICs, outright matching prior FHE ASIC CraterLake. It is the first GPU system to execute bootstrapping in under 10 milliseconds, achieving 7.5 milliseconds, and is the first to demonstrate encrypted inference for BERT-Base and Llama3-8B in 8 seconds and 134 seconds, respectively. 使用全同态加密(FHE)的加密人工智能提供了强大的隐私保障;但其低速性能限制了实际部署。最近的研究提出了用于加速 FHE 的 ASIC,但这些 ASIC 需要昂贵的先进制造工艺,从而限制了其可及性。GPU 是一个更容易获得的平台,但在 GPU 上实现与 ASIC 相当的性能一直难以实现。此外,最先进的方法主要针对能轻松放入单个设备的小模型。将诸如 LLMs 之类的大模型应用于 FHE 会显著增加计算复杂度,这需要优化的 GPU 内核,同时还要管理远超单个 GPU 容量的太字节级内存占用。本文提出了 Cerium,这是一个用于大模型 FHE 推理的多 GPU 框架。Cerium 集成了领域专用语言、优化编译器和运行时系统,能够自动生成高性能 GPU 内核、管理太字节级内存占用,并在多 GPU 间并行化计算。 它引入了新的中间表示构造、编译器传递、稀疏多项式表示、节省内存的数据布局以及考虑通信的并行化技术,这些技术共同使从小型卷积神经网络到 Llama3-8B 的模型实现加密推理成为可能。我们在 NVIDIA GPU 上构建了 Cerium 并展示了显著的性能提升。对于小模型,Cerium 的性能比专家手写的手工优化 GPU 库最高快 2.25 倍。Cerium 的性能可与最先进的 FHE 专用芯片(ASIC)竞争,甚至完全匹配先前的 FHE ASIC CraterLake。它是第一个在 10 毫秒内完成引导(bootstrapping)的 GPU 系统,达到了 7.5 毫秒,并且是第一个分别在 8 秒和 134 秒内展示对 BERT-Base 和 Llama3-8B 的加密推理的系统。

Subjects: Cryptography and Security, Artificial Intelligence 主题:密码学与安全性 , 人工智能

Publish: 2025-12-12 04:15:38 UTC

Authors: [Di Wu](https://arxiv.org/search/?searchtype=author&query=Di Wu), [Ruiyu Fang](https://arxiv.org/search/?searchtype=author&query=Ruiyu Fang), [Liting Jiang](https://arxiv.org/search/?searchtype=author&query=Liting Jiang), [Shuangyong Song](https://arxiv.org/search/?searchtype=author&query=Shuangyong Song), [Xiaomeng Huang](https://arxiv.org/search/?searchtype=author&query=Xiaomeng Huang), [Shiquan Wang](https://arxiv.org/search/?searchtype=author&query=Shiquan Wang), [Zhongqiu Li](https://arxiv.org/search/?searchtype=author&query=Zhongqiu Li), [Lingling Shi](https://arxiv.org/search/?searchtype=author&query=Lingling Shi), [Mengjiao Bao](https://arxiv.org/search/?searchtype=author&query=Mengjiao Bao), [Yongxiang Li](https://arxiv.org/search/?searchtype=author&query=Yongxiang Li), [Hao Huang](https://arxiv.org/search/?searchtype=author&query=Hao Huang) 作者:吴迪、方瑞雨、姜丽婷、宋双勇、黄小萌、王世全、李忠秋、史玲玲、鲍梦娇、李永祥、黄浩

Multi-intent spoken language understanding (SLU) involves two tasks: multiple intent detection and slot filling, which jointly handle utterances containing more than one intent. Owing to this characteristic, which closely reflects real-world applications, the task has attracted increasing research attention, and substantial progress has been achieved. However, there remains a lack of a comprehensive and systematic review of existing studies on multi-intent SLU. To this end, this paper presents a survey of recent advances in multi-intent SLU. We provide an in-depth overview of previous research from two perspectives: decoding paradigms and modeling approaches. On this basis, we further compare the performance of representative models and analyze their strengths and limitations. Finally, we discuss the current challenges and outline promising directions for future research. We hope this survey will offer valuable insights and serve as a useful reference for advancing research in multi-intent SLU. 多意图口语语言理解(SLU)包括两个任务:多意图检测和槽位填充,二者共同处理包含多个意图的语句。由于这一特性与现实应用高度贴合,该任务吸引了越来越多的研究关注,并取得了显著进展。然而,现有关于多意图 SLU 的研究尚缺乏全面系统的综述。为此,本文对多意图 SLU 的最新进展进行了综述。我们从两种视角对以往研究进行了深入概述:解码范式与建模方法。在此基础上,我们进一步比较了代表性模型的性能并分析了它们的优缺点。最后,我们讨论了当前的挑战并勾勒了未来研究的有前景方向。我们希望本综述能提供有价值的见解并成为推进多意图 SLU 研究的有用参考。

Subjects: Computation and Language, Artificial Intelligence 主题:计算与语言,人工智能

Publish: 2025-12-12 03:46:39 UTC 发布:2025-12-12 03:46:39 UTC

55 A Simple Generalisation of the Implicit Dynamics of In-Context Learning 55 隐式学习上下文动态的一个简单泛化

Authors: [Francesco Innocenti](https://arxiv.org/search/?searchtype=author&query=Francesco Innocenti), [El Mehdi Achour](https://arxiv.org/search/?searchtype=author&query=El Mehdi Achour) 作者:Francesco Innocenti、El Mehdi Achour

In-context learning (ICL) refers to the ability of a model to learn new tasks from examples in its input without any parameter updates. In contrast to previous theories of ICL relying on toy models and data settings, recently it has been shown that an abstraction of a transformer block can be seen as implicitly updating the weights of its feedforward network according to the context (Dherin et al., 2025). Here, we provide a simple generalisation of this result for (i) all sequence positions beyond the last, (ii) any transformer block beyond the first, and (iii) more realistic residual blocks including layer normalisation. We empirically verify our theory on simple in-context linear regression tasks and investigate the relationship between the implicit updates related to different tokens within and between blocks. These results help to bring the theory of Dherin et al. (2025) even closer to practice, with potential for validation on large-scale models. 上下文学习(In-context learning,ICL)是指模型在不更新任何参数的情况下从输入中的示例中学习新任务的能力。与以往依赖玩具模型和数据设置的 ICL 理论不同,最近有研究表明,变换器(transformer)模块的一个抽象可以被视为根据上下文隐式更新其前馈网络的权重(Dherin et al., 2025)。在此,我们对该结果给出了一个简单的推广,适用于 (i) 除最后一位之外的所有序列位置,(ii) 除第一个之外的任意变换器模块,以及 (iii) 更现实的残差块(residual blocks),包括层归一化(layer normalisation)。我们在简单的上下文线性回归任务上对理论进行了实证验证,并研究了与不同标记(token)在模块内和模块之间相关的隐式更新之间的关系。这些结果有助于将 Dherin et al. (2025) 的理论更接近实践,并有可能在大规模模型上进行验证。

Subjects: Machine Learning, Artificial Intelligence 主题:机器学习,人工智能

Publish: 2025-12-12 03:26:16 UTC 发布:2025-12-12 03:26:16 UTC

56 VFMF: World Modeling by Forecasting Vision Foundation Model Features 56 VFMF:通过预测视觉基础模型特征进行世界建模

Authors: [Gabrijel Boduljak](https://arxiv.org/search/?searchtype=author&query=Gabrijel Boduljak), [Yushi Lan](https://arxiv.org/search/?searchtype=author&query=Yushi Lan), [Christian Rupprecht](https://arxiv.org/search/?searchtype=author&query=Christian Rupprecht), [Andrea Vedaldi](https://arxiv.org/search/?searchtype=author&query=Andrea Vedaldi) 作者:Gabrijel Boduljak、Yushi Lan、Christian Rupprecht、Andrea Vedaldi

Forecasting from partial observations is central to world modeling. Many recent methods represent the world through images, and reduce forecasting to stochastic video generation. Although such methods excel at realism and visual fidelity, predicting pixels is computationally intensive and not directly useful in many applications, as it requires translating RGB into signals useful for decision making. An alternative approach uses features from vision foundation models (VFMs) as world representations, performing deterministic regression to predict future world states. These features can be directly translated into actionable signals such as semantic segmentation and depth, while remaining computationally efficient. However, deterministic regression averages over multiple plausible futures, undermining forecast accuracy by failing to capture uncertainty. To address this crucial limitation, we introduce a generative forecaster that performs autoregressive flow matching in VFM feature space. Our key insight is that generative modeling in this space requires encoding VFM features into a compact latent space suitable for diffusion. We show that this latent space preserves information more effectively than previously used PCA-based alternatives, both for forecasting and other applications, such as image generation. Our latent predictions can be easily decoded into multiple useful and interpretable output modalities: semantic segmentation, depth, surface normals, and even RGB. With matched architecture and compute, our method produces sharper and more accurate predictions than regression across all modalities. Our results suggest that stochastic conditional generation of VFM features offers a promising and scalable foundation for future world models. 从部分观测中进行预测是世界建模的核心。许多最新方法通过图像来表示世界,并将预测问题简化为随机视频生成。尽管这些方法在逼真度和视觉保真度方面表现出色,但预测像素计算成本高昂,并且在许多应用中并不直接有用,因为这需要将 RGB 转换为对决策有用的信号。另一种方法使用视觉基础模型(VFM)的特征作为世界表示,执行确定性回归以预测未来世界状态。这些特征可以直接转换为语义分割和深度等可操作信号,同时保持计算效率。然而,确定性回归会在多种可能的未来之间取平均,通过未能捕捉不确定性而削弱了预测精度。为了解决这一关键限制,我们提出了一种生成式预测器,它在 VFM 特征空间中执行自回归流匹配。我们的关键见解是,在该空间中进行生成建模需要将 VFM 特征编码为适合扩散的紧凑潜在空间。 我们证明了这种潜在空间比先前使用的基于 PCA 的替代方法更有效地保留信息,无论是在预测方面还是在其他应用(例如图像生成)中。我们的潜在预测可以很容易解码为多种有用且可解释的输出模态:语义分割、深度、表面法线,甚至 RGB。在匹配的架构和计算条件下,我们的方法在所有模态上都比回归产生更清晰、更准确的预测。我们的结果表明,对 VFM 特征进行有条件的随机生成,为未来的世界模型提供了一个有前景且可扩展的基础。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Machine Learning 主题:计算机视觉与模式识别,人工智能,机器学习

Publish: 2025-12-12 02:10:05 UTC 发布:2025-12-12 02:10:05 UTC

57 Adaptive Soft Rolling KV Freeze with Entropy-Guided Recovery: Sublinear Memory Growth for Efficient LLM Inference 57 自适应软滚动 KV 冻结与熵引导恢复:用于高效 LLM 推理的次线性内存增长

Authors: [Adilet Metinov](https://arxiv.org/search/?searchtype=author&query=Adilet Metinov), [Gulida M. Kudakeeva](https://arxiv.org/search/?searchtype=author&query=Gulida M. Kudakeeva), [Bolotbek uulu Nursultan](https://arxiv.org/search/?searchtype=author&query=Bolotbek uulu Nursultan), [Gulnara D. Kabaeva](https://arxiv.org/search/?searchtype=author&query=Gulnara D. Kabaeva) 作者:Adilet Metinov、Gulida M. Kudakeeva、Bolotbek uulu Nursultan、Gulnara D. Kabaeva

We present Adaptive Soft Rolling KV Freeze with Entropy-Guided Recovery (ASR-KF-EGR), a training-free inference-time framework for efficient large language model generation. Our method introduces a reversible soft-freeze mechanism that temporarily suspends key-value (KV) updates for low-importance tokens identified within a sliding attention window. Unlike eviction-based approaches that permanently discard context, ASR-KF-EGR preserves all tokens in off-GPU storage and restores them on demand. We extend the framework with sublinear freeze scheduling, where freeze duration grows sublinearly with repeated low-importance detections, preventing over-aggressive compression. Preliminary experiments on LLaMA-3 8B demonstrate 55-67% reduction in active KV cache size while maintaining generation quality and passing needle-in-haystack retrieval tests. The method is architecture-agnostic, requires no fine-tuning, and provides a practical solution for memory-constrained deployment of long-context LLMs. 我们提出了自适应软冻结 KV(键值)机制并结合熵引导恢复(ASR-KF-EGR),这是一种无需训练、在推理时使用的高效大语言模型生成框架。我们的方法引入了一种可逆的软冻结机制,在滑动注意力窗口内对被识别为低重要性的标记暂时暂停键值(KV)更新。与永久丢弃上下文的逐出式方法不同,ASR-KF-EGR 将所有标记保存在 GPU 外存储中,并按需恢复。我们将该框架扩展为次线性冻结调度,其中冻结持续时间随着重复检测到低重要性而以次线性增长,从而防止过度激进的压缩。对 LLaMA-3 8B 的初步实验表明,在保持生成质量并通过大海捞针式检索测试的同时,活动 KV 缓存大小减少了 55–67%。该方法与架构无关,不需要微调,为在内存受限环境中部署长上下文 LLMs 提供了实用解决方案。

Subjects: Machine Learning, Artificial Intelligence, Computation and Language 主题:机器学习、人工智能、计算与语言

Publish: 2025-12-12 02:02:02 UTC 发布:2025-12-12 02:02:02 UTC

58 amc: The Automated Mission Classifier for Telescope Bibliographies 58 amc:望远镜书目自动任务分类器

Authors: [John F. Wu](https://arxiv.org/search/?searchtype=author&query=John F. Wu), [Joshua E. G. Peek](https://arxiv.org/search/?searchtype=author&query=Joshua E. G. Peek), [Sophie J. Miller](https://arxiv.org/search/?searchtype=author&query=Sophie J. Miller), [Jenny Novacescu](https://arxiv.org/search/?searchtype=author&query=Jenny Novacescu), [Achu J. Usha](https://arxiv.org/search/?searchtype=author&query=Achu J. Usha), [Christopher A. Wilkinson](https://arxiv.org/search/?searchtype=author&query=Christopher A. Wilkinson) 作者:John F. Wu、Joshua E. G. Peek、Sophie J. Miller、Jenny Novacescu、Achu J. Usha、Christopher A. Wilkinson

Telescope bibliographies record the pulse of astronomy research by capturing publication statistics and citation metrics for telescope facilities. Robust and scalable bibliographies ensure that we can measure the scientific impact of our facilities and archives. However, the growing rate of publications threatens to outpace our ability to manually label astronomical literature. We therefore present the Automated Mission Classifier (amc), a tool that uses large language models (LLMs) to identify and categorize telescope references by processing large quantities of paper text. A modified version of amc performs well on the TRACS Kaggle challenge, achieving a macro F1 score of 0.84 on the held-out test set. amc is valuable for other telescopes beyond TRACS; we developed the initial software for identifying papers that featured scientific results by NASA missions. Additionally, we investigate how amc can also be used to interrogate historical datasets and surface potential label errors. Our work demonstrates that LLM-based applications offer powerful and scalable assistance for library sciences. 望远镜文献目录通过记录望远镜设施的出版统计和被引指标,反映天文学研究的脉动。健壮且可扩展的文献目录能确保我们衡量设施和档案的科学影响。然而,出版数量的增长可能超出我们手工标注天文文献的能力。因此我们提出了自动任务分类器(amc),该工具使用大型语言模型(LLMs)通过处理大量论文文本来识别和分类望远镜引用。amc 的一个改进版本在 TRACS Kaggle 挑战中表现良好,在保留的测试集上取得了宏观 F1 得分 0.84。amc 对 TRACS 以外的其他望远镜也很有价值;我们开发了最初的软件以识别以 NASA 任务为特色并包含科学结果的论文。此外,我们还研究了 amc 如何用于审视历史数据集并揭示潜在的标签错误。我们的工作表明,基于 LLMs 的应用为图书馆学提供了强大且可扩展的辅助。

Subjects: Instrumentation and Methods for Astrophysics, Artificial Intelligence, Digital Libraries, Machine Learning 主题:天体物理学的仪器与方法、人工智能、数字图书馆、机器学习

Publish: 2025-12-12 01:24:42 UTC 发布:2025-12-12 01:24:42 UTC

59 Fast EXP3 Algorithms 59 快速 EXP3 算法

Authors: [Ryoma Sato](https://arxiv.org/search/?searchtype=author&query=Ryoma Sato), [Shinji Ito](https://arxiv.org/search/?searchtype=author&query=Shinji Ito) 作者:Ryoma Sato、Shinji Ito

We point out that EXP3 can be implemented in constant time per round, propose more practical algorithms, and analyze the trade-offs between the regret bounds and time complexities of these algorithms. 我们指出 EXP3 可以在每轮常数时间内实现,提出更实用的算法,并分析这些算法的后悔界与时间复杂度之间的权衡。

Subjects: Machine Learning, Artificial Intelligence, Data Structures and Algorithms 主题:机器学习、人工智能、数据结构与算法

Publish: 2025-12-12 01:18:32 UTC 发布:2025-12-12 01:18:32 UTC

60 Image Tiling for High-Resolution Reasoning: Balancing Local Detail with Global Context 60 图像分块以实现高分辨率推理:在局部细节与全局语境之间取得平衡

Authors: [Anatole Jacquin de Margerie](https://arxiv.org/search/?searchtype=author&query=Anatole Jacquin de Margerie), [Alexis Roger](https://arxiv.org/search/?searchtype=author&query=Alexis Roger), [Irina Rish](https://arxiv.org/search/?searchtype=author&query=Irina Rish)

Reproducibility remains a cornerstone of scientific progress, yet complex multimodal models often lack transparent implementation details and accessible training infrastructure. In this work, we present a detailed reproduction and critical analysis of the Monkey Vision-Language Model (VLM) (Li et al. 2023b) published in CVPR24, a recent approach to high-resolution image understanding via image tiling. The original paper proposed splitting large images into tiles to recover fine-grained visual details while maintaining computational efficiency. Our study replicates this strategy using open checkpoints and reimplements the training pipeline. We confirm the key finding of the original Monkey VLM work, namely that tiling effectively recovers local details. We then extend this work further, by investigating the effect of the inclusion of the global context, which provide practical insights for future high-resolution multimodal modeling. However, we also report deviations in the results, with the magnitude of these effects depending heavily on task type and tile granularity. 可重复性仍然是科学进步的基石,然而复杂的多模态模型常常缺乏透明的实现细节和可访问的训练基础设施。在本工作中,我们对发表于 CVPR24 的 Monkey 视觉-语言模型(VLM)(Li et al. 2023b)进行了详尽的复现与批判性分析,该模型是一种通过图像切片实现高分辨率图像理解的最新方法。原始论文提出将大图像拆分为若干切片,以在保持计算效率的同时恢复细粒度视觉细节。我们的研究使用开放的检查点复现了这一策略并重新实现了训练流水线。我们确认了原始 Monkey VLM 工作的关键发现,即切片确实能有效恢复局部细节。随后我们进一步扩展了该工作,研究了全局上下文的引入效果,为未来高分辨率多模态建模提供了实用见解。然而,我们也报告了结果上的偏差,这些影响的幅度在很大程度上依赖于任务类型和切片粒度。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence

Publish: 2025-12-11 23:17:38 UTC 发布:2025-12-11 23:17:38 UTC

61 MiniScope: A Least Privilege Framework for Authorizing Tool Calling Agents 61 MiniScope:一个用于授权调用工具代理的最小特权框架

Authors: [Jinhao Zhu](https://arxiv.org/search/?searchtype=author&query=Jinhao Zhu), [Kevin Tseng](https://arxiv.org/search/?searchtype=author&query=Kevin Tseng), [Gil Vernik](https://arxiv.org/search/?searchtype=author&query=Gil Vernik), [Xiao Huang](https://arxiv.org/search/?searchtype=author&query=Xiao Huang), [Shishir G. Patil](https://arxiv.org/search/?searchtype=author&query=Shishir G. Patil), [Vivian Fang](https://arxiv.org/search/?searchtype=author&query=Vivian Fang), [Raluca Ada Popa](https://arxiv.org/search/?searchtype=author&query=Raluca Ada Popa) 作者:Jinhao Zhu、Kevin Tseng、Gil Vernik、Xiao Huang、Shishir G. Patil、Vivian Fang、Raluca Ada Popa

Tool calling agents are an emerging paradigm in LLM deployment, with major platforms such as ChatGPT, Claude, and Gemini adding connectors and autonomous capabilities. However, the inherent unreliability of LLMs introduces fundamental security risks when these agents operate over sensitive user services. Prior approaches either rely on manually written policies that require security expertise, or place LLMs in the confinement loop, which lacks rigorous security guarantees. We present MiniScope, a framework that enables tool calling agents to operate on user accounts while confining potential damage from unreliable LLMs. MiniScope introduces a novel way to automatically and rigorously enforce least privilege principles by reconstructing permission hierarchies that reflect relationships among tool calls and combining them with a mobile-style permission model to balance security and ease of use. To evaluate MiniScope, we create a synthetic dataset derived from ten popular real-world applications, capturing the complexity of realistic agentic tasks beyond existing simplified benchmarks. Our evaluation shows that MiniScope incurs only 1-6% latency overhead compared to vanilla tool calling agents, while significantly outperforming the LLM based baseline in minimizing permissions as well as computational and operational costs. 工具调用代理是在 LLM 部署中出现的一种新范式,主要平台如 ChatGPT、Claude 和 Gemini 都在增加连接器和自主功能。然而,LLM 的固有不可靠性在这些代理操作敏感用户服务时带来了根本性的安全风险。以往的方法要么依赖需要安全专长的手工编写策略,要么将 LLM 置于限制循环中,但这缺乏严格的安全保证。我们提出了 MiniScope,一个使工具调用代理能够在用户账户上运行同时将来自不可靠 LLM 的潜在损害限制住的框架。MiniScope 提出了一种新颖的方法,通过重建反映工具调用之间关系的权限层次结构并将其与类移动端的权限模型相结合,自动且严格地执行最小权限原则,以在安全性与易用性之间取得平衡。为评估 MiniScope,我们从十个流行的真实应用派生出一个合成数据集,捕捉了超出现有简化基准的真实代理任务复杂性。 我们的评估显示,与普通的工具调用代理相比,MiniScope 仅增加了 1–6% 的延迟开销,同时在最小化权限以及降低计算和运营成本方面显著优于基于 LLM 的基线。

Subjects: Cryptography and Security, Artificial Intelligence 主题:密码学与安全性 , 人工智能

Publish: 2025-12-11 22:10:39 UTC 发表时间:2025-12-11 22:10:39 UTC

62 Autoencoder-based Semi-Supervised Dimensionality Reduction and Clustering for Scientific Ensembles 62 基于自编码器的科学集合半监督降维与聚类

Authors: [Lennard Manuel](https://arxiv.org/search/?searchtype=author&query=Lennard Manuel), [Hamid Gadirov](https://arxiv.org/search/?searchtype=author&query=Hamid Gadirov), [Steffen Frey](https://arxiv.org/search/?searchtype=author&query=Steffen Frey) 作者:Lennard Manuel, Hamid Gadirov, Steffen Frey

Analyzing and visualizing scientific ensemble datasets with high dimensionality and complexity poses significant challenges. Dimensionality reduction techniques and autoencoders are powerful tools for extracting features, but they often struggle with such high-dimensional data. This paper presents an enhanced autoencoder framework that incorporates a clustering loss, based on the soft silhouette score, alongside a contrastive loss to improve the visualization and interpretability of ensemble datasets. First, EfficientNetV2 is used to generate pseudo-labels for the unlabeled portions of the scientific ensemble datasets. By jointly optimizing the reconstruction, clustering, and contrastive objectives, our method encourages similar data points to group together while separating distinct clusters in the latent space. UMAP is subsequently applied to this latent representation to produce 2D projections, which are evaluated using the silhouette score. Multiple types of autoencoders are evaluated and compared based on their ability to extract meaningful features. Experiments on two scientific ensemble datasets - channel structures in soil derived from Markov chain Monte Carlo, and droplet-on-film impact dynamics - show that models incorporating clustering or contrastive loss marginally outperform the baseline approaches. 分析和可视化具有高维度和复杂性的科学集合数据集带来了重大挑战。降维技术和自编码器是提取特征的有力工具,但它们通常在处理此类高维数据时表现不佳。本文提出了一种增强的自编码器框架,该框架将基于软轮廓系数的聚类损失与对比损失相结合,以改善集合数据集的可视化和可解释性。首先,使用 EfficientNetV2 为科学集合数据集中未标注的部分生成伪标签。通过联合优化重建、聚类和对比目标,我们的方法鼓励相似的数据点在潜在空间中聚在一起,同时将不同的簇分离开来。随后在该潜在表示上应用 UMAP 以生成二维投影,并使用轮廓系数对其进行评估。评估并比较了多种类型的自编码器在提取有意义特征方面的能力。 在两个科学集合数据集上的实验——由马尔可夫链蒙特卡洛法得到的土壤通道结构,以及液滴撞击薄膜的动力学——表明加入聚类或对比损失的模型在性能上略优于基线方法。

Subjects: Machine Learning, Artificial Intelligence, Computer Vision and Pattern Recognition 学科:机器学习、人工智能、计算机视觉与模式识别

Publish: 2025-12-11 22:09:31 UTC 发布时间:2025-12-11 22:09:31 协调世界时(UTC)

63 Fairness-Regularized Online Optimization with Switching Costs 63 带切换成本的公平性正则化在线优化

Authors: [Pengfei Li](https://arxiv.org/search/?searchtype=author&query=Pengfei Li), [Yuelin Han](https://arxiv.org/search/?searchtype=author&query=Yuelin Han), [Adam Wierman](https://arxiv.org/search/?searchtype=author&query=Adam Wierman), [Shaolei Ren](https://arxiv.org/search/?searchtype=author&query=Shaolei Ren) 作者:Pengfei Li、Yuelin Han、Adam Wierman、Shaolei Ren

Fairness and action smoothness are two crucial considerations in many online optimization problems, but they have yet to be addressed simultaneously. In this paper, we study a new and challenging setting of fairness-regularized smoothed online convex optimization with switching costs. First, to highlight the fundamental challenges introduced by the long-term fairness regularizer evaluated based on the entire sequence of actions, we prove that even without switching costs, no online algorithms can possibly achieve a sublinear regret or finite competitive ratio compared to the offline optimal algorithm as the problem episode length T increases. Then, we propose FairOBD (Fairness-regularized Online Balanced Descent), which reconciles the tension between minimizing the hitting cost, switching cost, and fairness cost. Concretely, FairOBD decomposes the long-term fairness cost into a sequence of online costs by introducing an auxiliary variable and then leverages the auxiliary variable to regularize the online actions for fair outcomes. Based on a new approach to account for switching costs, we prove that FairOBD offers a worst-case asymptotic competitive ratio against a novel benchmark – the optimal offline algorithm with parameterized constraints – by considering T→∞. Finally, we run trace-driven experiments of dynamic computing resource provisioning for socially responsible AI inference to empirically evaluate FairOBD, showing that FairOBD can effectively reduce the total fairness-regularized cost and better promote fair outcomes compared to existing baseline solutions. 公平性和行动平滑性是许多在线优化问题中的两个关键考虑,但它们尚未同时得到解决。在本文中,我们研究了一种新的且具有挑战性的设置:带切换成本的公平性正则化平滑在线凸优化。首先,为了突出基于整套行动序列评估的长期公平性正则项所引入的根本性挑战,我们证明即使在没有切换成本的情况下,相对于离线最优算法,随着问题回合长度 T 增加,也不存在任何在线算法能够实现次线性后悔或有限的竞争比。然后,我们提出了 FairOBD(公平性正则化的在线平衡下降),它在最小化发生成本、切换成本和公平性成本之间协调冲突。具体而言,FairOBD 通过引入辅助变量将长期公平性成本分解为一系列在线成本,然后利用该辅助变量对在线动作进行正则化以实现公平的结果。 基于一种考虑切换成本的新方法,我们证明了 FairOBD 在一种新的基准——具有参数化约束的最优离线算法——下具有渐近最坏情况竞争比,方法是考虑 T→∞ 。最后,我们对面向社会责任 AI 推理的动态计算资源配置进行了基于轨迹的实验,以实证评估 FairOBD,结果表明与现有基线解决方案相比,FairOBD 能有效降低总的带公平性正则化的成本并更好地促进公平结果。

Subjects: Machine Learning, Artificial Intelligence 主题:机器学习,人工智能

Publish: 2025-12-11 21:36:34 UTC 发布:2025-12-11 21:36:34 UTC

64 In-Context Multi-Objective Optimization 64 情景内多目标优化

Authors: [Xinyu Zhang](https://arxiv.org/search/?searchtype=author&query=Xinyu Zhang), [Conor Hassan](https://arxiv.org/search/?searchtype=author&query=Conor Hassan), [Julien Martinelli](https://arxiv.org/search/?searchtype=author&query=Julien Martinelli), [Daolang Huang](https://arxiv.org/search/?searchtype=author&query=Daolang Huang), [Samuel Kaski](https://arxiv.org/search/?searchtype=author&query=Samuel Kaski) 作者:张欣宇、Conor Hassan、Julien Martinelli、黄道朗、Samuel Kaski

Balancing competing objectives is omnipresent across disciplines, from drug design to autonomous systems. Multi-objective Bayesian optimization is a promising solution for such expensive, black-box problems: it fits probabilistic surrogates and selects new designs via an acquisition function that balances exploration and exploitation. In practice, it requires tailored choices of surrogate and acquisition that rarely transfer to the next problem, is myopic when multi-step planning is often required, and adds refitting overhead, particularly in parallel or time-sensitive loops. We present TAMO, a fully amortized, universal policy for multi-objective black-box optimization. TAMO uses a transformer architecture that operates across varying input and objective dimensions, enabling pretraining on diverse corpora and transfer to new problems without retraining: at test time, the pretrained model proposes the next design with a single forward pass. We pretrain the policy with reinforcement learning to maximize cumulative hypervolume improvement over full trajectories, conditioning on the entire query history to approximate the Pareto frontier. Across synthetic benchmarks and real tasks, TAMO produces fast proposals, reducing proposal time by 50-1000x versus alternatives while matching or improving Pareto quality under tight evaluation budgets. These results show that transformers can perform multi-objective optimization entirely in-context, eliminating per-task surrogate fitting and acquisition engineering, and open a path to foundation-style, plug-and-play optimizers for scientific discovery workflows. 在从药物设计到自主系统的各个学科中,平衡相互竞争的目标无处不在。多目标贝叶斯优化是解决此类昂贵的黑盒问题的有前途的方法:它拟合概率代理模型,并通过在探索与利用之间取得平衡的获取函数来选择新的设计。实际上,这需要为代理模型和获取函数做出特定选择,而这些选择很少能迁移到下一个问题;当常常需要多步规划时,它又表现为短视;并且在并行或对时间敏感的循环中,会增加重新拟合的开销。我们提出了 TAMO,一种针对多目标黑盒优化的全然摊销(fully amortized)、通用策略。TAMO 使用一种变压器架构,可在不同的输入和目标维度上运行,使其能够在多样语料上进行预训练并在无需重训练的情况下迁移到新问题:在测试时,预训练模型只需一次前向传播即可提出下一个设计。我们通过强化学习对该策略进行预训练,以在整个轨迹上最大化累积超体积改进(hypervolume improvement),并基于完整的查询历史进行条件化以近似帕累托前沿。 在合成基准和真实任务中,TAMO 能生成快速的候选方案,在严格的评估预算下与替代方法相比将候选生成时间缩短了 50–1000 倍,同时在帕累托质量上达到或超过其他方法。这些结果表明,变换器(transformers)能够完全在上下文中执行多目标优化,消除了每个任务的代理模型拟合和采集工程,并为面向科学发现工作流程的基础式、即插即用优化器开辟了路径。

Subjects: Machine Learning, Artificial Intelligence, Machine Learning 主题:机器学习,人工智能,机器学习

Publish: 2025-12-11 20:56:42 UTC 发布:2025-12-11 20:56:42 UTC

65 FIBER: A Multilingual Evaluation Resource for Factual Inference Bias 65 FIBER:用于事实推断偏见的多语言评估资源

Authors: [Evren Ayberk Munis](https://arxiv.org/search/?searchtype=author&query=Evren Ayberk Munis), [Deniz Yılmaz](https://arxiv.org/search/?searchtype=author&query=Deniz Yılmaz), [Arianna Muti](https://arxiv.org/search/?searchtype=author&query=Arianna Muti), [Çağrı Toraman](https://arxiv.org/search/?searchtype=author&query=Çağrı Toraman) 作者:Evren Ayberk Munis、Deniz Yılmaz、Arianna Muti、Çağrı Toraman

Large language models are widely used across domains, yet there are concerns about their factual reliability and biases. Factual knowledge probing offers a systematic means to evaluate these aspects. Most existing benchmarks focus on single-entity facts and monolingual data. We therefore present FIBER, a multilingual benchmark for evaluating factual knowledge in single- and multi-entity settings. The dataset includes sentence completion, question-answering, and object-count prediction tasks in English, Italian, and Turkish. Using FIBER, we examine whether the prompt language induces inference bias in entity selection and how large language models perform on multi-entity versus single-entity questions. The results indicate that the language of the prompt can influence the model’s generated output, particularly for entities associated with the country corresponding to that language. However, this effect varies across different topics such that 31% of the topics exhibit factual inference bias score greater than 0.5. Moreover, the level of bias differs across languages such that Turkish prompts show higher bias compared to Italian in 83% of the topics, suggesting a language-dependent pattern. Our findings also show that models face greater difficulty when handling multi-entity questions than the single-entity questions. Model performance differs across both languages and model sizes. The highest mean average precision is achieved in English, while Turkish and Italian lead to noticeably lower scores. Larger models, including Llama-3.1-8B and Qwen-2.5-7B, show consistently better performance than smaller 3B-4B models. 大型语言模型在各个领域被广泛使用,但人们对其事实可靠性和偏见存在担忧。事实知识探测为评估这些方面提供了一种系统方法。现有的大多数基准着重于单实体事实和单语数据。因此我们提出了 FIBER,这是一个用于评估单实体与多实体情境下事实知识的多语言基准。该数据集包含英语、意大利语和土耳其语的句子补全、问答和对象计数预测任务。利用 FIBER,我们考察了提示语言是否会在实体选择上引发推理偏差,以及大型语言模型在多实体问题与单实体问题上的表现。结果表明,提示语言可能会影响模型生成的输出,尤其是与该语言对应国家相关的实体。不过,这种效应在不同主题间存在差异,约有 31%的主题表现出大于 0.5 的事实推理偏差得分。 此外,不同语言之间的偏见程度存在差异,在 83%的主题中,土耳其语提示相较于意大利语表现出更高的偏见,表明存在语言依赖的模式。我们的研究结果还显示,模型在处理多实体问题时比单实体问题面临更大困难。模型性能在不同语言和不同规模之间存在差异。平均精确度最高的是英语,而土耳其语和意大利语的得分明显较低。包括 Llama-3.1-8B 和 Qwen-2.5-7B 在内的较大模型的表现持续优于较小的 3B–4B 模型。

Subjects: Computation and Language, Artificial Intelligence 主题:计算与语言,人工智能

Publish: 2025-12-11 20:51:16 UTC 发布时间:2025-12-11 20:51:16 UTC

66 Explanation Bias is a Product: Revealing the Hidden Lexical and Position Preferences in Post-Hoc Feature Attribution 66 解释性偏差是产物:揭示后验特征归因中的隐藏词汇与位置偏好

Authors: [Jonathan Kamp](https://arxiv.org/search/?searchtype=author&query=Jonathan Kamp), [Roos Bakker](https://arxiv.org/search/?searchtype=author&query=Roos Bakker), [Dominique Blok](https://arxiv.org/search/?searchtype=author&query=Dominique Blok) 作者:Jonathan Kamp、Roos Bakker、Dominique Blok

Good quality explanations strengthen the understanding of language models and data. Feature attribution methods, such as Integrated Gradient, are a type of post-hoc explainer that can provide token-level insights. However, explanations on the same input may vary greatly due to underlying biases of different methods. Users may be aware of this issue and mistrust their utility, while unaware users may trust them inadequately. In this work, we delve beyond the superficial inconsistencies between attribution methods, structuring their biases through a model- and method-agnostic framework of three evaluation metrics. We systematically assess both the lexical and position bias (what and where in the input) for two transformers; first, in a controlled, pseudo-random classification task on artificial data; then, in a semi-controlled causal relation detection task on natural data. We find that lexical and position biases are structurally unbalanced in our model comparison, with models that score high on one type score low on the other. We also find signs that methods producing anomalous explanations are more likely to be biased themselves. 高质量的解释增强了对语言模型和数据的理解。特征归因方法,例如积分梯度,是一种事后解释器,可以提供基于标记的见解。然而,由于不同方法的潜在偏差,对相同输入的解释可能会有很大差异。知情用户可能意识到这一问题并因此不信任其效用,而不知情的用户则可能对其过度信任。在本工作中,我们深入探讨归因方法之间表面不一致之下的结构性偏差,通过一个与模型和方法无关的由三项评估指标构成的框架来构建这些偏差。我们系统性地评估了两种 Transformer 的词汇和位置偏差(输入中是什么以及在哪里);首先,在人工数据上的受控伪随机分类任务中;然后,在自然数据上的半受控因果关系检测任务中。我们发现,在模型比较中,词汇偏差和位置偏差在结构上并不平衡,某一类得分高的模型在另一类上得分低。我们还发现,产生异常解释的方法更有可能自身存在偏差。

Subjects: Computation and Language, Artificial Intelligence 主题:计算与语言,人工智能

Publish: 2025-12-11 20:48:22 UTC 发布:2025-12-11 20:48:22 UTC

67 Clip-and-Verify: Linear Constraint-Driven Domain Clipping for Accelerating Neural Network Verification 67 Clip-and-Verify:基于线性约束的域剪裁以加速神经网络验证

Authors: [Duo Zhou](https://arxiv.org/search/?searchtype=author&query=Duo Zhou), [Jorge Chavez](https://arxiv.org/search/?searchtype=author&query=Jorge Chavez), [Hesun Chen](https://arxiv.org/search/?searchtype=author&query=Hesun Chen), [Grani A. Hanasusanto](https://arxiv.org/search/?searchtype=author&query=Grani A. Hanasusanto), [Huan Zhang](https://arxiv.org/search/?searchtype=author&query=Huan Zhang) 作者:周铎(Duo Zhou)、Jorge Chavez、陈鹤孙(Hesun Chen)、Grani A. Hanasusanto、张焕(Huan Zhang)

State-of-the-art neural network (NN) verifiers demonstrate that applying the branch-and-bound (BaB) procedure with fast bounding techniques plays a key role in tackling many challenging verification properties. In this work, we introduce the linear constraint-driven clipping framework, a class of scalable and efficient methods designed to enhance the efficacy of NN verifiers. Under this framework, we develop two novel algorithms that efficiently utilize linear constraints to 1) reduce portions of the input space that are either verified or irrelevant to a subproblem in the context of branch-and-bound, and 2) directly improve intermediate bounds throughout the network. The process novelly leverages linear constraints that often arise from bound propagation methods and is general enough to also incorporate constraints from other sources. It efficiently handles linear constraints using a specialized GPU procedure that can scale to large neural networks without the use of expensive external solvers. Our verification procedure, Clip-and-Verify, consistently tightens bounds across multiple benchmarks and can significantly reduce the number of subproblems handled during BaB. We show that our clipping algorithms can be integrated with BaB-based verifiers such as α,β-CROWN, utilizing either the split constraints in activation-space BaB or the output constraints that denote the unverified input space. We demonstrate the effectiveness of our procedure on a broad range of benchmarks where, in some instances, we witness a 96% reduction in the number of subproblems during branch-and-bound, and also achieve state-of-the-art verified accuracy across multiple benchmarks. Clip-and-Verify is part of the α,β-CROWN verifier (http://abcrown.org), the VNN-COMP 2025 winner. Code available at https://github.com/Verified-Intelligence/Clip_and_Verify. 最先进的神经网络(NN)验证器表明,在分支定界(BaB)过程中结合快速边界估计技术,对解决许多具有挑战性的验证性质起着关键作用。在这项工作中,我们提出了线性约束驱动的裁剪框架,这是一类可扩展且高效的方法,旨在提高神经网络验证器的效能。在该框架下,我们开发了两种新算法,这些算法高效利用线性约束来:1)在分支定界的子问题上下文中减少那些已被验证或与子问题无关的输入空间部分;以及 2)直接改进网络中间层的边界。该过程新颖地利用了通常由边界传播方法产生的线性约束,并且足够通用以纳入来自其他来源的约束。它使用专门的 GPU 过程有效处理线性约束,能够在不使用昂贵外部求解器的情况下扩展到大型神经网络。我们的验证流程 Clip-and-Verify 在多个基准上持续收紧边界,并且可以显著减少分支定界过程中需要处理的子问题数量。 我们展示了我们的裁剪算法可以与基于 BaB 的验证器(例如 α,β -CROWN)集成,既可利用激活空间 BaB 中的分裂约束,也可利用表示未验证输入空间的输出约束。我们在广泛的基准测试上证明了该方法的有效性,在某些情况下,在分支定界过程中子问题数量减少了 96%,并在多个基准上达到了最先进的已验证准确率。Clip-and-Verify 是 α,β -CROWN 验证器(http://abcrown.org,VNN-COMP 2025 冠军)的一部分。代码可在 https://github.com/Verified-Intelligence/Clip_and_Verify 获得。

Subjects: Machine Learning, Artificial Intelligence, Cryptography and Security, Optimization and Control 主题:机器学习、人工智能、密码学与安全、优化与控制

Publish: 2025-12-11 19:59:37 UTC 发布:2025-12-11 19:59:37 UTC

68 A probabilistic foundation model for crystal structure denoising, phase classification, and order parameters 68 一个用于晶体结构去噪、相分类和序参量的概率基础模型

Authors: [Hyuna Kwon](https://arxiv.org/search/?searchtype=author&query=Hyuna Kwon), [Babak Sadigh](https://arxiv.org/search/?searchtype=author&query=Babak Sadigh), [Sebastien Hamel](https://arxiv.org/search/?searchtype=author&query=Sebastien Hamel), [Vincenzo Lordi](https://arxiv.org/search/?searchtype=author&query=Vincenzo Lordi), [John Klepeis](https://arxiv.org/search/?searchtype=author&query=John Klepeis), [Fei Zhou](https://arxiv.org/search/?searchtype=author&query=Fei Zhou) 作者:Hyuna Kwon、Babak Sadigh、Sebastien Hamel、Vincenzo Lordi、John Klepeis、Fei Zhou

Atomistic simulations generate large volumes of noisy structural data, but extracting phase labels, order parameters (OPs), and defect information in a way that is universal, robust, and interpretable remains challenging. Existing tools such as PTM and CNA are restricted to a small set of hand-crafted lattices (e.g.\ FCC/BCC/HCP), degrade under strong thermal disorder or defects, and produce hard, template-based labels without per-atom probability or confidence scores. Here we introduce a log-probability foundation model that unifies denoising, phase classification, and OP extraction within a single probabilistic framework. We reuse the MACE-MP foundation interatomic potential on crystal structures mapped to AFLOW prototypes, training it to predict per-atom, per-phase logits l and to aggregate them into a global log-density logP^θ(r) whose gradient defines a conservative score field. Denoising corresponds to gradient ascent on this learned log-density, phase labels follow from argmaxclac, and the l values act as continuous, defect-sensitive and interpretable OPs quantifying the Euclidean distance to ideal phases. We demonstrate universality across hundreds of prototypes, robustness under strong thermal and defect-induced disorder, and accurate treatment of complex systems such as ice polymorphs, ice–water interfaces, and shock-compressed Ti. 原子尺度模拟会产生大量带噪声的结构数据,但以通用、稳健且可解释的方式提取相标签、序参量(OP)和缺陷信息仍然具有挑战性。现有工具如 PTM 和 CNA 仅限于少数手工设计的晶格(例如 FCC/BCC/HCP),在强热扰动或存在缺陷时性能下降,并且产生基于模板的硬标签,缺乏每个原子的概率或置信度分数。在此我们引入了一种对数概率基础模型,将去噪、相分类和序参量提取统一到单一的概率框架中。我们在映射到 AFLOW 原型的晶体结构上重用 MACE-MP 基础原子间势,训练其预测每个原子、每个相的对数几率(logit)并将它们聚合成全局对数密度,其梯度定义了一个保守的评分场。去噪对应于在这个学得的对数密度上进行梯度上升,相标签由此得到,而这些值则作为连续的、对缺陷敏感且可解释的序参量,用以量化到理想相的欧几里得距离。 我们在数百个原型结构上展示了普适性,在强热扰动和缺陷引起的无序下表现出鲁棒性,并能准确处理复杂体系,例如多晶型冰、冰—水界面以及受冲击压缩的钛。

Subjects: Materials Science, Artificial Intelligence 学科:材料科学,人工智能

Publish: 2025-12-11 19:46:56 UTC 发布:2025-12-11 19:46:56 UTC

69 Fast, accurate measurement of the worker populations of honey bee colonies using deep learning 69 使用深度学习对蜜蜂群体工蜂数量进行快速、准确的测量

Authors: [Junmin Zhong](https://arxiv.org/search/?searchtype=author&query=Junmin Zhong), [Jon F. Harrison](https://arxiv.org/search/?searchtype=author&query=Jon F. Harrison), [Jennie Si](https://arxiv.org/search/?searchtype=author&query=Jennie Si), [Jun Chen](https://arxiv.org/search/?searchtype=author&query=Jun Chen) 作者:钟俊敏、Jon F. Harrison、Jennie Si、陈俊

Honey bees play a crucial role in pollination, contributing significantly to global agriculture and ecosystems. Accurately estimating hive populations is essential for understanding the effects of environmental factors on bee colonies, yet traditional methods of counting bees are time-consuming, labor-intensive, and prone to human error, particularly in large-scale studies. In this paper, we present a deep learning-based solution for automating bee population counting using CSRNet and introduce ASUBEE, the FIRST high-resolution dataset specifically designed for this task. Our method employs density map estimation to predict bee populations, effectively addressing challenges such as occlusion and overlapping bees that are common in hive monitoring. We demonstrate that CSRNet achieves superior performance in terms of time efficiency, with a computation time of just 1 second per image, while delivering accurate counts even in complex and densely populated hive scenarios. Our findings show that deep learning approaches like CSRNet can dramatically enhance the efficiency of hive population assessments, providing a valuable tool for researchers and beekeepers alike. This work marks a significant advancement in applying AI technologies to ecological research, offering scalable and precise monitoring solutions for honey bee populations. 蜜蜂在授粉中发挥着关键作用,为全球农业和生态系统做出重要贡献。准确估计蜂箱内的数量对于理解环境因素对蜂群的影响至关重要,然而传统的计数方法既耗时又费力,而且在大规模研究中容易出现人为误差。本文提出了一种基于深度学习的自动化蜜蜂数量计数解决方案,采用 CSRNet,并引入了 ASUBEE,这是第一个为此任务专门设计的高分辨率数据集。我们的方法使用密度图估计来预测蜜蜂数量,有效应对蜂箱监测中常见的遮挡和重叠等挑战。实验表明,CSRNet 在时间效率方面表现优越,每张图像仅需 1 秒的计算时间,同时在复杂且密集的蜂群场景中也能提供准确的计数。我们的研究结果表明,像 CSRNet 这样的深度学习方法可以显著提升蜂群数量评估的效率,为研究人员和养蜂人提供有价值的工具。 这项工作在将人工智能技术应用于生态学研究方面取得了重要进展,为蜜蜂群体提供了可扩展且精确的监测解决方案。

Subjects: Quantitative Methods, Artificial Intelligence 主题:定量方法,人工智能

Publish: 2025-12-11 19:44:36 UTC 发布日期:2025-12-11 19:44:36 UTC

70 MultiScript30k: Leveraging Multilingual Embeddings to Extend Cross Script Parallel Data 70 MultiScript30k:利用多语言嵌入扩展跨书写体系平行数据

Authors: [Christopher Driggers-Ellis](https://arxiv.org/search/?searchtype=author&query=Christopher Driggers-Ellis), [Detravious Brinkley](https://arxiv.org/search/?searchtype=author&query=Detravious Brinkley), [Ray Chen](https://arxiv.org/search/?searchtype=author&query=Ray Chen), [Aashish Dhawan](https://arxiv.org/search/?searchtype=author&query=Aashish Dhawan), [Daisy Zhe Wang](https://arxiv.org/search/?searchtype=author&query=Daisy Zhe Wang), [Christan Grant](https://arxiv.org/search/?searchtype=author&query=Christan Grant) 作者:Christopher Driggers-Ellis、Detravious Brinkley、Ray Chen、Aashish Dhawan、Daisy Zhe Wang、Christan Grant

Multi30k is frequently cited in the multimodal machine translation (MMT) literature, offering parallel text data for training and fine-tuning deep learning models. However, it is limited to four languages: Czech, English, French, and German. This restriction has led many researchers to focus their investigations only on these languages. As a result, MMT research on diverse languages has been stalled because the official Multi30k dataset only represents European languages in Latin scripts. Previous efforts to extend Multi30k exist, but the list of supported languages, represented language families, and scripts is still very short. To address these issues, we propose MultiScript30k, a new Multi30k dataset extension for global languages in various scripts, created by translating the English version of Multi30k (Multi30k-En) using NLLB200-3.3B. The dataset consists of over 30000 sentences and provides translations of all sentences in Multi30k-En into Ar, Es, Uk, Zh_Hans and Zh_Hant. Similarity analysis shows that Multi30k extension consistently achieves greater than 0.8 cosine similarity and symmetric KL divergence less than 0.000251 for all languages supported except Zh_Hant which is comparable to the previous Multi30k extensions ArEnMulti30k and Multi30k-Uk. COMETKiwi scores reveal mixed assessments of MultiScript30k as a translation of Multi30k-En in comparison to the related work. ArEnMulti30k scores nearly equal MultiScript30k-Ar, but Multi30k-Uk scores 6.4% greater than MultiScript30k-Uk per split. Multi30k 在多模态机器翻译(MMT)文献中被频繁引用,为训练和微调深度学习模型提供了平行文本数据。然而,它仅限于四种语言:捷克语、英语、法语和德语。此限制导致许多研究者只将研究集中在这些语言上。因此,由于官方 Multi30k 数据集仅代表使用拉丁字母的欧洲语言,多样化语言的 MMT 研究一直停滞不前。此前曾有扩展 Multi30k 的努力,但受支持语言的列表、所代表的语系和书写系统仍然非常有限。为了解决这些问题,我们提出了 MultiScript30k,这是一个针对多种书写系统的全球语言的 Multi30k 数据集扩展,通过使用 NLLB200-3.3B 将 Multi30k 的英语版本(Multi30k-En)翻译而成。该数据集包含超过 30000 句子,并提供了 Multi30k-En 中所有句子的阿拉伯语(Ar)、西班牙语(Es)、乌克兰语(Uk)、简体中文(Zh_Hans)和繁体中文(Zh_Hant)翻译。 相似性分析表明,Multi30k 扩展在除繁体中文(Zh_Hant)之外的所有支持语言中均始终达到大于 0.8 的余弦相似度和小于 0.000251 的对称 KL 散度,繁体中文与之前的 Multi30k 扩展 ArEnMulti30k 和 Multi30k-Uk 相当。COMETKiwi 分数对将 Multi30k-En 翻译为 MultiScript30k 的评估结果喜忧参半,相比相关工作并不一致。ArEnMulti30k 的得分几乎等同于 MultiScript30k-Ar,但 Multi30k-Uk 在每个划分上的得分比 MultiScript30k-Uk 高出 6.4% 。

Subjects: Computation and Language, Artificial Intelligence, Machine Learning, Multimedia 主题:计算与语言、人工智能、机器学习、多媒体

Publish: 2025-12-11 19:43:19 UTC 发布:2025-12-11 19:43:19 UTC

71 KathDB: Explainable Multimodal Database Management System with Human-AI Collaboration 71 KathDB:具有人机协作的可解释多模态数据库管理系统

Authors: [Guorui Xiao](https://arxiv.org/search/?searchtype=author&query=Guorui Xiao), [Enhao Zhang](https://arxiv.org/search/?searchtype=author&query=Enhao Zhang), [Nicole Sullivan](https://arxiv.org/search/?searchtype=author&query=Nicole Sullivan), [Will Hansen](https://arxiv.org/search/?searchtype=author&query=Will Hansen), [Magdalena Balazinska](https://arxiv.org/search/?searchtype=author&query=Magdalena Balazinska) 作者:Guorui Xiao、Enhao Zhang、Nicole Sullivan、Will Hansen、Magdalena Balazinska

Traditional DBMSs execute user- or application-provided SQL queries over relational data with strong semantic guarantees and advanced query optimization, but writing complex SQL is hard and focuses only on structured tables. Contemporary multimodal systems (which operate over relations but also text, images, and even videos) either expose low-level controls that force users to use (and possibly create) machine learning UDFs manually within SQL or offload execution entirely to black-box LLMs, sacrificing usability or explainability. We propose KathDB, a new system that combines relational semantics with the reasoning power of foundation models over multimodal data. Furthermore, KathDB includes human-AI interaction channels during query parsing, execution, and result explanation, such that users can iteratively obtain explainable answers across data modalities. 传统的关系型数据库管理系统(DBMS)在关系数据上执行用户或应用提供的 SQL 查询,具有强语义保证和先进的查询优化,但编写复杂的 SQL 很难,且仅关注结构化表格。现代多模态系统(在关系数据之外还处理文本、图像甚至视频)要么暴露低级控制,迫使用户在 SQL 中手动使用(并可能创建)机器学习 UDF,要么将执行完全交给黑箱的 LLMs,从而牺牲可用性或可解释性。我们提出了 KathDB,一种新系统,将关系语义与基础模型在多模态数据上推理的能力结合起来。此外,KathDB 在查询解析、执行和结果解释过程中包含人机交互通道,使用户能够跨数据模态迭代地获得可解释的答案。

Subjects: Databases, Artificial Intelligence 主题:数据库,人工智能

Publish: 2025-12-11 19:36:23 UTC 发布:2025-12-11 19:36:23 UTC

72 WholeBodyVLA: Towards Unified Latent VLA for Whole-Body Loco-Manipulation Control 72 WholeBodyVLA:迈向用于全身运动操控控制的统一潜在 VLA

Authors: [Haoran Jiang](https://arxiv.org/search/?searchtype=author&query=Haoran Jiang), [Jin Chen](https://arxiv.org/search/?searchtype=author&query=Jin Chen), [Qingwen Bu](https://arxiv.org/search/?searchtype=author&query=Qingwen Bu), [Li Chen](https://arxiv.org/search/?searchtype=author&query=Li Chen), [Modi Shi](https://arxiv.org/search/?searchtype=author&query=Modi Shi), [Yanjie Zhang](https://arxiv.org/search/?searchtype=author&query=Yanjie Zhang), [Delong Li](https://arxiv.org/search/?searchtype=author&query=Delong Li), [Chuanzhe Suo](https://arxiv.org/search/?searchtype=author&query=Chuanzhe Suo), [Chuang Wang](https://arxiv.org/search/?searchtype=author&query=Chuang Wang), [Zhihui Peng](https://arxiv.org/search/?searchtype=author&query=Zhihui Peng), [Hongyang Li](https://arxiv.org/search/?searchtype=author&query=Hongyang Li) 作者:蒋浩然、陈进、卜庆文、陈力、石墨迪、张艳洁、李德龙、索传哲、王创、彭志辉、李洪洋

Humanoid robots require precise locomotion and dexterous manipulation to perform challenging loco-manipulation tasks. Yet existing approaches, modular or end-to-end, are deficient in manipulation-aware locomotion. This confines the robot to a limited workspace, preventing it from performing large-space loco-manipulation. We attribute this to: (1) the challenge of acquiring loco-manipulation knowledge due to the scarcity of humanoid teleoperation data, and (2) the difficulty of faithfully and reliably executing locomotion commands, stemming from the limited precision and stability of existing RL controllers. To acquire richer loco-manipulation knowledge, we propose a unified latent learning framework that enables Vision-Language-Action (VLA) system to learn from low-cost action-free egocentric videos. Moreover, an efficient human data collection pipeline is devised to augment the dataset and scale the benefits. To more precisely execute the desired locomotion commands, we present a loco-manipulation-oriented (LMO) RL policy specifically tailored for accurate and stable core loco-manipulation movements, such as advancing, turning, and squatting. Building on these components, we introduce WholeBodyVLA, a unified framework for humanoid loco-manipulation. To the best of our knowledge, WholeBodyVLA is one of its kind enabling large-space humanoid loco-manipulation. It is verified via comprehensive experiments on the AgiBot X2 humanoid, outperforming prior baseline by 21.3%. It also demonstrates strong generalization and high extensibility across a broad range of tasks. 类人机器人在执行具有挑战性的移动-操控任务时需要精确的步态和灵巧的操作。然而,现有方法,无论是模块化的还是端到端的,在意识到操控的步态控制方面都存在不足。这将机器人限制在有限的工作空间内,阻止其在大型空间中执行loco-manipulation.我们将其归因于:(1) 由于类人远程操控数据稀缺,获取移动-操控知识具有挑战性;以及 (2) 由于现有强化学习控制器在精度和稳定性方面的局限,导致忠实且可靠地执行步态命令变得困难。为获取更丰富的移动-操控知识,我们提出了一个统一的潜在学习框架,使视觉-语言-动作(VLA)系统能够从低成本、无动作标注的第一视角视频中学习。此外,我们设计了一个高效的人类数据收集流程,以扩充数据集并放大收益。为更精确地执行期望的步态命令,我们提出了一个面向移动-操控的(LMO)强化学习策略,专门用于实现诸如前进、转向和下蹲等精确且稳定的核心移动-操控动作。 在这些组件的基础上,我们提出了 WholeBodyVLA,一种用于类人机器人的统一框架loco-manipulation. 据我们所知,WholeBodyVLA 是同类中少有的一种,能够在大空间中支持类人机器人loco-manipulation. 通过对 AgiBot X2 类人机器人进行的综合实验验证,其性能比之前的基线方法提升了 21.3%。它还在广泛任务上表现出强大的泛化能力和高度的可扩展性。

Subjects: Robotics, Artificial Intelligence, Computer Vision and Pattern Recognition 学科:机器人、人工智能、计算机视觉与模式识别

Publish: 2025-12-11 19:07:31 UTC 发布:2025-12-11 19:07:31 UTC

73 SoccerMaster: A Vision Foundation Model for Soccer Understanding 73 SoccerMaster:用于足球理解的视觉基础模型

Authors: [Haolin Yang](https://arxiv.org/search/?searchtype=author&query=Haolin Yang), [Jiayuan Rao](https://arxiv.org/search/?searchtype=author&query=Jiayuan Rao), [Haoning Wu](https://arxiv.org/search/?searchtype=author&query=Haoning Wu), [Weidi Xie](https://arxiv.org/search/?searchtype=author&query=Weidi Xie) 作者:Haolin Yang、Jiayuan Rao、Haoning Wu、Weidi Xie

Soccer understanding has recently garnered growing research interest due to its domain-specific complexity and unique challenges. Unlike prior works that typically rely on isolated, task-specific expert models, this work aims to propose a unified model to handle diverse soccer visual understanding tasks, ranging from fine-grained perception (e.g., athlete detection) to semantic reasoning (e.g., event classification). Specifically, our contributions are threefold: (i) we present SoccerMaster, the first soccer-specific vision foundation model that unifies diverse understanding tasks within a single framework via supervised multi-task pretraining; (ii) we develop an automated data curation pipeline to generate scalable spatial annotations, and integrate them with various existing soccer video datasets to construct SoccerFactory, a comprehensive pretraining data resource; and (iii) we conduct extensive evaluations demonstrating that SoccerMaster consistently outperforms task-specific expert models across diverse downstream tasks, highlighting its breadth and superiority. The data, code, and model will be publicly available. 由于其领域特定的复杂性和独特挑战,足球理解近年来引起了日益增长的研究兴趣。与以往通常依赖孤立的、面向特定任务的专家模型的工作不同,本研究旨在提出一个统一模型来处理多样的足球视觉理解任务,涵盖从细粒度感知(例如运动员检测)到语义推理(例如事件分类)。具体而言,我们的贡献有三点: (i) 我们提出了 SoccerMaster,这是首个足球特定的视觉基础模型,通过监督多任务预训练在单一框架内统一多种理解任务; (ii) 我们开发了一个自动化数据策划管道以生成可扩展的空间注释,并将其与各种现有的足球视频数据集整合,构建了 SoccerFactory——一个全面的预训练数据资源;(iii) 我们进行了广泛评估,证明 SoccerMaster 在多种下游任务上持续优于面向特定任务的专家模型,突出了其广度和优越性。数据、代码和模型将公开可用。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence

Publish: 2025-12-11 18:03:30 UTC 发布:2025-12-11 18:03:30 UTC

74 Leveraging Text Guidance for Enhancing Demographic Fairness in Gender Classification 74 利用文本引导提升性别分类中的人口统计公平性

Author: [Anoop Krishnan](https://arxiv.org/search/?searchtype=author&query=Anoop Krishnan) 作者:Anoop Krishnan

In the quest for fairness in artificial intelligence, novel approaches to enhance it in facial image based gender classification algorithms using text guided methodologies are presented. The core methodology involves leveraging semantic information from image captions during model training to improve generalization capabilities. Two key strategies are presented: Image Text Matching (ITM) guidance and Image Text fusion. ITM guidance trains the model to discern fine grained alignments between images and texts to obtain enhanced multimodal representations. Image text fusion combines both modalities into comprehensive representations for improved fairness. Exensive experiments conducted on benchmark datasets demonstrate these approaches effectively mitigate bias and improve accuracy across gender racial groups compared to existing methods. Additionally, the unique integration of textual guidance underscores an interpretable and intuitive training paradigm for computer vision systems. By scrutinizing the extent to which semantic information reduces disparities, this research offers valuable insights into cultivating more equitable facial analysis algorithms. The proposed methodologies contribute to addressing the pivotal challenge of demographic bias in gender classification from facial images. Furthermore, this technique operates in the absence of demographic labels and is application agnostic. 在追求人工智能公平性的过程中,提出了使用文本引导方法来增强基于人脸图像的性别分类算法公平性的创新方法。核心方法是在模型训练期间利用来自图像标题的语义信息以提高泛化能力。提出了两种关键策略:图像文本匹配(ITM)指导和图像文本融合。ITM 指导训练模型辨别图像与文本之间的细粒度对齐,从而获得增强的多模态表示。图像文本融合将两种模态结合为综合表示以改善公平性。在基准数据集上进行的大量实验表明,与现有方法相比,这些方法能够有效缓解偏差并提高各性别与种族群体间的准确性。此外,文本引导的独特整合强调了对计算机视觉系统而言可解释且直观的训练范式。通过审视语义信息在多大程度上减少差距,本研究为培育更公平的人脸分析算法提供了有价值的见解。 所提出的方法有助于解决从面部图像进行性别分类时的人口统计偏差这一关键问题。此外,该技术在没有人口统计标签的情况下运行,并且与应用无关。

Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Machine Learning 主题:计算机视觉与模式识别,人工智能,机器学习

Publish: 2025-12-11 17:56:09 UTC 发布:2025-12-11 17:56:09 UTC

75 Beyond Memristor: Neuromorphic Computing Using Meminductor 75 超越忆阻器:使用忆感器的类神经计算

Author: [Frank Zhigang Wang](https://arxiv.org/search/?searchtype=author&query=Frank Zhigang Wang) 作者:王志刚(Frank Zhigang Wang)

Memristor (resistor with memory), inductor with memory (meminductor) and capacitor with memory (memcapacitor) have different roles to play in novel computing architectures. We found that a coil with a magnetic core is an inductor with memory (meminductor) in terms of its inductance L(q) being a function of the charge q. The history of the current passing through the coil is remembered by the magnetization inside the magnetic core. Such a meminductor can play a unique role (that cannot be played by a memristor) in neuromorphic computing, deep learning and brain inspired since the time constant of a neuromorphic RLC circuit is jointly determined by the inductance and capacitance, rather than the resistance. As an experimental verification, this newly invented meminductor was used to reproduce the observed biological behaviour of amoebae (the memorizing, timing and anticipating mechanisms). In conclusion, a beyond memristor computing paradigm is theoretically sensible and experimentally practical. 忆阻器(具有记忆的电阻)、具有记忆的电感器(忆感器)和具有记忆的电容器(忆容器)在新型计算架构中扮演不同的角色。我们发现,带有磁心的线圈在其电感 L(q) 是电荷 q 的函数时,可视为一种具有记忆的电感器(忆感器)。流经线圈的电流历史被磁心内部的磁化所记忆。这种忆感器在类脑计算、深度学习和受大脑启发的系统中可以发挥一种独特作用(这是忆阻器无法替代的),因为类脑 RLC 电路的时间常数由电感和电容共同决定,而不是由电阻决定。作为实验验证,这种新发明的忆感器被用于再现变形虫观察到的生物行为(记忆、计时和预期机制)。总之,超越忆阻器的计算范式在理论上是合理的,并且在实验上是可行的。

Subjects: Emerging Technologies, Artificial Intelligence 主题:新兴技术,人工智能

Publish: 2025-12-10 22:45:27 UTC 发布:2025-12-10 22:45:27 UTC

76 Unambiguous Representations in Neural Networks: An Information-Theoretic Approach to Intentionality 76 在神经网络中的明确表示:一种关于意向性的 信息论方法

Author: [Francesco Lässig](https://arxiv.org/search/?searchtype=author&query=Francesco Lässig) 作者:Francesco Lässig

Representations pervade our daily experience, from letters representing sounds to bit strings encoding digital files. While such representations require externally defined decoders to convey meaning, conscious experience appears fundamentally different: a neural state corresponding to perceiving a red square cannot alternatively encode the experience of a green square. This intrinsic property of consciousness suggests that conscious representations must be unambiguous in a way that conventional representations are not. We formalize this intuition using information theory, defining representational ambiguity as the conditional entropy H(I|R) over possible interpretations I given a representation R. Through experiments on neural networks trained to classify MNIST digits, we demonstrate that relational structures in network connectivity can unambiguously encode representational content. Using both learned decoders and direct geometric matching, we achieve perfect (100%) accuracy for dropout-trained networks and 38% for standard backpropagation in identifying output neuron class identity, despite identical task performance, demonstrating that representational ambiguity can arise orthogonally to behavioral accuracy. We further show that spatial position information of input neurons can be decoded from network connectivity with R2 up to 0.844. These results provide a quantitative method for measuring representational ambiguity in neural systems and demonstrate that neural networks can exhibit the low-ambiguity representations posited as necessary (though not sufficient) by theoretical accounts of consciousness. 表征充斥着我们的日常体验,从表示声音的字母到编码数字文件的比特串。尽管此类表征需要外部定义的解码器来传达意义,但意识体验似乎根本不同:对应于感知红色方块的神经状态不可能被替换为编码绿色方块的体验。意识的这一内在特性表明,意识表征必须以某种传统表征所不具备的方式做到不含糊。我们使用信息论对这一直觉进行形式化,将表征歧义性定义为在给定表征 R 的情况下对可能解释 I 的条件熵 H(I|R)。通过在训练用于分类 MNIST 数字的神经网络上的实验,我们展示了网络连接中的关系结构可以无歧义地编码表征内容。 通过使用学习到的解码器和直接几何匹配,我们在识别输出神经元类别身份方面对使用丢弃法训练的网络达到了完美(100%)准确率,而对使用标准反向传播的网络则为 38%,尽管任务表现相同,这表明表征歧义可以与行为准确性正交地产生。我们进一步表明,输入神经元的空间位置信息可以从网络连通性中解码,R2 高达 0.844。这些结果提供了一种用于量化神经系统表征歧义的方法,并证明神经网络可以表现出理论上关于意识所假定的低歧义表征(作为必要但不充分条件)。

Subjects: Neurons and Cognition, Artificial Intelligence, Neural and Evolutionary Computing 主题:神经元与认知,人工智能,神经与进化计算

Publish: 2025-12-10 19:00:34 UTC 发布:2025-12-10 19:00:34 UTC

77 MedBioRAG: Semantic Search and Retrieval-Augmented Generation with Large Language Models for Medical and Biological QA 77 MedBioRAG:用于医学与生物问答的语义搜索与检索增强生成结合大型语言模型

Author: [Seonok Kim](https://arxiv.org/search/?searchtype=author&query=Seonok Kim) 作者:Seonok Kim

Recent advancements in retrieval-augmented generation (RAG) have significantly enhanced the ability of large language models (LLMs) to perform complex question-answering (QA) tasks. In this paper, we introduce MedBioRAG, a retrieval-augmented model designed to improve biomedical QA performance through a combination of semantic and lexical search, document retrieval, and supervised fine-tuning. MedBioRAG efficiently retrieves and ranks relevant biomedical documents, enabling precise and context-aware response generation. We evaluate MedBioRAG across text retrieval, close-ended QA, and long-form QA tasks using benchmark datasets such as NFCorpus, TREC-COVID, MedQA, PubMedQA, and BioASQ. Experimental results demonstrate that MedBioRAG outperforms previous state-of-the-art (SoTA) models and the GPT-4o base model in all evaluated tasks. Notably, our approach improves NDCG and MRR scores for document retrieval, while achieving higher accuracy in close-ended QA and ROUGE scores in long-form QA. Our findings highlight the effectiveness of semantic search-based retrieval and LLM fine-tuning in biomedical applications. 在检索增强生成(RAG)方面的最新进展显著提升了大型语言模型(LLMs)执行复杂问答(QA)任务的能力。在本文中,我们提出了 MedBioRAG,一种通过语义与词汇检索、文献检索和监督微调相结合来提升生物医学问答性能的检索增强模型。MedBioRAG 能高效检索并排序相关的生物医学文献,从而实现精确且具上下文意识的响应生成。我们在 NFCorpus、TREC-COVID、MedQA、PubMedQA 和 BioASQ 等基准数据集上对 MedBioRAG 在文本检索、封闭式问答和长格式问答任务上进行了评估。实验结果表明,MedBioRAG 在所有评估任务中均优于先前的最先进(SoTA)模型及 GPT-4o 基础模型。值得注意的是,我们的方法在文献检索上提高了 NDCG 和 MRR 得分,同时在封闭式问答上实现了更高的准确率,在长格式问答上取得了更高的 ROUGE 得分。我们的研究结果强调了基于语义搜索的检索与对 LLM 的微调在生物医学应用中的有效性。

Subjects: Computation and Language, Artificial Intelligence 主题:计算与语言,人工智能

Publish: 2025-12-10 15:43:25 UTC 发布:2025-12-10 15:43:25 UTC

78 MolSculpt: Sculpting 3D Molecular Geometries from Chemical Syntax 78 MolSculpt:从化学语法雕塑三维分子几何形状

Authors: [Zhanpeng Chen](https://arxiv.org/search/?searchtype=author&query=Zhanpeng Chen), [Weihao Gao](https://arxiv.org/search/?searchtype=author&query=Weihao Gao), [Shunyu Wang](https://arxiv.org/search/?searchtype=author&query=Shunyu Wang), [Yanan Zhu](https://arxiv.org/search/?searchtype=author&query=Yanan Zhu), [Hong Meng](https://arxiv.org/search/?searchtype=author&query=Hong Meng), [Yuexian Zou](https://arxiv.org/search/?searchtype=author&query=Yuexian Zou) 作者:陈展鹏、高伟浩、王舜宇、朱雅楠、孟宏、邹越先

Generating precise 3D molecular geometries is crucial for drug discovery and material science. While prior efforts leverage 1D representations like SELFIES to ensure molecular validity, they fail to fully exploit the rich chemical knowledge entangled within 1D models, leading to a disconnect between 1D syntactic generation and 3D geometric realization. To bridge this gap, we propose MolSculpt, a novel framework that “sculpts” 3D molecular geometries from chemical syntax. MolSculpt is built upon a frozen 1D molecular foundation model and a 3D molecular diffusion model. We introduce a set of learnable queries to extract inherent chemical knowledge from the foundation model, and a trainable projector then injects this cross-modal information into the conditioning space of the diffusion model to guide the 3D geometry generation. In this way, our model deeply integrates 1D latent chemical knowledge into the 3D generation process through end-to-end optimization. Experiments demonstrate that MolSculpt achieves state-of-the-art (SOTA) performance in \textit{de novo} 3D molecule generation and conditional 3D molecule generation, showing superior 3D fidelity and stability on both the GEOM-DRUGS and QM9 datasets. Code is available at https://github.com/SakuraTroyChen/MolSculpt. 生成精确的三维分子几何构型对于药物发现和材料科学至关重要。尽管先前工作利用 SELFIES 等一维表示来保证分子的有效性,但它们未能充分利用一维模型中蕴含的丰富化学知识,导致一维语法生成与三维几何实现之间存在脱节。为弥合这一差距,我们提出了 MolSculpt,一种从化学语法“雕刻”三维分子几何的新框架。MolSculpt 建立在一个冻结的一维分子基础模型和一个三维分子扩散模型之上。我们引入了一组可学习的查询以从基础模型中提取内在的化学知识,随后一个可训练的投影器将这些跨模态信息注入扩散模型的条件空间以指导三维几何的生成。通过这种方式,我们的模型通过端到端优化将一维潜在化学知识深度整合到三维生成过程中。 实验表明,MolSculpt 在新分子(de novo)三维生成和条件三维分子生成任务上达到了最先进(SOTA)的性能,在 GEOM-DRUGS 和 QM9 数据集上表现出更优越的三维保真度和稳定性。代码可在 https://github.com/SakuraTroyChen/MolSculpt 获取。

Subjects: Machine Learning, Artificial Intelligence, Chemical Physics, Quantitative Methods 主题:机器学习、人工智能、化学物理、定量方法

Publish: 2025-12-09 06:48:46 UTC 发布:2025-12-09 06:48:46 协调世界时

79 Dora: QoE-Aware Hybrid Parallelism for Distributed Edge AI 79 Dora: 面向 QoE 的分布式边缘 AI 混合并行方案

Authors: [Jianli Jin](https://arxiv.org/search/?searchtype=author&query=Jianli Jin), [Ziyang Lin](https://arxiv.org/search/?searchtype=author&query=Ziyang Lin), [Qianli Dong](https://arxiv.org/search/?searchtype=author&query=Qianli Dong), [Yi Chen](https://arxiv.org/search/?searchtype=author&query=Yi Chen), [Jayanth Srinivasa](https://arxiv.org/search/?searchtype=author&query=Jayanth Srinivasa), [Myungjin Lee](https://arxiv.org/search/?searchtype=author&query=Myungjin Lee), [Zhaowei Tan](https://arxiv.org/search/?searchtype=author&query=Zhaowei Tan), [Fan Lai](https://arxiv.org/search/?searchtype=author&query=Fan Lai) 作者:晋建立、林子阳、董倩丽、陈毅、Jayanth Srinivasa、Myungjin Lee、谭昭伟、赖帆

With the proliferation of edge AI applications, satisfying user quality of experience (QoE) requirements, such as model inference latency, has become a first class objective, as these models operate in resource constrained settings and directly interact with users. Yet, modern AI models routinely exceed the resource capacity of individual devices, necessitating distributed execution across heterogeneous devices over variable and contention prone networks. Existing planners for hybrid (e.g., data and pipeline) parallelism largely optimize for throughput or device utilization, overlooking QoE, leading to severe resource inefficiency (e.g., unnecessary energy drain) or QoE violations under runtime dynamics. We present Dora, a framework for QoE aware hybrid parallelism in distributed edge AI training and inference. Dora jointly optimizes heterogeneous computation, contention prone networks, and multi dimensional QoE objectives via three key mechanisms: (i) a heterogeneity aware model partitioner that determines and assigns model partitions across devices, forming a compact set of QoE compliant plans; (ii) a contention aware network scheduler that further refines these candidate plans by maximizing compute communication overlap; and (iii) a runtime adapter that adaptively composes multiple plans to maximize global efficiency while respecting overall QoEs. Across representative edge deployments, including smart homes, traffic analytics, and small edge clusters, Dora achieves 1.1–6.3 times faster execution and, alternatively, reduces energy consumption by 21–82 percent, all while maintaining QoE under runtime dynamics. 随着边缘人工智能应用的普及,满足用户体验质量(QoE)要求(例如模型推理延迟)已成为首要目标,因为这些模型在资源受限的环境中运行并直接与用户交互。然而,现代人工智能模型常常超出单个设备的资源容量,需在异构设备之间通过易受争用和变动影响的网络进行分布式执行。现有的混合并行(例如数据并行和流水线并行)调度器大多优化吞吐量或设备利用率,忽视了 QoE,导致在运行时动态变化下出现严重的资源低效(例如不必要的能量消耗)或 QoE 违规。我们提出了 Dora,一个用于分布式边缘 AI 训练和推理的面向 QoE 的混合并行框架。 Dora 通过三项关键机制联合优化异构计算、易发生争用的网络以及多维度的 QoE 目标:(i) 一个具备异构感知的模型分割器,用于决定并在设备间分配模型分区,从而形成一组紧凑且符合 QoE 的计划;(ii) 一个感知争用的网络调度器,通过最大化计算与通信的重叠进一步优化这些候选计划;以及 (iii) 一个运行时适配器,自适应地组合多个计划以在尊重整体 QoE 的同时最大化全局效率。在包括智能家居、交通分析和小型边缘集群等具有代表性的边缘部署中,Dora 实现了 1.1–6.3 倍更快的执行速度,或者在保持运行时动态下 QoE 不变的前提下将能耗降低 21–82%。

Subjects: Distributed, Parallel, and Cluster Computing, Artificial Intelligence 主题:分布式、并行与集群计算,人工智能

Publish: 2025-12-09 03:19:16 UTC 发布:2025-12-09 03:19:16 UTC

80 Mathematics of natural intelligence 80 自然智能的数学

Author: [Evgenii Vityaev](https://arxiv.org/search/?searchtype=author&query=Evgenii Vityaev) 作者:Evgenii Vityaev

In the process of evolution, the brain has achieved such perfection that artificial intelligence systems do not have and which needs its own mathematics. The concept of cognitome, introduced by the academician K.V. Anokhin, as the cognitive structure of the mind – a high-order structure of the brain and a neural hypernetwork, is considered as the basis for modeling. Consciousness then is a special form of dynamics in this hypernetwork – a large-scale integration of its cognitive elements. The cognitome, in turn, consists of interconnected COGs (cognitive groups of neurons) of two types – functional systems and cellular ensembles. K.V. Anokhin sees the task of the fundamental theory of the brain and mind in describing these structures, their origin, functions and processes in them. The paper presents mathematical models of these structures based on new mathematical results, as well as models of different cognitive processes in terms of these models. In addition, it is shown that these models can be derived based on a fairly general principle of the brain works: \textit{the brain discovers all possible causal relationships in the external world and draws all possible conclusions from them}. Based on these results, the paper presents models of: ``natural" classification; theory of functional brain systems by P.K. Anokhin; prototypical theory of categorization by E. Roche; theory of causal models by Bob Rehter; theory of consciousness as integrated information by G. Tononi. 在进化过程中,大脑达到了人工智能系统所不具备且需要自身数学来描述的完美程度。由院士 K.V. Anokhin 提出的认知基因组(cognitome)概念,作为心智的认知结构——大脑的高阶结构和神经超网络,被视为建模的基础。意识则是该超网络中一种特殊的动力学形式——其认知要素的大规模整合。认知基因组由两类相互连接的 COG(认知神经元群体)组成——功能系统和细胞集合体。K.V. Anokhin 认为,大脑与心智的基础理论任务在于描述这些结构、它们的起源、功能及其中的过程。本文基于新的数学成果给出了这些结构的数学模型,以及用这些模型表达的不同认知过程模型。 此外,研究表明这些模型可以基于一个相当普遍的关于大脑工作方式的原则推导出来:大脑发现外部世界中所有可能的因果关系并从中得出所有可能的结论。基于这些结果,论文提出了以下模型: “自然”分类;P.K. Anokhin 的功能性脑系统理论;E. Roche 的原型范畴化理论;Bob Rehter 的因果模型理论;G. Tononi 的作为整合信息的意识理论。

Subjects: Neurons and Cognition, Artificial Intelligence 主题:神经元与认知,人工智能

Publish: 2025-12-07 10:15:00 UTC 发表:2025-12-07 10:15:00 UTC

81 Marti-5: A Mathematical Model of "Self in the World" as a First Step Toward Self-Awareness 81 Marti-5:作为迈向自我意识的第一步的“自我在世界中”数学模型

Authors: [Igor Pivovarov](https://arxiv.org/search/?searchtype=author&query=Igor Pivovarov), [Sergey Shumsky](https://arxiv.org/search/?searchtype=author&query=Sergey Shumsky) 作者:Igor Pivovarov,Sergey Shumsky

The existence of ‘what’ and ‘where’ pathways of information processing in the brain was proposed almost 30 years ago, but there is still a lack of a clear mathematical model that could show how these pathways work together. We propose a biologically inspired mathematical model that uses this idea to identify and separate the self from the environment and then build and use a self-model for better predictions. This is a model of neocortical columns governed by the basal ganglia to make predictions and choose the next action, where some columns act as ‘what’ columns and others act as ‘where’ columns. Based on this model, we present a reinforcement learning agent that learns purposeful behavior in a virtual environment. We evaluate the agent on the Atari games Pong and Breakout, where it successfully learns to play. We conclude that the ability to separate the self from the environment gives advantages to the agent and therefore such a model could appear in living organisms during evolution. We propose Self-Awareness Principle 1: the ability to separate the self from the world is a necessary but insufficient condition for self-awareness. 近 30 年前有人提出了大脑中信息处理的“何物(what)”和“何处(where)”通路的存在,但仍缺乏能够清楚展示这些通路如何协同工作的数学模型。我们提出了一个以生物学为灵感的数学模型,利用这一思想来识别并将自我与环境分离,随后构建并使用自我模型以获得更优的预测。该模型描述新皮层柱由基底节支配以进行预测并选择下一个动作,其中部分柱子充当“何物”柱,另一些则充当“何处”柱。基于此模型,我们提出了一个在虚拟环境中学习有目的行为的强化学习体。我们在 Atari 游戏《乒乓》(Pong)和《打砖块》(Breakout)上评估了该体,结果显示其成功学会了游戏。我们得出结论:将自我与环境分离的能力为智能体带来了优势,因此此类模型可能在生物进化过程中出现。我们提出自我意识原理 1:将自我与世界分离的能力是自我意识的必要但不充分条件。

Subjects: Neurons and Cognition, Artificial Intelligence, Machine Learning 主题:神经元与认知、人工智能、机器学习

Publish: 2025-12-05 11:15:06 UTC 发布:2025-12-05 11:15:06 协调世界时

82 Developmental Symmetry-Loss: A Free-Energy Perspective on Brain-Inspired Invariance Learning 82 发展性对称性损失:一种关于受大脑启发的不变性学习的自由能视角

Author: [Arif Dönmez](https://arxiv.org/search/?searchtype=author&query=Arif Dönmez) 作者:Arif Dönmez

We propose Symmetry-Loss, a brain-inspired algorithmic principle that enforces invariance and equivariance through a differentiable constraint derived from environmental symmetries. The framework models learning as the iterative refinement of an effective symmetry group, paralleling developmental processes in which cortical representations align with the world’s structure. By minimizing structural surprise, i.e. deviations from symmetry consistency, Symmetry-Loss operationalizes a Free-Energy–like objective for representation learning. This formulation bridges predictive-coding and group-theoretic perspectives, showing how efficient, stable, and compositional representations can emerge from symmetry-based self-organization. The result is a general computational mechanism linking developmental learning in the brain with principled representation learning in artificial systems. 我们提出了对称性损失(Symmetry-Loss),这是一种受大脑启发的算法原则,通过从环境对称性导出的可微分约束来强制不变性与等变性。该框架将学习建模为对有效对称群的迭代精炼,类似于皮层表征与世界结构对齐的发展过程。通过最小化结构惊奇(即与对称性一致性偏离),对称性损失将一种类自由能(Free-Energy)目标运用于表征学习。这一表述桥接了预测编码与群论视角,展示了如何从基于对称性的机制中涌现出高效、稳定且具组合性的表征。其结果是一个将大脑中的发展性学习与人工系统中有原则的表征学习联系起来的一般计算机制。

Subjects: Neurons and Cognition, Artificial Intelligence, Machine Learning, Adaptation and Self-Organizing Systems 学科:神经元与认知、人工智能、机器学习、自适应与自组织系统

Publish: 2025-12-04 22:12:15 UTC 发布:2025-12-04 22:12:15 UTC

83 Reducing Fragmentation and Starvation in GPU Clusters through Dynamic Multi-Objective Scheduling 83 通过动态多目标调度减少 GPU 集群中的碎片和饥饿

Author: [Akhmadillo Mamirov](https://arxiv.org/search/?searchtype=author&query=Akhmadillo Mamirov) 作者:Akhmadillo Mamirov

GPU clusters have become essential for training and deploying modern AI systems, yet real deployments continue to report average utilization near 50%. This inefficiency is largely caused by fragmentation, heterogeneous workloads, and the limitations of static scheduling policies. This work presents a systematic evaluation of these issues and introduces three specialized dynamic schedulers: Hybrid Priority (HPS), Predictive Backfill (PBS), and Smart Batch (SBS). These schedulers are designed to improve utilization, fairness, and overall throughput in multi-tenant GPU clusters. We evaluate all schedulers using a controlled simulation of 1,000 AI jobs on a 64-GPU, 8-node cluster that includes a realistic mix of training, inference, and research workloads. Static baselines (FIFO, SJF, Shortest, Shortest-GPU) achieve 45 to 67% GPU utilization and 12.5 to 18.3 jobs per hour and experience severe starvation, with as many as 156 jobs waiting longer than 30 minutes. The dynamic schedulers significantly outperform these policies. HPS achieves the highest utilization (78.2%), highest throughput (25.8 jobs per hour), and the lowest fairness variance among dynamic methods (457), reducing starvation to 12 jobs. PBS improves fragmentation handling and reaches 76.1% utilization, while SBS increases efficiency for structurally similar jobs and reaches 74.6% utilization. Across all key metrics, including throughput, job wait times, fairness variance, and starvation, dynamic multi-objective schedulers consistently outperform single-objective heuristics. These results show that targeted and transparent scheduling strategies can meaningfully increase GPU efficiency in heterogeneous AI clusters and provide a practical foundation for future production scheduling frameworks. GPU 集群已成为训练和部署现代 AI 系统的关键,但实际部署中平均利用率仍接近 50%。这种低效率主要由碎片化、异构工作负载和静态调度策略的局限性造成。本文系统评估了这些问题,并引入了三种专用的动态调度器:混合优先(HPS)、预测回填(PBS)和智能批处理(SBS)。这些调度器旨在提高多租户 GPU 集群的利用率、公平性和整体吞吐量。我们在一个包含训练、推理和研究工作负载的真实混合场景中,对一个由 8 节点、64 块 GPU 组成的集群上 1000 个 AI 作业进行了受控仿真,以评估所有调度器。静态基线(FIFO、SJF、Shortest、Shortest-GPU)实现了 45% 到 67% 的 GPU 利用率和每小时 12.5 到 18.3 个作业的吞吐量,并且出现了严重的饥饿现象,最长有多达 156 个作业等待超过 30 分钟。动态调度器显著优于这些策略。HPS 实现了最高的利用率(78.2%)和最高的吞吐量(25.每小时 8 个作业),并且在动态方法中具有最低的公平性方差(457),将饥饿情况减少到 12 个作业。PBS 改进了碎片处理并达到了 76.1%的利用率,而 SBS 则提高了结构相似作业的效率并达到了 74.6%的利用率。在包括吞吐量、作业等待时间、公平性方差和饥饿在内的所有关键指标上,动态多目标调度器始终优于单目标启发式方法。这些结果表明,有针对性且透明的调度策略能够显著提高异构 AI 集群中的 GPU 效率,并为未来的生产调度框架提供了切实可行的基础。

Subjects: Distributed, Parallel, and Cluster Computing, Artificial Intelligence 主题:分布式、并行与集群计算,人工智能

Publish: 2025-12-04 04:14:03 UTC 发布:2025-12-04 04:14:03 UTC

84 Cognitive Mirrors: Exploring the Diverse Functional Roles of Attention Heads in LLM Reasoning 84 认知镜像:探索注意力头在 LLM 推理中的多样功能角色

Authors: [Xueqi Ma](https://arxiv.org/search/?searchtype=author&query=Xueqi Ma), [Jun Wang](https://arxiv.org/search/?searchtype=author&query=Jun Wang), [Yanbei Jiang](https://arxiv.org/search/?searchtype=author&query=Yanbei Jiang), [Sarah Monazam Erfani](https://arxiv.org/search/?searchtype=author&query=Sarah Monazam Erfani), [Tongliang Liu](https://arxiv.org/search/?searchtype=author&query=Tongliang Liu), [James Bailey](https://arxiv.org/search/?searchtype=author&query=James Bailey) 作者:马雪琪,王军,姜燕蓓,Sarah Monazam Erfani,刘通亮,James Bailey

Large language models (LLMs) have achieved state-of-the-art performance in a variety of tasks, but remain largely opaque in terms of their internal mechanisms. Understanding these mechanisms is crucial to improve their reasoning abilities. Drawing inspiration from the interplay between neural processes and human cognition, we propose a novel interpretability framework to systematically analyze the roles and behaviors of attention heads, which are key components of LLMs. We introduce CogQA, a dataset that decomposes complex questions into step-by-step subquestions with a chain-of-thought design, each associated with specific cognitive functions such as retrieval or logical reasoning. By applying a multi-class probing method, we identify the attention heads responsible for these functions. Our analysis across multiple LLM families reveals that attention heads exhibit functional specialization, characterized as cognitive heads. These cognitive heads exhibit several key properties: they are universally sparse, vary in number and distribution across different cognitive functions, and display interactive and hierarchical structures. We further show that cognitive heads play a vital role in reasoning tasks - removing them leads to performance degradation, while augmenting them enhances reasoning accuracy. These insights offer a deeper understanding of LLM reasoning and suggest important implications for model design, training, and fine-tuning strategies. 大型语言模型(LLMs)在多种任务上已取得最先进的表现,但其内部机制在很大程度上仍不透明。理解这些机制对于提升其推理能力至关重要。借鉴神经过程与人类认知之间的相互作用,我们提出了一种新颖的可解释性框架,用以系统分析注意力头(attention heads)的作用和行为——注意力头是 LLM 的关键组成部分。我们引入了 CogQA,这是一个将复杂问题分解为逐步子问题并采用链式思维设计的数据集,每个子问题都关联特定的认知功能,例如检索或逻辑推理。通过应用多类探测方法,我们识别出负责这些功能的注意力头。我们在多个 LLM 家族上的分析显示,注意力头表现出功能专门化,可被表征为认知头(cognitive heads)。这些认知头具有若干关键特性:它们在全局上呈稀疏分布,不同认知功能的数量和分布存在差异,并展现出交互性和层级结构。 我们进一步表明,认知注意头在推理任务中起着关键作用——移除它们会导致性能下降,而增加它们则能提升推理准确性。这些见解提供了对 LLM 推理的更深理解,并对模型设计、训练和微调策略提出了重要启示。

Subjects: Neurons and Cognition, Artificial Intelligence 主题:神经元与认知,人工智能

Publish: 2025-12-03 10:24:34 UTC 发布:2025-12-03 10:24:34 UTC

85 Agent-Based Modular Learning for Multimodal Emotion Recognition in Human-Agent Systems 85 基于代理的模块化学习用于人机系统中的多模态情感识别

Authors: [Matvey Nepomnyaschiy](https://arxiv.org/search/?searchtype=author&query=Matvey Nepomnyaschiy), [Oleg Pereziabov](https://arxiv.org/search/?searchtype=author&query=Oleg Pereziabov), [Anvar Tliamov](https://arxiv.org/search/?searchtype=author&query=Anvar Tliamov), [Stanislav Mikhailov](https://arxiv.org/search/?searchtype=author&query=Stanislav Mikhailov), [Ilya Afanasyev](https://arxiv.org/search/?searchtype=author&query=Ilya Afanasyev) 作者:Matvey Nepomnyaschiy、Oleg Pereziabov、Anvar Tliamov、Stanislav Mikhailov、Ilya Afanasyev

Effective human-agent interaction (HAI) relies on accurate and adaptive perception of human emotional states. While multimodal deep learning models - leveraging facial expressions, speech, and textual cues - offer high accuracy in emotion recognition, their training and maintenance are often computationally intensive and inflexible to modality changes. In this work, we propose a novel multi-agent framework for training multimodal emotion recognition systems, where each modality encoder and the fusion classifier operate as autonomous agents coordinated by a central supervisor. This architecture enables modular integration of new modalities (e.g., audio features via emotion2vec), seamless replacement of outdated components, and reduced computational overhead during training. We demonstrate the feasibility of our approach through a proof-of-concept implementation supporting vision, audio, and text modalities, with the classifier serving as a shared decision-making agent. Our framework not only improves training efficiency but also contributes to the design of more flexible, scalable, and maintainable perception modules for embodied and virtual agents in HAI scenarios. 有效的人—代理交互(HAI)依赖于对人类情绪状态的准确且自适应的感知。尽管利用面部表情、语音和文本线索的多模态深度学习模型在情绪识别方面具有很高的准确性,但它们的训练和维护通常计算开销大且对模态变化缺乏灵活性。在本工作中,我们提出了一种用于训练多模态情绪识别系统的新型多智能体框架,其中每个模态编码器和融合分类器作为由中央监督者协调的自治智能体运行。该架构支持模块化集成新模态(例如,通过 emotion2vec 引入音频特征)、无缝替换过时组件,并在训练期间降低计算开销。我们通过一个概念验证实现展示了该方法的可行性,该实现支持视觉、音频和文本模态,分类器作为共享的决策智能体。我们的框架不仅提高了训练效率,还为在 HAI 场景中为具身和虚拟代理设计更灵活、可扩展且易维护的感知模块做出了贡献。

Subjects: Machine Learning, Artificial Intelligence, Human-Computer Interaction, Multiagent Systems 主题:机器学习、人工智能、人机交互、多智能体系统

Publish: 2025-12-02 21:47:00 UTC 发布:2025-12-02 21:47:00 协调世界时

86 Multimodal Fusion of Regional Brain Experts for Interpretable Alzheimer's Disease Diagnosis 86 区域脑专家的多模态融合用于可解释的阿尔茨海默病诊断

Authors: [Farica Zhuang](https://arxiv.org/search/?searchtype=author&query=Farica Zhuang), [Dinara Aliyeva](https://arxiv.org/search/?searchtype=author&query=Dinara Aliyeva), [Shu Yang](https://arxiv.org/search/?searchtype=author&query=Shu Yang), [Zixuan Wen](https://arxiv.org/search/?searchtype=author&query=Zixuan Wen), [Duy Duong-Tran](https://arxiv.org/search/?searchtype=author&query=Duy Duong-Tran), [Christos Davatzikos](https://arxiv.org/search/?searchtype=author&query=Christos Davatzikos), [Tianlong Chen](https://arxiv.org/search/?searchtype=author&query=Tianlong Chen), [Song Wang](https://arxiv.org/search/?searchtype=author&query=Song Wang), [Li Shen](https://arxiv.org/search/?searchtype=author&query=Li Shen) 作者:Farica Zhuang、Dinara Aliyeva、Shu Yang、Zixuan Wen、Duy Duong-Tran、Christos Davatzikos、Tianlong Chen、Song Wang、Li Shen

Accurate and early diagnosis of Alzheimer’s disease (AD) can benefit from integrating complementary information from multiple modalities, mirroring clinical practice. However, conventional fusion approaches often rely on simple concatenation of features, which cannot adaptively balance the contributions of biomarkers such as amyloid PET and MRI across brain regions. In this work, we propose MREF-AD, a Multimodal Regional Expert Fusion model for AD diagnosis. It is a Mixture-of-Experts (MoE) framework that models meso-scale brain regions in each modality as an independent expert and employs two-level gating networks to learn subject-specific fusion weights. Beyond improving diagnostic performance, MREF-AD provides modality- and region-level insight into how structural and molecular imaging jointly contribute to disease diagnosis. Using data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI), MREF-AD achieves state-of-the-art performance over baselines while providing enhanced interpretability of brain region-specific biomarker relevance, underscoring its utility as a general framework for adaptive and interpretable multimodal fusion in neuroimaging. 通过整合来自多模态的互补信息以反映临床实践,阿尔茨海默病(AD)的精确与早期诊断可以受益。然而,传统的融合方法通常依赖于简单的特征拼接,无法自适应地平衡诸如淀粉样蛋白 PET 和 MRI 等生物标志物在不同脑区域的贡献。在本工作中,我们提出了 MREF-AD,一种用于 AD 诊断的多模态区域专家融合模型(Multimodal Regional Expert Fusion)。它是一个专家混合(Mixture-of-Experts, MoE)框架,将每种模态中的中尺度脑区建模为独立专家,并采用两级门控网络来学习个体化的融合权重。除提高诊断性能外,MREF-AD 还能在模态和区域层面上提供结构成像与分子成像如何共同促进疾病诊断的见解。使用阿尔茨海默病神经影像学计划(ADNI)的数据,MREF-AD 在超越基线方法的同时实现了最先进的性能,并提供了对特定脑区生物标志物相关性的增强可解释性,凸显了其作为一种在神经影像学中用于自适应且可解释的多模态融合的一般框架的实用性。

Subjects: Machine Learning, Artificial Intelligence, Computer Vision and Pattern Recognition, Image and Video Processing 主题:机器学习、人工智能、计算机视觉与模式识别、图像与视频处理

Publish: 2025-11-30 02:12:12 UTC 发布:2025-11-30 02:12:12 UTC

87 Emotion-Driven Personalized Recommendation for AI-Generated Content Using Multi-Modal Sentiment and Intent Analysis 87 基于情感驱动的 AI 生成内容个性化推荐:使用多模态情感与意图分析

Authors: [Zheqi Hu](https://arxiv.org/search/?searchtype=author&query=Zheqi Hu), [Xuanjing Chen](https://arxiv.org/search/?searchtype=author&query=Xuanjing Chen), [Jinlin Hu](https://arxiv.org/search/?searchtype=author&query=Jinlin Hu) 作者:Zheqi Hu、Xuanjing Chen、Jinlin Hu

With the rapid growth of AI-generated content (AIGC) across domains such as music, video, and literature, the demand for emotionally aware recommendation systems has become increasingly important. Traditional recommender systems primarily rely on user behavioral data such as clicks, views, or ratings, while neglecting users’ real-time emotional and intentional states during content interaction. To address this limitation, this study proposes a Multi-Modal Emotion and Intent Recognition Model (MMEI) based on a BERT-based Cross-Modal Transformer with Attention-Based Fusion, integrated into a cloud-native personalized AIGC recommendation framework. The proposed system jointly processes visual (facial expression), auditory (speech tone), and textual (comments or utterances) modalities through pretrained encoders ViT, Wav2Vec2, and BERT, followed by an attention-based fusion module to learn emotion-intent representations. These embeddings are then used to drive personalized content recommendations through a contextual matching layer. Experiments conducted on benchmark emotion datasets (AIGC-INT, MELD, and CMU-MOSEI) and an AIGC interaction dataset demonstrate that the proposed MMEI model achieves a 4.3% improvement in F1-score and a 12.3% reduction in cross-entropy loss compared to the best fusion-based transformer baseline. Furthermore, user-level online evaluations reveal that emotion-driven recommendations increase engagement time by 15.2% and enhance satisfaction scores by 11.8%, confirming the model’s effectiveness in aligning AI-generated content with users’ affective and intentional states. This work highlights the potential of cross-modal emotional intelligence for next-generation AIGC ecosystems, enabling adaptive, empathetic, and context-aware recommendation experiences. 随着音乐、视频和文学等领域中人工智能生成内容(AIGC)的快速增长,具备情感感知能力的推荐系统需求日益增加。传统推荐系统主要依赖用户的行为数据(如点击、浏览或评分),却忽视了用户在与内容交互时的实时情感和意图状态。为了解决这一局限性,本研究提出了一种基于 BERT 的跨模态 Transformer 并采用基于注意力的融合的多模态情感与意图识别模型(MMEI),并将其集成到云原生的个性化 AIGC 推荐框架中。该系统通过预训练编码器 ViT、Wav2Vec2 和 BERT 联合处理视觉(面部表情)、听觉(语音语调)和文本(评论或话语)模态,然后通过基于注意力的融合模块学习情感—意图表示。这些嵌入随后通过上下文匹配层用于驱动个性化内容推荐。 在基准情感数据集(AIGC-INT、MELD 和 CMU-MOSEI)以及一个 AIGC 交互数据集上进行的实验表明,所提出的 MMEI 模型相比最佳基于融合的 Transformer 基线在 F1 分数上提高了 4.3%,在交叉熵损失上减少了 12.3%。此外,基于用户层面的在线评估显示,情感驱动的推荐将参与时长提升了 15.2%,并将满意度得分提高了 11.8%,验证了该模型在将 AI 生成内容与用户的情感及意图状态对齐方面的有效性。这项工作突出了跨模态情感智能在下一代 AIGC 生态系统中的潜力,使推荐体验具备自适应、富有同理心并且具备上下文感知能力。

Subjects: Information Retrieval, Artificial Intelligence, Human-Computer Interaction, Machine Learning, Multimedia 主题:信息检索、人工智能、人机交互、机器学习、多媒体

Publish: 2025-11-25 17:52:22 UTC 发布:2025-11-25 17:52:22 UTC

88 Scalable Data Synthesis for Computer Use Agents with Step-Level Filtering 88 可扩展的数据合成用于具备步骤级过滤的计算机使用代理

Authors: [Yifei He](https://arxiv.org/search/?searchtype=author&query=Yifei He), [Pranit Chawla](https://arxiv.org/search/?searchtype=author&query=Pranit Chawla), [Yaser Souri](https://arxiv.org/search/?searchtype=author&query=Yaser Souri), [Subhojit Som](https://arxiv.org/search/?searchtype=author&query=Subhojit Som), [Xia Song](https://arxiv.org/search/?searchtype=author&query=Xia Song) 作者:何一飞,Pranit Chawla,Yaser Souri,Subhojit Som,宋霞

Computer use agents (CUAs) can operate real-world digital interfaces but remain difficult to train due to the high cost of graphical user interface (GUI) interaction and the scarcity of high-quality trajectory data. Existing datasets rely on human demonstrations, limiting scalability. A natural alternative is to synthesize data from strong CUAs, yet their rollouts are highly noisy, with incorrect or suboptimal actions consisting a large proportion of the steps, making naive imitation ineffective. To tackle this challenge, we introduce a scalable data synthesis pipeline that transforms noisy rollouts into reliable supervision without human annotation. The core idea is step-level filtering, which evaluates actions individually to retain only correct steps, complemented by reasoning augmentation for improved planning. Using this pipeline, we construct WebSTAR, a dataset of 13.3K trajectories and 100K graded, reasoning-rich steps synthesized from OpenAI’s computer-use-preview model. We train Qwen-2.5-VL-Instruct models (7B and 32B) on WebSTAR. On WebVoyager, our 7B model surpasses SoTA open-source CUA model UI-TARS-1.5-7B by more than 15% with only supervised finetuning. Building on step-level grading, we further create WebSCORE, a dataset of graded step-level actions, and train StepRM, a 7B multimodal reward model distilled from o4-mini, which matches its grading quality while being far more efficient to deploy at scale. Our results establish step-level filtering as a key principle for scalable CUA training and construct two new datasets (WebSTAR, WebSCORE) and a lightweight reward model (StepRM) as practical tools to advance robust and efficient CUAs. 计算机使用代理(CUA)能够操作现实世界的数字界面,但由于图形用户界面(GUI)交互成本高昂且高质量轨迹数据稀缺,训练仍然困难。现有数据集依赖人工示范,限制了可扩展性。一个自然的替代方案是从强大的 CUA 合成数据,然而它们的回放(rollouts)噪声极大,不正确或次优的动作占了很大比例的步骤,使得简单的模仿无效。为应对这一挑战,我们提出了一个可扩展的数据合成流水线,将噪声回放转换为无需人工标注的可靠监督。核心思想是步骤级过滤,对动作逐一评估以仅保留正确步骤,并辅以推理增强以改进规划。使用该流水线,我们构建了 WebSTAR 数据集:由 OpenAI 的 computer-use-preview 模型合成的 13.3K 条轨迹和 100K 条经分级、富含推理的步骤。我们在 WebSTAR 上训练了Qwen-2.5-VL-Instruct个模型(7B 和 32B)。在 WebVoyager 上,我们的 7B 模型仅通过监督微调就超过了最先进的开源 CUA 模型 UI-TARS-1.5-7B 超过 15%。 在基于步骤级评分的基础上,我们进一步创建了 WebSCORE——一个带有评分的步骤级动作数据集,并训练了 StepRM,这是一个由 o4-mini 蒸馏而来的 7B 多模态奖励模型,其评分质量与 o4-mini 相当,同时在大规模部署时效率远高于后者。我们的结果确立了步骤级过滤作为可扩展 CUA 训练的关键原则,并构建了两个新数据集(WebSTAR、WebSCORE)和一个轻量级奖励模型(StepRM),作为推进健壮且高效 CUA 的实用工具。

Subjects: Machine Learning, Artificial Intelligence 主题:机器学习,人工智能

Publish: 2025-11-22 23:12:56 UTC 发布:2025-11-22 23:12:56 UTC

89 AI as Cognitive Amplifier: Rethinking Human Judgment in the Age of Generative AI 89 人工智能作为认知放大器:在生成式人工智能时代重新思考人类判断力

Author: [Tao An](https://arxiv.org/search/?searchtype=author&query=Tao An) 作者:Tao An

Through extensive experience training professionals and individual users in AI tool adoption since the GPT-3 era, I have observed a consistent pattern: the same AI tool produces dramatically different results depending on who uses it. While some frame AI as a replacement for human intelligence, and others warn of cognitive decline, this position paper argues for a third perspective grounded in practical observation: AI as a cognitive amplifier that magnifies existing human capabilities rather than substituting for them. Drawing on research in human-computer interaction, cognitive augmentation theory, and educational technology, alongside field observations from corporate training across writing, software development, and data analysis domains, I present a framework positioning AI tools as intelligence amplification systems where output quality depends fundamentally on user expertise and judgment. Through analysis of empirical studies on expert-novice differences and systematic observations from professional training contexts, I demonstrate that domain knowledge, quality judgment, and iterative refinement capabilities create substantial performance gaps between users. I propose a three-level model of AI engagement – from passive acceptance through iterative collaboration to cognitive direction – and argue that the transition between levels requires not technical training but development of domain expertise and metacognitive skills. This position has critical implications for workforce development and AI system design. Rather than focusing solely on AI literacy or technical prompt engineering, I advocate for integrated approaches that strengthen domain expertise, evaluative judgment, and reflective practice. 自从 GPT-3 时代起,我在培训专业人士和个人用户采用 AI 工具方面积累了大量经验,并观察到一个一致的模式:相同的 AI 工具在不同使用者手中会产生截然不同的结果。有人将 AI 描述为对人类智力的替代,也有人警告会导致认知能力衰退,而本文则主张第三种基于实际观察的观点:将 AI 视为认知放大器,它放大的是现有的人类能力,而不是取代它们。本文结合人机交互、认知增强理论和教育技术领域的研究,以及在写作、软件开发和数据分析等企业培训中的实地观察,提出了一个框架,将 AI 工具定位为智能放大系统,其输出质量在根本上取决于使用者的专业知识和判断力。通过对专家与新手差异的实证研究分析以及来自职业培训环境的系统观察,我论证了领域知识、质量判断和迭代完善能力会在使用者之间造成显著的性能差距。 我提出了一个三层次的人工智能参与模型——从被动接受、通过迭代协作到认知引导——并论证了层级间的转变所需的并非技术训练,而是领域专长与元认知技能的发展。该立场对劳动力培养与人工智能系统设计具有重要影响。我主张采用综合方法,强化领域专长、评估判断力与反思实践,而不是仅仅关注人工智能素养或技术性提示工程。

Subjects: Human-Computer Interaction, Artificial Intelligence 主题:人机交互,人工智能

Publish: 2025-10-30 11:55:34 UTC 发布:2025-10-30 11:55:34 UTC

90 Measuring skill-based uplift from AI in a real biological laboratory 90 在真实生物实验室中测量人工智能带来的基于技能的提升

Authors: [Ethan Obie Romero-Severson](https://arxiv.org/search/?searchtype=author&query=Ethan Obie Romero-Severson), [Tara Harvey](https://arxiv.org/search/?searchtype=author&query=Tara Harvey), [Nick Generous](https://arxiv.org/search/?searchtype=author&query=Nick Generous), [Phillip M. Mach](https://arxiv.org/search/?searchtype=author&query=Phillip M. Mach) 作者:Ethan Obie Romero-Severson、Tara Harvey、Nick Generous、Phillip M. Mach

Understanding how AI systems are used by people in real situations that mirror aspects of both legitimate and illegitimate use is key to predicting the risks and benefits of AI systems. This is especially true in biological applications, where skill rather than knowledge is often the primary barrier for an untrained person. The challenge is that these studies are difficult to execute well and can take months to plan and run. Here we report the results of a pilot study that attempted to empirically measure the magnitude of \emph{skills-based uplift} caused by access to an AI reasoning model, compared with a control group that had only internet access. Participants – drawn from a diverse pool of Los Alamos National Laboratory employees with no prior wet-lab experience – were asked to transform \ecoli{} with a provided expression construct, induce expression of a reporter peptide, and have expression confirmed by mass spectrometry. We recorded quantitative outcomes (e.g., successful completion of experimental segments) and qualitative observations about how participants interacted with the AI system, the internet, laboratory equipment, and one another. We present the results of the study and lessons learned in designing and executing this type of study, and we discuss these results in the context of future studies of the evolving relationship between AI and global biosecurity. 了解人们在现实情境中如何使用人工智能系统——这些情境既可能包含合法使用的方面,也可能包含不当使用的方面——对于预测人工智能系统的风险与收益至关重要。在生物应用中尤为如此,因为对于未经培训的人来说,技能而非知识通常是主要障碍。挑战在于,这类研究难以高质量执行,并且可能需要数月时间来策划与实施。本文报告了一项试点研究的结果,该研究试图通过实证测量接触人工智能推理模型所带来的“基于技能的提升”(skills-based uplift)的幅度,并将其与仅能访问互联网的对照组进行比较。参与者来自洛斯阿拉莫斯国家实验室的一支多样化员工群体,均无先前湿实验室经验——他们被要求用提供的表达构建体改造大肠杆菌(E. coli),诱导报告肽的表达,并通过质谱确认表达。我们记录了量化结果(例如,实验环节的成功完成情况)以及关于参与者如何与人工智能系统、互联网、实验室设备和彼此互动的定性观察。 我们展示了该研究的结果以及在设计和实施此类研究过程中所汲取的经验教训,并在全球生物安全与人工智能关系演变的未来研究背景下讨论这些结果。

Subjects: Human-Computer Interaction, Artificial Intelligence 主题:人机交互,人工智能

Publish: 2025-10-29 16:34:57 UTC 发布:2025-10-29 16:34:57 UTC

0%