2025-09-17科研追新
2025-09-17科研追新
1. 源数据
1.1 媒体
From:量子位、机器之心、新智元、AGI Hunt、小红书、X其他
6.1B打平40B Dense模型,蚂蚁开源最新MoE模型Ling-flash-2.0
腾讯AI Lab首创RL框架Parallel-R1,教大模型学会「并行思维」
这是第一个通过强化学习(RL)在通用数学推理任务上教会大模型进行并行思维的框架。该框架通过创新的「渐进式课程」与「交替式奖励」设计,成功解决了 RL 训练中的冷启动和奖励设计难题。
- 论文标题:Parallel-R1: Towards Parallel Thinking via Reinforcement Learning
- 论文地址:https://arxiv.org/abs/2509.07980
- 项目地址:https://github.com/zhengkid/Parallel-R1 (Coming Soon)
- 项目主页:https://zhengkid.github.io/Parallel_R1.github.io/
LLM开源2.0大洗牌:60个出局,39个上桌,AI Coding疯魔,TensorFlow已死
突破单链思考上限,清华团队提出原生「并行思考」scale范式
AIR 的一篇最新研究论文《ParaThinker: Native Parallel Thinking as a New Paradigm to Scale LLM Test-time Compute》
谁说Scaling Law到头了?新研究:每一步的微小提升会带来指数级增长
虽然 scaling law 显示 LLM 在测试损失等指标上存在收益递减,但模型在现实世界的价值往往源于一个智能体能够完成任务的长度。从这个角度来看,更大的模型非但没有收益递减,反而能将单步准确率的微小提升复合放大,在任务完成长度上实现指数级跃升。
从少样本到千样本!MachineLearningLM给大模型上下文学习装上「机器学习引擎」
**这项名为 MachineLearningLM 的新研究突破了这一瓶颈。**该研究提出了一种轻量且可移植的「继续预训练」框架,无需下游微调即可直接通过上下文学习上千条示例,在金融、健康、生物信息、物理等等多个领域的二分类 / 多分类任务中的准确率显著超越基准模型(Qwen-2.5-7B-Instruct)以及最新发布的 GPT-5-mini。
终结数据荒!智源开源首个Deep Research数据合成框架InfoSeek
揭示了深度研究问题与层级约束满足问题(Hierarchical Constraint Satisfaction Problem)之间的数学等价关系,并由此提出了基于「扩散-回溯」过程的数据合成方法,实现了深度研究训练数据的大规模自动扩增。
1.2 Arxiv
1.2.1 Computation and Language
From:https:// /arxiv/cs.CLhttps://arxiv.org/list/cs.CL/recent
2025-09-17 | | Total: 69
#1 Do Natural Language Descriptions of Model Activations Convey Privileged Information? 模型激活的自然语言描述是否传达特权信息?
Authors: [Millicent Li](https://arxiv.org/search/?searchtype=author&query=Millicent Li), [Alberto Mario Ceballos Arroyo](https://arxiv.org/search/?searchtype=author&query=Alberto Mario Ceballos Arroyo), [Giordano Rogers](https://arxiv.org/search/?searchtype=author&query=Giordano Rogers), [Naomi Saphra](https://arxiv.org/search/?searchtype=author&query=Naomi Saphra), [Byron C. Wallace](https://arxiv.org/search/?searchtype=author&query=Byron C. Wallace)
Recent interpretability methods have proposed to translate LLM internal representations into natural language descriptions using a second verbalizer LLM. This is intended to illuminate how the target model represents and operates on inputs. But do such activation verbalization approaches actually provide privileged knowledge about the internal workings of the target model, or do they merely convey information about its inputs? We critically evaluate popular verbalization methods across datasets used in prior work and find that they succeed at benchmarks without any access to target model internals, suggesting that these datasets are not ideal for evaluating verbalization methods. We then run controlled experiments which reveal that verbalizations often reflect the parametric knowledge of the verbalizer LLM which generated them, rather than the activations of the target LLM being decoded. Taken together, our results indicate a need for targeted benchmarks and experimental controls to rigorously assess whether verbalization methods provide meaningful insights into the operations of LLMs. 最近的可解释性方法提出了使用第二个语言化器 LLM 将 LLM 内部表示转换为自然语言描述。这旨在阐明目标模型如何表示输入并对输入进行作。但是,这种激活语言化方法是否真的提供了有关目标模型内部工作原理的特权知识,或者它们只是传达有关其输入的信息?我们批判性地评估了先前工作中使用的数据集中流行的语言化方法,发现它们在基准测试中取得了成功,而无需访问任何目标模型内部,这表明这些数据集对于评估语言化方法并不理想。然后,我们进行了对照实验,结果表明,语言化通常反映了生成它们的语言化 LLM 的参数知识,而不是被解码的目标 LLM 的激活。综上所述,我们的结果表明需要有针对性的基准和实验对照,以严格评估语言化方法是否为法学硕士的作提供有意义的见解。
Subjects: Computation and Language, Machine Learning 科目 : 计算与语言, 机器学习
Publish: 2025-09-16 17:59:04 UTC 发布时间 : 2025-09-16 17:59:04 UTC
#2 ReSum: Unlocking Long-Horizon Search Intelligence via Context Summarization ReSum:通过上下文摘要解锁长视野搜索智能
Authors: [Xixi Wu](https://arxiv.org/search/?searchtype=author&query=Xixi Wu), [Kuan Li](https://arxiv.org/search/?searchtype=author&query=Kuan Li), [Yida Zhao](https://arxiv.org/search/?searchtype=author&query=Yida Zhao), [Liwen Zhang](https://arxiv.org/search/?searchtype=author&query=Liwen Zhang), [Litu Ou](https://arxiv.org/search/?searchtype=author&query=Litu Ou), [Huifeng Yin](https://arxiv.org/search/?searchtype=author&query=Huifeng Yin), [Zhongwang Zhang](https://arxiv.org/search/?searchtype=author&query=Zhongwang Zhang), [Yong Jiang](https://arxiv.org/search/?searchtype=author&query=Yong Jiang), [Pengjun Xie](https://arxiv.org/search/?searchtype=author&query=Pengjun Xie), [Fei Huang](https://arxiv.org/search/?searchtype=author&query=Fei Huang), [Minhao Cheng](https://arxiv.org/search/?searchtype=author&query=Minhao Cheng), [Shuai Wang](https://arxiv.org/search/?searchtype=author&query=Shuai Wang), [Hong Cheng](https://arxiv.org/search/?searchtype=author&query=Hong Cheng), [Jingren Zhou](https://arxiv.org/search/?searchtype=author&query=Jingren Zhou)
Large Language Model (LLM)-based web agents demonstrate strong performance on knowledge-intensive tasks but are hindered by context window limitations in paradigms like ReAct. Complex queries involving multiple entities, intertwined relationships, and high uncertainty demand extensive search cycles that rapidly exhaust context budgets before reaching complete solutions. To overcome this challenge, we introduce ReSum, a novel paradigm that enables indefinite exploration through periodic context summarization. ReSum converts growing interaction histories into compact reasoning states, maintaining awareness of prior discoveries while bypassing context constraints. For paradigm adaptation, we propose ReSum-GRPO, integrating GRPO with segmented trajectory training and advantage broadcasting to familiarize agents with summary-conditioned reasoning. Extensive experiments on web agents of varying scales across three benchmarks demonstrate that ReSum delivers an average absolute improvement of 4.5% over ReAct, with further gains of up to 8.2% following ReSum-GRPO training. Notably, with only 1K training samples, our WebResummer-30B (a ReSum-GRPO-trained version of WebSailor-30B) achieves 33.3% Pass@1 on BrowseComp-zh and 18.3% on BrowseComp-en, surpassing existing open-source web agents. 基于大型语言模型 (LLM) 的 Web 代理在知识密集型任务上表现出强大的性能,但受到 ReAct 等范式中的上下文窗口限制的阻碍。涉及多个实体、相互交织的关系和高不确定性的复杂查询需要广泛的搜索周期,在得出完整的解决方案之前会迅速耗尽上下文预算。为了克服这一挑战,我们引入了 ReSum,这是一种新颖的范式,可以通过周期性上下文摘要进行无限探索。ReSum 将不断增长的交互历史转换为紧凑的推理状态,保持对先前发现的认识,同时绕过上下文约束。对于范式适配,我们提出了 ReSum-GRPO,将 GRPO 与分割轨迹训练和优势广播相结合,使智能体熟悉摘要条件推理。在三个基准测试中对不同规模的 Web 代理进行的广泛实验表明,与 ReAct 相比,ReSum 的平均绝对改进为 4.5%,在 ReSum-GRPO 训练后进一步提高了 8.2%。值得注意的是,仅使用 1K 训练样本,我们的 WebResummer-30B(WebSailor-30B 的 ReSum-GRPO 训练版本)在 BrowseComp-zh 上实现了 33.3% 的 Pass@1,在 BrowseComp-en 上达到 18.3%,超过了现有的开源 Web 代理。
Subject: Computation and Language 主题 : 计算和语言
Publish: 2025-09-16 17:57:22 UTC 发布时间 : 2025-09-16 17:57:22 UTC
#3 WebWeaver: Structuring Web-Scale Evidence with Dynamic Outlines for Open-Ended Deep Research WebWeaver:为开放式深度研究构建具有动态大纲的网络规模证据
Authors: [Zijian Li](https://arxiv.org/search/?searchtype=author&query=Zijian Li), [Xin Guan](https://arxiv.org/search/?searchtype=author&query=Xin Guan), [Bo Zhang](https://arxiv.org/search/?searchtype=author&query=Bo Zhang), [Shen Huang](https://arxiv.org/search/?searchtype=author&query=Shen Huang), [Houquan Zhou](https://arxiv.org/search/?searchtype=author&query=Houquan Zhou), [Shaopeng Lai](https://arxiv.org/search/?searchtype=author&query=Shaopeng Lai), [Ming Yan](https://arxiv.org/search/?searchtype=author&query=Ming Yan), [Yong Jiang](https://arxiv.org/search/?searchtype=author&query=Yong Jiang), [Pengjun Xie](https://arxiv.org/search/?searchtype=author&query=Pengjun Xie), [Fei Huang](https://arxiv.org/search/?searchtype=author&query=Fei Huang), [Jun Zhang](https://arxiv.org/search/?searchtype=author&query=Jun Zhang), [Jingren Zhou](https://arxiv.org/search/?searchtype=author&query=Jingren Zhou)
This paper tackles open-ended deep research (OEDR), a complex challenge where AI agents must synthesize vast web-scale information into insightful reports. Current approaches are plagued by dual-fold limitations: static research pipelines that decouple planning from evidence acquisition and one-shot generation paradigms that easily suffer from long-context failure issues like “loss in the middle” and hallucinations. To address these challenges, we introduce WebWeaver, a novel dual-agent framework that emulates the human research process. The planner operates in a dynamic cycle, iteratively interleaving evidence acquisition with outline optimization to produce a comprehensive, source-grounded outline linking to a memory bank of evidence. The writer then executes a hierarchical retrieval and writing process, composing the report section by section. By performing targeted retrieval of only the necessary evidence from the memory bank for each part, it effectively mitigates long-context issues. Our framework establishes a new state-of-the-art across major OEDR benchmarks, including DeepResearch Bench, DeepConsult, and DeepResearchGym. These results validate our human-centric, iterative methodology, demonstrating that adaptive planning and focused synthesis are crucial for producing high-quality, reliable, and well-structured reports. 本文解决了开放式深度研究 (OEDR),这是一项复杂的挑战,AI 代理必须将大量 Web 规模的信息综合成富有洞察力的报告。当前的方法受到双重限制的困扰:将规划与证据获取分离的静态研究管道和容易遭受“中间损失”和幻觉等长期背景失败问题的一次性生成范式。为了应对这些挑战,我们推出了 WebWeaver,这是一种模拟人类研究过程的新型双代理框架。规划器在动态循环中运行,迭代地将证据获取与大纲优化交错在一起,以生成链接到证据记忆库的全面、基于来源的大纲。然后,编写者执行分层检索和写入过程,逐节撰写报告。通过仅从每个部分的内存库中执行有针对性的检索必要证据,它有效地缓解了长期上下文问题。我们的框架在主要的 OEDR 基准测试中建立了新的最先进技术,包括 DeepResearch Bench、DeepConsult 和 DeepResearchGym。这些结果验证了我们以人为本的迭代方法,表明自适应规划和重点综合对于生成高质量、可靠和结构良好的报告至关重要。
Subject: Computation and Language 主题 : 计算和语言
Publish: 2025-09-16 17:57:21 UTC 发布时间 : 2025-09-16 17:57:21 UTC
#4 Towards General Agentic Intelligence via Environment Scaling 通过环境扩展实现通用智能智能
Authors: [Runnan Fang](https://arxiv.org/search/?searchtype=author&query=Runnan Fang), [Shihao Cai](https://arxiv.org/search/?searchtype=author&query=Shihao Cai), [Baixuan Li](https://arxiv.org/search/?searchtype=author&query=Baixuan Li), [Jialong Wu](https://arxiv.org/search/?searchtype=author&query=Jialong Wu), [Guangyu Li](https://arxiv.org/search/?searchtype=author&query=Guangyu Li), [Wenbiao Yin](https://arxiv.org/search/?searchtype=author&query=Wenbiao Yin), [Xinyu Wang](https://arxiv.org/search/?searchtype=author&query=Xinyu Wang), [Xiaobin Wang](https://arxiv.org/search/?searchtype=author&query=Xiaobin Wang), [Liangcai Su](https://arxiv.org/search/?searchtype=author&query=Liangcai Su), [Zhen Zhang](https://arxiv.org/search/?searchtype=author&query=Zhen Zhang), [Shibin Wu](https://arxiv.org/search/?searchtype=author&query=Shibin Wu), [Zhengwei Tao](https://arxiv.org/search/?searchtype=author&query=Zhengwei Tao), [Yong Jiang](https://arxiv.org/search/?searchtype=author&query=Yong Jiang), [Pengjun Xie](https://arxiv.org/search/?searchtype=author&query=Pengjun Xie), [Fei Huang](https://arxiv.org/search/?searchtype=author&query=Fei Huang), [Jingren Zhou](https://arxiv.org/search/?searchtype=author&query=Jingren Zhou)
Advanced agentic intelligence is a prerequisite for deploying Large Language Models in practical, real-world applications. Diverse real-world APIs demand precise, robust function-calling intelligence, which needs agents to develop these capabilities through interaction in varied environments. The breadth of function-calling competence is closely tied to the diversity of environments in which agents are trained. In this work, we scale up environments as a step towards advancing general agentic intelligence. This gives rise to two central challenges: (i) how to scale environments in a principled manner, and (ii) how to effectively train agentic capabilities from experiences derived through interactions with these environments. To address these, we design a scalable framework that automatically constructs heterogeneous environments that are fully simulated, systematically broadening the space of function-calling scenarios. We further adapt a two-phase agent fine-tuning strategy: first endowing agents with fundamental agentic capabilities, then specializing them for domain-specific contexts. Extensive experiments on agentic benchmarks, tau-bench, tau2-Bench, and ACEBench, demonstrate that our trained model, AgentScaler, significantly enhances the function-calling capability of models. 高级代理智能是在实际应用中部署大型语言模型的先决条件。多样化的现实世界 API 需要精确、强大的函数调用智能,这需要代理通过在不同环境中的交互来开发这些功能。功能调用能力的广度与智能体接受培训的环境的多样性密切相关。在这项工作中,我们扩大了环境规模,作为推进通用代理智能的一步。这引发了两个核心挑战:(i) 如何以有原则的方式扩展环境,以及 (ii) 如何从与这些环境交互中获得的经验中有效地训练代理能力。为了解决这些问题,我们设计了一个可扩展的框架,可以自动构建完全模拟的异构环境,系统地拓宽了函数调用场景的空间。我们进一步采用了两阶段的代理微调策略:首先赋予代理基本的代理能力,然后将它们专门用于特定领域的上下文。在代理基准测试 tau-bench、tau2-Bench 和 ACEBench 上的大量实验表明,我们训练的模型 AgentScaler 显着增强了模型的函数调用能力。
Subject: Computation and Language 主题 : 计算和语言
Publish: 2025-09-16 17:57:20 UTC 发布时间 : 2025-09-16 17:57:20 UTC
#5 Scaling Agents via Continual Pre-training 通过持续预训练扩展代理
Authors: [Liangcai Su](https://arxiv.org/search/?searchtype=author&query=Liangcai Su), [Zhen Zhang](https://arxiv.org/search/?searchtype=author&query=Zhen Zhang), [Guangyu Li](https://arxiv.org/search/?searchtype=author&query=Guangyu Li), [Zhuo Chen](https://arxiv.org/search/?searchtype=author&query=Zhuo Chen), [Chenxi Wang](https://arxiv.org/search/?searchtype=author&query=Chenxi Wang), [Maojia Song](https://arxiv.org/search/?searchtype=author&query=Maojia Song), [Xinyu Wang](https://arxiv.org/search/?searchtype=author&query=Xinyu Wang), [Kuan Li](https://arxiv.org/search/?searchtype=author&query=Kuan Li), [Jialong Wu](https://arxiv.org/search/?searchtype=author&query=Jialong Wu), [Xuanzhong Chen](https://arxiv.org/search/?searchtype=author&query=Xuanzhong Chen), [Zile Qiao](https://arxiv.org/search/?searchtype=author&query=Zile Qiao), [Zhongwang Zhang](https://arxiv.org/search/?searchtype=author&query=Zhongwang Zhang), [Huifeng Yin](https://arxiv.org/search/?searchtype=author&query=Huifeng Yin), [Shihao Cai](https://arxiv.org/search/?searchtype=author&query=Shihao Cai), [Runnan Fang](https://arxiv.org/search/?searchtype=author&query=Runnan Fang), [Zhengwei Tao](https://arxiv.org/search/?searchtype=author&query=Zhengwei Tao), [Wenbiao Yin](https://arxiv.org/search/?searchtype=author&query=Wenbiao Yin), [Chenxiong Qian](https://arxiv.org/search/?searchtype=author&query=Chenxiong Qian), [Yong Jiang](https://arxiv.org/search/?searchtype=author&query=Yong Jiang), [Pengjun Xie](https://arxiv.org/search/?searchtype=author&query=Pengjun Xie), [Fei Huang](https://arxiv.org/search/?searchtype=author&query=Fei Huang), [Jingren Zhou](https://arxiv.org/search/?searchtype=author&query=Jingren Zhou)
Large language models (LLMs) have evolved into agentic systems capable of autonomous tool use and multi-step reasoning for complex problem-solving. However, post-training approaches building upon general-purpose foundation models consistently underperform in agentic tasks, particularly in open-source implementations. We identify the root cause: the absence of robust agentic foundation models forces models during post-training to simultaneously learn diverse agentic behaviors while aligning them to expert demonstrations, thereby creating fundamental optimization tensions. To this end, we are the first to propose incorporating Agentic Continual Pre-training (Agentic CPT) into the deep research agents training pipeline to build powerful agentic foundational models. Based on this approach, we develop a deep research agent model named AgentFounder. We evaluate our AgentFounder-30B on 10 benchmarks and achieve state-of-the-art performance while retains strong tool-use ability, notably 39.9% on BrowseComp-en, 43.3% on BrowseComp-zh, and 31.5% Pass@1 on HLE. 大型语言模型 (LLM) 已发展成为能够自主使用工具和多步骤推理以解决复杂问题的代理系统。然而,基于通用基础模型的训练后方法在代理任务中始终表现不佳,尤其是在开源实现中。我们确定了根本原因:缺乏强大的代理基础模型迫使模型在训练后同时学习不同的代理行为,同时将它们与专家演示保持一致,从而产生根本的优化张力。为此,我们率先提出将代理持续预训练(Agentic CPT)纳入深度研究代理训练管道,以构建强大的代理基础模型。基于这种方法,我们开发了一个名为 AgentFounder 的深入研究代理模型。我们在 10 个基准测试上评估了我们的 AgentFounder-30B,并实现了最先进的性能,同时保持了强大的工具使用能力,特别是 BrowseComp-en 的 39.9%、BrowseComp-zh 的 43.3% 和 HLE 的 31.5% Pass@1。
Subject: Computation and Language 主题 : 计算和语言
Publish: 2025-09-16 17:57:19 UTC 发布时间 : 2025-09-16 17:57:19 UTC
#6 WebResearcher: Unleashing unbounded reasoning capability in Long-Horizon Agents WebResearcher:在长视野代理中释放无限推理能力
Authors: [Zile Qiao](https://arxiv.org/search/?searchtype=author&query=Zile Qiao), [Guoxin Chen](https://arxiv.org/search/?searchtype=author&query=Guoxin Chen), [Xuanzhong Chen](https://arxiv.org/search/?searchtype=author&query=Xuanzhong Chen), [Donglei Yu](https://arxiv.org/search/?searchtype=author&query=Donglei Yu), [Wenbiao Yin](https://arxiv.org/search/?searchtype=author&query=Wenbiao Yin), [Xinyu Wang](https://arxiv.org/search/?searchtype=author&query=Xinyu Wang), [Zhen Zhang](https://arxiv.org/search/?searchtype=author&query=Zhen Zhang), [Baixuan Li](https://arxiv.org/search/?searchtype=author&query=Baixuan Li), [Huifeng Yin](https://arxiv.org/search/?searchtype=author&query=Huifeng Yin), [Kuan Li](https://arxiv.org/search/?searchtype=author&query=Kuan Li), [Rui Min](https://arxiv.org/search/?searchtype=author&query=Rui Min), [Minpeng Liao](https://arxiv.org/search/?searchtype=author&query=Minpeng Liao), [Yong Jiang](https://arxiv.org/search/?searchtype=author&query=Yong Jiang), [Pengjun Xie](https://arxiv.org/search/?searchtype=author&query=Pengjun Xie), [Fei Huang](https://arxiv.org/search/?searchtype=author&query=Fei Huang), [Jingren Zhou](https://arxiv.org/search/?searchtype=author&query=Jingren Zhou)
Recent advances in deep-research systems have demonstrated the potential for AI agents to autonomously discover and synthesize knowledge from external sources. In this paper, we introduce WebResearcher, a novel framework for building such agents through two key components: (1) WebResearcher, an iterative deep-research paradigm that reformulates deep research as a Markov Decision Process, where agents periodically consolidate findings into evolving reports while maintaining focused workspaces, overcoming the context suffocation and noise contamination that plague existing mono-contextual approaches; and (2) WebFrontier, a scalable data synthesis engine that generates high-quality training data through tool-augmented complexity escalation, enabling systematic creation of research tasks that bridge the gap between passive knowledge recall and active knowledge construction. Notably, we find that the training data from our paradigm significantly enhances tool-use capabilities even for traditional mono-contextual methods. Furthermore, our paradigm naturally scales through parallel thinking, enabling concurrent multi-agent exploration for more comprehensive conclusions. Extensive experiments across 6 challenging benchmarks demonstrate that WebResearcher achieves state-of-the-art performance, even surpassing frontier proprietary systems. 深度研究系统的最新进展证明了人工智能代理自主发现和综合来自外部来源的知识的潜力。在本文中,我们介绍了 WebResearcher,这是一个通过两个关键组件构建此类代理的新颖框架:(1) WebResearcher,一种迭代的深度研究范式,将深度研究重新表述为马尔可夫决策过程,其中代理定期将研究结果整合到不断发展的报告中,同时保持集中的工作空间,克服困扰现有单一上下文方法的上下文窒息和噪声污染;(2)WebFrontier,一个可扩展的数据合成引擎,通过工具增强的复杂性升级生成高质量的训练数据,从而能够系统地创建研究任务,弥合被动知识回忆和主动知识构建之间的差距。值得注意的是,我们发现,即使对于传统的单上下文方法,我们范式的训练数据也显着增强了工具使用能力。此外,我们的范式通过并行思维自然扩展,使并发多智能体探索成为可能,以获得更全面的结论。在 6 个具有挑战性的基准测试中的广泛实验表明,WebResearcher 实现了最先进的性能,甚至超越了前沿专有系统。
Subject: Computation and Language 主题 : 计算和语言
Publish: 2025-09-16 17:57:17 UTC 发布时间 : 2025-09-16 17:57:17 UTC
#7 ChartGaze: Enhancing Chart Understanding in LVLMs with Eye-Tracking Guided Attention Refinement ChartGaze:通过眼动追踪引导注意力细化增强 LVLM 中的图表理解
Authors: [Ali Salamatian](https://arxiv.org/search/?searchtype=author&query=Ali Salamatian), [Amirhossein Abaskohi](https://arxiv.org/search/?searchtype=author&query=Amirhossein Abaskohi), [Wan-Cyuan Fan](https://arxiv.org/search/?searchtype=author&query=Wan-Cyuan Fan), [Mir Rayat Imtiaz Hossain](https://arxiv.org/search/?searchtype=author&query=Mir Rayat Imtiaz Hossain), [Leonid Sigal](https://arxiv.org/search/?searchtype=author&query=Leonid Sigal), [Giuseppe Carenini](https://arxiv.org/search/?searchtype=author&query=Giuseppe Carenini)
Charts are a crucial visual medium for communicating and representing information. While Large Vision-Language Models (LVLMs) have made progress on chart question answering (CQA), the task remains challenging, particularly when models attend to irrelevant regions of the chart. In this work, we present ChartGaze, a new eye-tracking dataset that captures human gaze patterns during chart reasoning tasks. Through a systematic comparison of human and model attention, we find that LVLMs often diverge from human gaze, leading to reduced interpretability and accuracy. To address this, we propose a gaze-guided attention refinement that aligns image-text attention with human fixations. Our approach improves both answer accuracy and attention alignment, yielding gains of up to 2.56 percentage points across multiple models. These results demonstrate the promise of incorporating human gaze to enhance both the reasoning quality and interpretability of chart-focused LVLMs. 图表是交流和表示信息的重要视觉媒介。虽然大型视觉语言模型 (LVLM) 在图表问答 (CQA) 方面取得了进展,但这项任务仍然具有挑战性,特别是当模型关注图表的不相关区域时。在这项工作中,我们提出了 ChartGaze,这是一种新的眼动追踪数据集,可在图表推理任务期间捕获人类凝视模式。通过对人类和模型注意力的系统比较,我们发现 LVLM 经常偏离人类的视线,导致可解释性和准确性降低。为了解决这个问题,我们提出了一种凝视引导的注意力细化,将图像-文本注意力与人类的注视结合起来。我们的方法提高了答案的准确性和注意力对齐度,在多个模型中产生了高达 2.56 个百分点的收益。这些结果证明了结合人类凝视以提高以图表为中心的 LVLM 的推理质量和可解释性的前景。
Subjects: Computation and Language, Computer Vision and Pattern Recognition, Machine Learning 科目 : 计算与语言, 计算机视觉与模式识别, 机器学习
Publish: 2025-09-16 17:35:39 UTC 发布时间 : 2025-09-16 17:35:39 UTC
#8 Evaluating LLM Alignment on Personality Inference from Real-World Interview Data 根据真实世界的访谈数据评估法学硕士对性格推断的一致性
Authors: [Jianfeng Zhu](https://arxiv.org/search/?searchtype=author&query=Jianfeng Zhu), [Julina Maharjan](https://arxiv.org/search/?searchtype=author&query=Julina Maharjan), [Xinyu Li](https://arxiv.org/search/?searchtype=author&query=Xinyu Li), [Karin G. Coifman](https://arxiv.org/search/?searchtype=author&query=Karin G. Coifman), [Ruoming Jin](https://arxiv.org/search/?searchtype=author&query=Ruoming Jin)
Large Language Models (LLMs) are increasingly deployed in roles requiring nuanced psychological understanding, such as emotional support agents, counselors, and decision-making assistants. However, their ability to interpret human personality traits, a critical aspect of such applications, remains unexplored, particularly in ecologically valid conversational settings. While prior work has simulated LLM “personas” using discrete Big Five labels on social media data, the alignment of LLMs with continuous, ground-truth personality assessments derived from natural interactions is largely unexamined. To address this gap, we introduce a novel benchmark comprising semi-structured interview transcripts paired with validated continuous Big Five trait scores. Using this dataset, we systematically evaluate LLM performance across three paradigms: (1) zero-shot and chain-of-thought prompting with GPT-4.1 Mini, (2) LoRA-based fine-tuning applied to both RoBERTa and Meta-LLaMA architectures, and (3) regression using static embeddings from pretrained BERT and OpenAI’s text-embedding-3-small. Our results reveal that all Pearson correlations between model predictions and ground-truth personality traits remain below 0.26, highlighting the limited alignment of current LLMs with validated psychological constructs. Chain-of-thought prompting offers minimal gains over zero-shot, suggesting that personality inference relies more on latent semantic representation than explicit reasoning. These findings underscore the challenges of aligning LLMs with complex human attributes and motivate future work on trait-specific prompting, context-aware modeling, and alignment-oriented fine-tuning. 大型语言模型 (LLM) 越来越多地部署在需要细致入微的心理理解的角色中,例如情感支持代理、咨询师和决策助理。然而,它们解释人类性格特征的能力(此类应用的一个关键方面)仍未得到探索,特别是在生态上有效的对话环境中。虽然之前的工作在社交媒体数据上使用离散的五大标签模拟了法学硕士的“角色”,但法学硕士与从自然互动中得出的连续、真实的人格评估的一致性在很大程度上没有得到检查。为了解决这一差距,我们引入了一种新颖的基准,包括半结构化访谈记录和经过验证的连续五大特征分数。使用该数据集,我们系统地评估了三种范式的 LLM 性能:(1) 使用 GPT-4.1 Mini 进行零样本和思维链提示,(2) 应用于 RoBERTa 和 Meta-LLaMA 架构的基于 LoRA 的微调,以及 (3) 使用来自预训练 BERT 和 OpenAI 的 text-embedding-3-small 的静态嵌入的回归。我们的结果表明,模型预测与真实人格特征之间的所有 Pearson 相关性均保持在 0.26 以下,这凸显了当前 LLM 与经过验证的心理结构的有限一致性。与零样本相比,思维链提示提供的收益微乎其微,这表明人格推理更多地依赖于潜在的语义表示而不是显式推理。这些发现强调了将法学硕士与复杂的人类属性相结合的挑战,并激励了未来在特定特征提示、上下文感知建模和面向对齐的微调方面的工作。
Subject: Computation and Language 主题 : 计算和语言
Publish: 2025-09-16 16:54:35 UTC 发布时间 : 2025-09-16 16:54:35 UTC
#9 The Few-shot Dilemma: Over-prompting Large Language Models 少样本困境:过度提示大型语言模型
Authors: [Yongjian Tang](https://arxiv.org/search/?searchtype=author&query=Yongjian Tang), [Doruk Tuncel](https://arxiv.org/search/?searchtype=author&query=Doruk Tuncel), [Christian Koerner](https://arxiv.org/search/?searchtype=author&query=Christian Koerner), [Thomas Runkler](https://arxiv.org/search/?searchtype=author&query=Thomas Runkler)
Over-prompting, a phenomenon where excessive examples in prompts lead to diminished performance in Large Language Models (LLMs), challenges the conventional wisdom about in-context few-shot learning. To investigate this few-shot dilemma, we outline a prompting framework that leverages three standard few-shot selection methods - random sampling, semantic embedding, and TF-IDF vectors - and evaluate these methods across multiple LLMs, including GPT-4o, GPT-3.5-turbo, DeepSeek-V3, Gemma-3, LLaMA-3.1, LLaMA-3.2, and Mistral. Our experimental results reveal that incorporating excessive domain-specific examples into prompts can paradoxically degrade performance in certain LLMs, which contradicts the prior empirical conclusion that more relevant few-shot examples universally benefit LLMs. Given the trend of LLM-assisted software engineering and requirement analysis, we experiment with two real-world software requirement classification datasets. By gradually increasing the number of TF-IDF-selected and stratified few-shot examples, we identify their optimal quantity for each LLM. This combined approach achieves superior performance with fewer examples, avoiding the over-prompting problem, thus surpassing the state-of-the-art by 1% in classifying functional and non-functional requirements. 过度提示是一种提示中过多示例导致大型语言模型 (LLM) 性能下降的现象,它挑战了有关上下文中少样本学习的传统智慧。为了研究这种少样本困境,我们概述了一个提示框架,该框架利用三种标准的少样本选择方法——随机采样、语义嵌入和 TF-IDF 向量——并在多个 LLM 中评估这些方法,包括 GPT-4o、GPT-3.5-turbo、DeepSeek-V3、Gemma-3、LLaMA-3.1、LLaMA-3.2 和 Mistral。我们的实验结果表明,在提示中加入过多的特定领域示例可能会自相矛盾地降低某些 LLM 的性能,这与之前的实证结论相矛盾,即更相关的少样本示例普遍有利于 LLM。鉴于 LLM 辅助软件工程和需求分析的趋势,我们使用两个真实世界的软件需求分类数据集进行了实验。通过逐渐增加 TF-IDF 选择和分层的少样本示例的数量,我们确定了每个 LLM 的最佳数量。这种组合方法以更少的示例实现了卓越的性能,避免了过度提示的问题,从而在对功能性和非功能性需求进行分类方面比最先进的技术高出 1%。
Subject: Computation and Language 主题 : 计算和语言
Publish: 2025-09-16 16:00:06 UTC 发布时间 : 2025-09-16 16:00:06 UTC
#10 LLM Hallucination Detection: A Fast Fourier Transform Method Based on Hidden Layer Temporal Signals LLM 幻觉检测:一种基于隐性层时间信号的快速傅里叶变换方法
Authors: [Jinxin Li](https://arxiv.org/search/?searchtype=author&query=Jinxin Li), [Gang Tu](https://arxiv.org/search/?searchtype=author&query=Gang Tu), [ShengYu Cheng](https://arxiv.org/search/?searchtype=author&query=ShengYu Cheng), [Junjie Hu](https://arxiv.org/search/?searchtype=author&query=Junjie Hu), [Jinting Wang](https://arxiv.org/search/?searchtype=author&query=Jinting Wang), [Rui Chen](https://arxiv.org/search/?searchtype=author&query=Rui Chen), [Zhilong Zhou](https://arxiv.org/search/?searchtype=author&query=Zhilong Zhou), [Dongbo Shan](https://arxiv.org/search/?searchtype=author&query=Dongbo Shan)
Hallucination remains a critical barrier for deploying large language models (LLMs) in reliability-sensitive applications. Existing detection methods largely fall into two categories: factuality checking, which is fundamentally constrained by external knowledge coverage, and static hidden-state analysis, that fails to capture deviations in reasoning dynamics. As a result, their effectiveness and robustness remain limited. We propose HSAD (Hidden Signal Analysis-based Detection), a novel hallucination detection framework that models the temporal dynamics of hidden representations during autoregressive generation. HSAD constructs hidden-layer signals by sampling activations across layers, applies Fast Fourier Transform (FFT) to obtain frequency-domain representations, and extracts the strongest non-DC frequency component as spectral features. Furthermore, by leveraging the autoregressive nature of LLMs, HSAD identifies optimal observation points for effective and reliable detection. Across multiple benchmarks, including TruthfulQA, HSAD achieves over 10 percentage points improvement compared to prior state-of-the-art methods. By integrating reasoning-process modeling with frequency-domain analysis, HSAD establishes a new paradigm for robust hallucination detection in LLMs. 幻觉仍然是在可靠性敏感型应用程序中部署大型语言模型 (LLM) 的关键障碍。现有的检测方法主要分为两类:事实性检查(从根本上受到外部知识覆盖的制约)和静态隐藏状态分析(无法捕捉推理动态的偏差)。因此,它们的有效性和稳健性仍然有限。我们提出了 HSAD(基于隐藏信号分析的检测),这是一种新型的幻觉检测框架,用于对自回归生成过程中隐藏表示的时间动态进行建模。HSAD 通过跨层采样激活来构建隐藏层信号,应用快速傅里叶变换(FFT)获得频域表示,并提取最强的非直流频率分量作为频谱特征。此外,通过利用法学硕士的自回归特性,HSAD 确定最佳观察点以实现有效和可靠的检测。在包括 TruthfulQA 在内的多个基准测试中,与之前的最先进方法相比,HSAD 实现了超过 10 个百分点的改进。通过将推理过程建模与频域分析相结合,HSAD 为法学硕士中的鲁棒幻觉检测建立了一种新的范式。
Subject: Computation and Language 主题 : 计算和语言
Publish: 2025-09-16 15:08:19 UTC 发布时间 : 2025-09-16 15:08:19 UTC
#11 Empowering LLMs with Parameterized Skills for Adversarial Long-Horizon Planning 为法学硕士提供参数化技能,以进行对抗性长期规划
Authors: [Sijia Cui](https://arxiv.org/search/?searchtype=author&query=Sijia Cui), [Shuai Xu](https://arxiv.org/search/?searchtype=author&query=Shuai Xu), [Aiyao He](https://arxiv.org/search/?searchtype=author&query=Aiyao He), [Yanna Wang](https://arxiv.org/search/?searchtype=author&query=Yanna Wang), [Bo Xu](https://arxiv.org/search/?searchtype=author&query=Bo Xu)
Recent advancements in Large Language Models(LLMs) have led to the development of LLM-based AI agents. A key challenge is the creation of agents that can effectively ground themselves in complex, adversarial long-horizon environments. Existing methods mainly focus on (1) using LLMs as policies to interact with the environment through generating low-level feasible actions, and (2) utilizing LLMs to generate high-level tasks or language guides to stimulate action generation. However, the former struggles to generate reliable actions, while the latter relies heavily on expert experience to translate high-level tasks into specific action sequences. To address these challenges, we introduce the Plan with Language, Act with Parameter (PLAP) planning framework that facilitates the grounding of LLM-based agents in long-horizon environments. The PLAP method comprises three key components: (1) a skill library containing environment-specific parameterized skills, (2) a skill planner powered by LLMs, and (3) a skill executor converting the parameterized skills into executable action sequences. We implement PLAP in MicroRTS, a long-horizon real-time strategy game that provides an unfamiliar and challenging environment for LLMs. The experimental results demonstrate the effectiveness of PLAP. In particular, GPT-4o-driven PLAP in a zero-shot setting outperforms 80% of baseline agents, and Qwen2-72B-driven PLAP, with carefully crafted few-shot examples, surpasses the top-tier scripted agent, CoacAI. Additionally, we design comprehensive evaluation metrics and test 6 closed-source and 2 open-source LLMs within the PLAP framework, ultimately releasing an LLM leaderboard ranking long-horizon skill planning ability. Our code is available at https://github.com/AI-Research-TeamX/PLAP. 大型语言模型 (LLM) 的最新进展导致了基于 LLM 的 AI 代理的发展。一个关键的挑战是创建能够在复杂、对抗性的长期环境中有效地扎根的代理。现有方法主要集中在(1)使用 LLM 作为策略,通过生成低级可行动作与环境进行交互,以及(2)利用 LLM 生成高级任务或语言指南来刺激动作生成。然而,前者难以生成可靠的行动,而后者则严重依赖专家经验将高级任务转化为特定的行动序列。为了应对这些挑战,我们引入了 Plan with Language, Act with Parameter (PLAP) 规划框架,该框架有助于基于 LLM 的代理在长期环境中的基础。PLAP 方法包括三个关键组件:(1) 包含特定环境参数化技能的技能库,(2) 由 LLM 提供支持的技能规划器,以及 (3) 将参数化技能转换为可执行动作序列的技能执行器。我们在 MicroRTS 中实现了 PLAP,这是一款长期的即时战略游戏,为法学硕士提供了一个陌生且具有挑战性的环境。实验结果证明了 PLAP 的有效性。特别是,在零样本设置中,GPT-4o 驱动的 PLAP 优于 80% 的基线智能体,而 Qwen2-72B 驱动的 PLAP 凭借精心制作的少数样本示例,超越了顶级脚本智能体 CoacAI。此外,我们还在 PLAP 框架内设计了全面的评估指标,并测试了 6 个闭源和 2 个开源 LLM,最终发布了 LLM 排行榜排名的长期技能规划能力。我们的代码可在 https://github.com/AI-Research-TeamX/PLAP 获得。
Subject: Computation and Language 主题 : 计算和语言
Publish: 2025-09-16 14:36:30 UTC 发布时间 : 2025-09-16 14:36:30 UTC
#12 Shaping Explanations: Semantic Reward Modeling with Encoder-Only Transformers for GRPO 整形解释:使用纯编码器转换器进行语义奖励建模
Authors: [Francesco Pappone](https://arxiv.org/search/?searchtype=author&query=Francesco Pappone), [Ruggero Marino Lazzaroni](https://arxiv.org/search/?searchtype=author&query=Ruggero Marino Lazzaroni), [Federico Califano](https://arxiv.org/search/?searchtype=author&query=Federico Califano), [Niccolò Gentile](https://arxiv.org/search/?searchtype=author&query=Niccolò Gentile), [Roberto Marras](https://arxiv.org/search/?searchtype=author&query=Roberto Marras)
While Large Language Models (LLMs) excel at generating human-like text, aligning their outputs with complex, qualitative goals like pedagogical soundness remains a significant challenge. Standard reinforcement learning techniques often rely on slow and expensive LLM-as-a-judge evaluations or on brittle, keyword-based metrics like ROUGE, which fail to capture the semantic essence of a high-quality explanation. In this work, we introduce a novel approach to reward shaping within the Group Relative Policy Optimisation (GRPO) framework. Our central contribution is the use of a small, efficient encoder-only transformer as a semantic reward model. This model provides a dense, semantically rich reward signal based on the cosine similarity between a generated explanation and a ground-truth reference, guiding the policy towards explanations that are not just factually correct but also structurally and conceptually aligned with expert reasoning. We apply this method to the task of training a model for the Italian medical-school entrance examinations, following standard domain-adaptive continued pre-training (CPT) and supervised fine-tuning (SFT). Our results demonstrate that GRPO with our proposed semantic reward significantly improves explanation faithfulness and clarity over a strong SFT baseline, showcasing the power of using lightweight encoder models for nuanced reward shaping in complex generation tasks 虽然大型语言模型 (LLM) 擅长生成类似人类的文本,但使其输出与教学健全性等复杂的定性目标保持一致仍然是一个重大挑战。标准强化学习技术通常依赖于缓慢而昂贵的法学硕士作为评判者的评估,或者依赖于脆弱的、基于关键字的指标(如 ROUGE),这些指标无法捕捉高质量解释的语义本质。在这项工作中,我们引入了一种在群体相对政策优化(GRPO)框架内塑造奖励的新方法。我们的核心贡献是使用一个小型、高效的纯编码器转换器作为语义奖励模型。该模型基于生成的解释和基本事实参考之间的余弦相似性,提供了密集的、语义丰富的奖励信号,指导政策走向不仅事实正确,而且在结构和概念上与专家推理一致的解释。我们将这种方法应用于在标准领域自适应持续预训练 (CPT) 和监督微调 (SFT) 之后,为意大利医学院入学考试训练模型的任务。我们的结果表明,在强大的 SFT 基线上,具有我们提出的语义奖励的 GRPO 显着提高了解释的忠实度和清晰度,展示了在复杂生成任务中使用轻量级编码器模型进行细致入微的奖励塑造的力量
Subjects: Computation and Language, Artificial Intelligence 科目 : 计算与语言, 人工智能
Publish: 2025-09-16 13:39:29 UTC 发布时间 : 2025-09-16 13:39:29 UTC
#13 Multi-Model Synthetic Training for Mission-Critical Small Language Models 关键任务小型语言模型的多模型合成训练
Authors: [Nolan Platt](https://arxiv.org/search/?searchtype=author&query=Nolan Platt), [Pragyansmita Nayak](https://arxiv.org/search/?searchtype=author&query=Pragyansmita Nayak)
Large Language Models (LLMs) have demonstrated remarkable capabilities across many domains, yet their appli- cation to specialized fields remains constrained by the scarcity and complexity of domain-specific training data. We present a novel approach that achieves a 261x cost reduction for maritime intelligence by using LLMs as one-time teachers rather than using them directly for inference. Our method transforms 3.2 billion Automatic Identification System (AIS) vessel tracking records into 21,543 synthetic question and answer pairs through multi-model generation (GPT-4o and o3-mini), preventing over- fitting and ensuring accurate reasoning. The resulting fine-tuned Qwen2.5-7B model achieves 75% accuracy on maritime tasks, while being substantially cheaper than using a larger model for inference. We show that smaller, cheaper models - when fine tuned properly - can provide similar accuracy compared to larger models that are prohibitively expensive. Our work contributes to the growing field of synthetic dataset generation for specialized AI applications and presents a highly reproducible framework for domains where manual annotation is infeasible. Beyond expand- ing research in the growing field of specialized small language models, our approach has immediate applications in maritime safety, security operations, and vessel traffic management systems in various industries. 大型语言模型(LLM)在许多领域都表现出了卓越的能力,但它们在专业领域的应用仍然受到特定领域训练数据的稀缺性和复杂性的限制。我们提出了一种新颖的方法,通过使用法学硕士作为一次性教师而不是直接将其用于推理,可以将海上情报的成本降低 261 倍。该方法通过多模型生成(GPT-4o 和 o3-mini)将 32 亿条自动识别系统(AIS)船舶跟踪记录转换为 21,543 个合成问答对,防止过度拟合并确保推理准确。由此产生的微调 Qwen2.5-7B 模型在海上任务上实现了 75% 的准确率,同时比使用更大的模型进行推理要便宜得多。我们表明,与昂贵得令人望而却步的大型模型相比,更小、更便宜的模型(如果微调得当)可以提供类似的精度。我们的工作为专业人工智能应用的合成数据集生成领域不断发展做出了贡献,并为手动注释不可行的领域提供了一个高度可重复的框架。除了在不断发展的专业小语言模型领域扩大研究外,我们的方法还直接应用于各行各业的海上安全、安保作和船舶交通管理系统。
Subjects: Computation and Language, Artificial Intelligence, Machine Learning 科目 : 计算与语言, 人工智能, 机器学习
Publish: 2025-09-16 13:04:48 UTC 发布时间 : 2025-09-16 13:04:48 UTC
#14 SitLLM: Large Language Models for Sitting Posture Health Understanding via Pressure Sensor Data SitLLM:通过压力传感器数据理解坐姿健康的大型语言模型
Authors: [Jian Gao](https://arxiv.org/search/?searchtype=author&query=Jian Gao), [Fufangchen Zhao](https://arxiv.org/search/?searchtype=author&query=Fufangchen Zhao), [Yiyang Zhang](https://arxiv.org/search/?searchtype=author&query=Yiyang Zhang), [Danfeng Yan](https://arxiv.org/search/?searchtype=author&query=Danfeng Yan)
Poor sitting posture is a critical yet often overlooked factor contributing to long-term musculoskeletal disorders and physiological dysfunctions. Existing sitting posture monitoring systems, although leveraging visual, IMU, or pressure-based modalities, often suffer from coarse-grained recognition and lack the semantic expressiveness necessary for personalized feedback. In this paper, we propose \textbf{SitLLM}, a lightweight multimodal framework that integrates flexible pressure sensing with large language models (LLMs) to enable fine-grained posture understanding and personalized health-oriented response generation. SitLLM comprises three key components: (1) a \textit{Gaussian-Robust Sensor Embedding Module} that partitions pressure maps into spatial patches and injects local noise perturbations for robust feature extraction; (2) a \textit{Prompt-Driven Cross-Modal Alignment Module} that reprograms sensor embeddings into the LLM’s semantic space via multi-head cross-attention using the pre-trained vocabulary embeddings; and (3) a \textit{Multi-Context Prompt Module} that fuses feature-level, structure-level, statistical-level, and semantic-level contextual information to guide instruction comprehension. 不良的坐姿是导致长期肌肉骨骼疾病和生理功能障碍的关键但经常被忽视的因素。现有的坐姿监测系统虽然利用了视觉、IMU 或基于压力的方式,但通常存在粗粒度识别的问题,并且缺乏个性化反馈所需的语义表达能力。在本文中,我们提出了 \textbf{SitLLM},这是一个轻量级的多模态框架,它将灵活的压力感知与大型语言模型(LLMs)集成在一起,以实现细粒度的姿势理解和个性化的健康导向响应生成。SitLLM 由三个关键组件组成:(1)一个\textit{高斯鲁棒传感器嵌入模块},它将压力图划分为空间斑块,并注入局部噪声扰动以进行鲁棒特征提取;(2)一个\textit{Prompt-Driven Cross-Modal Alignment Module},它使用预训练的词汇嵌入,通过多头交叉注意力将传感器嵌入重新编程到 LLM 的语义空间中;(3)一个\textit{Multi-Context Prompt Module},它融合了特征级、结构级、统计级和语义级的上下文信息来指导指令理解。
Subject: Computation and Language 主题 : 计算和语言
Publish: 2025-09-16 12:06:05 UTC 发布时间 : 2025-09-16 12:06:05 UTC
#15 Do LLMs Understand Wine Descriptors Across Cultures? A Benchmark for Cultural Adaptations of Wine Reviews 法学硕士是否理解跨文化的葡萄酒描述符?葡萄酒评论文化改编的基准
Authors: [Chenye Zou](https://arxiv.org/search/?searchtype=author&query=Chenye Zou), [Xingyue Wen](https://arxiv.org/search/?searchtype=author&query=Xingyue Wen), [Tianyi Hu](https://arxiv.org/search/?searchtype=author&query=Tianyi Hu), [Qian Janice Wang](https://arxiv.org/search/?searchtype=author&query=Qian Janice Wang), [Daniel Hershcovich](https://arxiv.org/search/?searchtype=author&query=Daniel Hershcovich)
Recent advances in large language models (LLMs) have opened the door to culture-aware language tasks. We introduce the novel problem of adapting wine reviews across Chinese and English, which goes beyond literal translation by incorporating regional taste preferences and culture-specific flavor descriptors. In a case study on cross-cultural wine review adaptation, we compile the first parallel corpus of professional reviews, containing 8k Chinese and 16k Anglophone reviews. We benchmark both neural-machine-translation baselines and state-of-the-art LLMs with automatic metrics and human evaluation. For the latter, we propose three culture-oriented criteria – Cultural Proximity, Cultural Neutrality, and Cultural Genuineness – to assess how naturally a translated review resonates with target-culture readers. Our analysis shows that current models struggle to capture cultural nuances, especially in translating wine descriptions across different cultures. This highlights the challenges and limitations of translation models in handling cultural content. 大型语言模型 (LLM) 的最新进展为文化感知语言任务打开了大门。我们引入了中英文葡萄酒评论改编的新问题,它超越了直译,结合了地区口味偏好和特定文化的风味描述符。在跨文化葡萄酒评论改编的案例研究中,我们编制了第一个平行的专业评论语料库,包含 8k 中文评论和 16k 英语评论。我们通过自动指标和人工评估对神经机器翻译基线和最先进的 LLM 进行基准测试。对于后者,我们提出了三个以文化为导向的标准——文化接近性、文化中立性和文化真实性——以评估翻译后的评论与目标文化读者的自然共鸣程度。我们的分析表明,当前的模型很难捕捉文化的细微差别,尤其是在翻译不同文化的葡萄酒描述方面。这凸显了翻译模型在处理文化内容方面的挑战和局限性。
Subject: Computation and Language 主题 : 计算和语言
Publish: 2025-09-16 11:10:30 UTC 发布时间 : 2025-09-16 11:10:30 UTC
#16 Investigating ReLoRA: Effects on the Learning Dynamics of Small Language Models 研究 ReLoRA:对小型语言模型学习动态的影响
Authors: [Yuval Weiss](https://arxiv.org/search/?searchtype=author&query=Yuval Weiss), [David Demitri Africa](https://arxiv.org/search/?searchtype=author&query=David Demitri Africa), [Paula Buttery](https://arxiv.org/search/?searchtype=author&query=Paula Buttery), [Richard Diehl Martinez](https://arxiv.org/search/?searchtype=author&query=Richard Diehl Martinez)
Parameter-efficient methods such as LoRA have revolutionised the fine-tuning of LLMs. Still, their extension to pretraining via ReLoRA is less well understood, especially for small language models (SLMs), which offer lower computational and environmental costs. This work is the first systematic study of ReLoRA in SLMs (11M-66M parameters), evaluating both performance and learning dynamics. Through ablation experiments, we find that ReLoRA generally performs worse than standard training on loss, Paloma perplexity and BLiMP, with the gap widening for the larger models. Further analysis of the learning dynamics of the models indicates that ReLoRA reinforces the rank deficiencies found in smaller models. These results indicate that low-rank update strategies may not transfer easily to SLM pretraining, highlighting the need for more research in the low-compute regime. LoRA 等参数效率高的方法彻底改变了 LLM 的微调。尽管如此,它们通过 ReLoRA 扩展到预训练的理解还不太清楚,特别是对于小型语言模型 (SLM),它们提供了较低的计算和环境成本。这项工作是首次在 SLM(11M-66M 参数)中对 ReLoRA 进行系统研究,评估性能和学习动态。通过消融实验,我们发现 ReLoRA 在损失、Paloma 困惑度和 BLiMP 方面的表现通常比标准训练差,而对于较大的模型,差距会扩大。对模型学习动态的进一步分析表明,ReLoRA 强化了较小模型中发现的秩缺陷。这些结果表明,低秩更新策略可能不容易转移到 SLM 预训练中,这凸显了在低计算制度中进行更多研究的必要性。
Subjects: Computation and Language, Artificial Intelligence 科目 : 计算与语言, 人工智能
Publish: 2025-09-16 11:06:58 UTC 发布时间 : 2025-09-16 11:06:58 UTC
#17 Automated Generation of Research Workflows from Academic Papers: A Full-text Mining Framework 从学术论文中自动生成研究工作流程:全文挖掘框架
Authors: [Heng Zhang](https://arxiv.org/search/?searchtype=author&query=Heng Zhang), [Chengzhi Zhang](https://arxiv.org/search/?searchtype=author&query=Chengzhi Zhang)
The automated generation of research workflows is essential for improving the reproducibility of research and accelerating the paradigm of “AI for Science”. However, existing methods typically extract merely fragmented procedural components and thus fail to capture complete research workflows. To address this gap, we propose an end-to-end framework that generates comprehensive, structured research workflows by mining full-text academic papers. As a case study in the Natural Language Processing (NLP) domain, our paragraph-centric approach first employs Positive-Unlabeled (PU) Learning with SciBERT to identify workflow-descriptive paragraphs, achieving an F1-score of 0.9772. Subsequently, we utilize Flan-T5 with prompt learning to generate workflow phrases from these paragraphs, yielding ROUGE-1, ROUGE-2, and ROUGE-L scores of 0.4543, 0.2877, and 0.4427, respectively. These phrases are then systematically categorized into data preparation, data processing, and data analysis stages using ChatGPT with few-shot learning, achieving a classification precision of 0.958. By mapping categorized phrases to their document locations in the documents, we finally generate readable visual flowcharts of the entire research workflows. This approach facilitates the analysis of workflows derived from an NLP corpus and reveals key methodological shifts over the past two decades, including the increasing emphasis on data analysis and the transition from feature engineering to ablation studies. Our work offers a validated technical framework for automated workflow generation, along with a novel, process-oriented perspective for the empirical investigation of evolving scientific paradigms. Source code and data are available at: https://github.com/ZH-heng/research_workflow. 研究工作流程的自动生成对于提高研究的可重复性和加速“人工智能科学”的范式至关重要。然而,现有方法通常仅提取零散的程序组成部分,因此无法捕获完整的研究工作流程。为了解决这一差距,我们提出了一个端到端框架,通过挖掘全文学术论文来生成全面、结构化的研究工作流程。作为自然语言处理 (NLP) 领域的案例研究,我们以段落为中心的方法首先采用 SciBERT 的正无标记 (PU) 学习来识别工作流程描述性段落,达到 0.9772 的 F1 分数。随后,我们利用 Flan-T5 和提示学习从这些段落中生成工作流短语,产生 ROUGE-1、ROUGE-2 和 ROUGE-L 分数分别为 0.4543、0.2877 和 0.4427。然后,这些短语使用 ChatGPT 进行少量学习,系统地分为数据准备、数据处理和数据分析阶段,达到 0.958 的分类精度。通过将分类短语映射到文档中的文档位置,我们最终生成了整个研究工作流程的可读可视化流程图。这种方法有助于分析源自 NLP 语料库的工作流程,并揭示了过去二十年来的关键方法转变,包括对数据分析的日益重视以及从特征工程到消融研究的过渡。我们的工作为自动化工作流程生成提供了经过验证的技术框架,并为对不断发展的科学范式进行实证研究提供了新颖的、面向过程的视角。源代码和数据可在以下网址获得:https://github.com/ZH-heng/research_workflow。
Subjects: Computation and Language, Digital Libraries, Information Retrieval 科目 : 计算与语言, 数字图书馆, 信息检索
Publish: 2025-09-16 10:59:23 UTC 发布时间 : 2025-09-16 10:59:23 UTC
#18 All Roads Lead to Rome: Graph-Based Confidence Estimation for Large Language Model Reasoning 条条大路通罗马:基于图的大型语言模型推理置信度估计
Authors: [Caiqi Zhang](https://arxiv.org/search/?searchtype=author&query=Caiqi Zhang), [Chang Shu](https://arxiv.org/search/?searchtype=author&query=Chang Shu), [Ehsan Shareghi](https://arxiv.org/search/?searchtype=author&query=Ehsan Shareghi), [Nigel Collier](https://arxiv.org/search/?searchtype=author&query=Nigel Collier)
Confidence estimation is essential for the reliable deployment of large language models (LLMs). Existing methods are primarily designed for factual QA tasks and often fail to generalize to reasoning tasks. To address this gap, we propose a set of training-free, graph-based confidence estimation methods tailored to reasoning tasks. Our approach models reasoning paths as directed graphs and estimates confidence by exploiting graph properties such as centrality, path convergence, and path weighting. Experiments with two LLMs on three reasoning datasets demonstrate improved confidence estimation and enhanced performance on two downstream tasks. 置信度估计对于大型语言模型 (LLM) 的可靠部署至关重要。现有方法主要为事实 QA 任务而设计,通常无法推广到推理任务。为了解决这一差距,我们提出了一套针对推理任务量身定制的免训练、基于图的置信度估计方法。我们的方法将推理路径建模为有向图,并通过利用中心性、路径收敛和路径加权等图属性来估计置信度。在三个推理数据集上使用两个 LLM 进行的实验表明,两个下游任务的置信度估计得到改进并性能得到增强。
Subjects: Computation and Language, Artificial Intelligence 科目 : 计算与语言, 人工智能
Publish: 2025-09-16 10:02:52 UTC 发布时间 : 2025-09-16 10:02:52 UTC
#19 Conan-Embedding-v2: Training an LLM from Scratch for Text Embeddings Conan-Embedding-v2:从头开始训练 LLM 进行文本嵌入
Authors: [Shiyu Li](https://arxiv.org/search/?searchtype=author&query=Shiyu Li), [Yang Tang](https://arxiv.org/search/?searchtype=author&query=Yang Tang), [Ruijie Liu](https://arxiv.org/search/?searchtype=author&query=Ruijie Liu), [Shi-Zhe Chen](https://arxiv.org/search/?searchtype=author&query=Shi-Zhe Chen), [Xi Chen](https://arxiv.org/search/?searchtype=author&query=Xi Chen)
Large language models (LLMs) have recently demonstrated excellent performance in text embedding tasks. Previous work usually use LoRA to fine-tune existing LLMs, which are limited by the data and training gap between LLMs and embedding models. In this work, we introduce Conan-embedding-v2, a new 1.4B-parameter LLM trained from scratch and fine-tuned as a text embedder. First, we add news data and multilingual pairs for LLM pretraining to bridge the data gap. Based on this, we propose a cross-lingual retrieval dataset that enables the LLM to better integrate embeddings across different languages. Second, whereas LLMs use a causal mask with token-level loss, embedding models use a bidirectional mask with sentence-level loss. This training gap makes full fine-tuning less effective than LoRA. We introduce a soft-masking mechanism to gradually transition between these two types of masks, enabling the model to learn more comprehensive representations. Based on this, we propose a dynamic hard negative mining method that exposes the model to more difficult negative examples throughout the training process. Being intuitive and effective, with only approximately 1.4B parameters, Conan-embedding-v2 achieves SOTA performance on both the Massive Text Embedding Benchmark (MTEB) and Chinese MTEB (May 19, 2025). 大型语言模型(LLM)最近在文本嵌入任务中表现出了出色的性能。以前的工作通常使用 LoRA 来微调现有的 LLM,这些 LLM 受到 LLM 和嵌入模型之间的数据和训练差距的限制。在这项工作中,我们介绍了 Conan-embedding-v2,这是一种新的 1.4B 参数 LLM,从头开始训练并作为文本嵌入器进行微调。首先,我们添加新闻数据和多语言对进行 LLM 预训练,以弥合数据差距。基于此,我们提出了一个跨语言检索数据集,使 LLM 能够更好地整合不同语言的嵌入。其次,LLM 使用具有标记级丢失的因果掩码,而嵌入模型使用具有句子级丢失的双向掩码。这种训练差距使得完全微调不如 LoRA 有效。我们引入了一种软掩码机制,在这两种类型的掩码之间逐渐过渡,使模型能够学习更全面的表示。基于此,我们提出了一种动态硬负挖掘方法,该方法使模型在整个训练过程中暴露于更困难的负示例中。Conan-embedding-v2 直观有效,仅具有大约 1.4B 的参数,在海量文本嵌入基准测试 (MTEB) 和中文 MTEB 上都实现了 SOTA 性能(2025 年 5 月 19 日)。
Subjects: Computation and Language, Artificial Intelligence 科目 : 计算与语言, 人工智能
Publish: 2025-09-16 09:48:11 UTC 发布时间 : 2025-09-16 09:48:11 UTC
#20 The LLM Already Knows: Estimating LLM-Perceived Question Difficulty via Hidden Representations 法学硕士已经知道:通过隐藏表示估计法学硕士感知的问题难度
Authors: [Yubo Zhu](https://arxiv.org/search/?searchtype=author&query=Yubo Zhu), [Dongrui Liu](https://arxiv.org/search/?searchtype=author&query=Dongrui Liu), [Zecheng Lin](https://arxiv.org/search/?searchtype=author&query=Zecheng Lin), [Wei Tong](https://arxiv.org/search/?searchtype=author&query=Wei Tong), [Sheng Zhong](https://arxiv.org/search/?searchtype=author&query=Sheng Zhong), [Jing Shao](https://arxiv.org/search/?searchtype=author&query=Jing Shao)
Estimating the difficulty of input questions as perceived by large language models (LLMs) is essential for accurate performance evaluation and adaptive inference. Existing methods typically rely on repeated response sampling, auxiliary models, or fine-tuning the target model itself, which may incur substantial computational costs or compromise generality. In this paper, we propose a novel approach for difficulty estimation that leverages only the hidden representations produced by the target LLM. We model the token-level generation process as a Markov chain and define a value function to estimate the expected output quality given any hidden state. This allows for efficient and accurate difficulty estimation based solely on the initial hidden state, without generating any output tokens. Extensive experiments across both textual and multimodal tasks demonstrate that our method consistently outperforms existing baselines in difficulty estimation. Moreover, we apply our difficulty estimates to guide adaptive reasoning strategies, including Self-Consistency, Best-of-N, and Self-Refine, achieving higher inference efficiency with fewer generated tokens. 估计大型语言模型 (LLM) 感知的输入问题的难度对于准确的性能评估和自适应推理至关重要。现有方法通常依赖于重复响应采样、辅助模型或对目标模型本身进行微调,这可能会产生大量的计算成本或损害通用性。在本文中,我们提出了一种新的难度估计方法,该方法仅利用目标法学硕士产生的隐藏表示。我们将代币级生成过程建模为马尔可夫链,并定义一个值函数来估计给定任何隐藏状态的预期输出质量。这允许仅根据初始隐藏状态进行高效、准确的难度估计,而无需生成任何输出令牌。在文本和多模态任务中的广泛实验表明,我们的方法在难度估计方面始终优于现有基线。此外,我们应用难度估计来指导自适应推理策略,包括自洽、最佳 N 和自细化,以更少的生成 token 实现更高的推理效率。
Subjects: Computation and Language, Artificial Intelligence 科目 : 计算与语言, 人工智能
Publish: 2025-09-16 09:38:41 UTC 发布时间 : 2025-09-16 09:38:41 UTC
#21 Benchmarking and Improving LVLMs on Event Extraction from Multimedia Documents 对多媒体文档事件提取的 LVLM 进行基准测试和改进
Authors: [Fuyu Xing](https://arxiv.org/search/?searchtype=author&query=Fuyu Xing), [Zimu Wang](https://arxiv.org/search/?searchtype=author&query=Zimu Wang), [Wei Wang](https://arxiv.org/search/?searchtype=author&query=Wei Wang), [Haiyang Zhang](https://arxiv.org/search/?searchtype=author&query=Haiyang Zhang)
The proliferation of multimedia content necessitates the development of effective Multimedia Event Extraction (M2E2) systems. Though Large Vision-Language Models (LVLMs) have shown strong cross-modal capabilities, their utility in the M2E2 task remains underexplored. In this paper, we present the first systematic evaluation of representative LVLMs, including DeepSeek-VL2 and the Qwen-VL series, on the M2E2 dataset. Our evaluations cover text-only, image-only, and cross-media subtasks, assessed under both few-shot prompting and fine-tuning settings. Our key findings highlight the following valuable insights: (1) Few-shot LVLMs perform notably better on visual tasks but struggle significantly with textual tasks; (2) Fine-tuning LVLMs with LoRA substantially enhances model performance; and (3) LVLMs exhibit strong synergy when combining modalities, achieving superior performance in cross-modal settings. We further provide a detailed error analysis to reveal persistent challenges in areas such as semantic precision, localization, and cross-modal grounding, which remain critical obstacles for advancing M2E2 capabilities. 多媒体内容的激增需要开发有效的多媒体事件提取 (M2E2) 系统。尽管大型视觉语言模型(LVLM)显示出强大的跨模态能力,但它们在 M2E2 任务中的效用仍未得到充分探索。在本文中,我们首次在 M2E2 数据集上对具有代表性的 LVLM,包括 DeepSeek-VL2 和 Qwen-VL 系列进行了系统评估。我们的评估涵盖纯文本、纯图像和跨媒体子任务,在少镜头提示和微调设置下进行评估。我们的主要发现强调了以下有价值的见解:(1)少样本 LVLM 在视觉任务上表现明显更好,但在文本任务上却表现不佳;(2)用 LoRA 微调 LVLMs 可显著增强模型性能;(3)LVLM 在组合模态时表现出强大的协同作用,在跨模态环境中取得了卓越的性能。我们进一步提供了详细的错误分析,以揭示语义精度、定位和跨模态接地等领域的持续挑战,这些领域仍然是推进 M2E2 能力的关键障碍。
Subjects: Computation and Language, Multimedia 科目 : 计算与语言, 多媒体
Publish: 2025-09-16 09:29:02 UTC 发布时间 : 2025-09-16 09:29:02 UTC
#22 Data Augmentation for Maltese NLP using Transliterated and Machine Translated Arabic Data 使用音译和机器翻译的阿拉伯语数据进行马耳他语 NLP 的数据增强 #22Data Augmentation for Maltese NLP using Transliterated and Machine Translated Arabic Data 使用音译和机器翻译的阿拉伯语数据进行马耳他语 NLP 的数据增强[PDF ] [Copy] [Kimi 3 ] [REL]
Authors: [Kurt Micallef](https://arxiv.org/search/?searchtype=author&query=Kurt Micallef), [Nizar Habash](https://arxiv.org/search/?searchtype=author&query=Nizar Habash), [Claudia Borg](https://arxiv.org/search/?searchtype=author&query=Claudia Borg) 作者:库尔特·米卡勒夫、尼扎尔·哈巴什、克劳迪娅·博格
Maltese is a unique Semitic language that has evolved under extensive influence from Romance and Germanic languages, particularly Italian and English. Despite its Semitic roots, its orthography is based on the Latin script, creating a gap between it and its closest linguistic relatives in Arabic. In this paper, we explore whether Arabic-language resources can support Maltese natural language processing (NLP) through cross-lingual augmentation techniques. We investigate multiple strategies for aligning Arabic textual data with Maltese, including various transliteration schemes and machine translation (MT) approaches. As part of this, we also introduce novel transliteration systems that better represent Maltese orthography. We evaluate the impact of these augmentations on monolingual and mutlilingual models and demonstrate that Arabic-based augmentation can significantly benefit Maltese NLP tasks. 马耳他语是一种独特的闪米特语言,在罗曼语和日耳曼语系(特别是意大利语和英语)的广泛影响下演变而来。尽管它起源于闪米特语,但它的正字法是基于拉丁文字的,这在它与阿拉伯语中最亲近的语言亲戚之间造成了差距。在本文中,我们探讨了阿拉伯语资源是否可以通过跨语言增强技术支持马耳他语自然语言处理(NLP)。我们研究了将阿拉伯语文本数据与马耳他语对齐的多种策略,包括各种音译方案和机器翻译 (MT) 方法。作为其中的一部分,我们还引入了更好地代表马耳他语正字法的新型音译系统。我们评估了这些增强对单语和多语模型的影响,并证明基于阿拉伯语的增强可以显着使马耳他语 NLP 任务受益。
Subject: Computation and Language 主题 : 计算和语言
Publish: 2025-09-16 09:09:50 UTC 发布时间 : 2025-09-16 09:09:50 UTC
#23 ConvergeWriter: Data-Driven Bottom-Up Article Construction ConvergeWriter:数据驱动的自下而上的文章构建 #23ConvergeWriter:数据驱动的自下而上的文章构建 ConvergeWriter:数据驱动的自下而上的文章构建
Authors: [Binquan Ji](https://arxiv.org/search/?searchtype=author&query=Binquan Ji), [Jiaqi Wang](https://arxiv.org/search/?searchtype=author&query=Jiaqi Wang), [Ruiting Li](https://arxiv.org/search/?searchtype=author&query=Ruiting Li), [Xingchen Han](https://arxiv.org/search/?searchtype=author&query=Xingchen Han), [Yiyang Qi](https://arxiv.org/search/?searchtype=author&query=Yiyang Qi), [Shichao Wang](https://arxiv.org/search/?searchtype=author&query=Shichao Wang), [Yifei Lu](https://arxiv.org/search/?searchtype=author&query=Yifei Lu), [Yuantao Han](https://arxiv.org/search/?searchtype=author&query=Yuantao Han), [Feiliang Ren](https://arxiv.org/search/?searchtype=author&query=Feiliang Ren) 作者:季彬泉,王佳琦,李瑞婷,韩星辰,齐益阳,王世超,卢一飞,韩元涛,任飞良
Large Language Models (LLMs) have shown remarkable prowess in text generation, yet producing long-form, factual documents grounded in extensive external knowledge bases remains a significant challenge. Existing “top-down” methods, which first generate a hypothesis or outline and then retrieve evidence, often suffer from a disconnect between the model’s plan and the available knowledge, leading to content fragmentation and factual inaccuracies. To address these limitations, we propose a novel “bottom-up,” data-driven framework that inverts the conventional generation pipeline. Our approach is predicated on a “Retrieval-First for Knowledge, Clustering for Structure” strategy, which first establishes the “knowledge boundaries” of the source corpus before any generative planning occurs. Specifically, we perform exhaustive iterative retrieval from the knowledge base and then employ an unsupervised clustering algorithm to organize the retrieved documents into distinct “knowledge clusters.” These clusters form an objective, data-driven foundation that directly guides the subsequent generation of a hierarchical outline and the final document content. This bottom-up process ensures that the generated text is strictly constrained by and fully traceable to the source material, proactively adapting to the finite scope of the knowledge base and fundamentally mitigating the risk of hallucination. Experimental results on both 14B and 32B parameter models demonstrate that our method achieves performance comparable to or exceeding state-of-the-art baselines, and is expected to demonstrate unique advantages in knowledge-constrained scenarios that demand high fidelity and structural coherence. Our work presents an effective paradigm for generating reliable, structured, long-form documents, paving the way for more robust LLM applications in high-stakes, knowledge-intensive domains. 大型语言模型 (LLM) 在文本生成方面表现出了非凡的能力,但生成基于广泛外部知识库的长篇事实文档仍然是一项重大挑战。现有的“自上而下”方法首先生成假设或大纲,然后检索证据,通常存在模型计划与可用知识之间的脱节,导致内容碎片化和事实不准确。为了解决这些限制,我们提出了一种新颖的“自下而上”数据驱动框架,该框架颠倒了传统的生成管道。我们的方法基于“知识检索优先,结构聚类”策略,该策略首先在进行任何生成规划之前建立源语料库的“知识边界”。具体来说,我们从知识库中执行详尽的迭代检索,然后采用无监督聚类算法将检索到的文档组织成不同的“知识集群”。这些集群形成了一个客观的、数据驱动的基础,直接指导后续生成分层大纲和最终文档内容。这种自下而上的过程确保生成的文本受到源材料的严格约束并完全可追溯到源材料,主动适应知识库的有限范围,从根本上降低幻觉的风险。在 14B 和 32B 参数模型上的实验结果表明,我们的方法实现了与最先进的基线相当或超过最先进的基线的性能,并有望在需要高保真度和结构一致性的知识受限场景中表现出独特的优势。 我们的工作为生成可靠、结构化的长篇文档提供了一种有效的范式,为高风险、知识密集型领域中更强大的法学硕士应用铺平了道路。
Subject: Computation and Language 主题 : 计算和语言
Publish: 2025-09-16 08:30:52 UTC 发布时间 : 2025-09-16 08:30:52 UTC
#24 Contrastive Learning with Enhanced Abstract Representations using Grouped Loss of Abstract Semantic Supervision 使用抽象语义监督的分组丢失增强抽象表示的对比学习 #24Contrastive Learning with Enhanced Abstract Representations using Grouped Loss of Abstract Semantic Supervision 使用抽象语义监督的分组丢失增强抽象表示的对比学习[PDF ] [Copy] [Kimi 3 ] [REL]
Authors: [Omri Suissa](https://arxiv.org/search/?searchtype=author&query=Omri Suissa), [Muhiim Ali](https://arxiv.org/search/?searchtype=author&query=Muhiim Ali), [Shengmai Chen](https://arxiv.org/search/?searchtype=author&query=Shengmai Chen), [Yinuo Cai](https://arxiv.org/search/?searchtype=author&query=Yinuo Cai), [Shekhar Pradhan](https://arxiv.org/search/?searchtype=author&query=Shekhar Pradhan) 作者: Omri Suissa, Muhiim Ali, Shengmai Chen, Yinuo Cai, Shekhar Pradhan
Humans can recognize an image as an instance of a general concept, beyond simply identifying its objects and their relationships. In this paper, we investigate 1. The extent to which VLMs have this concept abstraction capacity, and 2. Strategies for encoding the sort of higher-concept information in images that would enable the resulting VLM model (CLEAR GLASS model) to have this capability to a greater degree. To this end, we introduce a grouped image-caption dataset (MAGIC), which consists of several groups of image captions and for each group a set of associated images and higher-level conceptual labels. We use a novel contrastive loss technique to induce the model to encode in the representation of each image (caption) in a group the information that is common to all members of the image-caption group. Our main contribution is a grouped contrastive loss function based on text-image contrastive groups (outer contrastive loss) as well as an inner loss which measures the distances between image-caption instances in the group. Our training methodology results in the CLEAR GLASS model having the concept abstraction capacity as an emergent capacity because the model is not exposed to the higher-level concepts associated with each group. Instead, the training forces the model to create for each image-caption group a semantic representation that brings it closer to the semantic representation of the higher-level concepts in the latent semantic space. Our experiments show that this training methodology results in a model which shows improvement in abstract concept recognition compared to SOTA models. 人类可以将图像识别为一般概念的实例,而不仅仅是识别其对象及其关系。在本文中,我们研究了 1.VLM 在多大程度上具有这种概念抽象能力,以及 2.对图像中更高概念信息进行编码的策略,使生成的 VLM 模型(CLEAR GLASS 模型)在更大程度上具有这种能力。为此,我们引入了一个分组图像标题数据集(MAGIC),它由几组图像标题组成,每组一组关联图像和更高级别的概念标签。我们使用一种新颖的对比损失技术来诱导模型在组中每个图像(标题)的表示中编码图像标题组所有成员共有的信息。我们的主要贡献是基于文本-图像对比组(外部对比损失)的分组对比损失函数,以及测量组中图像标题实例之间距离的内部损失。我们的训练方法导致 CLEAR GLASS 模型具有概念抽象能力作为一种涌现能力,因为该模型不会暴露于与每个组相关的更高级别的概念。相反,训练迫使模型为每个图像标题组创建语义表示,使其更接近潜在语义空间中更高级别概念的语义表示。我们的实验表明,与 SOTA 模型相比,这种训练方法产生的模型在抽象概念识别方面有所改进。
Subject: Computation and Language 主题 : 计算和语言
Publish: 2025-09-16 07:36:44 UTC 发布时间 : 2025-09-16 07:36:44 UTC
#25 HistoryBankQA: Multilingual Temporal Question Answering on Historical Events HistoryBankQA:历史事件多语言时间问答 #25HistoryBankQA:历史事件多语言时间问答 HistoryBankQA:历史事件多语言时间问答[PDF ] [Copy] [Kimi 3 ] [REL]
Authors: [Biswadip Mandal](https://arxiv.org/search/?searchtype=author&query=Biswadip Mandal), [Anant Khandelwal](https://arxiv.org/search/?searchtype=author&query=Anant Khandelwal), [Manish Gupta](https://arxiv.org/search/?searchtype=author&query=Manish Gupta) 作者:Biswadip Mandal、Anant Khandelwal、Manish Gupta
Temporal reasoning about historical events is a critical skill for NLP tasks like event extraction, historical entity linking, temporal question answering, timeline summarization, temporal event clustering and temporal natural language inference. Yet efforts on benchmarking temporal reasoning capabilities of large language models (LLMs) are rather limited. Existing temporal reasoning datasets are limited in scale, lack multilingual coverage and focus more on contemporary events. To address these limitations, we present HistoryBank, a multilingual database of 10M+ historical events extracted from Wikipedia timeline pages and article infoboxes. Our database provides unprecedented coverage in both historical depth and linguistic breadth with 10 languages. Additionally, we construct a comprehensive question answering benchmark for temporal reasoning across all languages. This benchmark covers a diverse set of 6 temporal QA reasoning tasks, and we evaluate a suite of popular language models (LLaMA-3-8B, Mistral-7B, Gemma-2-9b, Qwen3-8B, GPT4o) to assess their performance on these tasks. As expected GPT4o performs best across all answer types and languages; Gemma-2 outperforms the other small language models. Our work aims to provide a comprehensive resource for advancing multilingual and temporally-aware natural language understanding of historical events. To facilitate further research, we will make our code and datasets publicly available upon acceptance of this paper. 对历史事件的时间推理是 NLP 任务的一项关键技能,例如事件提取、历史实体链接、时间问答、时间线摘要、时间事件聚类和时间自然语言推理。然而,对大型语言模型 (LLM) 的时间推理能力进行基准测试的工作相当有限。现有的时间推理数据集规模有限,缺乏多语言覆盖,更关注当代事件。为了解决这些限制,我们推出了 HistoryBank,这是一个多语言数据库,其中包含从维基百科时间线页面和文章信息框中提取的 10M+ 历史事件。我们的数据库在历史深度和语言广度方面提供了前所未有的覆盖范围,涵盖了 10 种语言。此外,我们还为所有语言的时间推理构建了一个全面的问答基准。该基准测试涵盖了一组不同的 6 个时间 QA 推理任务,我们评估了一套流行的语言模型(LLaMA-3-8B、Mistral-7B、Gemma-2-9b、Qwen3-8B、GPT4o)以评估它们在这些任务上的性能。正如预期的那样,GPT4o 在所有答案类型和语言中表现最佳;Gemma-2 的性能优于其他小型语言模型。我们的工作旨在为促进对历史事件的多语言和时间感知自然语言理解提供全面的资源。为了促进进一步的研究,我们将在接受本文后公开我们的代码和数据集。
Subject: Computation and Language 主题 : 计算和语言
Publish: 2025-09-16 06:24:29 UTC 发布时间: 2025-09-16 06:24:29 UTC
#26 Case-Based Decision-Theoretic Decoding with Quality Memories #26 基于案例的决策论解码与质量记忆
Authors: [Hiroyuki Deguchi](https://arxiv.org/search/?searchtype=author&query=Hiroyuki Deguchi), [Masaaki Nagata](https://arxiv.org/search/?searchtype=author&query=Masaaki Nagata) 作者:Hiroyuki Deguchi、Masaaki Nagata
Minimum Bayes risk (MBR) decoding is a decision rule of text generation, which selects the hypothesis that maximizes the expected utility and robustly generates higher-quality texts than maximum a posteriori (MAP) decoding. However, it depends on sample texts drawn from the text generation model; thus, it is difficult to find a hypothesis that correctly captures the knowledge or information of out-of-domain. To tackle this issue, we propose case-based decision-theoretic (CBDT) decoding, another method to estimate the expected utility using examples of domain data. CBDT decoding not only generates higher-quality texts than MAP decoding, but also the combination of MBR and CBDT decoding outperformed MBR decoding in seven domain De–En and Ja↔En translation tasks and image captioning tasks on MSCOCO and nocaps datasets. 最小贝叶斯风险 (MBR) 解码是文本生成的决策规则,它选择的假设可以最大化预期效用,并稳健地生成比最大后验 (MAP) 解码更高质量的文本。但是,它取决于从文本生成模型中提取的示例文本;因此,很难找到正确捕获域外知识或信息的假设。为了解决这个问题,我们提出了基于案例的决策论(CBDT)解码,这是另一种使用领域数据示例来估计预期效用的方法。CBDT 解码不仅生成比 MAP 解码更高质量的文本,而且 MBR 和 CBDT 解码的结合在 MSCOCO 和 nocaps 数据集上的 7 个领域 De-En 和 Ja ↔ En 翻译任务以及图像字幕任务中优于 MBR 解码。
Subject: Computation and Language 主题:计算和语言
Publish: 2025-09-16 05:01:05 UTC 发布时间: 2025-09-16 05:01:05 UTC
#27 Towards Inclusive Toxic Content Moderation: Addressing Vulnerabilities to Adversarial Attacks in Toxicity Classifiers Tackling LLM-generated Content #27 迈向包容性有毒内容审核:解决处理 LLM 生成内容的毒性分类器中对抗性攻击的漏洞
Authors: [Shaz Furniturewala](https://arxiv.org/search/?searchtype=author&query=Shaz Furniturewala), [Arkaitz Zubiaga](https://arxiv.org/search/?searchtype=author&query=Arkaitz Zubiaga) 作者:Shaz Furniturewala、Arkaitz Zubiaga
The volume of machine-generated content online has grown dramatically due to the widespread use of Large Language Models (LLMs), leading to new challenges for content moderation systems. Conventional content moderation classifiers, which are usually trained on text produced by humans, suffer from misclassifications due to LLM-generated text deviating from their training data and adversarial attacks that aim to avoid detection. Present-day defence tactics are reactive rather than proactive, since they rely on adversarial training or external detection models to identify attacks. In this work, we aim to identify the vulnerable components of toxicity classifiers that contribute to misclassification, proposing a novel strategy based on mechanistic interpretability techniques. Our study focuses on fine-tuned BERT and RoBERTa classifiers, testing on diverse datasets spanning a variety of minority groups. We use adversarial attacking techniques to identify vulnerable circuits. Finally, we suppress these vulnerable circuits, improving performance against adversarial attacks. We also provide demographic-level insights into these vulnerable circuits, exposing fairness and robustness gaps in model training. We find that models have distinct heads that are either crucial for performance or vulnerable to attack and suppressing the vulnerable heads improves performance on adversarial input. We also find that different heads are responsible for vulnerability across different demographic groups, which can inform more inclusive development of toxicity detection models. 由于大型语言模型 (LLM) 的广泛使用,机器生成的在线内容量急剧增长,给内容审核系统带来了新的挑战。传统的内容审核分类器通常根据人类生成的文本进行训练,由于法学硕士生成的文本偏离其训练数据以及旨在避免检测的对抗性攻击而遭受错误分类。当今的防御策略是被动的而不是主动的,因为它们依赖于对抗训练或外部检测模型来识别攻击。在这项工作中,我们旨在识别导致错误分类的毒性分类器的易受攻击的组件,提出一种基于机械可解释性技术的新策略。我们的研究重点是微调的 BERT 和 RoBERTa 分类器,在跨越各种少数群体的不同数据集上进行测试。我们使用对抗性攻击技术来识别易受攻击的电路。最后,我们抑制了这些易受攻击的电路,提高了抵御对抗性攻击的性能。我们还提供了对这些易受攻击的电路的人口统计层面的见解,揭示了模型训练中的公平性和鲁棒性差距。我们发现模型具有不同的头部,这些头部要么对性能至关重要,要么容易受到攻击,抑制易受攻击的头部可以提高对抗性输入的性能。我们还发现,不同的头部对不同人口群体的脆弱性负责,这可以为毒性检测模型的更具包容性的开发提供信息。
Subject: Computation and Language 主题:计算和语言
Publish: 2025-09-16 04:51:18 UTC 发布时间: 2025-09-16 04:51:18 UTC
#28 Chat-Driven Text Generation and Interaction for Person Retrieval #28 用于人员检索的聊天驱动文本生成和交互
Authors: [Zequn Xie](https://arxiv.org/search/?searchtype=author&query=Zequn Xie), [Chuxin Wang](https://arxiv.org/search/?searchtype=author&query=Chuxin Wang), [Sihang Cai](https://arxiv.org/search/?searchtype=author&query=Sihang Cai), [Yeqiang Wang](https://arxiv.org/search/?searchtype=author&query=Yeqiang Wang), [Shulei Wang](https://arxiv.org/search/?searchtype=author&query=Shulei Wang), [Tao Jin](https://arxiv.org/search/?searchtype=author&query=Tao Jin) 作者: 谢泽群, 王楚新, 蔡思航, 王野强, 王书磊, 金涛
Text-based person search (TBPS) enables the retrieval of person images from large-scale databases using natural language descriptions, offering critical value in surveillance applications. However, a major challenge lies in the labor-intensive process of obtaining high-quality textual annotations, which limits scalability and practical deployment. To address this, we introduce two complementary modules: Multi-Turn Text Generation (MTG) and Multi-Turn Text Interaction (MTI). MTG generates rich pseudo-labels through simulated dialogues with MLLMs, producing fine-grained and diverse visual descriptions without manual supervision. MTI refines user queries at inference time through dynamic, dialogue-based reasoning, enabling the system to interpret and resolve vague, incomplete, or ambiguous descriptions - characteristics often seen in real-world search scenarios. Together, MTG and MTI form a unified and annotation-free framework that significantly improves retrieval accuracy, robustness, and usability. Extensive evaluations demonstrate that our method achieves competitive or superior results while eliminating the need for manual captions, paving the way for scalable and practical deployment of TBPS systems. 基于文本的人员搜索 (TBPS) 可以使用自然语言描述从大规模数据库中检索人物图像,从而在监控应用中提供关键价值。然而,一个主要挑战在于获得高质量文本注释的劳动密集型过程,这限制了可扩展性和实际部署。为了解决这个问题,我们引入了两个互补模块:多轮文本生成 (MTG) 和多轮文本交互 (MTI)。MTG 通过与 MLLM 的模拟对话生成丰富的伪标签,无需人工监督即可生成细粒度和多样化的视觉描述。MTI 通过基于对话的动态推理在推理时细化用户查询,使系统能够解释和解决模糊、不完整或模棱两可的描述——这些特征在现实世界的搜索场景中很常见。MTG 和 MTI 共同构成了一个统一且无需注释的框架,可显着提高检索准确性、稳健性和可用性。广泛的评估表明,我们的方法取得了有竞争力或卓越的结果,同时消除了对手动字幕的需求,为 TBPS 系统的可扩展和实际部署铺平了道路。
Subject: Computation and Language 主题:计算和语言
Publish: 2025-09-16 04:40:24 UTC 发布时间: 2025-09-16 04:40:24 UTC
#29 Mitigating Strategy Preference Bias in Emotional Support Conversation via Uncertainty Estimations #29 通过不确定性估计减轻情感支持对话中的策略偏好偏差
Authors: [Yougen Zhou](https://arxiv.org/search/?searchtype=author&query=Yougen Zhou), [Qin Chen](https://arxiv.org/search/?searchtype=author&query=Qin Chen), [Ningning Zhou](https://arxiv.org/search/?searchtype=author&query=Ningning Zhou), [Jie Zhou](https://arxiv.org/search/?searchtype=author&query=Jie Zhou), [Xingjiao Wu](https://arxiv.org/search/?searchtype=author&query=Xingjiao Wu), [Liang He](https://arxiv.org/search/?searchtype=author&query=Liang He) 作者:周友根、陈秦、周宁宁、周杰、吴星娇、何良
Emotional support conversation (ESC) aims to alleviate distress through empathetic dialogue, yet large language models (LLMs) face persistent challenges in delivering effective ESC due to low accuracy in strategy planning. Moreover, there is a considerable preference bias towards specific strategies. Prior methods using fine-tuned strategy planners have shown potential in reducing such bias, while the underlying causes of the preference bias in LLMs have not well been studied. To address these issues, we first reveal the fundamental causes of the bias by identifying the knowledge boundaries of LLMs in strategy planning. Then, we propose an approach to mitigate the bias by reinforcement learning with a dual reward function, which optimizes strategy planning via both accuracy and entropy-based confidence for each region according to the knowledge boundaries. Experiments on the ESCov and ExTES datasets with multiple LLM backbones show that our approach outperforms the baselines, confirming the effectiveness of our approach. 情感支持对话 (ESC) 旨在通过同理心对话来减轻痛苦,但由于战略规划准确性低,大型语言模型 (LLM) 在提供有效的 ESC 方面面临着持续的挑战。此外,对特定策略存在相当大的偏好偏差。先前使用微调策略规划器的方法已显示出减少此类偏差的潜力,而法学硕士偏好偏差的根本原因尚未得到充分研究。为了解决这些问题,我们首先通过识别战略规划中 LLM 的知识边界来揭示偏差的根本原因。然后,我们提出了一种通过具有双奖励函数的强化学习来减轻偏差的方法,该方法根据知识边界通过准确性和基于熵的置信度来优化每个区域的策略规划。在具有多个 LLM 主干的 ESCov 和 ExTES 数据集上的实验表明,我们的方法优于基线,证实了我们方法的有效性。
Subject: Computation and Language 主题:计算和语言
Publish: 2025-09-16 04:39:18 UTC 发布时间: 2025-09-16 04:39:18 UTC
#30 Don't Change My View: Ideological Bias Auditing in Large Language Models #30 不要改变我的观点:大型语言模型中的意识形态偏见审计
Authors: [Paul Kröger](https://arxiv.org/search/?searchtype=author&query=Paul Kröger), [Emilio Barkett](https://arxiv.org/search/?searchtype=author&query=Emilio Barkett) 作者:Paul Kröger、Emilio Barkett
As large language models (LLMs) become increasingly embedded in products used by millions, their outputs may influence individual beliefs and, cumulatively, shape public opinion. If the behavior of LLMs can be intentionally steered toward specific ideological positions, such as political or religious views, then those who control these systems could gain disproportionate influence over public discourse. Although it remains an open question whether LLMs can reliably be guided toward coherent ideological stances and whether such steering can be effectively prevented, a crucial first step is to develop methods for detecting when such steering attempts occur. In this work, we adapt a previously proposed statistical method to the new context of ideological bias auditing. Our approach carries over the model-agnostic design of the original framework, which does not require access to the internals of the language model. Instead, it identifies potential ideological steering by analyzing distributional shifts in model outputs across prompts that are thematically related to a chosen topic. This design makes the method particularly suitable for auditing proprietary black-box systems. We validate our approach through a series of experiments, demonstrating its practical applicability and its potential to support independent post hoc audits of LLM behavior. 随着大型语言模型 (LLM) 越来越多地嵌入到数百万人使用的产品中,它们的输出可能会影响个人信念,并累积影响公众舆论。如果法学硕士的行为可以有意识地引导到特定的意识形态立场,例如政治或宗教观点,那么控制这些系统的人可能会对公共话语产生不成比例的影响力。尽管是否可以可靠地引导法学硕士走向连贯的意识形态立场以及是否可以有效防止这种引导仍然是一个悬而未决的问题,但关键的第一步是开发检测此类引导尝试何时发生的方法。在这项工作中,我们将先前提出的统计方法应用于意识形态偏见审计的新背景。我们的方法继承了原始框架的与模型无关的设计,不需要访问语言模型的内部结构。相反,它通过分析与所选主题主题相关的提示中模型输出的分布变化来识别潜在的意识形态转向。这种设计使该方法特别适用于审计专有的黑匣子系统。我们通过一系列实验验证了我们的方法,展示了其实际适用性及其支持对 LLM 行为进行独立事后审计的潜力。
Subjects: Computation and Language, Artificial Intelligence 科目: 计算与语言 , 人工智能
Publish: 2025-09-16 04:14:29 UTC 发布时间: 2025-09-16 04:14:29 UTC
#31 PAC: Pronunciation-Aware Contextualized Large Language Model-based Automatic Speech Recognition #31 PAC:基于发音感知的上下文化大型语言模型自动语音识别
Authors: [Li Fu](https://arxiv.org/search/?searchtype=author&query=Li Fu), [Yu Xin](https://arxiv.org/search/?searchtype=author&query=Yu Xin), [Sunlu Zeng](https://arxiv.org/search/?searchtype=author&query=Sunlu Zeng), [Lu Fan](https://arxiv.org/search/?searchtype=author&query=Lu Fan), [Youzheng Wu](https://arxiv.org/search/?searchtype=author&query=Youzheng Wu), [Xiaodong He](https://arxiv.org/search/?searchtype=author&query=Xiaodong He) 作者:李福,于欣,曾孙璐,陆帆,吴友正,何晓东
This paper presents a Pronunciation-Aware Contextualized (PAC) framework to address two key challenges in Large Language Model (LLM)-based Automatic Speech Recognition (ASR) systems: effective pronunciation modeling and robust homophone discrimination. Both are essential for raw or long-tail word recognition. The proposed approach adopts a two-stage learning paradigm. First, we introduce a pronunciation-guided context learning method. It employs an interleaved grapheme-phoneme context modeling strategy that incorporates grapheme-only distractors, encouraging the model to leverage phonemic cues for accurate recognition. Then, we propose a pronunciation-discriminative reinforcement learning method with perturbed label sampling to further enhance the modelś ability to distinguish contextualized homophones. Experimental results on the public English Librispeech and Mandarin AISHELL-1 datasets indicate that PAC: (1) reduces relative Word Error Rate (WER) by 30.2% and 53.8% compared to pre-trained LLM-based ASR models, and (2) achieves 31.8% and 60.5% relative reductions in biased WER for long-tail words compared to strong baselines, respectively. 本文提出了一个发音感知情境化(PAC)框架,以解决基于大型语言模型(LLM)的自动语音识别(ASR)系统中的两个关键挑战:有效的发音建模和稳健的同音词辨别。两者对于原始或长尾词识别都是必不可少的。所提出的方法采用两阶段学习范式。首先,我们介绍一种发音引导的上下文学习方法。它采用交错字素-音素上下文建模策略,该策略包含仅字素干扰因素,鼓励模型利用音素线索进行准确识别。然后,我们提出了一种具有扰动标签采样的发音判别强化学习方法,以进一步增强模型区分上下文同音字的能力。在公共英语 Librispeech 和普通话 AISHELL-1 数据集上的实验结果表明,与预训练的基于 LLM 的 ASR 模型相比,PAC 的相对单词错误率 (WER) 降低了 30.2% 和 53.8%,以及 (2) 与强基线相比,长尾词的偏见 WER 分别降低了 31.8% 和 60.5%。
Subjects: Computation and Language, Audio and Speech Processing 科目:计算与语言、音频与语音处理
Publish: 2025-09-16 04:07:28 UTC 发布时间: 2025-09-16 04:07:28 UTC
#32 Positional Encoding via Token-Aware Phase Attention #32 通过标记感知相位注意力进行位置编码
Authors: Yu, Wang, [Sheng Shen](https://arxiv.org/search/?searchtype=author&query=Sheng Shen), [Rémi Munos](https://arxiv.org/search/?searchtype=author&query=Rémi Munos), [Hongyuan Zhan](https://arxiv.org/search/?searchtype=author&query=Hongyuan Zhan), [Yuandong Tian](https://arxiv.org/search/?searchtype=author&query=Yuandong Tian) 作者: Yu, Wang, Sheng Shen, Rémi Munos, Hongyuan Zhan, Yuandong Tian
We prove under practical assumptions that Rotary Positional Embedding (RoPE) introduces an intrinsic distance-dependent bias in attention scores that limits RoPE’s ability to model long-context. RoPE extension methods may alleviate this issue, but they typically require post-hoc adjustments after pretraining, such as rescaling or hyperparameters retuning. This paper introduces Token-Aware Phase Attention (TAPA), a new positional encoding method that incorporates a learnable phase function into the attention mechanism. TAPA preserves token interactions over long range, extends to longer contexts with direct and light fine-tuning, extrapolates to unseen lengths, and attains significantly lower perplexity on long-context than RoPE families. 我们在实际假设下证明,旋转位置嵌入 (RoPE) 在注意力分数中引入了内在的距离依赖性偏差,这限制了 RoPE 对长上下文进行建模的能力。RoPE 扩展方法可能会缓解这个问题,但它们通常需要在预训练后进行事后调整,例如重新缩放或超参数重新调整。本文介绍了标记感知相位注意力(TAPA),这是一种将可学习相位函数纳入注意力机制的新型位置编码方法。TAPA 在长距离上保留标记交互,通过直接和轻微调扩展到更长的上下文,推断到看不见的长度,并且在长上下文上获得比 RoPE 系列显着降低的困惑度。
Subjects: Computation and Language, Artificial Intelligence 科目: 计算与语言 , 人工智能
Publish: 2025-09-16 03:53:32 UTC 发布时间: 2025-09-16 03:53:32 UTC
#33 EconProver: Towards More Economical Test-Time Scaling for Automated Theorem Proving #33 EconProver:为自动定理证明提供更经济的测试时间缩放
Authors: [Mukai Li](https://arxiv.org/search/?searchtype=author&query=Mukai Li), [Linfeng Song](https://arxiv.org/search/?searchtype=author&query=Linfeng Song), [Zhenwen Liang](https://arxiv.org/search/?searchtype=author&query=Zhenwen Liang), [Jiahao Xu](https://arxiv.org/search/?searchtype=author&query=Jiahao Xu), [Shansan Gong](https://arxiv.org/search/?searchtype=author&query=Shansan Gong), [Qi Liu](https://arxiv.org/search/?searchtype=author&query=Qi Liu), [Haitao Mi](https://arxiv.org/search/?searchtype=author&query=Haitao Mi), [Dong Yu](https://arxiv.org/search/?searchtype=author&query=Dong Yu) 作者: 李向井, 宋林峰, 梁振文, 徐佳豪, 龚珊珊, 刘琦, 米海涛, 俞东
Large Language Models (LLMs) have recently advanced the field of Automated Theorem Proving (ATP), attaining substantial performance gains through widely adopted test-time scaling strategies, notably reflective Chain-of-Thought (CoT) reasoning and increased sampling passes. However, they both introduce significant computational overhead for inference. Moreover, existing cost analyses typically regulate only the number of sampling passes, while neglecting the substantial disparities in sampling costs introduced by different scaling strategies. In this paper, we systematically compare the efficiency of different test-time scaling strategies for ATP models and demonstrate the inefficiency of the current state-of-the-art (SOTA) open-source approaches. We then investigate approaches to significantly reduce token usage and sample passes while maintaining the original performance. Specifically, we propose two complementary methods that can be integrated into a unified EconRL pipeline for amplified benefits: (1) a dynamic Chain-of-Thought (CoT) switching mechanism designed to mitigate unnecessary token consumption, and (2) Diverse parallel-scaled reinforcement learning (RL) with trainable prefixes to enhance pass rates under constrained sampling passes. Experiments on miniF2F and ProofNet demonstrate that our EconProver achieves comparable performance to baseline methods with only 12% of the computational cost. This work provides actionable insights for deploying lightweight ATP models without sacrificing performance. 大型语言模型 (LLM) 最近推进了自动定理证明 (ATP) 领域,通过广泛采用的测试时间缩放策略,特别是反射性思维链 (CoT) 推理和增加采样通道,实现了显着的性能提升。然而,它们都为推理带来了巨大的计算开销。此外,现有的成本分析通常只调节采样通道的数量,而忽略了不同规模策略引入的采样成本的巨大差异。在本文中,我们系统地比较了 ATP 模型不同测试时间缩放策略的效率,并证明了当前最先进的 (SOTA) 开源方法的低效率。然后,我们研究了在保持原始性能的同时显着减少令牌使用和采样传递的方法。具体来说,我们提出了两种可以集成到统一的 EconRL 管道中以放大收益的互补方法:(1) 动态思维链 (CoT) 切换机制,旨在减少不必要的代币消耗,以及 (2) 具有可训练前缀的多样化并行规模强化学习 (RL),以提高受约束采样通过下的通过率。在 miniF2F 和 ProofNet 上的实验表明,我们的 EconProver 以仅 12% 的计算成本实现了与基线方法相当的性能。这项工作为在不牺牲性能的情况下部署轻量级 ATP 模型提供了可作的见解。
Subjects: Computation and Language, Artificial Intelligence 科目: 计算与语言 , 人工智能
Publish: 2025-09-16 03:00:13 UTC 发布时间: 2025-09-16 03:00:13 UTC
#34 MAGIC-Enhanced Keyword Prompting for Zero-Shot Audio Captioning with CLIP Models #34 使用 CLIP 模型进行零样本音频字幕的 MAGIC 增强关键字提示
Authors: [Vijay Govindarajan](https://arxiv.org/search/?searchtype=author&query=Vijay Govindarajan), [Pratik Patel](https://arxiv.org/search/?searchtype=author&query=Pratik Patel), [Sahil Tripathi](https://arxiv.org/search/?searchtype=author&query=Sahil Tripathi), [Md Azizul Hoque](https://arxiv.org/search/?searchtype=author&query=Md Azizul Hoque), [Gautam Siddharth Kashyap](https://arxiv.org/search/?searchtype=author&query=Gautam Siddharth Kashyap) 作者:Vijay Govindarajan、Pratik Patel、Sahil Tripathi、Md Azizul Hoque、Gautam Siddharth Kashyap
Automated Audio Captioning (AAC) generates captions for audio clips but faces challenges due to limited datasets compared to image captioning. To overcome this, we propose the zero-shot AAC system that leverages pre-trained models, eliminating the need for extensive training. Our approach uses a pre-trained audio CLIP model to extract auditory features and generate a structured prompt, which guides a Large Language Model (LLM) in caption generation. Unlike traditional greedy decoding, our method refines token selection through the audio CLIP model, ensuring alignment with the audio content. Experimental results demonstrate a 35% improvement in NLG mean score (from 4.7 to 7.3) using MAGIC search with the WavCaps model. The performance is heavily influenced by the audio-text matching model and keyword selection, with optimal results achieved using a single keyword prompt, and a 50% performance drop when no keyword list is used. 自动音频字幕 (AAC) 为音频剪辑生成字幕,但与图像字幕相比,由于数据集有限而面临挑战。为了克服这个问题,我们提出了利用预训练模型的零样本 AAC 系统,无需进行大量训练。我们的方法使用预训练的音频 CLIP 模型来提取听觉特征并生成结构化提示,从而指导大型语言模型 (LLM) 生成字幕。与传统的贪婪解码不同,我们的方法通过音频 CLIP 模型细化令牌选择,确保与音频内容保持一致。实验结果表明,使用 WAVCaps 模型进行 MAGIC 搜索,NLG 平均分数提高了 35%(从 4.7 到 7.3)。性能受音文匹配模型和关键词选择的影响很大,使用单个关键词提示可获得最佳结果,而不使用关键词列表时性能下降 50%。
Subject: Computation and Language 主题:计算和语言
Publish: 2025-09-16 02:36:00 UTC 发布时间: 2025-09-16 02:36:00 UTC
#35 A comparison of pipelines for the translation of a low resource language based on transformers #35 基于 Transformer 的低资源语言翻译管道比较
Authors: [Chiara Bonfanti](https://arxiv.org/search/?searchtype=author&query=Chiara Bonfanti), [Michele Colombino](https://arxiv.org/search/?searchtype=author&query=Michele Colombino), [Giulia Coucourde](https://arxiv.org/search/?searchtype=author&query=Giulia Coucourde), [Faeze Memari](https://arxiv.org/search/?searchtype=author&query=Faeze Memari), [Stefano Pinardi](https://arxiv.org/search/?searchtype=author&query=Stefano Pinardi), [Rosa Meo](https://arxiv.org/search/?searchtype=author&query=Rosa Meo) 作者:Chiara Bonfanti、Michele Colombino、Giulia Coucourde、Faeze Memari、Stefano Pinardi、Rosa Meo
This work compares three pipelines for training transformer-based neural networks to produce machine translators for Bambara, a Mandè language spoken in Africa by about 14,188,850 people. The first pipeline trains a simple transformer to translate sentences from French into Bambara. The second fine-tunes LLaMA3 (3B-8B) instructor models using decoder-only architectures for French-to-Bambara translation. Models from the first two pipelines were trained with different hyperparameter combinations to improve BLEU and chrF scores, evaluated on both test sentences and official Bambara benchmarks. The third pipeline uses language distillation with a student-teacher dual neural network to integrate Bambara into a pre-trained LaBSE model, which provides language-agnostic embeddings. A BERT extension is then applied to LaBSE to generate translations. All pipelines were tested on Dokotoro (medical) and Bayelemagaba (mixed domains). Results show that the first pipeline, although simpler, achieves the best translation accuracy (10% BLEU, 21% chrF on Bayelemagaba), consistent with low-resource translation results. On the Yiri dataset, created for this work, it achieves 33.81% BLEU and 41% chrF. Instructor-based models perform better on single datasets than on aggregated collections, suggesting they capture dataset-specific patterns more effectively. 这项工作比较了三个用于训练基于 Transformer 的神经网络的管道,以生产 Bambara 的机器翻译器,Bambara 是非洲约有 14,188,850 人使用的一种曼德语。第一个管道训练一个简单的转换器将句子从法语翻译成 Bambara。第二个使用纯解码器架构微调 LLaMA3 (3B-8B) 讲师模型,用于法语到 Bambara 的翻译。使用不同的超参数组合训练前两个管道中的模型,以提高 BLEU 和 chrF 分数,并在测试句子和官方 Bambara 基准测试上进行评估。第三个管道使用语言蒸馏和师生双神经网络将 Bambara 集成到预训练的 LaBSE 模型中,该模型提供与语言无关的嵌入。然后将 BERT 扩展应用于 LaBSE 以生成翻译。所有管道都在 Dokotoro(医疗)和 Bayelemagaba(混合域)上进行了测试。结果表明,第一个管道虽然更简单,但实现了最佳的翻译准确率(10% BLEU,Bayelemagaba 为 21% chrF),与低资源翻译结果一致。在为这项工作创建的 Yiri 数据集上,它实现了 33.81% 的 BLEU 和 41% 的 chrF。基于讲师的模型在单个数据集上的表现优于在聚合集合上的表现,这表明它们更有效地捕获特定于数据集的模式。
Subjects: Computation and Language, Computational Engineering, Finance, and Science, Computers and Society, Machine Learning 科目: 计算与语言 , 计算工程, 金融与科学 , 计算机与社会 , 机器学习
Publish: 2025-09-15 23:36:49 UTC 发布: 2025-09-15 23:36:49 UTC
#36 FunAudio-ASR Technical Report #36 FunAudio-ASR 技术报告
Authors: [Keyu An](https://arxiv.org/search/?searchtype=author&query=Keyu An), [Yanni Chen](https://arxiv.org/search/?searchtype=author&query=Yanni Chen), [Chong Deng](https://arxiv.org/search/?searchtype=author&query=Chong Deng), [Changfeng Gao](https://arxiv.org/search/?searchtype=author&query=Changfeng Gao), [Zhifu Gao](https://arxiv.org/search/?searchtype=author&query=Zhifu Gao), [Bo Gong](https://arxiv.org/search/?searchtype=author&query=Bo Gong), [Xiangang Li](https://arxiv.org/search/?searchtype=author&query=Xiangang Li), [Yabin Li](https://arxiv.org/search/?searchtype=author&query=Yabin Li), [Xiang Lv](https://arxiv.org/search/?searchtype=author&query=Xiang Lv), [Yunjie Ji](https://arxiv.org/search/?searchtype=author&query=Yunjie Ji), [Yiheng Jiang](https://arxiv.org/search/?searchtype=author&query=Yiheng Jiang), [Bin Ma](https://arxiv.org/search/?searchtype=author&query=Bin Ma), [Haoneng Luo](https://arxiv.org/search/?searchtype=author&query=Haoneng Luo), [Chongjia Ni](https://arxiv.org/search/?searchtype=author&query=Chongjia Ni), [Zexu Pan](https://arxiv.org/search/?searchtype=author&query=Zexu Pan), [Yiping Peng](https://arxiv.org/search/?searchtype=author&query=Yiping Peng), [Zhendong Peng](https://arxiv.org/search/?searchtype=author&query=Zhendong Peng), [Peiyao Wang](https://arxiv.org/search/?searchtype=author&query=Peiyao Wang), [Hao Wang](https://arxiv.org/search/?searchtype=author&query=Hao Wang), [Wen Wang](https://arxiv.org/search/?searchtype=author&query=Wen Wang), [Wupeng Wang](https://arxiv.org/search/?searchtype=author&query=Wupeng Wang), [Biao Tian](https://arxiv.org/search/?searchtype=author&query=Biao Tian), [Zhentao Tan](https://arxiv.org/search/?searchtype=author&query=Zhentao Tan), [Nan Yang](https://arxiv.org/search/?searchtype=author&query=Nan Yang), [Bin Yuan](https://arxiv.org/search/?searchtype=author&query=Bin Yuan), [Jieping Ye](https://arxiv.org/search/?searchtype=author&query=Jieping Ye), [Jixing Yu](https://arxiv.org/search/?searchtype=author&query=Jixing Yu), [Qinglin Zhang](https://arxiv.org/search/?searchtype=author&query=Qinglin Zhang), [Kun Zou](https://arxiv.org/search/?searchtype=author&query=Kun Zou), [Han Zhao](https://arxiv.org/search/?searchtype=author&query=Han Zhao), [Shengkui Zhao](https://arxiv.org/search/?searchtype=author&query=Shengkui Zhao), [Jingren Zhou](https://arxiv.org/search/?searchtype=author&query=Jingren Zhou) 作者: 安克宇, 陈燕妮, 邓冲, 高长峰, 高志富, 龚博, 李先刚, 李亚彬, 吕翔, 季云杰, 江一恒, 马斌, 罗浩能, 倪崇家, 潘泽旭, 彭一平, 彭振东, 王培瑶, 王浩, 温, 王五鹏, 田彪, 谭振涛, 杨楠, 袁斌, 叶洁平, 余继兴, 张庆林, 邹坤, 赵汉, 赵胜奎, 周景仁
In recent years, automatic speech recognition (ASR) has witnessed transformative advancements driven by three complementary paradigms: data scaling, model size scaling, and deep integration with large language models (LLMs). However, LLMs are prone to hallucination, which can significantly degrade user experience in real-world ASR applications. In this paper, we present FunAudio-ASR, a large-scale, LLM-based ASR system that synergistically combines massive data, large model capacity, LLM integration, and reinforcement learning to achieve state-of-the-art performance across diverse and complex speech recognition scenarios. Moreover, FunAudio-ASR is specifically optimized for practical deployment, with enhancements in streaming capability, noise robustness, code-switching, hotword customization, and satisfying other real-world application requirements. Experimental results show that while most LLM-based ASR systems achieve strong performance on open-source benchmarks, they often underperform on real industry evaluation sets. Thanks to production-oriented optimizations, FunAudio-ASR achieves SOTA performance on real application datasets, demonstrating its effectiveness and robustness in practical settings. 近年来,自动语音识别 (ASR) 见证了三种互补范式的推动的变革性进步:数据扩展、模型大小扩展以及与大型语言模型 (LLM) 的深度集成。然而,LLM 容易出现幻觉,这会显着降低现实世界 ASR 应用程序中的用户体验。在本文中,我们提出了 FunAudio-ASR,这是一个基于 LLM 的大规模 ASR 系统,它协同地结合了海量数据、大模型容量、LLM 集成和强化学习,以在多样化和复杂的语音识别场景中实现最先进的性能。此外,FunAudio-ASR 针对实际部署进行了专门优化,增强了流媒体功能、噪声鲁棒性、代码切换、热词定制以及满足其他实际应用要求。实验结果表明,虽然大多数基于 LLM 的 ASR 系统在开源基准测试中取得了强大的性能,但在实际的行业评估集上往往表现不佳。得益于面向生产的优化,FunAudio-ASR 在真实应用数据集上实现了 SOTA 性能,在实际环境中展示了其有效性和鲁棒性。
Subjects: Computation and Language, Artificial Intelligence, Sound, Audio and Speech Processing 科目: 计算与语言 , 人工智能 , 声音 , 音频与语音处理
Publish: 2025-09-15 23:19:36 UTC 发布时间: 2025-09-15 23:19:36 UTC
#37 Audited Reasoning Refinement: Fine-Tuning Language Models via LLM-Guided Step-Wise Evaluation and Correction #37 经审计的推理改进:通过 LLM 引导的逐步评估和纠正微调语言模型
Authors: [Sumanta Bhattacharyya](https://arxiv.org/search/?searchtype=author&query=Sumanta Bhattacharyya), [Sara Riaz](https://arxiv.org/search/?searchtype=author&query=Sara Riaz), [Pedram Rooshenas](https://arxiv.org/search/?searchtype=author&query=Pedram Rooshenas) 作者:Sumanta Bhattacharyya、Sara Riaz、Pedram Rooshenas
Training a task-specific small reasoning model is challenging when direct human supervision or high-quality labels are scarce. However, LLMs with reasoning capabilities produce abundant intermediate reasoning traces that can be systematically refined to create effective supervision signals. We propose Reason-Refine-then-Align (R2tA), which turns refined model rationales into supervision for training task-specific reasoning models. Our method generates initial reasoning and responses from an open-source base model on task-specific inputs, then refines these traces, fixing hallucinations and inconsistencies, to form a high-fidelity dataset. We perform a two-stage alignment, supervised fine-tuning (SFT), followed by direct preference optimization (DPO) to calibrate the model’s intermediate reasoning with human-validated conceptual preferences and then condition the final output on that aligned reasoning. As a case study, we apply R2tA to evaluate extended entity relationship diagrams (EERDs) in database system design, a structurally complex task where prompt-only methods miss or hallucinate errors. We curated a dataset of 600 EERD variants (train/test split of 450/150, respectively) with induced mistakes spanning 11 categories. Empirical evaluation suggests R2tA provides a practical, cost-effective path to scalable LLM adaptation in data-scarce domains, enabling reproducible AI tools for education and beyond. 当缺乏直接的人工监督或高质量标签时,训练特定于任务的小推理模型具有挑战性。然而,具有推理能力的 LLM 会产生丰富的中间推理轨迹,可以系统地细化这些轨迹以创建有效的监督信号。我们提出了 Reason-Refine-then-Align (R2tA),它将精细化的模型基本原理转化为训练特定任务推理模型的监督。我们的方法根据特定于任务的输入从开源基础模型生成初始推理和响应,然后细化这些跟踪,修复幻觉和不一致,以形成高保真数据集。我们执行两阶段对齐、监督微调 (SFT),然后进行直接偏好优化 (DPO),以使用人类验证的概念偏好校准模型的中间推理,然后根据该对齐推理对最终输出进行条件。作为案例研究,我们应用 R2tA 来评估数据库系统设计中的扩展实体关系图 (EERD),这是一项结构复杂的任务,其中仅提示方法会遗漏或产生幻觉错误。我们整理了一个包含 600 个 EERD 变体的数据集(训练/测试拆分分别为 450/150),其中诱发的错误涵盖 11 个类别。实证评估表明,R2tA 为在数据稀缺领域进行可扩展的 LLM 适应提供了一条实用、经济高效的途径,从而为教育及其他领域提供了可重复的人工智能工具。
Subject: Computation and Language 主题:计算和语言
Publish: 2025-09-15 21:47:52 UTC 发布时间: 2025-09-15 21:47:52 UTC
#38 Does Language Model Understand Language? #38 语言模型能理解语言吗?[PDF 格式 1 ][复制][基米 4 ][REL]
Authors: [Suvojit Acharjee](https://arxiv.org/search/?searchtype=author&query=Suvojit Acharjee), [Utathya Aich](https://arxiv.org/search/?searchtype=author&query=Utathya Aich), [Asfak Ali](https://arxiv.org/search/?searchtype=author&query=Asfak Ali) 作者:Suvojit Acharjee、Utathya Aich、Asfak Ali
Despite advances in natural language generation and understanding, LM still struggle with fine grained linguistic phenomena such as tense, negation, voice, and modality which are the elements central to effective human communication. In the context of the United Nations SDG 4, where linguistic clarity is critical, the deployment of LMs in educational technologies demands careful scrutiny. As LMs are increasingly powering applications like tutoring systems, automated grading, and translation, their alignment with human linguistic interpretation becomes essential for effective learning. In this study, we conduct a evaluation of SOTA language models across these challenging contexts in both English and Bengali. To ensure a structured assessment, we introduce a new Route for Evaluation of Cognitive Inference in Systematic Environments guidelines. Our proposed LUCID dataset, composed of carefully crafted sentence pairs in English and Bengali, specifically challenges these models on critical aspects of language comprehension, including negation, tense, voice variations. We assess the performance of SOTA models including MISTRAL-SABA-24B, LLaMA-4-Scout-17B, LLaMA-3.3-70B, Gemma2-9B, and Compound-Beta using standard metrics like Pearson correlation, Spearman correlation, and Mean Absolute Error, as well as novel, linguistically inspired metric the HCE accuracy. The HCE accuracy measures how often model predictions fall within one standard deviation of the mean human rating, thus capturing human like tolerance for variability in language interpretation. Our findings highlight Compound-Beta as the most balanced model, consistently achieving high correlations and low MAEs across diverse language conditions. It records the highest Pearson correlation in English and demonstrates robust performance on mixed-language data, indicating a strong alignment with human judgments in cross lingual scenarios. 尽管自然语言生成和理解取得了进步,但 LM 仍然与时态、否定、语音和模态等细粒度语言现象作斗争,这些都是有效人类交流的核心要素。在联合国可持续发展目标 4 的背景下,语言清晰度至关重要,LM 在教育技术中的部署需要仔细审查。随着 LM 越来越多地为辅导系统、自动评分和翻译等应用程序提供支持,它们与人类语言解释的结合对于有效学习至关重要。在这项研究中,我们用英语和孟加拉语对这些具有挑战性的上下文中的 SOTA 语言模型进行了评估。为确保结构化评估,我们引入了新的系统环境中认知推理评估路线指南。我们提出的 LUCID 数据集由精心制作的英语和孟加拉语句子对组成,专门在语言理解的关键方面挑战这些模型,包括否定、时态、语音变化。我们使用 Pearson 相关性、Spearman 相关性和平均绝对误差等标准指标以及新颖的、受语言启发的指标 HCE 准确性评估 SOTA 模型的性能,包括 MISTRAL-SABA-24B、LLaMA-4-Scout-17B、LLaMA-3.3-70B、Gemma2-9B 和 Compound-Beta。HCE 准确度衡量模型预测落在平均人类评级的一个标准差内的频率,从而捕获对语言解释可变性的类似人类的容忍度。我们的研究结果强调 Compound-Beta 是最平衡的模型,在不同的语言条件下始终实现高相关性和低 MAE。 它在英语中记录了最高的 Pearson 相关性,并在混合语言数据上表现出强大的性能,表明与跨语言场景中的人类判断高度一致。
Subject: Computation and Language 主题:计算和语言
Publish: 2025-09-15 21:09:09 UTC 发布时间: 2025-09-15 21:09:09 UTC
#39 Topic Coverage-based Demonstration Retrieval for In-Context Learning #39 用于上下文学习的基于主题覆盖的演示检索
Authors: [Wonbin Kweon](https://arxiv.org/search/?searchtype=author&query=Wonbin Kweon), [SeongKu Kang](https://arxiv.org/search/?searchtype=author&query=SeongKu Kang), [Runchu Tian](https://arxiv.org/search/?searchtype=author&query=Runchu Tian), [Pengcheng Jiang](https://arxiv.org/search/?searchtype=author&query=Pengcheng Jiang), [Jiawei Han](https://arxiv.org/search/?searchtype=author&query=Jiawei Han), [Hwanjo Yu](https://arxiv.org/search/?searchtype=author&query=Hwanjo Yu) 作者: Wonbin Kweon, SeongKu Kang, Runchu Tian, Pengcheng 江, Jiawei Han, Hwanjo Yu
The effectiveness of in-context learning relies heavily on selecting demonstrations that provide all the necessary information for a given test input. To achieve this, it is crucial to identify and cover fine-grained knowledge requirements. However, prior methods often retrieve demonstrations based solely on embedding similarity or generation probability, resulting in irrelevant or redundant examples. In this paper, we propose TopicK, a topic coverage-based retrieval framework that selects demonstrations to comprehensively cover topic-level knowledge relevant to both the test input and the model. Specifically, TopicK estimates the topics required by the input and assesses the model’s knowledge on those topics. TopicK then iteratively selects demonstrations that introduce previously uncovered required topics, in which the model exhibits low topical knowledge. We validate the effectiveness of TopicK through extensive experiments across various datasets and both open- and closed-source LLMs. Our source code is available at https://github.com/WonbinKweon/TopicK_EMNLP2025. 上下文学习的有效性在很大程度上取决于选择为给定测试输入提供所有必要信息的演示。为了实现这一目标,确定并涵盖细粒度的知识需求至关重要。然而,先前的方法通常仅根据嵌入相似性或生成概率来检索演示,从而导致不相关或冗余的示例。在本文中,我们提出了 TopicK,这是一个基于主题覆盖的检索框架,它选择演示来全面覆盖与测试输入和模型相关的主题级知识。具体来说,TopicK 估计输入所需的主题,并评估模型对这些主题的了解。然后,TopicK 迭代选择引入以前未发现的必需主题的演示,其中模型表现出较低的主题知识。我们通过跨各种数据集以及开源和闭源 LLM 的广泛实验验证了 TopicK 的有效性。我们的源代码可在 https://github.com/WonbinKweon/TopicK_EMNLP2025 获得。
Subject: Computation and Language 主题:计算和语言
Publish: 2025-09-15 21:00:28 UTC 发布时间: 2025-09-15 21:00:28 UTC
#40 MedFact: Benchmarking the Fact-Checking Capabilities of Large Language Models on Chinese Medical Texts #40 MedFact:对中文医学文本大语言模型的事实核查能力进行基准测试
Authors: [Jiayi He](https://arxiv.org/search/?searchtype=author&query=Jiayi He), [Yangmin Huang](https://arxiv.org/search/?searchtype=author&query=Yangmin Huang), [Qianyun Du](https://arxiv.org/search/?searchtype=author&query=Qianyun Du), [Xiangying Zhou](https://arxiv.org/search/?searchtype=author&query=Xiangying Zhou), [Zhiyang He](https://arxiv.org/search/?searchtype=author&query=Zhiyang He), [Jiaxue Hu](https://arxiv.org/search/?searchtype=author&query=Jiaxue Hu), [Xiaodong Tao](https://arxiv.org/search/?searchtype=author&query=Xiaodong Tao), [Lixian Lai](https://arxiv.org/search/?searchtype=author&query=Lixian Lai) 作者:何佳怡、黄阳敏、杜倩云、周向英、何志阳、胡佳雪、陶晓东、赖丽贤
The increasing deployment of Large Language Models (LLMs) in healthcare necessitates a rigorous evaluation of their factual reliability. However, existing benchmarks are often limited by narrow domains of data, failing to capture the complexity of real-world medical information. To address this critical gap, we introduce MedFact, a new and challenging benchmark for Chinese medical fact-checking. MedFact comprises 2,116 expert-annotated instances curated from diverse real-world texts, spanning 13 medical specialties, 8 fine-grained error types, 4 writing styles, and multiple difficulty levels. Its construction employs a hybrid AI-human framework where iterative expert feedback refines an AI-driven, multi-criteria filtering process, ensuring both high data quality and difficulty. We conduct a comprehensive evaluation of 20 leading LLMs, benchmarking their performance on veracity classification and error localization against a human expert baseline. Our results reveal that while models can often determine if a text contains an error, precisely localizing it remains a substantial challenge, with even top-performing models falling short of human performance. Furthermore, our analysis uncovers a frequent ``over-criticism’’ phenomenon, a tendency for models to misidentify correct information as erroneous, which is exacerbated by advanced reasoning techniques such as multi-agent collaboration and inference-time scaling. By highlighting these critical challenges for deploying LLMs in medical applications, MedFact provides a robust resource to drive the development of more factually reliable and medically aware models. 大型语言模型 (LLM) 在医疗保健领域的部署越来越多,需要对其事实可靠性进行严格评估。然而,现有的基准测试往往受到狭窄数据域的限制,无法捕捉现实世界医疗信息的复杂性。为了解决这一关键差距,我们推出了 MedFact,这是中国医学事实核查的一个新的、具有挑战性的基准。MedFact 包含 2,116 个专家注释实例,这些实例是根据不同的现实世界文本精选的,涵盖 13 个医学专业、8 种细粒度错误类型、4 种写作风格和多个难度级别。其结构采用人工智能与人类的混合框架,其中迭代专家反馈完善了人工智能驱动的多标准过滤过程,确保高数据质量和难度。我们对 20 名领先的法学硕士进行了全面评估,根据人类专家基线对他们在真实性分类和错误定位方面的表现进行了基准测试。我们的结果表明,虽然模型通常可以确定文本是否包含错误,但精确定位它仍然是一项重大挑战,即使是性能最好的模型也达不到人类的性能。此外,我们的分析还发现了一种常见的“过度批评”现象,即模型将正确信息误认为是错误的趋势,而多智能体协作和推理时间缩放等先进推理技术加剧了这种情况。通过强调在医疗应用中部署法学硕士的这些关键挑战,MedFact 提供了强大的资源来推动更事实可靠和医学意识模型的开发。
Subjects: Computation and Language, Artificial Intelligence 科目: 计算与语言 , 人工智能
Publish: 2025-09-15 20:46:21 UTC 发布时间: 2025-09-15 20:46:21 UTC
#41 MORQA: Benchmarking Evaluation Metrics for Medical Open-Ended Question Answering #41 MORQA:医学开放式问答的基准评估指标
Authors: [Wen-wai Yim](https://arxiv.org/search/?searchtype=author&query=Wen-wai Yim), [Asma Ben Abacha](https://arxiv.org/search/?searchtype=author&query=Asma Ben Abacha), [Zixuan Yu](https://arxiv.org/search/?searchtype=author&query=Zixuan Yu), [Robert Doerning](https://arxiv.org/search/?searchtype=author&query=Robert Doerning), [Fei Xia](https://arxiv.org/search/?searchtype=author&query=Fei Xia), [Meliha Yetisgen](https://arxiv.org/search/?searchtype=author&query=Meliha Yetisgen) 作者: 温伟 Yim, Asma Ben Abacha, Zixuan Yu, Robert Doerning, Fei Xia, Meliha Yetisgen
Evaluating natural language generation (NLG) systems in the medical domain presents unique challenges due to the critical demands for accuracy, relevance, and domain-specific expertise. Traditional automatic evaluation metrics, such as BLEU, ROUGE, and BERTScore, often fall short in distinguishing between high-quality outputs, especially given the open-ended nature of medical question answering (QA) tasks where multiple valid responses may exist. In this work, we introduce MORQA (Medical Open-Response QA), a new multilingual benchmark designed to assess the effectiveness of NLG evaluation metrics across three medical visual and text-based QA datasets in English and Chinese. Unlike prior resources, our datasets feature 2-4+ gold-standard answers authored by medical professionals, along with expert human ratings for three English and Chinese subsets. We benchmark both traditional metrics and large language model (LLM)-based evaluators, such as GPT-4 and Gemini, finding that LLM-based approaches significantly outperform traditional metrics in correlating with expert judgments. We further analyze factors driving this improvement, including LLMs’ sensitivity to semantic nuances and robustness to variability among reference answers. Our results provide the first comprehensive, multilingual qualitative study of NLG evaluation in the medical domain, highlighting the need for human-aligned evaluation methods. All datasets and annotations will be publicly released to support future research. 由于对准确性、相关性和特定领域专业知识的严格要求,评估医疗领域的自然语言生成 (NLG) 系统提出了独特的挑战。传统的自动评估指标,如 BLEU、ROUGE 和 BERTScore,往往无法区分高质量输出,特别是考虑到医学问答 (QA) 任务的开放性,其中可能存在多个有效响应。在这项工作中,我们引入了 MORQA(Medical Open-Response QA),这是一种新的多语言基准,旨在评估 NLG 评估指标在三个基于英语和文本的医学视觉和文本 QA 数据集中的有效性。与之前的资源不同,我们的数据集包含由医疗专业人员撰写的 2-4+ 黄金标准答案,以及三个英文和中文子集的专家人工评分。我们对传统指标和基于大型语言模型 (LLM) 的评估器(例如 GPT-4 和 Gemini)进行了基准测试,发现基于 LLM 的方法在与专家判断的关联方面明显优于传统指标。我们进一步分析了推动这一改进的因素,包括法学硕士对语义细微差别的敏感性以及对参考答案之间可变性的鲁棒性。我们的结果为医学领域 NLG 评估提供了第一个全面的多语言定性研究,强调了对人类一致评估方法的必要性。所有数据集和注释都将公开发布,以支持未来的研究。
Subject: Computation and Language 主题:计算和语言
Publish: 2025-09-15 19:51:57 UTC 发布时间: 2025-09-15 19:51:57 UTC
#42 SENTRA: Selected-Next-Token Transformer for LLM Text Detection #42 SENTRA:用于 LLM 文本检测的选定下一个令牌转换器
Authors: [Mitchell Plyler](https://arxiv.org/search/?searchtype=author&query=Mitchell Plyler), [Yilun Zhang](https://arxiv.org/search/?searchtype=author&query=Yilun Zhang), [Alexander Tuzhilin](https://arxiv.org/search/?searchtype=author&query=Alexander Tuzhilin), [Saoud Khalifah](https://arxiv.org/search/?searchtype=author&query=Saoud Khalifah), [Sen Tian](https://arxiv.org/search/?searchtype=author&query=Sen Tian) 作者:Mitchell Plyler、Yilun Zhang、Alexander Tuzhilin、Saoud Khalifah、Sen Tian
LLMs are becoming increasingly capable and widespread. Consequently, the potential and reality of their misuse is also growing. In this work, we address the problem of detecting LLM-generated text that is not explicitly declared as such. We present a novel, general-purpose, and supervised LLM text detector, SElected-Next-Token tRAnsformer (SENTRA). SENTRA is a Transformer-based encoder leveraging selected-next-token-probability sequences and utilizing contrastive pre-training on large amounts of unlabeled data. Our experiments on three popular public datasets across 24 domains of text demonstrate SENTRA is a general-purpose classifier that significantly outperforms popular baselines in the out-of-domain setting. 法学硕士的能力越来越强,越来越普遍。因此,它们被滥用的可能性和现实也在增加。在这项工作中,我们解决了检测未明确声明的 LLM 生成文本的问题。我们提出了一种新颖的、通用的、有监督的 LLM 文本检测器,即 SElected-Next-Token tRAnsformer (SENTRA)。SENTRA 是一个基于 Transformer 的编码器,利用选定的下一个标记概率序列,并利用对大量未标记数据的对比预训练。我们对 24 个文本域的三个流行公共数据集进行的实验表明,SENTRA 是一种通用分类器,在域外设置中的性能明显优于流行的基线。
Subjects: Computation and Language, Machine Learning 科目:计算与语言、机器学习
Publish: 2025-09-15 19:26:17 UTC 发布时间: 2025-09-15 19:26:17 UTC
#43 LLM-as-a-Judge: Rapid Evaluation of Legal Document Recommendation for Retrieval-Augmented Generation #43 LLM-as-a-Judge:快速评估检索增强生成的法律文件建议
Authors: [Anu Pradhan](https://arxiv.org/search/?searchtype=author&query=Anu Pradhan), [Alexandra Ortan](https://arxiv.org/search/?searchtype=author&query=Alexandra Ortan), [Apurv Verma](https://arxiv.org/search/?searchtype=author&query=Apurv Verma), [Madhavan Seshadri](https://arxiv.org/search/?searchtype=author&query=Madhavan Seshadri) 作者:Anu Pradhan、Alexandra Ortan、Apurv Verma、Madhavan Seshadri
The evaluation bottleneck in recommendation systems has become particularly acute with the rise of Generative AI, where traditional metrics fall short of capturing nuanced quality dimensions that matter in specialized domains like legal research. Can we trust Large Language Models to serve as reliable judges of their own kind? This paper investigates LLM-as-a-Judge as a principled approach to evaluating Retrieval-Augmented Generation systems in legal contexts, where the stakes of recommendation quality are exceptionally high. We tackle two fundamental questions that determine practical viability: which inter-rater reliability metrics best capture the alignment between LLM and human assessments, and how do we conduct statistically sound comparisons between competing systems? Through systematic experimentation, we discover that traditional agreement metrics like Krippendorff’s alpha can be misleading in the skewed distributions typical of AI system evaluations. Instead, Gwet’s AC2 and rank correlation coefficients emerge as more robust indicators for judge selection, while the Wilcoxon Signed-Rank Test with Benjamini-Hochberg corrections provides the statistical rigor needed for reliable system comparisons. Our findings suggest a path toward scalable, cost-effective evaluation that maintains the precision demanded by legal applications, transforming what was once a human-intensive bottleneck into an automated, yet statistically principled, evaluation framework. 随着生成式人工智能的兴起,推荐系统中的评估瓶颈变得尤为严重,传统指标无法捕捉法律研究等专业领域重要的细微质量维度。我们能否相信大型语言模型能够充当同类的可靠评判者?本文研究了法学硕士作为法官作为在法律环境中评估检索增强生成系统的一种原则性方法,在法律环境中,推荐质量的风险非常高。我们解决了决定实际可行性的两个基本问题:哪些评估者间可靠性指标最能捕捉法学硕士和人工评估之间的一致性,以及我们如何在竞争系统之间进行统计上合理的比较?通过系统实验,我们发现传统的一致性指标(如 Krippendorff 的 alpha)在人工智能系统评估典型的偏斜分布中可能会产生误导。相反,Gwet 的 AC2 和排名相关系数成为法官选择的更可靠的指标,而带有 Benjamini-Hochberg 校正的 Wilcoxon 符号排名检验提供了可靠系统比较所需的统计严谨性。我们的研究结果提出了一条可扩展、具有成本效益的评估之路,该评估保持了法律申请所需的精度,将曾经的人力密集型瓶颈转变为自动化但具有统计原则的评估框架。
Subject: Computation and Language 主题:计算和语言
Publish: 2025-09-15 19:20:21 UTC 发布时间: 2025-09-15 19:20:21 UTC
#44 MORABLES: A Benchmark for Assessing Abstract Moral Reasoning in LLMs with Fables #44 MORABLES:用寓言评估法学硕士抽象道德推理的基准
Authors: [Matteo Marcuzzo](https://arxiv.org/search/?searchtype=author&query=Matteo Marcuzzo), [Alessandro Zangari](https://arxiv.org/search/?searchtype=author&query=Alessandro Zangari), [Andrea Albarelli](https://arxiv.org/search/?searchtype=author&query=Andrea Albarelli), [Jose Camacho-Collados](https://arxiv.org/search/?searchtype=author&query=Jose Camacho-Collados), [Mohammad Taher Pilehvar](https://arxiv.org/search/?searchtype=author&query=Mohammad Taher Pilehvar) 作者:Matteo Marcuzzo、Alessandro Zangari、Andrea Albarelli、Jose Camacho-Collados、Mohammad Taher Pilehvar
As LLMs excel on standard reading comprehension benchmarks, attention is shifting toward evaluating their capacity for complex abstract reasoning and inference. Literature-based benchmarks, with their rich narrative and moral depth, provide a compelling framework for evaluating such deeper comprehension skills. Here, we present MORABLES, a human-verified benchmark built from fables and short stories drawn from historical literature. The main task is structured as multiple-choice questions targeting moral inference, with carefully crafted distractors that challenge models to go beyond shallow, extractive question answering. To further stress-test model robustness, we introduce adversarial variants designed to surface LLM vulnerabilities and shortcuts due to issues such as data contamination. Our findings show that, while larger models outperform smaller ones, they remain susceptible to adversarial manipulation and often rely on superficial patterns rather than true moral reasoning. This brittleness results in significant self-contradiction, with the best models refuting their own answers in roughly 20% of cases depending on the framing of the moral choice. Interestingly, reasoning-enhanced models fail to bridge this gap, suggesting that scale - not reasoning ability - is the primary driver of performance. 随着法学硕士在标准阅读理解基准上表现出色,人们的注意力正在转向评估其复杂抽象推理和推理的能力。基于文学的基准具有丰富的叙事和道德深度,为评估这种更深层次的理解技能提供了一个令人信服的框架。在这里,我们展示了 MORABLES,这是一个由历史文学中的寓言和短篇小说构建的经过人类验证的基准。主要任务的结构是针对道德推理的多项选择题,并带有精心设计的干扰因素,挑战模型超越肤浅的、提取性的问题回答。为了进一步对模型的稳健性进行压力测试,我们引入了对抗性变体,旨在发现由于数据污染等问题而导致的 LLM 漏洞和捷径。我们的研究结果表明,虽然较大的模型优于较小的模型,但它们仍然容易受到对抗性纵,并且通常依赖于肤浅的模式而不是真正的道德推理。这种脆弱性导致了严重的自相矛盾,在大约 20% 的情况下,最好的模型会反驳自己的答案,具体取决于道德选择的框架。有趣的是,推理增强模型未能弥合这一差距,这表明规模——而不是推理能力——是性能的主要驱动力。
Subjects: Computation and Language, Artificial Intelligence 科目: 计算与语言 , 人工智能
Publish: 2025-09-15 19:06:10 UTC 发布时间: 2025-09-15 19:06:10 UTC
#45 MTEB-NL and E5-NL: Embedding Benchmark and Models for Dutch #45 MTEB-NL 和 E5-NL:荷兰语的嵌入基准和模型
Authors: [Nikolay Banar](https://arxiv.org/search/?searchtype=author&query=Nikolay Banar), [Ehsan Lotfi](https://arxiv.org/search/?searchtype=author&query=Ehsan Lotfi), [Jens Van Nooten](https://arxiv.org/search/?searchtype=author&query=Jens Van Nooten), [Cristina Arhiliuc](https://arxiv.org/search/?searchtype=author&query=Cristina Arhiliuc), [Marija Kliocaite](https://arxiv.org/search/?searchtype=author&query=Marija Kliocaite), [Walter Daelemans](https://arxiv.org/search/?searchtype=author&query=Walter Daelemans) 作者:Nikolay Banar、Ehsan Lotfi、Jens Van Nooten、Cristina Arhiliuc、Marija Kliocaite、Walter Daelemans
Recently, embedding resources, including models, benchmarks, and datasets, have been widely released to support a variety of languages. However, the Dutch language remains underrepresented, typically comprising only a small fraction of the published multilingual resources. To address this gap and encourage the further development of Dutch embeddings, we introduce new resources for their evaluation and generation. First, we introduce the Massive Text Embedding Benchmark for Dutch (MTEB-NL), which includes both existing Dutch datasets and newly created ones, covering a wide range of tasks. Second, we provide a training dataset compiled from available Dutch retrieval datasets, complemented with synthetic data generated by large language models to expand task coverage beyond retrieval. Finally, we release a series of E5-NL models compact yet efficient embedding models that demonstrate strong performance across multiple tasks. We make our resources publicly available through the Hugging Face Hub and the MTEB package. 最近,包括模型、基准测试和数据集在内的嵌入资源已被广泛发布,以支持多种语言。然而,荷兰语的代表性仍然不足,通常只占已出版多语言资源的一小部分。为了解决这一差距并鼓励荷兰语嵌入的进一步发展,我们引入了新的资源来评估和生成它们。首先,我们介绍了荷兰语海量文本嵌入基准测试(MTEB-NL),它包括现有的荷兰语数据集和新创建的数据集,涵盖了广泛的任务。其次,我们提供了一个由可用的荷兰语检索数据集编译而成的训练数据集,并辅以大型语言模型生成的合成数据,以将任务覆盖范围扩展到检索之外。最后,我们发布了一系列 E5-NL 模型,这些模型紧凑而高效的嵌入模型在多个任务中表现出强大的性能。我们通过 Hugging Face Hub 和 MTEB 包公开我们的资源。
Subject: Computation and Language 主题:计算和语言
Publish: 2025-09-15 18:08:08 UTC 发布时间: 2025-09-15 18:08:08 UTC
#46 WebSailor-V2: Bridging the Chasm to Proprietary Agents via Synthetic Data and Scalable Reinforcement Learning #46 WebSailor-V2:通过合成数据和可扩展的强化学习弥合专有代理的鸿沟
Authors: [Kuan Li](https://arxiv.org/search/?searchtype=author&query=Kuan Li), [Zhongwang Zhang](https://arxiv.org/search/?searchtype=author&query=Zhongwang Zhang), [Huifeng Yin](https://arxiv.org/search/?searchtype=author&query=Huifeng Yin), [Rui Ye](https://arxiv.org/search/?searchtype=author&query=Rui Ye), [Yida Zhao](https://arxiv.org/search/?searchtype=author&query=Yida Zhao), [Liwen Zhang](https://arxiv.org/search/?searchtype=author&query=Liwen Zhang), [Litu Ou](https://arxiv.org/search/?searchtype=author&query=Litu Ou), [Dingchu Zhang](https://arxiv.org/search/?searchtype=author&query=Dingchu Zhang), [Xixi Wu](https://arxiv.org/search/?searchtype=author&query=Xixi Wu), [Jialong Wu](https://arxiv.org/search/?searchtype=author&query=Jialong Wu), [Xinyu Wang](https://arxiv.org/search/?searchtype=author&query=Xinyu Wang), [Zile Qiao](https://arxiv.org/search/?searchtype=author&query=Zile Qiao), [Zhen Zhang](https://arxiv.org/search/?searchtype=author&query=Zhen Zhang), [Yong Jiang](https://arxiv.org/search/?searchtype=author&query=Yong Jiang), [Pengjun Xie](https://arxiv.org/search/?searchtype=author&query=Pengjun Xie), [Fei Huang](https://arxiv.org/search/?searchtype=author&query=Fei Huang), [Jingren Zhou](https://arxiv.org/search/?searchtype=author&query=Jingren Zhou) 作者: 李宽, 张忠旺, 尹惠峰, 叶瑞, 赵一达, 张立文, 欧丽图, 张定初, 吴曦曦, 吴佳龙, 王新宇, 乔子乐, 张振, 江勇, 谢鹏军, 黄飞, 周景仁
Transcending human cognitive limitations represents a critical frontier in LLM training. Proprietary agentic systems like DeepResearch have demonstrated superhuman capabilities on extremely complex information-seeking benchmarks such as BrowseComp, a feat previously unattainable. We posit that their success hinges on a sophisticated reasoning pattern absent in open-source models: the ability to systematically reduce extreme uncertainty when navigating vast information landscapes. Based on this insight, we introduce WebSailor, a complete post-training methodology designed to instill this crucial capability. Our approach involves generating novel, high-uncertainty tasks through structured sampling and information obfuscation, RFT cold start, and an efficient agentic RL training algorithm, Duplicating Sampling Policy Optimization (DUPO). With this integrated pipeline, WebSailor significantly outperforms all open-source agents in complex information-seeking tasks, matching proprietary agents’ performance and closing the capability gap. 超越人类认知限制是法学硕士培训的一个关键前沿。像 DeepResearch 这样的专有代理系统在极其复杂的信息搜索基准(例如 BrowseComp)上展示了超人的能力,这是以前无法实现的壮举。我们认为,它们的成功取决于开源模型中不存在的复杂推理模式:在浏览广阔的信息环境时系统地减少极端不确定性的能力。基于这一见解,我们推出了 WebSailor,这是一种完整的训练后方法,旨在灌输这一关键能力。我们的方法涉及通过结构化采样和信息混淆、RFT 冷启动和高效的代理 RL 训练算法复制采样策略优化 (DUPO) 来生成新颖的高不确定性任务。借助这种集成管道,WebSailor 在复杂的信息搜索任务中显着优于所有开源代理,与专有代理的性能相匹配并缩小能力差距。
Subjects: Machine Learning, Computation and Language 科目:机器学习、计算和语言
Publish: 2025-09-16 17:57:03 UTC 发布时间: 2025-09-16 17:57:03 UTC
#47 RepIt: Representing Isolated Targets to Steer Language Models #47 RepIt:表示孤立的目标来引导语言模型
Authors: [Vincent Siu](https://arxiv.org/search/?searchtype=author&query=Vincent Siu), [Nathan W. Henry](https://arxiv.org/search/?searchtype=author&query=Nathan W. Henry), [Nicholas Crispino](https://arxiv.org/search/?searchtype=author&query=Nicholas Crispino), [Yang Liu](https://arxiv.org/search/?searchtype=author&query=Yang Liu), [Dawn Song](https://arxiv.org/search/?searchtype=author&query=Dawn Song), [Chenguang Wang](https://arxiv.org/search/?searchtype=author&query=Chenguang Wang) 作者: Vincent Siu, Nathan W. Henry, Nicholas Crispino, Yang Liu, Dawn Song, Chenguang Wang
While activation steering in large language models (LLMs) is a growing area of research, methods can often incur broader effects than desired. This motivates isolation of purer concept vectors to enable targeted interventions and understand LLM behavior at a more granular level. We present RepIt, a simple and data-efficient framework for isolating concept-specific representations. Across five frontier LLMs, RepIt enables precise interventions: it selectively suppresses refusal on targeted concepts while preserving refusal elsewhere, producing models that answer WMD-related questions while still scoring as safe on standard benchmarks. We further show that the corrective signal localizes to just 100-200 neurons and that robust target representations can be extracted from as few as a dozen examples on a single A6000. This efficiency raises a dual concern: manipulations can be performed with modest compute and data to extend to underrepresented data-scarce topics while evading existing benchmarks. By disentangling refusal vectors with RepIt, this work demonstrates that targeted interventions can counteract overgeneralization, laying the foundation for more granular control of model behavior. 虽然大型语言模型 (LLM) 中的激活转向是一个不断发展的研究领域,但方法通常会产生比预期更广泛的影响。这促使隔离更纯粹的概念向量,以实现有针对性的干预并在更精细的层面上理解 LLM 行为。我们提出了 RepIt,这是一个简单且数据高效的框架,用于隔离特定于概念的表示。在五个前沿法学硕士中,RepIt 能够进行精确的干预:它有选择地抑制对目标概念的拒绝,同时保留其他地方的拒绝,生成回答大规模杀伤性武器相关问题的模型,同时在标准基准上仍得分为安全。我们进一步表明,校正信号仅定位到 100-200 个神经元,并且可以从单个 A6000 上的十几个示例中提取稳健的目标表示。这种效率引发了双重问题:可以使用适度的计算和数据进行作,以扩展到代表性不足的数据稀缺主题,同时规避现有基准。通过用 RepIt 解开拒绝向量,这项工作表明有针对性的干预可以抵消过度概括,为更精细地控制模型行为奠定基础。
Subjects: Artificial Intelligence, Computation and Language 科目:人工智能、计算和语言
Publish: 2025-09-16 17:35:36 UTC 发布时间: 2025-09-16 17:35:36 UTC
#48 HARMONIC: A Content-Centric Cognitive Robotic Architecture #48 HARMONIC:以内容为中心的认知机器人架构
Authors: [Sanjay Oruganti](https://arxiv.org/search/?searchtype=author&query=Sanjay Oruganti), [Sergei Nirenburg](https://arxiv.org/search/?searchtype=author&query=Sergei Nirenburg), [Marjorie McShane](https://arxiv.org/search/?searchtype=author&query=Marjorie McShane), [Jesse English](https://arxiv.org/search/?searchtype=author&query=Jesse English), [Michael K. Roberts](https://arxiv.org/search/?searchtype=author&query=Michael K. Roberts), [Christian Arndt](https://arxiv.org/search/?searchtype=author&query=Christian Arndt), [Carlos Gonzalez](https://arxiv.org/search/?searchtype=author&query=Carlos Gonzalez), [Mingyo Seo](https://arxiv.org/search/?searchtype=author&query=Mingyo Seo), [Luis Sentis](https://arxiv.org/search/?searchtype=author&query=Luis Sentis) 作者:Sanjay Oruganti、Sergei Nirenburg、Marjorie McShane、Jesse English、Michael K. Roberts、Christian Arndt、Carlos Gonzalez、Mingyo Seo、Luis Sentis
This paper introduces HARMONIC, a cognitive-robotic architecture designed for robots in human-robotic teams. HARMONIC supports semantic perception interpretation, human-like decision-making, and intentional language communication. It addresses the issues of safety and quality of results; aims to solve problems of data scarcity, explainability, and safety; and promotes transparency and trust. Two proof-of-concept HARMONIC-based robotic systems are demonstrated, each implemented in both a high-fidelity simulation environment and on physical robotic platforms. 本文介绍了 HARMONIC,这是一种专为人机团队中的机器人设计的认知机器人架构。HARMONIC 支持语义感知解释、类人决策和有意的语言交流。它解决了结果的安全和质量问题;旨在解决数据稀缺性、可解释性和安全性等问题;并促进透明度和信任。演示了两个基于 HARMONIC 的概念验证机器人系统,每个系统都在高保真仿真环境和物理机器人平台上实现。
Subjects: Robotics, Artificial Intelligence, Computation and Language 科目:机器人技术、人工智能、计算和语言
Publish: 2025-09-16 17:34:18 UTC 发布时间: 2025-09-16 17:34:18 UTC
#49 Podcasts as a Medium for Participation in Collective Action: A Case Study of Black Lives Matter #49 播客作为参与集体行动的媒介:黑人的命也是命的案例研究
Authors: [Theodora Moldovan](https://arxiv.org/search/?searchtype=author&query=Theodora Moldovan), [Arianna Pera](https://arxiv.org/search/?searchtype=author&query=Arianna Pera), [Davide Vega](https://arxiv.org/search/?searchtype=author&query=Davide Vega), [Luca Maria Aiello](https://arxiv.org/search/?searchtype=author&query=Luca Maria Aiello) 作者:西奥多拉·摩尔多瓦、阿里安娜·佩拉、大卫·维加、卢卡·玛丽亚·艾洛
We study how participation in collective action is articulated in podcast discussions, using the Black Lives Matter (BLM) movement as a case study. While research on collective action discourse has primarily focused on text-based content, this study takes a first step toward analyzing audio formats by using podcast transcripts. Using the Structured Podcast Research Corpus (SPoRC), we investigated spoken language expressions of participation in collective action, categorized as problem-solution, call-to-action, intention, and execution. We identified podcast episodes discussing racial justice after important BLM-related events in May and June of 2020, and extracted participatory statements using a layered framework adapted from prior work on social media. We examined the emotional dimensions of these statements, detecting eight key emotions and their association with varying stages of activism. We found that emotional profiles vary by stage, with different positive emotions standing out during calls-to-action, intention, and execution. We detected negative associations between collective action and negative emotions, contrary to theoretical expectations. Our work contributes to a better understanding of how activism is expressed in spoken digital discourse and how emotional framing may depend on the format of the discussion. 我们以“黑人的命也是命”(BLM) 运动为案例研究,研究如何在播客讨论中阐明参与集体行动。虽然对集体行动话语的研究主要集中在基于文本的内容上,但这项研究迈出了使用播客文字记录分析音频格式的第一步。使用结构化播客研究语料库 (SPoRC),我们研究了参与集体行动的口语表达,分为问题解决方案、号召性用语、意图和执行。我们在 2020 年 5 月和 6 月的重要 BLM 相关事件后确定了讨论种族正义的播客剧集,并使用改编自社交媒体先前工作的分层框架提取了参与性陈述。我们研究了这些陈述的情感维度,发现了八种关键情绪及其与不同激进主义阶段的关联。我们发现情绪特征因阶段而异,在号召性用语、意图和执行过程中,不同的积极情绪脱颖而出。我们发现集体行动与负面情绪之间存在负面关联,这与理论预期相反。我们的工作有助于更好地理解激进主义如何在口头数字话语中表达,以及情感框架如何取决于讨论的形式。
Subjects: Social and Information Networks, Computation and Language, Computers and Society 科目:社会与信息网络、计算与语言、计算机与社会
Publish: 2025-09-16 16:00:19 UTC 发布时间: 2025-09-16 16:00:19 UTC
#50 Textarium: Entangling Annotation, Abstraction and Argument #50 Textarium:纠缠不清的注释、抽象和论证
Authors: [Philipp Proff](https://arxiv.org/search/?searchtype=author&query=Philipp Proff), [Marian Dörk](https://arxiv.org/search/?searchtype=author&query=Marian Dörk) 作者:Philipp Proff、Marian Dörk
We present a web-based environment that connects annotation, abstraction, and argumentation during the interpretation of text. As a visual interface for scholarly reading and writing, Textarium combines human analysis with lightweight computational processing to bridge close and distant reading practices. Readers can highlight text, group keywords into concepts, and embed these observations as anchors in essays. The interface renders these interpretive actions as parameterized visualization states. Through a speculative design process of co-creative and iterative prototyping, we developed a reading-writing approach that makes interpretive processes transparent and shareable within digital narratives. 我们提出了一个基于网络的环境,在文本解释过程中连接注释、抽象和论证。作为学术阅读和写作的可视化界面,Textarium 将人类分析与轻量级计算处理相结合,以弥合近距离和远距离阅读实践。读者可以突出显示文本,将关键字分组为概念,并将这些观察结果作为锚点嵌入到论文中。该接口将这些解释性作呈现为参数化可视化状态。通过共同创造和迭代原型的推测性设计过程,我们开发了一种读写方法,使解释过程在数字叙事中透明和可共享。
Subjects: Human-Computer Interaction, Computation and Language 科目:人机交互、计算与语言
Publish: 2025-09-16 15:46:00 UTC 发布时间: 2025-09-16 15:46:00 UTC
#51 When Inverse Data Outperforms: Exploring the Pitfalls of Mixed Data in Multi-Stage Fine-Tuning #51 当逆数据优于大盘:探索多阶段微调中混合数据的陷阱
Authors: [Mengyi Deng](https://arxiv.org/search/?searchtype=author&query=Mengyi Deng), [Xin Li](https://arxiv.org/search/?searchtype=author&query=Xin Li), [Tingyu Zhu](https://arxiv.org/search/?searchtype=author&query=Tingyu Zhu), [Zhicheng Yang](https://arxiv.org/search/?searchtype=author&query=Zhicheng Yang), [Zhijiang Guo](https://arxiv.org/search/?searchtype=author&query=Zhijiang Guo), [Wei Wang](https://arxiv.org/search/?searchtype=author&query=Wei Wang) 作者:邓梦怡、李欣、朱婷玉、杨志成、郭志江、王伟
Existing work has shown that o1-level performance can be achieved with limited data distillation, but most existing methods focus on unidirectional supervised fine-tuning (SFT), overlooking the intricate interplay between diverse reasoning patterns. In this paper, we construct r1k, a high-quality reverse reasoning dataset derived by inverting 1,000 forward examples from s1k, and examine how SFT and Direct Preference Optimization (DPO) affect alignment under bidirectional reasoning objectives. SFT on r1k yields a 1.6%–6.8% accuracy improvement over s1k across evaluated benchmarks. However, naively mixing forward and reverse data during SFT weakens the directional distinction. Although DPO can partially recover this distinction, it also suppresses less preferred reasoning paths by shifting the probability mass toward irrelevant outputs. These findings suggest that mixed reasoning data introduce conflicting supervision signals, underscoring the need for robust and direction-aware alignment strategies. 现有研究表明,通过有限的数据蒸馏可以实现 o1 级性能,但大多数现有方法都侧重于单向监督微调 (SFT),而忽略了不同推理模式之间错综复杂的相互作用。在本文中,我们构建了 r1k,这是一个通过反演 s1k 的 1,000 个正向示例得出的高质量逆向推理数据集,并研究了 SFT 和直接偏好优化 (DPO) 如何影响双向推理目标下的对齐。在评估的基准测试中,r1k 上的 SFT 比 s1k 提高了 1.6%–6.8% 的准确率。然而,在 SFT 期间天真地混合正向和反向数据会削弱方向区分。尽管 DPO 可以部分恢复这种区别,但它也通过将概率质量转移到不相关的输出来抑制不太受欢迎的推理路径。这些发现表明,混合推理数据引入了相互矛盾的监督信号,强调了对稳健和方向感知对齐策略的必要性。
Subjects: Machine Learning, Computation and Language 科目:机器学习、计算和语言
Publish: 2025-09-16 13:36:36 UTC 发布时间: 2025-09-16 13:36:36 UTC
#52 Jailbreaking Large Language Models Through Content Concretization #52 通过内容具体化越狱大型语言模型
Authors: [Johan Wahréus](https://arxiv.org/search/?searchtype=author&query=Johan Wahréus), [Ahmed Hussain](https://arxiv.org/search/?searchtype=author&query=Ahmed Hussain), [Panos Papadimitratos](https://arxiv.org/search/?searchtype=author&query=Panos Papadimitratos) 作者:Johan Wahréus、Ahmed Hussain、Panos Papadimitratos
Large Language Models (LLMs) are increasingly deployed for task automation and content generation, yet their safety mechanisms remain vulnerable to circumvention through different jailbreaking techniques. In this paper, we introduce \textit{Content Concretization} (CC), a novel jailbreaking technique that iteratively transforms abstract malicious requests into concrete, executable implementations. CC is a two-stage process: first, generating initial LLM responses using lower-tier, less constrained safety filters models, then refining them through higher-tier models that process both the preliminary output and original prompt. We evaluate our technique using 350 cybersecurity-specific prompts, demonstrating substantial improvements in jailbreak Success Rates (SRs), increasing from 7% (no refinements) to 62% after three refinement iterations, while maintaining a cost of 7.5\textcentper prompt. Comparative A/B testing across nine different LLM evaluators confirms that outputs from additional refinement steps are consistently rated as more malicious and technically superior. Moreover, manual code analysis reveals that generated outputs execute with minimal modification, although optimal deployment typically requires target-specific fine-tuning. With eventual improved harmful code generation, these results highlight critical vulnerabilities in current LLM safety frameworks.
大型语言模型 (LLM) 越来越多地用于任务自动化和内容生成,但其安全机制仍然容易被不同的越狱技术规避。在本文中,我们介绍了 \textit{Content Concretization} (CC),这是一种新颖的越狱技术,可以迭代地将抽象的恶意请求转换为具体的、可执行的实现。CC 是一个两阶段的过程:首先,使用较低层、限制较少的安全过滤器模型生成初始 LLM 响应,然后通过处理初步输出和原始提示的更高层模型对其进行细化。我们使用 350 个特定于网络安全的提示来评估我们的技术,证明越狱成功率 (SR) 有了显着提高,在三次改进迭代后从 7%(无改进)增加到 62%,同时保持每个提示 7.5\textcent的成本。对九个不同的 LLM 评估人员进行的比较 A/B 测试证实,额外细化步骤的输出始终被评为更恶意且技术更优越。此外,手动代码分析表明,生成的输出只需最少的修改即可执行,尽管最佳部署通常需要特定于目标的微调。随着最终改进的有害代码生成,这些结果凸显了当前 LLM 安全框架中的关键漏洞。
Subjects: Cryptography and Security, Artificial Intelligence, Computation and Language 科目: 密码学与安全 , 人工智能 , 计算与语言
Publish: 2025-09-16 10:34:26 UTC 发布时间: 2025-09-16 10:34:26 UTC
#53 Rethinking the Evaluation of Alignment Methods: Insights into Diversity, Generalisation, and Safety #53 重新思考对齐方法的评估:对多样性、泛化和安全性的洞察
Authors: [Denis Janiak](https://arxiv.org/search/?searchtype=author&query=Denis Janiak), [Julia Moska](https://arxiv.org/search/?searchtype=author&query=Julia Moska), [Dawid Motyka](https://arxiv.org/search/?searchtype=author&query=Dawid Motyka), [Karolina Seweryn](https://arxiv.org/search/?searchtype=author&query=Karolina Seweryn), [Paweł Walkowiak](https://arxiv.org/search/?searchtype=author&query=Paweł Walkowiak), [Bartosz Żuk](https://arxiv.org/search/?searchtype=author&query=Bartosz Żuk), [Arkadiusz Janz](https://arxiv.org/search/?searchtype=author&query=Arkadiusz Janz) 作者:Denis Janiak、Julia Moska、Dawid Motyka、Karolina Seweryn、Paweł Walkowiak、Bartosz Żuk、Arkadiusz Janz
Large language models (LLMs) require careful alignment to balance competing objectives - factuality, safety, conciseness, proactivity, and diversity. Existing studies focus on individual techniques or specific dimensions, lacking a holistic assessment of the inherent trade-offs. We propose a unified evaluation framework that compares LLM alignment methods (PPO, DPO, ORPO, KTO) across these five axes, using both in-distribution and out-of-distribution datasets. Leveraging a specialized LLM-as-Judge prompt, validated through human studies, we reveal that DPO and KTO excel in factual accuracy, PPO and DPO lead in safety, and PPO best balances conciseness with proactivity. Our findings provide insights into trade-offs of common alignment methods, guiding the development of more balanced and reliable LLMs. 大型语言模型 (LLM) 需要仔细调整,以平衡相互竞争的目标——真实性、安全性、简洁性、主动性和多样性。现有研究侧重于个别技术或特定维度,缺乏对固有权衡的整体评估。我们提出了一个统一的评估框架,使用分布内和分布外数据集,在这五个轴上比较 LLM 对齐方法(PPO、DPO、ORPO、KTO)。利用经过人体研究验证的专门 LLM-as-Judge 提示,我们发现 DPO 和 KTO 在事实准确性方面表现出色,PPO 和 DPO 在安全性方面处于领先地位,而 PPO 在简洁与主动性方面取得了最佳平衡。我们的研究结果提供了对常见对齐方法权衡的见解,指导开发更加平衡和可靠的法学硕士。
Subjects: Machine Learning, Computation and Language 科目:机器学习、计算和语言
Publish: 2025-09-16 10:32:59 UTC 发布日期: 2025-09-16 10:32:59 UTC
#54 InfoGain-RAG: Boosting Retrieval-Augmented Generation via Document Information Gain-based Reranking and Filtering #54 InfoGain-RAG:通过基于文档信息增益的重新排名和过滤促进检索增强生成
Authors: [Zihan Wang](https://arxiv.org/search/?searchtype=author&query=Zihan Wang), [Zihan Liang](https://arxiv.org/search/?searchtype=author&query=Zihan Liang), [Zhou Shao](https://arxiv.org/search/?searchtype=author&query=Zhou Shao), [Yufei Ma](https://arxiv.org/search/?searchtype=author&query=Yufei Ma), [Huangyu Dai](https://arxiv.org/search/?searchtype=author&query=Huangyu Dai), [Ben Chen](https://arxiv.org/search/?searchtype=author&query=Ben Chen), [Lingtao Mao](https://arxiv.org/search/?searchtype=author&query=Lingtao Mao), [Chenyi Lei](https://arxiv.org/search/?searchtype=author&query=Chenyi Lei), [Yuqing Ding](https://arxiv.org/search/?searchtype=author&query=Yuqing Ding), [Han Li](https://arxiv.org/search/?searchtype=author&query=Han Li) 作者:王子涵、梁子涵、周少、马玉飞、戴皇宇、陈本、毛灵涛、雷晨仪、丁玉清、韩丽
Retrieval-Augmented Generation (RAG) has emerged as a promising approach to address key limitations of Large Language Models (LLMs), such as hallucination, outdated knowledge, and lacking reference. However, current RAG frameworks often struggle with identifying whether retrieved documents meaningfully contribute to answer generation. This shortcoming makes it difficult to filter out irrelevant or even misleading content, which notably impacts the final performance. In this paper, we propose Document Information Gain (DIG), a novel metric designed to quantify the contribution of retrieved documents to correct answer generation. DIG measures a document’s value by computing the difference of LLM’s generation confidence with and without the document augmented. Further, we introduce InfoGain-RAG, a framework that leverages DIG scores to train a specialized reranker, which prioritizes each retrieved document from exact distinguishing and accurate sorting perspectives. This approach can effectively filter out irrelevant documents and select the most valuable ones for better answer generation. Extensive experiments across various models and benchmarks demonstrate that InfoGain-RAG can significantly outperform existing approaches, on both single and multiple retrievers paradigm. Specifically on NaturalQA, it achieves the improvements of 17.9%, 4.5%, 12.5% in exact match accuracy against naive RAG, self-reflective RAG and modern ranking-based RAG respectively, and even an average of 15.3% increment on advanced proprietary model GPT-4o across all datasets. These results demonstrate the feasibility of InfoGain-RAG as it can offer a reliable solution for RAG in multiple applications. 检索增强生成 (RAG) 已成为一种很有前途的方法,可以解决大型语言模型 (LLM) 的关键局限性,例如幻觉、过时的知识和缺乏参考。然而,当前的 RAG 框架经常难以确定检索到的文档是否对答案生成有意义。这一缺点使得很难过滤掉不相关甚至误导性的内容,这显着影响了最终的性能。在本文中,我们提出了文档信息增益(DIG),这是一种新型指标,旨在量化检索到的文档对正确答案生成的贡献。DIG 通过计算 LLM 在增强和未增强文档的情况下生成置信度的差异来衡量文档的价值。此外,我们还介绍了 InfoGain-RAG,这是一个利用 DIG 分数来训练专门的重新排序器的框架,该框架从精确区分和准确的排序角度对每个检索到的文档进行优先级排序。这种方法可以有效地过滤掉不相关的文档,并选择最有价值的文档,以便更好地生成答案。跨各种模型和基准测试的广泛实验表明,InfoGain-RAG 在单个和多个检索器范式上都可以显着优于现有方法。具体在 NaturalQA 上,它与朴素 RAG、自反射 RAG 和基于现代排名的 RAG 的精确匹配准确率分别提高了 17.9%、4.5%、12.5%,甚至在所有数据集的高级专有模型 GPT-4o 上平均提高了 15.3%。这些结果证明了 InfoGain-RAG 的可行性,因为它可以在多种应用中为 RAG 提供可靠的解决方案。
Subjects: Information Retrieval, Artificial Intelligence, Computation and Language 科目:信息检索、人工智能、计算与语言
Publish: 2025-09-16 07:28:07 UTC 发布时间: 2025-09-16 07:28:07 UTC
#55 Similarity-Distance-Magnitude Activations #55 相似性-距离-幅度激活
Author: [Allen Schmaltz](https://arxiv.org/search/?searchtype=author&query=Allen Schmaltz) 作者:艾伦·施马尔茨
We introduce a more robust and interpretable formulation of the standard softmax activation function commonly used with neural networks by adding Similarity (i.e., correctly predicted depth-matches into training) awareness and Distance-to-training-distribution awareness to the existing output Magnitude (i.e., decision-boundary) awareness. When used as the final-layer activation with language models, the resulting Similarity-Distance-Magnitude (SDM) activation function is more robust than the softmax function to co-variate shifts and out-of-distribution inputs in high-probability regions, and provides interpretability-by-exemplar via dense matching. Complementing the prediction-conditional estimates, the SDM activation enables a partitioning of the class-wise empirical CDFs to guard against low class-wise recall among selective classifications. These properties make it preferable for selective classification, even when considering post-hoc calibration methods over the softmax. 我们通过将相似性(即正确预测的深度匹配到训练中)意识和距离到训练分布意识添加到现有输出幅度(即决策边界)意识中,引入了神经网络常用的标准 softmax 激活函数的更稳健和可解释的公式。当用作语言模型的最后一层激活时,生成的相似性-距离-幅度(SDM)激活函数比 softmax 函数更稳健,可以协变量高概率区域中的偏移和分布外输入,并通过密集匹配提供示例的可解释性。作为预测条件估计的补充,SDM 激活可以对分类经验 CDF 进行分区,以防止选择性分类中低分类召回率。这些特性使其更适合选择性分类,即使在考虑事后校准方法而不是 softmax 时也是如此。
Subjects: Machine Learning, Computation and Language 科目:机器学习、计算和语言
Publish: 2025-09-16 07:19:38 UTC 发布时间: 2025-09-16 07:19:38 UTC
#56 Zero-shot Graph Reasoning via Retrieval Augmented Framework with LLMs #56 通过 LLM 检索增强框架进行零样本图推理
Authors: [Hanqing Li](https://arxiv.org/search/?searchtype=author&query=Hanqing Li), [Kiran Sheena Jyothi](https://arxiv.org/search/?searchtype=author&query=Kiran Sheena Jyothi), [Henry Liang](https://arxiv.org/search/?searchtype=author&query=Henry Liang), [Sharika Mahadevan](https://arxiv.org/search/?searchtype=author&query=Sharika Mahadevan), [Diego Klabjan](https://arxiv.org/search/?searchtype=author&query=Diego Klabjan) 作者:李汉青、Kiran Sheena Jyothi、Henry Liang、Sharika Mahadevan、Diego Klabjan
We propose a new, training-free method, Graph Reasoning via Retrieval Augmented Framework (GRRAF), that harnesses retrieval-augmented generation (RAG) alongside the code-generation capabilities of large language models (LLMs) to address a wide range of graph reasoning tasks. In GRRAF, the target graph is stored in a graph database, and the LLM is prompted to generate executable code queries that retrieve the necessary information. This approach circumvents the limitations of existing methods that require extensive finetuning or depend on predefined algorithms, and it incorporates an error feedback loop with a time-out mechanism to ensure both correctness and efficiency. Experimental evaluations on the GraphInstruct dataset reveal that GRRAF achieves 100% accuracy on most graph reasoning tasks, including cycle detection, bipartite graph checks, shortest path computation, and maximum flow, while maintaining consistent token costs regardless of graph sizes. Imperfect but still very high performance is observed on subgraph matching. Notably, GRRAF scales effectively to large graphs with up to 10,000 nodes. 我们提出了一种新的、无需训练的方法,即通过检索增强框架 (GRRAF) 进行图推理,它利用检索增强生成 (RAG) 以及大型语言模型 (LLM) 的代码生成功能来处理广泛的图推理任务。在 GRRAF 中,目标图存储在图数据库中,并提示 LLM 生成可执行代码查询以检索必要的信息。这种方法规避了需要大量微调或依赖预定义算法的现有方法的局限性,并且它结合了带有超时机制的错误反馈循环,以确保正确性和效率。对 GraphInstruct 数据集的实验评估表明,GRRAF 在大多数图推理任务上都达到了 100% 的准确率,包括循环检测、二分图检查、最短路径计算和最大流量,同时无论图大小如何,都保持一致的令牌成本。在子图匹配上观察到不完美但仍非常高的性能。值得注意的是,GRRAF 可以有效地扩展到具有多达 10,000 个节点的大型图。
Subjects: Artificial Intelligence, Computation and Language 科目:人工智能、计算和语言
Publish: 2025-09-16 06:58:58 UTC 发布时间: 2025-09-16 06:58:58 UTC
#57 A Novel Recurrent Neural Network Framework for Prediction and Treatment of Oncogenic Mutation Progression #57 一种用于预测和治疗致癌突变进展的新型循环神经网络框架
Authors: [Rishab Parthasarathy](https://arxiv.org/search/?searchtype=author&query=Rishab Parthasarathy), [Achintya Bhowmik](https://arxiv.org/search/?searchtype=author&query=Achintya Bhowmik) 作者:Rishab Parthasarathy、Achintya Bhowmik
Despite significant medical advancements, cancer remains the second leading cause of death, with over 600,000 deaths per year in the US. One emerging field, pathway analysis, is promising but still relies on manually derived wet lab data, which is time-consuming to acquire. This work proposes an efficient, effective end-to-end framework for Artificial Intelligence (AI) based pathway analysis that predicts both cancer severity and mutation progression, thus recommending possible treatments. The proposed technique involves a novel combination of time-series machine learning models and pathway analysis. First, mutation sequences were isolated from The Cancer Genome Atlas (TCGA) Database. Then, a novel preprocessing algorithm was used to filter key mutations by mutation frequency. This data was fed into a Recurrent Neural Network (RNN) that predicted cancer severity. Then, the model probabilistically used the RNN predictions, information from the preprocessing algorithm, and multiple drug-target databases to predict future mutations and recommend possible treatments. This framework achieved robust results and Receiver Operating Characteristic (ROC) curves (a key statistical metric) with accuracies greater than 60%, similar to existing cancer diagnostics. In addition, preprocessing played an instrumental role in isolating important mutations, demonstrating that each cancer stage studied may contain on the order of a few-hundred key driver mutations, consistent with current research. Heatmaps based on predicted gene frequency were also generated, highlighting key mutations in each cancer. Overall, this work is the first to propose an efficient, cost-effective end-to-end framework for projecting cancer progression and providing possible treatments without relying on expensive, time-consuming wet lab work. 尽管医学取得了重大进步,但癌症仍然是第二大死因,美国每年有超过 600,000 人死亡。一个新兴领域,即通路分析,很有前途,但仍然依赖于手动提取的湿实验室数据,这非常耗时。这项工作提出了一个高效、有效的基于人工智能 (AI) 的通路分析的端到端框架,可以预测癌症的严重程度和突变进展,从而推荐可能的治疗方法。所提出的技术涉及时间序列机器学习模型和通路分析的新颖组合。首先,从癌症基因组图谱 (TCGA) 数据库中分离突变序列。然后,采用一种新型预处理算法按突变频率过滤关键突变。这些数据被输入到预测癌症严重程度的循环神经网络 (RNN) 中。然后,该模型概率地使用 RNN 预测、来自预处理算法的信息和多个药物靶点数据库来预测未来的突变并推荐可能的治疗方法。该框架取得了稳健的结果和受试者工作特征 (ROC) 曲线(关键统计指标),准确率大于 60%,类似于现有的癌症诊断。此外,预处理在分离重要突变方面发挥了重要作用,表明所研究的每个癌症阶段可能包含几百个关键驱动突变,这与当前的研究一致。还生成了基于预测基因频率的热图,突出显示了每种癌症的关键突变。 总体而言,这项工作是第一个提出高效、具有成本效益的端到端框架,用于预测癌症进展并提供可能的治疗,而无需依赖昂贵、耗时的湿实验室工作。
Subjects: Machine Learning, Computation and Language, Quantitative Methods 主题: 机器学习 , 计算与语言 , 定量方法
Publish: 2025-09-16 06:46:28 UTC 发布时间: 2025-09-16 06:46:28 UTC
#58 DaSAThco: Data-Aware SAT Heuristics Combinations Optimization via Large Language Models #58 DaSAThco:通过大型语言模型优化数据感知 SAT 启发式组合
Authors: [Minyu Chen](https://arxiv.org/search/?searchtype=author&query=Minyu Chen), [Guoqiang Li](https://arxiv.org/search/?searchtype=author&query=Guoqiang Li) 作者:陈敏宇,李国强
The performance of Conflict-Driven Clause Learning solvers hinges on internal heuristics, yet the heterogeneity of SAT problems makes a single, universally optimal configuration unattainable. While prior automated methods can find specialized configurations for specific problem families, this dataset-specific approach lacks generalizability and requires costly re-optimization for new problem types. We introduce DaSAThco, a framework that addresses this challenge by learning a generalizable mapping from instance features to tailored heuristic ensembles, enabling a train-once, adapt-broadly model. Our framework uses a Large Language Model, guided by systematically defined Problem Archetypes, to generate a diverse portfolio of specialized heuristic ensembles and subsequently learns an adaptive selection mechanism to form the final mapping. Experiments show that DaSAThco achieves superior performance and, most notably, demonstrates robust out-of-domain generalization where non-adaptive methods show limitations. Our work establishes a more scalable and practical path toward automated algorithm design for complex, configurable systems. 冲突驱动的从句学习求解器的性能取决于内部启发式方法,但 SAT 问题的异构性使得单一的、普遍的最优配置无法实现。虽然以前的自动化方法可以为特定问题族找到专门的配置,但这种特定于数据集的方法缺乏通用性,并且需要对新问题类型进行昂贵的重新优化。我们介绍了 DaSAThco,这是一个框架,它通过学习从实例特征到定制启发式集成的可通用映射来应对这一挑战,从而实现一次训练、广泛适应的模型。我们的框架使用大型语言模型,在系统定义的问题原型的指导下,生成多样化的专业启发式集成组合,随后学习自适应选择机制以形成最终映射。实验表明,DaSAThco 实现了卓越的性能,最值得注意的是,在非自适应方法显示出局限性的情况下,表现出稳健的域外泛化。我们的工作为复杂的可配置系统自动化算法设计建立了一条更具可扩展性和实用性的途径。
Subjects: Artificial Intelligence, Computation and Language 科目:人工智能、计算和语言
Publish: 2025-09-16 02:58:50 UTC 发布时间: 2025-09-16 02:58:50 UTC
#59 The Better You Learn, The Smarter You Prune: Towards Efficient Vision-language-action Models via Differentiable Token Pruning #59 你学得越好,你修剪得越聪明:通过可微分的标记修剪实现高效的视觉-语言-行动模型
Authors: [Titong Jiang](https://arxiv.org/search/?searchtype=author&query=Titong Jiang), [Xuefeng Jiang](https://arxiv.org/search/?searchtype=author&query=Xuefeng Jiang), [Yuan Ma](https://arxiv.org/search/?searchtype=author&query=Yuan Ma), [Xin Wen](https://arxiv.org/search/?searchtype=author&query=Xin Wen), [Bailin Li](https://arxiv.org/search/?searchtype=author&query=Bailin Li), [Kun Zhan](https://arxiv.org/search/?searchtype=author&query=Kun Zhan), [Peng Jia](https://arxiv.org/search/?searchtype=author&query=Peng Jia), [Yahui Liu](https://arxiv.org/search/?searchtype=author&query=Yahui Liu), [Sheng Sun](https://arxiv.org/search/?searchtype=author&query=Sheng Sun), [Xianpeng Lang](https://arxiv.org/search/?searchtype=author&query=Xianpeng Lang) 作者: 江铁彤, 江雪峰, 马媛, 温鑫, 李百林, 詹坤, 佳鹏, 刘亚辉, 孙盛, 郎先鹏
We present LightVLA, a simple yet effective differentiable token pruning framework for vision-language-action (VLA) models. While VLA models have shown impressive capability in executing real-world robotic tasks, their deployment on resource-constrained platforms is often bottlenecked by the heavy attention-based computation over large sets of visual tokens. LightVLA addresses this challenge through adaptive, performance-driven pruning of visual tokens: It generates dynamic queries to evaluate visual token importance, and adopts Gumbel softmax to enable differentiable token selection. Through fine-tuning, LightVLA learns to preserve the most informative visual tokens while pruning tokens which do not contribute to task execution, thereby improving efficiency and performance simultaneously. Notably, LightVLA requires no heuristic magic numbers and introduces no additional trainable parameters, making it compatible with modern inference frameworks. Experimental results demonstrate that LightVLA outperforms different VLA models and existing token pruning methods across diverse tasks on the LIBERO benchmark, achieving higher success rates with substantially reduced computational overhead. Specifically, LightVLA reduces FLOPs and latency by 59.1% and 38.2% respectively, with a 2.9% improvement in task success rate. Meanwhile, we also investigate the learnable query-based token pruning method LightVLA* with additional trainable parameters, which also achieves satisfactory performance. Our work reveals that as VLA pursues optimal performance, LightVLA spontaneously learns to prune tokens from a performance-driven perspective. To the best of our knowledge, LightVLA is the first work to apply adaptive visual token pruning to VLA tasks with the collateral goals of efficiency and performance, marking a significant step toward more efficient, powerful and practical real-time robotic systems. 我们提出了 LightVLA,这是一个简单而有效的视觉-语言-动作 (VLA) 模型的可微分标记修剪框架。虽然 VLA 模型在执行现实世界的机器人任务方面表现出了令人印象深刻的能力,但它们在资源受限平台上的部署往往受到大量视觉令牌的基于注意力的计算的瓶颈。LightVLA 通过自适应、性能驱动的视觉标记修剪来解决这一挑战:它生成动态查询来评估视觉标记的重要性,并采用 Gumbel softmax 来实现可微分的标记选择。通过微调,LightVLA 学会了保留信息量最大的视觉令牌,同时修剪对任务执行无助的令牌,从而同时提高效率和性能。值得注意的是,LightVLA 不需要启发式幻数,也没有引入额外的可训练参数,使其与现代推理框架兼容。实验结果表明,在 LIBERO 基准测试中,LightVLA 在不同任务中的表现优于不同的 VLA 模型和现有的标记修剪方法,在大幅降低计算开销的情况下实现了更高的成功率。具体来说,LightVLA 将 FLOP 和延迟分别降低了 59.1% 和 38.2%,任务成功率提高了 2.9%。同时,我们还研究了具有附加可训练参数的可学习查询的标记修剪方法 LightVLA*,该方法也取得了令人满意的性能。我们的工作表明,随着 VLA 追求最佳性能,LightVLA 会自发地从性能驱动的角度学习修剪代币。 据我们所知,LightVLA 是第一个将自适应视觉令牌修剪应用于 VLA 任务的工作,其附带目标是效率和性能,标志着朝着更高效、更强大和更实用的实时机器人系统迈出了重要一步。
Subjects: Robotics, Computation and Language, Computer Vision and Pattern Recognition 科目:机器人技术、计算与语言、计算机视觉与模式识别
Publish: 2025-09-16 02:43:46 UTC 发布时间: 2025-09-16 02:43:46 UTC
#60 Match Chat: Real Time Generative AI and Generative Computing for Tennis #60 Match Chat:网球实时生成式人工智能和生成式计算
Authors: [Aaron Baughman](https://arxiv.org/search/?searchtype=author&query=Aaron Baughman), [Gozde Akay](https://arxiv.org/search/?searchtype=author&query=Gozde Akay), [Eduardo Morales](https://arxiv.org/search/?searchtype=author&query=Eduardo Morales), [Rahul Agarwal](https://arxiv.org/search/?searchtype=author&query=Rahul Agarwal), [Preetika Srivastava](https://arxiv.org/search/?searchtype=author&query=Preetika Srivastava) 作者:Aaron Baughman、Gozde Akay、Eduardo Morales、Rahul Agarwal、Preetika Srivastava
We present Match Chat, a real-time, agent-driven assistant designed to enhance the tennis fan experience by delivering instant, accurate responses to match-related queries. Match Chat integrates Generative Artificial Intelligence (GenAI) with Generative Computing (GenComp) techniques to synthesize key insights during live tennis singles matches. The system debuted at the 2025 Wimbledon Championships and the 2025 US Open, where it provided about 1 million users with seamless access to streaming and static data through natural language queries. The architecture is grounded in an Agent-Oriented Architecture (AOA) combining rule engines, predictive models, and agents to pre-process and optimize user queries before passing them to GenAI components. The Match Chat system had an answer accuracy of 92.83% with an average response time of 6.25 seconds under loads of up to 120 requests per second (RPS). Over 96.08% of all queries were guided using interactive prompt design, contributing to a user experience that prioritized clarity, responsiveness, and minimal effort. The system was designed to mask architectural complexity, offering a frictionless and intuitive interface that required no onboarding or technical familiarity. Across both Grand Slam deployments, Match Chat maintained 100% uptime and supported nearly 1 million unique users, underscoring the scalability and reliability of the platform. This work introduces key design patterns for real-time, consumer-facing AI systems that emphasize speed, precision, and usability that highlights a practical path for deploying performant agentic systems in dynamic environments. 我们推出 Match Chat,这是一款实时、代理驱动的助手,旨在通过对比赛相关查询提供即时、准确的响应来增强网球迷体验。Match Chat 将生成式人工智能 (GenAI) 与生成式计算 (GenComp) 技术相结合,以综合现场网球单打比赛期间的关键见解。该系统在 2025 年温布尔登锦标赛和 2025 年美国网球公开赛上首次亮相,通过自然语言查询为约 100 万用户提供了对流媒体和静态数据的无缝访问。该架构基于面向代理的架构 (AOA),结合了规则引擎、预测模型和代理,在将用户查询传递给 GenAI 组件之前对其进行预处理和优化。Match Chat 系统的回答准确率为 92.83%,在负载高达 120 个/秒 (RPS) 的情况下,平均响应时间为 6.25 秒。超过 96.08% 的查询都是使用交互式提示设计进行引导的,有助于提供优先考虑清晰度、响应能力和最小努力的用户体验。该系统旨在掩盖架构的复杂性,提供无摩擦且直观的界面,无需入门或熟悉技术。在两次大满贯部署中,Match Chat 都保持了 100% 的正常运行时间并支持近 100 万独立用户,凸显了该平台的可扩展性和可靠性。这项工作介绍了面向消费者的实时人工智能系统的关键设计模式,这些模式强调速度、精度和可用性,突出了在动态环境中部署高性能代理系统的实用路径。
Subjects: Artificial Intelligence, Computation and Language 科目:人工智能、计算和语言
Publish: 2025-09-16 02:38:27 UTC 发布时间: 2025-09-16 02:38:27 UTC
#61 Yet Another Watermark for Large Language Models #61 大型语言模型的又一个水印
Authors: [Siyuan Bao](https://arxiv.org/search/?searchtype=author&query=Siyuan Bao), [Ying Shi](https://arxiv.org/search/?searchtype=author&query=Ying Shi), [Zhiguang Yang](https://arxiv.org/search/?searchtype=author&query=Zhiguang Yang), [Hanzhou Wu](https://arxiv.org/search/?searchtype=author&query=Hanzhou Wu), [Xinpeng Zhang](https://arxiv.org/search/?searchtype=author&query=Xinpeng Zhang) 作者:鲍思媛,石英,杨志光,吴汉洲,张新鹏
Existing watermarking methods for large language models (LLMs) mainly embed watermark by adjusting the token sampling prediction or post-processing, lacking intrinsic coupling with LLMs, which may significantly reduce the semantic quality of the generated marked texts. Traditional watermarking methods based on training or fine-tuning may be extendable to LLMs. However, most of them are limited to the white-box scenario, or very time-consuming due to the massive parameters of LLMs. In this paper, we present a new watermarking framework for LLMs, where the watermark is embedded into the LLM by manipulating the internal parameters of the LLM, and can be extracted from the generated text without accessing the LLM. Comparing with related methods, the proposed method entangles the watermark with the intrinsic parameters of the LLM, which better balances the robustness and imperceptibility of the watermark. Moreover, the proposed method enables us to extract the watermark under the black-box scenario, which is computationally efficient for use. Experimental results have also verified the feasibility, superiority and practicality. This work provides a new perspective different from mainstream works, which may shed light on future research. 现有的大型语言模型(LLMs)水印方法主要通过调整 token 采样预测或后处理来嵌入水印,缺乏与 LLM 的内在耦合,这可能会显著降低生成的标记文本的语义质量。基于训练或微调的传统水印方法可以扩展到法学硕士。然而,它们中的大多数都仅限于白盒场景,或者由于 LLM 的大量参数而非常耗时。在本文中,我们提出了一个新的 LLM 水印框架,其中水印通过作 LLM 的内部参数嵌入到 LLM 中,并且可以在不访问 LLM 的情况下从生成的文本中提取。与相关方法相比,所提方法将水印与 LLM 的固有参数纠缠在一起,更好地平衡了水印的鲁棒性和不可感知性。此外,所提方法使我们能够在黑盒场景下提取水印,使用起来具有计算效率。实验结果也验证了其可行性、优越性和实用性。这项工作提供了不同于主流作品的新视角,可能为未来的研究提供启示。
Subjects: Cryptography and Security, Computation and Language 科目: 密码学与安全 , 计算与语言
Publish: 2025-09-16 02:04:55 UTC 发布时间: 2025-09-16 02:04:55 UTC
#62 LEAF: Knowledge Distillation of Text Embedding Models with Teacher-Aligned Representations #62 LEAF:具有教师对齐表示的文本嵌入模型的知识蒸馏
Authors: [Robin Vujanic](https://arxiv.org/search/?searchtype=author&query=Robin Vujanic), [Thomas Rueckstiess](https://arxiv.org/search/?searchtype=author&query=Thomas Rueckstiess) 作者:Robin Vujanic、Thomas Rueckstiess
We present LEAF (“Lightweight Embedding Alignment Framework”), a knowledge distillation framework for text embedding models. A key distinguishing feature is that our distilled leaf models are aligned to their teacher. In the context of information retrieval, this allows for flexible asymmetric architectures where documents are encoded with the larger teacher model, while queries can be served with the smaller leaf models. We also show that leaf models automatically inherit MRL and robustness to output quantization whenever these properties are present in the teacher model, without explicitly training for them. To demonstrate the capability of our framework we publish leaf-ir, a 23M parameters information retrieval oriented text embedding model trained using LEAF, which sets a new state-of-the-art (SOTA) on BEIR, ranking #1 on the public leaderboard for this benchmark and for models of its size. When run in asymmetric mode, its retrieval performance is further increased. Our scheme is however not restricted to the information retrieval setting, and we demonstrate its wider applicability by synthesizing the multi-task leaf-mt model. This also sets a new SOTA, ranking #1 on the public MTEB v2 (English) leaderboard for its size. LEAF is applicable to black-box models and in contrast to other embedding model training frameworks, it does not require judgments nor hard negatives, and training can be conducted using small batch sizes. Thus, dataset and training infrastructure requirements for our framework are modest. We make our models publicly available under a permissive Apache 2.0 license. 我们提出了 LEAF(“轻量级嵌入对齐框架”),这是一个用于文本嵌入模型的知识蒸馏框架。一个关键的区别特征是我们的蒸馏叶模型与它们的老师保持一致。在信息检索的上下文中,这允许灵活的非对称架构,其中文档使用较大的教师模型进行编码,而查询可以使用较小的叶模型进行服务。我们还表明,只要教师模型中存在这些属性,叶模型就会自动继承 MRL 和鲁棒性以输出量化,而无需对它们进行显式训练。为了展示我们框架的能力,我们发布了 leaf-ir,这是一个使用 LEAF 训练的 23M 参数信息检索导向文本嵌入模型,它在 BEIR 上设置了新的最先进的 (SOTA),在该基准测试和同规模模型的公共排行榜上排名 #1。在非对称模式下运行时,其检索性能进一步提高。然而,我们的方案并不局限于信息检索设置,我们通过合成多任务叶-mt 模型证明了其更广泛的适用性。这也设定了一个新的 SOTA,在其规模上在公共 MTEB v2(英语)排行榜上排名 #1。LEAF 适用于黑盒模型,与其他嵌入模型训练框架相比,它不需要判断,也不需要硬否定,可以使用小批量进行训练。因此,我们框架的数据集和训练基础设施要求是适度的。我们在宽松的 Apache 2.0 许可下公开我们的模型。
Subjects: Information Retrieval, Computation and Language, Machine Learning 科目:信息检索、计算与语言、机器学习
Publish: 2025-09-16 00:41:05 UTC 发布时间: 2025-09-16 00:41:05 UTC
#63 The Adaptation Paradox: Agency vs. Mimicry in Companion Chatbots #63 适应悖论:伴侣聊天机器人中的代理与模仿
Authors: [T. James Brandt](https://arxiv.org/search/?searchtype=author&query=T. James Brandt), [Cecilia Xi Wang](https://arxiv.org/search/?searchtype=author&query=Cecilia Xi Wang) 作者:T. James Brandt、王塞西莉亚习
Generative AI powers a growing wave of companion chatbots, yet principles for fostering genuine connection remain unsettled. We test two routes: visible user authorship versus covert language-style mimicry. In a preregistered 3x2 experiment (N = 162), we manipulated user-controlled avatar generation (none, premade, user-generated) and Language Style Matching (LSM) (static vs. adaptive). Generating an avatar boosted rapport (ω2 = .040, p = .013), whereas adaptive LSM underperformed static style on personalization and satisfaction (d = 0.35, p = .009) and was paradoxically judged less adaptive (t = 3.07, p = .003, d = 0.48). We term this an Adaptation Paradox: synchrony erodes connection when perceived as incoherent, destabilizing persona. To explain, we propose a stability-and-legibility account: visible authorship fosters natural interaction, while covert mimicry risks incoherence. Our findings suggest designers should prioritize legible, user-driven personalization and limit stylistic shifts rather than rely on opaque mimicry. 生成式人工智能为越来越多的配套聊天机器人提供动力,但促进真正联系的原则仍然悬而未决。我们测试了两种途径:可见的用户作者身份与隐蔽的语言风格模仿。在预注册的 3x2 实验 (N = 162) 中,我们纵了用户控制的头像生成(无、预制、用户生成)和语言风格匹配 (LSM)(静态与自适应)。生成头像增强了融洽关系( ω2 = .040,p = .013),而自适应 LSM 在个性化和满意度方面表现不佳(d = 0.35,p = .009),并且矛盾地被判断为适应性较差(t = 3.07,p = .003,d = 0.48)。我们称之为适应悖论:当同步被视为不连贯、不稳定的角色时,同步会侵蚀联系。为了解释,我们提出了一个稳定性和易读性的解释:可见的作者身份促进了自然的互动,而隐蔽的模仿则存在不连贯的风险。我们的研究结果表明,设计师应该优先考虑清晰、用户驱动的个性化并限制风格转变,而不是依赖不透明的模仿。
Subjects: Human-Computer Interaction, Computation and Language 科目:人机交互、计算与语言
Publish: 2025-09-16 00:02:27 UTC 发布时间: 2025-09-16 00:02:27 UTC
#64 Context-Aware Language Models for Forecasting Market Impact from Sequences of Financial News #64 上下文感知语言模型,用于预测金融新闻序列的市场影响
Authors: [Ross Koval](https://arxiv.org/search/?searchtype=author&query=Ross Koval), [Nicholas Andrews](https://arxiv.org/search/?searchtype=author&query=Nicholas Andrews), [Xifeng Yan](https://arxiv.org/search/?searchtype=author&query=Xifeng Yan) 作者:Ross Koval、Nicholas Andrews、闫喜峰
Financial news plays a critical role in the information diffusion process in financial markets and is a known driver of stock prices. However, the information in each news article is not necessarily self-contained, often requiring a broader understanding of the historical news coverage for accurate interpretation. Further, identifying and incorporating the most relevant contextual information presents significant challenges. In this work, we explore the value of historical context in the ability of large language models to understand the market impact of financial news. We find that historical context provides a consistent and significant improvement in performance across methods and time horizons. To this end, we propose an efficient and effective contextualization method that uses a large LM to process the main article, while a small LM encodes the historical context into concise summary embeddings that are then aligned with the large model’s representation space. We explore the behavior of the model through multiple qualitative and quantitative interpretability tests and reveal insights into the value of contextualization. Finally, we demonstrate that the value of historical context in model predictions has real-world applications, translating to substantial improvements in simulated investment performance. 财经新闻在金融市场的信息传播过程中发挥着至关重要的作用,并且是股价的已知驱动因素。然而,每篇新闻文章中的信息并不一定是独立的,往往需要对历史新闻报道有更广泛的了解才能进行准确解读。此外,识别和整合最相关的上下文信息也带来了重大挑战。在这项工作中,我们探讨了历史背景在大型语言模型理解财经新闻市场影响的能力中的价值。我们发现,历史背景在不同方法和时间范围内提供了一致且显着的性能改进。为此,我们提出了一种高效且有效的上下文化方法,该方法使用大型 LM 来处理主要文章,而小型 LM 将历史上下文编码为简洁的摘要嵌入,然后与大模型的表示空间保持一致。我们通过多种定性和定量可解释性测试探索模型的行为,并揭示对情境化价值的见解。最后,我们证明了历史背景在模型预测中的价值具有实际应用,可以转化为模拟投资绩效的显着改进。
Subjects: Computational Engineering, Finance, and Science, Computation and Language, Computational Finance 科目: 计算工程, 金融与科学 , 计算与语言 , 计算金融
Publish: 2025-09-15 23:51:13 UTC 发布时间: 2025-09-15 23:51:13 UTC
#65 Small Models, Big Results: Achieving Superior Intent Extraction through Decomposition #65 小模型,大结果:通过分解实现卓越的意图提取
Authors: [Danielle Cohen](https://arxiv.org/search/?searchtype=author&query=Danielle Cohen), [Yoni Halpern](https://arxiv.org/search/?searchtype=author&query=Yoni Halpern), [Noam Kahlon](https://arxiv.org/search/?searchtype=author&query=Noam Kahlon), [Joel Oren](https://arxiv.org/search/?searchtype=author&query=Joel Oren), [Omri Berkovitch](https://arxiv.org/search/?searchtype=author&query=Omri Berkovitch), [Sapir Caduri](https://arxiv.org/search/?searchtype=author&query=Sapir Caduri), [Ido Dagan](https://arxiv.org/search/?searchtype=author&query=Ido Dagan), [Anatoly Efros](https://arxiv.org/search/?searchtype=author&query=Anatoly Efros) 作者:Danielle Cohen、Yoni Halpern、Noam Kahlon、Joel Oren、Omri Berkovitch、Sapir Caduri、Ido Dagan、Anatoly Efros
Understanding user intents from UI interaction trajectories remains a challenging, yet crucial, frontier in intelligent agent development. While massive, datacenter-based, multi-modal large language models (MLLMs) possess greater capacity to handle the complexities of such sequences, smaller models which can run on-device to provide a privacy-preserving, low-cost, and low-latency user experience, struggle with accurate intent inference. We address these limitations by introducing a novel decomposed approach: first, we perform structured interaction summarization, capturing key information from each user action. Second, we perform intent extraction using a fine-tuned model operating on the aggregated summaries. This method improves intent understanding in resource-constrained models, even surpassing the base performance of large MLLMs. 从 UI 交互轨迹中理解用户意图仍然是智能代理开发中一个具有挑战性但至关重要的前沿领域。虽然基于数据中心的大型多模态大型语言模型 (MLLM) 具有更大的能力来处理此类序列的复杂性,但可以在设备上运行以提供隐私保护、低成本和低延迟用户体验的较小模型在准确的意图推理方面遇到了困难。我们通过引入一种新颖的分解方法来解决这些限制:首先,我们执行结构化交互摘要,从每个用户作中捕获关键信息。其次,我们使用对聚合摘要进行微调模型进行意图提取。这种方法提高了资源受限模型中的意图理解,甚至超过了大型 MLLM 的基本性能。
Subjects: Artificial Intelligence, Computation and Language 科目:人工智能、计算和语言
Publish: 2025-09-15 20:20:30 UTC 发布时间: 2025-09-15 20:20:30 UTC
#66 Exact Coset Sampling for Quantum Lattice Algorithms #66 量子晶格算法的精确余集采样
Author: [Yifan Zhang](https://arxiv.org/search/?searchtype=author&query=Yifan Zhang) 作者:张一帆
We give a simple, fully correct, and assumption-light replacement for the contested “domain-extension” in Step 9 of a recent windowed-QFT lattice algorithm with complex-Gaussian windows~\citep{chen2024quantum}. The published Step9 suffers from a periodicity/support mismatch. We present a pair-shift difference construction that coherently cancels all unknown offsets, produces an exact uniform CRT-coset state over ZP, and then uses the QFT to enforce the intended modular linear relation. The unitary is reversible, uses poly(logM2) gates, and preserves the algorithm’s asymptotics. Project Page: https://github.com/yifanzhang-pro/quantum-lattice.
在最近的一个具有复高斯窗口的窗口 QFT 晶格算法的第 9 步中,我们给出了一个简单、完全正确和假设光的替代,以替代有争议的“域扩展”\citep{chen2024quantum}。已发布的 Step~9 存在周期性/支持不匹配的问题。我们提出了一种对移位差分结构,该构造相干地抵消所有未知偏移量,产生精确均匀的 CRT 余集状态 ZP ,然后使用 QFT 来强制执行预期的模线性关系。酉是可逆的,使用 poly(logM2) 门,并保留算法的渐近。项目页面:https://github.com/yifanzhang-pro/quantum-lattice。
Subjects: Quantum Physics, Computation and Language, Cryptography and Security 科目:量子物理学、计算与语言、密码学与安全
Publish: 2025-09-15 18:10:28 UTC 发布时间: 2025-09-15 18:10:28 UTC
#67 LLMAP: LLM-Assisted Multi-Objective Route Planning with User Preferences #67 LLMAP:具有用户偏好的 LLM 辅助多目标路线规划
Authors: [Liangqi Yuan](https://arxiv.org/search/?searchtype=author&query=Liangqi Yuan), [Dong-Jun Han](https://arxiv.org/search/?searchtype=author&query=Dong-Jun Han), [Christopher G. Brinton](https://arxiv.org/search/?searchtype=author&query=Christopher G. Brinton), [Sabine Brunswicker](https://arxiv.org/search/?searchtype=author&query=Sabine Brunswicker) 作者:Liangqi Yuan、Dong-Jun Han、Christopher G. Brinton、Sabine Brunswicker
The rise of large language models (LLMs) has made natural language-driven route planning an emerging research area that encompasses rich user objectives. Current research exhibits two distinct approaches: direct route planning using LLM-as-Agent and graph-based searching strategies. However, LLMs in the former approach struggle to handle extensive map data, while the latter shows limited capability in understanding natural language preferences. Additionally, a more critical challenge arises from the highly heterogeneous and unpredictable spatio-temporal distribution of users across the globe. In this paper, we introduce a novel LLM-Assisted route Planning (LLMAP) system that employs an LLM-as-Parser to comprehend natural language, identify tasks, and extract user preferences and recognize task dependencies, coupled with a Multi-Step Graph construction with iterative Search (MSGS) algorithm as the underlying solver for optimal route finding. Our multi-objective optimization approach adaptively tunes objective weights to maximize points of interest (POI) quality and task completion rate while minimizing route distance, subject to three key constraints: user time limits, POI opening hours, and task dependencies. We conduct extensive experiments using 1,000 routing prompts sampled with varying complexity across 14 countries and 27 cities worldwide. The results demonstrate that our approach achieves superior performance with guarantees across multiple constraints. 大型语言模型(LLM)的兴起使自然语言驱动的路线规划成为一个包含丰富用户目标的新兴研究领域。目前的研究展示了两种不同的方法:使用 LLM-as-Agent 的直接路线规划和基于图的搜索策略。然而,前一种方法的法学硕士难以处理大量地图数据,而后者在理解自然语言偏好方面的能力有限。此外,一个更关键的挑战来自全球用户高度异质和不可预测的时空分布。在本文中,我们介绍了一种新型的 LLM 辅助路线规划(LLMAP)系统,该系统采用 LLM-as-Parser 来理解自然语言,识别任务,提取用户偏好并识别任务依赖关系,并结合迭代搜索(MSGS)算法的多步图结构作为最优路线查找的底层求解器。我们的多目标优化方法自适应地调整目标权重,以最大限度地提高兴趣点 (POI) 质量和任务完成率,同时最大限度地减少路线距离,但受三个关键约束:用户时间限制、POI 开放时间和任务依赖性。我们使用全球 14 个国家和 27 个城市的 1,000 个不同复杂程度的路由提示进行了广泛的实验。结果表明,我们的方法在跨多个约束的保证下实现了卓越的性能。
Subjects: Artificial Intelligence, Computation and Language, Machine Learning 科目: 人工智能 , 计算与语言 , 机器学习
Publish: 2025-09-14 02:30:19 UTC 发布时间: 2025-09-14 02:30:19 UTC
#68 Humor in Pixels: Benchmarking Large Multimodal Models Understanding of Online Comics #68 像素幽默:对大型多模态模型进行基准测试 对网络漫画的理解
Authors: [Yuriel Ryan](https://arxiv.org/search/?searchtype=author&query=Yuriel Ryan), [Rui Yang Tan](https://arxiv.org/search/?searchtype=author&query=Rui Yang Tan), [Kenny Tsu Wei Choo](https://arxiv.org/search/?searchtype=author&query=Kenny Tsu Wei Choo), [Roy Ka-Wei Lee](https://arxiv.org/search/?searchtype=author&query=Roy Ka-Wei Lee) 作者:Yuriel Ryan、Rui Yang Tan、Kenny Tsu Wei Choo、Roy Ka-Wei Lee
Understanding humor is a core aspect of social intelligence, yet it remains a significant challenge for Large Multimodal Models (LMMs). We introduce PixelHumor, a benchmark dataset of 2,800 annotated multi-panel comics designed to evaluate LMMs’ ability to interpret multimodal humor and recognize narrative sequences. Experiments with state-of-the-art LMMs reveal substantial gaps: for instance, top models achieve only 61% accuracy in panel sequencing, far below human performance. This underscores critical limitations in current models’ integration of visual and textual cues for coherent narrative and humor understanding. By providing a rigorous framework for evaluating multimodal contextual and narrative reasoning, PixelHumor aims to drive the development of LMMs that better engage in natural, socially aware interactions. 理解幽默是社会智能的一个核心方面,但它仍然是大型多模态模型 (LMM) 面临的重大挑战。我们介绍了 PixelHumor,这是一个包含 2,800 部带注释的多面板漫画的基准数据集,旨在评估 LMM 解释多模态幽默和识别叙事序列的能力。使用最先进的 LMM 进行的实验揭示了巨大的差距:例如,顶级模型在面板测序方面的准确率仅为 61%,远低于人类的性能。这凸显了当前模型在整合视觉和文本线索以实现连贯叙事和幽默理解方面存在严重局限性。通过提供一个严格的框架来评估多模态上下文和叙事推理,PixelHumor 旨在推动 LMM 的发展,以更好地参与自然的、具有社会意识的互动。
Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Computation and Language 科目:计算机视觉与模式识别、人工智能、计算与语言
Publish: 2025-09-12 01:39:24 UTC 发布时间: 2025-09-12 01:39:24 UTC
#69 MEUV: Achieving Fine-Grained Capability Activation in Large Language Models via Mutually Exclusive Unlock Vectors #69 MEUV:通过互斥的解锁向量在大型语言模型中实现细粒度能力激活
Authors: [Xin Tong](https://arxiv.org/search/?searchtype=author&query=Xin Tong), [Zhi Lin](https://arxiv.org/search/?searchtype=author&query=Zhi Lin), [Jingya Wang](https://arxiv.org/search/?searchtype=author&query=Jingya Wang), [Meng Han](https://arxiv.org/search/?searchtype=author&query=Meng Han), [Bo Jin](https://arxiv.org/search/?searchtype=author&query=Bo Jin) 作者:Tong 昕、Zhi Lin、王静雅、韩孟、波金
Large language models (LLMs) enforce safety alignment to reliably refuse malicious requests, yet the same blanket safeguards also block legitimate uses in policing, defense, and other high-stakes settings. Earlier “refusal-direction” edits can bypass those layers, but they rely on a single vector that indiscriminately unlocks all hazardous topics, offering no semantic control. We introduce Mutually Exclusive Unlock Vectors (MEUV), a lightweight framework that factorizes the monolithic refusal direction into topic-aligned, nearly orthogonal vectors, each dedicated to one sensitive capability. MEUV is learned in a single epoch with a multi-task objective that blends a differential-ablation margin, cross-topic and orthogonality penalties, and several auxiliary terms. On bilingual malicious-prompt benchmarks, MEUV achieves an attack success rate of no less than 87% on Gemma-2-2B, LLaMA-3-8B, and Qwen-7B, yet cuts cross-topic leakage by up to 90% compared with the best single-direction baseline. Vectors trained in Chinese transfer almost unchanged to English (and vice versa), suggesting a language-agnostic refusal subspace. The results show that fine-grained, topic-level capability activation is achievable with minimal utility loss, paving the way for controlled LLMs deployment in security-sensitive domains. 大型语言模型 (LLM) 强制执行安全一致性以可靠地拒绝恶意请求,但相同的一统保护措施也会阻止警务、防御和其他高风险环境中的合法使用。早期的“拒绝方向”编辑可以绕过这些层,但它们依赖于一个单一的向量,该向量不加区别地解锁所有危险主题,不提供语义控制。我们引入了互斥解锁向量 (MEUV),这是一个轻量级框架,它将单片拒绝方向分解为主题对齐、几乎正交的向量,每个向量专用于一个敏感功能。MEUV 是在一个具有多任务目标的单个时期中学习的,该目标混合了差异消融裕度、跨主题和正交性惩罚以及几个辅助术语。在双语恶意提示基准测试中,MEUV 在 Gemma-2-2B、LLaMA-3-8B 和 Qwen-7B 上的攻击成功率不低于 87%,但与最佳单向基线相比,跨主题泄漏减少了高达 90%。用中文训练的向量几乎没有变化地转移到英语(反之亦然),这表明存在与语言无关的拒绝子空间。结果表明,可以实现细粒度的主题级功能激活,同时将效用损失降至最低,为在安全敏感领域中部署受控的 LLM 铺平了道路。
Subjects: Machine Learning, Artificial Intelligence, Computation and Language, Cryptography and Security 科目: 机器学习 , 人工智能 , 计算与语言 , 密码学与安全
Publish: 2025-09-04 07:16:06 UTC 发布时间: 2025-09-04 07:16:06 UTC
1.2.2 Artificial Intelligence
From:https://papers.cool/arxiv/cs.AIhttps://arxiv.org/list/cs.AI/recent
2025-09-17 | |总计:169
#1 Shapes of Cognition for Computational Cognitive Modeling #1 计算认知建模的认知形状
Authors: [Marjorie McShane](https://arxiv.org/search/?searchtype=author&query=Marjorie McShane), [Sergei Nirenburg](https://arxiv.org/search/?searchtype=author&query=Sergei Nirenburg), [Sanjay Oruganti](https://arxiv.org/search/?searchtype=author&query=Sanjay Oruganti), [Jesse English](https://arxiv.org/search/?searchtype=author&query=Jesse English) 作者:Marjorie McShane、Sergei Nirenburg、Sanjay Oruganti、Jesse English
Shapes of cognition is a new conceptual paradigm for the computational cognitive modeling of Language-Endowed Intelligent Agents (LEIAs). Shapes are remembered constellations of sensory, linguistic, conceptual, episodic, and procedural knowledge that allow agents to cut through the complexity of real life the same way as people do: by expecting things to be typical, recognizing patterns, acting by habit, reasoning by analogy, satisficing, and generally minimizing cognitive load to the degree situations permit. Atypical outcomes are treated using shapes-based recovery methods, such as learning on the fly, asking a human partner for help, or seeking an actionable, even if imperfect, situational understanding. Although shapes is an umbrella term, it is not vague: shapes-based modeling involves particular objectives, hypotheses, modeling strategies, knowledge bases, and actual models of wide-ranging phenomena, all implemented within a particular cognitive architecture. Such specificity is needed both to vet our hypotheses and to achieve our practical aims of building useful agent systems that are explainable, extensible, and worthy of our trust, even in critical domains. However, although the LEIA example of shapes-based modeling is specific, the principles can be applied more broadly, giving new life to knowledge-based and hybrid AI. 认知形状是语言智能代理 (LEIA) 计算认知建模的新概念范式。形状是被记住的感官、语言、概念、情节和程序知识的星座,它们允许智能体像人们一样穿越现实生活的复杂性:通过期望事物是典型的,识别模式,习惯行事,类比推理,满足,并在情况允许的范围内将认知负荷降到最低。使用基于形状的恢复方法处理非典型结果,例如即时学习、向人类伙伴寻求帮助或寻求可作的(即使不完美的)情境理解。尽管形状是一个总称,但它并不模糊:基于形状的建模涉及特定的目标、假设、建模策略、知识库和广泛现象的实际模型,所有这些都在特定的认知架构中实现。需要这种特异性来审查我们的假设,并实现我们的实际目标,即构建有用的代理系统,这些系统是可解释的、可扩展的,值得我们信任,即使在关键领域也是如此。然而,尽管 LEIA 基于形状的建模示例很具体,但这些原理可以更广泛地应用,为基于知识的混合 AI 赋予新的生命。
Subjects: Artificial Intelligence, Robotics 科目: 人工智能 , 机器人技术
Publish: 2025-09-16 17:39:58 UTC 发布时间: 2025-09-16 17:39:58 UTC
#2 RepIt: Representing Isolated Targets to Steer Language Models #2 RepIt:表示孤立的目标来引导语言模型
Authors: [Vincent Siu](https://arxiv.org/search/?searchtype=author&query=Vincent Siu), [Nathan W. Henry](https://arxiv.org/search/?searchtype=author&query=Nathan W. Henry), [Nicholas Crispino](https://arxiv.org/search/?searchtype=author&query=Nicholas Crispino), [Yang Liu](https://arxiv.org/search/?searchtype=author&query=Yang Liu), [Dawn Song](https://arxiv.org/search/?searchtype=author&query=Dawn Song), [Chenguang Wang](https://arxiv.org/search/?searchtype=author&query=Chenguang Wang) 作者: Vincent Siu, Nathan W. Henry, Nicholas Crispino, Yang Liu, Dawn Song, Chenguang Wang
While activation steering in large language models (LLMs) is a growing area of research, methods can often incur broader effects than desired. This motivates isolation of purer concept vectors to enable targeted interventions and understand LLM behavior at a more granular level. We present RepIt, a simple and data-efficient framework for isolating concept-specific representations. Across five frontier LLMs, RepIt enables precise interventions: it selectively suppresses refusal on targeted concepts while preserving refusal elsewhere, producing models that answer WMD-related questions while still scoring as safe on standard benchmarks. We further show that the corrective signal localizes to just 100-200 neurons and that robust target representations can be extracted from as few as a dozen examples on a single A6000. This efficiency raises a dual concern: manipulations can be performed with modest compute and data to extend to underrepresented data-scarce topics while evading existing benchmarks. By disentangling refusal vectors with RepIt, this work demonstrates that targeted interventions can counteract overgeneralization, laying the foundation for more granular control of model behavior. 虽然大型语言模型 (LLM) 中的激活转向是一个不断发展的研究领域,但方法通常会产生比预期更广泛的影响。这促使隔离更纯粹的概念向量,以实现有针对性的干预并在更精细的层面上理解 LLM 行为。我们提出了 RepIt,这是一个简单且数据高效的框架,用于隔离特定于概念的表示。在五个前沿法学硕士中,RepIt 能够进行精确的干预:它有选择地抑制对目标概念的拒绝,同时保留其他地方的拒绝,生成回答大规模杀伤性武器相关问题的模型,同时在标准基准上仍得分为安全。我们进一步表明,校正信号仅定位到 100-200 个神经元,并且可以从单个 A6000 上的十几个示例中提取稳健的目标表示。这种效率引发了双重问题:可以使用适度的计算和数据进行作,以扩展到代表性不足的数据稀缺主题,同时规避现有基准。通过用 RepIt 解开拒绝向量,这项工作表明有针对性的干预可以抵消过度概括,为更精细地控制模型行为奠定基础。
Subjects: Artificial Intelligence, Computation and Language 科目:人工智能、计算和语言
Publish: 2025-09-16 17:35:36 UTC 发布时间: 2025-09-16 17:35:36 UTC
#3 A Scenario-Driven Cognitive Approach to Next-Generation AI Memory #3 下一代人工智能记忆的场景驱动认知方法
Authors: [Linyue Cai](https://arxiv.org/search/?searchtype=author&query=Linyue Cai), [Yuyang Cheng](https://arxiv.org/search/?searchtype=author&query=Yuyang Cheng), [Xiaoding Shao](https://arxiv.org/search/?searchtype=author&query=Xiaoding Shao), [Huiming Wang](https://arxiv.org/search/?searchtype=author&query=Huiming Wang), [Yong Zhao](https://arxiv.org/search/?searchtype=author&query=Yong Zhao), [Wei Zhang](https://arxiv.org/search/?searchtype=author&query=Wei Zhang), [Kang Li](https://arxiv.org/search/?searchtype=author&query=Kang Li) 作者: 蔡林岳, 程宇阳, 邵晓丁, 王惠明, 赵勇, 张伟, 李康
As artificial intelligence advances toward artificial general intelligence (AGI), the need for robust and human-like memory systems has become increasingly evident. Current memory architectures often suffer from limited adaptability, insufficient multimodal integration, and an inability to support continuous learning. To address these limitations, we propose a scenario-driven methodology that extracts essential functional requirements from representative cognitive scenarios, leading to a unified set of design principles for next-generation AI memory systems. Based on this approach, we introduce the \textbf{COgnitive Layered Memory Architecture (COLMA)}, a novel framework that integrates cognitive scenarios, memory processes, and storage mechanisms into a cohesive design. COLMA provides a structured foundation for developing AI systems capable of lifelong learning and human-like reasoning, thereby contributing to the pragmatic development of AGI. 随着人工智能向通用人工智能(AGI)发展,对强大且类人的记忆系统的需求变得越来越明显。当前的内存架构往往存在适应性有限、多模态集成不足以及无法支持持续学习的问题。为了解决这些限制,我们提出了一种场景驱动的方法,从具有代表性的认知场景中提取基本功能需求,从而为下一代人工智能记忆系统提供一套统一的设计原则。基于这种方法,我们引入了\textbf{认知分层记忆架构(COLMA)},这是一个将认知场景、记忆过程和存储机制集成到一个内聚设计中的新颖框架。COLMA 为开发能够终身学习和类人推理的人工智能系统提供了结构化基础,从而为通用人工智能的实用发展做出了贡献。
Subject: Artificial Intelligence 主题:人工智能
Publish: 2025-09-16 16:43:07 UTC 发布时间: 2025-09-16 16:43:07 UTC
#4 Simulating Clinical AI Assistance using Multimodal LLMs: A Case Study in Diabetic Retinopathy #4 使用多模态法学硕士模拟临床人工智能辅助:糖尿病视网膜病变案例研究
Authors: [Nadim Barakat](https://arxiv.org/search/?searchtype=author&query=Nadim Barakat), [William Lotter](https://arxiv.org/search/?searchtype=author&query=William Lotter) 作者:纳迪姆·巴拉卡特、威廉·洛特
Diabetic retinopathy (DR) is a leading cause of blindness worldwide, and AI systems can expand access to fundus photography screening. Current FDA-cleared systems primarily provide binary referral outputs, where this minimal output may limit clinical trust and utility. Yet, determining the most effective output format to enhance clinician-AI performance is an empirical challenge that is difficult to assess at scale. We evaluated multimodal large language models (MLLMs) for DR detection and their ability to simulate clinical AI assistance across different output types. Two models were tested on IDRiD and Messidor-2: GPT-4o, a general-purpose MLLM, and MedGemma, an open-source medical model. Experiments included: (1) baseline evaluation, (2) simulated AI assistance with synthetic predictions, and (3) actual AI-to-AI collaboration where GPT-4o incorporated MedGemma outputs. MedGemma outperformed GPT-4o at baseline, achieving higher sensitivity and AUROC, while GPT-4o showed near-perfect specificity but low sensitivity. Both models adjusted predictions based on simulated AI inputs, but GPT-4o’s performance collapsed with incorrect ones, whereas MedGemma remained more stable. In actual collaboration, GPT-4o achieved strong results when guided by MedGemma’s descriptive outputs, even without direct image access (AUROC up to 0.96). These findings suggest MLLMs may improve DR screening pipelines and serve as scalable simulators for studying clinical AI assistance across varying output configurations. Open, lightweight models such as MedGemma may be especially valuable in low-resource settings, while descriptive outputs could enhance explainability and clinician trust in clinical workflows. 糖尿病视网膜病变 (DR) 是全球失明的主要原因,人工智能系统可以扩大眼底摄影筛查的范围。目前 FDA 批准的系统主要提供二元转诊输出,其中这种最小输出可能会限制临床信任和效用。然而,确定最有效的输出格式来提高临床医生的人工智能性能是一项难以大规模评估的经验挑战。我们评估了用于 DR 检测的多模态大型语言模型 (MLLM) 及其跨不同输出类型模拟临床 AI 辅助的能力。在 IDRiD 和 Messidor-2 上测试了两个模型:通用 MLLM GPT-4o 和开源医疗模型 MedGemma。实验包括:(1)基线评估,(2)模拟人工智能辅助合成预测,以及(3)GPT-4o 结合 MedGemma 输出的实际人工智能到人工智能协作。MedGemma 在基线时优于 GPT-4o,实现了更高的灵敏度和 AUROC,而 GPT-4o 表现出近乎完美的特异性,但灵敏度较低。两种模型都根据模拟的 AI 输入调整了预测,但 GPT-4o 的性能因不正确的输入而崩溃,而 MedGemma 则保持更稳定。在实际协作中,GPT-4o 在 MedGemma 的描述性输出的指导下取得了强劲的效果,即使没有直接图像访问(AUROC 高达 0.96)。这些发现表明,MLLM 可以改进 DR 筛选管道,并作为可扩展的模拟器来研究跨不同输出配置的临床 AI 辅助。MedGemma 等开放、轻量级模型在资源匮乏的环境中可能特别有价值,而描述性输出可以增强可解释性和临床医生对临床工作流程的信任。
Subjects: Artificial Intelligence, Computer Vision and Pattern Recognition, Human-Computer Interaction 科目: 人工智能 , 计算机视觉与模式识别 , 人机交互
Publish: 2025-09-16 16:42:19 UTC 发布时间: 2025-09-16 16:42:19 UTC
#5 G-CSEA: A Graph-Based Conflict Set Extraction Algorithm for Identifying Infeasibility in Pseudo-Boolean Models #5 G-CSEA:一种基于图的冲突集提取算法,用于识别伪布尔模型中的不可行性
Authors: [Kanishk Garg](https://arxiv.org/search/?searchtype=author&query=Kanishk Garg), [Saranya D.](https://arxiv.org/search/?searchtype=author&query=Saranya D.), [Sanal Kumar](https://arxiv.org/search/?searchtype=author&query=Sanal Kumar), [Saurabh Singh](https://arxiv.org/search/?searchtype=author&query=Saurabh Singh), [Anupam Purwar](https://arxiv.org/search/?searchtype=author&query=Anupam Purwar) 作者:Kanishk Garg、Saranya D.、Sanal Kumar、Saurabh Singh、Anupam Purwar
Workforce scheduling involves a variety of rule-based constraints-such as shift limits, staffing policies, working hour restrictions, and many similar scheduling rules-which can interact in conflicting ways, leading to infeasible models. Identifying the underlying causes of such infeasibility is critical for resolving scheduling issues and restoring feasibility. A common diagnostic approach is to compute Irreducible Infeasible Subsets (IISs): minimal sets of constraints that are jointly infeasible but become feasible when any one is removed. We consider models formulated using pseudo-Boolean constraints with inequality relations over binary variables, which naturally encode scheduling logic. Existing IIS extraction methods such as Additive Deletion and QuickXplain rely on repeated feasibility checks, often incurring large numbers of solver calls. Dual ray analysis, while effective for LP-based models, may fail when the relaxed problem is feasible but the underlying pseudo-Boolean model is not. To address these limitations, we propose Graph-based Conflict Set Extraction Algorithm (G-CSEA) to extract a conflict set, an approach inspired by Conflict-Driven Clause Learning (CDCL) in SAT solvers. Our method constructs an implication graph during constraint propagation and, upon detecting a conflict, traces all contributing constraints across both decision branches. The resulting conflict set can optionally be minimized using QuickXplain to produce an IIS. 劳动力调度涉及各种基于规则的约束,例如轮班限制、人员配置政策、工作时间限制和许多类似的调度规则,这些约束可能会以相互冲突的方式相互作用,从而导致不可行的模型。确定这种不可行性的根本原因对于解决调度问题和恢复可行性至关重要。一种常见的诊断方法是计算不可约不可行子集 (IIS):共同不可行的最小约束集,但当删除任何一个约束时变得可行。我们考虑使用伪布尔约束制定的模型,这些模型对二元变量具有不等式关系,这些变量自然编码调度逻辑。现有的 IIS 提取方法(如加法删除和 QuickXplain)依赖于重复的可行性检查,通常会导致大量求解器调用。双射线分析虽然对基于 LP 的模型有效,但当松弛问题可行但底层伪布尔模型不可行时,可能会失败。为了解决这些限制,我们提出了基于图的冲突集提取算法(G-CSEA)来提取冲突集,这种方法的灵感来自 SAT 求解器中的冲突驱动从句学习(CDCL)。我们的方法在约束传播期间构造一个隐含图,并在检测到冲突后,跟踪两个决策分支中的所有贡献约束。可以选择使用 QuickXplain 最小化生成的冲突集以生成 IIS。
Subject: Artificial Intelligence 主题:人工智能
Publish: 2025-09-16 16:09:30 UTC 发布时间: 2025-09-16 16:09:30 UTC
#6 Agentic AI for Financial Crime Compliance #6 用于金融犯罪合规的代理人工智能
Authors: [Henrik Axelsen](https://arxiv.org/search/?searchtype=author&query=Henrik Axelsen), [Valdemar Licht](https://arxiv.org/search/?searchtype=author&query=Valdemar Licht), [Jan Damsgaard](https://arxiv.org/search/?searchtype=author&query=Jan Damsgaard) 作者:Henrik Axelsen、Valdemar Licht、Jan Damsgaard
The cost and complexity of financial crime compliance (FCC) continue to rise, often without measurable improvements in effectiveness. While AI offers potential, most solutions remain opaque and poorly aligned with regulatory expectations. This paper presents the design and deployment of an agentic AI system for FCC in digitally native financial platforms. Developed through an Action Design Research (ADR) process with a fintech firm and regulatory stakeholders, the system automates onboarding, monitoring, investigation, and reporting, emphasizing explainability, traceability, and compliance-by-design. Using artifact-centric modeling, it assigns clearly bounded roles to autonomous agents and enables task-specific model routing and audit logging. The contribution includes a reference architecture, a real-world prototype, and insights into how Agentic AI can reconfigure FCC workflows under regulatory constraints. Our findings extend IS literature on AI-enabled compliance by demonstrating how automation, when embedded within accountable governance structures, can support transparency and institutional trust in high-stakes, regulated environments. 金融犯罪合规 (FCC) 的成本和复杂性持续上升,但往往没有显着提高有效性。虽然人工智能具有潜力,但大多数解决方案仍然不透明且与监管期望不符。本文介绍了 FCC 代理 AI 系统在数字原生金融平台中的设计和部署。该系统通过与金融科技公司和监管利益相关者的行动设计研究 (ADR) 流程开发,可自动执行入职、监控、调查和报告,强调可解释性、可追溯性和设计合规性。它使用以工件为中心的建模,为自主代理分配明确界限的角色,并支持特定于任务的模型路由和审计日志记录。该贡献包括参考架构、真实原型以及对 Agentic AI 如何在监管约束下重新配置 FCC 工作流程的见解。我们的研究结果扩展了 IS 关于人工智能合规的文献,展示了自动化在嵌入负责任的治理结构中时如何支持高风险、受监管环境中的透明度和机构信任。
Subjects: Artificial Intelligence, Human-Computer Interaction, Multiagent Systems 主题: 人工智能 , 人机交互 , 多智能体系统
Publish: 2025-09-16 14:53:51 UTC 发布: 2025-09-16 14:53:51 UTC
#7 Reasoning with Preference Constraints: A Benchmark for Language Models in Many-to-One Matching Markets #7 偏好约束推理:多对一匹配市场中语言模型的基准
Authors: [Marylou Fauchard](https://arxiv.org/search/?searchtype=author&query=Marylou Fauchard), [Florian Carichon](https://arxiv.org/search/?searchtype=author&query=Florian Carichon), [Margarida Carvalho](https://arxiv.org/search/?searchtype=author&query=Margarida Carvalho), [Golnoosh Farnadi](https://arxiv.org/search/?searchtype=author&query=Golnoosh Farnadi) 作者:Marylou Fauchard、Florian Carichon、Margarida Carvalho、Golnoosh Farnadi
Recent advances in reasoning with large language models (LLMs) have demonstrated strong performance on complex mathematical tasks, including combinatorial optimization. Techniques such as Chain-of-Thought and In-Context Learning have further enhanced this capability, making LLMs both powerful and accessible tools for a wide range of users, including non-experts. However, applying LLMs to matching problems, which require reasoning under preferential and structural constraints, remains underexplored. To address this gap, we introduce a novel benchmark of 369 instances of the College Admission Problem, a canonical example of a matching problem with preferences, to evaluate LLMs across key dimensions: feasibility, stability, and optimality. We employ this benchmark to assess the performance of several open-weight LLMs. Our results first reveal that while LLMs can satisfy certain constraints, they struggle to meet all evaluation criteria consistently. They also show that reasoning LLMs, like QwQ and GPT-oss, significantly outperform traditional models such as Llama, Qwen or Mistral, defined here as models used without any dedicated reasoning mechanisms. Moreover, we observed that LLMs reacted differently to the various prompting strategies tested, which include Chain-of-Thought, In-Context Learning and role-based prompting, with no prompt consistently offering the best performance. Finally, we report the performances from iterative prompting with auto-generated feedback and show that they are not monotonic; they can peak early and then significantly decline in later attempts. Overall, this work offers a new perspective on model reasoning performance and the effectiveness of prompting strategies in combinatorial optimization problems with preferential constraints. 大型语言模型 (LLM) 推理的最新进展在复杂的数学任务(包括组合优化)上表现出强大的性能。思维链和情境学习等技术进一步增强了这种能力,使法学硕士成为包括非专家在内的广大用户的强大且易于使用的工具。然而,将法学硕士应用于匹配问题,需要在优先和结构约束下进行推理,仍然没有得到充分探索。为了解决这一差距,我们引入了一个包含 369 个大学录取问题实例的新基准,这是一个与偏好匹配问题的典型示例,以从关键维度评估法学硕士:可行性、稳定性和最优性。我们使用这个基准来评估几个开放权重 LLM 的性能。我们的结果首先表明,虽然法学硕士可以满足某些约束,但它们很难始终如一地满足所有评估标准。他们还表明,推理法学硕士(如 QwQ 和 GPT-oss)明显优于传统模型,如 Llama、Qwen 或 Mistral,这里定义为在没有任何专用推理机制的情况下使用的模型。此外,我们观察到法学硕士对测试的各种提示策略的反应不同,包括思维链、上下文学习和基于角色的提示,没有一个提示能够始终提供最佳性能。最后,我们报告了具有自动生成反馈的迭代提示的性能,并表明它们不是单调的;它们可能会在早期达到顶峰,然后在以后的尝试中显着下降。总体而言,这项工作为具有优先约束的组合优化问题中的模型推理性能和提示策略的有效性提供了新的视角。
Subject: Artificial Intelligence 主题:人工智能
Publish: 2025-09-16 14:48:46 UTC 发布时间: 2025-09-16 14:48:46 UTC
#8 A Visualized Framework for Event Cooperation with Generative Agents #8 与生成代理的事件合作可视化框架
Authors: [Yuyang Tian](https://arxiv.org/search/?searchtype=author&query=Yuyang Tian), [Shunqiang Mao](https://arxiv.org/search/?searchtype=author&query=Shunqiang Mao), [Wenchang Gao](https://arxiv.org/search/?searchtype=author&query=Wenchang Gao), [Lanlan Qiu](https://arxiv.org/search/?searchtype=author&query=Lanlan Qiu), [Tianxing He](https://arxiv.org/search/?searchtype=author&query=Tianxing He) 作者:田宇阳,毛顺强,高文昌,邱兰兰,何天兴
Large Language Models (LLMs) have revolutionized the simulation of agent societies, enabling autonomous planning, memory formation, and social interactions. However, existing frameworks often overlook systematic evaluations for event organization and lack visualized integration with physically grounded environments, limiting agents’ ability to navigate spaces and interact with items realistically. We develop MiniAgentPro, a visualization platform featuring an intuitive map editor for customizing environments and a simulation player with smooth animations. Based on this tool, we introduce a comprehensive test set comprising eight diverse event scenarios with basic and hard variants to assess agents’ ability. Evaluations using GPT-4o demonstrate strong performance in basic settings but highlight coordination challenges in hard variants. 大型语言模型 (LLM) 彻底改变了代理社会的模拟,实现了自主规划、记忆形成和社交互动。然而,现有框架往往忽视了对事件组织的系统评估,并且缺乏与物理接地环境的可视化集成,限制了代理在空间中导航和与物品真实交互的能力。我们开发了 MiniAgentPro,这是一个可视化平台,具有用于自定义环境的直观地图编辑器和具有流畅动画的模拟播放器。基于这个工具,我们引入了一个全面的测试集,包括八个不同的事件场景,以及基本和硬变体来评估代理的能力。使用 GPT-4o 的评估在基本设置中表现出强大的性能,但突出了硬变体中的协调挑战。
Subject: Artificial Intelligence 主题:人工智能
Publish: 2025-09-16 12:33:54 UTC 发布时间: 2025-09-16 12:33:54 UTC
#9 Data-driven Methods of Extracting Text Structure and Information Transfer #9 提取文本结构和信息传输的数据驱动方法
Authors: [Shinichi Honna](https://arxiv.org/search/?searchtype=author&query=Shinichi Honna), [Taichi Murayama](https://arxiv.org/search/?searchtype=author&query=Taichi Murayama), [Akira Matsui](https://arxiv.org/search/?searchtype=author&query=Akira Matsui) 作者:本名真一、村山太一、松井明
The Anna Karenina Principle (AKP) holds that success requires satisfying a small set of essential conditions, whereas failure takes diverse forms. We test AKP, its reverse, and two further patterns described as ordered and noisy across novels, online encyclopedias, research papers, and movies. Texts are represented as sequences of functional blocks, and convergence is assessed in transition order and position. Results show that structural principles vary by medium: novels follow reverse AKP in order, Wikipedia combines AKP with ordered patterns, academic papers display reverse AKP in order but remain noisy in position, and movies diverge by genre. Success therefore depends on structural constraints that are specific to each medium, while failure assumes different shapes across domains. 安娜·卡列尼娜原则 (AKP) 认为,成功需要满足一小部分基本条件,而失败则有多种形式。我们测试了 AKP、它的反向,以及小说、在线百科全书、研究论文和电影中被描述为有序和嘈杂的另外两种模式。文本表示为功能块序列,并按过渡顺序和位置评估收敛性。结果表明,结构原理因媒介而异:小说按顺序遵循反向 AKP,维基百科将 AKP 与有序模式相结合,学术论文按顺序显示反向 AKP,但在位置上保持嘈杂,电影按类型分化。因此,成功取决于特定于每种介质的结构约束,而失败则在跨领域呈现不同的形状。
Subjects: Artificial Intelligence, Machine Learning 科目: 人工智能 , 机器学习
Publish: 2025-09-16 12:13:09 UTC 发布时间: 2025-09-16 12:13:09 UTC
#10 Toward PDDL Planning Copilot #10 迈向 PDDL 规划副驾驶
Authors: [Yarin Benyamin](https://arxiv.org/search/?searchtype=author&query=Yarin Benyamin), [Argaman Mordoch](https://arxiv.org/search/?searchtype=author&query=Argaman Mordoch), [Shahaf S. Shperberg](https://arxiv.org/search/?searchtype=author&query=Shahaf S. Shperberg), [Roni Stern](https://arxiv.org/search/?searchtype=author&query=Roni Stern) 作者:Yarin Benyamin、Argaman Mordoch、Shahaf S. Shperberg、Roni Stern
Large Language Models (LLMs) are increasingly being used as autonomous agents capable of performing complicated tasks. However, they lack the ability to perform reliable long-horizon planning on their own. This paper bridges this gap by introducing the Planning Copilot, a chatbot that integrates multiple planning tools and allows users to invoke them through instructions in natural language. The Planning Copilot leverages the Model Context Protocol (MCP), a recently developed standard for connecting LLMs with external tools and systems. This approach allows using any LLM that supports MCP without domain-specific fine-tuning. Our Planning Copilot supports common planning tasks such as checking the syntax of planning problems, selecting an appropriate planner, calling it, validating the plan it generates, and simulating their execution. We empirically evaluate the ability of our Planning Copilot to perform these tasks using three open-source LLMs. The results show that the Planning Copilot highly outperforms using the same LLMs without the planning tools. We also conducted a limited qualitative comparison of our tool against Chat GPT-5, a very recent commercial LLM. Our results shows that our Planning Copilot significantly outperforms GPT-5 despite relying on a much smaller LLM. This suggests dedicated planning tools may be an effective way to enable LLMs to perform planning tasks. 大型语言模型 (LLM) 越来越多地被用作能够执行复杂任务的自主代理。然而,他们缺乏自行执行可靠的长期规划的能力。本文通过引入 Planning Copilot 来弥合这一差距,这是一个集成了多种规划工具并允许用户通过自然语言指令调用它们的聊天机器人。Planning Copilot 利用模型上下文协议 (MCP),这是最近开发的用于将 LLM 与外部工具和系统连接起来的标准。这种方法允许使用任何支持 MCP 的 LLM,而无需进行特定领域的微调。我们的 Planning Copilot 支持常见的计划任务,例如检查计划问题的语法、选择合适的计划器、调用它、验证它生成的计划以及模拟其执行。我们实证评估了我们的 Planning Copilot 使用三个开源 LLM 执行这些任务的能力。结果表明,Planning Copilot 的性能远优于使用没有规划工具的相同 LLM。我们还对我们的工具与最近的商业法学硕士 Chat GPT-5 进行了有限的定性比较。我们的结果表明,尽管依赖于更小的 LLM,但我们的 Planning Copilot 的性能明显优于 GPT-5。这表明专用的规划工具可能是使法学硕士能够执行规划任务的有效方法。
Subject: Artificial Intelligence 主题:人工智能
Publish: 2025-09-16 11:51:07 UTC 发布时间: 2025-09-16 11:51:07 UTC
#11 Forget What's Sensitive, Remember What Matters: Token-Level Differential Privacy in Memory Sculpting for Continual Learning #11 忘记敏感的事情,记住重要的事情:用于持续学习的记忆雕刻中的令牌级差分隐私
Authors: [Bihao Zhan](https://arxiv.org/search/?searchtype=author&query=Bihao Zhan), [Jie Zhou](https://arxiv.org/search/?searchtype=author&query=Jie Zhou), [Junsong Li](https://arxiv.org/search/?searchtype=author&query=Junsong Li), [Yutao Yang](https://arxiv.org/search/?searchtype=author&query=Yutao Yang), [Shilian Chen](https://arxiv.org/search/?searchtype=author&query=Shilian Chen), [Qianjun Pan](https://arxiv.org/search/?searchtype=author&query=Qianjun Pan), [Xin Li](https://arxiv.org/search/?searchtype=author&query=Xin Li), [Wen Wu](https://arxiv.org/search/?searchtype=author&query=Wen Wu), [Xingjiao Wu](https://arxiv.org/search/?searchtype=author&query=Xingjiao Wu), [Qin Chen](https://arxiv.org/search/?searchtype=author&query=Qin Chen), [Hang Yan](https://arxiv.org/search/?searchtype=author&query=Hang Yan), [Liang He](https://arxiv.org/search/?searchtype=author&query=Liang He) 作者: 詹碧豪, 周杰, 李俊松, 杨玉涛, 陈世莲, 潘倩军, 李欣, 温吴, 吴兴娇, 陈琴, 杭岩, 梁鹤
Continual Learning (CL) models, while adept at sequential knowledge acquisition, face significant and often overlooked privacy challenges due to accumulating diverse information. Traditional privacy methods, like a uniform Differential Privacy (DP) budget, indiscriminately protect all data, leading to substantial model utility degradation and hindering CL deployment in privacy-sensitive areas. To overcome this, we propose a privacy-enhanced continual learning (PeCL) framework that forgets what’s sensitive and remembers what matters. Our approach first introduces a token-level dynamic Differential Privacy strategy that adaptively allocates privacy budgets based on the semantic sensitivity of individual tokens. This ensures robust protection for private entities while minimizing noise injection for non-sensitive, general knowledge. Second, we integrate a privacy-guided memory sculpting module. This module leverages the sensitivity analysis from our dynamic DP mechanism to intelligently forget sensitive information from the model’s memory and parameters, while explicitly preserving the task-invariant historical knowledge crucial for mitigating catastrophic forgetting. Extensive experiments show that PeCL achieves a superior balance between privacy preserving and model utility, outperforming baseline models by maintaining high accuracy on previous tasks while ensuring robust privacy. 持续学习(CL)模型虽然擅长顺序知识获取,但由于积累了不同的信息,面临着重大且经常被忽视的隐私挑战。传统的隐私方法,如统一的差分隐私(DP)预算,不加区别地保护所有数据,导致模型效用大幅下降,并阻碍了 CL 在隐私敏感领域的部署。为了克服这个问题,我们提出了一个隐私增强的持续学习 (PeCL) 框架,该框架可以忘记敏感内容并记住重要内容。我们的方法首先引入了令牌级动态差分隐私策略,该策略根据单个令牌的语义敏感性自适应地分配隐私预算。这确保了对私人实体的强大保护,同时最大限度地减少了对非敏感常识的噪声注入。其次,我们集成了一个隐私引导的记忆雕刻模块。该模块利用动态 DP 机制的敏感性分析,智能地忘记模型内存和参数中的敏感信息,同时显式保留对减轻灾难性遗忘至关重要的任务不变历史知识。广泛的实验表明,PeCL 在隐私保护和模型实用性之间实现了卓越的平衡,通过在确保强大的隐私性的同时保持先前任务的高精度,优于基线模型。
Subject: Artificial Intelligence 主题:人工智能
Publish: 2025-09-16 11:01:59 UTC 发布时间: 2025-09-16 11:01:59 UTC
#12 Black-box Model Merging for Language-Model-as-a-Service with Massive Model Repositories #12 使用海量模型存储库进行语言模型即服务的黑盒模型合并
Authors: [Shilian Chen](https://arxiv.org/search/?searchtype=author&query=Shilian Chen), [Jie Zhou](https://arxiv.org/search/?searchtype=author&query=Jie Zhou), [Tianyu Huai](https://arxiv.org/search/?searchtype=author&query=Tianyu Huai), [Yujiang Lu](https://arxiv.org/search/?searchtype=author&query=Yujiang Lu), [Junsong Li](https://arxiv.org/search/?searchtype=author&query=Junsong Li), [Bihao Zhan](https://arxiv.org/search/?searchtype=author&query=Bihao Zhan), [Qianjun Pan](https://arxiv.org/search/?searchtype=author&query=Qianjun Pan), [Yutao Yang](https://arxiv.org/search/?searchtype=author&query=Yutao Yang), [Xin Li](https://arxiv.org/search/?searchtype=author&query=Xin Li), [Qin Chen](https://arxiv.org/search/?searchtype=author&query=Qin Chen), [Hang Yan](https://arxiv.org/search/?searchtype=author&query=Hang Yan), [Liang He](https://arxiv.org/search/?searchtype=author&query=Liang He) 作者: 陈世莲, 周杰, 怀天宇, 陆玉江, 李俊松, 詹碧豪, 潘倩军, 杨玉涛, 李鑫, 陈秦, 杭岩, 何良
Model merging refers to the process of integrating multiple distinct models into a unified model that preserves and combines the strengths and capabilities of the individual models. Most existing approaches rely on task vectors to combine models, typically under the assumption that model parameters are accessible. However, for extremely large language models (LLMs) such as GPT-4, which are often provided solely as black-box services through API interfaces (Language-Model-as-a-Service), model weights are not available to end users. This presents a significant challenge, which we refer to as black-box model merging (BMM) with massive LLMs. To address this challenge, we propose a derivative-free optimization framework based on the evolutionary algorithm (Evo-Merging) that enables effective model merging using only inference-time API queries. Our method consists of two key components: (1) sparsity-based denoising, designed to identify and filter out irrelevant or redundant information across models, and (2) sign-aware scaling, which dynamically computes optimal combination weights for the relevant models based on their performance. We also provide a formal justification, along with a theoretical analysis, for our asymmetric sparsification. Extensive experimental evaluations demonstrate that our approach achieves state-of-the-art results on a range of tasks, significantly outperforming existing strong baselines. 模型合并是指将多个不同的模型集成到一个统一模型中的过程,该模型保留并结合了各个模型的优势和功能。大多数现有方法都依赖于任务向量来组合模型,通常假设模型参数是可访问的。然而,对于像 GPT-4 这样的超大型语言模型 (LLM),它们通常仅通过 API 接口(语言模型即服务)作为黑盒服务提供,最终用户无法使用模型权重。这带来了一个重大挑战,我们将其称为黑盒模型合并 (BMM) 与大规模 LLM。为了应对这一挑战,我们提出了一种基于进化算法(Evo-Merging)的无导数优化框架,该框架仅使用推理时 API 查询即可实现有效的模型合并。我们的方法由两个关键组成部分组成:(1)基于稀疏性的去噪,旨在识别和过滤掉模型中不相关或冗余的信息,以及(2)符号感知缩放,根据相关模型的性能动态计算相关模型的最佳组合权重。我们还为我们的不对称稀疏化提供了正式的理由以及理论分析。广泛的实验评估表明,我们的方法在一系列任务上取得了最先进的结果,显着优于现有的强基线。
Subject: Artificial Intelligence 主题:人工智能
Publish: 2025-09-16 10:55:50 UTC 发布时间: 2025-09-16 10:55:50 UTC
#13 The Anatomy of Alignment: Decomposing Preference Optimization by Steering Sparse Features #13 对齐剖析:通过控制稀疏特征分解偏好优化
Authors: [Jeremias Ferrao](https://arxiv.org/search/?searchtype=author&query=Jeremias Ferrao), [Matthijs van der Lende](https://arxiv.org/search/?searchtype=author&query=Matthijs van der Lende), [Ilija Lichkovski](https://arxiv.org/search/?searchtype=author&query=Ilija Lichkovski), [Clement Neo](https://arxiv.org/search/?searchtype=author&query=Clement Neo) 作者:Jeremias Ferrao、Matthijs van der Lende、Ilija Lichkovski、Clement Neo
Aligning large language models is critical for their usability and safety. However, the prevailing approach of Reinforcement Learning from Human Feedback (RLHF) induces diffuse, opaque parameter changes, making it difficult to discern what the model has internalized. Hence, we introduce Feature Steering with Reinforcement Learning (FSRL), a transparent alignment framework that trains a lightweight adapter to steer behavior by modulating interpretable features from a Sparse Autoencoder (SAE). First, we demonstrate that FSRL is an effective method for preference optimization and is comparable with current RLHF methods. We then perform mechanistic analysis on the trained adapter, and find that its policy systematically promotes style features over explicit alignment concepts, suggesting that the preference optimization process rewards stylistic presentation as a proxy for quality. Ultimately, we hope that FSRL provides a tool for both interpretable model control and diagnosing the internal mechanisms of alignment. 调整大型语言模型对于其可用性和安全性至关重要。然而,人类反馈强化学习(RLHF)的流行方法会引起分散的、不透明的参数变化,从而难以辨别模型内化的内容。因此,我们引入了具有强化学习的特征转向 (FSRL),这是一个透明的对齐框架,它通过调制稀疏自动编码器 (SAE) 中的可解释特征来训练轻量级适配器来引导行为。首先,我们证明了 FSRL 是一种有效的偏好优化方法,并且与当前的 RLHF 方法相当。然后,我们对训练好的适配器进行了机制分析,发现其策略系统地促进了风格特征而不是显式对齐概念,这表明偏好优化过程奖励了风格呈现作为质量的代理。最终,我们希望 FSRL 为可解释的模型控制和诊断内部对齐机制提供一种工具。
Subject: Artificial Intelligence 主题:人工智能
Publish: 2025-09-16 10:32:40 UTC 发布时间: 2025-09-16 10:32:40 UTC
#14 HLSMAC: A New StarCraft Multi-Agent Challenge for High-Level Strategic Decision-Making #14 HLSMAC:用于高层战略决策的全新星际争霸多智能体挑战
Authors: [Xingxing Hong](https://arxiv.org/search/?searchtype=author&query=Xingxing Hong), [Yungong Wang](https://arxiv.org/search/?searchtype=author&query=Yungong Wang), [Dexin Jin](https://arxiv.org/search/?searchtype=author&query=Dexin Jin), [Ye Yuan](https://arxiv.org/search/?searchtype=author&query=Ye Yuan), [Ximing Huang](https://arxiv.org/search/?searchtype=author&query=Ximing Huang), [Zijian Wu](https://arxiv.org/search/?searchtype=author&query=Zijian Wu), [Wenxin Li](https://arxiv.org/search/?searchtype=author&query=Wenxin Li) 作者: Xingxing Hong , Yungong Wang, Dexin Jin, Ye Yuan, Ximing Huang, Zijian Wu, Wenxin Li
Benchmarks are crucial for assessing multi-agent reinforcement learning (MARL) algorithms. While StarCraft II-related environments have driven significant advances in MARL, existing benchmarks like SMAC focus primarily on micromanagement, limiting comprehensive evaluation of high-level strategic intelligence. To address this, we introduce HLSMAC, a new cooperative MARL benchmark with 12 carefully designed StarCraft II scenarios based on classical stratagems from the Thirty-Six Stratagems. Each scenario corresponds to a specific stratagem and is designed to challenge agents with diverse strategic elements, including tactical maneuvering, timing coordination, and deception, thereby opening up avenues for evaluating high-level strategic decision-making capabilities. We also propose novel metrics across multiple dimensions beyond conventional win rate, such as ability utilization and advancement efficiency, to assess agents’ overall performance within the HLSMAC environment. We integrate state-of-the-art MARL algorithms and LLM-based agents with our benchmark and conduct comprehensive experiments. The results demonstrate that HLSMAC serves as a robust testbed for advancing multi-agent strategic decision-making. 基准对于评估多智能体强化学习 (MARL) 算法至关重要。虽然《星际争霸 II》相关环境推动了 MARL 的重大进步,但 SMAC 等现有基准主要侧重于微观管理,限制了对高级战略情报的综合评估。为了解决这个问题,我们推出了 HLSMAC,这是一个新的合作 MARL 基准测试,其中包含 12 个精心设计的《星际争霸 II》场景,这些场景基于三十六种策略中的经典策略。每个场景都对应一个特定的策略,旨在挑战具有不同战略要素的特工,包括战术机动、时机协调和欺骗,从而为评估高层战略决策能力开辟途径。我们还提出了超越传统胜率的多个维度的新指标,例如能力利用率和进阶效率,以评估代理在 HLSMAC 环境中的整体表现。我们将最先进的 MARL 算法和基于 LLM 的代理与我们的基准测试相结合,并进行了全面的实验。结果表明,HLSMAC 是推进多智能体战略决策的强大测试平台。
Subjects: Artificial Intelligence, Computer Vision and Pattern Recognition, Computer Science and Game Theory, Machine Learning, Multiagent Systems 科目: 人工智能 , 计算机视觉与模式识别 , 计算机科学与博弈论 , 机器学习 , 多智能体系统
Publish: 2025-09-16 10:26:12 UTC 发布时间: 2025-09-16 10:26:12 UTC
#15 Population Estimation using Deep Learning over Gandhinagar Urban Area #15 甘地讷格尔市区使用深度学习进行人口估算
Authors: [Jai Singla](https://arxiv.org/search/?searchtype=author&query=Jai Singla), [Peal Jotania](https://arxiv.org/search/?searchtype=author&query=Peal Jotania), [Keivalya Pandya](https://arxiv.org/search/?searchtype=author&query=Keivalya Pandya) 作者:Jai Singla、Peal Jotania、Keivalya Pandya
Population estimation is crucial for various applications, from resource allocation to urban planning. Traditional methods such as surveys and censuses are expensive, time-consuming and also heavily dependent on human resources, requiring significant manpower for data collection and processing. In this study a deep learning solution is proposed to estimate population using high resolution (0.3 m) satellite imagery, Digital Elevation Models (DEM) of 0.5m resolution and vector boundaries. Proposed method combines Convolution Neural Network (CNN) architecture for classification task to classify buildings as residential and non-residential and Artificial Neural Network (ANN) architecture to estimate the population. Approx. 48k building footprints over Gandhinagar urban area are utilized containing both residential and non-residential, with residential categories further used for building-level population estimation. Experimental results on a large-scale dataset demonstrate the effectiveness of our model, achieving an impressive overall F1-score of 0.9936. The proposed system employs advanced geospatial analysis with high spatial resolution to estimate Gandhinagar population at 278,954. By integrating real-time data updates, standardized metrics, and infrastructure planning capabilities, this automated approach addresses critical limitations of conventional census-based methodologies. The framework provides municipalities with a scalable and replicable tool for optimized resource management in rapidly urbanizing cities, showcasing the efficiency of AI-driven geospatial analytics in enhancing data-driven urban governance. 人口估算对于从资源分配到城市规划的各种应用至关重要。调查和人口普查等传统方法成本高昂、耗时,而且严重依赖人力资源,需要大量人力来收集和处理数据。本研究提出了一种深度学习解决方案,利用高分辨率(0.3 m)卫星图像、0.5 m 分辨率的数字高程模型(DEM)和矢量边界来估计人口。该方法结合卷积神经网络(CNN)架构进行分类任务,将建筑物分为住宅和非住宅,并结合人工神经网络(ANN)架构来估计人口。甘地讷格尔市区约有 48k 建筑占地面积,包括住宅和非住宅,住宅类别进一步用于建筑层面的人口估算。在大规模数据集上的实验结果证明了我们模型的有效性,取得了令人印象深刻的 0.9936 的总体 F1 分数。拟议的系统采用具有高空间分辨率的先进地理空间分析来估计甘地讷格尔人口为 278,954 人。通过集成实时数据更新、标准化指标和基础设施规划功能,这种自动化方法解决了传统基于人口普查的方法的关键局限性。该框架为市政当局提供了一个可扩展且可复制的工具,用于在快速城市化的城市中优化资源管理,展示了人工智能驱动的地理空间分析在加强数据驱动的城市治理方面的效率。
Subject: Artificial Intelligence 主题:人工智能
Publish: 2025-09-16 10:25:46 UTC 发布时间: 2025-09-16 10:25:46 UTC
#16 Stochastic Streets: A Walk Through Random LLM Address Generation in four European Cities #16 随机街道:在四个欧洲城市中进行随机 LLM 地址生成的演练
Authors: [Tairan Fu](https://arxiv.org/search/?searchtype=author&query=Tairan Fu), [David Campo-Nazareno](https://arxiv.org/search/?searchtype=author&query=David Campo-Nazareno), [Javier Coronado-Blázquez](https://arxiv.org/search/?searchtype=author&query=Javier Coronado-Blázquez), [Javier Conde](https://arxiv.org/search/?searchtype=author&query=Javier Conde), [Pedro Reviriego](https://arxiv.org/search/?searchtype=author&query=Pedro Reviriego), [Fabrizio Lombardi](https://arxiv.org/search/?searchtype=author&query=Fabrizio Lombardi) 作者: Tairan Fu, David Campo-Nazareno, Javier Coronado-Blázquez, Javier Conde, Pedro Reviriego, Fabrizio Lombardi
Large Language Models (LLMs) are capable of solving complex math problems or answer difficult questions on almost any topic, but can they generate random street addresses for European cities? 大型语言模型 (LLM) 能够解决复杂的数学问题或回答几乎任何主题的难题,但它们能否为欧洲城市生成随机街道地址?
Subject: Artificial Intelligence 主题:人工智能
Publish: 2025-09-16 10:09:00 UTC 发布时间: 2025-09-16 10:09:00 UTC
#17 LTA-thinker: Latent Thought-Augmented Training Framework for Large Language Models on Complex Reasoning #17 LTA-thinker:大型语言模型复杂推理的潜在思想增强训练框架
Authors: [Jiaqi Wang](https://arxiv.org/search/?searchtype=author&query=Jiaqi Wang), [Binquan Ji](https://arxiv.org/search/?searchtype=author&query=Binquan Ji), [Haibo Luo](https://arxiv.org/search/?searchtype=author&query=Haibo Luo), [Yiyang Qi](https://arxiv.org/search/?searchtype=author&query=Yiyang Qi), [Ruiting Li](https://arxiv.org/search/?searchtype=author&query=Ruiting Li), [Huiyan Wang](https://arxiv.org/search/?searchtype=author&query=Huiyan Wang), [Yuantao Han](https://arxiv.org/search/?searchtype=author&query=Yuantao Han), [Cangyi Yang](https://arxiv.org/search/?searchtype=author&query=Cangyi Yang), [jiaxu Zhang](https://arxiv.org/search/?searchtype=author&query=jiaxu Zhang), [Feiliang Ren](https://arxiv.org/search/?searchtype=author&query=Feiliang Ren) 作者: 王佳琦, 季彬泉, 罗海波, 祁一阳, 李瑞婷, 王慧妍, 韩元涛, 杨苍仪, 张佳旭, 飞良任
Complex Reasoning in Large Language Models can be dynamically optimized using Test-Time Scaling (TTS) to mitigate Overthinking. Methods such as Coconut, SoftCoT and its variant are effective in continuous latent space inference, the core bottleneck still lies in the efficient generation and utilization of high-quality Latent Thought. Drawing from the theory of SoftCoT++ that a larger variance in the generated Latent Thought distribution more closely approximates the golden truth distribution, we propose a Latent Thought-Augmented Training Framework–LTA-Thinker, which improves distributional variance and enhances reasoning performance from two perspectives. First, LTA-Thinker constructs a Latent Thought generation architecture based on a learnable prior. This architecture aims to increase the variance distribution of generated Latent Thought Vectors in order to simplify the overall structure and raise the performance ceiling. Second, LTA-Thinker introduces a distribution-based directional optimization paradigm that jointly constrains both distribution locality and distribution scale. This mechanism improves information efficiency and computational cost through a multi-objective co-training strategy, which combines standard Supervised Fine-Tuning (SFT) loss with two novel losses: Semantic Alignment Loss, which utilizes KL divergence to ensure that the Latent Thought is highly relevant to the semantics of the question; Reasoning Focus Loss, which utilizes a contrastive learning mechanism to guide the model to focus on the most critical reasoning steps. Experiments show that LTA-thinker achieves state-of-the-art (SOTA) performance among various baselines and demonstrates a higher performance ceiling and better scaling effects. 大型语言模型中的复杂推理可以使用测试时间缩放 (TTS) 进行动态优化,以减少过度思考。Coconut、SoftCoT 及其变体等方法在连续潜在空间推理中是有效的,核心瓶颈仍在于高质量潜在思想的高效生成和利用。借鉴 SoftCoT++的理论,即生成的潜在思想分布中较大的方差更接近黄金真理分布,我们提出了一种潜在思想增强训练框架–LTA-Thinker,它从两个角度改善了分布方差并增强了推理性能。首先,LTA-Thinker 基于可学习的先验构建了潜在思想生成架构。该架构旨在增加生成的潜在思维向量的方差分布,以简化整体结构并提高性能上限。其次,LTA-Thinker 引入了一种基于分布的定向优化范式,共同约束了分布局部性和分布尺度。该机制通过多目标协同训练策略提高了信息效率和计算成本,该策略将标准监督微调(SFT)损失与两种新损失相结合:语义对齐损失,利用 KL 发散来确保潜在思想与问题的语义高度相关;推理焦点损失,它利用对比学习机制来指导模型专注于最关键的推理步骤。实验表明,LTA-thinker 在各种基线中实现了最先进的(SOTA)性能,并表现出更高的性能上限和更好的扩展效果。
Subject: Artificial Intelligence 主题:人工智能
Publish: 2025-09-16 09:27:57 UTC 发布时间: 2025-09-16 09:27:57 UTC
#18 H2R: Hierarchical Hindsight Reflection for Multi-Task LLM Agents #18H 2 R:多任务 LLM 代理的分层后见之明反思
Authors: [Shicheng Ye](https://arxiv.org/search/?searchtype=author&query=Shicheng Ye), [Chao Yu](https://arxiv.org/search/?searchtype=author&query=Chao Yu), [Kaiqiang Ke](https://arxiv.org/search/?searchtype=author&query=Kaiqiang Ke), [Chengdong Xu](https://arxiv.org/search/?searchtype=author&query=Chengdong Xu), [Yinqi Wei](https://arxiv.org/search/?searchtype=author&query=Yinqi Wei) 作者: 叶世成, 宇超, 柯凯强, 徐成东, 魏银奇
Large language model (LLM)-based agents have shown strong potential in multi-task scenarios, owing to their ability to transfer knowledge across diverse tasks. However, existing approaches often treat prior experiences and knowledge as monolithic units, leading to inefficient and coarse-grained knowledge transfer. In this work, we propose a novel hierarchical memory architecture that enables fine-grained knowledge transfer by decoupling high-level planning memory from low-level execution memory. To construct and refine these hierarchical memories, we introduce Hierarchical Hindsight Reflection (H2R), a mechanism that distills reusable and hierarchical knowledge from past agent-environment interactions. At test time, H2R performs retrievals of high-level and low-level memories separately, allowing LLM-based agents to efficiently access and utilize task-relevant knowledge for new tasks.Experimental results across two benchmarks demonstrate that H2R can improve generalization and decision-making performance, outperforming prior baselines such as Expel. 基于大型语言模型 (LLM) 的代理由于能够跨不同任务传递知识,在多任务场景中显示出强大的潜力。然而,现有方法往往将先前的经验和知识视为单一的单元,导致知识转移效率低下且粒度粗糙。在这项工作中,我们提出了一种新颖的分层内存架构,该架构通过将高级规划内存与低级执行内存解耦来实现细粒度的知识转移。为了构建和完善这些分层记忆,我们引入了分层后见之明反思(HR 2 ),这是一种从过去的智能体与环境交互中提炼出可重用和分层知识的机制。在测试时,H 2 R 分别执行高级和低级记忆的检索,使基于 LLM 的代理能够有效地访问和利用与任务相关的知识来执行新任务。两个基准测试的实验结果表明,H 2 R 可以提高泛化和决策性能,优于之前的基线,例如 Expel。
Subject: Artificial Intelligence 主题:人工智能
Publish: 2025-09-16 08:30:08 UTC 发布时间: 2025-09-16 08:30:08 UTC
#19 Zero-shot Graph Reasoning via Retrieval Augmented Framework with LLMs #19 通过 LLM 检索增强框架进行零样本图推理
Authors: [Hanqing Li](https://arxiv.org/search/?searchtype=author&query=Hanqing Li), [Kiran Sheena Jyothi](https://arxiv.org/search/?searchtype=author&query=Kiran Sheena Jyothi), [Henry Liang](https://arxiv.org/search/?searchtype=author&query=Henry Liang), [Sharika Mahadevan](https://arxiv.org/search/?searchtype=author&query=Sharika Mahadevan), [Diego Klabjan](https://arxiv.org/search/?searchtype=author&query=Diego Klabjan) 作者:李汉青、Kiran Sheena Jyothi、Henry Liang、Sharika Mahadevan、Diego Klabjan
We propose a new, training-free method, Graph Reasoning via Retrieval Augmented Framework (GRRAF), that harnesses retrieval-augmented generation (RAG) alongside the code-generation capabilities of large language models (LLMs) to address a wide range of graph reasoning tasks. In GRRAF, the target graph is stored in a graph database, and the LLM is prompted to generate executable code queries that retrieve the necessary information. This approach circumvents the limitations of existing methods that require extensive finetuning or depend on predefined algorithms, and it incorporates an error feedback loop with a time-out mechanism to ensure both correctness and efficiency. Experimental evaluations on the GraphInstruct dataset reveal that GRRAF achieves 100% accuracy on most graph reasoning tasks, including cycle detection, bipartite graph checks, shortest path computation, and maximum flow, while maintaining consistent token costs regardless of graph sizes. Imperfect but still very high performance is observed on subgraph matching. Notably, GRRAF scales effectively to large graphs with up to 10,000 nodes. 我们提出了一种新的、无需训练的方法,即通过检索增强框架 (GRRAF) 进行图推理,它利用检索增强生成 (RAG) 以及大型语言模型 (LLM) 的代码生成功能来处理广泛的图推理任务。在 GRRAF 中,目标图存储在图数据库中,并提示 LLM 生成可执行代码查询以检索必要的信息。这种方法规避了需要大量微调或依赖预定义算法的现有方法的局限性,并且它结合了带有超时机制的错误反馈循环,以确保正确性和效率。对 GraphInstruct 数据集的实验评估表明,GRRAF 在大多数图推理任务上都达到了 100% 的准确率,包括循环检测、二分图检查、最短路径计算和最大流量,同时无论图大小如何,都保持一致的令牌成本。在子图匹配上观察到不完美但仍非常高的性能。值得注意的是,GRRAF 可以有效地扩展到具有多达 10,000 个节点的大型图。
Subjects: Artificial Intelligence, Computation and Language 科目:人工智能、计算和语言
Publish: 2025-09-16 06:58:58 UTC 发布时间: 2025-09-16 06:58:58 UTC
#20 Large Language Models Imitate Logical Reasoning, but at what Cost? #20 大型语言模型模仿逻辑推理,但代价是什么?[PDF 格式 1 ][复制][基米 3 ][REL]
Authors: [Lachlan McGinness](https://arxiv.org/search/?searchtype=author&query=Lachlan McGinness), [Peter Baumgartner](https://arxiv.org/search/?searchtype=author&query=Peter Baumgartner) 作者:Lachlan McGinness、Peter Baumgartner
We present a longitudinal study which evaluates the reasoning capability of frontier Large Language Models over an eighteen month period. We measured the accuracy of three leading models from December 2023, September 2024 and June 2025 on true or false questions from the PrOntoQA dataset and their faithfulness to reasoning strategies provided through in-context learning. The improvement in performance from 2023 to 2024 can be attributed to hidden Chain of Thought prompting. The introduction of thinking models allowed for significant improvement in model performance between 2024 and 2025. We then present a neuro-symbolic architecture which uses LLMs of less than 15 billion parameters to translate the problems into a standardised form. We then parse the standardised forms of the problems into a program to be solved by Z3, an SMT solver, to determine the satisfiability of the query. We report the number of prompt and completion tokens as well as the computational cost in FLOPs for open source models. The neuro-symbolic approach significantly reduces the computational cost while maintaining near perfect performance. The common approximation that the number of inference FLOPs is double the product of the active parameters and total tokens was accurate within 10% for all experiments. 我们提出了一项纵向研究,评估了前沿大型语言模型在 18 个月内的推理能力。我们测量了 2023 年 12 月、2024 年 9 月和 2025 年 6 月三个领先模型对 PrOntoQA 数据集中的真或错问题的准确性,以及它们对通过上下文学习提供的推理策略的忠实度。2023 年至 2024 年绩效的提升可归因于隐藏的思维链提示。思维模型的引入使得模型性能在 2024 年至 2025 年间得到了显着提高。然后,我们提出了一种神经符号架构,该架构使用少于 150 亿个参数的 LLM 将问题转化为标准化形式。然后,我们将问题的标准化形式解析为一个程序,由 SMT 求解器 Z3 求解,以确定查询的可满足性。我们报告了开源模型的提示和完成令牌的数量以及 FLOP 中的计算成本。神经符号方法显着降低了计算成本,同时保持了近乎完美的性能。对于所有实验,推理 FLOP 的数量是活动参数和总标记的乘积的两倍,这一常见近似值的准确度在 10% 以内。
Subjects: Artificial Intelligence, Logic in Computer Science 科目:人工智能、计算机科学中的逻辑
Publish: 2025-09-16 04:03:42 UTC 发布时间: 2025-09-16 04:03:42 UTC
#21 Learn to Relax with Large Language Models: Solving Nonlinear Combinatorial Optimization Problems via Bidirectional Coevolution #21 学会用大型语言模型放松:通过双向协进解决非线性组合优化问题
Authors: [Beidan Liu](https://arxiv.org/search/?searchtype=author&query=Beidan Liu), [Zhengqiu Zhu](https://arxiv.org/search/?searchtype=author&query=Zhengqiu Zhu), [Chen Gao](https://arxiv.org/search/?searchtype=author&query=Chen Gao), [Yong Zhao](https://arxiv.org/search/?searchtype=author&query=Yong Zhao), [Wei Qi](https://arxiv.org/search/?searchtype=author&query=Wei Qi), [Quanjun Yin](https://arxiv.org/search/?searchtype=author&query=Quanjun Yin) 作者:刘北丹,朱正秋,陈高,赵勇,魏琦,尹全军
Nonlinear Combinatorial Optimization Problems (NCOPs) present a formidable computational hurdle in practice, as their nonconvex nature gives rise to multi-modal solution spaces that defy efficient optimization. Traditional constraint relaxation approaches rely heavily on expert-driven, iterative design processes that lack systematic automation and scalable adaptability. While recent Large Language Model (LLM)-based optimization methods show promise for autonomous problem-solving, they predominantly function as passive constraint validators rather than proactive strategy architects, failing to handle the sophisticated constraint interactions inherent to NCOPs.To address these limitations, we introduce the first end-to-end \textbf{Auto}mated \textbf{C}onstraint \textbf{O}ptimization (AutoCO) method, which revolutionizes NCOPs resolution through learning to relax with LLMs.Specifically, we leverage structured LLM reasoning to generate constraint relaxation strategies, which are dynamically evolving with algorithmic principles and executable code through a unified triple-representation scheme. We further establish a novel bidirectional (global-local) coevolution mechanism that synergistically integrates Evolutionary Algorithms for intensive local refinement with Monte Carlo Tree Search for systematic global strategy space exploration, ensuring optimal balance between intensification and diversification in fragmented solution spaces. Finally, comprehensive experiments on three challenging NCOP benchmarks validate AutoCO’s consistent effectiveness and superior performance over the baselines. 非线性组合优化问题 (NCOP) 在实践中提出了一个巨大的计算障碍,因为它们的非凸性质会产生无法有效优化的多模态解空间。传统的约束放宽方法严重依赖于专家驱动的迭代设计流程,缺乏系统的自动化和可扩展的适应性。虽然最近基于大型语言模型(LLM)的优化方法显示出自主解决问题的前景,但它们主要充当被动约束验证者而不是主动策略架构师,无法处理 NCOPs.To 解决这些限制所固有的复杂约束交互,我们引入了第一个端到端的 \textbf{Auto}mated \textbf{C}onstraint \textbf{O}ptimization (AutoCO) 方法,该方法通过学习放松彻底改变了 NCOP 的分辨率具体来说,我们利用结构化的 LLM 推理来生成约束放松策略,这些策略通过统一的三重表示方案随着算法原理和可执行代码动态演变。我们进一步建立了一种新的双向(全局-局部)协同进化机制,将进化算法与蒙特卡洛树搜索协同集成,用于系统的全局战略空间探索,确保碎片解空间中强化和多样化之间的最佳平衡。最后,对三个具有挑战性的 NCOP 基准的综合实验验证了 AutoCO 相对于基线的一致有效性和卓越性能。
Subject: Artificial Intelligence 主题:人工智能
Publish: 2025-09-16 03:59:51 UTC 发布时间: 2025-09-16 03:59:51 UTC
#22 ECG-aBcDe: Overcoming Model Dependence, Encoding ECG into a Universal Language for Any LLM #22 ECG-aBcDe:克服模型依赖性,将心电图编码为任何法学硕士的通用语言
Authors: [Yong Xia](https://arxiv.org/search/?searchtype=author&query=Yong Xia), [Jingxuan Li](https://arxiv.org/search/?searchtype=author&query=Jingxuan Li), [YeTeng Sun](https://arxiv.org/search/?searchtype=author&query=YeTeng Sun), [Jiarui Bu](https://arxiv.org/search/?searchtype=author&query=Jiarui Bu) 作者:夏勇,李静璇,孙叶腾,布佳瑞
Large Language Models (LLMs) hold significant promise for electrocardiogram (ECG) analysis, yet challenges remain regarding transferability, time-scale information learning, and interpretability. Current methods suffer from model-specific ECG encoders, hindering transfer across LLMs. Furthermore, LLMs struggle to capture crucial time-scale information inherent in ECGs due to Transformer limitations. And their black-box nature limits clinical adoption. To address these limitations, we introduce ECG-aBcDe, a novel ECG encoding method that transforms ECG signals into a universal ECG language readily interpretable by any LLM. By constructing a hybrid dataset of ECG language and natural language, ECG-aBcDe enables direct fine-tuning of pre-trained LLMs without architectural modifications, achieving “construct once, use anywhere” capability. Moreover, the bidirectional convertibility between ECG and ECG language of ECG-aBcDe allows for extracting attention heatmaps from ECG signals, significantly enhancing interpretability. Finally, ECG-aBcDe explicitly represents time-scale information, mitigating Transformer limitations. This work presents a new paradigm for integrating ECG analysis with LLMs. Compared with existing methods, our method achieves competitive performance on ROUGE-L and METEOR. Notably, it delivers significant improvements in the BLEU-4, with improvements of 2.8 times and 3.9 times in in-dataset and cross-dataset evaluations, respectively, reaching scores of 42.58 and 30.76. These results provide strong evidence for the feasibility of the new paradigm. 大型语言模型 (LLM) 在心电图 (ECG) 分析方面具有巨大的前景,但在可转移性、时间尺度信息学习和可解释性方面仍然存在挑战。目前的方法存在特定于模型的心电图编码器,阻碍了跨法学硕士的传输。此外,由于 Transformer 的限制,法学硕士很难捕获心电图固有的关键时间尺度信息。它们的黑匣子性质限制了临床采用。为了解决这些限制,我们推出了 ECG-aBcDe,这是一种新型的 ECG 编码方法,可将 ECG 信号转换为任何 LLM 都可以轻松解释的通用 ECG 语言。通过构建心电图语言和自然语言的混合数据集,ECG-aBcDe 无需修改架构即可直接对预训练的 LLM 进行微调,实现“一次构建,随处使用”的能力。此外,ECG-aBcDe 的 ECG 和 ECG 语言之间的双向可转换性允许从 ECG 信号中提取注意力热图,从而显着增强可解释性。最后,ECG-aBcDe 明确表示时间尺度信息,减轻了 Transformer 的限制。这项工作提出了一种将心电图分析与 LLM 相结合的新范式。与现有方法相比,我们的方法在 ROUGE-L 和 METEOR 上取得了具有竞争力的性能。值得注意的是,它在 BLEU-4 中取得了显着的改进,在数据集内和跨数据集评估中分别提高了 2.8 倍和 3.9 倍,得分分别达到 42.58 和 30.76。这些结果为新范式的可行性提供了有力的证据。
Subject: Artificial Intelligence 主题:人工智能
Publish: 2025-09-16 03:41:02 UTC 发布时间: 2025-09-16 03:41:02 UTC
#23 Mob-based cattle weight gain forecasting using ML models #23 使用 ML 模型预测基于 Mob 的牛体重增加
Authors: [Muhammad Riaz Hasib Hossain](https://arxiv.org/search/?searchtype=author&query=Muhammad Riaz Hasib Hossain), [Rafiqul Islam](https://arxiv.org/search/?searchtype=author&query=Rafiqul Islam), [Shawn R McGrath](https://arxiv.org/search/?searchtype=author&query=Shawn R McGrath), [Md Zahidul Islam](https://arxiv.org/search/?searchtype=author&query=Md Zahidul Islam), [David Lamb](https://arxiv.org/search/?searchtype=author&query=David Lamb) 作者:穆罕默德·里亚兹·哈西布·侯赛因、拉菲库尔·伊斯兰、肖恩·麦格拉思、Md 扎希杜尔·伊斯兰、大卫·兰姆
Forecasting mob based cattle weight gain (MB CWG) may benefit large livestock farms, allowing farmers to refine their feeding strategies, make educated breeding choices, and reduce risks linked to climate variability and market fluctuations. In this paper, a novel technique termed MB CWG is proposed to forecast the one month advanced weight gain of herd based cattle using historical data collected from the Charles Sturt University Farm. This research employs a Random Forest (RF) model, comparing its performance against Support Vector Regression (SVR) and Long Short Term Memory (LSTM) models for monthly weight gain prediction. Four datasets were used to evaluate the performance of models, using 756 sample data from 108 herd-based cattle, along with weather data (rainfall and temperature) influencing CWG. The RF model performs better than the SVR and LSTM models across all datasets, achieving an R^2 of 0.973, RMSE of 0.040, and MAE of 0.033 when both weather and age factors were included. The results indicate that including both weather and age factors significantly improves the accuracy of weight gain predictions, with the RF model outperforming the SVR and LSTM models in all scenarios. These findings demonstrate the potential of RF as a robust tool for forecasting cattle weight gain in variable conditions, highlighting the influence of age and climatic factors on herd based weight trends. This study has also developed an innovative automated pre processing tool to generate a benchmark dataset for MB CWG predictive models. The tool is publicly available on GitHub and can assist in preparing datasets for current and future analytical research.. 预测基于生物的牛体重增加 (MB CWG) 可能使大型畜牧场受益,使农民能够完善他们的饲养策略,做出明智的育种选择,并降低与气候变化和市场波动相关的风险。在本文中,提出了一种称为 MB CWG 的新技术,利用从查尔斯斯特大学农场收集的历史数据来预测牛群牛一个月的提前体重增加。本研究采用随机森林 (RF) 模型,将其性能与支持向量回归 (SVR) 和长短期记忆 (LSTM) 模型进行比较,以预测每月体重增加。使用四个数据集来评估模型的性能,使用来自 108 头牛群的 756 个样本数据,以及影响 CWG 的天气数据(降雨量和温度)。RF 模型在所有数据集上的性能优于 SVR 和 LSTM 模型,当同时考虑天气和年龄因素时,R^2 为 0.973,RMSE 为 0.040,MAE 为 0.033。结果表明,同时纳入天气和年龄因素显著提高了体重增加预测的准确性,射频模型在所有场景下均优于 SVR 和 LSTM 模型。这些发现证明了射频作为预测牛在可变条件下体重增加的强大工具的潜力,强调了年龄和气候因素对牛群体重趋势的影响。本研究还开发了一种创新的自动化预处理工具,用于生成 MB CWG 预测模型的基准数据集。该工具在 GitHub 上公开可用,可以帮助为当前和未来的分析研究准备数据集。
Subjects: Artificial Intelligence, Machine Learning 科目: 人工智能 , 机器学习
Publish: 2025-09-16 03:23:43 UTC 发布时间: 2025-09-16 03:23:43 UTC
#24 GBV-SQL: Guided Generation and SQL2Text Back-Translation Validation for Multi-Agent Text2SQL #24 GBV-SQL:多代理 Text2SQL 的引导生成和 SQL2Text 反向翻译验证
Authors: [Daojun Chen](https://arxiv.org/search/?searchtype=author&query=Daojun Chen), [Xi Wang](https://arxiv.org/search/?searchtype=author&query=Xi Wang), [Shenyuan Ren](https://arxiv.org/search/?searchtype=author&query=Shenyuan Ren), [Qingzhi Ma](https://arxiv.org/search/?searchtype=author&query=Qingzhi Ma), [Pengpeng Zhao](https://arxiv.org/search/?searchtype=author&query=Pengpeng Zhao), [An Liu](https://arxiv.org/search/?searchtype=author&query=An Liu) 作者:陈道军、王习、沈任、马庆志、赵鹏鹏、刘安
While Large Language Models have significantly advanced Text2SQL generation, a critical semantic gap persists where syntactically valid queries often misinterpret user intent. To mitigate this challenge, we propose GBV-SQL, a novel multi-agent framework that introduces Guided Generation with SQL2Text Back-translation Validation. This mechanism uses a specialized agent to translate the generated SQL back into natural language, which verifies its logical alignment with the original question. Critically, our investigation reveals that current evaluation is undermined by a systemic issue: the poor quality of the benchmarks themselves. We introduce a formal typology for “Gold Errors”, which are pervasive flaws in the ground-truth data, and demonstrate how they obscure true model performance. On the challenging BIRD benchmark, GBV-SQL achieves 63.23% execution accuracy, a 5.8% absolute improvement. After removing flawed examples, GBV-SQL achieves 96.5% (dev) and 97.6% (test) execution accuracy on the Spider benchmark. Our work offers both a robust framework for semantic validation and a critical perspective on benchmark integrity, highlighting the need for more rigorous dataset curation. 虽然大型语言模型显着推进了 Text2SQL 生成,但语法上有效的查询经常误解用户意图,但关键的语义差距仍然存在。为了缓解这一挑战,我们提出了 GBV-SQL,这是一种新颖的多代理框架,它引入了带有 SQL2Text 反向翻译验证的引导生成。该机制使用专门的代理将生成的 SQL 翻译回自然语言,从而验证其与原始问题的逻辑一致性。至关重要的是,我们的调查表明,当前的评估受到一个系统性问题的破坏:基准本身的质量很差。我们介绍了“黄金误差”的正式类型,这是地面实况数据中普遍存在的缺陷,并演示了它们如何掩盖了真正的模型性能。在具有挑战性的 BIRD 基准测试中,GBV-SQL 实现了 63.23% 的执行准确率,绝对提高了 5.8%。在删除有缺陷的示例后,GBV-SQL 在 Spider 基准测试中实现了 96.5%(开发)和 97.6%(测试)的执行准确率。我们的工作为语义验证提供了一个强大的框架,也提供了对基准完整性的批判性视角,强调了更严格的数据集管理的必要性。
Subject: Artificial Intelligence 主题:人工智能
Publish: 2025-09-16 03:21:12 UTC 发布时间: 2025-09-16 03:21:12 UTC
#25 Analogy-Driven Financial Chain-of-Thought (AD-FCoT): A Prompting Approach for Financial Sentiment Analysis #25 类比驱动的金融思维链 (AD-FCoT):金融情绪分析的提示方法
Author: [Anmol Singhal Navya Singhal](https://arxiv.org/search/?searchtype=author&query=Anmol Singhal Navya Singhal) 作者:安莫尔·辛格尔 Navya Singhal
Financial news sentiment analysis is crucial for anticipating market movements. With the rise of AI techniques such as Large Language Models (LLMs), which demonstrate strong text understanding capabilities, there has been renewed interest in enhancing these systems. Existing methods, however, often struggle to capture the complex economic context of news and lack transparent reasoning, which undermines their reliability. We propose Analogy-Driven Financial Chain-of-Thought (AD-FCoT), a prompting framework that integrates analogical reasoning with chain-of-thought (CoT) prompting for sentiment prediction on historical financial news. AD-FCoT guides LLMs to draw parallels between new events and relevant historical scenarios with known outcomes, embedding these analogies into a structured, step-by-step reasoning chain. To our knowledge, this is among the first approaches to explicitly combine analogical examples with CoT reasoning in finance. Operating purely through prompting, AD-FCoT requires no additional training data or fine-tuning and leverages the model’s internal financial knowledge to generate rationales that mirror human analytical reasoning. Experiments on thousands of news articles show that AD-FCoT outperforms strong baselines in sentiment classification accuracy and achieves substantially higher correlation with market returns. Its generated explanations also align with domain expertise, providing interpretable insights suitable for real-world financial analysis. 财经新闻情绪分析对于预测市场走势至关重要。随着大型语言模型 (LLM) 等人工智能技术的兴起,这些技术展示了强大的文本理解能力,人们对增强这些系统重新产生了兴趣。然而,现有方法往往难以捕捉新闻的复杂经济背景,并且缺乏透明的推理,这削弱了其可靠性。我们提出了类比驱动的金融思维链(AD-FCoT),这是一个将类比推理与思维链(CoT)提示相结合的提示框架,用于对历史财经新闻进行情绪预测。AD-FCoT 指导法学硕士在新事件和具有已知结果的相关历史场景之间进行类比,将这些类比嵌入到结构化的分步推理链中。据我们所知,这是金融领域将类比示例与 CoT 推理明确结合起来的首批方法之一。AD-FCoT 纯粹通过提示运行,不需要额外的训练数据或微调,并利用模型的内部财务知识来生成反映人类分析推理的基本原理。对数千篇新闻文章的实验表明,AD-FCoT 在情绪分类准确性方面优于强基线,并实现了与市场回报的更高相关性。其生成的解释也与领域专业知识相一致,提供适合现实世界财务分析的可解释见解。
Subject: Artificial Intelligence 主题:人工智能
Publish: 2025-09-16 03:19:26 UTC 发布时间: 2025-09-16 03:19:26 UTC
#26 DaSAThco: Data-Aware SAT Heuristics Combinations Optimization via Large Language Models #26 DaSAThco:通过大型语言模型优化数据感知 SAT 启发式组合
Authors: [Minyu Chen](https://arxiv.org/search/?searchtype=author&query=Minyu Chen), [Guoqiang Li](https://arxiv.org/search/?searchtype=author&query=Guoqiang Li) 作者:陈敏宇,李国强
The performance of Conflict-Driven Clause Learning solvers hinges on internal heuristics, yet the heterogeneity of SAT problems makes a single, universally optimal configuration unattainable. While prior automated methods can find specialized configurations for specific problem families, this dataset-specific approach lacks generalizability and requires costly re-optimization for new problem types. We introduce DaSAThco, a framework that addresses this challenge by learning a generalizable mapping from instance features to tailored heuristic ensembles, enabling a train-once, adapt-broadly model. Our framework uses a Large Language Model, guided by systematically defined Problem Archetypes, to generate a diverse portfolio of specialized heuristic ensembles and subsequently learns an adaptive selection mechanism to form the final mapping. Experiments show that DaSAThco achieves superior performance and, most notably, demonstrates robust out-of-domain generalization where non-adaptive methods show limitations. Our work establishes a more scalable and practical path toward automated algorithm design for complex, configurable systems. 冲突驱动的从句学习求解器的性能取决于内部启发式方法,但 SAT 问题的异构性使得单一的、普遍的最优配置无法实现。虽然以前的自动化方法可以为特定问题族找到专门的配置,但这种特定于数据集的方法缺乏通用性,并且需要对新问题类型进行昂贵的重新优化。我们介绍了 DaSAThco,这是一个框架,它通过学习从实例特征到定制启发式集成的可通用映射来应对这一挑战,从而实现一次训练、广泛适应的模型。我们的框架使用大型语言模型,在系统定义的问题原型的指导下,生成多样化的专业启发式集成组合,随后学习自适应选择机制以形成最终映射。实验表明,DaSAThco 实现了卓越的性能,最值得注意的是,在非自适应方法显示出局限性的情况下,表现出稳健的域外泛化。我们的工作为复杂的可配置系统自动化算法设计建立了一条更具可扩展性和实用性的途径。
Subjects: Artificial Intelligence, Computation and Language 科目:人工智能、计算和语言
Publish: 2025-09-16 02:58:50 UTC 发布时间: 2025-09-16 02:58:50 UTC
#27 Match Chat: Real Time Generative AI and Generative Computing for Tennis #27 比赛聊天:网球的实时生成式人工智能和生成式计算
Authors: [Aaron Baughman](https://arxiv.org/search/?searchtype=author&query=Aaron Baughman), [Gozde Akay](https://arxiv.org/search/?searchtype=author&query=Gozde Akay), [Eduardo Morales](https://arxiv.org/search/?searchtype=author&query=Eduardo Morales), [Rahul Agarwal](https://arxiv.org/search/?searchtype=author&query=Rahul Agarwal), [Preetika Srivastava](https://arxiv.org/search/?searchtype=author&query=Preetika Srivastava) 作者:Aaron Baughman、Gozde Akay、Eduardo Morales、Rahul Agarwal、Preetika Srivastava
We present Match Chat, a real-time, agent-driven assistant designed to enhance the tennis fan experience by delivering instant, accurate responses to match-related queries. Match Chat integrates Generative Artificial Intelligence (GenAI) with Generative Computing (GenComp) techniques to synthesize key insights during live tennis singles matches. The system debuted at the 2025 Wimbledon Championships and the 2025 US Open, where it provided about 1 million users with seamless access to streaming and static data through natural language queries. The architecture is grounded in an Agent-Oriented Architecture (AOA) combining rule engines, predictive models, and agents to pre-process and optimize user queries before passing them to GenAI components. The Match Chat system had an answer accuracy of 92.83% with an average response time of 6.25 seconds under loads of up to 120 requests per second (RPS). Over 96.08% of all queries were guided using interactive prompt design, contributing to a user experience that prioritized clarity, responsiveness, and minimal effort. The system was designed to mask architectural complexity, offering a frictionless and intuitive interface that required no onboarding or technical familiarity. Across both Grand Slam deployments, Match Chat maintained 100% uptime and supported nearly 1 million unique users, underscoring the scalability and reliability of the platform. This work introduces key design patterns for real-time, consumer-facing AI systems that emphasize speed, precision, and usability that highlights a practical path for deploying performant agentic systems in dynamic environments. 我们推出 Match Chat,这是一款实时、代理驱动的助手,旨在通过对比赛相关查询提供即时、准确的响应来增强网球迷体验。Match Chat 将生成式人工智能 (GenAI) 与生成式计算 (GenComp) 技术相结合,以综合现场网球单打比赛期间的关键见解。该系统在 2025 年温布尔登锦标赛和 2025 年美国网球公开赛上首次亮相,通过自然语言查询为约 100 万用户提供了对流媒体和静态数据的无缝访问。该架构基于面向代理的架构 (AOA),结合了规则引擎、预测模型和代理,在将用户查询传递给 GenAI 组件之前对其进行预处理和优化。Match Chat 系统的回答准确率为 92.83%,在负载高达 120 个/秒 (RPS) 的情况下,平均响应时间为 6.25 秒。超过 96.08% 的查询都是使用交互式提示设计进行引导的,有助于提供优先考虑清晰度、响应能力和最小努力的用户体验。该系统旨在掩盖架构的复杂性,提供无摩擦且直观的界面,无需入门或熟悉技术。在两次大满贯部署中,Match Chat 都保持了 100% 的正常运行时间并支持近 100 万独立用户,凸显了该平台的可扩展性和可靠性。这项工作介绍了面向消费者的实时人工智能系统的关键设计模式,这些模式强调速度、精度和可用性,突出了在动态环境中部署高性能代理系统的实用路径。
Subjects: Artificial Intelligence, Computation and Language 科目:人工智能、计算和语言
Publish: 2025-09-16 02:38:27 UTC 发布时间: 2025-09-16 02:38:27 UTC
#28 Redefining CX with Agentic AI: Minerva CQ Case Study #28 使用 Agentic AI 重新定义 CX:Minerva CQ 案例研究
Authors: [Garima Agrawal](https://arxiv.org/search/?searchtype=author&query=Garima Agrawal), [Riccardo De Maria](https://arxiv.org/search/?searchtype=author&query=Riccardo De Maria), [Kiran Davuluri](https://arxiv.org/search/?searchtype=author&query=Kiran Davuluri), [Daniele Spera](https://arxiv.org/search/?searchtype=author&query=Daniele Spera), [Charlie Read](https://arxiv.org/search/?searchtype=author&query=Charlie Read), [Cosimo Spera](https://arxiv.org/search/?searchtype=author&query=Cosimo Spera), [Jack Garrett](https://arxiv.org/search/?searchtype=author&query=Jack Garrett), [Don Miller](https://arxiv.org/search/?searchtype=author&query=Don Miller) 作者:加里玛·阿格拉瓦尔、里卡多·德·玛丽亚、基兰·达武鲁里、丹尼尔·斯佩拉、查理·里德、科西莫·斯佩拉、杰克·加勒特、唐·米勒
Despite advances in AI for contact centers, customer experience (CX) continues to suffer from high average handling time (AHT), low first-call resolution, and poor customer satisfaction (CSAT). A key driver is the cognitive load on agents, who must navigate fragmented systems, troubleshoot manually, and frequently place customers on hold. Existing AI-powered agent-assist tools are often reactive driven by static rules, simple prompting, or retrieval-augmented generation (RAG) without deeper contextual reasoning. We introduce Agentic AI goal-driven, autonomous, tool-using systems that proactively support agents in real time. Unlike conventional approaches, Agentic AI identifies customer intent, triggers modular workflows, maintains evolving context, and adapts dynamically to conversation state. This paper presents a case study of Minerva CQ, a real-time Agent Assist product deployed in voice-based customer support. Minerva CQ integrates real-time transcription, intent and sentiment detection, entity recognition, contextual retrieval, dynamic customer profiling, and partial conversational summaries enabling proactive workflows and continuous context-building. Deployed in live production, Minerva CQ acts as an AI co-pilot, delivering measurable improvements in agent efficiency and customer experience across multiple deployments. 尽管联络中心的人工智能取得了进步,但客户体验 (CX) 仍然受到平均处理时间 (AHT) 高、首次呼叫分辨率低和客户满意度 (CSAT) 差的影响。一个关键驱动因素是座席的认知负担,他们必须浏览分散的系统,手动排除故障,并经常让客户处于等待状态。现有的人工智能驱动的代理辅助工具通常是由静态规则、简单提示或检索增强生成 (RAG) 驱动的,没有更深入的上下文推理。我们引入了 Agentic AI 目标驱动、自主、工具使用系统,可主动实时支持代理。与传统方法不同,Agentic AI 可以识别客户意图、触发模块化工作流程、维护不断变化的上下文并动态适应对话状态。本文介绍了 Minerva CQ 的案例研究,Minerva CQ 是一种部署在基于语音的客户支持中的实时 Agent Assist 产品。Minerva CQ 集成了实时转录、意图和情感检测、实体识别、上下文检索、动态客户分析和部分对话摘要,从而实现主动工作流程和持续上下文构建。Minerva CQ 部署在现场生产中,充当 AI 副驾驶,在多个部署中提供可衡量的代理效率和客户体验改进。
Subject: Artificial Intelligence 主题:人工智能
Publish: 2025-09-16 02:30:33 UTC 发布时间: 2025-09-16 02:30:33 UTC
#29 Human + AI for Accelerating Ad Localization Evaluation #29 人类+人工智能加速广告本地化评估
Authors: [Harshit Rajgarhia](https://arxiv.org/search/?searchtype=author&query=Harshit Rajgarhia), [Shivali Dalmia](https://arxiv.org/search/?searchtype=author&query=Shivali Dalmia), [Mengyang Zhao](https://arxiv.org/search/?searchtype=author&query=Mengyang Zhao), [Mukherji Abhishek](https://arxiv.org/search/?searchtype=author&query=Mukherji Abhishek), [Kiran Ganesh](https://arxiv.org/search/?searchtype=author&query=Kiran Ganesh) 作者:Harshit Rajgarhia、Shivali Dalmia、Mengyang Zhao、Mukherji Abhishek、Kiran Ganesh
Adapting advertisements for multilingual audiences requires more than simple text translation; it demands preservation of visual consistency, spatial alignment, and stylistic integrity across diverse languages and formats. We introduce a structured framework that combines automated components with human oversight to address the complexities of advertisement localization. To the best of our knowledge, this is the first work to integrate scene text detection, inpainting, machine translation (MT), and text reimposition specifically for accelerating ad localization evaluation workflows. Qualitative results across six locales demonstrate that our approach produces semantically accurate and visually coherent localized advertisements, suitable for deployment in real-world workflows. 为多语言受众调整广告需要的不仅仅是简单的文本翻译;它要求保持不同语言和格式的视觉一致性、空间对齐和风格完整性。我们引入了一个结构化框架,将自动化组件与人工监督相结合,以解决广告本地化的复杂性。据我们所知,这是第一个集成场景文本检测、修复、机器翻译 (MT) 和文本重版的工作,专门用于加速广告本地化评估工作流程。六个区域设置的定性结果表明,我们的方法生成了语义准确且视觉上连贯的本地化广告,适合在实际工作流程中部署。
Subjects: Artificial Intelligence, Computer Vision and Pattern Recognition, Machine Learning 科目:人工智能 , 计算机视觉与模式识别 , 机器学习
Publish: 2025-09-16 00:52:41 UTC 发布时间: 2025-09-16 00:52:41 UTC
#30 zELO: ELO-inspired Training Method for Rerankers and Embedding Models #30 zELO:受 ELO 启发的重新排名者和嵌入模型的训练方法
Authors: [Nicholas Pipitone](https://arxiv.org/search/?searchtype=author&query=Nicholas Pipitone), [Ghita Houir Alami](https://arxiv.org/search/?searchtype=author&query=Ghita Houir Alami), [Advaith Avadhanam](https://arxiv.org/search/?searchtype=author&query=Advaith Avadhanam), [Anton Kaminskyi](https://arxiv.org/search/?searchtype=author&query=Anton Kaminskyi), [Ashley Khoo](https://arxiv.org/search/?searchtype=author&query=Ashley Khoo) 作者:Nicholas Pipitone、Ghita Houir Alami、Advaith Avadhanam、Anton Kaminskyi、Ashley Khoo
We introduce a novel training methodology named zELO, which optimizes retrieval performance via the analysis that ranking tasks are statically equivalent to a Thurstone model. Based on the zELO method, we use unsupervised data in order train a suite of state-of-the-art open-weight reranker models: zerank-1 and zerank-1-small. These models achieve the highest retrieval scores in multiple domains, including finance, legal, code, and STEM, outperforming closed-source proprietary rerankers on both NDCG@10 and Recall. These models also demonstrate great versatility, maintaining their 0-shot performance on out-of-domain and private customer datasets. The training data included 112,000 queries and 100 documents per query, and was trained end-to-end from unannotated queries and documents in less than 10,000 H100-hours. 我们引入了一种名为 zELO 的新颖训练方法,它通过分析排名任务在静态上等同于 Thurstone 模型来优化检索性能。基于 zELO 方法,我们使用无监督数据来训练一套最先进的开放权重重排序器模型:zerank-1 和 zerank-1-small。这些模型在金融、法律、代码和 STEM 等多个领域取得了最高的检索分数,在 NDCG@10 和 Recall 方面都优于闭源专有重新排名者。这些模型还表现出强大的多功能性,在域外和私有客户数据集上保持零样本性能。训练数据包括 112,000 个查询和每个查询 100 个文档,并在不到 10,000 H100 小时的时间内从未注释的查询和文档中端到端进行训练。
Subject: Artificial Intelligence 主题:人工智能
Publish: 2025-09-16 00:44:08 UTC 发布时间: 2025-09-16 00:44:08 UTC
#31 A Dimensionality-Reduced XAI Framework for Roundabout Crash Severity Insights #31 用于环形交叉路口碰撞严重性洞察的降维 XAI 框架
Authors: [Rohit Chakraborty](https://arxiv.org/search/?searchtype=author&query=Rohit Chakraborty), [Subasish Das](https://arxiv.org/search/?searchtype=author&query=Subasish Das) 作者:Rohit Chakraborty,Subasish Das
Roundabouts reduce severe crashes, yet risk patterns vary by conditions. This study analyzes 2017-2021 Ohio roundabout crashes using a two-step, explainable workflow. Cluster Correspondence Analysis (CCA) identifies co-occurring factors and yields four crash patterns. A tree-based severity model is then interpreted with SHAP to quantify drivers of injury within and across patterns. Results show higher severity when darkness, wet surfaces, and higher posted speeds coincide with fixed-object or angle events, and lower severity in clear, low-speed settings. Pattern-specific explanations highlight mechanisms at entries (fail-to-yield, gap acceptance), within multi-lane circulation (improper maneuvers), and during slow-downs (rear-end). The workflow links pattern discovery with case-level explanations, supporting site screening, countermeasure selection, and audit-ready reporting. The contribution to Information Systems is a practical template for usable XAI in public safety analytics. 环形交叉路口减少了严重的碰撞,但风险模式因条件而异。本研究使用两步、可解释的工作流程分析了 2017-2021 年俄亥俄州环形交叉路口的车祸。聚类对应分析 (CCA) 识别并发生的因素并产生四种碰撞模式。然后使用 SHAP 解释基于树的严重程度模型,以量化模式内和跨模式的伤害驱动因素。结果显示,当黑暗、潮湿的表面和较高的发布速度与固定物体或角度事件重合时,严重性较高,而在清晰的低速设置中,严重性较低。特定模式的解释强调了进入时(失败让行、间隙接受)、多车道循环内(不当机动)和减速期间(追尾)的机制。该工作流程将模式发现与案例级解释联系起来,支持站点筛选、对策选择和审计就绪报告。对信息系统的贡献是公共安全分析中可用的 XAI 的实用模板。
Subject: Artificial Intelligence 主题:人工智能
Publish: 2025-09-15 23:59:07 UTC 发布时间: 2025-09-15 23:59:07 UTC
#32 Physical Complexity of a Cognitive Artifact #32 认知人工制品的物理复杂性
Authors: [Gülce Kardeş](https://arxiv.org/search/?searchtype=author&query=Gülce Kardeş), [David Krakauer](https://arxiv.org/search/?searchtype=author&query=David Krakauer), [Joshua Grochow](https://arxiv.org/search/?searchtype=author&query=Joshua Grochow) 作者:Gülce Kardeş、David Krakauer、Joshua Grochow
Cognitive science and theoretical computer science both seek to classify and explain the difficulty of tasks. Mechanisms of intelligence are those that reduce task difficulty. Here we map concepts from the computational complexity of a physical puzzle, the Soma Cube, onto cognitive problem-solving strategies through a ``Principle of Materiality’’. By analyzing the puzzle’s branching factor, measured through search tree outdegree, we quantitatively assess task difficulty and systematically examine how different strategies modify complexity. We incrementally refine a trial-and-error search by layering preprocessing (cognitive chunking), value ordering (cognitive free-sorting), variable ordering (cognitive scaffolding), and pruning (cognitive inference). We discuss how the competent use of artifacts reduces effective time complexity by exploiting physical constraints and propose a model of intelligence as a library of algorithms that recruit the capabilities of both mind and matter. 认知科学和理论计算机科学都试图对任务的难度进行分类和解释。智能机制是那些降低任务难度的机制。在这里,我们通过“物质性原理”将物理谜题 Soma Cube 的计算复杂性的概念映射到认知问题解决策略上。通过分析谜题的分支因子(通过搜索树输出度来衡量),我们定量评估任务难度,并系统地检查不同策略如何改变复杂性。我们通过分层预处理(认知分块)、值排序(认知自由排序)、变量排序(认知脚手架)和修剪(认知推理)来逐步细化试错搜索。我们讨论了如何通过利用物理约束来有效使用人工制品来降低有效的时间复杂度,并提出了一种智能模型作为算法库,可以招募心灵和物质的能力。
Subjects: Artificial Intelligence, Computers and Society, Human-Computer Interaction 科目: 人工智能 , 计算机与社会 , 人机交互
Publish: 2025-09-15 22:39:30 UTC 发布时间: 2025-09-15 22:39:30 UTC
#33 Empowering Clinical Trial Design through AI: A Randomized Evaluation of PowerGPT #33 通过人工智能增强临床试验设计能力:PowerGPT 的随机评估
Authors: [Yiwen Lu](https://arxiv.org/search/?searchtype=author&query=Yiwen Lu), [Lu Li](https://arxiv.org/search/?searchtype=author&query=Lu Li), [Dazheng Zhang](https://arxiv.org/search/?searchtype=author&query=Dazheng Zhang), [Xinyao Jian](https://arxiv.org/search/?searchtype=author&query=Xinyao Jian), [Tingyin Wang](https://arxiv.org/search/?searchtype=author&query=Tingyin Wang), [Siqi Chen](https://arxiv.org/search/?searchtype=author&query=Siqi Chen), [Yuqing Lei](https://arxiv.org/search/?searchtype=author&query=Yuqing Lei), [Jiayi Tong](https://arxiv.org/search/?searchtype=author&query=Jiayi Tong), [Zhaohan Xi](https://arxiv.org/search/?searchtype=author&query=Zhaohan Xi), [Haitao Chu](https://arxiv.org/search/?searchtype=author&query=Haitao Chu), [Chongliang Luo](https://arxiv.org/search/?searchtype=author&query=Chongliang Luo), [Alexis Ogdie](https://arxiv.org/search/?searchtype=author&query=Alexis Ogdie), [Brian Athey](https://arxiv.org/search/?searchtype=author&query=Brian Athey), [Alparslan Turan](https://arxiv.org/search/?searchtype=author&query=Alparslan Turan), [Michael Abramoff](https://arxiv.org/search/?searchtype=author&query=Michael Abramoff), [Joseph C Cappelleri](https://arxiv.org/search/?searchtype=author&query=Joseph C Cappelleri), [Hua Xu](https://arxiv.org/search/?searchtype=author&query=Hua Xu), [Yun Lu](https://arxiv.org/search/?searchtype=author&query=Yun Lu), [Jesse Berlin](https://arxiv.org/search/?searchtype=author&query=Jesse Berlin), [Daniel I. Sessler](https://arxiv.org/search/?searchtype=author&query=Daniel I. Sessler), [David A. Asch](https://arxiv.org/search/?searchtype=author&query=David A. Asch), [Xiaoqian Jiang](https://arxiv.org/search/?searchtype=author&query=Xiaoqian Jiang), [Yong Chen](https://arxiv.org/search/?searchtype=author&query=Yong Chen) 作者: 卢奕雯, 陆丽, 张大正, 简鑫尧, 王婷银, 陈思琦, 雷玉清, 佟佳怡, 习兆汉, 褚海涛, 罗崇良, 亚历克西斯·奥格迪, 布莱恩·阿西, 阿尔帕斯兰·图兰, 迈克尔·阿布拉莫夫, 约瑟夫·卡佩勒里, 徐华, 陆云, Jesse Berlin, 丹尼尔·塞斯勒, 大卫·阿什, 江晓倩, 陈勇
Sample size calculations for power analysis are critical for clinical research and trial design, yet their complexity and reliance on statistical expertise create barriers for many researchers. We introduce PowerGPT, an AI-powered system integrating large language models (LLMs) with statistical engines to automate test selection and sample size estimation in trial design. In a randomized trial to evaluate its effectiveness, PowerGPT significantly improved task completion rates (99.3% vs. 88.9% for test selection, 99.3% vs. 77.8% for sample size calculation) and accuracy (94.1% vs. 55.4% in sample size estimation, p < 0.001), while reducing average completion time (4.0 vs. 9.3 minutes, p < 0.001). These gains were consistent across various statistical tests and benefited both statisticians and non-statisticians as well as bridging expertise gaps. Already under deployment across multiple institutions, PowerGPT represents a scalable AI-driven approach that enhances accessibility, efficiency, and accuracy in statistical power analysis for clinical research. 功效分析的样本量计算对于临床研究和试验设计至关重要,但其复杂性和对统计专业知识的依赖给许多研究人员带来了障碍。我们介绍了 PowerGPT,这是一个人工智能驱动的系统,将大型语言模型 (LLM) 与统计引擎集成在一起,可在试验设计中自动选择测试和样本量估计。在一项评估其有效性的随机试验中,PowerGPT 显着提高了任务完成率(测试选择为 99.3% 对 88.9%,样本量计算为 99.3% 对 77.8%)和准确性(样本量估计为 94.1% 对 55.4%,p < 0.001),同时减少了平均完成时间(4.0 对 9.3 分钟,p < 0.001)。这些收益在各种统计测试中是一致的,使统计学家和非统计学家都受益,并弥合了专业知识差距。PowerGPT 已经在多个机构中部署,代表了一种可扩展的人工智能驱动方法,可提高临床研究统计功效分析的可访问性、效率和准确性。
Subject: Artificial Intelligence 主题:人工智能
Publish: 2025-09-15 21:35:04 UTC 发布时间: 2025-09-15 21:35:04 UTC
#34 Reasoning Models Can be Accurately Pruned Via Chain-of-Thought Reconstruction #34 推理模型可以通过思维链重建进行准确修剪
Authors: [Ryan Lucas](https://arxiv.org/search/?searchtype=author&query=Ryan Lucas), [Kayhan Behdin](https://arxiv.org/search/?searchtype=author&query=Kayhan Behdin), [Zhipeng Wang](https://arxiv.org/search/?searchtype=author&query=Zhipeng Wang), [Qingquan Song](https://arxiv.org/search/?searchtype=author&query=Qingquan Song), [Shao Tang](https://arxiv.org/search/?searchtype=author&query=Shao Tang), [Rahul Mazumder](https://arxiv.org/search/?searchtype=author&query=Rahul Mazumder) 作者:Ryan Lucas、Kayhan Behdin、Zhipeng Wang、Qingquan Song、Shao Tang、Rahul Mazumder
Reasoning language models such as DeepSeek-R1 produce long chain-of-thought traces during inference time which make them costly to deploy at scale. We show that using compression techniques such as neural network pruning produces greater performance loss than in typical language modeling tasks, and in some cases can make the model slower since they cause the model to produce more thinking tokens but with worse performance. We show that this is partly due to the fact that standard LLM pruning methods often focus on input reconstruction, whereas reasoning is a decode-dominated task. We introduce a simple, drop-in fix: during pruning we jointly reconstruct activations from the input and the model’s on-policy chain-of-thought traces. This “Reasoning-Aware Compression” (RAC) integrates seamlessly into existing pruning workflows such as SparseGPT, and boosts their performance significantly. Code reproducing the results in the paper can be found at: https://github.com/RyanLucas3/RAC DeepSeek-R1 等推理语言模型在推理过程中会产生很长的思维链跟踪,这使得大规模部署它们的成本很高。我们表明,与典型的语言建模任务相比,使用神经网络修剪等压缩技术会产生更大的性能损失,并且在某些情况下会使模型变慢,因为它们会导致模型产生更多的思维标记,但性能更差。我们表明,这部分是由于标准的 LLM 修剪方法通常侧重于输入重建,而推理是一项以解码为主的任务。我们引入了一个简单的直接修复:在修剪过程中,我们从输入和模型的策略思维链跟踪中共同重建激活。这种“推理感知压缩”(RAC) 无缝集成到现有的修剪工作流程(例如 SparseGPT)中,并显着提高了其性能。在论文中重现结果的代码可以在以下位置找到:https://github.com/RyanLucas3/RAC
Subject: Artificial Intelligence 主题:人工智能
Publish: 2025-09-15 21:19:13 UTC 发布时间: 2025-09-15 21:19:13 UTC
#35 Enhancing Physical Consistency in Lightweight World Models #35 增强轻量级世界模型的物理一致性
Authors: [Dingrui Wang](https://arxiv.org/search/?searchtype=author&query=Dingrui Wang), [Zhexiao Sun](https://arxiv.org/search/?searchtype=author&query=Zhexiao Sun), [Zhouheng Li](https://arxiv.org/search/?searchtype=author&query=Zhouheng Li), [Cheng Wang](https://arxiv.org/search/?searchtype=author&query=Cheng Wang), [Youlun Peng](https://arxiv.org/search/?searchtype=author&query=Youlun Peng), [Hongyuan Ye](https://arxiv.org/search/?searchtype=author&query=Hongyuan Ye), [Baha Zarrouki](https://arxiv.org/search/?searchtype=author&query=Baha Zarrouki), [Wei Li](https://arxiv.org/search/?searchtype=author&query=Wei Li), [Mattia Piccinini](https://arxiv.org/search/?searchtype=author&query=Mattia Piccinini), [Lei Xie](https://arxiv.org/search/?searchtype=author&query=Lei Xie), [Johannes Betz](https://arxiv.org/search/?searchtype=author&query=Johannes Betz) 作者: Dingrui Wang, Zhexiao Sun, Zhouheng Li, Cheng Wang, Youlun Peng, Hongyuan Ye, Baha Zarrouki, Wei Li, Mattia Piccinini, Lei Xie, Johannes Betz
A major challenge in deploying world models is the trade-off between size and performance. Large world models can capture rich physical dynamics but require massive computing resources, making them impractical for edge devices. Small world models are easier to deploy but often struggle to learn accurate physics, leading to poor predictions. We propose the Physics-Informed BEV World Model (PIWM), a compact model designed to efficiently capture physical interactions in bird’s-eye-view (BEV) representations. PIWM uses Soft Mask during training to improve dynamic object modeling and future prediction. We also introduce a simple yet effective technique, Warm Start, for inference to enhance prediction quality with a zero-shot model. Experiments show that at the same parameter scale (400M), PIWM surpasses the baseline by 60.6% in weighted overall score. Moreover, even when compared with the largest baseline model (400M), the smallest PIWM (130M Soft Mask) achieves a 7.4% higher weighted overall score with a 28% faster inference speed. 部署世界模型的一个主要挑战是大小和性能之间的权衡。大型世界模型可以捕获丰富的物理动态,但需要大量的计算资源,这使得它们对于边缘设备来说不切实际。小世界模型更容易部署,但通常难以学习准确的物理,从而导致预测不佳。我们提出了物理知情 BEV 世界模型 (PIWM),这是一个紧凑的模型,旨在有效地捕捉鸟瞰 (BEV) 表示中的物理交互。PIWM 在训练期间使用软掩码来改进动态对象建模和未来预测。我们还引入了一种简单而有效的技术,即热启动,用于推理,以通过零样本模型提高预测质量。实验表明,在相同参数量表(400M)下,PIWM 的加权总分超过基线 60.6%。此外,即使与最大的基线模型(400M)相比,最小的 PIWM(130M 软掩码)的加权总分也提高了 7.4%,推理速度提高了 28%。
Subject: Artificial Intelligence 主题:人工智能
Publish: 2025-09-15 20:43:22 UTC 发布时间: 2025-09-15 20:43:22 UTC
#36 Building Coding Agents via Entropy-Enhanced Multi-Turn Preference Optimization #36 通过熵增强的多轮首选项优化构建编码代理
Authors: [Jiahao Yu](https://arxiv.org/search/?searchtype=author&query=Jiahao Yu), [Zelei Cheng](https://arxiv.org/search/?searchtype=author&query=Zelei Cheng), [Xian Wu](https://arxiv.org/search/?searchtype=author&query=Xian Wu), [Xinyu Xing](https://arxiv.org/search/?searchtype=author&query=Xinyu Xing) 作者:余佳豪,程泽磊,吴先,邢欣宇
Software engineering presents complex, multi-step challenges for Large Language Models (LLMs), requiring reasoning over large codebases and coordinated tool use. The difficulty of these tasks is exemplified by benchmarks like SWE-bench, where current LLMs still struggle to resolve real-world issues. A promising approach to enhance performance is test-time scaling (TTS), but its gains are heavily dependent on the diversity of model outputs. While standard alignment methods such as Direct Preference Optimization (DPO) and Kahneman-Tversky Optimization (KTO) are effective at aligning model outputs with human preferences, this process can come at the cost of reduced diversity, limiting the effectiveness of TTS. Additionally, existing preference optimization algorithms are typically designed for single-turn tasks and do not fully address the complexities of multi-turn reasoning and tool integration required for interactive coding agents. To bridge this gap, we introduce \sys, an entropy-enhanced framework that adapts existing preference optimization algorithms to the multi-turn, tool-assisted setting. \sys augments the preference objective to explicitly preserve policy entropy and generalizes learning to optimize over multi-turn interactions rather than single-turn responses. We validate \sys by fine-tuning a diverse suite of models from different families and sizes (up to 106B parameters). To maximize performance gains from TTS, we further propose a hybrid best-trajectory selection scheme combining a learned verifier model with model free approaches. On the \swebench leaderboard, our approach establishes new state-of-the-art results among open-weight models. A 30B parameter model trained with \sys ranks 1st on \lite and 4th on \verified on the open-weight leaderboard, surpassed only by models with over 10x more parameters(\eg>350B). 软件工程给大型语言模型 (LLM) 带来了复杂的多步骤挑战,需要对大型代码库进行推理和协调工具使用。SWE-bench 等基准测试就证明了这些任务的难度,当前的法学硕士仍在努力解决现实世界的问题。一种有前途的提高性能的方法是测试时缩放 (TTS),但其收益在很大程度上取决于模型输出的多样性。虽然直接偏好优化 (DPO) 和 Kahneman-Tversky 优化 (KTO) 等标准对齐方法可以有效地使模型输出与人类偏好保持一致,但此过程可能会以多样性降低为代价,从而限制 TTS 的有效性。此外,现有的偏好优化算法通常是为单轮任务而设计的,并没有完全解决交互式编码代理所需的多轮推理和工具集成的复杂性。为了弥合这一差距,我们引入了 \sys,这是一个熵增强框架,它使现有的偏好优化算法适应多轮、工具辅助设置。\sys 增强了偏好目标以显式保留策略熵,并推广学习以优化多轮交互而不是单轮响应。我们通过微调来自不同系列和大小的各种模型套件(最多 106B 参数)来验证 \sys。为了最大限度地提高 TTS 的性能提升,我们进一步提出了一种混合最佳轨迹选择方案,将学习的验证者模型与无模型方法相结合。在 \swebench 排行榜上,我们的方法在开放权重模型中建立了新的最先进的结果。 使用 \sys 训练的 30B 参数模型在开放权重排行榜上在 \lite 上排名第一,在 \verified 上排名第四,仅次于参数增加 10 倍以上的模型(\eg > 350B)。
Subject: Artificial Intelligence 主题:人工智能
Publish: 2025-09-15 20:36:19 UTC 发布时间: 2025-09-15 20:36:19 UTC
#37 Small Models, Big Results: Achieving Superior Intent Extraction through Decomposition #37 小模型,大结果:通过分解实现卓越的意图提取
Authors: [Danielle Cohen](https://arxiv.org/search/?searchtype=author&query=Danielle Cohen), [Yoni Halpern](https://arxiv.org/search/?searchtype=author&query=Yoni Halpern), [Noam Kahlon](https://arxiv.org/search/?searchtype=author&query=Noam Kahlon), [Joel Oren](https://arxiv.org/search/?searchtype=author&query=Joel Oren), [Omri Berkovitch](https://arxiv.org/search/?searchtype=author&query=Omri Berkovitch), [Sapir Caduri](https://arxiv.org/search/?searchtype=author&query=Sapir Caduri), [Ido Dagan](https://arxiv.org/search/?searchtype=author&query=Ido Dagan), [Anatoly Efros](https://arxiv.org/search/?searchtype=author&query=Anatoly Efros) 作者:Danielle Cohen、Yoni Halpern、Noam Kahlon、Joel Oren、Omri Berkovitch、Sapir Caduri、Ido Dagan、Anatoly Efros
Understanding user intents from UI interaction trajectories remains a challenging, yet crucial, frontier in intelligent agent development. While massive, datacenter-based, multi-modal large language models (MLLMs) possess greater capacity to handle the complexities of such sequences, smaller models which can run on-device to provide a privacy-preserving, low-cost, and low-latency user experience, struggle with accurate intent inference. We address these limitations by introducing a novel decomposed approach: first, we perform structured interaction summarization, capturing key information from each user action. Second, we perform intent extraction using a fine-tuned model operating on the aggregated summaries. This method improves intent understanding in resource-constrained models, even surpassing the base performance of large MLLMs. 从 UI 交互轨迹中理解用户意图仍然是智能代理开发中一个具有挑战性但至关重要的前沿领域。虽然基于数据中心的大型多模态大型语言模型 (MLLM) 具有更大的能力来处理此类序列的复杂性,但可以在设备上运行以提供隐私保护、低成本和低延迟用户体验的较小模型在准确的意图推理方面遇到了困难。我们通过引入一种新颖的分解方法来解决这些限制:首先,我们执行结构化交互摘要,从每个用户作中捕获关键信息。其次,我们使用对聚合摘要进行微调模型进行意图提取。这种方法提高了资源受限模型中的意图理解,甚至超过了大型 MLLM 的基本性能。
Subjects: Artificial Intelligence, Computation and Language 科目:人工智能、计算和语言
Publish: 2025-09-15 20:20:30 UTC 发布时间: 2025-09-15 20:20:30 UTC
#38 AIssistant: An Agentic Approach for Human–AI Collaborative Scientific Work on Reviews and Perspectives in Machine Learning #38 AIssistant:人类的代理方法——人工智能关于机器学习评论和观点的协作科学工作
Authors: [Sasi Kiran Gaddipati](https://arxiv.org/search/?searchtype=author&query=Sasi Kiran Gaddipati), [Farhana Keya](https://arxiv.org/search/?searchtype=author&query=Farhana Keya), [Gollam Rabby](https://arxiv.org/search/?searchtype=author&query=Gollam Rabby), [Sören Auer](https://arxiv.org/search/?searchtype=author&query=Sören Auer) 作者:Sasi Kiran Gaddipati、Farhana Keya、Gollam Rabby、Sören Auer
Advances in AI-assisted research have introduced powerful tools for literature retrieval, hypothesis generation, experimentation, and manuscript preparation. However, systems remain fragmented and lack human-centred workflows. To address these gaps, we introduce AIssistant, an agentic, open-source Human-AI collaborative framework designed to simplify the end-to-end creation of scientific workflows. Since our development is still in an early stage, we present here the first experiments with AIssistant for perspective and review research papers in machine learning. Our system integrates modular tools and agents for literature synthesis, section-wise experimentation, citation management, and automatic LaTeX paper text generation, while maintaining human oversight at every stage to ensure accuracy, coherence, and scholarly rigour. We conducted a comprehensive evaluation across three layers: (1) Independent Human Review, following NeurIPS double-blind standards; (2) Automated LLM Review, using GPT-5 as a scalable human review proxy; and (3) Program Chair Oversight, where the chair monitors the entire review process and makes final validation and acceptance decisions. The results demonstrate that AIssistant improves drafting efficiency and thematic consistency. Nonetheless, Human-AI collaboration remains essential for maintaining factual correctness, methodological soundness, and ethical compliance. Despite its effectiveness, we identify key limitations, including hallucinated citations, difficulty adapting to dynamic paper structures, and incomplete integration of multimodal content. 人工智能辅助研究的进步引入了用于文献检索、假设生成、实验和手稿准备的强大工具。然而,系统仍然分散,缺乏以人为本的工作流程。为了解决这些差距,我们推出了 AIssistant,这是一个代理、开源的人机协作框架,旨在简化科学工作流程的端到端创建。由于我们的开发仍处于早期阶段,我们在这里展示了 AIssistant 的第一个实验,以获取机器学习的透视和回顾研究论文。我们的系统集成了用于文献综合、分段实验、引文管理和自动 LaTeX 论文文本生成的模块化工具和代理,同时在每个阶段保持人工监督,以确保准确性、连贯性和学术严谨性。我们从三个层面进行了全面评估:(1)遵循 NeurIPS 双盲标准的独立人体评价;(2)自动化 LLM Review,使用 GPT-5 作为可扩展的人工审阅代理;(3) 项目主席监督,主席监督整个审查过程并做出最终验证和接受决定。结果表明,AIssistant 提高了起草效率和主题一致性。尽管如此,人机协作对于保持事实正确性、方法论合理性和道德合规性仍然至关重要。尽管它很有效,但我们确定了关键的局限性,包括幻觉引用、难以适应动态论文结构以及多模态内容的不完全整合。
Subjects: Artificial Intelligence, Machine Learning 科目: 人工智能 , 机器学习
Publish: 2025-09-14 15:50:31 UTC 发布时间: 2025-09-14 15:50:31 UTC
#39 Developing an aeroponic smart experimental greenhouse for controlling irrigation and plant disease detection using deep learning and IoT #39 利用深度学习和物联网开发用于控制灌溉和植物病害检测的气培智能实验温室
Authors: [Mohammadreza Narimani](https://arxiv.org/search/?searchtype=author&query=Mohammadreza Narimani), [Ali Hajiahmad](https://arxiv.org/search/?searchtype=author&query=Ali Hajiahmad), [Ali Moghimi](https://arxiv.org/search/?searchtype=author&query=Ali Moghimi), [Reza Alimardani](https://arxiv.org/search/?searchtype=author&query=Reza Alimardani), [Shahin Rafiee](https://arxiv.org/search/?searchtype=author&query=Shahin Rafiee), [Amir Hossein Mirzabe](https://arxiv.org/search/?searchtype=author&query=Amir Hossein Mirzabe) 作者:Mohammadreza Narimani、Ali Hajiahmad、Ali Moghimi、Reza Alimardani、Shahin Rafiee、Amir Hossein Mirzabe
Controlling environmental conditions and monitoring plant status in greenhouses is critical to promptly making appropriate management decisions aimed at promoting crop production. The primary objective of this research study was to develop and test a smart aeroponic greenhouse on an experimental scale where the status of Geranium plant and environmental conditions are continuously monitored through the integration of the internet of things (IoT) and artificial intelligence (AI). An IoT-based platform was developed to control the environmental conditions of plants more efficiently and provide insights to users to make informed management decisions. In addition, we developed an AI-based disease detection framework using VGG-19, InceptionResNetV2, and InceptionV3 algorithms to analyze the images captured periodically after an intentional inoculation. The performance of the AI framework was compared with an expert’s evaluation of disease status. Preliminary results showed that the IoT system implemented in the greenhouse environment is able to publish data such as temperature, humidity, water flow, and volume of charge tanks online continuously to users and adjust the controlled parameters to provide an optimal growth environment for the plants. Furthermore, the results of the AI framework demonstrate that the VGG-19 algorithm was able to identify drought stress and rust leaves from healthy leaves with the highest accuracy, 92% among the other algorithms. 控制温室中的环境条件和监测植物状态对于及时做出旨在促进作物生产的适当管理决策至关重要。本研究的主要目的是开发和测试一种实验规模的智能气培温室,通过物联网 (IoT) 和人工智能 (AI) 的集成,持续监测天竺葵植物的状态和环境条件。开发了一个基于物联网的平台,以更有效地控制工厂的环境条件,并为用户提供洞察力以做出明智的管理决策。此外,我们还开发了一种基于人工智能的疾病检测框架,使用 VGG-19、InceptionResNetV2 和 InceptionV3 算法来分析有意接种后定期捕获的图像。将人工智能框架的性能与专家对疾病状态的评估进行了比较。初步结果表明,在温室环境中实施的物联网系统能够连续在线向用户发布温度、湿度、水流量和充注罐容积等数据,并调整控制参数,为植物提供最佳的生长环境。此外,人工智能框架的结果表明,VGG-19 算法能够以最高的准确率识别健康叶子的干旱胁迫和锈叶,在其他算法中达到 92%。
Subjects: Artificial Intelligence, Computer Vision and Pattern Recognition, Machine Learning 科目:人工智能 , 计算机视觉与模式识别 , 机器学习
Publish: 2025-09-14 03:48:22 UTC 发布时间: 2025-09-14 03:48:22 UTC
#40 LLMAP: LLM-Assisted Multi-Objective Route Planning with User Preferences #40 LLMAP:具有用户偏好的 LLM 辅助多目标路线规划
Authors: [Liangqi Yuan](https://arxiv.org/search/?searchtype=author&query=Liangqi Yuan), [Dong-Jun Han](https://arxiv.org/search/?searchtype=author&query=Dong-Jun Han), [Christopher G. Brinton](https://arxiv.org/search/?searchtype=author&query=Christopher G. Brinton), [Sabine Brunswicker](https://arxiv.org/search/?searchtype=author&query=Sabine Brunswicker) 作者:Liangqi Yuan、Dong-Jun Han、Christopher G. Brinton、Sabine Brunswicker
The rise of large language models (LLMs) has made natural language-driven route planning an emerging research area that encompasses rich user objectives. Current research exhibits two distinct approaches: direct route planning using LLM-as-Agent and graph-based searching strategies. However, LLMs in the former approach struggle to handle extensive map data, while the latter shows limited capability in understanding natural language preferences. Additionally, a more critical challenge arises from the highly heterogeneous and unpredictable spatio-temporal distribution of users across the globe. In this paper, we introduce a novel LLM-Assisted route Planning (LLMAP) system that employs an LLM-as-Parser to comprehend natural language, identify tasks, and extract user preferences and recognize task dependencies, coupled with a Multi-Step Graph construction with iterative Search (MSGS) algorithm as the underlying solver for optimal route finding. Our multi-objective optimization approach adaptively tunes objective weights to maximize points of interest (POI) quality and task completion rate while minimizing route distance, subject to three key constraints: user time limits, POI opening hours, and task dependencies. We conduct extensive experiments using 1,000 routing prompts sampled with varying complexity across 14 countries and 27 cities worldwide. The results demonstrate that our approach achieves superior performance with guarantees across multiple constraints. 大型语言模型(LLM)的兴起使自然语言驱动的路线规划成为一个包含丰富用户目标的新兴研究领域。目前的研究展示了两种不同的方法:使用 LLM-as-Agent 的直接路线规划和基于图的搜索策略。然而,前一种方法的法学硕士难以处理大量地图数据,而后者在理解自然语言偏好方面的能力有限。此外,一个更关键的挑战来自全球用户高度异质和不可预测的时空分布。在本文中,我们介绍了一种新型的 LLM 辅助路线规划(LLMAP)系统,该系统采用 LLM-as-Parser 来理解自然语言,识别任务,提取用户偏好并识别任务依赖关系,并结合迭代搜索(MSGS)算法的多步图结构作为最优路线查找的底层求解器。我们的多目标优化方法自适应地调整目标权重,以最大限度地提高兴趣点 (POI) 质量和任务完成率,同时最大限度地减少路线距离,但受三个关键约束:用户时间限制、POI 开放时间和任务依赖性。我们使用全球 14 个国家和 27 个城市的 1,000 个不同复杂程度的路由提示进行了广泛的实验。结果表明,我们的方法在跨多个约束的保证下实现了卓越的性能。
Subjects: Artificial Intelligence, Computation and Language, Machine Learning 科目: 人工智能 , 计算与语言 , 机器学习
Publish: 2025-09-14 02:30:19 UTC 发布时间: 2025-09-14 02:30:19 UTC
#41 InPhyRe Discovers: Large Multimodal Models Struggle in Inductive Physical Reasoning #41 InPhyRe 发现:大型多模态模型在归纳物理推理中举步维艰
Authors: [Gautam Sreekumar](https://arxiv.org/search/?searchtype=author&query=Gautam Sreekumar), [Vishnu Naresh Boddeti](https://arxiv.org/search/?searchtype=author&query=Vishnu Naresh Boddeti) 作者:Gautam Sreekumar、Vishnu Naresh Boddeti
Large multimodal models (LMMs) encode universal physical laws observed during training, such as momentum conservation, as parametric knowledge. It allows LMMs to answer physical reasoning queries, such as the outcome of a potential collision event from visual input. However, since parametric knowledge includes only the physical laws seen during training, it is insufficient for reasoning when the inference scenario violates these physical laws. In contrast, humans possess the skill to adapt their physical reasoning to unseen physical environments from a few visual examples. This ability, which we refer to as inductive physical reasoning, is indispensable for LMMs if they are to replace human agents in safety-critical applications. Despite its importance, existing visual benchmarks evaluate only the parametric knowledge in LMMs, and not inductive physical reasoning. To this end, we propose InPhyRe, the first visual question answering benchmark to measure inductive physical reasoning in LMMs. InPhyRe evaluates LMMs on their ability to predict the outcome of collision events in algorithmically generated synthetic collision videos. By inspecting 13 LMMs, InPhyRe informs us that (1) LMMs struggle to apply their limited parametric knowledge about universal physical laws to reasoning, (2) inductive physical reasoning in LMMs is weak when demonstration samples violate universal physical laws, and (3) inductive physical reasoning in LMMs suffers from language bias and largely ignores the visual inputs, questioning the trustworthiness of LMMs regarding visual inputs. 大型多模态模型 (LMM) 将训练期间观察到的普遍物理定律(例如动量守恒)编码为参数知识。它允许 LMM 回答物理推理查询,例如视觉输入中潜在碰撞事件的结果。然而,由于参数知识仅包括训练过程中看到的物理定律,因此当推理场景违反这些物理定律时,它不足以进行推理。相比之下,人类拥有从一些视觉例子中调整他们的物理推理以适应看不见的物理环境的技能。这种能力,我们称之为归纳物理推理,如果 LMM 要在安全关键型应用中取代人类代理,那么它们对于 LMM 来说是必不可少的。尽管它很重要,但现有的视觉基准仅评估 LMM 中的参数知识,而不是归纳物理推理。为此,我们提出了 InPhyRe,这是第一个用于测量 LMM 中归纳物理推理的视觉问答基准。InPhyRe 评估 LMM 预测算法生成的合成碰撞视频中碰撞事件结果的能力。通过检查 13 个 LMM,InPhyRe 告诉我们,(1) LMM 很难将其关于普遍物理定律的有限参数知识应用于推理,(2) 当演示样本违反普遍物理定律时,LMM 中的归纳物理推理很弱,以及 (3) LMM 中的归纳物理推理存在语言偏差,并且在很大程度上忽略了视觉输入,质疑 LMM 在视觉输入方面的可信度。
Subjects: Artificial Intelligence, Machine Learning 科目: 人工智能 , 机器学习
Publish: 2025-09-12 20:07:12 UTC 发布时间: 2025-09-12 20:07:12 UTC
#42 DISPLIB: a library of train dispatching problems #42 DISPLIB:列车调度问题库
Authors: [Oddvar Kloster](https://arxiv.org/search/?searchtype=author&query=Oddvar Kloster), [Bjørnar Luteberget](https://arxiv.org/search/?searchtype=author&query=Bjørnar Luteberget), [Carlo Mannino](https://arxiv.org/search/?searchtype=author&query=Carlo Mannino), [Giorgio Sartor](https://arxiv.org/search/?searchtype=author&query=Giorgio Sartor) 作者:Oddvar Kloster、Bjørnar Luteberget、Carlo Mannino、Giorgio Sartor
Optimization-based decision support systems have a significant potential to reduce delays, and thus improve efficiency on the railways, by automatically re-routing and re-scheduling trains after delays have occurred. The operations research community has dedicated a lot of effort to developing optimization algorithms for this problem, but each study is typically tightly connected with a specific industrial use case. Code and data are seldom shared publicly. This fact hinders reproducibility, and has led to a proliferation of papers describing algorithms for more or less compatible problem definitions, without any real opportunity for readers to assess their relative performance. Inspired by the successful communities around MILP, SAT, TSP, VRP, etc., we introduce a common problem definition and file format, DISPLIB, which captures all the main features of train re-routing and re-scheduling. We have gathered problem instances from multiple real-world use cases and made them openly available. In this paper, we describe the problem definition, the industrial instances, and a reference solver implementation. This allows any researcher or developer to work on the train dispatching problem without an industrial connection, and enables the research community to perform empirical comparisons between solvers. All materials are available online at https://displib.github.io. 基于优化的决策支持系统在减少延误方面具有巨大潜力,从而通过在发生延误后自动重新安排列车路线和重新调度列车来提高铁路效率。运筹学界投入了大量精力来开发针对这个问题的优化算法,但每项研究通常都与特定的工业用例紧密相连。代码和数据很少公开共享。这一事实阻碍了可重复性,并导致大量论文描述或多或少兼容的问题定义的算法,而读者没有任何真正的机会来评估它们的相对性能。受到围绕 MILP、SAT、TSP、VRP 等成功社区的启发,我们引入了一种常见的问题定义和文件格式 DISPLIB,它捕获了列车重新规划和重新调度的所有主要功能。我们从多个实际用例中收集了问题实例,并公开了它们。在本文中,我们描述了问题定义、工业实例和参考求解器实现。这使得任何研究人员或开发人员都可以在没有工业联系的情况下解决列车调度问题,并使研究界能够在求解器之间进行实证比较。所有材料均可在 https://displib.github.io 在线获取。
Subject: Artificial Intelligence 主题:人工智能
Publish: 2025-09-12 13:35:28 UTC 发布时间: 2025-09-12 13:35:28 UTC
#43 V-Math: An Agentic Approach to the Vietnamese National High School Graduation Mathematics Exams #43 V-Math:越南国立高中毕业数学考试的代理方法
Authors: [Duong Q. Nguyen](https://arxiv.org/search/?searchtype=author&query=Duong Q. Nguyen), [Quy P. Nguyen](https://arxiv.org/search/?searchtype=author&query=Quy P. Nguyen), [Nguyen Van Nhon](https://arxiv.org/search/?searchtype=author&query=Nguyen Van Nhon), [Quang-Thinh Bui](https://arxiv.org/search/?searchtype=author&query=Quang-Thinh Bui), [H. Nguyen-Xuan](https://arxiv.org/search/?searchtype=author&query=H. Nguyen-Xuan) 作者:Duong Q. Nguyen、Quy P. Nguyen、Nguyen Van Nhon、Quang-Thinh Bui、H. Nguyen-Xuan
This paper develops an autonomous agentic framework called V-Math that aims to assist Vietnamese high school students in preparing for the National High School Graduation Mathematics Exams (NHSGMEs). The salient framework integrates three specialized AI agents: a specification-matrix-conditioned question generator, a solver/explainer for detailed step-by-step reasoning, and a personalized tutor that adapts to student performance. Beyond enabling self-paced student practice, V-Math supports teachers by generating innovative, compliant exam questions and building diverse, high-quality question banks. This reduces manual workload and enriches instructional resources. We describe the system architecture, focusing on practice modes for learners and teacher-oriented features for question generation. Preliminary evaluations demonstrate that V-Math produces matrix-aligned exams with high solution accuracy, delivers coherent explanations, and enhances the variety of practice materials. These results highlight its potential to support scalable, equitable mathematics preparation aligned with national standards while also empowering teachers through AI-assisted exam creation. 本文开发了一个名为 V-Math 的自主代理框架,旨在帮助越南高中生准备全国高中毕业数学考试 (NHSGME)。显着的框架集成了三个专门的人工智能代理:规范矩阵条件问题生成器、用于详细分步推理的求解器/解释器以及适应学生表现的个性化导师。除了实现学生自定进度的练习之外,V-Math 还通过生成创新、合规的考试问题和构建多样化、高质量的题库来支持教师。这减少了手动工作量并丰富了教学资源。我们描述了系统架构,重点关注学习者的练习模式和面向教师的问题生成功能。初步评估表明,V-Math 可以产生具有高解决方案准确性的矩阵对齐考试,提供连贯的解释,并增强练习材料的多样性。这些结果凸显了它在支持符合国家标准的可扩展、公平的数学准备方面的潜力,同时还通过人工智能辅助考试创建赋予教师权力。
Subjects: Artificial Intelligence, Computer Vision and Pattern Recognition, Computers and Society 科目: 人工智能 , 计算机视觉与模式识别 , 计算机与社会
Publish: 2025-09-12 07:22:46 UTC 发布时间: 2025-09-12 07:22:46 UTC
#44 Contrastive timbre representations for musical instrument and synthesizer retrieval #44 乐器和合成器检索的对比音色表示
Authors: [Gwendal Le Vaillant](https://arxiv.org/search/?searchtype=author&query=Gwendal Le Vaillant), [Yannick Molle](https://arxiv.org/search/?searchtype=author&query=Yannick Molle) 作者:Gwendal Le Vaillant、Yannick Molle
Efficiently retrieving specific instrument timbres from audio mixtures remains a challenge in digital music production. This paper introduces a contrastive learning framework for musical instrument retrieval, enabling direct querying of instrument databases using a single model for both single- and multi-instrument sounds. We propose techniques to generate realistic positive/negative pairs of sounds for virtual musical instruments, such as samplers and synthesizers, addressing limitations in common audio data augmentation methods. The first experiment focuses on instrument retrieval from a dataset of 3,884 instruments, using single-instrument audio as input. Contrastive approaches are competitive with previous works based on classification pre-training. The second experiment considers multi-instrument retrieval with a mixture of instruments as audio input. In this case, the proposed contrastive framework outperforms related works, achieving 81.7% top-1 and 95.7% top-5 accuracies for three-instrument mixtures. 从音频混合中有效地检索特定的乐器音色仍然是数字音乐制作中的一个挑战。本文引入了一种用于乐器检索的对比学习框架,能够使用单一模型直接查询乐器数据库,同时处理单乐器和多乐器声音。我们提出了为采样器和合成器等虚拟乐器生成逼真的正/负声音对的技术,解决了常见音频数据增强方法的局限性。第一个实验侧重于使用单仪器音频作为输入,从 3,884 台仪器的数据集中检索仪器。对比方法与以前基于分类预训练的工作具有竞争力。第二个实验考虑使用混合乐器作为音频输入的多乐器检索。在这种情况下,所提出的对比框架优于相关工作,在三仪器混合物中实现了 81.7% 的 top-1 和 95.7% 的 top-5 精度。
Subjects: Sound, Artificial Intelligence 主题:声音、人工智能
Publish: 2025-09-16 17:38:35 UTC 发布时间: 2025-09-16 17:38:35 UTC
#45 HARMONIC: A Content-Centric Cognitive Robotic Architecture #45 HARMONIC:以内容为中心的认知机器人架构
Authors: [Sanjay Oruganti](https://arxiv.org/search/?searchtype=author&query=Sanjay Oruganti), [Sergei Nirenburg](https://arxiv.org/search/?searchtype=author&query=Sergei Nirenburg), [Marjorie McShane](https://arxiv.org/search/?searchtype=author&query=Marjorie McShane), [Jesse English](https://arxiv.org/search/?searchtype=author&query=Jesse English), [Michael K. Roberts](https://arxiv.org/search/?searchtype=author&query=Michael K. Roberts), [Christian Arndt](https://arxiv.org/search/?searchtype=author&query=Christian Arndt), [Carlos Gonzalez](https://arxiv.org/search/?searchtype=author&query=Carlos Gonzalez), [Mingyo Seo](https://arxiv.org/search/?searchtype=author&query=Mingyo Seo), [Luis Sentis](https://arxiv.org/search/?searchtype=author&query=Luis Sentis) 作者:Sanjay Oruganti、Sergei Nirenburg、Marjorie McShane、Jesse English、Michael K. Roberts、Christian Arndt、Carlos Gonzalez、Mingyo Seo、Luis Sentis
This paper introduces HARMONIC, a cognitive-robotic architecture designed for robots in human-robotic teams. HARMONIC supports semantic perception interpretation, human-like decision-making, and intentional language communication. It addresses the issues of safety and quality of results; aims to solve problems of data scarcity, explainability, and safety; and promotes transparency and trust. Two proof-of-concept HARMONIC-based robotic systems are demonstrated, each implemented in both a high-fidelity simulation environment and on physical robotic platforms. 本文介绍了 HARMONIC,这是一种专为人机团队中的机器人设计的认知机器人架构。HARMONIC 支持语义感知解释、类人决策和有意的语言交流。它解决了结果的安全和质量问题;旨在解决数据稀缺性、可解释性和安全性等问题;并促进透明度和信任。演示了两个基于 HARMONIC 的概念验证机器人系统,每个系统都在高保真仿真环境和物理机器人平台上实现。
Subjects: Robotics, Artificial Intelligence, Computation and Language 科目:机器人技术、人工智能、计算和语言
Publish: 2025-09-16 17:34:18 UTC 发布时间: 2025-09-16 17:34:18 UTC
#46 RadGame: An AI-Powered Platform for Radiology Education #46 RadGame:人工智能驱动的放射学教育平台
Authors: [Mohammed Baharoon](https://arxiv.org/search/?searchtype=author&query=Mohammed Baharoon), [Siavash Raissi](https://arxiv.org/search/?searchtype=author&query=Siavash Raissi), [John S. Jun](https://arxiv.org/search/?searchtype=author&query=John S. Jun), [Thibault Heintz](https://arxiv.org/search/?searchtype=author&query=Thibault Heintz), [Mahmoud Alabbad](https://arxiv.org/search/?searchtype=author&query=Mahmoud Alabbad), [Ali Alburkani](https://arxiv.org/search/?searchtype=author&query=Ali Alburkani), [Sung Eun Kim](https://arxiv.org/search/?searchtype=author&query=Sung Eun Kim), [Kent Kleinschmidt](https://arxiv.org/search/?searchtype=author&query=Kent Kleinschmidt), [Abdulrahman O. Alhumaydhi](https://arxiv.org/search/?searchtype=author&query=Abdulrahman O. Alhumaydhi), [Mohannad Mohammed G. Alghamdi](https://arxiv.org/search/?searchtype=author&query=Mohannad Mohammed G. Alghamdi), [Jeremy Francis Palacio](https://arxiv.org/search/?searchtype=author&query=Jeremy Francis Palacio), [Mohammed Bukhaytan](https://arxiv.org/search/?searchtype=author&query=Mohammed Bukhaytan), [Noah Michael Prudlo](https://arxiv.org/search/?searchtype=author&query=Noah Michael Prudlo), [Rithvik Akula](https://arxiv.org/search/?searchtype=author&query=Rithvik Akula), [Brady Chrisler](https://arxiv.org/search/?searchtype=author&query=Brady Chrisler), [Benjamin Galligos](https://arxiv.org/search/?searchtype=author&query=Benjamin Galligos), [Mohammed O. Almutairi](https://arxiv.org/search/?searchtype=author&query=Mohammed O. Almutairi), [Mazeen Mohammed Alanazi](https://arxiv.org/search/?searchtype=author&query=Mazeen Mohammed Alanazi), [Nasser M. Alrashdi](https://arxiv.org/search/?searchtype=author&query=Nasser M. Alrashdi), [Joel Jihwan Hwang](https://arxiv.org/search/?searchtype=author&query=Joel Jihwan Hwang), [Sri Sai Dinesh Jaliparthi](https://arxiv.org/search/?searchtype=author&query=Sri Sai Dinesh Jaliparthi), [Luke David Nelson](https://arxiv.org/search/?searchtype=author&query=Luke David Nelson), [Nathaniel Nguyen](https://arxiv.org/search/?searchtype=author&query=Nathaniel Nguyen), [Sathvik Suryadevara](https://arxiv.org/search/?searchtype=author&query=Sathvik Suryadevara), [Steven Kim](https://arxiv.org/search/?searchtype=author&query=Steven Kim), [Mohammed F. Mohammed](https://arxiv.org/search/?searchtype=author&query=Mohammed F. Mohammed), [Yevgeniy R. Semenov](https://arxiv.org/search/?searchtype=author&query=Yevgeniy R. Semenov), [Kun-Hsing Yu](https://arxiv.org/search/?searchtype=author&query=Kun-Hsing Yu), [Abdulrhman Aljouie](https://arxiv.org/search/?searchtype=author&query=Abdulrhman Aljouie), [Hassan AlOmaish](https://arxiv.org/search/?searchtype=author&query=Hassan AlOmaish), [Adam Rodman](https://arxiv.org/search/?searchtype=author&query=Adam Rodman), [Pranav Rajpurkar](https://arxiv.org/search/?searchtype=author&query=Pranav Rajpurkar) 作者:Mohammed Baharoon、Siavash Raissi、John S. Jun、Thibault Heintz、Mahmoud Alabbad、Ali Alburkani、Sung Eun Kim、Kent Kleinschmidt、Abdulrahman O. Alhumaydhi、Mohannad Mohammed G. Alghamdi、Jeremy Francis Palacio、Mohammed Bukhaytan、Noah Michael Prudlo、Rithvik Akula、Brady Chrisler、Benjamin Galligos、Mohammed O. Almutairi、Mazeen Mohammed Alanazi、Nasser M. Alrashdi、Joel Jihwan Hwang、Sri Sai Dinesh Jaliparthi、 卢克·大卫·尼尔森、纳撒尼尔·阮、萨特维克·苏里亚德瓦拉、史蒂文·金、穆罕默德·穆罕默德、叶夫根尼·谢苗诺夫、余坤兴、阿卜杜勒曼·阿尔朱伊、哈桑·阿尔奥迈什、亚当·罗德曼、普拉纳夫·拉杰普尔卡
We introduce RadGame, an AI-powered gamified platform for radiology education that targets two core skills: localizing findings and generating reports. Traditional radiology training is based on passive exposure to cases or active practice with real-time input from supervising radiologists, limiting opportunities for immediate and scalable feedback. RadGame addresses this gap by combining gamification with large-scale public datasets and automated, AI-driven feedback that provides clear, structured guidance to human learners. In RadGame Localize, players draw bounding boxes around abnormalities, which are automatically compared to radiologist-drawn annotations from public datasets, and visual explanations are generated by vision-language models for user missed findings. In RadGame Report, players compose findings given a chest X-ray, patient age and indication, and receive structured AI feedback based on radiology report generation metrics, highlighting errors and omissions compared to a radiologist’s written ground truth report from public datasets, producing a final performance and style score. In a prospective evaluation, participants using RadGame achieved a 68% improvement in localization accuracy compared to 17% with traditional passive methods and a 31% improvement in report-writing accuracy compared to 4% with traditional methods after seeing the same cases. RadGame highlights the potential of AI-driven gamification to deliver scalable, feedback-rich radiology training and reimagines the application of medical AI resources in education. 我们介绍了 RadGame,这是一个人工智能驱动的放射学教育游戏化平台,针对两项核心技能:本地化研究结果和生成报告。传统的放射学培训基于被动接触病例或主动练习,并由监督放射科医生实时输入,从而限制了获得即时和可扩展反馈的机会。RadGame 通过将游戏化与大规模公共数据集和自动化的人工智能驱动反馈相结合来弥补这一差距,为人类学习者提供清晰、结构化的指导。在 RadGame Localize 中,玩家在异常周围绘制边界框,这些框会自动与放射科医生从公共数据集中绘制的注释进行比较,视觉语言模型会针对用户遗漏的发现生成视觉解释。在 RadGame Report 中,玩家根据胸部 X 光检查、患者年龄和适应症撰写结果,并根据放射学报告生成指标接收结构化 AI 反馈,与放射科医生从公共数据集中编写的书面基本实况报告相比,突出显示错误和遗漏,产生最终的性能和风格分数。在一项前瞻性评估中,在看到相同的案例后,使用 RadGame 的参与者的定位准确率提高了 68%,而传统被动方法提高了 17%,报告撰写准确率提高了 31%,而传统方法提高了 4%。RadGame 强调了人工智能驱动的游戏化在提供可扩展、反馈丰富的放射学培训方面的潜力,并重新构想了医疗人工智能资源在教育中的应用。
Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence 科目:计算机视觉与模式识别 , 人工智能
Publish: 2025-09-16 17:27:33 UTC 发布时间: 2025-09-16 17:27:33 UTC
#47 JANUS: A Dual-Constraint Generative Framework for Stealthy Node Injection Attacks #47 JANUS:用于隐身节点注入攻击的双约束生成框架
Authors: [Jiahao Zhang](https://arxiv.org/search/?searchtype=author&query=Jiahao Zhang), [Xiaobing Pei](https://arxiv.org/search/?searchtype=author&query=Xiaobing Pei), [Zhaokun Zhong](https://arxiv.org/search/?searchtype=author&query=Zhaokun Zhong), [Wenqiang Hao](https://arxiv.org/search/?searchtype=author&query=Wenqiang Hao), [Zhenghao Tang](https://arxiv.org/search/?searchtype=author&query=Zhenghao Tang) 作者: Jiahao Zhang, Xiaobing Pei, Zhaokun Zhong, Wenqiang Hao, Zhenghao Tang
Graph Neural Networks (GNNs) have demonstrated remarkable performance across various applications, yet they are vulnerable to sophisticated adversarial attacks, particularly node injection attacks. The success of such attacks heavily relies on their stealthiness, the ability to blend in with the original graph and evade detection. However, existing methods often achieve stealthiness by relying on indirect proxy metrics, lacking consideration for the fundamental characteristics of the injected content, or focusing only on imitating local structures, which leads to the problem of local myopia. To overcome these limitations, we propose a dual-constraint stealthy node injection framework, called Joint Alignment of Nodal and Universal Structures (JANUS). At the local level, we introduce a local feature manifold alignment strategy to achieve geometric consistency in the feature space. At the global level, we incorporate structured latent variables and maximize the mutual information with the generated structures, ensuring the injected structures are consistent with the semantic patterns of the original graph. We model the injection attack as a sequential decision process, which is optimized by a reinforcement learning agent. Experiments on multiple standard datasets demonstrate that the JANUS framework significantly outperforms existing methods in terms of both attack effectiveness and stealthiness. 图神经网络 (GNN) 在各种应用程序中表现出卓越的性能,但它们容易受到复杂的对抗性攻击,尤其是节点注入攻击。此类攻击的成功在很大程度上依赖于它们的隐蔽性,即融入原始图表并逃避检测的能力。然而,现有方法往往通过依赖间接代理指标来实现隐蔽性,缺乏对注入内容基本特征的考虑,或者只关注模仿局部结构,从而导致局部近视的问题。为了克服这些限制,我们提出了一种双约束隐形节点注入框架,称为节点和通用结构的联合对齐(JANUS)。在局部层面,我们引入局部特征流形对齐策略,以实现特征空间的几何一致性。在全局层面,我们整合了结构化的潜在变量,并最大化了与生成结构的互信息,确保注入的结构与原始图的语义模式一致。我们将注入攻击建模为顺序决策过程,该过程由强化学习代理进行优化。在多个标准数据集上的实验表明,JANUS 框架在攻击有效性和隐蔽性方面都明显优于现有方法。
Subjects: Machine Learning, Artificial Intelligence 主题: 机器学习 , 人工智能
Publish: 2025-09-16 17:24:30 UTC 发布时间: 2025-09-16 17:24:30 UTC
#48 ResidualViT for Efficient Temporally Dense Video Encoding #48 用于高效时间密集视频编码的 ResidualViT
Authors: [Mattia Soldan](https://arxiv.org/search/?searchtype=author&query=Mattia Soldan), [Fabian Caba Heilbron](https://arxiv.org/search/?searchtype=author&query=Fabian Caba Heilbron), [Bernard Ghanem](https://arxiv.org/search/?searchtype=author&query=Bernard Ghanem), [Josef Sivic](https://arxiv.org/search/?searchtype=author&query=Josef Sivic), [Bryan Russell](https://arxiv.org/search/?searchtype=author&query=Bryan Russell) 作者:Mattia Soldan、Fabian Caba Heilbron、Bernard Ghanem、Josef Sivic、Bryan Russell
Several video understanding tasks, such as natural language temporal video grounding, temporal activity localization, and audio description generation, require “temporally dense” reasoning over frames sampled at high temporal resolution. However, computing frame-level features for these tasks is computationally expensive given the temporal resolution requirements. In this paper, we make three contributions to reduce the cost of computing features for temporally dense tasks. First, we introduce a vision transformer (ViT) architecture, dubbed ResidualViT, that leverages the large temporal redundancy in videos to efficiently compute temporally dense frame-level features. Our architecture incorporates (i) learnable residual connections that ensure temporal consistency across consecutive frames and (ii) a token reduction module that enhances processing speed by selectively discarding temporally redundant information while reusing weights of a pretrained foundation model. Second, we propose a lightweight distillation strategy to approximate the frame-level features of the original foundation model. Finally, we evaluate our approach across four tasks and five datasets, in both zero-shot and fully supervised settings, demonstrating significant reductions in computational cost (up to 60%) and improvements in inference speed (up to 2.5x faster), all while closely approximating the accuracy of the original foundation model. 一些视频理解任务,例如自然语言时间视频接地、时间活动定位和音频描述生成,需要对以高时间分辨率采样的帧进行“时间密集”推理。然而,考虑到时间分辨率要求,为这些任务计算帧级特征的计算成本很高。在本文中,我们为降低时间密集任务的计算特征成本做出了三项贡献。首先,我们引入了一种视觉转换器 (ViT) 架构,称为 ResidualViT,它利用视频中的大量时间冗余来有效地计算时间密集的帧级特征。我们的架构包含 (i) 可学习的残差连接,确保连续帧之间的时间一致性,以及 (ii) 标记减少模块,通过选择性丢弃时间冗余信息,同时重用预训练基础模型的权重来提高处理速度。其次,我们提出了一种轻量级蒸馏策略来近似原始基础模型的帧级特征。最后,我们在零样本和完全监督设置中评估了我们跨四个任务和五个数据集的方法,证明计算成本显着降低(高达 60%)和推理速度提高(快 2.5 倍),同时接近原始基础模型的准确性。
Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Information Retrieval, Image and Video Processing 科目:计算机视觉与模式识别 , 人工智能 , 信息检索 , 图像与视频处理
Publish: 2025-09-16 17:12:23 UTC 发布时间: 2025-09-16 17:12:23 UTC
#49 Metacognitive Reuse: Turning Recurring LLM Reasoning Into Concise Behaviors #49 元认知重用:将重复的法学硕士推理转化为简洁的行为
Authors: [Aniket Didolkar](https://arxiv.org/search/?searchtype=author&query=Aniket Didolkar), [Nicolas Ballas](https://arxiv.org/search/?searchtype=author&query=Nicolas Ballas), [Sanjeev Arora](https://arxiv.org/search/?searchtype=author&query=Sanjeev Arora), [Anirudh Goyal](https://arxiv.org/search/?searchtype=author&query=Anirudh Goyal) 作者:Aniket Didolkar、Nicolas Ballas、Sanjeev Arora、Anirudh Goyal
Large language models (LLMs) now solve multi-step problems by emitting extended chains of thought. During the process, they often re-derive the same intermediate steps across problems, inflating token usage and latency. This saturation of the context window leaves less capacity for exploration. We study a simple mechanism that converts recurring reasoning fragments into concise, reusable “behaviors” (name + instruction) via the model’s own metacognitive analysis of prior traces. These behaviors are stored in a “behavior handbook” which supplies them to the model in-context at inference or distills them into parameters via supervised fine-tuning. This approach achieves improved test-time reasoning across three different settings - 1) Behavior-conditioned inference: Providing the LLM relevant behaviors in-context during reasoning reduces number of reasoning tokens by up to 46% while matching or improving baseline accuracy; 2) Behavior-guided self-improvement: Without any parameter updates, the model improves its own future reasoning by leveraging behaviors from its own past problem solving attempts. This yields up to 10% higher accuracy than a naive critique-and-revise baseline; and 3) Behavior-conditioned SFT: SFT on behavior-conditioned reasoning traces is more effective at converting non-reasoning models into reasoning models as compared to vanilla SFT. Together, these results indicate that turning slow derivations into fast procedural hints enables LLMs to remember how to reason, not just what to conclude. 大型语言模型 (LLM) 现在通过发出扩展的思维链来解决多步骤问题。在此过程中,他们通常会跨问题重新派生相同的中间步骤,从而增加令牌使用和延迟。上下文窗口的这种饱和导致的探索能力较小。我们研究了一种简单的机制,通过模型自己对先前痕迹的元认知分析,将重复的推理片段转换为简洁的、可重用的“行为”(名称 + 指令)。这些行为存储在“行为手册”中,该手册在推理时将它们提供给上下文中的模型,或通过监督微调将它们提炼成参数。这种方法在三种不同的设置中实现了改进的测试时间推理 - 1) 行为条件推理:在推理过程中提供上下文中的 LLM 相关行为,在匹配或提高基线准确性的同时,将推理标记的数量减少多达 46%;2)行为引导的自我改进:在没有任何参数更新的情况下,模型通过利用自己过去解决问题尝试中的行为来改进自己的未来推理。这比朴素的批评和修改基线的准确性高出 10%;3)行为条件 SFT:与普通 SFT 相比,行为条件推理轨迹上的 SFT 在将非推理模型转换为推理模型方面更有效。总之,这些结果表明,将缓慢的推导转化为快速的程序提示使法学硕士能够记住如何推理,而不仅仅是得出什么结论。
Subjects: Machine Learning, Artificial Intelligence 主题: 机器学习 , 人工智能
Publish: 2025-09-16 16:44:26 UTC 发布时间: 2025-09-16 16:44:26 UTC
#50 Layout-Aware OCR for Black Digital Archives with Unsupervised Evaluation #50 具有无监督评估的黑人数字档案的布局感知 OCR
Authors: [Fitsum Sileshi Beyene](https://arxiv.org/search/?searchtype=author&query=Fitsum Sileshi Beyene), [Christopher L. Dancy](https://arxiv.org/search/?searchtype=author&query=Christopher L. Dancy) 作者:Fitsum Sileshi Beyene、Christopher L. Dancy
Despite their cultural and historical significance, Black digital archives continue to be a structurally underrepresented area in AI research and infrastructure. This is especially evident in efforts to digitize historical Black newspapers, where inconsistent typography, visual degradation, and limited annotated layout data hinder accurate transcription, despite the availability of various systems that claim to handle optical character recognition (OCR) well. In this short paper, we present a layout-aware OCR pipeline tailored for Black newspaper archives and introduce an unsupervised evaluation framework suited to low-resource archival contexts. Our approach integrates synthetic layout generation, model pretraining on augmented data, and a fusion of state-of-the-art You Only Look Once (YOLO) detectors. We used three annotation-free evaluation metrics, the Semantic Coherence Score (SCS), Region Entropy (RE), and Textual Redundancy Score (TRS), which quantify linguistic fluency, informational diversity, and redundancy across OCR regions. Our evaluation on a 400-page dataset from ten Black newspaper titles demonstrates that layout-aware OCR improves structural diversity and reduces redundancy compared to full-page baselines, with modest trade-offs in coherence. Our results highlight the importance of respecting cultural layout logic in AI-driven document understanding and lay the foundation for future community-driven and ethically grounded archival AI systems. 尽管黑人数字档案具有文化和历史意义,但它仍然是人工智能研究和基础设施中结构性不足的领域。这在历史黑人报纸数字化的努力中尤为明显,尽管有各种声称可以很好地处理光学字符识别 (OCR) 的系统,但排版不一致、视觉质量下降和有限的注释布局数据阻碍了准确的转录。在这篇简短的论文中,我们提出了一个为黑人报纸档案量身定制的布局感知 OCR 管道,并介绍了适合低资源档案环境的无监督评估框架。我们的方法集成了合成布局生成、增强数据的模型预训练以及最先进的 You Only Look Once (YOLO) 检测器的融合。我们使用了三个无注释的评估指标,即语义连贯性分数 (SCS)、区域熵 (RE) 和文本冗余分数 (TRS),它们量化了跨 OCR 区域的语言流畅性、信息多样性和冗余性。我们对来自十种黑人报纸标题的 400 页数据集的评估表明,与整版基线相比,布局感知 OCR 提高了结构多样性并减少了冗余,但在连贯性方面进行了适度的权衡。我们的研究结果强调了在人工智能驱动的文档理解中尊重文化布局逻辑的重要性,并为未来社区驱动和道德基础的档案人工智能系统奠定了基础。
Subjects: Digital Libraries, Artificial Intelligence 学科: 数字图书馆 , 人工智能
Publish: 2025-09-16 16:43:34 UTC 发布时间: 2025-09-16 16:43:34 UTC
#51 Single-stream Policy Optimization #51 单流策略优化
Authors: [Zhongwen Xu](https://arxiv.org/search/?searchtype=author&query=Zhongwen Xu), [Zihan Ding](https://arxiv.org/search/?searchtype=author&query=Zihan Ding) 作者:徐忠文,丁子涵
We revisit policy-gradient optimization for Large Language Models (LLMs) from a single-stream perspective. Prevailing group-based methods like GRPO reduce variance with on-the-fly baselines but suffer from critical flaws: frequent degenerate groups erase learning signals, and synchronization barriers hinder scalability. We introduce Single-stream Policy Optimization (SPO), which eliminates these issues by design. SPO replaces per-group baselines with a persistent, KL-adaptive value tracker and normalizes advantages globally across the batch, providing a stable, low-variance learning signal for every sample. Being group-free, SPO enables higher throughput and scales effectively in long-horizon or tool-integrated settings where generation times vary. Furthermore, the persistent value tracker naturally enables an adaptive curriculum via prioritized sampling. Experiments using Qwen3-8B show that SPO converges more smoothly and attains higher accuracy than GRPO, while eliminating computation wasted on degenerate groups. Ablation studies confirm that SPO’s gains stem from its principled approach to baseline estimation and advantage normalization, offering a more robust and efficient path for LLM reasoning. Across five hard math benchmarks with Qwen3 8B, SPO improves the average maj@32 by +3.4 percentage points (pp) over GRPO, driven by substantial absolute point gains on challenging datasets, including +7.3 pp on BRUMO 25, +4.4 pp on AIME 25, +3.3 pp on HMMT 25, and achieves consistent relative gain in pass@k across the evaluated k values. SPO’s success challenges the prevailing trend of adding incidental complexity to RL algorithms, highlighting a path where fundamental principles, not architectural workarounds, drive the next wave of progress in LLM reasoning. 我们从单流的角度重新审视大型语言模型 (LLM) 的策略梯度优化。像 GRPO 这样流行的基于组的方法减少了与动态基线的方差,但存在严重缺陷:频繁的退化组会擦除学习信号,同步障碍阻碍了可扩展性。我们引入了单流策略优化 (SPO),它通过设计消除了这些问题。SPO 用持久的、KL 自适应的值跟踪器替换每组基线,并在整个批次中全局归一化优势,为每个样本提供稳定、低方差的学习信号。SPO 无组,可实现更高的吞吐量,并在生成时间变化的长期或工具集成环境中有效扩展。此外,持久价值跟踪器自然地通过优先抽样实现自适应课程。使用 Qwen3-8B 的实验表明,SPO 比 GRPO 收敛更平滑,精度更高,同时消除了简并群上浪费的计算。消融研究证实,SPO 的收益源于其基线估计和优势归一化的原则性方法,为法学硕士推理提供了更稳健、更有效的途径。在 Qwen3 8B 的五个硬数学基准测试中,SPO 的平均 maj@32 比 GRPO 提高了 +3.4 个百分点 (pp),这得益于在具有挑战性的数据集上大幅提高绝对点,包括 BRUMO 25 的 +7.3 个百分点、AIME 25 的 +4.4 个百分点、HMMT 25 的 +3.3 个百分点,并在评估 k 值 k 中实现了一致的 pass@相对增益。 SPO 的成功挑战了向 RL 算法增加附带复杂性的流行趋势,凸显了一条由基本原则而不是架构变通方法推动 LLM 推理下一波进步的道路。
Subjects: Machine Learning, Artificial Intelligence, Machine Learning 主题: 机器学习 , 人工智能 , 机器学习
Publish: 2025-09-16 16:39:11 UTC 发布时间: 2025-09-16 16:39:11 UTC
#52 Curriculum Multi-Task Self-Supervision Improves Lightweight Architectures for Onboard Satellite Hyperspectral Image Segmentation #52 课程:多任务自我监督改进了机载卫星高光谱图像分割的轻量级架构
Authors: [Hugo Carlesso](https://arxiv.org/search/?searchtype=author&query=Hugo Carlesso), [Josiane Mothe](https://arxiv.org/search/?searchtype=author&query=Josiane Mothe), [Radu Tudor Ionescu](https://arxiv.org/search/?searchtype=author&query=Radu Tudor Ionescu) 作者:雨果·卡莱索、乔西亚内·莫特、拉杜·都铎·约内斯库
Hyperspectral imaging (HSI) captures detailed spectral signatures across hundreds of contiguous bands per pixel, being indispensable for remote sensing applications such as land-cover classification, change detection, and environmental monitoring. Due to the high dimensionality of HSI data and the slow rate of data transfer in satellite-based systems, compact and efficient models are required to support onboard processing and minimize the transmission of redundant or low-value data, e.g. cloud-covered areas. To this end, we introduce a novel curriculum multi-task self-supervised learning (CMTSSL) framework designed for lightweight architectures for HSI analysis. CMTSSL integrates masked image modeling with decoupled spatial and spectral jigsaw puzzle solving, guided by a curriculum learning strategy that progressively increases data complexity during self-supervision. This enables the encoder to jointly capture fine-grained spectral continuity, spatial structure, and global semantic features. Unlike prior dual-task SSL methods, CMTSSL simultaneously addresses spatial and spectral reasoning within a unified and computationally efficient design, being particularly suitable for training lightweight models for onboard satellite deployment. We validate our approach on four public benchmark datasets, demonstrating consistent gains in downstream segmentation tasks, using architectures that are over 16,000x lighter than some state-of-the-art models. These results highlight the potential of CMTSSL in generalizable representation learning with lightweight architectures for real-world HSI applications. Our code is publicly available at https://github.com/hugocarlesso/CMTSSL. 高光谱成像 (HSI) 可捕获每个像素数百个连续波段的详细光谱特征,对于土地覆盖分类、变化检测和环境监测等遥感应用是不可或缺的。由于 HSI 数据的维度高,卫星系统的数据传输速率较慢,因此需要紧凑高效的模型来支持机载处理,并最大限度地减少冗余或低价值数据的传输,例如云覆盖区域。为此,我们引入了一种新颖的课程多任务自监督学习(CMTSSL)框架,该框架专为用于 HSI 分析的轻量级架构而设计。CMTSSL 将掩膜图像建模与解耦的空间和光谱拼图解决相结合,以课程学习策略为指导,在自我监督过程中逐渐增加数据复杂性。这使得编码器能够联合捕获细粒度的光谱连续性、空间结构和全局语义特征。与以往的双任务 SSL 方法不同,CMTSSL 在统一且计算高效的设计中同时处理空间和光谱推理问题,特别适用于训练用于机载卫星部署的轻量级模型。我们在四个公共基准数据集上验证了我们的方法,使用比一些最先进模型轻 16,000 倍以上的架构,在下游分割任务中展示了一致的收益。这些结果凸显了 CMTSSL 在具有轻量级架构的可泛化表示学习中的潜力,适用于实际的 HSI 应用。我们的代码可在 https://github.com/hugocarlesso/CMTSSL 公开获取。
Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Machine Learning 科目: 计算机视觉与模式识别 , 人工智能 , 机器学习
Publish: 2025-09-16 16:37:59 UTC 发布时间: 2025-09-16 16:37:59 UTC
#53 Rich Vehicle Routing Problem with diverse Vertices allowing Hierarchical and Multimodal Time-Dependant Transhipment of multiple Node- Vehicle- compatible Cargo with Cascaded Time-Minimization Objective for Emergency Decision Support Systems #53 具有不同顶点的丰富车辆路线问题允许对多个节点车辆兼容的货物进行分层和多模式时间相关转运,并具有紧急决策支持系统的级联时间最小化目标
Authors: [Santanu Banerjee](https://arxiv.org/search/?searchtype=author&query=Santanu Banerjee), [Goutam Sen](https://arxiv.org/search/?searchtype=author&query=Goutam Sen), [Siddhartha Mukhopadhyay](https://arxiv.org/search/?searchtype=author&query=Siddhartha Mukhopadhyay) 作者:Santanu Banerjee、Goutam Sen、Siddhartha Mukhopadhyay
A rich vehicle routing problem is considered allowing multiple trips of heterogeneous vehicles stationed at distributed vehicle depots spread across diverse geographies having access to different modes of transportation. The problem arises from the real world requirement of optimizing the disaster response/preparedness time and minimizes the route duration of the vehicles to achieve the solution with the minimum highest-vehicle-route-duration. Multiple diversely-functional vertices are considered including the concept of Transhipment Ports as inter-modal resource transfer stations. Both simultaneous and split pickup and transferring of different types of delivery and pickup cargo is considered, along with Vehicle-Cargo and Transhipment Port-Cargo Compatibility. The superiority of the proposed cascaded minimization approach is shown over existing makespan minimization approaches through the developed MILP formulation. To solve the problem quickly for practical implementation within Disaster Management-specific Decision Support Systems, an extensive Heuristic Algorithm is devised. The Heuristic utilizes Decision Tree based structuring of possible routes and is able to inherently consider the compatibility issues. Preferential generation of small route elements are performed, which are integrated into route clusters; we consider multiple different logical integration approaches, as well as shuffling the logics to simultaneously produce multiple independent solutions. Finally perturbation of the different solutions are done to find better neighbouring solutions. The computational performance of the PSR-GIP Heuristic, on our created novel datasets, indicate that it is able to give good solutions swiftly for practical problems involving large integer instances which the MILP is unable to solve. 考虑了一个丰富的车辆路线问题,允许驻扎在分布在不同地区的分布式车辆段的异构车辆进行多次出行,这些车辆可以使用不同的交通方式。该问题源于现实世界对优化灾害响应/备灾时间的要求,并最大限度地减少车辆的路线持续时间,以实现最短最高车辆路线持续时间的解决方案。考虑了多个功能多样的顶点,包括转运港作为多式联运资源中转站的概念。考虑不同类型的交付和提货货物的同步和分拆提货和转运,以及车辆-货物和转运港口-货物兼容性。通过开发的 MILP 公式,所提出的级联最小化方法优于现有的 makespan 最小化方法。为了快速解决该问题并在特定于灾害管理的决策支持系统中实际实施,设计了一种广泛的启发式算法。启发式利用基于决策树的可能路由结构,并且能够从本质上考虑兼容性问题。对小路由要素进行优先生成,并将其整合到路由集群中;我们考虑了多种不同的逻辑集成方法,并对逻辑进行洗牌以同时产生多个独立的解决方案。最后,对不同的解进行扰动以找到更好的相邻解。 PSR-GIP 启发式在我们创建的新数据集上的计算性能表明,它能够为涉及 MILP 无法解决的大整数实例的实际问题提供良好的解决方案。
Subjects: Optimization and Control, Artificial Intelligence, Systems and Control 科目: 优化与控制 , 人工智能 , 系统与控制
Publish: 2025-09-16 16:37:18 UTC 发布时间: 2025-09-16 16:37:18 UTC
#54 B-TGAT: A Bi-directional Temporal Graph Attention Transformer for Clustering Multivariate Spatiotemporal Data #54 B-TGAT:用于聚类多变量时空数据的双向时间图注意力转换器
Authors: [Francis Ndikum Nji](https://arxiv.org/search/?searchtype=author&query=Francis Ndikum Nji), [Vandana Janaja](https://arxiv.org/search/?searchtype=author&query=Vandana Janaja), [Jianwu Wang](https://arxiv.org/search/?searchtype=author&query=Jianwu Wang) 作者:Francis Ndikum Nji、Vandana Janaja、Jianwu Wang
Clustering high-dimensional multivariate spatiotemporal climate data is challenging due to complex temporal dependencies, evolving spatial interactions, and non-stationary dynamics. Conventional clustering methods, including recurrent and convolutional models, often struggle to capture both local and global temporal relationships while preserving spatial context. We present a time-distributed hybrid U-Net autoencoder that integrates a Bi-directional Temporal Graph Attention Transformer (B-TGAT) to guide efficient temporal clustering of multidimensional spatiotemporal climate datasets. The encoder and decoder are equipped with ConvLSTM2D modules that extract joint spatial–temporal features by modeling localized dynamics and spatial correlations over time, and skip connections that preserve multiscale spatial details during feature compression and reconstruction. At the bottleneck, B-TGAT integrates graph-based spatial modeling with attention-driven temporal encoding, enabling adaptive weighting of temporal neighbors and capturing both short and long-range dependencies across regions. This architecture produces discriminative latent embeddings optimized for clustering. Experiments on three distinct spatiotemporal climate datasets demonstrate superior cluster separability, temporal stability, and alignment with known climate transitions compared to state-of-the-art baselines. The integration of ConvLSTM2D, U-Net skip connections, and B-TGAT enhances temporal clustering performance while providing interpretable insights into complex spatiotemporal variability, advancing both methodological development and climate science applications. 由于复杂的时间依赖性、不断演变的空间相互作用和非平稳动力学,对高维多元时空气候数据进行聚类具有挑战性。传统的聚类方法,包括递归模型和卷积模型,通常难以在保留空间背景的同时捕获局部和全局时间关系。我们提出了一种时间分布的混合 U-Net 自动编码器,它集成了双向时空图注意力转换器(B-TGAT),以指导多维时空气候数据集的高效时间聚类。编码器和解码器配备了 ConvLSTM2D 模块,通过对随时间变化的局部动力学和空间相关性进行建模来提取联合时空特征,并在特征压缩和重建过程中跳过连接,从而保留多尺度空间细节。在瓶颈处,B-TGAT 将基于图的空间建模与注意力驱动的时间编码相结合,实现了时间邻居的自适应加权,并捕获跨区域的短距离和长程依赖关系。此架构生成针对聚类优化的判别潜在嵌入。与最先进的基线相比,对三个不同时空气候数据集的实验表明,与最先进的基线相比,具有优异的聚类分离性、时间稳定性以及与已知气候转变的一致性。ConvLSTM2D、U-Net 跳跃连接和 B-TCHAT 的集成增强了时间聚类性能,同时提供了对复杂时空变异性的可解释见解,推动了方法学开发和气候科学应用。
Subjects: Machine Learning, Artificial Intelligence 主题: 机器学习 , 人工智能
Publish: 2025-09-16 16:08:21 UTC 发布时间: 2025-09-16 16:08:21 UTC
#55 Is Meta-Learning Out? Rethinking Unsupervised Few-Shot Classification with Limited Entropy #55 元学习已经结束了吗?重新思考具有有限熵的无监督少样本分类
Authors: [Yunchuan Guan](https://arxiv.org/search/?searchtype=author&query=Yunchuan Guan), [Yu Liu](https://arxiv.org/search/?searchtype=author&query=Yu Liu), [Ke Zhou](https://arxiv.org/search/?searchtype=author&query=Ke Zhou), [Zhiqi Shen](https://arxiv.org/search/?searchtype=author&query=Zhiqi Shen), [Jenq-Neng Hwang](https://arxiv.org/search/?searchtype=author&query=Jenq-Neng Hwang), [Serge Belongie](https://arxiv.org/search/?searchtype=author&query=Serge Belongie), [Lei Li](https://arxiv.org/search/?searchtype=author&query=Lei Li) 作者: 关云川, 刘宇, 周柯, 沈志琦, 黄仁奇, Serge Belongie, 李磊
Meta-learning is a powerful paradigm for tackling few-shot tasks. However, recent studies indicate that models trained with the whole-class training strategy can achieve comparable performance to those trained with meta-learning in few-shot classification tasks. To demonstrate the value of meta-learning, we establish an entropy-limited supervised setting for fair comparisons. Through both theoretical analysis and experimental validation, we establish that meta-learning has a tighter generalization bound compared to whole-class training. We unravel that meta-learning is more efficient with limited entropy and is more robust to label noise and heterogeneous tasks, making it well-suited for unsupervised tasks. Based on these insights, We propose MINO, a meta-learning framework designed to enhance unsupervised performance. MINO utilizes the adaptive clustering algorithm DBSCAN with a dynamic head for unsupervised task construction and a stability-based meta-scaler for robustness against label noise. Extensive experiments confirm its effectiveness in multiple unsupervised few-shot and zero-shot tasks. 元学习是处理少量任务的强大范例。然而,最近的研究表明,在少样本分类任务中,使用全类训练策略训练的模型可以达到与使用元学习训练的模型相当的性能。为了证明元学习的价值,我们建立了一个熵限制的监督设置以进行公平比较。通过理论分析和实验验证,我们确定与全班训练相比,元学习具有更严格的泛化界限。我们发现元学习在熵有限的情况下效率更高,并且在标记噪声和异构任务方面更稳健,使其非常适合无监督任务。基于这些见解,我们提出了 MINO,这是一个旨在增强无监督性能的元学习框架。MINO 利用自适应聚类算法 DBSCAN,该算法具有用于无监督任务构建的动态头和用于标记噪声的鲁棒性基于稳定性的元缩放器。广泛的实验证实了它在多个无监督的少样本和零样本任务中的有效性。
Subjects: Machine Learning, Artificial Intelligence 主题: 机器学习 , 人工智能
Publish: 2025-09-16 15:39:03 UTC 发布时间: 2025-09-16 15:39:03 UTC
#56 On the Correlation between Individual Fairness and Predictive Accuracy in Probabilistic Models #56 论概率模型中个体公平性与预测准确性之间的相关性
Authors: [Alessandro Antonucci](https://arxiv.org/search/?searchtype=author&query=Alessandro Antonucci), [Eric Rossetto](https://arxiv.org/search/?searchtype=author&query=Eric Rossetto), [Ivan Duvnjak](https://arxiv.org/search/?searchtype=author&query=Ivan Duvnjak) 作者:亚历山德罗·安东努奇、埃里克·罗塞托、伊万·杜文尼亚克
We investigate individual fairness in generative probabilistic classifiers by analysing the robustness of posterior inferences to perturbations in private features. Building on established results in robustness analysis, we hypothesise a correlation between robustness and predictive accuracy, specifically, instances exhibiting greater robustness are more likely to be classified accurately. We empirically assess this hypothesis using a benchmark of fourteen datasets with fairness concerns, employing Bayesian networks as the underlying generative models. To address the computational complexity associated with robustness analysis over multiple private features with Bayesian networks, we reformulate the problem as a most probable explanation task in an auxiliary Markov random field. Our experiments confirm the hypothesis about the correlation, suggesting novel directions to mitigate the traditional trade-off between fairness and accuracy. 我们通过分析对私人特征扰动的后验推断的鲁棒性来研究生成概率分类器中的个体公平性。基于稳健性分析的既定结果,我们假设稳健性和预测准确性之间存在相关性,具体来说,表现出更高稳健性的实例更有可能被准确分类。我们使用具有公平性问题的 14 个数据集的基准进行实证评估这一假设,并采用贝叶斯网络作为底层生成模型。为了解决与贝叶斯网络对多个私有特征的鲁棒性分析相关的计算复杂性,我们将该问题重新表述为辅助马尔可夫随机域中最可能的解释任务。我们的实验证实了相关性的假设,提出了减轻公平性和准确性之间传统权衡的新方向。
Subjects: Machine Learning, Artificial Intelligence 主题: 机器学习 , 人工智能
Publish: 2025-09-16 15:17:13 UTC 发布时间: 2025-09-16 15:17:13 UTC
#57 FinSearchComp: Towards a Realistic, Expert-Level Evaluation of Financial Search and Reasoning #57 FinSearchComp:对金融搜索和推理进行现实的专家级评估
Authors: [Liang Hu](https://arxiv.org/search/?searchtype=author&query=Liang Hu), [Jianpeng Jiao](https://arxiv.org/search/?searchtype=author&query=Jianpeng Jiao), [Jiashuo Liu](https://arxiv.org/search/?searchtype=author&query=Jiashuo Liu), [Yanle Ren](https://arxiv.org/search/?searchtype=author&query=Yanle Ren), [Zhoufutu Wen](https://arxiv.org/search/?searchtype=author&query=Zhoufutu Wen), [Kaiyuan Zhang](https://arxiv.org/search/?searchtype=author&query=Kaiyuan Zhang), [Xuanliang Zhang](https://arxiv.org/search/?searchtype=author&query=Xuanliang Zhang), [Xiang Gao](https://arxiv.org/search/?searchtype=author&query=Xiang Gao), [Tianci He](https://arxiv.org/search/?searchtype=author&query=Tianci He), [Fei Hu](https://arxiv.org/search/?searchtype=author&query=Fei Hu), [Yali Liao](https://arxiv.org/search/?searchtype=author&query=Yali Liao), [Zaiyuan Wang](https://arxiv.org/search/?searchtype=author&query=Zaiyuan Wang), [Chenghao Yang](https://arxiv.org/search/?searchtype=author&query=Chenghao Yang), [Qianyu Yang](https://arxiv.org/search/?searchtype=author&query=Qianyu Yang), [Mingren Yin](https://arxiv.org/search/?searchtype=author&query=Mingren Yin), [Zhiyuan Zeng](https://arxiv.org/search/?searchtype=author&query=Zhiyuan Zeng), [Ge Zhang](https://arxiv.org/search/?searchtype=author&query=Ge Zhang), [Xinyi Zhang](https://arxiv.org/search/?searchtype=author&query=Xinyi Zhang), [Xiying Zhao](https://arxiv.org/search/?searchtype=author&query=Xiying Zhao), [Zhenwei Zhu](https://arxiv.org/search/?searchtype=author&query=Zhenwei Zhu), [Hongseok Namkoong](https://arxiv.org/search/?searchtype=author&query=Hongseok Namkoong), [Wenhao Huang](https://arxiv.org/search/?searchtype=author&query=Wenhao Huang), [Yuwen Tang](https://arxiv.org/search/?searchtype=author&query=Yuwen Tang) 作者: 胡良, 焦建鹏, 刘家硕, 燕乐任, 周富图温, 张开元, 张宣良, 高翔, 何天赐, 胡飞, 廖雅丽, 王在元, 杨成浩, 杨倩宇, 尹明仁, 曾志远, 张葛, 张欣怡, 赵希英, 朱振伟, 南宫红锡, 黄文浩, 唐宇文
Search has emerged as core infrastructure for LLM-based agents and is widely viewed as critical on the path toward more general intelligence. Finance is a particularly demanding proving ground: analysts routinely conduct complex, multi-step searches over time-sensitive, domain-specific data, making it ideal for assessing both search proficiency and knowledge-grounded reasoning. Yet no existing open financial datasets evaluate data searching capability of end-to-end agents, largely because constructing realistic, complicated tasks requires deep financial expertise and time-sensitive data is hard to evaluate. We present FinSearchComp, the first fully open-source agent benchmark for realistic, open-domain financial search and reasoning. FinSearchComp comprises three tasks – Time-Sensitive Data Fetching, Simple Historical Lookup, and Complex Historical Investigation – closely reproduce real-world financial analyst workflows. To ensure difficulty and reliability, we engage 70 professional financial experts for annotation and implement a rigorous multi-stage quality-assurance pipeline. The benchmark includes 635 questions spanning global and Greater China markets, and we evaluate 21 models (products) on it. Grok 4 (web) tops the global subset, approaching expert-level accuracy. DouBao (web) leads on the Greater China subset. Experimental analyses show that equipping agents with web search and financial plugins substantially improves results on FinSearchComp, and the country origin of models and tools impact performance significantly.By aligning with realistic analyst tasks and providing end-to-end evaluation, FinSearchComp offers a professional, high-difficulty testbed for complex financial search and reasoning. 搜索已成为基于 LLM 的代理的核心基础设施,并被广泛认为在迈向更通用的智能之路上至关重要。金融是一个特别苛刻的试验场:分析师经常对时间敏感的、特定领域的数据进行复杂的多步骤搜索,使其成为评估搜索熟练程度和基于知识的推理的理想选择。然而,现有的开放金融数据集没有评估端到端代理的数据搜索能力,主要是因为构建现实、复杂的任务需要深厚的金融专业知识,而时间敏感的数据很难评估。我们展示了 FinSearchComp,这是第一个完全开源的代理基准测试,用于现实的、开放领域的金融搜索和推理。FinSearchComp 包括三项任务——时间敏感数据获取、简单历史查找和复杂历史调查——紧密再现真实世界的金融分析师工作流程。为确保难度和可靠性,我们聘请了 70 名专业财务专家进行注释,并实施了严格的多阶段质量保证管道。该基准包括 635 个问题,横跨全球和大中华区市场,我们评估了 21 个模型(产品)。Grok 4 (web) 在全球子集中名列前茅,接近专家级的准确性。豆包(网络)在大中华区子集处于领先地位。实验分析表明,为代理配备网络搜索和金融插件可以显着改善 FinSearchComp 的结果,并且模型和工具的来源国家/地区会显着影响性能。通过与现实的分析师任务保持一致并提供端到端评估,FinSearchComp 为复杂的金融搜索和推理提供了一个专业、高难度的测试平台。
Subjects: Machine Learning, Artificial Intelligence 主题: 机器学习 , 人工智能
Publish: 2025-09-16 15:13:13 UTC 发布时间: 2025-09-16 15:13:13 UTC
#58 An Uncertainty-Weighted Decision Transformer for Navigation in Dense, Complex Driving Scenarios #58 用于密集、复杂驾驶场景下导航的不确定性加权决策转换器
Authors: [Zhihao Zhang](https://arxiv.org/search/?searchtype=author&query=Zhihao Zhang), [Chengyang Peng](https://arxiv.org/search/?searchtype=author&query=Chengyang Peng), [Minghao Zhu](https://arxiv.org/search/?searchtype=author&query=Minghao Zhu), [Ekim Yurtsever](https://arxiv.org/search/?searchtype=author&query=Ekim Yurtsever), [Keith A. Redmill](https://arxiv.org/search/?searchtype=author&query=Keith A. Redmill) 作者: Zhihao Zhang, Chengyang Peng, Minghao Zhu, Ekim Yurtsever, Keith A. Redmill
Autonomous driving in dense, dynamic environments requires decision-making systems that can exploit both spatial structure and long-horizon temporal dependencies while remaining robust to uncertainty. This work presents a novel framework that integrates multi-channel bird’s-eye-view occupancy grids with transformer-based sequence modeling for tactical driving in complex roundabout scenarios. To address the imbalance between frequent low-risk states and rare safety-critical decisions, we propose the Uncertainty-Weighted Decision Transformer (UWDT). UWDT employs a frozen teacher transformer to estimate per-token predictive entropy, which is then used as a weight in the student model’s loss function. This mechanism amplifies learning from uncertain, high-impact states while maintaining stability across common low-risk transitions. Experiments in a roundabout simulator, across varying traffic densities, show that UWDT consistently outperforms other baselines in terms of reward, collision rate, and behavioral stability. The results demonstrate that uncertainty-aware, spatial-temporal transformers can deliver safer and more efficient decision-making for autonomous driving in complex traffic environments. 在密集的动态环境中进行自动驾驶需要决策系统能够利用空间结构和长期时间依赖性,同时对不确定性保持稳健性。这项工作提出了一个新颖的框架,该框架将多通道鸟瞰占用网格与基于变压器的序列建模相结合,用于复杂环形交叉路口场景下的战术驾驶。为了解决频繁的低风险状态和罕见的安全关键决策之间的不平衡,我们提出了不确定性加权决策转换器(UWDT)。UWDT 使用冻结的教师转换器来估计每个标记的预测熵,然后将其用作学生模型损失函数中的权重。这种机制扩大了从不确定、高影响状态中学习的经验,同时在常见的低风险过渡中保持稳定性。在环形交叉路口模拟器中,在不同的交通密度下进行的实验表明,UWDT 在奖励、碰撞率和行为稳定性方面始终优于其他基线。结果表明,不确定性感知的时空变压器可以为复杂交通环境中的自动驾驶提供更安全、更高效的决策。
Subjects: Robotics, Artificial Intelligence 科目: 机器人 , 人工智能
Publish: 2025-09-16 14:48:52 UTC 发布时间: 2025-09-16 14:48:52 UTC
#59 Hierarchical Deep Fusion Framework for Multi-dimensional Facial Forgery Detection - The 2024 Global Deepfake Image Detection Challenge #59 用于多维人脸伪造检测的分层深度融合框架 - 2024 年全球深度伪造图像检测挑战赛
Authors: [Kohou Wang](https://arxiv.org/search/?searchtype=author&query=Kohou Wang), [Huan Hu](https://arxiv.org/search/?searchtype=author&query=Huan Hu), [Xiang Liu](https://arxiv.org/search/?searchtype=author&query=Xiang Liu), [Zezhou Chen](https://arxiv.org/search/?searchtype=author&query=Zezhou Chen), [Ping Chen](https://arxiv.org/search/?searchtype=author&query=Ping Chen), [Zhaoxiang Liu](https://arxiv.org/search/?searchtype=author&query=Zhaoxiang Liu), [Shiguo Lian](https://arxiv.org/search/?searchtype=author&query=Shiguo Lian) 作者: Kohou Wang, Huan 胡, Xiang Liu, Zezhou Chen, Ping Chen, Zhaoxiang Liu, Shiguo Lian
The proliferation of sophisticated deepfake technology poses significant challenges to digital security and authenticity. Detecting these forgeries, especially across a wide spectrum of manipulation techniques, requires robust and generalized models. This paper introduces the Hierarchical Deep Fusion Framework (HDFF), an ensemble-based deep learning architecture designed for high-performance facial forgery detection. Our framework integrates four diverse pre-trained sub-models, Swin-MLP, CoAtNet, EfficientNetV2, and DaViT, which are meticulously fine-tuned through a multi-stage process on the MultiFFDI dataset. By concatenating the feature representations from these specialized models and training a final classifier layer, HDFF effectively leverages their collective strengths. This approach achieved a final score of 0.96852 on the competition’s private leaderboard, securing the 20th position out of 184 teams, demonstrating the efficacy of hierarchical fusion for complex image classification tasks. 复杂的深度伪造技术的激增对数字安全和真实性提出了重大挑战。检测这些伪造,尤其是跨广泛的纵技术,需要稳健且通用的模型。本文介绍了分层深度融合框架(HDFF),这是一种基于集成的深度学习架构,专为高性能人脸伪造检测而设计。我们的框架集成了四个不同的预训练子模型,Swin-MLP、CoAtNet、EfficientNetV2 和 DaViT,这些子模型在 MultiFFDI 数据集上通过多阶段过程进行了细致的微调。通过连接这些专用模型的特征表示并训练最终的分类器层,HDFF 有效地利用了它们的集体优势。这种方法在比赛的私人排行榜上取得了 0.96852 的最终得分,在 184 支队伍中排名第 20 位,展示了分层融合在复杂图像分类任务中的功效。
Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence 科目:计算机视觉与模式识别 , 人工智能
Publish: 2025-09-16 14:06:54 UTC 发布时间: 2025-09-16 14:06:54 UTC
#60 Shaping Explanations: Semantic Reward Modeling with Encoder-Only Transformers for GRPO #60 整形解释:使用 GRPO 的纯编码器 Transformer 进行语义奖励建模
Authors: [Francesco Pappone](https://arxiv.org/search/?searchtype=author&query=Francesco Pappone), [Ruggero Marino Lazzaroni](https://arxiv.org/search/?searchtype=author&query=Ruggero Marino Lazzaroni), [Federico Califano](https://arxiv.org/search/?searchtype=author&query=Federico Califano), [Niccolò Gentile](https://arxiv.org/search/?searchtype=author&query=Niccolò Gentile), [Roberto Marras](https://arxiv.org/search/?searchtype=author&query=Roberto Marras) 作者:Francesco Pappone、Ruggero Marino Lazzaroni、Federico Califano、Niccolò Gentile、Roberto Marras
While Large Language Models (LLMs) excel at generating human-like text, aligning their outputs with complex, qualitative goals like pedagogical soundness remains a significant challenge. Standard reinforcement learning techniques often rely on slow and expensive LLM-as-a-judge evaluations or on brittle, keyword-based metrics like ROUGE, which fail to capture the semantic essence of a high-quality explanation. In this work, we introduce a novel approach to reward shaping within the Group Relative Policy Optimisation (GRPO) framework. Our central contribution is the use of a small, efficient encoder-only transformer as a semantic reward model. This model provides a dense, semantically rich reward signal based on the cosine similarity between a generated explanation and a ground-truth reference, guiding the policy towards explanations that are not just factually correct but also structurally and conceptually aligned with expert reasoning. We apply this method to the task of training a model for the Italian medical-school entrance examinations, following standard domain-adaptive continued pre-training (CPT) and supervised fine-tuning (SFT). Our results demonstrate that GRPO with our proposed semantic reward significantly improves explanation faithfulness and clarity over a strong SFT baseline, showcasing the power of using lightweight encoder models for nuanced reward shaping in complex generation tasks 虽然大型语言模型 (LLM) 擅长生成类似人类的文本,但使其输出与教学健全性等复杂的定性目标保持一致仍然是一个重大挑战。标准强化学习技术通常依赖于缓慢而昂贵的法学硕士作为评判者的评估,或者依赖于脆弱的、基于关键字的指标(如 ROUGE),这些指标无法捕捉高质量解释的语义本质。在这项工作中,我们引入了一种在群体相对政策优化(GRPO)框架内塑造奖励的新方法。我们的核心贡献是使用一个小型、高效的纯编码器转换器作为语义奖励模型。该模型基于生成的解释和基本事实参考之间的余弦相似性,提供了密集的、语义丰富的奖励信号,指导政策走向不仅事实正确,而且在结构和概念上与专家推理一致的解释。我们将这种方法应用于在标准领域自适应持续预训练 (CPT) 和监督微调 (SFT) 之后,为意大利医学院入学考试训练模型的任务。我们的结果表明,在强大的 SFT 基线上,具有我们提出的语义奖励的 GRPO 显着提高了解释的忠实度和清晰度,展示了在复杂生成任务中使用轻量级编码器模型进行细致入微的奖励塑造的力量
Subjects: Computation and Language, Artificial Intelligence 科目: 计算与语言 , 人工智能
Publish: 2025-09-16 13:39:29 UTC 发布时间: 2025-09-16 13:39:29 UTC
#61 A Design Co-Pilot for Task-Tailored Manipulators #61 任务定制机械手的设计副驾驶
Authors: [Jonathan Külz](https://arxiv.org/search/?searchtype=author&query=Jonathan Külz), [Sehoon Ha](https://arxiv.org/search/?searchtype=author&query=Sehoon Ha), [Matthias Althoff](https://arxiv.org/search/?searchtype=author&query=Matthias Althoff) 作者:Jonathan Külz、Sehoon Ha、Matthias Althoff
Although robotic manipulators are used in an ever-growing range of applications, robot manufacturers typically follow a ``one-fits-all’’ philosophy, employing identical manipulators in various settings. This often leads to suboptimal performance, as general-purpose designs fail to exploit particularities of tasks. The development of custom, task-tailored robots is hindered by long, cost-intensive development cycles and the high cost of customized hardware. Recently, various computational design methods have been devised to overcome the bottleneck of human engineering. In addition, a surge of modular robots allows quick and economical adaptation to changing industrial settings. This work proposes an approach to automatically designing and optimizing robot morphologies tailored to a specific environment. To this end, we learn the inverse kinematics for a wide range of different manipulators. A fully differentiable framework realizes gradient-based fine-tuning of designed robots and inverse kinematics solutions. Our generative approach accelerates the generation of specialized designs from hours with optimization-based methods to seconds, serving as a design co-pilot that enables instant adaptation and effective human-AI collaboration. Numerical experiments show that our approach finds robots that can navigate cluttered environments, manipulators that perform well across a specified workspace, and can be adapted to different hardware constraints. Finally, we demonstrate the real-world applicability of our method by setting up a modular robot designed in simulation that successfully moves through an obstacle course. 尽管机器人机械手的应用范围不断扩大,但机器人制造商通常遵循“一刀切”的理念,在各种环境中使用相同的机械手。这通常会导致性能不佳,因为通用设计无法利用任务的特殊性。定制、任务定制机器人的开发受到漫长、成本密集型开发周期和定制硬件的高成本的阻碍。最近,人们设计了各种计算设计方法来克服人类工程的瓶颈。此外,模块化机器人的激增可以快速、经济地适应不断变化的工业环境。这项工作提出了一种自动设计和优化针对特定环境量身定制的机器人形态的方法。为此,我们学习了各种不同机械手的逆运动学。完全可微分的框架实现了对设计机器人和逆运动学解决方案的基于梯度的微调。我们的生成方法将专业设计的生成从基于优化的方法从几小时加速到几秒钟,充当设计副驾驶,实现即时调整和有效的人机协作。数值实验表明,我们的方法找到了能够在杂乱环境中导航的机器人,在指定工作空间中表现良好的机械手,并且可以适应不同的硬件约束。最后,我们通过设置一个模拟设计的模块化机器人来证明我们的方法在现实世界中的适用性,该机器人可以成功地穿过障碍赛。
Subjects: Robotics, Artificial Intelligence 科目: 机器人 , 人工智能
Publish: 2025-09-16 13:34:30 UTC 发布时间: 2025-09-16 13:34:30 UTC
#62 TFANet: Three-Stage Image-Text Feature Alignment Network for Robust Referring Image Segmentation #62 TFANet:用于鲁棒引用图像分割的三阶段图像-文本特征对齐网络
Authors: [Qianqi Lu](https://arxiv.org/search/?searchtype=author&query=Qianqi Lu), [Yuxiang Xie](https://arxiv.org/search/?searchtype=author&query=Yuxiang Xie), [Jing Zhang](https://arxiv.org/search/?searchtype=author&query=Jing Zhang), [Shiwei Zou](https://arxiv.org/search/?searchtype=author&query=Shiwei Zou), [Yan Chen](https://arxiv.org/search/?searchtype=author&query=Yan Chen), [Xidao Luan](https://arxiv.org/search/?searchtype=author&query=Xidao Luan) 作者: 卢倩琪, 谢玉翔, 张静, 邹世伟, 陈燕, 卢西道
Referring Image Segmentation (RIS) is a task that segments image regions based on language expressions, requiring fine-grained alignment between two modalities. However, existing methods often struggle with multimodal misalignment and language semantic loss, especially in complex scenes containing multiple visually similar objects, where uniquely described targets are frequently mislocalized or incompletely segmented. To tackle these challenges, this paper proposes TFANet, a Three-stage Image-Text Feature Alignment Network that systematically enhances multimodal alignment through a hierarchical framework comprising three stages: Knowledge Plus Stage (KPS), Knowledge Fusion Stage (KFS), and Knowledge Intensification Stage (KIS). In the first stage, we design the Multiscale Linear Cross-Attention Module (MLAM), which facilitates bidirectional semantic exchange between visual features and textual representations across multiple scales. This establishes rich and efficient alignment between image regions and different granularities of linguistic descriptions. Subsequently, the KFS further strengthens feature alignment through the Cross-modal Feature Scanning Module (CFSM), which applies multimodal selective scanning to capture long-range dependencies and construct a unified multimodal representation. This is essential for modeling long-range cross-modal dependencies and enhancing alignment accuracy in complex scenes. Finally, in the KIS, we propose the Word-level Linguistic Feature-guided Semantic Deepening Module (WFDM) to compensate for semantic degradation introduced in earlier stages. 引用图像分割 (RIS) 是一项根据语言表达式对图像区域进行分割的任务,需要在两种模态之间进行细粒度对齐。然而,现有方法经常面临多模态错位和语言语义丢失的问题,特别是在包含多个视觉相似对象的复杂场景中,其中唯一描述的目标经常被错误定位或不完全分割。为了应对这些挑战,本文提出了三阶段图像-文本特征对齐网络 TFANet,它通过由知识加阶段(KPS)、知识融合阶段(KFS)和知识强化阶段(KIS)三个阶段组成的分层框架系统地增强多模态对齐。在第一阶段,我们设计了多尺度线叉注意力模块(MLAM),它促进了跨多个尺度的视觉特征和文本表示之间的双向语义交换。这在图像区域和不同粒度的语言描述之间建立了丰富而有效的对齐。随后,KFS 通过跨模态特征扫描模块(CFSM)进一步加强了特征对齐,该模块应用多模态选择性扫描来捕获远距离依赖关系并构建统一的多模态表示。这对于对远程跨模态依赖关系进行建模和提高复杂场景中的对齐精度至关重要。最后,在 KIS 中,我们提出了词级语言特征引导语义深化模块(WFDM)来补偿早期阶段引入的语义退化。
Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence 科目:计算机视觉与模式识别 , 人工智能
Publish: 2025-09-16 13:26:58 UTC 发布时间: 2025-09-16 13:26:58 UTC
#63 Multi-Model Synthetic Training for Mission-Critical Small Language Models #63 关键任务小语言模型的多模型合成训练
Authors: [Nolan Platt](https://arxiv.org/search/?searchtype=author&query=Nolan Platt), [Pragyansmita Nayak](https://arxiv.org/search/?searchtype=author&query=Pragyansmita Nayak) 作者:诺兰·普拉特、普拉扬斯米塔·纳亚克
Large Language Models (LLMs) have demonstrated remarkable capabilities across many domains, yet their appli- cation to specialized fields remains constrained by the scarcity and complexity of domain-specific training data. We present a novel approach that achieves a 261x cost reduction for maritime intelligence by using LLMs as one-time teachers rather than using them directly for inference. Our method transforms 3.2 billion Automatic Identification System (AIS) vessel tracking records into 21,543 synthetic question and answer pairs through multi-model generation (GPT-4o and o3-mini), preventing over- fitting and ensuring accurate reasoning. The resulting fine-tuned Qwen2.5-7B model achieves 75% accuracy on maritime tasks, while being substantially cheaper than using a larger model for inference. We show that smaller, cheaper models - when fine tuned properly - can provide similar accuracy compared to larger models that are prohibitively expensive. Our work contributes to the growing field of synthetic dataset generation for specialized AI applications and presents a highly reproducible framework for domains where manual annotation is infeasible. Beyond expand- ing research in the growing field of specialized small language models, our approach has immediate applications in maritime safety, security operations, and vessel traffic management systems in various industries. 大型语言模型(LLM)在许多领域都表现出了卓越的能力,但它们在专业领域的应用仍然受到特定领域训练数据的稀缺性和复杂性的限制。我们提出了一种新颖的方法,通过使用法学硕士作为一次性教师而不是直接将其用于推理,可以将海上情报的成本降低 261 倍。该方法通过多模型生成(GPT-4o 和 o3-mini)将 32 亿条自动识别系统(AIS)船舶跟踪记录转换为 21,543 个合成问答对,防止过度拟合并确保推理准确。由此产生的微调 Qwen2.5-7B 模型在海上任务上实现了 75% 的准确率,同时比使用更大的模型进行推理要便宜得多。我们表明,与昂贵得令人望而却步的大型模型相比,更小、更便宜的模型(如果微调得当)可以提供类似的精度。我们的工作为专业人工智能应用的合成数据集生成领域不断发展做出了贡献,并为手动注释不可行的领域提供了一个高度可重复的框架。除了在不断发展的专业小语言模型领域扩大研究外,我们的方法还直接应用于各行各业的海上安全、安保作和船舶交通管理系统。
Subjects: Computation and Language, Artificial Intelligence, Machine Learning 科目: 计算与语言 , 人工智能 , 机器学习
Publish: 2025-09-16 13:04:48 UTC 发布时间: 2025-09-16 13:04:48 UTC
#64 MIA-EPT: Membership Inference Attack via Error Prediction for Tabular Data #64 MIA-EPT:通过表格数据错误预测进行成员推理攻击
Authors: [Eyal German](https://arxiv.org/search/?searchtype=author&query=Eyal German), [Daniel Samira](https://arxiv.org/search/?searchtype=author&query=Daniel Samira), [Yuval Elovici](https://arxiv.org/search/?searchtype=author&query=Yuval Elovici), [Asaf Shabtai](https://arxiv.org/search/?searchtype=author&query=Asaf Shabtai) 作者:Eyal German、Daniel Samira、Yuval Elovici、Asaf Shabtai
Synthetic data generation plays an important role in enabling data sharing, particularly in sensitive domains like healthcare and finance. Recent advances in diffusion models have made it possible to generate realistic, high-quality tabular data, but they may also memorize training records and leak sensitive information. Membership inference attacks (MIAs) exploit this vulnerability by determining whether a record was used in training. While MIAs have been studied in images and text, their use against tabular diffusion models remains underexplored despite the unique risks of structured attributes and limited record diversity. In this paper, we introduce MIAEPT, Membership Inference Attack via Error Prediction for Tabular Data, a novel black-box attack specifically designed to target tabular diffusion models. MIA-EPT constructs errorbased feature vectors by masking and reconstructing attributes of target records, disclosing membership signals based on how well these attributes are predicted. MIA-EPT operates without access to the internal components of the generative model, relying only on its synthetic data output, and was shown to generalize across multiple state-of-the-art diffusion models. We validate MIA-EPT on three diffusion-based synthesizers, achieving AUC-ROC scores of up to 0.599 and TPR@10% FPR values of 22.0% in our internal tests. Under the MIDST 2025 competition conditions, MIA-EPT achieved second place in the Black-box Multi-Table track (TPR@10% FPR = 20.0%). These results demonstrate that our method can uncover substantial membership leakage in synthetic tabular data, challenging the assumption that synthetic data is inherently privacy-preserving. Our code is publicly available at https://github.com/eyalgerman/MIA-EPT. 合成数据生成在实现数据共享方面发挥着重要作用,特别是在医疗保健和金融等敏感领域。扩散模型的最新进展使得生成真实、高质量的表格数据成为可能,但它们也可能记住训练记录并泄露敏感信息。成员身份推理攻击 (MIA) 通过确定是否在训练中使用了记录来利用此漏洞。虽然已经在图像和文本中研究了 MIA,但尽管结构化属性和有限的记录多样性存在独特的风险,但它们对表格扩散模型的使用仍然没有得到充分探索。在本文中,我们介绍了 MIAEPT,即通过表格数据错误预测进行成员推理攻击,这是一种专门针对表格扩散模型设计的新型黑盒攻击。MIA-EPT 通过屏蔽和重构目标记录的属性来构建基于错误的特征向量,并根据这些属性的预测程度披露成员信号。MIA-EPT 无需访问生成模型的内部组件即可运行,仅依赖于其合成数据输出,并且被证明可以在多个最先进的扩散模型中进行推广。我们在三个基于扩散的合成器上验证了 MIA-EPT,在我们的内部测试中实现了高达 0.599 的 AUC-ROC 分数和 22.0% 的 TPR@10% FPR 值。在 MIDST 2025 的比赛条件下,MIA-EPT 在黑盒多桌赛道中获得了第二名(TPR@10% FPR = 20.0%)。这些结果表明,我们的方法可以发现合成表格数据中的大量成员泄漏,挑战了合成数据本质上是隐私保护的假设。我们的代码可在 https://github.com/eyalgerman/MIA-EPT 公开获取。
Subjects: Cryptography and Security, Artificial Intelligence 主题: 密码学与安全 , 人工智能
Publish: 2025-09-16 13:03:54 UTC 发布: 2025-09-16 13:03:54 UTC
#65 Introducing the A2AJ's Canadian Legal Data: An open-source alternative to CanLII for the era of computational law #65 介绍 A2AJ 的加拿大法律数据:计算法时代 CanLII 的开源替代方案
Authors: [Simon Wallace](https://arxiv.org/search/?searchtype=author&query=Simon Wallace), [Sean Rehaag](https://arxiv.org/search/?searchtype=author&query=Sean Rehaag) 作者:Simon Wallace、Sean Rehaag
The Access to Algorithmic Justice project (A2AJ) is an open-source alternative to the Canadian Legal Information Institute (CanLII). At a moment when technology promises to enable new ways of working with law, CanLII is becoming an impediment to the free access of law and access to justice movements because it restricts bulk and programmatic access to Canadian legal data. This means that Canada is staring down a digital divide: well-resourced actors have the best new technological tools and, because CanLII has disclaimed leadership, the public only gets second-rate tools. This article puts CanLII in its larger historical context and shows how long and deep efforts to democratize access to Canadian legal data are, and how often they are thwarted by private industry. We introduce the A2AJ’s Canadian Legal Data project, which provides open access to over 116,000 court decisions and 5,000 statutes through multiple channels including APIs, machine learning datasets, and AI integration protocols. Through concrete examples, we demonstrate how open legal data enables courts to conduct evidence-based assessments and allows developers to create tools for practitioners serving low-income communities. Access to Algorithmic Justice 项目 (A2AJ) 是加拿大法律信息研究所 (CanLII) 的开源替代方案。在技术有望实现新的法律工作方式的时刻,CanLII 正在成为自由获取法律和诉诸司法运动的障碍,因为它限制了对加拿大法律数据的批量和程序化访问。这意味着加拿大正在面临数字鸿沟:资源充足的参与者拥有最好的新技术工具,而且由于 CanLII 放弃了领导地位,公众只能获得二流工具。本文将 CanLII 置于更大的历史背景下,并展示了使加拿大法律数据的访问民主化的努力是多么漫长和深入,以及它们被私营企业阻挠的频率。我们介绍了 A2AJ 的加拿大法律数据项目,该项目通过包括 API、机器学习数据集和 AI 集成协议在内的多种渠道提供对 116,000 多项法院判决和 5,000 项法规的开放访问。通过具体的例子,我们展示了开放的法律数据如何使法院能够进行基于证据的评估,并允许开发人员为服务于低收入社区的从业者创建工具。
Subjects: Computers and Society, Artificial Intelligence 科目: 计算机与社会 , 人工智能
Publish: 2025-09-16 12:51:39 UTC 发布时间: 2025-09-16 12:51:39 UTC
#66 Perception Before Reasoning: Two-Stage Reinforcement Learning for Visual Reasoning in Vision-Language Models #66 推理前的感知:视觉语言模型中视觉推理的两阶段强化学习
Authors: [Yan Chen](https://arxiv.org/search/?searchtype=author&query=Yan Chen), [Long Li](https://arxiv.org/search/?searchtype=author&query=Long Li), [Teng Xi](https://arxiv.org/search/?searchtype=author&query=Teng Xi), [Long Zeng](https://arxiv.org/search/?searchtype=author&query=Long Zeng), [Jingdong Wang](https://arxiv.org/search/?searchtype=author&query=Jingdong Wang) 作者:陈艳、李龙、滕习、曾龙、王婧东
Reinforcement learning (RL) has proven highly effective in eliciting the reasoning capabilities of large language models (LLMs). Inspired by this success, recent studies have explored applying similar techniques to vision-language models (VLMs), aiming to enhance their reasoning performance. However, directly transplanting RL methods from LLMs to VLMs is suboptimal, as the tasks faced by VLMs are inherently more complex. Specifically, VLMs must first accurately perceive and understand visual inputs before reasoning can be effectively performed. To address this challenge, we propose a two-stage reinforcement learning framework designed to jointly enhance both the perceptual and reasoning capabilities of VLMs. To mitigate the vanishing advantage issue commonly observed in RL training, we first perform dataset-level sampling to selectively strengthen specific capabilities using distinct data sources. During training, the first stage focuses on improving the model’s visual perception through coarse- and fine-grained visual understanding, while the second stage targets the enhancement of reasoning abilities. After the proposed two-stage reinforcement learning process, we obtain PeBR-R1, a vision-language model with significantly enhanced perceptual and reasoning capabilities. Experimental results on seven benchmark datasets demonstrate the effectiveness of our approach and validate the superior performance of PeBR-R1 across diverse visual reasoning tasks. 事实证明,强化学习 (RL) 在引出大型语言模型 (LLM) 的推理能力方面非常有效。受这一成功的启发,最近的研究探索了将类似的技术应用于视觉语言模型 (VLM),旨在提高其推理性能。然而,直接将 RL 方法从 LLM 移植到 VLM 并不是最优的,因为 VLM 面临的任务本质上更加复杂。具体来说,VLM 必须首先准确感知和理解视觉输入,然后才能有效地进行推理。为了应对这一挑战,我们提出了一个两阶段的强化学习框架,旨在共同增强 VLM 的感知和推理能力。为了缓解 RL 训练中常见的优势消失问题,我们首先进行数据集级采样,以使用不同的数据源有选择地增强特定能力。在训练过程中,第一阶段侧重于通过粗粒度和细粒度的视觉理解来提高模型的视觉感知能力,而第二阶段则侧重于增强推理能力。经过所提出的两阶段强化学习过程,我们得到了 PeBR-R1,这是一种视觉语言模型,具有显著增强的感知和推理能力。在七个基准数据集上的实验结果证明了我们方法的有效性,并验证了 PeBR-R1 在各种视觉推理任务中的卓越性能。
Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence 科目:计算机视觉与模式识别 , 人工智能
Publish: 2025-09-16 12:51:11 UTC 发布时间: 2025-09-16 12:51:11 UTC
#67 GView: A Survey of Binary Forensics via Visual, Semantic, and AI-Enhanced Analysis #67 GView:通过视觉、语义和人工智能增强分析对二元取证进行调查
Authors: [Raul Zaharia](https://arxiv.org/search/?searchtype=author&query=Raul Zaharia), [Dragoş Gavriluţ](https://arxiv.org/search/?searchtype=author&query=Dragoş Gavriluţ), [Gheorghiţă Mutu](https://arxiv.org/search/?searchtype=author&query=Gheorghiţă Mutu) 作者:Raul Zaharia、Dragoş Gavriluţ、Gheorghiţă Mutu
Cybersecurity threats continue to become more sophisticated and diverse in their artifacts, boosting both their volume and complexity. To overcome those challenges, we present GView, an open-source forensic analysis framework with visual and AI-enhanced reasoning. It started with focus on the practical cybersecurity industry. It has evolved significantly, incorporating large language models (LLMs) to dynamically enhance reasoning and ease the forensic workflows. This paper surveys both the current state of GView with its published papers alongside those that are in the publishing process. It also includes its innovative use of logical inference through predicates and inference rules for both the analyzed documents and the user’s actions for better suggestions. We highlight the extensible architecture, showcasing its potential as a bridge between the practical forensics worlds with the academic research. 网络安全威胁的工件继续变得更加复杂和多样化,从而增加了其数量和复杂性。为了克服这些挑战,我们推出了 GView,这是一个具有视觉和人工智能增强推理功能的开源取证分析框架。它首先关注实用的网络安全行业。它已经取得了显着的发展,结合了大型语言模型 (LLM) 来动态增强推理并简化取证工作流程。本文调查了 GView 及其已发表论文的现状以及正在发表过程中的论文。它还包括通过谓词和推理规则对分析文档和用户作进行逻辑推理的创新使用,以获得更好的建议。我们重点介绍了可扩展的架构,展示了其作为实用法医学世界与学术研究之间桥梁的潜力。
Subjects: Software Engineering, Artificial Intelligence 科目: 软件工程 , 人工智能
Publish: 2025-09-16 12:46:39 UTC 发布时间: 2025-09-16 12:46:39 UTC
#68 Validating Solidity Code Defects using Symbolic and Concrete Execution powered by Large Language Models #68 使用由大型语言模型提供支持的符号和具体执行来验证 Solidity 代码缺陷
Authors: [Ştefan-Claudiu Susan](https://arxiv.org/search/?searchtype=author&query=Ştefan-Claudiu Susan), [Andrei Arusoaie](https://arxiv.org/search/?searchtype=author&query=Andrei Arusoaie), [Dorel Lucanu](https://arxiv.org/search/?searchtype=author&query=Dorel Lucanu) 作者:Ştefan-Claudiu Susan、Andrei Arusoaie、Dorel Lucanu
The high rate of false alarms from static analysis tools and Large Language Models (LLMs) complicates vulnerability detection in Solidity Smart Contracts, demanding methods that can formally or empirically prove the presence of defects. This paper introduces a novel detection pipeline that integrates custom Slither-based detectors, LLMs, Kontrol, and Forge. Our approach is designed to reliably detect defects and generate proofs. We currently perform experiments with promising results for seven types of critical defects. We demonstrate the pipeline’s efficacy by presenting our findings for three vulnerabilities – Reentrancy, Complex Fallback, and Faulty Access Control Policies – that are challenging for current verification solutions, which often generate false alarms or fail to detect them entirely. We highlight the potential of either symbolic or concrete execution in correctly classifying such code faults. By chaining these instruments, our method effectively validates true positives, significantly reducing the manual verification burden. Although we identify potential limitations, such as the inconsistency and the cost of LLMs, our findings establish a robust framework for combining heuristic analysis with formal verification to achieve more reliable and automated smart contract auditing. 静态分析工具和大型语言模型 (LLM) 的高误报率使 Solidity 智能合约中的漏洞检测变得复杂,需要能够正式或经验证明缺陷存在的方法。本文介绍了一种新颖的检测管道,该管道集成了基于 Slither 的自定义检测器、LLM、Kontrol 和 Forge。我们的方法旨在可靠地检测缺陷并生成证明。我们目前对七种类型的严重缺陷进行了实验,结果很有希望。我们通过展示我们对三个漏洞(重入、复杂回退和错误访问控制策略)的调查结果来证明该管道的有效性,这些漏洞对于当前的验证解决方案来说是具有挑战性的,这些解决方案通常会产生误报或无法完全检测到它们。我们强调了符号执行或具体执行在正确分类此类代码错误方面的潜力。通过链接这些仪器,我们的方法有效地验证了真阳性,从而显着减轻了人工验证的负担。尽管我们确定了潜在的局限性,例如法学硕士的不一致和成本,但我们的研究结果建立了一个强大的框架,将启发式分析与形式验证相结合,以实现更可靠和自动化的智能合约审计。
Subjects: Software Engineering, Artificial Intelligence 科目: 软件工程 , 人工智能
Publish: 2025-09-16 12:46:11 UTC 发布时间: 2025-09-16 12:46:11 UTC
#69 xOffense: An AI-driven autonomous penetration testing framework with offensive knowledge-enhanced LLMs and multi agent systems #69 xOffense:一个人工智能驱动的自主渗透测试框架,具有进攻性知识增强的法学硕士和多代理系统
Authors: [Phung Duc Luong](https://arxiv.org/search/?searchtype=author&query=Phung Duc Luong), [Le Tran Gia Bao](https://arxiv.org/search/?searchtype=author&query=Le Tran Gia Bao), [Nguyen Vu Khai Tam](https://arxiv.org/search/?searchtype=author&query=Nguyen Vu Khai Tam), [Dong Huu Nguyen Khoa](https://arxiv.org/search/?searchtype=author&query=Dong Huu Nguyen Khoa), [Nguyen Huu Quyen](https://arxiv.org/search/?searchtype=author&query=Nguyen Huu Quyen), [Van-Hau Pham](https://arxiv.org/search/?searchtype=author&query=Van-Hau Pham), [Phan The Duy](https://arxiv.org/search/?searchtype=author&query=Phan The Duy) 作者: Phung Duc Luong, Le Tran Gia Bao, Nguyen Vu Khai Tam, Dong Huu Nguyen Khoa, Nguyen Huu Quyen, Van-Hau Pham, Phan The Duy
This work introduces xOffense, an AI-driven, multi-agent penetration testing framework that shifts the process from labor-intensive, expert-driven manual efforts to fully automated, machine-executable workflows capable of scaling seamlessly with computational infrastructure. At its core, xOffense leverages a fine-tuned, mid-scale open-source LLM (Qwen3-32B) to drive reasoning and decision-making in penetration testing. The framework assigns specialized agents to reconnaissance, vulnerability scanning, and exploitation, with an orchestration layer ensuring seamless coordination across phases. Fine-tuning on Chain-of-Thought penetration testing data further enables the model to generate precise tool commands and perform consistent multi-step reasoning. We evaluate xOffense on two rigorous benchmarks: AutoPenBench and AI-Pentest-Benchmark. The results demonstrate that xOffense consistently outperforms contemporary methods, achieving a sub-task completion rate of 79.17%, decisively surpassing leading systems such as VulnBot and PentestGPT. These findings highlight the potential of domain-adapted mid-scale LLMs, when embedded within structured multi-agent orchestration, to deliver superior, cost-efficient, and reproducible solutions for autonomous penetration testing. 这项工作引入了 xOffense,这是一个人工智能驱动的多代理渗透测试框架,它将流程从劳动密集型、专家驱动的手动工作转变为能够与计算基础设施无缝扩展的全自动、机器可执行的工作流程。xOffense 的核心是利用微调的中等规模开源法学硕士 (Qwen3-32B) 来推动渗透测试中的推理和决策。该框架分配专门的代理来进行侦察、漏洞扫描和利用,并设有编排层,确保跨阶段的无缝协调。对思维链渗透测试数据的微调进一步使模型能够生成精确的工具命令并执行一致的多步骤推理。我们在两个严格的基准测试上评估 xOffense:AutoPenBench 和 AI-Pentest-Benchmark。结果表明,xOffense 始终优于当代方法,实现了 79.17% 的子任务完成率,决定性地超越了 VulnBot 和 PentestGPT 等领先系统。这些发现凸显了领域适应的中型法学硕士在嵌入结构化多代理编排时的潜力,可以为自主渗透测试提供卓越、经济高效且可重复的解决方案。
Subjects: Cryptography and Security, Artificial Intelligence 主题: 密码学与安全 , 人工智能
Publish: 2025-09-16 12:45:45 UTC 发布时间: 2025-09-16 12:45:45 UTC
#70 Bridging Performance Gaps for Foundation Models: A Post-Training Strategy for ECGFounder #70 弥合基础模型的性能差距:ECGFounder 的后训练策略
Authors: [Ya Zhou](https://arxiv.org/search/?searchtype=author&query=Ya Zhou), [Yujie Yang](https://arxiv.org/search/?searchtype=author&query=Yujie Yang), [Xiaohan Fan](https://arxiv.org/search/?searchtype=author&query=Xiaohan Fan), [Wei Zhao](https://arxiv.org/search/?searchtype=author&query=Wei Zhao) 作者:周雅、杨宇杰、范晓涵、赵伟
ECG foundation models are increasingly popular due to their adaptability across various tasks. However, their clinical applicability is often limited by performance gaps compared to task-specific models, even after pre-training on large ECG datasets and fine-tuning on target data. This limitation is likely due to the lack of an effective post-training strategy. In this paper, we propose a simple yet effective post-training approach to enhance ECGFounder, a state-of-the-art ECG foundation model pre-trained on over 7 million ECG recordings. Experiments on the PTB-XL benchmark show that our approach improves the baseline fine-tuning strategy by 1.2%-3.3% in macro AUROC and 5.3%-20.9% in macro AUPRC. Additionally, our method outperforms several recent state-of-the-art approaches, including task-specific and advanced architectures. Further evaluation reveals that our method is more stable and sample-efficient compared to the baseline, achieving a 9.1% improvement in macro AUROC and a 34.9% improvement in macro AUPRC using just 10% of the training data. Ablation studies identify key components, such as stochastic depth and preview linear probing, that contribute to the enhanced performance. These findings underscore the potential of post-training strategies to improve ECG foundation models, and we hope this work will contribute to the continued development of foundation models in the ECG domain. 心电图基础模型因其对各种任务的适应性而越来越受欢迎。然而,与特定任务模型相比,它们的临床适用性通常受到性能差距的限制,即使在对大型心电图数据集进行预训练并对目标数据进行微调之后也是如此。这种限制可能是由于缺乏有效的培训后策略。在本文中,我们提出了一种简单而有效的后训练方法来增强 ECGFounder,这是一种在超过 700 万次心电图记录上预训练的最先进的心电图基础模型。在 PTB-XL 基准上的实验表明,我们的方法在宏观 AUROC 中提高了 1.2%-3.3%的基线微调策略,在宏观 AUPRC 中提高了 5.3%-20.9%。此外,我们的方法优于几种最近的最先进的方法,包括特定于任务的架构和高级架构。进一步的评估表明,与基线相比,我们的方法更加稳定和样本效率更高,仅使用 10%的训练数据即可实现 9.1%的宏观 AUROC 改进和 34.9%的宏观 AUPRC 改进。消融研究确定了有助于提高性能的关键组件,例如随机深度和预览线性探测。这些发现强调了训练后策略改进心电图基础模型的潜力,我们希望这项工作将有助于心电图领域基础模型的持续发展。
Subjects: Machine Learning, Artificial Intelligence, Applications 主题: 机器学习 , 人工智能 , 应用
Publish: 2025-09-16 12:02:13 UTC 发布时间: 2025-09-16 12:02:13 UTC
#71 Dual-Stage Reweighted MoE for Long-Tailed Egocentric Mistake Detection #71 用于长尾自我中心错误检测的双级重加权 MoE
Authors: [Boyu Han](https://arxiv.org/search/?searchtype=author&query=Boyu Han), [Qianqian Xu](https://arxiv.org/search/?searchtype=author&query=Qianqian Xu), [Shilong Bao](https://arxiv.org/search/?searchtype=author&query=Shilong Bao), [Zhiyong Yang](https://arxiv.org/search/?searchtype=author&query=Zhiyong Yang), [Sicong Li](https://arxiv.org/search/?searchtype=author&query=Sicong Li), [Qingming Huang](https://arxiv.org/search/?searchtype=author&query=Qingming Huang) 作者:韩博宇,徐倩倩,鲍世龙,杨志勇,李思聪,黄清明
In this report, we address the problem of determining whether a user performs an action incorrectly from egocentric video data. To handle the challenges posed by subtle and infrequent mistakes, we propose a Dual-Stage Reweighted Mixture-of-Experts (DR-MoE) framework. In the first stage, features are extracted using a frozen ViViT model and a LoRA-tuned ViViT model, which are combined through a feature-level expert module. In the second stage, three classifiers are trained with different objectives: reweighted cross-entropy to mitigate class imbalance, AUC loss to improve ranking under skewed distributions, and label-aware loss with sharpness-aware minimization to enhance calibration and generalization. Their predictions are fused using a classification-level expert module. The proposed method achieves strong performance, particularly in identifying rare and ambiguous mistake instances. The code is available at https://github.com/boyuh/DR-MoE. 在本报告中,我们解决了从以自我为中心的视频数据中确定用户是否错误地执行作的问题。为了应对微妙和不常见的错误带来的挑战,我们提出了一个双阶段重新加权专家混合 (DR-MoE) 框架。在第一阶段,使用冻结的 ViViT 模型和 LoRA 调整的 ViViT 模型提取特征,并通过特征级专家模块组合它们。在第二阶段,训练三个具有不同目标的分类器:重加权交叉熵以减轻类不平衡,AUC 损失以提高偏态分布下的排名,以及标签感知损失与锐度感知最小化以增强校准和泛化。他们的预测使用分类级专家模块进行融合。所提出的方法取得了强大的性能,特别是在识别罕见和模糊的错误实例方面。该代码可在 https://github.com/boyuh/DR-MoE 获得。
Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Machine Learning 科目: 计算机视觉与模式识别 , 人工智能 , 机器学习
Publish: 2025-09-16 12:00:42 UTC 发布时间: 2025-09-16 12:00:42 UTC
#72 Out of Distribution Detection in Self-adaptive Robots with AI-powered Digital Twins #72 具有人工智能驱动的数字孪生的自适应机器人的分布外检测
Authors: [Erblin Isaku](https://arxiv.org/search/?searchtype=author&query=Erblin Isaku), [Hassan Sartaj](https://arxiv.org/search/?searchtype=author&query=Hassan Sartaj), [Shaukat Ali](https://arxiv.org/search/?searchtype=author&query=Shaukat Ali), [Beatriz Sanguino](https://arxiv.org/search/?searchtype=author&query=Beatriz Sanguino), [Tongtong Wang](https://arxiv.org/search/?searchtype=author&query=Tongtong Wang), [Guoyuan Li](https://arxiv.org/search/?searchtype=author&query=Guoyuan Li), [Houxiang Zhang](https://arxiv.org/search/?searchtype=author&query=Houxiang Zhang), [Thomas Peyrucain](https://arxiv.org/search/?searchtype=author&query=Thomas Peyrucain) 作者:Erblin Isaku、Hassan Sartaj、Shaukat Ali、Beatriz Sanguino、Tongtong Wang、Guoyuan Li、Houxiang Zhang、Thomas Peyrucain
Self-adaptive robots (SARs) in complex, uncertain environments must proactively detect and address abnormal behaviors, including out-of-distribution (OOD) cases. To this end, digital twins offer a valuable solution for OOD detection. Thus, we present a digital twin-based approach for OOD detection (ODiSAR) in SARs. ODiSAR uses a Transformer-based digital twin to forecast SAR states and employs reconstruction error and Monte Carlo dropout for uncertainty quantification. By combining reconstruction error with predictive variance, the digital twin effectively detects OOD behaviors, even in previously unseen conditions. The digital twin also includes an explainability layer that links potential OOD to specific SAR states, offering insights for self-adaptation. We evaluated ODiSAR by creating digital twins of two industrial robots: one navigating an office environment, and another performing maritime ship navigation. In both cases, ODiSAR forecasts SAR behaviors (i.e., robot trajectories and vessel motion) and proactively detects OOD events. Our results showed that ODiSAR achieved high detection performance – up to 98% AUROC, 96% TNR@TPR95, and 95% F1-score – while providing interpretable insights to support self-adaptation. 在复杂、不确定的环境中,自适应机器人 (SAR) 必须主动检测和处理异常行为,包括分布外 (OOD) 情况。为此,数字孪生为 OOD 检测提供了有价值的解决方案。因此,我们提出了一种基于数字孪生的 SAR 中 OOD 检测 (ODiSAR) 方法。ODiSAR 使用基于 Transformer 的数字孪生来预测 SAR 状态,并采用重建误差和蒙特卡洛压差进行不确定性量化。通过将重建误差与预测方差相结合,数字孪生即使在以前看不见的条件下也能有效检测 OOD 行为。数字孪生还包括一个可解释层,将潜在的 OOD 与特定的 SAR 状态联系起来,为自我适应提供见解。我们通过创建两个工业机器人的数字孪生来评估 ODiSAR:一个在办公环境中导航,另一个执行海上船舶导航。在这两种情况下,ODiSAR 都会预测 SAR 行为(即机器人轨迹和船舶运动)并主动检测 OOD 事件。我们的结果表明,ODiSAR 实现了高检测性能——高达 98%的 AUROC、96%的 TNR@TPR95 和 95%的 F1 分数——同时提供了可解释的见解来支持自我适应。
Subjects: Robotics, Artificial Intelligence, Software Engineering 科目: 机器人技术 , 人工智能 , 软件工程
Publish: 2025-09-16 11:43:47 UTC 发布时间: 2025-09-16 11:43:47 UTC
#73 Investigating ReLoRA: Effects on the Learning Dynamics of Small Language Models #73 研究 ReLoRA:对小语言模型学习动态的影响
Authors: [Yuval Weiss](https://arxiv.org/search/?searchtype=author&query=Yuval Weiss), [David Demitri Africa](https://arxiv.org/search/?searchtype=author&query=David Demitri Africa), [Paula Buttery](https://arxiv.org/search/?searchtype=author&query=Paula Buttery), [Richard Diehl Martinez](https://arxiv.org/search/?searchtype=author&query=Richard Diehl Martinez) 作者:Yuval Weiss、David Demitri Africa、Paula Buttery、Richard Diehl Martinez
Parameter-efficient methods such as LoRA have revolutionised the fine-tuning of LLMs. Still, their extension to pretraining via ReLoRA is less well understood, especially for small language models (SLMs), which offer lower computational and environmental costs. This work is the first systematic study of ReLoRA in SLMs (11M-66M parameters), evaluating both performance and learning dynamics. Through ablation experiments, we find that ReLoRA generally performs worse than standard training on loss, Paloma perplexity and BLiMP, with the gap widening for the larger models. Further analysis of the learning dynamics of the models indicates that ReLoRA reinforces the rank deficiencies found in smaller models. These results indicate that low-rank update strategies may not transfer easily to SLM pretraining, highlighting the need for more research in the low-compute regime. LoRA 等参数效率高的方法彻底改变了 LLM 的微调。尽管如此,它们通过 ReLoRA 扩展到预训练的理解还不太清楚,特别是对于小型语言模型 (SLM),它们提供了较低的计算和环境成本。这项工作是首次在 SLM(11M-66M 参数)中对 ReLoRA 进行系统研究,评估性能和学习动态。通过消融实验,我们发现 ReLoRA 在损失、Paloma 困惑度和 BLiMP 方面的表现通常比标准训练差,而对于较大的模型,差距会扩大。对模型学习动态的进一步分析表明,ReLoRA 强化了较小模型中发现的秩缺陷。这些结果表明,低秩更新策略可能不容易转移到 SLM 预训练中,这凸显了在低计算制度中进行更多研究的必要性。
Subjects: Computation and Language, Artificial Intelligence 科目: 计算与语言 , 人工智能
Publish: 2025-09-16 11:06:58 UTC 发布时间: 2025-09-16 11:06:58 UTC
#74 FusionMAE: large-scale pretrained model to optimize and simplify diagnostic and control of fusion plasma #74 FusionMAE:用于优化和简化融合等离子体诊断和控制的大规模预训练模型
Authors: [Zongyu Yang](https://arxiv.org/search/?searchtype=author&query=Zongyu Yang), [Zhenghao Yang](https://arxiv.org/search/?searchtype=author&query=Zhenghao Yang), [Wenjing Tian](https://arxiv.org/search/?searchtype=author&query=Wenjing Tian), [Jiyuan Li](https://arxiv.org/search/?searchtype=author&query=Jiyuan Li), [Xiang Sun](https://arxiv.org/search/?searchtype=author&query=Xiang Sun), [Guohui Zheng](https://arxiv.org/search/?searchtype=author&query=Guohui Zheng), [Songfen Liu](https://arxiv.org/search/?searchtype=author&query=Songfen Liu), [Niannian Wu](https://arxiv.org/search/?searchtype=author&query=Niannian Wu), [Rongpeng Li](https://arxiv.org/search/?searchtype=author&query=Rongpeng Li), [Zhaohe Xu](https://arxiv.org/search/?searchtype=author&query=Zhaohe Xu), [Bo Li](https://arxiv.org/search/?searchtype=author&query=Bo Li), [Zhongbing Shi](https://arxiv.org/search/?searchtype=author&query=Zhongbing Shi), [Zhe Gao](https://arxiv.org/search/?searchtype=author&query=Zhe Gao), [Wei Chen](https://arxiv.org/search/?searchtype=author&query=Wei Chen), [Xiaoquan Ji](https://arxiv.org/search/?searchtype=author&query=Xiaoquan Ji), [Min Xu](https://arxiv.org/search/?searchtype=author&query=Min Xu), [Wulyu Zhong](https://arxiv.org/search/?searchtype=author&query=Wulyu Zhong) 作者: 杨宗宇, 杨正浩, 田文静, 李继元, 孙翔, 郑国辉, 刘松芬, 吴念念, 李荣鹏, 徐兆和, 李波, 石中兵, 高哲, 陈伟, 季晓全, 徐敏, 钟武宇
In magnetically confined fusion device, the complex, multiscale, and nonlinear dynamics of plasmas necessitate the integration of extensive diagnostic systems to effectively monitor and control plasma behaviour. The complexity and uncertainty arising from these extensive systems and their tangled interrelations has long posed a significant obstacle to the acceleration of fusion energy development. In this work, a large-scale model, fusion masked auto-encoder (FusionMAE) is pre-trained to compress the information from 88 diagnostic signals into a concrete embedding, to provide a unified interface between diagnostic systems and control actuators. Two mechanisms are proposed to ensure a meaningful embedding: compression-reduction and missing-signal reconstruction. Upon completion of pre-training, the model acquires the capability for ‘virtual backup diagnosis’, enabling the inference of missing diagnostic data with 96.7% reliability. Furthermore, the model demonstrates three emergent capabilities: automatic data analysis, universal control-diagnosis interface, and enhancement of control performance on multiple tasks. This work pioneers large-scale AI model integration in fusion energy, demonstrating how pre-trained embeddings can simplify the system interface, reducing necessary diagnostic systems and optimize operation performance for future fusion reactors. 在磁约束聚变装置中,等离子体的复杂、多尺度和非线性动力学需要集成广泛的诊断系统来有效监测和控制等离子体行为。这些广泛的系统及其错综复杂的相互关系所带来的复杂性和不确定性长期以来一直是加速聚变能发展的重大障碍。在这项工作中,对大型模型融合掩码自动编码器(FusionMAE)进行了预训练,以将来自 88 个诊断信号的信息压缩到一个具体的嵌入中,从而在诊断系统和控制执行器之间提供统一的接口。为了确保有意义的嵌入,提出了两种机制:压缩减少和缺失信号重建。完成预训练后,该模型获得了“虚拟备份诊断”的能力,能够以 96.7% 的可靠性推断缺失的诊断数据。此外,该模型还展示了三个紧急能力:自动数据分析、通用控制诊断接口和增强多任务控制性能。这项工作开创了聚变能中大规模人工智能模型集成,展示了预训练嵌入如何简化系统界面,减少必要的诊断系统并优化未来聚变反应堆的运行性能。
Subjects: Plasma Physics, Artificial Intelligence 科目:等离子体物理学、人工智能
Publish: 2025-09-16 10:50:29 UTC 发布时间: 2025-09-16 10:50:29 UTC
#75 Sy-FAR: Symmetry-based Fair Adversarial Robustness #75 Sy-FAR:基于对称性的公平对抗鲁棒性
Authors: [Haneen Najjar](https://arxiv.org/search/?searchtype=author&query=Haneen Najjar), [Eyal Ronen](https://arxiv.org/search/?searchtype=author&query=Eyal Ronen), [Mahmood Sharif](https://arxiv.org/search/?searchtype=author&query=Mahmood Sharif) 作者:Haneen Najjar、Eyal Ronen、Mahmood Sharif
Security-critical machine-learning (ML) systems, such as face-recognition systems, are susceptible to adversarial examples, including real-world physically realizable attacks. Various means to boost ML’s adversarial robustness have been proposed; however, they typically induce unfair robustness: It is often easier to attack from certain classes or groups than from others. Several techniques have been developed to improve adversarial robustness while seeking perfect fairness between classes. Yet, prior work has focused on settings where security and fairness are less critical. Our insight is that achieving perfect parity in realistic fairness-critical tasks, such as face recognition, is often infeasible – some classes may be highly similar, leading to more misclassifications between them. Instead, we suggest that seeking symmetry – i.e., attacks from class i to j would be as successful as from j to i – is more tractable. Intuitively, symmetry is a desirable because class resemblance is a symmetric relation in most domains. Additionally, as we prove theoretically, symmetry between individuals induces symmetry between any set of sub-groups, in contrast to other fairness notions where group-fairness is often elusive. We develop Sy-FAR, a technique to encourage symmetry while also optimizing adversarial robustness and extensively evaluate it using five datasets, with three model architectures, including against targeted and untargeted realistic attacks. The results show Sy-FAR significantly improves fair adversarial robustness compared to state-of-the-art methods. Moreover, we find that Sy-FAR is faster and more consistent across runs. Notably, Sy-FAR also ameliorates another type of unfairness we discover in this work – target classes that adversarial examples are likely to be classified into become significantly less vulnerable after inducing symmetry. 安全关键型机器学习 (ML) 系统(例如人脸识别系统)容易受到对抗性示例的影响,包括现实世界中物理上可实现的攻击。已经提出了各种方法来提高机器学习的对抗性鲁棒性;然而,它们通常会导致不公平的稳健性:来自某些类别或群体的攻击通常比来自其他类别或群体更容易。已经开发了几种技术来提高对抗鲁棒性,同时寻求类之间的完美公平性。然而,之前的工作主要集中在安全性和公平性不太重要的环境中。我们的见解是,在现实的公平关键任务(例如人脸识别)中实现完美平等通常是不可行的——某些类别可能高度相似,导致它们之间出现更多的错误分类。相反,我们建议寻求对称性——即从类 i 到的 j 攻击与从到 j 的 i 攻击一样成功——更容易处理。直观地讲,对称性是可取的,因为类相似性在大多数领域中都是一种对称关系。此外,正如我们在理论上证明的那样,个体之间的对称性会导致任何一组子群体之间的对称性,这与其他群体公平性通常难以捉摸的公平概念形成鲜明对比。我们开发了 Sy-FAR,这是一种鼓励对称性的技术,同时优化对抗鲁棒性,并使用五个数据集和三个模型架构对其进行广泛评估,包括针对有针对性和非针对性的现实攻击。结果表明,与最先进的方法相比,Sy-FAR 显著提高了公平对抗鲁棒性。此外,我们发现 Sy-FAR 在运行中更快、更一致。 值得注意的是,Sy-FAR 还改善了我们在这项工作中发现的另一种不公平性——对抗性示例可能被归类的目标类在诱导对称性后变得不那么脆弱。
Subjects: Machine Learning, Artificial Intelligence, Cryptography and Security, Computer Vision and Pattern Recognition 科目: 机器学习 , 人工智能 , 密码学与安全 , 计算机视觉与模式识别
Publish: 2025-09-16 10:39:42 UTC 发布时间: 2025-09-16 10:39:42 UTC
#76 Jailbreaking Large Language Models Through Content Concretization #76 通过内容具体化越狱大型语言模型
Authors: [Johan Wahréus](https://arxiv.org/search/?searchtype=author&query=Johan Wahréus), [Ahmed Hussain](https://arxiv.org/search/?searchtype=author&query=Ahmed Hussain), [Panos Papadimitratos](https://arxiv.org/search/?searchtype=author&query=Panos Papadimitratos) 作者:Johan Wahréus、Ahmed Hussain、Panos Papadimitratos
Large Language Models (LLMs) are increasingly deployed for task automation and content generation, yet their safety mechanisms remain vulnerable to circumvention through different jailbreaking techniques. In this paper, we introduce \textit{Content Concretization} (CC), a novel jailbreaking technique that iteratively transforms abstract malicious requests into concrete, executable implementations. CC is a two-stage process: first, generating initial LLM responses using lower-tier, less constrained safety filters models, then refining them through higher-tier models that process both the preliminary output and original prompt. We evaluate our technique using 350 cybersecurity-specific prompts, demonstrating substantial improvements in jailbreak Success Rates (SRs), increasing from 7% (no refinements) to 62% after three refinement iterations, while maintaining a cost of 7.5\textcentper prompt. Comparative A/B testing across nine different LLM evaluators confirms that outputs from additional refinement steps are consistently rated as more malicious and technically superior. Moreover, manual code analysis reveals that generated outputs execute with minimal modification, although optimal deployment typically requires target-specific fine-tuning. With eventual improved harmful code generation, these results highlight critical vulnerabilities in current LLM safety frameworks.
大型语言模型 (LLM) 越来越多地用于任务自动化和内容生成,但其安全机制仍然容易被不同的越狱技术规避。在本文中,我们介绍了 \textit{Content Concretization} (CC),这是一种新颖的越狱技术,可以迭代地将抽象的恶意请求转换为具体的、可执行的实现。CC 是一个两阶段的过程:首先,使用较低层、限制较少的安全过滤器模型生成初始 LLM 响应,然后通过处理初步输出和原始提示的更高层模型对其进行细化。我们使用 350 个特定于网络安全的提示来评估我们的技术,证明越狱成功率 (SR) 有了显着提高,在三次改进迭代后从 7%(无改进)增加到 62%,同时保持每个提示 7.5\textcent的成本。对九个不同的 LLM 评估人员进行的比较 A/B 测试证实,额外细化步骤的输出始终被评为更恶意且技术更优越。此外,手动代码分析表明,生成的输出只需最少的修改即可执行,尽管最佳部署通常需要特定于目标的微调。随着最终改进的有害代码生成,这些结果凸显了当前 LLM 安全框架中的关键漏洞。
Subjects: Cryptography and Security, Artificial Intelligence, Computation and Language 科目: 密码学与安全 , 人工智能 , 计算与语言
Publish: 2025-09-16 10:34:26 UTC 发布时间: 2025-09-16 10:34:26 UTC
#77 A Graph-Based Approach to Alert Contextualisation in Security Operations Centres #77 一种基于图形的安全运营中心警报情境化方法
Authors: [Magnus Wiik Eckhoff](https://arxiv.org/search/?searchtype=author&query=Magnus Wiik Eckhoff), [Peter Marius Flydal](https://arxiv.org/search/?searchtype=author&query=Peter Marius Flydal), [Siem Peters](https://arxiv.org/search/?searchtype=author&query=Siem Peters), [Martin Eian](https://arxiv.org/search/?searchtype=author&query=Martin Eian), [Jonas Halvorsen](https://arxiv.org/search/?searchtype=author&query=Jonas Halvorsen), [Vasileios Mavroeidis](https://arxiv.org/search/?searchtype=author&query=Vasileios Mavroeidis), [Gudmund Grov](https://arxiv.org/search/?searchtype=author&query=Gudmund Grov) 作者:Magnus Wiik Eckhoff、Peter Marius Flydal、Siem Peters、Martin Eian、Jonas Halvorsen、Vasileios Mavroeidis、Gudmund Grov
Interpreting the massive volume of security alerts is a significant challenge in Security Operations Centres (SOCs). Effective contextualisation is important, enabling quick distinction between genuine threats and benign activity to prioritise what needs further analysis.This paper proposes a graph-based approach to enhance alert contextualisation in a SOC by aggregating alerts into graph-based alert groups, where nodes represent alerts and edges denote relationships within defined time-windows. By grouping related alerts, we enable analysis at a higher abstraction level, capturing attack steps more effectively than individual alerts. Furthermore, to show that our format is well suited for downstream machine learning methods, we employ Graph Matching Networks (GMNs) to correlate incoming alert groups with historical incidents, providing analysts with additional insights. 解释海量安全警报是安全运营中心 (SOC) 面临的一项重大挑战。有效的情境化很重要,它可以快速区分真正的威胁和良性活动,从而确定需要进一步分析的优先级。本文提出了一种基于图的方法,通过将警报聚合到基于图形的警报组中来增强 SOC 中的警报上下文化,其中节点表示警报,边表示定义时间窗口内的关系。通过对相关警报进行分组,我们可以在更高的抽象级别进行分析,比单个警报更有效地捕获攻击步骤。此外,为了表明我们的格式非常适合下游机器学习方法,我们采用图匹配网络 (GMN) 将传入警报组与历史事件相关联,为分析师提供额外的见解。
Subjects: Cryptography and Security, Artificial Intelligence 主题: 密码学与安全 , 人工智能
Publish: 2025-09-16 10:20:39 UTC 发布时间: 2025-09-16 10:20:39 UTC
#78 All Roads Lead to Rome: Graph-Based Confidence Estimation for Large Language Model Reasoning #78 条条大路通罗马:大型语言模型推理的基于图的置信度估计
Authors: [Caiqi Zhang](https://arxiv.org/search/?searchtype=author&query=Caiqi Zhang), [Chang Shu](https://arxiv.org/search/?searchtype=author&query=Chang Shu), [Ehsan Shareghi](https://arxiv.org/search/?searchtype=author&query=Ehsan Shareghi), [Nigel Collier](https://arxiv.org/search/?searchtype=author&query=Nigel Collier) 作者: Caiqi Zhang, Chang Shu, Ehsan Shareghi, Nigel Collier
Confidence estimation is essential for the reliable deployment of large language models (LLMs). Existing methods are primarily designed for factual QA tasks and often fail to generalize to reasoning tasks. To address this gap, we propose a set of training-free, graph-based confidence estimation methods tailored to reasoning tasks. Our approach models reasoning paths as directed graphs and estimates confidence by exploiting graph properties such as centrality, path convergence, and path weighting. Experiments with two LLMs on three reasoning datasets demonstrate improved confidence estimation and enhanced performance on two downstream tasks. 置信度估计对于大型语言模型 (LLM) 的可靠部署至关重要。现有方法主要为事实 QA 任务而设计,通常无法推广到推理任务。为了解决这一差距,我们提出了一套针对推理任务量身定制的免训练、基于图的置信度估计方法。我们的方法将推理路径建模为有向图,并通过利用中心性、路径收敛和路径加权等图属性来估计置信度。在三个推理数据集上使用两个 LLM 进行的实验表明,两个下游任务的置信度估计得到改进并性能得到增强。
Subjects: Computation and Language, Artificial Intelligence 科目: 计算与语言 , 人工智能
Publish: 2025-09-16 10:02:52 UTC 发布时间: 2025-09-16 10:02:52 UTC
#79 Cross-Layer Vision Smoothing: Enhancing Visual Understanding via Sustained Focus on Key Objects in Large Vision-Language Models #79 跨层视觉平滑:通过持续关注大型视觉语言模型中的关键对象来增强视觉理解
Authors: [Jianfei Zhao](https://arxiv.org/search/?searchtype=author&query=Jianfei Zhao), [Feng Zhang](https://arxiv.org/search/?searchtype=author&query=Feng Zhang), [Xin Sun](https://arxiv.org/search/?searchtype=author&query=Xin Sun), [Lingxing Kong](https://arxiv.org/search/?searchtype=author&query=Lingxing Kong), [Zhixing Tan](https://arxiv.org/search/?searchtype=author&query=Zhixing Tan), [Chong Feng](https://arxiv.org/search/?searchtype=author&query=Chong Feng) 作者: 赵建飞, 张峰, 孙鑫, 孔凌星, 谭志兴, 峰冲
Large Vision-Language Models (LVLMs) can accurately locate key objects in images, yet their attention to these objects tends to be very brief. Motivated by the hypothesis that sustained focus on key objects can improve LVLMs’ visual capabilities, we propose Cross-Layer Vision Smoothing (CLVS). The core idea of CLVS is to incorporate a vision memory that smooths the attention distribution across layers. Specifically, we initialize this vision memory with position-unbiased visual attention in the first layer. In subsequent layers, the model’s visual attention jointly considers the vision memory from previous layers, while the memory is updated iteratively, thereby maintaining smooth attention on key objects. Given that visual understanding primarily occurs in the early and middle layers of the model, we use uncertainty as an indicator of completed visual understanding and terminate the smoothing process accordingly. Experiments on four benchmarks across three LVLMs confirm the effectiveness and generalizability of our method. CLVS achieves state-of-the-art performance on a variety of visual understanding tasks, with particularly significant improvements in relation and attribute understanding. 大型视觉语言模型 (LVLM) 可以准确定位图像中的关键对象,但它们对这些对象的关注往往非常短暂。基于持续关注关键物体可以提高 LVLM 视觉能力的假设,我们提出了跨层视觉平滑 (CLVS)。CLVS 的核心思想是整合视觉记忆,以平滑各层的注意力分布。具体来说,我们在第一层用位置无偏视觉注意力初始化了这个视觉记忆。在后续层中,模型的视觉注意力共同考虑前几层的视觉记忆,同时对记忆进行迭代更新,从而保持对关键对象的平滑注意力。鉴于视觉理解主要发生在模型的早期和中间层,我们使用不确定性作为完成视觉理解的指标,并相应地终止平滑过程。在三个 LVLM 上的四个基准测试的实验证实了我们方法的有效性和通用性。CLVS 在各种视觉理解任务上实现了最先进的性能,在关系和属性理解方面取得了特别显着的改进。
Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence 科目:计算机视觉与模式识别 , 人工智能
Publish: 2025-09-16 09:54:01 UTC 发布时间: 2025-09-16 09:54:01 UTC
#80 Conan-Embedding-v2: Training an LLM from Scratch for Text Embeddings #80 Conan-Embedding-v2:从头开始训练 LLM 进行文本嵌入
Authors: [Shiyu Li](https://arxiv.org/search/?searchtype=author&query=Shiyu Li), [Yang Tang](https://arxiv.org/search/?searchtype=author&query=Yang Tang), [Ruijie Liu](https://arxiv.org/search/?searchtype=author&query=Ruijie Liu), [Shi-Zhe Chen](https://arxiv.org/search/?searchtype=author&query=Shi-Zhe Chen), [Xi Chen](https://arxiv.org/search/?searchtype=author&query=Xi Chen) 作者:李世宇、唐阳、刘瑞杰、陈世哲、陈习
Large language models (LLMs) have recently demonstrated excellent performance in text embedding tasks. Previous work usually use LoRA to fine-tune existing LLMs, which are limited by the data and training gap between LLMs and embedding models. In this work, we introduce Conan-embedding-v2, a new 1.4B-parameter LLM trained from scratch and fine-tuned as a text embedder. First, we add news data and multilingual pairs for LLM pretraining to bridge the data gap. Based on this, we propose a cross-lingual retrieval dataset that enables the LLM to better integrate embeddings across different languages. Second, whereas LLMs use a causal mask with token-level loss, embedding models use a bidirectional mask with sentence-level loss. This training gap makes full fine-tuning less effective than LoRA. We introduce a soft-masking mechanism to gradually transition between these two types of masks, enabling the model to learn more comprehensive representations. Based on this, we propose a dynamic hard negative mining method that exposes the model to more difficult negative examples throughout the training process. Being intuitive and effective, with only approximately 1.4B parameters, Conan-embedding-v2 achieves SOTA performance on both the Massive Text Embedding Benchmark (MTEB) and Chinese MTEB (May 19, 2025). 大型语言模型(LLM)最近在文本嵌入任务中表现出了出色的性能。以前的工作通常使用 LoRA 来微调现有的 LLM,这些 LLM 受到 LLM 和嵌入模型之间的数据和训练差距的限制。在这项工作中,我们介绍了 Conan-embedding-v2,这是一种新的 1.4B 参数 LLM,从头开始训练并作为文本嵌入器进行微调。首先,我们添加新闻数据和多语言对进行 LLM 预训练,以弥合数据差距。基于此,我们提出了一个跨语言检索数据集,使 LLM 能够更好地整合不同语言的嵌入。其次,LLM 使用具有标记级丢失的因果掩码,而嵌入模型使用具有句子级丢失的双向掩码。这种训练差距使得完全微调不如 LoRA 有效。我们引入了一种软掩码机制,在这两种类型的掩码之间逐渐过渡,使模型能够学习更全面的表示。基于此,我们提出了一种动态硬负挖掘方法,该方法使模型在整个训练过程中暴露于更困难的负示例中。Conan-embedding-v2 直观有效,仅具有大约 1.4B 的参数,在海量文本嵌入基准测试 (MTEB) 和中文 MTEB 上都实现了 SOTA 性能(2025 年 5 月 19 日)。
Subjects: Computation and Language, Artificial Intelligence 科目: 计算与语言 , 人工智能
Publish: 2025-09-16 09:48:11 UTC 发布时间: 2025-09-16 09:48:11 UTC
#81 Runge-Kutta Approximation and Decoupled Attention for Rectified Flow Inversion and Semantic Editing #81 用于整流反演和语义编辑的龙格-库塔近似和解耦注意力
Authors: [Weiming Chen](https://arxiv.org/search/?searchtype=author&query=Weiming Chen), [Zhihan Zhu](https://arxiv.org/search/?searchtype=author&query=Zhihan Zhu), [Yijia Wang](https://arxiv.org/search/?searchtype=author&query=Yijia Wang), [Zhihai He](https://arxiv.org/search/?searchtype=author&query=Zhihai He) 作者:陈伟明,朱志涵,王一佳,何志海
Rectified flow (RF) models have recently demonstrated superior generative performance compared to DDIM-based diffusion models. However, in real-world applications, they suffer from two major challenges: (1) low inversion accuracy that hinders the consistency with the source image, and (2) entangled multimodal attention in diffusion transformers, which hinders precise attention control. To address the first challenge, we propose an efficient high-order inversion method for rectified flow models based on the Runge-Kutta solver of differential equations. To tackle the second challenge, we introduce Decoupled Diffusion Transformer Attention (DDTA), a novel mechanism that disentangles text and image attention inside the multimodal diffusion transformers, enabling more precise semantic control. Extensive experiments on image reconstruction and text-guided editing tasks demonstrate that our method achieves state-of-the-art performance in terms of fidelity and editability. Code is available at https://github.com/wmchen/RKSovler_DDTA. 与基于 DDIM 的扩散模型相比,整流(RF)模型最近表现出了卓越的生成性能。然而,在实际应用中,它们面临两大挑战:(1)反演精度低,阻碍了与源图像的一致性,以及(2)扩散变压器中的多模态注意力纠缠,阻碍了精确的注意力控制。为了解决第一个挑战,我们提出了一种基于微分方程的 Runge-Kutta 求解器的整流模型的高效高阶反演方法。为了应对第二个挑战,我们引入了解耦扩散转换器注意力(DDTA),这是一种新颖的机制,可以解开多模态扩散转换器内部的文本和图像注意力,从而实现更精确的语义控制。对图像重建和文本引导编辑任务的广泛实验表明,我们的方法在保真度和可编辑性方面实现了最先进的性能。代码可在 https://github.com/wmchen/RKSovler_DDTA 获得。
Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence 科目:计算机视觉与模式识别 , 人工智能
Publish: 2025-09-16 09:41:14 UTC 发布时间: 2025-09-16 09:41:14 UTC
#82 The LLM Already Knows: Estimating LLM-Perceived Question Difficulty via Hidden Representations #82 法学硕士已经知道:通过隐藏表示估计法学硕士感知的问题难度
Authors: [Yubo Zhu](https://arxiv.org/search/?searchtype=author&query=Yubo Zhu), [Dongrui Liu](https://arxiv.org/search/?searchtype=author&query=Dongrui Liu), [Zecheng Lin](https://arxiv.org/search/?searchtype=author&query=Zecheng Lin), [Wei Tong](https://arxiv.org/search/?searchtype=author&query=Wei Tong), [Sheng Zhong](https://arxiv.org/search/?searchtype=author&query=Sheng Zhong), [Jing Shao](https://arxiv.org/search/?searchtype=author&query=Jing Shao) 作者: Yubo Zhu, Dongrui Liu, Zecheng Lin, Wei Tong, Sheng Zhong, Jing Shao
Estimating the difficulty of input questions as perceived by large language models (LLMs) is essential for accurate performance evaluation and adaptive inference. Existing methods typically rely on repeated response sampling, auxiliary models, or fine-tuning the target model itself, which may incur substantial computational costs or compromise generality. In this paper, we propose a novel approach for difficulty estimation that leverages only the hidden representations produced by the target LLM. We model the token-level generation process as a Markov chain and define a value function to estimate the expected output quality given any hidden state. This allows for efficient and accurate difficulty estimation based solely on the initial hidden state, without generating any output tokens. Extensive experiments across both textual and multimodal tasks demonstrate that our method consistently outperforms existing baselines in difficulty estimation. Moreover, we apply our difficulty estimates to guide adaptive reasoning strategies, including Self-Consistency, Best-of-N, and Self-Refine, achieving higher inference efficiency with fewer generated tokens. 估计大型语言模型 (LLM) 感知的输入问题的难度对于准确的性能评估和自适应推理至关重要。现有方法通常依赖于重复响应采样、辅助模型或对目标模型本身进行微调,这可能会产生大量的计算成本或损害通用性。在本文中,我们提出了一种新的难度估计方法,该方法仅利用目标法学硕士产生的隐藏表示。我们将代币级生成过程建模为马尔可夫链,并定义一个值函数来估计给定任何隐藏状态的预期输出质量。这允许仅根据初始隐藏状态进行高效、准确的难度估计,而无需生成任何输出令牌。在文本和多模态任务中的广泛实验表明,我们的方法在难度估计方面始终优于现有基线。此外,我们应用难度估计来指导自适应推理策略,包括自洽、最佳 N 和自细化,以更少的生成 token 实现更高的推理效率。
Subjects: Computation and Language, Artificial Intelligence 科目: 计算与语言 , 人工智能
Publish: 2025-09-16 09:38:41 UTC 发布时间: 2025-09-16 09:38:41 UTC
#83 AI Factories: It's time to rethink the Cloud-HPC divide #83 AI 工厂:是时候重新思考云与 HPC 的鸿沟了
Authors: [Pedro Garcia Lopez](https://arxiv.org/search/?searchtype=author&query=Pedro Garcia Lopez), [Daniel Barcelona Pons](https://arxiv.org/search/?searchtype=author&query=Daniel Barcelona Pons), [Marcin Copik](https://arxiv.org/search/?searchtype=author&query=Marcin Copik), [Torsten Hoefler](https://arxiv.org/search/?searchtype=author&query=Torsten Hoefler), [Eduardo Quiñones](https://arxiv.org/search/?searchtype=author&query=Eduardo Quiñones), [Maciej Malawski](https://arxiv.org/search/?searchtype=author&query=Maciej Malawski), [Peter Pietzutch](https://arxiv.org/search/?searchtype=author&query=Peter Pietzutch), [Alberto Marti](https://arxiv.org/search/?searchtype=author&query=Alberto Marti), [Thomas Ohlson Timoudas](https://arxiv.org/search/?searchtype=author&query=Thomas Ohlson Timoudas), [Aleksander Slominski](https://arxiv.org/search/?searchtype=author&query=Aleksander Slominski) 作者:佩德罗·加西亚·洛佩兹、丹尼尔·巴塞罗那·庞斯、马辛·科皮克、托斯滕·霍夫勒、爱德华多·奎诺内斯、马切伊·马拉夫斯基、彼得·皮祖奇、阿尔贝托·马蒂、托马斯·奥尔森·蒂穆达斯、亚历山大·斯洛明斯基
The strategic importance of artificial intelligence is driving a global push toward Sovereign AI initiatives. Nationwide governments are increasingly developing dedicated infrastructures, called AI Factories (AIF), to achieve technological autonomy and secure the resources necessary to sustain robust local digital ecosystems. In Europe, the EuroHPC Joint Undertaking is investing hundreds of millions of euros into several AI Factories, built atop existing high-performance computing (HPC) supercomputers. However, while HPC systems excel in raw performance, they are not inherently designed for usability, accessibility, or serving as public-facing platforms for AI services such as inference or agentic applications. In contrast, AI practitioners are accustomed to cloud-native technologies like Kubernetes and object storage, tools that are often difficult to integrate within traditional HPC environments. This article advocates for a dual-stack approach within supercomputers: integrating both HPC and cloud-native technologies. Our goal is to bridge the divide between HPC and cloud computing by combining high performance and hardware acceleration with ease of use and service-oriented front-ends. This convergence allows each paradigm to amplify the other. To this end, we will study the cloud challenges of HPC (Serverless HPC) and the HPC challenges of cloud technologies (High-performance Cloud). 人工智能的战略重要性正在推动全球推动主权人工智能计划。全国政府越来越多地开发称为人工智能工厂 (AIF) 的专用基础设施,以实现技术自主并获得维持强大的当地数字生态系统所需的资源。在欧洲,EuroHPC 联合事业正在投资数亿欧元建设多个人工智能工厂,这些工厂建立在现有的高性能计算 (HPC) 超级计算机之上。然而,虽然 HPC 系统在原始性能方面表现出色,但它们本质上并不是为可用性、可访问性而设计的,也不是作为推理或代理应用程序等人工智能服务面向公众的平台。相比之下,人工智能从业者习惯了 Kubernetes 和对象存储等云原生技术,这些工具通常难以集成到传统 HPC 环境中。本文提倡在超级计算机中采用双栈方法:集成 HPC 和云原生技术。我们的目标是通过将高性能和硬件加速与易用性和面向服务的前端相结合,弥合 HPC 和云计算之间的鸿沟。这种趋同允许每个范式放大另一个范式。为此,我们将研究 HPC(Serverless HPC)的云挑战和云技术(High-performance Cloud)的 HPC 挑战。
Subjects: Distributed, Parallel, and Cluster Computing, Artificial Intelligence 主题:分布式、并行和集群计算 , 人工智能
Publish: 2025-09-16 09:08:05 UTC 发布时间: 2025-09-16 09:08:05 UTC
#84 Improving Anomalous Sound Detection with Attribute-aware Representation from Domain-adaptive Pre-training #84 通过域自适应预训练的属性感知表示改进异常声音检测
Authors: [Xin Fang](https://arxiv.org/search/?searchtype=author&query=Xin Fang), [Guirui Zhong](https://arxiv.org/search/?searchtype=author&query=Guirui Zhong), [Qing Wang](https://arxiv.org/search/?searchtype=author&query=Qing Wang), [Fan Chu](https://arxiv.org/search/?searchtype=author&query=Fan Chu), [Lei Wang](https://arxiv.org/search/?searchtype=author&query=Lei Wang), [Mengui Qian](https://arxiv.org/search/?searchtype=author&query=Mengui Qian), [Mingqi Cai](https://arxiv.org/search/?searchtype=author&query=Mingqi Cai), [Jiangzhao Wu](https://arxiv.org/search/?searchtype=author&query=Jiangzhao Wu), [Jianqing Gao](https://arxiv.org/search/?searchtype=author&query=Jianqing Gao), [Jun Du](https://arxiv.org/search/?searchtype=author&query=Jun Du) 作者: 方鑫, 钟贵瑞, 王庆, 樊楚, 王磊, 钱孟贵, 蔡明奇, 吴江钊, 高建清, 杜军
Anomalous Sound Detection (ASD) is often formulated as a machine attribute classification task, a strategy necessitated by the common scenario where only normal data is available for training. However, the exhaustive collection of machine attribute labels is laborious and impractical. To address the challenge of missing attribute labels, this paper proposes an agglomerative hierarchical clustering method for the assignment of pseudo-attribute labels using representations derived from a domain-adaptive pre-trained model, which are expected to capture machine attribute characteristics. We then apply model adaptation to this pre-trained model through supervised fine-tuning for machine attribute classification, resulting in a new state-of-the-art performance. Evaluation on the Detection and Classification of Acoustic Scenes and Events (DCASE) 2025 Challenge dataset demonstrates that our proposed approach yields significant performance gains, ultimately outperforming our previous top-ranking system in the challenge. 异常声音检测 (ASD) 通常被表述为机器属性分类任务,这是只有正常数据可用于训练的常见场景所必需的策略。然而,穷尽地收集机器属性标签既费力又不切实际。为了解决属性标签缺失的挑战,本文提出了一种聚集分层聚类方法,利用源自域自适应预训练模型的表示来分配伪属性标签,这些表示有望捕获机器属性特征。然后,我们通过对机器属性分类进行监督微调,将模型适配应用于该预训练模型,从而产生新的最先进性能。对声学场景和事件的检测和分类 (DCASE) 2025 挑战数据集的评估表明,我们提出的方法产生了显着的性能提升,最终在挑战中优于我们之前的顶级系统。
Subjects: Sound, Artificial Intelligence 主题:声音、人工智能
Publish: 2025-09-16 09:05:41 UTC 发布时间: 2025-09-16 09:05:41 UTC
#85 Multi-Robot Task Planning for Multi-Object Retrieval Tasks with Distributed On-Site Knowledge via Large Language Models #85 通过大语言模型进行分布式现场知识的多机器人任务规划
Authors: [Kento Murata](https://arxiv.org/search/?searchtype=author&query=Kento Murata), [Shoichi Hasegawa](https://arxiv.org/search/?searchtype=author&query=Shoichi Hasegawa), [Tomochika Ishikawa](https://arxiv.org/search/?searchtype=author&query=Tomochika Ishikawa), [Yoshinobu Hagiwara](https://arxiv.org/search/?searchtype=author&query=Yoshinobu Hagiwara), [Akira Taniguchi](https://arxiv.org/search/?searchtype=author&query=Akira Taniguchi), [Lotfi El Hafi](https://arxiv.org/search/?searchtype=author&query=Lotfi El Hafi), [Tadahiro Taniguchi](https://arxiv.org/search/?searchtype=author&query=Tadahiro Taniguchi) 作者:Kento Murata、Shoichi Hasegawa、Tomochika Ishikawa、Yoshinobu Hagiwara、Akira Taniguchi、Lotfi El Hafi、Tadahiro Taniguchi
It is crucial to efficiently execute instructions such as “Find an apple and a banana” or “Get ready for a field trip,” which require searching for multiple objects or understanding context-dependent commands. This study addresses the challenging problem of determining which robot should be assigned to which part of a task when each robot possesses different situational on-site knowledge-specifically, spatial concepts learned from the area designated to it by the user. We propose a task planning framework that leverages large language models (LLMs) and spatial concepts to decompose natural language instructions into subtasks and allocate them to multiple robots. We designed a novel few-shot prompting strategy that enables LLMs to infer required objects from ambiguous commands and decompose them into appropriate subtasks. In our experiments, the proposed method achieved 47/50 successful assignments, outperforming random (28/50) and commonsense-based assignment (26/50). Furthermore, we conducted qualitative evaluations using two actual mobile manipulators. The results demonstrated that our framework could handle instructions, including those involving ad hoc categories such as “Get ready for a field trip,” by successfully performing task decomposition, assignment, sequential planning, and execution. 有效执行“找到一个苹果和一根香蕉”或“为实地考察做准备”等指令至关重要,这些指令需要搜索多个对象或理解上下文相关的命令。本研究解决了一个具有挑战性的问题,即当每个机器人拥有不同的情境现场知识(特别是从用户指定的区域学习的空间概念)时,确定哪个机器人应该被分配到任务的哪个部分。我们提出了一个任务规划框架,利用大型语言模型(LLM)和空间概念将自然语言指令分解为子任务,并将其分配给多个机器人。我们设计了一种新颖的少量提示策略,使 LLM 能够从模棱两可的命令中推断出所需的对象,并将它们分解为适当的子任务。在我们的实验中,所提出的方法实现了 47/50 的成功分配,优于随机(28/50)和基于常识的分配(26/50)。此外,我们还使用两个实际的移动机械手进行了定性评估。结果表明,我们的框架可以通过成功执行任务分解、分配、顺序计划和执行来处理指令,包括涉及“为实地考察做好准备”等临时类别的指令。
Subjects: Robotics, Artificial Intelligence, Multiagent Systems 科目: 机器人技术 , 人工智能 , 多智能体系统
Publish: 2025-09-16 09:00:25 UTC 发布时间: 2025-09-16 09:00:25 UTC
#86 A Lightweight Pipeline for Noisy Speech Voice Cloning and Accurate Lip Sync Synthesis #86 用于嘈杂语音语音克隆和准确口型同步合成的轻量级管道
Authors: [Javeria Amir](https://arxiv.org/search/?searchtype=author&query=Javeria Amir), [Farwa Attaria](https://arxiv.org/search/?searchtype=author&query=Farwa Attaria), [Mah Jabeen](https://arxiv.org/search/?searchtype=author&query=Mah Jabeen), [Umara Noor](https://arxiv.org/search/?searchtype=author&query=Umara Noor), [Zahid Rashid](https://arxiv.org/search/?searchtype=author&query=Zahid Rashid) 作者:Javeria Amir、Farwa Attaria、Mah Jabeen、Umara Noor、Zahid Rashid
Recent developments in voice cloning and talking head generation demonstrate impressive capabilities in synthesizing natural speech and realistic lip synchronization. Current methods typically require and are trained on large scale datasets and computationally intensive processes using clean studio recorded inputs that is infeasible in noisy or low resource environments. In this paper, we introduce a new modular pipeline comprising Tortoise text to speech. It is a transformer based latent diffusion model that can perform high fidelity zero shot voice cloning given only a few training samples. We use a lightweight generative adversarial network architecture for robust real time lip synchronization. The solution will contribute to many essential tasks concerning less reliance on massive pre training generation of emotionally expressive speech and lip synchronization in noisy and unconstrained scenarios. The modular structure of the pipeline allows an easy extension for future multi modal and text guided voice modulation and it could be used in real world systems. 语音克隆和会说话头部生成的最新发展展示了在合成自然语音和逼真的口型同步方面的令人印象深刻的能力。当前的方法通常需要大规模数据集和计算密集型流程,并使用干净的工作室录制输入进行训练,这在嘈杂或资源不足的环境中是不可行的。在本文中,我们介绍了一个新的模块化管道,包括 Tortoise 文本转语音。它是一种基于 Transformer 的潜在扩散模型,只需几个训练样本即可执行高保真零样本语音克隆。我们使用轻量级的生成对抗网络架构来实现稳健的实时唇形同步。该解决方案将有助于完成许多基本任务,即在嘈杂和不受约束的场景中减少对大量预训练生成情感表达语音和口型同步的依赖。该管道的模块化结构允许为未来的多模态和文本引导语音调制轻松扩展,并且可以在现实世界的系统中使用。
Subjects: Sound, Artificial Intelligence 主题:声音、人工智能
Publish: 2025-09-16 08:55:40 UTC 发布时间: 2025-09-16 08:55:40 UTC
#87 A Pressure-Based Diffusion Model for Influence Maximization on Social Networks #87 社交网络影响力最大化的基于压力的扩散模型
Authors: [Curt Stutsman](https://arxiv.org/search/?searchtype=author&query=Curt Stutsman), [Eliot W. Robson](https://arxiv.org/search/?searchtype=author&query=Eliot W. Robson), [Abhishek K. Umrawal](https://arxiv.org/search/?searchtype=author&query=Abhishek K. Umrawal) 作者:Curt Stutsman、Eliot W. Robson、Abhishek K. Umrawal
In many real-world scenarios, an individual’s local social network carries significant influence over the opinions they form and subsequently propagate to others. In this paper, we propose a novel diffusion model – the Pressure Threshold model (PT) – for dynamically simulating the spread of influence through a social network. This new model extends the popular Linear Threshold Model (LT) by adjusting a node’s outgoing influence proportional to the influence it receives from its activated neighbors. We address the Influence Maximization (IM) problem, which involves selecting the most effective seed nodes to achieve maximal graph coverage after a diffusion process, and how the problem manifests with the PT Model. Experiments conducted on real-world networks, facilitated by enhancements to the open-source network-diffusion Python library, CyNetDiff, demonstrate unique seed node selection for the PT Model when compared to the LT Model. Moreover, analyses demonstrate that densely connected networks amplify pressure effects more significantly than sparse networks. 在许多现实场景中,个人的本地社交网络对他们形成并随后传播给他人的观点具有重大影响。在本文中,我们提出了一种新的扩散模型——压力阈值模型(PT)——用于动态模拟通过社交网络传播的影响。这个新模型扩展了流行的线性阈值模型 (LT),通过调整节点的传出影响与其从激活的邻居那里接收到的影响成正比。我们解决了影响最大化 (IM) 问题,该问题涉及选择最有效的种子节点以在扩散过程后实现最大的图覆盖率,以及问题如何通过 PT 模型表现出来。在开源网络扩散 Python 库 CyNetDiff 的增强下,在真实世界网络上进行的实验证明了与 LT 模型相比,PT 模型具有独特的种子节点选择。此外,分析表明,密集连接的网络比稀疏网络更显着地放大压力效应。
Subjects: Social and Information Networks, Artificial Intelligence 科目: 社会与信息网络 , 人工智能
Publish: 2025-09-16 08:47:00 UTC 发布时间: 2025-09-16 08:47:00 UTC
#88 Data Scaling Laws for Radiology Foundation Models #88 放射学基础模型的数据缩放定律
Authors: [Maximilian Ilse](https://arxiv.org/search/?searchtype=author&query=Maximilian Ilse), [Harshita Sharma](https://arxiv.org/search/?searchtype=author&query=Harshita Sharma), [Anton Schwaighofer](https://arxiv.org/search/?searchtype=author&query=Anton Schwaighofer), [Sam Bond-Taylor](https://arxiv.org/search/?searchtype=author&query=Sam Bond-Taylor), [Fernando Pérez-García](https://arxiv.org/search/?searchtype=author&query=Fernando Pérez-García), [Olesya Melnichenko](https://arxiv.org/search/?searchtype=author&query=Olesya Melnichenko), [Anne-Marie G. Sykes](https://arxiv.org/search/?searchtype=author&query=Anne-Marie G. Sykes), [Kelly K. Horst](https://arxiv.org/search/?searchtype=author&query=Kelly K. Horst), [Ashish Khandelwal](https://arxiv.org/search/?searchtype=author&query=Ashish Khandelwal), [Maxwell Reynolds](https://arxiv.org/search/?searchtype=author&query=Maxwell Reynolds), [Maria T. Wetscherek](https://arxiv.org/search/?searchtype=author&query=Maria T. Wetscherek), [Noel C. F. Codella](https://arxiv.org/search/?searchtype=author&query=Noel C. F. Codella), [Javier Alvarez-Valle](https://arxiv.org/search/?searchtype=author&query=Javier Alvarez-Valle), [Korfiatis Panagiotis](https://arxiv.org/search/?searchtype=author&query=Korfiatis Panagiotis), [Valentina Salvatelli](https://arxiv.org/search/?searchtype=author&query=Valentina Salvatelli) 作者:Maximilian Ilse、Harshita Sharma、Anton Schwaighofer、Sam Bond-Taylor、Fernando Pérez-García、Olesya Melnichenko、Anne-Marie G. Sykes、Kelly K. Horst、Ashish Khandelwal、Maxwell Reynolds、Maria T. Wetscherek、Noel CF Codella、Javier Alvarez-Valle、Korfiatis Panagiotis、Valentina Salvatelli
Foundation vision encoders such as CLIP and DINOv2, trained on web-scale data, exhibit strong transfer performance across tasks and datasets. However, medical imaging foundation models remain constrained by smaller datasets, limiting our understanding of how data scale and pretraining paradigms affect performance in this setting. In this work, we systematically study continual pretraining of two vision encoders, MedImageInsight (MI2) and RAD-DINO representing the two major encoder paradigms CLIP and DINOv2, on up to 3.5M chest x-rays from a single institution, holding compute and evaluation protocols constant. We evaluate on classification (radiology findings, lines and tubes), segmentation (lines and tubes), and radiology report generation. While prior work has primarily focused on tasks related to radiology findings, we include lines and tubes tasks to counterbalance this bias and evaluate a model’s ability to extract features that preserve continuity along elongated structures. Our experiments show that MI2 scales more effectively for finding-related tasks, while RAD-DINO is stronger on tube-related tasks. Surprisingly, continually pretraining MI2 with both reports and structured labels using UniCL improves performance, underscoring the value of structured supervision at scale. We further show that for some tasks, as few as 30k in-domain samples are sufficient to surpass open-weights foundation models. These results highlight the utility of center-specific continual pretraining, enabling medical institutions to derive significant performance gains by utilizing in-domain data. CLIP 和 DINOv2 等基础视觉编码器在网络规模数据上进行了训练,在任务和数据集之间表现出强大的传输性能。然而,医学成像基础模型仍然受到较小数据集的限制,限制了我们对数据规模和预训练范式如何影响这种情况下性能的理解。在这项工作中,我们系统地研究了代表两个主要编码器范式 CLIP 和 DINOv2 的视觉编码器 MedImageInsight (MI2) 和 RAD-DINO 的持续预训练,在来自单个机构的多达 3.5M 胸部 X 射线上,保持计算和评估协议不变。我们评估分类(放射学结果、管线和管子)、分割(管线和管子)和放射学报告生成。虽然之前的工作主要集中在与放射学发现相关的任务上,但我们包括线和管任务来抵消这种偏差,并评估模型提取沿细长结构保持连续性的特征的能力。我们的实验表明,MI2 在与查找相关的任务中更有效地扩展,而 RAD-DINO 在与管相关的任务上更强。令人惊讶的是,使用 UniCL 使用报告和结构化标签持续预训练 MI2 可以提高性能,强调大规模结构化监督的价值。我们进一步表明,对于某些任务,只需 30k 个域内样本就足以超越开放权重基础模型。这些结果凸显了针对特定中心的持续预训练的效用,使医疗机构能够通过利用域内数据获得显着的性能提升。
Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence 科目:计算机视觉与模式识别 , 人工智能
Publish: 2025-09-16 08:36:06 UTC 发布时间: 2025-09-16 08:36:06 UTC
#89 Gesture Evaluation in Virtual Reality #89 虚拟现实中的手势评估
Authors: [Axel Wiebe Werner](https://arxiv.org/search/?searchtype=author&query=Axel Wiebe Werner), [Jonas Beskow](https://arxiv.org/search/?searchtype=author&query=Jonas Beskow), [Anna Deichler](https://arxiv.org/search/?searchtype=author&query=Anna Deichler) 作者:Axel Wiebe Werner、Jonas Beskow、Anna Deichler
Gestures are central to human communication, enriching interactions through non-verbal expression. Virtual avatars increasingly use AI-generated gestures to enhance life-likeness, yet evaluations have largely been confined to 2D. Virtual Reality (VR) provides an immersive alternative that may affect how gestures are perceived. This paper presents a comparative evaluation of computer-generated gestures in VR and 2D, examining three models from the 2023 GENEA Challenge. Results show that gestures viewed in VR were rated slightly higher on average, with the strongest effect observed for motion-capture “true movement.” While model rankings remained consistent across settings, VR influenced participants’ overall perception and offered unique benefits over traditional 2D evaluation. 手势是人类交流的核心,通过非语言表达丰富互动。虚拟化身越来越多地使用人工智能生成的手势来增强逼真感,但评估在很大程度上仅限于 2D。虚拟现实 (VR) 提供了一种身临其境的替代方案,可能会影响手势的感知方式。本文对 VR 和 2D 中计算机生成的手势进行了比较评估,检查了 2023 年 GENEA 挑战赛中的三个模型。结果表明,在 VR 中观看的手势平均评分略高,其中观察到的动作捕捉“真实运动”效果最强。虽然模型排名在不同环境中保持一致,但 VR 影响了参与者的整体感知,并提供了比传统 2D 评估的独特优势。
Subjects: Human-Computer Interaction, Artificial Intelligence, Computer Vision and Pattern Recognition, Machine Learning 科目:人机交互、人工智能、计算机视觉与模式识别、机器学习
Publish: 2025-09-16 08:35:37 UTC 发布时间: 2025-09-16 08:35:37 UTC
#90 LLM-Based Approach for Enhancing Maintainability of Automotive Architectures #90 基于 LLM 的方法,用于增强汽车架构的可维护性
Authors: [Nenad Petrovic](https://arxiv.org/search/?searchtype=author&query=Nenad Petrovic), [Lukasz Mazur](https://arxiv.org/search/?searchtype=author&query=Lukasz Mazur), [Alois Knoll](https://arxiv.org/search/?searchtype=author&query=Alois Knoll) 作者:内纳德·彼得罗维奇、卢卡斯·马祖尔、阿洛伊斯·诺尔
There are many bottlenecks that decrease the flexibility of automotive systems, making their long-term maintenance, as well as updates and extensions in later lifecycle phases increasingly difficult, mainly due to long re-engineering, standardization, and compliance procedures, as well as heterogeneity and numerosity of devices and underlying software components involved. In this paper, we explore the potential of Large Language Models (LLMs) when it comes to the automation of tasks and processes that aim to increase the flexibility of automotive systems. Three case studies towards achieving this goal are considered as outcomes of early-stage research: 1) updates, hardware abstraction, and compliance, 2) interface compatibility checking, and 3) architecture modification suggestions. For proof-of-concept implementation, we rely on OpenAI’s GPT-4o model. 存在许多瓶颈,降低了汽车系统的灵活性,使其长期维护以及后期生命周期阶段的更新和扩展变得越来越困难,这主要是由于漫长的重新设计、标准化和合规程序,以及所涉及的设备和底层软件组件的异构性和数量性。在本文中,我们探讨了大型语言模型 (LLM) 在旨在提高汽车系统灵活性的任务和流程自动化方面的潜力。实现这一目标的三个案例研究被认为是早期研究的结果:1) 更新、硬件抽象和合规性,2) 接口兼容性检查,以及 3) 架构修改建议。对于概念验证的实现,我们依赖于 OpenAI 的 GPT-4o 模型。
Subjects: Software Engineering, Artificial Intelligence 科目: 软件工程 , 人工智能
Publish: 2025-09-16 08:17:41 UTC 发布时间: 2025-09-16 08:17:41 UTC
#91 CECT-Mamba: a Hierarchical Contrast-enhanced-aware Model for Pancreatic Tumor Subtyping from Multi-phase CECT #91 CECT-Mamba:多相 CECT 胰腺肿瘤亚型的分层对比增强感知模型
Authors: [Zhifang Gong](https://arxiv.org/search/?searchtype=author&query=Zhifang Gong), [Shuo Gao](https://arxiv.org/search/?searchtype=author&query=Shuo Gao), [Ben Zhao](https://arxiv.org/search/?searchtype=author&query=Ben Zhao), [Yingjing Xu](https://arxiv.org/search/?searchtype=author&query=Yingjing Xu), [Yijun Yang](https://arxiv.org/search/?searchtype=author&query=Yijun Yang), [Shenghong Ju](https://arxiv.org/search/?searchtype=author&query=Shenghong Ju), [Guangquan Zhou](https://arxiv.org/search/?searchtype=author&query=Guangquan Zhou) 作者: 龚志芳, 高硕, 赵本, 徐莹晶, 杨怡军, 鞠胜宏, 周广泉
Contrast-enhanced computed tomography (CECT) is the primary imaging technique that provides valuable spatial-temporal information about lesions, enabling the accurate diagnosis and subclassification of pancreatic tumors. However, the high heterogeneity and variability of pancreatic tumors still pose substantial challenges for precise subtyping diagnosis. Previous methods fail to effectively explore the contextual information across multiple CECT phases commonly used in radiologists’ diagnostic workflows, thereby limiting their performance. In this paper, we introduce, for the first time, an automatic way to combine the multi-phase CECT data to discriminate between pancreatic tumor subtypes, among which the key is using Mamba with promising learnability and simplicity to encourage both temporal and spatial modeling from multi-phase CECT. Specifically, we propose a dual hierarchical contrast-enhanced-aware Mamba module incorporating two novel spatial and temporal sampling sequences to explore intra and inter-phase contrast variations of lesions. A similarity-guided refinement module is also imposed into the temporal scanning modeling to emphasize the learning on local tumor regions with more obvious temporal variations. Moreover, we design the space complementary integrator and multi-granularity fusion module to encode and aggregate the semantics across different scales, achieving more efficient learning for subtyping pancreatic tumors. The experimental results on an in-house dataset of 270 clinical cases achieve an accuracy of 97.4% and an AUC of 98.6% in distinguishing between pancreatic ductal adenocarcinoma (PDAC) and pancreatic neuroendocrine tumors (PNETs), demonstrating its potential as a more accurate and efficient tool. 对比增强计算机断层扫描 (CECT) 是主要的成像技术,可提供有关病变的有价值的时空信息,从而能够准确诊断和亚分类胰腺肿瘤。然而,胰腺肿瘤的高度异质性和变异性仍然对精确的亚型诊断提出了重大挑战。以前的方法无法有效探索放射科医生诊断工作流程中常用的多个 CECT 阶段的上下文信息,从而限制了它们的性能。在本文中,我们首次引入了一种自动结合多相 CECT 数据来区分胰腺肿瘤亚型的方法,其中关键是使用具有可学习性和简单性的 Mamba 来鼓励多相 CECT 进行时间和空间建模。具体来说,我们提出了一种双层次对比增强感知 Mamba 模块,该模块结合了两个新的空间和时间采样序列,以探索病变的相内和相间对比变化。在时间扫描建模中还加入了相似性引导的细化模块,以强调对时间变化更明显的局部肿瘤区域的学习。此外,我们设计了空间互补积分器和多粒度融合模块,对不同尺度的语义进行编码和聚合,实现更高效的胰腺肿瘤亚型学习。在包含 270 个临床病例的内部数据集上的实验结果在区分胰腺导管腺癌 (PDAC) 和胰腺神经内分泌肿瘤 (PNET) 方面达到了 97.4% 的准确率和 98.6% 的 AUC,展示了其作为更准确和高效的工具的潜力。
Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence 科目:计算机视觉与模式识别 , 人工智能
Publish: 2025-09-16 07:48:11 UTC 发布时间: 2025-09-16 07:48:11 UTC
#92 EmbeddedML: A New Optimized and Fast Machine Learning Library #92 EmbeddedML:一个新的优化和快速机器学习库
Authors: [Halil Hüseyin Çalışkan](https://arxiv.org/search/?searchtype=author&query=Halil Hüseyin Çalışkan), [Talha Koruk](https://arxiv.org/search/?searchtype=author&query=Talha Koruk) 作者: Halil Hüseyin Çalışkan, Talha Koruk
Machine learning models and libraries can train datasets of different sizes and perform prediction and classification operations, but machine learning models and libraries cause slow and long training times on large datasets. This article introduces EmbeddedML, a training-time-optimized and mathematically enhanced machine learning library. The speed was increased by approximately times compared to scikit-learn without any loss in terms of accuracy in regression models such as Multiple Linear Regression. Logistic Regression and Support Vector Machines (SVM) algorithms have been mathematically rewritten to reduce training time and increase accuracy in classification models. With the applied mathematical improvements, training time has been reduced by approximately 2 times for SVM on small datasets and by around 800 times on large datasets, and by approximately 4 times for Logistic Regression, compared to the scikit-learn implementation. In summary, the EmbeddedML library offers regression, classification, clustering, and dimensionality reduction algorithms that are mathematically rewritten and optimized to reduce training time. 机器学习模型和库可以训练不同大小的数据集并执行预测和分类作,但机器学习模型和库会导致大型数据集的训练时间缓慢而长。本文介绍 EmbeddedML,这是一个经过训练时间优化和数学增强的机器学习库。与 scikit-learn 相比,速度提高了大约几倍,而在回归模型(如多元线性回归)的准确性方面没有任何损失。逻辑回归和支持向量机 (SVM) 算法已经过数学重写,以减少训练时间并提高分类模型的准确性。与 scikit-learn 实现相比,通过应用数学改进,小型数据集上 SVM 的训练时间减少了约 2 倍,大型数据集上的训练时间减少了约 800 倍,逻辑回归的训练时间减少了约 4 倍。总之,EmbeddedML 库提供了回归、分类、聚类和降维算法,这些算法经过数学重写和优化,以减少训练时间。
Subjects: Machine Learning, Artificial Intelligence 主题: 机器学习 , 人工智能
Publish: 2025-09-16 07:44:37 UTC 发布时间: 2025-09-16 07:44:37 UTC
#93 MEGAN: Mixture of Experts for Robust Uncertainty Estimation in Endoscopy Videos #93 MEGAN:内窥镜视频中稳健不确定性估计的专家组合
Authors: [Damola Agbelese](https://arxiv.org/search/?searchtype=author&query=Damola Agbelese), [Krishna Chaitanya](https://arxiv.org/search/?searchtype=author&query=Krishna Chaitanya), [Pushpak Pati](https://arxiv.org/search/?searchtype=author&query=Pushpak Pati), [Chaitanya Parmar](https://arxiv.org/search/?searchtype=author&query=Chaitanya Parmar), [Pooya Mobadersany](https://arxiv.org/search/?searchtype=author&query=Pooya Mobadersany), [Shreyas Fadnavis](https://arxiv.org/search/?searchtype=author&query=Shreyas Fadnavis), [Lindsey Surace](https://arxiv.org/search/?searchtype=author&query=Lindsey Surace), [Shadi Yarandi](https://arxiv.org/search/?searchtype=author&query=Shadi Yarandi), [Louis R. Ghanem](https://arxiv.org/search/?searchtype=author&query=Louis R. Ghanem), [Molly Lucas](https://arxiv.org/search/?searchtype=author&query=Molly Lucas), [Tommaso Mansi](https://arxiv.org/search/?searchtype=author&query=Tommaso Mansi), [Oana Gabriela Cula](https://arxiv.org/search/?searchtype=author&query=Oana Gabriela Cula), [Pablo F. Damasceno](https://arxiv.org/search/?searchtype=author&query=Pablo F. Damasceno), [Kristopher Standish](https://arxiv.org/search/?searchtype=author&query=Kristopher Standish) 作者:Damola Agbelese、Krishna Chaitanya、Pushpak Pati、Chaitanya Parmar、Pooya Mobadersany、Shreyas Fadnavis、Lindsey Surace、Shadi Yarandi、Louis R. Ghanem、Molly Lucas、Tommaso Mansi、Oana Gabriela Cula、Pablo F. Damasceno、Kristopher Standish
Reliable uncertainty quantification (UQ) is essential in medical AI. Evidential Deep Learning (EDL) offers a computationally efficient way to quantify model uncertainty alongside predictions, unlike traditional methods such as Monte Carlo (MC) Dropout and Deep Ensembles (DE). However, all these methods often rely on a single expert’s annotations as ground truth for model training, overlooking the inter-rater variability in healthcare. To address this issue, we propose MEGAN, a Multi-Expert Gating Network that aggregates uncertainty estimates and predictions from multiple AI experts via EDL models trained with diverse ground truths and modeling strategies. MEGAN’s gating network optimally combines predictions and uncertainties from each EDL model, enhancing overall prediction confidence and calibration. We extensively benchmark MEGAN on endoscopy videos for Ulcerative colitis (UC) disease severity estimation, assessed by visual labeling of Mayo Endoscopic Subscore (MES), where inter-rater variability is prevalent. In large-scale prospective UC clinical trial, MEGAN achieved a 3.5% improvement in F1-score and a 30.5% reduction in Expected Calibration Error (ECE) compared to existing methods. Furthermore, MEGAN facilitated uncertainty-guided sample stratification, reducing the annotation burden and potentially increasing efficiency and consistency in UC trials. 可靠的不确定性量化 (UQ) 在医疗人工智能中至关重要。与蒙特卡洛 (MC) Dropout 和深度集成 (DE) 等传统方法不同,证据深度学习 (EDL) 提供了一种计算高效的方法来量化模型不确定性以及预测。然而,所有这些方法通常都依赖于单个专家的注释作为模型训练的基本事实,而忽略了医疗保健中评估者之间的差异。为了解决这个问题,我们提出了 MEGAN,这是一个多专家门控网络,它通过使用各种基本事实和建模策略训练的 EDL 模型聚合来自多个 AI 专家的不确定性估计和预测。MEGAN 的门控网络以最佳方式结合了每个 EDL 模型的预测和不确定性,从而增强了整体预测的可信度和校准。我们在内窥镜视频中对 MEGAN 进行了广泛的基准测试,以评估溃疡性结肠炎 (UC) 疾病严重程度,通过 Mayo 内窥镜子评分 (MES) 的视觉标记进行评估,其中评估者间差异普遍存在。在大规模前瞻性 UC 临床试验中,与现有方法相比,MEGAN 的 F1 分数提高了 3.5%,预期校准误差 (ECE) 降低了 30.5%。此外,MEGAN 促进了不确定性引导的样本分层,减轻了注释负担,并有可能提高 UC 试验的效率和一致性。
Subjects: Image and Video Processing, Artificial Intelligence, Computer Vision and Pattern Recognition, Machine Learning 科目: 图像和视频处理 , 人工智能 , 计算机视觉和模式识别 , 机器学习
Publish: 2025-09-16 07:42:01 UTC 发布时间: 2025-09-16 07:42:01 UTC
#94 InfoGain-RAG: Boosting Retrieval-Augmented Generation via Document Information Gain-based Reranking and Filtering #94 InfoGain-RAG:通过基于文档信息增益的重新排名和过滤促进检索增强生成
Authors: [Zihan Wang](https://arxiv.org/search/?searchtype=author&query=Zihan Wang), [Zihan Liang](https://arxiv.org/search/?searchtype=author&query=Zihan Liang), [Zhou Shao](https://arxiv.org/search/?searchtype=author&query=Zhou Shao), [Yufei Ma](https://arxiv.org/search/?searchtype=author&query=Yufei Ma), [Huangyu Dai](https://arxiv.org/search/?searchtype=author&query=Huangyu Dai), [Ben Chen](https://arxiv.org/search/?searchtype=author&query=Ben Chen), [Lingtao Mao](https://arxiv.org/search/?searchtype=author&query=Lingtao Mao), [Chenyi Lei](https://arxiv.org/search/?searchtype=author&query=Chenyi Lei), [Yuqing Ding](https://arxiv.org/search/?searchtype=author&query=Yuqing Ding), [Han Li](https://arxiv.org/search/?searchtype=author&query=Han Li) 作者:王子涵、梁子涵、周少、马玉飞、戴皇宇、陈本、毛灵涛、雷晨仪、丁玉清、韩丽
Retrieval-Augmented Generation (RAG) has emerged as a promising approach to address key limitations of Large Language Models (LLMs), such as hallucination, outdated knowledge, and lacking reference. However, current RAG frameworks often struggle with identifying whether retrieved documents meaningfully contribute to answer generation. This shortcoming makes it difficult to filter out irrelevant or even misleading content, which notably impacts the final performance. In this paper, we propose Document Information Gain (DIG), a novel metric designed to quantify the contribution of retrieved documents to correct answer generation. DIG measures a document’s value by computing the difference of LLM’s generation confidence with and without the document augmented. Further, we introduce InfoGain-RAG, a framework that leverages DIG scores to train a specialized reranker, which prioritizes each retrieved document from exact distinguishing and accurate sorting perspectives. This approach can effectively filter out irrelevant documents and select the most valuable ones for better answer generation. Extensive experiments across various models and benchmarks demonstrate that InfoGain-RAG can significantly outperform existing approaches, on both single and multiple retrievers paradigm. Specifically on NaturalQA, it achieves the improvements of 17.9%, 4.5%, 12.5% in exact match accuracy against naive RAG, self-reflective RAG and modern ranking-based RAG respectively, and even an average of 15.3% increment on advanced proprietary model GPT-4o across all datasets. These results demonstrate the feasibility of InfoGain-RAG as it can offer a reliable solution for RAG in multiple applications. 检索增强生成 (RAG) 已成为一种很有前途的方法,可以解决大型语言模型 (LLM) 的关键局限性,例如幻觉、过时的知识和缺乏参考。然而,当前的 RAG 框架经常难以确定检索到的文档是否对答案生成有意义。这一缺点使得很难过滤掉不相关甚至误导性的内容,这显着影响了最终的性能。在本文中,我们提出了文档信息增益(DIG),这是一种新型指标,旨在量化检索到的文档对正确答案生成的贡献。DIG 通过计算 LLM 在增强和未增强文档的情况下生成置信度的差异来衡量文档的价值。此外,我们还介绍了 InfoGain-RAG,这是一个利用 DIG 分数来训练专门的重新排序器的框架,该框架从精确区分和准确的排序角度对每个检索到的文档进行优先级排序。这种方法可以有效地过滤掉不相关的文档,并选择最有价值的文档,以便更好地生成答案。跨各种模型和基准测试的广泛实验表明,InfoGain-RAG 在单个和多个检索器范式上都可以显着优于现有方法。具体在 NaturalQA 上,它与朴素 RAG、自反射 RAG 和基于现代排名的 RAG 的精确匹配准确率分别提高了 17.9%、4.5%、12.5%,甚至在所有数据集的高级专有模型 GPT-4o 上平均提高了 15.3%。这些结果证明了 InfoGain-RAG 的可行性,因为它可以在多种应用中为 RAG 提供可靠的解决方案。
Subjects: Information Retrieval, Artificial Intelligence, Computation and Language 科目:信息检索、人工智能、计算与语言
Publish: 2025-09-16 07:28:07 UTC 发布时间: 2025-09-16 07:28:07 UTC
#95 Toward Ownership Understanding of Objects: Active Question Generation with Large Language Model and Probabilistic Generative Model #95 迈向对对象的所有权理解:使用大型语言模型和概率生成模型进行主动问题生成
Authors: [Saki Hashimoto](https://arxiv.org/search/?searchtype=author&query=Saki Hashimoto), [Shoichi Hasegawa](https://arxiv.org/search/?searchtype=author&query=Shoichi Hasegawa), [Tomochika Ishikawa](https://arxiv.org/search/?searchtype=author&query=Tomochika Ishikawa), [Akira Taniguchi](https://arxiv.org/search/?searchtype=author&query=Akira Taniguchi), [Yoshinobu Hagiwara](https://arxiv.org/search/?searchtype=author&query=Yoshinobu Hagiwara), [Lotfi El Hafi](https://arxiv.org/search/?searchtype=author&query=Lotfi El Hafi), [Tadahiro Taniguchi](https://arxiv.org/search/?searchtype=author&query=Tadahiro Taniguchi) 作者:Saki Hashimoto、Shoichi Hasegawa、Tomochika Ishikawa、Akira Taniguchi、Yoshinobu Hagiwara、Lotfi El Hafi、Tadahiro Taniguchi
Robots operating in domestic and office environments must understand object ownership to correctly execute instructions such as ``Bring me my cup.’’ However, ownership cannot be reliably inferred from visual features alone. To address this gap, we propose Active Ownership Learning (ActOwL), a framework that enables robots to actively generate and ask ownership-related questions to users. ActOwL employs a probabilistic generative model to select questions that maximize information gain, thereby acquiring ownership knowledge efficiently to improve learning efficiency. Additionally, by leveraging commonsense knowledge from Large Language Models (LLM), objects are pre-classified as either shared or owned, and only owned objects are targeted for questioning. Through experiments in a simulated home environment and a real-world laboratory setting, ActOwL achieved significantly higher ownership clustering accuracy with fewer questions than baseline methods. These findings demonstrate the effectiveness of combining active inference with LLM-guided commonsense reasoning, advancing the capability of robots to acquire ownership knowledge for practical and socially appropriate task execution. 在家庭和办公环境中运行的机器人必须了解物体所有权才能正确执行诸如“给我拿我的杯子”之类的指令。然而,仅从视觉特征中无法可靠地推断所有权。为了解决这一差距,我们提出了主动所有权学习 (ActOwL),这是一个框架,使机器人能够主动生成并向用户提出与所有权相关的问题。ActOwL 采用概率生成模型来选择能够最大化信息获取的问题,从而有效地获取所有权知识,提高学习效率。此外,通过利用大型语言模型 (LLM) 的常识知识,对象被预先分类为共享或拥有,并且只有拥有的对象才会成为提问的目标。通过在模拟家庭环境和真实实验室环境中进行实验,ActOwL 以比基线方法更少的问题实现了显着更高的所有权聚类准确性。这些发现证明了将主动推理与法学硕士引导的常识推理相结合的有效性,提高了机器人获取所有权知识以执行实际和社会适当的任务的能力。
Subjects: Robotics, Artificial Intelligence, Human-Computer Interaction, Machine Learning 科目: 机器人技术 , 人工智能 , 人机交互 , 机器学习
Publish: 2025-09-16 07:15:52 UTC 发布时间: 2025-09-16 07:15:52 UTC
#96 Force-Modulated Visual Policy for Robot-Assisted Dressing with Arm Motions #96 手臂运动机器人辅助敷料的力调制视觉策略
Authors: [Alexis Yihong Hao](https://arxiv.org/search/?searchtype=author&query=Alexis Yihong Hao), [Yufei Wang](https://arxiv.org/search/?searchtype=author&query=Yufei Wang), [Navin Sriram Ravie](https://arxiv.org/search/?searchtype=author&query=Navin Sriram Ravie), [Bharath Hegde](https://arxiv.org/search/?searchtype=author&query=Bharath Hegde), [David Held](https://arxiv.org/search/?searchtype=author&query=David Held), [Zackory Erickson](https://arxiv.org/search/?searchtype=author&query=Zackory Erickson) 作者:Alexis Yihong Hao、Yufei Wang、Navin Sriram Ravie、Bharath Hegde、David Held、Zackory Erickson
Robot-assisted dressing has the potential to significantly improve the lives of individuals with mobility impairments. To ensure an effective and comfortable dressing experience, the robot must be able to handle challenging deformable garments, apply appropriate forces, and adapt to limb movements throughout the dressing process. Prior work often makes simplifying assumptions – such as static human limbs during dressing – which limits real-world applicability. In this work, we develop a robot-assisted dressing system capable of handling partial observations with visual occlusions, as well as robustly adapting to arm motions during the dressing process. Given a policy trained in simulation with partial observations, we propose a method to fine-tune it in the real world using a small amount of data and multi-modal feedback from vision and force sensing, to further improve the policy’s adaptability to arm motions and enhance safety. We evaluate our method in simulation with simplified articulated human meshes and in a real world human study with 12 participants across 264 dressing trials. Our policy successfully dresses two long-sleeve everyday garments onto the participants while being adaptive to various kinds of arm motions, and greatly outperforms prior baselines in terms of task completion and user feedback. Video are available at https://dressing-motion.github.io/. 机器人辅助敷料有可能显着改善行动不便人士的生活。为了确保有效和舒适的穿衣体验,机器人必须能够处理具有挑战性的可变形服装,施加适当的力,并在整个穿衣过程中适应肢体运动。先前的工作经常做出简化的假设——例如穿衣过程中静止的人体四肢——这限制了现实世界的适用性。在这项工作中,我们开发了一种机器人辅助敷料系统,能够处理具有视觉遮挡的部分观察,并能稳健地适应敷料过程中的手臂运动。给定一个在具有部分观测的模拟中训练的策略,我们提出了一种方法,利用少量数据以及来自视觉和力感知的多模态反馈在现实世界中对其进行微调,以进一步提高策略对手臂运动的适应性并增强安全性。我们在使用简化的铰接人体网格进行模拟和一项真实世界的人体研究中评估了我们的方法,该研究涉及 12 名参与者,涉及 264 项敷料试验。我们的政策成功地将两件长袖日常服装穿在参与者身上,同时适应各种手臂动作,并且在任务完成和用户反馈方面大大优于之前的基线。视频可在 https://dressing-motion.github.io/ 获得。
Subjects: Robotics, Artificial Intelligence, Machine Learning 科目: 机器人技术 , 人工智能 , 机器学习
Publish: 2025-09-16 06:53:18 UTC 发布时间: 2025-09-16 06:53:18 UTC
#97 Deep Generative and Discriminative Digital Twin endowed with Variational Autoencoder for Unsupervised Predictive Thermal Condition Monitoring of Physical Robots in Industry 6.0 and Society 6.0 #97 工业 6.0 和社会 6.0 中物理机器人的变分自动编码器用于无监督预测热状态监测的深度生成和判别数字孪生
Author: [Eric Guiffo Kaigom](https://arxiv.org/search/?searchtype=author&query=Eric Guiffo Kaigom) 作者:埃里克·吉福·凯戈姆
Robots are unrelentingly used to achieve operational efficiency in Industry 4.0 along with symbiotic and sustainable assistance for the work-force in Industry 5.0. As resilience, robustness, and well-being are required in anti-fragile manufacturing and human-centric societal tasks, an autonomous anticipation and adaption to thermal saturation and burns due to motors overheating become instrumental for human safety and robot availability. Robots are thereby expected to self-sustain their performance and deliver user experience, in addition to communicating their capability to other agents in advance to ensure fully automated thermally feasible tasks, and prolong their lifetime without human intervention. However, the traditional robot shutdown, when facing an imminent thermal saturation, inhibits productivity in factories and comfort in the society, while cooling strategies are hard to implement after the robot acquisition. In this work, smart digital twins endowed with generative AI, i.e., variational autoencoders, are leveraged to manage thermally anomalous and generate uncritical robot states. The notion of thermal difficulty is derived from the reconstruction error of variational autoencoders. A robot can use this score to predict, anticipate, and share the thermal feasibility of desired motion profiles to meet requirements from emerging applications in Industry 6.0 and Society 6.0. 机器人被不断用于实现工业 4.0 的运营效率,并为工业 5.0 中的劳动力提供共生和可持续的帮助。由于反脆弱制造和以人为本的社会任务需要弹性、稳健性和福祉,因此自主预测和适应电机过热引起的热饱和和烧伤对于人类安全和机器人可用性至关重要。因此,机器人除了提前将其能力传达给其他代理之外,还应自我维持其性能并提供用户体验,以确保全自动的热可行任务,并在无需人工干预的情况下延长其使用寿命。然而,传统的机器人停机在面临迫在眉睫的热饱和时,抑制了工厂的生产力和社会的舒适度,而机器人购置后的冷却策略难以实施。在这项工作中,利用具有生成式人工智能的智能数字孪生(即变分自动编码器)来管理热异常并生成不关键的机器人状态。热难度的概念源自变分自动编码器的重建误差。机器人可以使用此分数来预测、预测和共享所需运动曲线的热可行性,以满足工业 6.0 和社会 6.0 中新兴应用的要求。
Subjects: Robotics, Artificial Intelligence, Emerging Technologies, Machine Learning, Systems and Control 科目: 机器人技术 , 人工智能 , 新兴技术 , 机器学习 , 系统与控制
Publish: 2025-09-16 06:52:59 UTC 发布: 2025-09-16 06:52:59 UTC
#98 Deep Learning for Model-Free Prediction of Thermal States of Robot Joint Motors #98 用于机器人关节电机热状态无模型预测的深度学习
Authors: [Trung Kien La](https://arxiv.org/search/?searchtype=author&query=Trung Kien La), [Eric Guiffo Kaigom](https://arxiv.org/search/?searchtype=author&query=Eric Guiffo Kaigom) 作者:Trung Kien La、Eric Guiffo Kaigom
In this work, deep neural networks made up of multiple hidden Long Short-Term Memory (LSTM) and Feedforward layers are trained to predict the thermal behavior of the joint motors of robot manipulators. A model-free and scalable approach is adopted. It accommodates complexity and uncertainty challenges stemming from the derivation, identification, and validation of a large number of parameters of an approximation model that is hardly available. To this end, sensed joint torques are collected and processed to foresee the thermal behavior of joint motors. Promising prediction results of the machine learning based capture of the temperature dynamics of joint motors of a redundant robot with seven joints are presented. 在这项工作中,训练由多个隐藏的长短期记忆(LSTM)和前馈层组成的深度神经网络来预测机器人机械手关节电机的热行为。采用无模型且可扩展的方法。它适应了由于推导、识别和验证几乎无法获得的近似模型的大量参数而产生的复杂性和不确定性挑战。为此,收集并处理感应的关节扭矩,以预测关节电机的热行为。给出了基于机器学习捕获具有七关节的冗余机器人关节电机温度动力学的有希望的预测结果。
Subjects: Robotics, Artificial Intelligence, Emerging Technologies, Machine Learning, Systems and Control 科目: 机器人技术 , 人工智能 , 新兴技术 , 机器学习 , 系统与控制
Publish: 2025-09-16 06:52:30 UTC 发布时间: 2025-09-16 06:52:30 UTC
#99 A Graph Machine Learning Approach for Detecting Topological Patterns in Transactional Graphs #99 一种用于检测事务图拓扑模式的图机器学习方法
Authors: [Francesco Zola](https://arxiv.org/search/?searchtype=author&query=Francesco Zola), [Jon Ander Medina](https://arxiv.org/search/?searchtype=author&query=Jon Ander Medina), [Andrea Venturi](https://arxiv.org/search/?searchtype=author&query=Andrea Venturi), [Amaia Gil](https://arxiv.org/search/?searchtype=author&query=Amaia Gil), [Raul Orduna](https://arxiv.org/search/?searchtype=author&query=Raul Orduna) 作者:弗朗切斯科·佐拉、乔恩·安德·梅迪纳、安德里亚·文丘里、阿玛亚·吉尔、劳尔·奥尔杜纳
The rise of digital ecosystems has exposed the financial sector to evolving abuse and criminal tactics that share operational knowledge and techniques both within and across different environments (fiat-based, crypto-assets, etc.). Traditional rule-based systems lack the adaptability needed to detect sophisticated or coordinated criminal behaviors (patterns), highlighting the need for strategies that analyze actors’ interactions to uncover suspicious activities and extract their modus operandi. For this reason, in this work, we propose an approach that integrates graph machine learning and network analysis to improve the detection of well-known topological patterns within transactional graphs. However, a key challenge lies in the limitations of traditional financial datasets, which often provide sparse, unlabeled information that is difficult to use for graph-based pattern analysis. Therefore, we firstly propose a four-step preprocessing framework that involves (i) extracting graph structures, (ii) considering data temporality to manage large node sets, (iii) detecting communities within, and (iv) applying automatic labeling strategies to generate weak ground-truth labels. Then, once the data is processed, Graph Autoencoders are implemented to distinguish among the well-known topological patterns. Specifically, three different GAE variants are implemented and compared in this analysis. Preliminary results show that this pattern-focused, topology-driven method is effective for detecting complex financial crime schemes, offering a promising alternative to conventional rule-based detection systems. 数字生态系统的兴起使金融部门面临不断变化的滥用和犯罪策略,这些滥用和犯罪策略在不同环境(基于法定货币、加密资产等)内和跨环境共享运营知识和技术。传统的基于规则的系统缺乏检测复杂或协调的犯罪行为(模式)所需的适应性,这凸显了分析行为者互动以发现可疑活动并提取其作案手法的策略的必要性。因此,在这项工作中,我们提出了一种将图机器学习和网络分析相结合的方法,以改进对事务图中已知拓扑模式的检测。然而,一个关键挑战在于传统金融数据集的局限性,这些数据集通常提供稀疏的、未标记的信息,难以用于基于图的模式分析。因此,我们首先提出了一个四步预处理框架,包括(i)提取图结构,(ii)考虑数据时间性来管理大型节点集,(iii)检测其中的社区,以及(iv)应用自动标记策略来生成弱的地面实况标签。然后,一旦数据得到处理,就会实施图形自动编码器来区分众所周知的拓扑模式。具体来说,在此分析中实施并比较了三种不同的 GAE 变体。初步结果表明,这种以模式为中心、拓扑驱动的方法可有效检测复杂的金融犯罪计划,为传统的基于规则的检测系统提供了一种有前途的替代方案。
Subjects: Machine Learning, Artificial Intelligence, Computational Engineering, Finance, and Science 科目: 机器学习 , 人工智能 , 计算工程, 金融和科学
Publish: 2025-09-16 06:43:11 UTC 发布时间: 2025-09-16 06:43:11 UTC
#100 Unbiased Online Curvature Approximation for Regularized Graph Continual Learning #100 用于正则化图持续学习的无偏在线曲率近似
Authors: [Jie Yin](https://arxiv.org/search/?searchtype=author&query=Jie Yin), [Ke Sun](https://arxiv.org/search/?searchtype=author&query=Ke Sun), [Han Wu](https://arxiv.org/search/?searchtype=author&query=Han Wu) 作者:Jie Yin、Ke Sun、Han Wu
Graph continual learning (GCL) aims to learn from a continuous sequence of graph-based tasks. Regularization methods are vital for preventing catastrophic forgetting in GCL, particularly in the challenging replay-free, class-incremental setting, where each task consists of a set of unique classes. In this work, we first establish a general regularization framework for GCL based on the curved parameter space induced by the Fisher information matrix (FIM). We show that the dominant Elastic Weight Consolidation (EWC) and its variants are a special case within this framework, using a diagonal approximation of the empirical FIM based on parameters from previous tasks. To overcome their limitations, we propose a new unbiased online curvature approximation of the full FIM based on the model’s current learning state. Our method directly estimates the regularization term in an online manner without explicitly evaluating and storing the FIM itself. This enables the model to better capture the loss landscape during learning new tasks while retaining the knowledge learned from previous tasks. Extensive experiments on three graph datasets demonstrate that our method significantly outperforms existing regularization-based methods, achieving a superior trade-off between stability (retaining old knowledge) and plasticity (acquiring new knowledge). 图持续学习 (GCL) 旨在从连续的基于图的任务序列中学习。正则化方法对于防止 GCL 中的灾难性遗忘至关重要,特别是在具有挑战性的无重放、类增量环境中,其中每个任务都由一组独特的类组成。在这项工作中,我们首先建立了一个基于 Fisher 信息矩阵(FIM)诱导的弯曲参数空间的 GCL 通用正则化框架。我们表明,占主导地位的弹性重量巩固 (EWC) 及其变体是该框架内的一个特例,使用基于先前任务参数的经验 FIM 的对角线近似。为了克服它们的局限性,我们提出了一种基于模型当前学习状态的完整 FIM 的新的无偏在线曲率近似。我们的方法直接以在线方式估计正则化项,而无需显式评估和存储 FIM 本身。这使得模型能够在学习新任务期间更好地捕获损失情况,同时保留从以前的任务中学到的知识。在三个图数据集上的大量实验表明,我们的方法明显优于现有的基于正则化的方法,在稳定性(保留旧知识)和可塑性(获取新知识)之间实现了卓越的权衡。
Subjects: Machine Learning, Artificial Intelligence 主题: 机器学习 , 人工智能
Publish: 2025-09-16 06:35:13 UTC 发布时间: 2025-09-16 06:35:13 UTC
#101 Defense-to-Attack: Bypassing Weak Defenses Enables Stronger Jailbreaks in Vision-Language Models #101 防御到攻击:绕过弱防御可以在视觉语言模型中实现更强的越狱
Authors: [Yunhan Zhao](https://arxiv.org/search/?searchtype=author&query=Yunhan Zhao), [Xiang Zheng](https://arxiv.org/search/?searchtype=author&query=Xiang Zheng), [Xingjun Ma](https://arxiv.org/search/?searchtype=author&query=Xingjun Ma) 作者:赵云涵、向正、马星军
Despite their superb capabilities, Vision-Language Models (VLMs) have been shown to be vulnerable to jailbreak attacks. While recent jailbreaks have achieved notable progress, their effectiveness and efficiency can still be improved. In this work, we reveal an interesting phenomenon: incorporating weak defense into the attack pipeline can significantly enhance both the effectiveness and the efficiency of jailbreaks on VLMs. Building on this insight, we propose Defense2Attack, a novel jailbreak method that bypasses the safety guardrails of VLMs by leveraging defensive patterns to guide jailbreak prompt design. Specifically, Defense2Attack consists of three key components: (1) a visual optimizer that embeds universal adversarial perturbations with affirmative and encouraging semantics; (2) a textual optimizer that refines the input using a defense-styled prompt; and (3) a red-team suffix generator that enhances the jailbreak through reinforcement fine-tuning. We empirically evaluate our method on four VLMs and four safety benchmarks. The results demonstrate that Defense2Attack achieves superior jailbreak performance in a single attempt, outperforming state-of-the-art attack methods that often require multiple tries. Our work offers a new perspective on jailbreaking VLMs. 尽管视觉语言模型 (VLM) 具有卓越的功能,但它已被证明容易受到越狱攻击。虽然最近的越狱取得了显着进展,但其有效性和效率仍有待提高。在这项工作中,我们揭示了一个有趣的现象:将弱防御纳入攻击管道可以显着提高 VLM 越狱的有效性和效率。基于这一见解,我们提出了 Defense2Attack,这是一种新颖的越狱方法,它通过利用防御模式来指导越狱提示设计,从而绕过 VLM 的安全护栏。具体来说,Defense2Attack 由三个关键组件组成:(1)一个视觉优化器,它嵌入了具有肯定和鼓励语义的普遍对抗性扰动;(2) 使用防御风格提示细化输入的文本优化器;(3) 红队后缀生成器,通过强化微调增强越狱。我们在四个 VLM 和四个安全基准上实证评估了我们的方法。结果表明,Defense2Attack 在一次尝试中就实现了卓越的越狱性能,优于通常需要多次尝试的最先进的攻击方法。我们的工作为越狱 VLM 提供了新的视角。
Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence 科目:计算机视觉与模式识别 , 人工智能
Publish: 2025-09-16 06:25:58 UTC 发布时间: 2025-09-16 06:25:58 UTC
#102 Joint AoI and Handover Optimization in Space-Air-Ground Integrated Network #102 空空地一体化网络中的联合 AoI 与切换优化
Authors: [Zifan Lang](https://arxiv.org/search/?searchtype=author&query=Zifan Lang), [Guixia Liu](https://arxiv.org/search/?searchtype=author&query=Guixia Liu), [Geng Sun](https://arxiv.org/search/?searchtype=author&query=Geng Sun), [Jiahui Li](https://arxiv.org/search/?searchtype=author&query=Jiahui Li), [Jiacheng Wang](https://arxiv.org/search/?searchtype=author&query=Jiacheng Wang), [Weijie Yuan](https://arxiv.org/search/?searchtype=author&query=Weijie Yuan), [Dusit Niyato](https://arxiv.org/search/?searchtype=author&query=Dusit Niyato), [Dong In Kim](https://arxiv.org/search/?searchtype=author&query=Dong In Kim) 作者:郎子凡、刘桂霞、孙耿、李佳辉、王佳成、袁伟杰、都喜妮雅都、金东仁
Despite the widespread deployment of terrestrial networks, providing reliable communication services to remote areas and maintaining connectivity during emergencies remains challenging. Low Earth orbit (LEO) satellite constellations offer promising solutions with their global coverage capabilities and reduced latency, yet struggle with intermittent coverage and limited communication windows due to orbital dynamics. This paper introduces an age of information (AoI)-aware space-air-ground integrated network (SAGIN) architecture that leverages a high-altitude platform (HAP) as intelligent relay between the LEO satellites and ground terminals. Our three-layer design employs hybrid free-space optical (FSO) links for high-capacity satellite-to-HAP communication and reliable radio frequency (RF) links for HAP-to-ground transmission, and thus addressing the temporal discontinuity in LEO satellite coverage while serving diverse user priorities. Specifically, we formulate a joint optimization problem to simultaneously minimize the AoI and satellite handover frequency through optimal transmit power distribution and satellite selection decisions. This highly dynamic, non-convex problem with time-coupled constraints presents significant computational challenges for traditional approaches. To address these difficulties, we propose a novel diffusion model (DM)-enhanced dueling double deep Q-network with action decomposition and state transformer encoder (DD3QN-AS) algorithm that incorporates transformer-based temporal feature extraction and employs a DM-based latent prompt generative module to refine state-action representations through conditional denoising. Simulation results highlight the superior performance of the proposed approach compared with policy-based methods and some other deep reinforcement learning (DRL) benchmarks. 尽管地面网络已广泛部署,但向偏远地区提供可靠的通信服务并在紧急情况下保持连接仍然具有挑战性。近地轨道 (LEO) 卫星星座以其全球覆盖能力和减少的延迟提供了有前途的解决方案,但由于轨道动态,难以解决间歇性覆盖和有限的通信窗口。本文介绍了一种信息时代(AoI)感知的天地一体化网络(SAGIN)架构,该架构利用高空平台(HAP)作为低轨卫星与地面终端之间的智能中继。我们的三层设计采用混合自由空间光(FSO)链路进行高容量卫星到 HAP 通信,采用可靠的射频(RF)链路进行 HAP 到地面传输,从而解决低轨卫星覆盖的时间不连续性问题,同时服务于不同的用户优先级。具体来说,我们提出了一个联合优化问题,通过优化发射功率分配和卫星选择决策,同时最小化 AoI 和卫星切换频率。这种具有时间耦合约束的高动态、非凸问题给传统方法带来了重大的计算挑战。为了解决这些困难,我们提出了一种新型的扩散模型(DM)增强的双深 Q 网络与动作分解和状态转换器编码器(DD3QN-AS)算法,该算法结合了基于 Transformer 的时间特征提取,并采用基于 DM 的潜在提示生成模块,通过条件去噪来细化状态-动作表示。 仿真结果突出了所提出的方法与基于策略的方法和其他一些深度强化学习(DRL)基准相比的优越性能。
Subjects: Networking and Internet Architecture, Artificial Intelligence 主题: 网络与互联网架构 , 人工智能
Publish: 2025-09-16 06:16:56 UTC 发布时间: 2025-09-16 06:16:56 UTC
#103 A Comparative Study of YOLOv8 to YOLOv11 Performance in Underwater Vision Tasks #103 YOLOv8 与 YOLOv11 在水下视觉任务中表现的比较研究
Authors: [Gordon Hung](https://arxiv.org/search/?searchtype=author&query=Gordon Hung), [Ivan Felipe Rodriguez](https://arxiv.org/search/?searchtype=author&query=Ivan Felipe Rodriguez) 作者:Gordon Hung、Ivan Felipe Rodriguez
Autonomous underwater vehicles (AUVs) increasingly rely on on-board computer-vision systems for tasks such as habitat mapping, ecological monitoring, and infrastructure inspection. However, underwater imagery is hindered by light attenuation, turbidity, and severe class imbalance, while the computational resources available on AUVs are limited. One-stage detectors from the YOLO family are attractive because they fuse localization and classification in a single, low-latency network; however, their terrestrial benchmarks (COCO, PASCAL-VOC, Open Images) leave open the question of how successive YOLO releases perform in the marine domain. We curate two openly available datasets that span contrasting operating conditions: a Coral Disease set (4,480 images, 18 classes) and a Fish Species set (7,500 images, 20 classes). For each dataset, we create four training regimes (25 %, 50 %, 75 %, 100 % of the images) while keeping balanced validation and test partitions fixed. We train YOLOv8-s, YOLOv9-s, YOLOv10-s, and YOLOv11-s with identical hyperparameters (100 epochs, 640 px input, batch = 16, T4 GPU) and evaluate precision, recall, mAP50, mAP50-95, per-image inference time, and frames-per-second (FPS). Post-hoc Grad-CAM visualizations probe feature utilization and localization faithfulness. Across both datasets, accuracy saturates after YOLOv9, suggesting architectural innovations primarily target efficiency rather than accuracy. Inference speed, however, improves markedly. Our results (i) provide the first controlled comparison of recent YOLO variants on underwater imagery, (ii) show that lightweight YOLOv10 offers the best speed-accuracy trade-off for embedded AUV deployment, and (iii) deliver an open, reproducible benchmark and codebase to accelerate future marine-vision research. 自动驾驶水下航行器 (AUV) 越来越依赖机载计算机视觉系统来执行栖息地测绘、生态监测和基础设施检查等任务。然而,水下图像受到光衰减、浑浊和严重类别不平衡的阻碍,而 AUV 上可用的计算资源有限。YOLO 系列的单级探测器很有吸引力,因为它们将定位和分类融合在一个低延迟网络中;然而,他们的陆地基准(COCO、PASCAL-VOC、Open Images)留下了一个悬而未决的问题,即连续的 YOLO 版本在海洋领域的表现如何。我们策划了两个公开可用的数据集,这些数据集涵盖了不同的作条件:珊瑚病集(4,480 张图像,18 类)和鱼类集(7,500 张图像,20 类)。对于每个数据集,我们创建了四种训练制度(图像的 25%、50%、75%、100%),同时保持平衡的验证和测试分区固定。我们用相同的超参数(100 个纪元,640 像素输入,批次 = 16,T4 GPU)训练 YOLOv8-s、YOLOv9-s、YOLOv10-s 和 YOLOv11-s,并评估精度、召回率、mAP50、mAP50-95、每图像推理时间和每秒帧数 (FPS)。事后 Grad-CAM 可视化探测特征利用率和定位忠实度。在这两个数据集中,准确性在 YOLOv9 之后都饱和了,这表明架构创新主要针对的是效率而不是准确性。然而,推理速度显着提高。 我们的结果 (i) 首次对最近的 YOLO 变体在水下图像上进行了对照比较,(ii) 表明轻量级 YOLOv10 为嵌入式 AUV 部署提供了最佳的速度-精度权衡,以及 (iii) 提供了一个开放的、可重复的基准和代码库,以加速未来的海洋视觉研究。
Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence 科目:计算机视觉与模式识别 , 人工智能
Publish: 2025-09-16 05:12:59 UTC 发布时间: 2025-09-16 05:12:59 UTC
#104 Instance-level Randomization: Toward More Stable LLM Evaluations #104 实例级随机化:迈向更稳定的 LLM 评估
Authors: [Yiyang Li](https://arxiv.org/search/?searchtype=author&query=Yiyang Li), [Yonghuang Wu](https://arxiv.org/search/?searchtype=author&query=Yonghuang Wu), [Ying Luo](https://arxiv.org/search/?searchtype=author&query=Ying Luo), [Liangtai Sun](https://arxiv.org/search/?searchtype=author&query=Liangtai Sun), [Zishu Qin](https://arxiv.org/search/?searchtype=author&query=Zishu Qin), [Lin Qiu](https://arxiv.org/search/?searchtype=author&query=Lin Qiu), [Xuezhi Cao](https://arxiv.org/search/?searchtype=author&query=Xuezhi Cao), [Xunliang Cai](https://arxiv.org/search/?searchtype=author&query=Xunliang Cai) 作者: Yiyang Li, Yonghuang Wu, Ying Luo, Liangtai Sun, Zishu Qin, Lin Qiu, Xuezhi Cao, Xunliang Cai
Evaluations of large language models (LLMs) suffer from instability, where small changes of random factors such as few-shot examples can lead to drastic fluctuations of scores and even model rankings. Moreover, different LLMs can have different preferences for a certain setting of random factors. As a result, using a fixed setting of random factors, which is often adopted as the paradigm of current evaluations, can lead to potential unfair comparisons between LLMs. To mitigate the volatility of evaluations, we first theoretically analyze the sources of variance induced by changes in random factors. Targeting these specific sources, we then propose the instance-level randomization (ILR) method to reduce variance and enhance fairness in model comparisons. Instead of using a fixed setting across the whole benchmark in a single experiment, we randomize all factors that affect evaluation scores for every single instance, run multiple experiments and report the averaged score. Theoretical analyses and empirical results demonstrate that ILR can reduce the variance and unfair comparisons caused by random factors, as well as achieve similar robustness level with less than half computational cost compared with previous methods. 大型语言模型(LLM)的评估存在不稳定性,随机因素的微小变化(例如少量示例)可能会导致分数甚至模型排名的剧烈波动。此外,不同的法学硕士对某种随机因子设置可能有不同的偏好。因此,使用固定的随机因素设置(通常被采用为当前评估的范式)可能会导致法学硕士之间潜在的不公平比较。为了减轻评估的波动性,我们首先从理论上分析随机因素变化引起的方差来源。针对这些特定来源,我们提出了实例级随机化(ILR)方法,以减少方差并增强模型比较的公平性。我们没有在单个实验中使用整个基准测试的固定设置,而是随机化影响每个实例评估分数的所有因素,运行多个实验并报告平均分数。理论分析和实证结果表明,与以往的方法相比,ILR 可以减少随机因素引起的方差和不公平比较,并以不到一半的计算成本实现相似的鲁棒性水平。
Subjects: Machine Learning, Artificial Intelligence 主题: 机器学习 , 人工智能
Publish: 2025-09-16 05:04:00 UTC 发布时间: 2025-09-16 05:04:00 UTC
#105 MFAF: An EVA02-Based Multi-scale Frequency Attention Fusion Method for Cross-View Geo-Localization #105 MFAF:一种基于 EVA02 的跨视图地理定位多尺度频率注意力融合方法
Authors: [YiTong Liu](https://arxiv.org/search/?searchtype=author&query=YiTong Liu), [TianZhu Liu](https://arxiv.org/search/?searchtype=author&query=TianZhu Liu), [YanFeng GU](https://arxiv.org/search/?searchtype=author&query=YanFeng GU) 作者:YiTong Liu, TianZhu Liu, YanFeng GU
Cross-view geo-localization aims to determine the geographical location of a query image by matching it against a gallery of images. This task is challenging due to the significant appearance variations of objects observed from variable views, along with the difficulty in extracting discriminative features. Existing approaches often rely on extracting features through feature map segmentation while neglecting spatial and semantic information. To address these issues, we propose the EVA02-based Multi-scale Frequency Attention Fusion (MFAF) method. The MFAF method consists of Multi-Frequency Branch-wise Block (MFB) and the Frequency-aware Spatial Attention (FSA) module. The MFB block effectively captures both low-frequency structural features and high-frequency edge details across multiple scales, improving the consistency and robustness of feature representations across various viewpoints. Meanwhile, the FSA module adaptively focuses on the key regions of frequency features, significantly mitigating the interference caused by background noise and viewpoint variability. Extensive experiments on widely recognized benchmarks, including University-1652, SUES-200, and Dense-UAV, demonstrate that the MFAF method achieves competitive performance in both drone localization and drone navigation tasks. 跨视图地理定位旨在通过将查询图像与图像库进行匹配来确定查询图像的地理位置。由于从可变视图观察到的物体的外观存在显着差异,并且难以提取区分性特征,因此这项任务具有挑战性。现有的方法通常依赖于通过特征图分割来提取特征,而忽略了空间和语义信息。针对这些问题,我们提出了基于 EVA02 的多尺度频率注意力融合(MFAF)方法。MFAF 方法由多频分支分块(MFB)和频率感知空间注意力(FSA)模块组成。MFB 块有效地捕获了多个尺度上的低频结构特征和高频边缘细节,提高了不同视角上特征表示的一致性和鲁棒性。同时,FSA 模块自适应地聚焦于频率特征的关键区域,显著减轻了背景噪声和视点可变性造成的干扰。在广泛认可的基准(包括 University-1652、SUES-200 和 Dense-UAV)上的广泛实验表明,MFAF 方法在无人机定位和无人机导航任务中都取得了具有竞争力的性能。
Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Machine Learning 科目: 计算机视觉与模式识别 , 人工智能 , 机器学习
Publish: 2025-09-16 04:51:52 UTC 发布时间: 2025-09-16 04:51:52 UTC
#106 Exact alternative optima for nonlinear optimization problems defined with maximum component objective function constrained by the Sugeno-Weber fuzzy relational inequalities #106 使用受 Sugeno-Weber 模糊关系不等式约束的最大分量目标函数定义的非线性优化问题的精确备择最优
Authors: [Amin Ghodousian](https://arxiv.org/search/?searchtype=author&query=Amin Ghodousian), [Sara Zal](https://arxiv.org/search/?searchtype=author&query=Sara Zal), [Minoo Ahmadi](https://arxiv.org/search/?searchtype=author&query=Minoo Ahmadi) 作者:Amin Ghodousian、Sara Zal、Minoo Ahmadi
In this paper, we study a latticized optimization problem with fuzzy relational inequality constraints where the feasible region is formed as the intersection of two inequality fuzzy systems and Sugeno-Weber family of t-norms is considered as fuzzy composition. Sugeno-Weber family of t-norms and t-conorms is one of the most applied one in various fuzzy modelling problems. This family of t-norms and t-conorms was suggested by Weber for modeling intersection and union of fuzzy sets. Also, the t-conorms were suggested as addition rules by Sugeno for so-called alpha-fuzzy measures. The resolution of the feasible region of the problem is firstly investigated when it is defined with max-Sugeno-Weber composition and a necessary and sufficient condition is presented for determining the feasibility. Then, based on some theoretical properties of the problem, an algorithm is presented for solving this nonlinear problem. It is proved that the algorithm can find the exact optimal solution and an example is presented to illustrate the proposed algorithm. 本文研究了一个具有模糊关系不等式约束的格化优化问题,其中可行区域形成为两个不等式模糊系统的交集,并将 Sugeno-Weber 族的 t 范数视为模糊组成。Sugeno-Weber 的 t 范数和 t 常范群是各种模糊建模问题中应用最多的族之一。韦伯建议这个 t 范数和 t 共范数族用于对模糊集的交集和并集进行建模。此外,Sugeno 建议将 t 共范作为所谓的 alpha-模糊测度的加法规则。首先,在用 max-Sugeno-Weber 组合定义问题可行区域时,对问题的可行区域进行解决,并给出了确定可行性的必要充分条件。然后,基于问题的一些理论性质,提出了一种解决该非线性问题的算法。证明了该算法能够找到精确的最优解,并给出了算例来说明所提算法。
Subjects: Optimization and Control, Artificial Intelligence 主题: 优化与控制 , 人工智能
Publish: 2025-09-16 04:48:06 UTC 发布时间: 2025-09-16 04:48:06 UTC
#107 Beyond Artificial Misalignment: Detecting and Grounding Semantic-Coordinated Multimodal Manipulations #107 超越人为错位:检测和接地语义协调的多模态作
Authors: [Jinjie Shen](https://arxiv.org/search/?searchtype=author&query=Jinjie Shen), [Yaxiong Wang](https://arxiv.org/search/?searchtype=author&query=Yaxiong Wang), [Lechao Cheng](https://arxiv.org/search/?searchtype=author&query=Lechao Cheng), [Nan Pu](https://arxiv.org/search/?searchtype=author&query=Nan Pu), [Zhun Zhong](https://arxiv.org/search/?searchtype=author&query=Zhun Zhong) 作者:沈金杰,王亚雄,程乐超,南璞,忠准
The detection and grounding of manipulated content in multimodal data has emerged as a critical challenge in media forensics. While existing benchmarks demonstrate technical progress, they suffer from misalignment artifacts that poorly reflect real-world manipulation patterns: practical attacks typically maintain semantic consistency across modalities, whereas current datasets artificially disrupt cross-modal alignment, creating easily detectable anomalies. To bridge this gap, we pioneer the detection of semantically-coordinated manipulations where visual edits are systematically paired with semantically consistent textual descriptions. Our approach begins with constructing the first Semantic-Aligned Multimodal Manipulation (SAMM) dataset, generated through a two-stage pipeline: 1) applying state-of-the-art image manipulations, followed by 2) generation of contextually-plausible textual narratives that reinforce the visual deception. Building on this foundation, we propose a Retrieval-Augmented Manipulation Detection and Grounding (RamDG) framework. RamDG commences by harnessing external knowledge repositories to retrieve contextual evidence, which serves as the auxiliary texts and encoded together with the inputs through our image forgery grounding and deep manipulation detection modules to trace all manipulations. Extensive experiments demonstrate our framework significantly outperforms existing methods, achieving 2.06% higher detection accuracy on SAMM compared to state-of-the-art approaches. The dataset and code are publicly available at https://github.com/shen8424/SAMM-RamDG-CAP. 检测和建立多模态数据中纵的内容已成为媒体取证中的一项关键挑战。虽然现有基准测试展示了技术进步,但它们存在错位伪影,无法很好地反映现实世界的纵模式:实际攻击通常会保持跨模态的语义一致性,而当前的数据集会人为地破坏跨模态对齐,从而产生易于检测到的异常。为了弥合这一差距,我们率先检测语义协调的作,其中视觉编辑与语义一致的文本描述系统地配对。我们的方法首先构建第一个语义对齐多模态作 (SAMM) 数据集,该数据集通过两阶段管道生成:1) 应用最先进的图像处理,然后 2) 生成上下文合理的文本叙述,以强化视觉欺骗。在此基础上,我们提出了一个检索增强作检测和接地 (RamDG) 框架。RamDG 首先利用外部知识库来检索上下文证据,这些证据作为辅助文本,并通过我们的图像伪造接地和深度纵检测模块与输入一起编码,以追踪所有纵。广泛的实验表明,我们的框架明显优于现有方法,与最先进的方法相比,SAMM 的检测精度提高了 2.06%。数据集和代码可在 https://github.com/shen8424/SAMM-RamDG-CAP 公开获取。
Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence 科目:计算机视觉与模式识别 , 人工智能
Publish: 2025-09-16 04:18:48 UTC 发布时间: 2025-09-16 04:18:48 UTC
#108 Don't Change My View: Ideological Bias Auditing in Large Language Models #108 不要改变我的观点:大型语言模型中的意识形态偏见审计
Authors: [Paul Kröger](https://arxiv.org/search/?searchtype=author&query=Paul Kröger), [Emilio Barkett](https://arxiv.org/search/?searchtype=author&query=Emilio Barkett) 作者:Paul Kröger、Emilio Barkett
As large language models (LLMs) become increasingly embedded in products used by millions, their outputs may influence individual beliefs and, cumulatively, shape public opinion. If the behavior of LLMs can be intentionally steered toward specific ideological positions, such as political or religious views, then those who control these systems could gain disproportionate influence over public discourse. Although it remains an open question whether LLMs can reliably be guided toward coherent ideological stances and whether such steering can be effectively prevented, a crucial first step is to develop methods for detecting when such steering attempts occur. In this work, we adapt a previously proposed statistical method to the new context of ideological bias auditing. Our approach carries over the model-agnostic design of the original framework, which does not require access to the internals of the language model. Instead, it identifies potential ideological steering by analyzing distributional shifts in model outputs across prompts that are thematically related to a chosen topic. This design makes the method particularly suitable for auditing proprietary black-box systems. We validate our approach through a series of experiments, demonstrating its practical applicability and its potential to support independent post hoc audits of LLM behavior. 随着大型语言模型 (LLM) 越来越多地嵌入到数百万人使用的产品中,它们的输出可能会影响个人信念,并累积影响公众舆论。如果法学硕士的行为可以有意识地引导到特定的意识形态立场,例如政治或宗教观点,那么控制这些系统的人可能会对公共话语产生不成比例的影响力。尽管是否可以可靠地引导法学硕士走向连贯的意识形态立场以及是否可以有效防止这种引导仍然是一个悬而未决的问题,但关键的第一步是开发检测此类引导尝试何时发生的方法。在这项工作中,我们将先前提出的统计方法应用于意识形态偏见审计的新背景。我们的方法继承了原始框架的与模型无关的设计,不需要访问语言模型的内部结构。相反,它通过分析与所选主题主题相关的提示中模型输出的分布变化来识别潜在的意识形态转向。这种设计使该方法特别适用于审计专有的黑匣子系统。我们通过一系列实验验证了我们的方法,展示了其实际适用性及其支持对 LLM 行为进行独立事后审计的潜力。
Subjects: Computation and Language, Artificial Intelligence 科目: 计算与语言 , 人工智能
Publish: 2025-09-16 04:14:29 UTC 发布时间: 2025-09-16 04:14:29 UTC
#109 Leveraging Intermediate Representations of Time Series Foundation Models for Anomaly Detection #109 利用时间序列基础模型的中间表示进行异常检测
Authors: [Chan Sik Han](https://arxiv.org/search/?searchtype=author&query=Chan Sik Han), [Keon Myung Lee](https://arxiv.org/search/?searchtype=author&query=Keon Myung Lee) 作者:Chan Sik Han、Keon Myung Lee
Detecting anomalies in time series data is essential for the reliable operation of many real-world systems. Recently, time series foundation models (TSFMs) have emerged as a powerful tool for anomaly detection. However, existing methods typically rely on the final layer’s representations of TSFMs, computing the anomaly score as a reconstruction or forecasting error via a task-specific head. Instead, we propose TimeRep, a novel anomaly detection approach that leverages the intermediate layer’s representations of TSFMs, computing the anomaly score as the distance between these representations. Given a pre-trained TSFM, TimeRep selects the intermediate layer and patch-token position that yield the most informative representation. TimeRep forms a reference collection of intermediate representations from the training data and applies a core-set strategy to reduce its size while maintaining distributional coverage. During inference, TimeRep computes the anomaly score for incoming data by measuring the distance between its intermediate representations and those of the collection. To address concept drift, TimeRep integrates an adaptation mechanism that, at inference time, augments the collection exclusively with non-redundant intermediate representations from incoming data. We conducted extensive experiments on the UCR Anomaly Archive, which contains 250 univariate time series. TimeRep consistently outperforms a broad spectrum of state-of-the-art baselines, including non-DL, DL, and foundation model-based methods. 检测时间序列数据中的异常对于许多实际系统的可靠运行至关重要。最近,时间序列基础模型 (TSFM) 已成为异常检测的强大工具。然而,现有方法通常依赖于最后一层的 TSFM 表示,通过特定于任务的头将异常分数计算为重建或预测误差。相反,我们提出了 TimeRep,这是一种新颖的异常检测方法,它利用中间层对 TSFM 的表示,将异常分数计算为这些表示之间的距离。给定预训练的 TSFM,TimeRep 选择产生信息量最大的表示的中间层和补丁标记位置。TimeRep 从训练数据中形成中间表示的参考集合,并应用核心集策略来减小其大小,同时保持分布覆盖。在推理过程中,TimeRep 通过测量传入数据的中间表示与集合的中间表示之间的距离来计算传入数据的异常分数。为了解决概念漂移问题,TimeRep 集成了一种适应机制,该机制在推理时仅使用来自传入数据的非冗余中间表示来增强集合。我们对包含 250 个单变量时间序列的 UCR 异常档案进行了广泛的实验。TimeRep 的性能始终优于广泛的最先进的基线,包括非 DL、DL 和基于基础模型的方法。
Subjects: Machine Learning, Artificial Intelligence 主题: 机器学习 , 人工智能
Publish: 2025-09-16 04:10:17 UTC 发布时间: 2025-09-16 04:10:17 UTC
#110 A Systematic Evaluation of Parameter-Efficient Fine-Tuning Methods for the Security of Code LLMs #110 对代码 LLM 安全性的参数高效微调方法的系统评估
Authors: [Kiho Lee](https://arxiv.org/search/?searchtype=author&query=Kiho Lee), [Jungkon Kim](https://arxiv.org/search/?searchtype=author&query=Jungkon Kim), [Doowon Kim](https://arxiv.org/search/?searchtype=author&query=Doowon Kim), [Hyoungshick Kim](https://arxiv.org/search/?searchtype=author&query=Hyoungshick Kim) 作者:Kiho Lee、Jungkon Kim、Doowon Kim、Hyoungshick Kim
Code-generating Large Language Models (LLMs) significantly accelerate software development. However, their frequent generation of insecure code presents serious risks. We present a comprehensive evaluation of seven parameter-efficient fine-tuning (PEFT) techniques, demonstrating substantial gains in secure code generation without compromising functionality. Our research identifies prompt-tuning as the most effective PEFT method, achieving an 80.86% Overall-Secure-Rate on CodeGen2 16B, a 13.5-point improvement over the 67.28% baseline. Optimizing decoding strategies through sampling temperature further elevated security to 87.65%. This equates to a reduction of approximately 203,700 vulnerable code snippets per million generated. Moreover, prompt and prefix tuning increase robustness against poisoning attacks in our TrojanPuzzle evaluation, with strong performance against CWE-79 and CWE-502 attack vectors. Our findings generalize across Python and Java, confirming prompt-tuning’s consistent effectiveness. This study provides essential insights and practical guidance for building more resilient software systems with LLMs. 代码生成大型语言模型 (LLM) 显着加速了软件开发。然而,它们频繁生成不安全的代码带来了严重的风险。我们对七种参数高效微调 (PEFT) 技术进行了全面评估,展示了在不影响功能的情况下在安全代码生成方面取得的巨大收益。我们的研究确定提示调整是最有效的 PEFT 方法,在 CodeGen2 16B 上实现了 80.86% 的总体安全率,比 67.28% 的基线提高了 13.5 个百分点。通过采样温度优化解码策略,进一步将安全性提高到 87.65%。这相当于每百万生成的漏洞代码片段减少约 203,700 个。此外,在我们的 TrojanPuzzle 评估中,提示和前缀调整提高了对中毒攻击的鲁棒性,对 CWE-79 和 CWE-502 攻击媒介具有强大的性能。我们的研究结果在 Python 和 Java 中进行了推广,证实了提示调整的一致有效性。这项研究为使用 LLM 构建更具弹性的软件系统提供了重要的见解和实用指导。
Subjects: Cryptography and Security, Artificial Intelligence 主题: 密码学与安全 , 人工智能
Publish: 2025-09-16 04:09:41 UTC 发布时间: 2025-09-16 04:09:41 UTC
#111 Positional Encoding via Token-Aware Phase Attention #111 通过标记感知相位注意力进行位置编码
Authors: Yu, Wang, [Sheng Shen](https://arxiv.org/search/?searchtype=author&query=Sheng Shen), [Rémi Munos](https://arxiv.org/search/?searchtype=author&query=Rémi Munos), [Hongyuan Zhan](https://arxiv.org/search/?searchtype=author&query=Hongyuan Zhan), [Yuandong Tian](https://arxiv.org/search/?searchtype=author&query=Yuandong Tian) 作者: Yu, Wang, Sheng Shen, Rémi Munos, Hongyuan Zhan, Yuandong Tian
We prove under practical assumptions that Rotary Positional Embedding (RoPE) introduces an intrinsic distance-dependent bias in attention scores that limits RoPE’s ability to model long-context. RoPE extension methods may alleviate this issue, but they typically require post-hoc adjustments after pretraining, such as rescaling or hyperparameters retuning. This paper introduces Token-Aware Phase Attention (TAPA), a new positional encoding method that incorporates a learnable phase function into the attention mechanism. TAPA preserves token interactions over long range, extends to longer contexts with direct and light fine-tuning, extrapolates to unseen lengths, and attains significantly lower perplexity on long-context than RoPE families. 我们在实际假设下证明,旋转位置嵌入 (RoPE) 在注意力分数中引入了内在的距离依赖性偏差,这限制了 RoPE 对长上下文进行建模的能力。RoPE 扩展方法可能会缓解这个问题,但它们通常需要在预训练后进行事后调整,例如重新缩放或超参数重新调整。本文介绍了标记感知相位注意力(TAPA),这是一种将可学习相位函数纳入注意力机制的新型位置编码方法。TAPA 在长距离上保留标记交互,通过直接和轻微调扩展到更长的上下文,推断到看不见的长度,并且在长上下文上获得比 RoPE 系列显着降低的困惑度。
Subjects: Computation and Language, Artificial Intelligence 科目: 计算与语言 , 人工智能
Publish: 2025-09-16 03:53:32 UTC 发布时间: 2025-09-16 03:53:32 UTC
#112 CIARD: Cyclic Iterative Adversarial Robustness Distillation #112 CIARD:循环迭代对抗鲁棒性蒸馏
Authors: [Liming Lu](https://arxiv.org/search/?searchtype=author&query=Liming Lu), [Shuchao Pang](https://arxiv.org/search/?searchtype=author&query=Shuchao Pang), [Xu Zheng](https://arxiv.org/search/?searchtype=author&query=Xu Zheng), [Xiang Gu](https://arxiv.org/search/?searchtype=author&query=Xiang Gu), [Anan Du](https://arxiv.org/search/?searchtype=author&query=Anan Du), [Yunhuai Liu](https://arxiv.org/search/?searchtype=author&query=Yunhuai Liu), [Yongbin Zhou](https://arxiv.org/search/?searchtype=author&query=Yongbin Zhou) 作者:卢黎明,庞书超,徐峥,顾翔,杜安安,刘云怀,周永斌
Adversarial robustness distillation (ARD) aims to transfer both performance and robustness from teacher model to lightweight student model, enabling resilient performance on resource-constrained scenarios. Though existing ARD approaches enhance student model’s robustness, the inevitable by-product leads to the degraded performance on clean examples. We summarize the causes of this problem inherent in existing methods with dual-teacher framework as: 1. The divergent optimization objectives of dual-teacher models, i.e., the clean and robust teachers, impede effective knowledge transfer to the student model, and 2. The iteratively generated adversarial examples during training lead to performance deterioration of the robust teacher model. To address these challenges, we propose a novel Cyclic Iterative ARD (CIARD) method with two key innovations: a. A multi-teacher framework with contrastive push-loss alignment to resolve conflicts in dual-teacher optimization objectives, and b. Continuous adversarial retraining to maintain dynamic teacher robustness against performance degradation from the varying adversarial examples. Extensive experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet demonstrate that CIARD achieves remarkable performance with an average 3.53 improvement in adversarial defense rates across various attack scenarios and a 5.87 increase in clean sample accuracy, establishing a new benchmark for balancing model robustness and generalization. Our code is available at https://github.com/eminentgu/CIARD 对抗鲁棒性蒸馏 (ARD) 旨在将性能和鲁棒性从教师模型转移到轻量级学生模型,从而在资源受限的场景中实现弹性性能。尽管现有的 ARD 方法增强了学生模型的鲁棒性,但不可避免的副产品会导致干净示例的性能下降。我们总结了现有双教师框架方法中固有的问题的原因:1.双教师模型的发散优化目标,即干净和稳健的教师,阻碍了向学生模型的有效知识转移,2.在训练过程中迭代生成的对抗性示例导致鲁棒教师模型的性能下降。为了应对这些挑战,我们提出了一种新的循环迭代 ARD (CIARD) 方法,具有两项关键创新: a.具有对比推损对齐的多教师框架,以解决双教师优化目标中的冲突,以及 b.持续的对抗性再培训,以保持动态教师的稳健性,防止不同的对抗性示例导致绩效下降。在 CIFAR-10、CIFAR-100 和 Tiny-ImageNet 上的大量实验表明,CIARD 取得了卓越的性能,在各种攻击场景下的对抗防御率平均提高了 3.53,干净样本准确率提高了 5.87,为平衡模型鲁棒性和泛化性建立了新的基准。我们的代码可在 https://github.com/eminentgu/CIARD 获得
Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence 科目:计算机视觉与模式识别 , 人工智能
Publish: 2025-09-16 03:51:43 UTC 发布时间: 2025-09-16 03:51:43 UTC
#113 DoubleAgents: Exploring Mechanisms of Building Trust with Proactive AI #113 双重间谍:探索利用主动人工智能建立信任的机制
Authors: [Tao Long](https://arxiv.org/search/?searchtype=author&query=Tao Long), [Xuanming Zhang](https://arxiv.org/search/?searchtype=author&query=Xuanming Zhang), [Sitong Wang](https://arxiv.org/search/?searchtype=author&query=Sitong Wang), [Zhou Yu](https://arxiv.org/search/?searchtype=author&query=Zhou Yu), [Lydia B Chilton](https://arxiv.org/search/?searchtype=author&query=Lydia B Chilton) 作者:陶龙、张宣明、王思彤、周宇、Lydia B Chilton
Agentic workflows promise efficiency, but adoption hinges on whether people actually trust systems that act on their behalf. We present DoubleAgents, an agentic planning tool that embeds transparency and control through user intervention, value-reflecting policies, rich state visualizations, and uncertainty flagging for human coordination tasks. A built-in respondent simulation generates realistic scenarios, allowing users to rehearse, refine policies, and calibrate their reliance before live use. We evaluate DoubleAgents in a two-day lab study (n=10), two deployments (n=2), and a technical evaluation. Results show that participants initially hesitated to delegate but grew more reliant as they experienced transparency, control, and adaptive learning during simulated cases. Deployment results demonstrate DoubleAgents’ real-world relevance and usefulness, showing that the effort required scaled appropriately with task complexity and contextual data. We contribute trust-by-design patterns and mechanisms for proactive AI – consistency, controllability, and explainability – along with simulation as a safe path to build and calibrate trust over time. 代理工作流程承诺提高效率,但采用取决于人们是否真的信任代表他们行事的系统。我们展示了 DoubleAgents,这是一种代理规划工具,它通过用户干预、价值反映策略、丰富的状态可视化和人类协调任务的不确定性标记来嵌入透明度和控制。内置的受访者模拟可生成真实的场景,允许用户在实际使用前排练、完善策略并校准其依赖性。我们在为期两天的实验室研究 (n=10)、两次部署 (n=2) 和技术评估中评估了 DoubleAgents。结果表明,参与者最初对委派犹豫不决,但随着他们在模拟案例中体验到透明度、控制和适应性学习,他们变得更加依赖。部署结果证明了 DoubleAgents 在现实世界中的相关性和实用性,表明所需的工作量会随着任务复杂性和上下文数据而适当扩展。我们为主动人工智能提供信任设计模式和机制——一致性、可控性和可解释性——以及仿真作为随着时间的推移建立和校准信任的安全途径。
Subjects: Human-Computer Interaction, Artificial Intelligence, Computers and Society, Emerging Technologies 科目: 人机交互 , 人工智能 , 计算机与社会 , 新兴技术
Publish: 2025-09-16 03:43:13 UTC 发布时间: 2025-09-16 03:43:13 UTC
#114 ActiveVLN: Towards Active Exploration via Multi-Turn RL in Vision-and-Language Navigation #114 ActiveVLN:在视觉和语言导航中通过多圈 RL 实现主动探索
Authors: [Zekai Zhang](https://arxiv.org/search/?searchtype=author&query=Zekai Zhang), [Weiye Zhu](https://arxiv.org/search/?searchtype=author&query=Weiye Zhu), [Hewei Pan](https://arxiv.org/search/?searchtype=author&query=Hewei Pan), [Xiangchen Wang](https://arxiv.org/search/?searchtype=author&query=Xiangchen Wang), [Rongtao Xu](https://arxiv.org/search/?searchtype=author&query=Rongtao Xu), [Xing Sun](https://arxiv.org/search/?searchtype=author&query=Xing Sun), [Feng Zheng](https://arxiv.org/search/?searchtype=author&query=Feng Zheng) 作者: 张泽凯, 朱伟业, 潘和伟, 王祥晨, 徐荣涛, 孙星, 正凤
The Vision-and-Language Navigation (VLN) task requires an agent to follow natural language instructions and navigate through complex environments. Existing MLLM-based VLN methods primarily rely on imitation learning (IL) and often use DAgger for post-training to mitigate covariate shift. While effective, these approaches incur substantial data collection and training costs. Reinforcement learning (RL) offers a promising alternative. However, prior VLN RL methods lack dynamic interaction with the environment and depend on expert trajectories for reward shaping, rather than engaging in open-ended active exploration. This restricts the agent’s ability to discover diverse and plausible navigation routes. To address these limitations, we propose ActiveVLN, a VLN framework that explicitly enables active exploration through multi-turn RL. In the first stage, a small fraction of expert trajectories is used for IL to bootstrap the agent. In the second stage, the agent iteratively predicts and executes actions, automatically collects diverse trajectories, and optimizes multiple rollouts via the GRPO objective. To further improve RL efficiency, we introduce a dynamic early-stopping strategy to prune long-tail or likely failed trajectories, along with additional engineering optimizations. Experiments show that ActiveVLN achieves the largest performance gains over IL baselines compared to both DAgger-based and prior RL-based post-training methods, while reaching competitive performance with state-of-the-art approaches despite using a smaller model. Code and data will be released soon. 视觉和语言导航 (VLN) 任务要求代理遵循自然语言指令并在复杂环境中导航。现有的基于 MLLM 的 VLN 方法主要依赖于模仿学习(IL),并且经常使用 DAgger 进行后训练,以减轻协变量偏移。这些方法虽然有效,但会产生大量的数据收集和培训成本。强化学习 (RL) 提供了一种有前途的替代方案。然而,以前的 VLN RL 方法缺乏与环境的动态交互,依赖于专家轨迹来塑造奖励,而不是进行开放式的主动探索。这限制了代理发现多样化且合理的导航路线的能力。为了解决这些限制,我们提出了 ActiveVLN,这是一个 VLN 框架,它明确地通过多轮 RL 实现主动探索。在第一阶段,IL 使用一小部分专家轨迹来引导代理。在第二阶段,代理迭代预测和执行作,自动收集不同的轨迹,并通过 GRPO 目标优化多个推出。为了进一步提高 RL 效率,我们引入了一种动态的提前停止策略来修剪长尾或可能失败的轨迹,以及额外的工程优化。实验表明,与基于 DAgger 和先前基于 RL 的后训练方法相比,ActiveVLN 在 IL 基线上实现了最大的性能提升,同时尽管使用了较小的模型,但仍通过最先进的方法达到了具有竞争力的性能。代码和数据将很快发布。
Subjects: Robotics, Artificial Intelligence, Computer Vision and Pattern Recognition 科目:机器人技术、人工智能、计算机视觉和模式识别
Publish: 2025-09-16 03:31:46 UTC 发布时间: 2025-09-16 03:31:46 UTC
#115 ScaleDoc: Scaling LLM-based Predicates over Large Document Collections #115 ScaleDoc:在大型文档集合上扩展基于 LLM 的谓词
Authors: [Hengrui Zhang](https://arxiv.org/search/?searchtype=author&query=Hengrui Zhang), [Yulong Hui](https://arxiv.org/search/?searchtype=author&query=Yulong Hui), [Yihao Liu](https://arxiv.org/search/?searchtype=author&query=Yihao Liu), [Huanchen Zhang](https://arxiv.org/search/?searchtype=author&query=Huanchen Zhang) 作者:Hengrui Zhang, Yulong Hui, Yihao Liu and Huanchen Zhang
Predicates are foundational components in data analysis systems. However, modern workloads increasingly involve unstructured documents, which demands semantic understanding, beyond traditional value-based predicates. Given enormous documents and ad-hoc queries, while Large Language Models (LLMs) demonstrate powerful zero-shot capabilities, their high inference cost leads to unacceptable overhead. Therefore, we introduce \textsc{ScaleDoc}, a novel system that addresses this by decoupling predicate execution into an offline representation phase and an optimized online filtering phase. In the offline phase, \textsc{ScaleDoc} leverages a LLM to generate semantic representations for each document. Online, for each query, it trains a lightweight proxy model on these representations to filter the majority of documents, forwarding only the ambiguous cases to the LLM for final decision. Furthermore, \textsc{ScaleDoc} proposes two core innovations to achieve significant efficiency: (1) a contrastive-learning-based framework that trains the proxy model to generate reliable predicating decision scores; (2) an adaptive cascade mechanism that determines the effective filtering policy while meeting specific accuracy targets. Our evaluations across three datasets demonstrate that \textsc{ScaleDoc} achieves over a 2× end-to-end speedup and reduces expensive LLM invocations by up to 85%, making large-scale semantic analysis practical and efficient. 谓词是数据分析系统的基本组件。然而,现代工作负载越来越多地涉及非结构化文档,这需要语义理解,而不仅仅是传统的基于值的谓词。鉴于海量文档和临时查询,虽然大型语言模型 (LLM) 展示了强大的零样本功能,但其高推理成本会导致不可接受的开销。因此,我们引入了 \textsc{ScaleDoc},这是一个新颖的系统,它通过将谓词执行解耦到离线表示阶段和优化的在线过滤阶段来解决这个问题。在离线阶段,\textsc{ScaleDoc} 利用 LLM 为每个文档生成语义表示。在线,对于每个查询,它都会在这些表示上训练一个轻量级代理模型来过滤大多数文档,仅将模棱两可的情况转发给 LLM 进行最终决定。此外,\textsc{ScaleDoc}提出了两项核心创新来实现显著的效率:(1)基于对比学习的框架,训练代理模型生成可靠的谓词决策分数;(2)自适应级联机制,在满足特定精度目标的同时确定有效的滤波策略。我们对三个数据集的评估表明,\textsc{ScaleDoc} 实现了超过 2 次 × 端到端加速,并将昂贵的 LLM 调用减少了高达 85%,使大规模语义分析变得实用且高效。
Subjects: Databases, Artificial Intelligence, Machine Learning 主题: 数据库 , 人工智能 , 机器学习
Publish: 2025-09-16 03:18:06 UTC 发布时间: 2025-09-16 03:18:06 UTC
#116 EconProver: Towards More Economical Test-Time Scaling for Automated Theorem Proving #116 EconProver:为自动定理证明迈向更经济的测试时间缩放
Authors: [Mukai Li](https://arxiv.org/search/?searchtype=author&query=Mukai Li), [Linfeng Song](https://arxiv.org/search/?searchtype=author&query=Linfeng Song), [Zhenwen Liang](https://arxiv.org/search/?searchtype=author&query=Zhenwen Liang), [Jiahao Xu](https://arxiv.org/search/?searchtype=author&query=Jiahao Xu), [Shansan Gong](https://arxiv.org/search/?searchtype=author&query=Shansan Gong), [Qi Liu](https://arxiv.org/search/?searchtype=author&query=Qi Liu), [Haitao Mi](https://arxiv.org/search/?searchtype=author&query=Haitao Mi), [Dong Yu](https://arxiv.org/search/?searchtype=author&query=Dong Yu) 作者: 李向井, 宋林峰, 梁振文, 徐佳豪, 龚珊珊, 刘琦, 米海涛, 俞东
Large Language Models (LLMs) have recently advanced the field of Automated Theorem Proving (ATP), attaining substantial performance gains through widely adopted test-time scaling strategies, notably reflective Chain-of-Thought (CoT) reasoning and increased sampling passes. However, they both introduce significant computational overhead for inference. Moreover, existing cost analyses typically regulate only the number of sampling passes, while neglecting the substantial disparities in sampling costs introduced by different scaling strategies. In this paper, we systematically compare the efficiency of different test-time scaling strategies for ATP models and demonstrate the inefficiency of the current state-of-the-art (SOTA) open-source approaches. We then investigate approaches to significantly reduce token usage and sample passes while maintaining the original performance. Specifically, we propose two complementary methods that can be integrated into a unified EconRL pipeline for amplified benefits: (1) a dynamic Chain-of-Thought (CoT) switching mechanism designed to mitigate unnecessary token consumption, and (2) Diverse parallel-scaled reinforcement learning (RL) with trainable prefixes to enhance pass rates under constrained sampling passes. Experiments on miniF2F and ProofNet demonstrate that our EconProver achieves comparable performance to baseline methods with only 12% of the computational cost. This work provides actionable insights for deploying lightweight ATP models without sacrificing performance. 大型语言模型 (LLM) 最近推进了自动定理证明 (ATP) 领域,通过广泛采用的测试时间缩放策略,特别是反射性思维链 (CoT) 推理和增加采样通道,实现了显着的性能提升。然而,它们都为推理带来了巨大的计算开销。此外,现有的成本分析通常只调节采样通道的数量,而忽略了不同规模策略引入的采样成本的巨大差异。在本文中,我们系统地比较了 ATP 模型不同测试时间缩放策略的效率,并证明了当前最先进的 (SOTA) 开源方法的低效率。然后,我们研究了在保持原始性能的同时显着减少令牌使用和采样传递的方法。具体来说,我们提出了两种可以集成到统一的 EconRL 管道中以放大收益的互补方法:(1) 动态思维链 (CoT) 切换机制,旨在减少不必要的代币消耗,以及 (2) 具有可训练前缀的多样化并行规模强化学习 (RL),以提高受约束采样通过下的通过率。在 miniF2F 和 ProofNet 上的实验表明,我们的 EconProver 以仅 12% 的计算成本实现了与基线方法相当的性能。这项工作为在不牺牲性能的情况下部署轻量级 ATP 模型提供了可作的见解。
Subjects: Computation and Language, Artificial Intelligence 科目: 计算与语言 , 人工智能
Publish: 2025-09-16 03:00:13 UTC 发布时间: 2025-09-16 03:00:13 UTC
#117 A Multimodal Foundation Model to Enhance Generalizability and Data Efficiency for Pan-cancer Prognosis Prediction #117 一种多模态基础模型,以提高泛癌预后预测的普遍性和数据效率
Authors: [Huajun Zhou](https://arxiv.org/search/?searchtype=author&query=Huajun Zhou), [Fengtao Zhou](https://arxiv.org/search/?searchtype=author&query=Fengtao Zhou), [Jiabo Ma](https://arxiv.org/search/?searchtype=author&query=Jiabo Ma), [Yingxue Xu](https://arxiv.org/search/?searchtype=author&query=Yingxue Xu), [Xi Wang](https://arxiv.org/search/?searchtype=author&query=Xi Wang), [Xiuming Zhang](https://arxiv.org/search/?searchtype=author&query=Xiuming Zhang), [Li Liang](https://arxiv.org/search/?searchtype=author&query=Li Liang), [Zhenhui Li](https://arxiv.org/search/?searchtype=author&query=Zhenhui Li), [Hao Chen](https://arxiv.org/search/?searchtype=author&query=Hao Chen) 作者:周华军、周凤涛、马佳波、徐英雪、王习、张秀明、梁丽、李振辉、陈浩
Multimodal data provides heterogeneous information for a holistic understanding of the tumor microenvironment. However, existing AI models often struggle to harness the rich information within multimodal data and extract poorly generalizable representations. Here we present MICE (Multimodal data Integration via Collaborative Experts), a multimodal foundation model that effectively integrates pathology images, clinical reports, and genomics data for precise pan-cancer prognosis prediction. Instead of conventional multi-expert modules, MICE employs multiple functionally diverse experts to comprehensively capture both cross-cancer and cancer-specific insights. Leveraging data from 11,799 patients across 30 cancer types, we enhanced MICE’s generalizability by coupling contrastive and supervised learning. MICE outperformed both unimodal and state-of-the-art multi-expert-based multimodal models, demonstrating substantial improvements in C-index ranging from 3.8% to 11.2% on internal cohorts and 5.8% to 8.8% on independent cohorts, respectively. Moreover, it exhibited remarkable data efficiency across diverse clinical scenarios. With its enhanced generalizability and data efficiency, MICE establishes an effective and scalable foundation for pan-cancer prognosis prediction, holding strong potential to personalize tailored therapies and improve treatment outcomes. 多模态数据为全面了解肿瘤微环境提供了异质信息。然而,现有的人工智能模型往往难以利用多模态数据中的丰富信息并提取难以概括的表示。在这里,我们介绍了 MICE(通过协作专家进行多模态数据集成),这是一种多模态基础模型,可有效整合病理图像、临床报告和基因组学数据,以实现精确的泛癌症预后预测。MICE 聘请了多名职能不同的专家来全面捕捉跨癌症和癌症特异性的见解,而不是传统的多专家模块。利用来自 30 种癌症类型的 11,799 名患者的数据,我们通过结合对比学习和监督学习来增强 MICE 的普遍性。MICE 的表现优于单模态和最先进的基于多专家的多模态模型,内部队列的 C 指数分别显著改善 3.8%至 11.2%和独立队列的 5.8%至 8.8%。此外,它在不同的临床场景中表现出卓越的数据效率。凭借其增强的普遍性和数据效率,MICE 为泛癌预后预测奠定了有效且可扩展的基础,具有个性化定制疗法和改善治疗结果的强大潜力。
Subjects: Machine Learning, Artificial Intelligence, Quantitative Methods 主题: 机器学习 , 人工智能 , 定量方法
Publish: 2025-09-16 02:57:55 UTC 发布时间: 2025-09-16 02:57:55 UTC
#118 DisorientLiDAR: Physical Attacks on LiDAR-based Localization #118 迷失方向激光雷达:对基于激光雷达的定位的物理攻击
Authors: [Yizhen Lao](https://arxiv.org/search/?searchtype=author&query=Yizhen Lao), [Yu Zhang](https://arxiv.org/search/?searchtype=author&query=Yu Zhang), [Ziting Wang](https://arxiv.org/search/?searchtype=author&query=Ziting Wang), [Chengbo Wang](https://arxiv.org/search/?searchtype=author&query=Chengbo Wang), [Yifei Xue](https://arxiv.org/search/?searchtype=author&query=Yifei Xue), [Wanpeng Shao](https://arxiv.org/search/?searchtype=author&query=Wanpeng Shao) 作者: Yizhen Lao, Yu Zhang, Ziting Wang, Chengbo Wang, Yifei Xue, Wanpeng Shao
Deep learning models have been shown to be susceptible to adversarial attacks with visually imperceptible perturbations. Even this poses a serious security challenge for the localization of self-driving cars, there has been very little exploration of attack on it, as most of adversarial attacks have been applied to 3D perception. In this work, we propose a novel adversarial attack framework called DisorientLiDAR targeting LiDAR-based localization. By reverse-engineering localization models (e.g., feature extraction networks), adversaries can identify critical keypoints and strategically remove them, thereby disrupting LiDAR-based localization. Our proposal is first evaluated on three state-of-the-art point-cloud registration models (HRegNet, D3Feat, and GeoTransformer) using the KITTI dataset. Experimental results demonstrate that removing regions containing Top-K keypoints significantly degrades their registration accuracy. We further validate the attack’s impact on the Autoware autonomous driving platform, where hiding merely a few critical regions induces noticeable localization drift. Finally, we extended our attacks to the physical world by hiding critical regions with near-infrared absorptive materials, thereby successfully replicate the attack effects observed in KITTI data. This step has been closer toward the realistic physical-world attack that demonstrate the veracity and generality of our proposal. 深度学习模型已被证明容易受到具有视觉上难以察觉的扰动的对抗性攻击。即使这样也对自动驾驶汽车的国产化构成了严峻的安全挑战,但对其攻击的探索却很少,因为大多数对抗性攻击都应用于了 3D 感知。在这项工作中,我们提出了一种新的对抗性攻击框架,称为 DisorientLiDAR,针对基于 LiDAR 的定位。通过逆向工程定位模型(例如特征提取网络),攻击者可以识别关键关键点并战略性地删除它们,从而破坏基于激光雷达的定位。我们的提案首先使用 KITTI 数据集在三个最先进的点云配准模型(HRegNet、D3Feat 和 GeoTransformer)上进行了评估。实验结果表明,删除包含 Top-K 关键点的区域会显著降低其配准精度。我们进一步验证了攻击对 Autoware 自动驾驶平台的影响,在该平台中,仅隐藏几个关键区域会导致明显的定位漂移。最后,我们通过用近红外吸收材料隐藏关键区域,将攻击扩展到物理世界,从而成功复制了在 KITTI 数据中观察到的攻击效果。这一步更接近现实的物理世界攻击,证明了我们提案的真实性和普遍性。
Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence 科目:计算机视觉与模式识别 , 人工智能
Publish: 2025-09-16 02:46:39 UTC 发布时间: 2025-09-16 02:46:39 UTC
#119 Adaptive Sampling Scheduler #119 自适应采样调度程序
Authors: [Qi Wang](https://arxiv.org/search/?searchtype=author&query=Qi Wang), [Shuliang Zhu](https://arxiv.org/search/?searchtype=author&query=Shuliang Zhu), [Jinjia Zhou](https://arxiv.org/search/?searchtype=author&query=Jinjia Zhou) 作者:王琦、朱淑良、周金佳
Consistent distillation methods have evolved into effective techniques that significantly accelerate the sampling process of diffusion models. Although existing methods have achieved remarkable results, the selection of target timesteps during distillation mainly relies on deterministic or stochastic strategies, which often require sampling schedulers to be designed specifically for different distillation processes. Moreover, this pattern severely limits flexibility, thereby restricting the full sampling potential of diffusion models in practical applications. To overcome these limitations, this paper proposes an adaptive sampling scheduler that is applicable to various consistency distillation frameworks. The scheduler introduces three innovative strategies: (i) dynamic target timestep selection, which adapts to different consistency distillation frameworks by selecting timesteps based on their computed importance; (ii) Optimized alternating sampling along the solution trajectory by guiding forward denoising and backward noise addition based on the proposed time step importance, enabling more effective exploration of the solution space to enhance generation performance; and (iii) Utilization of smoothing clipping and color balancing techniques to achieve stable and high-quality generation results at high guidance scales, thereby expanding the applicability of consistency distillation models in complex generation scenarios. We validated the effectiveness and flexibility of the adaptive sampling scheduler across various consistency distillation methods through comprehensive experimental evaluations. Experimental results consistently demonstrated significant improvements in generative performance, highlighting the strong adaptability achieved by our method. 一致的蒸馏方法已发展成为有效的技术,可以显着加速扩散模型的采样过程。尽管现有方法已经取得了显著的效果,但在蒸馏过程中目标时间步长的选择主要依赖于确定性或随机策略,这往往需要专门针对不同的蒸馏过程设计采样调度器。此外,这种模式严重限制了灵活性,从而限制了扩散模型在实际应用中的全部采样潜力。为了克服这些限制,本文提出了一种适用于各种稠度蒸馏框架的自适应采样调度程序。调度器引入了三种创新策略:(i)动态目标时间步长选择,通过根据计算的重要性选择时间步长来适应不同的一致性蒸馏框架;(ii)基于所提出的时间步长重要性,通过引导前向去噪和后向噪声添加,优化沿解轨迹的交替采样,从而更有效地探索解空间,提高发电性能;(iii)利用平滑剪切和颜色平衡技术在高指导尺度下实现稳定和高质量的生成结果,从而扩大一致性蒸馏模型在复杂生成场景中的适用性。我们通过全面的实验评估验证了自适应采样调度程序在各种稠度蒸馏方法中的有效性和灵活性。实验结果一致表明生成性能有显着提高,凸显了我们的方法实现的强大适应性。
Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence 科目:计算机视觉与模式识别 , 人工智能
Publish: 2025-09-16 01:51:16 UTC 发布时间: 2025-09-16 01:51:16 UTC
#120 DeepEyeNet: Generating Medical Report for Retinal Images #120 DeepEyeNet:为视网膜图像生成医疗报告
Author: [Jia-Hong Huang](https://arxiv.org/search/?searchtype=author&query=Jia-Hong Huang) 作者:黄佳红
The increasing prevalence of retinal diseases poses a significant challenge to the healthcare system, as the demand for ophthalmologists surpasses the available workforce. This imbalance creates a bottleneck in diagnosis and treatment, potentially delaying critical care. Traditional methods of generating medical reports from retinal images rely on manual interpretation, which is time-consuming and prone to errors, further straining ophthalmologists’ limited resources. This thesis investigates the potential of Artificial Intelligence (AI) to automate medical report generation for retinal images. AI can quickly analyze large volumes of image data, identifying subtle patterns essential for accurate diagnosis. By automating this process, AI systems can greatly enhance the efficiency of retinal disease diagnosis, reducing doctors’ workloads and enabling them to focus on more complex cases. The proposed AI-based methods address key challenges in automated report generation: (1) A multi-modal deep learning approach captures interactions between textual keywords and retinal images, resulting in more comprehensive medical reports; (2) Improved methods for medical keyword representation enhance the system’s ability to capture nuances in medical terminology; (3) Strategies to overcome RNN-based models’ limitations, particularly in capturing long-range dependencies within medical descriptions; (4) Techniques to enhance the interpretability of the AI-based report generation system, fostering trust and acceptance in clinical practice. These methods are rigorously evaluated using various metrics and achieve state-of-the-art performance. This thesis demonstrates AI’s potential to revolutionize retinal disease diagnosis by automating medical report generation, ultimately improving clinical efficiency, diagnostic accuracy, and patient care. 视网膜疾病患病率的不断上升给医疗保健系统带来了重大挑战,因为对眼科医生的需求超过了现有劳动力。这种不平衡造成了诊断和治疗的瓶颈,可能会延误重症监护。根据视网膜图像生成医疗报告的传统方法依赖于人工解释,这既耗时又容易出错,进一步压垮了眼科医生有限的资源。本论文研究了人工智能 (AI) 在自动生成视网膜图像医疗报告方面的潜力。人工智能可以快速分析大量图像数据,识别准确诊断所必需的微妙模式。通过自动化这一过程,人工智能系统可以大大提高视网膜疾病诊断的效率,减少医生的工作量,使他们能够专注于更复杂的病例。所提出的基于人工智能的方法解决了自动报告生成中的关键挑战:(1)多模态深度学习方法捕获文本关键字和视网膜图像之间的相互作用,从而生成更全面的医学报告;(2)改进的医学关键词表示方法增强了系统捕捉医学术语细微差别的能力;(3) 克服基于 RNN 的模型局限性的策略,特别是在捕获医学描述中的远程依赖性方面;(4) 增强基于人工智能的报告生成系统的可解释性,促进临床实践中的信任和接受的技术。这些方法使用各种指标进行严格评估,并实现最先进的性能。 该论文展示了人工智能通过自动生成医疗报告来彻底改变视网膜疾病诊断的潜力,最终提高临床效率、诊断准确性和患者护理。
Subjects: Image and Video Processing, Artificial Intelligence, Computer Vision and Pattern Recognition 科目:图像和视频处理、人工智能、计算机视觉和模式识别
Publish: 2025-09-16 00:18:56 UTC 发布时间: 2025-09-16 00:18:56 UTC
#121 Pre-trained Visual Representations Generalize Where it Matters in Model-Based Reinforcement Learning #121 预训练的视觉表示在基于模型的强化学习中概括了重要性
Authors: [Scott Jones](https://arxiv.org/search/?searchtype=author&query=Scott Jones), [Liyou Zhou](https://arxiv.org/search/?searchtype=author&query=Liyou Zhou), [Sebastian W. Pattinson](https://arxiv.org/search/?searchtype=author&query=Sebastian W. Pattinson) 作者:斯科特·琼斯、周丽友、塞巴斯蒂安·帕丁森
In visuomotor policy learning, the control policy for the robotic agent is derived directly from visual inputs. The typical approach, where a policy and vision encoder are trained jointly from scratch, generalizes poorly to novel visual scene changes. Using pre-trained vision models (PVMs) to inform a policy network improves robustness in model-free reinforcement learning (MFRL). Recent developments in Model-based reinforcement learning (MBRL) suggest that MBRL is more sample-efficient than MFRL. However, counterintuitively, existing work has found PVMs to be ineffective in MBRL. Here, we investigate PVM’s effectiveness in MBRL, specifically on generalization under visual domain shifts. We show that, in scenarios with severe shifts, PVMs perform much better than a baseline model trained from scratch. We further investigate the effects of varying levels of fine-tuning of PVMs. Our results show that partial fine-tuning can maintain the highest average task performance under the most extreme distribution shifts. Our results demonstrate that PVMs are highly successful in promoting robustness in visual policy learning, providing compelling evidence for their wider adoption in model-based robotic learning applications. 在视觉运动策略学习中,机器人智能体的控制策略直接来自视觉输入。典型的方法,即策略和视觉编码器从头开始联合训练,但很难推广到新颖的视觉场景变化。使用预训练视觉模型 (PVM) 为策略网络提供信息可以提高无模型强化学习 (MFRL) 的鲁棒性。基于模型的强化学习 (MBRL) 的最新发展表明,MBRL 比 MFRL 更具样本效率。然而,与直觉相反,现有工作发现 PVM 对 MBRL 无效。在这里,我们研究了 PVM 在 MBRL 中的有效性,特别是在视觉域偏移下的泛化方面。我们表明,在变化严重的情况下,PVM 的性能比从头开始训练的基线模型要好得多。我们进一步研究了不同程度的 PVM 微调的影响。结果表明,部分微调可以在最极端的分布偏移下保持最高的平均任务性能。我们的结果表明,PVM 在促进视觉政策学习的鲁棒性方面非常成功,为其在基于模型的机器人学习应用中的更广泛采用提供了令人信服的证据。
Subjects: Robotics, Artificial Intelligence, Machine Learning, Systems and Control 科目: 机器人技术 , 人工智能 , 机器学习 , 系统与控制
Publish: 2025-09-16 00:13:14 UTC 发布时间: 2025-09-16 00:13:14 UTC
#122 DinoAtten3D: Slice-Level Attention Aggregation of DinoV2 for 3D Brain MRI Anomaly Classification #122 DinoAtten3D:用于 3D 脑 MRI 异常分类的 DinoV2 的切片级注意力聚合
Authors: [Fazle Rafsani](https://arxiv.org/search/?searchtype=author&query=Fazle Rafsani), [Jay Shah](https://arxiv.org/search/?searchtype=author&query=Jay Shah), [Catherine D. Chong](https://arxiv.org/search/?searchtype=author&query=Catherine D. Chong), [Todd J. Schwedt](https://arxiv.org/search/?searchtype=author&query=Todd J. Schwedt), [Teresa Wu](https://arxiv.org/search/?searchtype=author&query=Teresa Wu) 作者:Fazle Rafsani、Jay Shah、Catherine D. Chong、Todd J. Schwedt、Teresa Wu
Anomaly detection and classification in medical imaging are critical for early diagnosis but remain challenging due to limited annotated data, class imbalance, and the high cost of expert labeling. Emerging vision foundation models such as DINOv2, pretrained on extensive, unlabeled datasets, offer generalized representations that can potentially alleviate these limitations. In this study, we propose an attention-based global aggregation framework tailored specifically for 3D medical image anomaly classification. Leveraging the self-supervised DINOv2 model as a pretrained feature extractor, our method processes individual 2D axial slices of brain MRIs, assigning adaptive slice-level importance weights through a soft attention mechanism. To further address data scarcity, we employ a composite loss function combining supervised contrastive learning with class-variance regularization, enhancing inter-class separability and intra-class consistency. We validate our framework on the ADNI dataset and an institutional multi-class headache cohort, demonstrating strong anomaly classification performance despite limited data availability and significant class imbalance. Our results highlight the efficacy of utilizing pretrained 2D foundation models combined with attention-based slice aggregation for robust volumetric anomaly detection in medical imaging. Our implementation is publicly available at https://github.com/Rafsani/DinoAtten3D.git. 医学影像中的异常检测和分类对于早期诊断至关重要,但由于注释数据有限、类别不平衡以及专家标记成本高昂,仍然具有挑战性。新兴的视觉基础模型(例如 DINOv2)在广泛的未标记数据集上进行了预训练,提供了可能缓解这些限制的通用表示。在这项研究中,我们提出了一个专门为 3D 医学图像异常分类量身定制的基于注意力的全局聚合框架。利用自监督 DINOv2 模型作为预训练特征提取器,我们的方法处理脑部 MRI 的单个 2D 轴向切片,通过软注意力机制分配自适应切片级重要性权重。为了进一步解决数据稀缺问题,我们采用了一种复合损失函数,将监督对比学习与类方差正则化相结合,增强了类间可分离性和类内一致性。我们在 ADNI 数据集和机构多类头痛队列上验证了我们的框架,尽管数据可用性有限且类别严重不平衡,但仍表现出强大的异常分类性能。我们的结果强调了利用预训练的 2D 基础模型与基于注意力的切片聚合相结合在医学成像中进行稳健体积异常检测的功效。我们的实现可在 https://github.com/Rafsani/DinoAtten3D.git 公开获取。
Subjects: Image and Video Processing, Artificial Intelligence, Computer Vision and Pattern Recognition 科目:图像和视频处理、人工智能、计算机视觉和模式识别
Publish: 2025-09-15 23:31:40 UTC 发布时间: 2025-09-15 23:31:40 UTC
#123 FunAudio-ASR Technical Report #123 FunAudio-ASR 技术报告
Authors: [Keyu An](https://arxiv.org/search/?searchtype=author&query=Keyu An), [Yanni Chen](https://arxiv.org/search/?searchtype=author&query=Yanni Chen), [Chong Deng](https://arxiv.org/search/?searchtype=author&query=Chong Deng), [Changfeng Gao](https://arxiv.org/search/?searchtype=author&query=Changfeng Gao), [Zhifu Gao](https://arxiv.org/search/?searchtype=author&query=Zhifu Gao), [Bo Gong](https://arxiv.org/search/?searchtype=author&query=Bo Gong), [Xiangang Li](https://arxiv.org/search/?searchtype=author&query=Xiangang Li), [Yabin Li](https://arxiv.org/search/?searchtype=author&query=Yabin Li), [Xiang Lv](https://arxiv.org/search/?searchtype=author&query=Xiang Lv), [Yunjie Ji](https://arxiv.org/search/?searchtype=author&query=Yunjie Ji), [Yiheng Jiang](https://arxiv.org/search/?searchtype=author&query=Yiheng Jiang), [Bin Ma](https://arxiv.org/search/?searchtype=author&query=Bin Ma), [Haoneng Luo](https://arxiv.org/search/?searchtype=author&query=Haoneng Luo), [Chongjia Ni](https://arxiv.org/search/?searchtype=author&query=Chongjia Ni), [Zexu Pan](https://arxiv.org/search/?searchtype=author&query=Zexu Pan), [Yiping Peng](https://arxiv.org/search/?searchtype=author&query=Yiping Peng), [Zhendong Peng](https://arxiv.org/search/?searchtype=author&query=Zhendong Peng), [Peiyao Wang](https://arxiv.org/search/?searchtype=author&query=Peiyao Wang), [Hao Wang](https://arxiv.org/search/?searchtype=author&query=Hao Wang), [Wen Wang](https://arxiv.org/search/?searchtype=author&query=Wen Wang), [Wupeng Wang](https://arxiv.org/search/?searchtype=author&query=Wupeng Wang), [Biao Tian](https://arxiv.org/search/?searchtype=author&query=Biao Tian), [Zhentao Tan](https://arxiv.org/search/?searchtype=author&query=Zhentao Tan), [Nan Yang](https://arxiv.org/search/?searchtype=author&query=Nan Yang), [Bin Yuan](https://arxiv.org/search/?searchtype=author&query=Bin Yuan), [Jieping Ye](https://arxiv.org/search/?searchtype=author&query=Jieping Ye), [Jixing Yu](https://arxiv.org/search/?searchtype=author&query=Jixing Yu), [Qinglin Zhang](https://arxiv.org/search/?searchtype=author&query=Qinglin Zhang), [Kun Zou](https://arxiv.org/search/?searchtype=author&query=Kun Zou), [Han Zhao](https://arxiv.org/search/?searchtype=author&query=Han Zhao), [Shengkui Zhao](https://arxiv.org/search/?searchtype=author&query=Shengkui Zhao), [Jingren Zhou](https://arxiv.org/search/?searchtype=author&query=Jingren Zhou) 作者: 安克宇, 陈燕妮, 邓冲, 高长峰, 高志富, 龚博, 李先刚, 李亚彬, 吕翔, 季云杰, 江一恒, 马斌, 罗浩能, 倪崇家, 潘泽旭, 彭一平, 彭振东, 王培瑶, 王浩, 温, 王五鹏, 田彪, 谭振涛, 杨楠, 袁斌, 叶洁平, 余继兴, 张庆林, 邹坤, 赵汉, 赵胜奎, 周景仁
In recent years, automatic speech recognition (ASR) has witnessed transformative advancements driven by three complementary paradigms: data scaling, model size scaling, and deep integration with large language models (LLMs). However, LLMs are prone to hallucination, which can significantly degrade user experience in real-world ASR applications. In this paper, we present FunAudio-ASR, a large-scale, LLM-based ASR system that synergistically combines massive data, large model capacity, LLM integration, and reinforcement learning to achieve state-of-the-art performance across diverse and complex speech recognition scenarios. Moreover, FunAudio-ASR is specifically optimized for practical deployment, with enhancements in streaming capability, noise robustness, code-switching, hotword customization, and satisfying other real-world application requirements. Experimental results show that while most LLM-based ASR systems achieve strong performance on open-source benchmarks, they often underperform on real industry evaluation sets. Thanks to production-oriented optimizations, FunAudio-ASR achieves SOTA performance on real application datasets, demonstrating its effectiveness and robustness in practical settings. 近年来,自动语音识别 (ASR) 见证了三种互补范式的推动的变革性进步:数据扩展、模型大小扩展以及与大型语言模型 (LLM) 的深度集成。然而,LLM 容易出现幻觉,这会显着降低现实世界 ASR 应用程序中的用户体验。在本文中,我们提出了 FunAudio-ASR,这是一个基于 LLM 的大规模 ASR 系统,它协同地结合了海量数据、大模型容量、LLM 集成和强化学习,以在多样化和复杂的语音识别场景中实现最先进的性能。此外,FunAudio-ASR 针对实际部署进行了专门优化,增强了流媒体功能、噪声鲁棒性、代码切换、热词定制以及满足其他实际应用要求。实验结果表明,虽然大多数基于 LLM 的 ASR 系统在开源基准测试中取得了强大的性能,但在实际的行业评估集上往往表现不佳。得益于面向生产的优化,FunAudio-ASR 在真实应用数据集上实现了 SOTA 性能,在实际环境中展示了其有效性和鲁棒性。
Subjects: Computation and Language, Artificial Intelligence, Sound, Audio and Speech Processing 科目: 计算与语言 , 人工智能 , 声音 , 音频与语音处理
Publish: 2025-09-15 23:19:36 UTC 发布时间: 2025-09-15 23:19:36 UTC
#124 Reinforcement Learning-Based Market Making as a Stochastic Control on Non-Stationary Limit Order Book Dynamics #124 基于强化学习的做市商作为非平稳限价订单簿动态的随机控制
Authors: [Rafael Zimmer](https://arxiv.org/search/?searchtype=author&query=Rafael Zimmer), [Oswaldo Luiz do Valle Costa](https://arxiv.org/search/?searchtype=author&query=Oswaldo Luiz do Valle Costa) 作者:Rafael Zimmer、Oswaldo Luiz do Valle Costa
Reinforcement Learning has emerged as a promising framework for developing adaptive and data-driven strategies, enabling market makers to optimize decision-making policies based on interactions with the limit order book environment. This paper explores the integration of a reinforcement learning agent in a market-making context, where the underlying market dynamics have been explicitly modeled to capture observed stylized facts of real markets, including clustered order arrival times, non-stationary spreads and return drifts, stochastic order quantities and price volatility. These mechanisms aim to enhance stability of the resulting control agent, and serve to incorporate domain-specific knowledge into the agent policy learning process. Our contributions include a practical implementation of a market making agent based on the Proximal-Policy Optimization (PPO) algorithm, alongside a comparative evaluation of the agent’s performance under varying market conditions via a simulator-based environment. As evidenced by our analysis of the financial return and risk metrics when compared to a closed-form optimal solution, our results suggest that the reinforcement learning agent can effectively be used under non-stationary market conditions, and that the proposed simulator-based environment can serve as a valuable tool for training and pre-training reinforcement learning agents in market-making scenarios. 强化学习已成为开发自适应和数据驱动策略的一个有前途的框架,使做市商能够根据与限价订单簿环境的交互来优化决策策略。本文探讨了强化学习代理在做市环境中的集成,其中对潜在的市场动态进行了明确建模,以捕获观察到的真实市场的程式化事实,包括聚类订单到达时间、非平稳点差和回报漂移、随机订单数量和价格波动。这些机制旨在增强最终控制代理的稳定性,并用于将特定领域的知识纳入代理策略学习过程。我们的贡献包括基于近端策略优化 (PPO) 算法的做市代理的实际实施,以及通过基于模拟器的环境对代理在不同市场条件下的表现进行比较评估。与封闭式最优解相比,我们对财务回报和风险指标的分析证明了这一点,我们的结果表明,强化学习智能体可以在非平稳的市场条件下有效使用,并且所提出的基于模拟器的环境可以作为做市场景中训练和预训练强化学习智能体的宝贵工具。
Subjects: Trading and Market Microstructure, Artificial Intelligence 主题: 交易与市场微观结构 , 人工智能
Publish: 2025-09-15 21:08:13 UTC 发布时间: 2025-09-15 21:08:13 UTC
#125 PromptSculptor: Multi-Agent Based Text-to-Image Prompt Optimization #125 PromptSculptor:基于多代理的文本到图像提示优化
Authors: [Dawei Xiang](https://arxiv.org/search/?searchtype=author&query=Dawei Xiang), [Wenyan Xu](https://arxiv.org/search/?searchtype=author&query=Wenyan Xu), [Kexin Chu](https://arxiv.org/search/?searchtype=author&query=Kexin Chu), [Zixu Shen](https://arxiv.org/search/?searchtype=author&query=Zixu Shen), [Tianqi Ding](https://arxiv.org/search/?searchtype=author&query=Tianqi Ding), [Wei Zhang](https://arxiv.org/search/?searchtype=author&query=Wei Zhang) 作者:向大伟、徐文彦、褚可欣、沈子旭、丁天琦、张伟
The rapid advancement of generative AI has democratized access to powerful tools such as Text-to-Image models. However, to generate high-quality images, users must still craft detailed prompts specifying scene, style, and context-often through multiple rounds of refinement. We propose PromptSculptor, a novel multi-agent framework that automates this iterative prompt optimization process. Our system decomposes the task into four specialized agents that work collaboratively to transform a short, vague user prompt into a comprehensive, refined prompt. By leveraging Chain-of-Thought reasoning, our framework effectively infers hidden context and enriches scene and background details. To iteratively refine the prompt, a self-evaluation agent aligns the modified prompt with the original input, while a feedback-tuning agent incorporates user feedback for further refinement. Experimental results demonstrate that PromptSculptor significantly enhances output quality and reduces the number of iterations needed for user satisfaction. Moreover, its model-agnostic design allows seamless integration with various T2I models, paving the way for industrial applications. 生成式人工智能的快速发展使文本到图像模型等强大工具的访问变得民主化。然而,要生成高质量的图像,用户仍然必须精心制作详细的提示,指定场景、风格和上下文——通常需要多轮细化。我们提出了 PromptSculptor,这是一种新颖的多代理框架,可以自动执行这种迭代提示优化过程。我们的系统将任务分解为四个专门的代理,它们协同工作,将简短、模糊的用户提示转换为全面、精致的提示。通过利用思维链推理,我们的框架有效地推断出隐藏的上下文并丰富场景和背景细节。为了迭代细化提示,自我评估代理将修改后的提示与原始输入保持一致,而反馈调整代理则结合用户反馈以进一步细化。实验结果表明,PromptSculptor 显着提高了输出质量并减少了用户满意度所需的迭代次数。此外,其与模型无关的设计允许与各种 T2I 模型无缝集成,为工业应用铺平道路。
Subjects: Multiagent Systems, Artificial Intelligence 主题: 多智能体系统 , 人工智能
Publish: 2025-09-15 20:52:11 UTC 发布时间: 2025-09-15 20:52:11 UTC
#126 MedFact: Benchmarking the Fact-Checking Capabilities of Large Language Models on Chinese Medical Texts #126 MedFact:对中文医学文本大语言模型的事实核查能力进行基准测试
Authors: [Jiayi He](https://arxiv.org/search/?searchtype=author&query=Jiayi He), [Yangmin Huang](https://arxiv.org/search/?searchtype=author&query=Yangmin Huang), [Qianyun Du](https://arxiv.org/search/?searchtype=author&query=Qianyun Du), [Xiangying Zhou](https://arxiv.org/search/?searchtype=author&query=Xiangying Zhou), [Zhiyang He](https://arxiv.org/search/?searchtype=author&query=Zhiyang He), [Jiaxue Hu](https://arxiv.org/search/?searchtype=author&query=Jiaxue Hu), [Xiaodong Tao](https://arxiv.org/search/?searchtype=author&query=Xiaodong Tao), [Lixian Lai](https://arxiv.org/search/?searchtype=author&query=Lixian Lai) 作者:何佳怡、黄阳敏、杜倩云、周向英、何志阳、胡佳雪、陶晓东、赖丽贤
The increasing deployment of Large Language Models (LLMs) in healthcare necessitates a rigorous evaluation of their factual reliability. However, existing benchmarks are often limited by narrow domains of data, failing to capture the complexity of real-world medical information. To address this critical gap, we introduce MedFact, a new and challenging benchmark for Chinese medical fact-checking. MedFact comprises 2,116 expert-annotated instances curated from diverse real-world texts, spanning 13 medical specialties, 8 fine-grained error types, 4 writing styles, and multiple difficulty levels. Its construction employs a hybrid AI-human framework where iterative expert feedback refines an AI-driven, multi-criteria filtering process, ensuring both high data quality and difficulty. We conduct a comprehensive evaluation of 20 leading LLMs, benchmarking their performance on veracity classification and error localization against a human expert baseline. Our results reveal that while models can often determine if a text contains an error, precisely localizing it remains a substantial challenge, with even top-performing models falling short of human performance. Furthermore, our analysis uncovers a frequent ``over-criticism’’ phenomenon, a tendency for models to misidentify correct information as erroneous, which is exacerbated by advanced reasoning techniques such as multi-agent collaboration and inference-time scaling. By highlighting these critical challenges for deploying LLMs in medical applications, MedFact provides a robust resource to drive the development of more factually reliable and medically aware models. 大型语言模型 (LLM) 在医疗保健领域的部署越来越多,需要对其事实可靠性进行严格评估。然而,现有的基准测试往往受到狭窄数据域的限制,无法捕捉现实世界医疗信息的复杂性。为了解决这一关键差距,我们推出了 MedFact,这是中国医学事实核查的一个新的、具有挑战性的基准。MedFact 包含 2,116 个专家注释实例,这些实例是根据不同的现实世界文本精选的,涵盖 13 个医学专业、8 种细粒度错误类型、4 种写作风格和多个难度级别。其结构采用人工智能与人类的混合框架,其中迭代专家反馈完善了人工智能驱动的多标准过滤过程,确保高数据质量和难度。我们对 20 名领先的法学硕士进行了全面评估,根据人类专家基线对他们在真实性分类和错误定位方面的表现进行了基准测试。我们的结果表明,虽然模型通常可以确定文本是否包含错误,但精确定位它仍然是一项重大挑战,即使是性能最好的模型也达不到人类的性能。此外,我们的分析还发现了一种常见的“过度批评”现象,即模型将正确信息误认为是错误的趋势,而多智能体协作和推理时间缩放等先进推理技术加剧了这种情况。通过强调在医疗应用中部署法学硕士的这些关键挑战,MedFact 提供了强大的资源来推动更事实可靠和医学意识模型的开发。
Subjects: Computation and Language, Artificial Intelligence 科目: 计算与语言 , 人工智能
Publish: 2025-09-15 20:46:21 UTC 发布时间: 2025-09-15 20:46:21 UTC
#127 Neural-Quantum-States Impurity Solver for Quantum Embedding Problems #127 用于量子嵌入问题的神经量子态杂质求解器
Authors: [Yinzhanghao Zhou](https://arxiv.org/search/?searchtype=author&query=Yinzhanghao Zhou), [Tsung-Han Lee](https://arxiv.org/search/?searchtype=author&query=Tsung-Han Lee), [Ao Chen](https://arxiv.org/search/?searchtype=author&query=Ao Chen), [Nicola Lanatà](https://arxiv.org/search/?searchtype=author&query=Nicola Lanatà), [Hong Guo](https://arxiv.org/search/?searchtype=author&query=Hong Guo) 作者:周银章浩、李宗汉、陈敖、尼古拉·拉纳塔、郭宏
Neural quantum states (NQS) have emerged as a promising approach to solve second-quantised Hamiltonians, because of their scalability and flexibility. In this work, we design and benchmark an NQS impurity solver for the quantum embedding methods, focusing on the ghost Gutzwiller Approximation (gGA) framework. We introduce a graph transformer-based NQS framework able to represent arbitrarily connected impurity orbitals and develop an error control mechanism to stabilise iterative updates throughout the quantum embedding loops. We validate the accuracy of our approach with benchmark gGA calculations of the Anderson Lattice Model, yielding results in excellent agreement with the exact diagonalisation impurity solver. Finally, our analysis of the computational budget reveals the method’s principal bottleneck to be the high-accuracy sampling of physical observables required by the embedding loop, rather than the NQS variational optimisation, directly highlighting the critical need for more efficient inference techniques. 神经量子态 (NQS) 因其可扩展性和灵活性而成为解决二次量子哈密顿量的一种有前途的方法。在这项工作中,我们设计了用于量子嵌入方法的 NQS 杂质求解器并对其进行了基准测试,重点关注幽灵 Gutzwiller 近似 (gGA) 框架。我们引入了一种基于图转换器的 NQS 框架,能够表示任意连接的杂质轨道,并开发了一种错误控制机制,以稳定整个量子嵌入环的迭代更新。我们通过安德森晶格模型的基准 gGA 计算验证了我们方法的准确性,得出的结果与精确对角化杂质求解器非常吻合。最后,我们对计算预算的分析揭示了该方法的主要瓶颈是嵌入循环所需的物理可观测物的高精度采样,而不是 NQS 变分优化,这直接凸显了对更高效推理技术的迫切需求。
Subjects: Strongly Correlated Electrons, Artificial Intelligence, Machine Learning, Quantum Physics 主题:强相关电子 , 人工智能 , 机器学习 , 量子物理学
Publish: 2025-09-15 20:33:10 UTC 发布时间: 2025-09-15 20:33:10 UTC
#128 Understanding Prompt Management in GitHub Repositories: A Call for Best Practices #128 了解 GitHub 存储库中的提示管理:征集最佳实践
Authors: [Hao Li](https://arxiv.org/search/?searchtype=author&query=Hao Li), [Hicham Masri](https://arxiv.org/search/?searchtype=author&query=Hicham Masri), [Filipe R. Cogo](https://arxiv.org/search/?searchtype=author&query=Filipe R. Cogo), [Abdul Ali Bangash](https://arxiv.org/search/?searchtype=author&query=Abdul Ali Bangash), [Bram Adams](https://arxiv.org/search/?searchtype=author&query=Bram Adams), [Ahmed E. Hassan](https://arxiv.org/search/?searchtype=author&query=Ahmed E. Hassan) 作者:Hao Li、Hicham Masri、Filipe R. Cogo、Abdul Ali Bangash、Bram Adams、Ahmed E. Hassan
The rapid adoption of foundation models (e.g., large language models) has given rise to promptware, i.e., software built using natural language prompts. Effective management of prompts, such as organization and quality assurance, is essential yet challenging. In this study, we perform an empirical analysis of 24,800 open-source prompts from 92 GitHub repositories to investigate prompt management practices and quality attributes. Our findings reveal critical challenges such as considerable inconsistencies in prompt formatting, substantial internal and external prompt duplication, and frequent readability and spelling issues. Based on these findings, we provide actionable recommendations for developers to enhance the usability and maintainability of open-source prompts within the rapidly evolving promptware ecosystem. 基础模型(例如大型语言模型)的快速采用催生了提示软件,即使用自然语言提示构建的软件。有效管理提示(例如组织和质量保证)至关重要,但也具有挑战性。在这项研究中,我们对来自 92 个 GitHub 存储库的 24,800 个开源提示进行了实证分析,以调查提示管理实践和质量属性。我们的研究结果揭示了关键挑战,例如提示格式的相当不一致、大量的内部和外部提示重复以及频繁的可读性和拼写问题。基于这些发现,我们为开发人员提供可行的建议,以增强快速发展的提示软件生态系统中开源提示的可用性和可维护性。
Subjects: Software Engineering, Artificial Intelligence 科目: 软件工程 , 人工智能
Publish: 2025-09-15 20:18:22 UTC 发布时间: 2025-09-15 20:18:22 UTC
#129 Evaluating Large Language Models for Functional and Maintainable Code in Industrial Settings: A Case Study at ASML #129 在工业环境中评估大型语言模型的函数式和可维护代码:ASML 的案例研究
Authors: [Yash Mundhra](https://arxiv.org/search/?searchtype=author&query=Yash Mundhra), [Max Valk](https://arxiv.org/search/?searchtype=author&query=Max Valk), [Maliheh Izadi](https://arxiv.org/search/?searchtype=author&query=Maliheh Izadi) 作者:Yash Mundhra、Max Valk、Maliheh Izadi
Large language models have shown impressive performance in various domains, including code generation across diverse open-source domains. However, their applicability in proprietary industrial settings, where domain-specific constraints and code interdependencies are prevalent, remains largely unexplored. We present a case study conducted in collaboration with the leveling department at ASML to investigate the performance of LLMs in generating functional, maintainable code within a closed, highly specialized software environment. We developed an evaluation framework tailored to ASML’s proprietary codebase and introduced a new benchmark. Additionally, we proposed a new evaluation metric, build@k, to assess whether LLM-generated code successfully compiles and integrates within real industrial repositories. We investigate various prompting techniques, compare the performance of generic and code-specific LLMs, and examine the impact of model size on code generation capabilities, using both match-based and execution-based metrics. The findings reveal that prompting techniques and model size have a significant impact on output quality, with few-shot and chain-of-thought prompting yielding the highest build success rates. The difference in performance between the code-specific LLMs and generic LLMs was less pronounced and varied substantially across different model families. 大型语言模型在各个领域都表现出了令人印象深刻的性能,包括跨不同开源领域的代码生成。然而,它们在专有工业环境中的适用性在很大程度上仍未得到探索,因为在专有工业环境中,特定领域的约束和代码相互依赖性普遍存在。我们提出了一个与 ASML 调平部门合作进行的案例研究,以调查 LLM 在封闭的、高度专业化的软件环境中生成功能性、可维护代码的性能。我们开发了一个针对 ASML 专有代码库量身定制的评估框架,并引入了新的基准测试。此外,我们还提出了一个新的评估指标 build@k,用于评估 LLM 生成的代码是否成功编译并集成到真实的工业存储库中。我们研究了各种提示技术,比较了通用和特定于代码的 LLM 的性能,并使用基于匹配和基于执行的指标检查了模型大小对代码生成能力的影响。研究结果表明,提示技术和模型大小对输出质量有重大影响,少样本和思维链提示产生最高的构建成功率。特定代码的 LLM 和通用 LLM 之间的性能差异不太明显,并且在不同模型系列中差异很大。
Subjects: Software Engineering, Artificial Intelligence 科目: 软件工程 , 人工智能
Publish: 2025-09-15 19:39:26 UTC 发布时间: 2025-09-15 19:39:26 UTC
#130 Evaluating the printability of stl files with ML #130 使用 ML 评估 stl 文件的可打印性 [PDF ] [复印] [Kimi 1 ] [REL]
Authors: [Janik Henn](https://arxiv.org/search/?searchtype=author&query=Janik Henn), [Adrian Hauptmannl](https://arxiv.org/search/?searchtype=author&query=Adrian Hauptmannl), [Hamza A. A. Gardi](https://arxiv.org/search/?searchtype=author&query=Hamza A. A. Gardi) 作者:Janik Henn、Adrian Hauptmannl、Hamza AA Gardi
3D printing has long been a technology for industry professionals and enthusiasts willing to tinker or even build their own machines. This stands in stark contrast to today’s market, where recent developments have prioritized ease of use to attract a broader audience. Slicing software nowadays has a few ways to sanity check the input file as well as the output gcode. Our approach introduces a novel layer of support by training an AI model to detect common issues in 3D models. The goal is to assist less experienced users by identifying features that are likely to cause print failures due to difficult to print geometries before printing even begins. 长期以来,3D 打印一直是愿意修补甚至制造自己的机器的行业专业人士和爱好者的一项技术。这与当今市场形成鲜明对比,当今市场最近的发展优先考虑易用性以吸引更广泛的受众。现在的切片软件有几种方法可以检查输入文件和输出 gcode 的健全性。我们的方法通过训练 AI 模型来检测 3D 模型中的常见问题,引入了新的支持层。目标是通过识别由于打印开始前难以打印的几何形状而可能导致打印失败的功能,从而帮助经验不足的用户。
Subjects: Machine Learning, Artificial Intelligence 主题: 机器学习 , 人工智能
Publish: 2025-09-15 19:37:00 UTC 发布时间: 2025-09-15 19:37:00 UTC
#131 Causal-Symbolic Meta-Learning (CSML): Inducing Causal World Models for Few-Shot Generalization #131 因果符号元学习 (CSML):诱导因果世界模型以进行少样本泛化
Author: [Mohamed Zayaan S](https://arxiv.org/search/?searchtype=author&query=Mohamed Zayaan S) 作者:Mohamed Zayaan S
Modern deep learning models excel at pattern recognition but remain fundamentally limited by their reliance on spurious correlations, leading to poor generalization and a demand for massive datasets. We argue that a key ingredient for human-like intelligence-robust, sample-efficient learning-stems from an understanding of causal mechanisms. In this work, we introduce Causal-Symbolic Meta-Learning (CSML), a novel framework that learns to infer the latent causal structure of a task distribution. CSML comprises three key modules: a perception module that maps raw inputs to disentangled symbolic representations; a differentiable causal induction module that discovers the underlying causal graph governing these symbols and a graph-based reasoning module that leverages this graph to make predictions. By meta-learning a shared causal world model across a distribution of tasks, CSML can rapidly adapt to novel tasks, including those requiring reasoning about interventions and counterfactuals, from only a handful of examples. We introduce CausalWorld, a new physics-based benchmark designed to test these capabilities. Our experiments show that CSML dramatically outperforms state-of-the-art meta-learning and neuro-symbolic baselines, particularly on tasks demanding true causal inference. 现代深度学习模型在模式识别方面表现出色,但仍然受到对虚假相关性的依赖的根本限制,导致泛化性差和对海量数据集的需求。我们认为,类人智能(稳健、样本高效学习)的一个关键要素源于对因果机制的理解。在这项工作中,我们介绍了因果符号元学习(CSML),这是一种学习推断任务分布的潜在因果结构的新颖框架。CSML 由三个关键模块组成:一个感知模块,将原始输入映射到解开的符号表示;一个可微分的因果归纳模块,用于发现控制这些符号的底层因果图,以及一个基于图的推理模块,该模块利用该图进行预测。通过跨任务分布的共享因果世界模型进行元学习,CSML 可以快速适应新任务,包括那些需要对干预和反事实进行推理的任务,仅从少数几个例子中。我们介绍了 CausalWorld,这是一个新的基于物理的基准测试,旨在测试这些功能。我们的实验表明,CSML 的性能大大优于最先进的元学习和神经符号基线,特别是在需要真正因果推理的任务上。
Subjects: Machine Learning, Artificial Intelligence, Machine Learning 主题: 机器学习 , 人工智能 , 机器学习
Publish: 2025-09-15 19:28:09 UTC 发布时间: 2025-09-15 19:28:09 UTC
#132 Amulet: a Python Library for Assessing Interactions Among ML Defenses and Risks #132 Amulet:用于评估 ML 防御和风险之间相互作用的 Python 库
Authors: [Asim Waheed](https://arxiv.org/search/?searchtype=author&query=Asim Waheed), [Vasisht Duddu](https://arxiv.org/search/?searchtype=author&query=Vasisht Duddu), [Rui Zhang](https://arxiv.org/search/?searchtype=author&query=Rui Zhang), [Sebastian Szyller](https://arxiv.org/search/?searchtype=author&query=Sebastian Szyller), [N. Asokan](https://arxiv.org/search/?searchtype=author&query=N. Asokan) 作者:Asim Waheed、Vasisht Duddu、Rui Zhang、Sebastian Szyller、N. Asokan
ML models are susceptible to risks to security, privacy, and fairness. Several defenses are designed to protect against their intended risks, but can inadvertently affect susceptibility to other unrelated risks, known as unintended interactions. Several jurisdictions are preparing ML regulatory frameworks that require ML practitioners to assess the susceptibility of ML models to different risks. A library for valuating unintended interactions that can be used by (a) practitioners to evaluate unintended interactions at scale prior to model deployment and (b) researchers to design defenses which do not suffer from an unintended increase in unrelated risks. Ideally, such a library should be i) comprehensive by including representative attacks, defenses and metrics for different risks, ii) extensible to new modules due to its modular design, iii) consistent with a user-friendly API template for inputs and outputs, iv) applicable to evaluate previously unexplored unintended interactions. We present AMULET, a Python library that covers risks to security, privacy, and fairness, which satisfies all these requirements. AMULET can be used to evaluate unexplored unintended interactions, compare effectiveness between defenses or attacks, and include new attacks and defenses. 机器学习模型容易受到安全、隐私和公平风险的影响。一些防御措施旨在防范其预期风险,但可能会无意中影响对其他不相关风险(称为意外交互)的敏感性。一些司法管辖区正在准备机器学习监管框架,要求机器学习从业者评估机器学习模型对不同风险的敏感性。一个用于评估意外交互的库,可供 (a) 从业者在模型部署之前大规模评估意外交互,以及 (b) 研究人员设计不会意外增加不相关风险的防御措施。理想情况下,这样的库应该是 i) 全面的,包括针对不同风险的代表性攻击、防御和指标,ii) 由于其模块化设计而可扩展到新模块,iii) 与用户友好的输入和输出 API 模板一致,iv) 适用于评估以前未探索的意外交互。我们推出 AMULET,这是一个涵盖安全、隐私和公平风险的 Python 库,它满足所有这些要求。AMULET 可用于评估未探索的意外交互,比较防御或攻击之间的有效性,并包括新的攻击和防御。
Subjects: Cryptography and Security, Artificial Intelligence 主题: 密码学与安全 , 人工智能
Publish: 2025-09-15 19:27:46 UTC 发布时间: 2025-09-15 19:27:46 UTC
#133 GhostNetV3-Small: A Tailored Architecture and Comparative Study of Distillation Strategies for Tiny Images #133 GhostNetV3-Small:微小图像蒸馏策略的定制架构和比较研究
Authors: [Florian Zager](https://arxiv.org/search/?searchtype=author&query=Florian Zager), [Hamza A. A. Gardi](https://arxiv.org/search/?searchtype=author&query=Hamza A. A. Gardi) 作者:Florian Zager、Hamza AA Gardi
Deep neural networks have achieved remarkable success across a range of tasks, however their computational demands often make them unsuitable for deployment on resource-constrained edge devices. This paper explores strategies for compressing and adapting models to enable efficient inference in such environments. We focus on GhostNetV3, a state-of-the-art architecture for mobile applications, and propose GhostNetV3-Small, a modified variant designed to perform better on low-resolution inputs such as those in the CIFAR-10 dataset. In addition to architectural adaptation, we provide a comparative evaluation of knowledge distillation techniques, including traditional knowledge distillation, teacher assistants, and teacher ensembles. Experimental results show that GhostNetV3-Small significantly outperforms the original GhostNetV3 on CIFAR-10, achieving an accuracy of 93.94%. Contrary to expectations, all examined distillation strategies led to reduced accuracy compared to baseline training. These findings indicate that architectural adaptation can be more impactful than distillation in small-scale image classification tasks, highlighting the need for further research on effective model design and advanced distillation techniques for low-resolution domains. 深度神经网络在一系列任务中取得了显着的成功,但它们的计算需求往往使其不适合部署在资源受限的边缘设备上。本文探讨了压缩和调整模型以在此类环境中实现高效推理的策略。我们专注于 GhostNetV3,一种最先进的移动应用程序架构,并提出了 GhostNetV3-Small,这是一种修改后的变体,旨在在低分辨率输入(例如 CIFAR-10 数据集中的输入)上表现更好。除了架构调整外,我们还提供知识蒸馏技术的比较评估,包括传统知识蒸馏、助教和教师合奏。实验结果表明,GhostNetV3-Small 在 CIFAR-10 上明显优于原始 GhostNetV3,准确率达到 93.94%。与预期相反,与基线培训相比,所有检查的蒸馏策略都导致准确性降低。这些发现表明,在小规模图像分类任务中,架构适应可能比蒸馏更具影响力,这凸显了进一步研究低分辨率域的有效模型设计和先进蒸馏技术的必要性。
Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Machine Learning 科目: 计算机视觉与模式识别 , 人工智能 , 机器学习
Publish: 2025-09-15 19:19:09 UTC 发布时间: 2025-09-15 19:19:09 UTC
#134 Geometric Red-Teaming for Robotic Manipulation #134 机器人纵的几何红队
Authors: [Divyam Goel](https://arxiv.org/search/?searchtype=author&query=Divyam Goel), [Yufei Wang](https://arxiv.org/search/?searchtype=author&query=Yufei Wang), [Tiancheng Wu](https://arxiv.org/search/?searchtype=author&query=Tiancheng Wu), [Guixiu Qiao](https://arxiv.org/search/?searchtype=author&query=Guixiu Qiao), [Pavel Piliptchak](https://arxiv.org/search/?searchtype=author&query=Pavel Piliptchak), [David Held](https://arxiv.org/search/?searchtype=author&query=David Held), [Zackory Erickson](https://arxiv.org/search/?searchtype=author&query=Zackory Erickson) 作者: Divyam Goel, Yufei Wang, Tiancheng Wu, Guixiu Qiao, Pavel Piliptchak, David Held, Zackory Erickson
Standard evaluation protocols in robotic manipulation typically assess policy performance over curated, in-distribution test sets, offering limited insight into how systems fail under plausible variation. We introduce Geometric Red-Teaming (GRT), a red-teaming framework that probes robustness through object-centric geometric perturbations, automatically generating CrashShapes – structurally valid, user-constrained mesh deformations that trigger catastrophic failures in pre-trained manipulation policies. The method integrates a Jacobian field-based deformation model with a gradient-free, simulator-in-the-loop optimization strategy. Across insertion, articulation, and grasping tasks, GRT consistently discovers deformations that collapse policy performance, revealing brittle failure modes missed by static benchmarks. By combining task-level policy rollouts with constraint-aware shape exploration, we aim to build a general purpose framework for structured, object-centric robustness evaluation in robotic manipulation. We additionally show that fine-tuning on individual CrashShapes, a process we refer to as blue-teaming, improves task success by up to 60 percentage points on those shapes, while preserving performance on the original object, demonstrating the utility of red-teamed geometries for targeted policy refinement. Finally, we validate both red-teaming and blue-teaming results with a real robotic arm, observing that simulated CrashShapes reduce task success from 90% to as low as 22.5%, and that blue-teaming recovers performance to up to 90% on the corresponding real-world geometry – closely matching simulation outcomes. Videos and code can be found on our project website: https://georedteam.github.io/ . 机器人作中的标准评估协议通常评估策划的分布式测试集的策略性能,从而对系统在合理变化下如何失败提供有限的见解。我们引入了几何红队 (GRT),这是一个红队框架,它通过以对象为中心的几何扰动来探测鲁棒性,自动生成 CrashShapes——结构上有效的、用户约束的网格变形,会触发预训练作策略中的灾难性故障。该方法将基于雅可比场的变形模型与无梯度、模拟器在环优化策略相结合。在插入、铰接和抓取任务中,GRT 始终发现破坏策略性能的变形,揭示静态基准测试遗漏的脆性失效模式。通过将任务级策略推出与约束感知形状探索相结合,我们的目标是为机器人作中的结构化、以对象为中心的鲁棒性评估构建一个通用框架。我们还表明,对单个 CrashShapes 进行微调(我们称之为蓝队的过程)可将这些形状的任务成功率提高多达 60 个百分点,同时保留原始对象的性能,从而证明红队几何形状在有针对性的策略细化方面的效用。最后,我们用真实的机械臂验证了红队和蓝队的结果,观察到模拟的 CrashShapes 将任务成功率从 90% 降低到低至 22.5%,并且蓝队在相应的真实几何体上将性能恢复到 90% ——与模拟结果非常匹配。视频和代码可以在我们的项目网站上找到:https://georedteam.github.io/。
Subjects: Robotics, Artificial Intelligence, Machine Learning 科目: 机器人技术 , 人工智能 , 机器学习
Publish: 2025-09-15 19:12:26 UTC 发布时间: 2025-09-15 19:12:26 UTC
#135 MORABLES: A Benchmark for Assessing Abstract Moral Reasoning in LLMs with Fables #135 MORABLES:用寓言评估法学硕士抽象道德推理的基准
Authors: [Matteo Marcuzzo](https://arxiv.org/search/?searchtype=author&query=Matteo Marcuzzo), [Alessandro Zangari](https://arxiv.org/search/?searchtype=author&query=Alessandro Zangari), [Andrea Albarelli](https://arxiv.org/search/?searchtype=author&query=Andrea Albarelli), [Jose Camacho-Collados](https://arxiv.org/search/?searchtype=author&query=Jose Camacho-Collados), [Mohammad Taher Pilehvar](https://arxiv.org/search/?searchtype=author&query=Mohammad Taher Pilehvar) 作者:Matteo Marcuzzo、Alessandro Zangari、Andrea Albarelli、Jose Camacho-Collados、Mohammad Taher Pilehvar
As LLMs excel on standard reading comprehension benchmarks, attention is shifting toward evaluating their capacity for complex abstract reasoning and inference. Literature-based benchmarks, with their rich narrative and moral depth, provide a compelling framework for evaluating such deeper comprehension skills. Here, we present MORABLES, a human-verified benchmark built from fables and short stories drawn from historical literature. The main task is structured as multiple-choice questions targeting moral inference, with carefully crafted distractors that challenge models to go beyond shallow, extractive question answering. To further stress-test model robustness, we introduce adversarial variants designed to surface LLM vulnerabilities and shortcuts due to issues such as data contamination. Our findings show that, while larger models outperform smaller ones, they remain susceptible to adversarial manipulation and often rely on superficial patterns rather than true moral reasoning. This brittleness results in significant self-contradiction, with the best models refuting their own answers in roughly 20% of cases depending on the framing of the moral choice. Interestingly, reasoning-enhanced models fail to bridge this gap, suggesting that scale - not reasoning ability - is the primary driver of performance. 随着法学硕士在标准阅读理解基准上表现出色,人们的注意力正在转向评估其复杂抽象推理和推理的能力。基于文学的基准具有丰富的叙事和道德深度,为评估这种更深层次的理解技能提供了一个令人信服的框架。在这里,我们展示了 MORABLES,这是一个由历史文学中的寓言和短篇小说构建的经过人类验证的基准。主要任务的结构是针对道德推理的多项选择题,并带有精心设计的干扰因素,挑战模型超越肤浅的、提取性的问题回答。为了进一步对模型的稳健性进行压力测试,我们引入了对抗性变体,旨在发现由于数据污染等问题而导致的 LLM 漏洞和捷径。我们的研究结果表明,虽然较大的模型优于较小的模型,但它们仍然容易受到对抗性纵,并且通常依赖于肤浅的模式而不是真正的道德推理。这种脆弱性导致了严重的自相矛盾,在大约 20% 的情况下,最好的模型会反驳自己的答案,具体取决于道德选择的框架。有趣的是,推理增强模型未能弥合这一差距,这表明规模——而不是推理能力——是性能的主要驱动力。
Subjects: Computation and Language, Artificial Intelligence 科目: 计算与语言 , 人工智能
Publish: 2025-09-15 19:06:10 UTC 发布时间: 2025-09-15 19:06:10 UTC
#136 An integrated process for design and control of lunar robotics using AI and simulation #136 利用人工智能和仿真设计和控制月球机器人的集成流程
Authors: [Daniel Lindmark](https://arxiv.org/search/?searchtype=author&query=Daniel Lindmark), [Jonas Andersson](https://arxiv.org/search/?searchtype=author&query=Jonas Andersson), [Kenneth Bodin](https://arxiv.org/search/?searchtype=author&query=Kenneth Bodin), [Tora Bodin](https://arxiv.org/search/?searchtype=author&query=Tora Bodin), [Hugo Börjesson](https://arxiv.org/search/?searchtype=author&query=Hugo Börjesson), [Fredrik Nordfeldth](https://arxiv.org/search/?searchtype=author&query=Fredrik Nordfeldth), [Martin Servin](https://arxiv.org/search/?searchtype=author&query=Martin Servin) 作者:Daniel Lindmark、Jonas Andersson、Kenneth Bodin、Tora Bodin、Hugo Börjesson、Fredrik Nordfeldth、Martin Servin
We envision an integrated process for developing lunar construction equipment, where physical design and control are explored in parallel. In this paper, we describe a technical framework that supports this process. It relies on OpenPLX, a readable/writable declarative language that links CAD-models and autonomous systems to high-fidelity, real-time 3D simulations of contacting multibody dynamics, machine regolith interaction forces, and non-ideal sensors. To demonstrate its capabilities, we present two case studies, including an autonomous lunar rover that combines a vision-language model for navigation with a reinforcement learning-based control policy for locomotion. 我们设想了一个开发月球建造设备的综合流程,其中物理设计和控制并行探索。在本文中,我们描述了一个支持这一过程的技术框架。它依赖于 OpenPLX,这是一种可读/可写的声明性语言,可将 CAD 模型和自主系统与接触多体动力学、机器风化层相互作用力和非理想传感器的高保真实时 3D 仿真联系起来。为了展示其功能,我们提出了两个案例研究,包括一个自主月球车,它将用于导航的视觉语言模型与基于强化学习的运动控制策略相结合。
Subjects: Robotics, Artificial Intelligence 科目: 机器人 , 人工智能
Publish: 2025-09-15 19:02:30 UTC 发布时间: 2025-09-15 19:02:30 UTC
#137 Enhancing Smart Farming Through Federated Learning: A Secure, Scalable, and Efficient Approach for AI-Driven Agriculture #137 通过联邦学习增强智能农业:人工智能驱动农业的安全、可扩展和高效的方法
Authors: [Ritesh Janga](https://arxiv.org/search/?searchtype=author&query=Ritesh Janga), [Rushit Dave](https://arxiv.org/search/?searchtype=author&query=Rushit Dave) 作者:Ritesh Janga、Rushit Dave
The agricultural sector is undergoing a transformation with the integration of advanced technologies, particularly in data-driven decision-making. This work proposes a federated learning framework for smart farming, aiming to develop a scalable, efficient, and secure solution for crop disease detection tailored to the environmental and operational conditions of Minnesota farms. By maintaining sensitive farm data locally and enabling collaborative model updates, our proposed framework seeks to achieve high accuracy in crop disease classification without compromising data privacy. We outline a methodology involving data collection from Minnesota farms, application of local deep learning algorithms, transfer learning, and a central aggregation server for model refinement, aiming to achieve improved accuracy in disease detection, good generalization across agricultural scenarios, lower costs in communication and training time, and earlier identification and intervention against diseases in future implementations. We outline a methodology and anticipated outcomes, setting the stage for empirical validation in subsequent studies. This work comes in a context where more and more demand for data-driven interpretations in agriculture has to be weighed with concerns about privacy from farms that are hesitant to share their operational data. This will be important to provide a secure and efficient disease detection method that can finally revolutionize smart farming systems and solve local agricultural problems with data confidentiality. In doing so, this paper bridges the gap between advanced machine learning techniques and the practical, privacy-sensitive needs of farmers in Minnesota and beyond, leveraging the benefits of federated learning. 随着先进技术的整合,特别是在数据驱动的决策方面,农业部门正在经历转型。这项工作提出了一个用于智能农业的联合学习框架,旨在开发一种可扩展、高效且安全的作物病害检测解决方案,以满足明尼苏达州农场的环境和运营条件。通过在本地维护敏感的农场数据并实现协作模型更新,我们提出的框架旨在在不损害数据隐私的情况下实现作物病害分类的高精度。我们概述了一种方法,包括从明尼苏达州农场收集数据、应用本地深度学习算法、迁移学习和用于模型细化的中央聚合服务器,旨在提高疾病检测的准确性、跨农业场景的良好泛化、降低沟通和培训时间的成本,以及在未来的实施中更早地识别和干预疾病。我们概述了方法和预期结果,为后续研究的实证验证奠定了基础。这项工作是在农业中对数据驱动解释的需求越来越多的背景下进行的,而农场对隐私的担忧犹豫不决,不愿分享其运营数据。这对于提供一种安全高效的疾病检测方法非常重要,该方法最终可以彻底改变智能农业系统并通过数据保密解决当地农业问题。在此过程中,本文利用联邦学习的优势,弥合了先进的机器学习技术与明尼苏达州及其他地区农民的实用、隐私敏感需求之间的差距。
Subjects: Machine Learning, Artificial Intelligence 主题: 机器学习 , 人工智能
Publish: 2025-09-15 18:57:32 UTC 发布时间: 2025-09-15 18:57:32 UTC
#138 Linear Dimensionality Reduction for Word Embeddings in Tabular Data Classification #138 表格数据分类中单词嵌入的线性降维
Authors: [Liam Ressel](https://arxiv.org/search/?searchtype=author&query=Liam Ressel), [Hamza A. A. Gardi](https://arxiv.org/search/?searchtype=author&query=Hamza A. A. Gardi) 作者:利亚姆·雷塞尔、哈姆扎 AA 加迪
The Engineers’ Salary Prediction Challenge requires classifying salary categories into three classes based on tabular data. The job description is represented as a 300-dimensional word embedding incorporated into the tabular features, drastically increasing dimensionality. Additionally, the limited number of training samples makes classification challenging. Linear dimensionality reduction of word embeddings for tabular data classification remains underexplored. This paper studies Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA). We show that PCA, with an appropriate subspace dimension, can outperform raw embeddings. LDA without regularization performs poorly due to covariance estimation errors, but applying shrinkage improves performance significantly, even with only two dimensions. We propose Partitioned-LDA, which splits embeddings into equal-sized blocks and performs LDA separately on each, thereby reducing the size of the covariance matrices. Partitioned-LDA outperforms regular LDA and, combined with shrinkage, achieves top-10 accuracy on the competition public leaderboard. This method effectively enhances word embedding performance in tabular data classification with limited training samples. 工程师薪资预测挑战赛要求根据表格数据将薪资类别分为三类。职位描述表示为合并到表格特征中的 300 维单词嵌入,大大增加了维度。此外,训练样本数量有限,使分类具有挑战性。用于表格数据分类的词嵌入的线性降维仍然没有得到充分探索。本文研究了主成分分析(PCA)和线性判别分析(LDA)。我们表明,具有适当子空间维度的 PCA 可以优于原始嵌入。由于协方差估计误差,没有正则化的 LDA 性能较差,但应用收缩可以显着提高性能,即使只有两个维度也是如此。我们提出了 Partitioned-LDA,它将嵌入拆分为大小相等的块,并对每个块分别执行 LDA,从而减小协方差矩阵的大小。分区 LDA 的性能优于常规 LDA,并且结合收缩,在竞赛公共排行榜上达到前 10 名的准确性。该方法在有限的训练样本下有效地增强了表格数据分类中的词嵌入性能。
Subjects: Machine Learning, Artificial Intelligence 主题: 机器学习 , 人工智能
Publish: 2025-09-15 18:19:00 UTC 发布时间: 2025-09-15 18:19:00 UTC
#139 Integrating Attention-Enhanced LSTM and Particle Swarm Optimization for Dynamic Pricing and Replenishment Strategies in Fresh Food Supermarkets #139 将注意力增强的 LSTM 和粒子群优化集成在生鲜超市的动态定价和补货策略中
Authors: [Xianchen Liu](https://arxiv.org/search/?searchtype=author&query=Xianchen Liu), [Tianhui Zhang](https://arxiv.org/search/?searchtype=author&query=Tianhui Zhang), [Xinyu Zhang](https://arxiv.org/search/?searchtype=author&query=Xinyu Zhang), [Lingmin Hou](https://arxiv.org/search/?searchtype=author&query=Lingmin Hou), [Zhen Guo](https://arxiv.org/search/?searchtype=author&query=Zhen Guo), [Yuanhao Tian](https://arxiv.org/search/?searchtype=author&query=Yuanhao Tian), [Yang Liu](https://arxiv.org/search/?searchtype=author&query=Yang Liu) 作者: Xianchen Liu , Tianhui Zhang, Xinyu Zhang, Lingmin Hou, Zhen Guo, Yuanhao Tian, Yang Liu
This paper presents a novel approach to optimizing pricing and replenishment strategies in fresh food supermarkets by combining Long Short-Term Memory (LSTM) networks with Particle Swarm Optimization (PSO). The LSTM model, enhanced with an attention mechanism, is used to predict sales volumes, pricing trends, and spoilage rates over a seven-day period. The predictions generated by the LSTM model serve as inputs for the PSO algorithm, which iteratively optimizes pricing and replenishment strategies to maximize profitability while adhering to inventory constraints. The integration of cost-plus pricing allows for dynamic adjustments based on fixed and variable costs, ensuring real-time adaptability to market fluctuations. The framework not only maximizes profits but also reduces food waste, contributing to more sustainable supermarket operations. The attention mechanism enhances the interpretability of the LSTM model by identifying key time points and factors influencing sales, improving decision-making accuracy. This methodology bridges the gap between predictive modeling and optimization, offering a scalable solution for dynamic pricing and inventory management in fresh food retail and other industries dealing with perishable goods. 本文提出了一种通过将长短期记忆(LSTM)网络与粒子群优化(PSO)相结合来优化生鲜超市定价和补货策略的新方法。LSTM 模型通过注意力机制进行了增强,用于预测 7 天内的销量、定价趋势和腐败率。LSTM 模型生成的预测作为 PSO 算法的输入,该算法迭代优化定价和补货策略,以最大限度地提高盈利能力,同时遵守库存限制。成本加成定价的集成允许根据固定成本和可变成本进行动态调整,确保对市场波动的实时适应性。该框架不仅实现了利润最大化,还减少了食物浪费,有助于超市运营更加可持续。注意力机制通过识别影响销售的关键时间点和因素,增强了 LSTM 模型的可解释性,提高了决策的准确性。这种方法弥合了预测建模和优化之间的差距,为新鲜食品零售和其他处理易腐货物的行业的动态定价和库存管理提供了可扩展的解决方案。
Subjects: Machine Learning, Artificial Intelligence 主题: 机器学习 , 人工智能
Publish: 2025-09-15 18:07:44 UTC 发布时间: 2025-09-15 18:07:44 UTC
#140 An End to End Edge to Cloud Data and Analytics Strategy #140 端到端边缘到云数据和分析战略
Authors: [Vijay Kumar Butte](https://arxiv.org/search/?searchtype=author&query=Vijay Kumar Butte), [Sujata Butte](https://arxiv.org/search/?searchtype=author&query=Sujata Butte) 作者:Vijay Kumar Butte、Sujata Butte
There is an exponential growth of connected Internet of Things (IoT) devices. These have given rise to applications that rely on real time data to make critical decisions quickly. Enterprises today are adopting cloud at a rapid pace. There is a critical need to develop secure and efficient strategy and architectures to best leverage capabilities of cloud and edge assets. This paper provides an end to end secure edge to cloud data and analytics strategy. To enable real life implementation, the paper provides reference architectures for device layer, edge layer and cloud layer. 互联物联网 (IoT) 设备呈指数级增长。这些催生了依赖实时数据快速做出关键决策的应用程序。当今的企业正在快速采用云。迫切需要制定安全高效的战略和架构,以最好地利用云和边缘资产的功能。本文提供了端到端的安全边缘到云数据和分析策略。为了实现实际实现,本文提供了设备层、边缘层和云层的参考架构。
Subjects: Distributed, Parallel, and Cluster Computing, Artificial Intelligence, Computational Engineering, Finance, and Science, Machine Learning, Software Engineering 学科:分布式、并行和集群计算 , 人工智能 , 计算工程, 金融与科学 , 机器学习 , 软件工程
Publish: 2025-09-15 16:04:10 UTC 发布时间: 2025-09-15 16:04:10 UTC
#141 C3DE: Causal-Aware Collaborative Neural Controlled Differential Equation for Long-Term Urban Crowd Flow Prediction #141 C3DE:用于长期城市人流预测的因果感知协同神经控制微分方程
Authors: [Yuting Liu](https://arxiv.org/search/?searchtype=author&query=Yuting Liu), [Qiang Zhou](https://arxiv.org/search/?searchtype=author&query=Qiang Zhou), [Hanzhe Li](https://arxiv.org/search/?searchtype=author&query=Hanzhe Li), [Chenqi Gong](https://arxiv.org/search/?searchtype=author&query=Chenqi Gong), [Jingjing Gu](https://arxiv.org/search/?searchtype=author&query=Jingjing Gu) 作者:刘玉婷、周强、李汉哲、龚晨琦、顾晶晶
Long-term urban crowd flow prediction suffers significantly from cumulative sampling errors, due to increased sequence lengths and sampling intervals, which inspired us to leverage Neural Controlled Differential Equations (NCDEs) to mitigate this issue. However, regarding the crucial influence of Points of Interest (POIs) evolution on long-term crowd flow, the multi-timescale asynchronous dynamics between crowd flow and POI distribution, coupled with latent spurious causality, poses challenges to applying NCDEs for long-term urban crowd flow prediction. To this end, we propose Causal-aware Collaborative neural CDE (C3DE) to model the long-term dynamic of crowd flow. Specifically, we introduce a dual-path NCDE as the backbone to effectively capture the asynchronous evolution of collaborative signals across multiple time scales. Then, we design a dynamic correction mechanism with the counterfactual-based causal effect estimator to quantify the causal impact of POIs on crowd flow and minimize the accumulation of spurious correlations. Finally, we leverage a predictor for long-term prediction with the fused collaborative signals of POI and crowd flow. Extensive experiments on three real-world datasets demonstrate the superior performance of C3DE, particularly in cities with notable flow fluctuations. 由于序列长度和采样间隔的增加,长期城市人流预测严重受到累积采样误差的影响,这激发了我们利用神经控制微分方程 (NCDE) 来缓解这个问题的灵感。然而,关于兴趣点(POI)演化对长期人群流动的关键影响,人群流动和 POI 分布之间的多时间尺度异步动态,加上潜在的虚假因果关系,对应用 NCDEs 进行长期城市人群流动预测提出了挑战。为此,我们提出了因果感知协同神经 CDE(C3DE)来模拟人群流动的长期动态。具体来说,我们引入了双路径 NCDE 作为主干网,以有效捕获协作信号在多个时间尺度上的异步演变。然后,利用基于反事实的因果效应估计器设计动态校正机制,量化 POI 对人群流动的因果影响,最大限度地减少虚假相关性的积累。最后,我们利用预测器与 POI 和人群流动的融合协作信号进行长期预测。在三个真实世界数据集上的广泛实验证明了 C3DE 的卓越性能,特别是在流量波动显着的城市。
Subjects: Machine Learning, Artificial Intelligence 主题: 机器学习 , 人工智能
Publish: 2025-09-15 07:24:39 UTC 发布时间: 2025-09-15 07:24:39 UTC
#142 Digital Voices of Survival: From Social Media Disclosures to Support Provisions for Domestic Violence Victims #142 数字生存之声:从社交媒体披露到家庭暴力受害者的支持规定
Authors: [Kanlun Wang](https://arxiv.org/search/?searchtype=author&query=Kanlun Wang), [Zhe Fu](https://arxiv.org/search/?searchtype=author&query=Zhe Fu), [Wangjiaxuan Xin](https://arxiv.org/search/?searchtype=author&query=Wangjiaxuan Xin), [Lina Zhou](https://arxiv.org/search/?searchtype=author&query=Lina Zhou), [Shashi Kiran Chandrappa](https://arxiv.org/search/?searchtype=author&query=Shashi Kiran Chandrappa) 作者:王侃伦、傅喆、辛王家璇、周丽娜、沙希·基兰·钱德拉帕
Domestic Violence (DV) is a pervasive public health problem characterized by patterns of coercive and abusive behavior within intimate relationships. With the rise of social media as a key outlet for DV victims to disclose their experiences, online self-disclosure has emerged as a critical yet underexplored avenue for support-seeking. In addition, existing research lacks a comprehensive and nuanced understanding of DV self-disclosure, support provisions, and their connections. To address these gaps, this study proposes a novel computational framework for modeling DV support-seeking behavior alongside community support mechanisms. The framework consists of four key components: self-disclosure detection, post clustering, topic summarization, and support extraction and mapping. We implement and evaluate the framework with data collected from relevant social media communities. Our findings not only advance existing knowledge on DV self-disclosure and online support provisions but also enable victim-centered digital interventions. 家庭暴力 (DV) 是一种普遍存在的公共卫生问题,其特征是亲密关系中存在胁迫和虐待行为模式。随着社交媒体的兴起,成为家庭暴力受害者披露经历的主要渠道,在线自我披露已成为寻求支持的关键但未被充分探索的途径。此外,现有研究缺乏对 DV 自我披露、支持规定及其联系的全面和细致入微的了解。为了解决这些差距,本研究提出了一种新的计算框架,用于模拟 DV 寻求支持行为以及社区支持机制。该框架由四个关键组件组成:自我披露检测、帖子聚类、主题总结以及支持提取和映射。我们使用从相关社交媒体社区收集的数据实施和评估该框架。我们的研究结果不仅推进了有关家庭暴力自我披露和在线支持规定的现有知识,而且还实现了以受害者为中心的数字干预措施。
Subjects: Social and Information Networks, Artificial Intelligence, Computers and Society, Information Retrieval 科目: 社会与信息网络 , 人工智能 , 计算机与社会 , 信息检索
Publish: 2025-09-15 05:32:42 UTC 发布时间: 2025-09-15 05:32:42 UTC
#143 Deriving the Scaled-Dot-Function via Maximum Likelihood Estimation and Maximum Entropy Approach #143 通过最大似然估计和最大熵方法推导尺度点函数
Author: [Jiyong Ma](https://arxiv.org/search/?searchtype=author&query=Jiyong Ma) 作者:马继勇
In this paper, we present a maximum likelihood estimation approach to determine the value vector in transformer models. We model the sequence of value vectors, key vectors, and the query vector as a sequence of Gaussian distributions. The variance in each Gaussian distribution depends on the time step, the corresponding key vector, and the query vector. The mean value in each Gaussian distribution depends on the time step, and the corresponding value vector. This analysis may offer a new explanation of the scaled-dot-product function or softmax function used in transformer architectures [1]. Another explanation, inspired by [4], is based on the maximum entropy approach in natural language processing [5]. In this approach, a query vector and key vectors are used to derive the feature functions for the maximum entropy model. 在本文中,我们提出了一种最大似然估计方法来确定 Transformer 模型中的值向量。我们将值向量、键向量和查询向量的序列建模为高斯分布序列。每个高斯分布的方差取决于时间步长、相应的关键向量和查询向量。每个高斯分布中的均值取决于时间步长和相应的值向量。该分析可能为变压器架构中使用的缩放点积函数或 softmax 函数提供新的解释[1]。另一种解释受 [4] 的启发,基于自然语言处理中的最大熵方法 [5]。在这种方法中,查询向量和关键向量用于推导最大熵模型的特征函数。
Subjects: Machine Learning, Artificial Intelligence 主题: 机器学习 , 人工智能
Publish: 2025-09-14 19:52:32 UTC 发布时间: 2025-09-14 19:52:32 UTC
#144 Domain Adaptive SAR Wake Detection: Leveraging Similarity Filtering and Memory Guidance #144 域自适应 SAR 唤醒检测:利用相似性过滤和记忆引导
Authors: [He Gao](https://arxiv.org/search/?searchtype=author&query=He Gao), [Baoxiang Huang](https://arxiv.org/search/?searchtype=author&query=Baoxiang Huang), [Milena Radenkovic](https://arxiv.org/search/?searchtype=author&query=Milena Radenkovic), [Borui Li](https://arxiv.org/search/?searchtype=author&query=Borui Li), [Ge Chen](https://arxiv.org/search/?searchtype=author&query=Ge Chen) 作者:何高,黄宝祥,Milena Radenkovic,Borui Li,Ge Chen
Synthetic Aperture Radar (SAR), with its all- weather and wide-area observation capabilities, serves as a crucial tool for wake detection. However, due to its complex imaging mechanism, wake features in SAR images often appear abstract and noisy, posing challenges for accurate annotation. In contrast, optical images provide more distinct visual cues, but models trained on optical data suffer from performance degradation when applied to SAR images due to domain shift. To address this cross-modal domain adaptation challenge, we propose a Similarity-Guided and Memory-Guided Domain Adap- tation (termed SimMemDA) framework for unsupervised domain adaptive ship wake detection via instance-level feature similarity filtering and feature memory guidance. Specifically, to alleviate the visual discrepancy between optical and SAR images, we first utilize WakeGAN to perform style transfer on optical images, generating pseudo-images close to the SAR style. Then, instance-level feature similarity filtering mechanism is designed to identify and prioritize source samples with target-like dis- tributions, minimizing negative transfer. Meanwhile, a Feature- Confidence Memory Bank combined with a K-nearest neighbor confidence-weighted fusion strategy is introduced to dynamically calibrate pseudo-labels in the target domain, improving the reliability and stability of pseudo-labels. Finally, the framework further enhances generalization through region-mixed training, strategically combining source annotations with calibrated tar- get pseudo-labels. Experimental results demonstrate that the proposed SimMemDA method can improve the accuracy and robustness of cross-modal ship wake detection tasks, validating the effectiveness and feasibility of the proposed method. 合成孔径雷达(SAR)具有全天候和广域观测能力,是尾流检测的重要工具。然而,由于其复杂的成像机制,SAR 图像中的尾流特征往往显得抽象和嘈杂,给准确标注带来了挑战。相比之下,光学图像提供了更明显的视觉线索,但由于域偏移,在光学数据上训练的模型在应用于 SAR 图像时会遭受性能下降。为了应对这一跨模态域适应挑战,我们提出了一种相似性引导和记忆引导域适应(称为 SimMemDA)框架,用于通过实例级特征相似性过滤和特征记忆引导进行无监督域自适应船舶尾流检测。具体来说,为了缓解光学图像和 SAR 图像之间的视觉差异,我们首先利用 WakeGAN 对光学图像进行风格转移,生成接近 SAR 风格的伪图像。然后,设计实例级特征相似性过滤机制,对具有类目标分布的源样本进行识别和优先排序,最大限度地减少负转移。同时,引入特征置信度存储库结合 K 最近邻置信加权融合策略,对目标域中的伪标签进行动态标定,提高了伪标签的可靠性和稳定性。最后,该框架通过区域混合训练进一步增强了泛化性,战略性地将源注释与校准的 tar- get 伪标签相结合。实验结果表明,所提 SimMemDA 方法能够提高跨模态船舶尾流检测任务的准确性和鲁棒性,验证了所提方法的有效性和可行性。
Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence 科目:计算机视觉与模式识别 , 人工智能
Publish: 2025-09-14 08:35:39 UTC 发布时间: 2025-09-14 08:35:39 UTC
#145 PATIMT-Bench: A Multi-Scenario Benchmark for Position-Aware Text Image Machine Translation in Large Vision-Language Models #145 PATIMT-Bench:大型视觉语言模型中位置感知文本图像机器翻译的多场景基准
Authors: [Wanru Zhuang](https://arxiv.org/search/?searchtype=author&query=Wanru Zhuang), [Wenbo Li](https://arxiv.org/search/?searchtype=author&query=Wenbo Li), [Zhibin Lan](https://arxiv.org/search/?searchtype=author&query=Zhibin Lan), [Xu Han](https://arxiv.org/search/?searchtype=author&query=Xu Han), [Peng Li](https://arxiv.org/search/?searchtype=author&query=Peng Li), [Jinsong Su](https://arxiv.org/search/?searchtype=author&query=Jinsong Su) 作者:庄婉如,李文博,兰志斌,徐晗,李鹏,苏劲松
Text Image Machine Translation (TIMT) aims to translate texts embedded within an image into another language. Current TIMT studies primarily focus on providing translations for all the text within an image, while neglecting to provide bounding boxes and covering limited scenarios. In this work, we extend traditional TIMT into position-aware TIMT (PATIMT), aiming to support fine-grained and layoutpreserving translation, which holds great practical value but remains largely unexplored. This task comprises two key sub-tasks: regionspecific translation and full-image translation with grounding. To support existing models on PATIMT and conduct fair evaluation, we construct the PATIMT benchmark (PATIMTBench), which consists of 10 diverse real-world scenarios. Specifically, we introduce an Adaptive Image OCR Refinement Pipeline, which adaptively selects appropriate OCR tools based on scenario and refines the results of text-rich images. To ensure evaluation reliability, we further construct a test set, which contains 1,200 high-quality instances manually annotated and reviewed by human experts. After fine-tuning on our data, compact Large Vision-Language Models (LVLMs) achieve state-of-the-art performance on both sub-tasks. Experimental results also highlight the scalability and generalizability of our training data 文本图像机器翻译 (TIMT) 旨在将图像中嵌入的文本翻译成另一种语言。目前的 TIMT 研究主要集中在为图像中的所有文本提供翻译,而忽略了提供边界框并涵盖有限的场景。在这项工作中,我们将传统的 TIMT 扩展到位置感知 TIMT(PATIMT),旨在支持细粒度和版面保留的转换,该转换具有巨大的实用价值,但在很大程度上仍未被探索。该任务包括两个关键子任务:特定区域平移和带接地的全图像平移。为了支持 PATIMT 上的现有模型并进行公平评估,我们构建了 PATIMT 基准 (PATIMTBench),该基准由 10 个不同的真实场景组成。具体来说,我们引入了自适应图像 OCR 细化管道,该管道根据场景自适应选择合适的 OCR 工具,并细化文本丰富的图像的结果。为了确保评估的可靠性,我们进一步构建了一个测试集,其中包含 1,200 个由人类专家手动注释和审查的高质量实例。在对我们的数据进行微调后,紧凑的大型视觉语言模型 (LVLM) 在这两个子任务上都实现了最先进的性能。实验结果还凸显了我们训练数据的可扩展性和通用性
Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence 科目:计算机视觉与模式识别 , 人工智能
Publish: 2025-09-14 08:33:23 UTC 发布时间: 2025-09-14 08:33:23 UTC
#146 GraphDerm: Fusing Imaging, Physical Scale, and Metadata in a Population-Graph Classifier for Dermoscopic Lesions #146 GraphDerm:在皮肤镜病变的人群图分类器中融合成像、物理规模和元数据
Authors: [Mehdi Yousefzadeh](https://arxiv.org/search/?searchtype=author&query=Mehdi Yousefzadeh), [Parsa Esfahanian](https://arxiv.org/search/?searchtype=author&query=Parsa Esfahanian), [Sara Rashidifar](https://arxiv.org/search/?searchtype=author&query=Sara Rashidifar), [Hossein Salahshoor Gavalan](https://arxiv.org/search/?searchtype=author&query=Hossein Salahshoor Gavalan), [Negar Sadat Rafiee Tabatabaee](https://arxiv.org/search/?searchtype=author&query=Negar Sadat Rafiee Tabatabaee), [Saeid Gorgin](https://arxiv.org/search/?searchtype=author&query=Saeid Gorgin), [Dara Rahmati](https://arxiv.org/search/?searchtype=author&query=Dara Rahmati), [Maryam Daneshpazhooh](https://arxiv.org/search/?searchtype=author&query=Maryam Daneshpazhooh) 作者:迈赫迪·优素福扎德、帕尔萨·埃斯法哈尼安、萨拉·拉希迪法尔、侯赛因·萨拉赫舒尔·加瓦兰、内加尔·萨达特·拉菲·塔巴塔巴伊、赛义德·戈尔金、达拉·拉赫马蒂、玛丽亚姆·达内什帕祖
Introduction. Dermoscopy aids melanoma triage, yet image-only AI often ignores patient metadata (age, sex, site) and the physical scale needed for geometric analysis. We present GraphDerm, a population-graph framework that fuses imaging, millimeter-scale calibration, and metadata for multiclass dermoscopic classification, to the best of our knowledge the first ISIC-scale application of GNNs to dermoscopy. Methods. We curate ISIC 2018/2019, synthesize ruler-embedded images with exact masks, and train U-Nets (SE-ResNet-18) for lesion and ruler segmentation. Pixels-per-millimeter are regressed from the ruler-mask two-point correlation via a lightweight 1D-CNN. From lesion masks we compute real-scale descriptors (area, perimeter, radius of gyration). Node features use EfficientNet-B3; edges encode metadata/geometry similarity (fully weighted or thresholded). A spectral GNN performs semi-supervised node classification; an image-only ANN is the baseline. Results. Ruler and lesion segmentation reach Dice 0.904 and 0.908; scale regression attains MAE 1.5 px (RMSE 6.6). The graph attains AUC 0.9812, with a thresholded variant using about 25% of edges preserving AUC 0.9788 (vs. 0.9440 for the image-only baseline); per-class AUCs typically fall in the 0.97-0.99 range. Conclusion. Unifying calibrated scale, lesion geometry, and metadata in a population graph yields substantial gains over image-only pipelines on ISIC-2019. Sparser graphs retain near-optimal accuracy, suggesting efficient deployment. Scale-aware, graph-based AI is a promising direction for dermoscopic decision support; future work will refine learned edge semantics and evaluate on broader curated benchmarks. 介绍。皮肤镜检查有助于黑色素瘤分类,但纯图像人工智能通常会忽略患者元数据(年龄、性别、部位)和几何分析所需的物理尺度。我们提出了 GraphDerm,这是一个群体图框架,它融合了成像、毫米级校准和元数据以进行多类皮肤镜分类,据我们所知,这是 GNN 在皮肤镜检查中的首次 ISIC 规模应用。方法。我们策划了 ISIC 2018/2019,合成带有精确掩码的标尺嵌入图像,并训练 U-Net (SE-ResNet-18) 进行病变和标尺分割。每毫米像素数通过轻量级 1D-CNN 从标尺-掩码两点相关性回归。根据病变掩模,我们计算真实尺度的描述符(面积、周长、回转半径)。节点功能使用 EfficientNet-B3;边对元数据/几何相似性进行编码(完全加权或阈值)。光谱 GNN 执行半监督节点分类;纯图像人工神经网络是基线。结果。尺子和病灶分割达到 Dice 0.904 和 0.908;比例回归达到 MAE 1.5 px (RMSE 6.6)。该图达到 AUC 0.9812,使用约 25% 的边的阈值变体保留 AUC 0.9788(仅图像基线为 0.9440);每类 AUC 通常落在 0.97-0.99 范围内。结论。在总体图中统一校准的尺度、病变几何形状和元数据,与 ISIC-2019 上的纯图像管道相比,可产生显着的收益。较稀疏的图保持接近最佳的精度,表明高效部署。基于规模的、基于图的人工智能是皮肤镜决策支持的一个有前途的方向;未来的工作将完善学习到的边缘语义,并在更广泛的策划基准上进行评估。
Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence 科目:计算机视觉与模式识别 , 人工智能
Publish: 2025-09-14 08:11:54 UTC 发布时间: 2025-09-14 08:11:54 UTC
#147 Omni-CLST: Error-aware Curriculum Learning with guided Selective chain-of-Thought for audio questuin answering #147 Omni-CLST:错误感知课程学习,采用引导式选择性思维链进行音频问题回答
Authors: [Jinghua Zhao](https://arxiv.org/search/?searchtype=author&query=Jinghua Zhao), [Hang Su](https://arxiv.org/search/?searchtype=author&query=Hang Su), [Lichun Fan](https://arxiv.org/search/?searchtype=author&query=Lichun Fan), [Zhenbo Luo](https://arxiv.org/search/?searchtype=author&query=Zhenbo Luo), [Jian Luan](https://arxiv.org/search/?searchtype=author&query=Jian Luan), [Hui Wang](https://arxiv.org/search/?searchtype=author&query=Hui Wang), [Haoqin Sun](https://arxiv.org/search/?searchtype=author&query=Haoqin Sun), [Yong Qin](https://arxiv.org/search/?searchtype=author&query=Yong Qin) 作者: Jinghua Zhaha, Hang Su, Lichun Fans, Zhenbo Luo, Jian Luan, Hui Wang, Haoqin Sun, Yong Qin
We propose Omni-CLST, an error-aware Curriculum Learning framework with guided Selective Chain-of-Thought for audio question answering. The framework efficiently leverages existing high-quality dataset through two key strategies: an error-aware curriculum that organizes samples by difficulty, and a guided thought dropout mechanism that focuses reasoning on challenging cases. Integrated with GRPO training, these strategies enable the model to learn more effectively from informative samples. Experiments on MMAU-mini and MMAR demonstrate that Omni-CLST achieves competitive accuracy (73.80% on MMAU-mini) and establishes a new state of the art (64.30% on MMAR), highlighting its robustness and generalization capability in multimodal audio-language understanding. 我们提出了 Omni-CLST,这是一个错误感知的课程学习框架,具有用于音频问答的引导式选择性思维链。该框架通过两个关键策略有效地利用了现有的高质量数据集:按难度组织样本的错误感知课程,以及将推理重点放在具有挑战性的案例上的引导式思维辍学机制。这些策略与 GRPO 训练相结合,使模型能够更有效地从信息样本中学习。MMAU-mini 和 MMAR 的实验表明,Omni-CLST 实现了具有竞争力的准确率(MMAU-mini 为 73.80%),并建立了新的技术水平(MMAR 为 64.30%),凸显了其在多模态音频语言理解方面的鲁棒性和泛化能力。
Subjects: Sound, Artificial Intelligence, Audio and Speech Processing 科目: 声音 , 人工智能 , 音频和语音处理
Publish: 2025-09-14 06:54:12 UTC 发布时间: 2025-09-14 06:54:12 UTC
#148 A Variational Physics-Informed Neural Network Framework Using Petrov-Galerkin Method for Solving Singularly Perturbed Boundary Value Problems #148 使用 Petrov-Galerkin 方法解决奇异扰动边界值问题的变分物理信息神经网络框架
Authors: [Vijay Kumar](https://arxiv.org/search/?searchtype=author&query=Vijay Kumar), [Gautam Singh](https://arxiv.org/search/?searchtype=author&query=Gautam Singh) 作者:Vijay Kumar、Gautam Singh
This work proposes a Variational Physics-Informed Neural Network (VPINN) framework that integrates the Petrov-Galerkin formulation with deep neural networks (DNNs) for solving one-dimensional singularly perturbed boundary value problems (BVPs) and parabolic partial differential equations (PDEs) involving one or two small parameters. The method adopts a nonlinear approximation in which the trial space is defined by neural network functions, while the test space is constructed from hat functions. The weak formulation is constructed using localized test functions, with interface penalty terms introduced to enhance numerical stability and accurately capture boundary layers. Dirichlet boundary conditions are imposed via hard constraints, and source terms are computed using automatic differentiation. Numerical experiments on benchmark problems demonstrate the effectiveness of the proposed method, showing significantly improved accuracy in both the L2 and maximum norms compared to the standard VPINN approach for one-dimensional singularly perturbed differential equations (SPDEs). 这项工作提出了一个变分物理知情神经网络(VPINN)框架,该框架将 Petrov-Galerkin 公式与深度神经网络(DNN)集成在一起,用于解决涉及一个或两个小参数的一维奇异扰动边界值问题(BVP)和抛物线偏微分方程(PDE)。该方法采用非线性近似法,试验空间由神经网络函数定义,而测试空间由帽子函数构成。使用局部测试函数构建弱公式,并引入界面惩罚项以增强数值稳定性并准确捕获边界层。狄利克雷边界条件是通过硬约束施加的,源项是使用自动微分计算的。基准问题的数值实验证明了所提方法的有效性,与一维奇异扰动微分方程(SPDE)的标准 VPINN 方法相比,在范数 L2 和最大范数方面的 精度都显着提高。
Subjects: Numerical Analysis, Artificial Intelligence 学科:数值分析、人工智能
Publish: 2025-09-13 18:25:00 UTC 发布时间: 2025-09-13 18:25:00 UTC
#149 A Modern Look at Simplicity Bias in Image Classification Tasks #149 图像分类任务中简单性偏差的现代视角
Authors: [Xiaoguang Chang](https://arxiv.org/search/?searchtype=author&query=Xiaoguang Chang), [Teng Wang](https://arxiv.org/search/?searchtype=author&query=Teng Wang), [Changyin Sun](https://arxiv.org/search/?searchtype=author&query=Changyin Sun) 作者:张晓光、王腾、孙长音
The simplicity Bias (SB) of neural networks, i.e.\ their tendency to represent simple functions, is a key factor in their generalization capabilities. Recent studies show that an excessive SB may harm performance on complex tasks, and the need for this bias varies across tasks. Many of these studies focus on simple models or synthetic tasks. It remains challenging to measure the SB in large models and little is known about the relevance of the SB to various image classification tasks. In this paper, we investigate the relationship between the SB in CLIP models and their performance across image classification tasks. First, we theoretically analyze the potential limitation of existing measures of complexity that have been used to characterize small models. To address this, we propose a frequency-aware measure capturing finer-grained SB differences. We validate this measure on CLIP models subjected to two recent SB-modulation methods, demonstrating that it is more informative and consistent than previous measures. Second, we examine the relation between the SB of those models and their performance across a range of image classification tasks, including zero-shot and fine-tuning settings. These experiments reveal a range of behaviors. For example, a stronger SB correlates with a better performance on OOD generalization than on adversarial robustness. These results highlight the benefits of aligning a model’s inductive biases with the characteristics of the target task. 神经网络的简单性偏差(SB),即它们表示简单函数的倾向,是其泛化能力的关键因素。最近的研究表明,过多的 SB 可能会损害复杂任务的表现,并且对这种偏差的需求因任务而异。其中许多研究侧重于简单模型或综合任务。在大型模型中测量 SB 仍然具有挑战性,并且对 SB 与各种图像分类任务的相关性知之甚少。在本文中,我们研究了 CLIP 模型中的 SB 与其在图像分类任务中的性能之间的关系。首先,我们从理论上分析了用于表征小型模型的现有复杂性度量的潜在局限性。为了解决这个问题,我们提出了一种频率感知测量方法,可以捕获更细粒度的 SB 差异。我们在接受两种最近的 SB 调制方法的 CLIP 模型上验证了这一测量,证明它比以前的测量信息更丰富且更一致。其次,我们研究了这些模型的 SB 与其在一系列图像分类任务(包括零样本和微调设置)中的性能之间的关系。这些实验揭示了一系列行为。例如,与对抗鲁棒性相比,更强的 SB 与更好的 OOD 泛化性能相关。这些结果突出了将模型的归纳偏差与目标任务的特征保持一致的好处。
Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence 科目:计算机视觉与模式识别 , 人工智能
Publish: 2025-09-13 02:33:57 UTC 发布时间: 2025-09-13 02:33:57 UTC
#150 Quantum-Inspired Stacked Integrated Concept Graph Model (QISICGM) for Diabetes Risk Prediction #150 用于糖尿病风险预测的量子启发堆叠集成概念图模型 (QISICGM)
Author: [Kenneth G. Young II](https://arxiv.org/search/?searchtype=author&query=Kenneth G. Young II) 作者:Kenneth G. Young II
The Quantum-Inspired Stacked Integrated Concept Graph Model (QISICGM) is an innovative machine learning framework that harnesses quantum-inspired techniques to predict diabetes risk with exceptional accuracy and efficiency. Utilizing the PIMA Indians Diabetes dataset augmented with 2,000 synthetic samples to mitigate class imbalance (total: 2,768 samples, 1,949 positives), QISICGM integrates a self-improving concept graph with a stacked ensemble comprising Random Forests (RF), Extra Trees (ET), transformers, convolutional neural networks (CNNs), and feed-forward neural networks (FFNNs). This approach achieves an out-of-fold (OOF) F1 score of 0.8933 and an AUC of 0.8699, outperforming traditional methods. Quantum inspired elements, such as phase feature mapping and neighborhood sequence modeling, enrich feature representations, enabling CPU-efficient inference at 8.5 rows per second. This paper presents a detailed architecture, theoretical foundations, code insights, and performance evaluations, including visualizations from the outputs subfolder. The open-source implementation (v1.0.0) is available at https://github.com/keninayoung/QISICGM, positioning QISICGM as a potential benchmark for AI-assisted clinical triage in diabetes and beyond. Ultimately, this work emphasizes trustworthy AI through calibration, interpretability, and open-source reproducibility. 量子启发堆叠集成概念图模型 (QISICGM) 是一种创新的机器学习框架,它利用量子启发技术以卓越的准确性和效率预测糖尿病风险。QISICGM 利用 PIMA Indians Diabetes 数据集和 2,000 个合成样本来缓解阶级不平衡(总计:2,768 个样本,1,949 个阳性样本),将自我改进的概念图与由随机森林 (RF)、额外树 (ET)、转换器、卷积神经网络 (CNN) 和前馈神经网络 (FFNN) 组成的堆叠集成集成。这种方法实现了 0.8933 的折叠外 (OOF) F1 分数和 0.8699 的 AUC,优于传统方法。受量子启发的元素,如相位特征映射和邻域序列建模,丰富了特征表示,以每秒 8.5 行的速度实现 CPU 效率的推理。本文介绍了详细的架构、理论基础、代码见解和性能评估,包括输出子文件夹中的可视化效果。开源实施 (v1.0.0) 已于 https://github.com/keninayoung/QISICGM 年推出,将 QISICGM 定位为糖尿病及其他领域人工智能辅助临床分诊的潜在基准。最终,这项工作通过校准、可解释性和开源可重复性强调值得信赖的人工智能。
Subjects: Machine Learning, Artificial Intelligence, Quantum Physics 学科: 机器学习 , 人工智能 , 量子物理学
Publish: 2025-09-12 18:26:31 UTC 发布时间: 2025-09-12 18:26:31 UTC
#151 Representation Learning on Large Non-Bipartite Transaction Networks using GraphSAGE #151 使用 GraphSAGE 在大型非二分事务网络上进行表示学习
Authors: [Mihir Tare](https://arxiv.org/search/?searchtype=author&query=Mihir Tare), [Clemens Rattasits](https://arxiv.org/search/?searchtype=author&query=Clemens Rattasits), [Yiming Wu](https://arxiv.org/search/?searchtype=author&query=Yiming Wu), [Euan Wielewski](https://arxiv.org/search/?searchtype=author&query=Euan Wielewski) 作者:Mihir Tare、Clemens Rattasits、Yiming Wu、Euan Wielewski
Financial institutions increasingly require scalable tools to analyse complex transactional networks, yet traditional graph embedding methods struggle with dynamic, real-world banking data. This paper demonstrates the practical application of GraphSAGE, an inductive Graph Neural Network framework, to non-bipartite heterogeneous transaction networks within a banking context. Unlike transductive approaches, GraphSAGE scales well to large networks and can generalise to unseen nodes which is critical for institutions working with temporally evolving transactional data. We construct a transaction network using anonymised customer and merchant transactions and train a GraphSAGE model to generate node embeddings. Our exploratory work on the embeddings reveals interpretable clusters aligned with geographic and demographic attributes. Additionally, we illustrate their utility in downstream classification tasks by applying them to a money mule detection model where using these embeddings improves the prioritisation of high-risk accounts. Beyond fraud detection, our work highlights the adaptability of this framework to banking-scale networks, emphasising its inductive capability, scalability, and interpretability. This study provides a blueprint for financial organisations to harness graph machine learning for actionable insights in transactional ecosystems. 金融机构越来越需要可扩展的工具来分析复杂的交易网络,但传统的图嵌入方法难以处理动态的真实世界银行数据。本文展示了归纳图神经网络框架 GraphSAGE 在银行环境中非二方异构交易网络中的实际应用。与转导方法不同,GraphSAGE 可以很好地扩展到大型网络,并且可以推广到看不见的节点,这对于处理时间变化的交易数据的机构至关重要。我们使用匿名的客户和商家交易构建交易网络,并训练 GraphSAGE 模型来生成节点嵌入。我们对嵌入的探索性工作揭示了与地理和人口属性一致的可解释集群。此外,我们还通过将它们应用于钱骡检测模型来说明它们在下游分类任务中的效用,在该模型中,使用这些嵌入可以提高高风险账户的优先级。除了欺诈检测之外,我们的工作还强调了该框架对银行规模网络的适应性,强调其归纳能力、可扩展性和可解释性。这项研究为金融组织利用图机器学习在交易生态系统中获得可作的见解提供了蓝图。
Subjects: Machine Learning, Artificial Intelligence, Social and Information Networks 科目: 机器学习 , 人工智能 , 社会与信息网络
Publish: 2025-09-12 14:09:16 UTC 发布时间: 2025-09-12 14:09:16 UTC
#152 Physics-Informed Neural Networks vs. Physics Models for Non-Invasive Glucose Monitoring: A Comparative Study Under Realistic Synthetic Conditions #152 用于无创血糖监测的物理信息神经网络与物理模型:现实合成条件下的比较研究
Author: [Riyaadh Gani](https://arxiv.org/search/?searchtype=author&query=Riyaadh Gani) 作者:利雅得加尼
Non-invasive glucose monitors often fail outside the lab because existing datasets ignore hardware noise, environmental drift, and person-to-person physiology. We introduce the first ultra-realistic near-infrared (NIR) simulator that injects 12-bit ADC quantisation, +/-0.1% LED ageing, photodiode dark noise, 15-45 C temperature, 30-90% relative humidity, contact-pressure variation, Fitzpatrick I-VI melanin, and diurnal glucose excursions (dawn phenomenon). Using this platform (rho glucose-NIR = 0.21), we benchmark six methods: Enhanced Beer-Lambert (physics-engineered ridge regression), three physics-informed neural networks (PINNs), a selective radiative-transfer PINN, and a shallow DNN. Beer-Lambert achieves 13.6 mg/dL RMSE, 95.8% Clarke-A and 93.8% +/-15% accuracy with only 56 parameters and 0.01 ms inference, outperforming the best PINN (14.6 mg/dL) and the SDNN baseline (35.1 mg/dL). Results overturn the assumption that deeper PINNs dominate and supply an open, end-to-end reference stack for rapid prototyping of embedded optical glucose sensors. 非侵入性血糖监测仪经常在实验室外出现故障,因为现有数据集忽略了硬件噪声、环境漂移和人与人之间的生理学。我们推出了第一个超逼真的近红外 (NIR) 模拟器,它注入了 12 位 ADC 量化、+/-0.1% LED 老化、光电二极管暗噪声、15-45 C 温度、30-90% 相对湿度、接触压力变化、Fitzpatrick I-VI 黑色素和昼夜葡萄糖偏移(黎明现象)。使用这个平台(rho glucose-NIR = 0.21),我们对六种方法进行了基准测试:增强型 Beer-Lambert(物理工程脊回归)、三种物理信息神经网络 (PINN)、选择性辐射转移 PINN 和浅层 DNN。Beer-Lambert 仅使用 56 个参数和 0.01 ms 推理即可实现 13.6 mg/dL RMSE、95.8% Clarke-A 和 93.8% +/-15% 准确率,优于最佳 PINN (14.6 mg/dL) 和 SDNN 基线 (35.1 mg/dL)。结果推翻了更深的 PINN 占主导地位的假设,并为嵌入式光学葡萄糖传感器的快速原型设计提供了开放的端到端参考堆栈。
Subjects: Image and Video Processing, Artificial Intelligence, Machine Learning 科目: 图像和视频处理 , 人工智能 , 机器学习
Publish: 2025-09-12 12:18:00 UTC 发布时间: 2025-09-12 12:18:00 UTC
#153 OnlineHOI: Towards Online Human-Object Interaction Generation and Perception #153 OnlineHOI:迈向在线人机交互生成和感知
Authors: [Yihong Ji](https://arxiv.org/search/?searchtype=author&query=Yihong Ji), [Yunze Liu](https://arxiv.org/search/?searchtype=author&query=Yunze Liu), [Yiyao Zhuo](https://arxiv.org/search/?searchtype=author&query=Yiyao Zhuo), [Weijiang Yu](https://arxiv.org/search/?searchtype=author&query=Weijiang Yu), [Fei Ma](https://arxiv.org/search/?searchtype=author&query=Fei Ma), [Joshua Huang](https://arxiv.org/search/?searchtype=author&query=Joshua Huang), [Fei Yu](https://arxiv.org/search/?searchtype=author&query=Fei Yu) 作者: Yihong Ji, Yunze Liu, Yiyao Zhuo, Weijiang Yu, Fei 马, Joshua Huang, Fei Yu
The perception and generation of Human-Object Interaction (HOI) are crucial for fields such as robotics, AR/VR, and human behavior understanding. However, current approaches model this task in an offline setting, where information at each time step can be drawn from the entire interaction sequence. In contrast, in real-world scenarios, the information available at each time step comes only from the current moment and historical data, i.e., an online setting. We find that offline methods perform poorly in an online context. Based on this observation, we propose two new tasks: Online HOI Generation and Perception. To address this task, we introduce the OnlineHOI framework, a network architecture based on the Mamba framework that employs a memory mechanism. By leveraging Mamba’s powerful modeling capabilities for streaming data and the Memory mechanism’s efficient integration of historical information, we achieve state-of-the-art results on the Core4D and OAKINK2 online generation tasks, as well as the online HOI4D perception task. 人与物交互(HOI)的感知和生成对于机器人、AR/VR 和人类行为理解等领域至关重要。然而,当前的方法在离线设置中对此任务进行建模,其中每个时间步长的信息都可以从整个交互序列中提取。相比之下,在现实场景中,每个时间步的可用信息仅来自当前时刻和历史数据,即在线设置。我们发现离线方法在在线环境中表现不佳。基于这一观察,我们提出了两个新任务:在线 HOI 生成和感知。为了解决这一任务,我们引入了 OnlineHOI 框架,这是一种基于 Mamba 框架的网络架构,采用了内存机制。通过利用 Mamba 强大的流数据建模能力和 Memory 机制对历史信息的高效集成,我们在 Core4D 和 OAKINK2 在线生成任务以及在线 HOI4D 感知任务上取得了最先进的结果。
Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Robotics 科目: 计算机视觉与模式识别 , 人工智能 , 机器人技术
Publish: 2025-09-12 06:58:29 UTC 发布时间: 2025-09-12 06:58:29 UTC
#154 Why and How Auxiliary Tasks Improve JEPA Representations #154 辅助任务为什么以及如何改进 JEPA 表示
Authors: [Jiacan Yu](https://arxiv.org/search/?searchtype=author&query=Jiacan Yu), [Siyi Chen](https://arxiv.org/search/?searchtype=author&query=Siyi Chen), [Mingrui Liu](https://arxiv.org/search/?searchtype=author&query=Mingrui Liu), [Nono Horiuchi](https://arxiv.org/search/?searchtype=author&query=Nono Horiuchi), [Vladimir Braverman](https://arxiv.org/search/?searchtype=author&query=Vladimir Braverman), [Zicheng Xu](https://arxiv.org/search/?searchtype=author&query=Zicheng Xu), [Dan Haramati](https://arxiv.org/search/?searchtype=author&query=Dan Haramati), [Randall Balestriero](https://arxiv.org/search/?searchtype=author&query=Randall Balestriero) 作者:余佳灿、陈思怡、刘明瑞、堀内野野、弗拉基米尔·布雷弗曼、徐子成、丹·哈拉马蒂、兰德尔·巴莱斯特列罗
Joint-Embedding Predictive Architecture (JEPA) is increasingly used for visual representation learning and as a component in model-based RL, but its behavior remains poorly understood. We provide a theoretical characterization of a simple, practical JEPA variant that has an auxiliary regression head trained jointly with latent dynamics. We prove a No Unhealthy Representation Collapse theorem: in deterministic MDPs, if training drives both the latent-transition consistency loss and the auxiliary regression loss to zero, then any pair of non-equivalent observations, i.e., those that do not have the same transition dynamics or auxiliary label, must map to distinct latent representations. Thus, the auxiliary task anchors which distinctions the representation must preserve. Controlled ablations in a counting environment corroborate the theory and show that training the JEPA model jointly with the auxiliary head generates a richer representation than training them separately. Our work indicates a path to improve JEPA encoders: training them with an auxiliary function that, together with the transition dynamics, encodes the right equivalence relations. 联合嵌入预测架构 (JEPA) 越来越多地用于视觉表示学习和作为基于模型的 RL 的组件,但其行为仍然知之甚少。我们提供了一个简单、实用的 JEPA 变体的理论表征,该变体具有与潜在动力学联合训练的辅助回归头。我们证明了一个无不健康的表示崩溃定理:在确定性 MDP 中,如果训练将潜在过渡一致性损失和辅助回归损失都驱动到零,那么任何一对非等效观察值,即那些没有相同过渡动力学或辅助标签的观察值,都必须映射到不同的潜在表示。因此,辅助任务锚定了表示必须保留的区别。计数环境中的受控消融证实了该理论,并表明与辅助头一起训练 JEPA 模型比单独训练它们产生更丰富的表示。我们的工作指明了改进 JEPA 编码器的途径:使用辅助函数训练它们,该函数与转换动力学一起编码正确的等价关系。
Subjects: Machine Learning, Artificial Intelligence 主题: 机器学习 , 人工智能
Publish: 2025-09-12 05:28:29 UTC 发布时间: 2025-09-12 05:28:29 UTC
#155 Humor in Pixels: Benchmarking Large Multimodal Models Understanding of Online Comics #155 像素中的幽默:对大型多模态模型进行基准测试 对网络漫画的理解
Authors: [Yuriel Ryan](https://arxiv.org/search/?searchtype=author&query=Yuriel Ryan), [Rui Yang Tan](https://arxiv.org/search/?searchtype=author&query=Rui Yang Tan), [Kenny Tsu Wei Choo](https://arxiv.org/search/?searchtype=author&query=Kenny Tsu Wei Choo), [Roy Ka-Wei Lee](https://arxiv.org/search/?searchtype=author&query=Roy Ka-Wei Lee) 作者:Yuriel Ryan、Rui Yang Tan、Kenny Tsu Wei Choo、Roy Ka-Wei Lee
Understanding humor is a core aspect of social intelligence, yet it remains a significant challenge for Large Multimodal Models (LMMs). We introduce PixelHumor, a benchmark dataset of 2,800 annotated multi-panel comics designed to evaluate LMMs’ ability to interpret multimodal humor and recognize narrative sequences. Experiments with state-of-the-art LMMs reveal substantial gaps: for instance, top models achieve only 61% accuracy in panel sequencing, far below human performance. This underscores critical limitations in current models’ integration of visual and textual cues for coherent narrative and humor understanding. By providing a rigorous framework for evaluating multimodal contextual and narrative reasoning, PixelHumor aims to drive the development of LMMs that better engage in natural, socially aware interactions. 理解幽默是社会智能的一个核心方面,但它仍然是大型多模态模型 (LMM) 面临的重大挑战。我们介绍了 PixelHumor,这是一个包含 2,800 部带注释的多面板漫画的基准数据集,旨在评估 LMM 解释多模态幽默和识别叙事序列的能力。使用最先进的 LMM 进行的实验揭示了巨大的差距:例如,顶级模型在面板测序方面的准确率仅为 61%,远低于人类的性能。这凸显了当前模型在整合视觉和文本线索以实现连贯叙事和幽默理解方面存在严重局限性。通过提供一个严格的框架来评估多模态上下文和叙事推理,PixelHumor 旨在推动 LMM 的发展,以更好地参与自然的、具有社会意识的互动。
Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence, Computation and Language 科目:计算机视觉与模式识别、人工智能、计算与语言
Publish: 2025-09-12 01:39:24 UTC 发布时间: 2025-09-12 01:39:24 UTC
#156 Modular, On-Site Solutions with Lightweight Anomaly Detection for Sustainable Nutrient Management in Agriculture #156 具有轻量级异常检测功能的模块化现场解决方案,用于农业可持续养分管理
Authors: [Abigail R. Cohen](https://arxiv.org/search/?searchtype=author&query=Abigail R. Cohen), [Yuming Sun](https://arxiv.org/search/?searchtype=author&query=Yuming Sun), [Zhihao Qin](https://arxiv.org/search/?searchtype=author&query=Zhihao Qin), [Harsh S. Muriki](https://arxiv.org/search/?searchtype=author&query=Harsh S. Muriki), [Zihao Xiao](https://arxiv.org/search/?searchtype=author&query=Zihao Xiao), [Yeonju Lee](https://arxiv.org/search/?searchtype=author&query=Yeonju Lee), [Matthew Housley](https://arxiv.org/search/?searchtype=author&query=Matthew Housley), [Andrew F. Sharkey](https://arxiv.org/search/?searchtype=author&query=Andrew F. Sharkey), [Rhuanito S. Ferrarezi](https://arxiv.org/search/?searchtype=author&query=Rhuanito S. Ferrarezi), [Jing Li](https://arxiv.org/search/?searchtype=author&query=Jing Li), [Lu Gan](https://arxiv.org/search/?searchtype=author&query=Lu Gan), [Yongsheng Chen](https://arxiv.org/search/?searchtype=author&query=Yongsheng Chen) 作者:Abigail R. Cohen、Yuming Sun、Zhihao Qin、Harsh S. Muriki、Zihao Xiao、Yeonju Lee、Matthew Housley、Andrew F. Sharkey、Rhuanito S. Ferrarezi、Jing Li、Lu Gan、Yongsheng Chen
Efficient nutrient management is critical for crop growth and sustainable resource consumption (e.g., nitrogen, energy). Current approaches require lengthy analyses, preventing real-time optimization; similarly, imaging facilitates rapid phenotyping but can be computationally intensive, preventing deployment under resource constraints. This study proposes a flexible, tiered pipeline for anomaly detection and status estimation (fresh weight, dry mass, and tissue nutrients), including a comprehensive energy analysis of approaches that span the efficiency-accuracy spectrum. Using a nutrient depletion experiment with three treatments (T1-100%, T2-50%, and T3-25% fertilizer strength) and multispectral imaging (MSI), we developed a hierarchical pipeline using an autoencoder (AE) for early warning. Further, we compared two status estimation modules of different complexity for more detailed analysis: vegetation index (VI) features with machine learning (Random Forest, RF) and raw whole-image deep learning (Vision Transformer, ViT). Results demonstrated high-efficiency anomaly detection (73% net detection of T3 samples 9 days after transplanting) at substantially lower energy than embodied energy in wasted nitrogen. The state estimation modules show trade-offs, with ViT outperforming RF on phosphorus and calcium estimation (R2 0.61 vs. 0.58, 0.48 vs. 0.35) at higher energy cost. With our modular pipeline, this work opens opportunities for edge diagnostics and practical opportunities for agricultural sustainability. 有效的养分管理对于作物生长和可持续资源消耗(例如氮、能源)至关重要。当前的方法需要冗长的分析,无法进行实时优化;同样,成像有助于快速表型分析,但可能是计算密集型的,从而阻止在资源限制下进行部署。本研究提出了一种灵活的分层管道,用于异常检测和状态估计(新鲜重量、干质量和组织营养物质),包括对跨越效率-准确性范围的方法进行全面的能量分析。通过三种处理(T1-100%、T2-50%和 T3-25%肥料强度)和多光谱成像(MSI)的养分消耗实验,我们开发了一种使用自动编码器(AE)进行预警的分层管道。此外,我们比较了两个不同复杂程度的状态估计模块以进行更详细的分析:植被指数 (VI) 特征与机器学习(随机森林,RF)和原始全图像深度学习(Vision Transformer,ViT)。结果表明,高效异常检测(移栽后 9 天 T3 样品净检测率为 73%),其能量远低于废氮中的隐含能量。状态估计模块显示出权衡,ViT 在磷和钙估计方面优于 RF(R2 0.61 vs. 0.58,0.48 vs. 0.35),但能源成本更高。通过我们的模块化管道,这项工作为边缘诊断提供了机会,并为农业可持续性提供了实际机会。
Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence 科目:计算机视觉与模式识别 , 人工智能
Publish: 2025-09-11 21:14:35 UTC 发布时间: 2025-09-11 21:14:35 UTC
#157 RU-Net for Automatic Characterization of TRISO Fuel Cross Sections #157 RU-Net 用于自动表征 TRISO 燃料截面
Authors: [Lu Cai](https://arxiv.org/search/?searchtype=author&query=Lu Cai), [Fei Xu](https://arxiv.org/search/?searchtype=author&query=Fei Xu), [Min Xian](https://arxiv.org/search/?searchtype=author&query=Min Xian), [Yalei Tang](https://arxiv.org/search/?searchtype=author&query=Yalei Tang), [Shoukun Sun](https://arxiv.org/search/?searchtype=author&query=Shoukun Sun), [John Stempien](https://arxiv.org/search/?searchtype=author&query=John Stempien) 作者: Lu Cai, Fei Xu, Min Xian, Yalei Tang, Shoukun Sun, John Stempien
During irradiation, phenomena such as kernel swelling and buffer densification may impact the performance of tristructural isotropic (TRISO) particle fuel. Post-irradiation microscopy is often used to identify these irradiation-induced morphologic changes. However, each fuel compact generally contains thousands of TRISO particles. Manually performing the work to get statistical information on these phenomena is cumbersome and subjective. To reduce the subjectivity inherent in that process and to accelerate data analysis, we used convolutional neural networks (CNNs) to automatically segment cross-sectional images of microscopic TRISO layers. CNNs are a class of machine-learning algorithms specifically designed for processing structured grid data. They have gained popularity in recent years due to their remarkable performance in various computer vision tasks, including image classification, object detection, and image segmentation. In this research, we generated a large irradiated TRISO layer dataset with more than 2,000 microscopic images of cross-sectional TRISO particles and the corresponding annotated images. Based on these annotated images, we used different CNNs to automatically segment different TRISO layers. These CNNs include RU-Net (developed in this study), as well as three existing architectures: U-Net, Residual Network (ResNet), and Attention U-Net. The preliminary results show that the model based on RU-Net performs best in terms of Intersection over Union (IoU). Using CNN models, we can expedite the analysis of TRISO particle cross sections, significantly reducing the manual labor involved and improving the objectivity of the segmentation results. 在辐照过程中,核膨胀和缓冲液致密化等现象可能会影响三结构各向同性(TRISO)颗粒燃料的性能。照射后显微镜通常用于识别这些照射引起的形态变化。然而,每个燃料压约通常包含数千个 TRISO 颗粒。手动执行工作以获取有关这些现象的统计信息既麻烦又主观。为了减少该过程固有的主观性并加速数据分析,我们使用卷积神经网络 (CNN) 自动分割微观 TRISO 层的横截面图像。CNN 是一类专门设计用于处理结构化网格数据的机器学习算法。近年来,它们因其在各种计算机视觉任务(包括图像分类、对象检测和图像分割)中的卓越性能而广受欢迎。在这项研究中,我们生成了一个大型辐照 TRISO 层数据集,其中包含 2000 多张横截面 TRISO 颗粒的显微图像和相应的注释图像。基于这些注释图像,我们使用不同的 CNN 自动分割不同的 TRIISO 层。这些 CNN 包括 RU-Net(在本研究中开发)以及三种现有架构:U-Net、残差网络 (ResNet) 和注意力 U-Net。初步结果表明,基于 RU-Net 的模型在交集(IoU)方面表现最佳。使用 CNN 模型,我们可以加快 TRISO 颗粒截面的分析,显着减少体力劳动,提高分割结果的客观性。
Subjects: Computer Vision and Pattern Recognition, Artificial Intelligence 科目:计算机视觉与模式识别 , 人工智能
Publish: 2025-09-10 23:04:28 UTC 发布时间: 2025-09-10 23:04:28 UTC
#158 RL Fine-Tuning Heals OOD Forgetting in SFT #158 RL 微调治愈 SFT 中的 OOD 遗忘
Authors: [Hangzhan Jin](https://arxiv.org/search/?searchtype=author&query=Hangzhan Jin), [Sitao Luan](https://arxiv.org/search/?searchtype=author&query=Sitao Luan), [Sicheng Lyu](https://arxiv.org/search/?searchtype=author&query=Sicheng Lyu), [Guillaume Rabusseau](https://arxiv.org/search/?searchtype=author&query=Guillaume Rabusseau), [Reihaneh Rabbany](https://arxiv.org/search/?searchtype=author&query=Reihaneh Rabbany), [Doina Precup](https://arxiv.org/search/?searchtype=author&query=Doina Precup), [Mohammad Hamdaqa](https://arxiv.org/search/?searchtype=author&query=Mohammad Hamdaqa) 作者: Hangzhan Jin, Sitao Luan, Sicheng Lyu, Guillaume Rabusseau, Reihaneh Rabbany, Doina Precup, Mohammad Hamdaqa
The two-stage fine-tuning paradigm of Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL) has empirically shown better reasoning performance than one-stage SFT for the post-training of Large Language Models (LLMs). However, the evolution and mechanism behind the synergy of SFT and RL are still under-explored and inconclusive. In our study, we find the well-known claim “SFT memorizes, RL generalizes” is over-simplified, and discover that: (1) OOD performance peaks at the early stage of SFT and then declines (OOD forgetting), the best SFT checkpoint cannot be captured by training/test loss; (2) the subsequent RL stage does not generate fundamentally better OOD capability, instead it plays an \textbf{OOD restoration} role, recovering the lost reasoning ability during SFT; (3) The recovery ability has boundaries, \ie{} \textbf{if SFT trains for too short or too long, RL cannot recover the lost OOD ability;} (4) To uncover the underlying mechanisms behind the forgetting and restoration process, we employ SVD analysis on parameter matrices, manually edit them, and observe their impacts on model performance. Unlike the common belief that the shift of model capacity mainly results from the changes of singular values, we find that they are actually quite stable throughout fine-tuning. Instead, the OOD behavior strongly correlates with the \textbf{rotation of singular vectors}. Our findings re-identify the roles of SFT and RL in the two-stage fine-tuning and discover the rotation of singular vectors as the key mechanism. %reversing the rotations induced by SFT, which shows recovery from forgetting, whereas imposing the SFT parameter directions onto a RL-tuned model results in performance degradation. Code is available at https://github.com/xiaodanguoguo/RL_Heals_SFT 在大型语言模型(LLMs)的后训练中,监督微调(SFT)和强化学习(RL)的两阶段微调范式在经验上显示出比一阶段 SFT 更好的推理性能。然而,SFT 和 RL 协同作用的演变和机制仍未得到充分探索和定论。在我们的研究中,我们发现众所周知的“SFT 记忆,RL 泛化”的说法被过度简化,并发现:(1)OOD 性能在 SFT 早期达到峰值,然后下降(OOD 遗忘),训练/测试丢失无法捕获最佳 SFT 检查点;(2)后续的 RL 阶段并没有产生根本上更好的 OOD 能力,而是起到了\textbf{OOD 恢复}的作用,恢复了 SFT 过程中失去的推理能力;(3)恢复能力有边界,\ie{} \textbf{如果 SFT 训练时间太短或太长,RL 无法恢复丢失的 OOD 能力;(4)为了揭示遗忘和恢复过程背后的潜在机制,我们对参数矩阵进行 SVD 分析,对其进行手动编辑,并观察它们对模型性能的影响。与人们普遍认为模型容量的偏移主要是由奇异值的变化引起的不同,我们发现它们实际上在整个微调过程中相当稳定。相反,OOD 行为与 \textbf{singular vectors 的旋转} 密切相关。研究结果重新确定了 SFT 和 RL 在两阶段微调中的作用,并发现了奇异载体的旋转是关键机制。%反转 SFT 引起的旋转,这显示了遗忘的恢复,而将 SFT 参数方向强加到 RL 调整的模型上会导致性能下降。 代码可在 https://github.com/xiaodanguoguo/RL_Heals_SFT 获得
Subjects: Machine Learning, Artificial Intelligence 主题: 机器学习 , 人工智能
Publish: 2025-09-08 21:40:41 UTC 发布时间: 2025-09-08 21:40:41 UTC
#159 Flexible Multimodal Neuroimaging Fusion for Alzheimer's Disease Progression Prediction #159 用于阿尔茨海默病进展预测的灵活多模态神经影像融合
Authors: [Benjamin Burns](https://arxiv.org/search/?searchtype=author&query=Benjamin Burns), [Yuan Xue](https://arxiv.org/search/?searchtype=author&query=Yuan Xue), [Douglas W. Scharre](https://arxiv.org/search/?searchtype=author&query=Douglas W. Scharre), [Xia Ning](https://arxiv.org/search/?searchtype=author&query=Xia Ning) 作者:Benjamin Burns、Yuan Xue、Douglas W. Scharre、夏宁
Alzheimer’s disease (AD) is a progressive neurodegenerative disease with high inter-patient variance in rate of cognitive decline. AD progression prediction aims to forecast patient cognitive decline and benefits from incorporating multiple neuroimaging modalities. However, existing multimodal models fail to make accurate predictions when many modalities are missing during inference, as is often the case in clinical settings. To increase multimodal model flexibility under high modality missingness, we introduce PerM-MoE, a novel sparse mixture-of-experts method that uses independent routers for each modality in place of the conventional, single router. Using T1-weighted MRI, FLAIR, amyloid beta PET, and tau PET neuroimaging data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI), we evaluate PerM-MoE, state-of-the-art Flex-MoE, and unimodal neuroimaging models on predicting two-year change in Clinical Dementia Rating-Sum of Boxes (CDR-SB) scores under varying levels of modality missingness. PerM-MoE outperforms the state of the art in most variations of modality missingness and demonstrates more effective utility of experts than Flex-MoE. 阿尔茨海默病 (AD) 是一种进行性神经退行性疾病,认知能力下降率在患者间差异很大。AD 进展预测旨在预测患者的认知能力下降,并从结合多种神经影像学方式中受益。然而,当推理过程中缺少许多模态时,现有的多模态模型无法做出准确的预测,这在临床环境中经常出现。为了提高高模态缺失下的多模态模型灵活性,我们引入了 PerM-MoE,这是一种新颖的稀疏专家混合方法,它为每个模态使用独立的路由器来代替传统的单路由器。使用来自阿尔茨海默病神经影像学计划 (ADNI) 的 T1 加权 MRI、FLAIR、淀粉样蛋白 β PET 和 tau PET 神经影像学数据,我们评估了 PerM-MoE、最先进的 Flex-MoE 和单峰神经影像学模型,以预测临床痴呆症评级盒总和 (CDR-SB) 分数在不同程度的模态缺失下两年的变化。PerM-MoE 在大多数模态缺失变体方面优于最先进的技术,并且比 Flex-MoE 更有效地证明了专家的效用。
Subjects: Machine Learning, Artificial Intelligence, Computer Vision and Pattern Recognition, Image and Video Processing 科目: 机器学习 , 人工智能 , 计算机视觉与模式识别 , 图像与视频处理
Publish: 2025-09-08 16:59:23 UTC 发布时间: 2025-09-08 16:59:23 UTC
#160 Towards Trustworthy Agentic IoEV: AI Agents for Explainable Cyberthreat Mitigation and State Analytics #160 迈向值得信赖的代理 IoEV:用于可解释的网络威胁缓解和状态分析的 AI 代理
Authors: [Meryem Malak Dif](https://arxiv.org/search/?searchtype=author&query=Meryem Malak Dif), [Mouhamed Amine Bouchiha](https://arxiv.org/search/?searchtype=author&query=Mouhamed Amine Bouchiha), [Abdelaziz Amara Korba](https://arxiv.org/search/?searchtype=author&query=Abdelaziz Amara Korba), [Yacine Ghamri-Doudane](https://arxiv.org/search/?searchtype=author&query=Yacine Ghamri-Doudane) 作者:Meryem Malak Dif、Mouhamed Amine Bouchiha、Abdelaziz Amara Korba、Yacine Ghamri-Doudane
The Internet of Electric Vehicles (IoEV) envisions a tightly coupled ecosystem of electric vehicles (EVs), charging infrastructure, and grid services, yet it remains vulnerable to cyberattacks, unreliable battery-state predictions, and opaque decision processes that erode trust and performance. To address these challenges, we introduce a novel Agentic Artificial Intelligence (AAI) framework tailored for IoEV, where specialized agents collaborate to deliver autonomous threat mitigation, robust analytics, and interpretable decision support. Specifically, we design an AAI architecture comprising dedicated agents for cyber-threat detection and response at charging stations, real-time State of Charge (SoC) estimation, and State of Health (SoH) anomaly detection, all coordinated through a shared, explainable reasoning layer; develop interpretable threat-mitigation mechanisms that proactively identify and neutralize attacks on both physical charging points and learning components; propose resilient SoC and SoH models that leverage continuous and adversarial-aware learning to produce accurate, uncertainty-aware forecasts with human-readable explanations; and implement a three-agent pipeline, where each agent uses LLM-driven reasoning and dynamic tool invocation to interpret intent, contextualize tasks, and execute formal optimizations for user-centric assistance. Finally, we validate our framework through comprehensive experiments across diverse IoEV scenarios, demonstrating significant improvements in security and prediction accuracy. All datasets, models, and code will be released publicly. 电动汽车互联网 (IoEV) 设想了一个由电动汽车 (EV)、充电基础设施和电网服务组成的紧密耦合的生态系统,但它仍然容易受到网络攻击、不可靠的电池状态预测以及不透明的决策过程的影响,从而削弱信任和性能。为了应对这些挑战,我们引入了专为 IoEV 量身定制的新型代理人工智能 (AAI) 框架,其中专业代理协作提供自主威胁缓解、强大的分析和可解释的决策支持。具体来说,我们设计了一个 AAI 架构,包括用于充电站网络威胁检测和响应、实时充电状态 (SoC) 估计和健康状态 (SoH) 异常检测的专用代理,所有这些都通过共享的、可解释的推理层进行协调;开发可解释的威胁缓解机制,主动识别和消除对物理充电点和学习组件的攻击;提出弹性 SoC 和 SoH 模型,利用连续和对抗感知学习来生成准确的、不确定性感知的预测,并提供人类可读的解释;并实施一个三代理管道,其中每个代理使用 LLM 驱动的推理和动态工具调用来解释意图、将任务置于上下文中,并执行正式优化以获得以用户为中心的帮助。最后,我们通过跨不同 IoEV 场景的综合实验验证了我们的框架,证明了安全性和预测准确性的显着提高。所有数据集、模型和代码都将公开发布。
Subjects: Cryptography and Security, Artificial Intelligence, Emerging Technologies, Machine Learning, Networking and Internet Architecture 主题: 密码学与安全 , 人工智能 , 新兴技术 , 机器学习 , 网络与互联网架构
Publish: 2025-09-08 14:28:53 UTC 发布时间: 2025-09-08 14:28:53 UTC
#161 Profiling LoRA/QLoRA Fine-Tuning Efficiency on Consumer GPUs: An RTX 4060 Case Study #161 分析消费类 GPU 上的 LoRA/QLoRA 微调效率:RTX 4060 案例研究
Author: [MSR Avinash](https://arxiv.org/search/?searchtype=author&query=MSR Avinash) 作者:MSR Avinash
Fine-tuning large language models (LLMs) with parameter-efficient techniques such as LoRA and QLoRA has enabled adaptation of foundation models on modest hardware. Yet the efficiency of such training on consumer-grade GPUs, especially under strict 8 GB VRAM limits, remains underexplored. We present a controlled profiling study of LoRA/QLoRA fine-tuning using the Qwen2.5-1.5B-Instruct model on a single NVIDIA RTX 4060. Across three representative configurations, we systematically vary batch size, sequence length, optimizer choice (AdamW vs. PagedAdamW), and precision (fp16 vs. bf16). We report throughput (tokens/s), time per 10k tokens, and VRAM footprint, alongside energy estimates derived from GPU board power limits. Our results show that paged optimizers improve throughput by up to 25% (628 tok/s vs. 500 tok/s baseline), while bf16 degrades efficiency relative to fp16. Despite 8 GB constraints, sequence lengths up to 2048 tokens were feasible using parameter-efficient strategies. To our knowledge, this is the first systematic case study of LLM fine- tuning efficiency on consumer GPUs, providing reproducible benchmarks and practical guidelines for resource-constrained researchers and practitioners. 使用 LoRA 和 QLoRA 等参数效率高的技术对大型语言模型 (LLM) 进行微调,使得基础模型能够在适度的硬件上进行适配。然而,在消费级 GPU 上进行此类训练的效率,尤其是在严格的 8 GB VRAM 限制下,仍然没有得到充分探索。我们提出了一项在单个 NVIDIA RTX 4060 上使用 Qwen2.5-1.5B-Instruct 模型进行 LoRA/QLoRA 微调的受控分析研究。在三种代表性配置中,我们系统地改变了批量大小、序列长度、优化器选择(AdamW 与 PagedAdamW)和精度(fp16 与 bf16)。我们报告吞吐量(令牌/秒)、每 10k 个令牌的时间和 VRAM 占用空间,以及根据 GPU 板功率限制得出的能量估算。我们的结果表明,分页优化器将吞吐量提高了 25%(628 tok/s 与 500 tok/s 基线),而 bf16 相对于 fp16 降低了效率。尽管有 8 GB 的限制,但使用参数效率策略可实现高达 2048 个标记的序列长度。据我们所知,这是第一个关于消费类 GPU 上 LLM 微调效率的系统案例研究,为资源受限的研究人员和从业者提供了可重复的基准和实用指南。
Subjects: Machine Learning, Artificial Intelligence, Performance 主题: 机器学习 , 人工智能 , 性能
Publish: 2025-09-07 21:41:14 UTC 发布时间: 2025-09-07 21:41:14 UTC
#162 Learning to Route: Per-Sample Adaptive Routing for Multimodal Multitask Prediction #162 学习路由:用于多模态多任务预测的每样本自适应路由
Authors: [Marzieh Ajirak](https://arxiv.org/search/?searchtype=author&query=Marzieh Ajirak), [Oded Bein](https://arxiv.org/search/?searchtype=author&query=Oded Bein), [Ellen Rose Bowen](https://arxiv.org/search/?searchtype=author&query=Ellen Rose Bowen), [Dora Kanellopoulos](https://arxiv.org/search/?searchtype=author&query=Dora Kanellopoulos), [Avital Falk](https://arxiv.org/search/?searchtype=author&query=Avital Falk), [Faith M. Gunning](https://arxiv.org/search/?searchtype=author&query=Faith M. Gunning), [Nili Solomonov](https://arxiv.org/search/?searchtype=author&query=Nili Solomonov), [Logan Grosenick](https://arxiv.org/search/?searchtype=author&query=Logan Grosenick) 作者:Marzieh Ajirak、Oded Bein、Ellen Rose Bowen、Dora Kanellopoulos、Avital Falk、Faith M. Gunning、Nili Solomonov、Logan Grosenick
We propose a unified framework for adaptive routing in multitask, multimodal prediction settings where data heterogeneity and task interactions vary across samples. Motivated by applications in psychotherapy where structured assessments and unstructured clinician notes coexist with partially missing data and correlated outcomes, we introduce a routing-based architecture that dynamically selects modality processing pathways and task-sharing strategies on a per-sample basis. Our model defines multiple modality paths, including raw and fused representations of text and numeric features and learns to route each input through the most informative expert combination. Task-specific predictions are produced by shared or independent heads depending on the routing decision, and the entire system is trained end-to-end. We evaluate the model on both synthetic data and real-world psychotherapy notes predicting depression and anxiety outcomes. Our experiments show that our method consistently outperforms fixed multitask or single-task baselines, and that the learned routing policy provides interpretable insights into modality relevance and task structure. This addresses critical challenges in personalized healthcare by enabling per-subject adaptive information processing that accounts for data heterogeneity and task correlations. Applied to psychotherapy, this framework could improve mental health outcomes, enhance treatment assignment precision, and increase clinical cost-effectiveness through personalized intervention strategies. 我们提出了一个统一的框架,用于在多任务、多模态预测设置中进行自适应路由,其中数据异质性和任务交互因样本而异。在心理治疗中的应用中,结构化评估和非结构化临床医生笔记与部分缺失的数据和相关结果并存,我们引入了一种基于路由的架构,该架构在每个样本的基础上动态选择模式处理途径和任务共享策略。我们的模型定义了多种模态路径,包括文本和数字特征的原始和融合表示,并学习通过信息最丰富的专家组合路由每个输入。特定于任务的预测由共享或独立的头根据路由决策生成,并且整个系统是端到端训练的。我们根据合成数据和预测抑郁和焦虑结果的真实世界心理治疗笔记来评估该模型。我们的实验表明,我们的方法始终优于固定的多任务或单任务基线,并且学习到的路由策略提供了对模态相关性和任务结构的可解释见解。这通过实现每个受试者的自适应信息处理来解决个性化医疗保健中的关键挑战,该处理考虑了数据异质性和任务相关性。应用于心理治疗,该框架可以通过个性化干预策略改善心理健康结果,提高治疗分配精度,并提高临床成本效益。
Subjects: Machine Learning, Artificial Intelligence 主题: 机器学习 , 人工智能
Publish: 2025-09-06 16:49:45 UTC 发布时间: 2025-09-06 16:49:45 UTC
#163 Ratio1 – AI meta-OS #163 Ratio1 – AI 元作系统
Authors: [Andrei Damian](https://arxiv.org/search/?searchtype=author&query=Andrei Damian), [Petrica Butusina](https://arxiv.org/search/?searchtype=author&query=Petrica Butusina), [Alessandro De Franceschi](https://arxiv.org/search/?searchtype=author&query=Alessandro De Franceschi), [Vitalii Toderian](https://arxiv.org/search/?searchtype=author&query=Vitalii Toderian), [Marius Grigoras](https://arxiv.org/search/?searchtype=author&query=Marius Grigoras), [Cristian Bleotiu](https://arxiv.org/search/?searchtype=author&query=Cristian Bleotiu) 作者:安德烈·达米安、佩特里卡·布图西纳、亚历山德罗·德·弗朗西斯基、维塔利·托德里安、马吕斯·格里戈拉斯、克里斯蒂安·布莱奥蒂乌
We propose the Ratio1 AI meta-operating system (meta-OS), a decentralized MLOps protocol that unifies AI model development, deployment, and inference across heterogeneous edge devices. Its key innovation is an integrated blockchain-based framework that transforms idle computing resources (laptops, smartphones, cloud VMs) into a trustless global supercomputer. The architecture includes novel components: a decentralized authentication layer (dAuth), an in-memory state database (CSTORE), a distributed storage system (R1FS), homomorphic encrypted federated learning (EDIL), decentralized container orchestration (Deeploy) and an oracle network (OracleSync), which collectively ensure secure, resilient execution of AI pipelines and other container based apps at scale. The protocol enforces a formal circular token-economic model combining Proof-of-Availability (PoA) and Proof-of-AI (PoAI) consensus. Compared to centralized heterogeneous cloud MLOps and existing decentralized compute platforms, which often lack integrated AI toolchains or trusted Ratio1 node operators (R1OP) mechanics, Ratio1’s holistic design lowers barriers for AI deployment and improves cost-efficiency. We provide mathematical formulations of its secure licensing and reward protocols, and include descriptive information for the system architecture and protocol flow. We argue that our proposed fully functional ecosystem proposes and demonstrates significant improvements in accessibility, scalability, and security over existing alternatives. 我们提出了 Ratio1 AI 元作系统 (meta-OS),这是一种去中心化的 MLOps 协议,可统一跨异构边缘设备的 AI 模型开发、部署和推理。其关键创新是基于区块链的集成框架,可将闲置的计算资源(笔记本电脑、智能手机、云虚拟机)转变为无需信任的全球超级计算机。该架构包括新颖的组件:去中心化身份验证层 (dAuth)、内存状态数据库 (CSTORE)、分布式存储系统 (R1FS)、同态加密联邦学习 (EDIL)、去中心化容器编排 (Deeploy) 和预言机网络 (OracleSync),它们共同确保安全、弹性地大规模执行 AI 管道和其他基于容器的应用程序。该协议强制执行正式的循环代币经济模型,结合了可用性证明 (PoA) 和 AI 证明 (PoAI) 共识。与集中式异构云 MLOps 和现有的去中心化计算平台相比,这些平台通常缺乏集成的 AI 工具链或可信的 Ratio1 节点运营商 (R1OP) 机制,Ratio1 的整体设计降低了 AI 部署的门槛并提高了成本效益。我们提供了其安全许可和奖励协议的数学公式,并包括系统架构和协议流程的描述性信息。我们认为,我们提议的功能齐全的生态系统提出并展示了与现有替代方案相比在可访问性、可扩展性和安全性方面的显着改进。
Subjects: Operating Systems, Artificial Intelligence, Cryptography and Security, Distributed, Parallel, and Cluster Computing 主题:作系统 , 人工智能 , 密码学与安全 , 分布式、并行和集群计算
Publish: 2025-09-05 07:41:54 UTC 发布时间: 2025-09-05 07:41:54 UTC
#164 Accelerating Privacy-Preserving Federated Learning in Large-Scale LEO Satellite Systems #164 在大型低轨卫星系统中加速隐私保护联邦学习
Authors: [Binquan Guo](https://arxiv.org/search/?searchtype=author&query=Binquan Guo), [Junteng Cao](https://arxiv.org/search/?searchtype=author&query=Junteng Cao), [Marie Siew](https://arxiv.org/search/?searchtype=author&query=Marie Siew), [Binbin Chen](https://arxiv.org/search/?searchtype=author&query=Binbin Chen), [Tony Q. S. Quek](https://arxiv.org/search/?searchtype=author&query=Tony Q. S. Quek), [Zhu Han](https://arxiv.org/search/?searchtype=author&query=Zhu Han) 作者: Binquan Guo, Junteng Cao, Marie Siew, Binbin Chen, Tony Q. S. Quek, Zhu Han
Large-scale low-Earth-orbit (LEO) satellite systems are increasingly valued for their ability to enable rapid and wide-area data exchange, thereby facilitating the collaborative training of artificial intelligence (AI) models across geographically distributed regions. Due to privacy concerns and regulatory constraints, raw data collected at remote clients cannot be centrally aggregated, posing a major obstacle to traditional AI training methods. Federated learning offers a privacy-preserving alternative by training local models on distributed devices and exchanging only model parameters. However, the dynamic topology and limited bandwidth of satellite systems will hinder timely parameter aggregation and distribution, resulting in prolonged training times. To address this challenge, we investigate the problem of scheduling federated learning over satellite networks and identify key bottlenecks that impact the overall duration of each training round. We propose a discrete temporal graph-based on-demand scheduling framework that dynamically allocates communication resources to accelerate federated learning. Simulation results demonstrate that the proposed approach achieves significant performance gains over traditional statistical multiplexing-based model exchange strategies, reducing overall round times by 14.20% to 41.48%. Moreover, the acceleration effect becomes more pronounced for larger models and higher numbers of clients, highlighting the scalability of the proposed approach. 大规模低地球轨道 (LEO) 卫星系统因其能够实现快速、广域数据交换的能力而越来越受到重视,从而促进了跨地理分布区域的人工智能 (AI) 模型的协作训练。由于隐私问题和监管限制,在远程客户端收集的原始数据无法集中汇总,这对传统的人工智能训练方法构成了主要障碍。联邦学习通过在分布式设备上训练本地模型并仅交换模型参数,提供了一种保护隐私的替代方案。然而,卫星系统的动态拓扑结构和有限的带宽会阻碍参数的及时聚合和分配,导致训练时间延长。为了应对这一挑战,我们研究了在卫星网络上调度联邦学习的问题,并确定了影响每轮训练总持续时间的关键瓶颈。我们提出了一种基于离散时间图的按需调度框架,该框架动态分配通信资源以加速联邦学习。仿真结果表明,与传统的基于统计复用的模型交换策略相比,所提方法取得了显著的性能提升,将整体换算时间缩短了 14.20%至 41.48%。此外,对于更大的模型和更多的客户端,加速效应变得更加明显,凸显了所提出方法的可扩展性。
Subjects: Machine Learning, Artificial Intelligence, Distributed, Parallel, and Cluster Computing 主题: 机器学习 , 人工智能 , 分布式、并行和集群计算
Publish: 2025-09-05 03:33:42 UTC 发布时间: 2025-09-05 03:33:42 UTC
#165 MEUV: Achieving Fine-Grained Capability Activation in Large Language Models via Mutually Exclusive Unlock Vectors #165 MEUV:通过互斥的解锁向量在大型语言模型中实现细粒度的能力激活
Authors: [Xin Tong](https://arxiv.org/search/?searchtype=author&query=Xin Tong), [Zhi Lin](https://arxiv.org/search/?searchtype=author&query=Zhi Lin), [Jingya Wang](https://arxiv.org/search/?searchtype=author&query=Jingya Wang), [Meng Han](https://arxiv.org/search/?searchtype=author&query=Meng Han), [Bo Jin](https://arxiv.org/search/?searchtype=author&query=Bo Jin) 作者:Tong 昕、Zhi Lin、王静雅、韩孟、波金
Large language models (LLMs) enforce safety alignment to reliably refuse malicious requests, yet the same blanket safeguards also block legitimate uses in policing, defense, and other high-stakes settings. Earlier “refusal-direction” edits can bypass those layers, but they rely on a single vector that indiscriminately unlocks all hazardous topics, offering no semantic control. We introduce Mutually Exclusive Unlock Vectors (MEUV), a lightweight framework that factorizes the monolithic refusal direction into topic-aligned, nearly orthogonal vectors, each dedicated to one sensitive capability. MEUV is learned in a single epoch with a multi-task objective that blends a differential-ablation margin, cross-topic and orthogonality penalties, and several auxiliary terms. On bilingual malicious-prompt benchmarks, MEUV achieves an attack success rate of no less than 87% on Gemma-2-2B, LLaMA-3-8B, and Qwen-7B, yet cuts cross-topic leakage by up to 90% compared with the best single-direction baseline. Vectors trained in Chinese transfer almost unchanged to English (and vice versa), suggesting a language-agnostic refusal subspace. The results show that fine-grained, topic-level capability activation is achievable with minimal utility loss, paving the way for controlled LLMs deployment in security-sensitive domains. 大型语言模型 (LLM) 强制执行安全一致性以可靠地拒绝恶意请求,但相同的一统保护措施也会阻止警务、防御和其他高风险环境中的合法使用。早期的“拒绝方向”编辑可以绕过这些层,但它们依赖于一个单一的向量,该向量不加区别地解锁所有危险主题,不提供语义控制。我们引入了互斥解锁向量 (MEUV),这是一个轻量级框架,它将单片拒绝方向分解为主题对齐、几乎正交的向量,每个向量专用于一个敏感功能。MEUV 是在一个具有多任务目标的单个时期中学习的,该目标混合了差异消融裕度、跨主题和正交性惩罚以及几个辅助术语。在双语恶意提示基准测试中,MEUV 在 Gemma-2-2B、LLaMA-3-8B 和 Qwen-7B 上的攻击成功率不低于 87%,但与最佳单向基线相比,跨主题泄漏减少了高达 90%。用中文训练的向量几乎没有变化地转移到英语(反之亦然),这表明存在与语言无关的拒绝子空间。结果表明,可以实现细粒度的主题级功能激活,同时将效用损失降至最低,为在安全敏感领域中部署受控的 LLM 铺平了道路。
Subjects: Machine Learning, Artificial Intelligence, Computation and Language, Cryptography and Security 科目: 机器学习 , 人工智能 , 计算与语言 , 密码学与安全
Publish: 2025-09-04 07:16:06 UTC 发布时间: 2025-09-04 07:16:06 UTC
#166 Scaling Up Data Parallelism in Decentralized Deep Learning #166 在去中心化深度学习中扩展数据并行性
Authors: [Bing Xie](https://arxiv.org/search/?searchtype=author&query=Bing Xie), [Junqi Yin](https://arxiv.org/search/?searchtype=author&query=Junqi Yin), [Zhenyu Zhou](https://arxiv.org/search/?searchtype=author&query=Zhenyu Zhou), [Sarp Oral](https://arxiv.org/search/?searchtype=author&query=Sarp Oral), [Feiyi Wang](https://arxiv.org/search/?searchtype=author&query=Feiyi Wang) 作者: 谢冰, 尹俊琦, 周振宇, Sarp Oral, 王飞怡
Although it has been extensively explored in theory, decentralized learning is not yet green-lighted for production use, largely due to a lack of stability, scalability, and generality in large scale DNN training. To shed light on the production use of decentralized learning, this work studies decentralized data parallel training at scale. To this end, we introduce a benchmarking framework, namely DBench, to host both centralized and decentralized DNN training. Building upon DBench, we introduce a benchmarking methodology to uncover the correlations between model accuracy and the variances of parameter tensors by varying communication graphs and training scales. Based on the benchmarking results, we observe that, (1) Similar to centralized learning, decentralized data parallel training also presents the issues of scalability and generality when the training scales up; (2) The model accuracy of decentralized learning is correlated to the number of connections in a communication graph; (3) The model accuracy of decentralized learning is surprisingly sensitive to the variance of parameter tensors across model replicas. Built upon the observations, we propose Ada, a decentralized adaptive approach that performs large scale DNN training following a decentralized SGD method and adapting the communication graph in use dynamically throughout training iterations. We apply Ada on large scale training and observe that Ada can obtain the best convergence rates consistently in decentralized DNN training, and delivers equally or comparably good model accuracy for all sample applications as centralized learning does, even when training ResNet50 for ImageNet-1K on the scale of 1008 GPUs. 尽管在理论上已经进行了广泛的探索,但去中心化学习尚未获得生产应用的批准,这主要是由于大规模 DNN 训练缺乏稳定性、可扩展性和通用性。为了阐明去中心化学习的生产使用,这项工作研究了大规模的去中心化数据并行训练。为此,我们引入了一个基准测试框架,即 DBench,来托管集中式和去中心化的 DNN 训练。在 DBench 的基础上,我们引入了一种基准测试方法,通过改变通信图和训练尺度来揭示模型精度与参数张量方差之间的相关性。基于基准测试结果,我们观察到,(1)与集中式学习类似,去中心化数据并行训练在训练规模扩大时也存在可扩展性和通用性的问题;(2)去中心化学习的模型精度与通信图中的连接数相关;(3)去中心化学习的模型精度对模型副本之间参数张量的方差非常敏感。基于观察结果,我们提出了 Ada,这是一种去中心化的自适应方法,它遵循去中心化的 SGD 方法执行大规模 DNN 训练,并在整个训练迭代过程中动态调整正在使用的通信图。我们将 Ada 应用于大规模训练,并观察到 Ada 可以在去中心化 DNN 训练中始终如一地获得最佳收敛率,并且为所有示例应用程序提供与集中式学习相同或相当好的模型精度,即使在 1008 个 GPU 的规模上为 ImageNet-1K 训练 ResNet50 时也是如此。
Subjects: Machine Learning, Artificial Intelligence 主题: 机器学习 , 人工智能
Publish: 2025-08-31 17:34:52 UTC 发布时间: 2025-08-31 17:34:52 UTC
#167 PowerGrow: Feasible Co-Growth of Structures and Dynamics for Power Grid Synthesis #167 PowerGrow:用于电网合成的结构和动力学的可行协同增长
Authors: [Xinyu He](https://arxiv.org/search/?searchtype=author&query=Xinyu He), [Chenhan Xiao](https://arxiv.org/search/?searchtype=author&query=Chenhan Xiao), [Haoran Li](https://arxiv.org/search/?searchtype=author&query=Haoran Li), [Ruizhong Qiu](https://arxiv.org/search/?searchtype=author&query=Ruizhong Qiu), [Zhe Xu](https://arxiv.org/search/?searchtype=author&query=Zhe Xu), [Yang Weng](https://arxiv.org/search/?searchtype=author&query=Yang Weng), [Jingrui He](https://arxiv.org/search/?searchtype=author&query=Jingrui He), [Hanghang Tong](https://arxiv.org/search/?searchtype=author&query=Hanghang Tong) 作者: 何欣宇, 萧晨涵, 李浩然, 邱瑞忠, 徐哲, 杨翁, 何静瑞, 佟航航
Modern power systems are becoming increasingly dynamic, with changing topologies and time-varying loads driven by renewable energy variability, electric vehicle adoption, and active grid reconfiguration. Despite these changes, publicly available test cases remain scarce, due to security concerns and the significant effort required to anonymize real systems. Such limitations call for generative tools that can jointly synthesize grid structure and nodal dynamics. However, modeling the joint distribution of network topology, branch attributes, bus properties, and dynamic load profiles remains a major challenge, while preserving physical feasibility and avoiding prohibitive computational costs. We present PowerGrow, a co-generative framework that significantly reduces computational overhead while maintaining operational validity. The core idea is dependence decomposition: the complex joint distribution is factorized into a chain of conditional distributions over feasible grid topologies, time-series bus loads, and other system attributes, leveraging their mutual dependencies. By constraining the generation process at each stage, we implement a hierarchical graph beta-diffusion process for structural synthesis, paired with a temporal autoencoder that embeds time-series data into a compact latent space, improving both training stability and sample fidelity. Experiments across benchmark settings show that PowerGrow not only outperforms prior diffusion models in fidelity and diversity but also achieves a 98.9% power flow convergence rate and improved N-1 contingency resilience. This demonstrates its ability to generate operationally valid and realistic power grid scenarios. 现代电力系统正变得越来越动态,可再生能源的可变性、电动汽车的采用和主动电网重新配置驱动了不断变化的拓扑结构和时变负载。尽管发生了这些变化,但由于安全问题以及匿名化真实系统所需的大量努力,公开可用的测试用例仍然很少。这种限制需要能够共同合成网格结构和节点动力学的生成工具。然而,对网络拓扑、分支属性、总线属性和动态负载曲线的联合分布进行建模仍然是一个主要挑战,同时保持物理可行性并避免过高的计算成本。我们介绍了 PowerGrow,这是一个协生成框架,可在保持作有效性的同时显着降低计算开销。核心思想是依赖性分解:利用它们的相互依赖关系,将复杂的联合分布分解为可行的网格拓扑、时间序列总线负载和其他系统属性上的条件分布链。通过约束每个阶段的生成过程,我们实现了用于结构合成的分层图β扩散过程,并搭配时间序列数据嵌入紧凑的潜在空间的时间自动编码器,提高了训练稳定性和样本保真度。跨基准设置的实验表明,PowerGrow 不仅在保真度和多样性方面优于之前的扩散模型,而且还实现了 98.9% 的潮流收敛率和改进的 N-1 应急弹性。这证明了它能够生成作上有效且现实的电网场景。
Subjects: Machine Learning, Artificial Intelligence, Systems and Control 科目: 机器学习 , 人工智能 , 系统与控制
Publish: 2025-08-29 01:47:27 UTC 发布时间: 2025-08-29 01:47:27 UTC
#168 TinyServe: Query-Aware Cache Selection for Efficient LLM Serving #168 TinyServe:查询感知缓存选择以实现高效的 LLM 服务
Authors: [Dong Liu](https://arxiv.org/search/?searchtype=author&query=Dong Liu), [Yanxuan Yu](https://arxiv.org/search/?searchtype=author&query=Yanxuan Yu) 作者:Dong Liu、Yanxuan Yu
Serving large language models (LLMs) efficiently remains challenging due to the high memory and latency overhead of key-value (KV) cache access during autoregressive decoding. We present \textbf{TinyServe}, a lightweight and extensible serving system for deploying tiny LLMs (e.g., TinyLLaMA, GPT2-345M) with support for structured KV sparsity, plugin-based token selection, and hardware-efficient attention kernels. Unlike prior simulation frameworks, TinyServe executes real-time decoding with configurable sparsity strategies and fine-grained instrumentation. To reduce decoding cost, we introduce a \textit{query-aware page selection} mechanism that leverages bounding-box metadata to estimate attention relevance between the query and KV cache blocks. This enables selective KV loading with minimal overhead and no model modifications. Our fused CUDA kernel integrates page scoring, sparse memory access, and masked attention in a single pass. Experiments show that TinyServe achieves up to \textbf{3.4x} speedup and over \textbf{2x} memory savings with negligible accuracy drop. Additional analysis of cache reuse, page hit rate, and multi-GPU scaling confirms its practicality as an efficient system-level design for LLM training and inference research on resource-constrained hardware. 由于自回归解码期间键值 (KV) 缓存访问的高内存和延迟开销,高效服务大型语言模型 (LLM) 仍然具有挑战性。我们提出了 \textbf{TinyServe},这是一个轻量级且可扩展的服务系统,用于部署微型 LLM(例如 TinyLLaMA、GPT2-345M),支持结构化 KV 稀疏性、基于插件的标记选择和硬件高效的注意力内核。与以前的仿真框架不同,TinyServe 使用可配置的稀疏策略和细粒度的仪器执行实时解码。为了降低解码成本,我们引入了一种 \textit{query-aware page selection} 机制,该机制利用边界框元数据来估计查询和 KV 缓存块之间的注意力相关性。这可以以最小的开销实现选择性 KV 加载,并且无需修改模型。我们的融合 CUDA 内核在一次传递中集成了页面评分、稀疏内存访问和屏蔽注意力。实验表明,TinyServe 实现了高达 \textbf{3.4x} 的加速和超过 \textbf{2x} 的内存节省,精度下降可以忽略不计。对缓存重用、页面命中率和多 GPU 扩展的额外分析证实了其作为资源受限硬件上 LLM 训练和推理研究的高效系统级设计的实用性。
Subjects: Distributed, Parallel, and Cluster Computing, Artificial Intelligence 主题:分布式、并行和集群计算 , 人工智能
Publish: 2025-08-28 16:17:18 UTC 发布时间: 2025-08-28 16:17:18 UTC
#169 Scalable RF Simulation in Generative 4D Worlds #169 生成式 4D 世界中的可扩展射频仿真
Authors: [Zhiwei Zheng](https://arxiv.org/search/?searchtype=author&query=Zhiwei Zheng), [Dongyin Hu](https://arxiv.org/search/?searchtype=author&query=Dongyin Hu), [Mingmin Zhao](https://arxiv.org/search/?searchtype=author&query=Mingmin Zhao) 作者:郑志伟、胡东寅、赵明敏
Radio Frequency (RF) sensing has emerged as a powerful, privacy-preserving alternative to vision-based methods for indoor perception tasks. However, collecting high-quality RF data in dynamic and diverse indoor environments remains a major challenge. To address this, we introduce WaveVerse, a prompt-based, scalable framework that simulates realistic RF signals from generated indoor scenes with human motions. WaveVerse introduces a language-guided 4D world generator, which includes a state-aware causal transformer for human motion generation conditioned on spatial constraints and texts, and a phase-coherent ray tracing simulator that enables the simulation of accurate and coherent RF signals. Experiments demonstrate the effectiveness of our approach in conditioned human motion generation and highlight how phase coherence is applied to beamforming and respiration monitoring. We further present two case studies in ML-based high-resolution imaging and human activity recognition, demonstrating that WaveVerse not only enables data generation for RF imaging for the first time, but also consistently achieves performance gain in both data-limited and data-adequate scenarios. 射频 (RF) 传感已成为室内感知任务中基于视觉的方法的强大、保护隐私的替代方案。然而,在动态和多样化的室内环境中收集高质量的射频数据仍然是一个重大挑战。为了解决这个问题,我们推出了 WaveVerse,这是一个基于提示的可扩展框架,可以模拟来自生成的室内场景和人体运动的真实射频信号。WaveVerse 引入了一种语言引导的 4D 世界生成器,其中包括一个状态感知因果转换器,用于以空间约束和文本为条件生成人体运动,以及一个相位相干光线追踪模拟器,能够模拟准确和相干的射频信号。实验证明了我们的方法在条件性人体运动生成方面的有效性,并强调了相位相干如何应用于波束成形和呼吸监测。我们进一步提出了基于机器学习的高分辨率成像和人体活动识别的两个案例研究,表明 WaveVerse 不仅首次实现了射频成像的数据生成,而且在数据有限和数据充足的场景中都能始终如一地实现性能提升。
Subject: Computer Vision and Pattern Recognition 主题:计算机视觉和模式识别
Publish: 2025-08-16 23:02:14 UTC 发布时间: 2025-08-16 23:02:14 UTC
1.3 Huggingface
- OmniWorld:用于4D世界建模的多域多模态数据集(73▲)
- UI-S1:通过半在线强化学习推进GUI自动化(34▲)
- InternScenes:具有逼真布局的大规模可模拟室内场景数据集(23▲)
- LazyDrag:通过显式对应在多模态扩散变压器上实现稳定的基于拖动的编辑(10▲)
- 图像扩散模型的局域性源于数据统计(8▲)
- 基于动态奖励加权的多目标对齐优化学习(8▲)
- 嵌入中的丢失:视觉语言模型中的信息丢失(7▲)
- SearchInstruct:通过基于检索的指令数据集创建增强领域自适应(7▲)
- 在多模态大语言模型中测量认知谦卑性(5▲)
- Nav-R1:具身场景中的推理与导航(4▲)
- 再看一遍,慢慢思考:增强视觉语言模型中的视觉反射(3)
- PersonaX:具有llm推断行为特征的多模态数据集(2▲)
- 还有8篇论文